Table of Contents

cs.CL [Back]

[1] KBQA-R1: Reinforcing Large Language Models for Knowledge Base Question Answering cs.CLPDF

Xin Sun, Zhongqi Chen, Xing Zheng, Qiang Liu, Shu Wu

TL;DR: KBQA-R1是一个通过强化学习优化大型语言模型在知识库问答中表现的框架,它将KBQA视为多轮决策过程,利用Group Relative Policy Optimization(GRPO)基于执行反馈改进策略,并引入Referenced Rejection Sampling(RRS)解决冷启动问题。实验在WebQSP、GrailQA和GraphQuestions基准上实现了最先进的性能。

Details

Motivation: 当前KBQA方法存在两种失败模式:要么生成未验证模式存在的幻觉查询,要么采用僵化的模板推理而缺乏对环境真正理解。KBQA-R1旨在通过强化学习将范式从文本模仿转向交互优化,以解决这些限制。

Result: 在WebQSP、GrailQA和GraphQuestions基准上的广泛实验表明,KBQA-R1实现了最先进的性能,有效将LLM推理基于可验证的执行。

Insight: 创新点包括:1) 将KBQA重新定义为多轮决策过程,通过强化学习(GRPO)优化交互策略;2) 提出Referenced Rejection Sampling(RRS)数据合成方法,严格对齐推理轨迹与真实动作序列以解决冷启动问题,从而增强模型对知识库环境的真实理解能力。

Abstract: Knowledge Base Question Answering (KBQA) challenges models to bridge the gap between natural language and strict knowledge graph schemas by generating executable logical forms. While Large Language Models (LLMs) have advanced this field, current approaches often struggle with a dichotomy of failure: they either generate hallucinated queries without verifying schema existence or exhibit rigid, template-based reasoning that mimics synthesized traces without true comprehension of the environment. To address these limitations, we present \textbf{KBQA-R1}, a framework that shifts the paradigm from text imitation to interaction optimization via Reinforcement Learning. Treating KBQA as a multi-turn decision process, our model learns to navigate the knowledge base using a list of actions, leveraging Group Relative Policy Optimization (GRPO) to refine its strategies based on concrete execution feedback rather than static supervision. Furthermore, we introduce \textbf{Referenced Rejection Sampling (RRS)}, a data synthesis method that resolves cold-start challenges by strictly aligning reasoning traces with ground-truth action sequences. Extensive experiments on WebQSP, GrailQA, and GraphQuestions demonstrate that KBQA-R1 achieves state-of-the-art performance, effectively grounding LLM reasoning in verifiable execution.


[2] MultiScript30k: Leveraging Multilingual Embeddings to Extend Cross Script Parallel Data cs.CL | cs.AI | cs.LG | cs.MMPDF

Christopher Driggers-Ellis, Detravious Brinkley, Ray Chen, Aashish Dhawan, Daisy Zhe Wang

TL;DR: 本文提出了MultiScript30k数据集,作为Multi30k数据集的扩展,旨在为多模态机器翻译(MMT)研究提供更多样化的语言支持。该数据集通过使用NLLB200-3.3B模型将Multi30k的英文句子翻译成阿拉伯语、西班牙语、乌克兰语、简体中文和繁体中文,包含超过30000个句子,以支持非拉丁脚本和全球语言的研究。

Details

Motivation: 现有Multi30k数据集仅支持捷克语、英语、法语和德语等少数欧洲语言(拉丁脚本),限制了MMT研究在多样化语言上的进展。为了突破这一限制,需要扩展数据集以涵盖更多语言家族和脚本。

Result: 相似性分析显示,除繁体中文外,所有支持语言的余弦相似度均大于0.8,对称KL散度小于0.000251,与先前扩展数据集(如ArEnMulti30k和Multi30k-Uk)相当。COMETKiwi评分显示混合结果:MultiScript30k-Ar与ArEnMulti30k评分相近,但MultiScript30k-Uk比Multi30k-Uk低约6.4%。

Insight: 创新点在于利用大规模多语言模型(NLLB)自动扩展并行语料库,以支持非拉丁脚本和全球语言,从而促进MMT研究的语言多样性。客观来看,该方法提供了一种高效扩展数据集以覆盖更广泛语言和脚本的可行方案,但自动翻译的质量可能因语言而异,需进一步评估。

Abstract: Multi30k is frequently cited in the multimodal machine translation (MMT) literature, offering parallel text data for training and fine-tuning deep learning models. However, it is limited to four languages: Czech, English, French, and German. This restriction has led many researchers to focus their investigations only on these languages. As a result, MMT research on diverse languages has been stalled because the official Multi30k dataset only represents European languages in Latin scripts. Previous efforts to extend Multi30k exist, but the list of supported languages, represented language families, and scripts is still very short. To address these issues, we propose MultiScript30k, a new Multi30k dataset extension for global languages in various scripts, created by translating the English version of Multi30k (Multi30k-En) using NLLB200-3.3B. The dataset consists of over (30000) sentences and provides translations of all sentences in Multi30k-En into Ar, Es, Uk, Zh_Hans and Zh_Hant. Similarity analysis shows that Multi30k extension consistently achieves greater than (0.8) cosine similarity and symmetric KL divergence less than (0.000251) for all languages supported except Zh_Hant which is comparable to the previous Multi30k extensions ArEnMulti30k and Multi30k-Uk. COMETKiwi scores reveal mixed assessments of MultiScript30k as a translation of Multi30k-En in comparison to the related work. ArEnMulti30k scores nearly equal MultiScript30k-Ar, but Multi30k-Uk scores $6.4%$ greater than MultiScript30k-Uk per split.


[3] When Actions Teach You to Think: Reasoning-Action Synergy via Reinforcement Learning in Conversational Agents cs.CL | cs.LGPDF

Mrinal Rawat, Arkajyoti Chakraborty, Neha Gupta, Roberto Pieraccini

TL;DR: 本文提出了一种利用强化学习(RL)来提升对话智能体推理与行动协同的方法,通过Group Relative Policy Optimization(GRPO)优化模型,使其能够根据任务结果直接学习推理策略,从而改善工具调用和答案生成的准确性。

Details

Motivation: 监督微调(SFT)在数据分布变化时泛化能力有限,且高质量推理标注成本高、难以扩展,因此需要一种无需依赖大量标注数据就能提升模型推理和泛化能力的方法。

Result: 实验表明,该方法在推理质量和工具调用精度上均有提升,相比无显式推理的SFT模型取得了1.5%的相对改进,相比基础Qwen3-1.7B模型提升了40%。

Insight: 创新点在于通过强化学习将推理步骤与工具调用及答案生成统一优化,利用任务结果作为奖励信号,使模型能迭代地精炼推理和行动,从而增强对话智能体的能力和泛化性。

Abstract: Supervised fine-tuning (SFT) has emerged as one of the most effective ways to improve the performance of large language models (LLMs) in downstream tasks. However, SFT can have difficulty generalizing when the underlying data distribution changes, even when the new data does not fall completely outside the training domain. Recent reasoning-focused models such as o1 and R1 have demonstrated consistent gains over their non-reasoning counterparts, highlighting the importance of reasoning for improved generalization and reliability. However, collecting high-quality reasoning traces for SFT remains challenging – annotations are costly, subjective, and difficult to scale. To address this limitation, we leverage Reinforcement Learning (RL) to enable models to learn reasoning strategies directly from task outcomes. We propose a pipeline in which LLMs generate reasoning steps that guide both the invocation of tools (e.g., function calls) and the final answer generation for conversational agents. Our method employs Group Relative Policy Optimization (GRPO) with rewards designed around tool accuracy and answer correctness, allowing the model to iteratively refine its reasoning and actions. Experimental results demonstrate that our approach improves both the quality of reasoning and the precision of tool invocations, achieving a 1.5% relative improvement over the SFT model (trained without explicit thinking) and a 40% gain compared to the base of the vanilla Qwen3-1.7B model. These findings demonstrate the promise of unifying reasoning and action learning through RL to build more capable and generalizable conversational agents.


[4] CIP: A Plug-and-Play Causal Prompting Framework for Mitigating Hallucinations under Long-Context Noise cs.CL | cs.AIPDF

Qingsen Ma, Dianyun Wang, Ran Jing, Yujun Sun, Zhenbo Xu

TL;DR: 本文提出了一种名为CIP的轻量级即插即用因果提示框架,旨在缓解大语言模型在处理长且嘈杂的检索上下文时产生的幻觉问题。该框架通过构建实体、动作和事件之间的因果序列,并将其注入提示中以引导模型关注因果相关证据,从而抑制非因果推理路径。实验表明,CIP能持续提升多个主流模型的推理质量和可靠性。

Details

Motivation: 大语言模型在处理长且嘈杂的检索上下文时,由于依赖虚假相关性而非真正的因果关系,容易产生幻觉。本文旨在通过因果推理来缓解这一问题,提高模型的事实依据和可解释性。

Result: 在包括GPT-4o、Gemini 2.0 Flash和Llama 3.1在内的七个主流语言模型上的实验表明,CIP将Attributable Rate提升了2.6个百分点,Causal Consistency Score提升了0.38,有效信息密度提升了四倍,并将端到端响应延迟降低了高达55.1%。

Insight: 论文宣称的创新点在于提出了一个轻量级、即插即用的因果提示框架,通过因果干预和反事实推理来引导模型关注因果证据,从而抑制幻觉。从客观角度看,其核心创新在于将因果推理结构化为可注入提示的序列,并将其作为一种通用的、模型无关的输入阶段干预方法,有望提升大语言模型的可解释性、稳定性和效率。

Abstract: Large language models often hallucinate when processing long and noisy retrieval contexts because they rely on spurious correlations rather than genuine causal relationships. We propose CIP, a lightweight and plug-and-play causal prompting framework that mitigates hallucinations at the input stage. CIP constructs a causal relation sequence among entities, actions, and events and injects it into the prompt to guide reasoning toward causally relevant evidence. Through causal intervention and counterfactual reasoning, CIP suppresses non causal reasoning paths, improving factual grounding and interpretability. Experiments across seven mainstream language models, including GPT-4o, Gemini 2.0 Flash, and Llama 3.1, show that CIP consistently enhances reasoning quality and reliability, achieving 2.6 points improvement in Attributable Rate, 0.38 improvement in Causal Consistency Score, and a fourfold increase in effective information density. API level profiling further shows that CIP accelerates contextual understanding and reduces end to end response latency by up to 55.1 percent. These results suggest that causal reasoning may serve as a promising paradigm for improving the explainability, stability, and efficiency of large language models.


Shogo Fujita, Yuji Naraki, Yiqing Zhu, Shinsuke Mori

TL;DR: 本文介绍了LegalRikai: Open Benchmark,这是一个专注于日本企业法律实践的新基准测试,包含四个复杂任务,由法律专业人士在律师监督下创建。该基准包含100个需要长格式、结构化输出的样本,并通过多个人类和自动化评估标准进行评估。研究使用GPT-5、Gemini 2.5 Pro和Claude Opus 4.1等领先LLM进行了评估,发现抽象指令会引发不必要的修改,揭示了模型在文档级编辑方面的弱点,这是传统短文本任务所忽视的。此外,分析表明自动化评估在具有明确语言基础的标准上与人类判断高度一致,但评估结构一致性仍具挑战性。结果证明了自动化评估在专家资源有限时作为筛选工具的有效性,并提出了一个数据集评估框架以促进法律领域更面向实践的研究。

Details

Motivation: 为了解决现有法律领域基准测试缺乏复杂、长格式、结构化输出任务的问题,特别是针对日本企业法律实践,并推动更面向实际应用的研究。

Result: 人类评估揭示了LLM在文档级编辑任务中的弱点,自动化评估在语言基础明确的标准上与人类判断高度一致(可作为有效筛选工具),但在结构一致性评估上仍面临挑战。

Insight: 创新点在于创建了一个由法律专家构建的、专注于日本企业法律复杂任务的基准,强调了长格式结构化输出和文档级编辑评估的重要性;客观分析认为,其提出的数据集评估框架和对自动化评估与人类判断一致性的分析,为法律AI的实践导向研究提供了有价值的工具和见解。

Abstract: This paper introduces LegalRikai: Open Benchmark, a new benchmark comprising four complex tasks that emulate Japanese corporate legal practices. The benchmark was created by legal professionals under the supervision of an attorney. This benchmark has 100 samples that require long-form, structured outputs, and we evaluated them against multiple practical criteria. We conducted both human and automated evaluations using leading LLMs, including GPT-5, Gemini 2.5 Pro, and Claude Opus 4.1. Our human evaluation revealed that abstract instructions prompted unnecessary modifications, highlighting model weaknesses in document-level editing that were missed by conventional short-text tasks. Furthermore, our analysis reveals that automated evaluation aligns well with human judgment on criteria with clear linguistic grounding, and assessing structural consistency remains a challenge. The result demonstrates the utility of automated evaluation as a screening tool when expert availability is limited. We propose a dataset evaluation framework to promote more practice-oriented research in the legal domain.


Tomáš Koref, Lena Held, Mahammad Namazov, Harun Kumru, Yassine Thlija

TL;DR: 本研究开发了一种自动化方法,利用最先进的自然语言处理技术,从捷克最高法院的判决中检测和分类司法推理,以反驳关于中欧和东欧(CEE)司法形式主义的普遍说法。研究创建了MADON数据集,包含272个判决的9,183个段落,带有专家标注的八种论证类型和整体形式主义标签,并通过对30万捷克法院判决语料库进行持续预训练来适应捷克法律领域的Transformer大语言模型。最佳模型在论证段落检测、传统法律论证类型分类和形式主义/非形式主义决策分类任务上分别取得了82.6%、77.5%和83.2%的宏F1分数。

Details

Motivation: 系统地大规模分析司法推理仍然很困难,本研究旨在通过自动化法律论证挖掘,挑战关于中欧和东欧司法形式主义的普遍叙事,实现对司法哲学的可靠分类。

Result: 在MADON数据集上,最佳模型在论证段落检测、传统法律论证类型分类和决策形式主义分类任务上的宏F1分数分别达到82.6%、77.5%和83.2%。提出的结合ModernBERT、Llama 3.1和传统基于特征的机器学习的三阶段流水线在决策分类任务上取得了有希望的结果,同时降低了计算成本并提高了可解释性。

Insight: 创新点在于创建了一个高质量、专家标注的法律论证挖掘数据集(MADON),并通过领域自适应(持续预训练)和解决类别不平衡(如非对称损失和类别加权)的方法,成功地将大语言模型应用于捷克法律领域。研究展示了法律论证挖掘在司法哲学分类等计算法学任务中的潜力,其方法论易于在不同司法管辖区复制,且整个流水线、数据和代码均已开源。

Abstract: Courts must justify their decisions, but systematically analyzing judicial reasoning at scale remains difficult. This study refutes claims about formalistic judging in Central and Eastern Europe (CEE) by developing automated methods to detect and classify judicial reasoning in Czech Supreme Courts’ decisions using state-of-the-art natural language processing methods. We create the MADON dataset of 272 decisions from two Czech Supreme Courts with expert annotations of 9,183 paragraphs with eight argument types and holistic formalism labels for supervised training and evaluation. Using a corpus of 300k Czech court decisions, we adapt transformer LLMs for Czech legal domain by continued pretraining and experiment with methods to address dataset imbalance including asymmetric loss and class weighting. The best models successfully detect argumentative paragraphs (82.6% macro-F1), classify traditional types of legal argument (77.5% macro-F1), and classify decisions as formalistic/non-formalistic (83.2% macro-F1). Our three-stage pipeline combining ModernBERT, Llama 3.1, and traditional feature-based machine learning achieves promising results for decision classification while reducing computational costs and increasing explainability. Empirically, we challenge prevailing narratives about CEE formalism. This work shows that legal argument mining enables reliable judicial philosophy classification and shows the potential of legal argument mining for other important tasks in computational legal studies. Our methodology is easily replicable across jurisdictions, and our entire pipeline, datasets, guidelines, models, and source codes are available at https://github.com/trusthlt/madon.


[7] Minimal Clips, Maximum Salience: Long Video Summarization via Key Moment Extraction cs.CL | cs.CVPDF

Galann Pennec, Zhengyuan Liu, Nicholas Asher, Philippe Muller, Nancy F. Chen

TL;DR: 本文提出了一种基于关键片段提取的长视频摘要方法,通过将视频分割为短片段、使用轻量级视频描述模型生成视觉描述,并利用大语言模型选择最相关的K个片段来构建多模态摘要。该方法在MovieSum数据集上评估,能够以较低计算成本实现接近参考片段的摘要性能。

Details

Motivation: 解决视觉语言模型在处理长视频时容易丢失重要视觉信息的问题,同时设计一种成本效益高的工具来分析冗长视频内容。

Result: 在MovieSum数据集上,该方法(使用少于6%的电影片段)的摘要性能接近人类标注的参考片段,且比随机片段选择捕获更多相关视频信息,保持了低计算成本。

Insight: 创新点在于结合轻量级视频描述模型和大语言模型进行关键片段选择,实现高效的长视频摘要;客观分析表明该方法平衡了计算效率与摘要质量,为多模态视频处理提供了实用框架。

Abstract: Vision-Language Models (VLMs) are able to process increasingly longer videos. Yet, important visual information is easily lost throughout the entire context and missed by VLMs. Also, it is important to design tools that enable cost-effective analysis of lengthy video content. In this paper, we propose a clip selection method that targets key video moments to be included in a multimodal summary. We divide the video into short clips and generate compact visual descriptions of each using a lightweight video captioning model. These are then passed to a large language model (LLM), which selects the K clips containing the most relevant visual information for a multimodal summary. We evaluate our approach on reference clips for the task, automatically derived from full human-annotated screenplays and summaries in the MovieSum dataset. We further show that these reference clips (less than 6% of the movie) are sufficient to build a complete multimodal summary of the movies in MovieSum. Using our clip selection method, we achieve a summarization performance close to that of these reference clips while capturing substantially more relevant video information than random clip selection. Importantly, we maintain low computational cost by relying on a lightweight captioning model.


[8] Mistake Notebook Learning: Selective Batch-Wise Context Optimization for In-Context Learning cs.CLPDF

Xuanbo Su, Yingfang Zhang, Hao Luo, Xiaoteng Liu, Leo Huang

TL;DR: 本文提出了一种名为Mistake Notebook Learning (MNL) 的无训练框架,用于改进大语言模型(LLMs)的上下文学习(ICL)。MNL通过维护一个持久化的、包含抽象错误模式的知识库,从多个失败案例中提取可泛化的指导,并通过留出验证确保性能单调提升。实验表明,MNL在多个复杂推理基准上接近甚至超越监督微调,并显著优于其他无训练方法。

Details

Motivation: 为了解决大语言模型通过梯度微调(计算量大、存在灾难性遗忘)或上下文学习(鲁棒性低、错误学习能力差)适应任务时存在的缺陷,本文旨在开发一种无需训练、能有效从错误中学习并提升推理鲁棒性的方法。

Result: 在GSM8K、Spider、AIME和KaggleDBQA等基准测试中,MNL表现优异。具体而言,在GSM8K上接近监督微调性能(93.9% vs 94.3%);在KaggleDBQA(使用Qwen3-8B模型)上达到28%的准确率,相对提升47%,显著优于Memento(15.1%)和Training-Free GRPO(22.1%),证明了其作为复杂推理任务中强大无训练替代方案的潜力。

Insight: 论文宣称的创新点在于提出了批处理错误抽象机制,从多个失败案例中提取通用指导,并动态维护一个经过验证的“错误笔记本”知识库,确保性能单调提升。从客观角度看,其核心创新是将错误分析与知识库的动态更新和验证相结合,为无训练的上下文学习提供了一种系统化、可积累改进的优化范式。

Abstract: Large language models (LLMs) adapt to tasks via gradient fine-tuning (heavy computation, catastrophic forgetting) or In-Context Learning (ICL: low robustness, poor mistake learning). To fix this, we introduce Mistake Notebook Learning (MNL), a training-free framework with a persistent knowledge base of abstracted error patterns. Unlike prior instance/single-trajectory memory methods, MNL uses batch-wise error abstraction: it extracts generalizable guidance from multiple failures, stores insights in a dynamic notebook, and retains only baseline-outperforming guidance via hold-out validation (ensuring monotonic improvement). We show MNL nearly matches Supervised Fine-Tuning (93.9% vs 94.3% on GSM8K) and outperforms training-free alternatives on GSM8K, Spider, AIME, and KaggleDBQA. On KaggleDBQA (Qwen3-8B), MNL hits 28% accuracy (47% relative gain), outperforming Memento (15.1%) and Training-Free GRPO (22.1) - proving it’s a strong training-free alternative for complex reasoning.


[9] Does Less Hallucination Mean Less Creativity? An Empirical Investigation in LLMs cs.CL | cs.AIPDF

Mohor Banerjee, Nadya Yuki Wangsajaya, Syed Ali Redha Alsagoff, Min Sen Tan, Zachary Choy Kit Chun

TL;DR: 本文实证研究了三种减少大语言模型幻觉的技术(CoVe、DoLa、RAG)对模型创造力的影响。通过在多个模型系列和规模上进行评估,发现这些方法对发散性创造力有不同影响:CoVe增强,DoLa抑制,RAG影响甚微。

Details

Motivation: 现有方法致力于减少LLMs的幻觉,但其对模型创造力的影响尚不明确,而AI辅助科学发现等应用需要同时保证事实准确性和创造性假设生成。

Result: 在NeoCoder和CS4两个创造力基准上评估了不同规模的LLaMA、Qwen、Mistral模型,发现CoVe提升发散思维,DoLa抑制发散思维,RAG影响最小。

Insight: 减少幻觉的方法可能以牺牲创造力为代价,不同方法对创造力的影响方向不同;在需要平衡事实准确性与创造性探索的科学应用中,应根据具体需求选择合适的方法。

Abstract: Large Language Models (LLMs) exhibit remarkable capabilities in natural language understanding and reasoning, but suffer from hallucination: the generation of factually incorrect content. While numerous methods have been developed to reduce hallucinations, their impact on creative generations remains unexplored. This gap is particularly critical for AI-assisted scientific discovery, which requires both factual accuracy and creative hypothesis generation. We investigate how three hallucination-reduction techniques: Chain of Verification (CoVe), Decoding by Contrasting Layers (DoLa), and Retrieval-Augmented Generation (RAG), affect creativity in LLMs. Evaluating multiple model families (LLaMA, Qwen, Mistral) at varying scales (1B - 70B parameters) on two creativity benchmarks (NeoCoder and CS4), we find that these methods have opposing effects on divergent creativity. CoVe enhances divergent thinking, DoLa suppresses it, and RAG shows minimal impact. Our findings provide guidance for selecting appropriate hallucination-reduction methods in scientific applications, where the balance between factual accuracy and creative exploration is crucial.


[10] Extending a Parliamentary Corpus with MPs’ Tweets: Automatic Annotation and Evaluation Using MultiParTweet cs.CL | cs.MMPDF

Mevlüt Bagci, Ali Abusaleh, Daniel Baumartz, Giueseppe Abrami, Maxim Konca

TL;DR: 本文介绍了MultiParTweet,一个多语言推特语料库,将德国议员的社交媒体言论与议会辩论语料库GerParCor连接起来,用于在线沟通与议会辩论的比较分析。该语料库包含39,546条推文和19,056个媒体项目,并通过九个文本模型和一个视觉语言模型(VLM)自动标注了情感、情绪和主题。自动标注结果与人工标注子集进行了评估。此外,本文还提供了数据收集工具TTLABTweetCrawler,并进行了方法学演示,展示了模型之间的相互可预测性。

Details

Motivation: 社交媒体在现代政治中至关重要,既能反映政治家的意识形态,又能促进与年轻一代的沟通。然而,缺乏将社交媒体言论与正式议会辩论连接起来的多语言语料库,限制了在线沟通与议会辩论的比较分析。

Result: MultiParTweet语料库包含39,546条推文和19,056个媒体项目,自动标注通过九个文本模型和一个VLM完成,并与人工标注子集进行了评估。分析表明,模型之间具有相互可预测性,且基于VLM的标注更受人工标注者青睐,表明多模态表示更符合人类解释。

Insight: 创新点在于构建了一个连接社交媒体与议会辩论的多语言语料库,并整合了自动文本和媒体标注(包括VLM),通过工具TTLABTweetCrawler支持可复现的数据收集。客观分析认为,利用VLM进行多模态标注以提升与人类判断的一致性是一个值得借鉴的方向,尤其是在政治话语分析中结合文本和视觉信息。

Abstract: Social media serves as a critical medium in modern politics because it both reflects politicians’ ideologies and facilitates communication with younger generations. We present MultiParTweet, a multilingual tweet corpus from X that connects politicians’ social media discourse with German political corpus GerParCor, thereby enabling comparative analyses between online communication and parliamentary debates. MultiParTweet contains 39 546 tweets, including 19 056 media items. Furthermore, we enriched the annotation with nine text-based models and one vision-language model (VLM) to annotate MultiParTweet with emotion, sentiment, and topic annotations. Moreover, the automated annotations are evaluated against a manually annotated subset. MultiParTweet can be reconstructed using our tool, TTLABTweetCrawler, which provides a framework for collecting data from X. To demonstrate a methodological demonstration, we examine whether the models can predict each other using the outputs of the remaining models. In summary, we provide MultiParTweet, a resource integrating automatic text and media-based annotations validated with human annotations, and TTLABTweetCrawler, a general-purpose X data collection tool. Our analysis shows that the models are mutually predictable. In addition, VLM-based annotation were preferred by human annotators, suggesting that multimodal representations align more with human interpretation.


[11] Bounding Hallucinations: Information-Theoretic Guarantees for RAG Systems via Merlin-Arthur Protocols cs.CL | cs.AI | cs.LGPDF

Björn Deiseroth, Max Henning Höth, Kristian Kersting, Letitia Parcalabescu

TL;DR: 本文提出了一种基于Merlin-Arthur交互式证明协议的训练框架,用于提升检索增强生成(RAG)系统的可靠性。该框架将检索器和生成器视为证明系统中的角色,通过引入对抗性误导证据(Morgana)和有益证据(Merlin)来训练生成器(Arthur),使其学会在有支持时回答、证据不足时拒绝,并依赖真正支持答案的上下文片段。同时,论文引入了严格的评估框架和解释信息分数(EIF)来衡量模型性能。

Details

Motivation: 当前RAG系统将检索视为启发式步骤而非可验证证据,导致大语言模型(LLM)在缺乏支持时回答、产生幻觉或依赖虚假证据。本文旨在通过形式化的交互证明协议,为RAG系统提供信息论保证,从而减少幻觉并提升可靠性。

Result: 在三个RAG数据集和两种不同规模的模型家族上,经过M/A训练的LLM在基础性、完整性、正确性和拒绝行为方面均有提升,同时减少了幻觉。检索器也通过自动生成的M/A正负样本提高了召回率和平均倒数排名(MRR)。

Insight: 核心创新在于将RAG流程形式化为交互式证明系统,通过对抗性训练和可解释性方法(XAI)动态识别和修改关键证据,使生成器学会基于可验证证据进行推理。此外,提出的EIF指标为衡量模型解释保真度提供了归一化方法,为构建可靠RAG系统提供了原则性路径。

Abstract: Retrieval-augmented generation (RAG) models rely on retrieved evidence to guide large language model (LLM) generators, yet current systems treat retrieval as a weak heuristic rather than verifiable evidence. As a result, LLMs answer without support, hallucinate under incomplete or misleading context, and rely on spurious evidence. We introduce a training framework that treats the entire RAG pipeline – both the retriever and the generator – as an interactive proof system via an adaptation of the Merlin-Arthur (M/A) protocol. Arthur (the generator LLM) trains on questions of unkown provenance: Merlin provides helpful evidence, while Morgana injects adversarial, misleading context. Both use a linear-time XAI method to identify and modify the evidence most influential to Arthur. Consequently, Arthur learns to (i) answer when the context support the answer, (ii) reject when evidence is insufficient, and (iii) rely on the specific context spans that truly ground the answer. We further introduce a rigorous evaluation framework to disentangle explanation fidelity from baseline predictive errors. This allows us to introduce and measure the Explained Information Fraction (EIF), which normalizes M/A certified mutual-information guarantees relative to model capacity and imperfect benchmarks. Across three RAG datasets and two model families of varying sizes, M/A-trained LLMs show improved groundedness, completeness, soundness, and reject behavior, as well as reduced hallucinations – without needing manually annotated unanswerable questions. The retriever likewise improves recall and MRR through automatically generated M/A hard positives and negatives. Our results demonstrate that autonomous interactive-proof-style supervision provides a principled and practical path toward reliable RAG systems that treat retrieved documents not as suggestions, but as verifiable evidence.


[12] SUMFORU: An LLM-Based Review Summarization Framework for Personalized Purchase Decision Support cs.CLPDF

Yuming Feng, Xinrui Jiang

TL;DR: 本文提出了SUMFORU,一种基于大语言模型的可控评论摘要框架,旨在通过结合明确的用户角色来生成个性化摘要,以支持购买决策。该框架利用亚马逊2023评论数据集构建高质量数据管道,并采用两阶段对齐方法:基于角色的监督微调和基于AI反馈的强化学习,以提升摘要与用户偏好的对齐度。

Details

Motivation: 现有基于LLM的评论摘要方法通常较为通用,未能考虑个体偏好,限制了其实用性;本文旨在解决在线评论信息过载且噪声多、用户难以有效决策的问题,通过个性化摘要提升决策支持效果。

Result: 在基于规则、LLM和以人为中心的评估指标上,SUMFORU在一致性、事实依据和偏好对齐方面均表现出持续改进,在所有评估设置中达到最高性能,并能有效泛化到未见过的产品类别。

Insight: 创新点包括通过非对称知识蒸馏进行角色感知的监督微调,以及使用偏好估计器进行RLAIF以捕获细粒度的角色相关信号;该方法展示了可控多元化对齐在构建下一代个性化决策支持系统中的潜力。

Abstract: Online product reviews contain rich but noisy signals that overwhelm users and hinder effective decision-making. Existing LLM-based summarizers remain generic and fail to account for individual preferences, limiting their practical utility. We propose SUMFORU, a steerable review summarization framework that aligns outputs with explicit user personas to support personalized purchase decisions. Our approach integrates a high-quality data pipeline built from the Amazon 2023 Review Dataset with a two-stage alignment procedure: (1) persona-aware Supervised Fine-Tuning (SFT) via asymmetric knowledge distillation, and (2) Reinforcement Learning with AI Feedback (RLAIF) using a preference estimator to capture fine-grained, persona-relevant signals. We evaluate the model across rule-based, LLM-based, and human-centered metrics, demonstrating consistent improvements in consistency, grounding, and preference alignment. Our framework achieves the highest performance across all evaluation settings and generalizes effectively to unseen product categories. Our results highlight the promise of steerable pluralistic alignment for building next-generation personalized decision-support systems.


cs.CV [Back]

[13] Leveraging Text Guidance for Enhancing Demographic Fairness in Gender Classification cs.CV | cs.AI | cs.LGPDF

Anoop Krishnan

TL;DR: 本文提出利用文本引导方法增强基于面部图像的性别分类算法的人口统计学公平性,核心方法包括图像文本匹配(ITM)引导和图像文本融合,通过利用图像描述中的语义信息来提升模型泛化能力,从而减少偏见。

Details

Motivation: 解决人工智能中性别分类算法存在的人口统计学偏见问题,特别是在面部图像分析中,旨在通过文本引导方法提升公平性。

Result: 在基准数据集上的广泛实验表明,相比现有方法,这些方法有效减轻了偏见,并提高了跨性别和种族群体的分类准确性。

Insight: 创新点在于将文本语义信息整合到视觉模型训练中,通过图像文本匹配和融合增强多模态表示,无需人口统计学标签即可操作,提供了一种可解释且直观的训练范式,有助于开发更公平的面部分析算法。

Abstract: In the quest for fairness in artificial intelligence, novel approaches to enhance it in facial image based gender classification algorithms using text guided methodologies are presented. The core methodology involves leveraging semantic information from image captions during model training to improve generalization capabilities. Two key strategies are presented: Image Text Matching (ITM) guidance and Image Text fusion. ITM guidance trains the model to discern fine grained alignments between images and texts to obtain enhanced multimodal representations. Image text fusion combines both modalities into comprehensive representations for improved fairness. Exensive experiments conducted on benchmark datasets demonstrate these approaches effectively mitigate bias and improve accuracy across gender racial groups compared to existing methods. Additionally, the unique integration of textual guidance underscores an interpretable and intuitive training paradigm for computer vision systems. By scrutinizing the extent to which semantic information reduces disparities, this research offers valuable insights into cultivating more equitable facial analysis algorithms. The proposed methodologies contribute to addressing the pivotal challenge of demographic bias in gender classification from facial images. Furthermore, this technique operates in the absence of demographic labels and is application agnostic.


[14] SoccerMaster: A Vision Foundation Model for Soccer Understanding cs.CV | cs.AIPDF

Haolin Yang, Jiayuan Rao, Haoning Wu, Weidi Xie

TL;DR: 本文提出了SoccerMaster,一个面向足球理解的视觉基础模型,通过监督多任务预训练统一处理从细粒度感知到语义推理的多种任务。同时构建了自动化数据标注流程和综合预训练数据集SoccerFactory。

Details

Motivation: 解决足球视觉理解领域任务孤立、依赖特定专家模型的问题,旨在构建一个统一模型来处理多样化的足球视觉任务。

Result: 在多种下游任务上,SoccerMaster持续优于任务特定的专家模型,展现了其广泛性和优越性。

Insight: 创新点包括首个足球专用视觉基础模型、自动化数据标注流程构建的SoccerFactory数据集,以及通过多任务预训练实现统一框架,可借鉴其领域专用基础模型的构建思路和数据整合方法。

Abstract: Soccer understanding has recently garnered growing research interest due to its domain-specific complexity and unique challenges. Unlike prior works that typically rely on isolated, task-specific expert models, this work aims to propose a unified model to handle diverse soccer visual understanding tasks, ranging from fine-grained perception (e.g., athlete detection) to semantic reasoning (e.g., event classification). Specifically, our contributions are threefold: (i) we present SoccerMaster, the first soccer-specific vision foundation model that unifies diverse understanding tasks within a single framework via supervised multi-task pretraining; (ii) we develop an automated data curation pipeline to generate scalable spatial annotations, and integrate them with various existing soccer video datasets to construct SoccerFactory, a comprehensive pretraining data resource; and (iii) we conduct extensive evaluations demonstrating that SoccerMaster consistently outperforms task-specific expert models across diverse downstream tasks, highlighting its breadth and superiority. The data, code, and model will be publicly available.


[15] Synthetic Vasculature and Pathology Enhance Vision-Language Model Reasoning cs.CVPDF

Chenjun Li, Cheng Wan, Laurin Lux, Alexander Berger, Richard B. Rosen

TL;DR: 本文提出Synthetic Vasculature Reasoning (SVR)框架,通过可控合成具有糖尿病视网膜病变特征的视网膜血管图像及对应细粒度推理文本,构建了包含10万对数据的OCTA-100K-SVR数据集。实验表明,基于该数据集训练的通用视觉语言模型(Qwen3-VL-8b)在真实OCTA图像上实现了89.67%的零样本平衡分类准确率,超越了有监督基线,并显著提升了临床数据的解释质量和病理定位能力。

Details

Motivation: 解决在OCTA图像等专业医学领域中,训练需要细粒度推理的视觉语言模型时,缺乏大规模、具有精确病理描述文本的图像-文本配对数据的问题。

Result: 在真实OCTA图像上,使用所提数据集训练的模型实现了89.67%的零样本平衡分类准确率,超越了有监督基线,并通过专家评估证实其在解释质量和病理定位上的显著提升。

Insight: 创新点在于提出一个可控合成图像与文本的框架,以数据生成方式解决专业领域数据稀缺问题,从而有效增强通用VLM在特定医学任务上的零样本推理和解释能力。

Abstract: Vision-Language Models (VLMs) offer a promising path toward interpretable medical diagnosis by allowing users to ask about clinical explanations alongside predictions and across different modalities. However, training VLMs for detailed reasoning requires large-scale image-text datasets. In many specialized domains, for example in reading Optical Coherence Tomography Angiography (OCTA) images, such precise text with grounded description of pathologies is scarce or even non-existent. To overcome this bottleneck, we introduce Synthetic Vasculature Reasoning (SVR), a framework that controllably synthesizes images and corresponding text, specifically: realistic retinal vasculature with Diabetic Retinopathy (DR) features: capillary dropout, microaneurysms, neovascularization, and tortuosity, while automatically generating granular reasoning texts. Based on this we curate OCTA-100K-SVR, an OCTA image-reasoning dataset with 100,000 pairs. Our experiments show that a general-purpose VLM (Qwen3-VL-8b) trained on the dataset achieves a zero-shot balanced classification accuracy of 89.67% on real OCTA images, outperforming supervised baselines. Through human expert evaluation we also demonstrate that it significantly enhances explanation quality and pathology localization on clinical data.


[16] VDAWorld: World Modelling via VLM-Directed Abstraction and Simulation cs.CVPDF

Felix O’Mahony, Roberto Cipolla, Ayush Tewari

TL;DR: 本文提出VDAWorld框架,通过视觉语言模型(VLM)引导的抽象与模拟进行世界建模,将图像-文本对蒸馏为可处理的抽象表示,并自适应选择物理模拟器进行动态预测,以克服生成视频模型在物理逻辑一致性、交互性和可查询性方面的局限。

Details

Motivation: 解决生成视频模型在物理逻辑规则违反、缺乏交互性以及作为黑箱模型难以构建结构化可查询世界的问题。

Result: 实验表明,该框架通过智能抽象与自适应模拟的结合,能够生成高质量、多样化的动态场景模拟。

Insight: 创新点在于利用VLM作为智能代理协调抽象表示构建与物理模拟器选择,实现从静态场景推断潜在动力学以预测未来状态,为可解释、结构化的世界建模提供了新范式。

Abstract: Generative video models, a leading approach to world modeling, face fundamental limitations. They often violate physical and logical rules, lack interactivity, and operate as opaque black boxes ill-suited for building structured, queryable worlds. To overcome these challenges, we propose a new paradigm focused on distilling an image caption pair into a tractable, abstract representation optimized for simulation. We introduce VDAWorld, a framework where a Vision-Language Model (VLM) acts as an intelligent agent to orchestrate this process. The VLM autonomously constructs a grounded (2D or 3D) scene representation by selecting from a suite of vision tools, and accordingly chooses a compatible physics simulator (e.g., rigid body, fluid) to act upon it. VDAWorld can then infer latent dynamics from the static scene to predict plausible future states. Our experiments show that this combination of intelligent abstraction and adaptive simulation results in a versatile world model capable of producing high quality simulations across a wide range of dynamic scenarios.


[17] Vision-Language Models for Infrared Industrial Sensing in Additive Manufacturing Scene Description cs.CV | cs.ROPDF

Nazanin Mahjourian, Vinh Nguyen

TL;DR: 本文提出VLM-IRIS框架,将基于RGB数据训练的视觉语言模型(VLMs)适配到红外图像数据,实现零样本工件存在检测,适用于增材制造等低光工业环境。

Details

Motivation: 解决传统视觉系统在低光或封闭工业环境中失效的问题,并利用红外摄像头的优势,同时避免监督学习所需的大规模标注数据,通过零样本学习框架提升实用性。

Result: 在3D打印机床上进行零样本工件存在检测任务,利用CLIP ViT-B/32编码器结合质心提示集成,在红外图像上实现了高精度,无需模型重新训练。

Insight: 创新点在于将红外图像通过岩浆表示预处理为RGB兼容输入,使现有VLMs能直接处理红外数据,扩展了VLMs在热成像应用中的零样本能力,为无标签监控提供了新途径。

Abstract: Many manufacturing environments operate in low-light conditions or within enclosed machines where conventional vision systems struggle. Infrared cameras provide complementary advantages in such environments. Simultaneously, supervised AI systems require large labeled datasets, which makes zero-shot learning frameworks more practical for applications including infrared cameras. Recent advances in vision-language foundation models (VLMs) offer a new path in zero-shot predictions from paired image-text representations. However, current VLMs cannot understand infrared camera data since they are trained on RGB data. This work introduces VLM-IRIS (Vision-Language Models for InfraRed Industrial Sensing), a zero-shot framework that adapts VLMs to infrared data by preprocessing infrared images captured by a FLIR Boson sensor into RGB-compatible inputs suitable for CLIP-based encoders. We demonstrate zero-shot workpiece presence detection on a 3D printer bed where temperature differences between the build plate and workpieces make the task well-suited for thermal imaging. VLM-IRIS converts the infrared images to magma representation and applies centroid prompt ensembling with a CLIP ViT-B/32 encoder to achieve high accuracy on infrared images without any model retraining. These findings demonstrate that the proposed improvements to VLMs can be effectively extended to thermal applications for label-free monitoring.


[18] VGent: Visual Grounding via Modular Design for Disentangling Reasoning and Prediction cs.CVPDF

Weitai Kang, Jason Kuen, Mengwei Ren, Zijun Wei, Yan Yan

TL;DR: VGent提出了一种模块化的编码器-解码器架构,用于视觉定位任务,通过解耦高层推理和低层边界框预测来解决现有方法的问题。它利用冻结的多模态大语言模型作为编码器保持推理能力,并使用检测器提出的边界框作为查询,通过交叉注意力选择目标框,实现了快速推理和模块化升级。

Details

Motivation: 当前视觉定位模型要么基于自回归解码的多模态大语言模型,速度慢且易产生幻觉,要么通过重新对齐LLM与视觉特征学习新令牌,可能损害预训练推理能力。VGent旨在避免这些缺陷,通过模块化设计解耦推理和预测。

Result: 在多目标视觉定位基准测试中,VGent实现了新的最先进水平,F1分数比先前方法提高了20.6%,在视觉参考挑战下gIoU和cIoU分别提升了8.2%和5.8%,同时保持恒定快速的推理延迟。

Insight: 创新点包括:模块化架构解耦推理和预测,避免自回归解码;引入QuadThinker增强编码器的多目标推理能力;使用mask-aware标签解决检测-分割歧义;全局目标识别提升目标识别和选择能力。这些设计充分利用了目标检测和MLLM的进展,支持模块化升级。

Abstract: Current visual grounding models are either based on a Multimodal Large Language Model (MLLM) that performs auto-regressive decoding, which is slow and risks hallucinations, or on re-aligning an LLM with vision features to learn new special or object tokens for grounding, which may undermine the LLM’s pretrained reasoning ability. In contrast, we propose VGent, a modular encoder-decoder architecture that explicitly disentangles high-level reasoning and low-level bounding box prediction. Specifically, a frozen MLLM serves as the encoder to provide untouched powerful reasoning capabilities, while a decoder takes high-quality boxes proposed by detectors as queries and selects target box(es) via cross-attending on encoder’s hidden states. This design fully leverages advances in both object detection and MLLM, avoids the pitfalls of auto-regressive decoding, and enables fast inference. Moreover, it supports modular upgrades of both the encoder and decoder to benefit the whole system: we introduce (i) QuadThinker, an RL-based training paradigm for enhancing multi-target reasoning ability of the encoder; (ii) mask-aware label for resolving detection-segmentation ambiguity; and (iii) global target recognition to improve the recognition of all the targets which benefits the selection among augmented proposals. Experiments on multi-target visual grounding benchmarks show that VGent achieves a new state-of-the-art with +20.6% F1 improvement over prior methods, and further boosts gIoU by +8.2% and cIoU by +5.8% under visual reference challenges, while maintaining constant, fast inference latency.


[19] Learning from a Generative Oracle: Domain Adaptation for Restoration cs.CV | eess.IVPDF

Yuyang Hu, Mojtaba Sahraee-Ardakan, Arpit Bansal, Kangfu Mei, Christian Qi

TL;DR: 本文提出LEGO框架,用于解决预训练图像复原模型在真实世界分布外退化场景下的领域适应问题,通过利用冻结的生成式大模型生成高质量伪真值,将无监督适应转化为伪监督学习,从而在不修改模型结构的情况下提升模型在新领域的性能。

Details

Motivation: 预训练图像复原模型在真实世界分布外退化数据上性能下降,而传统适应方法通常需要配对数据或复杂架构调整,LEGO旨在解决无配对数据下的领域适应挑战。

Result: 实验表明,LEGO在多种真实世界基准测试上显著提升了性能,有效缩小了领域差距。

Insight: 创新点在于利用冻结的生成式大模型作为“先知”生成伪真值,将无监督问题转化为伪监督学习,并通过混合监督策略保持模型原有鲁棒性,无需架构修改即可实现领域适应。

Abstract: Pre-trained image restoration models often fail on real-world, out-of-distribution degradations due to significant domain gaps. Adapting to these unseen domains is challenging, as out-of-distribution data lacks ground truth, and traditional adaptation methods often require complex architectural changes. We propose LEGO (Learning from a Generative Oracle), a practical three-stage framework for post-training domain adaptation without paired data. LEGO converts this unsupervised challenge into a tractable pseudo-supervised one. First, we obtain initial restorations from the pre-trained model. Second, we leverage a frozen, large-scale generative oracle to refine these estimates into high-quality pseudo-ground-truths. Third, we fine-tune the original model using a mixed-supervision strategy combining in-distribution data with these new pseudo-pairs. This approach adapts the model to the new distribution without sacrificing its original robustness or requiring architectural modifications. Experiments demonstrate that LEGO effectively bridges the domain gap, significantly improving performance on diverse real-world benchmarks.


[20] Learning complete and explainable visual representations from itemized text supervision cs.CVPDF

Yiwei Lyu, Chenhui Zhao, Soumyanil Banerjee, Shixuan Liu, Akshay Rao

TL;DR: 本文提出了ItemizedCLIP框架,旨在从项目化文本监督中学习完整且可解释的视觉表示。该框架通过交叉注意力模块生成以文本项目为条件的视觉嵌入,并采用定制化目标函数来确保项目独立性和表示完整性。在多个自然项目化文本监督的领域(如脑部MRI、胸部CT、遥感)以及一个合成数据集上,ItemizedCLIP在零样本性能和细粒度可解释性方面显著优于基线方法。

Details

Motivation: 解决在非以对象为中心的视觉领域(如医学影像和遥感)中,现有语言监督方法难以处理项目化文本注释(即单个图像中包含多个描述独立语义发现的文本项)的问题,以学习更完整和可解释的视觉表示。

Result: 在四个自然项目化文本监督领域(脑部MRI、头部CT、胸部CT、遥感)和一个合成项目化数据集上,ItemizedCLIP在零样本性能方面实现了显著提升,并具有更好的细粒度可解释性,超越了基线方法。

Insight: 创新点在于针对项目化文本监督设计了交叉注意力机制和联合优化目标,以强制实现项目独立性(不同项目对应不同图像区域)和表示完整性(覆盖所有项目),从而生成语义接地、项目可区分、完整且视觉可解释的表示。

Abstract: Training vision models with language supervision enables general and transferable representations. However, many visual domains, especially non-object-centric domains such as medical imaging and remote sensing, contain itemized text annotations: multiple text items describing distinct and semantically independent findings within a single image. Such supervision differs from standard multi-caption supervision, where captions are redundant or highly overlapping. Here, we introduce ItemizedCLIP, a framework for learning complete and explainable visual representations from itemized text supervision. ItemizedCLIP employs a cross-attention module to produce text item-conditioned visual embeddings and a set of tailored objectives that jointly enforce item independence (distinct regions for distinct items) and representation completeness (coverage of all items). Across four domains with naturally itemized text supervision (brain MRI, head CT, chest CT, remote sensing) and one additional synthetically itemized dataset, ItemizedCLIP achieves substantial improvements in zero-shot performance and fine-grained interpretability over baselines. The resulting ItemizedCLIP representations are semantically grounded, item-differentiable, complete, and visually interpretable. Our code is available at https://github.com/MLNeurosurg/ItemizedCLIP.


[21] Image Tiling for High-Resolution Reasoning: Balancing Local Detail with Global Context cs.CV | cs.AIPDF

Anatole Jacquin de Margerie, Alexis Roger, Irina Rish

TL;DR: 本文对CVPR24发表的Monkey视觉语言模型进行了复现和批判性分析,该模型通过图像分块策略实现高分辨率图像理解。研究确认了分块能有效恢复局部细节的核心发现,并进一步探究了全局上下文信息的影响,同时指出结果偏差与任务类型和分块粒度密切相关。

Details

Motivation: 动机在于解决复杂多模态模型缺乏透明实现细节和可访问训练基础设施的问题,通过复现Monkey VLM来验证其高分辨率图像理解方法的有效性,并扩展研究全局上下文的作用。

Result: 研究确认了原始Monkey VLM的关键发现,即分块策略能有效恢复局部细节;同时指出结果存在偏差,其程度严重依赖于任务类型和分块粒度,为未来高分辨率多模态建模提供了实用见解。

Insight: 创新点在于对现有高分辨率图像理解方法(图像分块)的系统性复现与扩展分析,强调了平衡局部细节与全局上下文的重要性,并揭示了方法性能对任务和参数设置的敏感性,为模型设计提供了实证指导。

Abstract: Reproducibility remains a cornerstone of scientific progress, yet complex multimodal models often lack transparent implementation details and accessible training infrastructure. In this work, we present a detailed reproduction and critical analysis of the Monkey Vision-Language Model (VLM) (Li et al. 2023b) published in CVPR24, a recent approach to high-resolution image understanding via image tiling. The original paper proposed splitting large images into tiles to recover fine-grained visual details while maintaining computational efficiency. Our study replicates this strategy using open checkpoints and reimplements the training pipeline. We confirm the key finding of the original Monkey VLM work, namely that tiling effectively recovers local details. We then extend this work further, by investigating the effect of the inclusion of the global context, which provide practical insights for future high-resolution multimodal modeling. However, we also report deviations in the results, with the magnitude of these effects depending heavily on task type and tile granularity.


[22] Lightweight 3D Gaussian Splatting Compression via Video Codec cs.CVPDF

Qi Yang, Geert Van Der Auwera, Zhu Li

TL;DR: 本文提出了一种基于视频编解码器的轻量级3D高斯溅射压缩方法(LGSCV),旨在解决现有基于视频的GS压缩方法计算成本高、耗时长的问题,通过两阶段Morton扫描生成块状2D映射,并结合PCA降维和MiniPLAS优化,在MPEG数据集上实现了超过20%的率失真性能提升,同时大幅减少生成和编码时间。

Details

Motivation: 现有基于视频的GS压缩方法依赖并行线性分配排序(PLAS)将3D GS转换为平滑的2D映射,计算昂贵且耗时,限制了GS在轻量级设备上的应用,因此需要一种更高效的压缩方案。

Result: 在MPEG数据集上,LGSCV相比最先进方法实现了超过20%的率失真性能增益,同时将2D映射生成时间减少到约1秒,编码时间降低50%。

Insight: 创新点包括:两阶段Morton扫描生成块状2D映射以适配视频编解码器;使用PCA降低球谐函数维度;设计灵活快速的MiniPLAS优化块内基元排列,提升中低比特率下的性能,并指导编解码器CU大小配置以减少编码时间。

Abstract: Current video-based GS compression methods rely on using Parallel Linear Assignment Sorting (PLAS) to convert 3D GS into smooth 2D maps, which are computationally expensive and time-consuming, limiting the application of GS on lightweight devices. In this paper, we propose a Lightweight 3D Gaussian Splatting (GS) Compression method based on Video codec (LGSCV). First, a two-stage Morton scan is proposed to generate blockwise 2D maps that are friendly for canonical video codecs in which the coding units (CU) are square blocks. A 3D Morton scan is used to permute GS primitives, followed by a 2D Morton scan to map the ordered GS primitives to 2D maps in a blockwise style. However, although the blockwise 2D maps report close performance to the PLAS map in high-bitrate regions, they show a quality collapse at medium-to-low bitrates. Therefore, a principal component analysis (PCA) is used to reduce the dimensionality of spherical harmonics (SH), and a MiniPLAS, which is flexible and fast, is designed to permute the primitives within certain block sizes. Incorporating SH PCA and MiniPLAS leads to a significant gain in rate-distortion (RD) performance, especially at medium and low bitrates. MiniPLAS can also guide the setting of the codec CU size configuration and significantly reduce encoding time. Experimental results on the MPEG dataset demonstrate that the proposed LGSCV achieves over 20% RD gain compared with state-of-the-art methods, while reducing 2D map generation time to approximately 1 second and cutting encoding time by 50%. The code is available at https://github.com/Qi-Yangsjtu/LGSCV .


[23] Multi-task Learning with Extended Temporal Shift Module for Temporal Action Localization cs.CVPDF

Anh-Kiet Duong, Petra Gomez-Krämer

TL;DR: 本文提出了针对ICCV 2025 BinEgo-360挑战赛的解决方案,该挑战赛聚焦于多视角、多模态视频中的时序动作定位任务。方法基于扩展的时序移位模块,通过引入背景类别和对固定长度非重叠区间进行分类来适配TAL任务,并采用一个联合优化场景分类与动作定位的多任务学习框架,最后通过加权集成策略融合多个模型以提升预测的鲁棒性和一致性。该方法在挑战赛的初赛和扩展轮次中均排名第一。

Details

Motivation: 解决在多视角(全景、第三人称、第一人称)和多模态视频设置下,进行细粒度时序动作定位的挑战,旨在利用动作与环境之间的上下文线索提升定位性能。

Result: 在ICCV 2025 BinEgo-360挑战赛的初赛和扩展轮次中均获得第一名,证明了所提方法的有效性。

Insight: 主要创新点包括:1) 将时序移位模块扩展用于时序动作定位,通过引入背景类别和固定区间分类进行适配;2) 采用多任务学习框架,联合优化场景分类与动作定位,以利用动作与环境的上下文关联;3) 使用加权集成策略融合多个模型,提升预测的鲁棒性与一致性。从客观角度看,将高效的骨干网络、多任务学习与集成学习相结合,是针对复杂多模态TAL任务的一个有效且实用的解决方案框架。

Abstract: We present our solution to the BinEgo-360 Challenge at ICCV 2025, which focuses on temporal action localization (TAL) in multi-perspective and multi-modal video settings. The challenge provides a dataset containing panoramic, third-person, and egocentric recordings, annotated with fine-grained action classes. Our approach is built on the Temporal Shift Module (TSM), which we extend to handle TAL by introducing a background class and classifying fixed-length non-overlapping intervals. We employ a multi-task learning framework that jointly optimizes for scene classification and TAL, leveraging contextual cues between actions and environments. Finally, we integrate multiple models through a weighted ensemble strategy, which improves robustness and consistency of predictions. Our method is ranked first in both the initial and extended rounds of the competition, demonstrating the effectiveness of combining multi-task learning, an efficient backbone, and ensemble learning for TAL.


[24] AutoRefiner: Improving Autoregressive Video Diffusion Models via Reflective Refinement Over the Stochastic Sampling Path cs.CVPDF

Zhengyang Yu, Akio Hayakawa, Masato Ishii, Qingtao Yu, Takashi Shibuya

TL;DR: 本文提出了AutoRefiner,一种专为自回归视频扩散模型(AR-VDMs)设计的噪声优化器,通过路径式噪声优化和反射式KV缓存两项关键设计,在不更新模型参数的情况下,有效提升了AR-VDMs生成样本的保真度。

Details

Motivation: 自回归视频扩散模型(AR-VDMs)作为双向VDMs的可扩展替代方案,在实时和交互应用中展现出潜力,但其样本保真度仍有提升空间。现有的推理时对齐方法(如基于优化的方法)计算成本高,不适用于AR-VDMs,而文本到图像(T2I)领域的简单前馈噪声优化器直接扩展到AR-VDMs会失效。

Result: 实验表明,AutoRefiner可作为AR-VDMs的高效即插即用模块,通过沿随机去噪路径优化噪声,有效提升了样本保真度。

Insight: 核心创新点在于针对AR-VDMs的序列生成特性,设计了路径式噪声优化(沿整个去噪路径而非单步进行优化)和反射式KV缓存(利用历史信息指导优化),解决了T2I噪声优化器直接迁移的失效问题,为AR-VDMs提供了一种高效、无需训练的参数优化方案。

Abstract: Autoregressive video diffusion models (AR-VDMs) show strong promise as scalable alternatives to bidirectional VDMs, enabling real-time and interactive applications. Yet there remains room for improvement in their sample fidelity. A promising solution is inference-time alignment, which optimizes the noise space to improve sample fidelity without updating model parameters. Yet, optimization- or search-based methods are computationally impractical for AR-VDMs. Recent text-to-image (T2I) works address this via feedforward noise refiners that modulate sampled noises in a single forward pass. Can such noise refiners be extended to AR-VDMs? We identify the failure of naively extending T2I noise refiners to AR-VDMs and propose AutoRefiner-a noise refiner tailored for AR-VDMs, with two key designs: pathwise noise refinement and a reflective KV-cache. Experiments demonstrate that AutoRefiner serves as an efficient plug-in for AR-VDMs, effectively enhancing sample fidelity by refining noise along stochastic denoising paths.


[25] SmokeBench: Evaluating Multimodal Large Language Models for Wildfire Smoke Detection cs.CVPDF

Tianye Qi, Weihao Li, Nick Barnes

TL;DR: 本文提出了一个名为SmokeBench的基准测试,用于评估多模态大语言模型在图像中识别和定位野火烟雾的能力。该基准包含烟雾分类、基于图块的定位、基于网格的定位和烟雾检测四项任务。评估了包括Idefics2、Qwen2.5-VL、GPT-4o在内的多个主流MLLM,发现所有模型在精确烟雾定位,尤其是早期烟雾检测方面都存在困难。

Details

Motivation: 野火烟雾具有透明、无定形且易与云混淆的视觉特性,使得早期检测极具挑战性。为了解决这一问题,需要评估当前多模态大语言模型在此安全关键任务上的实际能力。

Result: 评估结果显示,部分模型能在烟雾覆盖大面积时进行分类,但所有模型在精确定位上都表现不佳,尤其是在早期阶段。烟雾体积与模型性能强相关,而对比度影响较小。该基准测试揭示了当前MLLM在野火监测任务上的关键局限。

Insight: 论文的创新点在于构建了首个专门针对野火烟雾检测的多模态基准SmokeBench,并系统评估了MLLM在此任务上的能力边界。客观分析表明,当前MLLM的视觉定位能力,特别是对早期、稀疏、透明目标的感知,是亟待改进的关键方向,这为未来研究(如开发专门的定位方法或增强模型对细微视觉特征的敏感性)提供了明确指引。

Abstract: Wildfire smoke is transparent, amorphous, and often visually confounded with clouds, making early-stage detection particularly challenging. In this work, we introduce a benchmark, called SmokeBench, to evaluate the ability of multimodal large language models (MLLMs) to recognize and localize wildfire smoke in images. The benchmark consists of four tasks: (1) smoke classification, (2) tile-based smoke localization, (3) grid-based smoke localization, and (4) smoke detection. We evaluate several MLLMs, including Idefics2, Qwen2.5-VL, InternVL3, Unified-IO 2, Grounding DINO, GPT-4o, and Gemini-2.5 Pro. Our results show that while some models can classify the presence of smoke when it covers a large area, all models struggle with accurate localization, especially in the early stages. Further analysis reveals that smoke volume is strongly correlated with model performance, whereas contrast plays a comparatively minor role. These findings highlight critical limitations of current MLLMs for safety-critical wildfire monitoring and underscore the need for methods that improve early-stage smoke localization.


[26] VFMF: World Modeling by Forecasting Vision Foundation Model Features cs.CV | cs.AI | cs.LGPDF

Gabrijel Boduljak, Yushi Lan, Christian Rupprecht, Andrea Vedaldi

TL;DR: 本文提出VFMF方法,通过预测视觉基础模型(VFM)特征来构建世界模型,采用自回归流匹配在VFM特征空间中进行生成式预测,以解决确定性回归方法无法捕捉不确定性的问题,并实现多模态输出。

Details

Motivation: 现有世界建模方法中,基于像素的随机视频生成计算成本高且难以直接用于决策,而基于VFM特征的确定性回归虽高效但会平均化多个可能未来,导致预测准确性下降。本文旨在通过生成式建模在VFM特征空间中捕捉不确定性,提升预测性能。

Result: 在匹配架构和计算条件下,VFMF在所有输出模态(如语义分割、深度、表面法线和RGB)上比回归方法产生更清晰和准确的预测,表明其在特征空间中生成建模的有效性。

Insight: 创新点在于将VFM特征编码到紧凑的潜在空间进行扩散生成,该空间比基于PCA的替代方案更有效地保留信息,支持多模态解码;从客观角度看,该方法为世界模型提供了一种可扩展的生成式基础,平衡了计算效率与预测不确定性。

Abstract: Forecasting from partial observations is central to world modeling. Many recent methods represent the world through images, and reduce forecasting to stochastic video generation. Although such methods excel at realism and visual fidelity, predicting pixels is computationally intensive and not directly useful in many applications, as it requires translating RGB into signals useful for decision making. An alternative approach uses features from vision foundation models (VFMs) as world representations, performing deterministic regression to predict future world states. These features can be directly translated into actionable signals such as semantic segmentation and depth, while remaining computationally efficient. However, deterministic regression averages over multiple plausible futures, undermining forecast accuracy by failing to capture uncertainty. To address this crucial limitation, we introduce a generative forecaster that performs autoregressive flow matching in VFM feature space. Our key insight is that generative modeling in this space requires encoding VFM features into a compact latent space suitable for diffusion. We show that this latent space preserves information more effectively than previously used PCA-based alternatives, both for forecasting and other applications, such as image generation. Our latent predictions can be easily decoded into multiple useful and interpretable output modalities: semantic segmentation, depth, surface normals, and even RGB. With matched architecture and compute, our method produces sharper and more accurate predictions than regression across all modalities. Our results suggest that stochastic conditional generation of VFM features offers a promising and scalable foundation for future world models.


[27] FutureX: Enhance End-to-End Autonomous Driving via Latent Chain-of-Thought World Model cs.CVPDF

Hongbin Lin, Yiming Yang, Yifan Zhang, Chaoda Zheng, Jie Feng

TL;DR: FutureX提出了一种基于潜在思维链世界模型的端到端自动驾驶规划增强框架,通过引入自动思维开关判断场景复杂度,在需要时进入思维模式,利用潜在世界模型进行未来场景表示的思维链式推演和轨迹优化,从而在动态交通环境中生成更合理的运动规划。

Details

Motivation: 解决现有端到端规划器仅依赖当前场景进行决策,在高度动态的交通环境中可能因未来场景变化而导致次优响应的问题,旨在通过建模未来场景演化来提升规划质量。

Result: 在NAVSIM基准测试中,FutureX显著提升了现有方法(如TransFuser)的性能,实现了6.2 PDMS的改进,能够生成更合理的运动规划并减少碰撞,且不影响效率。

Insight: 创新点在于将思维链推理机制与潜在世界模型结合,用于自动驾驶的未来场景演化建模和轨迹优化,并通过自动模式切换平衡复杂场景的深度推理与简单场景的即时响应,提升了端到端规划器的适应性和鲁棒性。

Abstract: In autonomous driving, end-to-end planners learn scene representations from raw sensor data and utilize them to generate a motion plan or control actions. However, exclusive reliance on the current scene for motion planning may result in suboptimal responses in highly dynamic traffic environments where ego actions further alter the future scene. To model the evolution of future scenes, we leverage the World Model to represent how the ego vehicle and its environment interact and change over time, which entails complex reasoning. The Chain of Thought (CoT) offers a promising solution by forecasting a sequence of future thoughts that subsequently guide trajectory refinement. In this paper, we propose FutureX, a CoT-driven pipeline that enhances end-to-end planners to perform complex motion planning via future scene latent reasoning and trajectory refinement. Specifically, the Auto-think Switch examines the current scene and decides whether additional reasoning is required to yield a higher-quality motion plan. Once FutureX enters the Thinking mode, the Latent World Model conducts a CoT-guided rollout to predict future scene representation, enabling the Summarizer Module to further refine the motion plan. Otherwise, FutureX operates in an Instant mode to generate motion plans in a forward pass for relatively simple scenes. Extensive experiments demonstrate that FutureX enhances existing methods by producing more rational motion plans and fewer collisions without compromising efficiency, thereby achieving substantial overall performance gains, e.g., 6.2 PDMS improvement for TransFuser on NAVSIM. Code will be released.


[28] REST: Diffusion-based Real-time End-to-end Streaming Talking Head Generation via ID-Context Caching and Asynchronous Streaming Distillation cs.CV | cs.SDPDF

Haotian Wang, Yuzhe Weng, Xinyi Yu, Jun Du, Haoran Xu

TL;DR: 本文提出REST,首个基于扩散模型的实时端到端流式音频驱动说话头生成框架。通过高时空压缩的VAE学习紧凑视频潜在空间,引入ID-Context缓存机制维持长时间流式生成中的时序一致性与身份一致性,并采用异步流式蒸馏训练策略减少自回归生成误差。

Details

Motivation: 扩散模型在说话头生成领域取得显著进展,但其推理速度慢且非自回归范式严重限制了实际应用,因此需要实现实时、端到端的流式生成。

Result: 实验结果表明,REST在生成速度和整体性能上均优于现有最先进方法。

Insight: 创新点包括:紧凑视频潜在空间构建、ID-Context缓存机制实现自回归流式生成,以及异步流式蒸馏策略提升时序一致性;这些方法弥合了自回归与扩散模型间的差距,为实时说话头生成应用提供了有效解决方案。

Abstract: Diffusion models have significantly advanced the field of talking head generation. However, the slow inference speeds and non-autoregressive paradigms severely constrain the application of diffusion-based THG models. In this study, we propose REST, the first diffusion-based, real-time, end-to-end streaming audio-driven talking head generation framework. To support real-time end-to-end generation, a compact video latent space is first learned through high spatiotemporal VAE compression. Additionally, to enable autoregressive streaming within the compact video latent space, we introduce an ID-Context Cache mechanism, which integrates ID-Sink and Context-Cache principles to key-value caching for maintaining temporal consistency and identity coherence during long-time streaming generation. Furthermore, an Asynchronous Streaming Distillation (ASD) training strategy is proposed to mitigate error accumulation in autoregressive generation and enhance temporal consistency, which leverages a non-streaming teacher with an asynchronous noise schedule to supervise the training of the streaming student model. REST bridges the gap between autoregressive and diffusion-based approaches, demonstrating substantial value for applications requiring real-time talking head generation. Experimental results demonstrate that REST outperforms state-of-the-art methods in both generation speed and overall performance.


[29] RoomPilot: Controllable Synthesis of Interactive Indoor Environments via Multimodal Semantic Parsing cs.CVPDF

Wentang Chen, Shougao Zhang, Yiman Zhang, Tianhao Zhou, Ruihui Li

TL;DR: RoomPilot是一个统一的框架,通过将文本描述或CAD平面图等多模态输入解析为室内领域特定语言(IDSL),来实现可控、交互式室内场景的生成。该方法克服了现有方法输入模态单一或随机性过强导致可控性差的问题,能够合成具有真实物体行为的高质量、物理一致的3D室内环境。

Details

Motivation: 现有方法在生成可控、交互式室内场景时存在局限性,要么只能处理有限的输入模态,要么依赖随机过程导致可控性差。因此,需要一种能够解析多模态输入并生成具有交互语义的室内场景的统一框架。

Result: 广泛的实验验证了RoomPilot在多模态理解、场景生成的细粒度可控性、物理一致性和视觉保真度方面的优越性能,标志着向通用可控3D室内场景生成迈出了重要一步。

Insight: 核心创新在于设计了一个室内领域特定语言(IDSL)作为共享语义表示,能够将不同模态的输入统一解析,并利用带交互标注的数据集来合成具有真实物体行为的场景,从而在保证视觉质量的同时增强了场景的功能性和可控性。

Abstract: Generating controllable and interactive indoor scenes is fundamental to applications in game development, architectural visualization, and embodied AI training. Yet existing approaches either handle a narrow range of input modalities or rely on stochastic processes that hinder controllability. To overcome these limitations, we introduce RoomPilot, a unified framework that parses diverse multi-modal inputs–textual descriptions or CAD floor plans–into an Indoor Domain-Specific Language (IDSL) for indoor structured scene generation. The key insight is that a well-designed IDSL can act as a shared semantic representation, enabling coherent, high-quality scene synthesis from any single modality while maintaining interaction semantics. In contrast to conventional procedural methods that produce visually plausible but functionally inert layouts, RoomPilot leverages a curated dataset of interaction-annotated assets to synthesize environments exhibiting realistic object behaviors. Extensive experiments further validate its strong multi-modal understanding, fine-grained controllability in scene generation, and superior physical consistency and visual fidelity, marking a significant step toward general-purpose controllable 3D indoor scene generation.


[30] WildCap: Facial Appearance Capture in the Wild via Hybrid Inverse Rendering cs.CV | cs.GRPDF

Yuxuan Han, Xin Ming, Tianxiao Li, Zhuofan Shen, Qixuan Zhang

TL;DR: 本文提出了一种名为WildCap的新方法,用于从智能手机在野外录制的视频中捕获高质量的面部外观。该方法通过一种新颖的混合逆向渲染框架,结合数据驱动和基于模型的方法,将复杂光照下的高质量反射率分离出来,并引入纹理网格光照模型来解决网络预测中的非物理伪影问题,从而显著缩小了野外捕获与可控光照条件下捕获的质量差距。

Details

Motivation: 现有方法在可控光照下能实现高质量的面部外观捕获,但这增加了捕获成本并限制了实用性。因此,本文旨在解决在野外复杂光照条件下,从智能手机视频中高质量捕获面部外观的挑战。

Result: 在相同的捕获设置下,该方法取得了比先前技术显著更好的结果,大幅缩小了野外捕获与可控记录之间的质量差距。

Insight: 创新点包括:1)提出混合逆向渲染框架,结合数据驱动的SwitchLight预处理和基于模型的逆向渲染;2)引入纹理网格光照模型,将非物理伪影解释为局部物理光照下的干净反照率;3)在优化过程中联合采样反射率图的扩散先验并优化光照,有效解决了局部光照与反照率之间的尺度模糊问题。

Abstract: Existing methods achieve high-quality facial appearance capture under controllable lighting, which increases capture cost and limits usability. We propose WildCap, a novel method for high-quality facial appearance capture from a smartphone video recorded in the wild. To disentangle high-quality reflectance from complex lighting effects in in-the-wild captures, we propose a novel hybrid inverse rendering framework. Specifically, we first apply a data-driven method, i.e., SwitchLight, to convert the captured images into more constrained conditions and then adopt model-based inverse rendering. However, unavoidable local artifacts in network predictions, such as shadow-baking, are non-physical and thus hinder accurate inverse rendering of lighting and material. To address this, we propose a novel texel grid lighting model to explain non-physical effects as clean albedo illuminated by local physical lighting. During optimization, we jointly sample a diffusion prior for reflectance maps and optimize the lighting, effectively resolving scale ambiguity between local lights and albedo. Our method achieves significantly better results than prior arts in the same capture setup, closing the quality gap between in-the-wild and controllable recordings by a large margin. Our code will be released \href{https://yxuhan.github.io/WildCap/index.html}{\textcolor{magenta}{here}}.


[31] Cross-modal Prompting for Balanced Incomplete Multi-modal Emotion Recognition cs.CVPDF

Wen-Jue He, Xiaofeng Zhu, Zheng Zhang

TL;DR: 本文提出了一种名为跨模态提示(ComP)的新方法,用于解决不完全多模态情感识别(IMER)中的性能差距和模态欠优化问题。该方法通过渐进式提示生成模块和动态梯度调制器生成一致的模态语义线索,利用跨模态知识传播增强模态特定特征的判别性,并通过协调器动态重加权模态输出以提高整体识别准确率。

Details

Motivation: 不完全多模态情感识别旨在通过部分观测的多源数据理解人类意图和情感,但实践中存在性能差距和模态欠优化问题,尤其在数据缺失情况下加剧,阻碍了有效的多模态学习。

Result: 在4个数据集上,与7种SOTA方法在不同缺失率下进行广泛实验,验证了所提方法的有效性。

Insight: 创新点包括:渐进式提示生成模块结合动态梯度调制器以产生简洁一致的模态语义提示;跨模态知识传播选择性放大模态特征中的一致信息;协调器动态重加权模态输出作为平衡策略的补充。这些设计旨在增强模态特定特征并提升各模态性能,从而改善不完全多模态场景下的情感识别效果。

Abstract: Incomplete multi-modal emotion recognition (IMER) aims at understanding human intentions and sentiments by comprehensively exploring the partially observed multi-source data. Although the multi-modal data is expected to provide more abundant information, the performance gap and modality under-optimization problem hinder effective multi-modal learning in practice, and are exacerbated in the confrontation of the missing data. To address this issue, we devise a novel Cross-modal Prompting (ComP) method, which emphasizes coherent information by enhancing modality-specific features and improves the overall recognition accuracy by boosting each modality’s performance. Specifically, a progressive prompt generation module with a dynamic gradient modulator is proposed to produce concise and consistent modality semantic cues. Meanwhile, cross-modal knowledge propagation selectively amplifies the consistent information in modality features with the delivered prompts to enhance the discrimination of the modality-specific output. Additionally, a coordinator is designed to dynamically re-weight the modality outputs as a complement to the balance strategy to improve the model’s efficacy. Extensive experiments on 4 datasets with 7 SOTA methods under different missing rates validate the effectiveness of our proposed method.


[32] PersonaLive! Expressive Portrait Image Animation for Live Streaming cs.CVPDF

Zhiyuan Li, Chi-Man Pun, Chen Fang, Jue Wang, Xiaodong Cun

TL;DR: 本文提出了PersonaLive,一种基于扩散模型的新型框架,旨在实现实时肖像动画生成,适用于直播场景。该框架通过混合隐式信号(隐式面部表示和3D隐式关键点)实现表情丰富的图像级运动控制,采用少步外观蒸馏策略提高推理效率,并引入自回归微块流式生成范式以支持低延迟、稳定的长时视频生成。

Details

Motivation: 当前基于扩散的肖像动画模型主要关注提升视觉质量和表情真实感,但忽视了生成延迟和实时性能,这限制了它们在直播场景中的应用范围。

Result: 大量实验表明,PersonaLive在性能上达到最先进水平(SOTA),相比之前的基于扩散的肖像动画模型,速度提升了7-22倍。

Insight: 创新点包括:使用混合隐式信号进行精确运动控制,提出少步外观蒸馏策略以减少去噪过程中的外观冗余并提升效率,以及设计自回归微块流式生成范式结合滑动训练策略和历史关键帧机制,以实现低延迟和稳定的长时视频生成。从客观角度看,这些方法有效平衡了生成质量与实时性,为直播应用提供了实用解决方案。

Abstract: Current diffusion-based portrait animation models predominantly focus on enhancing visual quality and expression realism, while overlooking generation latency and real-time performance, which restricts their application range in the live streaming scenario. We propose PersonaLive, a novel diffusion-based framework towards streaming real-time portrait animation with multi-stage training recipes. Specifically, we first adopt hybrid implicit signals, namely implicit facial representations and 3D implicit keypoints, to achieve expressive image-level motion control. Then, a fewer-step appearance distillation strategy is proposed to eliminate appearance redundancy in the denoising process, greatly improving inference efficiency. Finally, we introduce an autoregressive micro-chunk streaming generation paradigm equipped with a sliding training strategy and a historical keyframe mechanism to enable low-latency and stable long-term video generation. Extensive experiments demonstrate that PersonaLive achieves state-of-the-art performance with up to 7-22x speedup over prior diffusion-based portrait animation models.


[33] Do We Need Reformer for Vision? An Experimental Comparison with Vision Transformers cs.CVPDF

Ali El Bellaj, Mohammed-Amine Cheddadi, Rhassan Berber

TL;DR: 本文探讨了Reformer架构作为视觉骨干网络的可行性,并与Vision Transformer (ViT) 进行了实验比较。研究发现,虽然Reformer在CIFAR-10等小规模数据集上精度更高,但在ImageNet-100和高分辨率医学图像等更实际场景中,ViT在实用效率和端到端计算时间上均优于Reformer。

Details

Motivation: 标准ViT的全局自注意力计算复杂度随token数量呈二次方增长,限制了其在高分辨率输入和资源受限环境下的实用性。本文旨在研究计算复杂度更低的Reformer架构(使用局部敏感哈希注意力)是否能成为更高效的视觉骨干网络。

Result: 在CIFAR-10上,Reformer比ViT基线精度更高。但在ImageNet-100和高分辨率医学图像数据集上,ViT在实用效率和端到端计算时间上均优于Reformer。

Insight: 论文的创新点在于将Reformer架构(结合基于图像块的标记化和LSH注意力)系统地应用于视觉任务,并进行了全面的实验评估。客观分析表明,尽管LSH注意力具有理论上的复杂度优势,但在典型高分辨率图像产生的序列长度下,其实际计算增益有限,揭示了理论复杂度和实际效率之间的差距。

Abstract: Transformers have recently demonstrated strong performance in computer vision, with Vision Transformers (ViTs) leveraging self-attention to capture both low-level and high-level image features. However, standard ViTs remain computationally expensive, since global self-attention scales quadratically with the number of tokens, which limits their practicality for high-resolution inputs and resource-constrained settings. In this work, we investigate the Reformer architecture as an alternative vision backbone. By combining patch-based tokenization with locality-sensitive hashing (LSH) attention, our model approximates global self-attention while reducing its theoretical time complexity from $\mathcal{O}(n^2)$ to $\mathcal{O}(n \log n)$ in the sequence length $n$. We evaluate the proposed Reformer-based vision model on CIFAR-10 to assess its behavior on small-scale datasets, on ImageNet-100 to study its accuracy–efficiency trade-off in a more realistic setting, and on a high-resolution medical imaging dataset to evaluate the model under longer token sequences. While the Reformer achieves higher accuracy on CIFAR-10 compared to our ViT-style baseline, the ViT model consistently outperforms the Reformer in our experiments in terms of practical efficiency and end-to-end computation time across the larger and higher-resolution settings. These results suggest that, despite the theoretical advantages of LSH-based attention, meaningful computation gains require sequence lengths substantially longer than those produced by typical high-resolution images.


[34] FilmWeaver: Weaving Consistent Multi-Shot Videos with Cache-Guided Autoregressive Diffusion cs.CVPDF

Xiangyang Luo, Qingyu Li, Xiaokun Liu, Wenyu Qin, Miao Yang

TL;DR: 本文提出了FilmWeaver框架,旨在解决现有视频生成模型在生成多镜头视频时难以保持角色与背景一致性以及无法灵活生成任意长度视频的问题。该框架采用自回归扩散范式,并通过双级缓存机制(镜头记忆和时序记忆)分别处理镜头间一致性和镜头内连贯性,从而支持任意长度、多镜头的视频生成,并允许用户进行多轮交互。

Details

Motivation: 现有视频生成模型擅长单镜头合成,但在生成多镜头视频时面临关键挑战,包括难以跨镜头保持角色和背景的一致性,以及无法灵活生成任意长度和镜头数量的视频。

Result: 广泛的实验结果表明,该方法在一致性和美学质量指标上均超越了现有方法,为创建更一致、可控和叙事驱动的视频内容开辟了新可能性。

Insight: 核心创新点在于将多镜头视频生成的一致性挑战解耦为镜头间一致性和镜头内连贯性两个子问题,并设计了双级缓存机制(镜头记忆和时序记忆)来分别应对。这种解耦设计还赋予了方法高度的通用性,支持多概念注入和视频扩展等下游任务。同时,论文还构建了一个高质量的多镜头视频数据集用于训练。

Abstract: Current video generation models perform well at single-shot synthesis but struggle with multi-shot videos, facing critical challenges in maintaining character and background consistency across shots and flexibly generating videos of arbitrary length and shot count. To address these limitations, we introduce \textbf{FilmWeaver}, a novel framework designed to generate consistent, multi-shot videos of arbitrary length. First, it employs an autoregressive diffusion paradigm to achieve arbitrary-length video generation. To address the challenge of consistency, our key insight is to decouple the problem into inter-shot consistency and intra-shot coherence. We achieve this through a dual-level cache mechanism: a shot memory caches keyframes from preceding shots to maintain character and scene identity, while a temporal memory retains a history of frames from the current shot to ensure smooth, continuous motion. The proposed framework allows for flexible, multi-round user interaction to create multi-shot videos. Furthermore, due to this decoupled design, our method demonstrates high versatility by supporting downstream tasks such as multi-concept injection and video extension. To facilitate the training of our consistency-aware method, we also developed a comprehensive pipeline to construct a high-quality multi-shot video dataset. Extensive experimental results demonstrate that our method surpasses existing approaches on metrics for both consistency and aesthetic quality, opening up new possibilities for creating more consistent, controllable, and narrative-driven video content. Project Page: https://filmweaver.github.io


[35] Autoregressive Video Autoencoder with Decoupled Temporal and Spatial Context cs.CVPDF

Cuifeng Shen, Lumin Xu, Xingguo Zhu, Gengdai Liu

TL;DR: 本文提出了一种自回归视频自编码器(ARVAE),通过解耦时空上下文来压缩和重建视频。该方法以自回归方式处理视频帧,结合下采样光流场和空间相对补偿,实现了高效无损压缩,并在轻量模型和小规模训练数据下取得了优异的视频重建质量。

Details

Motivation: 现有视频自编码器通常将时空信息纠缠在一起,限制了其捕捉时间一致性的能力,导致性能不佳。本文旨在解决这一问题,通过解耦时空表示来提升视频压缩与重建的效率和质量。

Result: 大量实验表明,ARVAE在极轻量模型和小规模训练数据下实现了卓越的重建质量。在视频生成任务上的评估也突显了其在下游应用中的强大潜力。

Insight: 创新点在于提出了时空解耦的表示方法,将时间运动(光流场)与空间补充(新内容补偿)分离,并采用自回归框架和多阶段训练策略,从而在保持高压缩效率的同时避免信息损失,增强了时间一致性。

Abstract: Video autoencoders compress videos into compact latent representations for efficient reconstruction, playing a vital role in enhancing the quality and efficiency of video generation. However, existing video autoencoders often entangle spatial and temporal information, limiting their ability to capture temporal consistency and leading to suboptimal performance. To address this, we propose Autoregressive Video Autoencoder (ARVAE), which compresses and reconstructs each frame conditioned on its predecessor in an autoregressive manner, allowing flexible processing of videos with arbitrary lengths. ARVAE introduces a temporal-spatial decoupled representation that combines downsampled flow field for temporal coherence with spatial relative compensation for newly emerged content, achieving high compression efficiency without information loss. Specifically, the encoder compresses the current and previous frames into the temporal motion and spatial supplement, while the decoder reconstructs the original frame from the latent representations given the preceding frame. A multi-stage training strategy is employed to progressively optimize the model. Extensive experiments demonstrate that ARVAE achieves superior reconstruction quality with extremely lightweight models and small-scale training data. Moreover, evaluations on video generation tasks highlight its strong potential for downstream applications.


[36] Few-Shot VLM-Based G-Code and HMI Verification in CNC Machining cs.CV | cs.AI | cs.HCPDF

Yasaman Hashem Pour, Nazanin Mahjourian, Vinh Nguyen

TL;DR: 本文提出了一种基于少样本视觉语言模型(VLM)的G代码和HMI验证方法,用于CNC加工中手动生成的G代码和机床人机界面(HMI)显示的错误与安全状态的同时评估。该方法利用包含正确和错误案例的G代码文本与HMI截图配对数据集,并通过基于先验启发式知识的结构化JSON模式进行少样本提示学习,以提升对HMI错误和G代码不一致性的检测能力。

Details

Motivation: 现有G代码验证工作主要使用大型语言模型(LLMs),仅能检查编程文本错误,但CNC加工需要大量依赖显示机床状态和错误的HMI知识,而LLMs因缺乏视觉模态能力无法利用HMI信息。

Result: 在15-slant-PRO车床数据集上,与零样本VLM相比,少样本提示方法在多个错误G代码和HMI错误场景中,通过每槽准确率评估,显示出整体上增强了HMI错误检测和G代码不一致性识别的能力。

Insight: 创新点在于结合视觉与文本模态,通过结构化JSON模式实现少样本学习,使VLM能同时验证G代码和HMI,为CNC培训中手动代码的全面调试提供了新框架;客观来看,该方法将领域先验知识融入提示工程,有效弥补了纯文本模型在视觉相关工业应用中的不足。

Abstract: Manual generation of G-code is important for learning the operation of CNC machines. Prior work in G-code verification uses Large-Language Models (LLMs), which primarily examine errors in the written programming. However, CNC machining requires extensive use and knowledge of the Human-Machine Interface (HMI), which displays machine status and errors. LLMs currently lack the capability to leverage knowledge of HMIs due to their inability to access the vision modality. This paper proposes a few-shot VLM-based verification approach that simultaneously evaluates the G-code and the HMI display for errors and safety status. The input dataset includes paired G-code text and associated HMI screenshots from a 15-slant-PRO lathe, including both correct and error-prone cases. To enable few-shot learning, the VLM is provided with a structured JSON schema based on prior heuristic knowledge. After determining the prompts, instances of G-code and HMI that either contain errors or are error free are used as few-shot examples to guide the VLM. The model was then evaluated in comparison to a zero-shot VLM through multiple scenarios of incorrect G-code and HMI errors with respect to per-slot accuracy. The VLM showed that few-shot prompting led to overall enhancement of detecting HMI errors and discrepancies with the G-code for more comprehensive debugging. Therefore, the proposed framework was demonstrated to be suitable for verification of manually generated G-code that is typically developed in CNC training.


[37] MultiEgo: A Multi-View Egocentric Video Dataset for 4D Scene Reconstruction cs.CVPDF

Bate Li, Houqiang Zhong, Zhengxue Cheng, Qiang Hu, Qiang Wang

TL;DR: 本文提出了MultiEgo,这是首个用于4D动态场景重建的多视角第一人称(egocentric)视频数据集。该数据集包含五个典型的社交互动场景(如会议、表演和演示),每个场景提供由参与者佩戴AR眼镜捕获的五段真实第一人称视频,并设计了硬件采集系统和处理流程,实现了亚毫秒级的时间同步和精确的姿态标注。实验验证了该数据集在自由视点视频(FVV)应用中的实用性和有效性。

Details

Motivation: 现有重建数据集主要关注静态多视角或单第一人称视角设置,缺乏用于动态场景重建的多视角第一人称数据集,而多视角第一人称动态场景重建在全息社交互动记录等应用中具有重要研究价值。

Result: 实验验证表明,该数据集在自由视点视频(FVV)应用中具有实用性和有效性,为推进多视角第一人称动态场景重建研究建立了基础资源。

Insight: 论文的主要创新在于首次构建了一个专门用于4D动态场景重建的多视角第一人称视频数据集,并通过硬件系统实现了高精度的时间同步和姿态标注,填补了该领域的数据空白,为相关研究提供了关键的基础设施。

Abstract: Multi-view egocentric dynamic scene reconstruction holds significant research value for applications in holographic documentation of social interactions. However, existing reconstruction datasets focus on static multi-view or single-egocentric view setups, lacking multi-view egocentric datasets for dynamic scene reconstruction. Therefore, we present MultiEgo, the first multi-view egocentric dataset for 4D dynamic scene reconstruction. The dataset comprises five canonical social interaction scenes: meetings, performances, and a presentation. Each scene provides five authentic egocentric videos captured by participants wearing AR glasses. We design a hardware-based data acquisition system and processing pipeline, achieving sub-millisecond temporal synchronization across views, coupled with accurate pose annotations. Experiment validation demonstrates the practical utility and effectiveness of our dataset for free-viewpoint video (FVV) applications, establishing MultiEgo as a foundational resource for advancing multi-view egocentric dynamic scene reconstruction research.


[38] KeyframeFace: From Text to Expressive Facial Keyframes cs.CVPDF

Jingchao Wu, Zejian Kang, Haibo Liu, Yuanchen Fei, Xiangru Huang

TL;DR: 本文提出了KeyframeFace,一个用于从文本生成富有表现力的3D面部动画的大规模多模态数据集,以及首个利用LLM先验进行可解释面部运动合成的文本到动画框架。该数据集包含2100个富有表现力的脚本,并提供了单目视频、每帧ARKit系数、上下文背景、复杂情感、手动定义的关键帧以及通过LLM和MLLM生成的多视角标注。所提出的框架将LLM的语义理解能力与ARKit系数的可解释结构对齐,实现了高保真度的富有表现力动画生成。

Details

Motivation: 现有数据集和方法主要关注语音驱动动画或无结构表情序列,缺乏生成富有表现力人类表演所需的语义基础和时序结构。本文旨在解决从自然语言生成动态3D面部动画时,对时序结构化语义和细粒度表情变化的理解需求。

Result: 论文提出了新的数据集和框架,但摘要中未提及具体的定量实验结果(如与SOTA的对比)或在特定基准测试上的性能。

Insight: 主要创新点包括:1) 构建了首个提供关键帧级监督的大规模多模态文本到动画数据集KeyframeFace,包含丰富的标注;2) 提出了首个明确利用LLM先验进行可解释面部运动合成的文本到动画框架,将LLM的语义理解与ARKit系数的可解释结构相结合,为可解释、关键帧引导和上下文感知的文本到动画研究奠定了基础。

Abstract: Generating dynamic 3D facial animation from natural language requires understanding both temporally structured semantics and fine-grained expression changes. Existing datasets and methods mainly focus on speech-driven animation or unstructured expression sequences and therefore lack the semantic grounding and temporal structures needed for expressive human performance generation. In this work, we introduce KeyframeFace, a large-scale multimodal dataset designed for text-to-animation research through keyframe-level supervision. KeyframeFace provides 2,100 expressive scripts paired with monocular videos, per-frame ARKit coefficients, contextual backgrounds, complex emotions, manually defined keyframes, and multi-perspective annotations based on ARKit coefficients and images via Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs). Beyond the dataset, we propose the first text-to-animation framework that explicitly leverages LLM priors for interpretable facial motion synthesis. This design aligns the semantic understanding capabilities of LLMs with the interpretable structure of ARKit’s coefficients, enabling high-fidelity expressive animation. KeyframeFace and our LLM-based framework together establish a new foundation for interpretable, keyframe-guided, and context-aware text-to-animation. Code and data are available at https://github.com/wjc12345123/KeyframeFace.


[39] HFS: Holistic Query-Aware Frame Selection for Efficient Video Reasoning cs.CV | cs.CL | cs.MMPDF

Yiqing Yang, Kin-Man Lam

TL;DR: 本文提出了一种名为HFS的端到端可训练、任务自适应的视频关键帧选择框架,用于高效视频推理。该框架利用思维链引导的小型语言模型生成任务特定的隐式查询向量,结合多模态特征进行动态帧评分,并通过定义包含相关性、覆盖度和冗余度的连续集合级目标函数,结合Gumbel-Softmax实现可微优化,以选择最优帧组合。此外,采用师生互学习策略,通过KL散度对齐学生选择器与教师推理器的帧重要性分布,实现端到端优化,摆脱了对静态伪标签的依赖。

Details

Motivation: 传统视频理解中的关键帧选择方法(如独立评分的top-K选择)存在局限性,常导致选择的时间聚类和视觉冗余帧,且使用MLLM离线生成的伪标签训练轻量级选择器,无法使监督信号动态适应任务目标。

Result: 在Video-MME、LongVideoBench、MLVU和NExT-QA等多个基准测试上的实验表明,该方法显著优于现有方法。

Insight: 创新点在于:1)引入任务自适应的隐式查询向量进行动态帧评分;2)提出包含多目标的连续集合级优化函数,实现可微的帧组合选择;3)采用师生互学习实现端到端训练,避免依赖静态伪标签。从客观角度看,该方法将帧选择建模为可微的集合优化问题,并结合了语言模型的推理能力,为视频理解中的高效帧选择提供了新思路。

Abstract: Key frame selection in video understanding presents significant challenges. Traditional top-K selection methods, which score frames independently, often fail to optimize the selection as a whole. This independent scoring frequently results in selecting frames that are temporally clustered and visually redundant. Additionally, training lightweight selectors using pseudo labels generated offline by Multimodal Large Language Models (MLLMs) prevents the supervisory signal from dynamically adapting to task objectives. To address these limitations, we propose an end-to-end trainable, task-adaptive framework for frame selection. A Chain-of-Thought approach guides a Small Language Model (SLM) to generate task-specific implicit query vectors, which are combined with multimodal features to enable dynamic frame scoring. We further define a continuous set-level objective function that incorporates relevance, coverage, and redundancy, enabling differentiable optimization via Gumbel-Softmax to select optimal frame combinations at the set level. Finally, student-teacher mutual learning is employed, where the student selector (SLM) and teacher reasoner (MLLM) are trained to align their frame importance distributions via KL divergence. Combined with cross-entropy loss, this enables end-to-end optimization, eliminating reliance on static pseudo labels. Experiments across various benchmarks, including Video-MME, LongVideoBench, MLVU, and NExT-QA, demonstrate that our method significantly outperforms existing approaches.


[40] MLLM Machine Unlearning via Visual Knowledge Distillation cs.CV | cs.AIPDF

Yuhang Wang, Zhenxing Niu, Haoxuan Ji, Guangyu He, Haichang Gao

TL;DR: 本文提出了一种针对多模态大语言模型(MLLM)的机器遗忘方法,通过视觉知识蒸馏(VKD)方案,选择性地擦除目标视觉知识,同时保留文本知识,从而在有效性和效率上超越了现有方法。

Details

Motivation: 现有机器遗忘方法主要针对LLM,而面向MLLM的遗忘研究尚处早期;本文旨在解决如何从MLLM中有效移除敏感视觉信息,同时保持模型整体性能的问题。

Result: 大量实验表明,该方法在有效性和效率上均优于最先进的遗忘方法,并首次评估了MLLM遗忘对再学习攻击的鲁棒性。

Insight: 创新点在于解耦MLLM中的视觉与文本知识,并利用模型内部的中间视觉表征作为监督信号进行知识蒸馏,而非依赖输出层监督;该方法仅微调视觉组件,效率高,且首次评估了遗忘的鲁棒性。

Abstract: Recently, machine unlearning approaches have been proposed to remove sensitive information from well-trained large models. However, most existing methods are tailored for LLMs, while MLLM-oriented unlearning remains at its early stage. Inspired by recent studies exploring the internal mechanisms of MLLMs, we propose to disentangle the visual and textual knowledge embedded within MLLMs and introduce a dedicated approach to selectively erase target visual knowledge while preserving textual knowledge. Unlike previous unlearning methods that rely on output-level supervision, our approach introduces a Visual Knowledge Distillation (VKD) scheme, which leverages intermediate visual representations within the MLLM as supervision signals. This design substantially enhances both unlearning effectiveness and model utility. Moreover, since our method only fine-tunes the visual components of the MLLM, it offers significant efficiency advantages. Extensive experiments demonstrate that our approach outperforms state-of-the-art unlearning methods in terms of both effectiveness and efficiency. Moreover, we are the first to evaluate the robustness of MLLM unlearning against relearning attacks.


[41] DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry cs.CV | cs.AI | cs.CLPDF

Zhenyang Cai, Jiaming Zhang, Junjie Zhao, Ziyi Zeng, Yanchao Li

TL;DR: 本文提出了DentalGPT,一个专门用于牙科领域的多模态大语言模型(MLLM)。该模型通过构建大规模高质量牙科多模态数据集(包含超过12万张牙科图像及详细描述)进行领域知识注入,并结合强化学习来增强其多模态复杂推理能力。在口腔内和全景X光片基准测试以及医学VQA的牙科子集上,DentalGPT在疾病分类和牙科视觉问答任务中表现出色,超越了多个SOTA模型,尽管其仅有70亿参数。

Details

Motivation: 当前的多模态大语言模型在牙科领域存在不足,难以捕捉细粒度的牙科视觉细节,并且缺乏进行精确诊断所需的足够推理能力。这阻碍了自动化口腔医疗保健的发展。

Result: 在口腔内图像、全景X光片基准测试以及医学VQA的牙科子集上进行综合评估,结果显示DentalGPT在疾病分类和牙科VQA任务上取得了优越性能,超越了多个最先进的多模态大语言模型,尽管其模型参数仅为7B。

Insight: 主要创新点在于:1)构建了迄今为止规模最大、标注最详尽(强调诊断相关视觉特征)的牙科多模态数据集,为领域知识注入提供了高质量数据基础;2)采用分阶段适应策略(高质量数据训练+强化学习),有效提升了模型对牙科视觉细节的理解和复杂推理能力,为构建专业领域MLLM提供了一条有效路径。

Abstract: Reliable interpretation of multimodal data in dentistry is essential for automated oral healthcare, yet current multimodal large language models (MLLMs) struggle to capture fine-grained dental visual details and lack sufficient reasoning ability for precise diagnosis. To address these limitations, we present DentalGPT, a specialized dental MLLM developed through high-quality domain knowledge injection and reinforcement learning. Specifically, the largest annotated multimodal dataset for dentistry to date was constructed by aggregating over 120k dental images paired with detailed descriptions that highlight diagnostically relevant visual features, making it the multimodal dataset with the most extensive collection of dental images to date. Training on this dataset significantly enhances the MLLM’s visual understanding of dental conditions, while the subsequent reinforcement learning stage further strengthens its capability for multimodal complex reasoning. Comprehensive evaluations on intraoral and panoramic benchmarks, along with dental subsets of medical VQA benchmarks, show that DentalGPT achieves superior performance in disease classification and dental VQA tasks, outperforming many state-of-the-art MLLMs despite having only 7B parameters. These results demonstrate that high-quality dental data combined with staged adaptation provides an effective pathway for building capable and domain-specialized dental MLLMs.


[42] Physics-Informed Video Flare Synthesis and Removal Leveraging Motion Independence between Flare and Scene cs.CVPDF

Junqiao Wang, Yuanfei Huang, Hua Huang

TL;DR: 本文提出了一种物理启发的动态光晕合成与去除方法,专门针对视频中的镜头光晕问题。通过模拟光源运动并建模散射和反射光晕的时序行为,构建了首个视频光晕数据集,并设计了一个结合注意力模块和Mamba时序建模组件的视频光晕去除网络,有效消除了光晕与场景内容间的运动独立性影响,提升了去除性能。

Details

Motivation: 现有光晕去除研究主要集中在图像上,而视频光晕的时空特性尚未充分探索;由于光晕、光源和场景内容的复杂且相互独立的运动,视频光晕的合成与去除面临更大挑战,常导致闪烁和伪影。

Result: 在构建的首个视频光晕数据集(包括合成配对视频和真实网络视频)上进行广泛实验,结果表明该方法在真实和合成视频上均一致优于现有的基于视频的修复方法和基于图像的光晕去除方法,有效去除动态光晕的同时保持了光源完整性和场景时空一致性。

Insight: 创新点包括:提出物理启发的动态光晕合成流程,模拟光源运动并建模光晕时序行为;设计视频光晕去除网络,结合空间注意力抑制和Mamba-based时序建模,无需多帧对齐即可捕获长程时空依赖,缓解了光晕与场景间的时序混叠问题;构建了首个视频光晕数据集用于全面评估。

Abstract: Lens flare is a degradation phenomenon caused by strong light sources. Existing researches on flare removal have mainly focused on images, while the spatiotemporal characteristics of video flare remain largely unexplored. Video flare synthesis and removal pose significantly greater challenges than in image, owing to the complex and mutually independent motion of flare, light sources, and scene content. This motion independence further affects restoration performance, often resulting in flicker and artifacts. To address this issue, we propose a physics-informed dynamic flare synthesis pipeline, which simulates light source motion using optical flow and models the temporal behaviors of both scattering and reflective flares. Meanwhile, we design a video flare removal network that employs an attention module to spatially suppress flare regions and incorporates a Mamba-based temporal modeling component to capture long range spatio-temporal dependencies. This motion-independent spatiotemporal representation effectively eliminates the need for multi-frame alignment, alleviating temporal aliasing between flares and scene content and thereby improving video flare removal performance. Building upon this, we construct the first video flare dataset to comprehensively evaluate our method, which includes a large set of synthetic paired videos and additional real-world videos collected from the Internet to assess generalization capability. Extensive experiments demonstrate that our method consistently outperforms existing video-based restoration and image-based flare removal methods on both real and synthetic videos, effectively removing dynamic flares while preserving light source integrity and maintaining spatiotemporal consistency of scene.


[43] UFVideo: Towards Unified Fine-Grained Video Cooperative Understanding with Large Language Models cs.CVPDF

Hewen Pan, Cong Wei, Dashuang Liang, Zepeng Huang, Pengfei Gao

TL;DR: 本文提出UFVideo,首个具备统一多粒度协同理解能力的视频大语言模型,通过统一的视觉-语言引导对齐机制,在单一模型中灵活处理全局、像素和时间尺度的视频理解任务,并生成文本响应、时间定位或掩码。

Details

Motivation: 现有视频大语言模型局限于特定视频理解任务,无法实现全面、多粒度的视频感知,因此需要开发一个能统一处理多粒度视频理解任务的模型。

Result: 在自建的UFVideo-Bench(包含三个协作任务)上,UFVideo展现出优于GPT-4o的灵活性和优势;同时在9个公共基准测试中验证了其有效性,覆盖多种常见视频理解任务。

Insight: 创新点在于设计了统一的视觉-语言引导对齐机制,使单一模型能动态编码不同任务的视觉和文本输入,并生成多样化输出(文本、时间定位、掩码),实现了多粒度视频理解的统一框架。

Abstract: With the advancement of multi-modal Large Language Models (LLMs), Video LLMs have been further developed to perform on holistic and specialized video understanding. However, existing works are limited to specialized video understanding tasks, failing to achieve a comprehensive and multi-grained video perception. To bridge this gap, we introduce UFVideo, the first Video LLM with unified multi-grained cooperative understanding capabilities. Specifically, we design unified visual-language guided alignment to flexibly handle video understanding across global, pixel and temporal scales within a single model. UFVideo dynamically encodes the visual and text inputs of different tasks and generates the textual response, temporal localization, or grounded mask. Additionally, to evaluate challenging multi-grained video understanding tasks, we construct the UFVideo-Bench consisting of three distinct collaborative tasks within the scales, which demonstrates UFVideo’s flexibility and advantages over GPT-4o. Furthermore, we validate the effectiveness of our model across 9 public benchmarks covering various common video understanding tasks, providing valuable insights for future Video LLMs.


[44] Task-Specific Distance Correlation Matching for Few-Shot Action Recognition cs.CVPDF

Fei Long, Yao Zhang, Jiaming Lv, Jiangtao Xie, Peihua Li

TL;DR: 本文提出TS-FSAR框架,针对小样本动作识别中的两个关键限制:现有集合匹配方法仅使用余弦相似度和实例级信息,难以捕捉非线性关系与任务特定线索;以及高效微调CLIP时引入的侧层在有限数据下难以优化。该框架包含视觉阶梯侧网络(LSN)用于高效微调CLIP、任务特定距离相关性匹配(TS-DCM)度量以建模线性和非线性帧间依赖并利用任务原型实现任务特定匹配,以及引导LSN与适配CLIP(GLAC)模块来正则化LSN训练。在五个广泛使用的基准测试上,TS-FSAR取得了优于先前最先进方法的性能。

Details

Motivation: 解决小样本动作识别中现有方法的两大局限:集合匹配度量仅依赖余弦相似度和实例级信息,无法捕捉复杂非线性关系和任务特定线索;以及高效微调CLIP时新引入的侧层在有限数据下优化困难。

Result: 在五个广泛使用的基准测试(如Kinetics、Something-Something V2等)上进行大量实验,TS-FSAR相比先前最先进方法(SOTA)取得了更优的性能。

Insight: 创新点包括:提出α-距离相关性来同时建模线性和非线性帧间依赖,引入任务原型实现任务特定匹配,以及设计GLAC模块通过冻结CLIP正则化侧层训练以提升有限监督下的相关性估计。从客观角度看,该方法将距离相关性度量与任务感知机制结合,增强了小样本下的表征匹配能力,同时通过正则化策略缓解了侧层优化难题,具有借鉴意义。

Abstract: Few-shot action recognition (FSAR) has recently made notable progress through set matching and efficient adaptation of large-scale pre-trained models. However, two key limitations persist. First, existing set matching metrics typically rely on cosine similarity to measure inter-frame linear dependencies and then perform matching with only instance-level information, thus failing to capture more complex patterns such as nonlinear relationships and overlooking task-specific cues. Second, for efficient adaptation of CLIP to FSAR, recent work performing fine-tuning via skip-fusion layers (which we refer to as side layers) has significantly reduced memory cost. However, the newly introduced side layers are often difficult to optimize under limited data conditions. To address these limitations, we propose TS-FSAR, a framework comprising three components: (1) a visual Ladder Side Network (LSN) for efficient CLIP fine-tuning; (2) a metric called Task-Specific Distance Correlation Matching (TS-DCM), which uses $α$-distance correlation to model both linear and nonlinear inter-frame dependencies and leverages a task prototype to enable task-specific matching; and (3) a Guiding LSN with Adapted CLIP (GLAC) module, which regularizes LSN using the adapted frozen CLIP to improve training for better $α$-distance correlation estimation under limited supervision. Extensive experiments on five widely-used benchmarks demonstrate that our TS-FSAR yields superior performance compared to prior state-of-the-arts.


[45] Surveillance Video-Based Traffic Accident Detection Using Transformer Architecture cs.CV | cs.AIPDF

Tanu Singh, Pranamesh Chakraborty, Long T. Truong

TL;DR: 本文提出了一种基于Transformer架构的监控视频交通事故检测方法。针对现有方法在时空理解有限和跨域泛化能力差的问题,作者构建了一个全面且平衡的数据集,并设计了一个结合卷积层提取局部特征和Transformer捕获时序依赖的模型。研究还评估了多种运动线索融合策略,发现RGB特征与光流拼接的方法效果最佳,准确率达到88.3%。

Details

Motivation: 交通事故是全球主要致死原因之一,传统计算机视觉方法在事故检测中存在时空理解有限和跨域泛化能力差的问题,且现有Transformer应用受限于数据集规模小、多样性不足,阻碍了鲁棒、可泛化系统的发展。

Result: 在构建的数据集上,提出的模型通过拼接RGB特征与光流,达到了88.3%的最高准确率,并与GPT、Gemini、LLaVA-NeXT-Video等视觉语言模型进行了比较,以评估其有效性。

Insight: 创新点包括构建了一个涵盖广泛交通环境、事故类型和上下文变化的全面平衡数据集,以及设计了一个结合卷积局部特征提取和Transformer时序建模的混合架构,并系统评估了运动线索融合策略,强调了动态信息在事故检测中的重要性。

Abstract: Road traffic accidents represent a leading cause of mortality globally, with incidence rates rising due to increasing population, urbanization, and motorization. Rising accident rates raise concerns about traffic surveillance effectiveness. Traditional computer vision methods for accident detection struggle with limited spatiotemporal understanding and poor cross-domain generalization. Recent advances in transformer architectures excel at modeling global spatial-temporal dependencies and parallel computation. However, applying these models to automated traffic accident detection is limited by small, non-diverse datasets, hindering the development of robust, generalizable systems. To address this gap, we curated a comprehensive and balanced dataset that captures a wide spectrum of traffic environments, accident types, and contextual variations. Utilizing the curated dataset, we propose an accident detection model based on a transformer architecture using pre-extracted spatial video features. The architecture employs convolutional layers to extract local correlations across diverse patterns within a frame, while leveraging transformers to capture sequential-temporal dependencies among the retrieved features. Moreover, most existing studies neglect the integration of motion cues, which are essential for understanding dynamic scenes, especially during accidents. These approaches typically rely on static features or coarse temporal information. In this study, multiple methods for incorporating motion cues were evaluated to identify the most effective strategy. Among the tested input approaches, concatenating RGB features with optical flow achieved the highest accuracy at 88.3%. The results were further compared with vision language models (VLM) such as GPT, Gemini, and LLaVA-NeXT-Video to assess the effectiveness of the proposed method.


[46] Prior-Enhanced Gaussian Splatting for Dynamic Scene Reconstruction from Casual Video cs.CVPDF

Meng-Li Shih, Ying-Huan Chen, Yu-Lun Liu, Brian Curless

TL;DR: 本文提出了一种从单目RGB视频中自动重建动态场景的完整流程,通过增强驱动动态高斯泼溅的先验信息,结合视频分割和极线误差图生成对象级掩码,并利用这些掩码指导对象深度损失和基于骨架的采样,最终通过虚拟视图深度损失和骨架投影损失优化重建质量,实现了优于现有方法的动态场景重建和渲染效果。

Details

Motivation: 解决从随意拍摄的单目视频中重建动态场景时,现有方法在薄结构跟踪、深度一致性和运动连贯性方面的不足,旨在提升重建的准确性和渲染质量。

Result: 在动态场景重建任务中,该方法超越了之前的单目动态场景重建方法,在渲染质量上实现了视觉上的显著提升,达到了新的先进水平(SOTA)。

Insight: 创新点在于通过视频分割和极线误差图生成精细的对象级掩码,并利用这些掩码指导深度损失和跟踪采样,同时引入虚拟视图深度损失和骨架投影损失来优化重建阶段,从而有效去除漂浮物、保持几何细节和运动连贯性,提升了动态场景重建的鲁棒性和精度。

Abstract: We introduce a fully automatic pipeline for dynamic scene reconstruction from casually captured monocular RGB videos. Rather than designing a new scene representation, we enhance the priors that drive Dynamic Gaussian Splatting. Video segmentation combined with epipolar-error maps yields object-level masks that closely follow thin structures; these masks (i) guide an object-depth loss that sharpens the consistent video depth, and (ii) support skeleton-based sampling plus mask-guided re-identification to produce reliable, comprehensive 2-D tracks. Two additional objectives embed the refined priors in the reconstruction stage: a virtual-view depth loss removes floaters, and a scaffold-projection loss ties motion nodes to the tracks, preserving fine geometry and coherent motion. The resulting system surpasses previous monocular dynamic scene reconstruction methods and delivers visibly superior renderings


[47] Assisted Refinement Network Based on Channel Information Interaction for Camouflaged and Salient Object Detection cs.CVPDF

Kuan Wang, Yanjun Qin, Mengge Lu, Liejun Wang, Xiaoming Tao

TL;DR: 本文提出了一种基于通道信息交互的辅助精炼网络(ARNet-v2),用于伪装物体检测(COD)和显著物体检测(SOD)。该网络通过通道信息交互模块(CIIM)增强同层特征内的跨通道信息交互,并构建了一个由边界和区域先验知识引导的协同解码架构,以同时优化物体区域和边界的分割精度。

Details

Motivation: 当前COD方法在解码阶段存在两个关键问题:一是同层特征内跨通道信息交互不足,限制了特征表达能力;二是无法有效协同建模边界和区域信息,导致难以准确重建物体的完整区域和清晰边界。

Result: 在四个COD基准数据集上的大量实验验证了该模型的有效性和最先进(SOTA)性能。该模型还成功迁移到SOD任务,并在息肉分割、透明物体检测以及工业和道路缺陷检测等下游任务中展示了良好的适应性。

Insight: 创新点在于提出了通道信息交互模块(CIIM)进行跨通道特征重组与交互,以及构建了一个由边界提取(BE)和区域提取(RE)模块生成先验、并利用混合注意力协同校准解码特征的协作架构,有效解决了语义模糊和边界不精确的问题。多尺度增强(MSE)模块也丰富了上下文特征表示。

Abstract: Camouflaged Object Detection (COD) stands as a significant challenge in computer vision, dedicated to identifying and segmenting objects visually highly integrated with their backgrounds. Current mainstream methods have made progress in cross-layer feature fusion, but two critical issues persist during the decoding stage. The first is insufficient cross-channel information interaction within the same-layer features, limiting feature expressiveness. The second is the inability to effectively co-model boundary and region information, making it difficult to accurately reconstruct complete regions and sharp boundaries of objects. To address the first issue, we propose the Channel Information Interaction Module (CIIM), which introduces a horizontal-vertical integration mechanism in the channel dimension. This module performs feature reorganization and interaction across channels to effectively capture complementary cross-channel information. To address the second issue, we construct a collaborative decoding architecture guided by prior knowledge. This architecture generates boundary priors and object localization maps through Boundary Extraction (BE) and Region Extraction (RE) modules, then employs hybrid attention to collaboratively calibrate decoded features, effectively overcoming semantic ambiguity and imprecise boundaries. Additionally, the Multi-scale Enhancement (MSE) module enriches contextual feature representations. Extensive experiments on four COD benchmark datasets validate the effectiveness and state-of-the-art performance of the proposed model. We further transferred our model to the Salient Object Detection (SOD) task and demonstrated its adaptability across downstream tasks, including polyp segmentation, transparent object detection, and industrial and road defect detection. Code and experimental results are publicly available at: https://github.com/akuan1234/ARNet-v2.


[48] The N-Body Problem: Parallel Execution from Single-Person Egocentric Video cs.CVPDF

Zhifan Zhu, Yifei Huang, Yoichi Sato, Dima Damen

TL;DR: 该论文提出了‘N体问题’:如何从单人的第一人称视角视频中,学习并规划出N个人并行执行相同任务序列的方案,以最大化执行速度。为了解决简单分配视频片段导致的空间、物体和因果冲突问题,作者提出了一套评估指标,并设计了一种结构化提示策略,引导视觉语言模型进行三维环境、物体使用和时序依赖的推理,从而生成可行的并行执行计划。

Details

Motivation: 人类可以直观地并行化复杂活动,但模型能否仅通过观察单人视频来学习这种能力?论文旨在解决从单人自我中心视频中推导出多人并行执行相同任务序列的规划问题,同时确保方案的物理可行性和逻辑一致性。

Result: 在EPIC-Kitchens和HD-EPIC数据集的100个视频上进行评估,当N=2时,该方法相较于Gemini 2.5 Pro的基线提示,将动作覆盖率提升了45%,同时将空间碰撞率、物体冲突率和因果冲突率分别降低了55%、45%和55%。

Insight: 论文的核心创新点在于将多人并行任务规划形式化为一个受物理和因果约束的优化问题(N体问题),并设计了一套综合评估指标来衡量性能与可行性。其提出的结构化提示策略,能够有效引导大模型进行三维空间和时序逻辑推理,为解决从观察中学习复杂、可执行的并行规划问题提供了新思路。

Abstract: Humans can intuitively parallelise complex activities, but can a model learn this from observing a single person? Given one egocentric video, we introduce the N-Body Problem: how N individuals, can hypothetically perform the same set of tasks observed in this video. The goal is to maximise speed-up, but naive assignment of video segments to individuals often violates real-world constraints, leading to physically impossible scenarios like two people using the same object or occupying the same space. To address this, we formalise the N-Body Problem and propose a suite of metrics to evaluate both performance (speed-up, task coverage) and feasibility (spatial collisions, object conflicts and causal constraints). We then introduce a structured prompting strategy that guides a Vision-Language Model (VLM) to reason about the 3D environment, object usage, and temporal dependencies to produce a viable parallel execution. On 100 videos from EPIC-Kitchens and HD-EPIC, our method for N = 2 boosts action coverage by 45% over a baseline prompt for Gemini 2.5 Pro, while simultaneously slashing collision rates, object and causal conflicts by 55%, 45% and 55% respectively.


[49] JoyAvatar: Real-time and Infinite Audio-Driven Avatar Generation with Autoregressive Diffusion cs.CVPDF

Chaochao Li, Ruikui Wang, Liangbo Zhou, Jinheng Feng, Huaishao Luo

TL;DR: 本文提出了JoyAvatar,一种基于自回归扩散模型的音频驱动虚拟人生成方法,能够实现实时推理和无限长度视频生成。该方法通过渐进步长引导、运动条件注入和缓存重置的无界RoPE三项技术,解决了现有方法计算开销大、无法生成长视频以及自回归方法中误差累积和质量下降的问题。

Details

Motivation: 现有基于DiT的音频驱动虚拟人生成方法存在计算开销高、无法生成长视频的局限性,而自回归方法虽能缓解此问题,但会引入误差累积和质量下降。本文旨在开发一种能够实时推理并支持无限长度生成的音频驱动自回归模型。

Result: 在视觉质量、时间一致性和唇形同步方面取得了有竞争力的结果,其1.3B参数的因果模型在单GPU上实现了16 FPS的实时推理速度。

Insight: 创新点包括:渐进步长引导(PSB)通过为初始帧分配更多去噪步骤来稳定生成并减少误差累积;运动条件注入(MCI)通过注入噪声污染的前一帧作为运动条件来增强时间一致性;缓存重置的无界RoPE(URCR)通过动态位置编码实现无限长度生成。这些技术有效提升了自回归扩散模型在长序列生成中的稳定性和质量。

Abstract: Existing DiT-based audio-driven avatar generation methods have achieved considerable progress, yet their broader application is constrained by limitations such as high computational overhead and the inability to synthesize long-duration videos. Autoregressive methods address this problem by applying block-wise autoregressive diffusion methods. However, these methods suffer from the problem of error accumulation and quality degradation. To address this, we propose JoyAvatar, an audio-driven autoregressive model capable of real-time inference and infinite-length video generation with the following contributions: (1) Progressive Step Bootstrapping (PSB), which allocates more denoising steps to initial frames to stabilize generation and reduce error accumulation; (2) Motion Condition Injection (MCI), enhancing temporal coherence by injecting noise-corrupted previous frames as motion condition; and (3) Unbounded RoPE via Cache-Resetting (URCR), enabling infinite-length generation through dynamic positional encoding. Our 1.3B-parameter causal model achieves 16 FPS on a single GPU and achieves competitive results in visual quality, temporal consistency, and lip synchronization.


[50] Flowception: Temporally Expansive Flow Matching for Video Generation cs.CV | cs.AIPDF

Tariq Berrada Ifriqi, John Nguyen, Karteek Alahari, Jakob Verbeek, Ricky T. Q. Chen

TL;DR: Flowception是一种新颖的非自回归、可变长度视频生成框架,它通过学习一个交织离散帧插入与连续帧去噪的概率路径来生成视频。该方法通过帧插入机制缓解了自回归方法中的误差累积/漂移问题,同时相比全序列流方法减少了三倍训练FLOPs,并能联合学习视频长度与内容。

Details

Motivation: 解决自回归视频生成方法中的误差累积/漂移问题,以及全序列流方法计算成本高、难以处理长序列和可变长度视频的局限性。

Result: 在定量实验中,Flowception在FVD和VBench指标上优于自回归和全序列基线方法,并通过定性结果进一步验证了其性能提升。

Insight: 创新点在于将离散帧插入与连续去噪交织的概率路径学习,这既作为高效压缩机制处理长期上下文,又降低了计算成本,并能无缝集成图像到视频生成和视频插值等任务。

Abstract: We present Flowception, a novel non-autoregressive and variable-length video generation framework. Flowception learns a probability path that interleaves discrete frame insertions with continuous frame denoising. Compared to autoregressive methods, Flowception alleviates error accumulation/drift as the frame insertion mechanism during sampling serves as an efficient compression mechanism to handle long-term context. Compared to full-sequence flows, our method reduces FLOPs for training three-fold, while also being more amenable to local attention variants, and allowing to learn the length of videos jointly with their content. Quantitative experimental results show improved FVD and VBench metrics over autoregressive and full-sequence baselines, which is further validated with qualitative results. Finally, by learning to insert and denoise frames in a sequence, Flowception seamlessly integrates different tasks such as image-to-video generation and video interpolation.


[51] YawDD+: Frame-level Annotations for Accurate Yawn Prediction cs.CVPDF

Ahmed Mujtaba, Gleb Radchenko, Marc Masana, Radu Prodan

TL;DR: 本文针对驾驶员疲劳检测中打哈欠行为预测的准确性问题,提出了一种半自动标注流程来改进YawDD数据集,生成了帧级标注的YawDD+数据集。通过在YawDD+上训练MNasNet分类器和YOLOv11检测器,模型性能显著提升,实现了高精度和实时边缘计算能力。

Details

Motivation: 现有基于视频标注的数据集存在粗粒度时间标注引入的系统性噪声,导致机器学习方法在打哈欠预测任务上面临挑战,影响了驾驶员疲劳监测的准确性。

Result: 在YawDD+上训练后,MNasNet分类器的帧准确率提升高达6%,达到99.34%;YOLOv11检测器的mAP提升5%,达到95.69%。在NVIDIA Jetson Nano边缘AI硬件上实现59.8 FPS的实时性能。

Insight: 创新点在于开发了带有人类在环验证的半自动标注流程,生成高质量帧级标注数据集YawDD+,证明了仅通过提升数据质量即可支持无需服务器计算的设备端打哈欠监测,为边缘AI应用提供了有效解决方案。

Abstract: Driver fatigue remains a leading cause of road accidents, with 24% of crashes involving drowsy drivers. While yawning serves as an early behavioral indicator of fatigue, existing machine learning approaches face significant challenges due to video-annotated datasets that introduce systematic noise from coarse temporal annotations. We develop a semi-automated labeling pipeline with human-in-the-loop verification, which we apply to YawDD, enabling more accurate model training. Training the established MNasNet classifier and YOLOv11 detector architectures on YawDD+ improves frame accuracy by up to 6% and mAP by 5% over video-level supervision, achieving 99.34% classification accuracy and 95.69% detection mAP. The resulting approach deliver up to 59.8 FPS on edge AI hardware (NVIDIA Jetson Nano), confirming that enhanced data quality alone supports on-device yawning monitoring without server-side computation.


[52] Boosting Skeleton-based Zero-Shot Action Recognition with Training-Free Test-Time Adaptation cs.CV | cs.AIPDF

Jingmin Zhu, Anqi Zhu, Hossein Rahmani, Jun Liu, Mohammed Bennamoun

TL;DR: 本文提出了Skeleton-Cache,这是首个用于基于骨架的零样本动作识别(SZAR)的无训练测试时适应框架。该框架将推理过程重新定义为在一个存储结构化骨架表示的非参数缓存上进行轻量级检索的过程,并结合了全局和细粒度局部描述符。通过利用大型语言模型(LLM)的语义推理能力为不同类别分配重要性权重来指导描述符预测的融合,从而在无需额外训练或访问训练数据的情况下动态适应未见过的动作。

Details

Motivation: 解决基于骨架的零样本动作识别模型在推理时对未见动作泛化能力不足的问题,旨在通过测试时适应提升模型性能,而无需重新训练或访问原始训练数据。

Result: 在NTU RGB+D 60/120和PKU-MMD II数据集上的大量实验表明,Skeleton-Cache在零样本和广义零样本设置下,能持续提升多种SZAR骨干网络的性能。

Insight: 主要创新点在于将测试时适应重新定义为基于非参数缓存的轻量级检索任务,并创新性地结合了结构化骨架描述符(全局与局部)与LLM提供的语义先验来动态融合预测。从客观角度看,其无训练、仅依赖推理时缓存和外部LLM知识的设计,为提升零样本泛化提供了一种高效且可扩展的新思路。

Abstract: We introduce Skeleton-Cache, the first training-free test-time adaptation framework for skeleton-based zero-shot action recognition (SZAR), aimed at improving model generalization to unseen actions during inference. Skeleton-Cache reformulates inference as a lightweight retrieval process over a non-parametric cache that stores structured skeleton representations, combining both global and fine-grained local descriptors. To guide the fusion of descriptor-wise predictions, we leverage the semantic reasoning capabilities of large language models (LLMs) to assign class-specific importance weights. By integrating these structured descriptors with LLM-guided semantic priors, Skeleton-Cache dynamically adapts to unseen actions without any additional training or access to training data. Extensive experiments on NTU RGB+D 60/120 and PKU-MMD II demonstrate that Skeleton-Cache consistently boosts the performance of various SZAR backbones under both zero-shot and generalized zero-shot settings. The code is publicly available at https://github.com/Alchemist0754/Skeleton-Cache.


[53] Exploring MLLM-Diffusion Information Transfer with MetaCanvas cs.CV | cs.AI | cs.LGPDF

Han Lin, Xichen Pan, Ziqi Huang, Ji Hou, Jialiang Wang

TL;DR: 本文提出MetaCanvas框架,旨在解决多模态大语言模型(MLLMs)在视觉生成任务中仅被用作全局文本编码器、未能充分利用其推理与规划能力的问题。该框架允许MLLMs直接在空间和时空潜在空间中进行推理和规划,并与扩散模型紧密交互,从而实现对图像和视频生成的更精确、结构化控制。

Details

Motivation: 当前多模态大语言模型在复杂场景解析方面表现出色,但在视觉生成中往往被降级为扩散模型的全局文本编码器,导致其强大的推理和规划能力未被充分利用,无法实现同等精度的结构化控制。

Result: 在三种不同扩散模型骨干上实证实现MetaCanvas,并在文本到图像生成、文本/图像到视频生成、图像/视频编辑以及上下文视频生成等六个任务上评估,均优于全局条件基线方法,表明该方法在缩小多模态理解与生成差距方面具有潜力。

Insight: 创新点在于将多模态大语言模型视为潜在空间规划器,而非仅作为文本编码器,使其能够直接在潜在空间进行推理和规划,从而更紧密地结合扩散生成器,实现更精确的布局控制、属性绑定和推理密集型生成任务。

Abstract: Multimodal learning has rapidly advanced visual understanding, largely via multimodal large language models (MLLMs) that use powerful LLMs as cognitive cores. In visual generation, however, these powerful core models are typically reduced to global text encoders for diffusion models, leaving most of their reasoning and planning ability unused. This creates a gap: current multimodal LLMs can parse complex layouts, attributes, and knowledge-intensive scenes, yet struggle to generate images or videos with equally precise and structured control. We propose MetaCanvas, a lightweight framework that lets MLLMs reason and plan directly in spatial and spatiotemporal latent spaces and interface tightly with diffusion generators. We empirically implement MetaCanvas on three different diffusion backbones and evaluate it across six tasks, including text-to-image generation, text/image-to-video generation, image/video editing, and in-context video generation, each requiring precise layouts, robust attribute binding, and reasoning-intensive control. MetaCanvas consistently outperforms global-conditioning baselines, suggesting that treating MLLMs as latent-space planners is a promising direction for narrowing the gap between multimodal understanding and generation.


[54] DOS: Distilling Observable Softmaps of Zipfian Prototypes for Self-Supervised Point Representation cs.CV | cs.LGPDF

Mohamed Abdelsamad, Michael Ulrich, Bin Yang, Miao Zhang, Yakov Miron

TL;DR: 本文提出了一种名为DOS的新型自监督学习框架,用于3D点云表示学习。该方法通过在可观测(未掩码)点上自蒸馏语义相关性软映射,避免了掩码区域的信息泄露,并引入了Zipfian原型和Zipf-Sinkhorn算法来处理无监督场景下的语义不平衡问题。

Details

Motivation: 解决3D点云自监督学习面临的几何不规则、重建易走捷径以及语义分布不平衡等关键挑战。

Result: 在多个基准测试(包括nuScenes、Waymo、SemanticKITTI、ScanNet和ScanNet200)的语义分割和3D目标检测任务上,DOS超越了当前最先进的方法,且不依赖额外数据或标注。

Insight: 核心创新点在于仅在可观测点上进行软映射蒸馏,避免了信息泄露;同时,通过引入Zipfian原型和Zipf-Sinkhorn算法,在无监督设置下有效建模了语义的长尾分布,为学习鲁棒的3D表示提供了可扩展且有效的范式。

Abstract: Recent advances in self-supervised learning (SSL) have shown tremendous potential for learning 3D point cloud representations without human annotations. However, SSL for 3D point clouds still faces critical challenges due to irregular geometry, shortcut-prone reconstruction, and unbalanced semantics distribution. In this work, we propose DOS (Distilling Observable Softmaps), a novel SSL framework that self-distills semantic relevance softmaps only at observable (unmasked) points. This strategy prevents information leakage from masked regions and provides richer supervision than discrete token-to-prototype assignments. To address the challenge of unbalanced semantics in an unsupervised setting, we introduce Zipfian prototypes and incorporate them using a modified Sinkhorn-Knopp algorithm, Zipf-Sinkhorn, which enforces a power-law prior over prototype usage and modulates the sharpness of the target softmap during training. DOS outperforms current state-of-the-art methods on semantic segmentation and 3D object detection across multiple benchmarks, including nuScenes, Waymo, SemanticKITTI, ScanNet, and ScanNet200, without relying on extra data or annotations. Our results demonstrate that observable-point softmaps distillation offers a scalable and effective paradigm for learning robust 3D representations.


[55] VLM2GeoVec: Toward Universal Multimodal Embeddings for Remote Sensing cs.CV | cs.IRPDF

Emanuel Sánchez Aimar, Gulnaz Zhambulova, Fahad Shahbaz Khan, Yonghao Xu, Michael Felsberg

TL;DR: 本文提出了VLM2GeoVec,一个指令跟随的单编码器视觉语言模型,旨在为遥感领域创建通用的多模态嵌入。该模型通过对比学习,将图像、文本、边界框和地理坐标等交错输入统一嵌入到一个联合向量空间中,从而统一了可扩展检索与区域级空间推理能力。

Details

Motivation: 当前遥感方法存在割裂:双编码器检索模型擅长大规模跨模态搜索但无法交错处理多模态输入,而生成式助手支持区域级解释但缺乏可扩展的检索能力。本文旨在解决这一鸿沟,实现遥感中可扩展检索与区域级空间推理的统一。

Result: 在提出的新基准RSMEB上,模型在区域-描述检索上达到26.6%的P@1(比基线提升25个百分点),在指代表达式检索上达到32.5% P@1(提升19个百分点),在语义地理定位检索上达到17.8% P@1(是先前最佳性能的3倍以上),同时在场景分类和跨模态检索等传统任务上达到或超过了专用基线的水平。

Insight: 核心创新点在于提出了一个单编码器架构,能够将交错的多模态输入(图像、文本、边界框、坐标)统一编码到一个联合嵌入空间,并通过对比损失进行训练,从而消除了多阶段流水线和任务特定模块,实现了遥感中可扩展检索与细粒度空间推理能力的统一。提出的RSMEB基准也为评估遥感嵌入模型的通用性提供了全面标准。

Abstract: Satellite imagery differs fundamentally from natural images: its aerial viewpoint, very high resolution, diverse scale variations, and abundance of small objects demand both region-level spatial reasoning and holistic scene understanding. Current remote-sensing approaches remain fragmented between dual-encoder retrieval models, which excel at large-scale cross-modal search but cannot interleave modalities, and generative assistants, which support region-level interpretation but lack scalable retrieval capabilities. We propose $\textbf{VLM2GeoVec}$, an instruction-following, single-encoder vision-language model trained contrastively to embed interleaved inputs (images, text, bounding boxes, and geographic coordinates) in a unified vector space. Our single encoder interleaves all inputs into one joint embedding trained with a contrastive loss, eliminating multi-stage pipelines and task-specific modules. To evaluate its versatility, we introduce $\textbf{RSMEB}$, a novel benchmark covering key remote-sensing embedding applications: scene classification; cross-modal search; compositional retrieval; visual-question answering; visual grounding and region-level reasoning; and semantic geospatial retrieval. On RSMEB, it achieves $\textbf{26.6%}$ P@1 on region-caption retrieval (+25 pp vs. dual-encoder baselines), $\textbf{32.5%}$ P@1 on referring-expression retrieval (+19 pp), and $\textbf{17.8%}$ P@1 on semantic geo-localization retrieval (over $3\times$ prior best), while matching or exceeding specialized baselines on conventional tasks such as scene classification and cross-modal retrieval. VLM2GeoVec unifies scalable retrieval with region-level spatial reasoning, enabling cohesive multimodal analysis in remote sensing. We will publicly release the code, checkpoints, and data upon acceptance.


[56] TSkel-Mamba: Temporal Dynamic Modeling via State Space Model for Human Skeleton-based Action Recognition cs.CVPDF

Yanan Liu, Jun Liu, Hao Zhang, Dan Xu, Hossein Rahmani

TL;DR: 本文提出TSkel-Mamba,一个用于基于人体骨架的动作识别的混合Transformer-Mamba框架。它利用Spatial Transformer学习空间特征,并利用Mamba进行时序建模。为了增强Mamba对骨架数据的适应性和时序依赖建模能力,作者引入了包含多尺度时序交互模块的时序动态建模块。

Details

Motivation: 动机是解决基于骨架的动作识别中有效建模时空动态的问题,特别是针对Mamba模型在单独通道建模时无法捕捉通道间依赖关系的局限性。

Result: 在NTU-RGB+D 60、NTU-RGB+D 120、NW-UCLA和UAV-Human数据集上的大量实验表明,TSkel-Mamba实现了最先进的性能,同时保持了较低的推理时间。

Insight: 创新点在于提出了一个通用的即插即用时序动态建模块,其核心是多尺度时序交互模块,它使用多尺度Cycle算子来捕捉跨通道的时序交互,这是动作识别的关键因素。从客观角度看,将Transformer的空间建模优势与增强后的Mamba时序建模能力相结合,并专注于通道间时序关系的建模,是一个有借鉴意义的架构设计思路。

Abstract: Skeleton-based action recognition has garnered significant attention in the computer vision community. Inspired by the recent success of the selective state-space model (SSM) Mamba in modeling 1D temporal sequences, we propose TSkel-Mamba, a hybrid Transformer-Mamba framework that effectively captures both spatial and temporal dynamics. In particular, our approach leverages Spatial Transformer for spatial feature learning while utilizing Mamba for temporal modeling. Mamba, however, employs separate SSM blocks for individual channels, which inherently limits its ability to model inter-channel dependencies. To better adapt Mamba for skeleton data and enhance Mamba`s ability to model temporal dependencies, we introduce a Temporal Dynamic Modeling (TDM) block, which is a versatile plug-and-play component that integrates a novel Multi-scale Temporal Interaction (MTI) module. The MTI module employs multi-scale Cycle operators to capture cross-channel temporal interactions, a critical factor in action recognition. Extensive experiments on NTU-RGB+D 60, NTU-RGB+D 120, NW-UCLA and UAV-Human datasets demonstrate that TSkel-Mamba achieves state-of-the-art performance while maintaining low inference time, making it both efficient and highly effective.


[57] Reconstruction as a Bridge for Event-Based Visual Question Answering cs.CVPDF

Hanyue Lou, Jiayi Zhou, Yang Zhang, Boyu Li, Yi Wang

TL;DR: 该论文提出了一种将事件相机与多模态大语言模型(MLLMs)结合的方法,通过重建作为桥梁来解决事件数据与基于帧的模型之间的兼容性问题。论文介绍了两种方法:基于帧的重建与标记化(FRT)和自适应的重建与标记化(ART),并引入了首个客观的、真实世界的事件MLLM基准EvQA。实验表明,这些方法在EvQA基准上达到了最先进的性能。

Details

Motivation: 动机在于整合事件相机与多模态大语言模型(MLLMs),以在挑战性视觉条件下实现通用场景理解,同时需要在保留事件数据独特优势与确保与基于帧的模型兼容性之间取得平衡。

Result: 在提出的EvQA基准(包含来自22个公共数据集的1,000个事件-Q&A对)上,论文的方法达到了最先进的性能,展示了MLLMs在事件视觉中的巨大潜力。

Insight: 创新点在于使用重建作为桥梁来弥合事件数据与MLLMs之间的差距,提出了FRT和ART两种方法,其中ART利用了事件稀疏性以提高效率,并引入了首个客观的事件MLLM基准EvQA以进行稳健评估。

Abstract: Integrating event cameras with Multimodal Large Language Models (MLLMs) promises general scene understanding in challenging visual conditions, yet requires navigating a trade-off between preserving the unique advantages of event data and ensuring compatibility with frame-based models. We address this challenge by using reconstruction as a bridge, proposing a straightforward Frame-based Reconstruction and Tokenization (FRT) method and designing an efficient Adaptive Reconstruction and Tokenization (ART) method that leverages event sparsity. For robust evaluation, we introduce EvQA, the first objective, real-world benchmark for event-based MLLMs, comprising 1,000 event-Q&A pairs from 22 public datasets. Our experiments demonstrate that our methods achieve state-of-the-art performance on EvQA, highlighting the significant potential of MLLMs in event-based vision.


[58] 3DTeethSAM: Taming SAM2 for 3D Teeth Segmentation cs.CVPDF

Zhiguo Lu, Jianwen Lou, Mingjun Ma, Hairong Jin, Youyi Zheng

TL;DR: 本文提出了3DTeethSAM方法,通过将预训练的2D图像分割基础模型SAM2适配到3D牙齿分割任务中。该方法从多个视角渲染3D牙齿模型为2D图像,利用SAM2进行2D分割,再通过2D-3D投影重建3D结果,并引入了轻量级可学习模块和可变形全局注意力插件来提升分割精度和训练效率。

Details

Motivation: 解决3D牙齿分割在真实世界牙列复杂性下的挑战,将强大的2D分割基础模型SAM2有效地迁移到3D牙齿数据上。

Result: 在3DTeethSeg基准测试上,对高分辨率3D牙齿网格实现了91.90%的IoU,创造了该领域新的最先进水平(SOTA)。

Insight: 创新点在于通过多视图渲染和2D-3D投影将2D基础模型应用于3D任务,并设计了轻量级的提示嵌入生成器、掩码优化器和分类器模块,以及引入可变形全局注意力插件来增强模型性能,为3D医学图像分割提供了一种有效的迁移学习范式。

Abstract: 3D teeth segmentation, involving the localization of tooth instances and their semantic categorization in 3D dental models, is a critical yet challenging task in digital dentistry due to the complexity of real-world dentition. In this paper, we propose 3DTeethSAM, an adaptation of the Segment Anything Model 2 (SAM2) for 3D teeth segmentation. SAM2 is a pretrained foundation model for image and video segmentation, demonstrating a strong backbone in various downstream scenarios. To adapt SAM2 for 3D teeth data, we render images of 3D teeth models from predefined views, apply SAM2 for 2D segmentation, and reconstruct 3D results using 2D-3D projections. Since SAM2’s performance depends on input prompts and its initial outputs often have deficiencies, and given its class-agnostic nature, we introduce three light-weight learnable modules: (1) a prompt embedding generator to derive prompt embeddings from image embeddings for accurate mask decoding, (2) a mask refiner to enhance SAM2’s initial segmentation results, and (3) a mask classifier to categorize the generated masks. Additionally, we incorporate Deformable Global Attention Plugins (DGAP) into SAM2’s image encoder. The DGAP enhances both the segmentation accuracy and the speed of the training process. Our method has been validated on the 3DTeethSeg benchmark, achieving an IoU of 91.90% on high-resolution 3D teeth meshes, establishing a new state-of-the-art in the field.


[59] Evaluating Foundation Models’ 3D Understanding Through Multi-View Correspondence Analysis cs.CVPDF

Valentina Lilova, Toyesh Chakravorty, Julian I. Bibo, Emma Boccaletti, Brandon Li

TL;DR: 本文提出了一种无需微调的新基准测试方法,用于评估基础模型在3D场景理解中的内在能力。该方法基于Hummingbird框架扩展至3D多视图数据集MVImgNet,通过分析多视角图像之间的对应关系来直接评估密集视觉特征的质量。

Details

Motivation: 现有评估方法通常依赖下游任务的微调或特定解码器,难以分离预训练编码器固有的3D推理能力,因此需要一种直接评估基础模型3D空间理解能力的基准测试。

Result: 在MVImgNet数据集上对8个最先进的基础模型进行测试,结果显示基于DINO的编码器在大视角变化下保持竞争力,而像VGGT这样的3D感知模型需要专门的多视图调整。评估结果按视角对比难度分为简单、中等、困难和极端四类进行报告。

Insight: 创新点在于提出了一种无需微调的上下文3D场景理解基准测试,直接评估密集视觉特征质量,揭示了不同模型在多视图对应任务中的表现差异,为模型设计提供了新的评估视角。

Abstract: Benchmarking 3D spatial understanding of foundation models is essential for real-world applications such as robotics and autonomous driving. Existing evaluations often rely on downstream finetuning with linear heads or task-specific decoders, making it difficult to isolate the intrinsic 3D reasoning ability of pretrained encoders. In this work, we introduce a novel benchmark for in-context 3D scene understanding that requires no finetuning and directly probes the quality of dense visual features. Building on the Hummingbird framework, which evaluates in-context 2D scene understanding, we extend the setup to the 3D Multi-View ImageNet (MVImgNet) dataset. Given a set of images from objects in specific angles (keys), we benchmark the performance of segmenting novel views (queries) and report the scores in 4 categories of easy, medium, hard, and extreme based on the key-query view contrast. We benchmark 8 state-of-the-art foundation models and show DINO-based encoders remain competitive across large viewpoint shifts, while 3D-aware models like VGGT require dedicated multi-view adjustments. Our code is publicly available at https://github.com/ToyeshC/open-hummingbird-3d-eval .


[60] Embodied Image Compression cs.CV | eess.IVPDF

Chunyi Li, Rui Qing, Jianbo Zhang, Yuan Tian, Xiangyang Zhu

TL;DR: 本文首次提出了具身图像压缩(Embodied Image Compression)这一科学问题,旨在解决多智能体系统中具身AI的通信约束和实时任务执行需求。论文建立了标准化基准EmbodiedComp,用于在闭环设置和超低码率条件下进行系统评估。通过大量实证研究,发现现有的视觉-语言-动作模型在压缩至低于具身码率阈值时,甚至无法可靠执行简单的操作任务。

Details

Motivation: 随着机器智能的快速发展,压缩目标已从特定任务的虚拟模型转向在真实世界环境中操作的具身智能体。为应对多智能体系统中具身AI的通信限制并确保实时任务执行,需要研究专门针对具身智能体的图像压缩方法。

Result: 在模拟和真实世界设置中的广泛实证研究表明,现有视觉-语言-动作模型在压缩至低于具身码率阈值时,无法可靠执行简单的操作任务。论文建立了标准化基准EmbodiedComp,用于在闭环设置和超低码率条件下进行系统评估。

Insight: 创新点在于首次明确提出了具身图像压缩这一科学问题,并建立了专门的标准化基准EmbodiedComp来推动该领域发展。这强调了为具身智能体开发领域特定压缩技术的必要性,以加速具身AI在真实世界的部署,从压缩目标从静态模型转向动态、交互的智能体是一个关键视角转变。

Abstract: Image Compression for Machines (ICM) has emerged as a pivotal research direction in the field of visual data compression. However, with the rapid evolution of machine intelligence, the target of compression has shifted from task-specific virtual models to Embodied agents operating in real-world environments. To address the communication constraints of Embodied AI in multi-agent systems and ensure real-time task execution, this paper introduces, for the first time, the scientific problem of Embodied Image Compression. We establish a standardized benchmark, EmbodiedComp, to facilitate systematic evaluation under ultra-low bitrate conditions in a closed-loop setting. Through extensive empirical studies in both simulated and real-world settings, we demonstrate that existing Vision-Language-Action models (VLAs) fail to reliably perform even simple manipulation tasks when compressed below the Embodied bitrate threshold. We anticipate that EmbodiedComp will catalyze the development of domain-specific compression tailored for Embodied agents , thereby accelerating the Embodied AI deployment in the Real-world.


[61] FactorPortrait: Controllable Portrait Animation via Disentangled Expression, Pose, and Viewpoint cs.CVPDF

Jiapeng Tang, Kai Li, Chengxiang Yin, Liuhao Ge, Fei Jiang

TL;DR: FactorPortrait是一种基于视频扩散模型的可控肖像动画方法,能够从解耦的面部表情、头部运动和相机视角控制信号中生成逼真的动画。给定一张静态肖像图、一段驱动视频和相机轨迹,该方法可以迁移驱动视频中的表情和头部运动,同时从任意视角合成新视图。

Details

Motivation: 解决现有肖像动画方法在同时控制表情、姿态和视角方面的局限性,实现更灵活、逼真且解耦控制的肖像动画生成。

Result: 在真实性、表现力、控制精度和视角一致性方面均优于现有方法,通过大量实验验证了其优越性。

Insight: 创新点在于提出了一种解耦控制框架,利用预训练图像编码器提取解耦身份与姿态信息的表情潜在表示作为控制信号,并结合Plücker射线图与法线图进行姿态与视角控制;通过构建大规模合成数据集进行训练,确保了模型的泛化能力。

Abstract: We introduce FactorPortrait, a video diffusion method for controllable portrait animation that enables lifelike synthesis from disentangled control signals of facial expressions, head movement, and camera viewpoints. Given a single portrait image, a driving video, and camera trajectories, our method animates the portrait by transferring facial expressions and head movements from the driving video while simultaneously enabling novel view synthesis from arbitrary viewpoints. We utilize a pre-trained image encoder to extract facial expression latents from the driving video as control signals for animation generation. Such latents implicitly capture nuanced facial expression dynamics with identity and pose information disentangled, and they are efficiently injected into the video diffusion transformer through our proposed expression controller. For camera and head pose control, we employ Plücker ray maps and normal maps rendered from 3D body mesh tracking. To train our model, we curate a large-scale synthetic dataset containing diverse combinations of camera viewpoints, head poses, and facial expression dynamics. Extensive experiments demonstrate that our method outperforms existing approaches in realism, expressiveness, control accuracy, and view consistency.


[62] Kinetic Mining in Context: Few-Shot Action Synthesis via Text-to-Motion Distillation cs.CVPDF

Luca Cazzola, Ahed Alboody

TL;DR: 本文提出KineMIC框架,通过文本到动作生成模型的迁移学习,解决骨骼动作识别中标注数据稀缺的问题。该框架利用CLIP文本嵌入建立稀疏HAR标签与T2M源数据之间的语义对应关系,从而将通用T2M模型微调为专用的少样本动作生成器。

Details

Motivation: 解决大规模标注动作数据集获取成本高的问题,并弥合通用文本到动作生成模型与动作识别任务对运动学精确、类别区分性动作需求之间的领域差距。

Result: 在HumanML3D作为源T2M数据集、NTU RGB+D 120子集作为目标HAR领域的实验中,仅使用每动作类别10个样本,生成的动作为数据增强提供了显著更连贯的运动,带来了+23.1%的准确率提升。

Insight: 创新点在于提出基于语义对应关系的运动学蒸馏策略,利用CLIP文本嵌入作为软监督,将通用T2M模型适应到HAR领域,实现少样本动作合成,为数据增强提供了新思路。

Abstract: The acquisition cost for large, annotated motion datasets remains a critical bottleneck for skeletal-based Human Activity Recognition (HAR). Although Text-to-Motion (T2M) generative models offer a compelling, scalable source of synthetic data, their training objectives, which emphasize general artistic motion, and dataset structures fundamentally differ from HAR’s requirements for kinematically precise, class-discriminative actions. This disparity creates a significant domain gap, making generalist T2M models ill-equipped for generating motions suitable for HAR classifiers. To address this challenge, we propose KineMIC (Kinetic Mining In Context), a transfer learning framework for few-shot action synthesis. KineMIC adapts a T2M diffusion model to an HAR domain by hypothesizing that semantic correspondences in the text encoding space can provide soft supervision for kinematic distillation. We operationalize this via a kinetic mining strategy that leverages CLIP text embeddings to establish correspondences between sparse HAR labels and T2M source data. This process guides fine-tuning, transforming the generalist T2M backbone into a specialized few-shot Action-to-Motion generator. We validate KineMIC using HumanML3D as the source T2M dataset and a subset of NTU RGB+D 120 as the target HAR domain, randomly selecting just 10 samples per action class. Our approach generates significantly more coherent motions, providing a robust data augmentation source that delivers a +23.1% accuracy points improvement. Animated illustrations and supplementary materials are available at (https://lucazzola.github.io/publications/kinemic).


[63] Cross-modal Context-aware Learning for Visual Prompt Guided Multimodal Image Understanding in Remote Sensing cs.CVPDF

Xu Zhang, Jiabin Fang, Zhuoming Ding, Jin Yuan, Xuan Liu

TL;DR: 本文提出了一种名为CLV-Net的跨模态上下文感知学习方法,用于遥感图像中视觉提示引导的多模态理解。该方法允许用户通过简单的边界框视觉提示指定感兴趣区域,并引导模型生成相关的分割掩码和描述文本,以准确反映用户意图。

Details

Motivation: 现有方法在仅使用简单、通用的文本提示时,难以引导模型关注用户相关的区域。同时,大规模航空图像中许多物体视觉外观高度相似且存在丰富的物体间关系,这进一步增加了准确识别的难度。

Result: 在两个基准数据集上的综合实验表明,CLV-Net超越了现有方法,并取得了新的最先进(SOTA)结果。

Insight: 核心创新点包括:1)一个建模并整合物体间关系以增强目标表示和掩码质量的上下文感知掩码解码器;2)一个语义与关系对齐模块,通过跨模态语义一致性损失增强视觉相似目标的细粒度区分,并通过关系一致性损失强制文本关系与视觉交互的对齐。

Abstract: Recent advances in image understanding have enabled methods that leverage large language models for multimodal reasoning in remote sensing. However, existing approaches still struggle to steer models to the user-relevant regions when only simple, generic text prompts are available. Moreover, in large-scale aerial imagery many objects exhibit highly similar visual appearances and carry rich inter-object relationships, which further complicates accurate recognition. To address these challenges, we propose Cross-modal Context-aware Learning for Visual Prompt-Guided Multimodal Image Understanding (CLV-Net). CLV-Net lets users supply a simple visual cue, a bounding box, to indicate a region of interest, and uses that cue to guide the model to generate correlated segmentation masks and captions that faithfully reflect user intent. Central to our design is a Context-Aware Mask Decoder that models and integrates inter-object relationships to strengthen target representations and improve mask quality. In addition, we introduce a Semantic and Relationship Alignment module: a Cross-modal Semantic Consistency Loss enhances fine-grained discrimination among visually similar targets, while a Relationship Consistency Loss enforces alignment between textual relations and visual interactions. Comprehensive experiments on two benchmark datasets show that CLV-Net outperforms existing methods and establishes new state-of-the-art results. The model effectively captures user intent and produces precise, intention-aligned multimodal outputs.


[64] Depth-Copy-Paste: Multimodal and Depth-Aware Compositing for Robust Face Detection cs.CVPDF

Qiushi Guo

TL;DR: 本文提出了一种名为Depth-Copy-Paste的多模态深度感知数据增强框架,旨在通过复制全身人物实例并将其粘贴到语义兼容的场景中,生成多样且物理一致的人脸检测训练样本,以提升人脸检测系统在遮挡、光照变化等挑战性条件下的鲁棒性。

Details

Motivation: 传统复制粘贴增强方法由于前景提取不准确、场景几何不一致和背景语义不匹配,常产生不真实的合成图像。本文旨在解决这些限制,生成更真实、物理一致的数据增强样本。

Result: 大量实验表明,与传统复制粘贴及无深度增强方法相比,Depth-Copy-Paste能提供更多样、更真实的训练数据,从而在下游人脸检测任务中带来显著的性能提升。

Insight: 创新点在于结合BLIP和CLIP进行语义与视觉一致性评估以自动检索背景,集成SAM3和Depth-Anything确保高质量前景掩码(保留面部细节、排除遮挡区域),并引入基于深度图的滑动窗口放置机制以实现几何真实感(深度连续性和尺度对齐)。

Abstract: Data augmentation is crucial for improving the robustness of face detection systems, especially under challenging conditions such as occlusion, illumination variation, and complex environments. Traditional copy paste augmentation often produces unrealistic composites due to inaccurate foreground extraction, inconsistent scene geometry, and mismatched background semantics. To address these limitations, we propose Depth Copy Paste, a multimodal and depth aware augmentation framework that generates diverse and physically consistent face detection training samples by copying full body person instances and pasting them into semantically compatible scenes. Our approach first employs BLIP and CLIP to jointly assess semantic and visual coherence, enabling automatic retrieval of the most suitable background images for the given foreground person. To ensure high quality foreground masks that preserve facial details, we integrate SAM3 for precise segmentation and Depth-Anything to extract only the non occluded visible person regions, preventing corrupted facial textures from being used in augmentation. For geometric realism, we introduce a depth guided sliding window placement mechanism that searches over the background depth map to identify paste locations with optimal depth continuity and scale alignment. The resulting composites exhibit natural depth relationships and improved visual plausibility. Extensive experiments show that Depth Copy Paste provides more diverse and realistic training data, leading to significant performance improvements in downstream face detection tasks compared with traditional copy paste and depth free augmentation methods.


[65] Reframing Music-Driven 2D Dance Pose Generation as Multi-Channel Image Generation cs.CVPDF

Yan Zhang, Han Zou, Lincong Feng, Cong Xie, Ruiqi Yu

TL;DR: 该论文将音乐驱动的2D舞蹈姿态生成重新定义为多通道图像生成任务,通过将姿态序列编码为独热图像,利用预训练图像VAE压缩,并采用DiT风格的主干网络进行建模,从而借鉴了现代文本到图像模型的架构和训练优势,以更好地捕捉高方差的2D姿态分布。在此基础上,引入了时间共享的时间索引方案和参考姿态条件策略,以同步音乐令牌与姿态潜在表示,并保持主体特定的身体比例和屏幕尺度。

Details

Motivation: 解决从音乐生成时间连贯、节奏对齐的2D舞蹈姿态序列的挑战,特别是在复杂、高方差的野外分布下,现有方法难以有效处理。

Result: 在大型野外2D舞蹈语料库和校准的AIST++2D基准测试中,该方法在姿态和视频空间指标以及人类偏好方面均优于代表性音乐到舞蹈方法,消融实验验证了表示、时间索引和参考条件化的贡献。

Insight: 创新点包括将姿态序列视为图像进行生成,利用图像生成模型的进展;引入时间索引实现音乐与姿态的显式同步;以及参考姿态条件化策略支持长时程分段生成,同时保持主体一致性。这为序列生成任务提供了新的视角和实用技术。

Abstract: Recent pose-to-video models can translate 2D pose sequences into photorealistic, identity-preserving dance videos, so the key challenge is to generate temporally coherent, rhythm-aligned 2D poses from music, especially under complex, high-variance in-the-wild distributions. We address this by reframing music-to-dance generation as a music-token-conditioned multi-channel image synthesis problem: 2D pose sequences are encoded as one-hot images, compressed by a pretrained image VAE, and modeled with a DiT-style backbone, allowing us to inherit architectural and training advances from modern text-to-image models and better capture high-variance 2D pose distributions. On top of this formulation, we introduce (i) a time-shared temporal indexing scheme that explicitly synchronizes music tokens and pose latents over time and (ii) a reference-pose conditioning strategy that preserves subject-specific body proportions and on-screen scale while enabling long-horizon segment-and-stitch generation. Experiments on a large in-the-wild 2D dance corpus and the calibrated AIST++2D benchmark show consistent improvements over representative music-to-dance methods in pose- and video-space metrics and human preference, and ablations validate the contributions of the representation, temporal indexing, and reference conditioning. See supplementary videos at https://hot-dance.github.io


[66] MatAnyone 2: Scaling Video Matting via a Learned Quality Evaluator cs.CVPDF

Peiqing Yang, Shangchen Zhou, Kai Hao, Qingyi Tao

TL;DR: 本文提出了一种无需真实标注的抠图质量评估器(MQE),通过评估语义和边界质量生成像素级评估图,用于在线训练反馈和离线数据筛选,从而构建大规模真实视频抠图数据集VMReal,并引入参考帧训练策略处理长视频外观变化,最终MatAnyone 2在合成和真实基准测试中达到SOTA性能。

Details

Motivation: 视频抠图受限于现有数据集的规模和真实性,利用分割数据虽能提升语义稳定性,但缺乏有效边界监督常导致抠图结果类似分割而缺失细节。

Result: MatAnyone 2在合成和真实世界基准测试中取得SOTA性能,在所有指标上超越先前方法。

Insight: 创新点包括无需真实标注的抠图质量评估器(MQE)用于质量反馈和数据筛选,以及参考帧训练策略处理长视频外观变化,从而构建大规模真实数据集VMReal。

Abstract: Video matting remains limited by the scale and realism of existing datasets. While leveraging segmentation data can enhance semantic stability, the lack of effective boundary supervision often leads to segmentation-like mattes lacking fine details. To this end, we introduce a learned Matting Quality Evaluator (MQE) that assesses semantic and boundary quality of alpha mattes without ground truth. It produces a pixel-wise evaluation map that identifies reliable and erroneous regions, enabling fine-grained quality assessment. The MQE scales up video matting in two ways: (1) as an online matting-quality feedback during training to suppress erroneous regions, providing comprehensive supervision, and (2) as an offline selection module for data curation, improving annotation quality by combining the strengths of leading video and image matting models. This process allows us to build a large-scale real-world video matting dataset, VMReal, containing 28K clips and 2.4M frames. To handle large appearance variations in long videos, we introduce a reference-frame training strategy that incorporates long-range frames beyond the local window for effective training. Our MatAnyone 2 achieves state-of-the-art performance on both synthetic and real-world benchmarks, surpassing prior methods across all metrics.


[67] Structure From Tracking: Distilling Structure-Preserving Motion for Video Generation cs.CVPDF

Yang Fei, George Stoica, Jingyuan Liu, Qifeng Chen, Ranjay Krishna

TL;DR: 该论文提出了一种名为SAM2VideoX的视频生成方法,通过从自回归视频跟踪模型(SAM2)中提取结构保持的运动先验,并将其蒸馏到双向视频扩散模型(CogVideoX)中,以生成更真实且结构保持的运动,特别是在人体和动物等铰接和可变形对象上。

Details

Motivation: 解决现有视频生成模型在生成真实且结构保持的运动方面存在的挑战,特别是避免物理上不合理的过渡,并改进对噪声运动表示(如光流或外部不完美模型提取的骨架)的依赖。

Result: 在VBench基准测试中,SAM2VideoX获得了95.51%的得分,比REPA(92.91%)提高了2.60%,并将FVD(Fréchet Video Distance)降低至360.57,相比REPA和LoRA微调分别改善了21.20%和22.46%,在人类研究中获得了71.4%的偏好率。

Insight: 创新点包括:1)双向特征融合模块,用于从循环模型(如SAM2)中提取全局结构保持的运动先验;2)局部Gram流损失,用于对齐局部特征的运动一致性。这提供了一种通过蒸馏跟踪模型先验来增强视频生成结构保真度的有效途径。

Abstract: Reality is a dance between rigid constraints and deformable structures. For video models, that means generating motion that preserves fidelity as well as structure. Despite progress in diffusion models, producing realistic structure-preserving motion remains challenging, especially for articulated and deformable objects such as humans and animals. Scaling training data alone, so far, has failed to resolve physically implausible transitions. Existing approaches rely on conditioning with noisy motion representations, such as optical flow or skeletons extracted using an external imperfect model. To address these challenges, we introduce an algorithm to distill structure-preserving motion priors from an autoregressive video tracking model (SAM2) into a bidirectional video diffusion model (CogVideoX). With our method, we train SAM2VideoX, which contains two innovations: (1) a bidirectional feature fusion module that extracts global structure-preserving motion priors from a recurrent model like SAM2; (2) a Local Gram Flow loss that aligns how local features move together. Experiments on VBench and in human studies show that SAM2VideoX delivers consistent gains (+2.60% on VBench, 21-22% lower FVD, and 71.4% human preference) over prior baselines. Specifically, on VBench, we achieve 95.51%, surpassing REPA (92.91%) by 2.60%, and reduce FVD to 360.57, a 21.20% and 22.46% improvement over REPA- and LoRA-finetuning, respectively. The project website can be found at https://sam2videox.github.io/ .


[68] V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties cs.CVPDF

Ye Fang, Tong Wu, Valentin Deschaintre, Duygu Ceylan, Iliyan Georgiev

TL;DR: V-RGBX是首个端到端的本征感知视频编辑框架,它统一了三个关键能力:将视频逆向渲染为本征通道(如反照率、法线、材质和辐照度),从这些本征表示进行逼真视频合成,以及基于本征通道的关键帧视频编辑。其核心是交错条件机制,支持通过用户选择的关键帧进行直观、物理基础的编辑,灵活操控任何本征模态。

Details

Motivation: 大规模视频生成模型在建模真实世界场景的光照交互方面潜力显著,但缺乏一个联合理解本征场景属性、利用它们进行视频合成并支持可编辑本征表示的闭环框架。

Result: 广泛的定性和定量结果表明,V-RGBX能生成时间一致、逼真的视频,并以物理合理的方式将关键帧编辑传播到整个序列,在物体外观编辑和场景级重光照等应用中超越了先前方法的性能。

Insight: 创新点在于提出了首个端到端的本征感知视频编辑框架,通过交错条件机制实现了基于本征通道的关键帧编辑,支持灵活操控内在属性,为视频编辑提供了更物理准确和直观的控制方式。

Abstract: Large-scale video generation models have shown remarkable potential in modeling photorealistic appearance and lighting interactions in real-world scenes. However, a closed-loop framework that jointly understands intrinsic scene properties (e.g., albedo, normal, material, and irradiance), leverages them for video synthesis, and supports editable intrinsic representations remains unexplored. We present V-RGBX, the first end-to-end framework for intrinsic-aware video editing. V-RGBX unifies three key capabilities: (1) video inverse rendering into intrinsic channels, (2) photorealistic video synthesis from these intrinsic representations, and (3) keyframe-based video editing conditioned on intrinsic channels. At the core of V-RGBX is an interleaved conditioning mechanism that enables intuitive, physically grounded video editing through user-selected keyframes, supporting flexible manipulation of any intrinsic modality. Extensive qualitative and quantitative results show that V-RGBX produces temporally consistent, photorealistic videos while propagating keyframe edits across sequences in a physically plausible manner. We demonstrate its effectiveness in diverse applications, including object appearance editing and scene-level relighting, surpassing the performance of prior methods.


cs.AI [Back]

[69] FutureWeaver: Planning Test-Time Compute for Multi-Agent Systems with Modularized Collaboration cs.AI | cs.CLPDF

Dongwon Jung, Peng Shi, Yi Zhang

TL;DR: 本文提出了FutureWeaver框架,用于在固定计算预算下规划和优化多智能体系统中的测试时计算分配。该框架通过模块化协作,将可重用的多智能体工作流封装为可调用函数,并采用双层规划架构来优化计算分配,从而提升任务成功率。

Details

Motivation: 现有测试时计算扩展技术(如重复采样、自我验证)难以应用于多智能体系统,缺乏在预算约束下分配计算以促进智能体协作的原则性机制。

Result: 在复杂的智能体基准测试中,FutureWeaver在不同预算设置下均持续优于基线方法,验证了其在推理时优化多智能体协作的有效性。

Insight: 创新点在于将模块化协作形式化为可调用函数,并通过自博弈反思自动抽象出可复用的交互模式;同时,双层规划架构能够基于当前任务状态并推测未来步骤来优化计算分配,为多智能体系统的测试时计算优化提供了系统化方法。

Abstract: Scaling test-time computation improves large language model performance without additional training. Recent work demonstrates that techniques such as repeated sampling, self-verification, and self-reflection can significantly enhance task success by allocating more inference-time compute. However, applying these techniques across multiple agents in a multi-agent system is difficult: there does not exist principled mechanisms to allocate compute to foster collaboration among agents, to extend test-time scaling to collaborative interactions, or to distribute compute across agents under explicit budget constraints. To address this gap, we propose FutureWeaver, a framework for planning and optimizing test-time compute allocation in multi-agent systems under fixed budgets. FutureWeaver introduces modularized collaboration, formalized as callable functions that encapsulate reusable multi-agent workflows. These modules are automatically derived through self-play reflection by abstracting recurring interaction patterns from past trajectories. Building on these modules, FutureWeaver employs a dual-level planning architecture that optimizes compute allocation by reasoning over the current task state while also speculating on future steps. Experiments on complex agent benchmarks demonstrate that FutureWeaver consistently outperforms baselines across diverse budget settings, validating its effectiveness for multi-agent collaboration in inference-time optimization.


cs.RO [Back]

[70] WholeBodyVLA: Towards Unified Latent VLA for Whole-Body Loco-Manipulation Control cs.RO | cs.AI | cs.CVPDF

Haoran Jiang, Jin Chen, Qingwen Bu, Li Chen, Modi Shi

TL;DR: 本文提出了WholeBodyVLA,一个用于仿人机器人全身运动-操作控制的统一潜在视觉-语言-动作框架。该框架通过从低成本的无动作自我中心视频中学习运动-操作知识,并设计了一个面向运动-操作、能精确执行移动指令的强化学习策略,从而解决了现有方法在操作感知的运动控制方面的不足,实现了大范围的运动-操作任务。

Details

Motivation: 现有仿人机器人的运动-操作方法(无论是模块化还是端到端)缺乏操作感知的运动能力,这限制了机器人的工作空间,使其无法执行大范围的运动-操作任务。这主要源于两个挑战:一是由于仿人遥操作数据稀缺导致获取运动-操作知识困难;二是现有强化学习控制器精度和稳定性有限,导致难以忠实可靠地执行运动指令。

Result: 在AgiBot X2仿人机器人上进行的综合实验验证了该方法的有效性,其性能比之前的基线方法提升了21.3%。该方法在广泛的任务中展现出强大的泛化能力和高可扩展性。

Insight: 主要创新点在于提出了一个统一的潜在学习框架,使VLA系统能够从低成本的无动作自我中心视频中学习运动-操作知识,从而缓解数据稀缺问题;同时,设计了一个专门面向运动-操作、针对核心动作(如前进、转向、下蹲)进行精确稳定控制的强化学习策略。从客观角度看,将大规模视频预训练知识与针对性的强化学习策略相结合,为解决复杂机器人控制问题提供了一个有前景的统一框架。

Abstract: Humanoid robots require precise locomotion and dexterous manipulation to perform challenging loco-manipulation tasks. Yet existing approaches, modular or end-to-end, are deficient in manipulation-aware locomotion. This confines the robot to a limited workspace, preventing it from performing large-space loco-manipulation. We attribute this to: (1) the challenge of acquiring loco-manipulation knowledge due to the scarcity of humanoid teleoperation data, and (2) the difficulty of faithfully and reliably executing locomotion commands, stemming from the limited precision and stability of existing RL controllers. To acquire richer loco-manipulation knowledge, we propose a unified latent learning framework that enables Vision-Language-Action (VLA) system to learn from low-cost action-free egocentric videos. Moreover, an efficient human data collection pipeline is devised to augment the dataset and scale the benefits. To more precisely execute the desired locomotion commands, we present a loco-manipulation-oriented (LMO) RL policy specifically tailored for accurate and stable core loco-manipulation movements, such as advancing, turning, and squatting. Building on these components, we introduce WholeBodyVLA, a unified framework for humanoid loco-manipulation. To the best of our knowledge, WholeBodyVLA is one of its kind enabling large-space humanoid loco-manipulation. It is verified via comprehensive experiments on the AgiBot X2 humanoid, outperforming prior baseline by 21.3%. It also demonstrates strong generalization and high extensibility across a broad range of tasks.


[71] Seeing to Act, Prompting to Specify: A Bayesian Factorization of Vision Language Action Policy cs.RO | cs.CVPDF

Kechun Xu, Zhenjie Zhu, Anzhe Chen, Shuqi Zhao, Qing Huang

TL;DR: 本文提出BayesVLA,一种用于视觉-语言-动作(VLA)模型的贝叶斯分解方法,以解决VLA模型在微调过程中因模态不平衡(语言多样性远低于视觉和动作多样性)而导致的灾难性遗忘和视觉捷径学习问题。该方法将策略分解为支持“看到即行动”的视觉-动作先验和实现“提示即指定”的语言条件似然,从而在无需外部推理数据的情况下,保持泛化能力并提升指令跟随性能。

Details

Motivation: 解决VLA模型在追求分布外泛化时,因微调导致视觉语言模型(VLM)主干发生灾难性遗忘的问题。现有方法依赖外部推理数据进行协同训练,但存在调优难度和数据开销。本文发现VLA数据集中固有的模态不平衡(语言多样性不足)是导致模型偏向视觉捷径和语言遗忘的内在原因。

Result: 在广泛的实验中,与现有方法相比,该方法在未见过的指令、物体和环境上展现出优越的泛化性能。信息论分析也正式验证了其在缓解捷径学习方面的有效性。

Insight: 核心创新点在于提出一种贝叶斯分解框架,将VLA策略解耦为视觉-动作先验和语言条件似然,这从结构上缓解了模态不平衡问题并保护了预训练知识。此外,引入接触前和接触后阶段以更好地利用预训练基础模型,也是一个实用的设计思路。

Abstract: The pursuit of out-of-distribution generalization in Vision-Language-Action (VLA) models is often hindered by catastrophic forgetting of the Vision-Language Model (VLM) backbone during fine-tuning. While co-training with external reasoning data helps, it requires experienced tuning and data-related overhead. Beyond such external dependencies, we identify an intrinsic cause within VLA datasets: modality imbalance, where language diversity is much lower than visual and action diversity. This imbalance biases the model toward visual shortcuts and language forgetting. To address this, we introduce BayesVLA, a Bayesian factorization that decomposes the policy into a visual-action prior, supporting seeing-to-act, and a language-conditioned likelihood, enabling prompt-to-specify. This inherently preserves generalization and promotes instruction following. We further incorporate pre- and post-contact phases to better leverage pre-trained foundation models. Information-theoretic analysis formally validates our effectiveness in mitigating shortcut learning. Extensive experiments show superior generalization to unseen instructions, objects, and environments compared to existing methods. Project page is available at: https://xukechun.github.io/papers/BayesVLA.


[72] AnchorDream: Repurposing Video Diffusion for Embodiment-Aware Robot Data Synthesis cs.RO | cs.CVPDF

Junjie Ye, Rong Xue, Basile Van Hoorick, Pavel Tokmakov, Muhammad Zubair Irshad

TL;DR: AnchorDream是一种基于预训练视频扩散模型的机器人数据合成方法,通过将扩散过程锚定在机器人运动渲染上,生成与机器人运动学一致的对象和环境,从而将少量人类遥操作演示扩展为大规模、多样化和高质量的数据集。

Details

Motivation: 解决机器人模仿学习中大规模多样化数据收集的瓶颈问题,现有生成方法要么仅改变视觉外观而不创造新行为,要么存在运动学不一致导致不合理的运动。

Result: 在模拟器基准测试中相对提升36.4%,在真实世界研究中性能几乎翻倍,表明生成的数据能持续改善下游策略学习。

Insight: 创新点在于将预训练视频扩散模型重新用于机器人数据合成,并通过机器人运动渲染条件化扩散过程,确保运动学一致性,这为扩展模仿学习提供了一条实用路径。

Abstract: The collection of large-scale and diverse robot demonstrations remains a major bottleneck for imitation learning, as real-world data acquisition is costly and simulators offer limited diversity and fidelity with pronounced sim-to-real gaps. While generative models present an attractive solution, existing methods often alter only visual appearances without creating new behaviors, or suffer from embodiment inconsistencies that yield implausible motions. To address these limitations, we introduce AnchorDream, an embodiment-aware world model that repurposes pretrained video diffusion models for robot data synthesis. AnchorDream conditions the diffusion process on robot motion renderings, anchoring the embodiment to prevent hallucination while synthesizing objects and environments consistent with the robot’s kinematics. Starting from only a handful of human teleoperation demonstrations, our method scales them into large, diverse, high-quality datasets without requiring explicit environment modeling. Experiments show that the generated data leads to consistent improvements in downstream policy learning, with relative gains of 36.4% in simulator benchmarks and nearly double performance in real-world studies. These results suggest that grounding generative world models in robot motion provides a practical path toward scaling imitation learning.


cs.LG [Back]

[73] Rethinking Expert Trajectory Utilization in LLM Post-training cs.LG | cs.CLPDF

Bowen Ding, Yuhan Chen, Jiayang Lv, Jiyao Yuan, Qi Zhu

TL;DR: 本文提出了一个名为’可塑性-上限框架’的理论框架,用于分析大语言模型后训练中如何最佳利用专家轨迹。研究发现,先进行监督微调再进行强化学习的顺序流水线是最优标准,并提出了关于何时转向强化学习、数据规模与轨迹难度作用、以及如何选择专家轨迹的具体可扩展性指导原则。

Details

Motivation: 解决大语言模型后训练中,如何最优地利用专家轨迹(例如来自人类或更优模型的示范数据)这一未决问题,旨在最大化最终模型性能。

Result: 通过广泛的基准测试,确立了顺序SFT-then-RL流水线优于同步方法,并提供了具体的扩展指导原则,例如在SFT稳定或轻度过拟合子阶段转向RL能最大化最终性能上限。

Insight: 主要创新点在于提出了一个理论框架来分解后训练性能,并基于此得出了颠覆直觉的结论(如反驳了’少即是多’的观点),明确指出数据规模是主要潜力来源而轨迹难度是性能乘数,且验证损失可作为选择专家轨迹的鲁棒指标。

Abstract: While effective post-training integrates Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), the optimal mechanism for utilizing expert trajectories remains unresolved. We propose the Plasticity-Ceiling Framework to theoretically ground this landscape, decomposing performance into foundational SFT performance and the subsequent RL plasticity. Through extensive benchmarking, we establish the Sequential SFT-then-RL pipeline as the superior standard, overcoming the stability deficits of synchronized approaches. Furthermore, we derive precise scaling guidelines: (1) Transitioning to RL at the SFT Stable or Mild Overfitting Sub-phase maximizes the final ceiling by securing foundational SFT performance without compromising RL plasticity; (2) Refuting ``Less is More’’ in the context of SFT-then-RL scaling, we demonstrate that Data Scale determines the primary post-training potential, while Trajectory Difficulty acts as a performance multiplier; and (3) Identifying that the Minimum SFT Validation Loss serves as a robust indicator for selecting the expert trajectories that maximize the final performance ceiling. Our findings provide actionable guidelines for maximizing the value extracted from expert trajectories.


[74] Multimodal Fusion of Regional Brain Experts for Interpretable Alzheimer’s Disease Diagnosis cs.LG | cs.AI | cs.CV | eess.IVPDF

Farica Zhuang, Dinara Aliyeva, Shu Yang, Zixuan Wen, Duy Duong-Tran

TL;DR: 本文提出了一种名为MREF-AD的多模态区域专家融合模型,用于阿尔茨海默病的诊断。该模型采用混合专家框架,将每个模态中的中尺度脑区建模为独立专家,并通过两级门控网络学习受试者特定的融合权重,旨在自适应地融合淀粉样蛋白PET和MRI等多模态信息,以提升诊断性能并提供区域级可解释性。

Details

Motivation: 现有融合方法通常依赖特征的简单拼接,无法自适应地平衡不同脑区中生物标志物的贡献,而临床实践表明整合多模态互补信息对AD的准确早期诊断有益。

Result: 在阿尔茨海默病神经影像学倡议数据集上,MREF-AD取得了超越基线的SOTA性能,同时增强了脑区特异性生物标志物相关性的可解释性。

Insight: 创新点在于将混合专家框架与两级门控网络结合,实现了模态和脑区级别的自适应融合与可解释性分析,为神经影像中的多模态融合提供了一个通用、可解释的框架。

Abstract: Accurate and early diagnosis of Alzheimer’s disease (AD) can benefit from integrating complementary information from multiple modalities, mirroring clinical practice. However, conventional fusion approaches often rely on simple concatenation of features, which cannot adaptively balance the contributions of biomarkers such as amyloid PET and MRI across brain regions. In this work, we propose MREF-AD, a Multimodal Regional Expert Fusion model for AD diagnosis. It is a Mixture-of-Experts (MoE) framework that models meso-scale brain regions in each modality as an independent expert and employs two-level gating networks to learn subject-specific fusion weights. Beyond improving diagnostic performance, MREF-AD provides modality- and region-level insight into how structural and molecular imaging jointly contribute to disease diagnosis. Using data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI), MREF-AD achieves state-of-the-art performance over baselines while providing enhanced interpretability of brain region-specific biomarker relevance, underscoring its utility as a general framework for adaptive and interpretable multimodal fusion in neuroimaging.