Table of Contents

cs.CL [Back]

[1] Knowledge Distillation with Structured Chain-of-Thought for Text-to-SQL cs.CL | cs.AI | cs.DBPDF

Khushboo Thaker, Yony Bresler

TL;DR: 本文提出了一种名为Struct-SQL的新型知识蒸馏框架,旨在解决企业级Text-to-SQL系统在成本、安全性和性能之间的权衡难题。该框架通过使用结构化的思维链(以查询执行计划作为形式化蓝图)来蒸馏大型语言模型的推理能力,从而训练出高性能的小型语言模型。

Details

Motivation: 当前企业部署Text-to-SQL系统面临成本、安全和性能的三难困境,不得不在昂贵的大模型和性能不足的小模型之间选择。现有提升小模型的方法通常依赖非结构化的思维链进行知识蒸馏,但其推理过程存在固有的模糊性。本文假设,对于需要明确、精确逻辑步骤的Text-to-SQL任务,形式化的结构化推理表示能提供更清晰、可靠的教学信号。

Result: 实验表明,采用结构化思维链蒸馏得到的小模型,相比非结构化思维链蒸馏的基线模型,在Text-to-SQL任务上取得了8.1%的绝对性能提升。详细的错误分析指出,性能增益的关键因素是语法错误显著减少。

Insight: 核心创新点在于将查询执行计划作为形式化的结构化推理蓝图,用于指导知识蒸馏过程。这为Text-to-SQL任务提供了一种更清晰、更可靠的推理教学信号,有效提升了小型语言模型生成可靠SQL语句的能力,特别是在减少语法错误方面效果显著。

Abstract: Deploying accurate Text-to-SQL systems at the enterprise level faces a difficult trilemma involving cost, security and performance. Current solutions force enterprises to choose between expensive, proprietary Large Language Models (LLMs) and low-performing Small Language Models (SLMs). Efforts to improve SLMs often rely on distilling reasoning from large LLMs using unstructured Chain-of-Thought (CoT) traces, a process that remains inherently ambiguous. Instead, we hypothesize that a formal, structured reasoning representation provides a clearer, more reliable teaching signal, as the Text-to-SQL task requires explicit and precise logical steps. To evaluate this hypothesis, we propose Struct-SQL, a novel Knowledge Distillation (KD) framework that trains an SLM to emulate a powerful large LLM. Consequently, we adopt a query execution plan as a formal blueprint to derive this structured reasoning. Our SLM, distilled with structured CoT, achieves an absolute improvement of 8.1% over an unstructured CoT distillation baseline. A detailed error analysis reveals that a key factor in this gain is a marked reduction in syntactic errors. This demonstrates that teaching a model to reason using a structured logical blueprint is beneficial for reliable SQL generation in SLMs.


[2] Mindscape-Aware Retrieval Augmented Generation for Improved Long Context Understanding cs.CLPDF

Yuqing Li, Jiangnan Li, Zheng Lin, Ziyan Zhou, Junjie Wu

TL;DR: 本文提出了一种名为Mindscape-Aware RAG(MiA-RAG)的新方法,旨在通过引入显式的全局上下文感知来增强基于LLM的检索增强生成系统在长文本理解任务中的性能。该方法通过分层摘要构建一个整体语义表示(称为mindscape),并以此指导检索和生成过程,从而帮助模型更好地整合分散在文档中的证据并进行全局推理。

Details

Motivation: 当前检索增强生成系统在处理长上下文任务时缺乏人类所具有的、基于整体语义表征(心理学中的Mindscape-Aware能力)的指导,导致其在整合分散证据和全局理解方面存在困难。

Result: MiA-RAG在多种长上下文和双语基准测试(用于基于证据的理解和全局意义构建任务)上进行了评估,结果一致超越了基线模型。分析表明,该方法能够将局部细节与连贯的全局表示对齐,实现了更接近人类的长上下文检索和推理。

Insight: 核心创新点在于首次为RAG系统引入了显式的全局上下文感知机制,通过分层摘要构建mindscape来统一指导检索(形成增强的查询嵌入)和生成(在连贯的全局上下文中进行推理)两个阶段,从而模拟人类理解长文本的认知过程。

Abstract: Humans understand long and complex texts by relying on a holistic semantic representation of the content. This global view helps organize prior knowledge, interpret new information, and integrate evidence dispersed across a document, as revealed by the Mindscape-Aware Capability of humans in psychology. Current Retrieval-Augmented Generation (RAG) systems lack such guidance and therefore struggle with long-context tasks. In this paper, we propose Mindscape-Aware RAG (MiA-RAG), the first approach that equips LLM-based RAG systems with explicit global context awareness. MiA-RAG builds a mindscape through hierarchical summarization and conditions both retrieval and generation on this global semantic representation. This enables the retriever to form enriched query embeddings and the generator to reason over retrieved evidence within a coherent global context. We evaluate MiA-RAG across diverse long-context and bilingual benchmarks for evidence-based understanding and global sense-making. It consistently surpasses baselines, and further analysis shows that it aligns local details with a coherent global representation, enabling more human-like long-context retrieval and reasoning.


[3] Seed-Prover 1.5: Mastering Undergraduate-Level Theorem Proving via Learning from Experience cs.CLPDF

Jiangjie Chen, Wenxiang Chen, Jiacheng Du, Jinyi Hu, Zhicheng Jiang

TL;DR: 本文提出了Seed-Prover 1.5模型,这是一个通过大规模智能体强化学习训练的形式化定理证明模型,并配备了一个高效的测试时扩展工作流。该模型通过与Lean等工具的交互积累经验,显著提升了形式化定理证明的能力和效率,并在多个数学竞赛基准上取得了优异成绩。

Details

Motivation: 解决大型语言模型在形式化语言(如Lean)中进行定理证明时面临的挑战和高昂计算成本问题,特别是在处理本科及以上水平的数学问题时。

Result: 在PutnamBench(本科水平)上解决了88%的问题,在Fate-H(研究生水平)上解决了80%,在Fate-X(博士水平)上解决了33%。在Putnam 2025竞赛的12道题中,9小时内解决了11道,性能优于现有最先进方法且计算预算更小。

Insight: 核心创新点在于通过大规模智能体强化学习让模型从与形式化工具的交互中持续积累经验,以及一个高效的测试时扩展工作流来弥合自然语言证明与形式化语言之间的差距。其方法表明,利用高质量的形式化反馈进行经验学习扩展,对形式化数学推理的未来具有巨大潜力。

Abstract: Large language models have recently made significant progress to generate rigorous mathematical proofs. In contrast, utilizing LLMs for theorem proving in formal languages (such as Lean) remains challenging and computationally expensive, particularly when addressing problems at the undergraduate level and beyond. In this work, we present \textbf{Seed-Prover 1.5}, a formal theorem-proving model trained via large-scale agentic reinforcement learning, alongside an efficient test-time scaling (TTS) workflow. Through extensive interactions with Lean and other tools, the model continuously accumulates experience during the RL process, substantially enhancing the capability and efficiency of formal theorem proving. Furthermore, leveraging recent advancements in natural language proving, our TTS workflow efficiently bridges the gap between natural and formal languages. Compared to state-of-the-art methods, Seed-Prover 1.5 achieves superior performance with a smaller compute budget. It solves \textbf{88% of PutnamBench} (undergraduate-level), \textbf{80% of Fate-H} (graduate-level), and \textbf{33% of Fate-X} (PhD-level) problems. Notably, using our system, we solved \textbf{11 out of 12 problems} from Putnam 2025 within 9 hours. Our findings suggest that scaling learning from experience, driven by high-quality formal feedback, holds immense potential for the future of formal mathematical reasoning.


[4] Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers cs.CLPDF

Zeyuan Allen-Zhu

TL;DR: 本文提出了一种名为Canon层的轻量级架构组件,旨在通过促进相邻token间的水平信息流来增强语言模型的核心能力。研究通过受控合成预训练任务,在学术规模(如1.3B参数、100B token)下评估了Canon层在Transformer、线性注意力、状态空间模型等多种序列架构中的效果。

Details

Motivation: 解决在学术规模预训练中,由于噪声和随机性主导,难以理解语言模型架构差异的问题。

Result: Canon层在合成任务和真实学术规模预训练中验证有效,能将弱架构(如NoPE)提升至匹配RoPE的水平,并使线性注意力模型达到与Mamba2/GDN等SOTA线性模型相当的性能;具体包括将推理深度提升约2倍、增强推理广度和知识操作等12项关键结果。

Insight: 创新点在于引入Canon层作为促进水平信息流的通用组件,以及使用受控合成预训练任务作为经济且原则性的方法,来隔离和评估模型核心能力,甚至可能预测未来架构在训练流程改进后的行为。

Abstract: Understanding architectural differences in language models is challenging, especially at academic-scale pretraining (e.g., 1.3B parameters, 100B tokens), where results are often dominated by noise and randomness. To overcome this, we introduce controlled synthetic pretraining tasks that isolate and evaluate core model capabilities. Within this framework, we discover CANON LAYERS: lightweight architectural components – named after the musical term “canon” – that promote horizontal information flow across neighboring tokens. Canon layers compute weighted sums of nearby token representations and integrate seamlessly into Transformers, linear attention, state-space models, or any sequence architecture. We present 12 key results. This includes how Canon layers enhance reasoning depth (e.g., by $2\times$), reasoning breadth, knowledge manipulation, etc. They lift weak architectures like NoPE to match RoPE, and linear attention to rival SOTA linear models like Mamba2/GDN – validated both through synthetic tasks and real-world academic-scale pretraining. This synthetic playground offers an economical, principled path to isolate core model capabilities often obscured at academic scales. Equipped with infinite high-quality data, it may even PREDICT how future architectures will behave as training pipelines improve – e.g., through better data curation or RL-based post-training – unlocking deeper reasoning and hierarchical inference.


[5] UCoder: Unsupervised Code Generation by Internal Probing of Large Language Models cs.CLPDF

Jiajun Wu, Jian Yang, Wei Zhang, Lin Jing, Yuqing Ma

TL;DR: 本文提出了一种名为IPC的无监督代码生成框架,通过内部探测大型语言模型(LLMs)来生成代码,无需任何外部语料库。该方法利用问题空间探测、测试理解探测、解决方案空间探测以及知识巩固与强化来挖掘LLMs内部的知识和置信度模式,并通过自一致性机制和基于表示的质量估计筛选可靠代码候选,训练出无监督学习的UCoder模型。

Details

Motivation: 大型语言模型在代码生成任务中表现优异,但其有效性严重依赖于有监督训练,需要大量标注数据(如问答对)或无标注数据集(如代码片段),这些数据获取成本高且难以大规模收集。本文旨在解决这一限制,提出一种无监督方法以减少对标注数据和计算资源的依赖。

Result: 该方法在多个代码基准测试中进行了验证,结果表明无监督方法能够达到与有监督方法相当的性能,同时显著减少了对标注数据和计算资源的需求。

Insight: 创新点在于提出了一种无监督的内部探测框架(IPC),通过挖掘LLMs内部状态中关于代码质量和正确性的丰富信号,实现有效的无监督学习。这为资源受限场景下训练代码LLMs开辟了新方向,强调了模型内部知识利用的潜力。

Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in code generation tasks. However, their effectiveness heavily relies on supervised training with extensive labeled (e.g., question-answering pairs) or unlabeled datasets (e.g., code snippets), which are often expensive and difficult to obtain at scale. To address this limitation, this paper introduces a method IPC, an unsupervised framework that leverages Internal Probing of LLMs for Code generation without any external corpus, even unlabeled code snippets. We introduce the problem space probing, test understanding probing, solution space probing, and knowledge consolidation and reinforcement to probe the internal knowledge and confidence patterns existing in LLMs. Further, IPC identifies reliable code candidates through self-consistency mechanisms and representation-based quality estimation to train UCoder (coder with unsupervised learning). We validate the proposed approach across multiple code benchmarks, demonstrating that unsupervised methods can achieve competitive performance compared to supervised approaches while significantly reducing the dependency on labeled data and computational resources. Analytic experiments reveal that internal model states contain rich signals about code quality and correctness, and that properly harnessing these signals enables effective unsupervised learning for code generation tasks, opening new directions for training code LLMs in resource-constrained scenarios.


[6] Are Vision Language Models Cross-Cultural Theory of Mind Reasoners? cs.CL | cs.CV | cs.CYPDF

Zabir Al Nazi, G M Shahariar, Abrar Hossain, Wei Peng

TL;DR: 该论文提出了CulturalToM-VQA,一个包含5095个问题的新评估基准,旨在通过视觉问答(VQA)探究视觉语言模型(VLMs)在不同文化背景下的心理理论(ToM)推理能力。该数据集捕捉了仪式、服饰、手势和人际动态等文化线索,以系统评估超越西方中心基准的ToM推理。

Details

Motivation: 心理理论(ToM)是人类社会智能的基础,但对人工智能体仍是一大挑战。现有视觉语言模型(VLMs)越来越多地应用于社会性任务,但其跨文化ToM推理能力在很大程度上尚未被探索。

Result: 论文构建了CulturalToM-VQA基准,但摘要中未提及具体模型在该基准上的定量评估结果或性能水平(如SOTA)。

Insight: 主要创新点在于创建了一个专注于跨文化心理理论推理的视觉问答评估数据集,该数据集通过VLM辅助的人机协作流程构建,覆盖了六类ToM任务和四个复杂度等级,为评估AI的社会智能提供了更全面的文化视角。

Abstract: Theory of Mind (ToM) – the ability to attribute beliefs, desires, and emotions to others – is fundamental for human social intelligence, yet remains a major challenge for artificial agents. Existing Vision-Language Models (VLMs) are increasingly applied in socially grounded tasks, but their capacity for cross-cultural ToM reasoning is largely unexplored. In this work, we introduce CulturalToM-VQA, a new evaluation benchmark containing 5095 questions designed to probe ToM reasoning across diverse cultural contexts through visual question answering. The dataset captures culturally grounded cues such as rituals, attire, gestures, and interpersonal dynamics, enabling systematic evaluation of ToM reasoning beyond Western-centric benchmarks. Our dataset is built through a VLM-assisted human-in-the-loop pipeline, where human experts first curate culturally rich images across traditions, rituals, and social interactions; a VLM then assist in generating structured ToM-focused scene descriptions, which are refined into question-answer pairs spanning a taxonomy of six ToM tasks and four graded complexity levels. The resulting dataset covers diverse theory of mind facets such as mental state attribution, false belief reasoning, non-literal communication, social norm violations, perspective coordination, and multi-agent reasoning.


[7] Toward Ethical AI Through Bayesian Uncertainty in Neural Question Answering cs.CLPDF

Riccardo Di Sipio

TL;DR: 本文探讨了贝叶斯推理在神经网络问答系统中量化不确定性的方法,通过在Iris数据集上的多层感知机实验展示了后验推断如何传达预测置信度,并扩展到语言模型,在CommonsenseQA基准上评估了拉普拉斯近似与最大后验估计,强调不确定性校准和选择性预测,使模型能在低置信度时弃答,从而提升可解释性和促进伦理AI部署。

Details

Motivation: 解决神经网络问答系统缺乏不确定性量化能力的问题,通过贝叶斯方法使模型能够评估自身预测的置信度,从而支持更负责任和伦理的AI部署。

Result: 在CommonsenseQA基准上评估了贝叶斯推理方法(如拉普拉斯近似和MAP估计),重点展示了不确定性校准和选择性预测的性能,而非追求SOTA准确率,允许模型在低置信度时弃答以提升可靠性。

Insight: 创新点在于将贝叶斯不确定性量化应用于神经网络问答系统,通过后验推断实现选择性预测,使模型能主动弃答,这增强了系统的可解释性和伦理安全性,为负责任AI部署提供了实用框架。

Abstract: We explore Bayesian reasoning as a means to quantify uncertainty in neural networks for question answering. Starting with a multilayer perceptron on the Iris dataset, we show how posterior inference conveys confidence in predictions. We then extend this to language models, applying Bayesian inference first to a frozen head and finally to LoRA-adapted transformers, evaluated on the CommonsenseQA benchmark. Rather than aiming for state-of-the-art accuracy, we compare Laplace approximations against maximum a posteriori (MAP) estimates to highlight uncertainty calibration and selective prediction. This allows models to abstain when confidence is low. An ``I don’t know’’ response not only improves interpretability but also illustrates how Bayesian methods can contribute to more responsible and ethical deployment of neural question-answering systems.


[8] DEER: A Comprehensive and Reliable Benchmark for Deep-Research Expert Reports cs.CLPDF

Janghoon Han, Heegyu Kim, Changho Lee, Dahm Lee, Min Hyung Park

TL;DR: 论文提出了DEER基准,用于评估大型语言模型生成的专家级深度研究报告。该基准包含50个跨13个领域的任务,一个专家评估分类法(7个维度,25个子维度),以及一个文档级事实检查架构,以全面验证报告中的声明,包括引用和未引用的部分。

Details

Motivation: 现有基准在评估专家报告时存在不足,如缺乏系统标准、LLM法官可能无法捕捉需要专家判断的问题、源验证通常只覆盖有限部分。因此,需要开发一个更全面和可靠的基准来解决这些问题。

Result: DEER基准与人类专家判断高度相关,能提供可解释的系统强弱项诊断,表明其评估的可靠性和有效性。

Insight: 创新点包括专家评估分类法、任务特定专家指导以提升LLM法官评估一致性,以及文档级事实检查架构,全面验证报告中的所有声明,量化外部证据质量,从而提供更全面的评估。

Abstract: As large language models (LLMs) advance, deep research systems can generate expert-level reports via multi-step reasoning and evidence-based synthesis, but evaluating such reports remains challenging. Existing benchmarks often lack systematic criteria for expert reporting, evaluations that rely heavily on LLM judges can fail to capture issues that require expert judgment, and source verification typically covers only a limited subset of explicitly cited statements rather than report-wide factual reliability. We introduce DEER, a benchmark for evaluating expert-level deep research reports. DEER comprises 50 report-writing tasks spanning 13 domains and an expert-grounded evaluation taxonomy (7 dimensions, 25 sub-dimension) operationalized into 130 fine-grained rubric items. DEER further provides task-specific expert guidance to help LLM judges assess expert-level report quality more consistently. Complementing rubric-based assessment, we propose a document-level fact-checking architecture that extracts and verifies all claims across the entire report, including both cited and uncited ones, and quantifies external-evidence quality. DEER correlates closely with human expert judgments and yields interpretable diagnostics of system strengths and weaknesses.


[9] ShareChat: A Dataset of Chatbot Conversations in the Wild cs.CL | cs.AI | cs.HCPDF

Yueru Yan, Tuc Nguyen, Bo Su, Melissa Lieffers, Thai Le

TL;DR: 论文提出了ShareChat数据集,这是一个大规模、跨平台的聊天机器人对话语料库,收集自ChatGPT、Claude、Gemini、Perplexity和Grok五个平台,包含142,808个对话和超过660,000个轮次。该数据集保存了界面上下文和原生功能,如推理痕迹、源链接和代码工件,覆盖101种语言,时间跨度从2023年4月到2025年10月,并展示了在对话完整性、源引用行为和时间分析等方面的实用性。

Details

Motivation: 现有公共数据集将大型语言模型视为通用文本生成器,忽略了界面上下文对用户交互的影响。ShareChat旨在解决这一限制,提供真实世界中的用户-LLM聊天机器人交互数据。

Result: ShareChat数据集包含142,808个对话和660,000+轮次,覆盖101种语言,具有更长的上下文窗口和更大的交互深度。论文通过三个分析展示了其效用:衡量用户意图满意度的对话完整性分析、评估源引用行为的内容生成分析,以及跟踪使用模式变化的时间分析。

Insight: 创新点在于保存了原生平台功能,提供了跨平台、多语言的大规模真实对话数据,为理解用户-LLM交互提供了重要资源。从客观角度看,该数据集填补了现有数据集的空白,促进了更深入的研究。

Abstract: While Large Language Models (LLMs) have evolved into distinct platforms with unique interface designs and capabilities, existing public datasets treat models as generic text generators, stripping away the interface context that actively shapes user interaction. To address this limitation, we present ShareChat, a large-scale, cross-platform corpus comprising 142,808 conversations and over 660,000 turns collected from publicly shared URLs across five major platforms: ChatGPT, Claude, Gemini, Perplexity, and Grok. ShareChat distinguishes itself by preserving native platform affordances often lost in standard logs, including reasoning traces, source links, and code artifacts, while spanning 101 languages over the period from April 2023 to October 2025. Furthermore, ShareChat offers substantially longer context windows and greater interaction depth than prior datasets. We demonstrate the dataset’s multifaceted utility through three representative analyses: (1) analyzing conversation completeness to measure user intent satisfaction; (2) evaluating source citation behaviors in content generation; and (3) conducting temporal analysis to track evolving usage patterns. This work provides the community with a vital and timely resource for understanding authentic user-LLM chatbot interactions in the wild.


cs.CV [Back]

[10] V-Agent: An Interactive Video Search System Using Vision-Language Models cs.CV | cs.AI | cs.IR | cs.MAPDF

SunYoung Park, Jong-Hyeon Lee, Youngjune Kim, Daegyu Sung, Younghyun Yu

TL;DR: V-Agent是一个基于视觉语言模型(VLM)的多智能体交互式视频搜索系统,通过微调VLM并整合图像-文本检索模型的检索向量,能够理解视频的视觉和语音内容,实现上下文感知的视频检索。系统由路由、搜索和聊天三个智能体协作,优化搜索结果并与用户交互。

Details

Motivation: 解决传统基于文本的检索系统在多模态场景(如视频)中的局限性,实现对视频视觉内容和语音转录的联合理解,以提供更精准、交互式的视频搜索体验。

Result: 在MultiVENT 2.0基准测试中实现了最先进的零样本性能,表明其在视频检索任务上的优越性。

Insight: 创新点包括:利用小规模视频偏好数据集微调VLM,结合图像-文本检索模型的检索向量增强多模态表示;设计多智能体协作框架(路由、搜索、聊天)以动态响应用户意图;引入重排序模块进一步提升检索质量。这为构建交互式多模态检索系统提供了可借鉴的架构设计。

Abstract: We introduce V-Agent, a novel multi-agent platform designed for advanced video search and interactive user-system conversations. By fine-tuning a vision-language model (VLM) with a small video preference dataset and enhancing it with a retrieval vector from an image-text retrieval model, we overcome the limitations of traditional text-based retrieval systems in multimodal scenarios. The VLM-based retrieval model independently embeds video frames and audio transcriptions from an automatic speech recognition (ASR) module into a shared multimodal representation space, enabling V-Agent to interpret both visual and spoken content for context-aware video search. This system consists of three agents-a routing agent, a search agent, and a chat agent-that work collaboratively to address user intents by refining search outputs and communicating with users. The search agent utilizes the VLM-based retrieval model together with an additional re-ranking module to further enhance video retrieval quality. Our proposed framework demonstrates state-of-the-art zero-shot performance on the MultiVENT 2.0 benchmark, highlighting its potential for both academic research and real-world applications.


[11] AVM: Towards Structure-Preserving Neural Response Modeling in the Visual Cortex Across Stimuli and Individuals cs.CVPDF

Qi Xu, Shuai Gong, Xuming Ran, Haihua Luo, Yangfan Hu

TL;DR: 本文提出了自适应视觉模型(AVM),这是一个结构保持的框架,用于在视觉皮层中跨刺激和个体模拟神经响应。AVM通过冻结一个基于Vision Transformer的编码器来捕获一致的视觉特征,并使用独立训练的调制路径来适应由刺激内容和被试身份驱动的神经响应变化。

Details

Motivation: 现有深度学习模型在模拟神经响应时,难以清晰分离稳定的视觉编码与条件特定的适应,这限制了它们跨刺激和个体的泛化能力。

Result: 在两个大规模小鼠V1数据集上的评估表明,AVM在预测相关性上比最先进的V1T模型高出约2%,在跨数据集适应设置下,解释方差(FEVE)提高了9.1%,展现了强大的泛化能力、可解释的条件调制和高架构效率。

Insight: 创新点在于提出了一个模块化的、结构保持的框架,将核心视觉编码与条件特定的调制路径解耦,从而实现了跨条件和个体的鲁棒泛化。这为神经科学和受生物启发的AI系统中的皮层建模提供了一个可扩展的解决方案。

Abstract: While deep learning models have shown strong performance in simulating neural responses, they often fail to clearly separate stable visual encoding from condition-specific adaptation, which limits their ability to generalize across stimuli and individuals. We introduce the Adaptive Visual Model (AVM), a structure-preserving framework that enables condition-aware adaptation through modular subnetworks, without modifying the core representation. AVM keeps a Vision Transformer-based encoder frozen to capture consistent visual features, while independently trained modulation paths account for neural response variations driven by stimulus content and subject identity. We evaluate AVM in three experimental settings, including stimulus-level variation, cross-subject generalization, and cross-dataset adaptation, all of which involve structured changes in inputs and individuals. Across two large-scale mouse V1 datasets, AVM outperforms the state-of-the-art V1T model by approximately 2% in predictive correlation, demonstrating robust generalization, interpretable condition-wise modulation, and high architectural efficiency. Specifically, AVM achieves a 9.1% improvement in explained variance (FEVE) under the cross-dataset adaptation setting. These results suggest that AVM provides a unified framework for adaptive neural modeling across biological and experimental conditions, offering a scalable solution under structural constraints. Its design may inform future approaches to cortical modeling in both neuroscience and biologically inspired AI systems.


[12] Lights, Camera, Consistency: A Multistage Pipeline for Character-Stable AI Video Stories cs.CV | cs.AIPDF

Chayan Jain, Rishant Sharma, Archit Garg, Ishan Bhanuka, Pratik Narang

TL;DR: 本文提出了一种多阶段AI视频生成方法,用于生成长篇、连贯且角色一致的视频故事。该方法首先利用大语言模型生成详细制作脚本,然后通过文本到图像模型为每个角色创建一致的视觉锚点,最后指导视频生成模型逐场景合成视频。

Details

Motivation: 解决当前文本到视频AI在生成长篇连贯视频故事时难以保持角色一致性的重大挑战。

Result: 基线比较验证了多阶段分解的必要性,移除视觉锚点机制会导致角色一致性分数从7.99骤降至0.55。此外,分析了当前模型在印度与西方主题生成中的文化差异和偏见。

Insight: 创新点在于采用类似电影制作的分阶段流程,通过视觉锚点机制(视觉先验)来保持角色身份一致性。客观分析认为,将视频生成分解为脚本、视觉锚定和场景合成的模块化流程是确保长篇内容连贯性的有效策略。

Abstract: Generating long, cohesive video stories with consistent characters is a significant challenge for current text-to-video AI. We introduce a method that approaches video generation in a filmmaker-like manner. Instead of creating a video in one step, our proposed pipeline first uses a large language model to generate a detailed production script. This script guides a text-to-image model in creating consistent visuals for each character, which then serve as anchors for a video generation model to synthesize each scene individually. Our baseline comparisons validate the necessity of this multi-stage decomposition; specifically, we observe that removing the visual anchoring mechanism results in a catastrophic drop in character consistency scores (from 7.99 to 0.55), confirming that visual priors are essential for identity preservation. Furthermore, we analyze cultural disparities in current models, revealing distinct biases in subject consistency and dynamic degree between Indian vs Western-themed generations.


[13] InfoTok: Adaptive Discrete Video Tokenizer via Information-Theoretic Compression cs.CV | cs.AIPDF

Haotian Ye, Qiyuan He, Jiaqi Han, Puheng Li, Jiaojiao Fan

TL;DR: 本文提出InfoTok,一种基于信息论的自适应离散视频分词框架,通过信息论压缩实现根据视频内容信息密度自适应分配token,解决了现有固定速率分词器在冗余或信息丢失方面的瓶颈。

Details

Motivation: 现有视频分词器以固定速率压缩所有内容,无法适应视频固有的复杂性和可变信息密度,导致冗余或信息丢失,因此需要一种自适应分词方法。

Result: 在压缩性能上达到SOTA,节省20%的token且不影响性能,压缩率达到2.3倍,优于先前的启发式自适应方法。

Insight: 创新点在于从信息论角度严格证明现有数据无关训练方法在表示长度上的次优性,并提出基于证据下界(ELBO)的算法逼近理论最优;通过基于Transformer的自适应压缩器实现自适应分词,为视频表示提供了更压缩且准确的分词方案。

Abstract: Accurate and efficient discrete video tokenization is essential for long video sequences processing. Yet, the inherent complexity and variable information density of videos present a significant bottleneck for current tokenizers, which rigidly compress all content at a fixed rate, leading to redundancy or information loss. Drawing inspiration from Shannon’s information theory, this paper introduces InfoTok, a principled framework for adaptive video tokenization. We rigorously prove that existing data-agnostic training methods are suboptimal in representation length, and present a novel evidence lower bound (ELBO)-based algorithm that approaches theoretical optimality. Leveraging this framework, we develop a transformer-based adaptive compressor that enables adaptive tokenization. Empirical results demonstrate state-of-the-art compression performance, saving 20% tokens without influence on performance, and achieving 2.3x compression rates while still outperforming prior heuristic adaptive approaches. By allocating tokens according to informational richness, InfoTok enables a more compressed yet accurate tokenization for video representation, offering valuable insights for future research.


[14] Endo-SemiS: Towards Robust Semi-Supervised Image Segmentation for Endoscopic Video cs.CVPDF

Hao Li, Daiwei Lu, Xing Yao, Nicholas Kavoussi, Ipek Oguz

TL;DR: 本文提出了Endo-SemiS,一种用于内窥镜视频帧的鲁棒半监督分割框架,旨在利用有限的标注提供可靠的分割。该框架通过四种策略有效利用所有可用数据(特别是未标注数据):网络间交叉监督、不确定性引导的伪标签生成、联合伪标签监督以及特征和图像层面的互学习。此外,还引入了一个利用内窥镜视频时空信息的校正网络以提升性能。方法在肾结石激光碎石术和结肠息肉筛查两个临床数据集上进行了评估。

Details

Motivation: 解决内窥镜视频分割中标注数据有限的问题,旨在开发一个鲁棒的半监督分割框架,以充分利用未标注数据来提升分割性能。

Result: 在两个临床数据集(肾结石激光碎石术和结肠息肉筛查)上,与现有最先进的分割方法相比,Endo-SemiS在有限标注数据下取得了显著更优的结果。

Insight: 创新点在于整合了四种互补的半监督学习策略(交叉监督、不确定性引导的伪标签、联合伪标签监督和互学习)以及一个利用时空信息的校正网络,共同提升了模型对未标注数据的利用效率和分割的鲁棒性。从客观角度看,其将多种半监督技术系统性地结合并针对内窥镜视频的时序特性进行设计,是一个有效的工程创新。

Abstract: In this paper, we present Endo-SemiS, a semi-supervised segmentation framework for providing reliable segmentation of endoscopic video frames with limited annotation. EndoSemiS uses 4 strategies to improve performance by effectively utilizing all available data, particularly unlabeled data: (1) Cross-supervision between two individual networks that supervise each other; (2) Uncertainty-guided pseudo-labels from unlabeled data, which are generated by selecting high-confidence regions to improve their quality; (3) Joint pseudolabel supervision, which aggregates reliable pixels from the pseudo-labels of both networks to provide accurate supervision for unlabeled data; and (4) Mutual learning, where both networks learn from each other at the feature and image levels, reducing variance and guiding them toward a consistent solution. Additionally, a separate corrective network that utilizes spatiotemporal information from endoscopy video to improve segmentation performance. Endo-SemiS is evaluated on two clinical applications: kidney stone laser lithotomy from ureteroscopy and polyp screening from colonoscopy. Compared to state-of-the-art segmentation methods, Endo-SemiS substantially achieves superior results on both datasets with limited labeled data. The code is publicly available at https://github.com/MedICL-VU/Endo-SemiS


[15] A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos cs.CVPDF

Mohammed Irfan Kurpath, Jaseel Muhammad Kaithakkodan, Jinxing Zhou, Sahal Shaji Mullappilly, Mohammad Almansoori

TL;DR: 本文提出了LongShOTBench,一个用于诊断长视频多模态理解和工具使用的基准测试,包含开放式意图驱动问题、多轮对话等任务,并配有参考答案和评分标准。同时,作者还提出了LongShOTAgent,一个通过预处理、搜索和迭代优化来分析长视频的智能体系统。在LongShOTBench上,现有最先进模型表现不佳,凸显了真实世界长视频理解的挑战。

Details

Motivation: 现有长视频多模态理解基准测试要么侧重时长,要么侧重模态丰富度,但很少兼顾两者,且评估指标单一,难以揭示模型失败的具体模式。本文旨在创建一个更全面、可诊断的基准,以推动长视频多模态推理和工具使用能力的发展。

Result: 在LongShOTBench上,最先进的MLLM(如Gemini-2.5-Flash)准确率为52.95%,开源模型低于30%,而作者提出的LongShOTAgent达到44.66%。这些结果远未达到理想水平,表明该任务极具挑战性。

Insight: 论文的创新点在于创建了一个结合了长时序、多模态(视频、音频、语音)、开放式问题、多轮对话以及工具使用需求的综合性诊断基准,并提供了可解释、可追溯的评估方法。同时,提出的智能体框架采用了预处理、搜索和迭代优化的策略来处理长视频,为复杂多模态任务提供了可借鉴的系统设计思路。

Abstract: Long-form multimodal video understanding requires integrating vision, speech, and ambient audio with coherent long-range reasoning. Existing benchmarks emphasize either temporal length or multimodal richness, but rarely both and while some incorporate open-ended questions and advanced metrics, they mostly rely on single-score accuracy, obscuring failure modes. We introduce LongShOTBench, a diagnostic benchmark with open-ended, intent-driven questions; single- and multi-turn dialogues; and tasks requiring multimodal reasoning and agentic tool use across video, audio, and speech. Each item includes a reference answer and graded rubric for interpretable, and traceable evaluation. LongShOTBench is produced via a scalable, human-validated pipeline to ensure coverage and reproducibility. All samples in our LongShOTBench are human-verified and corrected. Furthermore, we present LongShOTAgent, an agentic system that analyzes long videos via preprocessing, search, and iterative refinement. On LongShOTBench, state-of-the-art MLLMs show large gaps: Gemini-2.5-Flash achieves 52.95%, open-source models remain below 30%, and LongShOTAgent attains 44.66%. These results underscore the difficulty of real-world long-form video understanding. LongShOTBench provides a practical, reproducible foundation for evaluating and improving MLLMs. All resources are available on GitHub: https://github.com/mbzuai-oryx/longshot.


[16] 4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation cs.CVPDF

Chiao-An Yang, Ryo Hachiuma, Sifei Liu, Subhashree Radhakrishnan, Raymond A. Yeh

TL;DR: 本文提出了4D-RGPT,一个专门用于增强4D(3D结构+时间动态)感知和理解的多模态大语言模型。通过引入感知4D蒸馏(P4D)训练框架,将冻结专家模型的4D表征知识迁移到4D-RGPT中,并构建了包含区域级提示的深度感知动态场景基准R4D-Bench。该模型在现有4D VQA基准和新提出的基准上均取得了显著提升。

Details

Motivation: 现有MLLMs在3D结构和时间动态推理方面能力有限,受限于较弱的4D感知和时间理解。同时,现有的3D和4D VQA基准侧重于静态场景且缺乏区域级提示。本文旨在解决这些问题,提升模型对4D场景的区域级理解能力。

Result: 4D-RGPT在现有的4D VQA基准和新提出的R4D-Bench基准上均取得了显著改进(notable improvements),表明其性能优于现有方法。

Insight: 主要创新点包括:1)专门设计的用于捕获视频输入4D表征的MLLM(4D-RGPT);2)通过知识蒸馏将冻结专家模型的综合4D感知能力迁移到目标模型的训练框架(P4D);3)通过混合自动化和人工验证流程构建的、包含区域级提示的深度感知动态场景基准(R4D-Bench)。这为提升MLLMs的时空理解和细粒度区域推理提供了新思路。

Abstract: Despite advances in Multimodal LLMs (MLLMs), their ability to reason over 3D structures and temporal dynamics remains limited, constrained by weak 4D perception and temporal understanding. Existing 3D and 4D Video Question Answering (VQA) benchmarks also emphasize static scenes and lack region-level prompting. We tackle these issues by introducing: (a) 4D-RGPT, a specialized MLLM designed to capture 4D representations from video inputs with enhanced temporal perception; (b) Perceptual 4D Distillation (P4D), a training framework that transfers 4D representations from a frozen expert model into 4D-RGPT for comprehensive 4D perception; and (c) R4D-Bench, a benchmark for depth-aware dynamic scenes with region-level prompting, built via a hybrid automated and human-verified pipeline. Our 4D-RGPT achieves notable improvements on both existing 4D VQA benchmarks and the proposed R4D-Bench benchmark.


[17] Infinite-Homography as Robust Conditioning for Camera-Controlled Video Generation cs.CVPDF

Min-Jung Kim, Jeongho Kim, Hoiyeong Jin, Junha Hyung, Jaegul Choo

TL;DR: 本文提出了InfCam,一种无需深度估计的相机控制视频生成框架,通过无限单应性扭曲在2D潜在空间中编码3D相机旋转,并结合数据增强技术,实现了高相机姿态保真度和视觉质量的动态场景新视角视频生成。

Details

Motivation: 解决现有相机控制视频生成方法因深度估计不准确和训练数据轨迹多样性有限,导致生成视频的相机姿态保真度和质量不高的问题。

Result: 实验表明,InfCam在相机姿态准确性和视觉保真度上优于基线方法,并能从合成数据良好泛化到真实世界数据。

Insight: 创新点在于使用无限单应性扭曲直接在扩散模型的2D潜在空间编码3D旋转信息,避免了依赖易出错的深度估计,并通过数据增强管道提升了模型对多样化相机轨迹的泛化能力。

Abstract: Recent progress in video diffusion models has spurred growing interest in camera-controlled novel-view video generation for dynamic scenes, aiming to provide creators with cinematic camera control capabilities in post-production. A key challenge in camera-controlled video generation is ensuring fidelity to the specified camera pose, while maintaining view consistency and reasoning about occluded geometry from limited observations. To address this, existing methods either train trajectory-conditioned video generation model on trajectory-video pair dataset, or estimate depth from the input video to reproject it along a target trajectory and generate the unprojected regions. Nevertheless, existing methods struggle to generate camera-pose-faithful, high-quality videos for two main reasons: (1) reprojection-based approaches are highly susceptible to errors caused by inaccurate depth estimation; and (2) the limited diversity of camera trajectories in existing datasets restricts learned models. To address these limitations, we present InfCam, a depth-free, camera-controlled video-to-video generation framework with high pose fidelity. The framework integrates two key components: (1) infinite homography warping, which encodes 3D camera rotations directly within the 2D latent space of a video diffusion model. Conditioning on this noise-free rotational information, the residual parallax term is predicted through end-to-end training to achieve high camera-pose fidelity; and (2) a data augmentation pipeline that transforms existing synthetic multiview datasets into sequences with diverse trajectories and focal lengths. Experimental results demonstrate that InfCam outperforms baseline methods in camera-pose accuracy and visual fidelity, generalizing well from synthetic to real-world data. Link to our project page:https://emjay73.github.io/InfCam/


[18] PhysFire-WM: A Physics-Informed World Model for Emulating Fire Spread Dynamics cs.CVPDF

Nan Zhou, Huandong Wang, Jiahao Li, Yang Li, Xiao-Ping Zhang

TL;DR: 本文提出PhysFire-WM,一个物理信息世界模型,用于模拟火灾蔓延动态。该方法通过从物理模拟器中编码结构化先验来纠正物理不一致性,并结合跨任务协同训练策略(CC-Train)来缓解基于掩码建模的信息有限问题,从而提升火灾预测的物理真实感和几何精度。

Details

Motivation: 解决细粒度火灾预测中现有方法(如仅使用二值掩码)因信号稀疏而无法捕捉复杂火灾动态,以及世界模型在火灾预测中存在物理不一致性的问题。

Result: 在细粒度多模态火灾数据集上的大量实验表明,PhysFire-WM在火灾蔓延预测方面具有优越的准确性。验证强调了物理先验和跨任务协作的重要性。

Insight: 创新点在于将物理模拟器的结构化先验编码到世界模型中以纠正物理偏差,并设计了跨任务协同训练策略(CC-Train),通过参数共享和梯度协调整合热辐射动态和空间边界划分,为物理信息世界模型应用于灾害预测提供了新思路。

Abstract: Fine-grained fire prediction plays a crucial role in emergency response. Infrared images and fire masks provide complementary thermal and boundary information, yet current methods are predominantly limited to binary mask modeling with inherent signal sparsity, failing to capture the complex dynamics of fire. While world models show promise in video generation, their physical inconsistencies pose significant challenges for fire forecasting. This paper introduces PhysFire-WM, a Physics-informed World Model for emulating Fire spread dynamics. Our approach internalizes combustion dynamics by encoding structured priors from a Physical Simulator to rectify physical discrepancies, coupled with a Cross-task Collaborative Training strategy (CC-Train) that alleviates the issue of limited information in mask-based modeling. Through parameter sharing and gradient coordination, CC-Train effectively integrates thermal radiation dynamics and spatial boundary delineation, enhancing both physical realism and geometric accuracy. Extensive experiments on a fine-grained multimodal fire dataset demonstrate the superior accuracy of PhysFire-WM in fire spread prediction. Validation underscores the importance of physical priors and cross-task collaboration, providing new insights for applying physics-informed world models to disaster prediction.


[19] Can Synthetic Images Serve as Effective and Efficient Class Prototypes? cs.CVPDF

Dianxing Shi, Dingjie Fu, Yuqiao Liu, Jun Wang

TL;DR: 本文提出了一种名为LGCLIP的新型视觉语言模型框架,它利用大语言模型生成类别特定的提示词,并驱动扩散模型合成参考图像作为视觉原型,从而仅需类别标签即可进行零样本图像分类,无需依赖人工标注的图像-文本对。

Details

Motivation: 现有视觉语言模型(如CLIP)依赖高质量人工标注的图像-文本对进行模态对齐,成本高昂且准确度要求高;同时,双塔编码器结构也阻碍了模型的轻量化。

Result: 实验结果表明LGCLIP在零样本分类任务中表现出色,验证了其可行性和高效性,并建立了一种新的分类范式。

Insight: 核心创新在于利用LLM和扩散模型自动生成视觉原型,从而完全摆脱了对人工标注图像-文本对的依赖,并简化了模型架构(仅需视觉编码器),实现了更轻量、高效的零样本分类。这为数据获取成本高或标注稀缺的场景提供了新思路。

Abstract: Vision-Language Models (VLMs) have shown strong performance in zero-shot image classification tasks. However, existing methods, including Contrastive Language-Image Pre-training (CLIP), all rely on annotated text-to-image pairs for aligning visual and textual modalities. This dependency introduces substantial cost and accuracy requirement in preparing high-quality datasets. At the same time, processing data from two modes also requires dual-tower encoders for most models, which also hinders their lightweight. To address these limitations, we introduce a ``Contrastive Language-Image Pre-training via Large-Language-Model-based Generation (LGCLIP)” framework. LGCLIP leverages a Large Language Model (LLM) to generate class-specific prompts that guide a diffusion model in synthesizing reference images. Afterwards these generated images serve as visual prototypes, and the visual features of real images are extracted and compared with the visual features of these prototypes to achieve comparative prediction. By optimizing prompt generation through the LLM and employing only a visual encoder, LGCLIP remains lightweight and efficient. Crucially, our framework requires only class labels as input during whole experimental procedure, eliminating the need for manually annotated image-text pairs and extra pre-processing. Experimental results validate the feasibility and efficiency of LGCLIP, demonstrating great performance in zero-shot classification tasks and establishing a novel paradigm for classification.


[20] ABE-CLIP: Training-Free Attribute Binding Enhancement for Compositional Image-Text Matching cs.CV | cs.IRPDF

Qi Zhang, Yuxu Chen, Lei Deng, Lili Shen

TL;DR: 论文提出ABE-CLIP,一种无需训练的方法,通过语义精炼机制和局部令牌-补丁对齐策略,增强CLIP-like模型中属性与对象的绑定,以改进组合图像-文本匹配性能。

Details

Motivation: CLIP在组合图像-文本匹配中难以准确关联属性与对象,因为其全局表示忽略细粒度语义;现有方法需额外训练或硬负采样,但泛化能力有限且未根本解决全局表示缺点。

Result: 在多个数据集上的实验表明,ABE-CLIP显著提升了属性-对象绑定性能,甚至超过了需要大量训练的方法,达到先进水平。

Insight: 创新点包括无需训练的语义精炼和局部对齐策略,客观上避免了训练开销,直接增强现有模型的细粒度语义理解,提升组合概念的泛化能力。

Abstract: Contrastive Language-Image Pretraining (CLIP) has achieved remarkable performance in various multimodal tasks. However, it still struggles with compositional image-text matching, particularly in accurately associating objects with their corresponding attributes, because its inherent global representation often overlooks fine-grained semantics for attribute binding. Existing methods often require additional training or extensive hard negative sampling, yet they frequently show limited generalization to novel compositional concepts and fail to fundamentally address the drawbacks of global representations. In this paper, we propose ABE-CLIP, a novel training-free Attribute Binding Enhancement method designed to strengthen attribute-object binding in CLIP-like models. Specifically, we employ a Semantic Refinement Mechanism to refine token embeddings for both object and attribute phrases in the text, thereby mitigating attribute confusion and improving semantic precision. We further introduce a Local Token-Patch Alignment strategy that computes similarity scores between refined textual tokens and their most relevant image patches. By aggregating localized similarity scores, ABE-CLIP computes the final image-text similarity. Experiments on multiple datasets demonstrate that ABE-CLIP significantly improves attribute-object binding performance, even surpassing methods that require extensive training.


[21] Anatomical Region-Guided Contrastive Decoding: A Plug-and-Play Strategy for Mitigating Hallucinations in Medical VLMs cs.CVPDF

Xiao Liang, Chenxi Liu, Zhi Ma, Di Wang, Bin Jing

TL;DR: 本文提出了一种名为解剖区域引导对比解码(ARCD)的即插即用策略,用于缓解医学视觉语言模型(MedVLMs)中的幻觉问题。该方法利用解剖掩码引导一个三层级的对比解码过程,在token、注意力和logits层面进行动态重加权,从而将模型注意力引导至指定区域,增强解剖理解并抑制事实错误的输出。

Details

Motivation: 医学视觉语言模型(MedVLMs)在临床应用中前景广阔,但其可靠性受到幻觉问题的阻碍,即模型常常无法从视觉证据中推导答案,而是依赖学习到的文本先验。现有的缓解策略存在局限:基于训练的方法依赖昂贵的专家标注,可扩展性差;而无需训练的方法(如对比解码)虽然数据高效,但其全局、无针对性的校正方式在复杂的真实临床场景中效果不可靠。

Result: 在包括胸部X光、CT、脑部MRI和眼部超声在内的多个数据集上的广泛实验表明,该方法能有效提升区域理解能力、减少幻觉并提高整体诊断准确性。

Insight: 论文的核心创新点是提出了一种基于解剖区域引导的、可解释的即插即用解码策略,通过引入解剖掩码实现区域特异性的、有针对性的引导,从而在无需额外训练的情况下,更可靠地缓解医学VLM的幻觉问题。从客观角度看,将对比解码从全局应用细化为针对特定解剖区域的层级化动态调整,是一个有借鉴价值的创新思路。

Abstract: Medical Vision-Language Models (MedVLMs) show immense promise in clinical applicability. However, their reliability is hindered by hallucinations, where models often fail to derive answers from visual evidence, instead relying on learned textual priors. Existing mitigation strategies for MedVLMs have distinct limitations: training-based methods rely on costly expert annotations, limiting scalability, while training-free interventions like contrastive decoding, though data-efficient, apply a global, untargeted correction whose effects in complex real-world clinical settings can be unreliable. To address these challenges, we introduce Anatomical Region-Guided Contrastive Decoding (ARCD), a plug-and-play strategy that mitigates hallucinations by providing targeted, region-specific guidance. Our module leverages an anatomical mask to direct a three-tiered contrastive decoding process. By dynamically re-weighting at the token, attention, and logits levels, it verifiably steers the model’s focus onto specified regions, reinforcing anatomical understanding and suppressing factually incorrect outputs. Extensive experiments across diverse datasets, including chest X-ray, CT, brain MRI, and ocular ultrasound, demonstrate our method’s effectiveness in improving regional understanding, reducing hallucinations, and enhancing overall diagnostic accuracy.


[22] Reasoning Palette: Modulating Reasoning via Latent Contextualization for Controllable Exploration for (V)LMs cs.CVPDF

Rujiao Long, Yang Li, Xingyao Zhang, Weixun Wang, Tianqianjin Lin

TL;DR: 本文提出Reasoning Palette框架,通过变分自编码器(VAE)引入潜在变量来调制大型(视觉)语言模型的推理过程,在推理前对潜在上下文进行采样以引导多样化的推理策略,从而提升模型的探索能力和性能。

Details

Motivation: 解决大模型在推理时随机采样导致冗余推理路径、缺乏高层多样性的问题,旨在增强模型在推理和强化学习中的探索效率与控制性。

Result: 在多个推理基准测试中,该方法相比标准强化学习方法实现了持续的性能提升,并展现出对模型策略行为的可解释和可控调制。

Insight: 创新点在于通过潜在变量编码推理上下文,并将其解码为可学习的token前缀来调制模型内部推理轨迹,实现了推理前的策略采样,从而结构化探索并提升强化学习效率。

Abstract: Exploration capacity shapes both inference-time performance and reinforcement learning (RL) training for large (vision-) language models, as stochastic sampling often yields redundant reasoning paths with little high-level diversity. This paper proposes Reasoning Palette, a novel latent-modulation framework that endows the model with a stochastic latent variable for strategic contextualization, guiding its internal planning prior to token generation. This latent context is inferred from the mean-pooled embedding of a question-answer pair via a variational autoencoder (VAE), where each sampled latent potentially encodes a distinct reasoning context. During inference, a sampled latent is decoded into learnable token prefixes and prepended to the input prompt, modulating the model’s internal reasoning trajectory. In this way, the model performs internal sampling over reasoning strategies prior to output generation, which shapes the style and structure of the entire response sequence. A brief supervised fine-tuning (SFT) warm-up phase allows the model to adapt to this latent conditioning. Within RL optimization, Reasoning Palette facilitates structured exploration by enabling on-demand injection for diverse reasoning modes, significantly enhancing exploration efficiency and sustained learning capability. Experiments across multiple reasoning benchmarks demonstrate that our method enables interpretable and controllable control over the (vision-) language model’s strategic behavior, thereby achieving consistent performance gains over standard RL methods.


[23] CheXPO-v2: Preference Optimization for Chest X-ray VLMs with Knowledge Graph Consistency cs.CV | cs.LGPDF

Xiao Liang, Yuxuan An, Di Wang, Jiawei Hu, Zhicheng Jiao

TL;DR: 本文提出了CheXPO-v2,一个用于胸部X光视觉语言模型(VLM)的新型对齐框架,旨在解决医学VLM中的幻觉问题。该框架的核心创新是从结果监督转向过程监督,通过基于知识图谱一致性的奖励机制和实体关系匹配,对模型的推理步骤进行细粒度监督,惩罚不连贯的逻辑和原子级幻觉。结合难例挖掘策略,该方法在MIMIC-CXR-VQA等基准测试中显著优于现有方法,仅用5k样本就达到了新的最先进精度,展示了卓越的数据效率和临床可靠性。

Details

Motivation: 医学视觉语言模型(VLMs)容易产生幻觉,损害临床可靠性。现有的强化学习方法(如GRPO)依赖稀疏的、基于结果的奖励,无意中鼓励模型“过度思考”,生成冗长、复杂且不可验证的思维链推理来证明答案,这掩盖了事实错误并带来重大安全风险。

Result: CheXPO-v2在MIMIC-CXR-VQA等基准测试中显著优于GRPO和最先进模型,仅使用5k样本就达到了新的最先进(SOTA)准确率,展示了卓越的数据效率,并产生临床可靠且可验证的推理。

Insight: 论文宣称的创新点是从结果监督转向过程监督,核心是知识图谱一致性奖励机制,通过实体关系匹配将推理步骤解析为结构化的“疾病、关系、解剖”三元组,提供原子级的细粒度监督。客观分析认为,这种结构化过程监督和硬样本挖掘策略的结合,是提高医学VLM可靠性和数据效率的有效途径。

Abstract: Medical Vision-Language Models (VLMs) are prone to hallucinations, compromising clinical reliability. While reinforcement learning methods like Group Relative Policy Optimization (GRPO) offer a low-cost alignment solution, their reliance on sparse, outcome-based rewards inadvertently encourages models to “overthink” – generating verbose, convoluted, and unverifiable Chain-of-Thought reasoning to justify answers. This focus on outcomes obscures factual errors and poses significant safety risks. To address this, we propose CheXPO-v2, a novel alignment framework that shifts from outcome to process supervision. Our core innovation is a Knowledge Graph Consistency Reward mechanism driven by Entity-Relation Matching. By explicitly parsing reasoning steps into structured “Disease, Relation, Anatomy” triplets, we provide fine-grained supervision that penalizes incoherent logic and hallucinations at the atomic level. Integrating this with a hard-example mining strategy, our approach significantly outperforms GRPO and state-of-the-art models on benchmarks like MIMIC-CXR-VQA. Crucially, CheXPO-v2 achieves new state-of-the-art accuracy using only 5k samples, demonstrating exceptional data efficiency while producing clinically sound and verifiable reasoning. The project source code is publicly available at: https://github.com/ecoxial2007/CheX-Phi4MM.


[24] DAVE: A VLM Vision Encoder for Document Understanding and Web Agents cs.CVPDF

Brandon Huang, Hang Hua, Zhuoran Yu, Trevor Darrell, Rogerio Feris

TL;DR: 本文提出DAVE,一种专为视觉语言模型设计的视觉编码器,旨在解决现有VLM在文档理解和网页代理任务中因低级特征缺乏稳健结构和空间信息而表现不佳的问题。通过结合自监督预训练和监督自回归预训练,并采用模型融合与集成训练策略,DAVE在无需大规模标注数据的情况下提升了模型性能。

Details

Motivation: 现有视觉语言模型的视觉编码器在低级特征上缺乏足够的结构和空间信息,限制了其在文档理解和网页代理任务中的表现,因此需要一种专门针对这些任务的视觉编码器。

Result: 在经典文档任务、视觉问答、网页定位和基于代理的基准测试中,DAVE表现出有效性,成为文档和网页应用的强大视觉编码器。

Insight: 创新点包括:通过自监督和监督两阶段训练利用无标注数据减少标注成本;采用模型融合策略确保与不同网页代理架构的兼容性;以及集成预训练通用编码器特征以增强文档和网页特定表示。这些方法可借鉴于多模态任务中视觉编码器的定制化设计。

Abstract: While Vision-language models (VLMs) have demonstrated remarkable performance across multi-modal tasks, their choice of vision encoders presents a fundamental weakness: their low-level features lack the robust structural and spatial information essential for document understanding and web agents. To bridge this gap, we introduce DAVE, a vision encoder purpose-built for VLMs and tailored for these tasks. Our training pipeline is designed to leverage abundant unlabeled data to bypass the need for costly large-scale annotations for document and web images. We begin with a self-supervised pretraining stage on unlabeled images, followed by a supervised autoregressive pretraining stage, where the model learns tasks like parsing and localization from limited, high-quality data. Within the supervised stage, we adopt two strategies to improve our encoder’s alignment with both general visual knowledge and diverse document and web agentic tasks: (i) We introduce a novel model-merging scheme, combining encoders trained with different text decoders to ensure broad compatibility with different web agentic architectures. (ii) We use ensemble training to fuse features from pretrained generalist encoders (e.g., SigLIP2) with our own document and web-specific representations. Extensive experiments on classic document tasks, VQAs, web localization, and agent-based benchmarks validate the effectiveness of our approach, establishing DAVE as a strong vision encoder for document and web applications.


[25] Learning When to Look: A Disentangled Curriculum for Strategic Perception in Multimodal Reasoning cs.CVPDF

Siqi Yang, Zilve Gao, Haibo Qiu, Fanfan Liu, Peng Shi

TL;DR: 本文针对多模态大语言模型在复杂长链视觉推理任务中出现的’视觉遗忘’问题,提出了一种解耦课程学习框架。该框架首先通过解耦的监督微调课程,在纯文本数据上构建鲁棒的抽象推理骨干,再通过感知锚定的思维链范式将其与视觉对齐;其次,通过将视觉感知时机建模为强化学习问题,并设计关键感知奖励,使模型学会根据语言中的认知不确定性标记自主决定何时进行视觉感知。

Details

Motivation: 解决多模态大语言模型在长链推理中因视觉与抽象推理技能过早纠缠而导致的’视觉遗忘’问题,即模型随着推理链延长逐渐丧失视觉基础。

Result: 论文提出的方法旨在将模型从启发式驱动的观察者转变为具有战略性的、有基础支撑的推理者。摘要中未提及具体的基准测试和定量结果,但提供了代码链接。

Insight: 核心创新点在于将’如何思考’(抽象推理)与’何时观看’(战略视觉感知)两种认知技能进行解耦训练,并引入基于强化学习的自主感知时机策略,通过感知-语言不确定性耦合奖励机制来学习何时进行视觉感知。

Abstract: Multimodal Large Language Models (MLLMs) demonstrate significant potential but remain brittle in complex, long-chain visual reasoning tasks. A critical failure mode is “visual forgetting”, where models progressively lose visual grounding as reasoning extends, a phenomenon aptly described as “think longer, see less”. We posit this failure stems from current training paradigms prematurely entangling two distinct cognitive skills: (1) abstract logical reasoning “how-to-think”) and (2) strategic visual perception (“when-to-look”). This creates a foundational cold-start deficiency – weakening abstract reasoning – and a strategic perception deficit, as models lack a policy for when to perceive. In this paper, we propose a novel curriculum-based framework to disentangle these skills. First, we introduce a disentangled Supervised Fine-Tuning (SFT) curriculum that builds a robust abstract reasoning backbone on text-only data before anchoring it to vision with a novel Perception-Grounded Chain-of-Thought (PG-CoT) paradigm. Second, we resolve the strategic perception deficit by formulating timing as a reinforcement learning problem. We design a Pivotal Perception Reward that teaches the model when to look by coupling perceptual actions to linguistic markers of cognitive uncertainty (e.g., “wait”, “verify”), thereby learning an autonomous grounding policy. Our contributions include the formalization of these two deficiencies and the development of a principled, two-stage framework to address them, transforming the model from a heuristic-driven observer to a strategic, grounded reasoner. \textbf{Code}: \url{https://github.com/gaozilve-max/learning-when-to-look}.


[26] Video Detective: Seek Critical Clues Recurrently to Answer Question from Long Videos cs.CVPDF

Henghui Du, Chang Zhou, Chunjie Zhang, Xi Chen, Di Hu

TL;DR: 本文提出VideoDetective,一种用于长视频问答(LVQA)的高效方法,通过问题感知的记忆机制,循环地处理视频片段并压缩关键线索,使上下文长度有限的MLLM能够处理长达一小时的长视频。

Details

Motivation: 解决长视频问答中因上下文巨大和信息过载导致的MLLM内存消耗过高和计算量大问题,现有方法可能丢失有用信息或计算成本高。

Result: 实验表明,该方法使上下文长度为32K的MLLM能高效处理100K tokens(3600帧,一小时视频),仅需2分钟和37GB GPU内存;在多个长视频基准测试中,能更有效地从海量信息中寻找关键线索。

Insight: 创新点包括问题感知的压缩策略(引入特殊记忆令牌进行有目的压缩)和循环聚合历史上下文的记忆机制;同时贡献了GLVC数据集,以更有效地评估模型的长视频理解能力。

Abstract: Long Video Question-Answering (LVQA) presents a significant challenge for Multi-modal Large Language Models (MLLMs) due to immense context and overloaded information, which could also lead to prohibitive memory consumption. While existing methods attempt to address these issues by reducing visual tokens or extending model’s context length, they may miss useful information or take considerable computation. In fact, when answering given questions, only a small amount of crucial information is required. Therefore, we propose an efficient question-aware memory mechanism, enabling MLLMs to recurrently seek these critical clues. Our approach, named VideoDetective, simplifies this task by iteratively processing video sub-segments. For each sub-segment, a question-aware compression strategy is employed by introducing a few special memory tokens to achieve purposefully compression. This allows models to effectively seek critical clues while reducing visual tokens. Then, due to history context could have a significant impact, we recurrently aggregate and store these memory tokens to update history context, which would be reused for subsequent sub-segments. Furthermore, to more effectively measure model’s long video understanding ability, we introduce GLVC (Grounding Long Video Clues), a long video question-answering dataset, which features grounding critical and concrete clues scattered throughout entire videos. Experimental results demonstrate our method enables MLLMs with limited context length of 32K to efficiently process 100K tokens (3600 frames, an hour-long video sampled at 1fps), requiring only 2 minutes and 37GB GPU memory usage. Evaluation results across multiple long video benchmarks illustrate our method can more effectively seek critical clues from massive information.


[27] Mitty: Diffusion-based Human-to-Robot Video Generation cs.CVPDF

Yiren Song, Cheng Liu, Weijia Mao, Mike Zheng Shou

TL;DR: 本文提出Mitty,一种基于扩散Transformer的端到端Human2Robot视频生成模型,能够直接从人类演示视频中学习并生成机器人执行视频,无需动作标签或中间表示。

Details

Motivation: 现有方法依赖关键点或轨迹等中间表示,导致信息损失和累积误差,损害时空一致性,因此需要一种能够保持视觉和时间一致性的端到端方法。

Result: 在Human2Robot和EPIC-Kitchens基准测试中,Mitty取得了最先进(SOTA)的结果,并展现出对未见环境的强泛化能力。

Insight: 创新点包括利用预训练视频扩散模型的视觉-时间先验,通过双向注意力融合人类演示条件令牌与机器人去噪令牌,以及开发自动合成流水线从大规模第一人称数据集中生成高质量人-机器人配对数据。

Abstract: Learning directly from human demonstration videos is a key milestone toward scalable and generalizable robot learning. Yet existing methods rely on intermediate representations such as keypoints or trajectories, introducing information loss and cumulative errors that harm temporal and visual consistency. We present Mitty, a Diffusion Transformer that enables video In-Context Learning for end-to-end Human2Robot video generation. Built on a pretrained video diffusion model, Mitty leverages strong visual-temporal priors to translate human demonstrations into robot-execution videos without action labels or intermediate abstractions. Demonstration videos are compressed into condition tokens and fused with robot denoising tokens through bidirectional attention during diffusion. To mitigate paired-data scarcity, we also develop an automatic synthesis pipeline that produces high-quality human-robot pairs from large egocentric datasets. Experiments on Human2Robot and EPIC-Kitchens show that Mitty delivers state-of-the-art results, strong generalization to unseen environments, and new insights for scalable robot learning from human observations.


[28] AnyCXR: Human Anatomy Segmentation of Chest X-ray at Any Acquisition Position using Multi-stage Domain Randomized Synthetic Data with Imperfect Annotations and Conditional Joint Annotation Regularization Learning cs.CVPDF

Dong Zifei, Wu Wenjie, Hao Jinkui, Chen Tianqi, Weng Ziqiao

TL;DR: 本文提出了AnyCXR框架,用于解决胸部X光片在不同拍摄角度下的鲁棒性解剖结构分割问题。该框架仅使用合成数据进行监督训练,结合了多阶段域随机化引擎生成大量多样化的合成X光片,以及条件联合标注正则化学习策略来利用不完美的标注。

Details

Motivation: 由于真实世界胸部X光片标注稀缺且采集条件差异巨大,鲁棒的解剖分割面临挑战。本文旨在开发一个仅依赖合成数据、能泛化到任意投影角度的统一分割框架,以减轻标注负担并提高模型鲁棒性。

Result: AnyCXR在多个真实世界数据集上实现了零样本泛化,能够准确分割PA位、侧位和斜位视图中的54个解剖结构。其分割结果支持下游临床任务(如心胸比估计、脊柱曲率评估和疾病分类),并提高了诊断性能。

Insight: 创新点在于提出了一个仅用合成数据训练的完整框架,其核心是结合多阶段域随机化生成逼真且多样的数据,以及条件联合标注正则化学习策略来有效利用不完美标注,从而实现了对任意拍摄角度的泛化能力,为解剖感知的CXR分析提供了可扩展且可靠的基础。

Abstract: Robust anatomical segmentation of chest X-rays (CXRs) remains challenging due to the scarcity of comprehensive annotations and the substantial variability of real-world acquisition conditions. We propose AnyCXR, a unified framework that enables generalizable multi-organ segmentation across arbitrary CXR projection angles using only synthetic supervision. The method combines a Multi-stage Domain Randomization (MSDR) engine, which generates over 100,000 anatomically faithful and highly diverse synthetic radiographs from 3D CT volumes, with a Conditional Joint Annotation Regularization (CAR) learning strategy that leverages partial and imperfect labels by enforcing anatomical consistency in a latent space. Trained entirely on synthetic data, AnyCXR achieves strong zero-shot generalization on multiple real-world datasets, providing accurate delineation of 54 anatomical structures in PA, lateral, and oblique views. The resulting segmentation maps support downstream clinical tasks, including automated cardiothoracic ratio estimation, spine curvature assessment, and disease classification, where the incorporation of anatomical priors improves diagnostic performance. These results demonstrate that AnyCXR establishes a scalable and reliable foundation for anatomy-aware CXR analysis and offers a practical pathway toward reducing annotation burdens while improving robustness across diverse imaging conditions.


[29] RadImageNet-VQA: A Large-Scale CT and MRI Dataset for Radiologic Visual Question Answering cs.CV | cs.AI | cs.CLPDF

Léo Butsanets, Charles Corbière, Julien Khlaut, Pierre Manceron, Corentin Dancette

TL;DR: 本文介绍了RadImageNet-VQA,一个用于放射学视觉问答(VQA)的大规模CT和MRI数据集,包含75万张图像和750万个问答对,覆盖异常检测、解剖结构识别和病理识别三大任务,旨在解决现有医学VQA数据集规模小、模态单一且易受文本捷径影响的问题。

Details

Motivation: 现有医学VQA数据集规模有限,主要基于X射线或生物医学插图,且易受文本捷径影响,因此需要构建一个大规模、多模态、无语言捷径的放射学VQA数据集以推动该领域发展。

Result: 实验表明,即使在微调后,最先进的视觉语言模型在细粒度病理识别(尤其是开放式问题)上仍表现不佳;纯文本分析显示,没有图像输入时模型性能接近随机,证实了数据集避免了语言捷径。

Insight: 创新点在于构建了首个大规模、专家标注、覆盖多解剖区域和病理类别的CT/MRI VQA数据集,并通过严格的实验验证了其避免了文本捷径,为评估模型在真实放射学场景下的细粒度理解能力提供了可靠基准。

Abstract: In this work, we introduce RadImageNet-VQA, a large-scale dataset designed to advance radiologic visual question answering (VQA) on CT and MRI exams. Existing medical VQA datasets are limited in scale, dominated by X-ray imaging or biomedical illustrations, and often prone to text-based shortcuts. RadImageNet-VQA is built from expert-curated annotations and provides 750K images paired with 7.5M question-answer samples. It covers three key tasks - abnormality detection, anatomy recognition, and pathology identification - spanning eight anatomical regions and 97 pathology categories, and supports open-ended, closed-ended, and multiple-choice questions. Extensive experiments show that state-of-the-art vision-language models still struggle with fine-grained pathology identification, particularly in open-ended settings and even after fine-tuning. Text-only analysis further reveals that model performance collapses to near-random without image inputs, confirming that RadImageNet-VQA is free from linguistic shortcuts. The full dataset and benchmark are publicly available at https://huggingface.co/datasets/raidium/RadImageNet-VQA.


[30] Vision-Language Model Guided Image Restoration cs.CVPDF

Cuixin Yang, Rongkang Dong, Kin-Man Lam

TL;DR: 本文提出了VLMIR框架,利用视觉语言模型(如CLIP)的视觉-语言先验知识,通过两阶段方法(VLM特征提取和扩散模型图像恢复)提升图像恢复性能,在通用和特定退化任务上均取得优异结果。

Details

Motivation: 现有图像恢复方法难以有效结合视觉和语言知识,且未能利用语言先验确保恢复过程中的语义一致性,因此需要一种能同时利用像素级保真度和高层语义理解的方法。

Result: 大量实验表明,VLMIR在通用和退化特定的图像恢复任务上均实现了优越性能,验证了整合视觉与语言知识对提升恢复能力的关键作用。

Insight: 创新点在于通过CLIP等VLM提取互补的视觉和语言表示,并利用余弦相似度损失与LoRA微调对齐高低质量图像的描述嵌入,再通过扩散模型的交叉注意力机制整合这些嵌入,从而增强恢复的语义连贯性。

Abstract: Many image restoration (IR) tasks require both pixel-level fidelity and high-level semantic understanding to recover realistic photos with fine-grained details. However, previous approaches often struggle to effectively leverage both the visual and linguistic knowledge. Recent efforts have attempted to incorporate Vision-language models (VLMs), which excel at aligning visual and textual features, into universal IR. Nevertheless, these methods fail to utilize the linguistic priors to ensure semantic coherence during the restoration process. To address this issue, in this paper, we propose the Vision-Language Model Guided Image Restoration (VLMIR) framework, which leverages the rich vision-language priors of VLMs, such as CLIP, to enhance IR performance through improved visual perception and semantic understanding. Our approach consists of two stages: VLM-based feature extraction and diffusion-based image restoration. In the first stage, we extract complementary visual and linguistic representations of input images by condensing the visual perception and high-level semantic priors through VLMs. Specifically, we align the embeddings of captions from low-quality and high-quality images using a cosine similarity loss with LoRA fine-tuning, and employ a degradation predictor to decompose degradation and clean image content embeddings. These complementary visual and textual embeddings are then integrated into a diffusion-based model via cross-attention mechanisms for enhanced restoration. Extensive experiments and ablation studies demonstrate that VLMIR achieves superior performance across both universal and degradation-specific IR tasks, underscoring the critical role of integrated visual and linguistic knowledge from VLMs in advancing image restoration capabilities.


[31] Deep But Reliable: Advancing Multi-turn Reasoning for Thinking with Images cs.CVPDF

Wenhao Yang, Yu Xia, Jinlong Huang, Shiyin Lu, Qing-Guo Chen

TL;DR: 本文提出DRIM模型,通过数据构建、冷启动监督微调和强化学习三阶段方法,实现了深度且可靠的多轮视觉推理,解决了现有视觉语言模型在复杂视觉任务中难以自我反思和纠正错误推理轨迹的问题。

Details

Motivation: 现有大型视觉语言模型在通过思维链进行图像推理时,往往难以对错误的推理轨迹进行反思和修正,限制了其在复杂视觉任务中的可靠性。

Result: 大量实验表明,DRIM在视觉理解基准测试上取得了卓越的性能。

Insight: 创新点在于提出了一个包含冗余惩罚策略优化的强化学习阶段,激励模型发展自我反思的推理模式,并对推理轨迹进行判断,惩罚那些在没有充分多尺度探索的情况下产生错误答案的轨迹。

Abstract: Recent advances in large Vision-Language Models (VLMs) have exhibited strong reasoning capabilities on complex visual tasks by thinking with images in their Chain-of-Thought (CoT), which is achieved by actively invoking tools to analyze visual inputs rather than merely perceiving them. However, existing models often struggle to reflect on and correct themselves when attempting incorrect reasoning trajectories. To address this limitation, we propose DRIM, a model that enables deep but reliable multi-turn reasoning when thinking with images in its multimodal CoT. Our pipeline comprises three stages: data construction, cold-start SFT and RL. Based on a high-resolution image dataset, we construct high-difficulty and verifiable visual question-answer pairs, where solving each task requires multi-turn tool calls to reach the correct answer. In the SFT stage, we collect tool trajectories as cold-start data, guiding a multi-turn reasoning pattern. In the RL stage, we introduce redundancy-penalized policy optimization, which incentivizes the model to develop a self-reflective reasoning pattern. The basic idea is to impose judgment on reasoning trajectories and penalize those that produce incorrect answers without sufficient multi-scale exploration. Extensive experiments demonstrate that DRIM achieves superior performance on visual understanding benchmarks.


[32] CodeDance: A Dynamic Tool-integrated MLLM for Executable Visual Reasoning cs.CVPDF

Qi Song, Honglin Li, Yingchen Yu, Haoyi Zhou, Lin Yang

TL;DR: CodeDance是一种动态工具集成的多模态大语言模型,它通过生成和执行代码来协调多种工具,进行可执行的视觉推理。该方法超越了依赖固定模式或纯文本链的传统方法,实现了透明、可自我检查的推理过程,并在多个基准测试中超越了包括GPT-4o在内的先进模型。

Details

Motivation: 当前开源方法在视觉推理中主要依赖纯文本链、固定视觉模式或单步流程,限制了在复杂任务上的灵活性、可解释性和可迁移性。本文旨在探索可执行代码作为一种通用求解器,以克服这些限制。

Result: 在视觉搜索、数学、图表问答等多个推理基准测试上的广泛实验表明,CodeDance不仅持续优于模式驱动和纯文本基线,还超越了GPT-4o等先进的闭源模型以及更大的开源模型。

Insight: 核心创新点在于使用可执行代码作为通用求解器来编排工具和计算中间结果,并引入了平衡自适应工具调用的奖励机制。有趣的是,在强化学习训练中观察到了新颖的工具调用、未见过的组合和跨任务迁移等涌现行为,这表明了一种通用且可扩展的可执行视觉推理机制。

Abstract: Recent releases such as o3 highlight human-like “thinking with images” reasoning that combines structured tool use with stepwise verification, yet most open-source approaches still rely on text-only chains, rigid visual schemas, or single-step pipelines, limiting flexibility, interpretability, and transferability on complex tasks. We introduce CodeDance, which explores executable code as a general solver for visual reasoning. Unlike fixed-schema calls (e.g., only predicting bounding-box coordinates), CodeDance defines, composes, and executes code to orchestrate multiple tools, compute intermediate results, and render visual artifacts (e.g., boxes, lines, plots) that support transparent, self-checkable reasoning. To guide this process, we introduce a reward for balanced and adaptive tool-call, which balances exploration with efficiency and mitigates tool overuse. Interestingly, beyond the expected capabilities taught by atomic supervision, we empirically observe novel emergent behaviors during RL training: CodeDance demonstrates novel tool invocations, unseen compositions, and cross-task transfer. These behaviors arise without task-specific fine-tuning, suggesting a general and scalable mechanism of executable visual reasoning. Extensive experiments across reasoning benchmarks (e.g., visual search, math, chart QA) show that CodeDance not only consistently outperforms schema-driven and text-only baselines, but also surpasses advanced closed models such as GPT-4o and larger open-source models.


[33] Auxiliary Descriptive Knowledge for Few-Shot Adaptation of Vision-Language Model cs.CVPDF

SuBeen Lee, GilHan Park, WonJun Moon, Hyun Seok Seong, Jae-Pil Heo

TL;DR: 本文提出了一种名为辅助描述性知识(ADK)的新框架,用于增强视觉语言模型(VLM)在少样本适应(FSA-VLM)任务中的性能。该框架利用大语言模型为每个类别离线生成丰富的描述性提示,并将其作为组合知识和实例特定知识来丰富文本表示,从而在不牺牲效率的情况下提升模型对分布偏移下游任务的理解能力。

Details

Motivation: 尽管视觉语言模型在零样本任务上表现出色,但在与预训练数据存在分布偏移的下游任务中表现不佳。现有的参数高效微调方法依赖于固定的人工设计提示,难以充分理解类别语义,而引入图像诱导提示的方法又会导致推理时计算开销过大。

Result: 大量实验表明,ADK能够持续提升多种参数高效微调基线的性能,在各种场景下达到了新的最先进水平(SOTA)。

Insight: 创新点在于提出了一种参数免费、即插即用的ADK组件,它通过离线生成描述性提示并分为组合知识(提供丰富语义)和实例特定知识(动态选择相关描述)两种方式,高效地丰富了文本表示,解决了固定提示语义不足和图像诱导提示计算成本高的问题。

Abstract: Despite the impressive zero-shot capabilities of Vision-Language Models (VLMs), they often struggle in downstream tasks with distribution shifts from the pre-training data. Few-Shot Adaptation (FSA-VLM) has emerged as a key solution, typically using Parameter-Efficient Fine-Tuning (PEFT) to adapt models with minimal data. However, these PEFT methods are constrained by their reliance on fixed, handcrafted prompts, which are often insufficient to understand the semantics of classes. While some studies have proposed leveraging image-induced prompts to provide additional clues for classification, they introduce prohibitive computational overhead at inference. Therefore, we introduce Auxiliary Descriptive Knowledge (ADK), a novel framework that efficiently enriches text representations without compromising efficiency. ADK first leverages a Large Language Model to generate a rich set of descriptive prompts for each class offline. These pre-computed features are then deployed in two ways: (1) as Compositional Knowledge, an averaged representation that provides rich semantics, especially beneficial when class names are ambiguous or unfamiliar to the VLM; and (2) as Instance-Specific Knowledge, where a lightweight, non-parametric attention mechanism dynamically selects the most relevant descriptions for a given image. This approach provides two additional types of knowledge alongside the handcrafted prompt, thereby facilitating category distinction across various domains. Also, ADK acts as a parameter-free, plug-and-play component that enhances existing PEFT methods. Extensive experiments demonstrate that ADK consistently boosts the performance of multiple PEFT baselines, setting a new state-of-the-art across various scenarios.


[34] A Benchmark for Ultra-High-Resolution Remote Sensing MLLMs cs.CV | cs.AI | cs.MMPDF

Yunkai Dang, Meiyi Zhu, Donghao Wang, Yizhuo Zhang, Jiacheng Yang

TL;DR: 本文针对现有遥感多模态大语言模型(MLLMs)基准测试的缺陷,提出了一个名为RSHR-Bench的超高分辨率遥感视觉理解与推理基准。该基准包含5,329张长边至少为4,000像素的全场景图像,设计了多项选择题问答、开放式问答、图像描述和单图像评估四类任务,并通过对抗性过滤和人工验证减少语言先验依赖。评估表明,现有模型在超高分辨率场景下仍存在性能差距。

Details

Motivation: 现有遥感MLLM基准大多依赖低分辨率图像,且部分高分辨率基准的推理任务设计存在缺陷,导致纯文本LLM无需访问图像也能在推理任务上表现优异,无法真实评估视觉理解能力。

Result: 在RSHR-Bench上对开源、闭源及遥感专用视觉语言模型进行评估,揭示了它们在超高分辨率场景下存在持续的性能差距。

Insight: 创新点在于构建了一个超高分辨率、任务设计严谨的遥感基准,通过对抗性过滤和人工验证有效减少了语言先验的干扰,从而能够更忠实地评估模型的视觉理解与推理能力。

Abstract: Multimodal large language models (MLLMs) demonstrate strong perception and reasoning performance on existing remote sensing (RS) benchmarks. However, most prior benchmarks rely on low-resolution imagery, and some high-resolution benchmarks suffer from flawed reasoning-task designs. We show that text-only LLMs can perform competitively with multimodal vision-language models on RS reasoning tasks without access to images, revealing a critical mismatch between current benchmarks and the intended evaluation of visual understanding. To enable faithful assessment, we introduce RSHR-Bench, a super-high-resolution benchmark for RS visual understanding and reasoning. RSHR-Bench contains 5,329 full-scene images with a long side of at least 4,000 pixels, with up to about 3 x 10^8 pixels per image, sourced from widely used RS corpora and UAV collections. We design four task families: multiple-choice VQA, open-ended VQA, image captioning, and single-image evaluation. These tasks cover nine perception categories and four reasoning types, supporting multi-turn and multi-image dialog. To reduce reliance on language priors, we apply adversarial filtering with strong LLMs followed by rigorous human verification. Overall, we construct 3,864 VQA tasks, 3,913 image captioning tasks, and 500 fully human-written or verified single-image evaluation VQA pairs. Evaluations across open-source, closed-source, and RS-specific VLMs reveal persistent performance gaps in super-high-resolution scenarios. Code: https://github.com/Yunkaidang/RSHR


[35] DESSERT: Diffusion-based Event-driven Single-frame Synthesis via Residual Training cs.CVPDF

Jiyun Kong, Jun-Hyuk Kim, Jong-Seok Lee

TL;DR: 论文提出DESSERT,一种基于扩散的事件驱动单帧合成框架,通过残差训练利用预训练Stable Diffusion模型,在事件数据上合成未来帧,以提高时间一致性和清晰度。训练包括两个阶段:事件到残差对齐变分自编码器(ER-VAE)和条件扩散模型,并引入多样长度时间(DLT)增强提高鲁棒性。

Details

Motivation: 解决事件相机视频帧预测中因先前方法基于光流预测导致的像素位移不准确、产生孔洞和模糊的问题,旨在提高动态场景下的预测质量。

Result: 实验结果表明,DESSERT在事件基于重建、图像基于视频帧预测、事件基于视频帧预测和单边事件基于视频帧插值等多个基准上优于现有方法,实现了更清晰和时间一致的帧合成。

Insight: 创新点包括结合预训练扩散模型与事件数据,通过残差训练确保时间一致性;引入ER-VAE对齐事件帧与残差,以及DLT增强提高模型鲁棒性,为事件相机帧预测提供了新思路。

Abstract: Video frame prediction extrapolates future frames from previous frames, but suffers from prediction errors in dynamic scenes due to the lack of information about the next frame. Event cameras address this limitation by capturing per-pixel brightness changes asynchronously with high temporal resolution. Prior research on event-based video frame prediction has leveraged motion information from event data, often by predicting event-based optical flow and reconstructing frames via pixel warping. However, such approaches introduce holes and blurring when pixel displacement is inaccurate. To overcome this limitation, we propose DESSERT, a diffusion-based event-driven single-frame synthesis framework via residual training. Leveraging a pre-trained Stable Diffusion model, our method is trained on inter-frame residuals to ensure temporal consistency. The training pipeline consists of two stages: (1) an Event-to-Residual Alignment Variational Autoencoder (ER-VAE) that aligns the event frame between anchor and target frames with the corresponding residual, and (2) a diffusion model that denoises the residual latent conditioned on event data. Furthermore, we introduce Diverse-Length Temporal (DLT) augmentation, which improves robustness by training on frame segments of varying temporal lengths. Experimental results demonstrate that our method outperforms existing event-based reconstruction, image-based video frame prediction, event-based video frame prediction, and one-sided event-based video frame interpolation methods, producing sharper and more temporally consistent frame synthesis.


[36] Democratizing Pathology Co-Pilots: An Open Pipeline and Dataset for Whole-Slide Vision-Language Modelling cs.CVPDF

Sander Moonemans, Sebastiaan Ram, Frédérique Meeuwsen, Carlijn Lems, Jeroen van der Laak

TL;DR: 本文提出了一种民主化病理学辅助系统的开放管道和数据集,旨在解决当前全切片图像视觉语言模型(VLM)的局限性。主要贡献包括:1)开发了标准化的合成指令生成工具Polysome;2)利用Polysome在公开HISTAI数据集上生成了大规模全切片指令调优数据集HISTAI-Instruct,包含24,259张切片和超过110万条指令-响应对;3)使用该数据集训练了名为ANTONI-α的VLM,该模型在组织识别、肿瘤检测和鉴别诊断等全切片视觉问答任务上超越了MedGemma。所有方法、数据和代码均已公开。

Details

Motivation: 当前大多数视觉语言模型要么仅关注全切片图像中的小区域,要么仅提供静态的切片级输出,或依赖非公开数据,限制了可复现性。同时,包含详细临床报告的全切片图像配对训练数据稀缺,阻碍了透明且可泛化的VLM的发展。

Result: ANTONI-α在全切片视觉问答任务(包括组织识别、肿瘤检测和鉴别诊断)上超越了MedGemma。论文还比较了使用不同数据量训练的多个ANTONI-α变体的性能。

Insight: 创新点包括:1)提出了一个标准化的合成指令生成工具(Polysome),可自动化生成大规模指令调优数据;2)创建并开源了首个大规模、公开可用的全切片指令调优数据集(HISTAI-Instruct);3)证明了使用合成指令数据训练的VLM(ANTONI-α)在关键病理学任务上能达到超越现有模型(如MedGemma)的性能,为开发透明、可复现的病理学AI助手提供了开放框架和数据基础。

Abstract: Vision-language models (VLMs) have the potential to become co-pilots for pathologists. However, most VLMs either focus on small regions of interest within whole-slide images, provide only static slide-level outputs, or rely on data that is not publicly available, limiting reproducibility. Furthermore, training data containing WSIs paired with detailed clinical reports is scarce, restricting progress toward transparent and generalisable VLMs. We address these limitations with three main contributions. First, we introduce Polysome, a standardised tool for synthetic instruction generation. Second, we apply Polysome to the public HISTAI dataset, generating HISTAI-Instruct, a large whole-slide instruction tuning dataset spanning 24,259 slides and over 1.1 million instruction-response pairs. Finally, we use HISTAI-Instruct to train ANTONI-α, a VLM capable of visual-question answering (VQA). We show that ANTONI-α outperforms MedGemma on WSI-level VQA tasks of tissue identification, neoplasm detection, and differential diagnosis. We also compare the performance of multiple incarnations of ANTONI-α trained with different amounts of data. All methods, data, and code are publicly available.


[37] Towards Deeper Emotional Reflection: Crafting Affective Image Filters with Generative Priors cs.CVPDF

Peixuan Zhang, Shuchen Weng, Jiajun Tang, Si Li, Boxin Shi

TL;DR: 本文提出情感图像滤镜(AIF)任务,旨在将文本中视觉抽象的情感转化为视觉具体的图像,以创作富有情感感染力的结果。作者首先构建了AIF数据集并定义了模型框架,随后提出了基于多模态Transformer的初始模型AIF-B,并进一步扩展为AIF-D,通过利用预训练大规模扩散模型的生成先验来实现更深层的情感反射。实验表明,AIF模型在内容一致性和情感保真度上均优于现有方法,并能更有效地唤起特定情感。

Details

Motivation: 社交媒体用户常通过图文结合表达情感,但现有方法难以将文本中的抽象情感直观地转化为视觉具体的图像,因此需要开发能够深层反射情感的图像生成技术。

Result: 定量和定性实验显示,AIF模型在内容一致性和情感保真度方面均优于当前最先进方法;广泛的用户研究证实,AIF模型在唤起特定情感上显著更有效。

Insight: 创新点在于提出了AIF任务及相应数据集,并利用预训练扩散模型的生成先验来增强情感反射的深度,这为多模态情感生成提供了新的研究方向和技术路径。

Abstract: Social media platforms enable users to express emotions by posting text with accompanying images. In this paper, we propose the Affective Image Filter (AIF) task, which aims to reflect visually-abstract emotions from text into visually-concrete images, thereby creating emotionally compelling results. We first introduce the AIF dataset and the formulation of the AIF models. Then, we present AIF-B as an initial attempt based on a multi-modal transformer architecture. After that, we propose AIF-D as an extension of AIF-B towards deeper emotional reflection, effectively leveraging generative priors from pre-trained large-scale diffusion models. Quantitative and qualitative experiments demonstrate that AIF models achieve superior performance for both content consistency and emotional fidelity compared to state-of-the-art methods. Extensive user study experiments demonstrate that AIF models are significantly more effective at evoking specific emotions. Based on the presented results, we comprehensively discuss the value and potential of AIF models.


[38] AIFloodSense: A Global Aerial Imagery Dataset for Semantic Segmentation and Understanding of Flooded Environments cs.CVPDF

Georgios Simantiris, Konstantinos Bacharidis, Apostolos Papanikolaou, Petros Giannakakis, Costas Panagiotakis

TL;DR: 本文介绍了AIFloodSense,一个用于洪水环境语义分割与理解的全球性航空影像数据集。该数据集包含来自64个国家、230次不同洪水事件的470张高分辨率图像,支持图像分类、语义分割和视觉问答三种任务,旨在推动面向气候韧性的领域泛化AI工具的发展。

Details

Motivation: 现有洪水分割数据集在地理范围和标注细节上有限,阻碍了鲁棒、泛化的计算机视觉方法的发展。为填补这一空白,作者构建了一个全球多样、时效性强(2022-2024年)的公开数据集。

Result: 作者使用最先进的架构为所有任务建立了基线基准,证明了数据集的复杂性及其在推进领域泛化AI工具方面的价值。

Insight: 创新点在于提供了一个地理和时间上具有全球多样性的综合洪水影像数据集,并首次在同一数据集中整合了分类、分割和VQA三种互补任务,以支持更全面的灾害评估自然语言推理。

Abstract: Accurate flood detection from visual data is a critical step toward improving disaster response and risk assessment, yet datasets for flood segmentation remain scarce due to the challenges of collecting and annotating large-scale imagery. Existing resources are often limited in geographic scope and annotation detail, hindering the development of robust, generalized computer vision methods. To bridge this gap, we introduce AIFloodSense, a comprehensive, publicly available aerial imagery dataset comprising 470 high-resolution images from 230 distinct flood events across 64 countries and six continents. Unlike prior benchmarks, AIFloodSense ensures global diversity and temporal relevance (2022-2024), supporting three complementary tasks: (i) Image Classification with novel sub-tasks for environment type, camera angle, and continent recognition; (ii) Semantic Segmentation providing precise pixel-level masks for flood, sky, and buildings; and (iii) Visual Question Answering (VQA) to enable natural language reasoning for disaster assessment. We establish baseline benchmarks for all tasks using state-of-the-art architectures, demonstrating the dataset’s complexity and its value in advancing domain-generalized AI tools for climate resilience.


[39] Xiaomi MiMo-VL-Miloco Technical Report cs.CVPDF

Jiaze Li, Jingyang Chen, Yuxun Qu, Jianzhong Ju, Zhenbo Luo

TL;DR: 小米开源了MiMo-VL-Miloco-7B及其量化版本MiMo-VL-Miloco-7B-GGUF,这是一对专注于家庭场景的视觉语言模型。该模型在家庭场景理解(如手势识别)和通用多模态推理(如视频和语言理解基准)上均表现出色。通过两阶段训练流程(监督微调与基于Group Relative Policy Optimization的强化学习)结合思维链监督和token预算感知推理,实现了专业化与通用性的平衡。

Details

Motivation: 解决智能家居环境中需要同时具备家庭场景专业理解能力和通用多模态推理能力的视觉语言模型需求,平衡专业化与通用性。

Result: 在家庭场景理解(手势识别等)上取得领先的F1分数;在Video-MME、Video-MMMU、Charades-STA等视频基准以及MMMU-Pro、MMLU-Pro等语言理解基准上持续提升;超越多个闭源和开源基线模型。

Insight: 采用两阶段训练流程结合强化学习优化专业化与通用性平衡;引入思维链监督和token预算感知推理以提高数据效率和推理效率;针对家庭场景的训练不仅能提升活动与手势理解,还能改善纯文本推理能力。

Abstract: We open-source \textbf{MiMo-VL-Miloco-7B} and its quantized variant \textbf{MiMo-VL-Miloco-7B-GGUF}, a pair of home-centric vision-language models that achieve strong performance on both home-scenario understanding and general multimodal reasoning. Built on the MiMo-VL-7B backbone, MiMo-VL-Miloco-7B is specialized for smart-home environments, attaining leading F1 scores on gesture recognition and common home-scenario understanding, while also delivering consistent gains across video benchmarks such as Video-MME, Video-MMMU, and Charades-STA, as well as language understanding benchmarks including MMMU-Pro and MMLU-Pro. In our experiments, MiMo-VL-Miloco-7B outperforms strong closed-source and open-source baselines on home-scenario understanding and several multimodal reasoning benchmarks. To balance specialization and generality, we design a two-stage training pipeline that combines supervised fine-tuning with reinforcement learning based on Group Relative Policy Optimization, leveraging efficient multi-domain data. We further incorporate chain-of-thought supervision and token-budget-aware reasoning, enabling the model to learn knowledge in a data-efficient manner while also performing reasoning efficiently. Our analysis shows that targeted home-scenario training not only enhances activity and gesture understanding, but also improves text-only reasoning with only modest trade-offs on document-centric tasks. Model checkpoints, quantized GGUF weights, and our home-scenario evaluation toolkit are publicly available at \href{https://github.com/XiaoMi/xiaomi-mimo-vl-miloco}{https://github.com/XiaoMi/xiaomi-mimo-vl-miloco} to support research and deployment in real-world smart-home applications.


[40] LangDriveCTRL: Natural Language Controllable Driving Scene Editing with Multi-modal Agents cs.CVPDF

Yun He, Francesco Pittaluga, Ziyu Jiang, Matthias Zwicker, Manmohan Chandraker

TL;DR: LangDriveCTRL是一个自然语言可控的框架,用于编辑真实世界驾驶视频以合成多样化的交通场景。它利用显式3D场景分解将驾驶视频表示为包含静态背景和动态对象的场景图,并通过一个由Orchestrator协调的智能体管道,将用户指令转换为执行图,调用专用智能体和工具(如对象接地、行为编辑和审查代理)进行细粒度编辑,最后使用视频扩散工具渲染并优化以提升真实感。

Details

Motivation: 解决现有驾驶视频编辑方法在自然语言指令的细粒度控制、多对象行为编辑以及保持场景结构真实性和交通合理性方面的不足。

Result: 在指令对齐方面,相比之前的SOTA方法实现了近2倍的提升,并在结构保持、照片真实感和交通真实性方面表现优越。

Insight: 创新点在于结合了显式3D场景图表示与多模态智能体管道,通过分解的智能体(如对象接地、行为编辑和审查代理)协同工作,实现了从单一自然语言指令对场景对象节点(移除、插入、替换)和多对象行为进行精细编辑,并利用视频扩散工具后处理以增强真实感。

Abstract: LangDriveCTRL is a natural-language-controllable framework for editing real-world driving videos to synthesize diverse traffic scenarios. It leverages explicit 3D scene decomposition to represent driving videos as a scene graph, containing static background and dynamic objects. To enable fine-grained editing and realism, it incorporates an agentic pipeline in which an Orchestrator transforms user instructions into execution graphs that coordinate specialized agents and tools. Specifically, an Object Grounding Agent establishes correspondence between free-form text descriptions and target object nodes in the scene graph; a Behavior Editing Agent generates multi-object trajectories from language instructions; and a Behavior Reviewer Agent iteratively reviews and refines the generated trajectories. The edited scene graph is rendered and then refined using a video diffusion tool to address artifacts introduced by object insertion and significant view changes. LangDriveCTRL supports both object node editing (removal, insertion and replacement) and multi-object behavior editing from a single natural-language instruction. Quantitatively, it achieves nearly $2\times$ higher instruction alignment than the previous SoTA, with superior structural preservation, photorealism, and traffic realism. Project page is available at: https://yunhe24.github.io/langdrivectrl/.


[41] MULTIAQUA: A multimodal maritime dataset and robust training strategies for multimodal semantic segmentation cs.CV | cs.LGPDF

Jon Muhovič, Janez Perš

TL;DR: 本文提出了一个名为MULTIAQUA的新型多模态海事数据集,包含RGB、热成像、红外、激光雷达等多种同步校准的传感器数据,旨在提升无人水面艇在恶劣视觉条件下的场景理解能力。论文还介绍了仅使用白天图像训练稳健多模态语义分割模型的策略,简化了数据采集和训练过程。

Details

Motivation: 解决无人水面艇在复杂天气和光照条件下(如夜间或低能见度)仅靠RGB图像难以准确感知环境的问题,需要多模态数据来增强场景解释的鲁棒性。

Result: 在提出的困难夜间测试集上评估了多种多模态方法,所提训练方法使模型在近乎完全黑暗的条件下仍保持可靠性能,实现了稳健的语义分割。

Insight: 创新点包括构建了全面的海事多模态数据集,以及仅用白天数据训练就能泛化到夜间场景的稳健训练策略,这降低了数据获取和标注成本,对实际部署具有实用价值。

Abstract: Unmanned surface vehicles can encounter a number of varied visual circumstances during operation, some of which can be very difficult to interpret. While most cases can be solved only using color camera images, some weather and lighting conditions require additional information. To expand the available maritime data, we present a novel multimodal maritime dataset MULTIAQUA (Multimodal Aquatic Dataset). Our dataset contains synchronized, calibrated and annotated data captured by sensors of different modalities, such as RGB, thermal, IR, LIDAR, etc. The dataset is aimed at developing supervised methods that can extract useful information from these modalities in order to provide a high quality of scene interpretation regardless of potentially poor visibility conditions. To illustrate the benefits of the proposed dataset, we evaluate several multimodal methods on our difficult nighttime test set. We present training approaches that enable multimodal methods to be trained in a more robust way, thus enabling them to retain reliable performance even in near-complete darkness. Our approach allows for training a robust deep neural network only using daytime images, thus significantly simplifying data acquisition, annotation, and the training process.


[42] 3D-RE-GEN: 3D Reconstruction of Indoor Scenes with a Generative Framework cs.CVPDF

Tobias Sautter, Jan-Niklas Dihlmann, Hendrik P. A. Lensch

TL;DR: 本文提出了3D-RE-GEN框架,用于从单张图像重建室内场景的带纹理3D网格。该框架通过组合多个先进模型,解决了现有方法在物体分解、空间关系和背景生成上的不足,旨在生成可供艺术家直接编辑的、物理布局合理的完整3D场景。

Details

Motivation: 现有3D场景生成方法虽然视觉效果好,但其表示形式(如NeRF)难以满足视觉特效和游戏开发中艺术家对可修改的带纹理网格场景的需求。当前纹理网格重建方法存在物体分解错误、空间关系不准确和背景缺失等问题,无法直接用于艺术创作流程。

Result: 论文宣称3D-RE-GEN在单图像3D场景重建任务上达到了最先进的性能,但没有提及具体的定量指标或基准测试名称。

Insight: 主要创新点包括:1)组合式生成框架,将资产检测、重建和布局等特定领域的SOTA模型集成并拓展其应用范围;2)将遮挡物体获取视为图像编辑任务,利用生成模型进行场景级推理,在一致光照和几何下进行推断和重建;3)生成全面的背景,为物体优化提供空间约束,并为光照和模拟任务奠定基础;4)提出新颖的4自由度可微分优化,使重建物体与估计的地平面对齐,获得物理上合理的布局。

Abstract: Recent advances in 3D scene generation produce visually appealing output, but current representations hinder artists’ workflows that require modifiable 3D textured mesh scenes for visual effects and game development. Despite significant advances, current textured mesh scene reconstruction methods are far from artist ready, suffering from incorrect object decomposition, inaccurate spatial relationships, and missing backgrounds. We present 3D-RE-GEN, a compositional framework that reconstructs a single image into textured 3D objects and a background. We show that combining state of the art models from specific domains achieves state of the art scene reconstruction performance, addressing artists’ requirements. Our reconstruction pipeline integrates models for asset detection, reconstruction, and placement, pushing certain models beyond their originally intended domains. Obtaining occluded objects is treated as an image editing task with generative models to infer and reconstruct with scene level reasoning under consistent lighting and geometry. Unlike current methods, 3D-RE-GEN generates a comprehensive background that spatially constrains objects during optimization and provides a foundation for realistic lighting and simulation tasks in visual effects and games. To obtain physically realistic layouts, we employ a novel 4-DoF differentiable optimization that aligns reconstructed objects with the estimated ground plane. 3D-RE-GEN~achieves state of the art performance in single image 3D scene reconstruction, producing coherent, modifiable scenes through compositional generation guided by precise camera recovery and spatial optimization.


[43] TwinSegNet: A Digital Twin-Enabled Federated Learning Framework for Brain Tumor Analysis cs.CV | cs.LGPDF

Almustapha A. Wakili, Adamu Hussaini, Abubakar A. Musa, Woosub Jung, Wei Yu

TL;DR: 本文提出TwinSegNet,一种结合数字孪生与联邦学习的隐私保护框架,用于脑肿瘤分割。该框架采用混合ViT-UNet模型,通过卷积编码器和Vision Transformer瓶颈捕获局部与全局上下文,各机构在私有数据上微调全局模型以构建个性化数字孪生。在包括BraTS 2019-2021在内的九个异构MRI数据集上验证,模型在非独立同分布数据中表现出鲁棒性。

Details

Motivation: 解决当前深度学习脑肿瘤分割方法依赖集中式数据收集导致的隐私问题和跨机构泛化能力受限的挑战。

Result: 在BraTS 2019-2021等九个异构MRI数据集上评估,TwinSegNet达到最高0.90%的Dice分数,敏感性和特异性超过90%,在非独立同分布客户端分布中保持鲁棒性;与集中式模型TumorVisNet相比,在保护隐私的同时未牺牲性能。

Insight: 创新点在于将数字孪生与联邦学习结合,实现个性化模型微调;混合ViT-UNet架构整合卷积与Transformer优势;框架在隐私保护下支持多机构临床场景的可扩展分割,符合数据保密要求。

Abstract: Brain tumor segmentation is critical in diagnosis and treatment planning for the disease. Yet, current deep learning methods rely on centralized data collection, which raises privacy concerns and limits generalization across diverse institutions. In this paper, we propose TwinSegNet, which is a privacy-preserving federated learning framework that integrates a hybrid ViT-UNet model with personalized digital twins for accurate and real-time brain tumor segmentation. Our architecture combines convolutional encoders with Vision Transformer bottlenecks to capture local and global context. Each institution fine-tunes the global model of private data to form its digital twin. Evaluated on nine heterogeneous MRI datasets, including BraTS 2019-2021 and custom tumor collections, TwinSegNet achieves high Dice scores (up to 0.90%) and sensitivity/specificity exceeding 90%, demonstrating robustness across non-independent and identically distributed (IID) client distributions. Comparative results against centralized models such as TumorVisNet highlight TwinSegNet’s effectiveness in preserving privacy without sacrificing performance. Our approach enables scalable, personalized segmentation for multi-institutional clinical settings while adhering to strict data confidentiality requirements.


[44] MMLANDMARKS: a Cross-View Instance-Level Benchmark for Geo-Spatial Understanding cs.CVPDF

Oskar Kristoffersen, Alba R. Sánchez, Morten R. Hannemose, Anders B. Dahl, Dim P. Papadopoulos

TL;DR: 本文介绍了MMLANDMARKS数据集,这是一个用于地理空间理解的多模态基准,包含美国18,557个独特地标的197k高分辨率航拍图像、329k地面视图图像、文本信息和地理坐标,支持跨视图检索、地理定位等任务。

Details

Motivation: 当前地理空间基准在多模态覆盖上有限,限制了该领域进展,无法在统一框架中整合所有相关模态,因此需要构建一个全面的多模态数据集。

Result: 通过简单的CLIP启发基线,在跨视图地面到卫星检索、地理定位等任务上展示了广泛的泛化性和与现成基础模型及专门SOTA模型竞争的性能。

Insight: 创新点在于提供了一个跨模态一一对应的多模态地理空间基准,强调了多模态数据集对于实现广泛地理空间理解的必要性,并展示了简单基线方法的有效性。

Abstract: Geo-spatial analysis of our world benefits from a multimodal approach, as every single geographic location can be described in numerous ways (images from various viewpoints, textual descriptions, and geographic coordinates). Current geo-spatial benchmarks have limited coverage across modalities, considerably restricting progress in the field, as current approaches cannot integrate all relevant modalities within a unified framework. We introduce the Multi-Modal Landmark dataset (MMLANDMARKS), a benchmark composed of four modalities: 197k highresolution aerial images, 329k ground-view images, textual information, and geographic coordinates for 18,557 distinct landmarks in the United States. The MMLANDMARKS dataset has a one-to-one correspondence across every modality, which enables training and benchmarking models for various geo-spatial tasks, including cross-view Ground-to-Satellite retrieval, ground and satellite geolocalization, Text-to-Image, and Text-to-GPS retrieval. We demonstrate broad generalization and competitive performance against off-the-shelf foundational models and specialized state-of-the-art models across different tasks by employing a simple CLIP-inspired baseline, illustrating the necessity for multimodal datasets to achieve broad geo-spatial understanding.


[45] GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation cs.CVPDF

Rang Li, Lei Li, Shuhuai Ren, Hao Tian, Shuhao Gu

TL;DR: 本文提出了一个名为GroundingME的新型基准测试,旨在系统评估多模态大语言模型在视觉定位任务上的真实能力。该基准从判别性、空间关系、有限视觉信息和拒绝能力四个维度构建了1005个具有挑战性的真实世界样例。评估25个SOTA模型后发现存在巨大能力鸿沟,最佳模型准确率仅45.1%,且在拒绝任务上表现极差。论文还探索了测试时扩展和数据混合训练两种改进策略。

Details

Motivation: 现有基准测试无法捕捉真实世界的复杂性,导致MLLMs在视觉定位任务上的能力被高估。本文旨在揭示MLLMs是否真正具备类人的、精细化的视觉语言对齐能力,还是仅仅在简化数据集上进行模式匹配。

Result: 在GroundingME基准上,评估的25个SOTA MLLMs表现不佳:最佳模型准确率仅为45.1%,大多数模型在拒绝任务上的准确率为0%。通过提出的改进策略,复杂定位任务性能最高提升2.9%,拒绝任务准确率从0%提升至27.9%。

Insight: 论文的创新点在于构建了一个多维度、高难度的视觉定位基准,系统性地暴露了MLLMs在真实世界场景下的能力短板,特别是拒绝无法定位查询的能力严重缺失。这为未来模型的安全部署和能力提升提供了明确的诊断工具和路线图。

Abstract: Visual grounding, localizing objects from natural language descriptions, represents a critical bridge between language and vision understanding. While multimodal large language models (MLLMs) achieve impressive scores on existing benchmarks, a fundamental question remains: can MLLMs truly ground language in vision with human-like sophistication, or are they merely pattern-matching on simplified datasets? Current benchmarks fail to capture real-world complexity where humans effortlessly navigate ambiguous references and recognize when grounding is impossible. To rigorously assess MLLMs’ true capabilities, we introduce GroundingME, a benchmark that systematically challenges models across four critical dimensions: (1) Discriminative, distinguishing highly similar objects, (2) Spatial, understanding complex relational descriptions, (3) Limited, handling occlusions or tiny objects, and (4) Rejection, recognizing ungroundable queries. Through careful curation combining automated generation with human verification, we create 1,005 challenging examples mirroring real-world complexity. Evaluating 25 state-of-the-art MLLMs reveals a profound capability gap: the best model achieves only 45.1% accuracy, while most score 0% on rejection tasks, reflexively hallucinating objects rather than acknowledging their absence, raising critical safety concerns for deployment. We explore two strategies for improvements: (1) test-time scaling selects optimal response by thinking trajectory to improve complex grounding by up to 2.9%, and (2) data-mixture training teaches models to recognize ungroundable queries, boosting rejection accuracy from 0% to 27.9%. GroundingME thus serves as both a diagnostic tool revealing current limitations in MLLMs and a roadmap toward human-level visual grounding.


[46] InsertAnywhere: Bridging 4D Scene Geometry and Diffusion Models for Realistic Video Object Insertion cs.CV | cs.AIPDF

Hoiyeong Jin, Hyojin Jang, Jeongho Kim, Junha Hyung, Kinam Kim

TL;DR: InsertAnywhere是一个用于真实视频对象插入(VOI)的新框架,它通过结合4D场景几何理解与扩散模型,解决了现有方法在几何一致性、遮挡处理和光照效果方面的不足。该方法首先利用4D感知掩码生成模块重建场景几何并跨帧传播对象放置,然后基于扩散的视频生成模型联合合成插入对象及其局部光照变化。

Details

Motivation: 解决基于扩散的视频生成在可控视频编辑中,由于4D场景理解有限以及对遮挡和光照效果处理不足,导致真实视频对象插入仍然具有挑战性的问题。

Result: 在广泛的实验中,该框架在多样化的真实世界场景中产生了几何合理且视觉连贯的对象插入效果,显著优于现有的研究和商业模型。

Insight: 创新点在于将4D场景几何重建与扩散模型联合优化,并引入了光照感知的合成数据集ROSE++进行监督训练,实现了几何一致性和外观真实性的统一。从客观角度看,其将几何先验与生成模型深度融合的思路,为视频编辑任务提供了新的范式。

Abstract: Recent advances in diffusion-based video generation have opened new possibilities for controllable video editing, yet realistic video object insertion (VOI) remains challenging due to limited 4D scene understanding and inadequate handling of occlusion and lighting effects. We present InsertAnywhere, a new VOI framework that achieves geometrically consistent object placement and appearance-faithful video synthesis. Our method begins with a 4D aware mask generation module that reconstructs the scene geometry and propagates user specified object placement across frames while maintaining temporal coherence and occlusion consistency. Building upon this spatial foundation, we extend a diffusion based video generation model to jointly synthesize the inserted object and its surrounding local variations such as illumination and shading. To enable supervised training, we introduce ROSE++, an illumination aware synthetic dataset constructed by transforming the ROSE object removal dataset into triplets of object removed video, object present video, and a VLM generated reference image. Through extensive experiments, we demonstrate that our framework produces geometrically plausible and visually coherent object insertions across diverse real world scenarios, significantly outperforming existing research and commercial models.


[47] Foundation Model Priors Enhance Object Focus in Feature Space for Source-Free Object Detection cs.CVPDF

Sairam VCR, Rishabh Lalla, Aveen Dayal, Tejal Kulkarni, Anuj Lalla

TL;DR: 论文提出FALCON-SFOD框架,用于源自由目标检测(SFOD),通过利用视觉基础模型的先验来增强特征空间中的对象焦点,包括空间先验感知正则化(SPAR)和不平衡感知噪声鲁棒伪标签(IRPL)两个互补组件。

Details

Motivation: 当前SFOD方法依赖Mean-Teacher自标签,但域偏移导致对象焦点减弱,产生不可靠的伪标签;现有工作主要优化伪标签,而忽略了加强特征空间本身。

Result: 在SFOD基准测试上达到竞争性性能。

Insight: 创新点在于利用基础模型(如OV-SAM)的先验正则化特征空间以增强对象焦点,并提出不平衡感知的噪声鲁棒伪标签方法;客观分析显示,这些设计通过理论分析连接到更紧的定位和分类误差界。

Abstract: Current state-of-the-art approaches in Source-Free Object Detection (SFOD) typically rely on Mean-Teacher self-labeling. However, domain shift often reduces the detector’s ability to maintain strong object-focused representations, causing high-confidence activations over background clutter. This weak object focus results in unreliable pseudo-labels from the detection head. While prior works mainly refine these pseudo-labels, they overlook the underlying need to strengthen the feature space itself. We propose FALCON-SFOD (Foundation-Aligned Learning with Clutter suppression and Noise robustness), a framework designed to enhance object-focused adaptation under domain shift. It consists of two complementary components. SPAR (Spatial Prior-Aware Regularization) leverages the generalization strength of vision foundation models to regularize the detector’s feature space. Using class-agnostic binary masks derived from OV-SAM, SPAR promotes structured and foreground-focused activations by guiding the network toward object regions. IRPL (Imbalance-aware Noise Robust Pseudo-Labeling) complements SPAR by promoting balanced and noise-tolerant learning under severe foreground-background imbalance. Guided by a theoretical analysis that connects these designs to tighter localization and classification error bounds, FALCON-SFOD achieves competitive performance across SFOD benchmarks.


[48] Robust-R1: Degradation-Aware Reasoning for Robust Visual Understanding cs.CV | cs.AIPDF

Jiaqi Tang, Jianmin Chen, Wei Wei, Xiaogang Xu, Runtao Liu

TL;DR: 本文提出Robust-R1框架,通过显式建模视觉退化链来提升多模态大语言模型在真实世界视觉退化下的鲁棒性。该方法结合监督微调、奖励驱动的对齐和动态推理深度缩放,并在包含11K合成退化图像的数据集上训练。实验表明,Robust-R1在R-Bench基准上超越现有通用和鲁棒基线,并在MMMB、MMStar和RealWorldQA上对多强度对抗退化保持优异性能。

Details

Motivation: 解决多模态大语言模型在极端真实世界视觉退化下性能不可靠的问题,现有方法主要依赖隐式训练/适应,仅关注视觉编码器泛化,存在可解释性有限和孤立优化的局限性。

Result: 在真实世界退化基准R-Bench上达到最先进的鲁棒性,优于所有通用和鲁棒基线;在MMMB、MMStar和RealWorldQA上对多强度对抗退化保持优越的抗退化性能。

Insight: 创新点在于通过结构化推理链显式建模视觉退化,结合退化感知推理基础、奖励驱动的退化参数感知和适应退化强度的动态推理深度缩放;从客观角度看,该方法提升了模型对退化的可解释性和系统化处理能力。

Abstract: Multimodal Large Language Models struggle to maintain reliable performance under extreme real-world visual degradations, which impede their practical robustness. Existing robust MLLMs predominantly rely on implicit training/adaptation that focuses solely on visual encoder generalization, suffering from limited interpretability and isolated optimization. To overcome these limitations, we propose Robust-R1, a novel framework that explicitly models visual degradations through structured reasoning chains. Our approach integrates: (i) supervised fine-tuning for degradation-aware reasoning foundations, (ii) reward-driven alignment for accurately perceiving degradation parameters, and (iii) dynamic reasoning depth scaling adapted to degradation intensity. To facilitate this approach, we introduce a specialized 11K dataset featuring realistic degradations synthesized across four critical real-world visual processing stages, each annotated with structured chains connecting degradation parameters, perceptual influence, pristine semantic reasoning chain, and conclusion. Comprehensive evaluations demonstrate state-of-the-art robustness: Robust-R1 outperforms all general and robust baselines on the real-world degradation benchmark R-Bench, while maintaining superior anti-degradation performance under multi-intensity adversarial degradations on MMMB, MMStar, and RealWorldQA.


[49] FLEG: Feed-Forward Language Embedded Gaussian Splatting from Any Views cs.CVPDF

Qijian Tian, Xin Tan, Jiayu Ying, Xuhong Wang, Yuan Xie

TL;DR: FLEG是一种前馈网络,能够从任意视角重建语言嵌入的3D高斯表示。它通过无需3D标注的训练框架,利用大规模视频数据和2D实例信息,结合实例引导的对比学习和几何-语义分层稀疏化策略,高效生成几何准确、外观逼真且语义对齐的3D表示。

Details

Motivation: 解决现有前馈重建方法依赖固定输入视角和缺乏足够3D训练数据的问题,旨在从任意未标定、未配准的多视角图像中实现2D到3D的提升,并丰富语义嵌入。

Result: 在多个相关任务上优于现有方法,实现了高效重建,联合产生准确的几何、高保真外观和语言对齐的语义。

Insight: 创新点包括无需3D标注的训练框架、利用大规模视频数据增强语义嵌入、实例引导的对比学习对齐2D与3D表示,以及几何-语义分层稀疏化策略降低计算成本。从客观角度看,这些方法有效解决了数据稀缺和计算效率问题,提升了3D重建的语义对齐能力。

Abstract: We present FLEG, a feed-forward network that reconstructs language-embedded 3D Gaussians from any views. Previous straightforward solutions combine feed-forward reconstruction with Gaussian heads but suffer from fixed input views and insufficient 3D training data. In contrast, we propose a 3D-annotation-free training framework for 2D-to-3D lifting from arbitrary uncalibrated and unposed multi-view images. Since the framework does not require 3D annotations, we can leverage large-scale video data with easily obtained 2D instance information to enrich semantic embedding. We also propose an instance-guided contrastive learning to align 2D semantics with the 3D representations. In addition, to mitigate the high memory and computational cost of dense views, we further propose a geometry-semantic hierarchical sparsification strategy. Our FLEG efficiently reconstructs language-embedded 3D Gaussian representation in a feed-forward manner from arbitrary sparse or dense views, jointly producing accurate geometry, high-fidelity appearance, and language-aligned semantics. Extensive experiments show that it outperforms existing methods on various related tasks. Project page: https://fangzhou2000.github.io/projects/fleg.


[50] ClothHMR: 3D Mesh Recovery of Humans in Diverse Clothing from Single Image cs.CV | cs.AIPDF

Yunqi Gao, Leyuan Liu, Yuhan Li, Changxin Gao, Yuanyuan Liu

TL;DR: 本文提出了ClothHMR方法,用于从单张图像中准确恢复穿着多样化服装(尤其是宽松服装)的人体三维网格。该方法包含两个核心模块:服装裁剪模块通过语义估计和边缘预测使服装贴合人体轮廓;基于基础人体视觉模型的网格恢复模块通过对齐中间表示来优化三维网格参数。实验表明该方法在多个基准数据集和真实图像上显著优于现有SOTA方法,并开发了在线时尚购物应用。

Details

Motivation: 现有三维人体网格恢复方法主要针对紧身服装,在处理多样化服装(尤其是宽松衣物)时,对人体形状和姿态的估计效果不佳。

Result: 实验结果表明,ClothHMR在多个基准数据集和真实图像上显著优于现有的最先进方法。

Insight: 核心创新点在于两个关键洞察:通过裁剪服装使其贴合人体轮廓以减轻服装对恢复的负面影响,以及利用大型基础人体视觉模型的视觉信息来增强模型的泛化能力。具体通过服装裁剪模块和基于FHVM的网格恢复模块实现。

Abstract: With 3D data rapidly emerging as an important form of multimedia information, 3D human mesh recovery technology has also advanced accordingly. However, current methods mainly focus on handling humans wearing tight clothing and perform poorly when estimating body shapes and poses under diverse clothing, especially loose garments. To this end, we make two key insights: (1) tailoring clothing to fit the human body can mitigate the adverse impact of clothing on 3D human mesh recovery, and (2) utilizing human visual information from large foundational models can enhance the generalization ability of the estimation. Based on these insights, we propose ClothHMR, to accurately recover 3D meshes of humans in diverse clothing. ClothHMR primarily consists of two modules: clothing tailoring (CT) and FHVM-based mesh recovering (MR). The CT module employs body semantic estimation and body edge prediction to tailor the clothing, ensuring it fits the body silhouette. The MR module optimizes the initial parameters of the 3D human mesh by continuously aligning the intermediate representations of the 3D mesh with those inferred from the foundational human visual model (FHVM). ClothHMR can accurately recover 3D meshes of humans wearing diverse clothing, precisely estimating their body shapes and poses. Experimental results demonstrate that ClothHMR significantly outperforms existing state-of-the-art methods across benchmark datasets and in-the-wild images. Additionally, a web application for online fashion and shopping powered by ClothHMR is developed, illustrating that ClothHMR can effectively serve real-world usage scenarios. The code and model for ClothHMR are available at: \url{https://github.com/starVisionTeam/ClothHMR}.


[51] G3Splat: Geometrically Consistent Generalizable Gaussian Splatting cs.CVPDF

Mehdi Hosseinzadeh, Shin-Fang Chng, Yi Xu, Simon Lucey, Ian Reid

TL;DR: G3Splat是一种用于姿态无关、可泛化的3D高斯溅射方法,它通过引入几何先验来解决仅依赖视图合成损失时3D高斯参数学习中的几何模糊性问题,从而获得几何一致的3D场景表示。

Details

Motivation: 现有方法主要依赖视图合成损失来回归从图像预测的逐像素3D高斯参数(如方向、尺度、不透明度和外观),但仅凭视图合成损失不足以恢复具有几何意义的溅射表示,存在几何模糊性。

Result: 在RE10K数据集上训练,G3Splat在几何一致重建、相对姿态估计和新视角合成方面达到了最先进的性能;在ScanNet上的零样本泛化测试中,在几何恢复和相对姿态估计方面也显著优于先前工作。

Insight: 创新点在于明确指出了仅用视图合成监督的不足,并通过强制几何先验(如几何一致性约束)来学习更准确的3D高斯参数,从而提升几何重建和姿态估计的精度,这为自监督的通用3D表示学习提供了新思路。

Abstract: 3D Gaussians have recently emerged as an effective scene representation for real-time splatting and accurate novel-view synthesis, motivating several works to adapt multi-view structure prediction networks to regress per-pixel 3D Gaussians from images. However, most prior work extends these networks to predict additional Gaussian parameters – orientation, scale, opacity, and appearance – while relying almost exclusively on view-synthesis supervision. We show that a view-synthesis loss alone is insufficient to recover geometrically meaningful splats in this setting. We analyze and address the ambiguities of learning 3D Gaussian splats under self-supervision for pose-free generalizable splatting, and introduce G3Splat, which enforces geometric priors to obtain geometrically consistent 3D scene representations. Trained on RE10K, our approach achieves state-of-the-art performance in (i) geometrically consistent reconstruction, (ii) relative pose estimation, and (iii) novel-view synthesis. We further demonstrate strong zero-shot generalization on ScanNet, substantially outperforming prior work in both geometry recovery and relative pose estimation. Code and pretrained models are released on our project page (https://m80hz.github.io/g3splat/).


[52] RoomEditor++: A Parameter-Sharing Diffusion Architecture for High-Fidelity Furniture Synthesis cs.CVPDF

Qilong Wang, Xiaofan Ming, Zhenyi Lin, Jinwen Li, Dongwei Ren

TL;DR: 本文提出了RoomEditor++,一种基于参数共享双扩散主干的架构,用于高保真虚拟家具合成,旨在将参考对象无缝集成到室内场景中,同时保持几何一致性和视觉真实感。同时,作者还构建了RoomBench++基准数据集,以支持该任务的训练和评估。

Details

Motivation: 解决虚拟家具合成领域因缺乏可复现基准和现有图像合成方法在保持背景完整性的同时实现高保真合成方面的局限性而未被充分探索的问题。

Result: 在RoomBench++数据集上的大量实验表明,RoomEditor++在定量指标、定性评估和人类偏好研究中均优于最先进方法,并且在未见过的室内场景和一般场景上展现出强大的泛化能力,无需任务特定的微调。

Insight: 创新点在于提出了一个参数共享的双扩散主干架构,该架构统一了参考图像和背景图像的特征提取与修复过程,强制对齐的特征表示有助于精确的几何变换、纹理保持和无缝集成。该方法兼容U-Net和DiT架构,具有通用性。

Abstract: Virtual furniture synthesis, which seamlessly integrates reference objects into indoor scenes while maintaining geometric coherence and visual realism, holds substantial promise for home design and e-commerce applications. However, this field remains underexplored due to the scarcity of reproducible benchmarks and the limitations of existing image composition methods in achieving high-fidelity furniture synthesis while preserving background integrity. To overcome these challenges, we first present RoomBench++, a comprehensive and publicly available benchmark dataset tailored for this task. It consists of 112,851 training pairs and 1,832 testing pairs drawn from both real-world indoor videos and realistic home design renderings, thereby supporting robust training and evaluation under practical conditions. Then, we propose RoomEditor++, a versatile diffusion-based architecture featuring a parameter-sharing dual diffusion backbone, which is compatible with both U-Net and DiT architectures. This design unifies the feature extraction and inpainting processes for reference and background images. Our in-depth analysis reveals that the parameter-sharing mechanism enforces aligned feature representations, facilitating precise geometric transformations, texture preservation, and seamless integration. Extensive experiments validate that RoomEditor++ is superior over state-of-the-art approaches in terms of quantitative metrics, qualitative assessments, and human preference studies, while highlighting its strong generalization to unseen indoor scenes and general scenes without task-specific fine-tuning. The dataset and source code are available at \url{https://github.com/stonecutter-21/roomeditor}.


[53] 3One2: One-step Regression Plus One-step Diffusion for One-hot Modulation in Dual-path Video Snapshot Compressive Imaging cs.CVPDF

Ge Wang, Xing Liu, Xin Yuan

TL;DR: 本文提出了一种名为3One2的新算法,专门用于双路径视频快照压缩成像(SCI)中的独热调制。该方法将重建任务转化为生成式视频修复问题,并引入了一种结合一步回归初始化和一步扩散细化的新颖框架,以充分利用独热调制在时间解耦方面的潜力,并通过双光路设计缓解空间退化问题。

Details

Motivation: 解决现有视频SCI中随机二进制调制导致的时间混叠问题,并充分利用独热调制实现完美时间解耦的潜力,但目前缺乏专门算法来挖掘这一潜力。

Result: 在合成数据集和真实场景上的实验证明了该方法的有效性,这是首个将扩散模型集成到视频SCI重建中的工作。

Insight: 创新点在于将重建任务转化为生成式视频修复,并提出了结合回归和扩散的两步框架(一步回归初始化加一步扩散细化),同时硬件上采用双光路互补以缓解空间退化;这是扩散模型在视频SCI领域的首次应用。

Abstract: Video snapshot compressive imaging (SCI) captures dynamic scene sequences through a two-dimensional (2D) snapshot, fundamentally relying on optical modulation for hardware compression and the corresponding software reconstruction. While mainstream video SCI using random binary modulation has demonstrated success, it inevitably results in temporal aliasing during compression. One-hot modulation, activating only one sub-frame per pixel, provides a promising solution for achieving perfect temporal decoupling, thereby alleviating issues associated with aliasing. However, no algorithms currently exist to fully exploit this potential. To bridge this gap, we propose an algorithm specifically designed for one-hot masks. First, leveraging the decoupling properties of one-hot modulation, we transform the reconstruction task into a generative video inpainting problem and introduce a stochastic differential equation (SDE) of the forward process that aligns with the hardware compression process. Next, we identify limitations of the pure diffusion method for video SCI and propose a novel framework that combines one-step regression initialization with one-step diffusion refinement. Furthermore, to mitigate the spatial degradation caused by one-hot modulation, we implement a dual optical path at the hardware level, utilizing complementary information from another path to enhance the inpainted video. To our knowledge, this is the first work integrating diffusion into video SCI reconstruction. Experiments conducted on synthetic datasets and real scenes demonstrate the effectiveness of our method.


[54] HeadHunt-VAD: Hunting Robust Anomaly-Sensitive Heads in MLLM for Tuning-Free Video Anomaly Detection cs.CVPDF

Zhaolin Cai, Fan Li, Ziwei Zheng, Haixia Bi, Lijun He

TL;DR: 本文提出HeadHunt-VAD,一种无需微调的视频异常检测新范式。该方法通过直接识别并利用冻结多模态大语言模型内部对异常敏感的注意力头,绕过文本生成,解决了现有基于MLLM方法的信息丢失、正常性偏差和提示敏感性问题。核心包括一个鲁棒头识别模块,用于筛选出稀疏的专家头,并配合轻量级异常评分器和时序定位器实现高效准确的检测。

Details

Motivation: 解决现有基于多模态大语言模型的免调优视频异常检测方法依赖文本输出所导致的信息损失、正常性偏差和提示敏感性问题,旨在更直接、鲁棒地捕获视频中的细微异常线索。

Result: 在两个主要的视频异常检测基准测试上,HeadHunt-VAD在免调优方法中达到了最先进的性能,同时保持了高效率。

Insight: 创新点在于绕过MLLM的文本生成层,直接探测并利用其内部对异常敏感的注意力头(专家头),通过多准则分析(显著性和稳定性)进行鲁棒识别。这为利用冻结大模型进行下游任务提供了一种新的、可解释的头部级探测思路,避免了提示工程和微调开销。

Abstract: Video Anomaly Detection (VAD) aims to locate events that deviate from normal patterns in videos. Traditional approaches often rely on extensive labeled data and incur high computational costs. Recent tuning-free methods based on Multimodal Large Language Models (MLLMs) offer a promising alternative by leveraging their rich world knowledge. However, these methods typically rely on textual outputs, which introduces information loss, exhibits normalcy bias, and suffers from prompt sensitivity, making them insufficient for capturing subtle anomalous cues. To address these constraints, we propose HeadHunt-VAD, a novel tuning-free VAD paradigm that bypasses textual generation by directly hunting robust anomaly-sensitive internal attention heads within the frozen MLLM. Central to our method is a Robust Head Identification module that systematically evaluates all attention heads using a multi-criteria analysis of saliency and stability, identifying a sparse subset of heads that are consistently discriminative across diverse prompts. Features from these expert heads are then fed into a lightweight anomaly scorer and a temporal locator, enabling efficient and accurate anomaly detection with interpretable outputs. Extensive experiments show that HeadHunt-VAD achieves state-of-the-art performance among tuning-free methods on two major VAD benchmarks while maintaining high efficiency, validating head-level probing in MLLMs as a powerful and practical solution for real-world anomaly detection.


[55] Self-Supervised Weighted Image Guided Quantitative MRI Super-Resolution cs.CVPDF

Alireza Samadifardheris, Dirk H. J. Poot, Florian Wiesinger, Stefan Klein, Juan A. Hernandez-Tamames

TL;DR: 本文提出了一种基于物理信息的自监督加权图像引导定量磁共振成像超分辨率框架,该框架利用常规获取的高分辨率加权MRI扫描作为引导,无需高分辨率定量MRI真值进行训练,通过贝叶斯最大后验推断最小化合成图像与引导图像之间的差异以及低分辨率定量MRI与下采样预测之间的差异,从而学习超分辨率映射。

Details

Motivation: 解决高分辨率定量MRI采集时间长、临床利用率低的问题,通过利用常规临床加权MRI图像引导,无需高分辨率定量MRI真值即可训练超分辨率模型,降低对长时间扫描的依赖。

Result: 在合成数据上训练的网络能够从1分钟采集的数据中生成与5分钟参考扫描质量相当的超分辨率参数图,并在不同定量MRI序列的独立体内数据上验证了跨序列泛化能力。

Insight: 创新点在于提出了一种物理信息驱动的自监督学习框架,利用加权MRI作为引导,实现了无需高分辨率定量MRI真值的超分辨率训练;通过贝叶斯推断结合前向信号模型,有效利用临床常规图像提升定量MRI分辨率,为临床工作流集成定量弛豫测量提供了实用路径。

Abstract: High-resolution (HR) quantitative MRI (qMRI) relaxometry provides objective tissue characterization but remains clinically underutilized due to lengthy acquisition times. We propose a physics-informed, self-supervised framework for qMRI super-resolution that uses routinely acquired HR weighted MRI (wMRI) scans as guidance, thus, removing the necessity for HR qMRI ground truth during training. We formulate super-resolution as Bayesian maximum a posteriori inference, minimizing two discrepancies: (1) between HR images synthesized from super-resolved qMRI maps and acquired wMRI guides via forward signal models, and (2) between acquired LR qMRI and downsampled predictions. This physics-informed objective allows the models to learn from clinical wMRI without HR qMRI supervision. To validate the concept, we generate training data by synthesizing wMRI guides from HR qMRI using signal equations, then degrading qMRI resolution via k-space truncation. A deep neural network learns the super-resolution mapping. Ablation experiments demonstrate that T1-weighted images primarily enhance T1 maps, T2-weighted images improve T2 maps, and combined guidance optimally enhances all parameters simultaneously. Validation on independently acquired in-vivo data from a different qMRI sequence confirms cross-qMRI sequence generalizability. Models trained on synthetic data can produce super-resolved maps from a 1-minute acquisition with quality comparable to a 5-minute reference scan, leveraging the scanner-independent nature of relaxometry parameters. By decoupling training from HR qMRI requirement, our framework enables fast qMRI acquisitions enhanced via routine clinical images, offering a practical pathway for integrating quantitative relaxometry into clinical workflows with acceptable additional scan time.


[56] PathFLIP: Fine-grained Language-Image Pretraining for Versatile Computational Pathology cs.CVPDF

Fengchun Liu, Songhan Jiang, Linghan Cai, Ziyue Wang, Yongbing Zhang

TL;DR: 本文提出PathFLIP框架,用于解决计算病理学中全切片图像(WSI)的多模态理解难题。该框架通过将幻灯片级描述分解为区域级子标题,并生成文本条件区域嵌入,实现了精细的视觉-语言对齐。PathFLIP利用大语言模型(LLM)适应多样临床指令,并在幻灯片级分类与检索、细粒度病变定位及指令跟随等多个任务上展现出强大能力。

Details

Motivation: 现有视觉-语言模型(VLM)在计算病理学中难以捕捉千兆像素级全切片图像(WSI)的细粒度文本-视觉对应关系,导致下游任务性能受限。

Result: 在四个代表性基准测试中,PathFLIP超越了现有大规模病理学VLM,且所需训练数据显著更少。

Insight: 创新点在于将幻灯片级描述分解为区域级子标题以实现精细对齐,并利用LLM实现指令感知的灵活适应,为临床实践中的细粒度、指令感知WSI解读提供了新途径。

Abstract: While Vision-Language Models (VLMs) have achieved notable progress in computational pathology (CPath), the gigapixel scale and spatial heterogeneity of Whole Slide Images (WSIs) continue to pose challenges for multimodal understanding. Existing alignment methods struggle to capture fine-grained correspondences between textual descriptions and visual cues across thousands of patches from a slide, compromising their performance on downstream tasks. In this paper, we propose PathFLIP (Pathology Fine-grained Language-Image Pretraining), a novel framework for holistic WSI interpretation. PathFLIP decomposes slide-level captions into region-level subcaptions and generates text-conditioned region embeddings to facilitate precise visual-language grounding. By harnessing Large Language Models (LLMs), PathFLIP can seamlessly follow diverse clinical instructions and adapt to varied diagnostic contexts. Furthermore, it exhibits versatile capabilities across multiple paradigms, efficiently handling slide-level classification and retrieval, fine-grained lesion localization, and instruction following. Extensive experiments demonstrate that PathFLIP outperforms existing large-scale pathological VLMs on four representative benchmarks while requiring significantly less training data, paving the way for fine-grained, instruction-aware WSI interpretation in clinical practice.


[57] Generative Human-Object Interaction Detection via Differentiable Cognitive Steering of Multi-modal LLMs cs.CVPDF

Zhaolin Cai, Huiyu Duan, Zitong Xu, Fan Li, Zhi Liu

TL;DR: 本文提出GRASP-HO框架,将人-物交互检测从封闭集分类任务重新定义为开放词汇生成问题。该框架通过提取混合交互表示,并设计轻量级可学习的认知引导模块,将细粒度视觉证据注入冻结的多模态大语言模型中进行推理。同时,采用混合指导策略结合语言建模损失和辅助分类损失,以解决监督不匹配问题。

Details

Motivation: 现有HOI检测方法基于封闭世界假设,在预定义的小型动词集上进行分类,难以泛化到真实世界中未见或模糊的长尾交互。虽然多模态大语言模型具备开放词汇理解所需的世界知识,但由于微调计算成本高,它们与现有HOI检测器脱节。

Result: 实验表明,该方法在封闭集性能上达到最先进水平,并展现出强大的零样本泛化能力,在开放世界HOI检测中实现了判别式感知与生成式推理的统一。

Insight: 创新点在于将HOI检测重构为生成任务,通过轻量级认知引导模块桥接视觉与认知,并设计混合损失策略平衡判别与生成能力,为开放世界理解提供了可借鉴的生成式推理框架。

Abstract: Human-object interaction (HOI) detection aims to localize human-object pairs and the interactions between them. Existing methods operate under a closed-world assumption, treating the task as a classification problem over a small, predefined verb set, which struggles to generalize to the long-tail of unseen or ambiguous interactions in the wild. While recent multi-modal large language models (MLLMs) possess the rich world knowledge required for open-vocabulary understanding, they remain decoupled from existing HOI detectors since fine-tuning them is computationally prohibitive. To address these constraints, we propose \GRASP-HO}, a novel Generative Reasoning And Steerable Perception framework that reformulates HOI detection from the closed-set classification task to the open-vocabulary generation problem. To bridge the vision and cognitive, we first extract hybrid interaction representations, then design a lightweight learnable cognitive steering conduit (CSC) module to inject the fine-grained visual evidence into a frozen MLLM for effective reasoning. To address the supervision mismatch between classification-based HOI datasets and open-vocabulary generative models, we introduce a hybrid guidance strategy that coupling the language modeling loss and auxiliary classification loss, enabling discriminative grounding without sacrificing generative flexibility. Experiments demonstrate state-of-the-art closed-set performance and strong zero-shot generalization, achieving a unified paradigm that seamlessly bridges discriminative perception and generative reasoning for open-world HOI detection.


[58] Region-Constraint In-Context Generation for Instructional Video Editing cs.CV | cs.MMPDF

Zhongwei Zhang, Fuchen Long, Wei Li, Zhaofan Qiu, Wu Liu

TL;DR: 本文提出ReCo,一种基于区域约束的上下文生成范式,用于指令驱动的视频编辑。该方法通过将源视频与目标视频宽度拼接进行联合去噪,并引入潜在正则化和注意力正则化来校准视频扩散学习,以解决编辑区域不准确和去噪过程中编辑与非编辑区域间的令牌干扰问题。此外,还构建了一个大规模高质量视频编辑数据集ReCo-Data。

Details

Motivation: 现有上下文生成范式在指令图像编辑中表现出色,但应用于视频编辑时,由于未指定编辑区域,易导致编辑区域不准确以及去噪过程中编辑与非编辑区域间的令牌干扰问题。

Result: 在四个主要指令视频编辑任务上的大量实验表明,该方法具有优越性,但摘要未提及具体定量结果或基准测试对比。

Insight: 创新点在于通过潜在正则化和注意力正则化约束编辑与非编辑区域间的建模,潜在正则化增强编辑区域差异并减少非编辑区域差异,注意力正则化抑制编辑区域令牌对源视频对应令牌的注意力,从而减少干扰;同时构建大规模数据集支持训练。

Abstract: The In-context generation paradigm recently has demonstrated strong power in instructional image editing with both data efficiency and synthesis quality. Nevertheless, shaping such in-context learning for instruction-based video editing is not trivial. Without specifying editing regions, the results can suffer from the problem of inaccurate editing regions and the token interference between editing and non-editing areas during denoising. To address these, we present ReCo, a new instructional video editing paradigm that novelly delves into constraint modeling between editing and non-editing regions during in-context generation. Technically, ReCo width-wise concatenates source and target video for joint denoising. To calibrate video diffusion learning, ReCo capitalizes on two regularization terms, i.e., latent and attention regularization, conducting on one-step backward denoised latents and attention maps, respectively. The former increases the latent discrepancy of the editing region between source and target videos while reducing that of non-editing areas, emphasizing the modification on editing area and alleviating outside unexpected content generation. The latter suppresses the attention of tokens in the editing region to the tokens in counterpart of the source video, thereby mitigating their interference during novel object generation in target video. Furthermore, we propose a large-scale, high-quality video editing dataset, i.e., ReCo-Data, comprising 500K instruction-video pairs to benefit model training. Extensive experiments conducted on four major instruction-based video editing tasks demonstrate the superiority of our proposal.


[59] Bitbox: Behavioral Imaging Toolbox for Computational Analysis of Behavior from Videos cs.CV | q-bio.NCPDF

Evangelos Sariyanidi, Gokul Nair, Lisa Yankowitz, Casey J. Zampella, Mohan Kashyap Pargi

TL;DR: Bitbox是一个开源工具箱,旨在降低行为科学家和临床研究人员使用AI进行视频行为分析的入门门槛,通过标准化接口提取面部表情、头部运动和身体动作等高层次行为测量,促进计算行为测量在心理学、精神病学和心理健康研究中的广泛应用。

Details

Motivation: 现有AI视频行为分析方法主要面向工程人员,软件栈复杂且难以直接用于假设驱动的研究,阻碍了行为科学和临床领域的广泛采用。

Result: 核心模块已在临床样本上经过测试和验证,能够提供稳健的高层次行为指标,无需工程专业知识即可使用。

Insight: 通过强调可重复性、模块化和可解释性设计,Bitbox弥合了计算机科学与行为科学之间的转化鸿沟,支持社区驱动发展,便于方法开发者和领域科学家共同贡献。

Abstract: Computational measurement of human behavior from video has recently become feasible due to major advances in AI. These advances now enable granular and precise quantification of facial expression, head movement, body action, and other behavioral modalities and are increasingly used in psychology, psychiatry, neuroscience, and mental health research. However, mainstream adoption remains slow. Most existing methods and software are developed for engineering audiences, require specialized software stacks, and fail to provide behavioral measurements at a level directly useful for hypothesis-driven research. As a result, there is a large barrier to entry for researchers who wish to use modern, AI-based tools in their work. We introduce Bitbox, an open-source toolkit designed to remove this barrier and make advanced computational analysis directly usable by behavioral scientists and clinical researchers. Bitbox is guided by principles of reproducibility, modularity, and interpretability. It provides a standardized interface for extracting high-level behavioral measurements from video, leveraging multiple face, head, and body processors. The core modules have been tested and validated on clinical samples and are designed so that new measures can be added with minimal effort. Bitbox is intended to serve both sides of the translational gap. It gives behavioral researchers access to robust, high-level behavioral metrics without requiring engineering expertise, and it provides computer scientists a practical mechanism for disseminating methods to domains where their impact is most needed. We expect that Bitbox will accelerate integration of computational behavioral measurement into behavioral, clinical, and mental health research. Bitbox has been designed from the beginning as a community-driven effort that will evolve through contributions from both method developers and domain scientists.


[60] Learning Spatio-Temporal Feature Representations for Video-Based Gaze Estimation cs.CV | cs.AI | cs.HCPDF

Alexandre Personnic, Mihai Bâce

TL;DR: 本文提出了一种名为ST-Gaze的时空特征表示网络,用于视频中的视线估计。该模型结合CNN骨干网络、通道注意力和自注意力模块,融合眼部和面部特征,并利用时空循环建模帧内空间上下文和帧间动态,在EVE数据集上实现了最先进的性能。

Details

Motivation: 解决现有视频视线估计方法在同时捕捉空间(单帧内)和时间(多帧间)关系时特征表示受限的问题,旨在提升基于普通摄像头的视线估计的鲁棒性。

Result: 在EVE数据集上,无论是否进行人物特定适应,ST-Gaze均达到了最先进的性能水平。消融研究表明,所提出的时空循环建模帧内空间上下文的方法,在性能上显著优于过早进行空间池化的方法。

Insight: 创新点在于通过通道注意力和自注意力模块优化特征融合,并将融合特征视为空间序列,通过时空循环同时建模帧内空间上下文和帧间时间动态。客观来看,这种将空间序列建模与时间传播相结合的方法,为有效利用视频中的时空信息提供了新思路。

Abstract: Video-based gaze estimation methods aim to capture the inherently temporal dynamics of human eye gaze from multiple image frames. However, since models must capture both spatial and temporal relationships, performance is limited by the feature representations within a frame but also between multiple frames. We propose the Spatio-Temporal Gaze Network (ST-Gaze), a model that combines a CNN backbone with dedicated channel attention and self-attention modules to fuse eye and face features optimally. The fused features are then treated as a spatial sequence, allowing for the capture of an intra-frame context, which is then propagated through time to model inter-frame dynamics. We evaluated our method on the EVE dataset and show that ST-Gaze achieves state-of-the-art performance both with and without person-specific adaptation. Additionally, our ablation study provides further insights into the model performance, showing that preserving and modelling intra-frame spatial context with our spatio-temporal recurrence is fundamentally superior to premature spatial pooling. As such, our results pave the way towards more robust video-based gaze estimation using commonly available cameras.


[61] SAVeD: A First-Person Social Media Video Dataset for ADAS-equipped vehicle Near-Miss and Crash Event Analyses cs.CVPDF

Shaoyan Zhai, Mohamed Abdel-Aty, Chenzhu Wang, Rodrigo Vena Garcia

TL;DR: 本文介绍了SAVeD数据集,这是一个从社交媒体收集的大规模第一人称视角视频数据集,专门用于分析配备高级驾驶辅助系统(ADAS)的车辆在真实世界中的碰撞、险肇事件和系统脱手情况。该数据集包含2119个视频,并提供了帧级标注,支持对感知和决策失误的分析。论文还展示了其应用价值,包括提出了一种结合语义分割和单目深度估计的实时碰撞时间计算框架、利用广义极值分布量化极端风险,以及为先进视频大语言模型建立了性能基准。

Details

Motivation: 现有驾驶行为数据集多局限于模拟环境或人类驾驶数据,缺乏ADAS车辆在真实高风险边缘情况(如险肇事件和系统故障)下的真实行为数据,这限制了安全关键研究的进展。

Result: 在SAVeD数据集上,通过领域自适应,先进视频大语言模型(如VideoLLaMA2和InternVL2.5 HiCo R16)的性能得到了显著提升,为复杂险肇场景的分析建立了基准。

Insight: 创新点在于首次从社交媒体构建了专注于ADAS车辆高风险事件的真实世界第一人称视频数据集,并提出了结合语义分割与深度估计的实时碰撞时间计算框架以及利用广义极值分布量化极端风险的方法,为ADAS安全研究提供了新的数据和分析工具。

Abstract: The advancement of safety-critical research in driving behavior in ADAS-equipped vehicles require real-world datasets that not only include diverse traffic scenarios but also capture high-risk edge cases such as near-miss events and system failures. However, existing datasets are largely limited to either simulated environments or human-driven vehicle data, lacking authentic ADAS (Advanced Driver Assistance System) vehicle behavior under risk conditions. To address this gap, this paper introduces SAVeD, a large-scale video dataset curated from publicly available social media content, explicitly focused on ADAS vehicle-related crashes, near-miss incidents, and disengagements. SAVeD features 2,119 first-person videos, capturing ADAS vehicle operations in diverse locations, lighting conditions, and weather scenarios. The dataset includes video frame-level annotations for collisions, evasive maneuvers, and disengagements, enabling analysis of both perception and decision-making failures. We demonstrate SAVeD’s utility through multiple analyses and contributions: (1) We propose a novel framework integrating semantic segmentation and monocular depth estimation to compute real-time Time-to-Collision (TTC) for dynamic objects. (2) We utilize the Generalized Extreme Value (GEV) distribution to model and quantify the extreme risk in crash and near-miss events across different roadway types. (3) We establish benchmarks for state-of-the-art VLLMs (VideoLLaMA2 and InternVL2.5 HiCo R16), showing that SAVeD’s detailed annotations significantly enhance model performance through domain adaptation in complex near-miss scenarios.


[62] AdaptPrompt: Parameter-Efficient Adaptation of VLMs for Generalizable Deepfake Detection cs.CVPDF

Yichen Jiang, Mohammed Talha Alam, Sohail Ahmed Khan, Duc-Tien Dang-Nguyen, Fakhri Karray

TL;DR: 本文提出AdaptPrompt框架,通过参数高效的迁移学习方法,利用CLIP视觉语言模型实现可泛化的深度伪造检测。首先构建了Diff-Gen大规模基准数据集,包含10万张扩散生成图像以捕获广泛的光谱伪影;然后通过联合学习任务特定的文本提示和视觉适配器,同时冻结CLIP主干网络,显著提升了检测性能。

Details

Motivation: 解决深度伪造检测中的泛化性挑战,即现有检测器在训练数据之外的生成模型上表现不佳的问题。

Result: 在涵盖GAN、扩散模型和商业工具的25个挑战性测试集上评估,在标准和跨域场景中均达到新的SOTA水平;仅使用320张图像的少样本泛化也表现出色。

Insight: 创新点包括:1)构建Diff-Gen数据集捕获扩散模型的光谱伪影;2)参数高效的AdaptPrompt框架联合优化文本提示和视觉适配器;3)通过剪裁视觉编码器最后一层Transformer块来保留高频生成伪影,提升检测精度。

Abstract: Recent advances in image generation have led to the widespread availability of highly realistic synthetic media, increasing the difficulty of reliable deepfake detection. A key challenge is generalization, as detectors trained on a narrow class of generators often fail when confronted with unseen models. In this work, we address the pressing need for generalizable detection by leveraging large vision-language models, specifically CLIP, to identify synthetic content across diverse generative techniques. First, we introduce Diff-Gen, a large-scale benchmark dataset comprising 100k diffusion-generated fakes that capture broad spectral artifacts unlike traditional GAN datasets. Models trained on Diff-Gen demonstrate stronger cross-domain generalization, particularly on previously unseen image generators. Second, we propose AdaptPrompt, a parameter-efficient transfer learning framework that jointly learns task-specific textual prompts and visual adapters while keeping the CLIP backbone frozen. We further show via layer ablation that pruning the final transformer block of the vision encoder enhances the retention of high-frequency generative artifacts, significantly boosting detection accuracy. Our evaluation spans 25 challenging test sets, covering synthetic content generated by GANs, diffusion models, and commercial tools, establishing a new state-of-the-art in both standard and cross-domain scenarios. We further demonstrate the framework’s versatility through few-shot generalization (using as few as 320 images) and source attribution, enabling the precise identification of generator architectures in closed-set settings.


[63] Pix2NPHM: Learning to Regress NPHM Reconstructions From a Single Image cs.CV | cs.AIPDF

Simon Giebenhain, Tobias Kirschstein, Liam Schoneveld, Davide Davoli, Zhe Chen

TL;DR: 本文提出Pix2NPHM,一种基于视觉Transformer(ViT)的网络,能够从单张图像直接回归神经参数化头部模型(NPHM)的参数,实现高保真度的3D人脸重建。该方法利用在几何预测任务上预训练的领域特定ViT作为骨干网络,并混合使用3D数据和2D视频数据进行训练,支持交互式帧率的重建,并能通过推理时优化进一步提升几何精度。

Details

Motivation: 神经参数化头部模型(NPHMs)相比基于网格的3D形变模型(3DMMs)能提供更高保真的几何细节,但其潜在的隐空间表达性强,导致将其拟合到视觉输入极具挑战。本文旨在解决从单张图像直接、准确地重建NPHM参数的问题。

Result: 该方法在重建质量上实现了前所未有的水平,能够在大规模真实世界数据上运行。通过混合监督(包括超过10万个NPHM配准数据的SDF空间直接监督,以及大规模2D视频数据中法线估计作为伪地面真值几何),Pix2NPHM重建的面部几何更具辨识度,表情更准确。

Insight: 主要创新点包括:1)提出首个直接从单张图像回归NPHM参数的ViT网络(Pix2NPHM);2)利用在几何任务上预训练的领域特定ViT作为骨干,以增强泛化能力;3)采用混合3D/2D数据训练策略,结合直接SDF监督和法线伪标签;4)支持快速的交互式重建,并可进行推理时优化以进一步提升几何保真度。从客观角度看,将NPHM的高表达能力与高效的、基于Transformer的单图回归框架相结合,是推动高保真单图3D人脸重建向前迈进的关键一步。

Abstract: Neural Parametric Head Models (NPHMs) are a recent advancement over mesh-based 3d morphable models (3DMMs) to facilitate high-fidelity geometric detail. However, fitting NPHMs to visual inputs is notoriously challenging due to the expressive nature of their underlying latent space. To this end, we propose Pix2NPHM, a vision transformer (ViT) network that directly regresses NPHM parameters, given a single image as input. Compared to existing approaches, the neural parametric space allows our method to reconstruct more recognizable facial geometry and accurate facial expressions. For broad generalization, we exploit domain-specific ViTs as backbones, which are pretrained on geometric prediction tasks. We train Pix2NPHM on a mixture of 3D data, including a total of over 100K NPHM registrations that enable direct supervision in SDF space, and large-scale 2D video datasets, for which normal estimates serve as pseudo ground truth geometry. Pix2NPHM not only allows for 3D reconstructions at interactive frame rates, it is also possible to improve geometric fidelity by a subsequent inference-time optimization against estimated surface normals and canonical point maps. As a result, we achieve unprecedented face reconstruction quality that can run at scale on in-the-wild data.


[64] LiteGE: Lightweight Geodesic Embedding for Efficient Geodesics Computation and Non-Isometric Shape Correspondence cs.CV | cs.GRPDF

Yohanes Yudhi Adikusuma, Qixing Huang, Ying He

TL;DR: 本文提出了LiteGE,一种轻量级方法,用于高效计算三维表面的测地线距离并实现非等距形状对应。该方法通过在有信息体素上对无符号距离场(UDF)样本应用主成分分析(PCA)来构建紧凑的类别感知形状描述符,从而避免了使用高容量网络,显著降低了内存占用和推理时间。

Details

Motivation: 现有基于学习的方法虽然性能强大,但依赖大型三维骨干网络,导致高内存使用和延迟,限制了在交互式或资源受限环境中的应用。本文旨在解决这一问题,提出一种轻量级替代方案。

Result: 大量实验表明,与现有神经方法相比,LiteGE将内存使用和推理时间降低了高达300倍。在非等距形状对(包括点云输入)上,与最先进的基于网格的方法相比,实现了高达1000倍的加速,同时保持了相当的精度。该方法在稀疏点云(少至300个点)上仍保持鲁棒性,而先前方法在此类输入上失效。

Insight: 主要创新点在于通过PCA处理UDF样本构建高效紧凑的形状描述符,从而摆脱对大型神经网络的依赖。该方法巧妙利用了测地线距离与形状对应之间的内在关系,实现了快速准确的形状匹配,特别适用于资源受限场景和稀疏点云输入。

Abstract: Computing geodesic distances on 3D surfaces is fundamental to many tasks in 3D vision and geometry processing, with deep connections to tasks such as shape correspondence. Recent learning-based methods achieve strong performance but rely on large 3D backbones, leading to high memory usage and latency, which limit their use in interactive or resource-constrained settings. We introduce LiteGE, a lightweight approach that constructs compact, category-aware shape descriptors by applying PCA to unsigned distance field (UDFs) samples at informative voxels. This descriptor is efficient to compute and removes the need for high-capacity networks. LiteGE remains robust on sparse point clouds, supporting inputs with as few as 300 points, where prior methods fail. Extensive experiments show that LiteGE reduces memory usage and inference time by up to 300$\times$ compared to existing neural approaches. In addition, by exploiting the intrinsic relationship between geodesic distance and shape correspondence, LiteGE enables fast and accurate shape matching. Our method achieves up to 1000$\times$ speedup over state-of-the-art mesh-based approaches while maintaining comparable accuracy on non-isometric shape pairs, including evaluations on point-cloud inputs.


[65] Animate Any Character in Any World cs.CV | cs.AIPDF

Yitong Wang, Fangyun Wei, Hongyang Zhang, Bo Dai, Yan Lu

TL;DR: 本文提出了AniX模型,旨在结合静态世界生成模型的真实感与可控实体模型的交互性,实现用户指定角色在任意3D高斯溅射场景中执行开放动作。用户可通过自然语言指令控制角色进行从基础移动到物体交互的多样化行为,模型则生成时间连贯且保持视觉保真度的视频片段。

Details

Motivation: 现有方法主要分为静态世界生成模型(无主动智能体)和可控实体模型(单一实体在不可控环境中执行有限动作),缺乏既能构建真实世界又能支持用户指定角色执行开放动作的解决方案。

Result: 评估涵盖了视觉质量、角色一致性、动作可控性和长时程连贯性等多个方面,表明AniX在保持泛化能力的同时显著增强了运动动态。

Insight: 创新点在于将问题形式化为条件自回归视频生成,基于预训练视频生成器,通过训练策略增强运动动态并跨动作和角色泛化,实现了静态场景与可控角色的有效结合。

Abstract: Recent advances in world models have greatly enhanced interactive environment simulation. Existing methods mainly fall into two categories: (1) static world generation models, which construct 3D environments without active agents, and (2) controllable-entity models, which allow a single entity to perform limited actions in an otherwise uncontrollable environment. In this work, we introduce AniX, leveraging the realism and structural grounding of static world generation while extending controllable-entity models to support user-specified characters capable of performing open-ended actions. Users can provide a 3DGS scene and a character, then direct the character through natural language to perform diverse behaviors from basic locomotion to object-centric interactions while freely exploring the environment. AniX synthesizes temporally coherent video clips that preserve visual fidelity with the provided scene and character, formulated as a conditional autoregressive video generation problem. Built upon a pre-trained video generator, our training strategy significantly enhances motion dynamics while maintaining generalization across actions and characters. Our evaluation covers a broad range of aspects, including visual quality, character consistency, action controllability, and long-horizon coherence.


[66] Chorus: Multi-Teacher Pretraining for Holistic 3D Gaussian Scene Encoding cs.CVPDF

Yue Li, Qi Ma, Runyi Yang, Mengjiao Ma, Bin Ren

TL;DR: 本文提出了Chorus,一个多教师预训练框架,用于学习一个整体的前馈式3D高斯溅射(3DGS)场景编码器。该方法通过从2D基础模型中提取互补信号,使3DGS原语能够编码丰富、通用的特征,并在一系列3D理解任务上进行了评估,包括开放词汇语义和实例分割、线性与解码器探测以及数据高效监督。

Details

Motivation: 尽管3DGS已成为一种高保真场景表示方法,但直接从其原语中编码丰富、通用的特征仍未得到充分探索。本文旨在填补这一空白,使3DGS能够支持更广泛的3D理解任务。

Result: 在开放词汇语义和实例分割、线性探测、解码器探测和数据高效监督等任务上进行了评估。此外,在仅支持点云的基准测试中,仅使用高斯中心、颜色和估计法线作为输入的变体模型表现出强大的迁移能力,性能优于点云基线,同时使用的训练场景数量减少了39.9倍。

Insight: 主要创新点在于提出了一个多教师预训练框架,通过共享的3D编码器和教师特定投影器,从语言对齐、通用和对象感知的教师模型中学习,鼓励形成一个捕获从高级语义到细粒度结构信号的共享嵌入空间。此外,提出的渲染-蒸馏适应方法有助于领域外微调,增强了模型的泛化能力。

Abstract: While 3DGS has emerged as a high-fidelity scene representation, encoding rich, general-purpose features directly from its primitives remains under-explored. We address this gap by introducing Chorus, a multi-teacher pretraining framework that learns a holistic feed-forward 3D Gaussian Splatting (3DGS) scene encoder by distilling complementary signals from 2D foundation models. Chorus employs a shared 3D encoder and teacher-specific projectors to learn from language-aligned, generalist, and object-aware teachers, encouraging a shared embedding space that captures signals from high-level semantics to fine-grained structure. We evaluate Chorus on a wide range of tasks: open-vocabulary semantic and instance segmentation, linear and decoder probing, as well as data-efficient supervision. Besides 3DGS, we also test Chorus on several benchmarks that only support point clouds by pretraining a variant using only Gaussians’ centers, colors, estimated normals as inputs. Interestingly, this encoder shows strong transfer and outperforms the point clouds baseline while using 39.9 times fewer training scenes. Finally, we propose a render-and-distill adaptation that facilitates out-of-domain finetuning. Our code and model will be released upon publication.


[67] InfSplign: Inference-Time Spatial Alignment of Text-to-Image Diffusion Models cs.CV | cs.AIPDF

Sarah Rastegar, Violeta Chatalbasheva, Sieger Falkena, Anuj Singh, Yanbo Wang

TL;DR: InfSplign是一种无需训练、即插即用的推理时方法,通过在每个去噪步骤中利用复合损失调整噪声,来提升文本到图像扩散模型的空间对齐能力,从而更准确地根据文本提示生成具有正确空间关系的图像。

Details

Motivation: 现有文本到图像扩散模型在生成图像时经常无法捕捉文本提示中指定的空间关系,这主要源于训练数据缺乏细粒度空间监督以及文本嵌入难以编码空间语义。

Result: 在VISOR和T2I-CompBench基准测试上的综合评估表明,InfSplign建立了新的最先进水平,显著超越了现有最强的推理时基线方法,甚至优于基于微调的方法。

Insight: 创新点在于提出了一种轻量级的推理时空间对齐方法,通过利用从主干解码器提取的不同层级的交叉注意力图来构建复合损失,从而在采样过程中强制实现精确的对象放置和平衡的对象存在,该方法与任何扩散主干模型兼容。

Abstract: Text-to-image (T2I) diffusion models generate high-quality images but often fail to capture the spatial relations specified in text prompts. This limitation can be traced to two factors: lack of fine-grained spatial supervision in training data and inability of text embeddings to encode spatial semantics. We introduce InfSplign, a training-free inference-time method that improves spatial alignment by adjusting the noise through a compound loss in every denoising step. Proposed loss leverages different levels of cross-attention maps extracted from the backbone decoder to enforce accurate object placement and a balanced object presence during sampling. The method is lightweight, plug-and-play, and compatible with any diffusion backbone. Our comprehensive evaluations on VISOR and T2I-CompBench show that InfSplign establishes a new state-of-the-art (to the best of our knowledge), achieving substantial performance gains over the strongest existing inference-time baselines and even outperforming the fine-tuning-based methods. Codebase is available at GitHub.


[68] Visually Prompted Benchmarks Are Surprisingly Fragile cs.CV | cs.LGPDF

Haiwen Feng, Long Lian, Lisa Dunlap, Jiahao Shu, XuDong Wang

TL;DR: 该论文研究发现,当前用于评估视觉语言模型(VLM)视觉感知能力的视觉提示基准(如BLINK)存在惊人的脆弱性。模型性能对视觉提示的微小细节(如标记颜色、大小)以及低级推理选择(如JPEG压缩级别)高度敏感,这些细节能显著改变模型在排行榜上的排名,甚至能让较弱模型超越更强模型。为缓解此不稳定性,作者构建了包含16种视觉标记变体的更大基准VPBench。

Details

Motivation: 评估VLM时,一个关键挑战是测试模型独立于其文本先验来分析视觉内容的能力。现有视觉提示基准(通过图像中明确标记坐标来提问)是评估的重要组成部分,但论文旨在探究这些基准的稳健性,发现模型性能对视觉提示的看似无关的细节异常敏感。

Result: 在两项视觉提示任务上评估了九个常用开源和闭源VLM。结果表明,基准设置细节(如视觉标记设计、数据集大小)对模型性能和排行榜排名有显著影响。例如,略微增大视觉标记大小可使开源模型InternVL3-8B与Gemini 2.5 Pro等更大专有模型排名相当或更好。低级推理选择(如API调用中的JPEG压缩级别)也能导致模型排名变化。这些影响远大于传统语义VLM评估。

Insight: 论文揭示了视觉提示基准在评估VLM视觉感知时存在严重脆弱性,基准设计细节(视觉标记属性、数据处理)可能误导模型能力评估和比较。这强调了在构建和解释VLM基准时需严格控制变量和报告细节的重要性。作者提出的VPBench及分析工具为未来更稳健的评估提供了资源和方法借鉴。

Abstract: A key challenge in evaluating VLMs is testing models’ ability to analyze visual content independently from their textual priors. Recent benchmarks such as BLINK probe visual perception through visual prompting, where questions about visual content are paired with coordinates to which the question refers, with the coordinates explicitly marked in the image itself. While these benchmarks are an important part of VLM evaluation, we find that existing models are surprisingly fragile to seemingly irrelevant details of visual prompting: simply changing a visual marker from red to blue can completely change rankings among models on a leaderboard. By evaluating nine commonly-used open- and closed-source VLMs on two visually prompted tasks, we demonstrate how details in benchmark setup, including visual marker design and dataset size, have a significant influence on model performance and leaderboard rankings. These effects can even be exploited to lift weaker models above stronger ones; for instance, slightly increasing the size of the visual marker results in open-source InternVL3-8B ranking alongside or better than much larger proprietary models like Gemini 2.5 Pro. We further show that low-level inference choices that are often ignored in benchmarking, such as JPEG compression levels in API calls, can also cause model lineup changes. These details have substantially larger impacts on visually prompted benchmarks than on conventional semantic VLM evaluations. To mitigate this instability, we curate existing datasets to create VPBench, a larger visually prompted benchmark with 16 visual marker variants. VPBench and additional analysis tools are released at https://lisadunlap.github.io/vpbench/.


[69] Keypoint Counting Classifiers: Turning Vision Transformers into Self-Explainable Models Without Training cs.CVPDF

Kristoffer Wickstrøm, Teresa Dorszewski, Siyan Chen, Michael Kampffmeyer, Elisabeth Wetzer

TL;DR: 本文提出了一种名为关键点计数分类器(KCCs)的新方法,无需重新训练即可将任何训练好的基于视觉Transformer(ViT)的模型转换为自解释模型(SEM),通过利用ViT自动识别图像间匹配关键点的能力,创建易于解释的决策过程,并在输入中可视化,从而提升人机交互的透明度。

Details

Motivation: 当前设计自解释模型的方法需要复杂的训练过程和特定架构,不切实际,尤其随着基于ViT的通用基础模型的发展,这一问题更加突出,因此需要新方法来提高ViT基础模型的透明度和可靠性。

Result: 在广泛评估中,KCCs相比近期基线方法,改善了人机交互效果,表明其在提升模型可解释性方面的有效性。

Insight: 创新点在于无需重新训练即可将现有ViT模型转换为自解释模型,利用ViT的关键点匹配能力实现视觉化解释,为ViT基础模型提供了简单实用的透明度解决方案。

Abstract: Current approaches for designing self-explainable models (SEMs) require complicated training procedures and specific architectures which makes them impractical. With the advance of general purpose foundation models based on Vision Transformers (ViTs), this impracticability becomes even more problematic. Therefore, new methods are necessary to provide transparency and reliability to ViT-based foundation models. In this work, we present a new method for turning any well-trained ViT-based model into a SEM without retraining, which we call Keypoint Counting Classifiers (KCCs). Recent works have shown that ViTs can automatically identify matching keypoints between images with high precision, and we build on these results to create an easily interpretable decision process that is inherently visualizable in the input. We perform an extensive evaluation which show that KCCs improve the human-machine communication compared to recent baselines. We believe that KCCs constitute an important step towards making ViT-based foundation models more transparent and reliable.


[70] RadarGen: Automotive Radar Point Cloud Generation from Cameras cs.CV | cs.AI | cs.LG | cs.ROPDF

Tomer Borreda, Fangqiang Ding, Sanja Fidler, Shengyu Huang, Or Litany

TL;DR: RadarGen是一个扩散模型,能够从多视角相机图像生成逼真的汽车雷达点云。它通过将雷达测量表示为鸟瞰图形式,编码空间结构、雷达横截面和速度属性,并利用预训练基础模型提取的深度、语义和运动线索来引导生成过程,从而缩小与真实数据训练的感知模型之间的差距。

Details

Motivation: 解决从视觉数据生成逼真雷达点云的挑战,以实现跨传感模态的统一生成式仿真,并利用现有视觉数据集和仿真框架的可扩展性。

Result: 在大规模驾驶数据上的评估表明,RadarGen能够捕捉特征性的雷达测量分布,并缩小与真实数据训练的感知模型之间的差距。

Insight: 创新点包括将高效的图像潜在扩散适应到雷达领域,通过鸟瞰图表示雷达属性,并整合预训练基础模型的深度、语义和运动线索来引导生成过程,为多模态生成式仿真提供了可扩展的方向。

Abstract: We present RadarGen, a diffusion model for synthesizing realistic automotive radar point clouds from multi-view camera imagery. RadarGen adapts efficient image-latent diffusion to the radar domain by representing radar measurements in bird’s-eye-view form that encodes spatial structure together with radar cross section (RCS) and Doppler attributes. A lightweight recovery step reconstructs point clouds from the generated maps. To better align generation with the visual scene, RadarGen incorporates BEV-aligned depth, semantic, and motion cues extracted from pretrained foundation models, which guide the stochastic generation process toward physically plausible radar patterns. Conditioning on images makes the approach broadly compatible, in principle, with existing visual datasets and simulation frameworks, offering a scalable direction for multimodal generative simulation. Evaluations on large-scale driving data show that RadarGen captures characteristic radar measurement distributions and reduces the gap to perception models trained on real data, marking a step toward unified generative simulation across sensing modalities.


[71] Diffusion Forcing for Multi-Agent Interaction Sequence Modeling cs.CV | cs.ROPDF

Vongani H. Maluleke, Kie Horiuchi, Lea Wilken, Evonne Ng, Jitendra Malik

TL;DR: 本文提出MAGNet(多智能体扩散强制变换器),一个统一的自回归扩散框架,用于多智能体运动生成。该模型通过灵活的调节和采样,支持多种交互任务,包括二元预测、伙伴修复和完整多智能体运动生成,并能自回归生成长达数百帧的超长序列。

Details

Motivation: 理解和生成多人交互是机器人学和社会计算领域的基础挑战,现有方法多为任务特定且难以泛化到灵活的多智能体生成,主要困难在于长时程、强智能体间依赖和可变群体规模。

Result: 在二元基准测试中,MAGNet与专用方法性能相当,同时自然扩展到涉及三个或更多人的多元场景,其可扩展架构对智能体数量不敏感。

Insight: 基于扩散强制(Diffusion Forcing)的关键改进,在自回归去噪过程中显式建模智能体间耦合,实现了跨智能体的连贯协调,从而能捕捉从紧密同步活动(如舞蹈、拳击)到松散结构化社交互动的广泛交互。

Abstract: Understanding and generating multi-person interactions is a fundamental challenge with broad implications for robotics and social computing. While humans naturally coordinate in groups, modeling such interactions remains difficult due to long temporal horizons, strong inter-agent dependencies, and variable group sizes. Existing motion generation methods are largely task-specific and do not generalize to flexible multi-agent generation. We introduce MAGNet (Multi-Agent Diffusion Forcing Transformer), a unified autoregressive diffusion framework for multi-agent motion generation that supports a wide range of interaction tasks through flexible conditioning and sampling. MAGNet performs dyadic prediction, partner inpainting, and full multi-agent motion generation within a single model, and can autoregressively generate ultra-long sequences spanning hundreds of v. Building on Diffusion Forcing, we introduce key modifications that explicitly model inter-agent coupling during autoregressive denoising, enabling coherent coordination across agents. As a result, MAGNet captures both tightly synchronized activities (e.g, dancing, boxing) and loosely structured social interactions. Our approach performs on par with specialized methods on dyadic benchmarks while naturally extending to polyadic scenarios involving three or more interacting people, enabled by a scalable architecture that is agnostic to the number of agents. We refer readers to the supplemental video, where the temporal dynamics and spatial coordination of generated interactions are best appreciated. Project page: https://von31.github.io/MAGNet/


[72] Adversarial Robustness of Vision in Open Foundation Models cs.CV | cs.AI | cs.CRPDF

Jonathon Fox, William J Buchanan, Pavlos Papadopoulos

TL;DR: 本文研究了开放基础模型LLaVA-1.5-13B和Meta Llama 3.2 Vision-8B-2在视觉模态上的对抗鲁棒性。通过无目标PGD攻击在VQA v2数据集子集上测试,发现Llama 3.2 Vision在攻击下性能下降较小,尽管基线准确度较低,表明视觉模态是攻击开放权重视觉语言模型的有效途径。

Details

Motivation: 随着深度学习模型复杂性增加,对抗性攻击可能通过修改图像混淆AI识别,因此需要评估开放基础模型的对抗鲁棒性以理解其脆弱性。

Result: 在VQA v2数据集子集上,Llama 3.2 Vision在PGD攻击下的准确度下降比LLaVA小,特别是在高扰动水平下,尽管其基线准确度较低。

Insight: 论文创新地证实视觉模态是攻击开放权重视觉语言模型的有效向量,并指出对抗鲁棒性不一定与标准基准性能直接相关,可能受模型架构和训练因素影响。

Abstract: With the increase in deep learning, it becomes increasingly difficult to understand the model in which AI systems can identify objects. Thus, an adversary could aim to modify an image by adding unseen elements, which will confuse the AI in its recognition of an entity. This paper thus investigates the adversarial robustness of LLaVA-1.5-13B and Meta’s Llama 3.2 Vision-8B-2. These are tested for untargeted PGD (Projected Gradient Descent) against the visual input modality, and empirically evaluated on the Visual Question Answering (VQA) v2 dataset subset. The results of these adversarial attacks are then quantified using the standard VQA accuracy metric. This evaluation is then compared with the accuracy degradation (accuracy drop) of LLaVA and Llama 3.2 Vision. A key finding is that Llama 3.2 Vision, despite a lower baseline accuracy in this setup, exhibited a smaller drop in performance under attack compared to LLaVA, particularly at higher perturbation levels. Overall, the findings confirm that the vision modality represents a viable attack vector for degrading the performance of contemporary open-weight VLMs, including Meta’s Llama 3.2 Vision. Furthermore, they highlight that adversarial robustness does not necessarily correlate directly with standard benchmark performance and may be influenced by underlying architectural and training factors.


[73] Dexterous World Models cs.CVPDF

Byungjun Kim, Taeksoo Kim, Junyoung Lee, Hanbyul Joo

TL;DR: 本文提出了Dexterous World Model (DWM),一个基于场景和动作条件的视频扩散框架,用于建模灵巧的人类动作如何引起静态3D场景的动态变化。该方法通过结合静态场景渲染和以自我为中心的手部运动序列,生成时空一致、物理合理的人类-场景交互视频。

Details

Motivation: 当前3D重建技术创建的数字孪生大多是静态的,缺乏具身交互性,主要局限于导航和视图合成。本文旨在弥合这一差距,实现基于视频扩散的交互式数字孪生。

Result: 实验表明,DWM能够生成真实且物理合理的人类-场景交互视频(如抓取、打开和移动物体),同时保持相机和场景的一致性。

Insight: 主要创新点在于提出了一个结合静态场景渲染(保证空间一致性)和自我中心手部网格渲染(编码几何与运动线索)的双条件视频生成框架,并构建了混合交互视频数据集(合成数据提供对齐监督,真实数据贡献多样性)来训练模型,这是迈向基于视频扩散的交互式数字孪生的第一步。

Abstract: Recent progress in 3D reconstruction has made it easy to create realistic digital twins from everyday environments. However, current digital twins remain largely static and are limited to navigation and view synthesis without embodied interactivity. To bridge this gap, we introduce Dexterous World Model (DWM), a scene-action-conditioned video diffusion framework that models how dexterous human actions induce dynamic changes in static 3D scenes. Given a static 3D scene rendering and an egocentric hand motion sequence, DWM generates temporally coherent videos depicting plausible human-scene interactions. Our approach conditions video generation on (1) static scene renderings following a specified camera trajectory to ensure spatial consistency, and (2) egocentric hand mesh renderings that encode both geometry and motion cues to model action-conditioned dynamics directly. To train DWM, we construct a hybrid interaction video dataset. Synthetic egocentric interactions provide fully aligned supervision for joint locomotion and manipulation learning, while fixed-camera real-world videos contribute diverse and realistic object dynamics. Experiments demonstrate that DWM enables realistic and physically plausible interactions, such as grasping, opening, and moving objects, while maintaining camera and scene consistency. This framework represents a first step toward video diffusion-based interactive digital twins and enables embodied simulation from egocentric actions.


[74] Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting cs.CV | cs.AI | cs.LGPDF

Ananta R. Bhattarai, Helge Rhodin

TL;DR: 本文提出了Re-Depth Anything,一种测试时自监督框架,用于改进单目深度估计。该方法通过融合Depth Anything V2(DA-V2)模型和大规模2D扩散模型的先验知识,在测试时对输入图像进行无标签的深度图优化。核心是利用重新照明(re-lighting)预测的深度图来增强输入,并在生成式上下文中通过分数蒸馏采样(SDS)利用明暗形状(SfS)线索,替代了经典的光度重建方法。

Details

Motivation: 现有基础模型(如Depth Anything V2)在处理与训练分布差异较大的真实世界图像时,单目深度估计性能仍面临挑战。本文旨在通过测试时自监督来弥合这一领域差距。

Result: 在多个不同的基准测试中,Re-Depth Anything相比DA-V2在深度估计准确性和真实感方面取得了显著提升。

Insight: 创新点在于提出了一种测试时自监督的深度优化框架,将几何推理与生成式扩散模型先验相结合。具体包括:1)利用重新照明和SDS进行数据增强与优化,替代传统光度重建;2)采用针对性优化策略(冻结编码器、仅更新中间嵌入并微调解码器),防止优化崩溃。这为通过增强几何推理实现自监督提供了新途径。

Abstract: Monocular depth estimation remains challenging as recent foundation models, such as Depth Anything V2 (DA-V2), struggle with real-world images that are far from the training distribution. We introduce Re-Depth Anything, a test-time self-supervision framework that bridges this domain gap by fusing DA-V2 with the powerful priors of large-scale 2D diffusion models. Our method performs label-free refinement directly on the input image by re-lighting predicted depth maps and augmenting the input. This re-synthesis method replaces classical photometric reconstruction by leveraging shape from shading (SfS) cues in a new, generative context with Score Distillation Sampling (SDS). To prevent optimization collapse, our framework employs a targeted optimization strategy: rather than optimizing depth directly or fine-tuning the full model, we freeze the encoder and only update intermediate embeddings while also fine-tuning the decoder. Across diverse benchmarks, Re-Depth Anything yields substantial gains in depth accuracy and realism over the DA-V2, showcasing new avenues for self-supervision by augmenting geometric reasoning.


[75] Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing cs.CVPDF

Shilong Zhang, He Zhang, Zhifei Zhang, Chongjian Ge, Shuchen Xue

TL;DR: 论文提出了一种系统框架,通过引入语义-像素重建目标来正则化表示编码器的潜在空间,使其既语义丰富又重建准确,从而适用于文本到图像生成和编辑任务。

Details

Motivation: 论文的动机是解决表示编码器特征空间在生成任务中缺乏紧凑正则化和弱像素级重建的问题,以统一视觉生成和理解。

Result: 论文方法在文本到图像生成和编辑任务上实现了最先进的重建质量、更快的收敛速度,并显著提升了性能。

Insight: 创新点在于提出语义-像素重建目标,正则化潜在空间以同时保留语义信息和细粒度细节,使表示编码器能有效用于生成任务。

Abstract: Modern Latent Diffusion Models (LDMs) typically operate in low-level Variational Autoencoder (VAE) latent spaces that are primarily optimized for pixel-level reconstruction. To unify vision generation and understanding, a burgeoning trend is to adopt high-dimensional features from representation encoders as generative latents. However, we empirically identify two fundamental obstacles in this paradigm: (1) the discriminative feature space lacks compact regularization, making diffusion models prone to off-manifold latents that lead to inaccurate object structures; and (2) the encoder’s inherently weak pixel-level reconstruction hinders the generator from learning accurate fine-grained geometry and texture. In this paper, we propose a systematic framework to adapt understanding-oriented encoder features for generative tasks. We introduce a semantic-pixel reconstruction objective to regularize the latent space, enabling the compression of both semantic information and fine-grained details into a highly compact representation (96 channels with 16x16 spatial downsampling). This design ensures that the latent space remains semantically rich and achieves state-of-the-art image reconstruction, while remaining compact enough for accurate generation. Leveraging this representation, we design a unified Text-to-Image (T2I) and image editing model. Benchmarking against various feature spaces, we demonstrate that our approach achieves state-of-the-art reconstruction, faster convergence, and substantial performance gains in both T2I and editing tasks, validating that representation encoders can be effectively adapted into robust generative components.


cs.AI [Back]

[76] Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows cs.AI | cs.CL | cs.LGPDF

Wanghan Xu, Yuhao Zhou, Yifan Zhou, Qinglong Cao, Shuo Li

TL;DR: 本文提出了科学通用智能(SGI)的操作化定义,基于实践探究模型(PIM),并通过四个科学家对齐的任务(深度研究、想法生成、干/湿实验、实验推理)来具体实现。作者构建了SGI-Bench基准,包含1000多个跨学科样本,用于系统评估最先进的大语言模型。结果表明,现有模型在多个任务上存在显著差距。此外,论文还引入了测试时强化学习(TTRL)来优化推理时检索增强的新颖性奖励,以提升假设的新颖性。

Details

Motivation: 尽管科学AI有所进展,但一个关于科学通用智能(SGI)的连贯框架仍然缺乏。本文旨在为能够自主构思、研究和跨科学领域推理的AI系统建立一个基础。

Result: 在SGI-Bench基准上对SOTA大语言模型的评估结果显示:深度研究任务的精确匹配率低(10-20%);生成的想法缺乏可行性和细节;干实验代码可执行性高但执行结果准确性低;湿实验协议序列保真度低;多模态比较推理挑战持续存在。

Insight: 主要创新点包括:1)基于实践探究模型(PIM)对SGI进行了操作化定义;2)创建了一个以科学家工作流程为中心的、专家策划的跨学科基准SGI-Bench;3)提出了测试时强化学习(TTRL)方法,在无需参考答案的情况下优化假设新颖性,为AI参与科学发现奠定了基础。

Abstract: Despite advances in scientific AI, a coherent framework for Scientific General Intelligence (SGI)-the ability to autonomously conceive, investigate, and reason across scientific domains-remains lacking. We present an operational SGI definition grounded in the Practical Inquiry Model (PIM: Deliberation, Conception, Action, Perception) and operationalize it via four scientist-aligned tasks: deep research, idea generation, dry/wet experiments, and experimental reasoning. SGI-Bench comprises over 1,000 expert-curated, cross-disciplinary samples inspired by Science’s 125 Big Questions, enabling systematic evaluation of state-of-the-art LLMs. Results reveal gaps: low exact match (10–20%) in deep research despite step-level alignment; ideas lacking feasibility and detail; high code executability but low execution result accuracy in dry experiments; low sequence fidelity in wet protocols; and persistent multimodal comparative-reasoning challenges. We further introduce Test-Time Reinforcement Learning (TTRL), which optimizes retrieval-augmented novelty rewards at inference, enhancing hypothesis novelty without reference answer. Together, our PIM-grounded definition, workflow-centric benchmark, and empirical insights establish a foundation for AI systems that genuinely participate in scientific discovery.


[77] PAACE: A Plan-Aware Automated Agent Context Engineering Framework cs.AI | cs.CL | cs.LG | cs.MAPDF

Kamer Ali Yuksel

TL;DR: PAACE是一个面向大语言模型(LLM)智能体的计划感知自动上下文工程框架,旨在优化智能体在复杂多步骤工作流中不断演化的上下文状态。它通过合成数据生成和模型蒸馏,实现了对上下文的计划感知压缩,从而在提升任务正确性的同时显著降低上下文负载和推理成本。

Details

Motivation: 现有的上下文摘要和压缩方法大多忽略了智能体推理的多步骤、计划感知特性,导致在复杂工作流中上下文迅速膨胀,影响保真度、注意力集中和推理效率。本文旨在解决这一问题。

Result: 在长视野基准测试(AppWorld、OfficeBench和8-Objective QA)上,PAACE在提升智能体正确性的同时,显著降低了峰值上下文长度、累计依赖和推理步骤。例如,在AppWorld上,PAACE在取得更高准确率的同时降低了峰值上下文和累计依赖;在OfficeBench和多跳QA上,PAACE提升了准确率和F1分数,并减少了步骤数、峰值令牌数和注意力依赖。蒸馏后的PAACE-FT模型保留了教师模型97%的性能,同时将推理成本降低了一个数量级以上。

Insight: 论文的核心创新在于提出了一个统一的计划感知上下文工程框架,通过“未来k步任务相关性建模”、“计划结构分析”、“指令协同精炼”和“函数保持压缩”来优化上下文。其方法论创新包括:1)PAACE-Syn:一个大规模合成智能体工作流数据集,带有逐步压缩监督;2)PAACE-FT:一系列从成功教师演示中蒸馏出的、轻量级的计划感知压缩器。这为在紧凑模型中实现高效、计划感知的上下文管理提供了可行的技术路径。

Abstract: Large Language Model (LLM) agents are increasingly deployed in complex, multi-step workflows involving planning, tool use, reflection, and interaction with external knowledge systems. These workflows generate rapidly expanding contexts that must be curated, transformed, and compressed to maintain fidelity, avoid attention dilution, and reduce inference cost. Prior work on summarization and query-aware compression largely ignores the multi-step, plan-aware nature of agentic reasoning. In this work, we introduce PAACE (Plan-Aware Automated Context Engineering), a unified framework for optimizing the evolving state of LLM agents through next-k-task relevance modeling, plan-structure analysis, instruction co-refinement, and function-preserving compression. PAACE comprises (1) PAACE-Syn, a large-scale generator of synthetic agent workflows annotated with stepwise compression supervision, and (2) PAACE-FT, a family of distilled, plan-aware compressors trained from successful teacher demonstrations. Experiments on long-horizon benchmarks (AppWorld, OfficeBench, and 8-Objective QA) demonstrate that PAACE consistently improves agent correctness while substantially reducing context load. On AppWorld, PAACE achieves higher accuracy than all baselines while lowering peak context and cumulative dependency. On OfficeBench and multi-hop QA, PAACE improves both accuracy and F1, achieving fewer steps, lower peak tokens, and reduced attention dependency. Distilled PAACE-FT retains 97 percent of the teacher’s performance while reducing inference cost by over an order of magnitude, enabling practical deployment of plan-aware compression with compact models.


[78] Large Language Models as Pokémon Battle Agents: Strategic Play and Content Generation cs.AI | cs.CLPDF

Daksh Jain, Aarya Jain, Ashutosh Desai, Avyakt Verma, Ishan Bhanuka

TL;DR: 本文探讨了大型语言模型(LLMs)在《宝可梦》对战中的战略决策能力,通过构建一个基于回合制的战斗系统,评估LLMs作为对战代理的战术决策和内容生成潜力。研究发现,LLMs无需领域特定训练即可作为动态游戏对手,为回合制策略游戏提供了强化学习之外的实用替代方案。

Details

Motivation: 研究动机在于评估LLMs在复杂战略环境(如宝可梦对战)中的能力,该环境需要推理类型匹配、统计权衡和风险评估,以测试LLMs是否既能做出战术决策,又能生成新颖、平衡的游戏内容。

Result: 通过系统评估多个模型架构,测量了胜率、决策延迟、类型对齐准确性和令牌效率,结果表明LLMs可以作为动态游戏对手,无需领域特定训练。

Insight: 创新点在于将LLMs定位为兼具玩家和设计师双重角色,展示了其在战术推理和内容生成方面的能力,对交互娱乐中的程序生成和自适应难度系统具有启示意义。

Abstract: Strategic decision-making in Pokémon battles presents a unique testbed for evaluating large language models. Pokémon battles demand reasoning about type matchups, statistical trade-offs, and risk assessment, skills that mirror human strategic thinking. This work examines whether Large Language Models (LLMs) can serve as competent battle agents, capable of both making tactically sound decisions and generating novel, balanced game content. We developed a turn-based Pokémon battle system where LLMs select moves based on battle state rather than pre-programmed logic. The framework captures essential Pokémon mechanics: type effectiveness multipliers, stat-based damage calculations, and multi-Pokémon team management. Through systematic evaluation across multiple model architectures we measured win rates, decision latency, type-alignment accuracy, and token efficiency. These results suggest LLMs can function as dynamic game opponents without domain-specific training, offering a practical alternative to reinforcement learning for turn-based strategic games. The dual capability of tactical reasoning and content creation, positions LLMs as both players and designers, with implications for procedural generation and adaptive difficulty systems in interactive entertainment.


[79] When Reasoning Meets Its Laws cs.AI | cs.CLPDF

Junyu Zhang, Yifan Sun, Tianang Leng, Jingyan Shen, Liu Ziyin

TL;DR: 本文提出了推理定律(Laws of Reasoning, LoRe)框架,旨在形式化描述大型推理模型(LRMs)应有的内在推理模式,包括计算定律和准确率定律。通过引入LoRe-Bench基准来评估模型在单调性和组合性这两个可度量属性上的表现,发现多数模型缺乏组合性。作者进而开发了一种微调方法以增强计算定律的组合性,实验表明更好地遵循计算定律能持续提升多个基准上的推理性能。

Details

Motivation: 尽管大型推理模型性能优越,但其推理行为常违反直觉,导致推理能力欠佳。本文旨在从理论上形式化理想的推理行为,为模型评估和改进提供理论基础。

Result: 在LoRe-Bench上的评估显示,大多数推理模型表现出合理的单调性但缺乏组合性。通过提出的微调方法增强计算定律组合性后,模型在多个基准(如GSM8K、MATH)上的推理性能得到一致提升,并揭示了不同属性与定律之间的协同效应。

Insight: 创新点在于提出了一个统一的推理定律框架(LoRe)及其可度量的属性(单调性、组合性),并构建了相应的基准(LoRe-Bench)和微调方法。从客观角度看,该工作将直觉的推理行为形式化为可检验的定律,为理解和改进大型推理模型的推理能力提供了系统性的理论工具和实证路径。

Abstract: Despite the superior performance of Large Reasoning Models (LRMs), their reasoning behaviors are often counterintuitive, leading to suboptimal reasoning capabilities. To theoretically formalize the desired reasoning behaviors, this paper presents the Laws of Reasoning (LoRe), a unified framework that characterizes intrinsic reasoning patterns in LRMs. We first propose compute law with the hypothesis that the reasoning compute should scale linearly with question complexity. Beyond compute, we extend LoRe with a supplementary accuracy law. Since the question complexity is difficult to quantify in practice, we examine these hypotheses by two properties of the laws, monotonicity and compositionality. We therefore introduce LoRe-Bench, a benchmark that systematically measures these two tractable properties for large reasoning models. Evaluation shows that most reasoning models exhibit reasonable monotonicity but lack compositionality. In response, we develop an effective finetuning approach that enforces compute-law compositionality. Extensive empirical studies demonstrate that better compliance with compute laws yields consistently improved reasoning performance on multiple benchmarks, and uncovers synergistic effects across properties and laws. Project page: https://lore-project.github.io/


cs.LG [Back]

[80] Understanding Generalization in Role-Playing Models via Information Theory cs.LG | cs.AI | cs.CLPDF

Yongqi Li, Hao Lang, Fei Huang, Tieyun Qian, Yongbin Li

TL;DR: 本文提出了一种基于信息论的度量方法R-EMID,用于可解释地衡量角色扮演模型(RPMs)在分布偏移下的性能退化,并推导了其泛化性能的上界。同时,作者提出了一个协同进化的强化学习框架来增强对话响应生成概率的估计,并通过R-EMID评估发现用户偏移风险最高且强化学习是提升RPM泛化的最有效方法。

Details

Motivation: 角色扮演模型在真实应用中广泛使用,但在部署到开放环境时性能下降,这归因于用户、角色和对话组合的分布偏移。现有方法(如LLM-as-a-judge)无法对这些偏移如何影响RPM泛化提供细粒度诊断,也缺乏形式化框架来刻画RPM的泛化行为。

Result: 作者使用R-EMID评估了多种RPM的泛化性能,发现用户偏移在所有偏移中风险最高,而强化学习是增强RPM泛化最有效的方法。

Insight: 创新点包括引入信息论度量R-EMID来可解释地量化RPM性能退化,推导其泛化上界以理论揭示各种偏移的贡献,并提出协同进化强化学习框架来建模用户、角色和对话上下文之间的连接以改进概率估计。从客观角度看,该工作为RPM的泛化分析提供了新的形式化工具和理论见解。

Abstract: Role-playing models (RPMs) are widely used in real-world applications but underperform when deployed in the wild. This degradation can be attributed to distribution shifts, including user, character, and dialogue compositional shifts. Existing methods like LLM-as-a-judge fall short in providing a fine-grained diagnosis of how these shifts affect RPM generalization, and thus there lack formal frameworks to characterize RPM generalization behaviors. To bridge these gaps, we introduce an information-theoretic metric, named reasoning-based effective mutual information difference (R-EMID), to measure RPM performance degradation in an interpretable way. We also derive an upper bound on R-EMID to predict the worst-case generalization performance of RPMs and theoretically reveal how various shifts contribute to the RPM performance degradation. Moreover, we propose a co-evolving reinforcement learning framework to adaptively model the connection among user, character, and dialogue context and thus enhance the estimation of dialogue response generation probability, which is critical for calculating R-EMID. Finally, we evaluate the generalization performance of various RPMs using R-EMID, finding that user shift poses the highest risk among all shifts and reinforcement learning is the most effective approach for enhancing RPM generalization.


[81] AdvJudge-Zero: Binary Decision Flips in LLM-as-a-Judge via Adversarial Control Tokens cs.LG | cs.CL | cs.CRPDF

Tung-Ling Li, Yuhao Wu, Hongliang Liu

TL;DR: 本文提出AdvJudge-Zero方法,揭示了LLM-as-a-Judge评估系统存在一个普遍漏洞:通过在输入中添加低困惑度的控制令牌序列,可以翻转模型的二元判断(如从正确的‘No’变为错误的‘Yes’),这代表了现实中的奖励攻击风险。该方法利用模型的下一令牌分布和束搜索来从头发现多样的控制令牌序列,并分析其诱导的隐藏状态扰动集中在与拒绝方向反对齐的低秩‘软模式’上。实验表明,这些令牌在数学和推理基准上导致大型开放权重和专业法官模型对错误答案产生高误报率,而基于LoRA的对抗训练可以显著减少误报并保持评估质量。

Details

Motivation: 动机在于揭示奖励模型和LLM-as-a-Judge系统(用于RLHF、DPO、RLAIF等后训练流程)中存在的反复性漏洞,即通过简单的控制令牌序列即可操纵二元评估结果,这代表了实际奖励攻击风险,而非最坏情况的对抗性字符串。

Result: 在数学和推理基准上,使用大型开放权重和专业法官模型进行实验,控制令牌导致对错误答案的误报率非常高;同时,基于LoRA的对抗训练在小规模控制令牌增强示例集上能显著减少误报,同时保持评估质量。

Insight: 创新点包括:提出AdvJudge-Zero方法从头发现控制令牌序列,揭示隐藏状态扰动集中在低秩‘软模式’这一机制,并证明对抗训练的有效性;从客观角度看,该研究系统性地暴露了LLM评估系统的脆弱性,为后训练安全提供了重要洞见。

Abstract: Reward models and LLM-as-a-Judge systems are central to modern post-training pipelines such as RLHF, DPO, and RLAIF, where they provide scalar feedback and binary decisions that guide model selection and RL-based fine-tuning. We show that these judge systems exhibit a recurring vulnerability: short sequences of low-perplexity control tokens can flip many binary evaluations from correct No'' judgments to incorrect Yes’’ judgments by steering the last-layer logit gap. These control tokens are patterns that a policy model could plausibly generate during post-training, and thus represent realistic reward-hacking risks rather than worst-case adversarial strings. Our method, AdvJudge-Zero, uses the model’s next-token distribution and beam-search exploration to discover diverse control-token sequences from scratch, and our analysis shows that the induced hidden-state perturbations concentrate in a low-rank ``soft mode’’ that is anti-aligned with the judge’s refusal direction. Empirically, these tokens cause very high false positive rates when large open-weight and specialized judge models score incorrect answers on math and reasoning benchmarks. Finally, we show that LoRA-based adversarial training on small sets of control-token-augmented examples can markedly reduce these false positives while preserving evaluation quality.


eess.IV [Back]

[82] Colormap-Enhanced Vision Transformers for MRI-Based Multiclass (4-Class) Alzheimer’s Disease Classification eess.IV | cs.CV | cs.LGPDF

Faisal Ahmed

TL;DR: 本文提出了一种名为PseudoColorViT-Alz的伪彩色增强视觉Transformer框架,用于基于MRI的阿尔茨海默病四分类任务。该方法通过将MRI图像转换为伪彩色表示,并结合视觉Transformer的全局特征学习能力,增强了图像中的解剖纹理和对比度线索,从而提升了分类性能。

Details

Motivation: 解决传统深度学习模型在脑部MRI扫描中难以有效提取细微结构变化特征的挑战,旨在通过伪彩色表示增强MRI图像中的判别性信息,以改进阿尔茨海默病的分类。

Result: 在OASIS-1数据集上的四分类任务中,模型达到了99.79%的准确率和100%的AUC,超越了2024-2025年基于CNN和Siamese网络等方法(准确率在96.1%至99.68%之间),实现了SOTA性能。

Insight: 创新点在于将伪彩色变换与视觉Transformer结合,以增强MRI图像中原本在灰度图像中不明显的解剖纹理和对比度特征,为医学图像分类提供了一种可解释且鲁棒的框架。

Abstract: Magnetic Resonance Imaging (MRI) plays a pivotal role in the early diagnosis and monitoring of Alzheimer’s disease (AD). However, the subtle structural variations in brain MRI scans often pose challenges for conventional deep learning models to extract discriminative features effectively. In this work, we propose PseudoColorViT-Alz, a colormap-enhanced Vision Transformer framework designed to leverage pseudo-color representations of MRI images for improved Alzheimer’s disease classification. By combining colormap transformations with the global feature learning capabilities of Vision Transformers, our method amplifies anatomical texture and contrast cues that are otherwise subdued in standard grayscale MRI scans. We evaluate PseudoColorViT-Alz on the OASIS-1 dataset using a four-class classification setup (non-demented, moderate dementia, mild dementia, and very mild dementia). Our model achieves a state-of-the-art accuracy of 99.79% with an AUC of 100%, surpassing the performance of recent 2024–2025 methods, including CNN-based and Siamese-network approaches, which reported accuracies ranging from 96.1% to 99.68%. These results demonstrate that pseudo-color augmentation combined with Vision Transformers can significantly enhance MRI-based Alzheimer’s disease classification. PseudoColorViT-Alz offers a robust and interpretable framework that outperforms current methods, providing a promising tool to support clinical decision-making and early detection of Alzheimer’s disease.