Table of Contents
- cs.CL [Total: 27]
- cs.CV [Total: 131]
- cs.AI [Total: 5]
- cs.CR [Total: 1]
- cs.MM [Total: 2]
- cs.CY [Total: 3]
- cs.CE [Total: 1]
- cs.RO [Total: 4]
- q-bio.QM [Total: 1]
- eess.IV [Total: 2]
- physics.optics [Total: 1]
- astro-ph.IM [Total: 2]
- cs.LG [Total: 4]
cs.CL [Back]
[1] Enhancing Urban Visual Place Recognition for Crowdsourced Flood Imagery via LLM-Guided Attention cs.CL | cs.AI | cs.CV | cs.CYPDF
Fengyi Xu, Jun Ma, Waishan Qiu, Cui Guo
TL;DR: 本文提出VPR-AttLLM,一个模型无关的框架,通过利用大语言模型的语义推理和地理空间知识来引导注意力机制,增强视觉地点识别模型在众包洪水图像上的性能,无需重新训练模型或额外数据。
Details
Motivation: 解决众包街景图像(如社交媒体上的洪水图像)因视觉扭曲和领域偏移导致现有视觉地点识别模型性能显著下降的问题,以支持应急响应中的实时地理定位。
Result: 在包括SF-XL(包含真实社交媒体洪水图像)、合成洪水场景以及新构建的HK-URBAN数据集等多个扩展基准上,将VPR-AttLLM与CosPlace、EigenPlaces和SALAD等SOTA VPR模型集成,一致提升了召回性能,相对增益通常在1-3%,在最具挑战性的真实洪水图像上最高可达8%。
Insight: 创新点在于将大语言模型的语义推理能力与视觉检索系统结合,通过注意力机制引导识别位置信息丰富的区域并抑制瞬态视觉噪声,实现了可插拔、可解释且具有跨源鲁棒性的LLM引导多模态融合范式,将人类空间推理思想嵌入现代VPR架构。
Abstract: Crowdsourced street-view imagery from social media provides valuable real-time visual evidence of urban flooding and other crisis events, yet it often lacks reliable geographic metadata for emergency response. Existing Visual Place Recognition (VPR) models exhibit substantial performance degradation when applied to such imagery due to visual distortions and domain shifts inherent in cross-source scenarios. This paper presents VPR-AttLLM, a model-agnostic framework that integrates the semantic reasoning and geospatial knowledge of Large Language Models (LLMs) into established VPR pipelines through attention-guided descriptor enhancement. By leveraging LLMs to identify location-informative regions within the city context and suppress transient visual noise, VPR-AttLLM improves retrieval performance without requiring model retraining or additional data. Comprehensive evaluations are conducted on extended benchmarks including SF-XL enriched with real social-media flood images, synthetic flooding scenarios over established query sets and Mapillary photos, and a new HK-URBAN dataset capturing morphologically distinct cityscapes. Integrating VPR-AttLLM with three state-of-the-art VPR models-CosPlace, EigenPlaces, and SALAD-consistently improves recall performance, yielding relative gains typically between 1-3% and reaching up to 8% on the most challenging real flood imagery. Beyond measurable gains in retrieval accuracy, this study establishes a generalizable paradigm for LLM-guided multimodal fusion in visual retrieval systems. By embedding principles from urban perception theory into attention mechanisms, VPR-AttLLM bridges human-like spatial reasoning with modern VPR architectures. Its plug-and-play design, strong cross-source robustness, and interpretability highlight its potential for scalable urban monitoring and rapid geo-localization of crowdsourced crisis imagery.
[2] Reinforcement Learning for Latent-Space Thinking in LLMs cs.CL | cs.LGPDF
Enes Özeren, Matthias Aßenmacher
TL;DR: 本文研究了在大型语言模型(LLMs)中实现潜在空间思维(latent-space thinking)的方法,以替代传统的、在离散语言空间进行的思维链(CoT)推理。作者发现现有的监督微调方法(如Coconut)存在设计敏感性和固有局限,因此探索了强化学习(RL)技术,包括GRPO和一种新颖的Latent RL方法,来直接优化潜在思维步骤。然而,实验结果表明,在数学推理等复杂任务上,这些RL训练的模型性能仍落后于传统的语言空间CoT模型。
Details
Motivation: 传统思维链推理在离散语言空间进行,效率低下,因为生成的许多令牌仅用于满足语言规则而非推理本身。潜在空间思维允许模型在连续的嵌入空间中进行思考,旨在绕过这一低效性。但现有训练方法在复杂任务上表现不佳,因此需要探索新的优化方法。
Result: 在数学推理领域,实验结果表明,采用强化学习(包括GRPO和Latent RL方法)训练的潜在空间思维模型,其性能仍然落后于传统的语言空间思维链模型。
Insight: 论文的创新点在于系统地探索了强化学习(一个在潜在空间思维中尚未充分研究的方向)来优化潜在思维步骤,并设计了一种新的Latent RL方法。从客观角度看,这项工作揭示了将推理过程迁移到潜在空间所面临的挑战,特别是在保持复杂任务性能方面,为未来研究提供了重要的基准和方向性见解。
Abstract: Chain-of-Thought (CoT) reasoning typically utilizes the discrete language space for thinking, which is inherently inefficient, as many generated tokens only enforce linguistic rules that are not required for reasoning. To bypass this, latent-space thinking allows models to think using the continuous embedding space. While existing methods for training those models show domain-specific gains, they fail to maintain performance in complex tasks, such as mathematical reasoning. We experimentally demonstrate that the Coconut approach, a form of supervised fine-tuning for latent-space thinking, is highly sensitive to design choices and exhibits several inherent limitations. To address these issues, we investigate reinforcement learning (RL) techniques – an underexplored direction in latent-space thinking – including GRPO and design a novel Latent RL method for directly optimizing the latent thinking steps. Our experimental results reveal that these RL-trained models still lag behind traditional language-space CoT models in the mathematical reasoning domain. We make our codebase publicly available.
[3] Hold Onto That Thought: Assessing KV Cache Compression On Reasoning cs.CL | cs.AI | cs.PFPDF
Minghui Liu, Aadi Palnitkar, Tahseen Rabbani, Hyunwoo Jae, Kyle Rui Sang
TL;DR: 本文评估了多种KV缓存压缩算法在长推理任务上的性能,发现对于推理模型,H2O和增强解码能力的SnapKV变体是主导策略,揭示了缓存大小与推理成本之间的权衡。
Details
Motivation: 解决LLMs在长上下文推理任务中因KV缓存线性增长导致的内存瓶颈问题,现有压缩算法多针对预填充阶段,缺乏在长解码推理任务上的评估。
Result: 在Llama-3.1-8B-Instruct上,无单一策略适用于所有数据集,性能受数据集类型影响大;对于推理模型,H2O和SnapKV变体在GSM8K和MATH500等基准测试中表现优异,低预算驱逐策略可生成更长推理轨迹。
Insight: 创新点在于系统评估压缩策略在长推理任务中的效果,发现重击者跟踪对推理轨迹有效,并揭示了缓存大小与推理成本的权衡关系,为优化LLM推理效率提供了实证依据。
Abstract: Large language models (LLMs) have demonstrated remarkable performance on long-context tasks, but are often bottlenecked by memory constraints. Namely, the KV cache, which is used to significantly speed up attention computations, grows linearly with context length. A suite of compression algorithms has been introduced to alleviate cache growth by evicting unimportant tokens. However, several popular strategies are targeted towards the prefill phase, i.e., processing long prompt context, and their performance is rarely assessed on reasoning tasks requiring long decoding. In particular, short but complex prompts, such as those in benchmarks like GSM8K and MATH500, often benefit from multi-step reasoning and self-reflection, resulting in thinking sequences thousands of tokens long. In this work, we benchmark the performance of several popular compression strategies on long-reasoning tasks. For the non-reasoning Llama-3.1-8B-Instruct, we determine that no singular strategy fits all, and that performance is heavily influenced by dataset type. However, we discover that H2O and our decoding-enabled variant of SnapKV are dominant strategies for reasoning models, indicating the utility of heavy-hitter tracking for reasoning traces. We also find that eviction strategies at low budgets can produce longer reasoning traces, revealing a tradeoff between cache size and inference costs.
[4] Benchmarking Contextual Understanding for In-Car Conversational Systems cs.CLPDF
Philipp Habicht, Lev Sorokin, Abdullah Saydemir, Ken E. Friedl, Andrea Stocco
TL;DR: 本文提出了一种利用大语言模型(LLMs)及其提示技术(如思维链、自洽性和多智能体提示)来评估车载对话问答系统上下文理解能力的方法。研究通过生成包含正确和错误响应的合成对话数据,在餐厅推荐案例上测试了13个不同规模和供应商的LLMs,发现推理模型普遍优于非推理模型,其中DeepSeek-R1在单智能体自洽性提示下达到最佳性能(F1分数0.99),而非推理模型DeepSeek-V3在效果与成本时间效率间取得了最佳平衡。
Details
Motivation: 车载对话问答系统虽能提升用户体验,但其准确性和可靠性的评估仍具挑战;本文旨在探索如何利用LLMs来评估系统响应与用户话语的上下文一致性及约束条件下的准确推荐能力。
Result: 在餐厅推荐案例研究中,推理模型(如DeepSeek-R1)在单智能体自洽性提示下达到SOTA水平(F1分数0.99,每次请求成本0.002美元);非推理模型(如DeepSeek-V3)则在效果与成本时间效率间取得最佳平衡;高级提示技术(尤其是多智能体提示)对小规模非推理模型提升最显著。
Insight: 创新点在于将LLMs与多种提示技术(输入输出、思维链、自洽性、多智能体)结合,为ConvQA系统的上下文理解评估提供了可扩展且准确的自动化基准测试方法,替代了传统人工评估;客观分析认为,该方法通过合成故障响应数据系统化评估一致性,并深入比较了不同模型架构与提示策略在成本效益上的权衡,为实际部署提供了实用指导。
Abstract: In-Car Conversational Question Answering (ConvQA) systems significantly enhance user experience by enabling seamless voice interactions. However, assessing their accuracy and reliability remains a challenge. This paper explores the use of Large Language Models (LLMs) alongside advanced prompting techniques and agent-based methods to evaluate the extent to which ConvQA system responses adhere to user utterances. The focus lies on contextual understanding and the ability to provide accurate venue recommendations considering user constraints and situational context. To evaluate utterance-response coherence using an LLM, we synthetically generate user utterances accompanied by correct and modified failure-containing system responses. We use input-output, chain-of-thought, self-consistency prompting, and multi-agent prompting techniques with 13 reasoning and non-reasoning LLMs of varying sizes and providers, including OpenAI, DeepSeek, Mistral AI, and Meta. We evaluate our approach on a case study involving restaurant recommendations. The most substantial improvements occur for small non-reasoning models when applying advanced prompting techniques, particularly multi-agent prompting. However, reasoning models consistently outperform non-reasoning models, with the best performance achieved using single-agent prompting with self-consistency. Notably, DeepSeek-R1 reaches an F1-score of 0.99 at a cost of 0.002 USD per request. Overall, the best balance between effectiveness and cost-time efficiency is reached with the non-reasoning model DeepSeek-V3. Our findings show that LLM-based evaluation offers a scalable and accurate alternative to traditional human evaluation for benchmarking contextual understanding in ConvQA systems.
[5] Semantic Distance Measurement based on Multi-Kernel Gaussian Processes cs.CL | cs.AIPDF
Yinzhu Cheng, Haihua Xie, Yaqing Wang, Miao He, Mingming Sun
TL;DR: 本文提出了一种基于多核高斯过程(MK-GP)的语义距离度量方法,通过将文本的潜在语义函数建模为高斯过程,并采用结合了Matérn核与多项式核的组合核函数,其参数可从监督数据中自动学习,而非人工设定。该方法在基于大语言模型的上下文学习(ICL)场景下的细粒度情感分类任务中进行了实例化与评估,实验结果验证了其有效性。
Details
Motivation: 语义距离度量是计算语言学中的一个基本问题,但大多数经典方法本质上是固定的,难以适应特定的数据分布和任务需求。本文旨在提出一种能够自动从数据中学习、更具适应性的语义距离度量方法。
Result: 实验在基于大语言模型的上下文学习(ICL)设置下的细粒度情感分类任务中进行,结果表明所提出的度量方法是有效的。
Insight: 创新点在于将语义距离度量问题转化为高斯过程建模,并利用可自动学习的多核组合(Matérn核与多项式核)来构建协方差函数,从而实现了从数据中自适应地学习语义距离,而非依赖固定的预定义度量。这为构建任务和数据自适应的语义表示与度量提供了新思路。
Abstract: Semantic distance measurement is a fundamental problem in computational linguistics, providing a quantitative characterization of similarity or relatedness between text segments, and underpinning tasks such as text retrieval and text classification. From a mathematical perspective, a semantic distance can be viewed as a metric defined on a space of texts or on a representation space derived from them. However, most classical semantic distance methods are essentially fixed, making them difficult to adapt to specific data distributions and task requirements. In this paper, a semantic distance measure based on multi-kernel Gaussian processes (MK-GP) was proposed. The latent semantic function associated with texts was modeled as a Gaussian process, with its covariance function given by a combined kernel combining Matérn and polynomial components. The kernel parameters were learned automatically from data under supervision, rather than being hand-crafted. This semantic distance was instantiated and evaluated in the context of fine-grained sentiment classification with large language models under an in-context learning (ICL) setup. The experimental results demonstrated the effectiveness of the proposed measure.
[6] Can GPT replace human raters? Validity and reliability of machine-generated norms for metaphors cs.CLPDF
Veronica Mangiaterra, Hamad Al-Azary, Chiara Barattieri di San Pietro, Paolo Canal, Valentina Bambini
TL;DR: 本文评估了GPT模型在生成隐喻熟悉度、可理解性和形象性评分方面的有效性和可靠性,发现GPT评分与人类评分呈正相关,且能有效预测行为反应和脑电响应,表明GPT可以替代或增强人类评分者,但在处理隐喻的常规性和多模态方面与人类存在差异。
Details
Motivation: 随着大语言模型在科学研究中的应用日益广泛,其可信度问题变得至关重要。在心理语言学中,LLMs已被用于自动增强人类评分数据集,但针对复杂项目(如隐喻)的评分性能尚未探索。本文旨在评估GPT模型生成隐喻评分的有效性和可靠性。
Result: GPT生成的评分与人类评分呈正相关:熟悉度评分在英语和意大利语隐喻中达到中等至强相关(但在高感觉运动负荷的隐喻中减弱),形象性评分在英语中呈中等相关、在意大利语中呈中等至强相关,可理解性评分在英语隐喻中表现出最强相关。较大模型表现优于较小模型,且机器评分能显著预测反应时间和脑电振幅,强度与人类评分相当。GPT在不同独立会话中的评分高度稳定。
Insight: 创新点在于首次系统评估GPT模型在复杂隐喻评分任务中的表现,并验证其与人类数据的一致性及对行为/生理响应的预测能力。客观分析表明,GPT(尤其是较大模型)可有效替代人类评分者,但在处理隐喻的常规性和多模态含义时与人类存在偏差,提示需谨慎考虑刺激性质。
Abstract: As Large Language Models (LLMs) are increasingly being used in scientific research, the issue of their trustworthiness becomes crucial. In psycholinguistics, LLMs have been recently employed in automatically augmenting human-rated datasets, with promising results obtained by generating ratings for single words. Yet, performance for ratings of complex items, i.e., metaphors, is still unexplored. Here, we present the first assessment of the validity and reliability of ratings of metaphors on familiarity, comprehensibility, and imageability, generated by three GPT models for a total of 687 items gathered from the Italian Figurative Archive and three English studies. We performed a thorough validation in terms of both alignment with human data and ability to predict behavioral and electrophysiological responses. We found that machine-generated ratings positively correlated with human-generated ones. Familiarity ratings reached moderate-to-strong correlations for both English and Italian metaphors, although correlations weakened for metaphors with high sensorimotor load. Imageability showed moderate correlations in English and moderate-to-strong in Italian. Comprehensibility for English metaphors exhibited the strongest correlations. Overall, larger models outperformed smaller ones and greater human-model misalignment emerged with familiarity and imageability. Machine-generated ratings significantly predicted response times and the EEG amplitude, with a strength comparable to human ratings. Moreover, GPT ratings obtained across independent sessions were highly stable. We conclude that GPT, especially larger models, can validly and reliably replace - or augment - human subjects in rating metaphor properties. Yet, LLMs align worse with humans when dealing with conventionality and multimodal aspects of metaphorical meaning, calling for careful consideration of the nature of stimuli.
[7] HyperEdit: Unlocking Instruction-based Text Editing in LLMs via Hypernetworks cs.CL | cs.LGPDF
Yiming Zeng, Jinghan Cao, Zexin Li, Wanhao Yu, Zhankai Ye
TL;DR: 本文提出HyperEdit方法,通过超网络和差异感知正则化解决大语言模型在指令式文本编辑任务中存在的编辑意图对齐不准确和过度编辑未修改区域的问题,在仅使用30亿参数的情况下,在修改区域的BLEU指标上相对现有最优方法提升了9%至30%。
Details
Motivation: 指令式文本编辑(如代码编辑器)在现实应用中日益重要,但大语言模型将其视为通用文本生成任务,导致难以忠实对齐用户编辑意图并经常过度编辑未修改内容。
Result: 在修改区域上,HyperEdit相比现有最优基线方法在BLEU指标上取得了9%至30%的相对提升,且模型参数量仅为30亿。
Insight: 创新点在于引入超网络实现动态参数适应以针对不同指令定制编辑策略,并结合差异感知正则化将监督聚焦于修改片段,从而确保精确的最小化编辑并防止过度编辑。
Abstract: Instruction-based text editing is increasingly critical for real-world applications such as code editors (e.g., Cursor), but Large Language Models (LLMs) continue to struggle with this task. Unlike free-form generation, editing requires faithfully implementing user instructions while preserving unchanged content, as even minor unintended modifications can break functionality. Existing approaches treat editing as generic text generation, leading to two key failures: they struggle to faithfully align edits with diverse user intents, and they often over-edit unchanged regions. We propose HyperEdit to address both issues. First, we introduce hypernetwork-based dynamic adaptation that generates request-specific parameters, enabling the model to tailor its editing strategy to each instruction. Second, we develop difference-aware regularization that focuses supervision on modified spans, preventing over-editing while ensuring precise, minimal changes. HyperEdit achieves a 9%–30% relative improvement in BLEU on modified regions over state-of-the-art baselines, despite utilizing only 3B parameters.
[8] Coupled Variational Reinforcement Learning for Language Model General Reasoning cs.CL | cs.AIPDF
Xueru Wen, Jie Lou, Yanjiang Liu, Hongyu Lin, Ben He
TL;DR: 本文提出了一种名为CoVRL(耦合变分强化学习)的新方法,用于提升语言模型的通用推理能力。该方法通过混合采样策略耦合先验分布和后验分布,将变分推断与强化学习相结合,从而在无需验证器的情况下,实现更高效的探索并保持思维与答案之间的一致性。
Details
Motivation: 现有无需验证器的强化学习方法通常仅基于问题对推理轨迹进行采样,导致推理轨迹与最终答案信息脱节,造成探索效率低下和轨迹与答案不一致的问题。本文旨在解决这一局限性。
Result: 在数学和通用推理基准测试上的广泛实验表明,CoVRL相比基础模型性能提升了12.4%,并且比当前最先进的无需验证器强化学习基线方法额外提升了2.3%的性能。
Insight: 核心创新点在于通过耦合先验与后验分布,构建并优化一个复合分布,将变分推断与强化学习原则性地结合。这提供了一个无需外部验证器、能高效探索并确保思维-答案一致性的新框架。
Abstract: While reinforcement learning have achieved impressive progress in language model reasoning, they are constrained by the requirement for verifiable rewards. Recent verifier-free RL methods address this limitation by utilizing the intrinsic probabilities of LLMs generating reference answers as reward signals. However, these approaches typically sample reasoning traces conditioned only on the question. This design decouples reasoning-trace sampling from answer information, leading to inefficient exploration and incoherence between traces and final answers. In this paper, we propose \textit{\b{Co}upled \b{V}ariational \b{R}einforcement \b{L}earning} (CoVRL), which bridges variational inference and reinforcement learning by coupling prior and posterior distributions through a hybrid sampling strategy. By constructing and optimizing a composite distribution that integrates these two distributions, CoVRL enables efficient exploration while preserving strong thought-answer coherence. Extensive experiments on mathematical and general reasoning benchmarks show that CoVRL improves performance by 12.4% over the base model and achieves an additional 2.3% improvement over strong state-of-the-art verifier-free RL baselines, providing a principled framework for enhancing the general reasoning capabilities of language models.
[9] StruProKGR: A Structural and Probabilistic Framework for Sparse Knowledge Graph Reasoning cs.CLPDF
Yucan Guo, Saiping Guan, Miao Su, Zeya Zhao, Xiaolong Jin
TL;DR: 本文提出了一种名为StruProKGR的结构化概率框架,专门用于稀疏知识图谱(KG)的高效且可解释的推理。该框架通过距离引导的路径收集机制降低计算成本,并通过概率路径聚合利用结构信息,优先考虑相互强化的路径。
Details
Motivation: 解决稀疏知识图谱中因知识不完整和关系模式难以捕捉而导致的推理挑战,特别是针对现有基于路径的方法计算成本高、路径质量不一且未能有效利用图结构信息的缺陷。
Result: 在五个稀疏知识图谱推理基准测试上的广泛实验表明,StruProKGR在效果和效率上均超越了现有的基于路径的方法。
Insight: 创新点在于结合了距离引导的路径收集(提高效率与相关性)和概率路径聚合(利用结构信息增强推理),为稀疏KG推理提供了一个有效、高效且可解释的解决方案。
Abstract: Sparse Knowledge Graphs (KGs) are commonly encountered in real-world applications, where knowledge is often incomplete or limited. Sparse KG reasoning, the task of inferring missing knowledge over sparse KGs, is inherently challenging due to the scarcity of knowledge and the difficulty of capturing relational patterns in sparse scenarios. Among all sparse KG reasoning methods, path-based ones have attracted plenty of attention due to their interpretability. Existing path-based methods typically rely on computationally intensive random walks to collect paths, producing paths of variable quality. Additionally, these methods fail to leverage the structured nature of graphs by treating paths independently. To address these shortcomings, we propose a Structural and Probabilistic framework named StruProKGR, tailored for efficient and interpretable reasoning on sparse KGs. StruProKGR utilizes a distance-guided path collection mechanism to significantly reduce computational costs while exploring more relevant paths. It further enhances the reasoning process by incorporating structural information through probabilistic path aggregation, which prioritizes paths that reinforce each other. Extensive experiments on five sparse KG reasoning benchmarks reveal that StruProKGR surpasses existing path-based methods in both effectiveness and efficiency, providing an effective, efficient, and interpretable solution for sparse KG reasoning.
[10] Understanding Syllogistic Reasoning in LLMs from Formal and Natural Language Perspectives cs.CL | cs.AIPDF
Aheli Poddar, Saptarshi Sahoo, Sujata Ghosh
TL;DR: 本文从逻辑和自然语言两个角度研究了大型语言模型(LLMs)的三段论推理能力,旨在探索LLMs的基本推理能力及其研究发展方向。研究使用了14个大型语言模型,评估了它们在符号推理和自然语言理解方面的三段论推理表现。研究发现,虽然这种推理能力并非所有LLMs都具备的普遍涌现属性,但某些模型在符号推理上的完美表现引发了关于LLMs是否正演变为更形式化的推理机制,而非体现人类推理细微差别的思考。
Details
Motivation: 研究动机是探究LLMs在三段论推理方面的基本能力,并从逻辑(符号)和自然语言两个视角评估其推理性能,以理解LLMs推理机制的本质和发展趋势。
Result: 研究评估了14个LLMs,结果显示三段论推理能力并非所有模型的普遍涌现属性;然而,某些模型在符号推理任务上表现完美,达到了高准确率。
Insight: 创新点在于从形式逻辑和自然语言理解的双重视角系统评估LLMs的三段论推理,并提出了一个关键见解:LLMs可能正在向形式化推理机制发展,而非模拟人类推理的细微差别,这对理解LLMs的推理本质具有启发意义。
Abstract: We study syllogistic reasoning in LLMs from the logical and natural language perspectives. In process, we explore fundamental reasoning capabilities of the LLMs and the direction this research is moving forward. To aid in our studies, we use 14 large language models and investigate their syllogistic reasoning capabilities in terms of symbolic inferences as well as natural language understanding. Even though this reasoning mechanism is not a uniform emergent property across LLMs, the perfect symbolic performances in certain models make us wonder whether LLMs are becoming more and more formal reasoning mechanisms, rather than making explicit the nuances of human reasoning.
[11] CoDA: A Context-Decoupled Hierarchical Agent with Reinforcement Learning cs.CLPDF
Xuanzhang Liu, Jianglun Feng, Zhuoran Zhuang, Junzhe Zhao, Maofei Que
TL;DR: 本文提出了一种名为CoDA的上下文解耦分层智能体框架,通过强化学习训练单个共享LLM骨干网络,使其分别扮演高层规划器和低层执行器两个角色,以解决长文本输出积累导致的’上下文爆炸’问题,从而提升复杂多步任务的性能。
Details
Motivation: 动机是解决基于强化学习的LLM智能体在复杂多步任务中因长文本输出积累导致’上下文爆炸’、进而引发推理失败的性能瓶颈问题。
Result: 在复杂的多跳问答基准测试中,CoDA相比现有最先进基线取得了显著性能提升,并在长上下文场景中表现出强鲁棒性,性能保持稳定而其他基线严重退化。
Insight: 创新点在于将高层规划与低层执行解耦的分层设计,以及通过PECO(规划器-执行器协同优化)强化学习方法进行端到端训练,使单一LLM在上下文隔离的角色中协作,有效缓解上下文过载。
Abstract: Large Language Model (LLM) agents trained with reinforcement learning (RL) show great promise for solving complex, multi-step tasks. However, their performance is often crippled by “Context Explosion”, where the accumulation of long text outputs overwhelms the model’s context window and leads to reasoning failures. To address this, we introduce CoDA, a Context-Decoupled hierarchical Agent, a simple but effective reinforcement learning framework that decouples high-level planning from low-level execution. It employs a single, shared LLM backbone that learns to operate in two distinct, contextually isolated roles: a high-level Planner that decomposes tasks within a concise strategic context, and a low-level Executor that handles tool interactions in an ephemeral, isolated workspace. We train this unified agent end-to-end using PECO (Planner-Executor Co-Optimization), a reinforcement learning methodology that applies a trajectory-level reward to jointly optimize both roles, fostering seamless collaboration through context-dependent policy updates. Extensive experiments demonstrate that CoDA achieves significant performance improvements over state-of-the-art baselines on complex multi-hop question-answering benchmarks, and it exhibits strong robustness in long-context scenarios, maintaining stable performance while all other baselines suffer severe degradation, thus further validating the effectiveness of our hierarchical design in mitigating context overload.
[12] NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents cs.CLPDF
Jingzhe Ding, Shengda Long, Changxin Pu, Huan Zhou, Hongwan Gao
TL;DR: 该论文提出了NL2Repo-Bench基准测试,旨在评估编码智能体在长周期、多步骤的软件仓库生成任务上的能力。该基准要求智能体仅根据一份自然语言需求文档,在空工作空间中自主完成架构设计、依赖管理、多模块逻辑实现,并最终生成一个可安装的Python库。实验表明,当前最先进的模型在此任务上表现不佳,平均测试通过率低于40%,揭示了长周期推理是自主编码智能体发展的主要瓶颈。
Details
Motivation: 现有基准测试主要评估局部代码生成、脚手架式补全或短期修复任务,无法严格评估构建完整软件系统所需的长周期能力,因此需要一个新的基准来填补这一空白,以衡量编码智能体在真实世界仓库构建中所需的持续、连贯的推理、规划和执行能力。
Result: 在NL2Repo-Bench上的实验结果显示,即使是当前最先进的开源和闭源模型,其平均测试通过率也低于40%,很少能正确完成整个仓库的生成,表明长周期仓库生成任务在很大程度上仍未解决。
Insight: 论文的创新点在于设计了一个专门针对长周期仓库生成的、可验证的基准测试。从客观角度看,其核心贡献是识别并系统性地定义了长周期任务中的关键失败模式,如过早终止、全局一致性丧失、脆弱的跨文件依赖以及数百个交互步骤中的规划不足,这为未来编码智能体的能力评估和研发指明了关键挑战方向。
Abstract: Recent advances in coding agents suggest rapid progress toward autonomous software development, yet existing benchmarks fail to rigorously evaluate the long-horizon capabilities required to build complete software systems. Most prior evaluations focus on localized code generation, scaffolded completion, or short-term repair tasks, leaving open the question of whether agents can sustain coherent reasoning, planning, and execution over the extended horizons demanded by real-world repository construction. To address this gap, we present NL2Repo Bench, a benchmark explicitly designed to evaluate the long-horizon repository generation ability of coding agents. Given only a single natural-language requirements document and an empty workspace, agents must autonomously design the architecture, manage dependencies, implement multi-module logic, and produce a fully installable Python library. Our experiments across state-of-the-art open- and closed-source models reveal that long-horizon repository generation remains largely unsolved: even the strongest agents achieve below 40% average test pass rates and rarely complete an entire repository correctly. Detailed analysis uncovers fundamental long-horizon failure modes, including premature termination, loss of global coherence, fragile cross-file dependencies, and inadequate planning over hundreds of interaction steps. NL2Repo Bench establishes a rigorous, verifiable testbed for measuring sustained agentic competence and highlights long-horizon reasoning as a central bottleneck for the next generation of autonomous coding agents.
[13] State over Tokens: Characterizing the Role of Reasoning Tokens cs.CL | cs.AIPDF
Mosh Levy, Zohar Elyoseph, Shauli Ravfogel, Yoav Goldberg
TL;DR: 本文提出了’State over Tokens’(SoT)概念框架,重新定义了大语言模型(LLMs)生成的推理标记(reasoning tokens)的本质。论文认为这些标记不应被解读为解释模型推理过程的文本叙述,而应被视为模型在无状态生成周期之间外部化的、持久化的计算状态。这一视角解释了为何推理标记能驱动正确推理,却并非对模型内部过程的忠实解释,并揭示了相关被忽视的研究问题。
Details
Motivation: 大语言模型在输出最终答案前生成的推理标记序列,虽然在表面上类似人类思维过程,但实证证据表明它们并非模型实际推理过程的忠实解释。本文旨在弥合这种表象与功能之间的差距。
Result: 论文主要提出了一个概念性分析框架,未在摘要中提及具体的定量实验结果或基准测试。
Insight: 核心创新点在于将推理标记重新概念化为外部化的计算状态,而非可读的文本解释。这为理解LLMs的内部推理机制提供了新的理论视角,并指出未来研究应超越文本解读,专注于将这些标记解码为状态信息。
Abstract: Large Language Models (LLMs) can generate reasoning tokens before their final answer to boost performance on complex tasks. While these sequences seem like human thought processes, empirical evidence reveals that they are not a faithful explanation of the model’s actual reasoning process. To address this gap between appearance and function, we introduce the State over Tokens (SoT) conceptual framework. SoT reframes reasoning tokens not as a linguistic narrative, but as an externalized computational state – the sole persistent information carrier across the model’s stateless generation cycles. This explains how the tokens can drive correct reasoning without being a faithful explanation when read as text and surfaces previously overlooked research questions on these tokens. We argue that to truly understand the process that LLMs do, research must move beyond reading the reasoning tokens as text and focus on decoding them as state.
[14] Hindsight is 20/20: Building Agent Memory that Retains, Recalls, and Reflects cs.CL | cs.AI | cs.IR | cs.LGPDF
Chris Latimer, Nicoló Boschi, Andrew Neeser, Chris Bartholomew, Gaurav Srivastava
TL;DR: 本文提出了一个名为Hindsight的新型智能体记忆架构,旨在解决现有基于LLM的智能体记忆系统在长期信息组织、推理可解释性和证据与推断区分方面的不足。该架构将记忆视为推理的一等结构化基板,通过四个逻辑网络(世界事实、智能体经验、合成实体摘要和演化信念)来组织信息,并支持保留、回忆和反思三个核心操作。
Details
Motivation: 现有智能体外挂式记忆系统(如向量/图存储)存在信息组织能力有限、证据与推断界限模糊、难以支持可解释推理等问题,限制了智能体在长期会话和跨会话适应中的表现。
Result: 在LongMemEval和LoCoMo等关键长期会话记忆基准测试中,使用开源20B模型的Hindsight将整体准确率从全上下文基线的39%提升至83.6%,并超越了全上下文GPT-4o。进一步扩展骨干模型后,Hindsight在LongMemEval上达到91.4%准确率,在LoCoMo上达到89.61%(优于先前最强开源系统的75.78%),在多会话和开放域问题上持续超越现有记忆架构。
Insight: 核心创新在于将记忆从外挂的检索增强层提升为结构化推理基板,通过逻辑网络明确区分信息类型(事实、经验、摘要、信念)并引入反思操作进行可追溯的推理与更新,这为构建具有长期记忆、可解释性和自适应能力的智能体提供了新的架构范式。
Abstract: Agent memory has been touted as a dimension of growth for LLM-based applications, enabling agents that can accumulate experience, adapt across sessions, and move beyond single-shot question answering. The current generation of agent memory systems treats memory as an external layer that extracts salient snippets from conversations, stores them in vector or graph-based stores, and retrieves top-k items into the prompt of an otherwise stateless model. While these systems improve personalization and context carry-over, they still blur the line between evidence and inference, struggle to organize information over long horizons, and offer limited support for agents that must explain their reasoning. We present Hindsight, a memory architecture that treats agent memory as a structured, first-class substrate for reasoning by organizing it into four logical networks that distinguish world facts, agent experiences, synthesized entity summaries, and evolving beliefs. This framework supports three core operations – retain, recall, and reflect – that govern how information is added, accessed, and updated. Under this abstraction, a temporal, entity aware memory layer incrementally turns conversational streams into a structured, queryable memory bank, while a reflection layer reasons over this bank to produce answers and to update information in a traceable way. On key long-horizon conversational memory benchmarks like LongMemEval and LoCoMo, Hindsight with an open-source 20B model lifts overall accuracy from 39% to 83.6% over a full-context baseline with the same backbone and outperforms full context GPT-4o. Scaling the backbone further pushes Hindsight to 91.4% on LongMemEval and up to 89.61% on LoCoMo (vs. 75.78% for the strongest prior open system), consistently outperforming existing memory architectures on multi-session and open-domain questions.
[15] Counting Clues: A Lightweight Probabilistic Baseline Can Match an LLM cs.CL | cs.AI | cs.LGPDF
Furong Jia, Yuan Pu, Finn Guo, Monica Agrawal
TL;DR: 本文研究了大型语言模型在临床诊断多选题基准上的表现,发现其性能可能并非主要源于概率推理。作者提出了一种基于频率的轻量级概率排序器,仅使用概念-诊断共现统计就能达到与对应LLM相当的性能,且两者正确回答的问题重叠度很低,表明它们具有互补优势。
Details
Motivation: 动机是探究LLM在临床诊断任务中的优异表现究竟在多大程度上反映了其底层的概率推理能力,而非简单地依赖训练数据中的频率统计。
Result: 在MedQA基准测试中,基于相同预训练语料库统计的FBPR方法,其性能与对应的OLMo和Llama模型相当。LLM直接推理与FBPR方法正确回答的问题重叠度仅略高于随机水平。
Insight: 论文的创新点在于提出了一个轻量级、可解释的概率基线方法,揭示了显式概率基线在提供性能参考点和互补信号方面的持续价值。客观来看,该研究强调了即使在LLM时代,基于历史统计的低复杂度专家系统方法仍能解释基准性能的相当一部分,为模型能力分析和混合系统设计提供了新视角。
Abstract: Large language models (LLMs) excel on multiple-choice clinical diagnosis benchmarks, yet it is unclear how much of this performance reflects underlying probabilistic reasoning. We study this through questions from MedQA, where the task is to select the most likely diagnosis. We introduce the Frequency-Based Probabilistic Ranker (FBPR), a lightweight method that scores options with a smoothed Naive Bayes over concept-diagnosis co-occurrence statistics from a large corpus. When co-occurrence statistics were sourced from the pretraining corpora for OLMo and Llama, FBPR achieves comparable performance to the corresponding LLMs pretrained on that same corpus. Direct LLM inference and FBPR largely get different questions correct, with an overlap only slightly above random chance, indicating complementary strengths of each method. These findings highlight the continued value of explicit probabilistic baselines: they provide a meaningful performance reference point and a complementary signal for potential hybridization. While the performance of LLMs seems to be driven by a mechanism other than simple frequency aggregation, we show that an approach similar to the historically grounded, low-complexity expert systems still accounts for a substantial portion of benchmark performance.
[16] QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management cs.CLPDF
Weizhou Shen, Ziyi Yang, Chenliang Li, Zhiyuan Lu, Miao Peng
TL;DR: 本文介绍了QwenLong-L1.5模型,这是一个通过系统性的后训练创新实现卓越长上下文推理能力的模型。其关键技术突破包括:一个生成长上下文数据合成管道,用于生成需要多跳推理的挑战性任务;一个稳定的长上下文强化学习训练方法,以克服训练不稳定性;以及一个用于超长上下文的记忆增强架构,通过多阶段融合RL训练整合单次推理与基于记忆的迭代处理。
Details
Motivation: 解决现有模型在长上下文推理中面临的挑战,如缺乏高质量的长程推理训练数据、长上下文强化学习训练不稳定,以及上下文窗口有限无法处理超长序列的问题。
Result: 基于Qwen3-30B-A3B-Thinking,QwenLong-L1.5在长上下文推理基准测试中取得了与GPT-5和Gemini-2.5-Pro相当的性能,平均超越其基线9.90分。在超长任务(1M~4M tokens)上,其记忆-智能体框架相比智能体基线带来了9.48分的提升,并在科学推理、记忆工具使用和扩展对话等通用领域也表现出增强的性能。
Insight: 创新点在于系统性的后训练配方,特别是通过程序化合成高质量长上下文推理数据、引入任务平衡采样与自适应熵控制策略优化(AEPO)来稳定RL训练,以及设计一个结合单次推理与迭代记忆处理的内存管理框架来处理超长序列,这为构建高效的长上下文模型提供了可借鉴的工程与算法思路。
Abstract: We introduce QwenLong-L1.5, a model that achieves superior long-context reasoning capabilities through systematic post-training innovations. The key technical breakthroughs of QwenLong-L1.5 are as follows: (1) Long-Context Data Synthesis Pipeline: We develop a systematic synthesis framework that generates challenging reasoning tasks requiring multi-hop grounding over globally distributed evidence. By deconstructing documents into atomic facts and their underlying relationships, and then programmatically composing verifiable reasoning questions, our approach creates high-quality training data at scale, moving substantially beyond simple retrieval tasks to enable genuine long-range reasoning capabilities. (2) Stabilized Reinforcement Learning for Long-Context Training: To overcome the critical instability in long-context RL, we introduce task-balanced sampling with task-specific advantage estimation to mitigate reward bias, and propose Adaptive Entropy-Controlled Policy Optimization (AEPO) that dynamically regulates exploration-exploitation trade-offs. (3) Memory-Augmented Architecture for Ultra-Long Contexts: Recognizing that even extended context windows cannot accommodate arbitrarily long sequences, we develop a memory management framework with multi-stage fusion RL training that seamlessly integrates single-pass reasoning with iterative memory-based processing for tasks exceeding 4M tokens. Based on Qwen3-30B-A3B-Thinking, QwenLong-L1.5 achieves performance comparable to GPT-5 and Gemini-2.5-Pro on long-context reasoning benchmarks, surpassing its baseline by 9.90 points on average. On ultra-long tasks (1M~4M tokens), QwenLong-L1.5’s memory-agent framework yields a 9.48-point gain over the agent baseline. Additionally, the acquired long-context reasoning ability translates to enhanced performance in general domains like scientific reasoning, memory tool using, and extended dialogue.
[17] Authors Should Annotate cs.CLPDF
Marcus Ma, Cole Johnson, Nolan Bridges, Jackson Trager, Georgios Chochlakis
TL;DR: 本文提出了一种名为作者标注的新标注技术,即让文档作者在创作时直接标注数据,尤其适用于情感和信念等自我中心特征。通过与拥有超过10,000名用户的商业聊天机器人合作,部署了一个用于产品推荐相关主观特征的作者标注系统,该系统能识别任务相关查询、实时生成标注问题并记录作者回答。研究发现,基于作者标注的在线学习推荐模型点击率比行业广告基准提高了534%,且作者标注在情感分析任务中比传统标注方法质量更高、获取更快、成本更低。
Details
Motivation: 解决传统第三方标注在情感、信念等自我中心特征上可能存在的代理偏差问题,探索直接从文档来源获取标注信息的优势。
Result: 在产品推荐任务中,基于作者标注的在线学习模型点击率相比行业广告基准提高了534%;在情感分析任务中,作者标注相比三种传统标注方法在质量、速度和成本上均表现更优。
Insight: 创新点在于提出了作者标注这一新范式,强调在创作源头进行实时标注能显著提升主观特征标注的质量和效率;实践上通过部署实时系统验证了其可行性和巨大性能提升,为标注方法研究提供了新方向。
Abstract: The status quo for labeling text is third-party annotation, but there are many cases where information directly from the document’s source would be preferable over a third-person proxy, especially for egocentric features like sentiment and belief. We introduce author labeling, an annotation technique where the writer of the document itself annotates the data at the moment of creation. We collaborate with a commercial chatbot with over 10,000 users to deploy an author labeling annotation system for subjective features related to product recommendation. This system identifies task-relevant queries, generates on-the-fly labeling questions, and records authors’ answers in real time. We train and deploy an online-learning model architecture for product recommendation that continuously improves from author labeling and find it achieved a 534% increase in click-through rate compared to an industry advertising baseline running concurrently. We then compare the quality and practicality of author labeling to three traditional annotation approaches for sentiment analysis and find author labeling to be higher quality, faster to acquire, and cheaper. These findings reinforce existing literature that annotations, especially for egocentric and subjective beliefs, are significantly higher quality when labeled by the author rather than a third party. To facilitate broader scientific adoption, we release an author labeling service for the research community at academic.echollm.io.
[18] An Open and Reproducible Deep Research Agent for Long-Form Question Answering cs.CLPDF
Ikuya Yamada, Wataru Ikeda, Ko Yoshida, Mengyu Ye, Hinata Sugimoto
TL;DR: 本文提出了一种开放且可复现的深度研究智能体,用于长格式问答任务。该系统在NeurIPS 2025的MMU-RAG竞赛文本到文本赛道中获胜,其核心是将开源大语言模型与开放网络搜索API相结合,在真实开放域环境中执行迭代检索、推理和综合。
Details
Motivation: 旨在解决在开放域环境中进行高质量、长格式问答的挑战,通过构建一个可复现的系统来整合检索与推理能力。
Result: 实验结果表明,所提出的方法在清晰度、洞察力和事实性三个方面均能持续提升答案质量,并在MMU-RAG竞赛中取得了优胜成绩。
Insight: 主要创新点在于构建了一个开放、可复现的端到端研究系统,并引入了基于LLM作为评判者的偏好微调来从多维度优化推理质量,这为构建透明且高性能的检索增强生成系统提供了实践范例。
Abstract: We present an open deep research system for long-form question answering, selected as a winning system in the text-to-text track of the MMU-RAG competition at NeurIPS 2025. The system combines an open-source large language model (LLM) with an open web search API to perform iterative retrieval, reasoning, and synthesis in real-world open-domain settings. To enhance reasoning quality, we apply preference tuning based on LLM-as-a-judge feedback that evaluates multiple aspects, including clarity, insightfulness, and factuality. Our experimental results show that the proposed method consistently improves answer quality across all three aspects. Our source code is publicly available at https://github.com/efficient-deep-research/efficient-deep-research.
[19] LLM Rationalis? Measuring Bargaining Capabilities of AI Negotiators cs.CL | cs.AIPDF
Cheril Shah, Akshit Agarwal, Kanak Garg, Mourad Heddaya
TL;DR: 本文提出了一种基于双曲正切曲线的统一数学模型来建模双边谈判中的让步动态,并引入了爆发性tau和让步刚性指数(CRI)两个指标来量化报价轨迹的时机和刚性。通过大规模实证比较人类谈判者与四种先进大语言模型(LLM)在不同设置下的表现,研究发现LLM在谈判中存在系统性锚定极端值、策略多样性有限等根本性限制。
Details
Motivation: 双边谈判是一个复杂、情境敏感的任务,人类谈判者会动态调整锚点、节奏和灵活性以利用权力不对称和非正式线索。本文旨在量化分析谈判动态,并系统评估当前LLM在谈判任务中的能力与局限性。
Result: 在包含自然语言和数字报价设置、有无丰富市场背景以及六种受控权力不对称场景的大规模实证比较中,研究发现LLM(包括四种SOTA模型)在谈判中系统地锚定在可能协议区的极端值,其谈判能力并未随模型性能提升而改善,与人类灵活适应情境并推断对手立场和策略的能力形成鲜明对比。
Insight: 论文的创新点在于提出了一个统一的数学模型和两个量化指标来形式化分析谈判动态,并通过系统性的实验设计揭示了当前LLM在谈判任务中缺乏情境适应性和对手推理能力的根本缺陷,这为未来开发能更好内化对手推理和情境依赖策略的模型指明了方向。
Abstract: Bilateral negotiation is a complex, context-sensitive task in which human negotiators dynamically adjust anchors, pacing, and flexibility to exploit power asymmetries and informal cues. We introduce a unified mathematical framework for modeling concession dynamics based on a hyperbolic tangent curve, and propose two metrics burstiness tau and the Concession-Rigidity Index (CRI) to quantify the timing and rigidity of offer trajectories. We conduct a large-scale empirical comparison between human negotiators and four state-of-the-art large language models (LLMs) across natural-language and numeric-offers settings, with and without rich market context, as well as six controlled power-asymmetry scenarios. Our results reveal that, unlike humans who smoothly adapt to situations and infer the opponents position and strategies, LLMs systematically anchor at extremes of the possible agreement zone for negotiations and optimize for fixed points irrespective of leverage or context. Qualitative analysis further shows limited strategy diversity and occasional deceptive tactics used by LLMs. Moreover the ability of LLMs to negotiate does not improve with better models. These findings highlight fundamental limitations in current LLM negotiation capabilities and point to the need for models that better internalize opponent reasoning and context-dependent strategy.
[20] AutoTool: Dynamic Tool Selection and Integration for Agentic Reasoning cs.CL | cs.LGPDF
Jiaru Zou, Ling Yang, Yunzhe Qi, Sirui Chen, Mengting Ai
TL;DR: 本文提出了AutoTool框架,旨在增强LLM代理的动态工具选择能力,使其能够在推理轨迹中自适应地选择和使用新工具。作者构建了一个包含20万条数据、涵盖1000多种工具和100多项任务的数据集,并采用双阶段优化管道(轨迹稳定化和KL正则化Plackett-Luce排序)来训练模型。在十个多样化基准测试中,基于Qwen3-8B和Qwen2.5-VL-7B的AutoTool模型在数学与科学推理、搜索问答、代码生成和多模态理解等任务上均取得了显著性能提升,且展现出对未见工具的泛化能力。
Details
Motivation: 现有方法假设工具库存固定,限制了LLM代理对新工具或演化工具集的适应性,因此需要开发动态工具选择框架以提升代理的灵活性和泛化能力。
Result: 在十个多样化基准测试中,AutoTool在数学与科学推理上平均提升6.4%,搜索问答提升4.5%,代码生成提升7.7%,多模态理解提升6.9%,以更少参数超越了先进的LLM代理和工具集成方法,达到SOTA水平。
Insight: 创新点包括构建大规模工具选择数据集、双阶段优化管道(结合监督与强化学习的轨迹稳定化以及KL正则化Plackett-Luce排序),以及动态工具选择机制,这些设计增强了代理在推理过程中对工具的一致性和适应性,可借鉴于构建更灵活的AI代理系统。
Abstract: Agentic reinforcement learning has advanced large language models (LLMs) to reason through long chain-of-thought trajectories while interleaving external tool use. Existing approaches assume a fixed inventory of tools, limiting LLM agents’ adaptability to new or evolving toolsets. We present AutoTool, a framework that equips LLM agents with dynamic tool-selection capabilities throughout their reasoning trajectories. We first construct a 200k dataset with explicit tool-selection rationales across 1,000+ tools and 100+ tasks spanning mathematics, science, code generation, and multimodal reasoning. Building on this data foundation, AutoTool employs a dual-phase optimization pipeline: (i) supervised and RL-based trajectory stabilization for coherent reasoning, and (ii) KL-regularized Plackett-Luce ranking to refine consistent multi-step tool selection. Across ten diverse benchmarks, we train two base models, Qwen3-8B and Qwen2.5-VL-7B, with AutoTool. With fewer parameters, AutoTool consistently outperforms advanced LLM agents and tool-integration methods, yielding average gains of 6.4% in math & science reasoning, 4.5% in search-based QA, 7.7% in code generation, and 6.9% in multimodal understanding. In addition, AutoTool exhibits stronger generalization by dynamically leveraging unseen tools from evolving toolsets during inference.
[21] AIR: Post-training Data Selection for Reasoning via Attention Head Influence cs.CLPDF
Jinrui Liu, Jeff Wu, Xuanguang Pan, Gavin Cheung, Shuai Ma
TL;DR: 本文提出了一种名为AIR(Attention Influence for Reasoning)的无监督、无需训练的数据选择框架,用于提升大型语言模型(LLM)后训练蒸馏中的推理能力。该方法通过分析现成模型的注意力机制,识别对推理至关重要的注意力头,构建一个影响力被削弱的参考模型,并通过损失差异计算注意力影响力分数,从而在步骤和样本层面精细评估数据价值。实验表明,AIR能有效提升多个推理基准测试的准确性。
Details
Motivation: 现有基于长度、熵或整体损失等启发式方法的后训练数据选择策略,未能捕捉推理步骤的因果重要性,限制了知识蒸馏的效率。本文旨在解决如何更有效地选择高价值后训练数据以提升LLM推理能力的问题。
Result: 在多个推理基准测试上的实验表明,AIR方法持续提升了推理准确性,超越了基于启发式的基线方法,并能有效识别出最关键的训练步骤和样本。
Insight: 创新点在于将模型机制解释(特别是注意力头的影响)直接应用于数据选择过程,提出了一种机制驱动、数据高效的后训练蒸馏方法。从客观角度看,其利用现成模型的内部注意力机制来量化数据重要性,是一种无需额外训练、可解释性强的无监督数据选择策略。
Abstract: LLMs achieve remarkable multi-step reasoning capabilities, yet effectively transferring these skills via post-training distillation remains challenging. Existing data selection methods, ranging from manual curation to heuristics based on length, entropy, or overall loss, fail to capture the causal importance of individual reasoning steps, limiting distillation efficiency. To address this, we propose Attention Influence for Reasoning (AIR), a principled, unsupervised and training-free framework that leverages mechanistic insights of the retrieval head to select high-value post-training data. AIR first identifies reasoning-critical attention heads of an off-the-shelf model, then constructs a weakened reference model with disabled head influence, and finally quantifies the resulting loss divergence as the Attention Influence Score. This score enables fine-grained assessment at both the step and sample levels, supporting step-level weighted fine-tuning and global sample selection. Experiments across multiple reasoning benchmarks show that AIR consistently improves reasoning accuracy, surpassing heuristic baselines and effectively isolating the most critical steps and samples. Our work establishes a mechanism-driven, data-efficient approach for reasoning distillation in LLMs.
[22] Integrating Causal Reasoning into Automated Fact-Checking cs.CLPDF
Youssra Rebboud, Pasquale Lisena, Raphael Troncy
TL;DR: 本文提出了一种将因果推理整合到自动事实核查中的方法,该方法结合了事件关系提取、语义相似度计算和基于规则的推理,以检测声明与证据之间事件链的逻辑不一致性。
Details
Motivation: 当前自动事实核查方法缺乏专门的基于因果的推理,可能错失语义丰富的可解释性机会,因此需要填补这一空白。
Result: 在两个事实核查数据集上的评估表明,该方法为将细粒度因果事件关系整合到事实核查中建立了首个基线,并增强了裁决预测的可解释性。
Insight: 创新点在于首次将细粒度因果事件关系系统性地引入自动事实核查,通过事件链的逻辑不一致性检测提升语义理解和解释能力,为领域提供了新的基准方法。
Abstract: In fact-checking applications, a common reason to reject a claim is to detect the presence of erroneous cause-effect relationships between the events at play. However, current automated fact-checking methods lack dedicated causal-based reasoning, potentially missing a valuable opportunity for semantically rich explainability. To address this gap, we propose a methodology that combines event relation extraction, semantic similarity computation, and rule-based reasoning to detect logical inconsistencies between chains of events mentioned in a claim and in an evidence. Evaluated on two fact-checking datasets, this method establishes the first baseline for integrating fine-grained causal event relationships into fact-checking and enhance explainability of verdict prediction.
[23] FIN-bench-v2: A Unified and Robust Benchmark Suite for Evaluating Finnish Large Language Models cs.CL | cs.AIPDF
Joona Kytöniemi, Jousia Piha, Akseli Reunamo, Fedor Vitiugin, Farrokh Mehryary
TL;DR: 本文介绍了FIN-bench-v2,一个用于评估芬兰语大语言模型的统一基准套件。它将广泛使用的基准测试的芬兰语版本与原始FIN-bench的更新扩展版整合成一个格式一致的集合,涵盖阅读理解、常识推理、情感分析、世界知识和对齐等多个任务类型。所有数据集都转换为HuggingFace Datasets格式,包含填空和多项选择提示,并采用人工标注或审查来确保机器翻译资源的质量。通过预训练模型分析任务鲁棒性,并评估了更大的指令微调模型。所有资源均已开源。
Details
Motivation: 为芬兰语大语言模型提供一个统一、鲁棒的评估基准套件,解决现有基准分散、格式不一致、翻译资源质量参差不齐的问题。
Result: 通过预训练2.15B参数的仅解码器模型,使用学习曲线计算单调性、信噪比、非随机性能和模型排序一致性等指标来筛选鲁棒任务。进一步评估了更大的指令微调模型在不同任务和提示格式上的性能。
Insight: 创新点在于将多个芬兰语基准统一整合并格式化,引入系统性的任务鲁棒性筛选标准(基于预训练模型的学习曲线),并对机器翻译资源进行人工质量把控。这为小语种LLM评估提供了可复现、高质量的基准构建方法论。
Abstract: We introduce FIN-bench-v2, a unified benchmark suite for evaluating large language models in Finnish. FIN-bench-v2 consolidates Finnish versions of widely used benchmarks together with an updated and expanded version of the original FIN-bench into a single, consistently formatted collection, covering multiple-choice and generative tasks across reading comprehension, commonsense reasoning, sentiment analysis, world knowledge, and alignment. All datasets are converted to HuggingFace Datasets, which include both cloze and multiple-choice prompt formulations with five variants per task, and we incorporate human annotation or review for machine-translated resources such as GoldenSwag and XED. To select robust tasks, we pretrain a set of 2.15B-parameter decoder-only models and use their learning curves to compute monotonicity, signal-to-noise, non-random performance, and model ordering consistency, retaining only tasks that satisfy all criteria. We further evaluate a set of larger instruction-tuned models to characterize performance across tasks and prompt formulations. All datasets, prompts, and evaluation configurations are publicly available via our fork of the Language Model Evaluation Harness at https://github.com/LumiOpen/lm-evaluation-harness. Supplementary resources are released in a separate repository at https://github.com/TurkuNLP/FIN-bench-v2.
[24] Non-Resolution Reasoning: A Framework for Preserving Semantic Ambiguity in Language Models cs.CL | cs.AI | cs.LGPDF
Kei Saito
TL;DR: 本文提出了一种名为非消解推理(NRR)的计算框架,旨在解决当前语言模型中存在的过早语义坍缩问题,即模型在上下文不足时被迫过早地承诺单一语义。该框架通过多向量嵌入、非坍缩注意力和上下文身份追踪三个组件,将语义表示与消解决策分离,允许模型在推理过程中保持语义模糊性,仅在需要时进行显式消解。
Details
Motivation: 解决当前语言模型因Softmax驱动的竞争和贪婪解码导致的过早语义坍缩问题,该问题使得模型在获得足够上下文前就丢弃了有效解释,导致脆弱的推理和上下文失败。
Result: 在合成评估中,增强上下文身份追踪(CIT)的NRR模型在分布外身份转换任务上达到了90.9%的准确率,而Transformer基线模型仅为9.1%。
Insight: 核心创新在于将语义模糊性视为显式的表示状态而非故障模式,通过分离表示与消解,使模型能够在不重新训练的情况下在不同推理模式(如创造性、事实性)间切换,并通过外部消解算子实现可控的语义承诺。
Abstract: Premature semantic collapse – the forced early commitment to a single meaning – remains a core architectural limitation of current language models. Softmax-driven competition and greedy decoding cause models to discard valid interpretations before sufficient context is available, resulting in brittle reasoning and context failures. We introduce Non-Resolution Reasoning (NRR), a general computational framework that preserves semantic ambiguity during inference and performs resolution only when explicitly required. NRR integrates three components: (1) Multi-Vector Embeddings that maintain multiple viable interpretations per token, (2) Non-Collapsing Attention that prevents winner-take-all dynamics across layers, and (3) Contextual Identity Tracking (CIT), which assigns context-specific identities to recurring entities (e.g., distinguishing “Dr. Smith the cardiologist” from “Dr. Smith the researcher”). These mechanisms are unified by an external Resolution Operator $ρ$ that makes semantic commitment explicit, controllable, and task-dependent. Unlike standard architectures, NRR separates representation from resolution, allowing a single model to shift between creative, factual, and ambiguity-preserving reasoning without retraining. A synthetic evaluation demonstrates NRR’s ability to preserve ambiguity and track context: CIT-enhanced models achieve 90.9% accuracy on out-of-distribution identity-shift tasks, compared to 9.1% for transformer baselines. NRR provides a principled alternative to premature collapse, reframing ambiguity as an explicit representational state rather than a failure mode. The question is not whether AI should resolve ambiguity, but when, how, and under whose control.
[25] Memory in the Age of AI Agents cs.CL | cs.AIPDF
Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu
TL;DR: 本文是一篇关于AI智能体记忆研究的综述,旨在梳理当前碎片化的研究现状,通过形式、功能和动态三个统一视角对智能体记忆进行系统性分类,并总结了相关基准和开源框架,最后展望了未来研究方向。
Details
Motivation: 当前基于基础模型的智能体研究中,记忆作为核心能力受到广泛关注,但领域研究分散、术语定义模糊,传统分类方法(如长/短期记忆)已无法涵盖现代智能体记忆系统的多样性,因此需要提供一个清晰、统一的研究图景。
Result: 本文未提出具体模型或实验,但整理并总结了现有的记忆基准测试和开源框架,为实践开发提供支持。
Insight: 创新点在于提出了从形式(词元级、参数化、潜在记忆)、功能(事实性、经验性、工作记忆)和动态(形成、演化、检索)三个维度对智能体记忆进行系统性分类的新框架,超越了传统二分法,并为将记忆视为未来智能体设计的一等原语奠定了概念基础。
Abstract: Memory has emerged, and will continue to remain, a core capability of foundation model-based agents. As research on agent memory rapidly expands and attracts unprecedented attention, the field has also become increasingly fragmented. Existing works that fall under the umbrella of agent memory often differ substantially in their motivations, implementations, and evaluation protocols, while the proliferation of loosely defined memory terminologies has further obscured conceptual clarity. Traditional taxonomies such as long/short-term memory have proven insufficient to capture the diversity of contemporary agent memory systems. This work aims to provide an up-to-date landscape of current agent memory research. We begin by clearly delineating the scope of agent memory and distinguishing it from related concepts such as LLM memory, retrieval augmented generation (RAG), and context engineering. We then examine agent memory through the unified lenses of forms, functions, and dynamics. From the perspective of forms, we identify three dominant realizations of agent memory, namely token-level, parametric, and latent memory. From the perspective of functions, we propose a finer-grained taxonomy that distinguishes factual, experiential, and working memory. From the perspective of dynamics, we analyze how memory is formed, evolved, and retrieved over time. To support practical development, we compile a comprehensive summary of memory benchmarks and open-source frameworks. Beyond consolidation, we articulate a forward-looking perspective on emerging research frontiers, including memory automation, reinforcement learning integration, multimodal memory, multi-agent memory, and trustworthiness issues. We hope this survey serves not only as a reference for existing work, but also as a conceptual foundation for rethinking memory as a first-class primitive in the design of future agentic intelligence.
[26] Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models cs.CL | cs.AI | cs.LGPDF
Boxin Wang, Chankyu Lee, Nayeon Lee, Sheng-Chieh Lin, Wenliang Dai
TL;DR: 本文提出了一种名为Cascade RL(级联强化学习)的新方法,用于构建通用推理模型Nemotron-Cascade。该方法通过按领域顺序进行强化学习,解决了传统方法中因跨领域异质性(如响应长度和验证延迟差异)带来的工程复杂性和训练挑战。模型在多个基准测试中取得了SOTA性能,其14B版本在RL后超越了其SFT教师模型DeepSeek-R1-0528,并在LiveCodeBench和IOI竞赛中表现出色。
Details
Motivation: 构建通用推理模型时,跨领域异质性(如推理时响应长度和验证延迟的巨大差异)使RL基础设施复杂化,拖慢训练,并使课程设计和超参数选择变得困难。本文旨在解决这些问题。
Result: 提出的14B模型在RL后,在LiveCodeBench v5/v6/Pro上超越了其SFT教师模型DeepSeek-R1-0528,并在2025年国际信息学奥林匹克竞赛(IOI)中取得了银牌级别的性能,在广泛的基准测试中实现了SOTA。
Insight: 核心创新点是Cascade RL(级联领域强化学习),它摒弃了混合不同领域提示的传统方法,转而采用顺序的、按领域进行的RL,从而降低了工程复杂度。另一个重要发现是,RLHF作为前置步骤能显著提升模型的推理能力,而后续的领域RLVR阶段很少损害早期领域已取得的性能,甚至可能提升它。
Abstract: Building general-purpose reasoning models with reinforcement learning (RL) entails substantial cross-domain heterogeneity, including large variation in inference-time response lengths and verification latency. Such variability complicates the RL infrastructure, slows training, and makes training curriculum (e.g., response length extension) and hyperparameter selection challenging. In this work, we propose cascaded domain-wise reinforcement learning (Cascade RL) to develop general-purpose reasoning models, Nemotron-Cascade, capable of operating in both instruct and deep thinking modes. Departing from conventional approaches that blend heterogeneous prompts from different domains, Cascade RL orchestrates sequential, domain-wise RL, reducing engineering complexity and delivering state-of-the-art performance across a wide range of benchmarks. Notably, RLHF for alignment, when used as a pre-step, boosts the model’s reasoning ability far beyond mere preference optimization, and subsequent domain-wise RLVR stages rarely degrade the benchmark performance attained in earlier domains and may even improve it (see an illustration in Figure 1). Our 14B model, after RL, outperforms its SFT teacher, DeepSeek-R1-0528, on LiveCodeBench v5/v6/Pro and achieves silver-medal performance in the 2025 International Olympiad in Informatics (IOI). We transparently share our training and data recipes.
[27] Comparative Analysis of LLM Abliteration Methods: A Cross-Architecture Evaluation cs.CL | cs.SEPDF
Richard J. Young
TL;DR: 本研究对四种大语言模型(LLM)的‘消除’工具(Heretic、DECCP、ErisForge、FailSpy)进行了跨架构评估,旨在安全对齐机制下移除模型对有害查询的拒绝行为,以支持合法研究。研究在16个指令调优模型(7B-14B参数)上测试了工具兼容性,并在子集上进行了定量分析,发现单次通过方法在保持模型能力方面表现更优,而贝叶斯优化的消除方法会导致能力影响和分布偏移,且数学推理能力对消除干预最为敏感。
Details
Motivation: 大语言模型的安全对齐机制通过习得的拒绝行为防止对有害查询的响应,但这阻碍了认知建模、对抗测试和安全分析等合法研究应用。现有消除技术(如定向正交化)能手术式移除拒绝表征,但其不同实现方法的相对有效性尚不明确。
Result: 在基准测试子集上,单次通过方法(如ErisForge、DECCP)表现出更优的能力保持(三个模型在GSM8K上的平均变化:ErisForge -0.28个百分点;DECCP -0.13个百分点)。贝叶斯优化的消除方法则产生可变的分布偏移(KL散度:0.043-1.646),能力影响因模型而异。数学推理能力(GSM8K)对消除干预最敏感,变化范围从+1.51个百分点到-18.81个百分点(相对下降26.5%),具体取决于工具选择和模型架构。
Insight: 论文的创新点在于首次对多种消除工具进行了系统性的跨架构比较评估,为研究人员提供了基于证据的工具选择标准。客观来看,其核心洞察是揭示了不同消除方法在能力保持与分布偏移上的权衡,并明确指出数学推理能力是评估消除干预效果的关键敏感指标,这为未来安全对齐与模型可用性研究提供了重要参考。
Abstract: Safety alignment mechanisms in large language models prevent responses to harmful queries through learned refusal behavior, yet these same mechanisms impede legitimate research applications including cognitive modeling, adversarial testing, and security analysis. While abliteration techniques enable surgical removal of refusal representations through directional orthogonalization, the relative effectiveness of available implementations remains uncharacterized. This study evaluates four abliteration tools (Heretic, DECCP, ErisForge, FailSpy) across sixteen instruction-tuned models (7B-14B parameters), reporting tool compatibility on all 16 models and quantitative metrics on subsets dictated by tool support. Single-pass methods demonstrated superior capability preservation on the benchmarked subset (avg GSM8K change across three models: ErisForge -0.28 pp; DECCP -0.13 pp), while Bayesian-optimized abliteration produced variable distribution shift (KL divergence: 0.043-1.646) with model-dependent capability impact. These findings provide researchers with evidence-based selection criteria for abliteration tool deployment across diverse model architectures. The principal finding indicates that mathematical reasoning capabilities exhibit the highest sensitivity to abliteration interventions, with GSM8K change ranging from +1.51 pp to -18.81 pp (-26.5% relative) depending on tool selection and model architecture.
cs.CV [Back]
[28] Explainable Adversarial-Robust Vision-Language-Action Model for Robotic Manipulation cs.CV | cs.AI | cs.ROPDF
Ju-Young Kim, Ji-Hong Park, Myeongjun Kim, Gun-Woo Kim
TL;DR: 本文提出了一种可解释的对抗鲁棒视觉-语言-动作模型,用于智能农业中的机器人操作任务。该模型基于OpenVLA-OFT框架,集成了一个名为Evidence-3的模块,用于检测光度扰动(如色调、光照和噪声变化)并生成关于其成因和影响的自然语言解释。实验表明,该模型在对抗条件下显著提升了动作预测的准确性和可解释性。
Details
Motivation: 智能农业中依赖RGB摄像头感知和机器人操作器的系统容易受到光度扰动的对抗攻击,导致故障。本文旨在解决此类系统在对抗条件下的鲁棒性和可解释性问题。
Result: 与基线相比,所提模型将当前动作的L1损失降低了21.7%,将下一动作的L1损失降低了18.4%,表明其在对抗条件下动作预测准确性的提升。
Insight: 创新点在于将可解释性模块(Evidence-3)集成到视觉-语言-动作模型中,使其不仅能检测对抗性扰动,还能用自然语言解释扰动的原因和影响,从而增强了模型在复杂环境下的可靠性和透明度。
Abstract: Smart farming has emerged as a key technology for advancing modern agriculture through automation and intelligent control. However, systems relying on RGB cameras for perception and robotic manipulators for control, common in smart farming, are vulnerable to photometric perturbations such as hue, illumination, and noise changes, which can cause malfunction under adversarial attacks. To address this issue, we propose an explainable adversarial-robust Vision-Language-Action model based on the OpenVLA-OFT framework. The model integrates an Evidence-3 module that detects photometric perturbations and generates natural language explanations of their causes and effects. Experiments show that the proposed model reduces Current Action L1 loss by 21.7% and Next Actions L1 loss by 18.4% compared to the baseline, demonstrating improved action prediction accuracy and explainability under adversarial conditions.
[29] Temporal-Anchor3DLane: Enhanced 3D Lane Detection with Multi-Task Losses and LSTM Fusion cs.CVPDF
D. Shainu Suhas, G. Rahul, K. Muni
TL;DR: 本文提出Temporal-Anchor3DLane,一种增强的单目3D车道线检测框架,通过改进多任务损失函数和引入轻量级LSTM时序融合模块,解决了Anchor3DLane在回归异常值敏感、几何监督弱、损失平衡难和时序连续性利用不足等问题,在OpenLane基准上显著提升了检测性能和平滑性。
Details
Motivation: 针对现有Anchor3DLane等方法在单目3D车道线检测中存在的对回归异常值敏感、全局曲线几何监督弱、多损失项平衡困难以及跨帧时序连续性利用有限等挑战,旨在提升检测的鲁棒性和时序稳定性。
Result: 在OpenLane基准测试中,所提方法将F1分数提升了+6.2,并产生了更平滑的时序轨迹,表明其显著增强了3D车道线检测的鲁棒性。
Insight: 创新点包括:引入平衡L1回归、Chamfer点集距离和基于不确定性的损失加权等多任务损失改进;设计轻量级时序LSTM融合模块替代较重的Transformer式融合;采用ESCOP式训练细化,将曲线级监督与时序一致性结合。这些架构和损失上的小改进有效提升了性能,无需额外传感器或扩大模型规模。
Abstract: Monocular 3D lane detection remains challenging due to depth ambiguity, occlusion, and temporal instability across frames. Anchor-based approaches such as Anchor3DLane have demonstrated strong performance by regressing continuous 3D lane curves from multi-camera surround views. However, the baseline model still exhibits (i) sensitivity to regression outliers, (ii) weak supervision of global curve geometry, (iii) difficulty in balancing multiple loss terms, and (iv) limited exploitation of temporal continuity. We propose Temporal-Anchor3DLane, an enhanced 3D lane detection framework that extends Anchor3DLane with three key contributions: (1) a set of multi-task loss improvements, including Balanced L1 regression, Chamfer point-set distance, and uncertainty-based loss weighting, together with focal and Dice components for classification and visibility; (2) a lightweight Temporal LSTM Fusion module that aggregates per-anchor features across frames, replacing a heavier Transformer-style temporal fusion; and (3) ESCOP-style training refinements that couple curve-level supervision with temporal consistency. On OpenLane, Temporal-Anchor3DLane improves F1 by +6.2 and yields smoother temporal trajectories, showing that small architectural and loss refinements significantly enhance 3D lane robustness without extra sensors or scaling.
[30] Automated Plant Disease and Pest Detection System Using Hybrid Lightweight CNN-MobileViT Models for Diagnosis of Indigenous Crops cs.CV | cs.AIPDF
Tekleab G. Gebremedhin, Hailom S. Asegede, Bruh W. Tesheme, Tadesse B. Gebremichael, Kalayu G. Redae
TL;DR: 本文提出了一种用于埃塞俄比亚提格雷地区本土作物(主要是仙人掌无花果)病虫害检测的自动化离线系统。该系统基于新构建的包含3,587张田间图像的数据集,并针对边缘部署环境,对三种移动端高效架构(自定义轻量CNN、EfficientNet-Lite1和CNN-Transformer混合模型MobileViT-XS)进行了性能基准测试。
Details
Motivation: 解决埃塞俄比亚提格雷地区因基础设施中断而难以获得专家作物病害诊断的问题,为边缘环境开发一个离线优先的检测系统,以支持粮食安全关键诊断。
Result: 在仙人掌无花果数据集上,EfficientNet-Lite1测试准确率达到90.7%,自定义轻量CNN为89.5%(具有最佳部署特性:42毫秒推理延迟,4.8 MB模型大小),而MobileViT-XS的平均交叉验证准确率最高,达到97.3%。
Insight: 创新点在于为本土作物构建专用数据集,并在资源受限的边缘环境中系统性地评估和部署移动高效模型。客观分析表明,CNN-Transformer混合架构(MobileViT-XS)利用基于多头自注意力(MHSA)的全局推理,在区分害虫集群与二维真菌病斑方面比依赖局部纹理的CNN核更可靠,展示了注意力机制在农业视觉任务中的优势。同时,研究明确了模型精度与部署效率(延迟、大小)之间的帕累托权衡,为实际应用提供了选择依据。
Abstract: Agriculture supports over 80% of the population in the Tigray region of Ethiopia, where infrastructural disruptions limit access to expert crop disease diagnosis. We present an offline-first detection system centered on a newly curated indigenous cactus-fig (Opuntia ficus-indica) dataset consisting of 3,587 field images across three core symptom classes. Given deployment constraints in post-conflict edge environments, we benchmark three mobile-efficient architectures: a custom lightweight CNN, EfficientNet-Lite1, and the CNN-Transformer hybrid MobileViT-XS. While the broader system contains independent modules for potato, apple, and corn, this study isolates cactus-fig model performance to evaluate attention sensitivity and inductive bias transfer on indigenous morphology alone. Results establish a clear Pareto trade-off: EfficientNet-Lite1 achieves 90.7% test accuracy, the lightweight CNN reaches 89.5% with the most favorable deployment profile (42 ms inference latency, 4.8 MB model size), and MobileViT-XS delivers 97.3% mean cross-validation accuracy, demonstrating that MHSA-based global reasoning disambiguates pest clusters from two dimensional fungal lesions more reliably than local texture CNN kernels. The ARM compatible models are deployed in a Tigrigna and Amharic localized Flutter application supporting fully offline inference on Cortex-A53 class devices, strengthening inclusivity for food security critical diagnostics.
[31] Microscopic Vehicle Trajectory Datasets from UAV-collected Video for Heterogeneous, Area-Based Urban Traffic cs.CVPDF
Yawar Ali, K. Ramachandra Rao, Ashish Bhaskar, Niladri Chatterjee
TL;DR: 本文提供了基于无人机采集的微观车辆轨迹数据集,覆盖印度首都区域六个中段路段的异质化、区域化城市交通场景。数据集包含时间戳、车辆位置、速度、纵向与横向加速度及车辆分类信息,旨在支持仿真建模、安全评估和行为研究。
Details
Motivation: 传统路边视频采集在密集混合交通中常因遮挡、视角有限和车辆不规则运动而失效,无人机俯拍视角能减少这些问题并捕捉丰富的时空动态。
Result: 数据集通过Data from Sky平台提取,并已通过人工计数、空间平均速度和探针轨迹验证,帧率为30fps,覆盖多种交通组成和密度水平。
Insight: 创新点在于提供公开可用的无人机采集微观轨迹数据,支持异质化区域交通行为分析,如车道保持偏好、速度分布和横向机动,为复杂城市交通环境建模提供实证基础。
Abstract: This paper offers openly available microscopic vehicle trajectory (MVT) datasets collected using unmanned aerial vehicles (UAVs) in heterogeneous, area-based urban traffic conditions. Traditional roadside video collection often fails in dense mixed traffic due to occlusion, limited viewing angles, and irregular vehicle movements. UAV-based recording provides a top-down perspective that reduces these issues and captures rich spatial and temporal dynamics. The datasets described here were extracted using the Data from Sky (DFS) platform and validated against manual counts, space mean speeds, and probe trajectories in earlier work. Each dataset contains time-stamped vehicle positions, speeds, longitudinal and lateral accelerations, and vehicle classifications at a resolution of 30 frames per second. Data were collected at six mid-block locations in the national capital region of India, covering diverse traffic compositions and density levels. Exploratory analyses highlight key behavioural patterns, including lane-keeping preferences, speed distributions, and lateral manoeuvres typical of heterogeneous and area-based traffic settings. These datasets are intended as a resource for the global research community to support simulation modelling, safety assessment, and behavioural studies under area-based traffic conditions. By making these empirical datasets openly available, this work offers researchers a unique opportunity to develop, test, and validate models that more accurately represent complex urban traffic environments.
[32] Read or Ignore? A Unified Benchmark for Typographic-Attack Robustness and Text Recognition in Vision-Language Models cs.CVPDF
Futa Waseda, Shojiro Yamabe, Daiki Shiono, Kento Sasaki, Tsubasa Takahashi
TL;DR: 本文针对大型视觉语言模型(LVLMs)易受排版攻击(即图像中误导性文本覆盖视觉理解)的问题,提出了一个名为’读或忽略’视觉问答(RIO-VQA)的新任务,并构建了相应的标准化基准RIO-Bench,用于评估模型在需要选择性使用文本(根据上下文决定何时读取文本、何时忽略文本)的场景下的性能。
Details
Motivation: 现有评估和防御方法主要关注物体识别,并隐含地鼓励忽略文本以实现鲁棒性,但这与现实世界中常需对物体和文本进行联合推理(例如识别行人的同时阅读交通标志)的需求不符,因此需要一种能平衡排版攻击鲁棒性和文本阅读能力的新评估框架。
Result: 使用RIO-Bench进行评估表明,当前强大的LVLMs和现有防御方法均无法有效平衡排版攻击鲁棒性和文本阅读能力,凸显了改进方法的必要性。
Insight: 论文的核心创新在于提出了RIO-VQA任务和RIO-Bench基准,将选择性文本使用形式化,并通过提供真实图像及其仅文本内容和问题类型不同的反事实(读/忽略)场景进行标准化评估。这揭示了现有评估范围与现实需求之间的根本错位,并为开发可靠LVLMs提供了原则性路径。此外,RIO-Bench还支持一种新颖的数据驱动防御方法,能够学习自适应的选择性文本使用,超越了先前非自适应的、单纯忽略文本的防御策略。
Abstract: Large vision-language models (LVLMs) are vulnerable to typographic attacks, where misleading text within an image overrides visual understanding. Existing evaluation protocols and defenses, largely focused on object recognition, implicitly encourage ignoring text to achieve robustness; however, real-world scenarios often require joint reasoning over both objects and text (e.g., recognizing pedestrians while reading traffic signs). To address this, we introduce a novel task, Read-or-Ignore VQA (RIO-VQA), which formalizes selective text use in visual question answering (VQA): models must decide, from context, when to read text and when to ignore it. For evaluation, we present the Read-or-Ignore Benchmark (RIO-Bench), a standardized dataset and protocol that, for each real image, provides same-scene counterfactuals (read / ignore) by varying only the textual content and question type. Using RIO-Bench, we show that strong LVLMs and existing defenses fail to balance typographic robustness and text-reading capability, highlighting the need for improved approaches. Finally, RIO-Bench enables a novel data-driven defense that learns adaptive selective text use, moving beyond prior non-adaptive, text-ignoring defenses. Overall, this work reveals a fundamental misalignment between the existing evaluation scope and real-world requirements, providing a principled path toward reliable LVLMs. Our Project Page is at https://turingmotors.github.io/rio-vqa/.
[33] CLARGA: Multimodal Graph Representation Learning over Arbitrary Sets of Modalities cs.CV | cs.LGPDF
Santosh Patapati
TL;DR: CLARGA是一种通用的多模态融合架构,适用于任意数量和类型的模态,通过构建注意力加权图并使用多头图注意力网络进行信息传递,实现高效融合并适应缺失模态输入。
Details
Motivation: 解决现有多模态学习方法难以灵活处理不同数量和类型模态、且效率低下的问题,旨在开发一个通用、高效且能适应缺失输入的融合框架。
Result: 在涵盖金融、人机交互、多媒体分类和情感计算等7个数据集上的实验表明,CLARGA consistently outperforms baselines, state-of-the-art models, and ablations,并展现出对缺失输入的鲁棒性。
Insight: 创新点包括基于样本构建动态注意力图实现自适应融合、使用可学习掩码处理缺失模态、以及结合监督任务损失与对比InfoNCE损失的混合训练目标,提升了跨模态一致性和鲁棒性。
Abstract: We introduce CLARGA, a general-purpose multimodal fusion architecture for multimodal representation learning that works with any number and type of modalities without changing the underlying framework. Given a supervised dataset, CLARGA can be applied to virtually any machine learning task to fuse different multimodal representations for processing by downstream layers. On a sample-by-sample basis, CLARGA learns how modalities should inform one another by building an attention weighted graph over their features and passing messages along this graph with a multi-head Graph Attention Network. Not only does this make CLARGA highly adaptive, as it constructs unique graphs for different samples, it makes for efficient fusion with sub-quadratic complexity as the number of modalities grows. Through a learnable mask, it can also adapt to missing modality inputs. The model is trained with a hybrid objective that combines a supervised task loss with contrastive InfoNCE loss, improving cross-modal consistency and robustness to noisy inputs. We demonstrate CLARGA’s effectiveness in diverse multimodal representation learning tasks across 7 datasets spanning finance, human-computer interaction, general multimedia classification, and affective computing. It consistently outperforms baselines, state-of-the-art models, and ablations. Additional experiments also demonstrate its robustness to missing inputs and ability to excel on niche tasks. Overall, CLARGA can be easily plugged into machine learning models for effective and efficient learning of representations across a wide variety of tasks.
[34] Smartphone monitoring of smiling as a behavioral proxy of well-being in everyday life cs.CVPDF
Ming-Zher Poh, Shun Liao, Marco Andreetto, Daniel McDuff, Jonathan Wang
TL;DR: 该研究提出了一种通过智能手机被动监控日常微笑强度来客观衡量主观幸福感的新方法。通过分析超过40万条视频片段,研究发现微笑强度模式与全国幸福感调查数据高度相关,并与日间活动模式相符,表明被动式智能手机传感可作为研究情感行为动态的有效工具。
Details
Motivation: 传统的主观幸福感测量依赖自我报告,存在回忆偏差和参与者负担高的问题,缺乏对日常生活中幸福感表达的客观理解。
Result: 微笑强度与全国幸福感调查数据高度相关(r=0.92),与日重建方法的已知结果高度一致(r=0.80),且与更多体力活动(Beta=0.043)和更大光照暴露(Beta=0.038)显著正相关。
Insight: 利用深度学习模型从被动采集的智能手机视频中量化微笑强度,作为积极情感的客观行为指标,为大规模、生态效度高的幸福感研究提供了创新方法。
Abstract: Subjective well-being is a cornerstone of individual and societal health, yet its scientific measurement has traditionally relied on self-report methods prone to recall bias and high participant burden. This has left a gap in our understanding of well-being as it is expressed in everyday life. We hypothesized that candid smiles captured during natural smartphone interactions could serve as a scalable, objective behavioral correlate of positive affect. To test this, we analyzed 405,448 video clips passively recorded from 233 consented participants over one week. Using a deep learning model to quantify smile intensity, we identified distinct diurnal and daily patterns. Daily patterns of smile intensity across the week showed strong correlation with national survey data on happiness (r=0.92), and diurnal rhythms documented close correspondence with established results from the day reconstruction method (r=0.80). Higher daily mean smile intensity was significantly associated with more physical activity (Beta coefficient = 0.043, 95% CI [0.001, 0.085]) and greater light exposure (Beta coefficient = 0.038, [0.013, 0.063]), whereas no significant effects were found for smartphone use. These findings suggest that passive smartphone sensing could serve as a powerful, ecologically valid methodology for studying the dynamics of affective behavior and open the door to understanding this behavior at a population scale.
[35] MPath: Multimodal Pathology Report Generation from Whole Slide Images cs.CV | cs.LGPDF
Noorul Wahab, Nasir Rajpoot
TL;DR: MPath是一个轻量级多模态框架,用于从全切片图像(WSI)自动生成病理诊断报告。该框架通过学习的视觉前缀提示机制,将WSI提取的视觉特征嵌入到预训练的生物医学语言模型(BioBART)中,避免了端到端的视觉语言预训练,提高了稳定性和数据效率。
Details
Motivation: 解决直接从高分辨率全切片图像生成临床连贯病理报告的难题,由于组织形态变异大和病理叙述结构复杂,传统方法难以准确翻译视觉模式为文本。
Result: 在RED 2025 Grand Challenge数据集上开发和评估,在Test Phase 2中排名第4,尽管提交机会有限,结果显示了基于提示的多模态条件化策略的潜力。
Insight: 创新点在于使用视觉前缀提示机制将基础模型WSI特征注入冻结的语言模型,实现可扩展和可解释的病理报告生成,避免了昂贵的端到端预训练。
Abstract: Automated generation of diagnostic pathology reports directly from whole slide images (WSIs) is an emerging direction in computational pathology. Translating high-resolution tissue patterns into clinically coherent text remains difficult due to large morphological variability and the complex structure of pathology narratives. We introduce MPath, a lightweight multimodal framework that conditions a pretrained biomedical language model (BioBART) on WSI-derived visual embeddings through a learned visual-prefix prompting mechanism. Instead of end-to-end vision-language pretraining, MPath leverages foundation-model WSI features (CONCH + Titan) and injects them into BioBART via a compact projection module, keeping the language backbone frozen for stability and data efficiency. MPath was developed and evaluated on the RED 2025 Grand Challenge dataset and ranked 4th in Test Phase 2, despite limited submission opportunities. The results highlight the potential of prompt-based multimodal conditioning as a scalable and interpretable strategy for pathology report generation.
[36] MONET – Virtual Cell Painting of Brightfield Images and Time Lapses Using Reference Consistent Diffusion cs.CV | cs.AIPDF
Alexander Peysakhovich, William Berman, Joseph Rufo, Felix Wong, Maxwell Z. Wilson
TL;DR: 该论文提出了一种名为MONET的扩散模型,用于从明场图像中预测细胞绘画通道,从而实现了虚拟细胞绘画。该方法解决了传统细胞绘画技术劳动密集且无法研究细胞动态的问题,并展示了模型规模提升带来的质量改进。
Details
Motivation: 解决传统细胞绘画技术劳动密集、需要化学固定且无法研究细胞动态的局限性,通过AI模型实现从明场图像到高对比度细胞形态图像的自动转换。
Result: 模型质量随规模提升而改进;采用一致性架构生成了时间推移视频,尽管缺乏对应的训练数据;该架构还支持上下文学习,使模型能部分泛化到分布外的细胞系和成像协议。
Insight: 创新点在于利用一致性扩散模型从静态明场图像生成细胞绘画通道和时间序列视频,实现了无需配对视频数据的动态预测,并展示了模型的上下文学习与部分跨域泛化能力。
Abstract: Cell painting is a popular technique for creating human-interpretable, high-contrast images of cell morphology. There are two major issues with cell paint: (1) it is labor-intensive and (2) it requires chemical fixation, making the study of cell dynamics impossible. We train a diffusion model (Morphological Observation Neural Enhancement Tool, or MONET) on a large dataset to predict cell paint channels from brightfield images. We show that model quality improves with scale. The model uses a consistency architecture to generate time-lapse videos, despite the impossibility of obtaining cell paint video training data. In addition, we show that this architecture enables a form of in-context learning, allowing the model to partially transfer to out-of-distribution cell lines and imaging protocols. Virtual cell painting is not intended to replace physical cell painting completely, but to act as a complementary tool enabling novel workflows in biological research.
[37] CARI4D: Category Agnostic 4D Reconstruction of Human-Object Interaction cs.CVPDF
Xianghui Xie, Bowen Wen, Yan Chang, Hesam Rabeti, Jiefeng Li
TL;DR: CARI4D是一种类别无关的4D重建方法,能够从单目RGB视频中重建出具有空间和时间一致性的、符合度量尺度的人类-物体交互。该方法通过整合基础模型的预测、利用学习到的渲染-比较范式进行联合优化,并推理复杂接触以满足物理约束,从而克服了深度模糊、遮挡和复杂运动等挑战。
Details
Motivation: 从单目RGB视图推断4D交互极具挑战性,因为存在未知的物体和人体信息、深度模糊、遮挡以及复杂运动。先前的方法通常依赖于真实物体模板或局限于有限的物体类别,限制了其通用性。本文旨在开发一种类别无关的方法,以从普通RGB视频中实现一致的4D重建。
Result: 实验表明,该方法在分布内数据集上的重建误差比先前方法降低了38%,在未见数据集上降低了36%。模型能够泛化到训练类别之外,并可零样本应用于真实世界的互联网视频。
Insight: 创新点在于提出了一个鲁棒的姿态假设选择算法,将多个基础模型的预测进行整合,并通过一个学习到的渲染-比较范式进行联合优化,确保了空间、时间和像素的对齐。此外,该方法还通过推理复杂接触进行进一步细化,以满足物理约束,实现了类别无关的4D重建。
Abstract: Accurate capture of human-object interaction from ubiquitous sensors like RGB cameras is important for applications in human understanding, gaming, and robot learning. However, inferring 4D interactions from a single RGB view is highly challenging due to the unknown object and human information, depth ambiguity, occlusion, and complex motion, which hinder consistent 3D and temporal reconstruction. Previous methods simplify the setup by assuming ground truth object template or constraining to a limited set of object categories. We present CARI4D, the first category-agnostic method that reconstructs spatially and temporarily consistent 4D human-object interaction at metric scale from monocular RGB videos. To this end, we propose a pose hypothesis selection algorithm that robustly integrates the individual predictions from foundation models, jointly refine them through a learned render-and-compare paradigm to ensure spatial, temporal and pixel alignment, and finally reasoning about intricate contacts for further refinement satisfying physical constraints. Experiments show that our method outperforms prior art by 38% on in-distribution dataset and 36% on unseen dataset in terms of reconstruction error. Our model generalizes beyond the training categories and thus can be applied zero-shot to in-the-wild internet videos. Our code and pretrained models will be publicly released.
[38] V-REX: Benchmarking Exploratory Visual Reasoning via Chain-of-Questions cs.CV | cs.AI | cs.LGPDF
Chenrui Fan, Yijun Liang, Shweta Bhardwaj, Kwesi Cobbina, Ming Li
TL;DR: 该论文提出了一个名为V-REX的评估套件,用于评测视觉语言模型在多步探索式视觉推理任务上的能力。V-REX将复杂的开放式任务分解为一系列问题链,并分别评估模型的规划能力和执行能力,从而实现对中间步骤的细粒度定量分析。
Details
Motivation: 当前视觉语言模型在回答定义明确的问题上表现良好,但在需要多轮视觉空间探索和推理的复杂开放式任务中表现不佳。由于中间步骤的探索空间巨大,评估这种视觉思维路径具有挑战性。
Result: 通过评估最先进的专有和开源视觉语言模型,V-REX揭示了模型能力一致的缩放趋势、规划与执行能力之间的显著差异,以及多步探索式推理方面仍有巨大的改进空间。
Insight: 创新点在于将多步探索式推理建模为问题链,并解耦为规划和执行两个可独立评估的子能力。该方法通过为每个步骤策划有限的问题和答案选项,实现了对中间步骤可靠、定量和细粒度的分析,为评估复杂视觉推理提供了新框架。
Abstract: While many vision-language models (VLMs) are developed to answer well-defined, straightforward questions with highly specified targets, as in most benchmarks, they often struggle in practice with complex open-ended tasks, which usually require multiple rounds of exploration and reasoning in the visual space. Such visual thinking paths not only provide step-by-step exploration and verification as an AI detective but also produce better interpretations of the final answers. However, these paths are challenging to evaluate due to the large exploration space of intermediate steps. To bridge the gap, we develop an evaluation suite, ``Visual Reasoning with multi-step EXploration (V-REX)’’, which is composed of a benchmark of challenging visual reasoning tasks requiring native multi-step exploration and an evaluation protocol. V-REX covers rich application scenarios across diverse domains. V-REX casts the multi-step exploratory reasoning into a Chain-of-Questions (CoQ) and disentangles VLMs’ capability to (1) Planning: breaking down an open-ended task by selecting a chain of exploratory questions; and (2) Following: answering curated CoQ sequentially to collect information for deriving the final answer. By curating finite options of questions and answers per step, V-REX achieves a reliable quantitative and fine-grained analysis of the intermediate steps. By assessing SOTA proprietary and open-sourced VLMs, we reveal consistent scaling trends, significant differences between planning and following abilities, and substantial room for improvement in multi-step exploratory reasoning.
[39] Semantic-Drive: Democratizing Long-Tail Data Curation via Open-Vocabulary Grounding and Neuro-Symbolic VLM Consensus cs.CV | cs.AI | cs.CL | cs.ROPDF
Antonio Guillen-Perez
TL;DR: 本文提出了一种名为Semantic-Drive的本地优先、神经符号框架,用于从海量视频日志中自动挖掘长尾、安全关键事件(如乱穿马路、施工改道)的训练数据。该方法将感知解耦为两个阶段:首先通过实时开放词汇检测器进行符号接地以锚定注意力,然后通过推理视觉语言模型进行认知分析。为了缓解幻觉,系统采用了’系统2’推理时对齐策略和’法官-侦察员’多模型共识机制。
Details
Motivation: 解决自动驾驶系统开发中因缺乏’长尾’安全关键事件训练数据而面临的瓶颈问题,克服现有基于粗略元数据搜索(精度低)或基于云的视觉语言模型(侵犯隐私且昂贵)方案的局限性。
Result: 在nuScenes数据集上,使用Waymo开放数据集(WOD-E2E)分类法进行基准测试,Semantic-Drive实现了0.966的召回率(对比CLIP的0.475),并将风险评估误差降低了40%。系统完全在消费级硬件(NVIDIA RTX 3090)上运行。
Insight: 创新点在于提出了一种本地优先的神经符号框架,将开放词汇检测与推理VLM的认知分析解耦,并通过多模型共识机制进行推理时对齐以缓解幻觉。这为隐私保护、高效的长尾数据挖掘提供了新思路。
Abstract: The development of robust Autonomous Vehicles (AVs) is bottlenecked by the scarcity of “Long-Tail” training data. While fleets collect petabytes of video logs, identifying rare safety-critical events (e.g., erratic jaywalking, construction diversions) remains a manual, cost-prohibitive process. Existing solutions rely on coarse metadata search, which lacks precision, or cloud-based VLMs, which are privacy-invasive and expensive. We introduce Semantic-Drive, a local-first, neuro-symbolic framework for semantic data mining. Our approach decouples perception into two stages: (1) Symbolic Grounding via a real-time open-vocabulary detector (YOLOE) to anchor attention, and (2) Cognitive Analysis via a Reasoning VLM that performs forensic scene analysis. To mitigate hallucination, we implement a “System 2” inference-time alignment strategy, utilizing a multi-model “Judge-Scout” consensus mechanism. Benchmarked on the nuScenes dataset against the Waymo Open Dataset (WOD-E2E) taxonomy, Semantic-Drive achieves a Recall of 0.966 (vs. 0.475 for CLIP) and reduces Risk Assessment Error by 40% compared to single models. The system runs entirely on consumer hardware (NVIDIA RTX 3090), offering a privacy-preserving alternative to the cloud.
[40] Exploring Spatial-Temporal Representation via Star Graph for mmWave Radar-based Human Activity Recognition cs.CV | cs.LG | eess.IVPDF
Senhao Gao, Junqing Zhang, Luoyu Mei, Shuai Wang, Xuyu Wang
TL;DR: 本文提出了一种基于星形图表示和离散动态图神经网络(DDGNN)的方法,用于毫米波雷达点云的人体活动识别(HAR)。该方法通过手动添加静态中心点构建星形图,以捕捉动态雷达点云在连续帧间的高维时空关系,有效解决了点云稀疏和尺寸可变的问题,并在真实数据集上取得了优于基线方法的性能。
Details
Motivation: 毫米波雷达点云用于人体活动识别时,存在点云稀疏和尺寸可变的问题,而现有方法通常借鉴基于视觉的密集点云预处理算法,可能并非最优。本文旨在设计一种更适合毫米波雷达系统的时空特征表示方法。
Result: 在真实世界HAR数据集上的实验表明,该方法优于其他基线方法,总体分类准确率达到94.27%,接近基于视觉的骨架数据方法(97.25%)。在Raspberry Pi 4上的推理测试验证了其在资源受限平台的有效性,且无需重采样或帧聚合器,优于三种近期雷达专用方法。
Insight: 创新点在于提出星形图表示来建模手动静态中心点与动态雷达点之间的高维相对关系,并采用DDGNN处理可变尺寸图结构,从而更有效地提取稀疏点云的时空特征,为雷达特定系统提供了新的表示学习思路。
Abstract: Human activity recognition (HAR) requires extracting accurate spatial-temporal features with human movements. A mmWave radar point cloud-based HAR system suffers from sparsity and variable-size problems due to the physical features of the mmWave signal. Existing works usually borrow the preprocessing algorithms for the vision-based systems with dense point clouds, which may not be optimal for mmWave radar systems. In this work, we proposed a graph representation with a discrete dynamic graph neural network (DDGNN) to explore the spatial-temporal representation of human movement-related features. Specifically, we designed a star graph to describe the high-dimensional relative relationship between a manually added static center point and the dynamic mmWave radar points in the same and consecutive frames. We then adopted DDGNN to learn the features residing in the star graph with variable sizes. Experimental results demonstrated that our approach outperformed other baseline methods using real-world HAR datasets. Our system achieved an overall classification accuracy of 94.27%, which gets the near-optimal performance with a vision-based skeleton data accuracy of 97.25%. We also conducted an inference test on Raspberry Pi~4 to demonstrate its effectiveness on resource-constraint platforms. \sh{ We provided a comprehensive ablation study for variable DDGNN structures to validate our model design. Our system also outperformed three recent radar-specific methods without requiring resampling or frame aggregators.
[41] Enhancing deep learning performance on burned area delineation from SPOT-6/7 imagery for emergency management cs.CVPDF
Maria Rodriguez, Minh-Tan Pham, Martin Sudmanns, Quentin Poterek, Oscar Narvaez
TL;DR: 本研究提出了一种监督式语义分割工作流,旨在提升基于SPOT-6/7高分辨率遥感影像的过火区(BA)勾绘性能与效率,以满足紧急管理的时间约束需求。实验评估了U-Net和SegFormer等模型,发现两者在有限训练数据下性能相当,但SegFormer资源需求更高。引入土地覆盖数据作为辅助任务可增强模型鲁棒性,而测试时数据增强(TTA)能提升性能但增加推理时间,可通过混合精度等方法优化。
Details
Motivation: 当前过火区勾绘方法通常依赖灾后遥感影像训练的计算机视觉模型,但忽视了其在时间紧迫的紧急管理场景中的适用性。
Result: 在基于Dice分数、交并比(IoU)和推理时间的评估中,U-Net和SegFormer模型在有限训练数据下表现相似;SegFormer资源需求更高,不利于紧急应用;加入土地覆盖辅助任务提升了模型鲁棒性且不增加推理时间;测试时数据增强(TTA)提高了勾绘性能但增加了推理时间。
Insight: 创新点在于针对紧急管理场景优化BA勾绘流程,强调了效率与性能的平衡;通过引入土地覆盖作为辅助任务以低成本提升模型鲁棒性,以及探讨TTA与混合精度等优化方法在时间敏感任务中的权衡,为遥感应急应用提供了实用见解。
Abstract: After a wildfire, delineating burned areas (BAs) is crucial for quantifying damages and supporting ecosystem recovery. Current BA mapping approaches rely on computer vision models trained on post-event remote sensing imagery, but often overlook their applicability to time-constrained emergency management scenarios. This study introduces a supervised semantic segmentation workflow aimed at boosting both the performance and efficiency of BA delineation. It targets SPOT-6/7 imagery due to its very high resolution and on-demand availability. Experiments are evaluated based on Dice score, Intersection over Union, and inference time. The results show that U-Net and SegFormer models perform similarly with limited training data. However, SegFormer requires more resources, challenging its practical use in emergencies. Incorporating land cover data as an auxiliary task enhances model robustness without increasing inference time. Lastly, Test-Time Augmentation improves BA delineation performance but raises inference time, which can be mitigated with optimization methods like Mixed Precision.
[42] CreativeVR: Diffusion-Prior-Guided Approach for Structure and Motion Restoration in Generative and Real Videos cs.CV | cs.LG | cs.MMPDF
Tejas Panambur, Ishan Rajendrakumar Dave, Chongjian Ge, Ersin Yumer, Xue Bai
TL;DR: 本文提出CreativeVR,一种基于扩散先验引导的视频修复框架,专门针对AI生成视频和真实视频中严重的结构性和时序性伪影。该方法通过一个深度适配器架构和单一精度旋钮,在标准退化修复与强结构/运动校正行为之间平滑权衡,并引入时序一致的退化模块进行训练。
Details
Motivation: 现有文本到视频扩散模型在精细结构上表现脆弱,常产生扭曲的面部、手部、背景和时序不一致的运动;类似严重结构伪影也出现在低质量真实视频中。传统视频修复/超分方法仅针对合成退化(如模糊和下采样),而扩散先验修复器通常针对光度噪声训练,难以在感知质量与保真度之间取得平衡。
Result: 在提出的AIGC54基准测试(包含FIQA、语义和感知指标及多维度评分)上,CreativeVR在严重伪影视频上取得了最先进(SOTA)结果,并在标准视频修复基准上具有竞争力,同时在单张80GB A100上以约13 FPS(720p)的实用吞吐量运行。
Insight: 创新点包括:1) 基于深度适配器的框架,通过单一精度旋钮实现输入跟随强度的可控调节;2) 时序一致的退化训练模块,通过精心设计的变换模拟真实结构故障;3) 专门针对AIGC伪影修复提出的AIGC54评估基准。该方法为生成和真实视频的结构与运动修复提供了可平衡保真度与感知质量的通用解决方案。
Abstract: Modern text-to-video (T2V) diffusion models can synthesize visually compelling clips, yet they remain brittle at fine-scale structure: even state-of-the-art generators often produce distorted faces and hands, warped backgrounds, and temporally inconsistent motion. Such severe structural artifacts also appear in very low-quality real-world videos. Classical video restoration and super-resolution (VR/VSR) methods, in contrast, are tuned for synthetic degradations such as blur and downsampling and tend to stabilize these artifacts rather than repair them, while diffusion-prior restorers are usually trained on photometric noise and offer little control over the trade-off between perceptual quality and fidelity. We introduce CreativeVR, a diffusion-prior-guided video restoration framework for AI-generated (AIGC) and real videos with severe structural and temporal artifacts. Our deep-adapter-based method exposes a single precision knob that controls how strongly the model follows the input, smoothly trading off between precise restoration on standard degradations and stronger structure- and motion-corrective behavior on challenging content. Our key novelty is a temporally coherent degradation module used during training, which applies carefully designed transformations that produce realistic structural failures. To evaluate AIGC-artifact restoration, we propose the AIGC54 benchmark with FIQA, semantic and perceptual metrics, and multi-aspect scoring. CreativeVR achieves state-of-the-art results on videos with severe artifacts and performs competitively on standard video restoration benchmarks, while running at practical throughput (about 13 FPS at 720p on a single 80-GB A100). Project page: https://daveishan.github.io/creativevr-webpage/.
[43] BAgger: Backwards Aggregation for Mitigating Drift in Autoregressive Video Diffusion Models cs.CV | cs.LGPDF
Ryan Po, Eric Ryan Chan, Changan Chen, Gordon Wetzstein
TL;DR: 本文提出了一种名为BAgger(Backwards Aggregation)的自监督方案,用于缓解自回归视频扩散模型中的曝光偏差问题。该方法通过从模型自身生成的序列中构建纠正轨迹,训练模型从错误中恢复,从而减少长期生成中的质量漂移。
Details
Motivation: 自回归视频模型通过下一帧预测进行世界建模,但在推理时面临曝光偏差问题:训练时使用干净上下文,而推理时依赖自生成帧,导致误差累积和质量随时间漂移。
Result: 在因果扩散变换器上实例化BAgger,并在文本到视频、视频扩展和多提示生成任务中评估,观察到更稳定的长期运动和更好的视觉一致性,减少了漂移现象。
Insight: 创新点在于提出了一种自监督的向后聚合方案,避免了依赖大教师模型或长时间链反向传播的传统蒸馏和分布匹配损失,而是使用标准的分数或流匹配目标进行训练,有助于保持生成质量和多样性。
Abstract: Autoregressive video models are promising for world modeling via next-frame prediction, but they suffer from exposure bias: a mismatch between training on clean contexts and inference on self-generated frames, causing errors to compound and quality to drift over time. We introduce Backwards Aggregation (BAgger), a self-supervised scheme that constructs corrective trajectories from the model’s own rollouts, teaching it to recover from its mistakes. Unlike prior approaches that rely on few-step distillation and distribution-matching losses, which can hurt quality and diversity, BAgger trains with standard score or flow matching objectives, avoiding large teachers and long-chain backpropagation through time. We instantiate BAgger on causal diffusion transformers and evaluate on text-to-video, video extension, and multi-prompt generation, observing more stable long-horizon motion and better visual consistency with reduced drift.
[44] RePack: Representation Packing of Vision Foundation Model Features Enhances Diffusion Transformer cs.CVPDF
Guanfang Dong, Luke Schultz, Negar Hassanpour, Chao Gao
TL;DR: 本文提出了RePack框架,旨在解决预训练视觉基础模型(VFM)高维特征注入扩散变换器(DiT)时导致的信息过载问题。该方法通过将高维VFM特征投影到低维流形,生成更紧凑、解码器友好的表示,从而加速DiT收敛并提升图像生成质量。
Details
Motivation: 动机在于利用预训练VFM(如DINOv3)的丰富语义增强潜在扩散模型时,其高维特征可能超过原始图像解码尺寸,导致信息过载,影响模型性能。
Result: 在DiT-XL/2上,RePack仅用64轮训练就达到了3.66的FID分数,比当前最优方法收敛速度快35%,并在图像重建任务上超越了直接注入原始VFM特征的方法。
Insight: 创新点在于提出了一种简单的表示压缩框架,通过低维投影过滤非语义噪声并保留核心结构信息,有效平衡了VFM特征的效用与维度副作用,为扩散模型的高效语义注入提供了新思路。
Abstract: The superior representation capability of pre-trained vision foundation models (VFMs) has been harnessed for enhancing latent diffusion models (LDMs). These approaches inject the rich semantics from high-dimensional VFM representations (e.g., DINOv3) into LDMs at different phases, resulting in accelerated learning and better generation performance. However, the high-dimensionality of VFM representations may also lead to Information Overload, particularly when the VFM features exceed the size of the original image for decoding. To address this issue while preserving the utility of VFM features, we propose RePack (Representation Packing), a simple yet effective framework for improving Diffusion Transformers (DiTs). RePack transforms the VFM representation into a more compact, decoder-friendly representation by projecting onto low-dimensional manifolds. We find that RePack can effectively filter out non-semantic noise while preserving the core structural information needed for high-fidelity reconstruction. Experimental results show that RePack significantly accelerates DiT convergence and outperforms recent methods that directly inject raw VFM features into the decoder for image reconstruction. On DiT-XL/2, RePack achieves an FID of 3.66 in only 64 epochs, which is 35% faster than the state-of-the-art method. This demonstrates that RePack successfully extracts the core semantics of VFM representations while bypassing their high-dimensionality side effects.
[45] VEGAS: Mitigating Hallucinations in Large Vision-Language Models via Vision-Encoder Attention Guided Adaptive Steering cs.CV | cs.CLPDF
Zihu Wang, Boxun Xu, Yuxuan Xia, Peng Li
TL;DR: 本文提出VEGAS方法,通过将视觉编码器的注意力图注入到大型视觉语言模型(LVLM)的语言模型中间层,并自适应地引导未能聚焦关键图像对象的token,以有效减少模型产生的幻觉(即与视觉证据事实不一致的输出)。
Details
Motivation: 大型视觉语言模型(LVLM)在联合推理视觉和文本输入方面表现出色,但经常产生语言流畅但与视觉证据事实不一致的幻觉输出。现有研究未能明确何种形式的视觉注意力能在解码过程中有效抑制幻觉。
Result: 在多个基准测试上的广泛实验表明,VEGAS在减少幻觉方面始终实现了最先进的性能(SOTA)。
Insight: 创新点在于发现LVLM的最终视觉注意力图未能聚焦关键图像对象时容易产生幻觉,而视觉编码器自身更集中的注意力图能显著减少幻觉;进一步分析发现视觉-文本冲突在语言模型中间层达到峰值,因此提出在推理时向这些层注入视觉编码器注意力图并进行自适应引导。从客观角度看,该方法提供了一种简单有效的、基于注意力引导的推理时干预策略,无需重新训练模型即可提升事实一致性。
Abstract: Large vision-language models (LVLMs) exhibit impressive ability to jointly reason over visual and textual inputs. However, they often produce outputs that are linguistically fluent but factually inconsistent with the visual evidence, i.e., they hallucinate. Despite growing efforts to mitigate such hallucinations, a key question remains: what form of visual attention can effectively suppress hallucinations during decoding? In this work, we provide a simple answer: the vision encoder’s own attention map. We show that LVLMs tend to hallucinate when their final visual-attention maps fail to concentrate on key image objects, whereas the vision encoder’s more concentrated attention maps substantially reduce hallucinations. To further investigate the cause, we analyze vision-text conflicts during decoding and find that these conflicts peak in the language model’s middle layers. Injecting the vision encoder’s attention maps into these layers effectively suppresses hallucinations. Building on these insights, we introduce VEGAS, a simple yet effective inference-time method that integrates the vision encoder’s attention maps into the language model’s mid-layers and adaptively steers tokens which fail to concentrate on key image objects. Extensive experiments across multiple benchmarks demonstrate that VEGAS consistently achieves state-of-the-art performance in reducing hallucinations.
[46] SPDMark: Selective Parameter Displacement for Robust Video Watermarking cs.CV | cs.CR | cs.LGPDF
Samar Fares, Nurbek Tastan, Karthik Nandakumar
TL;DR: 本文提出了一种名为SPDMark的新型视频生成水印框架,通过选择性参数位移在视频扩散模型中嵌入水印。该方法利用低秩适应(LoRA)实现层间基位移,并结合密码学哈希函数生成帧特定水印信息,以支持对篡改视频的检测和帧顺序恢复。
Details
Motivation: 现有视频水印方法(包括后处理和生成中嵌入)难以同时实现不可感知性、鲁棒性和计算效率,而高质量视频生成模型的兴起亟需能够可靠检测和追踪生成视频来源的鲁棒水印方案。
Result: 在文生视频和图生视频生成模型上的评估表明,SPDMark能够生成不可感知的水印,并以高准确率恢复水印,同时对多种常见视频修改具有鲁棒性。
Insight: 创新点在于提出基于选择性参数位移的生成中水印框架,将水印嵌入建模为层间基位移的加性组合,并利用LoRA实现参数高效性;同时结合密码学哈希和最大二分图匹配,实现了对时间篡改的鲁棒检测和帧顺序恢复。
Abstract: The advent of high-quality video generation models has amplified the need for robust watermarking schemes that can be used to reliably detect and track the provenance of generated videos. Existing video watermarking methods based on both post-hoc and in-generation approaches fail to simultaneously achieve imperceptibility, robustness, and computational efficiency. This work introduces a novel framework for in-generation video watermarking called SPDMark (pronounced `SpeedMark’) based on selective parameter displacement of a video diffusion model. Watermarks are embedded into the generated videos by modifying a subset of parameters in the generative model. To make the problem tractable, the displacement is modeled as an additive composition of layer-wise basis shifts, where the final composition is indexed by the watermarking key. For parameter efficiency, this work specifically leverages low-rank adaptation (LoRA) to implement the basis shifts. During the training phase, the basis shifts and the watermark extractor are jointly learned by minimizing a combination of message recovery, perceptual similarity, and temporal consistency losses. To detect and localize temporal modifications in the watermarked videos, we use a cryptographic hashing function to derive frame-specific watermark messages from the given base watermarking key. During watermark extraction, maximum bipartite matching is applied to recover the correct frame order, even from temporally tampered videos. Evaluations on both text-to-video and image-to-video generation models demonstrate the ability of SPDMark to generate imperceptible watermarks that can be recovered with high accuracy and also establish its robustness against a variety of common video modifications.
[47] EchoVLM: Measurement-Grounded Multimodal Learning for Echocardiography cs.CVPDF
Yuheng Li, Yue Zhang, Abdoul Aziz Amadou, Yuxiang Lai, Jike Zhong
TL;DR: 该论文提出了EchoVLM,一个用于超声心动图解读的测量基础多模态视觉语言模型,并构建了首个测量基础的多模态超声心动图数据集EchoGround-MIMIC。模型通过引入视图感知对比损失和否定感知对比损失等新颖预训练目标,在多种临床任务上实现了最先进的性能。
Details
Motivation: 超声心动图解读是劳动密集型且本质多模态的任务,现有视觉语言模型因缺乏大规模、临床基础且包含测量推理的数据集而潜力受限。
Result: 在涵盖多模态疾病分类、图文检索、视图分类、腔室分割和关键点检测的36个任务上,EchoVLM实现了SOTA性能,例如零样本疾病分类的AUC达到86.5%,视图分类准确率达到95.1%。
Insight: 创新点在于构建了首个测量基础的多模态超声心动图数据集,并设计了视图感知和否定感知的对比损失来编码超声心动图的视图依赖结构和区分临床关键阴性/阳性发现,从而学习可迁移的视觉表示,为端到端超声心动图解读提供了基础模型。
Abstract: Echocardiography is the most widely used imaging modality in cardiology, yet its interpretation remains labor-intensive and inherently multimodal, requiring view recognition, quantitative measurements, qualitative assessments, and guideline-based reasoning. While recent vision-language models (VLMs) have achieved broad success in natural images and certain medical domains, their potential in echocardiography has been limited by the lack of large-scale, clinically grounded image-text datasets and the absence of measurement-based reasoning central to echo interpretation. We introduce EchoGround-MIMIC, the first measurement-grounded multimodal echocardiography dataset, comprising 19,065 image-text pairs from 1,572 patients with standardized views, structured measurements, measurement-grounded captions, and guideline-derived disease labels. Building on this resource, we propose EchoVLM, a vision-language model that incorporates two novel pretraining objectives: (i) a view-informed contrastive loss that encodes the view-dependent structure of echocardiographic imaging, and (ii) a negation-aware contrastive loss that distinguishes clinically critical negative from positive findings. Across five types of clinical applications with 36 tasks spanning multimodal disease classification, image-text retrieval, view classification, chamber segmentation, and landmark detection, EchoVLM achieves state-of-the-art performance (86.5% AUC in zero-shot disease classification and 95.1% accuracy in view classification). We demonstrate that clinically grounded multimodal pretraining yields transferable visual representations and establish EchoVLM as a foundation model for end-to-end echocardiography interpretation. We will release EchoGround-MIMIC and the data curation code, enabling reproducibility and further research in multimodal echocardiography interpretation.
[48] Audio-Visual Camera Pose Estimationn with Passive Scene Sounds and In-the-Wild Video cs.CVPDF
Daniel Adebi, Sagnik Majumder, Kristen Grauman
TL;DR: 本文提出了一种利用被动场景声音增强相对相机位姿估计的音频-视觉框架,通过整合到达方向谱和双耳嵌入到先进的纯视觉模型中,在视觉信息退化时提供互补线索,从而提升在位姿估计任务上的性能。
Details
Motivation: 解决在视觉退化条件下(如运动模糊或遮挡)纯视觉方法在相机位姿估计中的局限性,利用被动场景声音作为补充信息。
Result: 在两个大型数据集上,该方法相比强视觉基线模型取得了持续的性能提升,并在视觉信息受损时表现出鲁棒性。
Insight: 创新点在于首次成功利用真实世界视频中的音频进行相对相机位姿估计,将日常音频信号作为解决经典空间挑战的意外但有效的补充线索。
Abstract: Understanding camera motion is a fundamental problem in embodied perception and 3D scene understanding. While visual methods have advanced rapidly, they often struggle under visually degraded conditions such as motion blur or occlusions. In this work, we show that passive scene sounds provide complementary cues for relative camera pose estimation for in-the-wild videos. We introduce a simple but effective audio-visual framework that integrates direction-ofarrival (DOA) spectra and binauralized embeddings into a state-of-the-art vision-only pose estimation model. Our results on two large datasets show consistent gains over strong visual baselines, plus robustness when the visual information is corrupted. To our knowledge, this represents the first work to successfully leverage audio for relative camera pose estimation in real-world videos, and it establishes incidental, everyday audio as an unexpected but promising signal for a classic spatial challenge. Project: http://vision.cs.utexas.edu/projects/av_camera_pose.
[49] SMRABooth: Subject and Motion Representation Alignment for Customized Video Generation cs.CVPDF
Xuancheng Xu, Yaning Li, Sisi You, Bing-Kun Bao
TL;DR: SMRABooth提出了一种用于定制化视频生成的新方法,通过自监督编码器和光流编码器分别提取对象级的主题外观和运动表示,并在LoRA微调过程中对齐这些表示。该方法采用三阶段策略:主题表示引导主题对齐、光流表示捕捉独立于外观的运动轨迹、以及主题-运动关联解耦策略以减少干扰。实验表明,SMRABooth在保持主题外观相似性和运动模式一致性方面表现优异。
Details
Motivation: 现有定制化视频生成方法在同时确保主题外观相似性和运动模式一致性方面存在困难,主要原因是缺乏对象级的主题和运动指导。
Result: 大量实验表明,SMRABooth在主题和运动定制化方面表现出色,能够保持一致的视频主题外观和运动模式,证明了其在可控文本到视频生成中的有效性。
Insight: 创新点在于利用自监督编码器和光流编码器提供对象级的主题和运动表示,并采用主题-运动关联解耦策略(通过位置和时序的稀疏LoRA注入)来减少干扰,从而更好地对齐主题外观和运动模式。
Abstract: Customized video generation aims to produce videos that faithfully preserve the subject’s appearance from reference images while maintaining temporally consistent motion from reference videos. Existing methods struggle to ensure both subject appearance similarity and motion pattern consistency due to the lack of object-level guidance for subject and motion. To address this, we propose SMRABooth, which leverages the self-supervised encoder and optical flow encoder to provide object-level subject and motion representations. These representations are aligned with the model during the LoRA fine-tuning process. Our approach is structured in three core stages: (1) We exploit subject representations via a self-supervised encoder to guide subject alignment, enabling the model to capture overall structure of subject and enhance high-level semantic consistency. (2) We utilize motion representations from an optical flow encoder to capture structurally coherent and object-level motion trajectories independent of appearance. (3) We propose a subject-motion association decoupling strategy that applies sparse LoRAs injection across both locations and timing, effectively reducing interference between subject and motion LoRAs. Extensive experiments show that SMRABooth excels in subject and motion customization, maintaining consistent subject appearance and motion patterns, proving its effectiveness in controllable text-to-video generation.
[50] A Multi-Year Urban Streetlight Imagery Dataset for Visual Monitoring and Spatio-Temporal Drift Detection cs.CVPDF
Peizheng Li, Ioannis Mavromatis, Ajith Sahadevan, Tim Farnham, Adnan Aijaz
TL;DR: 该论文提出了一个大规模、纵向的城市路灯图像数据集,包含从2021年至2025年在英国布里斯托尔部署的22个固定角度摄像头每小时捕获的超过526,000张图像,涵盖不同光照、天气和季节条件。数据集附带丰富元数据,并提供了一个基于卷积变分自编码器(CNN-VAEs)的自监督框架,用于分析视觉漂移和异常检测。
Details
Motivation: 解决智能城市部署中视觉模型长期稳定性评估、漂移检测和MLOps策略缺乏真实世界细粒度基准的问题。
Result: 数据集作为一个现实世界的细粒度基准,可用于评估长期模型稳定性、漂移感知学习和部署就绪的视觉系统;论文未提及具体定量性能比较或SOTA结果。
Insight: 创新点在于提供了一个独特的纵向城市街景数据集,支持对视觉漂移的详细研究;并提出了基于CNN-VAEs的自监督框架和两种每样本漂移度量(相对质心漂移和相对重建误差),用于捕获潜在空间偏差和图像域退化。
Abstract: We present a large-scale, longitudinal visual dataset of urban streetlights captured by 22 fixed-angle cameras deployed across Bristol, U.K., from 2021 to 2025. The dataset contains over 526,000 images, collected hourly under diverse lighting, weather, and seasonal conditions. Each image is accompanied by rich metadata, including timestamps, GPS coordinates, and device identifiers. This unique real-world dataset enables detailed investigation of visual drift, anomaly detection, and MLOps strategies in smart city deployments. To promtoe seconardary analysis, we additionally provide a self-supervised framework based on convolutional variational autoencoders (CNN-VAEs). Models are trained separately for each camera node and for day/night image sets. We define two per-sample drift metrics: relative centroid drift, capturing latent space deviation from a baseline quarter, and relative reconstruction error, measuring normalized image-domain degradation. This dataset provides a realistic, fine-grained benchmark for evaluating long-term model stability, drift-aware learning, and deployment-ready vision systems. The images and structured metadata are publicly released in JPEG and CSV formats, supporting reproducibility and downstream applications such as streetlight monitoring, weather inference, and urban scene understanding. The dataset can be found at https://doi.org/10.5281/zenodo.17781192 and https://doi.org/10.5281/zenodo.17859120.
[51] ALERT Open Dataset and Input-Size-Agnostic Vision Transformer for Driver Activity Recognition using IR-UWB cs.CV | cs.AI | cs.LGPDF
Jeongjun Park, Sunwook Hwang, Hyeonho Noh, Jin Mo Yang, Hyun Jong Yang
TL;DR: 本文针对分心驾驶检测问题,提出了基于IR-UWB雷达的驾驶员活动识别(DAR)方法。为解决现有UWB数据集缺乏和ViT模型输入尺寸固定的限制,作者发布了大规模真实驾驶场景下的ALERT数据集,并提出了输入尺寸无关的Vision Transformer(ISA-ViT)框架,通过调整补丁配置和利用预训练位置嵌入向量来适应非标准尺寸的雷达数据,并结合域融合策略提升分类性能。
Details
Motivation: 解决分心驾驶检测中IR-UWB雷达应用的两大挑战:缺乏覆盖多样分心行为的大规模真实UWB数据集,以及固定输入尺寸的ViT难以适应非标准维度的雷达数据。
Result: 在基于UWB的DAR任务上,ISA-ViT相比现有ViT方法实现了22.68%的准确率提升,并通过公开ALERT数据集和详细策略促进实际部署。
Insight: 创新点包括发布首个大规模真实驾驶UWB数据集ALERT,以及提出输入尺寸无关的ISA-ViT框架,通过保留雷达特定信息(如多普勒频移和相位特征)和域融合策略,提升了模型对非标准尺寸数据的适应性和分类性能。
Abstract: Distracted driving contributes to fatal crashes worldwide. To address this, researchers are using driver activity recognition (DAR) with impulse radio ultra-wideband (IR-UWB) radar, which offers advantages such as interference resistance, low power consumption, and privacy preservation. However, two challenges limit its adoption: the lack of large-scale real-world UWB datasets covering diverse distracted driving behaviors, and the difficulty of adapting fixed-input Vision Transformers (ViTs) to UWB radar data with non-standard dimensions. This work addresses both challenges. We present the ALERT dataset, which contains 10,220 radar samples of seven distracted driving activities collected in real driving conditions. We also propose the input-size-agnostic Vision Transformer (ISA-ViT), a framework designed for radar-based DAR. The proposed method resizes UWB data to meet ViT input requirements while preserving radar-specific information such as Doppler shifts and phase characteristics. By adjusting patch configurations and leveraging pre-trained positional embedding vectors (PEVs), ISA-ViT overcomes the limitations of naive resizing approaches. In addition, a domain fusion strategy combines range- and frequency-domain features to further improve classification performance. Comprehensive experiments demonstrate that ISA-ViT achieves a 22.68% accuracy improvement over an existing ViT-based approach for UWB-based DAR. By publicly releasing the ALERT dataset and detailing our input-size-agnostic strategy, this work facilitates the development of more robust and scalable distracted driving detection systems for real-world deployment.
[52] A Hybrid Deep Learning Framework for Emotion Recognition in Children with Autism During NAO Robot-Mediated Interaction cs.CV | cs.ROPDF
Indranil Bhattacharjee, Vartika Narayani Srinet, Anirudha Bhattacharjee, Braj Bhushan, Bishakh Bhattacharya
TL;DR: 本研究提出了一种新颖的深度学习框架,用于在受控实验环境中,识别自闭症谱系障碍儿童对人形机器人NAO呼叫其名字事件的情绪反应。该框架结合了基于ResNet-50的微调卷积神经网络和基于MediaPipe FaceMesh特征点的三层图卷积网络,利用视觉和几何特征,并通过DeepFace和FER模型的加权集成进行概率软标签生成,最终通过优化KL散度的融合嵌入进行分类。
Details
Motivation: 解决在发育心理学和人机交互领域,理解自闭症儿童在社交互动中情绪反应这一关键挑战,填补自闭症特异性人机交互研究的空白。
Result: 该方法在由15名ASD儿童视频中提取的约50,000个面部帧数据集上,展现出对细微情感响应的稳健建模性能,为临床和治疗性人机交互中的情感分析提供了重要前景。
Insight: 创新点在于提出了一个结合CNN与GCN的混合深度学习框架,专门处理自闭症儿童的微表情线索;并利用多模型加权集成生成概率软标签进行训练,以应对ASD情绪识别的复杂性。这是印度首个利用社交机器人进行自闭症情绪分析的大规模真实世界数据集和流程。
Abstract: Understanding emotional responses in children with Autism Spectrum Disorder (ASD) during social interaction remains a critical challenge in both developmental psychology and human-robot interaction. This study presents a novel deep learning pipeline for emotion recognition in autistic children in response to a name-calling event by a humanoid robot (NAO), under controlled experimental settings. The dataset comprises of around 50,000 facial frames extracted from video recordings of 15 children with ASD. A hybrid model combining a fine-tuned ResNet-50-based Convolutional Neural Network (CNN) and a three-layer Graph Convolutional Network (GCN) trained on both visual and geometric features extracted from MediaPipe FaceMesh landmarks. Emotions were probabilistically labeled using a weighted ensemble of two models: DeepFace’s and FER, each contributing to soft-label generation across seven emotion classes. Final classification leveraged a fused embedding optimized via Kullback-Leibler divergence. The proposed method demonstrates robust performance in modeling subtle affective responses and offers significant promise for affective profiling of ASD children in clinical and therapeutic human-robot interaction contexts, as the pipeline effectively captures micro emotional cues in neurodivergent children, addressing a major gap in autism-specific HRI research. This work represents the first such large-scale, real-world dataset and pipeline from India on autism-focused emotion analysis using social robotics, contributing an essential foundation for future personalized assistive technologies.
[53] CineLOG: A Training Free Approach for Cinematic Long Video Generation cs.CVPDF
Zahra Dehghanian, Morteza Abolghasemi, Hamid Beigy, Hamid R. Rabiee
TL;DR: 该论文提出了CineLOG数据集和一种新的视频生成流程,用于解决可控视频合成中精细控制(如相机轨迹和电影类型)的难题。CineLOG包含5000个高质量、平衡且未剪辑的视频片段,每个片段都带有详细的场景描述、基于标准电影分类法的明确相机指令和类型标签。论文还介绍了一种将复杂文本到视频生成任务解耦为四个更简单阶段的新流程,并引入了轨迹引导过渡模块以生成连贯的多镜头序列。
Details
Motivation: 当前可控视频合成模型难以实现超越文本提示的细粒度控制(如相机轨迹和电影类型),且现有数据集存在严重的数据不平衡、噪声标签或模拟与真实差距大的问题。
Result: 广泛的人类评估表明,该流程在遵循特定相机和剧本指令方面显著优于最先进的端到端文本到视频模型,同时保持了专业的视觉质量。
Insight: 创新点包括构建了一个高质量、平衡且标注详细的视频数据集(CineLOG),以及提出了一种将复杂任务解耦为多阶段并使用轨迹引导过渡模块确保序列连贯性的新流程,这为可控长视频生成提供了数据和方法上的新思路。
Abstract: Controllable video synthesis is a central challenge in computer vision, yet current models struggle with fine grained control beyond textual prompts, particularly for cinematic attributes like camera trajectory and genre. Existing datasets often suffer from severe data imbalance, noisy labels, or a significant simulation to real gap. To address this, we introduce CineLOG, a new dataset of 5,000 high quality, balanced, and uncut video clips. Each entry is annotated with a detailed scene description, explicit camera instructions based on a standard cinematic taxonomy, and genre label, ensuring balanced coverage across 17 diverse camera movements and 15 film genres. We also present our novel pipeline designed to create this dataset, which decouples the complex text to video (T2V) generation task into four easier stages with more mature technology. To enable coherent, multi shot sequences, we introduce a novel Trajectory Guided Transition Module that generates smooth spatio-temporal interpolation. Extensive human evaluations show that our pipeline significantly outperforms SOTA end to end T2V models in adhering to specific camera and screenplay instructions, while maintaining professional visual quality. All codes and data are available at https://cine-log.pages.dev.
[54] Journey Before Destination: On the importance of Visual Faithfulness in Slow Thinking cs.CV | cs.CL | cs.LGPDF
Rheeya Uppaal, Phu Mon Htut, Min Bai, Nikolaos Pappas, Zheng Qi
TL;DR: 该论文针对增强推理的视觉语言模型(VLMs)在生成思维链时可能出现的视觉不忠实问题,提出了一个无需训练和参考的评估框架,用于分解和评估推理链中感知步骤的视觉忠实性,并设计了一种轻量级自反思程序来检测和局部修复不忠实的感知步骤,从而提升多模态推理的可靠性。
Details
Motivation: 解决现有推理增强VLM评估仅关注最终答案准确性,无法区分模型是通过视觉不忠实的中间步骤得出正确结论,还是进行了忠实推理但最终预测失败的问题,旨在将推理链的视觉忠实性确立为一个独立的评估维度。
Result: 在多个经过推理训练的VLMs和感知密集型基准测试上,该方法降低了不忠实感知率,同时保持了最终答案的准确性。
Insight: 创新点在于将推理链分解为感知与推理步骤,并利用现成的VLM作为评判者进行步骤级忠实性评估;提出了一种无需训练的自反思机制来局部修复不忠实的感知步骤,为提升多模态推理的透明度和可靠性提供了新思路。
Abstract: Reasoning-augmented vision language models (VLMs) generate explicit chains of thought that promise greater capability and transparency but also introduce new failure modes: models may reach correct answers via visually unfaithful intermediate steps, or reason faithfully yet fail on the final prediction. Standard evaluations that only measure final-answer accuracy cannot distinguish these behaviors. We introduce the visual faithfulness of reasoning chains as a distinct evaluation dimension, focusing on whether the perception steps of a reasoning chain are grounded in the image. We propose a training- and reference-free framework that decomposes chains into perception versus reasoning steps and uses off-the-shelf VLM judges for step-level faithfulness, additionally verifying this approach through a human meta-evaluation. Building on this metric, we present a lightweight self-reflection procedure that detects and locally regenerates unfaithful perception steps without any training. Across multiple reasoning-trained VLMs and perception-heavy benchmarks, our method reduces Unfaithful Perception Rate while preserving final-answer accuracy, improving the reliability of multimodal reasoning.
[55] ProImage-Bench: Rubric-Based Evaluation for Professional Image Generation cs.CVPDF
Minheng Ni, Zhengyuan Yang, Yaowen Zhang, Linjie Li, Chung-Ching Lin
TL;DR: 该论文提出了ProImage-Bench,一个基于评分标准的基准测试,用于评估专业图像生成模型在根据技术描述生成信息密集、科学精确的插图(如生物学示意图、工程/专利图纸和科学图表)方面的能力。该基准包含654个从真实教材和技术报告中收集的图例,并构建了详细的图像指令和分层评分标准,将正确性分解为6076个标准和44131个二元检查。论文还展示了如何利用评分标准为编辑模型提供可操作的监督,通过迭代优化显著提升生成质量。
Details
Motivation: 当前图像生成模型在开放领域表现良好,但在需要高科学保真度和精确性的专业图像生成任务中存在不足,缺乏量化评估这种能力的方法。
Result: 在ProImage-Bench上对多个代表性文本到图像模型进行基准测试,发现最佳基础模型的评分标准准确率仅为0.791,标准得分仅为0.553,揭示了在细粒度科学保真度方面存在显著差距。通过将失败的检查反馈给编辑模型进行迭代优化,可以将一个强生成器的评分标准准确率从0.653提升到0.865,标准得分从0.388提升到0.697。
Insight: 创新点在于提出了一个基于大规模多模态模型自动构建的、层次化、可解释的评分标准基准,将复杂的专业图像正确性分解为大量细粒度标准进行量化评估,并展示了该基准不仅能用于诊断,还能作为可扩展的监督信号来迭代改进模型,实现从评估到改进的闭环。
Abstract: We study professional image generation, where a model must synthesize information-dense, scientifically precise illustrations from technical descriptions rather than merely produce visually plausible pictures. To quantify the progress, we introduce ProImage-Bench, a rubric-based benchmark that targets biology schematics, engineering/patent drawings, and general scientific diagrams. For 654 figures collected from real textbooks and technical reports, we construct detailed image instructions and a hierarchy of rubrics that decompose correctness into 6,076 criteria and 44,131 binary checks. Rubrics are derived from surrounding text and reference figures using large multimodal models, and are evaluated by an automated LMM-based judge with a principled penalty scheme that aggregates sub-question outcomes into interpretable criterion scores. We benchmark several representative text-to-image models on ProImage-Bench and find that, despite strong open-domain performance, the best base model reaches only 0.791 rubric accuracy and 0.553 criterion score overall, revealing substantial gaps in fine-grained scientific fidelity. Finally, we show that the same rubrics provide actionable supervision: feeding failed checks back into an editing model for iterative refinement boosts a strong generator from 0.653 to 0.865 in rubric accuracy and from 0.388 to 0.697 in criterion score. ProImage-Bench thus offers both a rigorous diagnostic for professional image generation and a scalable signal for improving specification-faithful scientific illustrations.
[56] Moment and Highlight Detection via MLLM Frame Segmentation cs.CVPDF
I Putu Andika Bagas Jiwanta, Ayu Purwarianti
TL;DR: 本文提出了一种新颖的多模态大语言模型(MLLM)视频时刻与高光检测方法,通过将分割目标直接应用于LLM的输出token,将每帧映射为‘0’或‘1’字符,从而同时利用LLM的语言推理能力并为帧级预测提供直接梯度。该方法仅采样25帧,在QVHighlights基准上实现了强大的高光检测(56.74 HIT@1)和优于基线的时刻检索性能(35.28 MAP)。
Details
Motivation: 现有基于生成式MLLM的方法将时刻或高光预测为文本时间戳,虽有效但无法为帧级预测提供直接梯度;而强化学习方法试图解决此问题。本文旨在直接利用LLM的输出token进行分割,以结合LLM的语言能力与可微的帧级监督。
Result: 在QVHighlights基准测试中,该方法仅使用25帧(少于同类方法一半),实现了56.74的HIT@1高光检测分数,并在时刻检索上达到35.28 MAP,超过了基线水平。
Insight: 核心创新在于将视频帧序列与特定提示结合,强制LLM输出连续的‘0’/‘1’字符序列(每帧对应一个字符),这些字符既利用了LLM的语言能力,又可被视为背景/前景概率,从而允许直接应用分割损失进行训练。这种设计为MLLM提供了稳定的互补学习信号,即使因果语言模型损失趋于平稳时仍有效,实现了效率与性能的平衡。
Abstract: Detecting video moments and highlights from natural-language queries have been unified by transformer-based methods. Other works use generative Multimodal LLM (MLLM) to predict moments and/or highlights as text timestamps, utilizing its reasoning capability. While effective, text-based generation cannot provide direct gradients for frame-level predictions because the model only emits language tokens. Although recent Reinforcement Learning (RL) methods attempt to address the issue, we propose a novel approach by applying segmentation objectives directly on the LLM’s output tokens. The LLM is fed with a fixed number of frames alongside a prompt that enforces it to output a sequence of continuous “0” and/or “1” characters, with one character per frame. The “0”/“1” characters benefit from the LLM’s inherent language capability while also acting as background and foreground probabilities, respectively. Training employs segmentation losses on the probabilities alongside a normal causal LM loss. At inference, beam search generates sequence and logits, acting as moments and saliency scores, respectively. Despite sampling only 25 frames – less than half of comparable methods – our method achieved strong highlight detection (56.74 HIT@1) on QVHighlights. Additionally, our efficient method scores above the baseline (35.28 MAP) for moment retrieval. Empirically, segmentation losses provide a stable complementary learning signal even when the causal LM loss plateaus.
[57] MetaTPT: Meta Test-time Prompt Tuning for Vision-Language Models cs.CVPDF
Yuqing Lei, Yingjun Du, Yawen Huang, Xiantong Zhen, Ling Shao
TL;DR: 本文提出MetaTPT,一种用于视觉语言模型(如CLIP)的元学习框架,旨在通过元学习自监督辅助任务来改进测试时提示调优(TPT),以应对领域偏移。该方法通过动态学习参数化增强来生成信息丰富的视图,并利用内外双循环优化将增强学习与提示调优结合,从而提升模型在测试时的适应能力。
Details
Motivation: 现有测试时提示调优(TPT)方法依赖固定的数据增强,在更具挑战性的领域偏移场景下可能失效,因此需要更灵活、更具表达力的增强策略来指导提示调优。
Result: 在领域泛化和跨数据集基准测试中,MetaTPT实现了最先进的(SOTA)性能。
Insight: 创新点在于将元学习引入测试时提示调优,通过自监督辅助任务动态学习参数化数据增强,并采用双循环优化耦合增强学习与提示一致性,从而更有效地捕获目标域的关键特征。
Abstract: Vision-language models (VLMs) such as CLIP exhibit strong zero-shot generalization but remain sensitive to domain shifts at test time. Test-time prompt tuning (TPT) mitigates this issue by adapting prompts with fixed augmentations, which may falter in more challenging settings. In this work, we propose Meta Test-Time Prompt Tuning (MetaTPT), a meta-learning framework that learns a self-supervised auxiliary task to guide test-time prompt tuning. The auxiliary task dynamically learns parameterized augmentations for each sample, enabling more expressive transformations that capture essential features in target domains. MetaTPT adopts a dual-loop optimization paradigm: an inner loop learns a self-supervised task that generates informative views, while the outer loop performs prompt tuning by enforcing consistency across these views. By coupling augmentation learning with prompt tuning, MetaTPT improves test-time adaptation under domain shifts. Extensive experiments demonstrate that MetaTPT achieves state-of-the-art performance on domain generalization and cross-dataset benchmarks.
[58] Cognitive-YOLO: LLM-Driven Architecture Synthesis from First Principles of Data for Object Detection cs.CVPDF
Jiahao Zhao
TL;DR: 本文提出Cognitive-YOLO框架,利用大语言模型(LLM)直接从数据的内在特征(如目标尺度分布、场景密度)出发,合成目标检测网络架构,避免了传统手动设计或神经架构搜索(NAS)的高成本问题。
Details
Motivation: 解决传统目标检测架构设计耗时耗力、NAS计算成本高昂的问题,并改进现有LLM方法仅作为搜索循环中的迭代优化器,而非从数据整体理解直接生成架构的局限。
Result: 在五个不同的目标检测数据集上进行广泛实验,证明Cognitive-YOLO生成的架构性能优越,在多个基准测试中达到了极具竞争力的性能,并展现出更优的性能-参数权衡。
Insight: 创新点在于提出一个三阶段框架(数据分析、LLM基于数据特征与RAG检索的组件进行推理生成NADL描述、编译器实例化),强调数据驱动的“第一性原理”理解对架构性能的关键作用,而非单纯检索SOTA组件。
Abstract: Designing high-performance object detection architectures is a complex task, where traditional manual design is time-consuming and labor-intensive, and Neural Architecture Search (NAS) is computationally prohibitive. While recent approaches using Large Language Models (LLMs) show promise, they often function as iterative optimizers within a search loop, rather than generating architectures directly from a holistic understanding of the data. To address this gap, we propose Cognitive-YOLO, a novel framework for LLM-driven architecture synthesis that generates network configurations directly from the intrinsic characteristics of the dataset. Our method consists of three stages: first, an analysis module extracts key meta-features (e.g., object scale distribution and scene density) from the target dataset; second, the LLM reasons upon these features, augmented with state-of-the-art components retrieved via Retrieval-Augmented Generation (RAG), to synthesize the architecture into a structured Neural Architecture Description Language (NADL); finally, a compiler instantiates this description into a deployable model. Extensive experiments on five diverse object detection datasets demonstrate that our proposed Cognitive-YOLO consistently generates superior architectures, achieving highly competitive performance and demonstrating a superior performance-per-parameter trade-off compared to strong baseline models across multiple benchmarks. Crucially, our ablation studies prove that the LLM’s data-driven reasoning is the primary driver of performance, demonstrating that a deep understanding of data “first principles” is more critical for achieving a superior architecture than simply retrieving SOTA components.
[59] RealDrag: The First Dragging Benchmark with Real Target Image cs.CVPDF
Ahmad Zafarani, Zahra Dehghanian, Mohammadreza Davoodi, Mohsen Shadroo, MohammadAmin Fazli
TL;DR: 本文提出了首个包含真实目标图像的基于拖拽的图像编辑基准测试RealDrag,包含400多个标注样本和四个任务特定指标,用于系统评估17个SOTA模型,揭示了现有方法的权衡并建立了可复现的基线。
Details
Motivation: 解决基于拖拽的图像编辑模型因缺乏标准化基准和真实目标图像数据集而难以客观评估的问题。
Result: 在RealDrag基准上评估了17个SOTA模型,揭示了不同方法在像素匹配、区域保持和语义对齐方面的权衡,并建立了可复现的基线。
Insight: 创新点在于引入首个包含真实目标图像的基准数据集和四个针对性的评估指标(SeD、OMPS、IPPS、DiS),为领域提供了系统化的评估框架。
Abstract: The evaluation of drag based image editing models is unreliable due to a lack of standardized benchmarks and metrics. This ambiguity stems from inconsistent evaluation protocols and, critically, the absence of datasets containing ground truth target images, making objective comparisons between competing methods difficult. To address this, we introduce \textbf{RealDrag}, the first comprehensive benchmark for point based image editing that includes paired ground truth target images. Our dataset contains over 400 human annotated samples from diverse video sources, providing source/target images, handle/target points, editable region masks, and descriptive captions for both the image and the editing action. We also propose four novel, task specific metrics: Semantical Distance (SeD), Outer Mask Preserving Score (OMPS), Inner Patch Preserving Score (IPPS), and Directional Similarity (DiS). These metrics are designed to quantify pixel level matching fidelity, check preservation of non edited (out of mask) regions, and measure semantic alignment with the desired task. Using this benchmark, we conduct the first large scale systematic analysis of the field, evaluating 17 SOTA models. Our results reveal clear trade offs among current approaches and establish a robust, reproducible baseline to guide future research. Our dataset and evaluation toolkit will be made publicly available.
[60] GrowTAS: Progressive Expansion from Small to Large Subnets for Efficient ViT Architecture Search cs.CV | cs.LGPDF
Hyunju Lee, Youngmin Oh, Jeimin Jeon, Donghyeon Baek, Bumsub Ham
TL;DR: 本文提出了一种名为GrowTAS的渐进式训练框架,用于高效的视觉Transformer架构搜索。该方法从训练小型子网络开始,逐步纳入更大的子网络,以减少权重共享带来的干扰并稳定训练过程。此外,还提出了GrowTAS+,通过仅微调部分权重来进一步提升大型子网络的性能。
Details
Motivation: 现有的Transformer架构搜索方法通常训练一个包含所有候选架构的过参数化超网,所有子网络共享同一组权重,这会导致严重的干扰,特别是对小规模子网络性能的损害。研究发现,训练良好的小型子网络可以作为训练更大网络的良好基础。
Result: 在ImageNet以及CIFAR-10/100、Flowers、CARS和INAT-19等多个迁移学习基准上的大量实验表明,该方法在性能上超越了当前的TAS方法。
Insight: 核心创新在于提出了一个渐进式的训练范式,从易到难(从小型子网到大型子网)地构建和训练超网,这能有效缓解权重共享带来的干扰问题。GrowTAS+进一步引入了部分权重微调策略,针对性地优化大型子网络,这是一种高效且专注的性能提升手段。
Abstract: Transformer architecture search (TAS) aims to automatically discover efficient vision transformers (ViTs), reducing the need for manual design. Existing TAS methods typically train an over-parameterized network (i.e., a supernet) that encompasses all candidate architectures (i.e., subnets). However, all subnets share the same set of weights, which leads to interference that degrades the smaller subnets severely. We have found that well-trained small subnets can serve as a good foundation for training larger ones. Motivated by this, we propose a progressive training framework, dubbed GrowTAS, that begins with training small subnets and incorporate larger ones gradually. This enables reducing the interference and stabilizing a training process. We also introduce GrowTAS+ that fine-tunes a subset of weights only to further enhance the performance of large subnets. Extensive experiments on ImageNet and several transfer learning benchmarks, including CIFAR-10/100, Flowers, CARS, and INAT-19, demonstrate the effectiveness of our approach over current TAS methods
[61] MRD: Using Physically Based Differentiable Rendering to Probe Vision Models for 3D Scene Understanding cs.CV | cs.GRPDF
Benjamin Beilharz, Thomas S. A. Wallis
TL;DR: 本文提出了一种名为MRD(可微分渲染的元像)的方法,通过基于物理的可微分渲染技术,探究视觉模型对3D场景属性的隐式理解。该方法通过寻找物理上不同但能产生相同模型激活的3D场景参数(即模型元像),来评估模型对生成性3D场景属性的敏感性。
Details
Motivation: 尽管深度学习模型在视觉任务上取得了显著成功,但其内部表示和决策过程难以解释。虽然模型通常基于2D输入训练,但常被假设能隐式理解底层3D场景(如对部分遮挡的容忍度或相对深度推理能力)。本文旨在开发一种基于物理场景描述的方法,以系统性地探究模型对3D场景属性的敏感性。
Result: 作为原理验证,作者评估了多个模型在恢复场景几何(形状)和双向反射分布函数(材质)参数方面的能力。结果显示目标场景与优化场景之间的模型激活高度相似,但视觉结果各异。这些重建结果有助于定性地分析模型对哪些物理场景属性敏感或不敏感。
Insight: 创新点在于将基于物理的可微分渲染与模型元像概念结合,提供了一种可解释且物理基础的方法来探究视觉模型的3D场景理解能力。该方法能够独立控制场景属性(如形状、材质、光照),从而更精细地分析模型表示,为计算机和人类视觉研究提供了新工具。
Abstract: While deep learning methods have achieved impressive success in many vision benchmarks, it remains difficult to understand and explain the representations and decisions of these models. Though vision models are typically trained on 2D inputs, they are often assumed to develop an implicit representation of the underlying 3D scene (for example, showing tolerance to partial occlusion, or the ability to reason about relative depth). Here, we introduce MRD (metamers rendered differentiably), an approach that uses physically based differentiable rendering to probe vision models’ implicit understanding of generative 3D scene properties, by finding 3D scene parameters that are physically different but produce the same model activation (i.e. are model metamers). Unlike previous pixel-based methods for evaluating model representations, these reconstruction results are always grounded in physical scene descriptions. This means we can, for example, probe a model’s sensitivity to object shape while holding material and lighting constant. As a proof-of-principle, we assess multiple models in their ability to recover scene parameters of geometry (shape) and bidirectional reflectance distribution function (material). The results show high similarity in model activation between target and optimized scenes, with varying visual results. Qualitatively, these reconstructions help investigate the physical scene attributes to which models are sensitive or invariant. MRD holds promise for advancing our understanding of both computer and human vision by enabling analysis of how physical scene parameters drive changes in model responses.
[62] VideoARM: Agentic Reasoning over Hierarchical Memory for Long-Form Video Understanding cs.CV | cs.CLPDF
Yufei Yin, Qianke Meng, Minghao Chen, Jiajun Ding, Zhenwei Shao
TL;DR: 论文提出VideoARM,一种基于分层记忆的智能体推理范式,用于长视频理解。该方法通过自适应、动态的观察-思考-行动-记忆循环,以粗到细的方式解释视频,显著减少token消耗,并利用分层多模态记忆捕获多级线索以支持决策。
Details
Motivation: 解决长视频理解中因时间结构长、多模态线索密集而依赖手工推理流程或高token消耗预处理的问题。
Result: 在主流基准测试中,VideoARM超越了当前最先进方法DVD,同时显著减少了长视频的token消耗。
Insight: 创新点在于将智能体推理与分层记忆结合,实现自适应、动态的视频处理,减少计算开销,提升理解效率;可借鉴其循环推理框架和记忆更新机制用于其他时序数据任务。
Abstract: Long-form video understanding remains challenging due to the extended temporal structure and dense multimodal cues. Despite recent progress, many existing approaches still rely on hand-crafted reasoning pipelines or employ token-consuming video preprocessing to guide MLLMs in autonomous reasoning. To overcome these limitations, we introduce VideoARM, an Agentic Reasoning-over-hierarchical-Memory paradigm for long-form video understanding. Instead of static, exhaustive preprocessing, VideoARM performs adaptive, on-the-fly agentic reasoning and memory construction. Specifically, VideoARM performs an adaptive and continuous loop of observing, thinking, acting, and memorizing, where a controller autonomously invokes tools to interpret the video in a coarse-to-fine manner, thereby substantially reducing token consumption. In parallel, a hierarchical multimodal memory continuously captures and updates multi-level clues throughout the operation of the agent, providing precise contextual information to support the controller in decision-making. Experiments on prevalent benchmarks demonstrate that VideoARM outperforms the state-of-the-art method, DVD, while significantly reducing token consumption for long-form videos.
[63] STAGE: Storyboard-Anchored Generation for Cinematic Multi-shot Narrative cs.CVPDF
Peixuan Zhang, Zijian Jia, Kaiqi Liu, Shuchen Weng, Si Li
TL;DR: STAGE提出了一种基于故事板锚定的生成工作流,用于解决多镜头叙事视频生成中跨镜头一致性和电影语言捕捉的挑战。该方法通过预测每个镜头的起止帧对构成结构化故事板,并引入多镜头记忆包、双重编码策略和两阶段训练方案来确保实体一致性和镜头间过渡。
Details
Motivation: 现有基于关键帧的视频生成方法在保持跨镜头一致性和捕捉电影语言方面存在不足,STAGE旨在通过结构化故事板和多镜头一致性机制来改进多镜头叙事视频的生成质量。
Result: 在构建的大规模ConStoryBoard数据集上进行广泛实验,STAGE在结构化叙事控制和跨镜头连贯性方面表现出优越性能。
Insight: 创新点包括用结构化故事板替代稀疏关键帧、多镜头记忆包确保长程实体一致性、双重编码策略保证镜头内连贯性,以及两阶段训练学习电影镜头间过渡,为可控叙事视频生成提供了新框架。
Abstract: While recent advancements in generative models have achieved remarkable visual fidelity in video synthesis, creating coherent multi-shot narratives remains a significant challenge. To address this, keyframe-based approaches have emerged as a promising alternative to computationally intensive end-to-end methods, offering the advantages of fine-grained control and greater efficiency. However, these methods often fail to maintain cross-shot consistency and capture cinematic language. In this paper, we introduce STAGE, a SToryboard-Anchored GEneration workflow to reformulate the keyframe-based multi-shot video generation task. Instead of using sparse keyframes, we propose STEP2 to predict a structural storyboard composed of start-end frame pairs for each shot. We introduce the multi-shot memory pack to ensure long-range entity consistency, the dual-encoding strategy for intra-shot coherence, and the two-stage training scheme to learn cinematic inter-shot transition. We also contribute the large-scale ConStoryBoard dataset, including high-quality movie clips with fine-grained annotations for story progression, cinematic attributes, and human preferences. Extensive experiments demonstrate that STAGE achieves superior performance in structured narrative control and cross-shot coherence.
[64] V-Warper: Appearance-Consistent Video Diffusion Personalization via Value Warping cs.CVPDF
Hyunkoo Lee, Wooseok Jang, Jini Yang, Taehwan Kim, Sangoh Kim
TL;DR: V-Warper是一个无需训练、由粗到细的视频扩散模型个性化框架,旨在通过值扭曲技术生成外观一致且符合文本提示的视频,无需大规模视频微调即可提升细粒度身份保真度。
Details
Motivation: 现有视频个性化方法依赖计算成本高的视频微调或大规模视频数据集,且难以维持跨帧的细粒度外观一致性,V-Warper旨在解决这些限制。
Result: V-Warper显著提升了外观保真度,同时保持了提示对齐和运动动态,无需大规模视频微调即可高效实现这些改进。
Insight: 创新点包括:1)轻量级粗外观适应阶段仅利用少量参考图像,通过仅图像LoRA和主题嵌入适应编码全局身份;2)推理时细外观注入阶段通过从无RoPE的中间层查询-键特征计算语义对应关系,引导外观丰富的值表示扭曲到生成过程的语义对齐区域,并使用掩码确保空间可靠性。
Abstract: Video personalization aims to generate videos that faithfully reflect a user-provided subject while following a text prompt. However, existing approaches often rely on heavy video-based finetuning or large-scale video datasets, which impose substantial computational cost and are difficult to scale. Furthermore, they still struggle to maintain fine-grained appearance consistency across frames. To address these limitations, we introduce V-Warper, a training-free coarse-to-fine personalization framework for transformer-based video diffusion models. The framework enhances fine-grained identity fidelity without requiring any additional video training. (1) A lightweight coarse appearance adaptation stage leverages only a small set of reference images, which are already required for the task. This step encodes global subject identity through image-only LoRA and subject-embedding adaptation. (2) A inference-time fine appearance injection stage refines visual fidelity by computing semantic correspondences from RoPE-free mid-layer query–key features. These correspondences guide the warping of appearance-rich value representations into semantically aligned regions of the generation process, with masking ensuring spatial reliability. V-Warper significantly improves appearance fidelity while preserving prompt alignment and motion dynamics, and it achieves these gains efficiently without large-scale video finetuning.
[65] M4Human: A Large-Scale Multimodal mmWave Radar Benchmark for Human Mesh Reconstruction cs.CVPDF
Junqiao Fan, Yunjiao Zhou, Yizhuo Yang, Xinyuan Cui, Jiarui Zhang
TL;DR: M4Human是一个大规模多模态毫米波雷达基准数据集,用于人体网格重建研究,包含66.1万帧高分辨率毫米波雷达、RGB和深度数据,并提供原始雷达张量和处理后点云两种数据形式,覆盖20名受试者的50种多样化动作,并带有高质量运动捕捉标注。
Details
Motivation: 现有大规模人体网格重建数据集严重依赖视距RGB输入,易受遮挡、光照变化和隐私问题限制;而现有雷达数据集则受限于稀疏骨架标注、规模有限和动作简单。为克服这些局限并推动研究,本文提出了M4Human数据集。
Result: 论文在原始雷达张量和雷达点云两种模态上建立了基准,并进行了与RGB-D模态的多模态融合实验。广泛的结果凸显了M4Human对基于雷达的人体建模的重要性,同时也揭示了在快速、无约束运动下存在的持续挑战。
Insight: 创新点在于构建了当前最大规模的多模态毫米波雷达基准,提供不同粒度的雷达信号数据(原始张量与点云)和高质量的人体网格与全局轨迹标注,支持跨模态融合研究,并涵盖了更丰富的动作类型,有助于推动隐私保护型室内人体感知技术的发展。
Abstract: Human mesh reconstruction (HMR) provides direct insights into body-environment interaction, which enables various immersive applications. While existing large-scale HMR datasets rely heavily on line-of-sight RGB input, vision-based sensing is limited by occlusion, lighting variation, and privacy concerns. To overcome these limitations, recent efforts have explored radio-frequency (RF) mmWave radar for privacy-preserving indoor human sensing. However, current radar datasets are constrained by sparse skeleton labels, limited scale, and simple in-place actions. To advance the HMR research community, we introduce M4Human, the current largest-scale (661K-frame) ($9\times$ prior largest) multimodal benchmark, featuring high-resolution mmWave radar, RGB, and depth data. M4Human provides both raw radar tensors (RT) and processed radar point clouds (RPC) to enable research across different levels of RF signal granularity. M4Human includes high-quality motion capture (MoCap) annotations with 3D meshes and global trajectories, and spans 20 subjects and 50 diverse actions, including in-place, sit-in-place, and free-space sports or rehabilitation movements. We establish benchmarks on both RT and RPC modalities, as well as multimodal fusion with RGB-D modalities. Extensive results highlight the significance of M4Human for radar-based human modeling while revealing persistent challenges under fast, unconstrained motion. The dataset and code will be released after the paper publication.
[66] ArtGen: Conditional Generative Modeling of Articulated Objects in Arbitrary Part-Level States cs.CVPDF
Haowen Wang, Xiaoping Yuan, Fugang Zhang, Rui Jian, Yuanwei Zhu
TL;DR: ArtGen是一个基于条件扩散的生成框架,能够从单视图图像或文本描述中生成具有精确几何和连贯运动学的铰接式3D对象,支持任意部件级状态。它通过跨状态蒙特卡洛采样确保全局运动学一致性,并利用思维链推理模块推断结构先验,结合稀疏专家扩散变换器处理多样运动交互,同时采用增强的局部-全局注意力组合3D-VAE潜在先验来捕获细粒度几何和部件关系。
Details
Motivation: 现有生成模型通常依赖表示闭合状态的单视图输入,导致几何形状与关节动力学纠缠,产生模糊或不现实的运动结构,ArtGen旨在解决铰接对象生成中的结构-运动纠缠问题。
Result: 在PartNet-Mobility基准测试上的广泛实验表明,ArtGen显著优于最先进的方法。
Insight: 创新点包括跨状态蒙特卡洛采样以显式强制全局运动学一致性、思维链推理模块推断结构先验(如部件语义、关节类型和连接性),以及局部-全局注意力增强的组合3D-VAE潜在先验,这些方法有助于减少结构-运动纠缠并提升生成质量。
Abstract: Generating articulated assets is crucial for robotics, digital twins, and embodied intelligence. Existing generative models often rely on single-view inputs representing closed states, resulting in ambiguous or unrealistic kinematic structures due to the entanglement between geometric shape and joint dynamics. To address these challenges, we introduce ArtGen, a conditional diffusion-based framework capable of generating articulated 3D objects with accurate geometry and coherent kinematics from single-view images or text descriptions at arbitrary part-level states. Specifically, ArtGen employs cross-state Monte Carlo sampling to explicitly enforce global kinematic consistency, reducing structural-motion entanglement. Additionally, we integrate a Chain-of-Thought reasoning module to infer robust structural priors, such as part semantics, joint types, and connectivity, guiding a sparse-expert Diffusion Transformer to specialize in diverse kinematic interactions. Furthermore, a compositional 3D-VAE latent prior enhanced with local-global attention effectively captures fine-grained geometry and global part-level relationships. Extensive experiments on the PartNet-Mobility benchmark demonstrate that ArtGen significantly outperforms state-of-the-art methods.
[67] ViInfographicVQA: A Benchmark for Single and Multi-image Visual Question Answering on Vietnamese Infographics cs.CV | cs.LGPDF
Tue-Thu Van-Dinh, Hoang-Duy Tran, Truong-Binh Duong, Mai-Hanh Pham, Binh-Nam Le-Nguyen
TL;DR: 论文提出了ViInfographicVQA,这是首个越南语信息图视觉问答基准,包含6747个真实世界信息图和20409个人工验证的问答对。该基准包含单图像和多图像两种评估任务,用于评估模型在数据丰富、布局密集的越南语信息图上的阅读和推理能力。
Details
Motivation: 现有VQA基准主要针对场景文本或自然图像,缺乏对需要更强OCR、布局理解以及数值和语义推理能力的信息图,特别是越南语等低资源语言的评估。
Result: 论文评估了一系列最新的视觉语言模型,结果显示在单图像任务上表现尚可,但在多图像任务上存在显著的性能差距,尤其是在涉及跨图像整合和非片段推理的问题上错误最多。
Insight: 主要创新点是创建了首个越南语信息图VQA基准,并引入了需要跨图像推理的多图像任务,这揭示了当前多模态模型在低资源语言和复杂布局理解上的局限性,为未来布局感知和跨图像推理方法的研究提供了方向。
Abstract: Infographic Visual Question Answering (InfographicVQA) evaluates a model’s ability to read and reason over data-rich, layout-heavy visuals that combine text, charts, icons, and design elements. Compared with scene-text or natural-image VQA, infographics require stronger integration of OCR, layout understanding, and numerical and semantic reasoning. We introduce ViInfographicVQA, the first benchmark for Vietnamese InfographicVQA, comprising over 6747 real-world infographics and 20409 human-verified question-answer pairs across economics, healthcare, education, and more. The benchmark includes two evaluation settings. The Single-image task follows the traditional setup in which each question is answered using a single infographic. The Multi-image task requires synthesizing evidence across multiple semantically related infographics and is, to our knowledge, the first Vietnamese evaluation of cross-image reasoning in VQA. We evaluate a range of recent vision-language models on this benchmark, revealing substantial performance disparities, with the most significant errors occurring on Multi-image questions that involve cross-image integration and non-span reasoning. ViInfographicVQA contributes benchmark results for Vietnamese InfographicVQA and sheds light on the limitations of current multimodal models in low-resource contexts, encouraging future exploration of layout-aware and cross-image reasoning methods.
[68] BokehDepth: Enhancing Monocular Depth Estimation through Bokeh Generation cs.CVPDF
Hangwei Zhang, Armando Teles Fortes, Tianyi Wei, Xingang Pan
TL;DR: 论文提出BokehDepth框架,通过散焦(bokeh)生成增强单目深度估计。该框架分为两阶段:第一阶段利用预训练图像编辑骨干生成无深度信息的可控散焦堆栈;第二阶段通过轻量级散焦感知模块融合特征,提升现有深度模型的准确性和鲁棒性。
Details
Motivation: 现有方法未能充分利用散焦与单目深度估计在镜头成像几何上的紧密耦合关系,导致散焦渲染依赖有噪声的深度图而产生伪影,而深度模型在弱纹理、远距离和几何模糊区域表现不佳。
Result: 在多个挑战性基准测试中,BokehDepth在视觉保真度上优于基于深度图的散焦基线方法,并持续提升了强单目深度基础模型的度量精度和鲁棒性。
Insight: 创新点在于将散焦合成与深度预测解耦,将散焦作为无监督的几何线索;通过物理引导的可控散焦生成器和轻量级散焦感知聚合模块,在不改变下游解码器的情况下有效利用散焦信息。
Abstract: Bokeh and monocular depth estimation are tightly coupled through the same lens imaging geometry, yet current methods exploit this connection in incomplete ways. High-quality bokeh rendering pipelines typically depend on noisy depth maps, which amplify estimation errors into visible artifacts, while modern monocular metric depth models still struggle on weakly textured, distant and geometrically ambiguous regions where defocus cues are most informative. We introduce BokehDepth, a two-stage framework that decouples bokeh synthesis from depth prediction and treats defocus as an auxiliary supervision-free geometric cue. In Stage-1, a physically guided controllable bokeh generator, built on a powerful pretrained image editing backbone, produces depth-free bokeh stacks with calibrated bokeh strength from a single sharp input. In Stage-2, a lightweight defocus-aware aggregation module plugs into existing monocular depth encoders, fuses features along the defocus dimension, and exposes stable depth-sensitive variations while leaving downstream decoder unchanged. Across challenging benchmarks, BokehDepth improves visual fidelity over depth-map-based bokeh baselines and consistently boosts the metric accuracy and robustness of strong monocular depth foundation models.
[69] Endless World: Real-Time 3D-Aware Long Video Generation cs.CVPDF
Ke Zhang, Yiqun Mei, Jiacong Xu, Vishal M. Patel
TL;DR: 本文提出Endless World框架,用于实时生成无限长、3D一致性的视频序列。通过条件自回归训练策略和全局3D感知注意力机制,实现了在单GPU上实时推理,无需额外训练开销。
Details
Motivation: 解决长视频生成中3D结构不稳定、时序一致性差的问题,特别是在流式生成场景下保持几何一致性和物理合理性。
Result: 在视觉保真度和空间一致性方面达到或超越现有方法,能够生成长、稳定且视觉连贯的视频。
Insight: 创新点包括条件自回归训练策略以保持长程依赖,以及全局3D感知注意力机制提供跨时间几何指导,确保扩展序列中的物理合理性和几何一致性。
Abstract: Producing long, coherent video sequences with stable 3D structure remains a major challenge, particularly in streaming scenarios. Motivated by this, we introduce Endless World, a real-time framework for infinite, 3D-consistent video generation.To support infinite video generation, we introduce a conditional autoregressive training strategy that aligns newly generated content with existing video frames. This design preserves long-range dependencies while remaining computationally efficient, enabling real-time inference on a single GPU without additional training overhead.Moreover, our Endless World integrates global 3D-aware attention to provide continuous geometric guidance across time. Our 3D injection mechanism enforces physical plausibility and geometric consistency throughout extended sequences, addressing key challenges in long-horizon and dynamic scene synthesis.Extensive experiments demonstrate that Endless World produces long, stable, and visually coherent videos, achieving competitive or superior performance to existing methods in both visual fidelity and spatial consistency. Our project has been available on https://bwgzk-keke.github.io/EndlessWorld/.
[70] From Particles to Fields: Reframing Photon Mapping with Continuous Gaussian Photon Fields cs.CV | cs.GRPDF
Jiachen Tao, Benjamin Planche, Van Nguyen Nguyen, Junyi Wu, Yuchun Liu
TL;DR: 该论文提出了一种名为高斯光子场(GPF)的连续可学习表示方法,用于加速多视角渲染中的光子映射。GPF将光子分布编码为各向异性的3D高斯基元,通过多视角监督优化,将基于光子的光照传输提炼为连续场,从而在训练后无需重复光子追踪即可实现可微的辐射度评估。
Details
Motivation: 光子映射在渲染同一场景的多个视角时,由于每个视角独立进行光子追踪和随机核估计,存在计算效率低下的问题,导致不可避免的冗余计算。
Result: 在包含复杂光照传输(如焦散和镜面-漫反射交互)的场景上进行的大量实验表明,GPF在保持光子级精度的同时,将计算量减少了数个数量级。
Insight: 创新点在于将离散的光子映射重新构建为连续、可重用的辐射度函数,通过可学习的3D高斯表示来编码光子分布,从而将基于物理的光子渲染的严谨性与神经场景表示的高效性统一起来。
Abstract: Accurately modeling light transport is essential for realistic image synthesis. Photon mapping provides physically grounded estimates of complex global illumination effects such as caustics and specular-diffuse interactions, yet its per-view radiance estimation remains computationally inefficient when rendering multiple views of the same scene. The inefficiency arises from independent photon tracing and stochastic kernel estimation at each viewpoint, leading to inevitable redundant computation. To accelerate multi-view rendering, we reformulate photon mapping as a continuous and reusable radiance function. Specifically, we introduce the Gaussian Photon Field (GPF), a learnable representation that encodes photon distributions as anisotropic 3D Gaussian primitives parameterized by position, rotation, scale, and spectrum. GPF is initialized from physically traced photons in the first SPPM iteration and optimized using multi-view supervision of final radiance, distilling photon-based light transport into a continuous field. Once trained, the field enables differentiable radiance evaluation along camera rays without repeated photon tracing or iterative refinement. Extensive experiments on scenes with complex light transport, such as caustics and specular-diffuse interactions, demonstrate that GPF attains photon-level accuracy while reducing computation by orders of magnitude, unifying the physical rigor of photon-based rendering with the efficiency of neural scene representations.
[71] More Than the Final Answer: Improving Visual Extraction and Logical Consistency in Vision-Language Models cs.CVPDF
Hoang Anh Just, Yifei Fan, Handong Zhao, Jiuxiang Gu, Ruiyi Zhang
TL;DR: 本文提出了PeRL-VL框架,旨在解决基于可验证奖励的强化学习(RLVR)训练的视觉语言模型(VLM)中存在的视觉提取不准确和思维链逻辑不一致的问题。该框架将感知与推理解耦,通过引入基于VLM的描述奖励来提升视觉感知的忠实性和充分性,并增加纯文本推理监督微调阶段来独立增强逻辑一致性。
Details
Motivation: 现有RLVR训练的VLMs虽然在多模态推理上取得进展,但其监督信号仅作用于最终答案,导致模型仍存在视觉细节提取错误(遗漏或幻觉)和思维链逻辑不一致两大顽固失败模式。
Result: 在多个多模态基准测试上,PeRL-VL将基础模型Qwen2.5-VL-7B的平均Pass@1准确率从63.3%提升至68.8%,超越了标准RLVR、纯文本推理SFT以及从GPT-4o进行的朴素多模态蒸馏方法。
Insight: 核心创新在于将视觉感知与文本推理的优化过程解耦,并分别引入针对性的监督信号:1)利用VLM自身生成描述并评估其忠实性与充分性作为感知奖励;2)在富含逻辑的思维链数据上进行纯文本的监督微调,独立于视觉模态提升推理链的连贯性与逻辑一致性。这种分而治之的策略有效弥补了仅监督最终答案的不足。
Abstract: Reinforcement learning from verifiable rewards (RLVR) has recently been extended from text-only LLMs to vision-language models (VLMs) to elicit long-chain multimodal reasoning. However, RLVR-trained VLMs still exhibit two persistent failure modes: inaccurate visual extraction (missing or hallucinating details) and logically inconsistent chains-of-thought, largely because verifiable signals supervise only the final answer. We propose PeRL-VL (Perception and Reasoning Learning for Vision-Language Models), a decoupled framework that separately improves visual perception and textual reasoning on top of RLVR. For perception, PeRL-VL introduces a VLM-based description reward that scores the model’s self-generated image descriptions for faithfulness and sufficiency. For reasoning, PeRL-VL adds a text-only Reasoning SFT stage on logic-rich chain-of-thought data, enhancing coherence and logical consistency independently of vision. Across diverse multimodal benchmarks, PeRL-VL improves average Pass@1 accuracy from 63.3% (base Qwen2.5-VL-7B) to 68.8%, outperforming standard RLVR, text-only reasoning SFT, and naive multimodal distillation from GPT-4o.
[72] Adaptive Detector-Verifier Framework for Zero-Shot Polyp Detection in Open-World Settings cs.CV | cs.CLPDF
Shengkai Xu, Hsiang Lun Kao, Tianxiang Xu, Honghui Zhang, Junqiao Wang
TL;DR: 本文提出了一种名为AdaptiveDetector的自适应检测器-验证器框架,用于解决息肉检测在开放世界场景下的零样本泛化问题。该框架结合了YOLOv11检测器和基于视觉语言模型的验证器,通过自适应调整置信度阈值和成本敏感的强化学习微调,显著减少了漏检,提升了在恶劣成像条件下的检测性能。
Details
Motivation: 解决在真实内窥镜场景中,由于光照变化、运动模糊和遮挡等恶劣成像条件导致的息肉检测器性能下降问题,弥合受控实验室数据与临床实践之间的领域鸿沟。
Result: 在合成的CVC-ClinicDB和Kvasir-SEG数据集上进行零样本评估,相比单独使用YOLO,召回率提升了14到22个百分点,同时精确度保持在基线-0.7到+1.7个百分点的范围内。
Insight: 创新点在于提出了一个两阶段的检测器-验证器框架,结合了自适应阈值调整(由VLM引导)和成本敏感的强化学习微调(使用GRPO和不对称奖励函数),专注于减少临床关键的漏检,并通过构建系统性退化的合成测试平台进行更现实的评估。
Abstract: Polyp detectors trained on clean datasets often underperform in real-world endoscopy, where illumination changes, motion blur, and occlusions degrade image quality. Existing approaches struggle with the domain gap between controlled laboratory conditions and clinical practice, where adverse imaging conditions are prevalent. In this work, we propose AdaptiveDetector, a novel two-stage detector-verifier framework comprising a YOLOv11 detector with a vision-language model (VLM) verifier. The detector adaptively adjusts per-frame confidence thresholds under VLM guidance, while the verifier is fine-tuned with Group Relative Policy Optimization (GRPO) using an asymmetric, cost-sensitive reward function specifically designed to discourage missed detections – a critical clinical requirement. To enable realistic assessment under challenging conditions, we construct a comprehensive synthetic testbed by systematically degrading clean datasets with adverse conditions commonly encountered in clinical practice, providing a rigorous benchmark for zero-shot evaluation. Extensive zero-shot evaluation on synthetically degraded CVC-ClinicDB and Kvasir-SEG images demonstrates that our approach improves recall by 14 to 22 percentage points over YOLO alone, while precision remains within 0.7 points below to 1.7 points above the baseline. This combination of adaptive thresholding and cost-sensitive reinforcement learning achieves clinically aligned, open-world polyp detection with substantially fewer false negatives, thereby reducing the risk of missed precancerous polyps and improving patient outcomes.
[73] Advancing Cache-Based Few-Shot Classification via Patch-Driven Relational Gated Graph Attention cs.CVPDF
Tasweer Ahmad, Arindam Sikdar, Sandip Pradhan, Ardhendu Behera
TL;DR: 本文提出了一种基于补丁驱动关系图注意力网络的缓存式小样本图像分类方法,通过构建图像内部补丁图并利用边感知注意力机制增强补丁间的信息交互,从而生成更具判别性的上下文丰富表示,最终通过缓存相似度与CLIP零样本得分的残差融合进行预测,在保持零样本推理效率的同时提升了分类性能。
Details
Motivation: 现有基于缓存的自适应方法(如Tip-Adapter)虽然通过学习轻量级残差适配器缓解了小样本分类中监督有限和视觉域偏移的挑战,但仍继承了CLIP倾向于编码全局通用表示的局限性,导致在低数据场景下难以生成针对特定领域的最优判别性表示。
Result: 在11个基准测试上的广泛评估表明,该方法在保持零样本效率的同时,一致超越了最先进的CLIP适配器和基于缓存的基线方法,达到了SOTA水平;此外,论文还引入了一个用于伤亡识别的’受伤与未受伤士兵’数据集,验证了方法在战场相关应用中的有效性。
Insight: 创新点在于提出了补丁驱动的关系细化机制,通过关系门控图注意力网络从图像内部补丁依赖中学习缓存适配器权重,而非将图像嵌入视为单一向量;该方法仅在训练时使用图细化将关系结构蒸馏到缓存中,推理时无额外成本,实现了效率与性能的平衡。
Abstract: Few-shot image classification remains difficult under limited supervision and visual domain shift. Recent cache-based adaptation approaches (e.g., Tip-Adapter) address this challenge to some extent by learning lightweight residual adapters over frozen features, yet they still inherit CLIP’s tendency to encode global, general-purpose representations that are not optimally discriminative to adapt the generalist to the specialist’s domain in low-data regimes. We address this limitation with a novel patch-driven relational refinement that learns cache adapter weights from intra-image patch dependencies rather than treating an image embedding as a monolithic vector. Specifically, we introduce a relational gated graph attention network that constructs a patch graph and performs edge-aware attention to emphasize informative inter-patch interactions, producing context-enriched patch embeddings. A learnable multi-aggregation pooling then composes these into compact, task-discriminative representations that better align cache keys with the target few-shot classes. Crucially, the proposed graph refinement is used only during training to distil relational structure into the cache, incurring no additional inference cost beyond standard cache lookup. Final predictions are obtained by a residual fusion of cache similarity scores with CLIP zero-shot logits. Extensive evaluations on 11 benchmarks show consistent gains over state-of-the-art CLIP adapter and cache-based baselines while preserving zero-shot efficiency. We further validate battlefield relevance by introducing an Injured vs. Uninjured Soldier dataset for casualty recognition. It is motivated by the operational need to support triage decisions within the “platinum minutes” and the broader “golden hour” window in time-critical UAV-driven search-and-rescue and combat casualty care.
[74] Generative Spatiotemporal Data Augmentation cs.CV | cs.LGPDF
Jinfan Zhou, Lixin Luo, Sungmin Eum, Heesung Kwon, Jeong Joon Park
TL;DR: 本文提出了一种利用视频基础模型进行时空数据增强的方法,通过现成的视频扩散模型从给定的图像数据集中生成逼真的三维空间和时间变化,以增加训练数据的多样性。该方法在标注稀缺的低数据场景(如无人机捕获图像)中作为补充训练数据,能持续提升模型性能。
Details
Motivation: 解决传统数据增强方法(如简单几何变换或外观扰动)在多样化相机视角和场景动态方面的局限性,特别是在标注数据稀缺的场景下,通过生成逼真的时空变化来拓宽数据分布。
Result: 在COCO子集和无人机捕获数据集上的实验表明,该方法能有效扩展传统和先前生成方法未充分覆盖的数据分布轴,在低数据体制下提升模型性能。
Insight: 创新点在于利用现成的视频扩散模型进行时空数据增强,生成逼真的三维空间和时间变化,并提供了选择生成设置、将标注转移到合成帧以及处理生成视图中新揭示的无标签区域(如遮挡解除)的实用指南。
Abstract: We explore spatiotemporal data augmentation using video foundation models to diversify both camera viewpoints and scene dynamics. Unlike existing approaches based on simple geometric transforms or appearance perturbations, our method leverages off-the-shelf video diffusion models to generate realistic 3D spatial and temporal variations from a given image dataset. Incorporating these synthesized video clips as supplemental training data yields consistent performance gains in low-data settings, such as UAV-captured imagery where annotations are scarce. Beyond empirical improvements, we provide practical guidelines for (i) choosing an appropriate spatiotemporal generative setup, (ii) transferring annotations to synthetic frames, and (iii) addressing disocclusion - regions newly revealed and unlabeled in generated views. Experiments on COCO subsets and UAV-captured datasets show that, when applied judiciously, spatiotemporal augmentation broadens the data distribution along axes underrepresented by traditional and prior generative methods, offering an effective lever for improving model performance in data-scarce regimes.
[75] Animus3D: Text-driven 3D Animation via Motion Score Distillation cs.CV | cs.GR | cs.LGPDF
Qi Sun, Can Wang, Jiaxiang Shang, Wensen Feng, Jing Liao
TL;DR: Animus3D是一个文本驱动的3D动画框架,它通过一种新颖的运动分数蒸馏(MSD)方法,将预训练文本到视频扩散模型的运动知识提取到静态3D资产中,从而生成与文本描述一致且运动幅度大、细节丰富的动画。
Details
Motivation: 现有方法主要使用原始的分数蒸馏采样(SDS)目标从文本到视频扩散模型中提取运动,这通常导致生成的动画运动幅度极小或存在明显抖动。Animus3D旨在解决这些问题,实现更显著、更流畅的文本驱动3D动画。
Result: 广泛的实验表明,Animus3D能够成功根据多样化的文本提示为静态3D资产生成动画,其产生的运动在幅度和细节上都显著优于最先进的基线方法,同时保持了高度的视觉完整性。
Insight: 主要创新点包括:1)提出运动分数蒸馏(MSD)作为SDS的替代方案,使用LoRA增强的视频扩散模型定义静态源分布,并结合基于反转的噪声估计技术以保持外观;2)引入显式的时空正则化项以减少几何畸变;3)提出运动细化模块来提升时间分辨率和细节,克服底层视频模型的固定分辨率限制。
Abstract: We present Animus3D, a text-driven 3D animation framework that generates motion field given a static 3D asset and text prompt. Previous methods mostly leverage the vanilla Score Distillation Sampling (SDS) objective to distill motion from pretrained text-to-video diffusion, leading to animations with minimal movement or noticeable jitter. To address this, our approach introduces a novel SDS alternative, Motion Score Distillation (MSD). Specifically, we introduce a LoRA-enhanced video diffusion model that defines a static source distribution rather than pure noise as in SDS, while another inversion-based noise estimation technique ensures appearance preservation when guiding motion. To further improve motion fidelity, we incorporate explicit temporal and spatial regularization terms that mitigate geometric distortions across time and space. Additionally, we propose a motion refinement module to upscale the temporal resolution and enhance fine-grained details, overcoming the fixed-resolution constraints of the underlying video model. Extensive experiments demonstrate that Animus3D successfully animates static 3D assets from diverse text prompts, generating significantly more substantial and detailed motion than state-of-the-art baselines while maintaining high visual integrity. Code will be released at https://qiisun.github.io/animus3d_page.
[76] Supervised Contrastive Frame Aggregation for Video Representation Learning cs.CV | cs.LGPDF
Shaif Chowdhury, Mushfika Rahman, Greg Hamerly
TL;DR: 本文提出了一种用于视频表征学习的监督对比学习框架,通过将视频的多帧图像空间排列成单张输入图像,利用预训练的CNN骨干网络(如ResNet50)来避免复杂视频Transformer模型的计算开销,并设计了一个直接比较模型生成投影的对比学习目标。
Details
Motivation: 动机在于利用时间全局上下文进行视频表征学习,同时避免复杂视频模型的高计算成本,通过帧聚合策略和对比学习来学习有效的视频表示。
Result: 在Penn Action和HMDB51数据集上的实验表明,该方法在分类准确率上优于现有方法(如ViVIT),在Penn Action上达到76%的准确率(ViVIT为43%),在HMDB51上达到48%的准确率(ViVIT为37%),同时计算资源需求更少。
Insight: 创新点包括视频到图像的帧聚合策略,使能使用预训练CNN骨干;监督对比学习目标直接利用标签信息定义正负对;通过不同时间帧采样创建自然视图,增强多样性和减少过拟合,适用于监督和自监督设置。
Abstract: We propose a supervised contrastive learning framework for video representation learning that leverages temporally global context. We introduce a video to image aggregation strategy that spatially arranges multiple frames from each video into a single input image. This design enables the use of pre trained convolutional neural network backbones such as ResNet50 and avoids the computational overhead of complex video transformer models. We then design a contrastive learning objective that directly compares pairwise projections generated by the model. Positive pairs are defined as projections from videos sharing the same label while all other projections are treated as negatives. Multiple natural views of the same video are created using different temporal frame samplings from the same underlying video. Rather than relying on data augmentation these frame level variations produce diverse positive samples with global context and reduce overfitting. Experiments on the Penn Action and HMDB51 datasets demonstrate that the proposed method outperforms existing approaches in classification accuracy while requiring fewer computational resources. The proposed Supervised Contrastive Frame Aggregation method learns effective video representations in both supervised and self supervised settings and supports video based tasks such as classification and captioning. The method achieves seventy six percent classification accuracy on Penn Action compared to forty three percent achieved by ViVIT and forty eight percent accuracy on HMDB51 compared to thirty seven percent achieved by ViVIT.
[77] StreamingAssistant: Efficient Visual Token Pruning for Accelerating Online Video Understanding cs.CV | cs.AIPDF
Xinqi Jin, Hanxun Yu, Bohan Yu, Kebin Liu, Jian Liu
TL;DR: 本文提出StreamingAssistant,一种用于加速在线视频理解的高效视觉令牌剪枝方法。该方法通过引入新的冗余度量MSSAVT(最大空间相邻视频令牌相似度)和掩码剪枝策略,在减少计算和内存开销的同时保留关键信息,并在多个基准测试中显著提升了准确性。
Details
Motivation: 在线视频理解(如公共监控和AI眼镜)应用多模态大语言模型(MLLMs)时,由于视频帧数量大,导致GPU内存使用高和计算延迟大,因此需要一种方法来减少上下文长度同时保持关键信息。
Result: 在多个在线和离线视频理解基准测试中,该方法最多将准确率提高了4%,同时剪枝延迟可忽略不计(小于1毫秒)。
Insight: 创新点包括:1)提出MSSAVT冗余度量,综合考虑令牌相似度和空间位置;2)设计掩码剪枝策略,解决剪枝与冗余之间的双向依赖问题;3)结合现有基于时间冗余的剪枝方法,消除视频模态的时间冗余。从客观角度看,该方法通过空间和时间维度的联合优化,实现了高效且低延迟的令牌压缩。
Abstract: Online video understanding is essential for applications like public surveillance and AI glasses. However, applying Multimodal Large Language Models (MLLMs) to this domain is challenging due to the large number of video frames, resulting in high GPU memory usage and computational latency. To address these challenges, we propose token pruning as a means to reduce context length while retaining critical information. Specifically, we introduce a novel redundancy metric, Maximum Similarity to Spatially Adjacent Video Tokens (MSSAVT), which accounts for both token similarity and spatial position. To mitigate the bidirectional dependency between pruning and redundancy, we further design a masked pruning strategy that ensures only mutually unadjacent tokens are pruned. We also integrate an existing temporal redundancy-based pruning method to eliminate temporal redundancy of the video modality. Experimental results on multiple online and offline video understanding benchmarks demonstrate that our method significantly improves the accuracy (i.e., by 4% at most) while incurring a negligible pruning latency (i.e., less than 1ms). Our full implementation will be made publicly available.
[78] From Tokens to Photons: Test-Time Physical Prompting for Vison-Language Models cs.CVPDF
Boyeong Im, Wooseok Lee, Yoojin Kwon, Hyung-Sin Kim
TL;DR: 本文提出MVP框架,通过将相机曝光三角(ISO、快门速度、光圈)视为物理提示,在推理时采集多视角物理视图,结合轻量级数字增强与低熵筛选,最终通过硬投票聚合预测,实现无需梯度或模型修改的测试时适应,显著提升视觉语言模型在物理环境中的鲁棒性。
Details
Motivation: 将视觉语言模型从网络图像扩展到传感器介导的物理环境,解决传统测试时适应仅依赖数字增强而忽略物理视图控制的问题。
Result: 在ImageNet-ES和ImageNet-ES-Diverse基准上,MVP比仅使用数字增强的测试时适应方法提升高达25.6个百分点,且比结合传统传感器控制与测试时适应的流程额外提升3.4个百分点,在减少参数候选集以降低捕获延迟时仍保持有效。
Insight: 创新点在于将相机曝光参数作为物理提示,通过选择与组合真实物理视图(测量时控制)而非仅依赖后捕获提示,显著增强模型鲁棒性;其选择-投票设计简单、易于校准且无需模型修改,为视觉语言模型在物理场景的部署提供了实用框架。
Abstract: To extend the application of vision-language models (VLMs) from web images to sensor-mediated physical environments, we propose Multi-View Physical-prompt for Test-Time Adaptation (MVP), a forward-only framework that moves test-time adaptation (TTA) from tokens to photons by treating the camera exposure triangle–ISO, shutter speed, and aperture–as physical prompts. At inference, MVP acquires a library of physical views per scene, selects the top-k sensor settings using a source-affinity score, evaluates each retained view under lightweight digital augmentations, filters the lowest-entropy subset of augmented views, and aggregates predictions with Zero-temperature softmax (i.e., hard voting). This selection-then-vote design is simple, calibration-friendly, and requires no gradients or model modifications. On ImageNet-ES and ImageNet-ES-Diverse, MVP consistently outperforms digital-only TTA on single Auto-Exposure captures, by up to 25.6 percentage points (pp), and delivers up to 3.4 pp additional gains over pipelines that combine conventional sensor control with TTA. MVP remains effective under reduced parameter candidate sets that lower capture latency, demonstrating practicality. These results support the main claim that, beyond post-capture prompting, measurement-time control–selecting and combining real physical views–substantially improves robustness for VLMs.
[79] Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space cs.CV | cs.CLPDF
Chengzhi Liu, Yuzhe Yang, Yue Fan, Qingyue Wei, Sheng Liu
TL;DR: 本文提出了一种动态多模态潜在推理框架DMLR,通过置信度引导的潜在策略梯度优化来精炼潜在思考标记,并引入动态视觉注入策略,实现视觉与文本的动态交错推理,从而提升多模态大语言模型在推理和感知任务上的性能与效率。
Details
Motivation: 现有多模态大语言模型在推理时依赖显式的逐步推理,存在感知-推理交互不稳定和计算开销大的问题,受人类认知中推理与感知动态交错过程的启发,旨在实现更高效、稳定的多模态推理。
Result: 在七个多模态推理基准测试和多种模型架构上的实验表明,DMLR显著提升了推理和感知性能,同时保持了较高的推理效率。
Insight: 创新点在于将推理过程从显式语义空间扩展到潜在空间,通过动态视觉注入策略实现视觉与文本的动态交错,从而减少计算开销并增强模型交互稳定性;客观来看,该方法为多模态推理提供了一种更接近人类认知的高效范式。
Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced cross-modal understanding and reasoning by incorporating Chain-of-Thought (CoT) reasoning in the semantic space. Building upon this, recent studies extend the CoT mechanism to the visual modality, enabling models to integrate visual information during reasoning through external tools or explicit image generation. However, these methods remain dependent on explicit step-by-step reasoning, unstable perception-reasoning interaction and notable computational overhead. Inspired by human cognition, we posit that thinking unfolds not linearly but through the dynamic interleaving of reasoning and perception within the mind. Motivated by this perspective, we propose DMLR, a test-time Dynamic Multimodal Latent Reasoning framework that employs confidence-guided latent policy gradient optimization to refine latent think tokens for in-depth reasoning. Furthermore, a Dynamic Visual Injection Strategy is introduced, which retrieves the most relevant visual features at each latent think token and updates the set of best visual patches. The updated patches are then injected into latent think token to achieve dynamic visual-textual interleaving. Experiments across seven multimodal reasoning benchmarks and various model architectures demonstrate that DMLR significantly improves reasoning and perception performance while maintaining high inference efficiency.
[80] StegaVAR: Privacy-Preserving Video Action Recognition via Steganographic Domain Analysis cs.CVPDF
Lixin Chen, Chaomeng Chen, Jiale Zhou, Zhijian Wu, Xun Lin
TL;DR: 本文提出StegaVAR框架,首次将动作视频嵌入到普通封面视频中,并在隐写域直接进行视频动作识别,以解决现有隐私保护方法存在的隐蔽性低和时空特征破坏问题。
Details
Motivation: 针对当前视频动作识别中隐私泄露问题,现有匿名化方法存在隐蔽性差(传输中产生视觉失真吸引攻击者注意)和时空特征破坏(降低准确VAR所需的关键时空特征)的不足。
Result: 实验表明,StegaVAR在广泛使用的数据集上实现了优异的视频动作识别和隐私保护性能,且该框架对多种隐写模型均有效。
Insight: 创新点包括首次在隐写域直接进行视频动作识别,并提出Secret Spatio-Temporal Promotion(STeP)利用秘密视频指导隐写域时空特征提取,以及Cross-Band Difference Attention(CroDA)通过捕捉跨波段语义差异抑制封面干扰。
Abstract: Despite the rapid progress of deep learning in video action recognition (VAR) in recent years, privacy leakage in videos remains a critical concern. Current state-of-the-art privacy-preserving methods often rely on anonymization. These methods suffer from (1) low concealment, where producing visually distorted videos that attract attackers’ attention during transmission, and (2) spatiotemporal disruption, where degrading essential spatiotemporal features for accurate VAR. To address these issues, we propose StegaVAR, a novel framework that embeds action videos into ordinary cover videos and directly performs VAR in the steganographic domain for the first time. Throughout both data transmission and action analysis, the spatiotemporal information of hidden secret video remains complete, while the natural appearance of cover videos ensures the concealment of transmission. Considering the difficulty of steganographic domain analysis, we propose Secret Spatio-Temporal Promotion (STeP) and Cross-Band Difference Attention (CroDA) for analysis within the steganographic domain. STeP uses the secret video to guide spatiotemporal feature extraction in the steganographic domain during training. CroDA suppresses cover interference by capturing cross-band semantic differences. Experiments demonstrate that StegaVAR achieves superior VAR and privacy-preserving performance on widely used datasets. Moreover, our framework is effective for multiple steganographic models.
[81] Automatic Wire-Harness Color Sequence Detector cs.CV | eess.IVPDF
Indiwara Nanayakkara, Dehan Jayawickrama, Mervyn Parakrama B. Ekanayake
TL;DR: 本文提出了一种半自动化的机器视觉系统,用于检测线束的线序、连接器极性和颜色序列的正确性,该系统集成了五个工业CMOS摄像头,采用基于HSV和RGB颜色域值比较的颜色序列分类器,在实际部署中实现了100%的检测准确率,并将检测时间减少了44%。
Details
Motivation: 在现代电子制造服务(EMS)行业中,线束检测过程仍然是劳动密集型且容易出错,因此需要一种自动化解决方案来提高检测效率和准确性。
Result: 该系统在GPV Lanka Pvt. Ltd.部署后,实现了100%的检测准确率,并将检测时间相比人工方法减少了44%。
Insight: 创新点在于结合多摄像头模块化机械框架和基于HSV/RGB颜色域值比较的分类器,实现了对线性和圆形线束配置的自动化检测,并通过少量参考样本训练实现可重用性,提升了工业检测的可靠性和效率。
Abstract: Wire harness inspection process remains a labor-intensive process prone to errors in the modern Electronics Manufacturing Services (EMS) industry. This paper introduces a semiautomated machine vision system capable of verifying correct wire positioning, correctness of the connector polarity and correctness of color sequences for both linear and circular wire harness configurations. Five industrial standard CMOS cameras are integrated into a modularized mechanical framework in the physical structure of the solution and a HSV and RGB color domain value comparison based color sequence classifier is used in the operation. For each harness batch, a user can train the system using at least five reference samples; the trained file is stored and reused for similar harness types. The Solution is deployed at GPV Lanka Pvt. Ltd. (Fig. 2) and the system achieved 100% detection accuracy and reduced inspection time by 44% compared to manual methods. Additional features include user management, adjustable lighting, session data storage, and secure login. Results of this product usage in the real world situation demonstrate that this approach delivers reliable and efficient inspection capabilities.
[82] Vision-Enhanced Large Language Models for High-Resolution Image Synthesis and Multimodal Data Interpretation cs.CVPDF
Karthikeya KV
TL;DR: 本研究提出了一种将视觉增强大语言模型与先进Transformer架构相结合的框架,旨在解决高分辨率图像合成和多模态数据理解中的挑战。该模型采用整流流机制实现高效高质量生成,并利用双向标记化策略融合文本、图像和视频输入,通过时空特征嵌入和混合文本-图像序列建模,在合成图像保真度和多模态表示连贯性方面取得突破。
Details
Motivation: 动机在于解决高分辨率图像合成和多模态数据理解中的效率与质量挑战,通过整合视觉增强LLMs和Transformer架构来提升生成性能与跨模态统一理解能力。
Result: 在基准数据集上的评估显示,与基于扩散的方法相比,图像分辨率清晰度提升25%,计算需求降低20%,展现了在自主系统、创意内容生成和高级视频分析等应用中的潜力。
Insight: 创新点包括整流流机制用于线性路径生成、双向标记化策略实现多模态无缝融合、以及噪声感知学习算法优化架构,这些方法提升了生成效率、数据适应性和跨模态一致性。
Abstract: This research introduces a transformative framework for integrating Vision-Enhanced Large Language Models (LLMs) with advanced transformer-based architectures to tackle challenges in high-resolution image synthesis and multimodal data interpretation. The proposed model incorporates a rectified flow mechanism that connects noise and data with linear paths, enabling efficient and high-quality generation. A bidirectional tokenization strategy is employed to seamlessly merge inputs from text, image, and video modalities, fostering a unified understanding across diverse data types. By embedding spatial-temporal features and leveraging a hybrid text-image sequence modeling approach, the framework achieves unparalleled fidelity in synthesized images and coherent multimodal representations. The architecture is optimized with a noise-aware learning algorithm, addressing discrepancies in noisy data distributions and improving generative performance under varying input conditions. Rigorous evaluations on benchmark datasets demonstrate a 25% increase in image resolution clarity and a 20% reduction in computational requirements compared to diffusion-based methods. Furthermore, the model exhibits robust scalability and adaptability, showcasing its potential in applications like autonomous systems, creative content generation, and advanced video analysis. This work underscores the role of vision-centric LLMs in redefining capabilities in computer vision and multimodal artificial intelligence.
[83] Content-Aware Ad Banner Layout Generation with Two-Stage Chain-of-Thought in Vision Language Models cs.CV | cs.AIPDF
Kei Yoshitake, Kento Hosono, Ken Kobayashi, Kazuhide Nakata
TL;DR: 本文提出了一种利用视觉语言模型(VLM)生成图像广告布局的方法。该方法通过两阶段思维链流程,首先让VLM分析背景图像以识别物体类型和空间关系并生成基于文本的“放置计划”,然后将该计划渲染为HTML格式的最终布局。
Details
Motivation: 传统广告布局技术主要依赖显著性映射来检测背景图像中的显著区域,但这种方法往往无法充分考虑图像的详细构图和语义内容。本文旨在克服这一局限,通过VLM理解图像内容来指导文本和标志的放置。
Result: 通过评估实验,与现有方法进行了定量和定性比较。结果表明,通过显式考虑背景图像的内容,该方法能生成明显更高质量的广告布局。
Insight: 创新点在于将VLM的语义理解能力引入广告布局生成任务,并设计了一个两阶段的思维链流程(分析图像生成计划、渲染为布局),实现了内容感知的布局生成。这为结合视觉理解和结构化输出生成提供了新思路。
Abstract: In this paper, we propose a method for generating layouts for image-based advertisements by leveraging a Vision-Language Model (VLM). Conventional advertisement layout techniques have predominantly relied on saliency mapping to detect salient regions within a background image, but such approaches often fail to fully account for the image’s detailed composition and semantic content. To overcome this limitation, our method harnesses a VLM to recognize the products and other elements depicted in the background and to inform the placement of text and logos. The proposed layout-generation pipeline consists of two steps. In the first step, the VLM analyzes the image to identify object types and their spatial relationships, then produces a text-based “placement plan” based on this analysis. In the second step, that plan is rendered into the final layout by generating HTML-format code. We validated the effectiveness of our approach through evaluation experiments, conducting both quantitative and qualitative comparisons against existing methods. The results demonstrate that by explicitly considering the background image’s content, our method produces noticeably higher-quality advertisement layouts.
[84] Geometry-Aware Scene-Consistent Image Generation cs.CVPDF
Cong Xie, Che Wang, Yan Zhang, Zheng Pan, Han Zou
TL;DR: 本文研究几何感知的场景一致图像生成,即给定参考场景图像和指定生成实体及其空间关系的文本条件,目标是合成既保持参考场景物理环境又正确生成符合文本空间关系实体的输出图像。现有方法难以平衡场景保持与提示遵循,要么高保真复制场景但对提示响应差,要么优先遵循提示而牺牲场景一致性。为解决此权衡,本文引入两个关键贡献:场景一致数据构建管道和几何引导注意力损失。
Details
Motivation: 解决现有方法在场景一致图像生成中难以平衡场景保持与提示遵循的问题,旨在实现既保持物理环境又准确响应空间关系描述的图像合成。
Result: 在场景一致基准测试中,根据自动指标和人类偏好研究,该方法在场景对齐和文本图像一致性方面优于最先进的基线方法,生成几何一致且忠实于文本指令和场景结构的多样化图像。
Insight: 创新点包括场景一致数据构建管道生成多样化、几何基础的训练对,以及利用跨视图线索规范模型空间推理的几何引导注意力损失,从客观角度分析,这些贡献通过数据增强和损失函数设计有效解决了场景与提示的权衡问题。
Abstract: We study geometry-aware scene-consistent image generation: given a reference scene image and a text condition specifying an entity to be generated in the scene and its spatial relation to the scene, the goal is to synthesize an output image that preserves the same physical environment as the reference scene while correctly generating the entity according to the spatial relation described in the text. Existing methods struggle to balance scene preservation with prompt adherence: they either replicate the scene with high fidelity but poor responsiveness to the prompt, or prioritize prompt compliance at the expense of scene consistency. To resolve this trade-off, we introduce two key contributions: (i) a scene-consistent data construction pipeline that generates diverse, geometrically-grounded training pairs, and (ii) a novel geometry-guided attention loss that leverages cross-view cues to regularize the model’s spatial reasoning. Experiments on our scene-consistent benchmark show that our approach achieves better scene alignment and text-image consistency than state-of-the-art baselines, according to both automatic metrics and human preference studies. Our method produces geometrically coherent images with diverse compositions that remain faithful to the textual instructions and the underlying scene structure.
[85] Efficient Vision-Language Reasoning via Adaptive Token Pruning cs.CV | cs.CL | cs.LGPDF
Xue Li, Xiaonan Song, Henry Hu
TL;DR: 本文提出了一种名为自适应令牌剪枝(ATP)的动态推理机制,旨在降低视觉语言模型(VLMs)的计算开销。ATP在视觉-语言接口处运行,通过结合ViT CLS注意力(模态内显著性)和CLIP文本-图像相似度(模态间相关性)的混合重要性评分,为大型语言模型(LLM)仅保留信息量最高的K个令牌。该方法无需修改主干网络,即可实现约40%的FLOPs减少和约1.5倍的端到端延迟加速,且精度损失可忽略不计(小于1%)。
Details
Motivation: 现有视觉语言模型在处理所有令牌时计算效率低下,阻碍了其在实际场景中的部署。本文旨在解决VLMs的高计算需求问题,通过动态剪枝不重要的令牌来提升推理效率。
Result: 在VQAv2、GQA和COCO等基准上的初步评估表明,ATP在精度损失小于1%的情况下,将推理FLOPs减少了约40%,端到端延迟加速约1.5倍。定性分析显示ATP保持了视觉基础并增强了可解释性,在数据损坏下的鲁棒性测试中,自适应剪枝抑制了虚假相关性,提高了模型稳定性。
Insight: 创新点在于提出了一种轻量级的自适应令牌剪枝门控模块,它动态评估令牌重要性(结合模态内显著性和模态间相关性),实现了输入自适应的高效推理。该方法与BLIP-2、LLaVA、Flamingo等流行主干网络兼容,表明资源受限的推理与模型可靠性并非相互冲突的目标,为高效多模态边缘计算提供了思路。
Abstract: Real-world deployment of Vision-Language Models (VLMs) is hindered by high computational demands, as existing architectures inefficiently process all tokens uniformly. We introduce Adaptive Token Pruning (ATP), a dynamic inference mechanism that retains only the most informative tokens based on contextual relevance. ATP operates at the vision-language interface, assigning a hybrid importance score combining ViT CLS attention (intra-modal saliency) and CLIP text-image similarity (inter-modal relevance) to keep top-K tokens for the LLM. Unlike static compression, ATP adapts to each input without modifying the backbone. Proposed as a lightweight gating module, ATP is compatible with popular backbones like BLIP-2, LLaVA, and Flamingo. Preliminary evaluations across VQAv2, GQA, and COCO indicate that ATP reduces inference FLOPs by around 40% and achieves roughly 1.5x speedups in end-to-end latency with negligible accuracy loss (less than 1%). Qualitative analyses suggest ATP preserves visual grounding and enhances interpretability. Beyond efficiency, we investigate robustness under corruptions; observations suggest adaptive pruning suppresses spurious correlations, improving stability. These findings imply that resource-constrained inference and model reliability are not competing objectives. Finally, we discuss ATP’s role in efficient multimodal edge computing pipelines.
[86] No Cache Left Idle: Accelerating diffusion model via Extreme-slimming Caching cs.CVPDF
Tingyan Wen, Haoyu Li, Yihuang Chen, Xing Zhou, Lifei Zhu
TL;DR: 本文提出了X-Slim(eXtreme-Slimming Caching),一种无需训练的、基于缓存的扩散模型加速器。它通过一个双阈值控制器,首次统一利用了跨时间步、结构(块)和空间(令牌)的可缓存冗余,将缓存过程转变为“先推后精修”的流程,从而在显著降低推理延迟的同时保持生成质量。
Details
Motivation: 扩散模型生成质量高,但计算开销随步数、模型深度和序列长度增长。现有特征缓存方法存在权衡:激进的时间步重用能大幅加速但易损害保真度,而块级或令牌级重用更安全但计算节省有限。
Result: 在FLUX.1-dev和HunyuanVideo任务上,分别实现了高达4.97倍和3.52倍的延迟降低,且感知损失最小。在DiT-XL/2上,达到了3.13倍加速,并将FID(Fréchet Inception Distance)比先前方法提升了2.42,推进了速度-质量的前沿。
Insight: 主要创新点是提出了首个统一利用时间步、块和令牌三个维度冗余的缓存框架,并设计了双阈值控制器实现“先推后精修”的智能缓存策略。其上下文感知的缓存决策机制和分层错误控制方法,为扩散模型的高效推理提供了新的思路。
Abstract: Diffusion models achieve remarkable generative quality, but computational overhead scales with step count, model depth, and sequence length. Feature caching is effective since adjacent timesteps yield highly similar features. However, an inherent trade-off remains: aggressive timestep reuse offers large speedups but can easily cross the critical line, hurting fidelity, while block- or token-level reuse is safer but yields limited computational savings. We present X-Slim (eXtreme-Slimming Caching), a training-free, cache-based accelerator that, to our knowledge, is the first unified framework to exploit cacheable redundancy across timesteps, structure (blocks), and space (tokens). Rather than simply mixing levels, X-Slim introduces a dual-threshold controller that turns caching into a push-then-polish process: it first pushes reuse at the timestep level up to an early-warning line, then switches to lightweight block- and token-level refresh to polish the remaining redundancy, and triggers full inference once the critical line is crossed to reset accumulated error. At each level, context-aware indicators decide when and where to cache. Across diverse tasks, X-Slim advances the speed-quality frontier. On FLUX.1-dev and HunyuanVideo, it reduces latency by up to 4.97x and 3.52x with minimal perceptual loss. On DiT-XL/2, it reaches 3.13x acceleration and improves FID by 2.42 over prior methods.
[87] D3D-VLP: Dynamic 3D Vision-Language-Planning Model for Embodied Grounding and Navigation cs.CV | cs.ROPDF
Zihan Wang, Seungjun Lee, Guangzhao Dai, Gim Hee Lee
TL;DR: 本文提出了动态3D视觉-语言-规划模型(D3D-VLP),以解决具身智能中端到端模型缺乏可解释性和显式3D推理,而模块化系统忽视跨组件相互依赖与协同的问题。该模型通过动态3D思维链统一规划、接地、导航和问答,并采用碎片化监督协同学习策略从大规模混合数据中学习。
Details
Motivation: 解决具身智能中端到端模型可解释性差、缺乏显式3D推理,以及模块化系统忽略组件间协同作用的局限性。
Result: 在多个基准测试(包括视觉语言导航R2R-CE、REVERIE-CE、NavRAG-CE,目标导航HM3D-OVON,以及任务导向的序列接地与导航SG3D)上达到最先进水平,并通过真实世界移动操作实验验证了有效性。
Insight: 创新点包括动态3D思维链将多任务统一于单一3D视觉语言模型流程,以及碎片化监督协同学习策略利用掩码自回归损失从部分标注的混合数据中实现组件间的相互增强与隐式监督。
Abstract: Embodied agents face a critical dilemma that end-to-end models lack interpretability and explicit 3D reasoning, while modular systems ignore cross-component interdependencies and synergies. To bridge this gap, we propose the Dynamic 3D Vision-Language-Planning Model (D3D-VLP). Our model introduces two key innovations: 1) A Dynamic 3D Chain-of-Thought (3D CoT) that unifies planning, grounding, navigation, and question answering within a single 3D-VLM and CoT pipeline; 2) A Synergistic Learning from Fragmented Supervision (SLFS) strategy, which uses a masked autoregressive loss to learn from massive and partially-annotated hybrid data. This allows different CoT components to mutually reinforce and implicitly supervise each other. To this end, we construct a large-scale dataset with 10M hybrid samples from 5K real scans and 20K synthetic scenes that are compatible with online learning methods such as RL and DAgger. Our D3D-VLP achieves state-of-the-art results on multiple benchmarks, including Vision-and-Language Navigation (R2R-CE, REVERIE-CE, NavRAG-CE), Object-goal Navigation (HM3D-OVON), and Task-oriented Sequential Grounding and Navigation (SG3D). Real-world mobile manipulation experiments further validate the effectiveness.
[88] SignRAG: A Retrieval-Augmented System for Scalable Zero-Shot Road Sign Recognition cs.CV | cs.AI | cs.CL | cs.IR | cs.ROPDF
Minghao Zhu, Zhihao Zhang, Anmol Sidhu, Keith Redmill
TL;DR: 本文提出了一种名为SignRAG的新型零样本路标识别系统,该系统采用检索增强生成范式,结合视觉语言模型生成图像描述、向量数据库检索候选路标以及大语言模型进行细粒度推理,旨在解决传统深度学习方法因路标类别繁多和标注数据不足而面临的挑战。
Details
Motivation: 传统深度学习方法难以应对路标类别数量庞大且无法创建详尽标注数据集的问题,因此需要一种无需任务特定训练即可实现可扩展且准确的路标识别系统。
Result: 在包含303个俄亥俄州MUTCD监管标志的全面数据集上验证,该方法在理想参考图像上达到95.58%的准确率,在具有挑战性的真实道路数据上达到82.45%的准确率,证明了其有效性。
Insight: 创新点在于将检索增强生成范式首次应用于路标识别任务,通过视觉语言模型与大语言模型的结合实现零样本学习,避免了传统方法对大规模标注数据的依赖,为可扩展的智能交通系统提供了新思路。
Abstract: Automated road sign recognition is a critical task for intelligent transportation systems, but traditional deep learning methods struggle with the sheer number of sign classes and the impracticality of creating exhaustive labeled datasets. This paper introduces a novel zero-shot recognition framework that adapts the Retrieval-Augmented Generation (RAG) paradigm to address this challenge. Our method first uses a Vision Language Model (VLM) to generate a textual description of a sign from an input image. This description is used to retrieve a small set of the most relevant sign candidates from a vector database of reference designs. Subsequently, a Large Language Model (LLM) reasons over the retrieved candidates to make a final, fine-grained recognition. We validate this approach on a comprehensive set of 303 regulatory signs from the Ohio MUTCD. Experimental results demonstrate the framework’s effectiveness, achieving 95.58% accuracy on ideal reference images and 82.45% on challenging real-world road data. This work demonstrates the viability of RAG-based architectures for creating scalable and accurate systems for road sign recognition without task-specific training.
[89] DiG: Differential Grounding for Enhancing Fine-Grained Perception in Multimodal Large Language Model cs.CV | cs.AIPDF
Zhou Tao, Shida Wang, Yongxiang Hua, Haoyu Cao, Linli Xu
TL;DR: 本文提出DiG(差分接地)框架,通过让多模态大语言模型(MLLMs)在无先验知识的情况下识别并定位相似图像对之间的所有差异,以增强其细粒度视觉感知能力。该方法采用基于3D渲染的自动化数据生成流程构建训练数据,并利用课程学习从单差异到多差异逐步提升任务复杂度,实现稳定优化。实验表明,DiG显著提升了MLLMs在多个视觉感知基准上的性能,且习得的细粒度感知技能能有效迁移至标准下游任务。
Details
Motivation: 当前多模态大语言模型在多种视觉语言任务上表现优异,但其细粒度视觉感知和精确空间推理能力仍有限,需要一种可扩展的方法来提升模型对细节的感知与推理能力。
Result: DiG在RefCOCO、RefCOCO+、RefCOCOg等引用表达理解数据集以及通用多模态感知基准上均取得显著性能提升,证明了该方法在增强细粒度视觉感知方面的有效性。
Insight: 创新点在于提出差分接地作为代理任务,通过无监督差异定位学习细粒度感知;采用可扩展的3D渲染数据生成和课程学习策略,解决了差异信号稀疏和训练稳定性问题,为提升MLLMs的视觉推理能力提供了可扩展且鲁棒的途径。
Abstract: Multimodal Large Language Models have achieved impressive performance on a variety of vision-language tasks, yet their fine-grained visual perception and precise spatial reasoning remain limited. In this work, we introduce DiG (Differential Grounding), a novel proxy task framework where MLLMs learn fine-grained perception by identifying and localizing all differences between similar image pairs without prior knowledge of their number. To support scalable training, we develop an automated 3D rendering-based data generation pipeline that produces high-quality paired images with fully controllable discrepancies. To address the sparsity of difference signals, we further employ curriculum learning that progressively increases complexity from single to multiple differences, enabling stable optimization. Extensive experiments demonstrate that DiG significantly improves model performance across a variety of visual perception benchmarks and that the learned fine-grained perception skills transfer effectively to standard downstream tasks, including RefCOCO, RefCOCO+, RefCOCOg, and general multimodal perception benchmarks. Our results highlight differential grounding as a scalable and robust approach for advancing fine-grained visual reasoning in MLLMs.
[90] CogDoc: Towards Unified thinking in Documents cs.CVPDF
Qixin Xu, Haozhe Wang, Che Liu, Fangzhen Lin, Wenhu Chen
TL;DR: 本文提出CogDoc,一种模仿人类认知过程的统一粗到细思维框架,用于解决文档理解中可扩展性与细粒度多模态细节保真度之间的权衡问题。该框架包含低分辨率’快速阅读’阶段进行信息定位,以及高分辨率’专注思考’阶段进行深度推理。
Details
Motivation: 当前文档推理范式在可扩展性(处理长上下文文档)与保真度(捕获细粒度多模态细节)之间存在根本性权衡,本文旨在通过统一思维框架弥合这一差距。
Result: 在具有挑战性的视觉丰富文档基准测试中,7B参数的CogDoc模型在其参数类别内达到最先进性能,显著超越了如GPT-4o等更大的专有模型。
Insight: 创新点包括:1)模仿人类认知的粗到细统一思维框架;2)发现直接强化学习优于带监督微调初始化的强化学习,避免了’策略冲突’;3)证明了小参数模型在复杂文档理解任务上超越大模型的潜力。
Abstract: Current document reasoning paradigms are constrained by a fundamental trade-off between scalability (processing long-context documents) and fidelity (capturing fine-grained, multimodal details). To bridge this gap, we propose CogDoc, a unified coarse-to-fine thinking framework that mimics human cognitive processes: a low-resolution “Fast Reading” phase for scalable information localization,followed by a high-resolution “Focused Thinking” phase for deep reasoning. We conduct a rigorous investigation into post-training strategies for the unified thinking framework, demonstrating that a Direct Reinforcement Learning (RL) approach outperforms RL with Supervised Fine-Tuning (SFT) initialization. Specifically, we find that direct RL avoids the “policy conflict” observed in SFT. Empirically, our 7B model achieves state-of-the-art performance within its parameter class, notably surpassing significantly larger proprietary models (e.g., GPT-4o) on challenging, visually rich document benchmarks.
[91] Heart Disease Prediction using Case Based Reasoning (CBR) cs.CV | cs.CLPDF
Mohaiminul Islam Bhuiyan, Chan Hue Wah, Nur Shazwani Kamarudin, Nur Hafieza Ismail, Ahmad Fakhri Ab Nasir
TL;DR: 本研究探讨了使用智能系统预测心脏病的方法,重点比较了模糊逻辑、神经网络和基于案例推理(CBR)三种技术,最终选择CBR进行预测。通过数据预处理和分割,CBR在心脏病预测中达到了97.95%的准确率,并分析了性别差异及风险因素。
Details
Motivation: 传统医疗方法依赖医生经验,预测精度不足,因此应用智能系统作为替代方案以提高心脏病预测的准确性。
Result: 在心脏病数据集上,基于案例推理(CBR)方法取得了97.95%的准确率,达到较高水平;分析显示男性患病概率为57.76%,女性为42.24%。
Insight: 创新点在于比较多种智能系统方法并选择CBR进行优化,强调数据预处理的重要性;客观分析表明,结合风险因素(如吸烟、饮酒)可提升预测效果,尤其在性别差异分析上具有实际应用价值。
Abstract: This study provides an overview of heart disease prediction using an intelligent system. Predicting disease accurately is crucial in the medical field, but traditional methods relying solely on a doctor’s experience often lack precision. To address this limitation, intelligent systems are applied as an alternative to traditional approaches. While various intelligent system methods exist, this study focuses on three: Fuzzy Logic, Neural Networks, and Case-Based Reasoning (CBR). A comparison of these techniques in terms of accuracy was conducted, and ultimately, Case-Based Reasoning (CBR) was selected for heart disease prediction. In the prediction phase, the heart disease dataset underwent data pre-processing to clean the data and data splitting to separate it into training and testing sets. The chosen intelligent system was then employed to predict heart disease outcomes based on the processed data. The experiment concluded with Case-Based Reasoning (CBR) achieving a notable accuracy rate of 97.95% in predicting heart disease. The findings also revealed that the probability of heart disease was 57.76% for males and 42.24% for females. Further analysis from related studies suggests that factors such as smoking and alcohol consumption are significant contributors to heart disease, particularly among males.
[92] Open-World Deepfake Attribution via Confidence-Aware Asymmetric Learning cs.CVPDF
Haiyang Zheng, Nan Pu, Wenjing Li, Teng Long, Nicu Sebe
TL;DR: 本文提出了一种名为置信感知非对称学习(CAL)的框架,用于解决开放世界深度伪造溯源(OW-DFA)任务中的两个关键挑战:置信度偏斜导致新伪造类型伪标签不可靠,以及不切实际地假设未知伪造类型数量已知。CAL通过置信感知一致性正则化(CCR)和非对称置信增强(ACR)来平衡模型对已知和新伪造类型的置信度,并结合动态原型剪枝(DPP)策略自动估计新伪造类型的数量。
Details
Motivation: 现有OW-DFA方法存在两个关键局限:1)置信度偏斜导致新伪造类型的伪标签不可靠,造成训练偏差;2)不切实际地假设未知伪造类型的数量是已知的先验知识。本文旨在解决这些问题,以提升模型在真实世界开放世界场景下的可扩展性和性能。
Result: 在标准OW-DFA基准和新扩展的包含高级操纵技术的基准上进行的大量实验表明,CAL始终优于先前的方法,在已知和新伪造类型的溯源任务上均达到了新的最先进(SOTA)性能。
Insight: 主要创新点包括:1)通过CCR和ACR组成的相互增强循环,自适应地平衡模型对已知和新伪造类型的置信度,缓解伪标签偏差;2)引入DPP策略,以从粗到精的方式自动估计新伪造类型的数量,无需不现实的先验假设,增强了方法的可扩展性。从客观角度看,该方法将置信度校准与未知类别数量估计相结合,为开放世界识别任务提供了一个系统性的解决方案。
Abstract: The proliferation of synthetic facial imagery has intensified the need for robust Open-World DeepFake Attribution (OW-DFA), which aims to attribute both known and unknown forgeries using labeled data for known types and unlabeled data containing a mixture of known and novel types. However, existing OW-DFA methods face two critical limitations: 1) A confidence skew that leads to unreliable pseudo-labels for novel forgeries, resulting in biased training. 2) An unrealistic assumption that the number of unknown forgery types is known a priori. To address these challenges, we propose a Confidence-Aware Asymmetric Learning (CAL) framework, which adaptively balances model confidence across known and novel forgery types. CAL mainly consists of two components: Confidence-Aware Consistency Regularization (CCR) and Asymmetric Confidence Reinforcement (ACR). CCR mitigates pseudo-label bias by dynamically scaling sample losses based on normalized confidence, gradually shifting the training focus from high- to low-confidence samples. ACR complements this by separately calibrating confidence for known and novel classes through selective learning on high-confidence samples, guided by their confidence gap. Together, CCR and ACR form a mutually reinforcing loop that significantly improves the model’s OW-DFA performance. Moreover, we introduce a Dynamic Prototype Pruning (DPP) strategy that automatically estimates the number of novel forgery types in a coarse-to-fine manner, removing the need for unrealistic prior assumptions and enhancing the scalability of our methods to real-world OW-DFA scenarios. Extensive experiments on the standard OW-DFA benchmark and a newly extended benchmark incorporating advanced manipulations demonstrate that CAL consistently outperforms previous methods, achieving new state-of-the-art performance on both known and novel forgery attribution.
[93] $β$-CLIP: Text-Conditioned Contrastive Learning for Multi-Granular Vision-Language Alignment cs.CVPDF
Fatimah Zohra, Chen Zhao, Hani Itani, Bernard Ghanem
TL;DR: 本文提出了β-CLIP,一个多粒度文本条件对比学习框架,旨在实现从完整描述到句子和短语的多层次文本粒度与其对应视觉区域之间的分层对齐。该方法通过跨注意力动态池化图像块生成上下文视觉嵌入,并引入β-上下文对比对齐损失来平衡严格查询匹配与宽松图像内上下文化之间的权衡。
Details
Motivation: CLIP在全局视觉-文本对齐上表现出色,但在细粒度任务上表现不佳,即使使用长而详细的描述进行微调。本文旨在解决多粒度视觉-语言对齐问题,提升密集对齐能力。
Result: 在Urban1K数据集上,β-CLIP实现了91.8%的文本到图像和92.3%的图像到文本的R@1召回率;在FG-OVD(Hard)数据集上达到30.9%的准确率,在没有使用困难负样本训练的方法中达到了最先进水平(SOTA)。
Insight: 创新点包括多粒度文本条件对比学习框架、动态池化视觉嵌入的跨注意力机制,以及β-上下文对比对齐损失(β-CAL),该损失通过参数化权衡支持软交叉熵和硬二元交叉熵公式,为密集视觉-语言对应提供了鲁棒且自适应的基线。
Abstract: CLIP achieves strong zero-shot image-text retrieval by aligning global vision and text representations, yet it falls behind on fine-grained tasks even when fine-tuned on long, detailed captions. In this work, we propose $β$-CLIP, a multi-granular text-conditioned contrastive learning framework designed to achieve hierarchical alignment between multiple textual granularities-from full captions to sentences and phrases-and their corresponding visual regions. For each level of granularity, $β$-CLIP utilizes cross-attention to dynamically pool image patches, producing contextualized visual embeddings. To address the semantic overlap inherent in this hierarchy, we introduce the $β$-Contextualized Contrastive Alignment Loss ($β$-CAL). This objective parameterizes the trade-off between strict query-specific matching and relaxed intra-image contextualization, supporting both soft Cross-Entropy and hard Binary Cross-Entropy formulations. Through extensive experiments, we demonstrate that $β$-CLIP significantly improves dense alignment: achieving 91.8% T2I 92.3% I2T at R@1 on Urban1K and 30.9% on FG-OVD (Hard), setting state-of-the-art among methods trained without hard negatives. $β$-CLIP establishes a robust, adaptive baseline for dense vision-language correspondence. The code and models are released at https://github.com/fzohra/B-CLIP.
[94] Robust Motion Generation using Part-level Reliable Data from Videos cs.CV | cs.AIPDF
Boyuan Li, Sipeng Zheng, Bin Cao, Ruihua Song, Zongqing Lu
TL;DR: 本文提出一种利用视频中可信部分数据增强运动生成的方法,通过将人体分解为五个部分,检测视频帧中清晰可见的‘可信’部分,并设计一个部分感知的变分自编码器将其编码为潜在标记,再采用鲁棒的部分级掩码生成模型预测被掩码的可信部分,同时忽略噪声部分。此外,作者贡献了一个包含约20万真实世界运动序列的新基准K700-M用于评估。实验结果表明,该方法在干净和噪声数据集上的运动质量、语义一致性和多样性方面均优于基线模型。
Details
Motivation: 从大规模网络视频中提取人体运动为解决角色动画中的数据稀缺问题提供了可扩展方案,但许多视频帧中的人体部分因离屏拍摄或遮挡而不可见,导致数据丢弃会限制规模和多样性,而保留则会损害数据质量和模型性能。
Result: 在提出的K700-M基准上,该方法在运动质量、语义一致性和多样性方面均优于基线模型,在干净和噪声数据集上均表现出色。
Insight: 创新点在于利用部分级可信数据而非完整人体数据,通过部分感知的变分自编码器和鲁棒的部分级掩码生成模型处理噪声,提高了运动生成的鲁棒性和数据利用率;客观分析认为,该方法通过分解和选择性建模部分数据,有效解决了视频数据中常见的不完整性问题,为从噪声视频中学习高质量运动提供了新思路。
Abstract: Extracting human motion from large-scale web videos offers a scalable solution to the data scarcity issue in character animation. However, some human parts in many video frames cannot be seen due to off-screen captures or occlusions. It brings a dilemma: discarding the data missing any part limits scale and diversity, while retaining it compromises data quality and model performance. To address this problem, we propose leveraging credible part-level data extracted from videos to enhance motion generation via a robust part-aware masked autoregression model. First, we decompose a human body into five parts and detect the parts clearly seen in a video frame as “credible”. Second, the credible parts are encoded into latent tokens by our proposed part-aware variational autoencoder. Third, we propose a robust part-level masked generation model to predict masked credible parts, while ignoring those noisy parts. In addition, we contribute K700-M, a challenging new benchmark comprising approximately 200k real-world motion sequences, for evaluation. Experimental results indicate that our method successfully outperforms baselines on both clean and noisy datasets in terms of motion quality, semantic consistency and diversity. Project page: https://boyuaner.github.io/ropar-main/
[95] Towards Interactive Intelligence for Digital Humans cs.CV | cs.CL | cs.GR | cs.HCPDF
Yiyi Cai, Xuangeng Chu, Xiwei Gao, Sitong Gong, Yifei Huang
TL;DR: 本文提出了名为‘交互智能’的数字人新范式,旨在实现个性一致表达、自适应交互和自我进化。为实现该目标,作者提出了Mio(多模态交互全能化身)端到端框架,该框架由Thinker、Talker、Face Animator、Body Animator和Renderer五个专用模块组成,将认知推理与实时多模态具身化相结合,以实现流畅、一致的交互。此外,论文还建立了一个新的基准来严格评估交互智能的能力。大量实验表明,该框架在所有评估维度上均优于现有最先进方法。
Details
Motivation: 解决当前数字人仅停留在表面模仿、缺乏真正智能交互能力的问题,旨在推动数字人向具备个性表达、自适应交互和自我进化能力的智能体发展。
Result: 在作者新建立的基准上进行广泛实验,结果表明,Mio框架在所有评估维度上均取得了优于当前最先进(SOTA)方法的性能。
Insight: 创新点在于提出了‘交互智能’这一新范式,并构建了集认知推理与多模态具身化于一体的统一端到端框架Mio。其模块化设计(五个专用模块)和新的评估基准为构建更智能、更自然的数字人交互系统提供了可借鉴的架构思路和评估标准。
Abstract: We introduce Interactive Intelligence, a novel paradigm of digital human that is capable of personality-aligned expression, adaptive interaction, and self-evolution. To realize this, we present Mio (Multimodal Interactive Omni-Avatar), an end-to-end framework composed of five specialized modules: Thinker, Talker, Face Animator, Body Animator, and Renderer. This unified architecture integrates cognitive reasoning with real-time multimodal embodiment to enable fluid, consistent interaction. Furthermore, we establish a new benchmark to rigorously evaluate the capabilities of interactive intelligence. Extensive experiments demonstrate that our framework achieves superior performance compared to state-of-the-art methods across all evaluated dimensions. Together, these contributions move digital humans beyond superficial imitation toward intelligent interaction.
[96] GenieDrive: Towards Physics-Aware Driving World Model with 4D Occupancy Guided Video Generation cs.CVPDF
Zhenya Yang, Zhe Liu, Yuxiang Lu, Liping Hou, Chenxuan Miao
TL;DR: 本文提出GenieDrive框架,用于生成物理感知的驾驶视频。该方法首先生成包含丰富物理信息的4D占据栅格作为基础,通过提出的VAE将其压缩为潜在三平面表示以减少潜在尺寸,并引入互控注意力(MCA)精确建模控制对占据演化的影响,同时采用归一化多视图注意力(NMVA)在4D占据引导下生成高质量多视角驾驶视频。
Details
Motivation: 现有驾驶世界模型通常依赖单一扩散模型直接将驾驶动作映射为视频,导致学习困难且输出物理不一致,因此需要一种能生成物理一致、可控驾驶视频的方法。
Result: 在占据预测任务上,mIoU提升7.2%,推理速度达41 FPS,仅使用3.47 M参数;在视频生成任务上,FVD指标降低20.7%,实现了高质量、可控、多视角一致的驾驶视频生成。
Insight: 创新点包括:以4D占据作为物理感知的中间表示;设计VAE进行高效压缩(潜在尺寸减少至先前方法的58%);提出MCA模块建模控制-占据交互;采用NMVA实现多视角视频生成;端到端联合训练提升预测准确性。
Abstract: Physics-aware driving world model is essential for drive planning, out-of-distribution data synthesis, and closed-loop evaluation. However, existing methods often rely on a single diffusion model to directly map driving actions to videos, which makes learning difficult and leads to physically inconsistent outputs. To overcome these challenges, we propose GenieDrive, a novel framework designed for physics-aware driving video generation. Our approach starts by generating 4D occupancy, which serves as a physics-informed foundation for subsequent video generation. 4D occupancy contains rich physical information, including high-resolution 3D structures and dynamics. To facilitate effective compression of such high-resolution occupancy, we propose a VAE that encodes occupancy into a latent tri-plane representation, reducing the latent size to only 58% of that used in previous methods. We further introduce Mutual Control Attention (MCA) to accurately model the influence of control on occupancy evolution, and we jointly train the VAE and the subsequent prediction module in an end-to-end manner to maximize forecasting accuracy. Together, these designs yield a 7.2% improvement in forecasting mIoU at an inference speed of 41 FPS, while using only 3.47 M parameters. Additionally, a Normalized Multi-View Attention is introduced in the video generation model to generate multi-view driving videos with guidance from our 4D occupancy, significantly improving video quality with a 20.7% reduction in FVD. Experiments demonstrate that GenieDrive enables highly controllable, multi-view consistent, and physics-aware driving video generation.
[97] FysicsWorld: A Unified Full-Modality Benchmark for Any-to-Any Understanding, Generation, and Reasoning cs.CVPDF
Yue Jiang, Dingkang Yang, Minghao Han, Jinghang Han, Zizhi Chen
TL;DR: 本文提出了FysicsWorld,首个支持图像、视频、音频和文本之间双向输入输出的统一全模态基准测试,用于全面评估任意模态间的理解、生成和推理能力。该基准包含16个主要任务和3,268个精选样本,覆盖开放域多类问题,并通过提出的跨模态互补性筛选策略构建数据。通过对30多个SOTA基线模型的评估,揭示了现有模型在全模态任务上的性能差距和局限。
Details
Motivation: 当前多模态大语言模型和全模态架构发展迅速,但现有基准测试在模态覆盖、交互方式(通常局限于文本输出)以及模态间依赖互补性方面存在不足,需要一个更全面的评估框架。
Result: 在FysicsWorld基准上对超过30个SOTA基线模型(包括MLLMs、模态专用模型、统一理解-生成模型和全模态语言模型)进行了全面评估,揭示了这些模型在理解、生成和推理任务上的性能差异和局限性。
Insight: 创新点在于构建了首个支持任意模态间双向输入输出的统一全模态基准,并提出了跨模态互补性筛选策略来构建用于口语交互和融合依赖的跨模态推理的全模态数据,为评估和推进下一代全模态架构建立了统一基础和强基线。
Abstract: Despite rapid progress in multimodal large language models (MLLMs) and emerging omni-modal architectures, current benchmarks remain limited in scope and integration, suffering from incomplete modality coverage, restricted interaction to text-centric outputs, and weak interdependence and complementarity among modalities. To bridge these gaps, we introduce FysicsWorld, the first unified full-modality benchmark that supports bidirectional input-output across image, video, audio, and text, enabling comprehensive any-to-any evaluation across understanding, generation, and reasoning. FysicsWorld encompasses 16 primary tasks and 3,268 curated samples, aggregated from over 40 high-quality sources and covering a rich set of open-domain categories with diverse question types. We also propose the Cross-Modal Complementarity Screening (CMCS) strategy integrated in a systematic data construction framework that produces omni-modal data for spoken interaction and fusion-dependent cross-modal reasoning. Through a comprehensive evaluation of over 30 state-of-the-art baselines, spanning MLLMs, modality-specific models, unified understanding-generation models, and omni-modal language models, FysicsWorld exposes the performance disparities and limitations across models in understanding, generation, and reasoning. Our benchmark establishes a unified foundation and strong baselines for evaluating and advancing next-generation full-modality architectures.
[98] CoRe3D: Collaborative Reasoning as a Foundation for 3D Intelligence cs.CV | cs.AI | cs.LGPDF
Tianjiao Yu, Xinzhuo Li, Yifan Shen, Yuanzhe Liu, Ismini Lourentzou
TL;DR: CoRe3D提出了一个统一的3D理解与生成推理框架,通过将语义链式思维推理与结构化空间推理紧密结合,实现了从高级语言意图到低级3D内容生成的直接引导。
Details
Motivation: 现有以推理为中心的方法在语言和2D视觉任务中有效,但在3D领域应用不足,CoRe3D旨在解决3D智能中可靠推理、可解释性和跨模态对齐的问题。
Result: 论文表明,CoRe3D生成的3D输出在局部一致性和与语言描述的对齐性方面表现优异,但摘要中未提及具体的基准测试或定量比较结果。
Insight: 核心创新在于引入了空间接地的推理表示,将3D潜在空间分解为局部化区域,支持组合式和程序化的几何推理,从而桥接了高级语义与低级3D结构。
Abstract: Recent advances in large multimodal models suggest that explicit reasoning mechanisms play a critical role in improving model reliability, interpretability, and cross-modal alignment. While such reasoning-centric approaches have been proven effective in language and vision tasks, their extension to 3D remains underdeveloped. CoRe3D introduces a unified 3D understanding and generation reasoning framework that jointly operates over semantic and spatial abstractions, enabling high-level intent inferred from language to directly guide low-level 3D content formation. Central to this design is a spatially grounded reasoning representation that decomposes 3D latent space into localized regions, allowing the model to reason over geometry in a compositional and procedural manner. By tightly coupling semantic chain-of-thought inference with structured spatial reasoning, CoRe3D produces 3D outputs that exhibit strong local consistency and faithful alignment with linguistic descriptions.
[99] L-STEC: Learned Video Compression with Long-term Spatio-Temporal Enhanced Context cs.CVPDF
Tiange Zhang, Zhimeng Huang, Xiandong Meng, Kai Zhang, Zhipin Deng
TL;DR: 本文提出了一种名为L-STEC的神经视频压缩方法,旨在通过增强长时空上下文信息来提升压缩性能。该方法利用LSTM扩展参考链以捕获长期依赖,并结合像素域的扭曲空间上下文,通过多感受野网络融合时空信息,从而更好地保留参考细节。
Details
Motivation: 现有基于条件的神经视频压缩方法主要依赖前一帧特征预测时序上下文,这导致两个关键问题:短参考窗口无法捕获长期依赖和精细纹理细节,以及仅传播特征级信息会导致误差累积和细节丢失。
Result: 实验结果表明,L-STEC通过丰富上下文信息显著提升了压缩性能,在PSNR和MS-SSIM指标上分别比DCVC-TCM节省了37.01%和31.65%的码率,性能优于VTM-17.0和DCVC-FM,达到了新的最先进水平。
Insight: 论文的创新点在于将LSTM引入参考链以捕获长期依赖,并融合像素域的扭曲空间上下文来增强细节保留。从客观角度看,这种结合长时序建模与多源空间信息融合的方法,为缓解神经视频压缩中的误差传播和细节损失问题提供了新思路。
Abstract: Neural Video Compression has emerged in recent years, with condition-based frameworks outperforming traditional codecs. However, most existing methods rely solely on the previous frame’s features to predict temporal context, leading to two critical issues. First, the short reference window misses long-term dependencies and fine texture details. Second, propagating only feature-level information accumulates errors over frames, causing prediction inaccuracies and loss of subtle textures. To address these, we propose the Long-term Spatio-Temporal Enhanced Context (L-STEC) method. We first extend the reference chain with LSTM to capture long-term dependencies. We then incorporate warped spatial context from the pixel domain, fusing spatio-temporal information through a multi-receptive field network to better preserve reference details. Experimental results show that L-STEC significantly improves compression by enriching contextual information, achieving 37.01% bitrate savings in PSNR and 31.65% in MS-SSIM compared to DCVC-TCM, outperforming both VTM-17.0 and DCVC-FM and establishing new state-of-the-art performance.
[100] DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning cs.CVPDF
Zhe Liu, Runhui Huang, Rui Yang, Siming Yan, Zining Wang
TL;DR: 本文提出了DrivePI,一种空间感知的4D多模态大语言模型,用于统一自动驾驶中的理解、感知、预测和规划任务。该方法通过端到端优化,并行执行空间理解、3D感知(如3D占据)、预测(如占据流)和规划(如动作输出),并整合了点云、多视角图像和语言指令。仅使用0.5B参数的Qwen2.5作为骨干,DrivePI作为单一统一模型,在多个基准测试中匹配或超越了现有VLA模型和专用VA模型。
Details
Motivation: 尽管多模态大语言模型(MLLMs)在各领域展现出强大能力,但在自动驾驶中生成细粒度3D感知和预测输出的应用仍未被充分探索。本文旨在解决这一问题,提出一个统一的视觉-语言-动作(VLA)框架,以整合自动驾驶中的多种任务。
Result: 在nuScenes-QA上,DrivePI的平均准确率比OpenDriveVLA-7B高出2.5%,碰撞率比ORION降低70%(从0.37%降至0.11%)。与专用VA模型相比,在OpenOcc上,DrivePI的3D占据RayIoU比FB-OCC高出10.3,占据流mAVE从0.591降至0.509;在nuScenes规划任务上,L2误差比VAD降低32%(从0.72m降至0.49m)。
Insight: 摘要宣称的创新点包括:提出一个空间感知的4D MLLM统一框架,整合点云、多视角图像和语言指令;开发数据引擎生成文本-占据和文本-流问答对以增强4D空间理解;通过端到端优化并行处理多任务。从客观角度看,其核心创新在于将复杂的自动驾驶任务(理解、感知、预测、规划)统一到一个轻量级MLLM中,并在多个基准上实现SOTA或相当性能,展示了MLLM在细粒度3D任务中的潜力。
Abstract: Although multi-modal large language models (MLLMs) have shown strong capabilities across diverse domains, their application in generating fine-grained 3D perception and prediction outputs in autonomous driving remains underexplored. In this paper, we propose DrivePI, a novel spatial-aware 4D MLLM that serves as a unified Vision-Language-Action (VLA) framework that is also compatible with vision-action (VA) models. Our method jointly performs spatial understanding, 3D perception (i.e., 3D occupancy), prediction (i.e., occupancy flow), and planning (i.e., action outputs) in parallel through end-to-end optimization. To obtain both precise geometric information and rich visual appearance, our approach integrates point clouds, multi-view images, and language instructions within a unified MLLM architecture. We further develop a data engine to generate text-occupancy and text-flow QA pairs for 4D spatial understanding. Remarkably, with only a 0.5B Qwen2.5 model as MLLM backbone, DrivePI as a single unified model matches or exceeds both existing VLA models and specialized VA models. Specifically, compared to VLA models, DrivePI outperforms OpenDriveVLA-7B by 2.5% mean accuracy on nuScenes-QA and reduces collision rate by 70% over ORION (from 0.37% to 0.11%) on nuScenes. Against specialized VA models, DrivePI surpasses FB-OCC by 10.3 RayIoU for 3D occupancy on OpenOcc, reduces the mAVE from 0.591 to 0.509 for occupancy flow on OpenOcc, and achieves 32% lower L2 error than VAD (from 0.72m to 0.49m) for planning on nuScenes. Code will be available at https://github.com/happinesslz/DrivePI
[101] Lemon: A Unified and Scalable 3D Multimodal Model for Universal Spatial Understanding cs.CV | cs.AIPDF
Yongyuan Liang, Xiyao Wang, Yuanchen Ju, Jianwei Yang, Furong Huang
TL;DR: 本文提出了Lemon,一个统一且可扩展的3D多模态模型,用于通用空间理解。该模型采用统一的Transformer架构,将3D点云块和语言标记作为单一序列联合处理,实现了早期空间-语言融合,解决了现有模型架构碎片化、训练不稳定和可扩展性差的问题。
Details
Motivation: 将大型多模态模型扩展到3D理解面临独特挑战:点云数据稀疏且不规则,现有模型依赖具有模态特定编码器的碎片化架构,且训练流程通常存在不稳定性和可扩展性差的问题。
Result: Lemon在全面的3D理解和推理任务上(从物体识别、描述到3D场景的空间推理)建立了新的最先进性能,并随着模型规模和训练数据的增加展现出稳健的扩展特性。
Insight: 主要创新点在于统一的序列化处理架构实现了早期跨模态融合,提高了参数效率;同时,为处理3D数据复杂性设计了结构化的分块与标记化方案,以及一个从物体级识别到场景级空间推理的三阶段渐进式训练课程。这为3D空间智能提供了一个统一的基础。
Abstract: Scaling large multimodal models (LMMs) to 3D understanding poses unique challenges: point cloud data is sparse and irregular, existing models rely on fragmented architectures with modality-specific encoders, and training pipelines often suffer from instability and poor scalability. We introduce Lemon, a unified transformer architecture that addresses these challenges by jointly processing 3D point cloud patches and language tokens as a single sequence. Unlike prior work that relies on modality-specific encoders and cross-modal alignment modules, this design enables early spatial-linguistic fusion, eliminates redundant encoders, improves parameter efficiency, and supports more effective model scaling. To handle the complexity of 3D data, we develop a structured patchification and tokenization scheme that preserves spatial context, and a three-stage training curriculum that progressively builds capabilities from object-level recognition to scene-level spatial reasoning. Lemon establishes new state-of-the-art performance across comprehensive 3D understanding and reasoning tasks, from object recognition and captioning to spatial reasoning in 3D scenes, while demonstrating robust scaling properties as model size and training data increase. Our work provides a unified foundation for advancing 3D spatial intelligence in real-world applications.
[102] Adapting Multimodal Foundation Models for Few-Shot Learning: A Comprehensive Study on Contrastive Captioners cs.CV | cs.AIPDF
N. K. B. M. P. K. B. Narasinghe, Uthayasanker Thayasivam
TL;DR: 本文对对比性字幕生成模型(CoCa)在少样本图像分类任务中的适应策略进行了全面实证研究,系统评估了从免训练的混合原型到基于LoRA的深度参数适应等多种方法,并揭示了数据增强在低样本场景下的矛盾作用以及混合损失函数的有效性。
Details
Motivation: 现有研究主要关注CLIP等双编码器架构,而CoCa这类生成-对比混合模型在极少量数据(少样本学习)下的适应机制尚不明确,本文旨在填补这一空白,探索其潜在空间对参数高效微调(PEFT)的响应。
Result: 实验表明,结合监督对比损失(SupCon)的混合目标在不同样本数量下均比标准交叉熵损失带来一致的性能提升;同时,研究为生成-对比基础模型的高效适应提供了关于正则化、秩和采样策略的实证参考设置。
Insight: 创新点在于发现了’增强分歧’现象:强数据增强会损害低样本设置下线性探测的性能,但对稳定LoRA微调至关重要;此外,系统化的适应策略层次为生成-对比混合模型的少样本学习提供了可借鉴的调优框架。
Abstract: Large-scale multimodal foundation models, particularly Contrastive Captioners (CoCa), have achieved state-of-the-art results by unifying contrastive alignment with generative captioning. While zero-shot transfer capabilities are well-documented, the adaptation of these generative-contrastive hybrids to downstream tasks with extreme data scarcity (few-shot learning) remains under-explored. Existing literature predominantly focuses on dual-encoder architectures like CLIP, leaving a gap in understanding how CoCa’s distinct latent space responds to parameter-efficient fine-tuning (PEFT). This paper presents a comprehensive empirical study on adapting the CoCa visual backbone for few-shot image classification. We systematically evaluate a hierarchy of strategies, ranging from training-free hybrid prototyping to deep parameter adaptation via Low-Rank Adaptation (LoRA). First, we identify an “augmentation divergence”: while strong data augmentation degrades the performance of linear probing in low-shot settings, it is essential for stabilizing LoRA fine-tuning. We also demonstrate that hybrid objectives incorporating Supervised Contrastive (SupCon) loss yield consistent performance improvements over standard Cross-Entropy across varying shot counts. Crucially, we characterize the sensitivity of training configurations to data scarcity, providing empirical reference settings for scaling regularization, rank, and sampling strategies to facilitate the efficient adaptation of generative-contrastive foundation models.
[103] Schrodinger Audio-Visual Editor: Object-Level Audiovisual Removal cs.CV | cs.MM | cs.SDPDF
Weihan Xu, Kan Jen Cheng, Koichi Saito, Muhammad Jehanzeb Mirza, Tingle Li
TL;DR: 本文提出了一种名为Schrodinger Audio-Visual Editor (SAVE)的端到端流匹配模型,用于实现对象级别的音视频联合编辑,特别是目标对象的移除。为了解决联合编辑中配对数据稀缺和模态异质性的挑战,研究团队构建了SAVEBench数据集,并利用Schrodinger Bridge学习从源到目标音视频混合的直接传输。
Details
Motivation: 解决音视频内容联合编辑的挑战,包括目标编辑前后配对数据的缺乏以及跨模态的异质性,以实现精确可控的内容创作。
Result: 评估表明,与音频编辑器和视频编辑器的成对组合相比,SAVE模型能够更好地移除音视频内容中的目标对象,同时保留其余内容,并展现出更强的时间同步性和音视频语义对应关系。
Insight: 创新点在于引入了SAVEBench配对数据集以支持基于对象的源到目标学习,并设计了端到端的流匹配模型SAVE,其核心是使用Schrodinger Bridge进行跨模态联合编辑,确保处理过程中的音视频对齐。
Abstract: Joint editing of audio and visual content is crucial for precise and controllable content creation. This new task poses challenges due to the limitations of paired audio-visual data before and after targeted edits, and the heterogeneity across modalities. To address the data and modeling challenges in joint audio-visual editing, we introduce SAVEBench, a paired audiovisual dataset with text and mask conditions to enable object-grounded source-to-target learning. With SAVEBench, we train the Schrodinger Audio-Visual Editor (SAVE), an end-to-end flow-matching model that edits audio and video in parallel while keeping them aligned throughout processing. SAVE incorporates a Schrodinger Bridge that learns a direct transport from source to target audiovisual mixtures. Our evaluation demonstrates that the proposed SAVE model is able to remove the target objects in audio and visual content while preserving the remaining content, with stronger temporal synchronization and audiovisual semantic correspondence compared with pairwise combinations of an audio editor and a video editor.
[104] Cross-Level Sensor Fusion with Object Lists via Transformer for 3D Object Detection cs.CV | cs.ROPDF
Xiangzhong Liu, Jiajie Zhang, Hao Shen
TL;DR: 本文提出了一种基于Transformer的跨级融合方法,用于3D目标检测,将高度抽象的目标列表信息与原始相机图像进行端到端融合。该方法将目标列表作为去噪查询输入Transformer,并结合可变形高斯掩码引导注意力,加速训练收敛。在nuScenes数据集上,该方法相比纯视觉基线取得了显著性能提升。
Details
Motivation: 解决汽车传感器融合系统中,智能传感器和V2X模块通常仅提供处理后的目标列表而非原始数据,导致传统融合方法难以有效整合抽象目标信息与原始图像的问题。
Result: 在nuScenes数据集上,该方法相比纯视觉基线实现了显著的性能提升,并展示了在不同噪声水平的模拟目标列表和真实检测器上的泛化能力。
Insight: 创新点包括:首次提出跨级融合概念,将目标列表作为去噪查询融入Transformer;引入可变形高斯掩码,利用目标列表的位置和尺寸先验引导注意力;提出从真实边界框生成伪目标列表的方法,以模拟状态噪声和误检/漏检,填补了公开数据集的空白。
Abstract: In automotive sensor fusion systems, smart sensors and Vehicle-to-Everything (V2X) modules are commonly utilized. Sensor data from these systems are typically available only as processed object lists rather than raw sensor data from traditional sensors. Instead of processing other raw data separately and then fusing them at the object level, we propose an end-to-end cross-level fusion concept with Transformer, which integrates highly abstract object list information with raw camera images for 3D object detection. Object lists are fed into a Transformer as denoising queries and propagated together with learnable queries through the latter feature aggregation process. Additionally, a deformable Gaussian mask, derived from the positional and size dimensional priors from the object lists, is explicitly integrated into the Transformer decoder. This directs attention toward the target area of interest and accelerates model training convergence. Furthermore, as there is no public dataset containing object lists as a standalone modality, we propose an approach to generate pseudo object lists from ground-truth bounding boxes by simulating state noise and false positives and negatives. As the first work to conduct cross-level fusion, our approach shows substantial performance improvements over the vision-based baseline on the nuScenes dataset. It demonstrates its generalization capability over diverse noise levels of simulated object lists and real detectors.
[105] Revisiting 2D Foundation Models for Scalable 3D Medical Image Classification cs.CVPDF
Han Liu, Bogdan Georgescu, Yanbo Zhang, Youngjin Yoo, Michael Baumgartner
TL;DR: 本文提出AnyMC3D,一种从2D基础模型(FMs)适配而来的可扩展3D医学图像分类框架。该方法通过在一个冻结的骨干网络上添加轻量级插件(每个任务约1M参数)来高效扩展到新任务,支持多视图输入、像素级监督和可解释热图生成。作者建立了一个包含12个任务的综合基准,并系统分析了最先进的3D分类技术,发现有效适配对释放FM潜力至关重要,通用FM经适当适配后可匹敌医学专用FM,且基于2D的方法在3D分类上优于3D架构。
Details
Motivation: 解决当前3D医学图像分类研究中存在的三个关键缺陷:数据制度偏差、次优适配和任务覆盖不足,旨在开发一个可扩展的、无需为每个任务单独训练模型的统一框架。
Result: 在涵盖多种病理、解剖结构和模态的12个任务基准上,AnyMC3D实现了最先进的性能(包括在VLM3D挑战赛中获得第一名),证明了使用单一可扩展框架在不同应用中达到SOTA的可行性。
Insight: 创新点在于通过轻量级插件适配冻结的2D基础模型来高效处理3D医学图像分类,揭示了有效适配的重要性、通用FM的潜力以及2D方法在3D任务上的优势;可借鉴之处包括轻量级适配策略、多任务统一框架设计以及从2D到3D的有效知识迁移方法。
Abstract: 3D medical image classification is essential for modern clinical workflows. Medical foundation models (FMs) have emerged as a promising approach for scaling to new tasks, yet current research suffers from three critical pitfalls: data-regime bias, suboptimal adaptation, and insufficient task coverage. In this paper, we address these pitfalls and introduce AnyMC3D, a scalable 3D classifier adapted from 2D FMs. Our method scales efficiently to new tasks by adding only lightweight plugins (about 1M parameters per task) on top of a single frozen backbone. This versatile framework also supports multi-view inputs, auxiliary pixel-level supervision, and interpretable heatmap generation. We establish a comprehensive benchmark of 12 tasks covering diverse pathologies, anatomies, and modalities, and systematically analyze state-of-the-art 3D classification techniques. Our analysis reveals key insights: (1) effective adaptation is essential to unlock FM potential, (2) general-purpose FMs can match medical-specific FMs if properly adapted, and (3) 2D-based methods surpass 3D architectures for 3D classification. For the first time, we demonstrate the feasibility of achieving state-of-the-art performance across diverse applications using a single scalable framework (including 1st place in the VLM3D challenge), eliminating the need for separate task-specific models.
[106] Qonvolution: Towards Learning High-Frequency Signals with Queried Convolution cs.CV | cs.GR | cs.LGPDF
Abhinav Kumar, Tristan Aumentado-Armstrong, Lazar Valkov, Gopal Sharma, Alex Levinshtein
TL;DR: 本文提出了一种名为Qonvolution(查询卷积)的新方法,旨在解决神经网络学习高频信号时的困难。该方法通过将低频信号与查询(如坐标)进行卷积来增强对复杂高频信号的学习能力。实验表明,Qonvolution在多种高频学习任务中提升了性能,包括1D回归、2D超分辨率、2D图像回归和新视角合成(NVS)。特别是在NVS任务中,结合高斯泼溅和Qonvolution,在真实世界复杂场景中实现了最先进的图像质量。
Details
Motivation: 神经网络由于频谱偏差或优化困难,难以准确学习高频信号,现有技术如傅里叶编码虽有改进,但在处理高频信息时仍有提升空间。
Result: 在1D回归、2D超分辨率、2D图像回归和新视角合成(NVS)等任务中,Qonvolution提升了性能;在NVS中,结合高斯泼溅,在真实世界复杂场景中实现了SOTA图像质量,甚至优于强大的辐射场模型。
Insight: 创新点在于利用卷积的邻域特性,通过查询(如坐标)与低频信号卷积来增强高频信号学习;客观分析认为,这是一种简单而有效的结构修改,可能缓解神经网络的频谱偏差问题,适用于视觉和图形学中的高频任务。
Abstract: Accurately learning high-frequency signals is a challenge in computer vision and graphics, as neural networks often struggle with these signals due to spectral bias or optimization difficulties. While current techniques like Fourier encodings have made great strides in improving performance, there remains scope for improvement when presented with high-frequency information. This paper introduces Queried-Convolutions (Qonvolutions), a simple yet powerful modification using the neighborhood properties of convolution. Qonvolution convolves a low-frequency signal with queries (such as coordinates) to enhance the learning of intricate high-frequency signals. We empirically demonstrate that Qonvolutions enhance performance across a variety of high-frequency learning tasks crucial to both the computer vision and graphics communities, including 1D regression, 2D super-resolution, 2D image regression, and novel view synthesis (NVS). In particular, by combining Gaussian splatting with Qonvolutions for NVS, we showcase state-of-the-art performance on real-world complex scenes, even outperforming powerful radiance field models on image quality.
[107] Sharpness-aware Dynamic Anchor Selection for Generalized Category Discovery cs.CVPDF
Zhimao Peng, Enguang Wang, Fei Yang, Xialei Liu, Ming-Ming Cheng
TL;DR: 本文针对广义类别发现任务中预训练模型易产生伪标签噪声的问题,提出了一种包含损失锐度惩罚和动态锚点选择的新方法,通过增强模型对扰动的鲁棒性并选择未知类代表性样本来提升伪标签质量,在多个基准测试中取得了最先进的结果。
Details
Motivation: 解决广义类别发现任务中,基于预训练模型的伪标签策略因模型偏好特定视觉模式而编码虚假相关性,导致伪标签噪声的问题。
Result: 在多个广义类别发现基准测试上实现了最先进的结果,有效缓解了伪标签噪声。
Insight: 创新点包括通过最小化最坏情况损失锐度来抑制琐碎特征编码的损失锐度惩罚模块,以及基于KNN密度和类别概率动态选择未知类代表性样本并分配硬伪标签的动态锚点选择模块,共同提升了模型对未知类的学习效率和聚类精度。
Abstract: Generalized category discovery (GCD) is an important and challenging task in open-world learning. Specifically, given some labeled data of known classes, GCD aims to cluster unlabeled data that contain both known and unknown classes. Current GCD methods based on parametric classification adopt the DINO-like pseudo-labeling strategy, where the sharpened probability output of one view is used as supervision information for the other view. However, large pre-trained models have a preference for some specific visual patterns, resulting in encoding spurious correlation for unlabeled data and generating noisy pseudo-labels. To address this issue, we propose a novel method, which contains two modules: Loss Sharpness Penalty (LSP) and Dynamic Anchor Selection (DAS). LSP enhances the robustness of model parameters to small perturbations by minimizing the worst-case loss sharpness of the model, which suppressing the encoding of trivial features, thereby reducing overfitting of noise samples and improving the quality of pseudo-labels. Meanwhile, DAS selects representative samples for the unknown classes based on KNN density and class probability during the model training and assigns hard pseudo-labels to them, which not only alleviates the confidence difference between known and unknown classes but also enables the model to quickly learn more accurate feature distribution for the unknown classes, thus further improving the clustering accuracy. Extensive experiments demonstrate that the proposed method can effectively mitigate the noise of pseudo-labels, and achieve state-of-the-art results on multiple GCD benchmarks.
[108] MADTempo: An Interactive System for Multi-Event Temporal Video Retrieval with Query Augmentation cs.CV | cs.AIPDF
Huu-An Vu, Van-Khanh Mai, Trong-Tam Nguyen, Quang-Duc Dam, Tien-Huy Nguyen
TL;DR: MADTempo是一个用于多事件时序视频检索的交互式系统,通过结合时序搜索机制和基于Google图像搜索的查询增强模块,旨在提升对复杂事件时序结构的理解以及对未见或罕见视觉概念查询的鲁棒性。
Details
Motivation: 解决现有视频检索系统在建模多事件间时序依赖关系以及处理涉及未见或罕见视觉概念的查询时存在的不足。
Result: 摘要中未提及具体的定量实验结果、基准测试或与现有方法的比较,但宣称系统提升了时序推理和泛化能力。
Insight: 创新点在于将时序搜索(通过聚合连续视频片段的相似度分数来捕获事件级连续性)与基于网络规模视觉接地的查询增强(利用外部网络图像扩展查询表示)相统一,以应对分布外查询并弥补预训练视觉嵌入的不足。
Abstract: The rapid expansion of video content across online platforms has accelerated the need for retrieval systems capable of understanding not only isolated visual moments but also the temporal structure of complex events. Existing approaches often fall short in modeling temporal dependencies across multiple events and in handling queries that reference unseen or rare visual concepts. To address these challenges, we introduce MADTempo, a video retrieval framework developed by our team, AIO_Trinh, that unifies temporal search with web-scale visual grounding. Our temporal search mechanism captures event-level continuity by aggregating similarity scores across sequential video segments, enabling coherent retrieval of multi-event queries. Complementarily, a Google Image Search-based fallback module expands query representations with external web imagery, effectively bridging gaps in pretrained visual embeddings and improving robustness against out-of-distribution (OOD) queries. Together, these components advance the temporal reasoning and generalization capabilities of modern video retrieval systems, paving the way for more semantically aware and adaptive retrieval across large-scale video corpora.
[109] Unified Interactive Multimodal Moment Retrieval via Cascaded Embedding-Reranking and Temporal-Aware Score Fusion cs.CV | cs.AI | cs.IRPDF
Toan Le Ngo Thanh, Phat Ha Huu, Tan Nguyen Dang Duy, Thong Nguyen Le Minh, Anh Nguyen Nhu Tinh
TL;DR: 本文提出了一种统一的交互式多模态时刻检索系统,通过级联嵌入重排序和时间感知分数融合来解决视频内容快速增长带来的检索需求。系统结合BEIT-3和SigLIP进行广泛检索,利用BLIP-2进行重排序以平衡召回率和精确率,并引入时间感知评分机制和基于GPT-4o的智能查询分解,以自动处理模糊查询并构建连贯的事件序列。
Details
Motivation: 现有视频多模态时刻检索方法面临三个关键挑战:固定权重融合策略难以应对跨模态噪声和模糊查询,时间建模难以捕捉连贯事件序列并惩罚不合理的时序间隙,以及系统需要手动选择模态降低了可用性。
Result: 定性分析表明,该系统能有效处理模糊查询、检索时间连贯的序列,并动态调整融合策略,提升了交互式时刻搜索能力。
Insight: 创新点包括:级联双嵌入管道结合多种预训练模型优化检索效果;时间感知评分机制通过指数衰减惩罚大时序间隙,构建连贯事件序列;基于GPT-4o的智能查询分解自动解释模糊查询并执行自适应分数融合,消除手动模态选择需求。
Abstract: The exponential growth of video content has created an urgent need for efficient multimodal moment retrieval systems. However, existing approaches face three critical challenges: (1) fixed-weight fusion strategies fail across cross modal noise and ambiguous queries, (2) temporal modeling struggles to capture coherent event sequences while penalizing unrealistic gaps, and (3) systems require manual modality selection, reducing usability. We propose a unified multimodal moment retrieval system with three key innovations. First, a cascaded dual-embedding pipeline combines BEIT-3 and SigLIP for broad retrieval, refined by BLIP-2 based reranking to balance recall and precision. Second, a temporal-aware scoring mechanism applies exponential decay penalties to large temporal gaps via beam search, constructing coherent event sequences rather than isolated frames. Third, Agent-guided query decomposition (GPT-4o) automatically interprets ambiguous queries, decomposes them into modality specific sub-queries (visual/OCR/ASR), and performs adaptive score fusion eliminating manual modality selection. Qualitative analysis demonstrates that our system effectively handles ambiguous queries, retrieves temporally coherent sequences, and dynamically adapts fusion strategies, advancing interactive moment search capabilities.
[110] Content Adaptive based Motion Alignment Framework for Learned Video Compression cs.CV | cs.AIPDF
Tiange Zhang, Xiandong Meng, Siwei Ma
TL;DR: 本文提出了一种基于内容自适应的运动对齐框架(CAMA),用于改进端到端视频压缩性能。该框架通过两阶段流引导可变形扭曲机制实现精确特征对齐,采用多参考质量感知策略调整失真权重以减少误差传播,并集成无需训练的下采样模块以平滑运动估计。
Details
Motivation: 现有端到端视频压缩框架缺乏对内容特性的自适应能力,导致压缩性能不佳。本文旨在通过内容自适应策略,针对不同视频内容调整编码策略,以提升压缩效率。
Result: 在标准测试数据集上,CAMA框架相比基线模型DCVC-TCM实现了24.95%的BD-rate(PSNR)节省,并优于复现的DCVC-DC和传统编解码器HM-16.25,达到了SOTA水平。
Insight: 创新点包括:两阶段流引导可变形扭曲机制实现粗到细的运动补偿;多参考质量感知策略结合分层训练减少误差传播;无需训练的下采样模块基于运动幅度和分辨率平滑运动估计。这些方法增强了内容自适应能力,可借鉴于视频压缩的优化设计。
Abstract: Recent advances in end-to-end video compression have shown promising results owing to their unified end-to-end learning optimization. However, such generalized frameworks often lack content-specific adaptation, leading to suboptimal compression performance. To address this, this paper proposes a content adaptive based motion alignment framework that improves performance by adapting encoding strategies to diverse content characteristics. Specifically, we first introduce a two-stage flow-guided deformable warping mechanism that refines motion compensation with coarse-to-fine offset prediction and mask modulation, enabling precise feature alignment. Second, we propose a multi-reference quality aware strategy that adjusts distortion weights based on reference quality, and applies it to hierarchical training to reduce error propagation. Third, we integrate a training-free module that downsamples frames by motion magnitude and resolution to obtain smooth motion estimation. Experimental results on standard test datasets demonstrate that our framework CAMA achieves significant improvements over state-of-the-art Neural Video Compression models, achieving a 24.95% BD-rate (PSNR) savings over our baseline model DCVC-TCM, while also outperforming reproduced DCVC-DC and traditional codec HM-16.25.
[111] VLCache: Computing 2% Vision Tokens and Reusing 98% for Vision-Language Inference cs.CVPDF
Shengling Qin, Hao Yu, Chenxin Wu, Zheng Li, Yizhong Cao
TL;DR: VLCache是一种用于视觉语言模型的缓存重用框架,通过复用先前多模态输入的键值缓存和编码器缓存,避免对重复输入进行昂贵的重复计算。该方法通过分析累积重用误差效应并最小化非前缀缓存重用误差,同时提出动态的层感知重计算策略来平衡精度与效率。实验表明,VLCache在保持与完全重计算相当精度的同时,仅需计算2-5%的视觉token,实现了1.2倍至16倍的首次token生成时间加速,并已集成到SGLang中用于实际部署。
Details
Motivation: 解决多模态推理中重复输入导致的计算冗余问题,通过缓存机制减少视觉token的重复计算,以提升推理效率。
Result: 在保持与完全重计算相当精度的前提下,仅需计算2-5%的视觉token,实现了1.2倍至16倍的TTFT加速,已集成到SGLang中。
Insight: 创新点包括形式化分析累积重用误差效应并提出最小化非前缀缓存重用误差的方法,以及动态层感知重计算策略,有效平衡了精度与效率,为多模态模型的高效推理提供了可借鉴的缓存优化方案。
Abstract: This paper presents VLCache, a cache reuse framework that exploits both Key-Value (KV) cache and encoder cache from prior multimodal inputs to eliminate costly recomputation when the same multimodal inputs recur. Unlike previous heuristic approaches, we formally identify the cumulative reuse error effect and demonstrate how to minimize the non-prefix cache reuse error effectively. We further analyze the varying importance of model layers and propose a dynamic, layer-aware recomputation strategy to balance accuracy and efficiency. Experimental results show that VLCache achieves an accuracy on par with full recomputation, while requiring only 2-5% of the tokens to compute, yielding 1.2x-16x TTFT speedups. The proposed VLCache pipeline has been integrated into SGLang, enabling significantly faster inference in practical deployments.
[112] Light Field Based 6DoF Tracking of Previously Unobserved Objects cs.CVPDF
Nikolai Goncharov, James L. Gray, Donald G. Dansereau
TL;DR: 本文提出了一种基于光场图像的6自由度物体跟踪方法,无需依赖预训练模型,通过视觉基础模型提取语义和几何特征,并将其转换为视角相关的高斯泼溅表示,支持可微渲染和姿态优化,在包含挑战性反射物体的数据集上实现了与基于模型的SOTA跟踪器相当的性能。
Details
Motivation: 现有高性能物体跟踪方法通常依赖预捕获的物体视图构建显式参考模型,限制了其只能处理已知物体集合,且难以应对视觉复杂外观(如反射)导致的跟踪质量下降问题。
Result: 在包含挑战性反射物体的光场物体跟踪数据集上,该方法与基于模型的最先进跟踪器性能相当。
Insight: 创新点在于利用光场图像结合视觉基础模型提取特征,并采用视角相关的高斯泼溅作为统一物体表示,实现了对未见复杂物体(尤其是反射表面)的鲁棒6DoF跟踪,为机器人系统中的通用物体跟踪提供了新思路。
Abstract: Object tracking is an important step in robotics and reautonomous driving pipelines, which has to generalize to previously unseen and complex objects. Existing high-performing methods often rely on pre-captured object views to build explicit reference models, which restricts them to a fixed set of known objects. However, such reference models can struggle with visually complex appearance, reducing the quality of tracking. In this work, we introduce an object tracking method based on light field images that does not depend on a pre-trained model, while being robust to complex visual behavior, such as reflections. We extract semantic and geometric features from light field inputs using vision foundation models and convert them into view-dependent Gaussian splats. These splats serve as a unified object representation, supporting differentiable rendering and pose optimization. We further introduce a light field object tracking dataset containing challenging reflective objects with precise ground truth poses. Experiments demonstrate that our method is competitive with state-of-the-art model-based trackers in these difficult cases, paving the way toward universal object tracking in robotic systems. Code/data available at https://github.com/nagonch/LiFT-6DoF.
[113] TWLR: Text-Guided Weakly-Supervised Lesion Localization and Severity Regression for Explainable Diabetic Retinopathy Grading cs.CVPDF
Xi Luo, Shixin Xu, Ying Xie, JianZhong Hu, Yuwei He
TL;DR: TWLR是一个两阶段框架,用于可解释的糖尿病视网膜病变(DR)评估。第一阶段利用视觉-语言模型将眼科领域知识融入文本嵌入,联合执行DR分级和病灶分类;第二阶段基于弱监督语义分割引入迭代严重性回归框架,通过渐进修复机制消除病理特征,实现无需像素级标注的病灶定位和疾病向健康转变的可视化解释。
Details
Motivation: 解决医学图像分析中高质量像素级标注获取成本高、耗时长,以及深度学习模型缺乏可解释性从而限制临床采用的问题。
Result: 在FGADR、DDR和私有数据集上的实验表明,TWLR在DR分类和病灶分割任务上均取得了有竞争力的性能。
Insight: 创新点在于将领域知识通过文本嵌入整合到视觉-语言模型中,并设计迭代严重性回归框架实现弱监督下的精准病灶定位和可解释的疾病严重性可视化,为自动化视网膜图像分析提供了更可解释且标注高效的解决方案。
Abstract: Accurate medical image analysis can greatly assist clinical diagnosis, but its effectiveness relies on high-quality expert annotations Obtaining pixel-level labels for medical images, particularly fundus images, remains costly and time-consuming. Meanwhile, despite the success of deep learning in medical imaging, the lack of interpretability limits its clinical adoption. To address these challenges, we propose TWLR, a two-stage framework for interpretable diabetic retinopathy (DR) assessment. In the first stage, a vision-language model integrates domain-specific ophthalmological knowledge into text embeddings to jointly perform DR grading and lesion classification, effectively linking semantic medical concepts with visual features. The second stage introduces an iterative severity regression framework based on weakly-supervised semantic segmentation. Lesion saliency maps generated through iterative refinement direct a progressive inpainting mechanism that systematically eliminates pathological features, effectively downgrading disease severity toward healthier fundus appearances. Critically, this severity regression approach achieves dual benefits: accurate lesion localization without pixel-level supervision and providing an interpretable visualization of disease-to-healthy transformations. Experimental results on the FGADR, DDR, and a private dataset demonstrate that TWLR achieves competitive performance in both DR classification and lesion segmentation, offering a more explainable and annotation-efficient solution for automated retinal image analysis.
[114] What Happens Next? Next Scene Prediction with a Unified Video Model cs.CVPDF
Xinjie Li, Zhimin Chen, Rui Zhao, Florian Schiffers, Zhenyu Liao
TL;DR: 本文提出了一个名为’下一场景预测’的新任务,旨在推动统一视频模型进行时序和因果推理。作者构建了一个结合Qwen-VL理解和LTX合成的统一框架,并通过三阶段训练(文本到视频预训练、监督微调和带因果一致性奖励的强化学习)在一个新构建的大规模NSP数据集上进行训练。实验表明该模型在基准测试中达到了最先进的性能。
Details
Motivation: 当前统一模型主要关注文本到视频生成等传统任务,其时序推理潜力未被充分探索。为了填补这一空白,本文引入了NSP任务,要求模型根据先前上下文预测合理的未来场景,这需要更深层次的理解和推理能力。
Result: 实验证明,该模型在作者构建的基准测试上取得了最先进的性能。
Insight: 创新点在于提出了NSP这一新任务,并设计了一个结合理解与合成模块的统一框架,通过三阶段训练策略(特别是引入了因果一致性奖励的强化学习)来提升模型的时序推理能力。
Abstract: Recent unified models for joint understanding and generation have significantly advanced visual generation capabilities. However, their focus on conventional tasks like text-to-video generation has left the temporal reasoning potential of unified models largely underexplored. To address this gap, we introduce Next Scene Prediction (NSP), a new task that pushes unified video models toward temporal and causal reasoning. Unlike text-to-video generation, NSP requires predicting plausible futures from preceding context, demanding deeper understanding and reasoning. To tackle this task, we propose a unified framework combining Qwen-VL for comprehension and LTX for synthesis, bridged by a latent query embedding and a connector module. This model is trained in three stages on our newly curated, large-scale NSP dataset: text-to-video pre-training, supervised fine-tuning, and reinforcement learning (via GRPO) with our proposed causal consistency reward. Experiments demonstrate our model achieves state-of-the-art performance on our benchmark, advancing the capability of generalist multimodal systems to anticipate what happens next.
[115] SneakPeek: Future-Guided Instructional Streaming Video Generation cs.CVPDF
Cheeun Hong, German Barquero, Fadime Sener, Markos Georgopoulos, Edgar Schönfeld
TL;DR: 本文提出了一种名为SneakPeek的、基于扩散模型的自回归框架,用于从初始图像和结构化文本提示生成连贯的、分步骤的教学视频。该方法通过预测因果适应、未来引导的自强制以及多提示条件化三个关键创新,旨在解决现有视频扩散模型在生成长序列多步骤视频时面临的时间一致性和可控性问题。
Details
Motivation: 教学视频生成任务旨在从文本描述合成连贯的程序性活动演示,在内容创作、教育和人机交互中具有广泛应用前景。现有视频扩散模型在生成长序列多步骤视频时,难以保持时间一致性和可控性,这是本文要解决的核心问题。
Result: 实验结果表明,该方法能够生成时间连贯、语义忠实且能准确遵循复杂多步骤任务描述的教学视频,在保持运动一致性和遵循指令方面表现出色。
Insight: 论文的创新点包括:1) 预测因果适应,通过因果模型学习下一帧预测并预判未来关键帧;2) 未来引导的自强制与双区域KV缓存方案,以解决推理时的曝光偏差问题;3) 多提示条件化,实现对多步骤指令的细粒度程序性控制。这些组件共同减轻了时间漂移,保持了运动一致性,并实现了未来提示更新能动态影响正在进行的流式视频生成的交互式生成能力。
Abstract: Instructional video generation is an emerging task that aims to synthesize coherent demonstrations of procedural activities from textual descriptions. Such capability has broad implications for content creation, education, and human-AI interaction, yet existing video diffusion models struggle to maintain temporal consistency and controllability across long sequences of multiple action steps. We introduce a pipeline for future-driven streaming instructional video generation, dubbed SneakPeek, a diffusion-based autoregressive framework designed to generate precise, stepwise instructional videos conditioned on an initial image and structured textual prompts. Our approach introduces three key innovations to enhance consistency and controllability: (1) predictive causal adaptation, where a causal model learns to perform next-frame prediction and anticipate future keyframes; (2) future-guided self-forcing with a dual-region KV caching scheme to address the exposure bias issue at inference time; (3) multi-prompt conditioning, which provides fine-grained and procedural control over multi-step instructions. Together, these components mitigate temporal drift, preserve motion consistency, and enable interactive video generation where future prompt updates dynamically influence ongoing streaming video generation. Experimental results demonstrate that our method produces temporally coherent and semantically faithful instructional videos that accurately follow complex, multi-step task descriptions.
[116] Motus: A Unified Latent Action World Model cs.CV | cs.LG | cs.ROPDF
Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang
TL;DR: Motus是一个统一的潜在动作世界模型,它通过混合Transformer架构整合了理解、视频生成和动作三个专家模块,并采用UniDiffuser风格的调度器实现不同建模模式的灵活切换。该模型利用光流学习潜在动作,并通过三阶段训练流程和六层数据金字塔进行大规模动作预训练,从而提取像素级的“增量动作”。
Details
Motivation: 当前方法在理解、世界建模和控制方面使用孤立模型,导致多模态生成能力无法统一,且难以从大规模异构数据中学习。Motus旨在解决这种碎片化问题,构建一个统一的系统。
Result: 在仿真环境中,Motus相比X-VLA提升了15%,相比Pi0.5提升了45%;在真实世界场景中,性能提升了11%至48%,达到了SOTA水平。
Insight: 创新点包括:1) 混合Transformer架构整合多专家;2) UniDiffuser风格调度器实现灵活模式切换;3) 利用光流学习潜在动作,实现像素级动作表示;4) 三阶段训练和六层数据金字塔支持大规模预训练。这些设计统一了功能与先验,显著提升下游机器人任务性能。
Abstract: While a general embodied agent must function as a unified system, current methods are built on isolated models for understanding, world modeling, and control. This fragmentation prevents unifying multimodal generative capabilities and hinders learning from large-scale, heterogeneous data. In this paper, we propose Motus, a unified latent action world model that leverages existing general pretrained models and rich, sharable motion information. Motus introduces a Mixture-of-Transformer (MoT) architecture to integrate three experts (i.e., understanding, video generation, and action) and adopts a UniDiffuser-style scheduler to enable flexible switching between different modeling modes (i.e., world models, vision-language-action models, inverse dynamics models, video generation models, and video-action joint prediction models). Motus further leverages the optical flow to learn latent actions and adopts a recipe with three-phase training pipeline and six-layer data pyramid, thereby extracting pixel-level “delta action” and enabling large-scale action pretraining. Experiments show that Motus achieves superior performance against state-of-the-art methods in both simulation (a +15% improvement over X-VLA and a +45% improvement over Pi0.5) and real-world scenarios(improved by +11~48%), demonstrating unified modeling of all functionalities and priors significantly benefits downstream robotic tasks.
[117] GTR-Turbo: Merged Checkpoint is Secretly a Free Teacher for Agentic VLM Training cs.CV | cs.AIPDF
Tong Wei, Yijun Yang, Changhao Zhang, Junliang Xing, Yuanchun Shi
TL;DR: GTR-Turbo是一种用于视觉语言模型(VLM)智能体训练的高效方法,它通过合并训练过程中产生的检查点权重来创建一个免费的教师模型,从而替代昂贵的外部教师模型,在提升性能的同时大幅降低训练时间和计算成本。
Details
Motivation: 解决多模态智能体在多轮强化学习中面临的奖励稀疏和长期信用分配问题,并克服现有方法(如GTR)依赖昂贵、特权教师模型导致的实用性和可复现性限制。
Result: 在多种视觉智能体任务上,GTR-Turbo将基线模型的准确率提升了10-30%,同时相比GTR方法,减少了50%的挂钟训练时间和60%的计算成本。
Insight: 创新点在于利用训练过程中自身产生的检查点权重合并来构建免费教师模型,这消除了对特权VLM(如GPT或Gemini)的依赖,缓解了先前工作中观察到的“熵崩溃”问题,并保持了训练稳定性。
Abstract: Multi-turn reinforcement learning (RL) for multi-modal agents built upon vision-language models (VLMs) is hampered by sparse rewards and long-horizon credit assignment. Recent methods densify the reward by querying a teacher that provides step-level feedback, e.g., Guided Thought Reinforcement (GTR) and On-Policy Distillation, but rely on costly, often privileged models as the teacher, limiting practicality and reproducibility. We introduce GTR-Turbo, a highly efficient upgrade to GTR, which matches the performance without training or querying an expensive teacher model. Specifically, GTR-Turbo merges the weights of checkpoints produced during the ongoing RL training, and then uses this merged model as a “free” teacher to guide the subsequent RL via supervised fine-tuning or soft logit distillation. This design removes dependence on privileged VLMs (e.g., GPT or Gemini), mitigates the “entropy collapse” observed in prior work, and keeps training stable. Across diverse visual agentic tasks, GTR-Turbo improves the accuracy of the baseline model by 10-30% while reducing wall-clock training time by 50% and compute cost by 60% relative to GTR.
[118] Forging a Dynamic Memory: Retrieval-Guided Continual Learning for Generalist Medical Foundation Models cs.CVPDF
Zizhi Chen, Yizhen Gao, Minghao Han, Yizhou Liu, Zhaoyu Chen
TL;DR: 该论文提出了一种结合检索增强生成(RAG)与持续学习(CL)的综合框架,用于增强多模态生物医学视觉语言模型(VLMs)的通用能力。该方法利用一个1800万规模的多模态医学检索数据库,通过动态知识检索指导模型微调,并引入动态知识蒸馏框架来解决保留细粒度模态内特征与跨越模态间巨大领域鸿沟的核心矛盾。论文还设计了一个更严格的医学通用任务增量学习(MGTIL)基准来全面评估模型性能。
Details
Motivation: 解决多模态生物医学视觉语言模型在持续学习中面临的核心困境:如何在跨越不同模态的巨大领域差距的同时,保留细粒度的模态内特征。
Result: 在作者设计的医学通用任务增量学习(MGTIL)基准上进行广泛实验,结果表明所提出的方法在所有指标上均达到了最先进的(SOTA)性能。
Insight: 创新点在于将检索增强生成(RAG)首次引入持续学习领域,并构建了一个动态知识蒸馏框架,该框架能根据所需细节水平动态调节参数空间的重要性、蒸馏知识的粒度以及参考数据集的数据分布,从而精准解决领域适应与特征保留的平衡问题。从客观角度看,其构建的大规模多模态医学检索数据库和严谨的MGTIL基准也为该领域的研究提供了有价值的资源和评估标准。
Abstract: Multimodal biomedical Vision-Language Models (VLMs) exhibit immense potential in the field of Continual Learning (CL). However, they confront a core dilemma: how to preserve fine-grained intra-modality features while bridging the significant domain gap across different modalities. To address this challenge, we propose a comprehensive framework. Leveraging our 18-million multimodal and comprehensive medical retrieval database derived from PubMed scientific papers, we pioneer the integration of Retrieval-Augmented Generation (RAG) into CL. Specifically, we employ a multi-modal, multi-layer RAG system that provides real-time guidance for model fine-tuning through dynamic, on-demand knowledge retrieval. Building upon this, we introduce a dynamic knowledge distillation framework. This framework precisely resolves the aforementioned core dilemma by dynamically modulating the importance of the parameter space, the granularity of the distilled knowledge, and the data distribution of the reference dataset in accordance with the required level of detail. To thoroughly validate the clinical value of our strategy, we have designed a more rigorous \textbf{M}edical Generalist Task Incremental Learning (MGTIL) benchmark. This benchmark is engineered to simultaneously evaluate the model’s capacity for adaptation to significant domain shifts, retention of subtle intra-domain features, and real-time learning of novel and complex medical tasks. Extensive experimental results demonstrate that our proposed method achieves state-of-the-art (SOTA) performance across all metrics. The code is provided in the supplementary materials.
[119] UniVCD: A New Method for Unsupervised Change Detection in the Open-Vocabulary Era cs.CV | cs.AIPDF
Ziqiang Zhu, Bowei Yang
TL;DR: 本文提出了一种名为UniVCD的无监督、开放词汇变化检测方法。该方法基于冻结的SAM2和CLIP视觉基础模型,无需标注数据或成对变化图像,即可实现跨场景和成像几何的类别无关变化检测。通过引入轻量级特征对齐模块和简化的后处理流程,该方法在多个公开基准测试中取得了强劲性能。
Details
Motivation: 解决现有变化检测方法依赖监督学习、标注成本高、泛化能力差且通常局限于预定义类别的问题,旨在利用视觉基础模型(如SAM2和CLIP)实现开放词汇下的无监督变化检测。
Result: 在多个公开的BCD(二值变化检测)和SCD(语义变化检测)基准测试上,UniVCD在F1和IoU等关键指标上取得了持续强劲的性能,匹配或超越了现有的开放词汇变化检测方法。
Insight: 创新点在于提出了一种基于冻结视觉基础模型(SAM2和CLIP)的无监督开放词汇变化检测框架,通过轻量级特征对齐模块融合空间细节与语义先验,并结合后处理抑制噪声,为开放词汇变化检测提供了一种实用有效的范式。
Abstract: Change detection (CD) identifies scene changes from multi-temporal observations and is widely used in urban development and environmental monitoring. Most existing CD methods rely on supervised learning, making performance strongly dataset-dependent and incurring high annotation costs; they typically focus on a few predefined categories and generalize poorly to diverse scenes. With the rise of vision foundation models such as SAM2 and CLIP, new opportunities have emerged to relax these constraints. We propose Unified Open-Vocabulary Change Detection (UniVCD), an unsupervised, open-vocabulary change detection method built on frozen SAM2 and CLIP. UniVCD detects category-agnostic changes across diverse scenes and imaging geometries without any labeled data or paired change images. A lightweight feature alignment module is introduced to bridge the spatially detailed representations from SAM2 and the semantic priors from CLIP, enabling high-resolution, semantically aware change estimation while keeping the number of trainable parameters small. On top of this, a streamlined post-processing pipeline is further introduced to suppress noise and pseudo-changes, improving the detection accuracy for objects with well-defined boundaries. Experiments on several public BCD (Binary Change Detection) and SCD (Semantic Change Detection) benchmarks show that UniVCD achieves consistently strong performance and matches or surpasses existing open-vocabulary CD methods in key metrics such as F1 and IoU. The results demonstrate that unsupervised change detection with frozen vision foundation models and lightweight multi-modal alignment is a practical and effective paradigm for open-vocabulary CD. Code and pretrained models will be released at https://github.com/Die-Xie/UniVCD.
[120] ADHint: Adaptive Hints with Difficulty Priors for Reinforcement Learning cs.CV | cs.LGPDF
Feng Zhang, Zezhong Tan, Xinhong Ma, Ziqiang Dong, Xi Leng
TL;DR: 本文提出ADHint,一种在强化学习(RL)后训练中自适应利用提示(hints)的方法。该方法将样本难度作为关键因素,设计了自适应提示比例调度和基于难度的优势估计,以更好地平衡探索与模仿,从而提升模型的推理能力和分布外泛化性能。
Details
Motivation: 现有基于提示的RL方法在调度提示比例和估计相对优势时通常忽略难度因素,导致学习不稳定和过度模仿离策略提示,需要一种能更好平衡探索与模仿的方法。
Result: 在多种模态、模型规模和领域的广泛实验中,ADHint在推理能力和分布外泛化方面表现优异,在pass@1和avg@8指标上持续超越现有方法。
Insight: 创新点在于将难度作为核心因素引入提示调度和优势估计,具体包括基于样本难度先验的自适应提示、基于一致性的梯度调制与选择性掩码保护,以及基于推演难度后验的优势估计,实现了更稳定和平衡的训练。
Abstract: To combine the advantages of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), recent methods have integrated ‘’hints’’ into post-training, which are prefix segments of complete reasoning trajectories, aiming for powerful knowledge expansion and reasoning generalization. However, existing hint-based RL methods typically ignore difficulty when scheduling hint ratios and estimating relative advantages, leading to unstable learning and excessive imitation of off-policy hints. In this work, we propose ADHint, which treats difficulty as a key factor in both hint-ratio schedule and relative-advantage estimation to achieve a better trade-off between exploration and imitation. Specifically, we propose Adaptive Hint with Sample Difficulty Prior, which evaluates each sample’s difficulty under the policy model and accordingly schedules an appropriate hint ratio to guide its rollouts. We also introduce Consistency-based Gradient Modulation and Selective Masking for Hint Preservation to modulate token-level gradients within hints, preventing biased and destructive updates. Additionally, we propose Advantage Estimation with Rollout Difficulty Posterior, which leverages the relative difficulty of rollouts with and without hints to estimate their respective advantages, thereby achieving more balanced updates. Extensive experiments across diverse modalities, model scales, and domains demonstrate that ADHint delivers superior reasoning ability and out-of-distribution generalization, consistently surpassing existing methods in both pass@1 and avg@8. Our code and dataset will be made publicly available upon paper acceptance.
[121] Harmonizing Generalization and Specialization: Uncertainty-Informed Collaborative Learning for Semi-supervised Medical Image Segmentation cs.CV | cs.AI | cs.LGPDF
Wenjing Lu, Yi Hong, Yang Yang
TL;DR: 本文提出了一种名为不确定性协同学习(UnCoL)的双教师框架,用于解决半监督医学图像分割中通用基础模型泛化能力与特定临床任务需求不匹配的问题。该方法通过冻结的基础模型传递通用知识,同时利用自适应教师模型捕捉细粒度任务特定表示,并利用预测不确定性自适应调节伪标签学习,以平衡两位教师的指导。
Details
Motivation: 视觉基础模型虽然通过大规模异构预训练在医学图像分割中展现出强大的泛化能力,但在标注有限或罕见病理变化下,其通用先验与特定任务需求不匹配,难以泛化到专业临床任务。
Result: 在多种2D和3D分割基准测试中,UnCoL一致优于最先进的半监督方法和基础模型基线,并在显著减少标注需求的情况下实现了接近全监督的性能。
Insight: 创新点在于提出了一种双教师框架,通过不确定性引导的协同学习机制,有效调和了通用泛化与任务专化之间的矛盾,利用预测不确定性自适应调节伪标签学习,提升了半监督分割在医学图像中的鲁棒性和性能。
Abstract: Vision foundation models have demonstrated strong generalization in medical image segmentation by leveraging large-scale, heterogeneous pretraining. However, they often struggle to generalize to specialized clinical tasks under limited annotations or rare pathological variations, due to a mismatch between general priors and task-specific requirements. To address this, we propose Uncertainty-informed Collaborative Learning (UnCoL), a dual-teacher framework that harmonizes generalization and specialization in semi-supervised medical image segmentation. Specifically, UnCoL distills both visual and semantic representations from a frozen foundation model to transfer general knowledge, while concurrently maintaining a progressively adapting teacher to capture fine-grained and task-specific representations. To balance guidance from both teachers, pseudo-label learning in UnCoL is adaptively regulated by predictive uncertainty, which selectively suppresses unreliable supervision and stabilizes learning in ambiguous regions. Experiments on diverse 2D and 3D segmentation benchmarks show that UnCoL consistently outperforms state-of-the-art semi-supervised methods and foundation model baselines. Moreover, our model delivers near fully supervised performance with markedly reduced annotation requirements.
[122] Diffusion-Based Restoration for Multi-Modal 3D Object Detection in Adverse Weather cs.CV | cs.AIPDF
Zhijian He, Feifei Liu, Yuwei Li, Zhanpeng Liu, Jintao Cheng
TL;DR: 本文提出DiffFusion框架,通过基于扩散模型的图像和点云恢复以及双向自适应融合对齐模块,提升多模态3D目标检测在恶劣天气下的鲁棒性。
Details
Motivation: 解决多模态3D目标检测在恶劣天气下因天气引起的模态数据失真和模态间不对齐问题,以提高自动驾驶等场景的感知可靠性。
Result: 在三个公开数据集上达到SOTA的鲁棒性,并在真实世界DENSE数据集上通过零样本测试验证了泛化能力,同时保持了在干净数据上的强性能。
Insight: 创新点在于利用扩散模型的去噪和生成能力适应不同天气条件,并设计双向自适应融合对齐模块动态处理模态间不对齐,实现跨模态信息互补增强。
Abstract: Multi-modal 3D object detection is important for reliable perception in robotics and autonomous driving. However, its effectiveness remains limited under adverse weather conditions due to weather-induced distortions and misalignment between different data modalities. In this work, we propose DiffFusion, a novel framework designed to enhance robustness in challenging weather through diffusion-based restoration and adaptive cross-modal fusion. Our key insight is that diffusion models possess strong capabilities for denoising and generating data that can adapt to various weather conditions. Building on this, DiffFusion introduces Diffusion-IR restoring images degraded by weather effects and Point Cloud Restoration (PCR) compensating for corrupted LiDAR data using image object cues. To tackle misalignments between two modalities, we develop Bidirectional Adaptive Fusion and Alignment Module (BAFAM). It enables dynamic multi-modal fusion and bidirectional bird’s-eye view (BEV) alignment to maintain consistent spatial correspondence. Extensive experiments on three public datasets show that DiffFusion achieves state-of-the-art robustness under adverse weather while preserving strong clean-data performance. Zero-shot results on the real-world DENSE dataset further validate its generalization. The implementation of our DiffFusion will be released as open-source.
[123] MMDrive: Interactive Scene Understanding Beyond Vision with Multi-representational Fusion cs.CV | cs.ROPDF
Minghui Hou, Wei-Hsing Huang, Shaofeng Liang, Daizong Liu, Tai-Hao Wen
TL;DR: 本文提出了MMDrive,一个用于自动驾驶场景理解的多模态视觉语言模型框架。它通过融合占据栅格图、激光雷达点云和文本描述三种互补模态,突破了传统基于2D图像的视觉语言模型在3D空间感知和深度语义融合上的限制。
Details
Motivation: 现有视觉语言模型受限于2D平面图像理解范式,难以感知3D空间信息并进行深度语义融合,导致在复杂自动驾驶环境中性能不佳。
Result: 在DriveLM基准测试上,MMDrive取得了BLEU-4分数54.56和METEOR分数41.78;在NuScenes-QA基准测试上,准确率达到62.7%,显著超越了现有的自动驾驶视觉语言模型。
Insight: 创新点在于提出了面向文本的多模态调制器和跨模态抽象器,前者根据问题语义动态加权各模态贡献,后者利用可学习的抽象令牌生成紧凑的跨模态摘要以突出关键信息,从而实现了从2D图像理解到广义3D场景理解的范式扩展。
Abstract: Vision-language models enable the understanding and reasoning of complex traffic scenarios through multi-source information fusion, establishing it as a core technology for autonomous driving. However, existing vision-language models are constrained by the image understanding paradigm in 2D plane, which restricts their capability to perceive 3D spatial information and perform deep semantic fusion, resulting in suboptimal performance in complex autonomous driving environments. This study proposes MMDrive, an multimodal vision-language model framework that extends traditional image understanding to a generalized 3D scene understanding framework. MMDrive incorporates three complementary modalities, including occupancy maps, LiDAR point clouds, and textual scene descriptions. To this end, it introduces two novel components for adaptive cross-modal fusion and key information extraction. Specifically, the Text-oriented Multimodal Modulator dynamically weights the contributions of each modality based on the semantic cues in the question, guiding context-aware feature integration. The Cross-Modal Abstractor employs learnable abstract tokens to generate compact, cross-modal summaries that highlight key regions and essential semantics. Comprehensive evaluations on the DriveLM and NuScenes-QA benchmarks demonstrate that MMDrive achieves significant performance gains over existing vision-language models for autonomous driving, with a BLEU-4 score of 54.56 and METEOR of 41.78 on DriveLM, and an accuracy score of 62.7% on NuScenes-QA. MMDrive effectively breaks the traditional image-only understanding barrier, enabling robust multimodal reasoning in complex driving environments and providing a new foundation for interpretable autonomous driving scene understanding.
[124] Ego-EXTRA: video-language Egocentric Dataset for EXpert-TRAinee assistance cs.CVPDF
Francesco Ragusa, Michele Mazzamuto, Rosario Forte, Irene D’Ambra, James Fort
TL;DR: 本文介绍了Ego-EXTRA数据集,这是一个用于专家-学员辅助任务的视频-语言第一人称视角数据集。该数据集包含50小时非脚本化的第一人称视频,记录了学员在执行程序性活动时,由专家通过自然语言提供指导和回答问题的过程。数据集采用“Wizard of OZ”范式收集,专家模拟可穿戴智能助手,从学员的第一人称视角观察活动并进行交互。基于这些双向对话,作者创建了一个包含超过15,000个高质量视觉问答对的基准测试,用于评估多模态大语言模型。结果表明,该数据集具有挑战性,凸显了当前模型在提供专家级辅助方面的局限性。
Details
Motivation: 为了解决当前多模态模型在提供专家级实时辅助方面的不足,特别是缺乏高质量、真实交互的第一人称视频-语言对话数据,作者构建了Ego-EXTRA数据集,旨在为开发能够理解复杂程序性任务并提供专家反馈的智能助手提供基准。
Result: 论文利用Ego-EXTRA数据集评估了多模态大语言模型,结果显示该基准测试具有挑战性,当前模型在提供专家级辅助方面存在明显局限性。
Insight: 创新点在于采用“Wizard of OZ”范式收集了高质量、非脚本化的第一人称专家-学员交互视频与对话数据,并构建了大规模的视觉问答基准。这为训练和评估能够理解复杂场景、进行主动交互的智能助手提供了宝贵资源,强调了真实世界交互和专家反馈数据的重要性。
Abstract: We present Ego-EXTRA, a video-language Egocentric Dataset for EXpert-TRAinee assistance. Ego-EXTRA features 50 hours of unscripted egocentric videos of subjects performing procedural activities (the trainees) while guided by real-world experts who provide guidance and answer specific questions using natural language. Following a ``Wizard of OZ’’ data collection paradigm, the expert enacts a wearable intelligent assistant, looking at the activities performed by the trainee exclusively from their egocentric point of view, answering questions when asked by the trainee, or proactively interacting with suggestions during the procedures. This unique data collection protocol enables Ego-EXTRA to capture a high-quality dialogue in which expert-level feedback is provided to the trainee. Two-way dialogues between experts and trainees are recorded, transcribed, and used to create a novel benchmark comprising more than 15k high-quality Visual Question Answer sets, which we use to evaluate Multimodal Large Language Models. The results show that Ego-EXTRA is challenging and highlight the limitations of current models when used to provide expert-level assistance to the user. The Ego-EXTRA dataset is publicly available to support the benchmark of egocentric video-language assistants: https://fpv-iplab.github.io/Ego-EXTRA/.
[125] STARCaster: Spatio-Temporal AutoRegressive Video Diffusion for Identity- and View-Aware Talking Portraits cs.CVPDF
Foivos Paraperas Papantoniou, Stathis Galanakis, Rolandos Alexandros Potamias, Bernhard Kainz, Stefanos Zafeiriou
TL;DR: STARCaster是一个身份感知的时空视频扩散模型,在一个统一框架内解决了语音驱动的肖像动画和自由视点说话肖像合成任务。它通过引入软身份约束和利用视频数据的多视角特性,避免了现有方法对严格参考条件的依赖和3D重建的不完美问题。
Details
Motivation: 解决现有2D语音到视频扩散模型过度依赖参考指导导致运动多样性有限,以及3D感知动画通常依赖预训练三平面生成器的反演导致重建不完美和身份漂移的问题。
Result: 综合评估表明,STARCaster在跨任务和身份上有效泛化,在不同基准测试中持续超越先前方法。
Insight: 创新点包括:采用从身份感知运动建模到基于唇读监督的视听同步,再到通过时空适应的新视角动画的组合方法;提出解耦学习方法独立训练视角一致性和时间连贯性以克服4D视听数据稀缺;采用自强制训练方案使模型学习比推理生成长的时间上下文,缓解自回归方法中常见的过度静态动画。
Abstract: This paper presents STARCaster, an identity-aware spatio-temporal video diffusion model that addresses both speech-driven portrait animation and free-viewpoint talking portrait synthesis, given an identity embedding or reference image, within a unified framework. Existing 2D speech-to-video diffusion models depend heavily on reference guidance, leading to limited motion diversity. At the same time, 3D-aware animation typically relies on inversion through pre-trained tri-plane generators, which often leads to imperfect reconstructions and identity drift. We rethink reference- and geometry-based paradigms in two ways. First, we deviate from strict reference conditioning at pre-training by introducing softer identity constraints. Second, we address 3D awareness implicitly within the 2D video domain by leveraging the inherent multi-view nature of video data. STARCaster adopts a compositional approach progressing from ID-aware motion modeling, to audio-visual synchronization via lip reading-based supervision, and finally to novel view animation through temporal-to-spatial adaptation. To overcome the scarcity of 4D audio-visual data, we propose a decoupled learning approach in which view consistency and temporal coherence are trained independently. A self-forcing training scheme enables the model to learn from longer temporal contexts than those generated at inference, mitigating the overly static animations common in existing autoregressive approaches. Comprehensive evaluations demonstrate that STARCaster generalizes effectively across tasks and identities, consistently surpassing prior approaches in different benchmarks.
[126] Toward Ambulatory Vision: Learning Visually-Grounded Active View Selection cs.CVPDF
Juil Koo, Daehyeon Choi, Sangwoo Youn, Phillip Y. Lee, Minhyuk Sung
TL;DR: 该论文提出了视觉基础主动视角选择(VG-AVS)任务,旨在让智能体仅基于当前视觉信息选择信息量最大的下一个视角,以支持移动视觉。作者构建了一个合成数据集,并提出了一个结合监督微调和强化学习策略优化的框架来训练视觉语言模型。该方法在视角选择上实现了强大的问答性能,并能泛化到未见过的合成和真实场景,同时提升了现有基于场景探索的视觉问答系统的准确性。
Details
Motivation: 现有视觉语言模型(VLMs)主要基于静态图像进行视觉问答,而具身智能体需要移动视觉能力,能够主动移动以获取更具信息量的视角。论文旨在解决从静态视觉推理向移动视觉过渡的问题,提出了VG-AVS任务来学习仅基于当前图像视觉信息选择下一个最佳视角的能力。
Result: 论文提出的方法在VG-AVS任务上实现了强大的问答性能,在合成和真实场景中均表现出良好的泛化能力。此外,将学习到的VG-AVS框架集成到现有的基于场景探索的视觉问答(EQA)系统中,提高了下游问答的准确率。
Insight: 论文的创新点在于提出了VG-AVS这一新任务,强调仅依赖当前视觉信息(而非场景记忆或外部知识)进行主动视角选择。方法上,结合了监督微调和基于强化学习的策略优化来训练预训练视觉语言模型,并构建了自动生成的合成数据集来支持任务学习,为移动视觉和具身智能研究提供了新思路。
Abstract: Vision Language Models (VLMs) excel at visual question answering (VQA) but remain limited to snapshot vision, reasoning from static images. In contrast, embodied agents require ambulatory vision, actively moving to obtain more informative views. We introduce Visually Grounded Active View Selection (VG-AVS), a task that selects the most informative next viewpoint using only the visual information in the current image, without relying on scene memory or external knowledge. To support this task, we construct a synthetic dataset with automatically generated paired query-target views and question-answer prompts. We also propose a framework that fine-tunes pretrained VLMs through supervised fine-tuning (SFT) followed by RL-based policy optimization. Our approach achieves strong question answering performance based on viewpoint selection and generalizes robustly to unseen synthetic and real scenes. Furthermore, incorporating our learned VG-AVS framework into existing scene-exploration-based EQA systems improves downstream question-answering accuracy.
[127] CogniEdit: Dense Gradient Flow Optimization for Fine-Grained Image Editing cs.CVPDF
Yan Li, Lin Liu, Xiaopeng Zhang, Wei Xue, Wenhan Luo
TL;DR: 本文提出CogniEdit,一个用于细粒度图像编辑的统一框架,通过结合多模态推理与密集奖励优化,在扩散模型的去噪过程中传播梯度,实现对轨迹级别的控制。该方法包含三个核心组件:多模态大语言模型分解复杂指令、动态令牌焦点重定位强调细粒度属性、以及基于密集GRPO的跨步梯度传播优化。
Details
Motivation: 现有基于指令的扩散模型图像编辑方法在处理指定精确属性(如颜色、位置、数量)的细粒度指令时存在困难,且现有GRPO方法仅在单个采样步骤进行优化,反馈稀疏,限制了轨迹级别的控制能力。
Result: 在基准数据集上的大量实验表明,CogniEdit在平衡细粒度指令遵循、视觉质量和可编辑性保持方面达到了最先进的性能。
Insight: 创新点在于将多模态推理与密集梯度流优化相结合,通过跨连续去噪步骤传播梯度实现轨迹级监督,并引入动态令牌焦点重定位机制来增强对细粒度属性的关注。这为扩散模型提供了更精细、连贯的指令控制能力。
Abstract: Instruction-based image editing with diffusion models has achieved impressive results, yet existing methods struggle with fine-grained instructions specifying precise attributes such as colors, positions, and quantities. While recent approaches employ Group Relative Policy Optimization (GRPO) for alignment, they optimize only at individual sampling steps, providing sparse feedback that limits trajectory-level control. We propose a unified framework CogniEdit, combining multi-modal reasoning with dense reward optimization that propagates gradients across consecutive denoising steps, enabling trajectory-level gradient flow through the sampling process. Our method comprises three components: (1) Multi-modal Large Language Models for decomposing complex instructions into actionable directives, (2) Dynamic Token Focus Relocation that adaptively emphasizes fine-grained attributes, and (3) Dense GRPO-based optimization that propagates gradients across consecutive steps for trajectory-level supervision. Extensive experiments on benchmark datasets demonstrate that our CogniEdit achieves state-of-the-art performance in balancing fine-grained instruction following with visual quality and editability preservation
[128] Video Reality Test: Can AI-Generated ASMR Videos fool VLMs and Humans? cs.CVPDF
Jiaqi Wang, Weijia Wu, Yi Zhan, Rui Zhao, Ming Hu
TL;DR: 本文提出了Video Reality Test基准测试,用于评估AI生成的ASMR视频在视听耦合条件下的感知真实性,通过对抗性创作者-评审者协议测试视频生成模型能否欺骗视觉语言模型和人类。实验表明,最佳生成模型Veo3.1-Fast能有效误导大多数VLM评审者,而人类专家识别准确率显著更高,揭示了当前视频生成真实性的边界及VLM在感知保真度与视听一致性方面的局限。
Details
Motivation: 解决现有AIGC检测基准大多忽略音频、针对宽泛叙事领域且仅关注分类的问题,探究最先进的视频生成模型是否能产生具有沉浸感、音频配对且能可靠欺骗人类和VLM的视频。
Result: 在Video Reality Test基准上,最强评审者Gemini 2.5-Pro的准确率仅为56%(随机基线50%),远低于人类专家的81.25%;最佳生成模型Veo3.1-Fast能欺骗大多数VLM;添加音频有助于真假判别,但水印等表面线索仍会显著误导模型。
Insight: 创新点包括构建基于真实ASMR视频的沉浸式视听源基准,以及采用对抗性创作者-评审者协议进行评估;客观分析表明,该研究首次在紧密视听耦合条件下系统评估视频生成真实性,并暴露了VLM在感知任务中的关键弱点,为未来生成与检测模型的发展提供了重要方向。
Abstract: Recent advances in video generation have produced vivid content that are often indistinguishable from real videos, making AI-generated video detection an emerging societal challenge. Prior AIGC detection benchmarks mostly evaluate video without audio, target broad narrative domains, and focus on classification solely. Yet it remains unclear whether state-of-the-art video generation models can produce immersive, audio-paired videos that reliably deceive humans and VLMs. To this end, we introduce Video Reality Test, an ASMR-sourced video benchmark suite for testing perceptual realism under tight audio-visual coupling, featuring the following dimensions: \textbf{(i) Immersive ASMR video-audio sources.} Built on carefully curated real ASMR videos, the benchmark targets fine-grained action-object interactions with diversity across objects, actions, and backgrounds. \textbf{(ii) Peer-Review evaluation.} An adversarial creator-reviewer protocol where video generation models act as creators aiming to fool reviewers, while VLMs serve as reviewers seeking to identify fakeness. Our experimental findings show: The best creator Veo3.1-Fast even fools most VLMs: the strongest reviewer (Gemini 2.5-Pro) achieves only 56% accuracy (random 50%), far below that of human experts (81.25%). Adding audio improves real-fake discrimination, yet superficial cues such as watermarks can still significantly mislead models. These findings delineate the current boundary of video generation realism and expose limitations of VLMs in perceptual fidelity and audio-visual consistency. Our code is available at https://github.com/video-reality-test/video-reality-test.
[129] CausalCLIP: Causally-Informed Feature Disentanglement and Filtering for Generalizable Detection of Generated Images cs.CVPDF
Bo Liu, Qiao Qin, Qinghui He
TL;DR: 本文提出CausalCLIP框架,旨在解决生成图像检测中模型泛化能力不足的问题。该框架基于因果推断原理,通过结构因果模型对图像生成过程建模,利用Gumbel-Softmax特征掩码和HSIC约束来解耦因果特征与非因果特征,并过滤掉虚假模式,从而提取出稳定且可迁移的取证线索,以提升检测器在面对未知生成技术时的泛化性能。
Details
Motivation: 现有生成图像检测方法(包括基于预训练视觉-语言模型的方法)产生的特征表示高度纠缠,混合了任务相关的因果取证线索与虚假或无关的非因果模式,这限制了模型在不同生成技术间的泛化能力。
Result: 在来自不同系列的未见生成模型上进行测试时,CausalCLIP展现出强大的泛化能力,在准确率上比现有最优方法提升了6.83%,在平均精度上提升了4.06%。
Insight: 论文的创新点在于将因果推断原理系统地引入生成图像检测领域,通过特征解耦与过滤来提升模型泛化性。具体而言,其利用结构因果模型进行过程建模,并采用Gumbel-Softmax和HSIC约束来实现统计独立的特征分离,这是一种旨在提取稳定、不变特征表示的新颖方法学。
Abstract: The rapid advancement of generative models has increased the demand for generated image detectors capable of generalizing across diverse and evolving generation techniques. However, existing methods, including those leveraging pre-trained vision-language models, often produce highly entangled representations, mixing task-relevant forensic cues (causal features) with spurious or irrelevant patterns (non-causal features), thus limiting generalization. To address this issue, we propose CausalCLIP, a framework that explicitly disentangles causal from non-causal features and employs targeted filtering guided by causal inference principles to retain only the most transferable and discriminative forensic cues. By modeling the generation process with a structural causal model and enforcing statistical independence through Gumbel-Softmax-based feature masking and Hilbert-Schmidt Independence Criterion (HSIC) constraints, CausalCLIP isolates stable causal features robust to distribution shifts. When tested on unseen generative models from different series, CausalCLIP demonstrates strong generalization ability, achieving improvements of 6.83% in accuracy and 4.06% in average precision over state-of-the-art methods.
[130] LINA: Learning INterventions Adaptively for Physical Alignment and Generalization in Diffusion Models cs.CV | cs.AI | cs.LGPDF
Shu Yu, Chaochao Lu
TL;DR: 本文提出LINA框架,通过自适应学习提示特定干预,解决扩散模型在物理对齐和分布外指令遵循方面的不足。该方法结合提示和视觉潜在空间的目标引导以及因果感知的去噪调度,在图像和视频生成任务中实现了最先进的性能。
Details
Motivation: 扩散模型在图像和视频生成中表现出色,但在物理对齐和分布外指令遵循方面存在困难,这源于模型未能学习因果方向和解耦因果因子以进行新颖重组。
Result: 在具有挑战性的因果生成任务和Winoground数据集上实现了最先进的性能,有效增强了图像和视频扩散模型的物理对齐和分布外指令遵循能力。
Insight: 创新点包括引入因果场景图和物理对齐探针数据集进行诊断干预,发现扩散模型在多跳推理、提示嵌入中纹理与物理的解耦表示以及视觉因果结构在初始去噪步骤中建立的关键见解,并据此设计了自适应预测干预的框架。
Abstract: Diffusion models (DMs) have achieved remarkable success in image and video generation. However, they still struggle with (1) physical alignment and (2) out-of-distribution (OOD) instruction following. We argue that these issues stem from the models’ failure to learn causal directions and to disentangle causal factors for novel recombination. We introduce the Causal Scene Graph (CSG) and the Physical Alignment Probe (PAP) dataset to enable diagnostic interventions. This analysis yields three key insights. First, DMs struggle with multi-hop reasoning for elements not explicitly determined in the prompt. Second, the prompt embedding contains disentangled representations for texture and physics. Third, visual causal structure is disproportionately established during the initial, computationally limited denoising steps. Based on these findings, we introduce LINA (Learning INterventions Adaptively), a novel framework that learns to predict prompt-specific interventions, which employs (1) targeted guidance in the prompt and visual latent spaces, and (2) a reallocated, causality-aware denoising schedule. Our approach enforces both physical alignment and OOD instruction following in image and video DMs, achieving state-of-the-art performance on challenging causal generation tasks and the Winoground dataset. Our project page is at https://opencausalab.github.io/LINA.
[131] ShowTable: Unlocking Creative Table Visualization with Collaborative Reflection and Refinement cs.CVPDF
Zhihang Liu, Xiaoyi Bao, Pandeng Li, Junjie Zhou, Zhaohe Liao
TL;DR: 本文提出了一项新的挑战性任务——创意表格可视化,要求模型根据给定表格数据生成既忠实又美观的信息图表。为应对此挑战,作者提出了ShowTable,一个通过渐进式自我纠正过程将多模态大语言模型与扩散模型协同工作的流程。MLLM作为核心协调器,负责推理视觉规划并判断视觉错误以提供精炼指令,扩散模型则执行这些指令以实现高保真结果。为支持该任务及流程,作者引入了三个自动化数据构建流程来训练不同模块,并提出了包含800个挑战性实例、覆盖5个评估维度的新基准TableVisBench。实验表明,该流程在不同模型实例化下均显著优于基线,突显了其有效的多模态推理、生成和纠错能力。
Details
Motivation: 现有生成模型和统一模型在通用图像生成方面表现出色,但在需要深度推理、规划以及超越通用场景的精确数据到视觉映射能力的任务上存在局限。为突破这些限制,本文旨在解决创意表格可视化这一新任务。
Result: 实验在作者提出的新基准TableVisBench上进行,该基准包含800个实例,覆盖5个评估维度。结果表明,ShowTable流程在不同模型实例化下均显著优于基线方法,证明了其多模态推理、生成和纠错能力的有效性。
Insight: 论文的创新点在于:1)提出了创意表格可视化这一新任务,并构建了相应的基准;2)设计了ShowTable流程,通过MLLM作为核心协调器与扩散模型协同,实现渐进式自我纠正;3)引入了自动化数据构建流程来训练不同模块。从客观角度看,其将MLLM的规划与推理能力与扩散模型的高质量生成能力相结合,并通过反馈循环进行纠错的思路,为解决需要复杂规划和精确控制的生成任务提供了可借鉴的框架。
Abstract: While existing generation and unified models excel at general image generation, they struggle with tasks requiring deep reasoning, planning, and precise data-to-visual mapping abilities beyond general scenarios. To push beyond the existing limitations, we introduce a new and challenging task: creative table visualization, requiring the model to generate an infographic that faithfully and aesthetically visualizes the data from a given table. To address this challenge, we propose ShowTable, a pipeline that synergizes MLLMs with diffusion models via a progressive self-correcting process. The MLLM acts as the central orchestrator for reasoning the visual plan and judging visual errors to provide refined instructions, the diffusion execute the commands from MLLM, achieving high-fidelity results. To support this task and our pipeline, we introduce three automated data construction pipelines for training different modules. Furthermore, we introduce TableVisBench, a new benchmark with 800 challenging instances across 5 evaluation dimensions, to assess performance on this task. Experiments demonstrate that our pipeline, instantiated with different models, significantly outperforms baselines, highlighting its effective multi-modal reasoning, generation, and error correction capabilities.
[132] KlingAvatar 2.0 Technical Report cs.CVPDF
Kling Team, Jialu Chen, Yikang Ding, Zhixue Fang, Kun Gai
TL;DR: KlingAvatar 2.0是一个用于长时高分辨率虚拟形象视频生成的时空级联框架。它通过生成低分辨率关键帧蓝图捕获全局语义和运动,再利用首尾帧策略将其细化为高分辨率、时序连贯的子片段,以解决现有方法在生成长视频时存在的效率低下、时序漂移、质量下降和提示跟随弱等问题。
Details
Motivation: 解决现有虚拟形象视频生成模型在生成长时、高分辨率视频时面临的效率低、时序漂移、质量下降和提示跟随能力弱等挑战。
Result: 大量实验表明,该模型能有效实现高效、多模态对齐的长时高分辨率视频生成,在视觉清晰度、逼真的唇齿渲染与准确口型同步、强身份保持以及连贯的多模态指令跟随方面表现出色。
Insight: 主要创新点包括:1) 时空级联框架,在空间分辨率和时间维度上进行上采样;2) 由三个模态特定大语言模型专家组成的协同推理导演,通过多轮对话增强跨模态指令融合与对齐;3) 负面提示导演以改进指令对齐;4) 扩展框架以支持特定身份的多角色控制。
Abstract: Avatar video generation models have achieved remarkable progress in recent years. However, prior work exhibits limited efficiency in generating long-duration high-resolution videos, suffering from temporal drifting, quality degradation, and weak prompt following as video length increases. To address these challenges, we propose KlingAvatar 2.0, a spatio-temporal cascade framework that performs upscaling in both spatial resolution and temporal dimension. The framework first generates low-resolution blueprint video keyframes that capture global semantics and motion, and then refines them into high-resolution, temporally coherent sub-clips using a first-last frame strategy, while retaining smooth temporal transitions in long-form videos. To enhance cross-modal instruction fusion and alignment in extended videos, we introduce a Co-Reasoning Director composed of three modality-specific large language model (LLM) experts. These experts reason about modality priorities and infer underlying user intent, converting inputs into detailed storylines through multi-turn dialogue. A Negative Director further refines negative prompts to improve instruction alignment. Building on these components, we extend the framework to support ID-specific multi-character control. Extensive experiments demonstrate that our model effectively addresses the challenges of efficient, multimodally aligned long-form high-resolution video generation, delivering enhanced visual clarity, realistic lip-teeth rendering with accurate lip synchronization, strong identity preservation, and coherent multimodal instruction following.
[133] Unlocking Generalization in Polyp Segmentation with DINO Self-Attention “keys” cs.CVPDF
Carla Monteiro, Valentina Corbetta, Regina Beets-Tan, Luís F. Teixeira, Wilson Silva
TL;DR: 本文提出了一种利用DINO自注意力‘键’特征进行息肉分割的框架,通过简单的卷积解码器预测息肉掩码,在领域泛化和极端单领域泛化协议下实现了最先进的性能,显著提升了在数据稀缺和挑战性场景中的泛化能力。
Details
Motivation: 解决现有深度学习方法在息肉分割中泛化能力不足的问题,特别是在数据受限或挑战性设置下,同时避免依赖复杂、任务特定的架构。
Result: 在领域泛化和极端单领域泛化的多中心数据集上,该方法实现了最先进的性能,超越了nnU-Net和UM-Net等成熟模型,并通过统计分析验证了其显著提升的泛化能力。
Insight: 创新点在于利用DINO自注意力模块的‘键’特征而非传统最深层的ViT令牌,结合简单解码器,增强了鲁棒性和泛化性;同时提供了DINO框架演进的系统基准,量化了架构进步对下游任务的影响。
Abstract: Automatic polyp segmentation is crucial for improving the clinical identification of colorectal cancer (CRC). While Deep Learning (DL) techniques have been extensively researched for this problem, current methods frequently struggle with generalization, particularly in data-constrained or challenging settings. Moreover, many existing polyp segmentation methods rely on complex, task-specific architectures. To address these limitations, we present a framework that leverages the intrinsic robustness of DINO self-attention “key” features for robust segmentation. Unlike traditional methods that extract tokens from the deepest layers of the Vision Transformer (ViT), our approach leverages the key features of the self-attention module with a simple convolutional decoder to predict polyp masks, resulting in enhanced performance and better generalizability. We validate our approach using a multi-center dataset under two rigorous protocols: Domain Generalization (DG) and Extreme Single Domain Generalization (ESDG). Our results, supported by a comprehensive statistical analysis, demonstrate that this pipeline achieves state-of-the-art (SOTA) performance, significantly enhancing generalization, particularly in data-scarce and challenging scenarios. While avoiding a polyp-specific architecture, we surpass well-established models like nnU-Net and UM-Net. Additionally, we provide a systematic benchmark of the DINO framework’s evolution, quantifying the specific impact of architectural advancements on downstream polyp segmentation performance.
[134] Beyond the Visible: Disocclusion-Aware Editing via Proxy Dynamic Graphs cs.CVPDF
Anran Qi, Changjian Li, Adrien Bousseau, Niloy J. Mitra
TL;DR: 本文提出了一种无需训练的图像到视频生成方法,通过引入用户可编辑的代理动态图(PDG)来分离运动规范和外观合成,从而实现对最终帧中因物体运动而新暴露区域(去遮挡区域)内容的显式用户控制。该方法利用PDG确定性地驱动部件运动并生成密集运动流,然后使用冻结的扩散先验作为运动引导着色器合成外观,并允许用户在去遮挡区域编辑外观,通过潜在空间合成协调运动与用户意图。
Details
Motivation: 现有图像到视频生成流程能产生合理运动,但难以在生成可预测的关节运动的同时,强制用户在新暴露区域指定内容。本文旨在解决这一问题,实现对去遮挡区域内容的可控编辑。
Result: 在关节物体、家具、车辆和可变形物体的图像转短视频任务上,该方法相比现有最先进方法展现出明显优势,实现了可控的关节运动和去遮挡区域用户控制,无需微调。
Insight: 创新点在于将运动规范与外观合成解耦,通过轻量级、用户可编辑的PDG提供确定性运动驱动,并利用冻结扩散模型进行外观合成;结合了生成控制(宽松姿态和结构)与可预测控制(最终帧去遮挡区域的外观指定),解锁了新的图像到视频工作流程。
Abstract: We address image-to-video generation with explicit user control over the final frame’s disoccluded regions. Current image-to-video pipelines produce plausible motion but struggle to generate predictable, articulated motions while enforcing user-specified content in newly revealed areas. Our key idea is to separate motion specification from appearance synthesis: we introduce a lightweight, user-editable Proxy Dynamic Graph (PDG) that deterministically yet approximately drives part motion, while a frozen diffusion prior is used to synthesize plausible appearance that follows that motion. In our training-free pipeline, the user loosely annotates and reposes a PDG, from which we compute a dense motion flow to leverage diffusion as a motion-guided shader. We then let the user edit appearance in the disoccluded areas of the image, and exploit the visibility information encoded by the PDG to perform a latent-space composite that reconciles motion with user intent in these areas. This design yields controllable articulation and user control over disocclusions without fine-tuning. We demonstrate clear advantages against state-of-the-art alternatives towards images turned into short videos of articulated objects, furniture, vehicles, and deformables. Our method mixes generative control, in the form of loose pose and structure, with predictable controls, in the form of appearance specification in the final frame in the disoccluded regions, unlocking a new image-to-video workflow. Code will be released on acceptance. Project page: https://anranqi.github.io/beyondvisible.github.io/
[135] End2Reg: Learning Task-Specific Segmentation for Markerless Registration in Spine Surgery cs.CV | cs.AIPDF
Lorenzo Pettinari, Sidaty El Hadramy, Michael Wehrli, Philippe C. Cattin, Daniel Studer
TL;DR: 本文提出End2Reg,一种用于脊柱手术无标记配准的端到端深度学习框架。该框架联合优化分割和配准任务,无需依赖弱分割标签或手动步骤,直接在配准目标的指导下学习对配准最优的分割掩码。
Details
Motivation: 当前脊柱手术导航系统依赖术中放射成像和骨锚定标记,具有侵入性、辐射强且干扰工作流程。现有的无标记RGB-D配准方法虽前景良好,但依赖弱分割标签来隔离相关解剖结构,可能导致误差在配准过程中传播。
Result: 该方法在离体和在体基准测试中达到了最先进性能,将中值目标配准误差降低了32%至1.83毫米,均方根误差降低了45%至3.95毫米。消融研究证实端到端优化显著提高了配准精度。
Insight: 核心创新在于通过端到端联合优化,让分割任务完全由配准目标驱动,从而学习到对配准任务最优的、任务特定的分割表示,避免了弱标签引入的误差传播问题,推动了全自动无标记术中导航的发展。
Abstract: Purpose: Intraoperative navigation in spine surgery demands millimeter-level accuracy. Current systems based on intraoperative radiographic imaging and bone-anchored markers are invasive, radiation-intensive and workflow disruptive. Recent markerless RGB-D registration methods offer a promising alternative, but existing approaches rely on weak segmentation labels to isolate relevant anatomical structures, which can propagate errors throughout registration. Methods: We present End2Reg an end-to-end deep learning framework that jointly optimizes segmentation and registration, eliminating the need for weak segmentation labels and manual steps. The network learns segmentation masks specifically optimized for registration, guided solely by the registration objective without direct segmentation supervision. Results: The proposed framework achieves state-of-the-art performance on ex- and in-vivo benchmarks, reducing median Target Registration Error by 32% to 1.83mm and mean Root Mean Square Error by 45% to 3.95mm, respectively. An ablation study confirms that end-to-end optimization significantly improves registration accuracy. Conclusion: The presented end-to-end RGB-D registration pipeline removes dependency on weak labels and manual steps, advancing towards fully automatic, markerless intraoperative navigation. Code and interactive visualizations are available at: https://lorenzopettinari.github.io/end-2-reg/.
[136] Computer vision training dataset generation for robotic environments using Gaussian splatting cs.CV | cs.GRPDF
Patryk Niżeniec, Marcin Iwanowski
TL;DR: 本文提出了一种利用3D高斯泼溅(3DGS)和游戏引擎物理模拟,为机器人环境计算机视觉任务生成大规模、高真实感且自动标注数据集的创新流程。该方法通过一种新颖的双通道渲染技术,结合高斯泼溅的真实感与代理网格生成的阴影图,显著提升了合成图像的视觉真实感,并自动生成像素级分割掩码。实验表明,结合少量真实图像与大量合成数据的混合训练策略能实现最佳的检测与分割性能。
Details
Motivation: 解决机器人领域中合成数据与真实世界图像之间的域差距问题,以及手动标注耗时且成本高昂的瓶颈,旨在为计算机视觉任务高效生成高质量、自动标注的训练数据。
Result: 实验证实,采用结合少量真实图像与大量所生成合成数据的混合训练策略,在目标检测和分割任务上取得了最佳性能,这被验证为高效实现鲁棒且准确模型的最优策略。
Insight: 创新点在于利用3D高斯泼溅创建逼真环境表示,并结合游戏引擎物理模拟与新颖的双通道渲染技术(融合泼溅真实感与代理网格阴影)来生成高度真实的合成图像;同时,流程实现了像素级完美分割掩码的自动生成,可直接用于YOLO等模型训练,为数据生成提供了高效且可扩展的解决方案。
Abstract: This paper introduces a novel pipeline for generating large-scale, highly realistic, and automatically labeled datasets for computer vision tasks in robotic environments. Our approach addresses the critical challenges of the domain gap between synthetic and real-world imagery and the time-consuming bottleneck of manual annotation. We leverage 3D Gaussian Splatting (3DGS) to create photorealistic representations of the operational environment and objects. These assets are then used in a game engine where physics simulations create natural arrangements. A novel, two-pass rendering technique combines the realism of splats with a shadow map generated from proxy meshes. This map is then algorithmically composited with the image to add both physically plausible shadows and subtle highlights, significantly enhancing realism. Pixel-perfect segmentation masks are generated automatically and formatted for direct use with object detection models like YOLO. Our experiments show that a hybrid training strategy, combining a small set of real images with a large volume of our synthetic data, yields the best detection and segmentation performance, confirming this as an optimal strategy for efficiently achieving robust and accurate models.
[137] USTM: Unified Spatial and Temporal Modeling for Continuous Sign Language Recognition cs.CVPDF
Ahmed Abul Hasanaath, Hamzah Luqman
TL;DR: 本文提出了一种统一时空建模(USTM)框架,用于连续手语识别(CSLR),该框架通过结合增强的Swin Transformer骨干网络和轻量级时序适配器(TAPE),有效捕捉细粒度空间特征和长短时序依赖,仅使用RGB视频输入即可实现SOTA性能。
Details
Motivation: 现有CSLR方法通常依赖CNN骨干网络与时序卷积或循环模块结合,难以捕捉手部和面部的细粒度线索以及长程时序依赖,因此需要一种能统一建模时空信息的更有效框架。
Result: 在PHOENIX14、PHOENIX14T和CSL-Daily等基准数据集上的大量实验表明,USTM在仅使用RGB输入的情况下,不仅超越了基于RGB的方法,也优于多模态方法,并与多流方法性能相当,达到了SOTA水平。
Insight: 创新点在于提出了一个统一的时空编码器,通过Swin Transformer骨干和轻量级时序适配器TAPE的结合,实现了对细粒度空间特征和长短时序上下文的高效联合建模,无需多流输入或辅助模态,简化了模型架构并提升了性能。
Abstract: Continuous sign language recognition (CSLR) requires precise spatio-temporal modeling to accurately recognize sequences of gestures in videos. Existing frameworks often rely on CNN-based spatial backbones combined with temporal convolution or recurrent modules. These techniques fail in capturing fine-grained hand and facial cues and modeling long-range temporal dependencies. To address these limitations, we propose the Unified Spatio-Temporal Modeling (USTM) framework, a spatio-temporal encoder that effectively models complex patterns using a combination of a Swin Transformer backbone enhanced with lightweight temporal adapter with positional embeddings (TAPE). Our framework captures fine-grained spatial features alongside short and long-term temporal context, enabling robust sign language recognition from RGB videos without relying on multi-stream inputs or auxiliary modalities. Extensive experiments on benchmarked datasets including PHOENIX14, PHOENIX14T, and CSL-Daily demonstrate that USTM achieves state-of-the-art performance against RGB-based as well as multi-modal CSLR approaches, while maintaining competitive performance against multi-stream approaches. These results highlight the strength and efficacy of the USTM framework for CSLR. The code is available at https://github.com/gufranSabri/USTM
[138] Learning to Generate Cross-Task Unexploitable Examples cs.CVPDF
Haoxuan Qu, Qiuchi Xiang, Yujun Cai, Yirui Wu, Majid Mirmehdi
TL;DR: 本文提出了一种新颖的元跨任务不可利用示例生成框架,旨在通过元学习方案优化生成器,使其能够生成在多种真实世界计算机视觉任务中都广泛不可利用的示例,从而增强个人图像隐私保护的实用性。
Details
Motivation: 现有方法生成的不可利用示例在实际应用中存在局限性,无法确保其在不同的计算机视觉任务中都保持不可利用性,限制了其广泛的实用性。
Result: 广泛的实验证明了该框架的有效性,但摘要中未提及具体的基准测试或与现有方法的定量比较结果。
Insight: 核心创新在于设计了一个面向平坦最小值的元训练与测试方案,以优化生成器,使其能够产生跨任务通用的不可利用示例,这从元学习的角度提升了生成示例的鲁棒性和泛化能力。
Abstract: Unexploitable example generation aims to transform personal images into their unexploitable (unlearnable) versions before they are uploaded online, thereby preventing unauthorized exploitation of online personal images. Recently, this task has garnered significant research attention due to its critical relevance to personal data privacy. Yet, despite recent progress, existing methods for this task can still suffer from limited practical applicability, as they can fail to generate examples that are broadly unexploitable across different real-world computer vision tasks. To deal with this problem, in this work, we propose a novel Meta Cross-Task Unexploitable Example Generation (MCT-UEG) framework. At the core of our framework, to optimize the unexploitable example generator for effectively producing broadly unexploitable examples, we design a flat-minima-oriented meta training and testing scheme. Extensive experiments show the efficacy of our framework.
[139] RecTok: Reconstruction Distillation along Rectified Flow cs.CVPDF
Qingyu Shi, Size Wu, Jinbin Bai, Kaidong Yu, Yujing Wang
TL;DR: RecTok提出了一种新的视觉分词器训练方法,通过流语义蒸馏和重建-对齐蒸馏,克服了高维潜在空间中重建保真度与生成质量之间的权衡问题,在保持语义丰富性的同时,实现了优异的图像重建、生成质量和判别性能。
Details
Motivation: 现有视觉分词器在潜在空间维度与生成质量之间存在根本性权衡,高维分词器性能通常不如低维版本,限制了语义表达能力。RecTok旨在打破这一限制,通过改进训练策略来提升高维视觉分词器的性能。
Result: RecTok在gFID-50K基准测试中,无论是否使用无分类器引导,均取得了最先进的(SOTA)结果,并且随着潜在维度增加,性能持续提升。
Insight: 核心创新在于将视觉基础模型(VFM)的语义信息蒸馏到流匹配的前向流轨迹中,作为扩散变换器的训练空间,而非像以往工作那样专注于潜在空间本身;同时引入掩码特征重建损失进一步增强语义,这为高维视觉表示学习提供了新思路。
Abstract: Visual tokenizers play a crucial role in diffusion models. The dimensionality of latent space governs both reconstruction fidelity and the semantic expressiveness of the latent feature. However, a fundamental trade-off is inherent between dimensionality and generation quality, constraining existing methods to low-dimensional latent spaces. Although recent works have leveraged vision foundation models to enrich the semantics of visual tokenizers and accelerate convergence, high-dimensional tokenizers still underperform their low-dimensional counterparts. In this work, we propose RecTok, which overcomes the limitations of high-dimensional visual tokenizers through two key innovations: flow semantic distillation and reconstruction–alignment distillation. Our key insight is to make the forward flow in flow matching semantically rich, which serves as the training space of diffusion transformers, rather than focusing on the latent space as in previous works. Specifically, our method distills the semantic information in VFMs into the forward flow trajectories in flow matching. And we further enhance the semantics by introducing a masked feature reconstruction loss. Our RecTok achieves superior image reconstruction, generation quality, and discriminative performance. It achieves state-of-the-art results on the gFID-50K under both with and without classifier-free guidance settings, while maintaining a semantically rich latent space structure. Furthermore, as the latent dimensionality increases, we observe consistent improvements. Code and model are available at https://shi-qingyu.github.io/rectok.github.io.
[140] PoseAnything: Universal Pose-guided Video Generation with Part-aware Temporal Coherence cs.CVPDF
Ruiyan Wang, Teng Hu, Kaihui Huang, Zihan Su, Ran Yi
TL;DR: 本文提出了PoseAnything,一个通用的姿态引导视频生成框架,能够处理人类和非人类角色,支持任意骨骼输入。为了增强运动过程中的一致性,引入了部件感知时序一致性模块,将主体划分为不同部分并建立跨帧对应关系。此外,提出了主体与相机运动解耦的CFG引导策略,首次实现了姿态引导视频生成中相机运动的独立控制。同时,发布了包含5万对非人类姿态-视频数据的高质量数据集XPose。
Details
Motivation: 解决现有姿态引导视频生成方法仅支持人类姿态输入、泛化能力差的问题,旨在实现对任意主体(包括非人类角色)姿态的通用控制,并提升生成视频的时序一致性和相机运动可控性。
Result: 在广泛实验中,PoseAnything在效果和泛化能力上显著优于现有最先进方法。
Insight: 创新点包括:1) 通用姿态输入框架支持任意骨骼;2) 部件感知时序一致性模块实现细粒度部件级一致性;3) 主体与相机运动解耦的CFG策略首次实现独立相机运动控制;4) 构建高质量非人类姿态数据集XPose及自动化标注流程。
Abstract: Pose-guided video generation refers to controlling the motion of subjects in generated video through a sequence of poses. It enables precise control over subject motion and has important applications in animation. However, current pose-guided video generation methods are limited to accepting only human poses as input, thus generalizing poorly to pose of other subjects. To address this issue, we propose PoseAnything, the first universal pose-guided video generation framework capable of handling both human and non-human characters, supporting arbitrary skeletal inputs. To enhance consistency preservation during motion, we introduce Part-aware Temporal Coherence Module, which divides the subject into different parts, establishes part correspondences, and computes cross-attention between corresponding parts across frames to achieve fine-grained part-level consistency. Additionally, we propose Subject and Camera Motion Decoupled CFG, a novel guidance strategy that, for the first time, enables independent camera movement control in pose-guided video generation, by separately injecting subject and camera motion control information into the positive and negative anchors of CFG. Furthermore, we present XPose, a high-quality public dataset containing 50,000 non-human pose-video pairs, along with an automated pipeline for annotation and filtering. Extensive experiments demonstrate that Pose-Anything significantly outperforms state-of-the-art methods in both effectiveness and generalization.
[141] Transform Trained Transformer: Accelerating Naive 4K Video Generation Over 10$\times$ cs.CVPDF
Jiangning Zhang, Junwei Zhu, Teng Hu, Yabiao Wang, Donghao Luo
TL;DR: 本文提出了一种名为T3(Transform Trained Transformer)的Transformer改造策略,旨在解决原生4K视频生成中因全注意力机制计算量爆炸而导致的效率与质量难以平衡的问题。该方法通过优化预训练模型的前向逻辑,在不改变其核心架构的情况下,引入了多尺度权重共享窗口注意力机制和分层分块技术,实现了计算需求的大幅降低和生成速度的显著提升。
Details
Motivation: 动机是解决随着时空分辨率增加,全注意力机制计算量呈二次方增长,导致原生4K视频生成在效率与质量之间难以取得平衡的关键挑战。
Result: 在4K-VBench基准测试上,T3-Video方法在性能上显著优于现有方法(VQA指标提升+4.29,VTC指标提升+0.08),同时将原生4K视频生成速度加速了10倍以上。
Insight: 宣称的创新点在于提出了一种无需改动预训练模型核心架构的改造策略,通过多尺度权重共享窗口注意力和分层分块设计来高效转换注意力模式。客观分析认为,其核心创新在于将高效的局部注意力机制与预训练的全注意力模型相结合,在保持高质量的同时实现了数量级的加速,为高分辨率视频生成提供了一种实用的效率优化方案。
Abstract: Native 4K (2160$\times$3840) video generation remains a critical challenge due to the quadratic computational explosion of full-attention as spatiotemporal resolution increases, making it difficult for models to strike a balance between efficiency and quality. This paper proposes a novel Transformer retrofit strategy termed $\textbf{T3}$ ($\textbf{T}$ransform $\textbf{T}$rained $\textbf{T}$ransformer) that, without altering the core architecture of full-attention pretrained models, significantly reduces compute requirements by optimizing their forward logic. Specifically, $\textbf{T3-Video}$ introduces a multi-scale weight-sharing window attention mechanism and, via hierarchical blocking together with an axis-preserving full-attention design, can effect an “attention pattern” transformation of a pretrained model using only modest compute and data. Results on 4K-VBench show that $\textbf{T3-Video}$ substantially outperforms existing approaches: while delivering performance improvements (+4.29$\uparrow$ VQA and +0.08$\uparrow$ VTC), it accelerates native 4K video generation by more than 10$\times$. Project page at https://zhangzjn.github.io/projects/T3-Video
[142] Soul: Breathe Life into Digital Human for High-fidelity Long-term Multimodal Animation cs.CVPDF
Jiangning Zhang, Junwei Zhu, Zhenye Gan, Donghao Luo, Chuming Lin
TL;DR: 论文提出了一个名为Soul的多模态驱动框架,用于生成高保真、长期一致的数字人动画。该框架能够从单张人像图片、文本提示和音频输入,生成语义连贯的视频,实现精确的唇形同步、生动的面部表情和鲁棒的身份保持。
Details
Motivation: 解决从多模态输入(图像、文本、音频)生成高质量、长期一致的数字人动画的挑战,特别是克服数据稀缺问题,并实现对现有方法的全面、公平评估。
Result: 在视频质量、视频-文本对齐、身份保持和唇形同步准确性方面,Soul显著优于当前领先的开源和商业模型。同时,通过优化,推理速度提升了11.4倍,且质量损失可忽略。
Insight: 主要创新点包括:构建了大规模高质量数据集Soul-1M和评估基准Soul-Bench;在Wan2.2-5B骨干网络上集成了音频注入层、多种训练策略以及阈值感知的码本替换技术以保障长期一致性;并采用步长/CFG蒸馏和轻量VAE来优化推理效率。
Abstract: We propose a multimodal-driven framework for high-fidelity long-term digital human animation termed $\textbf{Soul}$, which generates semantically coherent videos from a single-frame portrait image, text prompts, and audio, achieving precise lip synchronization, vivid facial expressions, and robust identity preservation. We construct Soul-1M, containing 1 million finely annotated samples with a precise automated annotation pipeline (covering portrait, upper-body, full-body, and multi-person scenes) to mitigate data scarcity, and we carefully curate Soul-Bench for comprehensive and fair evaluation of audio-/text-guided animation methods. The model is built on the Wan2.2-5B backbone, integrating audio-injection layers and multiple training strategies together with threshold-aware codebook replacement to ensure long-term generation consistency. Meanwhile, step/CFG distillation and a lightweight VAE are used to optimize inference efficiency, achieving an 11.4$\times$ speedup with negligible quality loss. Extensive experiments show that Soul significantly outperforms current leading open-source and commercial models on video quality, video-text alignment, identity preservation, and lip-synchronization accuracy, demonstrating broad applicability in real-world scenarios such as virtual anchors and film production. Project page at https://zhangzjn.github.io/projects/Soul/
[143] Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model cs.CVPDF
Siyan Chen, Yanfei Chen, Ying Chen, Zhuo Chen, Feng Cheng
TL;DR: 本文提出了Seedance 1.5 pro,一个专门为原生联合音视频生成设计的基础模型。它采用双分支扩散Transformer架构,结合跨模态联合模块和多阶段数据管道,实现了出色的音画同步与生成质量。模型通过监督微调和基于人类反馈的强化学习进行优化,并引入了加速框架,将推理速度提升超过10倍。该模型支持精准的多语言/方言口型同步、动态电影级摄像机控制和增强的叙事连贯性,旨在成为专业级内容创作的强大引擎。
Details
Motivation: 当前视频生成领域的发展为统一的音视频生成铺平了道路,但需要专门的模型来实现高质量、同步的原生联合生成,以服务于专业内容创作。
Result: 模型实现了卓越的音画同步和生成质量,并具备精准的多语言/方言口型同步、动态摄像机控制和叙事连贯性。通过引入的加速框架,推理速度提升了超过10倍。
Insight: 创新点包括:1) 专为原生联合音视频生成设计的双分支扩散Transformer架构与跨模态联合模块;2) 结合高质量数据监督微调(SFT)和多维度奖励模型的人类反馈强化学习(RLHF)的后训练优化流程;3) 实现超过10倍加速的推理框架;4) 在口型同步、摄像机控制和叙事连贯性等专业内容创作关键维度上的针对性能力提升。
Abstract: Recent strides in video generation have paved the way for unified audio-visual generation. In this work, we present Seedance 1.5 pro, a foundational model engineered specifically for native, joint audio-video generation. Leveraging a dual-branch Diffusion Transformer architecture, the model integrates a cross-modal joint module with a specialized multi-stage data pipeline, achieving exceptional audio-visual synchronization and superior generation quality. To ensure practical utility, we implement meticulous post-training optimizations, including Supervised Fine-Tuning (SFT) on high-quality datasets and Reinforcement Learning from Human Feedback (RLHF) with multi-dimensional reward models. Furthermore, we introduce an acceleration framework that boosts inference speed by over 10X. Seedance 1.5 pro distinguishes itself through precise multilingual and dialect lip-syncing, dynamic cinematic camera control, and enhanced narrative coherence, positioning it as a robust engine for professional-grade content creation. Seedance 1.5 pro is now accessible on Volcano Engine at https://console.volcengine.com/ark/region:ark+cn-beijing/experience/vision?type=GenVideo.
[144] TARA: Simple and Efficient Time Aware Retrieval Adaptation of MLLMs for Video Understanding cs.CV | cs.IRPDF
Piyush Bagad, Andrew Zisserman
TL;DR: 本文提出了TARA方法,这是一种简单高效的配方,用于将多模态大语言模型(MLLMs)适配为时间感知的视频-文本嵌入模型,且无需使用任何视频数据。作者还提出了一个包含时间对立(手性)动作作为困难负样本的新基准来评估检索的时间感知能力。TARA在该手性基准上超越了所有现有视频-文本模型,同时在标准基准上也取得了强劲结果,并展现出对否定、动词和副词理解的额外优势。
Details
Motivation: 目标是构建一个通用的、时间感知的视频-文本嵌入模型用于检索,以解决现有模型在理解视频中时间顺序和时序关系方面的不足。
Result: TARA在提出的手性动作基准上超越了所有现有视频-文本模型,在标准基准上也取得了强劲结果。此外,在评估视频检索中否定理解的NegBench基准上表现出对否定的感知能力,并在视频中的动词和副词理解任务上达到了最先进的性能。
Insight: 核心创新在于提出了一种无需视频数据、仅通过适配MLLMs来构建时间感知视频嵌入模型的简单高效方法,以及通过构造包含时间对立(手性)动作的困难负样本来创建评估时间感知能力的新基准。该方法还意外地增强了模型对否定、动词和副词的理解能力,展现了其多功能性。
Abstract: Our objective is to build a general time-aware video-text embedding model for retrieval. To that end, we propose a simple and efficient recipe, dubbed TARA (Time Aware Retrieval Adaptation), to adapt Multimodal LLMs (MLLMs) to a time-aware video-text embedding model without using any video data at all. For evaluating time-awareness in retrieval, we propose a new benchmark with temporally opposite (chiral) actions as hard negatives and curated splits for chiral and non-chiral actions. We show that TARA outperforms all existing video-text models on this chiral benchmark while also achieving strong results on standard benchmarks. Furthermore, we discover additional benefits of TARA beyond time-awareness: (i) TARA embeddings are negation-aware as shown in NegBench benchmark that evaluates negation in video retrieval, (ii) TARA achieves state of the art performance on verb and adverb understanding in videos. Overall, TARA yields a strong, versatile, time-aware video-text embedding model with state of the art zero-shot performance.
[145] MMhops-R1: Multimodal Multi-hop Reasoning cs.CVPDF
Tao Zhang, Ziqi Zhang, Zongyang Ma, Yuxin Chen, Bing Li
TL;DR: 该论文提出了MMhops基准测试和MMhops-R1模型,旨在解决多模态多跳推理问题。MMhops是一个大规模基准,包含Bridging和Comparison两种任务格式,用于评估模型整合外部知识进行复杂推理的能力。MMhops-R1是一个基于强化学习的多模态检索增强生成框架,能够动态规划推理路径、生成查询并整合多级信息。
Details
Motivation: 现有MLLMs主要局限于单步推理,因为缺乏能够评估和驱动多跳推理能力的复杂基准。为了解决这一问题,作者旨在创建一个系统性评估多模态多跳推理的基准和相应模型。
Result: 在MMhops基准上,MMhops-R1显著优于强基线模型,证明了动态规划和多模态知识整合对复杂推理的重要性。此外,该模型在需要固定跳数推理的任务上也表现出强大的泛化能力。
Insight: 论文的创新点在于引入了首个系统性评估多模态多跳推理的大规模基准MMhops,以及一个利用强化学习优化动态推理路径规划的多模态RAG框架MMhops-R1。其核心洞察是动态规划和跨模态知识整合是提升复杂推理能力的关键。
Abstract: The ability to perform multi-modal multi-hop reasoning by iteratively integrating information across various modalities and external knowledge is critical for addressing complex real-world challenges. However, existing Multi-modal Large Language Models (MLLMs) are predominantly limited to single-step reasoning, as existing benchmarks lack the complexity needed to evaluate and drive multi-hop abilities. To bridge this gap, we introduce MMhops, a novel, large-scale benchmark designed to systematically evaluate and foster multi-modal multi-hop reasoning. MMhops dataset comprises two challenging task formats, Bridging and Comparison, which necessitate that models dynamically construct complex reasoning chains by integrating external knowledge. To tackle the challenges posed by MMhops, we propose MMhops-R1, a novel multi-modal Retrieval-Augmented Generation (mRAG) framework for dynamic reasoning. Our framework utilizes reinforcement learning to optimize the model for autonomously planning reasoning paths, formulating targeted queries, and synthesizing multi-level information. Comprehensive experiments demonstrate that MMhops-R1 significantly outperforms strong baselines on MMhops, highlighting that dynamic planning and multi-modal knowledge integration are crucial for complex reasoning. Moreover, MMhops-R1 demonstrates strong generalization to tasks requiring fixed-hop reasoning, underscoring the robustness of our dynamic planning approach. In conclusion, our work contributes a challenging new benchmark and a powerful baseline model, and we will release the associated code, data, and weights to catalyze future research in this critical area.
[146] DA-SSL: self-supervised domain adaptor to leverage foundational models in turbt histopathology slides cs.CV | cs.AIPDF
Haoyue Zhang, Meera Chappidi, Erolcan Sayar, Helen Richards, Zhijun Chen
TL;DR: 该论文提出了一种名为DA-SSL的领域自适应自监督适配器,旨在解决病理学基础模型(PFMs)在特定癌症类型(如经尿道膀胱肿瘤切除术标本)上因领域偏移导致的性能限制。该方法通过自监督学习对齐预训练PFM特征到目标领域,无需微调基础模型本身,并将其应用于基于多示例学习的治疗反应预测任务中。
Details
Motivation: 动机是解决病理学基础模型在特定癌症或标本类型(如包含组织碎片和电灼伪影的TURBT标本)上因领域偏移而表现不佳的问题,这些类型在预训练数据中很少见,从而限制了模型在临床挑战性任务(如预测膀胱癌新辅助化疗反应)中的应用。
Result: 在多中心研究中,DA-SSL在五折交叉验证中取得了0.77±0.04的AUC,在外部测试集上通过多数投票获得了0.84的准确率、0.71的敏感性和0.91的特异性,有效提升了基于PFM的MIL流程性能。
Insight: 创新点在于提出了一种轻量级的、自监督的领域适配方法,能够在不微调大型基础模型的情况下,有效对齐特征以克服领域偏移,这为将通用基础模型高效适配到特定、数据稀缺的医学影像领域提供了可借鉴的思路。
Abstract: Recent deep learning frameworks in histopathology, particularly multiple instance learning (MIL) combined with pathology foundational models (PFMs), have shown strong performance. However, PFMs exhibit limitations on certain cancer or specimen types due to domain shifts - these cancer types were rarely used for pretraining or specimens contain tissue-based artifacts rarely seen within the pretraining population. Such is the case for transurethral resection of bladder tumor (TURBT), which are essential for diagnosing muscle-invasive bladder cancer (MIBC), but contain fragmented tissue chips and electrocautery artifacts and were not widely used in publicly available PFMs. To address this, we propose a simple yet effective domain-adaptive self-supervised adaptor (DA-SSL) that realigns pretrained PFM features to the TURBT domain without fine-tuning the foundational model itself. We pilot this framework for predicting treatment response in TURBT, where histomorphological features are currently underutilized and identifying patients who will benefit from neoadjuvant chemotherapy (NAC) is challenging. In our multi-center study, DA-SSL achieved an AUC of 0.77+/-0.04 in five-fold cross-validation and an external test accuracy of 0.84, sensitivity of 0.71, and specificity of 0.91 using majority voting. Our results demonstrate that lightweight domain adaptation with self-supervision can effectively enhance PFM-based MIL pipelines for clinically challenging histopathology tasks. Code is Available at https://github.com/zhanghaoyue/DA_SSL_TURBT.
[147] LongVie 2: Multimodal Controllable Ultra-Long Video World Model cs.CVPDF
Jianxiong Gao, Zhaoxi Chen, Xian Liu, Junhao Zhuang, Chengming Xu
TL;DR: LongVie 2是一个端到端自回归框架,通过三阶段训练构建可控、高质量、时序一致的长视频世界模型。它整合了多模态引导、退化感知训练和历史上下文引导,在LongVGenBench基准上实现了SOTA性能,支持长达五分钟的连续视频生成。
Details
Motivation: 基于预训练视频生成系统构建视频世界模型是实现通用时空智能的重要但具挑战性的步骤,需要具备可控性、长期视觉质量和时序一致性三个关键属性。
Result: 在LongVGenBench(包含100个一分钟高分辨率视频的基准)上的广泛实验表明,LongVie 2在长程可控性、时序一致性和视觉保真度方面达到了最先进的性能,并支持长达五分钟的连续生成。
Insight: 创新点在于采用渐进式三阶段训练策略:多模态引导增强可控性,退化感知训练桥接训练与长期推理的差距以保持视觉质量,历史上下文引导确保时序一致性;同时引入了专门的长视频生成基准LongVGenBench进行评估。
Abstract: Building video world models upon pretrained video generation systems represents an important yet challenging step toward general spatiotemporal intelligence. A world model should possess three essential properties: controllability, long-term visual quality, and temporal consistency. To this end, we take a progressive approach-first enhancing controllability and then extending toward long-term, high-quality generation. We present LongVie 2, an end-to-end autoregressive framework trained in three stages: (1) Multi-modal guidance, which integrates dense and sparse control signals to provide implicit world-level supervision and improve controllability; (2) Degradation-aware training on the input frame, bridging the gap between training and long-term inference to maintain high visual quality; and (3) History-context guidance, which aligns contextual information across adjacent clips to ensure temporal consistency. We further introduce LongVGenBench, a comprehensive benchmark comprising 100 high-resolution one-minute videos covering diverse real-world and synthetic environments. Extensive experiments demonstrate that LongVie 2 achieves state-of-the-art performance in long-range controllability, temporal coherence, and visual fidelity, and supports continuous video generation lasting up to five minutes, marking a significant step toward unified video world modeling.
[148] Do-Undo: Generating and Reversing Physical Actions in Vision-Language Models cs.CV | cs.LGPDF
Shweta Mahajan, Shreya Kadambi, Hoang Le, Munawar Hayat, Fatih Porikli
TL;DR: 本文提出了Do-Undo任务和基准,旨在解决视觉语言模型在理解和生成由真实世界动作驱动的物理上合理的场景变换方面的关键不足。该任务要求模型模拟物理动作的结果并准确逆转它,以反映视觉世界中的真实因果关系。作者构建了一个大规模的可逆动作数据集,并设计了一种训练策略来增强动作的鲁棒性。实验表明,现有模型在物理可逆性方面存在困难,凸显了该任务对于具身AI、机器人学和物理感知生成建模的重要性。
Details
Motivation: 解决现有视觉语言模型主要关注对象级编辑,而缺乏对物理动作驱动的场景变换的理解和生成能力的问题,特别是模拟和逆转物理动作以反映真实因果关系的需求。
Result: 当前模型在物理可逆性任务上表现不佳,突显了该任务的重要性。Do-Undo基准为评估和推进多模态系统中的物理推理提供了一个直观的测试平台。
Insight: 创新点在于提出了一个专注于物理动作可逆性的新任务和基准,强调了对物理因果关系建模的需求。从客观角度看,这为评估模型的物理理解能力提供了一个更严谨的框架,可能推动更具物理意识的生成模型和具身智能系统的发展。
Abstract: We introduce the Do-Undo task and benchmark to address a critical gap in vision-language models: understanding and generating physically plausible scene transformations driven by real-world actions. Unlike prior work focused on object-level edits, Do-Undo requires models to simulate the outcome of a physical action and then accurately reverse it, reflecting true cause-and-effect in the visual world. We curate a large-scale dataset of reversible actions from real-world videos and design a training strategy enforcing consistency for robust action grounding. Our experiments reveal that current models struggle with physical reversibility, underscoring the importance of this task for embodied AI, robotics, and physics-aware generative modeling. Do-Undo establishes an intuitive testbed for evaluating and advancing physical reasoning in multimodal systems.
[149] SCR2-ST: Combine Single Cell with Spatial Transcriptomics for Efficient Active Sampling via Reinforcement Learning cs.CVPDF
Junchao Zhu, Ruining Deng, Junlin Guo, Tianyuan Yao, Chongyu Qu
TL;DR: 论文提出SCR2-ST框架,结合单细胞测序先验知识,通过强化学习进行主动采样,并利用混合回归-检索网络预测空间转录组表达,以低成本高效获取生物信息丰富的组织区域数据。
Details
Motivation: 空间转录组技术昂贵且传统固定网格采样策略导致冗余测量,数据稀缺限制方法发展;单细胞测序可作为辅助数据源缓解此问题,需桥接两者以提升采样效率和预测准确性。
Result: 在三个公共空间转录组数据集上评估,SCR2-ST在采样效率和预测准确性方面达到SOTA性能,尤其在低预算场景下表现突出。
Insight: 创新点包括:利用单细胞基础模型嵌入和空间密度信息构建奖励信号,实现基于生物信息的强化学习主动采样;设计混合回归-检索网络,结合多数细胞类型过滤机制抑制噪声,并以检索表达谱作为软标签进行辅助监督。
Abstract: Spatial transcriptomics (ST) is an emerging technology that enables researchers to investigate the molecular relationships underlying tissue morphology. However, acquiring ST data remains prohibitively expensive, and traditional fixed-grid sampling strategies lead to redundant measurements of morphologically similar or biologically uninformative regions, thus resulting in scarce data that constrain current methods. The well-established single-cell sequencing field, however, could provide rich biological data as an effective auxiliary source to mitigate this limitation. To bridge these gaps, we introduce SCR2-ST, a unified framework that leverages single-cell prior knowledge to guide efficient data acquisition and accurate expression prediction. SCR2-ST integrates a single-cell guided reinforcement learning-based (SCRL) active sampling and a hybrid regression-retrieval prediction network SCR2Net. SCRL combines single-cell foundation model embeddings with spatial density information to construct biologically grounded reward signals, enabling selective acquisition of informative tissue regions under constrained sequencing budgets. SCR2Net then leverages the actively sampled data through a hybrid architecture combining regression-based modeling with retrieval-augmented inference, where a majority cell-type filtering mechanism suppresses noisy matches and retrieved expression profiles serve as soft labels for auxiliary supervision. We evaluated SCR2-ST on three public ST datasets, demonstrating SOTA performance in both sampling efficiency and prediction accuracy, particularly under low-budget scenarios. Code is publicly available at: https://github.com/hrlblab/SCR2ST
[150] MindDrive: A Vision-Language-Action Model for Autonomous Driving via Online Reinforcement Learning cs.CV | cs.ROPDF
Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Hongwei Xie
TL;DR: 本文提出了MindDrive,一种用于自动驾驶的视觉-语言-动作模型,它通过在线强化学习来解决传统模仿学习范式中的分布偏移和因果混淆问题。该框架使用一个大型语言模型,通过两组不同的LoRA参数分别充当决策专家和动作专家,将离散的语言决策映射为连续轨迹,从而在语言决策空间而非连续动作空间中进行高效的试错学习。
Details
Motivation: 当前自动驾驶中的视觉-语言-动作模型主要依赖模仿学习,这存在分布偏移和因果混淆等固有问题。在线强化学习通过试错学习为解决这些问题提供了有前景的途径,但在连续动作空间中的低效探索阻碍了其在VLA模型中的应用。
Result: 在具有挑战性的Bench2Drive基准测试中,MindDrive实现了强大的闭环性能,驾驶得分为78.04%,成功率为55.09%。据作者所知,这是首个证明在线强化学习在自动驾驶VLA模型中有效性的工作。
Insight: 核心创新在于通过一个LLM配备两组LoRA参数,将决策(离散语言)与动作(连续轨迹)生成解耦,并将轨迹级奖励反馈到推理空间,从而将在线强化学习的探索限制在有限的离散语言决策集上,有效平衡了复杂场景下的最优决策、类人驾驶行为和高效探索。
Abstract: Current Vision-Language-Action (VLA) paradigms in autonomous driving primarily rely on Imitation Learning (IL), which introduces inherent challenges such as distribution shift and causal confusion. Online Reinforcement Learning offers a promising pathway to address these issues through trial-and-error learning. However, applying online reinforcement learning to VLA models in autonomous driving is hindered by inefficient exploration in continuous action spaces. To overcome this limitation, we propose MindDrive, a VLA framework comprising a large language model (LLM) with two distinct sets of LoRA parameters. The one LLM serves as a Decision Expert for scenario reasoning and driving decision-making, while the other acts as an Action Expert that dynamically maps linguistic decisions into feasible trajectories. By feeding trajectory-level rewards back into the reasoning space, MindDrive enables trial-and-error learning over a finite set of discrete linguistic driving decisions, instead of operating directly in a continuous action space. This approach effectively balances optimal decision-making in complex scenarios, human-like driving behavior, and efficient exploration in online reinforcement learning. MindDrive achieves strong closed-loop performance on the challenging Bench2Drive benchmark, with a Driving Score (DS) of 78.04 and a Success Rate (SR) of 55.09%. To the best of our knowledge, this is the first work to demonstrate the effectiveness of online reinforcement learning for the VLA model in autonomous driving.
[151] Charge: A Comprehensive Novel View Synthesis Benchmark and Dataset to Bind Them All cs.CVPDF
Michal Nazarczuk, Thomas Tanay, Arthur Moreau, Zhensong Zhang, Eduardo Pérez-Pellitero
TL;DR: 本文提出了一个用于新视角合成(Novel View Synthesis)的新数据集,该数据集源自一部高质量、具有惊人真实感和复杂细节的动画电影。它捕捉了多种动态场景,包含详细的纹理、光照和运动,非常适合训练和评估先进的4D场景重建和新视角生成模型。除了高保真RGB图像外,还提供了深度、表面法线、物体分割和光流等多种互补模态,以支持对场景几何和运动的深入理解。数据集被组织为三个不同的基准测试场景:密集多视角相机设置、稀疏相机排列和单目视频序列,从而能够在不同数据稀疏性水平下进行广泛的实验和比较。
Details
Motivation: 现有新视角合成数据集在视觉丰富性、动态场景覆盖和多模态标注方面存在不足,无法充分支持前沿4D场景重建和视图生成模型的训练与评估。本文旨在通过提供一个结合高视觉质量、详细标注和多样化实验设置的综合数据集,来推动视图合成和3D视觉领域的发展。
Result: 论文主要介绍了数据集本身,未在摘要中提及具体的定量实验结果或基准测试排名。但该数据集被设计为支持广泛的实验和比较,其提供的三种基准场景(密集、稀疏、单目)旨在成为未来方法评估的标准。
Insight: 创新点在于创建了一个综合性的新视角合成基准数据集,其核心优势包括:1) 基于高质量动画电影,提供了前所未有的视觉真实感和细节;2) 包含动态场景和多种互补模态(深度、法线、分割、光流),支持对场景几何与运动的全面分析;3) 设计了三种不同数据稀疏性的基准场景,能够系统评估方法在不同输入条件下的鲁棒性和泛化能力。这为4D场景重建和新视角合成研究提供了一个统一且高标准的评估平台。
Abstract: This paper presents a new dataset for Novel View Synthesis, generated from a high-quality, animated film with stunning realism and intricate detail. Our dataset captures a variety of dynamic scenes, complete with detailed textures, lighting, and motion, making it ideal for training and evaluating cutting-edge 4D scene reconstruction and novel view generation models. In addition to high-fidelity RGB images, we provide multiple complementary modalities, including depth, surface normals, object segmentation and optical flow, enabling a deeper understanding of scene geometry and motion. The dataset is organised into three distinct benchmarking scenarios: a dense multi-view camera setup, a sparse camera arrangement, and monocular video sequences, enabling a wide range of experimentation and comparison across varying levels of data sparsity. With its combination of visual richness, high-quality annotations, and diverse experimental setups, this dataset offers a unique resource for pushing the boundaries of view synthesis and 3D vision.
[152] Grab-3D: Detecting AI-Generated Videos from 3D Geometric Temporal Consistency cs.CVPDF
Wenhan Chen, Sezer Karaoglu, Theo Gevers
TL;DR: 本文提出Grab-3D,一种基于3D几何时序一致性的AI生成视频检测框架。该方法利用消失点作为3D几何模式的显式表示,揭示了真实视频与AI生成视频在几何一致性上的根本差异,并通过构建静态场景的AI生成视频数据集和设计几何感知的Transformer架构来实现检测。
Details
Motivation: 现有检测方法对生成视频中存在的3D几何模式探索有限,而扩散生成技术的进步使得AI模型能产生高度逼真的视频,因此需要更可靠的检测机制。
Result: 实验表明,Grab-3D显著优于最先进的检测器,并在未见过的生成器上实现了鲁棒的跨域泛化能力。
Insight: 创新点在于将消失点作为3D几何的显式表示来揭示生成视频的几何不一致性,并设计了包含几何位置编码、时序-几何注意力以及基于EMA的几何分类器头的几何感知Transformer,将3D几何意识显式注入时序建模中。同时,构建静态场景的AI生成视频数据集确保了3D几何特征的稳定提取,为可靠评估提供了基础。
Abstract: Recent advances in diffusion-based generation techniques enable AI models to produce highly realistic videos, heightening the need for reliable detection mechanisms. However, existing detection methods provide only limited exploration of the 3D geometric patterns present in generated videos. In this paper, we use vanishing points as an explicit representation of 3D geometry patterns, revealing fundamental discrepancies in geometric consistency between real and AI-generated videos. We introduce Grab-3D, a geometry-aware transformer framework for detecting AI-generated videos based on 3D geometric temporal consistency. To enable reliable evaluation, we construct an AI-generated video dataset of static scenes, allowing stable 3D geometric feature extraction. We propose a geometry-aware transformer equipped with geometric positional encoding, temporal-geometric attention, and an EMA-based geometric classifier head to explicitly inject 3D geometric awareness into temporal modeling. Experiments demonstrate that Grab-3D significantly outperforms state-of-the-art detectors, achieving robust cross-domain generalization to unseen generators.
[153] AgentIAD: Tool-Augmented Single-Agent for Industrial Anomaly Detection cs.CVPDF
Junwen Miao, Penghui Du, Yi Liu, Yu Wang, Yan Wang
TL;DR: 本文提出AgentIAD,一个工具增强的单智能体框架,用于工业异常检测。该框架通过感知缩放器和比较检索器实现多阶段视觉检查,利用监督微调和强化学习进行训练,在MMAD数据集上达到了97.62%的分类准确率,实现了新的SOTA。
Details
Motivation: 工业异常检测面临正常参考样本稀缺和缺陷细微、局部化的问题,现有单次视觉语言模型常忽略小异常且缺乏与标准正常模式比较的显式机制。
Result: 在MMAD数据集上,AgentIAD实现了97.62%的分类准确率,超越了之前基于MLLM的方法,达到了新的SOTA水平。
Insight: 创新点在于引入工具驱动的智能体框架,结合感知缩放器进行局部细粒度分析和比较检索器查询正常样本,并通过两阶段训练(监督微调+强化学习)和双部分奖励设计(感知奖励和行为奖励)来引导模型进行逐步观察、缩放和验证,产生透明可解释的检查轨迹。
Abstract: Industrial anomaly detection (IAD) is difficult due to the scarcity of normal reference samples and the subtle, localized nature of many defects. Single-pass vision-language models (VLMs) often overlook small abnormalities and lack explicit mechanisms to compare against canonical normal patterns. We propose AgentIAD, a tool-driven agentic framework that enables multi-stage visual inspection. The agent is equipped with a Perceptive Zoomer (PZ) for localized fine-grained analysis and a Comparative Retriever (CR) for querying normal exemplars when evidence is ambiguous. To teach these inspection behaviors, we construct structured perceptive and comparative trajectories from the MMAD dataset and train the model in two stages: supervised fine-tuning followed by reinforcement learning. A two-part reward design drives this process: a perception reward that supervises classification accuracy, spatial alignment, and type correctness, and a behavior reward that encourages efficient tool use. Together, these components enable the model to refine its judgment through step-wise observation, zooming, and verification. AgentIAD achieves a new state-of-the-art 97.62% classification accuracy on MMAD, surpassing prior MLLM-based approaches while producing transparent and interpretable inspection traces.
[154] JoVA: Unified Multimodal Learning for Joint Video-Audio Generation cs.CVPDF
Xiaohu Huang, Hao Zhou, Qiangpeng Yang, Shilei Wen, Kai Han
TL;DR: JoVA是一个统一的视频-音频联合生成框架,通过在每个Transformer层中使用视频和音频token的联合自注意力实现直接高效的跨模态交互,无需额外对齐模块。此外,引入基于面部关键点检测的嘴部区域损失来提升唇语同步质量。
Details
Motivation: 现有方法通常只能生成环境音,缺乏生成与唇动同步的人类语音的能力;且现有统一生成方法依赖显式融合或模态特定对齐模块,增加了架构复杂性并削弱了Transformer的简洁性。
Result: 在多个基准测试中,JoVA在唇语同步准确性、语音质量和整体视频-音频生成保真度方面优于或与最先进的统一和音频驱动方法相当。
Insight: 创新点包括:使用联合自注意力实现跨模态交互,简化架构;引入基于面部关键点的嘴部区域损失,增强唇语同步监督而不增加模型复杂度。客观分析认为,该方法在保持Transformer简洁性的同时有效解决了多模态生成中的对齐问题。
Abstract: In this paper, we present JoVA, a unified framework for joint video-audio generation. Despite recent encouraging advances, existing methods face two critical limitations. First, most existing approaches can only generate ambient sounds and lack the capability to produce human speech synchronized with lip movements. Second, recent attempts at unified human video-audio generation typically rely on explicit fusion or modality-specific alignment modules, which introduce additional architecture design and weaken the model simplicity of the original transformers. To address these issues, JoVA employs joint self-attention across video and audio tokens within each transformer layer, enabling direct and efficient cross-modal interaction without the need for additional alignment modules. Furthermore, to enable high-quality lip-speech synchronization, we introduce a simple yet effective mouth-area loss based on facial keypoint detection, which enhances supervision on the critical mouth region during training without compromising architectural simplicity. Extensive experiments on benchmarks demonstrate that JoVA outperforms or is competitive with both unified and audio-driven state-of-the-art methods in lip-sync accuracy, speech quality, and overall video-audio generation fidelity. Our results establish JoVA as an elegant framework for high-quality multimodal generation.
[155] LASER: Layer-wise Scale Alignment for Training-Free Streaming 4D Reconstruction cs.CVPDF
Tianye Ding, Yiming Xie, Yiqing Liang, Moitreya Chatterjee, Pedro Miraldo
TL;DR: LASER是一种无需训练的流式4D重建框架,可将离线重建模型(如VGGT和π³)转换为流式系统,通过层间尺度对齐解决连续时间窗口预测不一致的问题,从而实现高效、低内存的千米级流视频重建。
Details
Motivation: 现有前馈重建模型(如VGGT和π³)因二次内存复杂度无法处理流式视频,而现有流式方法需大量重训练且难以充分利用离线模型的几何先验,因此提出无需训练的方法来转换离线模型以实现流式处理。
Result: 在相机姿态估计和点云重建任务上达到SOTA性能,在RTX A6000 GPU上以14 FPS和6 GB峰值内存运行,支持千米级流视频的实用部署。
Insight: 创新点在于提出层间尺度对齐机制,通过将深度预测分割为离散层、计算每层尺度因子并在相邻窗口和时间戳间传播,解决了单目尺度模糊导致的层深度错位问题,无需训练即可高效利用离线模型。
Abstract: Recent feed-forward reconstruction models like VGGT and $π^3$ achieve impressive reconstruction quality but cannot process streaming videos due to quadratic memory complexity, limiting their practical deployment. While existing streaming methods address this through learned memory mechanisms or causal attention, they require extensive retraining and may not fully leverage the strong geometric priors of state-of-the-art offline models. We propose LASER, a training-free framework that converts an offline reconstruction model into a streaming system by aligning predictions across consecutive temporal windows. We observe that simple similarity transformation ($\mathrm{Sim}(3)$) alignment fails due to layer depth misalignment: monocular scale ambiguity causes relative depth scales of different scene layers to vary inconsistently between windows. To address this, we introduce layer-wise scale alignment, which segments depth predictions into discrete layers, computes per-layer scale factors, and propagates them across both adjacent windows and timestamps. Extensive experiments show that LASER achieves state-of-the-art performance on camera pose estimation and point map reconstruction %quality with offline models while operating at 14 FPS with 6 GB peak memory on a RTX A6000 GPU, enabling practical deployment for kilometer-scale streaming videos. Project website: $\href{https://neu-vi.github.io/LASER/}{\texttt{https://neu-vi.github.io/LASER/}}$
[156] I-Scene: 3D Instance Models are Implicit Generalizable Spatial Learners cs.CVPDF
Lu Ling, Yunhao Ge, Yichen Sheng, Aniket Bera
TL;DR: 这篇论文提出了I-Scene方法,通过重新编程一个预训练的3D实例生成器,使其能够作为场景级学习器,利用模型本身的空间知识而非有限数据集进行监督,从而实现了对未见过的布局和新颖物体组合的泛化能力。
Details
Motivation: 解决交互式3D场景生成中的泛化挑战,现有方法依赖有限场景数据集,限制了新布局的泛化能力。
Result: 定性和定量结果表明,3D实例生成器能够作为隐式的空间学习器和推理器,在未见布局和物体组合上实现泛化,指向了交互式3D场景理解与生成的基础模型。
Insight: 创新点在于利用预训练实例生成器的可迁移空间先验作为学习信号,通过以视图为中心的场景空间公式化,直接从几何线索推断邻近、支撑和对称关系,无需依赖广泛使用的规范空间。
Abstract: Generalization remains the central challenge for interactive 3D scene generation. Existing learning-based approaches ground spatial understanding in limited scene dataset, restricting generalization to new layouts. We instead reprogram a pre-trained 3D instance generator to act as a scene level learner, replacing dataset-bounded supervision with model-centric spatial supervision. This reprogramming unlocks the generator transferable spatial knowledge, enabling generalization to unseen layouts and novel object compositions. Remarkably, spatial reasoning still emerges even when the training scenes are randomly composed objects. This demonstrates that the generator’s transferable scene prior provides a rich learning signal for inferring proximity, support, and symmetry from purely geometric cues. Replacing widely used canonical space, we instantiate this insight with a view-centric formulation of the scene space, yielding a fully feed-forward, generalizable scene generator that learns spatial relations directly from the instance model. Quantitative and qualitative results show that a 3D instance generator is an implicit spatial learner and reasoner, pointing toward foundation models for interactive 3D scene understanding and generation. Project page: https://luling06.github.io/I-Scene-project/
[157] Recurrent Video Masked Autoencoders cs.CVPDF
Daniel Zoran, Nikhil Parthasarathy, Yi Yang, Drew A Hudson, Joao Carreira
TL;DR: 本文提出了一种名为循环视频掩码自编码器(RVM)的新型视频表示学习方法,该方法利用基于Transformer的循环神经网络聚合时间上的密集图像特征,有效捕捉自然视频数据的时空结构。RVM通过仅需标准像素重建目标的不对称掩码预测任务进行学习,设计出一个高效的“通用”编码器。
Details
Motivation: 旨在开发一种能够高效捕捉视频时空结构、并兼具参数效率的通用视频表示学习方法,以克服现有基于时空注意力的架构在长时程特征传播和计算成本方面的局限性。
Result: RVM在视频级任务(如动作识别、点/目标跟踪)上达到了与最先进的视频模型(如VideoMAE、V-JEPA)相当的性能,同时在测试几何和密集空间理解的任务上表现优于图像模型(如DINOv2)。在小模型规模下,无需知识蒸馏即可实现强大性能,参数效率比竞品视频掩码自编码器高出高达30倍,且其循环特性允许以线性计算成本实现长时程的稳定特征传播。
Insight: 创新点在于将循环神经网络与掩码自编码器结合用于视频表示学习,实现了高效的时空特征聚合和长时程建模;其不对称掩码预测任务设计简化了训练目标,而无需额外监督或知识蒸馏,即可在小模型规模下实现高参数效率和通用性能。
Abstract: We present Recurrent Video Masked-Autoencoders (RVM): a novel video representation learning approach that uses a transformer-based recurrent neural network to aggregate dense image features over time, effectively capturing the spatio-temporal structure of natural video data. RVM learns via an asymmetric masked prediction task requiring only a standard pixel reconstruction objective. This design yields a highly efficient ``generalist’’ encoder: RVM achieves competitive performance with state-of-the-art video models (e.g. VideoMAE, V-JEPA) on video-level tasks like action recognition and point/object tracking, while also performing favorably against image models (e.g. DINOv2) on tasks that test geometric and dense spatial understanding. Notably, RVM achieves strong performance in the small-model regime without requiring knowledge distillation, exhibiting up to 30x greater parameter efficiency than competing video masked autoencoders. Moreover, we demonstrate that RVM’s recurrent nature allows for stable feature propagation over long temporal horizons with linear computational cost, overcoming some of the limitations of standard spatio-temporal attention-based architectures. Finally, we use qualitative visualizations to highlight that RVM learns rich representations of scene semantics, structure, and motion.
[158] DiffusionBrowser: Interactive Diffusion Previews via Multi-Branch Decoders cs.CV | cs.AI | cs.GR | cs.LGPDF
Susung Hong, Chongjian Ge, Zhifei Zhang, Jui-Hsien Wang
TL;DR: 本文提出了DiffusionBrowser,一个模型无关的轻量级解码器框架,允许用户在去噪过程的任意时间步或Transformer块处交互式生成预览。该模型能以超过4倍实时速度生成包含RGB和场景本征的多模态预览表示,并支持通过随机性重注入和模态引导进行交互控制,同时揭示了去噪过程中的内部合成机制。
Details
Motivation: 视频扩散模型在生成视频时存在不精确、速度慢且过程不透明的问题,导致用户长时间无法了解生成进度。
Result: 该方法能以超过4倍实时速度(4秒视频生成预览少于1秒)生成与最终视频保持外观和运动一致的多模态预览,并在多个基准测试中验证了其有效性。
Insight: 创新点在于通过多分支解码器实现任意中间步骤的实时预览和交互控制,并利用解码器对黑盒去噪过程进行系统性探测,揭示了场景和物体的合成细节。
Abstract: Video diffusion models have revolutionized generative video synthesis, but they are imprecise, slow, and can be opaque during generation – keeping users in the dark for a prolonged period. In this work, we propose DiffusionBrowser, a model-agnostic, lightweight decoder framework that allows users to interactively generate previews at any point (timestep or transformer block) during the denoising process. Our model can generate multi-modal preview representations that include RGB and scene intrinsics at more than 4$\times$ real-time speed (less than 1 second for a 4-second video) that convey consistent appearance and motion to the final video. With the trained decoder, we show that it is possible to interactively guide the generation at intermediate noise steps via stochasticity reinjection and modal steering, unlocking a new control capability. Moreover, we systematically probe the model using the learned decoders, revealing how scene, object, and other details are composed and assembled during the otherwise black-box denoising process.
cs.AI [Back]
[159] WebOperator: Action-Aware Tree Search for Autonomous Agents in Web Environment cs.AI | cs.CL | cs.LGPDF
Mahir Labib Dihan, Tanzima Hashem, Mohammed Eunus Ali, Md Rizwan Parvez
TL;DR: 本文提出了WebOperator,一个面向Web环境的自主智能体框架,通过结合动作感知的树搜索、安全回溯机制和多上下文动作生成,解决了现有LLM智能体在部分可观测的Web环境中因缺乏前瞻性和安全回溯能力而导致的效率低下和错误累积问题。
Details
Motivation: 现有基于LLM的智能体在Web环境中通常采用贪婪的逐步决策方式,缺乏长期规划和回溯能力,且现有树搜索方法假设动作可逆并缺乏安全回溯机制,导致在包含不可逆动作的真实Web任务中效果不佳。
Result: 在WebArena和WebVoyager基准测试上验证了有效性,其中在WebArena上使用gpt-4o达到了54.6%的成功率,取得了最先进(SOTA)水平。
Insight: 创新点在于将最佳优先搜索与安全考量结合进行动作排序,设计了在回放前验证路径可行性的鲁棒回溯机制,以及通过多推理上下文生成并筛选动作候选集来确保探索的多样性和质量,为在部分可观测、动作可能不可逆的复杂环境中实现可靠规划提供了系统化框架。
Abstract: LLM-based agents often operate in a greedy, step-by-step manner, selecting actions solely based on the current observation without considering long-term consequences or alternative paths. This lack of foresight is particularly problematic in web environments, which are only partially observable-limited to browser-visible content (e.g., DOM and UI elements)-where a single misstep often requires complex and brittle navigation to undo. Without an explicit backtracking mechanism, agents struggle to correct errors or systematically explore alternative paths. Tree-search methods provide a principled framework for such structured exploration, but existing approaches lack mechanisms for safe backtracking, making them prone to unintended side effects. They also assume that all actions are reversible, ignoring the presence of irreversible actions-limitations that reduce their effectiveness in realistic web tasks. To address these challenges, we introduce WebOperator, a tree-search framework that enables reliable backtracking and strategic exploration. Our method incorporates a best-first search strategy that ranks actions by both reward estimates and safety considerations, along with a robust backtracking mechanism that verifies the feasibility of previously visited paths before replaying them, preventing unintended side effects. To further guide exploration, WebOperator generates action candidates from multiple, varied reasoning contexts to ensure diverse and robust exploration, and subsequently curates a high-quality action set by filtering out invalid actions pre-execution and merging semantically equivalent ones. Experimental results on WebArena and WebVoyager demonstrate the effectiveness of WebOperator. On WebArena, WebOperator achieves a state-of-the-art 54.6% success rate with gpt-4o, underscoring the critical advantage of integrating strategic foresight with safe execution.
[160] M-GRPO: Stabilizing Self-Supervised Reinforcement Learning for Large Language Models with Momentum-Anchored Policy Optimization cs.AI | cs.CLPDF
Bizhe Bai, Hongming Wu, Peng Ye, Tao Chen
TL;DR: 本文提出了M-GRPO框架,通过引入动量锚定模型和基于四分位距的自适应过滤方法,解决了大语言模型在自监督强化学习中长期训练时出现的策略崩溃和不稳定性问题,从而提升了模型的推理能力。
Details
Motivation: 现有自监督强化学习方法在大语言模型的长周期训练中会出现’策略崩溃’,导致性能急剧下降,即使增加采样数量也只能延迟而非防止崩溃,因此需要更稳定的训练机制。
Result: 在多个推理基准测试上的广泛实验表明,M-GRPO结合IQR过滤器能稳定训练过程并防止过早收敛,实现了最先进的性能。
Insight: 创新点包括利用缓慢演化的动量模型提供稳定训练目标,以及基于四分位距动态过滤低熵轨迹以保持策略多样性,这两者结合有效提升了训练稳定性和最终性能。
Abstract: Self-supervised reinforcement learning (RL) presents a promising approach for enhancing the reasoning capabilities of Large Language Models (LLMs) without reliance on expensive human-annotated data. However, we find that existing methods suffer from a critical failure mode under long-horizon training: a “policy collapse” where performance precipitously degrades. We diagnose this instability and demonstrate that simply scaling the number of rollouts – a common strategy to improve performance – only delays, but does not prevent, this collapse. To counteract this instability, we first introduce M-GRPO (Momentum-Anchored Group Relative Policy Optimization), a framework that leverages a slowly evolving momentum model to provide a stable training target. In addition, we identify that this process is often accompanied by a rapid collapse in policy entropy, resulting in a prematurely confident and suboptimal policy. To specifically address this issue, we propose a second contribution: an adaptive filtering method based on the interquartile range (IQR) that dynamically prunes low-entropy trajectories, preserving essential policy diversity. Our extensive experiments on multiple reasoning benchmarks demonstrate that M-GRPO stabilizes the training process while the IQR filter prevents premature convergence. The combination of these two innovations leads to superior training stability and state-of-the-art performance.
[161] SpeakRL: Synergizing Reasoning, Speaking, and Acting in Language Models with Reinforcement Learning cs.AI | cs.CLPDF
Emre Can Acikgoz, Jinoh Oh, Jie Hao, Joo Hyuk Jeon, Heng Ji
TL;DR: 本文提出SpeakRL,一种强化学习方法,通过奖励主动交互(如适时提问)来增强语言模型的对话能力,并构建SpeakER合成数据集支持训练。实验表明该方法在任务完成率上比基线模型提升20.14%,且不增加对话轮次。
Details
Motivation: 当前人机协作多为单向指令响应,代理缺乏主动澄清用户意图、解决歧义的能力,现有研究未充分利用语言模型的对话能力,导致代理仅作为被动跟随者而非有效发言者。
Result: 在任务导向对话场景的评估中,SpeakRL相比基线模型在任务完成率上取得20.14%的绝对提升,且未增加对话轮次,性能甚至超过更大的专有模型。
Insight: 创新点在于通过强化学习奖励机制系统性地教导代理平衡提问与行动,促进以澄清为中心的交互;客观来看,该方法将对话主动性建模为可优化目标,为构建更协同的智能代理提供了新思路。
Abstract: Effective human-agent collaboration is increasingly prevalent in real-world applications. Current trends in such collaborations are predominantly unidirectional, with users providing instructions or posing questions to agents, where agents respond directly without seeking necessary clarifications or confirmations. However, the evolving capabilities of these agents require more proactive engagement, where agents should dynamically participate in conversations to clarify user intents, resolve ambiguities, and adapt to changing circumstances. Existing prior work under-utilize the conversational capabilities of language models (LMs), thereby optimizing agents as better followers rather than effective speakers. In this work, we introduce SpeakRL, a reinforcement learning (RL) method that enhances agents’ conversational capabilities by rewarding proactive interactions with users, such as asking right clarification questions when necessary. To support this, we curate SpeakER, a synthetic dataset that includes diverse scenarios from task-oriented dialogues, where tasks are resolved through interactive clarification questions. We present a systematic analysis of reward design for conversational proactivity and propose a principled reward formulation for teaching agents to balance asking with acting. Empirical evaluations demonstrate that our approach achieves a 20.14% absolute improvement in task completion over base models without increasing conversation turns even surpassing even much larger proprietary models, demonstrating the promise of clarification-centric user-agent interactions.
[162] Differentiable Evolutionary Reinforcement Learning cs.AI | cs.CLPDF
Sitao Cheng, Tianle Li, Xuhan Huang, Xunjian Yin, Difan Zou
TL;DR: 本文提出了可微分进化强化学习(DERL),一种双层框架,用于自主发现最优奖励信号。DERL通过组合结构化原子基元来进化奖励函数(元奖励),并利用内层策略的训练性能作为信号,通过强化学习更新元优化器,从而近似任务成功的“元梯度”。该方法在机器人代理(ALFWorld)、科学模拟(ScienceWorld)和数学推理(GSM8k, MATH)三个领域进行了验证。
Details
Motivation: 强化学习中设计有效的奖励函数是一个核心且艰巨的挑战,尤其是在复杂推理任务中。现有的自动奖励优化方法通常将奖励函数视为黑盒,使用无导数的进化启发式方法,无法捕捉奖励结构与任务性能之间的因果关系。
Result: 实验结果表明,DERL在ALFWorld和ScienceWorld上达到了最先进的性能,显著优于依赖启发式奖励的方法,特别是在分布外场景中。在GSM8k和MATH上的结果也展示了其有效性。
Insight: DERL的创新点在于引入了可微分的元优化,将内层验证性能作为信号来更新元优化器,从而能够近似任务成功的元梯度,逐步学习生成更密集、更具可操作性的反馈。这允许框架自主捕获任务的内在结构,实现无需人工干预的自我改进智能体对齐。
Abstract: The design of effective reward functions presents a central and often arduous challenge in reinforcement learning (RL), particularly when developing autonomous agents for complex reasoning tasks. While automated reward optimization approaches exist, they typically rely on derivative-free evolutionary heuristics that treat the reward function as a black box, failing to capture the causal relationship between reward structure and task performance. To bridge this gap, we propose Differentiable Evolutionary Reinforcement Learning (DERL), a bilevel framework that enables the autonomous discovery of optimal reward signals. In DERL, a Meta-Optimizer evolves a reward function (i.e., Meta-Reward) by composing structured atomic primitives, guiding the training of an inner-loop policy. Crucially, unlike previous evolution, DERL is differentiable in its metaoptimization: it treats the inner-loop validation performance as a signal to update the Meta-Optimizer via reinforcement learning. This allows DERL to approximate the “meta-gradient” of task success, progressively learning to generate denser and more actionable feedback. We validate DERL across three distinct domains: robotic agent (ALFWorld), scientific simulation (ScienceWorld), and mathematical reasoning (GSM8k, MATH). Experimental results show that DERL achieves state-of-the-art performance on ALFWorld and ScienceWorld, significantly outperforming methods relying on heuristic rewards, especially in out-of-distribution scenarios. Analysis of the evolutionary trajectory demonstrates that DERL successfully captures the intrinsic structure of tasks, enabling selfimproving agent alignment without human intervention.
[163] Towards Unified Co-Speech Gesture Generation via Hierarchical Implicit Periodicity Learning cs.AI | cs.CV | cs.GR | cs.MM | cs.SDPDF
Xin Guo, Yifan Zhao, Jia Li
TL;DR: 本文提出了一种名为分层隐式周期性学习(HIP)的统一框架,用于从语音生成3D协同手势。该方法通过周期性自编码器探索手势运动的相位流形,并结合非周期性成分以增强多样性,同时采用级联引导建模面部、身体和手部动作的层次关系,从而生成更自然协调的3D手势。
Details
Motivation: 现有端到端的协同手势生成方法(如GAN、VQ-VAE和扩散模型)难以建模不同运动单元(头、身体、手)之间的内在关联,导致生成动作不自然且协调性差。本文旨在通过显式技术洞察解决这一多模态隐式关系建模问题。
Result: 在3D虚拟人上的大量实验表明,该方法在定量和定性评估上均优于当前最先进的协同手势生成方法,达到了SOTA水平。
Insight: 创新点包括:1)使用周期性自编码器解耦复杂手势运动,从真实分布中模仿人类自然周期性,并结合非周期性潜在状态增强实例级多样性;2)通过级联引导建模面部、身体和手部动作的层次关系,驱动动画生成。这为多模态运动生成提供了可借鉴的分层隐式周期学习框架。
Abstract: Generating 3D-based body movements from speech shows great potential in extensive downstream applications, while it still suffers challenges in imitating realistic human movements. Predominant research efforts focus on end-to-end generation schemes to generate co-speech gestures, spanning GANs, VQ-VAE, and recent diffusion models. As an ill-posed problem, in this paper, we argue that these prevailing learning schemes fail to model crucial inter- and intra-correlations across different motion units, i.e. head, body, and hands, thus leading to unnatural movements and poor coordination. To delve into these intrinsic correlations, we propose a unified Hierarchical Implicit Periodicity (HIP) learning approach for audio-inspired 3D gesture generation. Different from predominant research, our approach models this multi-modal implicit relationship by two explicit technique insights: i) To disentangle the complicated gesture movements, we first explore the gesture motion phase manifolds with periodic autoencoders to imitate human natures from realistic distributions while incorporating non-period ones from current latent states for instance-level diversities. ii) To model the hierarchical relationship of face motions, body gestures, and hand movements, driving the animation with cascaded guidance during learning. We exhibit our proposed approach on 3D avatars and extensive experiments show our method outperforms the state-of-the-art co-speech gesture generation methods by both quantitative and qualitative evaluations. Code and models will be publicly available.
cs.CR [Back]
[164] Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring cs.CR | cs.AI | cs.CL | cs.LGPDF
Peichun Hua, Hao Li, Shanghao Shi, Zhiyuan Yu, Ning Zhang
TL;DR: 本文提出了一种名为表示对比评分(RCS)的轻量级框架,用于检测大型视觉语言模型(LVLM)的多模态越狱攻击。该框架通过分析模型内部关键安全层的表示几何结构,学习一个投影来最大化良性输入与恶意输入之间的分离度,从而生成一个简单而强大的对比分数来区分恶意意图与新颖性。其实例化方法MCD和KCD在旨在测试对未见攻击类型泛化能力的评估协议上达到了最先进的性能。
Details
Motivation: 当前针对LVLM越狱攻击的防御策略要么泛化能力有限(针对特定攻击模式),要么计算开销高。轻量级异常检测方法虽然前景广阔,但其常见的单类设计容易将新颖的良性输入误判为恶意输入,导致不可靠的过度拒绝。
Result: 在旨在测试对未见攻击类型泛化能力的挑战性评估协议上,提出的MCD(马氏距离对比检测)和KCD(K近邻对比检测)方法实现了最先进的(SOTA)性能。
Insight: 核心创新点在于洞察到最有效的安全信号存在于LVLM自身的内部表示中。通过对其内部几何结构进行简单、可解释的统计分析,可以实现有效的越狱检测,为更安全的LVLM部署提供了一条实用路径。
Abstract: Large Vision-Language Models (LVLMs) are vulnerable to a growing array of multimodal jailbreak attacks, necessitating defenses that are both generalizable to novel threats and efficient for practical deployment. Many current strategies fall short, either targeting specific attack patterns, which limits generalization, or imposing high computational overhead. While lightweight anomaly-detection methods offer a promising direction, we find that their common one-class design tends to confuse novel benign inputs with malicious ones, leading to unreliable over-rejection. To address this, we propose Representational Contrastive Scoring (RCS), a framework built on a key insight: the most potent safety signals reside within the LVLM’s own internal representations. Our approach inspects the internal geometry of these representations, learning a lightweight projection to maximally separate benign and malicious inputs in safety-critical layers. This enables a simple yet powerful contrastive score that differentiates true malicious intent from mere novelty. Our instantiations, MCD (Mahalanobis Contrastive Detection) and KCD (K-nearest Contrastive Detection), achieve state-of-the-art performance on a challenging evaluation protocol designed to test generalization to unseen attack types. This work demonstrates that effective jailbreak detection can be achieved by applying simple, interpretable statistical methods to the appropriate internal representations, offering a practical path towards safer LVLM deployment. Our code is available on Github https://github.com/sarendis56/Jailbreak_Detection_RCS.
cs.MM [Back]
[165] AutoMV: An Automatic Multi-Agent System for Music Video Generation cs.MM | cs.CV | cs.SD | eess.ASPDF
Xiaoxuan Tang, Xinping Lei, Chaoran Zhu, Shiyun Chen, Ruibin Yuan
TL;DR: 本文提出了AutoMV,一个用于从歌曲直接生成完整音乐视频(MV)的自动多智能体系统。该系统通过音乐处理工具提取音乐结构、人声轨道和时间对齐歌词等特征,并利用编剧、导演和验证器等多个智能体协作生成具有时间一致性和音乐对齐性的长视频。
Details
Motivation: 解决现有音乐到视频(M2V)生成方法只能产生短小、不连贯的片段,无法与音乐结构、节拍或歌词对齐,且缺乏时间一致性的问题。
Result: 在作者提出的包含四大类(音乐内容、技术、后期制作、艺术)和十二个细粒度标准的基准测试中,AutoMV在所有四个类别上均显著优于现有基线方法,缩小了与专业人工制作MV的差距。
Insight: 创新点在于构建了一个基于多智能体协作的自动化流程,将音乐分析与视频生成解耦,并通过专门的验证器智能体进行质量评估。此外,提出了一个全面的M2V生成评估基准,并探索了使用大模型作为自动评估器的可能性。
Abstract: Music-to-Video (M2V) generation for full-length songs faces significant challenges. Existing methods produce short, disjointed clips, failing to align visuals with musical structure, beats, or lyrics, and lack temporal consistency. We propose AutoMV, a multi-agent system that generates full music videos (MVs) directly from a song. AutoMV first applies music processing tools to extract musical attributes, such as structure, vocal tracks, and time-aligned lyrics, and constructs these features as contextual inputs for following agents. The screenwriter Agent and director Agent then use this information to design short script, define character profiles in a shared external bank, and specify camera instructions. Subsequently, these agents call the image generator for keyframes and different video generators for “story” or “singer” scenes. A Verifier Agent evaluates their output, enabling multi-agent collaboration to produce a coherent longform MV. To evaluate M2V generation, we further propose a benchmark with four high-level categories (Music Content, Technical, Post-production, Art) and twelve ine-grained criteria. This benchmark was applied to compare commercial products, AutoMV, and human-directed MVs with expert human raters: AutoMV outperforms current baselines significantly across all four categories, narrowing the gap to professional MVs. Finally, we investigate using large multimodal models as automatic MV judges; while promising, they still lag behind human expert, highlighting room for future work.
[166] JointAVBench: A Benchmark for Joint Audio-Visual Reasoning Evaluation cs.MM | cs.CVPDF
Jianghan Chao, Jianzhang Gao, Wenhui Tan, Yuchong Sun, Ruihua Song
TL;DR: 本文提出了JointAVBench,一个用于严格评估全模态大语言模型(Omni-LLMs)音视频联合推理能力的新基准。该基准通过自动化流程生成严格依赖音视频关联的问题,涵盖五个认知维度、四种音频信息类型和三种场景跨度,以弥补现有数据集的不足。
Details
Motivation: 现有数据集在评估能够处理视觉和音频等多模态信息的Omni-LLMs时存在不足,缺乏对多模态依赖性、多样音频信息类型和不同场景跨度的全面覆盖,限制了严格且全面的评估。
Result: 在JointAVBench上评估领先的单模态和全模态大语言模型,结果显示,即使性能最佳的Omni-LLM平均准确率也仅为62.6%,虽优于单模态基线,但尤其在跨场景推理方面仍有巨大提升空间。
Insight: 创新点在于构建了一个具有严格音视频关联、覆盖多维度、多类型和多场景跨度的综合评估基准,并设计了一个利用先进视觉-LLM、音频-LLM和通用LLM自动合成问答的流程,以低成本实现高质量数据标注,为多模态模型评估提供了新标准。
Abstract: Understanding videos inherently requires reasoning over both visual and auditory information. To properly evaluate Omni-Large Language Models (Omni-LLMs), which are capable of processing multi-modal information including vision and audio, an effective benchmark must comprehensively cover three key aspects: (1) multi-modal dependency (i.e., questions that cannot be answered using vision or audio alone), (2) diverse audio information types (e.g., speech, sound events), and (3) varying scene spans. However, existing datasets fall short in one or more of these dimensions, limiting strict and comprehensive evaluation. To address this gap, we introduce JointAVBench, a novel benchmark with strict audio-video correlation, spanning five cognitive dimensions, four audio information types (speech, sound events, music, vocal traits), and three scene spans (single-, cross-, and full-scene). Given the high cost of manual annotation, we propose an automated pipeline that leverages state-of-the-art vision-LLMs, audio-LLMs, and general-purpose LLMs to synthesize questions and answers that strictly require joint audio-visual understanding. We evaluate leading vision-only, audio-only, and Omni-LLMs on our dataset. Results show that even the best-performing Omni-LLM achieves an average accuracy of only 62.6%, outperforming uni-modal baselines but revealing substantial room for improvement, especially in cross-scene reasoning.
cs.CY [Back]
[167] A Reproducible Workflow for Scraping, Structuring, and Segmenting Legacy Archaeological Artifact Images cs.CY | cs.CVPDF
Juan Palomeque-Gonzalez
TL;DR: 本文提出了一种可重复的工作流程,用于将遗留的考古文物图像集合转换为结构化且适合分割的数据集。该流程包括开发两个开源工具:一个网络爬虫脚本用于从Archaeology Data Service (ADS)批量下载图像和元数据,以及一个图像处理管道用于生成二进制掩码、边界框和COCO格式的注释文件。
Details
Motivation: 解决ADS等考古图像数据集缺乏批量下载和自动化处理机制的问题,以便将标准化照片转换为机器学习友好的格式,促进下游分析和提高数字考古研究的可重复性。
Result: 通过案例研究(旧石器时代手斧和双面器集合),成功创建了一个包含数千张图像、掩码、边界框和丰富考古元数据的结构化数据集,但未提及具体定量指标或与现有方法的比较。
Insight: 创新点在于结合网络爬虫和经典计算机视觉技术,构建了一个轻量级、可重用的工作流程,强调伦理爬取和仅共享衍生数据(如掩码和注释),而非原始图像,这有助于保护数据版权并推动考古学领域的可重复研究实践。
Abstract: This technical note presents a reproducible workflow for converting a legacy archaeological image collection into a structured and segmentation ready dataset. The case study focuses on the Lower Palaeolithic hand axe and biface collection curated by the Archaeology Data Service (ADS), a dataset that provides thousands of standardised photographs but no mechanism for bulk download or automated processing. To address this, two open source tools were developed: a web scraping script that retrieves all record pages, extracts associated metadata, and downloads the available images while respecting ADS Terms of Use and ethical scraping guidelines; and an image processing pipeline that renames files using UUIDs, generates binary masks and bounding boxes through classical computer vision, and stores all derived information in a COCO compatible Json file enriched with archaeological metadata. The original images are not redistributed, and only derived products such as masks, outlines, and annotations are shared. Together, these components provide a lightweight and reusable approach for transforming web based archaeological image collections into machine learning friendly formats, facilitating downstream analysis and contributing to more reproducible research practices in digital archaeology.
[168] Assessing Greenspace Attractiveness with ChatGPT, Claude, and Gemini: Do AI Models Reflect Human Perceptions? cs.CY | cs.AI | cs.CVPDF
Milad Malekzadeh, Magdalena Biernacka, Elias Willberg, Jussi Torkko, Edyta Łaszkiewicz
TL;DR: 本研究评估了ChatGPT GPT-4o、Claude 3.5 Haiku和Gemini 2.0 Flash等多模态大语言模型在利用谷歌街景图像评估绿地吸引力方面与人类感知的一致性。研究发现,模型在评估有吸引力的正式绿地和无吸引力的非正式绿地时与人类高度一致,但在评估有吸引力的非正式绿地和无吸引力的正式绿地时一致性较低。模型倾向于强调美学和设计特征,而低估了受访者重视的安全性、功能基础设施和本地嵌入品质。
Details
Motivation: 现有绿地吸引力评估方法往往忽视非正式或临时性空间,且资源密集,难以大规模捕捉主观感知。本研究旨在探索多模态大语言模型是否能够以类似人类的方式评估绿地吸引力,以实现可扩展的预评估。
Result: 在波兰罗兹的居民地理问卷调查对比中,模型与人类在评估有吸引力的正式绿地(如公园)和无吸引力的非正式空间(如荒地)时表现出高一致性,但在有吸引力的非正式绿地(如草地)和无吸引力的正式绿地时一致性低。模型输出与人类解释在类别上存在差异,模型更关注美学和设计。
Insight: 论文的创新点在于首次系统比较了多种主流MLLM在绿地感知评估任务上与人类的一致性,并揭示了模型偏好的特征类型与人类关注的差异。这为利用AI进行大规模城市环境预评估提供了实证基础,同时强调了在规划实践中需要人类监督和补充性参与方法,模型可作为辅助工具而非替代品。
Abstract: Understanding greenspace attractiveness is essential for designing livable and inclusive urban environments, yet existing assessment approaches often overlook informal or transient spaces and remain too resource intensive to capture subjective perceptions at scale. This study examines the ability of multimodal large language models (MLLMs), ChatGPT GPT-4o, Claude 3.5 Haiku, and Gemini 2.0 Flash, to assess greenspace attractiveness similarly to humans using Google Street View imagery. We compared model outputs with responses from a geo-questionnaire of residents in Lodz, Poland, across both formal (for example, parks and managed greenspaces) and informal (for example, meadows and wastelands) greenspaces. Survey respondents and models indicated whether each greenspace was attractive or unattractive and provided up to three free text explanations. Analyses examined how often their attractiveness judgments aligned and compared their explanations after classifying them into shared reasoning categories. Results show high AI human agreement for attractive formal greenspaces and unattractive informal spaces, but low alignment for attractive informal and unattractive formal greenspaces. Models consistently emphasized aesthetic and design oriented features, underrepresenting safety, functional infrastructure, and locally embedded qualities valued by survey respondents. While these findings highlight the potential for scalable pre-assessment, they also underscore the need for human oversight and complementary participatory approaches. We conclude that MLLMs can support, but not replace, context sensitive greenspace evaluation in planning practice.
[169] Aesthetic Alignment Risks Assimilation: How Image Generation and Reward Models Reinforce Beauty Bias and Ideological “Censorship” cs.CY | cs.AI | cs.CVPDF
Wenqi Marshall Guo, Qingyun Qian, Khalad Hasan, Shan Du
TL;DR: 该论文探讨了图像生成模型过度对齐于广义审美偏好所导致的问题,即当用户出于艺术或批判目的请求生成’反审美’图像时,模型会因偏向传统审美标准而无法满足用户意图,从而损害用户自主性和审美多样性。
Details
Motivation: 动机在于揭示当前图像生成和奖励模型中对广义审美偏好过度对齐的系统性偏见,这种偏见优先考虑开发者中心的价值观,与用户意图(特别是反审美需求)产生冲突。
Result: 通过构建广谱审美数据集评估SOTA生成和奖励模型,发现审美对齐的生成模型经常默认输出传统美丽图像,无法响应低质量或负面图像指令;奖励模型即使面对完全符合用户提示的反审美图像也会给予惩罚,并通过图像编辑和真实抽象艺术作品的评估确认了这种系统性偏见。
Insight: 创新点在于系统性地量化了审美对齐风险,揭示了模型在意识形态上潜在的’审查’效应;可借鉴之处包括对模型价值观对齐中用户意图多样性的重视,以及开发更包容的评估框架以避免单一审美标准主导。
Abstract: Over-aligning image generation models to a generalized aesthetic preference conflicts with user intent, particularly when ``anti-aesthetic” outputs are requested for artistic or critical purposes. This adherence prioritizes developer-centered values, compromising user autonomy and aesthetic pluralism. We test this bias by constructing a wide-spectrum aesthetics dataset and evaluating state-of-the-art generation and reward models. We find that aesthetic-aligned generation models frequently default to conventionally beautiful outputs, failing to respect instructions for low-quality or negative imagery. Crucially, reward models penalize anti-aesthetic images even when they perfectly match the explicit user prompt. We confirm this systemic bias through image-to-image editing and evaluation against real abstract artworks.
cs.CE [Back]
[170] ERA-IT: Aligning Semantic Models with Revealed Economic Preference for Real-Time and Explainable Patent Valuation cs.CE | cs.CLPDF
Yoo Yongmin, Kim Seungwoo, Liu Jingjiang
TL;DR: 本研究提出了ERA-IT框架,通过指令微调将大型语言模型的生成推理与市场现实对齐,利用专利续展历史作为揭示的经济偏好信号,以实现实时且可解释的专利价值评估。
Details
Motivation: 解决在高维技术规范信息不对称下,传统文献计量指标(如引用次数)因数据积累延迟而无法及时评估无形资产价值的问题。
Result: 在包含10,000个欧洲专利局专利的随机抽样数据集上,ERA-IT在预测准确性上显著优于传统计量经济学模型和零样本大型语言模型。
Insight: 创新性地将专利续展历史概念化为揭示的经济偏好,并作为监督信号对齐LLM的语义模型与经济现实(Eco-Semantic Alignment),同时生成逻辑清晰的估值理由,为决策者提供透明认知支架,减少黑盒AI在知识产权管理中的不透明性。
Abstract: Valuing intangible assets under uncertainty remains a critical challenge in the strategic management of technological innovation due to the information asymmetry inherent in high-dimensional technical specifications. Traditional bibliometric indicators, such as citation counts, fail to address this friction in a timely manner due to the systemic latency inherent in data accumulation. To bridge this gap, this study proposes the Economic Reasoning Alignment via Instruction Tuning (ERA-IT) framework. We theoretically conceptualize patent renewal history as a revealed economic preference and leverage it as an objective supervisory signal to align the generative reasoning of Large Language Models (LLMs) with market realities, a process we term Eco-Semantic Alignment. Using a randomly sampled dataset of 10,000 European Patent Office patents across diverse technological domains, we trained the model not only to predict value tiers but also to reverse-engineer the Economic Chain-of-Thought from unstructured text. Empirical results demonstrate that ERA-IT significantly outperforms both conventional econometric models and zero-shot LLMs in predictive accuracy. More importantly, by generating explicit, logically grounded rationales for valuation, the framework serves as a transparent cognitive scaffold for decision-makers, reducing the opacity of black-box AI in high-stakes intellectual property management.
cs.RO [Back]
[171] Benchmarking Tesla’s Traffic Light and Stop Sign Control: Field Dataset and Behavior Insights cs.RO | cs.CV | cs.HCPDF
Zheng Li, Peng Zhang, Shixiao Liang, Hang Zhou, Chengyuan Ma
TL;DR: 本文提出了一个针对特斯拉交通信号灯和停车标志控制(TLSSC)系统的现场数据集和行为分析。通过在不同限速和交通控制设备类型下进行实验,收集了同步的高分辨率车辆轨迹数据和驾驶员视角视频。研究开发了TLSSC与交通控制设备交互行为的分类法,并校准了全速度差模型来定量表征每种行为模式,识别出一个关键的跟车阈值(约90米)。
Details
Motivation: 高级驾驶辅助系统(ADAS)与交通控制设备(TCD)的交互对于评估其对交通运行的影响至关重要,但这一交互缺乏深入的实证研究。本文旨在通过实证数据填补这一空白,分析特斯拉TLSSC这一成熟ADAS的行为。
Result: 研究结果包括:识别出停止行为由对期望速度偏差和相对速度的强响应驱动,加速行为则更为保守;交叉口跟车行为比标准跟车行为表现出更平滑的动态和更小的车头时距。校准的全速度差模型定量地表征了这些行为模式。
Insight: 主要创新点包括:1)创建了一个公开的、包含同步高分辨率轨迹和视频的TLSSC现场交互数据集;2)提出了TLSSC-TCD交互行为的分类法;3)通过模型校准定量揭示了不同行为模式的动力学特征,特别是识别出约90米的跟车阈值这一新经验发现。这为未来的仿真、安全评估和ADAS-TCD交互逻辑设计提供了基础。
Abstract: Understanding how Advanced Driver-Assistance Systems (ADAS) interact with Traffic Control Devices (TCDs) is critical for assessing their influence on traffic operations, yet this interaction has received little focused empirical study. This paper presents a field dataset and behavioral analysis of Tesla’s Traffic Light and Stop Sign Control (TLSSC), a mature ADAS that perceives traffic lights and stop signs. We design and execute experiments across varied speed limits and TCD types, collecting synchronized high-resolution vehicle trajectory data and driver-perspective video. From these data, we develop a taxonomy of TLSSC-TCD interaction behaviors (i.e., stopping, accelerating, and car following) and calibrate the Full Velocity Difference Model (FVDM) to quantitatively characterize each behavior mode. A novel empirical insight is the identification of a car-following threshold (~90 m). Calibration results reveal that stopping behavior is driven by strong responsiveness to both desired speed deviation and relative speed, whereas accelerating behavior is more conservative. Intersection car-following behavior exhibits smoother dynamics and tighter headways compared to standard car-following behaviors. The established dataset, behavior definitions, and model characterizations together provide a foundation for future simulation, safety evaluation, and design of ADAS-TCD interaction logic. Our dataset is available at GitHub.
[172] ReGlove: A Soft Pneumatic Glove for Activities of Daily Living Assistance via Wrist-Mounted Vision cs.RO | cs.CVPDF
Rosh Ho, Jian Zhang
TL;DR: 本文提出了ReGlove系统,它将低成本商用气动康复手套改造为视觉引导的辅助性矫形器。该系统通过在手腕上安装摄像头和边缘计算推理引擎(Raspberry Pi 5),实现基于上下文感知的抓取,无需依赖不可靠的肌肉信号。
Details
Motivation: 解决全球数百万慢性上肢功能障碍患者面临的辅助技术昂贵或依赖不可靠生物信号(如肌电信号)的问题,旨在提供一种更易获取、基于视觉的替代方案。
Result: 系统实现了96.73%的抓取分类准确率和低于40毫秒的端到端延迟。在YCB物体操作基准测试中取得了82.71%的成功率,并在27项日常生活活动任务中表现可靠。
Insight: 主要创新点在于将手腕安装的视觉系统与低成本商用气动手套集成,创建了一个不依赖肌电信号、成本低廉(低于250美元)、完全基于商用组件的视觉引导辅助平台,为传统设备无法覆盖的人群提供了技术基础。
Abstract: This paper presents ReGlove, a system that converts low-cost commercial pneumatic rehabilitation gloves into vision-guided assistive orthoses. Chronic upper-limb impairment affects millions worldwide, yet existing assistive technologies remain prohibitively expensive or rely on unreliable biological signals. Our platform integrates a wrist-mounted camera with an edge-computing inference engine (Raspberry Pi 5) to enable context-aware grasping without requiring reliable muscle signals. By adapting real-time YOLO-based computer vision models, the system achieves \SI{96.73}{\percent} grasp classification accuracy with sub-\SI{40.00}{\milli\second} end-to-end latency. Physical validation using standardized benchmarks shows \SI{82.71}{\percent} success on YCB object manipulation and reliable performance across \SI{27.00}{} Activities of Daily Living (ADL) tasks. With a total cost under $\SI{250.00}{} and exclusively commercial components, ReGlove provides a technical foundation for accessible, vision-based upper-limb assistance that could benefit populations excluded from traditional EMG-controlled devices.
[173] World Models Can Leverage Human Videos for Dexterous Manipulation cs.RO | cs.AI | cs.CVPDF
Raktim Gautam Goswami, Amir Bar, David Fan, Tsung-Yen Yang, Gaoyue Zhou
TL;DR: 本文提出DexWM,一种用于灵巧操作的世界模型,通过预测环境的下一个潜在状态来理解手部动作对物体的影响。该模型利用超过900小时的人类和非灵巧机器人视频进行训练,以解决数据稀缺问题,并通过引入辅助手部一致性损失提升精细操作能力。DexWM在状态预测上优于现有世界模型,并在零样本泛化到未见操作技能时,在抓取、放置和到达任务上平均超越Diffusion Policy超过50%。
Details
Motivation: 解决灵巧操作中因手部细微动作通过接触影响环境而带来的挑战,并克服灵巧操作数据集稀缺的问题。
Result: DexWM在状态预测准确性上优于基于文本、导航和全身动作的先前世界模型;在配备Allegro夹爪的Franka Panda机械臂上,对未见操作技能进行零样本泛化时,在抓取、放置和到达任务上平均性能超过Diffusion Policy 50%以上。
Insight: 创新点包括利用人类和非灵巧机器人视频进行训练以缓解数据不足,以及引入手部一致性损失来增强精细操作预测;从客观角度看,该方法通过多源视频预训练和辅助损失设计,有效提升了世界模型在灵巧操作中的泛化能力和准确性。
Abstract: Dexterous manipulation is challenging because it requires understanding how subtle hand motion influences the environment through contact with objects. We introduce DexWM, a Dexterous Manipulation World Model that predicts the next latent state of the environment conditioned on past states and dexterous actions. To overcome the scarcity of dexterous manipulation datasets, DexWM is trained on over 900 hours of human and non-dexterous robot videos. To enable fine-grained dexterity, we find that predicting visual features alone is insufficient; therefore, we introduce an auxiliary hand consistency loss that enforces accurate hand configurations. DexWM outperforms prior world models conditioned on text, navigation, and full-body actions, achieving more accurate predictions of future states. DexWM also demonstrates strong zero-shot generalization to unseen manipulation skills when deployed on a Franka Panda arm equipped with an Allegro gripper, outperforming Diffusion Policy by over 50% on average in grasping, placing, and reaching tasks.
[174] RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics cs.RO | cs.CVPDF
Enshen Zhou, Cheng Chi, Yibo Li, Jingkun An, Jiayuan Zhang
TL;DR: 本文提出了RoboTracer,一种3D感知的视觉语言模型,旨在解决机器人空间追踪这一具身交互核心任务。该模型通过一个通用的空间编码器和回归监督解码器实现3D空间指代与测量,并利用带有度量敏感过程奖励的强化微调来增强多步骤度量推理能力。为支持训练与评估,作者构建了包含3000万问答对的大规模数据集TraceSpatial和相应的基准测试TraceSpatial-Bench。实验表明,RoboTracer在空间理解、测量和指代方面超越基线模型,并在新基准上大幅领先现有SOTA模型,且能集成到多种机器人控制策略中执行复杂的长时程动态任务。
Details
Motivation: 空间追踪是机器人一项基础的具身交互能力,需要结合多步骤度量推理、复杂空间指代和真实世界度量测量,现有方法难以处理这种组合任务。
Result: RoboTracer在空间理解、测量和指代方面的平均成功率达到了79.1%。在提出的TraceSpatial-Bench基准测试上,其性能大幅超越基线,准确率超过Gemini-2.5-Pro达36%,达到了SOTA水平。
Insight: 论文的核心创新在于:1)通过一个统一的架构(空间编码器+回归监督解码器)在监督微调阶段同时解决3D空间指代与测量问题,增强了模型的尺度感知能力;2)提出了利用度量敏感过程奖励进行强化微调的新范式,以监督关键中间感知线索,从而提升多步骤度量推理的准确性;3)构建了大规模、多场景、支持复杂推理步骤的数据集和专门的评估基准,填补了该领域的空白。
Abstract: Spatial tracing, as a fundamental embodied interaction ability for robots, is inherently challenging as it requires multi-step metric-grounded reasoning compounded with complex spatial referring and real-world metric measurement. However, existing methods struggle with this compositional task. To this end, we propose RoboTracer, a 3D-aware VLM that first achieves both 3D spatial referring and measuring via a universal spatial encoder and a regression-supervised decoder to enhance scale awareness during supervised fine-tuning (SFT). Moreover, RoboTracer advances multi-step metric-grounded reasoning via reinforcement fine-tuning (RFT) with metric-sensitive process rewards, supervising key intermediate perceptual cues to accurately generate spatial traces. To support SFT and RFT training, we introduce TraceSpatial, a large-scale dataset of 30M QA pairs, spanning outdoor/indoor/tabletop scenes and supporting complex reasoning processes (up to 9 steps). We further present TraceSpatial-Bench, a challenging benchmark filling the gap to evaluate spatial tracing. Experimental results show that RoboTracer surpasses baselines in spatial understanding, measuring, and referring, with an average success rate of 79.1%, and also achieves SOTA performance on TraceSpatial-Bench by a large margin, exceeding Gemini-2.5-Pro by 36% accuracy. Notably, RoboTracer can be integrated with various control policies to execute long-horizon, dynamic tasks across diverse robots (UR5, G1 humanoid) in cluttered real-world scenes.
q-bio.QM [Back]
[175] Vision Foundry: A System for Training Foundational Vision AI Models q-bio.QM | cs.AI | cs.CV | cs.LGPDF
Mahmut S. Gokmen, Mitchell A. Klusty, Evan W. Damron, W. Vaiden Logan, Aaron D. Mullen
TL;DR: Vision Foundry是一个无需编码、符合HIPAA标准的平台,旨在降低临床研究人员使用自监督学习(SSL)训练基础视觉模型的技术门槛。该系统集成了DINO-MX框架,通过放大感知蒸馏(MAD)和参数高效微调(PEFT)等策略,简化了模型预训练、适应和部署流程。在神经病理学分割、肺细胞密度估计和冠状动脉钙化评分等多个医学领域验证中,该平台训练的模型在分割保真度和回归准确性上显著优于通用基线,并展现出强大的零样本泛化能力。
Details
Motivation: 解决自监督学习在医学领域应用的技术壁垒,使临床研究人员无需深厚工程背景也能利用大规模未标注数据开发先进视觉AI模型。
Result: 在神经病理学分割、肺细胞密度估计和冠状动脉钙化评分等任务上,模型分割保真度和回归准确性显著优于通用基线,并实现跨成像协议的零样本泛化,达到先进水平。
Insight: 创新点包括将DINO-MX框架与代码免平台结合,引入放大感知蒸馏(MAD)和参数高效微调(PEFT)等策略,降低了领域专家开发临床AI工具的技术门槛和标注成本,推动从工程优化向临床发现的转变。
Abstract: Self-supervised learning (SSL) leverages vast unannotated medical datasets, yet steep technical barriers limit adoption by clinical researchers. We introduce Vision Foundry, a code-free, HIPAA-compliant platform that democratizes pre-training, adaptation, and deployment of foundational vision models. The system integrates the DINO-MX framework, abstracting distributed infrastructure complexities while implementing specialized strategies like Magnification-Aware Distillation (MAD) and Parameter-Efficient Fine-Tuning (PEFT). We validate the platform across domains, including neuropathology segmentation, lung cellularity estimation, and coronary calcium scoring. Our experiments demonstrate that models trained via Vision Foundry significantly outperform generic baselines in segmentation fidelity and regression accuracy, while exhibiting robust zero-shot generalization across imaging protocols. By bridging the gap between advanced representation learning and practical application, Vision Foundry enables domain experts to develop state-of-the-art clinical AI tools with minimal annotation overhead, shifting focus from engineering optimization to clinical discovery.
eess.IV [Back]
[176] V-Rex: Real-Time Streaming Video LLM Acceleration via Dynamic KV Cache Retrieval eess.IV | cs.AI | cs.AR | cs.CV | cs.MMPDF
Donghyuk Kim, Sejeong Yang, Wonjin Shin, Joo-Young Kim
TL;DR: 本文提出了V-Rex,一种软硬件协同设计的加速器,用于解决流式视频大语言模型(LLM)推理中的内存和计算瓶颈。其核心是ReSV算法,通过基于时空相似性的令牌聚类动态检索和压缩KV缓存,并结合专用硬件引擎DRE实现高效计算。该系统在边缘设备上实现了3.9-8.3 FPS的实时推理,速度提升最高达19.7倍,能效提升最高达18.5倍,且精度损失可忽略。
Details
Motivation: 流式视频LLM在实时多模态任务(如视频字幕、问答)中面临关键挑战:随着视频流持续输入,其KV缓存急剧增长,导致迭代预填充阶段产生巨大计算开销、数据传输负担和精度下降,尤其在边缘部署场景中问题尤为突出。
Result: 在边缘部署中,V-Rex实现了3.9-8.3 FPS的实时推理性能,相比AGX Orin GPU,速度提升1.9-19.7倍,能效提升3.1-18.5倍,且精度损失可忽略。其硬件引擎DRE仅占2.2%的功耗和2.0%的面积。
Insight: 创新点在于首次从算法和硬件层面协同解决流式视频LLM的KV缓存检索问题:提出了无需训练的ReSV算法,利用时空相似性动态压缩KV缓存;设计了低功耗的专用硬件加速器DRE,支持比特级和提前退出计算。这为资源受限的边缘设备实现实时视频LLM推理提供了新思路。
Abstract: Streaming video large language models (LLMs) are increasingly used for real-time multimodal tasks such as video captioning, question answering, conversational agents, and augmented reality. However, these models face fundamental memory and computational challenges because their key-value (KV) caches grow substantially with continuous streaming video input. This process requires an iterative prefill stage, which is a unique feature of streaming video LLMs. Due to its iterative prefill stage, it suffers from significant limitations, including extensive computation, substantial data transfer, and degradation in accuracy. Crucially, this issue is exacerbated for edge deployment, which is the primary target for these models. In this work, we propose V-Rex, the first software-hardware co-designed accelerator that comprehensively addresses both algorithmic and hardware bottlenecks in streaming video LLM inference. At its core, V-Rex introduces ReSV, a training-free dynamic KV cache retrieval algorithm. ReSV exploits temporal and spatial similarity-based token clustering to reduce excessive KV cache memory across video frames. To fully realize these algorithmic benefits, V-Rex offers a compact, low-latency hardware accelerator with a dynamic KV cache retrieval engine (DRE), featuring bit-level and early-exit based computing units. V-Rex achieves unprecedented real-time of 3.9-8.3 FPS and energy-efficient streaming video LLM inference on edge deployment with negligible accuracy loss. While DRE only accounts for 2.2% power and 2.0% area, the system delivers 1.9-19.7x speedup and 3.1-18.5x energy efficiency improvements over AGX Orin GPU. This work is the first to comprehensively tackle KV cache retrieval across algorithms and hardware, enabling real-time streaming video LLM inference on resource-constrained edge devices.
[177] Leveraging Compression to Construct Transferable Bitrate Ladders eess.IV | cs.CVPDF
Krishna Srikar Durbha, Hassene Tmar, Ping-Hao Wu, Ioannis Katsavounidis, Alan C. Bovik
TL;DR: 本文提出了一种新的基于机器学习的比特率阶梯构建技术,通过分析压缩过程并在压缩前对源视频进行感知相关测量,准确预测压缩视频的VMAF分数,以替代计算开销大的凸包构建方法。
Details
Motivation: 传统的固定比特率阶梯和恒定CRF编码技术效率较低,而现有的每标题和每镜头编码技术虽然能提升比特率增益和观看体验,但为每个视频构建凸包计算开销大,因此需要更高效的机器学习方法来构建内容自适应的比特率阶梯。
Result: 在大型视频语料库上评估,所提框架优于现有领先方法,并通过Bjontegaard-delta指标与固定比特率阶梯和穷举编码构建的最佳凸包进行比较,展示了其性能。
Insight: 创新点在于结合压缩过程分析和源视频的感知测量来预测VMAF分数,实现高效的内容自适应比特率阶梯构建,同时探索了不同编码设置下每镜头比特率阶梯的性能,提升了方法的可迁移性和实用性。
Abstract: Over the past few years, per-title and per-shot video encoding techniques have demonstrated significant gains as compared to conventional techniques such as constant CRF encoding and the fixed bitrate ladder. These techniques have demonstrated that constructing content-gnostic per-shot bitrate ladders can provide significant bitrate gains and improved Quality of Experience (QoE) for viewers under various network conditions. However, constructing a convex hull for every video incurs a significant computational overhead. Recently, machine learning-based bitrate ladder construction techniques have emerged as a substitute for convex hull construction. These methods operate by extracting features from source videos to train machine learning (ML) models to construct content-adaptive bitrate ladders. Here, we present a new ML-based bitrate ladder construction technique that accurately predicts the VMAF scores of compressed videos, by analyzing the compression procedure and by making perceptually relevant measurements on the source videos prior to compression. We evaluate the performance of our proposed framework against leading prior methods on a large corpus of videos. Since training ML models on every encoder setting is time-consuming, we also investigate how per-shot bitrate ladders perform under different encoding settings. We evaluate the performance of all models against the fixed bitrate ladder and the best possible convex hull constructed using exhaustive encoding with Bjontegaard-delta metrics.
physics.optics [Back]
[178] Meta-GPT: Decoding the Metasurface Genome with Generative Artificial Intelligence physics.optics | cs.AI | cs.CL | cs.LGPDF
David Dang, Stuart Love, Meena Salib, Quynh Dang, Samuel Rothfarb
TL;DR: 论文提出了一种名为METASTRINGS的符号语言,用于将光子纳米结构表示为编码材料、几何和晶格配置的文本序列,并在此基础上开发了Meta-GPT基础Transformer模型。该模型通过物理信息监督学习、强化学习和思维链学习进行微调,能够在多种设计任务中生成光学响应与目标光谱匹配的多样化超表面原型,验证了其通过学习光-物质相互作用的组成规则来推动AI驱动光子学发展的潜力。
Details
Motivation: 动机是开发一种既具可解释性又符合自然基本定律的人工智能表示方法,以解决光子学中纳米结构设计的复杂性问题,并推动物理科学中AI的发展。
Result: 模型在各种设计任务中实现了<3%的均方光谱误差,并保持>98%的句法有效性,生成的超表面原型的光学响应与目标光谱在实验测量中匹配。
Insight: 创新点在于引入了METASTRINGS符号语言作为连接人类可解释性与计算设计的桥梁,并构建了Meta-GPT这一结合多种学习范式的Transformer模型,为AI驱动的光子学设计提供了严格基础,类似于’超表面基因组项目’的重要一步。
Abstract: Advancing artificial intelligence for physical sciences requires representations that are both interpretable and compatible with the underlying laws of nature. We introduce METASTRINGS, a symbolic language for photonics that expresses nanostructures as textual sequences encoding materials, geometries, and lattice configurations. Analogous to molecular textual representations in chemistry, METASTRINGS provides a framework connecting human interpretability with computational design by capturing the structural hierarchy of photonic metasurfaces. Building on this representation, we develop Meta-GPT, a foundation transformer model trained on METASTRINGS and finetuned with physics-informed supervised, reinforcement, and chain-of-thought learning. Across various design tasks, the model achieves <3% mean-squared spectral error and maintains >98% syntactic validity, generating diverse metasurface prototypes whose experimentally measured optical responses match their target spectra. These results demonstrate that Meta-GPT can learn the compositional rules of light-matter interactions through METASTRINGS, laying a rigorous foundation for AI-driven photonics and representing an important step toward a metasurface genome project.
astro-ph.IM [Back]
[179] Pre-training vision models for the classification of alerts from wide-field time-domain surveys astro-ph.IM | cs.CVPDF
Nabeel Rehemtulla, Adam A. Miller, Mike Walmsley, Ved G. Shah, Theophile Jegou du Laz
TL;DR: 本文探讨了在宽视场时域巡天警报分类任务中,采用预训练视觉模型和标准化架构的影响。研究发现,使用在Galaxy Zoo星系图像上预训练的标准化模型,其性能优于或匹配当前天文学中常用的定制化CNN模型,同时推理效率更高。
Details
Motivation: 解决天文学时域研究中仍普遍采用定制化CNN架构并从零开始训练的问题,探索采用计算机视觉领域最新实践(如预训练和标准化架构)是否能提升警报分类的性能和效率。
Result: 在警报分类任务上,基于Galaxy Zoo图像预训练的模型性能优于基于ImageNet预训练或从零训练的模型,且匹配或优于典型的定制化CNN基线。标准化架构在推理时间和内存消耗上更优,尽管参数量更多。
Insight: 创新点在于将计算机视觉领域的预训练和标准化架构实践系统性地引入天文学时域警报分类,证明了其有效性。客观来看,领域特定数据(如星系图像)的预训练比通用图像预训练更具优势,且标准化架构的优化设计能带来显著的效率提升,这为即将到来的大规模巡天项目(如LSST)提供了新的模型构建范式。
Abstract: Modern wide-field time-domain surveys facilitate the study of transient, variable and moving phenomena by conducting image differencing and relaying alerts to their communities. Machine learning tools have been used on data from these surveys and their precursors for more than a decade, and convolutional neural networks (CNNs), which make predictions directly from input images, saw particularly broad adoption through the 2010s. Since then, continually rapid advances in computer vision have transformed the standard practices around using such models. It is now commonplace to use standardized architectures pre-trained on large corpora of everyday images (e.g., ImageNet). In contrast, time-domain astronomy studies still typically design custom CNN architectures and train them from scratch. Here, we explore the affects of adopting various pre-training regimens and standardized model architectures on the performance of alert classification. We find that the resulting models match or outperform a custom, specialized CNN like what is typically used for filtering alerts. Moreover, our results show that pre-training on galaxy images from Galaxy Zoo tends to yield better performance than pre-training on ImageNet or training from scratch. We observe that the design of standardized architectures are much better optimized than the custom CNN baseline, requiring significantly less time and memory for inference despite having more trainable parameters. On the eve of the Legacy Survey of Space and Time and other image-differencing surveys, these findings advocate for a paradigm shift in the creation of vision models for alerts, demonstrating that greater performance and efficiency, in time and in data, can be achieved by adopting the latest practices from the computer vision field.
[180] Semantic search for 100M+ galaxy images using AI-generated captions astro-ph.IM | cs.AI | cs.CV | cs.LGPDF
Nolan Koblischke, Liam Parker, Francois Lanusse, Irina Espejo Morales, Jo Bovy
TL;DR: 本文提出了一种名为AION-Search的语义搜索系统,用于大规模未标注的天文图像数据。该方法利用视觉语言模型(VLM)为星系图像生成描述,并通过对比学习将预训练的多模态天文学基础模型与这些描述对齐,从而生成可搜索的嵌入向量。该系统首次实现了对1.4亿张星系图像的灵活语义搜索,显著提升了在大型科学图像档案中发现稀有科学现象的能力。
Details
Motivation: 解决望远镜产生的数十亿星系图像中,通过缓慢、手动标注活动来寻找科学上有趣现象的效率低下问题,旨在使大规模未标注的科学图像档案能够进行语义搜索。
Result: AION-Search在寻找稀有现象的任务上实现了零样本(zero-shot)的SOTA性能,其性能超越了直接的图像相似性搜索。此外,引入的基于VLM的重排序方法,在top-100结果中,对最具挑战性目标的召回率(recall)提升了近一倍。
Insight: 创新点在于提出了一种从完全未标注图像数据构建语义搜索引擎的完整流程,核心是利用AI生成的描述(而非人工标注)来训练可扩展的搜索模型。其方法具有通用性,可扩展至地球观测、显微成像等其他科学领域的大规模未标注图像库。
Abstract: Finding scientifically interesting phenomena through slow, manual labeling campaigns severely limits our ability to explore the billions of galaxy images produced by telescopes. In this work, we develop a pipeline to create a semantic search engine from completely unlabeled image data. Our method leverages Vision-Language Models (VLMs) to generate descriptions for galaxy images, then contrastively aligns a pre-trained multimodal astronomy foundation model with these embedded descriptions to produce searchable embeddings at scale. We find that current VLMs provide descriptions that are sufficiently informative to train a semantic search model that outperforms direct image similarity search. Our model, AION-Search, achieves state-of-the-art zero-shot performance on finding rare phenomena despite training on randomly selected images with no deliberate curation for rare cases. Furthermore, we introduce a VLM-based re-ranking method that nearly doubles the recall for our most challenging targets in the top-100 results. For the first time, AION-Search enables flexible semantic search scalable to 140 million galaxy images, enabling discovery from previously infeasible searches. More broadly, our work provides an approach for making large, unlabeled scientific image archives semantically searchable, expanding data exploration capabilities in fields from Earth observation to microscopy. The code, data, and app are publicly available at https://github.com/NolanKoblischke/AION-Search
cs.LG [Back]
[181] Reassessing the Role of Supervised Fine-Tuning: An Empirical Study in VLM Reasoning cs.LG | cs.CL | cs.CVPDF
Yongcan Yu, Lingxiao He, Shuo Lu, Lijun Sheng, Yinuo Xu
TL;DR: 本文重新评估了监督微调(SFT)在视觉语言模型(VLM)推理中的作用,通过系统对比SFT与强化学习(RL)发现,SFT的有效性取决于模型容量、数据规模和数据分布,并在多个场景中至关重要,挑战了当前“RL优于SFT”的主流观点。
Details
Motivation: 针对当前VLM推理领域过度依赖RL而忽视SFT的倾向,本文旨在通过实证研究重新评估SFT的作用,解决SFT是否真的无效或有害的争议。
Result: 在相同数据源下,SFT在较弱模型上更可靠地激发推理能力,仅用2K数据即可达到RL使用20K数据的可比或更好性能,且展现出更强的跨模态泛化能力;同时发现RL中存在欺骗性奖励问题,即高奖励与推理准确性不相关。
Insight: 创新点在于揭示了SFT与RL的相对有效性是条件性的,并指出SFT在数据效率、跨模态迁移和弱模型训练中的关键作用,支持将SFT与RL作为互补组件构建更平衡的后训练流程;客观分析认为,这为VLM训练策略提供了新的实证依据,挑战了现有偏见。
Abstract: Recent advances in vision-language models (VLMs) reasoning have been largely attributed to the rise of reinforcement Learning (RL), which has shifted the community’s focus away from the supervised fine-tuning (SFT) paradigm. Many studies suggest that introducing the SFT stage not only fails to improve reasoning ability but may also negatively impact model training. In this study, we revisit this RL-centric belief through a systematic and controlled comparison of SFT and RL on VLM Reasoning. Using identical data sources, we find that the relative effectiveness of SFT and RL is conditional and strongly influenced by model capacity, data scale, and data distribution. Contrary to common assumptions, our findings show that SFT plays a crucial role across several scenarios: (1) Effectiveness for weaker models. SFT more reliably elicits reasoning capabilities in smaller or weaker VLMs. (2) Data efficiency. SFT with only 2K achieves comparable or better reasoning performance to RL with 20K. (3) Cross-modal transferability. SFT demonstrates stronger generalization across modalities. Moreover, we identify a pervasive issue of deceptive rewards, where higher rewards fail to correlate with better reasoning accuracy in RL. These results challenge the prevailing “RL over SFT” narrative. They highlight that the role of SFT may have been underestimated and support a more balanced post-training pipeline in which SFT and RL function as complementary components.
[182] Understanding Structured Financial Data with LLMs: A Case Study on Fraud Detection cs.LG | cs.CLPDF
Xuwei Tan, Yao Ma, Xueru Zhang
TL;DR: 本文提出FinFRE-RAG方法,通过重要性引导的特征缩减和检索增强的上下文学习,将结构化金融数据转化为自然语言,以提升大语言模型在欺诈检测任务上的性能与可解释性。
Details
Motivation: 解决传统表格模型在金融欺诈检测中依赖繁重特征工程、可解释性差,以及大语言模型直接应用于表格数据时因特征过多、类别极度不平衡和缺乏上下文信息而表现不佳的问题。
Result: 在四个公开欺诈数据集和三类开源大语言模型上,FinFRE-RAG在F1/MCC指标上显著优于直接提示方法,并在多个设定下与强大的表格基线模型性能相当,尽管仍落后于专用分类器,但缩小了性能差距。
Insight: 创新点在于两阶段框架:将高维表格数据序列化为紧凑的自然语言描述,并结合标签感知的实例级示例进行检索增强学习,从而在保持可解释性的同时提升大语言模型处理结构化金融数据的性能。
Abstract: Detecting fraud in financial transactions typically relies on tabular models that demand heavy feature engineering to handle high-dimensional data and offer limited interpretability, making it difficult for humans to understand predictions. Large Language Models (LLMs), in contrast, can produce human-readable explanations and facilitate feature analysis, potentially reducing the manual workload of fraud analysts and informing system refinements. However, they perform poorly when applied directly to tabular fraud detection due to the difficulty of reasoning over many features, the extreme class imbalance, and the absence of contextual information. To bridge this gap, we introduce FinFRE-RAG, a two-stage approach that applies importance-guided feature reduction to serialize a compact subset of numeric/categorical attributes into natural language and performs retrieval-augmented in-context learning over label-aware, instance-level exemplars. Across four public fraud datasets and three families of open-weight LLMs, FinFRE-RAG substantially improves F1/MCC over direct prompting and is competitive with strong tabular baselines in several settings. Although these LLMs still lag behind specialized classifiers, they narrow the performance gap and provide interpretable rationales, highlighting their value as assistive tools in fraud analysis.
[183] PerNodeDrop: A Method Balancing Specialized Subnets and Regularization in Deep Neural Networks cs.LG | cs.AI | cs.CVPDF
Gelesh G Omathil, Sreeja CS
TL;DR: 本文提出了一种名为PerNodeDrop的轻量级随机正则化方法,用于缓解深度神经网络中的过拟合问题。该方法通过对每个样本的每个节点施加扰动,打破了现有正则化技术中噪声的均匀性,从而在保留有益神经元共适应的同时抑制有害的共适应,提升了模型在未见数据上的泛化能力。
Details
Motivation: 深度神经网络虽然具有强大的表示能力,但容易过拟合,主要原因是神经元倾向于以捕获复杂特征交互的方式共适应,这也会强化虚假且不可泛化的模式。现有的基于噪声的正则化方法(如Dropout和DropConnect)通常对层或批次样本施加均匀噪声,可能同时抑制有害和有益的共适应,因此需要一种更精细的正则化策略。
Result: 在视觉、文本和音频基准测试上的实验表明,PerNodeDrop相比标准的基于噪声的正则化方法(如Dropout)提高了泛化性能,缩小了训练与验证性能之间的差距,提升了在未见数据上的可靠性。
Insight: PerNodeDrop的创新点在于引入了样本级、节点级的扰动,实现了更细粒度的正则化,从而在抑制有害共适应的同时保留有益的预测性交互。从客观角度看,该方法通过形式化的期望损失分析,为平衡专门化子网络与正则化提供了理论支持,是一种可扩展且轻量级的改进方案。
Abstract: Deep neural networks possess strong representational capacity yet remain vulnerable to overfitting, primarily because neurons tend to co-adapt in ways that, while capturing complex and fine-grained feature interactions, also reinforce spurious and non-generalizable patterns that inflate training performance but reduce reliability on unseen data. Noise-based regularizers such as Dropout and DropConnect address this issue by injecting stochastic perturbations during training, but the noise they apply is typically uniform across a layer or across a batch of samples, which can suppress both harmful and beneficial co-adaptation. This work introduces PerNodeDrop, a lightweight stochastic regularization method. It applies per-sample, per-node perturbations to break the uniformity of the noise injected by existing techniques, thereby allowing each node to experience input-specific variability. Hence, PerNodeDrop preserves useful co-adaptation while applying regularization. This narrows the gap between training and validation performance and improves reliability on unseen data, as evident from the experiments. Although superficially similar to DropConnect, PerNodeDrop operates at the sample level. It drops weights at the sample level, not the batch level. An expected-loss analysis formalizes how its perturbations attenuate excessive co-adaptation while retaining predictive interactions. Empirical evaluations on vision, text, and audio benchmarks indicate improved generalization relative to the standard noise-based regularizer.
[184] Image Diffusion Preview with Consistency Solver cs.LG | cs.CVPDF
Fu-Yun Wang, Hao Zhou, Liangzhe Yuan, Sanghyun Woo, Boqing Gong
TL;DR: 本文提出Diffusion Preview范式,通过快速低步数采样生成预览图像供用户评估,待满意后再进行全步数细化。针对现有加速方法在预览质量和一致性上的不足,作者设计了基于通用线性多步方法的ConsistencySolver,这是一种通过强化学习优化的轻量可训练高阶求解器,能显著提升低步数场景下的生成质量与预览-最终输出一致性。
Details
Motivation: 解决图像扩散模型推理速度慢导致的交互体验下降问题,现有训练无关求解器和训练后蒸馏方法难以在低步数下提供高质量预览或保证预览与最终输出的一致性。
Result: 在低步数场景下,ConsistencySolver显著提升了生成质量和一致性。定量上,其FID分数与Multistep DPM-Solver相当但步数减少47%,且优于蒸馏基线。用户研究表明,该方法在保持生成质量的同时,将用户总体交互时间减少了近50%。
Insight: 创新点在于提出了预览-细化工作流范式,并设计了基于通用线性多步方法、通过强化学习优化的可训练高阶求解器ConsistencySolver,以轻量方式在加速推理的同时保证了预览质量与一致性,为交互式图像生成提供了高效解决方案。
Abstract: The slow inference process of image diffusion models significantly degrades interactive user experiences. To address this, we introduce Diffusion Preview, a novel paradigm employing rapid, low-step sampling to generate preliminary outputs for user evaluation, deferring full-step refinement until the preview is deemed satisfactory. Existing acceleration methods, including training-free solvers and post-training distillation, struggle to deliver high-quality previews or ensure consistency between previews and final outputs. We propose ConsistencySolver derived from general linear multistep methods, a lightweight, trainable high-order solver optimized via Reinforcement Learning, that enhances preview quality and consistency. Experimental results demonstrate that ConsistencySolver significantly improves generation quality and consistency in low-step scenarios, making it ideal for efficient preview-and-refine workflows. Notably, it achieves FID scores on-par with Multistep DPM-Solver using 47% fewer steps, while outperforming distillation baselines. Furthermore, user studies indicate our approach reduces overall user interaction time by nearly 50% while maintaining generation quality. Code is available at https://github.com/G-U-N/consolver.