Table of Contents

cs.CL [Back]

[1] SKG-Eval: Stateful Evaluation of Multi-Turn Dialogue via Incremental Semantic Knowledge Graphs cs.CL | cs.AIPDF

Avijit Shil, Suman Samui

TL;DR: 本文提出了SKG-Eval,一个用于评估多轮对话系统的准确定性、可解释框架。它将对话建模为一个跨轮次演化的语义知识图(SKG),通过增量更新图结构并计算局部相关性、历史一致性和逻辑连贯性三个互补信号,最终融合成一个长度不变的会话分数。该方法在多个基准测试中与人类判断的相关性更高,并能显著改善对长程不一致性的检测。

Details

Motivation: 现有自动评估器(如LLM-as-a-judge和基于嵌入的指标)大多依赖扁平化或轮次孤立的表示,难以有效检测对话中的长程问题(如矛盾、话题漂移和实体不一致)。因此,需要一种能够建模对话状态演化的评估框架。

Result: 在多个基准测试中,SKG-Eval与人类判断的相关性高于现有方法,并在检测扩展对话中的长程不一致性方面有显著提升。

Insight: 核心创新在于通过语义知识图进行结构化的外部状态跟踪,为基于LLM的对话评估器提供了一种可扩展的替代方案。该方法具有准确定性、可解释性,并能生成明确的矛盾证明和可复现的分数。

Abstract: Evaluating multi-turn dialogue systems remains challenging because response quality depends not only on the current prompt, but also on previously established entities, claims, and conversational commitments. Existing automatic evaluators, including LLM-as-a-judge frameworks and embedding-based metrics, largely rely on flat or turn-isolated representations, making them less effective at detecting long-range issues such as contradiction, topic drift, and entity inconsistency. To address this, we propose SKG-Eval, a quasi-deterministic and interpretable framework that models dialogue as an evolving Semantic Knowledge Graph (SKG) of entities, relations, and commitments across turns. The framework incrementally updates the graph through structured triple extraction and computes three complementary signals: (i) local relevance, measuring alignment with the current prompt and optional reference; (ii) historical consistency, evaluating how newly introduced information connects to prior conversational context using graph-based and embedding-driven signals; and (iii) logical coherence, assessed by a geometric contradiction engine that detects cross-turn conflicts without relying on NLI models or LLM judges. These signals are adaptively fused and aggregated into a length-invariant session score via recency-weighted trend analysis. Across multiple benchmarks, SKG-Eval achieves higher correlation with human judgments and substantially improves detection of long-range inconsistencies in extended conversations. In addition, the framework produces explicit contradiction certificates and deterministic scores for fixed inputs, enabling reproducible and auditable evaluation. Overall, our results suggest that structured externalized state tracking through semantic knowledge graphs provides a scalable alternative to implicit reasoning in LLM-based dialogue evaluators.


Li Zhang, Jaromir Savelka, Kevin Ashley

TL;DR: 该论文提出了一种基于检索的多标签法律文本标注方法,将法律标注任务转化为检索问题。该方法使用冻结的检索模型分别嵌入文档和标签描述,并通过在嵌入空间中进行k近邻搜索来预测标签,从而避免了参数化编码器在标签集变化时所需的重新训练,以及生成式大语言模型在标签空间扩大时产生的成本与性能下降问题。

Details

Motivation: 解决多标签法律标注任务中,面对庞大且不断演变的标签体系、冗长的事实密集型文档以及有限监督数据时,现有参数化编码器需要针对特定任务训练且标签集变化时需重新训练,而生成式大语言模型则成本高昂且性能随标签空间增长而下降的问题。

Result: 在三个法律数据集(ECtHR-A, ECtHR-B, 以及包含100个标签的Eurlex)上,该检索方法取得了具有竞争力的准确率和强大的数据效率。例如,在Eurlex数据集上,使用Qwen-8B的检索方法将Macro-F1从GPT-5.2的零样本性能(40.41)提升至49.12,同时相比微调估计减少了20-30倍的计算量。在仅有100个训练样本的情况下,检索方法在ECtHR-A数据集上的Micro-F1(48.29)几乎是分层Legal-BERT(27.87)的两倍。

Insight: 论文的核心创新点在于将法律标注任务重新定义为检索问题,利用冻结的检索模型和k近邻预测,实现了方法的可扩展性(通过重新嵌入和索引而非反向传播来更新)、数据高效性以及无幻觉性(严格尊重预定义的标签集)。从客观角度看,这种基于检索的范式为处理高基数、快速变化的标签空间提供了一种实用且可部署的替代方案,并定量揭示了生成式推理在确定性解码下会产生标签幻觉的可靠性问题。

Abstract: Multi-label legal annotation requires assigning multiple labels from large, evolving taxonomies to long, fact-intensive documents, often under limited supervision. Parametric encoders typically require task-specific training and retraining when the label set changes, while prompting generative large language models becomes costly and degrades as the label space grows. We cast legal annotation as retrieval: we embed documents and label descriptions with a frozen retrieval model and predict labels via k-nearest neighbors in the embedding space, enabling updates by re-embedding and re-indexing rather than gradient-based backpropagation. Across three legal datasets (ECtHR-A, ECtHR-B, and Eurlex with 100 labels), retrieval achieves competitive accuracy and strong data efficiency; on Eurlex, Qwen-8B retrieval improves Macro-F1 from 40.41 (GPT-5.2, zero-shot) to 49.12 while reducing estimated compute by 20-30 times compared to fine-tuning. With only (N=100) training samples, retrieval nearly doubles Micro-F1 over hierarchical Legal-BERT on ECtHR-A (48.29 vs. 27.87). We also quantify a reliability failure mode of generative inference: GPT-5.2 hallucinates labels outside the provided taxonomy in 0.12-0.9% of test samples under deterministic decoding. In contrast, retrieval strictly respects defined label sets, eliminating hallucination by design. These results suggest retrieval-model-based annotators are a practical, deployable alternative for high-cardinality and rapidly changing legal label spaces.


[3] RTI-Bench: A Structured Dataset for Indian Right-to-Information Decision Analysis cs.CLPDF

Joy Bose

TL;DR: 本文介绍了RTI-Bench,一个针对印度《信息权利法》中央信息委员会决策的结构化数据集,包含结果标签、豁免引用、IRAC风格推理组件和程序时间线。该数据集整合了来自公开指令-响应语料库的1,218个案例和直接从委员会门户收集的298个PDF决策文件,覆盖了2023年至2026年间的五个委员和三种文档格式。数据集首次发布,并提供了零样本Mistral 7B基线模型的性能评估。

Details

Motivation: 印度《信息权利法》赋予公民向公共机构索取信息的权利,但实践中公众难以理解中央信息委员会决策中使用的复杂行政语言,也无法预测上诉是否值得进行。因此,需要构建一个结构化数据集来支持相关分析和预测任务。

Result: 在100个案例的零样本Mistral 7B基线测试中,结果预测的准确率达到57.3%,宏观F1分数为37.0%,显著高于多数类基线的14.3%宏观F1。数据集标签覆盖率达到89%(指令-响应语料库)和51%(PDF子集),手动审查的50个标注案例显示标签精确度为95.3%。

Insight: 论文的创新点在于创建了首个公开的印度RTI行政决策结构化数据集,整合了多源数据并提供了详细的标注(如IRAC推理组件),为法律文本分析和自然语言处理任务提供了基准。从客观角度看,该数据集促进了法律AI应用的发展,特别是在行政决策理解和预测方面。

Abstract: India’s Right to Information Act, 2005 gives every citizen the right to demand information from public authorities, yet in practice most people cannot make sense of the dense administrative language used in Central Information Commission (CIC) decisions, let alone predict whether an appeal is worth filing. This paper introduces RTI-Bench, a structured dataset of CIC decisions with outcome labels, exemption citations, IRAC-style reasoning components, and procedural timelines. To the best of our knowledge it is the first publicly released structured dataset for Indian RTI administrative decisions. The dataset draws from two sources: 1,218 cases from a publicly available instruction-response corpus (with structured fields added through rule-based extraction), and 298 CIC decision PDFs collected directly from the Commission portal, spanning five commissioners and three document format generations from 2023 to 2026. Label coverage reaches 89% on the instruction-response corpus. For the PDF subset of 239 primary decisions, coverage is 51% in this first release. A random sample of 50 labelled cases was manually reviewed, yielding a label precision of 95.3%. A zero-shot Mistral 7B baseline on 100 cases gives 57.3% accuracy and 37.0% macro-F1 on outcome prediction, well above the majority-class baseline of 14.3% macro-F1. RTI-Bench is available at https://huggingface.co/datasets/joyboseroy/rti-bench


[4] MixSD: Mixed Contextual Self-Distillation for Knowledge Injection cs.CLPDF

Jiarui Liu, Lechen Zhang, Yongjin Yang, Yinghui He, Yingheng Wang

TL;DR: 本文提出了一种名为MixSD的无外部教师的知识注入方法,旨在缓解监督微调(SFT)在向语言模型注入新知识时导致的灾难性遗忘问题。该方法通过动态混合基础模型自身的专家条件分布和朴素条件分布的输出来构建监督信号,使监督序列更接近基础模型的原始分布。实验表明,MixSD在多个模型规模和任务上,相比标准SFT和策略内自蒸馏基线,能更好地平衡记忆新知识与保留原有能力。

Details

Motivation: 标准监督微调(SFT)在向语言模型注入新知识时,常因其训练目标(来自人类或外部系统)与模型的自回归分布存在差异,导致模型模仿低概率的标记序列,从而损害模型原有的推理和通用领域能力,即发生灾难性遗忘。

Result: 在构建的用于研究事实回忆和算术函数获取的合成语料库,以及开放领域事实问答和知识编辑的基准测试中,MixSD在多个模型规模和设置下,相比SFT和策略内自蒸馏基线,始终实现了更好的记忆-保留权衡。它能保持高达100%的基础模型在保留任务上的能力,同时维持接近完美的训练准确率,而标准SFT的保留能力可能低至1%。

Insight: 论文的核心创新点是提出了一种分布对齐的知识注入原则,即通过模型自身条件分布的混合来动态生成监督信号,这避免了依赖外部教师,并使监督目标更符合模型的原生生成分布。从客观角度看,该方法通过降低监督序列在基础模型下的负对数似然(NLL)并减少沿Fisher敏感参数方向的有害移动,为缓解灾难性遗忘提供了一个简单有效的机制。

Abstract: Supervised fine-tuning (SFT) is widely used to inject new knowledge into language models, but it often degrades pretrained capabilities such as reasoning and general-domain performance. We argue this forgetting arises because fine-tuning targets from humans or external systems diverge from the model’s autoregressive distribution, forcing the optimizer to imitate low-probability token sequences. To address this problem, we propose MixSD, a simple external-teacher-free method for distribution-aligned knowledge injection. Instead of training on fixed targets, MixSD constructs supervision dynamically by mixing tokens from two conditionals of the base model itself: an expert conditional that observes the injected fact in context, and a naive conditional that reflects the model’s original prior. The resulting supervision sequences preserve the factual learning signal while remaining substantially closer to the base model’s distribution. We evaluate MixSD on two synthetic corpora that we construct to study factual recall and arithmetic function acquisition in a controlled setting, together with established benchmarks for open-domain factual question answering and knowledge editing. Across multiple model scales and settings, MixSD consistently achieves a better memorization-retention trade-off compared to SFT and on-policy self distillation baselines, retaining up to 100% of the base model’s held-out capability while maintaining near-perfect training accuracy, whereas standard SFT retains as little as 1%. We further show that MixSD produces substantially lower-NLL supervision targets under the base model and reduces harmful movement along Fisher-sensitive parameter directions. These results suggest that aligning supervision with the model’s native generation distribution is a simple and effective principle for knowledge injection that mitigates catastrophic forgetting.


[5] Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps cs.CL | cs.AIPDF

Yanke Zhou, Yiduo Li, Hanlin Tang, Maohua Li, Kan Liu

TL;DR: 本文提出了一种名为RTPurbo的方法,旨在解决大语言模型中长上下文推理因全注意力机制二次计算成本过高而导致的效率瓶颈。该方法通过利用全注意力模型固有的稀疏性,仅需数百步训练即可将其转换为高度稀疏的模型,从而在保持近乎无损精度的同时显著提升推理效率。

Details

Motivation: 动机在于解决长上下文推理中全注意力机制二次计算成本过高的问题,同时避免现有高效替代方案(如原生稀疏训练或启发式令牌淘汰)在效率、训练成本和准确性之间做出的不理想权衡。

Result: 在长上下文基准测试和推理任务上的实验表明,RTPurbo在保持近乎无损准确性的同时,带来了显著的效率提升,包括在100万上下文长度下实现高达9.36倍的预填充加速和约2.01倍的解码加速。

Insight: 创新点在于揭示了全注意力LLMs固有的稀疏性,并基于三个关键观察(仅少数注意力头需要全长上下文处理、长程检索由低维子空间主导、有用令牌预算具有强查询依赖性)设计了RTPurbo。该方法的核心洞察是,无需昂贵的原生稀疏预训练,即可从标准全注意力训练中获得强大的稀疏推理能力。

Abstract: Long-context inference in large language models is bottlenecked by the quadratic cost of full attention. Existing efficient alternatives often rely either on native sparse training or on heuristic token eviction, creating an undesirable trade-off among efficiency, training cost, and accuracy. In this work, we show that full-attention LLMs are already intrinsically sparse and can be transformed into highly sparse models with only minimal adaptation. Our approach is built on three observations: (1) only a small subset of attention heads truly requires full long-context processing; (2) long-range retrieval is governed primarily by a low-dimensional subspace, allowing relevant tokens to be retrieved efficiently with a 16-dimensional indexer; and (3) the useful token budget is strongly query-dependent, making dynamic top-$p$ selection more suitable than fixed top-$k$ sparsification. Based on these insights, we propose RTPurbo, which retains the full KV cache only for retrieval heads and introduces a lightweight token indexer for sparse attention. By exploiting the model’s intrinsic sparsity, RTPurbo achieves sparsification with only a few hundred training steps. Experiments on long-context benchmarks and reasoning tasks show that RTPurbo preserves near-lossless accuracy while delivering substantial efficiency gains, including up to a 9.36$\times$ prefill speedup at 1M context and about a 2.01$\times$ decode speedup. These results suggest that strong sparse inference can be obtained from standard full-attention training without expensive native sparse pretraining.


[6] Effort as Ceiling, Not Dial: Reasoning Budget Does Not Modulate Cognitive Cost Alignment Between Humans and Large Reasoning Models cs.CL | cs.AI | q-bio.NCPDF

Yueqing Hu, Tianhong Wang

TL;DR: 该论文研究了大型推理模型(LRMs)的推理轨迹长度与人类反应时间之间的认知成本对齐是否受推理时计算努力的影响。通过在不同模型规模、努力水平和推理任务上的实验,发现对齐关系保持稳定,不受推理时计算预算的调节。这表明对齐是训练时形成的固化策略,而非实时分配过程。

Details

Motivation: 针对近期关于LRMs的推理轨迹与人类反应时间对齐是否反映真实计算结构还是表面冗长性的争论,论文旨在检验这种对齐是否随推理时的计算努力而变化,以澄清对齐机制的本质。

Result: 在GPT-OSS-20B和GPT-OSS-120B模型、三种努力水平及六个推理任务上的实验显示,任务内和跨任务的对齐保持恒定(贝叶斯因子支持零假设,平均对齐数值接近)。算术复杂度对比进一步表明,令牌分配能跟踪细粒度、格式依赖的人类难度模式,且模型规模改善了对齐匹配。

Insight: 创新点在于揭示了认知成本对齐是训练时编译的成果,对推理时扰动具有鲁棒性,支持了LRM问题解决的’编译’而非’在线’账户。这为理解模型与人类认知对齐的机制提供了新视角,表明对齐可能源于训练数据中的统计规律而非动态计算过程。

Abstract: Large Reasoning Models (LRMs) generate chain-of-thought traces whose length tracks human reaction times across cognitive tasks, but recent debate questions whether this alignment reflects genuine computational structure or surface verbosity. We test whether the alignment varies with inference-time reasoning effort. Across GPT-OSS-20B and GPT-OSS-120B, three effort levels, and six reasoning tasks, within-task and cross-task alignment remain invariant: Bayes Factors lean toward the null, and mean alignment is numerically near-identical across conditions. A manipulation check reveals that the effort parameter sets an upper budget on generation rather than driving real-time allocation, suggesting that the allocation policy is crystallized at training time. Arithmetic complexity contrasts further show that token allocation tracks fine-grained, format-dependent human difficulty patterns, with model scale improving the match. Cognitive cost alignment between LRMs and humans appears to be a training-time achievement, robust to inference-time perturbations, supporting a compiled rather than online account of LRM problem-solving.


[7] Can LLMs Think Like Consumers? Benchmarking Crowd-Level Reaction Reconstruction with ConsumerSimBench cs.CL | cs.AI | cs.CYPDF

Tianyu Wang, Jiajun Li, Jianghao Lin

TL;DR: 本文提出了ConsumerSimBench基准测试,用于评估大型语言模型能否像真实消费者一样思考和预测公众反应。该基准基于1553个中国社交媒体话题和23122个原子化反应标准构建,通过可审计的是非决策任务评估模型。研究发现,即使是当前最强的模型(Gemini-3.1-Pro)也只能覆盖47.8%的真实反应标准,表明前沿LLM在预测高语境中文消费话语中的实际关注点方面仍有很大差距。

Details

Motivation: 当前LLM被广泛用作“数字消费者”来模拟公众意见和预测市场反应,但缺乏评估模型能否重建真实消费者在公共话语中展现的具体反应模式的基准。

Result: 在13个前沿生成模型中,表现最强的Gemini-3.1-Pro仅覆盖47.8%的真实反应标准,GPT-5.2和Claude-4.6表现更差。直接结构化推理提示会降低覆盖率,而生成-反思多智能体流程能将MiMo-V2.5-Pro在子集上的表现从32.9%提升至37.6%。

Insight: 将消费者模拟重构为对真实公共话语反应的预测问题,通过原子化、可审计的是非决策任务(而非整体偏好判断)来评估,显著提高了评估可靠性(三点一致性从65.8%提升至92.1%)。研究揭示了技术基准表现与社会化消费者直觉之间的显著差距。

Abstract: LLMs are increasingly used as ``digital consumers’’ to simulate public opinion, pre-test marketing decisions, and anticipate audience response. However, existing evaluations rarely ask whether a model can reconstruct the concrete reaction patterns that real consumers surface in public discourse. We introduce ConsumerSimBench, a benchmark built from 1,553 real Chinese social-media topics and 23,122 atomic, rule-audited criteria spanning four reaction families. Rather than scoring open-ended generations with a holistic preference judge, ConsumerSimBench decomposes each task into auditable yes-no decisions over concrete reaction points, raising three-judge agreement from 65.8% to 92.1% with 98.4% agreement between pointwise judge decisions and human-majority labels. Across 13 frontier generators, the strongest model, Gemini-3.1-Pro, covers only 47.8% of real reaction criteria, while GPT-5.2 and Claude-4.6 trail far behind despite their strength on technical benchmarks. The failures reveal a sharp gap between technical-benchmark performance and socially grounded consumer intuition. A direct structured reasoning prompt decreases coverage, while a generate–reflect multi-agent pipeline improves MiMo-V2.5-Pro from 32.9% to 37.6% on a subset. ConsumerSimBench reframes consumer simulation as a forecasting problem over real public-discourse reactions, showing that frontier LLMs remain far from reliably predicting what consumers will actually care about in high-context Chinese consumer discourse.


[8] HalluScore: Large Language Model Hallucination Question Answering Benchmark cs.CLPDF

Aisha Alansari, Hamzah Luqman

TL;DR: 本文介绍了HalluScore,一个专门针对阿拉伯语大语言模型幻觉问题设计的结构化问答基准数据集,包含827个精心设计的问题,涵盖不同推理难度、知识领域、历史时间线和阿拉伯文化场景,用于评估、检测和缓解LLMs的幻觉行为。

Details

Motivation: 针对当前幻觉基准主要集中在英语和中文,而阿拉伯语由于标注资源稀缺和语言形态复杂导致代表性不足,现有基准无法充分反映阿拉伯语的语言、文化和推理特点,因此需要构建专门的阿拉伯语幻觉评估基准。

Result: 使用HalluScore基准对17个阿拉伯语、多语言和推理LLMs进行了全面的实证分析,揭示了阿拉伯语LLMs的幻觉模式,并提供了高质量的人工标注,识别了所有被评估模型的幻觉、非幻觉和部分幻觉响应。

Insight: 创新点在于构建了首个结构化、多维度、文化敏感的阿拉伯语幻觉问答基准,其构建流程包含质量保证、清晰度与事实有效性过滤以及模型驱动选择以保留持续触发幻觉的问题;研究洞察表明阿拉伯语LLMs的幻觉超越了事实不准确,还涉及文化理解、语言推理和逻辑一致性方面的挑战。

Abstract: Large language models (LLMs) have achieved remarkable progress in natural language generation, but remain susceptible to hallucination. In response to growing concerns about hallucinations, several benchmarks have been developed, primarily in English and Chinese. However, Arabic remains underrepresented, with limited benchmarks for LLMs hallucination due to scarce annotated resources and the language’s morphological complexity. Consequently, existing benchmarks do not adequately reflect the linguistic, cultural, and reasoning characteristics of Arabic. To address this gap, we introduce HalluScore, a structured Arabic question answering benchmark designed to evaluate hallucination behavior in LLMs across different levels of reasoning difficulty, various knowledge domains, historical timelines, and culturally grounded Arabic scenarios. It contains 827 carefully curated questions for evaluating, detecting, and mitigating hallucination in LLMs. The dataset was constructed through a structured pipeline involving quality assurance, filtering for clarity and factual validity, and model-driven selection to retain questions that consistently trigger hallucinations. Each question is linked to verified ground-truth evidence, answer explanations, and multi-label annotations. Using the HalluScore benchmark, we conduct a comprehensive empirical analysis of hallucination patterns across 17 Arabic, multilingual, and reasoning LLMs. Moreover, we provide high-quality human annotations identifying hallucinated, non-hallucinated, and partially hallucinated responses of all evaluated LLMs. These results suggest that hallucination in Arabic LLMs extends beyond factual inaccuracies, encompassing challenges related to cultural understanding, linguistic reasoning, and logical consistency. We release HalluScore to support future research on improving the reliability and cultural competence of LLMs in Arabic.


[9] ACIL: Auto Chain of Thoughts for In-Context Learning cs.CLPDF

Rui Chu

TL;DR: 本文提出了一个名为ACIL(Auto Chain of Thoughts for In-Context Learning)的框架,旨在通过自动构建包含推理链的演示来增强上下文学习(ICL)在复杂推理任务上的性能。该框架自动为输入-输出示例生成推理步骤,并通过系统选择过程筛选高质量演示,从而在提示中提供结构化的中间解释,引导模型进行更可靠的推理。

Details

Motivation: 标准上下文学习(ICL)在处理需要多步推理的任务时往往表现不佳,因为演示通常只包含输入-输出对,缺乏明确的中间推理步骤。因此,作者旨在通过自动生成和选择高质量的推理链演示来改进ICL。

Result: 在多个推理任务上的实验表明,所提出的框架通过提供明确的中间推理指导,提高了ICL的性能。

Insight: 创新点在于自动构建推理增强的演示(Auto-CoT),包括生成推理链、用结构化解释增强提示上下文,以及通过系统选择去除不相关或低质量的演示,从而为模型提供更可靠的推理引导,这在处理复杂任务时可能是一个有效的策略。

Abstract: Recent advances in large language models (LLMs) have shown that Chain-of-Thought (CoT) reasoning can substantially improve performance on complex reasoning tasks. At the same time, In-Context Learning (ICL) has become an important mechanism for adapting LLMs to new tasks without updating model parameters, using only examples provided in the prompt. However, standard ICL often struggles on tasks that require multi-step reasoning, because the demonstrations usually contain only input-output pairs and lack explicit intermediate reasoning steps. This paper introduces an Automatic Chain-of-Thought (Auto-CoT) framework to improve ICL by automatically constructing reasoning-enhanced demonstrations. Auto-CoT generates reasoning chains for input-output examples, augments the prompt context with structured intermediate explanations, and removes irrelevant or low-quality demonstrations through a systematic selection process. By incorporating high-quality reasoning examples into the ICL prompt, Auto-CoT guides the model toward more reliable reasoning and improves prediction accuracy. Experiments across multiple reasoning tasks demonstrate that the proposed framework improves ICL performance by providing explicit intermediate reasoning guidance.


[10] SEMA-RAG: A Self-Evolving Multi-Agent Retrieval-Augmented Generation Framework for Medical Reasoning cs.CL | cs.AIPDF

Yongfeng Huang, Ruiying Chen, James Cheng

TL;DR: 本文提出了SEMA-RAG,一个用于医学问答的自进化多智能体检索增强生成框架。该框架通过解耦任务和动态多轮探索,将临床推理中的解释、探索和裁决角色分配给三个专门的智能体,以解决传统单轮静态RAG范式与临床推理多阶段过程不匹配的问题。

Details

Motivation: 传统检索增强生成(RAG)在医学问答中主要采用单轮、静态检索范式,这与临床推理的多阶段过程不匹配,导致问题到查询的转换缺乏临床语义解释,且检索缺乏迭代的充分性反馈,难以形成可靠的证据链。

Result: 在五个基准测试和五个大语言模型(LLM)骨干网络上,SEMA-RAG平均将最强基线的准确率提升了+6.46个百分点。

Insight: 核心创新点在于将RAG工作流重构为任务解耦和动态多轮探索,并设计了三个专门的智能体(解释器、探索者、裁决者)来分别承担临床模式解释、充分性驱动的自进化检索以及证据裁决和答案选择的任务,从而更好地模拟临床推理过程。

Abstract: Retrieval-Augmented Generation (RAG) is widely employed to mitigate risks such as hallucinations and knowledge obsolescence in medical question answering, yet its predominantly single-round, static retrieval paradigm misaligns with the multi-stage process of clinical reasoning. This compressed workflow induces two structural deficiencies: question-to-query translation often lacks clinically grounded semantic interpretation, and retrieval lacks iterative sufficiency feedback, making it difficult to form reliable evidence chains. We argue that both issues stem from a deeper cause: overloading a single reasoning chain with heterogeneous tasks of interpretation, exploration, and adjudication. The remedy is to reconstruct the workflow via task decoupling and dynamic multi-round exploration. To this end, we propose SEMA-RAG, a Self-Evolving Multi-Agent RAG framework for medical question answering, which assigns these roles to three specialist agents: the Interpreter Agent for clinical schema interpretation, the Explorer Agent for sufficiency-driven self-evolving retrieval, and the Arbiter Agent for evidence adjudication and answer selection. Across five benchmarks and five LLM backbones, SEMA-RAG improves the strongest baseline by +6.46 accuracy points on average, measured per backbone.


[11] HyDRA: Hybrid Dynamic Routing Architecture for Heterogeneous LLM Pools cs.CL | cs.LGPDF

Aashna Garg, Siddharth Singha Roy, Jinu Jang, Federico Brancasi, Shengyu Fu

TL;DR: 本文提出了HyDRA(混合动态路由架构),一个用于异构大语言模型池的查询路由框架。它通过预测每个查询在推理、代码生成、调试和工具使用等多维度的细粒度能力需求,并与模型配置文件进行‘短处匹配’,从而选择满足需求且成本最低的模型。该框架与模型目录解耦,无需重新训练即可更新模型池,并已在GitHub Copilot中部署。

Details

Motivation: 解决生产环境中异构LLM模型池(成本差异巨大)的路由问题。现有路由器通常做简单的强弱模型二元决策,且学习参数与特定模型绑定,导致模型目录更新时需要重新训练。

Result: 在SWE-Bench Verified基准测试的5模型池上,HyDRA通过可调的短处匹配阈值,实现了三种模式:在峰值质量下,以节省12.9%的成本超越了始终使用最强模型Claude Sonnet 4.6的基线(解决率75.4% vs 74.2%);在同等质量下,以节省54.1%的成本匹配Sonnet,比内部先前的二元路由器(节省9.1%)提升了6倍。结果在LiveCodeBench、BigCodeBench和tau-bench上具有普适性。

Insight: 核心创新在于将路由决策从简单的二元选择解耦为细粒度的、多维能力需求预测与模型配置文件的匹配,实现了路由逻辑与具体模型身份的分离,从而支持零重训练的模型池动态更新。此外,论文首次在LLM路由文献中展示了跨CJK、欧洲等多语系的语言无关路由能力。

Abstract: Production LLM deployments increasingly maintain heterogeneous model pools spanning order-of-magnitude cost differences. Existing routers make binary strong-vs-weak decisions and couple learned parameters to specific model identities, requiring retraining whenever the catalog changes. We present HyDRA (Hybrid Dynamic Routing Architecture), a framework that predicts fine-grained, multi-dimensional capability requirements per query and matches them against configuration-defined model profiles via shortfall matching. A ModernBERT encoder with K=4 independent sigmoid heads scores each query along reasoning, code generation, debugging, and tool use; a shortfall-matching algorithm then selects the cheapest model whose capabilities meet the predicted requirements. The deployed predictor runs at 86 ms median CPU inference latency in production, and is fully decoupled from the model catalog – adding or removing models requires only a configuration change, with zero retraining. On SWE-Bench Verified (5-model pool: GPT-5.4-mini, Claude Haiku 4.5, GPT-5.3 Codex, Claude Sonnet 4.6, GPT-5.4), HyDRA’s tunable shortfall threshold spans three regimes: peak-quality exceeds the always-strong Claude Sonnet 4.6 baseline (75.4% vs. 74.2% resolution) at 12.9% cost savings; iso-quality matches Sonnet at 54.1% cost savings, a 6x improvement over our prior in-house binary router at 9.1%; aggressive pushes savings to 72.5% for a 3.2-point quality trade. Results generalize across LiveCodeBench, BigCodeBench, and tau-bench. HyDRA is deployed to all users in GitHub Copilot’s VS Code Chat auto-mode and – to our knowledge for the first time in the LLM routing literature – demonstrates language-invariant routing across CJK, European, and other script families.


[12] The Point of No Return: Counterfactual Localization of Deceptive Commitment in Language-Model Reasoning cs.CL | cs.AIPDF

Scott Merrill, Shashank Srivastava

TL;DR: 本文提出了一种反事实定位方法,用于在语言模型的推理轨迹中定位欺骗性承诺的起始点,而非仅将欺骗视为最终输出的属性。研究者构建了五个策略性环境(如战略虚张声势、迷宫引导、金融建议等),在这些环境中欺骗从未被提示,而是由策略性激励自发产生,并通过超过94.1M次采样、91.5B生成令牌和100K场景的数据集,定位了约1.46M个句子。研究发现,基于注意力的过渡特征比词汇线索更能泛化地预测欺骗承诺,并识别出紧凑的注意力头集合(少于10%的头),能在未见环境中因果性地抑制欺骗承诺。

Details

Motivation: 现有欺骗数据集仅将完整输出标记为诚实或欺骗,将欺骗视为最终响应的属性,而非模型推理轨迹的函数,这掩盖了一个更根本的问题:语言模型何时开始承诺进行欺骗?

Result: 在五个策略性环境中,基于注意力的过渡特征在分布外泛化表现优于词汇线索;同时,识别出的紧凑注意力头集合(少于10%的头)在未见环境中能因果性地抑制欺骗承诺。人类评估确认检测到的承诺点对应于决策状态的可解释转变。

Insight: 创新点在于引入反事实定位方法,将欺骗分析从输出层面深入到推理轨迹的动态变化中,并构建了大规模、客观标注的欺骗承诺数据集;客观分析表明,欺骗承诺更反映在可重用的推理动态变化(如注意力模式)中,而非表面形式,这为理解语言模型中的承诺机制提供了新视角。

Abstract: Existing deception datasets label completed outputs as honest or deceptive, treating deception as a property of the final response rather than a function of the model’s reasoning trace. This obscures a more fundamental question: when does a language model become committed to deception? We introduce counterfactual localization: for each sentence prefix in a reasoning trace, we fix the prefix, resample continuations, and estimate the probability of a deceptive outcome. To scale this, we construct five environments (spanning strategic bluffing, maze guidance, financial advice, used-car sales, and offer negotiation) in which deception is never prompted but emerges from strategic incentives and labels follow mechanically from environment state rather than subjective human judgment. The resulting corpus localizes $\sim$1.46M sentences across four reasoning models, drawn from over 94.1M sampled continuations, 91.5B generated tokens, and over 100K scenarios. Sentence-level human evaluation confirms that detected commitment points correspond to interpretable shifts in decision state. Using this resource, we show that lexical cues for commitment prediction transfer poorly across environments, whereas attention-based transition features generalize out of distribution, suggesting that deceptive commitment is reflected in reusable changes in reasoning dynamics rather than surface form. We further identify compact attention-head sets (under 10% of heads) that, selected on one environment, causally suppress deceptive commitment across held-out environments. We release the corpus as a substrate for studying deception, and more broadly commitment, in language-model reasoning.


[13] Multilingual and Multimodal LLMs in the Wild: Building for Low-Resource Languages cs.CLPDF

Firoj Alam, Shammur Absar Chowdhury, Enamul Hoque Prince

TL;DR: 本教程概述了在有限数据和计算资源下构建多语言多模态(文本、语音、视觉)大语言模型的研究领域,涵盖了低成本数据创建、三模态对齐的适配器堆栈、超越英语的文化感知评估,以及微调紧凑型多语言视觉语言模型和构建语音->文本->LLM管道的实践资源。

Details

Motivation: 解决当前多模态LLM(从视觉-语言扩展到视觉、听觉、文本三模态)的流程和基准仍以英语为中心且计算成本高昂的问题,专注于低资源语言场景下的多语言多模态AI开发。

Result: 教程性质论文,未报告具体定量结果,但介绍了相关模型(如PALO, Maya)和语音-文本LLM,并提供了实践指南。

Insight: 创新点在于系统性地探讨了在低资源约束下实现多语言三模态对齐的实用方法,包括低成本数据策略、适配器堆栈架构以及文化感知的评估框架,为资源受限场景的研究与应用提供了可操作的路线图。

Abstract: Multimodal LLMs are evolving from vision-language to tri-modality that see, hear, and read, yet pipelines and benchmarks remain English-centric and compute-heavy. The tutorial offers an overview of this emerging research area for multilingual multimodality across text, speech, and vision under limited data/compute budgets, synthesizing foundations, recent multilingual models (PALO, Maya), speech-text LLMs. We cover low-cost data creation/curation; adapter stacks for tri-modal alignment; culture-aware evaluation beyond English and hands on resources for fine-tuning a compact multilingual VLM and wiring a speech->text->LLM pipeline. The content will be delivered as an interactive half-day tutorial, designed for researchers and practitioners working on multilingual, multimodal AI in low-resource language settings.


[14] PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media cs.CL | cs.AI | cs.CYPDF

Zoher Kachwala, Bao Tran Truong, Rasika Muralidharan, Haewoon Kwak, Jisun An

TL;DR: 论文提出了PluRule基准测试,用于评估AI模型在社交媒体多元化社区中的内容审核能力。该基准包含多模态、多语言数据,涵盖1989个Reddit社区的2885条规则和13371条违规检测任务。研究发现,即使是GPT-5.2等先进模型在该任务上表现也仅略高于随机基线,揭示了语言模型处理多元化社区审核的根本性挑战。

Details

Motivation: 针对社交媒体向多元化社区治理转型的趋势,研究旨在探索AI模型如何帮助审核不同社区的自定义规则,解决‘同一内容在不同社区可能违反不同规则’的复杂审核问题。

Result: 在PluRule基准测试中,最先进的视觉语言模型(包括GPT-5.2)表现显著不佳,仅略高于随机基线;模型规模和上下文扩展仅带来边际收益,而通用规则(如文明礼仪)相对更容易检测。

Insight: 创新点在于首次将多元化社区审核形式化为多选问题,并构建了大规模多模态多语言基准;客观分析表明,社区特定规则的细粒度理解是当前语言模型的本质瓶颈,这为未来研究提供了明确方向。

Abstract: Social media are shifting towards pluralism – community-governed platforms where groups define their own norms. What violates rules in one community may be perfectly acceptable in another. Can AI models help moderate such pluralistic communities? We formalize the task as a multiple-choice problem, mirroring how human moderators operate in the real world: given a comment and its surrounding context, identify which specific rule, if any, is violated. We introduce PluRule, a multimodal, multilingual benchmark for detecting 13,371 rule violations across 1,989 Reddit communities spanning 2,885 rules in 9 languages. Using this benchmark, we show that state-of-the-art vision-language models struggle significantly: even GPT-5.2 with high reasoning performs only slightly better than a trivial baseline. We also find that bigger models and increased context provide marginal gains, and universal rules like civility and self-promotion are easier to detect. Our results show that moderation of pluralistic communities on social media is a fundamental challenge for language models. Our code and benchmark are publicly available.


[15] Artificial Intolerance: Stigmatizing Language in Clinical Documentation Skews Large Language Model Decision-Making cs.CLPDF

Jen-tse Huang, Didi Zhou, Faith Kamau, Amy Oh, Anne R. Links

TL;DR: 本研究探讨了前沿大语言模型在处理临床文本时,是否继承和传播了人类偏见,特别是临床记录中常见的污名化语言。通过系统评估九个前沿LLM在四种污名化医疗条件下的表现,发现所有模型都存在显著偏见,导致临床决策倾向于更保守的患者管理。

Details

Motivation: 随着LLM在临床决策支持和医疗文档等高风险领域的部署日益增多,这些模型对临床笔记中常见的污名化语言等微妙语言变化的鲁棒性仍未被充分探索。

Result: 所有评估的模型都表现出显著偏见,临床决策明显偏向于更不积极的患者管理。模型对语言框架高度敏感,单个污名化句子就足以改变模型输出,并显示出清晰的剂量-反应关系。

Insight: 研究揭示了当前LLM在临床NLP公平性和鲁棒性方面存在关键脆弱性,标准提示缓解策略(如思维链推理和模型自去偏)效果有限,凸显了建立严格算法护栏以防止健康差异自动化的必要性。

Abstract: Large Language Models (LLMs) are increasingly deployed in high-stakes domains such as clinical decision support and medical documentation. However, the robustness of these models against subtle linguistic variations, specifically stigmatizing language (SL) commonly found in human-authored clinical notes, remains critically under-explored. In this work, we investigate whether frontier LLMs inherit and propagate this human bias when processing clinical text. We systematically evaluate nine frontier LLMs across four stigmatized medical conditions, utilizing clinical vignettes injected with varying intensities and phenotypes of SL (doubt, blame, and maligning). Our results demonstrate that all evaluated models exhibit substantial bias, with clinical decision-making significantly skewed towards less aggressive patient management. Notably, we observe a high sensitivity to linguistic framing, where a single SL sentence is sufficient to alter model outputs, revealing a clear dose-response relationship. Furthermore, we evaluate standard prompt-based mitigation strategies, including Chain-of-Thought (CoT) reasoning and model self-debiasing. These approaches show limited efficacy; models struggle to explicitly identify SL while remaining implicitly influenced by it. Our findings expose a critical vulnerability in current LLMs regarding fairness and robustness in clinical NLP, underscoring the need for rigorous algorithmic guardrails to prevent the automation of health disparities.


[16] Taming “Zombie’’ Agents: A Markov State-Aware Framework for Resilient Multi-Agent Evolution cs.CLPDF

Taolin Zhang, Pukun Zhao, Qizhou Chen, Jiuheng Wan, Chen Chen

TL;DR: 本文提出了AgentRevive框架,旨在解决基于LLM的多智能体系统中因硬剪枝而可能过早丢弃有价值智能体的问题。该框架通过马尔可夫状态感知机制,将智能体状态分为活跃、待机和终止三类,并基于状态进行软状态转移和边优化,以提升系统韧性并减少令牌消耗。

Details

Motivation: 现有方法通过激进的图演化(如节点或边剪枝)来提高多智能体系统的效率,但可能因幻觉或临时知识缺口等瞬时问题而提前丢弃有价值的智能体,忽略了这些’僵尸’智能体在后续轮次中恢复和贡献的潜力。

Result: 在通用推理、领域特定和幻觉挑战任务上的大量实验表明,该方法持续优于强基线,并通过状态感知的智能体调度显著减少了令牌消耗。

Insight: 创新点在于引入马尔可夫状态感知框架,通过软状态转移(包括状态感知策略学习和状态感知边优化)动态管理智能体协作,避免硬剪枝,从而提升系统韧性和效率。

Abstract: Recent advancements in LLM-based multi-agent systems have demonstrated remarkable collaborative capabilities across complex tasks. To improve overall efficiency, existing methods often rely on aggressive graph evolution among agents (e.g., node or edge pruning), which risks prematurely discarding valuable agents due to transient issues such as hallucinations or temporary knowledge gaps. However, such hard pruning overlooks the potential for zombie'' agents to recover and contribute in subsequent discussion rounds. In this paper, we propose AgentRevive, a Markov state-aware framework for resilient multi-agent evolution. Our approach dynamically manages agent collaboration through soft state transitions, implemented via two key components: (1) State-Aware Policy Learning: Agent states are divided into Active’’, Standby'', and Terminated’’ states, selectively propagating messages based on agent memory. The policy employs a risk estimator to optimize agent state transitions by assessing hallucination risk, minimizing the influence of unreliable nodes while safeguarding valuable ones. (2) State-Aware Edge Optimization: Subgraph edges are pruned according to states learned from the policy, permanently removing Terminated'' nodes and retaining Standby’’ nodes for subsequent rounds to assess their potential future contributions. Extensive experiments on general reasoning, domain-specific, and hallucination challenge tasks show that our method consistently outperforms strong baselines and significantly reduces token consumption through state-aware agent scheduling.


[17] Learning Transferable Topology Priors for Multi-Agent LLM Collaboration Across Domains cs.CLPDF

Taolin Zhang, Zijie Zhou, Jiuheng Wan, Tingyuan Hu, Chengyu Wang

TL;DR: 本文提出了TopoPrior框架,旨在解决多智能体大语言模型协作中拓扑结构在线搜索开销大、推理时令牌消耗高以及跨领域可扩展性受限的问题。该框架通过离线学习可迁移的拓扑先验知识,为下游任务生成查询条件化的初始协作图,从而分摊搜索成本并保持与现有拓扑演化骨干网络的兼容性。

Details

Motivation: 现有基于LLM的多智能体系统在复杂推理任务中,通常需要为每个查询从头构建或优化协作拓扑,导致在线搜索开销大、推理令牌消耗高,且在跨领域场景中可扩展性有限。

Result: 在多领域推理基准测试中,TopoPrior在仅增加少量可训练参数的情况下,持续改进了多种异构拓扑演化骨干网络的性能,同时显著减少了在线推理时的令牌使用量。

Insight: 创新点在于将拓扑搜索从每查询在线优化部分转移到离线先验学习,通过条件变分图框架捕获跨领域的可重用结构规律,并引入对抗对齐来减少不必要的领域差异,同时保留查询相关的结构变化,从而实现高效且轻量化的跨领域多智能体LLM协作。

Abstract: Large language model (LLM)-based multi-agent systems have shown strong potential for complex reasoning by coordinating specialized agents through structured communication. However, existing topology-evolution methods typically construct or optimize a collaboration topology for each query from scratch, leading to substantial online search overhead, high inference-time token consumption, and limited scalability in multi-domain settings. We propose TopoPrior, a framework for learning transferable topology priors for multi-agent LLM collaboration across domains. Rather than repeatedly searching for effective collaboration structures online, TopoPrior learns reusable topology priors from reference collaboration graphs collected offline from multiple domains and uses them to generate query-conditioned initial collaboration graphs for downstream refinement. By shifting part of topology search from per-query online optimization to offline prior learning, TopoPrior amortizes search cost while remaining compatible with existing topology-evolution backbones. Technically, TopoPrior contains two key components. First, a transferable topology prior learning module employs a conditional variational graph framework to capture reusable structural regularities across domains in a latent space. Second, a query-conditioned latent adaptation module introduces adversarial alignment to reduce unnecessary domain discrepancy while preserving query-relevant structural variation. Experiments on multi-domain reasoning benchmarks show that TopoPrior consistently improves several heterogeneous topology-evolution backbones while reducing online inference-time token usage, with only modest additional trainable parameters. These results suggest that transferable topology initialization is an effective and lightweight mechanism for improving the efficiency of multi-agent LLM collaboration across domains.


[18] AMATA: Adaptive Multi-Agent Trajectory Alignment for Knowledge-Intensive Question Answering cs.CLPDF

Taolin Zhang, Dongyang Li, Chen Chen, Qizhou Chen, Jiuheng Wan

TL;DR: 本文提出AMATA(自适应多智能体轨迹对齐)框架,通过动态整合外部知识来提升知识密集型问答的事实一致性和可解释性。该框架采用六个专用智能体进行协作推理,将多智能体与外部工具的协作形式化为轨迹偏好对齐问题,并引入轨迹内偏好学习和智能体间依赖学习两项创新技术。

Details

Motivation: 解决大语言模型在知识密集型问答中存在的幻觉问题和长尾知识鸿沟限制,以生成更具事实依据的响应。

Result: 在五个公认的知识密集型问答基准测试中,AMATA持续优于基线方法、知识增强框架和基于LLM的轨迹系统,同时有效降低了token消耗。

Insight: 创新点在于将多智能体协作形式化为轨迹偏好对齐问题,并提出了面向目标的轨迹内偏好学习与基于依赖感知直接偏好优化的智能体间依赖学习机制,以优先处理关键智能体并捕捉跨智能体工具依赖关系。

Abstract: Despite substantial advances in large language models (LLMs), generating factually consistent responses for knowledge-intensive question answering remains challenging. These difficulties are primarily due to hallucinations and the limitations of LLMs in bridging long-tail knowledge gaps. To address this, we propose AMATA, an Adaptive Multi-Agent Trajectory Alignment framework that dynamically integrates external knowledge to improve response interpretability and factual grounding. Our architecture leverages six specialized agents that collaboratively perform structured actions for complex question reasoning. We formalize multi-agent collaboration with external tools as a trajectory preference alignment problem, incorporating question-aware agent customization and inter-agent preference harmonization. AMATA introduces two principal innovations: (1) Intra-Trajectory Preference Learning, which learns objective-oriented preferences to prioritize critical agents, and (2) Inter-Agent Dependency Learning, which captures cross-agent tool dependencies through a novel dependency-aware direct preference optimization technique. Empirical results show that AMATA consistently outperforms baseline approaches, knowledge-augmented frameworks, and LLM-based trajectory systems on five established knowledge-intensive QA benchmarks. Further analysis demonstrates the efficiency of our method in reducing token consumption.


[19] BELIEF: Structured Evidence Modeling and Uncertainty-Aware Fusion for Biomedical Question Answering cs.CLPDF

Chang Zong, Hao Ning, Siliang Tang, Jie Huang, Jian Wan

TL;DR: BELIEF是一个用于封闭式生物医学问答的结构化证据建模与不确定性感知融合框架。它将检索到的文献转换为包含临床属性、来源质量、问题相关性、支持强度和候选假设的结构化证据对象,并设计了符号推理和神经推理两条互补路径进行证据融合与决策。

Details

Motivation: 解决生物医学问答中检索文献的相关性、质量和答案支持度不均,以及现有检索增强LLM方法将文献作为扁平文本处理,导致证据可靠性和不确定性信息不明确的问题。

Result: 在PubMedQA、MedQA和MedMCQA三个数据集上,使用五个通用LLM骨干网络进行实验,BELIEF在30个骨干-数据集-指标设置中的25个取得了最佳结果。与生物医学领域模型相比,在MedQA和MedMCQA上具有竞争力,在PubMedQA上则显示专业生物医学预训练仍有优势。

Insight: 核心创新在于将非结构化检索文本转化为结构化证据对象,并融合了基于Dempster-Shafer理论的符号化不确定性推理与LLM的神经语义推理。通过显式建模证据结构、路径分歧和决策不确定性,提高了检索证据的利用率。

Abstract: Biomedical question answering often requires decisions from retrieved literature whose relevance, quality, and support for candidate answers are uneven. Most retrieval-augmented large language model (LLM) methods feed this literature to the model as flat text, leaving evidence reliability and remaining uncertainty largely implicit. We propose BELIEF, a structured evidence modeling and uncertainty-aware fusion framework for closed-set biomedical question answering. Rather than treating retrieved documents as undifferentiated context, BELIEF converts them into evidence objects that record clinical attributes, source quality, question relevance, support strength, and the associated candidate hypothesis. These evidence objects provide a shared basis for two complementary reasoning paths. The symbolic path constructs reliability-weighted basic probability assignments based on Dempster–Shafer (D-S) theory over a finite answer space and performs uncertainty-aware symbolic evidence fusion to estimate belief and residual uncertainty. The neural path uses the same structured evidence for LLM-based semantic inference, while a reliability-aware arbitration module reconciles the symbolic and neural outputs according to belief strength, uncertainty, evidence reliability, and semantic consistency. Experiments on PubMedQA, MedQA, and MedMCQA with five general-purpose LLM backbones show that BELIEF obtains the best result in 25 of 30 backbone–dataset–metric settings. Comparisons with biomedical-domain models indicate that BELIEF is competitive on MedQA and MedMCQA, while specialized biomedical pretraining remains advantageous on PubMedQA. Ablation, complementarity, uncertainty-stratified, and cost analyses further show that BELIEF improves retrieved-evidence utilization by making evidence structure, path disagreement, and decision uncertainty explicit.


[20] NewsLens: A Multi-Agent Framework for Adversarial News Bias Navigation cs.CL | cs.IRPDF

Joy Bose

TL;DR: 本文提出了NewsLens,一个多智能体对抗性框架,用于结构化地分析和导航新闻偏见。该框架通过五个协作智能体(事实核查器、进步框架分析师、保守框架分析师、宣传检测器和中立摘要器)将文章解构为可解释的框架图,揭示意识形态遗漏、修辞操纵和框架边界。系统在四个地缘政治事件集群的15篇文章上进行了评估,使用Qwen2.5-3B-Instruct和Mistral 7B模型进行交叉验证,并报告了视角分歧分数和操纵指数等指标。

Details

Motivation: 现有媒体偏见检测主要被框定为分类任务(如为文章或媒体分配政治标签),作者认为这种框架过于浅层,只能识别偏见存在,而无法揭示偏见的具体位置、方式以及结构性的信息遗漏。因此,需要一种更深入的结构化分析方法。

Result: 在四个地缘政治事件集群(印度-巴基斯坦克什米尔、加沙、气候政策、乌克兰)的15篇文章上使用Qwen2.5-3B-Instruct(4位量化)进行评估,并在克什米尔集群上使用Mistral 7B进行交叉模型验证。结果显示,中间派媒体表现出最高的平均视角分歧分数(PDS:Qwen为0.907,Mistral在克什米尔子集为0.729),而保守框架媒体表现出最高的平均操纵指数(MI:两个模型均为0.600)。对于高宣传内容,交叉模型比较显示出高度一致性(如Republic World的delta-PDS=0.125,MI=0.8),而对于细致入微的报道则方差更大。由于样本量限制(n=15),Mann-Whitney U检验未发现组间存在统计学显著差异。

Insight: 论文的创新点在于将媒体偏见分析从简单的分类任务扩展为一个结构化的、多智能体协作的对抗性推理框架(NewsLens),能够深入解构文章的框架、揭示遗漏和操纵。从客观角度看,其将先前基于词汇几何的偏见研究工作扩展到了基于智能体的大语言模型推理,并且完全使用开源权重模型实现,无需API密钥,确保了可复现性。部分消融实验(如移除宣传检测器)也展示了各组件在系统功能中的重要性。

Abstract: Media bias detection has predominantly been framed as a classification task: assign a political label to an article or outlet. We argue this framing is too shallow: it identifies that bias exists but not where, how, or crucially, what is structurally omitted. We present NewsLens, a five-agent adversarial pipeline for structured news bias navigation. A Fact Verifier, Progressive Framing Analyst, Conservative Framing Analyst, Propaganda Detector, and Neutral Summarizer collaborate to deconstruct articles into interpretable framing maps, exposing ideological omissions, rhetorical manipulation, and framing boundaries. The system is evaluated on 15 articles across four geopolitical event clusters (India-Pakistan Kashmir, Gaza, Climate Policy, Ukraine) using Qwen2.5-3B-Instruct (4-bit quantised, Google Colab T4), with cross-model validation using Mistral 7B on the Kashmir cluster. Center outlets show the highest mean Perspective Divergence Score (PDS: Qwen 0.907, Mistral 0.729 on Kashmir subset); conservative-framing outlets show the highest mean Manipulation Index (MI: 0.600 across both models). Cross-model comparison shows high consistency for high-propaganda content (Republic World delta-PDS=0.125, MI=0.8 both models) and greater variance for nuanced reporting. Mann-Whitney U tests find no statistically significant between-group differences at n=15, reported honestly as a sample-size limitation confirmed by post-hoc power analysis. A partial ablation removing the Propaganda Detector shows degraded omission precision in the Neutral Summarizer output. The architecture extends prior lexical-geometric bias work to agentic LLM reasoning, and is fully reproducible using open-weight models without API keys.


[21] Stop When Reasoning Converges: Semantic-Preserving Early Exit for Reasoning Models cs.CLPDF

Dehai Min, Giovanni Vaccarino, Huiyi Chen, Yongliang Wu, Gal Yona

TL;DR: 本文提出PUMA框架,通过检测推理过程中的语义冗余性来实现语义保持的早期退出,以减少大型推理模型(LRMs)的推理步骤和延迟。该方法结合轻量级冗余检测器和答案级验证,在保持答案准确性和推理链连贯性的同时,显著减少推理令牌数量。

Details

Motivation: 现有早期退出方法主要依赖答案级信号(如置信度或答案一致性),可能导致推理未完成或自校正前过早退出,损害准确性和语义完整性。本文旨在利用推理级语义冗余作为补充信号,识别推理收敛时机。

Result: 在五个LRMs和五个具有挑战性的推理基准测试中,PUMA平均减少26.2%的令牌使用,同时保持准确性和保留的思维链质量。在代码生成、零样本视觉语言推理等任务上的实验进一步验证了其鲁棒性和可迁移性。

Insight: 创新点在于引入推理级语义冗余作为早期退出信号,结合冗余检测与验证机制,实现语义保持的高效推理。该方法具有即插即用、可转移和可学习的特点,为优化推理效率提供了新视角。

Abstract: Large Reasoning Models (LRMs) achieve strong performance by generating long chains of thought (CoT), but often overthink, continuing to reason after a solution has already stabilized and thereby wasting tokens and increasing latency. Existing inference-time early-exit methods rely primarily on answer-level signals, such as confidence or trial-answer consistency, to decide when to stop. However, these signals mainly reflect answer readiness rather than reasoning convergence: they may trigger before the model has finished exploring or self-correcting, causing premature exits that can degrade final-answer accuracy and leave the retained reasoning chain semantically incomplete. We identify reasoning-level semantic redundancy as a complementary signal for semantic-preserving early exit: when successive steps no longer add novel progress and instead revisit established conclusions, the reasoning trajectory has likely converged. Building on this insight, we propose PUMA, a plug-and-play framework that combines a lightweight Redundancy Detector with answer-level verification. The detector flags semantically redundant candidate exits, while verification confirms whether stopping is safe, allowing PUMA to remove redundant continuation while preserving both answer accuracy and a coherent reasoning prefix. Across five LRMs and five challenging reasoning benchmarks, PUMA achieves 26.2% average token reduction while preserving accuracy and retained CoT quality. Additional experiments on code generation, zero-shot vision-language reasoning, and learned stopping-policy internalization further demonstrate that reasoning-level redundancy is a robust, transferable, and learnable signal for efficient reasoning. Our code is available at \url{https://github.com/giovanni-vaccarino/PUMA}.


[22] Validate Your Authority: Benchmarking LLMs on Multi-Label Precedent Treatment Classification cs.CL | cs.AIPDF

M. Mikail Demir, M. Abdullah Canbaz

TL;DR: 本文针对法律判例中的负面处理分类任务,提出了一种更稳健的评估框架。作者构建了一个包含239个真实法律引用的专家标注数据集,并引入了一种新颖的平均严重性误差指标来衡量分类错误的实际影响。研究对多个现代大语言模型进行了基准测试,发现Gemini 2.5 Flash在高层级分类任务上准确率最高,而GPT-5-mini在更复杂的细粒度分类方案上表现最佳。

Details

Motivation: 自动化法律判例中的负面处理分类是一个关键但微妙的NLP任务,错误分类会带来重大风险。标准准确率指标存在不足,因此需要建立一个更稳健的评估框架来应对这一复杂法律推理任务的实际需求。

Result: 在新建的专家标注数据集上,Gemini 2.5 Flash在高层级分类任务上取得了79.1%的最高准确率,而GPT-5-mini在更复杂的细粒度分类方案上以67.7%的准确率成为最佳模型。

Insight: 论文的创新点在于提出了一个专门针对法律推理任务的评估框架,包括一个上下文丰富的新数据集和一个能更好衡量错误实际影响的平均严重性误差指标,为这一领域建立了关键基准。

Abstract: Automating the classification of negative treatment in legal precedent is a critical yet nuanced NLP task where misclassification carries significant risk. To address the shortcomings of standard accuracy, this paper introduces a more robust evaluation framework. We benchmark modern Large Language Models on a new, expert-annotated dataset of 239 real-world legal citations and propose a novel Average Severity Error metric to better measure the practical impact of classification errors. Our experiments reveal a performance split. Google’s Gemini 2.5 Flash achieved the highest accuracy on a high-level classification task (79.1%), while OpenAI’s GPT-5-mini was the top performer on the more complex fine-grained schema (67.7%). This work establishes a crucial baseline, provides a new context-rich dataset, and introduces an evaluation metric tailored to the demands of this complex legal reasoning task.


[23] SocialMemBench: Are AI Memory Systems Ready for Social Group Settings? cs.CL | cs.AIPDF

Olukunle Owolabi

TL;DR: 该论文提出了SocialMemBench,一个专门用于评估AI助手记忆系统在多参与者社交群组场景下性能的基准测试。该基准包含五个原型、三个规模层级的人工验证合成社交网络,包含大量对话轮次和问答对,旨在测试记忆系统在社交环境中的九种关键能力。评估发现,现有开源记忆框架性能远低于完整上下文检索基线,表明当前系统在处理社交群组记忆方面存在显著差距。

Details

Motivation: 当前AI助手的记忆系统是为单用户对话设计的,在多参与者社交群组场景中表现不佳。随着社交助手和主动个人助理的发展,需要记忆系统能够理解并处理包含社交上下文的共享历史、群体规范与个体例外、成员变动后的正确归因等复杂问题,而现有基准无法满足这一评估需求。

Result: 在小型网络上,即使拥有完整对话上下文的Gemini 2.5 Flash模型得分仅为0.721,远低于盲评推理模型的平均分0.98,表明基准具有挑战性。评估的四个开源记忆框架(Mem0, LangMem, Graphiti, Cognee)在全部43个网络上的问题加权得分集中在0.12-0.18区间,显著低于未压缩检索基线(0.345)和匹配回答者完整上下文基线(GPT-4o-mini,0.369)。

Insight: 论文的创新点在于首次构建了针对多参与者社交群组记忆的基准测试SocialMemBench,其通过九类问题类别系统性地隔离并测试了记忆架构的关键能力,并定义了五种可测试的失败模式。这为评估和改进AI记忆系统在复杂社交环境中的表现提供了重要的工具和方向。

Abstract: Memory systems for AI assistants were built for single-user dialogue and fail characteristically when applied to multi-party social group settings. This gap matters for the social assistants being built today: group-acting agents embedded in chat platforms, and proactive personal-assistant agents whose holistic model of a user must include their social context. Existing memory benchmarks evaluate dyadic or workplace dialogue; none targets multi-party social groups, where memory must anchor facts in shared history rather than professional roles, separate group norms from individual exceptions, and correctly attribute even after member departure. We introduce SocialMemBench, a benchmark of human-verified synthetic social group networks across five archetypes (close friends, family, recreational, interest community, acquaintance network) and three group-size tiers (4-30 members), with 430 personas and 7,355 conversation turns, yielding 1,031 QA pairs across nine question categories. Each category isolates an architectural capability, and the five failure modes (single-stream conflation, temporal-state overwrite, entity merging at scale, missing cross-persona knowledge, norm-individual conflation) are testable hypotheses; our two research probes Subject-Mem and SMG provide evidence on two, three remain open. A full-context Gemini 2.5 Flash reference reaches only 0.721 against a blind-critic reasoning-model mean of 0.98 on small networks, indicating the benchmark is genuinely difficult even with complete access to the conversation. Across all 43 networks, the four open-source memory frameworks evaluated (Mem0, LangMem, Graphiti, Cognee) cluster in the 0.12-0.18 question-weighted range with overlapping 95% CIs, well below an uncompressed retrieval reference of 0.345 and a matched-answerer full-context reference of 0.369 (GPT-4o-mini). Current memory systems show a measurable gap.


[24] Generating Pretraining Tokens from Organic Data for Data-Bound Scaling cs.CL | cs.AI | cs.LGPDF

Zichun Yu, Chenyan Xiong

TL;DR: 本文提出SynPro框架,通过重述和重构两种操作,从有限的有机数据中生成多样化的合成数据,以帮助大语言模型更充分地学习。该框架利用强化学习优化生成器,并在训练平台期持续更新,从而在数据受限的情况下提升模型性能。

Details

Motivation: 随着大语言模型预训练从计算受限转向数据受限,可用的人类文本数据远不能满足模型扩展的需求,但现有方法未能充分利用有机语料库。

Result: 在Chinchilla最优token的10%数据量下,SynPro为400M和1.1B模型分别解锁了3.7倍和5.2倍的有效token,甚至超越了在1.1B规模上使用等效唯一数据训练的非数据受限oracle模型。

Insight: 创新点在于通过模型感知的忠实合成操作(重述和重构)生成多样化数据,避免了引入外部信息,并通过强化学习优化生成质量,从而在数据受限情况下持续扩展模型能力而不导致分布崩溃。

Abstract: LLM pretraining is shifting from a compute-bound to a data-bound regime, where available human (organic) text falls far short of scaling demands. However, reaching the data-bound regime does not mean the model has fully utilized its organic corpus. In this paper, we introduce SynPro, a synthetic data generation framework that helps LLMs more thoroughly learn from limited organic data. SynPro applies two operations, rephrasing and reformat, that present the same organic source in diverse forms to facilitate deeper learning without introducing external information. Both generators are optimized via reinforcement learning with quality, faithfulness, and data influence rewards, and are continuously updated as pretraining plateaus to target content the model has yet to absorb. We pretrain 400M and 1.1B models with 10% of their Chinchilla-optimal tokens (0.8B and 2.2B) from DCLM-Baseline, reflecting a realistic data-bound regime in frontier pretraining. Our results reveal that organic data is significantly underutilized by standard repetition: SynPro unlocks 3.7-5.2x the effective tokens of repetition, even surpassing the non-data-bound oracle that trains on equivalent unique data at the 1.1B scale. Analyses confirm that faithful, model-aware synthesis sustains data-bound scaling without causing distribution collapse. We open-source our code at https://github.com/cxcscmu/SynPro.


[25] A Pilot Benchmark for NL-to-FOL Translation in Planetary Exploration cs.CLPDF

Hayden Moore, Suman Saha, Mahfuza Farooque

TL;DR: 本文提出了一个用于行星探索领域的自然语言到一阶逻辑翻译的试点基准数据集。该数据集基于NASA行星数据系统的真实任务文档构建,并进行了手动标注,旨在为语言理解与形式推理的交叉研究提供基础。

Details

Motivation: 未来行星探索中,自主智能体需要在严格通信约束下运行,这要求它们不仅能感知和行动,还能对任务目标、操作约束和环境变化进行推理。然而,将高级任务知识转化为结构化、机器可解释的表示形式的研究仍显不足。

Result: 论文构建并发布了一个试点基准数据集,该数据集源自2003年至2013年的真实NASA任务文档,涵盖了发射、助推、巡航、轨道操作等多个阶段,并提供了对应的FOL标注、结构化谓词词汇和类型化常量。

Insight: 创新点在于首次在行星探索这一安全关键领域构建了NL到FOL翻译的基准,将真实任务文档与形式逻辑表示相结合,为研究提供了可控的实验基础,推动了语言理解与形式推理在实际应用中的交叉。

Abstract: Future planetary exploration envisions autonomous robotic agents operating under severe communication constraints, without global positioning, and with minimal human intervention. In such environments, agents must not only perceive and act, but also reason over mission objectives, operational constraints, and evolving environmental conditions. While prior work has largely focused on perception and control, the translation of high-level mission knowledge into structured, machine-interpretable representations remains underexplored. We introduce a pilot benchmark for translating natural language (NL) into First-Order Logic (FOL) within the domain of planetary exploration. The dataset is constructed from real mission documentation sourced from NASA’s Planetary Data System (PDS), spanning missions from 2003 to 2013. These documents describe mission phases such as launch, boost, coast, cruise, and orbital operations in rich natural language. We manually annotate these documents with corresponding FOL representations that capture temporal structure, agent roles, and operational dependencies. In addition, we provide structured predicate vocabularies and typed constants to enable controlled experimentation with varying levels of prior knowledge. This pilot benchmark provides a foundation for research at the intersection of language understanding and formal reasoning, grounded in real-world, safety-critical mission data. The dataset is provided at: https://github.com/HaydenMM/planetary-logic-benchmark/blob/main/pilot_benchmark.json


[26] Prompt Compression in Diffusion Large Language Models: Evaluating LLMLingua-2 on LLaDA cs.CL | cs.AIPDF

Sterling Huang, Abigayle Brown, Jiyoo Noh, Jiakang Xu, Wantong Huo

TL;DR: 本研究评估了LLMLingua-2提示压缩方法在扩散大语言模型(DLLMs)上的有效性,特别针对8B参数的LLaDA模型。通过在GSM8K、DUC2004和ShareGPT数据集上进行数学推理、提示重构和摘要任务测试,发现为自回归模型设计的压缩方法在扩散模型中语义保持良好,但下游任务性能不稳定,尤其是数学推理显著下降。

Details

Motivation: 现有提示压缩评估主要针对自回归架构,本研究旨在探究此类方法是否能有效迁移到扩散大语言模型(DLLMs),以降低其推理成本和上下文长度。

Result: 在约2倍压缩比下,摘要任务相对稳健,但数学推理性能大幅下降,尽管语义相似度得分(如BLEU、ROUGE)较高。BERTScore召回率持续低于精确度,表明压缩失败主要由信息遗漏而非语义漂移导致。

Insight: 论文的创新点在于首次系统评估提示压缩在扩散大语言模型上的迁移效果,揭示了语义保持与下游任务稳定性之间的脱节,为开发扩散感知的压缩策略提供了动机。从客观角度看,这强调了针对不同模型架构设计专用压缩方法的重要性。

Abstract: Prompt compression reduces inference cost and context length in large language models, but prior evaluations focus primarily on autoregressive architectures. This study investigates whether prompt compression transfers effectively to diffusion large language models (DLLMs) using LLMLingua-2, specifically the 8B-parameter DLLM LLaDA. We evaluate compression performance on GSM8K, DUC2004, and ShareGPT using 250 prompts per dataset at an approximate 2$\times$ compression ratio, across mathematical reasoning, prompt reconstruction, and summarization tasks. Outputs generated from original prompts, compressed prompts, reconstructed prompts, and reconstructed-prompt reasoning were compared using exact-match accuracy, BLEU, ROUGE, and BERTScore. Results show that semantic preservation does not necessarily imply stable downstream behavior in diffusion models. Summarization tasks remained comparatively robust under compression, while mathematical reasoning degraded substantially despite high semantic similarity scores. Reconstruction experiments further showed that semantically similar prompts may still omit reasoning-critical information required for stable denoising. Across tasks, BERTScore recall was consistently lower than precision, suggesting that compression failures are primarily driven by information omission rather than semantic drift. These findings indicate that prompt compression methods designed for autoregressive models do not transfer uniformly to diffusion large language models and motivate the development of diffusion-aware compression strategies.


[27] AutoVecCoder: Teaching LLMs to Generate Explicitly Vectorized Code cs.CLPDF

Shangzhan Li, Xinyu Yin, Xuanyu Jin, Ye He, Yuxin Zhou

TL;DR: 本文提出了AutoVecCoder框架,旨在增强大型语言模型(LLM)生成显式向量化代码的能力。该框架通过VecPrompt自动合成领域特定数据来注入内在函数知识,并利用VecRL强化学习框架将代码生成与执行效率对齐。训练得到的AutoVecCoder-8B模型在SimdBench的SSE和AVX子集上取得了最先进的性能,其生成的代码在某些情况下甚至超越了标准-O3优化。

Details

Motivation: 为了解决编译器自动向量化因保守静态分析而效果不佳的问题,以及弥补LLMs在生成显式向量化代码方面因高质量语料稀缺和低级硬件指令语义约束严格而存在的不足。

Result: 在SimdBench的SSE和AVX子集上取得了最先进的(SOTA)性能,并且生成的代码实现有时能超越标准的-O3编译器优化。

Insight: 创新点在于结合了自动数据合成(VecPrompt)来注入领域知识,以及使用强化学习(VecRL)来对齐代码生成与执行效率,从而有效克服了传统自动向量化的瓶颈,为LLM赋能硬件级优化提供了新思路。

Abstract: Vectorization via Single Instruction, Multiple Data (SIMD) architectures is a cornerstone of high-performance computing. To fully exploit hardware potential, developers often resort to explicit vectorization using intrinsics, as compiler-based auto-vectorization frequently yields suboptimal results due to conservative static analysis. While Large Language Models (LLMs) have demonstrated remarkable proficiency in general code generation, they struggle with explicit vectorization due to the scarcity of high-quality corpora and the strict semantic constraints of low-level hardware instructions. In this paper, we propose AutoVecCoder, a novel framework designed to empower LLMs with the capability of automated explicit vectorization. AutoVecCoder integrates two core components: VecPrompt, an automated data synthesis pipeline to inject domain-specific intrinsic knowledge; and VecRL, a reinforcement learning framework that aligns code generation with execution efficiency. AutoVecCoder-8B trained by this framework achieves state-of-the-art performance on the SSE and AVX subsets of SimdBench and, in some cases, generates implementations surpassing standard -O3 optimizations, effectively overcoming the inherent bottlenecks of traditional automated vectorization.


[28] PROTEA: Offline Evaluation and Iterative Refinement for Multi-Agent LLM Workflows cs.CL | cs.AI | cs.HC | cs.SEPDF

Kazuki Kawamura, Satoshi Waki, Kei Tateno

TL;DR: 本文提出了PROTEA,一个用于多智能体大语言模型工作流的离线评估与迭代优化统一框架。该系统通过执行工作流、使用可配置规则对中间节点输出评分,并在工作流图上叠加节点状态和推理依据来定位瓶颈。对于仅依赖最终答案监督的复杂系统,PROTEA执行反向节点评估:从最终答案参考和图上下文中生成候选节点级期望,并与实际输出进行比较。此外,它提供针对性的提示词修订建议,并自动重新运行和评估工作流以展示改进效果。

Details

Motivation: 多智能体LLM工作流虽常优于单提示基线,但难以调试和优化,因为中间输出的细微错误会传播至下游节点,开发者需检查长轨迹并推断需修改的智能体。

Result: 在两个接近生产环境的工作流中,PROTEA将文档检查准确率从64.3%提升至83.9%,并将推荐系统的Hit@5从0.30提升至0.38。一项有六位经验丰富的LLM开发者参与的初步研究显示,参与者高度评价其图级定位、节点级推理依据和可编辑的提示词前后对比修订功能。

Insight: 创新点在于提出了一个统一的离线、测试驱动的多智能体工作流改进界面,特别是引入了反向节点评估机制,能从最终答案监督中推导出节点级期望,从而在缺乏中间标注的情况下实现细粒度调试和提示词迭代优化。

Abstract: Multi-agent LLM workflows – systems composed of multiple role-specific LLM calls – often outperform single-prompt baselines, but they remain difficult to debug and refine. Failures can originate from subtle errors in intermediate outputs that propagate to downstream nodes, requiring developers to inspect long traces and infer which agent to modify. We present PROTEA, a unified interface for offline, test-driven improvement of multi-agent workflows. PROTEA executes a workflow, scores intermediate node outputs with configurable rubrics, and overlays per-node states and rationales on the workflow graph to localize likely bottlenecks. To support complex systems where final-answer references are the primary supervision, PROTEA performs backward node evaluation: it generates candidate node-level expectations from final-answer references and graph context, then compares them with observed node outputs. For selected nodes, PROTEA presents targeted prompt revisions as editable before/after comparisons, then automatically reruns and re-evaluates the workflow to show output changes and score trajectories within the same interface. In two production-adjacent workflows, PROTEA improved document-inspection accuracy from 64.3% to 83.9% and recommendation Hit@5 from 0.30 to 0.38. In a formative study with six experienced LLM developers, participants valued graph-level localization, per-node rationales, and editable before/after prompt revisions.


[29] How Good LLMs Are at Answering Bangla Medical Visual Questions? Dataset and Benchmarking cs.CL | cs.CVPDF

Rafid Ahmed, Intesar Tahmid, Mir Sazzat Hossain, Tasnimul Hossain Tomal, Md Fahim

TL;DR: 本文针对孟加拉语医学视觉问答(MedVQA)领域缺乏基准数据集的问题,提出了首个临床验证的孟加拉语MedVQA数据集BanglaMedVQA,并对当前主流基础模型(如Gemini、GPT-4.1 mini、Gemma-3等)在该数据集上的性能进行了全面评估。研究发现,现有模型在孟加拉语MedVQA任务上表现远低于英语基准,尤其在细粒度医学推理方面存在严重不足。

Details

Motivation: 孟加拉语是全球使用最广泛的语言之一,但在医学视觉问答领域缺乏一个标准化的基准数据集,这阻碍了相关模型在该语言上的评估与发展。本文旨在填补这一空白。

Result: 在提出的BanglaMedVQA数据集上,即使是Gemini和GPT-4.1 mini等顶级模型在专业诊断问题上也表现不佳,准确率很低。一些开源模型(如Gemma-3)在通用类别上偶尔表现更好,但在临床复杂问题上同样困难。整体性能远低于英语MedVQA基准,突显了低资源语言面临的挑战。

Insight: 主要创新点是创建了首个临床验证的孟加拉语MedVQA数据集,为低资源语言的医学AI评估提供了关键资源。客观分析表明,该研究揭示了当前大模型在跨语言、细粒度医学推理任务上的显著局限性,强调了开发更专业评估方法和针对低资源语言优化模型的迫切需求。

Abstract: Recent advancements in Large Language Models (LLMs) and Large Vision Language Models (LVLMs) have enabled general-purpose systems to demonstrate promising capabilities in complex reasoning tasks, including those in the medical domain. Medical Visual Question Answering (MedVQA) has particularly benefited from these developments. However, despite Bangla being one of the most widely spoken languages globally, there exists no established MedVQA benchmark for it. To address this gap, we introduce BanglaMedVQA, a dataset comprising clinically validated image-question-answer pairs, along with a comprehensive evaluation of current foundation models on this resource. Consistent with prior findings that report low performance of current models on English MedVQA benchmarks, our analysis reveals that Bangla performance is substantially lower, reflecting the challenges inherent to low-resource languages. Even top-performing models such as Gemini and GPT-4.1 mini fail to accurately answer specialized diagnostic questions, indicating severe limitations in fine-grained medical reasoning. Although certain open-source models, such as Gemma-3, occasionally outperform these models in general categories, they too struggle with clinically complex questions, underscoring the urgent need for top-notch evaluation method.


[30] Knowledge-to-Verification: Exploring RLVR for LLMs in Knowledge-Intensive Domains cs.CLPDF

Zhonghang Yuan, Zhefan Wang, Fang Hu, Zihong Chen, Jinzhe Li

TL;DR: 本文提出了Knowledge-to-Verification(K2V)框架,旨在将可验证奖励的强化学习(RLVR)扩展到知识密集型领域。该框架通过自动合成可验证数据来解决高质量数据稀缺的问题,并引入对推理过程的验证以克服仅关注最终答案正确性的局限性。实验表明,K2V能有效提升大语言模型在知识密集型领域的推理能力,且不显著损害其通用能力。

Details

Motivation: 当前RLVR方法在数学和编程等领域已显示出增强大语言模型推理能力的潜力,但在知识密集型领域应用不足,主要受限于高质量可验证数据的稀缺性,且现有方法仅验证最终答案,导致推理过程存在缺陷和奖励信号稀疏的问题。

Result: 大量实验证明,K2V框架能提升大语言模型在知识密集型领域的推理性能,同时保持模型的通用能力基本不受影响,具体基准和定量结果未在摘要中详细说明。

Insight: 创新点在于将自动数据合成与推理过程验证相结合,以扩展RLVR至知识密集型领域;从客观角度看,该方法为解决数据稀缺和推理验证不全面提供了新思路,强调了过程验证的重要性,是增强模型在广泛领域能力的有前景方向。

Abstract: Reinforcement learning with verifiable rewards (RLVR) has demonstrated promising potential to enhance the reasoning capabilities of large language models (LLMs) in domains such as mathematics and coding. However, its applications on knowledge-intensive domains have not been effectively explored due to the scarcity of high-quality verifiable data. Furthermore, current RLVR focuses solely on the correctness of final answers, leading to the limitations of flawed reasoning and sparse reward signals. In this work, we propose Knowledge-to-Verification (K2V), a framework that extends RLVR to knowledge-intensive domains through automated verifiable data synthesis, while enabling verification of the LLM’s reasoning process. Extensive experiments demonstrate that K2V enhances the reasoning of LLM in knowledge-intensive domains without significantly compromising the model’s general capabilities. This study also suggests that integrating automated data synthesis with reasoning verification is a promising direction to enhance model capabilities in these broader domains. Code is available at https://github.com/SeedScientist/K2V.


[31] EvoMemBench: Benchmarking Agent Memory from a Self-Evolving Perspective cs.CL | cs.AI | cs.LGPDF

Yuyao Wang, Zhongjian Zhang, Mo Chi, Kaichi Yu, Yuhan Li

TL;DR: 本文提出了EvoMemBench基准测试,从自我演化的角度评估LLM智能体的记忆能力,该基准通过记忆范围(单次任务内vs跨任务)和记忆内容(知识导向vs执行导向)两个维度进行组织,并比较了15种代表性记忆方法与长上下文基线的性能。

Details

Motivation: 现有基准主要评估智能体的推理、规划和执行能力,而记忆能力对于智能体存储、更新和检索信息至关重要,但目前缺乏系统性的评估方法。

Result: 结果表明,当前记忆系统远非通用解决方案:长上下文基线模型仍极具竞争力;当当前上下文信息不足或任务困难时,记忆机制帮助最大;没有单一的记忆形式在所有设置下都表现一致。基于检索的方法在知识密集型场景中表现强劲,而当存储经验与任务结构匹配时,程序性和长期记忆方法在执行导向任务中更有效。

Insight: 创新点在于从自我演化视角构建了一个统一的记忆评估框架(EvoMemBench),其核心洞察是揭示了记忆机制的有效性高度依赖于任务类型和上下文充足性,并指出未来需要开发更具适应性的通用记忆系统。

Abstract: Recent benchmarks for Large Language Model (LLM) agents mainly evaluate reasoning, planning, and execution. However, memory is also essential for agents, as it enables them to store, update, and retrieve information over time. This ability remains under-evaluated, largely because existing benchmarks do not provide a systematic way to assess memory mechanisms. In this paper, we study agent memory from a self-evolving perspective and introduce EvoMemBench, a unified benchmark organized along two axes: memory scope (in-episode vs. cross-episode) and memory content (knowledge-oriented vs. execution-oriented). We compare 15 representative memory methods with strong long-context baselines under a standardized protocol. Results show that current memory systems are still far from a general solution: long-context baselines remain highly competitive, memory helps most when the current context is insufficient or tasks are difficult, and no single memory form works consistently across all settings. Retrieval-based methods remain strong for knowledge-intensive settings, whereas procedural and long-term memory methods are more effective for execution-oriented tasks when their stored experience matches the task structure. We hope EvoMemBench facilitates future research on more effective memory systems for LLM-based agents. Our code is available at https://github.com/DSAIL-Memory/EvoMemBench.


[32] Presupposition and Reasoning in Conditionals: A Theory-Based Study of Humans and LLMs cs.CLPDF

Tara Azin, Yongan Yu, Raj Singh, Olessia Jouravlev

TL;DR: 本文通过对比人类与大型语言模型在条件句预设投射任务上的表现,发现人类在判断中整合了概率和语用线索,而LLMs的表现与人类模式存在差异,且最佳匹配人类评分的模型往往缺乏连贯的语用推理能力,表明LLMs在此类任务上的性能可能源于表面模式匹配而非真正的语用能力。

Details

Motivation: 条件句中的预设投射是语义学和语用学理论的核心问题,但在大型语言模型中尚未得到充分评估,本文旨在填补这一空白,通过对比人类判断与LLM预测来探究LLMs的语用推理能力。

Result: 在控制前提与投射预设关系的规范数据集上,人类参与者和四个LLMs在匹配上下文条件下进行可能性评分,结果显示LLMs与人类模式的对齐程度不一,且使用LLM-as-a-Judge框架结合语言学检查表进一步评估模型推理,发现匹配人类评分最佳的模型常缺乏连贯语用推理,而推理能力强的模型则产生较少类人判断。

Insight: 论文的创新点在于将语言学理论驱动的基准测试应用于人类与模型的比较,揭示了LLMs在语用任务上可能依赖表面模式匹配而非深层推理,强调了基于理论构建评估基准的重要性,为未来模型评估提供了新视角。

Abstract: Presupposition projection in conditionals is central to theories of meaning and pragmatics, yet it remains largely unevaluated in large language models. We address this gap through a parallel behavioral study comparing human judgments and LLM predictions on a normed dataset of conditional sentences that controls the relation between the antecedent and the projected presupposition. We collect likelihood ratings from 120 participants and four LLMs under matched contextual conditions. Results show that humans integrate probabilistic and pragmatic cues in their judgment, whereas LLMs show variable alignment with human patterns. Using a linguistically motivated checklist within an LLM-as-a-Judge framework, we further evaluate model reasoning. We observe models that best match human ratings often lack coherent pragmatic reasoning, while models with stronger reasoning produce less human-like judgments. These findings suggest that LLMs’ performance on such tasks may result from surface pattern matching rather than pragmatic competence. Our findings highlight the importance of benchmarks grounded in linguistic theory for comparing humans and models.


[33] Implicit Hierarchical GRPO: Decoupling Tool Invocation from Execution for Tool-Integrated Mathematical Reasoning cs.CLPDF

Li Wang, Xiaohan Wang, Xiaodong Lu, Zipeng Zhang, Jinyang Wu

TL;DR: 本文提出了一种名为IH-GRPO的新算法,旨在解决大语言模型在工具集成推理中工具调用与执行紧密耦合的问题。该方法首次将工具调用与执行解耦,并引入延迟执行和显式控制,通过一个分层控制框架和推导出的代理损失来训练一个隐式分层策略,从而提升推理性能。在多个数学推理基准测试中,该方法在不同规模的Qwen3模型上均取得了显著的性能提升。

Details

Motivation: 现有的大语言模型工具调用方法通常将工具调用与即时执行紧密耦合,这种即时交互可能会破坏LLM的推理连贯性并限制其表达能力,最终导致推理性能下降。因此,本文旨在解决工具调用与执行耦合对推理过程造成的干扰问题。

Result: 在六个领域外数学推理基准测试上,IH-GRPO方法在Qwen3-1.7B、Qwen3-4B和Qwen3-8B模型上相比最强基线方法分别取得了1.87%、2.16%和2.53%的绝对性能提升,并在其他领域也带来了一致的性能增益。

Insight: 核心创新点在于首次形式化并解决了推理过程中工具调用与执行的解耦问题,并引入了延迟执行机制。从方法层面,提出了一个分层控制框架,并通过理论推导使隐式分层策略能够学习到与显式分层策略等效的行为,这为优化工具集成推理提供了一种新的、更灵活的范式。

Abstract: Large language models (LLMs) have increasingly leveraged tool invocation to enhance their reasoning capabilities. However, existing approaches typically tightly couple tool invocation with immediate execution. Such immediate tool interaction may disrupt the reasoning coherence of LLMs and constrain their expressivity, ultimately degrading reasoning performance. To this end, for the first time, we propose and formalize the problem of decoupling tool invocation from execution during reasoning, and introduce delayed execution with explicit control to enhance tool-integrated reasoning (TIR). Furthermore, we propose a hierarchical control framework and theoretically derive a surrogate loss that enables an implicitly hierarchical policy to learn behavior equivalent to that of an explicit hierarchical policy, leading to the proposed IH-GRPO algorithm. Extensive experiments on IH-GRPO achieve absolute improvements of 1.87%, 2.16%, and 2.53% on Qwen3-1.7B, Qwen3-4B, and Qwen3-8B across six out-of-domain mathematical reasoning benchmarks over the strongest baseline method, while also yielding consistent performance gains in other domains. Our code is available at https://github.com/Lumina04/IH-GRPO-01.


[34] Monitoring the Internal Monologue: Probe Trajectories Reveal Reasoning Dynamics cs.CL | cs.CRPDF

Maciej Chrabąszcz, Aleksander Szymczyk, Marcin Sendera, Tomasz Trzciński, Sebastian Cygert

TL;DR: 该论文提出了一种基于探针轨迹(probe trajectory)的新框架,用于监控大型推理模型(LRMs)的内部推理动态。通过在每个生成的token处评估探针,构建概念概率在推理过程中的连续演化轨迹,并提取信号处理特征来捕捉其波动性、趋势和稳态行为,从而更有效地预测模型未来的行为(如安全风险或数学答案)。

Details

Motivation: 现有基于思维链(CoT)的监控方法存在不忠实于模型最终输出的问题,削弱了其作为监控工具的可靠性。因此,研究旨在探索是否可以从模型的隐藏表示中预测其未来行为,以提供更可靠的监控手段。

Result: 在安全和数学领域的四个数据集和四个推理模型上验证,使用轨迹特征显著改善了未来模型状态(如有害输出与无害输出)的可分离性,最大池化方法实现了高达95%的AUROC,优于平均池化和最后token方法。

Insight: 创新点在于将连续的探针轨迹和信号处理特征引入模型行为监控,提供了比静态预测更丰富的动态信息。方法学上的关键发现包括:基于模板的训练数据可替代昂贵的动态生成响应进行探针训练;最大池化操作对于获得稳定、高性能的探针轨迹至关重要。

Abstract: Large Reasoning Models (LRMs) introduce new opportunities for safety monitoring through their Chain of Thought (CoT) reasoning. However, CoT is not always faithful to the model’s final output, undermining its reliability as a monitoring tool. To address this, we investigate the hidden representations of LRMs to determine whether future behavior can be predicted from prompt and CoT representations. By evaluating a probe at each generated token, we construct a probe trajectory, the continuous evolution of a concept’s probability across the reasoning process. We find that future model behavior is more distinguishable when examined over the full trajectory than from a single static prediction. To characterize these temporal dynamics, we extract signal-processing features that capture volatility, trend, and steady-state behavior, significantly improving the separation of future model states. We also present two methodological insights. First, template-based training data achieves near-parity with dynamically generated model responses, eliminating the need for a costly initial inference and labeling. Second, the choice of pooling operation is critical: average-pooling and last-token methods collapse to near-random performance, while max-pooling achieves up to 95% AUROC and yields stable probe trajectories. Using four datasets and four reasoning models across the domains of safety and mathematics, we demonstrate that trajectory features encode task-specific dynamics that improve outcome separability. These findings establish probe trajectories as a complementary framework for monitoring LRM behavior. Warning: This article contains potentially harmful content.


[35] STT-Arena: A More Realistic Environment for Tool-Using with Spatio-Temporal Dynamics cs.CL | cs.AIPDF

Tingfeng Hui, Hao Xu, Pengyu Zhu, Hongsheng Xin, Kun Zhan

TL;DR: 本文提出了STT-Arena,一个用于评估大语言模型在时空动态环境中工具使用和自适应重规划能力的基准测试。该基准包含227个高质量交互任务,涵盖九种时空冲突类型和四种可解性级别。评估发现,即使是Claude-4.6-Opus等前沿模型,其总体准确率也低于40%,揭示了时空动态推理的根本性困难。作者通过分析失败轨迹,提出了迭代轨迹精炼技术,并结合在线强化学习训练出STT-Agent-4B模型,其在STT-Arena上超越了前沿LLMs。

Details

Motivation: 现有动态基准主要衡量LLMs能否及时检测时间变化,而忽略了在时空动态下进行自适应重规划的互补性挑战。现实世界中,LLM代理在任务执行中常会遇到使先前决策失效的中断,因此需要具备重新规划和适应的能力。

Result: 在STT-Arena基准上的广泛评估显示,包括Claude-4.6-Opus在内的最先进专有模型总体准确率低于40%。作者提出的STT-Agent-4B模型(结合了迭代轨迹精炼和在线RL)在该基准上超越了前沿LLMs。

Insight: 论文的创新点在于构建了一个更真实的、具有可执行环境的基准(STT-Arena),系统地评估了LLMs在时空动态下的自适应重规划能力。从客观角度看,其提出的三种常见错误模式(陈旧状态执行、动态触发器误诊、适应后验证缺失)的分析,以及通过迭代轨迹精炼从训练数据中消除这些失败模式的方法,为提升LLM在动态环境中的鲁棒性提供了有价值的见解和可借鉴的技术路径。

Abstract: Large language models (LLMs) deployed in real-world agentic applications must be capable of replanning and adapting when mid-task disruptions invalidate their prior decisions. Existing dynamic benchmarks primarily measure whether LLMs can detect temporal changes in a timely manner, leaving the complementary challenge of adaptive replanning under spatio-temporal dynamics largely unexplored. We introduce STT-Arena (Spatio-Temporal Tool-Use Arena), a benchmark of 227 high-quality interactive tasks spanning nine spatio-temporal conflict types and four solvability levels. Each task is grounded in a realistic, executable environment equipped with injected spatio-temporal triggers that can abruptly invalidate an ongoing plan, forcing the model to detect the state shift and construct a revised execution strategy. Extensive evaluation of frontier LLMs reveals that even the SOTA proprietary models, including Claude-4.6-Opus, achieves less than 40% overall accuracies, highlighting the fundamental difficulty of spatio-temporal dynamic reasoning. Systematic analysis of failure trajectories uncovers three recurring error modes of existing models: Stale-State Execution, Misdiagnosis of Dynamic Triggers, and Missing Post-Adaptation Verification. Guided by these findings, we propose an iterative trajectory refinement technique that eliminates these failure patterns from training data, and combine it with online RL to produce STT-Agent-4B which outperforms frontier LLMs on STT-Arena.


[36] LongMINT: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems cs.CL | cs.AIPDF

Hyunji Lee, Justin Chih-Yao Chen, Joykirat Singh, Zaid Khan, Elias Stengel-Eskin

TL;DR: 本文介绍了LongMINT基准测试,用于评估在长视野、多目标干扰场景下记忆增强智能体的性能。该基准包含长且高度互连的上下文、多样化的领域和问题类型,旨在模拟现实世界中信息不断更新和相互干扰的动态环境。评估结果显示,现有系统(包括长上下文LLM、RAG和记忆增强框架)在该基准上的平均准确率仅为27.9%,尤其在需要聚合推理的多证据问题上表现不佳。

Details

Motivation: 现有基准测试主要关注静态、独立的记忆召回,无法捕捉现实智能体在长视野、动态演变环境中信息更新和记忆间相互干扰的复杂交互。

Result: 在LongMINT基准(平均上下文长度138.8k token,最长1.8M token)上评估了7个代表性系统,平均准确率仅为27.9%。性能在多证据聚合推理任务上尤其低下,且随着中间更新次数的增加而下降。

Insight: 论文的创新点在于构建了一个模拟现实动态干扰的长视野记忆评估基准,揭示了当前记忆系统在检索、记忆构建以及处理后续上下文对早期事实的修订与干扰方面存在根本性局限。

Abstract: Real-world agents operate over long and evolving horizons, where information is repeatedly updated and may interfere across memories, requiring accurate recall and aggregated reasoning over multiple pieces of information. However, existing benchmarks focus on static, independent recall and fail to capture these dynamic interactions between evolving memories. In this paper, we study how current memory-augmented agents perform in realistic, interference-heavy, long-horizon settings across diverse domains and question types. We introduce LongMINT (Long-Horizon Memory under INTerference), a benchmark featuring (1) long, highly interconnected contexts with frequently updated information that induces substantial interference, (2) diverse domains (state tracking, multi-turn dialogue, Wikipedia revisions, and GitHub commits), enabling evaluation of domain generalization, and (3) diverse question types that assess robustness to interference, including (i) single-target recall tasks requiring retrieval of a specific target from long contexts, and (ii) multi-target aggregation tasks requiring reasoning over multiple relevant pieces of information. Overall, LongMINT has 15.6k question-answering pairs over long-horizon contexts averaging 138.8k tokens and extending up to 1.8M tokens per instance. We evaluate 7 representative systems, including vanilla long-context LLMs, RAG, and memory-augmented agent frameworks. Across all systems, we observe consistently low performance (avg. 27.9% accuracy), especially on questions requiring aggregated reasoning over multiple pieces of evidence. Our analysis shows that performance is primarily limited by retrieval and memory construction. Furthermore, current memory systems struggle to recall and reason over earlier facts that are later revised or interfered with by subsequent context, with performance degrading as the number of intervening updates increases.


[37] MA$^{2}$P: A Meta-Cognitive Autonomous Intelligent Agents Framework for Complex Persuasion cs.CLPDF

Dingyi Zhang, Ziqing Zhuang, Linhai Zhang, Ziyang Gao, Deyu Zhou

TL;DR: 本文提出了MA$^{2}$P框架,一个用于复杂说服任务的元认知自主智能体框架。该框架通过协调感知管理、心理状态推断、策略执行、记忆维护和性能评估等多个模块,并结合一个元认知配置器来选择初始元策略,以解决现有方法在说服对话中响应泛化、缺乏针对性以及大语言模型跨领域性能不稳定的问题。实验表明,该方法在说服成功率上优于基线模型。

Details

Motivation: 解决复杂说服任务中的核心挑战:当被说服者的内部状态表达不清晰时,现有方法难以生成有针对性且策略一致的响应,且大语言模型(LLMs)因知识覆盖不均和推理泛化能力有限,导致跨领域性能波动大。

Result: 实验结果显示,该方法在说服成功率上超越了基线模型,但摘要未具体说明是在哪个基准测试或数据集上进行的评估。

Insight: 主要创新点在于提出了一个集成了元认知配置器的自主多智能体架构,该配置器能从结构化知识库中预先选择元策略,从而指导后续的推理与规划,这有助于提升跨领域任务的适应性和响应策略的针对性。

Abstract: Persuasive dialogue generation plays a vital role in decision-making, negotiation, counseling, and behavior change, yet it remains a challenging problem. In complex persuasion where the persuadee’s internal states are not expressed clearly, the persuader must interpret responses, infer the persuadee’s latent mental states (e.g., beliefs and desires), and translate them into targeted, strategy-consistent actions; however, current approaches often produce generic or weakly grounded responses even when such cues are identified. Moreover, although large language models (LLMs) can generate persuasive content, their performance varies substantially across domains due to uneven knowledge coverage and limited reasoning generalization. To address these challenges, we propose MA$^{2}$P, a meta-cognitive autonomous intelligent agent framework for complex persuasion. Specifically, we develop an autonomous multi-agent architecture that coordinates perception management, mental-state inference, strategy execution, memory maintenance, and performance evaluation. To mitigate cross-domain performance variation, we further design a meta-cognitive configurator that selects an appropriate meta-strategy from a structured knowledge base at the outset, thereby guiding subsequent reasoning and planning. Experimental results show that our approach achieves a higher persuasion success rate than baselines.


[38] Forecasting Downstream Performance of LLMs With Proxy Metrics cs.CL | cs.LGPDF

Arkil Patel, Siva Reddy, Marius Mosbach, Dzmitry Bahdanau

TL;DR: 该论文提出了一种通过聚合候选模型在专家撰写解决方案上的下一个词分布中的词元级统计量(如熵、top-k准确率和专家词元排名)来构建代理指标的方法,以预测语言模型的下游性能。该方法在模型选择、预训练数据选择和训练时预测三个场景中均优于基于损失和计算量的基线方法,实现了更可靠且高效的性能预测。

Details

Motivation: 语言模型开发中的比较决策(如架构选择、预训练语料库选择、训练方案选择)需要可靠的性能预测,但现有两种常用信号存在根本性限制:交叉熵损失与下游能力对齐不佳,而直接下游评估则成本高昂、稀疏且在早期训练阶段信息不足。

Result: 在三个场景中,代理指标均优于基线:1)跨家族模型选择中,斯皮尔曼相关系数均值达0.81(交叉熵损失为0.36);2)预训练数据选择中,以约万分之一的计算成本可靠地对25个候选语料库进行排序,超越了现有方法的帕累托前沿;3)训练时预测中,在18倍计算跨度上外推下游准确率,误差约为现有替代方法的一半。

Insight: 核心创新点在于利用专家轨迹(模型在专家解决方案上的下一个词分布)作为评估模型能力的信号源,构建的代理指标能更高效、可靠地预测下游性能,贯穿模型开发生命周期。这为模型开发中的决策提供了一种低成本、高信息量的评估新思路。

Abstract: Progress in language model development is often driven by comparative decisions: which architecture to adopt, which pretraining corpus to use, or which training recipe to apply. Making these decisions well requires reliable performance forecasts, yet the two commonly used signals are fundamentally limited. Cross-entropy loss is poorly aligned with downstream capabilities, and direct downstream evaluation is expensive, sparse, and often uninformative at early training stages. Instead, we propose to construct proxy metrics by aggregating token-level statistics, such as entropy, top-k accuracy, and expert token rank, from a candidate model’s next token distribution over expert-written solutions. Across three settings, our proxies consistently outperform loss- and compute-based baselines: 1) For cross-family model selection, they rank a heterogeneous population of reasoning models with mean Spearman Rho = 0.81 (vs. Rho = 0.36 for cross-entropy loss); 2) For pretraining data selection, they reliably rank 25 candidate corpora for a target model at roughly $10{,}000\times$ less compute than direct evaluation, pushing the Pareto frontier beyond existing methods; and 3) for training-time forecasting, they extrapolate downstream accuracy across an $18\times$ compute horizon with roughly half the error of existing alternatives. Together, these results suggest that expert trajectories are a broadly useful source of signal for assessing model capabilities, enabling reliable performance forecasting throughout the model development life cycle.


[39] Code as Agent Harness cs.CL | cs.AIPDF

Xuying Ning, Katherine Tieu, Dongqi Fu, Tianxin Wei, Zihao Li

TL;DR: 这篇论文提出‘代码作为智能体驾驭工具’的统一视角,将代码定位为构建AI智能体基础设施的核心基础。论文围绕该视角,从智能体与环境的接口、实现可靠与自适应执行的机制,以及从单智能体到多智能体的扩展三个层面,系统性地综述了相关方法与应用,并指出了未来工程挑战。

Details

Motivation: 动机在于认识到,在新型智能体系统中,代码的角色已从单纯的输出目标,转变为支撑智能体推理、行动、环境建模和基于执行的验证的操作性基础。论文旨在通过‘代码作为驾驭工具’这一统一视角,系统性地研究这一转变。

Result: 本文是一篇综述性论文,未提出具体的新模型或方法,因此没有在特定基准测试上的定量结果。其成果在于对现有代表性方法和实际应用领域(如编码助手、GUI/OS自动化、具身智能体、科学发现等)进行了系统性梳理和总结。

Insight: 核心创新点在于提出了‘代码作为智能体驾驭工具’这一统一概念框架,将代码提升为智能体系统的核心基础设施。从客观角度看,该框架为理解、设计和评估可执行、可验证、有状态的AI智能体系统提供了一个清晰且连贯的路线图,有助于整合分散的研究方向。

Abstract: Recent large language models (LLMs) have demonstrated strong capabilities in understanding and generating code, from competitive programming to repository-level software engineering. In emerging agentic systems, code is no longer only a target output. It increasingly serves as an operational substrate for agent reasoning, acting, environment modeling, and execution-based verification. We frame this shift through the lens of agent harnesses and introduce code as agent harness: a unified view that centers code as the basis for agent infrastructure. To systematically study this perspective, we organize the survey around three connected layers. First, we study the harness interface, where code connects agents to reasoning, action, and environment modeling. Second, we examine harness mechanisms: planning, memory, and tool use for long-horizon execution, together with feedback-driven control and optimization that make harness reliable and adaptive. Third, we discuss scaling the harness from single-agent systems to multi-agent settings, where shared code artifacts support multi-agent coordination, review, and verification. Across these layers, we summarize representative methods and practical applications of code as agent harness, spanning coding assistants, GUI/OS automation, embodied agents, scientific discovery, personalization and recommendation, DevOps, and enterprise workflows. We further outline open challenges for harness engineering, including evaluation beyond final task success, verification under incomplete feedback, regression-free harness improvement, consistent shared state across multiple agents, human oversight for safety-critical actions, and extensions to multimodal environments. By centering code as the harness of agentic AI, this survey provides a unified roadmap toward executable, verifiable, and stateful AI agent systems.


[40] EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL cs.CL | cs.LGPDF

Minrui Xu, Zilin Wang, Mengyi DENG, Zhiwei Li, Zhicheng Yang

TL;DR: EnvFactory是一个自动化框架,旨在解决工具使用智能体在强化学习中面临的两个瓶颈:缺乏可扩展且鲁棒的执行环境,以及缺少捕捉人类隐式推理的真实训练数据。该框架能够从真实资源中自主探索和验证有状态的可执行工具环境,并通过拓扑感知采样和校准优化合成自然的多轮轨迹。实验表明,仅使用85个验证环境,EnvFactory生成的训练数据就能显著提升模型在多个基准测试上的性能。

Details

Motivation: 当前通过智能体强化学习(Agentic RL)赋予大语言模型工具使用能力面临两大挑战:一是缺乏可扩展、鲁棒的执行环境,二是缺少能捕捉人类隐式推理的真实训练数据。现有方法依赖昂贵的真实API、易产生幻觉的LLM模拟器或通常是单轮或依赖预收集文档的合成环境,且合成轨迹往往过于具体化,类似于指令序列而非自然的人类意图,降低了强化学习训练的有效性。

Result: EnvFactory仅使用7个领域的85个验证环境,生成了2,575条SFT和RL轨迹。尽管使用的环境数量显著少于先前工作(通常少5倍以上),但EnvFactory在训练效率和下游性能上均表现更优,将Qwen3系列模型在BFCLv3基准上提升了高达+15%,在MCP-Atlas上提升+8.6%,在包括τ²-Bench和VitaBench在内的对话基准上提升+6%。

Insight: 论文的创新点在于完全自动化地构建可执行环境和合成自然轨迹。具体而言,EnvFactory能够从真实资源中自主探索和验证有状态环境,并通过拓扑感知采样和校准优化生成具有隐式意图的、接地的多轮查询轨迹。这为智能体强化学习提供了一个可扩展、可扩展且鲁棒的基础,避免了对外部API或易出错模拟器的依赖,并生成了更符合人类自然交互模式的高质量训练数据。

Abstract: Equipping LLMs with tool-use capabilities via Agentic Reinforcement Learning (Agentic RL) is bottlenecked by two challenges: the lack of scalable, robust execution environments and the scarcity of realistic training data that captures implicit human reasoning. Existing approaches depend on costly real-world APIs, hallucination-prone LLM simulators, or synthetic environments that are often single-turn or depend on pre-collected documents. Moreover, synthetic trajectories are frequently over-specified, resembling instruction sequences rather than natural human intents, reducing their effectiveness for RL training. We introduce EnvFactory, a fully automated framework that addresses both challenges. EnvFactory autonomously explores and verifies stateful, executable tool environments from authentic resources, and synthesizes natural multi-turn trajectories through topology-aware sampling and calibrated refinement, producing grounded queries with implicit intents. Using only 85 verified environments across 7 domains, EnvFactory generates 2,575 SFT and RL trajectories. Despite using significantly fewer environments than prior work, which are often 5 times more, EnvFactory achieves superior training efficiency and downstream performance, improving Qwen3-series models by up to +15% on BFCLv3, +8.6% on MCP-Atlas, and +6% on conversational benchmarks including $τ^2$-Bench and VitaBench. By fully automating both environment construction and trajectory synthesis, EnvFactory provides a scalable, extensible, and robust foundation for Agentic RL.


cs.CV [Back]

[41] How Many Visual Tokens Do Multimodal Language Models Need? Scaling Visual Token Pruning with F^3A cs.CV | cs.AIPDF

YiJie Huang, Yiqun Zhang, Zhuoyue Jia, Xiaocui Yang, Junzhao Huang

TL;DR: 本文提出了一种名为F^3A的训练无关视觉令牌剪枝路由器,旨在解决多模态语言模型中视觉令牌序列过长导致的推理成本问题。该方法通过问题条件化的证据搜索,在固定视觉令牌预算下,高效地选择和分配视觉令牌,无需额外训练或前向传播。

Details

Motivation: 随着多模态模型规模增长,长视觉令牌序列带来了高昂的推理成本,因此需要研究在固定视觉令牌预算下,如何确定所需视觉令牌数量及其分配策略。现有训练无关剪枝方法通常依赖一次性代理指标,但作者认为视觉令牌剪枝应被视为任务条件化的证据搜索问题。

Result: 论文未在摘要中提供具体的定量实验结果或基准测试比较,但指出F^3A方法在激进压缩和跨模型规模下表现更优,且保持了原始多模态提示和解码流程。

Insight: 创新点在于将视觉令牌剪枝重新定义为任务条件化的证据搜索过程,并设计了包含粗粒度证据定位、局部细化、覆盖保持竞争和未覆盖区域恢复的轻量级路由机制,无需训练即可实现高效令牌分配。

Abstract: Vision-language models improve perception by feeding increasingly long visual token sequences into language backbones, but the resulting inference cost raises a basic scaling question: as multimodal models grow, how many visual tokens are actually needed, and how should they be allocated under a fixed visual token budget? Existing training-free pruning methods typically answer this with one-shot proxies such as decoder attention, visual similarity, or conditional diversity. We argue that visual token pruning is better viewed as task-conditioned evidence search, especially under aggressive compression and across model scales. We propose F^3A, a training-free router for visual token pruning that operates before the language model consumes image tokens. F^3A builds lightweight question-conditioned cues, matches them to visual-grid tokens through frozen sparse sensing heads, and allocates a fixed vision token budget via coarse evidence localization, local refinement, coverage-preserving competition, and recovery of under-covered regions. It requires no model training, no extra LLM forward pass and preserves the original multimodal prompting and decoding pipeline.


[42] Fre-Res: Frequency-Residual Video Token Compression for Efficient Video MLLMs cs.CV | cs.AIPDF

Yigui Feng, Qinglin Wang, Yang Liu, Jie Liu

TL;DR: 本文提出Fre-Res,一种预算自适应的双轨视频令牌压缩框架,旨在解决视频多模态大语言模型在空间保真度与时间覆盖之间的固有矛盾。该方法通过分离高保真空间锚点与紧凑的残差-频率令牌来分别保留细节和捕捉动态,在视觉潜在空间中对帧间残差轨迹应用一维DCT以利用其低频集中特性,并通过空间引导吸收器将时域残差信息注入对应的空间锚令牌。

Details

Motivation: 视频MLLMs面临空间保真度与时间覆盖的持续张力:保留细粒度视觉细节需要大量空间令牌,而捕捉短暂事件则需要密集的时间采样。

Result: 在细粒度短视频和长视频推理基准测试中,Fre-Res实现了良好的精度-效率权衡,匹配或接近全令牌性能,同时显著减少了视觉令牌长度。

Insight: 核心创新点在于将视频证据分离为稀疏高保真空间锚点和基于频域压缩的密集时间残差表示,并设计了空间引导吸收器进行跨模态对齐;客观分析其创新在于利用视觉潜在空间中残差轨迹的低频特性进行高效压缩,并通过双轨设计解耦空间与时间信息处理。

Abstract: Video MLLMs face a persistent tension between spatial fidelity and temporal coverage: preserving fine-grained visual details requires many spatial tokens, while capturing short-lived events requires dense temporal sampling. We propose \textbf{Fre-Res}, a budget-adaptive dual-track video-token compression framework that separates these two forms of evidence. Fre-Res preserves sparse high-fidelity spatial anchors and represents dense temporal evolution through compact residual-frequency tokens. Specifically, it applies temporal 1D-DCT to inter-frame residual trajectories in vision-latent space, where we observe strong low-frequency concentration. To align frequency-domain dynamics with native visual embeddings, Fre-Res introduces a Spatial-Guided Absorber that injects temporal residual information into spatially corresponding anchor tokens. Across fine-grained short-video and long-video reasoning benchmarks, Fre-Res achieves a favorable accuracy–efficiency trade-off, matching or approaching full-token performance while substantially reducing visual-token length. Extensive ablations further show that temporal-frequency residuals preserve causal transition cues, while spatial anchors remain essential for fine-grained object and layout reasoning.


[43] StrLoRA: Towards Streaming Continual Visual Instruction Tuning for MLLMs cs.CV | cs.AIPDF

Chang Che, Ziqi Wang, Hui Ma, Cheems Wang, Zenglin Shi

TL;DR: 本文提出了Streaming Continual Visual Instruction Tuning (StrCVIT)这一更通用和现实的持续学习设定,以解决现有任务增量式视觉指令调优方法无法处理动态交错任务数据流的问题。为此,作者提出了StrLoRA框架,它采用正则化的两阶段专家路由机制,通过基于文本指令的专家选择和基于跨模态注意力的令牌级专家加权来减少任务间干扰,并引入路由稳定性正则化来应对非平稳数据流。

Details

Motivation: 现有持续视觉指令调优方法局限于任务增量设定,即每个训练阶段对应单一预定义任务,这与现实世界中数据以动态交错、持续演化的任务流形式到达的情况不符。

Result: 在新开发的StrCVIT基准上进行的大量实验表明,StrLoRA显著优于现有方法,能有效增强模型从持续演化数据流中学习的能力。

Insight: 创新点在于提出了更现实的流式持续视觉指令调优设定,并设计了结合任务感知专家选择、令牌级跨模态专家加权和路由稳定性正则化的两阶段专家路由框架,以在动态混合任务流中同时实现新能力获取、重复能力巩固和遗忘缓解。

Abstract: Continual Visual Instruction Tuning (CVIT) enables Multimodal Large Language Models to incrementally acquire new abilities. However, existing CVIT methods operate under a restrictive task-incremental setting, where each training phase corresponds to a single, predefined task. This does not reflect real-world conditions, where data arrives as a continuous stream of interleaved and dynamically evolving tasks. To bridge this gap, we introduce Streaming CVIT (StrCVIT), a more general and realistic setting where models learn from a stream of data chunks containing a dynamic mixture of tasks. In StrCVIT, a model must simultaneously acquire new abilities, reinforce recurring abilities, and mitigate forgetting. Existing CVIT methods fail here as they cannot reliably distinguish or adapt to the heterogeneous task samples within each chunk. We therefore propose StrLoRA, a regularized two-stage expert routing framework. StrLoRA first performs task-aware expert selection using the textual instruction to activate a sparse subset of relevant experts, reducing cross-task interference. It then applies token-wise expert weighting within this subset, where contribution weights are computed via cross-modal attention between local visual tokens and the global instruction representation. To maintain stability across the non-stationary stream, a routing-stability regularization aligns current routing distributions with a historical exponential moving average reference. Extensive experiments on a newly developed StrCVIT benchmark show that StrLoRA substantially outperforms existing methods, effectively enhancing model’s abilities from continuously evolving data streams.


[44] Cross-Source Supervision for Bone Infection Segmentation in Dual-Modality PET-CT cs.CV | cs.AI | cs.LGPDF

Zonglin Yang, Xiaolei Diao, Jishizhan Chen, Xiaozhuang Man, Wei Kong

TL;DR: 该论文提出了一种用于双模态PET-CT中骨感染分割的跨源监督方法,旨在解决因专家标注差异和病灶边界模糊带来的分割挑战。通过早期融合PET代谢信号与CT骨窗解剖信息,构建了双模态端到端分割框架,并采用患者级别的3D体积评估以规避小数据集中切片相关性的性能膨胀问题。

Details

Motivation: 骨感染的早期准确诊断和病灶定位对临床治疗至关重要,但PET-CT影像中病灶边界不清晰以及不同专家或自动系统生成的标注不一致,使得精确分割面临挑战。

Result: 实验采用严格的患者级别3D体积评估和交叉验证,客观报告了患者层面的性能变化(均值±标准差),证明了多模态PET-CT融合的有效性。交叉评估矩阵定量揭示了模型如何成功内化不同专家的诊断理念。

Insight: 创新点在于提出了解耦的双源学习框架,允许并行模型在独立专家标注上训练,分别驱动高灵敏度与高特异性的临床意图,为临床AI部署提供了保留多样性的鲁棒范式,而非强制单一共识。

Abstract: Early and accurate diagnosis and lesion localization of bone infections are crucial for clinical treatment. PET-CT integrates anatomical information from CT with metabolic information from PET, making it an important imaging modality for diagnosing bone infections. However, accurate lesion segmentation remains challenging due to indistinct lesion boundaries and inconsistencies in annotations generated by different experts or automated systems. In this work, we investigate multimodal segmentation of bone infections under annotation discrepancy. We develop a bimodal end-to-end segmentation framework that integrates PET metabolic signals and CT bone-window anatomy through an early-fusion multimodal representation.To mitigate performance inflation caused by inter-slice correlation in small datasets, this study discards traditional two-dimensional evaluation methods and implements a rigorous patient-level 3D volumetric evaluation and cross-validation. Furthermore, instead of forcing a singular consensus, we propose a decoupled dual-source learning framework where parallel models are trained on independent expert annotations driven by high-sensitivity and high-specificity clinical intents. Experimental results objectively report performance variations at the patient level (Mean + SD and Mean - SD), demonstrating the effectiveness of multimodal PET-CT fusion. The cross-evaluation matrix quantitatively reveals how models successfully internalize distinct expert diagnostic philosophies, providing a robust, diversity-preserving paradigm for clinical AI deployment in bone infection segmentation.


[45] GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning cs.CV | cs.AIPDF

Jinhao Jing, Zheng Ma, Jinwei Liang, Qiannian Zhao, Shawn Chen

TL;DR: 本文提出GeoSym引擎,一个可扩展的神经符号框架,用于自动生成具有精确符号真值的几何推理数据。基于此构建了GeoSym127K数据集和GeoSym-Bench评测集,并通过监督微调(SFT)和可验证奖励的强化学习(RLVR)显著提升了大型多模态模型在几何推理任务上的性能。

Details

Motivation: 解决大型多模态模型(LMMs)在几何推理中因视觉幻觉和缺乏数学上精确的思维链(CoT)数据而表现不佳的问题。

Result: 使用Qwen3-VL-8B模型进行实验,在MathVerse Vision-Only子集上绝对提升22.21%,在WeMath上达到61.52%(提升6.19%),超越了Doubao-1.8等先进闭源模型。基于结构SFT检查点的RLVR初始化也显著优于零样本RL。

Insight: 创新点在于提出了一个可扩展、可符号验证的合成框架(GeoSym引擎),通过类型条件语法和解析式符号求解器生成精确的符号真值,并结合渲染管道生成高精度几何图表,从而构建了大规模、难度分层的几何推理数据集。该方法展示了可验证推理合成在提升模型几何能力方面的强大扩展潜力。

Abstract: Large Multimodal Models (LMMs) often struggle with geometric reasoning due to visual hallucinations and a lack of mathematically precise Chain-of-Thought (CoT) data. To address this, we propose the GeoSym Engine, an automated and scalable neuro-symbolic framework. By leveraging a type-conditional grammar and an analytic SymGT Solver, it derives exact symbolic ground truths and seamlessly integrates with a robust rendering pipeline to produce high-precision geometric diagrams. Using this engine, we construct GeoSym127K, a difficulty-stratified dataset featuring 51K high-resolution images, 127K questions with symbolic ground truths, and 55K answer-verified CoT QA pairs. We also introduce GeoSym-Bench, an expert-curated suite of 511 complex samples for rigorous evaluation. Through extensive supervised fine-tuning (SFT), we demonstrate that GeoSym drives concentrated improvements specifically on diagram-dependent and multi-step geometry tasks. Our Qwen3-VL-8B model gains an absolute +22.21% on the MathVerse Vision-Only subset and reaches 61.52% (+6.19% improvement) on WeMath, mitigating long-horizon logic fragmentation and outperforming advanced closed-source models like Doubao-1.8. Furthermore, applying Reinforcement Learning with Verifiable Rewards (RLVR) via GRPO reveals that initializing from structural SFT checkpoints substantially elevates the performance ceiling over zero-shot RL. Driven by deterministic exact-match signals, this showcases the robust scaling potential of our verifiable reasoning synthesis. Datasets and code are available at https://huggingface.co/datasets/Tomie0506/GeoSym127K and https://github.com/Tomie56/GeoSym127K.


[46] SwordBench: Evaluating Orthogonality of Steering Image Representations cs.CV | cs.AI | cs.LGPDF

Vladimir Zaigrajew, Dawid Pludowski, Hubert Baniecki, Przemyslaw Biecek

TL;DR: 本文提出了SwordBench基准,用于评估视觉模型中图像表示导向(steering)的正交性,特别是在概念移除任务中。该基准引入了新的评估指标,如跨概念鲁棒性和附带损害,以量化导向方法在保持下游任务性能方面的效果。研究发现,尽管线性支持向量机在可分离性和正交性上表现优异,但无法实现零附带损害,而稀疏自编码器在某些情况下表现更好。

Details

Motivation: 现有评估协议主要局限于模糊的语言建模任务,缺乏对视觉模型表示导向效果的全面评估,因此需要一个新的基准来系统评估图像表示导向的正交性和实用性。

Result: 在SwordBench基准上,线性支持向量机在可分离性和正交性方面优于其他方法,但无法实现零附带损害;稀疏自编码器在减少附带损害方面表现更好。在简单场景中,标准基线和基于优化的方法均未能实现完美导向。

Insight: 创新点在于提出了SwordBench基准和新的评估指标(跨概念鲁棒性和附带损害),以量化导向方法的二阶效应;客观分析表明,现有方法在平衡正交性和下游任务性能方面仍存在挑战,这为未来研究提供了方向。

Abstract: Steering or intervening on model representations at inference time to correct predictions is essential for AI interpretability and safety, yet existing evaluation protocols are limited to ambiguous language modeling tasks. To address this gap, we introduce SwordBench, a benchmark for steering image representations of vision models across multiple backbones and concept removal tasks. Beyond a unified benchmarking suite, we propose new evaluation notions that uncover the second-order effects of orthogonalization among concept activation vectors for pragmatic steering. Specifically, cross-concept robustness measures the stability of concept detection performance across inputs orthogonalized against alternative concepts, and collateral damage quantifies whether steering inadvertently affects model performance on a downstream task for inputs lacking the bias. We find that although a linear support vector machine exhibits superior separability and orthogonality, it fails to achieve zero collateral damage, often trailing sparse autoencoders. In simpler regimes, both standard baselines and optimization-based methods fail to achieve perfect steering. The source code will be made available soon on GitHub.


[47] A neurosymbolic Approach with Epistemic Deep Learning for Hierarchical Image Classification cs.CV | cs.AI | stat.MLPDF

Ezel Kilicdere, Shireen Kudukkil Manchingal, Fabio Cuzzolin

TL;DR: 本文提出了一种结合神经符号与认知深度学习的统一框架,用于层次图像分类。该方法通过引入焦点集推理和可微模糊逻辑,增强Swin Transformer模型以捕捉认知不确定性并确保层次预测的逻辑一致性。实验表明,该框架在保持与Transformer基线相当准确性的同时,提供了更校准、可解释的预测,减少了过度自信并强化了层次输出的一致性。

Details

Motivation: 深度神经网络在图像分类中虽精度高,但常产生过度自信的预测,无法表达认知不确定性,且违反数据中的逻辑或结构约束,这在层次分类中尤为突出。

Result: 在层次图像分类实验中,该框架的准确性与Transformer基线相当,同时提供了更校准和可解释的预测,减少了过度自信并确保了高逻辑一致性。

Insight: 创新点在于首次将焦点集推理与模糊逻辑结合,通过数据驱动的焦点集捕捉认知不确定性,并利用模糊隶属函数和t-范数合取来强制层次预测的一致性,实现了准确性与认知感知的平衡。

Abstract: Deep neural networks achieve high accuracy on image classification tasks. Yet, they often produce overconfident predictions as which fail to express epistemic uncertainty, and frequently violate logical or structural constraints present in the data. These limitations are particularly pronounced in hierarchical classification, where predictions across fine and coarse levels must remain coherent. We propose, for the first time, a unified neurosymbolic and epistemic modelling framework that augments Swin Transformers with focal set reasoning and differentiable fuzzy logic. Rather than treating labels as isolated categories, our method induces data-driven focal sets within the learnt embedding space, which helps capture epistemic uncertainty over multiple plausible fine-grained classes. These focal sets form the basis of a belief-theoretic layer that uses fuzzy membership functions and t-norm conjunctions to encourage consistency between fine- and coarse-grained predictions. A learnable loss further balances calibration, mass regularisation, and logical consistency, allowing the model to adaptively trade off symbolic structure with data-driven evidence. In experiments on hierarchical image classification, our framework maintains accuracy on par with transformer baselines while providing more calibrated and interpretable predictions, reducing overconfidence and enforcing high logical consistency across hierarchical outputs. Our experimental results show that combining focal set reasoning with fuzzy logic provides a practical step toward deep learning models that are both accurate and epistemically aware.


[48] StreamPro: From Reactive Perception to Proactive Decision-Making in Streaming Video cs.CV | cs.AIPDF

Ao Li, Zihan Xiao, Zihao Yue, Boshen Xu, Linli Yao

TL;DR: 本文提出了StreamPro-Bench基准和StreamPro训练框架,旨在解决流媒体视频理解中从被动感知到主动决策的转变问题。基准从感知理解、时序推理和主动代理三个维度评估模型,而训练框架通过CB-Stream损失和GRPO优化策略来缓解监督不平衡并联合优化响应正确性与时机。

Details

Motivation: 现有流媒体视频理解基准大多遵循‘先看后答’范式,仅在明确证据出现后触发响应,这无法评估模型在不完整观察下做出及时可靠决策的能力,且训练主动模型面临沉默与响应信号极度不平衡以及联合优化响应正确性与时机的挑战。

Result: 在提出的StreamPro-Bench上,StreamPro取得了41.5的分数,显著优于之前最佳模型的10.4;在实时流媒体基准StreamingBench-RTVU上,也达到了78.9的强性能。

Insight: 创新点包括引入从感知理解、时序推理和主动代理三个互补角度评估流媒体模型的新基准,以及采用两阶段训练框架,其中CB-Stream损失缓解监督微调中的严重不平衡,GRPO结合多粒度奖励设计(回合级和轨迹级)来优化决策。这为流媒体视频中的主动决策学习提供了系统的评估和训练方法。

Abstract: Proactive streaming video understanding requires models to continuously process video streams and decide when to respond, rather than merely what to respond. This naturally introduces a decision-making problem under partial observations, where models must balance early prediction against sufficient evidence. However, existing benchmarks largely follow a “see-then-answer” paradigm, where responses are triggered only after explicit evidence appears, effectively reducing proactive reasoning to delayed perception. As a result, they fail to evaluate a model’s ability to make timely and reliable decisions under incomplete observations. Moreover, training proactive models is inherently challenging due to the extreme imbalance between silence and response signals in streaming trajectories, as well as the need to jointly optimize response correctness and timing. To address these challenges, we introduce StreamPro-Bench, a new benchmark that evaluates streaming models from three complementary perspectives: Perception Understanding, Temporal Reasoning, and Proactive Agency, where the last measures a model’s ability to make early yet reliable decisions under partial observations. We further propose StreamPro, a two-stage training framework for proactive learning. First, we introduce CB-Stream Loss to mitigate the severe supervision imbalance during supervised fine-tuning (SFT). Then, we apply Group Relative Policy Optimization (GRPO) with a multi-grained reward design that involves both turn-level and trajectory-level rewards. Experiments show that StreamPro significantly improves proactive performance. On StreamPro-Bench, it achieves 41.5, substantially outperforming the previous best (10.4), while also maintaining strong performance on real-time streaming benchmarks, achieving 78.9 on StreamingBench-RTVU.


[49] Auditing Multimodal LLM Raters: Central Tendency Bias in Clinical Ordinal Scoring cs.CVPDF

Jiaqing Zhang, Sandeep Elluri, Bhanu Cherukuvada, Yonah Joffe, Jessica Sena

TL;DR: 本文研究了多模态大语言模型在临床序数量表评分中的行为,发现它们存在显著的中心倾向偏差,即预测结果会系统性地向量表中间压缩,从而影响认知障碍筛查决策的准确性。

Details

Motivation: 多模态大语言模型在临床环境中作为自动化评估器的应用日益增多,但其在临床序数量表上的评分行为尚不明确,需要深入理解其偏差以确保可靠部署。

Result: 在Clock Drawing Test数据集上,全微调的Vision Transformers校准最佳(MAE 0.52),而零样本LLMs在容忍度一致性上仍具竞争力(GPT-5 MAE 0.67),但所有LLM家族均表现出中心倾向效应,在量表两端预测误差较大。

Insight: 研究揭示了LLM在临床评估中的系统性评分偏差,扩展了LLM-as-a-judge的偏差文献,强调了在高风险筛查工作流中部署基于LLM的评分器时,需要进行校准感知评估和后校准处理。

Abstract: Multimodal large language models (LLMs) are increasingly explored as automated evaluators in clinical settings, yet their scoring behavior on ordinal clinical scales remains poorly understood. We benchmark three frontier LLM families against supervised deep learning models for scoring Clock Drawing Test (CDT) images on two public datasets using the Shulman rubric. While fully fine-tuned Vision Transformers achieve the best calibration (MAE 0.52, within-1 accuracy 91%), zero-shot LLMs remain competitive on tolerance-based agreement (GPT-5 MAE 0.67, within-1 accuracy 92%) despite higher absolute error. However, per-score analysis reveals that all three LLM families exhibit a pronounced central tendency effect (systematic endpoint compression): predictions are systematically compressed toward the middle of the scale, with over-prediction at the low end (score 0 to 1) and under-prediction at the high end (score 5 to 4). This effect disproportionately affects the clinically critical extremes where accurate scoring most impacts screening decisions for cognitive impairment. Targeted ablations show that neither few-shot exemplars spanning the full score range nor removing clinical terminology from the prompt eliminates the effect. Our findings extend the LLM-as-a-judge bias literature from NLP evaluation to clinical assessment, and highlight the need for calibration-aware evaluation and post-hoc calibration before deploying LLM-based raters in high-stakes screening workflows.


[50] Hilbert-Geo: Solving Solid Geometric Problems by Neural-Symbolic Reasoning cs.CV | cs.AI | cs.CLPDF

Ruoran Xu, Haoyu Cheng, Bin Dong, Qiufeng Wang

TL;DR: 本文提出了Hilbert-Geo,一个用于解决立体几何问题的神经符号推理框架。该框架包含一个统一的谓词库和定理库,并采用Parse2Reason方法,首先将自然语言描述和视觉图表解析为形式化的条件描述语言(CDL),然后利用定理库进行关系推理和代数计算,生成严格正确、可验证且人类可读的推理过程。该方法在自建的SolidFGeo2k、PlaneFGeo3k数据集以及MathVerse-Solid基准上均取得了最先进的性能。

Details

Motivation: 当前几何问题求解的研究主要集中在平面几何,而由于3D空间图表的复杂性和推理难度,立体几何问题通常难以解决。本文旨在填补这一空白,为立体几何提供一个统一的形式化语言和推理框架。

Result: 在SolidFGeo2k数据集上达到77.3%的准确率,在MathVerse-Solid子集上达到84.1%的准确率,显著优于Gemini-2.5-pro和GPT-5等领先的MLLM模型。在PlaneFGeo3k数据集上也达到了80.2%的SOTA准确率。

Insight: 核心创新点在于提出了首个用于立体几何的统一形式化语言框架(Hilbert-Geo)及其配套的谓词库和定理库,以及将多模态输入(文本和图表)解析为形式化CDL语言再进行符号推理的两阶段Parse2Reason方法。这为复杂几何问题的可解释、可验证求解提供了一种新范式,并展示了其在平面几何上的泛化能力。

Abstract: Geometric problem solving, as a typical multimodal reasoning problem, has attracted much attention and made great progress recently, however most of works focus on plane geometry while usually fail in solid geometry due to 3D spatial diagrams and complex reasoning. To bridge this gap, we introduce Hilbert-Geo, the first unified formal language framework for solid geometry, including an extensive predicate library and a dedicated theorem bank. Based on this framework, we propose a Parse2Reason method containing two steps of first parsing then reasoning. In the parsing step, we utilize conditional description language (CDL), a formalized language composed of predicates specifically designed to construct geometric conditions, to represent both problem description (natural text) and solid diagrams (visual image). In the reasoning step, we leverage those formal CDL and the theorem bank to perform relational inference and algebraic computation, generating strictly correct, verifiable, and human-readable reasoning processes. Notably, our proposed Hilbert-Geo is also applicable to plane geometry. To advance geometric reasoning, we curate two expert-annotated dataset SolidFGeo2k and PlaneFGeo3k, which are furnished with geometric formal language annotations, solutions and answers. Extensive experiments show that our proposed method achieves the state-of-the-art (SOTA) performance 77.3% in SolidFGeo2k and 84.1% in MathVerse-Solid (one small subset in MathVerse dedicated to solid geometry), substantially outperforming leading MLLMs, such as Gemini-2.5-pro (54.2% on SolidFGeo2k) and GPT-5 (62.9% on MathVerse-Solid). In addition, our method achieves the SOTA accuracy 80.2% in PlaneFGeo3k, demonstrating the generality of the Hilbert-Geo in geometric reasoning. Our code and datasets will be publicly available.


[51] ChronoSC: Task-Oriented Semantic Communication via Temporal-to-Color Encoding cs.CVPDF

Phuc H. Nguyen, Trung T. Nguyen, Quy N. Duong, Van-Dinh Nguyen

TL;DR: 本文提出ChronoSC,一种面向任务的语义通信框架,用于视频问答任务。该方法通过Chrono-Color Stacking技术将视频的动态时序信息编码为单张静态图像,实现极致的时序压缩,再结合轻量级深度联合信源信道编码进行传输,并在接收端显式重建图像,最后利用预训练的BLIP模型进行推理。

Details

Motivation: 现有视频语义通信方法多关注像素级重建或依赖复杂的时空处理流程,导致带宽占用高、延迟大,难以在资源受限环境中部署。本文旨在为视频问答任务设计一种轻量、高效的语义通信方案。

Result: 在CLEVRER数据集上的实验表明,与原始视频传输相比,ChronoSC实现了高达192倍的带宽压缩,同时保持了较高的视频问答准确率。

Insight: 核心创新在于提出了一种无损的时序到颜色编码方案(Chrono-Color Stacking),将视频动态压缩为单张静态图像,实现了极致的时序压缩。同时,采用显式视觉重建而非潜在空间表示,使得能够直接复用预训练的视觉语言模型(如BLIP),降低了计算开销并提升了任务性能。

Abstract: Semantic communication (SC) aims to reduce transmission overhead by conveying task-relevant information rather than raw data. However, existing SC approaches for video largely focus on pixel-level reconstruction or rely on complex spatiotemporal pipelines, leading to excessive bandwidth usage and latency that are unsuitable for low-resource deployments. In this paper, we propose ChronoSC, a task-oriented semantic communication framework for Video Question Answering (VideoQA). ChronoSC introduces Chrono-Color Stacking, a lightweight and lossless projection scheme that encodes temporal video dynamics into a single static image, enabling extreme temporal compression before transmission. This compact semantic representation is transmitted using a lightweight Deep Joint Source-Channel Coding (DeepJSCC) transceiver and explicitly reconstructed at the receiver. Unlike latent-space methods, explicit visual reconstruction enables the direct reuse of pre-trained vision-language models; specifically, a pre-trained BLIP model is employed to infer answers from noisy, reconstructed chrono-images. Experiments on the CLEVRER dataset show that ChronoSC achieves up to 192 times bandwidth reduction compared to raw video transmission while maintaining high VideoQA accuracy.


[52] Inducing Spatial Locality in Vision Transformers through the Training Protocol cs.CV | cs.LG | stat.MLPDF

Eduardo Santiago Toledo, Asael Fabian Martínez

TL;DR: 本文研究了训练协议(特别是CutMix数据增强)如何能在不依赖大规模预训练的情况下,诱导从头开始训练的Vision Transformer(ViT)早期层产生空间局部性。通过在CIFAR-10、CIFAR-100和Tiny-ImageNet上对比Baseline和Modern(包含AutoAugment/ColorJitter、CutMix和Label Smoothing)两种训练协议,发现Modern协议能显著降低早期注意力层的平均注意力距离(MAD),使其注意力更局部、更集中。消融实验表明,CutMix是产生这种局部性的关键因素。

Details

Motivation: 动机是探究在不改变架构和优化器的情况下,仅通过训练协议(特别是数据增强和正则化技术)能否诱导ViT模型在早期层学习到空间局部性特征,从而减少对大规模预训练的依赖。

Result: 在CIFAR-100上,Modern协议将最小MAD从0.316(Baseline)降至0.008,表明注意力变得极为局部。消融研究确认CutMix是决定性组件:所有包含CutMix的实验条件MAD均为0.024,而不包含CutMix的条件MAD保持在0.210。AutoAugment和Label Smoothing对局部性没有独立影响。

Insight: 创新点在于揭示了训练协议(尤其是CutMix这种基于部分图像区域进行分类的数据增强方法)能够有效促进ViT早期注意力层局部性的涌现。这为设计更高效的ViT训练方法提供了新思路,即通过特定的数据增强策略来引导模型学习局部特征,可能减少模型对计算密集型预训练的依赖。

Abstract: We investigate whether the training protocol can induce spatial locality in the early layers of a Vision Transformer (ViT) trained from scratch, without large-scale pretraining. Keeping the architecture and optimization procedure fixed, we compare a Baseline protocol with a Modern protocol (AutoAugment/ColorJitter, CutMix, and Label Smoothing) on CIFAR-10, CIFAR-100, and Tiny-ImageNet, characterizing each attention head via Mean Attention Distance (MAD) and normalized entropy. Across all three datasets, the Modern protocol produces more local and more concentrated attention in early layers; on CIFAR-100, the minimum MAD drops from 0.316 (Baseline) to 0.008 (Modern). To identify the source of this effect, we conduct an ablation study on CIFAR-100 by adding or removing each component individually. The results identify CutMix as the determining component within our experiments: all conditions with CutMix exhibit MAD 0.024, while all conditions without CutMix remain at MAD 0.210. AutoAugment and Label Smoothing show no independent effect on locality. Taken together, these findings suggest that the pressure to classify from partial image regions, induced by CutMix, can promote the emergence of local attention in Vision Transformers.


[53] Vision Transformer-Conditioned UNet for Domain-Adaptive Semantic Segmentation cs.CV | cs.AIPDF

Joel Valdivia Ortega, Tingying Peng, Marion Jasnin

TL;DR: 本文提出ViTC-UNet模型,通过可学习令牌和双向注意力解码器,将预训练且冻结的Vision Transformer(ViT)表征作为条件输入到UNet中,以结合ViT的全局视觉先验与UNet的局部归纳偏置和高分辨率解码能力,用于领域自适应的生物医学图像语义分割。

Details

Motivation: 针对Vision Transformer在生物医学图像语义分割(特别是稀疏、精细结构、低信噪比目标)中性能不足的问题,作者认为部分原因在于可提示ViT模型中常用的轻量级像素解码器缺乏高精度分割所需的局部归纳偏置。

Result: ViTC-UNet在MRI和CT模态的语义分割任务中超越了基线模型结果,证明了结构条件化的UNet解码能够有效地将大规模视觉先验适应于高复杂度的生物医学分割。

Insight: 创新点在于提出了一种无需端到端微调ViT的领域自适应方法,通过条件化机制将冻结的预训练ViT表征与UNet解码器高效结合,以同时利用全局语义先验和局部空间细节处理能力。

Abstract: Semantic segmentation is essential for analysing anatomical features in biomedical research, yet a performance gap remains for Vision Transformers (ViTs) in the field, particularly for sparse, fine-structured, and low signal-to-noise targets. We attribute this challenge in part to the lightweight pixel decoders commonly used in promptable ViT models, who may lack the local inductive bias needed for high-precision biomedical masks. We bridge this gap by introducing ViTC-UNet, which conditions a UNet on frozen pre-trained ViT representations through learnable tokens and a two-way attention decoder. This combines ViT global visual priors with the local inductive bias and high-resolution decoding capacity of UNets, while avoiding end-to-end ViT fine-tuning even in cross-domain settings. ViTC-UNet outperforms baseline results in semantic segmentation tasks across MRI and CT modalities, demonstrating that structure-conditioned UNet decoding can efficiently adapt large-scale visual priors to high-complexity biomedical segmentation.


[54] Trajectory-Aware Adaptive Inference in Object Detection Models cs.CV | cs.AIPDF

Grigorios Papanikolaou, Ioannis Kontopoulos, Giannis Spiliopoulos, Dimitris Zissis, Konstantinos Tserpes

TL;DR: 本文提出了一种轨迹感知的自适应推理方法,用于目标检测模型。该方法在YOLOv8检测器中引入了提前退出机制,利用GPS轨迹数据(如船舶间距离)来评估场景复杂度,从而动态调整计算资源。对于距离近、快速接近的船舶使用完整模型处理,否则只激活网络架构的子集,以实现精度与效率的灵活权衡。

Details

Motivation: 解决自主海上导航系统中,多模态数据集带来的实时感知效率挑战,特别是目标检测与轨迹感知在动态环境中的耦合问题,以及推理过程中模型效率常被忽视的现状。

Result: 实验结果表明,该策略在保持满意检测性能的同时,显著减少了推理时间和计算成本,与完整模型推理相比,实现了精度和效率的灵活权衡。

Insight: 创新点在于将GPS轨迹数据(运动线索)集成到目标检测推理过程中,实现输入自适应计算;客观分析认为,通过物体间距离及其变化率来评估帧或每秒帧集的难度(场景复杂度),并据此动态激活网络部分,是一种高效的计算资源分配方法。

Abstract: The increasing integration of sensors in autonomous maritime navigation has led to large-scale multimodal datasets, raising challenges in achieving efficient real-time perception. In such systems, object detection and trajectory perception of nearby vessels are tightly coupled, particularly in dynamic environments such as maritime navigation. However, the efficiency of object detection models during inference remains an often-overlooked aspect. To this end, we build upon an existing object detection framework by incorporating GPS trajectory data into the inference process to enable input-adaptive computation. Specifically, we introduce an early-exit mechanism in a YOLOv8-based detector that incorporates motion cues - such as inter-vessel distances. Frames of vessels that are separated by short distances, converging with high speed, are processed using the full model, while only a subset of the network’s architecture is activated otherwise. The difficulty degree (or scene complexity) of a frame or set of frames per second is evaluated by leveraging inter-object distance and the rate at which the distance between them decreases. Experimental results demonstrate that this strategy maintains satisfactory detection performance while significantly reducing inference time and computational cost, thus enabling a flexible trade-off between accuracy and efficiency compared to full-model inference.


[55] WinDeskGround: A Benchmark for Robust GUI Grounding in Complex Multi-Window Desktop Environments cs.CVPDF

Haoren Zhao, Tianyi Chen, Zhen Wang

TL;DR: 本文提出了WinDeskGround,一个专门用于评估多模态大语言模型在复杂、多窗口堆叠的真实桌面环境中GUI定位鲁棒性的基准测试和合成框架。该框架通过参数化控制窗口遮挡、布局密度和语义相似性来生成复杂的桌面场景,并构建了包含1356个高保真指令-目标对的数据集。对五个领先MLLM的评估表明,尽管顶级模型在简化场景中表现出色,但在部分遮挡等复杂条件下的准确率会显著下降。

Details

Motivation: 当前MLLM在GUI自动化方面的有效性主要在理想化的单层界面上得到验证,而在具有多窗口堆叠、遮挡和视觉杂乱的真实桌面环境中,最先进的智能体面临着明显的鲁棒性挑战。本文旨在解决这一可靠性差距。

Result: 在WinDeskGround基准上对五个领先MLLM进行了全面评估。结果表明,顶级智能体在简化设置中表现出色,但在部分遮挡条件下的准确率会下降。该基准为评估和推进真实环境中GUI智能体的鲁棒性提供了有价值的工具。

Insight: 主要创新点在于提出了首个专注于复杂多窗口桌面环境GUI定位鲁棒性的基准测试框架。其核心洞察是通过参数化合成方法(控制遮挡、密度、语义相似性)来模拟真实工作流的分布偏移,从而系统性地评估模型在现实世界复杂视觉场景下的脆弱性,这比静态数据集更能反映实际挑战。

Abstract: Multimodal Large Language Models (MLLMs) have revolutionized GUI automation, yet their efficacy is largely established on idealized, single-layer interfaces. This paper identifies a critical reliability gap: state-of-the-art agents face distinct robustness challenges in real-world desktop environments characterized by multi-window stacking, occlusion, and visual clutter. To address this, we introduce WinDeskGround, a novel benchmark and synthesis framework tailored for evaluating GUI grounding robustness. Unlike static datasets, our framework parametrically generates complex desktop scenarios by controlling window occlusion, layout density, and semantic similarity, thereby simulating the distribution shifts of authentic workflows. We construct a diverse meta-dataset of 1,356 high-fidelity instruction-target pairs and conduct comprehensive evaluations of five leading MLLMs. Our results demonstrate that while top-tier agents excel in simplified settings, their accuracy declines under partial occlusion. WinDeskGround provides a valuable benchmark to facilitate the assessment and advancement of GUI agent robustness in realistic environments. The code is available at https://github.com/ZZZhr-1/WinDeskGround.


[56] When Vision Speaks for Sound cs.CV | cs.SDPDF

Xiaofei Wen, Wenjie Jacky Mo, Xingyu Fu, Rui Cai, Tinghui Zhu

TL;DR: 本文研究了当前视频多模态大语言模型(MLLMs)在音频理解上的一个关键缺陷:模型往往依赖视觉线索来推断或幻觉音频信息,而非真正验证音频流,这种现象被称为视听‘Clever Hans效应’。为了系统诊断这一问题,作者提出了一个名为Thud的干预驱动探测框架,通过三种反事实音频编辑(Shift、Mute、Swap)来测试模型的音频验证能力。此外,作者还提出了一种两阶段对齐方法,利用干预衍生的偏好对来教导音频验证,同时用事件级通用视频偏好来防止模型过度专门化,最终在少量样本上显著提升了模型性能。

Details

Motivation: 尽管视频多模态大语言模型(MLLMs)发展迅速,但作者发现这些模型在视频中的音频理解往往是视觉驱动的,即模型依赖视觉线索来推断或幻觉声音,而不是真正验证音频流。这种问题在开源和闭源的先进模型中普遍存在,作者旨在系统性地诊断并解决这一‘Clever Hans效应’,以提升模型对音频的真实理解能力。

Result: 作者提出的最佳10K样本对齐方法,在三种干预维度(Shift、Mute、Swap)上的平均性能提升了28个百分点,同时在通用视频和视听问答基准测试上略有提升。这表明该方法有效增强了模型的音频验证能力,而未损害其通用性能。

Insight: 论文的创新点在于系统性地揭示了MLLMs中音频理解的视觉驱动缺陷,并提出了Thud干预框架来量化诊断这一问题。此外,两阶段对齐方法结合了干预衍生的偏好学习和事件级正则化,为提升模型的音频真实感知提供了一种可借鉴的解决方案,强调了在视听任务中确保音频流验证的重要性。

Abstract: Despite rapid progress in video-capable MLLMs, we find that their apparent audio understanding in videos is often vision-driven: models rely on visual cues to infer or hallucinate acoustic information, rather than verifying the audio stream. This issue appears across both state-of-the-art open-source omni models and leading closed-source models from providers such as Google and OpenAI. We characterize this failure mode as an audio-visual Clever Hans effect, in which models appear (falsely) audio-grounded, but actually exploit visual-acoustic correlations without verifying whether the audio and visual streams are truly aligned. To systematically study this behavior, we introduce Thud, an intervention-driven probing framework based on three counterfactual audio edits: Shift, which tests temporal synchronization; Mute, which tests sound existence; and Swap, which tests audio-visual consistency. Beyond diagnosis, we further study a two-stage alignment recipe: intervention-derived preference pairs teach audio verification, while event-level general video preferences regularize the model against over-specialization. Our best 10K-sample recipe improves average performance across the three intervention dimensions by 28 percentage points, while slightly improving performance on general video and audio-visual QA benchmarks.


[57] Hybrid Quantum-MambaVision: A Quantum-enhanced State Space Model for Calibrated Mixed-type Wafer Defect Detection cs.CVPDF

Satwik Sai Prakash Sahoo, Jyoti Prakash Sahoo, Ting Wang, Subrota Kumar Mondal

TL;DR: 本文提出了一种名为Hybrid Quantum-MambaVision的高效混合量子-经典架构,用于解决半导体晶圆缺陷检测中的极端类别不平衡和计算复杂度问题。该模型结合了线性复杂度的状态空间模型(Mamba)主干、参数化量子上下文适配器和低秩适应技术,旨在高效捕获长程空间依赖并解耦复杂的重叠缺陷特征。

Details

Motivation: 解决工业视觉数据(特别是半导体晶圆缺陷检测)中存在的极端类别不平衡问题,以及现代基础模型(如Vision Transformers)因二次方计算复杂度而无法满足高吞吐量实时异常检测需求的计算瓶颈。

Result: 在高度不平衡的MixedWM38数据集上,Hybrid Quantum-MambaVision在复杂多缺陷拓扑结构的多标签分类任务中取得了卓越性能,显著降低了错误率;量子正则化器作为不确定性校准器,大幅降低了最大校准误差并最小化了预期误报成本。

Insight: 创新点在于将线性复杂度的状态空间模型(Mamba)与参数化量子计算相结合,利用量子适配器将压缩的潜在特征映射到高维希尔伯特空间以解耦复杂特征,并引入量子正则化进行不确定性校准,为工业数据挖掘建立了一个可扩展的量子-经典混合表示学习范式。

Abstract: Extracting actionable knowledge from industrial visual data is fundamentally bottlenecked by extreme class imbalance and the prohibitive computational complexity of modern foundation models. In semi-conductor manufacturing, identifying multi-label wafer defects is a complex spatial data mining task where overlapping patterns obscure critical root-cause signals. While Vision Transformers (ViTs) excel at global dependency extraction, their quadratic scaling renders them inefficient for high-throughput, real-time anomaly detection. To overcome these computational barriers, this paper introduces Hybrid Quantum-MambaVision, a highly efficient architecture tailored for spatial knowledge discovery. We integrate a linear-complexity State-Space Model (SSM) backbone with a Parameterized Quantum Context Adapter (QCA) and Low-Rank Adaptation (LoRA). The Mamba backbone efficiently captures long-range spatial dependencies, while the quantum adapter maps compressed latent features into a high-dimensional Hilbert space to disentangle complex, overlapping signatures. On the highly imbalanced MixedWM38 dataset, Hybrid Quantum-MambaVision achieves exceptional multi-label classification performance, significantly reducing the error rate on complex multi-defect topologies compared to classical baselines. The quantum regularizer acts as a profound uncertainty calibrator, substantially reducing Maximum Calibration Error (MCE) and minimizing expected false-positive costs. This work establishes a scalable Quantum-Classical hybrid paradigm for efficient representation learning in industrial data mining.


[58] Concepts Worth Having: Refining VLM-Guided Concept Bottleneck Models with Minimal Annotations cs.CVPDF

Nicola Debole, Andrea Passerini, Stefano Teso, Andrea Pugnana, Emanuele Marconato

TL;DR: 本文提出了一种名为VH-CBM的混合方法,它结合了视觉语言模型(VLM)和少量密集人工标注,以改进概念瓶颈模型(CBM)。该方法通过在VLM的嵌入空间中使用高斯过程来传播专家监督,从而在仅标注1%数据的情况下,也能比纯VLM引导的CBM预测出更准确的概念,并提升概念校准和主动学习能力。

Details

Motivation: 概念瓶颈模型(CBMs)依赖概念级标注来保证可解释性,但这类标注通常难以获取。现有方法使用VLM自动生成标注,但可能导致概念质量下降和模型可解释性降低。本文旨在寻找一个折中方案,利用少量人工标注来提升VLM引导的CBMs的质量。

Result: 实验评估表明,VH-CBM即使在仅标注1%数据的情况下,也比纯VLM引导的CBM预测出更准确的概念。同时,该方法展现出更好的概念校准性能,并支持主动学习框架。

Insight: 核心创新在于引入了一个混合监督框架(VH-CBM),它通过高斯过程在VLM的嵌入空间中有效传播少量专家标注的全局信息。这为在标注稀缺场景下,平衡自动化(VLM)与专家知识、提升模型可解释性与预测准确性提供了一个实用且高效的路径。

Abstract: Concept-bottleneck models (CBMs) are neural classifiers that compute predictions from high-level concepts extracted from the input. CBMs ensure stakeholders can understand the concepts – and the predictions they entail – by learning these from concept-level annotations, which are however seldom available. Recent CBM architectures work around this issue by obtaining annotations from Vision-Language Models (VLMs). While greatly broadening applicability, doing so can yield lower quality concepts and therefore less interpretable models. We strike for a middle ground by introducing Vision-plus-Human-guided CBM (VH-CBM), a hybrid approach that exploits both VLMs and a small amount of dense annotations. VH-CBM employs a Gaussian Process in the VLM’s embedding space, which captures useful global information about the target domain, to propagate the expert’s supervision to any target data point. Our empirical evaluation shows how VH-CBM predicts more accurate concepts than VLM-guided CBMs even when annotating as little as 1% of the data, while sporting better concept calibration and supporting active learning.


[59] Multilingual OCR-Aware Fine-Tuning and Prompt-Guided Chain-of-Thought Reasoning for Multimodal Large Language Models cs.CV | cs.CL | cs.LGPDF

Qinwu Xu, Xin Liu, Yifan Jiang, Haoyu Ren

TL;DR: 本文提出了一种针对多模态大语言模型(MLLMs)的OCR感知多语言训练框架,旨在解决模型在复杂真实图像(如布局杂乱、字体小、模糊、遮挡等)中的OCR和多语言文本理解问题。该框架结合了大规模合成OCR-翻译数据生成、OCR感知的监督微调(SFT)以及结构化的视觉思维链(CoT)提示,显著提升了OCR完整性、多语言翻译准确性和在视觉退化条件下的鲁棒性。

Details

Motivation: 解决多模态大语言模型在真实世界图像中处理OCR和多语言文本时的关键失败模式,例如布局杂乱、小字体、模糊、遮挡和复杂排版带来的挑战。

Result: 在多语言收据、菜单、海报、标志、手写文本和文档图像上的实验表明,相比基线模型,视觉-文本基础能力显著提升。特别是在提取小、模糊、空间分散和部分遮挡文本方面有改进,并在嘈杂和视觉模糊的OCR场景下,与GPT-5级和Gemini系列模型等前沿多模态系统相比,减少了幻觉并提升了OCR基础能力。

Insight: 创新点在于数据中心的OCR感知多模态后训练框架,通过合成数据生成、LoRA适配的SFT和结构化视觉CoT提示,为提升多语言OCR和基于OCR的视觉问答系统提供了有效且可扩展的方向。从客观角度看,其将OCR能力深度集成到MLLM训练流程中,并利用CoT处理视觉不确定性,是提升模型在复杂真实场景下鲁棒性的重要思路。

Abstract: Optical character recognition (OCR) and multilingual text understanding remain major failure modes of multimodal large language models (MLLMs), particularly in real-world images containing cluttered layouts, small fonts, blur, occlusion, and complex typography. We present an OCR-aware multilingual multimodal training framework that combines (i) large-scale synthetic OCR-to-translation data generation, (ii) OCR-aware supervised fine-tuning (SFT) with LoRA adaptation, and (iii) structured visual chain-of-thought (CoT) prompting for reasoning under uncertain visual conditions. Using a LLaMA-based multimodal architecture, the proposed framework substantially improves OCR completeness, multilingual translation accuracy, and robustness under degraded visual conditions. Experimental results on multilingual receipts, menus, posters, signs, handwritten text, and document images demonstrate significantly improved visual-text grounding compared with the baseline model. In particular, the proposed OCR-aware post-training framework improves extraction of small, blurred, spatially scattered, and partially occluded text while reducing reliance on language priors under uncertain OCR conditions. Qualitative comparisons with frontier multimodal systems, including GPT-5-class and Gemini-family models, further suggest improved OCR grounding and reduced hallucination under noisy and visually ambiguous OCR scenarios. Overall, the results indicate that data-centric OCR-aware multimodal post-training provides an effective and scalable direction for improving multilingual OCR and OCR-based visual question answering systems.


[60] Test-Time Hinting for Black-Box Vision-Language Models cs.CVPDF

Kaihua Hou, Abhijith Varma Mudunuri, Jiaxing Qiu, Roxana Daneshjou, Thomas Hartvigsen

TL;DR: 本文提出了一种名为“测试时提示”的方法,旨在提升黑盒视觉语言模型(VLMs)的性能。该方法通过训练一个轻量级的提示生成器,为给定的测试输入预测并前置一个“提示”,从而引导VLM避免常见的错误模式,且仅需一次VLM调用和黑盒API访问即可实现。

Details

Motivation: 现有VLM测试时缩放方法大多需要开放模型权重或昂贵的重复采样,且主要在多模态数学和科学推理基准上进行评估,而非通用视觉理解任务。本文旨在解决这一问题,使方法能广泛应用于前沿的闭源模型。

Result: 实验表明,测试时提示方法提升了多个闭源VLMs在自然图像VQA基准上的准确率,并且这些性能增益能够泛化到未见过的基准和VLMs上,而无需重新训练提示生成器。

Insight: 创新点在于利用VLM错误倾向于围绕重复失败模式聚集的观察,通过预测性提示进行针对性引导。从客观角度看,其轻量级、单次调用且仅需黑盒访问的设计,为高效提升闭源VLM的通用视觉理解能力提供了实用且可扩展的思路。

Abstract: Test-time scaling (TTS) methods have proven highly effective for LLMs, yet their application to vision-language models (VLMs) remains relatively underexplored. Existing VLM TTS methods largely require open-weight model access or expensive repeated sampling, and are evaluated primarily on multimodal mathematical and scientific reasoning benchmarks rather than general visual understanding tasks. In this paper, we propose Test-Time Hinting, a method that improves VLM performance via a single VLM call and requiring only black-box API access, which makes it broadly applicable to frontier closed-weight models. Our method is motivated by the observation that VLM errors tend to cluster around recurring failure patterns. We therefore train a lightweight hint generator model to predict, for a given test input, which “hint” should be prepended to the prompt, providing targeted contextual or procedural guidance that steers the VLM away from its characteristic failure modes. We show that Test-Time Hinting improves the accuracy of multiple closed-weight VLMs on natural-image VQA benchmarks and that these gains generalize to unseen benchmarks and VLMs without retraining the hint generator.


[61] Reducing Hallucination in Vision-Language Models via Stage-wise Preference Optimization under Distribution Shift cs.CV | cs.AI | cs.CL | cs.DB | cs.LGPDF

Qinwu Xu

TL;DR: 本文提出了一种分阶段的偏好优化框架,旨在减少视觉语言模型中的幻觉问题。该方法通过构建针对幻觉的多模态偏好数据对,在已知失败边界附近进行渐进式优化,重点关注空间方向、物体关系、OCR不确定性和对抗性错误前提等场景。实验表明,该方法在多个基准测试和真实场景中提升了视觉一致性、减少了幻觉,并生成了更具信息量的回答。

Details

Motivation: 解决视觉语言模型中因自回归生成和联合概率建模下的似然最大化导致的幻觉问题,即模型可能产生语言上合理但物理上不一致或视觉上无依据的回答。

Result: 在开源基准和真实世界多模态评估场景中,该方法提高了视觉一致性、减少了幻觉,并产生了更具信息量的回答。跨模型定性评估表明,在模糊空间推理和对抗性错误前提设置中,其表现优于多个前沿的专有视觉语言模型。

Insight: 创新点在于提出了分阶段的、针对幻觉的偏好数据构建与优化框架,而非在通用指令数据上直接优化。核心见解是幻觉可能源于自回归概率生成在弱视觉基础下倾向于语言合理延续的内在倾向,而不仅仅是模型容量限制。

Abstract: Hallucination remains a fundamental challenge in vision-language models (VLMs), where autoregressive generation may produce linguistically plausible yet physically inconsistent or visually ungrounded responses due to likelihood maximization under joint probabilistic modeling. We propose a stage-wise preference optimization framework for hallucination reduction through targeted multimodal data construction. Rather than directly optimizing on generic instruction-following data, our approach progressively constructs hallucination-focused preference pairs near known failure boundaries. The framework emphasizes ambiguous spatial orientation, object relationships, OCR uncertainty, and adversarial false-premise training. Hallucinated negatives are generated through minimally perturbed yet visually inconsistent alternatives, enabling Direct Preference Optimization (DPO) to better separate grounded reasoning from plausible hallucination. Experiments on open-source benchmarks and real-world multimodal evaluation scenarios demonstrate improved grounding consistency, reduced hallucination, and more informative grounded responses. Cross-model qualitative evaluation further shows that the proposed multimodal LLM DPO framework produces more visually grounded responses than several frontier proprietary VLMs, such as in ambiguous spatial reasoning and adversarial false-premise settings. The results suggest that hallucination may arise not only from limited model capacity, but also from inherent tendencies of autoregressive probabilistic generation to favor linguistically plausible continuations under weak visual grounding. Future work may explore physical consistency modeling, uncertainty-aware multimodal reasoning, and architectural alternatives beyond standard autoregressive decoding.


[62] NERVE: A Neuromorphic Vision and Radar Ensemble for Multi-Sensor Fusion Research cs.CVPDF

Omar Mansour, Pietro Martinello, Ethan Milon, YingFu Xu, Manolis Sifalakis

TL;DR: NERVE是一个多传感器数据集,包含来自两个动态视觉传感器(DVS)、一个RGB-D相机和两个雷达单元(24GHz和77GHz)的257分钟同步记录数据,覆盖办公室环境,提供约600GB未压缩的时间对齐数据,包括约914,000帧图像和约960万RGB COCO格式标注,涉及16个相关对象类别。该数据集旨在评估多模态融合,特别是DVS和雷达组合用于人体检测和距离估计,基线实验显示结合DVS与77GHz雷达能持续提升检测性能。

Details

Motivation: 解决多传感器融合研究中缺乏同步、大规模、标注丰富的神经形态视觉和雷达数据的问题,以促进跨模态感知技术的发展。

Result: 在DVS+Radar子集上,使用前馈和循环检测器进行基线实验,结合DVS与77GHz雷达能改善检测,循环模型达到最高47.5% mAP,雷达距离估计的平均绝对误差低于1.8米(以LiDAR为地面真值)。

Insight: 创新点在于提供首个同步的神经形态视觉和雷达多传感器数据集,支持多模态融合研究;客观分析表明,该数据集通过时间对齐和丰富标注,为低功耗、实时感知系统(如自动驾驶和机器人)的开发提供了重要基准。

Abstract: We present NERVE (Neuromorphic Vision and Radar Ensemble), a multi-sensor dataset comprising 257 minutes of synchronized recordings from five sensors: two Dynamic Vision Sensors (DVS), an RGB-D camera, and two Radar units (24GHz and 77GHz). Captured across 12 measurement days in office environments, NERVE contains around 600GB of uncompressed temporally aligned data with around 914,000 frames and around 9.6 million RGB COCO-formatted annotations covering 16 relevant object categories. To evaluate multi-modal fusion, we construct a DVS+Radar subset for human detection and distance estimation. Baseline experiments using feed-forward and recurrent detectors show that combining DVS with 77GHz Radar consistently improves detection, with recurrent models achieving up to 47.5% mAP and mean absolute Radar distance errors below 1.8m against LiDAR ground truth.


[63] Neural Visual Decoding via Cognitive guided Adaptive Blurring and Information Constrained Alignment cs.CV | cs.AIPDF

Fan Yin, Chuhang Zheng, Peiliang Gong, Donghai Guan, Qi Zhu

TL;DR: 本文提出CAIA框架,用于解决基于EEG的视觉解码中信息粒度不匹配和低信噪比的双重挑战。该框架通过认知引导的自适应模糊化机制动态整合视觉线索,并利用神经振荡先验和信息瓶颈机制增强EEG信号质量,在零样本脑到图像检索任务中显著提升了性能。

Details

Motivation: 现有EEG视觉解码方法通常处理静态视觉特征,忽略了人类视觉的动态选择性和神经振荡的频率特异性,导致信息粒度严重不匹配和低信噪比问题。

Result: 在零样本脑到图像检索任务中,CAIA在受试者依赖和独立设置下均显著提升了平均Top-1和Top-5准确率,大幅优于先前方法。

Insight: 创新点包括认知动力学引导的自适应模糊化机制、分布感知边界校准损失函数,以及认知引导的信息筛选方法;核心洞见在于通过优化视觉信息密度以匹配神经粒度,为神经解码提供了更可解释且鲁棒的途径。

Abstract: EEG-based visual decoding aims to establish a mapping between neural signals and visual semantics. However, it remains constrained by the dual challenges of severe information granularity mismatch and the low signal-to-noise ratio (SNR) of EEG signals. Existing approaches typically treat static visual features, ignoring the dynamic selectivity of human vision and the frequency specificity of neural oscillations. To bridge this gap, we propose CAIA, a Cognitive-guided Adaptive blurring with Information-Constrained Alignment framework for Neural-Visual decoding. On the visual side, it simulates selective attention to adaptively reduce redundancy. Meanwhile, on the EEG side, it leverages neural oscillation priors and the information bottleneck mechanism to enhance SNR. Specifically, we devise a cognitive-dynamics-based adaptive blurring mechanism that dynamically integrates center-biased and saliency-guided visual cues via cross-modal attention. Furthermore, we introduce a distribution-aware boundary calibration loss to robustly rectify alignment bias caused by outlier samples. Moreover, a cognitively-guided information-screening method is proposed to select task-relevant EEG oscillations. Extensive experiments demonstrate that CAIA improves both subject-dependent and subject-independent average Top-1 and Top-5 accuracy in zero-shot brain-to-image retrieval, significantly outperforming prior methods. Our work validates that optimizing visual information density to match neural granularity offers a more interpretable and robust pathway for neural decoding.


[64] CAVE: A Structured Credit Assignment Approach for Fragmented Visual Evidence Reasoning cs.CV | cs.AIPDF

Tengda Guo, Jie Leng, Hanlei Li, Yaoyuan Liang, Qingyue Zhang

TL;DR: 本文提出了CAVE方法,一种基于GRPO的结构化过程奖励方法,用于解决视觉语言模型在整合非局部视觉信息进行语义不确定的视觉推理(即碎片化视觉推理)时所面临的挑战。该方法通过三个互补的推理过程信号(信念更新、证据获取和自适应焦点控制)在动作层面评估中间步骤的贡献,从而优化推理动作并学习更可靠的视觉推理策略。同时,作者构建了TRACER-Bench基准,涵盖四个非局部且语义易混淆的推理维度,并提供关键中间证据以监督推理路径。

Details

Motivation: 视觉语言模型在通用多模态推理上表现良好,但在整合非局部视觉信息以支持语义不确定的视觉推理方面仍存在挑战,作者将此定义为碎片化视觉推理问题,旨在解决模型在此类任务上的不足。

Result: 实验表明,CAVE方法在需要整合碎片化视觉证据的任务上性能显著提升,覆盖了公共基准和新提出的TRACER-Bench,同时在通用多模态评估上保持了有竞争力的性能。进一步分析显示,CAVE有效提升了视觉推理能力,并在更长范围、更深层次的跨区域依赖下表现出更强的鲁棒性。

Insight: 创新点在于提出了一种结构化的过程奖励方法(CAVE),通过动作级别的多信号评估来优化推理过程,并构建了一个专门针对碎片化视觉推理的基准(TRACER-Bench)以提供细粒度监督。从客观角度看,该方法将强化学习中的信用分配思想引入视觉推理,通过分解和评估推理动作来提升模型在复杂、非局部依赖场景下的性能,是一种有前景的细粒度优化思路。

Abstract: Vision-Language Models (VLMs) have achieved strong performance on general multimodal reasoning, yet remain challenged in integrating nonlocal visual information to support semantically underdetermined visual reasoning. We describe this challenge as Fragmented Visual Reasoning. To this end, we propose Credit Assignment for Visual Evidence (CAVE), a structured process-reward method based on GRPO for interleaved visual reasoning. Specifically, CAVE evaluates the contribution of intermediate steps at the action level via three complementary reasoning process signals: belief update, evidence acquisition, and adaptive focus control, thereby guiding the model to optimize each reasoning action and learn more reliable visual reasoning strategies. Meanwhile, we construct TRACER-Bench, which covers four nonlocal and semantically confusable reasoning dimensions and provides key intermediate evidence to supervise reasoning paths. Experiments demonstrate that CAVE substantially improves performance on tasks requiring fragmented visual evidence integration, covering both public benchmarks and our newly introduced TRACER-Bench, while retaining competitive performance on general multimodal evaluations. Further analyses reveal that CAVE effectively improves the visual reasoning capacity and exhibits stronger robustness under longer-range and deeper cross-region dependencies.


[65] Agentic Pipeline for Self-Synchronized Multiview Joint Angle Monitoring in Uncalibrated Environments cs.CV | cs.AI | cs.ROPDF

Juncheng Yu, Lusi A, Haoxuan Xie, Weiming Wang

TL;DR: 本文提出了一种用于无标定环境下的自同步多视角关节角度监测的智能体流程。该方法利用两个无硬件触发的摄像头,通过多模态大语言模型实现视频自动同步与智能体驱动的自我验证,结合最先进的单目2D姿态估计模型提取候选姿态,并采用基于智能体的选择机制自动识别和跟踪目标对象,最终通过显式几何建模从无标定的多视角姿态序列中优化估计关节角度。

Details

Motivation: 针对脊髓损伤患者长期康复中的运动学监测需求,现有基于多视角无标记运动捕捉的方法因依赖标定和难以实现多视角同步,在患者自部署环境中应用存在挑战。

Result: 在Vicon系统上的验证表明,该方法性能优异,平均绝对误差为5.97°±2.36°,皮尔逊相关系数为0.962±0.014。

Insight: 创新点在于利用多模态大语言模型实现无硬件触发的视频自同步与智能体自我验证流程,以及结合基于智能体的选择机制处理多人场景和遮挡,从而在无标定环境下实现鲁棒、可解释的关节角度估计,为患者自部署的日常监测提供了实用方案。

Abstract: Kinematic monitoring plays a critical role in long-term rehabilitation for patients with spinal cord injury (SCI), where multi-view markerless motion capture methods have shown significant potential. However, owing to the reliance on calibration and the difficulty of achieving multi-view synchronization, their deployment in patient self-deployed environments remains challenging. In this work, we propose an agentic pipeline for self-synchronized multi-view joint angle monitoring in uncalibrated environments using two cameras without hardware triggers. The Multimodal large language models enable automatic video synchronization and agent-driven self-verification. State-of-the-art monocular 2D pose estimation models are employed to extract candidate poses, where an agent-based selection mechanism is then applied to automatically identify and track the target subject, thereby producing consistent 2D poses in the presence of multiple individuals and occlusions. Such 2D poses are optimized to estimate joint angles from uncalibrated multi-view pose sequences, ensuring interpretability through explicit geometric modeling. Validation against Vicon system demonstrated the strong performance, achieving an MAE of $5.97^\circ \pm 2.36^\circ$ and a Pearson correlation coefficient of $0.962 \pm 0.014$. The proposed method is expected to provide a practical, patient self-deployable system to perform daily kinematic monitoring in uncalibrated home environments.


[66] CT-DegradBench: A Physics-Informed Benchmark for CT Degradation Detection and Severity Estimation cs.CVPDF

Yousra Nabila Taifour, Marouane Tliba, Zuheng Ming, Marie Luong, Nour Aburaed

TL;DR: 本文提出了CT-DegradBench,一个用于CT图像退化检测和严重程度估计的数据集和基准测试,涵盖单一和混合伪影场景。同时,作者提出了SeSpeCT框架,该框架结合了医学视觉语言模型的语义先验和频域线索,无需特定任务微调即可联合预测伪影类型和严重程度。

Details

Motivation: 当前CT图像增强任务主要使用感知和临床有效性有限的图像质量指标进行评估,且现有数据集多集中于孤立的修复任务,缺乏一个能够统一评估多种退化类型的基准。

Result: 实验结果表明,在单一和混合退化设置下,SeSpeCT框架在CT-DegradBench基准上持续优于所评估的基线方法。

Insight: 创新点在于构建了一个无需任务特定微调的、基于放射学文本提示的多模态嵌入空间语义质量轴,并将其与捕捉退化特定频率模式的频谱特征相结合,实现了对伪影类型和严重程度的联合预测。

Abstract: Computed tomography (CT) images are frequently degraded by acquisition artifacts, including noise, blur, streaking, aliasing, and metal artifacts. Yet CT enhancement is still largely evaluated using image quality metrics with limited perceptual and clinical validity, while existing datasets remain focused on isolated restoration tasks, hindering unified benchmarking across diverse degradation types. We present CT-DegradBench, a dataset and benchmark for CT degradation detection and severity estimation under controlled single- and mixed-artifact settings. CT-DegradBench enables systematic evaluation across multiple degradation families and severity levels within a common experimental framework. We further propose SeSpeCT (Semantic-Spectral CT degradation estimation), a framework that combines semantic priors from medical vision-language models with complementary frequency-domain cues for artifact analysis. SeSpeCT constructs a training-free semantic quality axis in the multimodal embedding space using radiology-informed text prompts, without task-specific fine-tuning, and combines it with spectral features that capture degradation-specific frequency patterns. The resulting representation enables joint prediction of artifact type and severity. Experimental results show that SeSpeCT consistently outperforms the evaluated baselines under both single- and mixed-degradation settings. The framework is available at https://github.com/yousranb/CT-DEGRADBENCH.


[67] Video Reconstruction using Diffusion-based Image-to-Video Generation with Trajectory Guidance cs.CV | cs.LGPDF

Stelio Bompai, Ioannis Kontopoulos, Giannis Spiliopoulos, Dimitris Zissis, Konstantinos Tserpes

TL;DR: 本文提出了一种基于扩散模型的图像到视频生成方法,用于重建无人机拍摄的自主水面舰艇结构化海上机动视频中缺失或掉落的帧。该方法利用原始GPS遥测数据和单张参考帧,通过预训练的SG-I2V扩散模型生成轨迹引导的视频序列,无需领域特定的微调。

Details

Motivation: 解决在低纹理、小目标等挑战性条件下,无人机拍摄的海上自主水面舰艇视频中帧缺失或掉落的重建问题,旨在利用GPS轨迹信息引导视频生成。

Result: 在感知质量(BRISQUE 25.52,最接近真实值23.64)、运动真实性(时间平滑度1.14 vs. 真实值1.42)和GPS轨迹遵循度(9.31px vs. 真实值28.70px)上均优于光流外推和RIFE插值基线方法,证明了该方法的有效性。

Insight: 创新点在于将GPS轨迹通过等距柱状投影映射到图像空间,作为运动线索来条件化预训练的扩散模型,实现了无需微调的轨迹引导视频合成,为低纹理小目标视频重建提供了新思路。

Abstract: This paper addresses the problem of reconstructing missing or dropped frames in top-down drone video of autonomous surface vehicles performing structured maritime manoeuvres. We propose a pipeline that converts raw GPS telemetry and a single reference frame into a trajectory-guided video sequence using a pre-trained image-to-video diffusion model, requiring no domain-specific fine-tuning. GPS coordinates from onboard telemetry logs are projected into image space via an equirectangular mapping, producing per-vessel motion cues that condition the SG-I2V diffusion model. The generated frames are evaluated against ground-truth video using perceptual, temporal and trajectory-based metrics, and benchmarked against optical flow extrapolation and RIFE interpolation baselines. SG-I2V produces the most naturally appearing frames among all methods (BRISQUE 25.52, closest to ground-truth 23.64), the most realistic motion magnitude (temporal smoothness 1.14 vs. ground truth 1.42), and the strongest GPS trajectory adherence (9.31px vs. 28.70px for ground-truth, the latter reflecting approximate temporal alignment between footage and GPS logs rather than generation error), demonstrating that trajectory-guided diffusion synthesis is a viable approach to maritime video reconstruction under challenging low-texture, small-object conditions.


[68] KVCapsule: Efficient Sequential KV Cache Compression for Vision-Language Models with Asymmetric Redundancy cs.CV | cs.AIPDF

Yingbing Huang, Tharun Adithya Srikrishnan, Steven K. Reinhardt, Deming Chen

TL;DR: 本文提出KVCapsule,一种针对视觉语言模型(VLMs)中视觉令牌的高效KV缓存压缩框架。该框架通过分析视觉令牌的独特行为,设计了轻量级的压缩与重建组件,无需修改预训练模型或注意力计算模块,即可显著减少KV缓存内存占用并提升推理吞吐量。

Details

Motivation: 视觉语言模型(VLMs)在自回归解码时,由于图像产生的令牌序列更长、特征表示更密集,导致KV缓存的内存开销问题比纯文本模型更为严重,且现有的面向LLM的KV缓存压缩技术直接应用于VLMs时效果不佳。

Result: 在多个VLMs和基准任务上的评估表明,KVCapsule在60%的压缩率下,实现了高达2倍的TPS提升和2.4倍的KV缓存内存减少,同时精度或响应质量下降可忽略不计。

Insight: 创新点在于通过实证分析揭示了视觉令牌与文本令牌在注意力模式上的关键差异,并据此设计了结构感知的KV缓存压缩方案;其框架设计保持了预训练主干冻结且无需改动注意力计算,具有很好的即插即用性和实用性,为多模态模型的缓存压缩研究提供了新思路。

Abstract: Vision-Language Models (VLMs) have emerged as a critical and fast-growing extension of Large Language Models (LLMs) that enable multimodal reasoning through both text and image inputs. Although VLMs enrich the capabilities of language models, they also inherit and amplify key computational bottlenecks: the memory overhead caused by the large key-value (KV) cache during autoregressive decoding. This challenge is particularly severe in VLMs, where images produce longer token sequences and denser feature representations compared to text. Moreover, the spatial and information-rich nature of vision tokens introduces structured attention patterns that make many LLM-oriented KV cache compression techniques ineffective when applied directly to VLMs. In this work, we conduct a detailed empirical analysis of the behavior of vision tokens, highlighting the critical differences from purely text-based models. Based on these insights, we propose KVCapsule, a novel KV cache compression framework for vision tokens. KVCapsule keeps the pretrained VLM backbone frozen, requires no modification to the attention computation modules, and can be integrated into existing VLMs through lightweight compression and reconstruction components. We evaluate KVCapsule on multiple VLMs and benchmark tasks, demonstrating up to 2x improvement in TPS and 2.4x reduction in KV cache memory at a 60% compression ratio, with negligible degradation in accuracy or response quality. Our findings offer practical pathways to scale VLM inference under constrained memory budgets and inspire further research into structure-aware cache compression for multimodal models.


[69] Multi-hop Relational Contrastive Learning: Extending Spatial Contrastive Pre-training Beyond Pairwise Relations cs.CVPDF

Sheikh Tanvir Ahmed, Md. Tanvir Raihan

TL;DR: 本文提出了多跳关系对比学习框架,通过在图结构场景表示中追踪k跳路径来捕获超越直接对象对的隐式空间依赖关系,从而扩展了空间对比预训练的范围。

Details

Motivation: 现有对比预训练方法主要建模成对关系,而忽略了更丰富的组合和多跳交互,这限制了场景理解中对物体间空间关系的全面建模。

Result: 在GQA子集上,MRCL提高了基于内容的图检索性能(NDCG@5 = 0.748),并在空间关系识别和图基问答等下游任务中带来一致收益。

Insight: 创新点在于将对比学习从成对关系扩展到多跳路径,通过节点、边和多跳路径的多层次对比目标,学习对物体语义稳定且对空间布局敏感的嵌入表示,从而提供比仅成对方法更丰富的结构指导。

Abstract: Understanding how objects relate to each other in space is fundamental to scene understanding, yet most contrastive pre-training approaches only model pairwise relationships, leaving richer compositional and multi-hop interactions largely unexplored. We introduce Multi-Hop Relational Contrastive Learning (MRCL), a framework that extends spatial contrastive learning to graph-structured scene representations. By tracing k-hop paths through scene graphs built from detected objects, MRCL captures implicit spatial dependencies that go well beyond what direct object pairs can express. We define a multi-level contrastive objective spanning nodes, edges, and multi-hop paths, encouraging embeddings that remain stable across object semantics while staying responsive to spatial layout. On a GQA subset, MRCL produces spatially-aware representations that improve content-based graph retrieval (NDCG@5 = 0.748) and consistently benefit downstream tasks, including spatial relationship recognition and graph-based question answering. Together, these results suggest that multi-hop relational supervision offers substantially richer structural guidance than pairwise-only methods, leading to visual representations that are more robust, compositional, and geometry-aware.


[70] REC-RL: Referring expression counting via Gaussian and range-based reward optimization cs.CVPDF

Hui Liu, Yunlai Teng, Kunlong Bai, Pengfei Qi, Haotian Yan

TL;DR: 本文提出REC-RL,一种用于指代表达式计数的强化学习框架,通过引入‘思考-范围-回答’范式来显式优化视觉推理过程。该方法采用组相对策略优化和两种轻量级奖励(结合范围区间监督与高斯精度指导的准确性奖励,以及强制结构化输出的格式奖励),无需额外标注即可提升模型性能。

Details

Motivation: 现有指代表达式计数方法多依赖基于规则的强化学习,奖励机制仅关注最终精度,忽视了中间推理过程的质量,导致模型缺乏上下文感知的视觉推理能力。

Result: 在多个基准测试上的广泛实验表明,REC-RL相比强基线模型取得了持续的性能提升,并展现出良好的泛化能力。

Insight: 创新点在于将中间焦点预测建模为内部决策过程,通过范围与高斯结合的奖励设计优化推理路径,同时利用格式奖励确保输出结构化,从而更好地对齐人类感知机制。

Abstract: Referring expression counting (REC) is an intention-driven task that requires context-aware visual reasoning. While recent vision-language models incorporate language for visual understanding, most existing REC methods rely on rulebased reinforcement learning with rewards focused primarily on final accuracy, overlooking the quality of intermediate reasoning. We propose REC-RL, a reinforcement learning framework that introduces a think-range-answer paradigm to explicitly optimize the visual reasoning process. RECRL employs Group Relative Policy Optimization and two lightweight rewards: an accuracy reward that combines range-based interval supervision with Gaussian-based precision guidance, and a format reward that enforces structured outputs. By modeling intermediate focus prediction as internal decision-making, REC-RL avoids additional annotations and better aligns with human perception. Extensive experiments demonstrate consistent improvements over strong baselines and robust generalization across benchmarks.


[71] MHMamba: Multi-Head Mamba for 3D Brain Tumor Segmentation cs.CV | cs.AIPDF

Hanjun Tao, Hua Wang, Fan Zhang

TL;DR: 本文提出了一种名为MHMamba(Multi-Head Mamba)的新型网络架构,用于3D脑肿瘤分割。该方法结合了U形结构和多头状态空间模型(Mamba),通过将通道维度分割为并行的SSM头并使用残差聚合,以增强长程表征并提升多模态训练的稳定性,同时保持线性复杂度。此外,还设计了通道空间校准模块和跳跃连接处的自适应融合机制,以改善边界一致性和对小体积病灶的检测。在BraTS2021和BraTS2023数据集上的实验表明,该方法在整体精度、边界平滑度以及对肿瘤核心和小体积增强区域的敏感性方面取得了稳定且显著的提升。

Details

Motivation: 脑肿瘤在形态和多模态对比度上具有高度异质性,手动逐层勾画耗时且依赖经验,因此需要高效稳定的自动分割方法。现有方法如CNN难以建模长程依赖,而Transformer在3D MRI中计算和内存开销大且存在块间上下文不连贯的问题。

Result: 在BraTS2021和BraTS2023基准数据集上进行了实验和消融研究。结果表明,MHMamba在整体精度、边界平滑度以及对肿瘤核心和小体积增强区域的敏感性方面实现了稳定且显著的改进,同时保持了基于Mamba建模的线性复杂度优势。

Insight: 主要创新点包括:1)将多头机制引入状态空间模型(Mamba),通过并行SSM头增强长程表征能力;2)设计了通道空间校准模块,用于对齐多头输出的统计信息并增强病灶响应;3)在跳跃连接中引入自适应融合机制,动态连接全局语义与局部细节,改善边界一致性和小病灶检测。这些设计在保持线性复杂度的同时,有效提升了3D医学图像分割的性能和稳定性。

Abstract: Brain tumors exhibit high heterogeneity in morphology and multimodal contrast, making manual slice-by-slice de lineation time-consuming and experience-dependent, thus necessitating efficient and stable automated segmentation methods. To address the limitations of CNNs in modeling long-range dependencies, and the heavy computational and memory overhead and inter-block contextual in coherence of Transformers in 3D MRI, this paper proposes Multi-Head Mamba (MHMamba). This method combines a U-shaped architecture with a multi-head state-space model (Mamba), splitting the channel dimension into parallel SSM heads and aggregating them with residuals. This enhances long-range representation and improves the stability of multimodal training while maintaining linear complexity. To further align statistics and enhance lesion response, we designed a channel-space calibration module for multi-head outputs and introduced an adaptive fusion mechanism at skip connections to dynamically connect global semantics with local details, thereby improving boundary consistency and the detection of small-volume lesions. We conducted experiments and ablations on BraTS2021 and BraTS2023. The results showed that MHMamba achieved stable and significant improvements in overall accuracy, boundary smoothness, and sensitivity to tumor core and small-volume enhancement areas, while preserving the linear-complexity advantage of Mamba-based modeling, thus verifying the effectiveness and versatility of the method.


[72] Mechanistically Interpretable Neural Encoding Reveals Fine-Grained Functional Selectivity in Human Visual Cortex cs.CV | cs.AI | cs.CL | cs.LG | q-bio.NCPDF

Idan Daniel Grosbard, Mor Geva, Galit Yovel

TL;DR: 本文提出了Mechanistically Interpretable Neural Encoding (MINE)框架,通过应用机制可解释性工具,定位自然图像中驱动大脑体素水平活动的特征,从而打开传统编码模型的‘黑箱’。该框架使用语言对齐的图像表征来预测体素响应,并生成对激活至关重要的特征的语义可解释描述,进一步将这些特征概括为每个体素的功能图谱。

Details

Motivation: 现有使用人工神经网络作为编码模型来预测大脑对自然图像反应的方法主要是相关性的,并将编码器视为黑箱,无法揭示驱动每个体素响应的具体图像特征。本文旨在解决这一问题,通过机制可解释性来揭示驱动神经元活动的视觉特征。

Result: 验证表明,MINE生成的描述足以生成能引发与原始图像匹配的体素反应的图像,其准确性优于随机或低归因控制生成的图像。反事实编辑进一步提供了因果证据,显示插入或移除预测特征会按预期方向改变激活。在已知类别选择性脑区的应用表明,MINE能恢复其已知的类别偏好,同时揭示每个区域内精细的独特体素结构。

Insight: 主要创新点在于将机制可解释性工具应用于神经编码,实现了从相关性分析到因果验证的转变,并生成了语义可解释的特征描述和体素级功能图谱。这为发现和验证关于神经功能的精细假设提供了一条新路径。

Abstract: A central goal in understanding human vision is to uncover the visual features that drive neuronal activity. A growing body of work has used artificial neural networks as encoding models to predict cortical responses to natural images, revealing the visual content that activates category-selective regions. However, existing approaches are largely correlational and treat the encoder as a black box, leaving open which image features drive each voxel’s response. We introduce Mechanistically Interpretable Neural Encoding (MINE), a framework that opens this black box by applying mechanistic-interpretability tools to localize the features within natural images that drive millimeter-scale (voxel-level) activity. MINE predicts each voxel’s response using language-aligned image representations, and produces semantically interpretable descriptions of the features critical for the voxel’s activation. We further generalize these per-image features into per-voxel functional profiles. To validate the per-image descriptions, we show they are sufficient to generate images that elicit voxel responses matching the responses to the original images, more accurately than images generated from random or low-attribution controls. Moreover, counterfactually inserting or removing the predicted features from images shifts activation in the expected direction, providing causal evidence. Counterfactual editing guided by the per-voxel activation profiles produces even stronger activation shifts, indicating that the profiles faithfully capture each voxel’s selectivity. Finally, we apply MINE to well-studied category-selective brain regions, showing it recovers their known categorical preferences while revealing fine-grained unique voxel structure within each region. Overall, our results establish mechanistic interpretability as a path to discover and causally validate fine-grained hypotheses about neural function.


[73] Visual Agentic Memory: Enabling Online Long Video Understanding via Online Indexing, Hierarchical Memory, and Agentic Retrieval cs.CV | cs.AIPDF

Aiden Yiliu Li, Nels Numan, Anthony Steed

TL;DR: 本文提出了视觉智能记忆(VAM)框架,用于解决长视频理解问题。该框架无需训练,包含在线索引、分层记忆和智能检索三个组件,旨在通过显式、可检查、可查询的记忆机制,而非仅依赖压缩的潜在状态,来支持长时序的视觉证据保留与推理。

Details

Motivation: 长视频理解不仅需要大的上下文窗口,更需要一种记忆机制来决定保留哪些视觉证据、使其在长时序内可搜索,并确保后续推理基于可恢复的观察而非仅压缩的潜在状态。

Result: 在OVO-Bench上,VAM在所有已报告的基线中取得了最高的RT+BT平均分(68.41),优于直接使用相同底层MLLM(Gemini 3 Flash,67.46)的端到端方法。在MM-Lifelong train@month的月尺度分割(51天共105.6小时)上,VAM达到17.11%,仅次于使用GPT-5的ReMA(17.62%)。

Insight: 创新点在于将视觉记忆视为一个显式、可检查、可查询的底层结构,并通过在线索引实现流式约束下的选择性证据保留,分层记忆组织对齐时空上下文,以及智能检索在生成答案前进行搜索、检查和验证。这为长时序视频理解提供了一种新的、无需训练的代理式记忆架构思路。

Abstract: Long video understanding requires more than large context windows. It also needs a memory mechanism that decides what visual evidence to retain, keeps it searchable over long horizons, and grounds later reasoning in recoverable observations rather than compressed latent state alone. We propose Visual Agentic Memory (VAM), a training-free framework with three components. Online Indexing supports selective evidence retention under streaming constraints. Hierarchical Memory organises retained evidence in a Parallel Representation that aligns temporal context with spatial observations. Agentic Retrieval searches, inspects, and verifies candidate evidence before producing a grounded answer. On OVO-Bench, VAM achieves the highest RT+BT average (68.41) across all reported baselines, improving over end-to-end use of the same underlying MLLM (Gemini 3 Flash, 67.46). On the month-scale split of MM-Lifelong train@month (105.6 hours over 51 days), VAM reaches 17.11%, second only to ReMA with GPT-5 (17.62%). These results suggest that long-horizon video understanding benefits from treating visual memory as an explicit, inspectable, and queryable substrate. Code is available at https://github.com/yiliu-li/Visual-Agentic-Memory.


[74] SWoMo: Neuro-Symbolic World Model for Cataract Surgery Simulation cs.CVPDF

Ssharvien Kumar Sivakumar, Akwele Johnson, Anirudh Dhingra, Yannik Frisch, Ghazal Ghazaei

TL;DR: 本文提出了SWoMo,一种用于白内障手术模拟的神经符号世界模型,它将运动生成与视觉真实感解耦。符号组件(基于规则的模拟器和场景图)负责建模运动动力学和工具-组织交互,而扩散模型则生成逼真的视觉外观。通过提出的逆向配对策略,在模拟器中重建真实手术视频以获得配对数据,用于训练视频扩散模型实现从模拟到真实的转换。

Details

Motivation: 解决现有手术模拟方法在临床适用性关键标准上的不足,包括视觉真实感、物理基础的交互以及模拟超出训练分布场景的能力。

Result: 实验表明,该方法在定性和定量上均优于先前工作,能泛化到未见过的交互几何结构,提升下游阶段检测性能,并实现无监督视频风格迁移。

Insight: 核心创新在于神经符号架构的解耦设计,以及利用逆向配对策略构建配对数据来训练扩散模型,实现了高真实感手术模拟与物理交互的融合。

Abstract: Realistic surgical simulation plays a crucial role in training novice surgeons and in the development of autonomous agents. World models can scale such simulation environments to realistic and diverse procedures by predicting future patient states conditioned on current observations and surgical actions. However, current state-of-the-art approaches often fail to satisfy key criteria required for clinical applicability, including visual realism, physically grounded interactions, and the ability to simulate scenarios beyond the training distribution. Hence, we introduce SWoMo, a neuro-symbolic world model for cataract surgery simulation that decouples motion generation from visual realism. The symbolic component, consisting of a rule-based simulator and scene graph representations, models motion dynamics and tool-tissue interactions, while a diffusion model produces realistic visual appearance, including textures and tissue deformations. We propose an inverse pairing strategy that reconstructs real surgical videos in the simulator to obtain paired simulated and real videos, which are then used to train our video diffusion model for the reverse objective of sim-to-real translation. Our experiments show both qualitative and quantitative improvements over prior work. We demonstrate that our simulator further satisfies the key criteria, including generalisation to unseen interaction geometries, improvements in downstream phase detection, and unsupervised video style transfer. The code, data, and model weights are available at: https://ssharvienkumar.github.io/SWoMo/


[75] DepthPolyp: Pseudo-Depth Guided Lightweight Segmentation for Real-Time Colonoscopy cs.CV | eess.SPPDF

Zhuoyu Wu, Wenhui Ou, Lexi Zhang, Pei-Sze Tan, Dongjun Wu

TL;DR: 本文提出DepthPolyp,一种用于实时结肠镜息肉分割的轻量级鲁棒框架。该方法基于伪深度引导的多任务学习和高效特征调制,结合了层次化Ghost分解、交错混洗融合和动态分组门控等技术。实验表明,该方法在参数和计算量极低的情况下,在干净和噪声数据上均表现出强大的跨数据集泛化能力,并在真实手术视频评估中超越了参数量大20倍的模型,同时保持超过180 FPS的实时推理速度。

Details

Motivation: 解决结肠镜息肉分割在真实临床环境(存在运动模糊、镜面反射和光照不稳定等挑战)中部署时,现有方法(通常在干净基准图像上优化)性能显著下降的问题。

Result: 在PolypGen真实手术视频评估中,分割性能优于参数量大20倍的模型。模型仅3.57M参数和0.86 GMACs,在移动设备上运行速度超过180 FPS,在跨数据集泛化实验中持续优于轻量级基线,并与大模型保持竞争力。

Insight: 创新点在于提出伪深度引导的多任务学习框架,以及结合了层次化Ghost分解、交错混洗融合和动态分组门控的高效轻量架构设计,在极低计算成本下实现了对真实噪声场景的鲁棒性和实时性。

Abstract: Accurate polyp segmentation in colonoscopy is essential for early colorectal cancer detection, yet real-world clinical environments pose persistent challenges such as motion blur, specular reflections, and illumination instability. Most existing methods are optimized on clean benchmark images and suffer noticeable performance degradation when deployed in authentic surgical scenarios. We propose DepthPolyp, a lightweight and robust segmentation framework based on pseudo-depth-guided multi-task learning and efficient feature modulation. The architecture combines hierarchical Ghost factorization for compact feature generation, Interleaved Shuffle Fusion for low-cost cross-scale interaction, and Dynamic Group Gating for adaptive group-wise feature weighting. Extensive experiments demonstrate that DepthPolyp achieves strong cross-dataset generalization when trained on degraded data and evaluated on both clean and noisy target domains, consistently outperforming lightweight baselines and remaining competitive with substantially larger models. In real surgical video evaluation on PolypGen, DepthPolyp achieves better segmentation performance than models up to $20\times$ larger while preserving real-time inference speed. With only 3.57M parameters and 0.86 GMACs, the proposed method runs at over 180 FPS on mobile devices, making it well suited for real-time deployment in resource-constrained clinical environments. Code and pretrained weights are available at: https://github.com/ReaganWu/DepthPolyp/


[76] Attention-Aware Transformer-Based Aggregation Network for Video Periocular Recognition cs.CV | cs.LGPDF

Luiz G F Carreira, Breno A Mariano, Victor H C de Melo, David Menotti, William Robson Schwartz

TL;DR: 本文提出了一种基于注意力感知的Transformer聚合网络,用于视频眼周识别任务。该方法通过卷积神经网络提取眼周区域的特征向量,并使用仅编码器的Transformer自适应地聚合帧级特征,生成视频表示和静态参考图像的特征向量。在COX Face数据集上的实验表明,该方法优于简单的聚合方案,在最佳情况下实现了99.8%的TPR@1e^{-1}和96.6%的Rank-5识别率。

Details

Motivation: 解决在监控场景下,由于非受控采集条件导致传统人脸或虹膜识别不可行时,利用眼周区域作为生物特征进行身份识别的问题。眼周区域是人脸中具有高区分度的区域,适合作为替代生物特征模态。

Result: 在公开的COX Face数据集上,该方法表现出鲁棒性,一致优于简单的聚合方案。最佳情况下,实现了99.8%的TPR@1e^{-1}和96.6%的Rank-5识别率,达到了较高的性能水平。

Insight: 创新点在于结合注意力机制的Transformer聚合模块,能够自适应地学习帧级特征的聚合方式,提升视频眼周识别的效果。从客观角度看,该方法将Transformer应用于特征聚合,为视频生物特征识别提供了新的思路。

Abstract: Video periocular recognition is the task of recognizing an individual’s identity based on the region around an individual’s eyes. The periocular area is one of the most discriminative regions of the human face, making it suitable for recognition tasks. Its use as a biometric modality has emerged as an alternative, especially in surveillance scenarios where conventional biometric traits such as face or iris recognition become unfeasible due to unconstrained acquisition conditions. This paper proposes an attention-aware approach for video-based periocular recognition in surveillance environments. The framework consists of two main modules: feature embedding and aggregation. The feature embedding module is a deep convolutional neural network that maps periocular data to feature vectors. The aggregation module is an encoder-only transformer that adaptively learns to aggregate frame-level features into a single video representation and a feature vector for the still reference image. Experiments on the publicly available COX Face dataset show the robustness of the proposed method, consistently outperforming naive aggregation schemes. In the best scenario, the approach achieves $99.8%$ of TPR@$1e^{-1}$ and $96.6%$ of Rank-5.


[77] ArtMesh: Part-Aware Articulated Mesh Fields with Motion-Consistent Dynamics cs.CVPDF

Sylvia Yuan, Dan Wang, Ravi Ramamoorthi, Xinrui Cui

TL;DR: ArtMesh是一种基于网格的显式方法,用于从多视角图像中重建关节物体,将其表示为具有每部件刚性运动的连接三角形网格。该方法在网格可微分渲染基础上构建,通过部件感知受限Delaunay重网格化确保拓扑与关节兼容,并利用双向顶点级和像素级运动一致性优化关节运动。

Details

Motivation: 现有基于3D高斯泼溅的关节重建方法继承了其无结构点基几何的缺点,缺乏表面拓扑来推理部件边界或沿物体连接性强制运动一致性,因此需要一种能直接处理结构化拓扑的网格原生方法。

Result: 在提出的新基准Articulate-100(包含16个PartNet-Mobility类别的100个关节物体)上,ArtMesh在关节参数估计和部件级几何重建方面优于先前的3DGS方法,在具有多个可动部件的物体上提升最大。

Insight: 创新点包括:部件感知受限Delaunay重网格化确保网格三角形不跨越语义部件边界;双向顶点级运动一致性和像素级运动一致性联合优化关节运动;网格原生方法直接利用结构化拓扑处理部件感知动力学。

Abstract: We present ArtMesh, a mesh-native method for reconstructing articulated objects explicitly as connected triangle meshes with per-part rigid motion from multi-view images in start and end states. Existing 3D Gaussian Splatting pipelines for articulated reconstruction inherit the unstructured point-based geometry of their splatting base, which provides no surface topology for reasoning about part boundaries or enforcing motion consistency along the object’s connectivity. ArtMesh instead builds on a mesh-based differentiable rendering backbone, enabling part-aware dynamics to act directly on the structured topology. To make the topology compatible with articulation, we introduce part-aware restricted Delaunay remeshing, producing connected submeshes whose triangles do not cross semantic part boundaries. The dynamic mesh field then optimizes articulation using bidirectional Vertex-wise Motion Consistency on transported mesh vertices and Pixel-wise Motion Consistency on rendered RGB-D observations. We introduce Articulate-100, a new benchmark of 100 articulated objects spanning 16 PartNet-Mobility categories. On this benchmark, ArtMesh outperforms prior 3DGS-based pipelines in joint parameter estimation and part-level geometric reconstruction, with the largest gains on objects with many movable parts.


[78] Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion cs.CV | cs.LGPDF

Kunyang Li, Mubarak Shah, Yuzhang Shang

TL;DR: 本文提出ARL2(Attend Locally, Remember Linearly)混合注意力模块,用于解决自回归视频扩散模型中因softmax注意力导致的二次计算复杂度和线性内存增长问题。该方法将自注意力分解为处理空间细节的帧内softmax分支和维持固定大小状态以管理流式上下文的帧间门控循环线性分支,实现了线性时间复杂度和恒定内存开销,同时提升了时间一致性。

Details

Motivation: 自回归视频扩散模型在流式和交互式视频生成中具有强大能力,但其依赖的softmax自注意力机制导致序列长度上的二次计算复杂度和因键值缓存带来的线性内存增长,限制了模型向长视频序列的可扩展性。现有方法(如稀疏注意力和KV缓存压缩)未能从根本上解决线性内存增长和流式上下文管理问题。

Result: 在自回归视频扩散模型中,用混合线性注意力替换了75%的层后,模型实现了高达2.26倍的实时加速和54%的内存减少,同时保持了可比的生成质量,并改善了时间一致性。

Insight: 核心创新在于将自注意力分解为处理局部细节的softmax分支和提供可控长程记忆的循环线性分支,实现了计算效率与性能的平衡。具体实现上,仅在去噪后更新循环状态以避免噪声干扰,并让所有令牌共享相同的预更新状态以避免帧内信息不对称。此外,本文首次提出了将预训练的自回归视频扩散模型转换为混合线性注意力架构的高效两阶段训练方案。

Abstract: Autoregressive (AR) video diffusion is a powerful paradigm for streaming and interactive video generation. However, its reliance on softmax self-attention leads to quadratic compute complexity in sequence length and memory usage due to key-value caching, which limits its scalability to long video horizons. Existing remedies (e.g., sparse attention and KV-cache compression) reduce per-step cost but still rely on a linearly growing cache or irreversibly discard past context, and thus fail to address linear memory growth and streaming context management. To address this scalability bottleneck, we propose ARL2 (Attend Locally, Remember Linearly), a hybrid attention module that replaces quadratic cross-frame attention with a fixed-size recurrent state. We decompose self-attention into two branches: an intra-frame softmax branch for spatial detail and local dependencies, and an inter-frame gated recurrent linear branch that maintains a fixed-size state for streaming context. Our key insight is that softmax attention captures fine-grained local interactions, while a recurrent state provides controllable long-range memory. This design achieves linear-time scaling with constant memory while improving temporal consistency over the full-softmax model. To prevent noisy intermediate states from corrupting memory, we update the recurrent state only after the denoised pass. To avoid within-frame information asymmetry, all tokens share the same pre-update state rather than sequential updates. To the best of our knowledge, this is the first work to convert a pretrained AR video diffusion model into a hybrid linear attention architecture, through an efficient two-stage training scheme for AR video. With 75% of layers replaced by hybrid linear attention, the model achieves up to 2.26 wall-clock speedup and 54% memory reduction, while maintaining comparable quality with improving temporal consistency.


[79] Controlla: Learning Controllability via Graph-Constrained Latent Geometry cs.CVPDF

Jamuna S. Murthy, Amin Karimi Monsefi, Rajiv Ramnath

TL;DR: 本文提出Controlla框架,将可控性视为结构化潜在几何的属性,通过图约束最优传输学习多模态输入中的身份和属性因子,并使其与图先验对齐,从而在保持参考身份的同时让属性遵循图一致的轨迹。为评估该框架,作者构建了AffectHuman-43K基准数据集,并引入了几何感知的轨迹一致性和潜在解耦度量。实验表明,该方法在可控性、身份保持和跨模态对齐方面均有提升。

Details

Motivation: 现有可控多模态生成方法通常通过提示、引导或辅助模块在推理时进行条件控制,但未显式结构化语义属性的演化方式,导致身份漂移和跨模态行为不一致。

Result: 在构建的AffectHuman-43K基准上,Controlla在可控性、身份保持和跨模态对齐方面均取得一致改进,并进行了图敏感性、可扩展性和鲁棒性分析。

Insight: 创新点在于将可控性建模为结构化潜在几何问题,利用图约束最优传输对齐身份与属性因子,并引入几何感知的评估指标;客观来看,该方法通过图先验约束属性演化轨迹,为多模态可控生成提供了更结构化的潜在空间学习框架。

Abstract: Controllable multimodal generation is commonly formulated as an inference-time conditioning problem using prompts, guidance, or auxiliary modules. While effective, such approaches do not explicitly structure how semantic attributes evolve, which can lead to identity drift and inconsistent cross-modal behavior. We propose Controlla, a modular factorized-control framework that treats controllability as a property of structured latent geometry. Controlla learns identity and attribute factors from multimodal inputs and aligns them with graph priors using graph-constrained optimal transport, encouraging attributes to follow graph-consistent trajectories while preserving reference identity. To evaluate this setting, we construct AffectHuman-43K, a leakage-aware multimodal benchmark for reference-grounded affective control, and introduce geometry-aware metrics for trajectory consistency and latent disentanglement. Experiments show consistent improvements in controllability, identity preservation, and cross-modal alignment, with additional analyses on graph sensitivity, extensibility, and robustness.


[80] AtlasVid: Efficient Ultra-High-Resolution Long Video Generation via Decoupled Global-Local Modeling cs.CVPDF

Ziyang Mai, Yuyao Zhang, Yu-Wing Tai

TL;DR: AtlaVid提出了一种解耦的全局-局部建模框架,用于高效生成超高分辨率长视频。该方法通过时间缩放的RoPE生成低分辨率、低帧率的全局语义代理,引导高分辨率细节分支进行联合去噪,实现了分辨率无关的训练,显著提升了生成效率。

Details

Motivation: 解决现有扩散视频生成模型在扩展到超高分辨率长视频时计算成本过高的问题,特别是在需要保持全局时间连贯性和精细空间细节的长镜头生成场景中。

Result: 实验表明,AtlaVid在超高分辨率长视频生成上实现了60.9倍的加速,训练成本显著降低,性能甚至优于原生的4K视频生成器。

Insight: 创新点在于解耦全局语义建模与局部细节生成,通过全局语义代理引导和分层局部保持注意力机制,实现了在低分辨率训练下泛化到4K及以上分辨率的长视频合成,这是一种高效的分辨率无关训练范式。

Abstract: Recent diffusion-based video generators have achieved remarkable visual fidelity and prompt controllability, yet scaling them to ultra-high-resolution (UHR) long videos remains prohibitively expensive. The difficulty is especially pronounced for long single-shot generation where a continuous scene must preserve global temporal coherence, and fine-grained spatial details without relying on clip transitions or autoregressive shot stitching. In this work, we revisit this challenge from the perspective of decoupled modeling. We argue that existing video diffusion models already encode strong local visual priors, while the main bottleneck lies in efficiently extending global spatiotemporal modeling as resolution and duration increase. Based on this insight, we propose AtlaVid, a decoupled global-local framework for efficient UHR long video generation. AtlaVid first generates a low-resolution and low-FPS global semantic proxy via temporally scaled RoPE, thereby extending the temporal horizon without increasing the training token count. Guided by this proxy, a high-resolution detail branch performs joint denoising with hierarchical locality-preserving attention. Reordered spatiotemporal windows preserve geometric locality and asymmetric global-local attention injects aligned semantic guidance and preserves the model’s pretrained ability. This design enables resolution-agnostic training: the model is trained only at 720P with lightweight LoRA adaptation, yet generalizes directly to 4K and beyond for longer (>10s) video synthesis. Experiments show that AtlaVid substantially improves the efficiency of ultra-high-resolution long video generation, achieving high-quality UHR long video generation with 60.9x speed up and significantly less training cost and even better performance than native 4K video generators.


[81] Right Predictions, Misleading Explanations: On the Vulnerability of Vision-Language Model Explanations cs.CV | cs.LGPDF

Narges Babadi, Hadis Karimipour

TL;DR: 本文研究了视觉语言模型(VLM)中解释热图的鲁棒性,发现基于CLIP的模型在对抗条件下,其解释机制可以被系统性地操纵,导致解释与模型实际推理过程脱节,即使预测结果保持不变。

Details

Motivation: 动机在于探究VLM解释机制的鲁棒性,特别是在需要人类监督的高风险应用中,这些解释是否真实反映模型推理过程仍不明确。

Result: 在ImageNet-1k、MS-COCO和Flickr30K等基准数据集上评估,X-Shift攻击能在保持预测稳定的同时,通过不可察觉的扰动显著降低解释对齐度,而传统对抗攻击即使使用更大扰动预算也无法复现相同行为。

Insight: 创新点在于提出了X-Shift攻击,专门针对解释过程的完整性而非预测准确性,揭示了当前VLM解释机制的根本局限性,警示了在高影响应用中依赖解释作为可信度指标的潜在风险。

Abstract: Explanation mechanisms are increasingly used to support transparency and trust in vision-language models (VLMs), particularly in settings where model decisions require human oversight. However, the robustness of these explanations remains insufficiently understood. In this work, we investigate whether explanation heatmaps in VLMs, particularly CLIP-based models, faithfully reflect model reasoning under adversarial conditions. We show that explanation maps can be systematically manipulated while preserving the model’s original prediction, revealing a disconnect between predictive behavior and explanation faithfulness. To study this vulnerability, we introduce X-Shift, a novel grey-box attack that perturbs patch-level visual representations to redirect explanation heatmaps toward semantically irrelevant regions without altering the predicted output. Unlike conventional adversarial attacks that aim to induce misclassification, X-Shift specifically targets the integrity of the explanation process itself. The attack operates without modifying model parameters and generalizes across multiple CLIP architectures and explanation methods. We evaluate the proposed approach on ImageNet-1k, MS-COCO, and Flickr30K, demonstrating consistent degradation in explanation alignment under imperceptible perturbations while maintaining prediction stability. Furthermore, standard prediction-oriented adversarial attacks fail to reproduce the same explanation-shifting behavior even under substantially larger perturbation budgets. Our findings highlight a fundamental limitation of current explanation mechanisms in VLMs and raise concerns about their use as reliable indicators of model trustworthiness in high-impact applications.


[82] Face inpainting with Identity Preserving Latent Diffusion Models cs.CVPDF

João Santos, Carlos Santiago, Manuel Marques

TL;DR: 本文提出ID-ControlNet,一种基于潜在扩散模型的身份保持人脸修复框架,通过引入预训练人脸识别网络提取的身份嵌入来条件化扩散过程,并采用身份一致性和三元组损失训练策略,以在修复被遮挡面部区域时保持身份一致性。

Details

Motivation: 现有基于扩散模型的人脸修复方法难以忠实保留个体特定的面部特征,而现有身份感知方法通常依赖昂贵的微调、辅助监督或对多样遮挡、姿态和面部变化的鲁棒性有限,因此需要一种能有效保持身份一致性的修复方法。

Result: 在CelebA-HQ、FFHQ和新构建的E-Mask数据集上的大量实验表明,ID-ControlNet在身份保持方面显著优于标准基于扩散的修复方法,性能与最先进的身份感知方法相当。

Insight: 创新点在于将预训练人脸识别网络的身份嵌入作为ControlNet的条件输入,并设计身份一致性和三元组损失来显式对齐生成人脸与目标身份表示,从而在无需昂贵微调或辅助监督的情况下实现鲁棒的身份保持修复。

Abstract: Face inpainting techniques recover missing or occluded facial regions in a visually realistic manner, but preserving the identity in the final output remains a fundamental challenge. Identity consistency is crucial for downstream applications such as face recognition, digital forensics, and human-computer interaction, where even subtle identity distortions can significantly degrade performance or trust. Although diffusion-based generative models have recently achieved remarkable progress in image inpainting, they often struggle to faithfully retain individual-specific facial characteristics. On the other hand, existing identity-aware methods typically rely on costly fine-tuning, auxiliary supervision, or exhibit limited robustness to diverse occlusions, poses, and facial variations. To address these limitations, we propose ID-ControlNet, an identity-preserving face inpainting framework built upon latent diffusion models. Based on ControlNet architecture, our approach conditions the diffusion process on facial identity embeddings extracted from a pretrained face recognition network. This design enables reconstruction of occluded facial regions while maintaining global facial coherence and identity fidelity. Furthermore, we introduce an identity consistency and triplet loss training strategy that explicitly enforces alignment between the generated face and the target identity representation. Extensive experiments on CelebA-HQ, FFHQ, and on a new E-Mask dataset demonstrate that ID-ControlNet significantly improves identity preservation over standard diffusion-based inpainting methods, achieving performance comparable to SOTA identity-aware approaches.


[83] MAVEN A Multi-Agent Framework for Multicultural Text-to-Video Generation cs.CV | cs.AIPDF

Shuowei Li, Yuming Zhao, Parth Bhalerao, Oana Ignat

TL;DR: 本文提出了MAVEN,一个用于提升文本到视频生成中文化保真度的多智能体提示词优化框架。该框架通过并行或串行的方式,将提示词分解为人、动作和地点三个维度,由专门的智能体分别处理。研究还构建了一个包含243个文化相关提示词和972个视频的新基准数据集,用于系统评估跨文化生成任务。

Details

Motivation: 当前文本到视频生成技术在视觉保真度上进展迅速,但在单个提示词中准确、忠实地呈现多种文化的能力尚未得到充分探索。本文旨在解决这一文化保真度不足的问题。

Result: 评估结合了基于CLIP的指标、VLM-as-judge评估和视频质量度量。结果表明,多智能体优化,特别是并行专业化处理,能显著提升文化相关性,同时保持视觉质量和时间一致性。

Insight: 创新点在于提出了一个将文化维度(人、动作、地点)解耦并由专门智能体处理的多智能体框架,以及构建了一个用于系统评估文化保真度的多文化基准数据集。从客观角度看,其将复杂文化表征任务分解为可并行处理的子任务的方法,为提升生成模型的文化理解能力提供了新思路。

Abstract: Text-to-video (T2V) generation has rapidly progressed in visual fidelity, yet its ability to faithfully represent multiple cultures within a single prompt remains underexplored. We introduce MAVEN, a multi-agent prompt refinement framework designed to improve cultural fidelity in both mono-cultural and cross-cultural T2V generation. MAVEN decomposes prompts into person, action, and location dimensions, handled by specialized agents operating in parallel or sequentially. To support systematic evaluation, we contribute a new benchmark of 243 culturally grounded prompts and 972 corresponding videos, spanning three cultures (Chinese, American, Romanian), three action categories, and both mono-cultural and cross-cultural scenarios. Evaluations combining CLIP-based metrics, VLM-as-judge assessments, and videoquality measures show that multi-agent refinement, particularly parallel specialization, significantly improves cultural relevance while preserving visual quality and temporal consistency. The dataset and code are available athttps://github.com/AIM-SCU/CRAFT


[84] GeoWorld-VLM: Geometry from World Models for Vision-Language Models cs.CV | cs.AIPDF

Renjie Gu, Kaichen Zhou, Yan Luo, Mengyu Wang

TL;DR: GeoWorld-VLM 是一个视觉语言模型(VLM)的蒸馏框架,旨在通过从冻结的相机条件视频世界模型中提取几何结构信息,来增强 VLM 在空间关系理解(如‘左边’、‘上面’、‘后面’、‘之间’)上的能力。该方法仅微调图像编码器和多模态投影器,将投影后的图像特征与世界模型的中间表示对齐,同时保持主干网络和语言模型冻结,从而在提升空间推理的同时保留原有的语义识别能力。

Details

Motivation: 现代视觉语言模型在语义识别上表现强大,但在基本空间关系理解上仍然脆弱。一个关键原因是视觉特征提取过程可能压缩或丢弃了关键的3D结构线索,导致语言模型接收到的图像表示不足以进行可靠的空间判断。

Result: 在 What’sUp 和 VSR 两个基准测试上,GeoWorld-VLM 将性能提升了约 4%。该方法应用于两种不同的 VLM 架构,均观察到一致的改进,表明世界模型引导的视觉对齐方法具有跨模型结构和空间推理数据集的泛化性。

Insight: 创新点在于提出了一种 VLM 侧的蒸馏框架,利用视频世界模型作为教师来提供合成的多视角空间信号,从而增强 VLM 视觉通路对几何结构的编码能力。客观来看,该方法通过冻结语言模型和大部分主干,实现了在保持原有语言能力的前提下,针对性地提升空间推理性能,是一种高效且可泛化的增强策略。

Abstract: Modern Vision-Language Models (VLMs) achieve strong semantic recognition, yet remain brittle on elementary spatial relations such as left of, on, behind, and between. One cause of this failure arises before language reasoning begins: the visual pathway may compress or discard critical 3D structural cues during feature extraction, so the language model receives image representations that are already insufficient for reliable spatial judgment. We introduce GeoWorld-VLM, a VLM-side distillation framework that transfers geometric structure from frozen camera-conditioned video world models into VLMs. GeoWorld-VLM fine-tunes only the image encoder and multimodal projector, aligning post-projector image features with intermediate world-model representations while leaving the main backbone frozen. Given images, a prompt, and a sampled camera trajectory, the world-model teacher converts static visual input into a synthetic multi-view spatial signal. Training combines spatial answer supervision, teacher-student feature alignment, and a preservation anchor to the original VLM. Since the language model remains frozen, GeoWorld-VLM preserves the original model’s linguistic capabilities while attributing spatial improvements to the enhanced visual pathway. To evaluate the effectiveness and generality of the proposed method, we apply GeoWorld-VLM to two distinct VLM architectures and observe consistent improvements across both backbones. GeoWorld-VLM improves performance by approximately 4 percent on both the What’sUp and VSR benchmarks, suggesting that world-model-guided visual alignment generalizes across model structures and spatial reasoning datasets.


[85] Compositional Adversarial Training for Robust Visual Watermarking cs.CV | cs.LGPDF

Anirudh Satheesh, Michael-Andrei Panaitescu-Liess, Andrew Xu, Georgios Milis, Heng Huang

TL;DR: 本文提出了一种名为组合对抗训练(CAT)的插件式框架,用于提升视觉水印的鲁棒性。该方法将水印鲁棒性建模为组合变换空间上的极小极大问题,通过可微分的顺序对抗器动态选择攻击序列,以最大化干扰消息恢复,从而更有效地覆盖现实攻击的组合空间。

Details

Motivation: 传统鲁棒水印训练采用随机后处理增强,但随机采样难以覆盖真实攻击流程的组合空间,且很少遇到能真正破坏检测的罕见组合,导致训练不稳定和样本效率低下。

Result: 在单步和两步攻击套件下,对VideoSeal 0.0、VideoSeal 1.0、PixelSeal等后生成水印以及生成式WMAR进行了评估,包括分布内和多个分布外图像/视频基准测试。CAT在相同增强预算下始终优于随机增强基线,在困难组合攻击和OOD评估上提升最大:单步攻击设置中整体水印容量提升高达63.5%,组合设置中提升13.0%;在自回归设置中,CAT在困难几何变换上将TPR@FPR=1%平均提升12%。

Insight: 创新点在于将水印鲁棒性形式化为组合变换空间的极小极大问题,并设计了结合直通Gumbel-Softmax攻击选择和熵正则化的可微顺序对抗器,实现了端到端可微训练,能跨攻击家族聚合梯度信息,避免坍缩到单一攻击模式,从而获得更快、更平滑的收敛。这表明针对自适应组合对抗器进行训练比独立随机破坏更有利于鲁棒视觉水印。

Abstract: Robust watermarking is typically trained with random post-processing augmentation, but random sampling under-covers the combinatorial space of realistic attack pipelines and rarely encounters the rare compositions that actually break detection. This leads to unstable training and poor sample efficiency. We instead formulate watermark robustness as a min-max problem over a structured space of compositional transformations. We propose Compositional Adversarial Training (CAT), a plug-in framework that learns a sequential differentiable adversary that observes the current watermarked image and selects an attack family at each step to maximally disrupt message recovery. CAT combines a straight-through Gumbel-Softmax attack selection with entropy regularization, allowing the backward pass to be end-to-end differentiable and aggregate gradient information across attack families, yielding faster, smoother convergence without collapsing to a single attack mode. We evaluate CAT on post-generation watermarks VideoSeal 0.0, VideoSeal 1.0, and PixelSeal and in-generation WMAR under both single-step and two-step attack suites, on in-distribution and multiple out-of-distribution image and video benchmarks. CAT consistently outperforms random-augmentation baselines trained with the same augmentation budget, with the largest gains on hard composed attacks and OOD evaluations; improving overall watermark capacity by up to $63.5%$ in the single-step attack setting and $13.0%$ in the compositional setting. In the autoregressive setting, CAT improves the TPR@FPR$=1%$ by $12%$ on average on difficult geometric transformations. These results show that robust visual watermarking benefits from training against adaptive compositional adversaries rather than independent random corruptions.


[86] DiRotQ: Rotation-Aware Quantization for 4-bit Diffusion Transformers cs.CV | cs.LGPDF

Sayeh Sharify, Mahsa Salmani, Hesham Mostafa

TL;DR: 本文提出了DiRotQ,一种针对4位扩散变换器(DiTs)的旋转感知量化框架,旨在解决后训练量化(PTQ)到4位精度时导致的严重质量退化问题。该方法通过主成分分析(PCA)识别捕获主要激活方差的低秩子空间,在该子空间中保留高精度系数,同时将剩余分量量化为4位,并结合GPTQ进行权重量化。在PixArt-Σ模型上的实验显示,DiRotQ在FID和PSNR指标上优于先前的最优方法SVDQuant,并实现了内存使用减少2.1倍和推理速度提升2.3倍。

Details

Motivation: 扩散变换器(DiTs)在图像生成质量上达到最优,但推理时内存和计算成本高昂;现有4位后训练量化方法虽能提升效率,却导致严重的质量下降,现有缓解方法仍与FP16/BF16性能存在明显差距。

Result: 在PixArt-Σ模型上,使用MJHQ-30K数据集,DiRotQ在INT W4A4设置下实现了FID 15.9(越低越好)和PSNR 19.1 dB(越高越好),优于先前最优方法SVDQuant(FID 18.9, PSNR 17.6);在系统层面,在24 GB RTX 4090 GPU上,将12B FLUX.1-dev模型的内存使用减少2.1倍,推理速度比BF16基线提升2.3倍。

Insight: 创新点包括:提出旋转感知激活量化,通过PCA识别低秩子空间以保留关键信息;首次引入基于视觉语言模型(VLM)的评估协议,用于扩散模型量化的感知质量和提示对齐的全面评估;开发了基于Triton的自定义内核以实现高效端到端推理。

Abstract: Diffusion Transformers (DiTs) achieve state-of-the-art image generation quality but incur substantial memory and computational costs at inference. While aggressive Post-Training Quantization (PTQ) to 4-bit precision offers significant efficiency gains, it typically results in severe quality degradation. Existing approaches, including smoothing-based methods, mixed-precision schemes, rotation techniques, and low-rank residual methods, partially mitigate this issue but still leave a noticeable gap to FP16/BF16 performance. In this work, we introduce DiRotQ, a W4A4 PTQ framework that mitigates this degradation through rotation-aware activation quantization. DiRotQ identifies a low-rank subspace capturing dominant activation variance via Principal Component Analysis (PCA), preserving coefficients in this subspace at higher precision while quantizing the remaining components to 4-bit. Activations are rotated into the PCA basis at inference time using calibration-derived orthogonal transformations, while the inverse rotation is fused into the layer weights offline. Combined with GPTQ-based weight quantization, DiRotQ achieves an FID (lower is better) of 15.9 and PSNR (higher is better) of 19.1 dB on PixArt-Σ over the MJHQ-30K dataset, outperforming the prior state-of-the-art SVDQuant (FID 18.9, PSNR 17.6) under the same INT W4A4 setting. Beyond standard metrics, we introduce a VLM-as-a-Judge evaluation protocol for diffusion model quantization, the first such evaluation in this setting, providing a more holistic assessment of perceptual quality and prompt alignment under aggressive compression. On the systems side, we implement a Triton-based custom kernel to enable efficient end-to-end inference, reducing memory usage of the 12B FLUX.1-dev model by 2.1x and delivering 2.3x speedup over the BF16 baseline, on a 24 GB RTX 4090 GPU.


[87] TRACE: Evidence Grounding-Guided Multi-Video Event Understanding and Claim Generation cs.CVPDF

Pengyu Yan, Akhil Gorugantu, Mahesh Bhosale, Abdul Wasi, Vishvesh Trivedi

TL;DR: 论文提出了TRACE框架,用于多视频事件理解与声明生成。该框架采用’先定位后推理’的策略,首先通过OCR和物体检测为每个视频构建结构化的、可文本搜索的时间线,然后利用纯文本LLM进行查询感知的证据定位,最后引导基于LVLM的声明生成和跨视频引用整合。

Details

Motivation: 现有的LVLM在多视频事件理解任务中表现不佳,因为它们容易耗尽上下文预算,且难以精确定位证据重要的片段,经常遗漏广播图形、字幕、记分牌等密集信息线索。

Result: 在MAGMaR 2026和WikiVideo数据集上的实验表明,结构化证据定位显著提升了事实完整性和归因保真度。在MAGMaR验证集上,TRACE将宏平均MiRAGE F1从0.705提升至0.811,引用召回率从0.440提升至0.628,并在官方MAGMaR 2026排行榜上取得了SOTA结果。

Insight: 创新点在于提出了一个证据定位引导的框架,将证据定位与视觉推理解耦,先通过文本LLM进行高效的查询感知证据检索,再使用检索到的帧及其摘要指导LVLM,这解决了LVLM上下文窗口有限和难以精确定位的问题,提升了多视频推理的效率和准确性。

Abstract: Multi-video event understanding demands models that can locate and attribute query-relevant evidence scattered across long, heterogeneous video corpora. Existing large vision-language models (LVLMs) often underperform in this regime because they quickly exhaust their context budget and struggle to precisely localize evidentially important segments, frequently missing dense informational cues such as broadcast graphics, subtitles, and scoreboards. We introduce TRACE, an evidence grounding-guided framework that follows a ground-before-reasoning strategy for multi-video event reasoning. Our approach first builds a structured, text-searchable timeline for each video using OCR and object detection. A text-only LLM then conducts query-aware evidence localization, selecting relevant moments prior to any downstream visual reasoning. The retrieved frames and their grounding summaries are subsequently used to steer LVLM-based claim generation and cross-video citation consolidation. Experiments on MAGMaR 2026 and WikiVideo demonstrate that structured grounding markedly boosts factual completeness and attribution fidelity. On the MAGMaR validation split, TRACE raises macro-average MiRAGE F1 from 0.705 to 0.811 compared to an unguided Qwen3-VL-30B baseline, with especially strong improvements in citation recall from 0.440 to 0.628. The method also attains state-of-the-art results on the official MAGMaR 2026 leaderboard.


[88] EVA01: Unified Native 3D Understanding and Generation via Mixture-of-Transformers cs.CVPDF

Zongyuan Yang, Mingjing Yi, Wanli Ma, Chenzhuo Fan, Bocheng Li

TL;DR: EVA01是一个统一的多模态大语言模型框架,通过混合专家Transformer架构,首次将3D网格作为原生模态集成,实现了对3D网格的原生理解、生成和上下文感知编辑。

Details

Motivation: 现有方法将3D视为外部输出而非多模态序列的原生组成部分,且扩散重建模型将语义理解与几何推理解耦,依赖于密集的2D像素先验,缺乏对几何流形与MLLM特征空间对齐的系统分析。

Result: EVA01在文本到3D生成保真度上达到了最先进水平,并解锁了具有身份保持能力的鲁棒长上下文多轮几何编辑能力,这是无状态重建流水线无法实现的功能。

Insight: 核心创新在于通过混合专家Transformer架构,将模型解耦为理解专家和生成专家,并通过共享全局自注意力与硬模态路由耦合,从而将MLLM主干的语义潜空间与几何流形对齐,实现了无需中间2D表示的多模态先验直接迁移。

Abstract: This paper addresses the challenge of integrating 3D meshes as a native modality within Multimodal Large Language Models (MLLMs). Diffusion-based large reconstruction models decouple semantic understanding from geometric reasoning, operating as stateless reconstructors conditioned on dense 2D pixel priors. Recent MLLM-based methods treat the 3D modality as an external output rather than a native component of the multimodal sequence, making incremental adaptations without a systematic analysis of how geometric manifolds align with MLLM feature spaces. We introduce EVA01, a unified framework that extends the modality boundary of MLLMs to natively incorporate 3D mesh understanding, generation, and context-aware editing. Built upon a Mixture-of-Transformers (MoT) architecture, EVA01 decouples the model into a pre-trained Understanding Expert ($E_{\mathrm{und}}$) and a structurally mirrored Generation Expert ($E_{\mathrm{gen}}$), coupled through shared global self-attention with hard modality routing. This design aligns the semantic latent space of the MLLM backbone with the geometric manifold, enabling direct transfer of multimodal priors without intermediate 2D representations. Results show that EVA01 achieves state-of-the-art native text-to-3D generation fidelity and unlocks robust long-context multi-turn geometric editing with identity preservation, a capability fundamentally inaccessible to stateless reconstruction pipelines. Our findings further offer architectural insights for integrating 2D foundation models with 3D tasks, informing the design of 3D-native multimodal systems. Project Page: https://www.seeles.ai/research/pages/EVA01


[89] Accelerating Rectified Flow Models via Trajectory-Aware Caching cs.CVPDF

Xiao Liu, Kai Liu, Naiyang Guan, Hongliang Lu, Zhixin Wang

TL;DR: 本文提出了一种名为TACache(轨迹感知缓存)的无训练加速框架,用于加速校正流(RF)模型的采样过程。该方法采用“跳过-补偿”范式,通过正交分解将离散速度加速分解为平行分量和正交残差,从而隔离每一步近似误差的幅度和方向来源。实验表明,TACache在文本到图像和文本到视频生成任务上分别实现了最高4.14倍和2.11倍的加速,并在所有基于参考的保真度指标上优于现有缓存方法。

Details

Motivation: 扩散模型和校正流模型虽然能生成高质量图像和视频,但其迭代速度场评估计算成本高昂。现有缓存方法通过跳过时间步来加速采样,但粗略近似会在长跳过区间内引入累积误差,并在激进加速下降低生成质量。

Result: 在BAGEL、FLUX.1-dev和Wan2.1-1.3B等基准测试中,TACache在文本到图像生成上实现了最高4.14倍加速,在文本到视频生成上实现了最高2.11倍加速,在所有基于参考的保真度指标上均优于先前的基于缓存的方法。

Insight: 创新点在于提出了一种无训练的轨迹感知缓存框架,通过正交分解隔离误差来源,并采用离线统计与在线历史方向结合的方式重构跳过速度,无需额外模型评估。这提供了一种高效且保真的采样加速新思路。

Abstract: Diffusion and rectified flow (RF) models generate high-fidelity images and videos, but their iterative velocity-field evaluations are computationally expensive. Existing caching methods accelerate sampling by skipping timesteps, yet their coarse approximations introduce accumulated errors over long skip intervals and degrade quality under aggressive acceleration. We propose TACache (Trajectory-Aware Cache), a training-free acceleration framework following a skip-then-compensate paradigm. TACache performs an orthogonal decomposition of discrete velocity acceleration along the RF trajectory into a parallel component and an orthogonal residual, isolating the magnitude and directional sources of per-step approximation error. The framework operates in two stages: offline, cumulative variation thresholds on the magnitude and direction indicators yield the skip schedule and bound how far each skip interval may extend; online, at each skipped step the offline statistics are combined with the sample’s historical orthogonal direction to reconstruct the skipped velocity without additional model evaluations. Experiments on BAGEL, FLUX.1-dev, and Wan2.1-1.3B show that TACache achieves up to 4.14 speedup on text-to-image generation and 2.11 speedup on text-to-video generation, with consistent improvements over prior cache-based methods on all reference-based fidelity metrics. Code will be released soon.


[90] CANSURF: An ASV-View Can Dataset and Benchmark for Detection and Tracking of Surface-Level Debris cs.CV | cs.AIPDF

Zaid Aljundi, Zahra F. Rahmatullah, Mostafa Elemam, Abdullah Moosa

TL;DR: 本文提出了CANSURF数据集和基准测试,这是一个专门用于水面铝罐检测与跟踪的ASV视角数据集。该数据集包含约7.3k原始图像,通过数据增强扩展至约57k训练/验证图像,涵盖了多种光照和水面状态。论文还评估了一系列针对水面操作定制的检测器和检测器-跟踪器流程。

Details

Motivation: 解决水面小型反光目标(如铝罐)在远距离、强光、波纹和部分浸没等复杂条件下检测困难的问题,这是自主清理任务的实际瓶颈。目前缺乏从水面视角针对水上铝罐的公开数据集。

Result: 在CANSURF数据集上训练YOLOv11,性能比在通用数据集上训练提升了12倍。实验表明,YOLOv11+ByteTrack组合在跟踪稳定性(身份切换更少)和多目标精度方面表现最佳;而YOLOv11+SAHI组合在远场罐体检测上召回率更高,但全上下文输入的精度较低。对于单罐拾取任务,YOLOv11+SAHI被证明能检测到最多数量的罐体。

Insight: 创新点在于创建了首个面向水面铝罐检测与跟踪的公开ASV视角数据集(CANSURF),填补了领域空白。研究通过系统性的基准测试,对比了不同检测-跟踪流程在特定任务(如单罐拾取)下的适用性,为水面自主清理系统的视觉模块设计提供了数据支持和算法选择依据。

Abstract: Surface-level marine debris remains a practical bottleneck for autonomous clean-up, where small, reflective targets (e.g., aluminum cans) must be detected at distance under glare, ripples, and partial submersion. This paper presents, an ASV vision system and a new surface-can dataset. The dataset comprises ~7.3k raw images extracted from videos and annotated with bounding boxes, expanded via ten augmentation types to ~57k training/validation images spanning diverse lighting and water states. A family of detector and detector-tracker pipelines tailored to surface operations were benchmarked. Training YOLOv11 on CANSURF boosts performance 12x over generic datasets, highlighting the dataset’s value. Experiments show that YOLOv11+ByteTrack yields the most stable tracks (fewer identity switches) and stronger multi-object accuracy under, while YOLOv11+SAHI increases recall on far-field cans at the cost of lower precision in full-context inputs. Given the mission profile, single-can pickup with approach and grab, YOLOv11 + SAHI proves better for detecting the maximum number of cans. No prior open dataset targets aluminum cans on water from a surface-level viewpoint; this dataset fills this gap and supports reproducible evaluation.


[91] VolTA-3D: Self-Supervised Learning for Brain MRI using 3D Volumetric Token Alignment cs.CV | cs.AI | cs.LGPDF

Amy Makawana, Abhijeet Parida, Marius George Linguraru, Julia Ive, Syed Muhammad Anwar

TL;DR: 本文提出了VolTA-3D,一个用于脑部MRI的自监督3D视觉Transformer框架。该框架通过联合对齐全局类别风格token和局部图像块token,并强制执行细粒度结构重建,旨在从无标签的3D体积数据中学习可迁移的表示。

Details

Motivation: 当前脑部MRI的3D模型大多专精于分割或分类等单一任务,在不同数据集、成像协议和下游任务间泛化能力有限,这限制了其临床实用性。本文旨在解决现有自监督学习方法在脑部MRI语义多样性有限和细微解剖特征方面的挑战,学习更具可迁移性的表示。

Result: 在多个分布外下游任务(如海马体分割、性别分类、阿尔茨海默病与健康对照分类)上评估,VolTA-3D学习到的表示均优于随机初始化的基线模型,显示出在领域偏移下改进的可迁移性和鲁棒性。

Insight: 创新点在于在师生范式中联合强制执行全局语义一致性和局部结构学习。这种全局-局部联合对齐策略,结合细粒度重建,能更有效地从脑部MRI数据中学习广泛概念,支持通过任务特定微调实现有效的多任务下游性能,是迈向通用且临床可行的3D模型的一步。

Abstract: Self-supervised learning (SSL) has advanced medical image analysis be enabling learning form large unlabelled data. However, in brain magnetic resonance imaging (MRI), most 3D models remain specialized for either segmentation of classification, limiting their ability to generalize across datasets, imaging protocols,, and downstream tasks. This lack of transferability constrains the clinical utility of 3D MRI models, despite the availability of unlabeled volumetric data. We present Volta-3D, a self-supervised 3D Vision Transformer framework designed to learn transferable volumetric representations. Volta-3D jointly aligns global class-style tokens and local patch tokens within a student-teacher paradigm and enforces fine-grained structural reconstruction. This combined global-local alignment addresses the limited semantic diversity and subtle anatomical characteristics of brain MRI, which challenges existing SSL approaches. We evaluate Volta-3D on multiple out-of-distribution downstream tasks, including hippocampal segmentation and classification of sex and Alzheimer’s disease versus healthy controls. Across all tasks, representations learned by Volta-3D outperform randomly initialized baselines, demonstrating improved transferability and robustness under domain shift. Hence jointly enforcing global semantic consistency and local structural learning during pretraining enables broader concept learning from unlabeled brain MRI data. Overall VolTA-3D supports effective multi-task downstream performance with task-specific pertaining, a step towards generalizable and clinically viable 3D models.


[92] 3DPhysVideo: Consistency-Guided Flow SDE for Video Generation via 3D Scene Reconstruction and Physical Simulation cs.CV | cs.AI | cs.GRPDF

Hwidong Kim, Yunho Kim, Tae-Kyun Kim

TL;DR: 本文提出了一种无需训练的3DPhysVideo方法,用于从单张图像生成物理真实的视频。该方法利用现成的视频模型,通过两个阶段实现:首先,将视频模型用作新颖视图合成器,结合渲染点云引导图像到视频(I2V)流模型重建完整的360度3D场景几何;其次,对几何应用物理求解器后,使用物理模拟点云引导同一I2V流模型合成最终高质量视频。该方法在多样实验中成功生成了物理合理的视频,并在GPT评分、VideoPhy基准测试和人工评估中超越了现有最佳基线。

Details

Motivation: 现有视频生成模型常产生违反物理动力学的视觉伪影,而近期工作如PhysGen3D虽通过网格重建和基于物理的渲染处理单图像到3D物理问题,但在流体动力学、多物体交互和照片真实感建模方面仍面临挑战。

Result: 该方法在包含多物体和流体交互场景的实验中,在GPT评分、VideoPhy基准测试和人工评估上均优于最先进的基线模型,且能在单个消费级GPU上高效运行。

Insight: 创新点包括提出一致性引导流SDE,将I2V流模型的预测速度分解为去噪和一致性偏差,以强制与条件输入的一致性,从而有效重用模型进行3D重建和模拟引导的视频生成;以及利用现成视频模型作为新颖视图合成器和视频生成器,结合点云引导实现物理模拟与生成的统一流程。

Abstract: Video generative models have made remarkable progress, yet they often yield visual artifacts that violate grounding in physical dynamics. Recent works such as PhysGen3D tackle single image-to-3D physics through mesh reconstruction and Physically-Based Rendering, but challenges remain in modeling fluid dynamics, multi-object interactions and photorealism. This work introduces 3DPhysVideo, a novel training-free pipeline that generates physically realistic videos from a single image. We repurpose an off-the-shelf video model for two stages. First, we use it as a novel view synthesizer to reconstruct complete 360-degree 3D scene geometry by guiding the image-to-video (I2V) flow model with rendered point clouds. Second, after applying physics solvers to this geometry, the physically simulated point cloud is used to guide the same I2V flow model to synthesize final, high-quality videos. Consistency-Guided Flow SDE, which decomposes the predicted velocity of the I2V flow model into denoising and consistency bias, enforces consistency to the conditional inputs, allowing us to effectively repurpose the model for both 3D reconstruction and simulation-guided video generation. In the diverse experiments including multi-objects, and fluid interaction scenes, our method successfully bridges the gap from single-images to physically plausible videos, while remaining efficient to run on a single consumer GPU. It outperforms state-of-the-art baselines on GPT-based scores, VideoPhy benchmark and human evaluation.


[93] Training-Free Occluded Text Rendering via Glyph Priors and Attention-Guided Semantic Blending cs.CVPDF

Jingqi Hou, Hongtian Wang

TL;DR: 本文提出了一种无需训练的遮挡文本渲染框架,基于预训练的FLUX.1-dev主干网络,通过解耦文本布局保持与遮挡物插入,利用字形先验和注意力引导的语义融合,显著提升了遮挡场景下文本的可读性和遮挡对齐效果。

Details

Motivation: 现有文本到图像生成器在渲染被遮挡文本时存在遮挡物漂移、文本扭曲或浮于遮挡物之上的问题,本文旨在解决这一挑战,实现更稳定的对象-文本合成。

Result: 在代表性遮挡文本场景上的实验表明,该方法在无需模型微调的情况下,显著改善了文本可读性,并取得了具有竞争力的遮挡对齐效果。

Insight: 创新点包括重启的双流推理框架解耦任务、光谱字形先验稳定文本结构,以及基于注意力与字形支持定位文本区域并进行硬掩码引导的K/V特征替换,实现了训练自由的遮挡文本渲染。

Abstract: We present a training-free framework for occluded text rendering with a pretrained FLUX.1-dev backbone. The task requires a model to render recognizable typography and place an occluding object over the intended text region. This setting remains difficult for existing text-to-image generators: the occluder often drifts away from the text, while the text may be distorted or appear to float on top of the occluding object. To address this problem, we propose a restarted dual-stream inference framework that decouples text-layout preservation from occluder insertion. A Base Stream provides a clean typographic reference and same-step key/value (K/V) features, while the Edit Stream is conditioned on the occlusion prompt. We further adopt the spectral glyph-prior idea from FreeText and adapt it to stabilize the target text structure during early-to-mid denoising. In the reasoning pass, our method localizes the target text, estimates a text-band region from token-conditioned attention and glyph support, and derives an anchor-aware hard fusion mask for the occluder. In the final edit pass, generation restarts from the same initial noise and applies hard mask-guided image-token K/V replacement at selected attention sites, preserving the Base layout outside the mask while injecting the occluder appearance from the Edit Stream inside the mask. Experiments on representative occluded text scenarios demonstrate substantially improved text readability and competitive occlusion alignment, yielding more stable object-on-text compositions without any model fine-tuning.


[94] Learning Relative Representations for Fine-Grained Multimodal Alignment with Limited Data cs.CV | cs.AI | cs.LGPDF

Shiwon Kim, Yu Rang Park

TL;DR: 本文提出了一种基于相对表示的后训练多模态对齐方法,用于在配对数据有限的情况下实现细粒度跨模态匹配。该方法通过将图像和文本表示为与各自模态空间中一组可学习锚点的标记级相似度,学习跨模态一致性相似度模式,从而捕捉细粒度的跨模态结构。

Details

Motivation: 现有后训练多模态对齐方法主要关注全局表示对齐,忽略了补丁-标记关系,这限制了其在需要超越粗粒度样本级语义的细粒度跨模态匹配任务上的迁移能力。

Result: 在零样本分类、跨模态检索和零样本分割任务上,该方法显著优于现有方法,尽管仅学习锚点而不使用繁重的投影层。

Insight: 创新点在于利用相对表示学习标记级跨模态结构,通过可学习锚点诱导一致性相似度模式,这为有限配对数据下的有效后训练多模态对齐提供了新思路,强调了建模细粒度跨模态结构的重要性。

Abstract: Multimodal pre-training demonstrates strong generalization performance, but this paradigm is often impractical in domains where paired data are scarce. A promising alternative is post-hoc multimodal alignment, which aligns separately pre-trained unimodal encoders using a limited number of paired examples. However, existing methods focus primarily on aligning global representations, missing patch-token relations. This may hinder transfer to tasks that require fine-grained cross-modal matching beyond coarse sample-level semantics. To address this issue, we propose a post-hoc alignment method that learns token-level cross-modal structure using relative representations. Specifically, we represent images and texts through their token-level similarities to a set of learnable anchors in each modality space, which are trained to induce consistent cross-modal similarity patterns for matched pairs. Despite learning only the anchors without heavy projection layers, our approach consistently outperforms existing methods in zero-shot classification, cross-modal retrieval, and zero-shot segmentation by a substantial margin. This highlights the importance of modeling fine-grained cross-modal structure for effective post-hoc multimodal alignment with limited paired data.


[95] EgoKit: Towards Unified Low-Cost Egocentric Data Collection with Heterogeneous Devices cs.CV | cs.ROPDF

Liuchuan Yu, Erdem Murat, Beichen Wang, Yan Zeng, Tingting Luo

TL;DR: 本文提出了EgoKit工具包,旨在解决异构设备(如Android手机、iPhone、iPad、智能眼镜和XR头显)上第一人称(egocentric)视频数据采集的碎片化问题。该工具包通过统一的录制工作流、本地存储视频和标准日志格式,支持跨设备同步采集第一人称视角和手腕视角视频,并在XR设备上额外记录头部姿态和手部追踪数据。

Details

Motivation: 当前第一人称视频作为机器人学习、活动理解和具身AI研究的数据源,其大规模采集在实践中存在碎片化问题:不同宿主设备(如手机、智能眼镜、XR头显)的SDK、原始相机访问策略、外部USB相机和追踪限制各不相同,导致同步采集第一人称和手腕视角通常依赖于单一专有平台或不可跨设备复用的定制化装置。

Result: 论文未在摘要中提及具体的定量实验结果或基准测试,但指出EgoKit已在六种异构宿主设备上实现统一的录制工作流,并提供了配套配件(如手腕相机、头带、USB-C集线器)以支持手腕视角采集,无需定制硬件制造。

Insight: 创新点在于设计了一个跨异构设备的统一第一人称数据采集工具包,通过标准化工作流和日志格式解决了设备碎片化问题,并利用配件扩展了手腕视角采集能力,降低了数据收集成本并提高了可移植性,为具身AI研究提供了实用的数据采集解决方案。

Abstract: Egocentric video is increasingly used as a data source for robot learning, activity understanding, and embodied AI research, but collecting it at scale remains fragmented in practice: each candidate host device, such as an Android phone, iPhone, iPad, smart glasses, or extended reality (XR) headset, exposes a different SDK, a different policy on raw camera access, and different limitations on external USB cameras and on-device tracking. Synchronized ego-view and wrist-view capture is therefore typically obtained by either committing to a single proprietary platform or building one-off rigs that do not transfer across devices. To address this gap, we present EgoKit, a toolkit that exposes the same egocentric recording workflow across six heterogeneous host devices. Across all supported devices, EgoKit presents the same recording interaction and produces locally stored video with a uniform log format; on XR headsets, it additionally logs head pose and OpenXR-standard 26-joint hand tracking aligned to the video streams. The companion accessories, including two wrist cameras with mounts, a head strap, and a USB-C hub, add wrist-view capture to any supported host without custom hardware fabrication. EgoKit is available at \url{https://egokit.chuange.org/}.


[96] HAD: Hallucination-Aware Diffusion Priors for 3D Reconstruction cs.CVPDF

Xi Liu, Weiwei Sun, Zhou Ren, Chris Broaddus, Siyu Huang

TL;DR: 本文提出了一种名为HAD(Hallucination-Aware Diffusion Prior)的方法,用于解决扩散先验在稀疏视图3D重建中引入幻觉伪影的问题。该方法通过预训练的前馈新视角合成网络估计增强图像的像素级幻觉分数图,在重建过程中选择性屏蔽不可靠像素,并融合多输入视图生成的多个增强图像版本,以减少幻觉内容。

Details

Motivation: 扩散先验虽能提升稀疏视图3D重建质量,但会引入与输入视图不一致的幻觉伪影,影响最终3D模型的准确性。

Result: 该方法在新视角合成(NVS)的多个基准测试中取得了最先进(SOTA)的性能,显著减少了扩散辅助3D重建中的幻觉伪影。

Insight: 创新点在于利用预训练NVS网络的多视图推理能力量化幻觉,并通过像素级屏蔽和跨视图融合策略选择性利用扩散先验,从而在增强细节的同时抑制伪影。

Abstract: Diffusion priors have recently demonstrated strong capability in enhancing the quality of sparse-view 3D reconstruction by augmenting training views at novel viewpoints, but they inevitably introduce hallucinated content – artifacts inconsistent with the input views – into the final 3D model. To address this challenge, we propose Hallucination-Aware Diffusion prior (HAD), which estimates pixel-wise hallucination score maps for augmented images by leveraging multi-view reasoning capabilities from a feedforward novel view synthesis (NVS) network pre-trained on large-scale 3D data. These hallucination scores enable selective masking of unreliable pixels during the progressive 3D reconstruction procedure, preventing the introduction of non-existent artifacts into the 3D model. To further enhance performance, we create multiple versions of augmented images at each novel view by conditioning the diffusion prior on different input views, which are then fused into a final image that leverages the broader context across all input views. We show that our method substantially reduces hallucination artifacts in diffusion-assisted 3D reconstruction, thereby achieving state-of-the-art performance across multiple benchmarks on novel view synthesis. Our project are publicly available at \href{https://xiliu8006.github.io/HAD-Project-website/}{project website}.


[97] Thinking with Patterns: Breaking the Perceptual Bottleneck in Visual Planning via Pattern Induction cs.CV | cs.AI | cs.CL | cs.LGPDF

Yichang Jian, Boyuan Xiao, Zhenyuan Huang, Yifei Peng, Yao-Xiang Ding

TL;DR: 本文针对视觉语言模型在复杂视觉规划任务中的感知瓶颈问题,提出了一种基于模式归纳的视觉规划方法。通过将‘图像思维’过程分解为逐步构建内部世界模型,并引入模式推断与模式归纳策略,在保持准确性的同时提升了规划效率。

Details

Motivation: 当前视觉语言模型在原始视觉输入下的规划能力有限,当输入复杂度超出其单步感知能力时,难以有效完成任务。受‘图像思维’近期进展启发,需要将感知过程分解为更简单的步骤,但现有方法仍存在效率瓶颈。

Result: 在FrozenLake、Crafter和CubeBench等领域的实验评估表明,所提方法在准确性和效率之间取得了理想平衡,能够解决远超模型初始能力的任务。

Insight: 创新点在于将‘图像思维’形式化为逐步构建内部世界模型的工具,并提出模式推断(主动识别已知视觉模式)与模式归纳(将视觉模式作为可复用专家进行在线归纳学习)策略,实现了无需训练的高效规划。

Abstract: Planning from raw visual input remains a significant challenge for current Vision-Language Models (VLMs), when the complexity of input is beyond their one-step perception capability. Motivated by recent advances in Thinking with Images (TWI), a reasonable solution is to decompose the perception process into simpler steps by iteratively acquiring and incorporating local visual evidence. However, even though current VLMs are well-trained in general TWI ability, their perceptual bottleneck in the planning domain remains. To tackle this challenge, we formulate TWI as a tool to gradually build and reflect an accurate internal world model. We find that the resulting training-free planning strategy enables VLMs to solve tasks that are far beyond their initial capabilities, at the cost that too many TWI operations would significantly increase the computational overhead. To further improve efficiency, we propose Pattern Inference, a novel TWI strategy enabling VLMs to actively recognize known visual patterns in the new tasks and directly infer local world model structures. To obtain these patterns, we propose Pattern Induction, an online inductive learning strategy treating visual patterns as composite and reusable experts, which are autonomously discovered and optimized from experience. Experimental evaluations in FrozenLake, Crafter and CubeBench domains show that our approaches achieve a desirable balance between accuracy and efficiency.


[98] Zero-Shot Faithful Textual Explanations via Directional-Derivative Influence on Predictions cs.CVPDF

Toshinori Yamauchi, Hiroshi Kera, Kazuhiko Kawamoto

TL;DR: 本文提出FaithTrace方法,用于生成零样本忠实文本解释,通过计算类别logit沿文本诱导方向的方向导数作为影响分数,以衡量解释对模型预测的真实影响,从而提高图像分类器的透明度。

Details

Motivation: 现有零样本文本解释方法往往无法捕捉真正驱动预测的特征,导致对模型决策证据的忠实性有限,因此需要一种直接衡量解释对预测影响的方法来提升解释的忠实性。

Result: 实验表明,FaithTrace在忠实性方面优于基线方法,能够更准确地反映模型决策依据,并提出了基于影响分数的定量评估指标来填补文本解释忠实性评估的空白。

Insight: 创新点在于将方向导数作为影响分数来量化文本解释对预测的直接影响,从而提升解释的忠实性,同时扩展为评估指标,为文本解释的忠实性提供了可量化的衡量标准。

Abstract: Zero-shot textual explanations aim to make image classifiers more transparent by probing their internal representations, without relying on task-specific supervision or LVLMs. However, existing methods often miss the features that truly drive the prediction, resulting in limited \textit{faithfulness} to the evidence underlying the model’s decision. To address this, we propose FaithTrace. Motivated by the idea that faithful explanations should describe concepts that strongly influence the prediction, FaithTrace directly measures how much the representation induced by the explanation changes the class logit. We introduce an influence score, computed as the directional derivative of the class logit along the text-induced direction in the classifier’s feature space, and use it as a proxy for faithfulness. Moreover, we extend this influence score into quantitative evaluation metrics, helping fill the gap in faithfulness evaluation for textual explanations. Experiments show that FaithTrace yields more faithful explanations than baselines, facilitating a more accurate understanding of the model. The code will be publicly released.


[99] Towards Generalized Image Manipulation Localization via Score-based Model cs.CVPDF

Yunfei Wang, Bo Du, Zhe Yang, Xin Liu, Zhiyu Lin

TL;DR: 本文提出DiffIML,一种基于分数生成模型的图像篡改定位新框架,旨在解决现有判别式方法泛化能力不足的问题。通过近似对数似然梯度来捕捉掩码分布的内在几何拓扑,并利用扩散模型作为数值求解器,在多个基准测试中实现了优异的泛化性能。

Details

Motivation: 现有图像篡改定位方法多为判别式模型,学习固定的决策边界,容易过拟合特定训练伪影,难以泛化到未见过的篡改类型。

Result: 在八个非生成式和三个生成式基准测试上进行的广泛实验表明,DiffIML在两种不同协议下均持续超越现有最先进方法,在多种未见数据集上取得了显著的泛化提升。

Insight: 创新点在于将基于分数的生成建模引入图像篡改定位,通过捕捉掩码分布的几何拓扑来避免判别式模型的脆弱性;同时,通过轻量级掩码特定VAE、解耦架构以及边缘监督和误差先验,解决了标准扩散模型的效率和稳定性瓶颈。

Abstract: With the rapid evolution of synthetic media, Image Manipulation Localization (IML) has emerged as a critical component in multimedia forensics for ensuring the integrity of digital content. However, generalization remains a core challenge, as existing discriminative methods typically learn a fixed decision boundary that tends to overfit to specific training artifacts and fails to adapt to unseen manipulation types. To address this, we propose DiffIML, a novel framework that introduces score-based generative modeling to IML. Diverging from the direct estimation of hard boundaries, DiffIML approximates the score function, the gradient of the log-likelihood, to capture the intrinsic geometric topology of mask distributions. This paradigm leverages structural priors to iteratively recover coherent masks from noise, thereby circumventing the brittleness associated with discriminative models. Under this formulation, diffusion models serve as an effective numerical solver for the learned score function.To ensure practicality, we respectively resolve the efficiency and stability bottlenecks of standard diffusion by: (1) utilizing a Lightweight Mask-Specific VAE for fast latent-space process and a decoupled architecture with a lightweight denoising UNet, (2) edge supervision and error prior to mitigate error accumulation during sampling. Extensive experiments of two distinct protocols on eight non-generative and three generative benchmarks demonstrate that DiffIML consistently outperforms state-of-the-art methods, yielding remarkable generalization improvements on diverse unseen datasets. The code will be publicly available.


[100] LASAR: Towards Spatio-temporal Reasoning with Latent Cognitive Map cs.CVPDF

Jinzhou Tang, Sidi Liu, Waikit Xiu, Weixing Chen, Keze Wang

TL;DR: 本文提出LASAR架构,通过双记忆系统(情景记忆与语义认知地图)结合时空上下文表征学习(ST-CRL)的对比目标,旨在解决具身智能中智能体缺乏细粒度空间关系编码能力的问题,从而提升其在零样本泛化任务中的性能。

Details

Motivation: 现有基于动作(如VLN)和推理(如EQA)的具身AI方法存在共同局限:缺乏强制智能体从长程、碎片化经验中编码细粒度空间关系(如拓扑或距离)的学习信号,难以验证智能体是否真正构建了内部空间模型。

Result: 在VLN-CE和VSI-Bench基准测试中,该方法实现了2%-3.5%的零样本泛化性能提升,并验证了所提认知地图具有高度自一致性。

Insight: 创新点在于引入双记忆系统与ST-CRL对比学习目标,利用模拟中标注的时空上下文生成认知查询来构建样本对,从而从智能体经验中形成内部认知地图,促进了时空推理能力的显式学习。

Abstract: A fundamental challenge in embodied AI is verifying if agents build internal models of spatial structure or merely learn to mimic task-specific expert trajectories. This is critical as foundational approaches rooted in action-centric tasks (e.g., VLN) and reasoning-centric tasks (e.g., EQA) often share a common limitation: they lack a learning signal that forces them to encode fine-grained spatial relationships (like topology or distance) over long-range, fragmented experiences. To address this, we first propose LASAR, an architecture featuring a dual-memory system designed to maintain both episodic experiences and a semantic cognitive map. We then introduce Spatio-temporal Contextual Representation Learning (ST-CRL), a contrastive objective designed to train this architecture. ST-CRL leverages spatio-temporal cues from cognitive queries generated through annotated spatio-temporal context in simulation to build sample pairs, thereby forming the internal cognitive map from the agent’s experiences. Experiments demonstrate that our method achieves 2%-3.5% gains in both zero-shot generalization on standard VLN-CE and VSI-Bench benchmarks. We also demonstrate that our proposed cognitive map has high self-consistency.


[101] Controlling Decision Drift in Multimodal Sentiment Analysis with Missing Modalities cs.CVPDF

Chenglizhao Chen, Yuchen Cao, Xinyu Liu, Mengke Song, Guisheng Zhang

TL;DR: 该论文提出了一种两级参考对齐框架,用于解决多模态情感分析中因模态缺失和质量不平衡导致的决策漂移问题。该框架通过在特征表示和情感决策两个层面引入稳定参考,约束不同模态组合的表征并抑制不可靠模态的影响,从而在各种模态缺失模式下保持预测的稳定性。

Details

Motivation: 现实世界多模态数据常存在模态缺失和质量不平衡问题,现有方法生成的缺失模态特征可能偏离真实分布并误导预测,且不可靠模态可能在融合中占据主导,导致表征漂移和不稳定的情感表示。

Result: 在CMU-MOSI和CMU-MOSEI数据集上的实验表明,该方法在各种模态缺失设置下均取得一致提升。在完整模态输入下,达到了SOTA性能,ACC分别为86.28%和85.88%,F1分别为86.24%和85.86%。

Insight: 创新点在于提出了两级参考对齐机制:特征级对齐利用完整模态样本约束表征,将不同模态组合对齐到共享情感空间;决策级对齐通过原型检索和投票抑制不可靠模态,强制跨模态一致性。这为处理模态缺失提供了稳健的表示学习和决策融合思路。

Abstract: Multimodal sentiment analysis relies on textual, acoustic, and visual signals, yet real-world data often suffer from modality missing and quality imbalance. Existing methods generate features for modality missing from available ones, but differences in expression mechanisms and sentiment dynamics across modalities may cause the generated features to deviate from true distributions and mislead prediction. In addition, unreliable modalities may dominate fusion, resulting in representation shift across modality combinations and unstable sentiment representations. To address these challenges, we propose a two-level reference alignment framework. The framework introduces stable references at the feature representation and sentiment decision levels to improve robustness under modality missing. First-level reference alignment leverages complete-modality samples to constrain representations and align different modality combinations into a shared sentiment space. Second-level reference alignment enforces cross-modal consistency at the decision level by suppressing unreliable modalities through prototype retrieval and voting. As a result, the framework maintains stable and reliable sentiment predictions under diverse missing-modality patterns. Experiments on CMU-MOSI and CMU-MOSEI show consistent improvements across various missing-modality settings. Under full-modality input, the proposed method achieves state-of-the-art performance, with ACC of 86.28% and 85.88%, and F1 of 86.24% and 85.86%.


[102] HighSync: High-Quality Lip Synchronization via Latent Diffusion Models cs.CVPDF

Saeed Firouzi Daghigh, Majid Iranpour Mobarekeh, Mostafa Alavi, Mehdi Bagheri

TL;DR: HighSync是一个基于扩散模型的端到端唇部同步框架,能够生成与任意输入音频对齐的高保真度说话人脸视频。该模型在512*512分辨率下运行,解决了现有方法在图像质量和同步准确性之间的权衡问题,并在感知质量和同步精度方面均达到了最先进的性能。

Details

Motivation: 现有唇部同步方法难以同时保证图像质量和同步准确性,常导致视觉质量下降或唇部运动时间不一致。HighSync旨在解决这一挑战,为电影和广播等专业制作环境提供可行的解决方案。

Result: 在感知质量和同步准确性指标的全面评估中,HighSync在两方面均实现了最先进的性能,并在512*512分辨率下运行,这是首个在此分辨率下原生运行的唇部同步模型。

Insight: 创新点包括识别并系统性地消除了先前工作中存在的数据泄漏现象,该现象阻碍了模型对音频信号的真正依赖;此外,模型在高质量分辨率下运行,为专业应用提供了新的可能性。

Abstract: We present HighSync, an end-to-end diffusion-based framework for high-fidelity lip synchronization that generates photorealistic talking-face videos aligned with arbitrary input audio. Existing approaches consistently struggle to reconcile image quality with synchronization accuracy, producing either visually degraded outputs or temporally inconsistent lip movements. HighSync addresses both challenges simultaneously and, to our knowledge, is the first lip sync model to operate natively at 512*512 resolution, positioning it as a viable solution for professional production environments such as the film and broadcast industries. Central to our approach is the identification and systematic elimination of a data leakage phenomenon that has silently undermined temporal modeling in prior work, preventing models from developing a genuine dependence on the audio signal. Comprehensive evaluations across both perceptual quality and synchronization accuracy metrics confirm that HighSync achieves state-of-the-art performance on both fronts. Source code, pre-trained models, and supplementary video results are publicly available at: https://github.com/saeed5959/high_sync


[103] DriveSafe: A Framework for Risk Detection and Safety Suggestions in Driving Scenarios cs.CV | cs.AI | cs.CLPDF

Sainithin Artham, Shankar Gangisetty, Avijit Dasgupta, C. V. Jawahar

TL;DR: 本文提出DriveSafe框架,用于驾驶场景中的风险检测与安全建议。该框架通过生成包含运动、空间和深度线索的多模态结构化自然语言描述,进行细粒度的风险感知场景理解,并利用描述-风险配对微调轻量适配器模块,将领域知识注入基础大语言模型。

Details

Motivation: 现有零样本多模态大语言模型在细粒度、空间基础的风险评估任务上表现不及领域专用方法,为解决这一性能差距,需要提升自动驾驶系统在安全关键环境中的综合态势感知与风险缓解能力。

Result: 在DRAMA基准测试上进行了详尽实验,结果表明DriveSafe取得了最先进的性能,显著超越了零样本多模态大语言模型和先前的领域专用基线方法。

Insight: 创新点在于将风险评估建立在显式的、基于语言的场景表示之上,并通过结构化描述和轻量适配器微调,高效地将领域特定知识融入通用大模型,实现了性能的显著提升。

Abstract: Comprehensive situational awareness is essential for autonomous vehicles operating in safety-critical environments, as it enables the identification and mitigation of potential risks. Although recent Multimodal Large Language Models (MLLMs) have shown promise on general vision-language tasks, our findings indicate that zero-shot MLLMs still underperform compared to domain-specific methods in fine-grained, spatially grounded risk assessment. To address this gap, we propose DriveSafe, a framework for risk-aware scene understanding that leverages structured natural language descriptions. Specifically, our method first generates spatially grounded captions enriched with multimodal context, including motion, spatial, and depth cues. These captions are then used for downstream risk assessment, explicitly identifying hazardous objects, their locations, and the unsafe behaviors they imply, followed by actionable safety suggestions. To further improve performance, we employ caption-risk pairings to fine-tune a lightweight adapter module, efficiently injecting domain-specific knowledge into the base LLM. By conditioning risk assessment on explicit language-based scene representations, DriveSafe achieves significant gains over both zero-shot MLLMs and prior domain-specific baselines. Exhaustive experiments on the DRAMA benchmark demonstrate state-of-the-art performance, while ablation studies validate the effectiveness of our key design choices. Project page: https://cvit.iiit.ac.in/ research/projects/cvit-projects/drivesafe


[104] CAR-SAM: Cross-Attention Reconstruction for Post-Training Quantization of the Segment Anything Model cs.CVPDF

Houji Wen, Jiangyong Yu, Jun Li, Dawei Yang

TL;DR: 本文提出CAR-SAM,一个专门为Segment Anything Model(SAM)设计的后训练量化统一框架,旨在解决SAM在低比特量化时因交叉注意力架构导致的注意力消散和重建振荡问题,通过引入MatMul-Aware Compensation机制和Joint Cross-Attention Reconstruction策略,实现了稳定的4位量化。

Details

Motivation: SAM模型在资源受限设备上部署面临高计算和内存需求挑战,而现有后训练量化方法未能有效处理SAM解码器中交叉注意力架构带来的独特问题,如注意力消散和重建振荡。

Result: 在SAM-B和SAM-L模型上,CAR-SAM实现了4位量化,分别以14.6%和6.6%的mAP提升超越现有方法,达到了先进的量化性能水平。

Insight: 创新点在于针对SAM交叉注意力结构设计了MAC机制将激活量化误差转移至权重,以及JCAR策略联合重建耦合注意力分支以稳定收敛;这为处理复杂Transformer架构的量化提供了可借鉴的误差补偿和联合优化思路。

Abstract: Segment Anything Models (SAMs) are extensively used in computer vision for universal image segmentation, but deploying them on resource-constrained devices is challenging due to their high computational and memory demands. Post-Training Quantization (PTQ) is a widely used technique for model compression and acceleration. However, existing PTQ methods fail to consider the cross-attention architecture in the SAM decoder. This degradation primarily stems from the unique challenges posed by SAMs: (1) Attention dissipation, where the attention information in the decoder, which is crucial for representing segmentation masks, collapses into a diffuse and non-semantic form under low-bit quantization; and (2) Reconstruction oscillation, where bidirectional coupling within the two-way transformer introduces cross-branch error interference and destabilizes convergence. To tackle these issues, we propose CAR-SAM, a unified quantization framework tailored for SAMs. Firstly, to mitigate attention dissipation, we introduce MatMul-Aware Compensation (MAC) mechanism that transfers activation-induced quantization errors from MatMul to preceding linear weights. Secondly, to mitigate oscillation in decoder optimization, we develop a Joint Cross-Attention Reconstruction (JCAR) strategy that jointly reconstructs coupled attention branches, suppressing oscillatory behavior and promoting stable convergence. Extensive experiments show that CAR-SAM robustly quantizes SAM models down to 4-bit precision, surpassing existing methods by 14.6% and 6.6% mAP on SAM-B and SAM-L respectively.


[105] Motion Cues from Image-based Point Tracking for LiDAR Scene Flow Estimation cs.CVPDF

Youngdong Jang, Gyeongrok Oh, Jong Wook Kim, Hyunju Ryu, Hyung-gun Chi

TL;DR: 本文提出TrackCue框架,通过利用图像点跟踪获取密集运动轨迹,并结合视觉一致的运动补偿策略,来提升LiDAR场景流估计中动态物体的表征能力。该方法将图像域中分离出的真实物体运动线索提升至LiDAR域,以优化静态-动态点分类标签,从而为自监督场景流学习提供更可靠的监督信号。

Details

Motivation: 现有自监督LiDAR场景流估计方法依赖稀疏的几何观测进行静态-动态点分类,易受数据稀疏性和遮挡影响,导致产生噪声标签并误导运动学习。

Result: 实验结果表明,TrackCue显著提升了动态标签的精确率和F1分数,并带来了自监督场景流估计性能的提升。

Insight: 创新点在于将图像点跟踪的密集运动轨迹作为超越稀疏几何观测的运动线索,并通过视觉一致的运动补偿策略在图像平面分离真实物体运动与自车运动引起的表观运动,再通过视觉运动线索提升将其关联回LiDAR点以优化分类标签。

Abstract: LiDAR scene flow estimation is essential for autonomous driving, as it provides 3D motion for each point. Self-supervised approaches use static-dynamic classification to mitigate the imbalance between static and dynamic points, deriving targeted supervision. However, existing methods rely on sparse geometric observations for this classification, making them vulnerable to data sparsity and occlusions. The resulting noisy labels provide incorrect motion guidance and degrade scene flow learning. To address this, we introduce TrackCue, a tracking-guided framework for improving dynamic object representation in LiDAR scene flow estimation. In particular, TrackCue repurposes point tracking to obtain dense image-space trajectories anchored to LiDAR points, providing motion cues beyond sparse geometric observations. Furthermore, we present a visually consistent motion compensation strategy that compares the tracked trajectories with ego-induced rigid trajectories in the image plane, effectively isolating true object motion from ego-induced apparent motion. To transfer these isolated motion cues back to the LiDAR domain, we perform visual motion cue lifting, which associates ego-compensated image trajectories with LiDAR points for static-dynamic label refinement. As a result, TrackCue produces more accurate static-dynamic classification and provides more reliable supervision for scene flow learning. Experimental results show that TrackCue significantly improves the precision and F1 score of dynamic labels, leading to performance gains in self-supervised scene flow estimation.


[106] Neuroscience-inspired Staged Representation Learning with Disentangled Coarse- and Fine-Grained Semantics for EEG Visual Decoding cs.CVPDF

Xiang Gao, Hui Tian, Yanming Zhu, Xuefei Yin, Alan Wee-Chung Liew

TL;DR: 本文提出了一种受神经科学启发的分阶段表示学习框架,用于从脑电图(EEG)信号中解码视觉信息。该框架将EEG表示学习分解为三个互补阶段:低级视觉表示学习、高级语义表示学习和整合信息融合,并引入多模态双级语义学习机制来分离粗粒度标签级语义和细粒度图像级视觉语义信息。

Details

Motivation: 现有EEG视觉解码方法主要学习单一的全局EEG嵌入进行跨模态对齐,但忽视了人类视觉处理的分阶段和层次化特性,因此需要一种更符合神经科学原理的框架来改进EEG视觉解码。

Result: 在THINGS-EEG基准测试上的大量实验表明,该方法在受试者依赖的零样本评估中取得了优越性能,并在受试者独立的零样本评估中提高了精确检索率。

Insight: 创新点在于将EEG视觉解码重新定义为阶段特定的表示分解问题,并引入了语义潜在通道作为从观测视觉EEG信号生成的计算表示通道,从而扩展了通道级语义表示空间,用于结构化语义抽象和跨模态对齐。

Abstract: Decoding visual information from electroencephalography (EEG) signals remains a fundamental challenge in brain-computer interfaces and medical rehabilitation. Existing EEG visual decoding methods mainly focus on learning a single global EEG embedding for cross-modal alignment, but they largely overlook the staged and hierarchical characteristics of human visual processing. To address this limitation, we propose a neuroscience-inspired staged representation learning framework that reformulates EEG visual decoding as a stage-specific representation decomposition problem. The proposed framework organizes EEG representation learning into three complementary phases: low-level visual representation learning, high-level semantic representation learning, and integrative information fusion. To strengthen semantic modeling, we further introduce a multimodal dual-level semantic learning mechanism that separates coarse label-level semantics from fine image-level visual-semantic information. In addition, semantic latent channels are introduced as computational representation channels generated from observed visual EEG signals, expanding the channel-level semantic representation space for structured semantic abstraction and cross-modal alignment. Extensive experiments on the THINGS-EEG benchmark demonstrate that the proposed method achieves superior performance under subject-dependent zero-shot evaluation and improved exact retrieval under subject-independent zero-shot evaluation. Additional analyses, including layer-wise retrieval, temporal accumulation, expanded multi-image retrieval, and ablation studies, further support the effectiveness of staged decomposition and structured semantic modeling. These results suggest that explicitly modeling staged perceptual, semantic, and integrative representations provides an effective neuroscience-inspired framework for EEG-based visual decoding.


[107] P2GS: Physical Prior-guided Gaussian Splatting for Photometrically Consistent Urban Reconstruction cs.CVPDF

Kota Shimomura, Hidehisa Arai, Tsubasa Takahashi, Takayoshi Yamashita, Hironobu Fujiyoshi

TL;DR: P2GS是一种物理先验引导的高斯泼溅框架,旨在解决自动驾驶场景中因相机曝光差异和动态光照导致的3D重建光度不一致问题。它通过联合分解视图不变的线性HDR辐射场、每视图曝光尺度和色调映射函数,仅从LDR图像中实现物理一致的重建,同时保持实时渲染效率。

Details

Motivation: 传统3D高斯泼溅(3DGS)假设视图间曝光和色调映射一致,但真实驾驶数据因相机管线异质性和动态户外光照而违反该假设,导致辐射场中嵌入曝光差异和噪声,产生伪影和不一致光照,尤其影响静态背景的真实感模拟。现有工作主要关注动态物体重建,忽视了跨视图的光度一致性。

Result: 在真实和模拟驾驶环境中的实验表明,P2GS在LDR重建方面匹配或超越了先前方法,同时显著提升了光度一致性、提供了可靠的曝光归一化,并在多样场景中实现了物理连贯的照明效果。

Insight: 创新点在于将物理图像形成过程融入统一优化策略,通过相对曝光一致性和HDR域辐射正则化,仅从LDR图像无监督地分解出线性HDR辐射场和每视图参数,从而增强对相机间光照差异的鲁棒性,同时保持了3DGS的实时性优势。

Abstract: 3D Gaussian Splatting (3DGS) has recently emerged as a powerful explicit representation enabling fast, high-fidelity rendering, making it a promising foundation for closed-loop simulators and perception models in autonomous driving. However, conventional 3DGS implicitly assumes consistent exposure and tone mapping across views. Real driving data violates this assumption due to heterogeneous camera pipelines and dynamic outdoor illumination, baking exposure discrepancies and sensor noise into the radiance field and producing artifacts and inconsistent illumination especially in static backgrounds crucial for realistic simulation. These issues are amplified in autonomous driving, where sparse viewpoints, varying exposures, and outdoor lighting interact, while prior work mainly targets dynamic-object reconstruction and overlooks cross-view photometric consistency. To address this limitation, we introduce P2GS, a physically consistent Gaussian Splatting framework that jointly decomposes a view-invariant linear HDR radiance field, per-view exposure scales, and tone-mapping functions from only LDR images without HDR supervision. P2GS employs a unified optimization strategy grounded in the physical image-formation process, enforcing relative-exposure consistency and HDR-domain radiance regularization. This yields a radiance field robust to inter-camera illumination differences while preserving the real-time efficiency of standard 3DGS. Experiments across real and simulated driving environments show that P2GS matches or surpasses prior methods in LDR reconstruction while providing substantially improved photometric consistency, reliable exposure normalization, and physically coherent illumination across diverse scenes.


[108] DEVIS-GRPO: Unleashing GRPO on Dynamic Extreme View Synthesis cs.CVPDF

Yi Zuo, Huimin Wu, Lingling Li, Fang Liu, Licheng Jiao

TL;DR: 本文提出了DEVIS-GRPO框架,一种基于GRPO(Group Relative Policy Optimization)的轨迹控制视频生成方法,专门用于解决大视角相机运动下的动态极端视角合成问题。其核心创新是ADEVIS采样策略,通过累积小视角增量来实现大视角运动,从而避免了收集昂贵配对视频数据的需求,并提高了训练效率和采样多样性。

Details

Motivation: 现有轨迹控制视频生成方法在小视角相机运动下表现良好,但在大视角(极端视角)运动下性能显著下降,且现有极端视角合成解决方案通常需要大量标注的专用视频对,成本高昂。

Result: 在Kubric-4D、iPhone和DL3DV数据集上的实验证明了方法的优越性。在Kubric-4D的非遮挡区域,PSNR和SSIM分别比次优方法相对提升了21.57%和7.31%;在iPhone数据集上,LPIPS降低了18.56%。

Insight: 主要创新点包括:1) 提出了ADEVIS渐进累积采样策略,实现了无需配对大视角视频数据的高效训练和灵活多样的轨迹配置;2) 设计了多层级一致性与质量奖励函数,用于筛选高质量样本进行模型优化;3) 这是首个用于极端视角视频生成的在线策略梯度方法。

Abstract: Trajectory-controlled video generation has become essential for controllable video generation. While current methods perform well under small-view camera motions, they degrade significantly with large-view motions. Existing solutions for extreme-view synthesis typically require dedicated video pairs, demanding substantial annotation effort. To address these limitations, we propose Dynamic Extreme VIew Synthesis-GRPO (DEVIS-GRPO), a GRPO-based framework for trajectory-controlled video generation, the first online policy gradient method for extreme view video generation. Central to our approach is a novel sampling strategy: Accumulative Dynamic Extreme VIew Synthesis (ADEVIS), which achieves large-view camera motions by progressively accumulating small-view increments. This method delivers two key advantages: 1) enhanced training efficiency, as it eliminates the need to warm-start the policy model by collecting expensive paired large-view videos, and 2) increased sampling diversity, achieved by flexibly varying trajectory configurations. Finally, we designed a multi-level consistency-quality reward function to select high-quality samples for model optimization. Experiments on the Kubric-4D, iPhone, and DL3DV datasets demonstrate our method’s superiority. On Kubric-4D, we achieve relative improvements of 21.57% in PSNR and 7.31% in SSIM over the second-best method in non-occlusion areas. On iPhone, LPIPS is reduced by 18.56%.


[109] Beyond Point-Wise Matching: Structural Representation Alignment for Accelerating Diffusion Transformers cs.CVPDF

Shaodong Xu, Zhendong Wang, Litong Gong, Zexian Li, Wengang Zhou

TL;DR: 本文提出了一种名为sREPA的结构化表示对齐框架,用于加速扩散变换器(DiTs)的训练。该方法通过显式地对齐预训练视觉特征中的空间关系几何结构,而非仅进行逐点匹配,从而实现了更快的收敛速度和更高的生成质量。

Details

Motivation: 现有表示对齐方法(如REPA)主要采用逐点匹配目标或依赖隐式架构调整,未能显式建模视觉基础模型中固有的空间关系几何结构,导致对视觉表示中丰富空间拓扑结构的捕捉不足。

Result: 与最先进的对齐策略相比,sREPA实现了更快、更稳定的收敛,并提升了样本质量。

Insight: 创新点在于将生成任务的有效对齐重新表述为一种显式的结构约束,强调特征图关系几何结构的一致性,而非个体特征点的匹配,这有助于模型内化预训练特征中的整体空间布局和结构相关性。

Abstract: Recent advances in Diffusion Transformers (DiTs) demonstrate that aligning noisy latent states with well-trained semantic features-as pioneered by Representation Alignment (REPA)-can substantially accelerate training and improve generation fidelity. Subsequent analysis(e.g., iREPA) suggests that these gains arise primarily from transferring spatial structure contained in pre-trained vision representations. However, mostly existing alignment methods employ point-wise matching objectives or rely on implicit architectural tweaks, which fail to explicitly model the spatial relational geometry inherent in vision foundation models. We argue that such element-wise supervision is insufficient to capture the rich spatial topology of visual representations, and that effective alignment for generation should instead be formulated as an explicit structural constraint. To this end, we propose sREPA, a structural REPresentation Alignment framework to enforce consistency in the relational geometry of feature maps, rather than merely matching individual feature points. By encouraging the model to internalize holistic spatial layouts and structural correlations from pre-trained features, sREPA achieves faster and more stable convergence, along with improved sample quality, compared to state-of-the-art alignment strategies. Our code and models will be released.


[110] Latent Action Control for Reasoning-Guided Unified Image Generation cs.CV | cs.AIPDF

Fuxiang Zhai, Sixiang Chen, Yingjin Li, Shuaibo Li, Jianyu Lai

TL;DR: 本文提出了一种名为潜在动作控制(LAC)的方法,旨在解决统一多模态模型中视觉理解与图像生成之间的控制鸿沟。LAC将推理过程表示为生成器内部的连续潜在动作轨迹,通过规划、内部视觉草图、诊断和细化等步骤,直接将这些动作注入基于流的生成过程的条件流中,而无需产生推理标记或中间图像。该方法在BAGEL-7B-MoT模型上实现,并在多个基准测试中显著提升了组合性和知识驱动的图像生成质量。

Details

Motivation: 统一多模态模型虽然能在共享主干中编码视觉理解和图像生成,但理解并不能自动转化为对生成过程的控制。模型可能推断出对象、关系或知识线索,但无法在生成的图像中实例化它们。

Result: 在GenEval、WISE和T2I-CompBench基准测试中,LAC持续改进了组合性和知识驱动的生成效果,在空间关系、属性绑定和对世界知识敏感的提示上取得了最大的性能提升。

Insight: 核心创新点在于将推理过程表示为可学习的、角色结构化的潜在动作轨迹,并通过先验引导的变分潜在动作对齐和Latent-Flow GRPO进行训练,使理解在生成过程中变得“可操作”。这为从推断关系到生成过程提供了一条直接的控制路径。

Abstract: Unified multimodal models can encode visual understanding and image generation within a shared backbone, yet understanding does not automatically translate into control: models may infer objects, relations, or knowledge cues but fail to instantiate them in the generated image. We propose Latent Action Control (LAC), which makes reasoning actionable by representing it as hidden continuous actions inside a unified generator. Given a prompt, LAC rolls out a role-structured latent trajectory for planning, internal visual drafting, diagnosis, and refinement, and injects these actions into the hidden stream that conditions flow-based generation, without producing reasoning tokens or intermediate images. Since such action trajectories are unobserved, LAC learns them through prior-guided variational latent action alignment from training-only rendered semantic priors, draft image features, and supervised halting signals, followed by Latent-Flow GRPO to align the latent-to-image rollout with terminal visual feedback. This provides a control path from inferred relations, bindings, and knowledge cues to the generation process. Instantiated on BAGEL-7B-MoT, LAC consistently improves compositional and knowledge-grounded generation across GenEval, WISE, and T2I-CompBench, with the largest gains on spatial relations, attribute binding, and world-knowledge-sensitive prompts. Ablations and latent interventions show that the learned action trajectory is consumed by the generator, suggesting that unified generation benefits when understanding is not only encoded, but made actionable during generation.


[111] OmniVL-Guard Pro: A Tool-Augmented Agent for Omnibus Vision-Language Forensics cs.CV | cs.AIPDF

Jinjie Shen, Zheng Huang, Yuchen Zhang, Yujiao Wu, Yaxiong Wang

TL;DR: 本文提出了OmniVL-Guard Pro,一个工具增强的智能体,用于解决开放世界中的视觉语言取证问题。它通过集成多种外部工具(如实时事件搜索、局部裁剪、异常检测等)来克服传统自包含多模态大模型在动态开放世界取证中的局限性。论文还提出了树状结构自演化工具轨迹生成方法来创建高质量的训练数据,以及检查器引导的智能体强化学习来优化推理过程。

Details

Motivation: 现有视觉语言伪造检测和定位方法在封闭世界假设下工作,依赖模型自身能力,但自包含的多模态大模型受限于有限的参数知识、静态训练数据和感知分辨率,在需要外部线索和细粒度局部操作的动态开放世界取证(如实时事件验证和伪造分割)中存在瓶颈。

Result: 大量实验表明,OmniVL-Guard Pro在各种任务上达到了最先进的性能,并展现出强大的零样本泛化能力。

Insight: 核心创新点在于从单纯扩大模型规模转向利用外部工具增强,提出了一个统一的工具增强智能体框架。方法学上的创新包括树状结构自演化工具轨迹生成用于创建多样化的高质量训练数据,以及检查器引导的智能体强化学习用于提供过程级监督,确保推理的正确性。

Abstract: Existing vision-language forgery detection and grounding methods operate under a closed-world paradigm, assuming verification can be completed by the model alone. However, self-contained MLLMs are constrained by finite parametric knowledge, static training corpora, and limited perceptual resolution, creating a practical ceiling in dynamic open-world forensics – particularly for real-time event verification requiring external clues and forgery segmentation demanding fine-grained scrutiny of local manipulations. To address these limitations, we shift from scaling up the self-contained model toward reaching beyond it. We propose \textbf{OmniVL-Guard Pro}, a tool-augmented agent that extends unified forensics from closed-world prediction to open-world clues-driven reasoning. OmniVL-Guard Pro integrates a tool environment spanning real-time event search, local cropping and zooming, edge-anomaly screening, face detection, video frame extraction, and SAM3-based segmentation. To generate high-quality tool-reasoning trajectories, we introduce \textbf{Tree-Structured Self-Evolving Tool Trajectory Generation}, which produces diverse trajectories through seed guidance, guider-free self-evolution, and weakly-hinted hard sample synthesis, yielding the Full-Spectrum Tool Reasoning (FSTR) dataset for training. We further propose \textbf{Checker-Guided Agentic Reinforcement Learning} (CGARL), which provides process-level supervision to penalize cases where the answer is correct but the reasoning is distorted. Extensive experiments demonstrate that OmniVL-Guard Pro achieves state-of-the-art performance across various tasks, and exhibits strong zero-shot generalization. The FSTR dataset and code for OmniVL-Guard Pro will be publicly released at \url{https://github.com/shen8424/OmniVL-Guard-Pro}.


[112] SHED: Style-Homogenized Embedding Alignment for Domain Generalization cs.CV | cs.LGPDF

Kai Gan, Tong Wei

TL;DR: 本文提出了一种名为SHED的新方法,用于提升CLIP模型在领域泛化任务中的性能。该方法通过对齐风格同质化的嵌入来解决图像与文本嵌入之间的信息不对称问题,即在训练时去除源领域的风格中心,并在推理时通过投影文本领域中心到视觉空间进行加权预测。

Details

Motivation: 动机在于解决CLIP等大规模视觉语言模型在领域泛化中因图像与文本嵌入信息不对称(图像包含类语义和领域特定风格,而文本仅提供基本类线索)而导致的泛化能力受限问题。

Result: 在五个基准测试上的广泛实验表明,SHED达到了最先进的性能,显著优于先前方法(例如在DomainNet上比标准微调高出4.0%)。

Insight: 创新点包括提出风格同质化嵌入对齐策略,通过去除风格中心来减少领域偏差,并在推理时利用文本领域中心投影和加权聚合来应对未见目标领域,从而提升泛化能力。

Abstract: Domain generalization aims to enhance model robustness against unseen domains with embedding distribution shifts. While large-scale vision-language models like CLIP exhibit strong generalization, their direct image-text embedding alignment suffers from inherent information asymmetry: images encode both class semantics and domain-specific styles, whereas text prompts primarily convey basic class cues. This asymmetry hinders generalization to novel domains in realistic scenarios. To address this, we propose Style-Homogenized Embedding alignment for Domain-generalization (SHED), a novel CLIP-based method that aligns style-homogenized embeddings instead of raw representations from encoders in CLIP. During training, SHED removes domain-specific style centroids from both image embeddings computed per source domains and text embeddings which are averaged across diverse prompt templates and stripped of a global centroid. For inference, considering the lack of target domain information, SHED projects diverse textual domain centroids into the visual space and aggregates predictions via membership weighting. Extensive experiments on five benchmarks show SHED achieves state-of-the-art performance, outperforming prior methods significantly (e.g., +4.0% on DomainNet vs. standard fine-tuning).


[113] Rethinking the State Update Gate for Long-Sequence Recurrent 3D Reconstruction cs.CVPDF

Kejun Ren, Lei Jin, Tianxin Huang, Lianming Xu, Li Wang

TL;DR: 本文重新审视了流式3D重建中的状态更新门机制,发现现有方法存在结构瓶颈:基于token的更新门幅度受限且帧间变化小,导致有效记忆范围仅约3帧,是长序列漂移的结构性根源。为此,论文提出一种无参数、无需训练或额外前向传播的标量帧级门控机制,通过闭式解从内部特征变化推导出每帧对状态的贡献强度,在多个基准测试中显著提升了长序列性能。

Details

Motivation: 解决在严格恒定内存预算下进行流式3D重建时,由于现有基于token的更新门机制存在幅度限制和帧间不变性,导致有效记忆范围过短,从而引发长序列漂移的问题。

Result: 在涵盖相机位姿、视频深度和3D重建的六个基准测试中(序列长度最高达4,541帧),该方法将长TUM-RGBD位姿序列的ATE降低了51%,在Bonn视频深度上AbsRel减少了12.8%,在KITTI长序列位姿估计上超越了LongStream和Keyframe-VO,同时保持了严格的恒定内存且无需训练成本。

Insight: 创新点在于引入了帧级更新门控,将传统SLAM关键帧选择的离散逻辑连续化,从内部特征变化闭式推导出每帧贡献强度,无需额外参数或训练,有效扩展了状态记忆范围,解决了长序列漂移的结构性瓶颈。

Abstract: Streaming 3D reconstruction under a strict constant-memory budget hinges on how the recurrent state is updated as the stream evolves. We profile TTT3R-style per-token gates across five benchmarks and discover a structural bottleneck: the gate is intrinsically bounded in magnitude (median $0.31$; never exceeding $0.6$) and nearly frame-invariant, yielding an effective memory horizon of only $\sim$3 frames per state token, which serves as the structural origin of long-sequence drift. We trace this to a missing axis: existing inference-time methods modulate updates only at the per-token, intra-frame level, while the orthogonal frame-level question of \emph{how strongly each frame should contribute to the state} has been treated as content-independent. We close this gap with a scalar frame-level gate $α_t \in (0, 1]$ derived in closed form from frame-to-frame changes of internal features – a continuous relaxation of classical Simultaneous Localization and Mapping (SLAM) keyframe selection that requires no parameters, no training, and no extra forward pass. Across six benchmarks spanning camera pose, video depth, and 3D reconstruction at sequence lengths up to $4,541$ frames, our gate cuts ATE by $51%$ on long TUM-RGBD pose sequences, reduces AbsRel by $12.8%$ on Bonn video depth, and on KITTI long-sequence pose estimation surpasses both LongStream and Keyframe-VO, while retaining strictly constant memory at zero training cost.


[114] RHINO: Reconstructing Human Interactions with Novel Objects from Monocular Videos cs.CVPDF

Lixin Xue, Chengwei Zheng, Georgios Paschalidis, Chen Guo, Manuel Kaufmann

TL;DR: RHINO是一个从单目RGB视频中重建人类与未知物体交互的三步框架,通过结合3D感知基础模型、运动分解和可微接触先验,实现了人、物体和静态场景在统一世界坐标系下的三维重建。

Details

Motivation: 现有方法通常孤立处理人或物体,忽略其交互,或依赖已知3D形状/相机参数,难以适用于真实世界单目视频中深度模糊、遮挡和运动纠缠的挑战。

Result: 在同步体积4D捕捉数据的新数据集上评估,RHINO在新视角合成和4D重建任务上优于现有SOTA基线,消融实验验证了各阶段的关键贡献。

Insight: 创新点包括利用基础模型稳定低纹理区域的运动恢复结构、通过运动分解分离物体与相机运动、以及使用可微接触先验提升物理合理性,为单目视频中的复杂交互重建提供了系统解决方案。

Abstract: Reconstructing people, objects, and their interactions in 3D is a long-standing goal for intelligent systems. Often the input is RGB video from a moving camera, making the task ill-posed; depth is ambiguous, humans and objects occlude each other, and camera and object motion entangle to create apparent motion. Most prior work addresses humans or objects in isolation, ignoring their interplay, or assumes known 3D shapes or cameras, which is impractical for real-world applications. We develop RHINO (Reconstructing Human Interactions with Novel Objects), a three-step framework that recovers in 3D a human, novel (unseen) manipulated object, and static scene in a common world frame from a monocular RGB video. First, we leverage 3D-aware foundation models to obtain cues that stabilize Structure-from-Motion (SfM) even for low-texture regions; this yields a coarse shape and apparent motion of a manipulated object from foreground pixels, and a coarse scene shape and camera motion from background pixels. Second, we estimate a human in the camera frame via an off-the-shelf method, and subtract the camera motion from apparent motion to extract the object motion; this registers the human, object, and coarse scene shapes into a common world frame. Third, we refine shapes using a compositional neural field with per-component signed-distance fields. The latter further enables differentiable contact priors that attract surfaces while penalizing interpenetration, improving the physical plausibility of the final reconstruction. For evaluation, we capture a new dataset of handheld monocular videos synchronized with a volumetric 4D capture stage, providing ground-truth shape and camera motion. RHINO outperforms state-of-the-art baselines on novel-view synthesis and 4D reconstruction. Ablations show that each stage contributes substantially. Code and data are available at https://lxxue.github.io/RHINO.


[115] StreamingEffect: Real-Time Human-Centric Video Effect Generation cs.CVPDF

Yiren Song, Cheng Liu, Yuxin Jiang, Mike Zheng Shou

TL;DR: 本文提出了StreamingEffect框架,用于实时生成以人为中心的视频特效。该框架采用上下文视频编辑架构,通过高质量双向教师模型蒸馏出因果自回归学生模型,并将采样步骤从50步减少到4步,同时引入关键帧控制实现交互式编辑。为解决数据瓶颈,构建了VideoEffect-130K数据集,包含13万个人为中心的视频特效样本。

Details

Motivation: 针对电商直播、娱乐和视频博客等实时应用场景,需要能够实时生成并保持人物身份、背景内容和时间一致性的视频特效,但现有方法缺乏合适的数据集和可部署的编辑模型。

Result: 实验表明,该方法在单个H200 GPU上实现了720p高质量视频的实时编辑,达到了实时处理的要求。

Insight: 创新点包括采用教师-学生蒸馏策略大幅加速推理,引入关键帧控制实现在线交互编辑,以及构建大规模人本视频特效数据集VideoEffect-130K填补数据空白。

Abstract: Streaming video effect generation is highly desirable for live human-centric applications such as e-commerce streaming, entertainment, and vlogging, yet remains difficult due to the lack of suitable data and deployable editing models. Unlike generic video generation, this task requires real-time video-to-video editing that adds expressive effects while preserving human identity, background content, and temporal consistency. Existing acceleration efforts mainly focus on text-to-video generation, while efficient distillation for video editing remains largely underexplored. In this paper, we present \textbf{StreamingEffect}, a real-time human-centric streaming video effect framework. We adopt an in-context video editing architecture and train a high-quality bidirectional teacher, then distill it into a causal autoregressive student and further reduce sampling from 50 steps to 4 steps. We also introduce keyframe control, allowing reference effect frames to be injected online and propagated through the stream for interactive editing. To address the data bottleneck, we construct \textbf{VideoEffect-130K}, to our knowledge the largest human-centric video effect dataset, containing 70K effect videos and 60K editing videos across 600 effect categories curated from short-video and editing platforms. Experiments show that our method enables real-time, high-quality 720p video editing on a single H200 GPU.


[116] Thermal-Only Crowd Counting with Deployment-Time Privacy Protection cs.CVPDF

Yifei Qian, Zhongliang Guo, Chun Tong Lei, Bowen Deng, Chun Pong Lau

TL;DR: 本文提出首个仅使用热成像数据的隐私保护人群计数框架,通过深度到RGB的扩散模型作为跨模态桥梁来增强热成像特征表示,在推理阶段无需RGB输入,从而显著减少公共监控中的隐私暴露风险。

Details

Motivation: 现有RGB-热成像人群计数方法存在隐私问题(RGB数据在公共监控中引发担忧)和多模态错位导致融合性能下降的局限,因此需要一种仅依赖热成像且能保护隐私的解决方案。

Result: 在RGBT-CC和DroneRGBT数据集上的实验表明,该方法仅使用热成像输入即可达到与最先进的RGB-T融合方法相当的性能,实现了竞争性的结果。

Insight: 创新点在于利用深度到RGB扩散模型作为跨模态特征增强桥梁,并发现单步LCM去噪能提取最忠实于深度条件信号结构内容的特征,而多步方法会导致特征与条件输入解耦并累积误差,这为跨模态特征学习提供了新见解。

Abstract: While RGB-Thermal crowd counting has shown promise, the paradigm faces critical limitations: RGB data raises privacy concerns in public surveillance, and multi-modal misalignment degrades fusion performance. We propose the first thermal-only framework specifically designed for privacy-conscious crowd counting, eliminating RGB dependency at inference time and substantially reducing the privacy exposure associated with continuous RGB capture in public surveillance deployments. To mitigate thermal ambiguity, we leverage depth-to-RGB diffusion models as a cross-modal bridge, extracting discriminative features that enhance thermal representations. Critically, we demonstrate that single-step LCM denoising yields features most faithful to the structural content of the depth conditioning signal, while multi-step approaches progressively decouple features from the conditioning input and accumulate errors that degrade counting accuracy. Experiments on RGBT-CC and DroneRGBT datasets show our method achieves competitive performance against state-of-the-art RGB-T fusion methods, while requiring only thermal input during inference, eliminating the need for continuous RGB capture that constitutes the primary privacy concern in real-world surveillance deployment. The code will be made publicly available.


[117] EPIC-Bench: A Perception-Centric Benchmark for Fine-Grained Embodied Visual Grounding in Vision-Language Models cs.CVPDF

Haozhe Shan, Xiancong Ren, Han Dong, Haoyuan Shi, Yingji Zhang

TL;DR: 本文提出了EPIC-Bench,一个面向具身智能的细粒度视觉感知基准测试,旨在系统评估视觉语言模型在真实世界具身交互环境中的视觉基础能力。该基准包含6.6k个精心标注的(图像、文本、掩码)三元组,覆盖了目标定位、导航和操作三个核心交互阶段的23个细粒度任务。通过对89个领先VLM的广泛评估,发现当前模型在物理交互所需的复杂视觉-文本对齐方面普遍存在困难。

Details

Motivation: 现有基准测试(如问答或多选题形式)允许模型利用语言先验而非展示真正的视觉基础能力,因此需要一个新的基准来系统评估VLM在具身环境中的细粒度视觉感知能力。

Result: 对超过89个领先VLM的广泛评估表明,尽管先进的推理模型显示出潜力,但当前VLM在物理交互所需的复杂视觉-文本对齐方面普遍表现不佳,特别是在多目标计数、部分-整体关系理解和可供性区域检测等任务上存在关键瓶颈。

Insight: 创新点在于提出了一个专注于具身交互流程中细粒度视觉感知的基准测试EPIC-Bench,其任务设计旨在迫使模型进行真正的视觉基础,而非依赖语言捷径,从而为开发下一代视觉驱动的具身模型提供了坚实的基础和可操作的见解。

Abstract: While large vision-language models (VLMs) are increasingly adopted as the perceptual backbone for embodied agents, existing benchmarks often rely on question-answering or multiple-choice formats. These protocols allow models to exploit linguistic priors rather than demonstrating genuine visual grounding. To address this, we present EPIC-Bench, Embodied PerceptIon BenChmark, a fine-grained grounding benchmark designed to systematically evaluate the visual perceptual capabilities of VLMs in real-world embodied environments. Comprising 6.6k meticulously annotated tuples (Image, Text, Mask), EPIC-Bench spans 23 fine-grained tasks across three core stages of the embodied interaction pipeline: Target Localization, Navigation, and Manipulation. Extensive evaluations of over 89 leading VLMs reveal that while advanced reasoning models show promise, current VLMs universally struggle with complex visual-text alignment for physical interactions. Specifically, models exhibit critical bottlenecks in multi-target counting, part-whole relationship understanding, and affordance region detection. EPIC-Bench provides a robust foundation and actionable insights for advancing the next generation of vision-driven embodied models.


[118] Visual Timelines of Police Encounters in Body-Worn Camera Footage: Operational Context and Activity Cataloging for Training and Analysis in OpenBWC cs.CV | cs.AI | cs.LGPDF

Angela Srbinovska, Christopher Homan, Adrian Martin, Ernest Fokoué

TL;DR: 本文提出一种处理执法记录仪视频的方法,将视频分割为10秒窗口,并使用隐私保护协议进行标注。每个窗口标注了操作情境和运动强度两个维度,采用CLIP模型编码帧特征并结合光流统计进行分类。该方法在测试集上情境分类准确率达78.75%,活动分类准确率达88.33%,并通过完整性审计验证了视觉时间线能加速事件审查和优化训练流程。

Details

Motivation: 解决执法记录仪视频数据量大但信息不透明的问题,帮助分析人员和训练员快速定位关键事件起点和活动强度变化节点,减少观看完整视频的时间成本。

Result: 在测试窗口上,最佳情境模型准确率为78.75%,最佳活动模型准确率为88.33%,通过完整性审计展示了视觉时间线表示能支持更快的事件审查和更实用的警官训练工作流。

Insight: 创新点包括:采用时间对齐的固定长度窗口分割和双维度标注(操作情境与运动强度),结合CLIP视觉编码与光流统计进行多模态分类;隐私保护协议和低证据标签处理(如黑暗、模糊场景)增强了方法的实用性。

Abstract: Law enforcement agencies are accumulating vast amounts of body-worn camera (BWC) footage. However, this remains operationally opaque. That is, analysts and trainers still have to invest considerable time watching full-length videos to pinpoint the start of key encounters and identify the points where activity shifts to something more physically intense. We present an approach to process BWC video into a time-aligned sequence of fixed-length 10-second windows, processed and labeled using a privacy-conscious protocol. Each window is labeled with two dimensions of information: (i) the operational context of the window and (ii) the level of motion intensity within the window, with low-evidence labels for windows for which insufficient evidence exists due to darkness, blur or occlusion. We train models to classify windows based on these two axes using frames sampled from each window encoded using CLIP model and aggregated into a window-level representation. We extract dense optical flow statistics for each window to capture motion intensity. On test windows the best context model achieves 78.75% accuracy, and the best-accuracy activity model achieves 88.33%. We also included integrity audits to show the results and how the visual timeline representations support faster incident review and make the officer training workflow more practical.


[119] HEED: Density-Weighted Residual Alignment for Hybrid Vision-Language Model Distillation cs.CV | cs.CLPDF

Yihao Liang, Niraj K. Jha

TL;DR: 本文提出了一种名为HEED的密度加权残差对齐方法,用于改进视觉-语言混合模型的蒸馏过程。该方法通过使用补丁自相似性作为位置重要性的无训练代理,对高信息密度的图像区域(如文本、边缘)进行加权,从而解决传统均匀蒸馏在细粒度文本理解任务上的性能下降问题。

Details

Motivation: 当前将视觉-语言模型蒸馏为更快的混合架构(如3:1 Mamba-2/注意力混合)已成为提升推理效率的标准做法,但基准测试掩盖了选择性失败,特别是在光学字符识别和文档任务上性能显著下降,因为均匀加权损失函数过度关注低信息背景区域,而忽略了稀疏但关键的回答相关区域。

Result: 在OCRBench v2上性能提升8.7分,在10个基准测试平均提升5.13分,学生模型在128k上下文长度下实现4.12倍吞吐量和68%内存节省,最终达到教师模型在10个基准测试上的平均性能水平。

Insight: 创新点在于引入密度加权残差对齐机制,以补丁自相似性作为无训练的重要性估计,优先对齐高信息密度区域,从而有效提升混合模型在细粒度视觉任务上的蒸馏效果,且无需额外参数或推理成本。

Abstract: Distilling vision-language models into faster hybrid architectures, such as 3:1 Mamba-2/attention mixes, is now standard practice for making inference efficient. Aggregate benchmarks suggest that this works but they hide selective failures. When we distill Qwen3-VL-8B-Instruct into a 3:1 Mamba-2/attention hybrid, student model stays within 2 points of the teacher across visual reasoning benchmarks like MMStar, MMBench, and MMMU-Pro, while dropping 13 points on optical-character-recognition and document tasks. The student can still understand the scene but loses the fine-grained text needed to answer. We localize much of the failure to a specific kind of position. In a high-resolution image, most patches are sky, wall, or smooth texture, while a small fraction carries text, edges, object boundaries, or other local details. In a token-level diagnostic, the top 10% highest-density patches have 3.6$\times$ larger residual drift than the bottom 10% lowest-density patches and 3.5$\times$ larger teacher-masking answer contribution. Uniform weighting devotes many loss terms to low-information background patches, whereas sparse answer-bearing patches receive no special protection. The required intervention is minimal: we replace uniform residual alignment with density-weighted residual alignment, using patch self-dissimilarity as a training-free proxy for position importance. We call this HEED. Compared with normal end-to-end distillation, HEED increases performance by 8.7 points on OCRBench v2 and 5.13 points on a 10-benchmark average. The gain is realized on different teacher models and hybrid architectures. After standard post-training, the student reaches teacher-level performance on the 10-benchmark average with a 4.12$\times$ throughput and a 68% memory saving at 128k context, with no additional parameters and no inference-time cost.


[120] A Systematic Survey on Deep Learning Architectures for Point Cloud Classification and Segmentation cs.CV | cs.AI | cs.LGPDF

Minhas Kamal, Hiranya Garbha Kumar, Balakrishnan Prabhakaran

TL;DR: 本文系统综述了用于点云分类和分割任务的深度学习架构,涵盖了分类、部件分割和语义分割三个基本任务。文章首先定义了点云数据并讨论了其结构特性,然后根据骨干网络结构对代表性工作进行了分类,并在流行基准上评估了它们的性能。

Details

Motivation: 点云因其简单性和几何保真度成为表示3D形状和场景最广泛的格式,但其固有的无序、不规则特性以及传感器噪声和遮挡问题,给基于机器学习的方法带来了独特挑战。

Result: 文章在流行基准上评估了不同方法的性能,并进行了实证比较,但摘要中未提及具体的定量结果或是否达到SOTA水平。

Insight: 创新点在于对点云处理策略(如转换为有序格式、提取局部几何特征、置换不变或基于自注意力的处理)进行了系统性梳理和分类,并深入分析了架构创新、局限性以及未来的开放挑战和方向。

Abstract: Point cloud stands as the most widely adopted format for representing 3D shapes and scenes due to its simplicity and geometric fidelity. However, its inherent unordered and irregular nature, exacerbated by sensor noise and occlusions, introduces unique challenges for machine learning based methodologies. To combat these issues, diverse strategies have been developed, including converting to a format that has orderliness, extracting local geometry, and permutation-invariant or self-attention-based processing. In this paper, our focus is directed towards deep learning models for three fundamental tasks in 3D vision: point cloud classification, part segmentation, and semantic segmentation. We begin by formally defining point cloud data, followed by an in-depth discussion on its structural characteristics. Then, we categorize notable works based on their backbone structure and evaluate their performance on popular benchmarks. Beyond empirical comparison, we offer insights into architectural innovations and limitations. We also outline open challenges and promising future directions for 3D point cloud understanding.


[121] Collaborative Learning for Semi-Supervised LiDAR Semantic Segmentation cs.CVPDF

Bin Yang, Alexandru Paul Condurache

TL;DR: 本文提出了一种名为CoLLiS的新型半监督激光雷达语义分割框架,通过协同学习策略,在单一步骤中联合训练多个表示,以解决传统两阶段方法中伪标签来源单一、易导致确认偏差和错误传播的问题。

Details

Motivation: 大规模激光雷达点云的标注成本高昂且耗时,因此需要利用半监督学习;但现有方法通常采用两阶段训练,伪标签仅来自单一蒸馏源,这会强化确认偏差并传播误差,限制了性能。

Result: 在三个数据集上的大量实验表明,CoLLiS始终优于最先进的激光雷达半监督学习方法,特别是在低标签情况下取得了显著的性能提升。

Insight: 创新点在于将多个表示视为平等的学生进行协同单步训练,自适应地从多个表示进行蒸馏,并在线监控学生间差异以解决矛盾监督,从而有效缓解确认偏差;这提供了一种更鲁棒的半监督学习范式。

Abstract: Annotating large-scale LiDAR point clouds for 3D semantic segmentation is costly and time-consuming, which motivates the use of semi-supervised learning (SemiSL). Standard LiDAR SemiSL methods typically adopt a two-step training paradigm, where pseudo-labels are separately generated from a single distillation source, either from the same or another LiDAR representation. Such supervision relies on a unique source of pseudo-labels, which can reinforce confirmation bias and propagate errors during training, ultimately limiting performance. To address this challenge, we introduce CoLLiS, a novel framework that leverages Collaborative Learning for LiDAR Semi-supervised segmentation. Unlike prior paradigms with decoupled pseudo-labeling and training phases, CoLLiS trains multiple representations collaboratively in a single step by treating them as coequal students. Each student is adaptively distilled from multiple representations, while inter-student disparities are monitored online to resolve contradictory supervision and effectively mitigate confirmation bias. Extensive experiments on three datasets demonstrate that CoLLiS consistently outperforms state-of-the-art LiDAR SemiSL methods, with particularly strong gains in low-label regimes.


[122] Markerless Motion Capture for Biomechanical Whole-Body Kinematic Estimation in Infants cs.CVPDF

Divya Joshi, J. D. Peiffer, Colleen Peyton, R. James Cotton

TL;DR: 本研究系统评估了三种最先进的姿态估计框架(MeTRAbs-ACAE、SAM 3D Body和Sapiens)在婴儿多视角无标记运动捕捉视频上的表现,旨在为婴儿早期运动发育提供自动化、客观的视频评估方法。研究量化了关键点检测精度,并展示了将逆运动学框架应用于婴儿数据的可行性。

Details

Motivation: 早期识别婴儿运动障碍依赖于专家对自发运动的视觉评估,这促使了自动化、客观替代方法的开发。计算机视觉,特别是基于视频的高质量姿态估计,是一种有前景的途径。

Result: 在评估的方法中,Sapiens取得了最低的重投影误差(22.8像素)和最高的几何一致性(0.82)。SAM 3D Body为运动学重建提供了最全面的3D信息,其Procrustes对齐位置误差为19至28毫米。案例比较表明,基于SAM 3D估计拟合的生物力学模型能够区分临床专家识别的、与运动发育相关的代表性婴儿运动模式。

Insight: 论文的创新点在于系统性地将前沿的3D姿态估计框架应用于具有挑战性的婴儿生物力学领域,并量化了其性能。从客观角度看,研究不仅比较了不同框架的精度指标,还通过案例展示了将姿态估计与逆运动学模型结合用于区分临床相关运动模式的潜力,为可扩展的、基于视频的早期运动评估奠定了基础。

Abstract: arly identification of motor impairment in infancy relies on expert visual assessment of spontaneous movement, motivating the development of automated, objective alternatives. One promising approach is using computer vision, which benefits from high quality pose estimation from video. In this study, we systematically evaluated three state-of-the-art pose estimation frameworks (MeTRAbs-ACAE, SAM 3D Body, and Sapiens) on 100 videos over 13 sessions of 8 infants recorded with a multi-view markerless motion capture system. We quantified keypoint detection accuracy using reprojection error, geometric consistency, and Procrustes-aligned 3D position error, and demonstrated proof-of-concept for fitting an inverse kinematic framework to infant data. While Sapiens achieved the lowest reprojection error and highest geometric consistency of the methods evaluated (22.8 pixels and 0.82, respectively), SAM 3D Body provided the most comprehensive 3D information for kinematic reconstruction with Procrustes-aligned position errors of 19 to 28 mm. We demonstrate in a case comparison example that biomechanical models fit to SAM 3D estimates distinguish representative movement patterns in infants related to motor development, as identified by a clinical expert. Together, these findings highlight both the promise and current limitations of 3D pose estimation for infant biomechanics and establish preliminary groundwork for scalable, video-based assessment of early motor development.


[123] CAM-VFD: Cross-Attention Multimodal Video Forgery Detection cs.CV | cs.AIPDF

Hoda Osama Elkhodary, Sherin Mostafa Youssef, Marwa Elshenawy, Dalia Sobhy

TL;DR: 本文提出CAM-VFD,一个基于交叉注意力的多模态视频伪造检测框架,旨在通过建模视觉、时序和几何模态之间的不一致性来检测深度伪造视频。该框架利用CLIP、VideoMAE和MiDaS分别提取外观、运动和深度特征,并通过交叉注意力机制融合以捕捉跨模态矛盾信号。

Details

Motivation: 深度伪造技术和视频编辑工具的快速发展对多媒体取证、司法证据完整性和信息真实性构成严峻挑战。现有检测器依赖单模态信号,而先进的生成器能在保持模态内一致性的同时产生跨模态矛盾,这些矛盾具有取证判别性但单模态检测器无法察觉。

Result: 在两个生成视频基准测试(GenVidBench和GenVideo)上取得了优异性能:在GenVidBench上达到95.31%的Top-1准确率,在GenVideo上达到93.43%准确率、90.63% F1分数和96.56% AUROC。跨模态注意力差异分析显示真假视频分布具有统计可分离性(p<0.001,Cohen’s d=0.68)。模型在压缩、噪声、模糊和对抗性扰动下也表现出稳定的鲁棒性。

Insight: 核心创新在于将跨模态矛盾建模为方向性的取证信号,并设计了一个以CLIP外观特征为查询、VideoMAE运动特征和MiDaS深度特征为键值的交叉注意力融合机制。这为视频伪造检测提供了一种新的多模态推理范式,可能提升媒体取证的鲁棒性。

Abstract: The rapid advancement of Deepfake technologies and video manipulation tools poses a critical challenge to multimedia forensics, judicial evidence integrity, and information authenticity. Current detectors rely on single-modality signals, treating appearance, geometry, and motion independently. However, advanced generators maintain within-modality consistency while producing cross-modal contradictions, which are forensically discriminative but invisible to any single-modal detector. We propose CAM-VFD, a Cross-Attention Multimodal Video Forgery Detection framework that models cross-modal contradiction as a directional forensic signal. The framework uses a cross-attention fusion mechanism in which CLIP-based appearance representations serve as queries against VideoMAE motion features and MiDaS depth features, enabling the identification of contradictions between visual, temporal, and geometric evidence. We examine this design through cross-modal attention discrepancy analysis, observing statistically separable real and fake distributions ($p<0.001$, Cohen’s $d=0.68$). Experimental results on two generative video benchmarks indicate consistent performance, with 95.31% Top-1 accuracy on GenVidBench and 93.43% accuracy, 90.63% F1-score, and 96.56% AUROC on GenVideo. Moreover, CAM-VFD demonstrates stable performance under compression, noise, blur, and adversarial perturbations, suggesting that cross-modal reasoning may improve robustness in media forensics. The code is publicly available at \url{https://github.com/Hoda-Osama/CAM-VFD/tree/main}.


[124] UCSF-PDGM-VQA: Visual Question Answering dataset for brain tumor MRI interpretation cs.CV | cs.AI | cs.CLPDF

Shiv Ghosh, Junayd Lateef, Chih-Hua, Liu, Yannan Yu

TL;DR: 本文介绍了UCSF-PDGM-VQA数据集,这是一个用于脑肿瘤MRI解读的视觉问答(VQA)基准,包含来自473个胶质瘤MRI研究的2,387个问答对。作者评估了六种最先进的视觉语言模型和一个大型语言模型在该数据集上的表现,发现现有模型在处理多序列3D MRI扫描时存在视觉特征抑制和过度依赖语言先验的问题,导致模态崩溃,凸显了临床环境中模型可靠性和安全性的不足。

Details

Motivation: 脑肿瘤诊断高度依赖MRI评估,但放射科医生解读过程训练要求高、认知负荷大且耗时,而现有视觉语言模型在神经肿瘤学中因缺乏专业评估基准而未被充分利用。

Result: 在UCSF-PDGM-VQA数据集上,六种SOTA视觉语言模型和一个大型语言模型的基线评估显示,它们无法有效处理多序列3D MRI扫描,导致性能受限,揭示了模态崩溃问题。

Insight: 创新点在于构建了首个针对脑肿瘤MRI的临床相关VQA基准数据集;客观分析表明,当前通用VLMs在复杂医学影像领域存在局限性,强调了开发鲁棒、领域专用模型的必要性。

Abstract: Brain tumor diagnosis is largely dependent on Magnetic Resonance Imaging (MRI) evaluation, which requires radiologists to synthesize thousands of images across multiple 3D sequences and longitudinal studies. This process requires advanced neuro-radiology training, poses substantial cognitive load, and is highly time-consuming. Despite increasing demands in radiology, this expertise is difficult to scale, straining the current health systems. Vision-Language Models (VLMs) provide an opportunity to reduce this burden through a semi-automated, interactive interpretation of complex brain MRIs. However, they are currently underutilized in neuro-oncology due to a lack of specialized benchmarks for evaluating them. We introduce a clinically relevant visual question answering (VQA) benchmark – the UCSF-PDGM-VQA dataset – consisting of 2,387 QA pairs from 473 glioma-related MRI studies in the public UCSF-PDGM dataset. We further establish a performance baseline for six state-of-the-art vision-language models (VLMs) and one large language model on this dataset. We find that current models are incapable of effectively processing multi-sequence, 3-dimensional MRI scans, thus resulting in a suppression of visual features and over-reliance on language priors, causing modality collapse. These findings underscore a critical deficiency in current model reliability and safety within clinical settings, necessitating the development of robust, domain-specific VLMs.


[125] iMiGUE-3K: A Large-Scale Benchmark for Micro-Gesture Analysis with Self-Supervised Learning cs.CVPDF

Chengyan Wang, Haoyu Chen, Hui Wei, Yueyi Yang, Yunquan Chen

TL;DR: 本文提出了一个用于微手势分析的新基准iMiGUE-3K,这是一个迄今为止最大的微手势数据集,包含超过3400个长视频片段和3700万帧,涵盖32个微手势类别。基于该数据集,作者构建了MG-FMs基础模型,并建立了五个评估任务,系统评估表明基于微手势的分析能显著提升情感理解能力。

Details

Motivation: 现有情感计算方法主要关注面部表情和语音,往往忽略了身体语言所传达的丰富情感线索。微手势作为由内在感受驱动的无意识、潜意识动作,是一种有潜力的替代线索,但缺乏支持基础模型预训练的大规模数据集。

Result: 在提出的iMiGUE-3K基准上,通过系统评估代表性方法,证明了基于微手势的分析能显著改善情感理解。该工作为微手势分析提供了全面的工具。

Insight: 主要创新点包括:1) 采用基于模型的众包数据收集策略,构建了首个大规模、野外、细粒度的手势情感分析视频数据集iMiGUE-3K;2) 提出了一个可迁移手势表示学习的判别式基础模型MG-FMs;3) 建立了涵盖无监督、半监督、监督识别、检索及情感识别的五个综合评估任务,为微手势研究奠定了坚实基础。

Abstract: Emotion understanding is a fundamental challenge in affective computing and artificial intelligence. While existing approaches predominantly focus on facial expressions and speech, they often overlook the rich emotional cues conveyed through body language. Recently, micro-gestures (MGs), unintentional, subconscious movements driven by inner feelings, have attracted increasing attention as an alternative to other cues. However, there are no existing large-scale datasets supporting the pre-training of the MG foundation model. To advance MG research, we present a new benchmark for micro-gesture-based emotion understanding, featuring key contributions with a novel dataset (iMiGUE-3K) and a series of foundation models for different tasks. Using a model-based crowd-sourcing data collection strategy, we construct iMiGUE-3K, the largest MG dataset to date. It comprises video recordings from 332 distinct professional tennis players’ public press interviews over the past seven years, totaling more than 3.4K long video clips and 37 million frames. The dataset includes 32 micro-gesture classes with rich descriptive annotations, making it the first large-scale, in-the-wild, video dataset for fine-grained gesture-based emotion analysis. Built on iMiGUE-3K, we propose MG-FMs, a discriminative foundation model for transferable gesture presentation learning. Based on the foundation model, we establish five comprehensive evaluation tasks: MG recognition (unsupervised, semi-supervised, supervised), MG retrieval, and MG emotion recognition. Our systematic evaluation of representative methods demonstrates that micro-gesture-based analysis significantly improves emotion understanding. We hope this work can provide comprehensive tools for MG analysis and set a solid foundation for future research in psychological diagnostics, affective computing, and advanced human-computer interaction.


[126] Factorized Latent Dynamics for Video JEPA: An Empirical Study of Auxiliary Objectives cs.CV | cs.LGPDF

Santosh Premi

TL;DR: 本文对视频联合嵌入预测架构(Video-JEPA)中的辅助目标函数进行了小规模实证研究,探索了18种变体在两种预训练范式下的表现。研究发现许多辅助目标存在能力权衡,即提升某一下游能力往往会导致另一能力下降。为此,论文提出了一种名为FWM-HW-LD(带硬区域加权的因子化世界模型潜在动态)的训练时目标,该目标将潜在表示分解为外观和动态子空间,并对JEPA预测误差和潜在动态误差应用硬区域加权。在混合数据集设置下,该方法相对于基线在ImageNet-100和Something-Something V2上分别提升了5.92和3.21个百分点,同时在Diving-48上保持相近性能。

Details

Motivation: 联合嵌入预测架构(JEPA)是视频自监督表示学习的有前景框架,但其在小规模Video-JEPA训练中辅助目标函数的行为尚未得到充分研究。本文旨在通过实证研究,探索不同辅助目标对下游任务性能的影响,并解决其存在的性能权衡问题。

Result: 在混合数据集(UCF-101 + Something-Something V2 + ImageNet-100)预训练设置下,提出的FWM-HW-LD方法相对于参考基线,在ImageNet-100(外观任务)上提升了+5.92个百分点,在Something-Something V2(时序推理任务)上提升了+3.21个百分点,同时在Diving-48(细粒度动作任务)上的性能下降控制在0.30个百分点以内。实验在Diving-48、Something-Something V2和ImageNet-100三个互补的基准上进行评估。

Insight: 论文的主要创新点是提出了FWM-HW-LD目标,它通过将潜在表示因子化为外观和动态子空间,并引入硬区域加权机制,来缓解辅助目标间的能力权衡。从客观角度看,潜在表示因子化是研究Video-JEPA中辅助目标权衡的一个有效方向,有助于在不同下游任务间取得更均衡的性能提升。

Abstract: Joint-Embedding Predictive Architectures (JEPA) are a promising framework for self-supervised video representation learning, yet the behavior of auxiliary objectives in small-scale Video-JEPA training is not well characterized. We report a small-scale empirical study of 18 auxiliary objective variants for Video-JEPA across two pretraining regimes: single-dataset (UCF-101) and mixed-dataset (UCF-101 + Something-Something V2 + ImageNet-100). We evaluate frozen representations on three complementary benchmarks: Diving-48 (fine-grained motion), SomethingSomething V2 (temporal reasoning), and ImageNet-100 (appearance). Our experiments suggest that many auxiliary objectives exhibit capacity trade-offs: gains on one downstream capability often coincide with degradation on another. We then study FWM-HW-LD (Factorized World-Model with Hard-Region-Weighted Latent Dynamics), a training-time objective that separates the latent representation into appearance and dynamics subspaces and applies hard-region weighting to both JEPA prediction errors and latent dynamics errors. In our mixed-dataset setting, FWM-HW-LD improves ImageNet-100 by +5.92 and SSv2 by +3.21 percentage points relative to the reference baseline, while remaining within 0.30 percentage points on Diving-48. These results indicate that latent factorization is a useful direction for studying auxiliary-objective trade-offs in Video-JEPA.


[127] Image-to-Video Diffusion: From Foundations to Open Frontiers cs.CVPDF

Xianlong Wang, Wenbo Pan, Shijia Zhou, Ke Li, Yuqi Wang

TL;DR: 本文系统性地综述了基于扩散模型的图像到视频生成任务,将其作为一个独立的研究领域进行探讨。文章首先回顾了任务定义、模型架构、数据集和评估指标,然后基于架构和训练范式对现有方法进行了分类。进一步提炼出条件编码、时序建模、噪声先验设计和时空上采样四个核心设计要素,并讨论了代表性应用场景与主要开放挑战。

Details

Motivation: 当前图像到视频生成研究发展迅速,但大多被包含在更广泛的视频生成主题中讨论,缺乏一个专门针对该领域的分类法和系统性分析。本文旨在填补这一空白,将扩散图像到视频生成作为一个独立主题进行深入剖析。

Result: 本文是一篇综述性论文,未报告具体的定量实验结果,但系统性地梳理了该领域的任务定义、方法分类、核心设计要素和评估体系。

Insight: 创新点在于首次为扩散图像到视频生成领域提供了一个系统性的综述和分类框架,提炼了四个核心设计要素,并明确了该任务在内容一致性、身份保持和运动连贯性方面的独特挑战,为未来研究指明了方向。

Abstract: Diffusion-based \textit{image-to-video} (I2V) generation has become a central direction in generative models by turning a reference image, with optional conditions, into a temporally coherent video. Compared with broader video generation settings, this task places stricter demands on content consistency, identity preservation, and motion coherence. Although the literature grows rapidly, existing works mostly discuss I2V generation within broader topics and still lack a dedicated taxonomy together with a systematic analysis centered on this field. This work addresses that gap by treating diffusion I2V generation as a standalone subject. It first reviews the task formulation, model architectures, datasets, and evaluation metrics, and then organizes existing methods through a taxonomy based on architecture and training paradigm. It further distills four core designs, namely condition encoding, temporal modeling, noise prior design, and spatial-temporal upsampling, and discusses representative application scenarios together with major open challenges.


[128] LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs cs.CVPDF

Jihwan Kim, Nikhil Parthasarathy, Danfeng Qin, Junhwa Hur, Deqing Sun

TL;DR: 本文提出LiteFrame,一种高效视频编码器骨干网络,用于解决视频大语言模型(Video LLMs)在处理长视频时面临的视觉令牌上下文长度爆炸问题。通过压缩令牌蒸馏(CTD)训练框架,使紧凑的学生视觉编码器直接预测大型教师模型产生的信息密集、时空压缩表示,并结合语言模型适配(LMA),在降低延迟的同时处理更多帧并提升视频理解准确率。

Details

Motivation: 现有方法主要关注特征提取后的“事后”令牌缩减以减轻LLM计算负担,但这导致主要延迟瓶颈转移到了昂贵的逐帧视觉编码器处理上。因此,需要一种高效且强大的视频编码器骨干来解锁视频LLMs的帧数扩展能力。

Result: 与InternVL3-8B相比,LiteFrame在多个基准测试中,端到端延迟降低35%,处理帧数增加8倍,并提高了平均视频理解准确率,实现了新的延迟-准确率帕累托前沿。

Insight: 创新点在于提出压缩令牌蒸馏(CTD)训练框架,直接让学生编码器学习教师模型的压缩表示,绕过冗余计算,并结合语言模型适配(LMA)优化整体性能。这为在固定计算预算下实现更长视频理解提供了新路径。

Abstract: The fundamental challenge in scaling Video Large Language Models (Video LLMs) to long-form video lies in managing the explosion of visual-token context length. Existing strategies predominantly focus on “post-hoc” token reduction – reducing visual tokens after feature extraction to alleviate the LLM’s computational overhead. While these methods effectively reduce the number of visual tokens, we observe that the primary latency bottleneck then shifts from the LLM to the expensive per-frame processing of the vision encoder. To address this, we introduce LiteFrame, a strong, yet highly efficient video encoder backbone for Video LLMs. To train LiteFrame, we propose Compressed Token Distillation (CTD), a novel training framework that teaches a compact student vision encoder to directly predict information-dense, spatio-temporally compressed representations produced by a large teacher vision model, effectively bypassing redundant computation. When coupled with further Language Model Adaptation (LMA), this approach results in a new latency-accuracy Pareto frontier – compared with InternVL3-8B, LiteFrame provides a 35% reduction in end-to-end latency while processing 8$\times$ more frames and improves average video understanding accuracy across multiple benchmarks. Our results demonstrate a new potential path to unlocking longer-form video understanding under fixed compute budgets.


[129] CLAP: Contrastive Latent-space Prompt Optimization for End-to-end Autonomous Driving cs.CV | cs.AI | cs.LG | cs.ROPDF

Ruiyang Zhu, Yuehan He, Boyuan Zheng, Zesen Zhao, Ahmad Chalhoub

TL;DR: 本文提出了一种名为CLAP(对比潜在空间提示优化)的位置感知自适应框架,旨在解决端到端自动驾驶系统在罕见但安全关键的长尾场景(如施工区域和复杂让行几何)中的脆弱性问题。该方法通过众包数据优化每个路障的软提示,并利用车联网(V2X)通信按需检索,以增强冻结的视觉-语言-动作(VLA)驾驶模型,而无需重新训练模型或扩展数据。

Details

Motivation: 端到端自动驾驶系统在常见驾驶场景中表现良好,但在长尾安全关键场景中仍然脆弱;现有方法依赖数据扩展和模型训练,难以高效应对这些罕见情况。

Result: 在NAVSIM基准测试中,结合多种最先进的VLA骨干网络,CLAP将挑战性场景的规划误差降低了24%,且正常帧性能没有回归,显著提升了规划性能。

Insight: 创新点在于利用VLA潜在空间的观察(同一路障场景聚类紧密,长尾与正常帧在表示中混合),通过两阶段流程(监督对比学习发现路障特定硬场景方向,方向正则化提示优化)选择性改进挑战性帧,同时保持正常帧性能;客观来看,该方法通过软提示优化和V2X检索,实现了对冻结模型的高效、针对性适应,避免了全模型微调的开销。

Abstract: End-to-end autonomous driving systems powered by Vision-Language-Action (VLA) models achieve strong performance on common driving scenarios, yet remain brittle in rare but safety-critical long-tail situations such as active construction zones and complex yielding geometries. In this paper, we present a method that addresses the long-tail challenging scenes beyond data scaling and model training. We introduce CLAP (Contrastive Latent-space Prompt optimization), a location-aware adaptation framework that augments a frozen VLA driving model with per-roadblock soft prompts, optimized from crowdsourced data and retrieved on demand via Vehicle-to-Everything (V2X) communication. Our approach rests on two observations from VLAs’ latent space: (i) at the VLA’s hidden-state layer, scenarios from the same roadblock cluster tightly and occupy compact regions of the latent space; and (ii) within a single roadblock, long-tail and normal frames are heavily intermixed in the latent representation, making it difficult to improve one without disturbing the other. CLAP addresses this via a two-stage pipeline: supervised contrastive learning to discover a roadblock-specific hard-scene direction, followed by directionally regularized prompt optimization that selectively improves challenging frames while preserving normal frame performance. On the NAVSIM benchmark with various state-of-the-art VLA backbones, CLAP reduces challenging scenario planning error by 24% with no regression on normal frames, significantly improving planning performance.


[130] Systematic Evaluation of Vision Transformers for Automated Cervical Cancer Classification: Optimization, Statistical Validation, and Clinical Interpretability cs.CV | cs.AIPDF

Nisreen Albzour, Sarah S. Lam

TL;DR: 本研究系统优化了轻量级Vision Transformer (ViT-Tiny)架构,用于宫颈癌细胞分类的自动化筛查。通过数据增强、类别加权和超参数调优,在Herlev数据集上实现了约95%的交叉验证准确率,并通过Grad-CAM分析验证了模型注意力与临床相关形态特征(如核区域、细胞边界)的一致性。

Details

Motivation: 手动宫颈涂片分析存在观察者间差异、时间限制和专家资源有限的问题,而现有卷积神经网络(CNN)方法在建模长距离空间依赖性和临床可解释性方面存在不足。

Result: 优化后的ViT-Tiny模型在Herlev数据集(917张图像)上达到94.9%-95.2%的交叉验证准确率,其中随机水平翻转和类别加权(0.7×1.3)策略最为有效。

Insight: 论文创新点在于系统地将Vision Transformer应用于医学图像分类任务,并通过Grad-CAM可视化证明其注意力机制与细胞病理学标准(核形态、染色质纹理等)对齐,实现了高精度与临床可解释性的平衡,为医疗AI部署提供了新思路。

Abstract: Manual Pap smear analysis for cervical cancer screening is limited by inter-observer variability, time constraints, and restricted expert availability. Although convolutional neural networks (CNNs) have automated cervical cell classification, they remain limited in modeling long-range spatial dependencies and often lack clinical interpretability. In this study, Vision Transformer (ViT) architectures were systematically optimized to enhance automated cervical cancer screening, which resulted in improved interpretability. The Herlev dataset (917 images: 242 normal, 675 abnormal) was utilized to optimize ViT-Tiny, a lightweight Vision Transformer architecture designed for reduced computational complexity, through a comprehensive evaluation of augmentation strategies, class weighting, and hyperparameters. The optimal configuration achieved 94.9%-95.2% cross-validation accuracy, in which random horizontal flipping and class weighting (0.7 x 1.3) were identified as most effective. Gradient-weighted Class Activation Mapping (Grad-CAM) analysis confirmed that model attention corresponded to clinically relevant morphological features, which include nuclear regions, cell boundaries, and chromatin texture, which align with cytopathological criteria. These findings indicate that Vision Transformers can deliver accurate and interpretable decision support for cervical cancer screening, which fulfills both clinical performance and transparency requirements essential for medical AI deployment.


[131] Beyond Detection: A Structure-Aware Framework for Scene Text Tracking cs.CVPDF

Chenmin Yu, Liu Yu, Daiqing Wu, Gengluo Li, Zeyu Chen

TL;DR: 本文提出了一个名为SymTrack的免检测、结构感知框架,专门用于解决视频中的场景文本跟踪任务。该框架通过协同双分支设计,集成了跨专家校准、预测令牌校正和自适应推理引擎,以应对几何畸变、视觉模糊性和对结构细节敏感等挑战。

Details

Motivation: 现有通用视觉目标跟踪器在处理场景文本时性能显著下降,而视频文本跟踪对于动态文本操作(如分割、移除和编辑)至关重要,但目前该任务尚未得到充分探索。

Result: 在利用视频文本定位数据集构建的基准测试上,SymTrack在所有三个基准上都达到了新的最先进水平,在BOVText_SOT上比之前的最佳跟踪器AUC高出高达11.97%。

Insight: 创新点在于将场景文本跟踪形式化为一个特定任务,并提出了一个统一的免检测框架,其核心是通过跨专家校准减少语义偏差,以及通过预测令牌校正机制纠正结构不平衡,这为解决文本特有的跟踪挑战提供了新思路。

Abstract: Modern visual object trackers show impressive results on general targets, yet their performance drops substantially when dealing with scene text. Although currently underexplored, tracking text in videos is essential for dynamic text manipulations such as segmentation, removal, and editing. To fill this gap, this paper formalizes this specific task as Scene Text Tracking and presents the first systematic work for it. We identify three primary challenges in this task: 1) severe geometric distortions from perspective shifts, 2) high visual ambiguity across different instances, and 3) high sensitivity to fine-grained structural details. To address these issues, we propose SymTrack, a unified detection-free framework with synergistic dual-branch design. It integrates a Cross-Expert Calibration mechanism to reduce semantic bias, along with a Predictive Token Rectification mechanism to correct structural imbalances, complemented by an Adaptive Inference Engine that stabilizes predictions under motion constraints. Considering the lack of dedicated benchmarks for this task, we utilize three datasets from video text spotting to construct a benchmark with high-quality annotations. Extensive experiments demonstrate that SymTrack sets the new state-of-the-art on all three benchmarks, outperforming previous best trackers by up to 11.97% AUC on $ \text{BOVText}_{\text{SOT}} $. Overall, our work promotes efficient and thorough text tracking, paving the way toward more generalized video text manipulation.


[132] EgoIntrospect: An Egocentric Dataset and Benchmark for User-Centric Internal State Reasoning cs.CVPDF

Zeyu Wang, Chang Liu, Eduardus Tjitrahardja, Yuntao Wang, Borislav Pavlov

TL;DR: 本文介绍了EgoIntrospect,首个在用户驱动场景下采集的自我标注第一人称数据集,旨在揭示用户与AI助手交互时的内部状态。该数据集包含180小时的同步视频、音频、注视、运动和生理信号,并基于此构建了评估多模态大语言模型推理用户内部状态能力的基准。实验表明现有模型难以有效利用多模态信号推断用户主观状态。

Details

Motivation: 现有第一人称视频数据集和基准在理解用户内部状态方面存在不足,而这对实现无缝AI助手体验至关重要。

Result: 在构建的基准上实验显示,现有多模态大语言模型在利用多模态信号推理用户主观内部状态(如情感体验、交互意图和认知记忆)时表现不佳。

Insight: 创新点在于首次提供了用户驱动场景下带有自我标注内部状态的第一人称多模态数据集,并形式化了以用户内部状态为中心的任务基准,推动了可穿戴AI助手和第一人称视觉研究。

Abstract: Despite extensive efforts on egocentric video datasets and benchmarks, understanding users’ internal states, which is crucial for enabling seamless AI assistant experiences, remains largely overlooked. In this work, we introduce EgoIntrospect, the first egocentric dataset captured in user-driven scenarios with self-annotations that explicitly reveal users’ interactive intentions with AI assistants. EgoIntrospect was collected using a cross-device setup, providing synchronized video, audio, gaze, motion, and physiological signals. It consists of 180 hours of recordings from 60 subjects, with an average recording duration of 3 hours per subject. Leveraging EgoIntrospect, we formalize a suite of tasks centered on user internal states, including affective experience, interactive intent, and cognitive memory. We further process the annotations to construct benchmarks that evaluate the ability of modern multimodal large language models to reason about users’ internal states from egocentric observations. Experiments on our benchmark suggest that existing multimodal large language models struggle to effectively leverage multimodal signals to infer users’ subjective internal states. The dataset and annotations will be made publicly available to advance research in egocentric vision and wearable AI assistants. Project page: https://ego-introspect.github.io/


[133] HyperVision: A Channel-Adaptive Ground-Based Hyperspectral Vision Pre-trained Backbone cs.CVPDF

Guanyiman Fu, Jingtao Li, Zihang Cheng, Zhuanfeng Li, Diqi Chen

TL;DR: 本文提出了HyperVision,这是首个面向地物高光谱成像的预训练骨干网络,旨在解决传感器光谱配置差异、标注数据稀缺不一致以及数据集规模与场景多样性有限等挑战。通过通道自适应动态嵌入、多源伪标签生成和跨模态知识蒸馏三项核心技术,模型在26个多样化数据集上预训练后,在多个下游任务中实现了优异的泛化性能。

Details

Motivation: 动机在于填补地物高光谱成像领域缺乏通用预训练骨干网络的空白,并解决因传感器光谱配置各异、标注数据稀缺且不一致、以及现有数据集规模小、场景单一所导致的模型泛化能力受限问题。

Result: 在包含26个数据集的15k图像上预训练后,HyperVision在仅微调任务头而不调整骨干网络参数的情况下,在多种传感器配置下的三个下游任务(高光谱语义分割、目标跟踪、显著目标检测)中均达到了最先进的性能水平,具体表现为分割准确率相对提升16.3%,跟踪AUC相对提升2.1%,显著目标检测MAE降低35.5%。

Insight: 创新点包括:1) 通道自适应动态嵌入机制,以统一处理不同传感器的异构光谱输入;2) 融合SAM2空间结构与HyperFree光谱信息的伪标签生成方法,以缓解标注稀缺问题;3) 利用跨模态知识蒸馏从预训练RGB模型迁移语义知识,以丰富场景表示。这些方法为构建通用高光谱视觉基础模型提供了新思路。

Abstract: While hyperspectral imaging provides rich spatial-spectral information across hundreds of narrow wavelength bands for precise material identification, ground-based hyperspectral pre-trained backbones remain absent, constrained by varying spectral configurations across sensors, the scarcity and inconsistency of labels, and the limited scale and scene diversity of existing datasets. To address these challenges and enable universal perception, we propose HyperVision, the first ground-based hyperspectral pre-trained backbone. First, to handle varying spectral configurations, HyperVision adopts a channel-adaptive dynamic embedding mechanism to map heterogeneous inputs into a unified token space. Second, to address the scarcity and inconsistency of labels, we introduce a multi-source pseudo-labeling method that fuses semantic representations from both spatial structures generated by SAM2 and fine-grained spectral material information extracted by HyperFree. Third, to compensate for limited dataset scale and enrich scene diversity, a cross-modal knowledge distillation mechanism is utilized to transfer rich semantic representations from a pre-trained RGB vision model to our hyperspectral backbone. Pre-trained on a collection of 15k images from 26 diverse ground-based datasets, HyperVision demonstrates exceptional generalization. Requiring only efficient head-only adaptation without adjusting backbone parameters, it achieves state-of-the-art performance compared to task-specific methods across three downstream tasks under varying sensor configurations, yielding up to a 16.3% relative improvement in hyperspectral semantic segmentation $\mathrm{Acc}_{\mathrm{M}}$, a 2.1% relative gain in object tracking AUC, and a 35.5% reduction in salient object detection MAE. The source code and pre-trained model will be publicly available at https://github.com/lronkitty/HyperVision .


[134] HierEdit: Region-Aware Hierarchical Diffusion for Efficient High-Resolution Editing cs.CVPDF

Yuyao Zhang, Alexander Huang-Menders, Yu-Wing Tai

TL;DR: HierEdit是一个区域感知的分层扩散框架,用于高效、可扩展的高分辨率图像编辑。该方法首先使用现成的编辑模型在低分辨率代理图像上执行编辑,生成参考并定位修改区域;然后通过一个分层局部窗口扩散模型(Local-Window MMDiT)仅细化原始高分辨率图像中的编辑区域,同时将未更改区域作为条件输入重用。低分辨率代理还提供结构指导和中间去噪监督(推理加速),确保一致的全局语义和稳定生成,无需全分辨率注意力计算。这种有针对性的分层设计使得无需专门的高分辨率训练数据即可实现高达4K分辨率的快速、高保真图像编辑。

Details

Motivation: 解决现有基于多模态扩散的图像编辑器计算效率低下且受限于相对低分辨率的问题,这些方法要么冗余处理整个画布,要么依赖大规模高分辨率数据集,导致训练和推理成本高昂。

Result: 大量实验表明,HierEdit在商品分辨率数据集上实现了具有竞争力的视觉质量,同时显著加速了推理,并无缝扩展到超高分辨率4K编辑。

Insight: 创新点在于区域感知的分层编辑策略:通过低分辨率代理定位编辑区域并指导高分辨率细化,结合局部窗口扩散模型和推理加速技术,实现了无需高分辨率训练数据的高效高分辨率编辑。从客观角度看,其将编辑任务分解为低分辨率全局编辑和高分辨率局部细化的分层方法,以及重用未更改区域作为条件以减少计算量的设计,是高效处理高分辨率内容的有益思路。

Abstract: High-resolution image editing is essential for professional and creative applications, yet existing multimodal diffusion-based editors remain computationally inefficient and constrained to relatively low resolutions. Current approaches redundantly process the entire image canvas or rely on large-scale high-resolution datasets, resulting in substantial training and inference costs. We introduce HierEdit, a region-aware hierarchical diffusion framework designed for efficient and scalable high-resolution image editing. Our method first performs edits on a low-resolution proxy using an off-the-shelf editing model to generate a reference and to localize the modified regions. A hierarchical local-window diffusion model (\textbf{Local-Window MMDiT}) that refines only edited regions within the original high-res image, while reusing the unaltered regions as conditioning inputs. The low-resolution proxy further provides structural guidance and intermediate denoising supervision (\textbf{Inference Acceleration}) , ensuring consistent global semantics and stable generation without the need for full-resolution attention computation. This targeted and hierarchical design enables fast, high-fidelity editing of images up to 4K resolution without any specialized high-resolution training data. Extensive experiments demonstrate that HierEdit achieves competitive visual quality on commodity-resolution datasets while significantly accelerating inference and extending seamlessly to ultra-high-resolution 4K editing. Please check our {\href{https://peteryyzhang.github.io/HierEdit-page/}{\textbf{Project Page}}}.


[135] LISA: Language-guided Interference-aware Spatial-Frequency Attention for Driver Gaze Estimation cs.CVPDF

Jun Ma, Zhenye Yang, Ruichen Zhou, Pei Zhang, Huan Li

TL;DR: 本文提出LISA框架,一种结合频域先验与视觉-语言知识的语言引导干扰感知空间-频率注意力方法,用于驾驶员注视估计。该方法通过双域融合机制整合稳定的低频语义与高频细节,并利用训练时解耦策略分离注视特征与外观干扰,在多个基准测试中实现了最先进的性能。

Details

Motivation: 现有空间域模型难以从无关视觉属性中分离出真实的注视线索,且易受光照突变和传感器噪声影响,因此需要一种更鲁棒的方法来准确估计驾驶员注视。

Result: 在两个基准测试上的实验表明,LISA达到了最先进的性能,并在遮挡和光照变化下表现出显著提升的鲁棒性。

Insight: 创新点包括:利用振幅谱在空间扰动下的稳定性设计双域融合机制;引入基于冻结CLIP编码器和正交正则化的训练时解耦策略,以显式分离注视特征与外观干扰,减少语义模糊性。

Abstract: Driver gaze estimation serves as a fundamental metric for evaluating driver attentiveness in modern monitoring systems. Beyond being vulnerable to sudden lighting changes and sensor noise, spatial-domain models struggle to disentangle authentic gaze cues from irrelevant visual attributes. In this paper, we propose LISA, a \textbf{L}anguage-guided \textbf{I}nterference-aware \textbf{S}patial-Frequency \textbf{A}ttention framework that combines frequency-domain priors with vision-language knowledge. Observing that the amplitude spectrum remains relatively stable even under spatial perturbations, we design a dual-domain fusion mechanism. It integrates stable low-frequency semantics into high-frequency details, employing spatial attention to precisely target ocular regions. To reduce semantic ambiguity, we also introduce a training-time disentanglement strategy. Using a frozen CLIP encoder and orthogonal regularization, we explicitly separate gaze features from appearance interference. Experiments on two benchmarks show that LISA achieves state-of-the-art performance, with significantly improved robustness against occlusions and lighting variations. The code repository is available at https://github.com/Mason-bupt/LISA.


[136] LongDPM: Overlap-Aware 4D Reconstruction from Long Monocular Videos cs.CVPDF

Chenyi Xu, Yihao Wu, Liqi Yan, Chao Yang, Jianhui Zhang

TL;DR: LongDPM是一个用于从长单目视频中进行可扩展、长距离动态3D重建的新框架。它通过将长视频分割成重叠的片段进行处理,解决了现有前馈重建模型局限于短片段、而长距离跟踪器无法生成密集序列级重建的问题。该方法利用置信度加权配准和静态感知重叠抽象来连接局部坐标系,并跨片段边界关联动态身份以融合轨迹,从而实现一致的长距离3D运动恢复。

Details

Motivation: 从长单目视频中恢复动态3D场景需要保持密集几何、相机运动和时序对应关系在共享坐标系中的一致性。现有方法面临两大挑战:前馈重建模型仅适用于短片段,而长距离跟踪器虽能保持对应关系却无法生成密集的序列级重建。

Result: 在PointOdyssey、Kubric-F和Kubric-G数据集上,LongDPM相比V-DPM显著降低了密集跟踪的端点误差(EPE)。同时,在TUM-dynamics数据集上,它取得了最佳的相机位姿估计绝对轨迹误差(ATE),展示了其在长距离重建和跟踪方面的优越性能。

Insight: 论文的核心创新在于提出了一个重叠感知的、分块处理长视频的框架,通过置信度加权配准与静态感知重叠抽象来桥接局部坐标系,并实现了跨片段动态身份的关联与轨迹融合。这为长序列单目动态重建提供了一种内存高效且可扩展的解决方案。

Abstract: Recovering a dynamic 3D scene from a long monocular video is crucial for dense geometry, camera motion, and temporal correspondence to remain consistent in a shared coordinate system. Existing methods face two key challenges: (1) feed-forward reconstruction models provide accurate local predictions but are limited to short clips, and (2) long-range trackers preserve correspondences without producing dense sequence-level reconstruction. This paper presents LongDPM, a novel overlap-aware framework for scalable long-range monocular dynamic reconstruction. First, LongDPM processes long videos in overlapping chunks, keeping inference memory bounded by the chunk length. Second, it connects chunk-local coordinate systems through confidence-weighted registration with static-aware overlap abstraction. Third, it associates dynamic identities across chunk boundaries and fuses matched trajectories to recover coherent long-range 3D motion. Experimental results demonstrate that LongDPM achieves superior long-range reconstruction and tracking performance, reducing dense tracking EPE over V-DPM on PointOdyssey, Kubric-F, and Kubric-G, while obtaining the best TUM-dynamics ATE for camera pose estimation.


[137] Attention Hijacking: Response Manipulation Across Queries in Vision-Language Models cs.CV | cs.AIPDF

Zhiqiang Wang, Dongrui Liu, Yan Li, Zonghao Ying, Wei Xue

TL;DR: 本文提出了一种名为’注意力劫持’的新型对抗攻击方法,旨在实现跨查询的响应操控,即单个对抗样本能够对多样化的用户查询保持攻击有效性。该方法通过显式引导内部注意力分布,维持图像主导的注意力模式,从而减少被操控输出对查询具体措辞的依赖。

Details

Motivation: 现有针对视觉语言模型(VLM)的对抗攻击在引导模型输出至攻击者指定目标响应时,当同一扰动输入与不同文本查询配对时,其有效性往往会下降。本文旨在研究跨查询的响应操控问题,即单个对抗样本需对多样化的用户查询保持有效。

Result: 在广泛使用的VLM上进行的大量实验表明,注意力劫持方法显著提高了跨查询的可迁移性,适用于多样化的目标响应和未见过的查询。该方法也能有效扩展到多种攻击场景。

Insight: 论文宣称的创新点在于首次明确地将注意力稳定性与可迁移的响应操控联系起来,并提出通过强制维持图像主导的注意力模式来实现跨查询攻击。从客观角度看,该方法为理解VLM的鲁棒性提供了新视角,即注意力模式的稳定性是影响对抗攻击可迁移性的关键因素。

Abstract: Existing adversarial attacks on vision-language models (VLMs) can steer model outputs toward attacker-specified target responses, but their effectiveness often degrades when the same perturbed input is paired with different textual queries. This paper studies cross-query response manipulation, where a single adversarial example is expected to remain effective across diverse user queries. We first analyze the limitations of existing attacks and find that successful transfer is closely associated with preserving an image-dominant attention pattern during response generation. Motivated by the observation, we propose \textbf{Attention Hijacking}, a novel adversarial attack that explicitly steers internal attention distributions toward a persistent image-dominant pattern. By amplifying the influence of visual tokens on target response tokens while suppressing the competing influence of textual tokens, our method reduces the dependence of the manipulated output on the specific wording of the query. Extensive experiments on widely used VLMs show that Attention Hijacking substantially improves cross-query transferability across diverse target responses and unseen queries. The method also extends effectively to multiple attack scenarios, offering new insights into the role of attention stability in transferable response manipulation for VLMs.


[138] SpecSem-Net: Integrating Spectral and Semantic Features for Robust AI-generated Video Detection cs.CVPDF

Zixi Wei, Huixuaun Zhang, Xiaojun Wan

TL;DR: 本文提出SpecSem-Net,一种用于鲁棒检测AI生成视频的新框架。它通过引入语义引导的频谱去噪机制,首次将语义特征与频谱特征(特别是高频特征)进行自适应融合,以应对当前顶级视频生成模型(如Sora、Veo)带来的检测挑战。

Details

Motivation: 现有AI生成视频检测器过度依赖日益逼真的语义特征,容易失效,而忽略了合成视频中细微的频谱伪影。为了解决这一问题,需要一种能同时利用语义和频谱信息的鲁棒检测方法。

Result: 在包含5种最先进商业生成器的综合基准测试和公开数据集上进行了广泛实验。SpecSem-Net性能优于现有方法,在自建基准和公开数据集上的准确率分别达到87.25%和95.59%。

Insight: 主要创新点是提出了首个用于高保真AI生成视频检测的语义引导频谱去噪机制,具体包括基于傅里叶变换滤波的频谱模块,以及用于自适应融合语义上下文以减少频谱噪声误判的门控融合机制。这为检测器设计提供了结合语义与底层物理伪影的新思路。

Abstract: The remarkable visual fidelity of recent commercial video generative models, such as Sora and Veo, renders robust AI-generated video detection increasingly essential to prevent synthetic content from being indistinguishable from real videos and exploited for disinformation. However, existing detectors often fail due to an over-reliance on increasingly realistic semantic features, neglecting subtle spectral artifacts. In this paper, we propose SpecSem-Net, the first framework to introduce a semantic-guided spectral denoising mechanism specifically for high-fidelity AI-generated video detection. Specifically, we design a spectral module to extract high-frequency features via Fourier-Transform based filtering. Furthermore, to reduce misjudgments arising from spectral noise, we employ a Gated Merging Mechanism to adaptively fuse semantic context, effectively mitigating spectral noise. Additionally, to evaluate detector performance on the latest top-tier generative models, we construct a comprehensive benchmark comprising 5 SOTA commercial generators. Extensive experiments demonstrate that SpecSem-Net outperforms existing methods, achieving accuracies of 87.25% and 95.59% on our benchmark and public datasets, respectively.


[139] VISTA: Triplet-Supervised Video Style Transfer with Diffusion Transformers cs.CVPDF

Yiren Song, Wangzi Yao, Haofan Wang, Mike Zheng Shou

TL;DR: 本文提出VISTA,一种基于扩散Transformer和三元组监督的视频风格迁移框架,并构建了包含1000种风格的大规模合成数据集VISTA-1000。该方法通过轻量级风格适配器提取风格,在上下文学习框架下实现视频风格迁移,旨在解决现有方法因启发式时序传播导致的时间不一致性问题。

Details

Motivation: 现有视频风格迁移方法通常对帧或关键帧进行风格化,并通过启发式时序传播强制一致性,在遮挡、去遮挡和长程运动下易产生漂移和闪烁伪影。根本瓶颈在于缺乏大规模三元组数据及能够联合建模并解耦风格、内容和运动的训练范式。

Result: 大量实验表明,该方法在风格保真度、时间一致性和内容保持方面达到了最先进的性能水平。

Insight: 创新点在于构建了大规模、运动对齐的(风格参考、干净视频、风格化视频)三元组数据集VISTA-1000,并提出了一个基于扩散Transformer的上下文学习框架,通过轻量级风格适配器实现鲁棒的风格提取,从而端到端地联合建模风格、内容和运动。

Abstract: Video style transfer aims to render videos in a target artistic style while preserving content, structure, and motion. While image stylization has advanced rapidly, video stylization remains challenging due to temporal inconsistency. Most existing methods stylize frames or keyframes and enforce consistency via heuristic temporal propagation, which is brittle under occlusions, disocclusions, and long-term motion, leading to drift and flickering artifacts. We argue that a fundamental bottleneck lies in the lack of large-scale triplet data and a principled training paradigm that jointly models and disentangles style, content, and motion.To address this, we introduce VISTA-1000, a synthetic dataset with 1,000 styles and motion-aligned triplets of style reference, clean video, and stylized video, and propose a diffusion-transformer-based in-context video style transfer framework with a lightweight style adapter for robust style extraction. Extensive experiments demonstrate SOTA performance in style fidelity, temporal consistency, and content preservation.


[140] Single-Sample Black-Box Membership Inference Attack against Vision-Language Models via Cross-modal Semantic Alignment cs.CV | cs.AIPDF

Jiaqing Li, Yajuan Lu, Xiaochuan Shi, Gang Wu, ZhongYuan Wang

TL;DR: 本文提出了一种针对视觉语言模型(VLMs)的新型黑盒成员推理攻击(MIA)方法,该方法专为严格的单样本场景设计。该方法通过量化联合嵌入空间中图像与生成描述之间的跨模态语义对齐程度来区分训练成员与非成员,克服了现有方法依赖内部logits或大规模统计分布的局限性。

Details

Motivation: 动机在于解决现有针对VLMs的成员推理攻击方法的关键瓶颈:灰盒方法依赖通常受限的内部logits,而黑盒方法依赖大规模统计分布,难以在单样本场景下工作。

Result: 在VL-MIA/Flicker数据集上,该方法针对LLaVA-1.5模型取得了0.821的AUC,显著优于现有基线方法,并且在多种图像扰动下保持鲁棒性。

Insight: 创新点在于从跨模态语义对齐的视角进行成员推理,利用训练记忆导致成员图像与描述对齐更强这一观察,设计了一个无需不现实假设的量化框架。这为评估VLMs的数据安全风险提供了一种更实用的黑盒单样本攻击途径。

Abstract: Vision-Language Models (VLMs) have achieved remarkable success, yet their reliance on massive datasets and unintended memorization of training data raise significant data security risk. Membership Inference Attacks (MIAs) aim to assess these risks by determining whether a data sample was included in a model’s training set. However, existing MIA methods against VLMs face critical bottlenecks: gray-box method relies on internal logits that are typically restricted in real-world Application Programming Interfaces (APIs), while black-box method depends on large-scale statistical distributions, which struggle in single-sample scenarios. To this end, we investigate MIAs from the perspective of cross-modal semantic alignment, and observe that member images exhibit significantly stronger image-caption alignment due to training memorization, whereas generated captions for non-members may deviate from the original visual content. Leveraging this insight, we propose a novel MIA framework designed for strict black-box and single-sample setting that quantifies such alignment within a joint embedding space, thereby bypassing these unrealistic assumptions. We conducted extensive experiments on three open-source and two closed-source VLMs. On the VL-MIA/Flicker dataset, our method achieves an AUC of 0.821 against LLaVA-1.5, significantly outperforming existing baselines. Furthermore, it remains robust under diverse image perturbations, highlighting its practicality.


[141] Omni-DuplexEval: Evaluating Real-time Duplex Omni-modal Interaction cs.CVPDF

Chaoqun He, Mingyang Xiang, Yingjing Xu, Bokai Xu, Junbo Cui

TL;DR: 该论文提出了Omni-DuplexEval,一个用于系统评估实时双向全模态交互的基准测试。它包含两个互补场景(实时描述和主动提醒),涵盖660个带细粒度标注的视频和9个现实任务,并引入了基于LLM-as-a-Judge的自动评估框架。实验揭示了当前最先进的双向MLLMs存在显著局限性。

Details

Motivation: 现有MLLMs主要在离线设置下评估,缺乏对现实场景中实时双向交互(即模型需持续处理流式输入并在适当时刻响应)的全面基准和自动评估方法。

Result: 在最先进的双向MLLMs上的实验表明,性能最好的模型总体得分仅为39.6%,在主动提醒任务上更是低至20.0%,揭示了模型在平衡及时响应与连贯内容生成、以及决定何时响应和响应什么方面存在重大挑战。

Insight: 创新点在于构建了首个针对实时双向全模态交互的综合性基准测试,并设计了结合时间戳感知和顺序推理的LLM-as-a-Judge自动评估框架,能联合评估响应内容对齐和响应时机,与人类判断高度一致。

Abstract: Real-time duplex interaction is essential for multimodal AI systems operating in real-world scenarios, where models must continuously process streaming inputs and respond at appropriate moments. However, most existing multimodal large language models (MLLMs) are evaluated in offline settings, where the entire video input is processed before any response is generated. While recent work has started to explore real-time duplex MLLMs, there is still no comprehensive benchmark or automatic evaluation method for this setting. To address this gap, we propose Omni-DuplexEval, a benchmark for systematically evaluating real-time duplex interaction. The benchmark consists of two complementary scenarios: (1) Real-Time Description, which evaluates the ability to generate continuous, time-aligned responses that track evolving multimodal inputs, and (2) Proactive Reminder, which evaluates the ability to identify salient events and respond at appropriate moments. Omni-DuplexEval contains 660 videos with fine-grained, human-annotated labels and precise temporal metadata, spanning 9 tasks grounded in real-world scenarios, where all questions are formulated as open-ended queries. We further introduce an automatic evaluation framework based on LLM-as-a-Judge, which enables systematic assessment by jointly evaluating response-content alignment and response timing through timestamp-aware and sequential reasoning, achieving strong alignment with human judgments. Experiments on state-of-the-art duplex MLLMs reveal substantial limitations. The best-performing model achieves only 39.6% overall, while scoring only 20.0% on Proactive Reminder. Our analysis identifies two key challenges: models struggle to balance timely responses with coherent, holistic content generation, and they often fail to determine both when to respond and what to produce. We hope our work facilitates further progress in MLLMs.


[142] UniPPTBench: A Unified Benchmark for Presentation Generation Across Diverse Input Settings cs.CVPDF

Bo Zhao, Maosheng Pang, Chen Zhang, Huan Yang, Yixin Cao

TL;DR: 本文提出了UniPPTBench,一个用于评估多样化输入设置下演示文稿生成系统的统一基准,涵盖了模糊提示、长文档、多模态文档和多源生成四种代表性场景。同时,作者引入了UniPPTEval,一个结合了跨场景通用指标和针对各场景核心需求(如内容锚定、视觉-文本对齐、跨源合成)的特定指标的评估协议。

Details

Motivation: 现有研究通常关注孤立输入设置下的演示文稿生成,而现实应用场景多样,且当前评估方法主要依赖通用质量标准,未能有效评估不同输入场景所需的核心能力,导致领域缺乏一个统一的、能真实诊断系统在不同现实场景下表现的基准和评估框架。

Result: 在UniPPTBench上的实验揭示了不同设置下系统性能的巨大差异,以及在内容锚定、多模态整合和跨源合成等方面普遍存在的失败模式。特别是,在通用演示文稿质量指标上的强性能,并不一定意味着在需要内容锚定的场景中能很好地完成任务。

Insight: 创新点在于构建了一个覆盖多样化现实输入场景的统一基准,并设计了一个场景感知的评估协议,将跨场景比较与场景特定评估相结合,为领域提供了更忠实、更具诊断性的评估基础。这强调了针对任务核心需求进行专门评估的重要性,而非仅依赖通用质量指标。

Abstract: Existing works typically focus on presentation generation under isolated input settings, whereas real-world use cases span diverse scenarios, including vague user prompts, long documents, multimodal materials, and multiple heterogeneous sources. Moreover, current evaluations are often insufficiently scenario-specific. They mainly rely on generic presentation-quality criteria, such as visual appeal, layout quality, and overall coherence, but fail to assess the core capabilities required by different input settings, including grounded compression, visual-text alignment, and cross-source synthesis. Consequently, the field lacks a unified benchmark and a scenario-aware evaluation framework for faithfully diagnosing presentation-generation systems across diverse real-world settings. We present UniPPTBench, a unified benchmark for presentation generation across four representative input settings: vague-prompt, long-document, multimodal-document, and multi-source generation. We further introduce UniPPTEval, a scenario-aware evaluation protocol that combines shared metrics for cross-setting comparison with scenario-specific metrics tailored to the core requirements of each setting. We also provide transparent reference baselines to support reproducible comparison. Experiments on UniPPTBench reveal substantial performance variation across settings and recurring failure modes in content grounding, multimodal integration, and cross-source synthesis. In particular, strong performance on generic presentation-quality metrics does not necessarily imply strong task fulfillment in grounded scenarios. Together, UniPPTBench and UniPPTEval provide a faithful and diagnostic foundation for evaluating presentation generation across diverse real-world scenarios. Code and data will be publicly available.


[143] Soap2Soap: Long Cinematic Video Remaking via Multi-Agent Collaboration cs.CVPDF

Yiren Song, Huilin Zhong, Kevin Qinghong Lin, Haofan Wang, Mike Zheng Shou

TL;DR: 本文提出Soap2Soap,一个用于长篇影视剧集级视频重制的多智能体协作框架,旨在通过风格化或演员替换来重塑完整剧集或电影,同时严格保持数百个镜头间的叙事结构、动作编排和角色身份一致性。

Details

Motivation: 解决现有视频生成和编辑方法在长视频序列中,因大范围摄像机运动和视角变化导致的身份漂移、背景突变和语义侵蚀等长期一致性问题。

Result: 在SoapBench基准测试中,相比商业视频生成API,该方法在长期一致性和叙事保真度方面取得了显著提升。

Insight: 创新点包括:1)双桥一致性机制,结合场景感知的JSON剧本作为持久语义主干,以及动态分配的场景和镜头级视觉参考锚点;2)批量关键帧一致性方法,在共享潜在上下文中联合生成多个关键帧;3)闭环验证智能体,用于审核身份、稳定性和对齐度以触发选择性再生。

Abstract: We study series-level cinematic remaking, a long-horizon video-to-video generation problem that localizes full episodes or films via stylization or actor replacement while strictly preserving narrative structure, motion choreography, and character identity across hundreds of shots. Existing video generation and editing pipelines often break down in this regime due to compounding identity drift, background mutation, and semantic erosion under large camera motions and viewpoint changes. We propose Soap2Soap, a multi-agent framework that enforces long-term language-visual consistency through a Dual-Bridge Consistency mechanism: a scene-aware JSON screenplay serving as a persistent semantic backbone, and dynamically allocated visual reference anchors at both scene and shot levels. To suppress drift before video synthesis, we introduce batch keyframe consistency, jointly generating multiple keyframes in a shared latent context via a grid-based formulation. A closed-loop verification agent further audits identity, stability, and alignment to trigger selective regeneration. Experiments on SoapBench demonstrate strong improvements over commercial video generation APIs in long-term consistency and narrative fidelity.


[144] Medical Context Distorts Decisions in Clinical Vision Language Models cs.CV | cs.CLPDF

David Restrepo, Ira Ktena, Maria Vakalopoulou, Stergios Christodoulidis, Enzo Ferrante

TL;DR: 本文研究了视觉语言模型在临床决策支持中的可靠性,发现这些模型在整合医学记录的视觉和文本信息时存在三种失败模式:过度依赖文本而非图像、对无关临床病史的虚假依赖以及提示词敏感性。通过在MIMIC-CXR数据集上对多种通用和医学专用VLM进行评估,发现模型决策受文本模态主导,且易受无关报告和提示词微小变化的影响。

Details

Motivation: 针对VLM在临床实践中整合视觉和文本信息的可靠性不足的问题,探究其在真实医疗场景中的潜在失败模式。

Result: 在MIMIC-CXR胸部X光任务上评估了多种开放和封闭VLM,发现模型决策被文本模态主导,视觉证据常被忽略;无关临床病史会误导模型,且语义等效的提示词变化可逆转基于图像的正确预测。

Insight: 揭示了VLM在医疗领域存在模态不平衡和上下文敏感性的根本缺陷;强调在临床部署前需进行显式安全防护和压力测试,为医疗AI的可靠性评估提供了系统性框架。

Abstract: Vision-language models (VLMs) are increasingly proposed for clinical decision support, yet their reliability in real-world scenarios that require integrating both visual and textual context from medical records remains poorly characterized. This paper identifies three failure modes: (1) modality over-reliance on text over images, (2) spurious reliance on irrelevant clinical history, and (3) prompt sensitivity across semantically equivalent inputs. We evaluate a diverse set of general-domain and medically-tuned open and closed VLMs on chest x-ray tasks using MIMIC-CXR. By systematically manipulating image-text alignment, clinical history, and prompt formulations, we found that VLM decisions are dominated by the text modality, even when visual evidence is available. Moreover, we observed that VLMs are heavily influenced by irrelevant reports, while minor prompt changes can reverse correct image-based predictions. Our findings underscore the need for explicit safeguards and stress-testing before considering the use of these models in clinical practice.


[145] FastOCR: Dynamic Visual Fixation via KV Cache Pruning for Efficient Document Parsing cs.CV | cs.CLPDF

Zihan Tang, Leqi Shen, Hui Chen, Ao Wang, Ben Wan

TL;DR: FastOCR提出了一种无需训练的加速框架,通过动态视觉注视机制,在文档解析任务中显著减少视觉语言模型所需的视觉token数量。该方法包含焦点引导剪枝和跨步注视重用两个模块,能够在不永久丢弃信息的情况下,将每步解码的视觉token减少到5%,同时保持98%的原始模型精度,并将注意力延迟降低3倍。

Details

Motivation: 解决视觉语言模型在密集文档OCR任务中,因编码大量视觉token导致推理成本过高的问题。现有基于物理驱逐的剪枝方法在文档图像上会导致灾难性的精度下降,因为几乎所有视觉token都可能对应字符或结构元素。

Result: 在五个不同规模和架构的VLM上进行了广泛实验,FastOCR作为即插即用模块具有一致的泛化性。在Qwen2.5-VL模型上,该方法在每解码步仅关注5%视觉token的情况下,保持了未剪枝模型98%的准确率,并将注意力延迟降低了3.0倍。

Insight: 核心创新在于观察到模型对文档图像的注意力在时间上是稀疏的(动态视觉注视现象),并将难以处理的全局剪枝问题转化为可处理的局部动态问题。通过动态调整缓存中的关注token而非驱逐token,避免了永久性信息丢失,实现了高效且高精度的文档解析加速。

Abstract: Vision-Language Models (VLMs) have shown strong promise on Optical Character Recognition (OCR), yet the sheer number of visual tokens required to encode dense documents incurs prohibitive inference cost. Existing pruning methods rely on physical eviction, e.g., permanently discarding visual tokens during the prefill stage. While effective for natural images, this strategy fundamentally breaks down on OCR, where virtually every visual token may correspond to a character or structural element, and any irreversible loss leads to catastrophic accuracy degradation. We observe that, although document images appear globally dense and seemingly unprunable, the model’s attention to them is in fact temporally sparse: at each decoding step it concentrates on a small region that shifts gradually across steps, much as a human reader fixates on successive words rather than perceiving an entire page at once. Motivated by this Dynamic Visual Fixation phenomenon, we recast the intractable global pruning problem as a tractable local, dynamic one and propose FastOCR, a training-free framework with two complementary modules. Specifically, Focal-Guided Pruning identifies a small set of focal layers and selects the most task-relevant visual tokens from them at each step, while Cross-Step Fixation Reuse exploits the gradual shift of fixation to warm-start each step from the previous one. By dynamically adjusting which tokens are attended rather than evicting any from the cache, FastOCR avoids permanent information loss. Extensive experiments show that FastOCR serves as a plug-and-play acceleration module, generalizing consistently across five VLMs of varying sizes and architectures. On Qwen2.5-VL, FastOCR retains 98% of the unpruned model’s accuracy while attending to only 5% of the visual tokens per decoding step, reducing attention latency by 3.0$\times$.


[146] Spatial Blindness in Whole-Slide Multiple Instance Learning cs.CV | cs.AIPDF

Xiangyu Li, Ran Su

TL;DR: 该论文揭示了全切片多实例学习(Whole-slide MIL)模型中存在的‘空间盲视’问题,即许多声称具有上下文感知能力的模型实际上对组织结构的空间排列不敏感。作者提出了一种名为ResTopoMIL的新方法,通过分离外观统计学习和空间关系学习,有效解决了这一问题,并在多个公开WSI基准测试中提升了分类和生存预测性能。

Details

Motivation: 论文的动机是发现并解决当前全切片MIL模型在病理学任务中的一个关键缺陷:尽管模型声称具有上下文感知能力,但在诊断信号依赖于组织结构(空间关系)的任务中,它们实际上对图像块(patch)的坐标排列不敏感,即存在‘空间盲视’问题。

Result: 在9个公开的WSI基准测试中,ResTopoMIL仅使用1.15M参数就提升了分类和生存预测性能,恢复了模型对坐标扰动的敏感性,并在CAMELYON-16数据集上提供了更强的定位证据。

Insight: 论文的核心创新点在于揭示了MIL模型优化过程中的一个根本问题:密集的外观统计信息在早期被快速学习,导致稀疏的空间关系梯度微弱。ResTopoMIL通过一种新颖的训练策略——先学习并冻结一个排列不变的原型直方图,再在坐标打乱的约束下训练一个轻量级图分支来学习残差——巧妙地解决了这一问题,其架构设计简单但干预训练过程有效。

Abstract: Whole-slide MIL models are often called context-aware once graphs, Transform ers, or state-space modules are placed above patch embeddings. We show that this label can be deceptive. On pathology tasks where tissue architecture is part of the diagnostic signal, several strong MIL baselines retain nearly unchanged slide level AUC after patch coordinates are permuted. Their predictions are accurate, but largely compositional. We refer to this failure mode as spatial blindness. Our explanation is optimization-based: dense appearance statistics are learned early under slide-level supervision, leaving weak gradients for sparse spatial relations. ResTopoMIL addresses the issue by first fitting a permutation-invariant prototype histogram and then freezing it while a lightweight graph branch learns the residual under a coordinate-shuffling constraint. The architecture is simple by design; the intervention is in how the spatial branch is trained. Across 9 public WSI bench marks, ResTopoMIL improves classification and survival prediction with 1.15M parameters, restores sensitivity to coordinate perturbation, and gives stronger lo calization evidence on CAMELYON-16.


[147] DeTrack: A Benchmark and Altitude-Aware Dual World Model for Drone-embodied Tracking cs.CVPDF

Guyue Hu, Haoming Liu, Siyuan Song, Chenglong Li, Feng Chen

TL;DR: 本文提出了一个名为DeTrack的新型无人机具身跟踪任务,要求无人机在交互式3D环境中,利用在线第一人称观测和主动飞行控制进行闭环目标跟踪。为此,作者构建了一个大规模基准测试集,并提出了一个名为AaDWorlds的、具有高度感知能力的双重世界模型框架来解决该任务。

Details

Motivation: 现有空中跟踪基准主要基于固定摄像机或预定义飞行路径捕获的被动2D视频序列,将无人机视为被动摄像机,而非能在动态3D场景中主动感知、交互和控制其运动的具身智能体。本文旨在定义并解决这一更具挑战性的无人机具身跟踪任务。

Result: 在DeTrack基准测试上的实验表明,AaDWorlds框架在所有评估指标(包括目标可见性、跟踪精度和轨迹成功率)上均提升了闭环跟踪性能。

Insight: 核心创新在于定义了无人机具身跟踪新任务及相应基准,并提出了AaDWorlds框架。该框架通过高度感知感知模块和双重世界模型(分别模拟高、低空飞行状态下的未来场景),结合伪高度感知观测和想象的未来状态,缓解了目标可见性与飞行安全之间固有的、由高度介导的矛盾。

Abstract: Aerial object tracking has broad applications in public safety, emergency rescue, wildlife monitoring, and related fields. However, existing aerial tracking benchmarks are mainly based on passive 2D video sequences captured from fixed camera locations or predefined flight paths, where drones are treated as passive cameras rather than embodied agents that actively perceive, interact, and control their motion in dynamic 3D scenes. In this paper, we define a new drone-embodied tracking task, termed DeTrack, which requires a drone to track a target in interactive 3D environments using online egocentric observations and active flight control in a closed loop. We build a large-scale benchmark containing 11,368 target trajectories across diverse scenes, rendering conditions, semantic regions, and moving distractors, together with evaluation metrics for target visibility, tracking accuracy, and trajectory success. We further propose AaDWorlds, an altitude-aware dual world model framework for drone-embodied tracking. AaDWorlds consists of an altitude-aware perception module and dual world models that imagine future states under both high- and low-altitude regimes. By combining pseudo altitude-aware observations and imagined future states, AaDWorlds alleviates the intrinsic altitude-mediated contradiction between target visibility and flight safety. Experiments on the DeTrack benchmark demonstrate that AaDWorlds improves closed-loop tracking performance across all evaluation metrics.


[148] Weighted Reverse Convolution for Feature Upsampling cs.CVPDF

Wentong Li, Zhiyuan Qi, Zichen Zhao, Kai Zhang, Lei Zhang

TL;DR: 本文提出加权反向卷积(WRC),一种用于视觉基础模型(VFMs)特征上采样的空间自适应逆算子。它将特征上采样建模为加权Tikhonov正则化最小二乘问题,通过空间变化权重调节数据保真度和先验强度,从而在保留关键结构的同时避免过度平滑。WRC具有高效、可微的闭式FFT解,可即插即用,在多种密集预测任务中显著提升了特征质量。

Details

Motivation: 预训练的视觉基础模型(VFMs)的块级特征本质上是粗糙的,限制了其在需要细粒度定位、密集预测和逐点对应任务上的有效性。本文旨在从逆问题的角度重新审视VFMs的特征上采样问题,以提升特征的精细度。

Result: WRC被集成到一个轻量级自监督稠密化框架中,在分割、深度估计、视频目标分割、目标发现和关键点对应等多个下游基准测试中,一致地提升了密集特征质量,同时保持了高计算效率。

Insight: 核心创新点是将特征上采样形式化为加权Tikhonov正则化逆问题,并提出了具有空间自适应权重的WRC算子。其闭式FFT解保证了高效性和可微性,使其成为一个实用的即插即用模块,有效平衡了结构保持与平滑抑制。

Abstract: Pre-trained vision foundation models (VFMs) provide strong semantic representations, yet their patch-level features are inherently coarse, limiting their effectiveness on tasks requiring fine-grained localization, dense prediction, and point-wise correspondence. In this work, we revisit feature upsampling for VFMs from the perspective of \textbf{\textit{inverse problem}} and propose Weighted Reverse Convolution (WRC), a spatially adaptive inverse operator for densifying high-level visual descriptors. Specifically, we formulate feature upsampling as a weighted Tikhonov-regularized least-squares problem, where spatially varying weights modulate both data fidelity and prior strength at each spatial location. This allows WRC to adapt the reconstruction to spatially varying feature characteristics, thereby preserving critical structures while mitigating over-smoothing. Moreover, WRC retains an efficient, fully differentiable closed-form FFT solution, making it a practical drop-in upsampling operator. Integrated into a lightweight self-supervised densification framework, WRC consistently improves dense feature quality across various downstream benchmarks, including segmentation, depth estimation, video object segmentation, object discovery, and keypoint correspondence, while maintaining high computational efficiency.


[149] Mamba-VGGT: Persistent Long-Sequence Video Geometry Grounded Transformer via External Sliding Window Mamba Memory cs.CVPDF

Tianchen Deng, Zhenxiang Xiong, Nailin Wang, Fangjinhua Wang, Jiuming Liu

TL;DR: 论文提出Mamba-VGGT,一种增强的视觉几何基础Transformer(VGGT)框架,旨在解决长序列视频3D场景重建中的灾难性几何遗忘和累积漂移问题。核心创新是引入了滑动窗口Mamba(SWM)记忆模块,通过选择性状态空间建模在时间窗口间维护显式外部记忆令牌,并结合零初始化空间记忆注入器将长期时序线索无缝融合到预训练VGGT的空间特征中。

Details

Motivation: 现有VGGT模型在处理长序列时,由于全局注意力二次复杂度导致必须截断时间窗口,从而引发灾难性几何遗忘和累积漂移,限制了其在广泛3D环境中的持久长程推理能力。

Result: 大量实验表明,该方法在保持空间一致性和减少轨迹累积误差方面显著优于现有的基于VGGT的方法,为广泛3D环境中的几何基础世界建模提供了可扩展的线性复杂度解决方案。

Insight: 创新点在于利用选择性状态空间模型(Mamba)构建外部滑动窗口记忆模块,以线性复杂度实现长程几何先验的提炼与传播,并通过零卷积层实现长期记忆与预训练空间特征的无缝自适应融合,避免了破坏原有优化结构。

Abstract: Visual Geometry Grounded Transformers (VGGT) have set new benchmarks in high-fidelity 3D scene reconstruction. However, as the sequence length increases, these models suffer from catastrophic geometric forgetting and accumulation drift, primarily due to the quadratic complexity of global attention which necessitates truncated temporal windows. To overcome the resulting geometric drift, we present Mamba-VGGT, an enhanced VGGT framework capable of persistent long-range reasoning. Our key contribution is a Sliding Window Mamba (SWM) memory module that maintains an explicit external memory token across temporal windows. This module leverages selective state-space modeling to distill and propagate global geometric priors, effectively bypassing the memory constraints of traditional transformers. To integrate these long-term temporal cues without disrupting the highly optimized spatial features of the pre-trained VGGT, we propose a Zero-Init Spatial Memory Injector. Utilizing zero-convolutional layers, this injector adaptively fuses persistent memory into the patch token stream, ensuring structural stability and seamless feature alignment. Extensive experiments demonstrate that our approach significantly outperforms existing VGGT-based methods in maintaining spatial consistency and reducing trajectory accumulation errors. Our work provides a scalable, linear-complexity solution for geometry-grounded world modeling in extensive 3D environments.


[150] Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation cs.CV | cs.MM | cs.SDPDF

Yuheng Chen, Qingdong He, Teng Hu, Yuji Wang, Yabiao Wang

TL;DR: 本文提出了Omni-Customizer,一个端到端的框架,旨在解决联合音视频生成中多模态定制(同时保留多个交互主体的视觉身份和声音音色)的难题。该框架通过Omni-Context Fusion模块和Masked TTS Cross-Attention机制,实现了多模态身份信息的精确绑定与无缝融合,并采用包含交错音视频调度和渐进式课程学习的综合训练策略。实验表明,该方法在双模态定制生成任务上取得了最先进的性能。

Details

Motivation: 当前强大的基础模型虽然改变了联合音视频生成的格局,但如何实现多模态定制,以同时保留多个交互主体的视觉身份和声音音色,仍然是一个尚未充分探索的挑战。本文旨在填补这一空白。

Result: 大量实验证明,Omni-Customizer在双模态定制生成任务上取得了最先进的性能,在视觉身份相似度、音色一致性、精确的音视频同步以及整体音视频保真度方面表现出色。

Insight: 论文的创新点包括:1) Omni-Context Fusion模块和Masked TTS Cross-Attention机制,用于解决多模态信息融合和’语音泄漏’问题;2) Semantic-Anchored Multimodal RoPE,用于将视觉、音频参考标记和TTS嵌入锚定到语义描述,实现结构化融合;3) 包含交错音视频调度和渐进式课程学习的综合训练策略,以快速适应多语言场景并学习鲁棒的身份特征。

Abstract: The landscape of joint audio and video generation has been fundamentally transformed by the advent of powerful foundation models. Despite these strides, achieving cohesive multimodal customization for the simultaneous preservation of visual identities and vocal timbres across multiple interacting subjects remains largely underexplored. To bridge this gap, we present Omni-Customizer, an end-to-end framework targeted at the precise binding and seamless fusion of multimodal identity information. Specifically, we introduce an Omni-Context Fusion (OCF) module that effectively enriches the base textual prompt with dense, multimodal identity cues, along with a Masked TTS Cross-Attention (MTP-CA) mechanism explicitly designed to prevent the severe “speech leakage” problem. Within this architecture, we propose Semantic-Anchored Multimodal RoPE (SA-MRoPE) to anchor visual and audio reference tokens, along with TTS embeddings, to their corresponding semantic descriptions, enabling structured multimodal fusion and robust identity binding. Furthermore, we devise a comprehensive training strategy that incorporates interleaved audio-video scheduling to rapidly adapt the audio branch to multilingual scenarios without degrading foundational priors, and a progressive in-pair to cross-pair curriculum to facilitate the learning of high-level and robust identity features. Extensive experiments demonstrate that Omni-Customizer achieves state-of-the-art performance in dual-modal customized generation, excelling across visual identity similarity, timbre consistency, precise audio-video synchronization, and overall video-audio fidelity.


[151] Employing Vision-Language Models for Face Image Quality Assessment cs.CVPDF

Erdi Sarıtaş, Eren Onaran, Vitomir Štruc, Hazım Kemal Ekenel

TL;DR: 本文研究了利用现成的视觉语言模型(VLMs)进行零样本人脸图像质量评估(FIQA),以解决传统FIQA方法缺乏可解释性的问题。作者提出了一个全面的评估框架,通过错误-拒绝曲线对传统FIQA方法进行基准测试,并在多种数据集上分析了VLMs的可解释性、一致性和对提示词的鲁棒性。

Details

Motivation: 传统最先进的FIQA方法虽然效用高,但通常是“黑盒”系统,仅输出标量分数而缺乏人类可理解的解释,这在需要可操作反馈的人机交互场景(如自动边境控制)中限制了其有效性。

Result: 结果表明,VLMs的生物识别效用性能很大程度上取决于模型架构而非参数数量,大多数VLM的输出与传统方法一致,但其排名性能和生成分数可能因提示词不同而变化。合成消融研究表明,增加参数数量可以提高内部一致性,但在退化检测性能上可能比小模型更差。

Insight: 创新点在于首次系统性地探索了零样本VLM在FIQA任务中的应用潜力,并提出了一个评估其性能、可解释性和鲁棒性的框架。客观来看,该研究为将VLM作为可解释性模块集成到传统FIQA流程中提供了实证依据,强调了架构设计而非单纯扩大规模的重要性。

Abstract: Face Image Quality Assessment (FIQA) is a crucial control step in biometric pipelines. It ensures only reliable samples are processed to maintain system accuracy. State-of-the-art FIQA methods achieve high utility but typically operate as “black boxes.” They produce scalar scores without human-interpretable justifications. This lack of transparency limits their effectiveness in human-in-the-loop scenarios, such as automated border control, where actionable feedback is essential. In this paper, we investigate the potential of off-the-shelf Vision-Language Models (VLMs) to bridge this gap by performing FIQA in a zero-shot setting. We present a comprehensive evaluation framework for assessing VLM performance. This involves benchmarking traditional FIQA methods through error-versus-reject curves. Additionally, using a diverse set of datasets, ranging from surveillance-oriented to synthetically generated, we analyzed their interpretability, consistency, and robustness to prompt changes. Our results show biometric utility performance depends significantly on architecture, not merely on parameter count. Most VLMs’ outputs align with those of traditional methods. We also find that VLM ranking performance and the generated scores may vary across prompts. Our synthetic ablation study shows that while increasing the parameter count can improve internal consistency, it yields worse degradation-detection performance than smaller models. These findings suggest that zero-shot FIQA score estimation using VLMs is promising and could effectively complement conventional FIQA pipelines as an interpretability module. The codes are available at https://github.com/ThEnded32/VLM4FIQA.git.


[152] A Distributional View for Visual Mechanistic Interpretability: KL-Minimal Soft-Constraint Principle cs.CV | cs.AIPDF

Guancheng Zhou, Yisi Luo, Zhengfu He, Zhenyu Jin, Xuyang Ge

TL;DR: 本文提出了一种用于视觉机制可解释性(MI)的理论分布视角,通过建模特征激活对自然图像分布的影响,将MI任务形式化为一个KL最小化优化问题。在该框架下,作者识别了先前MI方法中的统计偏差,并提出了一种基于KL最小化软约束原则的模型,通过能量引导扩散后验采样实现,以平衡可解释性和忠实性。实验验证了该分布视角的理论合理性及其在DINOv3视觉模型上的有效性。

Details

Motivation: 当前视觉机制可解释性方法多依赖启发式方法(如top-K激活检索或正则化优化),缺乏理论基础,可能导致结果在人类感知上不可解释或对模型机制不忠实。本文旨在建立一种理论分布视角,以解决这些偏差。

Result: 在DINOv3视觉模型上进行了广泛实验,验证了所提分布视角的理论合理性,并证明了新范式在实际应用中的有效性,但摘要未提及具体定量结果或基准测试对比。

Insight: 创新点在于将视觉机制可解释性任务形式化为KL最小化优化问题,提出KL最小化软约束原则来平衡可解释性和忠实性,并利用能量引导扩散后验采样实现该原则,为MI提供了更严谨的理论基础。

Abstract: Most current paradigms in visual mechanistic interpretability (MI) remain confined to interpreting internal units of the vision model via heuristic methods (e.g., top-$K$ activation retrieval or optimization with regularization). In this work, we establish a theoretical distributional view for visual MI, which models the influence of a feature activation on the natural image distribution, thereby formulating a Kullback-Leibler (KL)-minimal optimization problem to model the MI task. Under this framework, statistical biases are identified within previous MI paradigms, which reveal that they may either be perceptually uninterpretable to humans (i.e., deviate from the natural image distribution), or mechanistically unfaithful to the vision models (i.e., unable to activate model features). To resolve the biases under the distributional view, we propose a model with a KL-minimal soft-constraint principle for visual MI that theoretically balances interpretability and faithfulness. We realize this principle via energy-guided diffusion posterior sampling. Extensive experiments validate the theoretical soundness of the proposed distributional view and demonstrate the practical effectiveness of our paradigm on the DINOv3 vision model.


[153] $\textit{Don’t Guess, Just Ask}$: Resolving Ambiguity in Referring Segmentation via Multi-turn Clarification cs.CVPDF

Yuting Yang, Haichao Jiang, Tianming Liang, Quan Zhang, Jian-Fang Hu

TL;DR: 本文提出IC-Seg,一种通过多轮对话主动澄清用户意图的智能代理框架,以解决指代分割中用户查询模糊的问题。同时,论文引入了Hi-GRPO分层优化策略来提升对话效率,并构建了包含模糊查询的Ambi-RVOS基准进行评测。

Details

Motivation: 现有指代分割方法假设用户查询总是精确且无歧义的,这在现实场景中不切实际;当查询模糊时,模型会随意猜测用户意图,导致不良结果。

Result: 在提出的Ambi-RVOS基准上,IC-Seg大幅优于现有方法;在标准指代分割基准上也保持了最先进的性能。

Insight: 创新点在于将主动澄清机制引入指代分割任务,通过多轮对话解决歧义;Hi-GRPO分层优化策略在轨迹、轮次和步骤层面注入密集监督信号,提升了对话效率和意图澄清效果。

Abstract: Referring segmentation aims to segment the target objects in images or videos based on the textual query. Despite remarkable progress over the past years, existing works always assume that the user-provided queries are already precise and clear. However, this assumption is impractical. In real-world scenarios, it is unrealistic to expect all users to thoroughly review their visual content and carefully ensure their queries are unique and unambiguous. When encountering such cases, existing segmentation models tend to arbitrarily guess the user preferences, often resulting in undesired outcomes. To address this limitation, we propose \textbf{IC-Seg}, a novel agentic framework that proactively clarifies user intent through multi-turn conversation before segmentation. To effectively incentivize this capability, we further introduce \textbf{Hi-GRPO}, a new hierarchical optimization strategy that injects dense and informative supervision signals at the trajectory, turn, and step levels. This strategy encourages efficient intent clarification, effectively eliminating redundant interactions and improving overall dialogue quality. For evaluation, we establish \textbf{Ambi-RVOS}, a referring video object segmentation benchmark with ambiguous user queries. Extensive experiments demonstrate that IC-Seg not only outperforms existing methods by a large margin in resolving ambiguous queries, but also maintains state-of-the-art performance on standard reasoning segmentation benchmarks. Code and data will be released at \url{https://github.com/iSEE-Laboratory/IC-Seg}.


[154] Designing streetscapes from street-view imagery using diffusion models cs.CVPDF

Yuzhou Chen, Yuebing Liang, Lingqian Hu, Kailai Sun, Qingqi Song

TL;DR: 本文提出了一种基于扩散模型的多模态生成式AI框架,用于从街景图像合成符合目标视觉指标的替代性街道景观,以支持城市规划与设计中的场景探索。该框架整合了文本和图像控制,能够生成真实且语义一致的街景图像。

Details

Motivation: 现有研究主要侧重于量化现有街道环境,缺乏生成替代性或非现有城市场景的能力,而这是城市规划等地理空间学科的核心任务。

Result: 定量评估表明,结合视觉控制可将LPIPS指数降低约6%,并在保持全局视觉真实性的同时,使mIoU指数在奥兰多和芝加哥分别提升23.7%和46.4%,其中建筑视图指数的类别增益甚至超过100%。

Insight: 创新点在于构建了一个对齐多模态数据(文本描述、分割图、道路掩码、量化指标)的数据集,并证明了扩散模型在街道景观生成中可实现细粒度的文本与图像控制,且当两者冲突时图像控制占主导,这为城市场景生成建立了重要基准并指明了视觉控制的重要性。

Abstract: Street-view imagery (SVI) is widely used to quantify key indicators of urban environment, such as green- ery, sky, or road view indices. However, existing studies largely focus on measuring current streetscapes and rarely support the generation of alternative and non-existing urban scenarios, which is a core task in geospatial disciplines such as urban planning and design. To address this gap, we propose a gener- ative multimodal AI framework that synthesizes alternative streetscapes conditioned on targeted visual metrics, enabling direct visual exploration of urban scenarios. We first construct a multimodal dataset that aligns SVIs with textual descriptions, segmentation maps, road masks, and quantitative metrics of visual elements in Chicago and Orlando. Using this dataset, we demonstrate that diffusion models can produce realistic and semantically consistent streetscape imagery while responding to both textual and imagery controls. Our quantitative evaluations show that incorporating visual controls can improve semantic consistency, reducing the LPIPS index by approximately 6% while maintaining global visual realism. In addition, overall semantic consistency increases by 23.7% in Orlando and 46.4% in Chicago, as measured by the mIoU index, with class-wise gains exceeding even 100% improvement for building view indices. Streetscape generation can be controlled in a fine-grained manner by both visual and textual prompts, and when textual and visual controls conflict, imagery controls consistently dominate, indicating a clear control hierarchy and the importance of further developing visual controls for urban scene generation. Overall, this work establishes an important benchmark for streetscape generation us- ing SVIs and diffusion models, and illustrates how generative AI can serve as a practical, scalable, and controllable approach for urban scenario exploration.


[155] Rethinking Point Clouds as Sequences: A Causal Next-Token Predictive Learning Framework cs.CVPDF

Yumeng Yao, Jingzhi Dong, Haowen Gu, Tao Chen, Zonghan Wu

TL;DR: 本文提出PointNTP,一种将点云预训练重新定义为完全因果、无解码器的潜在下一令牌预测问题的框架。该方法将点云分割为局部块并序列化为3D令牌序列,通过因果Transformer在仅前缀条件下进行建模,并使用基于移位预测的目标进行训练。实验表明,该方法在多个下游任务上具有竞争力,为点云自监督学习提供了一个简单、可扩展且可能模态无关的范式。

Details

Motivation: 现有点云自监督方法主要基于掩码重建或显式几何生成,仍局限于输入恢复而非预测依赖建模。本文旨在为点云设计一种更符合下一令牌和下一嵌入学习的预训练范式。

Result: 在ScanObjectNN的OBJ_BG、OBJ_ONLY和PB_T50_RS上分别达到93.8%(+0.5%)、92.6%(+0.3%)和89.3%(+1.1%);在ShapeNetPart的Cls.mIoU上达到85.0%(+0.1%);在S3DIS Area 5上达到71.1% mAcc,表现出高度竞争力。

Insight: 创新点在于将点云视为序列,通过完全因果的下一令牌预测在潜在空间中学习结构依赖,无需重建解码器或显式几何恢复,为3D数据的基础式预测学习提供了新视角。

Abstract: With the rapid progress of multimodal foundation models and predictive pre-training, an important open question is how to equip 3D point clouds with a pre-training paradigm that is better aligned with next-token and next-embedding learning. Existing point-cloud self-supervised methods are largely built on masked reconstruction or explicit geometric generation, and thus remain tied to input recovery rather than predictive dependency modeling. In this paper, we introduce PointNTP, which reformulates point cloud pre-training as a fully causal, decoder-free latent Next-Token Prediction problem. Specifically, each point cloud is first partitioned into local patches and serialized into a structured 3D token sequence according to patch-center geometry. The resulting sequence is then modeled by a causal Transformer under prefix-only conditioning, and trained with a shift-based prediction objective stabilized by stop-gradient targets. This design enables the model to learn structural dependencies directly in latent space, without reconstruction decoders or explicit geometric recovery. Extensive experiments demonstrate that the proposed PointNTP is highly competitive across multiple downstream tasks: it achieves 93.8%(+0.5%), 92.6%(+0.3%), and 89.3%(+1.1%) on OBJ_BG, OBJ_ONLY, and PB_T50_RS of ScanObjectNN, respectively; obtains 85.0%(+0.1%) in Cls.mIoU on ShapeNetPart; and reaches 71.1% mAcc on S3DIS Area 5. Overall, decoder-free causal latent prediction provides a simple, scalable, and potentially modality-agnostic paradigm for point-cloud self-supervised learning, offering a new 3D perspective on foundation-style predictive learning for 3D data.


[156] HL-OutPaint: Coarse-to-Fine Video Outpainting for High-Resolution Long-Range Videos cs.CV | cs.GRPDF

Jeongeun Park, Janghyeok Han, Geonung Kim, Hyun-Seung Lee, Kyuha Choi

TL;DR: 本文提出了HL-OutPaint,一个用于高分辨率长视频的外绘框架。该方法采用由粗到精的两阶段策略:首先构建一个捕获全局结构和主导运动的低分辨率全局粗粒度引导(GCG),然后在此引导下进行高分辨率外绘,以生成空间细节丰富且时间一致的内容。

Details

Motivation: 视频外绘旨在生成超出原始视频空间范围的合理视觉内容,以适应不同的显示格式。现有方法大多只解决空间外推或长序列中的一个挑战,且缺乏确保全局时空一致性的显式机制,存在明显局限。

Result: 大量实验表明,HL-OutPaint在涉及宽空间外推和长视频序列的挑战性场景中,性能优于现有方法。

Insight: 核心创新在于提出了一种新颖的全局-局部帧交换机制来构建GCG,该机制耦合稀疏全局关键帧与局部时间窗口,并在采样过程中交换信息,从而在一个统一表示中编码长期结构一致性和短期时间动态。通过将全局结构建模与细粒度合成分离,框架实现了对大空间扩展和长视频序列的稳定、连贯生成。

Abstract: Video outpainting generates plausible visual content beyond the original spatial extent of a video, playing a key role in adapting videos to diverse display formats. To support such use cases, it must enable large spatial extrapolation over long sequences. However, most existing methods address only one of these challenges or lack explicit mechanisms for ensuring global spatio-temporal consistency, leading to notable limitations. In this paper, we propose HL-OutPaint, a high-resolution video outpainting framework for long sequences. Our approach follows a coarse-to-fine strategy with a two-stage pipeline. We first construct Global Coarse Guidance (GCG), a low-resolution representation that captures global structure and dominant motion across the video. Unlike naive downsampling, GCG is built via a novel global-local frame swapping mechanism that couples sparse global keyframes with local temporal windows and exchanges information during sampling. This enables GCG to encode both long-term structural consistency and short-term temporal dynamics in a unified representation. Guided by this representation, HL-OutPaint then performs high-resolution outpainting to generate spatially detailed and temporally consistent content. By separating global structure modeling from fine-grained synthesis, our framework achieves stable, coherent generation for large spatial expansion and long video sequences. Extensive experiments show that HL-OutPaint outperforms existing methods in challenging scenarios involving wide spatial extrapolation and long video sequences.


[157] SafeLens: Deliberate and Efficient Video Guardrails with Fast-and-Slow Screening cs.CV | cs.CLPDF

Shahriar Kabir Nahin, Hadi Askari, Muhao Chen, Anshuman Chhabra

TL;DR: SafeLens是一个视频内容安全护栏框架,采用快慢推理架构,通过快速模式识别筛选大部分视频,仅对少数复杂内容进行深度推理,从而在保证准确性的同时显著降低计算成本。

Details

Motivation: 在线视频平台和AI生成内容的快速增长使得可靠的内容安全护栏成为关键挑战。现有方法通常对所有输入统一使用大型视觉语言模型,导致推理成本高且计算资源分配低效。

Result: 在真实世界和AI生成的视频基准测试中,SafeLens实现了最先进的性能,超越了强大的开源和闭源模型,同时显著降低了推理成本。

Insight: 创新点在于引入了快慢推理架构,实现了可变计算成本的输入处理;通过影响力引导过滤构建高质量小数据集,并结合结构化思维链增强测试时推理能力,证明了高效设计比单纯扩大数据或模型规模更有效。

Abstract: The rapid growth of online video platforms and AI-generated content has made reliable video guardrails a key challenge for safety and real-world deployment. While most videos can be screened through fast pattern recognition, a small subset requires deeper reasoning over temporally complex content and nuanced policy constraints. Existing approaches typically rely on large vision-language models applied uniformly across all inputs, resulting in high inference costs and inefficient allocation of computation. We propose SafeLens, a video guardrail framework that introduces a fast-and-slow inference architecture for efficient and accurate content moderation with variable computational cost across inputs. Additionally, we construct a high-quality dataset by applying influence-guided filtering to the SafeWatch Dataset, retaining only 2.4% of the original data. To further address limitations of training-time scaling, we enable test-time reasoning by augmenting the filtered data with structured Chain-of-Thought traces. Across real-world and AI-generated video benchmarks, SafeLens achieves state-of-the-art performance, outperforming strong open-source video guardrails (e.g., SafeWatch-8B, OmniGuard-7B) and closed-source models (e.g., GPT-5.4, Gemini-3.1-pro) while significantly reducing inference cost, demonstrating that efficient design serves to be more effective than scaling data or model size alone.


[158] Deepfake Detection in Social Media: A Temporal Artifact Analysis Using 3D Convolutional Neural Networks cs.CV | cs.CRPDF

Mohammadreza Rashidi, Raja Hashim Ali, Sami Ur Rahman

TL;DR: 本文提出了一种基于3D卷积神经网络(R3D-18)的深度伪造视频检测方法,通过分析视频中的时序伪影来提升检测性能。该方法在DeepfakeTIMIT数据集上达到92.8%的准确率,并在跨数据集(FaceForensics++)测试中表现出良好的泛化能力。

Details

Motivation: 随着生成对抗网络(GAN)质量的提升,仅依赖空间特征的帧级深度伪造检测器性能显著下降,而时序不一致性在高质量伪造视频中仍较明显,因此需要开发能有效利用时序伪影的检测方法。

Result: 在128x128分辨率的DeepfakeTIMIT数据集上,模型准确率为92.8%;跨数据集迁移至FaceForensics++时,无需微调达到76.4%准确率,经少量微调后进一步提升。消融实验表明,迁移学习贡献7.2个百分点,人脸跟踪贡献3.5个百分点,时序一致性正则化对高质量伪造视频有额外增益。

Insight: 创新点在于结合3D CNN与时序一致性正则化损失,有效捕捉深度伪造视频中的时序伪影;客观分析表明,时序伪影比空间伪影具有更好的泛化性,能抵抗社交媒体重编码的干扰,为实际应用提供了更鲁棒的检测信号。

Abstract: Synthetic facial videos have proliferated across social media faster than platform moderation can respond, raising the cost of disinformation and identity-based attacks. Frame-level deepfake detectors degrade sharply as generator quality increases; high-quality 128x128 GAN output cuts spatial-only accuracy by five percentage points while leaving temporal inconsistencies largely intact. We address this gap with a 3D Convolutional Neural Network detector based on R3D-18, trained with a composite loss that combines binary cross-entropy with a temporal-consistency regularizer. The model processes 16-frame clips from the DeepfakeTIMIT dataset and is initialized from Kinetics-400 action-recognition weights. We report 92.8% accuracy on intra-dataset evaluation at 128x128 resolution; cross-dataset transfer to FaceForensics++ without fine-tuning reaches 76.4%, rising after minimal fine-tuning. Ablation studies show that transfer learning contributes 7.2 percentage points and face tracking adds 3.5 points, while temporal consistency regularization provides additional gains on high-quality fakes. The results establish that temporal artifacts generalize more broadly than spatial ones, providing a detection signal that survives social-media re-encoding.


[159] VVitCutLER: Towards Unsupervised Object Detection and Segmentation in Videos cs.CVPDF

Zhijing Lu, Khurram Azeem Hashmi, Didier Stricker, Muhammad Zeshan Afzal

TL;DR: VVitCutLER是一种无监督视频对象检测与实例分割框架,通过引入时间一致性提升伪标签质量。其核心组件VitCut利用跨帧区域一致性减少伪标签在运动模糊、遮挡等场景下的误差累积,并结合蒸馏解码器生成实例掩码。该框架进一步整合跨帧特征聚合以增强视频级鲁棒性,在标准视频基准测试中显著提升了性能与时间稳定性。

Details

Motivation: 解决无监督像素级视频理解中因运动模糊、遮挡和快速对象动态导致的伪标签时间漂移和闪烁问题,旨在提升视频对象检测与分割的鲁棒性。

Result: 在标准视频基准测试(如YouTube-VIS、DAVIS)上,VVitCutLER显著提高了检测与分割性能,同时减少了时间不稳定性,达到了当前无监督方法的先进水平(SOTA)。

Insight: 创新点在于提出时间稳定的伪标签生成器VitCut,通过跨帧区域一致性约束减少误差传播,并结合特征蒸馏与跨帧聚合机制,为无监督视频理解提供了可扩展的框架设计思路。

Abstract: Unsupervised pixel-level video understanding remains challenging in real-world scenarios, where motion blur, occlusion, and fast object dynamics often cause temporal drift and flickering pseudo-labels.We propose VVitCutLER, an unsupervised framework for video object detection and instance segmentation, which improves the quality of pseudo-labels through temporal consistency. Our core contribution is VitCut, a temporarily stable pseudo-label generator that reduces error accumulation during field degradation through cross-frame region consistency. Meanwhile, VitCut uses a distillation decoder to achieve effective instance mask prediction. Then, based on VitCut, VVitCutLER further integrates cross-frame feature aggregation to enhance video-level robustness. Extensive experiments on standard video benchmarks demonstrate that VVitCutLER significantly improves detection and segmentation performance while reducing temporal instability. These results highlight the importance of temporally consistent supervision for robust pixel-level video understanding.


[160] TAME: Test-Time Adversarial Prompt Tuning via Mixture-of-Experts for Vision-Language Models cs.CVPDF

Xin Wang, Yixu Wang, Jiaming Zhang, Ruofan Wang, Jiaqi Yu

TL;DR: 本文提出了一种名为TAME的新型测试时防御方法,旨在提升大规模预训练视觉语言模型(如CLIP)的对抗鲁棒性,而无需针对下游任务进行重新训练。该方法通过引入一个基于输入条件化的专家混合框架,取代了先前方法中的单一自适应提示,从而实现了更具表达力和适应性的防御。

Details

Motivation: 大规模预训练视觉语言模型(如CLIP)虽然展现出强大的零样本泛化能力,但对难以察觉的对抗性扰动高度脆弱,这为其在开放世界中的部署带来了严重的安全隐患。本文旨在不依赖下游任务特定重训练的情况下,增强模型的鲁棒性。

Result: 在包括ImageNet在内的11个基准数据集上的评估结果表明,在AutoAttack攻击下,TAME将原始CLIP的零样本对抗鲁棒性提升了至少49.1%,同时基本保持了在干净样本上的泛化能力。此外,TAME在多种提示设计下均一致优于现有的对抗性提示调优方法,平均鲁棒性增益至少达到30.2%。

Insight: 核心创新点在于将测试时防御架构重新表述为输入条件化的专家混合框架,通过一个包含可学习专家提示的库和输入依赖的路由机制,为每个未标记的测试样本聚合定制的提示混合。该机制由三个无监督目标驱动:多视图预测熵最小化、视觉令牌统计量与预计算的干净及对抗参考分布的逐层对齐,以及用于平衡专家利用和提示多样性的MoE正则化。

Abstract: Large-scale pre-trained Vision-Language models (VLMs), such as CLIP, exhibit strong zero-shot generalization, yet remain highly vulnerable to imperceptible adversarial perturbations, raising serious safety concerns for open-world deployment. To enhance robustness without requiring downstream task-specific retraining, we propose TAME, a novel test-time defense. Building upon our prior Test-Time Adversarial Prompt Tuning (TAPT), TAME introduces an architectural reformulation by replacing TAPT’s single adaptive prompt with an input-conditioned Mixture-of-Experts (MoE) framework, enabling more expressive and adaptive defense. Specifically, TAME maintains a bank of learnable expert prompts and employs an input-dependent routing mechanism to aggregate a customized prompt mixture for each unlabeled test sample at inference time. This test-time defense mechanism is driven by three unsupervised objectives: (1) multi-view prediction entropy minimization, (2) layer-wise alignment of visual token statistics to precomputed clean and adversarial reference distributions, and (3) MoE regularization for balanced expert utilization and prompt diversity. We evaluated TAME on 11 benchmark datasets, including ImageNet and 10 additional zero-shot datasets. The results show that TAME improves the zero-shot adversarial robustness of the original CLIP by at least 49.1% under AutoAttack while largely preserving generalization on clean samples. TAME also consistently outperforms existing adversarial prompt tuning methods across multiple prompt designs, yielding an average robustness gain of at least 30.2%.


[161] Multi-task learning on partially labeled datasets via invariant/equivariant semi-supervised learning cs.CV | cs.AI | cs.LGPDF

Miquel Martí i Rabadán, Alessandro Pieropan, Hossein Azizpour, Atsuto Maki

TL;DR: 本文研究了不变性和等变性半监督学习在部分标注数据集上训练多任务模型的潜力。作者使用FixMatch及其等变扩展Dense FixMatch方法,在Cityscapes和BDD100K数据集上评估了其在目标检测和语义分割任务中的性能。结果表明,在大多数情况下,这两种半监督学习方法都优于有监督基线,尤其是在标注样本较少时,等变方法通常表现更好。

Details

Motivation: 解决在部分标注且输出任务结构不同的数据集上训练多任务模型所面临的挑战,探索如何利用有限的标注数据进行有效学习。

Result: 在Cityscapes和BDD100K数据集的目标检测和语义分割任务上,不变性和等变性半监督学习方法在大多数情况下超越了有监督基线,特别是在任务标注样本较少时改进最显著;等变方法(Dense FixMatch)通常获得更好的结果。

Insight: 创新点在于将不变性/等变性半监督学习(如FixMatch和Dense FixMatch)框架应用于多任务学习场景,以处理部分标注和任务结构差异;客观来看,这为从有限标注数据中进行多任务学习提供了一个有前景的通用方向,强调了利用未标注数据和任务间结构关系的重要性。

Abstract: We investigate the potential of invariant and equivariant semi-supervised learning for addressing the challenges of training multi-task models on partially labeled datasets with differently structured output tasks. Specifically, we use the popular FixMatch method for invariant semi-supervised learning and its equivariant extension Dense FixMatch. We evaluate their performance on the Cityscapes and BDD100K datasets in the context of the prevalent object detection and semantic segmentation tasks in computer vision. We consider varying sizes of the subsets annotated for each task and different overlaps among them. Our results for both invariant and equivariant semi-supervised learning outperform supervised baselines in most situations, with the most significant improvements observed when fewer labeled samples are available for a task and generally better results for the latter approach. Our study suggests that invariant/equivariant learning is a promising general direction for multi-task learning from limited labeled data.


[162] SynVA: A Modular Toolkit for Vessel Generation and Aneurysm Editing cs.CV | cs.AI | cs.LGPDF

Marten J. Finck, Niklas C. Koser, Sarker M. Mahfuz, Tameem Jahangir, Jon E. Wilhelm

TL;DR: 本文提出了SynVA,一个用于血管网格生成和动脉瘤合成的模块化工具包。它结合了基于流匹配的健康血管生成方法和基于学习的解剖条件动脉瘤生成方法,并引入了仅基于生理原理和统计先验的合成模型,以生成大规模数据集。

Details

Motivation: 颅内动脉瘤(IAs)具有不可预测的生长和破裂风险,是中风的主要原因,且随着人口老龄化,脑血管疾病的负担增加。然而,数字孪生和深度学习的应用受到大规模高质量医学数据及标注稀缺的限制。

Result: 广泛的定量和定性评估表明,SynVA能生成逼真的血管几何结构和解剖学上合理的动脉瘤。实验显示,某些方法生成的动脉瘤形状更符合专家的人类感知,而另一些方法在与真实动脉瘤重建的定量相似性指标上表现更好。

Insight: 创新点在于将流匹配用于健康血管生成,并提出了解剖条件约束的动脉瘤合成方法,确保动脉瘤从现有血管几何中计算而非孤立生成。此外,仅基于生理和统计先验的合成模型支持大规模数据集生成,并发布了包含5万个标注网格样本的数据集,可用于下游视觉任务。

Abstract: Intracranial aneurysms (IAs), characterized by unpredictable growth and risk of rupture, are a major cause of stroke and can lead to life-threatening hemorrhages with high mortality and long-term disability. With aging populations, the incidence and overall burden of cerebrovascular diseases are expected to increase, highlighting the need for scalable approaches to analyze complex medical data and improve population-level understanding of these conditions. While digital twins and deep learning offer promising avenues for improving diagnosis, prognosis, and treatment, their effectiveness is limited by the scarcity of large-scale, high-quality medical data and corresponding labels. We present Synthetic VAsculature (SynVA), a modular toolkit for vascular mesh generation and anatomically consistent aneurysm synthesis. SynVA combines novel flow-matching-based methods for generating healthy vessel meshes with learning-based approaches for anatomy-conditioned aneurysm mesh generation - aneurysms are computed from pre-existing vascular geometries rather than being generated in isolation. In addition, we introduce the SynVA procedural model for vascular and aneurysm synthesis based solely on physiological principles and statistical priors, which enables the generation of large-scale datasets (e.g., for the training of mesh-based generative models). To this end, we release a dataset of 50,000 fully labeled mesh samples for a variety of downstream vision tasks, such as semantic segmentation. Extensive quantitative and qualitative evaluations demonstrate that SynVA generates realistic vessel geometries and anatomically plausible aneurysms. Specifically, our experiments indicate that some methods produce aneurysm shapes more aligned with expert human perception while others perform better on quantitative similarity metrics with reconstructions of real aneurysms.


[163] CodeBind: Decoupled Representation Learning for Multimodal Alignment with Unified Compositional Codebook cs.CV | cs.AI | cs.CLPDF

Zeyu Chen, Jie Li, Kai Han

TL;DR: CodeBind是一个用于多模态表示对齐的框架,通过模态共享-特定码本设计优化表示空间,无需完全配对数据,在九种模态上实现了最先进的分类和检索性能。

Details

Motivation: 解决传统多模态对齐方法因跨模态信息差异和数据稀缺导致的次优对齐空间问题,避免忽视模态独特特征。

Result: 在文本、图像、视频、音频、深度、热成像、触觉、3D点云和EEG九种模态的分类和检索任务中达到SOTA水平。

Insight: 创新性地采用解耦的共享-特定码本设计,将特征分解为语义一致的共享成分和保留独特细节的特定成分,通过组合向量量化缓解表示偏差和模态主导问题。

Abstract: Multimodal representation alignment is pivotal for large language models and robotics. Traditional methods are often hindered by cross-modal information discrepancies and data scarcity, leading to suboptimal alignment spaces that overlook modality-unique features. We propose CodeBind, a framework that optimizes multimodal representation spaces through a modality-shared-specific codebook design. By incrementally aligning target and bridging modalities, CodeBind bypasses the need for fully paired data. Unlike traditional hard alignment, CodeBind decomposes features into shared components for semantic consistency and specific components for modality-unique details. This design utilizes a compositional vector quantization scheme, where a shared codebook bridges modality gaps and modality-specific codebooks mitigate representation bias by preventing dominant modalities from overshadowing others. Validated across nine modalities (text, image, video, audio, depth, thermal, tactile, 3D point cloud, EEG), CodeBind achieves state-of-the-art performance in multimodal classification and retrieval tasks.


[164] TouchMap-OR: Multi-View 3D Mapping of Hand-Surface Contacts cs.CVPDF

Sophokles Ktistakis, Rui Wang, Bastian Grande, Hugo Sax

TL;DR: 本文提出了TouchMap-OR系统,这是一个用于手术室场景的多视角RGB-D视觉系统,旨在重建医护人员与物体表面之间的手部接触交互。该系统通过融合多视角的骨架跟踪、手部网格重建和环境语义建模,能够推断出哪位医护人员在何时接触了哪个特定表面。

Details

Motivation: 解决医疗过程中病原体传播的关键环节——医护人员、患者与医疗设备之间的手部接触交互难以被详细观测和记录的问题,以超越当前依赖人工观察的感染预防实践。

Result: 在三个真实麻醉诱导过程的记录数据上进行评估,该系统在接触事件检测上取得了0.75的F1分数,优于基于跟踪的基线方法,同时在多人跟踪精度上表现相当,并达到了0.96的身份归属准确率。

Insight: 创新点在于将多视角的全身骨架跟踪、精细的MANO手部网格重建与环境的语义三维建模相结合,从而实现了对复杂临床场景中身份可解析的、时空精确的手-表面接触历史的重建。

Abstract: Hand-surface interactions between clinicians, patients, and medical equipment play a central role in pathogen transmission during medical procedures. However, these interactions remain largely unobserved, as current infection-prevention practices rely on manual observation and cannot reconstruct detailed contact histories. In this work we formulate the problem of identity-resolved hand-surface interaction reconstruction in operating rooms and introduce TouchMap-OR, a multi-view RGB-D vision system that models clinicians, articulated hand geometry, and the semantic structure of the clinical environment to infer when and where contacts occur. The system reconstructs globally consistent multi-person 3D skeleton tracks across cameras while estimating articulated MANO hand meshes from RGB observations aligned to depth data. Multi-view hand reconstructions are fused and associated with tracked clinicians to obtain consistent left and right hand trajectories. A semantic 3D model of the operating room is built from multi-view segmentation and depth fusion, enabling reconstructed hand trajectories to be mapped to specific surfaces, including medical equipment, movable objects, and patient body sites. Temporal hand-surface proximity is used to infer contact episodes describing which clinician touched which surface and when. We evaluate TouchMap-OR on recordings from three real anesthesia inductions with manually annotated contact events. TouchMap-OR achieves 0.75 binary contact F1, outperforming tracking-based baselines while maintaining comparable multi-person tracking accuracy and achieving 0.96 identity attribution accuracy.


[165] What is Holding Back Latent Visual Reasoning? cs.CV | cs.AI | cs.CL | cs.LGPDF

André G. Viveiros, Nuno Gonçalves, André F. T. Martins, Matthias Lindemann

TL;DR: 本文研究了视觉语言模型中潜在视觉推理的局限性,发现现有模型的潜在标记在推理中因果作用微弱,其预测准确性不受潜在标记信息性的影响。通过分析训练信号和推理时生成的潜在标记质量,揭示了阻碍潜在视觉推理的两个关键问题:现有数据集中潜在标记提供的信息有限,且推理时生成的潜在标记偏离理想表示。

Details

Motivation: 受人类通过心理模拟中间视觉步骤解决复杂视觉问题的启发,本文旨在探究当前视觉语言模型如何利用潜在标记进行链式思维推理,并识别阻碍其有效性的因素。

Result: 实验表明,在现有数据集上,模型准确率不受潜在标记信息性影响;但在一个诊断数据集上微调后,模型能够因果地依赖潜在标记。同时,推理时生成的潜在标记会坍缩到一个狭窄区域,偏离其理想表示。

Insight: 论文的创新点在于系统性地诊断了潜在视觉推理失败的原因,指出未来进展依赖于两个支柱:提供信息性中间步骤的高质量数据集,以及更精确的潜在标记预测方法。

Abstract: Humans can approach complex visual problems by mentally simulating intermediate visual steps, rather than reasoning through language alone. Inspired by this, several works on Vision-Language Models have recently explored chain-of-thought reasoning with continuous latent tokens as intermediate visual imagination steps. In this work, we investigate how recent models leverage such latent tokens. Surprisingly, we find that model accuracy is unaffected when latent tokens are replaced by uninformative ``dummy’’ tokens. This indicates that latent tokens play a minimal causal role in the model’s final prediction. To better understand this phenomenon, we analyze both the training signal provided by oracle latent representations and the quality of the latent tokens generated at inference time. Our experiments reveal two crucial issues holding back latent visual reasoning: First, in most existing datasets, oracle latent tokens provide limited additional information beyond the original image and do not substantially simplify the task, leading models to ignore them during training and effectively bypassing them at inference time. When fine-tuned on a diagnostic dataset, in which latent tokens provide sufficient support for the final prediction, we show that models can causally rely on them. Second, the latent tokens produced at inference time deviate from their corresponding oracle representations, collapsing to a narrow region and preventing benefits even when the model relies on them. Overall, our findings suggest that future progress in latent visual reasoning depends on two key pillars: high-quality datasets with informative intermediate steps and more precise latent token prediction.


[166] Brain-inspired spike-timing plasticity for reliable label-efficient event-camera vision cs.CVPDF

Mohamad Yazan Sadoun, Sarah Sharif, Yaser Mike Banad

TL;DR: 本文提出了一种受大脑启发的脉冲时序依赖可塑性(STDP)模块,用于构建无需GPU支持、标签高效的脉冲相机视觉检测器。该方法在FRED无人机基准测试中实现了从零标签到少量标签的监督层级,并在EVUAV基准上显著降低了误报率。

Details

Motivation: 解决脉冲相机目标检测器部署时面临的每帧标注需求和GPU计算成本高的问题,旨在开发一种仅需CPU线程、标签高效的可靠检测框架。

Result: 在FRED无人机基准上,零标签检测器达到53.8% mAP@30,少量标签(约26比特)达到76.9% mAP@30,STDP候选可靠性门控达到78.60 +/- 0.42% mAP@30;在EVUAV基准上,管级STDP层将误报率从454降至331e-4(Pd >= 88%)。

Insight: 创新点在于引入局部STDP模块(序列、候选和管可靠性模块),实现无需梯度训练和密集矩阵乘法的操作,显著降低模型方差并提升可靠性,同时支持标签高效学习和抗漂移性能。

Abstract: Deploying event-camera object detectors is constrained by per-frame labeling requirements and GPU compute demands. This work introduces three local spike-timing-dependent plasticity (STDP) modules, including sequence, candidate, and tube-reliability modules, that operate on a single CPU thread without GPU support. On the FRED drone benchmark, the proposed framework spans three label-efficient supervision tiers. A strict zero-label detector achieves 53.8% mAP@30, approximately 26 train-derived bits achieve 76.9% mAP@30, and an STDP candidate-reliability gate achieves 78.60 +/- 0.42% mAP@30. Under acquisition-order drift, the cohort gate outperforms streaming k-means by 2.03 +/- 0.58 percentage points across 20 of 20 positive trials, while a no-drift control falsifies the effect. STDP reduces single-model variance by 6.6 times, and one trained gate matches a 44-seed ensemble bound. The gate transfers to Intel Lava with 89% top-2 agreement. On the EVUAV benchmark, a tube-level STDP layer reduces false alarms from 454 to 331e-4 at Pd >= 88%. Dense gradient-trained detectors cannot provide this combination of gradient training, dense matrix multiplication, and local plasticity-free operation by construction.


[167] GraSP-VL: Length as a Semantic Granularity Interface for Vision-Language Representations cs.CVPDF

Zesheng Li, Chengchang Pan, Honggang Qi

TL;DR: 本文提出GraSP-VL方法,通过在学习共享的近似正交前缀变换,将冻结的视觉语言模型嵌入重新组织为可截断的语义前缀接口,使得短前缀对应粗粒度语义,长前缀逐步揭示细粒度语言基础区分,从而将嵌入长度转化为可控的语义访问接口。

Details

Motivation: 现有冻结视觉语言嵌入通过固定长度向量接口暴露多分辨率语义信号,但缺乏可控的语义访问机制,本文旨在研究是否可将嵌入长度转变为可控的语义接口。

Result: 在COCO/Flickr30K标注池上,GraSP-VL达到53.01的阶梯分数和89.76的硬负样本选择性,同时保持全空间漂移低于10^{-6};在SugarCrepe-clean上实现86.03的对象准确率和11.96的平均外部涌现,并保持全维度的零样本CIFAR-100准确率。

Insight: 创新点在于提出语义套娃接口,通过共享正交变换实现嵌入长度的语义粒度控制,而非简单压缩,为冻结VLM嵌入提供了可解释且可扩展的语义访问机制。

Abstract: Frozen vision-language embeddings contain signals at multiple semantic resolutions, from object identity to attributes, relations, and full-caption meaning, but they expose these signals through a fixed-length vector interface. We study whether embedding length can be turned into a controllable semantic access interface. We propose \textbf{GraSP-VL}, which learns a shared near-orthogonal prefix transform over frozen VLM embeddings. GraSP-VL instantiates a \textbf{Semantic Matryoshka} interface: short prefixes are assigned coarse semantic roles, while longer prefixes progressively expose finer language-grounded distinctions. Because the transform is shared across image and text embeddings and preserves full-dimensional geometry, prefix behavior changes without rewriting the original VLM space. On a 20,147-example COCO/Flickr30K annotation pool, GraSP-VL reaches a staircase score of 53.01 and hard-negative selectivity of 89.76, while keeping full-space drift below $10^{-6}$. It also transfers to SugarCrepe-clean with 86.03 object accuracy and 11.96 mean external emergence, and preserves full-dimensional zero-shot CIFAR-100 accuracy. These results show that frozen VLM embeddings can be reorganized into a truncatable semantic prefix interface rather than merely compressed.


[168] UST-Hand: An Uncertainty-aware Spatiotemporal Point Cloud Interaction Network for 3D Self-supervised Hand Pose Estimation cs.CV | cs.HCPDF

Tianhao Han, Haoyang Zhang, Liang Xie, Haochen Chang, Kun Gao

TL;DR: 论文提出了一种名为UST-Hand的自监督学习框架,用于3D手部姿态估计。该框架通过估计手部姿态的不确定性分布并构建概率点云特征空间,来建模复杂的时空关系,从而解决现有方法对噪声伪标签敏感且忽略细粒度空间相关性的问题。

Details

Motivation: 手动标注准确的3D手部姿态极其耗时耗力,而现有的自监督方法容易受到噪声伪标签的影响,且未能充分利用细粒度的空间相关性,这影响了模型训练的稳定性。

Result: 在三个具有挑战性的数据集上进行的大量实验表明,UST-Hand在平均每顶点位置误差(MPVPE)上比现有的自监督方法提升了高达37.8%,达到了最先进的性能水平。

Insight: 创新点在于引入不确定性估计和条件归一化流模型来捕获姿态分布并采样多样假设,以增强在噪声监督下的鲁棒性;同时,通过将多假设映射到统一的概率3D点云空间进行多视图和时序特征交互,全面探索手部运动模式和细粒度空间相关性。

Abstract: Manually annotating accurate 3D hand poses is extremely time-consuming and labor-intensive. Existing self-supervised hand pose estimation methods leverage the discrepancy between input images and rendered outputs, or multi-view consistency constraints, as the driving force to optimize networks and progressively refine pose accuracy. However, these methods are highly susceptible to noisy pseudo-labels and overlook the importance of fully exploiting fine-grained spatial correlations, which undermines the stability of model training. To address these issues, we propose UST-Hand, a self-supervised learning framework that estimates uncertainty distribution of hand pose and constructs a probabilistic point cloud feature space, which enables the complex spatiotemporal relationship modeling. UST-Hand employs a conditional normalizing flow model to capture hand pose distributions and samples diverse hypotheses, facilitating robust learning under noisy pseudo-labels supervision with enhanced stability. These multi-hypothesis are mapped to a unified probabilistic 3D point cloud space for multi-view and temporal feature interaction, comprehensively exploring hand motion patterns and fine-grained spatial correlations. Extensive experiments on three challenging datasets demonstrate that UST-Hand achieves state-of-the-art performance, outperforming existing self-supervised methods by up to 37.8% in Mean Per Vertex Position Error (MPVPE).


[169] MoASE++: Mixture of Activation Sparsity Experts with Domain-Adaptive On-policy Distillation for Continual Test Time Adaptation cs.CVPDF

Ronyu Zhang, Aosong Cheng, Gaole Dai, Yulin Luo, Jiaming Liu

TL;DR: 本文提出MoASE++方法,用于持续测试时适应任务,通过解耦领域无关的结构和领域相关的纹理信息来应对非平稳、无标签的目标数据流。该方法结合了激活稀疏专家模块与空间可微丢弃技术,构建互补的高/低激活路径,并引入领域自适应策略蒸馏来抑制确认偏差并提升鲁棒性-可塑性平衡。

Details

Motivation: 解决持续测试时适应中纹理偏置主干网络导致的错误累积和灾难性遗忘问题,借鉴人类视觉系统分离形状与纹理的机制,设计能够解耦结构信息与纹理信息的自适应模型。

Result: 在分类任务(CIFAR-10/100-C, ImageNet-C)和语义分割任务(Cityscapes->ACDC)上进行了广泛实验,均取得了持续领先(state-of-the-art)的性能。

Insight: 创新点包括:基于激活稀疏性的专家混合机制实现结构-纹理解耦;领域感知路由器和输入自适应的阈值选择;以及基于熵和置信度的增强策略与EMA锚定的策略蒸馏,有效平衡模型鲁棒性与可塑性。

Abstract: Continual test-time adaptation adapts a source-pretrained model to non-stationary, unlabeled target streams while retaining past competence, yet texture-biased backbones risk error accumulation and catastrophic forgetting. Drawing inspiration from the process of decoupling shape and texture in the human visual system, we introduce MoASE, a plug-in mixture-of-experts that disentangles domain-agnostic structure from domain-specific texture using Activation Sparsity Experts with Spatial Differentiable Dropout, forming complementary high- and low-activation pathways, while high- and low-rank bottlenecks diversify representations. The Activation Sparsity Gate produces input-adaptive SDD thresholds for precise token selection, and the Domain-Aware Router assigns per-sample expert weights using texture-sensitive cues. To curb confirmation bias on unlabeled streams and stabilize supervision, we then introduce Domain-Adaptive On-Policy Distillation to constitute MoASE++, with an EMA-anchored on-policy reverse KL distillation and an augmentation policy conditioned on entropy and confidence that aligns predictions across the same views and improves the robustness-plasticity balance. Extensive experiments on classification (CIFAR-10/100-C, ImageNet-C) and semantic segmentation (Cityscapes->ACDC) demonstrate consistent state-of-the-art performance, offering a principled, controllable approach to continual adaptation in dynamic visual environments.


[170] Unleashing Vision Transformer Potential In Image Quality Assessment via Global-Local Adaptive Interaction cs.CVPDF

Yu Li, Puchao Zhou, Yachun Mi, Yanfeng Wu, Xiaoming Wang

TL;DR: 本文提出了一种名为全局-局部交互适配器(GLIA)的新框架,用于无参考图像质量评估(BIQA)。该框架通过双流特征提取机制和交互式全局-局部融合,有效利用预训练视觉Transformer,在保持高预测精度的同时显著减少了可训练参数。

Details

Motivation: 解决BIQA中真实失真图像质量预测的挑战,克服现有方法因主观标注成本高、数据集有限而扩展性不足,以及预训练大视觉模型在IQA任务中计算需求大、微调效率低的问题。

Result: 在多个基准测试上的广泛实验验证了该方法的有效性和优越性,实现了优异的预测精度和鲁棒性。

Insight: 创新点在于通过全局-局部交互适配器(GLIA)的双流机制,联合保留全局语义信息和细粒度局部细节,高效利用预训练模型,在降低参数量的同时提升性能。

Abstract: In the field of Blind Image Quality Assessment (BIQA), accurately predicting the perceptual quality of authentically distorted images remains highly challenging due to the diverse and complex distortions present in natural environments. Although existing methods have achieved notable accuracy, their scalability is often constrained by the high cost of subjective annotation and the limited size of available datasets. Recent advances in large-scale pre-trained vision models have introduced powerful semantic and representational capabilities, yet their application to IQA tasks is hindered by substantial computational demands and suboptimal fine-tuning efficiency. To overcome these limitations, we introduce the Global-Local Interaction Adapter (GLIA), a novel framework that effectively harnesses pre-trained Vision Transformers through a dual-stream feature extraction mechanism coupled with interactive global-local fusion. By jointly retaining global semantic information and fine-grained local details, our approach delivers superior prediction accuracy and robustness while requiring significantly fewer trainable parameters. Extensive experiments on multiple benchmarks validate the effectiveness and superiority of our approach.


[171] ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop cs.CV | cs.AI | cs.CL | cs.LG | cs.ROPDF

Yining Hong, Jiageng Liu, Han Yin, Manling Li, Leonidas Guibas

TL;DR: 本文提出了ESI-BENCH基准,用于评估具身空间智能,强调通过感知-行动闭环主动探索以解决仅靠被动观察无法完成的任务。该基准基于OmniGibson构建,涵盖10大类29小类任务,实验表明主动探索显著优于被动方法,并揭示了当前模型存在的行动盲目性和元认知差距等问题。

Details

Motivation: 现有空间智能研究通常假设拥有完美的观测信息,而忽视了智能体应作为主动行动者,通过感知-行动闭环来主动获取信息以解决遮挡、动态、包含关系和功能等仅凭被动感知无法解析的问题。

Result: 在ESI-BENCH上对先进MLLMs的广泛实验表明,主动探索方法在性能上显著优于被动对应方法,而随机多视角观察尽管消耗更多图像,却常常引入噪声而非有效信号。人类研究进一步揭示了模型与人类在寻求证伪视角和根据矛盾修正信念方面存在元认知差距。

Insight: 创新点在于将空间智能重新定义为通过感知-行动闭环实现的主动探索过程,并构建了全面的评估基准。客观分析认为,研究揭示了’行动盲目性’(糟糕的行动选择导致糟糕的观测并引发级联错误)是当前模型的主要失败原因,且不完美的3D表征可能比2D基线更有害,这为未来具身AI设计提供了重要洞见。

Abstract: Spatial intelligence unfolds through a perception-action loop: agents act to acquire observations, and reason about how observations vary as a function of action. Rather than passively processing what is seen, they actively uncover what is unseen - occluded structure, dynamics, containment, and functionality that cannot be resolved from passive sensing alone. We move beyond prior formulations of spatial intelligence that assume oracle observations by recasting the observer as an actor. We introduce ESI-BENCH, a comprehensive benchmark for embodied spatial intelligence spanning 10 task categories and 29 subcategories built on OmniGibson, grounded in Spelke’s core knowledge systems. Agents must decide what abilities to deploy - perception, locomotion, and manipulation - and how to sequence them to actively accumulate task-relevant evidence. We conduct extensive experiments on state-of-the-art MLLMs and find that active exploration substantially outperforms passive counterparts, with agents spontaneously discovering emergent spatial strategies without explicit instructions, while random multi-view often adds noise rather than signal despite consuming far more images. Most failures stem not from weak perception but from action blindness: poor action choices lead to poor observations, which in turn drive cascading errors. While explicit 3D grounding stabilizes reasoning on depth-sensitive tasks, imperfect 3D representation proves more harmful than 2D baselines by distorting spatial relations. Human studies further reveal that unlike humans who seek falsifying viewpoints and revise beliefs under contradiction, models commit prematurely with high confidence regardless of evidence quality, exposing a metacognitive gap that neither better perception nor more embodied interaction alone can close.


[172] Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation cs.CV | cs.AI | cs.CL | cs.LGPDF

Qianhao Yuan, Jie Lou, Xing Yu, Hongyu Lin, Le Sun

TL;DR: 本文提出Vision-OPD框架,旨在解决多模态大语言模型在细粒度视觉理解上的困难。该框架通过区域到全局的策略内蒸馏,将模型自身在图像裁剪区域上的优越感知能力迁移到处理完整图像的任务中,从而提升模型对图像中关键细节的注意力。

Details

Motivation: 动机源于观察到MLLMs存在区域到全局的感知鸿沟:模型在基于关键证据裁剪图像上回答细粒度问题的准确率,远高于基于完整图像,这表明失败主要源于模型难以聚焦相关证据,而非局部识别能力不足。

Result: 在多个细粒度视觉理解基准测试上的实验表明,Vision-OPD模型取得了有竞争力或更优的性能,超越了更大的开源、闭源模型以及“图像思考”代理模型。

Insight: 创新点在于提出了一种无需外部教师模型、真实标签、奖励验证器或推理时工具使用的策略内自蒸馏框架,使模型能够内化视觉缩放的优势,从而弥合区域与全局感知的差距。

Abstract: Multimodal Large Language Models (MLLMs) still struggle with fine-grained visual understanding, where answers often depend on small but decisive evidence in the full image. We observe a regional-to-global perception gap: the same MLLM answers fine-grained questions more accurately when conditioned on evidence-centered crops than on the corresponding full images, suggesting that many failures stem from difficulty to focus on relevant evidence rather than insufficient local recognition ability. Motivated by this observation, we propose Vision-OPD (Vision On-Policy Distillation), a regional-to-global self-distillation framework that transfers the model’s own privileged regional perception to its full-image policy. Vision-OPD instantiates two conditional policies from the same MLLM: a crop-conditioned teacher and a full-image-conditioned student. The student generates on-policy rollouts, and Vision-OPD minimizes token-level divergence between the teacher and student next-token distributions along these rollouts. This enables the model to internalize the benefit of visual zooming without external teacher models, ground-truth labels, reward verifiers, or inference-time tool use. Experiments on multiple fine-grained visual understanding benchmarks show that Vision-OPD models achieve competitive or superior performance against much larger open-source, closed-source, and “Thinking-with-Images” agentic models.


[173] LatentUMM: Dual Latent Alignment for Unified Multimodal Models cs.CVPDF

Yinyi Luo, Wenwen Wang, Hayes Bai, Marios Savvides, Jindong Wang

TL;DR: 本文提出了LatentUMM框架,旨在解决统一多模态模型在理解和生成功能之间的不一致性问题。该框架通过双潜在对齐和潜在动态稳定化两个阶段,在共享潜在空间中显式地对齐模态转换的映射过程,从而提升跨模态一致性。

Details

Motivation: 现有统一多模态模型虽然在理解和生成任务上表现良好,但两者功能之间存在不一致性,这并非源于缺乏共享表示,而是由于映射进/出潜在空间的变换之间缺乏显式对齐,导致模态转换时出现语义漂移。

Result: 实验表明,LatentUMM能持续提升多种模型架构的跨模态一致性,具体定量结果和基准测试在摘要中未详细说明,但暗示了其通用有效性。

Insight: 核心创新点在于提出了双潜在对齐(跨模态对齐与双能力对齐)和潜在动态稳定化,通过显式约束映射变换和增强轨迹鲁棒性,从机制上解决了UMMs的功能不一致问题,而非仅优化共享表示。

Abstract: Unified multimodal models (UMMs) achieve strong performance in both understanding and generation by learning a shared latent space, yet they often exhibit functional inconsistency between these two capabilities. We observe that this issue does not stem from a lack of shared representations, but from the absence of explicit alignment between the transformations that map into and out of the latent space. As a result, generation and re-encoding can follow inconsistent trajectories, leading to semantic drift under modality transitions. In this work, we propose LatentUMM, a framework that constructs an enhanced shared latent space to explicitly align these transformations and improve cross-modal consistency. LatentUMM consists of two stages. First, dual latent alignment enforces consistency at both the modality and capacity levels: cross-modal alignment uses a stronger embedding model to impose structured cross-modal semantics, while dual capacity alignment enforces bidirectional consistency under generation and re-encoding. Second, latent dynamics stabilization improves robustness via stochastic latent rollouts and preference optimization, favoring trajectories that better preserve semantic consistency. Experiments show that LatentUMM consistently improves multimodal consistency across diverse architectures. Code is available at: https://github.com/AIFrontierLab/TorchUMM/tree/main/src/umm/post_training/LatentUMM.


[174] Towards Universal Physical Adversarial Attacks via a Joint Multi-Objective and Multi-Model Optimization Framework cs.CVPDF

Ziyang Liu, Hongyuan Wang, Zijian Wang, Yinxi Lu, Yunzhao Zang

TL;DR: 本文提出了一种联合多目标多模型优化框架(JMOF),用于提升物理对抗攻击的泛化能力。该框架通过量化相似性分析选择最优代理模型组合,采用双级机制联合抑制预测输出并平滑中间特征分布,并引入正交梯度对齐策略解决跨模型梯度冲突。实验表明JMOF在模拟和真实场景中均优于现有基线,并能生成可同时欺骗目标检测、语义分割和单目深度估计等多种视觉任务的攻击。

Details

Motivation: 现有物理对抗攻击通常过拟合单一代理模型和优化目标,集成攻击方法在受限的物理纹理空间中面临严重的梯度冲突问题,导致跨模型可迁移性显著下降。本文旨在解决这一泛化瓶颈。

Result: 在广泛的模拟和真实世界实验中,JMOF在对抗多种黑盒检测器时超越了最先进的基线方法。关键的是,JMOF展现出显著的跨视觉任务泛化能力,生成的攻击能同时欺骗目标检测、语义分割或单目深度估计模型。

Insight: 创新点包括:1)利用量化相似性分析选择最优代理模型集成;2)联合抑制预测输出和平滑中间特征分布的双级优化机制;3)解决跨模型梯度冲突的正交梯度对齐策略。这为评估现实世界视觉AI系统的脆弱性提供了一个鲁棒的框架。

Abstract: Physical adversarial attacks often overfit single surrogate models and optimization objectives. While ensemble attacks can mitigate this, existing methods struggle with severe gradient conflicts within restricted physical texture spaces, significantly degrading cross-model transferability. To bridge this gap, this paper proposes a Joint Multi-Objective and Multi-Model Optimization Framework (JMOF) that leverages quantitative similarity analysis to select the optimal surrogate model ensemble. Within JMOF, a dual-level mechanism jointly suppresses prediction outputs and flattens intermediate feature distributions, balancing attack efficiency with deep generalization. Additionally, an Orthogonal Gradient Alignment (OGA) strategy resolves cross-model gradient conflicts, transforming mutually repulsive gradients into synergistic optimization directions. Extensive simulated and real-world experiments demonstrate that JMOF outperforms state-of-the-art baselines against diverse black-box detectors. Crucially, JMOF exhibits substantial cross-vision-task generalization, generating attacks capable of simultaneously deceiving object detection and semantic segmentation or monocular depth estimation models. This research advances the generalization limits of physical adversarial attacks, providing a robust framework for evaluating visual AI vulnerabilities in real-world deployments.


[175] Network Knowledge Prior Guided Learning for Data-Efficient Surface Defect Detection cs.CVPDF

Hang-Cheng Dong, Guodong Liu, Dong Ye, Bingguo Liu

TL;DR: 本文提出了一种新颖的知识引导损失函数,用于解决工业缺陷检测中深度学习模型的数据依赖性和可解释性不足的问题。该方法通过两阶段学习,利用初始模型的显著图作为先验知识,在多任务学习框架下约束最终模型与初始模型的显著图一致性,从而提升模型性能和可解释性。

Details

Motivation: 针对基于深度学习的工业缺陷检测方法存在数据饥渴、黑箱特性导致的性能瓶颈和可信度有限的问题,旨在将模型可解释性无缝集成到训练过程中,且不增加推理成本。

Result: 在多个公共缺陷数据集上的大量实验表明,该方法在准确率和平均精度(AP)上持续提升了基线模型的性能,并且可视化分析显示其生成的显著图更集中且更易于人类理解。

Insight: 创新点在于提出了一种知识引导的损失函数作为强大的正则化器,通过利用模型自身的解释(显著图)作为先验知识来引导训练,从而在提升性能的同时增强可解释性,为工业视觉系统提供了一种简单有效的性能与可解释性协同优化范式。

Abstract: Deep learning-based methods have become the de facto standard for industrial defect detection. However, their data-hungry nature and inherent “black-box” characteristics often lead to performance bottlenecks and limited trustworthiness in real-world applications. To address these challenges, this paper proposes a novel knowledge-guided loss function that seamlessly integrates model interpretability into the training process without incurring any additional inference cost. Our method operates in two phases: first, a primary classification network is trained, and its explanations, in the form of saliency maps, are generated as prior knowledge. Second, a multi-task learning framework is established, where the main task performs classification, and an auxiliary task imposes consistency between the saliency maps of the final model and the primary model. This consistency is enforced by a dedicated knowledge-guided loss term, effectively acting as a powerful regularizer to steer the model towards robust feature representations. Extensive experiments on multiple public defect datasets demonstrate that our approach consistently enhances the performance of baseline models in terms of accuracy and AP. Moreover, visual analysis reveals that the proposed method yields more concentrated and human-intelligible saliency maps. This work presents a simple yet effective paradigm for bridging the gap between model performance and interpretability, paving the way for more reliable and high-performing vision systems in industrial quality inspection.


[176] Curriculum Group Policy Optimization: Adaptive Sampling for Unleashing the Potential of Text-to-Image Generation cs.CV | cs.AIPDF

Baoteng Li, Xianghao Zang, Xinran Wang, Xiangyu Na, Zhixiang He

TL;DR: 本文提出了一种名为课程组策略优化(CGPO)的自适应课程训练框架,旨在通过自适应采样策略提升文本到图像(T2I)生成模型的训练效率。该方法利用组奖励方差作为提示不一致性的在线代理,优先采样那些模型已部分掌握但尚未稳定掌握的提示,并结合类别校准方法平衡多类别数据集中的训练难度。

Details

Motivation: 当前基于组相对策略优化(GRPO)的强化学习方法在T2I任务中常采用均匀采样策略,这忽略了样本难度与模型当前学习能力之间的匹配,导致训练效率低下。

Result: 在GenEval、T2I-CompBench++和DPG Bench基准测试上的实验表明,该框架有效提升了生成性能。

Insight: 创新点在于将课程学习思想引入T2I的强化学习训练,通过组奖励方差动态评估提示的学习价值并自适应调整采样概率;同时,为解决多类别数据不平衡问题,设计了基于比例公平优化的类别校准方法,以平衡跨类别的训练难度。

Abstract: Text-to-Image (T2I) generation has achieved remarkable progress in recent years. Meanwhile, reinforcement learning methods, particularly those based on Group Relative Policy Optimization (GRPO), have attracted widespread attention and been successfully applied to T2I tasks. However, the uniform sampling strategy commonly used during training often ignores the match between sample difficulty and the model’s current learning capability, leading to low training efficiency. We argue that improving training efficiency requires continuously prioritizing prompts that match the model’s evolving capability and remain actively learnable. To this end, we propose Curriculum Group Policy Optimization (CGPO), an adaptive curriculum training framework. During training, each prompt produces a group of images scored by a reward model. We use the variance of group rewards as an online proxy for prompt inconsistency. A higher variance suggests that the model has partially captured the prompt requirements but has not yet achieved stable mastery. Such prompts are more likely to provide useful learning signals, so we increase their sampling probabilities accordingly. Additionally, to address data imbalance in multi-category datasets, we design a category calibration method based on proportional fairness optimization, which balances training difficulty across categories. Experiments on GenEval, T2I-CompBench++, and DPG Bench demonstrate that our framework effectively improves generation performance.


[177] Why We Look Where We Look: Emergent Human-like Fixations of a Foveated Visual Language Model Maximizing Scene Understanding cs.CV | cs.AIPDF

Shravan Murlidaran, Ziqi Wen, Sana Shehabi, Miguel P. Eckstein

TL;DR: 本文提出了一种模拟人类中央凹视觉的计算智能体,通过优化场景理解任务,其自发产生的注视模式与人类自由观看时的特征性注视模式高度一致。研究发现,这种注视模式是优化场景理解在中央凹视觉生物约束下的功能性副产品,而非由搜索或分类任务驱动。

Details

Motivation: 研究动机在于探究人类在自由观看场景时,其注视模式(如优先注视场景中心、人物、文本、被注视或抓握的物体等)背后的原因,以及这些模式是否反映了对某种潜在感知任务的优化。

Result: 实验表明,为优化场景理解而训练的模拟中央凹视觉智能体,其预测的人类注视模式准确性最高;而训练用于搜索或分类任务的智能体,或配备优于或劣于人类周边视觉的智能体,其预测准确性较低。

Insight: 核心创新点在于将人类注视模式解释为在中央凹视觉的生物约束下,为最大化场景理解而进行优化的功能性副产品。这为理解人类视觉注意机制的计算基础提供了新视角,并展示了任务目标(场景理解 vs. 搜索/分类)如何塑造注意模式。

Abstract: When humans view scenes without a specific task (free-viewing), they initially direct their eye movements toward the scene center and then fixate on people, text, objects being gazed at or grasped, and semantically meaningful regions. What these signature fixation patterns reflect and whether they optimize an underlying perceptual task remain unknown. We show that a computational agent with simulated foveation, trained to optimize scene comprehension, exhibits emergent human fixation signature patterns. In contrast, versions of the agent trained to search or classify scenes, or equipped with peripheral vision that was better or worse than human vision, predicted human fixation patterns less accurately. Thus, human free-viewing fixation patterns may emerge as a functional byproduct of optimizing scene comprehension under the biological constraints of foveated vision.


[178] CounterCount: A Diagnostic Framework for Counting Bias in Vision Language Models cs.CV | cs.AIPDF

Reem Alzahrani, Hassan Alshanqiti, Bushra Bin Hemid, Zaid Alyafeai, Abdelrahman Eldesokey

TL;DR: 该论文提出了CounterCount诊断框架,用于评估视觉语言模型(VLMs)在计数任务中是否依赖视觉证据或先验知识。该框架包含成对的事实和反事实图像,通过编辑与计数相关的属性来测试模型。研究发现,现有VLMs在反事实条件下性能下降,表明其过度依赖对象级先验。论文还提出了一种统一的推理时注意力调制策略,通过重新加权视觉标记来提高反事实计数准确性。

Details

Motivation: 解决视觉语言模型在计数任务中是否真正基于视觉证据进行推理的问题,特别是在视觉证据与先验知识冲突时,模型可能依赖语言和世界先验而非图像内容。

Result: 在多个近期VLMs上的评估显示,模型在事实图像上表现良好,但在反事实属性变化下性能一致下降。通过提出的注意力调制策略,反事实计数准确率在多个VLMs上提升了高达8%。

Insight: 创新点在于构建了一个诊断框架来量化VLMs的计数偏差,并揭示了模型失败的原因在于对计数相关视觉标记的注意力不足。提出的注意力调制策略是一种通用的推理时干预方法,可提高模型对视觉证据的依赖。

Abstract: Vision-Language Models (VLMs) excel at multimodal reasoning, yet it remains unclear whether their answers are grounded in visual evidence or driven by learned language and world priors. Counting provides a precise testbed: when visual evidence conflicts with canonical object knowledge, a model must rely on the image rather than a prototypical count. We introduce CounterCount, a diagnostic framework for counterfactual counting in VLMs, consisting of paired factual and counterfactual images with edited count-relevant attributes, verified answers, and localized evidence annotations. Evaluating recent VLMs, we find strong performance on factual images but consistent degradation under counterfactual attribute changes, indicating reliance on object-level priors even when contradictory visual evidence is present. Using localized annotations, we show that these failures are not solely due to missing or ambiguous visual evidence, but to models underweighting attention to count-relevant visual tokens. We introduce a unified inference-time attention modulation strategy that reweights selected visual tokens, improving counterfactual counting accuracy by up to 8% across multiple VLMs. Overall, CounterCount exposes prior-driven counting failures and provides diagnostic insights for designing future VLMs.


[179] Temporal Aware Pruning for Efficient Diffusion-based Video Generation cs.CV | cs.AIPDF

Sheng Li, Yang Sui, Junhao Ran, Bo Yuan, Yue Dai

TL;DR: 本文提出了TAPE(Temporal Aware Pruning for Efficient diffusion-based video generation),一种用于高效扩散视频生成的无训练时域感知剪枝方法。该方法通过时域平滑、令牌重选和时间步级预算调度,在保持视频时间一致性和视觉保真度的同时,显著降低了基于ViT架构的视频扩散模型的计算开销。

Details

Motivation: 基于ViT的视频扩散模型生成高质量视频时,需要对长时空序列进行注意力计算,计算成本高昂。现有的注意力剪枝方法多为逐帧操作,无法保证视频生成任务中至关重要的帧间时间一致性,导致背景不一致、闪烁和图像质量下降。

Result: 实验结果表明,TAPE在保持高视觉保真度的同时,实现了显著的加速,性能优于先前的令牌缩减方法。

Insight: 创新点在于提出了一种无训练的、时域感知的剪枝策略,通过跨帧对齐令牌重要性(时域平滑)、根据层语义焦点进行令牌重选以及动态调整不同去噪阶段的剪枝强度(时间步级预算调度),有效解决了视频生成中时间一致性的剪枝挑战。

Abstract: Video diffusion models have recently enabled high-quality video generation with ViT-based architectures, but remain computationally intensive because generation requires attention computation over long spatiotemporal sequences. Token pruning has proven effective for ViTs and VLMs. However, most prior pruning methods are attention-based and operate per frame, failing to ensure the vital temporal coherence across frames in video generation tasks. In practice, naively adopting attention-only pruning causes noticeable degradation due to worsened background consistency, flickering, and reduced image quality. To address this, we propose TAPE, a training-free Temporal Aware Pruning for Efficient diffusion-based video generation. TAPE (i) applies temporal smoothing to align token-importance across adjacent frames and suppress selection jitter; and (ii) performs token reselection in selected layers to align token pruning with layers’ diverse semantic focus and avoid error accumulation in specific areas; it also (iii) adopt a timestep-level budget scheduling that prunes aggressively at early noisy steps and relaxes pruning during fidelity-critical refinement. The experimental results show that TAPE delivers significant speedups while preserving high visual fidelity, outperforming prior token reduction approaches.


[180] SurgLQA: Scalable Long-Horizon Surgical Video Question Answering cs.CVPDF

Diandian Guo, Xikai Yang, Ruiyang Li, Jialun Pei, Pheng-Ann Heng

TL;DR: 本文提出了SurgLQA,一个用于可扩展外科手术视频问答的统一长时程框架。该框架通过Faithful Temporal Consolidation(FTC)构建紧凑的长程表示,并采用Temporally-Grounded Multi-Policy Scaling(TMS)进行自适应测试时推理。在重构的结肠镜长时程视频问答基准Colon-LQA和REAL-Colon-VQA上的实验表明,该方法在长程推理任务中取得了持续的性能提升。

Details

Motivation: 现有外科手术视频问答方法主要局限于图像或短视频片段,难以建模长程手术流程中的时序动态和因果依赖关系,限制了其在临床环境中的实时决策支持能力。

Result: 在重构的长时程结肠镜视频问答基准Colon-LQA和REAL-Colon-VQA上进行了广泛实验,结果表明该方法在长程推理任务中取得了持续的性能增益。

Insight: 创新点在于提出了FTC模块,利用内在时序线索构建紧凑长程表示并保持细粒度时序保真度;以及TMS范式,在时序锚定上下文中自适应调整策略级推理能力,实现了可扩展的长时程手术视频理解。

Abstract: Surgical Video Question Answering (VideoQA) provides a promising paradigm for dynamic intraoperative interpretation, enabling real-time decision support and context-aware retrieval in clinical environments. Nevertheless, existing approaches are predominantly restricted to images or short clips, limiting their ability to model long-range procedural dynamics and causal dependencies across extended surgical workflows. To address this challenge, we propose SurgLQA, a unified long-horizon VideoQA framework for scalable surgical reasoning. This framework incorporates Faithful Temporal Consolidation (FTC), which leverages intrinsic temporal cues to construct compact long-range representations while preserving fine-grained temporal fidelity. Further, we develop Temporally-Grounded Multi-Policy Scaling (TMS), an adaptive test-time inference paradigm that strategically adjusts policy-level reasoning capacity within temporally grounded contexts. To facilitate systematic evaluation, we restructured a long-duration colonoscopy VideoQA benchmark, Colon-LQA, and conducted extensive experiments on Colon-LQA and REAL-Colon-VQA. Experimental results demonstrate that our approach achieves consistent performance gains in long-range reasoning with temporally grounded inference. Code link: https://github.com/RascalGdd/SurgLQA.


[181] PySIFT: GPU-Resident Deterministic SIFT for Deep Learning Vision Pipelines cs.CVPDF

Sivakumar K. S., Mohammad Daniyalur Rahman, Gopi Raju Matta

TL;DR: 论文提出PySIFT,一种完全驻留在GPU内存中的确定性SIFT特征提取器,旨在解决传统SIFT在深度学习视觉流程中效率低下的问题。通过实验证明,经典的SIFT结合DSP多尺度池化在多个基准测试中,在精度和速度上均优于神经描述子和方向估计网络,并指出经典特征与学习型匹配器(如LightGlue)应是互补而非替代关系。

Details

Motivation: 针对当前普遍认为经典手工特征(如SIFT)是精度受限的过时技术、应被学习型特征替代的假设,论文旨在通过可控的实验证明这一假设是错误的,并揭示由于缺乏完全GPU驻留且模块化的SIFT实现,导致这一结论长期被忽视。

Result: 在HPatches、ROxford5K、IMC Phototourism和MegaDepth四个基准测试上,经典SIFT结合DSP多尺度池化在所有精度指标上均优于HardNet和OriNet等神经替代方案,且运行速度快2-18倍。PySIFT在NVIDIA RTX 3050(4GB显存)上实现了比OpenCV SIFT更高的平均匹配精度(MMA)、更快的处理速度(在MegaDepth上每对图像快383毫秒)、更高的跨数据集几何精度(如在MegaDepth上AUC@10°提升5.6个百分点),并提供了跨运行和GPU架构的比特级确定性输出。

Insight: 论文的核心创新点在于提出了首个完全GPU驻留的SIFT实现(PySIFT),通过CuPy/Numba CUDA内核和DLPack零拷贝传递实现了高效、模块化的特征提取,便于进行经典与学习型方法的可控消融实验。其关键见解是重新定位了经典特征与学习型方法的关系:不是“替代SIFT”,而是“与SIFT组合”,即仅在几何上下文需要时,将经典特征提取与学习型匹配相结合。此外,其比特级确定性输出是学习型提取器难以在不牺牲性能的情况下实现的独特优势。

Abstract: A widespread assumption in local feature research holds that classical handcrafted descriptors are accuracy-limited relics best replaced by learned alternatives. We show this is wrong. Through an 8-configuration ablation spanning four benchmarks (HPatches, ROxford5K, IMC Phototourism, MegaDepth), we demonstrate that classical SIFT with DSP multi-scale pooling outperforms neural descriptor and orientation replacements (HardNet, OriNet) on every accuracy metric–while running 2–18$\times$ faster–and that learned matchers (LightGlue) complement rather than supersede classical features. The conclusion reframes a decade of work: not “replace SIFT” but “compose with SIFT,” classical extraction paired with learned matching only where geometric context demands it. This finding was invisible because no prior GPU SIFT kept the complete pipeline in VRAM or offered modularity for controlled classical-vs-learned ablations. We present PySIFT, the first fully GPU-resident SIFT, implemented in CuPy/Numba CUDA kernels with DLPack zero-copy handoff to downstream DL frameworks–submillisecond O(1) metadata swap regardless of keypoint count. On a laptop-grade NVIDIA RTX 3050 (4 GB VRAM), PySIFT achieves: (i) higher Mean Matching Accuracy (MMA) than OpenCV SIFT on HPatches, (ii) 383 ms faster per pair on high-resolution MegaDepth, (iii) higher geometric accuracy on cross-dataset benchmarks (+5.6 pp AUC@10${}^\circ$ on MegaDepth, more inliers on IMC Phototourism), and (iv) bitwise deterministic output–identical keypoints and descriptors across runs, with detection reproducing identically even across GPU architectures: a guarantee that learned extractors cannot match without significant performance sacrifice, and cannot achieve at all across GPU architectures due to cuDNN’s architecture-dependent algorithm selection. PySIFT is open-source, requiring no C++ compilation.


[182] An Efficient Streaming Video Understanding Framework with Agentic Control cs.CVPDF

Jinming Liu, Jianguo Huang, Zhaoyang Jia, Jiahao Li, Xiaoyi Zhang

TL;DR: 本文提出R3-Streaming框架,将流式视频理解建模为一个级联控制问题,通过动态决策(记忆压缩、响应就绪判断、计算路由)来高效处理不同复杂度的查询。该方法引入了基于年龄的遗忘策略进行记忆压缩,并采用目标平衡的强化学习目标(TB-GRPO)进行路由,在保证实时性的同时大幅减少视觉令牌使用。

Details

Motivation: 现有流式视频理解方法通常采用静态策略(如固定记忆压缩或单一模型),导致在简单查询上计算过度、在复杂查询上性能不足,无法在严格延迟预算下适应动态的信息密度。

Result: 在OVO-Bench和StreamingBench基准测试中,R3-Streaming达到了最先进的性能,分别获得57.92和76.36的分数,同时将视觉令牌使用量减少了95%至96%。

Insight: 核心创新在于将流式视频理解视为级联控制问题,通过序列化决策(记忆-响应-推理)实现自适应处理;具体技术贡献包括基于年龄的遗忘策略和防止模式崩溃的TB-GRPO路由优化目标,为动态资源分配提供了新思路。

Abstract: Streaming video requires handling dynamic information density under strict latency budgets. Yet, existing methods typically employ static strategies, such as fixed memory compression or reliance on a single model, forcing a trade-off: fast models fail on complex queries, while always-on heavy models violate real-time constraints and overcomplicate simple queries. Rather than fixing these decisions upfront, we propose R3-Streaming (Remember, Respond, Reason), which formulates streaming video understanding as a cascaded control problem: for each query, the system compresses memory, judges response readiness, and routes computation sequentially, so that each downstream decision builds on progressively refined information states. To optimize this pipeline, we introduce an age-aware forgetting policy for memory compression, as aggressively compressing historical frames can yield substantial performance gains. For compute routing, we propose TB-GRPO, a target-balanced reinforcement learning objective that routes hard queries to a stronger model while preventing mode collapse. Extensive evaluations demonstrate that R3-Streaming achieves state-of-the-art results among streaming MLLMs, reaching 57.92 on OVO-Bench and 76.36 on StreamingBench, while reducing visual token usage by 95 to 96 percent.


[183] SkyNative: A Native Multimodal Framework for Remote Sensing Visual Evidence Reasoning cs.CVPDF

Xiao Yang, Ronghao Fu, Zhiwen Lin, Zhuoran Duan, Jiashun Zhu

TL;DR: SkyNative提出了一种原生多模态框架,用于遥感视觉证据推理,通过移除预训练视觉编码器,直接将图像表示为语言模型标记空间中的原始图像块标记,以解决现有方法因过早压缩局部视觉证据而导致细粒度空间推理易受语言先验影响的问题。

Details

Motivation: 现有遥感视觉语言模型依赖预训练视觉编码器将图像转换为语义特征再进行语言模型推理,这种流程可能过早压缩局部视觉证据,使细粒度空间推理在超高分辨率遥感图像中易受语言先验影响。

Result: 在标准遥感理解任务和大格式空间推理评估中,SkyNative表现出更强的图像基础感知能力,并提高了对提示诱导语言先验的鲁棒性。

Insight: 创新点在于采用无编码器架构,通过模态感知解耦机制在统一自回归骨干网络中使用模态特定参数来协调低级视觉块与文本标记,并引入了视觉依赖基准来诊断模型是否基于图像证据进行回答。

Abstract: Remote sensing vision-language models commonly rely on pretrained visual encoders to convert images into semantic features before language-model reasoning. While effective for scene-level understanding, this pipeline may prematurely compress local visual evidence, making fine-grained spatial reasoning vulnerable to language priors, especially in ultra-high-resolution remote sensing imagery. We present SkyNative, a native multimodal framework for remote sensing that adopts an encoder-free architecture, removing the pretrained visual backbone to directly represent images as raw patch tokens in the language-model token space. To reconcile low-level visual patches with textual tokens, SkyNative introduces a modality-aware decoupling mechanism that uses modality-specific parameters within a unified autoregressive backbone. We further introduce a visual reliance benchmark that diagnoses whether models ground their answers in image evidence through progressive visual degradation and misleading textual prompts. Across standard remote sensing understanding tasks and large-format spatial reasoning evaluations, SkyNative shows stronger image-grounded perception and improved robustness against prompt-induced language priors. These results suggest that native patch-level multimodal modeling is a promising direction for reliable remote sensing vision-language reasoning.


[184] AtlasVA: Self-Evolving Visual Skill Memory for Teacher-Free VLM Agents cs.CVPDF

Pan Wang, Yihao Hu, Xiujin Liu, Jingchu Yang, Hang Wang

TL;DR: 本文提出AtlasVA,一种无需教师模型的视觉技能记忆框架,用于视觉语言模型(VLM)智能体。该框架将记忆组织为空间热图、视觉范例和符号文本技能三个互补层,并通过轨迹统计和轻量级网格启发式方法直接演化危险与亲和地图,将其作为基于势能的塑形奖励用于强化学习。实验表明,AtlasVA在多个空间密集型任务基准上持续优于以文本为中心的记忆基线。

Details

Motivation: 现有VLM智能体大多将记忆存储为文本并依赖专有教师模型进行总结,这种设计不匹配空间决策需求,因为几何先验被压缩为有损的语言表示,且交互监督依赖于延迟的文本反馈而非密集的视觉信号。

Result: 在Sokoban、FrozenLake、3D具身导航和3D机器人操作基准测试中,AtlasVA一致超越了以文本为中心的记忆基线和有竞争力的VLM智能体,尤其在空间密集型任务上表现出显著优势。

Insight: 创新点在于提出视觉接地的记忆框架,通过三层互补记忆结构和自演化的势能地图统一感知、记忆与优化,无需外部大语言模型监督,为VLM智能体提供了更有效的空间经验复用机制。

Abstract: Vision-language model (VLM) agents increasingly rely on memory-augmented reinforcement learning to reuse experience across long-horizon tasks, yet most existing frameworks store memory as text and depend on proprietary teacher models to summarize or refine it. This design is poorly matched to spatial decision making: geometric priors are compressed into lossy language, and sparse interaction is often supervised through delayed textual feedback rather than dense visually grounded signals. We argue that reusable experience for VLM agents should remain visually grounded. Based on this insight, we propose \textbf{AtlasVA}, a teacher-free visual skill memory framework that organizes memory into three complementary layers: spatial heatmaps, visual exemplars, and symbolic text skills. AtlasVA further evolves danger and affinity atlases directly from trajectory statistics and lightweight grid heuristics, and reuses these self-evolving atlases as potential-based shaping rewards for reinforcement learning. This unifies perception, memory, and optimization without external LLM supervision. Experiments on \textsc{Sokoban}, \textsc{FrozenLake}, 3D embodied navigation, and 3D robotic manipulation benchmarks show that AtlasVA consistently outperforms text-centric memory baselines and competitive VLM agents, with especially strong gains on spatially intensive tasks. Homepage: https://wangpan-ustc.github.io/AtlasvaWeb


[185] Generation Navigator: A State-Aware Agentic Framework for Image Generation cs.CVPDF

Jinming Liu, Ruoyu Feng, Yuqi Wang, Wenjun Zeng, Xin Jin

TL;DR: 本文提出Generation Navigator,一个将图像生成重新定义为状态条件动作决策问题的多轮文本到图像智能体框架。该框架通过强化学习动态引导生成轨迹,并引入PRE-GRPO目标函数来优化轨迹质量、保持性和效率,从而自动化实现用户意图。

Details

Motivation: 现有文本到图像生成系统难以忠实实现用户意图,通常依赖手动多轮试错或基于手工规则的闭环智能体,缺乏对生成过程动态适应的学习能力。

Result: 在T2I-ReasonBench基准测试中,该方法取得了显著提升,达到了0.90的WISE分数和79.06%的推理准确率。

Insight: 创新点在于将图像生成建模为状态条件决策问题,并提出了PRE-GRPO这一轨迹级强化学习目标,通过峰值奖励、保持性奖励和效率奖励来明确优化生成轨迹的质量动态和效率,解决了传统强化学习中的信用分配挑战。

Abstract: Despite rapid advances in text-to-image generation, faithfully realizing user intent remains challenging, often requiring manual multi-turn trial and error. To automate this process, existing systems rely on either simple prompt rewriting or closed-loop agents driven by hand-crafted rules, rather than learning to adapt actions to the evolving generation process. In this paper, we reformulate image generation as a state-conditioned action-making problem and propose Generation Navigator, a multi-turn T2I agent that learns to dynamically steer the generation trajectory and output the next action. However, training this agent via reinforcement learning introduces a critical credit assignment challenge: naively rewarding a trajectory based solely on a single state assigns equal credit to all actions in the rollout, ignores the quality dynamics across turns, and fails to distinguish actions that improve the trajectory from those that degrade it or waste turns without progress. We resolve this with PRE-GRPO (Peak-Retention-Efficiency Group Relative Policy Optimization), a trajectory-level reinforcement learning objective that explicitly rewards discovering a high-quality image (Peak), avoiding subsequent quality degradation across turns (Retention), and minimizing unnecessary turns (Efficiency). Experiments show substantial improvements across benchmarks, reaching a WISE score of 0.90 and 79.06% reasoning accuracy on T2I-ReasonBench.


[186] A More Word-like Image Tokenization for MLLMs cs.CV | cs.AI | cs.LGPDF

Hyun Lee, Hyemin Jeong, Yejin Kim, Hyungwook Choi, Hyunsoo Cho

TL;DR: 本文提出了一种名为解耦视觉标记化(DiVT)的新方法,旨在使多模态大语言模型(MLLMs)中的视觉标记更类似于语言模型所处理的单词单元。DiVT通过将图像块嵌入聚类成连贯的语义单元,使每个标记对应一个独立的视觉概念,而非固定的网格单元,并根据图像复杂度自适应调整标记数量,从而在减少视觉标记数量的同时提升模型性能。

Details

Motivation: 当前MLLMs通常固定语言模型并训练视觉投影器将像素映射为嵌入空间中的标记序列,但视觉投影器产生的连续、高度相关的长嵌入流与语言模型优化的离散、语义丰富的单词标记存在差异,导致视觉标记行为与语言模型预期不符。

Result: 在多种多模态基准测试中,DiVT在使用显著更少视觉标记的情况下,匹配或超越了基线模型性能,展示了在有限标记预算下的鲁棒性,同时显著降低了内存成本和延迟。

Insight: 创新点在于将视觉标记从基于固定网格的表示转变为基于语义聚类的单元,使每个标记对应一个视觉概念,并引入自适应标记预算机制,在不修改视觉编码器或语言模型的情况下实现精度与计算量的权衡,增强了视觉输入与LLMs的兼容性。

Abstract: Modern multimodal large language models (MLLMs) typically keep the language model fixed and train a visual projector that maps the pixels into a sequence of tokens in its embedding space, so that images can be presented in essentially the same form as text. However, the language model has been optimized to operate on discrete, semantically meaningful tokens, while prevailing visual projectors transform an image into a long stream of continuous and highly correlated embeddings. This causes the visual tokens to behave differently from the word-like units that LLMs are originally trained to understand. We propose a novel Disentangled Visual Tokenization (DiVT) that clusters patch embeddings into coherent semantic units, so each token corresponds to a distinct visual concept instead of a rigid grid cell. DiVT further adapts its token budget to image complexity, providing an explicit accuracy-compute trade-off modifying neither the vision encoder nor the language model. Across diverse multimodal benchmarks, DiVT matches or surpasses baselines with significantly fewer visual tokens, demonstrating robustness under limited token budgets, significantly reducing memory cost and latency while making visual inputs more compatible with LLMs. Our code is available at https://github.com/snuviplab/DiVT.


[187] TinySAM 2: Extreme Memory Compression for Efficient Track Anything Model cs.CV | cs.AIPDF

Zhaoyuan Ding, Yijing Yang, Han Shu, Xinghao Chen

TL;DR: 本文提出了TinySAM 2,一个轻量化的视频分割模型,旨在解决SAM 2模型因计算复杂和内存需求高而难以实际部署的问题。该方法通过引入内存质量管理机制、联合时空令牌压缩以及使用轻量级图像编码器RepViT,显著降低了模型的计算开销和内存占用。

Details

Motivation: SAM 2作为视频分割的核心基础模型,其多阶段图像编码器和内存模块的计算复杂性抬高了实际应用部署的门槛,因此需要开发一个在性能和效率之间取得平衡的轻量化模型。

Result: 在DAVIS和SA-V等具有挑战性的数据集上的大量实验表明,TinySAM 2仅使用7%的内存令牌和3%的训练数据,就达到了SAM 2.1模型90%的性能水平。

Insight: 创新点包括:1)内存质量管理机制,用于筛选和保留信息量高的历史帧;2)联合时空令牌压缩,通过空间域平均池化和基于令牌相似度的时域选择来减少冗余;3)采用RepViT作为轻量级图像编码器以削减参数量。这些方法为资源受限设备上的视频分割模型部署提供了高效解决方案。

Abstract: Segment Anything Model 2 (SAM 2) serves as a core foundation model in the field of video segmentation. Building upon the original SAM model, it introduces a memory bank mechanism and demonstrates outstanding performance in tasks such as semi-supervised video object segmentation and tracking anything. However, the complex computational characteristics of SAM 2’s multi-stage image encoder and memory module have raised the barrier to the model’s deployment in practical applications. To address this issue, we propose TinySAM 2, a lightweight video segmentation model that balances performance and efficiency. First, a memory quality management mechanism is introduced to select and retain high-informative historical frames as the memory. In addition, a joint-spatial-temporal token compression is proposed that reduces the memory storage and computational cost. Specifically, average pooling is employed to first compress redundancy tokens in the spatial domain. In the temporal domain, informative tokens are selected across frames in the memory bank based on token-level similarity measurement. Besides, we take RepViT as the lightweight image encoder, which further reduces the model parameters. Extensive experiments on challenging datasets such as DAVIS and SA-V demonstrate that TinySAM 2 achieves 90% of the performance of SAM 2.1, with only 7% memory tokens and 3% training data. This study effectively alleviates the bottlenecks in parameter count, computational load, and deployment costs associated with SAM 2, providing a resource-efficient solution for the widespread application of video segmentation models on devices.


[188] See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding cs.CV | cs.AI | cs.HCPDF

Boyuan Sun, Bowen Yin, Yuanming Li, Xihan Wei, Qibin Hou

TL;DR: 本文提出了SWIM(See What I Mean)训练策略,通过对齐视觉与语言表征,仅使用文本提示即可实现视频中的细粒度物体理解。该方法在训练时利用掩码监督来引导跨模态注意力,使模型在推理时能自动关注用户指定的物体,从而无需显式的视觉提示(如掩码或点)。

Details

Motivation: 现有方法通常需要显式的视觉提示(如掩码或点)来进行细粒度物体理解,这限制了其便捷性。本文旨在解决预训练多模态大语言模型中存在的表征错位问题,即属性词在视觉模态中产生局部化激活,而物体名词由于语义参考偏差和分布式高层表征产生弥散模式,导致仅用文本提示难以精确定位物体。

Result: 实验结果表明,SWIM显著改善了文本-视觉对齐,并在细粒度物体理解基准测试上超越了基于视觉提示的方法,取得了优越的性能。

Insight: 主要创新点在于提出了仅需文本提示的细粒度物体理解训练策略SWIM,通过构建NL-Refer数据集(将物体掩码与精确的自然语言指代表达配对)并强制物体名词的多层跨注意力图与真实掩码保持空间一致性,有效解决了跨模态表征错位问题,实现了无需视觉提示的精确物体定位。

Abstract: We present SWIM (See What I Mean), a novel training strategy that aligns vision and language representations to enable fine-grained object understanding solely from textual prompts. Unlike existing approaches that require explicit visual prompts, such as masks or points, SWIM leverages mask supervision only during training to guide cross-modal attention, allowing the model to automatically attend to the user-specified object at inference. Our cross-attention analysis of pretrained multimodal large languagemodels (MLLMs) reveals a systematic discrepancy: Attribute words produce sharp, localized activations in the visual modality, whereas object nouns yield diffuse and scattered patterns due to semantic reference bias and distributed high-level representations. To address this misalignment, we construct NL-Refer, an enriched dataset, in which each object mask is paired with a precise natural language referring expression. SWIM extracts multi-layer cross-attention maps from object nouns and enforces spatial consistency with ground-truth masks. Experimental results demonstrate that SWIM substantially improves text-visual alignment and achieves superior performance over visual-prompt-based methods on fine-grained object understanding benchmarks. The code and data are available at \href{https://github.com/HumanMLLM/SWIM}{https://github.com/HumanMLLM/SWIM}.


[189] What Matters for Grocery Product Retrieval with Open Source Vision Language Models cs.CVPDF

Emmanuel G. Maminta, Rowel O. Atienza

TL;DR: 本文首次系统性地评估了190个开源视觉语言模型在GroceryVision挑战赛的多模态产品检索任务上的零样本性能,分析了预训练数据、架构和输入分辨率的影响。研究发现数据质量比规模更重要,高效模型可以超越更大但数据质量差的模型,并揭示了当前SOTA模型在精确检索相似SKU时存在的精度差距。

Details

Motivation: 多模态产品检索是无人零售和自动化库存系统的核心,但现有标准视觉语言基准无法捕捉细粒度的SKU区分需求,因此需要系统评估开源VLMs在此任务上的表现。

Result: 在GroceryVision Challenge上,最佳模型Recall@5达到94.5%,但Recall@1下降17.5%;使用过滤数据相比原始网络爬取数据可带来高达16.6%的准确率提升;MobileCLIP-B(1.5亿参数)在噪声数据上训练的3.51亿参数模型。

Insight: 创新点包括提出数据质量优于规模的核心发现、引入惩罚低准确率的效率指标“语义功率密度”,以及揭示对比嵌入能有效聚类类别但难以排序视觉相似SKU的精度差距,为实际应用中的模型选择提供了实用指导。

Abstract: Multimodal product retrieval (MPR) underpins checkout-free retail and automated inventory systems, yet it demands fine-grained SKU discrimination that standard vision-language benchmarks fail to capture. We present the first systematic zero-shot evaluation of 190 open-source VLMs on the MPR task of the GroceryVision Challenge, isolating pre-training data, architecture, and input resolution. Our analysis yields three actionable findings. \textbf{(1) Data quality trumps scale.} Switching from raw web-scrapes to filtered datasets delivers up to 16.6% accuracy gains, exceeding the benefit of doubling model parameters. \textbf{(2) Efficient models can win.} MobileCLIP-B (150M parameters) outperforms 351M counterparts trained on noisy data. We introduce \textit{semantic power density} ($φ$), an efficiency metric that penalizes sub-threshold accuracy. \textbf{(3) A precision gap persists.} State-of-the-art models achieve 94.5% Recall@5 but suffer a 17.5% drop at Recall@1, revealing that contrastive embeddings cluster categories effectively but fail to rank visually similar SKUs. Code and evaluation scripts are available at \url{https://github.com/upeee/openmpr}.


[190] SGSoft: Learning Fused Semantic-Geometric Features for 3D Shape Correspondence via Template-Guided Soft Signals cs.CVPDF

Soyeon Yoon, Chang Wook Seo, Hyunjung Shim

TL;DR: SGSoft提出了一种统一的固有流程,通过构建规范模板上的测地对应场,利用预训练的语义先验指导学习多模态密集描述符,并通过描述符空间中的最近邻搜索在单次前向传递中检索密集对应。该方法在保持几何保真度的同时实现了优异的跨类别泛化能力,并在准确性与效率之间达到了最佳平衡。

Details

Motivation: 解决可变形3D形状之间由于结构可变性、非等距变形和不一致拓扑导致的密集对应学习难题,克服现有方法在泛化性、几何保真度和效率之间的权衡问题。

Result: 在跨类别泛化任务上达到SOTA水平,在准确率-效率权衡方面优于现有方法,无需预对齐、成对优化或后处理即可实现近实时推理。

Insight: 创新性地将测地对应场作为拓扑不变的监督信号,结合语义先验指导的多模态描述符学习,实现了对大规模姿态变化和结构差异的稳定处理;构建了可扩展的部署就绪范式,学习到的描述符可有效迁移到语义分割和形变迁移等下游任务。

Abstract: Learning dense correspondences across deformable 3D shapes remains a long-standing challenge due to structural variability, non-isometric deformation, and inconsistent topology. Existing methods typically trade off generalization, geometric fidelity, and efficiency. We address this by proposing SGSoft, a unified intrinsic pipeline that (i) constructs a geodesic correspondence field on a canonical template, (ii) learns multimodal dense descriptors guided by pretrained semantic priors with this geodesic correspondence field supervision, (iii) retrieves dense correspondences in a single feed-forward pass via nearest-neighbor search in descriptor space. This formulation enables stable and topology-invariant supervision under large pose variation, structural differences, and remeshing. SGSoft achieves state-of-the-art inter-category generalization while offering the best accuracy-efficiency trade-off among prior methods. It also achieves near real-time inference without pre-alignment, pairwise optimization, or post-refinement. Learned descriptors can be transferred effectively to downstream tasks such as semantic segmentation and deformation transfer, establishing a scalable and deployment-ready paradigm for dense 3D correspondence.


[191] OmniSelect: Dynamic Modality-Aware Token Compression for Efficient Omni-modal Large Language Models cs.CVPDF

Morunliu Yang, Ruotao Xu, Le Li, Yue Wang, Jianxin Zhang

TL;DR: 本文提出了OmniSelect,一种无需训练、模态自适应的令牌剪枝框架,用于高效处理全模态大语言模型中的长序列多模态输入。该方法通过轻量级AudioCLIP模型估计跨模态相关性,将输入动态分类为音频中心、视频中心或均匀剪枝三种策略,并在每个时间组内进行细粒度令牌剪枝,自适应分配剪枝比例以保留各模态中的信息丰富令牌。

Details

Motivation: 现有全模态大语言模型在处理长多模态令牌序列时计算开销巨大,而现有压缩方法通常依赖固定的模态特定指导,无法根据不同查询中模态重要性变化进行自适应调整。

Result: 大量实验表明,该方法在实现高效多模态令牌压缩的同时保持了强大的性能,且无需任何额外训练。

Insight: 创新点在于显式建模模态偏好并实现动态策略选择,避免了“一刀切”压缩的缺陷;通过跨模态相关性估计和自适应剪枝比例分配,在无需训练的情况下实现了模态感知的令牌压缩。

Abstract: Omnimodal large language models (OmniLLMs) have recently gained increasing attention for unified audio-video understanding. However, processing long multimodal token sequences introduces substantial computational overhead, making efficient token compression crucial. Existing methods typically rely on fixed, modality-specific guidance, which fails to account for the varying importance of modalities across different queries. To address this limitation, we propose $\textbf{OmniSelect}$, a training-free, modality-adaptive token pruning framework that dynamically selects appropriate compression strategies for multimodal inputs. Specifically, we leverage a lightweight AudioCLIP model to estimate cross-modal relevance and categorize each input into three pruning regimes: Audio-Centric, Video-Centric, and Uniform pruning. Based on these relevance scores, OmniSelect further performs fine-grained token pruning within each temporal group, adaptively allocating pruning ratios to preserve informative tokens across modalities. By explicitly modeling modality preference and enabling dynamic strategy selection, OmniSelect effectively avoids the pitfalls of one-size-fits-all compression. Extensive experiments demonstrate that our method achieves efficient multimodal token reduction while maintaining strong performance, without requiring any additional training.


[192] Efficient 3D Content Reconstruction and Generation cs.CVPDF

Jiahao Li

TL;DR: 本文提出了一种高效的3D内容重建与生成方法,通过Instant3D实现快速文本/图像到3D的生成,以及通过FastMap实现快速3D重建。Instant3D结合多视图扩散和前馈稀疏视图重建,能在5-20秒内生成高质量3D资产;FastMap采用一阶优化和融合GPU内核,在保持姿态精度和新视图合成质量的同时,比现有SOTA方法快10倍。

Details

Motivation: 旨在替代劳动密集型的建模和扫描流程,通过从文本或图像直接合成或恢复3D资产,应用于游戏、虚拟现实、机器人和仿真等领域,以加速资产原型设计、交互世界生成和3D数据收集。

Result: Instant3D在生成方面能在5-20秒内生成高质量3D资产;FastMap在重建方面比先前SOTA方法快10倍,同时保持可比的姿态精度和下游新视图合成质量。

Insight: 创新点包括结合多视图扩散与稀疏视图重建以实现快速3D生成,以及使用一阶优化和融合GPU内核来加速结构从运动流程,这些方法在效率和效果上均有显著提升。

Abstract: Automatic 3D content creation seeks to replace labor-intensive modeling and scanning pipelines with systems that can synthesize or recover 3D assets directly from text or images. Its applications span video games, virtual reality, robotics, and simulation, enabling rapid asset prototyping, diverse interactive world generation, and efficient 3D data collection for training foundation models. Contemporary solutions largely follow two complementary paradigms: (i) text- or image-to-3D generation, which learns priors over 3D geometry and appearance to create novel assets from natural language or a single view image; and (ii) 3D reconstruction, which estimates camera poses and geometry from RGB images. This thesis advances both directions. On the generation side, I introduce Instant3D, which combines multi-view diffusion with feed-forward sparse-view 3D reconstruction to produce high-quality assets in 5-20 seconds. On the reconstruction side, I develop FastMap, a structure-from-motion pipeline that achieves up to 10x speedup over prior state-of-the-art by using first-order optimization with fused GPU kernels extensively, while maintaining comparable pose accuracy and downstream novel view synthesis quality.


[193] The MixCount Dataset: Bridging the Data Gap for Open-Vocabulary Object Counting cs.CV | cs.LGPDF

Corentin Dumery, Niki Amini-Naieni, Shervin Naini, Pascal Fua

TL;DR: 本文提出了MixCount数据集和基准测试,旨在解决开放词汇物体计数任务中混合物体场景下的性能瓶颈。通过自动生成合成图像、细粒度文本描述和像素级精确计数标注,该数据集克服了真实数据标注成本高和合成数据缺乏多样性的问题。

Details

Motivation: 现有物体计数模型在工业检测和产品分拣等真实混合物体场景中系统性地失败,主要原因是现有训练和评估数据存在局限:真实数据集标注成本高且存在噪声,而合成数据缺乏多样性和真实性。

Result: 在MixCount上评估SOTA计数模型显示其在混合物体设置下性能严重下降;而使用该合成数据训练模型后,在真实基准FSC-147和PairTally上的MAE分别降低了20.14%和18.3%。

Insight: 创新点在于构建了一个针对混合物体计数模型失效模式的自动生成管道,能大规模合成无标注歧义的数据;客观来看,该方法通过合成数据有效缓解了计数模型长期存在的数据瓶颈,证明了合成数据对提升真实世界性能的有效性。

Abstract: Object counting is a foundational vision task with over a decade of dedicated research, yet state-of-the-art models still fail systematically in the mixed-object setting that dominates real-world applications such as industrial inspection and product sorting. We show that this gap is strongly driven by limitations in existing training and evaluation data: real counting datasets are prohibitively expensive to annotate and suffer from labeling noise, while existing synthetic alternatives lack diversity and realism. We address this with MixCount, a dataset and benchmark for mixed-object counting designed to target the failure modes of current counting models. To overcome the high cost of constructing and labeling such data, we develop an automatic generation pipeline that synthesizes images, fine-grained textual descriptions, and pixel-perfect counting annotations at scale, eliminating the labeling ambiguity that plagues prior datasets. Evaluating state-of-the-art counting models on MixCount exposes severe degradation in the mixed-object setting. More importantly, training these models on our synthesized data yields substantial gains on real-world benchmarks, reducing MAE by 20.14% on FSC-147 and by 18.3% on PairTally. These results establish MixCount as both a benchmark and a training dataset for fine-grained counting, and demonstrate that our pipeline, which produces effectively unlimited labeled data, helps address a long-standing bottleneck in counting models.


[194] SENSE: Satellite-based ENergy Synthesis for Sustainable Environment cs.CV | cs.AIPDF

Kailai Sun, Mingyi He, Heye Huang, Can Rong, Alok Prakash

TL;DR: 本文提出了SENSE,一个基于卫星图像的生成式城市建筑能耗建模框架。该框架利用可控扩散模型,联合生成逼真的城市卫星图像以及对齐的高质量建筑能耗和高度图。通过在四个城市的实验验证,SENSE在视觉保真度和物理一致性方面表现出色,并能有效提升下游预测任务的性能。

Details

Motivation: 现有基于卫星图像和深度学习的城市建筑能耗建模研究多为预测性,无法反映城市规划的生成性本质;生成式AI在卫星图像领域缺乏城市功能(如能耗)生成能力;同时,与卫星图像对齐的高质量高分辨率建筑能耗数据稀缺。

Result: 在纽约、波士顿、里昂、釜山四个城市的实验中,SENSE生成的合成数据视觉保真度高,物理一致性满足ASHRAE标准。仅使用不到20%的标注能耗数据,即可生成足够的合成标注数据,将下游预测任务的IoU提升10%。与SOTA城市能耗预测方法相比,SENSE显著降低了预测误差(NMBE降低3%-11%,CVRMSE降低1%-9%)。

Insight: 创新点在于提出了一个统一的生成式UBEM框架,将卫星图像生成与建筑能耗/高度信息生成在潜在空间中联合进行。其核心是利用大型视觉模型的知识,并通过道路网络和城市密度指标进行条件控制,实现了从卫星图像到城市功能属性的可控生成,为解决高质量标注数据稀缺问题提供了数据增强新思路。

Abstract: Urban Building Energy Modeling plays a critical role in achieving the United Nations’ Sustainable Development Goals 7 and 11. Although existing studies based on satellite imagery and deep learning have achieved remarkable progress, many challenges exist: most existing studies are inherently predictive, failing to reflect the generative nature of urban planning; although generative AI and diffusion models have seen explosive growth in satellite imagery, they lack the urban functional generation (e.g., energy layer); third, aligned high-quality high-resolution building energy data with satellite imagery is limited and scarce. Here we propose SENSE (Satellite-based ENergy Synthesis for Sustainable Environment), a unified generative UBEM framework that jointly synthesizes realistic urban satellite imagery and aligned high-quality building energy consumption and height maps. By conditioning on road networks and urban density metrics, SENSE, based on a controllable diffusion model, leverages the knowledge learned by large vision models to generate urban building energy consumption and height information (annotations) in the latent space. Experiments across four cities (New York City, Boston, Lyon, Busan) demonstrate that SENSE achieves high visual fidelity and strong physical consistency, satisfying the ASHRAE standard metric. Experiments demonstrate that SENSE can generate enough annotated synthetic data using less than 20% labeled energy data, boosting downstream prediction performance by 10% IoU. Compared to SOTA urban energy prediction methods, SENSE significantly reduced prediction error (reduced 3%-11% NMBE and 1%-9% CVRMSE). This study offers an energy-efficiency urban planning and physical generation solution for urban science, energy science and building science. The dataset and code: https://huggingface.co/datasets/skl24/MUSE and https://github.com/kailaisun/GenAI4Urban-Energy/.


[195] Rad-VLSM: A Cross-Modal Framework with Semantics-Assisted Prompting for Medical Segmentation and Diagnosis cs.CVPDF

Fengyi Zhang, Xujie Zeng, Mohan Liu, Zengyi Wang, Yalong Jiang

TL;DR: 该论文提出了Rad-VLSM,一个用于医学图像分割和诊断的两阶段跨模态框架。该框架首先利用基于BLIP-2的视觉-语言对齐模块,在语义引导下识别病灶相关候选区域并转换为框提示;然后,将这些提示输入基于SAM的多任务网络进行病灶分割,并将预测的掩码作为空间先验,通过视觉-放射组学融合头进行诊断。

Details

Motivation: 现有医学图像分割模型可能被背景组织、声学伪影和不相关的视觉关联分散注意力,而临床诊断所需的病灶线索往往细微且局部化。因此,需要一种能够聚焦病灶、实现鲁棒分割并提供视觉依据诊断的方法。

Result: 在私有临床乳腺超声数据集和公共基准测试上的实验表明,Rad-VLSM在分割和诊断性能上均表现强劲,并具有良好的泛化能力。

Insight: 创新点在于利用语义信息进行定位而非直接预测,通过两阶段设计(语义引导定位 + SAM提示分割)和视觉-放射组学特征融合,减少了文本对诊断的直接依赖,将诊断过程建立在病灶层面的证据之上。多候选区域聚合策略也提高了提示的稳定性。

Abstract: Medical image segmentation is more clinically valuable when it supports diagnosis rather than merely producing lesion masks. However, diagnostically relevant lesion cues are often subtle and localized, while existing models may be distracted by background tissues, acoustic artifacts, and irrelevant visual correlations. To address this problem, we propose Rad-VLSM, a two-stage cross-modal framework for semantics-assisted lesion focusing, robust segmentation, and visually grounded diagnosis. In the first stage, a BLIP-2-based vision-language alignment module identifies lesion-related candidate regions under semantic guidance and converts them into box prompts. In the second stage, these prompts are fed into a SAM-based multitask network, where a multi-candidate region aggregation strategy improves prompt stability and guides lesion segmentation. The predicted masks are then used as spatial priors for diagnosis, and a visual-radiomics fusion head integrates lesion-aware visual features with selected radiomics descriptors. By using semantic information for localization rather than direct prediction, Rad-VLSM reduces text-to-diagnosis dependence and grounds diagnosis in lesion-level evidence. Experiments on a private clinical breast ultrasound dataset and public benchmarks show that Rad-VLSM achieves strong segmentation and diagnostic performance with favorable generalization.


[196] DanceHMR: Hand-Aware Whole-Body Human Mesh Recovery from Monocular Videos cs.CVPDF

Wenhao Shen, Ming Zhou, Hengyuan Zhang, Siyuan Bian, Youjiang Xu

TL;DR: 本文提出了DanceHMR,一个用于从单目视频中恢复具有时间一致性的全身人网格(SMPL-X)的框架。该框架通过残差身体-手部融合技术,统一了身体上下文和手部特定观察,从而在单个时序架构内实现了稳定的身体运动和精细的手部恢复。

Details

Motivation: 现有视频HMR方法能产生连贯的身体运动但常忽略手部细节,而基于图像的全身方法逐帧独立恢复网格,导致手部运动抖动且不准确。本文旨在解决在具有挑战性的野外单目视频中,同时实现时间稳定性和富有表现力的全身(包括手部)运动恢复的问题。

Result: 在全身和仅身体的基准测试上,该方法展示了改进的手部重建效果和具有竞争力的身体精度。在具有挑战性的真实世界视频中,该方法也产生了时间稳定且与2D一致的SMPL-X运动。

Insight: 主要创新点在于提出了一个统一的时序框架,通过残差身体-手部融合来协同优化身体和手部;此外,引入了针对上半身构图的近距离感知增强策略以提高鲁棒性,有效解决了现有方法在手部细节恢复和时间一致性上的不足。

Abstract: Monocular video human mesh recovery is essential for digital humans, avatar animation, and embodied simulation, where both temporal stability and expressive whole-body motion are required. Existing video HMR methods produce coherent body motion but often overlook detailed hand articulation, while image-based whole-body methods recover SMPL-X meshes independently per frame, often leading to jittery and inaccurate hand motion. We present a temporally coherent whole-body HMR framework for challenging in-the-wild monocular videos. Our model unifies body context and part-specific hand observations through residual body-hand fusion, enabling stable body motion and detailed hand recovery within a single temporal architecture. We further introduce close-up-aware augmentation to improve robustness under upper-body framing. Experiments on whole-body and body-only benchmarks demonstrate improved hand reconstruction and competitive body accuracy. Our method also produces temporally stable and 2D-consistent SMPL-X motion in challenging real-world videos.


[197] Who Generated This 3D Asset? Learning Source Attribution for Generative 3D Models cs.CV | cs.AIPDF

Sihan Ma, Siyuan Liang, Dacheng Tao

TL;DR: 本文研究了生成式3D模型的来源归属问题,即如何判断一个3D资产是否由特定生成模型创建。作者构建了首个被动来源归属基准,涵盖22个代表性3D生成器,并提出了一种分层多视图多模态Transformer方法,通过融合外观、几何和频域特征来捕捉分散的归属信号。实验表明该方法在完全监督下达到97.22%的准确率,在仅1%训练数据下仍能实现77.17%的准确率。

Details

Motivation: 随着生成式3D模型在游戏、机器人和沉浸式创作中的广泛应用,确定3D资产的来源变得至关重要,但面临归属信号分散(如多视图、几何和频域线索)和实际部署约束(如标签稀缺、提示降级和真实/合成资产混合)两大挑战。

Result: 在构建的基准测试中,该方法在完全监督协议下达到97.22%的准确率,在仅使用1%训练数据(每个生成器少于五个样本)的少样本协议下仍能实现77.17%的准确率,展现了强大的性能。

Insight: 创新点在于发现生成式3D模型留下两种稳定指纹:跨视图不一致性和反映在几何统计与频域线索中的结构伪影,并设计分层多视图多模态Transformer来融合分散的多模态特征,为可信3D内容溯源建立了新的基准和方法基础。

Abstract: Generative 3D models are deployed in gaming, robotics, and immersive creation, making source attribution critical: given a 3D asset, can we identify whether and which generative model created it? This problem faces two core challenges: dispersed attribution signals, where 3D fingerprints are distributed across multi-view, geometric, and frequency-domain cues; and realistic deployment constraints, where scarce labels, degraded prompts, and mixed real/synthetic assets undermine attribution reliability. To systematically study this problem, we construct, to the best of our knowledge, the first passive source attribution benchmark for modern generated assets, covering 22 representative 3D generators under standard, few-shot, and realistic deployment protocols. Based on this benchmark, we find that generative 3D models leave two types of stable fingerprints: cross-view inconsistency and structural artifacts reflected in geometric statistics and frequency-domain cues. To capture these dispersed signals, we propose a hierarchical multi-view multi-modal Transformer that fuses appearance, geometric, and frequency-domain features within each view and models global relationships across views. Extensive experiments demonstrate strong performance, achieving 97.22% accuracy under full supervision and 77.17% accuracy with only 1% training data, corresponding to fewer than five samples per generator. These results show that modern 3D generators leave stable and attributable fingerprints, establishing a new benchmark and methodological foundation for trustworthy 3D content provenance.


[198] Semi-LAR: Semi-supervised Contrastive Learning with Linear Attention for Removal of Nighttime Flares cs.CVPDF

Xiyu Zhu, Wei Wang, Kui Jiang, Zhengguo Li

TL;DR: 本文提出了一种半监督的镜头光晕去除框架Semi-LAR,通过结合自适应伪标签库和光晕感知对比损失,有效利用未标记图像进行稳定学习,以解决现有方法对大规模配对数据依赖性强的问题。

Details

Motivation: 镜头光晕去除面临光晕伪影空间范围大、与场景结构纠缠的挑战,且现有方法严重依赖大规模配对数据,因此需要一种能有效利用未标记数据的半监督学习方案。

Result: 在多个光晕基准测试上的广泛实验表明,该框架是模型无关的,能持续提升性能和鲁棒性。

Insight: 创新点包括通过无参考质量评估、动量更新和无效标签过滤的自适应伪标签库来渐进优化伪监督,以及将光晕污染输入作为负样本进行块级对比学习的光晕感知对比损失,以增强对光晕模式的判别力。

Abstract: Lens flare removal is challenging due to the large spatial extent of flare artifacts and their entanglement with scene structures, while existing methods heavily rely on large-scale paired data. We propose a semi-supervised flare removal framework that enables stable learning from unlabeled images by jointly addressing pseudo-label reliability and representation discrimination. We propose an adaptive pseudo-label repository that progressively refines pseudo supervision through no-reference quality assessment, momentum-based updates, and invalid label filtering, effectively mitigating error accumulation. Moreover, we propose a flare-aware contrastive loss that explicitly treats flare-contaminated inputs as negatives and performs patch-level contrastive learning, encouraging representations that are discriminative against flare patterns while remaining consistent with reliable pseudo targets. Extensive experiments on multiple flare benchmarks demonstrate that the proposed framework is model-agnostic and consistently improves performance and robustness.


[199] Vision Inference Former: Sustaining Visual Consistency in Multimodal Large Language Models cs.CV | cs.AIPDF

Xinpeng Dong, Min Zhang, Kairong Han, Xu Tan, Fei Wu

TL;DR: 本文提出了一种名为Vision Inference Former(VIF)的轻量级架构模块,旨在解决多模态大语言模型(MLLMs)中视觉信息贡献被削弱以及随着生成长度增加视觉-语言对齐一致性下降的问题。VIF通过在推理过程的解码阶段持续注入视觉语义,确保模型在生成过程中始终基于视觉内容。实验在14个基准任务上验证了其有效性。

Details

Motivation: 当前基于连接器的MLLMs范式将视觉特征投影为文本序列,导致视觉模态的独特贡献被削弱,且随着生成长度增加,模型对视觉信息的依赖减弱,造成视觉-语言对齐和语义一致性下降。

Result: 在涵盖通用推理、OCR、表格理解、以视觉为中心的评估和幻觉检测的14个基准任务上的实验结果表明,VIF能持续提升不同架构模型的性能,且引入的额外开销极小。

Insight: 主要创新点在于提出了一个在解码阶段持续注入纯视觉表示的轻量级模块(VIF),直接桥接视觉表示与模型输出空间,从而在推理过程中维持视觉一致性,这是一种新颖的、与模型架构解耦的视觉信息持续利用机制。

Abstract: In recent years, multimodal large language models (MLLMs) have achieved remarkable progress, primarily attributed to effective paradigms for integrating visual and textual information. The dominant connector-based paradigm projects visual features into textual sequence, enabling unified multimodal alignment and reasoning within a generative architecture. However, our experiments reveal two key limitations: (1) Although visual information serves as the core evidential modality in MLLMs, it is treated on par with textual tokens, diminishing the unique contribution of the visual modality; (2) As generation length increases, particularly within a limited context window, the model’s dependence on visual information progressively weakens, resulting in deteriorated vision-language alignment and reduced consistency between generated content and visual semantics. To address these challenges, we propose the Vision Inference Former (VIF), a lightweight architectural module that establishes a direct bridge between pure visual representations and the model’s output space. Specifically, VIF continuously injects visual semantics throughout the decoding phase of the inference process, ensuring that the model remains firmly grounded in visual content during generation. We conduct experiments on 14 benchmark tasks covering general reasoning, OCR, table understanding, vision-centric evaluation, and hallucination. Experimental results show that VIF consistently improves model performance across diverse architectures while introducing minimal additional overhead. The code for this work is available at https://github.com/Dong-Xinpeng/VIF.


[200] Self-Evolving Spatial Reasoning in Vision Language Models via Geometric Logic Consistency cs.CV | cs.AIPDF

Junming Liu, Yuqi Li, Yifei Sun, Maonan Wang, Piotr Koniusz

TL;DR: 本文提出了一种名为SAGE的自进化框架,旨在提升视觉语言模型的空间推理鲁棒性。该方法通过几何与语言的对偶操作来强化模型的逻辑一致性,并将其作为辅助奖励整合到GRPO训练中。SAGE是一个模型无关、数据高效的轻量级后训练方法,在多个基准测试上展现了性能提升和泛化能力。

Details

Motivation: 当前视觉语言模型在空间推理方面仍很脆弱,模型在原始输入上回答正确,但在经过可预测答案映射的成对变换后可能失败,这揭示了实例级正确性与鲁棒空间推理之间的差距。

Result: 在视频和空间推理基准测试上的实验表明,该方法相较于强基线模型取得了持续的性能改进,并且对未见数据具有增强的泛化能力。

Insight: 核心创新点在于通过几何与语言的对偶操作来定义并强制逻辑一致性,并将其作为辅助奖励融入训练;同时,动态操作池的设计能持续探测不一致性,并专注于最具信息量的训练信号,从而实现了数据高效的自进化学习。

Abstract: Vision-Language Models (VLMs) have made striking progress, yet their spatial reasoning remains fragile: models that answer an original input correctly can still fail under paired transformations with predictable answer mappings, revealing a gap between instance-level correctness and robust spatial reasoning. To address this, we propose Spatial Alignment via Geometric Evolution (SAGE), a self-evolving framework that enforces logical consistency in VLMs through geometric and linguistic duality operations. SAGE incorporates duality consistency as an auxiliary reward within GRPO training, encouraging models to produce logically coherent answers across original and transformed inputs. A dynamic operation pool continuously probes for inconsistencies, promoting challenging operations and retiring mastered ones, so that training focuses on the most informative signals. SAGE is model-agnostic, data-efficient compared to prior GRPO methods, and can be applied as a lightweight post-training stage to any existing VLM. Experiments on video and spatial reasoning benchmarks demonstrate consistent improvements over strong baselines and enhanced generalization to unseen data.


[201] Xiaomi EV World Model: A Joint World Model Integrating Reconstruction and Generation for Autonomous Driving cs.CVPDF

Lijun Zhou, Hongcheng Luo, Zhenxin Zhu, Cheng Chi, Mingfei Tu

TL;DR: 本文提出了小米EV世界模型,这是一个用于自动驾驶的统一技术系统,包含世界表示和世界生成两大核心能力。具体包括:用于世界表示的WorldRec前馈重建架构,通过稀疏场景查询生成紧凑高保真的3D高斯场景表示;用于世界生成的WorldGen两阶段训练框架,支持高质量在线因果视频生成;以及深度融合两者的联合世界模型JWM,旨在提升生成稳定性、跨帧一致性和视觉保真度。

Details

Motivation: 旨在解决自动驾驶世界模型中的两大核心能力:世界表示和世界生成,为闭环仿真、数据合成和端到端训练提供坚实基础。

Result: WorldGen框架能够在仅4个去噪步骤内实现高质量的在线因果视频生成;JWM模型在生成稳定性、跨帧一致性和视觉保真度方面实现了协同增益。

Insight: 创新点包括:基于稀疏3D场景查询的WorldRec前馈重建架构,自然保证了跨帧空间一致性;WorldGen采用双向预训练和因果微调的两阶段训练框架,结合教师强制、ODE蒸馏和DMD三种渐进阶段,实现了高效高质量的视频生成;通过JWM深度集成两个模块,实现了性能的协同提升。

Abstract: This report presents a unified technical system addressing the two core capabilities of world models for autonomous driving: world representation and world generation. For world representation, we propose WorldRec, a feed-forward reconstruction architecture driven by sparse scene queries. WorldRec initializes structured queries in 3D space, leveraging them to aggregate cross-view, cross-temporal features, thereby naturally enforcing spatial consistency across frames and yielding compact yet high-fidelity 3D Gaussian scene representations. For world generation, we propose WorldGen, a two-stage training framework of bidirectional pretraining followed by causal fine-tuning through three progressive stages (Teacher Forcing, ODE distillation, and DMD), enabling high-quality online causal video generation in as few as 4 denoising steps. Building on both modules, we further introduce the JWM, which deeply integrates WorldRec and WorldGen to achieve synergistic gains in generation stability, cross-frame consistency, and visual fidelity, providing a solid foundation for closed-loop simulation, data synthesis, and end-to-end training in autonomous driving.


[202] View-Aware Semantic Alignment for Aerial-Ground Person Re-Identification cs.CVPDF

Quan Zhang, Zeqiang Cai, Peiming Zhao, Jingze Wu, Cailun Wu

TL;DR: 本文提出了一种名为ViSA(View-aware Semantic Alignment)的视图感知语义对齐框架,用于解决无人机与固定摄像头之间视角剧烈变化带来的空中-地面行人重识别(AGPReID)挑战。该框架包含专家驱动的令牌生成模块(ETGM)和双分支局部融合模块(DLFM),旨在感知视角特定模式并实现跨视图的语义一致性对齐。

Details

Motivation: 现有方法通常遵循视图不变范式,通过对齐跨视图的共享特征来实现鲁棒性,但这本质上强制了部件级对齐,忽略了视图特定线索和判别性身份信息。因此,本文旨在解决视角剧烈变化导致的特征对齐困难问题。

Result: 在AG-ReID.v2、CARGO和LAGPeR三个AGPReID基准测试上的广泛实验表明,ViSA始终实现了卓越的性能,特别是在具有挑战性的CARGO跨视图协议上,mAP显著提升了10.06%,达到了SOTA水平。

Insight: 创新点在于从视图不变范式转向视图感知范式,通过ETGM生成自适应语义查询来感知视角特定模式,并利用DLFM进行图推理以提取和对齐响应不同专家的局部区域,从而更好地利用视图特定线索实现跨视图语义一致性。

Abstract: Aerial-Ground Person Re-Identification (AGPReID) remains highly challenging due to drastic viewpoint variations between drones and fixed cameras. Existing methods typically follow a view-invariant paradigm, aligning shared features across views to achieve robustness. However, view-invariant inherently enforces part-level alignment, which ignores view-specific cues and discriminative identity information. To this end, this work proposes ViSA (View-aware Semantic Alignment), a view-aware framework that achieves cross-view semantic consistency containing an Expert-driven Token Generation Module (ETGM) and a Dual-branch Local Fusion Module (DLFM). Technically, the former constructs a set of view-aware experts to generate adaptive semantic queries that perceive viewpoint-specific patterns, while the latter leverages graph reasoning to extract and align local regions responsive to different experts. Extensive experiments on three AGPReID benchmarks including AG-ReID.v2, CARGO and LAGPeR demonstrate that ViSA consistently achieves superior performance, with a notable 10.06% mAP improvement on the challenging CARGO cross-view protocol. The code is available at \href{https://github.com/Cat-Zero/ViSA}{https://github.com/Cat-Zero/ViSA}.


[203] Best Segmentation Buddies for Image-Shape Correspondence cs.CV | cs.GRPDF

Itai Lang, Dongwei Lyu, Dale Decatur, Rana Hanocka

TL;DR: 该论文提出了一种名为’最佳分割伙伴’的方法,用于解决图像与无纹理3D形状之间的分割到分割对应估计问题。该方法通过将2D视觉模型的特征提取到3D形状表面,计算像素与顶点之间的特征相似性,从而在跨模态差异下建立语义对应的可靠关联。

Details

Motivation: 解决在野外图像和无纹理3D形状之间估计分割到分割对应关系的未充分探索任务,该任务因外观、几何和视角的显著差异而极具挑战性。

Result: 论文展示了该方法在广泛的图像-形状对上的通用性和鲁棒性,实现了准确且语义上有意义的对应关系,但未提及具体基准测试或与SOTA的比较。

Insight: 创新点在于通过将2D特征提取到3D表面来桥接跨模态差距,并引入’最佳分割伙伴’概念来可靠地识别语义对应的形状部分顶点;客观分析认为,利用2D分割模型的特征直接进行3D形状分割以引导对应过程是一种有效的自举策略。

Abstract: Finding correspondences is a fundamental and extensively researched problem in computer vision and graphics. In this work, we examine the underexplored task of estimating segmentation-to-segmentation correspondence between images in the wild and untextured 3D shapes. This task is highly challenging due to substantial differences in appearance, geometry, and viewpoint. Our approach bridges the cross-modality gap by linking pixels in the image segment to vertices in the corresponding semantic part of the 3D shape. To achieve this, we first distill deep visual features from a 2D vision model onto the 3D shape surface, allowing for the computation of feature similarity between image pixels and shape vertices. Then, we identify Best Segmentation Buddies, vertices whose most similar image pixel lies within the image segmentation region, enabling the reliable discovery of vertices in semantically corresponding shape parts. Finally, we leverage distilled 3D features from the 2D image segmentation model to segment the shape directly in 3D, bootstrapping the correspondence process. We demonstrate the generality and robustness of our approach across a wide range of image-shape pairs, showcasing accurate and semantically meaningful correspondences. Our project page is at https://threedle.github.io/bsb/.


[204] MARS: Technical Report for the CASTLE Challenge at EgoVis 2026 cs.CV | cs.AIPDF

Haoyu Zhang, Qiaohui Chu, Yisen Feng, Meng Liu, Weili Guan

TL;DR: 本文介绍了MARS(多模态代理推理与源选择)系统,该系统是为EgoVis 2026的CASTLE挑战赛设计的。CASTLE挑战要求基于包含多日活动、15个同步视角、官方转录文本及多种辅助模态(如个人照片、辅助视频、注视、热成像和心率测量)的数据集回答185个封闭式问题。MARS将该任务视为一个在多模态源上进行代理证据选择的问题,而非纯文本流程。它首先从视频和转录文本两个主要源,以及注视、心率、照片和热成像四个辅助源构建证据记忆,并将长视频转换为字幕和基于DeepSeek的摘要以压缩时序证据。在推理时,一个GPT-5.4决策代理反复选择是否继续推理、请求特定缺失模态、生成答案或在证据不足时回退到随机选项。该系统在最终排行榜上获得了第二名。

Details

Motivation: 解决CASTLE挑战中需要基于长时间、多视角、多模态的自我中心数据进行复杂推理的问题,传统单视频基准无法处理此类任务。

Result: 在CASTLE Challenge最终排行榜上获得第二名。

Insight: 将多模态问答任务重新定义为代理证据选择问题,通过决策代理动态管理多模态源(包括压缩时序证据和保留特定源证据),而非依赖纯文本流程,以处理长时、异构数据。

Abstract: This report presents MARS, short for Multimodal Agentic Reasoning with Source selection, our system for the CASTLE Challenge at EgoVis 2026. Participants must answer 185 closed-form questions over the CASTLE 2024 dataset. In contrast to prior single-video egocentric benchmarks, CASTLE requires reasoning over four days of activity, 15 synchronized perspectives, official transcripts, and multiple auxiliary modalities, including personal photos, auxiliary videos, gaze, thermal imagery, and heartrate measurements. MARS therefore treats the task as an agentic evidence-selection problem over multimodal sources rather than a purely text-only pipeline. MARS first follows the official CASTLE directory organization to build evidence memories from two primary sources, videos and transcripts, and four auxiliary sources, gaze, heartrate, photos, and thermal imagery. Long videos are converted into captions and DeepSeek-based summaries only because CASTLE videos are too long to fit directly into the model context for every question; this step compresses temporal evidence while keeping photos and other auxiliary media available as source-specific evidence. At inference time, a GPT-5.4 decision agent repeatedly chooses whether to continue reasoning, request a specific missing modality, produce an answer, or fall back to a random option when the evidence remains insufficient. The resulting system achieved second place on the final CASTLE Challenge leaderboard. Our codes are available at https://github.com/Hyu-Zhang/MARS.


[205] Token-Space Mask Prediction for Efficient Vision Transformer Segmentation cs.CVPDF

Calvin Galagain, Martyna Poreba, François Goulette

TL;DR: 本文提出TokenMask,一种用于基于查询的视觉Transformer分割模型的令牌空间掩码预测头。它通过直接计算查询与令牌之间的亲和度来生成掩码逻辑值,并在逻辑值空间而非特征空间进行插值,从而避免了显式的图像空间特征图重建。该方法简化了计算结构,在多种ViT骨干网络、数据集和分割任务上,均能在保持竞争力的精度同时,显著降低计算和内存开销,并在嵌入式平台上实现实际加速。

Details

Motivation: 现有基于查询的ViT分割模型通常重建密集的空间特征图来预测掩码,这继承了卷积架构的设计模式。作者认为这种显式的图像空间重建并非必需,旨在探索更高效、更部署友好的设计。

Result: 在多种ViT骨干网络(如ViT、Swin Transformer)、数据集(如COCO、ADE20K)和分割任务(实例分割、全景分割)上,TokenMask在保持与先前方法相当的竞争性精度的同时,显著降低了计算量(FLOPs)和内存占用,并在NVIDIA Jetson AGX Orin平台上使用TensorRT FP16推理实现了切实的加速。

Insight: 核心创新在于将掩码预测从特征空间重构转移到逻辑值空间,通过查询-令牌亲和度直接计算掩码逻辑值。这保留了原始的线性评分机制,但简化了整体计算流程,为嵌入式视觉系统提供了一种更简洁、更高效的设计范式。

Abstract: Query-based Vision Transformer segmentation models typically reconstruct dense spatial feature maps to predict masks, inheriting design patterns from convolutional architectures. We show that this explicit image-space reconstruction is not required. We introduce TokenMask, a token-space mask head that computes mask logits directly from query-token affinities and performs interpolation in logit space rather than feature space. This reformulation preserves the original linear scoring mechanism while simplifying the computational structure. Across diverse ViT backbones, datasets and segmentation tasks, TokenMask consistently improves efficiency over prior approaches by reducing computational and memory requirements while maintaining competitive accuracy, leading to tangible speedups on NVIDIA Jetson AGX Orin using TensorRT FP16 inference. Overall, TokenMask yields a simpler and more deployment-friendly design for embedded vision systems.


[206] EgoInteract: Synthetic Egocentric Videos Generation for Interaction Understanding and Anticipation cs.CVPDF

Rosario Leonardi, Francesco Ragusa, Daniele Materia, Alessandro Passanisi, James Fort

TL;DR: 本文提出了EgoInteract,一个可控的合成第一人称视频生成模拟器,旨在建模细粒度的第一人称交互及其时序动态。该模拟器能够精确控制相机、人体和手部运动、物体操作以及不同环境下的场景组合。基于此框架,作者生成了一个带有密集时空标注的合成第一人称视频数据集,用于时序动作分割、下一个活动物体检测、交互预测和手-物交互检测等任务。通过在多个真实世界的第一人称基准测试上评估用模拟数据训练的模型,结果显示在各项任务和数据集上均优于强基线,证明了该模拟方法的有效性和可迁移性。

Details

Motivation: 收集大规模、具有密集时空标注的第一人称视频数据集成本高、速度慢,且常受环境偏差、隐私限制和交互模式覆盖有限的约束。虽然合成数据在多个视觉领域已显示出强大潜力,但其在第一人称感知任务中的应用仍相对不足,特别是对于需要时序连贯人-物交互的任务。

Result: 在多个真实世界的第一人称基准测试(涵盖不同环境、物体类别和交互模式)上评估模型,结果显示在时序动作分割、下一个活动物体检测、交互预测和手-物交互检测等任务上,相比强基线均取得了持续改进。

Insight: 创新点在于开发了一个可控的第一人称视频生成模拟器,能够精确建模交互的时序动态,并生成了大规模、带密集标注的合成数据集。从客观角度看,该方法通过合成数据有效缓解了真实数据收集的瓶颈,并证明了合成数据在复杂时序交互理解任务上的可迁移性和有效性,为数据稀缺领域提供了新思路。

Abstract: Collecting large-scale egocentric video datasets with dense spatial and temporal annotations is costly, slow, and often constrained by environmental biases, privacy constraints, and limited coverage of interaction patterns. While synthetic data has shown strong potential in several vision domains, its use for egocentric perception remains relatively underexplored, especially for tasks requiring temporally coherent human-object interactions. In this work, we introduce EgoInteract, a controllable simulator for egocentric video generation designed to model fine-grained egocentric interactions and their temporal dynamics. The simulator enables precise control over camera, human body and hand motion, object manipulation, and scene composition across diverse environments. Building on this framework, we generate a synthetic egocentric video dataset with dense spatial and temporal annotations for temporal action segmentation, next-active object detection, interaction anticipation, and hand-object interaction detection. We evaluate models trained with simulated data on multiple real-world egocentric benchmarks spanning diverse environments, object categories, and interaction patterns. Results show consistent improvements over strong baselines across tasks and datasets, demonstrating the effectiveness and transferability of our simulation-based approach.


[207] SPATIOROUTE: Dynamic Prompt Routing for Zero-Shot Spatial Reasoning cs.CV | cs.AIPDF

Pawat Chunhachatrachai, Gueter Josmy Faure, Hung-Ting Su, Winston H. Hsu

TL;DR: 本文提出SpatioRoute,一种用于零样本空间推理的动态提示路由方法,通过基于规则和LLM驱动的两种互补模式,为空间问答任务生成语义定制的提示模板,无需额外训练或3D传感器输入。

Details

Motivation: 解决在零样本设置下,视觉语言模型(VLMs)处理以自我为中心视频中的空间问答任务时面临的挑战,如3D物体位置、场景可供性和方向关系推理,而无需任务特定的微调。

Result: 在SQA3D基准测试中,SpatioRoute相比固定提示基线实现了高达5%的整体准确率提升,在无需3D点云输入的情况下,为零样本视频空间视觉问答(VQA)建立了新的最先进水平(SOTA)。

Insight: 创新点在于动态提示路由机制,通过问题类型映射或LLM生成情境感知提示,优于统一的思维链(CoT)提示;客观分析表明,问题感知路由比统一推理指令更有效,尤其在Qwen系列模型中CoT提示会降低性能。

Abstract: Spatial question answering over egocentric video is a challenging task that requires Vision-Language Models (VLMs) to reason about 3D object positions, scene affordances, and directional relationships, particularly in the zero-shot setting where no task-specific fine-tuning is available. We introduce SpatioRoute, a dynamic prompt generation approach that routes each incoming question to a semantically tailored prompt template – without any additional training, fine-tuning, or 3D sensor input. SpatioRoute operates in two complementary modes: SpatioRoute-R, a rule-based router that deterministically maps question typologies (e.g., What, Is, How, Can, Which) to specialized prompt templates; and SpatioRoute-L, an LLM-driven approach that generates task-specific prompts from the question and situational context alone, with no video input at routing time. We evaluate SpatioRoute on the SQA3D benchmark across VLMs spanning model families. SpatioRoute achieves consistent overall accuracy gains up to 5% over fixed prompt baselines, establishing a new state-of-the-art for zero-shot video-only spatial VQA without requiring 3D point-cloud inputs. As an additional finding, we observe that Chain-of-Thought (CoT) prompting, implemented via the Think it Twice architecture, consistently degrades performance in this setting on Qwen series models, confirming that question-aware routing is more effective than uniform reasoning instructions for spatial video understanding.


[208] Enhancing Train-Free Infinite-Frame Generation for Consistent Long Videos cs.CVPDF

X. Feng, J. Zhu, M. Wu, C. Chen, F. Mao

TL;DR: 本文提出了一种名为MIGA的无训练无限帧长视频生成方法,旨在解决基础视频生成模型在生成长视频时面临的计算开销大、训练与推理不匹配以及长期一致性难以保持的问题。该方法通过两阶段对齐机制减少输入模型的过度噪声跨度,并采用包含自反思和长程帧引导的双重一致性增强机制来提升视频的时间一致性。

Details

Motivation: 动机在于使基础视频生成模型能够在不显著增加计算开销的情况下生成更长的视频,同时克服帧级自回归框架(如FIFO-diffusion)中存在的训练-推理不匹配和长期一致性维护的挑战。

Result: 在VBench和NarrLV基准测试上的大量实验表明,MIGA方法取得了最先进的性能。

Insight: 创新点在于提出了一个两阶段对齐机制来弥合训练-推理差距,以及一个结合了自反思(校正早期高噪声帧)和长程帧引导(利用后期低噪声帧进行广泛覆盖指导)的双重一致性增强机制,共同提升了生成视频的时间一致性。

Abstract: Without incurring significant computational overhead, train-free long video generation aims to enable foundation video generation models to produce longer videos. Frame-level autoregressive frameworks, e.g., FIFO-diffusion, offer the advantage of generating infinitely long videos with constant memory consumption. However, the mismatch between training and inference, coupled with the challenge of maintaining long-term consistency, limits the effective utilization of foundation models. To mitigate these concerns, we propose \textbf{MIGA}, a novel infinite-frame long video generation method. Firstly, we propose an effective two-stage alignment mechanism that mitigates the training-inference gap by reducing the excessive noise span fed to the model. We then introduce an innovative dual consistency enhancement mechanism, where the self-reflection approach corrects early high-noise frames and the long-range frame guidance approach leverages later low-noise frames with broad coverage to steer generation, jointly improving temporal consistency. Extensive experiments on VBench and NarrLV demonstrate the state-of-the-art performance of MIGA. Our project page is available at https://xiaokunfeng.github.io/miga_homepage/.


[209] GaussianZoom: Progressive Zoom-in Generative 3D Gaussian Splatting with Geometric and Semantic Guidance cs.CVPDF

Jiale Shi, Jiarui Hu, Zesong Yang, Kaixuan Luan, Hujun Bao

TL;DR: GaussianZoom是一种生成式渐进放大3D重建系统,通过结合几何一致场景建模和多尺度语义推理,能够从低分辨率输入实现高保真度的极端放大渲染。该系统采用多视角一致超分辨率模块和可扩展连续细节层次结构,在Mip-NeRF360和Tanks&Temples数据集上实现了卓越的感知质量与多视角一致性。

Details

Motivation: 解决从低分辨率输入进行极端放大3D渲染时,如何保持几何一致性和丰富细节的问题,特别是跨大放大倍率范围的平滑过渡需求。

Result: 在Mip-NeRF360和Tanks&Temples基准测试中,GaussianZoom在极端放大条件下实现了卓越的感知质量、多视角一致性和鲁棒性,为生成式放大3D场景重建建立了强基线。

Insight: 创新点包括:基于深度的特征扭曲与VLM驱动的细节合成相结合的多视角一致超分辨率模块,以及动态调节高斯可见性的可扩展连续LOD层次结构,实现了无伪影的跨尺度渲染。

Abstract: We introduce GaussianZoom, a generative zoom-in 3D reconstruction system with an iterative progressive framework that combines geometry-consistent scene modeling and multi-scale semantic reasoning to enable high-fidelity extreme zoom-in rendering from low-resolution inputs. To achieve this, we develop a novel multi-view consistent super-resolution module with depth-based feature warping and VLM-driven detail synthesis, ensuring accurate multi-view correspondence while enriching fine-scale appearance beyond the observed resolution. To support zooming across large magnification ranges, we further introduce a new expandable continuous Level-of-Detail hierarchy that dynamically modulates Gaussian visibility for smooth, alias-free cross-scale rendering. Experiments on Mip-NeRF360 and Tanks&Temples demonstrate that GaussianZoom achieves superior perceptual quality, multi-view consistency, and robustness under extreme magnification, establishing a strong baseline for generative zoom-in 3D scene reconstruction.


[210] Non-Colliding Biometric Identities for Digital Entities: Geometry, Capacity, and Million-Scale Virtual Identity Provisioning cs.CVPDF

Yuyang Ji, Yixuan Shen, Anil Jain, Xiaoming Liu, Feng Liu

TL;DR: 本文提出了生物特征身份配置(BIP)这一新问题及其解决方案框架,旨在为数字实体(如AI代理和人形机器人)配置虚拟身份。核心思想是:真实人脸身份在嵌入超球面上占据低维子空间,因此虚拟身份必须作为真实人脸流形内部未占用的间隙进行分配。基于此几何洞察,作者提出了一种基于排斥的分配方法,可生成大量不冲突的虚拟身份嵌入,并开发了GapGen生成器将这些嵌入合成为高保真人脸图像。此外,还构建了v-LFW虚拟人脸数据集,用于评估虚拟人脸验证、跨现实匹配、真实与虚拟检测等任务。

Details

Motivation: 数字实体(如AI代理、人形机器人)日益与真实人类共同运作,但其身份基础设施仍基于凭证而非具身的生物特征身份。因此,需要一种方法为这些数字实体配置虚拟身份,这些身份需与所有已注册的真实身份不冲突,保持足够的类间可分离性,并能实现为高保真人脸图像。

Result: 在包含36万个真实身份的注册库上,该方法成功生成了1000万个不冲突的虚拟身份嵌入。通过GapGen生成器,合成了100万张照片级逼真的虚拟人脸图像。构建了v-LFW数据集,作为LFW的虚拟对应物,并设计了用于虚拟人脸验证、跨现实匹配、真实与虚拟检测以及统一识别与检测的评估协议。

Insight: 关键创新在于将虚拟身份配置问题视为真实人脸流形内部的约束性填充问题,而非在剩余子空间中分配。这突破了固定配置数量的限制。此外,提出的GapGen生成器采用课程学习策略,逐步将合成扩展到非冲突区域,实现了在真实人脸图像训练分布之外的高质量生成。v-LFW数据集的构建为评估虚拟身份相关任务提供了标准化基准。

Abstract: Digital entities such as AI agents and humanoid robots increasingly operate alongside real humans, yet their identity infrastructure is based on credentials rather than embodied biometric identity. We introduce Biometric Identity Provisioning (BIP), a new problem and solution framework that addresses: given an enrollment gallery of real human identities, provision virtual identities that are non-colliding with every enrolled identity, maintain sufficient inter-class separability, and are realizable as high-fidelity face images. The key geometric insight is that real face identities occupy a low-dimensional subspace of the embedding hypersphere, leaving no residual subspace for virtual identities. Hence, virtual identities must instead be allocated as unclaimed gaps within the real face manifold itself. BIP is therefore a constrained packing problem: available gaps vastly exceed any foreseeable enrollment scale, and provisioned identities remain non-colliding even as new real identities are subsequently enrolled. Grounded in this geometry, our repulsion-based allocation is not bounded by any fixed provisioning count; we demonstrate 10M non-colliding virtual identity embeddings against a gallery of 360K real identities. Realizing these embeddings as face images requires a generator that operates outside the training distribution of real face images; we introduce GapGen, a gap-aware generator trained with a curriculum that progressively extends synthesis into non-colliding regions, validated at 1M photorealistic virtual face images. We further construct v-LFW, a virtual counterpart to LFW face dataset, with protocols for virtual face verification, cross-reality matching, real-vs-virtual detection, and unified recognition and detection.


[211] CineMatte: Background Matting for Virtual Production and Beyond cs.CVPDF

Yuanjian He, Chen Zhang, Fasheng Chen, Jiangbo Cao

TL;DR: CineMatte是一个用于虚拟制作及其他场景的鲁棒背景抠图框架。它采用交叉注意力条件设计,分别编码输入帧和捕获背景,并通过交叉注意力模块预测前景。该方法还引入了预训练的图像引导特征上采样器来替代传统的细节分支,以减少边界伪影,并发布了首个用于VP抠图的非合成4K HDR数据集CineMatte-4K。

Details

Motivation: LED虚拟制作(VP)使用大型LED屏幕实时渲染背景,虽然实现了机内视觉特效,但后期修改工作繁重。本文旨在解决这一问题,为VP及更广泛场景提供一个鲁棒的背景抠图框架。

Result: 在CineMatte-4K数据集以及公开基准(如VideoMatte240K、YouTubeMatte)上,CineMatte不仅在VP场景中表现出色,而且能鲁棒地泛化到真实世界素材。

Insight: 创新点包括:采用交叉注意力条件设计,分别编码输入和背景以保留预训练语义并提升对背景变化的鲁棒性;用预训练的图像引导特征上采样器替代传统的并行卷积细节分支,有效缓解了语义错位导致的边界伪影问题;发布了首个用于VP抠图的非合成4K HDR图像-视频数据集CineMatte-4K,包含绿幕插入获取的图像和带有跟踪轨迹的视频,支持后续任意背景的正确视差渲染。

Abstract: LED Virtual Production (VP) uses large LED volumes to render backgrounds in real time, enabling in-camera visual effects but making post-shot changes labor-intensive. We address this with CineMatte, a robust background matting framework for VP and beyond. CineMatte employs a cross-attention-conditioned design. Instead of concatenating the background with the input, CineMatte employs a Siamese, frozen DINOv3 Vision Transformer with shared weights to encode the input frame and the captured background separately. A cross-attention module compares the two streams to predict the foreground, preserving pretrained semantics and improving robustness to background shifts. Previous ViT-based matting models use a parallel convolutional “detail branch” to recover fine details, which can cause boundary artifacts in real-world samples due to semantic misalignment with the backbone. We instead replace it with a pretrained, image-guided feature upsampler, which largely mitigates the problem. We also introduce CineMatte-4K, a 4K HDR image-video dataset captured on a professional LED VP stage. To the best of our knowledge, the image subset is the first dataset for VP matting and is non-synthetic, obtained via green-screen insertion; the video subset includes camera motion with tracked trajectories so that arbitrary backgrounds can be rendered later with correct parallax. Across CineMatte-4K and public benchmarks (VideoMatte240K, YouTubeMatte), CineMatte not only excels in VP but also generalizes robustly to real-world footage.


[212] StableVLA: Towards Robust Vision-Language-Action Models without Extra Data cs.CV | cs.ROPDF

Yiyang Fu, Chubin Zhang, Shukai Gong, Yufan Deng, Kaiwei Sun

TL;DR: 本文提出了一种名为StableVLA的鲁棒视觉-语言-动作模型,旨在解决VLA模型在面对训练数据中未见的视觉干扰时性能显著下降的问题。核心创新是引入了一个基于信息论的轻量级适配器模块(IB-Adapter),它无需额外数据或增强策略,即可选择性地过滤视觉输入中的潜在噪声,显著提升模型鲁棒性。

Details

Motivation: 由于训练数据无法涵盖所有可能的视觉干扰,VLA模型在遇到真实世界中未见的、不完美视觉条件下的干扰时,其鲁棒性存在严重问题。本文旨在系统性地研究并提升VLA模型对此类视觉扰动的鲁棒性。

Result: 提出的IB-Adapter在无需额外数据的情况下,平均将基线模型性能提升了30%,且仅增加不到1000万个参数。即使使用小14倍的主干网络(0.5B参数)且未在Open X-Embodiment数据集上进行预训练,StableVLA模型在鲁棒性上仍能与7B规模的SOTA VLA模型竞争,并在合成和物理视觉损坏条件下超越了OpenPi模型。

Insight: 主要创新点在于基于信息论设计了一个轻量级适配器(IB-Adapter),用于在推理时动态过滤视觉噪声,这是一种无需额外训练数据即可提升模型鲁棒性的高效方法。从客观角度看,其核心洞察是将鲁棒性问题转化为信息瓶颈问题,通过选择性信息过滤来增强模型对噪声的泛化能力,且参数开销极小,具有很好的实用价值。

Abstract: It is infeasible to encompass all possible disturbances within the training dataset. This raises a critical question regarding the robustness of Vision-Language-Action (VLA) models when encountering unseen real-world visual disturbances, particularly under imperfect visual conditions. In this work, we conduct a systematic study based on recent state-of-the-art VLA models and reveal a significant performance drop when visual disturbances absent from the training data are introduced. To mitigate this issue, we propose a lightweight adapter module grounded in information theory, termed the Information Bottleneck Adapter (IB-Adapter), which selectively filters potential noise from visual inputs. Without requiring any extra data or augmentation strategies, IB-Adapter consistently improves over the baseline by an average of 30%, while adding fewer than 10M parameters, demonstrating notable efficiency and effectiveness. Furthermore, even with a 14x smaller backbone (0.5B parameters) and no pre-training on the Open X-Embodiment dataset, our model StableVLA achieves robustness competitive with 7B-scale state-of-the-art VLAs. With negligible parameter overhead (<10M), our approach maintains accuracy on long-horizon tasks and surpasses OpenPi under both synthetic and physical visual corruptions.


[213] Improved Baselines with Representation Autoencoders cs.CV | cs.AI | cs.GR | cs.LG | stat.MLPDF

Jaskirat Singh, Boyang Zheng, Zongze Wu, Richard Zhang, Eli Shechtman

TL;DR: 本文提出了RAEv2,一种改进的表示自编码器(RAE)方法。通过研究三个关键设计选择:广义表示定义、RAE与表示对齐(REPA)的互补机制,以及利用REPA实现免费引导,显著提升了RAE的性能和训练效率。

Details

Motivation: 旨在简化和改进现有的表示自编码器(RAE)方法,解决其在表示定义、与REPA的关系理解以及引导机制效率方面的局限性。

Result: 在ImageNet-256上仅用80个epoch就达到了1.06的gFID(SOTA水平),在FDr^k上达到2.17(SOTA),远超之前最佳结果3.26(需800个epoch)。训练效率指标EP_FID@2达到35个epoch,相比原始RAE的177个epoch实现了超过10倍的加速收敛。

Insight: 主要创新点包括:1)将表示定义为最后k个编码器层的和,而非仅最后一层,显著提升了重建质量;2)揭示了RAE与REPA具有互补工作机制,可同时用作编码器和中间扩散层的目标;3)通过重新参数化DiT模型的输出,利用REPA实现了免费的引导机制,无需额外训练。这些改进共同构成了高效且高性能的RAEv2框架。

Abstract: Representation Autoencoders (RAE) replace traditional VAE with pretrained vision encoders. In this paper, we systematically investigate several design choices and find three insights which simplify and improve RAE. First, we study a generalized formulation where the representation is defined as sum of the last k encoder layers rather than solely the final layer. This simple change greatly improves reconstruction without encoder finetuning or specialized data (e.g., text, faces). Second, we study the prevalent assumption that RAE (using pretrained representation as encoder) replaces representation alignment (REPA), which distills the same representation to intermediate layers instead. Through large-scale empirical analysis, we uncover a surprising finding: RAE and REPA exhibit complementary working mechanisms, allowing the same representation to be used as both encoder and target for intermediate diffusion layers. Finally, the original RAE struggles with classifier-free guidance (CFG) and requires training a second, weaker diffusion model for AutoGuidance (AG). We show that REPA itself can be viewed as x-prediction in RAE latent space. By simply re-parameterizing the output of the DiT model, it can provide guidance for “free”. Overall, RAEv2 leads to more than 10x faster convergence over the original RAE, achieving a state-of-the-art gFID of 1.06 in just 80 epochs on ImageNet-256. On FDr^k, RAEv2 achieves a state-of-the-art 2.17 at just 80 epochs compared to the previous best 3.26 (800 epochs) without any post-training. This motivates EP_FID@k (epochs to reach unguided gFID <= k) as a measure of training efficiency. RAEv2 attains an EP_FID@2 of 35 epochs, versus 177 for the original RAE. We also validate our approach across diverse settings for text-to-image generation and navigation world models, showing consistent improvements. Code is available at https://raev2.github.io.


[214] Wasserstein Equilibrium Decoding for Reliable Medical Visual Question Answering cs.CV | cs.AIPDF

Luca Hagen, Johanna P. Müller, Weitong Zhang, Mengyun Qiao, Bernhard Kainz

TL;DR: 本文提出了一种名为Wasserstein均衡解码的方法,用于提升小型视觉语言模型在开放式医学视觉问答任务中的可靠性。该方法将此前仅用于纯文本、封闭式NLP任务的博弈论解码扩展至视觉语言模型,并引入了一种基于语义共识的Wasserstein停止准则,以替代传统的词汇顺序匹配。

Details

Motivation: 小型视觉语言模型(2-8B)因其隐私、连接性和低延迟优势适合临床部署,但其有限能力加剧了生成看似合理但错误答案的问题。本文旨在解决开放式医学VQA中模型输出的不可靠性。

Result: 在VQA-RAD和PathVQA基准测试上,该方法相比贪婪解码和判别式基线模型取得了显著且一致的提升。例如,在VQA-RAD上,Qwen3-VL-2B模型性能提升了3.5个百分点,超越了贪婪解码的4B模型。在PathVQA上,未经领域微调的Gemma-3-4B模型配合该方法,达到了与经过领域微调的MedGemma-4B模型在贪婪解码下相当的水平。

Insight: 主要创新点在于将博弈论解码范式成功迁移到视觉语言模型,并设计了基于Wasserstein距离的语义感知停止准则。该准则能根据近义候选答案的语义共识判断收敛,避免了临床等效答案排序互换导致的不必要迭代,在保证均衡行为的同时将平均收敛迭代次数减少了约20%,提升了推理效率。

Abstract: Small vision-language models (2-8B) are well-suited for clin- ical deployment due to privacy constraints, limited connectivity, and low-latency requirements favouring on-device or on-premise inference. However, their limited capacity exacerbates the generation of plausible but incorrect outputs. We extend game-theoretic decoding, previously restricted to text-only, closed-ended NLP tasks, to vision-language mod- els for open-ended Medical VQA. We introduce a semantically aware Wasserstein stopping criterion that replaces lexical order matching, en- abling convergence based on semantic consensus among near-synonymous candidate answers and avoiding unnecessary iterations caused by clini- cally equivalent ranking swaps. On VQA-RAD and PathVQA, we ob- tain consistent, statistically significant improvements over greedy and discriminative baselines. On VQA-RAD, we improve Qwen3-VL-2B by +3.5 percentage points (p < 0.01), surpassing the greedy 4B model, with similar trends at larger scales. On PathVQA, Gemma-3-4B with BDG matches MedGemma-4B under greedy decoding despite no domain- specific fine-tuning. At accuracy parity with classic BDG, the Wasser- stein criterion reduces average convergence iterations by approximately 20%, improving inference efficiency while preserving the game-theoretic equilibrium behaviour. Code is available at https://github.com/luca-hagen/ Wasserstein-BDG-medical-VQA.


[215] Focused Forcing: Content-Aware Per-Frame KV Selection for Efficient Autoregressive Video Diffusion cs.CV | cs.AIPDF

Peiliang Cai, Evelyn Zhang, Jiacheng Liu, Hao Lin, Ruiqi Zhang

TL;DR: 本文提出了一种名为Focused Forcing的无训练KV缓存选择方法,用于提升自回归视频扩散模型的生成效率。该方法通过结合注意力分数和历史帧多样性分数,为每个生成帧选择最相关且独特的历史帧,并根据估计的重要性为不同注意力头分配预算,从而在多个自回归生成范式中实现了高达1.48倍的端到端加速,同时提升了视觉质量和文本对齐度。

Details

Motivation: 现有的自回归视频扩散模型在生成长序列时需要巨大的KV缓存,现有压缩方法通常基于注意力分数粗粒度地选择历史帧,忽略了同一生成块内不同帧可能依赖不同历史帧、同一历史帧的注意力分数会随时间距离变化、以及不同注意力头的重要性不同等问题。

Result: 在多个自回归生成范式中,Focused Forcing方法无需训练即可实现高达1.48倍的端到端加速,同时改善了视觉质量和文本对齐度。

Insight: 创新点在于提出了一个细粒度的、内容感知的每帧KV选择策略,通过结合注意力分数和多样性分数来选择历史帧,并基于估计的重要性为注意力头分配预算,从而更有效地压缩KV缓存,在保持甚至提升生成质量的同时显著提高效率。

Abstract: Recent advances in autoregressive video diffusion have enabled sequential and streaming video generation. However, long-horizon generation requires increasingly large KV caches, making efficient compression without sacrificing quality challenging. Existing methods mostly select historical frames based on attention scores, but their context decisions remain coarse. When multiple frames are generated in the same chunk, these methods often apply a shared history selection to the whole chunk, score historical frames solely by attention, and assign head-wise budgets either uniformly or by attention-pattern heuristics rather than explicit head-importance estimation. We show that frames within the same generated chunk can depend on distinct historical frames, that the same historical frame can receive different attention scores as its relative temporal distance to the current frames changes, and that masking different heads induces unequal generation degradation. Motivated by these findings, we propose \textbf{Focused Forcing}, a training-free KV selection method that focuses cached history along both generated-frame and head dimensions. For each generated frame, Focused Forcing preserves the most relevant and distinctive historical frames by combining attention scores with diversity scores of historical frames, while assigning larger budgets to heads with higher estimated importance. Across multiple autoregressive generation paradigms, Focused Forcing achieves up to $\textbf{1.48}\times$ end-to-end acceleration without training, while \textbf{improving visual quality and text alignment}. \textit{Our code will be released on GitHub.}


[216] Optimising CSRNet with parameter-free attention mechanisms for crowd counting in public transport cs.CV | cs.AIPDF

Aida Rostamza, Enrico Del Re, Joshua Cherian Varughese, Cristina Olaverri-Monreal

TL;DR: 本文研究了无参数注意力机制在公共交通场景人群计数中的应用,提出了一种结合通道注意力和空间注意力的新模块PFCASA,并在CSRNet骨干网络上验证了其在ShanghaiTech数据集上的性能。

Details

Motivation: 针对公共交通场景中人群密度变化大、遮挡复杂的特点,传统参数化注意力机制会增加模型计算成本,限制了在资源受限边缘设备上的部署,因此探索无参数注意力机制的有效性。

Result: 在ShanghaiTech数据集上的实验表明,无参数注意力机制在不增加额外参数的情况下,达到了与参数化注意力相当或更优的精度,其中PFCASA在少于40人的场景中表现最佳,而PFCA在高密度人群中更有效。

Insight: 创新点在于定制化结合了通道和空间无参数注意力模块(PFCASA),并系统评估了不同无参数注意力机制在人群计数任务中的适用性,为边缘设备部署提供了轻量化解决方案。

Abstract: Occupancy estimation and crowd counting are critical tasks in designing smart and efficient public transport vehicles. Given that public transport loading can vary from sparse to crowded, classical models for occupancy estimation must be adapted to suit this purpose. Attention mechanisms have shown remarkable capability in enhancing the representational power of deep neural networks for crowd counting in congested scenes with occlusion, complex backgrounds, and perspective distortion. However, conventional approaches, often implemented as parameterized sub-networks within convolutional layers, inevitably increase model size and computational cost, limiting deployment on resource-constrained edge devices. This paper investigates the effectiveness of state-of-the-art parameter-free attention mechanisms for crowd counting and density map estimation in highly congested scenes. We evaluate channel-wise (PFCA), spatial-wise (SA), and 3-D (SimAM) modules and compare their performance with parameterized attention modules constrained to introduce no more than 1% additional parameters. Furthermore, we present a novel combination of attention mechanisms that combines the strengths of PFCA and SA (PFCASA) customized for analyzing video streams onboard public transport systems. Using CSRNet as the backbone, experiments on the ShanghaiTech dataset demonstrate that parameter-free attention mechanisms achieve comparable or superior accuracy without introducing additional model parameters. A detailed performance analysis further reveals that PFCASA outperforms other attention modules in scenes with fewer than 40 individuals, while PFCA shows greater effectiveness as crowd density increases, underscoring their potential applicability for integration into smart public transport modalities.


[217] RAVE: Re-Allocating Visual Attention in Large Multimodal Models cs.CVPDF

Xi Leng, Xinhong Ma, Ziqiang Dong, Feng Zhang, Xiaoying Tang

TL;DR: 本文提出RAVE(Re-Allocating Visual Attention),一种轻量级的配对门控机制,用于改进大型多模态模型(LMMs)中的视觉注意力分配。该方法通过在学习到的查询-键偏置来调整视觉键的注意力分数,无需修改主干网络架构,并可进行端到端训练。

Details

Motivation: 大型多模态模型继承了预训练语言主干的自注意力机制,但标准注意力在跨模态(文本与视觉证据)和视觉内部(视觉标记之间)的分配上存在次优问题,导致视觉定位不准确。

Result: 在一系列多模态基准测试中,RAVE相比标准注意力平均提升了3个点,在感知密集型任务(如多语言OCR、图表理解、文档VQA和场景文本VQA)上提升最为显著。

Insight: 创新点在于引入一个轻量级的、基于预RoPE查询和键特征的学习偏置来重新分配视觉注意力,这是一种无需改变主干架构的即插即用方法,能有效增强模型对关键视觉信息的关注,提升视觉定位精度。

Abstract: Large multimodal models (LMMs) inherit the self-attention mechanism of pretrained language backbones, yet standard attention can exhibit suboptimal allocation, including cross-modal misallocation between textual and visual evidence and intra-visual imbalance among visual tokens. We propose RAVE (Re-Allocating Visual Attention), a lightweight pair-gating mechanism that adds a learned query–key bias to pre-softmax attention scores over visual keys, derived from pre-RoPE query and key features. RAVE requires no architectural modification to the backbone and can be trained end-to-end with the rest of the model. Across a suite of multimodal benchmarks, RAVE improves over standard attention by an average of 3 points, with the largest gains on perception-intensive tasks – including multilingual OCR, chart understanding, document VQA, and scene text VQA – where accurate visual grounding is critical.


[218] GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation cs.CVPDF

Jan Ackermann, Shengqu Cai, Boyang Deng, Zhengfei Kuang, Songyou Peng

TL;DR: 本文提出GeoFlow,一种通过强化学习微调来显式优化视频生成几何一致性的方法。该方法利用光流、深度-姿态预测和特征匹配来分离刚性背景和动态物体区域,并评估各自的运动一致性,从而减少物体变形、纹理漂移和非刚性背景等几何伪影。

Details

Motivation: 解决现有文本到视频扩散模型在几何一致性方面的不足,这些模型仅隐式处理几何,导致在相机运动下出现物体变形、纹理漂移和非刚性背景等问题。现有方法要么仅作为副产品提升一致性,要么仅适用于静态场景或完全重新对齐模型潜在空间。

Result: 实验表明,该方法在强基线模型上显著减少了时间几何伪影,同时保持了感知质量。

Insight: 核心创新在于将几何一致性从涌现属性转变为明确的优化目标,通过基于物理一致性的奖励函数(刚性背景运动由相机流解释,独立运动物体沿轨迹保持外观一致性)并结合强化学习微调,实现了模型无关且适用于包含相机和物体运动的多样化动态场景。

Abstract: Generating geometrically consistent videos remains an open challenge: text-to-video diffusion models trained on web-scale data treat geometry only implicitly, leading to object deformation, texture drift, and non-rigid backgrounds under camera motion. Existing solutions either improve consistency as a byproduct, apply only to static scenes or realign the latent space of the model completely. We introduce a geometry-consistency reward that directly measures whether motion in a generated video is compatible with a coherent scene. Our key insight is that in physically consistent videos, background motion should be explainable by rigid camera-induced flow, while independently moving objects should preserve appearance identity along motion trajectories. We operationalize this using optical flow, depth–pose predictions, and feature-based correspondence to separate rigid and dynamic regions and evaluate their respective consistency. Integrating this reward with reinforcement fine-tuning transforms geometric consistency from an emergent property into an explicit optimization objective for video generators. The approach is model agnostic and applies to diverse dynamic scenes containing both camera and object motion. Experiments show substantial reductions in temporal geometric artifacts over strong baselines while preserving perceptual quality. Code and model weights are published.


[219] Vision Foundation Models as Generalist Tokenizers for Image Generation cs.CVPDF

Anlin Zheng, Qi Han, Xin Wen, Chuofan Ma, Lanxi Gong

TL;DR: 本文提出VFMTok,一种基于冻结视觉基础模型(VFM)构建的通用图像分词器。它通过区域自适应量化框架减少空间冗余,并利用语义重建目标保持语义保真度,从而在离散和连续潜在空间中高效工作,显著提升了图像生成的质量和效率。

Details

Motivation: 旨在探索直接在冻结的视觉基础模型之上构建通用图像分词器的方向,以解决标准二维网格特征中的空间冗余问题,并确保解码输出与VFM表征的语义对齐。

Result: 在ImageNet类条件合成任务中,VFMTok在离散自回归生成上实现了3倍的收敛加速,并达到1.36的gFID(SOTA水平);在连续空间生成中,结合去噪模型获得1.25的gFID。此外,它无需分类器无指导(w/o CFG)即可实现高保真类条件合成,显著加速推理。

Insight: 创新点包括区域自适应量化框架和语义重建目标,使分词器能同时处理离散和连续空间;研究发现VFM预训练时结合全局对比学习和潜在掩码图像建模的自监督目标能提供最优表征,为未来图像分词器设计提供了理论基础和实用指导。

Abstract: In this work, we explore the largely unexplored direction of building a generalist image tokenizer directly on top of a frozen vision foundation model (VFM). To build this tokenizer, we utilize a frozen VFM as the encoder and introduce two key innovations: (1) a region-adaptive quantization framework to eliminate spatial redundancy in standard 2D grid features, and (2) a semantic reconstruction objective that aligns the decoded outputs with the VFM’s representations to preserve semantic fidelity. Grounded in these designs, we propose VFMTok, a generalist visual tokenizer capable of operating seamlessly in both discrete and continuous latent spaces. VFMTok achieves substantial improvements in synthesis quality while drastically enhancing token efficiency. For discrete autoregressive (AR) generation, it accelerates model convergence by \textbf{3 times} and achieves a state-of-the-art gFID of \textbf{1.36} on ImageNet class-conditional synthesis. Similarly, for continuous-space generation, integrating VFMTok with a denoising model yields an exceptional gFID of \textbf{1.25}. Furthermore, because the latent space inherently captures rich spatial semantics, VFMTok enables high-fidelity class-conditional synthesis without classifier-free guidance (\textbf{w/o CFG}) across both generative paradigms, significantly accelerating inference speed. Beyond these remarkable empirical results, we systematically investigate the underlying mechanisms of our approach. We discover that the specific self-supervised learning objectives utilized during VFM pre-training dictate its effectiveness as a tokenizer. Specifically, a VFM jointly optimized with global contrastive learning and latent masked image modeling provides the optimal representations for image tokenization. These insights establish a strong foundation and offer valuable guidance for the design of future image tokenizers.


[220] NEWTON: Agentic Planning for Physically Grounded Video Generation cs.CVPDF

Yuxiang Feng, Juncheng Wang, Chao Xu, Yijie Qian, Huihan Wang

TL;DR: 论文提出NEWTON,一种用于物理基础视频生成的智能体规划框架。它将视频生成降级为智能体工具箱中的一个动作,由一个可学习的规划器协调物理感知工具(如关键帧生成、科学计算、提示词优化)来构建丰富的条件信息,并通过验证器实现迭代重新规划。该框架旨在解决现有视频生成模型普遍违反物理常识的问题。

Details

Motivation: 现有视频生成模型虽然视觉效果出色,但系统性地违反了物理常识,例如在VideoPhy-2基准上最佳模型仅达到32.6%的联合准确率。作者认为问题的根源在于规范瓶颈:文本提示是对物理世界的有损压缩,遗漏了完全决定动力学的参数,因此无论模型如何缩放都无法恢复从未被指定的信息。

Result: 在VideoPhy-2基准上,NEWTON将LTX-Video的联合准确率从21.4%提升至29.7%,将Veo-3.1的准确率从30.7%提升至37.4%,且无需修改底层视频生成器。

Insight: 创新点在于提出了物理条件必须满足的三个属性:充分性、动态性和可验证性,并构建了一个以规划器为核心的智能体系统。该规划器是唯一可训练组件,通过Flow-GRPO在实时多轮循环中进行策略优化,协调非训练工具来迭代地构建和验证物理条件,从而提升生成视频的物理合理性。

Abstract: Video generation models produce visually compelling results but systematically violate physical commonsense – on VideoPhy-2, the best model achieves only 32.6% joint accuracy. We identify a specification bottleneck: text prompts are lossy compression of the physical world, omitting the parameters that fully determine dynamics, and no amount of model scaling can recover what was never specified. From this diagnosis we derive three properties that physics conditioning must satisfy – sufficiency, dynamism, and verifiability – and show that no existing approach satisfies all three. We present NEWTON, in which video generation is demoted from the system output to one action inside an agent’s toolbox: a learned planner orchestrates physics-aware tools (keyframe generation, scientific computation, prompt refinement) to construct rich conditioning, and a verifier closes the loop for iterative re-planning. The planner is the sole trainable component, optimized on-policy via Flow-GRPO inside the live multi-turn loop. On VideoPhy-2, NEWTON improves joint accuracy from 21.4% to 29.7% on LTX-Video and from 30.7% to 37.4% on Veo-3.1, without modifying either generator. Our project page: \href{https://Newton026.github.io/newton}{https://Newton026.github.io/newton}


[221] Cracks in the Foundation: A Civil Infrastructure Dataset to Challenge Vision Foundation Models cs.CVPDF

Nicola Farronato, Niccolo Avogaro, Thomas Frick, Mattia Rigotti, Rizwan Ullah Khan

TL;DR: 该论文介绍了CiF(Cracks in the Foundation)数据集,这是迄今为止最大、最详细的民用基础设施(实例)分割数据集,包含约15万张高分辨率图像。研究发现,尽管当前的可提示基础模型和视觉语言模型在通用图像理解上表现出色,但在真实世界基础设施的密集图像理解任务中仍面临重大挑战,揭示了现有模型在特定领域感知任务上的根本弱点。

Details

Motivation: 解决民用基础设施自动化结构健康监测中,由于数据极度稀缺和专家标注成本高昂导致的缺陷分割进展缓慢问题,并应对算法固有的中心偏差和依赖形状识别无纹理建筑材料等挑战。

Result: 评估表明,即使是最新的零样本基础模型在真实世界基础设施上部署时也面临显著困难,而具有领域特定监督的专用模型性能在约25% mAP处达到瓶颈。

Insight: 论文的创新点在于构建了大规模、高质量的领域特定数据集(CiF),并利用其揭示了当前主要基于互联网图像训练的基础模型在看似简单的现实世界感知任务(如基础设施检查)上存在根本性缺陷,挑战了现有视觉AI范式的有效性。

Abstract: Automated structural health monitoring is essential to prevent catastrophic infrastructure failures. Precise, pixel-level defect segmentation is needed to accurately assess structural integrity, but progress in defect segmentation for civil infrastructures has been held back by an extreme scarcity of data, which requires costly expert annotation. The need for data is accentuated by algorithmic hurdles intrinsic to the problem, including center-bias and the need to rely more on shape when inspecting nearly textureless building materials. To remove the bottleneck, we introduce Cracks in the Foundation (CiF), the largest and most detailed civil infrastructure (instance) segmentation dataset to date, comprising $\approx$150,000 high-resolution images meticulously curated over five years in collaboration with civil engineering experts. With the help of this unprecedented data source, we expose a blind spot of current visual AI: despite the advent of promptable Foundation Models (FMs) and Vision Language Models (VLMs), and despite the impressive abilities of today’s specialised segmentation models, it turns out that dense image understanding in the built environment is nowhere near solved. Our evaluations indicate that even the most recent zero-shot FMs face significant challenges when deployed on real-world infrastructure and even the performance of specialised models with domain-specific supervision plateaus at $\approx$25% mAP. CiF establishes inspection of civil infrastructure, an elementary and seemingly easy perceptual task, as an open challenge that reveals fundamental weaknesses of present-day models trained predominantly on internet images, literally and figuratively highlighting cracks in the current foundation model paradigm.


[222] Geometry-Aware Uncertainty Coresets for Robust Visual In-Context Learning in Histopathology cs.CV | cs.AIPDF

Franciskus Xaverius Erick, Johanna Paula Müller, Bernhard Kainz

TL;DR: 本文提出了一种名为GAUC的训练无关核心集选择方法,旨在提升视觉语言模型在计算病理学中的上下文学习性能。该方法直接在预训练的多模态嵌入空间中操作,通过联合优化三个目标来选取更具代表性的图像-文本对,从而增强诊断的准确性、校准性和提示鲁棒性。

Details

Motivation: 视觉语言模型在计算病理学中具有潜力,但微调成本高昂,而上下文学习对示例选择和查询表述高度敏感,导致诊断不可靠。现有选择策略存在忽略全局数据结构、需要参数更新或忽视多模态嵌入几何结构等问题。

Result: 在CRC-100K和MHIST数据集上,使用多个开源视觉语言模型架构进行测试,GAUC在准确性、校准性和提示鲁棒性方面均优于最近的上下文学习选择方法和数据集蒸馏基线方法,且无需任何梯度更新。

Insight: 创新点在于提出了一种无需训练的核心集选择框架,通过联合优化分布保真度、基于视觉-文本对齐的性能退化边界正则化以及预测方差惩罚三个目标,直接利用预训练模型的几何结构来提升上下文学习的鲁棒性。

Abstract: Vision-language models (VLMs) can couple visual perception with open-ended clinical reasoning, making them attractive for computational histopathology. However, fine-tuning billions of parameters on scarce, expert-annotated pathology data is prohibitive, while in-context learning (ICL), which conditions the VLM on demonstrative image-text pairs without parameter updates, suffers from high sensitivity to which examples are selected and how the query is phrased, producing unreliable diagnostics. Existing selection strategies rely on query-dependent nearest-neighbour retrieval that ignores global data structure, require costly parameter updates, or disregard the joint vision-text embedding geometry of VLMs. We propose GAUC, a training-free coreset selection method operating directly in the pre-trained multimodal embedding space. GAUC jointly optimises three objectives: (1) a Maximum Mean Discrepancy term enforcing distributional fidelity between coreset and full dataset, (2) an Effective Mutual Information Difference regulariser bounding performance degradation under prompt paraphrases by exploiting the VLM’s joint vision-text alignment, and (3) a predictive-variance penalty suppressing overconfident, unstable outputs. On CRC-100K and MHIST across multiple open-source VLM architectures, GAUC consistently improves accuracy, calibration, and prompt robustness over recent ICL selection methods and dataset-distillation baselines, all without a single gradient update.


[223] Code-as-Room: Generating 3D Rooms from Top-Down View Images via Agentic Code Synthesis cs.CV | cs.GRPDF

Yixuan Yang, Zhen Luo, Wanshui Gan, Jinkun Hao, Junru Lu

TL;DR: 本文提出Code-as-Room,一个基于多模态大语言模型(MLLM)的智能体框架,用于从俯视图生成3D室内场景。该方法通过解析参考图像提取场景元素和空间关系,并以多阶段流程合成可执行的Blender代码来创建几何、材质和光照,同时引入跨阶段记忆模块以缓解现有智能体框架的上下文遗忘问题。

Details

Motivation: 解决现有基于文本或图像的3D房间生成方法存在的问题:文本方法难以捕捉精确空间信息,而现有图像条件智能体在从俯视图进行整体房间生成时存在不稳定和无限循环的缺陷。

Result: 论文引入了一个专门用于基于代码的3D房间合成的基准测试,并在该基准上进行了与现有基于智能体方法的全面比较,验证了所提出的执行框架的有效性。

Insight: 创新点在于使用Blender代码作为3D房间的表示形式,并设计了一个结构化的执行框架与跨阶段记忆模块,这为通过代码合成实现可控、稳定的3D内容生成提供了新思路。

Abstract: Designing realistic and functional 3D indoor rooms is essential for a wide range of applications, including interior design, virtual reality, gaming, and embodied AI. While recent MLLM-based approaches have shown great potential for 3D room synthesis from textual descriptions or reference images, text-based methods struggle to capture precise spatial information, and existing image-conditioned agents suffer from instability and infinite looping when tasked with holistic room generation from top-down views. To address these limitations, we propose Code-as-Room, an MLLM-based agentic framework equipped with a structured execution harness, which represents 3D rooms with Blender codes. Given a top-down room image, the framework parses the reference image to extract scene elements and their spatial relationships, and synthesizes executable Blender code for geometry, materials, and lighting in a principled, multi-stage pipeline. A cross-stage memory module is maintained throughout to mitigate context forgetting inherent to existing agent-based frameworks. We further introduce a dedicated benchmark for code-based 3D room synthesis, encompassing various evaluation protocols. Based on our benchmark, comprehensive comparisons against existing agent-based methods are conducted to validate the effectiveness of our proposed execution harness.


[224] Seeing Together:Multi-Robot Cooperative Egocentric Spatial Reasoning with Multimodal Large Language Models cs.CVPDF

Kunyu Peng, Zhikun Zhou, Kailun Yang, Di Wen, Ruiping Liu

TL;DR: 本文研究了多机器人协同动态空间推理问题,提出了首个基准CoopSR和数据集EgoTeam,并开发了SP-CoR框架,该框架结合了动态感知的多机器人帧采样、谱与物理引导的视图融合以及物理对齐的提示蒸馏,在多个基准测试中显著提升了多模态大语言模型在协同推理任务上的性能。

Details

Motivation: 多模态大语言模型在单视角自我中心视频理解方面已取得显著进展,但其从多个具身视角进行协同推理的能力尚未得到充分探索。本文旨在解决多机器人协同动态空间推理问题,即模型需要整合来自移动机器人团队的同步自我中心视频来回答空间、时间、可见性和协调性问题。

Result: 在22个MLLM基线模型中,SP-CoR在Habitat和iGibson模拟器上的性能分别比最强的微调基线高出+3.87%和+7.12%,并且在未见过的团队规模和真实世界机器人测试中表现出更强的泛化能力。

Insight: 创新点在于引入了首个多机器人协同空间推理基准和数据集,并提出了SP-CoR框架,该框架通过动态感知采样、谱与物理引导的视图融合以及提示蒸馏,实现了训练时利用特权机器人姿态监督、测试时仅需自我中心视频的协同推理,提升了模型对多视角信息的整合与推理能力。

Abstract: Multimodal Large Language Models (MLLMs) have made substantial progress in egocentric video understanding, but their ability to reason cooperatively from multiple embodied viewpoints remains largely unexplored. We study this problem through multi-robot cooperative dynamic spatial reasoning, where a model must answer spatial, temporal, visibility, and coordination questions by integrating synchronized egocentric videos from a team of moving robots. To support this setting, we introduce CoopSR, the first benchmark for this task, together with EgoTeam, a multi-robot egocentric QA dataset. EgoTeam contains 114,227 QA pairs spanning 19 question types, four difficulty tiers, and three team sizes in Habitat and iGibson, along with a real-world test set of around 2,326 QAs collected using two quadruped robots. We further propose SP-CoR (Spectral and Physics-Informed Cooperative Reasoner), an MLLM framework for fine-grained cooperative spatial reasoning. SP-CoR combines dynamics-aware multi-robot frame sampling, spectral- and physics-guided view fusion, and physics-aligned prompt distillation, enabling the model to benefit from privileged robot-pose supervision during training while requiring only egocentric videos at test time. Across 22 MLLM baselines, SP-CoR consistently improves cooperative reasoning, outperforming the strongest fine-tuned baseline by +3.87% on Habitat and +7.12% on iGibson. It also shows stronger generalization to unseen team sizes and real-world robot tests. Code can be found at https://github.com/KPeng9510/seeing-together.git.


[225] PERL: Parameter Efficient Reasoning in CLIP Latent Space cs.CVPDF

Simone Carnemolla, Salvatore Calcagno, Daniela Giordano, Concetto Spampinato, Matteo Pennisi

TL;DR: 本文提出了PERL(Parameter-Efficient Reasoning in CLIP Latent Space),一种轻量级适应框架,通过在CLIP的潜在空间中引入迭代推理机制来微调模型,而无需大量额外参数。该方法利用一个紧凑的共享推理模块,在多个细化步骤中递归地生成并注入潜在推理令牌,以逐步优化高层语义表示,同时保持CLIP预训练的多模态结构。

Details

Motivation: 旨在解决在适应下游任务时,如何保持CLIP等对比训练视觉语言模型的零样本泛化能力而不使其退化的问题。现有参数高效适应方法主要依赖增加可训练参数,本文探索是否可以通过在潜在表示上进行迭代推理来实现适应,而非仅仅增加参数数量。

Result: 在涵盖基础到新颖类泛化、跨数据集迁移和分布外ImageNet变体的15个基准测试中,PERL在快速适应少样本设置下,实现了最佳的参数-性能权衡。它仅使用约6K可训练参数(比最大对比方法少817倍),就获得了强大的新类别准确性和有竞争力的迁移性能。

Insight: 创新点在于将迭代潜在推理作为一种与参数扩展互补的适应机制引入判别式视觉语言模型。从客观角度看,该方法通过递归应用轻量级模块进行语义细化,在极低参数成本下有效提升了模型适应能力,为高效微调提供了新思路。

Abstract: Contrastively trained vision-language models such as CLIP provide strong zero-shot transfer by aligning images and text in a shared embedding space. However, adapting these models to downstream tasks without degrading their open-vocabulary generalization remains challenging. Existing parameter-efficient adaptation methods typically improve task specialization through learned prompts, adapters, or multimodal transformations, where adaptation capacity is primarily expressed through additional trainable parameters. Inspired by recent latent reasoning methods in language models, we investigate a complementary perspective: can adaptation emerge from iterative reasoning on latent representations rather than from increasing parameter count alone? We introduce PERL (Parameter-Efficient Reasoning in CLIP Latent Space), a lightweight adaptation framework that augments a frozen CLIP model with a compact shared reasoning module applied recurrently across refinement steps. At each step, PERL generates a latent reasoning token conditioned on the current representation and injects it into an intermediate encoder layer, progressively refining higher-level semantic representations while preserving CLIP’s pretrained multimodal structure. Across 15 benchmarks spanning base-to-novel generalization, cross-dataset transfer, and out-of-distribution ImageNet variants, PERL achieves the best parameter-performance trade-off among the compared methods under a fast-adaptation few-shot setting, combining strong novel-class accuracy and competitive transfer performance with only about 6K trainable parameters, up to 817x fewer than the largest compared approach. Overall, our results suggest that iterative latent reasoning provides a complementary adaptation mechanism to parameter scaling in discriminative vision-language models.


[226] Speech-Guided Multimodal Learning for Vocal Tract Segmentation in Real-Time MRI cs.CVPDF

Daiqi Liu, Lukas Mulzer, Md Hasan, Nyvenn de Castro, Fangxu Xing

TL;DR: 本文提出了一种用于实时磁共振成像(rtMRI)中声道发音器官分割的三阶段多模态学习框架,该框架在训练时利用语音信号和音系学监督,但在推理时仅需rtMRI图像。该方法通过音系学表示生成空间边界框先验来定位发音器官,通过双层次跨模态对比预训练对齐视觉和声学编码器,并通过交叉注意力解码器融合学习到的表示,从而将多模态知识有效迁移到单模态推理流程中。

Details

Motivation: 解决rtMRI声道分割中因对比度低、运动快和空间分辨率有限带来的挑战,现有方法丢弃了同步采集的音频信号,而少数融合音频的多模态方法在音频不可用时无法部署,因此需要一种在训练时利用多模态监督、推理时仅需图像的方法。

Result: 在75-Speaker~Annot-16和USC-TIMIT数据集上的评估表明,该方法优于现有的单模态和多模态方法,实现了精确且可临床部署的分割性能。

Insight: 创新点在于提出了一种训练时利用音频和音系学监督、推理时仅需图像的三阶段框架,通过音系学先验、跨模态对比预训练和交叉注意力融合,实现了多模态知识向单模态推理的有效迁移,提升了分割精度和临床实用性。

Abstract: Segmenting vocal tract articulators in real-time MRI (rtMRI) is a challenging dynamic image segmentation problem characterized by low contrast, rapid motion, and limited spatial resolution. However, while rtMRI acquisitions may provide synchronized acoustic signals, existing methods discard this information, and the few multimodal approaches that incorporate audio cannot be deployed when audio is unavailable. We propose a three-stage framework that leverages acoustic and phonological supervision during training while requiring only the rtMRI image at inference: phonological representations are converted into spatial bounding-box priors for articulator localization, visual and acoustic encoders are aligned via dual-level cross-modal contrastive pretraining, and the learned representations are fused through a cross-attention decoder, effectively transferring multimodal knowledge into a single-modality inference pipeline. Evaluated on 75-Speaker~Annot-16 and USC-TIMIT datasets, our method outperforms existing unimodal and multimodal methods, demonstrating that multimodal supervision provides transferable benefits for precise and clinically deployable vocal tract segmentation.


[227] InstructAV2AV: Instruction-Guided Audio-Video Joint Editing cs.CVPDF

Haojie Zheng, Yixin Yang, Siqi Yang, Shuchen Weng, Boxin Shi

TL;DR: 本文提出了InstructAV2AV,这是首个用于指令引导的音视频联合编辑的端到端框架。该方法通过构建大规模数据集InsAVE-80K,并利用音视频生成主干网络,结合源上下文锚定、门控注意力机制和两阶段训练策略,实现了对视频及其伴随音频的同步编辑。

Details

Motivation: 现有基于扩散模型的视频编辑方法通常忽略伴随的音频,导致编辑后的视频与音频内容脱节。本文旨在解决音视频联合编辑的问题,实现根据自然语言指令同步修改视频和音频内容。

Result: 在两个评估集上,InstructAV2AV在涵盖三个方面的11个指标上均优于现有最先进方法,展示了其在可控内容创作方面的潜力。

Insight: 创新点包括:1) 构建了首个大规模高质量音视频编辑数据集InsAVE-80K;2) 提出了源上下文锚定和源-指令门控注意力机制,以更好地遵循指令并保留内容;3) 设计了两阶段训练策略,有效迁移预训练先验知识。从客观角度看,将音视频联合编辑任务形式化并构建相应数据集是推动该领域发展的关键一步。

Abstract: Recent diffusion-based methods have achieved impressive progress in video content manipulation. However, they typically ignore the accompanying audio, leaving the audio disjointed from the edited results. In this paper, we propose InstructAV2AV, the first end-to-end framework for instruction-guided audio-video joint editing. We first develop a scalable data synthesis pipeline and construct InsAVE-80K, the first large-scale audio-video editing dataset with high-quality source-to-target pairs. With this data foundation, we adapt an audio-video generation backbone to leverage its robust priors. We concatenate the audio-video input with noisy latent codes to anchor the source context, propose the source-instruction gated attention to improve instruction following and content preservation, and introduce a two-stage training strategy to effectively transfer these pre-trained priors. Extensive experiments demonstrate that InstructAV2AV outperforms state-of-the-art methods across 11 metrics spanning three aspects on two evaluation sets, highlighting its potential for controllable content creation. Project page: https://hjzheng.net/projects/InstructAV2AV/.


[228] Beyond Morphology: Quantifying the Diagnostic Power of Color Features in Cancer Classification cs.CV | cs.AI | cs.LGPDF

Farnaz Kheiri, Shahryar Rahnamayan, Masoud Makrehchi

TL;DR: 该研究探讨了在癌症分类中仅使用颜色特征(排除形态学信息)的诊断能力,通过提取统计颜色矩和离散化RGB/HSV颜色直方图,在十个不同实验设置下评估经典机器学习分类器的性能。

Details

Motivation: 解决在组织病理学中,颜色特征作为独立于形态学线索的原始统计信息,能否单独支持癌症分类的基本问题。

Result: 颜色特征在二元诊断任务(如良性 vs. 恶性)中达到高达89%的分类准确率,显著优于随机基线,表明原始颜色分布编码了非随机且与诊断相关的信号。

Insight: 创新点在于系统量化了全局颜色特征的独立判别能力,并证明简单、计算高效的颜色特征可作为有效的预筛查工具,减轻复杂深度学习架构的计算负担。

Abstract: In histopathology, human experts primarily rely on color as a means of enhancing contrast to interpret tissue morphology, whereas machine vision models process color as raw statistical information. This distinction raises a fundamental question: to what extent can pixel intensity alone, independent of structural and morphological cues, support cancer classification? To address this question, we systematically evaluated the standalone discriminative power of global color features while deliberately excluding all morphological information. Specifically, we extracted statistical color moments and discretized RGB and HSV color histograms, and assessed their performance across ten diverse experimental settings using classical machine learning classifiers. Our results demonstrate that color features alone can achieve strong performance in binary diagnostic tasks (e.g., benign versus malignant), with classification accuracies reaching up to 89%. This performance is likely attributable to global chromatic shifts associated with malignancy. Importantly, these simple color-based representations consistently outperformed random baselines by a substantial margin, indicating that raw color distributions encode a non-random and diagnostically relevant signal for cancer detection. Consequently, this study suggests that simple, computationally efficient color features can serve as an effective pre-screening tool. By identifying samples with strong chromatic indicators of malignancy, these lightweight models could function as a first-pass triage system, reducing the computational burden on complex deep learning architectures.


[229] Weakly Supervised Cross-Modal Learning for 4D Radar Scene Flow Estimation cs.CVPDF

Jingyun Fu, Zhiyu Xiang, Na Zhao

TL;DR: 本文提出了一种用于4D雷达场景流估计的弱监督跨模态学习方法。该方法仅使用图像和里程计作为辅助监督,通过引入实例感知的自监督损失和静态刚性损失,克服了传统方法依赖昂贵激光雷达或自监督效果不佳的局限。在真实数据集View-of-Delft上的实验表明,该方法超越了现有依赖激光雷达的跨监督方法以及全监督场景流估计方法。

Details

Motivation: 解决4D雷达场景流估计中真实标注数据难以获取的问题。现有自监督方法因雷达测量精度低而效果不佳,而跨模态监督方法又依赖昂贵的激光雷达和复杂的多任务架构。

Result: 在真实世界数据集View-of-Delft上进行广泛实验,结果表明该方法超越了依赖密集激光雷达点云进行3D多目标跟踪的先进跨模态监督方法,也优于现有的全监督场景流估计方法,达到了SOTA水平。

Insight: 创新点在于提出了一种任务特定的迭代框架,仅使用图像和里程计进行弱监督;通过利用现成的2D跟踪和分割算法获取跟踪实例掩码,并反投影到3D空间提供实例级语义指导;针对静态区域,结合车辆里程计和雷达固有运动线索构建刚性静态损失。从客观角度看,该方法巧妙地利用低成本传感器(相机、里程计)和成熟2D算法来增强雷达3D感知,是一种高效且实用的弱监督范式。

Abstract: Due to the difficulty of obtaining ground-truth data for 4D radar scene flow estimation, previous methods typically rely on either self-supervised losses or cross-modal supervision using 3D LiDAR data, 2D images, and odometry. However, self-supervised approaches often yield suboptimal results due to radar’s inherently low-fidelity measurements, while existing cross-modal supervised methods introduce complex multi-task architecture and require costly LiDAR sensors to generate pseudo radar scene flow labels from pretrained 3D tracking models. To overcome these limitations, we propose a task-specific iterative framework for weakly supervised radar scene flow learning, using only images and odometry for auxiliary supervision during training. Specially, we establish two novel instance-aware self-supervised losses by exploiting off-the-shelf 2D tracking and segmentation algorithms to obtain tracked instance masks, which are back-projected into 3D space to provide instance-level semantic guidance; for static regions, we integrate vehicle odometry with radar’s intrinsic motion cues to construct a rigid static loss. Extensive experiments on the real-world View-of-Delft (VoD) dataset demonstrate that our method not only surpasses state-of-the-art cross-modal supervised approaches that rely on 3D multi-object tracking on dense LiDAR point clouds but also outperforms existing fully supervised scene flow estimation methods. The code is open-sourced at \href{https://github.com/FuJingyun/IterFlow}{https://github.com/FuJingyun/IterFlow}.


[230] LESSViT: Robust Hyperspectral Representation Learning under Spectral Configuration Shift cs.CVPDF

Haozhe Si, Yuxuan Wan, Yuqing Wang, Minh Do, Han Zhao

TL;DR: 本文提出了一种名为LESSViT的传感器灵活架构,用于解决高光谱图像(HSI)建模中因波长覆盖范围、波段采样和通道维度变化导致的跨传感器泛化难题。该方法基于低秩高效空间-光谱注意力(LESS Attention),通过可分离的空间和光谱组件建模联合空间-光谱交互,显著降低了计算复杂度,并引入了通道无关的补丁嵌入和波长感知位置编码以支持灵活的光谱输入。此外,还提出了高光谱掩码自编码器(HyperMAE)进行高效且鲁棒的预训练。

Details

Motivation: 现有基于Vision Transformer(ViT)的方法在处理不同传感器的高光谱图像时,要么依赖具有固定通道假设的隐式光谱建模,要么采用计算成本高昂的显式空间-光谱注意力,导致效率与表达能力之间存在根本性权衡。因此,需要一种传感器灵活的架构来实现跨光谱泛化。

Result: 在模拟跨传感器变化的跨光谱泛化设置下,于SpectralEarth基准测试上进行评估。实验表明,LESSViT在光谱偏移下提高了鲁棒性,同时在分布内性能保持竞争力。

Insight: 核心创新点是提出了结构化的低秩因子化注意力(LESS Attention),将全空间-光谱注意力的复杂度从O(N²C²)降至O(rNC),实现了高效且显式的空间-光谱建模。此外,通道无关的补丁嵌入、波长感知位置编码以及用于预训练的HyperMAE(具有解耦的空间-光谱掩码和分层通道采样)共同构成了一个可扩展且可泛化的高光谱表示学习框架。

Abstract: Modeling hyperspectral imagery (HSI) across different sensors presents a fundamental challenge due to variations in wavelength coverage, band sampling, and channel dimensionality. As a result, models trained under a fixed spectral configuration often fail to generalize to other sensors. Existing Vision Transformer (ViT) approaches either rely on implicit spectral modeling with fixed channel assumptions or adopt explicit spatial-spectral attention with prohibitive computational cost, leading to a fundamental trade-off between efficiency and expressiveness. In this work, we introduce Low-rank Efficient Spatial-Spectral ViT (LESSViT), a sensor-flexible architecture for cross-spectral generalization. LESSViT is built on LESS Attention, a structured low-rank factorization that models joint spatial-spectral interactions through separable spatial and spectral components, reducing the complexity of full spatial-spectral attention from $O(N^2 C^2)$ to $O(rNC)$, where $N$ is the number of spatial tokens, $C$ is the number of spectral channels, and $r$ is the rank of the low-rank approximation. We further incorporate channel-agnostic patch embedding and wavelength-aware positional encoding to support flexible spectral inputs. To enable efficient and robust pretraining, we introduce a hyperspectral masked autoencoder (HyperMAE) with decoupled spatial-spectral masking and hierarchical channel sampling. We evaluate LESSViT under a cross-spectral generalization setting that simulates cross-sensor variability. Experiments on the SpectralEarth benchmark demonstrate that LESSViT improves robustness under spectral shifts while remaining competitive in-distribution, and explicit and efficient spatial-spectral modeling is essential for scalable and generalizable hyperspectral representation learning.


[231] OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding cs.CVPDF

Ruixiang Zhao, Jie Yang, Zijie Xin, Tianyi Wang, Fengyun Rao

TL;DR: 本文提出了OmniPro,这是首个用于全面评估全模态主动流式视频理解能力的基准测试。该基准包含2700个人工验证样本,涵盖9个子任务和3个认知层次,并引入了双模式评估协议(探测模式和在线模式)来分别评估内容理解和主动响应能力。通过对11个代表性模型的评估,揭示了音频利用的差异性、长时鲁棒性不足以及非语音音频感知是当前模型的薄弱环节等关键发现。

Details

Motivation: 现有的视频理解基准主要依赖视觉信号,采用轮询或固定时间戳协议而非真正的主动评估,且任务覆盖范围有限,无法可靠评估和区分全模态主动流式模型。

Result: 在OmniPro基准上评估了11个代表性模型,发现音频能带来一致性能提升但模型间利用率差异大,性能随时间显著下降表明长时鲁棒性有限,且非语音音频感知是当前最薄弱的维度。

Insight: 创新点在于构建了首个联合评估全模态感知、主动响应和多样化视频理解任务的基准,并设计了包含模态隔离标签的样本和双模式评估协议,为细粒度多模态分析和模型能力诊断提供了系统框架。

Abstract: Omni-proactive streaming video understanding, i.e., autonomously deciding when to speak and what to say from continuous audio-visual streams, is an emerging capability of omni-modal large language models. Existing benchmarks fall short in three key aspects: they rely primarily on visual signals, adopt polling or fixed-timestamp protocols instead of true proactive evaluation, and cover only a limited range of tasks, preventing reliable assessment and differentiation of omni-proactive streaming models. We present OmniPro, the first benchmark to jointly evaluate omni-modal perception, proactive responding, and diverse video understanding tasks. It comprises 2,700 human-verified samples spanning 9 sub-tasks and 3 cognitive levels, covering 6 basic video understanding capabilities. Notably, 84% of samples require audio signals (speech or non-speech), and each sample is annotated with modality-isolation labels to enable fine-grained multimodal analysis. We further introduce a dual-mode evaluation protocol: Probe mode assesses content understanding by querying the model before and after each ground-truth trigger, while Online mode evaluates full proactive ability by requiring models to autonomously decide when to respond in streaming input. Evaluating 11 representative models reveals three key findings: (1) audio provides consistent gains but with highly variable utilization across models, (2) performance degrades significantly over time, indicating limited long-horizon robustness, and (3) non-speech audio perception remains the weakest dimension.


[232] StableHand: Quality-Aware Flow Matching for World-Space Dual-Hand Motion Estimation from Egocentric Video cs.CV | cs.AIPDF

Huajian Zeng, Chaohua Yao, Yuantai Zhang, Jiaqi Yang, Rolandos Alexandros Potamias

TL;DR: 本文提出StableHand,一种质量感知的流匹配框架,用于从第一人称视角视频中估计世界空间的双手4D运动。该方法通过分解并利用来自现成手部姿态估计器的四通道质量信号(双手的腕部全局平移和手指关节),在流匹配过程中整合这些信号以保持高质量观测并重建不可靠部分。

Details

Motivation: 从第一人称视频恢复双手交互的世界空间4D运动对于机器人策略学习监督至关重要,但面临两大挑战:头部运动导致手部频繁长时间离开视野,以及持续的手-物交互造成严重遮挡。现有方法统一处理噪声观测而未考虑其逐帧可靠性,导致性能显著下降。

Result: 在具有长时手部缺失和持续手物遮挡特征的第一人称基准测试HOT3D和ARCTIC上,StableHand在所有报告指标上均达到最先进性能,与最强基线相比,W-MPJPE降低了20-25%,在遮挡严重的ARCTIC序列上提升最大。

Insight: 核心创新在于将世界空间手部运动估计的准确性与逐帧观测质量紧密耦合,并提出了一个统一的生成过程。具体技术贡献包括:四通道质量信号分解、通过逐通道前向调度、质量调整速度目标、DiT去噪器的AdaLN调制以及质量感知ODE初始化,将质量信号自然地整合到流匹配框架中。

Abstract: Recovering world space 4D motion of two interacting hands from egocentric video is a fundamental capability for supervising robot policy learning, where wrist trajectories track the end-effector and finger articulations specify the grasp pose. Two major challenges arise in this setting: hands frequently leave the camera view for extended periods due to head motion, and persistent hand-object interactions cause severe occlusions of one or both hands. Existing methods uniformly condition on noisy hand motion observations without accounting for their per-frame reliability, leading to substantial performance degradation. Our key insight is that accurate world space hand motion estimation is tightly coupled with the quality of per-frame hand observations. To this end, we decompose the quality of hand motion observations extracted from an off-the-shelf hand pose estimator into four channels: wrist global translation and finger articulations for both hands. We propose StableHand, a quality-aware flow-matching framework conditioned on these four-channel quality signals, which are predicted by a learned quality network. We naturally incorporate the quality signals into the flow-matching process through a per-channel forward schedule, a quality-adjusted velocity target, AdaLN modulation of the DiT denoiser, and a quality-aware ODE initialization. This unified generative process preserves high-quality observations while reconstructing unreliable ones using a learned bimanual motion prior. Experiments on HOT3D and ARCTIC, two egocentric benchmarks featuring long missing-hand spans and persistent hand-object occlusions, show that StableHand achieves state-of-the-art performance across all reported metrics, reducing W-MPJPE by 20-25% compared to the strongest baseline, with the largest gains on heavily occluded ARCTIC sequences.


[233] Resolving Representation Ambiguity in Feedforward Novel View Synthesis Transformer via Semantic-Spatial Decoupling cs.CVPDF

Yihang Wu, Yihang Sun, Shaofeng Zhang, Zuxuan Wu, Junchi Yan

TL;DR: 本文针对基于Transformer的前馈式新视角合成模型提出语义-空间解耦表示方法,以解决现有架构中语义信息与空间信息在共享特征空间中相互干扰导致渲染质量下降的问题。通过将表示解耦为独立的语义和空间分支,并引入分类监督与双向调制机制,在几乎不增加推理延迟的情况下提升了多种前馈NVS模型的性能。

Details

Motivation: 现有前馈新视角合成Transformer模型(如GS-LRM、LVSM)将语义信息(如RGB)和具有网格状空间结构的Plücker光线信息混合在共享特征空间中,导致空间偏差干扰外观表示,降低了渲染保真度。

Result: 所提出的解耦设计在仅解码器和编码器-解码器两种前馈NVS模型上均取得了一致的性能提升,证明了其有效性。

Insight: 核心创新点在于将语义与空间表示显式解耦为独立分支,并通过共享注意力路由保持跨分支交互;同时引入可选的分类监督(提供分支特定训练信号)和双向调制(增强分支间交互),且基础解耦设计几乎不增加推理延迟。

Abstract: Transformer-based models have advanced feedforward novel view synthesis (NVS). Current architectures such as GS-LRM and LVSM mix semantic information (e.g., RGB) and spatial information (e.g., Plücker rays) into a shared feature space. Since Plücker rays naturally carry lattice-like spatial structure, these designs can make the spatial bias interfere with appearance representation and degrade rendering fidelity. To this end, we propose to decouple the representation of feedforward NVS transformers into separate semantic and spatial tokens. The decoupled design keeps semantic and spatial information explicit in their branches while preserving cross-branch interaction through shared attention routing. Built on this design, we introduce optional categorized supervision and bidirectional modulation: the former provides branch-specific training signals, while the latter improves interaction between the two branches. Notably, the base decoupled design introduces virtually zero additional inference latency due to its architectural design. The proposed designs achieve consistent improvements, demonstrating effectiveness across decoder-only and encoder-decoder feedforward NVS models.


[234] CATA: Continual Machine Unlearning via Conflict-Averse Task Arithmetic cs.CV | cs.AI | cs.LGPDF

Shen Lin, Junhao Dong, Rongjie Chen, Xiaoyu Zhang, Li Xu

TL;DR: 本文提出了CATA方法,用于视觉语言模型(VLM)的持续机器遗忘。该方法将每个遗忘请求表示为任务向量,并通过冲突感知聚合来有效移除目标知识、保持模型效用并防止知识在序列更新中重新出现。

Details

Motivation: 解决VLM大规模训练数据引发的隐私、版权和不良内容问题,现有研究主要关注单次遗忘,而实际部署需要处理序列遗忘请求,因此需要研究持续机器遗忘。

Result: 在单次和持续遗忘设置下的广泛实验表明,CATA在遗忘有效性、模型保真度和遗忘持久性方面优于基线方法。

Insight: 创新点在于首次研究VLM的持续遗忘问题,并提出基于冲突感知任务算术的方法,通过维护历史任务向量和符号感知聚合来抑制冲突更新,以维持遗忘效果。

Abstract: Vision-language models (VLMs) have shown remarkable ability in aligning visual and textual representations, enabling a wide range of multimodal applications. However, their large-scale training data inevitably raises concerns about privacy, copyright, and undesirable content, creating a strong need for machine unlearning. While existing studies mainly focus on single-shot unlearning, practical VLM deployment often involves sequential removal requests over time, giving rise to continual machine unlearning. In this work, we make the first attempt to study continual unlearning for VLMs and identify three key challenges in this setting: effectiveness in removing target knowledge, fidelity in preserving retained model utility, and persistence in preventing knowledge re-emergence under sequential updates. To address these challenges, we propose CATA, a conflict-averse task arithmetic method that represents each forget request as an unlearning task vector. By maintaining historical task vectors and performing sign-aware conflict-averse aggregation, CATA suppresses conflicting update components that may weaken previous forgetting effects. Extensive experiments under both single-shot and continual settings show that CATA outperforms baselines in terms of forgetting effectiveness, model fidelity, and forgetting persistence.


[235] Incantation: Natural Language as the Action Interface for Multi-Entity Video World Models cs.CVPDF

Shangwen Zhu, Qianyu Peng, Zhao Pu, Zhilei Shu, Xiangrui Ke

TL;DR: 论文提出了Incantation,首个支持每潜在帧(0.25秒)自然语言条件化的交互式视频世界模型,旨在解决现有模型缺乏细粒度多实体控制和跨实体、跨世界泛化能力的问题。通过使用自然语言作为动作接口,结合预训练双向视频主干和帧局部文本交叉注意力,并采用ODE初始化的Self-Forcing蒸馏与RoPE解耦的滑动KV缓存实现实时长时流式生成。

Details

Motivation: 现有交互式视频世界模型视觉保真度高,但缺乏细粒度多实体控制和跨实体、跨世界泛化能力,这源于标准控制协议(如动画ID、设备输入、场景级描述)在设计时将动作语义绑定到特定实体或引擎。

Result: 在跨实体转移任务上超越Action-Index基线(89% vs. 43%),在词汇外提示上达到90% vs. 0%的准确率;其2步学生模型在480p分辨率下维持19.7 FPS,并在2小时推演中保持稳定的FVD。模型架构和训练方法还成功应用于《拳皇》游戏,仅更改每实体动作词汇槽。

Insight: 创新点在于将自然语言作为通用动作接口,实现同时多实体控制和概念级跨实体转移,超越了固定渲染管道的限制;技术贡献包括帧局部文本交叉注意力、ODE初始化的Self-Forcing蒸馏和RoPE解耦滑动KV缓存,以平衡表达能力和实时性能。

Abstract: Modern interactive video world models have achieved impressive visual fidelity, yet lack fine-grained multi-entity control and cross-entity, cross-world generalization. We trace this gap to the action interface: standard control protocols (e.g. animation IDs, device inputs, scene-level captions) bind action semantics to specific entities or engines at design time. We propose natural language as the interface to unlock expressiveness that no prior interface can achieve, and we present Incantation, the first interactive video world model with per-latent-frame (0.25 s) natural-language conditioning that supports simultaneous multi-entity control and concept-level cross-entity transfer beyond any fixed rendering pipeline. We pair a pretrained bidirectional video backbone with frame-local text cross-attention, and enable real-time long-horizon streaming through ODE-initialized Self-Forcing distillation with a RoPE-decoupled sliding KV-cache. We surpass the Action-Index baseline on cross-entity transfer (89% vs. 43%) and out-of-vocabulary prompts (90% vs. 0%), and our 2-step student sustains 19.7 FPS at 480p with stable FVD over 2-hour rollouts. We further apply the same architecture and training recipe to The King of Fighters, changing only the per-entity action vocabulary slots. We have released a preview subset of the Incantation dataset at https://huggingface.co/datasets/zhush/incantation-elden-ring-scenes, containing manually collected Elden Ring player-boss combat clips with structured action-oriented metadata. Larger-scale Elden Ring and KOF data will be released with the full project.


[236] Starve to Perceive: Taming Lazy Perception in VLMs with Constrained Visual Bandwidth cs.CVPDF

Yuhuan Wu, Cong Wei, Fangzhen Lin, Wenhu Chen, Haozhe Wang

TL;DR: 论文提出了一种名为’Starve to Perceive’的训练范式,旨在解决视觉语言模型(VLMs)作为情境智能体时存在的’懒惰感知’问题。该方法通过严格限制每次观察的视觉令牌预算,迫使模型必须通过主动感知(如缩放、裁剪和平移)来完成任务,从而在无需额外损失函数或架构改动的情况下,显著提升了模型在多种基准测试上的性能。

Details

Motivation: 当前VLM训练范式导致模型模仿主动感知操作的形式,却并不真正依赖其输出(即’懒惰感知’),因为粗略的全局视图结合语言先验已能获得尚可的准确率,模型缺乏学习复杂多步视觉搜索的动力。

Result: 在多种基准测试上,经过感知饥饿训练的模型实现了平均5%的相对性能提升。

Insight: 核心创新在于通过’饥饿’(限制视觉带宽)这一简单、可插拔的训练策略,从根本上改变了模型的学习动机,使其必须掌握主动感知才能成功,这为解决类似’懒惰学习’问题提供了新的思路。

Abstract: Vision-Language Models (VLMs) deployed as situated agents in high-resolution visual environments require active perception – the ability to dynamically decide where to look through operations like zooming, cropping, and panning. However, current training paradigms produce models that mimic the surface form of such operations without functionally depending on their outputs, a phenomenon we term lazy perception. We trace this to a fundamental learning asymmetry: when coarse global views combined with language priors suffice for moderate accuracy, the model has no incentive to learn harder multi-step visual search. If a model can succeed without actively looking, it will never learn to look. This motivates Starve to Perceive, a training paradigm that constrains visual bandwidth – restricting each observation to a tight token budget so that no single view suffices for task completion, making active perception the only viable strategy. Despite requiring no auxiliary losses, reward shaping, or architectural changes – serving as a minimal, plug-in modification to standard post-training pipelines – models trained under perceptual starvation achieve substantial gains of 5% average relative improvement across diverse benchmarks.


[237] Dance Across Shifts: Forward-Facilitation Continual Test-Time Adaptation through Dynamic Style Bridging cs.CVPDF

Zhilin Zhu, Yabin Wang, Zhiheng Ma, Yaguang Song, Yaowei Wang

TL;DR: 本文提出了一种名为动态风格桥接(Dynamic Style Bridging)的前向促进范式,用于持续测试时适应(CTTA)。该方法在部署前构建一个紧凑的生成类样本知识库,在测试时通过多级桥接机制将传入数据的风格动态注入到代理样本中,以提供可靠的监督信号,从而在持续分布变化下实现稳定适应。

Details

Motivation: 现有CTTA方法主要遵循后向对齐范式,将传入数据与源域衍生的监督代理进行刚性对齐,这导致其在不可靠的监督和不断演变的分布偏移面前表现不佳。本文旨在克服这些限制。

Result: 在标准CTTA基准测试上的广泛实验表明,该方法相比近期最先进(SOTA)方法取得了一致且显著的性能提升。

Insight: 核心创新点在于从后向对齐范式转向前向促进范式,并提出了动态风格桥接机制。该机制在输入、统计和表示三个层次上动态融合传入数据风格与代理样本,在保持代理原始语义的同时生成高保真代理,从而提供可靠、按需的监督信号,这是应对持续分布偏移的一种新颖且有效的策略。

Abstract: Continual Test-Time Adaptation (CTTA) aims to empower perception systems to handle dynamic distribution shifts encountered after deployment. Existing methods predominantly follow a backward-alignment paradigm, which rigidly aligns incoming data with supervisory surrogates derived from the source domain. Consequently, they struggle with unreliable supervision and evolving distribution shifts. To overcome these limitations, we introduce a novel forward-facilitation paradigm through a method termed Dynamic Style Bridging. Prior to deployment, we construct a compact knowledge base of generated class exemplars. During test time, to mitigate inherent generative bias and adapt these proxies to incoming data, we propose a multi-level bridging mechanism. This mechanism dynamically injects the proxies with incoming data styles at the input, statistical, and representation levels, while preserving the original semantics of the proxies. These high-fidelity proxies are then used to provide reliable, on-demand supervisory signals, enabling stable adaptation under continual shifts. Extensive experiments across standard CTTA benchmarks demonstrate that our method achieves consistent and substantial improvements over recent state-of-the-art approaches. Code is available at \href{https://github.com/z1358/DAS}.


[238] SPIKE: An Adaptive Dual Controller Framework for Cost-Efficient Long-Horizon Game Agents cs.CVPDF

Wencan Jiang, Jiangning Zhang, Jianbiao Mei, Jinzhuo Liu, Yu Yang

TL;DR: 论文提出了SPIKE,一种用于开放世界游戏中长视野多模态智能体的自适应双控制器框架。该框架通过战略控制器进行低频全局规划与失败恢复,反应控制器在严格令牌预算下处理快速本地执行,事件触发器监控状态变化以决定何时切换控制模式,并结合分层内存实现经验复用。

Details

Motivation: 解决开放世界游戏中长视野智能体在有限令牌和延迟预算下,如何在保持目标导向的同时,平衡高成本每步推理与易漂移、重复失败的反应式执行之间的矛盾。

Result: 在StarDojo的Lite-100数据集上,SPIKE将Lite-100成功率(SR)比最强基线提升了5.0个百分点(相对提升38.5%),预算成功率提升了9.3个百分点(相对提升75.6%),同时令牌消耗降低了54.9%,延迟降低了40.8%。

Insight: 核心创新在于事件触发的选择性推理机制,它仅在状态边界(如视觉变化、任务进展、重复动作或失败信号)发生时启动昂贵的战略规划,而非每步都进行;结合分层内存(SA-MB用于短期经验复用,SA-KG用于结构化知识)和反应覆盖能力,实现了成本效益与鲁棒性的平衡。

Abstract: Long-horizon multimodal agents in open-world games must stay goal-directed across many low-level interactions under tight token and latency budgets. Existing approaches often trade off costly per-step reasoning against reactive execution that can drift, repeat failures, and recover poorly. Our key idea is to reuse strategic reasoning across locally stable segments and reinvoke it at event boundaries. We present SPIKE, an adaptive dual controller framework for cost-efficient long-horizon game control. Its Strategic Controller performs low-frequency global planning, failure analysis, and recovery, while its Reactive Controller handles fast local execution under a strict token budget. An Event Trigger monitors visual change, task progress, repeated actions, and failure signals to decide when control should stay reactive or escalate to strategic reasoning. Hierarchical Memory separates short-term experience reuse in the State-Action Memory Bank (SA-MB) from structured evidence in the State Action Knowledge Graph (SA-KG), allowing each controller to retrieve the context it needs. This design reuses strategic proposals over multiple reactive steps, supports local override when plans become stale, and reserves expensive reasoning for moments where extra deliberation is useful. On the Lite-100 split of StarDojo, SPIKE improves Lite-100 success rate (SR) by 5.0 percentage points (38.5% relative) over the strongest Lite-100 baseline and Budgeted SR by 9.3 points (75.6% relative) over the strongest budgeted baseline. It also reduces token consumption by 54.9% and latency by 40.8%. Ablations show that event triggering, reactive override, and heterogeneous memory each contribute to success and recovery, supporting selective reasoning rather than reasoning at every step.


[239] CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark cs.CV | cs.AIPDF

Wei Wang, Yuqian Yuan, Tianwei Lin, Wenqiao Zhang, Siliang Tang

TL;DR: 本文提出了CrossView Suite,一个旨在提升多模态大语言模型(MLLMs)跨视角空间智能的综合框架。该框架包含三个核心组件:一个大规模高质量的跨视角指令数据集CrossViewSet、一个用于系统评估的基准CrossViewBench,以及一个遵循感知-对齐-推理范式的三阶段模型CrossViewer。

Details

Motivation: 当前MLLMs在跨视角空间推理方面进展有限,主要受限于三个关键缺口:缺乏大规模高质量标注的训练数据、缺少系统性的评估基准,以及模型缺乏建立跨视角物体级一致性的显式对齐机制。

Result: 实验表明,大规模训练数据、系统性评估和显式的跨视角对齐对于推动MLLMs从单视角感知迈向真实世界空间智能至关重要。CrossViewer框架在构建的CrossViewBench上进行了全面评估,证明了其有效性。

Insight: 主要创新点在于构建了一个包含数据集、基准和模型的完整生态系统(CrossView Suite),以系统性地解决跨视角推理问题。具体技术亮点包括使用多智能体数据引擎构建高质量数据集,以及提出一个包含自适应空间区域分词器和显式跨视角对象对齐的三阶段推理框架,以增强MLLMs的跨视角推理能力。

Abstract: Spatial intelligence requires multimodal large language models (MLLMs) to move beyond single-view perception and reason consistently about objects, visibility, geometry, and interactions across multiple viewpoints. However, progress in cross-view reasoning remains limited by three major gaps: the scarcity of large-scale well-annotated training data, the lack of comprehensive benchmarks for systematic evaluation, and the absence of explicit alignment mechanisms that establish object-level consistency across views. To address these gaps, we thoroughly develop CrossView Suite across three coordinated components: CrossViewSet, CrossViewBench, and CrossViewer. Firstly, we introduce a multi-agent data engine to meticulously curate a large-scale, high-quality cross-view instruction dataset, termed CrossViewSet, covering 17 fine-grained task types with 1.6M samples. Second, we meticulously create a scene-disjoint CrossViewBench to comprehensively assess the cross-view spatial understanding capability of an MLLM, evaluating it across various aspects. Finally, we propose CrossViewer, a progressive three-stage framework for cross-view spatial reasoning in MLLMs, following a Perception -> Alignment -> Reasoning paradigm. Our method equips an adaptive spatial region tokenizer to capture fine-grained object representations, and then aligns the multi-view objects explicitly, and thus fuses aligned features for boosting the cross-view inference capacity for MLLMs. Extensive experiments and analyses show that large-scale training data, systematic evaluation, and explicit cross-view alignment are all critical for advancing MLLMs from single-view perception toward real-world spatial intelligence. The project page is available at https://github.com/Thinkirin/Crossview-Suite.


[240] Leveraging Latent Visual Reasoning in Silence cs.CVPDF

Dongyao Zhu, Zhen Wang, Xi Xiao, Han Jiang, Saeed Vahidian

TL;DR: 本文探讨了潜在视觉推理(Latent Visual Reasoning)在推理阶段是否必要的问题,发现用随机噪声替换或完全移除潜在标记对空间推理基准性能影响很小,且强化学习后潜在生成行为进一步减少。作者认为潜在视觉推理的价值应通过其如何有效指导学习来衡量,而非是否在推理时保留,并提出一种基于注意力的奖励机制,在强化学习中促进潜在标记与后续文本标记的交互,从而在感知和视觉推理基准上提升性能,即使训练后潜在标记很少生成。

Details

Motivation: 研究动机在于澄清潜在视觉推理在推理阶段的必要性,现有方法在推理时插入连续潜在标记以更直接地利用视觉证据进行多模态推理,但其必要性不明确,作者旨在评估其实际价值并改进其学习指导作用。

Result: 实验表明,在空间推理基准(如空间推理任务)上,替换或移除潜在标记性能下降很小;提出的基于注意力的奖励方法在感知和视觉推理基准上提升了性能,即使训练后潜在标记生成很少,实现了更好的视觉基础和更准确的文本推理。

Insight: 创新点在于提出潜在视觉推理的价值应通过其学习指导效果而非推理时格式来衡量,并设计了一种基于注意力的奖励机制来促进潜在标记与文本的交互,从而在保持纯文本推理灵活性的同时提升性能,这为多模态推理模型提供了可借鉴的隐式学习策略。

Abstract: Latent visual reasoning involves visual evidence more directly in multimodal reasoning by inserting continuous latent tokens before textual generation. However, the necessity of these latent tokens at inference remains ambiguous. We show that replacing latent tokens with random noise or removing them completely causes little performance degradation across spatial reasoning benchmarks. Reinforcement learning further diminishes the latent generation behavior after post-training. These observations raise a central question: Is latent visual reasoning still meaningful? We argue that its value should be measured by how effectively latent tokens guide learning, rather than whether they persist as an inference-time format. Our analysis shows that latent reasoning is unevenly favorable across question types, yet hard task-level routing for applying latent generation is brittle. Motivated by these findings, we propose an attention-based reward that encourages generated latent tokens to interact with later text tokens during RL. This reward promotes latent utilization when the latent mode is activated while preserving the flexibility to use pure-text reasoning. Experiments show that our method improves performance across perception and visual reasoning benchmarks, even when latent tokens are rarely generated after post-training. Our results highlight that, without explicit expression at inference, latent visual reasoning can shape better visual grounding and more accurate textual reasoning in silence. Our code and trained models are publicly available at \href{https://github.com/ddydyd32/silent-lvr/tree/master}{GitHub} and \href{https://huggingface.co/collections/cornuHGF/silent-lvr}{Hugging Face}.


[241] Articulation in Prime: Primitive-Based Articulated Object Understanding from a Single Casual Video cs.CVPDF

Arslan Artykov, Tom Ravaud, Nicolás Violante-Grezzi, Vincent Lepetit

TL;DR: 本文提出了一种类别无关的优化框架,将铰接物体理解视为几何基元拟合问题,通过将几何基元组织成由旋转和平移关节约束的连贯部件,从单个随意拍摄的视频中联合优化部件分割和关节参数,以恢复复杂运动学。该方法还提出了处理现实数据中部分观测和遮挡的可见性感知流程,并在新提出的AiP基准上超越了现有方法。

Details

Motivation: 解决从单目视频中恢复铰接物体3D运动学这一基础挑战,克服现有方法在严重遮挡、快速相机运动或弱局部特征下的脆弱性,以及学习方法难以泛化到训练类别之外的问题。

Result: 在新提出的AiP-synth和AiP-real基准(具有显著相机运动和严重遮挡)上,性能超越了现有方法。

Insight: 将铰接物体理解重新定义为几何基元拟合问题,使用基元作为代理表示以避免不稳定点轨迹的缺陷;提出了一种新颖的机制,将基元组织成受关节约束的连贯部件,并联合优化分割与关节参数;引入了可见性感知流程来处理现实世界数据中的部分观测和遮挡。

Abstract: Retrieving the 3D kinematics of articulated objects from monocular video is a fundamental challenge in computer vision. Existing methods rely on complex video setups or cues such as long-term point tracking or wide-baseline matching, but are frequently brittle under severe occlusions, rapid camera ego-motion, or weak local features. Learning-based methods, meanwhile, struggle to generalize beyond their training categories. We propose a category-agnostic optimization framework that treats articulated object understanding as a primitive-fitting problem. Geometric primitives serve as a proxy representation that avoids the pitfalls of unstable point tracks; a novel mechanism organizes them into coherent parts constrained by revolute and prismatic joints. Our formulation jointly optimizes part segmentation and joint parameters, recovering complex kinematics from a single casually captured video. A visibility-aware procedure handles partial observations and occlusions inherent to real-world data. We also propose the AiP-synth and AiP-real benchmarks, featuring significant camera motion and heavy occlusions, and outperform existing methods. Project page: https://aartykov.github.io/Articulation-in-Prime/


[242] Lance: Unified Multimodal Modeling by Multi-Task Synergy cs.CV | cs.AIPDF

Fengyi Fu, Mengqi Huang, Shaojin Wu, Yunsheng Jiang, Yufei Huo

TL;DR: Lance是一个轻量级的原生统一多模态模型,支持图像和视频的理解、生成和编辑。它通过协作式多任务训练探索了一种实用的统一多模态建模范式,其核心是统一上下文建模和解耦能力路径。实验表明,Lance在图像和视频生成方面大幅优于现有开源统一模型,同时保持了强大的多模态理解能力。

Details

Motivation: 旨在探索一种不依赖模型容量扩展或文本主导设计的实用统一多模态建模范式,以支持图像和视频的理解、生成和编辑。

Result: 在图像和视频生成任务上,Lance大幅优于现有开源统一模型,同时保持了强大的多模态理解能力。

Insight: 提出了基于统一上下文建模和解耦能力路径的双核心原则,并采用了双流专家混合架构、模态感知旋转位置编码以及分阶段多任务训练范式,以增强语义理解和视觉生成性能,同时减轻异构视觉令牌间的干扰。

Abstract: We present Lance, a lightweight native unified model supporting multimodal understanding, generation, and editing for both images and videos. Rather than relying on model capacity scaling or text-image-dominant designs, Lance explores a practical paradigm for unified multimodal modeling via collaborative multi-task training. It is grounded in two core principles: unified context modeling and decoupled capability pathways. Specifically, Lance is trained from scratch and employs a dual-stream mixture-of-experts architecture on shared interleaved multimodal sequences, enabling joint context learning while decoupling the pathways for understanding and generation. We further introduce modality-aware rotary positional encoding to mitigate interference among heterogeneous visual tokens and boost cross-task alignment. During training, Lance adopts a staged multi-task training paradigm with capability-oriented objectives and adaptive data scheduling to strengthen both semantic comprehension and visual generation performance. Experimental results demonstrate that Lance substantially outperforms existing open-source unified models in image and video generation, while retaining strong multimodal understanding capabilities. The homepage is available at https://lance-project.github.io.


[243] CMAG: Concept-Scaffolded Retrieval for Marketplace Avatar Generation cs.CVPDF

Rajeev Goel, Jason Ding, Phani Harish Wajjala, Pavan Turaga, Tejaswi Gowda

TL;DR: CMAG是一个用于元宇宙市场头像生成的概念支架检索与验证组合框架。它通过合成3D概念支架来消除文本提示的歧义,并结合视图感知部件发现、分类路由和混合检索器,最终通过智能体视觉-语言模型进行迭代验证,从目录资产中组装出符合提示且拓扑一致的虚拟形象。

Details

Motivation: 解决在严格分类和拓扑约束下,仅依赖文本检索生成虚拟形象时存在的歧义性、元数据噪声、风格不一致和几何不兼容等问题。

Result: 在多样化的组合提示上评估,相比强基线方法,CMAG在检索鲁棒性和组合正确性方面表现出改进,突显了在提示模糊情况下3D概念支架的重要性。

Insight: 创新点在于引入3D概念支架作为中间表示来全局消歧,并结合了提示分解、基于分类的混合检索以及由智能体驱动的迭代验证循环,以应对市场环境下的复杂约束。

Abstract: Metaverse platforms rely on creator-driven marketplaces where avatars are assembled from discrete, taxonomy-labeled 3D assets (e.g., tops, bottoms, shoes, accessories) under strict category and topology constraints. While users increasingly expect free-form text control, text-only retrieval is brittle: natural language is ambiguous with respect to platform taxonomies, metadata is often noisy or informal, and independently retrieved components can be stylistically inconsistent or geometrically incompatible. We propose \textbf{CMAG}, a concept-scaffolded retrieval and verified composition framework for marketplace avatar generation. Given a prompt, CMAG first synthesizes an intermediate 3D concept scaffold that disambiguates intent beyond text by providing global spatial and stylistic context. In parallel, a view-aware part discovery module extracts localized visual evidence via prompt decomposition and text-grounded segmentation. A prompt-conditioned taxonomy router enforces category coverage and resolves semantic-to-taxonomic mismatch, after which a hybrid category-wise retriever combines part-based fusion with a concept-residual fallback using feature suppression. Finally, an agentic vision–language model filters and re-ranks candidates across categories and drives an iterative verification loop to assemble prompt-faithful, topologically consistent avatars from catalog assets. We evaluate CMAG on diverse compositional prompts and demonstrate improved retrieval robustness and compositional correctness compared to strong baselines, highlighting the importance of 3D concept scaffolding under prompt ambiguity.


[244] MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents cs.CVPDF

Ziyun Zeng, Hang Hua, Bocheng Zou, Mu Cai, Rogerio Feris

TL;DR: MementoGUI是一个插件式的智能记忆框架,旨在增强基于MLLM的GUI代理在长时程任务中的表现。它通过MementoCore控制器在线选择、压缩和检索记忆,将长时程GUI控制视为在线记忆控制问题,利用工作记忆和情景记忆分别保存任务相关界面事件和检索可重用历史轨迹。

Details

Motivation: 现有GUI代理在需要跨多个界面转换维持任务状态的长时程任务中表现脆弱,通常依赖原始历史回放或纯文本记忆,导致模型被冗余截图淹没或丢弃未来决策所需的局部视觉证据。

Result: 在GUI-Odyssey、MM-Mind2Web和MementoGUI-Bench上的实验表明,MementoGUI相比无历史、历史回放和纯文本记忆基线,持续提升了GUI代理的性能,且更大的MementoCore骨干进一步增强了记忆增强的GUI控制。

Insight: 创新点在于将长时程GUI控制形式化为在线记忆控制问题,并模块化设计了MementoCore控制器,实现了无需微调GUI代理骨干的插件式记忆增强;同时开发了可扩展的数据整理流程和评估基准MementoGUI-Bench,引入了基于MLLM的语义动作匹配、任务进度和记忆一致性度量。

Abstract: Recent GUI agents have made substantial progress in visual grounding and action prediction, yet they remain brittle in long-horizon tasks that require maintaining task state across many interface transitions. Existing agents typically rely on raw history replay or text-only memory, which either overwhelms the model with redundant screenshots or discards localized visual evidence needed for future decisions. To address these limitations, we introduce \textbf{MementoGUI}, a plug-in agentic memory framework that equips MLLM-based GUI agents with \textbf{MementoCore}, a learned controller for online memory selection, compression, and retrieval. Rather than treating interaction history as a fixed context, MementoGUI formulates long-horizon GUI control as an online memory-control problem: working memory selectively preserves task-relevant interface events with textual summaries and ROI-level visual evidence, while episodic memory retrieves reusable past trajectories through learned relevance selection. MementoCore modularizes memory control into specialized operators for step processing, memory compression, episodic writing, and episodic selection, enabling plug-in memory augmentation without finetuning the GUI agent backbone. We further develop a scalable data curation pipeline that converts computer-use trajectories into memory-controller training data, introduce \textbf{MementoGUI-Bench} for evaluating long-horizon decision-making in GUI agents, and design MLLM-based metrics for semantic action matching, task progress, and memory consistency. Experiments on GUI-Odyssey, MM-Mind2Web, and MementoGUI-Bench show that MementoGUI consistently improves GUI agents over no-history, history-replay, and text-only memory baselines, with larger MementoCore backbones further strengthening memory-augmented GUI control.


[245] Semantic Generative Tuning for Unified Multimodal Models cs.CV | cs.AIPDF

Songsong Yu, Yuxin Chen, Ying Shan, Yanwei Li

TL;DR: 本文提出了一种名为语义生成调优(SGT)的新范式,旨在解决统一多模态模型中视觉理解与视觉生成任务之间的表征失准问题。该方法通过将高级语义任务(特别是图像分割)作为生成代理,来对齐和协同多模态能力,从而提升模型在理解和生成两方面的性能。

Details

Motivation: 当前统一多模态模型的训练范式通常将视觉理解(通过稀疏文本信号优化)和视觉生成(通过密集像素目标优化)解耦,导致表征空间失准,阻碍了两种能力的相互增强。本文旨在通过生成式后训练来弥合这种隔离。

Result: 在主流基准测试上的广泛评估表明,SGT能持续提升多模态理解能力和生成保真度。机制分析进一步证明,SGT从根本上改善了特征线性可分性并优化了视觉-文本注意力分配模式。

Insight: 核心创新点在于首次系统性地研究了生成式后训练,并发现高级语义任务(如图像分割)是比低级任务更优的生成代理,因为它提供结构语义而非纹理细节,能同时增强感知和布局保真度。这为统一多模态模型的协同训练提供了新思路。

Abstract: Unified multimodal models (UMMs) strive to consolidate visual understanding and visual generation within a single architecture. However, prevailing training paradigms independently optimize understanding via sparse text signals and generation through dense pixel objectives. Such a decoupled strategy yields misaligned representation spaces, isolating visual understanding from generation and hindering their mutual reinforcement. This work presents the first systematic investigation into generative post-training, where we formulate hierarchical visual tasks as generative proxies to bridge the isolation in UMMs. Our empirical investigation reveals that high-level semantic tasks, particularly image segmentation, serve as optimal proxies. Unlike low-level tasks that distract models with texture details, segmentation provides structural semantics that significantly enhance both vision-centric perception and generative layout fidelity. Building upon these insights, we introduce Semantic Generative Tuning (SGT), a novel paradigm that leverages segmentation as a generative proxy to align and synergize multimodal capabilities. Mechanistic analyses further demonstrate that SGT fundamentally improves feature linear separability and optimizes visual-textual attention allocation pattern. Extensive evaluations show that SGT consistently improves both multimodal comprehension and generative fidelity across mainstream benchmarks. Our code is available on the https://song2yu.github.io/SGT/.


[246] SafeDiffusion-R1: Online Reward Steering for Safe Diffusion Post-Training cs.CVPDF

Komal Kumar, Ankan Deria, Abhishek Basu, Fahad Shamshad, Hisham Cholakkal

TL;DR: 本文提出了一种名为SafeDiffusion-R1的在线强化学习框架,用于在扩散模型的后训练阶段引导其生成安全内容。该方法通过在线策略优化,利用CLIP嵌入的固有特性构建引导奖励机制,无需专门的奖励模型或成对的监督数据,有效解决了数据稀缺和模型退化问题。

Details

Motivation: 现有方法依赖昂贵的监督数据(如不安全文本-安全图像对或正负图像对)且难以扩展,而离线强化学习和监督微调方法会因灾难性遗忘导致生成质量下降。本文旨在解决扩散模型后训练中的数据稀缺和模型退化问题。

Result: 实验表明,该方法将不当内容比例从SD v1.4的48.9%降至18.07%,裸露检测次数从基线646次降至15次,并在GenEval基准上将组合生成质量从42.08%提升至47.83%。在七个危害类别的域外不安全提示上实现了最先进的性能。

Insight: 创新点在于提出了一个在线强化学习框架和基于CLIP嵌入的引导奖励机制,无需训练专门的奖励模型或依赖成对监督数据。其核心是利用CLIP嵌入空间的几何特性,通过将文本表示向安全方向引导并远离不安全方向来实现安全控制,这是一种高效且可扩展的方法。

Abstract: Diffusion models have been widely studied for removing unsafe content learned during pre-training. Existing methods require expensive supervised data, either unsafe-text paired with safe-image groundtruth or negative/positive image pairs, making them impractical to scale. Furthermore, offline reinforcement learning and supervised fine-tuning approaches that generate synthetic data offline suffer from catastrophic forgetting, degrading generation quality. We propose a novel online reinforcement learning framework that addresses both data scarcity and model degradation through post-training with Group Relative Policy Optimization (GRPO) on both negative and positive text prompts. To eliminate the need for fine-tuning specialized safe/unsafe reward models, we introduce a \textit{steering reward mechanism} that exploits an inherent property of CLIP embeddings: steering text representations toward positive safety directions and away from negative ones in the embedding space. Our online-policy approach enables the model to learn from diverse prompts, including explicit unsafe content, without catastrophic forgetting. Extensive experiments demonstrate that our method reduces inappropriate content to 18.07% (vs. 48.9% for SD v1.4) and nudity detections to 15 (vs. 646 baseline) while improving compositional generation quality from 42.08% to 47.83% on GenEval. Remarkably, these safety gains generalize to out-of-domain unsafe prompts across seven harm categories, achieving state-of-the-art performance without supervised paired data or reward tuning. Github: https://github.com/MAXNORM8650/SafeDiffusion-R1.


[247] Advancing Narrative Long Video Generation via Training-Free Identity-Aware Memory cs.CVPDF

Jinzhuo Liu, Jiangning Zhang, Wencan Jiang, Yabiao Wang, Dingkang Liang

TL;DR: 本文提出了IAMFlow,一种无需训练的身份感知记忆框架,用于解决自回归视频生成中的长期不一致性和记忆退化问题。该方法通过LLM提取提示中的实体并分配全局ID,结合VLM异步验证渲染帧中的属性,实现显式的实体跟踪。此外,论文还引入了NarraStream-Bench基准,用于评估叙事流视频生成,并在该基准上展示了IAMFlow的优越性能。

Details

Motivation: 现有自回归视频生成方法在处理动态提示时,常因压缩历史帧或基于隐式注意力检索关键帧而导致身份漂移、角色重复和属性丢失,无法有效跟踪演化实体。

Result: 在提出的NarraStream-Bench基准上,IAMFlow取得了最佳综合性能,比最强基线高出2.56分,同时在60秒多提示设置下比最高效基线加速1.39倍。

Insight: 创新点在于无需训练的显式实体身份建模与跟踪框架,结合LLM和VLM进行异步验证,以及系统化的推理加速管道设计;可借鉴的是将实体感知记忆与多模态模型结合以提升长期一致性,以及构建专用基准进行多维评估的方法。

Abstract: Autoregressive video generation has improved rapidly in visual fidelity and interactivity, but it still suffers from long-term inconsistency and memory degradation. Most existing solutions either compress historical frames using predefined strategies or retrieve keyframes based on coarse implicit attention signals, both of which fail to handle evolving prompts with shifting entity references, leading to identity drift, character duplication, and attribute loss. To address this, we propose IAMFlow, a training-free identity-aware memory framework that explicitly models and tracks persistent entity identities, enabling consistent generation across prompt transitions. Specifically, an LLM extracts entities with visual attributes from each prompt and assigns unique global IDs for identity-aware memory, while a VLM asynchronously verifies and refines attributes from rendered frames, enabling explicit entity tracking in place of implicit similarity-based matching. To keep the proposed framework computationally practical, we design a systematic inference acceleration pipeline, including asynchronous visual verification, adaptive prompt transition, and model quantization, which achieves faster generation than existing baselines. Furthermore, we introduce NarraStream-Bench, a benchmark for narrative streaming video generation that features 324 multi-prompt scripts spanning six dimensions and a three-dimensional evaluation protocol that integrates both traditional metrics and multimodal large language model-based assessments. Extensive experiments show that IAMFlow, despite being training-free, achieves the best overall performance on NarraStream-Bench, outperforming the strongest baseline by 2.56 points, while achieving a 1.39$\times$ speedup over the most efficient baseline in the 60-second multi-prompt setting.


[248] Spectral Progressive Diffusion for Efficient Image and Video Generation cs.CVPDF

Howard Xiao, Brian Chao, Lior Yariv, Gordon Wetzstein

TL;DR: 本文提出了一种名为‘谱渐进扩散’的通用框架,通过分析预训练扩散模型在去噪过程中频率成分的生成顺序,实现了高效的图像和视频生成。该方法利用谱噪声扩展机制和基于模型功率谱的最优分辨率调度,在保持视觉质量的同时显著提升了生成速度。

Details

Motivation: 动机在于利用扩散模型在频域中自回归生成视觉内容的特性,即低频成分在去噪早期生成而高频细节在后期出现,从而避免在噪声主导的高频部分进行高分辨率计算,以实现高效生成。

Result: 实验表明,该方法在SOTA预训练图像和视频生成模型上实现了显著的加速,同时保持了视觉质量。

Insight: 创新点在于提出了一个无需训练即可加速的框架,并引入了基于功率谱的最优分辨率调度和一种新颖的微调方法,以进一步提升效率和质量,这为扩散模型的高效生成提供了新的思路。

Abstract: Diffusion models have been shown to implicitly generate visual content autoregressively in the frequency domain, where low-frequency components are generated earlier in the denoising process while high-frequency details emerge only in later timesteps. This structure offers a natural opportunity for efficient generation, as high-resolution computation on noise-dominated frequencies is largely redundant. We propose Spectral Progressive Diffusion, a general framework that progressively grows resolution along the denoising trajectory of pretrained diffusion models. To this end, we develop a spectral noise expansion mechanism and derive an optimal resolution schedule from the model’s power spectrum. Our framework supports training-free acceleration and a novel fine-tuning recipe that further improves efficiency and quality. We demonstrate significant speedups on state-of-the-art pretrained image and video generation models while preserving visual quality.


[249] EgoExoMem: Cross-View Memory Reasoning over Synchronized Egocentric and Exocentric Videos cs.CVPDF

Ruiping Liu, Junwei Zheng, Yufan Chen, Di Wen, Shaofang Quan

TL;DR: 该论文提出了EgoExoMem,这是首个用于同步第一人称(自我中心)和第三人称(外部中心)视频跨视角记忆推理的基准测试。它包含2600个高质量多选题,涵盖八种时空和跨视角问答类型。为解决双视角检索问题,作者提出了E^2-Select,一种无需训练的帧选择方法,结合基于相关性的预算分配和每视角k-DPP采样来处理视角不对称和跨视角时间一致性。实验表明,两种视角提供互补的记忆线索,而现有多模态大语言模型(MLLMs)远未解决该基准,最佳模型仅达到55.3%。E^2-Select在帧选择和基于检索增强生成(RAG)的记忆基线方法中达到了58.2%的最先进性能。

Details

Motivation: 动机在于,尽管第一人称记忆在具身智能中被广泛使用,但它可能不足以进行全面的时空推理。受人类从现场(第一人称)和观察者(第三人称)视角进行回忆的启发,作者旨在通过引入同步双视角视频基准来推动跨视角记忆推理的研究。

Result: 在EgoExoMem基准上,现有最佳MLLM模型仅达到55.3%的准确率。作者提出的E^2-Select方法在帧选择和RAG-based记忆基线中取得了58.2%的SOTA性能。

Insight: 主要创新点包括:1)首个同步自我-外部中心视频的跨视角记忆推理基准EgoExoMem,定义了八种QA类型;2)无需训练的E^2-Select帧选择方法,通过结合预算分配和k-DPP采样有效处理视角不对称和跨视角一致性;3)揭示了问题表述与答案依据之间存在系统性的视角偏好冲突,凸显了跨视角推理的新颖性和挑战性。

Abstract: Egocentric memory is widely used in embodied intelligence, but it may be insufficient for comprehensive spatial-temporal reasoning. Inspired by human recall from both field and observer perspectives, we introduce EgoExoMem, the first benchmark for cross-view memory reasoning over synchronized egocentric and exocentric videos. EgoExoMem contains $2.6K$ high-quality MCQs across eight temporal, spatial, and cross-view QA types. To support dual-view retrieval, we propose E$^2$-Select, a training-free frame selection method for synchronized ego-exo videos. It combines relevance-based budget allocation with per-view k-DPP sampling to handle view asymmetry and cross-view temporal consistency. Experiments show that ego and exo views provide complementary memory cues, while existing MLLMs remain far from solving the benchmark: the best model reaches only $55.3%$. E$^2$-Select achieves state-of-the-art performance of $58.2%$ over frame-selection and RAG-based memory baselines. Further analysis reveals systematic view-preference conflicts between question framing and answer grounding, underscoring the novelty and challenge of cross-view memory reasoning.


[250] LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation cs.CV | cs.DCPDF

Yukang Chen, Luozhou Wang, Wei Huang, Shuai Yang, Bohan Zhang

TL;DR: LongLive-2.0是一个基于NVFP4(NVIDIA FP4)精度的并行基础设施,用于长视频生成的完整训练和推理流程。它通过序列并行自回归训练、NVFP4精度计算、KV缓存量化和异步VAE解码等技术,解决了长视频生成中的速度和内存瓶颈问题,实现了训练和推理的显著加速。

Details

Motivation: 解决长视频生成任务中,随着视频长度增加而急剧增长的计算(GEMM)和内存开销问题,以及现有方法(如Self-Forcing系列)依赖复杂初始化与蒸馏流程导致的训练流程不够简洁高效的问题。

Result: 实验显示,LongLive-2.0在训练上实现了最高2.15倍的加速,在推理上实现了最高1.84倍的加速。LongLive-2.0-5B模型在保持强大基准性能的同时,达到了45.7 FPS的推理速度。

Insight: 主要创新点包括:1) 序列并行自回归训练(Balanced SP)与教师强制(teacher-forcing)布局的协同设计,实现了自然的分块VAE编码;2) 全流程(训练与推理)采用NVFP4低精度计算,并结合KV缓存量化,大幅节省内存;3) 提出了一种简洁的训练流程,直接将扩散模型调优为长视频自回归扩散模型,无需复杂的ODE初始化和分布匹配蒸馏;4) 针对不同GPU架构(如Blackwell)的推理优化策略,包括异步流式VAE解码。

Abstract: We present LongLive-2.0, an NVFP4-based parallel infrastructure throughout the full training and inference workflow of long video generation, addressing speed and memory bottlenecks. For training, we introduce sequence-parallel autoregressive (AR) training, instantiated as Balanced SP, which co-designs the efficient teacher-forcing layout with SP execution by pairing clean-history and noisy-target temporal chunks on each rank, enabling a natural teacher-forcing mask with SP-aware chunked VAE encoding. Combined with NVFP4 precision, it reduces GPU memory cost and accelerates GEMM computation during training, the proportion of which increases as video length grows. Moreover, we show that a high-quality infrastructure and dataset enable a remarkably clean training pipeline. Unlike existing Self-Forcing series methods that rely on ODE initialization and subsequent distribution matching distillation (DMD), LongLive-2.0 directly tunes a diffusion model into a long, multi-shot, interactive auto-regressive (AR) diffusion model. It can be further converted to real-time generation (4 to 2 denoising steps) with standalone LoRA weights. For inference on Blackwell GPUs, we enable W4A4 NVFP4 inference, quantize KV cache into NVFP4 for memory savings, and boost end-to-end throughput with asynchronous streaming VAE decoding. On non-Blackwell GPU architectures, we deploy SP inference to match the speed on Blackwell GPUs, while the quantized KV cache can lower inter-GPU communication of SP. Experiments show up to 2.15x speedup in training, and 1.84x in inference. LongLive-2.0-5B achieves 45.7 FPS inference while attaining strong performance on benchmarks. To our knowledge, LongLive-2.0 is the first NVFP4 training and inference system for long video generation.


[251] Aurora: Unified Video Editing with a Tool-Using Agent cs.CVPDF

Yongsheng Yu, Ziyun Zeng, Zhiyuan Xiao, Zhenghong Zhou, Hang Hua

TL;DR: 本文提出了Aurora,一个基于工具使用智能体的统一视频编辑框架。该框架通过一个工具增强的视觉语言模型(VLM)智能体,将原始用户请求解析为与底层统一视频扩散Transformer输入格式对齐的结构化编辑计划,从而在生成前解决文本和视觉信息不明确的问题。

Details

Motivation: 现有统一视频编辑模型虽然设计灵活,但假设用户已提供模型可直接处理的精确文本、参考图像和空间定位信息,而真实用户请求往往信息不完整或模糊,这限制了模型的实用性。

Result: 在提出的AgentEdit-Bench以及两个现有视频编辑基准上的实验表明,Aurora超越了仅使用指令的基线方法,并且其VLM智能体可以迁移到其他兼容的、冻结的视频编辑模型上。

Insight: 核心创新在于引入了一个工具使用的VLM智能体作为前端,将模糊的用户请求转化为结构化的、模型友好的编辑计划,从而弥合了用户意图与模型输入之间的差距。这种智能体与生成模型解耦的框架设计,提升了系统的鲁棒性和可迁移性。

Abstract: Recent video editing models have converged on a unified conditioning design: a single diffusion transformer jointly consumes text, source video, and reference images, and one set of weights covers replacement, removal, style transfer, and reference-driven insertion. The design is flexible, but it assumes that the user already provides model-ready text, reference images, and spatial grounding for local edits, which real requests often omit. We present Aurora, an agentic video editing framework that pairs a tool-augmented vision-language model (VLM) agent with a unified video diffusion transformer. The VLM agent maps a raw user request to a structured edit plan aligned with the transformer’s conditioning channels, thereby resolving textual and visual underspecification before generation. We train the VLM agent with supervised data for complete edit planning and reference-image selection, together with preference pairs for robust tool use and instruction refinement. We introduce AgentEdit-Bench to evaluate agent-enhanced video editing under textual and visual underspecification. Experiments on AgentEdit-Bench and two existing video editing benchmarks show that Aurora improves over instruction-only baselines and that the VLM agent transfers to compatible frozen video editing models. Project page: https://yeates.github.io/Aurora-Page


cs.SD [Back]

[252] SIREM: Speech-Informed MRI Reconstruction with Learned Sampling cs.SD | cs.CL | cs.CV | cs.LG | physics.med-phPDF

Md Hasan, Nyvenn Castro, Daiqi Liu, Lukas Mulzer, Jana Hutter

TL;DR: SIREM是一种语音引导的实时MRI重建框架,通过融合同步语音信号作为跨模态先验,提升语音产生过程中动态声道运动图像的重建质量。该方法将每个MRI帧建模为音频驱动和MRI驱动组件的融合,并引入可学习的螺旋臂软加权剖面,实现多模态统一重建。

Details

Motivation: 实时MRI在语音产生研究中面临空间分辨率、时间分辨率和采集速度之间的权衡,导致k空间欠采样和重建质量下降。论文旨在利用语音信号与声道结构的相关性,通过跨模态先验改善重建效果。

Result: 在USC语音实时MRI基准测试中,SIREM相比网格化、基于小波的压缩感知和全变分等基线方法,在更高吞吐量下重建出解剖学合理的声道结构,为多模态语音引导重建建立了初步基准。

Insight: 创新点在于将同步语音作为可预测图像内容的跨模态先验,并通过可学习的空间加权图融合音频与MRI分支,同时引入可微分螺旋臂加权剖面实现采样与重建的联合优化,为快速多模态医学成像提供了新思路。

Abstract: Real-time magnetic resonance imaging (rtMRI) of speech production enables non-invasive visualization of dynamic vocal-tract motion and is valuable for speech science and clinical assessment. However, rtMRI is fundamentally constrained by trade-offs among spatial resolution, temporal resolution, and acquisition speed, often leading to undersampled k-space measurements and degraded reconstructions. We propose SIREM, a speech-informed MRI reconstruction framework that uses synchronized speech as a cross-modal prior. The central idea is that vocal-tract configurations during speech are correlated with the produced acoustics, making part of the image content predictable from audio. SIREM models each frame as a fusion of an audio-driven component and an MRI-driven component through a spatial weighting map. The audio branch predicts articulator-related structure from speech, while the MRI branch reconstructs complementary content from measured k-space data. We further introduce a learnable soft weighting profile over spiral arms, enabling a differentiable study of how k-space arm usage interacts with speech-informed fusion. This yields a unified multimodal formulation that combines audio-driven prediction, MRI reconstruction, and sampling adaptation. We evaluate SIREM on the USC speech rtMRI benchmark against standard baselines, including gridding, wavelet-based compressed sensing, and total variation. SIREM introduces a speech-informed reconstruction paradigm that operates in a substantially higher-throughput regime than iterative methods while preserving anatomically plausible vocal-tract structure. These results establish an initial benchmark for multimodal speech-informed rtMRI reconstruction and highlight the potential of synchronized speech as an auxiliary prior for fast reconstruction. The source code is available at https://github.com/mdhasanai/SIREM


[253] WavFlow: Audio Generation in Waveform Space cs.SD | cs.CVPDF

Feiyan Zhou, Luyuan Wang, Shoufa Chen, Zhe Wang, Zhiheng Liu

TL;DR: WavFlow是一个直接在原始波形空间生成高保真音频的框架,它通过波形分块和幅度提升技术克服了高维低能量信号建模的困难,并利用大规模视频-文本-音频三元组数据进行训练。实验表明,其在VGGSound和AudioCaps基准测试上取得了与主流潜空间方法相当或更优的性能。

Details

Motivation: 现代音频生成主要依赖潜空间压缩,这引入了额外的复杂性和潜在的信息损失。本工作旨在挑战这一范式,探索直接在原始波形空间进行高质量音频生成的可行性。

Result: 在视频到音频基准VGGSound(FD_PaSST: 59.98, IS_PANNs: 17.40, DeSync: 0.44)和文本到音频基准AudioCaps(FD_PANNs: 10.63, IS_PANNs: 12.62)上取得了有竞争力的性能,匹配或超越了成熟的基于潜空间的方法。

Insight: 核心创新在于证明了中间压缩并非高质量音频合成的先决条件,提出了一种更简单、可扩展的替代方案。具体技术包括将音频重塑为2D令牌网格、幅度提升以对齐信号尺度,以及利用流匹配中的直接x预测进行稳定优化。

Abstract: Modern audio generation predominantly relies on latent-space compression, introducing additional complexity and potential information loss. In this work, we challenge this paradigm with WavFlow, a framework that generates high-fidelity audio directly in raw waveform space without intermediate representations. To overcome the inherent difficulties of modeling high-dimensional and low-energy signals, we reshape audio into 2D token grids through waveform patchify and introduce amplitude lifting to align signal scales, enabling stable optimization via direct x-prediction in flow matching. To capture complex semantic alignment and temporal synchronization, we leverage an automated data pipeline to curate 5 million high-quality video-text-audio triplets, allowing the model to learn fine-grained acoustic patterns from scratch. Experimental results show that WavFlow achieves competitive performance on the video-to-audio benchmark VGGSound (FD_PaSST: 59.98, IS_PANNs: 17.40, DeSync: 0.44) and the text-to-audio benchmark AudioCaps (FD_PANNs: 10.63, IS_PANNs: 12.62), matching or exceeding the performance of established latent-based methods. Our work demonstrates that intermediate compression is not a prerequisite for high-quality synthesis, offering a simpler and more scalable alternative for multimodal audio generation.


cs.IR [Back]

[254] MARQUIS: A Three-Stage Pipeline for Video Retrieval-Augmented Generation cs.IR | cs.CVPDF

Debashish Chakraborty, Dengjia Zhang, Jialiang Jin, Hanting Liu, Katherine Guerrerio

TL;DR: MARQUIS是一个用于视频检索增强生成的三阶段流水线,旨在解决复杂查询下的多模态检索和跨视频信息合成的难题。它通过查询扩展与重排序、结构化证据提取以及可控的文章生成,显著提升了在MAGMaR2026任务上的检索和生成性能。

Details

Motivation: 当前方法在处理复杂、多方面的视频查询时存在不足:检索方法难以用单一嵌入捕获查询语义,而生成方法缺乏跨视频的高层推理能力,且受限于长上下文的记忆约束。

Result: 在MAGMaR2026共享任务上,检索性能(nDCG@10)从0.195提升至0.759;文章生成方面,ITER-QA-BASE将平均人工评分从3.09提高到3.83,而MARQUIS-RLM获得3.30分,并在非问答系统中实现了最强的引用召回率。

Insight: 创新点在于将检索增强生成分解为三个可解释的阶段,特别是查询扩展与融合策略解决了复杂查询的表示问题,以及结构化证据提取和可控生成机制增强了跨视频信息的合成与归因能力。

Abstract: Retrieval-augmented generation from videos requires systems to retrieve relevant audiovisual evidence from large corpora and synthesize it into coherent, attributed text. Current approaches struggle at both ends: retrieval methods fail on complex, multi-faceted queries that cannot be captured by a single embedding, while generation methods lack the high-level reasoning needed to synthesize across multiple videos and face memory constraints over long, multi-video contexts. We present MARQUIS: a three-stage pipeline that addresses these limitations through (1) query expansion, fusion, and reranking, (2) calibrated structured evidence extraction, and (3) article generation from extracted evidence, optionally controlled by an RLM. On the MAGMaR2026 shared task, we improve retrieval performance from 0.195 to 0.759 (nDCG@10). For article generation, ITER-QA-BASE improves average human score from 3.09 to 3.83 over the CAG baseline, while MARQUIS-RLM achieves a human score of 3.30 and the strongest citation recall among non-QA systems.


[255] TIGER-FG: Text-Guided Implicit Fine-Grained Grounding for E-commerce Retrieval cs.IR | cs.CVPDF

Xinyu Sun, Huangyu Dai, Lingtao Mao, Zexin Zheng, Zihan Liang

TL;DR: 本文提出TIGER-FG框架,用于解决电商图像检索中存在的模态与粒度不对称问题。该框架利用商品文本作为语义引导,无需目标检测即可生成聚焦目标的商品表征,并通过双重蒸馏目标提升表征的稳定性和判别力。

Details

Motivation: 解决电商图像检索中模态不对称(视觉查询需匹配图文商品)和粒度不对称(裁剪查询需与包含背景和干扰物的完整图像比较)的挑战,避免基于检测方法的高成本和误差传播,以及CLIP类编码器对背景或无关项的脆弱性。

Result: 在自建的ECom-RF-IMMR基准测试(包含标准与杂乱商品布局)上,TIGER-FG在两个评估基准上的Recall@1分别比最强基线提升了6.1和34.4个百分点;在公开电商基准上也展示了对噪声和一对多检索场景的泛化能力。

Insight: 创新点在于提出文本引导的隐式细粒度定位框架,以文本为语义指导绕过显式检测;引入双重蒸馏目标(保持目标区域空间一致性和查询-商品相似性结构)来优化多模态表征;构建了大规模、更贴近现实的电商检索基准ECom-RF-IMMR。

Abstract: E-commerce image search often takes a cropped image as the query, while each candidate is represented by full item images and structured text. This image-to-multimodal retrieval setting presents two asymmetries: a modality disparity – a visual query must match image–text items, and a granularity disparity – a cropped query must be compared with full images containing background context and possible distractors. Detection-based pipelines handle the granularity disparity through explicit localization but incur extra cost and error propagation, whereas CLIP-style encoders avoid detection, but are vulnerable to backgrounds or irrelevant items. To address these limitations, we propose TIGER-FG, a text-guided implicit fine-grained grounding framework for image-to-multimodal e-commerce retrieval. TIGER-FG uses item text as semantic guidance to produce target-focused item representations without object detection for retrieval. We further introduce dual distillation objectives that preserve target-region spatial consistency and query–item similarity structure, yielding more stable and discriminative multimodal representations. In addition, we construct ECom-RF-IMMR, a realistic benchmark suite with a 10M-pair training set and two evaluation benchmarks covering standard and cluttered item layouts. TIGER-FG improves Recall@1 over the strongest baseline by 6.1 and 34.4 percentage points on the two evaluation benchmarks, respectively, with only 85.7M query-side parameters and 256-dim embeddings. Results on public e-commerce benchmarks further demonstrate its generalization to noisy and one-to-many retrieval scenarios. Code and data will be released.


cs.ET [Back]

[256] BIDO: A Biometric Identity Online Authentication Framework cs.ET | cs.CR | cs.CVPDF

Aditya Mithra, Sibi Chakkaravarthy S, Srinivas Kankanala

TL;DR: BIDO是一种无需设备的在线生物特征身份认证框架,它通过从活体生物特征测量和用户记忆的秘密中确定性生成ECDSA密钥材料,实现了符合NIST AAL2级别的安全认证。该框架不存储长期生物特征模板或任何个人可识别信息,并生成与FIDO2完全兼容的非驻留WebAuthn凭证,支持从任何商用传感器终端进行验证。

Details

Motivation: 解决安全系统需要持续、加密强度高的身份验证,同时避免用户携带物理令牌或专用硬件认证器,并消除对持久存储生物特征模板或个人可识别信息的需求。

Result: 在三个主流人脸基准测试(VGGFace2、LFW和MegaFace)上评估,在LFW上达到99.51%的验证准确率,在MegaFace Challenge 1(10^6干扰项)上达到92.14%的Rank-1识别准确率,密码学错误接受率为0.03%,错误拒绝率为0.90%。

Insight: 创新点在于将生物特征与用户记忆的秘密结合,通过确定性密钥派生实现无模板存储的强认证,并利用多阶段处理管道生成瞬态WebAuthn凭证,确保了隐私和兼容性;客观分析认为其将生物识别与密码学协议(如WebAuthn)无缝集成,是设备无关认证的有前景方向。

Abstract: Security systems demand continuous, cryptograph- ically robust identity verification without requiring subjects to carry physical tokens, smart cards, or dedicated hardware authenticators. This paper presents BIDO (Biometric Identity Online), a device-free authentication standard that achieves Au- thenticator Assurance Level 2 (AAL2) per NIST SP 800-63B with- out storing long-lived biometric templates, facial images, or any other form of Personally Identifiable Information (PII). BIDO derives Elliptic Curve Digital Signature Algorithm (ECDSA) key material deterministically from a live biometric measurement salted with a user-defined memorized secret at every authen- tication event, eliminating persistent private-key storage while enabling verification from any commodity sensor terminal. The generated credentials are non-discoverable (non-resident) Web Authentication (WebAuthn) credentials, fully compatible with all FIDO2-enabled websites and services without modification on the server side. A multi-stage pipeline, comprising capture of 200 valid biometric samples, feature extraction using the Dlib 68- point facial landmark predictor, affine face alignment, frontality gating, Euclidean distance computation from the inter-eye mid- point, floor-division quantization with divisor q = 8, inter-session drift stabilization, and majority-voting SHA-256 hash binding, produces a Verification Seed (Vseed) from which the WebAuthn credential is transiently derived and immediately zeroized after signing. Evaluated against three prominent face benchmarks (VGGFace2, LFW, and MegaFace), achieving 99.51% verification accuracy on LFW and 92.14% Rank-1 identification accuracy on MegaFace Challenge 1 at 10^6 distractors, with a cryptographic False Accept Rate (FAR) of 0.03%, a False Reject Rate (FRR) of 0.90%.


q-bio.NC [Back]

[257] MIRAGE: Robust multi-modal architectures translate fMRI-to-image models from vision to mental imagery q-bio.NC | cs.CVPDF

Reese Kneeland, Cesar Kadir Torrico Villanueva, Jordyn Ojeda, Shuhb Khanna, Jonathan Xu

TL;DR: 本文提出了MIRAGE方法,旨在将从人类大脑活动重建所见图像的视觉解码模型,推广到解码内部生成的心理图像。该方法采用线性主干网络,结合多模态文本和图像特征作为扩散模型的输入,在NSD-Imagery基准测试中实现了心理图像重建的SOTA性能。

Details

Motivation: 现有视觉解码模型在重建所见图像上表现优异,但其性能不能保证在重建心理图像(即内心生成的视觉表征)上同样有效。作者基于对NSD-Imagery数据集的分析,发现部分模型在此任务上失败,因此需要开发专门的方法来从大脑活动中跨模态解码心理图像。

Result: 在NSD-Imagery基准测试上,MIRAGE通过特征度量和人工评分被确立为心理图像重建的SOTA方法。消融分析表明,使用维度相对较少的图像特征,并结合基于文本的引导以及高、低层级的图像特征,能实现最佳的心理图像重建效果。

Insight: 论文的核心创新在于设计了一个专门用于心理图像重建的架构MIRAGE,它证明了利用现有大规模外部刺激数据集(如视觉数据集)训练模型来解码心理图像是可行的。其关键设计洞察是采用线性主干并融合多模态特征(文本、高低层级图像特征)来指导扩散模型,这为未来心理图像重建的实用化提供了乐观前景和架构参考。

Abstract: To be useful for downstream applications, vision decoding models that are trained to reconstruct seen images from human brain activity must be able to generalize to internally generated visual representations, i.e., mental images. In an analysis of the recently released NSD-Imagery dataset, we demonstrated that while some modern vision decoders can perform quite well on mental image reconstruction, some fail, and that state-of-the-art (SOTA) performance on seen image reconstruction is no guarantee of SOTA performance on mental image reconstruction. Motivated by these findings, we developed MIRAGE, a method explicitly designed to train on vision datasets and cross-decode mental images from brain activity. MIRAGE employs a linear backbone and multi-modal text and image features as input to a diffusion model. Feature metrics and human raters establish MIRAGE as SOTA for mental image reconstruction on the NSD-Imagery benchmark. With ablation analysis we show that mental image reconstruction works best when decoders use image features with relatively few dimensions and include guidance from text-based and both high- and low-level image-based features. Our work indicates that–given the right architecture–existing large-scale datasets using external stimuli are viable training data for decoding mental images, and warrant optimism about the future success and utility of mental image reconstruction.


cs.CE [Back]

[258] The Alpha Illusion: Reported Alpha from LLM Trading Agents Should Not Be Treated as Deployment Evidence cs.CE | cs.AI | cs.CLPDF

Yuxuan Ye, Jun Han, Ao Hu, Juncheng Bu, Yiyi Chen

TL;DR: 这篇论文指出,当前基于大型语言模型(LLM)的端到端交易代理(如FinCon、FinMem等)所报告的夏普比率等收益指标,不应被直接视为可实际部署的证据。作者认为,在将这些回报作为可部署交易能力的证明之前,必须通过一系列结构性有效性测试(如时间完整性、现实摩擦、反事实稳健性等)。论文提出了一个最低报告协议套件(P1-P6)和一个保守的模块化替代方案,以解决评估和结构性问题。

Details

Motivation: 动机在于学术界和工业界过于轻率地将LLM交易代理的架构研究成果等同于实际部署能力,报告的高夏普比率等指标存在误导性,需要建立更严格的评估标准来区分稳健的预测能力与各种潜在偏差(如时间污染、未建模摩擦等)。

Result: 论文未在摘要中提供具体的定量实验结果或基准测试排名,而是侧重于批判现有评估的不足并提出新的评估框架(P1-P6协议)和模块化架构方案。

Insight: 创新点在于系统性地指出了当前LLM交易代理评估中的关键结构性问题(如语言置信度不等于可交易概率、叙事推理不等于数值执行),并提出了一个分层适用的最低报告协议以及一个将LLM作为可审计信息接口、与独立校准、风险和执行模块解耦的保守架构,为未来该领域的研究和评估提供了重要的方法论指导。

Abstract: End-to-end LLM trading agents have moved quickly from research curiosity to a small ecosystem of named systems, including FinCon, FinMem, TradingAgents, FinAgent, QuantAgent, and FLAG-Trader. Several of these report headline Sharpe ratios that would be material if read at face value on a deployment desk, and associated benchmarks such as FinBen report trading-task Sharpe statistics in the same range. The gap between architecture research and deployment claim has been crossed too freely on both sides of the academia–industry divide. We take a position on that gap: reported alpha from end-to-end LLM trading agents should not be treated as deployment evidence. Before such returns can support claims of deployable trading capability, they must survive structural validity tests for temporal integrity, real-world frictions, counterfactual robustness, predictive calibration, numerical execution, and multi-agent disaggregation. Current public evidence cannot yet distinguish robust predictive ability from temporal contamination, unmodeled frictions, short-window Sharpe uncertainty, narrative fitting, and parametric priors. The problem is not only evaluative but structural. Language confidence is not tradable probability, narrative reasoning is not numerical execution, and model priors may become undisclosed implicit factor exposures. We contribute a minimum reporting protocol suite, P1–P6, with tiered applicability by claim strength, and a conservative modular alternative that uses LLMs as auditable information interfaces upstream of independent calibration, risk, and execution modules. Code and reproduction harness: \url{https://github.com/hj1650782738/Trading}.


eess.IV [Back]

[259] Flow Matching with Optimized Subclass Priors for Medical Image Augmentation eess.IV | cs.CVPDF

Felix Nützel, Mischa Dombrowski, Bernhard Kainz

TL;DR: 该论文提出一种用于医学图像增强的流匹配方法,通过优化子类先验来解决罕见疾病数据稀缺问题。该方法在生成模型的潜在空间中利用高斯混合模型将粗粒度标签划分为连贯的子模式,并学习子类条件化的源分布以缩短传输路径并减少类内离散度。

Details

Motivation: 医学影像中罕见疾病诊断面临数据严重不足的挑战,现有生成增强方法因粗粒度标签聚合了多样亚型和采集设置,导致生成器偏向主导子模式,且共享高斯源迫使罕见亚群经历过长的传输路径。

Result: 在长尾分布的胸部X光(MIMIC-LT、NIH-LT)和CT切片(CT-RATE)基准测试中,该方法持续提升了尾类生成的保真度和多样性(FID、IRS指标),并作为增强策略可靠提高了下游任务的平衡准确率和宏观F1分数,优于未增强基线。

Insight: 创新点在于引入两级信息先验:通过潜在空间高斯混合划分子模式,并学习子类条件化源分布以优化传输轨迹;同时通过几何控制机制(如约束归一化位移方向围绕可学习原型)避免退化解,提升罕见亚群的生成质量。

Abstract: Rare diseases dominate the diagnostic challenge in medical imaging yet are severely underrepresented in clinical datasets, causing classifiers to fail on exactly the conditions where reliable detection matters most. Generative augmentation can supply the missing tail-class coverage, but coarse disease labels aggregate diverse subtypes and acquisition settings into multi-modal conditionals that bias generators toward dominant submodes, while a shared Gaussian source forces rare subpopulations through disproportionately long transport paths. We propose an offline strategy that introduces informative priors at two levels: first, we partition each coarse label into coherent submodes via Gaussian mixture modeling in the generative model’s latent space; second, we learn subclass-conditioned source distributions that re-center and re-scale the starting distribution per submode, shortening trajectories and reducing within-subclass dispersion. To prevent degenerate solutions we impose explicit geometric control, moderately concentrating normalized displacement directions around learnable prototypes while capping path-length outliers. On long-tailed chest X-ray (MIMIC-LT, NIH-LT) and CT slice (CT-RATE) benchmarks the proposed method consistently improves tail-class generation fidelity and diversity (FID, IRS) and is a promising augmentation strategy that reliably improves downstream balanced accuracy and macro-F1 over a non-augmented baseline across modalities.


[260] See Silhouettes in Motion with Neuromorphic Vision eess.IV | cs.CV | cs.ROPDF

Pei Zhang, Shijie Lin, Zhou Ge, Jinpeng Chen, Wei Pu

TL;DR: 本文提出了一种基于事件相机和帧图像的双模态二值化方法,用于在动态场景中提取清晰的目标轮廓。该方法利用事件相机的高时间分辨率和高动态范围特性,结合传统帧图像,在仅使用CPU的设备上实现了实时、高帧率的二值化处理,有效克服了运动模糊和恶劣光照的挑战。

Details

Motivation: 解决在无人机、自动驾驶汽车等移动平台上,传统帧式相机在动态场景中因快速运动和恶劣光照导致的严重运动模糊和细节丢失问题,从而无法可靠提取文本、路标等准双模态物体的清晰轮廓。

Result: 在减少运动模糊方面,该方法取得了与领先技术相竞争的性能,并在具有挑战性的光照条件下带来了显著的改进。其异步工作流程即使在极端千赫兹帧率下也能保持清晰的目标形状。

Insight: 创新点在于利用事件相机与帧图像的协同作用,提出了一种简单有效的双模态实时二值化方法。其异步处理机制绕过了传统事件时间分箱重建对事件密度的依赖,为资源受限的边缘平台上的轻量级感知与交互提供了新思路。

Abstract: Quasi-bimodal objects, such as text, road signs, and barcodes, play a basic yet vital role in daily visual communication. By boiling these down to clear silhouettes, binarization uses a minimal language to convey essential vision cues for maximum downstream efficiency. The catch is that frame-based imaging often struggles on mobile platforms like drones, self-driving cars, and underwater vehicles. In these dynamic scenes, rapid motion and harsh lighting can make it blind, causing severe motion blur and erasing crucial details. To overcome the limits, neuromorphic vision via event cameras, featuring microsecond-level temporal resolution and high dynamic range, steps in as a natural solution. Building upon this event-driven sensing paradigm, we introduce a simple yet effective dual-modal approach that harnesses the synergy between frames and events to achieve real-time, high-frame-rate binarization on CPU-only devices. Extensive evaluations present that it earns competitive performance against leading techniques in reducing motion blur, while delivering impressive improvements under challenging illumination. Besides, our asynchronous workflow bypasses event scarcity that breaks traditional time-binning reconstruction, maintaining clear target shapes even at extreme kilohertz frame rates. Its binary results further serve as reliable representations that facilitate a range of downstream tasks. This work paves the way towards lightweight perception and interaction in embodied intelligence on resource-constrained edge platforms.


[261] CATRF: Codec-Adaptive TriPlane Radiance Fields for Volumetric Content Delivery eess.IV | cs.CV | cs.MMPDF

Tung-I Chen, Lingdong Wang, Subhransu Maji, Ramesh K. Sitaraman

TL;DR: 本文提出了CATRF,一种用于体素内容传输的编解码器自适应三平面辐射场压缩框架。该框架将2D特征平面量化和打包成适合标准编解码器(如JPEG/VP9/HEVC/AV1)处理的画布,在训练循环中通过直通估计器(STE)整合非可微的标准编解码器流程,使辐射场特征能直接适应客户端编解码器的真实失真。

Details

Motivation: 体素媒体是下一代内容传输应用的关键,但其巨大的带宽需求是主要瓶颈。现有的隐式和混合体素表示虽然减小了模型尺寸,但仍需精细编码才能达到类似2D视频的码率,因此需要一种能与实际部署的标准编解码器协同工作的高效压缩方法。

Result: 在静态和动态基准测试中,CATRF在率失真权衡上始终优于编解码器无关和基于学习编解码器的基线方法,并且在压缩效率和解码速度上都超越了近期压缩的3D高斯泼溅(3DGS)方法。

Insight: 主要创新点在于将非可微的标准编解码器流程通过STE无缝集成到训练循环中,使模型特征能直接针对真实编解码失真进行优化,而无需引入任何可学习的编解码器参数,这为实现低码率、抗压缩的自由视点视频流提供了一条实用路径。

Abstract: Volumetric media promises next-generation content delivery applications, but its bandwidth demand remains a key bottleneck. Implicit and hybrid volumetric representations reduce model sizes, yet still require careful coding to reach 2D video-like bitrates. We present CATRF, a standard-codec-in-the-loop compression framework for plane-factorized radiance fields. During training, we quantize and pack 2D feature planes into codec-friendly canvases, run a standard codec roundtrip (JPEG/VP9/HEVC/AV1), then unpack and dequantize the decoded features before volume rendering. We use a straight-through estimator (STE) to insert the non-differentiable, standard codec pipeline into the training loop, allowing radiance-field features to adapt directly to the real, client-side codec distortions without introducing any learned codec parameters. On both static and dynamic benchmarks, CATRF consistently achieves a better rate-distortion trade-off over codec-agnostic and learned-codec-in-the-loop baselines, and also outperforms recent compressed 3DGS methods in both compression efficiency and decoding speed. These results highlight a practical path toward low-bitrate, compression-resilient volumetric representations for free-viewpoint video streaming.


cs.RO [Back]

[262] Tactile-based Multimodal Fusion in Embodied Intelligence: A Survey of Vision, Language, and Contact-Driven Paradigms cs.RO | cs.CV | eess.SPPDF

Zhixiang Cao, Di Tian, Runwei Guan, Yanzhou Mu, Xiaolou Sun

TL;DR: 这篇论文是一篇关于具身智能中触觉多模态融合的综述,系统梳理了截至2026年第一季度的研究进展。论文提出了一个层次化的分类法,从多模态数据集和多模态方法两个主要维度对该领域进行组织,涵盖了触觉-视觉、触觉-语言等数据集,以及感知识别、跨模态生成和交互三大方法支柱。

Details

Motivation: 触觉感知是具身智能的基础模态,但单模态触觉感知存在空间覆盖稀疏和缺乏全局语义的固有局限。随着深度学习和大型语言模型的发展,将触觉与视觉、语言融合对于连接物理交互与语义推理变得至关重要,而现有研究在数据集、感知模态和任务上较为分散,缺乏统一的理论框架。

Result: 作为一篇综述性论文,本文未提出新模型或报告具体定量结果,而是对现有研究进行了系统性归纳和分类,总结了代表性触觉传感硬件、常用评估指标和基准设置。

Insight: 论文的创新之处在于提出了一个统一的多模态触觉融合分类框架,将数据集和方法进行了清晰的层次化组织,为未来研究提供了结构化的路线图。其从感知、生成、交互三个核心支柱来划分方法,视角全面,有助于整合当前碎片化的研究格局。

Abstract: Tactile sensing is a fundamental modality for embodied intelligence, offering unique and direct feedback on contact geometry, material properties, and interaction dynamics that remote sensors cannot replace. However, unimodal tactile perception is inherently limited by its sparse spatial coverage and lack of global semantic context. With the recent explosion in deep learning and large language models, integrating tactile with vision and language has become essential to bridge physical interaction with semantic reasoning, leading to the emergence of Multimodal Tactile Fusion. Despite rapid progress, the existing researches remain fragmented across disparate datasets, sensing modalities, and tasks, lacking a unified theoretical framework. To address this gap, this paper provides a comprehensive survey of multimodal tactile fusion research up to the first quarter of 2026. We propose a hierarchical taxonomy that organizes the field into two primary dimensions: multimodal datasets and multimodal methods. On the data side, we categorize resources ranging from Tactile-Vision datasets, Tactile-Language datasets, Tactile-Vision-Language datasets, and Tactile-Vision-Other datasets. On the method side, we structure prior work into three core pillars: (1) Multimodal Perception and Recognition, which focuses on object understanding and grasp prediction; (2) Cross-Modal Generation, focusing on bidirectional translation between tactile, vision, and text; and (3) Multimodal Interaction, emphasizing feedback control and language-guided manipulation. Furthermore, we summarize representative tactile sensing hardware, review commonly used evaluation metrics and benchmark settings, and discuss current challenges and promising future directions.


[263] WorldArena 2.0: Extending Embodied World Model Benchmarking on Modality, Functionality and Platform cs.RO | cs.CVPDF

Yu Shang, Yinzhou Tang, Yiding Ma, Zhuohang Li, Lei Jin

TL;DR: WorldArena 2.0是一个扩展的具身世界模型基准测试,旨在从模态、功能和平台三个维度系统性地拓宽评估范围。它将评估从纯视觉扩展到视觉触觉多模态,从策略评估扩展到作为交互式RL环境,并从纯模拟器评估扩展到多样化的模拟和真实机器人平台。

Details

Motivation: 现有具身世界模型基准测试主要局限于纯视觉预测、离线应用和基于模拟器的评估,不足以评估日益全面的世界模型。

Result: 该基准测试在标准化协议下,全面评估感知质量、交互效用和跨平台性能,为追踪具身世界模型进展提供了一个综合测试平台。

Insight: 创新点在于提出了一个三维扩展的基准框架,将多模态感知、交互式强化学习环境功能以及模拟与真实机器人平台统一纳入评估体系,为领域发展提供了更全面的衡量标准。

Abstract: World models have emerged as a central paradigm for embodied intelligence, enabling agents to predict action-conditioned future and reason about environmental dynamics. However, existing embodied world model benchmarks are still largely confined to vision-only prediction, offline embodied applications, and simulator-based evaluation, making them insufficient for assessing increasingly comprehensive world models. In this work, we introduce WorldArena 2.0, an expanded benchmark that systematically broadens embodied world model evaluation along three dimensions: modality, functionality, and platform. Along the modality dimension, WorldArena 2.0 extends evaluation from vision-only to visuotactile modalities, enabling assessment of multimodal perception and prediction. Along the functionality dimension, it extends beyond policy evaluation and planning to assess world models as interactive RL environments for policy optimization. Along the platform dimension, it moves beyond simulator-only evaluation to a diverse suite of simulated and real-world robotic settings across multiple embodiments. Under a standardized protocol, WorldArena 2.0 comprehensively evaluates perceptual quality, interactive utility, and cross-platform performance, providing a comprehensive testbed for tracking progress toward embodied world models. The benchmark is available at: https://world-arena.ai.


[264] ManiSoft: Towards Vision-Language Manipulation for Soft Continuum Robotics cs.RO | cs.AI | cs.CVPDF

Ziyu Wei, Luting Wang, Chen Gao, Li Wen, Si Liu

TL;DR: 本文提出了ManiSoft基准测试,旨在研究软体连续体机器人(soft continuum robotics)在视觉-语言操控任务中的挑战。该基准包含一个定制的仿真器,结合了真实的软体动力学和接触丰富的交互,并定义了四项任务以突出可变形控制的不同方面。通过自动化流程生成了6,300个多样化场景和专家轨迹,用于策略训练和评估。

Details

Motivation: 现有视觉-语言操控研究主要针对刚性机械臂,其固定形态限制了在杂乱或受限空间中的适应性。软体机械臂因其可变形性提供了有吸引力的替代方案,但面临本体感知不可靠和分布式低层驱动等挑战。

Result: 在ManiSoft基准上评估了三种代表性策略模型,在干净场景中表现出相对有希望的结果,但在随机化条件下性能显著下降。可视化分析表明失败主要源于本体感知状态的视觉估计不准确,以及未能充分利用可变形性进行自适应避障。

Insight: 创新点在于引入了首个针对软体机械臂的视觉-语言操控基准测试,通过结合弹性力约束的仿真器模拟软体动力学和接触交互。自动化生成大规模多样化场景和专家轨迹的流程,以及分层规划(高层规划器分解任务为路径点序列,低层强化学习策略生成扭矩命令)的方法,为软体机器人控制研究提供了有价值的测试平台。

Abstract: Most existing vision-language manipulation research targets rigid robotic arms, whose fixed morphology limits adaptability in cluttered or confined spaces. Soft robotic arms offer an appealing alternative due to their deformability, but confront challenges such as unreliable proprioception and distributed low-level actuation. To investigate these challenges, we introduce \ManiSoft, a benchmark for vision-language manipulation with soft arms. ManiSoft features a tailored simulator that couples realistic soft-body dynamics with contact-rich interactions via an elastic force constraint. On this basis, ManiSoft defines four tasks, each highlighting distinct aspects of deformable control, from basic end-effector coordination to obstacle avoidance. To support policy training and evaluation, \ManiSoft{} includes an automated pipeline that generates $6{,}300$ diverse scenes and corresponding expert trajectories. To produce high-quality trajectories at scale, we first employ a high-level planner to decompose each task into a sequence of waypoints, followed by a low-level reinforcement learning policy that generates torque commands to track waypoints. Benchmarking three representative policy models shows relatively promising results in clean scenes but substantial performance drop under randomization. Visualization analysis indicates that failures stem primarily from inaccurate visual estimation of proprioceptive state and limited exploitation of deformability for adaptive obstacle avoiding. We anticipate ManiSoft to serve as a valuable testbed, bridging the gap between rigid and soft arms in the context of vision-language manipulation. Out codes and datasets are released at https://buaa-colalab.github.io/ManiSoft.


[265] Robo-Cortex: A Self-Evolving Embodied Agent via Dual-Grain Cognitive Memory and Autonomous Knowledge Induction cs.RO | cs.CVPDF

Nga Teng Chan, Yi Zhang, Yechi Liu, Renwen Cui, Fanhu Zeng

TL;DR: 本文提出了Robo-Cortex,一个用于具身智能体的自我进化框架,旨在解决机器人在未知环境中导航时因‘经验性遗忘’而难以泛化策略的问题。该框架通过自主知识归纳机制,将多模态轨迹提炼为结构化的导航启发式知识库,并结合双粒度认知记忆系统(短期反思记忆和长期原则记忆)以及一个‘想象-验证’循环,使机器人能够主动归纳和优化导航策略。

Details

Motivation: 解决具身智能体在未知复杂环境中导航时面临的‘经验性遗忘’挑战,即现有基于轨迹或反应式的策略无法从过去的交互中综合出可泛化的策略,从而限制了机器人的自主适应能力。

Result: 在IGNav、AR和AEQA基准测试上的广泛评估表明,Robo-Cortex在任务成功率和探索效率上持续优于强基线方法,相比先前最强方法在SPL指标上最高提升+4.16%,在向未见环境进行启发式知识迁移时SPL最高提升+15.30%。初步的真实世界机器人实验也验证了其有效性。

Insight: 核心创新在于自主知识归纳机制和双粒度认知记忆系统,将具体经验抽象为自然语言启发式原则,实现了从被动执行到主动策略进化的转变。‘想象-验证’循环利用世界模型和视觉语言模型验证器,增强了决策的鲁棒性,为构建能够持续学习和适应新环境的具身智能体提供了新思路。

Abstract: The ability to navigate and interact with complex environments is central to real-world embodied agents, yet navigation in unseen environments remains challenging due to “experiential amnesia,” where existing trajectory-driven or reactive policies fail to synthesize generalizable strategies from past interactions. We propose Robo-Cortex, a self-evolving framework that enables robots to autonomously induce navigation heuristics and refine cognitive strategies through a continuous reflection-adaptation loop. By abstracting success patterns and failure pitfalls into natural-language heuristics, Robo-Cortex enables a transition from passive execution to active strategy evolution. Our core innovation is an Autonomous Knowledge Induction (AKI) mechanism that distills multimodal trajectories into a structured Navigation Heuristic Library for knowledge generalization. The architecture further incorporates a Dual-Grain Cognitive Memory system, comprising a Short-term Reflective Memory (SRM) for real-time local progress analysis, and a Long-term Principle Memory (LPM) that abstracts past trajectories into reusable guiding and cautionary principles. To ensure robust decision-making, we introduce a multimodal Imagine-then-Verify loop, where a world model simulates potential outcomes and a VLM-based evaluator validates action plans. Extensive evaluations on IGNav, AR, and AEQA show that Robo-Cortex consistently outperforms strong baselines in both task success and exploration efficiency, with gains of up to +4.16% SPL over the strongest prior method and up to +15.30% SPL under heuristic transfer to unseen environments. Preliminary real-world robotic experiments further support the effectiveness of Robo-Cortex in physical settings.


cs.CY [Back]

[266] AI Slop or AI-enhancement? Student perceptions of AI-generated media for an English for Academic Purposes course cs.CY | cs.AI | cs.CL | cs.MMPDF

David James Woo, Deliang Wang, Kai Guo

TL;DR: 这篇论文探讨了在学术英语课程中使用AI检索增强生成工具创建多媒体补充材料的效果,通过混合方法研究学生对这些材料的接受度、偏好及其与学业成绩的关系。研究发现,学生普遍认为这些材料有用且易用,偏好视觉和多模态格式,视频偏好与学业表现正相关,但认知负荷过高会负面影响成绩。

Details

Motivation: 研究动机是探究AI生成的内容是作为教学支架还是低质量的“AI垃圾”,评估其在学术英语课程中的实际教育价值。

Result: 在106名英语学习者中,学生高度评价材料的感知有用性和易用性,偏好与评估相关的视觉和多模态内容(如视频和信息图),视频偏好与学业成绩正相关,但高认知负荷与课程成绩负相关。

Insight: 创新点在于将RAG工具应用于规模化个性化反馈,并通过技术接受模型和认知负荷理论框架分析,表明教师引导的AI生成内容能有效增强学习生态系统,而非产生低质量材料,尤其对低成绩学生的补救支架作用显著。

Abstract: Artificial intelligence (AI) retrieval-augmented generation (RAG) tools now enable educators to transform course materials into diverse multimedia at scale. However, it remains unclear whether such AI-generated content functions as a pedagogical scaffold or AI slop: high volume, low quality material. This innovative practice paper reports on the development, implementation, and evaluation of teacher-prompted, AI-generated supplemental materials in an English for Academic Purposes (EAP) course at a Hong Kong Community College. Using primarily Google Notebook LM, the instructor generated videos, podcasts, infographics, and individualized feedback reports from course materials and student work for 106 English as a Foreign Language learners. An explanatory sequential mixed-methods design comprising a survey, semi-structured interviews, and correlation analysis with academic scores was employed to examine students’ preferences, perceptions, and learning outcomes. Findings are framed through the Technology Acceptance Model and Cognitive Load Theory. Students rated the materials highly for perceived usefulness and ease of use, and preferred assessment-linked content presented in visual and multimodal formats, particularly videos and infographics. Video preference correlated positively with academic performance; however, higher cognitive load was negatively associated with course grades, indicating that material complexity must be carefully calibrated. Notably, some lower-performing students independently adopted the materials as remedial scaffolds. The practice demonstrates that RAG tools enable scalable personalized feedback that would be less feasible through traditional methods. When aligned with student goals and cognitive principles, teacher-prompted AI generation can meaningfully enhance the EAP learning ecosystem rather than producing AI slop.


[267] ANVIL: Analogies and Videos for Lecturers cs.CY | cs.AI | cs.CL | cs.GR | cs.HC | cs.MMPDF

Yuri Noviello, Anastasiia Birillo, Gosia Migut

TL;DR: ANVIL是一个多模态生成系统,用于自动化生成基于类比的计算机科学教学动画。它根据概念定义生成文本类比,将其编译成结构化的视觉剧本,并生成可执行的manim代码来渲染动画,同时包含自动修复机制以提高鲁棒性。论文通过教师评估、自动化筛选和用户研究,评估了系统的教学有效性和可扩展性。

Details

Motivation: 解决为计算机科学主题高效、规模化地制作高质量、基于类比的教学动画内容的自动化问题,以辅助教学。

Result: 教师评估表明ANVIL生成的材料经常被评为合格;针对文本类比引入了基于LLM的可扩展质量筛选器;用户研究显示教育工作者对其感知价值和可用性反应积极。

Insight: 创新点在于将概念定义到教学动画的端到端自动化生成流程,并提出了一个结合人工评估(教师)与自动化代理(LLM评估器、剧本保真度检查)的混合评估框架,以平衡教学有效性与系统可扩展性。

Abstract: We present ANVIL, a multimodal generative system that automates the production of analogy-based instructional animations for computer science topics. Given a concept definition, ANVIL generates a textual analogy, compiles it into a structured visual screenplay, and produces executable manim code to render an animation, with an automated repair mechanism to improve robustness. Evaluating such systems at scale requires balancing pedagogical validity with scalability. We begin with a teacher evaluation to ground the quality assessment and use its findings to guide automated screening. For textual analogies, we introduce an LLM-based evaluator for scalable quality screening; for videos, where subjective judgments are difficult to automate, we instead assess fidelity to the intended screenplay using an automated proxy for auditing and error analysis. We further conduct a user study with educators to examine adoption requirements and risks. Our findings suggest that ANVIL can produce materials that are frequently rated as adequate, and that educators respond positively to its perceived value and usability.


cs.AI [Back]

[268] How do Humans Process AI-generated Hallucination Contents: a Neuroimaging Study cs.AI | cs.CLPDF

Shuqi Zhu, Yi Zhong, Ziyi Ye, Bangde Du, Yujia Zhou

TL;DR: 本文通过脑电图(EEG)研究人类处理AI生成幻觉内容的神经机制,揭示在判断多模态大语言模型(MLLM)生成的图像描述正确性时,大脑对幻觉与非幻觉内容表现出不同的认知过程模式,包括语义整合、推理处理、记忆检索和认知负荷。

Details

Motivation: 解决AI生成幻觉对人类认知的影响机制不明确的问题,探索人类识别或受误导于幻觉内容的底层神经动力学。

Result: 基于平均事件相关电位(ERP)分析,发现人类处理幻觉与非幻觉内容时多个认知过程呈现显著差异,且被误判的幻觉未能触发标准的神经认知事实验证通路。

Insight: 创新点在于首次利用神经影像学方法量化AI幻觉的认知处理差异,为理解人类对AI输出的信任机制提供了神经科学证据,可借鉴于设计更可靠的AI评估或人机交互系统。

Abstract: While AI-generated hallucinations pose considerable risks, the underlying cognitive mechanisms by which humans can successfully recognize or be misled by these hallucinations remain unclear. To address this problem, this paper explores humans’ neural dynamics to characterize how the brain processes hallucinated content. We record EEG signals from 27 participants while they are performing a verification task to judge the correctness of image descriptions generated by a multi-modal large language model (MLLM). Based on an averaged event-related potential (ERP) study, we reveal that multiple cognitive processes, e.g., semantic integration, inferential processing, memory retrieval, and cognitive load, exhibit distinct patterns when humans process hallucinated versus non-hallucinated content. Notably, neural responses to hallucinations that were misjudged versus correctly judged by human participants showed significant differences. This indicates that misjudged AI-generated hallucinations failed to trigger the standard neurocognitive fact verification pathway.


[269] ChemVA: Advancing Large Language Models on Chemical Reaction Diagrams Understanding cs.AI | cs.CL | cs.CVPDF

Mingyang Rao, Kehua Feng, Zhihui Zhu, Jiangzhen Fu, Hao Yu

TL;DR: 本文提出了ChemVA框架,旨在解决大型语言模型在理解化学反应图时的视觉与语义瓶颈。该框架通过视觉锚定机制和语义对齐方法,将视觉特征转化为实体名称以激活LLM的化学知识,在OCRD-Bench数据集上实现了92.0%的结构识别准确率,并在9种不同LLM上带来约20个百分点的性能提升。

Details

Motivation: 当前LLM在处理科学文本时表现出色,但在解释化学反应图时存在显著能力差距,主要受限于通用视觉编码器难以解析密集分子图的严格拓扑连接(视觉缺陷),以及SMILES等线性字符串无法有效激活模型的潜在化学推理(语义断开)。

Result: 在新建的OCRD-Bench数据集(涵盖密集视觉语义上下文和全面反应覆盖)上,ChemVA实现了92.0%的结构识别准确率,并在9种多样LLM上带来约20个百分点的稳定性能提升,使开源模型在复杂化学推理任务中媲美专有SOTA系统。

Insight: 创新点包括视觉锚定机制(通过混合粒度检测锚定官能团)和语义对齐方法(将视觉特征翻译为实体名称以最大化知识激活),从客观角度看,该研究通过桥接视觉与语义瓶颈,为LLM在专业领域(如化学)的视觉理解提供了可借鉴的跨模态对齐框架。

Abstract: While Large Language Models (LLMs) have revolutionized scientific text processing, they exhibit a significant capability gap when interpreting chemical reaction diagrams. We identify two fundamental bottlenecks restricting current systems: a Visual Deficit, where generic vision encoders struggle to resolve the strict topological connectivity of dense molecular graphs, and a Semantic Disconnect, where standard linear strings, such as SMILES, fail to effectively activate the model’s latent chemical reasoning. To bridge these gaps, we propose the Chemical Visual Activation (ChemVA) framework, which employs a Visual Anchor mechanism to ground functional groups via hybrid-granularity detection, followed by a semantic alignment approach that translates visual features into entity names to maximize knowledge activation in LLMs. We evaluate our approach on OCRD-Bench, a newly constructed dataset featuring dense visual-semantic contexts and comprehensive reaction coverage to evaluate the full spectrum from recognition to reasoning. Extensive experiments on OCRD-Bench demonstrate that ChemVA achieves 92.0% structural recognition accuracy. By bridging visual and semantic bottlenecks, our framework delivers a consistent performance gain of approximately 20 percentage points across 9 diverse LLMs, enabling open-weight models to rival proprietary SOTA systems in complex chemical reasoning tasks.


[270] CyberCorrect: A Cybernetic Framework for Closed-Loop Self-Correction in Large Language Models cs.AI | cs.CLPDF

Yuning Wu, Yingmin Liu, Yang Shu

TL;DR: 本文提出了CyberCorrect框架,将大语言模型(LLM)的自我修正形式化为一个基于控制论的闭环控制系统。该框架将LLM生成器建模为被控对象,引入结合了自洽性、置信度表达和逻辑链验证的三模态错误检测器作为传感器,并通过类型导向的修正控制器和基于稳定性准则的收敛判断器来指导修正过程。实验表明,该框架在自建基准上取得了更高的最终准确率,并有效减少了过修正。

Details

Motivation: 解决当前LLM自我修正方法缺乏系统性、依赖通用提示且没有收敛保证的问题,旨在建立一个有理论基础的、系统化的自我修正框架。

Result: 在自建的CyberCorrect-Bench基准(包含440个带标注错误类型和修正路径的推理任务)上,CyberCorrect达到了79.8%的最终准确率,比现有最佳自我修正方法提升了6.2个百分点,并通过其收敛控制机制将过修正率降低了41%。

Insight: 主要创新点在于将控制论原理(如闭环系统、传感器、控制器、收敛判断)系统地应用于LLM自我修正问题,并提出了三模态错误检测器和控制理论启发的动态评估指标(收敛率、过冲率、振荡率),为理解和改进LLM的修正过程提供了新视角和理论工具。

Abstract: Large language model (LLM) self-correction – the ability to detect and fix errors in generated outputs – remains largely ad hoc, relying on generic prompts such as “please reconsider your answer” without systematic error analysis or convergence guarantees. We propose CyberCorrect, a framework that formalizes LLM self-correction as a closed-loop control system grounded in cybernetic theory. The framework models the LLM generator as the plant and introduces a tri-modal Error Detector (combining self-consistency, verbalized confidence, and logic-chain verification) as the sensor. A type-directed Correction Controller generates targeted repair instructions based on diagnosed error categories, while a Convergence Judge determines iteration termination using stability criteria adapted from control theory. We further introduce three control-theoretic evaluation metrics – convergence rate, overshoot rate, and oscillation rate – that capture correction dynamics beyond final accuracy. Experiments on our constructed CyberCorrect-Bench (440 reasoning tasks with annotated error types and correction paths) show that CyberCorrect achieves 79.8% final accuracy, improving upon the best existing self-correction method by 6.2 percentage points, while reducing overshoot (erroneous over-correction) by 41% through its convergence control mechanism.


[271] QQJ: Quantifying Qualitative Judgment for Scalable and Human-Aligned Evaluation of Generative AI cs.AI | cs.CL | cs.GRPDF

Marjan Veysi, Pirooz Shamsinejadbabaki, Mohammad Zare, Mohammad Sabouri

TL;DR: 本文提出了量化定性判断(QQJ)框架,旨在解决生成式AI评估中自动指标与人类感知脱节、人工评估难以扩展的问题。QQJ通过专家设计的多维评分标准和少量高质量标注集校准大语言模型评估器,实现了可扩展、可解释且与人类判断对齐的评估。

Details

Motivation: 生成式AI的快速发展暴露了现有评估方法的根本局限:传统自动指标无法反映人类对质量的感知,而纯人工评估成本高、主观性强且难以扩展;基于大语言模型的评估器虽可扩展,但缺乏对人类定义评估原则的显式对齐,导致偏见和不一致。

Result: 在文本和图像生成任务上的大量实验表明,QQJ相比传统自动指标和无约束的基于大语言模型的评估器,与人类判断的对齐性显著更强;同时,QQJ在重复评估中表现出更高的稳定性,并在识别幻觉和意图不匹配等关键失败模式方面具有更优的诊断能力。

Insight: 创新点在于将质量定义与执行分离,通过专家设计的评分标准锚定评估,并用小规模高质量标注校准大语言模型评估器,从而实现了结构化定性判断的大规模可操作化,为可靠评估生成式AI系统提供了实用基础。

Abstract: The rapid progress of generative artificial intelligence has exposed fundamental limitations in existing evaluation methodologies, particularly for open-ended, creative, and human-facing tasks. Traditional automatic metrics rely on surface-level statistical similarity and often fail to reflect human perceptions of quality, while purely human evaluation, although reliable, is costly, subjective, and difficult to scale. Recent approaches using large language models as evaluators offer improved scalability but frequently lack explicit grounding in human-defined evaluation principles, leading to bias and inconsistency. In this paper, we introduce Quantifying Qualitative Judgment (QQJ), a scalable and human-centric evaluation framework that explicitly bridges the gap between human judgment and automated assessment. QQJ separates the definition of quality from its execution by anchoring evaluation in expert-designed, multi-dimensional rubrics and calibrating large language model evaluators to align with expert reasoning using a small, high-quality annotation set. This design enables consistent, interpretable, and scalable evaluation across diverse generative tasks and modalities. Extensive experiments on text and image generation demonstrate that QQJ achieves substantially stronger alignment with human judgment than traditional automatic metrics and unconstrained LLM-based evaluators. Moreover, QQJ exhibits improved stability across repeated evaluations and superior diagnostic capability in identifying critical failure modes such as hallucination and intent mismatch. These results indicate that structured qualitative judgment can be operationalized at scale without sacrificing interpretability or human alignment, positioning QQJ as a practical foundation for reliable evaluation of modern generative AI systems.


[272] Entropy-Gradient Inversion: Moving Toward Internal Mechanism of Large Reasoning Models cs.AI | cs.CLPDF

Junyao Yang, Chen Qian, Kun Wang, Linfeng Zhang, Quanshi Zhang

TL;DR: 本文针对大型推理模型(LRMs)提出了一种名为熵-梯度反转的几何指纹,用于表征模型内部推理机制,并基于此提出了相关性正则化组策略优化(CorR-PO)方法,通过将反转特征嵌入强化学习奖励正则化来提升推理性能。实验表明,该方法在多个推理基准测试中超越了现有最优基线。

Details

Motivation: 解决大型推理模型领域存在的两个关键问题:一是令牌级行为分析与内部推理机制之间的根本差距;二是依赖昂贵外部验证器的强化学习优化方法的不稳定性。

Result: 在多个模型规模和推理基准测试上的广泛实验表明,CorR-PO方法一致性地超越了现有最优基线,且更强的熵-梯度反转特征与更优的推理性能直接相关。

Insight: 创新点在于首次形式化定义了熵-梯度反转这一几何指纹作为推理能力的标志,并利用其设计了一种无需外部验证器的强化学习正则化方法,为理解LRM内部机制提供了新视角。

Abstract: The advancement of Large Reasoning Models (LRMs) has catalyzed a paradigm shift from reactive fast thinking'' text generation to systematic, step-by-step slow thinking’’ reasoning, unlocking state-of-the-art performance in complex mathematical and logical tasks. However, the field faces \textit{the fundamental gap between token-level behavioral analysis and internal reasoning mechanisms, and the instability of reinforcement learning (RL) for reasoning optimization relying on costly external verifiers}. We identify and formally define \textbf{Entropy-Gradient Inversion}, a robust negative correlation between token entropy and logit gradients that acts as a definitive geometric fingerprint for LRM reasoning capability. Building on this, we propose \textbf{Correlation-Regularized Group Policy Optimization (CorR-PO)}, which embeds this inversion signature into RL reward regularization. Extensive experiments on various reasoning benchmarks across multiple model scales show CorR-PO consistently outperforms state-of-the-art baselines, confirming that stronger inversion directly correlates with superior reasoning performance.


[273] SD-Search: On-Policy Hindsight Self-Distillation for Search-Augmented Reasoning cs.AI | cs.CL | cs.IRPDF

Yufei Ma, Zihan Liang, Ben Chen, Zhipeng Qian, Huangyu Dai

TL;DR: 本文提出SD-Search方法,一种用于搜索增强推理的在线事后自蒸馏技术。该方法通过让同一模型扮演学生和教师两种角色,从策略自身生成步骤级监督信号,无需外部教师模型或额外标注。

Details

Motivation: 搜索增强推理代理的性能依赖于每次查询的质量,但在基于结果奖励的强化学习中,所有搜索决策共享轨迹级奖励,缺乏步骤级信用分配。现有过程监督方法依赖外部大模型或强系统标注,存在依赖问题。

Result: 论文未在摘要中提供具体定量结果或基准测试数据,但指出该方法在标准RL训练循环内实现,无需外部模型推理、辅助标注流程或额外训练阶段。

Insight: 核心创新是提出在线事后自蒸馏机制,通过让模型基于事后搜索查询和结果摘要生成教师分布,为学生提供步骤级监督,巧妙解决了搜索决策的信用分配问题,且完全自包含于策略训练过程中。

Abstract: Search-augmented reasoning agents interleave internal reasoning with calls to an external retriever, and their performance relies on the quality of each issued query. However, under outcome-reward reinforcement learning, every search decision in a rollout shares the same trajectory-level reward, leaving individual queries without step-specific credit. Recent process-supervision approaches address this gap by drawing step-level signals from outside the policy, relying either on a much larger teacher model, or on sub-question annotations produced by a stronger external system. In contrast, we propose SD-Search, which derives step-level supervision from the policy itself through on-policy hindsight self-distillation, requiring neither an external teacher nor additional annotations. In SD-Search, a single model plays two roles that differ only in conditioning: a student that sees only the context available at inference time, and a teacher that additionally conditions on a compact hindsight block summarizing the search queries and final outcomes of a group of rollouts sampled from the same question. Since the teacher knows how each rollout unfolded and which ones succeeded, its query distribution implicitly marks which decisions were worth making, and the student is trained to recover this behavior by minimizing the token-level Jensen–Shannon divergence to the teacher at search-query positions. This layers a dense, step-level signal on top of GRPO’s coarse trajectory reward. Crucially, this signal is produced by the policy itself within the standard RL training loop, without external model inference, auxiliary annotation pipeline, or additional training stage.


[274] GIM: Evaluating models via tasks that integrate multiple cognitive domains cs.AI | cs.CL | cs.LGPDF

Rohit Patel, Alexandre Rezende, Steven McClain

TL;DR: 本文提出了Grounded Integration Measure (GIM)基准测试,包含820个原创问题,其难度源于整合多种认知操作(如约束满足、状态跟踪),而非依赖专业知识或纯抽象推理。作者通过项目反应理论(IRT)模型对28个模型进行校准,生成了稳健的能力估计,并发布了包含22个模型和47种测试配置的排行榜,研究了测试时计算量与模型能力之间的权衡。

Details

Motivation: 现有LLM基准测试要么过度依赖知识记忆(如GPQA),要么完全脱离实际背景进行抽象推理(如ARC-AGI),这混淆了记忆与能力,或将推理与实际应用脱节。本文旨在通过整合多种认知操作于可及知识之上的任务,创建更贴近现实、不依赖专业知识的评估基准。

Result: 在GIM基准上,作者校准了超过20万条提示-响应对,使用2参数逻辑IRT模型生成了稳健的能力估计,即使原始准确率受误差或数据缺失影响也能正确排序模型。排行榜涵盖了22个模型和47种测试配置,并进行了测试时计算量与模型能力权衡的广泛研究。

Insight: 创新点在于提出通过认知操作整合而非知识深度或纯抽象性来构建难度,并引入IRT模型进行更稳健的基准评估。客观分析认为,其将推理任务置于实际背景中,并公开校准框架和问题集,为模型能力评估提供了新视角和方法论。

Abstract: As LLM benchmarks saturate, the evaluation community has pursued two strategies to increase difficulty: escalating knowledge demands (GPQA, HLE) or removing knowledge entirely in favor of abstract reasoning (ARC-AGI). The first conflates memorization with capability; the second divorces reasoning from the practical contexts in which it matters. We take a different approach. The Grounded Integration Measure (GIM) is a benchmark of 820 original problems (615 public, 205 private) where difficulty comes from integration; individual problems require coordinating multiple cognitive operations (constraint satisfaction, state tracking, epistemic vigilance, audience calibration) over broadly accessible knowledge, so that reasoning stays grounded in realistic tasks without being gated on specialized expertise. Each problem is an original expert-authored composition, majority with rubric-decomposed scoring (median 6 independently judged criteria). A balanced public–private split provides built-in contamination diagnostic. We calibrate a continuous response 2-parameter logistic (2PL) IRT model over >200k prompt-response pairs across 28 models, producing robust ability estimates that correctly order test-configurations even when raw accuracy is distorted by errors or missing data, addressing a common challenge in benchmark reporting. Using this framework, we present a comprehensive leaderboard spanning 22 models and 47 test-configurations (unique model, thinking-level pairs), and conduct what is to our knowledge the most extensive published study of how test-time compute trades off against model capability on a fixed benchmark: 11 models swept across 35 test-configurations. We observe that within-family configuration choices, such as thinking budget and quantization, matter as much as model selection. We release the evaluation framework, calibrated IRT parameters, and all public problems.


[275] Sustainable Intelligence for the Wild: Democratizing Ecological Monitoring via Knowledge-Adaptive Edge Expert Agents cs.AI | cs.CV | cs.CY | cs.LGPDF

Jiaxing Li, Hao Fang, Chi Xu, Miao Zhang, Jiangchuan Liu

TL;DR: 该论文提出了一种名为‘知识自适应边缘专家代理’的架构,旨在解决野外生态监测中设备端AI因环境变化而性能受限的问题。该方法将视觉感知与推理分离,结合视觉编码器和动态知识库,以显式知识库替代将专家知识隐式编码到模型参数的传统方式,从而减少对云端资源和持续数据上传的依赖。

Details

Motivation: 动机是解决野外生态监测中手动调查资源密集、现有设备端AI因环境变化性能不佳,以及依赖云端重训练模型消耗远程部署有限电力和网络连接的问题。

Result: 摘要未提及具体的定量实验结果、基准测试或达到的性能水平(如SOTA)。

Insight: 宣称的创新点是从模型适应转向知识适应,通过分离感知与推理、使用显式动态知识库来提升适应性并支持知识可持续性;客观分析认为其架构设计(视觉编码器+知识库)和强调与生物学家、原住民社区的跨学科伦理协作是可借鉴的创新之处。

Abstract: Rapid biodiversity loss underscore the urgency of effective monitoring, yet manual surveys remain resource-intensive. While on-device AI offers a scalable alternative, its performance in the wild is often challenged by environmental variability. Current methods rely heavily on cloud resource, which requires continuous uploading of field data for model retraining. This approach is unsuitable for remote deployments because it consumes limited power and network connectivity. To address these constraints, this research proposes a shift from model adaptation to knowledge adaptation. We introduce an architecture that separates visual perception from reasoning, combining a visual encoder with a dynamic knowledge base. We uses an explicit knowledge base to replace implicitly encoding expert knowledge into model parameters. This method also supports knowledge sustainability by preserving expert insights in a structured form. Through cross-disciplinary collaboration with biologists and Indigenous communities, this work advances ethical AI co-development, fostering responsible and culturally informed ecosystem management.


[276] Is VLA Reasoning Faithful? Probing Safety of Chain-of-Causation cs.AI | cs.CV | cs.ROPDF

Nicanor Mayumu, Xiaoheng Deng, Patrick Mukala

TL;DR: 本文首次系统研究了视觉-语言-行动(VLA)驾驶模型推理的忠实性,通过分析Alpamayo-R1-10B模型在100个多样化PhysicalAI-AV场景中的300次推理,发现其输出的自然语言因果链(Chain-of-Causation)轨迹推理存在显著不忠实问题。

Details

Motivation: 动机在于探究VLA驾驶模型输出的推理过程是否忠实于真实场景,即模型给出的因果解释是否与实际情况一致,以评估其安全性和可靠性。

Result: 定量结果显示:整体推理忠实度仅为42.5%;在涉及行人的场景中,三分之一漏检了94个行人;在轻微视觉扰动下,97.7%的轨迹脆弱;平均推理-行动一致性仅48.3%,其中53.3%的推理一致性低,包括37.9%声称停车却继续行驶的情况。

Insight: 创新点在于首次对VLA模型的推理忠实性进行系统性评估,并形式化定义了基于信息论的忠实性概念,提出了实体和行动忠实度的验证标准,以及基于结果构建的四组件安全架构,为提升自动驾驶模型的可解释性和安全性提供了新视角。

Abstract: We present the first systematic study of faithfulness in Vision-Language-Action (VLA) driving models, analyzing 300 Alpamayo-R1-10B inferences across 100 diverse PhysicalAI-AV scenarios. Our main finding is that output natural-language rationales with trajectories may be significantly unfaithful: (i) overall reasoning fidelity is only 42.5%, with Chain-of-Causation matching scene reality less than half the time; (ii) 94 missed pedestrians in one-third of pedestrian-relevant scenes; (iii) 97.7% trajectory fragility under mild visual perturbations; and (iv) only 48.3% mean reasoning-action consistency, with 53.3% of inferences exhibiting low consistency, including 37.9% of stop-claimed cases where the model continues instead. We formalize faithfulness information-theoretically, define entity and action fidelity with verification criteria, and outline a four-component safety architecture aligned with these results.


[277] AutoRubric-T2I: Robust Rule-Based Reward Model for Text-to-Image Alignment cs.AI | cs.CV | cs.LGPDF

Kuei-Chun Kao, Daixuan Huo, Yuanhao Ban, Cho-Jui Hsieh

TL;DR: 本文提出了AutoRubric-T2I,一种用于文本到图像(T2I)生成对齐的、基于规则的鲁棒奖励模型框架。该框架通过自动合成和选择明确的评分规则(rubrics)来指导视觉语言模型(VLM)进行图像评估,从而生成高质量、可解释的奖励信号,并大幅减少了对大规模人工标注偏好数据的需求。

Details

Motivation: 现有的T2I奖励模型通常基于大规模人类偏好数据训练为Bradley-Terry偏好模型,存在训练成本高、难以适应、评估标准不透明的问题;而VLM评判者虽然能提供细粒度评估,但其手动设计或启发式生成的评分规则可能无法可靠反映人类偏好。

Result: 在MMRB2等图像奖励基准测试中,AutoRubric-T2I超越了强大的奖励模型基线。在下游T2I任务(如TIIF和UniGenBench++)中,将其作为强化学习奖励应用于扩散模型的Flow-GRPO流程,相比标量奖励模型提升了生成质量。

Insight: 创新点在于首次在T2I领域提出了一个自动学习评分规则的框架,通过从偏好对中合成推理轨迹生成候选规则,并利用L1正则化逻辑回归精炼器选择最具判别性的Top-N规则,实现了数据高效、可解释且性能优越的奖励建模。

Abstract: Aligning Text-to-Image (T2I) generation models with human preferences increasingly relies on image reward models that score or rank generated images according to prompt alignment and perceptual quality. Existing reward models are commonly trained as Bradley-Terry (BT) preference models on large-scale human preference corpora, making them costly to train, difficult to adapt, and opaque in their evaluation criteria. Meanwhile, Vision-Language Model (VLM) judges can provide more fine-grained assessments through textual rubrics, but their manually designed or heuristically generated scoring rules may fail to reliably reflect human preferences. In this paper, we propose AutoRubric-T2I, the first rubric learning framework in T2I that automatically synthesizes and selects explicit rubrics for guiding VLM judges. AutoRubric-T2I first synthesizes reasoning traces from preference pairs into candidate rubrics, then uses a VLM judge to score paired images under each rubric, producing pairwise rubric-score differences for preference learning. To remove noisy and redundant rules, we further employ a $\ell_1$-Regularized Logistic Regression Refiner, which selects the Top-$N$ most discriminative rubrics. Extensive evaluations show that AutoRubric-T2I produces high-quality, interpretable reward signals using less than 0.01% of the annotated preference data, substantially reducing the need for large-scale reward-model training. On image reward benchmarks such as MMRB2, AutoRubric-T2I outperforms strong reward model baselines. We further validate AutoRubric-T2I as an RL reward on downstream T2I tasks, including TIIF and UniGenBench++, where it improves generation quality over scalar reward models using the Flow-GRPO pipeline on diffusion models.


[278] SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain cs.AI | cs.CV | cs.LGPDF

Lingtao Mao, Huangyu Dai, Xinyu Sun, Zihan Liang, Ben Chen

TL;DR: 本文提出了SVFSearch,这是首个针对中文游戏垂直领域的短视频帧搜索开放基准测试。该基准包含5000个四选一测试样本和4198个辅助训练样本,每个样本基于真实短视频片段中的暂停游戏场景。SVFSearch提供了一个包含游戏领域文本语料库、主题关联图像库以及文本、图像和多模态检索接口的冻结离线检索环境,以支持公平和可复现的评估。

Details

Motivation: 现有基准测试很少评估多模态大语言模型在短视频应用中的能力,其中暂停的帧通常视觉上模糊,回答问题需要垂直、长尾且快速演变的领域知识。

Result: 评估了从直接问答、RAG工作流到Plan-Act-Replan代理和学习的搜索模型等多种代表性范式。结果显示,最佳开源直接问答模型准确率为66.4%,最佳实用代理达到79.1%,而全知知识可达95.4%,揭示了模型仅凭自身知识、实际代理搜索与全知知识之间的巨大差距。

Insight: 创新点在于构建了首个针对游戏垂直领域的短视频帧搜索基准,并提供了可控的离线检索环境以进行公平评估。客观分析认为,该研究揭示了当前多模态代理在视觉定位、检索质量、基于证据的推理和工具使用行为(如过度搜索、仅答案捷径和检索诱导误导)等方面的瓶颈。

Abstract: Multimodal large language models are increasingly used as agent backbones that understand multimodal inputs, plan retrieval actions, invoke external tools, and reason over retrieved information. Yet existing benchmarks rarely evaluate this ability in short-video applications, where a paused frame is often visually ambiguous and answering requires vertical, long-tail, and fast-evolving domain knowledge. We introduce SVFSearch, the first open benchmark for short-video frame search in the Chinese gaming domain. SVFSearch contains 5,000 four-choice test examples and 4,198 auxiliary training examples, each centered on a paused game scene from a real short-video clip. To support fair and reproducible evaluation, SVFSearch provides a frozen offline retrieval environment with a game-domain text corpus, a topic-linked image gallery, and text, image, and multimodal retrieval interfaces, avoiding reliance on uncontrolled web search APIs. We evaluate representative paradigms ranging from direct QA and RAG workflow to Plan-Act-Replan agents and learned search models. Results reveal a large gap between model-only answering, practical agentic search, and oracle knowledge: the best open-source direct-QA model reaches 66.4%, the best practical agent achieves 79.1%, and oracle knowledge reaches 95.4%. Further analysis exposes bottlenecks in visual grounding, retrieval quality, evidence-grounded reasoning, and tool-use behavior, including over-search, answer-only shortcuts, and retrieval-induced misleading.


[279] TaskGround: Structured Executable Task Inference for Full-Scene Household Reasoning cs.AI | cs.CV | cs.ROPDF

ZhiYuan Feng, Yu Deng, Ruichuan An, Zhenhua Liu, Qixiu Li

TL;DR: 该论文提出了TaskGround框架,用于解决家庭智能体在全场景家庭推理中的挑战。该框架通过将完整家庭场景压缩为任务相关的场景切片,推断可执行的任务结构,并将其编译为技能级动作序列,从而在无需训练且模型无关的情况下提升任务成功率。

Details

Motivation: 解决家庭智能体在实际部署中面临的挑战:即从完整的家庭场景和模糊的家庭请求中,高效、准确地推断出可执行的任务结构,同时克服隐私和本地计算限制,使紧凑的开源模型能够胜任长上下文推理。

Result: 在提出的FullHome评估套件(包含400个人工验证的家庭任务)上,TaskGround显著提升了专有和开源模型的任务成功率。例如,它使Qwen3.5-9B模型在直接全场景提示下的性能可与GPT-5媲美,同时将输入令牌总成本降低了高达18倍。

Insight: 论文的创新点在于将可执行任务结构推断确立为全场景家庭推理的核心瓶颈,并提出了一种结构化的“接地-推断-执行”框架。其核心洞察是,通过结构化地压缩场景信息,可以极大提升紧凑本地模型在实际部署中的有效性,这为资源受限环境下的智能体推理提供了新思路。

Abstract: In real home deployments, household agents must often operate from a complete household scene and a situated household request, rather than from a clean task specification. Such requests require agents to identify task-relevant entities, recover intended task conditions, and resolve ordering constraints from the surrounding scene context. We formalize this capability as full-scene household reasoning: given a complete household scene and a situated household request, an agent must infer executable task structure before producing a grounded skill-level action sequence. This setting is challenging because complete household scenes contain substantial task-irrelevant information, making direct complete-scene prompting inefficient and error-prone. In practical deployment, this challenge is further amplified by privacy and local compute constraints, which favor compact open-weight models with limited long-context reasoning ability. We propose TaskGround, a training-free and model-agnostic Ground-Infer-Execute framework that grounds complete scenes into compact task-relevant scene slices, infers executable task structure, and compiles it into grounded skill-level action sequences. To evaluate this setting, we introduce FullHome, a human-validated evaluation suite of 400 household tasks spanning diverse home-scale environments and both goal-oriented and process-constrained requirements. On FullHome, TaskGround improves task success rates by large margins across both proprietary and open-weight models. Notably, it makes Qwen3.5-9B competitive with GPT-5 under direct complete-scene prompting while reducing total input-token cost by up to 18x. Our results identify executable task-structure inference as a central bottleneck in full-scene household reasoning and show that structured grounding can make compact local models substantially more effective for practical household deployment.


[280] Beyond the Cartesian Illusion: Testing Two-Stage Multi-Modal Theory of Mind under Perceptual Bottlenecks cs.AI | cs.CVPDF

Yajing Zhou, Xiangyu Kong

TL;DR: 本文针对多模态大语言模型在具身空间智能中存在的’笛卡尔错觉’问题,提出了一种新颖的音频-视觉任务来测试其二阶心理理论能力。作者设计了一个’基于锚点的具身空间分解思维链’方法,通过’几何到语义’的投影,引导模型处理感知瓶颈下的空间推理。实验表明,该方法显著超越了以自我为中心和以他者为中心的基线模型。

Details

Motivation: 多模态大语言模型在通用推理上表现出色,但其具身空间智能受限于依赖文本概率分布而缺乏扎实的3D拓扑理解的’笛卡尔错觉’。这一局限在多智能体环境中尤为突出,该环境不仅需要场景感知,更要求二阶心理理论,即智能体A必须能推断受限于物理朝向和感知能力的智能体B对环境的信念。

Result: 广泛评估显示,当前MLLMs在空间对称性和视野外模糊性上存在根本性困难,建立了一个严格的零样本基线(准确率42%)。相比之下,本文提出的感知受限推理链在性能上稳健地超越了纯自我中心和他者中心的基线模型。

Insight: 核心创新在于提出了’基于锚点的具身空间分解思维链’推理框架,它放弃了刚性的、基于规则的坐标变换,转而引导模型先建立B的局部坐标系,然后根据A是否在B的视觉视锥体内来动态加权视觉和听觉模态,实现了’几何到语义’的投影。这为具身AI中具有认知意识、模态感知的推理建立了一个基础范式。

Abstract: While Multi-Modal Large Language Models (MLLMs) demonstrate impressive capabilities in general reasoning, their embodied spatial intelligence remains hampered by a “Cartesian Illusion” - a reliance on text-based probability distributions that lack grounded, 3D topological understanding. This limitation is starkly exposed in multi-agent environments, which demand more than just scene perception; they require second-order Theory of Mind (ToM). Specifically, an Agent A must be able to infer Agent B’s belief about the environment, governed strictly by Agent B’s physical orientation and sensory limitations. In this paper, we probe the limits of two-stage spatial inference in MLLMs through a novel audio-visual task: requiring Agent A to predict Agent B’s estimation of A’s relative location. To solve this, we propose an Epistemic Sensory Bottleneck module that abandons rigid, rule-based coordinate transformations. Instead, we introduce an Anchor-Based Embodied Spatial Decomposition Chain-of-Thought (CoT). This guides the MLLM through a “geometric-to-semantic” projection, forcing it to first establish B’s local coordinate system and then dynamically weight visual and auditory modalities based on whether A falls within B’s visual frustum. Extensive evaluations reveal that while current MLLMs fundamentally struggle with spatial symmetry and out-of-view ambiguities (establishing a rigorous zero-shot baseline of 42% accuracy), our sensory-bounded reasoning chain robustly outperforms pure egocentric and allocentric baselines. By systematically benchmarking these perceptual bottlenecks, our work exposes the current limits of MLLM spatial reasoning and establishes a foundational paradigm for epistemic, modality-aware inference in Embodied AI.


cs.SE [Back]

[281] ContraFix: Agentic Vulnerability Repair via Differential Runtime Evidence and Skill Reuse cs.SE | cs.AI | cs.CL | cs.CRPDF

Simiao Liu, Fang Liu, Li Zhang, Yang Liu, Yinghao Zhu

TL;DR: 本文提出了ContraFix,一个基于LLM智能体的自动化漏洞修复框架,通过结合差分运行时证据和可复用的修复技能来提升修复效果。该框架包含三个核心组件:Mutator用于构造跨越失败边界的PoC变体,Analyzer通过状态探针分析崩溃与非崩溃执行的差异并生成修复规范,Patcher将规范转换为经过验证的源代码补丁。成功修复后,相关的修复规范和变异策略会被存入一个双轨技能库,供未来类似案例通过三层检索策略复用。

Details

Motivation: 现有基于LLM的自动化漏洞修复智能体在处理真实世界漏洞时,主要失败模式是语义误解,即选择的修复方向与根本原因不匹配。这源于两个原因:一是现有智能体通常仅从失败的执行中推理,而崩溃报告无法揭示故障点附近众多候选变量或状态转换中,究竟是哪一个导致了崩溃行为与安全执行的分歧,导致产生面向症状的补丁而非因果修复;二是为一个漏洞收集的证据很少被保留,导致后续遇到类似案例时必须从头开始诊断。

Result: 在SEC-Bench(C/C++,200个实例)和PatchEval(Go、Python、JavaScript,225个实例)两个基准测试上,使用GPT-5-mini的ContraFix分别解决了84.0%和73.8%的任务,在两个基准上都达到了最先进的性能水平,同时成本不到最强可比基准线的三分之一。

Insight: 论文的核心创新点在于将差分运行时证据与可复用的修复技能相结合。具体而言,通过构造跨越失败边界的PoC变体来获取更丰富的运行时对比信息,从而生成更准确的修复规范;同时,建立了一个包含修复规范和变异策略的双轨技能库,并通过三层检索策略实现技能的复用,这不仅能提高修复效率,还能通过积累经验持续提升系统性能。从客观角度看,这种将动态分析、规范生成与经验库相结合的框架设计,为解决LLM智能体在复杂代码修复任务中的语义误解和重复劳动问题提供了系统性的思路。

Abstract: Large language model (LLM) agents are increasingly used for automated vulnerability repair (AVR), where repository-level reasoning enables them to inspect context and produce source-code patches. However, recent empirical results show that these agents still struggle with real-world vulnerabilities. Their main failure mode is semantic misunderstanding: choosing a repair direction that does not match the root cause. We identify two reasons for this gap. Existing agents usually reason from the failing execution alone. A crash report can pinpoint where the program failed, but it does not reveal which variable or state transition, among many candidates near the fault site, separates the crashing behavior from safe execution. As a result, agents often produce symptom-oriented patches instead of causal fixes. Moreover, evidence collected for one vulnerability is rarely retained, so similar cases in later repositories must be diagnosed again from scratch. We present ContraFix, an agentic AVR framework that couples differential runtime evidence with reusable repair skills. Its Mutator constructs PoC variants that straddle the failure boundary; its Analyzer inserts state probes around the fault region and summarizes divergences between crashing and non-crashing executions into a repair specification; and its Patcher converts the specification into verified source patches. Each successful repair updates a two-track skill base containing repair specifications and mutation strategies, which are retrieved through a three-tier policy for future instances. On SEC-Bench (C/C++, 200 instances) and PatchEval (Go, Python, JavaScript, 225 instances), ContraFix with GPT-5-mini resolves 84.0% and 73.8% of the tasks, respectively, achieving state-of-the-art performance on both benchmarks while costing less than one-third of the strongest comparable baseline.


eess.SP [Back]

[282] Learning Displacement-Aware WiFi Representations for Weakly Supervised Relative Localization eess.SP | cs.AI | cs.CVPDF

Tzu-Ti Wei, Po-Cheng Chen, Yu-Chee Tseng, Jen-Jee Chen

TL;DR: 本文提出了一种名为Intersection Pathway (IP)的跨模态学习框架,用于解决弱监督的WiFi指纹相对定位问题。该方法通过将WiFi指纹轨迹和惯性传感器获取的位移轨迹对齐到共享潜在空间,并强制潜在空间具有加法结构,从而能够直接推断两个指纹之间的相对位移,无需预测绝对位置。

Details

Motivation: 现有基于WiFi指纹的室内定位方法大多关注绝对定位,且依赖密集的坐标标注,成本高昂。本文旨在研究一个根本不同的问题:相对定位,即直接估计两个WiFi指纹轨迹之间的位移,并利用惯性传感器提供的步进运动向量作为弱监督信号,以降低标注开销。

Result: 在基于真实测量合成的数据集上的实验表明,所提方法能够学习到具有位移感知能力的WiFi表示,并在不同位移范围内实现了准确的相对定位。此外,学习到的模型可以扩展到使用稀疏锚点的少样本绝对定位任务中。

Insight: 核心创新点在于提出了一个跨模态对齐框架,并在潜在空间中强制施加了与物理运动组合相对应的加法结构,这使得模型能够直接从WiFi指纹中推理相对位移。从客观角度看,将相对定位问题与弱监督(运动向量)相结合,为降低定位系统部署成本提供了一种新思路。

Abstract: WiFi fingerprint-based indoor localization has been widely studied, but most existing approaches focus on absolute positioning and rely on dense coordinate annotations, which are costly to obtain at scale. In this paper, we study a fundamentally different problem: relative localization, where the goal is to directly estimate the displacement between two WiFi fingerprint traces without predicting their absolute positions. To reduce annotation overhead, we adopt weak supervision in the form of stepwise motion vectors obtained from inertial sensing. We propose Intersection Pathway (IP), a cross-modal learning framework that aligns fingerprint traces (f-traces) and displacement traces (d-traces) in a shared latent space. The key idea is to enforce an additive structure in the latent space, such that latent addition and subtraction correspond to physical motion composition, enabling direct relative-displacement inference. Experiments on a synthesized dataset derived from real measurements demonstrate that the proposed method learns displacement-aware WiFi representations and achieves accurate relative localization across varying displacement ranges. Furthermore, the learned model can be extended to few-shot absolute localization with sparse anchors.


cs.LG [Back]

[283] Reducing Credit Assignment Variance via Counterfactual Reasoning Paths cs.LG | cs.AI | cs.CLPDF

Fei Ding, Yongkang Zhang, Yeling Peng, Youwei Wang, Guoxiong Zhou

TL;DR: 本文提出了一种基于反事实比较的信用分配框架,通过在同一输入下采样多个推理轨迹,将稀疏的终端奖励转化为对步骤敏感的学习信号,从而解决大型语言模型在多步推理中因信用分配方差高而导致的训练不稳定问题。

Details

Motivation: 针对大型语言模型在多步推理强化学习中依赖稀疏终端奖励,导致信用分配条件差、梯度方差高、训练不稳定和无效更新多的问题,旨在改善信用分配机制。

Result: 在数学和代码推理基准测试中,提出的隐式行为策略优化(IBPO)显著提高了训练稳定性和性能上限,表现出优越的性能。

Insight: 创新点在于通过反事实比较构建隐式过程级优势估计器,将稀疏奖励转化为步骤敏感信号,为解锁大型语言模型性能潜力提供了新方向。

Abstract: Reinforcement learning for multi-step reasoning with large language models (LLMs) often relies on sparse terminal rewards, leading to poor credit assignment conditions where the final feedback is evenly propagated across all intermediate decisions. This results in high gradient variance, unstable training, and numerous ineffective updates, ultimately causing the model to fail and preventing sustained improvement. We introduce a counterfactual comparison-based credit assignment framework, which samples multiple reasoning trajectories under the same input. By treating their differences as an implicit approximation of alternative decisions, we construct an implicit process-level advantage estimator that transforms sparse terminal rewards into step-sensitive learning signals. Based on this, we propose Implicit Behavior Policy Optimization (IBPO), which significantly improves training stability and performance upper bounds on mathematical and code reasoning benchmarks, pointing to a promising direction for unlocking the performance potential of LLMs.


[284] DACA-GRPO: Denoising-Aware Credit Assignment for Reinforcement Learning in Diffusion Language Models cs.LG | cs.AI | cs.CLPDF

Amin Karimi Monsefi, Dominic Culver, Nikhil Bhendawade, Lokesh Boominathan, Manuel R. Ciosici

TL;DR: 本文提出了一种名为DACA-GRPO的轻量级、即插即用增强方法,用于改进扩散大语言模型中的强化学习训练。该方法通过引入去噪进度分数和分层掩码似然两个互补机制,解决了现有方法在去噪轨迹上缺乏时间信用分配以及使用有偏、高方差似然估计的问题。

Details

Motivation: 现有针对扩散大语言模型的强化学习方法将所有去噪步骤视为同等重要,并依赖于有偏且高方差的似然估计,这导致了两个根本性弱点:缺乏跨去噪轨迹的时间信用分配,以及用于策略优化的平均场似然估计存在系统性偏差。

Result: 在三个GRPO基础方法之上应用DACA-GRPO,在涵盖数学推理、代码生成、约束满足和约束生成的七个基准测试中均取得了持续改进,其中数学推理提升高达5.6个百分点,代码生成提升7.4个百分点,约束满足提升36.3个百分点,JSON模式遵循提升5.9个百分点。

Insight: 核心创新点在于提出了去噪进度分数(从中间预测中无额外前向成本地提取每个token的重要性权重)和分层掩码似然(将token位置分层,使每个token能在大部分序列作为上下文的情况下被预测,从而减少平均场偏差),实现了对去噪过程更精细的信用分配和更准确的似然估计。

Abstract: Diffusion large language models are a compelling alternative to autoregressive models, yet existing RL methods for diffusion treat all denoising steps as equally important and rely on biased, high-variance likelihood estimates. We identify two fundamental weaknesses: the absence of temporal credit assignment across the denoising trajectory, and the systematic bias of mean-field likelihood estimates used for policy optimization. To address these, we propose Denoising-Aware Credit Assignment for GRPO (DACA-GRPO), a lightweight, plug-and-play enhancement for any GRPO-style trainer. DACA-GRPO introduces two complementary mechanisms: Denoising Progress Scores, which extract per-token importance weights from intermediate predictions at no additional forward cost, and Stratified Masking Likelihood, which partitions token positions into strata so that each token is predicted with most of the sequence as context, reducing the mean-field bias. Applied on top of three GRPO base methods, DACA-GRPO achieves consistent improvements across seven benchmarks spanning mathematical reasoning, code generation, constraint satisfaction, and constrained generation, with gains of up to 5.6pp on math reasoning, 7.4pp on code generation, 36.3pp on constraint satisfaction, and 5.9pp on JSON schema adherence.


[285] The Unlearnability Phenomenon in RLVR for Language Models cs.LG | cs.CLPDF

Yulin Chen, He He, Chen Zhao

TL;DR: 本文揭示了强化学习与可验证奖励(RLVR)训练语言模型时存在的‘不可学习性’现象:即使在提供正确推理轨迹的情况下,模型对一部分初始难以处理的困难示例仍无法学会。通过梯度分析,研究发现这些不可学习的示例存在根本性的表征问题,表现为梯度相似性低和推理模式不可泛化,且现有优化、采样和数据增强技术难以缓解此问题。

Details

Motivation: 尽管RLVR已被证明能有效提升大语言模型的推理能力,但其学习动态机制尚未被充分探索。本文旨在揭示并系统性地分析RLVR训练中存在的反直觉‘不可学习性’现象。

Result: 研究通过跨示例梯度分析,定性地揭示了不可学习示例的表征缺陷(低梯度相似性、不可泛化的推理模式),并定量地证明了现有优化、采样和数据增强技术无法解决此问题。

Insight: 创新点在于首次系统性地刻画了RLVR训练中的‘不可学习’数据,并揭示了当前强化学习方法在推理任务上的根本性局限。客观来看,其通过梯度相似性分析来诊断表征缺陷的方法,为理解模型的学习瓶颈提供了新视角。

Abstract: Reinforcement Learning with Verifiable Reward (RLVR) has proven effective in improving Large Language Model’s (LLM) reasoning ability. However, the learning dynamics of RLVR remain underexplored. In this paper, we reveal a counterintuitive phenomenon: among hard examples that the model initially struggles with, a substantial subset remains unlearnable even when correct rollouts are present. To understand the phenomenon, we first demonstrate that existing optimization and sampling techniques fail to resolve unlearnability. With cross-example gradient analysis, we show that unlearnable examples have fundamental representation issue, characterized by low gradient similarity with the rest of the examples and ungeneralizable reasoning patterns. We further show that representation flaws are difficult to mitigate in RL, as data augmentation does not improve gradient similarity. Our study provides the first systematic characterization of unlearnable data in RLVR training and reveals fundamental limitations in current RL approaches for reasoning tasks. Code and data are available at \url{https://github.com/yulinchen99/unlearnability-rlvr}.


[286] TIER: Trajectory-Invariant Execution Rewards for Multi-Step Tool Composition cs.LG | cs.AI | cs.CLPDF

Anay Kulkarni, ChiaEn Lu, Dheeraj Mekala, Jayanth Srinivasa, Gaowen Liu

TL;DR: 本文提出了TIER(轨迹不变执行奖励)框架,用于解决大型语言模型在多步工具组合任务中强化学习奖励稀疏和依赖参考轨迹的问题。该方法直接从函数模式和运行时执行中获取监督信号,将奖励分解为格式有效性、模式遵循、执行成功和答案正确性四个部分,提供密集且可解释的序列级反馈。

Details

Motivation: 现有基于结果的奖励只能提供稀疏反馈,而基于轨迹监督的奖励依赖于标注的参考解决方案,会惩罚有效的替代执行路径,限制了在多步组合场景中的可扩展性。

Result: 在分层深度(1至6步)的组合基准DepthBench上,TIER在所有步骤中实现了超过90%的准确率,而基于轨迹监督的奖励在超过4步后性能崩溃。在BFCL v3和NestFUL等基准测试中也表现出一致的性能提升。

Insight: 创新点在于设计了一种不依赖于参考轨迹的奖励框架,通过函数模式和运行时执行直接生成监督信号,支持多种有效的解决方案路径,并能适应不断演化的工具接口。其多层次的奖励分解(格式、模式、执行、答案)为组合推理提供了细粒度的监督,这是实现高性能的关键。

Abstract: Tool use enables large language models to solve complex tasks through sequences of API calls, yet existing reinforcement learning approaches fail to scale to multi-step composition settings. Outcome-based rewards provide only sparse feedback, while trajectory-supervised rewards depend on annotated reference solutions, penalizing valid alternatives and limiting scalability. We propose TIER: Trajectory-Invariant Execution Rewards, a reward framework that derives supervision directly from function schemas and runtime execution, rather than from reference trajectories. The reward decomposes into format validity, schema adherence, execution success, and answer correctness, providing dense, interpretable sequence-level feedback derived from fine-grained verification of individual steps of tool use. This design allows any valid execution path to receive credit, naturally supporting multiple solution strategies and adapting to evolving tool interfaces. On DepthBench, a compositional benchmark stratified by depth (1 to 6 steps), TIER achieves >90% accuracy across steps, where trajectory-supervised rewards collapse beyond step-4. We further demonstrate consistent gains on benchmarks like BFCL v3 and NestFUL. Ablation studies confirm that all reward components are necessary, highlighting the importance of multi-level supervision for compositional reasoning.


[287] FIM-LoRA: Task-Informative Rank Allocation for LoRA via Calibration-Time Gradient-Variance Estimation cs.LG | cs.CLPDF

Ramakrishnan Sathyavageeswaran

TL;DR: 本文提出FIM-LoRA方法,通过校准时梯度方差估计为LoRA适配器进行任务信息化的秩分配。该方法在微调前运行少量反向传播计算各层梯度方差,以此作为层信息量的代理指标,并据此按比例重新分配统一的秩预算,从而生成一个具有每层特定秩模式的标准LoRA适配器,无需额外参数或训练开销。

Details

Motivation: 解决现有LoRA方法为所有适配权重矩阵分配统一秩的问题,忽略了不同层对任务适应的贡献不均等的现实,旨在实现更高效的参数分配。

Result: 在GLUE基准上使用DeBERTa-v3-base模型,FIM-LoRA在相同参数预算下性能与标准LoRA相当(88.6 vs. 88.7);在LLaMA-3-8B的常识推理任务上达到68.5 vs. 68.7。

Insight: 创新点在于通过轻量级工程方案(基于经验Fisher信息矩阵对角线的高效近似)实现任务导向的秩分配,且分配的秩模式具有可解释性(如值投影和中早期层获得更高秩),与Transformer层角色的已有发现一致。

Abstract: Low-rank adaptation (LoRA) assigns a uniform rank to every adapted weight matrix - a practical convenience that ignores a fundamental reality: different layers contribute unequally to task adaptation. We address this with a lightweight engineering solution: before fine-tuning begins, run eight calibration backward passes, compute the gradient variance of each LoRA-B matrix as a proxy for layer informativeness, and redistribute the rank budget proportionally. The resulting adapter is a standard LoRA with a per-layer rank pattern - no new parameters, no training overhead, no changes to serving infrastructure. We implement this via an efficient approximation of the empirical Fisher Information Matrix (eFIM) diagonal, restricted to LoRA adapter matrices only, which reduces memory cost by approximately 256x compared to full-model Fisher estimation. On GLUE with DeBERTa-v3-base, FIM-LoRA matches LoRA (88.6 vs. 88.7) at the same parameter budget, and on commonsense reasoning with LLaMA-3-8B reaches 68.5 vs. 68.7 for LoRA. The per-layer rank maps are interpretable: value projections and early-to-middle layers consistently receive higher rank, consistent with established findings on transformer layer roles.


[288] Confidence Geometry Reveals Trace-Level Correctness in Large Language Model Reasoning cs.LG | cs.CLPDF

Shuo Liu, Ding Liu, Shi-Ju Ran

TL;DR: 该论文研究了大型语言模型(LLM)在推理过程中生成的token级置信度轨迹,发现这些轨迹的几何结构(即置信度几何)与推理路径(trace-level)的最终答案正确性相关,而无需依赖问题内容、推理文本或外部验证器。论文提出了一种名为NeuralConf的轻量级估计器,利用置信度轨迹来评估正确性,并在固定推理路径预算下,通过置信度加权答案聚合提升了性能。

Details

Motivation: 动机在于探索LLM生成的token级置信度轨迹是否与推理正确性相关,以揭示模型内部不确定性动态中是否蕴含了可用的正确性信号,从而无需额外信息即可改进推理评估。

Result: 在GSM8K、MATH和MMLU基准测试上,置信度轨迹的低维表示能有效区分正确与错误推理路径,其几何分离程度(通过Davies-Bouldin指数衡量)与下游正确性判别AUC呈正相关;NeuralConf在固定推理路径预算下,其置信度加权聚合优于多数投票、尾部置信度等基线方法。

Insight: 创新点在于揭示了LLM置信度动态中编码的与内容无关的几何结构可作为推理正确性的内在统计信号,并提出了利用轨迹尾部置信度动态(富含正确性信息)的轻量级估计器NeuralConf,为仅基于模型自身生成信息改进推理提供了新途径。

Abstract: Large language models (LLMs) generate not only reasoning text, but also token-level confidence trajectories that record how uncertainty evolves during inference. Whether these trajectories are relevant to reasoning correctness remains unclear. Here we show that confidence trajectories encode a content-agnostic confidence geometry associated with trace-level final-answer correctness. Using only token-level confidence values, without access to the input question, reasoning text, hidden states, or external verifiers, we find that low-dimensional representations of confidence trajectories separate correct from incorrect reasoning traces. Across GSM8K, MATH, and MMLU, this geometric separation is quantitatively linked to downstream predictability: stronger clustering of correct and incorrect traces, measured by the Davies–Bouldin index, consistently corresponds to higher correctness-discrimination AUC. We further show that correctness-related information is enriched in the tail of reasoning, suggesting that late-stage confidence dynamics carry key correctness signals. We propose NeuralConf, a lightweight estimator that learns from confidence trajectories for correctness evaluation. Under a fixed trace budget, NeuralConf-derived scores improve confidence-weighted answer aggregation over majority voting, tail confidence, and other static baselines. These results reveal that LLMs expose trace-intrinsic statistical signals of correctness through their own confidence dynamics, offering a route to improve inference using information already present within generation.


[289] Decoupling KL and Trajectories: A Unified Perspective for SFT, DAgger, Offline RL, and OPD in LLM Distillation cs.LG | cs.AI | cs.CLPDF

Anhao Zhao, Haoran Xin, Yingqi Fan, Junlong Tong, Wenjie Li

TL;DR: 本文提出了一种解耦视角,将大语言模型知识蒸馏中的序列级KL散度分解为前缀来源和词级KL方向两个正交维度,从而定义了四种有效的蒸馏目标。通过梯度恒等式分析,揭示了这些目标分别对应离线SFT、DAgger式在线SFT、离线RL风格蒸馏和在线策略蒸馏。在数学推理任务上的实验表明,KL方向、前缀来源和训练长度分别带来准确性-熵、质量-计算和准确性-稳定性三种权衡,并据此提出了KL混合与熵门控长度课程两种方法以优化蒸馏效果。

Details

Motivation: 现有的大语言模型知识蒸馏方法(如离线策略蒸馏和在线策略蒸馏)隐含地将前缀来源与词级KL方向耦合在一起,这种耦合缺乏内在必然性,限制了蒸馏目标的设计空间。本文旨在通过解耦这两个维度,提供一个统一框架来理解和设计更灵活的蒸馏目标。

Result: 在数学推理任务上的控制实验表明,四种解耦目标作为独立方法或作为后续强化学习初始化时,表现出不同的权衡特性。提出的KL混合与熵门控长度课程方法显著提升了性能,其中熵门控长度课程使Avg@k和Pass@k分别提高了3.6分和最高5.8分,同时将平均响应长度减少了约3倍。

Insight: 核心创新点在于将序列级KL散度解耦为两个可独立选择的设计轴,建立了四种蒸馏目标与经典方法(SFT、DAgger、离线RL、OPD)的理论联系。从客观角度看,提出的KL混合策略和基于熵的动态长度课程是解决蒸馏中准确性、多样性、计算成本与稳定性之间平衡问题的实用且有效的方法。

Abstract: Knowledge distillation is central to LLM post-training, yet its design space remains poorly understood, especially alongside reinforcement learning (RL). We show that the prevailing paradigms, off-policy distillation and on-policy distillation (OPD), implicitly couple two orthogonal choices: prefix source and token-level KL direction. This follows from decomposing sequence-level KL over autoregressive response distributions: forward KL pairs teacher prefixes with token-level forward KL, and reverse KL pairs student prefixes with token-level reverse KL. We argue this coupling is not intrinsic: decoupling the two axes yields four valid objectives. We establish gradient-level identities showing forward KL gives SFT-style cross-entropy matching with teacher soft targets, whereas reverse KL gives an RL-style policy-gradient objective with a dense teacher-student log-ratio reward, connecting them to off-policy SFT, DAgger-style on-policy SFT, offline-RL-style distillation, and OPD. We conduct an extensive controlled study on math reasoning, evaluating the four objectives both as standalone methods and as initializations for subsequent RL. The results reveal three tradeoffs: KL direction induces an accuracy-entropy tradeoff, prefix source a quality-compute tradeoff, and training length an accuracy-stability tradeoff. Motivated by these findings, we propose KL mixing and an entropy-gated length curriculum. KL mixing shows long-sequence distillation requires substantial forward-KL weight to prevent entropy collapse and length inflation without sacrificing accuracy. The entropy-gated length curriculum improves Avg@k and Pass@k by 3.6 and up to 5.8 points, and cuts average response length by roughly 3x versus fixed long-horizon training. Our results provide a framework and practical methods for designing reasoning distillation objectives that balance accuracy, diversity, compute, and RL behavior.


[290] D$^2$Evo: Dual Difficulty-Aware Self-Evolution for Data-Efficient Reinforcement Learning cs.LG | cs.AI | cs.CLPDF

Ru Zhang, Renda Li, Ziyu Ma, Weijie Qiu, Chongyang Tao

TL;DR: 本文提出D$^2$Evo框架,一种双难度感知的自进化强化学习方法,旨在解决数据高效强化学习中有效数据稀缺和难度动态变化的问题。该方法通过挖掘中等难度锚点、训练提问者生成合适难度的问题,并联合优化求解器和提问者,以逐步提升大语言模型的推理能力。

Details

Motivation: 动机在于强化学习训练需要中等难度的样本,但面临有效数据稀缺和模型能力提升后样本难度动态变化的挑战。现有方法存在锚点缺失、忽视协同进化以及难度不匹配等问题。

Result: 在数学推理基准测试上,使用少于2K个真实数学样本,D$^2$Evo超越了现有方法,并在通用推理基准上表现出强大的泛化能力。

Insight: 创新点在于提出了一个双难度感知的自进化框架,通过协同进化求解器和提问者,动态适应模型能力变化,从而更高效地生成和利用训练数据,解决了难度匹配和渐进式学习的核心问题。

Abstract: Reinforcement learning (RL) has demonstrated potential for enhancing reasoning in large language models (LLMs). However, effective RL training, which requires medium-difficulty training samples, faces two fundamental challenges: Effective Data Scarcity and Dynamic Difficulty Shifts, where medium-difficulty samples are scarce and become trivial as models improve. Existing methods mitigate this scarcity to some extent by generating training samples. However, these approaches suffer from anchor-free generation, ignoring co-evolution, and difficulty mismatch. To address these issues, we propose D$^2$Evo, a Dual Difficulty-aware self-Evolution RL framework. In each iteration, our method mines medium-difficulty anchors based on the current Solver’s capability, trains the Questioner to generate diverse questions at appropriate difficulty levels, and jointly optimizes both components to enable progressive reasoning gains. Extensive experiments demonstrate that D$^2$Evo outperforms existing methods on mathematical reasoning benchmarks with fewer than 2K real mathematical samples, and exhibits strong generalization on general reasoning benchmarks.


[291] DISA: Offline Importance Sampling for Distribution-Matching LLM-RL cs.LG | cs.CLPDF

Shaobo Wang, Yujie Chen, Yafeng Sun, Wenjie Qiu, Zhihui Xie

TL;DR: 本文提出了DISA(Decoupled Importance-Sampled Anchoring)方法,一种用于分布匹配强化学习的离线重要性采样技术。该方法通过离线采样轨迹并估计配分函数,将配分函数的校准问题从强化学习循环中解耦出来,从而在多个数学和代码基准测试中实现了与或优于现有基线方法的性能。

Details

Motivation: 标准奖励最大化强化学习倾向于坍缩到最容易强化的高奖励模式,而分布匹配强化学习旨在在整个奖励塑造的解集上分配概率质量;现有方法在线学习配分函数会导致校准误差直接扭曲策略更新且难以独立诊断,因此需要一种解耦的方法来避免这一问题。

Result: 在六个数学和三个代码基准测试上,使用两个开源骨干模型,DISA匹配或超过了在线耦合的分布匹配基线FlowRL,在数学平均上优于奖励最大化基线GRPO和GSPO,并在相同离线轨迹上比LoRASFT蒸馏高出最多13.8个Mean@8点;LLM作为评判者的评估进一步显示DISA比奖励最大化基线保留了显著更多的策略级多样性。

Insight: 创新点在于将配分函数估计与策略学习在数据、梯度、损失和诊断上严格分离,通过离线重要性采样固定配分函数估计,从而避免了在线校准误差对策略更新的直接影响;客观分析表明,该方法遵循了偏差-方差权衡的理论预测,为分布匹配强化学习提供了一种更稳定和可诊断的框架。

Abstract: Modern reasoning agents are increasingly evaluated on their ability to generate multiple valid solution paths, plans, or tool-use traces for a given input. Standard reward-maximizing RL tends to collapse onto the most easily reinforced high-reward mode, whereas distribution-matching RL aims to allocate probability mass across the entire reward-shaped solution set. Achieving this objective requires computing a prompt-dependent partition function over the trajectory space. Because existing distribution-matching methods learn this partition function online alongside the policy, calibration errors in the partition function directly distort policy updates and remain impossible to diagnose independently. We introduce DISA, short for Decoupled Importance-Sampled Anchoring, which moves this calibration problem outside the RL loop. DISA draws proposal trajectories offline, estimates the partition function via importance sampling, and freezes the resulting partition-function estimate before policy optimization begins. This decoupling preserves the distribution-matching objective while strictly separating partition-function estimation from policy learning in data, gradients, loss, and diagnostics. Empirically, on two open-weight backbones across six math and three code benchmarks, DISA matches or exceeds the online-coupled distribution-matching baseline FlowRL, outperforms rewardmaximization baselines GRPO and GSPO on math averages, and exceeds LoRASFT distillation by up to 13.8 Mean@8 points on the same offline trajectories. An LLM-as-judge evaluation further shows that DISA retains substantially more strategy-level diversity than reward-maximization baselines, and sensitivity studies on the proposal strength and inverse temperature follow the bias-variance pattern predicted by the analysis.


[292] How Off-Policy Can GRPO Be? Mu-GRPO for Efficient LLM Reinforcement Learning cs.LG | cs.CLPDF

Minghao Tian, Yunfei Xie, Chen Wei

TL;DR: 本文提出了一种名为Mu-GRPO的高效大语言模型强化学习框架,通过将训练组织为少数几个大型的顺序生成-优化阶段,显著提高了数据陈旧度并减少了切换开销,从而在保持或超越标准GRPO性能的同时实现了约2倍的训练加速。

Details

Motivation: 针对标准GRPO在可验证奖励的强化学习中因采用低陈旧度、近在线策略训练而导致的巨大系统开销问题,本文旨在探索GRPO算法能容忍多大程度的离线策略训练,以提升训练效率。

Result: 在五个语言模型和多个数学推理基准测试上,Mu-GRPO匹配或超越了标准GRPO的性能,同时将实际训练时间加速了约2倍,显著改善了性能与效率的权衡。

Insight: 核心创新在于设计了高陈旧度的多阶段训练框架(Mu-GRPO),并结合了松弛裁剪(保留有用陈旧梯度)和负优势否决(移除负优势响应中不稳定的后缀更新)两种技术来稳定学习过程,为LLM强化学习提供了更优的效率方案。

Abstract: Group Relative Policy Optimization (GRPO) has been a key driver of recent progress in reinforcement learning with verifiable rewards (RLVR) for large language models, but it is typically trained in a low-staleness, near-on-policy regime that incurs substantial system overhead. We ask a simple question: How off-policy can GRPO be? We show that GRPO-style algorithms can tolerate substantially larger rollout staleness than previously assumed, and propose Mu-GRPO, an RL training framework that organizes training into a small number (e.g., four) of large sequential generation-optimization stages. This design induces high rollout staleness while greatly reducing rollout-optimization switching overhead. To stabilize learning under stale data, Mu-GRPO combines relaxed clipping, which preserves useful stale-rollout gradients, with negative-advantage veto, which removes destabilizing post-trigger suffix updates in negative-advantage responses. Across five language models and multiple math reasoning benchmarks, Mu-GRPO matches or exceeds the performance of standard GRPO while achieving around 2x speedup in wall-clock training time, establishing a substantially improved performance-efficiency trade-off for LLM reinforcement learning.


[293] HINT-SD: Targeted Hindsight Self-Distillation for Long-Horizon Agents cs.LG | cs.AI | cs.CLPDF

Woongyeng Yeo, Yumin Choi, Taekyung Ki, Sung Ju Hwang

TL;DR: 本文提出了HINT-SD,一种用于长视野LLM智能体训练的目标性后见之明自蒸馏框架。该方法利用完整轨迹的后见之明来识别与任务失败相关的关键动作片段,并仅在这些目标动作跨度上应用反馈条件蒸馏,从而解决了稀疏奖励下难以定位和纠正错误中间动作的问题。

Details

Motivation: 训练长视野LLM智能体时,稀疏的结果奖励仅能揭示任务是否成功,但无法指明哪些中间动作导致了结果或应如何修正。现有方法(如每回合生成反馈)效率低下,且固定或未对齐的反馈回合往往无法有效监督导致失败的动作。

Result: 在BFCL v3和AppWorld基准测试上的实验表明,该方法相比密集的每回合反馈基线,性能提升高达18.80%,同时每个训练步骤的时间降低了2.26倍,实现了更有效和高效的训练。

Insight: 核心创新在于“目标性蒸馏”:通过后见之明分析整个轨迹,智能地选择对失败有贡献的关键动作片段进行反馈蒸馏,而非盲目地对所有回合施加监督。这揭示了在长视野智能体训练中,“选择在哪里进行蒸馏”是提升效果和效率的关键因素。

Abstract: Training long-horizon LLM agents with reinforcement learning is challenging because sparse outcome rewards reveal whether a task succeeds, but not which intermediate actions caused the outcome or how they should be corrected. Recent methods alleviate this issue by generating rewards or textual hints from turn-level action-output signals, or by using feedback-conditioned self-distillation. However, generating feedback at every turn is inefficient when many intermediate turns are already successful or neutral, and applying feedback at a fixed or misaligned turn often fails to supervise the actions that contributed to the failure. To bridge this gap, we propose HINT-SD, a targeted self-distillation framework that uses full-trajectory hindsight to select failure-relevant actions and applies feedback-conditioned distillation only on targeted action spans. Experiments on BFCL v3 and AppWorld show that our method improves over the dense per-turn feedback baseline by up to 18.80 percent while achieving 2.26$\times$ lower time per training step, suggesting that selecting where to distill is a key factor for both effective and efficient long-horizon agent training.


[294] The Expressive Power of Low Precision Softmax Transformers with (Summarized) Chain-of-Thought cs.LG | cs.CC | cs.CLPDF

Moritz Brösamle, Stephan Eckstein

TL;DR: 本文分析了标准Transformer解码器在软注意力机制和激活值/注意力权重取整条件下的表达能力,证明了其在深度和宽度随上下文长度对数增长时能够模拟图灵机。通过构建使用思维链的三值激活硬注意力Transformer作为中间步骤,并转换为软注意力版本,避免了不现实的参数规模或精度要求。此外,还分析了总结式思维链范式,展示了其在空间边界而非时间边界上更高效的模拟能力,并在数独推理任务上进行了实证验证。

Details

Motivation: 现有Transformer表达能力研究通常基于硬注意力、高精度等与实际模型脱节的架构修改,本文旨在弥合这一差距,分析标准软注意力Transformer在有限精度下的计算能力。

Result: 在数独推理任务上的实证结果表明,本文提出的低精度模型比先前高精度结果更符合可学习性预测,代码已开源。

Insight: 创新点在于通过思维链和三值激活的硬注意力Transformer作为桥梁,将图灵机模拟扩展到标准软注意力架构,避免了不现实的参数要求;同时揭示了总结式思维链在空间效率上的优势,为实际低精度Transformer的理论理解提供了新视角。

Abstract: Existing expressivity results for transformers typically rely on hardmax attention, high precision, and other architectural modifications that disconnect them from the models used in practice. We bridge this gap by analyzing standard transformer decoders with softmax attention and rounding of activations and attention weights, while allowing depth and width to grow logarithmically with the context length. As an intermediate step, we construct hardmax transformers with ternary activations and well-separated attention scores that simulate Turing machines using Chain-of-Thought (CoT). This lets us convert the constructions to equivalent softmax transformers without the unrealistic parameter magnitudes or activation precision that prior approaches would require. Using the same technique, we analyze a recently proposed summarized CoT paradigm and show that it simulates Turing machines more efficiently, with model size scaling logarithmically in a space bound rather than a time bound. We empirically test predictions made by our results on a Sudoku reasoning task and find better alignment with learnability than for prior high-precision results. Our code is available at https://github.com/moritzbroe/transformer-expressivity.


[295] AMARIS: A Memory-Augmented Rubric Improvement System for Rubric-Based Reinforcement Learning cs.LG | cs.AI | cs.CLPDF

Peilin Wu, Xinlu Zhang, Kun Wan, Wentian Zhao, Gang Wu

TL;DR: 本文提出了AMARIS系统,一种用于基于规则强化学习(RL)的记忆增强型规则改进系统。该系统通过引入一个持久的评估记忆库,在训练过程中长期积累和策略性地重用评估知识,从而动态地改进奖励规则,以更有效地对大型语言模型(LLM)进行RL微调。

Details

Motivation: 现有基于规则的奖励塑形方法通常根据当前步骤的局部信号(如当前步的rollout或成对比较)来调整规则,但会丢弃评估后产生的诊断信息,无法长期积累和重用评估知识,限制了系统检测重复次优行为的能力。

Result: 在封闭式和开放式领域的实验表明,AMARIS始终优于基线方法。消融研究证实静态和动态记忆检索均对性能提升有贡献,两者结合在适中的检索预算下能提供最强的结果,且整个异步执行流程仅增加约5%的时间开销。

Insight: 核心创新在于将持久的评估记忆库引入基于规则的奖励塑形,使其从一个无状态的、每步的启发式方法,转变为一个证据驱动的RL训练循环。通过结合静态(近期步骤)和动态(语义匹配)检索历史上下文来更新规则,实现了评估知识的长期积累和策略性重用。

Abstract: Rubric-based reward shaping is an effective method for fine-tuning LLMs via RL, where structured rubrics decompose standard outcome rewards into multiple dimensions to provide richer reward signals. Recent works make the rubrics adaptive based on local signals such as the rollouts from the current step or pairwise comparisons. However, these methods discard the diagnostics produced during evaluation after immediate use and prevent the long-term accumulation and strategic reuse of evaluation knowledge. This forces the system to re-derive evaluation principles from scratch, limits its ability to detect recurring suboptimal behaviors, and forfeits the curriculum-like progression that a persistent training history would naturally support. To address these limitations, we introduce AMARIS, which grounds rubric modifications in long-term training history. At each training step, AMARIS analyzes individual rollouts, aggregates findings into step-level summaries, retrieves relevant historical context from a persistent evaluation memory through both static (recent steps) and dynamic (semantically matched) retrieval, and updates rubrics based on these accumulated analyses. This procedure runs asynchronously alongside the normal RL loop with minimal overhead. Experiments across both closed and open-ended domains show that AMARIS consistently outperforms the baselines. Ablation studies show that static and dynamic memory retrieval contributes to the performance gain and their combination provides the strongest results with moderate retrieval budgets sufficient to provide most of the gain, and that the entire pipeline adds only ~5% time overhead through asynchronous execution. These results show that persistent evaluation memory can transform rubric-based reward shaping from a stateless, per-step heuristic into an evidence-driven loop for RL training.


[296] An Assessment of Human vs. Model Uncertainty in Soft-Label Learning and Calibration cs.LG | cs.AI | cs.CLPDF

Maja Pavlovic, Silviu Paun, Massimo Poesio

TL;DR: 本文通过控制实验评估了人类软标签与合成标签在模型学习中的差异,重点分析了人类不确定性对模型校准和训练稳定性的影响。研究发现,人类软标签不仅能提升模型准确率,更重要的是作为正则化器改善困难样本的校准并促进训练收敛稳定性。

Details

Motivation: 旨在厘清人类软标签相对于合成标签的真实价值,避免先前研究将软标签益处与错误标签修正混淆,从而明确人类不确定性在AI对齐中的作用。

Result: 在MNIST及其合成变体数据集上的实验表明,人类软标签能提升模型准确率,并显著改善校准(尤其在困难样本上),同时训练稳定性更高;而合成标签无法与人类不确定性对齐。

Insight: 创新点在于通过解耦软标签监督与底层标签模式偏移,揭示了人类软标签作为正则化器的核心价值;该方法为人类-AI不确定性对齐提供了可复现的诊断框架。

Abstract: Central to human-aligned AI is understanding the benefits of human-elicited labels over synthetic alternatives. While human soft-labels improve calibration by capturing uncertainty, prior studies conflate these benefits with the implicit correction of mislabeled data (mode shifts), obscuring true effects of soft-labels. We present a controlled audit of soft-label learning across MNIST and a synthetic variant, re-annotating subsets to extract human uncertainty. By decoupling soft-label supervision from underlying label mode shifts, we show that while human soft-labels do provide accuracy gains, their larger value lies in acting as a regularizer that improves model calibration on difficult samples and promotes stable convergence across training runs. Dataset cartography reveals models trained on human soft-labels mirror human uncertainty, whereas those trained on synthetic labels fail to align with humans. Broadly, this work provides a diagnostic testbed for human-AI uncertainty alignment.


[297] General Preference Reinforcement Learning cs.LG | cs.CLPDF

Muhammad Umer, Muhammad Ahmed Mohsin, Ahsan Bilal, Arslan Chaudhry, Andreas Haupt

TL;DR: 本文提出了一种名为通用偏好强化学习(GPRL)的新方法,旨在弥合大语言模型(LLM)对齐中在线强化学习和偏好优化之间的鸿沟。该方法基于通用偏好模型(GPM),利用其结构化、多维度且能处理不可传递偏好的特性来指导策略更新,并通过闭环漂移监控防止单维度奖励黑客攻击。实验表明,GPRL在多个基准测试中取得了优异且稳定的性能。

Details

Motivation: 当前LLM对齐训练分为两条路径:依赖可验证奖励的在线强化学习(RL)适用于数学和代码任务,但无法处理开放式任务;而偏好优化能处理开放式生成,却缺乏在线RL的持续探索能力。弥合这一鸿沟需要一种能评估开放式任务质量的验证器,但标量奖励模型无法胜任多维度的质量评估。

Result: 从Llama-3-8B-Instruct模型出发,GPRL在AlpacaEval 2.0上取得了56.51%的长度控制胜率,同时在Arena-Hard、MT-Bench和WildBench基准测试上超越了SimPO和SPPO方法,并在长时间训练中有效抵抗了奖励黑客攻击。

Insight: 核心创新在于用结构化、多维度且能处理偏好不可传递性的通用偏好模型(GPM)替代标量奖励模型,并设计了相应的多维度优势计算、归一化聚合策略以及闭环漂移监控机制,从而在开放式任务上实现了稳定、高效的在线强化学习对齐。

Abstract: Post-training has split large language model (LLM) alignment into two largely disconnected tracks. Online reinforcement learning (RL) with verifiable rewards drives emergent reasoning on math and code but depends on a programmatic verifier that cannot reach open-ended tasks, while preference optimization handles open-ended generation yet forgoes the continuous exploration that powers online RL. Closing this gap requires a verifier for open-ended quality, but a scalar reward model is the wrong shape for the job. Quality is multi-dimensional, and any scalar score is an incomplete proxy that lets online RL collapse onto whichever axis the score is most sensitive to. We turn instead to the General Preference Model (GPM), which embeds responses into $k$ skew-symmetric subspaces and represents preference as a structured, intransitivity-aware comparison. Building on this, we propose General Preference Reinforcement Learning (GPRL), which carries the $k$-way structure through to the policy update. GPRL computes per-dimension group-relative advantages, normalizes each on its own scale so no axis can dominate, and aggregates them with context-dependent eigenvalues. The same structure powers a closed-loop drift monitor that detects single-axis exploitation and corrects it on the fly by reweighting dimensions and tightening the trust region. Starting from $\texttt{Llama-3-8B-Instruct}$, GPRL reaches a length-controlled win rate of $56.51%$ on AlpacaEval~2.0 while also outperforming SimPO and SPPO on Arena-Hard, MT-Bench, and WildBench by resisting reward hacking across extended training runs.


[298] Identifiable Token Correspondence for World Models cs.LG | cs.AI | cs.CVPDF

Youngin Kim, Ray Sun, Inho Kim, Bumsoo Park, Hyun Oh Song

TL;DR: 本文提出了一种基于Transformer的世界模型新方法,通过引入可识别的token对应关系来解决长时程推演中的时间不一致性问题,如物体复制、消失和突变。该方法将下一帧预测建模为一个包含潜在token对应变量的结构化概率推断问题,使得每个下一帧token要么从上一帧复制,要么生成新token。

Details

Motivation: 现有基于Transformer的世界模型在视觉强化学习中表现出色,但在长时程推演中常出现时间不一致问题,主要原因是它们将下一帧预测纯粹视为token生成问题,而未显式建模跨时间的token对应关系。

Result: 该方法在4个具有挑战性的基准测试中取得了最先进的性能。在Craftax-classic基准上,实现了72.5%的回报率和35.6%的得分,显著超越了之前最佳的67.4%和27.9%。

Insight: 核心创新在于将下一帧预测重新定义为结构化概率推断问题,并引入可识别的潜在token对应变量,这为世界模型提供了一种显式建模时间一致性的新范式,可能提升长序列生成的稳定性。

Abstract: Transformer-based world models have shown strong performance in visual reinforcement learning, but often suffer from temporal inconsistency in long-horizon rollouts, including object duplication, disappearance, and transmutation. A key reason is that most existing approaches treat next-frame prediction purely as a token generation problem, without explicitly modeling correspondence between tokens across time. We formulate next-frame prediction as a structured probabilistic inference problem with latent token correspondence variables, deriving a model in which each next-frame token is explained either by copying a token from the previous frame or by generating a new token. Our experiments show state-of-the-art performance on 4 challenging benchmarks. The proposed method achieves a return of 72.5% and a score of 35.6% on the Craftax-classic benchmark, significantly surpassing the previous best of 67.4% and 27.9%. We release our source code on https://github.com/snu-mllab/Identifiable-Token-Correspondence.


[299] Cross-modal Affinity-aligned Multimodal Learning Analytics for Predicting Student Collaboration Satisfaction in Game-Based Learning cs.LG | cs.AI | cs.CVPDF

Wen-Hsin Tsai, Chia-Ming Lee, Yuk-Ying Tung

TL;DR: 本文提出了亲和力对齐的多模态学习分析框架(AAMLA),其核心是跨模态亲和力引导的模态对齐模块(CAMA),通过亲和矩阵建模模态间关系,并利用对比学习增强跨模态一致性,从而自适应抑制信息量不足的模态。该框架将面部动作单元、头部姿态、眼动追踪和交互日志等多模态特征映射到统一语义空间,用于预测学生在基于游戏的学习环境中的协作满意度。

Details

Motivation: 解决协作式游戏学习环境中学生协作满意度自动预测的挑战,特别是模态退化问题——即单个模态(如眼动)在不同学生群体中信息量不一致,导致基于隐式注意力的多模态融合方法产生脆弱的表征。

Result: 在EcoJourneys协作学习环境中对50名中学生进行的实验表明,AAMLA框架在标准和模态退化条件下均优于单模态基线和先前的跨注意力方法,SHAP和t-SNE分析证实CAMA能产生鲁棒且可解释的跨模态表征。

Insight: 创新点在于显式地通过亲和矩阵建模模态间关系,并利用对比学习强制跨模态一致性,实现了对信息量不足模态的自适应抑制而非简单丢弃,从而提升了多模态表征的鲁棒性和可解释性。

Abstract: Collaborative game-based learning environments offer rich opportunities for small-group knowledge construction, yet automatically predicting student collaboration satisfaction remains challenging. A critical barrier is modality degradation: in educational deployments, individual modalities such as eye gaze exhibit inconsistent informativeness across student cohorts, causing implicit attention-based fusion to produce brittle multimodal representations. We propose the Affinity-Aligned Multimodal Learning Analytics (AAMLA) framework, whose core contribution is the Cross-modal Affinity-guided Modality Alignment (CAMA) module, which explicitly models inter-modal relationships via affinity matrices and enforces cross-modal consistency through contrastive learning, enabling adaptive suppression of uninformative modalities without discarding them. AAMLA further applies modality-specific projection layers to map heterogeneous features, including facial action units, head pose, eye gaze, and interaction trace logs, into a unified semantic space prior to alignment. Experiments on 50 middle school students in the EcoJourneys collaborative learning environment demonstrate consistent improvements over unimodal baselines and prior cross-attention approaches under standard and modality degradation conditions, with SHAP and t-SNE analyses confirming that CAMA produces robust, interpretable cross-modal representations for student collaboration modeling.


[300] Beyond Linear Superposition: Discovering Climate Features in AI Weather Models with KAN-SAE cs.LG | cs.AI | cs.CV | physics.ao-phPDF

Minjong Cheon

TL;DR: 本文提出KAN-SAE,一种结合Kolmogorov-Arnold Networks(KANs)中B样条激活的稀疏自编码器,用于解释AI天气预测模型(如Sonny)的内部表示。该方法通过非线性激活函数替代传统线性叠加假设,显著提升了可解释特征的发现数量和质量,并成功识别出欧洲热浪和西太平洋台风等气候特征。

Details

Motivation: 现有基于稀疏自编码器(SAEs)的机制可解释性方法假设严格的线性特征叠加,这与现代Transformer模型编码的高度非线性大气动力学不匹配,导致难以有效分解AI天气模型的内部表示。

Result: 在Sonny天气模型上,KAN-SAE发现了975个活跃特征(相比线性基线的566个提升72%),特征间冗余降低20%,且重建保真度相当。通过无监督方式识别出可解释的欧洲热浪特征和西太平洋台风追踪特征,并经因果导向实验验证。

Insight: 创新点在于将KANs中的可学习B样条激活引入SAE编码器,允许每个潜在维度发展独立的非线性门控特性,从而更有效地捕捉非线性气候现象。这表明非线性激活对于深度学习天气预测模型的机制可解释性至关重要,能发现线性基线无法捕获的特征。

Abstract: Deep learning weather prediction models achieve remarkable predictive skill yet remain largely opaque: we know little about how they represent physical climate phenomena internally. Mechanistic interpretability through Sparse Autoencoders (SAEs) offers a principled route to decomposing these representations, but existing SAEs assume strictly linear feature superposition - a constraint ill-suited for the highly nonlinear atmospheric dynamics encoded in modern transformers. We introduce KAN-SAE, a sparse autoencoder whose encoder replaces the standard ReLU with learnable per-feature B-spline activations drawn from Kolmogorov-Arnold Networks (KANs), allowing each latent dimension to develop its own nonlinear gating profile. Applied to Sonny, KAN-SAE discovers 975 alive features (vs. 566 for a linear baseline, a 72% improvement) with 20% lower inter-feature redundancy and comparable reconstruction fidelity. Without any climate supervision, KAN-SAE identifies an interpretable European heatwave feature spatially concentrated over western Europe, and a western Pacific typhoon tracker confirmed by causal steering experiments. Our results demonstrate that nonlinear activations are essential for mechanistic interpretability of deep learning weather prediction models, recovering climate features that remain invisible to linear baselines.


[301] Domain Transfer Becomes Identifiable via a Single Alignment cs.LG | cs.AI | cs.CVPDF

Sagar Shrestha, Subash Timilsina, Hoang-Son Nguyen, Xiao Fu

TL;DR: 本文提出了一种新的领域迁移(DT)可识别性方法,通过单个对齐样本即可识别真实迁移映射,解决了DT因测度保持自同构(MPA)导致的非可识别性问题。该方法基于雅可比矩阵的结构稀疏性条件,结合分布匹配和单个锚点样本,显著减少了监督需求,并提出了高效的随机掩码有限差分正则化器以实现高维学习。

Details

Motivation: 领域迁移(DT)在无监督图像翻译、单细胞分析和跨平台医学成像等任务中应用广泛,但由于测度保持自同构(MPA)的存在,DT本质上是非可识别的,这会导致内容错位的翻译。现有方法需要联合迁移多个条件分布来消除MPA,但实践中往往缺乏标注这些条件分布的监督信号。

Result: 在合成和真实世界的DT任务上的实证结果验证了该理论的有效性,表明所提方法能够识别真实迁移映射,且比先前方法需要更少的监督。

Insight: 创新点在于证明了在雅可比矩阵结构稀疏性条件下,仅需单个配对锚点样本和分布匹配即可实现DT的可识别性,这大幅降低了监督需求;此外,提出的随机掩码有限差分正则化器避免了显式雅可比计算,使方法可扩展到高维场景,为实际应用提供了高效解决方案。

Abstract: Domain transfer (DT) maps source to target distributions and supports tasks such as unsupervised image-to-image translation, single-cell analysis, and cross-platform medical imaging. However, DT is fundamentally ill-posed: push-forward mappings are generally non-identifiable, as measure-preserving automorphisms (MPAs) preserve marginals while altering cross-domain correspondences, leading to content-misaligned translation. Recent work shows that MPAs can be eliminated by jointly transferring multiple corresponding source/target conditional distributions, but supervision signals labeling such conditionals are not always available in practice. We develop an alternative route to DT identifiability. Under a structural sparsity condition on the Jacobian support pattern, we show that distribution matching together with a single paired anchor sample suffices to identify the ground-truth transfer – requiring substantially less supervision than prior approaches. To enable practical high-dimensional learning, we further propose an efficient Jacobian sparsity regularizer based on randomized masked finite differences, yielding a scalable surrogate without explicit Jacobian evaluation. Empirical results on synthetic and real-world DT tasks validate the theory.


[302] MARR: Module-Adaptive Residual Reconstruction for Low-Bit Post-Training Quantization cs.LG | cs.AI | cs.CVPDF

Le Su, Xing Luo, Zhi Jin

TL;DR: 本文提出了一种名为MARR(模块自适应残差重建)的新方法,用于解决低比特后训练量化(PTQ)中的问题。该方法通过为每个模块分配一个特定的缩放系数,自适应地平衡累积误差校正与残差相关的Hessian近似偏差,并设计了一种基于PID的自适应更新策略来稳定地优化这些系数。

Details

Motivation: 现有基于残差重建的量化方法虽然通过引入跨层残差来减少累积误差,但残差项本身可能因Hessian近似假设而引入额外偏差,导致量化性能次优。因此,需要一种能自适应权衡误差校正与偏差的方法。

Result: 在多个典型大语言模型(LLMs)和视觉Transformer(ViTs)上的实验表明,MARR在低比特(≤4位)量化下有效,相比残差重建的SOTA方法,在LLMs上性能提升高达20.2%,在ViTs上相对提升高达4.6%。

Insight: 创新点在于认识到残差缩放系数的权衡是模块依赖的,因此提出了模块自适应的系数分配,并利用基于PID的反馈控制策略进行稳定优化,这为量化中的偏差-校正平衡提供了新思路。

Abstract: Recently, residual reconstruction-based model quantization methods have achieved promising performance in low-bit post-training quantization (PTQ) by introducing cross-layer residuals to reduce error accumulated from previous layers.However, these residuals may also introduce additional bias arising from the Hessian-approximation (HA) assumption underlying reconstruction-based PTQ, leading to suboptimal quantization performance.In this work, we analyze that multiplying the residual term by a scaling coefficient provides a direct way to mitigate the HA bias associated with residual strength, while preserving accumulated-error correction. More importantly, we observe that this trade-off is module-dependent, making a single global residual strength insufficient to balance effective correction and residual-related bias across modules.Based on these observations, we propose Module-Adaptive Residual Reconstruction (MARR), which assigns a module-specific scaling coefficient to adaptively balance accumulated-error correction and residual-related HA bias for each module.To avoid expensive per-module coefficient search and obtain a stable coefficient estimate, we design a Proportional-Integral-Derivative (PID)-based adaptive update strategy that uses reconstruction error as feedback to progressively refine this coefficient. Experiments on several typical large language models (LLMs) and vision transformers (ViTs) demonstrate the effectiveness of MARR under low-bit quantization (less than or equal to 4-bit), achieving up to 20.2% performance gains on LLMs and up to 4.6% relative gains on ViTs over the residual reconstruction state-of-the-art methods.Code will be made publicly available upon acceptance.


[303] PH-Dreamer: A Physics-Driven World Model via Port-Hamiltonian Generative Dynamics cs.LG | cs.AI | cs.CV | cs.ROPDF

Xueyu Luan, Chenwei Shi

TL;DR: 论文提出了PH-Dreamer,一个基于端口哈密顿框架的物理驱动世界模型。该模型通过将隐式物理先验嵌入循环状态空间架构,构建了具有能量感知的生成动力学,以提升潜在想象的物理结构性和仿真保真度。

Details

Motivation: 现有基于循环状态空间架构的世界模型缺乏物理结构,其生成的动态可能违反守恒和耗散原理。本文旨在通过引入端口哈密顿框架,将物理结构整合到世界模型中,以产生更符合物理规律的动态。

Result: 在视觉控制基准测试中,该模型获得了更高的渐进回报,并提升了内部仿真器的保真度,具体表现为潜在相空间体积减少4.18-8.41%,能耗降低高达7.80%,平均平方加加速度降低高达9.38%。

Insight: 创新点在于提出了一个统一的端口哈密顿框架,通过三种协同机制(嵌入物理先验的循环转换、基于本体感知的能量世界模型、以及能量引导的Actor-Critic)将物理结构引入世界模型,从而在提升性能的同时增强了模型的物理可解释性和效率。

Abstract: World models built on recurrent state space architectures enable efficient latent imagination, yet remain physically unstructured, producing dynamics that violate conservation and dissipative principles. We introduce a unified Port-Hamiltonian framework that remedies this through three synergistic mechanisms. First, we embed implicit physical priors into recurrent transitions by modeling projected latent evolution as action controlled energy routing governed by flow and dissipation, biasing the projected PH phase space toward a more compact and physically structured representation. Second, we develop a kinematics aware energy world model that estimates the Hamiltonian and power balance from proprioceptive observations, providing an explicit physical signal for thermodynamic reasoning. Third, leveraging these energy gradients, we establish an energy guided Actor-Critic that uses Lagrangian multipliers to regularize policy optimization toward lower energy and smoother control. Across visual control benchmarks, this paradigm not only attains superior asymptotic returns but also elevates internal simulator fidelity by establishing a tighter, lower variance alignment between imagined and real rewards, all while reducing latent phase space volume by 4.18-8.41%, energy consumption by up to 7.80%, and mean squared jerk by up to 9.38%.


cs.GR [Back]

[304] Self-Improving CAD Generation Agents with Finite Element Analysis as Feedback cs.GR | cs.CLPDF

Guijin Son, Jehyun Park, Seyeon Park, Sunghee Ahn, Youngjae Yu

TL;DR: 这篇论文提出了一种通过有限元分析反馈实现自我改进的CAD生成智能体。该研究将CAD生成任务重新定义为从自由形式的工程简报直接生成完整装配的多部件STEP文件,并引入FEA验证和两种新的监督信号来提升生成质量。

Details

Motivation: 现有学习的CAD生成器无法像工程师一样迭代,也缺乏工程所需的评估能力,且通常将零件合成与装配视为两个脱节的步骤。论文旨在建立一个更贴近工业实践的、能通过物理验证进行迭代改进的CAD生成流程。

Result: 在主要的一次性尝试中,GPT-5.5和Claude Code智能体均未生成任何严格通过FEA验证的工件,最佳配置平均仅满足约20%的类型化要求。引入反馈机制后,在S2O和Fusion360数据集上,GPT-5.5/xhigh的Box-IoU分别从0.444提升至0.592和从0.397提升至0.505。

Insight: 创新点在于将FEA物理验证作为核心反馈信号集成到CAD生成循环中,并引入了纯文本蓝图模式和21视图图像渲染器两种新的监督信号,使生成过程更贴近工程师的实际迭代方式,推动CAD程序生成不仅视觉合理且满足物理结构要求的工件。

Abstract: Computer-aided design (CAD) is the backbone of modern industrial design, yet learned CAD generators still fall short of real engineering pipelines: they neither iterate like engineers nor evaluate what engineering requires. Prior work has treated CAD generation as two disjoint steps, part synthesis and assembly, where the former is graded by proximity to a gold reference and the latter, when handled at all, is reduced to a separate constraint solving step. In this work, we introduce a more industry-native task formulation that requires a model to produce a fully assembled multi-part STEP file from a free-form engineering brief, which is then validated via finite element analysis (FEA). FEA validation reveals that Codex (GPT-5.5) and Claude Code (Opus-4.7) agents do not produce a single strict-passing artifact in the main first-attempt sweep, with the best configuration meeting only about 20% of typed requirements on average. Moreover, we introduce two additional supervision signals, a novel text-only blueprint schema and a 21-view image renderer that aids the agent’s visual inspection, that better align the generation loop with how engineers iterate in practice. On S2O and Fusion360, the same feedback tools improve geometric reconstruction, with GPT-5.5/xhigh rising from 0.444 to 0.592 Box-IoU on S2O and from 0.397 to 0.505 on Fusion360. Together these signals move CAD programs toward artifacts that are not only visually plausible but also checked against physical and structural requirements.


[305] Generative 3D Gaussians with Learned Density Control cs.GR | cs.CVPDF

Runjie Yan, Yan-Pei Cao, Peng Wang, Ding Liang, Yuan-Chen Guo

TL;DR: 本文提出了一种名为密度采样高斯(DeG)的新型3D表示方法,旨在弥合自适应渲染基元与可扩展生成建模之间的差距。该方法通过将高斯中心建模为定义在八叉树上的可学习概率密度函数的样本,实现了自适应密度控制,并训练了一个潜在扩散模型用于生成合成。实验表明,该流程在单图像到3D生成任务中达到了最先进的生成质量。

Details

Motivation: 现有方法通常将3D高斯约束在固定体素网格或阵列中,这限制了其自适应性和生成建模的可扩展性。本文旨在解决这一限制,通过一种数学上严谨的框架,实现渲染基元在几何复杂区域的自适应集中,并支持从单一潜在代码进行可变分辨率解码。

Result: 在单图像到3D生成任务上进行了广泛实验,结果表明该流程达到了最先进的生成质量。它结合了非结构化基元的结构自适应性,以及基于网格方法的训练稳定性。

Insight: 主要创新点包括:1)将高斯中心建模为可学习概率密度函数的样本,提供了自适应密度控制的数学框架;2)引入了渲染损失贡献梯度,作为标准高斯泼溅中离散致密化和剪枝启发式方法的完全可微分模拟;3)提出了VecSeq,一种将无序集合结构潜在变量锚定到确定性3D Sobol序列的规范重索引机制,将模糊的集合生成问题转化为稳健的序列建模任务,解决了扩散模型应用中的收敛挑战。

Abstract: We present Density-Sampled Gaussians (DeG), a novel 3D representation designed to bridge the gap between adaptive rendering primitives and scalable generative modeling. Unlike existing approaches that constrain 3D Gaussians to fixed voxel grids or arrays, DeG models Gaussian centers as samples from a learnable probability density function defined over an octree. This formulation provides a rigorous mathematical framework for adaptive density control: by jointly optimizing the spatial density and Gaussian attributes under rendering supervision, our model naturally concentrates primitives in regions of high geometric complexity. We achieve this via a new render loss contribution gradient that serves as a fully differentiable analogue to the discrete densification and pruning heuristics used in standard Gaussian Splatting. The resulting representation is highly flexible, supporting variable-resolution decoding from a single latent code by simply adjusting the sampling budget. To enable generative synthesis, we train a latent diffusion model on DeG. We identify a critical challenge in applying diffusion to unordered set-structured latents, which can significantly slow convergence, and propose VecSeq, a canonical re-indexing mechanism that anchors latent tokens to a deterministic 3D Sobol sequence. This transforms the ambiguous set-generation problem into a robust sequence modeling task. Extensive experiments demonstrate that our pipeline achieves state-of-the-art quality in single-image-to-3D generation, combining the structural adaptivity of unstructured primitives with the training stability of grid-based methods.


[306] Genflow Ad Studio: A Compound AI Architecture for Brand-Aligned, Self-Correcting Video Generation cs.GR | cs.AI | cs.CV | cs.LG | cs.MA | cs.MMPDF

Debanshu Das, Lavi Nigam, Sunil Kumar Jang Bahadur, Gopala Dhar

TL;DR: 本文介绍了Genflow Ad Studio,这是一个用于品牌对齐、自校正视频生成的复合AI系统。该系统通过整合基于检索的’品牌DNA’提取模块和对抗性多智能体质量控制循环,解决了现有生成视频模型在品牌一致性和时间连贯性方面的不足,显著提升了品牌合规视频的生成成功率。

Details

Motivation: 当前生成视频模型虽然视觉保真度高,但在企业环境中应用时存在时间不一致性和严重的品牌错位问题,且单一架构难以强制执行严格的品牌约束,经常产生未经批准的视觉资产。

Result: 通过采用多阶段、自校正的流程,Genflow将品牌合规视频的生成成功率从42%提升到了89%,为企业级生成系统建立了一个可扩展的稳健框架。

Insight: 核心创新在于提出了一个复合AI架构,结合了参数化品牌约束提取和基于多智能体迭代评估与修正的质量控制循环,实现了从单次生成到持续自我优化的范式转变,确保了输出的确定性和品牌对齐。

Abstract: Recent advancements in generative video models demonstrate high visual fidelity, yet their integration into enterprise environments is restricted by temporal inconsistencies and severe brand misalignment. Current monolithic architectures struggle to enforce rigid brand constraints, frequently hallucinating unapproved visual assets. We introduce Genflow, a Compound AI System designed to enforce brand consistency in generative media production. Our architecture integrates a retrieval-based ‘Brand DNA’ extraction module to parameterize generation according to established corporate identity guidelines. Furthermore, we implement an Adversarial Multi-Agent Quality Control (QC) loop. Instead of a single-pass generation, this pipeline employs evaluator agents to iteratively critique generated frames against the extracted parameters, prompting generator models to refine outputs until a deterministic consensus is reached. By transitioning to a multi-stage, self-correcting pipeline, Genflow improved the yield of brand-compliant video generations from 42% to 89%, establishing a robust framework for scalable, enterprise-grade generative systems.