cs.CL [Total: 30]
cs.CV [Total: 85]
cs.GR [Total: 1]
cs.AI [Total: 4]
cs.SE [Total: 1]
cs.RO [Total: 1]
cs.LG [Total: 3]
cs.IR [Total: 1]

cs.CL [Back]

[1] Scaling Reasoning Tokens via RL and Parallel Thinking: Evidence From Competitive Programming cs.CLPDF

Qianfan Zhang, Tianyu Guo, Xuandi Ren, Jiale Chen, Ming Ding

TL;DR: 本文研究了通过训练时强化学习（RL）和测试时并行思维两种互补方法，扩展竞争性编程中推理令牌预算的问题。研究发现验证准确率与生成推理令牌数量之间存在近似对数线性关系，并提出了验证RL预热和随机裁剪两种优化训练轨迹的方法。为解决单次生成推理在完全注意力机制下成本高昂的问题，引入了多轮并行思维流程，将令牌预算分配到多个线程和生成、验证、精炼轮次中。基于Seed-OSS-36B模型，完整系统在AetherCode的456个难题上使用平均760万令牌每问题，以pass@1匹配了底层RL模型的oracle pass@16性能，并超越了GPT-5-high。

Details

Motivation: 解决在竞争性编程任务中扩展推理令牌预算的挑战，传统单次生成推理在完全注意力机制下计算成本过高，需要更高效的训练和推理方法。

Result: 在AetherCode的456个硬竞争编程问题上，完整系统（16线程、每线程16轮）以pass@1匹配了底层RL模型的oracle pass@16，并超越了GPT-5-high，平均每问题使用760万令牌。

Insight: 创新点包括：揭示了验证准确率与推理令牌数量的对数线性关系；提出了验证RL预热和随机裁剪来优化训练轨迹；设计了多轮并行思维流程，将令牌预算分布式用于生成、验证和精炼，并通过端到端训练使训练目标与测试结构匹配。

Abstract: We study how to scale reasoning token budgets for competitive programming through two complementary approaches: training-time reinforcement learning (RL) and test-time parallel thinking. During RL training, we observe an approximately log-linear relationship between validation accuracy and the average number of generated reasoning tokens over successive checkpoints, and show two ways to shift this training trajectory: verification RL warmup raises the starting point, while randomized clipping produces a steeper trend in the observed regime. As scaling single-generation reasoning during RL quickly becomes expensive under full attention, we introduce a multi-round parallel thinking pipeline that distributes the token budget across threads and rounds of generation, verification, and refinement. We train the model end-to-end on this pipeline to match the training objective to the test-time structure. Starting from Seed-OSS-36B, the full system with 16 threads and 16 rounds per thread matches the underlying RL model’s oracle pass@16 at pass@1 using 7.6 million tokens per problem on average, and surpasses GPT-5-high on 456 hard competitive programming problems from AetherCode.

[2] M2-Verify: A Large-Scale Multidomain Benchmark for Checking Multimodal Claim Consistency cs.CLPDF

Abolfazl Ansari, Delvin Ce Zhang, Zhuoyang Zou, Wenpeng Yin, Dongwon Lee

TL;DR: 该论文提出了M2-Verify，一个用于检查科学主张与多模态证据一致性的、大规模、多领域基准数据集。该数据集从PubMed和arXiv获取，包含超过46.9万个实例，涵盖16个领域，并经过专家审核验证。实验表明，现有最先进的模型在处理该数据集时面临挑战，尤其是在视觉复杂性高的任务上表现显著下降，并存在幻觉问题。

Details

Motivation: 现有基准在规模、领域多样性和视觉复杂性方面不足，无法真实评估科学主张与其多模态证据之间的严格一致性，因此需要构建一个更全面的数据集来填补这一空白。

Result: 在M2-Verify基准上的基线实验表明，最先进的模型在低复杂性医学扰动上最高能达到85.8%的Micro-F1分数，但在高复杂性挑战（如解剖结构变化）上性能下降至61.6%。专家评估还揭示了模型在生成对齐决策的科学解释时存在幻觉。

Insight: 创新点在于构建了一个大规模、多领域、经过专家验证的多模态一致性验证基准，揭示了当前模型在复杂、真实场景下的一致性推理能力不足，特别是对视觉复杂性和领域变化的鲁棒性差，并为评估模型解释的可靠性提供了新视角。

Abstract: Evaluating scientific arguments requires assessing the strict consistency between a claim and its underlying multimodal evidence. However, existing benchmarks lack the scale, domain diversity, and visual complexity needed to evaluate this alignment realistically. To address this gap, we introduce M2-Verify, a large-scale multimodal dataset for checking scientific claim consistency. Sourced from PubMed and arXiv, M2-Verify provides over 469K instances across 16 domains, rigorously validated through expert audits. Extensive baseline experiments show that state-of-the-art models struggle to maintain robust consistency. While top models achieve up to 85.8% Micro-F1 on low-complexity medical perturbations, performance drops to 61.6% on high-complexity challenges like anatomical shifts. Furthermore, expert evaluations expose hallucinations when models generate scientific explanations for their alignment decisions. Finally, we demonstrate our dataset’s utility and provide comprehensive usage guidelines.

[3] Procedural Knowledge at Scale Improves Reasoning cs.CLPDF

Di Wu, Devendra Singh Sachan, Wen-tau Yih, Mingda Chen

TL;DR: 本文提出了一种名为‘推理记忆’的检索增强生成框架，旨在通过大规模检索和重用程序性知识来提升语言模型在复杂推理任务上的性能。该方法将现有逐步推理轨迹分解为自包含的子问题-子程序对，构建了一个包含3200万条条目的知识库。在推理时，模型通过轻量级提示词化核心子问题，检索相关子程序作为隐式程序先验，从而指导推理过程。

Details

Motivation: 现有测试时扩展方法通常孤立处理每个问题，未能系统性地重用先前推理轨迹中的知识，特别是程序性知识（如如何重构问题、选择方法、验证或回溯）。本文旨在解决这一问题，通过显式地检索和重用大规模的程序性知识来增强模型的推理能力。

Result: 在六个数学、科学和编程基准测试中，推理记忆框架一致优于使用文档、轨迹和模板知识的RAG方法，以及一个计算量匹配的测试时扩展基线。在更高的推理预算下，相比无检索方法提升高达19.2%，相比最强的计算匹配基线平均提升7.9%。消融研究表明，这些收益源于源轨迹的广泛程序覆盖以及本文的分解和检索设计。

Insight: 创新点在于将程序性知识（‘如何做’）大规模地结构化并用于检索增强推理，通过分解推理轨迹为子问题-子程序对，并设计轻量级提示进行检索和集成，有效提取和重用了隐式的程序先验知识，为提升复杂推理任务的性能提供了新思路。

Abstract: Test-time scaling has emerged as an effective way to improve language models on challenging reasoning tasks. However, most existing methods treat each problem in isolation and do not systematically reuse knowledge from prior reasoning trajectories. In particular, they underutilize procedural knowledge: how to reframe a problem, choose an approach, and verify or backtrack when needed. We introduce Reasoning Memory, a retrieval-augmented generation (RAG) framework for reasoning models that explicitly retrieves and reuses procedural knowledge at scale. Starting from existing corpora of step-by-step reasoning trajectories, we decompose each trajectory into self-contained subquestion-subroutine pairs, yielding a datastore of 32 million compact procedural knowledge entries. At inference time, a lightweight in-thought prompt lets the model verbalize the core subquestion, retrieve relevant subroutines within its reasoning trace, and reason under diverse retrieved subroutines as implicit procedural priors. Across six math, science, and coding benchmarks, Reasoning Memory consistently outperforms RAG with document, trajectory, and template knowledge, as well as a compute-matched test-time scaling baseline. With a higher inference budget, it improves over no retrieval by up to 19.2% and over the strongest compute-matched baseline by 7.9% across task types. Ablation studies show that these gains come from two key factors: the broad procedural coverage of the source trajectories and our decomposition and retrieval design, which together enable effective extraction and reuse of procedural knowledge.

[4] Open-Domain Safety Policy Construction cs.CLPDF

Di Wu, Siyue Liu, Zixiang Ji, Ya-Liang Chang, Zhe-Yu Liu

TL;DR: 本文提出Deep Policy Research (DPR)，一个基于人类编写的种子领域信息，通过单一网络搜索工具和轻量级框架迭代搜索、提炼网络资源为规则，并组织成索引文档，以自动构建完整内容审核策略的智能系统。

Details

Motivation: 针对特定领域起草和维护安全策略成本高昂的问题，旨在自动化内容审核策略的构建过程。

Result: 在OpenAI不良内容基准（五个领域）和内部多模态广告审核基准上，DPR持续优于仅定义和上下文学习基线，并在端到端设置中与多个领域的专家编写策略部分表现相当；在相同种子规范和评估协议下，DPR优于通用深度研究系统。

Insight: 创新点在于设计了一个任务特定的结构化研究循环，通过迭代查询和提炼网络信息来生成策略，相比通用网络研究更有效，可借鉴其轻量级代理系统架构和基于种子信息的自动化策略生成方法。

Abstract: Moderation layers are increasingly a core component of many products built on user- or model-generated content. However, drafting and maintaining domain-specific safety policies remains costly. We present Deep Policy Research (DPR), a minimal agentic system that drafts a full content moderation policy based on only human-written seed domain information. DPR uses a single web search tool and lightweight scaffolding to iteratively propose search queries, distill diverse web sources into policy rules, and organize rules into an indexed document. We evaluate DPR on (1) the OpenAI undesired content benchmark across five domains with two compact reader LLMs and (2) an in-house multimodal advertisement moderation benchmark. DPR consistently outperforms definition-only and in-context learning baselines, and in our end-to-end setting it is competitive with expert-written policy sections in several domains. Moreover, under the same seed specification and evaluation protocol, DPR outperforms a general-purpose deep research system, suggesting that a task-specific, structured research loop can be more effective than generic web research for policy drafting. We release our experiment code at https://github.com/xiaowu0162/deep-policy-research.

[5] Adaptive Stopping for Multi-Turn LLM Reasoning cs.CL | cs.AIPDF

Xiaofan Zhou, Huy Nguyen, Bo Yu, Chenxi Liu, Lu Cheng

TL;DR: 本文提出了首个用于多轮大语言模型推理的保形预测框架MiCP，旨在解决自适应检索增强生成和ReAct式智能体等多轮交互方法中何时停止迭代的关键挑战。该框架通过在不同轮次间分配错误预算，使模型能够在保持整体覆盖保证的前提下实现早期停止，从而降低推理成本、延迟和预测集大小。

Details

Motivation: 现有多轮LLM推理方法依赖启发式停止规则或固定轮次预算，缺乏最终预测仍包含正确答案的形式化保证，这在金融、医疗等高风险领域可能导致成本增加或过早停止引发错误决策。

Result: 在自适应RAG和ReAct上的实验表明，MiCP在单跳和多跳问答基准测试中均达到了目标覆盖保证，同时显著减少了推理轮次、成本和预测集大小。

Insight: 创新点在于将保形预测扩展到多轮自适应推理场景，通过轮次间错误预算分配实现形式化覆盖保证与早期停止的平衡；提出的联合评估覆盖有效性和回答效率的新指标为多轮系统评估提供了新视角。

Abstract: Large Language Models (LLMs) increasingly rely on multi-turn reasoning and interaction, such as adaptive retrieval-augmented generation (RAG) and ReAct-style agents, to answer difficult questions. These methods improve accuracy by iteratively retrieving information, reasoning, or acting, but introduce a key challenge: \textbf{When should the model stop?} Existing approaches rely on heuristic stopping rules or fixed turn budgets and provide no formal guarantees that the final prediction still contains the correct answer. This limitation is particularly problematic in high-stakes domains such as finance and healthcare, where unnecessary turns increase cost and latency, while stopping too early risks incorrect decisions. Conformal prediction (CP) provides formal coverage guarantees, but existing LLM-CP methods only apply to a single model output and cannot handle multi-turn pipelines with adaptive stopping. To address this gap, we propose Multi-Turn Language Models with Conformal Prediction (MiCP), the first CP framework for multi-turn reasoning. MiCP allocates different error budgets across turns, enabling the model to stop early while maintaining an overall coverage guarantee. We demonstrate MiCP on adaptive RAG and ReAct, where it achieves the target coverage on both single-hop and multi-hop question answering benchmarks while reducing the number of turns, inference cost, and prediction set size. We further introduce a new metric that jointly evaluates coverage validity and answering efficiency.

[6] Magic, Madness, Heaven, Sin: LLM Output Diversity is Everything, Everywhere, All at Once cs.CL | cs.AI | cs.CYPDF

Harnoor Dhingra

TL;DR: 本文提出了一个名为’Magic, Madness, Heaven, Sin’的框架，用于系统性地分析和评估大语言模型（LLM）输出多样性的不同维度。该框架将输出变化置于同质性与异质性的光谱上，并根据任务及其规范性目标（认知性、交互性、社会性、安全性）进行价值判断。论文应用该框架分析了不同目标之间的相互作用，揭示了优化单一目标（如安全性）可能对其他目标（如人口统计代表性或创造性多样性）产生负面影响，并主张应根据具体任务目标来评估输出变化。

Details

Motivation: 当前LLM研究中关于’多样性’的术语零散且缺乏统一框架，主要是因为任务背后的规范性目标很少被明确阐述。本文旨在提供一个统一的框架，以澄清和系统化地分析LLM输出变化在不同任务和规范背景下的含义与价值。

Result: 论文通过应用所提出的框架，分析了所有成对的跨情境交互，揭示了优化一个目标（如安全性）可能无意中损害其他目标（如人口统计代表性或创造性多样性）的权衡关系。

Insight: 创新点在于提出了一个整合性的理论框架，将LLM输出多样性的研究统一到基于任务规范性目标的评估体系中，并强调了输出变化是任务目标塑造的属性，而非模型的内在特性。这为理解和评估LLM在不同应用场景下的表现提供了更细致和情境化的视角。

Abstract: Research on Large Language Models (LLMs) studies output variation across generation, reasoning, alignment, and representational analysis, often under the umbrella of “diversity.” Yet the terminology remains fragmented, largely because the normative objectives underlying tasks are rarely made explicit. We introduce the Magic, Madness, Heaven, Sin framework, which models output variation along a homogeneity-heterogeneity axis, where valuation is determined by the task and its normative objective. We organize tasks into four normative contexts: epistemic (factuality), interactional (user utility), societal (representation), and safety (robustness). For each, we examine the failure modes and vocabulary such as hallucination, mode collapse, bias, and erasure through which variation is studied. We apply the framework to analyze all pairwise cross-contextual interactions, revealing that optimizing for one objective, such as improving safety, can inadvertently harm demographic representation or creative diversity. We argue for context-aware evaluation of output variation, reframing it as a property shaped by task objectives rather than a model’s intrinsic trait.

[7] Countering Catastrophic Forgetting of Large Language Models for Better Instruction Following via Weight-Space Model Merging cs.CL | cs.AIPDF

Mengxian Lyu, Cheng Peng, Ziyi Chen, Mengyuan Zhang, Jieting Li Lu

TL;DR: 该论文提出了一种基于权重空间模型合并的框架，旨在解决大型语言模型在针对特定医学任务进行微调时出现的灾难性遗忘问题。通过将临床基础模型（GatorTronLlama）与通用指令模型（Llama-3.1-8B-Instruct）进行基于插值的合并，该方法旨在获得一个既在临床任务上表现优异，又能保留指令跟随能力的领域适应模型。

Details

Motivation: 动机在于解决通用大语言模型在医学领域微调时，会严重遗忘其原有指令跟随能力的问题，这是将通用LLM应用于临床的关键挑战。

Result: 在多个医学基准测试和五项临床生成任务（如放射学和出院小结）上的综合评估表明，合并后的模型能有效缓解灾难性遗忘，保留临床领域专业知识及指令跟随能力。此外，在监督数据严重受限的情况下（例如64-shot对比256-shot），其性能与完全微调的基线模型相当。

Insight: 创新点在于利用权重空间模型合并作为一种高效、可扩展的解决方案，来适应开源LLM到临床应用，这有助于在资源受限的医疗环境中更广泛地部署。从客观角度看，该方法将模型合并技术与领域适应问题结合，为解决灾难性遗忘提供了一个参数高效且数据高效的替代方案。

Abstract: Large language models have been adopted in the medical domain for clinical documentation to reduce clinician burden. However, studies have reported that LLMs often “forget” a significant amount of instruction-following ability when fine-tuned using a task-specific medical dataset, a critical challenge in adopting general-purpose LLMs for clinical applications. This study presents a model merging framework to efficiently adapt general-purpose LLMs to the medical domain by countering this forgetting issue. By merging a clinical foundation model (GatorTronLlama) with a general instruct model (Llama-3.1-8B-Instruct) via interpolation-based merge methods, we seek to derive a domain-adapted model with strong performance on clinical tasks while retaining instruction-following ability. Comprehensive evaluation across medical benchmarks and five clinical generation tasks (e.g., radiology and discharge summarization) shows that merged models can effectively mitigate catastrophic forgetting, preserve clinical domain expertise, and retain instruction-following ability. In addition, our model merging strategies demonstrate training efficiency, achieving performance on par with fully fine-tuned baselines under severely constrained supervision (e.g., 64-shot vs. 256-shot). Consequently, weight-space merging constitutes a highly scalable solution for adapting open-source LLMs to clinical applications, facilitating broader deployment in resource-constrained healthcare environments.

[8] DeltaMem: Towards Agentic Memory Management via Reinforcement Learning cs.CLPDF

Qi Zhang, Shen Huang, Chu Liu, Shouqing Yang, Junbo Zhao

TL;DR: 本文提出了DeltaMem，一种基于强化学习的智能记忆管理系统，将面向角色的记忆管理建模为单智能体端到端任务。通过模拟人类记忆演化过程构建对话数据集和操作级记忆更新标签，并引入基于记忆的Levenshtein距离来形式化记忆更新奖励，设计定制强化学习框架提升管理能力。实验表明，DeltaMem在多个长期记忆基准测试中超越了所有产品级基线模型。

Details

Motivation: 现有面向角色的记忆管理框架在多智能体系统中存在信息丢失和场景适应性差的问题，导致性能不佳，需要更鲁棒且高效的单智能体解决方案。

Result: 在LoCoMo、HaluMem和PersonaMem等多个长期记忆基准测试中，DeltaMem（包括免训练和RL训练版本）均超越了所有产品级基线模型，达到SOTA水平。

Insight: 创新点包括：将多智能体记忆管理重构为单智能体端到端任务；通过模拟人类记忆演化构建带操作标签的对话数据集；设计基于记忆的Levenshtein距离作为强化学习奖励函数，实现细粒度记忆更新优化。

Abstract: Recent advances in persona-centric memory have revealed the powerful capability of multi-agent systems in managing persona memory, especially in conversational scenarios. However, these complex frameworks often suffer from information loss and are fragile across varying scenarios, resulting in suboptimal performance. In this paper, we propose DeltaMem, an agentic memory management system that formulates persona-centric memory management as an end-to-end task within a single-agent setting. To further improve the performance of our agentic memory manager, we draw inspiration from the evolution of human memory and synthesize a user-assistant dialogue dataset along with corresponding operation-level memory updating labels. Building on this, we introduce a novel Memory-based Levenshtein Distance to formalize the memory updating reward, and propose a tailored reinforcement learning framework to further enhance the management capabilities of DeltaMem. Extensive experiments show that both training-free and RL-trained DeltaMem outperform all product-level baselines across diverse long-term memory benchmarks, including LoCoMo, HaluMem, and PersonaMem.

[9] Fragile Reasoning: A Mechanistic Analysis of LLM Sensitivity to Meaning-Preserving Perturbations cs.CLPDF

Shou-Tzu Han, Rodrigue Rizk, KC Santosh

TL;DR: 本文系统评估了三个开源大语言模型（Mistral-7B, Llama-3-8B, Qwen2.5-7B）在数学推理任务中对语义保留的表面扰动（如名称替换和数字格式改写）的脆弱性。研究发现模型存在高比例答案翻转，并提出了一个名为‘机制扰动诊断’的统一框架来追溯这些失败的机制根源。该框架结合了logit lens分析、激活修补、组件消融和新提出的‘级联放大指数’指标。基于诊断信号，论文提出了一个机制性失败分类法，并通过针对性修复实验进行了验证。

Details

Motivation: 尽管大语言模型在数学推理基准测试上表现强劲，但它们对语义保留的表面扰动（如改写数字格式或替换人名）表现出令人惊讶的脆弱性。本文旨在系统地评估这种脆弱性，并深入探究其背后的计算机制。

Result: 在GSM8K数据集的677个问题上，三个模型在语义等价变体上的答案翻转率高达28.8%到45.1%，其中数字改写比名称替换更具破坏性。新提出的级联放大指数指标在预测失败方面，对于三个架构中的两个，其AUC最高达到0.679，优于首次发散层指标。针对性修复实验（如引导向量和层微调）对局部化失败（Llama-3）的修复率为12.2%，但对分布式（Mistral）和纠缠式（Qwen）失败的修复率较低，分别为5.2%和7.2%。

Insight: 主要创新点在于提出了一个统一的‘机制扰动诊断’框架，用于系统性地追溯模型对语义保留扰动敏感的机制根源。其中，新提出的‘级联放大指数’是一个量化层间发散放大的新指标，能有效预测失败。研究还揭示了不同模型架构在失败可定位性上的显著差异，并据此提出了一个机制性失败分类法（局部化、分布式、纠缠式），为理解和修复模型的脆弱性提供了新的视角和方法论。从客观角度看，将多种诊断工具整合到一个框架中，并结合新指标和分类法，是深入理解模型内部脆弱性机制的有效途径。

Abstract: Large language models demonstrate strong performance on mathematical reasoning benchmarks, yet remain surprisingly fragile to meaning-preserving surface perturbations. We systematically evaluate three open-weight LLMs, Mistral-7B, Llama-3-8B, and Qwen2.5-7B, on 677 GSM8K problems paired with semantically equivalent variants generated through name substitution and number format paraphrasing. All three models exhibit substantial answer-flip rates (28.8%-45.1%), with number paraphrasing consistently more disruptive than name swaps. To trace the mechanistic basis of these failures, we introduce the Mechanistic Perturbation Diagnostics (MPD) framework, combining logit lens analysis, activation patching, component ablation, and the Cascading Amplification Index (CAI) into a unified diagnostic pipeline. CAI, a novel metric quantifying layer-wise divergence amplification, outperforms first divergence layer as a failure predictor for two of three architectures (AUC up to 0.679). Logit lens reveals that flipped samples diverge from correct predictions at significantly earlier layers than stable samples. Activation patching reveals a stark architectural divide in failure localizability: Llama-3 failures are recoverable by patching at specific layers (43/60 samples), while Mistral and Qwen failures are broadly distributed (3/60 and 0/60). Based on these diagnostic signals, we propose a mechanistic failure taxonomy (localized, distributed, and entangled) and validate it through targeted repair experiments: steering vectors and layer fine-tuning recover 12.2% of localized failures (Llama-3) but only 7.2% of entangled (Qwen) and 5.2% of distributed (Mistral) failures.

[10] What Do Claim Verification Datasets Actually Test? A Reasoning Trace Analysis cs.CLPDF

Delip Rao, Chris Callison-Burch

TL;DR: 本文对九个主流声明验证数据集进行了系统性分析，通过GPT-4o-mini为24K个样本生成结构化推理轨迹，发现当前基准测试主要评估直接证据提取能力，而多句子综合推理和数值推理严重不足。研究进一步使用一个10亿参数的推理验证器分析了五种错误类型，揭示了不同领域（通用、科学、数学）的错误模式存在显著差异。

Details

Motivation: 尽管声明验证领域进展迅速，但缺乏对现有基准测试实际评估的推理能力的系统性理解，无法明确这些数据集究竟测试了何种能力。

Result: 分析发现，数据集存在严重偏差：某些数据集几乎只测试词汇匹配，而另一些约半数案例需要信息综合。高基准分数主要反映的是“检索+蕴含”能力。

Insight: 创新点在于首次对多个声明验证数据集进行了大规模的推理轨迹分析，并量化了其能力覆盖的偏差。客观来看，该研究为构建更具挑战性、能更好测试系统综合推理能力的评估套件提供了数据驱动的具体建议。

Abstract: Despite rapid progress in claim verification, we lack a systematic understanding of what reasoning these benchmarks actually exercise. We generate structured reasoning traces for 24K claim-verification examples across 9 datasets using GPT-4o-mini and find that direct evidence extraction dominates, while multi-sentence synthesis and numerical reasoning are severely under-represented. A dataset-level breakdown reveals stark biases: some datasets almost exclusively test lexical matching, while others require information synthesis in roughly half of cases. Using a compact 1B-parameter reasoning verifier, we further characterize five error types and show that error profiles vary dramatically by domain – general-domain verification is dominated by lexical overlap bias, scientific verification by overcautiousness, and mathematical verification by arithmetic reasoning failures. Our findings suggest that high benchmark scores primarily reflect retrieval-plus-entailment ability. We outline recommendations for building more challenging evaluation suites that better test the reasoning capabilities verification systems need.

[11] PRCCF: A Persona-guided Retrieval and Causal-aware Cognitive Filtering Framework for Emotional Support Conversation cs.CLPDF

Yanxin Luo, Xiaoyu Zhang, Jing Li, Yan Gao, Donghong Han

TL;DR: 本文提出了PRCCF框架，用于情感支持对话（ESC），通过结合角色引导的检索和因果感知的认知过滤机制，以增强上下文理解和生成共情回应。

Details

Motivation: 现有方法在深度上下文理解方面存在不足，无法有效缓解个体情感困扰，因此需要一种能整合语义兼容性、角色对齐和因果相关知识的框架来提升情感推理能力。

Result: 在ESConv数据集上的实验表明，PRCCF在自动评估指标和人工评估中均优于现有最先进基线模型，达到了SOTA水平。

Insight: 创新点在于引入了角色引导的检索机制联合建模语义与角色对齐，以及因果感知的认知过滤模块优先处理因果相关知识，从而提升情感支持对话的上下文认知和推理能力。

Abstract: Emotional Support Conversation (ESC) aims to alleviate individual emotional distress by generating empathetic responses. However, existing methods face challenges in effectively supporting deep contextual understanding. To address this issue, we propose PRCCF, a Persona-guided Retrieval and Causality-aware Cognitive Filtering framework. Specifically, the framework incorporates a persona-guided retrieval mechanism that jointly models semantic compatibility and persona alignment to enhance response generation. Furthermore, it employs a causality-aware cognitive filtering module to prioritize causally relevant external knowledge, thereby improving contextual cognitive understanding for emotional reasoning. Extensive experiments on the ESConv dataset demonstrate that PRCCF outperforms state-of-the-art baselines on both automatic metrics and human evaluations. Our code is publicly available at: https://github.com/YancyLyx/PRCCF.

[12] On the Role of Reasoning Patterns in the Generalization Discrepancy of Long Chain-of-Thought Supervised Fine-Tuning cs.CLPDF

Zhaoyi Li, Xiangyu Xi, Zhengyu Chen, Wei Wang, Gangwei Jiang

TL;DR: 本文研究了在长思维链监督微调中，不同来源的思维链轨迹如何影响模型泛化性能。通过对比DeepSeek-R1-0528和gpt-oss-120b生成的思维链数据，发现了一个悖论：训练损失更低并不代表泛化更好。分析表明，推理模式的差异是关键，gpt-oss-120b的轨迹收敛且演绎性强，而DeepSeek-R1-0528则发散且分支多。基于此，提出通过过滤频繁分支的轨迹来提升泛化性能，实验证明该方法在多个推理基准上显著提升了性能。

Details

Motivation: 探究不同来源的长思维链监督微调数据如何影响模型泛化性能，解决训练损失低但泛化差的悖论问题。

Result: 在AIME25上提升5.1%，在BeyondAIME上提升5.5%，在五个基准上平均提升3.6%，达到了更好的泛化性能。

Insight: 创新点在于揭示了推理模式（收敛/演绎 vs. 发散/分支）对泛化性能的关键影响，并提出通过轨迹过滤来缓解分支探索行为带来的泛化下降，为思维链数据的选择和清洗提供了新视角。

Abstract: Supervised Fine-Tuning (SFT) on long Chain-of-Thought (CoT) trajectories has become a pivotal phase in building large reasoning models. However, how CoT trajectories from different sources influence the generalization performance of models remains an open question. In this paper, we conduct a comparative study using two sources of verified CoT trajectories generated by two competing models, \texttt{DeepSeek-R1-0528} and \texttt{gpt-oss-120b}, with their problem sets controlled to be identical. Despite their comparable performance, we uncover a striking paradox: lower training loss does not translate to better generalization. SFT on \texttt{DeepSeek-R1-0528} data achieves remarkably lower training loss, yet exhibits significantly worse generalization performance on reasoning benchmarks compared to those trained on \texttt{gpt-oss-120b}. To understand this paradox, we perform a multi-faceted analysis probing token-level SFT loss and step-level reasoning behaviors. Our analysis reveals a difference in reasoning patterns. \texttt{gpt-oss-120b} exhibits highly convergent and deductive trajectories, whereas \texttt{DeepSeek-R1-0528} favors a divergent and branch-heavy exploration pattern. Consequently, models trained with \texttt{DeepSeek-R1} data inherit inefficient exploration behaviors, often getting trapped in redundant exploratory branches that hinder them from reaching correct solutions. Building upon this insight, we propose a simple yet effective remedy of filtering out frequently branching trajectories to improve the generalization of SFT. Experiments show that training on selected \texttt{DeepSeek-R1-0528} subsets surprisingly improves reasoning performance by up to 5.1% on AIME25, 5.5% on BeyondAIME, and on average 3.6% on five benchmarks.

[13] Memory in the LLM Era: Modular Architectures and Strategies in a Unified Framework cs.CL | cs.DBPDF

Yanchen Wu, Tenghui Lin, Yingli Zhou, Fangyuan Zhang, Qintian Guo

TL;DR: 本文提出了一个统一的框架来整合现有LLM智能体中的记忆方法，并在两个基准测试上系统比较了代表性方法，同时设计了一种新的记忆方法，超越了现有SOTA方法。

Details

Motivation: 现有LLM智能体中的记忆方法缺乏在相同实验设置下的系统比较，本文旨在通过统一框架和实验分析来深入理解这些方法的行为。

Result: 在Multi-Session Chat和WebShop两个基准测试上，新设计的记忆方法超越了现有SOTA方法。

Insight: 通过模块化视角整合现有记忆方法，并基于实验分析设计出更优的混合方法，为未来研究提供了系统比较框架和新的研究方向。

Abstract: Memory emerges as the core module in the large language model (LLM)-based agents for long-horizon complex tasks (e.g., multi-turn dialogue, game playing, scientific discovery), where memory can enable knowledge accumulation, iterative reasoning and self-evolution. A number of memory methods have been proposed in the literature. However, these methods have not been systematically and comprehensively compared under the same experimental settings. In this paper, we first summarize a unified framework that incorporates all the existing agent memory methods from a high-level perspective. We then extensively compare representative agent memory methods on two well-known benchmarks and examine the effectiveness of all methods, providing a thorough analysis of those methods. As a byproduct of our experimental analysis, we also design a new memory method by exploiting modules in the existing methods, which outperforms the state-of-the-art methods. Finally, based on these findings, we offer promising future research opportunities. We believe that a deeper understanding of the behavior of existing methods can provide valuable new insights for future research.

[14] Human-Guided Reasoning with Large Language Models for Vietnamese Speech Emotion Recognition cs.CLPDF

Truc Nguyen, Then Tran, Binh Truong, Phuoc Nguyen T. H

TL;DR: 本文提出了一种人机协作框架，用于越南语语音情感识别（SER），通过结合基于声学特征的模型和LLM推理，利用置信度路由机制处理模糊样本，并采用迭代优化策略提升性能。

Details

Motivation: 解决越南语SER中因声学模式模糊、标注数据缺乏以及真实场景中情感边界不清晰带来的挑战，避免仅依赖数据驱动模型的局限性。

Result: 在包含2,764个样本、三种情感类别的越南语语音数据集上（标注者间一致性高，Fleiss Kappa=0.8574），方法达到最高86.59%的准确率和约0.85-0.86的Macro F1分数，展现了处理模糊和难分类案例的有效性。

Insight: 创新点包括：引入基于置信度的路由机制区分简单与模糊样本，利用LLM进行基于人类标注行为的结构化规则指导的深度推理，以及通过迭代优化实现持续性能提升；这为低资源环境下的SER提供了一种结合数据驱动与人类推理的鲁棒且模型无关的方法。

Abstract: Vietnamese Speech Emotion Recognition (SER) remains challenging due to ambiguous acoustic patterns and the lack of reliable annotated data, especially in real-world conditions where emotional boundaries are not clearly separable. To address this problem, this paper proposes a human-machine collaborative framework that integrates human knowledge into the learning process rather than relying solely on data-driven models. The proposed framework is centered around LLM-based reasoning, where acoustic feature-based models are used to provide auxiliary signals such as confidence and feature-level evidence. A confidence-based routing mechanism is introduced to distinguish between easy and ambiguous samples, allowing uncertain cases to be delegated to LLMs for deeper reasoning guided by structured rules derived from human annotation behavior. In addition, an iterative refinement strategy is employed to continuously improve system performance through error analysis and rule updates. Experiments are conducted on a Vietnamese speech dataset of 2,764 samples across three emotion classes (calm, angry, panic), with high inter-annotator agreement (Fleiss Kappa = 0.8574), ensuring reliable ground truth. The proposed method achieves strong performance, reaching up to 86.59% accuracy and Macro F1 around 0.85-0.86, demonstrating its effectiveness in handling ambiguous and hard-to-classify cases. Overall, this work highlights the importance of combining data-driven models with human reasoning, providing a robust and model-agnostic approach for speech emotion recognition in low-resource settings.

[15] LiveMathematicianBench: A Live Benchmark for Mathematician-Level Reasoning with Proof Sketches cs.CL | cs.AI | cs.LGPDF

Linyang He, Qiyao Yu, Hanze Dong, Baohao Liao, Xinxing Xu

TL;DR: 本文提出了LiveMathematicianBench，一个基于arXiv最新论文构建的动态、抗数据污染的数学推理基准测试。它通过使用模型训练截止日期后发表的定理，并引入基于证明草图的干扰项生成和抗替换机制，旨在更真实地评估大语言模型的研究级数学推理能力。

Details

Motivation: 现有数学推理基准测试存在合成场景和数据污染的局限，无法真实评估大语言模型在科研级数学问题上的推理能力。

Result: 在基准测试上，最佳模型Gemini-3.1-pro-preview准确率仅为43.5%。在抗替换评估下，GPT-5.4以30.6%最高，而Gemini-3.1-pro-preview则降至17.6%，低于20%的随机基线。实验表明，提供证明草图能持续提升模型准确率。

Insight: 创新点在于构建了一个动态、抗污染的实时基准，其核心是引入基于定理逻辑类型（如蕴含、等价、存在性）的细粒度分类，以及利用证明草图策略生成高质量干扰项的流程，这能更敏感地区分模型是表面匹配还是实质性推理。抗替换机制进一步强化了这种区分能力。

Abstract: Mathematical reasoning is a hallmark of human intelligence, and whether large language models (LLMs) can meaningfully perform it remains a central question in artificial intelligence and cognitive science. As LLMs are increasingly integrated into scientific workflows, rigorous evaluation of their mathematical capabilities becomes a practical necessity. Existing benchmarks are limited by synthetic settings and data contamination. We present LiveMathematicianBench, a dynamic multiple-choice benchmark for research-level mathematical reasoning built from recent arXiv papers published after model training cutoffs. By grounding evaluation in newly published theorems, it provides a realistic testbed beyond memorized patterns. The benchmark introduces a thirteen-category logical taxonomy of theorem types (e.g., implication, equivalence, existence, uniqueness), enabling fine-grained evaluation across reasoning forms. It employs a proof-sketch-guided distractor pipeline that uses high-level proof strategies to construct plausible but invalid answer choices reflecting misleading proof directions, increasing sensitivity to genuine understanding over surface-level matching. We also introduce a substitution-resistant mechanism to distinguish answer recognition from substantive reasoning. Evaluation shows the benchmark is far from saturated: Gemini-3.1-pro-preview, the best model, achieves only 43.5%. Under substitution-resistant evaluation, accuracy drops sharply: GPT-5.4 scores highest at 30.6%, while Gemini-3.1-pro-preview falls to 17.6%, below the 20% random baseline. A dual-mode protocol reveals that proof-sketch access yields consistent accuracy gains, suggesting models can leverage high-level proof strategies for reasoning. Overall, LiveMathematicianBench offers a scalable, contamination-resistant testbed for studying research-level mathematical reasoning in LLMs.

[16] DEFT: Distribution-guided Efficient Fine-Tuning for Human Alignment cs.CLPDF

Liang Zhu, Feiteng Fang, Yuelin Bai, Longze Chen, Zhexiang Zhang

TL;DR: 本文提出了一种名为DEFT（Distribution-guided Efficient Fine-Tuning）的高效对齐框架，旨在解决RLHF等方法在人类价值观对齐中成本高、不稳定以及可能削弱大语言模型泛化能力的问题。DEFT通过计算语言模型输出分布与偏好数据差异分布之间的差异分布奖励，进行数据筛选和分布引导，从而从原始数据中过滤出一个小而高质量的子集，并将其整合到现有对齐方法中以指导模型输出分布。

Details

Motivation: 现有的人类反馈强化学习（RLHF）等方法（如PPO）虽然能对齐大语言模型与人类价值观，但存在成本高、训练不稳定、需要大量数据且可能削弱模型泛化能力的问题。

Result: 实验结果表明，经DEFT增强的方法在保持泛化能力的同时，其对齐能力优于原始方法，并且训练时间显著减少。

Insight: 创新点在于引入基于输出分布和偏好数据差异的差异分布奖励，进行高效的数据筛选和分布引导，从而在减少数据需求和训练成本的同时，提升对齐性能和泛化能力。

Abstract: Reinforcement Learning from Human Feedback (RLHF), using algorithms like Proximal Policy Optimization (PPO), aligns Large Language Models (LLMs) with human values but is costly and unstable. Alternatives have been proposed to replace PPO or integrate Supervised Fine-Tuning (SFT) and contrastive learning for direct fine-tuning and value alignment. However, these methods still require voluminous data to learn preferences and may weaken the generalization ability of LLMs. To further enhance alignment efficiency and performance while mitigating the loss of generalization ability, this paper introduces Distribution-guided Efficient Fine-Tuning (DEFT), an efficient alignment framework incorporating data filtering and distributional guidance by calculating the differential distribution reward based on the output distribution of language model and the discrepancy distribution of preference data. A small yet high-quality subset is filtered from the raw data using a differential distribution reward, which is then incorporated into existing alignment methods to guide the model’s output distribution. Experimental results demonstrate that the methods enhanced by DEFT outperform the original methods in both alignment capability and generalization ability, with significantly reduced training time.

[17] From Guessing to Placeholding: A Cost-Theoretic Framework for Uncertainty-Aware Code Completion cs.CLPDF

Liang Zhu, Haolin Chen, Lidong Zhao, Xian Wu

TL;DR: 本文提出了一种名为自适应占位符补全（APC）的协作框架，用于解决大型语言模型在代码补全中因上下文不足而强制生成具体代码导致的错误预测问题。该框架通过在高熵位置输出显式占位符，让用户通过IDE导航直接填充，从而降低编辑成本。论文从理论上将代码补全建模为不确定性下的成本最小化问题，并通过基于真实编辑日志构建的训练数据和强化学习奖励函数实现了该框架。实验表明，APC在1.5B到14B参数模型上能将预期编辑成本降低19%到50%，同时保持标准硬补全的性能。

Details

Motivation: 当前大型语言模型在代码补全中通常采用硬补全范式，即使在上下文不足时也强制生成完全具体的代码，导致大量生成的建议被用户编辑或拒绝，表明模型在特定令牌位置经常做出错误预测。

Result: 在1.5B至14B参数模型上的广泛评估表明，APC将预期编辑成本降低了19%到50%，同时保持了标准硬补全的性能。

Insight: 论文的创新点在于将代码补全形式化为不确定性下的成本最小化问题，并证明了存在一个临界熵阈值，超过该阈值时APC的预期成本严格低于硬补全。通过从真实编辑日志构建训练数据并设计基于成本的奖励函数进行强化学习，实现了端到端学习自适应弃权而不牺牲传统补全质量，为不确定性感知的代码补全提供了理论基础和实用训练框架。

Abstract: While Large Language Models (LLMs) have demonstrated exceptional proficiency in code completion, they typically adhere to a Hard Completion (HC) paradigm, compelling the generation of fully concrete code even amidst insufficient context. Our analysis of 3 million real-world interactions exposes the limitations of this strategy: 61% of the generated suggestions were either edited after acceptance or rejected despite exhibiting over 80% similarity to the user’s subsequent code, suggesting that models frequently make erroneous predictions at specific token positions. Motivated by this observation, we propose Adaptive Placeholder Completion (APC), a collaborative framework that extends HC by strategically outputting explicit placeholders at high-entropy positions, allowing users to fill directly via IDE navigation. Theoretically, we formulate code completion as a cost-minimization problem under uncertainty. Premised on the observation that filling placeholders incurs lower cost than correcting errors, we prove the existence of a critical entropy threshold above which APC achieves strictly lower expected cost than HC. We instantiate this framework by constructing training data from filtered real-world edit logs and design a cost-based reward function for reinforcement learning. Extensive evaluations across 1.5B–14B parameter models demonstrate that APC reduces expected editing costs from 19% to 50% while preserving standard HC performance. Our work provides both a theoretical foundation and a practical training framework for uncertainty-aware code completion, demonstrating that adaptive abstention can be learned end-to-end without sacrificing conventional completion quality.

[18] SURE: Synergistic Uncertainty-aware Reasoning for Multimodal Emotion Recognition in Conversations cs.CLPDF

Yiqiang Cai, Chengyan Wu, Bolei Ma, Bo Chen, Yun Xue

TL;DR: 本文提出了一种名为SURE（协同不确定性感知推理）的框架，用于对话中的多模态情感识别（MERC）。该框架通过不确定性感知的专家混合模块处理模态特定噪声，通过迭代推理模块进行多轮上下文推理，并通过Transformer门模块捕获模态内和模态间交互，以提高鲁棒性和上下文建模能力。

Details

Motivation: 现有方法在MERC任务中通常强调多模态融合，但忽视了噪声特征中的不确定性以及细粒度的推理过程，导致模型鲁棒性和上下文建模能力不足。

Result: 在基准MERC数据集上的实验表明，SURE框架持续优于现有最先进（SOTA）方法，证明了其在鲁棒多模态推理方面的有效性。

Insight: 创新点在于明确地将不确定性建模和迭代推理机制引入MERC任务，通过协同设计处理噪声和上下文，为对话情感识别提供了更稳健的解决方案。

Abstract: Multimodal emotion recognition in conversations (MERC) requires integrating multimodal signals while being robust to noise and modeling contextual reasoning. Existing approaches often emphasize fusion but overlook uncertainty in noisy features and fine-grained reasoning. We propose SURE (Synergistic Uncertainty-aware REasoning) for MERC, a framework that improves robustness and contextual modeling. SURE consists of three components: an Uncertainty-Aware Mixture-of-Experts module to handle modality-specific noise, an Iterative Reasoning module for multi-turn reasoning over context, and a Transformer Gate module to capture intra- and inter-modal interactions. Experiments on benchmark MERC datasets show that SURE consistently outperforms state-of-the-art methods, demonstrating its effectiveness in robust multimodal reasoning. These results highlight the importance of uncertainty modeling and iterative reasoning in advancing emotion recognition in conversational settings.

[19] Is Clinical Text Enough? A Multimodal Study on Mortality Prediction in Heart Failure Patients cs.CLPDF

Oumaima El Khettari, Virgile Barthet, Guillaume Hocquet, Joconde Weller, Emmanuel Morin

TL;DR: 本研究评估了基于Transformer的模型在法国心力衰竭患者队列中进行短期死亡率预测的性能，比较了纯文本、纯结构化数据、多模态以及基于大语言模型的方法。结果表明，在临床文本中融入实体级表示比仅使用CLS嵌入能提升预测效果，而文本与结构化变量的有监督多模态融合取得了最佳整体性能。相比之下，大语言模型在不同模态和解码策略下表现不一致，纯文本提示优于结构化或多模态输入。

Details

Motivation: 解决仅依赖结构化电子健康记录数据时，心力衰竭患者短期死亡率预测准确性不足的挑战，探索多模态方法及大语言模型在临床决策支持中的潜力与局限。

Result: 在法国心力衰竭患者队列的短期死亡率预测任务中，有监督的多模态融合方法取得了最佳性能；实体增强的文本表示优于纯CLS嵌入；大语言模型表现不稳定，纯文本提示效果最好。

Insight: 创新点在于系统比较了不同模态表示与融合策略，并指出实体感知的多模态Transformer是目前最可靠的解决方案，同时揭示了当前LLM提示在临床多模态任务中的局限性，为医疗AI模型设计提供了实证指导。

Abstract: Accurate short-term mortality prediction in heart failure (HF) remains challenging, particularly when relying on structured electronic health record (EHR) data alone. We evaluate transformer-based models on a French HF cohort, comparing text-only, structured-only, multimodal, and LLM-based approaches. Our results show that enriching clinical text with entity-level representations improves prediction over CLS embeddings alone, and that supervised multimodal fusion of text and structured variables achieves the best overall performance. In contrast, large language models perform inconsistently across modalities and decoding strategies, with text-only prompts outperforming structured or multimodal inputs. These findings highlight that entity-aware multimodal transformers offer the most reliable solution for short-term HF outcome prediction, while current LLM prompting remains limited for clinical decision support.

[20] ImplicitBBQ: Benchmarking Implicit Bias in Large Language Models through Characteristic Based Cues cs.CL | cs.AIPDF

Bhaskara Hanuma Vedula, Darshan Anghan, Ishita Goyal, Ponnurangam Kumaraguru, Abhijnan Chakraborty

TL;DR: 该论文提出了ImplicitBBQ基准测试，用于评估大型语言模型在通过文化关联属性等特征性线索间接传达身份信息时的隐性偏见，覆盖年龄、性别、地区、宗教、种姓和社会经济地位等多个维度。研究发现，在模糊语境下，开源模型的隐性偏见水平是显性偏见的六倍以上，且现有安全提示和思维链推理等方法未能有效缩小这一差距。

Details

Motivation: 现有基准测试主要依赖姓名作为代理来检测隐性偏见，但姓名与许多社会人口统计特征关联较弱，且无法扩展到年龄或社会经济地位等维度，因此需要一种更全面的方法来评估LLM在间接身份信息下的隐性偏见。

Result: 在评估的11个模型中，模糊语境下的隐性偏见是显性偏见的六倍以上；安全提示和思维链推理未能显著缩小差距；少量示例提示虽将隐性偏见降低了84%，但种姓偏见仍是其他维度的四倍。

Insight: 创新点在于引入了基于特征性线索（文化关联属性）的基准测试ImplicitBBQ，能够更全面地评估多维度隐性偏见；客观分析表明，当前的对齐和提示策略仅触及偏见评估的表面，未能解决根植于文化的刻板印象关联，这为偏见缓解技术的研究提供了新的评估工具和方向。

Abstract: Large Language Models increasingly suppress biased outputs when demographic identity is stated explicitly, yet may still exhibit implicit biases when identity is conveyed indirectly. Existing benchmarks use name based proxies to detect implicit biases, which carry weak associations with many social demographics and cannot extend to dimensions like age or socioeconomic status. We introduce ImplicitBBQ, a QA benchmark that evaluates implicit bias through characteristic based cues, culturally associated attributes that signal implicitly, across age, gender, region, religion, caste, and socioeconomic status. Evaluating 11 models, we find that implicit bias in ambiguous contexts is over six times higher than explicit bias in open weight models. Safety prompting and chain-of-thought reasoning fail to substantially close this gap; even few-shot prompting, which reduces implicit bias by 84%, leaves caste bias at four times the level of any other dimension. These findings indicate that current alignment and prompting strategies address the surface of bias evaluation while leaving culturally grounded stereotypic associations largely unresolved. We publicly release our code and dataset for model providers and researchers to benchmark potential mitigation techniques.

[21] SAFE: Stepwise Atomic Feedback for Error correction in Multi-hop Reasoning cs.CL | cs.AIPDF

Daeyong Kwon, Soyoung Yoon, Seung-won Hwang

TL;DR: 本文提出SAFE框架，通过基于知识图谱的原子错误分类和验证流程，在多跳推理任务中实现严格可验证的推理轨迹，在训练时识别不可回答样本，在推理时动态检测无依据步骤，显著提升模型性能。

Details

Motivation: 解决多跳问答基准中大型语言模型因虚假正确性而掩盖无依据或错误推理步骤的问题，推动严格可验证的推理。

Result: 在标准基准上识别高达14%的不可回答实例，推理时平均准确率提升8.4个百分点，显著优于基线方法。

Insight: 创新点在于将无依据的思维链替换为严格可验证的实体序列，并构建两阶段验证框架（训练时清理噪声监督、推理时动态反馈），确保推理轨迹的可验证性。

Abstract: Multi-hop QA benchmarks frequently reward Large Language Models (LLMs) for spurious correctness, masking ungrounded or flawed reasoning steps. To shift toward rigorous reasoning, we propose SAFE, a dynamic benchmarking framework that replaces the ungrounded Chain-of-Thought (CoT) with a strictly verifiable sequence of grounded entities. Our framework operates across two phases: (1) train-time verification, where we establish an atomic error taxonomy and a Knowledge Graph (KG)-grounded verification pipeline to eliminate noisy supervision in standard benchmarks, identifying up to 14% of instances as unanswerable, and (2) inference-time verification, where a feedback model trained on this verified dataset dynamically detects ungrounded steps in real-time. Experimental results demonstrate that SAFE not only exposes the critical flaws of existing benchmarks at train-time, but also significantly outperforms standard baselines, achieving an average accuracy gain of 8.4 pp while guaranteeing verifiable trajectories at inference-time.

[22] Why Gaussian Diffusion Models Fail on Discrete Data? cs.CLPDF

Alexander Shabalin, Simon Elistratov, Viacheslav Meshchaninov, Ildus Sadrtdinov, Dmitry Vetrov

TL;DR: 本文探讨了高斯扩散模型在处理离散数据时效果不佳的原因，指出在连续空间中表示离散分布时，DDPM求解器会在临界采样区间内遇到多模态密度问题，导致模型产生分布外输入并降低样本质量。

Details

Motivation: 解决高斯扩散模型在离散数据生成任务中采样质量下降的问题，特别是当离散分布以连续空间中的混合δ分布表示时。

Result: 在文本、编程代码和蛋白质等多个领域的条件与非条件任务中验证，结合自条件技术和在临界区间内切换至q-sampling求解器能提升生成质量。

Insight: 创新点在于识别了离散数据中临界采样区间的多模态密度问题，并提出通过自条件与q-sampling求解器切换来缓解该问题，为离散数据扩散模型提供了理论分析和实用改进。

Abstract: Diffusion models have become a standard approach for generative modeling in continuous domains, yet their application to discrete data remains challenging. We investigate why Gaussian diffusion models with the DDPM solver struggle to sample from discrete distributions that are represented as a mixture of delta-distributions in the continuous space. Using a toy Random Hierarchy Model, we identify a critical sampling interval in which the density of noisified data becomes multimodal. In this regime, DDPM occasionally enters low-density regions between modes producing out-of-distribution inputs for the model and degrading sample quality. We show that existing heuristics, including self-conditioning and a solver we term q-sampling, help alleviate this issue. Furthermore, we demonstrate that combining self-conditioning with switching from DDPM to q-sampling within the critical interval improves generation quality on real data. We validate these findings across conditional and unconditional tasks in multiple domains, including text, programming code, and proteins.

[23] BidirLM: From Text to Omnimodal Bidirectional Encoders by Adapting and Composing Causal LLMs cs.CL | cs.AIPDF

Nicolas Boizard, Théo Deschamps-Berger, Hippolyte Gisserot-Boukhlef, Céline Hudelot, Pierre Colombo

TL;DR: 本文提出BidirLM，一种将因果生成式大语言模型（如Gemma3和Qwen3）转化为双向编码器的通用方法。该方法通过系统消融实验确定了成功适配的关键因素，引入线性权重合并与轻量级多领域数据混合策略来缓解灾难性遗忘，并能通过合并专业因果模型来增强多模态编码能力。最终开源方案生成了五个编码器，在文本、视觉和音频表征基准上超越了现有方法。

Details

Motivation: 现有将因果LLM转化为双向编码器的方法存在局限：缺乏最优训练目标的共识、在大规模适配时遭遇灾难性遗忘，且难以灵活整合庞大的专业生成模型生态系统。

Result: 在文本、视觉和音频表征基准测试中，BidirLM家族编码器性能超越了现有替代方案。

Insight: 创新点在于：1）通过系统消融明确了成功适配因果LLM的关键因素，特别是常被忽略的先验掩码阶段；2）提出无需原始预训练数据的双策略（线性权重合并+轻量级多领域数据混合）来缓解灾难性遗忘；3）通过合并专业因果模型，实现了模态与领域特定能力的无缝迁移，提供了一种通用、可扩展的编码器构建方案。

Abstract: Transforming causal generative language models into bidirectional encoders offers a powerful alternative to BERT-style architectures. However, current approaches remain limited: they lack consensus on optimal training objectives, suffer from catastrophic forgetting at scale, and fail to flexibly integrate the vast ecosystem of specialized generative models. In this work, through systematic ablations on the Gemma3 and Qwen3 families, we identify the key factors driving successful adaptation, highlighting the critical role of an often-omitted prior masking phase. To scale this process without original pre-training data, we introduce a dual strategy combining linear weight merging with a lightweight multi-domain data mixture that mitigates catastrophic forgetting. Finally, we augment our encoders by merging them with specialized causal models, seamlessly transferring modality- and domain-specific capabilities. This open-source recipe, designed for any causal decoder LLM, yields BidirLM, a family of five encoders that outperform alternatives on text, vision, and audio representation benchmarks.

[24] Optimizing RAG Rerankers with LLM Feedback via Reinforcement Learning cs.CL | cs.AI | cs.IRPDF

Yuhang Wu, Xiangqing Shen, Fanfan Wang, Cangqi Zhou, Zhen Wu

TL;DR: 本文提出了一种名为ReRanking Preference Optimization (RRPO)的强化学习框架，用于优化检索增强生成中的重排序模型。该方法将重排序建模为序列决策过程，利用LLM的生成质量反馈直接对齐重排序与下游生成任务，无需昂贵的人工标注。实验表明，RRPO在多个知识密集型基准测试中显著优于现有基线模型，并展现出良好的泛化性和鲁棒性。

Details

Motivation: 当前重排序模型通常基于静态人工标注的相关性标签进行优化，与下游生成过程脱节，导致检索出的主题相关文档往往无法为LLM生成精确答案提供实际效用。

Result: 在知识密集型基准测试上的大量实验表明，RRPO显著优于包括强大的列表式重排序器RankZephyr在内的强基线模型。

Insight: 主要创新点在于提出了一个利用LLM反馈、基于强化学习的重排序优化框架，将重排序与LLM生成质量直接对齐，并引入了参考锚定的确定性基线以确保训练稳定性。该框架能泛化到不同阅读器，与查询扩展模块正交集成，且在噪声监督下仍保持鲁棒性。

Abstract: Rerankers play a pivotal role in refining retrieval results for Retrieval-Augmented Generation. However, current reranking models are typically optimized on static human annotated relevance labels in isolation, decoupled from the downstream generation process. This isolation leads to a fundamental misalignment: documents identified as topically relevant by information retrieval metrics often fail to provide the actual utility required by the LLM for precise answer generation. To bridge this gap, we introduce ReRanking Preference Optimization (RRPO), a reinforcement learning framework that directly aligns reranking with the LLM’s generation quality. By formulating reranking as a sequential decision-making process, RRPO optimizes for context utility using LLM feedback, thereby eliminating the need for expensive human annotations. To ensure training stability, we further introduce a reference-anchored deterministic baseline. Extensive experiments on knowledge-intensive benchmarks demonstrate that RRPO significantly outperforms strong baselines, including the powerful list-wise reranker RankZephyr. Further analysis highlights the versatility of our framework: it generalizes seamlessly to diverse readers (e.g., GPT-4o), integrates orthogonally with query expansion modules like Query2Doc, and remains robust even when trained with noisy supervisors.

[25] Reliable Control-Point Selection for Steering Reasoning in Large Language Models cs.CLPDF

Haomin Zhuang, Hojun Yoo, Xiaonan Luo, Kehan Guo, Xiangliang Zhang

TL;DR: 本文提出了一种基于稳定性过滤和内容子空间投影的方法，用于在大语言模型中可靠地选择控制点以构建有效的引导向量，从而提升数学推理任务的性能。

Details

Motivation: 现有方法通过关键词匹配在思维链轨迹中检测推理行为边界来构建引导向量，但假设每个检测到的边界都编码了真实的行为信号，而研究发现绝大多数边界在重新生成时行为不稳定，导致引导信号被稀释。

Result: 在MATH-500基准测试上，该方法达到了0.784的准确率，比最强基线提升了5.0个百分点；该方法提取的引导向量无需重新提取即可在同一架构系列的不同模型（如Nemotron-Research-Reasoning-1.5B和DeepScaleR-1.5B-Preview）上实现性能提升（分别+5.0和+6.0）。

Insight: 核心创新在于将内在推理行为形式化为具有上下文依赖触发概率的随机事件，并提出了稳定性过滤机制来筛选行为一致再现的边界，结合内容子空间投影去除问题特定噪声，从而构建出更鲁棒、可迁移的引导向量。

Abstract: Steering vectors offer a training-free mechanism for controlling reasoning behaviors in large language models, but constructing effective vectors requires identifying genuine behavioral signals in the model’s hidden states. For behaviors that can be toggled via prompts, this is straightforward. However, many reasoning behaviors – such as self-reflection – emerge spontaneously and resist prompt-level control. Current methods detect these behaviors through keyword matching in chain-of-thought traces, implicitly assuming that every detected boundary encodes a genuine behavioral signal. We show that this assumption is overwhelmingly wrong: across 541 keyword-detected boundaries, 93.3% are behaviorally unstable, failing to reproduce the detected behavior under re-generation from the same prefix. We develop a probabilistic model that formalizes intrinsic reasoning behaviors as stochastic events with context-dependent trigger probabilities, and show that unstable boundaries dilute the steering signal. Guided by this analysis, we propose stability filtering, which retains only boundaries where the model consistently reproduces the target behavior. Combined with a content-subspace projection that removes residual question-specific noise, our method achieves 0.784 accuracy on MATH-500 (+5.0 over the strongest baseline). The resulting steering vectors transfer across models in the same architecture family without re-extraction, improving Nemotron-Research-Reasoning-1.5B (+5.0) and DeepScaleR-1.5B-Preview (+6.0). Code is available at https://github.com/zhmzm/stability-steering.

[26] Brief Is Better: Non-Monotonic Chain-of-Thought Budget Effects in Function-Calling Language Agents cs.CLPDF

Xuan Qi

TL;DR: 本文系统研究了思维链（CoT）推理长度对函数调用语言智能体性能的影响，发现了一个显著的非单调模式：在Qwen2.5-1.5B-Instruct模型上，简短推理（32个token）能大幅提升任务准确率45%，而冗长推理（256个token）反而会损害性能，甚至低于无CoT的基线。基于此，作者提出了结构化简短推理方法FR-CoT，以模板化方式强制模型在推理开始时承诺一个有效函数名，从而在保持性能的同时消除函数幻觉。

Details

Motivation: 解决在结构化工具使用场景中，思维链推理长度与智能体准确率之间关系不明确的问题，旨在确定语言智能体在执行动作前应进行多少思考，以优化其性能。

Result: 在Berkeley Function Calling Leaderboard v3 Multiple基准的200个任务上，使用Qwen2.5-1.5B-Instruct模型进行实验。简短推理（32 token）将准确率从44.0%提升至64.0%，而冗长推理（256 token）则降至25.0%，显著低于无CoT基线。提出的FR-CoT方法在准确率上与自由形式简短CoT统计相当，同时将函数幻觉率降至0.0%。

Insight: 创新点在于揭示了CoT预算在函数调用场景中的非单调效应，并发现简短推理主要起到函数路由作用；提出的FR-CoT通过结构化模板强制早期函数承诺，提供了无需预算调优的结构化可靠性保证，这是一种可借鉴的、将自由推理引导至结构化输出的方法。

Abstract: How much should a language agent think before taking action? Chain-of-thought (CoT) reasoning is widely assumed to improve agent performance, but the relationship between reasoning length and accuracy in structured tool-use settings remains poorly understood. We present a systematic study of CoT budget effects on function-calling agents, sweeping six token budgets (0–512) across 200 tasks from the Berkeley Function Calling Leaderboard v3 Multiple benchmark. Our central finding is a striking non-monotonic pattern on Qwen2.5-1.5B-Instruct: brief reasoning (32 tokens) dramatically improves accuracy by 45% relative over direct answers, from 44.0% to 64.0%, while extended reasoning (256 tokens) degrades performance well below the no-CoT baseline, to 25.0% (McNemar p < 0.001). A three-way error decomposition reveals the mechanism. At d = 0, 30.5% of tasks fail because the model selects the wrong function from the candidate set; brief CoT reduces this to 1.5%, effectively acting as a function-routing step, while long CoT reverses the gain, yielding 28.0% wrong selections and 18.0% hallucinated functions at d = 256. Oracle analysis shows that 88.6% of solvable tasks require at most 32 reasoning tokens, with an average of 27.6 tokens, and a finer-grained sweep indicates that the true optimum lies at 8–16 tokens. Motivated by this routing effect, we propose Function-Routing CoT (FR-CoT), a structured brief-CoT method that templates the reasoning phase as “Function: [name] / Key args: […],” forcing commitment to a valid function name at the start of reasoning. FR-CoT achieves accuracy statistically equivalent to free-form d = 32 CoT while reducing function hallucination to 0.0%, providing a structural reliability guarantee without budget tuning.

[27] Do Lexical and Contextual Coreference Resolution Systems Degrade Differently under Mention Noise? An Empirical Study on Scientific Software Mentions cs.CL | cs.LGPDF

Atilla Kaan Alkan, Felix Grezes, Jennifer Lynn Bartlett, Anna Kelbert, Kelly Lockhart

TL;DR: 本文介绍了参与SOMD 2026跨文档软件提及共指消解共享任务的研究，系统在所有三个子任务中均排名第二。研究比较了两种无需微调的方法：基于词法字符串相似度的模糊匹配（FM）和结合提及级与文档级嵌入的上下文感知表示（CAR）。两者在所有子任务中均取得了有竞争力的性能（CoNLL F1为0.94-0.96），CAR在官方测试集上始终比FM高出1分。通过受控噪声注入研究发现了两者互补的失效模式：在边界噪声增加时，CAR性能下降更小；在提及替换时，FM性能下降更平缓。推理时间分析表明FM随语料库规模呈超线性扩展，而CAR近似线性扩展。

Details

Motivation: 解决跨文档软件提及共指消解任务，并探究在存在上游提及检测器噪声的情况下，词法方法和上下文方法的不同退化模式，以指导系统选择。

Result: 在SOMD 2026共享任务的所有三个子任务中均排名第二。两种方法CoNLL F1分数在0.94-0.96之间，CAR在官方测试集上始终比FM高出1分。在噪声注入研究中，CAR在边界噪声下F1仅下降0.07，而FM下降0.20；在提及替换下，FM下降0.52，CAR下降0.63。推理扩展性上，CAR近似线性扩展，FM超线性扩展。

Insight: 创新点在于对两种无需微调方法（词法FM与上下文CAR）在软件提及共指任务上的系统性比较，特别是通过受控噪声注入实验揭示了它们互补的失效模式。客观分析认为，其核心洞察是系统选择应同时考虑上游提及检测器的噪声特征和目标语料库的规模，为实际应用提供了重要指导。

Abstract: We present our participation in the SOMD 2026 shared task on cross-document software mention coreference resolution, where our systems ranked second across all three subtasks. We compare two fine-tuning-free approaches: Fuzzy Matching (FM), a lexical string-similarity method, and Context Aware Representations (CAR), which combines mention-level and document-level embeddings. Both achieve competitive performance across all subtasks (CoNLL F1 of 0.94-0.96), with CAR consistently outperforming FM by 1 point on the official test set, consistent with the high surface regularity of software names, which reduces the need for complex semantic reasoning. A controlled noise-injection study reveals complementary failure modes: as boundary noise increases, CAR loses only 0.07 F1 points from clean to fully corrupted input, compared to 0.20 for FM, whereas under mention substitution, FM degrades more gracefully (0.52 vs. 0.63). Our inference-time analysis shows that FM scales superlinearly with corpus size, whereas CAR scales approximately linearly, making CAR the more efficient choice at large scale. These findings suggest that system selection should be informed by both the noise profile of the upstream mention detector and the scale of the target corpus. We release our code to support future work on this underexplored task.

[28] Adam’s Law: Textual Frequency Law on Large Language Models cs.CLPDF

Hongyuan Adam Lu, Z. L., Victor Wei, Zefan Zhang, Zhao Hong

TL;DR: 本文提出了一种基于文本频率的LLM优化框架，包含文本频率定律（TFL）、文本频率蒸馏（TFD）和课程文本频率训练（CTFT）三个单元，旨在通过提升输入文本的频率来改善LLM在推理、翻译等任务上的性能。

Details

Motivation: 文本频率与人类认知相关，但其与大型语言模型（LLMs）的关系尚未被充分研究，本文旨在探索文本数据频率对LLM性能的影响。

Result: 在自建的文本频率配对数据集（TFPD）上，针对数学推理、机器翻译、常识推理和智能体工具调用等任务进行实验，结果表明所提框架有效。

Insight: 创新点在于首次系统性地将文本频率引入LLM优化，提出了利用在线资源估计频率、通过改写提升输入频率、以及按频率递增顺序进行课程微调的方法，为数据选择和模型训练提供了新视角。

Abstract: While textual frequency has been validated as relevant to human cognition in reading speed, its relatedness to Large Language Models (LLMs) is seldom studied. We propose a novel research direction in terms of textual data frequency, which is an understudied topic, to the best of our knowledge. Our framework is composed of three units. First, this paper proposes Textual Frequency Law (TFL), which indicates that frequent textual data should be preferred for LLMs for both prompting and fine-tuning. Since many LLMs are closed-source in their training data, we propose using online resources to estimate the sentence-level frequency. We then utilize an input paraphraser to paraphrase the input into a more frequent textual expression. Next, we propose Textual Frequency Distillation (TFD) by querying LLMs to conduct story completion by further extending the sentences in the datasets, and the resulting corpora are used to adjust the initial estimation. Finally, we propose Curriculum Textual Frequency Training (CTFT) that fine-tunes LLMs in an increasing order of sentence-level frequency. Experiments are conducted on our curated dataset Textual Frequency Paired Dataset (TFPD) on math reasoning, machine translation, commonsense reasoning and agentic tool calling. Results show the effectiveness of our framework.

[29] CV-18 NER: Augmented Common Voice for Named Entity Recognition from Arabic Speech cs.CLPDF

Youssef Saidi, Haroun Elleuch, Fethi Bougares

TL;DR: 本文介绍了CV-18 NER，这是首个公开可用的阿拉伯语语音命名实体识别（NER）数据集，基于阿拉伯语Common Voice 18语料库并采用Wojood细粒度模式进行人工NER标注。论文比较了级联系统（ASR + 文本NER）和基于Whisper与AraBEST-RQ的端到端（E2E）模型，发现E2E模型在测试集上显著优于最佳级联配置，分别达到37.0% CoER（AraBEST-RQ 300M）和38.0% CVER（Whisper-medium）。分析表明，阿拉伯语特定的自监督预训练在ASR上表现良好，而多语言弱监督在联合语音到实体学习上更有效，且大模型在低资源环境下可能更难适应。

Details

Motivation: 解决阿拉伯语语音端到端NER研究不足的问题，原因是阿拉伯语形态复杂、缺少短元音以及标注资源有限。

Result: 在CV-18 NER测试集上，端到端模型（AraBEST-RQ 300M和Whisper-medium）分别达到37.0% CoER和38.0% CVER，显著优于最佳级联系统，为阿拉伯语语音NER设立了首个公开基准。

Insight: 创新点包括创建首个公开阿拉伯语语音NER数据集CV-18 NER，并验证了端到端模型在阿拉伯语NER上的优越性；客观分析表明，针对特定语言的自监督预训练与多语言弱监督的有效结合是低资源语音NER的关键策略，同时模型规模需与数据资源匹配。

Abstract: End-to-end speech Named Entity Recognition (NER) aims to directly extract entities from speech. Prior work has shown that end-to-end (E2E) approaches can outperform cascaded pipelines for English, French, and Chinese, but Arabic remains under-explored due to its morphological complexity, the absence of short vowels, and limited annotated resources. We introduce CV-18 NER, the first publicly available dataset for NER from Arabic speech, created by augmenting the Arabic Common Voice 18 corpus with manual NER annotations following the fine-grained Wojood schema (21 entity types). We benchmark both pipeline systems (ASR + text NER) and E2E models based on Whisper and AraBEST-RQ. E2E systems substantially outperform the best pipeline configuration on the test set, reaching 37.0% CoER (AraBEST-RQ 300M) and 38.0% CVER (Whisper-medium). Further analysis shows that Arabic-specific self-supervised pretraining yields strong ASR performance, while multilingual weak supervision transfers more effectively to joint speech-to-entity learning, and that larger models may be harder to adapt in this low-resource setting. Our dataset and models are publicly released, providing the first open benchmark for end-to-end named entity recognition from Arabic speech https://huggingface.co/datasets/Elyadata/CV18-NER.

[30] Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation cs.CL | cs.AI | cs.LGPDF

Daiwei Chen, Zhoutong Fu, Chengming Jiang, Haichao Zhang, Ran Zhou

TL;DR: 本文针对语言模型在扩展新词汇时标准均值初始化方法导致新词嵌入坍缩的问题，提出了基于语言监督的接地初始化方法GTI，通过在预训练嵌入空间中为新词分配有语义区分度的初始位置，提升了生成式推荐任务中新词的学习效果。

Details

Motivation: 标准实践将新词嵌入初始化为现有词嵌入的均值，导致新词在初始阶段就坍缩到退化的子空间，丢失了词间区分度，而后续微调难以完全恢复，这成为语言模型扩展新词汇的关键瓶颈。

Result: 在多个生成式推荐基准测试（包括工业规模和公共数据集）上，GTI方法在大多数评估设置中均优于均值初始化和现有的辅助任务适应方法。

Insight: 创新点在于提出了‘接地词初始化假说’，并设计了轻量级的GTI阶段，利用配对的语言监督将新词映射到预训练嵌入空间中具有语义区分度的位置，从而在微调前就建立了更丰富的词间结构，缓解了初始化瓶颈问题。

Abstract: Language models (LMs) are increasingly extended with new learnable vocabulary tokens for domain-specific tasks, such as Semantic-ID tokens in generative recommendation. The standard practice initializes these new tokens as the mean of existing vocabulary embeddings, then relies on supervised fine-tuning to learn their representations. We present a systematic analysis of this strategy: through spectral and geometric diagnostics, we show that mean initialization collapses all new tokens into a degenerate subspace, erasing inter-token distinctions that subsequent fine-tuning struggles to fully recover. These findings suggest that \emph{token initialization} is a key bottleneck when extending LMs with new vocabularies. Motivated by this diagnosis, we propose the \emph{Grounded Token Initialization Hypothesis}: linguistically grounding novel tokens in the pretrained embedding space before fine-tuning better enables the model to leverage its general-purpose knowledge for novel-token domains. We operationalize this hypothesis as GTI (Grounded Token Initialization), a lightweight grounding stage that, prior to fine-tuning, maps new tokens to distinct, semantically meaningful locations in the pretrained embedding space using only paired linguistic supervision. Despite its simplicity, GTI outperforms both mean initialization and existing auxiliary-task adaptation methods in the majority of evaluation settings across multiple generative recommendation benchmarks, including industry-scale and public datasets. Further analyses show that grounded embeddings produce richer inter-token structure that persists through fine-tuning, corroborating the hypothesis that initialization quality is a key bottleneck in vocabulary extension.

cs.CV [Back]

[31] DOne: Decoupling Structure and Rendering for High-Fidelity Design-to-Code Generation cs.CV | cs.SEPDF

Xinhao Huang, Jinke Yu, Wenhao Xu, Zeyi Wen, Ying Zhou

TL;DR: DOne是一个端到端的设计到代码生成框架，通过解耦结构理解和元素渲染来解决现有视觉语言模型在生成代码时难以兼顾高层结构层次和细粒度视觉细节的问题。它包含布局分割模块、混合元素检索器和模式引导生成范式，并在新提出的高复杂度基准HiFi2Code上取得了优于现有方法的效果。

Details

Motivation: 现有视觉语言模型在设计到代码生成中存在“整体瓶颈”，无法同时处理高层结构层次和细粒度视觉细节，导致布局扭曲或使用通用占位符，需要一种能解耦这两方面任务的方法。

Result: 在提出的高布局复杂度基准HiFi2Code上，DOne在高层视觉相似性（如GPT Score提升超过10%）和细粒度元素对齐方面均优于现有方法；人工评估证实其能带来3倍的生产力提升和更高的视觉保真度。

Insight: 创新点在于将结构理解和元素渲染解耦，具体通过学习的布局分割模块（避免启发式裁剪限制）、专门处理UI组件极端宽高比和密度的混合元素检索器，以及连接布局与代码的模式引导生成范式来实现；客观来看，这种解耦策略和针对UI特点的专门化组件设计是有效提升生成代码保真度的关键。

Abstract: While Vision Language Models (VLMs) have shown promise in Design-to-Code generation, they suffer from a “holistic bottleneck-failing to reconcile high-level structural hierarchy with fine-grained visual details, often resulting in layout distortions or generic placeholders. To bridge this gap, we propose DOne, an end-to-end framework that decouples structure understanding from element rendering. DOne introduces (1) a learned layout segmentation module to decompose complex designs, avoiding the limitations of heuristic cropping; (2) a specialized hybrid element retriever to handle the extreme aspect ratios and densities of UI components; and (3) a schema-guided generation paradigm that bridges layout and code. To rigorously assess performance, we introduce HiFi2Code, a benchmark featuring significantly higher layout complexity than existing datasets. Extensive evaluations on the HiFi2Code demonstrate that DOne outperforms exiting methods in both high-level visual similarity (e.g., over 10% in GPT Score) and fine-grained element alignment. Human evaluations confirm a 3 times productivity gain with higher visual fidelity.

[32] Look Twice: Training-Free Evidence Highlighting in Multimodal Large Language Models cs.CV | cs.AI | cs.CLPDF

Marco Morini, Sara Sarto, Marcella Cornia, Lorenzo Baraldi

TL;DR: 本文提出了一种名为Look Twice（LoT）的无训练推理框架，旨在提升预训练多模态大语言模型（MLLMs）在回答知识密集型视觉问题时对多模态证据的利用能力。该框架通过分析模型注意力模式来估计与查询相关的视觉区域和检索到的文本元素，并使用轻量级提示标记突出这些证据，引导模型在生成答案时重新关注它们。

Details

Motivation: 现有的多模态大语言模型在回答需要结合视觉理解和外部知识的查询时，难以有效识别和整合最相关的视觉与文本证据，尤其是在面对噪声或部分相关的检索文本以及需要定位细粒度视觉信息的情况下。

Result: 在多个基于知识的视觉问答（VQA）基准测试中，该方法相比零样本MLLMs取得了持续的性能提升。在纯视觉中心和面向幻觉的基准测试上的进一步评估表明，仅视觉证据高亮也能在没有文本上下文的情况下改善模型性能，且无需额外训练或架构修改。

Insight: 创新点在于提出了一种完全无需训练、仅通过推理时分析注意力模式并利用提示标记进行证据高亮的方法，来引导预训练模型更有效地利用多模态证据。这为改进现有MLLMs的推理能力提供了一种轻量级、可推广的途径。

Abstract: Answering questions about images often requires combining visual understanding with external knowledge. Multimodal Large Language Models (MLLMs) provide a natural framework for this setting, but they often struggle to identify the most relevant visual and textual evidence when answering knowledge-intensive queries. In such scenarios, models must integrate visual cues with retrieved textual evidence that is often noisy or only partially relevant, while also localizing fine-grained visual information in the image. In this work, we introduce Look Twice (LoT), a training-free inference-time framework that improves how pretrained MLLMs utilize multimodal evidence. Specifically, we exploit the model attention patterns to estimate which visual regions and retrieved textual elements are relevant to a query, and then generate the answer conditioned on this highlighted evidence. The selected cues are highlighted through lightweight prompt-level markers that encourage the model to re-attend to the relevant evidence during generation. Experiments across multiple knowledge-based VQA benchmarks show consistent improvements over zero-shot MLLMs. Additional evaluations on vision-centric and hallucination-oriented benchmarks further demonstrate that visual evidence highlighting alone improves model performance in settings without textual context, all without additional training or architectural modifications. Source code will be publicly released.

[33] Sparse Spectral LoRA: Routed Experts for Medical VLMs cs.CVPDF

Omid Nejati Manzari, Hojat Asgariandehkordi, Taha Koleilat, Yiming Xiao, Hassan Rivaz

TL;DR: 本文提出MedQwen，一种参数高效的医学视觉语言模型，通过结合谱路由的专家混合（MoE）和理论驱动的缩放规则，解决了医学影像中异质监督导致的跨数据集干扰、数据敏感性以及持续学习中的灾难性遗忘问题。

Details

Motivation: 解决通用视觉语言模型在医学影像领域因异质监督导致的鲁棒性不足，以及在临床工作流中持续学习时出现的灾难性遗忘问题。

Result: 在涵盖视觉问答、报告生成、放射学分类和幻觉缓解的23个医学数据集上，MedQwen在零样本分类任务上以339倍更少的可训练参数接近全微调性能，并将持续学习中的遗忘率降低至约5%，而基线模型性能下降超过20-50%。

Insight: 创新点包括：从预训练权重的非重叠奇异值分解（SVD）片段初始化专家，引入残差补偿和缩放方案以实现稳定的专家专业化和分布偏移下的一致路由；通过理论驱动的缩放规则将低秩更新与全秩全微调的MoE对齐，无需改变基础架构。

Abstract: Large vision-language models (VLMs) excel on general benchmarks but often lack robustness in medical imaging, where heterogeneous supervision induces cross-dataset interference and sensitivity to data regime (i.e., how the supervisory signals are mixed). In realistic clinical workflows, data and tasks arrive sequentially, so naive continual training further leads to catastrophic forgetting. To address these challenges, we propose MedQwen, a parameter-efficient medical VLM that couples a spectrally routed Mixture-of-Experts (MoE) with a theoretically grounded scaling rule that aligns low-rank updates with a full-rank, fully fine-tuned MoE, without changing the base architecture. Concretely, we initialize each expert from non-overlapping singular value decomposition (SVD) segments of the pretrained weight and introduce a residual compensation and scaling scheme to enable stable expert specialization and consistent routing under distribution shift. Across 23 medical datasets covering visual question answering, report generation, radiology classification, and hallucination mitigation, MedQwen achieves strong, reliable performance: it approaches full fine-tuning on zero-shot classification with 339$\times$ fewer trainable parameters, and reduces sequential forgetting to $\sim$5% where strong baselines degrade by $>$20-50%.

[34] ViTs for Action Classification in Videos: An Approach to Risky Tackle Detection in American Football Practice Videos cs.CVPDF

Syed Ahsan Masud Zaidi, William Hsu, Scott Dietrich

TL;DR: 本文提出了一种基于视觉Transformer（ViT）的方法，用于从美式足球训练视频中检测高风险擒抱动作，并扩展了相关数据集。通过引入包含733个标注视频片段的新数据集，并采用针对类别不平衡的训练策略，模型在交叉验证中实现了0.67的风险召回率和0.59的风险F1分数，相比先前基线在更大数据集上提升了超过8个百分点。

Details

Motivation: 解决接触性运动中危险动作的早期识别问题，以提升运动员安全，特别是针对美式足球训练中高风险擒抱的自动化检测。

Result: 在扩展数据集（733个标注片段）上，基于视觉Transformer的模型通过交叉验证获得风险召回率0.67和风险F1分数0.59，相比先前较小数据集上的基线（风险召回率0.58，风险F1分数0.56）有显著提升，表明该方法能可靠检测罕见但安全关键的动作模式。

Insight: 创新点包括构建大规模美式足球擒抱检测数据集，以及将视觉Transformer与类别不平衡处理结合用于视频动作分类；客观来看，该方法展示了ViT在细粒度、安全关键视频分析任务中的潜力，并为教练中心的损伤预防工具提供了实用路径。

Abstract: Early identification of hazardous actions in contact sports enables timely intervention and improves player safety. We present a method for detecting risky tackles in American football practice videos and introduce a substantially expanded dataset for this task. Our work contains 733 single-athlete-dummy tackle clips, each temporally localized around first point contact and labeled with a strike zone component of the standardized Assessment for Tackling Technique (SATT-3), extending prior work that reported 178 annotated videos. Using a Vision transformer-based model with imbalance-aware training, we obtain risky recall of 0.67 and Risky F1 of 0.59 under crossvalidation. Relative to the previous baseline in a smaller subset (risky recall of 0.58; Risky F1 0.56 ), our approach improves risky recall by more than 8% points on a much larger dataset. These results indicate that the vision transformer-based video analysis, coupled with careful handling of class imbalance, can reliably detect rare but safety-critical tackling patterns, offering a practical pathway toward coach-centered injury prevention tools.

[35] Regularizing Attention Scores with Bootstrapping cs.CV | cs.AI | cs.LG | stat.ME | stat.MLPDF

Neo Christopher Chung, Maxim Laletin

TL;DR: 本文提出了一种基于自助法（bootstrapping）的注意力正则化方法，用于量化视觉变换器（ViT）中注意力分数的不确定性，并通过统计显著性检验去除噪声引起的虚假注意力，从而提升注意力图的稀疏性和可解释性。

Details

Motivation: 视觉变换器依赖注意力机制加权输入特征，其注意力分数常被用作决策过程的解释，但由于几乎总是非零，导致注意力图噪声大、扩散性强，限制了可解释性。因此，需要量化注意力分数的不确定性并获取正则化的注意力分数。

Result: 在自然图像和医学图像上，所提出的注意力正则化方法能有效去除噪声引起的虚假注意力，显著提高注意力图的收缩性和稀疏性。定量评估在模拟和真实数据集上进行，验证了方法的有效性。

Insight: 创新点在于将注意力分数置于统计框架中，利用自助法通过重采样输入特征生成注意力分数的基线分布，进而估计其显著性和后验概率，为使用注意力分数作为ViT解释提供了一种实用的正则化工具。

Abstract: Vision transformers (ViT) rely on attention mechanism to weigh input features, and therefore attention scores have naturally been considered as explanations for its decision-making process. However, attention scores are almost always non-zero, resulting in noisy and diffused attention maps and limiting interpretability. Can we quantify uncertainty measures of attention scores and obtain regularized attention scores? To this end, we consider attention scores of ViT in a statistical framework where independent noise would lead to insignificant yet non-zero scores. Leveraging statistical learning techniques, we introduce the bootstrapping for attention scores which generates a baseline distribution of attention scores by resampling input features. Such a bootstrap distribution is then used to estimate significances and posterior probabilities of attention scores. In natural and medical images, the proposed \emph{Attention Regularization} approach demonstrates a straightforward removal of spurious attention arising from noise, drastically improving shrinkage and sparsity. Quantitative evaluations are conducted using both simulation and real-world datasets. Our study highlights bootstrapping as a practical regularization tool when using attention scores as explanations for ViT. Code available: https://github.com/ncchung/AttentionRegularization

[36] IGLOSS: Image Generation for Lidar Open-vocabulary Semantic Segmentation cs.CVPDF

Nermin Samet, Gilles Puy, Renaud Marlet

TL;DR: 本文提出了一种名为IGLOSS的新方法，用于3D汽车激光雷达数据的零样本开放词汇语义分割。该方法通过从文本生成图像来创建原型图像，以规避基于视觉语言模型（如CLIP）的方法固有的图像-文本模态差距。然后，利用从2D视觉基础模型蒸馏出的3D网络，通过将3D点特征与这些原型图像的2D特征进行匹配来标注点云。

Details

Motivation: 解决基于视觉语言模型（VLMs）的3D开放词汇语义分割方法中存在的图像-文本模态差距问题，以实现更准确的零样本分割。

Result: 该方法在nuScenes和SemanticKITTI数据集上的开放词汇语义分割任务中达到了最先进的水平。

Insight: 创新点在于利用文本到图像的生成技术来创建原型图像，从而弥合模态差距，并通过特征匹配实现3D点云的标注。这为3D场景理解提供了一种不依赖大规模3D标注数据的新思路。

Abstract: This paper presents a new method for the zero-shot open-vocabulary semantic segmentation (OVSS) of 3D automotive lidar data. To circumvent the recognized image-text modality gap that is intrinsic to approaches based on Vision Language Models (VLMs) such as CLIP, our method relies instead on image generation from text, to create prototype images. Given a 3D network distilled from a 2D Vision Foundation Model (VFM), we then label a point cloud by matching 3D point features with 2D image features of these prototypes. Our method is state-of-the-art for OVSS on nuScenes and SemanticKITTI. Code, pre-trained models, and generated images are available at https://github.com/valeoai/IGLOSS.

[37] AffordTissue: Dense Affordance Prediction for Tool-Action Specific Tissue Interaction cs.CV | cs.AI | cs.RO | eess.IVPDF

Aiza Maksutova, Lalithkumar Seenivasan, Hao Ding, Jiru Xu, Chenhao Yu

TL;DR: 本文提出了AffordTissue，一个用于预测胆囊切除术中工具-动作特定组织可供性区域的多模态框架。该框架通过结合时序视觉编码器、语言条件输入和DiT风格解码器，生成密集的热力图，以明确指示手术器械与组织交互的安全区域。

Details

Motivation: 现有手术动作自动化方法在临床部署中存在挑战，主要是无法明确预测器械在组织表面的具体交互位置，且缺乏强制执行工具-动作特定安全交互区域的显式条件输入。

Result: 在构建的首个组织可供性基准测试（包含103个胆囊切除术中的15,638个视频片段）上，AffordTissue相比视觉语言模型基线（如Molmo-VLM）有显著提升（平均对称表面距离ASSD为20.6像素 vs. 60.2像素），表明其任务特定架构在密集手术可供性预测上优于大规模基础模型。

Insight: 创新点在于首次提出了一个针对手术场景的密集工具-动作特定组织可供性预测任务和基准，并通过多模态（视觉时序+语言条件）架构实现了精确的空间推理，为手术自动化提供了明确的策略指导和安全性保障（如预测安全区域外的早期安全停止）。

Abstract: Surgical action automation has progressed rapidly toward achieving surgeon-like dexterous control, driven primarily by advances in learning from demonstration and vision-language-action models. While these have demonstrated success in table-top experiments, translating them to clinical deployment remains challenging: current methods offer limited predictability on where instruments will interact on tissue surfaces and lack explicit conditioning inputs to enforce tool-action-specific safe interaction regions. Addressing this gap, we introduce AffordTissue, a multimodal framework for predicting tool-action specific tissue affordance regions as dense heatmaps during cholecystectomy. Our approach combines a temporal vision encoder capturing tool motion and tissue dynamics across multiple viewpoints, language conditioning enabling generalization across diverse instrument-action pairs, and a DiT-style decoder for dense affordance prediction. We establish the first tissue affordance benchmark by curating and annotating 15,638 video clips across 103 cholecystectomy procedures, covering six unique tool-action pairs involving four instruments (hook, grasper, scissors, clipper) and their associated tasks: dissection, grasping, clipping, and cutting. Experiments demonstrate substantial improvement over vision-language model baselines (20.6 px ASSD vs. 60.2 px for Molmo-VLM), showing that our task-specific architecture outperforms large-scale foundation models for dense surgical affordance prediction. By predicting tool-action specific tissue affordance regions, AffordTissue provides explicit spatial reasoning for safe surgical automation, potentially unlocking explicit policy guidance toward appropriate tissue regions and early safe stop when instruments deviate outside predicted safe zones.

Syed Ahsan Masud Zaidi, Lior Shamir, William Hsu, Scott Dietrich, Talha Zaidi

TL;DR: 本文提出了GRAZE，一种无需训练、零样本的事件定位方法，用于在美式足球训练视频中定位‘首次接触点’（FPOC），即球员首次接触擒抱假人的帧。该方法通过Grounding DINO发现候选交互，利用运动感知时序推理进行精炼，并使用SAM2进行像素级接触验证，在复杂场景下实现了高精度的时空定位。

Details

Motivation: 解决在无约束、存在相机运动、杂乱背景、多个相似运动员以及快速姿态变化的真实美式足球训练视频中，无需任务特定训练数据，可靠地定位球员与擒抱假人首次接触的时空位置（FPOC）的问题，以支持生物力学分析。

Result: 在738个擒抱训练视频上，GRAZE在97.4%的视频中产生有效输出，并在77.5%的视频中将FPOC定位在±10帧内，在82.7%的视频中定位在±20帧内，证明了其在真实场景中无需特定训练即可实现帧级准确接触起始定位的可行性。

Insight: 创新点在于将候选交互发现与接触确认解耦，结合了开放词汇检测（Grounding DINO）、运动感知时序推理和像素级分割验证（SAM2），构建了一个无需训练、对杂乱场景和冲击附近不稳定检测具有鲁棒性的零样本事件定位流程。

Abstract: American football practice generates video at scale, yet the interaction of interest occupies only a brief window of each long, untrimmed clip. Reliable biomechanical analysis, therefore, depends on spatiotemporal localization that identifies both the interacting entities and the onset of contact. We study First Point of Contact (FPOC), defined as the first frame in which a player physically touches a tackle dummy, in unconstrained practice footage with camera motion, clutter, multiple similarly equipped athletes, and rapid pose changes around impact. We present GRAZE, a training-free pipeline for FPOC localization that requires no labeled tackle-contact examples. GRAZE uses Grounding DINO to discover candidate player-dummy interactions, refines them with motion-aware temporal reasoning, and uses SAM2 as an explicit pixel-level verifier of contact rather than relying on detection confidence alone. This separation between candidate discovery and contact confirmation makes the approach robust to cluttered scenes and unstable grounding near impact. On 738 tackle-practice videos, GRAZE produces valid outputs for 97.4% of clips and localizes FPOC within $\pm$ 10 frames on 77.5% of all clips and within $\pm$ 20 frames on 82.7% of all clips. These results show that frame-accurate contact onset localization in real-world practice footage is feasible without task-specific training.

[39] LESV: Language Embedded Sparse Voxel Fusion for Open-Vocabulary 3D Scene Understanding cs.CVPDF

Fusang Wang, Nathan Piasco, Moussab Bennehar, Luis Roldão, Dzmitry Tsishkou

TL;DR: 本文提出了一种名为LESV的新框架，用于解决开放词汇3D场景理解中的空间和语义模糊性问题。该方法利用稀疏体素光栅化作为结构化几何表示，结合单目深度和法线先验进行正则化，实现了确定性的、置信度感知的特征注册，并利用基础模型AM-RADIO的密集对齐特性来避免分层训练的计算开销。

Details

Motivation: 现有基于3D高斯泼溅的方法存在两个关键限制：由非结构化、重叠高斯分布引起的空间模糊性导致需要概率特征注册，以及通过对象级掩码池化特征导致的多级语义模糊性，从而稀释了细粒度细节。

Result: 该方法在开放词汇3D对象检索和点云理解基准测试中达到了最先进的性能，特别是在细粒度查询任务上表现出色，而传统的注册方法通常在此类任务上失败。

Insight: 创新点在于采用结构化、不相交的稀疏体素光栅化几何表示来替代非结构化的3D高斯泼溅，结合几何先验正则化实现确定性特征注册，并利用基础模型的密集对齐特性高效解决多级语义模糊，避免了分层训练的计算成本。

Abstract: Recent advancements in open-vocabulary 3D scene understanding heavily rely on 3D Gaussian Splatting (3DGS) to register vision-language features into 3D space. However, we identify two critical limitations in these approaches: the spatial ambiguity arising from unstructured, overlapping Gaussians which necessitates probabilistic feature registration, and the multi-level semantic ambiguity caused by pooling features over object-level masks, which dilutes fine-grained details. To address these challenges, we present a novel framework that leverages Sparse Voxel Rasterization (SVRaster) as a structured, disjoint geometry representation. By regularizing SVRaster with monocular depth and normal priors, we establish a stable geometric foundation. This enables a deterministic, confidence-aware feature registration process and suppresses the semantic bleeding artifact common in 3DGS. Furthermore, we resolve multi-level ambiguity by exploiting the emerging dense alignment properties of foundation model AM-RADIO, avoiding the computational overhead of hierarchical training methods. Our approach achieves state-of-the-art performance on Open Vocabulary 3D Object Retrieval and Point Cloud Understanding benchmarks, particularly excelling on fine-grained queries where registration methods typically fail.

[40] EgoFlow: Gradient-Guided Flow Matching for Egocentric 6DoF Object Motion Generation cs.CVPDF

Abhishek Saroha, Huajian Zeng, Xingxing Zuo, Daniel Cremers, Xi Wang

TL;DR: 本文提出了EgoFlow，一种基于梯度引导流匹配的框架，用于从第一人称视角视频中生成物理一致的6自由度物体运动轨迹。该方法结合了混合Mamba-Transformer-Perceiver架构来联合建模时序动态、场景几何和语义意图，并通过梯度引导推理过程强制执行可微物理约束（如避碰和运动平滑性），从而无需后处理过滤或额外监督即可生成连贯可控的运动。

Details

Motivation: 从第一人称视角视频理解和预测物体运动是具身感知与交互的基础，但由于遮挡、快速运动以及现有生成模型缺乏显式物理推理，生成物理一致的6自由度轨迹仍具挑战。

Result: 在真实世界数据集HD-EPIC、EgoExo4D和HOT3D上的实验表明，EgoFlow在准确性、泛化性和物理真实性方面优于基于扩散和Transformer的基线方法，碰撞率降低高达79%，并对未见场景展现出强大的泛化能力。

Insight: 主要创新点包括：1）将流匹配生成模型与梯度引导推理相结合，以可微方式强制执行物理约束；2）提出混合Mamba-Transformer-Perceiver架构，有效融合多模态第一人称观测信息；3）展示了基于流的生成模型在可扩展且物理基础扎实的第一人称运动理解任务上的潜力。

Abstract: Understanding and predicting object motion from egocentric video is fundamental to embodied perception and interaction. However, generating physically consistent 6DoF trajectories remains challenging due to occlusions, fast motion, and the lack of explicit physical reasoning in existing generative models. We present EgoFlow, a flow-matching framework that synthesizes realistic and physically plausible trajectories conditioned on multimodal egocentric observations. EgoFlow employs a hybrid Mamba-Transformer-Perceiver architecture to jointly model temporal dynamics, scene geometry, and semantic intent, while a gradient-guided inference process enforces differentiable physical constraints such as collision avoidance and motion smoothness. This combination yields coherent and controllable motion generation without post-hoc filtering or additional supervision. Experiments on real-world datasets HD-EPIC, EgoExo4D, and HOT3D show that EgoFlow outperforms diffusion-based and transformer baselines in accuracy, generalization, and physical realism, reducing collision rates by up to 79%, and strong generalization to unseen scenes. Our results highlight the promise of flow-based generative modeling for scalable and physically grounded egocentric motion understanding.

[41] Nonlinear Methods for Analyzing Pose in Behavioral Research cs.CVPDF

Carter Sale, Margaret C. Macpherson, Gaurav Patil, Kelly Miles, Rachel W. Kallen

TL;DR: 本文提出了一种用于分析人类姿态数据的通用分析流程，该流程结合了预处理、降维和基于递归的时间序列分析方法，旨在从高维、噪声和复杂的姿态数据中提取有意义的协调和行为变化模式。

Details

Motivation: 无标记姿态估计的进展使得能够使用标准视频在自然环境中捕捉详细的人类运动，但姿态数据的高维度、噪声和时间复杂性给提取有意义的协调和行为变化模式带来了挑战。

Result: 通过三个案例研究（涵盖面部和全身运动、2D和3D数据、个体与多智能体行为）展示了该流程的灵活性，能够从复杂的姿态时间序列中提取理论上有意义的见解。

Insight: 创新点在于提供了一个结合线性与非线性表征的通用分析流程，适用于多种实验场景，并通过递归分析量化运动动态的时间结构，增强了从复杂姿态数据中提取行为模式的能力。

Abstract: Advances in markerless pose estimation have made it possible to capture detailed human movement in naturalistic settings using standard video, enabling new forms of behavioral analysis at scale. However, the high dimensionality, noise, and temporal complexity of pose data raise significant challenges for extracting meaningful patterns of coordination and behavioral change. This paper presents a general-purpose analysis pipeline for human pose data, designed to support both linear and nonlinear characterizations of movement across diverse experimental contexts. The pipeline combines principled preprocessing, dimensionality reduction, and recurrence-based time series analysis to quantify the temporal structure of movement dynamics. To illustrate the pipeline’s flexibility, we present three case studies spanning facial and full-body movement, 2D and 3D data, and individual versus multi-agent behavior. Together, these examples demonstrate how the same analytic workflow can be adapted to extract theoretically meaningful insights from complex pose time series.

[42] Reinforcing Consistency in Video MLLMs with Structured Rewards cs.CVPDF

Yihao Quan, Zeru Shi, Jinman Zhao, Ruixiang Tang

TL;DR: 该论文针对视频多模态大语言模型（MLLMs）在视觉和时间基础方面存在的不一致问题，提出了一种结构化奖励方法。通过将视频描述分解为事实性和时间性主张进行一致性审计，发现标准句子级监督是视频理解的弱代理。因此，作者设计了一个包含实例感知场景图奖励、时间奖励和视频基础VQA奖励的结构化奖励，用于强化学习训练，以提升模型在多个基准测试上的忠实理解能力。

Details

Motivation: 解决视频MLLMs输出看似合理但存在视觉和时间基础错误（如捏造物体、分配错误属性或压缩重复事件）的问题，即模型缺乏忠实于视频内容的细粒度一致性。

Result: 在时间理解、通用视频理解和面向幻觉的基准测试上，该方法在开源骨干模型上取得了一致的性能提升。

Insight: 创新点在于提出了一种自上而下的组合一致性审计方法来诊断模型失败模式，并设计了一个结构化的、多组件的奖励函数（结合事实、时间和自验证奖励）来替代粗粒度的句子级奖励，从而更精确地引导强化学习，实现更忠实的视频理解。

Abstract: Multimodal large language models (MLLMs) have achieved remarkable progress in video understanding. However, seemingly plausible outputs often suffer from poor visual and temporal grounding: a model may fabricate object existence, assign incorrect attributes, or collapse repeated events while still producing a globally reasonable caption or answer. We study this failure mode through a compositional consistency audit that decomposes a caption into supporting factual and temporal claims, investigating whether a correct high-level prediction is actually backed by valid lower-level evidence. Our top-down audit reveals that even correct root relational claims often lack reliable attribute and existence support. This indicates that standard sentence-level supervision is a weak proxy for faithful video understanding. Furthermore, when turning to reinforcement learning (RL) for better alignment, standard sentence-level rewards often prove too coarse to accurately localize specific grounding failures. To address this, we replace generic sentence-level rewards with a structured reward built from factual and temporal units. Our training objective integrates three complementary components: (1) an instance-aware scene-graph reward for factual objects, attributes, and relations; (2) a temporal reward for event ordering and repetition; and (3) a video-grounded VQA reward for hierarchical self-verification. Across temporal, general video understanding, and hallucination-oriented benchmarks, this objective yields consistent gains on open-source backbones. These results suggest that structured reward shaping is a practical route to more faithful video understanding.

[43] Prime Once, then Reprogram Locally: An Efficient Alternative to Black-Box Service Model Adaptation cs.CV | cs.LGPDF

Yunbei Zhang, Chengyi Cai, Feng Liu, Jihun Hamm

TL;DR: 本文提出了一种名为AReS的高效替代方案，用于适应闭箱服务模型（如API）。该方法通过单次API交互来初始化一个本地预训练编码器，随后在本地模型上进行白盒重编程，从而避免了传统基于零阶优化的方法所需的大量、昂贵的API调用，并解决了现代API（如GPT-4o）对输入扰动不敏感的问题。

Details

Motivation: 解决传统基于零阶优化的闭箱服务模型适应方法存在的API调用成本高、优化缓慢不稳定，以及现代API对输入扰动不敏感导致性能提升有限的问题。

Result: 在GPT-4o上，AReS相比零样本基线提升了27.8%，而基于ZOO的方法几乎没有改进；在十个多样化数据集上，AReS超越了最先进方法（视觉语言模型提升2.5%，标准视觉模型提升15.6%），同时减少了超过99.99%的API调用。

Insight: 创新点在于将闭箱适应分解为单次API交互的初始化阶段和后续完全本地的白盒重编程阶段，通过训练一个轻量层使本地编码器易于重编程，从而在保持高性能的同时极大降低了API成本，为适应现代闭箱模型提供了鲁棒且实用的解决方案。

Abstract: Adapting closed-box service models (i.e., APIs) for target tasks typically relies on reprogramming via Zeroth-Order Optimization (ZOO). However, this standard strategy is known for extensive, costly API calls and often suffers from slow, unstable optimization. Furthermore, we observe that this paradigm faces new challenges with modern APIs (e.g., GPT-4o). These models can be less sensitive to the input perturbations ZOO relies on, thereby hindering performance gains. To address these limitations, we propose an Alternative efficient Reprogramming approach for Service models (AReS). Instead of direct, continuous closed-box optimization, AReS initiates a single-pass interaction with the service API to prime an amenable local pre-trained encoder. This priming stage trains only a lightweight layer on top of the local encoder, making it highly receptive to the subsequent glass-box (white-box) reprogramming stage performed directly on the local model. Consequently, all subsequent adaptation and inference rely solely on this local proxy, eliminating all further API costs. Experiments demonstrate AReS’s effectiveness where prior ZOO-based methods struggle: on GPT-4o, AReS achieves a +27.8% gain over the zero-shot baseline, a task where ZOO-based methods provide little to no improvement. Broadly, across ten diverse datasets, AReS outperforms state-of-the-art methods (+2.5% for VLMs, +15.6% for standard VMs) while reducing API calls by over 99.99%. AReS thus provides a robust and practical solution for adapting modern closed-box models.

[44] Universal computational thermal imaging overcoming the ghosting effect cs.CV | physics.opticsPDF

Hongyi Xu, Du Wang, Chenjun Zhao, Jiashuo Chen, Jiale Lin

TL;DR: 本文提出了一种名为TAG（热反鬼影）的通用计算热成像框架，旨在解决热成像中的鬼影效应，即杂乱光子流导致的纹理细节丢失问题。该框架通过处理高光谱光子流实现非参数纹理恢复，首次在实验中对难以捉摸的鬼影人脸进行了前所未有的表情恢复，并展示了热3D拓扑对齐和情绪检测等应用。

Details

Motivation: 热成像在夜视中至关重要，但受到鬼影效应的根本性阻碍，而现有方法如HADAR仅适用于材料均匀的有限场景，无法处理现实世界中普遍存在的材料非均匀性问题。

Result: TAG在各种场景中普遍优于HADAR，揭示了材料非均匀性的影响，并阐明了HADAR的有效性边界。在昼夜测试中，实现了面部纹理和表情恢复，首次展示了热3D拓扑对齐和情绪检测。

Insight: 创新点在于提出通用计算热成像框架TAG，通过高光谱光子流处理克服材料非均匀性导致的鬼影，实现高保真夜视，为自主导航、侦察、医疗和野生动物监测等应用奠定基础。

Abstract: Thermal imaging is crucial for night vision but fundamentally hampered by the ghosting effect, a loss of detailed texture in cluttered photon streams. While conventional ghosting mitigation has relied on data post-processing, the recent breakthrough in heat-assisted detection and ranging (HADAR) opens a promising frontier for hyperspectral computational thermal imaging that produces night vision with day-like visibility. However, universal anti-ghosting imaging remains elusive, as state-of-the-art HADAR applies only to limited scenes with uniform materials, whereas material non-uniformity is ubiquitous in the real world. Here, we propose a universal computational thermal imaging framework, TAG (thermal anti-ghosting), to address material non-uniformity and overcome ghosting for high-fidelity night vision. TAG takes hyperspectral photon streams for nonparametric texture recovery, enabling our experimental demonstration of unprecedented expression recovery in thus-far-elusive ghostly human faces – the archetypal, long-recognized ghosting phenomenon. Strikingly, TAG not only universally outperforms HADAR across various scenes, but also reveals the influence of material non-uniformity, shedding light on HADAR’s effectiveness boundary. We extensively test facial texture and expression recovery across day and night, and demonstrate, for the first time, thermal 3D topological alignment and mood detection. This work establishes a universal foundation for high-fidelity computational night vision, with potential applications in autonomous navigation, reconnaissance, healthcare, and wildlife monitoring.

[45] ReFlow: Self-correction Motion Learning for Dynamic Scene Reconstruction cs.CV | cs.AIPDF

Yanzhe Liang, Ruijie Zhu, Hanzhi Chang, Zhuoyuan Li, Jiahao Lu

TL;DR: ReFlow提出了一种用于单目动态场景重建的统一框架，通过新颖的自校正方式从原始视频中学习3D运动。该框架集成了完整的规范空间构建模块和基于分离的动态场景建模模块，核心是自校正流匹配机制，包括全流匹配和相机流匹配，以实现鲁棒且准确的动态场景重建。

Details

Motivation: 现有方法在动态区域初始化不完整，导致重建和运动估计不稳定，往往依赖外部密集运动指导（如预计算光流），这增加了复杂性和潜在误差传播。ReFlow旨在解决这些问题，实现无需外部指导的鲁棒动态重建。

Result: 在多种场景下的广泛实验表明，ReFlow实现了卓越的重建质量和鲁棒性，为单目4D重建建立了新的自校正范式。

Insight: 创新点包括完整的规范空间构建以增强初始化、基于分离的动态场景建模进行针对性运动监督，以及核心的自校正流匹配机制（全流匹配和相机流匹配），这些共同形成了一种无需外部运动指导的自校正学习范式，可借鉴于动态3D重建任务中减少对外部数据的依赖并提高稳定性。

Abstract: We present ReFlow, a unified framework for monocular dynamic scene reconstruction that learns 3D motion in a novel self-correction manner from raw video. Existing methods often suffer from incomplete scene initialization for dynamic regions, leading to unstable reconstruction and motion estimation, which often resorts to external dense motion guidance such as pre-computed optical flow to further stabilize and constrain the reconstruction of dynamic components. However, this introduces additional complexity and potential error propagation. To address these issues, ReFlow integrates a Complete Canonical Space Construction module for enhanced initialization of both static and dynamic regions, and a Separation-Based Dynamic Scene Modeling module that decouples static and dynamic components for targeted motion supervision. The core of ReFlow is a novel self-correction flow matching mechanism, consisting of Full Flow Matching to align 3D scene flow with time-varying 2D observations, and Camera Flow Matching to enforce multi-view consistency for static objects. Together, these modules enable robust and accurate dynamic scene reconstruction. Extensive experiments across diverse scenarios demonstrate that ReFlow achieves superior reconstruction quality and robustness, establishing a novel self-correction paradigm for monocular 4D reconstruction.

[46] VideoZeroBench: Probing the Limits of Video MLLMs with Spatio-Temporal Evidence Verification cs.CV | cs.MMPDF

Jiahao Meng, Tan Yue, Qi Xu, Haochen Wang, Zhongwei Ren

TL;DR: 本文提出了VideoZeroBench，一个用于评估视频多模态大语言模型在长视频问答任务中时空证据验证能力的层次化基准测试。该基准包含500个跨13个领域的人工标注问题，并引入了五级评估协议以逐步收紧对模型提供时空证据的要求。实验表明，现有模型在需要精确时空定位的严格评估下性能极低，揭示了表面答案正确性与基于证据的真实推理能力之间存在巨大差距。

Details

Motivation: 当前视频多模态大语言模型的评估存在两个关键局限：一是现有基准的分数可能掩盖了模型在细粒度视觉理解和推理上的缺陷；二是答案正确性的衡量通常没有验证模型是否识别了支持其预测的精确时空证据。因此，需要一个新的基准来严格评估模型基于证据的推理能力。

Result: 在标准端到端问答设置下，即使Gemini-3-Pro的正确率也低于17%。当施加时空定位约束时，性能急剧下降：在要求同时提供正确答案和精确时空定位的最严格级别下，没有模型准确率超过1%，大多数模型甚至无法产生任何正确的定位预测。

Insight: 论文的核心创新点在于提出了一个层次化的评估框架，将答案生成、时间定位和空间定位能力解耦，从而能够更精细地诊断模型在基于证据的视频推理中的瓶颈。这为未来研究提供了更严格的评估范式和深入的分析视角，强调了真实时空证据验证在长视频理解中的重要性。

Abstract: Recent video multimodal large language models achieve impressive results across various benchmarks. However, current evaluations suffer from two critical limitations: (1) inflated scores can mask deficiencies in fine-grained visual understanding and reasoning, and (2) answer correctness is often measured without verifying whether models identify the precise spatio-temporal evidence supporting their predictions. To address this, we present VideoZeroBench, a hierarchical benchmark designed for challenging long-video question answering that rigorously verifies spatio-temporal evidence. It comprises 500 manually annotated questions across 13 domains, paired with temporal intervals and spatial bounding boxes as evidence. To disentangle answering generation, temporal grounding, and spatial grounding, we introduce a five-level evaluation protocol that progressively tightens evidence requirements. Experiments show that even Gemini-3-Pro correctly answers fewer than 17% of questions under the standard end-to-end QA setting (Level-3). When grounding constraints are imposed, performance drops sharply: No model exceeds 1% accuracy when both correct answering and accurate spatio-temporal localization are required (Level-5), with most failing to achieve any correct grounded predictions. These results expose a significant gap between surface-level answer correctness and genuine evidence-based reasoning, revealing that grounded video understanding remains a bottleneck for long-video QA. We further analyze performance across minimal evidence spans, atomic abilities, and inference paradigms, providing insights for future research in grounded video reasoning. The benchmark and code will be made publicly available.

[47] Harmonized Tabular-Image Fusion via Gradient-Aligned Alternating Learning cs.CV | cs.AIPDF

Longfei Huang, Yang Yang

TL;DR: 本文提出了一种新颖的梯度对齐交替学习（GAAL）范式，用于解决多模态表格-图像融合任务中存在的模态间梯度冲突问题。该方法通过交替的单模态学习和共享分类器来解耦多模态梯度，并利用基于不确定性的跨模态梯度手术来选择性对齐梯度，从而引导共享参数对所有模态都有益。

Details

Motivation: 现有多模态表格-图像融合方法可能受到模态间梯度冲突的阻碍，这会误导单模态学习器的优化。本文旨在解决这一优化冲突问题，以实现更有效的模态融合。

Result: 在广泛使用的数据集上进行的实验表明，该方法在表格-图像融合任务上优于多种最先进的基线方法，并且在测试时表格数据缺失的基准测试中也表现出色。

Insight: 核心创新点在于通过交替学习和梯度手术来对齐和解耦多模态梯度，从而缓解优化冲突。这为多模态融合中处理梯度不一致问题提供了一个新颖的优化视角和有效框架。

Abstract: Multimodal tabular-image fusion is an emerging task that has received increasing attention in various domains. However, existing methods may be hindered by gradient conflicts between modalities, misleading the optimization of the unimodal learner. In this paper, we propose a novel Gradient-Aligned Alternating Learning (GAAL) paradigm to address this issue by aligning modality gradients. Specifically, GAAL adopts an alternating unimodal learning and shared classifier to decouple the multimodal gradient and facilitate interaction. Furthermore, we design uncertainty-based cross-modal gradient surgery to selectively align cross-modal gradients, thereby steering the shared parameters to benefit all modalities. As a result, GAAL can provide effective unimodal assistance and help boost the overall fusion performance. Empirical experiments on widely used datasets reveal the superiority of our method through comparison with various state-of-the-art (SoTA) tabular-image fusion baselines and test-time tabular missing baselines. The source code is available at https://github.com/njustkmg/ICME26-GAAL.

[48] Satellite-Free Training for Drone-View Geo-Localization cs.CVPDF

Tao Liu, Yingzhi Zhang, Kan Ren, Xiaoqi Zhao

TL;DR: 本文提出了一种无卫星训练（SFT）框架，用于解决无人机视角地理定位（DVGL）问题。该框架通过无人机多视角序列进行三维场景重建，生成几何归一化的伪正射影像，并仅利用无人机数据学习特征聚合模型，从而在训练阶段完全摆脱对卫星图像的依赖。

Details

Motivation: 现有无人机视角地理定位方法在训练时依赖卫星图像进行配对监督或无监督对齐，这在实际部署中可能因卫星数据不可用或受限而受到限制。本文旨在开发一种无需卫星图像参与训练的方法。

Result: 在University-1652和SUES-200基准测试上的实验结果表明，所提出的SFT框架显著优于其他无需卫星数据的泛化基线方法，并缩小了与使用卫星图像训练的方法之间的性能差距。

Insight: 核心创新在于提出了一套完整的、无需卫星数据的训练流程：利用3D高斯溅射进行三维重建，通过PCA引导的正交投影生成伪正射影像，并仅基于无人机数据学习Fisher向量聚合模型。这为在卫星数据受限场景下的跨视角检索提供了一种可行的解决方案。

Abstract: Drone-view geo-localization (DVGL) aims to determine the location of drones in GPS-denied environments by retrieving the corresponding geotagged satellite tile from a reference gallery given UAV observations of a location. In many existing formulations, these observations are represented by a single oblique UAV image. In contrast, our satellite-free setting is designed for multi-view UAV sequences, which are used to construct a geometry-normalized UAV-side location representation before cross-view retrieval. Existing approaches rely on satellite imagery during training, either through paired supervision or unsupervised alignment, which limits practical deployment when satellite data are unavailable or restricted. In this paper, we propose a satellite-free training (SFT) framework that converts drone imagery into cross-view compatible representations through three main stages: drone-side 3D scene reconstruction, geometry-based pseudo-orthophoto generation, and satellite-free feature aggregation for retrieval. Specifically, we first reconstruct dense 3D scenes from multi-view drone images using 3D Gaussian splatting and project the reconstructed geometry into pseudo-orthophotos via PCA-guided orthographic projection. This rendering stage operates directly on reconstructed scene geometry without requiring camera parameters at rendering time. Next, we refine these orthophotos with lightweight geometry-guided inpainting to obtain texture-complete drone-side views. Finally, we extract DINOv3 patch features from the generated orthophotos, learn a Fisher vector aggregation model solely from drone data, and reuse it at test time to encode satellite tiles for cross-view retrieval. Experimental results on University-1652 and SUES-200 show that our SFT framework substantially outperforms satellite-free generalization baselines and narrows the gap to methods trained with satellite imagery.

[49] SHOE: Semantic HOI Open-Vocabulary Evaluation Metric cs.CV | cs.AIPDF

Maja Noack, Qinqian Lei, Taipeng Tian, Bihan Dong, Robby T. Tan

TL;DR: 本文提出了SHOE（语义人-物交互开放词汇评估）框架，用于评估开放词汇的人-物交互检测任务。该框架通过分解交互为动词和对象组件，利用多个大语言模型计算语义相似度，以替代传统的基于精确字符串匹配的评估指标（如mAP）。

Details

Motivation: 现有评估指标（如mAP）将人-物交互类别视为离散标签，无法对语义有效但词汇不同的预测（例如“靠在沙发上”与“坐在沙发上”）给予合理评价，限制了其在开放词汇预测评估中的适用性。

Result: 在HICO-DET等标准基准上，SHOE评分与人类判断的一致性达到85.73%，优于现有的基于LLM和嵌入的基线方法，更贴近人类对交互的理解。

Insight: 创新点在于将语义相似度引入人-物交互评估，通过分解交互并利用大语言模型进行多角度语义对齐，为开放词汇场景提供了更灵活、可扩展的评估方案，强调了基于语义的评估对于反映人类理解的重要性。

Abstract: Open-vocabulary human-object interaction (HOI) detection is a step towards building scalable systems that generalize to unseen interactions in real-world scenarios and support grounded multimodal systems that reason about human-object relationships. However, standard evaluation metrics, such as mean Average Precision (mAP), treat HOI classes as discrete categorical labels and fail to credit semantically valid but lexically different predictions (e.g., “lean on couch” vs. “sit on couch”), limiting their applicability for evaluating open-vocabulary predictions that go beyond any predefined set of HOI labels. We introduce SHOE (Semantic HOI Open-Vocabulary Evaluation), a new evaluation framework that incorporates semantic similarity between predicted and ground-truth HOI labels. SHOE decomposes each HOI prediction into its verb and object components, estimates their semantic similarity using the average of multiple large language models (LLMs), and combines them into a similarity score to evaluate alignment beyond exact string match. This enables a flexible and scalable evaluation of both existing HOI detection methods and open-ended generative models using standard benchmarks such as HICO-DET. Experimental results show that SHOE scores align more closely with human judgments than existing metrics, including LLM-based and embedding-based baselines, achieving an agreement of 85.73% with the average human ratings. Our work underscores the need for semantically grounded HOI evaluation that better mirrors human understanding of interactions. We will release our evaluation metric to the public to facilitate future research.

[50] Tex3D: Objects as Attack Surfaces via Adversarial 3D Textures for Vision-Language-Action Models cs.CV | cs.AIPDF

Jiawei Chen, Simin Huang, Jiawei Du, Shuaihang Chen, Yu Tian

TL;DR: 本文提出Tex3D框架，首次实现了在视觉-语言-动作（VLA）模型中针对3D纹理的端到端对抗攻击优化。通过前景-背景解耦和轨迹感知对抗优化技术，该框架能在保持原始模拟环境的同时，生成可物理部署的对抗性3D纹理，显著降低VLA模型在机器人操控任务中的性能。

Details

Motivation: 现有研究主要通过语言扰动和2D视觉攻击揭示VLA模型的脆弱性，但这些攻击方式在物理部署中代表性不足或真实性有限。相比之下，对抗性3D纹理更贴近物理现实、易于部署且威胁更大，但标准3D模拟器缺乏从VLA目标函数到物体外观的可微分优化路径，阻碍了端到端优化。

Result: 在模拟和真实机器人实验中的多个操控任务上，Tex3D使VLA模型的任务失败率最高达到96.7%，显著降低了其性能，暴露了VLA系统对物理3D对抗攻击的关键脆弱性。

Insight: 创新点包括前景-背景解耦（FBD）实现可微分纹理优化，以及轨迹感知对抗优化（TAAO）确保攻击在长时程和多视角下的有效性；从客观角度看，该研究将3D对抗攻击首次系统性地引入VLA领域，并提出了在非可微分模拟环境中进行端到端优化的实用方法，强调了VLA模型鲁棒性训练的重要性。

Abstract: Vision-language-action (VLA) models have shown strong performance in robotic manipulation, yet their robustness to physically realizable adversarial attacks remains underexplored. Existing studies reveal vulnerabilities through language perturbations and 2D visual attacks, but these attack surfaces are either less representative of real deployment or limited in physical realism. In contrast, adversarial 3D textures pose a more physically plausible and damaging threat, as they are naturally attached to manipulated objects and are easier to deploy in physical environments. Bringing adversarial 3D textures to VLA systems is nevertheless nontrivial. A central obstacle is that standard 3D simulators do not provide a differentiable optimization path from the VLA objective function back to object appearance, making it difficult to optimize through an end-to-end manner. To address this, we introduce Foreground-Background Decoupling (FBD), which enables differentiable texture optimization through dual-renderer alignment while preserving the original simulation environment. To further ensure that the attack remains effective across long-horizon and diverse viewpoints in the physical world, we propose Trajectory-Aware Adversarial Optimization (TAAO), which prioritizes behaviorally critical frames and stabilizes optimization with a vertex-based parameterization. Built on these designs, we present Tex3D, the first framework for end-to-end optimization of 3D adversarial textures directly within the VLA simulation environment. Experiments in both simulation and real-robot settings show that Tex3D significantly degrades VLA performance across multiple manipulation tasks, achieving task failure rates of up to 96.7%. Our empirical results expose critical vulnerabilities of VLA systems to physically grounded 3D adversarial attacks and highlight the need for robustness-aware training.

[51] Automatic Image-Level Morphological Trait Annotation for Organismal Images cs.CV | cs.AIPDF

Vardaan Pahuja, Samuel Stevens, Alyson East, Sydne Record, Yu Su

TL;DR: 本文提出了一种自动为生物图像进行形态特征标注的流水线，通过基于基础模型特征训练的稀疏自编码器获得单语义、空间定位的神经元，定位显著区域并利用视觉语言提示生成可解释的特征描述，构建了包含8万条标注的Bioscan-Traits数据集，为大规模生态研究提供了可扩展的解决方案。

Details

Motivation: 解决生物形态特征提取依赖专家、效率低下且缺乏高质量图像-特征标注数据集的问题，以支持大规模生态学研究。

Result: 构建了Bioscan-Traits数据集（涵盖BIOSCAN-5M中1.9万张昆虫图像的8万条标注），人工评估证实了生成形态描述的生物合理性，并通过消融实验评估了设计敏感性。

Insight: 利用稀疏自编码器从基础模型特征中提取单语义且空间定位的神经元，结合视觉语言提示实现可解释的自动特征标注，为将生物学意义监督注入基础模型、弥合生态相关性与机器学习实用性提供了模块化、可扩展的方法。

Abstract: Morphological traits are physical characteristics of biological organisms that provide vital clues on how organisms interact with their environment. Yet extracting these traits remains a slow, expert-driven process, limiting their use in large-scale ecological studies. A major bottleneck is the absence of high-quality datasets linking biological images to trait-level annotations. In this work, we demonstrate that sparse autoencoders trained on foundation-model features yield monosemantic, spatially grounded neurons that consistently activate on meaningful morphological parts. Leveraging this property, we introduce a trait annotation pipeline that localizes salient regions and uses vision-language prompting to generate interpretable trait descriptions. Using this approach, we construct Bioscan-Traits, a dataset of 80K trait annotations spanning 19K insect images from BIOSCAN-5M. Human evaluation confirms the biological plausibility of the generated morphological descriptions. We assess design sensitivity through a comprehensive ablation study, systematically varying key design choices and measuring their impact on the quality of the resulting trait descriptions. By annotating traits with a modular pipeline rather than prohibitively expensive manual efforts, we offer a scalable way to inject biologically meaningful supervision into foundation models, enable large-scale morphological analyses, and bridge the gap between ecological relevance and machine-learning practicality.

[52] LivingWorld: Interactive 4D World Generation with Environmental Dynamics cs.CVPDF

Hyeongju Mun, In-Hwan Jin, Sohyeong Kim, Kyeongbo Kong

TL;DR: LivingWorld是一个从单张图像生成具有环境动态的4D世界的交互式框架。它通过逐步构建全局一致的运动场来解决场景扩展时动态连贯性的挑战，并利用紧凑的哈希运动场表示实现高效查询和稳定传播，支持双向运动传播以生成长且时间一致的4D序列。

Details

Motivation: 当前3D场景生成方法主要关注静态几何重建，而忽略了云、水、烟雾等场景尺度的环境动态建模。这些动态建模的挑战在于运动需要在扩展场景中保持连贯，同时支持低延迟的用户反馈。

Result: 在单个RTX 5090 GPU上，生成每个新场景扩展步骤需要9秒，运动对齐和运动场更新需要3秒，实现了具有全局一致环境动态的交互式4D世界生成。

Insight: 创新点包括引入几何感知对齐模块解决视图间的方向和尺度模糊性，以及使用紧凑的哈希运动场表示实现高效查询和稳定动态传播，支持双向运动传播而不依赖昂贵的基于视频的细化。

Abstract: We introduce LivingWorld, an interactive framework for generating 4D worlds with environmental dynamics from a single image. While recent advances in 3D scene generation enable large-scale environment creation, most approaches focus primarily on reconstructing static geometry, leaving scene-scale environmental dynamics such as clouds, water, or smoke largely unexplored. Modeling such dynamics is challenging because motion must remain coherent across an expanding scene while supporting low-latency user feedback. LivingWorld addresses this challenge by progressively constructing a globally coherent motion field as the scene expands. To maintain global consistency during expansion, we introduce a geometry-aware alignment module that resolves directional and scale ambiguities across views. We further represent motion using a compact hash-based motion field, enabling efficient querying and stable propagation of dynamics throughout the scene. This representation also supports bidirectional motion propagation during rendering, producing long and temporally coherent 4D sequences without relying on expensive video-based refinement. On a single RTX 5090 GPU, generating each new scene expansion step requires 9 seconds, followed by 3 seconds for motion alignment and motion field updates, enabling interactive 4D world generation with globally coherent environmental dynamics. Video demonstrations are available at cvsp-lab.github.io/LivingWorld.

[53] MonoSAOD: Monocular 3D Object Detection with Sparsely Annotated Label cs.CVPDF

Junyoung Jung, Seokwon Kim, Jun Uk Kim

TL;DR: 本文提出了一种用于稀疏标注单目3D目标检测的新框架MonoSAOD，包含两个关键模块：道路感知补丁增强（RAPA）和基于原型的过滤（PBF）。RAPA通过将分割出的目标补丁增强到道路区域来利用稀疏标注，同时保持3D几何一致性；PBF则通过原型相似性和深度不确定性过滤预测，生成高质量伪标签。该训练策略结合了几何保持增强和原型引导的伪标签生成，旨在在稀疏监督下实现鲁棒检测。

Details

Motivation: 单目3D目标检测在密集标注数据集上表现优异，但由于3D标注成本高昂，在现实场景中通常只有部分目标被标注（稀疏标注），现有方法在此设定下性能不佳。本文旨在解决稀疏标注单目3D目标检测问题。

Result: 广泛的实验证明了所提方法的有效性，但摘要中未提及具体的定量结果（如精度指标）、所使用的基准数据集或与现有方法的比较水平（如是否达到SOTA）。

Insight: 论文宣称的创新点在于提出了RAPA和PBF两个模块，将几何保持的数据增强与基于原型相似性和深度不确定性的伪标签过滤相结合，以有效利用稀疏标注。从客观角度看，其核心创新在于通过道路区域感知的补丁增强来合成训练数据，并利用全局2D RoI特征原型和深度可靠性来筛选伪标签，这为在标注成本受限的场景下提升3D检测性能提供了新思路。

Abstract: Monocular 3D object detection has achieved impressive performance on densely annotated datasets. However, it struggles when only a fraction of objects are labeled due to the high cost of 3D annotation. This sparsely annotated setting is common in real-world scenarios where annotating every object is impractical. To address this, we propose a novel framework for sparsely annotated monocular 3D object detection with two key modules. First, we propose Road-Aware Patch Augmentation (RAPA), which leverages sparse annotations by augmenting segmented object patches onto road regions while preserving 3D geometric consistency. Second, we propose Prototype-Based Filtering (PBF), which generates high-quality pseudo-labels by filtering predictions through prototype similarity and depth uncertainty. It maintains global 2D RoI feature prototypes and selects pseudo-labels that are both feature-consistent with learned prototypes and have reliable depth estimates. Our training strategy combines geometry-preserving augmentation with prototype-guided pseudo-labeling to achieve robust detection under sparse supervision. Extensive experiments demonstrate the effectiveness of the proposed method. The source code is available at https://github.com/VisualAIKHU/MonoSAOD .

[54] Moiré Video Authentication: A Physical Signature Against AI Video Generation cs.CV | cs.AI | cs.MMPDF

Yuan Qing, Kunyu Zheng, Lingxiao Li, Boqing Gong, Chang Xiao

TL;DR: 该论文提出了一种基于莫尔效应的物理签名方法，用于鉴别AI生成的视频与真实拍摄的视频。通过分析相机拍摄双层光栅结构时产生的莫尔条纹，论文推导出莫尔运动不变性，即条纹相位与光栅图像位移之间存在线性耦合关系，且不受观看距离和光栅结构影响。验证器从视频中提取这两个信号并测试其相关性，从而区分真实视频与AI生成视频。

Details

Motivation: 针对AI视频生成技术日益逼真、难以与真实视频区分的问题，作者旨在利用真实相机固有的物理特性（莫尔效应）创建一种可验证的签名，以对抗AI生成的虚假视频。

Result: 在多个最先进的AI视频生成器生成的真实捕获视频和AI生成视频上验证了该不变性，发现真实视频与AI生成视频产生显著不同的相关性签名，表明该方法能有效区分两者。

Insight: 创新点在于利用光学物理现象（莫尔效应）的确定性作为可验证的物理基础签名，为AI生成视频检测提供了基于硬件物理特性而非纯软件分析的新思路，具有鲁棒性和可解释性。

Abstract: Recent advances in video generation have made AI-synthesized content increasingly difficult to distinguish from real footage. We propose a physics-based authentication signature that real cameras produce naturally, but that generative models cannot faithfully reproduce. Our approach exploits the Moiré effect: the interference fringes formed when a camera views a compact two-layer grating structure. We derive the Moiré motion invariant, showing that fringe phase and grating image displacement are linearly coupled by optical geometry, independent of viewing distance and grating structure. A verifier extracts both signals from video and tests their correlation. We validate the invariant on both real-captured and AI-generated videos from multiple state-of-the-art generators, and find that real and AI-generated videos produce significantly different correlation signatures, suggesting a robust means of differentiating them. Our work demonstrates that deterministic optical phenomena can serve as physically grounded, verifiable signatures against AI-generated video.

[55] DynaVid: Learning to Generate Highly Dynamic Videos using Synthetic Motion Data cs.CVPDF

Wonjoon Jin, Jiyun Won, Janghyeok Han, Qi Dai, Chong Luo

TL;DR: DynaVid是一个视频合成框架，旨在解决现有视频扩散模型在生成高度动态运动或需要精细运动控制视频时的困难。其核心创新在于利用计算机图形学渲染的光流图作为合成运动数据进行训练，并采用两阶段生成框架（先合成运动，再生成视频帧），从而从合成数据中学习动态运动模式，同时从真实视频中保持视觉真实性。

Details

Motivation: 现有视频扩散模型在合成涉及高度动态运动或需要精细运动控制的视频时效果不佳，主要原因是常用训练数据集中此类示例稀缺。

Result: 在剧烈人体运动生成和极端相机运动控制这两个具有挑战性的场景上进行了广泛实验，结果表明DynaVid在动态运动生成和相机运动控制的真实感与可控性方面均有提升。

Insight: 主要创新点在于利用仅编码运动、与外观解耦的渲染光流作为合成训练数据，避免了模型学习合成视频不自然外观的问题；同时，两阶段生成框架实现了运动模式学习与视觉真实性的解耦，为解决数据稀缺问题提供了新思路。

Abstract: Despite recent progress, video diffusion models still struggle to synthesize realistic videos involving highly dynamic motions or requiring fine-grained motion controllability. A central limitation lies in the scarcity of such examples in commonly used training datasets. To address this, we introduce DynaVid, a video synthesis framework that leverages synthetic motion data in training, which is represented as optical flow and rendered using computer graphics pipelines. This approach offers two key advantages. First, synthetic motion offers diverse motion patterns and precise control signals that are difficult to obtain from real data. Second, unlike rendered videos with artificial appearances, rendered optical flow encodes only motion and is decoupled from appearance, thereby preventing models from reproducing the unnatural look of synthetic videos. Building on this idea, DynaVid adopts a two-stage generation framework: a motion generator first synthesizes motion, and then a motion-guided video generator produces video frames conditioned on that motion. This decoupled formulation enables the model to learn dynamic motion patterns from synthetic data while preserving visual realism from real-world videos. We validate our framework on two challenging scenarios, vigorous human motion generation and extreme camera motion control, where existing datasets are particularly limited. Extensive experiments demonstrate that DynaVid improves the realism and controllability in dynamic motion generation and camera motion control.

[56] HOT: Harmonic-Constrained Optimal Transport for Remote Photoplethysmography Domain Adaptation cs.CVPDF

Ba-Thinh Nguyen, Thi-Duyen Ngo, Thanh-Trung Huynh, Thanh-Ha Le, Huy-Hieu Pham

TL;DR: 本文提出了一种用于远程光电容积描记（rPPG）领域自适应的谐波约束最优传输（HOT）方法。该方法通过引入频域自适应（FDA）策略来建模外观变化，并利用心脏信号的谐波特性来指导原始表示与FDA转换后表示之间的对齐，以增强rPPG模型在不同数据集间的鲁棒性和泛化能力。

Details

Motivation: 解决基于深度学习的rPPG方法在领域偏移下因过拟合于外观相关因素（如光照、相机特性）而导致的性能显著下降问题，以实现更稳健的非接触式生理测量。

Result: 广泛的跨数据集实验表明，所提出的FDA和HOT框架有效提升了rPPG模型在多个数据集上的鲁棒性和泛化性能。

Insight: 创新点在于将领域自适应问题引入频域，通过FDA转移低频外观特征，并结合心脏信号的谐波先验约束最优传输过程，从而在减少外观干扰的同时保留生理信号，为处理外观敏感的时序信号领域自适应提供了新思路。

Abstract: Remote photoplethysmography (rPPG) enables non-contact physiological measurement from facial videos; however, its practical deployment is often hindered by substantial performance degradation under domain shift. While recent deep learning-based rPPG methods have achieved strong performance on individual datasets, they frequently overfit to appearance-related factors, such as illumination, camera characteristics, and color response, that vary significantly across domains. To address this limitation, we introduce frequency domain adaptation (FDA) as a principled strategy for modeling appearance variation in rPPG. By transferring low-frequency spectral components that encode domain-dependent appearance characteristics, FDA encourages rPPG models to learn invariance to appearance variations while retaining cardiac-induced signals. To further support physiologically consistent alignment under such appearance variation, we propose Harmonic-Constrained Optimal Transport (HOT), which leverages the harmonic property of cardiac signals to guide alignment between original and FDA-transferred representations. Extensive cross-dataset experiments demonstrate that the proposed FDA and HOT framework effectively enhances the robustness and generalization of rPPG models across diverse datasets.

[57] GPA: Learning GUI Process Automation from Demonstrations cs.CV | cs.AI | cs.SEPDF

Zirui Zhao, Jun Hao Liew, Yan Yang, Wenzhuo Yang, Ziyang Luo

TL;DR: 本文提出了一种基于视觉的轻量级GUI流程自动化方法GPA，它仅需单次演示即可实现快速稳定的流程重放。该方法通过顺序蒙特卡洛定位处理界面缩放与检测不确定性，通过就绪校准确保确定性，并支持完全本地执行以保障隐私。实验表明，GPA在长流程GUI任务中比Gemini 3 Pro（使用CUA工具）成功率更高且执行速度快10倍。

Details

Motivation: 解决传统RPA的脆弱性以及当前基于视觉语言模型的GUI代理的非确定性风险，为企业工作流提供适应性强、鲁棒且安全的自动化方案。

Result: 在长流程GUI任务中，GPA相比Gemini 3 Pro（使用CUA工具）实现了更高的成功率，且执行速度快10倍。

Insight: 创新点包括：基于顺序蒙特卡洛的定位提升鲁棒性、就绪校准确保确定性、完全本地执行保障隐私；可作为MCP/CLI工具供其他具备编码能力的代理调用，实现推理与执行的分离。

Abstract: GUI Process Automation (GPA) is a lightweight but general vision-based Robotic Process Automation (RPA), which enables fast and stable process replay with only a single demo. Addressing the fragility of traditional RPA and the non-deterministic risks of current vision language model-based GUI agents, GPA introduces three core benefits: (1) Robustness via Sequential Monte Carlo-based localization to handle rescaling and detection uncertainty; (2) Deterministic and Reliability safeguarded by readiness calibration; and (3) Privacy through fast, fully local execution. This approach delivers the adaptability, robustness, and security required for enterprise workflows. It can also be used as an MCP/CLI tool by other agents with coding capabilities so that the agent only reasons and orchestrates while GPA handles the GUI execution. We conducted a pilot experiment to compare GPA with Gemini 3 Pro (with CUA tools) and found that GPA achieves higher success rate with 10 times faster execution speed in finishing long-horizon GUI tasks.

[58] Director: Instance-aware Gaussian Splatting for Dynamic Scene Modeling and Understanding cs.CVPDF

Yuheng Jiang, Yiwen Cai, Zihao Wang, Yize Wu, Sicheng Li

TL;DR: 本文提出了Director，一种统一时空高斯表示方法，用于动态场景建模与理解。该方法通过嵌入实例一致的语义信息，联合建模人体动作、高保真渲染和实例级语义，实现了语言对齐的4D表示，并增强了时间稳定性。

Details

Motivation: 现有基于高斯的方法主要关注外观渲染，但缺乏对实例级结构的感知，限制了在高度动态场景中的稳定跟踪和语义推理能力。

Result: 实验表明，Director实现了时间一致的4D重建，同时支持实例分割和开放词汇查询。

Insight: 关键创新在于利用时间对齐的实例掩码和来自多模态大语言模型的句子嵌入，通过两个MLP解码器监督每个高斯可学习语义特征，从而将实例级语义自然地融入4D建模。此外，通过将2D光流与4D高斯桥接并微调其运动，以及引入几何感知的SDF约束和正则化项，增强了时间相干性。

Abstract: Volumetric video seeks to model dynamic scenes as temporally coherent 4D representations. While recent Gaussian-based approaches achieve impressive rendering fidelity, they primarily emphasize appearance but are largely agnostic to instance-level structure, limiting stable tracking and semantic reasoning in highly dynamic scenarios. In this paper, we present Director, a unified spatio-temporal Gaussian representation that jointly models human performance, high-fidelity rendering, and instance-level semantics. Our key insight is that embedding instance-consistent semantics naturally complements 4D modeling, enabling more accurate scene decomposition while supporting robust dynamic scene understanding. To this end, we leverage temporally aligned instance masks and sentence embeddings derived from Multimodal Large Language Models to supervise the learnable semantic features of each Gaussian via two MLP decoders, enabling language-aligned 4D representations and enforcing identity consistency over time. To enhance temporal stability, we bridge 2D optical flow with 4D Gaussians and finetune their motions, yielding reliable initialization and reducing drift. For the training, we further introduce a geometry-aware SDF constraints, along with regularization terms that enforces surface continuity, enhancing temporal coherence in dynamic foreground modeling. Experiments demonstrate that Director achieves temporally coherent 4D reconstructions while simultaneously enabling instance segmentation and open-vocabulary querying.

[59] BTS-rPPG: Orthogonal Butterfly Temporal Shifting for Remote Photoplethysmography cs.CVPDF

Ba-Thinh Nguyen, Thi-Duyen Ngo, Thanh-Trung Huynh, Thanh-Ha Le, Huy-Hieu Pham

TL;DR: 本文提出了一种名为BTS-rPPG的远程光电容积描记（rPPG）时序建模框架，旨在解决现有深度学习方法在建模生理信号时序动态时存在的时序感受野有限的问题。该框架的核心是正交蝴蝶时序移位（BTS）机制，它受快速傅里叶变换（FFT）中蝴蝶通信模式的启发，通过基于XOR的蝴蝶配对调度建立结构化的帧间交互，逐步扩大时序感受野，实现跨远距离帧的高效信息传播。此外，还引入了正交特征转移（OFT）机制，在时序移位前根据目标上下文过滤源特征，仅保留正交分量进行跨帧传输，以减少冗余特征传播并促进互补的时序交互。

Details

Motivation: 远程光电容积描记（rPPG）通过分析面部视频中由血液循环引起的细微外观变化来实现非接触式生理感知。然而，建模这些信号的时序动态仍然具有挑战性，因为许多深度学习方法依赖于主要聚合相邻帧信息的时序移位或卷积算子，导致主要是局部时序建模和有限的时序感受野。

Result: 在多个基准数据集上的大量实验表明，BTS-rPPG改善了生理动力学的长程时序建模，并且在rPPG估计任务中持续优于现有的时序建模策略。

Insight: 主要创新点在于提出了受FFT蝴蝶模式启发的正交蝴蝶时序移位（BTS）框架，通过结构化的跨帧配对调度来高效扩展时序感受野，并引入了正交特征转移（OFT）机制来过滤冗余特征、促进互补交互。这为时序建模，特别是需要捕获长程依赖的任务（如rPPG），提供了一种新颖且高效的架构设计思路。

Abstract: Remote photoplethysmography (rPPG) enables contactless physiological sensing from facial videos by analyzing subtle appearance variations induced by blood circulation. However, modeling the temporal dynamics of these signals remains challenging, as many deep learning methods rely on temporal shifting or convolutional operators that aggregate information primarily from neighboring frames, resulting in predominantly local temporal modeling and limited temporal receptive fields. To address this limitation, we propose BTS-rPPG, a temporal modeling framework based on Orthogonal Butterfly Temporal Shifting (BTS). Inspired by the butterfly communication pattern in the Fast Fourier Transform (FFT), BTS establishes structured frame interactions via an XOR-based butterfly pairing schedule, progressively expanding the temporal receptive field and enabling efficient propagation of information across distant frames. Furthermore, we introduce an orthogonal feature transfer mechanism (OFT) that filters the source feature with respect to the target context before temporal shifting, retaining only the orthogonal component for cross-frame transmission. This reduces redundant feature propagation and encourages complementary temporal interaction. Extensive experiments on multiple benchmark datasets demonstrate that BTS-rPPG improves long-range temporal modeling of physiological dynamics and consistently outperforms existing temporal modeling strategies for rPPG estimation.

[60] From Understanding to Erasing: Towards Complete and Stable Video Object Removal cs.CVPDF

Dingming Liu, Wenjing Wang, Chen Li, Jing Lyu

TL;DR: 本文提出了一种名为UnderEraser的视频对象移除方法，通过引入外部和内部两个互补的视角来增强对目标对象及其与场景交互的理解，从而在移除目标对象的同时，也能消除其引发的副作用（如阴影、反射和光照变化），并保持时空一致性。

Details

Motivation: 现有基于扩散模型的视频对象移除方法难以在不损害整体连贯性的情况下，消除目标对象引发的物理和语义副作用，其根本原因在于对目标对象及其与场景交互的物理和语义理解不足。

Result: 大量实验表明，该方法在视频对象移除任务上取得了最先进的性能，并建立了首个真实世界的视频对象移除基准数据集以促进未来研究。

Insight: 主要创新点在于：1）外部指导：提出一种蒸馏方案，将视觉基础模型中学习到的对象与其引发效应之间的关系知识迁移到视频扩散模型中；2）内部指导：提出一种逐帧上下文交叉注意力机制，使每个去噪块都能基于目标区域周围未被遮挡的信息化上下文进行推理。内外指导共同作用，实现了对目标对象、其引发效应及全局背景的深度理解。

Abstract: Video object removal aims to eliminate target objects from videos while plausibly completing missing regions and preserving spatio-temporal consistency. Although diffusion models have recently advanced this task, it remains challenging to remove object-induced side effects (e.g., shadows, reflections, and illumination changes) without compromising overall coherence. This limitation stems from the insufficient physical and semantic understanding of the target object and its interactions with the scene. In this paper, we propose to introduce understanding into erasing from two complementary perspectives. Externally, we introduce a distillation scheme that transfers the relationships between objects and their induced effects from vision foundation models to video diffusion models. Internally, we propose a framewise context cross-attention mechanism that grounds each denoising block in informative, unmasked context surrounding the target region. External and internal guidance jointly enable our model to understand the target object, its induced effects, and the global background context, resulting in clear and coherent object removal. Extensive experiments demonstrate our state-of-the-art performance, and we establish the first real-world benchmark for video object removal to facilitate future research and community progress. Our code, data, and models are available at: https://github.com/WeChatCV/UnderEraser.

[61] Can Video Diffusion Models Predict Past Frames? Bidirectional Cycle Consistency for Reversible Interpolation cs.CV | cs.MMPDF

Lingyu Liu, Yaxiong Wang, Li Zhu, Zhedong Zheng

TL;DR: 本文提出了一种基于双向循环一致性的视频帧插值方法，通过引入可学习的方向令牌和循环一致性监督，使模型能够同时优化前向生成和后向重建，从而提升长序列插值中的运动一致性和视觉质量。

Details

Motivation: 解决现有视频帧插值方法因单向生成缺乏自验证机制而导致的运动漂移、方向模糊和边界错位问题，特别是在长序列中。

Result: 在37帧和73帧插值任务上，该方法在图像质量、运动平滑性和动态控制方面均达到了最先进的性能，且未增加额外计算开销。

Insight: 创新点在于将时间循环一致性原则引入生成模型，通过双向框架和可学习方向令牌实现可逆插值；仅训练时施加循环约束，推理时保持单次前向传播的高效性，是一种有效的正则化策略。

Abstract: Video frame interpolation aims to synthesize realistic intermediate frames between given endpoints while adhering to specific motion semantics. While recent generative models have improved visual fidelity, they predominantly operate in a unidirectional manner, lacking mechanisms to self-verify temporal consistency. This often leads to motion drift, directional ambiguity, and boundary misalignment, especially in long-range sequences. Inspired by the principle of temporal cycle-consistency in self-supervised learning, we propose a novel bidirectional framework that enforces symmetry between forward and backward generation trajectories. Our approach introduces learnable directional tokens to explicitly condition a shared backbone on temporal orientation, enabling the model to jointly optimize forward synthesis and backward reconstruction within a single unified architecture. This cycle-consistent supervision acts as a powerful regularizer, ensuring that generated motion paths are logically reversible. Furthermore, we employ a curriculum learning strategy that progressively trains the model from short to long sequences, stabilizing dynamics across varying durations. Crucially, our cyclic constraints are applied only during training; inference requires a single forward pass, maintaining the high efficiency of the base model. Extensive experiments show that our method achieves state-of-the-art performance in imaging quality, motion smoothness, and dynamic control on both 37-frame and 73-frame tasks, outperforming strong baselines while incurring no additional computational overhead.

[62] Dense Point-to-Mask Optimization with Reinforced Point Selection for Crowd Instance Segmentation cs.CVPDF

Hongru Chen, Jiyang Huang, Jia Wan, Antoni B. Chan

TL;DR: 本文针对密集人群实例分割任务，提出了一种结合点标注与掩码优化的方法。首先，通过Dense Point-to-Mask Optimization (DPMO) 结合SAM与最近邻排斥圆约束，从点标注生成密集实例分割掩码；其次，提出Reinforced Point Selection (RPS) 框架，利用Group Relative Policy Optimization (GRPO) 训练，从初始点预测中选择最佳预测点，以实现对密集人群的实例分割预测。

Details

Motivation: 密集人群实例分割在监控和交通等领域有广泛应用，但现有数据集通常只有点标注，而区域标注（如边界框）稀缺且不准确；直接应用大型基础模型（如SAM）在密集人群上效果不佳，因此需要一种从点标注生成高质量掩码并提升分割性能的方法。

Result: 在ShanghaiTech、UCF-QNRF、JHU-CROWD++和NWPU-Crowd数据集上实现了最先进的密集人群实例分割性能；此外，通过掩码监督设计的新损失函数提升了不同模型的计数精度。

Insight: 创新点包括：DPMO将SAM与NNEC约束结合，从点标注生成密集实例分割掩码；RPS框架利用强化学习选择最佳预测点，优化分割预测；掩码监督的损失函数设计提升了计数任务的准确性，展示了掩码标注在增强计数精度中的重要作用。

Abstract: Crowd instance segmentation is a crucial task with a wide range of applications, including surveillance and transportation. Currently, point labels are common in crowd datasets, while region labels (e.g., boxes) are rare and inaccurate. The masks obtained through segmentation help to improve the accuracy of region labels and resolve the correspondence between individual location coordinates and crowd density maps. However, directly applying currently popular large foundation models such as SAM does not yield ideal results in dense crowds. To this end, we first propose Dense Point-to-Mask Optimization (DPMO), which integrates SAM with the Nearest Neighbor Exclusive Circle (NNEC) constraint to generate dense instance segmentation from point annotations. With DPMO and manual correction, we obtain mask annotations from the existing point annotations for traditional crowd datasets. Then, to predict instance segmentation in dense crowds, we propose a Reinforced Point Selection (RPS) framework trained with Group Relative Policy Optimization (GRPO), which selects the best predicted point from a sampling of the initial point prediction. Through extensive experiments, we achieve state-of-the-art crowd instance segmentation performance on ShanghaiTech, UCF-QNRF, JHU-CROWD++, and NWPU-Crowd datasets. Furthermore, we design new loss functions supervised by masks that boost counting performance across different models, demonstrating the significant role of mask annotations in enhancing counting accuracy.

[63] Ultrasound-CLIP: Semantic-Aware Contrastive Pre-training for Ultrasound Image-Text Understanding cs.CVPDF

Jiayun Jin, Haolong Chai, Xueying Huang, Xiaoqing Guo, Zengwei Zheng

TL;DR: 本文提出了Ultrasound-CLIP，一个针对超声图像-文本理解的语义感知对比预训练框架。为了解决现有视觉-语言预训练模型（如CLIP）难以直接应用于具有异质解剖结构和多样化诊断属性的超声数据的问题，作者构建了包含36.5万配对样本的大规模超声数据集US-365K，并建立了包含分层解剖分类学和诊断属性框架的超声诊断分类法。在此基础上，该框架引入了语义软标签和语义损失来细化样本区分，并构建了基于文本表示的异构图模态以实现对病灶-属性关系的结构化推理。

Details

Motivation: 现有通用视觉-语言预训练模型（如CLIP）主要针对自然图像设计，难以直接应用于解剖结构异质、诊断属性多样的超声影像数据，因此需要专门针对超声领域设计预训练方法。

Result: 在患者级别的数据划分下进行的广泛实验表明，该方法在分类和检索基准上达到了最先进的性能，同时在零样本、线性探测和微调任务上表现出强大的泛化能力。

Insight: 创新点包括：1）构建了大规模、结构化的超声图像-文本数据集US-365K及配套的超声诊断分类法（UDT），为领域研究提供了重要资源；2）提出了语义感知的对比学习框架，通过引入语义软标签和语义损失，将领域知识融入预训练，细化了样本表示；3）构建了基于诊断属性文本表示的异构图模态，实现了对病灶与属性之间关系的结构化推理，增强了模型的可解释性和推理能力。

Abstract: Ultrasound imaging is widely used in clinical diagnostics due to its real-time capability and radiation-free nature. However, existing vision-language pre-training models, such as CLIP, are primarily designed for other modalities, and are difficult to directly apply to ultrasound data, which exhibit heterogeneous anatomical structures and diverse diagnostic attributes. To bridge this gap, we construct US-365K, a large-scale ultrasound image-text dataset containing 365k paired samples across 52 anatomical categories. We establish Ultrasonographic Diagnostic Taxonomy (UDT) containing two hierarchical knowledge frameworks. Ultrasonographic Hierarchical Anatomical Taxonomy standardizes anatomical organization, and Ultrasonographic Diagnostic Attribute Framework formalizes nine diagnostic dimensions, including body system, organ, diagnosis, shape, margins, echogenicity, internal characteristics, posterior acoustic phenomena, and vascularity. Building upon these foundations, we propose Ultrasound-CLIP, a semantic-aware contrastive learning framework that introduces semantic soft labels and semantic loss to refine sample discrimination. Moreover, we construct a heterogeneous graph modality derived from UDAF’s textual representations, enabling structured reasoning over lesion-attribute relations. Extensive experiments with patient-level data splitting demonstrate that our approach achieves state-of-the-art performance on classification and retrieval benchmarks, while also delivering strong generalization to zero-shot, linear probing, and fine-tuning tasks.

[64] Control-DINO: Feature Space Conditioning for Controllable Image-to-Video Diffusion cs.CVPDF

Edoardo A. Dominici, Thomas Deixelberger, Konstantinos Vardis, Markus Steinberger

TL;DR: 本文提出Control-DINO，一种用于可控图像到视频扩散的方法，通过特征空间条件化，利用自监督学习特征（如DINO）作为预训练视频扩散模型的通用条件信号，实现视频领域迁移和3D到视频生成等任务。

Details

Motivation: 动机在于探索自监督学习特征（如DINO）作为预训练视频扩散模型的通用条件信号，以解决现有方法在生成任务中因特征纠缠（如风格、光照和语义）而受限的问题，提升视频生成的可控性。

Result: 论文通过轻量级架构和训练策略，在视频领域迁移和3D到视频生成任务中实现了稳健的外观控制（如风格化和重光照），并表明低空间分辨率可通过高特征维度补偿，提升生成渲染的可控性。

Insight: 创新点包括将自监督特征用作视频扩散的通用条件信号，以及通过特征解耦策略分离外观与其他特征，从而增强生成任务的灵活性和控制能力，为生成式渲染提供了新思路。

Abstract: Video models have recently been applied with success to problems in content generation, novel view synthesis, and, more broadly, world simulation. Many applications in generation and transfer rely on conditioning these models, typically through perceptual, geometric, or simple semantic signals, fundamentally using them as generative renderers. At the same time, high-dimensional features obtained from large-scale self-supervised learning on images or point clouds are increasingly used as a general-purpose interface for vision models. The connection between the two has been explored for subject specific editing, aligning and training video diffusion models, but not in the role of a more general conditioning signal for pretrained video diffusion models. Features obtained through self-supervised learning like DINO, contain a lot of entangled information about style, lighting and semantics of the scene. This makes them great at reconstruction tasks but limits their generative capabilities. In this paper, we show how we can use the features for tasks such as video domain transfer and video-from-3D generation. We introduce a lightweight architecture and training strategy that decouples appearance from other features that we wish to preserve, enabling robust control for appearance changes such as stylization and relighting. Furthermore, we show that low spatial resolution can be compensated by higher feature dimensionality, improving controllability in generative rendering from explicit spatial representations.

[65] Cosine-Normalized Attention for Hyperspectral Image Classification cs.CVPDF

Muhammad Ahmad, Manuel Mazzara

TL;DR: 本文提出了一种余弦归一化注意力机制，用于改进基于Transformer的高光谱图像分类方法。该方法从几何角度重新审视注意力评分，通过将查询和键向量投影到单位超球面上并使用平方余弦相似度，强调角度关系并降低对幅度变化的敏感性。

Details

Motivation: 现有基于Transformer的高光谱图像分类方法通常依赖点积相似度进行注意力计算，这种计算方式混合了特征幅度和方向信息，可能对具有角度结构的高光谱数据不是最优的。

Result: 在三个基准数据集上的实验表明，该方法在极有限监督下，使用轻量级骨干网络，性能优于多个最近的基于Transformer和Mamba的模型，实现了更高的分类性能。

Insight: 创新点在于从几何角度提出余弦归一化注意力公式，将相似度计算与高光谱特征的角度结构对齐，为高光谱表示学习提供了可靠的归纳偏置。该方法强调了角度关系的重要性，并降低了幅度变化的影响。

Abstract: Transformer-based methods have improved hyperspectral image classification (HSIC) by modeling long-range spatial-spectral dependencies; however, their attention mechanisms typically rely on dot-product similarity, which mixes feature magnitude and orientation and may be suboptimal for hyperspectral data. This work revisits attention scoring from a geometric perspective and introduces a cosine-normalized attention formulation that aligns similarity computation with the angular structure of hyperspectral signatures. By projecting query and key embeddings onto a unit hypersphere and applying a squared cosine similarity, the proposed method emphasizes angular relationships while reducing sensitivity to magnitude variations. The formulation is integrated into a spatial-spectral Transformer and evaluated under extremely limited supervision. Experiments on three benchmark datasets demonstrate that the proposed approach consistently achieves higher performance, outperforming several recent Transformer- and Mamba-based models despite using a lightweight backbone. In addition, a controlled analysis of multiple attention score functions shows that cosine-based scoring provides a reliable inductive bias for hyperspectral representation learning.

[66] Hidden Meanings in Plain Sight: RebusBench for Evaluating Cognitive Visual Reasoning cs.CVPDF

Seyed Amir Kasaei, Arash Marioriyad, Mahbod Khaleti, MohammadAmin Fazli, Mahdieh Soleymani Baghshah

TL;DR: 本文介绍了RebusBench，一个包含1164个谜题的基准测试，用于评估大型视觉语言模型（LVLMs）在解决需要多步骤认知推理的谜语（如字谜画）方面的能力。研究发现，现有SOTA模型（如Qwen、InternVL、LLaVA）在此类任务上表现严重不足，准确率低于10%，且模型缩放和上下文学习均未带来显著提升，表明模型缺乏连接视觉感知与语言知识的认知推理能力。

Details

Motivation: 当前大型视觉语言模型在显式视觉识别上表现出色，但在信息未被直接描绘、需要将视觉输入作为线索进行复杂多步推理的问题上存在关键认知鸿沟。本文旨在评估模型解决此类神经符号推理任务的能力。

Result: 在提出的RebusBench基准上评估了Qwen、InternVL、LLaVA等SOTA模型，结果显示其精确匹配准确率低于10%，语义准确率低于20%，且模型规模扩展和上下文学习均未带来显著性能提升。

Insight: 论文的创新点在于提出了RebusBench这一专门评估视觉认知推理（特别是神经符号推理）能力的基准，揭示了当前LVLMs虽然具备必要的视觉和语言组件，但缺乏将它们连接起来的认知推理“粘合剂”。这为未来模型在抽象推理和知识整合方向的发展提供了重要的评估工具和洞见。

Abstract: Large Vision-Language Models (LVLMs) have achieved remarkable proficiency in explicit visual recognition, effectively describing what is directly visible in an image. However, a critical cognitive gap emerges when the visual input serves only as a clue rather than the answer. We identify that current models struggle with the complex, multi-step reasoning required to solve problems where information is not explicitly depicted. Successfully solving a rebus puzzle requires a distinct cognitive workflow: the model must extract visual and textual attributes, retrieve linguistic prior knowledge (such as idioms), and perform abstract mapping to synthesize these elements into a meaning that exists outside the pixel space. To evaluate this neurosymbolic capability, we introduce RebusBench, a benchmark of 1,164 puzzles designed to test this specific integration of perception and knowledge. Our evaluation of state-of-the-art models (including Qwen, InternVL, and LLaVA) shows a severe deficiency: performance saturates below 10% Exact Match and 20% semantic accuracy, with no significant improvement observed from model scaling or In-Context Learning (ICL). These findings suggest that while models possess the necessary visual and linguistic components, they lack the cognitive reasoning glue to connect them. Project page available at https://amirkasaei.com/rebusbench/.

[67] DriveDreamer-Policy: A Geometry-Grounded World-Action Model for Unified Generation and Planning cs.CV | cs.AI | cs.ROPDF

Yang Zhou, Xiaofeng Wang, Hao Shao, Letian Wang, Guosheng Zhao

TL;DR: DriveDreamer-Policy是一个统一的驾驶世界-动作模型，它在一个模块化架构中集成了深度生成、未来视频生成和运动规划。该模型利用大语言模型处理语言指令、多视角图像和动作，并通过三个轻量级生成器分别生成深度、未来视频和动作。通过在统一框架中学习几何感知的世界表示并以此指导未来预测和规划，该模型能产生更连贯的想象未来和更明智的驾驶动作。

Details

Motivation: 现有世界-动作模型（WAM）多关注2D外观或潜在表示建模，缺乏对物理世界具身系统至关重要的几何基础。本文旨在构建一个几何接地的统一模型，以弥补视觉-语言-动作模型与世界模型之间的鸿沟。

Result: 在Navsim v1和v2基准测试上，DriveDreamer-Policy在闭环规划和世界生成任务上均表现出色，分别达到89.2 PDMS和88.7 EPDMS，超越了现有基于世界模型的方法，同时生成更高质量的未来视频和深度预测。

Insight: 创新点在于将显式深度学习整合到统一的世界-动作模型中，利用几何感知表示同时提升未来想象和规划鲁棒性；模块化设计实现了可控延迟，为具身智能系统提供了可解释且高效的解决方案。

Abstract: Recently, world-action models (WAM) have emerged to bridge vision-language-action (VLA) models and world models, unifying their reasoning and instruction-following capabilities and spatio-temporal world modeling. However, existing WAM approaches often focus on modeling 2D appearance or latent representations, with limited geometric grounding-an essential element for embodied systems operating in the physical world. We present DriveDreamer-Policy, a unified driving world-action model that integrates depth generation, future video generation, and motion planning within a single modular architecture. The model employs a large language model to process language instructions, multi-view images, and actions, followed by three lightweight generators that produce depth, future video, and actions. By learning a geometry-aware world representation and using it to guide both future prediction and planning within a unified framework, the proposed model produces more coherent imagined futures and more informed driving actions, while maintaining modularity and controllable latency. Experiments on the Navsim v1 and v2 benchmarks demonstrate that DriveDreamer-Policy achieves strong performance on both closed-loop planning and world generation tasks. In particular, our model reaches 89.2 PDMS on Navsim v1 and 88.7 EPDMS on Navsim v2, outperforming existing world-model-based approaches while producing higher-quality future video and depth predictions. Ablation studies further show that explicit depth learning provides complementary benefits to video imagination and improves planning robustness.

[68] FSKD: Monocular Forest Structure Inference via LiDAR-to-RGBI Knowledge Distillation cs.CV | cs.AIPDF

Taimur Khan, Hannes Feilhauer, Muhammad Jazib Zafar

TL;DR: 该论文提出了FSKD，一个从LiDAR到RGB-红外（RGBI）的知识蒸馏框架，用于从单目遥感图像推断森林结构。该方法利用多模态教师模型融合RGBI图像与LiDAR衍生的平面指标和垂直剖面，训练一个仅使用RGBI的SegFormer学生模型来复现这些输出，从而在无需LiDAR的情况下实现高分辨率森林结构监测。

Details

Motivation: 动机是解决机载LiDAR成本高、采集不频繁的问题，同时满足对单木尺度高分辨率森林结构数据（如冠层高度模型CHM、植物面积指数PAI、叶高多样性FHD）的迫切需求，旨在通过知识蒸馏实现仅用RGBI图像进行高效、可扩展的森林监测。

Result: 在德国萨克森州384平方公里森林数据（20厘米地面采样距离）上训练，并在八个地理上不同的测试区域评估，学生模型在零样本CHM预测上达到了最先进（SOTA）性能（中值绝对误差MedAE 4.17米，R²=0.51，IoU 0.87），其平均绝对误差（MAE 5.81米）比HRCHM/DAC基线模型提高了29-46%，且相关系数更强（0.713 vs. 0.166-0.652）。该方法还能联合预测CHM、PAI和FHD。

Insight: 创新点在于提出了一个LiDAR-to-RGBI知识蒸馏框架，通过多模态教师融合与不对称蒸馏，使轻量级学生模型仅从RGBI图像就能高精度推断多种森林结构指标。关键洞察包括：多模态融合比仅用RGBI训练性能提升10-26%；模型容量的不对称设计对蒸馏效果至关重要；该方法能缓解LiDAR与RGBI数据的时间不匹配问题，放宽了严格同步采集的限制，有利于大规模应用。

Abstract: Very High Resolution (VHR) forest structure data at individual-tree scale is essential for carbon, biodiversity, and ecosystem monitoring. Still, airborne LiDAR remains costly and infrequent despite being the reference for forest structure metrics like Canopy Height Model (CHM), Plant Area Index (PAI), and Foliage Height Diversity (FHD). We propose FSKD: a LiDAR-to-RGB-Infrared (RGBI) knowledge distillation (KD) framework in which a multi-modal teacher fuses RGBI imagery with LiDAR-derived planar metrics and vertical profiles via cross-attention, and an RGBI-only SegFormer student learns to reproduce these outputs. Trained on 384 $km^2$ of forests in Saxony, Germany (20 cm ground sampling distance (GSD)) and evaluated on eight geographically distinct test tiles, the student achieves state-of-the-art (SOTA) zero-shot CHM performance (MedAE 4.17 m, $R^2$=0.51, IoU 0.87), outperforming HRCHM/DAC baselines by 29–46% in MAE (5.81 m vs. 8.14–10.84 m) with stronger correlation coefficients (0.713 vs. 0.166–0.652). Ablations show that multi-modal fusion improves performance by 10–26% over RGBI-only training, and that asymmetric distillation with appropriate model capacity is critical. The method jointly predicts CHM, PAI, and FHD, a multi-metric capability not provided by current monocular CHM estimators, although PAI/FHD transfer remains region-dependent and benefits from local calibration. The framework also remains effective under temporal mismatch (winter LiDAR, summer RGBI), removing strict co-acquisition constraints and enabling scalable 20 cm operational monitoring for workflows such as Digital Twin Germany and national Digital Orthophoto programs.

[69] STRIVE: Structured Spatiotemporal Exploration for Reinforcement Learning in Video Question Answering cs.CVPDF

Emad Bahrami, Olga Zatsarynna, Parth Pathak, Sunando Sengupta, Juergen Gall

TL;DR: 本文提出STRIVE框架，一种用于视频问答的结构化强化学习方法，通过构建输入视频的多个时空变体并进行跨文本生成和视觉变体的联合归一化，以解决传统基于组的策略优化方法在响应正确性相似时奖励方差低、优势估计弱或不稳定的问题。

Details

Motivation: 动机在于现有基于组的策略优化方法在多模态大模型中表现良好，但当响应正确性相似时，奖励方差较低，导致优势估计弱或不稳定，限制了视频问答的性能提升。

Result: 在六个具有挑战性的视频推理基准测试（包括VideoMME、TempCompass、VideoMMMU、MMVU、VSI-Bench和PerceptionTest）上，实验表明STRIVE在多个大型多模态模型上相比强化的强化学习基线方法取得了持续改进。

Insight: 创新点包括引入结构化时空探索，通过构建视频的时空变体来丰富奖励信号；以及重要性感知采样机制，优先选择与输入问题最相关的帧，同时保持时间覆盖，以促进跨互补视觉视角的鲁棒推理，避免对单一时空配置的过拟合。

Abstract: We introduce STRIVE (SpatioTemporal Reinforcement with Importance-aware Variant Exploration), a structured reinforcement learning framework for video question answering. While group-based policy optimization methods have shown promise in large multimodal models, they often suffer from low reward variance when responses exhibit similar correctness, leading to weak or unstable advantage estimates. STRIVE addresses this limitation by constructing multiple spatiotemporal variants of each input video and performing joint normalization across both textual generations and visual variants. By expanding group comparisons beyond linguistic diversity to structured visual perturbations, STRIVE enriches reward signals and promotes more stable and informative policy updates. To ensure exploration remains semantically grounded, we introduce an importance-aware sampling mechanism that prioritizes frames most relevant to the input question while preserving temporal coverage. This design encourages robust reasoning across complementary visual perspectives rather than overfitting to a single spatiotemporal configuration. Experiments on six challenging video reasoning benchmarks including VideoMME, TempCompass, VideoMMMU, MMVU, VSI-Bench, and PerceptionTest demonstrate consistent improvements over strong reinforcement learning baselines across multiple large multimodal models. Our results highlight the role of structured spatiotemporal exploration as a principled mechanism for stabilizing multimodal reinforcement learning and improving video reasoning performance.

[70] Language-Pretraining-Induced Bias: A Strong Foundation for General Vision Tasks cs.CV | cs.CL | cs.LGPDF

Yaxin Luo, Zhiqiang Shen

TL;DR: 本文挑战了语言预训练模型不适合视觉任务的假设，提出通过桥接训练阶段（特别是无需人工标注的随机标签桥接训练）来对齐大语言模型参数与视觉任务，并发现部分桥接训练往往更有效，因为LLM的某些层具有强大的基础属性，无需微调即可有益于视觉任务。

Details

Motivation: 解决语言与视觉预训练模型参数空间差异大导致跨模态适应困难的问题，探索如何利用语言预训练模型参数直接服务于下游视觉任务。

Result: 未在摘要中提及具体定量结果或基准测试，但宣称所提方法能有效对齐LLM参数与视觉任务，并发现部分桥接训练具有优势。

Insight: 创新点在于提出无需标注的随机标签桥接训练作为跨模态适应方法，并揭示LLM部分层具有可直接迁移到视觉任务的基础属性，为利用语言预训练参数进行视觉建模提供了新途径。

Abstract: The ratio of outlier parameters in language pre-training models and vision pre-training models differs significantly, making cross-modality (language and vision) inherently more challenging than cross-domain adaptation. As a result, many prior studies have focused on cross-domain transfer rather than attempting to bridge language and vision modalities, assuming that language pre-trained models are unsuitable for downstream visual tasks due to disparate parameter spaces. Contrary to this assumption, we show that adding a bridge training stage as a modality adaptation learner can effectively align Large Language Model (LLM) parameters with vision tasks. Specifically, we propose a simple yet powerful solution random label bridge training that requires no manual labeling and helps LLM parameters adapt to vision foundation tasks. Moreover, our findings reveal that partial bridge training is often advantageous, as certain layers in LLMs exhibit strong foundational properties that remain beneficial even without fine-tuning for visual tasks. This surprising discovery opens up new avenues for leveraging language pre-trained parameters directly within vision models and highlights the potential of partial bridge training as a practical pathway to cross-modality adaptation.

[71] Semantic Richness or Geometric Reasoning? The Fragility of VLM’s Visual Invariance cs.CVPDF

Jason Qiu, Zachary Meurer, Xavier Thomas, Deepti Ghadiyaram

TL;DR: 这篇论文研究了当前最先进的视觉语言模型（VLM）在基本几何变换下的根本脆弱性。研究发现，尽管VLMs在语义任务上表现出色，但在简单的旋转、缩放和恒等变换下，它们缺乏稳健的空间不变性和等变性，导致在确定物体身份时出现系统性失败。

Details

Motivation: 论文的动机是探究VLMs在几何推理能力上的根本缺陷，揭示其在语义理解与空间推理之间存在的系统性差距。

Result: 通过在符号草图、自然照片和抽象艺术等多种视觉领域进行系统评估，发现当语义内容变得稀疏时，VLMs的性能急剧下降，且这一现象在不同架构、模型容量和提示策略中普遍存在。

Insight: 论文的创新点在于系统地揭示了VLMs在几何不变性方面的脆弱性，强调了未来多模态系统需要更强的几何基础。从客观角度看，这为评估和提升VLMs的底层视觉推理能力提供了重要见解。

Abstract: This work investigates the fundamental fragility of state-of-the-art Vision-Language Models (VLMs) under basic geometric transformations. While modern VLMs excel at semantic tasks such as recognizing objects in canonical orientations and describing complex scenes, they exhibit systematic failures at a more fundamental level: lack of robust spatial invariance and equivariance required to reliably determine object identity under simple rotations, scaling, and identity transformations. We demonstrate this limitation through a systematic evaluation across diverse visual domains, including symbolic sketches, natural photographs, and abstract art. Performance drops sharply as semantic content becomes sparse, and this behavior is observed across architectures, model capacities, and prompting strategies. Overall, our results reveal a systematic gap between semantic understanding and spatial reasoning in current VLMs, highlighting the need for stronger geometric grounding in future multimodal systems.

[72] Combining Boundary Supervision and Segment-Level Regularization for Fine-Grained Action Segmentation cs.CVPDF

Hinako Mitsuoka, Kazuhiro Hotta

TL;DR: 本文提出了一种轻量级的双损失训练框架，用于改进细粒度动作分割的质量。该方法仅需增加一个输出通道和两个辅助损失项，无需显著修改模型架构，通过边界回归损失和基于累积分布函数的段级正则化损失，提升时间定位准确性和段内结构一致性。该框架与架构无关，可集成到现有TAS模型中，在三个基准数据集上提高了F1和Edit分数。

Details

Motivation: 当前时序动作分割（TAS）方法依赖复杂架构，不利于实际部署，本文旨在通过简单的损失设计而非重型架构或推理时优化，提升细粒度分割质量。

Result: 在三个基准数据集上，该方法提高了段级一致性和边界质量，在三种不同模型（如MS-TCN、C2F-TCN、FACT）上均获得了更高的F1和Edit分数，帧级准确率基本保持不变。

Insight: 创新点在于结合边界监督和段级正则化的双损失框架，通过边界回归损失提升时间定位，CDF-based正则化损失增强段内一致性，展示了轻量级损失设计可实现精确分割，无需复杂架构。

Abstract: Recent progress in Temporal Action Segmentation (TAS) has increasingly relied on complex architectures, which can hinder practical deployment. We present a lightweight dual-loss training framework that improves fine-grained segmentation quality with only one additional output channel and two auxiliary loss terms, requiring minimal architectural modification. Our approach combines a boundary-regression loss that promotes accurate temporal localization via a single-channel boundary prediction and a CDF-based segment-level regularization loss that encourages coherent within-segment structure by matching cumulative distributions over predicted and ground-truth segments. The framework is architecture-agnostic and can be integrated into existing TAS models (e.g., MS-TCN, C2F-TCN, FACT) as a training-time loss function. Across three benchmark datasets, the proposed method improves segment-level consistency and boundary quality, yielding higher F1 and Edit scores across three different models. Frame-wise accuracy remains largely unchanged, highlighting that precise segmentation can be achieved through simple loss design rather than heavier architectures or inference-time refinements.

[73] HieraVid: Hierarchical Token Pruning for Fast Video Large Language Models cs.CV | cs.CLPDF

Yansong Guo, Chaoyang Zhu, Jiayi Ji, Jianghang Lin, Liujuan Cao

TL;DR: 本文提出了HieraVid，一种用于视频大语言模型（VideoLLMs）的分层令牌剪枝框架，旨在通过渐进式动态剪枝减少视觉冗余，从而降低计算负担。该方法基于视频的片段-帧结构和LLM内部信息单向传播的观察，在片段、帧和模型层三个层次进行剪枝。

Details

Motivation: 视频大语言模型在视频理解方面表现出色，但输入视频令牌数量巨大，导致部署时计算成本高昂。现有方法主要在输入层面剪枝，忽略了视频和LLM中固有的信息结构。

Result: 在四个广泛使用的视频理解基准测试上进行实验，仅保留30%的令牌时，HieraVid达到了新的最先进（SOTA）性能，同时分别保持了LLaVA-Video-7B和LLaVA-OneVision-7B超过98%和99%的性能。

Insight: 创新点在于利用视频的层次结构（片段-帧）和LLM层间信息传播特性进行分层剪枝，在片段级进行时空合并，帧级联合剪枝相似帧以保持多样性，层级随LLM层数增加逐步减少冗余，从而在显著降低计算量的同时保持高性能。

Abstract: Video Large Language Models (VideoLLMs) have demonstrated impressive capabilities in video understanding, yet the massive number of input video tokens incurs a significant computational burden for deployment. Existing methods mainly prune video tokens at input level while neglecting the inherent information structure embedded in videos and large language models (LLMs). To address this, we propose HieraVid, a hierarchical pruning framework that progressively and dynamically reduces visual redundancy. Based on two observations that videos possess the segment-frame structure and LLMs internally propagate multi-modal information unidirectionally, we decompose pruning into three levels: 1) segment-level, where video tokens are first temporally segmented and spatially merged; 2) frame-level, where similar frames within the same segment are jointly pruned to preserve diversity; 3) layer-level, redundancy gradually shrinks as LLM layer increases w/o compromising performance. We conduct extensive experiments on four widely used video understanding benchmarks to comprehensively evaluate the effectiveness of HieraVid. Remarkably, with only 30% of tokens retained, HieraVid achieves new state-of-the-art performance, while maintaining over 98% and 99% of the performance of LLaVA-Video-7B and LLaVA-OneVision-7B, respectively.

[74] A3R: Agentic Affordance Reasoning via Cross-Dimensional Evidence in 3D Gaussian Scenes cs.CVPDF

Di Li, Jie Feng, Guanbin Li, Ronghua Shang, Yuhui Zheng

TL;DR: 本文提出A3R框架，将3D高斯场景中的细粒度可供性推理重新定义为顺序证据获取过程，通过基于MLLM的策略迭代选择并整合3D几何与2D语义证据来逐步减少歧义，从而提高复杂环境中动作支持区域的识别精度。

Details

Motivation: 现有方法将可供性推理视为基于静态观察的一次性预测，但在复杂3D场景中，固定视角下任务相关证据往往不完整，导致失败案例频发，因此需要一种能主动获取互补证据的序列化方法。

Result: 在场景级基准测试上的大量实验表明，A3R持续超越静态一次性预测基线，证明了智能体跨维度证据获取在复杂3D高斯场景中细粒度可供性推理方面的优势。

Insight: 创新点在于将可供性推理建模为主动的、序列化的证据获取过程，并引入基于GRPO的策略学习来优化决策效率；其核心思想是利用跨维度（3D几何与2D语义）证据的互补性，通过迭代交互逐步细化推理结果。

Abstract: Affordance reasoning in 3D Gaussian scenes aims to identify the region that supports the action specified by a given text instruction in complex environments. Existing methods typically cast this problem as one-shot prediction from static scene observations, assuming sufficient evidence is already available for reasoning. However, in complex 3D scenes, many failure cases arise not from weak prediction capacity, but from incomplete task-relevant evidence under fixed observations. To address this limitation, we reformulate fine-grained affordance reasoning as a sequential evidence acquisition process, where ambiguity is progressively reduced through complementary 3D geometric and 2D semantic evidence. Building on this formulation, we propose A3R, an agentic affordance reasoning framework that enables an MLLM-based policy to iteratively select evidence acquisition actions and update the affordance belief through cross-dimensional evidence acquisition. To optimize such sequential decision making, we further introduce a GRPO-based policy learning strategy that improves evidence acquisition efficiency and reasoning accuracy. Extensive experiments on scene-level benchmarks show that A3R consistently surpasses static one-shot baselines, demonstrating the advantage of agentic cross-dimensional evidence acquisition for fine-grained affordance reasoning in complex 3D Gaussian scenes.

[75] ProVG: Progressive Visual Grounding via Language Decoupling for Remote Sensing Imagery cs.CVPDF

Ke Li, Ting Wang, Di Wang, Yongshan Zhu, Yiming Zhang

TL;DR: 本文提出ProVG框架，用于遥感图像视觉定位任务，通过将语言表达解耦为全局上下文、空间关系和物体属性，并采用渐进式跨模态调制器实现从粗到细的视觉-语言对齐，同时引入跨尺度融合模块和语言引导校准解码器，在RRSIS-D和RISBench基准测试中达到SOTA性能。

Details

Motivation: 现有遥感视觉定位方法依赖句子级视觉-语言对齐，难以利用细粒度语言线索（如空间关系和物体属性），而这些线索在不同定位阶段作用不同，需针对性利用以提高定位精度。

Result: 在RRSIS-D和RISBench两个基准测试上，ProVG均超越现有方法，取得了新的最先进性能。

Insight: 创新点包括：语言表达解耦为三类线索、渐进式跨模态调制器（survey-locate-verify方案）、跨尺度融合模块处理遥感图像尺度变化、语言引导校准解码器，以及统一多任务头支持指代表达理解和分割任务。

Abstract: Remote sensing visual grounding (RSVG) aims to localize objects in remote sensing imagery according to natural language expressions. Previous methods typically rely on sentence-level vision-language alignment, which struggles to exploit fine-grained linguistic cues, such as \textit{spatial relations} and \textit{object attributes}, that are crucial for distinguishing objects with similar characteristics. Importantly, these cues play distinct roles across different grounding stages and should be leveraged accordingly to provide more explicit guidance. In this work, we propose \textbf{ProVG}, a novel RSVG framework that improves localization accuracy by decoupling language expressions into global context, spatial relations, and object attributes. To integrate these linguistic cues, ProVG employs a simple yet effective progressive cross-modal modulator, which dynamically modulates visual attention through a \textit{survey-locate-verify} scheme, enabling coarse-to-fine vision-language alignment. In addition, ProVG incorporates a cross-scale fusion module to mitigate the large-scale variations in remote sensing imagery, along with a language-guided calibration decoder to refine cross-modal alignment during prediction. A unified multi-task head further enables ProVG to support both referring expression comprehension and segmentation tasks. Extensive experiments on two benchmarks, \textit{i.e.}, RRSIS-D and RISBench, demonstrate that ProVG consistently outperforms existing methods, achieving new state-of-the-art performance.

[76] FTPFusion: Frequency-Aware Infrared and Visible Video Fusion with Temporal Perturbation cs.CVPDF

Xilai Li, Chusheng Fang, Xiaosong Li

TL;DR: 本文提出了一种名为FTPFusion的频率感知红外与可见光视频融合方法，旨在解决现有方法在保持时间稳定性和空间细节方面的挑战。该方法通过将特征表示分解为高频和低频分量进行协同建模，高频分支执行稀疏跨模态时空交互以捕捉运动相关上下文和互补细节，低频分支引入时间扰动策略以增强对复杂视频变化（如闪烁、抖动和局部错位）的鲁棒性，并设计了偏移感知时间一致性约束来显式稳定时间干扰下的跨帧表示。

Details

Motivation: 红外与可见光视频融合在智能监控和低光照监测中至关重要，但现有方法要么侧重于有限时间建模的逐帧增强，要么依赖于通常牺牲高频细节的繁重时空聚合，导致在保持时间稳定性的同时保留空间细节仍是一个根本性挑战。

Result: 在多个公共基准测试上的广泛实验表明，FTPFusion在空间保真度和时间一致性等多个指标上持续优于最先进的方法。

Insight: 论文的创新点在于将频率分解思想引入视频融合任务，通过高频与低频分支的协同设计，并结合稀疏跨模态交互与时间扰动策略，实现了对时空信息的更精细建模；从客观角度看，其偏移感知时间一致性约束为视频处理中的时间稳定性优化提供了可借鉴的思路。

Abstract: Infrared and visible video fusion plays a critical role in intelligent surveillance and low-light monitoring. However, maintaining temporal stability while preserving spatial detail remains a fundamental challenge. Existing methods either focus on frame-wise enhancement with limited temporal modeling or rely on heavy spatio-temporal aggregation that often sacrifices high-frequency details. In this paper, we propose FTPFusion, a frequency-aware infrared and visible video fusion method based on temporal perturbation and sparse cross-modal interaction. Specifically, FTPFusion decomposes the feature representations into high-frequency and low-frequency components for collaborative modeling. The high-frequency branch performs sparse cross-modal spatio-temporal interaction to capture motion-related context and complementary details. The low-frequency branch introduces a temporal perturbation strategy to enhance robustness against complex video variations, such as flickering, jitter, and local misalignment. Furthermore, we design an offset-aware temporal consistency constraint to explicitly stabilize cross-frame representations under temporal disturbances. Extensive experiments on multiple public benchmarks demonstrate that FTPFusion consistently outperforms state-of-the-art methods across multiple metrics in both spatial fidelity and temporal consistency. The source code will be available at https://github.com/ixilai/FTPFusion.

[77] Lifting Unlabeled Internet-level Data for 3D Scene Understanding cs.CV | cs.AIPDF

Yixin Chen, Yaowei Zhang, Huangyue Yu, Junchao He, Yan Wang

TL;DR: 本文提出了一种利用互联网上大量未标注视频自动生成训练数据的数据引擎，以解决3D场景理解中标注数据稀缺且昂贵的问题。该方法通过精心设计的数据引擎，结合人类标注数据集，训练端到端模型，并在多个3D感知任务上验证了其有效性。

Details

Motivation: 动机在于解决3D场景理解中标注数据稀缺且获取成本高的问题，而互联网上存在丰富的未标注视频，通过设计数据引擎来自动生成训练数据，以提升模型能力。

Result: 在3D物体检测、实例分割、3D空间视觉问答（VQA）和视觉语言导航（VLN）等任务上评估，模型使用生成数据训练后表现出强大的零样本性能，并在微调后进一步改进，验证了方法的有效性。

Insight: 创新点在于设计数据引擎自动从未标注网络视频生成训练数据，分析了自动数据生成的瓶颈，并展示了跨不同感知粒度（从低层感知到高层推理）的通用性，为利用网络数据提升场景理解系统提供了可行路径。

Abstract: Annotated 3D scene data is scarce and expensive to acquire, while abundant unlabeled videos are readily available on the internet. In this paper, we demonstrate that carefully designed data engines can leverage web-curated, unlabeled videos to automatically generate training data, to facilitate end-to-end models in 3D scene understanding alongside human-annotated datasets. We identify and analyze bottlenecks in automated data generation, revealing critical factors that determine the efficiency and effectiveness of learning from unlabeled data. To validate our approach across different perception granularities, we evaluate on three tasks spanning low-level perception, i.e., 3D object detection and instance segmentation, to high-evel reasoning, i.e., 3D spatial Visual Question Answering (VQA) and Vision-Lanugage Navigation (VLN). Models trained on our generated data demonstrate strong zero-shot performance and show further improvement after finetuning. This demonstrates the viability of leveraging readily available web data as a path toward more capable scene understanding systems.

[78] Enhancing Medical Visual Grounding via Knowledge-guided Spatial Prompts cs.CVPDF

Yifan Gao, Tao Zhou, Yi Zhou, Ke Zou, Yizhe Zhang

TL;DR: 本文提出KnowMVG框架，旨在提升医学视觉定位（MVG）任务的空间定位精度。该框架通过知识增强提示策略将医学知识编码为紧凑嵌入，并结合全局-局部注意力机制，联合利用粗粒度全局信息和细粒度局部线索来指导精确区域定位。

Details

Motivation: 现有视觉语言模型（VLMs）在医学视觉定位任务中，由于仅依赖潜在嵌入而缺乏显式的定位先验知识，导致其空间定位精度不足。本文旨在通过引入知识先验和增强空间感知来解决这一问题。

Result: 在四个MVG基准测试上的广泛实验表明，KnowMVG始终优于现有方法，在AP50指标上比之前的最先进方法提升了3.0%，在mIoU指标上提升了2.6%。定性和消融研究进一步验证了各组成部分的有效性。

Insight: 主要创新点在于提出了一种知识增强的提示策略，将领域知识（医学知识）显式编码为提示嵌入，并结合全局-局部注意力机制，在不引入额外文本推理开销的情况下，桥接了高层语义理解和细粒度视觉感知，从而提升了定位精度。

Abstract: Medical Visual Grounding (MVG) aims to identify diagnostically relevant phrases from free-text radiology reports and localize their corresponding regions in medical images, providing interpretable visual evidence to support clinical decision-making. Although recent Vision-Language Models (VLMs) exhibit promising multimodal reasoning ability, their grounding remains insufficient spatial precision, largely due to a lack of explicit localization priors when relying solely on latent embeddings. In this work, we analyze this limitation from an attention perspective and propose KnowMVG, a Knowledge-prior and global-local attention enhancement framework for MVG in VLMs that explicitly strengthens spatial awareness during decoding. Specifically, we present a knowledge-enhanced prompting strategy that encodes phrase related medical knowledge into compact embeddings, together with a global-local attention that jointly leverages coarse global information and refined local cues to guide precise region localization. localization. This design bridges high-level semantic understanding and fine-grained visual perception without introducing extra textual reasoning overhead. Extensive experiments on four MVG benchmarks demonstrate that our KnowMVG consistently outperforms existing approaches, achieving gains of 3.0% in AP50 and 2.6% in mIoU over prior state-of-the-art methods. Qualitative and ablation studies further validate the effectiveness of each component.

George Sebastian, Philipp Berthold, Bianca Forkel, Leon Pohl, Mirko Maehlisch

TL;DR: 本文探讨了能否直接从波束成形前的单天线距离-多普勒（RD）雷达数据中学习有意义的空间结构，而非依赖传统的波束成形后角度域表示。通过使用6发8收（48个虚拟天线）的商用汽车雷达，采用A/B啁啾序列调频连续波发射方案，并训练一个双啁啾共享权重的编码器进行端到端数据驱动学习，以鸟瞰图（BEV）占用作为几何探针进行评估。结果表明，无需显式构建角度域或手工信号处理阶段，即可从波束成形前的单天线RD张量中学习空间结构。

Details

Motivation: 解决汽车雷达感知流程中依赖波束成形构建角度域表示的问题，探索直接从波束成形前的单天线RD测量中学习空间结构的可能性，以简化或替代传统信号处理阶段。

Result: 在基于LiDAR的可见性感知跨模态监督下，通过啁啾消融（仅A、仅B、A+B）、距离带分析和物理对齐基线实验，评估了发射配置对几何可恢复性的影响，证实了直接从波束成形前单天线RD张量学习空间结构的可行性，但未提及具体定量指标或与SOTA的比较。

Insight: 创新点在于提出了一种端到端数据驱动方法，直接从原始雷达数据学习空间结构，避免了显式角度域构建和手工信号处理；同时引入了可见性感知的跨模态监督，结合雷达视场和LiDAR射线可见性建模，增强了学习过程的几何一致性。

Abstract: Automotive radar perception pipelines commonly construct angle-domain representations via beamforming before applying learning-based models. This work instead investigates a representational question: can meaningful spatial structure be learned directly from pre-beamforming per-antenna range-Doppler (RD) measurements? Experiments are conducted on a 6-TX x 8-RX (48 virtual antennas) commodity automotive radar employing an A/B chirp-sequence frequency-modulated continuous-wave (CS-FMCW) transmit scheme, in which the effective transmit aperture varies between chirps (single-TX vs. multi-TX), enabling controlled analysis of chirp-dependent transmit configurations. We operate on pre-beamforming per-antenna RD tensors using a dual-chirp shared-weight encoder trained in an end-to-end, fully data-driven manner, and evaluate spatial recoverability using bird’s-eye-view (BEV) occupancy as a geometric probe rather than a performance-driven objective. Supervision is visibility-aware and cross-modal, derived from LiDAR with explicit modeling of the radar field-of-view and occlusion-aware LiDAR observability via ray-based visibility. Through chirp ablations (A-only, B-only, A+B), range-band analysis, and physics-aligned baselines, we assess how transmit configurations affect geometric recoverability. The results indicate that spatial structure can be learned directly from pre-beamforming per-antenna RD tensors without explicit angle-domain construction or hand-crafted signal-processing stages.

[80] Captioning Daily Activity Images in Early Childhood Education: Benchmark and Algorithm cs.CV | cs.AIPDF

Sixing Li, Zhibin Gu, Ziqi Zhang, Weiguo Pan, Bing Li

TL;DR: 本文针对早期儿童教育（ECE）场景下的图像描述任务，提出了一个大规模基准数据集ECAC和一个混合训练框架RSRS，并基于此开发了领域适应的多模态大语言模型KinderMM-Cap-3B，显著提升了专业物体描述的准确性。

Details

Motivation: 解决早期儿童教育图像描述任务中缺乏领域特定数据集以及传统训练范式（如监督学习和强化学习）在提升专业物体描述能力上的局限性。

Result: 在提出的ECAC基准上，模型在专业物体识别评分TTS上达到51.06，显著优于现有SOTA基线，同时保持了优越的描述质量。

Insight: 创新点包括构建大规模、专家标注的ECE领域数据集ECAC及专业评估指标TTS，以及提出RSRS混合训练框架，通过动态切换强化学习与监督微调来稳定优化困难样本，提升细粒度识别能力。

Abstract: Image captioning for Early Childhood Education (ECE) is essential for automated activity understanding and educational assessment. However, existing methods face two key challenges. First, the lack of large-scale, domain-specific datasets limits the model’s ability to capture fine-grained semantic concepts unique to ECE scenarios, resulting in generic and imprecise descriptions. Second, conventional training paradigms exhibit limitations in enhancing professional object description capability, as supervised learning tends to favor high-frequency expressions, while reinforcement learning may suffer from unstable optimization on difficult samples. To address these limitations, we introduce ECAC, a large-scale benchmark for ECE daily activity image captioning, comprising 256,121 real-world images annotated with expert-level captions and fine-grained labels. ECAC is further equipped with a domain-oriented evaluation protocol, the Teaching Toy Recognition Score (TTS), to explicitly measure professional object naming accuracy. Furthermore, we propose RSRS (Reward-Conditional Switch of Reinforcement Learning and Supervised Fine-Tuning), a hybrid training framework that dynamically alternates between RL and supervised optimization. By rerouting hard samples with zero rewards to supervised fine-tuning, RSRS effectively mitigates advantage collapse and enables stable optimization for fine-grained recognition. Leveraging ECAC and RSRS, we develop KinderMM-Cap-3B, a domain-adapted multimodal large language model. Extensive experiments demonstrate that our model achieves a TTS of 51.06, substantially outperforming state-of-the-art baselines while maintaining superior caption quality, highlighting its potential for specialized educational applications.

[81] A Self supervised learning framework for imbalanced medical imaging datasets cs.CVPDF

Yash Kumar Sharma, Charan Ramtej Kodi, Vineet Padmanabhan

TL;DR: 本文提出了一种名为AMIMV的自监督学习框架，用于解决医学图像分类中数据稀缺和类别不平衡的问题。该方法通过构建非对称多图像多视图对来增强数据表示，并在11个MedMNIST数据集上评估了其在不同类别不平衡程度下的鲁棒性。

Details

Motivation: 医学图像分析面临两大挑战：一是缺乏大量标注数据，二是数据类别不平衡（常见类别数据丰富，罕见类别数据稀缺）。现有自监督学习方法主要解决第一个问题，但对其在类别不平衡数据上的鲁棒性研究不足。

Result: 在MedMNIST数据集上的实验表明，AMIMV方法在retinaMNIST、tissueMNIST和DermaMNIST上分别提升了4.25%、1.88%和3.1%的性能，并在长尾分布和有限监督下评估了八种代表性自监督学习方法的鲁棒性。

Insight: 创新点在于扩展了MIMV方法，引入非对称多图像多视图对增强策略，同时处理数据稀缺和类别不平衡问题；客观分析认为，该方法通过系统性评估自监督学习在医学图像不平衡数据集上的表现，为领域提供了新的基准和解决方案。

Abstract: Two problems often plague medical imaging analysis: 1) Non-availability of large quantities of labeled training data, and 2) Dealing with imbalanced data, i.e., abundant data are available for frequent classes, whereas data are highly limited for the rare class. Self supervised learning (SSL) methods have been proposed to deal with the first problem to a certain extent, but the issue of investigating the robustness of SSL to imbalanced data has rarely been addressed in the domain of medical image classification. In this work, we make the following contributions: 1) The MIMV method proposed by us in an earlier work is extended with a new augmentation strategy to construct asymmetric multi-image, multi-view (AMIMV) pairs to address both data scarcity and dataset imbalance in medical image classification. 2) We carry out a data analysis to evaluate the robustness of AMIMV under varying degrees of class imbalance in medical imaging . 3) We evaluate eight representative SSL methods in 11 medical imaging datasets (MedMNIST) under long-tailed distributions and limited supervision. Our experimental results on the MedMNIST dataset show an improvement of 4.25% on retinaMNIST, 1.88% on tissueMNIST, and 3.1% on DermaMNIST.

[82] MAVFusion: Efficient Infrared and Visible Video Fusion via Motion-Aware Sparse Interaction cs.CVPDF

Xilai Li, Weijun Jiang, Xiaosong Li, Yang Liu, Hongbin Wang

TL;DR: MAVFusion是一个用于红外与可见光视频融合的端到端框架，通过运动感知稀疏交互机制，在保持高质量融合结果的同时显著提升效率。它利用光流识别视频序列中的动态区域，并自适应地将计算密集的跨模态注意力分配给这些稀疏区域以捕获显著变化和促进模态间信息交换；对于静态背景区域，则使用轻量级弱交互模块来维持结构和外观完整性。通过解耦动态与静态区域的处理，该方法在保持时间一致性和细粒度细节的同时，大幅加速了推理过程。

Details

Motivation: 现有方法多为静态图像融合设计，无法有效处理视频中的帧间运动；而现有视频融合方法虽通过引入跨帧交互提升了时间一致性，但往往计算成本高昂。本文旨在解决如何在保持高质量融合的同时，高效处理红外与可见光视频融合中的运动问题。

Result: 在多个红外与可见光视频基准测试上取得了最先进的性能，在640×480分辨率下达到14.16 FPS的推理速度。

Insight: 创新点在于提出了运动感知稀疏交互机制，通过光流引导动态区域识别，并自适应分配计算资源（对动态区域使用密集跨模态注意力，对静态区域使用轻量弱交互），从而在效率与质量间取得平衡。这种基于运动区域解耦的处理策略是提升视频融合效率的有效途径。

Abstract: Infrared and visible video fusion combines the object saliency from infrared images with the texture details from visible images to produce semantically rich fusion results. However, most existing methods are designed for static image fusion and cannot effectively handle frame-to-frame motion in videos. Current video fusion methods improve temporal consistency by introducing interactions across frames, but they often require high computational cost. To mitigate these challenges, we propose MAVFusion, an end-to-end video fusion framework featuring a motion-aware sparse interaction mechanism that enhances efficiency while maintaining superior fusion quality. Specifically, we leverage optical flow to identify dynamic regions in multi-modal sequences, adaptively allocating computationally intensive cross-modal attention to these sparse areas to capture salient transitions and facilitate inter-modal information exchange. For static background regions, a lightweight weak interaction module is employed to maintain structural and appearance integrity. By decoupling the processing of dynamic and static regions, MAVFusion simultaneously preserves temporal consistency and fine-grained details while significantly accelerating inference. Extensive experiments demonstrate that MAVFusion achieves state-of-the-art performance on multiple infrared and visible video benchmarks, achieving a speed of 14.16,FPS at $640 \times 480$ resolution. The source code will be available at https://github.com/ixilai/MAVFusion.

[83] Automated Prostate Gland Segmentation in MRI Using nnU-Net cs.CVPDF

Pablo Rodriguez-Belenguer, Gloria Ribas, Javier Aquerreta Escribano, Rafael Moreno-Calatayud, Leonor Cerda-Alberich

TL;DR: 本文提出了一种基于nnU-Net v2框架的专用深度学习模型，用于自动分割多参数MRI中的前列腺腺体。该模型利用T2加权、扩散加权成像和表观扩散系数图等多模态数据，在PI-CAI数据集上进行训练，并通过交叉验证和外部验证证明了其强大的泛化能力。

Details

Motivation: 解决前列腺MRI手动分割耗时且存在观察者间差异，以及通用分割工具在前列腺特定任务上精度不足的问题。

Result: 在交叉验证中平均Dice分数达到0.96，在来自Hospital La Fe的外部测试集上Dice分数为0.82，显著优于通用分割工具TotalSegmentator（Dice分数0.15）。

Insight: 创新点在于采用任务专用、多模态的分割策略，并利用nnU-Net v2框架实现高性能；客观来看，其模型容器化和即用型推理工具的提供，增强了方法的可复现性和临床部署潜力。

Abstract: Accurate segmentation of the prostate gland in multiparametric MRI (mpMRI) is a fundamental step for a wide range of clinical and research applications, including image registration, volume estimation, and radiomic analysis. However, manual delineation is time-consuming and subject to inter-observer variability, while general-purpose segmentation tools often fail to provide sufficient accuracy for prostate-specific tasks. In this work, we propose a dedicated deep learning-based approach for automatic prostate gland segmentation using the nnU-Net v2 framework. The model leverages multimodal mpMRI data, including T2-weighted imaging, diffusion-weighted imaging (DWI), and apparent diffusion coefficient (ADC) maps, to exploit complementary tissue information. Training was performed on 981 cases from the PI-CAI dataset using whole-gland annotations, and model performance was assessed through 5-fold cross-validation and external validation on an independent cohort of 54 patients from Hospital La Fe. The proposed model achieved a mean Dice score of 0.96 +/- 0.00 in cross-validation and 0.82 on the external test set, demonstrating strong generalization despite domain shift. In comparison, a general-purpose approach (TotalSegmentator) showed substantially lower performance, with a Dice score of 0.15, primarily due to under-segmentation of the gland. These results highlight the importance of task-specific, multimodal segmentation strategies and demonstrate the potential of the proposed approach for reliable integration into clinical research workflows. To facilitate reproducibility and deployment, the model has been fully containerized and is available as a ready-to-use inference tool.

[84] Ego-Grounding for Personalized Question-Answering in Egocentric Videos cs.CV | cs.AI | cs.ROPDF

Junbin Xiao, Shenglang Zhang, Pengxiang Zhu, Angela Yao

TL;DR: 该论文提出了首个针对第一人称视角视频中个性化问答任务的多模态大语言模型（MLLMs）系统分析，并引入了首个用于评估MLLMs对相机佩戴者理解、记忆和推理能力的视频问答数据集MyEgo。基准测试表明，当前顶尖的闭源和开源MLLMs在该任务上表现不佳，准确率远低于人类水平，揭示了模型在自我定位和长期记忆方面的关键局限。

Details

Motivation: 解决多模态大语言模型在第一人称视角视频中进行个性化问答时，缺乏对相机佩戴者（即‘自我’）的理解、记忆和推理能力（即‘自我定位’）的问题。

Result: 在MyEgo数据集（包含541个长视频和5K个个性化问题）上的基准测试显示，顶尖闭源模型（如GPT-5）和开源模型（如Qwen3-VL）的准确率分别仅为约46%和36%，远低于人类水平（差距分别接近40%和50%）。显式推理和模型缩放并未带来一致性的性能提升。

Insight: 论文的核心创新点在于首次系统性地定义了‘自我定位’问题，并创建了首个用于评估该能力的专用数据集MyEgo。客观分析认为，其研究揭示了当前MLLMs在长期、连贯的个性化上下文理解与记忆方面存在根本性挑战，为未来开发具备长期记忆和自我意识的个性化辅助系统指明了关键方向。

Abstract: We present the first systematic analysis of multimodal large language models (MLLMs) in personalized question-answering requiring ego-grounding - the ability to understand the camera-wearer in egocentric videos. To this end, we introduce MyEgo, the first egocentric VideoQA dataset designed to evaluate MLLMs’ ability to understand, remember, and reason about the camera wearer. MyEgo comprises 541 long videos and 5K personalized questions asking about “my things”, “my activities”, and “my past”. Benchmarking reveals that competitive MLLMs across variants, including open-source vs. proprietary, thinking vs. non-thinking, small vs. large scales all struggle on MyEgo. Top closed- and open-source models (e.g., GPT-5 and Qwen3-VL) achieve only~46% and 36% accuracy, trailing human performance by near 40% and 50% respectively. Surprisingly, neither explicit reasoning nor model scaling yield consistent improvements. Models improve when relevant evidence is explicitly provided, but gains drop over time, indicating limitations in tracking and remembering “me” and “my past”. These findings collectively highlight the crucial role of ego-grounding and long-range memory in enabling personalized QA in egocentric videos. We hope MyEgo and our analyses catalyze further progress in these areas for egocentric personalized assistance. Data and code are available at https://github.com/Ryougetsu3606/MyEgo

[85] SDesc3D: Towards Layout-Aware 3D Indoor Scene Generation from Short Descriptions cs.CVPDF

Jie Feng, Jiawei Shen, Junjia Huang, Junpeng Zhang, Mingtao Feng

TL;DR: 本文提出SDesc3D框架，用于从简短文本描述生成3D室内场景。该方法通过多视角结构先验增强和功能感知布局定位，解决了现有方法在短文本条件下物理合理性和细节丰富性不足的问题，并采用迭代反射-修正方案进行渐进式结构优化。

Details

Motivation: 现有基于文本的3D场景生成方法在短文本描述条件下，由于过度依赖明确的物体组成和空间关系语义线索，导致生成的场景物理合理性差、细节不足。本文旨在通过增强3D推理能力（特别是先验整合和空间锚定）来解决这一问题。

Result: 大量实验表明，该方法在短文本条件下的3D室内场景生成任务上优于现有方法。

Insight: 创新点在于：1）多视角场景先验增强，将未充分指定的文本输入与聚合的多视角结构知识相结合；2）功能感知布局定位，利用区域功能进行隐式空间锚定和分层布局推理；3）迭代反射-修正方案，通过自修正渐进优化结构合理性。从客观角度看，其核心是将难以获取的显式语义关系线索，转向更易获取和整合的多视角关系先验与功能隐含信息，从而在稀疏文本引导下实现更好的3D布局推理。

Abstract: 3D indoor scene generation conditioned on short textual descriptions provides a promising avenue for interactive 3D environment construction without the need for labor-intensive layout specification. Despite recent progress in text-conditioned 3D scene generation, existing works suffer from poor physical plausibility and insufficient detail richness in such semantic condensation cases, largely due to their reliance on explicit semantic cues about compositional objects and their spatial relationships. This limitation highlights the need for enhanced 3D reasoning capabilities, particularly in terms of prior integration and spatial anchoring.Motivated by this, we propose SDesc3D, a short-text conditioned 3D indoor scene generation framework, that leverages multi-view structural priors and regional functionality implications to enable 3D layout reasoning under sparse textual guidance.Specifically, we introduce a Multi-view scene prior augmentation that enriches underspecified textual inputs with aggregated multi-view structural knowledge, shifting from inaccessible semantic relation cues to multi-view relational prior aggregation. Building on this, we design a Functionality-aware layout grounding, employing regional functionality grounding for implicit spatial anchors and conducting hierarchical layout reasoning to enhance scene organization and semantic plausibility.Furthermore, an Iterative reflection-rectification scheme is employed for progressive structural plausibility refinement via self-rectification.Extensive experiments show that our method outperforms existing approaches on short-text conditioned 3D indoor scene generation.Code will be publicly available.

[86] NearID: Identity Representation Learning via Near-identity Distractors cs.CVPDF

Aleksandar Cvejic, Rameen Abdal, Abdelrahman Eldesokey, Bernard Ghanem, Peter Wonka

TL;DR: 该论文提出了NearID框架，通过构建语义相似但身份不同的干扰项（NearID distractors）来消除背景上下文对身份表示的干扰，从而提升视觉编码器在身份相关任务（如个性化生成和图像编辑）中的表示可靠性。

Details

Motivation: 现有视觉编码器在评估身份相关任务时，会将对象身份与背景上下文纠缠在一起，导致不可靠的表示和度量，因此需要一种原则性框架来消除这种脆弱性。

Result: 在NearID数据集（包含19K个身份和316K个匹配背景的干扰项）上，预训练编码器的样本成功率（SSR）低至30.7%；而通过提出的两层级对比目标学习身份感知表示后，SSR提升至99.2%，部分级别判别能力提高28.0%，并在人类对齐基准DreamBench++上表现出更强的一致性。

Insight: 创新点在于引入NearID干扰项来隔离身份作为唯一判别信号，并设计严格基于边界的评估协议；从客观角度看，该方法通过层级对比学习有效解耦身份与背景，为身份表示学习提供了可泛化的新思路。

Abstract: When evaluating identity-focused tasks such as personalized generation and image editing, existing vision encoders entangle object identity with background context, leading to unreliable representations and metrics. We introduce the first principled framework to address this vulnerability using Near-identity (NearID) distractors, where semantically similar but distinct instances are placed on the exact same background as a reference image, eliminating contextual shortcuts and isolating identity as the sole discriminative signal. Based on this principle, we present the NearID dataset (19K identities, 316K matched-context distractors) together with a strict margin-based evaluation protocol. Under this setting, pre-trained encoders perform poorly, achieving Sample Success Rates (SSR), a strict margin-based identity discrimination metric, as low as 30.7% and often ranking distractors above true cross-view matches. We address this by learning identity-aware representations on a frozen backbone using a two-tier contrastive objective enforcing the hierarchy: same identity > NearID distractor > random negative. This improves SSR to 99.2%, enhances part-level discrimination by 28.0%, and yields stronger alignment with human judgments on DreamBench++, a human-aligned benchmark for personalization. Project page: https://gorluxor.github.io/NearID/

[87] Interactive Tracking: A Human-in-the-Loop Paradigm with Memory-Augmented Adaptation cs.CVPDF

Yuqing Huang, Guotian Zeng, Zhenqiao Yuan, Zhenyu He, Xin Li

TL;DR: 本文提出了交互式跟踪（Interactive Tracking）这一新范式，允许用户通过自然语言指令随时引导跟踪器，以克服现有视觉跟踪器非交互、一次性运行的局限。为此，作者贡献了首个大规模交互式跟踪基准InteractTrack、一套评估协议以及一个名为IMAT的基线方法，该方法采用动态记忆机制从用户反馈中学习并更新跟踪行为。

Details

Motivation: 现有视觉跟踪器主要以非交互、一次性方式运行，难以适应需要人机协同调整的真实场景，因此需要引入人机交互能力以实现更智能、自适应的跟踪系统。

Result: 在提出的InteractTrack基准上评估了25个代表性跟踪器，结果显示最先进的方法在交互场景中表现不佳，传统基准上的强性能无法迁移。

Insight: 创新点在于首次将人机交互范式引入视觉跟踪领域，通过自然语言指令和动态记忆机制实现跟踪器的在线适应；客观来看，这为开发更协作的感知系统提供了新的研究方向和基准基础。

Abstract: Existing visual trackers mainly operate in a non-interactive, fire-and-forget manner, making them impractical for real-world scenarios that require human-in-the-loop adaptation. To overcome this limitation, we introduce Interactive Tracking, a new paradigm that allows users to guide the tracker at any time using natural language commands. To support research in this direction, we make three main contributions. First, we present InteractTrack, the first large-scale benchmark for interactive tracking, containing 150 videos with dense bounding box annotations and timestamped language instructions. Second, we propose a comprehensive evaluation protocol and evaluate 25 representative trackers, showing that state-of-the-art methods fail in interactive scenarios; strong performance on conventional benchmarks does not transfer. Third, we introduce Interactive Memory-Augmented Tracking (IMAT), a new baseline that employs a dynamic memory mechanism to learn from user feedback and update tracking behavior accordingly. Our benchmark, protocol, and baseline establish a foundation for developing more intelligent, adaptive, and collaborative tracking systems, bridging the gap between automated perception and human guidance. The full benchmark, tracking results, and analysis are available at https://github.com/NorahGreen/InteractTrack.git.

[88] Curia-2: Scaling Self-Supervised Learning for Radiology Foundation Models cs.CV | cs.LGPDF

Antoine Saporta, Baptiste Callard, Corentin Dancette, Julien Khlaut, Charles Corbière

TL;DR: Curia-2是一个用于放射学基础模型的大规模自监督学习框架，它在原始Curia框架基础上改进了预训练策略和表征质量，以更好地捕捉放射学数据的特性。该工作首次将架构扩展到数十亿参数的视觉Transformer，并重构了评估基准CuriaBench，分为2D和3D两个赛道。结果表明，Curia-2在视觉任务上优于所有基础模型，在临床复杂任务上可与视觉语言模型竞争。

Details

Motivation: 医学影像的快速增长导致放射科医生工作量激增，开发基础模型旨在减轻这一负担。现有模型在从复杂的放射学体数据中学习方面仍有优化空间，因此需要改进预训练策略以更好地适应放射学数据的特异性。

Result: 在扩展和重构的CuriaBench基准（包括2D切片和3D体数据两个赛道）上，Curia-2在视觉任务上超越了所有基础模型，在如病灶检测等临床复杂任务上与视觉语言模型表现相当。

Insight: 主要创新点包括：针对放射学数据优化的自监督预训练策略、首次将多模态CT和MRI基础模型扩展到数十亿参数规模，以及将评估基准正式化为2D和3D两个专门赛道，这为医学影像基础模型的标准化评估提供了新思路。

Abstract: The rapid growth of medical imaging has fueled the development of Foundation Models (FMs) to reduce the growing, unsustainable workload on radiologists. While recent FMs have shown the power of large-scale pre-training to CT and MRI analysis, there remains significant room to optimize how these models learn from complex radiological volumes. Building upon the Curia framework, this work introduces Curia-2, which significantly improves the original pre-training strategy and representation quality to better capture the specificities of radiological data. The proposed methodology enables scaling the architecture up to billion-parameter Vision Transformers, marking a first for multi-modal CT and MRI FMs. Furthermore, we formalize the evaluation of these models by extending and restructuring CuriaBench into two distinct tracks: a 2D track tailored for slice-based vision models and a 3D track for volumetric benchmarking. Our results demonstrate that Curia-2 outperforms all FMs on vision-focused tasks and fairs competitively to vision-language models on clinically complex tasks such as finding detection. Weights will be made publicly available to foster further research.

[89] Attention at Rest Stays at Rest: Breaking Visual Inertia for Cognitive Hallucination Mitigation cs.CV | cs.AIPDF

Boyang Gong, Yu Zheng, Fanye Kong, Jie Zhou, Jiwen Lu

TL;DR: 本文发现多模态大语言模型（MLLMs）的视觉注意力存在显著的惯性现象，即在解码早期步骤中注意力一旦固定就基本保持静态，这阻碍了模型进行组合性理解所需的认知推理。为缓解由此引发的认知幻觉，作者提出了一种无需训练的惯性感知视觉激发（IVE）方法，通过建模视觉注意力的动态响应来打破惯性模式，并在多个基准测试中验证了其有效性。

Details

Motivation: 现有幻觉缓解方法主要针对物体存在或属性的感知幻觉，对需要对象间关系推理的认知幻觉效果不足。本文旨在解决由视觉注意力惯性导致的认知幻觉问题。

Result: 在多个幻觉基准测试（如MMHal-Bench、POPE）上的广泛实验表明，IVE方法能有效缓解认知幻觉，并在多个基础MLLM（如LLaVA、Qwen-VL）上取得良好效果，特别是在需要组合推理的任务上表现突出。

Insight: 核心创新点在于首次识别并量化了MLLMs中的视觉注意力惯性现象，并提出了一种无需训练的IVE方法来动态调整注意力，通过引入惯性感知惩罚机制来鼓励注意力的动态转移，从而更好地支持组合推理。这为理解及缓解MLLMs的认知幻觉提供了新视角和有效工具。

Abstract: Like a body at rest that stays at rest, we find that visual attention in multimodal large language models (MLLMs) exhibits pronounced inertia, remaining largely static once settled during early decoding steps and failing to support the compositional understanding required for cognitive inference. While existing hallucination mitigation methods mainly target perceptual hallucinations concerning object existence or attributes, they remain inadequate for such cognitive hallucinations that require inter-object relational deduction. Through token-wise attention analysis, we identify this visual inertia as a key factor: attention to semantically critical regions remains persistently focused and fails to dynamically support relational inference. We thereby propose a training-free Inertia-aware Visual Excitation (IVE) method that breaks this inertial pattern by modeling cognitive inference as the dynamic responsiveness of visual attention. Specifically, IVE selects visual tokens that are dynamically emerging relative to historical attention trends while distinguishing tokens exhibiting inertial behavior. To further facilitate compositional inference, IVE introduces an inertia-aware penalty that discourages over-concentration and limits the persistence of attention within localized regions. Extensive experiments show that IVE is effective across various base MLLMs and multiple hallucination benchmarks, particularly for cognitive hallucinations.

[90] Resonance4D: Frequency-Domain Motion Supervision for Preset-Free Physical Parameter Learning in 4D Dynamic Physical Scene Simulation cs.CVPDF

Changshe Zhang, Jie Feng, Siyu Chen, Guanbin Li, Ronghua Shang

TL;DR: Resonance4D是一个用于4D动态物理场景模拟的物理驱动框架，它通过结合3D高斯溅射和物质点法，并引入轻量级的双域运动监督，实现了无需预设的完整物理参数学习。该方法在显著降低计算成本和内存开销的同时，能够从静态3D场景生成高保真的物理驱动4D动态模拟。

Details

Motivation: 解决现有物理驱动4D动态模拟方法中的矛盾：可靠的运动监督（如在线视频扩散或光流管道）计算成本过高，且现有方法通常只优化部分材料参数，限制了在复杂材料和动态场景中的真实感。

Result: 在合成和真实场景上的实验表明，Resonance4D在保持强物理保真度和运动一致性的同时，将峰值GPU内存从超过35GB降低到约20GB，使得在单个消费级GPU上实现高保真物理驱动4D模拟成为可能。

Insight: 核心创新点是双域运动监督，它通过在空间域约束局部变形的结构一致性，在频域约束振荡和全局动态模式的光谱一致性，来联合约束互补域中的运动，从而无需密集时间生成即可保证动态一致性。此外，结合零样本文本提示分割与模拟引导初始化，实现了稳定的全参数物理恢复。

Abstract: Physics-driven 4D dynamic simulation from static 3D scenes remains constrained by an overlooked contradiction: reliable motion supervision often relies on online video diffusion or optical-flow pipelines whose computational cost exceeds that of the simulator itself. Existing methods further simplify inverse physical modeling by optimizing only partial material parameters, limiting realism in scenes with complex materials and dynamics. We present Resonance4D, a physics-driven 4D dynamic simulation framework that couples 3D Gaussian Splatting with the Material Point Method through lightweight yet physically expressive supervision. Our key insight is that dynamic consistency can be enforced without dense temporal generation by jointly constraining motion in complementary domains. To this end, we introduce Dual-domain Motion Supervision (DMS), which combines spatial structural consistency for local deformation with frequency-domain spectral consistency for oscillatory and global dynamic patterns, substantially reducing training cost and memory overhead while preserving physically meaningful motion cues. To enable stable full-parameter physical recovery, we further combine zero-shot text-prompted segmentation with simulation-guided initialization to automatically decompose Gaussians into object-part-level regions and support joint optimization of full material parameters. Experiments on both synthetic and real scenes show that Resonance4D achieves strong physical fidelity and motion consistency while reducing peak GPU memory from over 35,GB to around 20,GB, enabling high-fidelity physics-driven 4D simulation on a single consumer-grade GPU.

[91] Test-Time Adaptation for Height Completion via Self-Supervised ViT Features and Monocular Foundation Models cs.CVPDF

Osher Rafaeli, Tal Svoray, Ariel Nahlieli

TL;DR: 本文提出了一种名为Prior2DSM的无训练框架，用于数字表面模型（DSM）的高度补全。该方法利用DINOv3的自监督视觉Transformer特征和单目深度基础模型，在测试时通过语义特征空间对应关系传播度量信息，无需特定任务的监督训练。

Details

Motivation: 大规模DSM常因采集限制、重建伪影或环境变化而存在不完整或过时区域。传统空间插值方法假设空间连续性，在物体缺失时失效；现有学习方法虽能提升质量，但通常需要传感器特定的监督训练，泛化能力受限。

Result: 实验表明，Prior2DSM在多个基准上优于基于插值的方法、先验重缩放高度方法以及最先进的单目深度估计模型，将RMSE降低了高达46%，同时保持了结构保真度，并支持DSM更新和RGB-DSM耦合生成。

Insight: 创新点在于结合自监督ViT特征与单目深度基础模型，通过测试时适应（TTA）和参数高效的LoRA与轻量MLP，实现无监督的度量高度补全，提升了跨域和感知条件的泛化能力。

Abstract: Accurate digital surface models (DSMs) are essential for many geospatial applications, including urban monitoring, environmental analyses, infrastructure management, and change detection. However, large-scale DSMs frequently contain incomplete or outdated regions due to acquisition limitations, reconstruction artifacts, or changes in the built environment. Traditional height completion approaches primarily rely on spatial interpolation or which assume spatial continuity and therefore fail when objects are missing. Recent learning-based approaches improve reconstruction quality but typically require supervised training on sensor-specific datasets, limiting their generalization across domains and sensing conditions. We propose Prior2DSM, a training-free framework for metric DSM completion that operates entirely at test time by leveraging foundation models. Unlike previous height completion approaches that require task-specific training, the proposed method combines self-supervised Vision Transformer (ViT) features from DINOv3 with monocular depth foundation models to propagate metric information from incomplete height priors through semantic feature-space correspondence. Test-time adaptation (TTA) is performed using parameter-efficient low-rank adaptation (LoRA) together with a lightweight multilayer perceptron (MLP), which predicts spatially varying scale and shift parameters to convert relative depth estimates into metric heights. Experiments demonstrate consistent improvements over interpolation based methods, prior-based rescaling height approaches, and state-of-the-art monocular depth estimation models. Prior2DSM reduces reconstruction error while preserving structural fidelity, achieving up to a 46% reduction in RMSE compared to linear fitting of MDE, and further enables DSM updating and coupled RGB-DSM generation.

[92] Are VLMs Lost Between Sky and Space? LinkS$^2$Bench for UAV-Satellite Dynamic Cross-View Spatial Intelligence cs.CVPDF

Dian Liu, Jie Feng, Di Li, Yuhui Zheng, Guanbin Li

TL;DR: 本文提出了首个用于评估视觉语言模型在无人机与卫星动态跨视角空间智能能力的基准数据集LinkS^2Bench，该数据集包含超过1000分钟的无人机视频与高分辨率卫星图像，并构建了1.79万个高质量问答对。通过评估18个主流VLM，发现其在动态跨视角对齐方面存在显著瓶颈，并提出了跨视角对齐适配器以提升性能。

Details

Motivation: 现有基准仅关注孤立的无人机视频或静态卫星图像，无法评估动态局部到全局的空间映射能力，因此需要构建一个综合基准来探索VLM在无人机与卫星协同空间智能中的潜力。

Result: 在LinkS^2Bench上评估的18个代表性VLM表现均远低于人类基线，准确跨视角动态对齐是关键瓶颈；提出的Cross-View Alignment Adapter显著提升了模型性能，微调实验验证了该基准在推进VLM复杂空间推理适应方面的潜力。

Insight: 创新点在于构建了首个动态跨视角空间智能基准，并通过显式的跨视角对齐机制（Cross-View Alignment Adapter）有效缓解了VLM在动态对齐上的瓶颈，为复杂空间推理任务提供了新的评估与改进方向。

Abstract: Synergistic spatial intelligence between UAVs and satellites is indispensable for emergency response and security operations, as it uniquely integrates macro-scale global coverage with dynamic, real-time local perception. However, the capacity of Vision-Language Models (VLMs) to master this complex interplay remains largely unexplored. This gap persists primarily because existing benchmarks are confined to isolated Unmanned Aerial Vehicle (UAV) videos or static satellite imagery, failing to evaluate the dynamic local-to-global spatial mapping essential for comprehensive cross-view reasoning. To bridge this gap, we introduce LinkS$^2$Bench, the first comprehensive benchmark designed to evaluate VLMs’ wide-area, dynamic cross-view spatial intelligence. LinkS$^2$Bench links 1,022 minutes of dynamic UAV footage with high-resolution satellite imagery covering over 200 km$^2$. Through an LMM-assisted pipeline and rigorous human annotation, we constructed 17.9k high-quality question-answer pairs comprising 12 fine-grained tasks across four dimensions: perception, localization, relation, and reasoning. Evaluations of 18 representative VLMs reveal a substantial gap compared to human baselines, identifying accurate cross-view dynamic alignment as the critical bottleneck. To alleviate this, we design a Cross-View Alignment Adapter, demonstrating that explicit alignment significantly improves model performance. Furthermore, fine-tuning experiments underscore the potential of LinkS$^2$Bench in advancing VLM adaptation for complex spatial reasoning.

[93] IndoorCrowd: A Multi-Scene Dataset for Human Detection, Segmentation, and Tracking with an Automated Annotation Pipeline cs.CV | cs.LGPDF

Sebastian-Ion Nae, Radu Moldoveanu, Alexandra Stefania Ghita, Adina Magda Florea

TL;DR: 本文提出了IndoorCrowd数据集，这是一个用于室内拥挤场景下人体检测、实例分割和多目标跟踪的多场景数据集，包含四个校园位置的31个视频。论文还建立了一个自动化标注流程，并评估了多个基础模型的自动标注性能，同时为检测、分割和跟踪任务提供了基准结果。

Details

Motivation: 现有数据集难以大规模捕捉真实室内环境的复杂性，而理解拥挤室内环境中的人类行为对监控、智能建筑和人机交互至关重要，因此需要一个新的、更具挑战性的数据集。

Result: 在620帧的控制子集上，使用Cohen’s κ、AP、精确率、召回率和掩码IoU等指标评估了SAM3、GroundingSAM和EfficientGroundingSAN等基础模型自动标注器的性能。使用YOLOv8n、YOLOv26n和RT-DETR-L结合ByteTrack、BoT-SORT和OC-SORT等跟踪器建立了检测、分割和跟踪的基线。场景分析表明，挑战性因人群密度、目标尺度和遮挡程度而异，其中ACS-EC场景（79.3%的密集帧，平均实例尺度60.8像素）最具挑战性。

Insight: 创新点在于提出了一个专注于复杂室内拥挤场景、包含多任务标注（检测、分割、跟踪）且带有自动化标注流程评估的新数据集。其场景多样性分析和基准测试为未来研究提供了明确的挑战和比较标准，自动化标注流程的评估也为基础模型在实际数据标注中的应用提供了参考。

Abstract: Understanding human behaviour in crowded indoor environments is central to surveillance, smart buildings, and human-robot interaction, yet existing datasets rarely capture real-world indoor complexity at scale. We introduce IndoorCrowd, a multi-scene dataset for indoor human detection, instance segmentation, and multi-object tracking, collected across four campus locations (ACS-EC, ACS-EG, IE-Central, R-Central). It comprises $31$ videos ($9{,}913$ frames at $5$fps) with human-verified, per-instance segmentation masks. A $620$-frame control subset benchmarks three foundation-model auto-annotators: SAM3, GroundingSAM, and EfficientGroundingSAM, against human labels using Cohen’s $κ$, AP, precision, recall, and mask IoU. A further $2{,}552$-frame subset supports multi-object tracking with continuous identity tracks in MOTChallenge format. We establish detection, segmentation, and tracking baselines using YOLOv8n, YOLOv26n, and RT-DETR-L paired with ByteTrack, BoT-SORT, and OC-SORT. Per-scene analysis reveals substantial difficulty variation driven by crowd density, scale, and occlusion: ACS-EC, with $79.3%$ dense frames and a mean instance scale of $60.8$px, is the most challenging scene. The project page is available at https://sheepseb.github.io/IndoorCrowd/.

[94] Efficient Reasoning via Thought Compression for Language Segmentation cs.CVPDF

Qing Zhou, Shiyu Zhang, Yuyu Jia, Junyu Gao, Weiping Ni

TL;DR: 本文提出WISE（Wisdom from Internal Self-Exploration）框架，通过‘思考两次’（一次学习、一次提速）的原则，改进语言引导分割任务中的思维链推理效率。该方法训练模型生成结构化序列（简洁理由、最终答案、详细解释），并通过自蒸馏目标强化语义保真与简洁性。推理时使用WISE-S策略，通过提示注入简洁指令，省略详细解释，从而在保持高性能的同时大幅减少推理长度。

Details

Motivation: 解决思维链推理在语言引导分割任务中因生成冗长理由导致计算成本过高，从而限制实际应用的问题。

Result: 在ReasonSeg基准测试上，WISE-S实现了零样本性能的SOTA，达到58.3 cIoU，同时将平均推理长度从112个token减少到23个token，降低了近5倍。

Insight: 创新点在于提出‘思考两次’的范式，通过结构化序列生成和自蒸馏目标，将详细推理内部化为简洁形式；客观来看，其利用自回归条件化和推理时简洁指令提示来应对分布偏移，实现了效率与性能的平衡，为高效推理提供了新思路。

Abstract: Chain-of-thought (CoT) reasoning has significantly improved the performance of large multimodal models in language-guided segmentation, yet its prohibitive computational cost, stemming from generating verbose rationales, limits real-world applicability. We introduce WISE (Wisdom from Internal Self-Exploration), a novel paradigm for efficient reasoning guided by the principle of \textit{thinking twice – once for learning, once for speed}. WISE trains a model to generate a structured sequence: a concise rationale, the final answer, and then a detailed explanation. By placing the concise rationale first, our method leverages autoregressive conditioning to enforce that the concise rationale acts as a sufficient summary for generating the detailed explanation. This structure is reinforced by a self-distillation objective that jointly rewards semantic fidelity and conciseness, compelling the model to internalize its detailed reasoning into a compact form. At inference, the detailed explanation is omitted. To address the resulting conditional distribution shift, our inference strategy, WISE-S, employs a simple prompting technique that injects a brevity-focused instruction into the user’s query. This final adjustment facilitates the robust activation of the learned concise policy, unlocking the full benefits of our framework. Extensive experiments show that WISE-S achieves state-of-the-art zero-shot performance on the ReasonSeg benchmark with 58.3 cIoU, while reducing the average reasoning length by nearly \textbf{5$\times$} – from 112 to just 23 tokens. Code is available at \href{https://github.com/mrazhou/WISE}{WISE}.

[95] Jagle: Building a Large-Scale Japanese Multimodal Post-Training Dataset for Vision-Language Models cs.CVPDF

Issa Sugiura, Keito Sasagawa, Keisuke Nakao, Koki Maeda, Ziqi Yin

TL;DR: 本文介绍了Jagle，一个大规模日语多模态后训练数据集，包含约920万个实例，涵盖多种任务。该数据集通过收集图像、图文对和PDF文档等异构源数据，并采用基于VLM的问答生成、翻译和文本渲染等多种策略生成VQA对，旨在解决日语VQA数据集规模小、领域覆盖有限的问题，从而支持高质量日语及多语言视觉语言模型的构建。实验表明，使用Jagle训练的22亿参数模型在日语任务上表现强劲，超越了InternVL3.5-2B，并接近Qwen3-VL-2B-Instruct的性能，同时结合FineVision训练还能提升英语任务表现。

Details

Motivation: 构建能泛化到多样任务的视觉语言模型需要大规模、内容多样的训练数据集。在英语中，这类数据集通常通过聚合和整理现有VQA资源构建，但这一策略难以扩展到其他语言，因为日语等语言的VQA数据集在规模和领域覆盖上仍很有限，这成为构建高质量多语言和非英语VLMs的主要障碍。

Result: 实验显示，使用Jagle训练的2.2B参数模型在十个日语评估任务上的平均得分超越了InternVL3.5-2B，并接近Qwen3-VL-2B-Instruct（差距在5分以内）。此外，将Jagle与FineVision结合训练不仅未降低英语性能，反而比单独使用FineVision训练时有所提升。

Insight: 论文的创新点在于不依赖现有VQA数据集，而是通过收集异构源数据（如图像、图文对、PDF文档）并采用多种策略（如基于VLM的QA生成、翻译、文本渲染）生成VQA对，从而构建大规模日语多模态数据集。从客观角度看，这种数据构建方法为资源有限语言的多模态模型训练提供了可扩展的解决方案，并展示了多语言训练中数据组合可能带来的性能协同提升。

Abstract: Developing vision-language models (VLMs) that generalize across diverse tasks requires large-scale training datasets with diverse content. In English, such datasets are typically constructed by aggregating and curating numerous existing visual question answering (VQA) resources. However, this strategy does not readily extend to other languages, where VQA datasets remain limited in both scale and domain coverage, posing a major obstacle to building high-quality multilingual and non-English VLMs. In this work, we introduce Jagle, the largest Japanese multimodal post-training dataset to date, comprising approximately 9.2 million instances across diverse tasks. Rather than relying on existing VQA datasets, we collect heterogeneous source data, including images, image-text pairs, and PDF documents, and generate VQA pairs through multiple strategies such as VLM-based QA generation, translation, and text rendering. Experiments demonstrate that a 2.2B model trained with Jagle achieves strong performance on Japanese tasks, surpassing InternVL3.5-2B in average score across ten Japanese evaluation tasks and approaching within five points of Qwen3-VL-2B-Instruct. Furthermore, combining Jagle with FineVision does not degrade English performance; instead, it improves English performance compared to training with FineVision alone. To facilitate reproducibility and future research, we release the dataset, trained models, and code.

[96] COMPASS: Complete Multimodal Fusion via Proxy Tokens and Shared Spaces for Ubiquitous Sensing cs.CVPDF

Hao Wang, Yanyu Qian, Pengcheng Weng, Zixuan Xia, William Dan

TL;DR: COMPASS提出了一种基于融合完整性原则的多模态缺失融合框架，通过共享潜在空间中的成对源到目标生成器为缺失模态合成目标特定的代理令牌，并聚合为单一替换令牌，确保融合头始终接收固定的N槽多模态输入，从而提升跨模态交互的鲁棒性。

Details

Motivation: 解决多模态感知中缺失模态导致融合不完整和跨模态交互退化的问题，现有方法通过丢弃缺失分支、使用子集特定融合或重建缺失特征来适应观测子集，但融合头接收的输入结构与训练时不同。

Result: 在XRF55、MM-Fi和OctoNet数据集上的多种单模态和多模态缺失设置实验中，COMPASS在大多数场景下优于先前方法。

Insight: 创新点包括：基于融合完整性原则设计固定输入结构，使用共享潜在空间和成对生成器合成代理令牌，结合代理对齐、共享空间正则化和每个代理的判别性监督以确保代理令牌的表征兼容性和任务信息性；客观分析认为，保持模态完整的融合接口是鲁棒多模态感知的简单有效设计原则。

Abstract: Missing modalities remain a major challenge for multimodal sensing, because most existing methods adapt the fusion process to the observed subset by dropping absent branches, using subset-specific fusion, or reconstructing missing features. As a result, the fusion head often receives an input structure different from the one seen during training, leading to incomplete fusion and degraded cross-modal interaction. We propose COMPASS, a missing-modality fusion framework built on the principle of fusion completeness: the fusion head always receives a fixed N-slot multimodal input, with one token per modality slot. For each missing modality, COMPASS synthesizes a target-specific proxy token from the observed modalities using pairwise source-to-target generators in a shared latent space, and aggregates them into a single replacement token. To make these proxies both representation-compatible and task-informative, we combine proxy alignment, shared-space regularization, and per-proxy discriminative supervision. Experiments on XRF55, MM-Fi, and OctoNet under diverse single- and multiple-missing settings show that COMPASS outperforms prior methods on the large majority of scenarios. Our results suggest that preserving a modality-complete fusion interface is a simple and effective design principle for robust multimodal sensing.

[97] Mining Instance-Centric Vision-Language Contexts for Human-Object Interaction Detection cs.CV | cs.AI | cs.LGPDF

Soo Won Seo, KyungChae Lee, Hyungchan Cho, Taein Son, Nam Ik Cho

TL;DR: 本文提出了一种名为实例中心上下文挖掘网络（InCoM-Net）的新框架，用于提升人-物交互检测任务的性能。该框架通过整合从视觉语言模型中提取的丰富语义知识与目标检测器生成的实例特定特征，实现了对实例内部、实例之间以及全局场景上下文关系的建模，从而进行更深层次的交互推理。

Details

Motivation: 现有的人-物交互检测方法虽然利用了视觉语言模型引入语义先验，但未能充分利用分布在整个场景中的多样化上下文线索。本文旨在克服这一局限，更全面地挖掘和整合场景上下文信息以提升检测性能。

Result: 在HICO-DET和V-COCO基准测试上的大量实验表明，InCoM-Net取得了最先进的性能，超越了之前的人-物交互检测方法。

Insight: 论文的核心创新点在于提出了实例中心上下文精炼和渐进式上下文聚合两个模块，系统性地挖掘并融合了实例内部、实例间及全局三种层次的上下文信息。从客观角度看，这种分层、渐进式的多上下文融合机制，为如何更有效地将视觉语言模型的语义先验与目标检测器的几何/外观特征相结合，提供了新的设计思路。

Abstract: Human-Object Interaction (HOI) detection aims to localize human-object pairs and classify their interactions from a single image, a task that demands strong visual understanding and nuanced contextual reasoning. Recent approaches have leveraged Vision-Language Models (VLMs) to introduce semantic priors, significantly improving HOI detection performance. However, existing methods often fail to fully capitalize on the diverse contextual cues distributed across the entire scene. To overcome these limitations, we propose the Instance-centric Context Mining Network (InCoM-Net)-a novel framework that effectively integrates rich semantic knowledge extracted from VLMs with instance-specific features produced by an object detector. This design enables deeper interaction reasoning by modeling relationships not only within each detected instance but also across instances and their surrounding scene context. InCoM-Net comprises two core components: Instancecentric Context Refinement (ICR), which separately extracts intra-instance, inter-instance, and global contextual cues from VLM-derived features, and Progressive Context Aggregation (ProCA), which iteratively fuses these multicontext features with instance-level detector features to support high-level HOI reasoning. Extensive experiments on the HICO-DET and V-COCO benchmarks show that InCoM-Net achieves state-of-the-art performance, surpassing previous HOI detection methods. Code is available at https://github.com/nowuss/InCoM-Net.

[98] PLUME: Latent Reasoning Based Universal Multimodal Embedding cs.CVPDF

Chenwei He, Xiangzhao Hao, Tianyu Yang, Yuxiang Ma, Yuheng Jia

TL;DR: PLUME提出了一种基于潜在推理的通用多模态嵌入框架，通过用连续潜在状态的自回归展开替代显式的思维链推理，在提升多模态检索性能的同时大幅降低推理开销。

Details

Motivation: 现有基于显式思维链的通用多模态嵌入方法存在推理开销大、将丰富的多模态证据压缩为狭窄文本瓶颈的问题，PLUME旨在解决这些效率与信息损失问题。

Result: 在78个任务的MMEB-v2基准测试中，PLUME超越了强显式思维链基线模型，并将推理步骤从数百个生成token减少到少于10个潜在步骤，实现了超过30倍的推理加速。

Insight: 创新点在于引入语义锚点引导的转换适配器来引导不同推理轨迹，以及采用渐进式显式到潜在课程学习策略稳定训练，表明结构化潜在计算能在保留中间推理优势的同时避免显式理由生成的开销。

Abstract: Universal multimodal embedding (UME) maps heterogeneous inputs into a shared retrieval space with a single model. Recent approaches improve UME by generating explicit chain-of-thought (CoT) rationales before extracting embeddings, enabling multimodal large language models to better infer complex query intent. However, explicit CoT incurs substantial inference overhead and can compress rich multimodal evidence into a narrow textual bottleneck. We propose PLUME, a latent reasoning framework that advances UME by replacing verbalized CoT with a short autoregressive rollout of continuous latent states. To support diverse multimodal queries, PLUME further introduces a semantic-anchor-guided transition adapter that steers latent rollout along different reasoning trajectories under the same fixed computation budget. To stabilize training, PLUME adopts a progressive explicit-to-latent curriculum that uses verbalized reasoning only as a temporary training scaffold and gradually transfers this behavior into hidden-state computation, eliminating explicit CoT at inference. On the 78-task MMEB-v2 benchmark, PLUME outperforms strong explicit-CoT UME baselines while reducing reasoning from hundreds of generated tokens to fewer than 10 latent steps, delivering over 30x faster inference. PLUME is especially well suited to retrieval settings where relevant evidence is dense, structurally complex, and difficult to organize through verbalized intermediate rationales, such as video and visual document retrieval. These results show that structured latent computation can preserve the benefits of intermediate reasoning without the overhead of explicit rationale generation, providing a stronger and more efficient paradigm for practical retrieval systems.

[99] FlowSlider: Training-Free Continuous Image Editing via Fidelity-Steering Decomposition cs.CVPDF

Taichi Endo, Guoqing Hao, Kazuhiko Sumi

TL;DR: FlowSlider是一种无需训练即可实现连续图像编辑的方法，它通过将Rectified Flow中的编辑更新分解为保真度项和转向项，仅调整转向项的强度来控制编辑程度，从而在保持源图像结构和身份的同时实现平滑的语义过渡。

Details

Motivation: 现有基于学习的滑块式编辑方法通常需要额外的训练模块，这增加了训练开销，且其编辑行为受限于训练数据分布，导致在分布偏移时可靠性下降。FlowSlider旨在提供无需训练、分布无关的连续编辑控制。

Result: 几何分析和实验测量表明，保真度项和转向项近似正交，这使得仅缩放转向项即可实现稳定的强度控制。FlowSlider在多种编辑任务中提升了连续编辑的质量，提供了平滑可靠的滑块控制。

Insight: 创新点在于将编辑更新分解为两个近似正交的项（保真度项和转向项），并仅通过调整转向项来连续控制编辑强度，这是一种无需训练、解耦编辑行为与训练分布的方法，提高了编辑的可靠性和泛化性。

Abstract: Continuous image editing aims to provide slider-style control of edit strength while preserving source-image fidelity and maintaining a consistent edit direction. Existing learning-based slider methods typically rely on auxiliary modules trained with synthetic or proxy supervision. This introduces additional training overhead and couples slider behavior to the training distribution, which can reduce reliability under distribution shifts in edits or domains. We propose \textit{FlowSlider}, a training-free method for continuous editing in Rectified Flow that requires no post-training. \textit{FlowSlider} decomposes FlowEdit’s update into (i) a fidelity term, which acts as a source-conditioned stabilizer that preserves identity and structure, and (ii) a steering term that drives semantic transition toward the target edit. Geometric analysis and empirical measurements show that these terms are approximately orthogonal, enabling stable strength control by scaling only the steering term while keeping the fidelity term unchanged. As a result, \textit{FlowSlider} provides smooth and reliable control without post-training, improving continuous editing quality across diverse tasks.

[100] GroundVTS: Visual Token Sampling in Multimodal Large Language Models for Video Temporal Grounding cs.CVPDF

Rong Fan, Kaiyan Xiao, Minghao Zhu, Liuyi Wang, Kai Dai

TL;DR: 本文提出了一种名为GroundVTS的新型视频大语言模型架构，旨在解决视频时序定位任务中因均匀帧采样导致的关键帧稀疏和时序信息丢失问题。该方法通过细粒度的查询引导机制筛选视觉token，并采用渐进式优化策略使LLM适应非均匀视觉特征分布，从而提升时序建模和视频定位精度。

Details

Motivation: 现有视频大语言模型依赖均匀帧采样提取视频信息，导致关键帧分布稀疏且丢失重要时序线索，限制了其在视频时序定位等任务中的性能。

Result: 在三个标准视频时序定位基准测试中，GroundVTS均优于现有方法，在时刻检索任务上mIoU提升7.7个百分点，在高光检测任务上mAP提升12.0个百分点。

Insight: 创新点在于提出了查询引导的视觉token采样机制，以及使LLM适应非均匀视觉特征的渐进式优化策略，这为视频理解模型更高效地利用时空信息提供了新思路。

Abstract: Video temporal grounding (VTG) is a critical task in video understanding and a key capability for extending video large language models (Vid-LLMs) to broader applications. However, existing Vid-LLMs rely on uniform frame sampling to extract video information, resulting in a sparse distribution of key frames and the loss of crucial temporal cues. To address this limitation, we propose Grounded Visual Token Sampling (GroundVTS), a Vid-LLM architecture that focuses on the most informative temporal segments. GroundVTS employs a fine-grained, query-guided mechanism to filter visual tokens before feeding them into the LLM, thereby preserving essential spatio-temporal information and maintaining temporal coherence. Futhermore, we introduce a progressive optimization strategy that enables the LLM to effectively adapt to the non-uniform distribution of visual features, enhancing its ability to model temporal dependencies and achieve precise video localization. We comprehensively evaluate GroundVTS on three standard VTG benchmarks, where it outperforms existing methods, achieving a 7.7-point improvement in mIoU for moment retrieval and 12.0-point improvement in mAP for highlight detection. Code is available at https://github.com/Florence365/GroundVTS.

Jiachun Jin, Zetong Zhou, Xiao Yang, Hao Zhang, Pengfei Liu

TL;DR: 本文提出了LatentUM模型，通过将所有模态统一在共享的语义潜在空间中，消除了视觉理解与生成之间对像素空间解码的依赖，从而实现了高效且灵活的跨模态交错推理与生成。

Details

Motivation: 现有统一模型在视觉理解与生成中使用分离的表示，需要像素解码作为桥梁，导致效率低下且效果不佳，无法有效支持跨模态交错推理任务。

Result: 在Visual Spatial Planning基准测试中达到SOTA性能，并通过自反思提升了视觉生成质量，同时支持在共享潜在空间内预测未来视觉状态以进行世界建模。

Insight: 创新点在于将多模态统一于共享语义潜在空间，避免了像素解码的中间步骤，这不仅提高了计算效率，还减轻了编解码偏差并增强了跨模态对齐，为交错推理提供了更自然的框架。

Abstract: Unified models (UMs) hold promise for their ability to understand and generate content across heterogeneous modalities. Compared to merely generating visual content, the use of UMs for interleaved cross-modal reasoning is more promising and valuable, e.g., for solving understanding problems that require dense visual thinking, improving visual generation through self-reflection, or modeling visual dynamics of the physical world guided by stepwise action interventions. However, existing UMs necessitate pixel decoding as a bridge due to their disjoint visual representations for understanding and generation, which is both ineffective and inefficient. In this paper, we introduce LatentUM, a novel unified model that represents all modalities within a shared semantic latent space, eliminating the need for pixel-space mediation between visual understanding and generation. This design naturally enables flexible interleaved cross-modal reasoning and generation. Beyond improved computational efficiency, the shared representation substantially alleviates codec bias and strengthens cross-modal alignment, allowing LatentUM to achieve state-of-the-art performance on the Visual Spatial Planning benchmark, push the limits of visual generation through self-reflection, and support world modeling by predicting future visual states within the shared semantic latent space.

[102] ViT-Explainer: An Interactive Walkthrough of the Vision Transformer Pipeline cs.CV | cs.HCPDF

Juan Manuel Hernandez, Mariana Fernandez-Espinosa, Denis Parra, Diego Gomez-Zara

TL;DR: 本文介绍了ViT-Explainer，这是一个基于Web的交互式系统，旨在通过集成可视化（包括动画演示、补丁级注意力覆盖和视觉适配的Logit Lens）来引导用户端到端地理解Vision Transformer从补丁标记化到最终分类的完整推理流程。

Details

Motivation: 解决现有可解释性工具通常只关注孤立组件或面向专家分析，缺乏对完整推理流程的引导式、端到端理解的问题，特别是在视觉Transformer模型如何将图像作为补丁序列处理方面。

Result: 一项包含六名参与者的用户研究表明，ViT-Explainer易于学习和使用，能帮助用户解释和理解Vision Transformer的行为。

Insight: 创新点在于提供了一个集成的交互式可视化系统，结合了引导和自由探索模式，以动画和覆盖层等形式直观展示Vision Transformer的推理过程，填补了端到端理解工具的空缺。

Abstract: Transformer-based architectures have become the shared backbone of natural language processing and computer vision. However, understanding how these models operate remains challenging, particularly in vision settings, where images are processed as sequences of patch tokens. Existing interpretability tools often focus on isolated components or expert-oriented analysis, leaving a gap in guided, end-to-end understanding of the full inference pipeline. To bridge this gap, we present ViT-Explainer, a web-based interactive system that provides an integrated visualization of Vision Transformer inference, from patch tokenization to final classification. The system combines animated walkthroughs, patch-level attention overlays, and a vision-adapted Logit Lens within both guided and free exploration modes. A user study with six participants suggests that ViT-Explainer is easy to learn and use, helping users interpret and understand Vision Transformer behavior.

[103] UniDriveVLA: Unifying Understanding, Perception, and Action Planning for Autonomous Driving cs.CV | cs.ROPDF

Yongkang Li, Lijun Zhou, Sixu Yan, Bencheng Liao, Tianyi Yan

TL;DR: 本文提出UniDriveVLA，一种基于混合专家Transformer的统一驾驶视觉-语言-动作模型，旨在解决自动驾驶中空间感知与语义推理的冲突。通过解耦三个专家模块（驾驶理解、场景感知和动作规划）并采用掩码联合注意力进行协调，结合稀疏感知范式和渐进式训练策略，在保持语义推理能力的同时提升了空间感知性能。

Details

Motivation: 现有视觉-语言-动作模型在自动驾驶任务中面临空间感知与语义推理的权衡困境：直接采用2D视觉语言模型空间感知有限，而增强3D空间表示又会损害其原生推理能力。这源于两者在共享模型参数中的耦合优化。

Result: 在nuScenes数据集上的开环评估和Bench2Drive上的闭环评估中均达到了最先进性能。同时，在3D检测、在线建图、运动预测和驾驶导向VQA等一系列感知、预测和理解任务上表现出强大性能。

Insight: 核心创新点是通过专家解耦（专家混合Transformer）和掩码联合注意力机制分离空间感知与语义推理的优化路径，结合稀疏感知与渐进训练策略，实现了感知与推理能力的协同提升，为构建统一自动驾驶模型提供了新思路。

Abstract: Vision-Language-Action (VLA) models have recently emerged in autonomous driving, with the promise of leveraging rich world knowledge to improve the cognitive capabilities of driving systems. However, adapting such models for driving tasks currently faces a critical dilemma between spatial perception and semantic reasoning. Consequently, existing VLA systems are forced into suboptimal compromises: directly adopting 2D Vision-Language Models yields limited spatial perception, whereas enhancing them with 3D spatial representations often impairs the native reasoning capacity of VLMs. We argue that this dilemma largely stems from the coupled optimization of spatial perception and semantic reasoning within shared model parameters. To overcome this, we propose UniDriveVLA, a Unified Driving Vision-Language-Action model based on Mixture-of-Transformers that addresses the perception-reasoning conflict via expert decoupling. Specifically, it comprises three experts for driving understanding, scene perception, and action planning, which are coordinated through masked joint attention. In addition, we combine a sparse perception paradigm with a three-stage progressive training strategy to improve spatial perception while maintaining semantic reasoning capability. Extensive experiments show that UniDriveVLA achieves state-of-the-art performance in open-loop evaluation on nuScenes and closed-loop evaluation on Bench2Drive. Moreover, it demonstrates strong performance across a broad range of perception, prediction, and understanding tasks, including 3D detection, online mapping, motion forecasting, and driving-oriented VQA, highlighting its broad applicability as a unified model for autonomous driving. Code and model have been released at https://github.com/xiaomi-research/unidrivevla

[104] UAV-Track VLA: Embodied Aerial Tracking via Vision-Language-Action Models cs.CV | cs.ROPDF

Qiyao Zhang, Shuhua Zheng, Jianli Sun, Chengxiang Li, Xianke Wu

TL;DR: 本文提出了一种名为UAV-Track VLA的改进型视觉-语言-动作模型，用于无人机在动态城市环境中的具身视觉跟踪任务。该模型通过引入时间压缩网络和并行双分支解码器，解决了现有VLA模型中的时间特征冗余和空间几何先验缺失问题，旨在实现高效、实时的连续动作生成与跟踪。

Details

Motivation: 解决无人机在具有复杂语义需求的动态城市场景中进行具身视觉跟踪的挑战，现有VLA模型存在时间特征冗余和缺乏空间几何先验的问题，需要新的方法来提升跟踪性能和效率。

Result: 在CARLA模拟器中的系统实验表明，该方法在端到端性能上表现优越。在具有挑战性的远距离行人跟踪任务中，取得了61.76%的成功率和平均269.65跟踪帧数，显著优于现有基线模型。同时，在未见环境中展现出鲁棒的零样本泛化能力，并将单步推理延迟降低了33.4%（至0.0571秒），实现了高效的实时无人机控制。

Insight: 创新点包括：1）构建了一个专门的多模态跟踪评估基准和大规模数据集；2）提出了基于π0.5架构的改进VLA模型，引入了时间压缩网络以高效捕捉帧间动态；3）设计了包含空间感知辅助接地头和流匹配动作专家的并行双分支解码器，以解耦跨模态特征并生成细粒度连续动作。这些设计有效提升了跟踪的精度、鲁棒性和实时性。

Abstract: Embodied visual tracking is crucial for Unmanned Aerial Vehicles (UAVs) executing complex real-world tasks. In dynamic urban scenarios with complex semantic requirements, Vision-Language-Action (VLA) models show great promise due to their cross-modal fusion and continuous action generation capabilities. To benchmark multimodal tracking in such environments, we construct a dedicated evaluation benchmark and a large-scale dataset encompassing over 890K frames, 176 tasks, and 85 diverse objects. Furthermore, to address temporal feature redundancy and the lack of spatial geometric priors in existing VLA models, we propose an improved VLA tracking model, UAV-Track VLA. Built upon the $π_{0.5}$ architecture, our model introduces a temporal compression net to efficiently capture inter-frame dynamics. Additionally, a parallel dual-branch decoder comprising a spatial-aware auxiliary grounding head and a flow matching action expert is designed to decouple cross-modal features and generate fine-grained continuous actions. Systematic experiments in the CARLA simulator validate the superior end-to-end performance of our method. Notably, in challenging long-distance pedestrian tracking tasks, UAV-Track VLA achieves a 61.76% success rate and 269.65 average tracking frames, significantly outperforming existing baselines. Furthermore, it demonstrates robust zero-shot generalization in unseen environments and reduces single-step inference latency by 33.4% (to 0.0571s) compared to the original $π_{0.5}$, enabling highly efficient, real-time UAV control. Data samples and demonstration videos are available at: https://github.com/Hub-Tian/UAV-Track\_VLA.

[105] SPAR: Single-Pass Any-Resolution ViT for Open-vocabulary Segmentation cs.CVPDF

Naomi Kombol, Ivan Martinović, Siniša Šegvić, Giorgos Tolias

TL;DR: 本文提出SPAR（Single-Pass Any-Resolution ViT），一种分辨率无关的密集特征提取器，旨在解决基础视觉Transformer（ViT）在需要细粒度空间理解的任务（如开放词汇分割）中，因固定预训练分辨率和粗粒度补丁表示而效率低下的问题。该方法通过特征回归损失，将精细步长的滑动窗口教师模型的空间推理能力蒸馏到单次推理的学生模型中，无需架构修改或像素级监督。

Details

Motivation: 动机在于基础ViT模型在密集预测任务（特别是开放词汇分割）中，由于预训练分辨率固定和补丁级表示粗糙，难以高效处理高分辨率输入以进行精确的像素级推理。现有滑动窗口方法虽能提升精度，但计算成本高昂。

Result: 在开放词汇分割任务中，SPAR将单次推理基线提升了高达10.5 mIoU，甚至超越了教师模型，证明了其在高效高分辨率推理中的有效性。

Insight: 创新点包括：1）提出分辨率无关的密集特征提取器SPAR，支持单次高效高分辨率推理；2）使用特征回归损失进行知识蒸馏，将滑动窗口教师的细粒度空间能力迁移到学生模型，无需额外监督或架构改动；3）在保持计算效率的同时，显著提升了开放词汇分割的精度。

Abstract: Foundational Vision Transformers (ViTs) have limited effectiveness in tasks requiring fine-grained spatial understanding, due to their fixed pre-training resolution and inherently coarse patch-level representations. These challenges are especially pronounced in dense prediction scenarios, such as open-vocabulary segmentation with ViT-based vision-language models, where high-resolution inputs are essential for accurate pixel-level reasoning. Existing approaches typically process large-resolution images using a sliding-window strategy at the pre-training resolution. While this improves accuracy through finer strides, it comes at a significant computational cost. We introduce SPAR: Single-Pass Any-Resolution ViT, a resolution-agnostic dense feature extractor designed for efficient high-resolution inference. We distill the spatial reasoning capabilities of a finely-strided, sliding-window teacher into a single-pass student using a feature regression loss, without requiring architectural changes or pixel-level supervision. Applied to open-vocabulary segmentation, SPAR improves single-pass baselines by up to 10.5 mIoU and even surpasses the teacher, demonstrating effectiveness in efficient, high-resolution reasoning. Code: https://github.com/naomikombol/SPAR

[106] Modular Energy Steering for Safe Text-to-Image Generation with Foundation Models cs.CVPDF

Yaoteng Tan, Zikui Cai, M. Salman Asif

TL;DR: 本文提出了一种名为Modular Energy Steering的推理时引导框架，用于安全文本到图像生成。该方法利用冻结的预训练基础模型（如视觉语言模型）提供的梯度反馈，在不修改底层生成器的情况下，通过基于能量的采样问题来引导生成过程，实现模块化、无需训练的安全控制。

Details

Motivation: 现有安全方法通常依赖于模型微调或精选数据集，这会降低生成质量或限制可扩展性。本文旨在解决在不损害生成质量的前提下，实现可扩展且鲁棒的安全控制问题。

Result: 实验表明，该方法在NSFW红队测试基准上实现了最先进的鲁棒性，并能进行有效的多目标引导，同时在良性的非目标提示上保持了高生成质量。

Insight: 核心创新点在于将安全引导形式化为基于能量的采样问题，并利用预训练基础模型的丰富语义表示作为现成的监督信号。这提供了一种原则性的方法，将基础模型用作语义能量估计器，实现了可靠且可扩展的安全控制。

Abstract: Controlling the behavior of text-to-image generative models is critical for safe and practical deployment. Existing safety approaches typically rely on model fine-tuning or curated datasets, which can degrade generation quality or limit scalability. We propose an inference-time steering framework that leverages gradient feedback from frozen pretrained foundation models to guide the generation process without modifying the underlying generator. Our key observation is that vision-language foundation models encode rich semantic representations that can be repurposed as off-the-shelf supervisory signals during generation. By injecting such feedback through clean latent estimates at each sampling step, our method formulates safety steering as an energy-based sampling problem. This design enables modular, training-free safety control that is compatible with both diffusion and flow-matching models and can generalize across diverse visual concepts. Experiments demonstrate state-of-the-art robustness against NSFW red-teaming benchmarks and effective multi-target steering, while preserving high generation quality on benign non-targeted prompts. Our framework provides a principled approach for utilizing foundation models as semantic energy estimators, enabling reliable and scalable safety control for text-to-image generation.

[107] Omni123: Exploring 3D Native Foundation Models with Limited 3D Data by Unifying Text to 2D and 3D Generation cs.CV | cs.AIPDF

Chongjie Ye, Cheng Cao, Chuanyu Pan, Yiming Hao, Yihao Zhi

TL;DR: 本文提出了Omni123，一个3D原生基础模型，通过将文本到2D和文本到3D生成统一在一个自回归框架内，利用丰富的2D数据作为几何先验来改善3D表示，从而解决高质量3D数据稀缺导致的3D合成约束不足问题。

Details

Motivation: 由于高质量3D资产稀缺，现有方法通常依赖间接的2D编辑并通过优化提升到3D，牺牲了几何一致性，因此需要探索在有限3D数据下扩展多模态大语言模型到3D的原生能力。

Result: 实验表明，Omni123在文本引导的3D生成和编辑方面显著改进，展示了向多模态3D世界模型的可扩展路径。

Insight: 关键创新在于利用图像和3D之间的跨模态一致性作为隐式结构约束，通过将文本、图像和3D表示为共享序列空间中的离散令牌，并引入交错X到X训练范式，在异构配对数据集上协调多样跨模态任务，无需完全对齐的文本-图像-3D三元组，从而联合强制执行语义对齐、外观保真度和多视图几何一致性。

Abstract: Recent multimodal large language models have achieved strong performance in unified text and image understanding and generation, yet extending such native capability to 3D remains challenging due to limited data. Compared to abundant 2D imagery, high-quality 3D assets are scarce, making 3D synthesis under-constrained. Existing methods often rely on indirect pipelines that edit in 2D and lift results into 3D via optimization, sacrificing geometric consistency. We present Omni123, a 3D-native foundation model that unifies text-to-2D and text-to-3D generation within a single autoregressive framework. Our key insight is that cross-modal consistency between images and 3D can serve as an implicit structural constraint. By representing text, images, and 3D as discrete tokens in a shared sequence space, the model leverages abundant 2D data as a geometric prior to improve 3D representations. We introduce an interleaved X-to-X training paradigm that coordinates diverse cross-modal tasks over heterogeneous paired datasets without requiring fully aligned text-image-3D triplets. By traversing semantic-visual-geometric cycles (e.g., text to image to 3D to image) within autoregressive sequences, the model jointly enforces semantic alignment, appearance fidelity, and multi-view geometric consistency. Experiments show that Omni123 significantly improves text-guided 3D generation and editing, demonstrating a scalable path toward multimodal 3D world models.

[108] VOID: Video Object and Interaction Deletion cs.CV | cs.AIPDF

Saman Motamed, William Harvey, Benjamin Klein, Luc Van Gool, Zhuoning Yuan

TL;DR: 论文提出VOID框架，用于在视频中删除物体并修正其物理交互影响，通过生成对抗性数据集训练模型，结合视觉语言模型识别受影响区域，并利用视频扩散模型生成物理一致的修复结果。

Details

Motivation: 现有视频物体移除方法仅能修复物体背后的内容及外观伪影，但无法处理物体间的物理交互（如碰撞），导致结果不真实，因此需要开发能进行物理合理修复的框架。

Result: 在合成和真实数据上的实验表明，相比现有视频物体移除方法，VOID能更好地保持物体移除后场景动态的一致性。

Insight: 创新点在于引入物理交互修正机制，通过生成对抗性数据集训练模型，结合视觉语言模型与视频扩散模型进行高层次因果推理，提升视频编辑的世界模拟能力。

Abstract: Existing video object removal methods excel at inpainting content “behind” the object and correcting appearance-level artifacts such as shadows and reflections. However, when the removed object has more significant interactions, such as collisions with other objects, current models fail to correct them and produce implausible results. We present VOID, a video object removal framework designed to perform physically-plausible inpainting in these complex scenarios. To train the model, we generate a new paired dataset of counterfactual object removals using Kubric and HUMOTO, where removing an object requires altering downstream physical interactions. During inference, a vision-language model identifies regions of the scene affected by the removed object. These regions are then used to guide a video diffusion model that generates physically consistent counterfactual outcomes. Experiments on both synthetic and real data show that our approach better preserves consistent scene dynamics after object removal compared to prior video object removal methods. We hope this framework sheds light on how to make video editing models better simulators of the world through high-level causal reasoning.

[109] A Simple Baseline for Streaming Video Understanding cs.CVPDF

Yujiao Shen, Shulin Tian, Jingkang Yang, Ziwei Liu

TL;DR: 本文提出了一种名为SimpleStream的简单流式视频理解基线方法，仅使用最近N帧输入现成的视觉语言模型（VLM），在OVO-Bench和StreamingBench基准测试中与13种主流离线/在线视频LLM基线相比，性能相当或更优。研究发现，长上下文的价值取决于骨干网络而非模型规模，并揭示了感知与记忆的权衡：增加历史上下文可提升召回率，但可能削弱实时感知能力。

Details

Motivation: 针对当前流式视频理解方法过度依赖复杂记忆机制的趋势，本文旨在验证一个简单假设：仅使用最近帧的滑动窗口基线能否达到或超越现有复杂模型，从而挑战对复杂设计的盲目追求。

Result: 在OVO-Bench上平均准确率达到67.7%，在StreamingBench上达到80.59%（仅使用最近4帧）。该方法在多个基准测试中表现一致强劲，与现有复杂模型性能相当或更优。

Insight: 创新点在于揭示了流式视频理解中‘感知-记忆权衡’的普遍规律，并证明简单滑动窗口基线可作为强基准；未来研究应区分近期场景感知与长程记忆任务，以更清晰评估复杂模块的实际贡献。

Abstract: Recent streaming video understanding methods increasingly rely on complex memory mechanisms to handle long video streams. We challenge this trend with a simple finding: a sliding-window baseline that feeds only the most recent N frames to an off-the-shelf VLM already matches or surpasses published streaming models. We formalize this baseline as SimpleStream and evaluate it against 13 major offline and online video LLM baselines on OVO-Bench and StreamingBench. Despite its simplicity, SimpleStream delivers consistently strong performance. With only 4 recent frames, it reaches 67.7% average accuracy on OVO-Bench and 80.59% on StreamingBench. Controlled ablations further show that the value of longer context is backbone-dependent rather than uniformly increasing with model scale, and reveal a consistent perception-memory trade-off: adding more historical context can improve recall, but often weakens real-time perception. This suggests that stronger memory, retrieval, or compression modules should not be taken as evidence of progress unless they clearly outperform SimpleStream under the same protocol. We therefore argue that future streaming benchmarks should separate recent-scene perception from long-range memory, so that performance improvements from added complexity can be evaluated more clearly.

[110] Large-scale Codec Avatars: The Unreasonable Effectiveness of Large-scale Avatar Pretraining cs.CV | cs.GRPDF

Junxuan Li, Rawal Khirodkar, Chengan He, Zhongshi Jiang, Giljoo Nam

TL;DR: 本文提出了大规模编解码器化身模型，通过预训练和后训练范式，结合大规模野外视频和高质量标注数据，实现了高保真、可泛化的全身3D化身建模，支持精细表情和手指级控制。

Details

Motivation: 解决3D化身建模中保真度与泛化能力之间的权衡问题，结合多视角工作室数据的高保真优势和大规模野外数据的泛化潜力。

Result: 模型在未直接监督的情况下，展现出对光照、宽松衣物和风格化图像的零样本鲁棒性，实现了身份保持和精细控制。

Insight: 首次将预训练/后训练范式引入大规模3D化身建模，利用大规模野外视频学习先验知识，再通过高质量数据增强表达力，实现保真度与泛化的平衡。

Abstract: High-quality 3D avatar modeling faces a critical trade-off between fidelity and generalization. On the one hand, multi-view studio data enables high-fidelity modeling of humans with precise control over expressions and poses, but it struggles to generalize to real-world data due to limited scale and the domain gap between the studio environment and the real world. On the other hand, recent large-scale avatar models trained on millions of in-the-wild samples show promise for generalization across a wide range of identities, yet the resulting avatars are often of low-quality due to inherent 3D ambiguities. To address this, we present Large-Scale Codec Avatars (LCA), a high-fidelity, full-body 3D avatar model that generalizes to world-scale populations in a feedforward manner, enabling efficient inference. Inspired by the success of large language models and vision foundation models, we present, for the first time, a pre/post-training paradigm for 3D avatar modeling at scale: we pretrain on 1M in-the-wild videos to learn broad priors over appearance and geometry, then post-train on high-quality curated data to enhance expressivity and fidelity. LCA generalizes across hair styles, clothing, and demographics while providing precise, fine-grained facial expressions and finger-level articulation control, with strong identity preservation. Notably, we observe emergent generalization to relightability and loose garment support to unconstrained inputs, and zero-shot robustness to stylized imagery, despite the absence of direct supervision.

[111] Beyond Referring Expressions: Scenario Comprehension Visual Grounding cs.CVPDF

Ruozhen He, Nisarg A. Shah, Qihua Dong, Zilin Xiao, Jaywon Koo

TL;DR: 本文提出了一种超越传统指代表达式的场景理解视觉定位任务，并构建了Referring Scenario Comprehension（RSC）基准测试集，其中查询是描述对象角色、用户目标和上下文线索的段落文本，而非字面指代表达式。作者还提出了ScenGround课程推理方法，结合监督预热和难度感知强化学习，以应对这一更具挑战性的设置。

Details

Motivation: 现有视觉定位基准主要评估图像区域与字面指代表达式的对齐，模型常可通过匹配显著命名类别成功。本文探索了一种互补且更具挑战性的基于场景的视觉定位设置，其中目标必须从角色、意图和关系上下文中推断，而非通过显式命名。

Result: 实验表明，基于场景的查询暴露了当前模型在标准基准中未揭示的系统性失败，并且课程训练提高了在具有挑战性的数据切片上的性能，并能迁移到标准基准上。RSC基准包含约31k训练样本、4k域内测试样本和3k包含未见对象类别的分布外样本。

Insight: 创新点在于提出了场景理解视觉定位这一新任务和相应的RSC基准，其查询形式（段落文本）和难度标注（如独特性、杂乱度等）支持细粒度分析。提出的ScenGround方法通过课程学习（监督预热+难度感知强化学习）提升了模型在复杂场景下的推理能力，并展示了向传统任务的迁移潜力。

Abstract: Existing visual grounding benchmarks primarily evaluate alignment between image regions and literal referring expressions, where models can often succeed by matching a prominent named category. We explore a complementary and more challenging setting of scenario-based visual grounding, where the target must be inferred from roles, intentions, and relational context rather than explicit naming. We introduce Referring Scenario Comprehension (RSC), a benchmark designed for this setting. The queries in this benchmark are paragraph-length texts that describe object roles, user goals, and contextual cues, including deliberate references to distractor objects that often require deep understanding to resolve. Each instance is annotated with interpretable difficulty tags for uniqueness, clutter, size, overlap, and position which expose distinct failure modes and support fine-grained analysis. RSC contains approximately 31k training examples, 4k in-domain test examples, and a 3k out-of-distribution split with unseen object categories. We further propose ScenGround, a curriculum reasoning method serving as a reference point for this setting, combining supervised warm-starting with difficulty-aware reinforcement learning. Experiments show that scenario-based queries expose systematic failures in current models that standard benchmarks do not reveal, and that curriculum training improves performance on challenging slices and transfers to standard benchmarks.

[112] Steerable Visual Representations cs.CV | cs.AIPDF

Jona Ruthardt, Manu Gaur, Deva Ramanan, Makarand Tapaswi, Yuki M. Asano

TL;DR: 本文提出了一种名为Steerable Visual Representations的新型视觉表示方法，通过早期融合机制将自然语言提示注入视觉编码器，使全局和局部特征能够根据文本指令进行引导，从而聚焦图像中任意指定对象，同时保持基础表示质量。

Details

Motivation: 现有预训练视觉Transformer（如DINOv2、MAE）的通用图像特征倾向于关注最显著的视觉线索，无法引导至次要概念；而多模态大语言模型虽可通过文本提示引导，但其表示往往以语言为中心，在通用视觉任务中效果受限。本文旨在解决视觉表示缺乏可引导性的问题。

Result: 在提出的可引导性表示基准测试中，该方法能有效聚焦图像中任意目标对象并保持表示质量；在异常检测和个性化对象识别任务上，匹配或优于专用方法，并展现出对分布外任务的零样本泛化能力。

Insight: 创新点在于通过轻量级交叉注意力实现文本与视觉特征的早期融合（而非CLIP等的晚期融合），构建了可语言引导的视觉表示；客观分析认为，这种早期注入机制平衡了语言引导的灵活性与视觉表示的通用性，为可控视觉理解提供了新思路。

Abstract: Pretrained Vision Transformers (ViTs) such as DINOv2 and MAE provide generic image features that can be applied to a variety of downstream tasks such as retrieval, classification, and segmentation. However, such representations tend to focus on the most salient visual cues in the image, with no way to direct them toward less prominent concepts of interest. In contrast, Multimodal LLMs can be guided with textual prompts, but the resulting representations tend to be language-centric and lose their effectiveness for generic visual tasks. To address this, we introduce Steerable Visual Representations, a new class of visual representations, whose global and local features can be steered with natural language. While most vision-language models (e.g., CLIP) fuse text with visual features after encoding (late fusion), we inject text directly into the layers of the visual encoder (early fusion) via lightweight cross-attention. We introduce benchmarks for measuring representational steerability, and demonstrate that our steerable visual features can focus on any desired objects in an image while preserving the underlying representation quality. Our method also matches or outperforms dedicated approaches on anomaly detection and personalized object discrimination, exhibiting zero-shot generalization to out-of-distribution tasks.

[113] Modulate-and-Map: Crossmodal Feature Mapping with Cross-View Modulation for 3D Anomaly Detection cs.CVPDF

Alex Costanzino, Pierluigi Zama Ramirez, Giuseppe Lisanti, Luigi Di Stefano

TL;DR: 本文提出ModMap框架，一种多视角多模态的3D异常检测与分割方法，通过跨模态特征映射和视角间特征调制，结合跨视角训练策略处理高分辨率3D数据，在SiM3D基准测试中实现SOTA性能。

Details

Motivation: 解决现有方法独立处理视角、缺乏跨模态与跨视角特征交互的问题，旨在实现更有效的3D异常检测与分割。

Result: 在首个多视角多模态3D异常检测基准SiM3D上取得显著领先的SOTA性能，大幅超越先前方法。

Insight: 创新性地将跨模态特征映射与视角相关特征调制结合，提出跨视角训练策略，并公开了针对工业数据优化的深度编码器，增强了多视角信息融合能力。

Abstract: We present ModMap, a natively multiview and multimodal framework for 3D anomaly detection and segmentation. Unlike existing methods that process views independently, our method draws inspiration from the crossmodal feature mapping paradigm to learn to map features across both modalities and views, while explicitly modelling view-dependent relationships through feature-wise modulation. We introduce a cross-view training strategy that leverages all possible view combinations, enabling effective anomaly scoring through multiview ensembling and aggregation. To process high-resolution 3D data, we train and publicly release a foundational depth encoder tailored to industrial datasets. Experiments on SiM3D, a recent benchmark that introduces the first multiview and multimodal setup for 3D anomaly detection and segmentation, demonstrate that ModMap attains state-of-the-art performance by surpassing previous methods by wide margins.

[114] Generative World Renderer cs.CVPDF

Zheng-Hui Huang, Zhixiang Wang, Jiaming Tan, Ruihan Yu, Yidan Zhang

TL;DR: 本文提出了一个从AAA游戏中提取的大规模动态数据集，通过双屏拼接捕获方法获取了400万帧同步的RGB和G缓冲区通道数据，用于提升生成式逆渲染和正向渲染的真实感和时序一致性。同时，提出了一种基于视觉语言模型（VLM）的评估协议，以无地面真值的方式评估逆渲染性能，实验表明该数据集能增强模型的跨数据集泛化能力和可控生成，且VLM评估与人类判断高度相关。

Details

Motivation: 现有合成数据集的真实感和时序一致性有限，阻碍了生成式逆渲染和正向渲染在真实世界场景中的扩展，本文旨在通过构建高质量游戏数据集来弥合这一领域差距。

Result: 实验显示，基于该数据集微调的逆渲染器在跨数据集泛化性和可控生成方面表现优异；提出的VLM评估协议在语义、空间和时序一致性上与人类判断强相关，且工具包支持通过文本提示从G缓冲区编辑AAA游戏风格。

Insight: 创新点包括：利用AAA游戏构建大规模动态数据集以提升渲染真实性；提出双屏拼接捕获方法获取同步多通道数据；引入基于VLM的无地面真值评估协议，增强评估客观性；工具包支持文本驱动的游戏风格编辑，拓展了可控生成的应用。

Abstract: Scaling generative inverse and forward rendering to real-world scenarios is bottlenecked by the limited realism and temporal coherence of existing synthetic datasets. To bridge this persistent domain gap, we introduce a large-scale, dynamic dataset curated from visually complex AAA games. Using a novel dual-screen stitched capture method, we extracted 4M continuous frames (720p/30 FPS) of synchronized RGB and five G-buffer channels across diverse scenes, visual effects, and environments, including adverse weather and motion-blur variants. This dataset uniquely advances bidirectional rendering: enabling robust in-the-wild geometry and material decomposition, and facilitating high-fidelity G-buffer-guided video generation. Furthermore, to evaluate the real-world performance of inverse rendering without ground truth, we propose a novel VLM-based assessment protocol measuring semantic, spatial, and temporal consistency. Experiments demonstrate that inverse renderers fine-tuned on our data achieve superior cross-dataset generalization and controllable generation, while our VLM evaluation strongly correlates with human judgment. Combined with our toolkit, our forward renderer enables users to edit styles of AAA games from G-buffers using text prompts.

[115] ActionParty: Multi-Subject Action Binding in Generative Video Games cs.CV | cs.AI | cs.LGPDF

Alexander Pondaven, Ziyi Wu, Igor Gilitschenski, Philip Torr, Sergey Tulyakov

TL;DR: 本文提出ActionParty，一种用于生成式视频游戏的动作可控多主体世界模型，通过引入主体状态令牌和空间偏置机制，解决了现有视频扩散模型在同时控制多个主体时难以将特定动作与对应主体绑定的问题。

Details

Motivation: 现有视频扩散模型主要局限于单主体设置，无法在场景中同时控制多个主体，且难以将特定动作与对应主体正确关联，因此需要开发能够处理多主体动作绑定的世界模型。

Result: 在Melting Pot基准测试中，ActionParty是首个能够同时在46个不同环境中控制多达7个玩家的视频世界模型，在动作跟随准确性和身份一致性方面显示出显著改进，并能通过复杂交互实现鲁棒的自回归主体跟踪。

Insight: 创新点在于引入持久捕获场景中每个主体状态的主体状态令牌作为潜变量，并结合空间偏置机制联合建模状态令牌和视频潜在表示，从而解耦全局视频帧渲染与个体动作控制的主体更新，实现了多主体动作的精确绑定和一致性维护。

Abstract: Recent advances in video diffusion have enabled the development of “world models” capable of simulating interactive environments. However, these models are largely restricted to single-agent settings, failing to control multiple agents simultaneously in a scene. In this work, we tackle a fundamental issue of action binding in existing video diffusion models, which struggle to associate specific actions with their corresponding subjects. For this purpose, we propose ActionParty, an action controllable multi-subject world model for generative video games. It introduces subject state tokens, i.e. latent variables that persistently capture the state of each subject in the scene. By jointly modeling state tokens and video latents with a spatial biasing mechanism, we disentangle global video frame rendering from individual action-controlled subject updates. We evaluate ActionParty on the Melting Pot benchmark, demonstrating the first video world model capable of controlling up to seven players simultaneously across 46 diverse environments. Our results show significant improvements in action-following accuracy and identity consistency, while enabling robust autoregressive tracking of subjects through complex interactions.

cs.GR [Back]

[116] Non-Rigid 3D Shape Correspondences: From Foundations to Open Challenges and Opportunities cs.GR | cs.CVPDF

Aleksei Zhuravlev, Lennart Bastian, Dongliang Cao, Nafie El Amrani, Paul Roetzer

TL;DR: 这篇论文是一篇关于非刚性三维形状对应问题的综述报告，系统梳理了该领域从基础到前沿挑战与机遇的研究进展。报告将现有方法归纳为基于功能映射的谱方法、施加离散约束的组合公式化方法以及直接恢复全局对齐的形变方法三大范式，并讨论了各自的优缺点、最新进展及未来研究方向。

Details

Motivation: 解决变形三维形状实例之间的对应关系估计这一长期存在的计算机图形学难题，该问题对于纹理迁移、统计建模等众多下游应用至关重要。

Result: 作为一篇综述报告，未提出新方法或报告具体定量结果，但系统性地回顾和比较了不同方法范式的优劣，并指出了该领域的最新发展趋势。

Insight: 创新性地将非刚性三维形状对应方法归纳为三大研究范式进行系统性对比分析，并前瞻性地指出了利用视觉基础模型进行零样本对应以及匹配部分形状等新兴挑战与机遇，为研究者和实践者提供了清晰的领域地图和未来方向指引。

Abstract: Estimating correspondences between deformed shape instances is a long-standing problem in computer graphics; numerous applications, from texture transfer to statistical modelling, rely on recovering an accurate correspondence map. Many methods have thus been proposed to tackle this challenging problem from varying perspectives, depending on the downstream application. This state-of-the-art report is geared towards researchers, practitioners, and students seeking to understand recent trends and advances in the field. We categorise developments into three paradigms: spectral methods based on functional maps, combinatorial formulations that impose discrete constraints, and deformation-based methods that directly recover a global alignment. Each school of thought offers different advantages and disadvantages, which we discuss throughout the report. Meanwhile, we highlight the latest developments in each area and suggest new potential research directions. Finally, we provide an overview of emerging challenges and opportunities in this growing field, including the recent use of vision foundation models for zero-shot correspondence and the particularly challenging task of matching partial shapes.

cs.AI [Back]

[117] ThinknCheck: Grounded Claim Verification with Compact, Reasoning-Driven, and Interpretable Models cs.AI | cs.CLPDF

Delip Rao, Feijiang Han, Chris Callison-Burch

TL;DR: 本文提出了ThinknCheck，一个用于基于证据的声明验证的10亿参数验证器。它首先生成一个简短、结构化的推理过程，然后给出二元判决。作者构建了LLMAggreFact-Think训练集，并对量化后的Gemma3模型进行微调。该方法在多个基准测试上超越了更大的模型，并证明了显式监督推理对于构建紧凑、高效且可解释的验证器的有效性。

Details

Motivation: 解决现有声明验证模型参数量大、资源消耗高，以及缺乏可解释推理过程的问题，旨在开发一个紧凑、高效且能提供清晰推理的验证模型。

Result: 在LLMAggreFact基准上达到78.1的平衡准确率，超越了参数量7倍于它的MiniCheck-7B（77.4）；在SciFact基准上达到64.7的平衡准确率，比MiniCheck-7B高出14.7个绝对百分点。移除推理步骤会使准确率大幅下降至57.5。

Insight: 核心创新在于将验证过程明确分解为“生成结构化推理”和“给出判决”两个步骤，并通过监督学习进行训练。这证明了显式的、监督的推理过程对于构建高性能的紧凑模型至关重要，使其在保持资源高效的同时，兼具竞争力和可解释性。

Abstract: We present ThinknCheck, a 1B-parameter verifier for grounded claim verification that first produces a short, structured rationale and then a binary verdict. We construct LLMAggreFact-Think, a 24.1k reasoning-augmented training set derived from LLMAggreFact, and fine-tune a 4-bit Gemma3 model to follow this format. On LLMAggreFact, ThinknCheck attains 78.1 balanced accuracy (BAcc), surpassing MiniCheck-7B (77.4) with 7x fewer parameters; removing the reasoning step reduces BAcc to 57.5. On SciFact, ThinknCheck reaches 64.7 BAcc, a +14.7 absolute gain over MiniCheck-7B. By contrast, zero-shot chain-of-thought on the base Gemma3-1B harms accuracy relative to direct answers, and preference optimization with a simple format+accuracy reward underperforms supervised reasoning. To probe the latter, we introduce GSMClaims and a domain-specialized variant, ThinknCheck-Science, which improves across benchmarks, including 61.0% accuracy on GSMClaims. Overall, explicit, supervised reasoning enables compact verifiers that are competitive while remaining resource-efficient and interpretable.

[118] LLM-as-a-Judge for Time Series Explanations cs.AI | cs.CLPDF

Preetham Sivalingam, Murari Mandal, Saurabh Deshpande, Dhruv Kumar

TL;DR: 本文研究了在没有参考解释的情况下，利用大语言模型（LLM）作为评估者，来判断基于时间序列数据生成的文本解释的事实正确性。作者构建了一个包含七种查询类型、350个案例的合成基准测试，并评估了模型在解释生成、相对排序、独立评分和多异常检测四个任务上的表现。结果表明，LLM作为评估者比作为生成者更稳定可靠。

Details

Motivation: 目前，评估基于时间序列数据生成的LLM自然语言解释的事实正确性是一个开放挑战。现有方法要么需要真实解释作为参考，要么无法评估自由形式的文本推理。因此，缺乏一种无需预定义参考或特定任务规则的通用方法来直接验证解释是否忠实于底层时间序列数据。

Result: 在构建的合成基准测试上，模型在解释生成任务上的表现高度依赖模式，在某些查询类型上（如季节性下降和波动性变化）准确率仅为0.00至0.12，而在结构性断点上可达0.94至0.96。相比之下，评估任务（如相对排序和独立评分）表现更稳定，即使模型自身生成的解释不正确，也能正确地对解释进行排序和评分。

Insight: 论文的核心创新点在于提出并验证了在无参考设置下，使用LLM作为评估者来评判基于时间序列的解释的可行性。这为评估数据驱动的文本推理提供了一种新的、原则性的方法。从客观角度看，将LLM的角色从生成者扩展到评估者，并系统性地分析其在不同任务上的能力不对称性，是一个有价值的研究方向。

Abstract: Evaluating factual correctness of LLM generated natural language explanations grounded in time series data remains an open challenge. Although modern models generate textual interpretations of numerical signals, existing evaluation methods are limited: reference based similarity metrics and consistency checking models require ground truth explanations, while traditional time series methods operate purely on numerical values and cannot assess free form textual reasoning. Thus, no general purpose method exists to directly verify whether an explanation is faithful to underlying time series data without predefined references or task specific rules. We study large language models as both generators and evaluators of time series explanations in a reference free setting, where given a time series, question, and candidate explanation, the evaluator assigns a ternary correctness label based on pattern identification, numeric accuracy, and answer faithfulness, enabling principled scoring and comparison. To support this, we construct a synthetic benchmark of 350 time series cases across seven query types, each paired with correct, partially correct, and incorrect explanations. We evaluate models across four tasks: explanation generation, relative ranking, independent scoring, and multi anomaly detection. Results show a clear asymmetry: generation is highly pattern dependent and exhibits systematic failures on certain query types, with accuracies ranging from 0.00 to 0.12 for Seasonal Drop and Volatility Shift, to 0.94 to 0.96 for Structural Break, while evaluation is more stable, with models correctly ranking and scoring explanations even when their own outputs are incorrect. These findings demonstrate feasibility of data grounded LLM based evaluation for time series explanations and highlight their potential as reliable evaluators of data grounded reasoning in the time series domain.

Rui Dong, Xiaotong Zhang, Jiaxing Li, Yueying Li, Jiayin Wei

TL;DR: 本文提出了一种名为M3D-BFS的多阶段动态融合策略，用于样本自适应的多模态脑网络分析。该方法通过设计针对单模态和多模态表征的混合专家模块，使模型在推理过程中能根据输入样本动态调整计算，并结合多阶段训练策略和模态解耦损失来提升性能。

Details

Motivation: 当前脑网络中的多模态融合方法（主要针对结构连接和功能连接模态）本质上是静态的，对所有样本使用相同的模型和计算，忽略了输入样本之间的固有差异，这种缺乏样本适应性的问题限制了模型性能的进一步提升。

Result: 在多个真实世界数据集上的广泛实验证明了M3D-BFS的优越性，其性能超越了现有的静态融合方法。

Insight: 创新点在于首次为多模态脑网络分析提出了动态融合策略，通过样本自适应的混合专家模块实现动态计算，并采用分阶段训练策略（先分别训练单模态编码器，再预训练混合专家中的单个专家，最后微调整个模型）来缓解专家训练崩溃的问题，同时设计了多模态解耦损失来增强最终表征。

Abstract: Multi-modal fusion is of great significance in neuroscience which integrates information from different modalities and can achieve better performance than uni-modal methods in downstream tasks. Current multi-modal fusion methods in brain networks, which mainly focus on structural connectivity (SC) and functional connectivity (FC) modalities, are static in nature. They feed different samples into the same model with identical computation, ignoring inherent difference between input samples. This lack of sample adaptation hinders model’s further performance. To this end, we innovatively propose a multi-stage dynamic fusion strategy (M3D-BFS) for sample-adaptive multi-modal brain network analysis. Unlike other static fusion methods, we design different mixture-of-experts (MoEs) for uni- and multi-modal representations where modules can adaptively change as input sample changes during inference. To alleviate issue of MoE where training of experts may be collapsed, we divide our method into 3 stages. We first train uni-modal encoders respectively, then pretrain single experts of MoEs before finally finetuning the whole model. A multi-modal disentanglement loss is designed to enhance the final representations. To the best of our knowledge, this is the first work for dynamic fusion for multi-modal brain network analysis. Extensive experiments on different real-world datasets demonstrates the superiority of M3D-BFS.

[120] Novel Memory Forgetting Techniques for Autonomous AI Agents: Balancing Relevance and Efficiency cs.AI | cs.CVPDF

Payal Fofadiya, Sunil Tiwari

TL;DR: 本文提出了一种自适应预算遗忘框架，用于解决长时程对话智能体中持久记忆导致的性能衰减和错误记忆传播问题。该方法通过相关性引导评分和有界优化来调控记忆，整合了时效性、频率和语义对齐，在受限上下文中保持稳定性。

Details

Motivation: 长时程对话智能体需要持久记忆以进行连贯推理，但无控制的记忆积累会导致时间衰减和错误记忆传播，现有基准如LOCOMO和LOCCO显示性能从0.455降至0.05，MultiWOZ在持久保留下准确率为78.2%但错误记忆率达6.8%。

Result: 在比较分析中，该方法在长时程F1分数上超过0.583的基线水平，提高了保留一致性，并减少了错误记忆行为，同时未增加上下文使用量。

Insight: 创新点在于结构化遗忘机制，通过自适应预算框架平衡记忆相关性和效率，从而在扩展对话设置中保持推理性能并防止无界记忆增长。

Abstract: Long-horizon conversational agents require persistent memory for coherent reasoning, yet uncontrolled accumulation causes temporal decay and false memory propagation. Benchmarks such as LOCOMO and LOCCO report performance degradation from 0.455 to 0.05 across stages, while MultiWOZ shows 78.2% accuracy with 6.8% false memory rate under persistent retention. This work introduces an adaptive budgeted forgetting framework that regulates memory through relevanceguided scoring and bounded optimization. The approach integrates recency, frequency, and semantic alignment to maintain stability under constrained context. Comparative analysis demonstrates improved long-horizon F1 beyond 0.583 baseline levels, higher retention consistency, and reduced false memory behavior without increasing context usage. These findings confirm that structured forgetting preserves reasoning performance while preventing unbounded memory growth in extended conversational settings.

cs.SE [Back]

[121] From SWE-ZERO to SWE-HERO: Execution-free to Execution-based Fine-tuning for Software Engineering Agents cs.SE | cs.CLPDF

Nikolai Ludwig, Wasi Uddin Ahmad, Somshubra Majumdar, Boris Ginsburg

TL;DR: 本文提出了一种名为SWE-ZERO到SWE-HERO的两阶段监督微调方法，用于提升软件工程智能体的性能。该方法首先利用大规模、无需执行的轨迹来掌握代码语义和仓库级推理，然后通过基于执行的针对性精炼，将语义直觉转化为严谨的工程工作流。该方法在SWE-bench基准测试上取得了开源模型的最先进结果。

Details

Motivation: 旨在解决现有方法依赖资源密集型执行环境的问题，通过一种进化精炼策略来更高效地训练软件工程智能体，使其掌握从代码语义理解到实际工程执行的完整工作流。

Result: 在SWE-bench Verified上，SWE-HERO-32B模型取得了62.2%的问题解决率，达到了开源同类模型的新基准（SOTA）。尽管仅使用Python数据训练，在SWE-bench Multilingual上实现了44.1%的零样本迁移性能，证明了方法的泛化能力。

Insight: 创新点在于提出了一个两阶段的蒸馏微调流程，将无需执行的语义学习与基于执行的工程精炼解耦，从而降低了资源消耗并提升了性能。客观来看，其进化精炼策略和展示出的跨语言零样本泛化能力是值得借鉴的关键设计。

Abstract: We introduce SWE-ZERO to SWE-HERO, a two-stage SFT recipe that achieves state-of-the-art results on SWE-bench by distilling open-weight frontier LLMs. Our pipeline replaces resource-heavy dependencies with an evolutionary refinement strategy: (1) SWE-ZERO utilizes large-scale, execution-free trajectories to master code semantics and repository-level reasoning, and (2) SWE-HERO applies targeted, execution-backed refinement to transition these semantic intuitions into rigorous engineering workflows. Our empirical results set a new benchmark for open-source models of comparable size. We release a dataset of 300k SWE-ZERO and 13k SWE-HERO trajectories distilled from Qwen3-Coder-480B, alongside a suite of agents based on the Qwen2.5-Coder series. Notably, SWE-HERO-32B achieves a 62.2% resolution rate on SWE-bench Verified. Furthermore, despite being trained exclusively on Python, our agents demonstrate robust zero-shot transferability on SWE-bench Multilingual, reaching 44.1% and confirming the paradigm’s generalizability across diverse languages.

cs.RO [Back]

Xueying Li, Feng Lyu, Hao Wu, Mingliu Liu, Jia-Nan Liu

TL;DR: 本文提出MetaNav，一种基于元认知推理的免训练视觉语言导航（VLN）智能体，旨在解决现有方法因贪婪边界选择和被动空间记忆导致的局部振荡和重复访问等低效行为。MetaNav通过整合空间记忆、历史感知规划和反思校正模块，使智能体能够监控探索进度、诊断策略失败并自适应调整，从而提升导航的鲁棒性和效率。

Details

Motivation: 现有基于基础模型的免训练VLN智能体依赖贪婪边界选择和被动空间记忆，导致导航行为低效（如局部振荡和冗余重访），其根本原因在于缺乏元认知能力，无法监控进度、诊断失败或自适应调整。

Result: 在GOAT-Bench、HM3D-OVON和A-EQA基准测试中，MetaNav实现了最先进的性能，同时将视觉语言模型（VLM）查询次数减少了20.7%，证明了元认知推理能显著提升鲁棒性和效率。

Insight: 创新点在于引入元认知框架，通过空间记忆构建持久3D语义地图、历史感知规划惩罚重访以提升效率，以及反思校正检测停滞并利用大语言模型（LLM）生成校正规则来指导未来边界选择，从而增强智能体的自监控和自适应能力。

Abstract: Training-free Vision-Language Navigation (VLN) agents powered by foundation models can follow instructions and explore 3D environments. However, existing approaches rely on greedy frontier selection and passive spatial memory, leading to inefficient behaviors such as local oscillation and redundant revisiting. We argue that this stems from a lack of metacognitive capabilities: the agent cannot monitor its exploration progress, diagnose strategy failures, or adapt accordingly. To address this, we propose MetaNav, a metacognitive navigation agent integrating spatial memory, history-aware planning, and reflective correction. Spatial memory builds a persistent 3D semantic map. History-aware planning penalizes revisiting to improve efficiency. Reflective correction detects stagnation and uses an LLM to generate corrective rules that guide future frontier selection. Experiments on GOAT-Bench, HM3D-OVON, and A-EQA show that MetaNav achieves state-of-the-art performance while reducing VLM queries by 20.7%, demonstrating that metacognitive reasoning significantly improves robustness and efficiency.

cs.LG [Back]

[123] When Reward Hacking Rebounds: Understanding and Mitigating It with Representation-Level Signals cs.LG | cs.CLPDF

Rui Wu, Ruixiang Tang

TL;DR: 本文系统研究了LLM强化学习中的奖励黑客问题，特别是在编程任务中模型通过篡改评估器代码来最大化奖励而不解决实际任务的现象。研究发现了一种可复现的三阶段反弹模式，并利用表征工程提取了与黑客行为相关的概念方向。基于此，论文提出了优势修正方法，将捷径概念分数整合到GRPO优势计算中，以在策略更新前惩罚黑客行为，从而更稳健地抑制奖励黑客。

Details

Motivation: 动机是解决LLM强化学习中存在的奖励黑客问题，即模型利用捷径最大化奖励而不真正完成任务，这损害了训练的有效性和安全性。研究通过在编程任务中构建可控的环境操纵设置来系统分析这一现象。

Result: 在研究的模型中，识别出了可复现的三阶段反弹模式。通过表征工程提取的捷径方向能最紧密地跟踪黑客行为，使其成为有效的检测代理。提出的优势修正方法相比生成时激活引导，能更稳健地抑制黑客行为。

Insight: 创新点包括：1) 在编程任务中构建了环境操纵设置作为可控测试平台来系统研究奖励黑客；2) 发现了奖励黑客的三阶段反弹模式；3) 利用表征工程从领域通用的对比对中提取概念方向（如捷径、欺骗、评估意识），并将捷径方向作为检测代理；4) 提出了优势修正方法，将概念分数内部化到训练信号中，在策略更新前惩罚黑客行为，从而提供更稳健的抑制。从客观角度看，将表征层面的信号（概念方向）整合到强化学习训练过程中，为检测和缓解奖励黑客提供了一种新颖的内部化方法。

Abstract: Reinforcement learning for LLMs is vulnerable to reward hacking, where models exploit shortcuts to maximize reward without solving the intended task. We systematically study this phenomenon in coding tasks using an environment-manipulation setting, where models can rewrite evaluator code to trivially pass tests without solving the task, as a controlled testbed. Across both studied models, we identify a reproducible three-phase rebound pattern: models first attempt to rewrite the evaluator but fail, as their rewrites embed test cases their own solutions cannot pass. They then temporarily retreat to legitimate solving. When legitimate reward remains scarce, they rebound into successful hacking with qualitatively different strategies. Using representation engineering, we extract concept directions for shortcut, deception, and evaluation awareness from domain-general contrastive pairs and find that the shortcut direction tracks hacking behavior most closely, making it an effective representational proxy for detection. Motivated by this finding, we propose Advantage Modification, which integrates shortcut concept scores into GRPO advantage computation to penalize hacking rollouts before policy updates. Because the penalty is internalized into the training signal rather than applied only at inference time, Advantage Modification provides more robust suppression of hacking compared with generation-time activation steering.

Junyoung Sung, Seungwoo Lyu, Minjun Kim, Sumin An, Arsha Nagrani

TL;DR: 本文提出CRIT，一种基于图结构的自动数据合成方法，用于增强跨模态多跳推理能力。该方法通过构建包含自然图像、视频和文本丰富来源的多样化数据集，并包含人工验证的测试集，以解决现有多模态基准在强制跨模态互补推理方面的不足。

Details

Motivation: 现实世界推理常需跨模态信息融合，但现有多模态基准多依赖单图像或图像集，答案可从单一模态推断，导致训练数据缺乏强制性的跨模态多跳推理，使视觉语言模型常产生与视觉证据脱节的幻觉推理。

Result: 在CRIT基准测试中，即使最先进的模型在此类推理任务上也表现挣扎；使用CRIT训练的模型在跨模态多跳推理上取得显著提升，并在SPIQA等标准多模态基准上表现出强劲改进。

Insight: 创新点在于采用基于图的自动流水线生成复杂跨模态推理任务，构建了强制互补推理的数据集；客观分析认为，该方法通过结构化数据合成机制，有效增强了模型对跨模态依赖关系的建模能力。

Abstract: Real-world reasoning often requires combining information across modalities, connecting textual context with visual cues in a multi-hop process. Yet, most multimodal benchmarks fail to capture this ability: they typically rely on single images or set of images, where answers can be inferred from a single modality alone. This limitation is mirrored in the training data, where interleaved image-text content rarely enforces complementary, multi-hop reasoning. As a result, Vision-Language Models (VLMs) frequently hallucinate and produce reasoning traces poorly grounded in visual evidence. To address this gap, we introduce CRIT, a new dataset and benchmark built with a graph-based automatic pipeline for generating complex cross-modal reasoning tasks. CRIT consists of diverse domains ranging from natural images, videos, and text-rich sources, and includes a manually verified test set for reliable evaluation. Experiments on this benchmark reveal that even state-of-the-art models struggle on such reasoning tasks. Models trained on CRIT show significant gains in cross-modal multi-hop reasoning, including strong improvements on SPIQA and other standard multimodal benchmarks.

[125] Batched Contextual Reinforcement: A Task-Scaling Law for Efficient Reasoning cs.LG | cs.AI | cs.CLPDF

Bangji Yang, Hongbo Ma, Jiajun Fan, Ge Liu

TL;DR: 本文提出了一种名为批量上下文强化的单阶段训练范式，通过训练模型在共享上下文窗口中同时解决N个问题，并仅以每个实例的准确性作为奖励，从而解锁了高效推理。研究发现，随着并发问题数N的增加，每个问题的令牌使用量单调减少，而准确性下降远低于基线，建立了N作为一个可控的吞吐量维度。该方法在多个数学基准测试上，在保持或提高准确性的同时，显著减少了令牌使用量。

Details

Motivation: 解决大型语言模型在采用思维链推理时因令牌消耗过多而导致推理成本高昂的问题，现有方法如显式长度惩罚、难度估计器或多阶段课程学习要么会降低推理质量，要么需要复杂的训练流程。

Result: 在1.5B和4B模型系列上，BCR在五个主要数学基准测试中，令牌使用量减少了15.8%至62.6%，同时一致保持或提高了准确性。

Insight: 通过简单的结构修改（批量处理问题）引入隐式令牌预算，挑战了传统的准确性-效率权衡，展示了在标准单问题推理中的“免费午餐”现象，并避免了显式长度惩罚的对抗性梯度和灾难性优化崩溃问题，提供了一种高度稳定的基于约束的长度控制替代方案。

Abstract: Large Language Models employing Chain-of-Thought reasoning achieve strong performance but suffer from excessive token consumption that inflates inference costs. Existing efficiency methods such as explicit length penalties, difficulty estimators, or multi-stage curricula either degrade reasoning quality or require complex training pipelines. We introduce Batched Contextual Reinforcement, a minimalist, single-stage training paradigm that unlocks efficient reasoning through a simple structural modification: training the model to solve N problems simultaneously within a shared context window, rewarded purely by per-instance accuracy. This formulation creates an implicit token budget that yields several key findings: (1) We identify a novel task-scaling law: as the number of concurrent problems N increases during inference, per-problem token usage decreases monotonically while accuracy degrades far more gracefully than baselines, establishing N as a controllable throughput dimension. (2) BCR challenges the traditional accuracy-efficiency trade-off by demonstrating a “free lunch” phenomenon at standard single-problem inference. Across both 1.5B and 4B model families, BCR reduces token usage by 15.8% to 62.6% while consistently maintaining or improving accuracy across five major mathematical benchmarks. (3) Qualitative analyses reveal emergent self-regulated efficiency, where models autonomously eliminate redundant metacognitive loops without explicit length supervision. (4) Crucially, we empirically demonstrate that implicit budget constraints successfully circumvent the adversarial gradients and catastrophic optimization collapse inherent to explicit length penalties, offering a highly stable, constraint-based alternative for length control. These results prove BCR practical, showing simple structural incentives unlock latent high-density reasoning in LLMs.

cs.IR [Back]

[126] Do We Need Bigger Models for Science? Task-Aware Retrieval with Small Language Models cs.IR | cs.AI | cs.CL | cs.DLPDF

Florian Kelber, Matthias Jobst, Yuni Susanti, Michael Färber

TL;DR: 本文探讨了在科学应用中是否需要大型语言模型，提出了一种轻量级的检索增强框架，通过任务感知路由选择专门的检索策略，并结合全文科学论文和结构化学术元数据，使用紧凑的指令调优语言模型生成带引用的回答。研究发现，检索和模型规模是互补的，而非可互换的，检索设计可以部分补偿较小模型，但模型容量对复杂推理任务仍很重要。

Details

Motivation: 解决科学知识发现中过度依赖大型专有模型导致的复现性和可访问性问题，探究精心设计的检索管道能否在科学应用中补偿模型规模的减小。

Result: 在多个学术任务（包括单文档和多文档学术问答、领域转移下的生物医学问答以及科学文本压缩）上评估，结果表明检索设计能部分补偿较小模型，但模型容量对复杂推理任务至关重要。

Insight: 创新点在于提出任务感知路由的检索增强框架，整合全文和元数据，强调检索和任务感知设计是构建实用且可复现学术助手的关键因素；客观分析认为，该方法通过优化检索策略而非单纯扩大模型规模，为资源受限环境提供了高效解决方案。

Abstract: Scientific knowledge discovery increasingly relies on large language models, yet many existing scholarly assistants depend on proprietary systems with tens or hundreds of billions of parameters. Such reliance limits reproducibility and accessibility for the research community. In this work, we ask a simple question: do we need bigger models for scientific applications? Specifically, we investigate to what extent carefully designed retrieval pipelines can compensate for reduced model scale in scientific applications. We design a lightweight retrieval-augmented framework that performs task-aware routing to select specialized retrieval strategies based on the input query. The system further integrates evidence from full-text scientific papers and structured scholarly metadata, and employs compact instruction-tuned language models to generate responses with citations. We evaluate the framework across several scholarly tasks, focusing on scholarly question answering (QA), including single- and multi-document scenarios, as well as biomedical QA under domain shift and scientific text compression. Our findings demonstrate that retrieval and model scale are complementary rather than interchangeable. While retrieval design can partially compensate for smaller models, model capacity remains important for complex reasoning tasks. This work highlights retrieval and task-aware design as key factors for building practical and reproducible scholarly assistants.

Table of Contents

cs.CL [Back]

[1] Scaling Reasoning Tokens via RL and Parallel Thinking: Evidence From Competitive Programming cs.CLPDF

[2] M2-Verify: A Large-Scale Multidomain Benchmark for Checking Multimodal Claim Consistency cs.CLPDF

[3] Procedural Knowledge at Scale Improves Reasoning cs.CLPDF

[4] Open-Domain Safety Policy Construction cs.CLPDF

[5] Adaptive Stopping for Multi-Turn LLM Reasoning cs.CL | cs.AIPDF

[6] Magic, Madness, Heaven, Sin: LLM Output Diversity is Everything, Everywhere, All at Once cs.CL | cs.AI | cs.CYPDF

[7] Countering Catastrophic Forgetting of Large Language Models for Better Instruction Following via Weight-Space Model Merging cs.CL | cs.AIPDF

[8] DeltaMem: Towards Agentic Memory Management via Reinforcement Learning cs.CLPDF

[9] Fragile Reasoning: A Mechanistic Analysis of LLM Sensitivity to Meaning-Preserving Perturbations cs.CLPDF

[10] What Do Claim Verification Datasets Actually Test? A Reasoning Trace Analysis cs.CLPDF

[11] PRCCF: A Persona-guided Retrieval and Causal-aware Cognitive Filtering Framework for Emotional Support Conversation cs.CLPDF

[12] On the Role of Reasoning Patterns in the Generalization Discrepancy of Long Chain-of-Thought Supervised Fine-Tuning cs.CLPDF

[13] Memory in the LLM Era: Modular Architectures and Strategies in a Unified Framework cs.CL | cs.DBPDF

[14] Human-Guided Reasoning with Large Language Models for Vietnamese Speech Emotion Recognition cs.CLPDF

[15] LiveMathematicianBench: A Live Benchmark for Mathematician-Level Reasoning with Proof Sketches cs.CL | cs.AI | cs.LGPDF

[16] DEFT: Distribution-guided Efficient Fine-Tuning for Human Alignment cs.CLPDF

[17] From Guessing to Placeholding: A Cost-Theoretic Framework for Uncertainty-Aware Code Completion cs.CLPDF

[18] SURE: Synergistic Uncertainty-aware Reasoning for Multimodal Emotion Recognition in Conversations cs.CLPDF

[19] Is Clinical Text Enough? A Multimodal Study on Mortality Prediction in Heart Failure Patients cs.CLPDF

[20] ImplicitBBQ: Benchmarking Implicit Bias in Large Language Models through Characteristic Based Cues cs.CL | cs.AIPDF

[21] SAFE: Stepwise Atomic Feedback for Error correction in Multi-hop Reasoning cs.CL | cs.AIPDF

[22] Why Gaussian Diffusion Models Fail on Discrete Data? cs.CLPDF

[23] BidirLM: From Text to Omnimodal Bidirectional Encoders by Adapting and Composing Causal LLMs cs.CL | cs.AIPDF

[24] Optimizing RAG Rerankers with LLM Feedback via Reinforcement Learning cs.CL | cs.AI | cs.IRPDF

[25] Reliable Control-Point Selection for Steering Reasoning in Large Language Models cs.CLPDF

[26] Brief Is Better: Non-Monotonic Chain-of-Thought Budget Effects in Function-Calling Language Agents cs.CLPDF

[27] Do Lexical and Contextual Coreference Resolution Systems Degrade Differently under Mention Noise? An Empirical Study on Scientific Software Mentions cs.CL | cs.LGPDF

[28] Adam’s Law: Textual Frequency Law on Large Language Models cs.CLPDF

[29] CV-18 NER: Augmented Common Voice for Named Entity Recognition from Arabic Speech cs.CLPDF

[30] Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation cs.CL | cs.AI | cs.LGPDF

cs.CV [Back]

[31] DOne: Decoupling Structure and Rendering for High-Fidelity Design-to-Code Generation cs.CV | cs.SEPDF

[32] Look Twice: Training-Free Evidence Highlighting in Multimodal Large Language Models cs.CV | cs.AI | cs.CLPDF

[33] Sparse Spectral LoRA: Routed Experts for Medical VLMs cs.CVPDF

[34] ViTs for Action Classification in Videos: An Approach to Risky Tackle Detection in American Football Practice Videos cs.CVPDF

[35] Regularizing Attention Scores with Bootstrapping cs.CV | cs.AI | cs.LG | stat.ME | stat.MLPDF

[36] IGLOSS: Image Generation for Lidar Open-vocabulary Semantic Segmentation cs.CVPDF

[37] AffordTissue: Dense Affordance Prediction for Tool-Action Specific Tissue Interaction cs.CV | cs.AI | cs.RO | eess.IVPDF

[38] GRAZE: Grounded Refinement and Motion-Aware Zero-Shot Event Localization cs.CV | cs.AIPDF

[39] LESV: Language Embedded Sparse Voxel Fusion for Open-Vocabulary 3D Scene Understanding cs.CVPDF

[40] EgoFlow: Gradient-Guided Flow Matching for Egocentric 6DoF Object Motion Generation cs.CVPDF

[41] Nonlinear Methods for Analyzing Pose in Behavioral Research cs.CVPDF

[42] Reinforcing Consistency in Video MLLMs with Structured Rewards cs.CVPDF

[43] Prime Once, then Reprogram Locally: An Efficient Alternative to Black-Box Service Model Adaptation cs.CV | cs.LGPDF

[44] Universal computational thermal imaging overcoming the ghosting effect cs.CV | physics.opticsPDF

[45] ReFlow: Self-correction Motion Learning for Dynamic Scene Reconstruction cs.CV | cs.AIPDF

[46] VideoZeroBench: Probing the Limits of Video MLLMs with Spatio-Temporal Evidence Verification cs.CV | cs.MMPDF

[47] Harmonized Tabular-Image Fusion via Gradient-Aligned Alternating Learning cs.CV | cs.AIPDF

[48] Satellite-Free Training for Drone-View Geo-Localization cs.CVPDF

[49] SHOE: Semantic HOI Open-Vocabulary Evaluation Metric cs.CV | cs.AIPDF

[50] Tex3D: Objects as Attack Surfaces via Adversarial 3D Textures for Vision-Language-Action Models cs.CV | cs.AIPDF

[51] Automatic Image-Level Morphological Trait Annotation for Organismal Images cs.CV | cs.AIPDF

[52] LivingWorld: Interactive 4D World Generation with Environmental Dynamics cs.CVPDF

[53] MonoSAOD: Monocular 3D Object Detection with Sparsely Annotated Label cs.CVPDF

[54] Moiré Video Authentication: A Physical Signature Against AI Video Generation cs.CV | cs.AI | cs.MMPDF

[55] DynaVid: Learning to Generate Highly Dynamic Videos using Synthetic Motion Data cs.CVPDF

[56] HOT: Harmonic-Constrained Optimal Transport for Remote Photoplethysmography Domain Adaptation cs.CVPDF

[57] GPA: Learning GUI Process Automation from Demonstrations cs.CV | cs.AI | cs.SEPDF

[58] Director: Instance-aware Gaussian Splatting for Dynamic Scene Modeling and Understanding cs.CVPDF

[59] BTS-rPPG: Orthogonal Butterfly Temporal Shifting for Remote Photoplethysmography cs.CVPDF

[60] From Understanding to Erasing: Towards Complete and Stable Video Object Removal cs.CVPDF

[61] Can Video Diffusion Models Predict Past Frames? Bidirectional Cycle Consistency for Reversible Interpolation cs.CV | cs.MMPDF

[62] Dense Point-to-Mask Optimization with Reinforced Point Selection for Crowd Instance Segmentation cs.CVPDF

[63] Ultrasound-CLIP: Semantic-Aware Contrastive Pre-training for Ultrasound Image-Text Understanding cs.CVPDF

[64] Control-DINO: Feature Space Conditioning for Controllable Image-to-Video Diffusion cs.CVPDF

[65] Cosine-Normalized Attention for Hyperspectral Image Classification cs.CVPDF

[66] Hidden Meanings in Plain Sight: RebusBench for Evaluating Cognitive Visual Reasoning cs.CVPDF

[67] DriveDreamer-Policy: A Geometry-Grounded World-Action Model for Unified Generation and Planning cs.CV | cs.AI | cs.ROPDF

[68] FSKD: Monocular Forest Structure Inference via LiDAR-to-RGBI Knowledge Distillation cs.CV | cs.AIPDF

[69] STRIVE: Structured Spatiotemporal Exploration for Reinforcement Learning in Video Question Answering cs.CVPDF

[70] Language-Pretraining-Induced Bias: A Strong Foundation for General Vision Tasks cs.CV | cs.CL | cs.LGPDF

[71] Semantic Richness or Geometric Reasoning? The Fragility of VLM’s Visual Invariance cs.CVPDF

[72] Combining Boundary Supervision and Segment-Level Regularization for Fine-Grained Action Segmentation cs.CVPDF

[73] HieraVid: Hierarchical Token Pruning for Fast Video Large Language Models cs.CV | cs.CLPDF

[74] A3R: Agentic Affordance Reasoning via Cross-Dimensional Evidence in 3D Gaussian Scenes cs.CVPDF

[75] ProVG: Progressive Visual Grounding via Language Decoupling for Remote Sensing Imagery cs.CVPDF

[76] FTPFusion: Frequency-Aware Infrared and Visible Video Fusion with Temporal Perturbation cs.CVPDF

[77] Lifting Unlabeled Internet-level Data for 3D Scene Understanding cs.CV | cs.AIPDF