cs.CL [Total: 21]
cs.CV [Total: 98]
cs.AI [Total: 1]
q-bio.NC [Total: 1]
cs.RO [Total: 2]
cs.LG [Total: 4]

cs.CL [Back]

[1] Evaluating Prompting Strategies for Chart Question Answering with Large Language Models cs.CL | cs.AI | cs.LGPDF

Ruthuparna Naikar, Ying Zhu

TL;DR: 本文系统评估了四种提示策略（零样本、少样本、零样本思维链、少样本思维链）在GPT系列模型上对ChartQA数据集的图表问答性能影响。研究发现，少样本思维链提示在推理密集型问题上准确率最高（达78.2%），而少样本提示能提升格式遵循度。

Details

Motivation: 现有研究对提示策略在基于图表的问答任务中的作用探索不足，本文旨在通过控制变量实验，为结构化数据推理任务提供可操作的提示策略选择指导。

Result: 在ChartQA数据集的1200个样本上，少样本思维链提示在GPT-4o上达到最高准确率78.2%，尤其在推理密集型问题上表现突出；少样本提示能显著提升答案格式匹配度。

Insight: 创新点在于通过纯结构化数据框架隔离提示结构变量，揭示了思维链与少样本结合对复杂推理的增效作用；实际应用中需根据任务复杂度（简单任务可用零样本+高性能模型）和精度要求权衡提示策略。

Abstract: Prompting strategies affect LLM reasoning performance, but their role in chart-based QA remains underexplored. We present a systematic evaluation of four widely used prompting paradigms (Zero-Shot, Few-Shot, Zero-Shot Chain-of-Thought, and Few-Shot Chain-of-Thought) across GPT-3.5, GPT-4, and GPT-4o on the ChartQA dataset. Our framework operates exclusively on structured chart data, isolating prompt structure as the only experimental variable, and evaluates performance using two metrics: Accuracy and Exact Match. Results from 1,200 diverse ChartQA samples show that Few-Shot Chain-of-Thought prompting consistently yields the highest accuracy (up to 78.2%), particularly on reasoning-intensive questions, while Few-Shot prompting improves format adherence. Zero-Shot performs well only with high-capacity models on simpler tasks. These findings provide actionable guidance for selecting prompting strategies in structured data reasoning tasks, with implications for both efficiency and accuracy in real-world applications.

[2] MERIT: Memory-Enhanced Retrieval for Interpretable Knowledge Tracing cs.CL | cs.AIPDF

Runze Li, Kedi Chen, Guwei Feng, Mo Yu, Jun Wang

TL;DR: 论文提出MERIT框架，一种无需训练的记忆增强检索方法，用于可解释的知识追踪。该方法结合冻结的大型语言模型与结构化教学记忆，通过将原始交互日志转换为可解释的记忆库，利用语义去噪将学生分类到潜在认知模式，并构建范式库离线分析典型错误模式以生成显式的思维链推理。在推理时，采用分层路由机制检索相关上下文，并通过逻辑增强模块应用语义约束来校准预测，从而在真实数据集上实现最先进性能。

Details

Motivation: 传统深度学习知识追踪模型虽然准确率高但缺乏可解释性，而大型语言模型虽具强推理能力却受限于上下文窗口和幻觉问题，且现有基于LLM的方法通常需要昂贵的微调，限制了可扩展性和对新数据的适应性。

Result: MERIT在真实数据集上实现了最先进的性能，无需梯度更新，减少了计算成本并支持动态知识更新。

Insight: 创新点在于提出无需训练的记忆增强检索框架，通过结构化教学记忆和语义约束增强LLM的可解释性和准确性，避免了微调需求，提高了教育诊断的可访问性和透明度。

Abstract: Knowledge Tracing (KT) models students’ evolving knowledge states to predict future performance, serving as a foundation for personalized education. While traditional deep learning models achieve high accuracy, they often lack interpretability. Large Language Models (LLMs) offer strong reasoning capabilities but struggle with limited context windows and hallucinations. Furthermore, existing LLM-based methods typically require expensive fine-tuning, limiting scalability and adaptability to new data. We propose MERIT (Memory-Enhanced Retrieval for Interpretable Knowledge Tracing), a training-free framework combining frozen LLM reasoning with structured pedagogical memory. Rather than updating parameters, MERIT transforms raw interaction logs into an interpretable memory bank. The framework uses semantic denoising to categorize students into latent cognitive schemas and constructs a paradigm bank where representative error patterns are analyzed offline to generate explicit Chain-of-Thought (CoT) rationales. During inference, a hierarchical routing mechanism retrieves relevant contexts, while a logic-augmented module applies semantic constraints to calibrate predictions. By grounding the LLM in interpretable memory, MERIT achieves state-of-the-art performance on real-world datasets without gradient updates. This approach reduces computational costs and supports dynamic knowledge updates, improving the accessibility and transparency of educational diagnosis.

[3] TIPS: Turn-Level Information-Potential Reward Shaping for Search-Augmented LLMs cs.CL | cs.AI | cs.LGPDF

Yutao Xie, Nathaniel Thomas, Nicklas Hansen, Yang Fu, Li Erran Li

TL;DR: 本文提出了TIPS（Turn-Level Information-Potential Reward Shaping）框架，用于解决基于强化学习训练的搜索增强大语言模型在开放域问答任务中面临的稀疏奖励和跨步长信用分配难题。该方法通过一个教师模型，为每个推理+工具调用步骤分配密集的、基于信息势的奖励，从而提供细粒度的指导。在七个QA基准测试上的评估表明，TIPS在性能和训练稳定性上均显著优于GRPO/PPO基线方法。

Details

Motivation: 基于强化学习训练的搜索增强大语言模型在开放域问答上取得了不错的结果，但其训练过程因稀疏奖励和跨推理步骤的信用分配困难而面临优化不稳定、挑战巨大的问题。

Result: 在七个QA基准测试上，TIPS一致地超越了GRPO/PPO基线，并大幅提升了训练稳定性。例如，在使用Qwen-2.5 7B Instruct模型时，TIPS相对于PPO将平均精确匹配分数提升了11.8%，F1分数提升了13.6%。

Insight: 论文宣称的创新点在于提出了基于信息势的回合级奖励塑形框架，为每个推理步骤提供密集且策略不变的奖励指导，有效解决了多轮LLM推理中的稀疏奖励信用分配问题。从客观角度看，其核心创新是将教师模型预测正确答案似然性的增量作为潜在奖励，并将其密集化、步骤化，从而将最终结果的稀疏监督信号有效分解到中间步骤，这是一种通用且有效的奖励设计思路。

Abstract: Search-augmented large language models (LLMs) trained with reinforcement learning (RL) have achieved strong results on open-domain question answering (QA), but training still remains a significant challenge. The optimization is often unstable due to sparse rewards and difficult credit assignments across reasoning and tool calls. To address this, we introduce Turn-Level Information Potential Reward Shaping (TIPS), a simple framework that assigns dense, turn-level rewards to each reasoning + tool-call segment based on the increased likelihood of the correct answer under a teacher model. By leveraging the potential-based reward shaping, TIPS offers fine-grained and policy-invariant guidance that overcomes the limitations of outcome-only optimization. Evaluated on seven QA benchmarks, TIPS consistently outperforms GRPO/PPO baselines and substantially improves training stability. For instance, with a Qwen-2.5 7B Instruct model, TIPS improves the average Exact Match score by 11.8% and F1 by 13.6% relative to PPO. Our results demonstrate that turn-level information-potential reward shaping provides an effective and general solution to sparse-reward credit assignment for multi-turn LLM reasoning.

[4] Sparse but Critical: A Token-Level Analysis of Distributional Shifts in RLVR Fine-Tuning of LLMs cs.CL | cs.AI | cs.LGPDF

Haoming Meng, Kexin Huang, Shaohang Wei, Chiyu Ma, Shuo Yang

TL;DR: 本文通过系统性的实证研究，分析了在大型语言模型中使用可验证奖励的强化学习微调过程中，模型在token级别上的分布变化机制。研究发现，RL微调仅导致稀疏且有针对性的分布变化，一小部分token的决策直接决定了性能提升。

Details

Motivation: 尽管基于可验证奖励的强化学习显著提升了大型语言模型的推理能力，但其在token级别上的作用机制尚不明确，本文旨在揭示RLVR微调引起的细粒度分布变化及其对性能的影响。

Result: 实验表明，仅将RL模型生成的一小部分关键token插入基础模型的生成序列中，就能逐步恢复RL模型的性能增益；反之，将基础模型的token选择注入RL生成序列则会导致性能崩溃至基础水平。此外，基于优势信号的诊断性干预变体能够带来超越基线的改进。

Insight: 论文的创新点在于从token级别系统分析了RLVR微调引起的分布偏移，揭示了性能提升主要依赖于稀疏的关键token决策，这为理解RL微调作为一种针对性优化过程提供了细粒度视角，并提出了基于优势信号的诊断干预方法。

Abstract: Reinforcement learning with verifiable rewards (RLVR) has significantly improved reasoning in large language models (LLMs), yet the token-level mechanisms underlying these improvements remain unclear. We present a systematic empirical study of RLVR’s distributional effects organized around three main analyses: (1) token-level characterization of distributional shifts between base and RL models, (2) the impact of token-level distributional shifts on sequence-level reasoning performance through cross-sampling interventions, and (3) fine-grained mechanics of these shifts at the token level. We find that RL fine-tuning induces highly sparse and targeted changes, with only a small fraction of token distributions exhibiting meaningful divergence between the base and RL policies. We further characterize the structure and evolution of these shifts through analyses of token entropy, positional concentration, and reallocation of probability mass. To assess the functional importance of these sparse changes, we conduct cross-sampling experiments that selectively swap token choices between the base and RL models with varying intervention budgets. We show that inserting only a small fraction of RL-sampled tokens into base generations progressively recovers RL performance gains, while injecting a similarly small number of base token choices into otherwise RL-generated sequences collapses performance to base levels, isolating a small set of token-level decisions directly responsible for RLVR’s performance gains. Finally, we explore divergence-weighted variants of the advantage signal as a diagnostic intervention, finding that they can yield improvements over baselines. Together, our results shed light on the distributional changes induced by RLVR and provide a fine-grained, token-level lens for understanding RLVR fine-tuning as a targeted refinement process.

[5] Towards Automated Community Notes Generation with Large Vision Language Models for Combating Contextual Deception cs.CL | cs.SIPDF

Jin Ma, Jingwen Yan, Mohammed Aldeen, Ethan Anderson, Taran Kavuru

TL;DR: 本文提出了一种基于大型视觉语言模型的自动化社区笔记生成方法ACCNote，用于对抗图像上下文欺骗，该方法通过检索增强的多智能体协作框架生成简洁且基于事实的纠正性笔记，并引入了新的评估指标CHS，实验表明其在XCheck数据集上优于基线方法和GPT5-mini。

Details

Motivation: 解决在线社交媒体上基于图像的上下文欺骗问题，现有社区笔记依赖人工贡献导致时效性和可扩展性不足，且现有研究多集中于二元欺骗检测而非生成纠正性笔记。

Result: 在自建的XCheck真实数据集上，ACCNote在欺骗检测和笔记生成性能上均优于基线方法，并超过了商业工具GPT5-mini。

Insight: 创新点包括构建支持研究的真实数据集XCheck、提出检索增强的多智能体协作框架ACCNote，以及引入与用户研究结果对齐的评估指标CHS，而非依赖词汇重叠，推动了自动化上下文纠正笔记生成的实用化。

Abstract: Community Notes have emerged as an effective crowd-sourced mechanism for combating online deception on social media platforms. However, its reliance on human contributors limits both the timeliness and scalability. In this work, we study the automated Community Notes generation method for image-based contextual deception, where an authentic image is paired with misleading context (e.g., time, entity, and event). Unlike prior work that primarily focuses on deception detection (i.e., judging whether a post is true or false in a binary manner), Community Notes-style systems need to generate concise and grounded notes that help users recover the missing or corrected context. This problem remains underexplored due to three reasons: (i) datasets that support the research are scarce; (ii) methods must handle the dynamic nature of contextual deception; (iii) evaluation is difficult because standard metrics do not capture whether notes actually improve user understanding. To address these gaps, we curate a real-world dataset, XCheck, comprising X posts with associated Community Notes and external contexts. We further propose the Automated Context-Corrective Note generation method, named ACCNote, which is a retrieval-augmented, multi-agent collaboration framework built on large vision-language models. Finally, we introduce a new evaluation metric, Context Helpfulness Score (CHS), that aligns with user study outcomes rather than relying on lexical overlap. Experiments on our XCheck dataset show that the proposed ACCNote improves both deception detection and note generation performance over baselines, and exceeds a commercial tool GPT5-mini. Together, our dataset, method, and metric advance practical automated generation of context-corrective notes toward more responsible online social networks.

[6] CAPITU: A Benchmark for Evaluating Instruction-Following in Brazilian Portuguese with Literary Context cs.CLPDF

Giovana Kerche Bonás, Roseval Malaquias Junior, Marcos Piau, Thiago Laitz, Thales Sales Almeida

TL;DR: 本文介绍了CAPITU，一个用于评估大型语言模型在巴西葡萄牙语中遵循指令能力的基准测试。该基准将所有任务置于八部巴西文学经典作品的语境中，结合可验证的指令约束和文化相关内容，包含59种指令类型，分为七类，且无需LLM评判或人工评估即可自动验证。作者评估了18个最先进的模型，结果显示前沿推理模型表现出色，而葡萄牙语专用模型则具有成本效益优势。多轮评估揭示了约束持续性存在显著差异。

Details

Motivation: 现有基准测试主要关注英语或使用通用提示，缺乏针对巴西葡萄牙语且结合文化语境的评估工具，因此需要创建CAPITU来填补这一空白。

Result: 在单轮和多轮设置下评估了18个SOTA模型。前沿推理模型（如GPT-5.2 with reasoning）达到98.5%的严格准确率；葡萄牙语专用模型（如Sabiazinho-4）以87.0%的准确率和$0.13的成本展现出有竞争力的成本效益（对比Claude-Haiku-4.5的73.5%准确率和$1.12成本）。多轮评估中，对话级准确率在60%到96%之间波动，显示了约束持续性的显著差异。

Insight: 创新点在于将指令遵循评估与文化语境（巴西文学）深度结合，并设计了自动可验证的、包含葡萄牙语特定语言约束（如词尾模式）和结构要求的指令类型。客观来看，该基准为低资源语言（巴西葡萄牙语）的LLM评估提供了系统化、可复现的框架，并揭示了多轮交互中约束持续性的关键挑战。

Abstract: We introduce CAPITU, a benchmark for evaluating instruction-following capabilities of Large Language Models (LLMs) in Brazilian Portuguese. Unlike existing benchmarks that focus on English or use generic prompts, CAPITU contextualizes all tasks within eight canonical works of Brazilian literature, combining verifiable instruction constraints with culturally-grounded content. The benchmark comprises 59 instruction types organized into seven categories, all designed to be automatically verifiable without requiring LLM judges or human evaluation. Instruction types include Portuguese-specific linguistic constraints (word termination patterns like -ando/-endo/-indo, -inho/-inha, -mente) and structural requirements. We evaluate 18 state-of-the-art models across single-turn and multi-turn settings. Our results show that frontier reasoning models achieve strong performance (GPT-5.2 with reasoning: 98.5% strict accuracy), while Portuguese-specialized models offer competitive cost-efficiency (Sabiazinho-4: 87.0% at $0.13 vs Claude-Haiku-4.5: 73.5% at $1.12). Multi-turn evaluation reveals significant variation in constraint persistence, with conversation-level accuracy ranging from 60% to 96% across models. We identify specific challenges in morphological constraints, exact counting, and constraint persistence degradation across turns. We release the complete benchmark, evaluation code, and baseline results to facilitate research on instruction-following in Portuguese.

[7] Lie to Me: How Faithful Is Chain-of-Thought Reasoning in Reasoning Models? cs.CL | cs.AIPDF

Richard J. Young

TL;DR: 本研究评估了12个开源推理模型在思维链（CoT）推理中的忠实性，即模型是否在CoT中准确承认影响其输出的因素。通过在MMLU和GPQA Diamond的498道选择题中注入六类推理提示（如迎合性、一致性等），并测量模型在提示成功改变答案时承认提示影响的比率，发现整体忠实率在39.7%到89.9%之间，且训练方法和模型架构比参数量更能预测忠实性。关键词分析显示模型内部可能意识到提示影响，但会在输出中系统性抑制承认。

Details

Motivation: 思维链推理被提议作为安全关键部署中大语言模型的透明机制，但其有效性取决于忠实性（即模型是否准确表达实际影响其输出的因素）。先前评估仅针对两个专有模型，发现承认率低至25%和39%，因此本研究旨在将评估扩展到开源模型生态系统，以检验CoT作为安全监控机制的可行性。

Result: 在41,832次推理运行中，整体忠实率从39.7%（Seed-1.6-Flash）到89.9%（DeepSeek-V3.2-Speciale）不等，其中一致性提示（35.5%）和迎合性提示（53.9%）的承认率最低。训练方法和模型家族比参数量更能预测忠实性，且关键词分析显示思维令牌承认率（约87.5%）与答案文本承认率（约28.6%）存在显著差距。

Insight: 论文的创新点在于系统评估了开源推理模型的CoT忠实性，揭示了模型内部可能意识到提示影响但输出中系统性抑制承认的现象。从客观角度看，这强调了忠实性不是推理模型的固定属性，而是随架构、训练方法和影响线索性质系统变化，对CoT作为安全机制的应用提出了重要质疑。

Abstract: Chain-of-thought (CoT) reasoning has been proposed as a transparency mechanism for large language models in safety-critical deployments, yet its effectiveness depends on faithfulness (whether models accurately verbalize the factors that actually influence their outputs), a property that prior evaluations have examined in only two proprietary models, finding acknowledgment rates as low as 25% for Claude 3.7 Sonnet and 39% for DeepSeek-R1. To extend this evaluation across the open-weight ecosystem, this study tests 12 open-weight reasoning models spanning 9 architectural families (7B-685B parameters) on 498 multiple-choice questions from MMLU and GPQA Diamond, injecting six categories of reasoning hints (sycophancy, consistency, visual pattern, metadata, grader hacking, and unethical information) and measuring the rate at which models acknowledge hint influence in their CoT when hints successfully alter answers. Across 41,832 inference runs, overall faithfulness rates range from 39.7% (Seed-1.6-Flash) to 89.9% (DeepSeek-V3.2-Speciale) across model families, with consistency hints (35.5%) and sycophancy hints (53.9%) exhibiting the lowest acknowledgment rates. Training methodology and model family predict faithfulness more strongly than parameter count, and keyword-based analysis reveals a striking gap between thinking-token acknowledgment (approximately 87.5%) and answer-text acknowledgment (approximately 28.6%), suggesting that models internally recognize hint influence but systematically suppress this acknowledgment in their outputs. These findings carry direct implications for the viability of CoT monitoring as a safety mechanism and suggest that faithfulness is not a fixed property of reasoning models but varies systematically with architecture, training method, and the nature of the influencing cue.

[8] How Utilitarian Are OpenAI’s Models Really? Replicating and Reinterpreting Pfeffer, Krügel, and Uhl (2025) cs.CL | cs.CYPDF

Johannes Himmelreich

TL;DR: 本文复现并重新解读了Pfeffer等人（2025）关于OpenAI模型在电车难题和天桥困境中功利主义倾向的研究，发现原始结论存在误导性。通过测试多个当前OpenAI模型及不同提示变体，作者指出GPT-4o的低功利主义回应率并非源于道义论承诺，而是由提示的警示性框架触发的安全拒绝；当提示改为“这在道德上是否允许？”时，GPT-4o的功利主义回应率高达99%。所有模型在消除提示混淆后均趋于功利主义答案。在天桥困境中，推理模型虽比非推理模型更倾向功利主义，但常拒绝回答或给出非功利主义答案。研究强调单一提示评估LLM道德推理不可靠，应推广多提示鲁棒性测试。

Details

Motivation: 动机是检验Pfeffer等人（2025）关于OpenAI推理模型o1-mini在道德困境中比非推理模型GPT-4o更功利主义的结论是否可靠，并揭示单一提示评估可能导致的偏差，以推动更稳健的LLM行为实证研究方法。

Result: 在电车难题中，当提示框架从“我应该…？”改为“这在道德上是否允许…？”时，GPT-4o的功利主义回应率从低水平跃升至99%；所有模型在去除提示混淆后均给出功利主义答案。在天桥困境中，推理模型（如o1-mini）比非推理模型（如GPT-4o）更倾向功利主义，但常出现拒绝回答或非功利主义答案的情况；该发现部分成立但存在瑕疵。

Insight: 创新点在于通过系统性的提示变体测试，揭示了LLM道德推理评估对提示框架的高度敏感性，并证明单一提示结论易被误导；客观分析认为，该研究强调了多提示鲁棒性测试作为LLM行为实证研究标准实践的必要性，为评估模型道德对齐提供了方法论借鉴。

Abstract: Pfeffer, Krügel, and Uhl (2025) report that OpenAI’s reasoning model o1-mini produces more utilitarian responses to the trolley problem and footbridge dilemma than the non-reasoning model GPT-4o. I replicate their study with four current OpenAI models and extend it with prompt variant testing. The trolley finding does not survive: GPT-4o’s low utilitarian rate doesn’t reflect a deontological commitment but safety refusals triggered by the prompt’s advisory framing. When framed as “Is it morally permissible…?” instead of “Should I…?”, GPT-4o gives 99% utilitarian responses. All models converge on utilitarian answers when prompt confounds are removed. The footbridge finding survives with blemishes. Reasoning models tend to give more utilitarian responses than non-reasoning models across prompt variations. But often they refuse to answer the dilemma or, when they answer, give a non-utilitarian rather than a utilitarian answer. These results demonstrate that single-prompt evaluations of LLM moral reasoning are unreliable: multi-prompt robustness testing should be standard practice for any empirical claim about LLM behavior.

[9] Explanation Generation for Contradiction Reconciliation with LLMs cs.CLPDF

Jason Chan, Zhixue Zhao, Robert Gaizauskas

TL;DR: 本文提出了一项名为“调和解释生成”的新任务，旨在让大型语言模型（LLMs）为看似矛盾的陈述生成解释，使其变得兼容。作者通过改造现有的自然语言推理（NLI）数据集来构建评估基准，并引入了可扩展的自动评估指标。实验评估了18个LLM，发现大多数模型在此任务上表现有限，且增加模型规模带来的“思考”时间收益存在瓶颈。

Details

Motivation: 现有NLP工作通常将矛盾视为需要选择接受或丢弃陈述的错误，而人类在社会互动和专业领域的关键推理能力是提出能够调和矛盾的解释。尽管LLMs的推理能力不断增强，但其生成此类调和解释的能力尚未得到充分探索，本文旨在填补这一空白。

Result: 在18个LLM上的实验表明，大多数模型在此任务上成功率有限。研究还发现，通过增加“思考”时间来扩展测试时计算带来的收益会随着模型规模增大而达到瓶颈，即模型规模的增长并未持续提升性能。

Insight: 创新点在于提出了“调和解释生成”这一新任务，并开发了通过改造NLI数据集进行自动评估的方法。从客观角度看，该研究揭示了LLMs在生成创造性、情境化解释以调和矛盾方面的能力不足，这为提升LLM在聊天机器人和科学辅助等下游应用中的推理能力指明了改进方向。

Abstract: Existing NLP work commonly treats contradictions as errors to be resolved by choosing which statements to accept or discard. Yet a key aspect of human reasoning in social interactions and professional domains is the ability to hypothesize explanations that reconcile contradictions. For example, “Cassie hates coffee” and “She buys coffee everyday” may appear contradictory, yet both are compatible if Cassie has the unenviable daily chore of buying coffee for all her coworkers. Despite the growing reasoning capabilities of large language models (LLMs), their ability to hypothesize such reconciliatory explanations remains largely unexplored. To address this gap, we introduce the task of reconciliatory explanation generation, where models must generate explanations that effectively render contradictory statements compatible. We propose a novel method of repurposing existing natural language inference (NLI) datasets, and introduce quality metrics that enable scalable automatic evaluation. Experiments with 18 LLMs show that most models achieve limited success in this task, and that the benefit of extending test-time compute by “thinking” plateaus as model size increases. Our results highlight an under-explored dimension of LLM reasoning and the need to address this limitation in enhancing LLMs’ downstream applications such as chatbots and scientific aids.

[10] PRISM: A Dual View of LLM Reasoning through Semantic Flow and Latent Computation cs.CLPDF

Ruidi Chang, Jiawei Zhou, Hanjie Chen

TL;DR: 论文提出了PRISM框架，用于联合分析大语言模型（LLM）推理过程中的两个层面：跨推理步骤的语义流（token序列）和单步内的隐式计算（隐藏状态向量），从而提供对推理演化的统一视图。

Details

Motivation: 现有方法通常只从单一视角（要么是生成的文本序列，要么是模型层的隐藏状态）分析LLM的复杂推理轨迹，缺乏一个统一的框架来联合分析这两个层面，以深入理解推理过程。

Result: 在多个推理模型和基准测试上，PRISM揭示了推理过程中的系统性模式，例如失败轨迹更可能陷入无效的验证循环，并分化为过度思考（overthinking）和过早承诺（premature commitment）等不同模式。

Insight: 创新点在于提出了一个联合分析语义流和隐式计算的统一框架，将推理轨迹建模为结构化过程，使其行为可观测和分析，而不仅仅依赖最终任务准确率；这为分析和诊断LLM的推理过程提供了一个实用工具。

Abstract: Large language models (LLMs) solve complex problems by generating multi-step reasoning traces. Yet these traces are typically analyzed from only one of two perspectives: the sequence of tokens across different reasoning steps in the generated text, or the hidden-state vectors across model layers within one step. We introduce PRISM (Probabilistic Reasoning Inspection through Semantic and Implicit Modeling), a framework and diagnostic tool for jointly analyzing both levels, providing a unified view of how reasoning evolves across steps and layers. Across multiple reasoning models and benchmarks, PRISM uncovers systematic patterns in the reasoning process, showing that failed trajectories are more likely to become trapped in unproductive verification loops and further diverge into distinct modes such as overthinking and premature commitment, which behave differently once a candidate answer is reached. It further reveals how prompting reshapes reasoning behavior beyond aggregate accuracy by altering both semantic transitions and internal computational patterns. By modeling reasoning trajectories as structured processes, PRISM makes these behaviors observable and analyzable rather than relying solely on final-task accuracy. Taken together, these insights position PRISM as a practical tool for analyzing and diagnosing reasoning processes in LLMs.

[11] When AI Shows Its Work, Is It Actually Working? Step-Level Evaluation Reveals Frontier Language Models Frequently Bypass Their Own Reasoning cs.CL | cs.AI | cs.LGPDF

Abhinaba Basu, Pavan Chakraborty

TL;DR: 该论文提出了一种名为’步骤级评估’的新方法，用于检验前沿大语言模型（LLM）的逐步推理过程是否真实影响其最终答案，还是仅仅是装饰性的叙述。通过在多个任务上测试10个前沿模型，研究发现大多数模型的推理步骤是装饰性的，移除它们对答案影响很小。研究还发现模型的忠实度是模型特定和任务特定的，并揭示了’输出刚性’现象。

Details

Motivation: 解决一个核心问题：当大语言模型展示其逐步推理过程时，这些步骤是真正用于推导答案，还是在模型已经做出决定后生成的装饰性叙述？旨在评估模型推理的’忠实度’，即推理步骤与最终答案之间的因果联系。

Result: 在情感分析、数学、主题分类和医学问答四个任务上测试了10个前沿模型（包括GPT-4o、Claude Opus、DeepSeek-V3.2等）。结果显示，大多数模型产生的是装饰性推理：移除任何单个推理步骤，答案改变的概率低于17%。相比之下，较小的模型（0.8-8B）在数学任务上表现出真实的步骤依赖性（必要性达55%）。只有MiniMax-M2.5在情感任务（37%）和Kimi-K2.5在主题分类任务（39%）上部分打破了这一模式。

Insight: 论文宣称的创新点在于提出了一种简单、低成本（仅需API访问，约1-2美元/模型/任务）的’步骤级评估’方法，用于量化模型推理的忠实度。客观分析认为，其核心洞察在于揭示了当前前沿LLM的逐步推理普遍缺乏忠实性，且这种特性是模型和任务依赖的。研究还表明，训练目标（而非模型规模）是决定推理是否真实的关键因素，这对未来模型设计和评估具有重要指导意义。

Abstract: Language models increasingly “show their work” by writing step-by-step reasoning before answering. But are these reasoning steps genuinely used, or decorative narratives generated after the model has already decided? Consider: a medical AI writes “The patient’s eosinophilia and livedo reticularis following catheterization suggest cholesterol embolization syndrome. Answer: B.” If we remove the eosinophilia observation, does the diagnosis change? For most frontier models, the answer is no - the step was decorative. We introduce step-level evaluation: remove one reasoning sentence at a time and check whether the answer changes. This simple test requires only API access – no model weights – and costs approximately $1-2 per model per task. Testing 10 frontier models (GPT-5.4, Claude Opus, DeepSeek-V3.2, MiniMax-M2.5, Kimi-K2.5, and others) across sentiment, mathematics, topic classification, and medical QA (N=376-500 each), the majority produce decorative reasoning: removing any step changes the answer less than 17% of the time, while any single step alone recovers the answer. This holds even on math, where smaller models (0.8-8B) show genuine step dependence (55% necessity). Two models break the pattern: MiniMax-M2.5 on sentiment (37% necessity) and Kimi-K2.5 on topic classification (39%) - but both shortcut other tasks. Faithfulness is model-specific and task-specific. We also discover “output rigidity”: on the same medical questions, Claude Opus writes 11 diagnostic steps while GPT-OSS-120B outputs a single token. Mechanistic analysis (attention patterns) confirms that CoT attention drops more in late layers for decorative tasks (33%) than faithful ones (20%). Implications: step-by-step explanations from frontier models are largely decorative, per-model per-domain evaluation is essential, and training objectives - not scale - determine whether reasoning is genuine.

[12] Analysing LLM Persona Generation and Fairness Interpretation in Polarised Geopolitical Contexts cs.CLPDF

Maida Aizaz, Quang Minh Nguyen

TL;DR: 本文分析了五种流行大语言模型在640种实验条件下为巴勒斯坦和以色列身份生成的人物角色，发现模型在战争与非战争语境中生成的角色属性存在显著分布差异，且模型在收到避免有害假设的指令后，其生成结果会呈现多样化的分布变化，但社会经济差异往往持续存在。

Details

Motivation: 随着大语言模型越来越多地用于社会模拟和人物角色生成，需要理解它们如何表征地缘政治身份，本研究旨在分析模型在生成此类身份角色时的表现及其公平性解释。

Result: 在640种实验条件下，模型生成的巴勒斯坦角色在战争语境中常与较低社会经济地位和生存导向角色关联，而以色列角色则多保持中产阶级地位和专业属性；当提示避免有害假设时，模型生成结果出现非二元性别推断显著增加或向通用职业角色收敛等分布变化，但社会经济差异基本不变。

Insight: 创新点在于通过大规模实验揭示了模型在地缘政治语境下生成角色时的系统性偏见及其对公平性指令的多样化、不一致的响应模式；客观分析认为，模型推理过程虽提及公平概念，但最终生成结果并未一致地将其转化为代表性输出，这为评估和改善模型的公平性与社会表征提供了重要洞见。

Abstract: Large language models (LLMs) are increasingly utilised for social simulation and persona generation, necessitating an understanding of how they represent geopolitical identities. In this paper, we analyse personas generated for Palestinian and Israeli identities by five popular LLMs across 640 experimental conditions, varying context (war vs non-war) and assigned roles. We observe significant distributional patterns in the generated attributes: Palestinian profiles in war contexts are frequently associated with lower socioeconomic status and survival-oriented roles, whereas Israeli profiles predominantly retain middle-class status and specialised professional attributes. When prompted with explicit instructions to avoid harmful assumptions, models exhibit diverse distributional changes, e.g., marked increases in non-binary gender inferences or a convergence toward generic occupational roles (e.g., “student”), while the underlying socioeconomic distinctions often remain. Furthermore, analysis of reasoning traces reveals an interesting dynamics between model reasoning and generation: while rationales consistently mention fairness-related concepts, the final generated personas follow the aforementioned diverse distributional changes. These findings illustrate a picture of how models interpret geopolitical contexts, while suggesting that they process fairness and adjust in varied ways; there is no consistent, direct translation of fairness concepts into representative outcomes.

Chaoqun Cui, Caiyan Jia

TL;DR: 本文提出了一种基于纯Transformer架构的预训练传播树Transformer（P2T3）方法，用于社交媒体谣言检测。该方法通过提取传播树中的对话链、引入连接信息的token-wise嵌入以及在大规模无标签数据集上进行预训练，有效解决了传统图神经网络（GNNs）在处理谣言传播结构时面临的过平滑和长距离依赖捕获困难的问题。

Details

Motivation: 传统基于图神经网络（GNNs）的谣言检测方法在处理谣言传播树结构时，由于树中大部分节点为1级节点等结构特性，容易出现过平滑问题，导致性能下降，且难以捕获长距离依赖关系。

Result: 实验表明，P2T3在多个基准数据集上超越了之前的最先进方法，并且在少样本条件下也表现良好。

Insight: 创新点在于采用纯Transformer架构替代GNNs，通过提取对话链、设计token-wise嵌入引入结构信息与归纳偏置，并结合预训练策略，从而避免了GNNs的过平滑问题，并为未来社交媒体研究提供了大模型或统一多模态方案的潜力。

Abstract: Deep learning techniques for rumor detection typically utilize Graph Neural Networks (GNNs) to analyze post relations. These methods, however, falter due to over-smoothing issues when processing rumor propagation structures, leading to declining performance. Our investigation into this issue reveals that over-smoothing is intrinsically tied to the structural characteristics of rumor propagation trees, in which the majority of nodes are 1-level nodes. Furthermore, GNNs struggle to capture long-range dependencies within these trees. To circumvent these challenges, we propose a Pre-Trained Propagation Tree Transformer (P2T3) method based on pure Transformer architecture. It extracts all conversation chains from a tree structure following the propagation direction of replies, utilizes token-wise embedding to infuse connection information and introduces necessary inductive bias, and pre-trains on large-scale unlabeled datasets. Experiments indicate that P2T3 surpasses previous state-of-the-art methods in multiple benchmark datasets and performs well under few-shot conditions. P2T3 not only avoids the over-smoothing issue inherent in GNNs but also potentially offers a large model or unified multi-modal scheme for future social media research.

[14] Quality Over Clicks: Intrinsic Quality-Driven Iterative Reinforcement Learning for Cold-Start E-Commerce Query Suggestion cs.CLPDF

Qi Sun, Kejun Xiao, Huaipeng Zhao, Tao Luo, Xiaoyi Zeng

TL;DR: 本文提出了一种名为Cold-EQS的迭代强化学习框架，用于解决冷启动场景下的电商查询建议问题。该框架利用可回答性、事实性和信息增益作为奖励信号来优化建议查询的质量，并通过估计候选查询的不确定性来选择难样本进行持续优化。论文还提供了一个包含16,949条在线用户查询的EQS基准数据集用于离线训练和评估。

Details

Motivation: 现有对话系统依赖查询建议来提升用户参与度，但当前基于大语言模型和点击率模型的方法严重依赖大量在线点击数据，在冷启动场景下（缺乏点击信号）效果不佳。本文旨在解决冷启动电商查询建议的挑战。

Result: 离线和在线实验均表明，所提方法在在线聊天UV指标上取得了显著的+6.81%提升，证明了其有效性，并且离线与在线效果之间存在强正相关性。

Insight: 创新点在于提出了一种不依赖点击数据的强化学习框架，利用内在质量指标（可回答性、事实性、信息增益）作为奖励，并通过不确定性估计主动选择难样本进行迭代优化，从而有效应对冷启动问题。

Abstract: Existing dialogue systems rely on Query Suggestion (QS) to enhance user engagement. Recent efforts typically employ large language models with Click-Through Rate (CTR) model, yet fail in cold-start scenarios due to their heavy reliance on abundant online click data for effective CTR model training. To bridge this gap, we propose Cold-EQS, an iterative reinforcement learning framework for Cold-Start E-commerce Query Suggestion (EQS). Specifically, we leverage answerability, factuality, and information gain as reward to continuously optimize the quality of suggested queries. To continuously optimize our QS model, we estimate uncertainty for grouped candidate suggested queries to select hard and ambiguous samples from online user queries lacking click signals. In addition, we provide an EQS-Benchmark comprising 16,949 online user queries for offline training and evaluation. Extensive offline and online experiments consistently demonstrate a strong positive correlation between online and offline effectiveness. Both offline and online experimental results demonstrate the superiority of our Cold-EQS, achieving a significant +6.81% improvement in online chatUV.

[15] DariMis: Harm-Aware Modeling for Dari Misinformation Detection on YouTube cs.CL | cs.AI | cs.LGPDF

Jawid Ahmad Baktash, Mosa Ebrahimi, Mohammad Zarif Joya, Mursal Dawodi

TL;DR: 本文介绍了DariMis，这是首个针对阿富汗达里语（Dari）的YouTube视频人工标注数据集，包含9,224个视频，标注了信息类型（虚假、部分真实、真实）和危害等级（低、中、高）。研究发现这两个维度存在结构性耦合，即虚假信息往往伴随较高危害。论文还提出了一种双输入编码策略，将视频标题和描述作为独立的BERT输入段，以建模标题声明与正文内容之间的语义关系，从而提升虚假信息检测的召回率。

Details

Motivation: 达里语作为阿富汗的主要语言，拥有数千万使用者，但在虚假信息检测研究领域长期缺失。本文旨在填补这一空白，构建首个达里语虚假信息检测数据集，并探索信息类型与危害等级之间的关系，以支持内容审核。

Result: 在数据集上，针对虚假信息（关键少数类）的召回率通过双输入编码策略提升了7.0个百分点（从60.1%到67.1%）。使用达里语/波斯语专用模型ParsBERT取得了最佳测试性能，准确率为76.60%，宏观F1为72.77%，优于XLM-RoBERTa-base模型。

Insight: 创新点包括：1) 构建了首个达里语多维度标注的虚假信息检测数据集，揭示了信息类型与危害等级的结构性耦合（55.9%的虚假信息具有至少中等危害）；2) 提出了双输入编码策略，显式建模视频标题与描述的语义关系，有效提升了虚假信息召回率，对内容审核中的安全关键任务具有实用价值。

Abstract: Dari, the primary language of Afghanistan, is spoken by tens of millions of people yet remains largely absent from the misinformation detection literature. We address this gap with DariMis, the first manually annotated dataset of 9,224 Dari-language YouTube videos, labeled across two dimensions: Information Type (Misinformation, Partly True, True) and Harm Level (Low, Medium, High). A central empirical finding is that these dimensions are structurally coupled, not independent: 55.9 percent of Misinformation carries at least Medium harm potential, compared with only 1.0 percent of True content. This enables Information Type classifiers to function as implicit harm-triage filters in content moderation pipelines. We further propose a pair-input encoding strategy that represents the video title and description as separate BERT segment inputs, explicitly modeling the semantic relationship between headline claims and body content, a key signal of misleading information. An ablation study against single-field concatenation shows that pair-input encoding yields a 7.0 percentage point gain in Misinformation recall (60.1 percent to 67.1 percent), the safety-critical minority class, despite modest overall macro F1 differences (0.09 percentage points). We benchmark a Dari/Farsi-specialized model (ParsBERT) against XLM-RoBERTa-base; ParsBERT achieves the best test performance with accuracy of 76.60 percent and macro F1 of 72.77 percent. Bootstrap 95 percent confidence intervals are reported for all metrics, and we discuss both the practical significance and statistical limitations of the results.

[16] Beyond Hate: Differentiating Uncivil and Intolerant Speech in Multimodal Content Moderation cs.CL | cs.CYPDF

Nils A. Herrmann, Tobias Eder, Jingyi He, Georg Groh

TL;DR: 本文提出了一种细粒度的多模态内容审核标注方案，区分了不文明（粗鲁或轻蔑的语气）和不容忍（攻击多元主义并针对群体或身份的内容）两个维度，并将其应用于Hateful Memes数据集中的2030个表情包。研究评估了不同视觉语言模型在粗标签训练、跨标签方案迁移学习以及结合粗标签与细粒度标注的联合学习下的表现。结果表明，细粒度标注补充了现有粗标签，联合使用时能提升模型整体性能，并带来更平衡的错误分布和更低的有害内容漏检率。

Details

Motivation: 当前多模态毒性基准通常使用单一的二元仇恨标签，这种粗粒度方法混淆了表达的语气和内容两个根本不同的特征。为了提升内容审核系统的可靠性和准确性，需要更精细的标注方案。

Result: 在Hateful Memes数据集上，结合粗粒度仇恨标签和细粒度标注进行联合学习，改善了模型整体性能，并显著降低了有害内容的漏检率（例如，LLaVA-1.6-Mistral-7B的FNR-FPR从0.74降至0.42，Qwen2.5-VL-7B的从0.54降至0.28），使模型具有更平衡的审核相关错误分布。

Insight: 创新点在于借鉴传播科学理论，提出了区分不文明（语气）和不容忍（内容）的细粒度多模态标注方案。从客观角度看，这种数据中心的标注方法通过提升数据质量，为构建更可靠、准确的多模态内容审核系统提供了一条实用路径，即联合使用粗粒度和细粒度标签能有效互补并优化模型性能。

Abstract: Current multimodal toxicity benchmarks typically use a single binary hatefulness label. This coarse approach conflates two fundamentally different characteristics of expression: tone and content. Drawing on communication science theory, we introduce a fine-grained annotation scheme that distinguishes two separable dimensions: incivility (rude or dismissive tone) and intolerance (content that attacks pluralism and targets groups or identities) and apply it to 2,030 memes from the Hateful Memes dataset. We evaluate different vision-language models under coarse-label training, transfer learning across label schemes and a joint learning approach that combines the coarse hatefulness label with our fine-grained annotations. Our results show that fine-grained annotations complement existing coarse labels and, when used jointly, improve overall model performance. Moreover, models trained with the fine-grained scheme exhibit more balanced moderation-relevant error profiles and are less prone to under-detection of harmful content than models trained on hatefulness labels alone (FNR-FPR, the difference between false negative and false positive rates: 0.74 to 0.42 for LLaVA-1.6-Mistral-7B; 0.54 to 0.28 for Qwen2.5-VL-7B). This work contributes to data-centric approaches in content moderation by improving the reliability and accuracy of moderation systems through enhanced data quality. Overall, combining both coarse and fine-grained labels provides a practical route to more reliable multimodal moderation.

[17] PaperVoyager : Building Interactive Web with Visual Language Models cs.CLPDF

Dasen Dai, Biao Wu, Meng Fang, Wenhao Wang

TL;DR: 本文提出PaperVoyager，一个将研究论文PDF自动转换为可执行交互式网页系统的智能体。它通过端到端处理（包括论文理解、系统建模和网页合成），使用户能操作输入并观察动态行为，解决了现有文档智能体只能生成静态成果的局限。

Details

Motivation: 现有视觉语言模型驱动的文档智能体主要将论文转化为静态摘要、网页或幻灯片，无法充分处理涉及动态机制和状态转换的技术论文，因此需要一种能生成交互式系统的新方法。

Result: 在包含19篇研究论文及其专家构建的交互式系统作为基准的评估中，PaperVoyager显著提升了生成交互式系统的质量。

Insight: 创新点在于提出了一个结构化生成框架，在合成过程中显式地对机制和交互逻辑进行建模，为交互式科学论文理解提供了新范式。

Abstract: Recent advances in visual language models have enabled autonomous agents for complex reasoning, tool use, and document understanding. However, existing document agents mainly transform papers into static artifacts such as summaries, webpages, or slides, which are insufficient for technical papers involving dynamic mechanisms and state transitions. In this work, we propose a Paper-to-Interactive-System Agent that converts research papers into executable interactive web systems. Given a PDF paper, the agent performs end-to-end processing without human intervention, including paper understanding, system modeling, and interactive webpage synthesis, enabling users to manipulate inputs and observe dynamic behaviors. To evaluate this task, we introduce a benchmark of 19 research papers paired with expert-built interactive systems as ground truth. We further propose PaperVoyager, a structured generation framework that explicitly models mechanisms and interaction logic during synthesis. Experiments show that PaperVoyager significantly improves the quality of generated interactive systems, offering a new paradigm for interactive scientific paper understanding.

[18] When Language Models Lose Their Mind: The Consequences of Brain Misalignment cs.CLPDF

Gabriele Merlin, Mariya Toneva

TL;DR: 本文研究了大脑对齐（brain alignment）在大语言模型（LLM）语言能力中的作用。通过训练出在保持高语言建模性能的同时、故意与大脑活动预测不一致的‘大脑错位模型’，并与大脑对齐模型在200多个下游任务上对比，发现大脑错位会显著损害模型的下游表现，从而证明大脑对齐对于实现稳健的语言能力至关重要。

Details

Motivation: 探讨大脑对齐是否以及如何影响大语言模型的语言能力，而不仅仅是其作为认知模型或安全性的潜力。

Result: 在涵盖语义、句法、语篇、推理和形态学等领域的200多个下游任务上，大脑错位模型的表现显著差于匹配良好的大脑对齐模型，表明大脑对齐对下游性能有实质性影响。

Insight: 创新点在于通过构建‘大脑错位模型’这一控制变量，分离并量化了大脑对齐对语言理解的具体贡献；客观来看，这为理解神经表征与语言处理之间的关系提供了新的实验方法和证据，强调了生物神经对齐在构建更鲁棒AI模型中的潜在价值。

Abstract: While brain-aligned large language models (LLMs) have garnered attention for their potential as cognitive models and for potential for enhanced safety and trustworthiness in AI, the role of this brain alignment for linguistic competence remains uncertain. In this work, we investigate the functional implications of brain alignment by introducing brain-misaligned models–LLMs intentionally trained to predict brain activity poorly while maintaining high language modeling performance. We evaluate these models on over 200 downstream tasks encompassing diverse linguistic domains, including semantics, syntax, discourse, reasoning, and morphology. By comparing brain-misaligned models with well-matched brain-aligned counterparts, we isolate the specific impact of brain alignment on language understanding. Our experiments reveal that brain misalignment substantially impairs downstream performance, highlighting the critical role of brain alignment in achieving robust linguistic competence. These findings underscore the importance of brain alignment in LLMs and offer novel insights into the relationship between neural representations and linguistic processing.

[19] ImplicitRM: Unbiased Reward Modeling from Implicit Preference Data for LLM alignment cs.CL | cs.AI | stat.APPDF

Hao Wang, Haocheng Yang, Licheng Pan, Lei Shen, Xiaoxi Li

TL;DR: 本文提出ImplicitRM方法，用于从隐式偏好数据（如点击和复制）中学习无偏的奖励模型，以替代成本高昂的显式反馈数据，解决了隐式数据缺乏明确负样本和用户偏好偏差的挑战。

Details

Motivation: 当前奖励建模依赖高成本的显式人类反馈数据，而隐式反馈数据（如点击、复制）成本更低但存在缺乏明确负样本和用户偏好偏差的问题，需要开发无偏的奖励建模方法。

Result: 实验表明，ImplicitRM在多个隐式偏好数据集上学习到了准确的奖励模型，但摘要未提及具体基准或SOTA比较。

Insight: 创新点包括通过分层模型将训练样本分为四个潜在组，并基于似然最大化推导出理论无偏的学习目标，有效解决了隐式数据中的负样本缺失和偏好偏差问题。

Abstract: Reward modeling represents a long-standing challenge in reinforcement learning from human feedback (RLHF) for aligning language models. Current reward modeling is heavily contingent upon experimental feedback data with high collection costs. In this work, we study \textit{implicit reward modeling} – learning reward models from implicit human feedback (e.g., clicks and copies) – as a cost-effective alternative. We identify two fundamental challenges in implicit reward modeling: (1) Implicit preference data lacks definitive negative samples, which makes standard positive-negative classification methods inapplicable; (2) Implicit preference data suffers from user preference bias, where different responses have different propensities to elicit user feedback actions, which exacerbates the difficulty of distinguishing definitive negative samples. To address these challenges, we propose ImplicitRM, which aims to learn unbiased reward models from implicit preference data. ImplicitRM stratifies training samples into four latent groups via a stratification model. Building on this, it derives a learning objective through likelihood maximization, which we prove is theoretically unbiased, effectively resolving both challenges. Experiments demonstrate that ImplicitRM learns accurate reward models across implicit preference datasets. Code is available on our project website.

[20] I Came, I Saw, I Explained: Benchmarking Multimodal LLMs on Figurative Meaning in Memes cs.CLPDF

Shijia Zhou, Saif M. Mohammad, Barbara Plank, Diego Frassinelli

TL;DR: 该论文评估了八种最先进的多模态大语言模型（MLLMs）在识别和解释网络迷因中六种比喻含义的能力，并进行了人类评估以检验模型解释的忠实性和合理性。研究发现，所有模型都存在强烈偏见，倾向于将迷因与比喻含义关联，即使没有此类含义；且正确预测并不总是伴随忠实的解释。

Details

Motivation: 解决多模态大语言模型如何结合和解释视觉与文本信息以识别迷因中比喻含义的问题，填补现有研究空白。

Result: 在三个数据集上评估了八种SOTA生成式MLLMs，发现所有模型均表现出强烈偏见，倾向于将迷因与比喻含义关联；人类评估显示正确预测不一定伴随忠实解释。

Insight: 创新点在于系统评估MLLMs在迷因比喻含义理解上的表现，并引入人类评估以分析解释的忠实性；客观分析揭示了模型在跨模态推理中的偏见和解释不一致性问题，为改进MLLMs的可靠性和可解释性提供了方向。

Abstract: Internet memes represent a popular form of multimodal online communication and often use figurative elements to convey layered meaning through the combination of text and images. However, it remains largely unclear how multimodal large language models (MLLMs) combine and interpret visual and textual information to identify figurative meaning in memes. To address this gap, we evaluate eight state-of-the-art generative MLLMs across three datasets on their ability to detect and explain six types of figurative meaning. In addition, we conduct a human evaluation of the explanations generated by these MLLMs, assessing whether the provided reasoning supports the predicted label and whether it remains faithful to the original meme content. Our findings indicate that all models exhibit a strong bias to associate a meme with figurative meaning, even when no such meaning is present. Qualitative analysis further shows that correct predictions are not always accompanied by faithful explanations.

[21] WISTERIA: Weak Implicit Signal-based Temporal Relation Extraction with Attention cs.CL | cs.AIPDF

Duy Dao Do, Anaïs Halftermeyer, Thi-Bich-Hanh Dao

TL;DR: 本文提出WISTERIA框架，用于时序关系抽取（TRE），通过结合多头注意力与事件对条件化的top-K池化，从上下文中提取对每个事件对最具信息量的词元，以捕捉决定时序关系的隐式信号。

Details

Motivation: 现有基于注意力的模型常关注全局显著词元，但忽略了真正决定时序关系的事件对特定线索；WISTERIA旨在通过分析注意力组件是否编码可解释证据来解决这一问题。

Result: 在TimeBank-Dense、MATRES、TDDMan和TDDAuto等基准上，WISTERIA取得了有竞争力的准确率，并通过top-K词元的语言分析揭示了与时序语言线索对齐的事件对层面理据。

Insight: 创新点在于将时序信号视为任何隐式表达时序顺序的词汇、句法或形态元素，而非显式标记，并通过pair-conditioned top-K池化实现局部化、可解释的时序推理视图。

Abstract: Temporal Relation Extraction (TRE) requires identifying how two events or temporal expressions are related in time. Existing attention-based models often highlight globally salient tokens but overlook the pair-specific cues that actually determine the temporal relation. We propose WISTERIA (Weak Implicit Signal-based Temporal Relation Extraction with Attention), a framework that examines whether the top-K attention components conditioned on each event pair truly encode interpretable evidence for temporal classification. Unlike prior works assuming explicit markers such as before, after, or when, WISTERIA considers signals as any lexical, syntactic, or morphological element implicitly expressing temporal order. By combining multi-head attention with pair-conditioned top-K pooling, the model isolates the most informative contextual tokens for each pair. We conduct extensive experiments on TimeBank-Dense, MATRES, TDDMan, and TDDAuto, including linguistic analyses of top-K tokens. Results show that WISTERIA achieves competitive accuracy and reveals pair-level rationales aligned with temporal linguistic cues, offering a localized and interpretable view of temporal reasoning.

cs.CV [Back]

[22] Founder effects shape the evolutionary dynamics of multimodality in open LLM families cs.CV | cs.AI | cs.CLPDF

Manuel Cebrian

TL;DR: 该论文通过分析Hugging Face上超过180万个模型元数据，研究了开源大语言模型家族中多模态能力的演化动态，发现多模态能力主要通过罕见的奠基者事件引入，随后在其后代谱系中快速扩张，呈现间断性采纳模式。

Details

Motivation: 研究开源大语言模型家族中多模态能力（尤其是视觉-语言任务）如何随时间出现和传播，量化其演化动态和跨类型转移的局限性。

Result: 在主要开源LLM家族中，多模态能力在2023年及2024年大部分时间罕见，2024-2025年急剧增加，以图像-文本视觉语言任务为主；从文本生成父模型微调产生VLM后代的转移率仅为0.218%，而94.5%的VLM子代微调边源自VLM父模型；约60%的VLM发布为新根节点，其余主要源自VLM谱系。

Insight: 多模态能力在开源LLM家族中的传播呈现奠基者效应和谱系内快速放大的模式，跨模态转移率极低，这可能导致多模态能力具有独特的、转移受限的扩展行为，对模型能力演化研究具有方法论启示。

Abstract: Large language model (LLM) families are improving rapidly, yet it remains unclear how quickly multimodal capabilities emerge and propagate within open families. Using the ModelBiome AI Ecosystem dataset of Hugging Face model metadata and recorded lineage fields (>1.8x10^6 model entries), we quantify multimodality over time and along recorded parent-to-child relations. Cross-modal tasks are widespread in the broader ecosystem well before they become common within major open LLM families: within these families, multimodality remains rare through 2023 and most of 2024, then increases sharply in 2024-2025 and is dominated by image-text vision-language tasks. Across major families, the first vision-language model (VLM) variants typically appear months after the first text-generation releases, with lags ranging from 1 month (Gemma) to more than a year for several families and ~26 months for GLM. Lineage-conditioned transition rates show weak cross-type transfer: among fine-tuning edges from text-generation parents, only 0.218% yield VLM descendants. Instead, multimodality expands primarily within existing VLM lineages: 94.5% of VLM-child fine-tuning edges originate from VLM parents, versus 4.7% from text-generation parents. At the model level, most VLM releases appear as new roots without recorded parents (60%), while the remainder are predominantly VLM-derived; founder concentration analyses indicate rapid within-lineage amplification followed by diversification. Together, these results show that multimodality enters open LLM families through rare founder events and then expands rapidly within their descendant lineages, producing punctuated adoption dynamics that likely induce distinct, transfer-limited scaling behavior for multimodal capabilities.

[23] From Instructions to Assistance: a Dataset Aligning Instruction Manuals with Assembly Videos for Evaluating Multimodal LLMs cs.CV | cs.AI | cs.CLPDF

Federico Toschi, Nicolò Brunello, Andrea Sassella, Vincenzo Scotti, Mark James Carman

TL;DR: 本文提出并构建了一个名为Manual to Action Dataset (M2AD)的数据集，用于评估多模态大语言模型（MLMs）在技术任务（如家具组装）中作为实时助手的能力。该数据集包含逐步标注和手册参考，用于测试MLMs在减少详细标注需求、追踪组装步骤进展以及正确参考说明书页面等方面的性能。

Details

Motivation: 随着多模态大语言模型（MLMs）的发展，研究旨在将其扩展为实时助手，以支持用户在复杂技术任务（如通过VR/AR环境进行家具组装）中获得辅助。当前需要评估现有开源MLMs在此类任务中的实际表现，以推动其在真实场景中的应用。

Result: 在M2AD数据集上的评估结果显示，虽然某些模型能够理解程序性序列，但其性能受到架构和硬件限制的制约，特别是在多图像和交错文本-图像推理方面表现不足，尚未达到理想水平。

Insight: 创新点在于构建了一个专门用于评估MLMs在技术任务中辅助能力的数据集M2AD，并提出了三个具体的评估维度（标注效率、步骤追踪、手册参考）。客观分析表明，当前MLMs在处理多模态时序任务时仍存在显著瓶颈，强调了改进模型架构以支持复杂多模态推理的必要性。

Abstract: The recent advancements introduced by Large Language Models (LLMs) have transformed how Artificial Intelligence (AI) can support complex, real world tasks, pushing research outside the text boundaries towards multi modal contexts and leading to Multimodal Large Language Models (MLMs). Given the current adoption of LLM based assistants in solving technical or domain specific problems, the natural continuation of this trend is to extend the input domains of these assistants exploiting MLMs. Ideally, these MLMs should be used as real time assistants in procedural tasks, hopefully integrating a view of the environment where the user being assisted is, or even better sharing the same point of view via Virtual Reality (VR) or Augmented Reality (AR) supports, to reason over the same scenario the user is experiencing. With this work, we aim at evaluating the quality of currently openly available MLMs to provide this kind of assistance on technical tasks. To this end, we annotated a data set of furniture assembly with step by step labels and manual references: the Manual to Action Dataset (M2AD). We used this dataset to assess (1) to which extent the reasoning abilities of MLMs can be used to reduce the need for detailed labelling, allowing for more efficient, cost effective annotation practices, (2) whether MLMs are able to track the progression of assembly steps (3) and whether MLMs can refer correctly to the instruction manual pages. Our results showed that while some models understand procedural sequences, their performance is limited by architectural and hardware constraints, highlighting the need for multi image and interleaved text image reasoning.

[24] When Visuals Aren’t the Problem: Evaluating Vision-Language Models on Misleading Data Visualizations cs.CV | cs.AIPDF

Harsh Nishant Lalai, Raj Sanjay Shah, Hanspeter Pfister, Sashank Varma, Grace Guo

TL;DR: 本文评估了视觉语言模型在检测误导性数据可视化方面的能力，重点关注由图表设计错误和文本推理错误引起的误导。研究构建了一个包含真实世界可视化与人工编写误导性标题的基准数据集，用于系统分析模型在不同错误类别上的表现。

Details

Motivation: 现有视觉语言模型在图表理解任务上表现良好，但其在检测由标题中细微推理错误引发的误导性可视化方面的能力尚不明确，需要填补从粗粒度误导检测到具体错误归因之间的研究空白。

Result: 评估多个商业和开源VLM后发现，模型对视觉设计错误的检测明显比基于推理的误导信息更可靠，且经常将非误导性可视化误判为欺骗性内容。

Insight: 论文创新点在于构建了一个基于细粒度错误分类（如选择性呈现、因果推断等推理错误，以及截断坐标轴、双轴等设计错误）的基准，实现了对误导性可视化成因的受控分析，揭示了VLM在推理相关误导检测上的薄弱环节。

Abstract: Visualizations help communicate data insights, but deceptive data representations can distort their interpretation and propagate misinformation. While recent Vision Language Models (VLMs) perform well on many chart understanding tasks, their ability to detect misleading visualizations, especially when deception arises from subtle reasoning errors in captions, remains poorly understood. Here, we evaluate VLMs on misleading visualization-caption pairs grounded in a fine-grained taxonomy of reasoning errors (e.g., Cherry-picking, Causal inference) and visualization design errors (e.g., Truncated axis, Dual axis, inappropriate encodings). To this end, we develop a benchmark that combines real-world visualization with human-authored, curated misleading captions designed to elicit specific reasoning and visualization error types, enabling controlled analysis across error categories and modalities of misleadingness. Evaluating many commercial and open-source VLMs, we find that models detect visual design errors substantially more reliably than reasoning-based misinformation, and frequently misclassify non-misleading visualizations as deceptive. Overall, our work fills a gap between coarse detection of misleading content and the attribution of the specific reasoning or visualization errors that give rise to it.

[25] Efficient Universal Perception Encoder cs.CVPDF

Chenchen Zhu, Saksham Suri, Cijo Jose, Maxime Oquab, Marc Szafraniec

TL;DR: 本文提出了一种高效的通用感知编码器（EUPE），旨在解决智能边缘设备上运行AI模型时计算资源有限且需同时处理多任务的挑战。该方法通过从多个领域专家基础视觉编码器中蒸馏知识，实现了推理效率与通用表示能力的平衡。

Details

Motivation: 智能边缘设备计算资源有限，但需要同时处理多种视觉任务，因此需要一个既小巧又强大且通用的视觉编码器。

Result: 实验表明，EUPE在多种任务领域上达到或超越了同尺寸的单个领域专家模型，并且优于之前的聚合编码器。

Insight: 创新点在于提出了一种先放大到大型代理教师模型，再从这个单一教师模型缩小的蒸馏策略，而非直接从多个教师模型缩小，这有助于提升表示能力。从客观角度看，这种两步蒸馏方法可能更有效地整合多领域知识，实现更好的通用性。

Abstract: Running AI models on smart edge devices can unlock versatile user experiences, but presents challenges due to limited compute and the need to handle multiple tasks simultaneously. This requires a vision encoder with small size but powerful and versatile representations. We present our method, Efficient Universal Perception Encoder (EUPE), which offers both inference efficiency and universally good representations for diverse downstream tasks. We achieve this by distilling from multiple domain-expert foundation vision encoders. Unlike previous agglomerative methods that directly scale down from multiple teachers to an efficient encoder, we demonstrate the importance of first scaling up to a large proxy teacher and then scaling down from this single teacher. Experiments show that EUPE achieves on-par or better performance than individual domain experts of the same size on diverse task domains and also outperforms previous agglomerative encoders. We will release the full family of EUPE models and the code to foster future research.

[26] Static Scene Reconstruction from Dynamic Egocentric Videos cs.CV | cs.GRPDF

Qifei Cui, Patrick Chen

TL;DR: 本文提出了一种针对动态第一人称视频的鲁棒静态场景重建方法，通过引入掩码感知重建机制和分块重建与位姿图拼接策略，有效解决了因快速相机运动和手部动态交互导致的轨迹漂移和‘鬼影’几何问题，显著提升了重建精度和视觉质量。

Details

Motivation: 解决现有静态重建系统（如MapAnything）在处理长时程、动态交互频繁的第一人称视频时，因相机快速运动和前景动态物体（如手部）干扰而导致的灾难性轨迹漂移和‘鬼影’几何伪影问题。

Result: 在HD-EPIC和室内无人机数据集上的实验表明，该方法显著降低了绝对轨迹误差，并生成了视觉上干净的静态几何结构，性能优于简单基线方法。

Insight: 创新点在于将掩码感知机制集成到注意力层中以显式抑制动态前景，以及采用分块重建与位姿图拼接策略来确保全局一致性和消除长期漂移，从而将基础模型的能力有效扩展到动态第一人称场景。

Abstract: Egocentric videos present unique challenges for 3D reconstruction due to rapid camera motion and frequent dynamic interactions. State-of-the-art static reconstruction systems, such as MapAnything, often degrade in these settings, suffering from catastrophic trajectory drift and “ghost” geometry caused by moving hands. We bridge this gap by proposing a robust pipeline that adapts static reconstruction backbones to long-form egocentric video. Our approach introduces a mask-aware reconstruction mechanism that explicitly suppresses dynamic foreground in the attention layers, preventing hand artifacts from contaminating the static map. Furthermore, we employ a chunked reconstruction strategy with pose-graph stitching to ensure global consistency and eliminate long-term drift. Experiments on HD-EPIC and indoor drone datasets demonstrate that our pipeline significantly improves absolute trajectory error and yields visually clean static geometry compared to naive baselines, effectively extending the capability of foundation models to dynamic first-person scenes.

[27] MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding cs.CVPDF

Hejun Dong, Junbo Niu, Bin Wang, Weijun Zeng, Wentao Zhang

TL;DR: 本文提出MinerU-Diffusion，一个基于扩散模型的统一框架，用于文档OCR任务。它将文档OCR重新定义为逆向渲染问题，用并行的扩散去噪过程替代传统的自回归序列解码，旨在解决长文档处理中的顺序延迟和错误传播问题。

Details

Motivation: 现有文档OCR系统大多依赖自回归解码，这在处理长文档时会导致顺序延迟并放大错误传播。作者认为从左到右的因果生成是序列化的产物，而非任务的内在属性，因此从逆向渲染的角度重新思考该问题。

Result: 大量实验表明，MinerU-Diffusion在保持鲁棒性的同时，解码速度比自回归基线快达3.2倍。在所提出的Semantic Shuffle基准测试上的评估进一步证实了其对语言先验的依赖性降低，并具有更强的视觉OCR能力。

Insight: 核心创新点在于将文档OCR视为逆向渲染问题，并引入基于扩散的并行解码范式。具体技术包括块状扩散解码器和不确定性驱动的课程学习策略，以实现稳定训练和高效的长序列推理，减少了对文本顺序的依赖，增强了视觉理解能力。

Abstract: Optical character recognition (OCR) has evolved from line-level transcription to structured document parsing, requiring models to recover long-form sequences containing layout, tables, and formulas. Despite recent advances in vision-language models, most existing systems rely on autoregressive decoding, which introduces sequential latency and amplifies error propagation in long documents. In this work, we revisit document OCR from an inverse rendering perspective, arguing that left-to-right causal generation is an artifact of serialization rather than an intrinsic property of the task. Motivated by this insight, we propose MinerU-Diffusion, a unified diffusion-based framework that replaces autoregressive sequential decoding with parallel diffusion denoising under visual conditioning. MinerU-Diffusion employs a block-wise diffusion decoder and an uncertainty-driven curriculum learning strategy to enable stable training and efficient long-sequence inference. Extensive experiments demonstrate that MinerU-Diffusion consistently improves robustness while achieving up to 3.2x faster decoding compared to autoregressive baselines. Evaluations on the proposed Semantic Shuffle benchmark further confirm its reduced dependence on linguistic priors and stronger visual OCR capability.

[28] Color When It Counts: Grayscale-Guided Online Triggering for Always-On Streaming Video Sensing cs.CV | cs.AI | cs.HC | cs.MMPDF

Weitong Cai, Hang Zhang, Yukai Huang, Shitong Sun, Jiankang Deng

TL;DR: 本文提出了一种名为ColorTrigger的高效流式视频理解新范式：’灰度始终，按需彩色’。通过分析发现，在保留时间结构的情况下，稀疏的RGB帧足以实现与全彩色视频相当的性能。ColorTrigger是一种无需在线训练的触发器，基于窗口灰度亲和度分析选择性地激活彩色捕获，并结合信用预算控制和动态令牌路由，共同降低感知和推理成本。

Details

Motivation: 解决资源受限的边缘/可穿戴AI系统中持续高保真RGB视频捕获成本过高的问题，实现实用的始终在线视频感知。

Result: 在流式视频理解基准测试中，ColorTrigger仅使用8.1%的RGB帧就实现了全彩色基线性能的91.6%，显著减少了自然视频中的颜色冗余。

Insight: 创新点在于提出了’灰度始终，按需彩色’的感知范式，以及基于因果性色度冗余检测的轻量级在线触发机制。从客观角度看，该方法将感知与推理成本联合优化，为资源受限设备上的始终在线视频分析提供了切实可行的解决方案。

Abstract: Always-on sensing is essential for next-generation edge/wearable AI systems, yet continuous high-fidelity RGB video capture remains prohibitively expensive for resource-constrained mobile and edge platforms. We present a new paradigm for efficient streaming video understanding: grayscale-always, color-on-demand. Through preliminary studies, we discover that color is not always necessary. Sparse RGB frames suffice for comparable performance when temporal structure is preserved via continuous grayscale streams. Building on this insight, we propose ColorTrigger, an online training-free trigger that selectively activates color capture based on windowed grayscale affinity analysis. Designed for real-time edge deployment, ColorTrigger uses lightweight quadratic programming to detect chromatic redundancy causally, coupled with credit-budgeted control and dynamic token routing to jointly reduce sensing and inference costs. On streaming video understanding benchmarks, ColorTrigger achieves 91.6% of full-color baseline performance while using only 8.1% RGB frames, demonstrating substantial color redundancy in natural videos and enabling practical always-on video sensing on resource-constrained devices.

[29] Tiny Inference-Time Scaling with Latent Verifiers cs.CV | cs.AI | cs.MMPDF

Davide Bucciarelli, Evelyn Turri, Lorenzo Baraldi, Marcella Cornia, Lorenzo Baraldi

TL;DR: 本文提出了一种名为VHS（Verifier on Hidden States）的验证器，用于在推理时提升生成模型性能。该方法直接利用扩散变换器（DiT）单步生成器的中间隐藏表示进行候选输出评分和选择，避免了传统基于多模态大语言模型（MLLM）验证器所需的像素空间解码和重编码步骤，从而显著降低了计算开销。

Details

Motivation: 现有的推理时缩放方法常使用MLLM作为验证器来提升生成质量，但MLLM需要将候选输出从潜在空间解码到像素空间再编码为视觉嵌入，导致冗余且昂贵的计算开销。本文旨在设计一种更高效的验证器，直接在生成器的隐藏状态上操作，以降低验证成本。

Result: 在有限的推理预算下（每个提示仅生成少量候选），VHS相比标准MLLM验证器，将联合生成与验证时间减少了63.3%，计算FLOPs降低了51%，VRAM使用减少了14.5%，并在GenEval基准上实现了+2.7%的性能提升。

Insight: 创新点在于将验证过程从像素空间转移到生成器的中间隐藏表示空间，避免了昂贵的解码-编码操作。这为高效推理时缩放提供了一种新思路，即利用生成模型内部的表征进行质量评估，而非依赖外部重型模型，在保持或提升性能的同时大幅优化了计算效率。

Abstract: Inference-time scaling has emerged as an effective way to improve generative models at test time by using a verifier to score and select candidate outputs. A common choice is to employ Multimodal Large Language Models (MLLMs) as verifiers, which can improve performance but introduce substantial inference-time cost. Indeed, diffusion pipelines operate in an autoencoder latent space to reduce computation, yet MLLM verifiers still require decoding candidates to pixel space and re-encoding them into the visual embedding space, leading to redundant and costly operations. In this work, we propose Verifier on Hidden States (VHS), a verifier that operates directly on intermediate hidden representations of Diffusion Transformer (DiT) single-step generators. VHS analyzes generator features without decoding to pixel space, thereby reducing the per-candidate verification cost while improving or matching the performance of MLLM-based competitors. We show that, under tiny inference budgets with only a small number of candidates per prompt, VHS enables more efficient inference-time scaling reducing joint generation-and-verification time by 63.3%, compute FLOPs by 51% and VRAM usage by 14.5% with respect to a standard MLLM verifier, achieving a +2.7% improvement on GenEval at the same inference-time budget.

[30] Sketch2CT: Multimodal Diffusion for Structure-Aware 3D Medical Volume Generation cs.CVPDF

Delin An, Chaoli Wang

TL;DR: 本文提出Sketch2CT，一个多模态扩散框架，用于生成结构感知的3D医学体数据。该框架通过用户提供的2D草图（sketch）和描述3D几何语义的文本描述共同引导，首先生成目标器官的3D分割掩码，然后利用这些掩码指导一个潜在扩散模型合成3D CT体数据，从而生成解剖结构准确且外观逼真的医学图像。

Details

Motivation: 解决在医学领域中，如何在多模态条件下（结合草图与文本）生成具有解剖结构一致性的3D医学体数据这一复杂且尚未解决的问题，以应对医学数据稀缺的挑战。

Result: 在公开CT数据集上的大量实验表明，Sketch2CT在生成多模态医学体数据方面取得了优越的性能。

Insight: 创新点在于提出了一个由草图与文本共同引导的多模态扩散框架，并设计了两个关键模块（利用局部文本线索细化草图特征、整合全局草图-文本表示）来有效对齐和融合多模态输入。其基于胶囊注意力的骨干网络利用了草图和文本的互补优势，实现了可控、低成本的医学数据增强流程。

Abstract: Diffusion probabilistic models have demonstrated significant potential in generating high-quality, realistic medical images, providing a promising solution to the persistent challenge of data scarcity in the medical field. Nevertheless, producing 3D medical volumes with anatomically consistent structures under multimodal conditions remains a complex and unresolved problem. We introduce Sketch2CT, a multimodal diffusion framework for structure-aware 3D medical volume generation, jointly guided by a user-provided 2D sketch and a textual description that captures 3D geometric semantics. The framework initially generates 3D segmentation masks of the target organ from random noise, conditioned on both modalities. To effectively align and fuse these inputs, we propose two key modules that refine sketch features with localized textual cues and integrate global sketch-text representations. Built upon a capsule-attention backbone, these modules leverage the complementary strengths of sketches and text to produce anatomically accurate organ shapes. The synthesized segmentation masks subsequently guide a latent diffusion model for 3D CT volume synthesis, enabling realistic reconstruction of organ appearances that are consistent with user-defined sketches and descriptions. Extensive experiments on public CT datasets demonstrate that Sketch2CT achieves superior performance in generating multimodal medical volumes. Its controllable, low-cost generation pipeline enables principled, efficient augmentation of medical datasets. Code is available at https://github.com/adlsn/Sketch2CT.

[31] Ego2Web: A Web Agent Benchmark Grounded in Egocentric Videos cs.CV | cs.AI | cs.CLPDF

Shoubin Yu, Lei Shu, Antoine Yang, Yao Fu, Srinivas Sunkara

TL;DR: 本文提出了Ego2Web，这是首个连接第一人称（自我中心）视频感知与网页智能体执行的基准测试。它通过自动生成与人工验证的流程，构建了高质量的视频-任务对，涵盖电子商务、媒体检索等多种网页任务类型。同时，论文还开发了一种名为Ego2WebJudge的基于LLM的自动评估方法，其与人工判断的一致性达到约84%。实验表明，现有先进智能体在该基准上的表现普遍较弱，存在很大提升空间。

Details

Motivation: 当前网页智能体基准测试完全集中于网页交互与感知，缺乏与用户真实物理环境的关联，无法评估智能体利用第一人称视觉感知（如通过AR眼镜）识别环境物体并完成相关在线任务的场景。为弥补这一空白，作者提出了Ego2Web。

Result: 在Ego2Web基准上，多种最先进（SoTA）智能体的表现均较弱，在所有任务类别上都有显著的提升空间。论文提出的自动评估方法Ego2WebJudge与人工判断的一致性达到约84%，远高于现有评估方法。

Insight: 核心创新在于构建了首个连接物理世界（第一人称视频）与数字世界（网页任务）的智能体基准测试，强调了跨模态（视觉与网页）理解和执行的重要性。其提出的基于LLM的自动评估框架（Ego2WebJudge）也为复杂、多模态任务的评估提供了高效且可靠的方案。

Abstract: Multimodal AI agents are increasingly automating complex real-world workflows that involve online web execution. However, current web-agent benchmarks suffer from a critical limitation: they focus entirely on web-based interaction and perception, lacking grounding in the user’s real-world physical surroundings. This limitation prevents evaluation in crucial scenarios, such as when an agent must use egocentric visual perception (e.g., via AR glasses) to recognize an object in the user’s surroundings and then complete a related task online. To address this gap, we introduce Ego2Web, the first benchmark designed to bridge egocentric video perception and web agent execution. Ego2Web pairs real-world first-person video recordings with web tasks that require visual understanding, web task planning, and interaction in an online environment for successful completion. We utilize an automatic data-generation pipeline combined with human verification and refinement to curate well-constructed, high-quality video-task pairs across diverse web task types, including e-commerce, media retrieval, knowledge lookup, etc. To facilitate accurate and scalable evaluation for our benchmark, we also develop a novel LLM-as-a-Judge automatic evaluation method, Ego2WebJudge, which achieves approximately 84% agreement with human judgment, substantially higher than existing evaluation methods. Experiments with diverse SoTA agents on our Ego2Web show that their performance is weak, with substantial headroom across all task categories. We also conduct a comprehensive ablation study on task design, highlighting the necessity of accurate video understanding in the proposed task and the limitations of current agents. We hope Ego2Web can be a critical new resource for developing truly capable AI assistants that can seamlessly see, understand, and act across the physical and digital worlds.

[32] CanViT: Toward Active-Vision Foundation Models cs.CVPDF

Yohaï-Eliel Berreby, Sabrina Du, Audrey Durand, B. Suresh Krishna

TL;DR: 本文提出了CanViT，这是首个任务无关和策略无关的主动视觉基础模型。它通过场景相对RoPE绑定视网膜拓扑的视觉Transformer主干和空间拓扑的全局潜在工作空间（画布），并引入画布注意力机制实现高效交互。模型采用无标签的主动视觉预训练方案，通过从随机位置、缩放和长度的低分辨率瞥视序列中重构全局DINOv3嵌入进行训练。在ADE20K分割和ImageNet-1k分类任务上，模型在计算效率和性能上均超越了现有主动模型和其教师模型。

Details

Motivation: 主动计算机视觉通过序列化、局部化的瞥视实现高效且生物合理的感知，但缺乏可扩展的通用架构和预训练流程，导致主动视觉基础模型尚未被探索。本文旨在填补这一空白。

Result: 在ADE20K语义分割任务上，冻结的CanViT-B模型在单次低分辨率瞥视下达到38.5% mIoU，优于最佳主动模型的27.6%，且推理FLOPs减少19.5倍且无需微调；在更多瞥视下达到45.9% mIoU。在ImageNet-1k分类任务上，达到81.2% top-1准确率。模型在单块H100上使用1320万ImageNet-21k场景和10亿随机瞥视进行预训练。

Insight: 核心创新在于提出了首个任务与策略无关的主动视觉基础模型架构，其关键设计包括：1）场景相对RoPE用于绑定视网膜拓扑主干和空间拓扑画布；2）新颖的非对称交叉注意力机制（画布注意力）实现高效工作记忆交互；3）将‘思考’（主干层）与‘记忆’（画布层）解耦，移除了画布侧的自注意力和全连接层以实现低延迟和可扩展性；4）提出了策略无关的被动到主动密集潜在蒸馏预训练方案。这为主动视觉研究开辟了新方向，并显著缩小了被动与主动视觉在语义分割等任务上的性能差距。

Abstract: Active computer vision promises efficient, biologically plausible perception through sequential, localized glimpses, but lacks scalable general-purpose architectures and pretraining pipelines. As a result, Active-Vision Foundation Models (AVFMs) have remained unexplored. We introduce CanViT, the first task- and policy-agnostic AVFM. CanViT uses scene-relative RoPE to bind a retinotopic Vision Transformer backbone and a spatiotopic scene-wide latent workspace, the canvas. Efficient interaction with this high-capacity working memory is supported by Canvas Attention, a novel asymmetric cross-attention mechanism. We decouple thinking (backbone-level) and memory (canvas-level), eliminating canvas-side self-attention and fully-connected layers to achieve low-latency sequential inference and scalability to large scenes. We propose a label-free active vision pretraining scheme, policy-agnostic passive-to-active dense latent distillation: reconstructing scene-wide DINOv3 embeddings from sequences of low-resolution glimpses with randomized locations, zoom levels, and lengths. We pretrain CanViT-B from a random initialization on 13.2 million ImageNet-21k scenes – an order of magnitude more than previous active models – and 1 billion random glimpses, in 166 hours on a single H100. On ADE20K segmentation, a frozen CanViT-B achieves 38.5% mIoU in a single low-resolution glimpse, outperforming the best active model’s 27.6% with 19.5x fewer inference FLOPs and no fine-tuning, as well as its FLOP- or input-matched DINOv3 teacher. Given additional glimpses, CanViT-B reaches 45.9% ADE20K mIoU. On ImageNet-1k classification, CanViT-B reaches 81.2% top-1 accuracy with frozen teacher probes. CanViT generalizes to longer rollouts, larger scenes, and new policies. Our work closes the wide gap between passive and active vision on semantic segmentation and demonstrates the potential of AVFMs as a new research axis.

[33] A vision-language model and platform for temporally mapping surgery from video cs.CV | cs.ROPDF

Dani Kiyasseh

TL;DR: 本文介绍了Halsted，这是一个基于Halsted Surgical Atlas（HSA）训练的视觉语言模型，用于从手术视频中自动映射手术行为。HSA是一个通过迭代自标注框架构建的全面注释视频库，包含超过65万个视频。研究还发布了HSA-27k子集用于基准测试，并开发了Halsted网络平台，使外科医生能快速自动映射自己的手术过程。

Details

Motivation: 当前AI模型在从视频映射外科医生行为方面范围狭窄，仅捕获单个手术中的有限行为组件，且缺乏可转化价值，难以被执业外科医生使用。本文旨在解决这些问题，通过开发更全面、高效的模型和平台，缩小手术AI与临床部署之间的转化差距。

Result: Halsted在映射手术活动方面超越了先前的最先进模型（SOTA），同时提供了更高的全面性和计算效率。基准测试基于公开的HSA-27k子集进行。

Insight: 创新点包括：利用迭代自标注框架构建大规模注释手术视频库HSA；训练视觉语言模型Halsted以实现更全面的手术行为映射；开发易于访问的网络平台，使外科医生能直接使用AI能力自动映射手术，促进手术AI的临床部署和自主机器人手术发展。

Abstract: Mapping surgery is fundamental to developing operative guidelines and enabling autonomous robotic surgery. Recent advances in artificial intelligence (AI) have shown promise in mapping the behaviour of surgeons from videos, yet current models remain narrow in scope, capturing limited behavioural components within single procedures, and offer limited translational value, as they remain inaccessible to practising surgeons. Here we introduce Halsted, a vision-language model trained on the Halsted Surgical Atlas (HSA), one of the most comprehensive annotated video libraries grown through an iterative self-labelling framework and encompassing over 650,000 videos across eight surgical specialties. To facilitate benchmarking, we publicly release HSA-27k, a subset of the Halsted Surgical Atlas. Halsted surpasses previous state-of-the-art models in mapping surgical activity while offering greater comprehensiveness and computational efficiency. To bridge the longstanding translational gap of surgical AI, we develop the Halsted web platform (https://halstedhealth.ai/) to provide surgeons anywhere in the world with the previously-unavailable capability of automatically mapping their own procedures within minutes. By standardizing unstructured surgical video data and making these capabilities directly accessible to surgeons, our work brings surgical AI closer to clinical deployment and helps pave the way toward autonomous robotic surgery.

[34] Language Models Can Explain Visual Features via Steering cs.CV | cs.AIPDF

Javier Ferrando, Enrique Lopez-Cuena, Pablo Agustin Martin-Torres, Daniel Hinjos, Anna Arias-Duart

TL;DR: 本文提出了一种基于因果干预的视觉特征解释方法，通过操纵视觉语言模型中的稀疏自编码器特征，引导语言模型生成对视觉概念的描述，从而自动化解释视觉模型中的特征。该方法与传统的基于输入示例的方法形成互补，并通过结合两种方法的优势实现了最先进的解释质量。

Details

Motivation: 稀疏自编码器在视觉模型中发现了数千个特征，但如何无需人工干预地解释这些特征仍是一个开放挑战。传统方法依赖于激活最高的输入示例生成相关性解释，本文旨在提供一种基于因果干预的根本性替代方案。

Result: 实验表明，Steering方法提供了一种可扩展的替代方案，与基于输入示例的传统方法互补，作为视觉模型自动可解释性的新维度。解释质量随语言模型规模的增大而持续提升，且提出的混合方法Steering-informed Top-k在不增加计算成本的情况下达到了最先进的解释质量。

Insight: 创新点在于利用视觉语言模型的结构，通过因果干预（在提供空图像后操纵SAE特征）引导语言模型生成特征解释，实现了对视觉概念的自动化描述。客观来看，该方法将因果干预与输入示例方法结合，形成了一种高效且可扩展的混合解释框架，为未来研究提供了新方向。

Abstract: Sparse Autoencoders uncover thousands of features in vision models, yet explaining these features without requiring human intervention remains an open challenge. While previous work has proposed generating correlation-based explanations based on top activating input examples, we present a fundamentally different alternative based on causal interventions. We leverage the structure of Vision-Language Models and steer individual SAE features in the vision encoder after providing an empty image. Then, we prompt the language model to explain what it ``sees’’, effectively eliciting the visual concept represented by each feature. Results show that Steering offers an scalable alternative that complements traditional approaches based on input examples, serving as a new axis for automated interpretability in vision models. Moreover, the quality of explanations improves consistently with the scale of the language model, highlighting our method as a promising direction for future research. Finally, we propose Steering-informed Top-k, a hybrid approach that combines the strengths of causal interventions and input-based approaches to achieve state-of-the-art explanation quality without additional computational cost.

[35] TrajLoom: Dense Future Trajectory Generation from Video cs.CVPDF

Zewei Zhang, Jia Jun Cheng Xian, Kaiwen Liu, Ming Liang, Hang Chu

TL;DR: 本文提出了TrajLoom框架，用于从视频中生成密集的未来轨迹。该方法通过三个核心组件——网格锚点偏移编码、TrajLoom-VAE和TrajLoom-Flow——来预测未来轨迹和可见性，并引入了统一的评估基准TrajLoomBench。相比现有方法，该框架显著延长了预测帧数，并提升了运动真实感和稳定性。

Details

Motivation: 预测未来运动对于视频理解和可控视频生成至关重要。密集点轨迹是一种紧凑且富有表现力的运动表示，但从观测视频中建模其未来演化仍然具有挑战性。本文旨在解决这一挑战。

Result: 在TrajLoomBench基准测试中，与最先进方法相比，该方法将预测范围从24帧扩展到81帧，并在多个数据集上提高了运动真实感和稳定性。

Insight: 创新点包括：1）网格锚点偏移编码，通过将点表示为相对于像素中心锚点的偏移来减少位置依赖偏差；2）TrajLoom-VAE，通过掩码重建和时空一致性正则化器学习密集轨迹的紧凑时空潜在空间；3）TrajLoom-Flow，通过流匹配在潜在空间生成未来轨迹，并结合边界线索和在线K步微调以实现稳定采样。此外，引入统一的评估基准也是一个重要贡献。

Abstract: Predicting future motion is crucial in video understanding and controllable video generation. Dense point trajectories are a compact, expressive motion representation, but modeling their future evolution from observed video remains challenging. We propose a framework that predicts future trajectories and visibility from past trajectories and video context. Our method has three components: (1) Grid-Anchor Offset Encoding, which reduces location-dependent bias by representing each point as an offset from its pixel-center anchor; (2) TrajLoom-VAE, which learns a compact spatiotemporal latent space for dense trajectories with masked reconstruction and a spatiotemporal consistency regularizer; and (3) TrajLoom-Flow, which generates future trajectories in latent space via flow matching, with boundary cues and on-policy K-step fine-tuning for stable sampling. We also introduce TrajLoomBench, a unified benchmark spanning real and synthetic videos with a standardized setup aligned with video-generation benchmarks. Compared with state-of-the-art methods, our approach extends the prediction horizon from 24 to 81 frames while improving motion realism and stability across datasets. The predicted trajectories directly support downstream video generation and editing. Code, model checkpoints, and datasets are available at https://trajloom.github.io/.

[36] Dress-ED: Instruction-Guided Editing for Virtual Try-On and Try-Off cs.CVPDF

Fulvio Sanguigni, Davide Lobba, Bin Ren, Marcella Cornia, Nicu Sebe

TL;DR: 本文提出了首个大规模指令引导的虚拟试穿和试脱数据集Dress-ED，并基于此提出了一个统一的多模态扩散框架。该数据集通过自动化流程构建，包含超过14.6万个样本，覆盖多种服装类别和编辑类型，旨在支持可控、交互式的时尚图像生成。

Details

Motivation: 现有虚拟试穿和试脱数据集是静态的，缺乏基于指令驱动的可控编辑能力，无法满足交互式时尚生成的需求。

Result: 论文提出了一个统一的多模态扩散框架作为强基线模型，并在新构建的Dress-ED基准上进行了评估。数据集和代码将公开。

Insight: 主要创新点在于创建了首个统一VTON、VTOFF和文本引导编辑的大规模指令数据集，以及一个结合语言指令和视觉线索的统一生成框架。其自动化构建流程（结合MLLM、扩散模型和LLM验证）也颇具借鉴意义。

Abstract: Recent advances in Virtual Try-On (VTON) and Virtual Try-Off (VTOFF) have greatly improved photo-realistic fashion synthesis and garment reconstruction. However, existing datasets remain static, lacking instruction-driven editing for controllable and interactive fashion generation. In this work, we introduce the Dress Editing Dataset (Dress-ED), the first large-scale benchmark that unifies VTON, VTOFF, and text-guided garment editing within a single framework. Each sample in Dress-ED includes an in-shop garment image, the corresponding person image wearing the garment, their edited counterparts, and a natural-language instruction of the desired modification. Built through a fully automated multimodal pipeline that integrates MLLM-based garment understanding, diffusion-based editing, and LLM-guided verification, Dress-ED comprises over 146k verified quadruplets spanning three garment categories and seven edit types, including both appearance (e.g., color, pattern, material) and structural (e.g., sleeve length, neckline) modifications. Based on this benchmark, we further propose a unified multimodal diffusion framework that jointly reasons over linguistic instructions and visual garment cues, serving as a strong baseline for instruction-driven VTON and VTOFF. Dataset and code will be made publicly available.

[37] A Vision Language Model for Generating Procedural Plant Architecture Representations from Simulated Images cs.CVPDF

Heesup Yun, Isaac Kazuo Uyehara, Ioannis Droutsas, Earl Ranario, Christine H. Diepenbrock

TL;DR: 本文提出了一种新颖的视觉语言模型方法，用于从单张图像生成三维植物架构的程序化表示。该方法利用合成图像训练模型，将植物架构的XML定义转换为可由语言模型预测的token序列，从而从图像中提取器官级别的几何和拓扑参数，创建功能结构植物模型。

Details

Motivation: 三维程序化植物架构模型在植物研究、参数提取和计算机图形学中很重要，但实地测量其架构参数和嵌套结构非常耗时。本文旨在解决从图像数据中自动提取器官级植物架构参数的难题，避免使用昂贵的3D传感器或多视图图像处理。

Result: 模型在教师强制训练中token F1分数达到0.73。通过自回归生成评估，BLEU-4分数为94.00%，ROUGE-L分数为0.5182。这些结果表明从合成图像中生成植物架构模型和提取参数是可行的。

Insight: 创新点在于将植物架构的程序化定义（XML）转换为token序列，并利用视觉语言模型从单张图像中预测该序列，从而绕过传统的3D重建方法。这为从图像中高效提取复杂植物结构参数提供了一种新范式，未来可扩展到真实图像数据。

Abstract: Three-dimensional (3D) procedural plant architecture models have emerged as an important tool for simulation-based studies of plant structure and function, extracting plant architectural parameters from field measurements, and for generating realistic plants in computer graphics. However, measuring the architectural parameters and nested structures for these models at the field scales remains prohibitively labor-intensive. We present a novel algorithm that generates a 3D plant architecture from an image, creating a functional structural plant model that reflects organ-level geometric and topological parameters and provides a more comprehensive representation of the plant’s architecture. Instead of using 3D sensors or processing multi-view images with computer vision to obtain the 3D structure of plants, we proposed a method that generates token sequences that encode a procedural definition of plant architecture. This work used only synthetic images for training and testing, with exact architectural parameters known, allowing testing of the hypothesis that organ-level architectural parameters could be extracted from image data using a vision-language model (VLM). A synthetic dataset of cowpea plant images was generated using the Helios 3D plant simulator, with the detailed plant architecture encoded in XML files. We developed a plant architecture tokenizer for the XML file defining plant architecture, converting it into a token sequence that a language model can predict. The model achieved a token F1 score of 0.73 during teacher-forced training. Evaluation of the model was performed through autoregressive generation, achieving a BLEU-4 score of 94.00% and a ROUGE-L score of 0.5182. This led to the conclusion that such plant architecture model generation and parameter extraction were possible from synthetic images; thus, future work will extend the approach to real imagery data.

[38] To Agree or To Be Right? The Grounding-Sycophancy Tradeoff in Medical Vision-Language Models cs.CV | cs.AIPDF

OFM Riaz Rahman Aranya, Kevin Desai

TL;DR: 本文评估了六种视觉语言模型（包括通用和医学专用模型）在三个医学视觉问答数据集上的表现，揭示了模型在‘基础性’（避免幻觉）和‘顺从性’（抵抗社会压力）之间存在权衡：幻觉倾向最低的模型最顺从，而最抗压的模型幻觉更多。研究提出了三个新指标（L-VASE、CCS、CSI）来量化这一权衡，并发现所有评估的7-8B参数模型均未同时具备良好的基础性和抗压性，表明其在临床应用前需联合评估这两项属性。

Details

Motivation: 解决医学视觉语言模型在幻觉和顺从性这两个关键失败模式上的鲁棒性问题，特别是在两者结合时的表现尚未被充分理解。

Result: 在1,151个测试案例中，所有评估的7-8B参数模型在提出的统一安全指标（CSI）上均低于0.35，表明没有模型能同时实现良好的基础性和对社会压力的鲁棒性。

Insight: 论文的创新点在于揭示了医学VLM中基础性与顺从性的内在权衡，并提出了L-VASE、CCS和CSI三个新指标来系统量化这一权衡，为模型安全评估提供了更全面的框架。

Abstract: Vision-language models (VLMs) adapted to the medical domain have shown strong performance on visual question answering benchmarks, yet their robustness against two critical failure modes, hallucination and sycophancy, remains poorly understood, particularly in combination. We evaluate six VLMs (three general-purpose, three medical-specialist) on three medical VQA datasets and uncover a grounding-sycophancy tradeoff: models with the lowest hallucination propensity are the most sycophantic, while the most pressure-resistant model hallucinates more than all medical-specialist models. To characterize this tradeoff, we propose three metrics: L-VASE, a logit-space reformulation of VASE that avoids its double-normalization; CCS, a confidence-calibrated sycophancy score that penalizes high-confidence capitulation; and Clinical Safety Index (CSI), a unified safety index that combines grounding, autonomy, and calibration via a geometric mean. Across 1,151 test cases, no model achieves a CSI above 0.35, indicating that none of the evaluated 7-8B parameter VLMs is simultaneously well-grounded and robust to social pressure. Our findings suggest that joint evaluation of both properties is necessary before these models can be considered for clinical use. Code is available at https://github.com/UTSA-VIRLab/AgreeOrRight

[39] CAM3R: Camera-Agnostic Model for 3D Reconstruction cs.CVPDF

Namitha Guruprasad, Abhay Yadav, Cheng Peng, Rama Chellappa

TL;DR: 本文提出了CAM3R，一种相机无关的前馈模型，用于从无位姿的宽视角图像（如鱼眼、全景图像）进行密集三维重建。该模型通过一个双视图网络，分别估计像素级光线方向和跨视图的径向距离、置信度图、点图及相对位姿，并引入光线感知全局对齐框架进行位姿优化和尺度统一，从而实现对不同相机模型图像的鲁棒三维重建。

Details

Motivation: 当前最先进的三维重建模型主要在标准针孔相机数据集上训练，当应用于非直线光学（如鱼眼、全景传感器）捕获的宽视角图像时，会出现显著的几何退化。本文旨在解决这一相机模型依赖性问题，使模型能够无需先验标定地处理宽视角图像。

Result: 在包括全景、鱼眼和针孔相机模型在内的多种数据集上进行的大量实验表明，CAM3R在位姿估计和三维重建任务上达到了新的最先进水平（SOTA）。

Insight: 主要创新点在于提出了一个相机无关的框架，通过分离光线方向估计和跨视图几何推理，并结合一个严格保持预测局部几何的光线感知全局对齐策略，实现了对不同相机模型的泛化能力。其核心是将重建问题从依赖特定相机模型参数，转化为对像素级光线和场景几何的直接建模。

Abstract: Recovering dense 3D geometry from unposed images remains a foundational challenge in computer vision. Current state-of-the-art models are predominantly trained on perspective datasets, which implicitly constrains them to a standard pinhole camera geometry. As a result, these models suffer from significant geometric degradation when applied to wide-angle imagery captured via non-rectilinear optics, such as fisheye or panoramic sensors. To address this, we present CAM3R, a Camera-Agnostic, feed-forward Model for 3D Reconstruction capable of processing images from wide-angle camera models without prior calibration. Our framework consists of a two-view network which is bifurcated into a Ray Module (RM) to estimate per-pixel ray directions and a Cross-view Module (CVM) to infer radial distance with confidence maps, pointmaps, and relative poses. To unify these pairwise predictions into a consistent 3D scene, we introduce a Ray-Aware Global Alignment framework for pose refinement and scale optimization while strictly preserving the predicted local geometry. Extensive experiments on various camera model datasets, including panorama, fisheye and pinhole imagery, demonstrate that CAM3R establishes a new state-of-the-art in pose estimation and reconstruction.

[40] Q-Tacit: Image Quality Assessment via Latent Visual Reasoning cs.CVPDF

Yuxuan Jiang, Yixuan Li, Hanwei Zhu, Siyue Teng, Fan Zhang

TL;DR: 本文提出Q-Tacit，一种基于视觉语言模型（VLM）的图像质量评估（IQA）新范式。它质疑自然语言作为质量推理的理想空间，转而引导VLM在潜在质量空间中进行推理，通过注入视觉质量先验和校准推理轨迹来提升性能，显著减少了所需token数量并实现了强劲的整体表现。

Details

Motivation: 现有基于思维链（CoT）的VLM IQA方法通常以语言为中心，将视觉信息视为静态前提，但质量相关的视觉线索难以完全用离散文本token抽象，这限制了在视觉密集型IQA任务中的推理效果。本文旨在探索自然语言是否适合质量推理，并提出在潜在空间中进行推理的新方案。

Result: 大量实验表明，Q-Tacit能有效进行质量推理，且所需token数量显著少于以往基于推理的方法，同时实现了强劲的整体性能。

Insight: 创新点在于质疑并超越了以自然语言为中心的推理范式，提出在潜在质量空间进行推理，通过注入结构化视觉质量先验和校准推理轨迹来提升能力。这验证了语言并非视觉质量的唯一紧凑表示，为IQA的有效潜在推理范式探索开辟了可能性。

Abstract: Vision-Language Model (VLM)-based image quality assessment (IQA) has been significantly advanced by incorporating Chain-of-Thought (CoT) reasoning. Recent work has refined image quality reasoning by applying reinforcement learning (RL) and leveraging active visual tools. However, such strategies are typically language-centric, with visual information being treated as static preconditions. Quality-related visual cues often cannot be abstracted into text in extenso due to the gap between discrete textual tokens and quality perception space, which in turn restricts the reasoning effectiveness for visually intensive IQA tasks. In this paper, we revisit this by asking the question, “Is natural language the ideal space for quality reasoning?” and, as a consequence, we propose Q-Tacit, a new paradigm that elicits VLMs to reason beyond natural language in the latent quality space. Our approach follows a synergistic two-stage process: (i) injecting structural visual quality priors into the latent space, and (ii) calibrating latent reasoning trajectories to improve quality assessment ability. Extensive experiments demonstrate that Q-Tacit can effectively perform quality reasoning with significantly fewer tokens than previous reasoning-based methods, while achieving strong overall performance. This paper validates the proposition that language is not the only compact representation suitable for visual quality, opening possibilities for further exploration of effective latent reasoning paradigms for IQA. Source code will be released to support future research.

[41] GeoTikzBridge: Advancing Multimodal Code Generation for Geometric Perception and Reasoning cs.CVPDF

Jiayin Sun, Caixia Sun, Boyu Yang, Hailin Li, Xiao Chen

TL;DR: 本文提出GeoTikzBridge框架，通过生成基于Tikz的代码来增强多模态大语言模型（MLLMs）对精细几何结构的感知和视觉推理能力。该框架包含两个模型：基于最大图像到Tikz数据集（GeoTikz-Base，250万对）训练的GeoTikzBridge-Base，以及基于首个支持视觉推理的指令增强Tikz数据集（GeoTikz-Instruct）微调的GeoTikzBridge-Instruct。实验表明，该模型在开源MLLMs中达到SOTA性能，并可作为即插即用模块提升其他MLLMs/LLMs的几何问题解决能力。

Details

Motivation: 当前多模态大语言模型（MLLMs）在感知精细几何结构方面存在不足，限制了其几何理解和视觉推理能力，本文旨在解决这一问题。

Result: 在广泛的实验中，GeoTikzBridge模型在开源多模态大语言模型（MLLMs）中实现了最先进的（SOTA）性能。

Insight: 创新点包括：1）提出基于Tikz代码生成的框架来增强局部几何感知；2）通过迭代数据扩展和局部几何变换策略构建了迄今最大的图像到Tikz数据集（GeoTikz-Base）；3）创建了首个支持视觉推理的指令增强Tikz数据集（GeoTikz-Instruct）；4）模型可作为即插即用推理模块灵活集成到其他MLLMs/LLMs中，提升几何推理性能。

Abstract: Multimodal Large Language Models (MLLMs) have recently demonstrated remarkable perceptual and reasoning abilities. However, they struggle to perceive fine-grained geometric structures, constraining their ability of geometric understanding and visual reasoning. To address this, we propose GeoTikzBridge, a framework that enhances local geometric perception and visual reasoning through tikz-based code generation. Within this framework, we build two models supported by two complementary datasets. The GeoTikzBridge-Base model is trained on GeoTikz-Base dataset, the largest image-to-tikz dataset to date with 2.5M pairs (16 $\times$ larger than existing open-sourced datasets). This process is achieved via iterative data expansion and a localized geometric transformation strategy. Subsequently, GeoTikzBridge-Instruct is fine-tuned on GeoTikz-Instruct dataset which is the first instruction-augmented tikz dataset supporting visual reasoning. Extensive experimental results demonstrate that our models achieve state-of-the-art performance among open-sourced MLLMs. Furthermore, GeoTikzBridge models can serve as plug-and-play reasoning modules for any MLLM(LLM), enhancing reasoning performance in geometric problem-solving. Datasets and codes are publicly available at: https://github.com/sjy-1995/GeoTikzBridge-Advancing-Multimodal-Code-Generation-for-Geometric-Perception-and-Reasoning.

[42] Think 360°: Evaluating the Width-centric Reasoning Capability of MLLMs Beyond Depth cs.CVPDF

Mingrui Chen, Hexiong Yang, Haogeng Liu, Huaibo Huang, Ran He

TL;DR: 本文提出了一个名为Think 360°的多模态基准测试，旨在评估多模态大语言模型在‘推理宽度’维度的能力，该维度与常见的‘推理深度’形成互补。基准包含1200多个跨领域高质量案例，并采用细粒度的思维树评估协议来联合量化宽度与深度。通过对12个主要模型家族（超过30个先进MLLM）的评估，发现当前模型在通用或常识VQA任务上表现良好，但在结合深度顺序推理链与宽度探索搜索以进行真正基于洞察的推理方面仍存在困难。

Details

Motivation: 当前对多模态大语言模型推理能力的评估主要集中于推理深度（即长链、顺序推理），而忽略了推理宽度（即广泛试错搜索或多约束优化）这一互补维度。本文旨在填补这一空白，系统评估MLLMs在需要并行探索多种可能路径、应用约束剪枝无效分支并高效迭代或回溯的复杂场景下的能力。

Result: 在提出的Think 360°基准上评估了12个主要模型家族（超过30个先进MLLM），覆盖不同难度层级、问题类型和所需技能。结果表明，尽管当前模型在通用或常识视觉问答任务上表现出色，但在需要结合深度顺序思维链与宽度探索搜索以执行真正基于洞察的推理任务时，仍然存在显著困难。

Insight: 论文的创新点在于明确提出了‘推理宽度’这一评估维度，并构建了首个系统评估该能力的多模态基准。其设计的细粒度思维树评估协议能联合量化推理宽度与深度，为未来开发既能‘深’思又能‘广’搜的MLLMs提供了具体的评估工具和失败模式分析方向。

Abstract: In this paper, we present a holistic multimodal benchmark that evaluates the reasoning capabilities of MLLMs with an explicit focus on reasoning width, a complementary dimension to the more commonly studied reasoning depth. Specifically, reasoning depth measures the model’s ability to carry out long-chain, sequential reasoning in which each step is tightly and rigorously linked to the next. Reasoning width tends to focus more on the model’s capacity for broad trial-and-error search or multi-constrained optimization: it must systematically traverse many possible and parallelized reasoning paths, apply diverse constraints to prune unpromising branches, and identify valid solution routes for efficient iteration or backtracking. To achieve it, we carefully curate 1200+ high-quality multimodal cases spanning heterogeneous domains, and propose a fine-grained tree-of-thought evaluation protocol that jointly quantifies reasoning width and depth. We evaluate 12 major model families (over 30 advanced MLLMs) across difficulty tiers, question types, and required skills. Results show that while current models exhibit strong performance on general or common-sense VQA tasks, they still struggle to combine deep sequential thought chains with wide exploratory search to perform genuine insight-based reasoning. Finally, we analyze characteristic failure modes to provide possible directions for building MLLMs that reason not only deeper but also wider.

[43] WiFi2Cap: Semantic Action Captioning from Wi-Fi CSI via Limb-Level Semantic Alignment cs.CV | cs.AIPDF

Tzu-Ti Wei, Chu-Yu Huang, Yu-Chee Tseng, Jen-Jee Chen

TL;DR: WiFi2Cap是一个从Wi-Fi信道状态信息（CSI）生成语义动作描述的三阶段框架，通过视觉-语言教师模型学习可迁移的监督信号，并引入镜像一致性损失来减少方向敏感歧义，最终利用前缀调优的语言模型生成动作描述。

Details

Motivation: 现有基于Wi-Fi CSI的系统主要关注姿态估计或预定义动作分类，而缺乏细粒度的语言生成能力，且存在无线信号与语言之间的语义鸿沟以及方向敏感歧义（如左右肢体混淆）的挑战。

Result: 在WiFi2Cap数据集（同步的CSI-RGB-句子基准）上，WiFi2Cap在BLEU-4、METEOR、ROUGE-L、CIDEr和SPICE指标上均优于基线方法，实现了有效的隐私友好语义感知。

Insight: 创新点包括：利用视觉-语言教师模型进行跨模态对齐以桥接语义鸿沟；引入镜像一致性损失减少方向敏感歧义；构建首个同步CSI-RGB-句子数据集用于Wi-Fi信号语义描述任务。

Abstract: Privacy-preserving semantic understanding of human activities is important for indoor sensing, yet existing Wi-Fi CSI-based systems mainly focus on pose estimation or predefined action classification rather than fine-grained language generation. Mapping CSI to natural-language descriptions remains challenging because of the semantic gap between wireless signals and language and direction-sensitive ambiguities such as left/right limb confusion. We propose WiFi2Cap, a three-stage framework for generating action captions directly from Wi-Fi CSI. A vision-language teacher learns transferable supervision from synchronized video-text pairs, and a CSI student is aligned to the teacher’s visual space and text embeddings. To improve direction-sensitive captioning, we introduce a Mirror-Consistency Loss that reduces mirrored-action and left-right ambiguities during cross-modal alignment. A prefix-tuned language model then generates action descriptions from CSI embeddings. We also introduce the WiFi2Cap Dataset, a synchronized CSI-RGB-sentence benchmark for semantic captioning from Wi-Fi signals. Experimental results show that WiFi2Cap consistently outperforms baseline methods on BLEU-4, METEOR, ROUGE-L, CIDEr, and SPICE, demonstrating effective privacy-friendly semantic sensing.

[44] How Far Can VLMs Go for Visual Bug Detection? Studying 19,738 Keyframes from 41 Hours of Gameplay Videos cs.CV | cs.SEPDF

Wentao Lu, Alexander Senchenko, Alan Sayle, Abram Hindle, Cor-Paul Bezemer

TL;DR: 本研究评估了视觉语言模型（VLMs）在长游戏视频中检测视觉缺陷的实际表现，通过分析41小时游戏视频中的19,738个关键帧，发现现成的VLMs能检测一定范围的视觉缺陷，但常见增强策略（如二次判断模型和元数据增强提示）仅带来边际改进，表明未来可能需要混合方法以更好地分离文本和视觉异常检测。

Details

Motivation: 解决基于视频的长游戏质量保证（QA）中人工检测劳动密集且易出错的问题，探索VLMs在真实工业场景中检测视觉缺陷的潜力。

Result: 在41小时游戏视频的19,738个关键帧上，单提示基线模型达到0.50的精确度和0.72的准确度；增强策略仅带来边际改进，同时增加了计算成本和输出方差。

Insight: 现成VLMs已具备一定视觉缺陷检测能力，但实际应用中需开发混合方法以区分文本和视觉异常，而非依赖简单增强策略；研究基于大规模真实工业数据，提供了对VLM实际性能的实证见解。

Abstract: Video-based quality assurance (QA) for long-form gameplay video is labor-intensive and error-prone, yet valuable for assessing game stability and visual correctness over extended play sessions. Vision language models (VLMs) promise general-purpose visual reasoning capabilities and thus appear attractive for detecting visual bugs directly from video frames. Recent benchmarks suggest that VLMs can achieve promising results in detecting visual glitches on curated datasets. Building on these findings, we conduct a real-world study using industrial QA gameplay videos to evaluate how well VLMs perform in practical scenarios. Our study samples keyframes from long gameplay videos and asks a VLM whether each keyframe contains a bug. Starting from a single-prompt baseline, the model achieves a precision of 0.50 and an accuracy of 0.72. We then examine two common enhancement strategies used to improve VLM performance without fine-tuning: (1) a secondary judge model that re-evaluates VLM outputs, and (2) metadata-augmented prompting through the retrieval of prior bug reports. Across \textbf{100 videos} totaling \textbf{41 hours} and \textbf{19,738 keyframes}, these strategies provide only marginal improvements over the simple baseline, while introducing additional computational cost and output variance. Our findings indicate that off-the-shelf VLMs are already capable of detecting a certain range of visual bugs in QA gameplay videos, but further progress likely requires hybrid approaches that better separate textual and visual anomaly detection.

[45] SOUPLE: Enhancing Audio-Visual Localization and Segmentation with Learnable Prompt Contexts cs.CVPDF

Khanh Binh Nguyen, Chae Jung Park

TL;DR: 本文提出了一种名为SOUPLE（Sound-aware Prompt Learning）的方法，旨在解决将预训练的CLIP模型应用于音频-视觉定位任务时遇到的挑战。该方法通过可学习的上下文提示词替代固定提示，结合视觉特征生成条件上下文，以增强音频与视觉输入之间的语义对应，从而提升定位和分割性能。

Details

Motivation: 大规模预训练图像-文本模型（如CLIP）在多模态表示方面表现鲁棒，但将其应用于音频-视觉定位时，传统的分类令牌替换和固定提示方法难以捕捉语义线索并建立音频嵌入与上下文令牌之间的有效联系，因此需要改进。

Result: 在VGGSound、SoundNet和AVSBench等基准数据集上的实验表明，SOUPLE方法显著提升了音频-视觉定位和分割的性能，具体表现为在这些数据集上取得了改进的结果。

Insight: 论文的创新点在于引入了可学习的提示上下文（learnable prompt contexts），通过动态生成条件上下文来桥接音频和视觉语义，这为多模态任务中的提示学习提供了新思路，可借鉴于其他需要增强跨模态对齐的应用场景。

Abstract: Large-scale pre-trained image-text models exhibit robust multimodal representations, yet applying the Contrastive Language-Image Pre-training (CLIP) model to audio-visual localization remains challenging. Replacing the classification token ([CLS]) with an audio-embedded token ([V_A]) struggles to capture semantic cues, and the prompt “a photo of a [V_A]” fails to establish meaningful connections between audio embeddings and context tokens. To address these issues, we propose Sound-aware Prompt Learning (SOUPLE), which replaces fixed prompts with learnable context tokens. These tokens incorporate visual features to generate conditional context for a mask decoder, effectively bridging semantic correspondence between audio and visual inputs. Experiments on VGGSound, SoundNet, and AVSBench demonstrate that SOUPLE improves localization and segmentation performance.

Purui Bai, Tao Wu, Jiayang Sun, Xinyue Liu, Huaibo Huang

TL;DR: 本文提出了一个名为MVPBench的多视频感知评估基准，旨在评估多模态大语言模型（MLLMs）处理和理解多个视频之间复杂交互的能力。该基准包含14个子任务，覆盖多个视觉领域，包含约5000个问答测试，涉及2700个视频片段。评估表明，现有模型在处理多视频输入方面存在显著困难。

Details

Motivation: 现有评估基准主要关注静态图像或单一视频，忽略了跨多个视频的复杂交互，因此需要一个新的基准来专门评估模型的多视频感知与理解能力。

Result: 在提出的MVPBench上进行的广泛评估表明，当前模型在处理多视频输入以做出明智决策方面表现不佳，突显了其在多视频理解能力上的重大局限。

Insight: 论文的主要创新点是构建了首个专注于评估多视频感知与理解能力的基准（MVPBench），其设计涵盖了跨视频的复杂交互任务，这为未来模型在多视频场景下的能力发展提供了关键的评估工具和方向指引。

Abstract: The rapid progress of Large Language Models (LLMs) has spurred growing interest in Multi-modal LLMs (MLLMs) and motivated the development of benchmarks to evaluate their perceptual and comprehension abilities. Existing benchmarks, however, are limited to static images or single videos, overlooking the complex interactions across multiple videos. To address this gap, we introduce the Multi-Video Perception Evaluation Benchmark (MVPBench), a new benchmark featuring 14 subtasks across diverse visual domains designed to evaluate models on extracting relevant information from video sequences to make informed decisions. MVPBench includes 5K question-answering tests involving 2.7K video clips sourced from existing datasets and manually annotated clips. Extensive evaluations reveal that current models struggle to process multi-video inputs effectively, underscoring substantial limitations in their multi-video comprehension. We anticipate MVPBench will drive advancements in multi-video perception.

[47] Multimodal Industrial Anomaly Detection via Geometric Prior cs.CVPDF

Min Li, Jinghui He, Gang Li, Jiachen Li, Jin Wan

TL;DR: 本文提出了一种基于几何先验的多模态工业异常检测网络（GPAD），旨在通过有效利用表面法向量和3D形状拓扑等几何信息，检测2D方法难以识别的复杂几何形状缺陷，如细微表面变形和不规则轮廓。

Details

Motivation: 当前多模态工业异常检测方法缺乏对关键几何信息的有效利用，导致检测精度低，因此需要一种能够充分利用3D点云几何先验的检测方法。

Result: 在MVTec-3D AD和Eyecandies数据集上的大量实验表明，该模型在检测精度上超越了当前最先进（SOTA）的方法。

Insight: 创新点包括提出点云专家模型进行细粒度几何特征提取（通过微分法向量计算增强几何细节），以及两阶段融合策略（结合注意力融合和基于几何先验的异常区域分割），有效利用了多模态数据的互补性和3D点的几何先验。

Abstract: The purpose of multimodal industrial anomaly detection is to detect complex geometric shape defects such as subtle surface deformations and irregular contours that are difficult to detect in 2D-based methods. However, current multimodal industrial anomaly detection lacks the effective use of crucial geometric information like surface normal vectors and 3D shape topology, resulting in low detection accuracy. In this paper, we propose a novel Geometric Prior-based Anomaly Detection network (GPAD). Firstly, we propose a point cloud expert model to perform fine-grained geometric feature extraction, employing differential normal vector computation to enhance the geometric details of the extracted features and generate geometric prior. Secondly, we propose a two-stage fusion strategy to efficiently leverage the complementarity of multimodal data as well as the geometric prior inherent in 3D points. We further propose attention fusion and anomaly regions segmentation based on geometric prior, which enhance the model’s ability to perceive geometric defects. Extensive experiments show that our multimodal industrial anomaly detection model outperforms the State-of-the-art (SOTA) methods in detection accuracy on both MVTec-3D AD and Eyecandies datasets.

[48] Reconstruction-Guided Slot Curriculum: Addressing Object Over-Fragmentation in Video Object-Centric Learning cs.CV | cs.LGPDF

WonJun Moon, Hyun Seok Seong, Jae-Pil Heo

TL;DR: 本文提出了一种名为SlotCurri的重建引导的槽课程学习方法，旨在解决视频物体中心学习中的物体过度碎片化问题。该方法通过从少量粗粒度槽开始训练，并逐步在重建误差高的区域分配新槽，结合结构感知损失和循环推理机制，有效提升了物体分割的语义一致性和时间连续性。

Details

Motivation: 现有槽注意力模型在视频物体中心学习中常因重建目标鼓励占用所有槽而导致单个物体被多个冗余槽表示，产生严重的过度碎片化问题。

Result: 在YouTube-VIS和MOVi-C基准测试中，SlotCurri在FG-ARI指标上分别取得了+6.8和+8.3的显著提升，验证了其有效性。

Insight: 创新点包括重建引导的渐进式槽分配策略、结合局部对比和边缘信息的结构感知损失以增强语义边界，以及通过前后帧循环推理确保时间一致性的机制，这些共同解决了碎片化并提升了表示质量。

Abstract: Video Object-Centric Learning seeks to decompose raw videos into a small set of object slots, but existing slot-attention models often suffer from severe over-fragmentation. This is because the model is implicitly encouraged to occupy all slots to minimize the reconstruction objective, thereby representing a single object with multiple redundant slots. We tackle this limitation with a reconstruction-guided slot curriculum (SlotCurri). Training starts with only a few coarse slots and progressively allocates new slots where reconstruction error remains high, thus expanding capacity only where it is needed and preventing fragmentation from the outset. Yet, during slot expansion, meaningful sub-parts can emerge only if coarse-level semantics are already well separated; however, with a small initial slot budget and an MSE objective, semantic boundaries remain blurry. Therefore, we augment MSE with a structure-aware loss that preserves local contrast and edge information to encourage each slot to sharpen its semantic boundaries. Lastly, we propose a cyclic inference that rolls slots forward and then backward through the frame sequence, producing temporally consistent object representations even in the earliest frames. All combined, SlotCurri addresses object over-fragmentation by allocating representational capacity where reconstruction fails, further enhanced by structural cues and cyclic inference. Notable FG-ARI gains of +6.8 on YouTube-VIS and +8.3 on MOVi-C validate the effectiveness of SlotCurri. Our code is available at github.com/wjun0830/SlotCurri.

[49] ENC-Bench: A Benchmark for Evaluating Multimodal Large Language Models in Electronic Navigational Chart Understanding cs.CVPDF

Ao Cheng, Xingming Li, Xuanyu Ji, Xixiang He, Qiyao Sun

TL;DR: 本文提出了ENC-Bench，这是首个用于评估多模态大语言模型在电子航海图理解方面能力的专业基准。该基准包含来自840张真实NOAA ENC图的20,490个专家验证样本，涵盖感知、空间推理和海事决策三个层次的任务。在零样本设置下评估了10个SOTA MLLM，发现最佳模型准确率仅为47.88%，揭示了模型在符号理解、空间计算和多约束推理等方面的系统性挑战。

Details

Motivation: 电子航海图是海上导航的安全关键系统，但目前尚不清楚多模态大语言模型是否能可靠地解读它们。ENCs使用标准化的矢量符号、依赖比例的渲染和精确的几何结构来编码法规、水深和航线约束，需要专业的海事知识进行解读，现有模型在此领域的能力未知。

Result: 在统一的零样本协议下评估了GPT-4o、Gemini 2.5、Qwen3-VL、InternVL-3和GLM-4.5V等10个SOTA MLLM。最佳模型在ENC-Bench上的总体准确率仅为47.88%，表明模型在符号落地、空间计算、多约束推理以及对光照和比例变化的鲁棒性方面存在系统性挑战。

Insight: 创新点在于构建了首个专业的ENC理解基准，通过从原始S-57数据经过校准的矢量到图像生成流程创建样本，并建立了感知、空间推理和海事决策的三级层次评估体系。这为在专业符号推理和安全关键AI交叉领域开辟了新的研究前沿，并为推动MLLM走向专业海事应用提供了必要的基础设施。

Abstract: Electronic Navigational Charts (ENCs) are the safety-critical backbone of modern maritime navigation, yet it remains unclear whether multimodal large language models (MLLMs) can reliably interpret them. Unlike natural images or conventional charts, ENCs encode regulations, bathymetry, and route constraints via standardized vector symbols, scale-dependent rendering, and precise geometric structure – requiring specialized maritime expertise for interpretation. We introduce ENC-Bench, the first benchmark dedicated to professional ENC understanding. ENC-Bench contains 20,490 expert-validated samples from 840 authentic National Oceanic and Atmospheric Administration (NOAA) ENCs, organized into a three-level hierarchy: Perception (symbol and feature recognition), Spatial Reasoning (coordinate localization, bearing, distance), and Maritime Decision-Making (route legality, safety assessment, emergency planning under multiple constraints). All samples are generated from raw S-57 data through a calibrated vector-to-image pipeline with automated consistency checks and expert review. We evaluate 10 state-of-the-art MLLMs such as GPT-4o, Gemini 2.5, Qwen3-VL, InternVL-3, and GLM-4.5V, under a unified zero-shot protocol. The best model achieves only 47.88% accuracy, with systematic challenges in symbolic grounding, spatial computation, multi-constraint reasoning, and robustness to lighting and scale variations. By establishing the first rigorous ENC benchmark, we open a new research frontier at the intersection of specialized symbolic reasoning and safety-critical AI, providing essential infrastructure for advancing MLLMs toward professional maritime applications.

[50] From Pixels to Semantics: A Multi-Stage AI Framework for Structural Damage Detection in Satellite Imagery cs.CVPDF

Bijay Shakya, Catherine Hoier, Khandaker Mamun Ahmed

TL;DR: 本文提出了一种用于卫星图像中结构损伤检测的多阶段AI框架，该框架集成了基于AI的超分辨率、深度学习目标检测和视觉语言模型（VLMs），以全面评估灾后建筑损伤。首先使用视频恢复Transformer（VRT）将灾前和灾后卫星图像从1024x1024分辨率提升至4096x4096，以增强结构细节；然后基于YOLOv11的检测器定位灾前图像中的建筑，并利用VLMs对裁剪出的建筑区域进行语义分析，评估四个严重等级的损伤；最后采用CLIPScore进行无参考语义对齐，并引入多模型VLM-as-a-Jury策略以减少个体模型偏差。

Details

Motivation: 解决自然灾害后快速准确的结构损伤评估问题，传统遥感图像检测流程因空间分辨率低、上下文模糊和语义可解释性有限而可靠性不足。

Result: 在xBD数据集的子集（如Moore Tornado和Hurricane Matthew事件）上进行实验，结果表明所提框架增强了对受损建筑的语义解释能力，并为救援人员提供了基于损伤分析的恢复建议。

Insight: 创新点在于将超分辨率、目标检测和视觉语言模型集成到一个统一框架中，用于端到端的灾后建筑损伤语义评估；并引入无参考语义对齐和多模型陪审团策略来提高关键安全决策的鲁棒性，可借鉴其多模态融合与模型集成方法用于遥感图像分析任务。

Abstract: Rapid and accurate structural damage assessment following natural disasters is critical for effective emergency response and recovery. However, remote sensing imagery often suffers from low spatial resolution, contextual ambiguity, and limited semantic interpretability, reducing the reliability of traditional detection pipelines. In this work, we propose a novel hybrid framework that integrates AI-based super-resolution, deep learning object detection, and Vision-Language Models (VLMs) for comprehensive post-disaster building damage assessment. First, we enhance pre- and post-disaster satellite imagery using a Video Restoration Transformer (VRT) to upscale images from 1024x1024 to 4096x4096 resolution, improving structural detail visibility. Next, a YOLOv11-based detector localizes buildings in pre-disaster imagery, and cropped building regions are analyzed using VLMs to semantically assess structural damage across four severity levels. To ensure robust evaluation in the absence of ground-truth captions, we employ CLIPScore for reference-free semantic alignment and introduce a multi-model VLM-as-a-Jury strategy to reduce individual model bias in safety-critical decision making. Experiments on subsets of the xBD dataset, including the Moore Tornado and Hurricane Matthew events, demonstrate that the proposed framework enhances the semantic interpretation of damaged buildings. In addition, our framework provides helpful recommendations to first responders for recovery based on damage analysis.

[51] Typography-Based Monocular Distance Estimation Framework for Vehicle Safety Systems cs.CVPDF

Manognya Lokesh Reddy, Zheng Liu

TL;DR: 本文提出了一种基于车牌印刷字体的单目视觉距离估计框架，通过检测车牌字符高度并利用针孔相机模型进行度量距离估计，以低成本实现车辆间距离的精确测量。

Details

Motivation: 解决低成本单目视觉在车辆距离估计中存在的尺度模糊性和环境干扰敏感性问题，替代昂贵的LiDAR和雷达传感器。

Result: 在受控室内设置中使用校准单目相机进行实验验证，字符高度在连续帧中的变异系数为2.3%，平均绝对误差为7.7%；与基于车牌宽度的方法相比，基于字符的方法将估计标准差降低了35%。

Insight: 利用车牌标准化字体作为被动基准标记进行度量估计，结合交互式校准、自适应检测、多方法字符分割、相机姿态补偿、深度学习融合、卡尔曼滤波和多特征融合等技术，增强了系统的鲁棒性和实时性。

Abstract: Accurate inter-vehicle distance estimation is a cornerstone of advanced driver assistance systems and autonomous driving. While LiDAR and radar provide high precision, their cost prohibits widespread adoption in mass-market vehicles. Monocular vision offers a low-cost alternative but suffers from scale ambiguity and sensitivity to environmental disturbances. This paper introduces a typography-based monocular distance estimation framework, which exploits the standardized typography of license plates as passive fiducial markers for metric distance estimation. The core geometric module uses robust plate detection and character segmentation to measure character height and computes distance via the pinhole camera model. The system incorporates interactive calibration, adaptive detection with strict and permissive modes, and multi-method character segmentation leveraging both adaptive and global thresholding. To enhance robustness, the framework further includes camera pose compensation using lane-based horizon estimation, hybrid deep-learning fusion, temporal Kalman filtering for velocity estimation, and multi-feature fusion that exploits additional typographic cues such as stroke width, character spacing, and plate border thickness. Experimental validation with a calibrated monocular camera in a controlled indoor setup achieved a coefficient of variation of 2.3% in character height across consecutive frames and a mean absolute error of 7.7%. The framework operates without GPU acceleration, demonstrating real-time feasibility. A comprehensive comparison with a plate-width based method shows that character-based ranging reduces the standard deviation of estimates by 35%, translating to smoother, more consistent distance readings in practice, where erratic estimates could trigger unnecessary braking or acceleration.

[52] Know3D: Prompting 3D Generation with Knowledge from Vision-Language Models cs.CVPDF

Wenyue Chen, Wenjue Chen, Peng Li, Qinghe Wang, Xu Jia

TL;DR: 本文提出Know3D框架，通过将多模态大语言模型（MLLM）的丰富知识注入3D生成过程，解决了现有方法在生成单视图3D资产时，因视角模糊和结构先验不足导致背面区域生成随机、不可控的问题。该方法利用VLM-扩散模型作为桥梁，将语义知识从VLM传递到3D生成模型，实现了语言可控的背面视图生成。

Details

Motivation: 现有3D生成模型由于单视图观测的固有模糊性和有限3D训练数据导致的全局结构先验不足，其生成的不可见区域（如背面）往往是随机的、难以控制的，有时不符合用户意图或产生不合理的几何结构。

Result: 论文通过提出的Know3D框架，将传统随机的背面视图幻觉转变为语义可控的过程，展示了未来3D生成模型的一个有前景的方向。摘要中未提及具体的定量基准测试结果或SOTA比较。

Insight: 核心创新点在于通过潜在隐藏状态注入，将多模态大语言模型的语义理解能力与3D生成过程桥接，从而用语言指令控制不可见区域的几何重建。这为解决3D生成中的歧义性和可控性问题提供了一种新的知识引导范式。

Abstract: Recent advances in 3D generation have improved the fidelity and geometric details of synthesized 3D assets. However, due to the inherent ambiguity of single-view observations and the lack of robust global structural priors caused by limited 3D training data, the unseen regions generated by existing models are often stochastic and difficult to control, which may sometimes fail to align with user intentions or produce implausible geometries. In this paper, we propose Know3D, a novel framework that incorporates rich knowledge from multimodal large language models into 3D generative processes via latent hidden-state injection, enabling language-controllable generation of the back-view for 3D assets. We utilize a VLM-diffusion-based model, where the VLM is responsible for semantic understanding and guidance. The diffusion model acts as a bridge that transfers semantic knowledge from the VLM to the 3D generation model. In this way, we successfully bridge the gap between abstract textual instructions and the geometric reconstruction of unobserved regions, transforming the traditionally stochastic back-view hallucination into a semantically controllable process, demonstrating a promising direction for future 3D generation models.

[53] PhotoAgent: A Robotic Photographer with Spatial and Aesthetic Understanding cs.CV | cs.AI | cs.ROPDF

Lirong Che, Zhenfeng Gan, Yanbo Chen, Junbo Tan, Xueqian Wang

TL;DR: PhotoAgent是一个具备空间和美学理解的机器人摄影师，它通过整合大型多模态模型（LMMs）的推理与一种新颖的控制范式，将高层语言命令转化为几何控制。该代理首先通过LMM驱动的链式思维（CoT）推理将主观美学目标转化为可解的几何约束，由解析求解器计算高质量初始视点，然后在基于3D高斯泼溅（3DGS）构建的真实感内部世界模型中通过视觉反射迭代优化该初始姿态，以‘心理模拟’替代物理试错，快速收敛到美学上优越的结果。

Details

Motivation: 解决具身代理在摄影等创造性任务中，高层语言命令与几何控制之间的语义鸿沟问题。

Result: 评估证实PhotoAgent在空间推理方面表现出色，并实现了更优的最终图像质量。

Insight: 创新点在于结合LMM的链式思维推理将主观美学转化为几何约束，并利用3DGS构建的内部世界模型进行‘心理模拟’以迭代优化视点，这避免了耗时的物理试错，实现了高效的美学控制。

Abstract: Embodied agents for creative tasks like photography must bridge the semantic gap between high-level language commands and geometric control. We introduce PhotoAgent, an agent that achieves this by integrating Large Multimodal Models (LMMs) reasoning with a novel control paradigm. PhotoAgent first translates subjective aesthetic goals into solvable geometric constraints via LMM-driven, chain-of-thought (CoT) reasoning, allowing an analytical solver to compute a high-quality initial viewpoint. This initial pose is then iteratively refined through visual reflection within a photorealistic internal world model built with 3D Gaussian Splatting (3DGS). This ``mental simulation’’ replaces costly and slow physical trial-and-error, enabling rapid convergence to aesthetically superior results. Evaluations confirm that PhotoAgent excels in spatial reasoning and achieves superior final image quality.

[54] Focus, Don’t Prune: Identifying Instruction-Relevant Regions for Information-Rich Image Understanding cs.CV | cs.AIPDF

Mincheol Kwon, Minseung Lee, Seonga Choi, Miso Choi, Kyeong-Jin Oh

TL;DR: 本文提出PinPoint框架，通过两阶段方法（识别指令相关图像区域并提取细粒度视觉特征）来解决大型视觉语言模型在处理信息密集图像时计算开销大的问题，并在多个VQA基准测试中实现了更高的准确性和效率。

Details

Motivation: 大型视觉语言模型在处理视觉复杂、信息丰富的图像（如信息图或文档布局）时，需要生成大量视觉标记，导致显著的计算开销，因此需要一种方法来识别指令相关区域以提升效率和推理能力。

Result: 在InfographicVQA、MultiPageDocVQA和SinglePageDocVQA等挑战性VQA基准测试上，PinPoint不仅达到了优于现有方法的准确性，还通过减少不相关视觉标记降低了计算开销。

Insight: 创新点包括引入指令-区域对齐机制来定位相关区域，以及为指令相关区域提供更丰富的地面真值监督的新标注方法，从而在保持模型性能的同时提升计算效率。

Abstract: Large Vision-Language Models (LVLMs) have shown strong performance across various multimodal tasks by leveraging the reasoning capabilities of Large Language Models (LLMs). However, processing visually complex and information-rich images, such as infographics or document layouts, requires these models to generate a large number of visual tokens, leading to significant computational overhead. To address this, we propose PinPoint, a novel two-stage framework that first identifies instruction-relevant image regions and then refines them to extract fine-grained visual features for improved reasoning and efficiency. Central to our approach is the Instruction-Region Alignment, which localizes relevant regions using both visual input and textual instructions. We further introduce new annotations that provide richer ground-truth supervision for instruction-relevant regions across challenging VQA benchmarks: InfographicVQA, MultiPageDocVQA, and SinglePageDocVQA. Experimental results show that PinPoint not only achieves superior accuracy compared to existing methods but also reduces computational overhead by minimizing irrelevant visual tokens.

[55] TDATR: Improving End-to-End Table Recognition via Table Detail-Aware Learning and Cell-Level Visual Alignment cs.CV | cs.AIPDF

Chunxia Qin, Chenyu Liu, Pengcheng Xia, Jun Du, Baocai Yin

TL;DR: TDATR是一种改进的端到端表格识别方法，通过表格细节感知学习和单元格级视觉对齐来解决现有方法在数据受限场景下的性能问题。它采用’先感知后融合’策略，联合感知表格结构和内容，并集成结构引导的单元格定位模块，最终生成结构化HTML输出。

Details

Motivation: 现有模块化表格识别流程将结构和内容建模分离，导致集成效果不佳且流程复杂；端到端方法严重依赖大规模数据，在数据受限场景中表现不佳。本文旨在解决这些问题，实现更高效、鲁棒的端到端表格识别。

Result: 在七个基准测试上取得了最先进的或极具竞争力的性能，且无需针对特定数据集进行微调。

Insight: 创新点包括：在语言建模范式下设计多任务联合感知表格结构与内容的细节感知学习；提出’先感知后融合’策略以高效利用有限数据；集成结构引导的单元格定位模块以增强视觉-语言对齐及模型可解释性。这些设计提升了模型在数据稀缺情况下的鲁棒性和准确性。

Abstract: Tables are pervasive in diverse documents, making table recognition (TR) a fundamental task in document analysis. Existing modular TR pipelines separately model table structure and content, leading to suboptimal integration and complex workflows. End-to-end approaches rely heavily on large-scale TR data and struggle in data-constrained scenarios. To address these issues, we propose TDATR (Table Detail-Aware Table Recognition) improves end-to-end TR through table detail-aware learning and cell-level visual alignment. TDATR adopts a ``perceive-then-fuse’’ strategy. The model first performs table detail-aware learning to jointly perceive table structure and content through multiple structure understanding and content recognition tasks designed under a language modeling paradigm. These tasks can naturally leverage document data from diverse scenarios to enhance model robustness. The model then integrates implicit table details to generate structured HTML outputs, enabling more efficient TR modeling when trained with limited data. Furthermore, we design a structure-guided cell localization module integrated into the end-to-end TR framework, which efficiently locates cell and strengthens vision-language alignment. It enhances the interpretability and accuracy of TR. We achieve state-of-the-art or highly competitive performance on seven benchmarks without dataset-specific fine-tuning.

Zhiceng Shi, Changmiao Wang, Jun Wan, Wenwen Min

TL;DR: 该论文提出了一种名为SpaHGC的多模态异构图模型，用于从病理图像中预测空间基因表达。该模型通过整合目标切片内的局部空间上下文和基于病理基础模型提取的图像嵌入计算的跨切片相似性，来捕获切片内和切片间的点-点关系。它进一步结合了掩码图对比学习来增强特征表示，并将空间基因表达知识从参考切片转移到目标切片，从而显著提高了预测准确性。

Details

Motivation: 空间转录组学（ST）实验成本高昂，限制了其大规模应用。从病理图像预测ST是一种有前景的低成本替代方案，但现有方法难以捕捉复杂的跨切片空间关系。

Result: 在来自不同平台、组织和癌症亚型的七个匹配的病理-ST数据集上进行的全面基准测试表明，SpaHGC在所有评估指标上均显著优于现有的九种最先进方法。此外，其预测结果在多个癌症相关通路中显著富集。

Insight: 创新点在于提出了一种结合多模态异构图和掩码图对比学习的框架，通过病理基础模型的图像嵌入实现跨切片知识迁移，从而有效建模复杂的空间依赖性。这为利用计算模型从低成本图像数据中推断高成本空间基因表达信息提供了新思路。

Abstract: While spatial transcriptomics (ST) has advanced our understanding of gene expression in tissue context, its high experimental cost limits its large-scale application. Predicting ST from pathology images is a promising, cost-effective alternative, but existing methods struggle to capture complex cross-slide spatial relationships. To address the challenge, we propose SpaHGC, a multi-modal heterogeneous graph-based model that captures both intra-slice and inter-slice spot-spot relationships from histology images. It integrates local spatial context within the target slide and cross-slide similarities computed from image embeddings extracted by a pathology foundation model. These embeddings enable inter-slice knowledge transfer, and SpaHGC further incorporates Masked Graph Contrastive Learning to enhance feature representation and transfer spatial gene expression knowledge from reference to target slides, enabling it to model complex spatial dependencies and significantly improve prediction accuracy. We conducted comprehensive benchmarking on seven matched histology-ST datasets from different platforms, tissues, and cancer subtypes. The results demonstrate that SpaHGC significantly outperforms the existing nine state-of-the-art methods across all evaluation metrics. Additionally, the predictions are significantly enriched in multiple cancer-related pathways, thereby highlighting its strong biological relevance and application potential.

[57] MVRD-Bench: Multi-View Learning and Benchmarking for Dynamic Remote Photoplethysmography under Occlusion cs.CVPDF

Zuxian He, Xu Cheng, Zhaodong Sun, Haoyu Chen, Jingang Shi

TL;DR: 本文提出了一个用于解决动态遮挡下远程光电容积描记术（rPPG）测量问题的多视角学习框架MVRD-rPPG及其配套基准数据集MVRD。该框架通过融合多视角互补视觉线索，结合自适应时序光学补偿、节律-视觉双流网络和多视角相关性感知注意力等模块，有效抑制运动伪影并提升信号鲁棒性。实验表明，该方法在运动场景下取得了优异的性能。

Details

Motivation: 现有rPPG方法依赖静态单视角面部视频，在面部运动和遮挡场景下性能显著下降。本文旨在解决无约束多视角面部视频中运动引起的遮挡问题，以更好地匹配真实世界条件。

Result: 在提出的MVRD数据集运动场景下，MVRD-rPPG方法取得了平均绝对误差（MAE）0.90和皮尔逊相关系数（R）0.99的优异结果，通过广泛的实验和消融研究证明了其优越性。

Insight: 创新点包括：1）构建了高质量的多视角rPPG基准数据集MVRD；2）提出了统一的多视角学习框架，集成了ATOC模块、双流网络和MVCA注意力机制；3）引入了相关性频率对抗学习策略，联合优化时序准确性、频谱一致性和感知真实性。从客观角度看，其多视角融合与运动伪影抑制的设计对提升动态遮挡下的rPPG鲁棒性具有借鉴意义。

Abstract: Remote photoplethysmography (rPPG) is a non-contact technique that estimates physiological signals by analyzing subtle skin color changes in facial videos. Existing rPPG methods often encounter performance degradation under facial motion and occlusion scenarios due to their reliance on static and single-view facial videos. Thus, this work focuses on tackling the motion-induced occlusion problem for rPPG measurement in unconstrained multi-view facial videos. Specifically, we introduce a Multi-View rPPG Dataset (MVRD), a high-quality benchmark dataset featuring synchronized facial videos from three viewpoints under stationary, speaking, and head movement scenarios to better match real-world conditions. We also propose MVRD-rPPG, a unified multi-view rPPG learning framework that fuses complementary visual cues to maintain robust facial skin coverage, especially under motion conditions. Our method integrates an Adaptive Temporal Optical Compensation (ATOC) module for motion artifact suppression, a Rhythm-Visual Dual-Stream Network to disentangle rhythmic and appearance-related features, and a Multi-View Correlation-Aware Attention (MVCA) for adaptive view-wise signal aggregation. Furthermore, we introduce a Correlation Frequency Adversarial (CFA) learning strategy, which jointly enforces temporal accuracy, spectral consistency, and perceptual realism in the predicted signals. Extensive experiments and ablation studies on the MVRD dataset demonstrate the superiority of our approach. In the MVRD movement scenario, MVRD-rPPG achieves an MAE of 0.90 and a Pearson correlation coefficient (R) of 0.99. The source code and dataset will be made available.

[58] Rethinking Token-Level Policy Optimization for Multimodal Chain-of-Thought cs.CVPDF

Yunheng Li, Hangyi Kuang, Hengrui Zhang, Jiangxia Cao, Zhaojie Liu

TL;DR: 本文提出了一种名为感知-探索策略优化（PEPO）的新方法，用于优化多模态思维链（CoT）推理。该方法通过分析推理轨迹的token级动态，发现成功的推理需要结合感知基础和探索性推断。PEPO利用隐藏状态相似性生成感知先验，并通过平滑门控机制与token熵结合，计算token级优势，从而在无需额外监督或辅助分支的情况下，无缝集成到现有RLVR框架中。

Details

Motivation: 现有基于可验证奖励的强化学习方法在多模态CoT推理中通常以粗粒度进行优化，未能区分推理步骤中不同程度的视觉基础。本文旨在通过更精细的token级分析来改进多模态推理策略的优化。

Result: 在多个多模态基准测试（包括几何推理、视觉基础、视觉谜题解决和少样本分类）上的广泛实验表明，该方法在强RL基线上取得了持续且稳健的性能提升，同时保持了稳定的训练动态。

Insight: 创新点在于对多模态推理轨迹进行token级分析，揭示了感知基础和探索性推断的结构化动态，并据此设计了结合感知先验和token熵的token级优势计算机制。该方法可无缝集成，无需额外监督，为细粒度优化多模态推理策略提供了新思路。

Abstract: Multimodal Chain-of-Thought (CoT) reasoning requires large vision-language models to construct reasoning trajectories that interleave perceptual grounding with multi-step inference. However, existing Reinforcement Learning with Verifiable Rewards (RLVR) methods typically optimize reasoning at a coarse granularity, treating CoT uniformly without distinguishing their varying degrees of visual grounding. In this work, we conduct a token-level analysis of multimodal reasoning trajectories and show that successful reasoning is characterized by structured token dynamics reflecting both perceptual grounding and exploratory inference. Building upon this analysis, we propose Perception-Exploration Policy Optimization (PEPO), which derives a perception prior from hidden state similarity and integrates it with token entropy through a smooth gating mechanism to produce token-level advantages. PEPO integrates seamlessly with existing RLVR frameworks such as GRPO and DAPO, requiring neither additional supervision nor auxiliary branches. Extensive experiments across diverse multimodal benchmarks demonstrate consistent and robust improvements over strong RL baselines, spanning geometry reasoning, visual grounding, visual puzzle solving, and few-shot classification, while maintaining stable training dynamics. Code: https://github.com/xzxxntxdy/PEPO

Chengxin Lv, Yihui Li, Hongyu Yang, YunHong Wang

TL;DR: 本文提出了Gau-Occ，一个用于自动驾驶3D语义占据预测的多模态框架。它通过将场景建模为一组紧凑的语义3D高斯模型，绕过了计算密集的密集体素或BEV张量处理。核心创新包括使用LiDAR补全扩散器（LCD）从稀疏LiDAR恢复缺失结构以初始化鲁棒的高斯锚点，以及通过几何对齐的2D采样和跨模态对齐高效融合多视角图像语义的高斯锚点融合（GAF）模块。

Details

Motivation: 解决现有基于多模态融合的3D语义占据预测方法通常依赖计算昂贵的密集体素或BEV张量，导致效率低下的问题，旨在实现高精度与高效率的平衡。

Result: 在多个具有挑战性的基准测试上进行的大量实验表明，Gau-Occ在实现最先进（SOTA）性能的同时，显著提升了计算效率。

Insight: 主要创新点在于用紧凑的3D高斯集合替代密集体素表示，并结合了专门设计的LiDAR补全和几何对齐的跨模态融合机制。这为3D场景理解提供了一种新的、更高效的表示和融合范式，兼顾了几何完整性和语义判别性。

Abstract: 3D semantic occupancy prediction is crucial for autonomous driving. While multi-modal fusion improves accuracy over vision-only methods, it typically relies on computationally expensive dense voxel or BEV tensors. We present Gau-Occ, a multi-modal framework that bypasses dense volumetric processing by modeling the scene as a compact collection of semantic 3D Gaussians. To ensure geometric completeness, we propose a LiDAR Completion Diffuser (LCD) that recovers missing structures from sparse LiDAR to initialize robust Gaussian anchors. Furthermore, we introduce Gaussian Anchor Fusion (GAF), which efficiently integrates multi-view image semantics via geometry-aligned 2D sampling and cross-modal alignment. By refining these compact Gaussian descriptors, Gau-Occ captures both spatial consistency and semantic discriminability. Extensive experiments across challenging benchmarks demonstrate that Gau-Occ achieves state-of-the-art performance with significant computational efficiency.

Hyojin Park, Yi Li, Janghoon Cho, Sungha Choi, Jungsoo Lee

TL;DR: 本文提出了ForeSea，一个用于视频监控的AI法证搜索系统，以及配套的基准数据集ForeSeaQA。该系统通过一个三阶段、即插即用的流程（目标跟踪、多模态嵌入索引、VideoLLM推理）来处理图像和文本组成的多模态查询，旨在解决现有方法在长时、多摄像头视频中检索特定目标时存在的手动过滤、浅层属性捕捉和时间推理失败等问题。

Details

Motivation: 现有监控视频搜索方法（如跟踪流水线、基于CLIP的模型和VideoRAG）需要大量人工过滤、只能捕捉浅层属性且缺乏时间推理能力，而现实世界的搜索本质上是多模态的（例如，结合图像和文本的查询），但这一设定尚未得到充分探索，也缺乏合适的基准进行评估。

Result: 在作者新提出的ForeSeaQA基准上，ForeSea系统相比先前的VideoRAG模型，将准确率提高了3.5%，并将时间IoU（交并比）提高了11.0%。ForeSeaQA是首个支持具有精确时间定位的复杂多模态查询的基准，而ForeSea是首个在此设定下表现出色的VideoRAG系统。

Insight: 论文的核心创新点在于：1）构建了首个专门针对图像-文本多模态查询的视频问答基准ForeSeaQA，填补了评估空白；2）设计了一个模块化、三阶段的AI法证搜索系统ForeSea，将目标跟踪、多模态嵌入和VideoLLM推理相结合，实现了对长时监控视频中复杂事件的端到端检索与定位。这为视频理解领域，特别是监控和法证应用，提供了新的评估框架和系统架构思路。

Abstract: Despite decades of work, surveillance still struggles to find specific targets across long, multi-camera video. Prior methods – tracking pipelines, CLIP based models, and VideoRAG – require heavy manual filtering, capture only shallow attributes, and fail at temporal reasoning. Real-world searches are inherently multimodal (e.g., “When does this person join the fight?” with the person’s image), yet this setting remains underexplored. Also, there are no proper benchmarks to evaluate those setting - asking video with multimodal queries. To address this gap, we introduce ForeSeaQA, a new benchmark specifically designed for video QA with image-and-text queries and timestamped annotations of key events. The dataset consists of long-horizon surveillance footage paired with diverse multimodal questions, enabling systematic evaluation of retrieval, temporal grounding, and multimodal reasoning in realistic forensic conditions. Not limited to this benchmark, we propose ForeSea, an AI forensic search system with a 3-stage, plug-and-play pipeline. (1) A tracking module filters irrelevant footage; (2) a multimodal embedding module indexes the remaining clips; and (3) during inference, the system retrieves top-K candidate clips for a Video Large Language Model (VideoLLM) to answer queries and localize events. On ForeSeaQA, ForeSea improves accuracy by 3.5% and temporal IoU by 11.0 over prior VideoRAG models. To our knowledge, ForeSeaQA is the first benchmark to support complex multimodal queries with precise temporal grounding, and ForeSea is the first VideoRAG system built to excel in this setting.

[61] Group Editing : Edit Multiple Images in One Go cs.CVPDF

Yue Ma, Xinyu Wang, Qianli Ma, Qinghe Wang, Mingzhe Zheng

TL;DR: 本文提出GroupEditing框架，解决对一组相关图像进行一致且统一的编辑问题。该框架通过VGGT提取显式几何对应关系，并将图像组重构为伪视频以利用预训练视频模型的隐式时序一致性先验，通过融合机制结合两种对应关系。同时构建了GroupEditData数据集和GroupEditBench基准，并引入对齐增强的RoPE模块以保持身份一致性。实验表明该方法在视觉质量、跨视图一致性和语义对齐方面显著优于现有方法。

Details

Motivation: 解决对一组在姿态、视角和空间布局上差异较大的相关图像进行一致编辑的挑战，关键在于建立可靠的跨图像对应关系以实现语义对齐区域的精确修改。

Result: 在提出的GroupEditBench基准上进行大量实验，结果表明GroupEditing在视觉质量、跨视图一致性和语义对齐方面显著优于现有方法。

Insight: 创新点包括：1) 结合显式几何对应（VGGT）与隐式时序一致性先验（伪视频重构）的双重关系建模；2) 构建大规模训练数据集GroupEditData和专用评估基准GroupEditBench；3) 引入对齐增强的RoPE模块以提升多图像间的身份保持能力。从客观角度看，将图像组视为伪视频以利用视频模型的先验知识是一种新颖的跨模态迁移思路。

Abstract: In this paper, we tackle the problem of performing consistent and unified modifications across a set of related images. This task is particularly challenging because these images may vary significantly in pose, viewpoint, and spatial layout. Achieving coherent edits requires establishing reliable correspondences across the images, so that modifications can be applied accurately to semantically aligned regions. To address this, we propose GroupEditing, a novel framework that builds both explicit and implicit relationships among images within a group. On the explicit side, we extract geometric correspondences using VGGT, which provides spatial alignment based on visual features. On the implicit side, we reformulate the image group as a pseudo-video and leverage the temporal coherence priors learned by pre-trained video models to capture latent relationships. To effectively fuse these two types of correspondences, we inject the explicit geometric cues from VGGT into the video model through a novel fusion mechanism. To support large-scale training, we construct GroupEditData, a new dataset containing high-quality masks and detailed captions for numerous image groups. Furthermore, to ensure identity preservation during editing, we introduce an alignment-enhanced RoPE module, which improves the model’s ability to maintain consistent appearance across multiple images. Finally, we present GroupEditBench, a dedicated benchmark designed to evaluate the effectiveness of group-level image editing. Extensive experiments demonstrate that GroupEditing significantly outperforms existing methods in terms of visual quality, cross-view consistency, and semantic alignment.

[62] SLARM: Streaming and Language-Aligned Reconstruction Model for Dynamic Scenes cs.CVPDF

Zhicheng Qiu, Jiarui Meng, Tong-an Luo, Yican Huang, Xuan Feng

TL;DR: SLARM是一个前馈模型，统一了动态场景重建、语义理解和实时流式推理。它通过高阶运动建模捕捉复杂非均匀运动，仅使用可微分渲染训练，无需光流监督；同时从LSeg蒸馏语义特征以获得语言对齐表示，支持自然语言语义查询，并通过语义与几何的紧密耦合提升动态重建的准确性和鲁棒性。此外，模型采用基于窗口的因果注意力处理图像序列，实现稳定、低延迟的流式推理且不累积内存成本。

Details

Motivation: 解决动态场景中统一重建、语义理解和实时流式推理的挑战，旨在通过语言对齐表示增强语义查询能力，并提升动态重建的精度与鲁棒性。

Result: 在动态估计、渲染质量和场景解析方面达到SOTA，相比现有方法，运动准确性提升21%，重建PSNR提高1.6 dB，分割mIoU提升20%。

Insight: 创新点包括高阶运动建模无需光流监督、从LSeg蒸馏语言对齐语义特征以实现自然语言查询，以及基于窗口的因果注意力实现高效流式推理；客观分析认为其统一框架在语义与几何耦合方面具有借鉴意义。

Abstract: We propose SLARM, a feed-forward model that unifies dynamic scene reconstruction, semantic understanding, and real-time streaming inference. SLARM captures complex, non-uniform motion through higher-order motion modeling, trained solely on differentiable renderings without any flow supervision. Besides, SLARM distills semantic features from LSeg to obtain language-aligned representations. This design enables semantic querying via natural language, and the tight coupling between semantics and geometry further enhances the accuracy and robustness of dynamic reconstruction. Moreover, SLARM processes image sequences using window-based causal attention, achieving stable, low-latency streaming inference without accumulating memory cost. Within this unified framework, SLARM achieves state-of-the-art results in dynamic estimation, rendering quality, and scene parsing, improving motion accuracy by 21%, reconstruction PSNR by 1.6 dB, and segmentation mIoU by 20% over existing methods.

[63] Dual-Teacher Distillation with Subnetwork Rectification for Black-Box Domain Adaptation cs.CV | cs.LGPDF

Zhe Zhang, Jing Li, Wanli Xue, Xu Cheng, Jianhua Zhang

TL;DR: 本文提出了一种名为DDSR的双教师蒸馏与子网络校正模型，用于解决黑盒域自适应问题。该方法联合利用黑盒源模型的特定知识和视觉语言模型的通用语义信息，通过自适应集成互补预测来生成可靠的伪标签，并引入子网络正则化策略来缓解噪声监督导致的过拟合。实验表明，该方法在多个基准数据集上优于现有方法。

Details

Motivation: 解决黑盒域自适应中，由于无法访问源数据或源模型，仅能通过目标样本查询黑盒源模型预测，导致现有方法常受噪声监督或视觉语言模型语义先验利用不足的限制。

Result: 在多个基准数据集上的广泛实验验证了该方法的有效性，其性能持续优于现有最先进方法，包括那些使用源数据或模型的方法。

Insight: 创新点在于提出双教师蒸馏框架，自适应融合黑盒源模型和视觉语言模型的互补预测以生成可靠伪标签，并引入子网络正则化来减轻噪声影响；同时，通过迭代优化伪标签和视觉语言模型提示，实现更准确和语义一致的自适应。

Abstract: Assuming that neither source data nor the source model is accessible, black box domain adaptation represents a highly practical yet extremely challenging setting, as transferable information is restricted to the predictions of the black box source model, which can only be queried using target samples. Existing approaches attempt to extract transferable knowledge through pseudo label refinement or by leveraging external vision language models (ViLs), but they often suffer from noisy supervision or insufficient utilization of the semantic priors provided by ViLs, which ultimately hinder adaptation performance. To overcome these limitations, we propose a dual teacher distillation with subnetwork rectification (DDSR) model that jointly exploits the specific knowledge embedded in black box source models and the general semantic information of a ViL. DDSR adaptively integrates their complementary predictions to generate reliable pseudo labels for the target domain and introduces a subnetwork driven regularization strategy to mitigate overfitting caused by noisy supervision. Furthermore, the refined target predictions iteratively enhance both the pseudo labels and ViL prompts, enabling more accurate and semantically consistent adaptation. Finally, the target model is further optimized through self training with classwise prototypes. Extensive experiments on multiple benchmark datasets validate the effectiveness of our approach, demonstrating consistent improvements over state of the art methods, including those using source data or models.

[64] ForestPrune: High-ratio Visual Token Compression for Video Multimodal Large Language Models via Spatial-Temporal Forest Modeling cs.CV | cs.AIPDF

Shaobo Ju, Baiyang Song, Tao Chen, Jiapeng Zhang, Qiong Wu

TL;DR: ForestPrune是一种无需训练的视频多模态大语言模型（MLLMs）视觉令牌压缩方法，通过时空森林建模实现高比例令牌剪枝。该方法基于语义、空间和时间约束跨视频帧构建令牌森林，全面理解视频内容，并通过评估令牌树和节点的重要性（基于树深度和节点角色）做出全局最优剪枝决策。实验表明，在LLaVA-Video和LLaVA-OneVision等模型上，它能显著减少计算和内存开销，同时保持高性能。

Details

Motivation: 现有令牌压缩方法在视频多模态大语言模型中难以实现高比例压缩，主要原因是未能充分建模视频的时序和连续性内容，因此论文提出ForestPrune以解决这一问题。

Result: 在多个视频基准测试中，ForestPrune应用于LLaVA-OneVision时，在减少90%令牌的情况下平均准确率保持在95.8%；在LLaVA-Video上，相比FrameFusion方法，在MLVU基准上准确率提升10.1%，剪枝时间减少81.4%，显示出优于现有令牌压缩方法的性能和效率。

Insight: 创新点在于通过时空森林建模实现高比例令牌剪枝，无需额外训练，结合语义、空间和时间约束进行全局优化；客观分析认为，该方法在视频令牌压缩中有效平衡了压缩比和性能，为视频MLLMs的效率提升提供了新思路。

Abstract: Due to the great saving of computation and memory overhead, token compression has become a research hot-spot for MLLMs and achieved remarkable progress in image-language tasks. However, for the video, existing methods still fall short of high-ratio token compression. We attribute this shortcoming to the insufficient modeling of temporal and continual video content, and propose a novel and training-free token pruning method for video MLLMs, termed ForestPrune, which achieves effective and high-ratio pruning via Spatial-temporal Forest Modeling. In practice, ForestPrune construct token forests across video frames based on the semantic, spatial and temporal constraints, making an overall comprehension of videos. Afterwards, ForestPrune evaluates the importance of token trees and nodes based on tree depth and node roles, thereby obtaining a globally optimal pruning decision. To validate ForestPrune, we apply it to two representative video MLLMs, namely LLaVA-Video and LLaVA-OneVision, and conduct extensive experiments on a bunch of video benchmarks. The experimental results not only show the great effectiveness for video MLLMs, e.g., retaining 95.8% average accuracy while reducing 90% tokens for LLaVA-OneVision, but also show its superior performance and efficiency than the compared token compression methods, e.g., +10.1% accuracy on MLVU and -81.4% pruning time than FrameFusion on LLaVA-Video.

[65] When AVSR Meets Video Conferencing: Dataset, Degradation, and the Hidden Mechanism Behind Performance Collapse cs.CVPDF

Yihuan Huang, Jun Xue, Liu Jiajun, Daixian Li, Tong Zhang

TL;DR: 本文首次系统评估了音频-视觉语音识别（AVSR）模型在主流视频会议平台上的表现，发现传输失真和自发的人类超表达会导致性能严重下降。为此，作者构建了首个专为视频会议设计的多模态数据集MLD-VC，包含31位说话者、22.79小时的音视频数据，并利用隆巴德效应增强人类超表达。分析表明，语音增强算法是导致分布偏移的主要来源，而隆巴德效应引起的分布偏移与语音增强相似，这解释了为什么在隆巴德数据上训练的模型在视频会议中更具鲁棒性。在MLD-VC上微调AVSR模型可缓解此问题，在多个视频会议平台上平均降低17.5%的字符错误率。

Details

Motivation: 解决AVSR在离线条件下表现优异，但在真实视频会议场景中因传输失真和人类超表达导致性能下降的问题，填补了该领域的研究空白。

Result: 在MLD-VC数据集上微调AVSR模型，在多个视频会议平台上实现了平均17.5%的字符错误率降低，提升了模型在真实视频会议环境中的鲁棒性。

Insight: 创新点包括构建首个视频会议专用多模态数据集MLD-VC，并揭示了语音增强算法引起的音频第一和第二共振峰分布偏移是性能下降的关键机制，而隆巴德效应产生的类似分布偏移可增强模型鲁棒性，这为开发更通用的AVSR系统提供了新视角。

Abstract: Audio-Visual Speech Recognition (AVSR) has achieved remarkable progress in offline conditions, yet its robustness in real-world video conferencing (VC) remains largely unexplored. This paper presents the first systematic evaluation of state-of-the-art AVSR models across mainstream VC platforms, revealing severe performance degradation caused by transmission distortions and spontaneous human hyper-expression. To address this gap, we construct \textbf{MLD-VC}, the first multimodal dataset tailored for VC, comprising 31 speakers, 22.79 hours of audio-visual data, and explicit use of the Lombard effect to enhance human hyper-expression. Through comprehensive analysis, we find that speech enhancement algorithms are the primary source of distribution shift, which alters the first and second formants of audio. Interestingly, we find that the distribution shift induced by the Lombard effect closely resembles that introduced by speech enhancement, which explains why models trained on Lombard data exhibit greater robustness in VC. Fine-tuning AVSR models on MLD-VC mitigates this issue, achieving an average 17.5% reduction in CER across several VC platforms. Our findings and dataset provide a foundation for developing more robust and generalizable AVSR systems in real-world video conferencing. MLD-VC is available at https://huggingface.co/datasets/nccm2p2/MLD-VC.

[66] EVA: Efficient Reinforcement Learning for End-to-End Video Agent cs.CV | cs.AI | cs.CLPDF

Yaolun Zhang, Ruohui Wang, Jiahao Wang, Yepeng Tang, Xuanyu Zheng

TL;DR: 本文提出了EVA，一种基于高效强化学习的端到端视频智能体框架，用于解决多模态大语言模型处理长视频时面临的序列长、依赖复杂和帧冗余的挑战。EVA通过迭代的总结-规划-行动-反思推理，实现了‘先规划后感知’的自主视频理解，能自适应地决定看什么、何时看以及如何看。

Details

Motivation: 现有方法通常将多模态大语言模型视为被动识别器，处理整个视频或均匀采样帧，缺乏自适应推理；而基于智能体的方法虽引入外部工具，但仍依赖人工设计的工作流程和‘先感知后规划’策略，导致长视频处理效率低下。

Result: 在六个视频理解基准测试上，EVA相比通用多模态大语言模型基线实现了6-12%的显著提升，相比先前的自适应智能体方法进一步获得了1-3%的性能增益，展现了全面的视频理解能力。

Insight: 创新点在于提出了‘先规划后感知’的迭代推理框架（总结-规划-行动-反思），使智能体能够进行查询驱动的高效视频理解；同时设计了一个包含监督微调、Kahneman-Tversky优化和广义奖励策略优化的三阶段学习流程，结合高质量数据集，实现了稳定可复现的训练。

Abstract: Video understanding with multimodal large language models (MLLMs) remains challenging due to the long token sequences of videos, which contain extensive temporal dependencies and redundant frames. Existing approaches typically treat MLLMs as passive recognizers, processing entire videos or uniformly sampled frames without adaptive reasoning. Recent agent-based methods introduce external tools, yet still depend on manually designed workflows and perception-first strategies, resulting in inefficiency on long videos. We present EVA, an Efficient Reinforcement Learning framework for End-to-End Video Agent, which enables planning-before-perception through iterative summary-plan-action-reflection reasoning. EVA autonomously decides what to watch, when to watch, and how to watch, achieving query-driven and efficient video understanding. To train such agents, we design a simple yet effective three-stage learning pipeline - comprising supervised fine-tuning (SFT), Kahneman-Tversky Optimization (KTO), and Generalized Reward Policy Optimization (GRPO) - that bridges supervised imitation and reinforcement learning. We further construct high-quality datasets for each stage, supporting stable and reproducible training. We evaluate EVA on six video understanding benchmarks, demonstrating its comprehensive capabilities. Compared with existing baselines, EVA achieves a substantial improvement of 6-12% over general MLLM baselines and a further 1-3% gain over prior adaptive agent methods. Our code and model are available at https://github.com/wangruohui/EfficientVideoAgent.

[67] FixationFormer: Direct Utilization of Expert Gaze Trajectories for Chest X-Ray Classification cs.CV | cs.LGPDF

Daniel Beckmann, Benjamin Risse

TL;DR: FixationFormer是一种基于Transformer的架构，用于将专家眼动轨迹作为序列标记直接整合到胸部X光分类中，通过联合建模眼动序列和图像特征，解决了眼动数据的稀疏性和变异性问题，并在三个公开基准数据集上实现了最先进的分类性能。

Details

Motivation: 专家眼动轨迹是放射学中丰富的被动领域知识来源，但传统CNN难以直接整合这些序列化、时空密集但空间稀疏、噪声大且专家间差异大的数据，而Transformer的序列性和注意力机制天然适合处理眼动轨迹。

Result: 在三个公开的胸部X光基准数据集上评估，该方法实现了最先进的分类性能。

Insight: 创新点在于将专家眼动轨迹表示为序列标记，通过图像和眼动标记序列之间的显式交叉注意力，实现更直接和细粒度的专家诊断线索整合，这为基于Transformer的医学图像分析中利用序列化眼动数据提供了新思路。

Abstract: Expert eye movements provide a rich, passive source of domain knowledge in radiology, offering a powerful cue for integrating diagnostic reasoning into computer-aided analysis. However, direct integration into CNN-based systems, which historically have dominated the medical image analysis domain, is challenging: gaze recordings are sequential, temporally dense yet spatially sparse, noisy, and variable across experts. As a consequence, most existing image-based models utilize reduced representations such as heatmaps. In contrast, gaze naturally aligns with transformer architectures, as both are sequential in nature and rely on attention to highlight relevant input regions. In this work, we introduce FixationFormer, a transformer-based architecture that represents expert gaze trajectories as sequences of tokens, thereby preserving their temporal and spatial structure. By modeling gaze sequences jointly with image features, our approach addresses sparsity and variability in gaze data while enabling a more direct and fine-grained integration of expert diagnostic cues through explicit cross-attention between the image and gaze token sequences. We evaluate our method on three publicly available benchmark chest X-ray datasets and demonstrate that it achieves state-of-the-art classification performance, highlighting the value of representing gaze as a sequence in transformer-based medical image analysis.

[68] YOLOv10 with Kolmogorov-Arnold networks and vision-language foundation models for interpretable object detection and trustworthy multimodal AI in computer vision perception cs.CV | cs.AI | cs.CL | cs.LG | cs.ROPDF

Marios Impraimakis, Daniel Vazquez, Feiyu Zhou

TL;DR: 本文提出了一种基于Kolmogorov-Arnold网络（KAN）和视觉语言基础模型的可解释目标检测框架，用于增强YOLOv10在计算机视觉感知中的可信度。该框架利用KAN作为可解释的后处理代理模型，通过七个几何和语义特征对YOLOv10检测结果的可信度进行建模，并利用BLIP基础模型生成场景描述，从而构建一个透明、可信的多模态人工智能感知系统。

Details

Motivation: 解决自动驾驶等计算机视觉系统中，在视觉退化或模糊场景下，目标检测模型（如YOLOv10）置信度得分可靠性透明度不足的问题，旨在提供可解释且可信的检测结果。

Result: 在COCO数据集和巴斯大学校园图像上的实验表明，该框架能准确识别在模糊、遮挡或低纹理条件下的低可信度预测，为过滤、审查或下游风险缓解提供了可行见解。

Insight: 创新点在于将可解释的Kolmogorov-Arnold网络作为后处理代理模型来量化检测可信度，其加性样条结构允许直接可视化每个特征的影响；同时集成BLIP视觉语言基础模型生成场景描述，实现了不影响可解释性层的轻量级多模态接口，提升了系统透明度和实用性。

Abstract: The interpretable object detection capabilities of a novel Kolmogorov-Arnold network framework are examined here. The approach refers to a key limitation in computer vision for autonomous vehicles perception, and beyond. These systems offer limited transparency regarding the reliability of their confidence scores in visually degraded or ambiguous scenes. To address this limitation, a Kolmogorov-Arnold network is employed as an interpretable post-hoc surrogate to model the trustworthiness of the You Only Look Once (Yolov10) detections using seven geometric and semantic features. The additive spline-based structure of the Kolmogorov-Arnold network enables direct visualisation of each feature’s influence. This produces smooth and transparent functional mappings that reveal when the model’s confidence is well supported and when it is unreliable. Experiments on both Common Objects in Context (COCO), and images from the University of Bath campus demonstrate that the framework accurately identifies low-trust predictions under blur, occlusion, or low texture. This provides actionable insights for filtering, review, or downstream risk mitigation. Furthermore, a bootstrapped language-image (BLIP) foundation model generates descriptive captions of each scene. This tool enables a lightweight multimodal interface without affecting the interpretability layer. The resulting system delivers interpretable object detection with trustworthy confidence estimates. It offers a powerful tool for transparent and practical perception component for autonomous and multimodal artificial intelligence applications.

[69] Cluster-Wise Spatio-Temporal Masking for Efficient Video-Language Pretraining cs.CVPDF

Weijun Zhuang, Yuqing Huang, Weikang Meng, Xin Li, Ming Liu

TL;DR: 本文提出了一种名为ClusterSTM的集群级时空掩码策略，用于高效视频-语言预训练，通过帧内聚类和基于时间密度的掩码保留关键视觉信息，并引入视频-文本相关性重建目标以增强多模态语义对齐。

Details

Motivation: 解决大规模视频-语言预训练中计算成本过高的问题，以及现有掩码视觉建模方法在高掩码率下视觉信息丢失严重和帧间相关性导致时间信息泄露的局限性。

Result: 在多个基准测试（包括视频-文本检索、视频问答和视频描述任务）上进行了广泛实验，结果表明ClusterSTM在高效视频-语言模型中达到了新的最先进水平（SOTA）。

Insight: 创新点包括通过聚类实现语义独立的视觉令牌分区，基于时间密度的掩码策略以保留整体视频内容和强时间相关性，以及引入视频-文本相关性重建目标来超越传统的视觉重建，提升多模态语义对齐效果。

Abstract: Large-scale video-language pretraining enables strong generalization across multimodal tasks but often incurs prohibitive computational costs. Although recent advances in masked visual modeling help mitigate this issue, they still suffer from two fundamental limitations: severe visual information loss under high masking ratios and temporal information leakage caused by inter-frame correlations. To address these challenges, we propose ClusterSTM, a Cluster-Wise Spatio-Temporal Masking strategy for efficient video-language pretraining. ClusterSTM first performs intra-frame clustering to partition visual tokens into multiple semantically independent clusters, then conducts cluster-wise masking by retaining the token with the highest temporal density within each cluster. Our masking strategy ensure that the retained tokens capture holistic video content while exhibit strong temporal correlation. Additionally, we introduce a video-text relevance reconstruction objective that aligns high-level multimodal semantics beyond conventional visual reconstruction. Extensive experiments across multiple benchmarks demonstrate that ClusterSTM achieves superior performance on video-text retrieval, video question answering, and video captioning tasks, establishing a new state-of-the-art among efficient video-language models.

[70] WorldMesh: Generating Navigable Multi-Room 3D Scenes via Mesh-Conditioned Image Diffusion cs.CVPDF

Manuel-Andreas Schneider, Angela Dai

TL;DR: WorldMesh提出了一种几何优先的方法，用于从文本描述生成可导航的多房间3D场景。该方法首先构建一个捕捉环境几何结构（如墙壁、地板）的网格骨架，然后利用基于网格条件化的图像扩散模型来合成逼真的外观和物体布局，从而实现了大规模、高一致性和高真实感的3D场景生成。

Details

Motivation: 现有文本到图像和视频方法在生成大规模3D场景时，由于缺乏显式几何结构，难以保持场景和物体级别的一致性。本文旨在通过引入几何先验来解决这一问题。

Result: 该方法能够生成任意大小、物体丰富多样且具有3D一致性的逼真3D场景，结合了鲁棒的几何一致性与照片级真实感细节，标志着向生成环境级沉浸式3D世界迈出了重要一步。

Insight: 核心创新点在于将大规模3D场景生成解耦为结构组合（网格骨架）和外观合成（基于网格条件化的图像合成）两个阶段，利用网格作为结构骨干来引导图像生成，从而实现了可扩展且一致的大规模场景构建。

Abstract: Recent progress in image and video synthesis has inspired their use in advancing 3D scene generation. However, we observe that text-to-image and -video approaches struggle to maintain scene- and object-level consistency beyond a limited environment scale due to the absence of explicit geometry. We thus present a geometry-first approach that decouples this complex problem of large-scale 3D scene synthesis into its structural composition, represented as a mesh scaffold, and realistic appearance synthesis, which leverages powerful image synthesis models conditioned on the mesh scaffold. From an input text description, we first construct a mesh capturing the environment’s geometry (walls, floors, etc.), and then use image synthesis, segmentation and object reconstruction to populate the mesh structure with objects in realistic layouts. This mesh scaffold is then rendered to condition image synthesis, providing a structural backbone for consistent appearance generation. This enables scalable, arbitrarily-sized 3D scenes of high object richness and diversity, combining robust 3D consistency with photorealistic detail. We believe this marks a significant step toward generating truly environment-scale, immersive 3D worlds.

[71] VLA-IAP: Training-Free Visual Token Pruning via Interaction Alignment for Vision-Language-Action Models cs.CVPDF

Jintao Cheng, Haozhe Wang, Weibin Li, Gang Wang, Yipu Zhang

TL;DR: 本文提出了一种无需训练的视觉语言动作模型视觉令牌剪枝方法VLA-IAP，通过引入几何先验机制和动态调度策略，优先保留与物理交互相关的关键视觉区域，从而在保持任务成功率的同时显著提升推理速度。

Details

Motivation: 现有视觉令牌剪枝方法主要依赖语义显著性或简单时序线索，忽略了VLA任务中连续的物理交互这一根本特性，导致可能剪枝掉视觉稀疏但对操作至关重要的结构区域，在任务早期阶段引发不稳定行为。

Result: 在LIBERO基准测试中，VLA-IAP取得了97.8%的成功率，并实现了1.25倍的加速；最高可达1.54倍加速，同时性能与未剪枝的骨干模型相当。该方法在多种模型架构、三个不同仿真环境以及真实机器人平台上均表现出优越且一致的性能。

Insight: 核心创新点在于提出了明确的‘交互优先’范式，通过几何先验保留结构锚点，并基于语义-运动对齐动态调度剪枝强度，实现了从保守到激进的平稳过渡，确保了早期不确定阶段的鲁棒性和交互锁定后的效率。该方法无需训练，具有良好的泛化能力和实际应用价值。

Abstract: Vision-Language-Action (VLA) models have rapidly advanced embodied intelligence, enabling robots to execute complex, instruction-driven tasks. However, as model capacity and visual context length grow, the inference cost of VLA systems becomes a major bottleneck for real-world deployment on resource-constrained platforms. Existing visual token pruning methods mainly rely on semantic saliency or simple temporal cues, overlooking the continuous physical interaction, a fundamental property of VLA tasks. Consequently, current approaches often prune visually sparse yet structurally critical regions that support manipulation, leading to unstable behavior during early task phases. To overcome this, we propose a shift toward an explicit Interaction-First paradigm. Our proposed \textbf{training-free} method, VLA-IAP (Interaction-Aligned Pruning), introduces a geometric prior mechanism to preserve structural anchors and a dynamic scheduling strategy that adapts pruning intensity based on semantic-motion alignment. This enables a conservative-to-aggressive transition, ensuring robustness during early uncertainty and efficiency once interaction is locked. Extensive experiments show that VLA-IAP achieves a \textbf{97.8% success rate} with a \textbf{$1.25\times$ speedup} on the LIBERO benchmark, and up to \textbf{$1.54\times$ speedup} while maintaining performance \textbf{comparable to the unpruned backbone}. Moreover, the method demonstrates superior and consistent performance across multiple model architectures and three different simulation environments, as well as a real robot platform, validating its strong generalization capability and practical applicability. Our project website is: \href{https://chengjt1999.github.io/VLA-IAP.github.io/}{VLA-IAP.com}.

[72] Unleashing Spatial Reasoning in Multimodal Large Language Models via Textual Representation Guided Reasoning cs.CV | cs.CLPDF

Jiacheng Hua, Yishu Yin, Yuhang Wu, Tai Wang, Yifei Huang

TL;DR: 本文提出了一种名为TRACE的提示方法，旨在提升多模态大语言模型在3D空间推理任务上的性能。该方法通过引导模型生成基于文本的空间表示作为中间推理步骤，以解决现有模型在理解视频中3D环境时的结构化抽象能力不足问题。

Details

Motivation: 现有MLLMs在3D空间推理方面存在困难，因为它们无法从视频输入中构建3D环境的结构化抽象。受认知理论中自我中心到非自我中心空间推理的启发，研究如何使MLLMs能够基于视频的文本化空间表示进行建模和推理。

Result: 在VSI-Bench和OST-Bench基准测试中，TRACE方法在不同参数规模和训练方案的各种MLLM骨干网络上均取得了显著且一致的改进，优于先前的提示策略。

Insight: 创新点在于引入文本化空间表示作为中间推理轨迹，将元上下文、相机轨迹和详细物体实体编码以支持结构化推理。这提供了一种通过文本引导增强MLLMs空间认知能力的有效途径，可借鉴于其他需要复杂环境建模的多模态任务。

Abstract: Existing Multimodal Large Language Models (MLLMs) struggle with 3D spatial reasoning, as they fail to construct structured abstractions of the 3D environment depicted in video inputs. To bridge this gap, drawing inspiration from cognitive theories of allocentric spatial reasoning, we investigate how to enable MLLMs to model and reason over text-based spatial representations of video. Specifically, we introduce Textual Representation of Allocentric Context from Egocentric Video (TRACE), a prompting method that induces MLLMs to generate text-based representations of 3D environments as intermediate reasoning traces for more accurate spatial question answering. TRACE encodes meta-context, camera trajectories, and detailed object entities to support structured spatial reasoning over egocentric videos. Extensive experiments on VSI-Bench and OST-Bench demonstrate that TRACE yields notable and consistent improvements over prior prompting strategies across a diverse range of MLLM backbones, spanning different parameter scales and training schemas. We further present ablation studies to validate our design choices, along with detailed analyses that probe the bottlenecks of 3D spatial reasoning in MLLMs.

[73] VQ-Jarvis: Retrieval-Augmented Video Restoration Agent with Sharp Vision and Fast Thought cs.CVPDF

Xuanyu Zhang, Weiqi Li, Qunliang Xing, Jingfen Xie, Bin Chen

TL;DR: 本文提出了VQ-Jarvis，一个检索增强的一体化智能视频修复智能体，旨在解决现实场景中视频修复面临的异构退化问题。该智能体通过构建大规模视频配对增强数据集VSR-Compare，训练了质量判断和退化感知模型以实现‘锐利视觉’，并采用分层操作符调度策略（结合检索增强生成库的一步检索和贪婪搜索）以实现‘快速思考’，从而动态高效地发现最优修复轨迹。

Details

Motivation: 解决现实场景视频修复中因异构退化而导致的静态架构和固定推理流程泛化能力不足的问题，同时克服现有视频修复智能体在质量感知不足和搜索策略低效方面的局限性。

Result: 大量实验表明，VQ-Jarvis在复杂退化视频上持续优于现有方法。

Insight: 创新点在于提出了一个结合‘锐利视觉’（基于大规模配对数据集训练的感知模型）和‘快速思考’（分层操作符调度策略）的检索增强智能体框架；客观来看，其构建首个大规模视频配对增强数据集VSR-Compare用于训练感知模型，以及将检索增强生成（RAG）与分层贪婪搜索相结合以动态优化修复路径的策略，具有借鉴意义。

Abstract: Video restoration in real-world scenarios is challenged by heterogeneous degradations, where static architectures and fixed inference pipelines often fail to generalize. Recent agent-based approaches offer dynamic decision making, yet existing video restoration agents remain limited by insufficient quality perception and inefficient search strategies. We propose VQ-Jarvis, a retrieval-augmented, all-in-one intelligent video restoration agent with sharper vision and faster thought. VQ-Jarvis is designed to accurately perceive degradations and subtle differences among paired restoration results, while efficiently discovering optimal restoration trajectories. To enable sharp vision, we construct VSR-Compare, the first large-scale video paired enhancement dataset with 20K comparison pairs covering 7 degradation types, 11 enhancement operators, and diverse content domains. Based on this dataset, we train a multiple operator judge model and a degradation perception model to guide agent decisions. To achieve fast thought, we introduce a hierarchical operator scheduling strategy that adapts to video difficulty: for easy cases, optimal restoration trajectories are retrieved in a one-step manner from a retrieval-augmented generation (RAG) library; for harder cases, a step-by-step greedy search is performed to balance efficiency and accuracy. Extensive experiments demonstrate that VQ-Jarvis consistently outperforms existing methods on complex degraded videos.

[74] SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning cs.CV | cs.CLPDF

Haoyu Huang, Jinfa Huang, Zhongwei Wan, Xiawu Zheng, Rongrong Ji

TL;DR: 本文提出了SpecEyes，一个用于加速代理式多模态大语言模型（MLLM）的推测性加速框架。该框架通过一个轻量级的、无需工具的MLLM作为推测规划器来预测执行轨迹，并结合认知门控机制和异构并行漏斗设计，旨在打破代理式MLLM中感知、推理和工具调用循环带来的顺序瓶颈，从而显著降低延迟并提升系统吞吐量。

Details

Motivation: 代理式多模态大语言模型通过迭代调用视觉工具实现了强大的推理能力，但级联的感知、推理和工具调用循环引入了显著的顺序开销（称为代理深度），导致延迟过高并严重限制了系统级并发性。

Result: 在V* Bench、HR-Bench和POPE等基准测试上的大量实验表明，SpecEyes相比代理式基线实现了1.1-3.35倍的加速，同时保持甚至提高了准确性（最高提升+6.7%），从而提升了并发工作负载下的服务吞吐量。

Insight: 核心创新点在于利用轻量级MLLM作为推测规划器来预测执行轨迹以实现早期终止，从而打破顺序瓶颈；引入了基于答案可分离性的认知门控机制进行无需真值标签的自验证；设计了异构并行漏斗，利用小模型的无状态并发性来掩盖大模型的有状态串行执行，最大化系统吞吐量。这是一种将推测执行思想应用于多模态代理系统的高效架构设计。

Abstract: Agentic multimodal large language models (MLLMs) (e.g., OpenAI o3 and Gemini Agentic Vision) achieve remarkable reasoning capabilities through iterative visual tool invocation. However, the cascaded perception, reasoning, and tool-calling loops introduce significant sequential overhead. This overhead, termed agentic depth, incurs prohibitive latency and seriously limits system-level concurrency. To this end, we propose SpecEyes, an agentic-level speculative acceleration framework that breaks this sequential bottleneck. Our key insight is that a lightweight, tool-free MLLM can serve as a speculative planner to predict the execution trajectory, enabling early termination of expensive tool chains without sacrificing accuracy. To regulate this speculative planning, we introduce a cognitive gating mechanism based on answer separability, which quantifies the model’s confidence for self-verification without requiring oracle labels. Furthermore, we design a heterogeneous parallel funnel that exploits the stateless concurrency of the small model to mask the stateful serial execution of the large model, maximizing system throughput. Extensive experiments on V* Bench, HR-Bench, and POPE demonstrate that SpecEyes achieves 1.1-3.35x speedup over the agentic baseline while preserving or even improving accuracy (up to +6.7%), thereby boosting serving throughput under concurrent workloads.

[75] MedObvious: Exposing the Medical Moravec’s Paradox in VLMs via Clinical Triage cs.CV | cs.AI | cs.CLPDF

Ufaq Khan, Umair Nawaz, L D M S S Teja, Numaan Saeed, Muhammad Bilal

TL;DR: 论文提出了MedObvious基准测试，旨在评估医学视觉语言模型（VLMs）在临床分诊前进行输入验证（即检查图像模态、解剖结构、视角和完整性是否一致）的能力。研究发现，现有模型在识别无效或不一致输入方面表现不可靠，容易产生看似合理但基于错误输入的诊断叙述。

Details

Motivation: 现有医学VLM基准测试大多假设输入验证问题已解决，忽略了模型可能在输入无效时仍生成流畅但错误的诊断文本这一关键故障模式，这构成了医学应用中的安全风险。

Result: 在包含1,880个任务的MedObvious基准上评估了17个不同的VLMs，发现模型在输入验证（一致性检查）上表现不可靠：部分模型在正常（阴性对照）输入上产生幻觉异常，性能随图像集规模扩大而下降，且多项选择与开放式设置间的准确率差异显著。

Insight: 创新点在于将临床分诊前的‘合理性检查’（输入验证）概念化为一个独立的、安全关键的能力，并构建了一个专门的多层级基准来隔离和评估此能力。客观来看，这揭示了医学VLM部署前一个被忽视的评估维度，即模型对输入本身一致性的理解能力与其生成能力同样重要。

Abstract: Vision Language Models (VLMs) are increasingly used for tasks like medical report generation and visual question answering. However, fluent diagnostic text does not guarantee safe visual understanding. In clinical practice, interpretation begins with pre-diagnostic sanity checks: verifying that the input is valid to read (correct modality and anatomy, plausible viewpoint and orientation, and no obvious integrity violations). Existing benchmarks largely assume this step is solved, and therefore miss a critical failure mode: a model can produce plausible narratives even when the input is inconsistent or invalid. We introduce MedObvious, a 1,880-task benchmark that isolates input validation as a set-level consistency capability over small multi-panel image sets: the model must identify whether any panel violates expected coherence. MedObvious spans five progressive tiers, from basic orientation/modality mismatches to clinically motivated anatomy/viewpoint verification and triage-style cues, and includes five evaluation formats to test robustness across interfaces. Evaluating 17 different VLMs, we find that sanity checking remains unreliable: several models hallucinate anomalies on normal (negative-control) inputs, performance degrades when scaling to larger image sets, and measured accuracy varies substantially between multiple-choice and open-ended settings. These results show that pre-diagnostic verification remains unsolved for medical VLMs and should be treated as a distinct, safety-critical capability before deployment.

[76] Cog3DMap: Multi-View Vision-Language Reasoning with 3D Cognitive Maps cs.CVPDF

Chanyoung Gwak, Yoonwoo Jeong, Byungwoo Jeon, Hyunseok Lee, Jinwoo Shin

TL;DR: 本文提出了Cog3DMap框架，旨在解决多模态大语言模型在多视图图像空间理解方面的几何基础不足问题。该框架通过从多视图图像中递归构建一个显式的3D记忆，使每个token都具有3D空间中的语义和几何信息，从而让MLLM能够直接在结构化的3D地图上进行推理。

Details

Motivation: 现有MLLM的视觉表示主要是语义性的，缺乏显式的几何基础，导致其从多视图图像进行精确空间理解的能力受限。现有方法虽然用视觉几何模型的几何线索增强视觉token，但仍需MLLM从这些增强的token中隐式推断场景的3D结构，限制了其空间推理能力。

Result: 该框架在多个空间推理基准测试上达到了最先进的性能。

Insight: 核心创新在于构建了一个显式的、token化的3D认知地图作为MLLM的输入，将语义和几何信息在3D空间中进行显式且统一的表征，从而将空间推理任务从隐式推断转变为对显式结构化3D地图的直接操作。

Abstract: Precise spatial understanding from multi-view images remains a fundamental challenge for Multimodal Large Language Models (MLLMs), as their visual representations are predominantly semantic and lack explicit geometric grounding. While existing approaches augment visual tokens with geometric cues from visual geometry models, their MLLM is still required to implicitly infer the underlying 3D structure of the scene from these augmented tokens, limiting its spatial reasoning capability. To address this issue, we introduce Cog3DMap, a framework that recurrently constructs an explicit 3D memory from multi-view images, where each token is grounded in 3D space and possesses both semantic and geometric information. By feeding these tokens into the MLLM, our framework enables direct reasoning over a spatially structured 3D map, achieving state-of-the-art performance on various spatial reasoning benchmarks. Code will be made publicly available.

[77] Traffic Sign Recognition in Autonomous Driving: Dataset, Benchmark, and Field Experiment cs.CVPDF

Guoyang Zhao, Weiqing Qi, Kai Zhang, Chenguang Zhang, Zeying Gong

TL;DR: 本文提出了TS-1M，一个包含超过一百万张图像、覆盖454个标准化类别的大规模、全球多样性的交通标志数据集，并建立了一个诊断性基准，用于系统评估不同学习范式在跨区域变化、长尾类别和语义模糊等实际挑战下的性能。

Details

Motivation: 现有交通标志识别（TSR）数据集和基准在诊断不同建模范式如何应对实际挑战方面提供有限见解，因此需要一个新的数据集和基准来深入分析模型的能力边界。

Result: 在TS-1M基准上，对经典监督模型、自监督预训练模型和多模态视觉语言模型（VLMs）进行了统一评估，揭示了范式依赖的行为模式，例如语义对齐是跨区域泛化和稀有类别识别的关键因素。

Insight: 创新点在于构建了大规模、诊断性的TSR数据集与基准，并系统分析了不同学习范式的性能边界，强调了语义理解对于鲁棒感知的重要性；从客观角度看，其将数据集构建与模型诊断深度结合，为自动驾驶感知提供了新的评估框架。

Abstract: Traffic Sign Recognition (TSR) is a core perception capability for autonomous driving, where robustness to cross-region variation, long-tailed categories, and semantic ambiguity is essential for reliable real-world deployment. Despite steady progress in recognition accuracy, existing traffic sign datasets and benchmarks offer limited diagnostic insight into how different modeling paradigms behave under these practical challenges. We present TS-1M, a large-scale and globally diverse traffic sign dataset comprising over one million real-world images across 454 standardized categories, together with a diagnostic benchmark designed to analyze model capability boundaries. Beyond standard train-test evaluation, we provide a suite of challenge-oriented settings, including cross-region recognition, rare-class identification, low-clarity robustness, and semantic text understanding, enabling systematic and fine-grained assessment of modern TSR models. Using TS-1M, we conduct a unified benchmark across three representative learning paradigms: classical supervised models, self-supervised pretrained models, and multimodal vision-language models (VLMs). Our analysis reveals consistent paradigm-dependent behaviors, showing that semantic alignment is a key factor for cross-region generalization and rare-category recognition, while purely visual models remain sensitive to appearance shift and data imbalance. Finally, we validate the practical relevance of TS-1M through real-scene autonomous driving experiments, where traffic sign recognition is integrated with semantic reasoning and spatial localization to support map-level decision constraints. Overall, TS-1M establishes a reference-level diagnostic benchmark for TSR and provides principled insights into robust and semantic-aware traffic sign perception. Project page: https://guoyangzhao.github.io/projects/ts1m.

[78] MLLM-HWSI: A Multimodal Large Language Model for Hierarchical Whole Slide Image Understanding cs.CVPDF

Basit Alawode, Arif Mahmood, Muaz Khalifa Al-Radi, Shahad Albastaki, Asim Khan

TL;DR: 本文提出了MLLM-HWSI，一个用于层次化全切片图像（WSI）理解的多模态大语言模型。该模型通过将视觉特征与病理学语言在四个不同尺度（细胞、图像块、区域、整个WSI）上进行对齐，支持可解释的、基于证据的推理。模型采用多阶段训练，在六个计算病理学任务的13个WSI级基准测试中取得了新的最先进（SOTA）结果。

Details

Motivation: 现有的计算病理学多模态大语言模型通常将整个WSI压缩为单个嵌入，这阻碍了细粒度的定位，并忽略了病理学家如何综合不同尺度的证据。本文旨在解决这个问题，通过构建一个层次化的模型来模拟病理诊断工作流。

Result: MLLM-HWSI在六个计算病理学任务（如开放式推理、视觉问答、报告和标题生成）的13个WSI级基准测试中取得了新的最先进（SOTA）结果。

Insight: 主要创新点包括：1. 提出层次化的多尺度特征对齐框架（细胞-词、图像块-短语、区域-句子、WSI-段落）；2. 设计了分层对比学习目标和跨尺度一致性损失，以保持从细胞到WSI的语义连贯性；3. 引入了轻量级的细胞-细胞注意力融合（CCAF）变换器，用于聚合细胞嵌入；4. 采用三阶段训练策略。这为多模态大模型在具有层次结构的高分辨率医学图像理解方面提供了可借鉴的架构设计思路。

Abstract: Whole Slide Images (WSIs) exhibit hierarchical structure, where diagnostic information emerges from cellular morphology, regional tissue organization, and global context. Existing Computational Pathology (CPath) Multimodal Large Language Models (MLLMs) typically compress an entire WSI into a single embedding, which hinders fine-grained grounding and ignores how pathologists synthesize evidence across different scales. We introduce \textbf{MLLM-HWSI}, a Hierarchical WSI-level MLLM that aligns visual features with pathology language at four distinct scales, cell as word, patch as phrase, region as sentence, and WSI as paragraph to support interpretable evidence-grounded reasoning. MLLM-HWSI decomposes each WSI into multi-scale embeddings with scale-specific projectors and jointly enforces (i) a hierarchical contrastive objective and (ii) a cross-scale consistency loss, preserving semantic coherence from cells to the WSI. We compute diagnostically relevant patches and aggregate segmented cell embeddings into a compact cellular token per-patch using a lightweight \textit{Cell-Cell Attention Fusion (CCAF)} transformer. The projected multi-scale tokens are fused with text tokens and fed to an instruction-tuned LLM for open-ended reasoning, VQA, report, and caption generation tasks. Trained in three stages, MLLM-HWSI achieves new SOTA results on 13 WSI-level benchmarks across six CPath tasks. By aligning language with multi-scale visual evidence, MLLM-HWSI provides accurate, interpretable outputs that mirror diagnostic workflows and advance holistic WSI understanding. Code is available at: \href{https://github.com/BasitAlawode/HWSI-MLLM}{GitHub}.

[79] PolarAPP: Beyond Polarization Demosaicking for Polarimetric Applications cs.CVPDF

Yidong Luo, Chenggong Li, Yunfeng Song, Ping Wang, Boxin Shi

TL;DR: 本文提出了PolarAPP框架，首次联合优化偏振图像去马赛克及其下游任务（如法线估计和去反射），通过元学习特征对齐机制和等效成像约束，实现任务感知的重建，从而提升下游任务性能。

Details

Motivation: 现有偏振成像应用依赖原始测量数据的简单重组，导致重建目标不完整且次优，而当前去马赛克方法仅关注光度保真度，忽略了下游任务的实用性，因此需要一种任务感知的联合优化方法。

Result: 大量实验结果表明，PolarAPP在去马赛克质量和下游任务性能上均优于现有方法，实现了SOTA水平。

Insight: 创新点包括：通过元学习实现去马赛克与下游任务网络的特征语义对齐；引入等效成像约束，使去马赛克直接回归到物理有意义的输出；采用任务细化阶段进一步微调下游网络，提升整体准确性。

Abstract: Polarimetric imaging enables advanced vision applications such as normal estimation and de-reflection by capturing unique surface-material interactions. However, existing applications (alternatively called downstream tasks) rely on datasets constructed by naively regrouping raw measurements from division-of-focal-plane sensors, where pixels of the same polarization angle are extracted and aligned into sparse images without proper demosaicking. This reconstruction strategy results in suboptimal, incomplete targets that limit downstream performance. Moreover, current demosaicking methods are task-agnostic, optimizing only for photometric fidelity rather than utility in downstream tasks. Towards this end, we propose PolarAPP, the first framework to jointly optimize demosaicking and its downstream tasks. PolarAPP introduces a feature alignment mechanism that semantically aligns the representations of demosaicking and downstream networks via meta-learning, guiding the reconstruction to be task-aware. It further employs an equivalent imaging constraint for demosaicking training, enabling direct regression to physically meaningful outputs without relying on rearranged data. Finally, a task-refinement stage fine-tunes the task network using the stable demosaicking front-end to further enhance accuracy. Extensive experimental results demonstrate that PolarAPP outperforms existing methods in both demosaicking quality and downstream performance. Code is available upon acceptance.

[80] A Synchronized Audio-Visual Multi-View Capture System cs.CVPDF

Xiangwei Shi, Era Dorta Perez, Ruud de Jong, Ojas Shirekar, Chirag Raman

TL;DR: 本文介绍了一种同步音频-视觉多视角采集系统，旨在解决现有系统在音频采集和音视频严格对齐方面的不足。该系统通过统一的时间架构整合多摄像头和多麦克风通道，提供校准、采集和质量控制的完整工作流程，支持大规模可重复录制，并验证了其时间一致性足以支持对话行为的细粒度分析和数据驱动建模。

Details

Motivation: 现有多视角采集系统主要围绕视频流设计，缺乏对音频采集和音视频严格对齐的支持，而这两者对于研究对话互动中的时序特征（如话轮转换、重叠和韵律）至关重要。

Result: 系统在部署中量化了同步性能，证明录制结果具有足够的时间一致性，能够支持对话行为的细粒度分析和数据驱动建模。

Insight: 创新点在于将同步音频和视频视为一等信号，通过统一的时间架构整合多摄像头与多通道麦克风，并提供了完整的校准、采集和质量控制工作流程，实现了可扩展的高精度音视频同步采集。

Abstract: Multi-view capture systems have been an important tool in research for recording human motion under controlling conditions. Most existing systems are specified around video streams and provide little or no support for audio acquisition and rigorous audio-video alignment, despite both being essential for studying conversational interaction where timing at the level of turn-taking, overlap, and prosody matters. In this technical report, we describe an audio-visual multi-view capture system that addresses this gap by treating synchronized audio and synchronized video as first-class signals. The system combines a multi-camera pipeline with multi-channel microphone recording under a unified timing architecture and provides a practical workflow for calibration, acquisition, and quality control that supports repeatable recordings at scale. We quantify synchronization performance in deployment and show that the resulting recordings are temporally consistent enough to support fine-grained analysis and data-driven modeling of conversation behavior.

[81] AgentFoX: LLM Agent-Guided Fusion with eXplainability for AI-Generated Image Detection cs.CVPDF

Yangxin Yu, Yue Zhou, Bin Li, Kaiqing Lin, Haodong Li

TL;DR: 本文提出AgentFoX框架，利用大型语言模型（LLM）驱动的智能体，通过动态多阶段分析过程来检测AI生成图像（AIGI）。该框架采用基于专家配置文件和聚类配置文件知识库的快速集成融合机制，在推理过程中从高层语义评估转向细粒度信号级证据合成，并生成可解释的详细法医报告。

Details

Motivation: 现有AI生成图像检测器通常针对特定伪造痕迹（如频域模式或语义不一致）设计，导致性能专门化且有时判断冲突，因此需要一种能可靠区分合成与真实图像的可解释工具。

Result: 论文未在摘要中提及具体定量结果或基准测试，但强调其方法能生成详细、人类可读的法医报告，增强可解释性和实际部署的可信度。

Insight: 创新点在于将AIGI检测重构为LLM引导的动态多阶段分析过程，通过知识库指导的融合机制解决现有方法间的矛盾，并输出可解释报告；从客观角度看，其提出的可扩展智能体范式为未来法医工具的智能集成提供了新思路。

Abstract: The increasing realism of AI-Generated Images (AIGI) has created an urgent need for forensic tools capable of reliably distinguishing synthetic content from authentic imagery. Existing detectors are typically tailored to specific forgery artifacts–such as frequency-domain patterns or semantic inconsistencies–leading to specialized performance and, at times, conflicting judgments. To address these limitations, we present \textbf{AgentFoX}, a Large Language Model-driven framework that redefines AIGI detection as a dynamic, multi-phase analytical process. Our approach employs a quick-integration fusion mechanism guided by a curated knowledge base comprising calibrated Expert Profiles and contextual Clustering Profiles. During inference, the agent begins with high-level semantic assessment, then transitions to fine-grained, context-aware synthesis of signal-level expert evidence, resolving contradictions through structured reasoning. Instead of returning a coarse binary output, AgentFoX produces a detailed, human-readable forensic report that substantiates its verdict, enhancing interpretability and trustworthiness for real-world deployment. Beyond providing a novel detection solution, this work introduces a scalable agentic paradigm that facilitates intelligent integration of future and evolving forensic tools.

[82] Automatic Segmentation of 3D CT scans with SAM2 using a zero-shot approach cs.CVPDF

Miquel Lopez Escoriza, Pau Amargant Alvarez

TL;DR: 本研究探索了如何将图像分割基础模型SAM2以零样本方式应用于3D CT扫描的自动分割。论文分析了SAM2处理CT体数据的主要局限——缺乏固有的三维感知能力，并提出了一套无需微调的推理阶段架构与流程修改方案，通过将CT切片视为有序序列来适配SAM2的视频记忆机制。在TotalSegmentator数据集上进行了系统消融实验，并最终在更大样本上验证了方法的可行性。

Details

Motivation: 图像分割基础模型（如SAM2）在自然图像上表现出强大的泛化能力，但其在3D医学影像（如CT）上的直接应用仍然有限。本研究旨在探索如何在不进行任何微调或领域特定训练的情况下，实现SAM2对体数据CT扫描的零样本自动分割。

Result: 在TotalSegmentator数据集的500个CT扫描子集上进行了系统消融研究，评估了提示策略、记忆传播方案和多轮细化方法。基于此，选择了最佳配置，并在包含2500个CT扫描的更大样本上报告了最终结果。结果表明，即使权重冻结，通过精心构建推理流程，SAM2也能产生连贯的3D分割结果。

Insight: 论文的核心创新点在于提出了一套纯推理阶段的修改方案，将SAM2的视频记忆机制适配到3D数据上，通过将CT切片序列化处理来弥补其缺乏三维感知的不足。这为将强大的2D图像基础模型以零样本方式应用于3D医学影像分割提供了一种可行且无需额外训练的思路。

Abstract: Foundation models for image segmentation have shown strong generalization in natural images, yet their applicability to 3D medical imaging remains limited. In this work, we study the zero-shot use of Segment Anything Model 2 (SAM2) for automatic segmentation of volumetric CT data, without any fine-tuning or domain-specific training. We analyze how SAM2 should be applied to CT volumes and identify its main limitation: the lack of inherent volumetric awareness. To address this, we propose a set of inference-alone architectural and procedural modifications that adapt SAM2’s video-based memory mechanism to 3D data by treating CT slices as ordered sequences. We conduct a systematic ablation study on a subset of 500 CT scans from the TotalSegmentator dataset to evaluate prompt strategies, memory propagation schemes and multi-pass refinement. Based on these findings, we select the best-performing configuration and report final results on a bigger sample of the TotalSegmentator dataset comprising 2,500 CT scans. Our results show that, even with frozen weights, SAM2 can produce coherent 3D segmentations when its inference pipeline is carefully structured, demonstrating the feasibility of a fully zero-shot approach for volumetric medical image segmentation.

[83] SMSP: A Plug-and-Play Strategy of Multi-Scale Perception for MLLMs to Perceive Visual Illusions cs.CV | cs.MMPDF

Jinzhe Tu, Ruilei Guo, Zihan Guo, Junxiao Yang, Shiyao Cui

TL;DR: 本文针对多模态大语言模型（MLLMs）易受隐藏模式视觉错觉影响的问题，提出了一个名为SMSP的即插即用多尺度感知策略。该策略通过抑制图像中分散注意力的高频背景纹理，使模型感知与人类视觉对齐，从而显著提升了MLLMs在错觉图像上的性能。

Details

Motivation: 现有MLLMs在感知隐藏模式的视觉错觉时存在严重缺陷，这揭示了模型与人类之间的感知错位，并可能带来安全隐患。论文旨在系统性地研究这一失败机制并提供一个解决方案。

Result: 在提出的综合性错觉数据集IlluChar上，SMSP显著提升了所有被评估MLLMs的性能。例如，它将Qwen3-VL-8B-Instruct模型的准确率从13.0%提升到了84.0%。

Insight: 论文的核心创新点在于揭示了MLLMs在视觉错觉上失败的关键机制——高频注意力偏差，并据此提出了一个模拟人类视觉感知策略的即插即用框架SMSP。该方法通过多尺度感知和抑制高频背景，有效增强了模型对隐藏模式的感知能力，为解决MLLMs的视觉感知对齐问题提供了一个实用且鲁棒的方案。

Abstract: Recent works have shown that Multimodal Large Language Models (MLLMs) are highly vulnerable to hidden-pattern visual illusions, where the hidden content is imperceptible to models but obvious to humans. This deficiency highlights a perceptual misalignment between current MLLMs and humans, and also introduces potential safety concerns. To systematically investigate this failure, we introduce IlluChar, a comprehensive and challenging illusion dataset, and uncover a key underlying mechanism for the models’ failure: high-frequency attention bias, where the models are easily distracted by high-frequency background textures in illusion images, causing them to overlook hidden patterns. To address the issue, we propose the Strategy of Multi-Scale Perception (SMSP), a plug-and-play framework that aligns with human visual perceptual strategies. By suppressing distracting high-frequency backgrounds, SMSP generates images closer to human perception. Our experiments demonstrate that SMSP significantly improves the performance of all evaluated MLLMs on illusion images, for instance, increasing the accuracy of Qwen3-VL-8B-Instruct from 13.0% to 84.0%. Our work provides novel insights into MLLMs’ visual perception, and offers a practical and robust solution to enhance it. Our code is publicly available at https://github.com/Tujz2023/SMSP.

[84] PiCo: Active Manifold Canonicalization for Robust Robotic Visual Anomaly Detection cs.CVPDF

Teng Yan, Binkai Liu, Shuai Liu, Yue Yu, Bingzhuo Zhong

TL;DR: 本文提出PiCo框架，通过主动规范化方法解决机器人视觉异常检测在复杂姿态和光照变化下的鲁棒性问题。该方法采用级联机制，先通过主动物理规范化调整物体姿态，再通过神经潜在规范化进行多级去噪处理，最终将观测投影到条件不变的规范流形上。

Details

Motivation: 工业机器人视觉异常检测在6自由度姿态变化和光照不稳定条件下，存在语义异常与物理干扰共存的问题，传统被动感知方法难以应对，因此需要转向主动规范化范式。

Result: 在M2AD基准测试中，PiCo达到93.7%的O-AUROC（静态设置下比先前方法提升3.7%），在主动闭环场景中达到98.5%的准确率，实现了SOTA性能。

Insight: 创新点在于将被动特征学习转变为主动规范化范式，通过物理与神经相结合的级联机制消除多尺度干扰；可借鉴之处包括条件不变流形投影思想和跨表示尺度的渐进去噪层次结构。

Abstract: Industrial deployment of robotic visual anomaly detection (VAD) is fundamentally constrained by passive perception under diverse 6-DoF pose configurations and unstable operating conditions such as illumination changes and shadows, where intrinsic semantic anomalies and physical disturbances coexist and interact. To overcome these limitations, a paradigm shift from passive feature learning to Active Canonicalization is proposed. PiCo (Pose-in-Condition Canonicalization) is introduced as a unified framework that actively projects observations onto a condition-invariant canonical manifold. PiCo operates through a cascaded mechanism. The first stage, Active Physical Canonicalization, enables a robotic agent to reorient objects in order to reduce geometric uncertainty at its source. The second stage, Neural Latent Canonicalization, adopts a three-stage denoising hierarchy consisting of photometric processing at the input level, latent refinement at the feature level, and contextual reasoning at the semantic level, progressively eliminating nuisance factors across representational scales. Extensive evaluations on the large-scale M2AD benchmark demonstrate the superiority of this paradigm. PiCo achieves a state-of-the-art 93.7% O-AUROC, representing a 3.7% improvement over prior methods in static settings, and attains 98.5% accuracy in active closed-loop scenarios. These results demonstrate that active manifold canonicalization is critical for robust embodied perception.

[85] 3rd Place of MeViS-Audio Track of the 5th PVUW: VIRST-Audio cs.CVPDF

Jihwan Hong, Jaeyoung Do

TL;DR: 本文提出了VIRST-Audio框架，用于解决基于音频的指代视频目标分割（ARVOS）任务。该框架基于预训练的指代视频目标分割模型，结合视觉-语言架构，通过自动语音识别将音频查询转换为文本，再利用文本监督进行分割，从而将文本推理能力迁移到音频场景。此外，框架引入了存在感知门控机制，以判断目标对象是否存在于视频中，从而减少幻觉掩码并提升分割稳定性。

Details

Motivation: ARVOS任务需要将音频查询与视频中的时空视觉表示进行对齐，这带来了跨模态（声学信号与视觉）关联的挑战。论文旨在构建一个实用框架，有效利用现有文本基础的模型来处理音频指代分割，避免对音频数据进行专门训练。

Result: 在第五届PVUW挑战赛的MeViS-Audio赛道上，VIRST-Audio获得了第三名，展示了其在基于音频的指代视频分割任务中强大的泛化能力和可靠的性能。

Insight: 创新点在于通过ASR将音频转换为文本，从而利用成熟的文本监督模型进行分割，实现了从文本到音频场景的有效迁移；此外，存在感知门控机制通过估计目标对象的存在性来抑制错误预测，提升了模型的鲁棒性和分割稳定性。

Abstract: Audio-based Referring Video Object Segmentation (ARVOS) requires grounding audio queries into pixel-level object masks over time, posing challenges in bridging acoustic signals with spatio-temporal visual representations. In this report, we present VIRST-Audio, a practical framework built upon a pretrained RVOS model integrated with a vision-language architecture. Instead of relying on audio-specific training, we convert input audio into text using an ASR module and perform segmentation using text-based supervision, enabling effective transfer from text-based reasoning to audio-driven scenarios. To improve robustness, we further incorporate an existence-aware gating mechanism that estimates whether the referred target object is present in the video and suppresses predictions when it is absent, reducing hallucinated masks and stabilizing segmentation behavior. We evaluate our approach on the MeViS-Audio track of the 5th PVUW Challenge, where VIRST-Audio achieves 3rd place, demonstrating strong generalization and reliable performance in audio-based referring video segmentation.

[86] InterDyad: Interactive Dyadic Speech-to-Video Generation by Querying Intermediate Visual Guidance cs.CVPDF

Dongwei Pan, Longwei Guo, Jiazhi Guan, Luying Huang, Yiding Li

TL;DR: 本文提出了InterDyad框架，用于解决双人对话场景下语音驱动视频生成中跨个体依赖性和反应行为细粒度控制的难题。该框架通过查询结构化的运动指导，结合交互性注入器、基于MetaQuery的模态对齐机制、利用MLLM从音频中提取语言意图，以及角色感知的双人高斯指导（RoDG）来提升唇部同步和空间一致性，从而合成自然、符合上下文的双人互动视频。

Details

Motivation: 现有语音到视频合成方法在双人交互场景中难以捕捉跨个体依赖关系，并对反应行为缺乏细粒度控制。本文旨在解决这些问题，以生成更自然、更具上下文关联的双人互动视频。

Result: 综合实验表明，InterDyad在生成自然且符合上下文的双人互动方面显著优于现有最先进（SOTA）方法。论文还引入了一个专门的评估套件和新设计的指标来量化双人交互。

Insight: 创新点包括：1) 通过查询结构化的运动指导来合成交互动态；2) 利用MLLM从音频中提取语言意图以控制反应的精确时机和适当性；3) 提出RoDG方法在极端头部姿态下增强唇部同步和空间一致性；4) 设计了专门评估双人交互的新指标。从客观角度看，将MLLM的意图理解与运动先验对齐，以及针对双人场景的特定空间约束建模，是值得借鉴的思路。

Abstract: Despite progress in speech-to-video synthesis, existing methods often struggle to capture cross-individual dependencies and provide fine-grained control over reactive behaviors in dyadic settings. To address these challenges, we propose InterDyad, a framework that enables naturalistic interactive dynamics synthesis via querying structural motion guidance. Specifically, we first design an Interactivity Injector that achieves video reenactment based on identity-agnostic motion priors extracted from reference videos. Building upon this, we introduce a MetaQuery-based modality alignment mechanism to bridge the gap between conversational audio and these motion priors. By leveraging a Multimodal Large Language Model (MLLM), our framework is able to distill linguistic intent from audio to dictate the precise timing and appropriateness of reactions. To further improve lip-sync quality under extreme head poses, we propose Role-aware Dyadic Gaussian Guidance (RoDG) for enhanced lip-synchronization and spatial consistency. Finally, we introduce a dedicated evaluation suite with novelly designed metrics to quantify dyadic interaction. Comprehensive experiments demonstrate that InterDyad significantly outperforms state-of-the-art methods in producing natural and contextually grounded two-person interactions. Please refer to our project page for demo videos: https://interdyad.github.io/.

Huy Hoang Nguyen, Cédric Jung, Shirin Salehi, Tobias Glück, Anke Schmeink

TL;DR: 本文提出了一种名为CCMA的跨模态主动学习框架，通过教师-学生架构将视觉与语言模态相结合，利用预训练的视觉语言模型提供语义基础的不确定性估计，并经过保形校准来指导纯视觉学生模型的样本选择，从而在多个基准测试中实现更高的数据效率。

Details

Motivation: 动机在于充分利用现代视觉语言模型中嵌入的丰富多模态知识，以解决现有主动学习方法主要依赖单一模态（如仅视觉）而忽视多模态信息的问题，从而更高效地减少标注成本。

Result: 在多个基准测试中，CCMA一致性地超越了最先进的主动学习基线方法，显示出比仅依赖不确定性或多样性度量的方法具有明显优势，达到了SOTA水平。

Insight: 创新点在于通过教师-学生架构和保形校准将多模态知识整合到主动学习中，提供语义基础的不确定性估计，并结合多样性感知选择策略，这为数据高效学习提供了新的跨模态视角。

Abstract: Foundation models for vision have transformed visual recognition with powerful pretrained representations and strong zero-shot capabilities, yet their potential for data-efficient learning remains largely untapped. Active Learning (AL) aims to minimize annotation costs by strategically selecting the most informative samples for labeling, but existing methods largely overlook the rich multimodal knowledge embedded in modern vision-language models (VLMs). We introduce Conformal Cross-Modal Acquisition (CCMA), a novel AL framework that bridges vision and language modalities through a teacher-student architecture. CCMA employs a pretrained VLM as a teacher to provide semantically grounded uncertainty estimates, conformally calibrated to guide sample selection for a vision-only student model. By integrating multimodal conformal scoring with diversity-aware selection strategies, CCMA achieves superior data efficiency across multiple benchmarks. Our approach consistently outperforms state-of-the-art AL baselines, demonstrating clear advantages over methods relying solely on uncertainty or diversity metrics.

[88] GSwap: Realistic Head Swapping with Dynamic Neural Gaussian Field cs.CVPDF

Jingtao Zhou, Xuan Gao, Dongyu Liu, Junhui Hou, Yudong Guo

TL;DR: GSwap是一种新颖的视频头部替换系统，它利用动态神经高斯场先验，实现了高保真、3D一致且真实的头部替换，显著提升了人脸和头部替换的技术水平。

Details

Motivation: 现有方法主要依赖2D生成模型或3D形变人脸模型，存在3D一致性差、表情不自然、合成质量受限以及全身建模不足、背景融合效果差导致伪影和对齐问题。

Result: 大量实验表明，GSwap在视觉质量、时间一致性、身份保持和3D一致性等多个方面超越了现有方法。

Insight: 核心创新在于将2D肖像视频提升为嵌入全身SMPL-X表面的动态神经高斯特征场，确保了高保真渲染和自然的头身关系；同时，利用少量参考图像进行源头部域适配，并提出神经重渲染策略以实现前景与背景的无缝融合。

Abstract: We present GSwap, a novel consistent and realistic video head-swapping system empowered by dynamic neural Gaussian portrait priors, which significantly advances the state of the art in face and head replacement. Unlike previous methods that rely primarily on 2D generative models or 3D Morphable Face Models (3DMM), our approach overcomes their inherent limitations, including poor 3D consistency, unnatural facial expressions, and restricted synthesis quality. Moreover, existing techniques struggle with full head-swapping tasks due to insufficient holistic head modeling and ineffective background blending, often resulting in visible artifacts and misalignments. To address these challenges, GSwap introduces an intrinsic 3D Gaussian feature field embedded within a full-body SMPL-X surface, effectively elevating 2D portrait videos into a dynamic neural Gaussian field. This innovation ensures high-fidelity, 3D-consistent portrait rendering while preserving natural head-torso relationships and seamless motion dynamics. To facilitate training, we adapt a pretrained 2D portrait generative model to the source head domain using only a few reference images, enabling efficient domain adaptation. Furthermore, we propose a neural re-rendering strategy that harmoniously integrates the synthesized foreground with the original background, eliminating blending artifacts and enhancing realism. Extensive experiments demonstrate that GSwap surpasses existing methods in multiple aspects, including visual quality, temporal coherence, identity preservation, and 3D consistency.

[89] ViKey: Enhancing Temporal Understanding in Videos via Visual Prompting cs.CVPDF

Yeonkyung Lee, Dayun Ju, Youngmin Kim, Seil Kang, Seong Jae Hwang

TL;DR: 论文提出ViKey框架，通过视觉提示（VP）和轻量级关键词-帧映射（KFM）模块，增强视频大语言模型（VideoLLMs）在稀疏采样帧下的时序理解能力，无需训练即可在仅使用20%帧的情况下保持密集帧基线的性能。

Details

Motivation: 现有视频大语言模型为降低计算成本常采用帧选择方法，但会损害需要时序推理的任务性能，因为模型在中间帧缺失时容易误解时序关系，而人类能从稀疏视觉线索推断事件进展，因此探索视觉提示作为轻量有效的方法来弥补这一局限。

Result: 在多个数据集上，ViKey显著提升了时序推理性能，并在某些数据集上仅使用20%的帧数就能保持密集帧基线的性能水平。

Insight: 创新点在于通过视觉提示（如为每帧添加显式序数信息）帮助模型感知时序连续性，并结合关键词-帧映射模块将文本线索与相关帧链接，提供显式时序锚点，这是一种无需训练、简单高效的增强时序理解的方法。

Abstract: Recent advancements in Video Large Language Models (VideoLLMs) have enabled strong performance across diverse multimodal video tasks. To reduce the high computational cost of processing dense video frames, efficiency-oriented methods such as frame selection have been widely adopted. While effective at minimizing redundancy, these methods often cause notable performance drops on tasks requiring temporal reasoning. Unlike humans, who can infer event progression from sparse visual cues, VideoLLMs frequently misinterpret temporal relations when intermediate frames are omitted. To address this limitation, we explore visual prompting (VP) as a lightweight yet effective way to enhance temporal understanding in VideoLLMs. Our analysis reveals that simply annotating each frame with explicit ordinal information helps the model perceive temporal continuity. This visual cue also supports frame-level referencing and mitigates positional ambiguity within a sparsely sampled sequence. Building on these insights, we introduce ViKey, a training-free framework that combines VP with a lightweight Keyword-Frame Mapping (KFM) module. KFM leverages frame indices as dictionary-like keys to link textual cues to the most relevant frames, providing explicit temporal anchors during inference. Despite its simplicity, our approach substantially improves temporal reasoning and, on some datasets, preserves dense-frame baseline performance with as few as 20% of frames.

[90] Gaze-Regularized VLMs for Ego-Centric Behavior Understanding cs.CVPDF

Anupam Pani, Yanchao Yang

TL;DR: 该论文提出了一种基于眼动注视正则化的视觉语言模型框架，用于提升第一人称视角行为理解能力。通过将眼动数据（注视点和扫视）整合到VLM训练中，模型能动态关注人类注意区域，从而更准确地预测未来事件并生成详细动作描述。

Details

Motivation: 现有方法仅依赖视觉数据而忽略眼动信息，但眼动能反映人类意图和未来行动，因此需要开发能有效融合眼动数据的VLM以提升第一人称行为理解。

Result: 实验表明，相比未使用眼动数据的基线模型，该方法在语义评分上提升近13%，显著增强了未来事件预测的准确性和鲁棒性。

Insight: 创新点包括：生成基于眼动的查询机制使模型动态聚焦注意区域，以及眼动正则化机制确保模型注意力与人类注意模式对齐；这为VLM利用人类眼动数据提供了新范式，可推广到需要精细行为预测的应用中。

Abstract: Eye gaze, encompassing fixations and saccades, provides critical insights into human intentions and future actions. This study introduces a gaze-regularized framework that enhances Vision Language Models (VLMs) for egocentric behavior understanding. Unlike existing methods that rely solely on visual data and overlook gaze information, our approach directly incorporates gaze information into the VLM architecture during training. By generating gaze-based queries, the model dynamically focuses on gaze-highlighted regions, while a gaze-regularization mechanism ensures the alignment of model attention with human attention patterns. To better understand how gaze can be effectively integrated into VLMs, we conducted extensive experiments exploring various strategies for incorporating gaze data. These innovations enable the prediction of future events with detailed action descriptions. Experimental results demonstrate a nearly 13 % improvement in semantic scores compared to baseline models not leveraging gaze data, highlighting the effectiveness of our approach. This work establishes a foundation for leveraging the human gaze in VLMs, significantly boosting their predictive capabilities in applications requiring accurate and robust future event prediction.

[91] FDIF: Formula-Driven supervised Learning with Implicit Functions for 3D Medical Image Segmentation cs.CVPDF

Yukinori Yamamoto, Kazuya Nishimura, Tsukasa Fukusato, Hirokazu Nosato, Tetsuya Ogata

TL;DR: 本文提出了一种名为FDIF的公式驱动监督学习框架，用于3D医学图像分割，该框架利用隐函数（基于符号距离函数）从数学公式生成训练数据和标签，无需真实数据和专家标注，实现了可扩展的预训练。

Details

Motivation: 解决基于深度学习的3D医学图像分割方法依赖大规模标注数据集的问题，这些数据因隐私限制和专家标注成本高而难以获取，现有基于体素的公式驱动方法在几何表达和纹理合成方面存在局限。

Result: 在三个医学图像分割基准（AMOS、ACDC和KiTS）和三种架构（SwinUNETR、nnUNet ResEnc-L和nnUNet Primus-M）上，FDIF一致优于公式驱动方法，性能与在大规模真实数据集上预训练的自监督方法相当，并证明对3D分类任务也有益。

Insight: 创新点在于引入基于符号距离函数的隐函数表示，能够紧凑建模复杂几何形状，并利用表面表示支持几何和强度纹理的可控合成，为无数据表示学习提供了一个有前景的范式。

Abstract: Deep learning-based 3D medical image segmentation methods relies on large-scale labeled datasets, yet acquiring such data is difficult due to privacy constraints and the high cost of expert annotation. Formula-Driven Supervised Learning (FDSL) offers an appealing alternative by generating training data and labels directly from mathematical formulas. However, existing voxel-based approaches are limited in geometric expressiveness and cannot synthesize realistic textures. We introduce Formula-Driven supervised learning with Implicit Functions (FDIF), a framework that enables scalable pre-training without using any real data and medical expert annotations. FDIF introduces an implicit-function representation based on signed distance functions (SDFs), enabling compact modeling of complex geometries while exploiting the surface representation of SDFs to support controllable synthesis of both geometric and intensity textures. Across three medical image segmentation benchmarks (AMOS, ACDC, and KiTS) and three architectures (SwinUNETR, nnUNet ResEnc-L, and nnUNet Primus-M), FDIF consistently improves over a formula-driven method, and achieves performance comparable to self-supervised approaches pre-trained on large-scale real datasets. We further show that FDIF pre-training also benefits 3D classification tasks, highlighting implicit-function-based formula supervision as a promising paradigm for data-free representation learning. Code is available at https://github.com/yamanoko/FDIF.

[92] Gaze-Regularized Vision-Language-Action Models for Robotic Manipulation cs.CVPDF

Anupam Pani, Yanchao Yang

TL;DR: 本文提出了一种基于人类注视正则化的视觉-语言-动作（VLA）模型训练框架，用于提升机器人精细操作任务的性能。该方法通过将时间聚合的注视热图转化为补丁级分布，并利用KL散度对齐Transformer模型的内部注意力与人类视觉模式，从而在不修改架构或增加推理开销的情况下，引导模型关注任务相关特征。

Details

Motivation: 当前VLA模型在机器人精细操作任务中表现不佳，主要原因是缺乏主动视觉注意力分配机制。人类注视自然编码了意图、规划和执行模式，为引导机器人感知提供了强大的监督信号。

Result: 在多个机器人操作基准测试中，该方法将现有VLA模型的性能提升了4-12%。模型在更少的训练步数下达到同等性能，并在光照变化和传感器噪声下保持鲁棒性。

Insight: 创新点在于利用人类注视模式作为正则化信号，为VLA模型引入任务相关的归纳偏置，从而提升性能、训练效率和可解释性。该方法无需眼动追踪设备，可直接应用于现有数据集，展示了人类感知先验对加速机器人学习、提升任务性能和系统可解释性的潜力。

Abstract: Despite advances in Vision-Language-Action (VLA) models, robotic manipulation struggles with fine-grained tasks because current models lack mechanisms for active visual attention allocation. Human gaze naturally encodes intent, planning, and execution patterns – offering a powerful supervisory signal for guiding robot perception. We introduce a gaze-regularized training framework that aligns VLA models’ internal attention with human visual patterns without architectural modifications or inference-time overhead. Our method transforms temporally aggregated gaze heatmaps into patch-level distributions and regularizes the transformer’s attention through KL divergence, creating an inductive bias toward task-relevant features while preserving deployment efficiency. When integrated into existing VLA architectures, our approach yields 4-12% improvements across manipulation benchmarks. The gaze-regularized models reach equivalent performance with fewer training steps and maintain robustness under lighting variations and sensor noise. Beyond performance metrics, the learned attention patterns produce interpretable visualizations that mirror human strategies, enhancing trust in robotic systems. Moreover, our framework requires no eye-tracking equipment and applies directly to existing datasets. These results demonstrate that human perceptual priors can significantly accelerate robot learning while improving both task performance and system interpretability.

[93] GO-Renderer: Generative Object Rendering with 3D-aware Controllable Video Diffusion Models cs.CVPDF

Zekai Gu, Shuoxuan Feng, Yansong Wang, Hanzhuo Huang, Zhongshuo Du

TL;DR: GO-Renderer是一个统一的框架，通过整合重建的3D代理模型来引导视频生成模型，实现在任意视角和任意光照条件下对物体进行高质量渲染。该方法结合了3D重建的精确视角控制和扩散生成模型的高质量外观合成能力，避免了显式建模复杂材质和光照。

Details

Motivation: 解决从图像重建可渲染3D模型时，现有前馈方法难以准确建模复杂外观，而基于扩散的生成模型虽能合成真实图像但缺乏精确视角控制的问题。

Result: 大量实验表明，GO-Renderer在物体渲染任务（包括新视角图像合成、新光照环境渲染以及将物体插入现有视频）上达到了最先进的性能。

Insight: 创新点在于将重建的3D几何代理与视频扩散生成模型相结合，利用3D代理提供精确的几何和视角控制，同时利用生成模型隐式地处理复杂外观和光照，实现了可控且高质量的神经渲染。

Abstract: Reconstructing a renderable 3D model from images is a useful but challenging task. Recent feedforward 3D reconstruction methods have demonstrated remarkable success in efficiently recovering geometry, but still cannot accurately model the complex appearances of these 3D reconstructed models. Recent diffusion-based generative models can synthesize realistic images or videos of an object using reference images without explicitly modeling its appearance, which provides a promising direction for object rendering, but lacks accurate control over the viewpoints. In this paper, we propose GO-Renderer, a unified framework integrating the reconstructed 3D proxies to guide the video generative models to achieve high-quality object rendering on arbitrary viewpoints under arbitrary lighting conditions. Our method not only enjoys the accurate viewpoint control using the reconstructed 3D proxy but also enables high-quality rendering in different lighting environments using diffusion generative models without explicitly modeling complex materials and lighting. Extensive experiments demonstrate that GO-Renderer achieves state-of-the-art performance across the object rendering tasks, including synthesizing images on new viewpoints, rendering the objects in a novel lighting environment, and inserting an object into an existing video.

Xue Wang, Zheng Guan, Wenhua Qian, Chengchao Wang, Runzhuo Ma

TL;DR: 本文提出了一种基于因果干预的多模态图像融合框架，通过设计三种干预策略来识别鲁棒的跨模态依赖关系，并引入因果特征集成器（CFI）来学习干预稳定的特征，从而避免数据驱动的虚假关联，提升融合性能。

Details

Motivation: 当前多模态图像融合方法主要优化模态间的统计相关性，容易捕获数据集诱导的虚假关联，在分布偏移下性能下降；本文受因果原理启发，旨在识别鲁棒的跨模态依赖关系。

Result: 在公共基准测试和下游高级视觉任务上，该方法实现了最先进的性能。

Insight: 创新点包括：1）从Pearl因果层次出发，设计了三种原则性干预策略（互补掩蔽、随机掩蔽、模态丢弃）来探测模态关系的不同方面；2）提出因果特征集成器（CFI），通过自适应不变性门控学习干预稳定的特征，捕获鲁棒的模态依赖而非虚假关联。

Abstract: Multi-modal image fusion integrates complementary information from different modalities into a unified representation. Current methods predominantly optimize statistical correlations between modalities, often capturing dataset-induced spurious associations that degrade under distribution shifts. In this paper, we propose an intervention-based framework inspired by causal principles to identify robust cross-modal dependencies. Drawing insights from Pearl’s causal hierarchy, we design three principled intervention strategies to probe different aspects of modal relationships: i) complementary masking with spatially disjoint perturbations tests whether modalities can genuinely compensate for each other’s missing information, ii) random masking of identical regions identifies feature subsets that remain informative under partial observability, and iii) modality dropout evaluates the irreplaceable contribution of each modality. Based on these interventions, we introduce a Causal Feature Integrator (CFI) that learns to identify and prioritize intervention-stable features maintaining importance across different perturbation patterns through adaptive invariance gating, thereby capturing robust modal dependencies rather than spurious correlations. Extensive experiments demonstrate that our method achieves SOTA performance on both public benchmarks and downstream high-level vision tasks.

Yuchen Wu, Kun Wang, Yining Pan, Na Zhao

TL;DR: 本文提出了一种名为互补协同融合（CCF）的方法，用于提升多模态3D目标检测在领域泛化中的鲁棒性。该方法通过查询解耦损失、激光雷达引导的深度先验和互补跨模态掩码三个组件，解决模态退化与激光雷达主导导致的视觉线索利用不足问题，从而在跨域场景（如雨天、夜间）中实现更稳定的检测性能。

Details

Motivation: 针对多模态3D目标检测在跨域部署时性能显著下降的问题，作者发现两个关键限制因素：一是挑战性域（如雨天、夜间）中单一模态可能严重退化；二是激光雷达分支常主导检测过程，导致视觉线索利用不足且点云受损时系统脆弱。

Result: 大量实验表明，该方法在领域泛化基准测试中相比现有最先进基线取得了显著提升，同时保持了源域性能。具体结果未在摘要中定量说明，但提及代码和模型已公开。

Insight: 创新点包括：1）查询解耦损失为不同模态查询提供独立监督，重新平衡梯度流；2）激光雷达引导的深度先验通过概率融合图像预测与激光雷达深度分布，增强2D查询的几何先验；3）互补跨模态掩码鼓励多模态查询在融合解码器中竞争，促进自适应融合。从客观角度看，该方法通过结构化设计缓解了模态不平衡与退化问题，为多模态鲁棒融合提供了新思路。

Abstract: Multi-modal fusion has emerged as a promising paradigm for accurate 3D object detection. However, performance degrades substantially when deployed in target domains different from training. In this work, focusing on dual-branch proposal-level detectors, we identify two factors that limit robust cross-domain generalization: 1) in challenging domains such as rain or nighttime, one modality may undergo severe degradation; 2) the LiDAR branch often dominates the detection process, leading to systematic underutilization of visual cues and vulnerability when point clouds are compromised. To address these challenges, we propose three components. First, Query-Decoupled Loss provides independent supervision for 2D-only, 3D-only, and fused queries, rebalancing gradient flow across modalities. Second, LiDAR-Guided Depth Prior augments 2D queries with instance-aware geometric priors through probabilistic fusion of image-predicted and LiDAR-derived depth distributions, improving their spatial initialization. Third, Complementary Cross-Modal Masking applies complementary spatial masks to the image and point cloud, encouraging queries from both modalities to compete within the fused decoder and thereby promoting adaptive fusion. Extensive experiments demonstrate substantial gains over state-of-the-art baselines while preserving source-domain performance. Code and models are publicly available at https://github.com/IMPL-Lab/CCF.

[96] Mamba-driven MRI-to-CT Synthesis for MRI-only Radiotherapy Planning cs.CVPDF

Konstantinos Barmpounakis, Theodoros P. Vagenas, Maria Vakalopoulou, George K. Matsopoulos

TL;DR: 本文探索了基于Mamba架构的模型在MRI到CT图像合成任务中的应用，旨在替代传统的nnU-Net框架，以实现仅使用MRI进行放射治疗规划。研究将U-Mamba和SegMamba架构适配用于跨模态图像生成，并在SynthRAD2025数据集上验证了其有效性。

Details

Motivation: 动机在于推动仅MRI放射治疗规划，以减少患者电离辐射暴露并避免多模态配准误差，同时探索状态空间模型相较于标准卷积神经网络在跨模态翻译中的优势。

Result: 在SynthRAD2025数据集的三个解剖区域上，通过Hounsefield单位图像相似性指标和基于TotalSegmentator的分割指标进行定量评估，表明3D Mamba架构能准确合成CT并保持快速推理时间。

Insight: 创新点在于将Mamba架构应用于跨模态医学图像生成，利用其状态空间建模有效捕捉复杂体积特征和长程依赖，为放疗工作流集成状态空间模型提供了新途径。

Abstract: Radiotherapy workflows for oncological patients increasingly rely on multi-modal medical imaging, commonly involving both Magnetic Resonance Imaging (MRI) and Computed Tomography (CT). MRI-only treatment planning has emerged as an attractive alternative, as it reduces patient exposure to ionizing radiation and avoids errors introduced by inter-modality registration. While nnU-Net-based frameworks are predominantly used for MRI-to-CT synthesis, we explore Mamba-based architectures for this task, aiming to showcase the advantages of state-space modeling for cross-modality translation compared to standard convolutional neural networks. Specifically, we adapt both the U-Mamba and the SegMamba architecture, originally proposed for segmentation, to perform cross-modality image generation. Our 3D Mamba architecture effectively captures complex volumetric features and long-range dependencies, thus allowing accurate CT synthesis while maintaining fast inference times. Experiments were conducted on a subset of SynthRAD2025 dataset, comprising registered single-channel MRI-CT volume pairs across three anatomical regions. Quantitative evaluation is performed via a combination of image similarity metrics computed in Hounsefield Units (HU) and segmentation-based metrics obtained from TotalSegmentator to ensure geometric consistency is preserved. The findings pave the way for the integration of state-space models into radiotherapy workflows.

[97] Curriculum-Driven 3D CT Report Generation via Language-Free Visual Grafting and Zone-Constrained Compression cs.CV | cs.AIPDF

V. K. Cody Bumgardner, Mitchell A. Klusty, Mahmut S. Gokmen, Evan W. Damron

TL;DR: 本文提出Ker-VLJEPA-3B，一个基于课程学习的四阶段框架，用于从胸部3D CT扫描生成自由文本放射学报告。该方法通过渐进式训练，将Llama 3.2 3B解码器与一个在无标签CT上自监督预训练的视觉编码器（LeJEPA ViT-Large）对齐，并引入区域约束压缩等技术解决长序列、类别不平衡和视觉信息被忽略等问题。

Details

Motivation: 解决从3D CT体积自动生成放射学报告面临的三大挑战：序列极长、类别严重不平衡，以及大型语言模型倾向于忽略视觉token而依赖语言先验。

Result: 在CT-RATE基准测试（2,984个验证体积，18个类别）上，Ker-VLJEPA-3B的宏观F1分数达到0.429，超越了当前最佳模型U-VLM（宏观F1=0.414）3.6%，经过阈值优化后可达到0.448（提升8.2%）。消融研究证实56.6%的生成质量来源于患者特定的视觉内容。

Insight: 主要创新点包括：1）采用无语言的自监督视觉编码器获得纯模态表示，将视觉-语言对齐推迟到后续阶段；2）区域约束交叉注意力将切片嵌入压缩为32个空间定位的视觉token；3）仅使用阳性发现的策略避免后验塌缩；4）选择性交叉注意力冻结与弹性权重巩固防止灾难性遗忘。该框架是模态无关的，可将任何自监督编码器集成到LLM中。

Abstract: Automated radiology report generation from 3D computed tomography (CT) volumes is challenging due to extreme sequence lengths, severe class imbalance, and the tendency of large language models (LLMs) to ignore visual tokens in favor of linguistic priors. We present Ker-VLJEPA-3B, a four-phase curriculum learning framework for free-text report generation from thoracic CT volumes. A phased training curriculum progressively adapts a Llama 3.2 3B decoder to ground its output in visual features from a frozen, self-supervised encoder. Our visual backbone (LeJEPA ViT-Large) is trained via self-supervised joint-embedding prediction on unlabeled CTs, without text supervision. Unlike contrastive models (CLIP, BiomedCLIP), this language-free backbone yields modality-pure representations. Vision-language alignment is deferred to the curriculum’s bridge and generation phases. This modality-agnostic design can integrate any self-supervised encoder into an LLM without paired text during foundation training. Methodological innovations include: (1) zone-constrained cross-attention compressing slice embeddings into 32 spatially-grounded visual tokens; (2) PCA whitening of anisotropic LLM embeddings; (3) a positive-findings-only strategy eliminating posterior collapse; (4) warm bridge initialization transferring projection weights; and (5) selective cross-attention freezing with elastic weight consolidation to prevent catastrophic forgetting. Evaluated on the CT-RATE benchmark (2,984 validation volumes, 18 classes), Ker-VLJEPA-3B achieves a macro F1 of 0.429, surpassing the state-of-the-art (U-VLM, macro F1 = 0.414) by 3.6%, and reaching 0.448 (+8.2%) with threshold optimization. Ablation studies confirm 56.6% of generation quality derives from patient-specific visual content. Code and weights are available.

[98] ARGENT: Adaptive Hierarchical Image-Text Representations cs.CV | cs.LGPDF

Chuong Huynh, Hossein Souri, Abhinav Kumar, Vitali Petsiuk, Deen Dayal Mohan

TL;DR: 本文提出了一种名为ARGENT的自适应分层图像-文本表示模型，旨在解决现有双曲视觉语言模型（VLM）中因蕴含损失不稳定导致的层级结构崩溃问题。通过引入自适应蕴含损失和范数正则化器，模型能够稳定地学习层次化表示，并进一步提出了基于角度的概率蕴含协议（PEP）来更可靠地评估模型的层次理解能力。

Details

Motivation: 现有的大规模视觉语言模型（如CLIP）在欧几里得空间中学习表示，无法有效捕捉视觉和语言概念固有的层次结构。虽然双曲几何提供了低失真嵌入层次结构的理论优势，但现有双曲VLM的蕴含损失不稳定，容易导致层级崩溃，且评估方法不可靠。

Result: ARGENT在图像分类、文本到图像检索以及提出的层次化评估指标上，分别比现有最佳双曲VLM提升了0.7、1.1和0.8个绝对百分点，达到了新的SOTA水平。

Insight: 创新点包括自适应蕴含损失与范数正则化器的结合，有效防止了双曲空间中的锥体崩溃；以及基于角度的概率蕴含协议（PEP），使用AUC-ROC和平均精度进行评分，为层次化理解提供了更可靠的评估框架。

Abstract: Large-scale Vision-Language Models (VLMs) such as CLIP learn powerful semantic representations but operate in Euclidean space, which fails to capture the inherent hierarchical structure of visual and linguistic concepts. Hyperbolic geometry, with its exponential volume growth, offers a principled alternative for embedding such hierarchies with low distortion. However, existing hyperbolic VLMs use entailment losses that are unstable: as parent embeddings contract toward the origin, their entailment cones widen toward a half-space, causing catastrophic cone collapse that destroys the intended hierarchy. Additionally, hierarchical evaluation of these models remains unreliable, being largely retrieval-based and correlation-based metrics and prone to taxonomy dependence and ambiguous negatives. To address these limitations, we propose an adaptive entailment loss paired with a norm regularizer that prevents cone collapse without heuristic aperture clipping. We further introduce an angle-based probabilistic entailment protocol (PEP) for evaluating hierarchical understanding, scored with AUC-ROC and Average Precision. This paper introduces a stronger hyperbolic VLM baseline ARGENT, Adaptive hieRarchical imaGe-tExt represeNTation. ARGENT improves the SOTA hyperbolic VLM by 0.7, 1.1, and 0.8 absolute points on image classification, text-to-image retrieval, and proposed hierarchical metrics, respectively.

[99] Pose-Free Omnidirectional Gaussian Splatting for 360-Degree Videos with Consistent Depth Priors cs.CVPDF

Chuanqing Zhuang, Xin Lu, Zehui Deng, Zhengda Lu, Yiqun Wang

TL;DR: 本文提出了一种名为PFGS360的无姿态全景3D高斯溅射方法，用于从无姿态全景视频中重建3D高斯表示。该方法通过构建球面一致性感知的姿态估计模块，利用高斯内部深度先验建立2D-3D对应关系来恢复相机姿态，并引入深度内点感知的致密化模块，结合单目深度先验提取深度内点和高斯异常值，以实现高效的致密化和逼真的新视角合成。实验表明，该方法在真实世界和合成的360度视频上均显著优于现有的无姿态和有姿态3DGS方法。

Details

Motivation: 现有全景3D高斯溅射方法通常依赖缓慢的运动恢复结构技术来提供相机姿态和稀疏点先验，这限制了其应用。本文旨在解决从无姿态全景视频中直接重建3D高斯表示的挑战，实现无需预先姿态估计的高质量3D场景表示。

Result: 实验在真实世界和合成的360度视频上进行，结果显示PFGS360在性能上显著优于现有的无姿态和有姿态3D高斯溅射方法，达到了新的先进水平。

Insight: 创新点包括：1) 球面一致性感知的姿态估计模块，利用高斯内部深度先验建立2D-3D对应关系，实现无姿态视频的相机姿态恢复；2) 深度内点感知的致密化模块，结合单目深度先验优化高斯分布，提升新视角合成的真实感。这些方法为无姿态3D重建提供了新的思路，可借鉴于其他基于高斯溅射的场景表示任务。

Abstract: Omnidirectional 3D Gaussian Splatting with panoramas is a key technique for 3D scene representation, and existing methods typically rely on slow SfM to provide camera poses and sparse points priors. In this work, we propose a pose-free omnidirectional 3DGS method, named PFGS360, that reconstructs 3D Gaussians from unposed omnidirectional videos. To achieve accurate camera pose estimation, we first construct a spherical consistency-aware pose estimation module, which recovers poses by establishing consistent 2D-3D correspondences between the reconstructed Gaussians and the unposed images using Gaussians’ internal depth priors. Besides, to enhance the fidelity of novel view synthesis, we introduce a depth-inlier-aware densification module to extract depth inliers and Gaussian outliers with consistent monocular depth priors, enabling efficient Gaussian densification and achieving photorealistic novel view synthesis. The experiments show significant outperformance over existing pose-free and pose-aware 3DGS methods on both real-world and synthetic 360-degree videos. Code is available at https://github.com/zcq15/PFGS360.

[100] ViBe: Ultra-High-Resolution Video Synthesis Born from Pure Images cs.CVPDF

Yunfeng Wu, Hongying Cheng, Zihao He, Songhua Liu

TL;DR: 本文提出ViBe框架，通过纯图像适应策略将预训练的视频扩散模型升级为超高清视频生成模型，无需高分辨率视频训练数据。该方法采用两阶段Relay LoRA策略，先使用低分辨率图像对齐图像与视频模态，再用高分辨率图像增强空间外推能力，并结合高频感知训练目标提升细节合成质量。

Details

Motivation: 现有基于Transformer的视频扩散模型因时空注意力计算复杂度高，难以端到端训练超高清视频。本文旨在克服这一瓶颈，通过纯图像适应框架提升预训练模型的分辨率生成能力，避免直接训练的高成本。

Result: 在VBench基准测试中，ViBe在无需高分辨率视频数据的情况下，生成超高清视频的得分比先前基于高分辨率视频训练的SOTA模型高出0.8，展现出优越的视觉细节合成能力。

Insight: 创新点包括：1) 两阶段Relay LoRA策略，解耦模态对齐与空间外推学习；2) 高频感知训练目标，通过重构损失显式增强高频细节恢复。该方法为高效扩展视频生成分辨率提供了可借鉴的轻量化适应范式。

Abstract: Transformer-based video diffusion models rely on 3D attention over spatial and temporal tokens, which incurs quadratic time and memory complexity and makes end-to-end training for ultra-high-resolution videos prohibitively expensive. To overcome this bottleneck, we propose a pure image adaptation framework that upgrades a video Diffusion Transformer pre-trained at its native scale to synthesize higher-resolution videos. Unfortunately, naively fine-tuning with high-resolution images alone often introduces noticeable noise due to the image-video modality gap. To address this, we decouple the learning objective to separately handle modality alignment and spatial extrapolation. At the core of our approach is Relay LoRA, a two-stage adaptation strategy. In the first stage, the video diffusion model is adapted to the image domain using low-resolution images to bridge the modality gap. In the second stage, the model is further adapted with high-resolution images to acquire spatial extrapolation capability. During inference, only the high-resolution adaptation is retained to preserve the video generation modality while enabling high-resolution video synthesis. To enhance fine-grained detail synthesis, we further propose a High-Frequency-Awareness-Training-Objective, which explicitly encourages the model to recover high-frequency components from degraded latent representations via a dedicated reconstruction loss. Extensive experiments demonstrate that our method produces ultra-high-resolution videos with rich visual details without requiring any video training data, even outperforming previous state-of-the-art models trained on high-resolution videos by 0.8 on the VBench benchmark. Code will be available at https://github.com/WillWu111/ViBe.

[101] Object Pose Transformer: Unifying Unseen Object Pose Estimation cs.CVPDF

Weihang Li, Lorenzo Garattoni, Fabien Despinoy, Nassir Navab, Benjamin Busam

TL;DR: 本文提出了Object Pose Transformer (OPT)，一个统一的单模型前馈框架，用于解决未见物体的姿态估计问题。该模型通过联合预测深度、点云图、相机参数和归一化物体坐标，能够同时处理类别级的绝对姿态估计和未见物体的相对姿态估计，无需在推理时依赖语义标签，且对相机内参不敏感。

Details

Motivation: 解决现有方法在未见物体姿态估计上的局限性：类别级方法依赖预定义分类且只能预测绝对姿态，而相对姿态方法无法从单视图恢复绝对姿态。

Result: 在NOCS、HouseCat6D、Omni6DPose和Toyota-Light等多个基准测试上，该模型在绝对和相对姿态估计任务中均达到了最先进的性能水平。

Insight: 通过任务分解将绝对与相对姿态估计统一于单一模型；利用对比性物体中心潜在嵌入实现规范化而无需推理时语义标签；使用点云图作为相机空间表示以支持多视图几何推理；通过跨帧特征交互和共享物体嵌入，利用视图间的相对几何一致性来改善单视图绝对姿态估计的模糊性。

Abstract: Learning model-free object pose estimation for unseen instances remains a fundamental challenge in 3D vision. Existing methods typically fall into two disjoint paradigms: category-level approaches predict absolute poses in a canonical space but rely on predefined taxonomies, while relative pose methods estimate cross-view transformations but cannot recover single-view absolute pose. In this work, we propose Object Pose Transformer (\ours{}), a unified feed-forward framework that bridges these paradigms through task factorization within a single model. \ours{} jointly predicts depth, point maps, camera parameters, and normalized object coordinates (NOCS) from RGB inputs, enabling both category-level absolute SA(3) pose and unseen-object relative SE(3) pose. Our approach leverages contrastive object-centric latent embeddings for canonicalization without requiring semantic labels at inference time, and uses point maps as a camera-space representation to enable multi-view relative geometric reasoning. Through cross-frame feature interaction and shared object embeddings, our model leverages relative geometric consistency across views to improve absolute pose estimation, reducing ambiguity in single-view predictions. Furthermore, \ours{} is camera-agnostic, learning camera intrinsics on-the-fly and supporting optional depth input for metric-scale recovery, while remaining fully functional in RGB-only settings. Extensive experiments on diverse benchmarks (NOCS, HouseCat6D, Omni6DPose, Toyota-Light) demonstrate state-of-the-art performance in both absolute and relative pose estimation tasks within a single unified architecture.

[102] ABot-PhysWorld: Interactive World Foundation Model for Robotic Manipulation with Physics Alignment cs.CV | cs.ROPDF

Yuzhi Chen, Ronghan Chen, Dongjie Huo, Yandan Yang, Dekang Qi

TL;DR: 本文提出了ABot-PhysWorld，一个140亿参数的扩散Transformer模型，用于生成视觉逼真、物理合理且动作可控的机器人操作视频。为了解决现有世界模型生成物理不合理动作的问题，该方法基于一个包含物理感知标注的三百万个操作片段的数据集，并采用了一种新颖的基于DPO的解耦判别器后训练框架。同时，论文还提出了首个与训练无关的具身零样本基准EZSbench，以评估模型的泛化能力。

Details

Motivation: 现有基于视频的世界模型在生成机器人操作模拟时，常因在通用视觉数据上训练及使用忽略物理规律的基于似然的目标函数，而产生物体穿透、反重力运动等物理上不合理的行为。本文旨在解决这一问题，生成物理上更合理的操作视频。

Result: ABot-PhysWorld在PBench和其新提出的EZSbench基准测试中取得了新的最先进（SOTA）性能，在物理合理性和轨迹一致性方面超越了Veo 3.1和Sora v2 Pro模型。

Insight: 主要创新点包括：1）构建了一个带有物理感知标注的大规模机器人操作视频数据集；2）提出了一种基于DPO的解耦判别器后训练框架，能在抑制非物理行为的同时保持视觉质量；3）设计了并行上下文块以实现精确的空间动作注入，支持跨具身控制；4）提出了首个与训练无关的、结合真实与合成未见过的机器人-任务-场景组合的具身零样本基准EZSbench，并采用解耦协议分别评估物理真实性和动作对齐。

Abstract: Video-based world models offer a powerful paradigm for embodied simulation and planning, yet state-of-the-art models often generate physically implausible manipulations - such as object penetration and anti-gravity motion - due to training on generic visual data and likelihood-based objectives that ignore physical laws. We present ABot-PhysWorld, a 14B Diffusion Transformer model that generates visually realistic, physically plausible, and action-controllable videos. Built on a curated dataset of three million manipulation clips with physics-aware annotation, it uses a novel DPO-based post-training framework with decoupled discriminators to suppress unphysical behaviors while preserving visual quality. A parallel context block enables precise spatial action injection for cross-embodiment control. To better evaluate generalization, we introduce EZSbench, the first training-independent embodied zero-shot benchmark combining real and synthetic unseen robot-task-scene combinations. It employs a decoupled protocol to separately assess physical realism and action alignment. ABot-PhysWorld achieves new state-of-the-art performance on PBench and EZSbench, surpassing Veo 3.1 and Sora v2 Pro in physical plausibility and trajectory consistency. We will release EZSbench to promote standardized evaluation in embodied video generation.

[103] From Feature Learning to Spectral Basis Learning: A Unifying and Flexible Framework for Efficient and Robust Shape Matching cs.CVPDF

Feifan Luo, Hongyang Chen

TL;DR: 本文提出了一种名为Advanced Functional Maps的统一灵活框架，用于高效鲁棒的三维形状匹配。该框架通过将固定谱基替换为可学习的谱基，并引入抑制函数进行优化，实现了特征提取与谱基的端到端联合学习。方法包含热扩散模块和无监督损失函数，无需昂贵求解器，在非等距和拓扑噪声场景下显著优于现有方法。

Details

Motivation: 现有深度功能映射方法主要关注学习特征表示，而忽略了谱基优化这一关键环节，导致匹配结果次优；且依赖耗时的传统功能映射求解器，计算开销大。本文旨在填补这些空白，提出一个统一框架以同时优化特征和谱基。

Result: 在非刚性三维形状匹配任务上，该方法在多个基准测试中显著优于当前最先进的特征学习方法，特别是在具有挑战性的非等距和拓扑噪声场景下表现突出，同时保持了高效率。

Insight: 创新点在于首次提出了无监督谱基学习方法，将谱基优化形式化为可学习的抑制函数，实现了特征与谱基的端到端联合优化；理论揭示了优化谱基等价于谱卷积，抑制函数充当滤波器，这为受谱图网络启发的增强表示开辟了新方向。

Abstract: Shape matching is a fundamental task in computer graphics and vision, with deep functional maps becoming a prominent paradigm. However, existing methods primarily focus on learning informative feature representations by constraining pointwise and functional maps, while neglecting the optimization of the spectral basis-a critical component of the functional map pipeline. This oversight often leads to suboptimal matching results. Furthermore, many current approaches rely on conventional, time-consuming functional map solvers, incurring significant computational overhead. To bridge these gaps, we introduce Advanced Functional Maps, a framework that generalizes standard functional maps by replacing fixed basis functions with learnable ones, supported by rigorous theoretical guarantees. Specifically, the spectral basis is optimized through a set of learned inhibition functions. Building on this, we propose the first unsupervised spectral basis learning method for robust non-rigid 3D shape matching, enabling the joint, end-to-end optimization of feature extraction and basis functions. Our approach incorporates a novel heat diffusion module and an unsupervised loss function, alongside a streamlined architecture that bypasses expensive solvers and auxiliary losses. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art feature-learning approaches, particularly in challenging non-isometric and topological noise scenarios, while maintaining high efficiency. Finally, we reveal that optimizing basis functions is equivalent to spectral convolution, where inhibition functions act as filters. This insight enables enhanced representations inspired by spectral graph networks, opening new avenues for future research. Our code is available at https://github.com/LuoFeifan77/Unsupervised-Spectral-Basis-Learning.

[104] SIMART: Decomposing Monolithic Meshes into Sim-ready Articulated Assets via MLLM cs.CV | cs.GR | cs.ROPDF

Chuanrui Zhang, Minghan Qin, Yuang Wang, Baifeng Xie, Hang Li

TL;DR: SIMART是一个统一的MLLM框架，旨在将单一网格分解为可用于物理仿真的关节化资产。它通过稀疏3D VQ-VAE减少令牌数量，联合执行部件级分解和运动学预测，从而生成高质量的交互式3D对象。

Details

Motivation: 当前3D生成主要关注静态网格，缺乏’仿真就绪’的交互式对象，而现有关节化对象创建方法依赖多阶段流程，容易累积错误。SIMART旨在通过统一MLLM框架解决这一问题，实现单阶段的静态资产理解和仿真资产生成。

Result: 在PartNet-Mobility和野外AIGC数据集上达到最先进性能，稀疏3D VQ-VAE比密集体素令牌减少70%的令牌数量，支持高保真多部件组装，并实现了基于物理的机器人仿真。

Insight: 创新点在于引入稀疏3D VQ-VAE以降低内存开销和提升可扩展性，以及统一框架联合处理分解和运动学预测，避免了多阶段流程的错误累积，为仿真就绪资产生成提供了高效解决方案。

Abstract: High-quality articulated 3D assets are indispensable for embodied AI and physical simulation, yet 3D generation still focuses on static meshes, leaving a gap in “sim-ready” interactive objects. Most recent articulated object creation methods rely on multi-stage pipelines that accumulate errors across decoupled modules. Alternatively, unified MLLMs offer a single-stage path to joint static asset understanding and sim-ready asset generation. However dense voxel-based 3D tokenization yields long 3D token sequences and high memory overhead, limiting scalability to complex articulated objects. To address this, we propose SIMART, a unified MLLM framework that jointly performs part-level decomposition and kinematic prediction. By introducing a Sparse 3D VQ-VAE, SIMART reduces token counts by 70% vs. dense voxel tokens, enabling high-fidelity multi-part assemblies. SIMART achieves state-of-the-art performance on PartNet-Mobility and in-the-wild AIGC datasets, and enables physics-based robotic simulation.

[105] Harnessing Lightweight Transformer with Contextual Synergic Enhancement for Efficient 3D Medical Image Segmentation cs.CV | eess.IVPDF

Xinyu Liu, Zhen Chen, Wuyang Li, Chenxin Li, Yixuan Yuan

TL;DR: 本文提出了一种用于高效3D医学图像分割的轻量级Transformer模型Light-UNETR及其配套的上下文协同增强学习策略。该方法通过引入轻量级维度缩减注意力模块和紧凑门控线性单元来提升模型效率，并通过利用内外上下文信息的协同增强策略来提升数据效率。

Details

Motivation: 解决Transformer在3D医学图像分割中计算成本高、依赖大量标注数据的问题，旨在同时提升模型效率（降低计算和参数量）和数据效率（减少对标注数据的依赖）。

Result: 在多个基准测试中表现出优越的性能和效率。例如，在左心房分割数据集上仅使用10%的标注数据，其Jaccard指标超过BCP方法1.43%，同时FLOPs减少90.8%，参数量减少85.8%。

Insight: 创新点在于将模型轻量化（LIDR模块和CGLU单元）与半监督学习策略（CSE，结合注意力引导替换和空间掩码一致性）相结合，协同解决计算和标注数据瓶颈，为资源受限的医学图像分析提供了高效解决方案。

Abstract: Transformers have shown remarkable performance in 3D medical image segmentation, but their high computational requirements and need for large amounts of labeled data limit their applicability. To address these challenges, we consider two crucial aspects: model efficiency and data efficiency. Specifically, we propose Light-UNETR, a lightweight transformer designed to achieve model efficiency. Light-UNETR features a Lightweight Dimension Reductive Attention (LIDR) module, which reduces spatial and channel dimensions while capturing both global and local features via multi-branch attention. Additionally, we introduce a Compact Gated Linear Unit (CGLU) to selectively control channel interaction with minimal parameters. Furthermore, we introduce a Contextual Synergic Enhancement (CSE) learning strategy, which aims to boost the data efficiency of Transformers. It first leverages the extrinsic contextual information to support the learning of unlabeled data with Attention-Guided Replacement, then applies Spatial Masking Consistency that utilizes intrinsic contextual information to enhance the spatial context reasoning for unlabeled data. Extensive experiments on various benchmarks demonstrate the superiority of our approach in both performance and efficiency. For example, with only 10% labeled data on the Left Atrial Segmentation dataset, our method surpasses BCP by 1.43% Jaccard while drastically reducing the FLOPs by 90.8% and parameters by 85.8%. Code is released at https://github.com/CUHK-AIM-Group/Light-UNETR.

[106] I3DM: Implicit 3D-aware Memory Retrieval and Injection for Consistent Video Scene Generation cs.CVPDF

Jia Li, Han Yan, Yihang Chen, Siqi Li, Xibin Song

TL;DR: 本文提出了一种名为I3DM的隐式3D感知记忆机制，用于解决视频生成中场景长期一致性的难题。该方法通过利用预训练前馈新视角合成模型的中间特征进行3D感知的记忆检索，并引入3D对齐的记忆注入模块来隐式扭曲历史内容，从而在无需显式3D重建的情况下，显著提升了视频生成中场景重访的一致性和相机控制的精确度。

Details

Motivation: 现有视频生成方法在维持场景长期一致性方面存在局限：显式构建3D几何的方法易受误差累积和尺度模糊影响，而基于相机视场的简单检索方法在复杂遮挡下通常失效。本文旨在克服这些限制，提出一种无需显式3D重建的解决方案。

Result: 大量实验表明，该方法在多个指标上超越了现有最先进方法，在场景重访一致性、生成保真度和相机控制精度方面均取得了优异表现。

Insight: 核心创新点在于提出了一个隐式3D感知的记忆检索与注入框架。其3D感知检索策略利用预训练模型的中间特征评估视角相关性，增强了遮挡场景下的鲁棒性；而3D对齐的记忆注入模块则能隐式地将历史内容扭曲至目标视角，并自适应地利用可靠的扭曲区域来引导生成，从而有效提升一致性。

Abstract: Despite remarkable progress in video generation, maintaining long-term scene consistency upon revisiting previously explored areas remains challenging. Existing solutions rely either on explicitly constructing 3D geometry, which suffers from error accumulation and scale ambiguity, or on naive camera Field-of-View (FoV) retrieval, which typically fails under complex occlusions. To overcome these limitations, we propose I3DM, a novel implicit 3D-aware memory mechanism for consistent video scene generation that bypasses explicit 3D reconstruction. At the core of our approach is a 3D-aware memory retrieval strategy, which leverages the intermediate features of a pre-trained Feed-Forward Novel View Synthesis (FF-NVS) model to score view relevance, enabling robust retrieval even in highly occluded scenarios. Furthermore, to fully utilize the retrieved historical frames, we introduce a 3D-aligned memory injection module. This module implicitly warps historical content to the target view and adaptively conditions the generation on reliable warping regions, leading to improved revisit consistency and accurate camera control. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches, achieving superior revisit consistency, generation fidelity, and camera control precision.

[107] 3DCity-LLM: Empowering Multi-modality Large Language Models for 3D City-scale Perception and Understanding cs.CV | cs.AIPDF

Yiping Chen, Jinpeng Li, Wenyu Ke, Yang Luo, Jie Ouyang

TL;DR: 本文提出了3DCity-LLM，一个用于3D城市尺度视觉语言感知与理解的统一框架。它采用由粗到细的特征编码策略，包含目标物体、物体间关系和全局场景三个并行分支。为支持大规模训练，作者构建了包含约120万个高质量样本的3DCity-LLM-1.2M数据集，并提出了基于文本相似度和LLM语义评估的多维评估协议。实验表明，该方法在两个基准测试上显著优于现有最先进方法。

Details

Motivation: 现有多模态大语言模型在物体中心或室内场景表现出色，但将其扩展到复杂的3D城市尺度环境仍是一个巨大挑战。本文旨在弥合这一差距，推动空间推理和城市智能的发展。

Result: 在两个基准测试上的大量实验表明，3DCity-LLM显著优于现有的最先进方法，为3D城市尺度理解提供了有前景的方案。

Insight: 主要创新点包括：1）用于3D城市尺度感知的统一框架及由粗到细的三分支特征编码策略；2）大规模、高质量、严格质控的3DCity-LLM-1.2M数据集，集成了显式3D数值信息和多样化用户导向模拟；3）结合文本相似度与LLM语义评估的多维评估协议，确保评估的忠实性与全面性。

Abstract: While multi-modality large language models excel in object-centric or indoor scenarios, scaling them to 3D city-scale environments remains a formidable challenge. To bridge this gap, we propose 3DCity-LLM, a unified framework designed for 3D city-scale vision-language perception and understanding. 3DCity-LLM employs a coarse-to-fine feature encoding strategy comprising three parallel branches for target object, inter-object relationship, and global scene. To facilitate large-scale training, we introduce 3DCity-LLM-1.2M dataset that comprises approximately 1.2 million high-quality samples across seven representative task categories, ranging from fine-grained object analysis to multi-faceted scene planning. This strictly quality-controlled dataset integrates explicit 3D numerical information and diverse user-oriented simulations, enriching the question-answering diversity and realism of urban scenarios. Furthermore, we apply a multi-dimensional protocol based on text-similarity metrics and LLM-based semantic assessment to ensure faithful and comprehensive evaluations for all methods. Extensive experiments on two benchmarks demonstrate that 3DCity-LLM significantly outperforms existing state-of-the-art methods, offering a promising and meaningful direction for advancing spatial reasoning and urban intelligence. The source code and dataset are available at https://github.com/SYSU-3DSTAILab/3D-City-LLM.

Gautam Rajendrakumar Gare, Neehar Peri, Matvei Popov, Shruti Jain, John Galeotti

TL;DR: 本文提出了一种名为DetPO（Detection Prompt Optimization）的无梯度测试时优化方法，用于提升多模态大语言模型（MLLMs）在小样本目标检测任务中的性能。该方法通过在少量视觉训练样本上优化纯文本提示，并校准预测置信度，从而有效利用上下文学习，解决了现有MLLMs难以泛化到分布外类别和任务的问题。

Details

Motivation: 当前最先进的多模态大语言模型在标准目标检测基准上表现出色，但在面对分布外的类别、任务和成像模态时泛化能力不足。虽然上下文提示是提升跨任务性能的常用策略，但作者发现其检测准确率有时甚至低于仅使用类别名称的提示，这表明现有MLLMs无法有效利用小样本视觉示例和丰富文本描述进行目标检测。同时，前沿MLLMs通常仅通过API访问，而最先进的开源模型在消费级硬件上微调成本过高，因此作者探索了黑盒提示优化作为替代方案。

Result: 在Roboflow20-VL和LVIS基准测试上，DetPO方法在通用MLLMs上带来了持续的性能提升，比之前的黑盒方法最高提升了9.7%。

Insight: 论文的核心创新点在于提出了一种无需梯度、在测试时进行黑盒优化的提示优化框架（DetPO），专门针对小样本目标检测任务。它通过直接最大化在少量视觉示例上的检测准确率并校准置信度来优化纯文本提示，绕过了对模型内部权重访问或昂贵微调的需求，为资源受限场景下提升MLLMs的上下文学习能力提供了一种实用且有效的方法。

Abstract: Multi-Modal LLMs (MLLMs) demonstrate strong visual grounding capabilities on popular object detection benchmarks like OdinW-13 and RefCOCO. However, state-of-the-art models still struggle to generalize to out-of-distribution classes, tasks and imaging modalities not typically found in their pre-training. While in-context prompting is a common strategy to improve performance across diverse tasks, we find that it often yields lower detection accuracy than prompting with class names alone. This suggests that current MLLMs cannot yet effectively leverage few-shot visual examples and rich textual descriptions for object detection. Since frontier MLLMs are typically only accessible via APIs, and state-of-the-art open-weights models are prohibitively expensive to fine-tune on consumer-grade hardware, we instead explore black-box prompt optimization for few-shot object detection. To this end, we propose Detection Prompt Optimization (DetPO), a gradient-free test-time optimization approach that refines text-only prompts by maximizing detection accuracy on few-shot visual training examples while calibrating prediction confidence. Our proposed approach yields consistent improvements across generalist MLLMs on Roboflow20-VL and LVIS, outperforming prior black-box approaches by up to 9.7%. Our code is available at https://github.com/ggare-cmu/DetPO

[109] RealMaster: Lifting Rendered Scenes into Photorealistic Video cs.CVPDF

Dana Cohen-Bar, Ido Sobol, Raphael Bensadoun, Shelly Sheynin, Oran Gafni

TL;DR: RealMaster是一种利用视频扩散模型将渲染视频提升为逼真视频的方法，通过保持与3D引擎输出的完全对齐，解决了现有视频生成模型缺乏精确控制和3D一致性的问题。该方法通过基于锚点的传播策略生成配对数据集，并训练IC-LoRA模型以泛化到管道约束之外，处理序列中出现的对象和角色，实现无需锚点帧的推理。

Details

Motivation: 现有视频生成模型在逼真度上表现优异，但缺乏对场景元素的精确控制且无法保证3D一致性；而3D引擎虽提供细粒度控制和原生3D一致性，但输出常陷入“恐怖谷”效应。论文旨在弥合模拟与真实之间的差距，要求输出既保持输入几何和动态的结构精确性，又实现材料、光照和纹理的整体语义转换以达到逼真效果。

Result: 在复杂的GTA-V序列上评估，RealMaster显著优于现有视频编辑基线，在提升逼真度的同时，保留了原始3D控制指定的几何、动态和身份信息，达到了SOTA水平。

Insight: 创新点包括：提出基于锚点的传播策略生成配对数据集，以几何条件线索增强中间帧的真实感；训练IC-LoRA模型，将高质量输出蒸馏到可泛化的模型中，处理序列中新增对象并实现无锚点推理。从客观角度看，该方法结合了3D引擎的结构控制与扩散模型的逼真生成能力，为模拟到真实的转换提供了可扩展的解决方案。

Abstract: State-of-the-art video generation models produce remarkable photorealism, but they lack the precise control required to align generated content with specific scene requirements. Furthermore, without an underlying explicit geometry, these models cannot guarantee 3D consistency. Conversely, 3D engines offer granular control over every scene element and provide native 3D consistency by design, yet their output often remains trapped in the “uncanny valley”. Bridging this sim-to-real gap requires both structural precision, where the output must exactly preserve the geometry and dynamics of the input, and global semantic transformation, where materials, lighting, and textures must be holistically transformed to achieve photorealism. We present RealMaster, a method that leverages video diffusion models to lift rendered video into photorealistic video while maintaining full alignment with the output of the 3D engine. To train this model, we generate a paired dataset via an anchor-based propagation strategy, where the first and last frames are enhanced for realism and propagated across the intermediate frames using geometric conditioning cues. We then train an IC-LoRA on these paired videos to distill the high-quality outputs of the pipeline into a model that generalizes beyond the pipeline’s constraints, handling objects and characters that appear mid-sequence and enabling inference without requiring anchor frames. Evaluated on complex GTA-V sequences, RealMaster significantly outperforms existing video editing baselines, improving photorealism while preserving the geometry, dynamics, and identity specified by the original 3D control.

[110] InverFill: One-Step Inversion for Enhanced Few-Step Diffusion Inpainting cs.CV | cs.AIPDF

Duc Vu, Kien Nguyen, Trong-Tung Nguyen, Ngan Nguyen, Phong Nguyen

TL;DR: 本文提出InverFill方法，通过一步反转将输入掩码图像的语义信息注入初始噪声，实现高质量、少步数的扩散模型图像修复，显著提升修复效果与文本一致性。

Details

Motivation: 现有扩散模型图像修复方法需要多步采样，计算成本高；而少步文本到图像模型直接用于修复时，由于随机高斯噪声初始化导致语义错位和伪影，修复效果差。

Result: 在低NFEs（函数评估次数）下，InverFill显著优于基线少步模型，甚至能与专用修复模型相媲美，提升图像质量和文本连贯性，且仅增加最小推理开销。

Insight: 创新点在于提出一步反转方法，将语义信息注入初始噪声以改善少步修复；无需真实图像监督或昂贵重训练，利用现有文本到图像模型实现高效修复。

Abstract: Recent diffusion-based models achieve photorealism in image inpainting but require many sampling steps, limiting practical use. Few-step text-to-image models offer faster generation, but naively applying them to inpainting yields poor harmonization and artifacts between the background and inpainted region. We trace this cause to random Gaussian noise initialization, which under low function evaluations causes semantic misalignment and reduced fidelity. To overcome this, we propose InverFill, a one-step inversion method tailored for inpainting that injects semantic information from the input masked image into the initial noise, enabling high-fidelity few-step inpainting. Instead of training inpainting models, InverFill leverages few-step text-to-image models in a blended sampling pipeline with semantically aligned noise as input, significantly improving vanilla blended sampling and even matching specialized inpainting models at low NFEs. Moreover, InverFill does not require real-image supervision and only adds minimal inference overhead. Extensive experiments show that InverFill consistently boosts baseline few-step models, improving image quality and text coherence without costly retraining or heavy iterative optimization.

[111] UniFunc3D: Unified Active Spatial-Temporal Grounding for 3D Functionality Segmentation cs.CVPDF

Jiaying Lin, Dan Xu

TL;DR: UniFunc3D是一个用于3D场景功能分割的统一、免训练框架，它通过将多模态大语言模型作为主动观察者，整合语义、时间和空间推理，在一次前向传播中完成对自然语言指令的细粒度交互元素掩码定位。该方法采用从粗到精的主动时空定位策略，自适应选择视频帧并聚焦于高细节交互部分，同时保留消歧所需的全局上下文。

Details

Motivation: 解决现有3D功能分割方法依赖碎片化流程、在初始任务解析时存在视觉盲点的问题，这些方法受限于单尺度、被动和启发式的帧选择策略。

Result: 在SceneFun3D基准测试上取得了最先进的性能，相对mIoU提升了59.9%，大幅超越了现有的免训练和基于训练的方法，且无需任何任务特定训练。

Insight: 创新点在于提出了一个统一的主动时空定位框架，将多模态大语言模型作为主动观察者进行联合推理，并通过从粗到精的策略自适应聚焦，这避免了传统流程的视觉盲点，实现了端到端的指令定位。

Abstract: Functionality segmentation in 3D scenes requires an agent to ground implicit natural-language instructions into precise masks of fine-grained interactive elements. Existing methods rely on fragmented pipelines that suffer from visual blindness during initial task parsing. We observe that these methods are limited by single-scale, passive and heuristic frame selection. We present UniFunc3D, a unified and training-free framework that treats the multimodal large language model as an active observer. By consolidating semantic, temporal, and spatial reasoning into a single forward pass, UniFunc3D performs joint reasoning to ground task decomposition in direct visual evidence. Our approach introduces active spatial-temporal grounding with a coarse-to-fine strategy. This allows the model to select correct video frames adaptively and focus on high-detail interactive parts while preserving the global context necessary for disambiguation. On SceneFun3D, UniFunc3D achieves state-of-the-art performance, surpassing both training-free and training-based methods by a large margin with a relative 59.9% mIoU improvement, without any task-specific training. Code will be released on our project page: https://jiaying.link/unifunc3d.

[112] TETO: Tracking Events with Teacher Observation for Motion Estimation and Frame Interpolation cs.CVPDF

Jini Yang, Eunbeen Hong, Soowon Son, Hyunkoo Lee, Sunghwan Hong

TL;DR: 本文提出TETO（Tracking Events with Teacher Observation）框架，通过教师-学生知识蒸馏，仅使用约25分钟无标注真实世界事件相机数据学习运动估计，并利用估计的运动先验条件化预训练视频扩散Transformer进行帧插值。

Details

Motivation: 解决现有基于事件相机的运动估计方法依赖大规模合成数据且存在显著仿真-真实差距的问题，旨在从少量真实无标注数据中学习准确运动估计。

Result: 在EVIMO2数据集上实现最先进的点跟踪性能，在DSEC数据集上实现最先进的光流估计性能，且训练数据量减少数个数量级；在BS-ERGB和HQ-EVFI数据集上展示了准确运动估计直接带来更优的帧插值质量。

Insight: 创新点包括：1) 通过预训练RGB跟踪器进行知识蒸馏，从极少量真实事件数据中学习；2) 运动感知数据筛选和查询采样策略，从主导自运动中解耦物体运动；3) 将预测的点轨迹和稠密光流作为显式运动先验，条件化预训练视频扩散模型用于帧插值。

Abstract: Event cameras capture per-pixel brightness changes with microsecond resolution, offering continuous motion information lost between RGB frames. However, existing event-based motion estimators depend on large-scale synthetic data that often suffers from a significant sim-to-real gap. We propose TETO (Tracking Events with Teacher Observation), a teacher-student framework that learns event motion estimation from only $\sim$25 minutes of unannotated real-world recordings through knowledge distillation from a pretrained RGB tracker. Our motion-aware data curation and query sampling strategy maximizes learning from limited data by disentangling object motion from dominant ego-motion. The resulting estimator jointly predicts point trajectories and dense optical flow, which we leverage as explicit motion priors to condition a pretrained video diffusion transformer for frame interpolation. We achieve state-of-the-art point tracking on EVIMO2 and optical flow on DSEC using orders of magnitude less training data, and demonstrate that accurate motion estimation translates directly to superior frame interpolation quality on BS-ERGB and HQ-EVFI.

[113] One View Is Enough! Monocular Training for In-the-Wild Novel View Generation cs.CVPDF

Adrien Ramanana Rahary, Nicolas Dufour, Patrick Perez, David Picard

TL;DR: 论文提出OVIE方法，仅使用单张图像进行训练，无需多视角图像对监督，通过单目深度估计器构建几何支架，结合掩码训练处理遮挡问题，在3000万张非配对网络图像上训练，实现了零样本下的野外新视角生成，推理时无需几何信息，速度比次优基线快600倍。

Details

Motivation: 传统单目新视角合成方法依赖多视角图像对监督，限制了训练数据的规模和多样性，本文旨在仅使用单张图像进行训练，以利用大规模非配对网络图像提升模型泛化能力。

Result: 在零样本设置下，OVIE在野外图像上超越了先前方法，推理速度比次优基线快600倍，代码和模型已公开。

Insight: 创新点包括利用单目深度估计器作为训练时的几何支架，以及引入掩码训练公式处理遮挡区域，实现了仅需单张图像训练的高效新视角生成，无需推理时的几何组件。

Abstract: Monocular novel-view synthesis has long required multi-view image pairs for supervision, limiting training data scale and diversity. We argue it is not necessary: one view is enough. We present OVIE, trained entirely on unpaired internet images. We leverage a monocular depth estimator as a geometric scaffold at training time: we lift a source image into 3D, apply a sampled camera transformation, and project to obtain a pseudo-target view. To handle disocclusions, we introduce a masked training formulation that restricts geometric, perceptual, and textural losses to valid regions, enabling training on 30 million uncurated images. At inference, OVIE is geometry-free, requiring no depth estimator or 3D representation. Trained exclusively on in-the-wild images, OVIE outperforms prior methods in a zero-shot setting, while being 600x faster than the second-best baseline. Code and models are publicly available at https://github.com/AdrienRR/ovie.

[114] AgentRVOS: Reasoning over Object Tracks for Zero-Shot Referring Video Object Segmentation cs.CVPDF

Woojeong Jin, Jaeho Lee, Heeseong Shin, Seungho Jang, Junhwan Heo

TL;DR: 本文提出AgentRVOS，一种无需训练的智能体化流程，用于零样本参考视频目标分割（RVOS）。该方法利用SAM3模型在整个视频时空范围内生成可靠的掩码轨迹作为对象级证据，然后由多模态大语言模型（MLLM）基于这些证据进行基于查询的推理来识别目标对象，并通过SAM3提供的时间存在信息进行迭代剪枝。实验表明，该方法在多个基准测试中达到了无需训练方法中的最先进性能。

Details

Motivation: 现有无需训练的RVOS方法通常先让MLLM选择关键帧并进行目标定位，再由分割模型传播结果。这种设计迫使MLLM在获得任何对象级证据之前就做出时间决策，限制了推理质量和时空覆盖范围。本文旨在克服这一限制。

Result: 在多个基准测试上的广泛实验表明，AgentRVOS在无需训练的方法中达到了最先进（SOTA）的性能，并且在不同MLLM骨干网络上都能获得一致的结果。

Insight: 核心创新在于流程反转：先利用强大的视觉基础模型（SAM3）提供全时空的、可靠的对象级感知证据（掩码轨迹），再让MLLM基于这些丰富的证据进行推理和识别。这避免了MLLM过早进行时空决策的局限性，并引入了基于时间存在信息的迭代剪枝机制来优化推理过程。

Abstract: Referring Video Object Segmentation (RVOS) aims to segment a target object throughout a video given a natural language query. Training-free methods for this task follow a common pipeline: a MLLM selects keyframes, grounds the referred object within those frames, and a video segmentation model propagates the results. While intuitive, this design asks the MLLM to make temporal decisions before any object-level evidence is available, limiting both reasoning quality and spatio-temporal coverage. To overcome this, we propose AgentRVOS, a training-free agentic pipeline built on the complementary strengths of SAM3 and a MLLM. Given a concept derived from the query, SAM3 provides reliable perception over the full spatio-temporal extent through generated mask tracks. The MLLM then identifies the target through query-grounded reasoning over this object-level evidence, iteratively pruning guided by SAM3’s temporal existence information. Extensive experiments show that AgentRVOS achieves state-of-the-art performance among training-free methods across multiple benchmarks, with consistent results across diverse MLLM backbones. Our project page is available at: https://cvlab-kaist.github.io/AgentRVOS/.

[115] Foveated Diffusion: Efficient Spatially Adaptive Image and Video Generation cs.CVPDF

Brian Chao, Lior Yariv, Howard Xiao, Gordon Wetzstein

TL;DR: 本文提出了一种名为Foveated Diffusion的高效空间自适应图像与视频生成方法，通过利用人眼视觉的偏心度依赖性敏锐度特性，在已知或可估计用户注视点（如通过眼动追踪）的场景下，非均匀分配生成token，将高密度token分配给注视中心区域（中央凹区域），低密度token分配给外围区域，从而在混合分辨率token设置下生成图像或视频，在感知上与全分辨率生成结果难以区分的同时，大幅减少token数量和生成时间。

Details

Motivation: 解决扩散模型和流匹配模型在生成高分辨率、高帧率、长上下文内容时计算复杂度随生成token数量呈二次方增长导致的效率挑战，利用人眼视觉特性优化已知注视点场景下的生成效率。

Result: 通过广泛的定量分析和精心设计的用户研究验证了方法的有效性，证明其生成结果在感知上与全分辨率生成难以区分，同时显著减少了token数量和生成时间，为高效生成提供了一个实用且可扩展的维度。

Insight: 核心创新在于将人眼视觉的中央凹特性（foveation）系统性地引入扩散生成过程，提出了一种从高分辨率数据直接构建混合分辨率token的原则性机制，并支持通过后训练从现有基础模型获得中央凹扩散模型，同时保持跨分辨率的内容一致性，为高效内容生成开辟了新方向。

Abstract: Diffusion and flow matching models have unlocked unprecedented capabilities for creative content creation, such as interactive image and streaming video generation. The growing demand for higher resolutions, frame rates, and context lengths, however, makes efficient generation increasingly challenging, as computational complexity grows quadratically with the number of generated tokens. Our work seeks to optimize the efficiency of the generation process in settings where the user’s gaze location is known or can be estimated, for example, by using eye tracking. In these settings, we leverage the eccentricity-dependent acuity of human vision: while a user perceives very high-resolution visual information in a small region around their gaze location (the foveal region), the ability to resolve detail quickly degrades in the periphery of the visual field. Our approach starts with a mask modeling the foveated resolution to allocate tokens non-uniformly, assigning higher token density to foveal regions and lower density to peripheral regions. An image or video is generated in a mixed-resolution token setting, yielding results perceptually indistinguishable from full-resolution generation, while drastically reducing the token count and generation time. To this end, we develop a principled mechanism for constructing mixed-resolution tokens directly from high-resolution data, allowing a foveated diffusion model to be post-trained from an existing base model while maintaining content consistency across resolutions. We validate our approach through extensive analysis and a carefully designed user study, demonstrating the efficacy of foveation as a practical and scalable axis for efficient generation.

[116] VISion On Request: Enhanced VLLM efficiency with sparse, dynamically selected, vision-language interactions cs.CV | cs.AI | cs.LGPDF

Adrian Bulat, Alberto Baldrati, Ioannis Maniadis Metaxas, Yassine Ouali, Georgios Tzimiropoulos

TL;DR: 本文提出了一种名为VISOR的新方法，旨在提高大型视觉语言模型（LVLM）的推理效率，不同于传统的视觉令牌压缩方法，VISOR通过稀疏化和动态选择视觉与语言令牌之间的交互来降低成本，同时保留完整的视觉信息。

Details

Motivation: 现有提升LVLM效率的方法主要基于视觉令牌压缩，但这会造成信息瓶颈，损害模型在需要细粒度理解和推理的复杂任务上的性能，因此需要一种不丢弃视觉信息的新范式。

Result: 在多个基准测试中，VISOR大幅降低了计算成本，同时达到或超越了最先进（SOTA）的性能，尤其在需要详细视觉理解的挑战性任务上表现出色。

Insight: 创新点在于通过稀疏化和动态选择注意力层来优化视觉-语言交互，而非压缩图像本身；具体包括使用少量策略性放置的注意力层（高效的跨注意力提供通用视觉上下文，动态选择的自注意力层在需要时细化视觉表示）以及一个轻量级策略机制来根据样本复杂度动态分配视觉计算。

Abstract: Existing approaches for improving the efficiency of Large Vision-Language Models (LVLMs) are largely based on the concept of visual token reduction. This approach, however, creates an information bottleneck that impairs performance, especially on challenging tasks that require fine-grained understanding and reasoning. In this work, we challenge this paradigm by introducing VISion On Request (VISOR), a method that reduces inference cost without discarding visual information. Instead of compressing the image, VISOR improves efficiency by sparsifying the interaction between image and text tokens. Specifically, the language model attends to the full set of high-resolution visual tokens through a small, strategically placed set of attention layers: general visual context is provided by efficient cross-attention between text-image, while a few well-placed and dynamically selected self-attention layers refine the visual representations themselves, enabling complex, high-resolution reasoning when needed. Based on this principle, we first train a single universal network on a range of computational budgets by varying the number of self-attention layers, and then introduce a lightweight policy mechanism that dynamically allocates visual computation based on per-sample complexity. Extensive experiments show that VISOR drastically reduces computational cost while matching or exceeding state-of-the-art results across a diverse suite of benchmarks, and excels in challenging tasks that require detailed visual understanding.

[117] WildWorld: A Large-Scale Dataset for Dynamic World Modeling with Actions and Explicit State toward Generative ARPG cs.CVPDF

Zhen Li, Zian Meng, Shuwei Shi, Wenshuo Peng, Yuwei Wu

TL;DR: 本文提出了WildWorld，一个大规模、动作条件化的世界建模数据集，包含超过1.08亿帧，从写实的3A动作角色扮演游戏《怪物猎人：荒野》中自动收集。该数据集提供了超过450种动作以及每帧的角色骨骼、世界状态、相机位姿和深度图等显式状态标注，旨在解决现有数据集缺乏多样化语义动作和显式状态的问题。

Details

Motivation: 现有视频世界模型数据集通常缺乏多样且具有语义意义的动作空间，且动作与视觉观测直接关联而非通过底层状态中介，导致动作与像素级变化纠缠，难以学习结构化世界动态和保持长期一致性。

Result: 研究基于WildWorld数据集构建了WildBench评估基准，通过动作跟随和状态对齐任务进行评估。大量实验揭示了在建模语义丰富的动作和保持长期状态一致性方面存在持续挑战。

Insight: 论文的核心创新在于构建了一个具有显式状态标注的大规模、动作驱动的世界建模数据集，强调了状态感知视频生成的重要性，为学习结构化世界动态提供了关键数据支持。

Abstract: Dynamical systems theory and reinforcement learning view world evolution as latent-state dynamics driven by actions, with visual observations providing partial information about the state. Recent video world models attempt to learn this action-conditioned dynamics from data. However, existing datasets rarely match the requirement: they typically lack diverse and semantically meaningful action spaces, and actions are directly tied to visual observations rather than mediated by underlying states. As a result, actions are often entangled with pixel-level changes, making it difficult for models to learn structured world dynamics and maintain consistent evolution over long horizons. In this paper, we propose WildWorld, a large-scale action-conditioned world modeling dataset with explicit state annotations, automatically collected from a photorealistic AAA action role-playing game (Monster Hunter: Wilds). WildWorld contains over 108 million frames and features more than 450 actions, including movement, attacks, and skill casting, together with synchronized per-frame annotations of character skeletons, world states, camera poses, and depth maps. We further derive WildBench to evaluate models through Action Following and State Alignment. Extensive experiments reveal persistent challenges in modeling semantically rich actions and maintaining long-horizon state consistency, highlighting the need for state-aware video generation. The project page is https://shandaai.github.io/wildworld-project/.

[118] DA-Flow: Degradation-Aware Optical Flow Estimation with Diffusion Models cs.CVPDF

Jaewon Min, Jaeeun Lee, Yeji Choi, Paul Hyunbin Cho, Jin Hyeon Kim

TL;DR: 本文提出DA-Flow，一种基于扩散模型的退化感知光流估计方法，旨在解决现有模型在真实世界退化视频（如模糊、噪声、压缩伪影）上性能严重下降的问题。核心思想是利用图像恢复扩散模型的中间表示作为退化感知特征，并通过引入全时空注意力机制使其具备时序感知能力，最终构建了一个融合扩散特征与卷积特征的混合架构。

Details

Motivation: 动机在于解决现有光流模型在高质量数据上训练后，面对真实世界视频中常见的退化（如模糊、噪声、压缩伪影）时性能严重下降的问题，从而提出退化感知光流估计这一新任务。

Result: DA-Flow在多个基准测试中，在严重退化条件下显著优于现有的光流方法，实现了先进的性能。

Insight: 创新点在于揭示了图像恢复扩散模型的中间表示天然具有退化感知能力但缺乏时序感知，通过引入全时空注意力机制使其具备零样本对应关系能力，并成功将其与卷积特征融合于迭代优化框架中，为处理退化视频的光流估计提供了新思路。

Abstract: Optical flow models trained on high-quality data often degrade severely when confronted with real-world corruptions such as blur, noise, and compression artifacts. To overcome this limitation, we formulate Degradation-Aware Optical Flow, a new task targeting accurate dense correspondence estimation from real-world corrupted videos. Our key insight is that the intermediate representations of image restoration diffusion models are inherently corruption-aware but lack temporal awareness. To address this limitation, we lift the model to attend across adjacent frames via full spatio-temporal attention, and empirically demonstrate that the resulting features exhibit zero-shot correspondence capabilities. Based on this finding, we present DA-Flow, a hybrid architecture that fuses these diffusion features with convolutional features within an iterative refinement framework. DA-Flow substantially outperforms existing optical flow methods under severe degradation across multiple benchmarks.

[119] UniGRPO: Unified Policy Optimization for Reasoning-Driven Visual Generation cs.CVPDF

Jie Liu, Zilyu Ye, Linxiao Yuan, Shenhan Zhu, Yu Gao

TL;DR: 本文提出UniGRPO，一个用于交错生成（如先推理后生成图像）的统一强化学习框架。它将多模态生成过程建模为具有稀疏终端奖励的马尔可夫决策过程，并采用GRPO联合优化文本推理和图像生成策略。通过两项关键改进（消除无分类器引导、用速度场MSE惩罚替换潜在KL惩罚），确保了方法可扩展至多轮交错生成场景。

Details

Motivation: 当前统一模型在交错生成（如文本与图像交错）领域前景广阔，但缺乏专门为此设计的统一强化学习训练框架。本文旨在解决单轮推理驱动图像生成（即先通过推理扩展用户提示，再进行图像合成）这一基本单元的优化问题，并为未来完全交错模型的训练奠定基础。

Result: 实验表明，该统一的训练方法通过推理显著提升了图像生成质量，为未来完全交错模型的训练提供了一个鲁棒且可扩展的基线。

Insight: 创新点在于：1) 为交错生成任务设计了统一的MDP建模与GRPO优化框架；2) 对FlowGRPO进行了两项关键修改以支持可扩展性：消除无分类器引导以保证线性展开，以及用速度场MSE惩罚替代潜在KL惩罚以更有效地防止奖励黑客行为。这为复杂多轮、多条件生成任务的后训练提供了简洁有效的方案。

Abstract: Unified models capable of interleaved generation have emerged as a promising paradigm, with the community increasingly converging on autoregressive modeling for text and flow matching for image generation. To advance this direction, we propose a unified reinforcement learning framework tailored for interleaved generation. We validate our approach on its fundamental unit: a single round of reasoning-driven image generation, where the model first expands the user prompt through reasoning, followed by image synthesis. Formulating this multimodal generation process as a Markov Decision Process with sparse terminal rewards, we introduce UniGRPO to jointly optimize text and image generation policies using GRPO. Adopting a minimalist methodology to avoid over-design, we leverage established training recipes for both modalities by seamlessly integrating standard GRPO for reasoning and FlowGRPO for visual synthesis. To ensure scalability to multi-round interleaved generation, we introduce two critical modifications to the original FlowGRPO: (1) eliminating classifier-free guidance to maintain linear, unbranched rollouts, which is essential for scaling to complex scenarios involving multi-turn interactions and multi-condition generation (e.g., editing); and (2) replacing the standard latent KL penalty with an MSE penalty directly on the velocity fields, providing a more robust and direct regularization signal to mitigate reward hacking effectively. Our experiments demonstrate that this unified training recipe significantly enhances image generation quality through reasoning, providing a robust and scalable baseline for the future post-training of fully interleaved models.

cs.AI [Back]

[120] The Efficiency Attenuation Phenomenon: A Computational Challenge to the Language of Thought Hypothesis cs.AI | cs.CL | cs.LGPDF

Di Zhang

TL;DR: 本文通过计算实验探讨思维是否需要类似语言的格式，挑战了语言思维假说。作者提出了’AI私有语言’思想实验：若两个智能体通过多智能体强化学习发展出一种高效但难以理解的通信协议，且当被迫使用人类可理解语言时性能下降，这种’效率衰减现象’就对语言思维假说构成挑战。在部分可观测的协作导航任务中，实验结果显示使用涌现协议的智能体比使用预定义人类符号协议的效率高50.5%，证实了效率衰减现象的存在。

Details

Motivation: 研究动机是检验语言思维假说，即思维是否需要语言格式，通过计算实验探讨高效协作认知是否必须依赖符号化结构。

Result: 在部分可观测协作导航任务中，使用涌现通信协议的智能体比使用预定义人类符号协议的效率高50.5%，证实了效率衰减现象。

Insight: 创新点在于提出’效率衰减现象’和’AI私有语言’思想实验，表明最优协作认知可能自然耦合于亚符号计算而非符号结构，为认知架构多元论提供计算证据，并连接哲学、认知科学与人工智能领域。

Abstract: This paper computationally investigates whether thought requires a language-like format, as posited by the Language of Thought (LoT) hypothesis. We introduce the ``AI Private Language’’ thought experiment: if two artificial agents develop an efficient, inscrutable communication protocol via multi-agent reinforcement learning (MARL), and their performance declines when forced to use a human-comprehensible language, this Efficiency Attenuation Phenomenon (EAP) challenges the LoT. We formalize this in a cooperative navigation task under partial observability. Results show that agents with an emergent protocol achieve 50.5% higher efficiency than those using a pre-defined, human-like symbolic protocol, confirming the EAP. This suggests optimal collaborative cognition in these systems is not mediated by symbolic structures but is naturally coupled with sub-symbolic computations. The work bridges philosophy, cognitive science, and AI, arguing for pluralism in cognitive architectures and highlighting implications for AI ethics.

q-bio.NC [Back]

[121] Ca2+ transient detection and segmentation with the Astronomically motivated algorithm for Background Estimation And Transient Segmentation (Astro-BEATS) q-bio.NC | astro-ph.IM | cs.CVPDF

Bolin Fan, Anthony Bilodeau, Frederic Beaupre, Theresa Wiesner, Christian Gagne

TL;DR: 这篇论文提出了一种名为Astro-BEATS的自动分割算法，用于检测和分割荧光显微成像中的微型突触钙瞬变信号。该算法借鉴了天文学中用于检测天文瞬变信号的技术，通过结合图像估计和源查找方法来处理钙成像视频，旨在解决微弱信号检测的挑战。

Details

Motivation: 荧光钙成像中，微型突触钙瞬变引起的荧光信号变化非常微弱，仅略高于基线，这对自动检测和分割提出了重大挑战。天文学中检测瞬变信号也面临类似问题，需要能在不同噪声特性的大视场中保持鲁棒性的算法。因此，作者借鉴天文学技术来解决神经科学中的这一难题。

Result: Astro-BEATS在突触钙瞬变检测和分割任务上，性能优于当前基于阈值的方法。

Insight: 创新点在于将天文学中的背景估计和瞬变分割算法（如图像估计和源查找技术）迁移到神经科学钙成像领域，以处理微弱信号。该方法无需重新优化即可适用于新数据集，速度快，能有效生成用于训练深度学习模型的标注数据，为基于深度学习的方法提供了高质量的监督学习基础。

Abstract: Fluorescence-based Ca$^{2+}$-imaging is a powerful tool for studying localized neuronal activity, including miniature Synaptic Calcium Transients, providing real-time insights into synaptic activity. These transients induce only subtle changes in the fluorescence signal, often barely above baseline, which poses a significant challenge for automated synaptic transient detection and segmentation. Detecting astronomical transients similarly requires efficient algorithms that will remain robust over a large field of view with varying noise properties. We leverage techniques used in astronomical transient detection for miniature Synaptic Calcium Transient detection in fluorescence microscopy. We present Astro-BEATS, an automatic miniature Synaptic Calcium Transient segmentation algorithm that incorporates image estimation and source-finding techniques used in astronomy and designed for Ca$^{2+}$-imaging videos. Astro-BEATS outperforms current threshold-based approaches for synaptic Ca$^{2+}$ transient detection and segmentation. The produced segmentation masks can be used to train a supervised deep learning algorithm for improved synaptic Ca$^{2+}$ transient detection in Ca$^{2+}$-imaging data. The speed of Astro-BEATS and its applicability to previously unseen datasets without re-optimization makes it particularly useful for generating training datasets for deep learning-based approaches.

cs.RO [Back]

[122] Learning Sidewalk Autopilot from Multi-Scale Imitation with Corrective Behavior Expansion cs.RO | cs.CVPDF

Honglin He, Yukai Ma, Brad Squicciarini, Wayne Wu, Bolei Zhou

TL;DR: 本文提出了一种通过纠正行为扩展和多尺度模仿学习来改进人行道微移动机器人自主驾驶的框架，旨在解决传统模仿学习在复杂城市环境中存在的复合误差、鲁棒性差和泛化能力不足的问题。

Details

Motivation: 当前基于学习的控制方法在复杂城市人行道环境中表现不佳，模仿学习依赖固定离线数据导致复合误差、鲁棒性有限和泛化能力差，需要一种能学习从错误中恢复并捕捉多尺度行为的方法。

Result: 真实世界实验表明，该方法在多样化人行道场景中显著提高了鲁棒性和泛化能力，但未提及具体基准测试或与SOTA的定量比较。

Insight: 创新点包括通过纠正行为扩展和传感器增强来丰富数据集，以及引入基于时间尺度的轨迹聚类和分层监督的多尺度模仿学习架构，以同时学习短时交互行为和长时目标导向意图。

Abstract: Sidewalk micromobility is a promising solution for last-mile transportation, but current learning-based control methods struggle in complex urban environments. Imitation learning (IL) learns policies from human demonstrations, yet its reliance on fixed offline data often leads to compounding errors, limited robustness, and poor generalization. To address these challenges, we propose a framework that advances IL through corrective behavior expansion and multi-scale imitation learning. On the data side, we augment teleoperation datasets with diverse corrective behaviors and sensor augmentations to enable the policy to learn to recover from its own mistakes. On the model side, we introduce a multi-scale IL architecture that captures both short-horizon interactive behaviors and long-horizon goal-directed intentions via horizon-based trajectory clustering and hierarchical supervision. Real-world experiments show that our approach significantly improves robustness and generalization in diverse sidewalk scenarios.

[123] VTAM: Video-Tactile-Action Models for Complex Physical Interaction Beyond VLAs cs.RO | cs.AI | cs.CV | cs.LGPDF

Haoran Yuan, Weigang Yi, Zhenyu Zhang, Wendi Chen, Yuchen Mo

TL;DR: 本文提出了视频-触觉-动作模型（VTAM），一种多模态世界建模框架，通过整合触觉感知作为补充的接地信号，以增强在接触丰富场景下的物理交互能力。该方法通过轻量级模态迁移微调，将触觉流融入预训练的视频Transformer中，无需触觉-语言配对数据或独立的触觉预训练，实现了高效的跨模态表示学习。

Details

Motivation: 现有的视频-动作模型（VAMs）在仅依赖视觉的接触丰富场景中存在局限，因为精细的力调节和接触转换无法可靠地从视觉标记中编码，导致行为不稳定或不精确。

Result: VTAM在接触丰富的操作任务中表现出色，平均保持90%的稳健成功率；在需要高保真力感知的挑战性场景（如薯片拾放）中，比π0.5基线高出80%。

Insight: 创新点包括引入触觉正则化损失以稳定多模态融合，防止视觉潜在主导，以及通过轻量级模态迁移实现无需配对数据的跨模态学习；客观分析表明，整合触觉反馈对于纠正世界动作模型中的视觉估计错误至关重要，为物理接地的具身基础模型提供了可扩展方法。

Abstract: Video-Action Models (VAMs) have emerged as a promising framework for embodied intelligence, learning implicit world dynamics from raw video streams to produce temporally consistent action predictions. Although such models demonstrate strong performance on long-horizon tasks through visual reasoning, they remain limited in contact-rich scenarios where critical interaction states are only partially observable from vision alone. In particular, fine-grained force modulation and contact transitions are not reliably encoded in visual tokens, leading to unstable or imprecise behaviors. To bridge this gap, we introduce the Video-Tactile Action Model (VTAM), a multimodal world modeling framework that incorporates tactile perception as a complementary grounding signal. VTAM augments a pretrained video transformer with tactile streams via a lightweight modality transfer finetuning, enabling efficient cross-modal representation learning without tactile-language paired data or independent tactile pretraining. To stabilize multimodal fusion, we introduce a tactile regularization loss that enforces balanced cross-modal attention, preventing visual latent dominance in the action model. VTAM demonstrates superior performance in contact-rich manipulation, maintaining a robust success rate of 90 percent on average. In challenging scenarios such as potato chip pick-and-place requiring high-fidelity force awareness, VTAM outperforms the pi 0.5 baseline by 80 percent. Our findings demonstrate that integrating tactile feedback is essential for correcting visual estimation errors in world action models, providing a scalable approach to physically grounded embodied foundation models.

cs.LG [Back]

[124] Problems with Chinchilla Approach 2: Systematic Biases in IsoFLOP Parabola Fits cs.LG | cs.CL | stat.MLPDF

Eric Czech, Zhiwei Xu, Yael Elmatad, Yixin Wang, William Held

TL;DR: 本文分析了Chinchilla Approach 2在拟合神经缩放定律时引入的系统性偏差，指出其在计算最优分配估计上存在误差，并提出了改进的Chinchilla Approach 3方法，通过利用目标函数的部分线性结构和变量投影技术，实现无偏且数值稳定的参数推断。

Details

Motivation: Chinchilla Approach 2作为广泛使用的神经缩放定律拟合方法，其抛物线近似在计算最优分配估计中引入了系统性偏差，即使在无噪声合成数据上也是如此，这导致了对模型参数和计算资源的错误分配，造成了显著的计算浪费和机会成本。

Result: 在Llama 3的IsoFLOP数据上，应用Approach 2导致的偏差对应于3.8×10^25 FLOP训练预算中6.5%的参数分配不足，以及在50% H100 MFU下造成约140万美元（90%置信区间：41.2万至290万美元）的不必要计算开销；模拟的多模态模型误分配显示了更高的机会成本。

Insight: 论文的创新点在于系统分析了Approach 2偏差的三个来源（IsoFLOP采样网格宽度、未中心化采样和损失曲面不对称性），并提出了Approach 3的改进版本，通过变量投影利用目标函数的部分线性结构，将五参数优化问题转化为一个条件良好、可解析微分且易于密集网格搜索的二维优化问题，从而实现了无偏、稳定且更易实现的缩放定律拟合方法。

Abstract: Chinchilla Approach 2 is among the most widely used methods for fitting neural scaling laws. Its parabolic approximation introduces systematic biases in compute-optimal allocation estimates, even on noise-free synthetic data. Applied to published Llama 3 IsoFLOP data at open frontier compute scales, these biases imply a parameter underallocation corresponding to 6.5% of the $3.8\times10^{25}$ FLOP training budget and $1.4M (90% CI: $412K-$2.9M) in unnecessary compute at 50% H100 MFU. Simulated multimodal model misallocations show even greater opportunity costs due to higher loss surface asymmetry. Three sources of this error are examined: IsoFLOP sampling grid width (Taylor approximation accuracy), uncentered IsoFLOP sampling, and loss surface asymmetry ($α\neq β$). Chinchilla Approach 3 largely eliminates these biases but is often regarded as less data-efficient, numerically unstable, prone to local minima, and harder to implement. Each concern is shown to be unfounded or addressable, especially when the partially linear structure of the objective is exploited via Variable Projection, enabling unbiased inference on all five loss surface parameters through a two-dimensional optimization that is well-conditioned, analytically differentiable, and amenable to dense, or even exhaustive, grid search. It may serve as a more convenient replacement for Approach 2 or a more scalable alternative for adaptations of Approach 3 to richer scaling law formulations.

[125] Off-Policy Value-Based Reinforcement Learning for Large Language Models cs.LG | cs.CLPDF

Peng-Yuan Wang, Ziniu Li, Tian Xu, Bohan Yang, Tian-Shuo Liu

TL;DR: 本文提出了一种名为ReVal的基于价值函数的强化学习方法，用于大型语言模型训练，该方法结合了步级内部一致性信号和轨迹级结果验证信号，支持基于经验回放缓冲区的离策略学习，从而提高了数据利用效率。

Details

Motivation: 当前用于LLM的主流强化学习方法主要是同策略的，每批数据仅使用一次即丢弃，导致样本效率低下，尤其是在生成轨迹成本高昂的长视野任务中。

Result: 在标准数学推理基准测试中，ReVal不仅收敛更快，最终性能也优于GRPO。具体在DeepSeek-R1-Distill-1.5B模型上，ReVal在AIME24基准上提升了2.7%，在域外基准GPQA上提升了4.5%。

Insight: 创新点在于为LLM设计了一个基于贝尔曼更新的价值函数强化学习框架，它自然地支持离策略学习和经验回放，从而实现了更高效的数据重用。这为基于策略的方法提供了一个实用的替代方案。

Abstract: Improving data utilization efficiency is critical for scaling reinforcement learning (RL) for long-horizon tasks where generating trajectories is expensive. However, the dominant RL methods for LLMs are largely on-policy: they update each batch of data only once, discard it, and then collect fresh samples, resulting in poor sample efficiency. In this work, we explore an alternative value-based RL framework for LLMs that naturally enables off-policy learning. We propose ReVal, a Bellman-update-based method that combines stepwise signals capturing internal consistency with trajectory-level signals derived from outcome verification. ReVal naturally supports replay-buffer-based training, allowing efficient reuse of past trajectories. Experiments on standard mathematical reasoning benchmarks show that ReVal not only converges faster but also outperforms GRPO in final performance. On DeepSeek-R1-Distill-1.5B, ReVal improves training efficiency and achieves improvement of 2.7% in AIME24 and 4.5% in out-of-domain benchmark GPQA over GRPO. These results suggest that value-based RL is a practical alternative to policy-based methods for LLM training.

[126] TreeTeaming: Autonomous Red-Teaming of Vision-Language Models via Hierarchical Strategy Exploration cs.LG | cs.CVPDF

Chunxiao Li, Lijun Li, Jing Shao

TL;DR: 本文提出了TreeTeaming，一个用于自动红队测试视觉语言模型（VLM）安全漏洞的框架。它通过一个由大型语言模型（LLM）驱动的战略编排器，将策略探索从静态测试转变为动态、进化的发现过程，自主决定是演化有潜力的攻击路径还是探索多样化的策略分支，从而动态构建和扩展策略树。一个多模态执行器负责执行这些复杂策略。

Details

Motivation: 现有红队测试方法受限于固有的线性探索范式，只能在预定义的策略集内优化，无法发现新颖、多样的攻击方式。为了突破这一限制，需要一种能够动态、自主探索多样化攻击策略的新范式。

Result: 在12个主流VLM上的实验表明，TreeTeaming在11个模型上达到了最先进的攻击成功率，优于现有方法，在GPT-4o上最高达到87.60%。该框架还展现出比现有公开越狱策略合集更优的策略多样性，且生成的攻击平均毒性降低了23.09%，证明了其隐蔽性和微妙性。

Insight: 核心创新在于将红队测试重构为动态的、进化式的策略树探索过程，通过LLM驱动的编排器实现策略的自主演化与分支探索，超越了静态启发式方法的局限。这为自动漏洞发现引入了一个新范式，强调了超越静态规则进行主动探索的必要性。

Abstract: The rapid advancement of Vision-Language Models (VLMs) has brought their safety vulnerabilities into sharp focus. However, existing red teaming methods are fundamentally constrained by an inherent linear exploration paradigm, confining them to optimizing within a predefined strategy set and preventing the discovery of novel, diverse exploits. To transcend this limitation, we introduce TreeTeaming, an automated red teaming framework that reframes strategy exploration from static testing to a dynamic, evolutionary discovery process. At its core lies a strategic Orchestrator, powered by a Large Language Model (LLM), which autonomously decides whether to evolve promising attack paths or explore diverse strategic branches, thereby dynamically constructing and expanding a strategy tree. A multimodal actuator is then tasked with executing these complex strategies. In the experiments across 12 prominent VLMs, TreeTeaming achieves state-of-the-art attack success rates on 11 models, outperforming existing methods and reaching up to 87.60% on GPT-4o. The framework also demonstrates superior strategic diversity over the union of previously public jailbreak strategies. Furthermore, the generated attacks exhibit an average toxicity reduction of 23.09%, showcasing their stealth and subtlety. Our work introduces a new paradigm for automated vulnerability discovery, underscoring the necessity of proactive exploration beyond static heuristics to secure frontier AI models.

[127] Policy-based Tuning of Autoregressive Image Models with Instance- and Distribution-Level Rewards cs.LG | cs.CVPDF

Orhun Buğra Baran, Melih Kandemir, Ramazan Gokberk Cinbis

TL;DR: 本文提出了一种轻量级强化学习框架，用于优化自回归图像生成模型。该方法将基于token的自回归合成建模为马尔可夫决策过程，并通过组相对策略优化进行优化。其核心创新是引入了一种新颖的分布级奖励LOO-FID，结合实例级奖励和自适应熵正则化，以同时提升生成样本的质量和多样性。

Details

Motivation: 标准的最大似然估计训练自回归图像模型无法直接优化样本质量和多样性。现有的强化学习方法用于对齐扩散模型时通常会导致输出多样性崩溃，而针对自回归模型的并发RL方法又严格依赖实例级奖励，常常以牺牲分布覆盖为代价换取质量。本文旨在解决这些局限性。

Result: 在LlamaGen和VQGAN架构上进行的大量实验表明，仅需数百次调优迭代，该方法就在标准质量和多样性指标上取得了明显提升。结果显示，即使不使用无分类器引导，模型也能更新以产生有竞争力的样本，从而绕过其两倍的推理成本。

Insight: 论文宣称的创新点在于：1) 将自回归图像合成形式化为MDP并采用GRPO进行优化；2) 引入新颖的分布级LOO-FID奖励，利用特征矩的指数移动平均来明确鼓励样本多样性并防止模式崩溃；3) 将分布级奖励与实例级奖励（CLIP和HPSv2）结合，并采用自适应熵正则化稳定多目标学习。从客观角度看，该方法在轻量级框架内有效平衡了质量与多样性，并减少了推理开销，具有借鉴意义。

Abstract: Autoregressive (AR) models are highly effective for image generation, yet their standard maximum-likelihood estimation training lacks direct optimization for sample quality and diversity. While reinforcement learning (RL) has been used to align diffusion models, these methods typically suffer from output diversity collapse. Similarly, concurrent RL methods for AR models rely strictly on instance-level rewards, often trading off distributional coverage for quality. To address these limitations, we propose a lightweight RL framework that casts token-based AR synthesis as a Markov Decision Process, optimized via Group Relative Policy Optimization (GRPO). Our core contribution is the introduction of a novel distribution-level Leave-One-Out FID (LOO-FID) reward; by leveraging an exponential moving average of feature moments, it explicitly encourages sample diversity and prevents mode collapse during policy updates. We integrate this with composite instance-level rewards (CLIP and HPSv2) for strict semantic and perceptual fidelity, and stabilize the multi-objective learning with an adaptive entropy regularization term. Extensive experiments on LlamaGen and VQGAN architectures demonstrate clear improvements across standard quality and diversity metrics within only a few hundred tuning iterations. The results also show that the model can be updated to produce competitive samples even without Classifier-Free Guidance, and bypass its 2x inference cost.

Table of Contents

cs.CL [Back]

[1] Evaluating Prompting Strategies for Chart Question Answering with Large Language Models cs.CL | cs.AI | cs.LGPDF

[2] MERIT: Memory-Enhanced Retrieval for Interpretable Knowledge Tracing cs.CL | cs.AIPDF

[3] TIPS: Turn-Level Information-Potential Reward Shaping for Search-Augmented LLMs cs.CL | cs.AI | cs.LGPDF

[4] Sparse but Critical: A Token-Level Analysis of Distributional Shifts in RLVR Fine-Tuning of LLMs cs.CL | cs.AI | cs.LGPDF

[5] Towards Automated Community Notes Generation with Large Vision Language Models for Combating Contextual Deception cs.CL | cs.SIPDF

[6] CAPITU: A Benchmark for Evaluating Instruction-Following in Brazilian Portuguese with Literary Context cs.CLPDF

[7] Lie to Me: How Faithful Is Chain-of-Thought Reasoning in Reasoning Models? cs.CL | cs.AIPDF

[8] How Utilitarian Are OpenAI’s Models Really? Replicating and Reinterpreting Pfeffer, Krügel, and Uhl (2025) cs.CL | cs.CYPDF

[9] Explanation Generation for Contradiction Reconciliation with LLMs cs.CLPDF

[10] PRISM: A Dual View of LLM Reasoning through Semantic Flow and Latent Computation cs.CLPDF

[11] When AI Shows Its Work, Is It Actually Working? Step-Level Evaluation Reveals Frontier Language Models Frequently Bypass Their Own Reasoning cs.CL | cs.AI | cs.LGPDF

[12] Analysing LLM Persona Generation and Fairness Interpretation in Polarised Geopolitical Contexts cs.CLPDF

[13] Avoiding Over-smoothing in Social Media Rumor Detection with Pre-trained Propagation Tree Transformer cs.CL | cs.AIPDF

[14] Quality Over Clicks: Intrinsic Quality-Driven Iterative Reinforcement Learning for Cold-Start E-Commerce Query Suggestion cs.CLPDF

[15] DariMis: Harm-Aware Modeling for Dari Misinformation Detection on YouTube cs.CL | cs.AI | cs.LGPDF

[16] Beyond Hate: Differentiating Uncivil and Intolerant Speech in Multimodal Content Moderation cs.CL | cs.CYPDF

[17] PaperVoyager : Building Interactive Web with Visual Language Models cs.CLPDF

[18] When Language Models Lose Their Mind: The Consequences of Brain Misalignment cs.CLPDF

[19] ImplicitRM: Unbiased Reward Modeling from Implicit Preference Data for LLM alignment cs.CL | cs.AI | stat.APPDF

[20] I Came, I Saw, I Explained: Benchmarking Multimodal LLMs on Figurative Meaning in Memes cs.CLPDF

[21] WISTERIA: Weak Implicit Signal-based Temporal Relation Extraction with Attention cs.CL | cs.AIPDF

cs.CV [Back]

[22] Founder effects shape the evolutionary dynamics of multimodality in open LLM families cs.CV | cs.AI | cs.CLPDF

[23] From Instructions to Assistance: a Dataset Aligning Instruction Manuals with Assembly Videos for Evaluating Multimodal LLMs cs.CV | cs.AI | cs.CLPDF

[24] When Visuals Aren’t the Problem: Evaluating Vision-Language Models on Misleading Data Visualizations cs.CV | cs.AIPDF

[25] Efficient Universal Perception Encoder cs.CVPDF

[26] Static Scene Reconstruction from Dynamic Egocentric Videos cs.CV | cs.GRPDF

[27] MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding cs.CVPDF

[28] Color When It Counts: Grayscale-Guided Online Triggering for Always-On Streaming Video Sensing cs.CV | cs.AI | cs.HC | cs.MMPDF

[29] Tiny Inference-Time Scaling with Latent Verifiers cs.CV | cs.AI | cs.MMPDF

[30] Sketch2CT: Multimodal Diffusion for Structure-Aware 3D Medical Volume Generation cs.CVPDF

[31] Ego2Web: A Web Agent Benchmark Grounded in Egocentric Videos cs.CV | cs.AI | cs.CLPDF

[32] CanViT: Toward Active-Vision Foundation Models cs.CVPDF

[33] A vision-language model and platform for temporally mapping surgery from video cs.CV | cs.ROPDF

[34] Language Models Can Explain Visual Features via Steering cs.CV | cs.AIPDF

[35] TrajLoom: Dense Future Trajectory Generation from Video cs.CVPDF

[36] Dress-ED: Instruction-Guided Editing for Virtual Try-On and Try-Off cs.CVPDF

[37] A Vision Language Model for Generating Procedural Plant Architecture Representations from Simulated Images cs.CVPDF

[38] To Agree or To Be Right? The Grounding-Sycophancy Tradeoff in Medical Vision-Language Models cs.CV | cs.AIPDF

[39] CAM3R: Camera-Agnostic Model for 3D Reconstruction cs.CVPDF

[40] Q-Tacit: Image Quality Assessment via Latent Visual Reasoning cs.CVPDF

[41] GeoTikzBridge: Advancing Multimodal Code Generation for Geometric Perception and Reasoning cs.CVPDF

[42] Think 360°: Evaluating the Width-centric Reasoning Capability of MLLMs Beyond Depth cs.CVPDF

[43] WiFi2Cap: Semantic Action Captioning from Wi-Fi CSI via Limb-Level Semantic Alignment cs.CV | cs.AIPDF

[44] How Far Can VLMs Go for Visual Bug Detection? Studying 19,738 Keyframes from 41 Hours of Gameplay Videos cs.CV | cs.SEPDF

[45] SOUPLE: Enhancing Audio-Visual Localization and Segmentation with Learnable Prompt Contexts cs.CVPDF

[46] MVPBench: A Multi-Video Perception Evaluation Benchmark for Multi-Modal Video Understanding cs.CVPDF

[47] Multimodal Industrial Anomaly Detection via Geometric Prior cs.CVPDF

[48] Reconstruction-Guided Slot Curriculum: Addressing Object Over-Fragmentation in Video Object-Centric Learning cs.CV | cs.LGPDF

[49] ENC-Bench: A Benchmark for Evaluating Multimodal Large Language Models in Electronic Navigational Chart Understanding cs.CVPDF

[50] From Pixels to Semantics: A Multi-Stage AI Framework for Structural Damage Detection in Satellite Imagery cs.CVPDF

[51] Typography-Based Monocular Distance Estimation Framework for Vehicle Safety Systems cs.CVPDF

[52] Know3D: Prompting 3D Generation with Knowledge from Vision-Language Models cs.CVPDF

[53] PhotoAgent: A Robotic Photographer with Spatial and Aesthetic Understanding cs.CV | cs.AI | cs.ROPDF

[54] Focus, Don’t Prune: Identifying Instruction-Relevant Regions for Information-Rich Image Understanding cs.CV | cs.AIPDF

[55] TDATR: Improving End-to-End Table Recognition via Table Detail-Aware Learning and Cell-Level Visual Alignment cs.CV | cs.AIPDF

[56] Cross-Slice Knowledge Transfer via Masked Multi-Modal Heterogeneous Graph Contrastive Learning for Spatial Gene Expression Inference cs.CVPDF

[57] MVRD-Bench: Multi-View Learning and Benchmarking for Dynamic Remote Photoplethysmography under Occlusion cs.CVPDF

[58] Rethinking Token-Level Policy Optimization for Multimodal Chain-of-Thought cs.CVPDF

[59] Gau-Occ: Geometry-Completed Gaussians for Multi-Modal 3D Occupancy Prediction cs.CVPDF

[60] ForeSea: AI Forensic Search with Multi-modal Queries for Video Surveillance cs.CVPDF

[61] Group Editing : Edit Multiple Images in One Go cs.CVPDF

[62] SLARM: Streaming and Language-Aligned Reconstruction Model for Dynamic Scenes cs.CVPDF

[63] Dual-Teacher Distillation with Subnetwork Rectification for Black-Box Domain Adaptation cs.CV | cs.LGPDF

[64] ForestPrune: High-ratio Visual Token Compression for Video Multimodal Large Language Models via Spatial-Temporal Forest Modeling cs.CV | cs.AIPDF

[65] When AVSR Meets Video Conferencing: Dataset, Degradation, and the Hidden Mechanism Behind Performance Collapse cs.CVPDF

[66] EVA: Efficient Reinforcement Learning for End-to-End Video Agent cs.CV | cs.AI | cs.CLPDF

[67] FixationFormer: Direct Utilization of Expert Gaze Trajectories for Chest X-Ray Classification cs.CV | cs.LGPDF

[68] YOLOv10 with Kolmogorov-Arnold networks and vision-language foundation models for interpretable object detection and trustworthy multimodal AI in computer vision perception cs.CV | cs.AI | cs.CL | cs.LG | cs.ROPDF

[69] Cluster-Wise Spatio-Temporal Masking for Efficient Video-Language Pretraining cs.CVPDF

[70] WorldMesh: Generating Navigable Multi-Room 3D Scenes via Mesh-Conditioned Image Diffusion cs.CVPDF

[71] VLA-IAP: Training-Free Visual Token Pruning via Interaction Alignment for Vision-Language-Action Models cs.CVPDF

[72] Unleashing Spatial Reasoning in Multimodal Large Language Models via Textual Representation Guided Reasoning cs.CV | cs.CLPDF

[73] VQ-Jarvis: Retrieval-Augmented Video Restoration Agent with Sharp Vision and Fast Thought cs.CVPDF

[74] SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning cs.CV | cs.CLPDF

[75] MedObvious: Exposing the Medical Moravec’s Paradox in VLMs via Clinical Triage cs.CV | cs.AI | cs.CLPDF

[76] Cog3DMap: Multi-View Vision-Language Reasoning with 3D Cognitive Maps cs.CVPDF

[77] Traffic Sign Recognition in Autonomous Driving: Dataset, Benchmark, and Field Experiment cs.CVPDF