cs.CL [Total: 29]
cs.CV [Total: 68]
eess.AS [Total: 2]
cs.AI [Total: 2]
cs.SE [Total: 1]
cs.CR [Total: 1]
cs.LG [Total: 7]
cs.RO [Total: 4]
eess.IV [Total: 1]

cs.CL [Back]

[1] Reasoning-Based Personalized Generation for Users with Sparse Data cs.CL | cs.AIPDF

Bo Ni, Branislav Kveton, Samyadeep Basu, Subhojyoti Mukherjee, Leyao Wang

TL;DR: 该论文提出了一种名为GraSPer（基于图的稀疏个性化推理）的新框架，用于在用户上下文稀疏的情况下增强个性化文本生成。它通过预测用户未来可能交互的项目来增强用户上下文，并生成这些交互的文本以丰富增强的上下文，最终基于真实和合成的历史生成个性化输出。

Details

Motivation: 解决现实世界中用户（如社交平台的冷启动用户或在线电商平台的新注册客户）因交互历史稀疏、个人上下文有限而导致的LLM个性化生成效果不佳的问题。

Result: 在三个基准个性化生成数据集上的广泛实验表明，GraSPer取得了显著的性能提升，在稀疏用户上下文设置中大幅改善了个性化效果。

Insight: 创新点在于通过推理对齐，预测未来交互项目并生成其文本以合成增强上下文，从而在稀疏数据下实现更好的个性化生成；可借鉴其结合图基预测与文本生成来丰富上下文的思路，用于缓解冷启动问题。

Abstract: Large Language Model (LLM) personalization holds great promise for tailoring responses by leveraging personal context and history. However, real-world users usually possess sparse interaction histories with limited personal context, such as cold-start users in social platforms and newly registered customers in online E-commerce platforms, compromising the LLM-based personalized generation. To address this challenge, we introduce GraSPer (Graph-based Sparse Personalized Reasoning), a novel framework for enhancing personalized text generation under sparse context. GraSPer first augments user context by predicting items that the user would likely interact with in the future. With reasoning alignment, it then generates texts for these interactions to enrich the augmented context. In the end, it generates personalized outputs conditioned on both the real and synthetic histories, ensuring alignment with user style and preferences. Extensive experiments on three benchmark personalized generation datasets show that GraSPer achieves significant performance gain, substantially improving personalization in sparse user context settings.

[2] Field-Theoretic Memory for AI Agents: Continuous Dynamics for Context Preservation cs.CL | cs.AI | cs.LGPDF

Subhadip Mitra

TL;DR: 本文提出了一种基于场论的AI智能体记忆系统，将存储信息视为由偏微分方程控制的连续场，而非数据库中的离散条目。该系统在长上下文基准测试中表现出显著优势，特别是在多智能体场景下通过场耦合实现了近乎完美的集体智能。

Details

Motivation: 为了解决传统离散记忆系统在长上下文和多轮对话中信息保持与推理的局限性，论文借鉴经典场论思想，将记忆建模为连续动态场，以实现更自然、鲁棒的上下文保存。

Result: 在LongMemEval基准测试中，该方法在多会话推理上F1分数提升116%（p<0.01），时序推理提升43.8%（p<0.001），知识更新检索召回率提升27.8%（p<0.001）；多智能体实验通过场耦合实现了超过99.8%的集体智能。

Insight: 创新点在于将场论的连续动力学引入记忆系统，通过语义空间扩散、基于重要性的热力学衰减和场耦合交互，为长上下文处理提供了新的理论框架；客观来看，这种物理启发的连续表示可能增强记忆的鲁棒性和可解释性。

Abstract: We present a memory system for AI agents that treats stored information as continuous fields governed by partial differential equations rather than discrete entries in a database. The approach draws from classical field theory: memories diffuse through semantic space, decay thermodynamically based on importance, and interact through field coupling in multi-agent scenarios. We evaluate the system on two established long-context benchmarks: LoCoMo (ACL 2024) with 300-turn conversations across 35 sessions, and LongMemEval (ICLR 2025) testing multi-session reasoning over 500+ turns. On LongMemEval, the field-theoretic approach achieves significant improvements: +116% F1 on multi-session reasoning (p<0.01, d= 3.06), +43.8% on temporal reasoning (p<0.001, d= 9.21), and +27.8% retrieval recall on knowledge updates (p<0.001, d= 5.00). Multi-agent experiments show near-perfect collective intelligence (>99.8%) through field coupling. Code is available at github.com/rotalabs/rotalabs-fieldmem.

[3] Task-Aware LoRA Adapter Composition via Similarity Retrieval in Vector Databases cs.CL | cs.AI | cs.LGPDF

Riya Adsul, Balachandra Devarangadi Sunil, Isha Nalawade, Sudharshan Govindan

TL;DR: 本文提出了一种基于向量数据库相似性检索的动态LoRA适配器组合框架，通过嵌入22个数据集的训练样本构建任务感知向量库，在推理时检索相似训练样本并采用检索加权融合策略动态合并相关LoRA适配器，实现了跨NLP任务的零样本泛化。

Details

Motivation: 解决LoRA等参数高效微调方法在组合多个专用适配器处理未见任务时面临的挑战，旨在实现无需全模型重训练的可扩展多任务学习。

Result: 在PIQA和RTE基准测试中，线性合并方法分别达到70.95%和77.62%，显著超越单任务基线（46%和52%），且基于数据集的检索方法常匹配或超越单独微调的任务专用适配器性能。

Insight: 创新点在于利用预训练嵌入构建无需检索器训练的向量数据库，通过核采样计算任务相似度分布实现动态适配器融合，为参数高效的多任务学习提供了可解释且可扩展的解决方案。

Abstract: Parameter efficient fine tuning methods like LoRA have enabled task specific adaptation of large language models, but efficiently composing multiple specialized adapters for unseen tasks remains challenging. We present a novel framework for dynamic LoRA adapter composition that leverages similarity retrieval in vector databases to enable zero-shot generalization across diverse NLP tasks. Our approach constructs a task-aware vector database by embedding training examples from 22 datasets spanning commonsense reasoning, question answering, natural language inference, and sentiment analysis. At inference time, we retrieve the most similar training examples, compute task similarity distributions via nucleus sampling, and dynamically merge relevant LoRA adapters using retrieval weighted fusion strategies. We evaluated four merging methods Linear, Concatenation, TIES, and Magnitude Prune demonstrating that our dataset centric retrieval approach often matches or exceeds the performance of individually fine-tuned task-specific adapters. Notably, Linear merging achieves 70.95% on PIQA and 77.62% on RTE, substantially outperforming single-task baselines (46% and 52%, respectively). Our framework requires no additional retriever training, operates with frozen embeddings, and enables efficient, interpretable adapter composition. These results suggest that retrieval based dynamic merging offers a promising direction for scalable, parameter-efficient multitask learning without requiring full model retraining for each new task.

[4] Architecture-Agnostic Curriculum Learning for Document Understanding: Empirical Evidence from Text-Only and Multimodal cs.CL | cs.AI | cs.LGPDF

Mohammed Hamdan, Vincenzo Dentamaro, Giuseppe Pirlo, Mohamed Cheriet

TL;DR: 本文研究了渐进式数据调度（一种课程学习策略）在不同架构的文档理解模型（纯文本BERT和多模态LayoutLMv3）上的效率增益。通过在FUNSD和CORD基准测试上的实验，发现该策略能减少约33%的训练时间，且对于容量受限的BERT模型能带来显著的性能提升，但对多模态LayoutLMv3则无额外益处，表明其收益取决于模型容量与任务复杂度的交互。

Details

Motivation: 探究渐进式数据调度（即课程学习）是否能在架构不同的文档理解模型中带来一致的效率提升，并区分其收益是源于计算量减少还是数据调度本身。

Result: 在FUNSD数据集上，渐进式调度使BERT的训练时间减少约33%，并在控制总梯度更新的匹配计算基线（Standard-7）上显著提升了F1分数（ΔF1 = +0.023， p=0.022），但对LayoutLMv3无显著提升（p=0.621）。在CORD数据集上，所有调度方法均达到相近的高性能（F1 ≥ 0.947），表明存在性能天花板。消融实验证实效率增益主要源于数据量减少而非顺序。

Insight: 渐进式数据调度是一种可靠的计算量减少策略，但其课程学习特有的性能提升（超越单纯计算量减少）取决于模型容量与任务复杂度的交互：对于容量受限的纯文本模型（如BERT）在复杂任务上有效，而对于具有足够归纳偏置的多模态模型（如LayoutLMv3）或简单任务则无额外益处。这为模型训练中的高效数据调度提供了实证依据。

Abstract: We investigate whether progressive data scheduling – a curriculum learning strategy that incrementally increases training data exposure (33%$\rightarrow$67%$\rightarrow$100%) – yields consistent efficiency gains across architecturally distinct document understanding models. By evaluating BERT (text-only, 110M parameters) and LayoutLMv3 (multimodal, 126M parameters) on the FUNSD and CORD benchmarks, we establish that this schedule reduces wall-clock training time by approximately 33%, commensurate with the reduction from 6.67 to 10.0 effective epoch-equivalents of data. To isolate curriculum effects from compute reduction, we introduce matched-compute baselines (Standard-7) that control for total gradient updates. On the FUNSD dataset, the curriculum significantly outperforms the matched-compute baseline for BERT ($Δ$F1 = +0.023, $p=0.022$, $d_z=3.83$), constituting evidence for a genuine scheduling benefit in capacity-constrained models. In contrast, no analogous benefit is observed for LayoutLMv3 ($p=0.621$), whose multimodal representations provide sufficient inductive bias. On the CORD dataset, all conditions converge to equivalent F1 scores ($\geq$0.947) irrespective of scheduling, indicating a performance ceiling. Schedule ablations comparing progressive, two-phase, reverse, and random pacing confirm that the efficiency gain derives from reduced data volume rather than ordering. Taken together, these findings demonstrate that progressive scheduling is a reliable compute-reduction strategy across model families, with curriculum-specific benefits contingent on the interaction between model capacity and task complexity.

[5] IslamicLegalBench: Evaluating LLMs Knowledge and Reasoning of Islamic Law Across 1,200 Years of Islamic Pluralist Legal Traditions cs.CL | cs.AIPDF

Ezieddin Elmahjub, Junaid Qadir, Abdullah Mushtaq, Rafay Naeem, Ibrahim Ghaznavi

TL;DR: 该论文提出了首个伊斯兰法律推理基准IslamicLegalBench，用于评估大型语言模型在伊斯兰法学七个学派中的知识掌握与推理能力。基准包含718个实例，覆盖13项不同复杂度的任务。评估发现现有模型存在严重局限性：最佳模型正确率仅68%且幻觉率达21%，多数模型正确率低于35%、幻觉率超55%。

Details

Motivation: 随着数百万穆斯林用户依赖GPT、Claude等LLM获取宗教指导，研究旨在评估这些AI系统能否可靠地进行伊斯兰法律推理，揭示当前模型在宗教知识领域的缺陷。

Result: 在九种SOTA模型评估中，最佳模型正确率为68%（幻觉率21%），多数模型正确率低于35%（幻觉率超55%）；少样本提示仅对2/9模型产生>1%提升；中等复杂度任务错误率最高，而高复杂度任务通过语义推理表现出表面能力。

Insight: 创新点在于构建了首个跨伊斯兰法学多元传统的系统性评估框架，实证表明基于提示的方法无法弥补基础知识的缺失，并揭示了模型在错误前提检测中存在高达40%以上的风险性迎合倾向。

Abstract: As millions of Muslims turn to LLMs like GPT, Claude, and DeepSeek for religious guidance, a critical question arises: Can these AI systems reliably reason about Islamic law? We introduce IslamicLegalBench, the first benchmark evaluating LLMs across seven schools of Islamic jurisprudence, with 718 instances covering 13 tasks of varying complexity. Evaluation of nine state-of-the-art models reveals major limitations: the best model achieves only 68% correctness with 21% hallucination, while several models fall below 35% correctness and exceed 55% hallucination. Few-shot prompting provides minimal gains, improving only 2 of 9 models by >1%. Moderate-complexity tasks requiring exact knowledge show the highest errors, whereas high-complexity tasks display apparent competence through semantic reasoning. False premise detection indicates risky sycophancy, with 6 of 9 models accepting misleading assumptions at rates above 40%. These results highlight that prompt-based methods cannot compensate for missing foundational knowledge. IslamicLegalBench offers the first systematic framework to evaluate Islamic legal reasoning in AI, revealing critical gaps in tools increasingly relied on for spiritual guidance.

[6] ImpRIF: Stronger Implicit Reasoning Leads to Better Complex Instruction Following cs.CL | cs.AIPDF

Yuancheng Yang, Lin Yang, Xu Wang, Chao Tong, Haihua Yang

TL;DR: 该论文提出了一种名为ImpRIF的方法，旨在通过增强大型语言模型（LLM）对隐含推理指令的理解，来提升其遵循复杂指令的能力。该方法将复杂指令形式化为可验证的推理图，并利用该图进行程序化验证和图驱动的思维链推理。通过合成大规模单轮和多轮数据、结合图推理进行微调以及应用强化学习，模型被明确训练为沿着推理图进行推理。在五个复杂指令遵循基准测试中，该方法显著超越了其基础模型。

Details

Motivation: 随着大型语言模型应用日益复杂，对强大的复杂指令遵循能力的需求也在增长。作者认为，深刻理解指令本身，尤其是隐藏在字里行间的潜在推理结构，对于提升指令遵循能力至关重要。因此，他们专注于涉及隐含推理、复杂逻辑关系和多约束依赖的复杂指令。

Result: 在五个复杂指令遵循基准测试上，所提出的模型大幅超越了其基础模型。这表明增强隐含推理能力可以显著改善复杂指令遵循性能。

Insight: 论文的核心创新点在于将涉及隐含推理的复杂指令形式化为可验证的推理图，并基于此构建了一套包含数据合成、图推理微调和强化学习的训练框架。这为提升LLM处理复杂、隐含逻辑的指令提供了一种结构化、可验证且可扩展的解决方案。

Abstract: As applications of large language models (LLMs) become increasingly complex, the demand for robust complex instruction following capabilities is growing accordingly. We argue that a thorough understanding of the instruction itself, especially the latent reasoning structure embedded between the lines, is crucial for improving instruction following. Therefore we target complex instructions that involve implicit reasoning, intricate logical relations, and multi-constraint dependencies. We propose ImpRIF, a method to enhance LLMs’ understanding of implicit reasoning instructions, thereby improving its ability to follow complex instructions. We formalize such instructions as verifiable reasoning graphs, enabling programmatic verification and graph-driven chain-of-thought reasoning. Based on this formulation, we synthesize large-scale single- and multi-turn data, propose fine-tuning with graph reasoning, and apply reinforcement learning to explicitly train models to reason along the graph. On five complex instruction following benchmarks, our models substantially outperform their base models. These results demonstrate that enhancing implicit reasoning capabilities can significantly improve complex instruction following. This project will be open-sourced in the near future.

[7] TRACE: Trajectory-Aware Comprehensive Evaluation for Deep Research Agents cs.CLPDF

Yanyu Chen, Jiyue Jiang, Jiahong Liu, Yifei Zhang, Xiao Guo

TL;DR: 本文提出了TRACE（轨迹感知综合评估）框架，用于全面评估深度研究代理的复杂推理过程。该框架通过分层轨迹效用函数量化过程效率和认知质量，并引入支架式能力评估协议来测量代理的潜在能力。

Details

Motivation: 当前深度研究代理的评估主要依赖单一指标（如Pass@1），导致忽略推理过程的质量、效率和稳健性，且静态基准无法量化潜在能力。TRACE旨在解决这些评估缺陷。

Result: 实验表明，TRACE能够提供细粒度的排名，揭示代理在准确性、效率和稳健性之间的关键权衡，这些是单一指标完全无法捕捉的。

Insight: 创新点在于将评估从单一结果指标扩展到整个问题解决轨迹，通过分层效用函数和支架式评估协议，实现了对推理过程质量和潜在能力的综合量化。

Abstract: The evaluation of Deep Research Agents is a critical challenge, as conventional outcome-based metrics fail to capture the nuances of their complex reasoning. Current evaluation faces two primary challenges: 1) a reliance on singular metrics like Pass@1, creating a “high-score illusion” that ignores the quality, efficiency, and soundness of the reasoning process; and 2) the failure of static benchmarks to quantify crucial attributes like robustness and latent capability. To address these gaps, we introduce TRACE (Trajectory-Aware Comprehensive Evaluation), a framework that holistically assesses the entire problem-solving trajectory. To counter the “high-score illusion”, we propose a Hierarchical Trajectory Utility Function that quantifies process efficiency and cognitive quality, including evidence grounding, alongside accuracy. To measure deeper attributes, TRACE introduces a Scaffolded Capability Assessment protocol, quantifying an agent’s latent ability by determining the minimum guidance needed for success. Our contributions include the TRACE framework, its novel metrics, and the accompanying DeepResearch-Bench with controllable complexity. Experiments show TRACE delivers a granular ranking that uncovers critical trade-offs between agent accuracy, efficiency, and robustness entirely missed by singular metrics.

[8] ToolMATH: A Math Tool Benchmark for Realistic Long-Horizon Multi-Tool Reasoning cs.CL | cs.LG | cs.SEPDF

Hyeonje Choi, Jeongsoo Lee, Hyojun Lee, Jay-Yoon Lee

TL;DR: 本文介绍了ToolMATH，一个数学基础的基准测试，用于评估工具增强语言模型在现实多工具环境中的表现，该环境要求模型调用模式指定的工具并维持多步执行。该基准将数学问题转化为可控、可检查正确性的测试，包含约8k个问题和12k个工具，并提供一个额外困难版本ToolMATHHard，以系统评估模型在大规模重叠工具目录和缺乏预期能力情况下的可靠性。

Details

Motivation: 解决工具增强语言模型在现实多工具环境中面临的挑战，如处理大规模重叠工具目录、缺乏预期能力时的稳健性，以及提供可操作的诊断证据以识别失败模式和控制机制需求。

Result: 评估显示，关键失败因素源于推理能力不足，导致中间结果错误累积并影响后续决策；工具列表冗余会放大早期偏差，造成不可逆的执行漂移；当预期能力缺失时，干扰工具可能作为部分替代，但也可能误导模型进入无根据的工具轨迹。

Insight: 创新点在于构建了一个数学基础的、可控的多工具基准测试，强调长程计划连贯性和观察的纪律性使用比局部动作选择更重要，为工具增强代理的失败模式提供诊断证据，并揭示工具冗余和替代能力对模型性能的影响。

Abstract: We introduce \ToolMATH, a math-grounded benchmark that evaluates tool-augmented language models in realistic multi-tool environments where the output depends on calling schema-specified tools and sustaining multi-step execution. It turns math problems into a controlled, correctness-checkable benchmark with tool sets, enabling systematic evaluation of model reliability under (1) large, overlapping tool catalogs and (2) the absence of the intended capability. \ToolMATH provides actionable diagnostic evidence of failure modes in tool-augmented agents, helping identify the control mechanisms required for robustness. \ToolMATH roughly contains 8k questions and 12k tools; we provide an additional hard-set \ToolMATHHard with questions and tools. Our evaluation reveals that the key failure factor is due to the inability to reason, leading to the accumulation of intermediate results’ errors and constrain later decisions. Tool-list redundancy do not simply add noise, but amplify small early deviations into irreversible execution drift. The benchmark highlights that when the intended capability is missing, distractor tools can sometimes serve as partial substitutes in solution paths, yet they can also mislead models into ungrounded tool trajectories. Finally, comparisons between tool-use protocols emphasize that improvements come less from local action selection and more from long-range plan coherence and disciplined use of observations.

[9] Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment cs.CL | cs.AIPDF

Mengxuan Hu, Vivek V. Datla, Anoop Kumar, Zihan Guan, Sheng Li

TL;DR: 本文提出了一种名为Alignment-Weighted DPO的新方法，旨在通过增强大语言模型（LLMs）的深度推理能力来提升其安全对齐的鲁棒性。论文首先通过因果干预实验指出，现有对齐技术（如SFT、RLHF、DPO）的脆弱性源于缺乏深度推理的浅层对齐机制。为此，作者构建并发布了一个包含逐步推理链（CoT）的新型微调数据集，用于进行推理感知的后训练。在此基础上，他们进一步提出了Alignment-Weighted DPO，该方法通过对输出中的推理部分和最终答案部分分配不同的偏好权重，实现比标准DPO更细粒度、更具针对性的模型更新。

Details

Motivation: 现有对齐技术（如SFT、RLHF、DPO）虽然提升了LLMs的安全性，但模型仍易受通过间接或欺骗性措辞伪装的越狱攻击。论文通过实证分析发现，这种脆弱性源于缺乏深度推理的浅层对齐机制，模型往往在不真正理解有害原因的情况下拒绝有害提示。

Result: 在多个安全和实用性基准测试上的广泛实验表明，所提方法（包括CoT微调和Alignment-Weighted DPO）能持续提升对齐鲁棒性，同时保持模型的整体实用性，其性能优于标准的SFT基线。

Insight: 论文的核心创新点在于：1）通过因果干预分析揭示了安全对齐脆弱性的根源（浅层推理），并提出了推理感知后训练（使用CoT数据集）的解决方案；2）提出了Alignment-Weighted DPO，这是一种新颖的、细粒度的偏好优化方法，它通过差异化地加权输出中的推理链和最终答案，实现了更精准、更鲁棒的对齐更新，有效抵御了多样化的越狱攻击策略。

Abstract: Recent advances in alignment techniques such as Supervised Fine-Tuning (SFT), Reinforcement Learning from Human Feedback (RLHF), and Direct Preference Optimization (DPO) have improved the safety of large language models (LLMs). However, these LLMs remain vulnerable to jailbreak attacks that disguise harmful intent through indirect or deceptive phrasing. Using causal intervention, we empirically demonstrate that this vulnerability stems from shallow alignment mechanisms that lack deep reasoning, often rejecting harmful prompts without truly understanding why they are harmful. To mitigate this vulnerability, we propose enhancing alignment through reasoning-aware post-training. We construct and release a novel Chain-of-Thought (CoT) fine-tuning dataset that includes both utility-oriented and safety-critical prompts with step-by-step rationales. Fine-tuning on this dataset encourages models to produce principled refusals grounded in reasoning, outperforming standard SFT baselines. Furthermore, inspired by failure patterns in CoT fine-tuning, we introduce Alignment-Weighted DPO, which targets the most problematic parts of an output by assigning different preference weights to the reasoning and final-answer segments. This produces finer-grained, targeted updates than vanilla DPO and improves robustness to diverse jailbreak strategies. Extensive experiments across multiple safety and utility benchmarks show that our method consistently improves alignment robustness while maintaining overall model utility.

[10] VecGlypher: Unified Vector Glyph Generation with Language Models cs.CLPDF

Xiaoke Huang, Bhavul Gauri, Kam Woh Ng, Tony Ng, Mengmeng Xu

TL;DR: VecGlypher是一个统一的多模态语言模型，能够直接从文本描述或图像示例生成高质量、可编辑的矢量字形（SVG路径），无需依赖栅格化中间步骤或后处理。

Details

Motivation: 解决传统基于学习的矢量字形生成流程依赖精心策划的示例表和栅格到矢量后处理的问题，以提高字体创建的可访问性和可编辑性。

Result: 在跨字体OOD评估中，VecGlypher在纯文本生成上显著优于通用LLM和专用矢量字体基线，而基于图像参考的生成达到了SOTA性能，明显优于DeepVecFont-v2和DualVector。

Insight: 创新点包括：采用两阶段训练策略（大规模噪声字体预训练和专家标注字体后训练）以实现语言、图像与几何的对齐；使用绝对坐标序列化优化几何生成；模型直接输出SVG路径令牌，避免了栅格中间步骤。

Abstract: Vector glyphs are the atomic units of digital typography, yet most learning-based pipelines still depend on carefully curated exemplar sheets and raster-to-vector postprocessing, which limits accessibility and editability. We introduce VecGlypher, a single multimodal language model that generates high-fidelity vector glyphs directly from text descriptions or image exemplars. Given a style prompt, optional reference glyph images, and a target character, VecGlypher autoregressively emits SVG path tokens, avoiding raster intermediates and producing editable, watertight outlines in one pass. A typography-aware data and training recipe makes this possible: (i) a large-scale continuation stage on 39K noisy Envato fonts to master SVG syntax and long-horizon geometry, followed by (ii) post-training on 2.5K expert-annotated Google Fonts with descriptive tags and exemplars to align language and imagery with geometry; preprocessing normalizes coordinate frames, canonicalizes paths, de-duplicates families, and quantizes coordinates for stable long-sequence decoding. On cross-family OOD evaluation, VecGlypher substantially outperforms both general-purpose LLMs and specialized vector-font baselines for text-only generation, while image-referenced generation reaches a state-of-the-art performance, with marked gains over DeepVecFont-v2 and DualVector. Ablations show that model scale and the two-stage recipe are critical and that absolute-coordinate serialization yields the best geometry. VecGlypher lowers the barrier to font creation by letting users design with words or exemplars, and provides a scalable foundation for future multimodal design tools.

[11] Enhancing Multilingual Embeddings via Multi-Way Parallel Text Alignment cs.CL | cs.AI | cs.IRPDF

Barah Fazili, Koustava Goswami

TL;DR: 本文提出使用多语言多向平行语料库通过对比学习来增强多语言预训练模型的跨语言对齐能力。研究表明，与传统的双语（英语-目标语言）平行数据相比，多向平行数据能显著提升模型在多种自然语言理解任务上的性能，包括文本挖掘、语义相似度和分类任务。

Details

Motivation: 多语言预训练通常缺乏显式的对齐信号，导致表示空间中的跨语言对齐效果不佳。本文旨在通过利用多向平行语料库来改善多语言和跨语言表示。

Result: 在MTEB基准测试中，对XLM-Roberta和mBERT基础模型进行评估，使用多向平行语料库进行对比训练相比英语中心的双语平行数据，在文本挖掘上提升21.3%，语义相似度提升5.3%，分类任务提升28.4%。微调mE5模型时，即使使用小规模多向平行数据集也能显著提升文本挖掘性能。

Insight: 创新点在于利用多向平行语料库（而非传统的双语平行数据）进行对比学习，以增强跨语言对齐。客观分析认为，这种方法通过提供更丰富的跨语言监督信号，有效提升了模型在已见和未见语言上的泛化能力，即使对于已预训练的高质量句子嵌入模型也有显著增益。

Abstract: Multilingual pretraining typically lacks explicit alignment signals, leading to suboptimal cross-lingual alignment in the representation space. In this work, we show that training standard pretrained models for cross-lingual alignment with a multi-way parallel corpus in a diverse pool of languages can substantially improve multilingual and cross-lingual representations for NLU tasks. We construct a multi-way parallel dataset using translations of English text from an off-the-shelf NMT model for a pool of six target languages and achieve strong cross-lingual alignment through contrastive learning. This leads to substantial performance gains across both seen and unseen languages for multiple tasks from the MTEB benchmark evaluated for XLM-Roberta and multilingual BERT base models. Using a multi-way parallel corpus for contrastive training yields substantial gains on bitext mining (21.3%), semantic similarity (5.3%), and classification (28.4%) compared to English-centric (En-X) bilingually parallel data, where X is sampled from a pool of multiple target languages. Furthermore, finetuning mE5 model on a small dataset with multi-way parallelism significantly improves bitext mining compared to one without, underscoring the importance of multi-way cross-lingual supervision even for models already pretrained for high-quality sentence embeddings.

[12] When More Is Less: A Systematic Analysis of Spatial and Commonsense Information for Visual Spatial Reasoning cs.CLPDF

Muku Akasaka, Soyeon Caren Han

TL;DR: 本文对视觉空间推理（VSR）任务中信息注入策略进行了系统性分析，发现更多信息并不总是带来更好的推理效果。研究通过实验表明，针对性的单一空间线索优于多上下文聚合，过多或不相关的常识知识会降低性能，而思维链提示仅在空间定位足够精确时才有效。

Details

Motivation: 尽管现代视觉语言模型（VLMs）在多模态架构上取得进展，但在视觉空间推理方面仍面临挑战。常见的策略是在推理时注入额外信息（如显式空间线索、外部常识知识或思维链推理指令），但这些信息何时真正提升推理、何时引入噪声尚不明确。

Result: 研究在三个代表性VLMs和两个公共基准测试上进行实验，结果显示：目标单一空间线索优于多上下文聚合；过多或弱相关的常识知识会降低性能；思维链提示仅在空间定位足够精确时才能提高准确性。

Insight: 论文的创新点在于通过假设驱动的系统性分析，揭示了信息注入的‘少即是多’原则，强调了选择性、任务对齐的信息注入的重要性，为设计可靠的多模态推理流程提供了实用指导。

Abstract: Visual spatial reasoning (VSR) remains challenging for modern vision-language models (VLMs), despite advances in multimodal architectures. A common strategy is to inject additional information at inference time, such as explicit spatial cues, external commonsense knowledge, or chain-of-thought (CoT) reasoning instructions. However, it remains unclear when such information genuinely improves reasoning and when it introduces noise. In this paper, we conduct a hypothesis-driven analysis of information injection for VSR across three representative VLMs and two public benchmarks. We examine (i) the type and number of spatial contexts, (ii) the amount and relevance of injected commonsense knowledge, and (iii) the interaction between spatial grounding and CoT prompting. Our results reveal a consistent pattern: more information does not necessarily yield better reasoning. Targeted single spatial cues outperform multi-context aggregation, excessive or weakly relevant commonsense knowledge degrades performance, and CoT prompting improves accuracy only when spatial grounding is sufficiently precise. These findings highlight the importance of selective, task-aligned information injection and provide practical guidance for designing reliable multimodal reasoning pipelines.

[13] RuCL: Stratified Rubric-Based Curriculum Learning for Multimodal Large Language Model Reasoning cs.CLPDF

Yukun Chen, Jiaming Li, Longze Chen, Ze Gong, Jingpeng Li

TL;DR: 本文提出了一种名为RuCL的分层基于评分标准的课程学习框架，用于增强多模态大语言模型的推理能力。该框架通过生成通用评分标准并根据模型能力进行分层，动态调整训练中的评分标准权重，引导模型从掌握基础感知逐步过渡到高级逻辑推理。

Details

Motivation: 现有基于强化学习与可验证奖励的方法存在奖励黑客风险，模型可能学习虚假推理模式以满足最终答案检查；而基于评分标准的方法虽提供细粒度监督，但面临实例级生成的高计算成本和将所有评分标准视为同等可学导致的低效训练动态。RuCL旨在通过将课程学习重点从数据选择转向奖励设计来解决这些问题。

Result: 在多个视觉推理基准测试中，RuCL相比Qwen2.5-VL-7B模型平均提升了7.83%，达到了60.06%的最新最先进准确率。

Insight: 创新点在于将课程学习重新定义为奖励设计问题，通过生成通用评分标准、基于模型能力进行分层以及动态调整权重，实现了从感知到推理的渐进式学习，避免了传统方法的计算低效和奖励黑客问题。

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a prevailing paradigm for enhancing reasoning in Multimodal Large Language Models (MLLMs). However, relying solely on outcome supervision risks reward hacking, where models learn spurious reasoning patterns to satisfy final answer checks. While recent rubric-based approaches offer fine-grained supervision signals, they suffer from high computational costs of instance-level generation and inefficient training dynamics caused by treating all rubrics as equally learnable. In this paper, we propose Stratified Rubric-based Curriculum Learning (RuCL), a novel framework that reformulates curriculum learning by shifting the focus from data selection to reward design. RuCL generates generalized rubrics for broad applicability and stratifies them based on the model’s competence. By dynamically adjusting rubric weights during training, RuCL guides the model from mastering foundational perception to tackling advanced logical reasoning. Extensive experiments on various visual reasoning benchmarks show that RuCL yields a remarkable +7.83% average improvement over the Qwen2.5-VL-7B model, achieving a state-of-the-art accuracy of 60.06%.

[14] Scalable Multilingual Multimodal Machine Translation with Speech-Text Fusion cs.CLPDF

Yexing Du, Youcheng Pan, Zekun Wang, Zheng Chu, Yichong Huang

TL;DR: 本文提出了一种语音引导的机器翻译框架，通过将语音和文本作为融合输入集成到多模态大语言模型中，以提升翻译质量。该方法利用语音模态与文本的自然对齐及丰富数据，克服了图像引导方法因多语言图像-文本对稀缺带来的限制，并通过自进化机制减少对低资源数据的依赖。

Details

Motivation: 现有研究主要集中于图像引导的多模态机器翻译，但其应用受限于多语言图像-文本对的稀缺性；语音模态因与文本自然对齐且数据集丰富，能实现可扩展的语言覆盖，因此探索语音-文本融合以提升翻译性能。

Result: 在Multi30K多模态机器翻译基准上超越了所有现有方法，达到了新的SOTA；在通用机器翻译数据集FLORES-200的108个翻译方向上实现了平均SOTA性能；在CoVoST-2上的消融实验表明合成语音与真实语音的差异对翻译质量影响可忽略。

Insight: 创新点包括：提出语音-文本融合的SMT框架，利用语音模态扩展多语言覆盖；引入自进化机制，通过文本转语音模型生成合成语音，并让MLLM分类合成样本以迭代优化，减少对低资源数据的依赖，提升可扩展性。

Abstract: Multimodal Large Language Models (MLLMs) have achieved notable success in enhancing translation performance by integrating multimodal information. However, existing research primarily focuses on image-guided methods, whose applicability is constrained by the scarcity of multilingual image-text pairs. The speech modality overcomes this limitation due to its natural alignment with text and the abundance of existing speech datasets, which enable scalable language coverage. In this paper, we propose a Speech-guided Machine Translation (SMT) framework that integrates speech and text as fused inputs into an MLLM to improve translation quality. To mitigate reliance on low-resource data, we introduce a Self-Evolution Mechanism. The core components of this framework include a text-to-speech model, responsible for generating synthetic speech, and an MLLM capable of classifying synthetic speech samples and iteratively optimizing itself using positive samples. Experimental results demonstrate that our framework surpasses all existing methods on the Multi30K multimodal machine translation benchmark, achieving new state-of-the-art results. Furthermore, on general machine translation datasets, particularly the FLORES-200, it achieves average state-of-the-art performance in 108 translation directions. Ablation studies on CoVoST-2 confirms that differences between synthetic and authentic speech have negligible impact on translation quality. The code and models are released at https://github.com/yxduir/LLM-SRT.

[15] Evaluating the relationship between regularity and learnability in recursive numeral systems using Reinforcement Learning cs.CL | cs.AIPDF

Andrea Silvi, Ponrawee Prasertsom, Jennifer Culbertson, Devdatt Dubhashi, Moa Johansson

TL;DR: 本研究采用强化学习方法，探讨了递归数字系统（如英语十进制计数系统）的规律性与可学习性之间的关系。研究发现，高度规律的人类计数系统比不规律但可能存在的系统更容易学习，这种不对称性源于系统设计用于从有限数据泛化以精确表示所有整数。此外，规律性对可学习性的影响在不自然的、高度不规律系统中消失，其可学习性反而受信号长度影响。

Details

Motivation: 动机是探究人类递归数字系统高度规律性的原因，检验规律性是否因其促进学习而成为跨语言普遍现象，从而将学习偏差与语言普遍性联系起来。

Result: 研究证实，高度规律的人类（或类人）系统比未被证实的不规律系统更容易学习；对于不自然的高度不规律系统，其可学习性受信号长度影响而非规律性。

Insight: 创新点在于将强化学习方法应用于语言学习研究，揭示了规律性促进学习是系统设计用于泛化的自然结果，并指出不同系统区域的可学习性受不同压力影响，为学习性与语言普遍性关联提供了计算证据。

Abstract: Human recursive numeral systems (i.e., counting systems such as English base-10 numerals), like many other grammatical systems, are highly regular. Following prior work that relates cross-linguistic tendencies to biases in learning, we ask whether regular systems are common because regularity facilitates learning. Adopting methods from the Reinforcement Learning literature, we confirm that highly regular human(-like) systems are easier to learn than unattested but possible irregular systems. This asymmetry emerges under the natural assumption that recursive numeral systems are designed for generalisation from limited data to represent all integers exactly. We also find that the influence of regularity on learnability is absent for unnatural, highly irregular systems, whose learnability is influenced instead by signal length, suggesting that different pressures may influence learnability differently in different parts of the space of possible numeral systems. Our results contribute to the body of work linking learnability to cross-linguistic prevalence.

[16] Explore-on-Graph: Incentivizing Autonomous Exploration of Large Language Models on Knowledge Graphs with Path-refined Reward Modeling cs.CLPDF

Shiqi Yan, Yubo Chen, Ruiqi Zhou, Zhengxi Yao, Shuai Chen

TL;DR: 本文提出Explore-on-Graph（EoG）框架，旨在解决大语言模型在知识图谱问答任务中因幻觉和事实缺失导致的推理问题。该框架通过强化学习激励模型在知识图谱上自主探索多样化的推理路径，并利用路径信息作为额外奖励信号来优化探索过程，从而提升泛化能力。

Details

Motivation: 现有基于知识图谱增强的方法通常通过生成规则约束或模仿固定演示路径来限制LLM推理，这将其推理模式局限于先验经验或微调数据范围内，限制了模型在分布外图推理问题上的泛化能力。

Result: 在五个KGQA基准数据集上的大量实验表明，该方法取得了最先进的性能，不仅超越了开源模型，甚至优于闭源大语言模型。

Insight: 创新点在于引入强化学习，以推理路径最终答案的正确性作为奖励，激励模型自主探索更广泛的推理空间；同时，通过整合路径信息作为额外奖励信号，细化了探索过程，提高了探索效率和意义，从而增强了模型对未见推理问题的泛化能力。

Abstract: The reasoning process of Large Language Models (LLMs) is often plagued by hallucinations and missing facts in question-answering tasks. A promising solution is to ground LLMs’ answers in verifiable knowledge sources, such as Knowledge Graphs (KGs). Prevailing KG-enhanced methods typically constrained LLM reasoning either by enforcing rules during generation or by imitating paths from a fixed set of demonstrations. However, they naturally confined the reasoning patterns of LLMs within the scope of prior experience or fine-tuning data, limiting their generalizability to out-of-distribution graph reasoning problems. To tackle this problem, in this paper, we propose Explore-on-Graph (EoG), a novel framework that encourages LLMs to autonomously explore a more diverse reasoning space on KGs. To incentivize exploration and discovery of novel reasoning paths, we propose to introduce reinforcement learning during training, whose reward is the correctness of the reasoning paths’ final answers. To enhance the efficiency and meaningfulness of the exploration, we propose to incorporate path information as additional reward signals to refine the exploration process and reduce futile efforts. Extensive experiments on five KGQA benchmark datasets demonstrate that, to the best of our knowledge, our method achieves state-of-the-art performance, outperforming not only open-source but also even closed-source LLMs.

[17] Improving Implicit Discourse Relation Recognition with Natural Language Explanations from LLMs cs.CLPDF

Heng Wang, Changxing Wu

TL;DR: 本文提出了一种利用大语言模型（LLMs）生成的自然语言解释来改进隐式篇章关系识别（IDRR）性能与可解释性的方法。该方法通过提示LLM为训练数据生成基于黄金标签的解释，并设计了一个联合执行关系预测和解释生成的分类-生成框架，利用LLM生成的解释作为额外监督进行训练。该方法即插即用，可轻松集成到现有IDRR模型中。在PDTB数据集上的实验表明，该方法显著提升了IDRR性能，同时生成解释也增强了模型可解释性，并在情感分类和自然语言推理任务上验证了其通用性。

Details

Motivation: 隐式篇章关系识别（IDRR）因缺乏显式篇章标记而需要深度语义理解，是一项挑战性任务，且现有方法仅预测关系而不提供任何支持性解释。大语言模型（LLMs）在深度语言理解和自然语言解释生成方面展现出强大的推理能力。

Result: 在PDTB基准数据集上的实验结果表明，该方法显著提升了IDRR性能。同时，人类评估进一步证实了生成解释增强了模型的可解释性。该方法在情感分类和自然语言推理任务上也验证了其通用性。

Insight: 核心创新点在于利用LLMs的推理能力为训练数据生成解释，并设计了一个联合的分类-生成框架，将解释生成作为辅助任务来蒸馏LLM的知识，从而提升轻量级IDRR模型的性能和可解释性。这是一种即插即用的知识蒸馏与多任务学习结合的有效范式。

Abstract: Implicit Discourse Relation Recognition (IDRR) remains a challenging task due to the requirement for deep semantic understanding in the absence of explicit discourse markers. A further limitation is that existing methods only predict relations without providing any supporting explanations. Recent advances in large language models (LLMs) have shown strong reasoning capabilities in both deep language understanding and natural language explanation generation. In this work, we propose a simple yet effective approach to distill the reasoning capabilities of LLMs into lightweight IDRR models to improve both performance and interpretability. Specifically, we first prompt an LLM to generate explanations for each training instance conditioned on its gold label. Then, we introduce a novel classification-generation framework that jointly performs relation prediction and explanation generation, and train it with the additional supervision of LLM-generated explanations. Our framework is plug-and-play, enabling easy integration with most existing IDRR models. Experimental results on PDTB demonstrate that our approach significantly improves IDRR performance, while human evaluation further confirms that the generated explanations enhance model interpretability. Furthermore, we validate the generality of our approach on sentiment classification and natural language inference

[18] D-COT: Disciplined Chain-of-Thought Learning for Efficient Reasoning in Small Language Models cs.CLPDF

Shunsuke Ubukata

TL;DR: 本文提出了一种名为Disciplined Chain-of-Thought (D-CoT)的新框架，旨在解决从小语言模型进行思维链蒸馏时产生的“过度思考”问题。该方法通过在训练过程中使用控制标签（如用于事实核查的和用于多视角探索的）作为辅助支架，来强制执行结构化的推理过程，从而优化思维链轨迹，抑制推理漂移，并同时实现计算成本降低和性能提升。

Details

Motivation: 解决从小语言模型进行思维链蒸馏时，因“过度思考”导致的性能下降和令牌消耗过多的问题。

Result: 在Qwen3-8B模型上，仅使用5,000个训练样本，D-CoT在GPQA-diamond数据集上的准确率显著提升了9.9%，在MMLU-Pro（0-shot）上提升了9.1%，同时大幅降低了计算成本。

Insight: 创新点在于引入带有控制标签的辅助支架来规范思维链学习过程，从而抑制推理漂移并实现效率与性能的双重提升。该方法使模型内化了这种有纪律的思维结构，在推理时即使没有显式控制标签也能保持高性能，为高效的小语言模型推理提供了一种新思路。

Abstract: Chain-of-Thought (CoT) distillation from Large Language Models (LLMs) often induces “overthinking” in Small Language Models (SLMs), leading to performance degradation and excessive token consumption. In this study, we propose Disciplined Chain-of-Thought (D-CoT), a novel framework that enforces a structured reasoning process using control tags – such as for fact-checking and for multi-perspective exploration – as auxiliary scaffolding during training. By optimizing the CoT trajectory, D-CoT suppresses reasoning drift and simultaneously achieves token reduction and performance improvement. We demonstrate the efficacy of our approach on Qwen3-8B: with only 5,000 training samples, D-CoT significantly boosts accuracy on GPQA-diamond by 9.9% and MMLU-Pro (0-shot) by 9.1%, while drastically reducing computational costs. Furthermore, we confirm that the model internalizes this disciplined thought structure, maintaining high performance even without explicit control tags during inference.

[19] FewMMBench: A Benchmark for Multimodal Few-Shot Learning cs.CLPDF

Mustafa Dogan, Ilker Kesen, Iacer Calixto, Aykut Erdem, Erkut Erdem

TL;DR: 本文提出了FewMMBench，一个用于评估多模态大语言模型（MLLMs）在少样本学习条件下性能的综合基准测试，重点关注上下文学习（ICL）和思维链（CoT）提示。该基准涵盖从属性识别到时序推理的多种多模态理解任务，并对来自六个模型家族的26个开源MLLM在零样本、少样本和CoT增强少样本设置下进行了系统评估。

Details

Motivation: 随着MLLMs在处理交错图像-文本数据方面取得进展，评估其少样本学习能力仍然是一个开放的挑战，因此需要建立一个专门的基准来系统诊断和推进MLLMs的少样本能力。

Result: 评估结果显示，经过指令微调的模型在零样本设置下表现出色，但在增加演示示例或使用CoT推理时，性能提升有限甚至出现倒退。基于检索的演示和增加上下文长度带来的增益也有限。该研究在FewMMBench基准上对这些现象进行了系统分析。

Insight: 论文的创新点在于创建了一个专注于多模态少样本学习的综合性基准（FewMMBench），并系统揭示了当前MLLMs在少样本场景下（尤其是结合ICL和CoT时）性能提升的局限性，为未来模型改进提供了关键的诊断依据。从客观角度看，其任务多样性和对多种提示策略的对比分析具有借鉴意义。

Abstract: As multimodal large language models (MLLMs) advance in handling interleaved image-text data, assessing their few-shot learning capabilities remains an open challenge. In this paper, we introduce FewMMBench, a comprehensive benchmark designed to evaluate MLLMs under few-shot conditions, with a focus on In-Context Learning (ICL) and Chain-of-Thought (CoT) prompting. Covering a diverse suite of multimodal understanding tasks, from attribute recognition to temporal reasoning, FewMMBench enables systematic analysis across task types, model families, and prompting strategies. We evaluate 26 open-weight MLLMs from six model families across zero-shot, few-shot, and CoT-augmented few-shot settings. Our findings reveal that instruction-tuned models exhibit strong zero-shot performance but benefit minimally, or even regress, with additional demonstrations or CoT reasoning. Retrieval-based demonstrations and increased context size also yield limited gains. These results highlight FewMMBench as a rigorous testbed for diagnosing and advancing few-shot capabilities in multimodal LLMs. The data is available at: https://huggingface.co/datasets/mustafaa/FewMMBench

[20] ExpLang: Improved Exploration and Exploitation in LLM Reasoning with On-Policy Thinking Language Selection cs.CLPDF

Changjiang Gao, Zixian Huang, Kaichen Yang, Jiajun Chen, Jixing Li

TL;DR: 本文提出ExpLang，一种新颖的LLM后训练流程，通过引入在线策略的思维语言选择机制，利用多语言优势来增强强化学习中的探索与利用，从而提升大型推理模型的性能。

Details

Motivation: 当前大型推理模型主要专注于英语推理以追求最强性能，但忽视了多语言思维的潜在优势以及全球用户对母语思维轨迹的需求，因此需要一种方法来整合多语言能力以改进模型训练。

Result: 实验表明，在相同训练预算下，ExpLang方法稳定优于仅使用英语的训练，并在已见和未见语言上均表现出较高的思维语言遵从性，有效扩展了强化学习的探索空间并提升了利用效果。

Insight: 创新点在于将思维语言选择作为强化学习中的在线策略动作，通过多语言偏好实现探索空间的多样化，并利用非英语优势改善利用结果；该方法与大多数强化学习算法正交，为利用多语言性改进大型推理模型提供了新视角。

Abstract: Current large reasoning models (LRMs) have shown strong ability on challenging tasks after reinforcement learning (RL) based post-training. However, previous work mainly focuses on English reasoning in expectation of the strongest performance, despite the demonstrated potential advantage of multilingual thinking, as well as the requirement for native thinking traces by global users. In this paper, we propose ExpLang, a novel LLM post-training pipeline that enables on-policy thinking language selection to improve exploration and exploitation during RL with the use of multiple languages. The results show that our method steadily outperforms English-only training with the same training budget, while showing high thinking language compliance for both seen and unseen languages. Analysis shows that, by enabling on-policy thinking language selection as an action during RL, ExpLang effectively extends the RL exploration space with diversified language preference and improves the RL exploitation outcome with leveraged non-English advantage. The method is orthogonal to most RL algorithms and opens up a new perspective on using multilinguality to improve LRMs.

[21] MERRY: Semantically Decoupled Evaluation of Multimodal Emotional and Role Consistencies of Role-Playing Agents cs.CLPDF

Zhenyu Wang, Xiaofen Xing, Yirong Chen, Xiangmin Xu

TL;DR: 本文提出了MERRY框架，用于解耦评估多模态角色扮演代理（MRPAs）的情感一致性和角色一致性，通过引入八项细化指标并将主观评分转化为双向证据发现任务，提升了LLM作为评判者的人类一致性。

Details

Motivation: 现有研究依赖纯文本基准评估MRPAs的文本响应，而多模态表达评估仅依赖于模态合成指标，导致语义评估与模态生成纠缠、错误归因模糊，且过度依赖人类判断。

Result: 基于MERRY的广泛评估表明：在合成数据集上训练会降低情感一致性，而在真实数据集上训练则提升；现有模型存在情感模板化和简化问题，在细粒度负面情感上表现出正向偏见和性能瓶颈；简单提示方法强化弱模型但限制强模型，而简单微调方法角色泛化能力差。

Insight: 创新点包括提出语义解耦评估框架、细化八项一致性指标，以及将主观评分转化为双向证据发现任务以提升LLM评判的人类一致性；客观分析认为该方法为多模态交互评估提供了更精准和可解释的基准。

Abstract: Multimodal Role-Playing Agents (MRPAs) are attracting increasing attention due to their ability to deliver more immersive multimodal emotional interactions. However, existing studies still rely on pure textual benchmarks to evaluate the text responses of MRPAs, while delegating the assessment of their multimodal expressions solely to modality-synthesis metrics. This evaluation paradigm, on the one hand, entangles semantic assessment with modality generation, leading to ambiguous error attribution, and on the other hand remains constrained by the heavy reliance on human judgment. To this end, we propose MERRY, a semantically decoupled evaluation framework for assessing Multimodal Emotional and Role consistencies of Role-playing agents. This framework introduce five refined metrics for EC and three for RC. Notably, we transform the traditional subjective scoring approach into a novel bidirectional-evidence-finding task, significantly improving the human agreement of LLM-as-Judge evaluations. Based on MERRY, we conduct extensive evaluations. Our empirical results primarily reveal that: (1) Training on synthetic datasets tends to reduce emotional consistency, whereas training on real-world datasets improves it; (2) Existing models suffer from emotional templatization and simplification, exhibiting positive-bias and performance bottleneck in fine-grained negative emotions; (3) Simple prompting method strengthens the weak models but constrains the strong ones, while simple fine-tuning method suffers from poor role generalization. Codes and dataset are available.

Sohan Venkatesh, Ashish Mahendran Kurapath, Tejas Melkote

TL;DR: 该论文评估了八个前沿大语言模型在因果发现任务中预测算法性能的能力，发现它们存在系统性、近乎完全的失败，即’算法性盲视’，表明模型缺乏对计算过程的校准程序性预测能力。

Details

Motivation: 解决大语言模型在理解和推理计算过程方面的能力不足问题，这对于依赖LLMs指导算法选择和部署的从业者至关重要。

Result: 在基于大规模算法执行得出的真实数据测试中，所有模型表现均接近完全失败，多数模型表现比随机猜测更差，最佳模型的边际超随机表现最符合基准记忆而非原则性推理。

Insight: 揭示了LLMs在声明性算法知识与校准的程序性预测之间存在根本性差距，即’算法性盲视’，这挑战了LLMs在复杂算法推理任务中的实际应用可靠性。

Abstract: Large language models (LLMs) demonstrate remarkable breadth of knowledge, yet their ability to reason about computational processes remains poorly understood. Closing this gap matters for practitioners who rely on LLMs to guide algorithm selection and deployment. We address this limitation using causal discovery as a testbed and evaluate eight frontier LLMs against ground truth derived from large-scale algorithm executions and find systematic, near-total failure. Models produce ranges far wider than true confidence intervals yet still fail to contain the true algorithmic mean in the majority of instances; most perform worse than random guessing and the marginal above-random performance of the best model is most consistent with benchmark memorization rather than principled reasoning. We term this failure algorithmic blindness and argue it reflects a fundamental gap between declarative knowledge about algorithms and calibrated procedural prediction.

[23] MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models cs.CLPDF

Boqi Chen, Xudong Liu, Jiachuan Peng, Marianne Frey-Marti, Bang Zheng

TL;DR: MEDSYN是一个用于评估多模态大语言模型在复杂临床病例中多证据合成能力的新基准，包含多语言、多模态的复杂病例，每个病例最多有7种不同类型的视觉临床证据。研究评估了18个MLLM在鉴别诊断生成和最终诊断选择上的表现，发现顶级模型在鉴别诊断生成上可与人类专家媲美，但在最终诊断选择上存在显著差距，揭示了模型在合成异质证据类型时的失败模式。

Details

Motivation: 现有基准未能充分捕捉真实世界的临床复杂性，因此需要一个新的基准来评估MLLM在复杂、多证据临床场景下的诊断能力，以更好地反映实际临床工作流程。

Result: 在MEDSYN基准上，顶级MLLM在鉴别诊断生成任务上常匹配或超越人类专家，但所有模型在最终诊断选择任务上的性能与专家相比存在更大差距。消融实验将失败归因于对区分性较差的文本证据的过度依赖以及跨模态证据利用差距。

Insight: 论文的创新点在于引入了MEDSYN这一高复杂度的临床多模态基准，并提出了’证据敏感性’这一量化指标来度量跨模态证据利用差距，该指标与诊断准确性相关，并可指导干预以提高模型性能。从客观角度看，该研究强调了在医学AI中评估多证据合成能力的必要性，并为模型改进提供了具体方向。

Abstract: Multimodal large language models (MLLMs) have shown great potential in medical applications, yet existing benchmarks inadequately capture real-world clinical complexity. We introduce MEDSYN, a multilingual, multimodal benchmark of highly complex clinical cases with up to 7 distinct visual clinical evidence (CE) types per case. Mirroring clinical workflow, we evaluate 18 MLLMs on differential diagnosis (DDx) generation and final diagnosis (FDx) selection. While top models often match or even outperform human experts on DDx generation, all MLLMs exhibit a much larger DDx–FDx performance gap compared to expert clinicians, indicating a failure mode in synthesis of heterogeneous CE types. Ablations attribute this failure to (i) overreliance on less discriminative textual CE ($\it{e.g.}$, medical history) and (ii) a cross-modal CE utilization gap. We introduce Evidence Sensitivity to quantify the latter and show that a smaller gap correlates with higher diagnostic accuracy. Finally, we demonstrate how it can be used to guide interventions to improve model performance. We will open-source our benchmark and code.

[24] RADAR: Reasoning as Discrimination with Aligned Representations for LLM-based Knowledge Graph Reasoning cs.CLPDF

Bo Xue, Yuan Jin, Luoyi Fu, Jiaxin Ding, Xinbing Wang

TL;DR: 该论文提出了RADAR方法，将基于大语言模型的知识图谱推理从生成式范式重构为判别式关系推理。该方法将知识图谱补全任务重新定义为判别式实体选择问题，利用强化学习增强实体表征的可分离性，并直接在表征空间进行推理，从而避免生成式幻觉，提升泛化能力。

Details

Motivation: 现有基于大语言模型的知识图谱推理方法主要采用生成式范式，容易记忆表面共现而非学习真正的关系语义，限制了其分布外泛化能力。

Result: 在四个基准测试上，RADAR在链接预测和三元组分类任务上相比强大的LLM基线实现了5-6%的相对性能提升，同时将中间表征中与任务相关的互信息提高了62.9%，表明其具有更鲁棒和可迁移的关系推理能力。

Insight: 核心创新在于将知识图谱推理任务从生成式模式匹配范式转变为判别式关系推理范式，通过强化学习驱动的判别式优化来学习可分离的实体表征，并直接在表征空间进行高效、一致的推理，这减少了幻觉并提升了语义理解与泛化能力。

Abstract: Knowledge graph reasoning (KGR) infers missing facts, with recent advances increasingly harnessing the semantic priors and reasoning abilities of Large Language Models (LLMs). However, prevailing generative paradigms are prone to memorizing surface-level co-occurrences rather than learning genuine relational semantics, limiting out-of-distribution generalization. To address this, we propose RADAR, which reformulates KGR from generative pattern matching to discriminative relational reasoning. We recast KGR as discriminative entity selection, where reinforcement learning enforces relative entity separability beyond token-likelihood imitation. Leveraging this separability, inference operates directly in representation space, ensuring consistency with the discriminative optimization and bypassing generation-induced hallucinations. Across four benchmarks, RADAR achieves 5-6% relative gains on link prediction and triple classification over strong LLM baselines, while increasing task-relevant mutual information in intermediate representations by 62.9%, indicating more robust and transferable relational reasoning.

[25] Understanding Artificial Theory of Mind: Perturbed Tasks and Reasoning in Large Language Models cs.CL | cs.AIPDF

Christian Nickel, Laura Schrewe, Florian Mai, Lucie Flek

TL;DR: 本研究通过引入扰动任务和思维链提示，评估大型语言模型在心理理论任务上的鲁棒性，发现模型在扰动下性能显著下降，且思维链提示虽能整体提升性能，但在某些扰动类型上反而降低准确性。

Details

Motivation: 探讨大型语言模型是否具备真正的心理理论能力，通过扰动错误信念任务来检验其鲁棒性，并研究思维链提示对性能提升和决策解释的潜力。

Result: 所有评估的LLM在任务扰动下心理理论能力急剧下降，质疑其存在任何鲁棒形式的心理理论；思维链提示整体上以忠实方式提升性能，但在某些扰动类别中意外降低准确性。

Insight: 创新点在于构建了手工标注的丰富心理理论数据集，包括经典和扰动任务，并提出了评估推理链正确性和答案忠实性的指标；客观分析表明，研究揭示了LLM心理理论能力的脆弱性，强调了思维链提示的选择性应用必要性。

Abstract: Theory of Mind (ToM) refers to an agent’s ability to model the internal states of others. Contributing to the debate whether large language models (LLMs) exhibit genuine ToM capabilities, our study investigates their ToM robustness using perturbations on false-belief tasks and examines the potential of Chain-of-Thought prompting (CoT) to enhance performance and explain the LLM’s decision. We introduce a handcrafted, richly annotated ToM dataset, including classic and perturbed false belief tasks, the corresponding spaces of valid reasoning chains for correct task completion, subsequent reasoning faithfulness, task solutions, and propose metrics to evaluate reasoning chain correctness and to what extent final answers are faithful to reasoning traces of the generated CoT. We show a steep drop in ToM capabilities under task perturbation for all evaluated LLMs, questioning the notion of any robust form of ToM being present. While CoT prompting improves the ToM performance overall in a faithful manner, it surprisingly degrades accuracy for some perturbation classes, indicating that selective application is necessary.

[26] IndicIFEval: A Benchmark for Verifiable Instruction-Following Evaluation in 14 Indic Languages cs.CLPDF

Thanmay Jayakumar, Mohammed Safi Ur Rahman Khan, Raj Dabre, Ratish Puduppully, Anoop Kunchukuttan

TL;DR: 本文提出了IndicIFEval基准，用于评估大型语言模型在14种印度语言上的指令遵循能力，包含两个互补子集：基于翻译和本地化提示的IndicIFEval-Ground，以及基于原生内容合成生成的IndicIFEval-Ground。通过对主要开源和专有模型进行全面评估，发现模型在格式约束上表现良好，但在词汇和跨语言任务上存在显著困难，且印度语言的整体指令遵循能力远落后于英语。

Details

Motivation: 现有指令遵循基准主要集中于英语，缺乏对印度语言使用者的评估，因此需要建立一个可自动验证的多语言基准来填补这一空白。

Result: 在14种印度语言上的评估显示，模型在格式约束上保持较强遵循能力，但在词汇和跨语言任务上表现显著较差；印度语言家族的指令遵循整体水平明显落后于英语。

Insight: 创新点在于构建了首个针对14种印度语言的可自动验证指令遵循基准，结合翻译本地化和原生内容合成生成，为多语言约束生成研究提供了标准化评估工具。

Abstract: Instruction-following benchmarks remain predominantly English-centric, leaving a critical evaluation gap for the hundreds of millions of Indic language speakers. We introduce IndicIFEval, a benchmark evaluating constrained generation of LLMs across 14 Indic languages using automatically verifiable, rule-based instructions. It comprises around 800 human-verified examples per language spread across two complementary subsets: IndicIFEval-Ground, translated prompts from IFEval (Zhou et al., 2023) carefully localized for Indic contexts, and IndicIFEval-Ground, synthetically generated instructions grounded in native Indic content. We conduct a comprehensive evaluation of major open-weight and proprietary models spanning both reasoning and non-reasoning models. While models maintain strong adherence to formatting constraints, they struggle significantly with lexical and cross-lingual tasks – and despite progress in high-resource languages, instruction-following across the broader Indic family lags significantly behind English. We release IndicIFEval and its evaluation scripts to support progress on multilingual constrained generation (http://github.com/ai4bharat/IndicIFEval).

[27] DySCO: Dynamic Attention-Scaling Decoding for Long-Context LMs cs.CLPDF

Xi Ye, Wuwei Zhang, Fangcong Yin, Howard Yen, Danqi Chen

TL;DR: 本文提出了一种名为DySCO的新型解码算法，旨在提升语言模型在长上下文推理任务中的性能。该方法利用模型内部专门用于长上下文检索的注意力头（检索头），在解码的每一步动态识别并增强任务相关token的注意力权重，从而更好地利用长上下文中的相关信息。DySCO无需额外训练，可直接应用于现成的语言模型。

Details

Motivation: 尽管现代语言模型支持越来越长的上下文窗口，但其推理准确性常随输入长度增加而下降，模型在解码过程中难以持续将注意力聚焦于最相关的上下文信息。DySCO旨在解决长上下文推理中注意力漂移的问题。

Result: 在多个指令微调模型和推理模型上，DySCO在具有挑战性的长上下文推理基准测试（如MRCR和LongBenchV2）上性能得到一致提升，在128K上下文长度下相对增益最高可达25%，且仅需适度的额外计算开销。

Insight: 论文的核心创新点在于提出了一种训练无关的动态注意力缩放解码机制，通过利用模型固有的、专门化的“检索头”来引导解码时注意力的动态重分配。这为提升现成模型的长上下文能力提供了一种轻量级、可解释的干预方法，揭示了模型内部注意力机制在解码时可被有效引导的潜力。

Abstract: Understanding and reasoning over long contexts is a crucial capability for language models (LMs). Although recent models support increasingly long context windows, their accuracy often deteriorates as input length grows. In practice, models often struggle to keep attention aligned with the most relevant context throughout decoding. In this work, we propose DySCO, a novel decoding algorithm for improving long-context reasoning. DySCO leverages retrieval heads–a subset of attention heads specialized for long-context retrieval–to identify task-relevant tokens at each decoding step and explicitly up-weight them. By doing so, DySCO dynamically adjusts attention during generation to better utilize relevant context. The method is training-free and can be applied directly to any off-the-shelf LMs. Across multiple instruction-tuned and reasoning models, DySCO consistently improves performance on challenging long-context reasoning benchmarks, yielding relative gains of up to 25% on MRCR and LongBenchV2 at 128K context length with modest additional compute. Further analysis highlights the importance of both dynamic attention rescaling and retrieval-head-guided selection for the effectiveness of the method, while providing interpretability insights into decoding-time attention behavior. Our code is available at https://github.com/princeton-pli/DySCO.

[28] Improving Parametric Knowledge Access in Reasoning Language Models cs.CLPDF

Melody Ma, John Hewitt

TL;DR: 该论文研究了如何提升推理语言模型访问其参数化世界知识的能力。研究发现，模型在默认情况下不会生成最佳的知识推理过程，但通过简单的“逐步思考”提示可以显著改善知识回忆。为此，作者提出使用世界知识问答作为可验证奖励，通过强化学习训练模型更好地对其参数知识进行推理。实验表明，该方法在TriviaQA等基准上取得了显著性能提升，并具有良好的泛化能力。

Details

Motivation: 当前推理语言模型（如通过强化学习训练用于数学推理的模型）在访问自身存储的参数化世界知识（例如事实性知识）时表现不佳，这限制了它们在知识密集型任务中的应用。论文旨在解决模型在知识回忆任务中推理能力不足的问题。

Result: 在TriviaQA数据集上，经过强化学习训练后，模型性能提升了9.9%。该方法还泛化到其他知识问答基准：在Natural Questions、HotpotQA、SimpleQA和StrategyQA上分别提升了4.2%、2.1%、0.6%和3.0%。

Insight: 论文的核心创新点在于揭示了推理模型在参数化知识访问方面存在未被充分优化的潜力，并提出了一种简单有效的训练范式：将世界知识问答任务作为可验证的奖励信号，通过强化学习直接优化模型的推理过程以更好地提取内部知识。这为提升大语言模型的事实性知识回忆能力提供了一种新思路。

Abstract: We study reasoning for accessing world knowledge stored in a language model’s parameters. For example, recalling that Canberra is Australia’s capital may benefit from thinking through major cities and the concept of purpose-built capitals. While reasoning language models are trained via reinforcement learning to produce reasoning traces on tasks such as mathematics, they may not reason well for accessing their own world knowledge. We first find that models do not generate their best world knowledge reasoning by default: adding a simple “think step-by-step” cue demonstrates statistically significant improvement in knowledge recall but not math. Motivated by this, we propose training models to reason over their parametric knowledge using world-knowledge question answering as a verifiable reward. After reinforcement learning on TriviaQA (+9.9%), performance also improves on Natural Questions, HotpotQA, SimpleQA, and StrategyQA by 4.2%, 2.1%, 0.6%, and 3.0%, respectively. Reasoning models are under-optimized for parametric knowledge access, but can be easily trained to reason better.

[29] SumTablets: A Transliteration Dataset of Sumerian Tablets cs.CLPDF

Cole Simmons, Richard Diehl Martinez, Dan Jurafsky

TL;DR: 本文介绍了SumTablets数据集，该数据集将91,606块苏美尔楔形文字泥板的Unicode表示与Oracc发布的音译文本配对，以解决缺乏配对数据阻碍现代自然语言处理方法应用于苏美尔音译任务的问题。作者还利用该数据集实现了两种音译基线方法，并展示了微调自回归语言模型的潜力。

Details

Motivation: 现有大量苏美尔泥板音译数据在线可用，但缺乏将音译文本与泥板楔形文字数字表示（Unicode）配对的综合、可访问数据集，这阻碍了现代自然语言处理方法在该任务上的应用。

Result: 微调的自回归语言模型在音译任务上达到了平均字符级F分数（chrF）97.55，展示了基于Transformer的音译模型在帮助专家快速验证生成音译而非手动逐块音译方面的潜力。

Insight: 创新点在于构建了首个大规模、结构化的苏美尔楔形文字泥板Unicode与音译配对数据集SumTablets，并通过特殊令牌保留了泥板的结构信息（如表面、换行、破损段），为计算亚述学领域的研究提供了关键资源。同时，论文展示了如何利用该数据集建立基线并评估现代NLP模型（如微调语言模型）在该历史语言任务上的可行性。

Abstract: Sumerian transliteration is a conventional system for representing a scholar’s interpretation of a tablet in the Latin script. Thanks to visionary digital Assyriology projects such as ETCSL, CDLI, and Oracc, a large number of Sumerian transliterations have been published online, and these data are well-structured for a variety of search and analysis tasks. However, the absence of a comprehensive, accessible dataset pairing transliterations with a digital representation of the tablet’s cuneiform glyphs has prevented the application of modern Natural Language Processing (NLP) methods to the task of Sumerian transliteration. To address this gap, we present SumTablets, a dataset pairing Unicode representations of 91,606 Sumerian cuneiform tablets (totaling 6,970,407 glyphs) with the associated transliterations published by Oracc. We construct SumTablets by first preprocessing and standardizing the Oracc transliterations before mapping each reading back to the Unicode representation of the source glyph. Further, we retain parallel structural information (e.g., surfaces, newlines, broken segments) through the use of special tokens. We release SumTablets as a Hugging Face Dataset (CC BY 4.0) and open source data preparation code via GitHub. Additionally, we leverage SumTablets to implement and evaluate two transliteration baselines: (1) weighted sampling from a glyph’s possible readings, and (2) fine-tuning an autoregressive language model. Our fine-tuned language model achieves an average transliteration character-level F-score (chrF) of 97.55, demonstrating the immediate potential of transformer-based transliteration models in allowing experts to rapidly verify generated transliterations rather than manually transliterating tablets one-by-one.

cs.CV [Back]

[30] HorizonForge: Driving Scene Editing with Any Trajectories and Any Vehicles cs.CVPDF

Yifan Wang, Francesco Pittaluga, Zaid Tasneem, Chenyu You, Manmohan Chandraker

TL;DR: 本文提出了HorizonForge框架，用于可控的驾驶场景生成。该框架通过将场景重建为可编辑的高斯溅射和网格，实现了细粒度的3D操控和语言驱动的车辆插入，并利用噪声感知视频扩散过程进行渲染，确保时空一致性。同时，作者还提出了HorizonSuite基准来标准化评估。

Details

Motivation: 现有方法难以同时实现高真实感和精确控制，阻碍了自动驾驶仿真的真实性和可扩展性。本文旨在解决可控驾驶场景生成中真实感与控制精度难以兼得的问题。

Result: 在提出的HorizonSuite基准上进行广泛实验，结果表明，高斯-网格表示比其他3D表示具有显著更高的保真度，视频扩散的时间先验对于连贯合成至关重要。HorizonForge在用户偏好上获得了83.4%的提升，FID指标比次优的SOTA方法提高了25.19%。

Insight: 核心创新在于将场景统一表示为可编辑的高斯溅射和网格，并结合噪声感知视频扩散进行一次性前馈渲染，无需针对每条轨迹进行优化。这为高真实感、可控的驾驶仿真提供了一个简单而强大的范式。同时，提出的综合基准有助于标准化该领域的评估。

Abstract: Controllable driving scene generation is critical for realistic and scalable autonomous driving simulation, yet existing approaches struggle to jointly achieve photorealism and precise control. We introduce HorizonForge, a unified framework that reconstructs scenes as editable Gaussian Splats and Meshes, enabling fine-grained 3D manipulation and language-driven vehicle insertion. Edits are rendered through a noise-aware video diffusion process that enforces spatial and temporal consistency, producing diverse scene variations in a single feed-forward pass without per-trajectory optimization. To standardize evaluation, we further propose HorizonSuite, a comprehensive benchmark spanning ego- and agent-level editing tasks such as trajectory modifications and object manipulation. Extensive experiments show that Gaussian-Mesh representation delivers substantially higher fidelity than alternative 3D representations, and that temporal priors from video diffusion are essential for coherent synthesis. Combining these findings, HorizonForge establishes a simple yet powerful paradigm for photorealistic, controllable driving simulation, achieving an 83.4% user-preference gain and a 25.19% FID improvement over the second best state-of-the-art method. Project page: https://horizonforge.github.io/ .

[31] Towards Controllable Video Synthesis of Routine and Rare OR Events cs.CV | cs.AI | cs.LG | eess.IVPDF

Dominik Schneider, Lalithkumar Seenivasan, Sampath Rapuri, Vishalroshan Anil, Aiza Maksutova

TL;DR: 本文提出了一种用于手术室（OR）视频可控合成的扩散框架，该框架通过几何抽象模块、条件模块和微调的扩散模型，能够从抽象的几何表示中生成常规和罕见的安全关键事件视频。该方法在合成常规OR事件时超越了现有视频扩散基线，并利用合成数据训练了用于检测无菌区域违规的AI模型，展示了其在支持环境智能模型开发方面的潜力。

Details

Motivation: 由于操作和伦理上的挑战，构建包含罕见、安全关键或非典型事件的大规模手术室工作流数据集非常困难，这阻碍了用于检测、理解和缓解OR中罕见或安全关键事件的环境智能的发展。

Result: 在合成常规OR事件时，该方法在域内和域外数据集上均优于现成的视频扩散基线，取得了更低的FVD/LPIPS和更高的SSIM/PSNR。利用生成的合成数据训练和验证的AI模型在检测接近安全关键事件时达到了70.13%的召回率。

Insight: 创新点在于提出了一个集成了几何抽象和条件控制的OR视频扩散框架，能够可控地合成反事实事件视频，并利用合成数据缓解真实数据稀缺问题，为环境智能模型开发提供了新途径。

Abstract: Purpose: Curating large-scale datasets of operating room (OR) workflow, encompassing rare, safety-critical, or atypical events, remains operationally and ethically challenging. This data bottleneck complicates the development of ambient intelligence for detecting, understanding, and mitigating rare or safety-critical events in the OR. Methods: This work presents an OR video diffusion framework that enables controlled synthesis of rare and safety-critical events. The framework integrates a geometric abstraction module, a conditioning module, and a fine-tuned diffusion model to first transform OR scenes into abstract geometric representations, then condition the synthesis process, and finally generate realistic OR event videos. Using this framework, we also curate a synthetic dataset to train and validate AI models for detecting near-misses of sterile-field violations. Results: In synthesizing routine OR events, our method outperforms off-the-shelf video diffusion baselines, achieving lower FVD/LPIPS and higher SSIM/PSNR in both in- and out-of-domain datasets. Through qualitative results, we illustrate its ability for controlled video synthesis of counterfactual events. An AI model trained and validated on the generated synthetic data achieved a RECALL of 70.13% in detecting near safety-critical events. Finally, we conduct an ablation study to quantify performance gains from key design choices. Conclusion: Our solution enables controlled synthesis of routine and rare OR events from abstract geometric representations. Beyond demonstrating its capability to generate rare and safety-critical scenarios, we show its potential to support the development of ambient intelligence models.

[32] Momentum Memory for Knowledge Distillation in Computational Pathology cs.CVPDF

Yongxin Guo, Hao Lu, Onur C. Koyun, Zhengjie Zhu, Muhammet Fatih Demir

TL;DR: 该论文提出了一种名为动量记忆知识蒸馏（MoMKD）的跨模态蒸馏框架，用于解决计算病理学中配对组织学-基因组数据稀缺的问题。该方法通过动量更新的记忆库聚合跨批次的基因组和组织病理学信息，扩大监督上下文，并解耦两个分支的梯度以避免模态差距，从而仅使用组织学图像就能实现准确的癌症诊断。

Details

Motivation: 动机在于整合基因组学和组织病理学的多模态学习在癌症诊断中潜力巨大，但临床转化受限于配对的组学-病理数据稀缺。现有知识蒸馏方法依赖批次内对齐，因比较有限导致不稳定并降低性能。

Result: 在TCGA-BRCA基准测试（HER2、PR和ODX分类任务）和独立内部测试数据集上的广泛实验表明，MoMKD在仅使用组织学推理的情况下，持续优于最先进的多实例学习（MIL）和多模态知识蒸馏基线，实现了强大的性能和泛化能力，达到SOTA水平。

Insight: 创新点在于引入了动量更新的记忆库来聚合跨批次信息以扩大监督上下文，并解耦基因组和组织学分支的梯度以防止基因组信号主导特征学习并消除推理时的模态差距，为计算病理学建立了一个鲁棒且可泛化的知识蒸馏范式。

Abstract: Multimodal learning that integrates genomics and histopathology has shown strong potential in cancer diagnosis, yet its clinical translation is hindered by the limited availability of paired histology-genomics data. Knowledge distillation (KD) offers a practical solution by transferring genomic supervision into histopathology models, enabling accurate inference using histology alone. However, existing KD methods rely on batch-local alignment, which introduces instability due to limited within-batch comparisons and ultimately degrades performance. To address these limitations, we propose Momentum Memory Knowledge Distillation (MoMKD), a cross-modal distillation framework driven by a momentum-updated memory. This memory aggregates genomic and histopathology information across batches, effectively enlarging the supervisory context available to each mini-batch. Furthermore, we decouple the gradients of the genomics and histology branches, preventing genomic signals from dominating histology feature learning during training and eliminating the modality-gap issue at inference time. Extensive experiments on the TCGA-BRCA benchmark (HER2, PR, and ODX classification tasks) and an independent in-house testing dataset demonstrate that MoMKD consistently outperforms state-of-the-art MIL and multimodal KD baselines, delivering strong performance and generalization under histology-only inference. Overall, MoMKD establishes a robust and generalizable knowledge distillation paradigm for computational pathology.

Sajjad Ghiasvand, Haniyeh Ehsani Oskouie, Mahnoosh Alizadeh, Ramtin Pedarsani

TL;DR: 本文提出了MMLoP（多模态低秩提示）框架，用于高效适应视觉语言模型（如CLIP）。该方法通过低秩分解参数化视觉和文本提示，仅需11.5K可训练参数，同时引入一致性损失、均匀漂移校正和共享上投影三个组件来提升性能，在多个基准测试中实现了优越的准确率-效率权衡。

Details

Motivation: 现有提示学习方法在扩展到多模态和多层时参数量剧增，丧失了参数效率的优势，本文旨在实现深度多模态提示的同时保持低参数量。

Result: 在三个基准测试和11个数据集上的实验表明，MMLoP在基础到新颖类别的泛化上达到79.70%的调和平均值，优于大多数现有方法（包括参数量大几个数量级的方法）。

Insight: 创新点包括：通过低秩分解实现参数高效的深度多模态提示；引入特征和logit层面的自调节一致性损失、均匀漂移校正和共享上投影来增强性能并保持跨模态对齐。

Abstract: Prompt learning has become a dominant paradigm for adapting vision-language models (VLMs) such as CLIP to downstream tasks without modifying pretrained weights. While extending prompts to both vision and text encoders across multiple transformer layers significantly boosts performance, it dramatically increases the number of trainable parameters, with state-of-the-art methods requiring millions of parameters and abandoning the parameter efficiency that makes prompt tuning attractive. In this work, we propose \textbf{MMLoP} (\textbf{M}ulti-\textbf{M}odal \textbf{Lo}w-Rank \textbf{P}rompting), a framework that achieves deep multi-modal prompting with only \textbf{11.5K trainable parameters}, comparable to early text-only methods like CoOp. MMLoP parameterizes vision and text prompts at each transformer layer through a low-rank factorization, which serves as an implicit regularizer against overfitting on few-shot training data. To further close the accuracy gap with state-of-the-art methods, we introduce three complementary components: a self-regulating consistency loss that anchors prompted representations to frozen zero-shot CLIP features at both the feature and logit levels, a uniform drift correction that removes the global embedding shift induced by prompt tuning to preserve class-discriminative structure, and a shared up-projection that couples vision and text prompts through a common low-rank factor to enforce cross-modal alignment. Extensive experiments across three benchmarks and 11 diverse datasets demonstrate that MMLoP achieves a highly favorable accuracy-efficiency tradeoff, outperforming the majority of existing methods including those with orders of magnitude more parameters, while achieving a harmonic mean of 79.70% on base-to-novel generalization.

[34] Exploring Vision-Language Models for Open-Vocabulary Zero-Shot Action Segmentation cs.CVPDF

Asim Unmesh, Kaki Ramesh, Mayank Patel, Rahul Jain, Karthik Ramani

TL;DR: 本文探索了利用视觉语言模型（VLMs）解决开放词汇零样本时序动作分割（OVTAS）问题，提出了一种无需训练的分割流程，包括帧-动作嵌入相似性匹配和相似性矩阵时序分割，并在14种VLMs上进行了系统性分析。

Details

Motivation: 解决时序动作分割中因活动空间巨大和标注成本高导致的封闭词汇和固定标签集限制，探索开放词汇零样本分割的可行性。

Result: 在标准基准测试中，OVTAS无需任务特定监督即取得了强劲结果，展示了VLMs在结构化时序理解方面的潜力。

Insight: 创新点在于首次系统性地将VLMs应用于开放词汇零样本动作分割，提出了一种无需训练的分割流程，并通过广泛实验验证了不同VLMs的适用性。

Abstract: Temporal Action Segmentation (TAS) requires dividing videos into action segments, yet the vast space of activities and alternative breakdowns makes collecting comprehensive datasets infeasible. Existing methods remain limited to closed vocabularies and fixed label sets. In this work, we explore the largely unexplored problem of Open-Vocabulary Zero-Shot Temporal Action Segmentation (OVTAS) by leveraging the strong zero-shot capabilities of Vision-Language Models (VLMs). We introduce a training-free pipeline that follows a segmentation-by-classification design: Frame-Action Embedding Similarity (FAES) matches video frames to candidate action labels, and Similarity-Matrix Temporal Segmentation (SMTS) enforces temporal consistency. Beyond proposing OVTAS, we present a systematic study across 14 diverse VLMs, providing the first broad analysis of their suitability for open-vocabulary action segmentation. Experiments on standard benchmarks show that OVTAS achieves strong results without task-specific supervision, underscoring the potential of VLMs for structured temporal understanding.

[35] WildSVG: Towards Reliable SVG Generation Under Real-Word Conditions cs.CVPDF

Marco Terral, Haotian Zhang, Tianyang Zhang, Meng Lin, Xiaoqing Xie

TL;DR: 本文提出了SVG提取任务，旨在将图像中的视觉内容转换为可缩放矢量图形，并针对真实场景中存在的噪声、杂乱和领域偏移问题，引入了WildSVG基准测试，包含基于真实公司徽标图像的Natural WildSVG和合成复杂场景的Synthetic WildSVG两个数据集。研究发现当前多模态模型在真实场景下的SVG提取性能远未达到可靠水平，但迭代优化方法显示出潜力。

Details

Motivation: 现有多模态模型在从干净渲染或文本描述生成SVG方面表现良好，但在真实世界场景中，自然图像引入的噪声、杂乱和领域偏移导致性能不足，且缺乏合适的基准测试来评估这一任务。

Result: 在WildSVG基准测试上评估了最先进的多模态模型，发现当前方法在真实场景下的SVG提取性能远低于可靠应用所需水平，但迭代优化方法显示出改进潜力，模型能力正在稳步提升。

Insight: 创新点在于首次系统性地定义了SVG提取任务并构建了WildSVG基准测试，通过结合真实和合成数据模拟复杂条件，为评估模型在真实场景下的鲁棒性提供了基础；客观分析认为，该研究突出了领域适应和噪声处理在图形生成中的重要性，迭代优化策略可能是未来提升性能的关键方向。

Abstract: We introduce the task of SVG extraction, which consists in translating specific visual inputs from an image into scalable vector graphics. Existing multimodal models achieve strong results when generating SVGs from clean renderings or textual descriptions, but they fall short in real-world scenarios where natural images introduce noise, clutter, and domain shifts. A central challenge in this direction is the lack of suitable benchmarks. To address this need, we introduce the WildSVG Benchmark, formed by two complementary datasets: Natural WildSVG, built from real images containing company logos paired with their SVG annotations, and Synthetic WildSVG, which blends complex SVG renderings into real scenes to simulate difficult conditions. Together, these resources provide the first foundation for systematic benchmarking SVG extraction. We benchmark state-of-the-art multimodal models and find that current approaches perform well below what is needed for reliable SVG extraction in real scenarios. Nonetheless, iterative refinement methods point to a promising path forward, and model capabilities are steadily improving

[36] ECHOSAT: Estimating Canopy Height Over Space And Time cs.CV | cs.AI | cs.LGPDF

Jan Pauls, Karsten Schrödter, Sven Ligensa, Martin Schwartz, Berkant Turan

TL;DR: 本文提出了ECHOSAT模型，用于生成全球范围内时间一致、空间分辨率为10米的多年树高地图，通过结合多传感器卫星数据和专用视觉Transformer模型进行像素级时间回归，并引入自监督生长损失来正则化预测，以捕捉树木生长和干扰的动态变化。

Details

Motivation: 现有全球树高地图仅提供静态快照，无法捕捉森林的时间动态，而这对精确的碳核算至关重要，因此需要开发能够量化树木生长和干扰的时序一致地图。

Result: 实验评估表明，该模型在单年预测背景下提高了最先进（SOTA）的准确性，并首次提供了准确量化随时间变化的树木生长和干扰的全球尺度高度地图。

Insight: 创新点在于利用多传感器卫星数据训练专用视觉Transformer进行像素级时间回归，并设计自监督生长损失来正则化预测，使其符合自然树木生长曲线（包括渐进增长和突发干扰如火灾导致的下降），从而实现了动态、时序一致的全球森林监测。

Abstract: Forest monitoring is critical for climate change mitigation. However, existing global tree height maps provide only static snapshots and do not capture temporal forest dynamics, which are essential for accurate carbon accounting. We introduce ECHOSAT, a global and temporally consistent tree height map at 10 m resolution spanning multiple years. To this end, we resort to multi-sensor satellite data to train a specialized vision transformer model, which performs pixel-level temporal regression. A self-supervised growth loss regularizes the predictions to follow growth curves that are in line with natural tree development, including gradual height increases over time, but also abrupt declines due to forest loss events such as fires. Our experimental evaluation shows that our model improves state-of-the-art accuracies in the context of single-year predictions. We also provide the first global-scale height map that accurately quantifies tree growth and disturbances over time. We expect ECHOSAT to advance global efforts in carbon monitoring and disturbance assessment. The maps can be accessed at https://github.com/ai4forest/echosat.

[37] PSF-Med: Measuring and Explaining Paraphrase Sensitivity in Medical Vision Language Models cs.CV | cs.LGPDF

Binesh Sadanandan, Vahid Behzadan

TL;DR: 该论文提出了PSF-Med基准，用于评估医学视觉语言模型（VLMs）在临床医生重述相同问题时答案变化的敏感性，发现六种医学VLMs的答案翻转率在8%至58%之间，并指出低翻转率并不代表模型具有视觉基础性，因为一些模型仅依赖语言先验。通过应用稀疏自编码器分析MedGemma 4B模型，识别出与提示框架相关的稀疏特征，并通过因果修补和特征钳制技术显著降低了翻转率，同时减少了文本先验依赖。

Details

Motivation: 解决医学VLMs在部署中因问题重述导致答案不一致的风险，评估模型的释义敏感性和视觉基础性。

Result: 在MIMIC-CXR和PadChest数据集上构建的PSF-Med基准包含19,748个胸部X光问题及约92,000个释义对，六种医学VLMs的yes/no翻转率为8%-58%；通过特征钳制技术将翻转率相对降低31%，仅牺牲1.3个百分点的准确率。

Insight: 创新点包括引入PSF-Med基准量化医学VLMs的释义敏感性，使用稀疏自编码器识别影响决策的稀疏特征，并通过特征干预提升模型鲁棒性；客观分析表明，该方法强调了评估中需同时考虑释义稳定性和图像依赖性，为模型可靠性改进提供了可解释的机制分析路径。

Abstract: Medical Vision Language Models (VLMs) can change their answers when clinicians rephrase the same question, which raises deployment risks. We introduce Paraphrase Sensitivity Failure (PSF)-Med, a benchmark of 19,748 chest Xray questions paired with about 92,000 meaningpreserving paraphrases across MIMIC-CXR and PadChest. Across six medical VLMs, we measure yes/no flips for the same image and find flip rates from 8% to 58%. However, low flip rate does not imply visual grounding: text-only baselines show that some models stay consistent even when the image is removed, suggesting they rely on language priors. To study mechanisms in one model, we apply GemmaScope 2 Sparse Autoencoders (SAEs) to MedGemma 4B and analyze FlipBank, a curated set of 158 flip cases. We identify a sparse feature at layer 17 that correlates with prompt framing and predicts decision margin shifts. In causal patching, removing this feature’s contribution recovers 45% of the yesminus-no logit margin on average and fully reverses 15% of flips. Acting on this finding, we show that clamping the identified feature at inference reduces flip rates by 31% relative with only a 1.3 percentage-point accuracy cost, while also decreasing text-prior reliance. These results suggest that flip rate alone is not enough; robustness evaluations should test both paraphrase stability and image reliance.

[38] Synergizing Understanding and Generation with Interleaved Analyzing-Drafting Thinking cs.CVPDF

Shengqiong Wu, Bobo Li, Xinkai Wang, Xiangtai Li, Lei Cui

TL;DR: 本文提出了一种名为交错分析-草拟循环（AD-Loop）的新思考范式，旨在统一视觉-语言模型中理解和生成能力的协同工作。该方法通过交替进行文本和视觉思考，使模型能够迭代地优化理解和输出。研究采用两阶段训练策略（监督学习与强化学习），并在多个标准基准测试中验证了其性能提升和架构通用性。

Details

Motivation: 现有统一视觉-语言模型主要关注架构统一，但忽略了在任务解决过程中理解和生成能力之间需要显式交互，导致两者被视作平行技能而非协同过程。本文旨在通过引入动态交替的分析与草拟操作，实现真正的协同。

Result: 在多个标准理解和生成基准测试上的广泛实验表明，AD-Loop能持续提升性能，并展现出对各种统一视觉-语言模型架构的强可迁移性。视觉分析进一步验证了隐式视觉思考的有效性。

Insight: 创新点在于提出了一个动态交错的分析-草拟问题解决循环（AD-Loop）作为新的思考范式，以及结合监督学习和强化学习的训练策略，以实现理解和生成能力的真正协同，而非简单的并行处理。该方法具有原则性和广泛适用性。

Abstract: Unified Vision-Language Models (UVLMs) aim to advance multimodal learning by supporting both understanding and generation within a single framework. However, existing approaches largely focus on architectural unification while overlooking the need for explicit interaction between the two capabilities during task solving. As a result, current models treat understanding and generation as parallel skills rather than synergistic processes. To achieve real synergy, we introduce the interleaved Analyzing-Drafting problem-solving loop (AD-Loop), a new think paradigm that dynamically alternates between analytic and drafting operations. By interleaving textual thoughts with visual thoughts, AD-Loop enables models to iteratively refine both comprehension and outputs, fostering genuine synergy. To train this mechanism, we design a two-stage strategy: supervised learning on interleaved thought data to initialize alternation, followed by reinforcement learning to promote adaptive and autonomous control. Extensive experiments demonstrate that AD-Loop consistently improves performance across standard benchmarks for both understanding and generation, with strong transferability to various UVLMs architectures. Visual analyses further validate the effectiveness of implicit visual thoughts. These results highlight AD-Loop as a principled and broadly applicable strategy for synergizing comprehension and creation. The project page is at https://sqwu.top/AD-Loop.

[39] See It, Say It, Sorted: An Iterative Training-Free Framework for Visually-Grounded Multimodal Reasoning in LVLMs cs.CVPDF

Yongchang Zhang, Xianzheng Ma, Tianyi Liu, Guangquan Zhou, Yang Chen

TL;DR: 本文提出了一种无需训练的迭代框架，用于提升大型视觉语言模型（LVLM）在视觉基础多模态推理中的性能，通过构建文本视觉证据池并动态提取额外证据来监督推理步骤，从而减少视觉幻觉传播。

Details

Motivation: 现有LVLM在生成思维链（CoT）响应时容易因视觉幻觉传播导致推理错误，而基于强化学习的解决方案成本高且难以泛化，因此需要一种轻量级、无需训练的方法来确保推理步骤与视觉证据一致。

Result: 在多个LVLM骨干网络和基准测试中，该方法在TreeBench上实现了16.5%-29.5%的改进，在RH-Bench上获得了13.7%的RH-AUC增益，显著降低了幻觉率并提高了推理准确性。

Insight: 创新点在于通过动态视觉决策模块迭代扩展视觉证据池，以测试时监督的方式确保每个推理步骤都有视觉依据，这是一种即插即用、无需额外训练的方法，可有效减少多模态推理中的幻觉问题。

Abstract: Recent large vision-language models (LVLMs) have demonstrated impressive reasoning ability by generating long chain-of-thought (CoT) responses. However, CoT reasoning in multimodal contexts is highly vulnerable to visual hallucination propagation: once an intermediate reasoning step becomes inconsistent with the visual evidence, subsequent steps-even if logically valid-can still lead to incorrect final answers. Existing solutions attempt to mitigate this issue by training models to “think with images” via reinforcement learning (RL). While effective, these methods are costly, model-specific, and difficult to generalize across architectures. Differently, we present a lightweight method that bypasses RL training and provides an iterative, training-free, plug-and-play framework for visually-grounded multimodal reasoning. Our key idea is to supervise each reasoning step at test time with visual evidence, ensuring that every decoded token is justified by corresponding visual cues. Concretely, we construct a textual visual-evidence pool that guides the model’s reasoning generation. When existing evidence is insufficient, a visual decider module dynamically extracts additional relevant evidence from the image based on the ongoing reasoning context, expanding the pool until the model achieves sufficient visual certainty to terminate reasoning and produce the final answer. Extensive experiments on multiple LVLM backbones and benchmarks demonstrate the effectiveness of our approach. Our method achieves 16.5%-29.5% improvements on TreeBench and 13.7% RH-AUC gains on RH-Bench, substantially reducing hallucination rates while improving reasoning accuracy without additional training.

[40] Which Tool Response Should I Trust? Tool-Expertise-Aware Chest X-ray Agent with Multimodal Agentic Learning cs.CVPDF

Zheang Huai, Honglong Yang, Xiaomeng Li

TL;DR: 本文提出了一种工具专业知识感知的胸部X光智能体（TEA-CXA），通过多模态智能体学习框架，使智能体能够与工具交互并经验性地学习不同查询类型下工具的实际可信度，从而有效解决医疗AI工具间的冲突。

Details

Motivation: 现有医疗智能体研究对工具的实际可靠性理解不足，无法有效处理错误倾向的AI工具产生的矛盾响应，因此需要一种能学习工具实际可信度并解决冲突的框架。

Result: 实验表明，TEA-CXA在胸部X光分析任务上超越了现有最先进方法和一系列基线模型。

Insight: 创新点包括：通过强化学习实现多轮多模态工具调用的智能体学习框架；扩展了代码库以支持多模态上下文、单轮多工具调用、并行工具推理以及单查询多图像处理；为多模态环境下多轮工具调用强化学习的通用医疗研究提供了可应用的代码框架。

Abstract: AI agents with tool-use capabilities show promise for integrating the domain expertise of various tools. In the medical field, however, tools are usually AI models that are inherently error-prone and can produce contradictory responses. Existing research on medical agents lacks sufficient understanding of the tools’ realistic reliability and thus cannot effectively resolve tool conflicts. To address this gap, this paper introduces a framework that enables an agent to interact with tools and empirically learn their practical trustworthiness across different types of multimodal queries via agentic learning. As a concrete instantiation, we focus on chest X-ray analysis and present a tool-expertise-aware chest X-ray agent (TEA-CXA). When tool outputs disagree, the agent experimentally accepts or rejects multimodal tool results, receives rewards, and learns which tool to trust for each query type. Importantly, TEA-CXA extends existing codebases for reinforcement learning with multi-turn tool-calling that focus on textual inputs, to support multimodal contexts effectively. In addition, we enhance the codebase for medical use scenarios by supporting multiple tool calls in one turn, parallel tool inference, and multi-image accommodation within a single user query. Our code framework is applicable to general medical research on multi-turn tool-calling reinforcement learning in multimodal settings. Experiments show that TEA-CXA outperforms the state-of-the-art methods and a comprehensive set of baselines. Code will be released.

[41] IHF-Harmony: Multi-Modality Magnetic Resonance Images Harmonization using Invertible Hierarchy Flow Model cs.CVPDF

Pengli Zhu, Yitao Zhu, Haowen Pang, Anqi Qiu

TL;DR: 本文提出了IHF-Harmony，一种用于多模态磁共振图像（MRI）协调的统一可逆层次流框架。该方法利用未配对数据，通过将翻译过程分解为可逆的特征变换，保证了双射映射和无损重建，以防止解剖结构失真。具体通过可逆层次流（IHF）进行层次化减性耦合逐步移除伪影相关特征，并结合伪影感知归一化（AAN）进行解剖结构固定的特征调制以准确传递目标特性。结合解剖和伪影一致性损失，该方法实现了保留源解剖结构的高保真协调。

Details

Motivation: 回顾性MRI协调方法存在跨模态可扩展性差和依赖受试者旅行数据集的局限性。为了解决这些问题，本文旨在开发一个使用未配对数据的、统一的多模态协调框架。

Result: 在多个MRI模态上的实验表明，IHF-Harmony在解剖保真度和下游任务性能方面均优于现有方法，有助于为大规模多中心成像研究实现鲁棒的协调。

Insight: 主要创新点在于提出了一个统一的、基于可逆层次流和伪影感知归一化的框架，用于多模态MRI协调，其核心是保证双射映射以避免解剖失真，并使用未配对数据进行训练，提高了方法的实用性和可扩展性。

Abstract: Retrospective MRI harmonization is limited by poor scalability across modalities and reliance on traveling subject datasets. To address these challenges, we introduce IHF-Harmony, a unified invertible hierarchy flow framework for multi-modality harmonization using unpaired data. By decomposing the translation process into reversible feature transformations, IHF-Harmony guarantees bijective mapping and lossless reconstruction to prevent anatomical distortion. Specifically, an invertible hierarchy flow (IHF) performs hierarchical subtractive coupling to progressively remove artefact-related features, while an artefact-aware normalization (AAN) employs anatomy-fixed feature modulation to accurately transfer target characteristics. Combined with anatomy and artefact consistency loss objectives, IHF-Harmony achieves high-fidelity harmonization that retains source anatomy. Experiments across multiple MRI modalities demonstrate that IHF-Harmony outperforms existing methods in both anatomical fidelity and downstream task performance, facilitating robust harmonization for large-scale multi-site imaging studies. Code will be released upon acceptance.

[42] Generalizing Visual Geometry Priors to Sparse Gaussian Occupancy Prediction cs.CVPDF

Changqing Zhou, Yueru Luo, Changhao Chen

TL;DR: 本文提出GPOcc框架，利用可泛化的视觉几何先验进行单目占用预测，通过沿相机射线向内扩展表面点生成体素样本，并用高斯基元表示进行概率占用推理，同时设计了无需训练增量更新策略以处理流式输入。

Details

Motivation: 现有方法主要依赖深度先验但仅有限利用3D线索，而视觉几何模型虽能提供丰富3D先验却仍停留在可见表面层面，因此需要探索如何更有效利用几何先验进行3D占用预测。

Result: 在Occ-ScanNet和EmbodiedOcc-ScanNet基准测试中，GPOcc在单目设置下mIoU提升+9.99，流式设置下提升+11.79，均超越先前SOTA；在相同深度先验下实现+6.73 mIoU提升且运行速度加快2.65倍。

Insight: 创新点在于将表面几何先验扩展为体素化高斯表示，并设计免训练增量融合策略；客观分析其核心贡献在于通过概率化体素建模更充分利用几何先验，实现精度与效率的同步提升。

Abstract: Accurate 3D scene understanding is essential for embodied intelligence, with occupancy prediction emerging as a key task for reasoning about both objects and free space. Existing approaches largely rely on depth priors (e.g., DepthAnything) but make only limited use of 3D cues, restricting performance and generalization. Recently, visual geometry models such as VGGT have shown strong capability in providing rich 3D priors, but similar to monocular depth foundation models, they still operate at the level of visible surfaces rather than volumetric interiors, motivating us to explore how to more effectively leverage these increasingly powerful geometry priors for 3D occupancy prediction. We present GPOcc, a framework that leverages generalizable visual geometry priors (GPs) for monocular occupancy prediction. Our method extends surface points inward along camera rays to generate volumetric samples, which are represented as Gaussian primitives for probabilistic occupancy inference. To handle streaming input, we further design a training-free incremental update strategy that fuses per-frame Gaussians into a unified global representation. Experiments on Occ-ScanNet and EmbodiedOcc-ScanNet demonstrate significant gains: GPOcc improves mIoU by +9.99 in the monocular setting and +11.79 in the streaming setting over prior state of the art. Under the same depth prior, it achieves +6.73 mIoU while running 2.65$\times$ faster. These results highlight that GPOcc leverages geometry priors more effectively and efficiently. Code will be released at https://github.com/JuIvyy/GPOcc.

[43] MultiAnimate: Pose-Guided Image Animation Made Extensible cs.CVPDF

Yingcheng Hu, Haowen Gong, Chuanguang Yang, Zhulin An, Yongjun Xu

TL;DR: 本文提出MultiAnimate，一个基于扩散Transformer的可扩展多角色图像动画框架，旨在解决现有方法在扩展到多角色场景时出现的身份混淆和遮挡不合理问题。

Details

Motivation: 现有基于扩散的姿势引导人体图像动画方法主要局限于单角色场景，直接扩展到多角色时会导致身份混淆和遮挡不合理，因此需要一种专门处理多角色交互的动画框架。

Result: 在多个基准测试中，该方法在多人图像动画任务上达到了最先进的性能，超越了现有的基于扩散的基线模型，并且仅使用双角色数据集训练就能泛化到更多角色的场景。

Insight: 核心创新是引入了标识符分配器和标识符适配器，通过掩码驱动方案捕捉每个角色的位置线索和角色间的空间关系，结合可扩展的训练策略，实现了从训练数据到更多角色场景的泛化能力。

Abstract: Pose-guided human image animation aims to synthesize realistic videos of a reference character driven by a sequence of poses. While diffusion-based methods have achieved remarkable success, most existing approaches are limited to single-character animation. We observe that naively extending these methods to multi-character scenarios often leads to identity confusion and implausible occlusions between characters. To address these challenges, in this paper, we propose an extensible multi-character image animation framework built upon modern Diffusion Transformers (DiTs) for video generation. At its core, our framework introduces two novel components-Identifier Assigner and Identifier Adapter - which collaboratively capture per-person positional cues and inter-person spatial relationships. This mask-driven scheme, along with a scalable training strategy, not only enhances flexibility but also enables generalization to scenarios with more characters than those seen during training. Remarkably, trained on only a two-character dataset, our model generalizes to multi-character animation while maintaining compatibility with single-character cases. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in multi-character image animation, surpassing existing diffusion-based baselines.

[44] SEF-MAP: Subspace-Decomposed Expert Fusion for Robust Multimodal HD Map Prediction cs.CVPDF

Haoxiang Fu, Lingfeng Zhang, Hao Li, Ruibing Hu, Zhengrong Li

TL;DR: 本文提出了SEF-MAP，一种用于鲁棒多模态高清地图预测的子空间专家融合框架。其核心思想是将BEV特征显式解耦为激光雷达私有、图像私有、共享和交互四个语义子空间，并为每个子空间分配专用专家。通过不确定性感知门控机制和分布感知掩码训练策略，该方法在nuScenes和Argoverse2基准测试中实现了最先进的性能。

Details

Motivation: 解决自动驾驶高清地图预测中，相机和激光雷达多模态融合因模态不一致（如低光照、遮挡或点云稀疏）导致的性能下降问题。

Result: 在nuScenes和Argoverse2基准测试上达到了最先进水平，分别以mAP指标超过先前方法+4.2%和+4.8%。

Insight: 创新点在于将BEV特征显式解耦为四个语义子空间并分配专家，结合不确定性感知门控和分布感知掩码训练，这为多模态融合提供了一种结构化的、鲁棒的融合范式，可借鉴其子空间分解和专家专业化训练的思想。

Abstract: High-definition (HD) maps are essential for autonomous driving, yet multi-modal fusion often suffers from inconsistency between camera and LiDAR modalities, leading to performance degradation under low-light conditions, occlusions, or sparse point clouds. To address this, we propose SEFMAP, a Subspace-Expert Fusion framework for robust multimodal HD map prediction. The key idea is to explicitly disentangle BEV features into four semantic subspaces: LiDAR-private, Image-private, Shared, and Interaction. Each subspace is assigned a dedicated expert, thereby preserving modality-specific cues while capturing cross-modal consensus. To adaptively combine expert outputs, we introduce an uncertainty-aware gating mechanism at the BEV-cell level, where unreliable experts are down-weighted based on predictive variance, complemented by a usage balance regularizer to prevent expert collapse. To enhance robustness in degraded conditions and promote role specialization, we further propose distribution-aware masking: during training, modality-drop scenarios are simulated using EMA-statistical surrogate features, and a specialization loss enforces distinct behaviors of private, shared, and interaction experts across complete and masked inputs. Experiments on nuScenes and Argoverse2 benchmarks demonstrate that SEFMAP achieves state-of-the-art performance, surpassing prior methods by +4.2% and +4.8% in mAP, respectively. SEF-MAPprovides a robust and effective solution for multi-modal HD map prediction under diverse and degraded conditions.

[45] A Hidden Semantic Bottleneck in Conditional Embeddings of Diffusion Transformers cs.CVPDF

Trung X. Pham, Kang Zhang, Ji Woo Hong, Chang D. Yoo

TL;DR: 本文首次系统研究了扩散变换器（Diffusion Transformers）中条件嵌入的结构，发现其存在显著的冗余性：在ImageNet-1K等任务中，类条件嵌入表现出极高的角度相似性（超过99%），语义信息主要集中在少数维度上。通过剪枝低幅值维度（最多可移除三分之二的嵌入空间），生成质量和保真度基本不受影响，甚至在某些情况下有所提升。

Details

Motivation: 扩散变换器在条件生成任务中取得了SOTA性能，但其学习到的条件嵌入结构尚不明确，本文旨在揭示这些嵌入中的语义瓶颈和冗余模式。

Result: 在ImageNet-1K上，类条件嵌入的角度相似性超过99%；在姿态引导图像生成和视频到音频生成等连续条件任务中，相似性超过99.9%。剪枝低幅值维度后，生成质量基本保持不变，有时甚至提升，表明嵌入空间存在大量冗余。

Insight: 论文的创新点在于首次系统分析了扩散变换器条件嵌入的冗余结构，揭示了语义信息集中在头部维度的现象，并证明通过剪枝可以显著压缩嵌入空间而不损失性能，为设计更高效的条件机制提供了新思路。

Abstract: Diffusion Transformers have achieved state-of-the-art performance in class-conditional and multimodal generation, yet the structure of their learned conditional embeddings remains poorly understood. In this work, we present the first systematic study of these embeddings and uncover a notable redundancy: class-conditioned embeddings exhibit extreme angular similarity, exceeding 99% on ImageNet-1K, while continuous-condition tasks such as pose-guided image generation and video-to-audio generation reach over 99.9%. We further find that semantic information is concentrated in a small subset of dimensions, with head dimensions carrying the dominant signal and tail dimensions contributing minimally. By pruning low-magnitude dimensions–removing up to two-thirds of the embedding space–we show that generation quality and fidelity remain largely unaffected, and in some cases improve. These results reveal a semantic bottleneck in Transformer-based diffusion models, providing new insights into how semantics are encoded and suggesting opportunities for more efficient conditioning mechanisms.

[46] Virtual Biopsy for Intracranial Tumors Diagnosis on MRI cs.CV | cs.AIPDF

Xinzhe Luo, Shuai Shao, Yan Wang, Jiangtao Wang, Yutong Bai

TL;DR: 本文提出了一种基于MRI的虚拟活检框架，用于颅内肿瘤的非侵入性诊断。该框架包括MRI处理器、基于视觉语言模型的肿瘤定位器以及采用掩蔽通道注意力机制的自适应诊断器，旨在解决活检数据稀缺和肿瘤空间异质性带来的挑战。

Details

Motivation: 针对位于重要脑功能区的深部颅内肿瘤，传统活检方法存在出血风险、神经功能缺损和采样偏差问题，且病理变化具有区域选择性而非全肿瘤性，因此需要推进基于MRI的非侵入性病理预测以支持整体肿瘤评估和临床决策。

Result: 在构建的首个公开活检验证基准数据集ICT-MRI（包含249个病例，涵盖四类肿瘤）上，所提框架实现了超过90%的准确率，比基线方法提升了20%以上。

Insight: 创新点包括：构建首个公开的活检验证MRI数据集以缓解数据稀缺问题；利用视觉语言模型进行弱监督下的由粗到精肿瘤定位；以及通过掩蔽通道注意力机制融合局部判别特征与全局上下文，提升诊断性能。从客观角度看，该方法将多模态信息（视觉与语言）和注意力机制结合，为医学影像分析提供了可借鉴的弱监督学习框架。

Abstract: Deep intracranial tumors situated in eloquent brain regions controlling vital functions present critical diagnostic challenges. Clinical practice has shifted toward stereotactic biopsy for pathological confirmation before treatment. Yet biopsy carries inherent risks of hemorrhage and neurological deficits and struggles with sampling bias due to tumor spatial heterogeneity, because pathological changes are typically region-selective rather than tumor-wide. Therefore, advancing non-invasive MRI-based pathology prediction is essential for holistic tumor assessment and modern clinical decision-making. The primary challenge lies in data scarcity: low tumor incidence requires long collection cycles, and annotation demands biopsy-verified pathology from neurosurgical experts. Additionally, tiny lesion volumes lacking segmentation masks cause critical features to be overwhelmed by background noise. To address these challenges, we construct the ICT-MRI dataset - the first public biopsy-verified benchmark with 249 cases across four categories. We propose a Virtual Biopsy framework comprising: MRI-Processor for standardization; Tumor-Localizer employing vision-language models for coarse-to-fine localization via weak supervision; and Adaptive-Diagnoser with a Masked Channel Attention mechanism fusing local discriminative features with global contexts. Experiments demonstrate over 90% accuracy, outperforming baselines by more than 20%.

[47] Tokenizing Semantic Segmentation with RLE cs.CVPDF

Abhineet Singh, Justin Rozeboom, Nilanjan Ray

TL;DR: 本文提出了一种统一的语义分割方法，通过语言建模将图像和视频中的分割掩码输出为离散标记序列。该方法使用游程编码（RLE）对分割掩码进行离散化，并训练改进的Pix2Seq模型以自回归方式输出这些RLE标记。论文还提出了新颖的标记化策略来压缩标记序列长度，使其适用于视频分割，并展示了如何将实例信息融入标记化过程以实现全景分割。

Details

Motivation: 解决传统语义分割方法在图像和视频中缺乏统一框架的问题，旨在通过语言建模和离散标记化来简化分割任务，并扩展到视频和全景分割领域。

Result: 在两个数据集上的评估表明，尽管受限于计算资源，所提模型在性能上仍与当前最先进方法（SOTA）具有竞争力。

Insight: 创新点包括使用RLE进行掩码离散化、基于语言建模的统一分割框架、压缩标记序列的策略以支持视频处理，以及将实例信息整合到标记化中实现全景分割，为分割任务提供了新的序列化视角。

Abstract: This paper presents a new unified approach to semantic segmentation in both images and videos by using language modeling to output the masks as sequences of discrete tokens. We use run length encoding (RLE) to discretize the segmentation masks and then train a modified version of Pix2Seq \cite{p2s} to output these RLE tokens through autoregression. We propose novel tokenization strategies to compress the length of the token sequence to make it practicable to extend this approach to videos. We also show how instance information can be incorporated into the tokenization process to perform panoptic segmentation. We evaluate our proposed models on two datasets to show that they are competitive with the state of the art in spite of being bottlenecked by our limited computational resources.

[48] UniHand: A Unified Model for Diverse Controlled 4D Hand Motion Modeling cs.CVPDF

Zhihao Sun, Tong Wu, Ruirui Tu, Daoguo Dong, Zuxuan Wu

TL;DR: UniHand是一个统一的扩散模型框架，用于4D手部运动建模，将手部运动估计和生成任务统一为条件运动合成，通过联合变分自编码器将异构输入嵌入共享潜在空间，并利用潜在扩散模型从多样条件中合成一致的运动序列。

Details

Motivation: 手部运动建模通常分为估计和生成两个独立任务，这限制了异构条件信号的有效利用并阻碍了任务间的知识迁移，UniHand旨在通过统一框架解决这一问题。

Result: 在多个基准测试上的广泛实验表明，UniHand提供了鲁棒且准确的手部运动建模，在严重遮挡和时间不完整输入下仍保持性能。

Insight: 创新点包括将估计和生成统一为条件合成任务，通过联合VAE对齐异构条件（如MANO参数和2D骨架），以及使用专门的手部感知器直接从图像特征提取手部特定线索，无需复杂检测和裁剪流程。

Abstract: Hand motion plays a central role in human interaction, yet modeling realistic 4D hand motion (i.e., 3D hand pose sequences over time) remains challenging. Research in this area is typically divided into two tasks: (1) Estimation approaches reconstruct precise motion from visual observations, but often fail under hand occlusion or absence; (2) Generation approaches focus on synthesizing hand poses by exploiting generative priors under multi-modal structured inputs and infilling motion from incomplete sequences. However, this separation not only limits the effective use of heterogeneous condition signals that frequently arise in practice, but also prevents knowledge transfer between the two tasks. We present UniHand, a unified diffusion-based framework that formulates both estimation and generation as conditional motion synthesis. UniHand integrates heterogeneous inputs by embedding structured signals into a shared latent space through a joint variational autoencoder, which aligns conditions such as MANO parameters and 2D skeletons. Visual observations are encoded with a frozen vision backbone, while a dedicated hand perceptron extracts hand-specific cues directly from image features, removing the need for complex detection and cropping pipelines. A latent diffusion model then synthesizes consistent motion sequences from these diverse conditions. Extensive experiments across multiple benchmarks demonstrate that UniHand delivers robust and accurate hand motion modeling, maintaining performance under severe occlusions and temporally incomplete inputs.

[49] Axial-Centric Cross-Plane Attention for 3D Medical Image Classification cs.CVPDF

Doyoung Park, Jinsoo Kim, Lohendran Baskaran

TL;DR: 本文提出了一种轴向中心跨平面注意力架构，用于3D医学图像分类，该架构通过捕捉不同解剖平面之间的固有不对称依赖关系，模拟临床医生以轴向平面为主的诊断工作流程。

Details

Motivation: 现有3D深度学习方法通常整体处理体积数据或平等对待所有解剖平面，未能反映临床实践中以轴向平面为主要诊断参考、冠状和矢状平面提供补充信息的非对称工作流程。

Result: 在MedMNIST3D基准的六个数据集上的实验表明，该架构在准确率和AUC方面持续优于现有的3D和多平面模型。

Insight: 创新点在于提出轴向中心的查询-键-值分配和定向跨平面融合机制，并利用在大型轴向CT图像上预训练的MedDINOv3作为冻结特征提取器，将架构设计与临床解释工作流程对齐以实现鲁棒且数据高效的3D医学图像分析。

Abstract: Clinicians commonly interpret three-dimensional (3D) medical images, such as computed tomography (CT) scans, using multiple anatomical planes rather than as a single volumetric representation. In this multi-planar approach, the axial plane typically serves as the primary acquisition and diagnostic reference, while the coronal and sagittal planes provide complementary spatial information to increase diagnostic confidence. However, many existing 3D deep learning methods either process volumetric data holistically or assign equal importance to all planes, failing to reflect the axial-centric clinical interpretation workflow. To address this gap, we propose an axial-centric cross-plane attention architecture for 3D medical image classification that captures the inherent asymmetric dependencies between different anatomical planes. Our architecture incorporates MedDINOv3, a medical vision foundation model pretrained via self-supervised learning on large-scale axial CT images, as a frozen feature extractor for the axial, coronal, and sagittal planes. RICA blocks and intra-plane transformer encoders capture plane-specific positional and contextual information within each anatomical plane, while axial-centric cross-plane transformer encoders condition axial features on complementary information from auxiliary planes. Experimental results on six datasets from the MedMNIST3D benchmark demonstrate that the proposed architecture consistently outperforms existing 3D and multi-plane models in terms of accuracy and AUC. Ablation studies further confirm the importance of axial-centric query-key-value allocation and directional cross-plane fusion. These results highlight the importance of aligning architectural design with clinical interpretation workflows for robust and data-efficient 3D medical image analysis.

[50] Lie Flow: Video Dynamic Fields Modeling and Predicting with Lie Algebra as Geometric Physics Principle cs.CVPDF

Weidong Qiao, Wangmeng Zuo, Hui Li

TL;DR: LieFlow是一个动态辐射场表示框架，它利用SE(3)李群显式建模运动，在统一的几何空间中学习平移和旋转，以提升4D动态场景建模的物理一致性和时空连贯性。

Details

Motivation: 现有方法主要依赖平移位移，难以有效表示旋转和铰接变换，导致空间不一致和物理上不合理的运动，因此需要一种能统一表示复杂刚体与非刚体运动的物理一致表示方法。

Result: 在包含刚体轨迹的合成数据集和两个捕捉自然光照与遮挡下复杂运动的真实世界数据集上，LieFlow在视图合成保真度、时间连贯性和物理真实感方面均优于基于NeRF的基线方法。

Insight: 创新点在于将SE(3)李群作为几何物理原理引入动态场建模，通过SE(3)变换场施加物理启发的约束，为表示动态4D场景提供了一个稳健且物理基础坚实的框架。

Abstract: Modeling 4D scenes requires capturing both spatial structure and temporal motion, which is challenging due to the need for physically consistent representations of complex rigid and non-rigid motions. Existing approaches mainly rely on translational displacements, which struggle to represent rotations, articulated transformations, often leading to spatial inconsistency and physically implausible motion. LieFlow, a dynamic radiance representation framework that explicitly models motion within the SE(3) Lie group, enabling coherent learning of translation and rotation in a unified geometric space. The SE(3) transformation field enforces physically inspired constraints to maintain motion continuity and geometric consistency. The evaluation includes a synthetic dataset with rigid-body trajectories and two real-world datasets capturing complex motion under natural lighting and occlusions. Across all datasets, LieFlow consistently improves view-synthesis fidelity, temporal coherence, and physical realism over NeRF-based baselines. These results confirm that SE(3)-based motion modeling offers a robust and physically grounded framework for representing dynamic 4D scenes.

[51] CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning cs.CV | cs.AIPDF

Zhijiang Tang, Linhua Wang, Jiaxin Qi, Weihao Jiang, Peng Hou

TL;DR: 本文提出CCCaption，一个基于双奖励强化学习的图像描述生成框架，旨在生成更完整和正确的图像描述。该方法通过视觉查询奖励描述完整性，通过子描述验证惩罚幻觉以提升正确性，并在多个标准基准上取得一致改进。

Details

Motivation: 现有图像描述任务依赖人工标注作为监督，但人工标注存在主观偏好和专业差异，导致标注描述可能不完整或不正确，限制了描述模型性能。因此，需要从客观角度评估描述质量，即完整性和正确性。

Result: 在标准图像描述基准（如COCO Captions）上的广泛实验表明，该方法能一致提升性能，为超越人工标注模仿的描述模型训练提供了原则性路径。

Insight: 创新点包括：1）提出双奖励强化学习框架，分别优化描述的完整性和正确性；2）使用多样化LVLMs生成视觉查询来评估完整性，并采用动态查询采样提高训练效率；3）通过分解描述并验证子描述真实性来惩罚幻觉，提升正确性。从客观角度看，该方法将描述质量评估从主观人工标准转向客观可度量标准，具有借鉴意义。

Abstract: Image captioning remains a fundamental task for vision language understanding, yet ground-truth supervision still relies predominantly on human-annotated references. Because human annotations reflect subjective preferences and expertise, ground-truth captions are often incomplete or even incorrect, which in turn limits caption models. We argue that caption quality should be assessed by two objective aspects: completeness (does the caption cover all salient visual facts?) and correctness (are the descriptions true with respect to the image?). To this end, we introduce CCCaption: a dual-reward reinforcement learning framework with a dedicated fine-tuning corpus that explicitly optimizes these properties to generate \textbf{C}omplete and \textbf{C}orrect \textbf{Captions}. For completeness, we use diverse LVLMs to disentangle the image into a set of visual queries, and reward captions that answer more of these queries, with a dynamic query sampling strategy to improve training efficiency. For correctness, we penalize captions that contain hallucinations by validating the authenticity of sub-caption queries, which are derived from the caption decomposition. Our symmetric dual-reward optimization jointly maximizes completeness and correctness, guiding models toward captions that better satisfy these objective criteria. Extensive experiments across standard captioning benchmarks show consistent improvements, offering a principled path to training caption models beyond human-annotation imitation.

[52] Following the Diagnostic Trace: Visual Cognition-guided Cooperative Network for Chest X-Ray Diagnosis cs.CV | cs.AIPDF

Shaoxuan Wu, Jingkun Chen, Chong Ma, Cong Shen, Xiao Zhang

TL;DR: 本文提出了一种视觉认知引导的协作网络（VCC-Net），用于实现人机协作的胸部X光诊断。该方法通过眼动追踪或鼠标等临床兼容接口捕获放射科医生的视觉搜索轨迹和注意力模式，并将其作为空间认知引导，学习分层视觉搜索策略以定位关键诊断区域。通过一个认知图协同编辑模块，将医生视觉认知与模型推理相结合，构建疾病感知图，以缓解医生偏见并促进互补、透明的决策。

Details

Motivation: 现有计算机辅助诊断（CAD）系统与临床工作流脱节，缺乏可靠的决策支持和可解释性。人机协作旨在通过整合可控放射科医生的行为来增强诊断模型的可靠性，但缺乏无缝嵌入诊断流程的交互工具，且医生决策模式与模型表示之间的语义鸿沟限制了临床采用。

Result: 在公开数据集SIIM-ACR、EGD-CXR和自建TB-Mouse数据集上的实验分别达到了88.40%、85.05%和92.41%的分类准确率。VCC-Net生成的注意力图与放射科医生的注视分布高度一致，证明了医生与模型推理的相互增强。

Insight: 创新点在于提出了一种以视觉认知（VC）为中心的协作诊断范式，通过临床兼容接口捕获并整合医生的视觉搜索策略作为空间引导。其认知图协同编辑模块能够对齐模型表示与VC驱动的特征，并捕获解剖区域间的依赖关系，从而在提高性能的同时增强了模型的可解释性和临床适用性，为人机协作诊断提供了新思路。

Abstract: Computer-aided diagnosis (CAD) has significantly advanced automated chest X-ray diagnosis but remains isolated from clinical workflows and lacks reliable decision support and interpretability. Human-AI collaboration seeks to enhance the reliability of diagnostic models by integrating the behaviors of controllable radiologists. However, the absence of interactive tools seamlessly embedded within diagnostic routines impedes collaboration, while the semantic gap between radiologists’ decision-making patterns and model representations further limits clinical adoption. To overcome these limitations, we propose a visual cognition-guided collaborative network (VCC-Net) to achieve the cooperative diagnostic paradigm. VCC-Net centers on visual cognition (VC) and employs clinically compatible interfaces, such as eye-tracking or the mouse, to capture radiologists’ visual search traces and attention patterns during diagnosis. VCC-Net employs VC as a spatial cognition guide, learning hierarchical visual search strategies to localize diagnostically key regions. A cognition-graph co-editing module subsequently integrates radiologist VC with model inference to construct a disease-aware graph. The module captures dependencies among anatomical regions and aligns model representations with VC-driven features, mitigating radiologist bias and facilitating complementary, transparent decision-making. Experiments on the public datasets SIIM-ACR, EGD-CXR, and self-constructed TB-Mouse dataset achieved classification accuracies of 88.40%, 85.05%, and 92.41%, respectively. The attention maps produced by VCC-Net exhibit strong concordance with radiologists’ gaze distributions, demonstrating a mutual reinforcement of radiologist and model inference. The code is available at https://github.com/IPMI-NWU/VCC-Net.

[53] Space-Time Forecasting of Dynamic Scenes with Motion-aware Gaussian Grouping cs.CV | cs.GRPDF

Junmyeong Lee, Hoseung Choi, Minsu Cho

TL;DR: 本文提出了MoGaF（Motion Group-aware Gaussian Forecasting）框架，用于动态场景的长期时空预测。该框架基于4D高斯泼溅表示，通过运动感知的高斯分组和组级优化，实现了对刚性和非刚性区域物理一致运动的建模，并利用轻量级预测模块预测未来运动，从而生成真实且时间稳定的场景演化。

Details

Motivation: 动态场景预测是计算机视觉中的一个基本挑战，因为有限的观测数据难以捕捉连贯的对象级运动和长期时间演化。本文旨在解决这一问题，通过引入结构化的时空表示来提升预测的连贯性和稳定性。

Result: 在合成和真实世界数据集上的实验表明，MoGaF在渲染质量、运动合理性和长期预测稳定性方面均优于现有基线方法。

Insight: 创新点包括运动感知的高斯分组和组级优化策略，这有助于强制物理一致的运动表示；同时，基于4D高斯泼溅的轻量级预测模块能够高效地预测未来运动，为动态场景的长期外推提供了新的解决方案。

Abstract: Forecasting dynamic scenes remains a fundamental challenge in computer vision, as limited observations make it difficult to capture coherent object-level motion and long-term temporal evolution. We present Motion Group-aware Gaussian Forecasting (MoGaF), a framework for long-term scene extrapolation built upon the 4D Gaussian Splatting representation. MoGaF introduces motion-aware Gaussian grouping and group-wise optimization to enforce physically consistent motion across both rigid and non-rigid regions, yielding spatially coherent dynamic representations. Leveraging this structured space-time representation, a lightweight forecasting module predicts future motion, enabling realistic and temporally stable scene evolution. Experiments on synthetic and real-world datasets demonstrate that MoGaF consistently outperforms existing baselines in rendering quality, motion plausibility, and long-term forecasting stability. Our project page is available at https://slime0519.github.io/mogaf

[54] SF3D-RGB: Scene Flow Estimation from Monocular Camera and Sparse LiDAR cs.CVPDF

Rajai Alhimdiat, Ramy Battrawy, René Schuster, Didier Stricker, Wesam Ashour

TL;DR: 本文提出了一种名为SF3D-RGB的深度学习架构，用于从单目相机图像和稀疏LiDAR点云中估计场景流。该模型是一个端到端系统，通过编码和融合两种模态的特征，利用图匹配模块计算初始场景流，并通过残差模块进行细化，旨在平衡精度与效率。

Details

Motivation: 现有基于学习的场景流估计方法多专注于单一模态（如图像或LiDAR），本文旨在解决这一问题，通过融合2D图像和3D点云信息，以实现更鲁棒和准确的稀疏场景流估计。

Result: 实验表明，该方法在真实世界数据集上超越了单一模态方法，并与其他最先进的融合方法相比，在取得更好场景流精度的同时使用了更少的参数。

Insight: 创新点在于提出了一种多模态融合的端到端架构，结合了图匹配和残差细化模块，有效提升了稀疏场景流估计的鲁棒性和准确性，同时保持了模型的高效性。

Abstract: Scene flow estimation is an extremely important task in computer vision to support the perception of dynamic changes in the scene. For robust scene flow, learning-based approaches have recently achieved impressive results using either image-based or LiDAR-based modalities. However, these methods have tended to focus on the use of a single modality. To tackle these problems, we present a deep learning architecture, SF3D-RGB, that enables sparse scene flow estimation using 2D monocular images and 3D point clouds (e.g., acquired by LiDAR) as inputs. Our architecture is an end-to-end model that first encodes information from each modality into features and fuses them together. Then, the fused features enhance a graph matching module for better and more robust mapping matrix computation to generate an initial scene flow. Finally, a residual scene flow module further refines the initial scene flow. Our model is designed to strike a balance between accuracy and efficiency. Furthermore, experiments show that our proposed method outperforms single-modality methods and achieves better scene flow accuracy on real-world datasets while using fewer parameters compared to other state-of-the-art methods with fusion.

[55] Dynamic Multimodal Activation Steering for Hallucination Mitigation in Large Vision-Language Models cs.CV | cs.AIPDF

Jianghao Yin, Qin Chen, Kedi Chen, Jie Zhou, Xingjiao Wu

TL;DR: 本文提出了一种名为动态多模态激活引导的训练无关方法，用于缓解大型视觉语言模型中的幻觉问题。该方法基于对模型激活模式的深入分析，构建了基于语义的真实性引导向量数据库和视觉感知引导向量，并在推理过程中根据输入语义相似度动态选择最相关的引导向量，将其应用于最具影响力的注意力头，从而实现上下文感知的干预。

Details

Motivation: 大型视觉语言模型在视觉语言任务上表现出色，但存在严重的幻觉问题。通过深入分析模型激活模式，作者发现真实性和视觉感知能力主要激活模型架构中不同的注意力头子集，且真实性引导向量在不同语义上下文中差异显著，这为解决幻觉问题提供了新的切入点。

Result: 在多个模型和数据集上进行的综合实验表明，该方法显著提升了模型性能，超越了现有的最先进方法。

Insight: 创新点在于揭示了LVLMs中真实性和视觉感知能力在注意力头层面的功能分离现象，并据此提出了一种动态、上下文感知的激活引导机制。该方法无需训练，通过构建引导向量数据库和动态选择机制，实现了对模型推理过程的精准干预，为缓解幻觉问题提供了一种高效且可解释的新思路。

Abstract: Large Vision-Language Models (LVLMs) exhibit outstanding performance on vision-language tasks but struggle with hallucination problems. Through in-depth analysis of LVLM activation patterns, we reveal two key findings: 1) truthfulness and visual perception capabilities predominantly engage different subsets of attention heads within the model architecture; and 2) truthfulness steering vectors vary significantly across different semantic contexts. Based on these observations, we propose Dynamic Multimodal Activation Steering, a training-free approach for hallucination mitigation. Our method constructs a semantic-based truthfulness steering vector database and computes visual perception steering vectors, enabling context-aware interventions during inference by dynamically selecting the most relevant steering vectors based on input semantic similarity and applying them to the most influential attention heads. We conduct comprehensive experiments across multiple models and datasets, demonstrating that our approach significantly enhances model performance, outperforming existing state-of-the-art methods.

[56] SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video cs.CV | cs.AIPDF

Guanyi Qin, Xiaozhen Wang, Zhu Zhuo, Chang Han Low, Yuancan Xiao

TL;DR: 该论文提出了SurGo-R1模型和一个名为ResGo的基准数据集，用于评估和建模手术视频中安全操作区域的上下文推理能力。模型采用多阶段架构，先识别手术阶段，再基于该上下文生成推理和操作区域坐标，在未见过的程序上显著优于主流通用视觉语言模型。

Details

Motivation: 微创手术中识别安全操作区域具有挑战性，现有AI系统仅提供二元安全验证或静态检测，忽略了术中推理依赖于手术阶段的特性，因此需要开发能整合视觉线索、手术阶段和解剖上下文的智能系统。

Result: 在未见手术程序上，SurGo-R1实现了76.6%的阶段识别准确率、32.7%的平均交并比（mIoU）和54.8%的硬核准确率，相比主流通用视觉语言模型有6.6倍的提升。

Insight: 论文的创新点在于提出了一个包含手术阶段注释和临床医生推理的基准数据集，以及一个采用强化学习人类反馈（RLHF）优化的多阶段（phase-then-go）架构模型，该架构强制模型先理解手术阶段再进行区域推理，从而更好地处理上下文依赖的推理任务。

Abstract: Minimally invasive surgery has dramatically improved patient operative outcomes, yet identifying safe operative zones remains challenging in critical phases, requiring surgeons to integrate visual cues, procedural phase, and anatomical context under high cognitive load. Existing AI systems offer binary safety verification or static detection, ignoring the phase-dependent nature of intraoperative reasoning. We introduce ResGo, a benchmark of laparoscopic frames annotated with Go Zone bounding boxes and clinician-authored rationales covering phase, exposure quality reasoning, next action and risk reminder. We introduce evaluation metrics that treat correct grounding under incorrect phase as failures, revealing that most vision-language models cannot handle such tasks and perform poorly. We then present SurGo-R1, a model optimized via RLHF with a multi-turn phase-then-go architecture where the model first identifies the surgical phase, then generates reasoning and Go Zone coordinates conditioned on that context. On unseen procedures, SurGo-R1 achieves 76.6% phase accuracy, 32.7 mIoU, and 54.8% hardcore accuracy, a 6.6$\times$ improvement over the mainstream generalist VLMs. Code, model and benchmark will be available at https://github.com/jinlab-imvr/SurGo-R1

[57] Innovative Tooth Segmentation Using Hierarchical Features and Bidirectional Sequence Modeling cs.CVPDF

Xinxin Zhao, Jian Jiang, Yan Tian, Liqin Wu, Zhaocheng Xu

TL;DR: 本文提出了一种用于牙齿图像分割的创新方法，通过结合分层特征表示和双向序列建模，以解决传统方法因环境与全局上下文建模不足导致的间断分割和目标-背景区分差的问题，同时避免了Transformer自注意力机制的高计算开销。

Details

Motivation: 传统牙齿图像分割方法依赖固定分辨率特征图，导致分割不连续且目标与背景区分度差；而基于Transformer的自注意力机制因二次计算复杂度在高分辨率牙科图像上效率低下。

Result: 在两个牙科数据集上验证了方法的优越性，在OralVision数据集上，平均交并比（mIoU）提升了1.1%。

Insight: 创新点包括：1）采用三阶段编码器实现分层特征表示，以捕获尺度自适应信息；2）通过跨尺度特征融合联合利用低层细节与高层语义，在保持强上下文感知的同时保留精细结构信息；3）引入双向序列建模策略以增强全局空间上下文理解，且不引入高计算成本。

Abstract: Tooth image segmentation is a cornerstone of dental digitization. However, traditional image encoders relying on fixed-resolution feature maps often lead to discontinuous segmentation and poor discrimination between target regions and background, due to insufficient modeling of environmental and global context. Moreover, transformer-based self-attention introduces substantial computational overhead because of its quadratic complexity (O(n^2)), making it inefficient for high-resolution dental images. To address these challenges, we introduce a three-stage encoder with hierarchical feature representation to capture scale-adaptive information in dental images. By jointly leveraging low-level details and high-level semantics through cross-scale feature fusion, the model effectively preserves fine structural information while maintaining strong contextual awareness. Furthermore, a bidirectional sequence modeling strategy is incorporated to enhance global spatial context understanding without incurring high computational cost. We validate our method on two dental datasets, with experimental results demonstrating its superiority over existing approaches. On the OralVision dataset, our model achieves a 1.1% improvement in mean intersection over union (mIoU).

[58] TranX-Adapter: Bridging Artifacts and Semantics within MLLMs for Robust AI-generated Image Detection cs.CVPDF

Wenbin Wang, Yuge Huang, Jianqing Xu, Yue Yu, Jiangtao Yan

TL;DR: 本文提出TranX-Adapter，一种轻量级融合适配器，用于增强多模态大语言模型（MLLMs）在AI生成图像检测中的鲁棒性。该方法通过任务感知最优传输融合和交叉融合机制，有效解决纹理伪影特征与语义特征融合时因注意力稀释导致性能受限的问题。

Details

Motivation: 现有方法将纹理级伪影特征与语义特征结合到MLLMs中以提高AI生成图像检测能力，但伪影特征内部相似性高，导致注意力图趋于均匀，引起注意力稀释，阻碍了两种特征的有效融合。

Result: 在多个先进MLLMs上的标准AI生成图像检测基准测试中，TranX-Adapter带来了一致且显著的性能提升，准确率最高提升达6%。

Insight: 创新点在于提出任务感知最优传输融合，利用伪影与语义预测概率之间的Jensen-Shannon散度作为成本矩阵将伪影信息转移到语义特征，以及通过交叉注意力将语义信息转移到伪影特征的X-Fusion机制，从而在轻量级设计中实现特征间的有效交互与互补。

Abstract: Rapid advances in AI-generated image (AIGI) technology enable highly realistic synthesis, threatening public information integrity and security. Recent studies have demonstrated that incorporating texture-level artifact features alongside semantic features into multimodal large language models (MLLMs) can enhance their AIGI detection capability. However, our preliminary analyses reveal that artifact features exhibit high intra-feature similarity, leading to an almost uniform attention map after the softmax operation. This phenomenon causes attention dilution, thereby hindering effective fusion between semantic and artifact features. To overcome this limitation, we propose a lightweight fusion adapter, TranX-Adapter, which integrates a Task-aware Optimal-Transport Fusion that leverages the Jensen-Shannon divergence between artifact and semantic prediction probabilities as a cost matrix to transfer artifact information into semantic features, and an X-Fusion that employs cross-attention to transfer semantic information into artifact features. Experiments on standard AIGI detection benchmarks upon several advanced MLLMs, show that our TranX-Adapter brings consistent and significant improvements (up to +6% accuracy).

[59] SigVLP: Sigmoid Volume-Language Pre-Training for Self-Supervised CT-Volume Adaptive Representation Learning cs.CVPDF

Jiayi Wang, Hadrien Reynaud, Ibrahim Ethem Hamamci, Sezgin Er, Suprosanna Shit

TL;DR: 本文提出了一种名为SigVLP的新型视觉-语言预训练模型，用于解决CT体积数据因扫描设备和参数不同导致的尺寸和分辨率高度可变问题。该方法将体积数据视为3D块序列，采用旋转位置嵌入处理z轴，并引入基于块的局部文本对齐进行细粒度监督，从而避免了传统固定尺寸裁剪或插值造成的信息损失。

Details

Motivation: 大规模医学影像数据集中，CT扫描来自不同厂商和设备，导致分辨率、层厚和切片数量差异巨大。传统方法需要将数据裁剪或插值为固定尺寸块进行训练，这会造成信息丢失。本文旨在克服这一限制，学习能自适应不同输入尺寸的CT体积表示。

Result: 模型使用Muon优化器进行训练，并在多个下游任务上进行了评估，包括零样本异常和器官分类、分割以及检索任务。摘要中未提及具体的定量结果或与SOTA的比较。

Insight: 主要创新点在于：1) 将体积数据视为3D块序列，并采用旋转位置嵌入处理z轴，使其成为不受约束的“时间”维度，从而支持可变输入尺寸；2) 提出基于块的局部器官文本对齐策略，相比使用整个报告进行条件化，能提供更细粒度的监督，增强文本与体积表示之间的关联性，提升文本到体积对齐的精度。

Abstract: Large-scale, volumetric medical imaging datasets typically aggregate scans from different vendors and devices, resulting in highly variable resolution, slice thicknesses, and numbers of slices per study. Consequently, training representation models usually requires cropping or interpolating along the z-axis to obtain fixed-size blocks, which inevitably causes information loss. We propose a new training approach to overcome this limitation. Instead of absolute position embeddings, we interpret volumes as sequences of 3D chunks and adopt Rotary Position Embeddings, allowing us to treat the z-axis as an unconstrained temporal dimensions. Building on this idea, we introduce a new vision-language model: SigVLP. In SigVLP, we implement Rotary Position Embedding as the positional encoding method, which is applied directly within the attention operation, generating input-conditioned sine and cosine weights on the fly. This design ensures consistent alignment between query and key projections and adapts to any input sizes. To allow for variable input size during training, we sample Computed Tomography volumes in chunks and pair them with localized organ-wise textual observations. Compared to using entire reports for conditioning, chunkwise alignment provides finer-grained supervision, enabling the model to establish stronger correlations between the text and volume representations, thereby improving the precision of text-to-volume alignment. Our models are trained with the Muon optimizer and evaluated on a diverse set of downstream tasks, including zero-shot abnormality and organ classification, segmentation, and retrieval tasks.

Jinghan Li, Junfeng Fang, Jinda Lu, Yuan Wang, Xiaoyan Guo

TL;DR: 该论文提出了一种名为难度感知组归一化（Durian）的方法，旨在解决多模态大语言模型在强化学习训练中因基于标准差的归一化对极端奖励样本敏感而导致的不稳定问题。该方法通过视觉熵和模型置信度定义样本难度，并据此对样本进行分组，在组内进行归一化，从而在保持组内区分度的同时提升模型在多模态推理任务上的性能。

Details

Motivation: 将强化学习与可验证奖励（RLVR）和组相对策略优化（GRPO）扩展到多模态场景时，面临一个关键挑战：基于标准差的归一化不稳定，容易被具有极端正或负奖励的样本扭曲。多模态模型对此类扭曲特别敏感，因为感知和推理错误都会影响其响应。

Result: 该方法在多个多模态推理基准测试中取得了显著的性能提升。

Insight: 论文的核心创新点在于引入了“样本难度”的概念（通过视觉熵衡量感知复杂性，通过模型置信度捕捉推理不确定性），并据此对训练样本进行动态分组，在组内进行归一化。这既保留了GRPO方法原有的组内相对比较优势，又消除了对极端样本的敏感性，为多模态模型的稳定训练提供了一种新思路。

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) and Group Relative Policy Optimization (GRPO) have significantly advanced the reasoning capabilities of large language models. Extending these methods to multimodal settings, however, faces a critical challenge: the instability of std-based normalization, which is easily distorted by extreme samples with nearly positive or negative rewards. Unlike pure-text LLMs, multimodal models are particularly sensitive to such distortions, as both perceptual and reasoning errors influence their responses. To address this, we characterize each sample by its difficulty, defined through perceptual complexity (measured via visual entropy) and reasoning uncertainty (captured by model confidence). Building on this characterization, we propose difficulty-aware group normalization (Durian), which re-groups samples by difficulty levels and shares the std within each group. Our approach preserves GRPO’s intra-group distinctions while eliminating sensitivity to extreme cases, yielding significant performance gains across multiple multimodal reasoning benchmarks.

[61] Accelerating Diffusion via Hybrid Data-Pipeline Parallelism Based on Conditional Guidance Scheduling cs.CVPDF

Euisoo Jung, Byunghyun Kim, Hyunjin Kim, Seonghye Cho, Jae-Gil Lee

TL;DR: 本文提出了一种用于加速条件扩散模型推理的混合并行框架，结合了基于条件引导的新型数据并行策略和自适应并行切换的流水线调度方法，旨在降低生成延迟并保持高图像质量。

Details

Motivation: 当前基于分布式并行的扩散模型加速方法存在明显的生成伪影，且无法实现与GPU数量成比例的显著加速，因此需要一种能同时减少延迟并保证质量的新方法。

Result: 在SDXL和SD3模型上，使用两块NVIDIA RTX 3090 GPU分别实现了2.31倍和2.07倍的延迟降低，同时保持了图像质量，并在高分辨率合成设置下超越了现有加速方法。

Insight: 创新点在于利用条件与无条件去噪路径作为数据划分的新视角，并根据这两条路径之间的去噪差异自适应地启用最优流水线并行，该框架适用于基于U-Net的扩散模型和基于DiT的流匹配架构。

Abstract: Diffusion models have achieved remarkable progress in high-fidelity image, video, and audio generation, yet inference remains computationally expensive. Nevertheless, current diffusion acceleration methods based on distributed parallelism suffer from noticeable generation artifacts and fail to achieve substantial acceleration proportional to the number of GPUs. Therefore, we propose a hybrid parallelism framework that combines a novel data parallel strategy, condition-based partitioning, with an optimal pipeline scheduling method, adaptive parallelism switching, to reduce generation latency and achieve high generation quality in conditional diffusion models. The key ideas are to (i) leverage the conditional and unconditional denoising paths as a new data-partitioning perspective and (ii) adaptively enable optimal pipeline parallelism according to the denoising discrepancy between these two paths. Our framework achieves $2.31\times$ and $2.07\times$ latency reductions on SDXL and SD3, respectively, using two NVIDIA RTX~3090 GPUs, while preserving image quality. This result confirms the generality of our approach across U-Net-based diffusion models and DiT-based flow-matching architectures. Our approach also outperforms existing methods in acceleration under high-resolution synthesis settings. Code is available at https://github.com/kaist-dmlab/Hybridiff.

[62] From Statics to Dynamics: Physics-Aware Image Editing with Latent Transition Priors cs.CVPDF

Liangbing Zhao, Le Zhuo, Sayak Paul, Hongsheng Li, Mohamed Elhoseiny

TL;DR: 本文提出了一种基于物理感知的图像编辑框架PhysicEdit，以解决现有指令编辑模型在处理涉及复杂因果动态（如折射或材料变形）时物理真实性不足的问题。通过将编辑任务重新定义为预测物理状态转换，并引入大规模视频数据集PhysicTran38K进行监督，该框架结合了冻结的Qwen2.5-VL进行物理推理和可学习的转换查询，以提供时间步自适应的视觉引导。

Details

Motivation: 现有基于指令的图像编辑模型在语义对齐方面表现优异，但在涉及复杂物理动态的编辑场景中（如折射、材料变形）常产生物理上不真实的结果，这源于当前范式将编辑视为图像对之间的离散映射，仅提供边界条件而忽略了转换动态的细节。

Result: 实验表明，PhysicEdit在物理真实性方面比Qwen-Image-Edit提升了5.9%，在知识基础编辑方面提升了10.1%，为开源方法设定了新的SOTA，同时与领先的专有模型保持竞争力。

Insight: 创新点包括将物理感知编辑重新定义为预测物理状态转换，构建了大规模视频数据集PhysicTran38K用于监督，并设计了文本-视觉双重思维机制，结合冻结的视觉语言模型进行物理推理和可学习的转换查询，以增强扩散主干的动态引导能力。

Abstract: Instruction-based image editing has achieved remarkable success in semantic alignment, yet state-of-the-art models frequently fail to render physically plausible results when editing involves complex causal dynamics, such as refraction or material deformation. We attribute this limitation to the dominant paradigm that treats editing as a discrete mapping between image pairs, which provides only boundary conditions and leaves transition dynamics underspecified. To address this, we reformulate physics-aware editing as predictive physical state transitions and introduce PhysicTran38K, a large-scale video-based dataset comprising 38K transition trajectories across five physical domains, constructed via a two-stage filtering and constraint-aware annotation pipeline. Building on this supervision, we propose PhysicEdit, an end-to-end framework equipped with a textual-visual dual-thinking mechanism. It combines a frozen Qwen2.5-VL for physically grounded reasoning with learnable transition queries that provide timestep-adaptive visual guidance to a diffusion backbone. Experiments show that PhysicEdit improves over Qwen-Image-Edit by 5.9% in physical realism and 10.1% in knowledge-grounded editing, setting a new state-of-the-art for open-source methods, while remaining competitive with leading proprietary models.

[63] Beyond Static Artifacts: A Forensic Benchmark for Video Deepfake Reasoning in Vision Language Models cs.CV | cs.AIPDF

Zheyuan Gu, Qingsong Zhao, Yusong Wang, Zhaohong Huang, Xinqi Li

TL;DR: 本文提出了Forensic Answer-Questioning (FAQ)基准测试，用于评估和增强视觉语言模型在视频深度伪造检测中的时序推理能力。该基准通过一个包含三个层次（面部感知、时序伪造定位、法证推理）的多选题任务，系统地衡量模型对动态伪造线索的分析能力。作者还生成了对应的指令微调数据集FAQ-IT，实验表明，在该数据集上微调的模型在领域内和跨数据集检测基准上均取得了先进性能。

Details

Motivation: 当前用于深度伪造检测的视觉语言模型擅长识别空间伪影，但忽视了视频伪造中关键的时序不一致性维度。使VLMs能够对这些动态线索进行推理仍然是一个独特的挑战。

Result: 在FAQ基准上评估了一系列VLMs，并生成了指令微调集FAQ-IT。大量实验表明，在FAQ-IT上微调的模型在领域内和跨数据集检测基准上均取得了先进性能。消融研究进一步验证了关键设计选择的影响。

Insight: 创新点在于将时序深度伪造分析系统化为一个包含渐进式能力层次的多选题任务，并构建了相应的基准和指令微调集。这为评估和提升VLMs在动态视觉内容上的法证推理能力提供了新的框架和数据支持。

Abstract: Current Vision-Language Models (VLMs) for deepfake detection excel at identifying spatial artifacts but overlook a critical dimension: temporal inconsistencies in video forgeries. Adapting VLMs to reason about these dynamic cues remains a distinct challenge. To bridge this gap, we propose Forensic Answer-Questioning (FAQ), a large-scale benchmark that formulates temporal deepfake analysis as a multiple-choice task. FAQ introduces a three-level hierarchy to progressively evaluate and equip VLMs with forensic capabilities: (1) Facial Perception, testing the ability to identify static visual artifacts; (2) Temporal Deepfake Grounding, requiring the localization of dynamic forgery artifacts across frames; and (3) Forensic Reasoning, challenging models to synthesize evidence for final authenticity verdicts. We evaluate a range of VLMs on FAQ and generate a corresponding instruction-tuning set, FAQ-IT. Extensive experiments show that models fine-tuned on FAQ-IT achieve advanced performance on both in-domain and cross-dataset detection benchmarks. Ablation studies further validate the impact of our key design choices, confirming that FAQ is the driving force behind the temporal reasoning capabilities of these VLMs.

[64] XStreamVGGT: Extremely Memory-Efficient Streaming Vision Geometry Grounded Transformer with KV Cache Compression cs.CVPDF

Zunhai Su, Weihao Ye, Hansen Feng, Keyu Fan, Jing Zhang

TL;DR: 该论文提出了XStreamVGGT，一种无需微调的方法，通过无缝集成剪枝和量化技术来压缩Transformer模型中的键值（KV）缓存，旨在解决StreamVGGT等流式3D视觉几何模型在处理多图像和长视频输入时KV缓存无限增长导致的内存消耗和推理延迟问题，从而实现极高内存效率的流式推理。

Details

Motivation: 动机是解决基于Transformer的流式3D视觉几何模型（如StreamVGGT）在处理长序列输入时，由于视觉令牌大量涌入导致KV缓存无限增长，从而引发内存消耗和推理延迟增加、限制其在长时程应用中可扩展性的问题。

Result: 大量评估表明，XStreamVGGT在性能损失基本可忽略的情况下，大幅减少了内存使用（4.42倍）并加速了推理（5.48倍），为实际可扩展的流式3D应用提供了可能。

Insight: 创新点在于提出了一种无需微调的KV缓存压缩方案，该方案结合了基于令牌重要性识别的高效剪枝机制（与高性能注意力内核如FlashAttention完全兼容）和利用KV张量固有分布模式的维度自适应量化，在固定KV内存预算内系统性地压缩缓存，同时保持数值精度。

Abstract: Learning-based 3D visual geometry models have significantly advanced with the advent of large-scale transformers. Among these, StreamVGGT leverages frame-wise causal attention to deliver robust and efficient streaming 3D reconstruction. However, it suffers from unbounded growth in the Key-Value (KV) cache due to the massive influx of vision tokens from multi-image and long-video inputs, leading to increased memory consumption and inference latency as input frames accumulate. This ultimately limits its scalability for long-horizon applications. To address this gap, we propose XStreamVGGT, a tuning-free approach that seamlessly integrates pruning and quantization to systematically compress the KV cache, enabling extremely memory-efficient streaming inference. Specifically, redundant KVs generated from multi-frame inputs are initially pruned to conform to a fixed KV memory budget using an efficient token-importance identification mechanism that maintains full compatibility with high-performance attention kernels (e.g., FlashAttention). Additionally, leveraging the inherent distribution patterns of KV tensors, we apply dimension-adaptive KV quantization within the pruning pipeline to further minimize memory overhead while preserving numerical accuracy. Extensive evaluations show that XStreamVGGT achieves mostly negligible performance degradation while substantially reducing memory usage by 4.42$\times$ and accelerating inference by 5.48$\times$, enabling practical and scalable streaming 3D applications. The code is available at https://github.com/ywh187/XStreamVGGT/.

Guibin Chen, Dixuan Lin, Jiangping Yang, Youqiang Zhang, Zhengcong Fei

TL;DR: SkyReels V4是一个统一的多模态视频基础模型，用于联合视频音频生成、修复和编辑。该模型采用双流多模态扩散Transformer架构，一个分支合成视频，另一个生成时间对齐的音频，并共享一个基于多模态大语言模型的强大文本编码器。它支持多种模态指令输入，并通过高效的联合生成策略实现高达1080p分辨率、32 FPS、15秒时长的高保真电影级视频生成。

Details

Motivation: 解决现有视频生成模型在多模态输入、音视频联合生成以及生成、修复和编辑任务统一处理方面的不足，旨在创建一个能够接受丰富指令并高效生成高质量、长时长、带同步音频视频的统一框架。

Result: 摘要中未提及具体的定量基准测试结果或与SOTA的比较，但宣称模型能够实现高保真、多镜头、电影级的视频生成，并支持1080p分辨率、32 FPS和15秒时长，在效率和生成质量上表现出色。

Insight: 创新点包括：1）采用双流多模态扩散Transformer统一处理音视频生成；2）通过多模态大语言模型实现复杂条件下的细粒度多模态指令跟随；3）提出通道拼接公式，将图像到视频、视频扩展、视频编辑等多种修复式任务统一到单一接口；4）引入联合生成低分辨率全序列和高分辨率关键帧的效率策略，以支持高分辨率长视频生成。

Abstract: SkyReels V4 is a unified multi modal video foundation model for joint video audio generation, inpainting, and editing. The model adopts a dual stream Multimodal Diffusion Transformer (MMDiT) architecture, where one branch synthesizes video and the other generates temporally aligned audio, while sharing a powerful text encoder based on the Multimodal Large Language Models (MMLM). SkyReels V4 accepts rich multi modal instructions, including text, images, video clips, masks, and audio references. By combining the MMLMs multi modal instruction following capability with in context learning in the video branch MMDiT, the model can inject fine grained visual guidance under complex conditioning, while the audio branch MMDiT simultaneously leverages audio references to guide sound generation. On the video side, we adopt a channel concatenation formulation that unifies a wide range of inpainting style tasks, such as image to video, video extension, and video editing under a single interface, and naturally extends to vision referenced inpainting and editing via multi modal prompts. SkyReels V4 supports up to 1080p resolution, 32 FPS, and 15 second duration, enabling high fidelity, multi shot, cinema level video generation with synchronized audio. To make such high resolution, long-duration generation computationally feasible, we introduce an efficiency strategy: Joint generation of low resolution full sequences and high-resolution keyframes, followed by dedicated super-resolution and frame interpolation models. To our knowledge, SkyReels V4 is the first video foundation model that simultaneously supports multi-modal input, joint video audio generation, and a unified treatment of generation, inpainting, and editing, while maintaining strong efficiency and quality at cinematic resolutions and durations.

[66] SemVideo: Reconstructs What You Watch from Brain Activity via Hierarchical Semantic Guidance cs.CV | cs.AIPDF

Minghan Yang, Lan Yang, Ke Li, Honggang Zhang, Kaiyue Pang

TL;DR: SemVideo是一个新颖的fMRI到视频重建框架，旨在从大脑活动中重建动态视觉体验。它通过一个名为SemMiner的分层语义引导模块，从原始视频刺激中提取静态锚点描述、运动导向叙述和整体摘要三个层次的语义线索，并利用这些线索指导视频重建。该框架包含语义对齐解码器、运动适应解码器和条件视频渲染器三个关键组件，以解决现有方法中存在的显著物体外观不一致和时序连贯性差的问题。

Details

Motivation: 当前基于fMRI的视频重建方法存在两个主要缺陷：跨帧的显著物体视觉表征不一致导致外观失配，以及时序连贯性差导致运动错位或帧间突变。论文旨在解决这些限制，实现更准确、连贯的视频重建。

Result: 在CC2017和HCP数据集上的实验表明，SemVideo在语义对齐和时序一致性方面均取得了优越的性能，为fMRI到视频重建任务设定了新的最先进水平（SOTA）。

Insight: 论文的核心创新点在于引入了分层语义引导（SemMiner）来构建多层次的语义线索，并设计了一个新颖的三方注意力融合架构用于运动适应解码，从而同时提升了重建视频的语义准确性和动态连贯性。从客观角度看，这种将高层次语义信息与神经信号系统对齐的方法，为脑机接口和神经解码任务提供了可借鉴的思路。

Abstract: Reconstructing dynamic visual experiences from brain activity provides a compelling avenue for exploring the neural mechanisms of human visual perception. While recent progress in fMRI-based image reconstruction has been notable, extending this success to video reconstruction remains a significant challenge. Current fMRI-to-video reconstruction approaches consistently encounter two major shortcomings: (i) inconsistent visual representations of salient objects across frames, leading to appearance mismatches; (ii) poor temporal coherence, resulting in motion misalignment or abrupt frame transitions. To address these limitations, we introduce SemVideo, a novel fMRI-to-video reconstruction framework guided by hierarchical semantic information. At the core of SemVideo is SemMiner, a hierarchical guidance module that constructs three levels of semantic cues from the original video stimulus: static anchor descriptions, motion-oriented narratives, and holistic summaries. Leveraging this semantic guidance, SemVideo comprises three key components: a Semantic Alignment Decoder that aligns fMRI signals with CLIP-style embeddings derived from SemMiner, a Motion Adaptation Decoder that reconstructs dynamic motion patterns using a novel tripartite attention fusion architecture, and a Conditional Video Render that leverages hierarchical semantic guidance for video reconstruction. Experiments conducted on the CC2017 and HCP datasets demonstrate that SemVideo achieves superior performance in both semantic alignment and temporal consistency, setting a new state-of-the-art in fMRI-to-video reconstruction.

[67] Joint Shadow Generation and Relighting via Light-Geometry Interaction Maps cs.CVPDF

Shan Wang, Peixia Li, Chenchen Xu, Ziang Cheng, Jiayu Yang

TL;DR: 该论文提出了一种名为光-几何交互图（LGI）的新颖表示方法，用于从单目深度信息中编码光感知的遮挡关系。基于此表示，作者构建了一个统一的联合阴影生成与重光照管线，解决了现有方法将两者视为独立任务的问题。该方法通过将LGI嵌入到桥接匹配生成主干中，减少了歧义并强制执行物理一致的光影推理。为了有效训练，作者还构建了首个用于联合阴影与重光照的大规模基准数据集。

Details

Motivation: 现有生成模型在缺乏物理先验的情况下，常常产生漂浮阴影、不一致光照和不合理的阴影几何形状。论文旨在通过一种将光照方向与几何结构明确关联的物理启发先验，来约束生成模型，从而解决这些问题。

Result: 实验表明，该方法在合成图像和真实图像上，在真实感和一致性方面均取得了显著提升。

Insight: 核心创新点在于提出了LGI表示，它无需完整3D重建即可可靠捕获光影交互，为生成模型提供了物理约束先验。另一个重要贡献是提出了首个联合处理阴影生成与重光照的统一框架，并构建了相应的大规模基准数据集，这有助于建模间接光照效应所需的内在耦合关系。

Abstract: We propose Light-Geometry Interaction (LGI) maps, a novel representation that encodes light-aware occlusion from monocular depth. Unlike ray tracing, which requires full 3D reconstruction, LGI captures essential light-shadow interactions reliably and accurately, computed from off-the-shelf 2.5D depth map predictions. LGI explicitly ties illumination direction to geometry, providing a physics-inspired prior that constrains generative models. Without such prior, these models often produce floating shadows, inconsistent illumination, and implausible shadow geometry. Building on this representation, we propose a unified pipeline for joint shadow generation and relighting - unlike prior methods that treat them as disjoint tasks - capturing the intrinsic coupling of illumination and shadowing essential for modeling indirect effects. By embedding LGI into a bridge-matching generative backbone, we reduce ambiguity and enforce physically consistent light-shadow reasoning. To enable effective training, we curated the first large-scale benchmark dataset for joint shadow and relighting, covering reflections, transparency, and complex interreflections. Experiments show significant gains in realism and consistency across synthetic and real images. LGI thus bridges geometry-inspired rendering with generative modeling, enabling efficient, physically consistent shadow generation and relighting.

[68] UniVBench: Towards Unified Evaluation for Video Foundation Models cs.CVPDF

Jianhui Wei, Xiaotian Zhang, Yichen Li, Yuan Wang, Yan Zhang

TL;DR: 本文提出了UniVBench，一个用于统一评估视频基础模型（Video Foundation Models）在视频理解、生成、编辑和重建四个核心能力上的基准测试。该基准包含200个高质量、多样化、多镜头的视频及详细标注，并开发了统一的智能评估系统（UniV-Eval）以实现标准化、可扩展和可复现的评估。

Details

Motivation: 现有视频评估基准存在碎片化、范围有限的问题，通常只针对单一任务、使用特定指标和简单视频片段，无法全面评估视频基础模型所追求的统一能力。

Result: 论文提出了UniVBench基准，通过包含200个高质量、多样化、多镜头视频的数据集和统一的评估系统，为视频基础模型的集成能力提供了首个评估框架。

Insight: 创新点在于提出了一个统一的多任务视频评估基准，并引入了视频重建这一新任务来评估模型对已见视频内容的忠实再现能力；同时，其统一的智能评估系统（UniV-Eval）标准化了提示、指令解析和评分流程，有助于公平、可扩展的模型比较。

Abstract: Video foundation models aim to integrate video understanding, generation, editing, and instruction following within a single framework, making them a central direction for next-generation multimodal systems. However, existing evaluation benchmarks remain fragmented and limited in scope, as they each target a single task, rely on task-specific metrics, and typically use short or simple video clips. As a result, they do not capture the unified capabilities that these models are designed to deliver. To address this gap, we introduce UniVBench, a benchmark purpose-built for evaluating video foundation models across four core abilities: video understanding, video generation, video editing, and a newly proposed task, video reconstruction, which assesses how faithfully a model can reproduce video content it has encountered. Our benchmark substantially expands the complexity of evaluation by incorporating 200 high-quality, diverse and multi-shot videos, each paired with detailed captions, multi-format editing instructions, and reference images. All videos are human-created and carefully validated, offering richer cinematic information than prior benchmarks. In addition, we develop a unified agentic evaluation system (UniV-Eval) that standardizes prompting, instruction parsing, and scoring across all tasks, enabling fair, scalable, and reproducible comparisons of unified video models. By grounding evaluation in instruction-based multi-shot video tasks, UniVBench provides the first framework for measuring the integrated capabilities that video foundation models aim to achieve. Extensive human annotations ensure our evaluation aligns with human judgment, enabling rigorous assessment and accelerating progress toward robust video intelligence.

[69] Understanding Annotation Error Propagation and Learning an Adaptive Policy for Expert Intervention in Barrett’s Video Segmentation cs.CV | cs.AIPDF

Lokesha Rasanjalee, Jin Lin Tan, Dileepa Pitawela, Rajvinder Singh, Hsiang-Ting Chen

TL;DR: 本文针对内窥镜视频中巴雷特食管异型增生等不规则、边界模糊区域的标注难题，提出了一种学习自适应专家干预策略的框架L2RP。该工作系统研究了SAM2等半自动标注工具中掩码、框、点等不同提示类型导致的标注误差传播问题，并通过学习何时何地需要专家输入，在标注成本与分割精度之间取得平衡。

Details

Motivation: 内窥镜视频的精确标注至关重要但耗时，尤其是对于巴雷特食管异型增生这类区域不规则、边界模糊的挑战性数据集。现有SAM2等半自动工具通过跨帧传播标注来简化流程，但小误差会累积并降低精度，需要专家审查和纠正。

Result: 在私有的巴雷特异型增生数据集和公开的SUN-SEG基准测试上的实验表明，该方法提高了时间一致性，并优于基线策略，实现了标注工作量与分割准确性的平衡。

Insight: 创新点在于系统研究了不同提示类型下的标注误差传播机制，并提出了一个成本感知的’学习再提示’框架，可自适应地决定何时引入专家干预，为半自动标注工具的人机协作优化提供了新思路。

Abstract: Accurate annotation of endoscopic videos is essential yet time-consuming, particularly for challenging datasets such as dysplasia in Barrett’s esophagus, where the affected regions are irregular and lack clear boundaries. Semi-automatic tools like Segment Anything Model 2 (SAM2) can ease this process by propagating annotations across frames, but small errors often accumulate and reduce accuracy, requiring expert review and correction. To address this, we systematically study how annotation errors propagate across different prompt types, namely masks, boxes, and points, and propose Learning-to-Re-Prompt (L2RP), a cost-aware framework that learns when and where to seek expert input. By tuning a human-cost parameter, our method balances annotation effort and segmentation accuracy. Experiments on a private Barrett’s dysplasia dataset and the public SUN-SEG benchmark demonstrate improved temporal consistency and superior performance over baseline strategies.

[70] DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs cs.CV | cs.AI | cs.CL | cs.GRPDF

Yanbin Wei, Jiangyue Yan, Chun Kang, Yang Chen, Hua Liu

TL;DR: 本文提出了DynamicGTR框架，旨在提升视觉语言模型（VLM）在零样本图问答任务上的性能。该框架通过动态选择最适合每个查询的图拓扑表示（GTR），而非采用固定的单一表示方式，从而在准确性和简洁性之间实现可定制的权衡。实验表明，DynamicGTR不仅提升了图算法问答的性能，还能将合成图任务中训练的经验成功迁移到链接预测、节点分类等实际应用，且无需额外训练。

Details

Motivation: 现有方法通常依赖单一的图拓扑表示（如固定风格的视觉图像或统一文本描述），这种“一刀切”策略忽略了模型和任务的具体偏好，导致对图相关查询的响应不准确或过于冗长。本文旨在解决VLM在理解和处理结构化图数据时面临的这一挑战。

Result: 大量实验表明，DynamicGTR显著提升了VLM在图算法问答上的零样本性能。该框架在合成图算法任务上训练的经验，能够成功迁移到链接预测和节点分类等真实世界应用，且无需额外训练，展现了强大的跨任务、跨领域和跨模型的可迁移性。

Insight: 论文的核心创新点在于动态选择图拓扑表示（GTR）的策略，这突破了传统固定单一表示的局限，更好地适应了模型和任务的特定偏好。从客观角度看，这种动态适配机制为提升VLM在结构化图数据上的理解和推理能力提供了一种灵活、可迁移的解决方案，具有广泛的适用潜力。

Abstract: Vision-Language Models (VLMs) have emerged as versatile solutions for zero-shot question answering (QA) across various domains. However, enabling VLMs to effectively comprehend structured graphs and perform accurate, efficient QA remains challenging. Existing approaches typically rely on one single graph topology representation (GTR), such as fixed-style visual images or unified text descriptions. This ``one-size-fits-all’’ strategy often neglects model-specific and task-specific preferences, resulting in inaccurate or over-lengthy responses to graph-related queries. To address this, we propose the $\mbox{DynamicGTR}$ framework, which dynamically selects the optimal GTR for each query during inference, thereby enhancing the zero-shot graph QA capabilities of VLMs with a customizable accuracy and brevity trade-off. Extensive experiments show that DynamicGTR not only improves VLM-based graph algorithm QA performance but also successfully transfers the experience trained from synthetic graph algorithm tasks to real-world applications like link prediction and node classification, without any additional training. Additionally, DynamicGTR demonstrates strong transferability across tasks, domains, and models, suggesting its potential as a flexible solution for broad graph scenarios.

[71] GFPL: Generative Federated Prototype Learning for Resource-Constrained and Data-Imbalanced Vision Task cs.CV | cs.LGPDF

Shiwei Lu, Yuhang He, Jiashuo Li, Qiang Wang, Yihong Gong

TL;DR: 本文提出了一种名为生成式联邦原型学习（GFPL）的新框架，旨在解决联邦学习在资源受限和数据不平衡视觉任务中的两个关键挑战：模型更新偏向多数类特征导致的知识融合低效，以及高维模型参数频繁传输带来的通信开销过高。该框架通过基于高斯混合模型的原型生成方法捕获类特征统计信息，利用巴氏距离的原型聚合策略融合客户端间的语义相似知识，并利用融合原型生成伪特征以缓解客户端间的特征分布不平衡。此外，设计了一个双分类器架构，通过结合点回归和交叉熵的混合损失进行优化，以增强本地训练中的特征对齐。

Details

Motivation: 解决联邦学习在现实部署中面临的两个关键问题：因模型更新偏向多数类特征导致的知识融合无效，以及高维模型参数频繁传输带来的过高通信开销。

Result: 在基准测试上的大量实验表明，GFPL在不平衡数据设置下将模型准确率提高了3.6%，同时保持了低通信成本。

Insight: 创新点包括：基于高斯混合模型的类原型生成方法、利用巴氏距离的原型聚合策略、通过融合原型生成伪特征以缓解特征不平衡，以及结合点回归和交叉熵损失的双分类器架构设计，这些方法共同提升了知识融合效率并降低了通信开销。

Abstract: Federated learning (FL) facilitates the secure utilization of decentralized images, advancing applications in medical image recognition and autonomous driving. However, conventional FL faces two critical challenges in real-world deployment: ineffective knowledge fusion caused by model updates biased toward majority-class features, and prohibitive communication overhead due to frequent transmissions of high-dimensional model parameters. Inspired by the human brain’s efficiency in knowledge integration, we propose a novel Generative Federated Prototype Learning (GFPL) framework to address these issues. Within this framework, a prototype generation method based on Gaussian Mixture Model (GMM) captures the statistical information of class-wise features, while a prototype aggregation strategy using Bhattacharyya distance effectively fuses semantically similar knowledge across clients. In addition, these fused prototypes are leveraged to generate pseudo-features, thereby mitigating feature distribution imbalance across clients. To further enhance feature alignment during local training, we devise a dual-classifier architecture, optimized via a hybrid loss combining Dot Regression and Cross-Entropy. Extensive experiments on benchmarks show that GFPL improves model accuracy by 3.6% under imbalanced data settings while maintaining low communication cost.

[72] How to Take a Memorable Picture? Empowering Users with Actionable Feedback cs.CVPDF

Francesco Laiti, Davide Talon, Jacopo Staiano, Elisa Ricci

TL;DR: 该论文提出了记忆性反馈（MemFeed）任务，旨在为用户提供可操作、可解释的指导以提升照片的记忆性，并介绍了首个基于多模态大语言模型（MLLMs）的无训练方法MemCoach，通过教师-学生引导策略生成自然语言建议（如‘强调面部表情’）。论文还构建了MemBench基准进行评估，实验表明MemCoach在多个MLLMs上优于零样本模型。

Details

Motivation: 传统图像记忆性研究仅关注被动预测或生成修改，无法在拍摄时为用户提供改进指导，因此需要开发能提供可操作反馈的系统来帮助用户提升照片记忆性。

Result: 在MemBench基准上，MemCoach方法在多个多模态大语言模型（MLLMs）上均表现出色，一致优于多个零样本模型，证明了其有效性。

Insight: 创新点在于将记忆性研究从被动预测转向可操作的反馈任务，并利用无训练的教师-学生引导策略对齐模型激活，生成具体自然语言建议；这为AI辅助创作提供了新范式，强调‘教导’而不仅是预测。

Abstract: Image memorability, i.e., how likely an image is to be remembered, has traditionally been studied in computer vision either as a passive prediction task, with models regressing a scalar score, or with generative methods altering the visual input to boost the image likelihood of being remembered. Yet, none of these paradigms supports users at capture time, when the crucial question is how to improve a photo memorability. We introduce the task of Memorability Feedback (MemFeed), where an automated model should provide actionable, human-interpretable guidance to users with the goal to enhance an image future recall. We also present MemCoach, the first approach designed to provide concrete suggestions in natural language for memorability improvement (e.g., “emphasize facial expression,” “bring the subject forward”). Our method, based on Multimodal Large Language Models (MLLMs), is training-free and employs a teacher-student steering strategy, aligning the model internal activations toward more memorable patterns learned from a teacher model progressing along least-to-most memorable samples. To enable systematic evaluation on this novel task, we further introduce MemBench, a new benchmark featuring sequence-aligned photoshoots with annotated memorability scores. Our experiments, considering multiple MLLMs, demonstrate the effectiveness of MemCoach, showing consistently improved performance over several zero-shot models. The results indicate that memorability can not only be predicted but also taught and instructed, shifting the focus from mere prediction to actionable feedback for human creators.

[73] UNet-Based Keypoint Regression for 3D Cone Localization in Autonomous Racing cs.CV | cs.ROPDF

Mariia Baidachna, James Carty, Aidan Ferguson, Joseph Agrane, Varad Kulkarni

TL;DR: 本文提出了一种基于UNet架构的神经网络，用于在自动驾驶赛车场景中检测锥桶的关键点，以实现精确的3D锥桶定位。该方法利用了一个大规模的自定义标注数据集，能够准确估计锥桶位置并预测其颜色，并在感知流程中进行了端到端系统评估。

Details

Motivation: 解决自动驾驶赛车中精确3D锥桶定位的问题，因为传统的计算机视觉方法对环境变化敏感，而现有神经网络方法通常受限于训练数据不足且难以实时运行。

Result: 该方法在关键点检测精度上相比传统方法有显著提升，并且在端到端自动驾驶系统的各项指标上都表现出高性能，展示了其在竞争性自动驾驶赛车系统中的潜力。

Insight: 创新点在于将UNet架构应用于锥桶关键点回归任务，并利用大规模定制数据集进行训练，实现了高精度且具有实时应用潜力的3D定位，同时整合了颜色预测功能，增强了感知系统的实用性。

Abstract: Accurate cone localization in 3D space is essential in autonomous racing for precise navigation around the track. Approaches that rely on traditional computer vision algorithms are sensitive to environmental variations, and neural networks are often trained on limited data and are infeasible to run in real time. We present a UNet-based neural network for keypoint detection on cones, leveraging the largest custom-labeled dataset we have assembled. Our approach enables accurate cone position estimation and the potential for color prediction. Our model achieves substantial improvements in keypoint accuracy over conventional methods. Furthermore, we leverage our predicted keypoints in the perception pipeline and evaluate the end-to-end autonomous system. Our results show high-quality performance across all metrics, highlighting the effectiveness of this approach and its potential for adoption in competitive autonomous racing systems.

[74] TIRAuxCloud: A Thermal Infrared Dataset for Day and Night Cloud Detection cs.CVPDF

Alexis Apostolakis, Vasileios Botsos, Niklas Wölki, Andrea Spichtinger, Nikolaos Ioannis Bountos

TL;DR: 本文介绍了TIRAuxCloud数据集，这是一个专注于热红外光谱数据的多模态数据集，旨在促进白天和夜间的云检测。该数据集结合了Landsat和VIIRS的多光谱数据（热红外、光学和近红外波段），并包含高程、土地覆盖、气象变量和无云参考图像等辅助信息层，以减少地表与云的混淆和云形成的不确定性。

Details

Motivation: 云是地球观测中的主要障碍，限制了火灾响应、城市热岛监测和冰雪覆盖制图等关键遥感应用的可用性和可靠性。因此，实现全天候云检测至关重要。虽然可见光和近红外波段在白天云检测中有效，但它们依赖太阳光照，不适合夜间监测。热红外图像在夜间云检测中起关键作用，但准确检测仍面临光谱信息有限和空间分辨率较低的挑战。

Result: 通过监督学习和迁移学习建立了性能基准，展示了数据集在推动白天和夜间云检测创新方法开发中的价值。

Insight: 创新点包括：结合多模态数据（热红外、光学、近红外及辅助信息）以减少不确定性；提供自动云掩膜样本和手动标注子集以解决标签稀缺问题；专注于全天候云检测，特别是夜间场景，填补了现有数据集的空白。

Abstract: Clouds are a major obstacle in Earth observation, limiting the usability and reliability of critical remote sensing applications such as fire disaster response, urban heat island monitoring, and snow and ice cover mapping. Therefore, the ability to detect clouds 24/7 is of paramount importance. While visible and near-infrared bands are effective for daytime cloud detection, their dependence on solar illumination makes them unsuitable for nighttime monitoring. In contrast, thermal infrared (TIR) imagery plays a crucial role in detecting clouds at night, when sunlight is absent. Due to their generally lower temperatures, clouds emit distinct thermal signatures that are detectable in TIR bands. Despite this, accurate nighttime cloud detection remains challenging due to limited spectral information and the typically lower spatial resolution of TIR imagery. To address these challenges, we present TIRAuxCloud, a multi-modal dataset centered around thermal spectral data to facilitate cloud segmentation under both daytime and nighttime conditions. The dataset comprises a unique combination of multispectral data (TIR, optical, and near-infrared bands) from Landsat and VIIRS, aligned with auxiliary information layers. Elevation, land cover, meteorological variables, and cloud-free reference images are included to help reduce surface-cloud ambiguity and cloud formation uncertainty. To overcome the scarcity of manual cloud labels, we include a large set of samples with automated cloud masks and a smaller manually annotated subset to further evaluate and improve models. Comprehensive benchmarks are presented to establish performance baselines through supervised and transfer learning, demonstrating the dataset’s value in advancing the development of innovative methods for day and night time cloud detection.

[75] NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors cs.CV | cs.AI | cs.CLPDF

Lingfeng Ren, Weihao Yu, Runpeng Yu, Xinchao Wang

TL;DR: 本文提出了一种名为NoLan的训练无关框架，旨在缓解大型视觉语言模型中的物体幻觉问题。该框架通过动态抑制语言先验来优化输出分布，其调节基于多模态输入与纯文本输入之间的输出分布差异。实验表明，NoLan能有效减少多种LVLM在不同任务上的物体幻觉。

Details

Motivation: 解决大型视觉语言模型中物体幻觉的关键问题，即模型输出包含输入图像中不存在的物体。研究旨在探究幻觉主要源于视觉编码器还是语言解码器，并基于发现设计解决方案。

Result: 在POPE基准测试上，NoLan显著提升了模型性能，例如将LLaVA-1.5 7B和Qwen-VL 7B的准确率分别提高了6.45和7.21分，有效减少了物体幻觉。

Insight: 创新点在于通过系统实验揭示了物体幻觉主要与语言解码器的强先验相关，并据此提出了一种无需训练的动态抑制方法。该方法的核心是利用多模态与纯文本输出的分布差异来调制语言先验的抑制强度，这是一种简单且通用的后处理策略。

Abstract: Object hallucination is a critical issue in Large Vision-Language Models (LVLMs), where outputs include objects that do not appear in the input image. A natural question arises from this phenomenon: Which component of the LVLM pipeline primarily contributes to object hallucinations? The vision encoder to perceive visual information, or the language decoder to generate text responses? In this work, we strive to answer this question through designing a systematic experiment to analyze the roles of the vision encoder and the language decoder in hallucination generation. Our observations reveal that object hallucinations are predominantly associated with the strong priors from the language decoder. Based on this finding, we propose a simple and training-free framework, No-Language-Hallucination Decoding, NoLan, which refines the output distribution by dynamically suppressing language priors, modulated based on the output distribution difference between multimodal and text-only inputs. Experimental results demonstrate that NoLan effectively reduces object hallucinations across various LVLMs on different tasks. For instance, NoLan achieves substantial improvements on POPE, enhancing the accuracy of LLaVA-1.5 7B and Qwen-VL 7B by up to 6.45 and 7.21, respectively. The code is publicly available at: https://github.com/lingfengren/NoLan.

[76] Scan Clusters, Not Pixels: A Cluster-Centric Paradigm for Efficient Ultra-high-definition Image Restoration cs.CVPDF

Chen Wu, Ling Wang, Zhuoran Zheng, Yuning Cui, Zhixiong Yang

TL;DR: 本文提出C^2SSM模型，一种面向超高清图像修复的视觉状态空间模型，通过从逐像素扫描转向聚类扫描的新范式，显著提升了计算效率。

Details

Motivation: 现有超高清图像修复模型受限于逐像素操作，计算成本过高；而Mamba等状态空间模型虽具线性复杂度，但其像素串行扫描仍是处理百万像素的瓶颈。本文旨在探索是否必须处理每个像素才能理解图像。

Result: C^2SSM在五个超高清修复任务上取得了新的最先进（SOTA）结果，同时大幅降低了计算成本。

Insight: 创新点在于将图像特征分布提炼为稀疏的语义聚类中心，通过双路径过程（先扫描聚类中心再扩散全局上下文）实现高效全局建模，为大规模视觉任务提供了“扫描聚类而非像素”的新思路。

Abstract: Ultra-High-Definition (UHD) image restoration is trapped in a scalability crisis: existing models, bound to pixel-wise operations, demand unsustainable computation. While state space models (SSMs) like Mamba promise linear complexity, their pixel-serial scanning remains a fundamental bottleneck for the millions of pixels in UHD content. We ask: must we process every pixel to understand the image? This paper introduces C$^2$SSM, a visual state space model that breaks this taboo by shifting from pixel-serial to cluster-serial scanning. Our core discovery is that the rich feature distribution of a UHD image can be distilled into a sparse set of semantic centroids via a neural-parameterized mixture model. C$^2$SSM leverages this to reformulate global modeling into a novel dual-path process: it scans and reasons over a handful of cluster centers, then diffuses the global context back to all pixels through a principled similarity distribution, all while a lightweight modulator preserves fine details. This cluster-centric paradigm achieves a decisive leap in efficiency, slashing computational costs while establishing new state-of-the-art results across five UHD restoration tasks. More than a solution, C$^2$SSM charts a new course for efficient large-scale vision: scan clusters, not pixels.

[77] Geometry-as-context: Modulating Explicit 3D in Scene-consistent Video Generation to Geometry Context cs.CVPDF

JiaKui Hu, Jialun Liu, Liying Yang, Xinliang Zhang, Kaiwen Li

TL;DR: 本文提出了一种名为’geometry-as-context’的方法，用于提升场景一致性视频生成的质量。该方法采用自回归相机控制视频生成模型，通过迭代估计当前视图的几何信息以进行3D重建，并模拟恢复由3D场景渲染的新视图图像，从而在推理过程中避免错误累积。

Details

Motivation: 解决现有场景一致性视频生成方法因依赖外部记忆、迭代3D重建与修复，以及非可微过程和分离模型导致的推理错误累积问题。

Result: 在单向和往返相机轨迹的场景视频生成任务上测试，结果显示该方法在保持场景一致性和相机控制方面优于先前方法。

Insight: 创新点在于将几何信息作为上下文，通过多任务框架和相机门控注意力模块有效利用相机位姿，并在训练中随机丢弃几何上下文以确保推理时仅生成RGB图像，从而实现了端到端的可学习与一致性提升。

Abstract: Scene-consistent video generation aims to create videos that explore 3D scenes based on a camera trajectory. Previous methods rely on video generation models with external memory for consistency, or iterative 3D reconstruction and inpainting, which accumulate errors during inference due to incorrect intermediary outputs, non-differentiable processes, and separate models. To overcome these limitations, we introduce ``geometry-as-context”. It iteratively completes the following steps using an autoregressive camera-controlled video generation model: (1) estimates the geometry of the current view necessary for 3D reconstruction, and (2) simulates and restores novel view images rendered by the 3D scene. Under this multi-task framework, we develop the camera gated attention module to enhance the model’s capability to effectively leverage camera poses. During the training phase, text contexts are utilized to ascertain whether geometric or RGB images should be generated. To ensure that the model can generate RGB-only outputs during inference, the geometry context is randomly dropped from the interleaved text-image-geometry training sequence. The method has been tested on scene video generation with one-direction and forth-and-back trajectories. The results show its superiority over previous approaches in maintaining scene consistency and camera control.

[78] A Framework for Cross-Domain Generalization in Coronary Artery Calcium Scoring Across Gated and Non-Gated Computed Tomography cs.CV | cs.AIPDF

Mahmut S. Gokmen, Moneera N. Haque, Steve W. Leung, Caroline N. Leach, Seth Parker

TL;DR: 该论文提出了一种名为CARD-ViT的自动化框架，用于在门控和非门控CT扫描中进行冠状动脉钙化检测和病灶特异性Agatston评分。该框架的核心是一个仅使用门控CT数据通过DINO自监督训练的Vision Transformer，无需非门控训练数据即可实现跨域泛化。

Details

Motivation: 冠状动脉钙化评分是心血管风险的关键预测指标，但传统上依赖于心电图门控CT扫描，限制了其在专业心脏成像环境外的应用。论文旨在解决如何利用常规非门控胸部CT进行可扩展的心血管筛查，而无需额外扫描或标注的问题。

Result: 在斯坦福非门控数据集上，该框架实现了0.707的准确率和0.528的Cohen’s kappa系数，与直接在非门控扫描上训练的模型性能相当。在门控测试集上，准确率达到0.910，Cohen’s kappa系数在独立数据集上分别为0.871和0.874，展现了稳健的风险分层能力。

Insight: 论文的主要创新点在于通过自监督学习（DINO）在单一域（门控CT）上训练Vision Transformer（CARD-ViT），实现了向未见域（非门控CT）的有效零样本跨域泛化。这为将专业心脏评估工具扩展到常规临床影像提供了可行路径，减少了数据标注和采集的依赖。

Abstract: Coronary artery calcium (CAC) scoring is a key predictor of cardiovascular risk, but it relies on ECG-gated CT scans, restricting its use to specialized cardiac imaging settings. We introduce an automated framework for CAC detection and lesion-specific Agatston scoring that operates across both gated and non-gated CT scans. At its core is CARD-ViT, a self-supervised Vision Transformer trained exclusively on gated CT data using DINO. Without any non-gated training data, our framework achieves 0.707 accuracy and a Cohen’s kappa of 0.528 on the Stanford non-gated dataset, matching models trained directly on non-gated scans. On gated test sets, the framework achieves 0.910 accuracy with Cohen’s kappa scores of 0.871 and 0.874 across independent datasets, demonstrating robust risk stratification. These results demonstrate the feasibility of cross-domain CAC scoring from gated to non-gated domains, supporting scalable cardiovascular screening in routine chest imaging without additional scans or annotations.

[79] Mobile-Ready Automated Triage of Diabetic Retinopathy Using Digital Fundus Images cs.CVPDF

Aadi Joshi, Manav S. Sharma, Vijay Uttam Rathod, Ashlesha Sawant, Prajakta Musale

TL;DR: 本文提出了一种轻量级的深度学习框架，用于从数字眼底图像中自动评估糖尿病视网膜病变（DR）的严重程度。该框架基于MobileNetV3架构，并采用CORAL头来建模疾病的有序进展，同时保持计算效率以适应资源受限的环境。模型在APTOS 2019和IDRiD组合数据集上进行了训练和验证，通过预处理（包括圆形裁剪和光照归一化）和广泛实验（如3折交叉验证和消融研究），展示了强大的性能。此外，还通过模型校准和移动设备优化解决了实际部署挑战，为早期DR筛查提供了一个可扩展且实用的工具。

Details

Motivation: 糖尿病视网膜病变（DR）是全球视力损害的主要原因，但手动诊断耗时且易出错，导致筛查延迟。因此，需要一种高效、自动化的方法来评估DR严重程度，特别是在资源受限的环境中实现快速筛查。

Result: 在APTOS 2019和IDRiD组合数据集上，模型通过3折交叉验证和消融研究，取得了二次加权Kappa（QWK）分数0.9019和准确率80.03%的强性能，表明其在DR严重程度分类任务上达到了较高水平。

Insight: 创新点包括：使用MobileNetV3架构结合CORAL头来建模DR的有序进展，兼顾了轻量化和准确性；通过预处理（如圆形裁剪和光照归一化）提升数据质量；针对实际部署，进行了模型校准以减少过自信，并优化了移动设备兼容性，为资源受限环境下的自动化DR筛查提供了可借鉴的解决方案。

Abstract: Diabetic Retinopathy (DR) is a major cause of vision impairment worldwide. However, manual diagnosis is often time-consuming and prone to errors, leading to delays in screening. This paper presents a lightweight automated deep learning framework for efficient assessment of DR severity from digital fundus images. We use a MobileNetV3 architecture with a Consistent Rank Logits (CORAL) head to model the ordered progression of disease while maintaining computational efficiency for resource-constrained environments. The model is trained and validated on a combined dataset of APTOS 2019 and IDRiD images using a preprocessing pipeline including circular cropping and illumination normalization. Extensive experiments including 3-fold cross-validation and ablation studies demonstrate strong performance. The model achieves a Quadratic Weighted Kappa (QWK) score of 0.9019 and an accuracy of 80.03 percent. Additionally, we address real-world deployment challenges through model calibration to reduce overconfidence and optimization for mobile devices. The proposed system provides a scalable and practical tool for early-stage diabetic retinopathy screening.

[80] Learning to Fuse and Reconstruct Multi-View Graphs for Diabetic Retinopathy Grading cs.CVPDF

Haoran Li, Yuxin Lin, Huan Wang, Xiaoling Luo, Qi Zhu

TL;DR: 本文提出了一种名为MVGFDR的端到端多视图图融合框架，用于糖尿病视网膜病变（DR）分级。该框架通过一个新颖的多视图图融合（MVGF）模块，显式地解耦共享和视图特定的视觉特征，以更好地利用多视图眼底图像之间的相关性，从而提升DR分级的准确性。

Details

Motivation: 现有方法在融合多视图眼底图像时往往忽略了视图间的相关性，未能充分利用来自同一患者的不同视图之间固有的内在一致性。本文旨在解决这一问题，以更有效地利用多视图信息进行DR分级。

Result: 在迄今为止最大的多视图眼底图像数据集MFIDDR上进行的广泛实验表明，该方法在糖尿病视网膜病变分级任务上优于现有的最先进（SOTA）方法。

Insight: 创新点在于提出了一个结构化的多视图图融合模块，该模块通过多视图图初始化、基于频域相关性的多视图图融合以及掩码跨视图重建三个关键组件，显式建模视图间的共享和特定信息，促进了视图不变表示的学习。

Abstract: Diabetic retinopathy (DR) is one of the leading causes of vision loss worldwide, making early and accurate DR grading critical for timely intervention. Recent clinical practices leverage multi-view fundus images for DR detection with a wide coverage of the field of view (FOV), motivating deep learning methods to explore the potential of multi-view learning for DR grading. However, existing methods often overlook the inter-view correlations when fusing multi-view fundus images, failing to fully exploit the inherent consistency across views originating from the same patient. In this work, we present MVGFDR, an end-to-end Multi-View Graph Fusion framework for DR grading. Different from existing methods that directly fuse visual features from multiple views, MVGFDR is equipped with a novel Multi-View Graph Fusion (MVGF) module to explicitly disentangle the shared and view-specific visual features. Specifically, MVGF comprises three key components: (1) Multi-view Graph Initialization, which constructs visual graphs via residual-guided connections and employs Discrete Cosine Transform (DCT) coefficients as frequency-domain anchors; (2) Multi-view Graph Fusion, which integrates selective nodes across multi-view graphs based on frequency-domain relevance to capture complementary view-specific information; and (3) Masked Cross-view Reconstruction, which leverages masked reconstruction of shared information across views to facilitate view-invariant representation learning. Extensive experimental results on MFIDDR, by far the largest multi-view fundus image dataset, demonstrate the superiority of our proposed approach over existing state-of-the-art approaches in diabetic retinopathy grading.

[81] MindDriver: Introducing Progressive Multimodal Reasoning for Autonomous Driving cs.CVPDF

Lingjun Zhang, Yujian Yuan, Changjie Wu, Xinyuan Chang, Xin Cai

TL;DR: MindDriver提出了一种渐进式多模态推理框架，用于端到端自动驾驶系统，通过模仿人类渐进式思维，结合语义理解、语义到物理空间的想象以及物理空间轨迹规划，以解决现有视觉语言模型在自动驾驶中推理策略的挑战。

Details

Motivation: 现有视觉语言模型在自动驾驶中使用的思维链推理策略存在文本语义空间与轨迹物理空间之间的鸿沟，且近期使用未来图像作为推理过程的方法缺乏明确的规划导向目标指导，导致生成的图像场景演化不准确。

Result: MindDriver在nuScenes开环和Bench2Drive闭环评估中表现出优越性能，达到了先进水平。

Insight: 创新点包括引入渐进式多模态推理框架、开发反馈引导的自动数据标注管道以生成对齐的多模态推理训练数据，以及采用渐进式强化微调方法通过基于高级奖励的学习优化对齐过程，从而提升自动驾驶系统的推理准确性和规划能力。

Abstract: Vision-Language Models (VLM) exhibit strong reasoning capabilities, showing promise for end-to-end autonomous driving systems. Chain-of-Thought (CoT), as VLM’s widely used reasoning strategy, is facing critical challenges. Existing textual CoT has a large gap between text semantic space and trajectory physical space. Although the recent approach utilizes future image to replace text as CoT process, it lacks clear planning-oriented objective guidance to generate images with accurate scene evolution. To address these, we innovatively propose MindDriver, a progressive multimodal reasoning framework that enables VLM to imitate human-like progressive thinking for autonomous driving. MindDriver presents semantic understanding, semantic-to-physical space imagination, and physical-space trajectory planning. To achieve aligned reasoning processes in MindDriver, we develop a feedback-guided automatic data annotation pipeline to generate aligned multimodal reasoning training data. Furthermore, we develop a progressive reinforcement fine-tuning method to optimize the alignment through progressive high- level reward-based learning. MindDriver demonstrates superior performance in both nuScences open-loop and Bench2Drive closed-loop evaluation. Codes are available at https://github.com/hotdogcheesewhite/MindDriver.

[82] Global-Local Dual Perception for MLLMs in High-Resolution Text-Rich Image Translation cs.CVPDF

Junxin Lu, Tengfei Song, Zhanglin Wu, Pengfei Li, Xiaowei Liang

TL;DR: 本文提出GLoTran，一种用于基于多模态大语言模型（MLLM）的文本图像机器翻译（TIMT）的全局-局部双重视觉感知框架。该框架通过指令引导的对齐策略，整合低分辨率全局图像和多尺度区域级文本图像切片，以解决高分辨率、文本密集图像翻译中的文本遗漏、语义漂移和上下文不一致问题。

Details

Motivation: 现有TIMT方法（无论是级联流水线还是端到端MLLMs）在处理高分辨率、文本密集图像时，因布局杂乱、字体多样和非文本干扰而表现不佳，导致翻译不完整和语义错误。

Result: 大量实验表明，GLoTran在翻译完整性和准确性上显著优于最先进的MLLMs，为高分辨率、文本密集条件下的细粒度TIMT提供了新范式。

Insight: 创新点在于提出了全局-局部双重感知范式，通过指令引导对齐整合全局上下文与局部文本细节；同时构建了大规模数据集GLoD以支持该范式，解决了现有数据在高分辨率文本丰富场景下的不足。

Abstract: Text Image Machine Translation (TIMT) aims to translate text embedded in images in the source-language into target-language, requiring synergistic integration of visual perception and linguistic understanding. Existing TIMT methods, whether cascaded pipelines or end-to-end multimodal large language models (MLLMs),struggle with high-resolution text-rich images due to cluttered layouts, diverse fonts, and non-textual distractions, resulting in text omission, semantic drift, and contextual inconsistency. To address these challenges, we propose GLoTran, a global-local dual visual perception framework for MLLM-based TIMT. GLoTran integrates a low-resolution global image with multi-scale region-level text image slices under an instruction-guided alignment strategy, conditioning MLLMs to maintain scene-level contextual consistency while faithfully capturing fine-grained textual details. Moreover, to realize this dual-perception paradigm, we construct GLoD, a large-scale text-rich TIMT dataset comprising 510K high-resolution global-local image-text pairs covering diverse real-world scenarios. Extensive experiments demonstrate that GLoTran substantially improves translation completeness and accuracy over state-of-the-art MLLMs, offering a new paradigm for fine-grained TIMT under high-resolution and text-rich conditions.

[83] Global-Aware Edge Prioritization for Pose Graph Initialization cs.CVPDF

Tong Wei, Giorgos Tolias, Jiri Matas, Daniel Barath

TL;DR: 本文提出了一种用于运动恢复结构（SfM）中位姿图初始化的全局感知边缘优先级排序方法。该方法通过图神经网络预测边缘的全局一致性可靠性，并基于多最小生成树构建位姿图，旨在生成更可靠、更紧凑的位姿图，从而在稀疏和高速场景下提高重建精度。

Details

Motivation: 现有SfM流程依赖图像检索为每张图像连接其k个最近邻，独立处理图像对而忽略了全局一致性，这限制了位姿图初始化的质量。本文旨在解决这一局限性。

Result: 该方法在模糊场景下超越了最先进的检索方法，在稀疏和高速设置下提高了重建精度。

Insight: 创新点在于提出了边缘优先级排序的概念，并整合了三个组件：基于SfM监督训练的GNN用于预测边缘可靠性、基于多最小生成树的图构建，以及连通性感知的分数调制。这为SfM的图初始化提供了全局一致的视角。

Abstract: The pose graph is a core component of Structure-from-Motion (SfM), where images act as nodes and edges encode relative poses. Since geometric verification is expensive, SfM pipelines restrict the pose graph to a sparse set of candidate edges, making initialization critical. Existing methods rely on image retrieval to connect each image to its $k$ nearest neighbors, treating pairs independently and ignoring global consistency. We address this limitation through the concept of edge prioritization, ranking candidate edges by their utility for SfM. Our approach has three components: (1) a GNN trained with SfM-derived supervision to predict globally consistent edge reliability; (2) multi-minimal-spanning-tree-based pose graph construction guided by these ranks; and (3) connectivity-aware score modulation that reinforces weak regions and reduces graph diameter. This globally informed initialization yields more reliable and compact pose graphs, improving reconstruction accuracy in sparse and high-speed settings and outperforming SOTA retrieval methods on ambiguous scenes. The ode and trained models are available at https://github.com/weitong8591/global_edge_prior.

[84] PanoEnv: Exploring 3D Spatial Intelligence in Panoramic Environments with Reinforcement Learning cs.CVPDF

Zekai Lin, Xu Zheng

TL;DR: 本文提出了PanoEnv，一个用于评估和增强视觉语言模型在360度全景图像中3D空间推理能力的大规模VQA基准。该基准包含14.8K个基于合成3D环境的问题，涵盖五个类别。评测发现现有SOTA VLMs表现不佳（总体准确率49.34%）。为此，作者提出了一个基于强化学习（GRPO）和几何感知奖励的两阶段课程后训练框架，其7B模型在PanoEnv上达到了新的SOTA性能（总体准确率52.93%）。

Details

Motivation: 当前视觉语言模型在等距柱状投影全景图像上进行3D空间推理时，因几何失真和有限的3D监督而表现不佳，需要专门的基准和方法来提升其全景环境下的3D空间智能。

Result: 在提出的PanoEnv基准上，作者的方法（7B模型）达到了新的SOTA性能：总体准确率52.93%（提升+3.59%），开放式问题准确率14.83%，并在语义评估分数（Q-Score 6.24, P-Score 5.95）上超越了32B模型。

Insight: 创新点包括：1) 构建了一个具有精确3D标注（深度、分割、边界框）的大规模全景VQA基准；2) 提出了一个基于GRPO强化学习、融合多种几何感知策略（如距离容忍、空间一致性）的奖励机制；3) 设计了两阶段课程训练策略（先结构化任务，后混合开放式数据）以缓解灾难性遗忘并提升泛化能力。

Abstract: 360 panoramic images are increasingly used in virtual reality, autonomous driving, and robotics for holistic scene understanding. However, current Vision-Language Models (VLMs) struggle with 3D spatial reasoning on Equirectangular Projection (ERP) images due to geometric distortion and limited 3D supervision. We introduce PanoEnv, a large-scale VQA benchmark built from synthetic 3D environments, containing 14.8K questions across five categories (e.g., relative position, volume comparison) grounded in accurate 3D annotations including depth, segmentation, and bounding boxes. Benchmarking 14 state-of-the-art VLMs reveals limited 3D understanding, achieving only 49.34% overall accuracy and 8.36% on open-ended (OE) questions. To enhance 3D reasoning, we propose a reinforcement learning post-training framework based on Group Relative Policy Optimization (GRPO) with a ground-truth-guided reward that incorporates five geometry-aware strategies such as distance tolerance and spatial consistency. A two-stage curriculum further mitigates catastrophic forgetting: Stage 1 trains on structured tasks (true/false and multiple choice), and Stage 2 fine-tunes on mixed open-ended data to improve generalization. Our 7B model achieves new state-of-the-art performance, improving overall accuracy to 52.93% (+3.59%) and open-ended accuracy to 14.83% while maintaining structured-task performance. It also achieves top semantic evaluation scores (Q-Score 6.24, P-Score 5.95), surpassing 32B models. These results demonstrate that PanoEnv-QA and our curriculum-based RL framework effectively instill 3D spatial intelligence in VLMs for omnidirectional perception.

[85] RobustVisRAG: Causality-Aware Vision-Based Retrieval-Augmented Generation under Visual Degradations cs.CVPDF

I-Hsiang Chen, Yu-Wei Liu, Tse-Yu Wu, Yu-Chien Chiang, Jen-Chien Yang

TL;DR: 该论文提出了RobustVisRAG，一个基于因果关系的双路径框架，旨在解决基于视觉的检索增强生成（VisRAG）模型在输入图像存在模糊、噪声、低光照等退化时性能下降的问题。该框架通过非因果路径捕获退化信号，并利用因果路径学习纯净的语义表示，从而在具有挑战性的视觉条件下实现稳定的检索和生成。

Details

Motivation: 现有VisRAG模型在视觉输入存在退化时性能会显著下降，因为预训练的视觉编码器会将语义信息和退化因素纠缠在一起，导致检索和生成阶段均出现错误。本文旨在提升VisRAG模型在视觉退化条件下的鲁棒性。

Result: 在提出的Distortion-VisRAG大规模基准数据集（包含合成和真实世界退化文档）上，RobustVisRAG在真实世界退化条件下，将检索、生成和端到端性能分别提升了7.35%、6.35%和12.40%，同时在干净输入上保持了相当的准确性。

Insight: 核心创新点在于引入因果引导的双路径框架，通过非因果路径建模退化信号，并利用该信号指导因果路径学习纯净语义，实现了语义与退化因素的解耦。这为提升多模态模型在非理想视觉条件下的鲁棒性提供了一种新思路。

Abstract: Vision-based Retrieval-Augmented Generation (VisRAG) leverages vision-language models (VLMs) to jointly retrieve relevant visual documents and generate grounded answers based on multimodal evidence. However, existing VisRAG models degrade in performance when visual inputs suffer from distortions such as blur, noise, low light, or shadow, where semantic and degradation factors become entangled within pretrained visual encoders, leading to errors in both retrieval and generation stages. To address this limitation, we introduce RobustVisRAG, a causality-guided dual-path framework that improves VisRAG robustness while preserving efficiency and zero-shot generalization. RobustVisRAG uses a non-causal path to capture degradation signals through unidirectional attention and a causal path to learn purified semantics guided by these signals. Together with the proposed Non-Causal Distortion Modeling and Causal Semantic Alignment objectives, the framework enforces a clear separation between semantics and degradations, enabling stable retrieval and generation under challenging visual conditions. To evaluate robustness under realistic conditions, we introduce the Distortion-VisRAG dataset, a large-scale benchmark containing both synthetic and real-world degraded documents across seven domains, with 12 synthetic and 5 real distortion types that comprehensively reflect practical visual degradations. Experimental results show that RobustVisRAG improves retrieval, generation, and end-to-end performance by 7.35%, 6.35%, and 12.40%, respectively, on real-world degradations, while maintaining comparable accuracy on clean inputs.

[86] Olbedo: An Albedo and Shading Aerial Dataset for Large-Scale Outdoor Environments cs.CVPDF

Shuang Song, Debao Huang, Deyan Deng, Haolin Xiong, Yang Tang

TL;DR: 本文介绍了Olbedo，一个用于户外场景本征图像分解的大规模航拍数据集，包含5,664张无人机图像，提供多视角一致的反射率、着色图、深度、法线、光照分量等标注，通过逆渲染流程生成。该数据集使基于扩散的IID模型能够泛化到真实户外图像，在MatrixCity基准上显著提升了单视角反射率预测性能，并支持3D资产重光照、材质编辑等应用。

Details

Motivation: 户外场景的本征图像分解对于重光照、编辑和理解大规模环境至关重要，但缺乏具有可靠反射率和着色监督的真实世界数据集限制了该领域的进展。

Result: 在MatrixCity基准上，使用Olbedo微调的模型显著改善了单视角户外反射率预测，达到了最先进的性能水平。

Insight: 创新点在于构建了一个大规模、多视角一致、具有多种光照条件和详细标注的真实户外航拍数据集，并通过逆渲染流程生成高质量监督信号，有效解决了户外IID数据稀缺的问题，促进了模型从合成室内数据到真实户外场景的泛化能力。

Abstract: Intrinsic image decomposition (IID) of outdoor scenes is crucial for relighting, editing, and understanding large-scale environments, but progress has been limited by the lack of real-world datasets with reliable albedo and shading supervision. We introduce Olbedo, a large-scale aerial dataset for outdoor albedo–shading decomposition in the wild. Olbedo contains 5,664 UAV images captured across four landscape types, multiple years, and diverse illumination conditions. Each view is accompanied by multi-view consistent albedo and shading maps, metric depth, surface normals, sun and sky shading components, camera poses, and, for recent flights, measured HDR sky domes. These annotations are derived from an inverse-rendering refinement pipeline over multi-view stereo reconstructions and calibrated sky illumination, together with per-pixel confidence masks. We demonstrate that Olbedo enables state-of-the-art diffusion-based IID models, originally trained on synthetic indoor data, to generalize to real outdoor imagery: fine-tuning on Olbedo significantly improves single-view outdoor albedo prediction on the MatrixCity benchmark. We further illustrate applications of Olbedo-trained models to multi-view consistent relighting of 3D assets, material editing, and scene change analysis for urban digital twins. We release the dataset, baseline models, and an evaluation protocol to support future research in outdoor intrinsic decomposition and illumination-aware aerial vision.

[87] RGB-Event HyperGraph Prompt for Kilometer Marker Recognition based on Pre-trained Foundation Models cs.CV | cs.AIPDF

Xiaoyu Xian, Shiao Wang, Xiao Wang, Daxin Tian, Yan Tian

TL;DR: 本文针对地铁列车在复杂环境（如光照变化、高速运动、恶劣天气）下的视觉感知挑战，提出了一种基于预训练基础模型的RGB-事件超图提示方法，用于公里标识别（KMR）。该方法通过整合事件相机，利用其在低光、高速和低功耗方面的优势，构建了首个大规模RGB-事件数据集EvMetro5K，并基于预训练的RGB OCR基础模型进行多模态适应，以提升识别鲁棒性。

Details

Motivation: 地铁列车在GNSS拒止环境下依赖视觉进行自主定位，但传统RGB相机在光照变化、高速运动和恶劣天气条件下性能受限。为解决这一问题，论文探索引入事件相机，利用其优势增强感知系统，并聚焦于关键的公里标识别任务。

Result: 在构建的EvMetro5K数据集（包含5,599对同步RGB-事件样本，分为4,479训练和1,120测试样本）及其他广泛使用的基准测试上进行了大量实验，证明了所提方法在公里标识别上的有效性。

Insight: 创新点包括：首次构建大规模RGB-事件数据集EvMetro5K用于地铁场景；提出基于预训练RGB OCR基础模型的多模态适应方法，整合事件数据以提升鲁棒性；利用事件相机在复杂环境下的优势，为视觉感知系统提供了新思路。

Abstract: Metro trains often operate in highly complex environments, characterized by illumination variations, high-speed motion, and adverse weather conditions. These factors pose significant challenges for visual perception systems, especially those relying solely on conventional RGB cameras. To tackle these difficulties, we explore the integration of event cameras into the perception system, leveraging their advantages in low-light conditions, high-speed scenarios, and low power consumption. Specifically, we focus on Kilometer Marker Recognition (KMR), a critical task for autonomous metro localization under GNSS-denied conditions. In this context, we propose a robust baseline method based on a pre-trained RGB OCR foundation model, enhanced through multi-modal adaptation. Furthermore, we construct the first large-scale RGB-Event dataset, EvMetro5K, containing 5,599 pairs of synchronized RGB-Event samples, split into 4,479 training and 1,120 testing samples. Extensive experiments on EvMetro5K and other widely used benchmarks demonstrate the effectiveness of our approach for KMR. Both the dataset and source code will be released on https://github.com/Event-AHU/EvMetro5K_benchmark

[88] RT-RMOT: A Dataset and Framework for RGB-Thermal Referring Multi-Object Tracking cs.CVPDF

Yanqiu Yu, Zhifan Jin, Sijia Chen, Tongfei Chu, En Yu

TL;DR: 本文提出了RT-RMOT任务，即RGB-热成像模态下的指代多目标跟踪，旨在融合RGB外观特征和热成像的照明鲁棒性以实现全天候跟踪。为此，作者构建了首个RGB-热成像指代多目标跟踪数据集RefRT，并提出基于多模态大语言模型的RTrack框架，通过引入组序列策略优化、裁剪优势缩放等策略提升性能。

Details

Motivation: 现有指代多目标跟踪在低可见度条件下（如夜间、烟雾）存在局限性，需要融合热成像模态以增强鲁棒性，实现全天候跟踪。

Result: 在自建的RefRT数据集上进行了广泛实验，证明了所提RTrack框架的有效性。

Insight: 创新点包括：提出RGB-热成像指代跟踪新任务并构建首个对应数据集；设计基于MLLM的多模态融合框架；引入组序列策略优化和裁剪优势缩放等强化学习策略来提升模型潜力与训练稳定性；设计结构化输出奖励和综合检测奖励以平衡探索与利用。

Abstract: Referring Multi-Object Tracking has attracted increasing attention due to its human-friendly interactive characteristics, yet it exhibits limitations in low-visibility conditions, such as nighttime, smoke, and other challenging scenarios. To overcome this limitation, we propose a new RGB-Thermal RMOT task, named RT-RMOT, which aims to fuse RGB appearance features with the illumination robustness of the thermal modality to enable all-day referring multi-object tracking. To promote research on RT-RMOT, we construct the first Referring Multi-Object Tracking dataset under RGB-Thermal modality, named RefRT. It contains 388 language descriptions, 1,250 tracked targets, and 166,147 Language-RGB-Thermal (L-RGB-T) triplets. Furthermore, we propose RTrack, a framework built upon a multimodal large language model (MLLM) that integrates RGB, thermal, and textual features. Since the initial framework still leaves room for improvement, we introduce a Group Sequence Policy Optimization (GSPO) strategy to further exploit the model’s potential. To alleviate training instability during RL fine-tuning, we introduce a Clipped Advantage Scaling (CAS) strategy to suppress gradient explosion. In addition, we design Structured Output Reward and Comprehensive Detection Reward to balance exploration and exploitation, thereby improving the completeness and accuracy of target perception. Extensive experiments on the RefRT dataset demonstrate the effectiveness of the proposed RTrack framework.

[89] AdaSpot: Spend Resolution Where It Matters for Precise Event Spotting cs.CVPDF

Artur Xarles, Sergio Escalera, Thomas B. Moeslund, Albert Clapés

TL;DR: AdaSpot是一个用于精确事件定位的框架，它通过处理低分辨率视频提取全局特征，并自适应选择每帧中最具信息量的感兴趣区域进行高分辨率处理，从而在保持高精度的同时显著提升计算效率。

Details

Motivation: 现有方法通常均匀处理所有帧，忽略了视频数据固有的时空冗余性，导致在非信息区域进行冗余计算并丢失对精确定位至关重要的细粒度细节。

Result: 在标准PES基准测试（如Tennis和FineDiving）上，AdaSpot在严格评估指标下达到了最先进的性能（例如，mAP@0分别提升3.96和2.26），同时在宽松指标下也保持了强劲结果。

Insight: 其核心创新在于提出了一种无监督、任务感知的自适应区域选择策略，该策略保持了帧间的时空一致性，避免了可学习替代方案的不稳定性，从而以微小的计算开销保留了关键的细粒度视觉线索。

Abstract: Precise Event Spotting aims to localize fast-paced actions or events in videos with high temporal precision, a key task for applications in sports analytics, robotics, and autonomous systems. Existing methods typically process all frames uniformly, overlooking the inherent spatio-temporal redundancy in video data. This leads to redundant computation on non-informative regions while limiting overall efficiency. To remain tractable, they often spatially downsample inputs, losing fine-grained details crucial for precise localization. To address these limitations, we propose \textbf{AdaSpot}, a simple yet effective framework that processes low-resolution videos to extract global task-relevant features while adaptively selecting the most informative region-of-interest in each frame for high-resolution processing. The selection is performed via an unsupervised, task-aware strategy that maintains spatio-temporal consistency across frames and avoids the training instability of learnable alternatives. This design preserves essential fine-grained visual cues with a marginal computational overhead compared to low-resolution-only baselines, while remaining far more efficient than uniform high-resolution processing. Experiments on standard PES benchmarks demonstrate that \textbf{AdaSpot} achieves state-of-the-art performance under strict evaluation metrics (\eg, $+3.96$ and $+2.26$ mAP$@0$ frames on Tennis and FineDiving), while also maintaining strong results under looser metrics. Code is available at: \href{https://github.com/arturxe2/AdaSpot}{https://github.com/arturxe2/AdaSpot}.

[90] Learning to Drive is a Free Gift: Large-Scale Label-Free Autonomy Pretraining from Unposed In-The-Wild Videos cs.CVPDF

Matthew Strong, Wei-Jer Chang, Quentin Herau, Jiezhi Yang, Yihan Hu

TL;DR: 本文提出了一种名为LFG的无标注、教师引导的框架，用于直接从无位姿的野外驾驶视频中学习自动驾驶表示。该方法利用前馈架构和轻量级自回归模块，通过多模态教师信号联合预测当前和未来的点云图、相机位姿、语义分割和运动掩码，从而学习统一的伪4D表示。该编码器在NAVSIM基准测试中仅使用单目摄像头就超越了多摄像头和激光雷达基线，并在多种语义、几何和运动预测任务中表现出色。

Details

Motivation: 在线可用的第一人称驾驶视频为自动驾驶提供了丰富的视觉数据，但缺乏标注使得难以学习同时捕获语义结构和3D几何的表示。现有自监督方法主要关注帧间一致性，但安全反应性驾驶严重依赖于时序上下文。

Result: 在NAVSIM基准测试的下游自动驾驶规划任务中，仅使用单目摄像头就超越了多摄像头和激光雷达基线，达到SOTA水平。在多种语义、几何和定性运动预测任务中也表现出强劲性能。

Insight: 创新点在于提出了一种完全无标注的教师引导框架，通过多模态伪监督从原始视频中学习统一的伪4D表示，强调了时序上下文对驾驶感知的重要性，并展示了前馈架构与轻量自回归模块结合的有效性。

Abstract: Ego-centric driving videos available online provide an abundant source of visual data for autonomous driving, yet their lack of annotations makes it difficult to learn representations that capture both semantic structure and 3D geometry. Recent advances in large feedforward spatial models demonstrate that point maps and ego-motion can be inferred in a single forward pass, suggesting a promising direction for scalable driving perception. We therefore propose a label-free, teacher-guided framework for learning autonomous driving representations directly from unposed videos. Unlike prior self-supervised approaches that focus primarily on frame-to-frame consistency, we posit that safe and reactive driving depends critically on temporal context. To this end, we leverage a feedforward architecture equipped with a lightweight autoregressive module, trained using multi-modal supervisory signals that guide the model to jointly predict current and future point maps, camera poses, semantic segmentation, and motion masks. Multi-modal teachers provide sequence-level pseudo-supervision, enabling LFG to learn a unified pseudo-4D representation from raw YouTube videos without poses, labels, or LiDAR. The resulting encoder not only transfers effectively to downstream autonomous driving planning on the NAVSIM benchmark, surpassing multi-camera and LiDAR baselines with only a single monocular camera, but also yields strong performance when evaluated on a range of semantic, geometric, and qualitative motion prediction tasks. These geometry and motion-aware features position LFG as a compelling video-centric foundation model for autonomous driving.

[91] Overview of the CXR-LT 2026 Challenge: Multi-Center Long-Tailed and Zero Shot Chest X-ray Classification cs.CVPDF

Hexin Dong, Yi Lin, Pengyu Zhou, Fengnian Zhao, Alan Clint Legasto

TL;DR: 本文介绍了CXR-LT 2026挑战赛，这是一个专注于胸部X光（CXR）多中心、长尾分布和零样本分类的基准测试。该挑战赛基于PadChest和NIH Chest X-ray数据集构建了包含超过14.5万张图像的多中心数据集，并设置了两个核心任务：对30个已知类别的鲁棒多标签分类，以及对6个未见（分布外）罕见疾病类别的开放世界泛化。

Details

Motivation: 解决胸部X光解读中因疾病分布的长尾特性和临床环境的开放世界性质所带来的挑战。现有基准测试通常依赖于单一机构的封闭类别，无法有效捕捉罕见疾病的普遍性或新发现的出现。

Result: 在挑战赛中，表现最佳的团队在任务1（已知类别分类）上达到了0.5854的平均精度均值（mAP），在任务2（零样本开放世界泛化）上达到了0.4315的mAP。结果表明，大规模视觉语言预训练显著缓解了通常与零样本诊断相关的性能下降。

Insight: 创新点在于构建了一个多中心、长尾分布的胸部X光基准测试，并正式定义了开放世界泛化任务，以评估模型对罕见和未见疾病的诊断能力。从客观角度看，该工作强调了利用大规模预训练（特别是视觉语言模型）来应对医学影像中数据不平衡和零样本学习挑战的有效性。

Abstract: Chest X-ray (CXR) interpretation is hindered by the long-tailed distribution of pathologies and the open-world nature of clinical environments. Existing benchmarks often rely on closed-set classes from single institutions, failing to capture the prevalence of rare diseases or the appearance of novel findings. To address this, we present the CXR-LT 2026 challenge. This third iteration of the benchmark introduces a multi-center dataset comprising over 145,000 images from PadChest and NIH Chest X-ray datasets. The challenge defines two core tasks: (1) Robust Multi-Label Classification on 30 known classes and (2) Open-World Generalization to 6 unseen (out-of-distribution) rare disease classes. We report the results of the top-performing teams, evaluating them via mean Average Precision (mAP), AUROC, and F1-score. The winning solutions achieved an mAP of 0.5854 on Task 1 and 0.4315 on Task 2, demonstrating that large-scale vision-language pre-training significantly mitigates the performance drop typically associated with zero-shot diagnosis.

[92] Brain3D: Brain Report Automation via Inflated Vision Transformers in 3D cs.CVPDF

Mariano Barone, Francesco Di Serio, Giuseppe Riccio, Antonio Romano, Marco Postiglione

TL;DR: 本文提出了一种名为Brain3D的阶段性视觉-语言框架，用于从3D脑肿瘤MRI图像自动生成放射学报告。该方法通过将预训练的2D医学图像编码器扩展为原生3D架构，并经过对比性基础、监督投影器预热和基于LoRA的语言专业化三个阶段，逐步将其与因果语言模型对齐，从而克服了现有模型处理3D医学图像时空间上下文信息碎片化的问题。

Details

Motivation: 当前医学视觉-语言模型在处理体积脑部MRI时，通常采用基于2D切片的近似方法，这破坏了神经放射学准确解释所需的空间上下文。本文旨在开发一个专门针对神经放射学的自动化报告生成系统，以解决这一局限性。

Result: 在包含468名受试者（BraTS病理病例和健康对照）的数据集上评估，模型在临床病理F1分数上达到0.951，远高于强2D基线的0.413，并在健康扫描上保持了完美的特异性。

Insight: 论文的创新点在于：1) 将预训练的2D医学编码器“膨胀”为原生3D架构，以保留完整的空间上下文；2) 提出一个三阶段对齐策略（对比性基础、投影器预热、LoRA专业化），逐步实现从视觉特征到结构化临床报告的稳定转换；3) 专门针对神经放射学领域（如大脑半球偏侧性、肿瘤浸润模式）进行定制，而非通用3D医学VLM。从客观角度看，其阶段性训练方法和对领域特定性的强调是值得借鉴的设计思路。

Abstract: Current medical vision-language models (VLMs) process volumetric brain MRI using 2D slice-based approximations, fragmenting the spatial context required for accurate neuroradiological interpretation. We developed \textbf{Brain3D}, a staged vision-language framework for automated radiology report generation from 3D brain tumor MRI. Our approach inflates a pretrained 2D medical encoder into a native 3D architecture and progressively aligns it with a causal language model through three stages: contrastive grounding, supervised projector warmup, and LoRA-based linguistic specialization. Unlike generalist 3D medical VLMs, \textbf{Brain3D} is tailored to neuroradiology, where hemispheric laterality, tumor infiltration patterns, and anatomical localization are critical. Evaluated on 468 subjects (BraTS pathological cases plus healthy controls), our model achieves a Clinical Pathology F1 of 0.951 versus 0.413 for a strong 2D baseline while maintaining perfect specificity on healthy scans. The staged alignment proves essential: contrastive grounding establishes visual-textual correspondence, projector warmup stabilizes conditioning, and LoRA adaptation shifts output from verbose captions to structured clinical reports\footnote{Our code is publicly available for transparency and reproducibility

[93] GeoDiv: Framework For Measuring Geographical Diversity In Text-To-Image Models cs.CVPDF

Abhipsa Basu, Mohana Singh, Shashank Agnihotri, Margret Keuper, R. Venkatesh Babu

TL;DR: 本文提出了GeoDiv框架，用于评估文本到图像（T2I）模型生成图像的地理多样性。该框架利用大语言模型和视觉语言模型，从社会经济视觉指数（SEVI）和视觉多样性指数（VDI）两个互补维度进行测量。应用该框架对Stable Diffusion和FLUX.1-dev等模型在10个实体和16个国家生成的图像进行分析，揭示了模型普遍缺乏地理多样性，并识别出模型对印度、尼日利亚和哥伦比亚等国家存在偏向贫困和破旧描绘的偏见。

Details

Motivation: 现有的文本到图像模型输出常常缺乏地理多样性，强化刻板印象并歪曲地区形象。现有评估指标要么依赖精心策划的数据集，要么只关注表面视觉相似性，可解释性有限。因此，需要一种系统且可解释的框架来严格评估这些模型如何描绘世界。

Result: 将GeoDiv应用于Stable Diffusion和FLUX.1-dev等模型在10个实体和16个国家生成的图像上，结果表明模型输出存在一致性的多样性缺乏，并精细识别出模型对特定国家（如印度、尼日利亚、哥伦比亚）存在偏向贫困和破旧描绘的偏见。这为衡量此类偏见提供了首个系统且可解释的框架。

Insight: 论文的创新点在于提出了一个结合社会经济视觉指数（SEVI）和视觉多样性指数（VDI）的双轴评估框架，利用大模型能力进行细粒度、可解释的地理多样性测量，超越了传统依赖数据集或表面相似性的方法，为生成模型的公平性和包容性评估提供了新工具。

Abstract: Text-to-image (T2I) models are rapidly gaining popularity, yet their outputs often lack geographical diversity, reinforce stereotypes, and misrepresent regions. Given their broad reach, it is critical to rigorously evaluate how these models portray the world. Existing diversity metrics either rely on curated datasets or focus on surface-level visual similarity, limiting interpretability. We introduce GeoDiv, a framework leveraging large language and vision-language models to assess geographical diversity along two complementary axes: the Socio-Economic Visual Index (SEVI), capturing economic and condition-related cues, and the Visual Diversity Index (VDI), measuring variation in primary entities and backgrounds. Applied to images generated by models such as Stable Diffusion and FLUX.1-dev across $10$ entities and $16$ countries, GeoDiv reveals a consistent lack of diversity and identifies fine-grained attributes where models default to biased portrayals. Strikingly, depictions of countries like India, Nigeria, and Colombia are disproportionately impoverished and worn, reflecting underlying socio-economic biases. These results highlight the need for greater geographical nuance in generative models. GeoDiv provides the first systematic, interpretable framework for measuring such biases, marking a step toward fairer and more inclusive generative systems. Project page: https://abhipsabasu.github.io/geodiv

[94] WeaveTime: Stream from Earlier Frames into Emergent Memory in VideoLLMs cs.CVPDF

Yulin Zhang, Cheng Shi, Sibei Yang

TL;DR: 本文提出了WeaveTime框架，旨在解决现有视频大语言模型在流式视频处理中的时间不可知问题。该框架通过轻量级的时间重建目标（Streaming Order Perception）来注入时序感知表示，并在推理时使用过去-当前动态聚焦缓存进行不确定性触发的检索，从而在无需修改模型架构的情况下，提升模型在流式场景下的时序推理能力和效率。

Details

Motivation: 当前多模态大语言模型在视频理解方面虽有进步，但其二次注意力机制和离线训练方式使其难以适应帧序列依次到达、无法预知未来的流式场景。现有视频LLM存在时间不可知性，将视频视为无序证据集合而非因果有序序列，导致时序顺序模糊和过去-当前焦点盲区两大问题。

Result: 在代表性流式基准测试中，WeaveTime在无需改变现有视频LLM架构的情况下，带来了准确率的持续提升并降低了延迟，实现了在严格在线、时间因果约束下时间感知流式视频LLM的实用化进展。

Insight: 创新点在于提出了一个简单、高效且模型无关的框架，通过轻量级时序重建目标（无需专门流式数据）来教授模型时序顺序，并利用不确定性触发的由粗到精检索机制动态管理历史信息，从而有效解决了流式视频处理中的时序建模难题。

Abstract: Recent advances in Multimodal Large Language Models have greatly improved visual understanding and reasoning, yet their quadratic attention and offline training protocols make them ill-suited for streaming settings where frames arrive sequentially and future observations are inaccessible. We diagnose a core limitation of current Video-LLMs, namely Time-Agnosticism, in which videos are treated as an unordered bag of evidence rather than a causally ordered sequence, yielding two failures in streams: temporal order ambiguity, in which the model cannot follow or reason over the correct chronological order, and past-current focus blindness where it fails to distinguish present observations from accumulated history. We present WeaveTime, a simple, efficient, and model agnostic framework that first teaches order and then uses order. We introduce a lightweight Temporal Reconstruction objective-our Streaming Order Perception enhancement-that instills order aware representations with minimal finetuning and no specialized streaming data. At inference, a Past-Current Dynamic Focus Cache performs uncertainty triggered, coarse-to-fine retrieval, expanding history only when needed. Plugged into exsiting Video-LLM without architectural changes, WeaveTime delivers consistent gains on representative streaming benchmarks, improving accuracy while reducing latency. These results establish WeaveTime as a practical path toward time aware stream Video-LLMs under strict online, time causal constraints. Code and weights will be made publicly available. Project Page: https://zhangyl4.github.io/publications/weavetime/

[95] MedTri: A Platform for Structured Medical Report Normalization to Enhance Vision-Language Pretraining cs.CVPDF

Yuetan Chu, Xinhua Ma, Xinran Jin, Gongning Luo, Xin Gao

TL;DR: 本文提出了MedTri平台，用于将自由文本医疗报告标准化为[解剖实体：放射学描述+诊断类别]三元组结构，以提升医学视觉-语言预训练质量。研究表明，这种基于解剖结构的文本规范化能有效去除风格噪声和图像无关内容，提供一致的图像基础文本监督，并在多个X射线和CT数据集上带来性能提升。

Details

Motivation: 现有医学视觉-语言预训练依赖医疗报告作为监督信号，但原始报告存在风格异质性、长度不一和大量图像无关内容，而文本规范化的设计原则及其对预训练的影响缺乏系统研究。

Result: 在多个X射线和CT数据集上，基于解剖结构的文本规范化相比原始报告和现有规范化基线带来了一致的性能改进，提升了医学视觉-语言预训练的质量。

Insight: 创新点在于提出可部署的规范化框架MedTri，将自由文本报告转换为统一的三元组结构，保留了关键的形态和空间信息；同时，该规范化支持模块化文本级增强策略（如知识丰富和基于解剖的反事实监督），在不改变核心流程的情况下提升鲁棒性和泛化性，为医学视觉-语言学习提供了通用预处理组件。

Abstract: Medical vision-language pretraining increasingly relies on medical reports as large-scale supervisory signals; however, raw reports often exhibit substantial stylistic heterogeneity, variable length, and a considerable amount of image-irrelevant content. Although text normalization is frequently adopted as a preprocessing step in prior work, its design principles and empirical impact on vision-language pretraining remain insufficiently and systematically examined. In this study, we present MedTri, a deployable normalization framework for medical vision-language pretraining that converts free-text reports into a unified [Anatomical Entity: Radiologic Description + Diagnosis Category] triplet. This structured, anatomy-grounded normalization preserves essential morphological and spatial information while removing stylistic noise and image-irrelevant content, providing consistent and image-grounded textual supervision at scale. Across multiple datasets spanning both X-ray and computed tomography (CT) modalities, we demonstrate that structured, anatomy-grounded text normalization is an important factor in medical vision-language pretraining quality, yielding consistent improvements over raw reports and existing normalization baselines. In addition, we illustrate how this normalization can easily support modular text-level augmentation strategies, including knowledge enrichment and anatomy-grounded counterfactual supervision, which provide complementary gains in robustness and generalization without altering the core normalization process. Together, our results position structured text normalization as a critical and generalizable preprocessing component for medical vision-language learning, while MedTri provides this normalization platform. Code and data will be released at https://github.com/Arturia-Pendragon-Iris/MedTri.

[96] Solaris: Building a Multiplayer Video World Model in Minecraft cs.CVPDF

Georgy Savva, Oscar Michel, Daohan Lu, Suppakit Waiwitlikhit, Timothy Meehan

TL;DR: 论文提出了Solaris，一个在Minecraft中构建的多玩家视频世界模型，旨在模拟一致的多视角观察。通过开发一个支持多智能体协调交互和同步视频+动作捕获的数据收集系统，收集了1264万帧多玩家数据，并提出了一个多玩家评估框架。模型采用分阶段训练管道，结合了双向、因果和Self Forcing训练，最终引入了Checkpointed Self Forcing变体以支持更长视野的教师模型。结果表明其架构和训练设计优于现有基线。

Details

Motivation: 现有的动作条件视频生成模型（视频世界模型）仅限于单智能体视角，无法捕捉真实世界环境中的多智能体交互。论文旨在解决这一问题，构建一个能够模拟一致多视角观察的多玩家视频世界模型。

Result: 论文提出的架构和训练设计在评估中优于现有基线，具体结果在论文提出的多玩家移动、记忆、接地、建造和视角一致性评估框架上得到验证，达到了SOTA水平。

Insight: 创新点包括：1) 开发了一个专为多玩家设置设计的自动化数据收集系统，支持协调交互和同步捕获；2) 提出了一个分阶段训练管道，从单玩家逐步过渡到多玩家建模，结合了多种训练技术；3) 引入了Checkpointed Self Forcing，一种内存高效的Self Forcing变体，以支持更长视野的教师模型；4) 建立了一个全面的多玩家评估框架。从客观角度看，该系统为多智能体世界模型的研究提供了重要的基础设施和基准。

Abstract: Existing action-conditioned video generation models (video world models) are limited to single-agent perspectives, failing to capture the multi-agent interactions of real-world environments. We introduce Solaris, a multiplayer video world model that simulates consistent multi-view observations. To enable this, we develop a multiplayer data system designed for robust, continuous, and automated data collection on video games such as Minecraft. Unlike prior platforms built for single-player settings, our system supports coordinated multi-agent interaction and synchronized videos + actions capture. Using this system, we collect 12.64 million multiplayer frames and propose an evaluation framework for multiplayer movement, memory, grounding, building, and view consistency. We train Solaris using a staged pipeline that progressively transitions from single-player to multiplayer modeling, combining bidirectional, causal, and Self Forcing training. In the final stage, we introduce Checkpointed Self Forcing, a memory-efficient Self Forcing variant that enables a longer-horizon teacher. Results show our architecture and training design outperform existing baselines. Through open-sourcing our system and models, we hope to lay the groundwork for a new generation of multi-agent world models.

[97] WHOLE: World-Grounded Hand-Object Lifted from Egocentric Videos cs.CVPDF

Yufei Ye, Jiaman Li, Ryan Rong, C. Karen Liu

TL;DR: WHOLE是一种从第一人称视角视频中整体重建手部和物体在世界空间中运动的方法，通过联合学习手-物体运动的生成先验来推理交互，解决了现有方法在处理遮挡和视野外物体时的局限性。

Details

Motivation: 现有方法通常单独恢复手部或物体姿态，在交互过程中表现不佳且无法处理视野外情况，导致手-物体关系不一致，因此需要一种联合推理的解决方案。

Result: WHOLE在手部运动估计、6D物体姿态估计及其相对交互重建方面达到了最先进的性能，显著优于分别处理手部和物体再进行后处理的方法。

Insight: 创新点在于引入联合生成先验来建模手-物体交互，通过视频观测引导生成轨迹，实现了对遮挡和视野外物体的鲁棒重建，为多模态交互理解提供了新思路。

Abstract: Egocentric manipulation videos are highly challenging due to severe occlusions during interactions and frequent object entries and exits from the camera view as the person moves. Current methods typically focus on recovering either hand or object pose in isolation, but both struggle during interactions and fail to handle out-of-sight cases. Moreover, their independent predictions often lead to inconsistent hand-object relations. We introduce WHOLE, a method that holistically reconstructs hand and object motion in world space from egocentric videos given object templates. Our key insight is to learn a generative prior over hand-object motion to jointly reason about their interactions. At test time, the pretrained prior is guided to generate trajectories that conform to the video observations. This joint generative reconstruction substantially outperforms approaches that process hands and objects separately followed by post-processing. WHOLE achieves state-of-the-art performance on hand motion estimation, 6D object pose estimation, and their relative interaction reconstruction. Project website: https://judyye.github.io/whole-www

eess.AS [Back]

[98] iMiGUE-Speech: A Spontaneous Speech Dataset for Affective Analysis eess.AS | cs.CLPDF

Sofoklis Kakouros, Fang Kang, Haoyu Chen

TL;DR: 本文介绍了iMiGUE-Speech数据集，它是iMiGUE数据集的扩展，专注于收集自然情境下的自发情感语音数据，并提供了语音转录、说话者角色分离和词级强制对齐等元数据。该数据集旨在支持从声学和语言模态研究自发情感状态，并可与原始数据集的微手势注释同步结合，形成研究语音-手势情感动态的多模态资源。

Details

Motivation: 现有情感语音数据集多依赖表演或实验室诱发的情感，缺乏自然情境下的自发情感数据。本文旨在通过iMiGUE-Speech填补这一空白，提供基于真实比赛结果自然产生的自发情感语料，以更真实地研究情感和情感状态。

Result: 论文通过引入语音情感识别和基于转录的情感分析两个评估任务，利用最先进的预训练表示来评估数据集从声学和语言模态捕捉自发情感状态的能力，为数据集建立了初步基准。

Insight: 创新点在于提供了一个专注于自发情感的自然语音数据集，并整合了丰富的元数据和多模态（语音与手势）同步能力，为研究真实世界中的情感表达和跨模态情感动态提供了独特资源。

Abstract: This work presents iMiGUE-Speech, an extension of the iMiGUE dataset that provides a spontaneous affective corpus for studying emotional and affective states. The new release focuses on speech and enriches the original dataset with additional metadata, including speech transcripts, speaker-role separation between interviewer and interviewee, and word-level forced alignments. Unlike existing emotional speech datasets that rely on acted or laboratory-elicited emotions, iMiGUE-Speech captures spontaneous affect arising naturally from real match outcomes. To demonstrate the utility of the dataset and establish initial benchmarks, we introduce two evaluation tasks for comparative assessment: speech emotion recognition and transcript-based sentiment analysis. These tasks leverage state-of-the-art pre-trained representations to assess the dataset’s ability to capture spontaneous affective states from both acoustic and linguistic modalities. iMiGUE-Speech can also be synchronously paired with micro-gesture annotations from the original iMiGUE dataset, forming a uniquely multimodal resource for studying speech-gesture affective dynamics. The extended dataset is available at https://github.com/CV-AC/imigue-speech.

[99] TG-ASR: Translation-Guided Learning with Parallel Gated Cross Attention for Low-Resource Automatic Speech Recognition eess.AS | cs.AI | cs.CL | cs.SDPDF

Cheng-Yeh Yang, Chien-Chun Wang, Li-Wei Chen, Hung-Shin Lee, Hsin-Min Wang

TL;DR: 本文提出了一种名为TG-ASR的翻译引导低资源自动语音识别框架，用于解决台湾闽南语等语言因转录数据稀缺而面临的ASR挑战。该框架利用多语言翻译嵌入，通过并行门控交叉注意力机制整合辅助语言信息，以提升识别性能。作者还发布了包含30小时语音、对齐普通话字幕和人工验证转录的YT-THDC语料库。

Details

Motivation: 解决低资源语言（如台湾闽南语）自动语音识别中转录数据稀缺的问题，这些语言在影视剧等场景中有大量语音内容，但字幕多为普通话，缺乏本语言转录。

Result: 在提出的YT-THDC语料库上进行实验，通过选择有效的辅助语言，实现了14.77%的相对字符错误率降低，证明了翻译引导学习在低资源语言ASR中的有效性。

Insight: 创新点在于提出了翻译引导的ASR框架和并行门控交叉注意力机制，能自适应地整合多语言翻译嵌入到ASR解码器中，在提供跨语言语义指导的同时，确保优化稳定并最小化语言间干扰。这为利用丰富字幕数据提升低资源ASR性能提供了新思路。

Abstract: Low-resource automatic speech recognition (ASR) continues to pose significant challenges, primarily due to the limited availability of transcribed data for numerous languages. While a wealth of spoken content is accessible in television dramas and online videos, Taiwanese Hokkien exemplifies this issue, with transcriptions often being scarce and the majority of available subtitles provided only in Mandarin. To address this deficiency, we introduce TG-ASR for Taiwanese Hokkien drama speech recognition, a translation-guided ASR framework that utilizes multilingual translation embeddings to enhance recognition performance in low-resource environments. The framework is centered around the parallel gated cross-attention (PGCA) mechanism, which adaptively integrates embeddings from various auxiliary languages into the ASR decoder. This mechanism facilitates robust cross-linguistic semantic guidance while ensuring stable optimization and minimizing interference between languages. To support ongoing research initiatives, we present YT-THDC, a 30-hour corpus of Taiwanese Hokkien drama speech with aligned Mandarin subtitles and manually verified Taiwanese Hokkien transcriptions. Comprehensive experiments and analyses identify the auxiliary languages that most effectively enhance ASR performance, achieving a 14.77% relative reduction in character error rate and demonstrating the efficacy of translation-guided learning for underrepresented languages in practical applications.

cs.AI [Back]

[100] Prompt Architecture Determines Reasoning Quality: A Variable Isolation Study on the Car Wash Problem cs.AI | cs.CLPDF

Heejin Jo

TL;DR: 本研究通过变量隔离实验探究了提示架构对大型语言模型在‘洗车问题’推理任务中表现的影响。研究发现，单独使用STAR（情境-任务-行动-结果）推理框架即可将Claude 3.5 Sonnet的准确率从0%大幅提升至85%，而结合向量数据库检索的用户画像上下文和RAG上下文后，最终实现了100%的准确率。

Details

Motivation: 解决大型语言模型在‘洗车问题’这类需要隐式物理约束推理的病毒式传播基准测试中持续失败的问题，并探究生产系统中哪些提示架构层能够促成正确的推理。

Result: 在Claude 3.5 Sonnet模型上，STAR框架将准确率从0%提升至85%（p=0.001）。结合用户画像上下文和RAG上下文后，准确率分别再提升10和5个百分点，最终在完整堆栈条件下达到100%准确率。

Insight: 论文宣称的核心创新点在于通过变量隔离研究明确了结构化推理框架（特别是推理前强制目标阐述）对于隐式约束推理任务的重要性远大于上下文注入。从客观角度看，该研究为提示工程提供了实证依据，强调了推理过程的结构化引导是提升模型在复杂推理任务上表现的关键因素。

Abstract: Large language models consistently fail the “car wash problem,” a viral reasoning benchmark requiring implicit physical constraint inference. We present a variable isolation study (n=20 per condition, 6 conditions, 120 total trials) examining which prompt architecture layers in a production system enable correct reasoning. Using Claude 3.5 Sonnet with controlled hyperparameters (temperature 0.7, top_p 1.0), we find that the STAR (Situation-Task-Action-Result) reasoning framework alone raises accuracy from 0% to 85% (p=0.001, Fisher’s exact test, odds ratio 13.22). Adding user profile context via vector database retrieval provides a further 10 percentage point gain, while RAG context contributes an additional 5 percentage points, achieving 100% accuracy in the full-stack condition. These results suggest that structured reasoning scaffolds – specifically, forced goal articulation before inference – matter substantially more than context injection for implicit constraint reasoning tasks.

[101] Distill and Align Decomposition for Enhanced Claim Verification cs.AI | cs.CL | cs.LGPDF

Jabez Magomere, Elena Kochkina, Samuel Mensah, Simerjot Kaur, Fernando Acero

TL;DR: 本文提出了一种基于强化学习的联合优化方法，通过Group Relative Policy Optimization (GRPO) 同时提升复杂声明验证中的分解质量与验证器对齐，使较小的语言模型在声明验证任务上达到最先进水平。

Details

Motivation: 现有方法在复杂声明验证中难以将句子分解为可验证子声明的质量与验证性能对齐，需要一种联合优化分解质量和验证器对齐的方法。

Result: 在六个评估设置中，训练后的8B分解器将下游验证性能提升至71.75%的macro-F1，优于基于提示的方法（分别提升1.99和6.24）和现有强化学习方法（提升5.84），人类评估也证实了生成子声明的高质量。

Insight: 创新点在于将结构化顺序推理、基于教师蒸馏样本的监督微调以及平衡格式合规性、验证器对齐和分解质量的多目标奖励相结合，通过强化学习联合优化，使小模型在声明验证任务上实现SOTA。

Abstract: Complex claim verification requires decomposing sentences into verifiable subclaims, yet existing methods struggle to align decomposition quality with verification performance. We propose a reinforcement learning (RL) approach that jointly optimizes decomposition quality and verifier alignment using Group Relative Policy Optimization (GRPO). Our method integrates: (i) structured sequential reasoning; (ii) supervised finetuning on teacher-distilled exemplars; and (iii) a multi-objective reward balancing format compliance, verifier alignment, and decomposition quality. Across six evaluation settings, our trained 8B decomposer improves downstream verification performance to (71.75%) macro-F1, outperforming prompt-based approaches ((+1.99), (+6.24)) and existing RL methods ((+5.84)). Human evaluation confirms the high quality of the generated subclaims. Our framework enables smaller language models to achieve state-of-the-art claim verification by jointly optimising for verification accuracy and decomposition quality.

cs.SE [Back]

[102] SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents cs.SE | cs.AI | cs.CL | cs.LGPDF

Patrick Tser Jern Kon, Archana Pradeep, Ang Chen, Alexander P. Ellis, Warren Hunt

TL;DR: SWE-Protégé是一个后训练框架，它将软件修复任务重新定义为专家与学徒的协作问题。该框架让小语言模型（SLM）作为唯一的决策者，学习如何有选择地向一个强大的专家模型寻求指导、识别停滞状态并执行专家反馈。通过结合专家增强轨迹的监督微调和智能体强化学习，该方法显著提升了SLM在SWE-bench等长期软件工程任务上的性能。

Details

Motivation: 小语言模型在成本、延迟和适应性方面具有优势，但在SWE-bench等长期软件工程任务中表现不佳，存在普遍的动作循环和低解决率问题。论文旨在通过专家协作框架解锁SLM作为软件工程智能体的潜力。

Result: 在SWE-bench Verified基准测试上，经过轻量后训练的Qwen2.5-Coder-7B-Instruct模型实现了42.4%的Pass@1，相比之前的小语言模型SOTA提升了25.4%，同时稀疏地使用专家协助（每个任务约4次调用，总token的11%）。

Insight: 创新点在于将软件修复重构为选择性专家协作问题，让SLM自主决定何时求助专家，并结合监督微调与强化学习来抑制退化循环和无效协作。这为资源受限场景下高效利用大模型专家能力提供了新思路。

Abstract: Small language models (SLMs) offer compelling advantages in cost, latency, and adaptability, but have so far lagged behind larger models on long-horizon software engineering tasks such as SWE-bench, where they suffer from pervasive action looping and low resolution rates. We introduce SWE-Protégé, a post-training framework that reframes software repair as an expert-protégé collaboration problem. In SWE-Protégé, an SLM remains the sole decision-maker while learning to selectively seek guidance from a strong expert model, recognize stalled states, and follow through on expert feedback. Our approach combines supervised fine-tuning on expert-augmented trajectories with agentic reinforcement learning that explicitly discourages degenerative looping and unproductive expert collaboration. We lightly post-train Qwen2.5-Coder-7B-Instruct to achieve 42.4% Pass@1 on SWE-bench Verified, a +25.4% improvement over the prior SLM state of the art, while using expert assistance sparsely (~4 calls per task and 11% of total tokens).

cs.CR [Back]

[103] Adversarial Intent is a Latent Variable: Stateful Trust Inference for Securing Multimodal Agentic RAG cs.CR | cs.AI | cs.CL | cs.LGPDF

Inderjeet Singh, Vikas Pahuja, Aishvariya Priya Rathina Sabapathy, Chiara Picardi, Amit Giloni

TL;DR: 本文提出了一种名为MMA-RAG^T的推理时控制框架，用于保护多模态代理RAG系统免受分布式恶意攻击。该框架将安全挑战建模为部分可观测马尔可夫决策过程，通过模块化信任代理维护一个近似的信念状态，以推断潜在的攻击意图。

Details

Motivation: 当前无状态防御方法无法检测在检索、规划和生成组件间分布恶意语义的攻击策略，因此需要一种能够推断潜在对抗意图的状态化防御机制。

Result: 在43,774个实例上的广泛评估表明，相对于无防御基线，攻击成功率平均降低了6.50倍，且实用性成本可忽略。消融实验验证了状态性和空间覆盖的必要性。

Insight: 创新点在于将对抗意图建模为潜在变量，并通过结构化LLM推理维护状态化信念，实现了模型无关的深度防御。理论边界表明，在检测完全相关时，无状态的多点干预可能带来零边际收益。

Abstract: Current stateless defences for multimodal agentic RAG fail to detect adversarial strategies that distribute malicious semantics across retrieval, planning, and generation components. We formulate this security challenge as a Partially Observable Markov Decision Process (POMDP), where adversarial intent is a latent variable inferred from noisy multi-stage observations. We introduce MMA-RAG^T, an inference-time control framework governed by a Modular Trust Agent (MTA) that maintains an approximate belief state via structured LLM reasoning. Operating as a model-agnostic overlay, MMA-RAGT mediates a configurable set of internal checkpoints to enforce stateful defence-in-depth. Extensive evaluation on 43,774 instances demonstrates a 6.50x average reduction factor in Attack Success Rate relative to undefended baselines, with negligible utility cost. Crucially, a factorial ablation validates our theoretical bounds: while statefulness and spatial coverage are individually necessary (26.4 pp and 13.6 pp gains respectively), stateless multi-point intervention can yield zero marginal benefit under homogeneous stateless filtering when checkpoint detections are perfectly correlated.

cs.LG [Back]

[104] Latent Context Compilation: Distilling Long Context into Compact Portable Memory cs.LG | cs.AI | cs.CLPDF

Zeju Li, Yizhou Zhou, Qiang Xu

TL;DR: 本文提出了一种名为Latent Context Compilation（LCC）的框架，旨在解决大语言模型（LLM）高效处理长上下文的问题。该方法通过一个可丢弃的LoRA模块作为“编译器”，将长上下文信息提炼为紧凑、无状态的缓冲区令牌，这些令牌可直接与冻结的基础模型即插即用，无需修改模型权重或依赖昂贵的合成数据。

Details

Motivation: 当前高效部署长上下文LLM面临两难困境：摊销压缩方法在分布外泛化上表现不佳，而测试时训练（Test-Time Training）方法则成本高昂且需要修改模型权重，产生有状态参数，不利于并发服务。本文旨在解决这一效率与泛化之间的权衡问题。

Result: 在Llama-3.1-8B模型上的实验表明，该方法在16倍压缩比下，仍能有效保留细粒度细节和推理能力，优于先前方法，成功地将内存密度与模型参数解耦。

Insight: 核心创新点在于将上下文处理从“适应”范式转变为“编译”范式，并引入了自对齐优化策略。该策略通过使用与上下文无关的随机查询来正则化上下文重建任务，迫使压缩后的令牌驻留在模型已有的指令遵循流形中，从而无需依赖合成问答对进行训练，实现了高效、可移植的长上下文压缩。

Abstract: Efficient long-context LLM deployment is stalled by a dichotomy between amortized compression, which struggles with out-of-distribution generalization, and Test-Time Training, which incurs prohibitive synthetic data costs and requires modifying model weights, creating stateful parameters that complicate concurrent serving. We propose Latent Context Compilation, a framework that fundamentally shifts context processing from adaptation to compilation. By utilizing a disposable LoRA module as a compiler, we distill long contexts into compact buffer tokens – stateless, portable memory artifacts that are plug-and-play compatible with frozen base models. Crucially, we introduce a self-aligned optimization strategy that eliminates the need for synthetic context-relevant QA pairs. By regularizing context reconstruction task with context-agnostic random queries, we force compressed tokens to reside within the model’s existing instruction-following manifold. Experiments with Llama-3.1-8B demonstrate that Latent Context Compilation preserves fine-grained details and reasoning capabilities where prior methods falter, effectively decoupling memory density from model parameters even at a 16x compression ratio.

[105] ACAR: Adaptive Complexity Routing for Multi-Model Ensembles with Auditable Decision Traces cs.LG | cs.AI | cs.CLPDF

Ramchand Kumaresan

TL;DR: 本文提出了ACAR（自适应复杂度和归因路由）框架，用于研究可审计条件下的多模型协同编排。该系统基于自一致性方差（sigma）在单模型、双模型和三模型执行模式间动态路由任务，并在TEAMLLM确定性执行平台上实现。在四个基准测试（MathArena、Reasoning Gym、LiveCodeBench和SuperGPQA）上的实验表明，基于sigma的路由机制在避免54.2%任务完全集成的情况下，达到了55.6%的准确率，超过了双模型基线。

Details

Motivation: 解决多模型集成系统中如何在不依赖学习组件的情况下，根据任务复杂度自适应选择执行模式（单/双/三模型），同时保持决策过程完全可审计和可追溯的问题。

Result: 在MathArena、Reasoning Gym、LiveCodeBench和SuperGPQA四个基准的1510个任务上，使用Claude Sonnet 4、GPT-4o和Gemini 2.0 Flash模型进行测试。基于sigma的路由达到55.6%准确率，超过双模型基线（54.4%），同时避免了54.2%任务的完全集成（三模型）。

Insight: 创新点包括：1）使用自一致性方差作为无学习、模型无关的路由指标；2）在确定性执行平台（TEAMLLM）上实现完全可审计的决策轨迹；3）通过实证揭示了检索增强可能因语义未对齐而引入噪声、模型一致错误（sigma=0）无法通过集成纠正、以及基于代理信号的归因估计与真实值相关性弱等关键发现，为路由、检索和多模型归因研究提供了可证伪的基线。

Abstract: We present ACAR (Adaptive Complexity and Attribution Routing), a measurement framework for studying multi-model orchestration under auditable conditions. ACAR uses self-consistency variance (sigma) computed from N=3 probe samples to route tasks across single-model, two-model, and three-model execution modes. The system is implemented on top of TEAMLLM, a deterministic execution substrate with immutable artifacts and complete decision traces. We evaluate ACAR on 1,510 tasks spanning four benchmarks: MathArena, Reasoning Gym, LiveCodeBench, and SuperGPQA, using Claude Sonnet 4, GPT-4o, and Gemini 2.0 Flash, producing more than 7,550 auditable runs. Results show that sigma-based routing achieves 55.6 percent accuracy, exceeding the two-model baseline of 54.4 percent while avoiding full ensembling on 54.2 percent of tasks. The routing mechanism is model-agnostic and requires no learned components. We also document negative results. First, retrieval augmentation reduced accuracy by 3.4 percentage points, as median retrieval similarity was only 0.167, demonstrating that experience injection without semantic alignment introduces noise rather than grounding. Second, when models agree on incorrect answers (sigma equals zero), no downstream ensemble can recover; this agreement-but-wrong failure mode is intrinsic to self-consistency and bounds achievable accuracy at approximately eight percentage points below full ensembling. Third, attribution estimates based on proxy signals such as response similarity and entropy showed weak correlation with ground-truth leave-one-out values, indicating that practical attribution requires explicit counterfactual computation. This work documents which assumptions fail in practice and provides falsifiable baselines for future research on routing, retrieval, and multi-model attribution.

[106] GradAlign: Gradient-Aligned Data Selection for LLM Reinforcement Learning cs.LG | cs.AI | cs.CLPDF

Ningyuan Yang, Weihua Du, Weiwei Sun, Sean Welleck, Yiming Yang

TL;DR: 本文提出了GradAlign方法，一种用于大语言模型强化学习的梯度对齐数据选择技术，通过利用小型可信验证集来优先选择那些策略梯度与验证梯度对齐的训练问题，从而形成自适应课程。该方法在三种具有挑战性的数据场景（不可靠奖励信号、分布不平衡、低效用训练语料）中均优于现有基线，证明了梯度方向信号在非平稳策略优化中的重要性。

Details

Motivation: 强化学习作为大语言模型后训练的核心范式，其性能对训练问题质量高度敏感，而现有方法依赖人工筛选或简单启发式过滤，可能引入错误或低效用问题。

Result: 在不可靠奖励信号、分布不平衡和低效用训练语料三种数据场景下，GradAlign均一致优于现有基线方法，实现了更稳定的训练和更高的最终性能。

Insight: 创新点在于利用梯度对齐作为数据选择标准，构建自适应课程，以应对强化学习中的非平稳性；客观来看，该方法提供了一种基于梯度方向信号的数据筛选机制，可有效提升策略优化的鲁棒性和效率。

Abstract: Reinforcement learning (RL) has become a central post-training paradigm for large language models (LLMs), but its performance is highly sensitive to the quality of training problems. This sensitivity stems from the non-stationarity of RL: rollouts are generated by an evolving policy, and learning is shaped by exploration and reward feedback, unlike supervised fine-tuning (SFT) with fixed trajectories. As a result, prior work often relies on manual curation or simple heuristic filters (e.g., accuracy), which can admit incorrect or low-utility problems. We propose GradAlign, a gradient-aligned data selection method for LLM reinforcement learning that uses a small, trusted validation set to prioritize training problems whose policy gradients align with validation gradients, yielding an adaptive curriculum. We evaluate GradAlign across three challenging data regimes: unreliable reward signals, distribution imbalance, and low-utility training corpus, showing that GradAlign consistently outperforms existing baselines, underscoring the importance of directional gradient signals in navigating non-stationary policy optimization and yielding more stable training and improved final performance. We release our implementation at https://github.com/StigLidu/GradAlign

[107] GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL cs.LG | cs.AI | cs.CLPDF

Rui Yang, Qianhui Wu, Zhaoyang Wang, Hanyang Chen, Ke Yang

TL;DR: 本文提出GUI-Libra，一种针对原生GUI智能体的定制化训练方案，旨在解决其在长视野导航任务中落后于闭源系统的问题。该方法通过构建高质量的动作对齐推理数据集、提出动作感知的监督微调（SFT）以及引入针对部分可验证性的强化学习稳定技术，显著提升了智能体的逐步准确性和端到端任务完成率。

Details

Motivation: 开源原生GUI智能体在长视野导航任务上表现不佳，主要源于两个局限：缺乏高质量的动作对齐推理数据，以及直接采用通用的后训练流程而忽视了GUI智能体的独特挑战（如标准思维链微调损害基础能力，逐步强化学习面临部分可验证性问题）。

Result: 在多样化的网页和移动端基准测试中，GUI-Libra一致地提升了逐步准确性和端到端任务完成率。该方法表明，精心设计的后训练和数据管理能在无需昂贵在线数据收集的情况下，显著解锁更强的任务解决能力。

Insight: 创新点包括：1) 构建并发布了一个精心策划的81K GUI推理数据集以缓解数据稀缺；2) 提出动作感知SFT，混合推理后行动和直接行动数据并重新加权token以强调动作和基础；3) 揭示了KL正则化在RLVR中的重要性，并引入成功自适应缩放来降低不可靠负梯度的影响，从而在部分可验证性下稳定强化学习。

Abstract: Open-source native GUI agents still lag behind closed-source systems on long-horizon navigation tasks. This gap stems from two limitations: a shortage of high-quality, action-aligned reasoning data, and the direct adoption of generic post-training pipelines that overlook the unique challenges of GUI agents. We identify two fundamental issues in these pipelines: (i) standard SFT with CoT reasoning often hurts grounding, and (ii) step-wise RLVR-tyle training faces partial verifiability, where multiple actions can be correct but only a single demonstrated action is used for verification. This makes offline step-wise metrics weak predictors of online task success. In this work, we present GUI-Libra, a tailored training recipe that addresses these challenges. First, to mitigate the scarcity of action-aligned reasoning data, we introduce a data construction and filtering pipeline and release a curated 81K GUI reasoning dataset. Second, to reconcile reasoning with grounding, we propose action-aware SFT that mixes reasoning-then-action and direct-action data and reweights tokens to emphasize action and grounding. Third, to stabilize RL under partial verifiability, we identify the overlooked importance of KL regularization in RLVR and show that a KL trust region is critical for improving offline-to-online predictability; we further introduce success-adaptive scaling to downweight unreliable negative gradients. Across diverse web and mobile benchmarks, GUI-Libra consistently improves both step-wise accuracy and end-to-end task completion. Our results suggest that carefully designed post-training and data curation can unlock significantly stronger task-solving capabilities without costly online data collection. We release our dataset, code, and models to facilitate further research on data-efficient post-training for reasoning-capable GUI agents.

[108] Uncertainty-Aware Diffusion Model for Multimodal Highway Trajectory Prediction via DDIM Sampling cs.LG | cs.CV | cs.ROPDF

Marion Neumeier, Niklas Roßberg, Michael Botsch, Wolfgang Utschick

TL;DR: 本文提出cVMDx，一种基于扩散模型的高速公路轨迹预测框架，通过DDIM采样将推理时间减少高达100倍，并利用高斯混合模型从生成轨迹中提取可处理的多模态预测，同时评估了CVQ-VAE变体进行场景编码，在highD数据集上实现了更高的准确性和效率。

Details

Motivation: 解决自动驾驶中准确且不确定性感知的轨迹预测核心挑战，现有扩散模型方法（如cVMD）存在采样慢、生成多样性利用有限和场景编码脆弱的问题。

Result: 在公开highD数据集上，cVMDx相比cVMD实现了更高的准确性和显著提升的效率，能够进行完全随机、多模态的轨迹预测。

Insight: 创新点包括：1) 将DDIM采样引入轨迹预测扩散模型以极大加速推理；2) 使用拟合高斯混合模型从生成轨迹中提取可解释的多模态预测；3) 探索CVQ-VAE进行鲁棒场景编码，提升框架实用性。

Abstract: Accurate and uncertainty-aware trajectory prediction remains a core challenge for autonomous driving, driven by complex multi-agent interactions, diverse scene contexts and the inherently stochastic nature of future motion. Diffusion-based generative models have recently shown strong potential for capturing multimodal futures, yet existing approaches such as cVMD suffer from slow sampling, limited exploitation of generative diversity and brittle scenario encodings. This work introduces cVMDx, an enhanced diffusion-based trajectory prediction framework that improves efficiency, robustness and multimodal predictive capability. Through DDIM sampling, cVMDx achieves up to a 100x reduction in inference time, enabling practical multi-sample generation for uncertainty estimation. A fitted Gaussian Mixture Model further provides tractable multimodal predictions from the generated trajectories. In addition, a CVQ-VAE variant is evaluated for scenario encoding. Experiments on the publicly available highD dataset show that cVMDx achieves higher accuracy and significantly improved efficiency over cVMD, enabling fully stochastic, multimodal trajectory prediction.

[109] Causal Decoding for Hallucination-Resistant Multimodal Large Language Models cs.LG | cs.AI | cs.CVPDF

Shiwei Tan, Hengyi Wang, Weiyi Qin, Qi Xu, Zhigang Hua

TL;DR: 本文提出了一种因果解码框架，旨在解决多模态大语言模型（MLLMs）在视觉语言任务中容易产生物体幻觉（即描述图像中不存在的物体）的问题。该方法通过在生成过程中进行有针对性的因果干预，重塑解码动态以削弱虚假依赖，从而在降低幻觉率的同时保持输出质量。

Details

Motivation: 现有方法（如启发式惩罚、事后校正或通用解码调整）未能直接干预触发物体幻觉的机制，效果有限。因此，本文旨在设计一种直接干预解码过程以从根本上减少幻觉的框架。

Result: 在图像描述和问答基准测试中，该框架显著降低了物体幻觉率，并在保持整体输出质量的同时，在忠实度方面达到了最先进的水平（SOTA）。

Insight: 创新点在于将因果干预直接应用于生成过程中的解码步骤，通过针对性干预来切断虚假的因果依赖，而非依赖后处理或通用调整。这为从机制上减少MLLMs的幻觉提供了一种新思路。

Abstract: Multimodal Large Language Models (MLLMs) deliver detailed responses on vision-language tasks, yet remain susceptible to object hallucination (introducing objects not present in the image), undermining reliability in practice. Prior efforts often rely on heuristic penalties, post-hoc correction, or generic decoding tweaks, which do not directly intervene in the mechanisms that trigger object hallucination and thus yield limited gains. To address this challenge, we propose a causal decoding framework that applies targeted causal interventions during generation to curb spurious object mentions. By reshaping the decoding dynamics to attenuate spurious dependencies, our approach reduces false object tokens while maintaining descriptive quality. Across captioning and QA benchmarks, our framework substantially lowers object-hallucination rates and achieves state-of-the-art faithfulness without degrading overall output quality.

[110] Breaking Semantic-Aware Watermarks via LLM-Guided Coherence-Preserving Semantic Injection cs.LG | cs.CR | cs.CVPDF

Zheng Gao, Xiaoyu Li, Zhicheng Bao, Xiaoyan Feng, Jiaojiao Jiang

TL;DR: 本文提出了一种名为CSI（Coherence-Preserving Semantic Injection）的攻击方法，利用大语言模型（LLM）引导的语义操作，在保持图像全局语义连贯性的前提下，针对性地扰动与语义水印相关的局部语义，从而破坏当前基于内容感知的语义水印方案。实验表明，CSI攻击能有效导致水印检测器误判，揭示了现有语义水印设计在面对LLM驱动的语义扰动时存在根本性安全弱点。

Details

Motivation: 生成式图像在社交媒体和在线版权分发场景中广泛传播，语义水印被集成到扩散模型中用于来源追踪和防伪。然而，传统的基于噪声层的水印易受逆向攻击，而新兴的基于内容感知的语义水印方案将水印信号与高级图像语义绑定，以抵抗局部编辑。但大语言模型（LLMs）具备结构化推理能力，能够实现局部精细但全局连贯的语义修改，从而可能破坏这种绑定。本文旨在揭示这一被忽视的漏洞。

Result: 广泛的实证结果表明，CSI攻击在对抗内容感知语义水印时，始终优于现有的攻击基线，成功诱导水印检测器误分类，证明了其有效性。

Insight: 论文的创新点在于首次利用LLM的语义推理能力，在嵌入空间相似性约束下进行保持连贯性的语义注入攻击，从而绕过语义水印的绑定机制。这揭示了当前语义水印方案的一个关键安全缺陷：即使水印与高级语义绑定，LLM引导的、保持全局一致性的局部语义扰动仍可使其失效，为未来设计更鲁棒的语义水印提供了重要洞见。

Abstract: Generative images have proliferated on Web platforms in social media and online copyright distribution scenarios, and semantic watermarking has increasingly been integrated into diffusion models to support reliable provenance tracking and forgery prevention for web content. Traditional noise-layer-based watermarking, however, remains vulnerable to inversion attacks that can recover embedded signals. To mitigate this, recent content-aware semantic watermarking schemes bind watermark signals to high-level image semantics, constraining local edits that would otherwise disrupt global coherence. Yet, large language models (LLMs) possess structured reasoning capabilities that enable targeted exploration of semantic spaces, allowing locally fine-grained but globally coherent semantic alterations that invalidate such bindings. To expose this overlooked vulnerability, we introduce a Coherence-Preserving Semantic Injection (CSI) attack that leverages LLM-guided semantic manipulation under embedding-space similarity constraints. This alignment enforces visual-semantic consistency while selectively perturbing watermark-relevant semantics, ultimately inducing detector misclassification. Extensive empirical results show that CSI consistently outperforms prevailing attack baselines against content-aware semantic watermarking, revealing a fundamental security weakness of current semantic watermark designs when confronted with LLM-driven semantic perturbations.

cs.RO [Back]

[111] LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies cs.RO | cs.AI | cs.CV | cs.LG | eess.SYPDF

Yue Yang, Shuo Cheng, Yu Fang, Homanga Bharadhwaj, Mingyu Ding

TL;DR: 本文提出了LiLo-VLA（Linked Local VLA），一个用于长时程操作的模块化框架。该框架将任务解耦为全局移动和对象交互两个模块，通过对象中心的视觉-语言-动作模型处理目标物体，实现了对未见过的长时程任务的零样本泛化，并在仿真和真实世界任务中取得了高成功率。

Details

Motivation: 解决通用机器人在非结构化环境中执行涉及多次结构变化（如物体连接/分离）的长时程操作任务时面临的挑战。现有的端到端视觉-语言-动作模型难以应对技能组合的复杂性，且容易因环境敏感性导致级联失败。

Result: 在提出的21个任务的仿真基准（LIBERO-Long++和Ultra-Long）上，平均成功率达到69%，分别超过Pi0.5 41%和OpenVLA-OFT 67%。在8个真实世界长时程任务中，平均成功率达到85%。

Insight: 核心创新在于模块化设计：将运输与交互解耦，分别由全局移动模块和对象中心交互模块处理。这种设计提高了对无关视觉特征的鲁棒性和对空间配置的不变性，并支持动态重规划和技能复用以实现稳健的失败恢复，有效缓解了端到端方法中的级联错误问题。

Abstract: General-purpose robots must master long-horizon manipulation, defined as tasks involving multiple kinematic structure changes (e.g., attaching or detaching objects) in unstructured environments. While Vision-Language-Action (VLA) models offer the potential to master diverse atomic skills, they struggle with the combinatorial complexity of sequencing them and are prone to cascading failures due to environmental sensitivity. To address these challenges, we propose LiLo-VLA (Linked Local VLA), a modular framework capable of zero-shot generalization to novel long-horizon tasks without ever being trained on them. Our approach decouples transport from interaction: a Reaching Module handles global motion, while an Interaction Module employs an object-centric VLA to process isolated objects of interest, ensuring robustness against irrelevant visual features and invariance to spatial configurations. Crucially, this modularity facilitates robust failure recovery through dynamic replanning and skill reuse, effectively mitigating the cascading errors common in end-to-end approaches. We introduce a 21-task simulation benchmark consisting of two challenging suites: LIBERO-Long++ and Ultra-Long. In these simulations, LiLo-VLA achieves a 69% average success rate, outperforming Pi0.5 by 41% and OpenVLA-OFT by 67%. Furthermore, real-world evaluations across 8 long-horizon tasks demonstrate an average success rate of 85%. Project page: https://yy-gx.github.io/LiLo-VLA/.

Chenyv Liu, Wentao Tan, Lei Zhu, Fengling Li, Jingjing Li

TL;DR: 本文提出了一种自校正视觉-语言-动作（SC-VLA）模型，通过稀疏世界想象实现在线动作优化，以增强模型对物理动态的鲁棒理解。该方法结合了辅助预测头来预测任务进度和未来轨迹趋势，并引入在线动作优化模块来重塑依赖进度的密集奖励，从而在机器人操作任务中实现了最先进的性能。

Details

Motivation: 标准VLA模型依赖统计数据先验，对物理动态的鲁棒理解有限；强化学习需要外部奖励信号且与智能体内部状态隔离；世界动作模型缺乏显式的自我改进机制。本文旨在通过稀疏想象内在引导动作优化，实现模型的自我改进。

Result: 在仿真基准和真实世界设置的挑战性机器人操作任务评估中，SC-VLA实现了最先进的性能，与最佳基线相比，步骤数减少16%，成功率提高9%，在真实世界实验中获得了14%的性能增益。

Insight: 创新点包括稀疏世界想象设计（通过辅助预测头编码短期物理演化）和在线动作优化模块（基于预测的稀疏未来状态调整轨迹方向），这些机制使模型能够自我校正并提升物理基础。

Abstract: Standard vision-language-action (VLA) models rely on fitting statistical data priors, limiting their robust understanding of underlying physical dynamics. Reinforcement learning enhances physical grounding through exploration yet typically relies on external reward signals that remain isolated from the agent’s internal states. World action models have emerged as a promising paradigm that integrates imagination and control to enable predictive planning. However, they rely on implicit context modeling, lacking explicit mechanisms for self-improvement. To solve these problems, we propose Self-Correcting VLA (SC-VLA), which achieve self-improvement by intrinsically guiding action refinement through sparse imagination. We first design sparse world imagination by integrating auxiliary predictive heads to forecast current task progress and future trajectory trends, thereby constraining the policy to encode short-term physical evolution. Then we introduce the online action refinement module to reshape progress-dependent dense rewards, adjusting trajectory orientation based on the predicted sparse future states. Evaluations on challenging robot manipulation tasks from simulation benchmarks and real-world settings demonstrate that SC-VLA achieve state-of-the-art performance, yielding the highest task throughput with 16% fewer steps and a 9% higher success rate than the best-performing baselines, alongside a 14% gain in real-world experiments. Code is available at https://github.com/Kisaragi0/SC-VLA.

[113] Dream-SLAM: Dreaming the Unseen for Active SLAM in Dynamic Environments cs.RO | cs.CVPDF

Xiangqi Meng, Pengxu Hou, Zhenjun Zhao, Javier Civera, Daniel Cremers

TL;DR: 本文提出了一种名为Dream-SLAM的新型单目主动SLAM方法，旨在解决动态环境中主动SLAM的三大限制：底层SLAM模块的局限性、运动规划的短视性以及对动态场景的处理能力不足。该方法通过生成跨时空图像和语义合理的场景结构，融合真实观测以提升定位与建图精度，并支持长视距规划以实现高效探索。

Details

Motivation: 现有主动SLAM方法存在三大局限：受限于底层SLAM模块、运动规划缺乏长远视野、难以处理动态场景。本文旨在通过生成跨时空图像和语义结构来弥补这些不足，提升动态环境下的主动SLAM性能。

Result: 在公开和自收集数据集上的大量实验表明，Dream-SLAM在定位精度、建图质量和探索效率方面均优于现有最先进方法。

Insight: 创新点在于利用生成模型“梦想”出未观测的跨时空图像和语义合理的场景结构，并将其与真实观测融合，以增强动态环境下的SLAM鲁棒性和规划的长远性。这为处理动态、不完整观测的SLAM问题提供了新的数据增强和场景理解思路。

Abstract: In addition to the core tasks of simultaneous localization and mapping (SLAM), active SLAM additionally in- volves generating robot actions that enable effective and efficient exploration of unknown environments. However, existing active SLAM pipelines are limited by three main factors. First, they inherit the restrictions of the underlying SLAM modules that they may be using. Second, their motion planning strategies are typically shortsighted and lack long-term vision. Third, most approaches struggle to handle dynamic scenes. To address these limitations, we propose a novel monocular active SLAM method, Dream-SLAM, which is based on dreaming cross-spatio-temporal images and semantically plausible structures of partially observed dynamic environments. The generated cross-spatio-temporal im- ages are fused with real observations to mitigate noise and data incompleteness, leading to more accurate camera pose estimation and a more coherent 3D scene representation. Furthermore, we integrate dreamed and observed scene structures to enable long- horizon planning, producing farsighted trajectories that promote efficient and thorough exploration. Extensive experiments on both public and self-collected datasets demonstrate that Dream-SLAM outperforms state-of-the-art methods in localization accuracy, mapping quality, and exploration efficiency. Source code will be publicly available upon paper acceptance.

[114] World Guidance: World Modeling in Condition Space for Action Generation cs.RO | cs.CVPDF

Yue Su, Sijin Chen, Haixin Shi, Mingyu Liu, Zhengshen Zhang

TL;DR: 本文提出了World Guidance (WoG)框架，旨在通过将未来观测映射到紧凑的条件空间来增强视觉-语言-动作（VLA）模型的动作生成能力。该方法在动作推理流程中注入这些条件，并训练VLA模型同时预测压缩条件和未来动作，从而在条件空间内实现有效的世界建模，以指导细粒度的动作生成。

Details

Motivation: 现有方法难以在保持高效、可预测的未来表示与保留足够细粒度信息以指导精确动作生成之间取得平衡，因此本文提出WoG来解决这一局限。

Result: 在仿真和真实环境的大量实验表明，该方法显著优于基于未来预测的现有方法，并展现出优异的泛化能力，且能有效从大量人类操作视频中学习。

Insight: 创新点在于将未来观测压缩为条件空间进行建模和预测，这不仅促进了细粒度的动作生成，还通过条件空间的紧凑表示实现了高效的世界建模，为VLA模型的动作推理提供了新的指导机制。

Abstract: Leveraging future observation modeling to facilitate action generation presents a promising avenue for enhancing the capabilities of Vision-Language-Action (VLA) models. However, existing approaches struggle to strike a balance between maintaining efficient, predictable future representations and preserving sufficient fine-grained information to guide precise action generation. To address this limitation, we propose WoG (World Guidance), a framework that maps future observations into compact conditions by injecting them into the action inference pipeline. The VLA is then trained to simultaneously predict these compressed conditions alongside future actions, thereby achieving effective world modeling within the condition space for action inference. We demonstrate that modeling and predicting this condition space not only facilitates fine-grained action generation but also exhibits superior generalization capabilities. Moreover, it learns effectively from substantial human manipulation videos. Extensive experiments across both simulation and real-world environments validate that our method significantly outperforms existing methods based on future prediction. Project page is available at: https://selen-suyue.github.io/WoGNet/

eess.IV [Back]

[115] Lumosaic: Hyperspectral Video via Active Illumination and Coded-Exposure Pixels eess.IV | cs.CVPDF

Dhruv Verma, Andrew Qiu, Roberto Rangel, Ayandev Barman, Hao Yang

TL;DR: Lumosaic是一种紧凑型主动高光谱视频系统，通过结合窄带LED阵列和编码曝光像素（CEP）相机，实现了对动态场景的实时高光谱视频捕获。该系统通过主动同步照明和像素级曝光，在空间、时间和波长维度上联合编码场景信息，并利用基于学习的重建流程恢复出30 fps、VGA分辨率的31通道高光谱视频。

Details

Motivation: 解决现有被动快照高光谱成像系统在动态场景下因光线分割和运动假设导致的低光子利用率和光谱保真度下降问题，旨在实现高保真、高时间一致性的实时高光谱视频捕获。

Result: 在合成和真实数据上的实验表明，Lumosaic在重建保真度和时间稳定性上显著优于现有快照高光谱成像系统，能够在多种材料和运动条件下实现鲁棒的高光谱视频。

Insight: 创新点在于主动照明与像素级编码曝光控制的硬件协同设计，实现了跨空间、时间和光谱维度的联合信息编码，结合学习重建方法，提升了动态场景下的光谱精度和时间一致性。这是一种硬件-算法协同创新的范例。

Abstract: We present Lumosaic, a compact active hyperspectral video system designed for real-time capture of dynamic scenes. Our approach combines a narrowband LED array with a coded-exposure-pixel (CEP) camera capable of high-speed, per-pixel exposure control, enabling joint encoding of scene information across space, time, and wavelength within each video frame. Unlike passive snapshot systems that divide light across multiple spectral channels simultaneously and assume no motion during a frame’s exposure, Lumosaic actively synchronizes illumination and pixel-wise exposure, improving photon utilization and preserving spectral fidelity under motion. A learning-based reconstruction pipeline then recovers 31-channel hyperspectral (400-700 nm) video at 30 fps and VGA resolution, producing temporally coherent and spectrally accurate reconstructions. Experiments on synthetic and real data demonstrate that Lumosaic significantly improves reconstruction fidelity and temporal stability over existing snapshot hyperspectral imaging systems, enabling robust hyperspectral video across diverse materials and motion conditions.

Table of Contents

cs.CL [Back]

[1] Reasoning-Based Personalized Generation for Users with Sparse Data cs.CL | cs.AIPDF

[2] Field-Theoretic Memory for AI Agents: Continuous Dynamics for Context Preservation cs.CL | cs.AI | cs.LGPDF

[3] Task-Aware LoRA Adapter Composition via Similarity Retrieval in Vector Databases cs.CL | cs.AI | cs.LGPDF

[4] Architecture-Agnostic Curriculum Learning for Document Understanding: Empirical Evidence from Text-Only and Multimodal cs.CL | cs.AI | cs.LGPDF

[5] IslamicLegalBench: Evaluating LLMs Knowledge and Reasoning of Islamic Law Across 1,200 Years of Islamic Pluralist Legal Traditions cs.CL | cs.AIPDF

[6] ImpRIF: Stronger Implicit Reasoning Leads to Better Complex Instruction Following cs.CL | cs.AIPDF

[7] TRACE: Trajectory-Aware Comprehensive Evaluation for Deep Research Agents cs.CLPDF

[8] ToolMATH: A Math Tool Benchmark for Realistic Long-Horizon Multi-Tool Reasoning cs.CL | cs.LG | cs.SEPDF

[9] Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment cs.CL | cs.AIPDF

[10] VecGlypher: Unified Vector Glyph Generation with Language Models cs.CLPDF

[11] Enhancing Multilingual Embeddings via Multi-Way Parallel Text Alignment cs.CL | cs.AI | cs.IRPDF

[12] When More Is Less: A Systematic Analysis of Spatial and Commonsense Information for Visual Spatial Reasoning cs.CLPDF

[13] RuCL: Stratified Rubric-Based Curriculum Learning for Multimodal Large Language Model Reasoning cs.CLPDF

[14] Scalable Multilingual Multimodal Machine Translation with Speech-Text Fusion cs.CLPDF

[15] Evaluating the relationship between regularity and learnability in recursive numeral systems using Reinforcement Learning cs.CL | cs.AIPDF

[16] Explore-on-Graph: Incentivizing Autonomous Exploration of Large Language Models on Knowledge Graphs with Path-refined Reward Modeling cs.CLPDF

[17] Improving Implicit Discourse Relation Recognition with Natural Language Explanations from LLMs cs.CLPDF

[18] D-COT: Disciplined Chain-of-Thought Learning for Efficient Reasoning in Small Language Models cs.CLPDF

[19] FewMMBench: A Benchmark for Multimodal Few-Shot Learning cs.CLPDF

[20] ExpLang: Improved Exploration and Exploitation in LLM Reasoning with On-Policy Thinking Language Selection cs.CLPDF

[21] MERRY: Semantically Decoupled Evaluation of Multimodal Emotional and Role Consistencies of Role-Playing Agents cs.CLPDF

[22] Large Language Models are Algorithmically Blind cs.CLPDF

[23] MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models cs.CLPDF

[24] RADAR: Reasoning as Discrimination with Aligned Representations for LLM-based Knowledge Graph Reasoning cs.CLPDF

[25] Understanding Artificial Theory of Mind: Perturbed Tasks and Reasoning in Large Language Models cs.CL | cs.AIPDF

[26] IndicIFEval: A Benchmark for Verifiable Instruction-Following Evaluation in 14 Indic Languages cs.CLPDF

[27] DySCO: Dynamic Attention-Scaling Decoding for Long-Context LMs cs.CLPDF

[28] Improving Parametric Knowledge Access in Reasoning Language Models cs.CLPDF

[29] SumTablets: A Transliteration Dataset of Sumerian Tablets cs.CLPDF

cs.CV [Back]

[30] HorizonForge: Driving Scene Editing with Any Trajectories and Any Vehicles cs.CVPDF

[31] Towards Controllable Video Synthesis of Routine and Rare OR Events cs.CV | cs.AI | cs.LG | eess.IVPDF

[32] Momentum Memory for Knowledge Distillation in Computational Pathology cs.CVPDF

[33] MMLoP: Multi-Modal Low-Rank Prompting for Efficient Vision-Language Adaptation cs.CV | cs.LGPDF

[34] Exploring Vision-Language Models for Open-Vocabulary Zero-Shot Action Segmentation cs.CVPDF

[35] WildSVG: Towards Reliable SVG Generation Under Real-Word Conditions cs.CVPDF

[36] ECHOSAT: Estimating Canopy Height Over Space And Time cs.CV | cs.AI | cs.LGPDF

[37] PSF-Med: Measuring and Explaining Paraphrase Sensitivity in Medical Vision Language Models cs.CV | cs.LGPDF

[38] Synergizing Understanding and Generation with Interleaved Analyzing-Drafting Thinking cs.CVPDF

[39] See It, Say It, Sorted: An Iterative Training-Free Framework for Visually-Grounded Multimodal Reasoning in LVLMs cs.CVPDF

[40] Which Tool Response Should I Trust? Tool-Expertise-Aware Chest X-ray Agent with Multimodal Agentic Learning cs.CVPDF

[41] IHF-Harmony: Multi-Modality Magnetic Resonance Images Harmonization using Invertible Hierarchy Flow Model cs.CVPDF

[42] Generalizing Visual Geometry Priors to Sparse Gaussian Occupancy Prediction cs.CVPDF

[43] MultiAnimate: Pose-Guided Image Animation Made Extensible cs.CVPDF

[44] SEF-MAP: Subspace-Decomposed Expert Fusion for Robust Multimodal HD Map Prediction cs.CVPDF

[45] A Hidden Semantic Bottleneck in Conditional Embeddings of Diffusion Transformers cs.CVPDF

[46] Virtual Biopsy for Intracranial Tumors Diagnosis on MRI cs.CV | cs.AIPDF

[47] Tokenizing Semantic Segmentation with RLE cs.CVPDF

[48] UniHand: A Unified Model for Diverse Controlled 4D Hand Motion Modeling cs.CVPDF

[49] Axial-Centric Cross-Plane Attention for 3D Medical Image Classification cs.CVPDF

[50] Lie Flow: Video Dynamic Fields Modeling and Predicting with Lie Algebra as Geometric Physics Principle cs.CVPDF

[51] CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning cs.CV | cs.AIPDF

[52] Following the Diagnostic Trace: Visual Cognition-guided Cooperative Network for Chest X-Ray Diagnosis cs.CV | cs.AIPDF

[53] Space-Time Forecasting of Dynamic Scenes with Motion-aware Gaussian Grouping cs.CV | cs.GRPDF

[54] SF3D-RGB: Scene Flow Estimation from Monocular Camera and Sparse LiDAR cs.CVPDF

[55] Dynamic Multimodal Activation Steering for Hallucination Mitigation in Large Vision-Language Models cs.CV | cs.AIPDF

[56] SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video cs.CV | cs.AIPDF

[57] Innovative Tooth Segmentation Using Hierarchical Features and Bidirectional Sequence Modeling cs.CVPDF

[58] TranX-Adapter: Bridging Artifacts and Semantics within MLLMs for Robust AI-generated Image Detection cs.CVPDF

[59] SigVLP: Sigmoid Volume-Language Pre-Training for Self-Supervised CT-Volume Adaptive Representation Learning cs.CVPDF

[60] Enhancing Multi-Modal LLMs Reasoning via Difficulty-Aware Group Normalization cs.CVPDF

[61] Accelerating Diffusion via Hybrid Data-Pipeline Parallelism Based on Conditional Guidance Scheduling cs.CVPDF

[62] From Statics to Dynamics: Physics-Aware Image Editing with Latent Transition Priors cs.CVPDF

[63] Beyond Static Artifacts: A Forensic Benchmark for Video Deepfake Reasoning in Vision Language Models cs.CV | cs.AIPDF

[64] XStreamVGGT: Extremely Memory-Efficient Streaming Vision Geometry Grounded Transformer with KV Cache Compression cs.CVPDF

[65] SkyReels-V4: Multi-modal Video-Audio Generation, Inpainting and Editing model cs.CVPDF

[66] SemVideo: Reconstructs What You Watch from Brain Activity via Hierarchical Semantic Guidance cs.CV | cs.AIPDF

[67] Joint Shadow Generation and Relighting via Light-Geometry Interaction Maps cs.CVPDF

[68] UniVBench: Towards Unified Evaluation for Video Foundation Models cs.CVPDF

[69] Understanding Annotation Error Propagation and Learning an Adaptive Policy for Expert Intervention in Barrett’s Video Segmentation cs.CV | cs.AIPDF

[70] DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs cs.CV | cs.AI | cs.CL | cs.GRPDF

[71] GFPL: Generative Federated Prototype Learning for Resource-Constrained and Data-Imbalanced Vision Task cs.CV | cs.LGPDF

[72] How to Take a Memorable Picture? Empowering Users with Actionable Feedback cs.CVPDF

[73] UNet-Based Keypoint Regression for 3D Cone Localization in Autonomous Racing cs.CV | cs.ROPDF

[74] TIRAuxCloud: A Thermal Infrared Dataset for Day and Night Cloud Detection cs.CVPDF

[75] NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors cs.CV | cs.AI | cs.CLPDF

[76] Scan Clusters, Not Pixels: A Cluster-Centric Paradigm for Efficient Ultra-high-definition Image Restoration cs.CVPDF

[77] Geometry-as-context: Modulating Explicit 3D in Scene-consistent Video Generation to Geometry Context cs.CVPDF