cs.CL [Total: 38]
cs.CV [Total: 60]
eess.AS [Total: 1]
cs.IR [Total: 3]
eess.IV [Total: 3]
cs.CE [Total: 1]
cs.HC [Total: 2]
cs.LG [Total: 4]
cs.MA [Total: 1]
hep-ex [Total: 1]
cs.RO [Total: 2]
cs.CR [Total: 4]
physics.optics [Total: 1]
cs.AI [Total: 4]

cs.CL [Back]

[1] The Stepwise Informativeness Assumption: Why are Entropy Dynamics and Reasoning Correlated in LLMs? cs.CL | cs.AI | cs.IT | cs.LGPDF

Mar Gonzàlez I Català, Haitz Sáez de Ocáriz Borde, George D. Montañez, Pietro Liò

TL;DR: 本文通过提出逐步信息性假设（SIA），解释了大型语言模型内部熵动态与外部答案正确性之间稳健相关性的原因。该假设认为，在生成过程中，模型通过答案信息前缀逐步积累关于真实答案的信息。研究从最大似然优化的角度形式化了SIA，并推导出其可观测特征，在多个推理基准和开源模型上进行了实证验证。

Details

Motivation: 解决一个核心未解之谜：为什么在模型预测分布下定义的内部熵动态，与基于真实答案的外部正确性如此稳健地相关。

Result: 在多个推理基准（GSM8K, ARC, SVAMP）和一系列开源LLM（如Gemma-2, LLaMA-3.2等）上进行了实证测试，表明训练过程会诱导SIA，并且正确的推理轨迹展现出特征性的条件答案熵模式。

Insight: 核心创新点是提出了逐步信息性假设（SIA），为熵动态与推理正确性的相关性提供了一个理论解释框架，并指出该特性源于最大似然优化及标准微调/强化学习流程，是可推导和可观测的。

Abstract: Recent work uses entropy-based signals at multiple representation levels to study reasoning in large language models, but the field remains largely empirical. A central unresolved puzzle is why internal entropy dynamics, defined under the predictive distribution of a model, correlate so robustly with external correctness given by the ground-truth answer. In this paper, we argue that this correlation arises because autoregressive models reason correctly when they accumulate information about the true answer via answer-informative prefixes. We formalize this intuition via the Stepwise Informativeness Assumption (SIA), which states that reasoning prefixes accumulate answer-relevant information in expectation as generation progresses. We show that SIA naturally emerges from maximum-likelihood optimization on human reasoning traces and is reinforced by standard fine-tuning and reinforcement-learning pipelines. We then derive observable signatures of SIA linking conditional answer entropy dynamics to correctness. We empirically test SIA across multiple reasoning benchmarks (GSM8K, ARC, SVAMP) and a diverse set of open-weight LLMs (Gemma-2, LLaMA-3.2, Qwen-2.5, DeepSeek and Olmo variants), showing that training induces it and that correct traces exhibit characteristic conditional answer entropy patterns.

[2] SensorPersona: An LLM-Empowered System for Continual Persona Extraction from Longitudinal Mobile Sensor Streams cs.CL | cs.AI | cs.HCPDF

Bufang Yang, Lilin Xu, Yixuan Li, Kaiwei Liu, Xiaofan Jiang

TL;DR: 本文提出了SensorPersona系统，这是一个利用大语言模型（LLM）从用户移动设备长期、无扰收集的多模态传感器流中，持续推断稳定用户画像的系统。该系统通过面向人物的上下文编码、分层画像推理以及聚类感知的增量验证等技术，从物理模式、心理社会特质和生活经历等多个维度提取用户画像。

Details

Motivation: 现有基于LLM的智能体主要从聊天历史中推断用户画像，这仅能捕获用户自我披露的信息，而无法反映其在物理世界中的日常行为，限制了构建全面用户画像的能力。本文旨在利用移动传感器数据来无扰、持续地推断更全面的用户画像。

Result: 在一个包含20名参与者、总计1580小时传感器数据的自收集数据集上进行了评估。结果显示，SensorPersona在画像提取的召回率上比现有最佳基线（SOTA）高出31.4%，在画像感知的智能体响应中胜率达到85.7%，并且用户满意度有显著提升。

Insight: 创新点在于将LLM与连续、多模态的移动传感器数据流相结合，通过分层推理（包括片段内和片段间推理）和增量验证更新机制，从物理行为中无扰地推断出涵盖多维度、稳定且能演进的用户画像，为个性化智能体提供了更丰富、更真实的数据基础。

Abstract: Personalization is essential for Large Language Model (LLM)-based agents to adapt to users’ preferences and improve response quality and task performance. However, most existing approaches infer personas from chat histories, which capture only self-disclosed information rather than users’ everyday behaviors in the physical world, limiting the ability to infer comprehensive user personas. In this work, we introduce SensorPersona, an LLM-empowered system that continuously infers stable user personas from multimodal longitudinal sensor streams unobtrusively collected from users’ mobile devices. SensorPersona first performs person-oriented context encoding on continuous sensor streams to enrich the semantics of sensor contexts. It then employs hierarchical persona reasoning that integrates intra- and inter-episode reasoning to infer personas spanning physical patterns, psychosocial traits, and life experiences. Finally, it employs clustering-aware incremental verification and temporal evidence-aware updating to adapt to evolving personas. We evaluate SensorPersona on a self-collected dataset containing 1,580 hours of sensor data from 20 participants, collected over up to 3 months across 17 cities on 3 continents. Results show that SensorPersona achieves up to 31.4% higher recall in persona extraction, an 85.7% win rate in persona-aware agent responses, and notable improvements in user satisfaction compared to state-of-the-art baselines.

[3] Tool-MCoT: Tool Augmented Multimodal Chain-of-Thought for Content Safety Moderation cs.CL | cs.AIPDF

Shutong Zhang, Dylan Zhou, Yinxiao Liu, Yang Yang, Huiwen Luo

TL;DR: 本文提出了Tool-MCoT，一种用于内容安全审核的小型语言模型。它通过利用大型语言模型生成的工具增强思维链数据进行微调，使小型模型能够学习有效使用外部工具来提升推理和决策能力，并在准确性和推理效率之间取得平衡。

Details

Motivation: 在线平台和用户内容的增长需要强大的内容审核系统，但大型语言模型的计算成本和延迟使其难以大规模部署。本文旨在开发一种高效、可扩展的小型语言模型解决方案。

Result: 实验表明，经过微调的小型语言模型在性能上取得了显著提升，并且能够选择性地使用工具，在必要时才调用，从而在审核准确性和推理效率之间达到平衡。

Insight: 创新点在于将工具增强的思维链范式应用于小型语言模型的微调，使其能够学习并模仿大型模型的复杂工具使用和推理过程，从而以更低的成本实现接近的性能，并具备动态选择工具以优化效率的能力。

Abstract: The growth of online platforms and user content requires strong content moderation systems that can handle complex inputs from various media types. While large language models (LLMs) are effective, their high computational cost and latency present significant challenges for scalable deployment. To address this, we introduce Tool-MCoT, a small language model (SLM) fine-tuned for content safety moderation leveraging external framework. By training our model on tool-augmented chain-of-thought data generated by LLM, we demonstrate that the SLM can learn to effectively utilize these tools to improve its reasoning and decision-making. Our experiments show that the fine-tuned SLM achieves significant performance gains. Furthermore, we show that the model can learn to use these tools selectively, achieving a balance between moderation accuracy and inference efficiency by calling tools only when necessary.

[4] STDec: Spatio-Temporal Stability Guided Decoding for dLLMs cs.CLPDF

Yuzhe Chen, Jiale Cao, Xuyang Liu, Jin Xie, Aiping Yang

TL;DR: 本文提出了一种名为STDec的时空稳定性引导的解码方法，用于扩散大语言模型（dLLMs），通过空间感知解码和时间感知解码，动态调整解码阈值，以利用解码过程中的时空稳定性，从而在不牺牲任务性能的情况下显著提升推理速度。

Details

Motivation: 当前大多数dLLM解码器采用全局置信度阈值，未能显式建模相邻解码状态之间的局部上下文或跨步骤预测标记ID的时间一致性，这限制了解码效率和性能。

Result: 在文本推理和多模态理解基准测试中，STDec在保持可比任务性能分数的同时大幅提升了吞吐量；特别是在MBPP基准上使用LLaDA模型时，实现了高达14.17倍的加速，且分数相当。

Insight: 创新点在于观察到dLLM解码中存在的强时空稳定性（新解码标记倾向于靠近相邻解码标记，且预测ID在多个去噪步骤中保持一致），并据此设计了训练免费、兼容缓存加速方法的动态阈值解码策略，可借鉴于优化其他序列生成模型的解码效率。

Abstract: Diffusion Large Language Models (dLLMs) have achieved rapid progress, viewed as a promising alternative to the autoregressive paradigm. However, most dLLM decoders still adopt a global confidence threshold, and do not explicitly model local context from neighboring decoded states or temporal consistency of predicted token IDs across steps. To address this issue, we propose a simple spatio-temporal stability guided decoding approach, named STDec. We observe strong spatio-temporal stability in dLLM decoding: newly decoded tokens tend to lie near decoded neighbors, and their predicted IDs often remain consistent across several denoising steps. Inspired by this stability, our STDec includes spatial-aware decoding and temporal-aware decoding. The spatial-aware decoding dynamically generates the token-adaptive threshold by aggregating the decoded states of nearby tokens. The temporal-aware decoding relaxes the decoding thresholds for tokens whose predicted token IDs remain consistent over denoising steps. Our STDec is training-free and remains compatible with cache-based acceleration methods. Across textual reasoning and multimodal understanding benchmarks, STDec substantially improves throughput while maintaining comparable task performance score. Notably, on MBPP with LLaDA, STDec achieves up to 14.17x speedup with a comparable score. Homepage: https://yzchen02.github.io/STDec.

[5] The Illusion of Superposition? A Principled Analysis of Latent Thinking in Language Models cs.CL | cs.LGPDF

Michael Rizvi-Martel, Guillaume Rabusseau, Marius Mosbach

TL;DR: 本文研究了语言模型在连续思维链推理中是否利用叠加现象，通过三种训练机制分析发现，仅从头训练的模型表现出叠加使用迹象，而免训练和微调机制中叠加会崩溃或未被利用，模型倾向于寻找捷径解决方案。

Details

Motivation: 探讨语言模型在连续思维链推理中是否实际利用叠加能力，即同时维持多个候选解于单一表示中，以验证理论假设并分析其实际应用条件。

Result: 通过Logit Lens和实体级探测分析内部表示，发现在从头训练机制下模型显示出叠加使用迹象，而在免训练和微调机制中叠加崩溃或未被利用，模型偏好捷径解。

Insight: 创新点在于系统性地分析了叠加现象在不同训练机制下的表现，揭示了预训练自然语言数据偏向模型在最后层承诺特定标记以及容量对解决方案选择的影响，为连续思维链推理中叠加的出现和崩溃条件提供了统一解释。

Abstract: Latent reasoning via continuous chain-of-thoughts (Latent CoT) has emerged as a promising alternative to discrete CoT reasoning. Operating in continuous space increases expressivity and has been hypothesized to enable superposition: the ability to maintain multiple candidate solutions simultaneously within a single representation. Despite theoretical arguments, it remains unclear whether language models actually leverage superposition when reasoning using latent CoTs. We investigate this question across three regimes: a training-free regime that constructs latent thoughts as convex combinations of token embeddings, a fine-tuned regime where a base model is adapted to produce latent thoughts, and a from-scratch regime where a model is trained entirely with latent thoughts to solve a given task. Using Logit Lens and entity-level probing to analyze internal representations, we find that only models trained from scratch exhibit signs of using superposition. In the training-free and fine-tuned regimes, we find that the superposition either collapses or is not used at all, with models discovering shortcut solutions instead. We argue that this is due to two complementary phenomena: i) pretraining on natural language data biases models to commit to a token in the last layers ii) capacity has a huge effect on which solutions a model favors. Together, our results offer a unified explanation for when and why superposition arises in continuous chain-of-thought reasoning, and identify the conditions under which it collapses.

[6] Application-Driven Pedagogical Knowledge Optimization of Open-Source LLMs via Reinforcement Learning and Supervised Fine-Tuning cs.CLPDF

Navan Preet Singh, Xiaokun Wang, Anurag Garikipati, Madalina Ciobanu, Qingqing Mao

TL;DR: 本文提出了一种结合强化学习（RL）和监督微调（SFT）的多阶段优化策略，旨在提升大型语言模型（LLMs）的教学知识能力。该策略包括：1）采用渐进难度训练、聚焦困难样本和扩展推理过程的RL优化；2）利用RL训练模型通过难度加权采样合成高质量训练数据的SFT阶段；3）可选的第二轮RL优化。基于Qwen3-32B主干构建的EduQwen系列模型在跨领域教学知识（CDPK）基准测试中取得了新的SOTA结果，超越了更大的专有系统如Gemini-3 Pro。

Details

Motivation: 解决开源LLMs在教学知识领域能力不足的问题，旨在通过领域专业化优化，将中等规模的开源模型转化为真正的教学领域专家，以超越更大的通用系统，同时满足教育AI部署所需的透明度、可定制性和成本效益。

Result: 在跨领域教学知识（CDPK）基准测试和交互式教学基准排行榜上，EduQwen 32B-RL1、EduQwen 32B-SFT和EduQwen 32B-SFT-RL2模型取得了新的SOTA结果，显著超越了之前的基准领先者Gemini-3 Pro等更大的专有系统。

Insight: 创新点包括：1）将RL与SFT结合的多阶段优化策略，特别是RL阶段的渐进难度训练和扩展推理；2）利用RL模型合成高质量SFT数据的难度加权采样方法；3）证明了领域专业化优化能使中等规模开源LLMs在特定任务上超越更大通用模型，为负责任的教育AI部署提供了高效方案。

Abstract: We present an innovative multi-stage optimization strategy combining reinforcement learning (RL) and supervised fine-tuning (SFT) to enhance the pedagogical knowledge of large language models (LLMs), as illustrated by EduQwen 32B-RL1, EduQwen 32B-SFT, and an optional third-stage model EduQwen 32B-SFT-RL2: (1) RL optimization that implements progressive difficulty training, focuses on challenging examples, and employs extended reasoning rollouts; (2) a subsequent SFT phase that leverages the RL-trained model to synthesize high-quality training data with difficulty-weighted sampling; and (3) an optional second round of RL optimization. EduQwen 32B-RL1, EduQwen 32B-SFT, and EduQwen 32B-SFT-RL2 are an application-driven family of open-source pedagogical LLMs built on a dense Qwen3-32B backbone. These models remarkably achieve high enough accuracy on the Cross-Domain Pedagogical Knowledge (CDPK) Benchmark to establish new state-of-the-art (SOTA) results across the interactive Pedagogy Benchmark Leaderboard and surpass significantly larger proprietary systems such as the previous benchmark leader Gemini-3 Pro. These dense 32-billion-parameter models demonstrate that domain-specialized optimization can transform mid-sized open-source LLMs into true pedagogical domain experts that outperform much larger general-purpose systems, while preserving the transparency, customizability, and cost-efficiency required for responsible educational AI deployment.

[7] State-of-the-Art Arabic Language Modeling with Sparse MoE Fine-Tuning and Chain-of-Thought Distillation cs.CLPDF

Navan Preet Singh, Anurag Garikipati, Ahmed Abulkhair, Jyani Akshay Jagdishbhai, Atul Yaduvanshi

TL;DR: 本文介绍了Arabic-DeepSeek-R1，一个面向应用的开源阿拉伯语大语言模型，它采用稀疏混合专家（MoE）架构，并通过四阶段的思维链（CoT）蒸馏方案，整合了阿拉伯语特定的语言验证和区域伦理规范。该模型在Open Arabic LLM Leaderboard（OALL）的七个基准测试中取得了最高平均分，并在多个任务上达到或接近SOTA水平，包括在语法聚焦的MadinahQA上显著超越GPT-5.1和OALL领先模型。

Details

Motivation: 解决阿拉伯语等资源不足语言在LLM生态系统中的数字公平差距问题，通过专业化适应而非大规模预训练来提升性能。

Result: 在OALL的七个基准测试中取得最高平均分，在MadinahQA、AraTrust、AlGhafa和ALRAGE等任务上达到SOTA或接近SOTA水平，多数任务上超越GPT-5.1。

Insight: 结合稀疏MoE架构、融入文化知识的CoT蒸馏（含显式阿拉伯语语言检查）和策略性双语数据管理，能以参数高效的方式实现突破性SOTA性能，为低资源语言提供可复制的高性价比适应框架。

Abstract: This paper introduces Arabic-DeepSeek-R1, an application-driven open-source Arabic LLM that leverages a sparse MoE backbone to address the digital equity gap for under-represented languages, and establishes a new SOTA across the entire Open Arabic LLM Leaderboard (OALL). Our four-phase CoT distillation scheme integrates Arabic-specific linguistic verification and regional ethical norms into a 372M-token, contamination-controlled 80/20 Arabic-English training mixture. Arabic-DeepSeek-R1 achieves the highest average score across the seven-benchmark OALL suite while establishing SOTA or near-SOTA, including dominant results on grammar-focused MadinahQA (surpassing both GPT-5.1 and the OALL leader by substantial margins), safety-oriented AraTrust, multi-ability AlGhafa, and retrieval-augmented ALRAGE. Our results indicate that the combination of sparse MoE architecture, culturally-informed CoT distillation with explicit Arabic linguistic checks, and strategic bilingual data curation enables an open-source adapted model to systematically outperform the proprietary frontier system GPT-5.1 on the majority of benchmarks evaluating comprehensive language-specific tasks: the first such demonstration for Arabic LLMs. These findings indicate that much of Arabic’s performance deficit in current LLM ecosystems stems from under-specialization rather than architectural limitations, and that parameter-efficient adaptation of open reasoning models can yield breakthrough SOTA performance without industrial-scale pretraining costs. Arabic-DeepSeek-R1 establishes a validated and replicable framework for sovereign and domain-specific language technologies, demonstrating that strategic, culturally-grounded adaptation of sparse MoE backbones offers a viable and cost-effective pathway to achieving record-breaking performance across standardized benchmarks for low-resource languages.

[8] When to Call an Apple Red: Humans Follow Introspective Rules, VLMs Don’t cs.CL | cs.AI | cs.CVPDF

Jonathan Nemitz, Carsten Eickhoff, Junyi Jessy Li, Kyle Mahowald, Michal Golovanevsky

TL;DR: 本文通过引入Graded Color Attribution (GCA)数据集，研究了视觉语言模型(VLMs)与人类在颜色属性判断中的内省规则遵循行为。研究发现，VLMs（如GPT-5-mini）会系统性地违反其自身陈述的内省规则，而人类参与者则能保持对规则的忠实性，其表面违规可由高估颜色覆盖率的倾向解释。

Details

Motivation: 研究动机在于理解VLMs何时会出现意外行为、模型能否可靠预测自身行为以及模型是否遵循其内省推理，这是实现可信部署的核心挑战。

Result: 在GCA基准测试中，GPT-5-mini在具有强颜色先验的对象上，近60%的情况下违反了其陈述的内省规则；而人类参与者则保持了对规则的忠实性。VLMs在颜色覆盖率估计上表现优异，但在最终响应中却明显违背自身推理。

Insight: 创新点在于构建了GCA这一受控基准来量化评估模型与人类对内省规则的遵循程度，揭示了VLMs的内省自我知识存在误校准，其推理失败并非由任务难度驱动，且世界知识先验会系统性地降低模型的忠实性，这与人类认知模式不同。

Abstract: Understanding when Vision-Language Models (VLMs) will behave unexpectedly, whether models can reliably predict their own behavior, and if models adhere to their introspective reasoning are central challenges for trustworthy deployment. To study this, we introduce the Graded Color Attribution (GCA) dataset, a controlled benchmark designed to elicit decision rules and evaluate participant faithfulness to these rules. GCA consists of line drawings that vary pixel-level color coverage across three conditions: world-knowledge recolorings, counterfactual recolorings, and shapes with no color priors. Using GCA, both VLMs and human participants establish a threshold: the minimum percentage of pixels of a given color an object must have to receive that color label. We then compare these rules with their subsequent color attribution decisions. Our findings reveal that models systematically violate their own introspective rules. For example, GPT-5-mini violates its stated introspection rules in nearly 60% of cases on objects with strong color priors. Human participants remain faithful to their stated rules, with any apparent violations being explained by a well-documented tendency to overestimate color coverage. In contrast, we find that VLMs are excellent estimators of color coverage, yet blatantly contradict their own reasoning in their final responses. Across all models and strategies for eliciting introspective rules, world-knowledge priors systematically degrade faithfulness in ways that do not mirror human cognition. Our findings challenge the view that VLM reasoning failures are difficulty-driven and suggest that VLM introspective self-knowledge is miscalibrated, with direct implications for high-stakes deployment.

[9] Multi-objective Evolutionary Merging Enables Efficient Reasoning Models cs.CL | cs.AIPDF

Mario Iacobelli, Adrian Robert Minut, Tommaso Mencattini, Donato Crisostomi, Andrea Santilli

TL;DR: 本文提出了一种名为Evo-L2S的新框架，旨在解决推理模型在保持高准确性的同时减少输出长度（Long-to-Short, L2S）的问题。该框架将L2S推理建模为一个多目标优化挑战，通过进化模型合并技术，在准确性和输出长度之间进行权衡，生成一个鲁棒的帕累托前沿合并模型。为了降低大型语言模型的计算开销，文中还提出了一种基于熵的子集采样技术。

Details

Motivation: 当前推理模型虽然能通过长思维链解决复杂问题，但在推理时计算开销巨大。现有的无需训练的模型合并方法依赖于标量化、固定超参数的算术方法，这些方法脆弱且导致次优折衷，无法有效优化准确性与输出长度的平衡。

Result: 在1.5B、7B和14B参数规模的六个数学推理基准测试上进行综合实验，结果表明Evo-L2S能够将生成的推理轨迹长度减少超过50%，同时保持甚至提高原始推理模型的问题解决准确性。

Insight: 论文的创新点在于将L2S推理问题形式化为多目标优化，并应用进化模型合并来显式优化准确性与输出长度的权衡，从而生成帕累托最优模型集合。从客观角度看，其提出的基于熵的子集采样技术有效降低了大型模型适应性评估的开销，使进化搜索在计算上可行，这是一种高效且可扩展的模型压缩与加速方法。

Abstract: Reasoning models have demonstrated remarkable capabilities in solving complex problems by leveraging long chains of thought. However, this more deliberate reasoning comes with substantial computational overhead at inference time. The Long-to-Short (L2S) reasoning problem seeks to maintain high accuracy using fewer tokens, but current training-free model merging approaches rely on scalarized, fixed-hyperparameter arithmetic methods that are highly brittle and force suboptimal compromises. To address this gap, we introduce Evo-L2S, a novel framework that formulates L2S reasoning as a multi-objective optimization challenge. By leveraging evolutionary model merging, Evo-L2S explicitly optimizes the trade-off between accuracy and output length to produce a robust Pareto front of merged models. To make this search computationally tractable for large language models, we propose an entropy-based subset sampling technique that drastically reduces the overhead of fitness estimation. Comprehensive experiments across 1.5B, 7B, and 14B parameter scales on six mathematical reasoning benchmarks demonstrate that Evo-L2S can reduce the length of generated reasoning traces by over 50% while preserving, or even improving, the problem-solving accuracy of the original reasoning models.

[10] DataSTORM: Deep Research on Large-Scale Databases using Exploratory Data Analysis and Data Storytelling cs.CLPDF

Shicheng Liu, Yucheng Jiang, Sajid Farook, Camila Nicollier Sanchez, David Fernando Castro Pena

TL;DR: 本文提出了DataSTORM，一个基于大语言模型（LLM）的智能体系统，用于在大型结构化数据库和互联网源上进行自主的深度研究。它将结构化数据的深度研究重新定义为一个基于论点的分析过程，结合了探索性数据分析和数据叙事的原则。

Details

Motivation: 现有基于LLM的深度研究方法主要关注非结构化网络数据，而在大规模结构化数据库上进行深度研究（需要迭代假设生成、结构化模式上的定量推理和连贯分析叙事）的挑战尚未得到充分探索。

Result: 在InsightBench基准测试中，DataSTORM取得了新的最先进（SOTA）结果，在洞察级召回率上相对提升了19.4%，在摘要级得分上提升了7.2%。在基于真实世界复杂数据库ACLED构建的新数据集上，其自动评估和人工评估均优于ChatGPT Deep Research等专有系统。

Insight: 论文的创新点在于将结构化数据的深度研究形式化为一个基于论点的分析过程，并构建了一个结合探索性数据分析和数据叙事的LLM智能体系统。从客观角度看，其将定量推理与叙事构建相结合以处理结构化数据的研究范式具有借鉴意义。

Abstract: Deep research with Large Language Model (LLM) agents is emerging as a powerful paradigm for multi-step information discovery, synthesis, and analysis. However, existing approaches primarily focus on unstructured web data, while the challenges of conducting deep research over large-scale structured databases remain relatively underexplored. Unlike web-based research, effective data-centric research requires more than retrieval and summarization and demands iterative hypothesis generation, quantitative reasoning over structured schemas, and convergence toward a coherent analytical narrative. In this paper, we present DataSTORM, an LLM-based agentic system capable of autonomously conducting research across both large-scale structured databases and internet sources. Grounded in principles from Exploratory Data Analysis and Data Storytelling, DataSTORM reframes deep research over structured data as a thesis-driven analytical process: discovering candidate theses from data, validating them through iterative cross-source investigation, and developing them into coherent analytical narratives. We evaluate DataSTORM on InsightBench, where it achieves a new state-of-the-art result with a 19.4% relative improvement in insight-level recall and 7.2% in summary-level score. We further introduce a new dataset built on ACLED, a real-world complex database, and demonstrate that DataSTORM outperforms proprietary systems such as ChatGPT Deep Research across both automated metrics and human evaluations.

[11] ValueGround: Evaluating Culture-Conditioned Visual Value Grounding in MLLMs cs.CLPDF

Zhipin Wang, Christoph Leiter, Christian Frey, Mohamed Hesham Ibrahim Abdalla, Josif Grabocka

TL;DR: 本文提出了ValueGround基准，用于评估多模态大语言模型在文化条件视觉价值接地任务上的表现。该基准基于世界价值观调查问题构建，通过最小对比图像对表示对立选项，要求模型仅根据图像和国家信息判断文化价值倾向。实验表明，当选项从文本转为视觉形式时，模型平均准确率从72.8%下降至65.8%，揭示了视觉接地能力的不足。

Details

Motivation: 现有文化价值观评估主要基于文本，无法验证模型在视觉选项下的文化条件判断能力，因此需要构建跨模态评估基准。

Result: 在13个国家和6个MLLMs上的实验显示：纯文本设置准确率72.8%，视觉选项准确率降至65.8%，选项-图像对齐任务准确率达92.8%；更强模型表现更稳健，但所有模型都存在预测反转问题。

Insight: 创新点在于构建了首个文化条件视觉价值接地基准，通过最小对比图像对控制无关变量；客观来看，该方法揭示了MLLMs在跨模态文化价值迁移中的系统性弱点，为研究价值判断的视觉接地机制提供了可控测试平台。

Abstract: Cultural values are expressed not only through language but also through visual scenes and everyday social practices. Yet existing evaluations of cultural values in language models are almost entirely text-only, making it unclear whether models can ground culture-conditioned judgments when response options are visualized. We introduce ValueGround, a benchmark for evaluating culture-conditioned visual value grounding in multimodal large language models (MLLMs). Built from World Values Survey (WVS) questions, ValueGround uses minimally contrastive image pairs to represent opposing response options while controlling irrelevant variation. Given a country, a question, and an image pair, a model must choose the image that best matches the country’s value tendency without access to the original response-option texts. Across six MLLMs and 13 countries, average accuracy drops from 72.8% in the text-only setting to 65.8% when options are visualized, despite 92.8% accuracy on option-image alignment. Stronger models are more robust, but all remain prone to prediction reversals. Our benchmark provides a controlled testbed for studying cross-modal transfer of culture-conditioned value judgments.

[12] MedConclusion: A Benchmark for Biomedical Conclusion Generation from Structured Abstracts cs.CL | cs.AIPDF

Weiyue Li, Ruizhi Qian, Yi Li, Yongce Li, Yunfan Long

TL;DR: 该论文提出了MedConclusion，一个包含570万PubMed结构化摘要的大规模生物医学结论生成基准数据集，用于评估大型语言模型从结构化生物医学证据中推断科学结论的能力。

Details

Motivation: 目前缺乏用于测试大型语言模型是否能从结构化生物医学证据中推理出科学结论的资源，因此需要构建一个专门的基准来填补这一空白。

Result: 在初步研究中，评估了多种LLM在结论生成和摘要生成提示设置下的表现，使用基于参考的指标和LLM作为评判者进行评分。研究发现结论写作与摘要写作在行为上存在显著差异，强模型在当前自动指标下得分接近，且评判者身份会显著影响绝对分数。

Insight: 创新点在于构建了首个大规模、结构化的生物医学结论生成基准，并提供了期刊级别的元数据以支持跨生物医学领域的子组分析，为研究科学证据到结论的推理过程提供了可复用的数据资源。

Abstract: Large language models (LLMs) are widely explored for reasoning-intensive research tasks, yet resources for testing whether they can infer scientific conclusions from structured biomedical evidence remain limited. We introduce $\textbf{MedConclusion}$, a large-scale dataset of $\textbf{5.7M}$ PubMed structured abstracts for biomedical conclusion generation. Each instance pairs the non-conclusion sections of an abstract with the original author-written conclusion, providing naturally occurring supervision for evidence-to-conclusion reasoning. MedConclusion also includes journal-level metadata such as biomedical category and SJR, enabling subgroup analysis across biomedical domains. As an initial study, we evaluate diverse LLMs under conclusion and summary prompting settings and score outputs with both reference-based metrics and LLM-as-a-judge. We find that conclusion writing is behaviorally distinct from summary writing, strong models remain closely clustered under current automatic metrics, and judge identity can substantially shift absolute scores. MedConclusion provides a reusable data resource for studying scientific evidence-to-conclusion reasoning. Our code and data are available at: https://github.com/Harvard-AI-and-Robotics-Lab/MedConclusion.

[13] The Detection–Extraction Gap: Models Know the Answer Before They Can Say It cs.CL | cs.AI | cs.IT | cs.LGPDF

Hanyang Wang, Mingxuan Zhu

TL;DR: 该论文揭示了大型语言模型在推理过程中存在的‘检测-提取差距’现象：模型在答案已确定后仍会生成大量冗余的思维链标记。作者提出了一种黑盒自适应提前退出方法（BAEE），利用自由延续进行检测和提取，显著减少了序列生成长度并提升了准确性。

Details

Motivation: 解决现代推理模型在答案已确定后仍持续生成冗余内容的问题，即模型内部状态已包含答案但标准解码过程无法有效提取，导致计算资源浪费和潜在性能下降。

Result: 在五个模型配置、两个模型家族和三个基准测试上，BAEE方法减少了70-78%的序列生成，同时将准确率提升了1-5个百分点；对于思维模式模型，防止答案被覆盖后甚至可获得高达5.8个百分点的提升。

Insight: 核心创新在于发现了‘检测-提取差距’这一结构性现象，并通过自由延续与强制延续分布之间的总变差界限进行形式化量化；提出的BAEE方法巧妙地利用这种不对称性，在无需访问模型内部参数的黑盒设置下实现高效推理，为优化模型推理效率提供了新视角。

Abstract: Modern reasoning models continue generating long after the answer is already determined. Across five model configurations, two families, and three benchmarks, we find that \textbf{52–88% of chain-of-thought tokens are produced after the answer is recoverable} from a partial prefix. This post-commitment generation reveals a structural phenomenon: the \textbf{detection–extraction gap}. Free continuations from early prefixes recover the correct answer even at 10% of the trace, while forced extraction fails on 42% of these cases. The answer is recoverable from the model state, yet prompt-conditioned decoding fails to extract it. We formalize this mismatch via a total-variation bound between free and forced continuation distributions, yielding quantitative estimates of suffix-induced shift. Exploiting this asymmetry, we propose Black-box Adaptive Early Exit (\BAEE{}), which uses free continuations for both detection and extraction, truncating \textbf{70–78% of serial generation} while \textbf{improving accuracy by 1–5,pp} across all models. For thinking-mode models, early exit prevents post-commitment overwriting, yielding gains of up to 5.8,pp; a cost-optimized variant achieves 68–73% reduction at a median of 9 API calls. Code is available at https://github.com/EdWangLoDaSc/know2say.

[14] DiffuMask: Diffusion Language Model for Token-level Prompt Pruning cs.CLPDF

Caleb Zheng, Jyotika Singh, Fang Tu, Weiyi Sun, Sujeeth Bharadwaj

TL;DR: 本文提出了DiffuMask，一种基于扩散模型的并行化提示词剪枝框架，通过分层镜头级和词元级信号实现快速提示压缩，能在保留关键推理上下文的同时将提示长度减少高达80%，并提升或保持跨领域和跨模型的推理准确性。

Details

Motivation: 现有基于剪枝的提示压缩方法依赖顺序词元移除，计算成本高，且长提示中可能包含冗余信息，因此需要一种快速、可控的压缩方法来提升大语言模型上下文推理的效率和可靠性。

Result: 在领域内、领域外和跨模型设置中，DiffuMask在保持或提高准确性的同时，实现了高达80%的提示长度缩减，加速了压缩过程。

Insight: 创新点在于将扩散模型应用于提示剪枝，通过分层掩码预测实现并行化多词元移除，提供了可调控的压缩控制，这是一个可推广的、可控的提示压缩框架。

Abstract: In-Context Learning and Chain-of-Thought prompting improve reasoning in large language models (LLMs). These typically come at the cost of longer, more expensive prompts that may contain redundant information. Prompt compression based on pruning offers a practical solution, yet existing methods rely on sequential token removal which is computationally intensive. We present DiffuMask, a diffusion-based framework integrating hierarchical shot-level and token-level pruning signals, that enables rapid and parallel prompt pruning via iterative mask prediction. DiffuMask substantially accelerates the compression process via masking multiple tokens in each denoising step. It offers tunable control over retained content, preserving essential reasoning context and achieving up to 80% prompt length reduction. Meanwhile, it maintains or improves accuracy across in-domain, out-of-domain, and cross-model settings. Our results show that DiffuMask provides a generalizable and controllable framework for prompt compression, facilitating faster and more reliable in-context reasoning in LLMs.

[15] Feedback Adaptation for Retrieval-Augmented Generation cs.CLPDF

Jihwan Bang, Seunghan Yang, Kyuhong Shim, Simyung Chang, Juntae Lee

TL;DR: 本文提出了检索增强生成（RAG）系统的反馈适应性问题，关注系统在收到纠正性反馈后如何有效且快速地将反馈传播到未来查询。作者引入了校正延迟和反馈后性能两个评估指标来衡量此行为，并展示了基于训练的方法在延迟校正和可靠适应之间存在权衡。

Details

Motivation: 现有RAG系统评估通常基于静态假设，忽视了实际部署中系统通过用户或专家反馈进行动态修正的特性，缺乏对系统在引入反馈后如何适应和演变的评估。

Result: 在提出的评估框架下，基于训练的方法显示出校正延迟与可靠适应之间的权衡；而作者提出的无需重新训练的推理时方法PatchRAG，在实验中表现出即时校正能力和较强的反馈后泛化性能。

Insight: 创新点在于将反馈适应确立为一个新的RAG问题设置，并提出了两个可量化的评估指标（校正延迟和反馈后性能）来系统衡量该行为；从客观角度看，PatchRAG作为一种轻量级推理时适配方案，为构建更具交互性和适应性的RAG系统提供了实用思路。

Abstract: Retrieval-Augmented Generation (RAG) systems are typically evaluated under static assumptions, despite being frequently corrected through user or expert feedback in deployment. Existing evaluation protocols focus on overall accuracy and fail to capture how systems adapt after feedback is introduced. We introduce feedback adaptation as a problem setting for RAG systems, which asks how effectively and how quickly corrective feedback propagates to future queries. To make this behavior measurable, we propose two evaluation axes: correction lag, which captures the delay between feedback provision and behavioral change, and post-feedback performance, which measures reliability on semantically related queries after feedback. Using these metrics, we show that training-based approaches exhibit a trade-off between delayed correction and reliable adaptation. We further propose PatchRAG, a minimal inference-time instantiation that incorporates feedback without retraining, demonstrating immediate correction and strong post-feedback generalization under the proposed evaluation. Our results highlight feedback adaptation as a previously overlooked dimension of RAG system behavior in interactive settings.

[16] A Parameter-Efficient Transfer Learning Approach through Multitask Prompt Distillation and Decomposition for Clinical NLP cs.CL | cs.AIPDF

Cheng Peng, Mengxian Lyu, Ziyi Chen, Yonghui Wu

TL;DR: 本文提出了一种多任务提示蒸馏与分解框架，用于临床自然语言处理（NLP）的参数高效迁移学习。该方法从一个共享的元提示出发，通过少量可训练参数（少于0.05%）适应新任务，在多个临床NLP任务类型和骨干模型上，其性能超越了LoRA和单任务提示微调方法。

Details

Motivation: 解决现有基于提示的微调方法在部署多个临床NLP系统时，因独立学习任务特定提示而带来的巨大计算和存储开销问题。

Result: 在10个未见目标数据集（涵盖命名实体识别、关系抽取、问答、自然语言推理和摘要五种任务类型）上，使用LLaMA 3.1 8B、Meditron3 8B和gpt-oss 20B三种骨干模型进行评估。该框架性能始终优于LoRA（提升1.5~~1.7%），且使用参数数量级更少；同时超越单任务提示微调6.1~~6.6%。其中gpt-oss 20B模型整体性能最高，尤其在临床推理任务上。零样本和少样本性能优异，表明共享提示表示具有更好的可迁移性。

Insight: 核心创新点在于通过多任务蒸馏学习一个共享的元提示，再通过分解机制高效适配新任务，实现了极高的参数效率（<0.05%可训练参数）和强大的跨任务泛化能力。该方法为大规模部署专业领域（如临床）的NLP系统提供了一种轻量且高效的解决方案。

Abstract: Existing prompt-based fine-tuning methods typically learn task-specific prompts independently, imposing significant computing and storage overhead at scale when deploying multiple clinical natural language processing (NLP) systems. We present a multitask prompt distillation and decomposition framework that learns a single shared metaprompt from 21 diverse clinical source tasks and adapts it to unseen target tasks with fewer than 0.05% trainable parameters. Evaluated across five clinical NLP task types (named entity recognition, relation extraction, question answering, natural language inference, and summarization) on 10 held-out target datasets using three backbone models (LLaMA 3.1 8B, Meditron3 8B, gpt-oss 20B), our framework consistently outperforms LoRA by 1.5~~1.7% despite using orders of magnitude fewer parameters, and exceeds single-task prompt tuning by 6.1~~6.6%. The gpt-oss 20B model achieves the highest overall performance, particularly on clinical reasoning tasks. The strong zero- and few-shot performance demonstrates better transferability of the shared prompt representation.

[17] ChemVLR: Prioritizing Reasoning in Perception for Chemical Vision-Language Understanding cs.CL | cs.AIPDF

Xuanle Zhao, Xinyuan Cai, Xiang Cheng, Xiuyi Chen, Bo Xu

TL;DR: 本文提出了ChemVLR，一种优先考虑推理过程的化学视觉语言模型。它通过先识别细粒度的化学描述符（如官能团），再生成答案，为复杂的视觉化学问题提供可解释的推理路径。为此，作者构建了一个包含76万高质量样本的大规模推理与描述数据集，并采用三阶段训练框架。实验表明，ChemVLR在分子和反应任务上达到了最先进的性能。

Details

Motivation: 当前化学视觉语言模型主要针对直接视觉问答任务进行优化，导致系统成为‘黑箱’，未能充分利用大语言模型推断潜在反应机制的能力。本文旨在解决这一问题，优先在感知过程中融入推理。

Result: ChemVLR在分子和反应任务上实现了最先进的性能，超越了领先的专有模型和领域特定的开源基线模型。

Insight: 核心创新点在于将感知与推理解耦，通过先识别细粒度化学描述符来构建显式的、可解释的推理路径。方法论上的创新包括跨模态逆向工程策略与严格过滤流程构建的大规模数据集，以及系统性构建模型感知与推理能力的三阶段训练框架。

Abstract: While Vision-Language Models (VLMs) have demonstrated significant potential in chemical visual understanding, current models are predominantly optimized for direct visual question-answering tasks. This paradigm often results in “black-box” systems that fail to utilize the inherent capability of Large Language Models (LLMs) to infer underlying reaction mechanisms. In this work, we introduce ChemVLR, a chemical VLM designed to prioritize reasoning within the perception process. Unlike conventional chemical VLMs, ChemVLR analyzes visual inputs in a fine-grained manner by explicitly identifying granular chemical descriptors, such as functional groups, prior to generating answers. This approach ensures the production of explicit and interpretable reasoning paths for complex visual chemical problems. To facilitate this methodology, we implement a cross-modality reverse-engineering strategy, combined with a rigorous filtering pipeline, to curate a large-scale reasoning-and-captioning dataset comprising 760k high-quality samples across molecular and reaction tasks. Furthermore, we adopt a three-stage training framework that systemically builds model perception and reasoning capacity. Experiments demonstrate that ChemVLR achieves state-of-the-art (SOTA) performance, surpassing both leading proprietary models and domain-specific open-source baselines. We also provide comprehensive ablation studies to validate our training strategy and data generation designs. Code and model weights will be available at https://github.com/xxlllz/ChemVLR.

[18] Adaptive Prompt Structure Factorization: A Framework for Self-Discovering and Optimizing Compositional Prompt Programs cs.CL | cs.LGPDF

Haoyue Liu, Zhichao Wang, Yongxin Guo, Haoran Shou, Xiaoying Tang

TL;DR: 本文提出了一种名为自适应提示结构分解（aPSF）的框架，用于自动发现和优化组合式提示程序。该框架仅使用API（提示输入/文本输出，不访问模型内部），通过一个架构模型将提示分解为语义因子，并进行干预性的单因子更新，以提高优化效率和性能。

Details

Motivation: 现有API-only提示优化方法通常迭代编辑整体提示，导致组件耦合、信用分配不明确，限制了可控性并浪费了token。aPSF旨在解决这些问题，通过分解提示结构并实现更高效的优化。

Result: 在多个高级推理基准测试中，aPSF优于包括原则感知优化器在内的强基线，平均准确率提升高达+2.16个百分点，并在MultiArith上减少了45-87%的token优化成本，仅用1步达到峰值验证性能。

Insight: 创新点包括将提示分解为语义因子以实现解耦，以及通过干预性因子级评分和错误引导的因子选择来高效分配更新。这提供了更可控、样本效率更高的提示优化方法，可借鉴于LLM提示工程和自动化推理任务。

Abstract: Automated prompt optimization is crucial for eliciting reliable reasoning from large language models (LLMs), yet most API-only prompt optimizers iteratively edit monolithic prompts, coupling components and obscuring credit assignment, limiting controllability, and wasting tokens. We propose Adaptive Prompt Structure Factorization (aPSF), an API-only framework (prompt-in/text-out; no access to model internals) that uses an Architect model to discover task-specific prompt structures as semantic factors. aPSF then performs interventional, single-factor updates: interventional factor-level scoring estimates each factor’s marginal contribution via validation-performance changes, and error-guided factor selection routes updates to the current dominant failure source for more sample-efficient optimization. Across multiple advanced reasoning benchmarks, aPSF outperforms strong baselines including principle-aware optimizers, improving accuracy by up to +2.16 percentage points on average, and reduces optimization cost by 45–87% tokens on MultiArith while reaching peak validation in 1 step.

[19] Luwen Technical Report cs.CL | cs.AIPDF

Yiquan Wu, Yuhang Liu, Yifei Liu, Ang Li, Siying Zhou

TL;DR: 本文介绍了Luwen，一个基于Baichuan基础模型构建的开源中文法律大语言模型。它通过在大规模法律语料上的持续预训练、精心构建的法律指令数据进行监督微调，以及与全面法律知识库集成的检索增强生成三项关键技术，旨在解决通用大模型在法律领域应用时面临的术语专业、推理复杂和知识更新快等挑战。

Details

Motivation: 通用大语言模型在自然语言处理任务上表现出色，但由于法律领域涉及专业术语、复杂推理需求和快速演变的法律知识，其在该领域的应用仍面临挑战。

Result: 在涵盖预测和生成场景的五项代表性法律任务（包括法律判决预测、司法考试、法律文本摘要、法条问答和司法决策推理）上，Luwen的实验结果优于多个强基线模型，证明了该方法在使通用语言模型适应法律领域的有效性。

Insight: 创新点在于提出了一套针对法律领域定制化的模型适配方案，结合了持续预训练、监督微调和检索增强生成。从客观角度看，其将通用基础模型与领域知识深度结合的系统性工程方法，以及构建和评估涵盖多种任务的中文法律基准，对领域特定大模型的开发具有借鉴意义。

Abstract: Large language models have demonstrated remarkable capabilities across a wide range of natural language processing tasks, yet their application in the legal domain remains challenging due to the specialized terminology, complex reasoning requirements, and rapidly evolving legal knowledge involved. In this paper, we present Luwen, an open-source Chinese legal language model built upon the Baichuan foundation model through three key techniques: continual pre-training on a large-scale legal corpus, supervised fine-tuning with carefully curated legal instruction data, and retrieval-augmented generation integrated with a comprehensive legal knowledge base. We evaluate Luwen on five representative legal tasks spanning both prediction and generation settings, including legal judgment prediction, judicial examination, legal text summarization, law article question answering, and judicial decision reasoning. Experimental results show that Luwen outperforms several strong baselines, demonstrating the effectiveness of our approach in adapting general-purpose language models to the legal domain.

[20] Select-then-Solve: Paradigm Routing as Inference-Time Optimization for LLM Agents cs.CLPDF

Heng Zhou, Zelin Tan, Zhemeng Zhang, Yutao Fan, Yibing Lin

TL;DR: 本文研究了不同推理范式（如CoT、ReAct等）对LLM智能体任务性能的影响，发现没有单一范式在所有任务上最优，任务特定的范式选择至关重要。为此，作者提出了一种’先选择后解决’的方法，通过一个轻量级的基于嵌入的路由器为每个任务动态选择最合适的推理范式，从而显著提升模型在多个基准测试上的平均准确率。

Details

Motivation: 动机是探究LLM智能体性能的提升是源于模型本身还是其采用的推理范式，并发现不同范式在不同任务上表现互补，没有单一范式占优，因此需要一种能根据任务动态选择最佳范式的方法。

Result: 在四个前沿LLM和十个基准测试上的实验表明，学习的路由器将平均准确率从47.6%提升至53.1%，优于最佳固定范式（50.3%）2.8个百分点，并恢复了高达37%的oracle选择性能差距。

Insight: 核心创新点在于将推理范式选择视为一个基于学习的、按任务进行的动态决策问题，而非固定的架构选择，并通过轻量级嵌入路由器实现了这一目标，为LLM智能体优化提供了新的推理时优化范式。

Abstract: When an LLM-based agent improves on a task, is the gain from the model itself or from the reasoning paradigm wrapped around it? We study this question by comparing six inference-time paradigms, namely Direct, CoT, ReAct, Plan-Execute, Reflection, and ReCode, across four frontier LLMs and ten benchmarks, yielding roughly 18,000 runs. We find that reasoning structure helps dramatically on some tasks but hurts on others: ReAct improves over Direct by 44pp on GAIA, while CoT degrades performance by 15pp on HumanEval. No single paradigm dominates, and oracle per-task selection beats the best fixed paradigm by 17.1pp on average. Motivated by this complementarity, we propose a select-then-solve approach: before answering each task, a lightweight embedding-based router selects the most suitable paradigm. Across four models, the router improves average accuracy from 47.6% to 53.1%, outperforming the best fixed paradigm at 50.3% by 2.8pp and recovering up to 37% of the oracle gap. In contrast, zero-shot self-routing only works for GPT-5 at 67.1% and fails for weaker models, all trailing the learned router. Our results argue that reasoning paradigm selection should be a per-task decision made by a learned router, not a fixed architectural choice.

[21] How Long Reasoning Chains Influence LLMs’ Judgment of Answer Factuality cs.CLPDF

Minzhu Tu, Shiyu Ni, Keping Bi

TL;DR: 本文系统研究了推理链对LLM判断答案事实性的影响，发现弱模型易受推理存在性误导，而强模型虽能部分利用推理信息，但仍会被高质量表面推理所欺骗，强调需要更鲁棒的LLM评估器。

Details

Motivation: 探究将生成模型的推理内容暴露给评估模型是否能提高判断准确性，以及推理链如何影响LLM在事实问答和数学推理基准上的判断行为。

Result: 在事实问答和数学推理基准上的实验表明，弱评估模型易因推理存在而接受错误答案，强评估模型能部分利用推理作为证据，但仍会被高质量表面推理误导。

Insight: 揭示了推理链的流畅性和事实性是驱动LLM判断的关键信号，强调开发能区分真实推理质量与表面流畅性的鲁棒评估模型的必要性。

Abstract: Large language models (LLMs) has been widely adopted as a scalable surrogate for human evaluation, yet such judges remain imperfect and susceptible to surface-level biases. One possible reason is that these judges lack sufficient information in assessing answer correctness. With the rise of reasoning-capable models, exposing a generator’s reasoning content to the judge provides richer information and is a natural candidate for improving judgment accuracy. However, its actual impact on judge behavior remains understudied. In this paper, we systematically investigate how access to reasoning chains affects LLM-based judgment across factual question answering (QA) and mathematical reasoning benchmarks. We find that weak judges are easily swayed by reasoning presence, frequently accepting incorrect answers accompanied by fluent reasoning, while strong judges can partially leverage reasoning as informative evidence. Nevertheless, even strong judges are misled by seemingly high-quality reasoning chains. Controlled experiments further reveal that both fluency and factuality of reasoning chains are critical signals driving judge decisions. These findings highlight the need for more robust LLM judges that can distinguish genuine reasoning quality from superficial fluency when evaluating modern reasoning models.

[22] Multilingual Cognitive Impairment Detection in the Era of Foundation Models cs.CLPDF

Damar Hoogland, Boshko Koloski, Jaya Caporusso, Tine Kolenik, Ana Zwitter Vitez

TL;DR: 本文评估了在英语、斯洛文尼亚语和韩语中，利用语音转录文本进行认知障碍分类的性能。研究比较了三种输入设置下的零样本大语言模型直接分类器与留一法协议下训练的监督式表格模型，发现监督式表格模型，尤其是结合了人工设计语言特征和嵌入向量的模型，通常表现更优。

Details

Motivation: 动机是探索在认知障碍检测这一小数据任务中，零样本大语言模型与基于传统语言特征和嵌入向量的监督学习方法在不同语言下的性能差异，以确定最有效的信号来源。

Result: 在英语、斯洛文尼亚语和韩语上的实验结果表明，监督式表格模型（特别是融合了人工设计语言特征和嵌入向量的模型）通常优于零样本LLM基线；少量样本实验表明监督的价值具有语言依赖性。

Insight: 创新点在于系统比较了零样本LLM与多种监督表格方法在多语言小数据认知障碍检测任务中的表现，核心发现是结构化语言信号与简单的基于融合的分类器在该领域仍是强大可靠的信号源，而非完全依赖基础模型。

Abstract: We evaluate cognitive impairment (CI) classification from transcripts of speech in English, Slovene, and Korean. We compare zero-shot large language models (LLMs) used as direct classifiers under three input settings – transcript-only, linguistic-features-only, and combined – with supervised tabular approaches trained under a leave-one-out protocol. The tabular models operate on engineered linguistic features, transcript embeddings, and early or late fusion of both modalities. Across languages, zero-shot LLMs provide competitive no-training baselines, but supervised tabular models generally perform better, particularly when engineered linguistic features are included and combined with embeddings. Few-shot experiments focusing on embeddings indicate that the value of limited supervision is language-dependent, with some languages benefiting substantially from additional labelled examples while others remain constrained without richer feature representations. Overall, the results suggest that, in small-data CI detection, structured linguistic signals and simple fusion-based classifiers remain strong and reliable signals.

[23] TeamLLM: A Human-Like Team-Oriented Collaboration Framework for Multi-Step Contextualized Tasks cs.CL | cs.AIPDF

Xiangyu Wang, Jin Wu, Haoran Shi, Wei Xia, Jiarui Yu

TL;DR: 本文提出了TeamLLM，一个模拟人类团队角色分工的多LLM协作框架，用于解决多步骤情境化任务。该框架定义了四种团队角色并采用三阶段协作流程。为了评估其有效性，作者构建了CGPST基准，该基准具有情境化、结构化、过程导向和多维度评估的特点。实验表明，TeamLLM在CGPST上显著提升了性能。

Details

Motivation: 现有的多LLM框架在解决情境化任务时，没有明确模拟人类团队的角色分工，可能导致视角单一，从而削弱了在多步骤情境化任务上的性能。

Result: 在提出的CGPST基准上，从整体、步骤和维度三个层面评估了十个流行LLM。结果显示，TeamLLM在CGPST上的性能得到了显著提升。

Insight: 主要创新点在于将人类团队协作的角色分工思想引入多LLM协作框架，并为此构建了一个专门评估多步骤情境化任务的过程导向、多维度基准（CGPST）。这为理解和提升LLM在复杂、结构化任务上的协作能力提供了新思路和评估工具。

Abstract: Recently, multi-Large Language Model (LLM) frameworks have been proposed to solve contextualized tasks. However, these frameworks do not explicitly emulate human team role division, which may lead to a single perspective, thereby weakening performance on multi-step contextualized tasks. To address this issue, we propose TeamLLM, a human-like Team-Oriented Multi-LLM Collaboration Framework. TeamLLM adopts four team roles with distinct division and employs a three-phase multi-LLM collaboration for multi-step contextualized tasks. To evaluate the effectiveness of TeamLLM on multi-step contextualized tasks, we propose Contextually-Grounded and Procedurally-Structured tasks (CGPST) and construct the CGPST benchmark. This benchmark has four core features: contextual grounding, procedural structure, process-oriented evaluation and multi-dimensional assessment. We evaluate ten popular LLMs on CGPST at overall-level, step-level, and dimension-level. Results show that TeamLLM substantially improves performance on CGPST. We release the benchmark with scenarios, full-process responses and human scores from ten LLMs. The code and data are available at https://anonymous.4open.science/r/TeamLLM-anonymous-C50E/.

[24] When Is Thinking Enough? Early Exit via Sufficiency Assessment for Efficient Reasoning cs.CLPDF

Yang Xiang, Yixin Ji, Ruotao Xu, Dan Qiao, Zheming Yang

TL;DR: 本文提出了一种名为动态思维充分性评估（DTSR）的新框架，旨在解决大型推理模型（LRMs）在复杂推理任务中因‘过度思考’而导致的低效问题。该框架通过动态评估思维链的充分性，以确定提前退出的最佳时机，从而减少计算冗余。

Details

Motivation: 大型推理模型在推理时存在‘过度思考’现象，导致大量计算冗余和效率降低。现有的提前退出方法多依赖于不可靠且不实用的手工或经验指标，因此需要一种更可靠的动态评估机制。

Result: 在Qwen3模型上的实验结果表明，DTSR能够将推理长度减少28.9%-34.9%，同时性能损失极小，有效缓解了过度思考问题。

Insight: 论文的创新点在于受人类元认知启发，提出了一个两阶段框架（反思信号监测和思维充分性检查）来动态评估思维链的充分性，从而实现高效、自适应的提前退出。这为早期退出推理和自我评估范式提供了新的视角。

Abstract: Large reasoning models (LRMs) have achieved remarkable performance in complex reasoning tasks, driven by their powerful inference-time scaling capability. However, LRMs often suffer from overthinking, which results in substantial computational redundancy and significantly reduces efficiency. Early-exit methods aim to mitigate this issue by terminating reasoning once sufficient evidence has been generated, yet existing approaches mostly rely on handcrafted or empirical indicators that are unreliable and impractical. In this work, we introduce Dynamic Thought Sufficiency in Reasoning (DTSR), a novel framework for efficient reasoning that enables the model to dynamically assess the sufficiency of its chain-of-thought (CoT) and determine the optimal point for early exit. Inspired by human metacognition, DTSR operates in two stages: (1) Reflection Signal Monitoring, which identifies reflection signals as potential cues for early exit, and (2) Thought Sufficiency Check, which evaluates whether the current CoT is sufficient to derive the final answer. Experimental results on the Qwen3 models show that DTSR reduces reasoning length by 28.9%-34.9% with minimal performance loss, effectively mitigating overthinking. We further discuss overconfidence in LRMs and self-evaluation paradigms, providing valuable insights for early-exit reasoning.

[25] GCoT-Decoding: Unlocking Deep Reasoning Paths for Universal Question Answering cs.CLPDF

Guanran Luo, Wentao Qiu, Zhongquan Jian, Meihong Wang, Qingqiang Wu

TL;DR: 本文提出了一种通用的解码策略GCoT-Decoding，用于扩展无需提示的思维链推理到更广泛的问答任务。该方法通过两阶段分支（结合斐波那契采样和启发式错误回溯）生成候选解码路径，将路径分割为推理段和答案段以精确计算置信度，并通过聚合语义相似路径（而非传统多数投票）来达成共识答案。实验在六个涵盖固定和自由问答的数据集上进行，验证了其通用性和有效性。

Details

Motivation: 现有的CoT-decoding方法仅适用于答案集固定的问题，限制了其在更广泛问答任务（尤其是自由形式答案任务）中的应用。本文旨在解决这一局限性，提出一种通用的解码策略以扩展无需提示的思维链推理的适用范围。

Result: 在六个涵盖固定和自由问答的数据集上的广泛实验表明，该方法在固定问答任务上保持了强劲性能，同时在自由问答任务上取得了显著提升，证明了其通用性。

Insight: 创新点在于提出了一种通用的两阶段分支解码策略（GCoT-decoding），通过路径分割（推理段/答案段）进行精确置信度计算，并采用语义聚合而非多数投票来达成最终答案，从而将无需提示的思维链推理成功扩展到自由形式答案的问答任务中。

Abstract: Chain-of-Thought reasoning can enhance large language models, but it requires manually designed prompts to guide the model. Recently proposed CoT-decoding enables the model to generate CoT-style reasoning paths without prompts, but it is only applicable to problems with fixed answer sets. To address this limitation, we propose a general decoding strategy GCoT-decoding that extends applicability to a broader range of question-answering tasks. GCoT-decoding employs a two-stage branching method combining Fibonacci sampling and heuristic error backtracking to generate candidate decoding paths. It then splits each path into a reasoning span and an answer span to accurately compute path confidence, and finally aggregates semantically similar paths to identify a consensus answer, replacing traditional majority voting. We conduct extensive experiments on six datasets covering both fixed and free QA tasks. Our method not only maintains strong performance on fixed QA but also achieves significant improvements on free QA, demonstrating its generality.

[26] Beyond Accuracy: Diagnosing Algebraic Reasoning Failures in LLMs Across Nine Complexity Dimensions cs.CL | cs.CYPDF

Parth Patil, Dhruv Kumar, Yash Sinha, Murari Mandal

TL;DR: 本文提出一个九维代数复杂性框架，用于系统诊断大语言模型在代数推理中的失败原因。该框架通过独立控制每个复杂性维度（如表达式嵌套深度、并行中间结果数量、运算符难度等），自动生成和验证问题，从而精确定位模型失败的具体瓶颈。研究发现工作记忆是跨模型规模的主要瓶颈，所有模型在20-30个并行分支处均会崩溃，揭示了架构层面的硬性约束。

Details

Motivation: 现有代数推理基准仅提供准确率分数，无法揭示模型失败的具体原因（如表达式嵌套过深、运算符不常见等）。缺乏系统框架来独立控制各复杂性因素，且没有自动生成和验证渐进复杂问题的机制，难以追踪模型进展。

Result: 在九个维度上评估了7个指令微调模型（参数量8B-235B），发现工作记忆是尺度不变的主要瓶颈；所有模型在20-30个并行分支处崩溃，与参数量无关。研究进一步识别出五个最小诊断维度，可完整覆盖已知代数失败模式。

Insight: 创新点包括：1）提出多维度独立控制的代数复杂性框架，实现自动问题生成与验证；2）发现工作记忆是跨模型规模的通用瓶颈，且并行分支限制指向架构约束而非可扩展的容量问题；3）提炼出五个关键诊断维度，可高效全面评估模型代数推理能力。

Abstract: Algebraic reasoning remains one of the most informative stress tests for large language models, yet current benchmarks provide no mechanism for attributing failure to a specific cause. When a model fails an algebraic problem, a single accuracy score cannot reveal whether the expression was too deeply nested, the operator too uncommon, the intermediate state count too high, or the dependency chain too long. Prior work has studied individual failure modes in isolation, but no framework has varied each complexity factor independently under strict experimental control. No prior system has offered automatic generation and verification of problems of increasing complexity to track model progress over time. We introduce a nine-dimension algebraic complexity framework in which each factor is varied independently while all others are held fixed, with problem generation and verification handled by a parametric pipeline requiring no human annotation. Each dimension is grounded in a documented LLM failure mode and captures a structurally distinct aspect of algebraic difficulty, including expression nesting depth, simultaneous intermediate result count, sub-expression complexity, operator hardness, and dependent reasoning chain length. We evaluated seven instruction-tuned models spanning 8B to 235B parameters across all nine dimensions and find that working memory is the dominant scale-invariant bottleneck. Every model collapses between 20 and 30 parallel branches regardless of parameter count, pointing to a hard architectural constraint rather than a solvable capacity limitation. Our analysis further identifies a minimal yet diagnostically sufficient subset of five dimensions that together span the full space of documented algebraic failure modes, providing a complete complexity profile of a model’s algebraic reasoning capacity.

[27] Cognitive Loop of Thought: Reversible Hierarchical Markov Chain for Efficient Mathematical Reasoning cs.CLPDF

Jia-Chen Zhang, Zheng Zhou, Yu-Jie Xiong

TL;DR: 本文提出了一种名为认知思维循环（CLoT）的新型思维链框架，基于可逆分层马尔可夫链，旨在解决长思维链导致的序列过长和计算负担问题。该方法通过将问题分解为具有层次依赖的子问题，引入基于人类认知过程的后向验证机制，并在验证高层子问题后剪枝冗余低层子问题，以提高推理效率和鲁棒性。

Details

Motivation: 现有长思维链方法常导致序列长度超出计算限制，而类似马尔可夫链的现有方法存在内存遗忘（上下文丢失）和后向推理能力有限两大关键缺陷。本文旨在通过可逆分层结构和后向验证机制来解决这些限制。

Result: 在四个数学基准测试上的实验证明了该方法的有效性。特别是在使用GPT-4o-mini的AddSub数据集上，CLoT达到了99.0%的准确率，分别比传统CoT和CoT-SC高出4.1%和2.9%。

Insight: 主要创新点在于提出了基于可逆分层马尔可夫链的CLoT框架，引入了后向验证机制和层次化剪枝策略，这有助于减轻错误传播并增强推理鲁棒性，同时提高了计算效率。从客观角度看，该方法将人类认知中的验证和层次化处理过程形式化，为改进大语言模型的数学推理提供了一种结构化的新思路。

Abstract: Multi-step Chain-of-Thought (CoT) has significantly advanced the mathematical reasoning capabilities of LLMs by leveraging explicit reasoning steps. However, the widespread adoption of Long CoT often results in sequence lengths that exceed manageable computational limits. While existing approaches attempt to alleviate this by reducing KV Cache redundancy via Markov chain-like structures, they introduce two critical limitations: inherent memorylessness (loss of context) and limited backward reasoning capability. To address these limitations, we propose a novel Chain-of-Thought framework based on Reversible Hierarchical Markov Chain, termed Cognitive Loop of Thought (CLoT), and a backward reasoning dataset CLoT-Instruct. In CLoT, problems are decomposed into sub-problems with hierarchical dependencies. Inspired by human cognitive processes, we introduce a backward verification mechanism at each hierarchical layer. Furthermore, we implement a pruning strategy: once higher-level sub-problems are verified, redundant lower-level sub-problems are pruned to maximize efficiency. This approach effectively mitigates error propagation and enhances reasoning robustness. Experiments on four mathematical benchmarks demonstrate the effectiveness of our method. Notably, on the AddSub dataset using GPT-4o-mini, CLoT achieves 99.0% accuracy, outperforming traditional CoT and CoT-SC by 4.1% and 2.9%, respectively.

[28] WRAP++: Web discoveRy Amplified Pretraining cs.CL | cs.AIPDF

Jiang Zhou, Yunhao Wang, Xing Wu, Tinghao Yu, Feng Zhang

TL;DR: WRAP++ 是一种用于增强大语言模型预训练的方法，它通过发现网页超链接中的跨文档关系，并基于这些关系合成联合问答对，从而放大事实知识的关联上下文。该方法在维基百科上实例化，将约84亿原始文本标记放大为800亿标记的跨文档问答数据。

Details

Motivation: 现有基于单文档重写的合成数据方法局限于文档内部知识，缺乏跨文档关联，导致事实知识缺乏关联上下文。WRAP++旨在通过发现跨文档关系来放大事实知识的关联上下文。

Result: 在SimpleQA基准测试中，基于OLMo的7B和32B模型使用WRAP++训练后，性能显著优于单文档方法，并展现出持续的扩展收益。

Insight: 创新点在于从网页超链接中发现高置信度的关系模式（如双向链接和共同提及），并合成需要跨文档推理的问答对，从而生成源文档单独不具备的关系知识，并为同一事实创造多样化的切入点。该方法通过组合式增长有效实体对，实现了数据规模的显著放大。

Abstract: Synthetic data rephrasing has emerged as a powerful technique for enhancing knowledge acquisition during large language model (LLM) pretraining. However, existing approaches operate at the single-document level, rewriting individual web pages in isolation. This confines synthesized examples to intra-document knowledge, missing cross-document relationships and leaving facts with limited associative context. We propose WRAP++ (Web discoveRy Amplified Pretraining), which amplifies the associative context of factual knowledge by discovering cross-document relationships from web hyperlinks and synthesizing joint QA over each discovered document pair. Concretely, WRAP++ discovers high-confidence relational motifs including dual-links and co-mentions, and synthesizes QA that requires reasoning across both documents. This produces relational knowledge absent from either source document alone, creating diverse entry points to the same facts. Because the number of valid entity pairs grows combinatorially, this discovery-driven synthesis also amplifies data scale far beyond single-document rewriting. Instantiating WRAP++ on Wikipedia, we amplify ~8.4B tokens of raw text into 80B tokens of cross-document QA data. On SimpleQA, OLMo-based models at both 7B and 32B scales trained with WRAP++ substantially outperform single-document approaches and exhibit sustained scaling gains, underscoring the advantage of cross-document knowledge discovery and amplification.

[29] Fast-dVLM: Efficient Block-Diffusion VLM via Direct Conversion from Autoregressive VLM cs.CLPDF

Chengyue Wu, Shiyi Lan, Yonggan Fu, Sensen Gao, Jin Wang

TL;DR: Fast-dVLM是一种基于块扩散的视觉语言模型，旨在解决自回归解码在边缘设备上推理吞吐量低的问题。它通过直接转换预训练的自回归VLM，实现了KV缓存兼容的并行解码和推测块解码，从而显著加速推理。

Details

Motivation: 自回归解码在视觉语言模型中逐个生成令牌，限制了推理吞吐量，尤其在机器人、自动驾驶等边缘设备部署场景中，硬件并行性未充分利用。块扩散虽在文本生成中有效，但扩展到VLM需同时处理连续视觉表示和离散文本令牌，并保持预训练多模态能力，具有挑战性。

Result: 在11个多模态基准测试中，Fast-dVLM在生成质量上与自回归基线相当。通过SGLang集成和FP8量化，实现了超过6倍的端到端推理加速。

Insight: 创新点包括直接转换策略（优于两阶段方法）、多模态扩散适配（如块大小退火、因果上下文注意力、自动截断掩码和视觉高效拼接），以及KV缓存兼容的并行解码机制，为高效VLM推理提供了新方案。

Abstract: Vision-language models (VLMs) predominantly rely on autoregressive decoding, which generates tokens one at a time and fundamentally limits inference throughput. This limitation is especially acute in physical AI scenarios such as robotics and autonomous driving, where VLMs are deployed on edge devices at batch size one, making AR decoding memory-bandwidth-bound and leaving hardware parallelism underutilized. While block-wise discrete diffusion has shown promise for parallel text generation, extending it to VLMs remains challenging due to the need to jointly handle continuous visual representations and discrete text tokens while preserving pretrained multimodal capabilities. We present Fast-dVLM, a block-diffusion-based VLM that enables KV-cache-compatible parallel decoding and speculative block decoding for inference acceleration. We systematically compare two AR-to-diffusion conversion strategies: a two-stage approach that first adapts the LLM backbone with text-only diffusion fine-tuning before multimodal training, and a direct approach that converts the full AR VLM in one stage. Under comparable training budgets, direct conversion proves substantially more efficient by leveraging the already multimodally aligned VLM; we therefore adopt it as our recommended recipe. We introduce a suite of multimodal diffusion adaptations, block size annealing, causal context attention, auto-truncation masking, and vision efficient concatenation, that collectively enable effective block diffusion in the VLM setting. Extensive experiments across 11 multimodal benchmarks show Fast-dVLM matches its autoregressive counterpart in generation quality. With SGLang integration and FP8 quantization, Fast-dVLM achieves over 6x end-to-end inference speedup over the AR baseline.

[30] On the Step Length Confounding in LLM Reasoning Data Selection cs.CL | cs.AIPDF

Bing Wang, Rui Miao, Chen Shen, Shaotian Yan, Kaiyuan Liu

TL;DR: 本文揭示了在大型语言模型推理数据选择中，基于自然度（平均对数概率）的筛选方法存在步骤长度混淆问题，即偏好推理步骤更长的样本而非质量更高的样本。作者通过定量分析将原因归咎于推理步骤中首个令牌的低概率，并提出了两种变体方法（ASLEC-DROP和ASLEC-CASL）来缓解此问题，在多个模型和基准测试中验证了其有效性。

Details

Motivation: 现有基于自然度的数据选择方法在构建LLM推理数据集时，系统性地偏好推理步骤更长的样本，而非更高质量的样本，这被称为步骤长度混淆问题，论文旨在解决这一偏差。

Result: 在四个LLM和五个评估基准上的实验表明，提出的ASLEC-DROP和ASLEC-CASL方法能有效缓解步骤长度混淆问题，提升了数据选择的质量。

Insight: 创新点在于识别并量化了推理数据选择中的步骤长度混淆现象，归因于推理步骤中首个令牌的低概率影响，并提出了通过丢弃首个令牌概率或应用因果去偏回归的实用解决方案，为高质量推理数据集构建提供了新视角。

Abstract: Large reasoning models have recently demonstrated strong performance on complex tasks that require long chain-of-thought reasoning, through supervised fine-tuning on large-scale and high-quality datasets. To construct such datasets, existing pipelines generate long reasoning data from more capable Large Language Models (LLMs) and apply manually heuristic or naturalness-based selection methods to filter high-quality samples. Despite the proven effectiveness of naturalness-based data selection, which ranks data by the average log probability assigned by LLMs, our analysis shows that, when applied to LLM reasoning datasets, it systematically prefers samples with longer reasoning steps (i.e., more tokens per step) rather than higher-quality ones, a phenomenon we term step length confounding. Through quantitative analysis, we attribute this phenomenon to low-probability first tokens in reasoning steps; longer steps dilute their influence, thereby inflating the average log probabilities. To address this issue, we propose two variant methods: ASLEC-DROP, which drops first-token probabilities when computing average log probability, and ASLEC-CASL, which applies a causal debiasing regression to remove the first tokens’ confounding effect. Experiments across four LLMs and five evaluation benchmarks demonstrate the effectiveness of our approach in mitigating the step length confounding problem.

[31] iTAG: Inverse Design for Natural Text Generation with Accurate Causal Graph Annotations cs.CLPDF

Wenshuo Wang, Boyu Cao, Nan Zhuang, Wei Li

TL;DR: 本文提出iTAG方法，用于生成带有精确因果图标注的自然文本。该方法通过将因果图节点分配真实世界概念，并利用思维链推理迭代优化概念选择，确保生成的文本既自然又保持高标注准确性。

Details

Motivation: 现有方法在生成带因果标注的文本时，要么牺牲文本自然性（模板方法），要么无法保证标注准确性（LLM依赖方法），缺乏同时满足两者的方案。

Result: iTAG在广泛测试中展现出极高的标注准确性和文本自然性，且基于其生成数据测试文本因果发现算法的结果与真实数据具有高统计相关性，可作为可扩展基准测试的实用替代品。

Insight: 创新点在于将概念分配视为以因果图为目标的逆问题，并通过CoT推理迭代优化，实现了自然文本生成与高精度因果标注的统一；其生成的数据能有效替代真实数据用于算法评测。

Abstract: A fundamental obstacle to causal discovery from text is the lack of causally annotated text data for use as ground truth, due to high annotation costs. This motivates an important task of generating text with causal graph annotations. Early template-based generation methods sacrifice text naturalness in exchange for high causal graph annotation accuracy. Recent Large Language Model (LLM)-dependent methods directly generate natural text from target graphs through LLMs, but do not guarantee causal graph annotation accuracy. Therefore, we propose iTAG, which performs real-world concept assignment to nodes before converting causal graphs into text in existing LLM-dependent methods. iTAG frames this process as an inverse problem with the causal graph as the target, iteratively examining and refining concept selection through Chain-of-Thought (CoT) reasoning so that the induced relations between concepts are as consistent as possible with the target causal relationships described by the causal graph. iTAG demonstrates both extremely high annotation accuracy and naturalness across extensive tests, and the results of testing text-based causal discovery algorithms with the generated data show high statistical correlation with real-world data. This suggests that iTAG-generated data can serve as a practical surrogate for scalable benchmarking of text-based causal discovery algorithms.

[32] DTCRS: Dynamic Tree Construction for Recursive Summarization cs.CLPDF

Guanran Luo, Zhongquan Jian, Wentao Qiu, Meihong Wang, Qingqiang Wu

TL;DR: 本文提出DTCRS方法，通过动态构建摘要树来解决RAG中递归摘要的冗余问题。该方法基于文档结构和查询语义，分析问题类型以决定是否需要构建摘要树，并利用子问题嵌入作为初始聚类中心，减少冗余摘要节点，从而提升问答效率。

Details

Motivation: 现有递归摘要方法在构建层次化摘要树时会产生大量冗余节点，不仅增加构建时间，还可能对问答性能产生负面影响，且不适用于所有问题类型。

Result: 在三个问答任务上，DTCRS显著减少了摘要树构建时间，并取得了显著的性能提升。

Insight: 创新点在于动态决定是否构建摘要树，并利用问题分解和子问题嵌入优化聚类过程，提高了摘要与问题的相关性，同时为递归摘要的适用性问题提供了研究视角。

Abstract: Retrieval-Augmented Generation (RAG) mitigates the hallucination problem of Large Language Models (LLMs) by incorporating external knowledge. Recursive summarization constructs a hierarchical summary tree by clustering text chunks, integrating information from multiple parts of a document to provide evidence for abstractive questions involving multi-step reasoning. However, summary trees often contain a large number of redundant summary nodes, which not only increase construction time but may also negatively impact question answering. Moreover, recursive summarization is not suitable for all types of questions. We introduce DTCRS, a method that dynamically generates summary trees based on document structure and query semantics. DTCRS determines whether a summary tree is necessary by analyzing the question type. It then decomposes the question and uses the embeddings of sub-questions as initial cluster centers, reducing redundant summaries while improving the relevance between summaries and the question. Our approach significantly reduces summary tree construction time and achieves substantial improvements across three QA tasks. Additionally, we investigate the applicability of recursive summarization to different question types, providing valuable insights for future research.

[33] Gemma 4, Phi-4, and Qwen3: Accuracy-Efficiency Tradeoffs in Dense and MoE Reasoning Language Models cs.CLPDF

Md Motaleb Hossen Manik, Ge Wang

TL;DR: 本文对七种最新的推理导向指令微调模型（包括稠密和MoE架构）进行了受控实证基准测试，评估了它们在四个推理基准（ARC-Challenge、GSM8K、Math Level 1-3、TruthfulQA MC1）上使用三种提示策略（零样本、思维链、少样本思维链）时的准确性、延迟、GPU内存使用和近似FLOPs。研究发现，在加权多任务总结中，使用少样本思维链的Gemma-4-E4B取得了最佳整体结果，而稀疏激活（MoE）本身并不保证最佳的实际操作点，准确性-效率权衡共同取决于架构、提示协议和任务组成。

Details

Motivation: 研究动机是评估在现实推理约束下，稀疏激活的MoE语言模型是否比稠密模型提供更好的质量-效率权衡，并理解其端到端行为。

Result: 在加权多任务总结中，Gemma-4-E4B（少样本思维链）达到最高加权准确率0.675，平均VRAM为14.9 GB；Gemma-4-26B-A4B准确率接近（0.663），但内存使用显著更高（48.1 GB）。在任务层面，Gemma模型在ARC和Math上表现最佳，Phi模型在TruthfulQA上最强，GSM8K显示出最大的提示敏感性。

Insight: 论文的创新点在于提供了一个可复现的基准测试流程，用于在真实资源约束下评估推理LLM。客观分析表明，稀疏激活（MoE）的优势并非绝对，实际准确性-效率权衡是架构、提示策略和任务特性的联合函数，这挑战了MoE必然更高效的普遍假设。

Abstract: Mixture-of-experts (MoE) language models are often expected to offer better quality-efficiency tradeoffs than dense models because only a subset of parameters is activated per token, but the practical value of that advantage depends on end-to-end behavior under realistic inference constraints. We present a controlled empirical benchmark of seven recent reasoning-oriented instruction-tuned models spanning dense and MoE designs, namely Gemma-4-E2B, Gemma-4-E4B, Gemma-4-26B-A4B, Phi-4-mini-reasoning, Phi-4-reasoning, Qwen3-8B, and Qwen3-30B-A3B, evaluated on four benchmarks – ARC-Challenge, GSM8K, Math Level 1-3, and TruthfulQA MC1 – under three prompting strategies: zero-shot, chain-of-thought, and few-shot chain-of-thought. The study covers 8,400 total model-dataset-prompt evaluations and records accuracy, latency, peak GPU memory usage (VRAM), and an approximate floating-point operations (FLOPs)-per-token proxy. Across the weighted multi-task summary, Gemma-4-E4B with few-shot chain-of-thought achieved the best overall result, reaching weighted accuracy 0.675 with mean VRAM 14.9 GB, while Gemma-4-26B-A4B was close in accuracy at 0.663 but substantially more memory intensive at 48.1 GB. At the task level, Gemma models dominated ARC and Math, Phi models were strongest on TruthfulQA, and GSM8K showed the largest prompt sensitivity, including a sharp drop for Phi-4-reasoning from 0.67 under chain-of-thought to 0.11 under few-shot chain-of-thought. These results show that sparse activation alone does not guarantee the best practical operating point: observed accuracy-efficiency tradeoffs depend jointly on architecture, prompting protocol, and task composition. We release a reproducible benchmark pipeline, aggregated results, and paired statistical analyses to support deployment-oriented evaluation of reasoning LLMs under real resource constraints.

[34] STRIDE-ED: A Strategy-Grounded Stepwise Reasoning Framework for Empathetic Dialogue Systems cs.CL | cs.AIPDF

Hongru Ji, Yuyin Fan, Meng Zhao, Xianghua Li, Lianwei Wu

TL;DR: 本文提出了STRIDE-ED框架，这是一个基于策略、可解释的深度推理框架，用于建模共情对话。它通过结构化、策略驱动的推理过程，并结合高质量的策略感知数据构建流程以及结合监督微调与多目标强化学习的训练范式，显著提升了共情对话系统的性能。

Details

Motivation: 现有共情对话方法因缺乏全面的共情策略框架、明确的任务对齐多阶段推理以及高质量的策略感知数据，难以有效建模共情对话这一复杂的多阶段认知与决策过程。

Result: 广泛的实验表明，STRIDE-ED能够泛化到多种开源大语言模型上，在自动评估指标和人工评估中均持续优于现有方法。

Insight: 主要创新点在于：1) 提出了一个结构化、策略驱动的多阶段推理框架，将共情对话明确建模为认知决策过程；2) 设计了一个结合LLM标注、多模型一致性加权评估和动态采样的策略感知数据精炼流程，以构建高质量训练数据；3) 采用了结合监督微调与多目标强化学习的训练范式，以更好地对齐模型行为与目标情感、共情策略及响应格式。

Abstract: Empathetic dialogue requires not only recognizing a user’s emotional state but also making strategy-aware, context-sensitive decisions throughout response generation. However, the lack of a comprehensive empathy strategy framework, explicit task-aligned multi-stage reasoning, and high-quality strategy-aware data fundamentally limits existing approaches, preventing them from effectively modeling empathetic dialogue as a complex, multi-stage cognitive and decision-making process. To address these challenges, we propose STRIDE-ED, a STRategy-grounded, Interpretable, and DEep reasoning framework that models Empathetic Dialogue through structured, strategy-conditioned reasoning. To support effective learning, we develop a strategy-aware data refinement pipeline integrating LLM-based annotation, multi-model consistency-weighted evaluation, and dynamic sampling to construct high-quality training data aligned with empathetic strategies. Furthermore, we adopt a two-stage training paradigm that combines supervised fine-tuning with multi-objective reinforcement learning to better align model behaviors with target emotions, empathetic strategies, and response formats. Extensive experiments demonstrate that STRIDE-ED generalizes across diverse open-source LLMs and consistently outperforms existing methods on both automatic metrics and human evaluations.

[35] Yale-DM-Lab at ArchEHR-QA 2026: Deterministic Grounding and Multi-Pass Evidence Alignment for EHR Question Answering cs.CLPDF

Elyas Irankhah, Samah Fodeh

TL;DR: 本文介绍了耶鲁大学DM实验室为ArchEHR-QA 2026共享任务开发的系统，该任务旨在回答患者基于住院记录提出的问题，包含四个子任务：临床医生可解释的问题重述、证据句子识别、答案生成以及证据-答案对齐。系统采用双模型管道（Claude Sonnet 4和GPT-4o）处理问题重述，并利用Azure托管的模型集成（o3、GPT-5.2、GPT-5.1和DeepSeek-R1）结合少样本提示和投票策略处理其他子任务。实验表明，模型多样性和集成投票能持续提升性能，且将完整临床医生答案段落作为提示上下文有助于证据对齐，但开发集结果显示对齐准确性主要受推理能力限制。

Details

Motivation: 解决患者基于电子健康记录（EHR）提问的自动化问答问题，特别是针对住院记录，需要将患者问题转化为临床医生可解释的形式，并准确识别证据、生成答案及对齐证据与答案，以提升医疗问答系统的可靠性和实用性。

Result: 在开发集上，最佳得分分别为：子任务4（证据-答案对齐）达到88.81微平均F1，子任务2（证据句子识别）达到65.72宏平均F1，子任务3（答案生成）得分为34.01，子任务1（问题重述）得分为33.05。结果表明模型集成和投票策略相比单模型基线有持续改进，但对齐准确性受推理限制。

Insight: 创新点包括：使用双模型管道进行问题重述以提高临床可解释性；采用多模型集成（如o3、GPT-5.2等）结合少样本提示和投票策略来增强鲁棒性和性能；将完整临床医生答案段落作为额外提示上下文以改进证据对齐。从客观角度看，该方法强调了模型多样性和集成学习在复杂医疗问答任务中的有效性，以及推理能力作为关键瓶颈的洞察。

Abstract: We describe the Yale-DM-Lab system for the ArchEHR-QA 2026 shared task. The task studies patient-authored questions about hospitalization records and contains four subtasks (ST): clinician-interpreted question reformulation, evidence sentence identification, answer generation, and evidence-answer alignment. ST1 uses a dual-model pipeline with Claude Sonnet 4 and GPT-4o to reformulate patient questions into clinician-interpreted questions. ST2-ST4 rely on Azure-hosted model ensembles (o3, GPT-5.2, GPT-5.1, and DeepSeek-R1) combined with few-shot prompting and voting strategies. Our experiments show three main findings. First, model diversity and ensemble voting consistently improve performance compared to single-model baselines. Second, the full clinician answer paragraph is provided as additional prompt context for evidence alignment. Third, results on the development set show that alignment accuracy is mainly limited by reasoning. The best scores on the development set reach 88.81 micro F1 on ST4, 65.72 macro F1 on ST2, 34.01 on ST3, and 33.05 on ST1.

[36] Joint Optimization of Reasoning and Dual-Memory for Self-Learning Diagnostic Agent cs.CLPDF

Bingxuan Li, Simo Du, Yue Guo

TL;DR: 本文提出了一种名为SEA的自学习诊断智能体，它采用受认知启发的双记忆模块，通过强化学习框架联合优化推理和记忆管理，旨在提升临床诊断中的经验复用和持续适应能力。

Details

Motivation: 现有基于大语言模型的诊断智能体通常独立处理病例，限制了经验复用和持续适应，本文旨在通过模拟临床专家积累可重用诊断模式的过程来解决这一问题。

Result: 在MedCaseReasoning数据集的标准评估中，SEA达到92.46%的准确率，比最强基线高出19.6%；在ER-Reason数据集的长序列评估中，SEA取得了最佳最终准确率（0.7214）和最大改进（+0.35 Acc@100），专家评估也证实其生成的规则具有临床正确性、有用性和可信度。

Insight: 创新点在于将认知启发的双记忆模块与强化学习框架结合，实现了推理与记忆管理的联合优化，从而将经验有效转化为可重用知识，提升了诊断推理能力和持续学习性能。

Abstract: Clinical expertise improves not only by acquiring medical knowledge, but by accumulating experience that yields reusable diagnostic patterns. Recent LLMs-based diagnostic agents have shown promising progress in clinical reasoning for decision support. However, most approaches treat cases independently, limiting experience reuse and continual adaptation. We propose SEA, a self-learning diagnostic agent with cognitively inspired dual-memory module. We design a reinforcement training framework tailored to our designed agent for joint optimization of reasoning and memory management. We evaluate SEA in two complementary settings. On standard evaluation with MedCaseReasoning dataset, SEA achieves 92.46% accuracy, outperforming the strongest baseline by +19.6%, demonstrating the benefit of jointly optimizing reasoning and memory. On the long-horizon with ER-Reason dataset, SEA attains the best final accuracy (0.7214) and the largest improvement (+0.35 Acc@100), while baseline methods show limited or unstable gains. Expert evaluation further indicates that rules consolidated from SEA show strong clinical correctness, usefulness and trust, suggesting that the induced rules in dual-memory module are reliable and practically meaningful. Overall, SEA improves both diagnostic reasoning ability and continual learning by effectively transforming experience into reusable knowledge.

[37] A Systematic Study of Retrieval Pipeline Design for Retrieval-Augmented Medical Question Answering cs.CL | cs.AI | cs.LGPDF

Nusrat Sultana, Abdullah Muhammad Moosa, Kazi Afzalur Rahman, Sajal Chandra Banik

TL;DR: 本文系统研究了检索增强医学问答（RAG）中各个检索组件（如语言模型、嵌入模型、检索策略、查询重构和交叉编码器重排序）对性能的影响，在MedQA USMLE基准和结构化教科书知识库上进行了统一实验框架下的评估。

Details

Motivation: 尽管基于RAG的医学系统日益受到关注，但各个检索组件对性能的具体影响尚不明确，本研究旨在通过系统评估来填补这一理解空白，解决纯参数化大语言模型在医学问答中存在的知识鸿沟和事实基础有限的问题。

Result: 在MedQA USMLE基准上，检索增强显著提升了零样本医学问答性能，最佳配置（稠密检索结合查询重构和重排序）达到了60.49%的准确率；实验还发现领域专用语言模型比通用模型能更好地利用检索到的医学证据，并揭示了检索效果与计算成本之间的权衡。

Insight: 创新点在于首次在医学QA领域对RAG流水线的多个组件进行了系统性的统一评估，揭示了组件间的交互影响及性能-成本权衡；客观来看，其提出的在单消费级GPU上即可进行系统评估的轻量级实验框架具有实用借鉴价值。

Abstract: Large language models (LLMs) have demonstrated strong capabilities in medical question answering; however, purely parametric models often suffer from knowledge gaps and limited factual grounding. Retrieval-augmented generation (RAG) addresses this limitation by integrating external knowledge retrieval into the reasoning process. Despite increasing interest in RAG-based medical systems, the impact of individual retrieval components on performance remains insufficiently understood. This study presents a systematic evaluation of retrieval-augmented medical question answering using the MedQA USMLE benchmark and a structured textbook-based knowledge corpus. We analyze the interaction between language models, embedding models, retrieval strategies, query reformulation, and cross-encoder reranking within a unified experimental framework comprising forty configurations. Results show that retrieval augmentation significantly improves zero-shot medical question answering performance. The best-performing configuration was dense retrieval with query reformulation and reranking achieved 60.49% accuracy. Domain-specialized language models were also found to better utilize retrieved medical evidence than general-purpose models. The analysis further reveals a clear tradeoff between retrieval effectiveness and computational cost, with simpler dense retrieval configurations providing strong performance while maintaining higher throughput. All experiments were conducted on a single consumer-grade GPU, demonstrating that systematic evaluation of retrieval-augmented medical QA systems can be performed under modest computational resources.

[38] OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence cs.CLPDF

Jianhui Liu, Haoze Sun, Wenbo Li, Yanbing Zhang, Rui Yang

TL;DR: 本文介绍了OpenSpatial，一个用于赋能空间智能的开源数据引擎，旨在解决高质量空间数据生成缺乏原则性、开源系统的问题。该引擎采用3D边界框作为基本单元，构建了涵盖空间测量、空间关系、相机感知、多视图一致性和场景感知推理五大基础任务的数据层次结构，并基于此生成了包含300万个高保真样本的大规模数据集OpenSpatial-3M。

Details

Motivation: 当前空间智能研究主要依赖特定领域的数据生产，缺乏一个能够充分释放高质量空间数据潜力的、原则性的开源数据生成引擎。本文旨在填补这一空白。

Result: 在广泛的空间推理基准测试中，使用OpenSpatial-3M数据集训练的通用模型实现了最先进的性能。表现最佳的模型相对平均提升了19%。

Insight: 创新点在于提出了一个基于3D边界框构建统一数据层次结构的原则性数据引擎设计，实现了高质量、可扩展、任务多样且高效的数据生成。通过开源引擎和大规模数据集，为空间智能研究提供了坚实基础，并系统分析了数据属性对空间感知的影响。

Abstract: Spatial understanding is a fundamental cornerstone of human-level intelligence. Nonetheless, current research predominantly focuses on domain-specific data production, leaving a critical void: the absence of a principled, open-source engine capable of fully unleashing the potential of high-quality spatial data. To bridge this gap, we elucidate the design principles of a robust data generation system and introduce OpenSpatial – an open-source data engine engineered for high quality, extensive scalability, broad task diversity, and optimized efficiency. OpenSpatial adopts 3D bounding boxes as the fundamental primitive to construct a comprehensive data hierarchy across five foundational tasks: Spatial Measurement (SM), Spatial Relationship (SR), Camera Perception (CP), Multi-view Consistency (MC), and Scene-Aware Reasoning (SAR). Leveraging this scalable infrastructure, we curate OpenSpatial-3M, a large-scale dataset comprising 3 million high-fidelity samples. Extensive evaluations demonstrate that versatile models trained on our dataset achieve state-of-the-art performance across a wide spectrum of spatial reasoning benchmarks. Notably, the best-performing model exhibits a substantial average improvement of 19 percent, relatively. Furthermore, we provide a systematic analysis of how data attributes influence spatial perception. By open-sourcing both the engine and the 3M-scale dataset, we provide a robust foundation to accelerate future research in spatial intelligence.

cs.CV [Back]

[39] CraterBench-R: Instance-Level Crater Retrieval for Planetary Scale cs.CVPDF

Jichao Fang, Lei Zhang, Michael Phillips, Wei Luo

TL;DR: 该论文将陨石坑分析重新定义为实例级图像检索问题，并提出了CraterBench-R基准数据集，包含约25,000个陨石坑身份的多尺度图像和查询。研究发现自监督视觉Transformer（ViT）在该任务上表现最佳，并提出了一种无需训练的可扩展方法——实例-令牌聚合，以在保持高精度的同时显著减少存储开销。

Details

Motivation: 现有深度学习管道通常将陨石坑仅视为检测问题，而关键的科研工作流程（如目录去重、跨观测匹配和形态类比发现）本质上是检索任务，因此需要建立专门的陨石坑检索基准和方法。

Result: 在CraterBench-R基准上，自监督ViT（特别是经过领域内预训练的模型）优于参数量更大的通用模型。提出的实例-令牌聚合方法在K=16时比原始令牌选择提升mAP 17.9点，在K=64时匹配使用全部196个令牌的精度但存储需求显著降低。两阶段流水线（单向量初选加实例-令牌重排序）恢复了89-94%的完整延迟交互精度。

Insight: 创新点包括：将陨石坑分析形式化为实例级检索任务并创建专用基准；发现自监督ViT在领域特定检索中的优势；提出无需训练的实例-令牌聚合方法，通过聚类和聚合令牌在精度和存储效率间取得平衡；设计高效的两阶段检索流水线。

Abstract: Impact craters are a cornerstone of planetary surface analysis. However, while most deep learning pipelines treat craters solely as a detection problem, critical scientific workflows such as catalog deduplication, cross-observation matching, and morphological analog discovery are inherently retrieval tasks. To address this, we formulate crater analysis as an instance-level image retrieval problem and introduce CraterBench-R, a curated benchmark featuring about 25,000 crater identities with multi-scale gallery views and manually verified queries spanning diverse scales and contexts. Our baseline evaluations across various architectures reveal that self-supervised Vision Transformers (ViTs), particularly those with in-domain pretraining, dominate the task, outperforming generic models with significantly more parameters. Furthermore, we demonstrate that retaining multiple ViT patch tokens for late-interaction matching dramatically improves accuracy over standard single-vector pooling. However, storing all tokens per image is operationally inefficient at a planetary scale. To close this efficiency gap, we propose instance-token aggregation, a scalable, training-free method that selects K seed tokens, assigns the remaining tokens to these seeds via cosine similarity, and aggregates each cluster into a single representative token. This approach yields substantial gains: at K=16, aggregation improves mAP by 17.9 points over raw token selection, and at K=64, it matches the accuracy of using all 196 tokens with significantly less storage. Finally, we demonstrate that a practical two-stage pipeline, with single-vector shortlisting followed by instance-token reranking, recovers 89-94% of the full late-interaction accuracy while searching only a small candidate set. The benchmark is publicly available at hf.co/datasets/jfang/CraterBench-R.

[40] DISSECT: Diagnosing Where Vision Ends and Language Priors Begin in Scientific VLMs cs.CV | cs.AIPDF

Dikshant Kukreja, Kshitij Sah, Karan Goyal, Mukesh Mohania, Vikram Goyal

TL;DR: 本文提出了DISSECT诊断基准，用于揭示科学视觉语言模型（VLMs）中感知与整合之间的差距，即模型能提取视觉信息但在后续推理中丢失的问题。该基准包含化学和生物领域的12,000个问题，通过五种输入模式评估18个VLMs，发现开源模型在从自身语言化描述推理时表现优于原始图像，而闭源模型无此差距。

Details

Motivation: 解决视觉语言模型在科学领域存在的感知-整合差距问题，即模型能正确感知图像内容（如识别分子结构）但无法有效整合视觉信息进行推理，现有基准常将两者混为一谈，无法诊断具体失败原因。

Result: 在化学和生物领域的DISSECT基准上评估18个VLMs，结果显示化学任务的语言先验可利用性显著低于生物任务，表明分子视觉内容是更严格的视觉推理测试；开源模型从自身语言化描述推理的得分高于原始图像，而闭源模型无此差距，揭示了整合瓶颈是开源与闭源多模态能力的关键差异。

Insight: 创新点包括提出感知-整合差距概念、构建DISSECT诊断基准及五种输入模式（特别是模型预言家模式），可后验应用于任何VLM评估以诊断整合失败；客观分析认为，该工作为多模态模型能力分解提供了系统方法论，强调了视觉推理中整合环节的重要性。

Abstract: When asked to describe a molecular diagram, a Vision-Language Model correctly identifies ``a benzene ring with an -OH group.’’ When asked to reason about the same image, it answers incorrectly. The model can see but it cannot think about what it sees. We term this the perception-integration gap: a failure where visual information is successfully extracted but lost during downstream reasoning, invisible to single-configuration benchmarks that conflate perception with integration under one accuracy number. To systematically expose such failures, we introduce DISSECT, a 12,000-question diagnostic benchmark spanning Chemistry (7,000) and Biology (5,000). Every question is evaluated under five input modes – Vision+Text, Text-Only, Vision-Only, Human Oracle, and a novel Model Oracle in which the VLM first verbalizes the image and then reasons from its own description – yielding diagnostic gaps that decompose performance into language-prior exploitation, visual extraction, perception fidelity, and integration effectiveness. Evaluating 18~VLMs, we find that: (1) Chemistry exhibits substantially lower language-prior exploitability than Biology, confirming molecular visual content as a harder test of genuine visual reasoning; (2) Open-source models consistently score higher when reasoning from their own verbalized descriptions than from raw images, exposing a systematic integration bottleneck; and (3) Closed-source models show no such gap, indicating that bridging perception and integration is the frontier separating open-source from closed-source multimodal capability. The Model Oracle protocol is both model and benchmark agnostic, applicable post-hoc to any VLM evaluation to diagnose integration failures.

[41] Evolution of Video Generative Foundations cs.CVPDF

Teng Hu, Jiangning Zhang, Hongrui Huang, Ran Yi, Zihan Su

TL;DR: 这篇综述论文系统回顾了视频生成技术的发展历程，从早期的生成对抗网络（GAN）到主流的扩散模型，再到新兴的自回归（AR）模型与多模态技术，并探讨了多模态视频生成的趋势及其在虚拟现实、教育等领域的应用。

Details

Motivation: 现有视频生成综述多局限于特定技术（如GAN、扩散模型）或任务（如视频编辑），缺乏对领域演进（尤其是自回归模型和多模态信息融合）的全面视角，本文旨在填补这一空白。

Result: 论文未提及具体定量结果，但通过深入分析各类模型的基础原理、关键进展与比较优势/局限，提供了全面的技术发展脉络。

Insight: 创新点在于提供了视频生成技术演进的系统性综述，特别强调了自回归模型和多模态融合的新兴趋势，并展望了其在构建’世界模型’及多个应用领域的未来研究方向。

Abstract: The rapid advancement of Artificial Intelligence Generated Content (AIGC) has revolutionized video generation, enabling systems ranging from proprietary pioneers like OpenAI’s Sora, Google’s Veo3, and Bytedance’s Seedance to powerful open-source contenders like Wan and HunyuanVideo to synthesize temporally coherent and semantically rich videos. These advancements pave the way for building “world models” that simulate real-world dynamics, with applications spanning entertainment, education, and virtual reality. However, existing reviews on video generation often focus on narrow technical fields, e.g., Generative Adversarial Networks (GAN) and diffusion models, or specific tasks (e. g., video editing), lacking a comprehensive perspective on the field’s evolution, especially regarding Auto-Regressive (AR) models and integration of multimodal information. To address these gaps, this survey firstly provides a systematic review of the development of video generation technology, tracing its evolution from early GANs to dominant diffusion models, and further to emerging AR-based and multimodal techniques. We conduct an in-depth analysis of the foundational principles, key advancements, and comparative strengths/limitations. Then, we explore emerging trends in multimodal video generation, emphasizing the integration of diverse data types to enhance contextual awareness. Finally, by bridging historical developments and contemporary innovations, this survey offers insights to guide future research in video generation and its applications, including virtual/augmented reality, personalized education, autonomous driving simulations, digital entertainment, and advanced world models, in this rapidly evolving field. For more details, please refer to the project at https://github.com/sjtuplayer/Awesome-Video-Foundations.

[42] Evidence-Based Actor-Verifier Reasoning for Echocardiographic Agents cs.CVPDF

Peng Huang, Yiming Wang, Yineng Chen, Liangqiao Gui, Hui Guo

TL;DR: 本文提出EchoTrust框架，一种基于证据的Actor-Verifier推理方法，用于构建可信赖的超声心动图视觉语言模型（VLM）智能体，旨在解决现有方法因直接映射视频和问题到答案而导致的模板捷径和虚假解释问题。

Details

Motivation: 超声心动图在心血管疾病筛查和诊断中至关重要，但其自动化智能分析因复杂的心脏动力学和显著的视图异质性而极具挑战；现有VLM方法通常将任务简化为从视频和问题到答案的直接映射，容易受到模板捷径和虚假解释的影响。

Result: 论文未在摘要中提及具体的定量结果、基准测试或SOTA比较。

Insight: 创新点在于提出了一个证据驱动的Actor-Verifier框架，通过生成结构化的中间表示并由不同角色进行分析，以实现更可靠和可解释的临床决策支持，这有助于增强VLM在医疗高风险应用中的可信度。

Abstract: Echocardiography plays an important role in the screening and diagnosis of cardiovascular diseases. However, automated intelligent analysis of echocardiographic data remains challenging due to complex cardiac dynamics and strong view heterogeneity. In recent years, visual language models (VLM) have opened a new avenue for building ultrasound understanding systems for clinical decision support. Nevertheless, most existing methods formulate this task as a direct mapping from video and question to answer, making them vulnerable to template shortcuts and spurious explanations. To address these issues, we propose EchoTrust, an evidence-driven Actor-Verifier framework for trustworthy reasoning in echocardiography VLM-based agents. EchoTrust produces a structured intermediate representation that is subsequently analyzed by distinct roles, enabling more reliable and interpretable decision-making for high-stakes clinical applications.

[43] DietDelta: A Vision-Language Approach for Dietary Assessment via Before-and-After Images cs.CV | cs.AI | cs.MM | eess.IVPDF

Gautham Vinod, Siddeshwar Raghavan, Bruce Coburn, Fengqing Zhu

TL;DR: 本文提出DietDelta，一种基于视觉语言模型的饮食评估方法，通过餐前餐后配对图像实现食物级别的营养分析。该方法利用自然语言提示定位食物并直接从单张RGB图像估计重量，通过两阶段训练策略预测重量差异，在三个公开数据集上优于现有方法。

Details

Motivation: 现有基于图像的饮食评估方法通常依赖单张餐前图像，只能提供粗略的餐级估计，无法确定实际消耗量，且常需要深度感知、多视角图像或显式分割等限制性输入。

Result: 在三个公开数据集上评估，该方法相比现有方法取得一致改进，为餐前餐后饮食图像分析建立了强基线。

Insight: 创新点在于使用自然语言提示替代刚性分割掩码进行食物定位，并采用两阶段训练策略从配对图像中直接预测重量差异，简化了输入要求并提升了食物级别分析的准确性。

Abstract: Accurate dietary assessment is critical for precision nutrition, yet most image-based methods rely on a single pre-consumption image and provide only coarse, meal-level estimates. These approaches cannot determine what was actually consumed and often require restrictive inputs such as depth sensing, multi-view imagery, or explicit segmentation. In this paper, we propose a simple vision-language framework for food-item-level nutritional analysis using paired before-and-after eating images. Instead of relying on rigid segmentation masks, our method leverages natural language prompts to localize specific food items and estimate their weight directly from a single RGB image. We further estimate food consumption by predicting weight differences between paired images using a two-stage training strategy. We evaluate our method on three publicly available datasets and demonstrate consistent improvements over existing approaches, establishing a strong baseline for before-and-after dietary image analysis.

[44] MTA-Agent: An Open Recipe for Multimodal Deep Search Agents cs.CVPDF

Xiangyu Peng, Can Qin, An Yan, Xinyi Yang, Zeyuan Chen

TL;DR: 本文提出了MTA-Agent，一个用于构建多模态深度搜索智能体的开放方案。其核心是通过一个多跳工具增强智能体自动生成和验证高质量、多步骤的视觉-语言训练数据，从而解决现有MLLMs在需要深度搜索和整合外部知识的复杂推理任务上的局限性。

Details

Motivation: 动机是解决多模态大语言模型在需要深度搜索和整合视觉证据与外部知识的复杂多步推理任务上的能力局限。

Result: 使用所构建的MTA-Vision-DeepSearch数据集训练的32B开源多模态搜索智能体，在六个具有挑战性的基准测试中取得了SOTA性能，平均达到54.63%，超过了GPT-5、Gemini-2.5-Pro和Gemini-3-Pro。同时，训练显著提升了模型的推理深度和工具使用行为。

Insight: 创新点在于提出了一种自动生成高质量、已验证的多跳视觉-语言训练数据的完整流程，并展示了通过回放缓存的交互来降低训练成本的有效方法。该工作提供了一个完全开放的方案，包括数据集、训练轨迹和实现细节，以促进可复现性和未来研究。

Abstract: Multimodal large language models (MLLMs) have demonstrated strong capabilities in visual understanding, yet they remain limited in complex, multi-step reasoning that requires deep searching and integrating visual evidence with external knowledge. In this work, we address this challenge by constructing high-quality, verified multi-hop vision-language training data for multimodal deep-search agents. We propose a Multi-hop Tool-Augmented Agent for Evidence-based QA Synthesis (MTA-Agent), which automatically selects tools and their parameters to retrieve and validate evidence from both visual and textual sources and generates structured multi-hop question-answer trajectories. Starting from diverse VQA seed datasets, our pipeline produces a large-scale training dataset, MTA-Vision-DeepSearch, containing 21K high-quality multi-hop examples. The data is filtered through a multi-stage verification process to ensure factual consistency and answer uniqueness. Using MTA-Vision-DeepSearch, a 32B open-source multimodal search agent achieves state-of-the-art performance, reaching an average of 54.63% across six challenging benchmarks, outperforming GPT-5 (51.86%), Gemini-2.5-Pro (50.98%), and Gemini-3-Pro (54.46%) under the same tool settings. We further show that training on our data improves both reasoning depth and tool-use behavior, increasing the average number of steps from 2.27 to 4.28, and leading to more systematic and persistent search strategies. Additionally, we demonstrate that training can be performed without real-time tool calls by replaying cached interactions, significantly reducing training cost. Importantly, we present MTA-Agent as a fully open recipe for multimodal deep search: we release the entire dataset, training trajectories, and implementation details to enable reproducibility and future research on open multimodal search agents.

[45] Visual prompting reimagined: The power of the Activation Prompts cs.CV | cs.LGPDF

Yihua Zhang, Hongkang Li, Yuguang Yao, Aochuan Chen, Shuai Zhang

TL;DR: 本文提出了一种名为激活提示（Activation Prompt, AP）的通用方法，用于改进视觉提示（Visual Prompting, VP）技术。AP将通用扰动应用于模型中间层的激活图，而非仅输入数据，从而克服了传统VP在性能和效率上的固有局限。通过理论分析和在29个数据集及多种模型架构上的广泛实验，证明了AP在准确性和效率（时间、参数、内存、吞吐量）方面均优于VP和参数高效微调基线。

Details

Motivation: 视觉提示（VP）作为一种重用预训练视觉模型以适应下游任务的方法，其性能与传统的模型微调技术存在明显差距。本文旨在理解和推进输入级VP，以缩小这一性能差距，探索VP在理论和实践中尚未充分开发的领域。

Result: 在29个数据集和各种模型架构（如卷积神经网络和视觉变换器）上的广泛实验表明，AP在准确性和效率（考虑时间、参数、内存使用和吞吐量）方面均优于VP和参数高效微调基线，实现了卓越的性能。

Insight: 创新点在于将视觉提示从输入级扩展到模型中间层的激活图（即激活提示AP），揭示了输入级提示可能缺乏有效性的原因，并理论分析了不同模型类型对提示的层偏好。AP与卷积神经网络和视觉变换器中的归一化调优密切相关，但每种模型类型对提示的层偏好不同，这通过分析跨层全局特征得到了理论解释。

Abstract: Visual prompting (VP) has emerged as a popular method to repurpose pretrained vision models for adaptation to downstream tasks. Unlike conventional model fine-tuning techniques, VP introduces a universal perturbation directly into the input data to facilitate task-specific fine-tuning rather than modifying model parameters. However, there exists a noticeable performance gap between VP and conventional fine-tuning methods, highlighting an unexplored realm in theory and practice to understand and advance the input-level VP to reduce its current performance gap. Towards this end, we introduce a generalized concept, termed activation prompt (AP), which extends the scope of the input-level VP by enabling universal perturbations to be applied to activation maps within the intermediate layers of the model. By using AP to revisit the problem of VP and employing it as an analytical tool, we demonstrate the intrinsic limitations of VP in both performance and efficiency, revealing why input-level prompting may lack effectiveness compared to AP, which exhibits a model-dependent layer preference. We show that AP is closely related to normalization tuning in convolutional neural networks and vision transformers, although each model type has distinct layer preferences for prompting. We also theoretically elucidate the rationale behind such a preference by analyzing global features across layers. Through extensive experiments across 29 datasets and various model architectures, we provide a comprehensive performance analysis of AP, comparing it with VP and parameter-efficient fine-tuning baselines. Our results demonstrate AP’s superiority in both accuracy and efficiency, considering factors such as time, parameters, memory usage, and throughput.

[46] PhysHead: Simulation-Ready Gaussian Head Avatars cs.CVPDF

Berna Kabadayi, Vanessa Sklyarova, Wojciech Zielonka, Justus Thies, Gerard Pons-Moll

TL;DR: 本文提出PhysHead，一种用于可动画化头部化身的混合表示方法，通过结合3D参数化头部网格和基于发丝的发型，实现逼真的头发动态模拟。该方法采用附着在头部网格和头发段上的高斯基元进行外观建模，并利用VLM模型生成动态训练序列中被遮挡区域的外观，从而创建具有物理合理头发运动（如风吹效果）的逼真头部化身。

Details

Motivation: 现有头部化身方法通常假设头发为刚性运动，无法将头发与头部解耦，仅将其表示为简单外壳，难以捕捉其自然的体积行为。本文旨在解决这些限制，实现具有真实头发动态的动画化头部化身。

Result: 通过定量和定性研究，证明了所提模型的能力，并与现有基线方法进行了比较。结果表明，该方法除了能控制表情和相机视角外，还能合成物理上合理的头发运动。

Insight: 核心创新在于提出了一种结合3D高斯分层表示、参数化网格和基于物理模拟发丝的混合表示，以及利用VLM模型处理动态训练中遮挡区域外观的新训练方案，从而实现了对头发体积行为和动态的逼真建模与模拟。

Abstract: Realistic digital avatars require expressive and dynamic hair motion; however, most existing head avatar methods assume rigid hair movement. These methods often fail to disentangle hair from the head, representing it as a simple outer shell and failing to capture its natural volumetric behavior. In this paper, we address these limitations by introducing PhysHead, a hybrid representation for animatable head avatars with realistic hair dynamics learned from multi-view video. At the core is a 3D Gaussian-based layered representation of the head. Our approach combines a 3D parametric mesh for the head with strand-based hair, which can be directly simulated using physics engines. For the appearance model, we employ Gaussian primitives attached to both the head mesh and hair segments. This representation enables the creation of photorealistic head avatars with dynamic hair behavior, such as wind-blown motion, overcoming the constraints of rigid hair in existing methods. However, these animation capabilities also require new training schemes. In particular, we propose the use of VLM-based models to generate appearance of regions that are occluded in the dynamic training sequences. In quantitative and qualitative studies, we demonstrate the capabilities of the proposed model and compare it with existing baselines. We show that our method can synthesize physically plausible hair motion besides expression and camera control.

[47] LiftFormer: Lifting and Frame Theory Based Monocular Depth Estimation Using Depth and Edge Oriented Subspace Representation cs.CV | eess.IVPDF

Shuai Li, Huibin Bai, Yanbo Gao, Chong Lv, Hui Yuan

TL;DR: 本文提出了一种名为LiftFormer的单目深度估计方法，该方法基于提升理论和框架理论，通过构建深度导向几何表示子空间和边缘感知表示子空间，将颜色特征映射到几何深度值，并增强边缘区域的深度预测精度。

Details

Motivation: 单目深度估计是一个高度不适定问题，现有方法在从单目图像/视频预测深度图时，尤其是在深度边缘区域，容易产生错误。本文旨在通过构建中间子空间来桥接图像颜色特征与深度值，并专门处理边缘区域的预测难题。

Result: 实验结果表明，LiftFormer在广泛使用的数据集上达到了最先进的性能，消融研究验证了所提出的两个提升模块的有效性。

Insight: 创新点在于将深度值预测问题转化为深度导向几何表示子空间的特征表示问题，并利用框架理论构建冗余且鲁棒的子空间；同时，构建独立的边缘感知子空间以增强局部边缘特征，这为单目深度估计提供了一种新的基于子空间表示的几何学习范式。

Abstract: Monocular depth estimation (MDE) has attracted increasing interest in the past few years, owing to its important role in 3D vision. MDE is the estimation of a depth map from a monocular image/video to represent the 3D structure of a scene, which is a highly ill-posed problem. To solve this problem, in this paper, we propose a LiftFormer based on lifting theory topology, for constructing an intermediate subspace that bridges the image color features and depth values, and a subspace that enhances the depth prediction around edges. MDE is formulated by transforming the depth value prediction problem into depth-oriented geometric representation (DGR) subspace feature representation, thus bridging the learning from color values to geometric depth values. A DGR subspace is constructed based on frame theory by using linearly dependent vectors in accordance with depth bins to provide a redundant and robust representation. The image spatial features are transformed into the DGR subspace, where these features correspond directly to the depth values. Moreover, considering that edges usually present sharp changes in a depth map and tend to be erroneously predicted, an edge-aware representation (ER) subspace is constructed, where depth features are transformed and further used to enhance the local features around edges. The experimental results demonstrate that our LiftFormer achieves state-of-the-art performance on widely used datasets, and an ablation study validates the effectiveness of both proposed lifting modules in our LiftFormer.

[48] VAMAE: Vessel-Aware Masked Autoencoders for OCT Angiography cs.CVPDF

Ilerioluwakiiye Abolade, Prince Mireku, Kelechi Chibundu, Peace Ododo, Emmanuel Idoko

TL;DR: 本文提出了一种名为VAMAE的血管感知掩码自编码框架，用于OCTA图像的自监督预训练。该方法通过结合基于血管性和骨架线索的解剖学感知掩码，强调血管丰富区域，并采用多目标重建来捕捉外观、结构和拓扑信息。在OCTA-500基准测试中，该预训练策略在多种血管分割任务上优于标准掩码自编码基线，尤其在有限标注设置下表现更佳。

Details

Motivation: OCTA图像中血管结构稀疏且拓扑约束强，现有自监督学习方法（如掩码自编码）主要针对密集自然图像设计，采用均匀掩码和像素级重建，难以有效捕捉血管几何特征。

Result: 在OCTA-500基准测试的多个血管分割任务中，血管感知掩码和多目标重建相比标准掩码自编码基线带来了一致的性能提升，特别是在有限监督设置下。

Insight: 创新点包括解剖学感知的掩码策略（利用血管性和骨架线索强调关键区域）以及多目标重建预训练目标，这有助于模型更好地学习血管连接性和分支模式，为OCTA分析提供了几何感知的自监督学习潜力。

Abstract: Optical coherence tomography angiography (OCTA) provides non-invasive visualization of retinal microvasculature, but learning robust representations remains challenging due to sparse vessel structures and strong topological constraints. Many existing self-supervised learning approaches, including masked autoencoders, are primarily designed for dense natural images and rely on uniform masking and pixel-level reconstruction, which may inadequately capture vascular geometry. We propose VAMAE, a vessel-aware masked autoencoding framework for self-supervised pretraining on OCTA images. The approach incorporates anatomically informed masking that emphasizes vessel-rich regions using vesselness and skeleton-based cues, encouraging the model to focus on vascular connectivity and branching patterns. In addition, the pretraining objective includes reconstructing multiple complementary targets, enabling the model to capture appearance, structural, and topological information. We evaluate the proposed pretraining strategy on the OCTA-500 benchmark for several vessel segmentation tasks under varying levels of supervision. The results indicate that vessel-aware masking and multi-target reconstruction provide consistent improvements over standard masked autoencoding baselines, particularly in limited-label settings, suggesting the potential of geometry-aware self-supervised learning for OCTA analysis.

[49] Holistic Optimal Label Selection for Robust Prompt Learning under Partial Labels cs.CV | cs.LGPDF

Yaqi Zhao, Haoliang Sun, Yating Wang, Yongshun Gong, Yilong Yin

TL;DR: 本文提出了一种名为Holistic Optimal Label Selection (HopS)的方法，旨在解决在部分标签可用情况下提示学习（prompt learning）面临的标签模糊性和监督信息不足的问题。该方法通过结合局部密度过滤和基于最优传输的全局选择目标，从局部和全局两个角度进行稳健的标签选择，从而提升预训练视觉-语言模型在下游任务中的适应性能。

Details

Motivation: 动机在于，当仅能获得部分标签时，提示学习的性能常受限于标签歧义和不足的监督信息，因此需要一种能够有效利用预训练特征编码器泛化能力的方法来改善标签选择。

Result: 在八个基准数据集上的广泛实验表明，HopS在部分监督下持续提升了性能，并超越了所有基线方法，突显了其有效性。

Insight: 创新点在于提出了一个结合局部结构规律（通过最近邻候选集的频繁标签和softmax分数）和全局分布匹配（通过最优传输将均匀采样分布映射到批次候选标签分布）的整体标签选择框架，为弱监督设置下的提示学习提供了一个实用的解决方案。

Abstract: Prompt learning has gained significant attention as a parameter-efficient approach for adapting large pre-trained vision-language models to downstream tasks. However, when only partial labels are available, its performance is often limited by label ambiguity and insufficient supervisory information. To address this issue, we propose Holistic Optimal Label Selection (HopS), leveraging the generalization ability of pre-trained feature encoders through two complementary strategies. First, we design a local density-based filter that selects the top frequent labels from the nearest neighbors’ candidate sets and uses the softmax scores to identify the most plausible label, capturing structural regularities in the feature space. Second, we introduce a global selection objective based on optimal transport that maps the uniform sampling distribution to the candidate label distributions across a batch. By minimizing the expected transport cost, it can determine the most likely label assignments. These two strategies work together to provide robust label selection from both local and global perspectives. Extensive experiments on eight benchmark datasets show that HopS consistently improves performance under partial supervision and outperforms all baselines. Those results highlight the merit of holistic label selection and offer a practical solution for prompt learning in weakly supervised settings.

[50] WeatherRemover: All-in-one Adverse Weather Removal with Multi-scale Feature Map Compression cs.CVPDF

Weikai Qu, Sijun Liang, Cheng Pan, Zikuan Yang, Guanchi Zhou

TL;DR: 本文提出了WeatherRemover模型，一个用于去除图像中多种恶劣天气（如雨、雪、雾）影响的一体化图像恢复模型。该模型采用类似UNet的结构，结合门控机制和多尺度金字塔视觉Transformer，旨在高效地提升图像质量，同时平衡模型的参数量、计算开销和内存使用。

Details

Motivation: 现有方法大多针对单一特定天气条件进行图像恢复，而能处理多种天气的方法往往忽略了性能（如参数量、推理时间、内存成本），导致难以实际部署。本文旨在开发一个能有效处理多种恶劣天气、且在性能与恢复质量之间取得平衡的模型。

Result: 论文宣称其轻量级模型在恢复质量、参数效率、计算开销和内存使用之间实现了最佳平衡，有效满足了实际应用需求。摘要中未提及具体的定量结果（如PSNR/SSIM）或基准测试（benchmark）名称。

Insight: 创新点包括：1）结合门控机制（在feed-forward和下采样阶段）自适应选择关键信息、处理冗余；2）采用多尺度金字塔视觉Transformer和通道注意力优化特征提取；3）使用线性空间缩减（linear spatial reduction）来降低注意力计算成本。从客观角度看，其核心创新在于将高效的Transformer结构与门控机制结合，并系统性地优化了模型的计算效率，以实现一个实用的多天气图像恢复方案。

Abstract: Photographs taken in adverse weather conditions often suffer from blurriness, occlusion, and low brightness due to interference from rain, snow, and fog. These issues can significantly hinder the performance of subsequent computer vision tasks, making the removal of weather effects a crucial step in image enhancement. Existing methods primarily target specific weather conditions, with only a few capable of handling multiple weather scenarios. However, mainstream approaches often overlook performance considerations, resulting in large parameter sizes, long inference times, and high memory costs. In this study, we introduce the WeatherRemover model, designed to enhance the restoration of images affected by various weather conditions while balancing performance. Our model adopts a UNet-like structure with a gating mechanism and a multi-scale pyramid vision Transformer. It employs channel-wise attention derived from convolutional neural networks to optimize feature extraction, while linear spatial reduction helps curtail the computational demands of attention. The gating mechanisms, strategically placed within the feed-forward and downsampling phases, refine the processing of information by selectively addressing redundancy and mitigating its influence on learning. This approach facilitates the adaptive selection of essential data, ensuring superior restoration and maximizing efficiency. Additionally, our lightweight model achieves an optimal balance between restoration quality, parameter efficiency, computational overhead, and memory usage, distinguishing it from other multi-weather models, thereby meeting practical application demands effectively. The source code is available at https://github.com/RICKand-MORTY/WeatherRemover.

[51] Controllable Generative Video Compression cs.CVPDF

Ding Ding, Daowen Li, Ying Chen, Yixin Gao, Ruixiao Dong

TL;DR: 本文提出了一种可控生成视频压缩（CGVC）方法，通过结合关键帧编码和稠密逐帧控制先验，在生成视频时同时保持信号保真度和感知质量。该方法利用可控视频生成模型，在结构先验和语义先验的引导下重建非关键帧，并采用颜色距离引导的关键帧选择算法自适应选择关键帧以准确恢复颜色信息。实验表明，CGVC在信号保真度和感知质量上均优于以往的感知视频压缩方法。

Details

Motivation: 解决感知视频压缩中感知真实性与信号保真度之间的权衡问题，旨在忠实再现视觉信号的同时提升感知质量。

Result: 在实验中，CGVC在信号保真度和感知质量方面均优于先前的感知视频压缩方法，但未具体提及基准测试或是否达到SOTA水平。

Insight: 创新点包括引入可控生成视频压缩范式，结合关键帧结构先验和稠密逐帧控制先验来指导生成；开发颜色距离引导的关键帧选择算法以优化颜色恢复；利用可控视频生成模型确保时间与内容一致性。

Abstract: Perceptual video compression adopts generative video modeling to improve perceptual realism but frequently sacrifices signal fidelity, diverging from the goal of video compression to faithfully reproduce visual signal. To alleviate the dilemma between perception and fidelity, in this paper we propose Controllable Generative Video Compression (CGVC) paradigm to faithfully generate details guided by multiple visual conditions. Under the paradigm, representative keyframes of the scene are coded and used to provide structural priors for non-keyframe generation. Dense per-frame control prior is additionally coded to better preserve finer structure and semantics of each non-keyframe. Guided by these priors, non-keyframes are reconstructed by controllable video generation model with temporal and content consistency. Furthermore, to accurately recover color information of the video, we develop a color-distance-guided keyframe selection algorithm to adaptively choose keyframes. Experimental results show CGVC outperforms previous perceptual video compression method in terms of both signal fidelity and perceptual quality.

[52] GPAFormer: Graph-guided Patch Aggregation Transformer for Efficient 3D Medical Image Segmentation cs.CVPDF

Chung-Ming Lo, I-Yun Liu, Wei-Yang Lin

TL;DR: 本文提出了GPAFormer，一种轻量化的3D医学图像分割网络，通过多尺度注意力引导堆叠聚合（MASA）和互感知图块图聚合器（MPGA）两个核心模块，在保持高精度的同时显著提升了计算效率。

Details

Motivation: 解决3D医学图像分割中因成像模态多样、数据高维、解剖结构异质性导致的精度与计算效率难以兼顾的挑战，特别是在多器官分割任务中。

Result: 在BTCV、Synapse、ACDC和BraTS等公开数据集上，仅使用1.81M参数的GPAFormer取得了最高的Dice相似系数（DSC），例如BTCV上为75.70%，且在消费级GPU上对单个BTCV验证案例的推理时间小于1秒，实现了精度与效率的平衡。

Insight: 创新点在于引入了图引导的动态特征聚合机制（MPGA）来增强器官内部和边界结构的区分能力，并结合多尺度并行路径（MASA）处理不同尺寸结构，为资源受限的临床环境提供了高效的轻量化分割解决方案。

Abstract: Deep learning has been widely applied to 3D medical image segmentation tasks. However, due to the diversity of imaging modalities, the high-dimensional nature of the data, and the heterogeneity of anatomical structures, achieving both segmentation accuracy and computational efficiency in multi-organ segmentation remains a challenge. This study proposed GPAFormer, a lightweight network architecture specifically designed for 3D medical image segmentation, emphasizing efficiency while keeping high accuracy. GPAFormer incorporated two core modules: the multi-scale attention-guided stacked aggregation (MASA) and the mutual-aware patch graph aggregator (MPGA). MASA utilized three parallel paths with different receptive fields, combined through planar aggregation, to enhance the network’s capability in handling structures of varying sizes. MPGA employed a graph-guided approach to dynamically aggregate regions with similar feature distributions based on inter-patch feature similarity and spatial adjacency, thereby improving the discrimination of both internal and boundary structures of organs. Experiments were performed on public whole-body CT and MRI datasets including BTCV, Synapse, ACDC, and BraTS. Compared to the existed 3D segmentation networkd, GPAFormer using only 1.81 M parameters achieved overall highest DSC on BTCV (75.70%), Synapse (81.20%), ACDC (89.32%), and BraTS (82.74%). Using consumer level GPU, the inference time for one validation case of BTCV spent less than one second. The results demonstrated that GPAFormer balanced accuracy and efficiency in multi-organ, multi-modality 3D segmentation tasks across various clinical scenarios especially for resource-constrained and time-sensitive clinical environments.

[53] VDPP: Video Depth Post-Processing for Speed and Scalability cs.CVPDF

Daewon Yoon, Injun Baek, Sangyu Han, Yearim Kim, Nojun Kwak

TL;DR: 本文提出VDPP框架，通过将计算范式从昂贵的场景重建转向低分辨率空间中的针对性几何优化，实现了视频深度估计的后处理，在保持时间一致性的同时显著提升了速度和精度。

Details

Motivation: 现有端到端视频深度模型耦合紧密，难以快速集成更优的单图像深度估计器；而现有后处理方法在速度、精度和RGB依赖性方面存在不足，限制了其实用性。

Result: VDPP在NVIDIA Jetson Orin Nano上实现了超过43.5 FPS的速度，匹配了端到端系统的时间一致性，并在速度、精度和内存效率上取得了优越的平衡。

Insight: 创新点在于纯几何优化的后处理范式、低分辨率空间操作以及不依赖RGB的架构设计，这确保了框架的高效性、实时边缘部署的实用性以及与任何图像深度模型的即插即用兼容性。

Abstract: Video depth estimation is essential for providing 3D scene structure in applications ranging from autonomous driving to mixed reality. Current end-to-end video depth models have established state-of-the-art performance. Although current end-to-end (E2E) models have achieved state-of-the-art performance, they function as tightly coupled systems that suffer from a significant adaptation lag whenever superior single-image depth estimators are released. To mitigate this issue, post-processing methods such as NVDS offer a modular plug-and-play alternative to incorporate any evolving image depth model without retraining. However, existing post-processing methods still struggle to match the efficiency and practicality of E2E systems due to limited speed, accuracy, and RGB reliance. In this work, we revitalize the role of post-processing by proposing VDPP (Video Depth Post-Processing), a framework that improves the speed and accuracy of post-processing methods for video depth estimation. By shifting the paradigm from computationally expensive scene reconstruction to targeted geometric refinement, VDPP operates purely on geometric refinements in low-resolution space. This design achieves exceptional speed (>43.5 FPS on NVIDIA Jetson Orin Nano) while matching the temporal coherence of E2E systems, with dense residual learning driving geometric representations rather than full reconstructions. Furthermore, our VDPP’s RGB-free architecture ensures true scalability, enabling immediate integration with any evolving image depth model. Our results demonstrate that VDPP provides a superior balance of speed, accuracy, and memory efficiency, making it the most practical solution for real-time edge deployment. Our project page is at https://github.com/injun-baek/VDPP

[54] RASR: Retrieval-Augmented Semantic Reasoning for Fake News Video Detection cs.CVPDF

Hui Li, Peien Ding, Jun Li, Guoqi Ma, Zhanyu Liu

TL;DR: 本文提出了一种名为RASR（检索增强语义推理）的新型框架，用于多模态假新闻视频检测。该框架通过跨实例语义解析与检索器从动态记忆库中获取关联证据，结合领域先验知识驱动专家多模态大语言模型生成深度分析报告，并利用多视图特征解耦与融合模块进行自适应特征整合，以提升检测性能。

Details

Motivation: 现有方法缺乏跨实例的全局语义关联，难以有效利用历史关联证据验证当前视频，且跨领域语义差异阻碍通用知识迁移，缺乏领域专家知识的指导。

Result: 在FakeSV和FakeTT数据集上的大量实验表明，RASR显著优于现有最先进基线方法，实现了优异的跨领域泛化能力，并将整体检测准确率提升了高达0.93%。

Insight: 创新点包括引入跨实例语义解析与检索器构建动态记忆库以利用历史证据，结合领域先验指导多模态大语言模型进行领域感知的深度推理，以及通过多视图特征解耦与融合实现自适应特征整合，从而增强检测的鲁棒性和准确性。

Abstract: Multimodal fake news video detection is a crucial research direction for maintaining the credibility of online information. Existing studies primarily verify content authenticity by constructing multimodal feature fusion representations or utilizing pre-trained language models to analyze video-text consistency. However, these methods still face the following limitations: (1) lacking cross-instance global semantic correlations, making it difficult to effectively utilize historical associative evidence to verify the current video; (2) semantic discrepancies across domains hinder the transfer of general knowledge, lacking the guidance of domain-specific expert knowledge. To this end, we propose a novel Retrieval-Augmented Semantic Reasoning (RASR) framework. First, a Cross-instance Semantic Parser and Retriever (CSPR) deconstructs the video into high-level semantic primitives and retrieves relevant associative evidence from a dynamic memory bank. Subsequently, a Domain-Guided Multimodal Reasoning (DGMP) module incorporates domain priors to drive an expert multimodal large language model in generating domain-aware, in-depth analysis reports. Finally, a Multi-View Feature Decoupling and Fusion (MVDFF) module integrates multi-dimensional features through an adaptive gating mechanism to achieve robust authenticity determination. Extensive experiments on the FakeSV and FakeTT datasets demonstrate that RASR significantly outperforms state-of-the-art baselines, achieves superior cross-domain generalization, and improves the overall detection accuracy by up to 0.93%.

[55] Specializing Large Models for Oracle Bone Script Interpretation via Component-Grounded Multimodal Knowledge Augmentation cs.CV | cs.CLPDF

Jianing Zhang, Runan Li, Honglin Pang, Ding Xia, Zhou Zhu

TL;DR: 该论文提出了一种基于智能体驱动的视觉语言模型（VLM）框架，专门用于甲骨文（OBS）的解读。该框架通过整合VLM进行精确的视觉定位，并利用基于LLM的智能体自动化执行组件识别、基于图的知识检索和关系推理的推理链，以实现语言上准确的解释。为支持此框架，论文还引入了专家标注的数据集OB-Radix，包含字符和组件的图像及语义数据。

Details

Motivation: 现有方法将甲骨文解读视为封闭集的图像识别问题，未能弥合‘解释鸿沟’：即单个字符虽独特罕见，但由有限的、可重复的象形组件构成，这些组件承载着可迁移的语义。论文旨在利用这种结构逻辑，实现更准确的解读。

Result: 通过在三个不同任务的基准测试上评估，论文表明其框架相比基线方法能产生更详细和精确的解读结果。

Insight: 创新点在于将甲骨文解读从封闭集图像识别重构为基于组件结构逻辑的多模态推理任务，提出了组件接地的多模态知识增强方法，并构建了包含细粒度组件语义的结构化数据集OB-Radix，为领域专业化大模型提供了新思路。

Abstract: Deciphering ancient Chinese Oracle Bone Script (OBS) is a challenging task that offers insights into the beliefs, systems, and culture of the ancient era. Existing approaches treat decipherment as a closed-set image recognition problem, which fails to bridge the ``interpretation gap’’: while individual characters are often unique and rare, they are composed of a limited set of recurring, pictographic components that carry transferable semantic meanings. To leverage this structural logic, we propose an agent-driven Vision-Language Model (VLM) framework that integrates a VLM for precise visual grounding with an LLM-based agent to automate a reasoning chain of component identification, graph-based knowledge retrieval, and relationship inference for linguistically accurate interpretation. To support this, we also introduce OB-Radix, an expert-annotated dataset providing structural and semantic data absent from prior corpora, comprising 1,022 character images (934 unique characters) and 1,853 fine-grained component images across 478 distinct components with verified explanations. By evaluating our system across three benchmarks of different tasks, we demonstrate that our framework yields more detailed and precise decipherments compared to baseline methods.

[56] Enhancing MLLM Spatial Understanding via Active 3D Scene Exploration for Multi-Perspective Reasoning cs.CVPDF

Jiahua Chen, Qihong Tang, Weinong Wang, Qi Fan

TL;DR: 本文提出了一种无需训练的框架，通过结合显式三维重建的视觉思维链机制，来增强多模态大语言模型在复杂三维空间推理中的能力。该框架首先从单张图像重建高保真三维网格，然后利用外部知识库迭代计算最优相机外参并合成新视角，以模拟人类的视角转换。

Details

Motivation: 现有MLLM依赖二维视觉先验，在复杂三维空间推理上存在不足，且现有方法要么计算成本高、数据有限，要么缺乏显式几何理解和视角灵活性。本文旨在解决这些挑战。

Result: 在3DSRBench和Rel3D等主要基准测试中，该框架显著提升了空间理解能力，其性能超越了专用空间模型和通用MLLM，包括GPT-5.2和Gemini-2.5-Flash。

Insight: 创新点在于提出了一种无需训练、基于显式三维重建的视觉思维链框架，通过MLLM引导的多粒度关键词提取与掩码生成进行三维重建，并利用外部知识迭代合成新视角以模拟人类视角转换，从而增强空间理解。该方法避免了昂贵的后训练，并提供了更灵活的几何理解。

Abstract: Although Multimodal Large Language Models have achieved remarkable progress, they still struggle with complex 3D spatial reasoning due to the reliance on 2D visual priors. Existing approaches typically mitigate this limitation either through computationally expensive post-training procedures on limited 3D datasets or through rigid tool-calling mechanisms that lack explicit geometric understanding and viewpoint flexibility. To address these challenges, we propose a \textit{training-free} framework that introduces a Visual Chain-of-Thought mechanism grounded in explicit 3D reconstruction. The proposed pipeline first reconstructs a high-fidelity 3D mesh from a single image using MLLM-guided keyword extraction and mask generation at multiple granularities. Subsequently, the framework leverages an external knowledge base to iteratively compute optimal camera extrinsic parameters and synthesize novel views, thereby emulating human perspective-taking. Extensive experiments demonstrate that the proposed approach significantly enhances spatial comprehension. Specifically, the framework outperforms specialized spatial models and general-purpose MLLMs, including \textit{GPT-5.2} and \textit{Gemini-2.5-Flash}, on major benchmarks such as 3DSRBench and Rel3D.

[57] URMF: Uncertainty-aware Robust Multimodal Fusion for Multimodal Sarcasm Detection cs.CV | cs.AI | cs.MMPDF

Zhenyu Wang, Weichen Cheng, Weijia Li, Junjie Mou, Zongyou Zhao

TL;DR: 本文提出了一种不确定性感知的鲁棒多模态融合框架（URMF），用于解决多模态讽刺检测（MSD）中因文本模糊或图像弱相关导致的模态可靠性差异问题。该框架通过建模模态不确定性来动态调节融合过程中的模态贡献，并结合多种训练目标提升模型的准确性和鲁棒性。

Details

Motivation: 现有MSD方法通常假设所有模态同等可靠，但实际社交媒体中文本可能模糊、图像可能弱相关甚至无关，导致确定性融合引入噪声证据并削弱推理鲁棒性。

Result: 在公开MSD基准测试中，URMF一致优于强单模态、多模态及基于MLLM的基线方法，证明了不确定性感知融合在提升准确性和鲁棒性方面的有效性。

Insight: 创新点包括：显式建模模态可靠性，通过参数化高斯后验进行统一模态不确定性估计；利用不确定性动态调节融合权重；结合任务监督、模态先验正则化、跨模态分布对齐和不确定性驱动的自采样对比学习的联合训练目标。

Abstract: Multimodal sarcasm detection (MSD) aims to identify sarcastic intent from semantic incongruity between text and image. Although recent methods have improved MSD through cross-modal interaction and incongruity reasoning, they often assume that all modalities are equally reliable. In real-world social media, however, textual content may be ambiguous and visual content may be weakly relevant or even irrelevant, causing deterministic fusion to introduce noisy evidence and weaken robust reasoning. To address this issue, we propose Uncertainty-aware Robust Multimodal Fusion (URMF), a unified framework that explicitly models modality reliability during interaction and fusion. URMF first employs multi-head cross-attention to inject visual evidence into textual representations, followed by multi-head self-attention in the fused semantic space to enhance incongruity-aware reasoning. It then performs unified unimodal aleatoric uncertainty modeling over text, image, and interaction-aware latent representations by parameterizing each modality as a learnable Gaussian posterior. The estimated uncertainty is further used to dynamically regulate modality contributions during fusion, suppressing unreliable modalities and yielding a more robust joint representation. In addition, we design a joint training objective integrating task supervision, modality prior regularization, cross-modal distribution alignment, and uncertainty-driven self-sampling contrastive learning. Experiments on public MSD benchmarks show that URMF consistently outperforms strong unimodal, multimodal, and MLLM-based baselines, demonstrating the effectiveness of uncertainty-aware fusion for improving both accuracy and robustness.

[58] DOC-GS: Dual-Domain Observation and Calibration for Reliable Sparse-View Gaussian Splatting cs.CVPDF

Hantang Li, Qiang Zhu, Xiandong Meng, Debin Zhao, Xiaopeng Fan

TL;DR: 本文提出DOC-GS框架，通过双域（优化域和观测域）协同建模与校准高斯基元的可靠性，以解决稀疏视角下3D高斯泼溅重建中的过拟合、结构失真和半透明雾状伪影问题。

Details

Motivation: 稀疏视角下的3D高斯泼溅重建因几何监督不足而病态，导致严重过拟合和伪影；现有基于dropout的正则化方法缺乏对伪影形成的统一理解，核心挑战在于高斯基元可靠性的不可观测性。

Result: 未在摘要中明确提及具体定量结果或基准测试，但方法旨在通过可靠性建模和校准来抑制伪影、提升重建质量。

Insight: 创新点在于从高斯基元可靠性不可观测这一新视角切入，提出双域协同框架：优化域通过连续深度引导的dropout策略显式建模可靠性作为平滑深度感知归纳偏置；观测域将浮动物伪影与大气散射关联，利用暗通道先验作为结构一致性线索识别异常区域，并基于跨视图证据进行可靠性驱动的几何剪枝。

Abstract: Sparse-view reconstruction with 3D Gaussian Splatting (3DGS) is fundamentally ill-posed due to insufficient geometric supervision, often leading to severe overfitting and the emergence of structural distortions and translucent haze-like artifacts. While existing approaches attempt to alleviate this issue via dropout-based regularization, they are largely heuristic and lack a unified understanding of artifact formation. In this paper, we revisit sparse-view 3DGS reconstruction from a new perspective and identify the core challenge as the unobservability of Gaussian primitive reliability. Unreliable Gaussians are insufficiently constrained during optimization and accumulate as haze-like degradations in rendered images. Motivated by this observation, we propose a unified Dual-domain Observation and Calibration (DOC-GS) framework that models and corrects Gaussian reliability through the synergy of optimization-domain inductive bias and observation-domain evidence. Specifically, in the optimization domain, we characterize Gaussian reliability by the degree to which each primitive is constrained during training, and instantiate this signal via a Continuous Depth-Guided Dropout (CDGD) strategy, where the dropout probability serves as an explicit proxy for primitive reliability. This imposes a smooth depth-aware inductive bias to suppress weakly constrained Gaussians and improve optimization stability. In the observation domain, we establish a connection between floater artifacts and atmospheric scattering, and leverage the Dark Channel Prior (DCP) as a structural consistency cue to identify and accumulate anomalous regions. Based on cross-view aggregated evidence, we further design a reliability-driven geometric pruning strategy to remove low-confidence Gaussians.

[59] LiveStre4m: Feed-Forward Live Streaming of Novel Views from Unposed Multi-View Video cs.CVPDF

Pedro Quesado, Erkut Akdag, Yasaman Kashefbahrami, Willem Menu, Egor Bondarev

TL;DR: LiveStre4m是一种用于从无姿态稀疏多视角视频中实时合成新视角的前馈模型，旨在解决动态场景表示方法依赖已知相机参数和优化时间长、无法满足实时直播需求的问题。该方法通过多视角视觉Transformer进行关键帧3D场景重建，结合扩散-Transformer插值模块保证时间一致性，并引入相机姿态预测器直接从RGB图像估计相机参数，实现了仅需两个未标定同步输入流的实时、时间一致的新视角视频流生成。

Details

Motivation: 解决从无姿态多视角视频进行实时新视角合成直播的挑战，现有动态场景表示方法需要真实相机参数且优化耗时，无法适用于实时流媒体场景。

Result: 在1024×768分辨率下，平均每帧重建时间为0.07秒，相比基于优化的动态场景表示方法在运行时间上提升了数个数量级，实现了实时性能。

Insight: 创新点包括：1) 结合多视角视觉Transformer和扩散-Transformer插值的前馈架构，实现快速且时间一致的渲染；2) 相机姿态预测器模块，无需已知标定信息，直接从RGB图像估计内外参；3) 整个系统仅需少量未标定输入流即可实时合成新视角视频，为可部署的实时新视角合成系统迈出重要一步。

Abstract: Live-streaming Novel View Synthesis (NVS) from unposed multi-view video remains an open challenge in a wide range of applications. Existing methods for dynamic scene representation typically require ground-truth camera parameters and involve lengthy optimizations ($\approx 2.67$s), which makes them unsuitable for live streaming scenarios. To address this issue, we propose a novel viewpoint video live-streaming method (LiveStre4m), a feed-forward model for real-time NVS from unposed sparse multi-view inputs. LiveStre4m introduces a multi-view vision transformer for keyframe 3D scene reconstruction coupled with a diffusion-transformer interpolation module that ensures temporal consistency and stable streaming. In addition, a Camera Pose Predictor module is proposed to efficiently estimate both poses and intrinsics directly from RGB images, removing the reliance on known camera calibration information. Our approach enables temporally consistent novel-view video streaming in real-time using as few as two synchronized unposed input streams. LiveStre4m attains an average reconstruction time of $ 0.07$s per-frame at $ 1024 \times 768$ resolution, outperforming the optimization-based dynamic scene representation methods by orders of magnitude in runtime. These results demonstrate that LiveStre4m makes real-time NVS streaming feasible in practical settings, marking a substantial step toward deployable live novel-view synthesis systems. Code available at: https://github.com/pedro-quesado/LiveStre4m

[60] How Well Do Vision-Language Models Understand Sequential Driving Scenes? A Sensitivity Study cs.CVPDF

Roberto Brusnicki, Mattia Piccinini, Johannes Betz

TL;DR: 该论文提出了VENUSS框架，用于系统性地评估和分析视觉语言模型在理解连续驾驶场景时的性能敏感性，揭示了现有模型在动态理解和时序关系方面的显著不足。

Details

Motivation: 动机在于当前视觉语言模型在自动驾驶任务中的应用日益增多，但其在连续驾驶场景下的性能，特别是输入配置如何影响其能力，尚未得到充分表征。

Result: 通过对比25个以上的现有VLM在2600多个场景中的表现，研究发现即使顶级模型准确率也仅为57%，未达到人类在类似约束下的65%水平，暴露了显著的能力差距。

Insight: 创新点在于首次系统分析了输入图像配置（如分辨率、帧数、时间间隔、空间布局和呈现模式）对VLM在连续驾驶场景性能的影响，并建立了未来研究的基准框架。

Abstract: Vision-Language Models (VLMs) are increasingly proposed for autonomous driving tasks, yet their performance on sequential driving scenes remains poorly characterized, particularly regarding how input configurations affect their capabilities. We introduce VENUSS (VLM Evaluation oN Understanding Sequential Scenes), a framework for systematic sensitivity analysis of VLM performance on sequential driving scenes, establishing baselines for future research. Building upon existing datasets, VENUSS extracts temporal sequences from driving videos, and generates structured evaluations across custom categories. By comparing 25+ existing VLMs across 2,600+ scenarios, we reveal how even top models achieve only 57% accuracy, not matching human performance in similar constraints (65%) and exposing significant capability gaps. Our analysis shows that VLMs excel with static object detection but struggle with understanding the vehicle dynamics and temporal relations. VENUSS offers the first systematic sensitivity analysis of VLMs focused on how input image configurations - resolution, frame count, temporal intervals, spatial layouts, and presentation modes - affect performance on sequential driving scenes. Supplementary material available at https://V3NU55.github.io

[61] FlowInOne:Unifying Multimodal Generation as Image-in, Image-out Flow Matching cs.CVPDF

Junchao Yi, Rui Zhao, Jiahao Tang, Weixian Lei, Linjie Li

TL;DR: FlowInOne提出了一种统一的多模态生成框架，将所有模态输入（如文本描述、空间布局、编辑指令）转换为视觉提示，并通过单一的流匹配模型实现图像输入、图像输出的纯视觉生成流程。该方法将文本到图像生成、布局引导编辑和视觉指令跟随统一到一个连贯的范式中，消除了跨模态对齐瓶颈和任务特定分支。

Details

Motivation: 挑战当前以文本为主导的多模态生成范式，探索是否所有模态都能统一为单一的视觉表示，以克服语言无法在视觉内部进行推理或创造的局限性。

Result: 在VP-Bench基准测试中，FlowInOne在指令忠实度、空间精度、视觉真实性和内容一致性方面均达到最先进性能，超越了开源模型和商业竞争系统，在所有统一生成任务上表现优异。

Insight: 核心创新在于将多模态生成重新定义为纯视觉流，通过视觉提示统一输入，简化了架构并消除了跨模态对齐的复杂性；同时构建了大规模视觉提示数据集VisPrompt-5M和严谨的评估基准VP-Bench，为完全以视觉为中心的生成建模奠定了基础。

Abstract: Multimodal generation has long been dominated by text-driven pipelines where language dictates vision but cannot reason or create within it. We challenge this paradigm by asking whether all modalities, including textual descriptions, spatial layouts, and editing instructions, can be unified into a single visual representation. We present FlowInOne, a framework that reformulates multimodal generation as a purely visual flow, converting all inputs into visual prompts and enabling a clean image-in, image-out pipeline governed by a single flow matching model. This vision-centric formulation naturally eliminates cross-modal alignment bottlenecks, noise scheduling, and task-specific architectural branches, unifying text-to-image generation, layout-guided editing, and visual instruction following under one coherent paradigm. To support this, we introduce VisPrompt-5M, a large-scale dataset of 5 million visual prompt pairs spanning diverse tasks including physics-aware force dynamics and trajectory prediction, alongside VP-Bench, a rigorously curated benchmark assessing instruction faithfulness, spatial precision, visual realism, and content consistency. Extensive experiments demonstrate that FlowInOne achieves state-of-the-art performance across all unified generation tasks, surpassing both open-source models and competitive commercial systems, establishing a new foundation for fully vision-centric generative modeling where perception and creation coexist within a single continuous visual space.

[62] FlowExtract: Procedural Knowledge Extraction from Maintenance Flowcharts cs.CV | cs.AIPDF

Guillermo Gil de Avalle, Laura Maruster, Eric Sloot, Christos Emmanouilidis

TL;DR: FlowExtract是一个从符合ISO 5807标准的维护流程图中提取有向图的系统。它通过分离元素检测和连接性重建，使用YOLOv8和EasyOCR进行节点检测和文本提取，并结合一种新颖的、通过分析箭头方向并向后追踪连接线到源节点的边缘检测方法，以解决现有视觉语言模型在处理此类图表拓扑结构时的困难。

Details

Motivation: 制造业中的维护程序通常以流程图形式记录在静态PDF或扫描图像中，这些图表编码了对资产生命周期管理至关重要的程序性知识，但现代操作员支持系统无法直接访问。主流的图像理解范式——视觉语言模型，难以从这类图表中重建连接拓扑。

Result: 在工业故障排除指南上的评估表明，FlowExtract实现了非常高的节点检测精度，并且在边缘提取方面显著优于视觉语言模型基线。

Insight: 论文的创新点在于将流程图解析任务解耦为元素检测和连接性重建两个独立步骤，并设计了一种新颖的边缘检测方法，该方法通过分析箭头方向并反向追踪线条来重建连接关系，而不是依赖端到端的视觉语言模型。这为从静态图表中提取可查询的程序性知识表示提供了一条实用路径。

Abstract: Maintenance procedures in manufacturing facilities are often documented as flowcharts in static PDFs or scanned images. They encode procedural knowledge essential for asset lifecycle management, yet inaccessible to modern operator support systems. Vision-language models, the dominant paradigm for image understanding, struggle to reconstruct connection topology from such diagrams. We present FlowExtract, a pipeline for extracting directed graphs from ISO 5807-standardized flowcharts. The system separates element detection from connectivity reconstruction, using YOLOv8 and EasyOCR for standard domain-aligned node detection and text extraction, combined with a novel edge detection method that analyzes arrowhead orientations and traces connecting lines backward to source nodes. Evaluated on industrial troubleshooting guides, FlowExtract achieves very high node detection and substantially outperforms vision-language model baselines on edge extraction, offering organizations a practical path toward queryable procedural knowledge representations. The implementation is available athttps://github.com/guille-gil/FlowExtract.

[63] Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization cs.CVPDF

Wenhao Yang, Yu Xia, Jinlong Huang, Shiyin Lu, Qing-Guo Chen

TL;DR: 本文提出了一种名为多模态代理策略优化（MAPO）的方法，旨在解决多模态大语言模型在多轮推理中存在的推理-行动差距问题。该方法通过强制模型生成工具获取视觉内容的显式文本描述，并结合语义对齐与任务奖励进行优势估计，以减少梯度方差并提升多模态推理性能。

Details

Motivation: 当前基于结果奖励的强化学习实践忽略了文本合理性可能掩盖执行失败的问题，即模型在多轮推理轨迹中可能表现出直观的文本推理，但执行不精确或无关的视觉动作，这种推理-行动差异会引入噪声并累积，严重损害模型的多模态推理能力，甚至导致训练崩溃。

Result: 广泛的实验表明，该方法在多个视觉推理基准测试中实现了卓越的性能。

Insight: 创新点在于通过强制生成视觉内容的文本描述并耦合语义对齐与任务奖励进行优势估计，从而显式地弥合文本推理与视觉行动之间的差距，理论分析表明该方法能有效降低梯度方差，提升训练稳定性和推理准确性。

Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have incentivized models to ``think with images’’ by actively invoking visual tools during multi-turn reasoning. The common Reinforcement Learning (RL) practice of relying on outcome-based rewards ignores the fact that textual plausibility often masks executive failure, meaning that models may exhibit intuitive textual reasoning while executing imprecise or irrelevant visual actions within their agentic reasoning trajectories. This reasoning-action discrepancy introduces noise that accumulates throughout the multi-turn reasoning process, severely degrading the model’s multimodal reasoning capabilities and potentially leading to training collapse. In this paper, we introduce Multimodal Agentic Policy Optimization (MAPO), bridging the gap between textual reasoning and visual actions generated by models within their Multimodal Chain-of-Thought (MCoT). Specifically, MAPO mandates the model to generate explicit textual descriptions for the visual content obtained via tool usage. We then employ a novel advantage estimation that couples the semantic alignment between these descriptions and the actual observations with the task reward. Theoretical findings are provided to justify the rationale behind MAPO, which inherently reduces the variance of gradients, and extensive experiments demonstrate that our method achieves superior performance across multiple visual reasoning benchmarks.

[64] Insights from Visual Cognition: Understanding Human Action Dynamics with Overall Glance and Refined Gaze Transformer cs.CVPDF

Bohao Xing, Deng Li, Rong Gao, Xin Liu, Heikki Kälviäinen

TL;DR: 本文提出了一种名为OG-ReG Transformer的双路径网络，灵感来源于人类视觉认知中的整体扫视和精细凝视机制，旨在解决视频任务中时空关联被分割的问题，以更高效地捕捉运动信息和长程依赖。

Details

Motivation: 现有Transformer方法在视频任务中常采用因子化或基于窗口的自注意力，这割裂了视频中感兴趣区域间的时空关联，限制了模型捕捉运动和长程依赖的能力；受人类视觉系统在不同时间尺度上对时空信息重要性不同、注意力通过扫视和凝视稀疏分配的启发，本文探索了时空信息平等考虑是否关键，并提出了相应的解决方案。

Result: 在Kinetics-400、Something-Something v2和Diving-48等基准数据集上取得了最先进（SOTA）的结果，展示了其竞争性能。

Insight: 创新点在于借鉴人类视觉认知机制，设计了一个双路径网络（Glance路径提取粗粒度整体时空信息，Gaze路径补充局部细节），从而更有效地平衡计算与效率，并增强对视频中运动和长程依赖的建模能力；从客观角度看，这种生物启发式设计为视频理解任务提供了新的架构思路，可能提升模型在复杂时空动态中的表现。

Abstract: Recently, Transformer has made significant progress in various vision tasks. To balance computation and efficiency in video tasks, recent works heavily rely on factorized or window-based self-attention. However, these approaches split spatiotemporal correlations between regions of interest in videos, limiting the models’ ability to capture motion and long-range dependencies. In this paper, we argue that, similar to the human visual system, the importance of temporal and spatial information varies across different time scales, and attention is allocated sparsely over time through glance and gaze behavior. Is equal consideration of time and space crucial for success in video tasks? Motivated by this understanding, we propose a dual-path network called the Overall Glance and Refined Gaze (OG-ReG) Transformer. The Glance path extracts coarse-grained overall spatiotemporal information, while the Gaze path supplements the Glance path by providing local details. Our model achieves state-of-the-art results on the Kinetics-400, Something-Something v2, and Diving-48, demonstrating its competitive performance. The code will be available at https://github.com/linuxsino/OG-ReG.

[65] Video-guided Machine Translation with Global Video Context cs.CV | cs.CLPDF

Jian Chen, JinZe Lv, Zi Long, XiangHua Fu

TL;DR: 本文提出了一种全局视频引导的多模态翻译框架，通过预训练语义编码器和基于向量数据库的字幕检索构建与目标字幕语义紧密相关的视频片段上下文集合，并采用注意力机制聚焦高度相关的视觉内容，同时保留其余视频特征以维持更广泛的上下文信息。此外，设计了一种区域感知的跨模态注意力机制来增强翻译过程中的语义对齐。

Details

Motivation: 现有视频引导多模态翻译方法大多依赖与字幕一一对应的局部对齐视频片段，限制了其在长视频中捕捉跨多个片段的全局叙事上下文的能力，因此需要克服这一局限性。

Result: 在大型纪录片翻译数据集上的实验表明，该方法显著优于基线模型，突显了其在长视频场景中的有效性。

Insight: 创新点包括利用全局视频上下文构建相关视频片段集合、结合注意力机制平衡局部聚焦与全局信息保留，以及区域感知跨模态注意力机制以提升语义对齐，这些设计有助于处理长视频中的复杂叙事结构。

Abstract: Video-guided Multimodal Translation (VMT) has advanced significantly in recent years. However, most existing methods rely on locally aligned video segments paired one-to-one with subtitles, limiting their ability to capture global narrative context across multiple segments in long videos. To overcome this limitation, we propose a globally video-guided multimodal translation framework that leverages a pretrained semantic encoder and vector database-based subtitle retrieval to construct a context set of video segments closely related to the target subtitle semantics. An attention mechanism is employed to focus on highly relevant visual content, while preserving the remaining video features to retain broader contextual information. Furthermore, we design a region-aware cross-modal attention mechanism to enhance semantic alignment during translation. Experiments on a large-scale documentary translation dataset demonstrate that our method significantly outperforms baseline models, highlighting its effectiveness in long-video scenarios.

[66] Generate, Analyze, and Refine: Training-Free Sound Source Localization via MLLM Meta-Reasoning cs.CVPDF

Subin Park, Jung Uk Kim

TL;DR: 本文提出了一种无需训练的声源定位框架GAR，利用多模态大语言模型的内在推理能力，通过生成-分析-精炼的三阶段流程来定位发声物体，在单源和多源基准测试中展现了竞争力。

Details

Motivation: 现有基于对比学习的声源定位方法缺乏显式推理和验证，在复杂声学场景中效果受限，本文受人类元认知过程启发，旨在利用MLLM的推理能力解决这一问题。

Result: 在单源和多源基准测试上进行了广泛实验，结果表明该方法取得了具有竞争力的性能。

Insight: 创新点在于提出了一个完全无需训练的、基于MLLM元推理的三阶段GAR流程，通过开放集角色标注和锚点投票量化视听一致性，并采用自适应门控进行精炼，避免了传统方法对大量标注数据和对比学习的依赖。

Abstract: Sound source localization task aims to identify the locations of sound-emitting objects by leveraging correlations between audio and visual modalities. Most existing SSL methods rely on contrastive learning-based feature matching, but lack explicit reasoning and verification, limiting their effectiveness in complex acoustic scenes. Inspired by human meta-cognitive processes, we propose a training-free SSL framework that exploits the intrinsic reasoning capabilities of Multimodal Large Language Models (MLLMs). Our Generation-Analysis-Refinement (GAR) pipeline consists of three stages: Generation produces initial bounding boxes and audio classifications; Analysis quantifies Audio-Visual Consistency via open-set role tagging and anchor voting; and Refinement applies adaptive gating to prevent unnecessary adjustments. Extensive experiments on single-source and multi-source benchmarks demonstrate competitive performance. The source code is available at https://github.com/VisualAIKHU/GAR-SSL.

[67] Vision-Language Model-Guided Deep Unrolling Enables Personalized, Fast MRI cs.CVPDF

Fangmao Ju, Yuzhu He, Zhiwen Xue, Chunfeng Lian, Jianhua Ma

TL;DR: 本文提出了一种名为PASS（个性化、异常感知采样与重建）的智能MRI框架，该框架利用视觉语言模型（VLM）来指导深度展开网络，实现面向临床任务的快速磁共振成像。PASS通过三个核心贡献动态个性化成像流程：基于物理模型的深度展开重建网络、生成患者特异性k空间轨迹的采样模块，以及从预训练VLM中提取的异常感知先验，该先验引导采样和重建聚焦于临床相关区域。

Details

Motivation: 传统加速MRI方法优化通用图像质量，缺乏对特定临床任务的适应性，导致采集时间长且不够个性化。本文旨在解决这一问题，通过集成VLM的临床推理能力与可解释的物理感知网络，实现任务导向的快速成像。

Result: PASS在多种解剖结构、对比度、异常情况和加速因子下实现了卓越的图像质量，并直接提升了下游诊断任务（如细粒度异常检测、定位和诊断）的性能。

Insight: 创新点在于将视觉语言模型的高层临床推理能力与基于物理模型的深度展开网络相结合，实现了采样与重建的协同个性化优化。从客观角度看，其利用VLM提取任务相关先验来引导整个成像流程，为医学影像的智能加速提供了一种新的、可解释的范式。

Abstract: Magnetic Resonance Imaging (MRI) is a cornerstone in medicine and healthcare but suffers from long acquisition times. Traditional accelerated MRI methods optimize for generic image quality, lacking adaptability for specific clinical tasks. To address this, we introduce PASS (Personalized, Anomaly-aware Sampling and reconStruction), an intelligent MRI framework that leverages a Vision-Language Model (VLM) to guide a deep unrolling network for task-oriented, fast imaging. PASS dynamically personalizes the imaging pipeline through three core contributions: (1) a deep unrolled reconstruction network derived from a physics-based MRI model; (2) a sampling module that generates patient-specific $k$-space trajectories; and (3) an anomaly-aware prior, extracted from a pretrained VLM, which steers both sampling and reconstruction toward clinically relevant regions. By integrating the high-level clinical reasoning of a VLM with an interpretable, physics-aware network, PASS achieves superior image quality across diverse anatomies, contrasts, anomalies, and acceleration factors. This enhancement directly translates to improvements in downstream diagnostic tasks, including fine-grained anomaly detection, localization, and diagnosis.

[68] Physical Adversarial Attacks on AI Surveillance Systems:Detection, Tracking, and Visible–Infrared Evasion cs.CV | cs.AIPDF

Miguel A. DelaCruz, Patricia Mae Santos, Rafael T. Navarro

TL;DR: 本文从面向监控系统的视角综述了物理对抗攻击，强调在真实部署环境中需同时考虑人员检测、多目标跟踪、可见光-红外双模态感知及攻击载体的实际形式，并提出了一个四部分分类法来组织相关工作。

Details

Motivation: 现有物理对抗攻击研究多基于孤立的图像基准测试，而真实监控系统需综合考虑时序持续性、感知模态、载体真实性和系统级目标，本文旨在从系统层面重新审视该领域。

Result: 论文未报告具体定量结果，但综述了多目标跟踪、双模态可见光-红外规避及可控服装等近期进展，并总结了评估实践与未解决差距（如距离鲁棒性、相机管线变化等）。

Insight: 创新点在于提出监控系统的鲁棒性不能仅依赖单帧基准测试，而应作为一个随时间、跨传感器、在真实物理约束下展开的系统问题来考察；其四部分分类法为理解物理攻击提供了新的结构化视角。

Abstract: Physical adversarial attacks are increasingly studied in settings that resemble deployed surveillance systems rather than isolated image benchmarks. In these settings, person detection, multi-object tracking, visible–infrared sensing, and the practical form of the attack carrier all matter at once. This changes how the literature should be read. A perturbation that suppresses a detector in one frame may have limited practical effect if identity is recovered over time; an RGB-only result may say little about night-time systems that rely on visible and thermal inputs together; and a conspicuous patch can imply a different threat model from a wearable or selectively activated carrier. This paper reviews physical attacks from that surveillance-oriented viewpoint. Rather than attempting a complete catalogue of all physical attacks in computer vision, we focus on the technical questions that become central in surveillance: temporal persistence, sensing modality, carrier realism, and system-level objective. We organize prior work through a four-part taxonomy and discuss how recent results on multi-object tracking, dual-modal visible–infrared evasion, and controllable clothing reflect a broader change in the field. We also summarize evaluation practices and unresolved gaps, including distance robustness, camera-pipeline variation, identity-level metrics, and activation-aware testing. The resulting picture is that surveillance robustness cannot be judged reliably from isolated per-frame benchmarks alone; it has to be examined as a system problem unfolding over time, across sensors, and under realistic physical deployment constraints.

Dewei Zhou, You Li, Zongxin Yang, Yi Yang

TL;DR: 本文提出RefineAnything模型，专注于区域特定的图像精细化任务，旨在通过用户指定的区域（如涂鸦掩码或边界框）恢复细粒度细节，同时严格保持未编辑像素不变。该模型基于多模态扩散方法，支持基于参考和无参考的精细化，并引入Focus-and-Refine策略和边界一致性损失来提升效果和自然度。

Details

Motivation: 现有图像生成模型常出现局部细节崩溃（如扭曲的文本、徽标和细结构），而指令驱动的编辑模型侧重于粗粒度语义编辑，容易忽略细微局部缺陷或意外改变背景，尤其是在感兴趣区域仅占输入图像小部分时。

Result: 在RefineEval基准测试中，RefineAnything相比竞争基线取得了显著改进，实现了近乎完美的背景保持，为高精度局部精细化提供了实用解决方案。

Insight: 创新点包括：提出区域特定精细化作为专门问题设置；基于裁剪和调整尺寸的直观观察，设计Focus-and-Refine策略以重新分配分辨率预算到目标区域；引入边界一致性损失减少接缝伪影；构建Refine-30K数据集和RefineEval评估基准。

Abstract: We introduce region-specific image refinement as a dedicated problem setting: given an input image and a user-specified region (e.g., a scribble mask or a bounding box), the goal is to restore fine-grained details while keeping all non-edited pixels strictly unchanged. Despite rapid progress in image generation, modern models still frequently suffer from local detail collapse (e.g., distorted text, logos, and thin structures). Existing instruction-driven editing models emphasize coarse-grained semantic edits and often either overlook subtle local defects or inadvertently change the background, especially when the region of interest occupies only a small portion of a fixed-resolution input. We present RefineAnything, a multimodal diffusion-based refinement model that supports both reference-based and reference-free refinement. Building on a counter-intuitive observation that crop-and-resize can substantially improve local reconstruction under a fixed VAE input resolution, we propose Focus-and-Refine, a region-focused refinement-and-paste-back strategy that improves refinement effectiveness and efficiency by reallocating the resolution budget to the target region, while a blended-mask paste-back guarantees strict background preservation. We further introduce a boundary-aware Boundary Consistency Loss to reduce seam artifacts and improve paste-back naturalness. To support this new setting, we construct Refine-30K (20K reference-based and 10K reference-free samples) and introduce RefineEval, a benchmark that evaluates both edited-region fidelity and background consistency. On RefineEval, RefineAnything achieves strong improvements over competitive baselines and near-perfect background preservation, establishing a practical solution for high-precision local refinement. Project Page: https://limuloo.github.io/RefineAnything/.

[70] Time-driven Survival Analysis from FDG-PET/CT in Non-Small Cell Lung Cancer cs.CVPDF

Sambit Tarai, Ashish Chauhan, Elin Lundström, Johan Öfverstedt, Therese Sjöholm

TL;DR: 本文提出了一种基于FDG-PET/CT影像和时序信息的深度学习回归框架，用于预测非小细胞肺癌（NSCLC）患者的总生存期（OS）。该方法利用ResNet-50骨干网络提取图像嵌入，并与表示时间标量的时序输入相结合，以参数化方式预测随时间变化的OS概率。在U-CAN队列（n=556）上开发，并在测试集（n=292）上评估，结果表明其性能优于仅使用图像的基线方法。

Details

Motivation: 解决自动化医学影像预测临床结局（如总生存期）的问题，以改善患者预后和个性化治疗规划。现有方法通常仅在预设时间点（如2年或5年）进行预测，而本文旨在开发一个能够将生存概率参数化为时间连续函数的框架。

Result: 在测试集上，结合时序数据的方法在AUC上比仅使用图像的基线方法提升了4.3%。结合临床+IDP特征的模型表现强劲，而影像与临床+IDP模型的集成取得了最佳整体性能（AUC=0.788）。该方法还能将患者风险分层为高/低风险类别。

Insight: 创新点在于将标量时间维度作为显式输入，使模型能够输出随时间变化的连续生存概率函数，而非离散时间点的预测。从客观角度看，该方法展示了多模态输入（影像+时序+临床数据）的互补价值，并通过显著图分析验证了肿瘤区域是预测的关键结构，增强了模型的可解释性。

Abstract: Purpose: Automated medical image-based prediction of clinical outcomes, such as overall survival (OS), has great potential in improving patient prognostics and personalized treatment planning. We developed a deep regression framework using tissue-wise FDG-PET/CT projections as input, along with a temporal input representing a scalar time horizon (in days) to predict OS in patients with Non-Small Cell Lung Cancer (NSCLC). Methods: The proposed framework employed a ResNet-50 backbone to process input images and generate corresponding image embeddings. The embeddings were then combined with temporal data to produce OS probabilities as a function of time, effectively parameterizing the predictions based on time. The overall framework was developed using the U-CAN cohort (n = 556) and evaluated by comparing with a baseline method on the test set (n = 292). The baseline utilized the ResNet-50 architecture, processing only the images as input and providing OS predictions at pre-specified intervals, such as 2- or 5-year. Results: The incorporation of temporal data with image embeddings demonstrated an advantage in predicting OS, outperforming the baseline method with an improvement in AUC of 4.3%. The proposed model using clinical + IDP features achieved strong performance, and an ensemble of imaging and clinical + IDP models achieved the best overall performance (0.788), highlighting the complementary value of multimodal inputs. The proposed method also enabled risk stratification of patients into distinct categories (high vs low risk). Heat maps from the saliency analysis highlighted tumor regions as key structures for the prediction. Conclusion: Our method provided an automated framework for predicting OS as a function of time and demonstrates the potential of combining imaging and tabular data for improved survival prediction.

[71] Energy-Regularized Spatial Masking: A Novel Approach to Enhancing Robustness and Interpretability in Vision Models cs.CV | cs.LGPDF

Tom Devynck Bilal Faye Djamel Bouchaffra Nadjib Lazaar Hanane Azzag Mustapha Lebbah

TL;DR: 本文提出了一种名为能量正则化空间掩码（ERSM）的新框架，通过将特征选择重新表述为可微的能量最小化问题，在标准卷积骨干网络中嵌入轻量级的能量掩码层，为每个视觉令牌分配由内在单点重要性成本和成对空间一致性惩罚组成的标量能量，使网络能够自主发现针对每个输入的最优信息密度平衡。

Details

Motivation: 深度卷积神经网络通过密集处理空间特征图获得卓越性能，但这种暴力策略引入了显著的计算冗余并鼓励对虚假背景相关性的依赖，导致现代视觉模型脆弱且难以解释，因此需要一种增强鲁棒性和可解释性的新方法。

Result: 在卷积架构上验证表明，ERSM产生了涌现稀疏性、对结构化遮挡的改进鲁棒性以及高度可解释的空间掩码，同时保持了分类准确性；在基于删除的鲁棒性测试中，学习的能量排名显著优于基于幅度的剪枝。

Insight: 创新点在于将特征选择建模为可微能量最小化问题，允许网络自适应输入发现最优信息密度，无需像素级监督即可隔离语义对象区域，作为一种内在去噪机制，超越了依赖刚性稀疏预算或启发式重要性分数的先前剪枝方法。

Abstract: Deep convolutional neural networks achieve remarkable performance by exhaustively processing dense spatial feature maps, yet this brute-force strategy introduces significant computational redundancy and encourages reliance on spurious background correlations. As a result, modern vision models remain brittle and difficult to interpret. We propose Energy-Regularized Spatial Masking (ERSM), a novel framework that reformulates feature selection as a differentiable energy minimization problem. By embedding a lightweight Energy-Mask Layer inside standard convolutional backbones, each visual token is assigned a scalar energy composed of two competing forces: an intrinsic Unary importance cost and a Pairwise spatial coherence penalty. Unlike prior pruning methods that enforce rigid sparsity budgets or rely on heuristic importance scores, ERSM allows the network to autonomously discover an optimal information-density equilibrium tailored to each input. We validate ERSM on convolutional architectures and demonstrate that it produces emergent sparsity, improved robustness to structured occlusion, and highly interpretable spatial masks, while preserving classification accuracy. Furthermore, we show that the learned energy ranking significantly outperforms magnitude-based pruning in deletion-based robustness tests, revealing ERSM as an intrinsic denoising mechanism that isolates semantic object regions without pixel-level supervision.

[72] Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models cs.CV | cs.AIPDF

Yuheng Shi, Xiaohuan Pei, Linfeng Wen, Minjing Dong, Chang Xu

TL;DR: 本文提出了一种名为Q-Zoom的查询感知自适应高分辨率感知框架，旨在解决多模态大语言模型在处理细粒度任务时因全局高分辨率输入导致的计算冗余和推理效率低下问题。该框架采用由粗到精的处理方式，通过动态门控网络和自蒸馏区域建议网络，仅在必要时对任务相关区域进行高分辨率处理，从而在保持甚至提升模型精度的同时，大幅提升推理速度。

Details

Motivation: 当前MLLMs在处理文档理解和密集场景感知等细粒度任务时，需要高分辨率视觉输入，但全局分辨率缩放范式会向自注意力机制中无差别地引入大量视觉冗余token，导致推理吞吐量严重下降，且忽略了空间稀疏性和查询意图。

Result: 在Qwen2.5-VL-7B模型上的实验表明，Q-Zoom在文档与OCR基准测试上推理速度提升2.52倍，在高分辨率场景下提升4.39倍，同时匹配基线模型的峰值精度。当配置为追求最大感知保真度时，Q-Zoom在上述基准上分别超越基线峰值性能1.1%和8.1%。这些改进能无缝迁移到Qwen3-VL、LLaVA等模型上，并建立了主导的帕累托前沿。

Insight: 核心创新在于提出了查询感知的自适应感知机制，通过动态门控决策是否需要进行高分辨率处理，并结合自蒸馏区域建议网络精确定位任务相关区域，实现了计算资源的按需分配。其训练策略（一致性感知生成、自监督蒸馏、连续时空对齐）也颇具新意，无需额外人工标注即可高效优化模块。这种由粗到精、基于查询意图的稀疏处理范式，为高效MLLM设计提供了新思路。

Abstract: MLLMs require high-resolution visual inputs for fine-grained tasks like document understanding and dense scene perception. However, current global resolution scaling paradigms indiscriminately flood the quadratic self-attention mechanism with visually redundant tokens, severely bottlenecking inference throughput while ignoring spatial sparsity and query intent. To overcome this, we propose Q-Zoom, a query-aware adaptive high-resolution perception framework that operates in an efficient coarse-to-fine manner. First, a lightweight Dynamic Gating Network safely bypasses high-resolution processing when coarse global features suffice. Second, for queries demanding fine-grained perception, a Self-Distilled Region Proposal Network (SD-RPN) precisely localizes the task-relevant Region-of-Interest (RoI) directly from intermediate feature spaces. To optimize these modules efficiently, the gating network uses a consistency-aware generation strategy to derive deterministic routing labels, while the SD-RPN employs a fully self-supervised distillation paradigm. A continuous spatio-temporal alignment scheme and targeted fine-tuning then seamlessly fuse the dense local RoI with the coarse global layout. Extensive experiments demonstrate that Q-Zoom establishes a dominant Pareto frontier. Using Qwen2.5-VL-7B as a primary testbed, Q-Zoom accelerates inference by 2.52 times on Document & OCR benchmarks and 4.39 times in High-Resolution scenarios while matching the baseline’s peak accuracy. Furthermore, when configured for maximum perceptual fidelity, Q-Zoom surpasses the baseline’s peak performance by 1.1% and 8.1% on these respective benchmarks. These robust improvements transfer seamlessly to Qwen3-VL, LLaVA, and emerging RL-based thinking-with-image models. Project page is available at https://yuhengsss.github.io/Q-Zoom/.

Milad Moradi, Ke Yan, David Colwell, Matthias Samwald, Rhona Asgari

TL;DR: 本文提出了一种基于YOLOv5的多模态用户界面控件检测方法，通过交叉注意力模块将GPT生成的UI图像文本描述与视觉特征对齐，以增强检测的鲁棒性和上下文感知能力。

Details

Motivation: 解决仅依赖像素的UI控件检测方法因视觉模糊性、设计多样性和缺乏上下文线索而面临的挑战，旨在提升自动化测试、无障碍访问和软件分析中的检测性能。

Result: 在包含超过16,000张标注UI截图、涵盖23个控件类别的数据集上评估，实验比较了三种融合策略（逐元素加法、加权求和和卷积融合），其中卷积融合表现最佳，显著提升了语义复杂或视觉模糊类别的检测效果，优于基线YOLOv5模型。

Insight: 创新点在于将文本模态（GPT生成的描述）通过交叉注意力引入视觉检测框架，实现多模态对齐；客观分析表明，该方法能有效利用语义信息弥补视觉不足，为软件工程中的多模态检测系统提供了可借鉴的思路。

Abstract: Detecting user interface (UI) controls from software screenshots is a critical task for automated testing, accessibility, and software analytics, yet it remains challenging due to visual ambiguities, design variability, and the lack of contextual cues in pixel-only approaches. In this paper, we introduce a novel multi-modal extension of YOLOv5 that integrates GPT-generated textual descriptions of UI images into the detection pipeline through cross-attention modules. By aligning visual features with semantic information derived from text embeddings, our model enables more robust and context-aware UI control detection. We evaluate the proposed framework on a large dataset of over 16,000 annotated UI screenshots spanning 23 control classes. Extensive experiments compare three fusion strategies, i.e. element-wise addition, weighted sum, and convolutional fusion, demonstrating consistent improvements over the baseline YOLOv5 model. Among these, convolutional fusion achieved the strongest performance, with significant gains in detecting semantically complex or visually ambiguous classes. These results establish that combining visual and textual modalities can substantially enhance UI element detection, particularly in edge cases where visual information alone is insufficient. Our findings open promising opportunities for more reliable and intelligent tools in software testing, accessibility support, and UI analytics, setting the stage for future research on efficient, robust, and generalizable multi-modal detection systems.

[74] POS-ISP: Pipeline Optimization at the Sequence Level for Task-aware ISP cs.CVPDF

Jiyun Won, Heemin Yang, Woohyeok Kim, Jungseul Ok, Sunghyun Cho

TL;DR: 本文提出POS-ISP，一种序列级强化学习框架，用于优化面向任务的图像信号处理（ISP）流水线。它将模块化ISP优化建模为全局序列预测问题，一次性预测整个模块序列及其参数，并通过终端任务奖励进行优化，避免了中间监督和冗余计算。

Details

Motivation: 现有方法（如神经架构搜索NAS或分步强化学习RL）在联合优化ISP模块序列和参数时存在挑战：NAS存在训练-推理不匹配问题，而分步RL则导致训练不稳定和高计算开销。

Result: 在多个下游任务上的实验表明，POS-ISP在提升任务性能的同时降低了计算成本。

Insight: 主要创新点在于将模块序列和参数的联合优化重新表述为全局序列预测问题，通过序列级RL进行一次性决策，从而实现了更稳定、高效的优化范式，避免了分步决策的缺陷。

Abstract: Recent work has explored optimizing image signal processing (ISP) pipelines for various tasks by composing predefined modules and adapting them to task-specific objectives. However, jointly optimizing module sequences and parameters remains challenging. Existing approaches rely on neural architecture search (NAS) or step-wise reinforcement learning (RL), but NAS suffers from a training-inference mismatch, while step-wise RL leads to unstable training and high computational overhead due to stage-wise decision-making. We propose POS-ISP, a sequence-level RL framework that formulates modular ISP optimization as a global sequence prediction problem. Our method predicts the entire module sequence and its parameters in a single forward pass and optimizes the pipeline using a terminal task reward, eliminating the need for intermediate supervision and redundant executions. Experiments across multiple downstream tasks show that POS-ISP improves task performance while reducing computational cost, highlighting sequence-level optimization as a stable and efficient paradigm for task-aware ISP. The project page is available at https://w1jyun.github.io/POS-ISP

[75] Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis cs.CVPDF

Jintao Chen, Chengyu Bai, Junjun hu, Xinda Xue, Mu Xu

TL;DR: 本文提出了一种名为Grounded Forcing的新框架，用于解决自回归视频合成中的长期一致性问题。该框架通过三个相互关联的机制——双记忆KV缓存、双参考RoPE注入和非对称邻近重缓存——来桥接时间无关的语义和邻近动态，从而有效缓解语义遗忘、视觉漂移和可控性损失等挑战。

Details

Motivation: 自回归视频合成在实现无限时长生成方面前景广阔，但受到三个相互交织的根本性挑战阻碍：因上下文限制导致的语义遗忘、因位置外推导致的视觉漂移，以及在交互式指令切换期间的可控性损失。现有方法通常孤立地处理这些问题，限制了长期连贯性。

Result: 广泛的实验表明，Grounded Forcing显著增强了长距离一致性和视觉稳定性，为交互式长视频合成奠定了坚实基础。

Insight: 论文宣称的创新点在于提出了一个协同工作的框架，通过三个核心机制将生成过程锚定在稳定的语义核心上，同时适应灵活的局部动态。从客观角度看，其创新之处在于将时间无关的语义表示与邻近的动态建模进行系统性桥接，并通过解耦记忆缓存、约束位置嵌入和设计平滑的缓存更新策略来综合解决长期生成中的多个关键问题。

Abstract: Autoregressive video synthesis offers a promising pathway for infinite-horizon generation but is fundamentally hindered by three intertwined challenges: semantic forgetting from context limitations, visual drift due to positional extrapolation, and controllability loss during interactive instruction switching. Current methods often tackle these issues in isolation, limiting long-term coherence. We introduce Grounded Forcing, a novel framework that bridges time-independent semantics and proximal dynamics through three interlocking mechanisms. First, to address semantic forgetting, we propose a Dual Memory KV Cache that decouples local temporal dynamics from global semantic anchors, ensuring long-term semantic coherence and identity stability. Second, to suppress visual drift, we design Dual-Reference RoPE Injection, which confines positional embeddings within the training manifold while rendering global semantics time-invariant. Third, to resolve controllability issues, we develop Asymmetric Proximity Recache, which facilitates smooth semantic inheritance during prompt transitions via proximity-weighted cache updates. These components operate synergistically to tether the generative process to stable semantic cores while accommodating flexible local dynamics. Extensive experiments demonstrate that Grounded Forcing significantly enhances long-range consistency and visual stability, establishing a robust foundation for interactive long-form video synthesis.

[76] NTIRE 2026 Challenge on Bitstream-Corrupted Video Restoration: Methods and Results cs.CVPDF

Wenbin Zou, Tianyi Li, Kejun Wu, Huiping Zhuang, Zongwei Wu

TL;DR: 本文介绍了NTIRE 2026比特流损坏视频恢复（BSCVR）挑战赛，该挑战赛旨在推动从损坏的比特流中恢复视觉连贯视频的研究，并提供了一个在现实损坏设置下评估恢复方法的公共基准。论文概述了数据集、评估协议、参与方法，并总结了最终结果和主要技术趋势。

Details

Motivation: 解决从损坏比特流中解码视频时产生的严重时空伪影和内容失真问题，为这一新兴任务建立统一的评估基准。

Result: 挑战赛总结了各参与方法的最终结果，并指出了该任务的难度，但摘要未具体提及定量结果（如PSNR/SSIM）或与特定SOTA模型的直接比较。

Insight: 挑战赛本身作为一个公共基准和竞赛平台，推动了针对实际比特流损坏的鲁棒视频恢复研究，并揭示了当前方法的技术趋势和面临的困难，为未来研究提供了方向。

Abstract: This paper reports on the NTIRE 2026 Challenge on Bitstream-Corrupted Video Restoration (BSCVR). The challenge aims to advance research on recovering visually coherent videos from corrupted bitstreams, whose decoding often produces severe spatial-temporal artifacts and content distortion. Built upon recent progress in bitstream-corrupted video recovery, the challenge provides a common benchmark for evaluating restoration methods under realistic corruption settings. We describe the dataset, evaluation protocol, and participating methods, and summarize the final results and main technical trends. The challenge highlights the difficulty of this emerging task and provides useful insights for future research on robust video restoration under practical bitstream corruption.

[77] Making MLLMs Blind: Adversarial Smuggling Attacks in MLLM Content Moderation cs.CVPDF

Zhiheng Li, Zongyang Ma, Yuntong Pan, Ziqi Zhang, Xiaolei Lv

TL;DR: 本文揭示了多模态大语言模型（MLLMs）在内容审核中面临的一种新型威胁——对抗性走私攻击。该攻击通过将有害内容编码成人类可读但AI难以识别的视觉格式，利用人机能力差距绕过自动检测。研究构建了首个综合基准SmuggleBench进行评估，发现主流MLLMs的受攻击成功率超过90%，并初步探索了缓解策略。

Details

Motivation: 随着MLLMs被广泛部署为自动内容审核器，本文旨在揭示并研究一种利用人机能力差距的新型对抗性攻击，即对抗性走私攻击，以暴露现有MLLMs在内容安全审核中的严重漏洞。

Result: 在构建的SmuggleBench基准（包含1700个攻击实例）上评估，无论是专有模型（如GPT-5）还是开源模型（如Qwen3-VL）等SOTA模型都极易受到攻击，攻击成功率（ASR）超过90%。

Insight: 创新点在于首次系统性地定义和分类了针对MLLM内容审核的对抗性走私攻击（分为感知盲区和推理阻断两条路径），并构建了首个评估基准。客观来看，其从感知和推理层面分析漏洞根源（如视觉编码器能力有限、OCR鲁棒性差距、领域对抗样本稀缺）的视角，以及对缓解策略（如测试时缩放和对抗性训练）的初步探索，对提升MLLMs的安全性和鲁棒性具有重要借鉴意义。

Abstract: Multimodal Large Language Models (MLLMs) are increasingly being deployed as automated content moderators. Within this landscape, we uncover a critical threat: Adversarial Smuggling Attacks. Unlike adversarial perturbations (for misclassification) and adversarial jailbreaks (for harmful output generation), adversarial smuggling exploits the Human-AI capability gap. It encodes harmful content into human-readable visual formats that remain AI-unreadable, thereby evading automated detection and enabling the dissemination of harmful content. We classify smuggling attacks into two pathways: (1) Perceptual Blindness, disrupting text recognition; and (2) Reasoning Blockade, inhibiting semantic understanding despite successful text recognition. To evaluate this threat, we constructed SmuggleBench, the first comprehensive benchmark comprising 1,700 adversarial smuggling attack instances. Evaluations on SmuggleBench reveal that both proprietary (e.g., GPT-5) and open-source (e.g., Qwen3-VL) state-of-the-art models are vulnerable to this threat, producing Attack Success Rates (ASR) exceeding 90%. By analyzing the vulnerability through the lenses of perception and reasoning, we identify three root causes: the limited capabilities of vision encoders, the robustness gap in OCR, and the scarcity of domain-specific adversarial examples. We conduct a preliminary exploration of mitigation strategies, investigating the potential of test-time scaling (via CoT) and adversarial training (via SFT) to mitigate this threat. Our code is publicly available at https://github.com/zhihengli-casia/smugglebench.

[78] Auditing Demographic Bias in Facial Landmark Detection for Fair Human-Robot Interaction cs.CVPDF

Pablo Parte, Roberto Valle, José M. Buenaposada, Luis Baumela

TL;DR: 本文系统性地审计了面部关键点检测模型中的年龄、性别和种族偏见，并引入了一种统计方法来分离人口统计学效应与混杂视觉因素（如头部姿态和图像分辨率）的影响。研究发现，在控制混杂因素后，性别和种族相关的性能差异消失，但年龄相关的偏见仍然显著存在，尤其对老年人影响更大。

Details

Motivation: 动机是研究人类-机器人交互中公平性的基础，即面部关键点检测这一低级视觉任务中是否存在未被探索的人口统计学偏见，以确保机器人感知系统的可靠性与公平性。

Result: 评估一个标准代表性模型的结果显示，混杂视觉因素（特别是头部姿态和图像分辨率）的影响远大于人口统计学属性；在控制这些因素后，性别和种族的性能差异消失，但年龄偏见在统计上显著，对老年人偏差更高。

Insight: 创新点在于首次系统审计面部关键点检测中的人口统计学偏见，并提出了分离混杂因素的统计方法；客观分析表明，低级视觉组件也可能存在公平性问题，且可能通过HRI管道传播，影响弱势群体，强调了审计和纠正此类偏见的必要性。

Abstract: Fairness in human-robot interaction critically depends on the reliability of the perceptual models that enable robots to interpret human behavior. While demographic biases have been widely studied in high-level facial analysis tasks, their presence in facial landmark detection remains unexplored. In this paper, we conduct a systematic audit of demographic bias in this task, analyzing the age, gender and race biases. To this end we introduce a controlled statistical methodology to disentangle demographic effects from confounding visual factors. Evaluations of a standard representative model demonstrate that confounding visual factors, particularly head pose and image resolution, heavily outweigh the impact of demographic attributes. Notably, after accounting for these confounders, we show that performance disparities across gender and race vanish. However, we identify a statistically significant age-related effect, with higher biases observed for older individuals. This shows that fairness issues can emerge even in low-level vision components and can propagate through the HRI pipeline, disproportionately affecting vulnerable populations. We argue that auditing and correcting such biases is a necessary step toward trustworthy and equitable robot perception systems.

[79] MAR-GRPO: Stabilized GRPO for AR-diffusion Hybrid Image Generation cs.CVPDF

Xiaoxiao Ma, Jiachen Lei, Tianfei Ren, Jie Huang, Siming Fu

TL;DR: 本文提出了一种名为MAR-GRPO的稳定化强化学习框架，用于解决在混合自回归-扩散模型中应用强化学习时遇到的推理交错和噪声对数概率估计问题。该方法通过引入多轨迹期望来平均多个扩散轨迹以降低梯度噪声，并结合基于不确定性的top-k%令牌选择和一致性感知令牌筛选策略，以提升训练稳定性与生成质量。

Details

Motivation: 动机在于将强化学习成功应用于自回归和扩散模型后，扩展到混合自回归-扩散框架仍面临挑战，主要由于交错推理和噪声对数概率估计导致训练不稳定和性能过早饱和。

Result: 在多个基准测试上的广泛实验表明，该方法在视觉质量、训练稳定性和空间结构理解方面持续优于基线GRPO和预强化学习模型。

Insight: 创新点包括：提出多轨迹期望来减少扩散引起的梯度噪声；引入基于令牌不确定性的top-k%选择以避免过度平滑；设计一致性感知令牌选择策略来过滤与最终生成内容对齐度较低的自回归令牌。这些技术共同提升了混合框架的强化学习稳定性和效果。

Abstract: Reinforcement learning (RL) has been successfully applied to autoregressive (AR) and diffusion models. However, extending RL to hybrid AR-diffusion frameworks remains challenging due to interleaved inference and noisy log-probability estimation. In this work, we study masked autoregressive models (MAR) and show that the diffusion head plays a critical role in training dynamics, often introducing noisy gradients that lead to instability and early performance saturation. To address this issue, we propose a stabilized RL framework for MAR. We introduce multi-trajectory expectation (MTE), which estimates the optimization direction by averaging over multiple diffusion trajectories, thereby reducing diffusion-induced gradient noise. To avoid over-smoothing, we further estimate token-wise uncertainty from multiple trajectories and apply multi-trajectory optimization only to the top-k% uncertain tokens. In addition, we introduce a consistency-aware token selection strategy that filters out AR tokens that are less aligned with the final generated content. Extensive experiments across multiple benchmarks demonstrate that our method consistently improves visual quality, training stability, and spatial structure understanding over baseline GRPO and pre-RL models. Code is available at: https://github.com/AMAP-ML/mar-grpo.

[80] Synthetic Dataset Generation for Partially Observed Indoor Objects cs.CVPDF

Jelle Vermandere, Maarten Bassier, Maarten Vergauwen

TL;DR: 本文提出了一种在Unity中实现的虚拟扫描框架，用于生成逼真的合成3D扫描数据集，以解决基于学习的三维场景重建和物体补全方法对大规模配对数据的需求。该框架模拟真实扫描仪行为，结合程序化室内场景生成流程，创建了包含合成室内扫描、物体级点云、体素遮挡网格和完整真实几何的V-Scan数据集。

Details

Motivation: 基于学习的三维重建和物体补全方法需要大量包含部分扫描与完整真实几何配对的数据集，但通过真实扫描系统获取此类数据成本高、耗时长，尤其是在需要被遮挡区域的精确真实数据时。

Result: 通过提出的虚拟扫描框架生成了V-Scan数据集，该数据集为训练和评估基于学习的场景重建与物体补全方法提供了有价值的监督数据，但摘要中未提及具体定量结果或与现有基准的比较。

Insight: 创新点包括：1) 使用可配置参数（如扫描分辨率、测量范围、距离相关噪声）模拟真实扫描仪行为的虚拟扫描框架；2) 采用基于射线的扫描而非直接网格采样，实现了对传感器可见性和遮挡效应的逼真建模；3) 将扫描仪与程序化室内场景生成流程集成，支持可扩展的数据集创建。

Abstract: Learning-based methods for 3D scene reconstruction and object completion require large datasets containing partial scans paired with complete ground-truth geometry. However, acquiring such datasets using real-world scanning systems is costly and time-consuming, particularly when accurate ground truth for occluded regions is required. In this work, we present a virtual scanning framework implemented in Unity for generating realistic synthetic 3D scan datasets. The proposed system simulates the behaviour of real-world scanners using configurable parameters such as scan resolution, measurement range, and distance-dependent noise. Instead of directly sampling mesh surfaces, the framework performs ray-based scanning from virtual viewpoints, enabling realistic modelling of sensor visibility and occlusion effects. In addition, panoramic images captured at the scanner location are used to assign colours to the resulting point clouds. To support scalable dataset creation, the scanner is integrated with a procedural indoor scene generation pipeline that automatically produces diverse room layouts and furniture arrangements. Using this system, we introduce the \textit{V-Scan} dataset, which contains synthetic indoor scans together with object-level partial point clouds, voxel-based occlusion grids, and complete ground-truth geometry. The resulting dataset provides valuable supervision for training and evaluating learning-based methods for scene reconstruction and object completion.

[81] Not all tokens contribute equally to diffusion learning cs.CVPDF

Guoqing Zhang, Lu Shi, Wanru Xu, Linna Zhang, Sen Wang

TL;DR: 本文提出了一种名为DARE的统一框架，旨在解决条件扩散模型在文本到视频生成中忽视语义重要token的问题，通过分布感知矫正和空间集成来提升语义引导质量。

Details

Motivation: 现有条件扩散模型在推理时经常忽略语义重要的token，导致在无分类器引导下生成结果存在偏差或不完整，这主要源于训练数据中token频率的长尾分布偏差以及交叉注意力中的空间错位问题。

Result: 在多个基准数据集上的广泛实验表明，DARE能持续提升生成保真度和语义对齐，相比现有方法取得了显著提升。

Insight: 创新点包括提出分布矫正的无分类器引导（DR-CFG）来动态抑制低语义密度的主导token以平衡条件分布，以及空间表示对齐（SRA）来根据token重要性自适应重加权注意力图并增强表示一致性，从而确保高语义密度token在生成中提供更强的空间引导。

Abstract: With the rapid development of conditional diffusion models, significant progress has been made in text-to-video generation. However, we observe that these models often neglect semantically important tokens during inference, leading to biased or incomplete generations under classifier-free guidance. We attribute this issue to two key factors: distributional bias caused by the long-tailed token frequency in training data, and spatial misalignment in cross-attention where semantically important tokens are overshadowed by less informative ones. To address these issues, we propose Distribution-Aware Rectification and Spatial Ensemble (DARE), a unified framework that improves semantic guidance in diffusion models from the perspectives of distributional debiasing and spatial consistency. First, we introduce Distribution-Rectified Classifier-Free Guidance (DR-CFG), which regularizes the training process by dynamically suppressing dominant tokens with low semantic density, encouraging the model to better capture underrepresented semantic cues and learn a more balanced conditional distribution. This design mitigates the risk of the model distribution overfitting to tokens with low semantic density. Second, we propose Spatial Representation Alignment (SRA), which adaptively reweights cross-attention maps according to token importance and enforces representation consistency, enabling semantically important tokens to exert stronger spatial guidance during generation. This mechanism effectively prevents low semantic-density tokens from dominating the attention allocation, thereby avoiding the dilution of the spatial and distributional guidance provided by high semantic-density tokens. Extensive experiments on multiple benchmark datasets demonstrate that DARE consistently improves generation fidelity and semantic alignment, achieving significant gains over existing approaches.

[82] SurFITR: A Dataset for Surveillance Image Forgery Detection and Localisation cs.CV | cs.AI | cs.MM | eess.IVPDF

Qizhou Wang, Guansong Pang, Christopher Leckie

TL;DR: 本文介绍了SurFITR数据集，这是一个专门用于监控风格图像伪造检测和定位的数据集。该数据集通过多模态大语言模型驱动的流程生成，包含超过13.7万张具有不同分辨率、编辑类型和多样化监控场景（如不同视角、小目标或遮挡主体）的篡改图像，旨在解决现有伪造检测模型在监控场景下因篡改区域局部、细微而泛化能力不足的问题。实验表明，现有检测器在该数据集上性能显著下降，而利用该数据集训练则能大幅提升模型在域内和跨域的性能。

Details

Motivation: 随着开源图像生成模型的进步，伪造视觉证据的风险增加。现有伪造检测模型通常在面向对象、全图合成或大区域篡改的数据集上训练，难以泛化到监控场景，因为监控图像中的篡改通常是局部、细微的，且场景具有视角多变、目标小、遮挡多、视觉质量低等特点。

Result: 大量实验表明，现有检测器在SurFITR数据集上性能显著下降。同时，在SurFITR上训练能带来显著的性能提升，包括在域内和跨域评估中。该数据集已在GitHub上公开。

Insight: 论文的主要创新点是构建了一个专门针对监控场景的、大规模、多样化的图像伪造检测数据集SurFITR。其核心在于利用多模态大语言模型（LLM）驱动的流程进行语义感知的细粒度编辑，从而生成更贴近真实监控篡改（局部、细微）特点的数据。这为解决现有模型在特定、具有挑战性的现实场景（如监控）中泛化能力差的问题提供了新的数据基础和研究基准。

Abstract: We present the Surveillance Forgery Image Test Range (SurFITR), a dataset for surveillance-style image forgery detection and localisation, in response to recent advances in open-access image generation models that raise concerns about falsifying visual evidence. Existing forgery models, trained on datasets with full-image synthesis or large manipulated regions in object-centric images, struggle to generalise to surveillance scenarios. This is because tampering in surveillance imagery is typically localised and subtle, occurring in scenes with varied viewpoints, small or occluded subjects, and lower visual quality. To address this gap, SurFITR provides a large collection of forensically valuable imagery generated via a multimodal LLM-powered pipeline, enabling semantically aware, fine-grained editing across diverse surveillance scenes. It contains over 137k tampered images with varying resolutions and edit types, generated using multiple image editing models. Extensive experiments show that existing detectors degrade significantly on SurFITR, while training on SurFITR yields substantial improvements in both in-domain and cross-domain performance. SurFITR is publicly available on GitHub.

Chenhao Liu, Zelin Wen, Yan Tong, Junjie Zhu, Xinyu Tian

TL;DR: 本文提出了一种用于跨医院放射学数据共享的效用保持去识别化管道（UPDP），通过生成过滤机制合成隐私过滤且保留病理信息的图像，并结合ID过滤的报告，实现在保护隐私的同时保持数据对大规模视觉语言模型训练和跨医院迁移的效用。

Details

Motivation: 大规模放射学数据对开发稳健的医疗AI系统至关重要，但跨医院共享受隐私问题限制；现有去识别化研究主要关注移除可识别信息以合规发布数据，但去识别化数据是否仍能保持足够效用用于大规模视觉语言模型训练和跨医院迁移尚未充分探索。

Result: 在公共胸部X光基准测试中，该方法有效移除隐私敏感信息并保留诊断相关病理线索；在去识别化数据上训练的模型与原始数据训练的模型相比保持竞争性诊断准确性，同时身份相关准确性显著下降，证实了有效的隐私保护；在跨医院设置中，去识别化数据与本地数据结合可带来更好性能。

Insight: 创新点包括编译隐私敏感术语黑名单和病理相关术语白名单，以及使用生成过滤机制合成图像；客观分析认为，该方法通过平衡隐私保护和数据效用，为跨医院数据共享提供了实用解决方案，特别是在大规模AI模型训练场景中具有借鉴意义。

Abstract: Large-scale radiology data are critical for developing robust medical AI systems. However, sharing such data across hospitals remains heavily constrained by privacy concerns. Existing de-identification research in radiology mainly focus on removing identifiable information to enable compliant data release. Yet whether de-identified radiology data can still preserve sufficient utility for large-scale vision-language model training and cross-hospital transfer remains underexplored. In this paper, we introduce a utility-preserving de-identification pipeline (UPDP) for cross-hospital radiology data sharing. Specifically, we compile a blacklist of privacy-sensitive terms and a whitelist of pathology-related terms. For radiology images, we use a generative filtering mechanism that synthesis a privacy-filtered and pathology-reserved counterparts of the original images. These synthetic image counterparts, together with ID-filtered reports, can then be securely shared across hospitals for downstream model development and evaluation. Experiments on public chest X-ray benchmarks demonstrate that our method effectively removes privacy-sensitive information while preserving diagnostically relevant pathology cues. Models trained on the de-identified data maintain competitive diagnostic accuracy compared with those trained on the original data, while exhibiting a marked decline in identity-related accuracy, confirming effective privacy protection. In the cross-hospital setting, we further show that de-identified data can be combined with local data to yield better performance.

[84] CSA-Graphs: A Privacy-Preserving Structural Dataset for Child Sexual Abuse Research cs.CV | cs.AI | cs.LGPDF

Carlos Caetano, Camila Laranjeira, Clara Ernesto, Artur Barros, João Macedo

TL;DR: 本文提出了一种名为CSA-Graphs的隐私保护结构化数据集，旨在解决儿童性虐待图像（CSAI）分类研究中因法律和伦理限制导致数据无法公开共享的问题。该数据集不提供原始图像，而是提供两种基于图的结构化表示：描述物体关系的场景图和编码人体姿态的骨架图，以保留上下文信息并移除显式视觉内容。实验表明，这两种表示均能有效保留CSAI分类所需信息，且结合使用可进一步提升性能。

Details

Motivation: 儿童性虐待图像（CSAI）分类研究面临严格的法律和伦理限制，导致数据集无法公开共享，这阻碍了研究的可重复性和自动化方法的发展。

Result: 实验表明，CSA-Graphs中的场景图和骨架图表示均能保留对CSAI分类有用的信息，且结合两者可进一步提升分类性能。

Insight: 创新点在于提出了一种隐私保护的数据集构建方法，通过图结构表示（场景图和骨架图）替代原始图像，在移除敏感视觉内容的同时保留关键上下文信息，为受限领域（如CSAI）的计算机视觉研究提供了可行的数据共享方案。

Abstract: Child Sexual Abuse Imagery (CSAI) classification is an important yet challenging problem for computer vision research due to the strict legal and ethical restrictions that prevent the public sharing of CSAI datasets. This limitation hinders reproducibility and slows progress in developing automated methods. In this work, we introduce CSA-Graphs, a privacy-preserving structural dataset. Instead of releasing the original images, we provide structural representations that remove explicit visual content while preserving contextual information. CSA-Graphs includes two complementary graph-based modalities: scene graphs describing object relationships and skeleton graphs encoding human pose. Experiments show that both representations retain useful information for classifying CSAI, and that combining them further improves performance. This dataset enables broader research on computer vision methods for child safety while respecting legal and ethical constraints.

[85] USCNet: Transformer-Based Multimodal Fusion with Segmentation Guidance for Urolithiasis Classification cs.CVPDF

Changmiao Wang, Songqi Zhang, Yongquan Zhang, Yifei Wang, Liya Liu

TL;DR: 本文提出了一种名为USCNet（尿路结石分割与分类网络）的新方法，用于肾结石的术前精确分类。该方法通过集成CT图像和电子健康记录（EHR）临床数据，利用基于Transformer的多模态融合框架，并结合CT-EHR注意力模块和分割引导注意力模块，同时引入动态损失函数来平衡分割与分类的双重目标。

Details

Motivation: 解决肾结石分析依赖术后标本、无法在术前快速分类的局限性，以实现个性化的治疗和预防复发。

Result: 在内部肾结石数据集上的实验表明，USCNet在所有评估指标上均表现出色，其分类效果显著超越了现有的主流方法。

Insight: 创新点在于将Transformer用于CT与EHR的多模态融合，并设计了分割引导的注意力机制和动态损失函数，为医学图像与临床数据的结合分类提供了可借鉴的框架。

Abstract: Kidney stone disease ranks among the most prevalent conditions in urology, and understanding the composition of these stones is essential for creating personalized treatment plans and preventing recurrence. Current methods for analyzing kidney stones depend on postoperative specimens, which prevents rapid classification before surgery. To overcome this limitation, we introduce a new approach called the Urinary Stone Segmentation and Classification Network (USCNet). This innovative method allows for precise preoperative classification of kidney stones by integrating Computed Tomography (CT) images with clinical data from Electronic Health Records (EHR). USCNet employs a Transformer-based multimodal fusion framework with CT-EHR attention and segmentation-guided attention modules for accurate classification. Moreover, a dynamic loss function is introduced to effectively balance the dual objectives of segmentation and classification. Experiments on an in-house kidney stone dataset show that USCNet demonstrates outstanding performance across all evaluation metrics, with its classification efficacy significantly surpassing existing mainstream methods. This study presents a promising solution for the precise preoperative classification of kidney stones, offering substantial clinical benefits. The source code has been made publicly available: https://github.com/ZhangSongqi0506/KidneyStone.

[86] Learning to Search: A Decision-Based Agent for Knowledge-Based Visual Question Answering cs.CVPDF

Zhuohong Chen, Zhenxian Wu, Yunyao Yu, Hangrui Xu, Zirui Liao

TL;DR: 本文提出了一种基于决策的搜索代理方法，用于知识驱动的视觉问答（KB-VQA）。该方法将KB-VQA重新定义为搜索代理问题，将解决过程建模为多步决策过程，代理在每一步根据当前信息状态选择四种动作之一（回答、图像检索、文本检索、基于标题检索），并通过自动收集的多步轨迹进行微调监督。

Details

Motivation: 现有检索增强生成（RAG）方法通常采用固定的检索-过滤-生成流水线，难以适应多样的问题类型，且将检索与推理分离，导致检索证据与问题对齐不佳。

Result: 在InfoSeek和E-VQA基准测试中，该方法实现了最先进的性能，持续优于先前基线。

Insight: 创新点在于将KB-VQA建模为多步决策过程，使代理能够动态决定何时搜索、如何优化查询以及何时停止，从而更好地整合检索与推理；通过自动收集轨迹进行监督学习，提供了可扩展的训练框架。

Abstract: Knowledge-based visual question answering (KB-VQA) requires vision-language models to understand images and use external knowledge, especially for rare entities and long-tail facts. Most existing retrieval-augmented generation (RAG) methods adopt a fixed pipeline that sequentially retrieves information, filters it, and then produces an answer. Such a design makes it difficult to adapt to diverse question types. Moreover, it separates retrieval from reasoning, making it hard for the model to decide when to search, how to refine queries, or when to stop. As a result, the retrieved evidence is often poorly aligned with the question. To address these limitations, we reformulate KB-VQA as a search-agent problem and model the solving process as a multi-step decision-making procedure. At each step, the agent selects one of four actions-Answer, Image Retrieval, Text Retrieval, and Caption-based on its current information state. We further design an automated pipeline to collect multi-step trajectories that record the agent’s reasoning process, tool usage, and intermediate decisions. These trajectories are then used as supervision for fine-tuning. Experiments on InfoSeek and E-VQA demonstrate that our method achieves state-of-the-art performance, consistently outperforming prior baselines and confirming the effectiveness of our framework.

[87] Bridging MRI and PET physiology: Untangling complementarity through orthogonal representations cs.CV | cs.AIPDF

Sonja Adomeit, Kartikay Tehlan, Lukas Förner, Katharina Weisser, Helen Scholtiseek

TL;DR: 本文提出了一种用于多模态医学影像（MRI和PET）分析的子空间分解框架，将多模态融合重新定义为正交子空间分离问题，而非图像翻译。该方法将前列腺特异性膜抗原（PSMA）PET摄取分解为可由MRI解释的生理包络和一个正交残差，后者反映了MRI特征流形中无法表达的PET信号成分。

Details

Motivation: 解决多模态成像分析中，现有联合潜在表示方法未能清晰界定模态间共享信息与模态特有信息的问题，这对于明确每种成像方式的不可替代贡献和指导临床采集策略具有重要临床意义。

Result: 在13名前列腺癌患者数据上测试，模型表明MRI特征张成的残差成分被吸收到学习到的包络中，而正交残差在肿瘤区域最大，这证实了PSMA PET包含无法从MRI衍生生理描述符中恢复的信号成分。

Insight: 创新点在于将多模态融合问题重构为正交子空间分离，并引入基于奇异值分解的投影正则化来强制MRI特征与PET信号在表示空间中的数学正交性，从而提供了一种基于表示几何而非图像翻译的、结构化的模态互补性表征方法。

Abstract: Multimodal imaging analysis often relies on joint latent representations, yet these approaches rarely define what information is shared versus modality-specific. Clarifying this distinction is clinically relevant, as it delineates the irreducible contribution of each modality and informs rational acquisition strategies. We propose a subspace decomposition framework that reframes multimodal fusion as a problem of orthogonal subspace separation rather than translation. We decompose Prostate-Specific Membrane Antigen (PSMA) PET uptake into an MRI-explainable physiological envelope and an orthogonal residual reflecting signal components not expressible within the MRI feature manifold. Using multiparametric MRI, we train an intensity-based, non-spatial implicit neural representation (INR) to map MRI feature vectors to PET uptake. We introduce a projection-based regularization using singular value decomposition to penalize residual components lying within the span of the MRI feature manifold. This enforces mathematical orthogonality between tissue-level physiological properties (structure, diffusion, perfusion) and intracellular PSMA expression. Tested on 13 prostate cancer patients, the model demonstrates that residual components spanned by MRI features are absorbed into the learned envelope, while the orthogonal residual is largest in tumour regions. This indicates that PSMA PET contains signal components not recoverable from MRI-derived physiological descriptors. The resulting decomposition provides a structured characterization of modality complementarity grounded in representation geometry rather than image translation.

[88] TeaLeafVision: An Explainable and Robust Deep Learning Framework for Tea Leaf Disease Classification cs.CV | cs.AI | cs.LGPDF

Rafi Ahamed, Sidratul Moon Nafsin, Md Abir Rahman, Tasnia Tarannum Roza, Munaia Jannat Easha

TL;DR: 本文提出TeaLeafVision框架，用于茶叶病害分类，通过评估多个CNN模型在teaLeafBD数据集上的性能，其中DenseNet201达到99%的最高测试准确率，并采用Grad-CAM、遮挡敏感度分析和对抗训练增强模型可解释性和鲁棒性，最终开发了原型系统以应用于实际农业场景。

Details

Motivation: 茶叶作为全球第二大消费饮料，其叶片病害的精确识别对经济和农业管理至关重要，旨在解决真实田间条件下茶叶病害自动分类的挑战。

Result: 在包含7个类别（6种病害和1个健康类）的teaLeafBD数据集上，DenseNet201模型取得了99%的测试准确率，表现优于MobileNetV2和InceptionV3等模型。

Insight: 创新点包括结合多种可解释性方法（如Grad-CAM和遮挡分析）提升模型透明度，并通过对抗训练增强噪声抵抗能力，为农业病害检测提供了可解释且鲁棒的深度学习框架。

Abstract: As the worlds second most consumed beverage after water, tea is not just a cultural staple but a global economic force of profound scale and influence. More than a mere drink, it represents a quiet negotiation between nature, culture, and the human desire for a moment of reflection. So, the precise identification and detection of tea leaf disease is crucial. With this goal, we have evaluated several Convolutional Neural Networks (CNN) models, among them three shows noticeable performance including DenseNet201, MobileNetV2, InceptionV3 on the teaLeafBD dataset. teaLeafBD dataset contains seven classes, six disease classes and one healthy class, collected under various field conditions reflecting real world challenges. Among the CNN models, DenseNet201 has achieved the highest test accuracy of 99%. In order to enhance the model reliability and interpretability, we have implemented Gradient weighted Class Activation Mapping (Grad CAM), occlusion sensitivity analysis and adversarial training techniques to increase the noise resistance of the model. Finally, we have developed a prototype in order to leverage the models capabilities on real life agriculture. This paper illustrates the deep learning models capabilities to classify the disease in real life tea leaf disease detection and management.

[89] INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling cs.CVPDF

InSpatio Team, Donghui Shen, Guofeng Zhang, Haomin Liu, Haoyu Ji

TL;DR: 本文提出了INSPATIO-WORLD，一个能够从单个参考视频中恢复和生成高保真动态交互场景的实时框架。其核心是时空自回归（STAR）架构，通过隐式时空缓存和显式空间约束模块确保长期导航的全局一致性与物理合理的相机轨迹，并引入联合分布匹配蒸馏（JDMD）来提升真实感。

Details

Motivation: 解决现有视频生成范式在空间持久性和视觉真实感方面的不足，难以支持复杂环境中无缝导航的问题，旨在构建具有空间一致性和实时交互性的世界模型。

Result: 在WorldScore-Dynamic基准测试中，INSPATIO-WORLD在空间一致性和交互精度上显著优于现有最先进（SOTA）模型，在实时交互方法中排名第一。

Insight: 创新点包括STAR架构将隐式世界表示与显式几何约束耦合以实现可控场景演化，以及JDMD利用真实世界数据分布作为正则化指导来缓解合成数据导致的保真度下降，为从单目视频重建的4D环境导航提供了实用框架。

Abstract: Building world models with spatial consistency and real-time interactivity remains a fundamental challenge in computer vision. Current video generation paradigms often struggle with a lack of spatial persistence and insufficient visual realism, making it difficult to support seamless navigation in complex environments. To address these challenges, we propose INSPATIO-WORLD, a novel real-time framework capable of recovering and generating high-fidelity, dynamic interactive scenes from a single reference video. At the core of our approach is a Spatiotemporal Autoregressive (STAR) architecture, which enables consistent and controllable scene evolution through two tightly coupled components: Implicit Spatiotemporal Cache aggregates reference and historical observations into a latent world representation, ensuring global consistency during long-horizon navigation; Explicit Spatial Constraint Module enforces geometric structure and translates user interactions into precise and physically plausible camera trajectories. Furthermore, we introduce Joint Distribution Matching Distillation (JDMD). By using real-world data distributions as a regularizing guide, JDMD effectively overcomes the fidelity degradation typically caused by over-reliance on synthetic data. Extensive experiments demonstrate that INSPATIO-WORLD significantly outperforms existing state-of-the-art (SOTA) models in spatial consistency and interaction precision, ranking first among real-time interactive methods on the WorldScore-Dynamic benchmark, and establishing a practical pipeline for navigating 4D environments reconstructed from monocular videos.

[90] PhyEdit: Towards Real-World Object Manipulation via Physically-Grounded Image Editing cs.CVPDF

Ruihang Xu, Dewei Zhou, Xiaolong Shen, Fan Ma, Yi Yang

TL;DR: 本文提出PhyEdit框架，通过结合显式几何模拟作为3D感知视觉引导，提升图像编辑中物体操纵的物理准确性。同时，作者构建了RealManip-10K真实世界数据集和ManipEval基准，用于评估3D空间控制和几何一致性。实验表明，该方法在3D几何精度和操纵一致性上优于现有方法。

Details

Motivation: 现有视觉生成模型在图像编辑中缺乏精确的空间操纵能力，常导致物体缩放和定位错误，主要原因是缺乏整合3D几何和透视投影的显式机制。

Result: 在提出的ManipEval基准上，PhyEdit在3D几何准确性和操纵一致性方面优于现有方法，包括强大的闭源模型，达到了SOTA水平。

Insight: 创新点在于引入可插拔的3D先验与联合2D-3D监督，以显式几何模拟提供上下文引导；同时构建了带深度标注的真实世界数据集和评估基准，为3D感知物体操纵研究提供了新工具。

Abstract: Achieving physically accurate object manipulation in image editing is essential for its potential applications in interactive world models. However, existing visual generative models often fail at precise spatial manipulation, resulting in incorrect scaling and positioning of objects. This limitation primarily stems from the lack of explicit mechanisms to incorporate 3D geometry and perspective projection. To achieve accurate manipulation, we develop PhyEdit, an image editing framework that leverages explicit geometric simulation as contextual 3D-aware visual guidance. By combining this plug-and-play 3D prior with joint 2D–3D supervision, our method effectively improves physical accuracy and manipulation consistency. To support this method and evaluate performance, we present a real-world dataset, RealManip-10K, for 3D-aware object manipulation featuring paired images and depth annotations. We also propose ManipEval, a benchmark with multi-dimensional metrics to evaluate 3D spatial control and geometric consistency. Extensive experiments show that our approach outperforms existing methods, including strong closed-source models, in both 3D geometric accuracy and manipulation consistency.

[91] Geo-EVS: Geometry-Conditioned Extrapolative View Synthesis for Autonomous Driving cs.CVPDF

Yatong Lan, Rongkui Tang, Lei He

TL;DR: 本文提出了Geo-EVS，一个用于自动驾驶场景的几何条件外推式新视图合成框架，旨在从异构传感器生成标准化的虚拟视图以减少对相机阵列的依赖。该方法通过几何感知重投影生成条件图，并利用伪影引导的潜在扩散模型学习在几何支持缺失情况下的结构恢复。

Details

Motivation: 解决现有外推式视图合成方法在记录轨迹之外性能下降的问题，因为外推姿态提供较弱的几何支持且缺乏密集的目标视图监督。关键在于在训练期间显式地将模型暴露于轨迹外条件缺陷。

Result: 在Waymo数据集上，Geo-EVS提高了稀疏视图合成的质量和几何精度，特别是在高角度和低覆盖率的场景中，并改善了后续的3D检测任务性能。评估采用了LiDAR投影稀疏参考协议。

Insight: 创新点在于提出了几何条件框架，通过几何感知重投影统一训练和推理的重投影路径，并引入伪影引导的潜在扩散模型，使模型能够学习在几何支持不足（由伪影掩码表示）的情况下恢复结构。

Abstract: Extrapolative novel view synthesis can reduce camera-rig dependency in autonomous driving by generating standardized virtual views from heterogeneous sensors. Existing methods degrade outside recorded trajectories because extrapolated poses provide weak geometric support and no dense target-view supervision. The key is to explicitly expose the model to out-of-trajectory condition defects during training. We propose Geo-EVS, a geometry-conditioned framework under sparse supervision. Geo-EVS has two components. Geometry-Aware Reprojection (GAR) uses fine-tuned VGGT to reconstruct colored point clouds and reproject them to observed and virtual target poses, producing geometric condition maps. This design unifies the reprojection path between training and inference. Artifact-Guided Latent Diffusion (AGLD) injects reprojection-derived artifact masks during training so the model learns to recover structure under missing support. For evaluation, we use a LiDAR-Projected Sparse-Reference (LPSR) protocol when dense extrapolated-view ground truth is unavailable. On Waymo, Geo-EVS improves sparse-view synthesis quality and geometric accuracy, especially in high-angle and low-coverage settings. It also improves downstream 3D detection.

[92] Non-identifiability of Explanations from Model Behavior in Deep Networks of Image Authenticity Judgments cs.CV | cs.LGPDF

Icaro Re Depaolini, Uri Hasson

TL;DR: 该论文研究了深度神经网络在预测人类图像真实性判断时，其解释方法（如归因热图）的鲁棒性问题。研究发现，尽管不同架构的模型都能较好地预测人类评分（达到噪声上限的约80%），但它们的解释在不同架构间一致性很弱，表明模型行为无法提供可识别的解释。

Details

Motivation: 动机在于探究深度神经网络预测人类判断时，其解释方法是否可靠，即模型是否揭示了人类判断的真实线索，而非仅依赖图像质量等无关特征。

Result: 多个预训练视觉模型预测人类真实性评分性能良好，VGG模型依赖图像质量而非真实性特定方差；EfficientNetB3和Barlow Twins模型在架构内解释稳定性较高，但跨架构解释一致性弱；集成模型提升了预测性能并支持像素级归因。

Insight: 创新点在于系统评估了归因解释的鲁棒性，揭示了模型预测成功并不保证解释可识别；更广泛的启示是，行为模型的事后解释应视为认知机制的弱证据，需谨慎对待。

Abstract: Deep neural networks can predict human judgments, but this does not imply that they rely on human-like information or reveal the cues underlying those judgments. Prior work has addressed this issue using attribution heatmaps, but their explanatory value in itself depends on robustness. Here we tested the robustness of such explanations by evaluating whether models that predict human authenticity ratings also produce consistent explanations within and across architectures. We fit lightweight regression heads to multiple frozen pretrained vision models and generated attribution maps using Grad-CAM, LIME, and multiscale pixel masking. Several architectures predicted ratings well, reaching about 80% of the noise ceiling. VGG models achieved this by tracking image quality rather than authenticity-specific variance, limiting the relevance of their attributions. Among the remaining models, attribution maps were generally stable across random seeds within an architecture, especially for EfficientNetB3 and Barlow Twins, and consistency was higher for images judged as more authentic. Crucially, agreement in attribution across architectures was weak even when predictive performance was similar. To address this, we combined models in ensembles, which improved prediction of human authenticity judgments and enabled image-level attribution via pixel masking. We conclude that while deep networks can predict human authenticity judgments well, they do not produce identifiable explanations for those judgments. More broadly, our findings suggest that post hoc explanations from successful models of behavior should be treated as weak evidence for cognitive mechanism.

[93] GenLCA: 3D Diffusion for Full-Body Avatars from In-the-Wild Videos cs.CVPDF

Yiqian Wu, Rawal Khirodkar, Egor Zakharov, Timur Bagautdinov, Lei Xiao

TL;DR: GenLCA是一个基于扩散模型的生成模型，能够从文本和图像输入生成并编辑具有照片级真实感的全身虚拟化身。其核心创新在于提出了一种新范式，能够从部分可观测的2D数据（如真实世界视频）训练全身3D扩散模型，从而利用大规模数据集提升真实感和泛化能力。

Details

Motivation: 解决从真实世界视频（通常只包含身体局部观测）中训练高质量、可动画的全身3D虚拟化身生成模型的挑战，克服现有方法因数据不完整导致的模糊或透明伪影问题。

Result: 方法在生成和编辑任务上实现了多样且高保真的结果，大幅优于现有解决方案，但摘要未提及具体基准测试名称或定量比较数据。

Insight: 主要创新点包括：1) 将预训练的前馈化身重建模型重新用作可动画的3D分词器，将非结构化视频帧编码为结构化3D令牌；2) 提出可见性感知扩散训练策略，用可学习令牌替换无效区域并仅在有效区域计算损失，从而利用不完整观测数据训练3D扩散模型。

Abstract: We present GenLCA, a diffusion-based generative model for generating and editing photorealistic full-body avatars from text and image inputs. The generated avatars are faithful to the inputs, while supporting high-fidelity facial and full-body animations. The core idea is a novel paradigm that enables training a full-body 3D diffusion model from partially observable 2D data, allowing the training dataset to scale to millions of real-world videos. This scalability contributes to the superior photorealism and generalizability of GenLCA. Specifically, we scale up the dataset by repurposing a pretrained feed-forward avatar reconstruction model as an animatable 3D tokenizer, which encodes unstructured video frames into structured 3D tokens. However, most real-world videos only provide partial observations of body parts, resulting in excessive blurring or transparency artifacts in the 3D tokens. To address this, we propose a novel visibility-aware diffusion training strategy that replaces invalid regions with learnable tokens and computes losses only over valid regions. We then train a flow-based diffusion model on the token dataset, inherently maintaining the photorealism and animatability provided by the pretrained avatar reconstruction model. Our approach effectively enables the use of large-scale real-world video data to train a diffusion model natively in 3D. We demonstrate the efficacy of our method through diverse and high-fidelity generation and editing results, outperforming existing solutions by a large margin. The project page is available at https://onethousandwu.com/GenLCA-Page.

[94] Mem3R: Streaming 3D Reconstruction with Hybrid Memory via Test-Time Training cs.CVPDF

Changkun Liu, Jiezhi Yang, Zeman Li, Yuan Deng, Jiancong Guo

TL;DR: Mem3R是一种用于流式3D重建的模型，它通过解耦相机跟踪和几何映射，并采用混合内存设计来提高长序列处理的时间一致性。具体来说，它使用基于测试时训练的隐式快速权重内存进行相机跟踪，以及基于令牌的显式固定大小状态进行几何映射。

Details

Motivation: 解决现有循环模型在长序列处理中因压缩潜在内存容量有限而导致的漂移累积和时间遗忘问题，旨在提升流式3D感知的效率和一致性。

Result: 在500到1000帧的序列上，与CUT3R相比，模型参数量从793M减少到644M，且与TTT3R集成后，绝对轨迹误差降低高达39%。同时，在视频深度估计和3D重建等下游任务上也有改进，并保持了恒定的GPU内存使用和可比的推理吞吐量。

Insight: 创新点在于混合内存设计：将相机跟踪与几何映射解耦，分别采用隐式（基于测试时训练的MLP）和显式（基于令牌）内存机制，这有效缓解了长序列中的漂移和遗忘问题，并支持即插即用的状态更新策略，提升了模型的适应性和性能。

Abstract: Streaming 3D perception is well suited to robotics and augmented reality, where long visual streams must be processed efficiently and consistently. Recent recurrent models offer a promising solution by maintaining fixed-size states and enabling linear-time inference, but they often suffer from drift accumulation and temporal forgetting over long sequences due to the limited capacity of compressed latent memories. We propose Mem3R, a streaming 3D reconstruction model with a hybrid memory design that decouples camera tracking from geometric mapping to improve temporal consistency over long sequences. For camera tracking, Mem3R employs an implicit fast-weight memory implemented as a lightweight Multi-Layer Perceptron updated via Test-Time Training. For geometric mapping, Mem3R maintains an explicit token-based fixed-size state. Compared with CUT3R, this design not only significantly improves long-sequence performance but also reduces the model size from 793M to 644M parameters. Mem3R supports existing improved plug-and-play state update strategies developed for CUT3R. Specifically, integrating it with TTT3R decreases Absolute Trajectory Error by up to 39% over the base implementation on 500 to 1000 frame sequences. The resulting improvements also extend to other downstream tasks, including video depth estimation and 3D reconstruction, while preserving constant GPU memory usage and comparable inference throughput. Project page: https://lck666666.github.io/Mem3R/

[95] Are Face Embeddings Compatible Across Deep Neural Network Models? cs.CV | cs.LGPDF

Fizza Rubab, Yiying Tong, Arun Ross

TL;DR: 该论文研究了不同深度神经网络模型生成的人脸嵌入向量之间的兼容性问题，发现尽管模型在数据集、损失函数和架构上存在差异，但通过简单的仿射变换即可对齐不同模型的嵌入空间，显著提升跨模型人脸识别性能。

Details

Motivation: 随着领域特定模型和基础模型在生物识别等任务中的广泛应用，论文旨在探究不同DNN模型是否以相似方式编码人脸身份信息，以解决模型间嵌入表示兼容性和互操作性问题。

Result: 实验表明，低复杂度的线性映射能显著提升跨模型人脸识别（包括身份识别和验证任务）的性能，优于未对齐基线；该对齐模式在不同数据集上具有泛化性，且随模型家族呈现系统性变化。

Insight: 创新点在于通过几何分析揭示不同模型在人脸身份编码上存在表示收敛性，这为模型集成设计、生物特征模板安全及跨模型互操作性提供了新视角；客观来看，将嵌入视为点云并采用仿射变换对齐的方法简单有效，具有实际应用潜力。

Abstract: Automated face recognition has made rapid strides over the past decade due to the unprecedented rise of deep neural network (DNN) models that can be trained for domain-specific tasks. At the same time, foundation models that are pretrained on broad vision or vision-language tasks have shown impressive generalization across diverse domains, including biometrics. This raises an important question: Do different DNN models–both domain-specific and foundation models–encode facial identity in similar ways, despite being trained on different datasets, loss functions, and architectures? In this regard, we directly analyze the geometric structure of embedding spaces imputed by different DNN models. Treating embeddings of face images as point clouds, we study whether simple affine transformations can align face representations of one model with another. Our findings reveal surprising cross-model compatibility: low-capacity linear mappings substantially improve cross-model face recognition over unaligned baselines for both face identification and verification tasks. Alignment patterns generalize across datasets and vary systematically across model families, indicating representational convergence in facial identity encoding. These findings have implications for model interoperability, ensemble design, and biometric template security.

[96] Distilling Photon-Counting CT into Routine Chest CT through Clinically Validated Degradation Modeling cs.CVPDF

Junqi Liu, Xinze Zhou, Wenxuan Li, Scott Ye, Arkadiusz Sitek

TL;DR: 本文提出了一种名为SUMI的模拟退化到增强方法，通过将光子计数CT（PCCT）的高质量图像作为参考，学习逆转常规能量积分CT（EICT）中的真实采集伪影，从而将PCCT的先进成像质量蒸馏到常规胸部CT中。该方法通过临床验证的退化建模，在无需大规模配对数据的情况下，实现了对EICT图像的显著增强，并构建了大规模公开数据集，提升了下游任务的性能。

Details

Motivation: 光子计数CT（PCCT）相比传统能量积分CT（EICT）具有更高的空间分辨率和更低的噪声，但其临床可用性有限，阻碍了大规模研究和部署。本文旨在通过模拟真实采集退化，利用有限的高质量PCCT扫描作为参考，将PCCT的成像优势系统地蒸馏到常规EICT中，以弥合这一差距。

Result: 在外部数据上，SUMI方法在SSIM和PSNR指标上分别比最先进的图像翻译方法提升了15%和20%；在读者研究中提高了放射科医生评级的临床效用；并显著增强了下游病变检测性能，灵敏度提升高达15%，F1分数提升高达10%。

Insight: 论文的核心创新在于通过临床验证的退化建模，将高质量PCCT转化为临床可信的低质量对应图像，并学习逆转这一过程，从而在无需大规模配对采集的情况下实现有效监督。此外，该方法预训练了自动编码器以提取通用的CT潜在特征，并构建了大规模公开增强数据集，为其他生成式医学成像任务提供了可重用资源。

Abstract: Photon-counting CT (PCCT) provides superior image quality with higher spatial resolution and lower noise compared to conventional energy-integrating CT (EICT), but its limited clinical availability restricts large-scale research and clinical deployment. To bridge this gap, we propose SUMI, a simulated degradation-to-enhancement method that learns to reverse realistic acquisition artifacts in low-quality EICT by leveraging high-quality PCCT as reference. Our central insight is to explicitly model realistic acquisition degradations, transforming PCCT into clinically plausible lower-quality counterparts and learning to invert this process. The simulated degradations were validated for clinical realism by board-certified radiologists, enabling faithful supervision without requiring paired acquisitions at scale. As outcomes of this technical contribution, we: (1) train a latent diffusion model on 1,046 PCCTs, using an autoencoder first pre-trained on both these PCCTs and 405,379 EICTs from 145 hospitals to extract general CT latent features that we release for reuse in other generative medical imaging tasks; (2) construct a large-scale dataset of over 17,316 publicly available EICTs enhanced to PCCT-like quality, with radiologist-validated voxel-wise annotations of airway trees, arteries, veins, lungs, and lobes; and (3) demonstrate substantial improvements: across external data, SUMI outperforms state-of-the-art image translation methods by 15% in SSIM and 20% in PSNR, improves radiologist-rated clinical utility in reader studies, and enhances downstream top-ranking lesion detection performance, increasing sensitivity by up to 15% and F1 score by up to 10%. Our results suggest that emerging imaging advances can be systematically distilled into routine EICT using limited high-quality scans as reference.

[97] Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images cs.CV | cs.CL | cs.MMPDF

Yuechen Jiang, Enze Zhang, Md Mohsinul Kabir, Qianqian Xie, Stavroula Golfomitsou

TL;DR: 本文提出了一个名为Appear2Meaning的跨文化基准测试，用于评估视觉语言模型从图像中推断结构化文化元数据（如创作者、起源、时期）的能力，并采用LLM-as-Judge框架衡量语义对齐，发现现有模型在不同文化和元数据类型上表现不一致且存在局限性。

Details

Motivation: 尽管视觉语言模型在文化遗产图像描述方面取得进展，但从视觉输入推断结构化文化元数据仍未被充分探索，因此需要建立基准以评估模型在此任务上的表现。

Result: 在跨文化基准上，模型表现出碎片化信号捕获能力，在不同文化区域和元数据类型上性能差异显著，预测结果不一致且缺乏扎实依据，突显了当前模型在结构化文化元数据推理方面的不足。

Insight: 创新点在于引入多类别跨文化基准和LLM-as-Judge评估框架，客观分析揭示了视觉语言模型在文化推理任务中的泛化能力局限，为提升模型的文化感知和结构化推理提供了方向。

Abstract: Recent advances in vision-language models (VLMs) have improved image captioning for cultural heritage. However, inferring structured cultural metadata (e.g., creator, origin, period) from visual input remains underexplored. We introduce a multi-category, cross-cultural benchmark for this task and evaluate VLMs using an LLM-as-Judge framework that measures semantic alignment with reference annotations. To assess cultural reasoning, we report exact-match, partial-match, and attribute-level accuracy across cultural regions. Results show that models capture fragmented signals and exhibit substantial performance variation across cultures and metadata types, leading to inconsistent and weakly grounded predictions. These findings highlight the limitations of current VLMs in structured cultural metadata inference beyond visual perception.

[98] MoRight: Motion Control Done Right cs.CV | cs.AI | cs.GR | cs.LG | cs.ROPDF

Shaowei Liu, Xuanchi Ren, Tianchang Shen, Huan Ling, Saurabh Gupta

TL;DR: MoRight是一个用于生成运动控制视频的统一框架，它通过解耦的运动建模实现了对物体运动和相机视角的独立控制，并进一步将运动分解为主动（用户驱动）和被动（后果）组件以学习运动因果关系。

Details

Motivation: 现有方法在生成运动控制视频时存在两个主要不足：一是将相机和物体运动纠缠在单一的跟踪信号中，无法独立控制；二是将运动视为运动学位移，没有建模物体间的因果关系。MoRight旨在解决这两个问题。

Result: 在三个基准测试上的实验表明，MoRight在生成质量、运动可控性和交互意识方面达到了最先进的性能。

Insight: 创新点在于通过规范静态视图指定物体运动，并利用时序跨视图注意力将其转移到任意目标相机视角，实现了相机与物体运动的解耦控制；同时，将运动分解为主动和被动组件，使模型能够从数据中学习因果关系，并支持前向推理（给定主动运动预测后果）和逆向推理（给定被动后果恢复驱动动作）。

Abstract: Generating motion-controlled videos–where user-specified actions drive physically plausible scene dynamics under freely chosen viewpoints–demands two capabilities: (1) disentangled motion control, allowing users to separately control the object motion and adjust camera viewpoint; and (2) motion causality, ensuring that user-driven actions trigger coherent reactions from other objects rather than merely displacing pixels. Existing methods fall short on both fronts: they entangle camera and object motion into a single tracking signal and treat motion as kinematic displacement without modeling causal relationships between object motion. We introduce MoRight, a unified framework that addresses both limitations through disentangled motion modeling. Object motion is specified in a canonical static-view and transferred to an arbitrary target camera viewpoint via temporal cross-view attention, enabling disentangled camera and object control. We further decompose motion into active (user-driven) and passive (consequence) components, training the model to learn motion causality from data. At inference, users can either supply active motion and MoRight predicts consequences (forward reasoning), or specify desired passive outcomes and MoRight recovers plausible driving actions (inverse reasoning), all while freely adjusting the camera viewpoint. Experiments on three benchmarks demonstrate state-of-the-art performance in generation quality, motion controllability, and interaction awareness.

eess.AS [Back]

[99] Harf-Speech: A Clinically Aligned Framework for Arabic Phoneme-Level Speech Assessment eess.AS | cs.AI | cs.CL | cs.SDPDF

Asif Azad, MD Sadik Hossain Shanto, Mohammad Sadat Hossain, Bdour Alwuqaysi, Sabri Boughorbel

TL;DR: 本文提出了Harf-Speech，一个用于阿拉伯语语音评估的模块化系统，它通过在音素层面进行发音评分，旨在为语音治疗和语言学习提供可扩展的自动化工具。该系统整合了MSA音素化器、微调的语音到音素模型、Levenshtein对齐以及混合评分器，并在临床验证中表现出与专家评分高度的一致性。

Details

Motivation: 解决阿拉伯语自动化音素级发音评估工具稀缺的问题，为临床语音治疗和语言学习提供可扩展且经过验证的解决方案。

Result: 在阿拉伯语音素数据上微调的最佳ASR模型（OmniASR-CTC-1B-v2）达到了8.92%的音素错误率；临床验证中，Harf-Speech与专家平均评分的皮尔逊相关系数为0.791，组内相关系数ICC(2,1)为0.659，优于现有的端到端评估框架。

Insight: 创新点在于提出了一个模块化、临床对齐的框架，结合了语音识别、对齐算法和混合评分指标，提供了可解释的音素级评分，其性能可与专家间一致性相媲美，为资源稀缺语言的语音评估提供了新思路。

Abstract: Automated phoneme-level pronunciation assessment is vital for scalable speech therapy and language learning, yet validated tools for Arabic remain scarce. We present Harf-Speech, a modular system scoring Arabic pronunciation at the phoneme level on a clinical scale. It combines an MSA phonetizer, a fine-tuned speech-to-phoneme model, Levenshtein alignment, and a blended scorer using longest common subsequence and edit-distance metrics. We fine-tune three ASR architectures on Arabic phoneme data and benchmark them with zero-shot multimodal models; the best, OmniASR-CTC-1B-v2, achieves 8.92% phoneme error rate. Three certified speech-language pathologists independently scored 40 utterances for clinical validation. Harf-Speech attains a Pearson correlation of 0.791 and ICC(2,1) of 0.659 with mean expert scores, outperforming existing end-to-end assessment frameworks. These results show Harf-Speech yields clinically aligned, interpretable scores comparable to inter-rater expert agreement.

cs.IR [Back]

[100] WebExpert: domain-aware web agents with critic-guided expert experience for high-precision search cs.IR | cs.AI | cs.CLPDF

Yuelin Hu, Zhengxue Cheng, Ronghua Wu, Qunshan Gu, Hongwei Hu

TL;DR: 本文提出了WebExpert，一个面向特定领域（如金融、生物医学）的网页智能体，旨在解决专业网页搜索中查询漂移、证据噪声和推理脆弱性问题。该方法通过句子级经验检索、无模式分面归纳和偏好优化规划，实现了高精度搜索，并在多个基准测试中提升了答案精确匹配率，减少了页面跳转次数。

Details

Motivation: 解决金融、生物医学等专业领域网页任务中因缺乏领域先验知识导致的查询漂移、证据噪声和推理脆弱性问题，提升搜索的精确性和鲁棒性。

Result: 在GAIA、GPQA、HLE和WebWalkerQA基准测试中，WebExpert相比最强的浏览基线提升了1.5-3.6个百分点的答案精确匹配率（EM），并减少了页面跳转次数，分析显示在检索、主题合并、分面归纳和偏好感知训练方面均有一致性改进。

Insight: 创新点包括：句子级经验检索结合主题合并与规则蒸馏、基于弱监督的无模式分面归纳（自动引导时间、区域、政策等行业分面而非依赖静态词典）、以及通过成对偏好学习与覆盖感知目标联合优化查询规划和检索的偏好优化规划；客观分析认为，其轻量级经验门控机制在低检索置信度下向活跃分面偏置解码的设计，增强了系统的适应性和效率。

Abstract: Specialized web tasks in finance, biomedicine, and pharmaceuticals remain challenging due to missing domain priors: queries drift, evidence is noisy, and reasoning is brittle. We present WebExpert, a domain-aware web agent that we implement end-to-end, featuring : (i) sentence-level experience retrieval with topic merging and rule distillation, (ii) schemalight facet induction that bootstraps time,region,policy,industry facets from weak supervision instead of static hand-written lexicons, and (iii) preference-optimized planning that jointly improves query planning and retrieval via pairwise preference learning alongside a coverage-aware objective. At inference, a lightweight experience gate biases decoding toward active facets with fallback under low-retrieval confidence. On GAIA, GPQA, HLE, and WebWalkerQA, WebExpert improves Answer Exact Match (EM) by 1.5-3.6 pp over the strongest browsing baseline and reduces page hops. Analysis shows consistent gains and ablations on retrieval, topic merging, facet induction, and preference-aware training.

[101] ARIA: Adaptive Retrieval Intelligence Assistant – A Multimodal RAG Framework for Domain-Specific Engineering Education cs.IR | cs.CLPDF

Yue Luo, Dibakar Roy Sarkar, Rachel Herring Sangree, Somdatta Goswami

TL;DR: 该论文提出了ARIA（自适应检索智能助手），一个用于特定领域工程教育的多模态检索增强生成框架。它通过结合Docling、Nougat和GPT-4 Vision API的多模态内容提取管道，以及e5-large-v2嵌入模型，来处理复杂的教育材料，旨在为大学课程创建智能教学助手。

Details

Motivation: 解决大型语言模型在专业教育应用中存在的幻觉、知识更新有限和缺乏领域专业知识等局限性，同时避免微调带来的巨大计算开销，以及通用LLM在专业场景下因依赖泛化训练数据而产生的不准确响应。

Result: 在约翰霍普金斯大学土木工程大二课程《静力学与材料力学》的讲义材料上进行评估，并与ChatGPT-5进行基准测试。结果显示，在领域特定问题过滤上达到97.5%的准确率，正确回答了所有20个相关课程问题，并拒绝了60个非相关查询中的58个，实现了90.9%的精确率、100%的召回率以及4.89/5.0的平均响应质量，表现出优越的教学性能。

Insight: 创新点在于提出了一个课程无关的、可扩展的特定领域教育AI部署框架，其核心是结合了结构化文档分析、数学公式识别和图表解释的多模态内容提取管道，并通过工程化的提示和响应控制来保持教学一致性。从客观角度看，该框架将多模态RAG系统化地应用于高度专业化的教育领域，有效整合了多种工具以处理复杂材料，是一个有前景的、注重实际教学准确性和可扩展性的解决方案。

Abstract: Developing effective, domain-specific educational support systems is central to advancing AI in education. Although large language models (LLMs) demonstrate remarkable capabilities, they face significant limitations in specialized educational applications, including hallucinations, limited knowledge updates, and lack of domain expertise. Fine-tuning requires complete model retraining, creating substantial computational overhead, while general-purpose LLMs often provide inaccurate responses in specialized contexts due to reliance on generalized training data. To address this, we propose ARIA (Adaptive Retrieval Intelligence Assistant), a Retrieval-Augmented Generation (RAG) framework for creating intelligent teaching assistants across university-level courses. ARIA leverages a multimodal content extraction pipeline combining Docling for structured document analysis, Nougat for mathematical formula recognition, and GPT-4 Vision API for diagram interpretation, with the e5-large-v2 embedding model for high semantic performance and low latency. This enables accurate processing of complex educational materials while maintaining pedagogical consistency through engineered prompts and response controls. We evaluate ARIA using lecture material from Statics and Mechanics of Materials, a sophomore-level civil engineering course at Johns Hopkins University, benchmarking against ChatGPT-5. Results demonstrate 97.5% accuracy in domain-specific question filtering and superior pedagogical performance. ARIA correctly answered all 20 relevant course questions while rejecting 58 of 60 non-relevant queries, achieving 90.9% precision, 100% recall, and 4.89/5.0 average response quality. These findings demonstrate that ARIA’s course-agnostic architecture represents a scalable framework for domain-specific educational AI deployment.

[102] BRIDGE: Multimodal-to-Text Retrieval via Reinforcement-Learned Query Alignment cs.IR | cs.CVPDF

Mohamed Darwish Mounis, Mohamed Mahmoud, Shaimaa Sedek, Mahmoud Abdalla, Mahmoud SalahEldin Kasem

TL;DR: 本文提出了BRIDGE系统，用于解决多模态（如图像-文本）查询在纯文本语料库中检索效果不佳的问题。该系统包含两个核心组件：FORGE（基于强化学习的查询对齐模型）用于将嘈杂的多模态查询提炼为紧凑、优化的搜索字符串；LENS（语言增强的神经检索器）用于处理FORGE生成的富含意图的查询。在MM-BRIGHT基准测试中，BRIDGE超越了现有的多模态编码器和纯文本检索器，证明了查询对齐是提升多模态到文本检索性能的关键瓶颈。

Details

Motivation: 当前多模态检索系统在处理针对纯文本语料库的图像-文本查询时效果不佳，作者认为瓶颈在于原始查询本身，它混杂了视觉描述、对话噪声和检索意图，从而系统性地降低了嵌入相似性，而非检索器能力不足。

Result: 在MM-BRIGHT基准（包含2,803个查询，29个领域）上，BRIDGE取得了29.7的nDCG@10分数，超越了包括Nomic-Vision（27.6）在内的所有多模态编码器基线。当FORGE作为即插即用的对齐器与Nomic-Vision结合时，系统达到了33.3的nDCG@10，超过了最佳纯文本检索器（32.2），实现了新的SOTA水平。

Insight: 论文的核心创新在于将问题重构为查询对齐而非编码器改进，提出了一个无需多模态编码器的两阶段系统。其关键洞察是：通过强化学习训练查询对齐模型（FORGE）来提炼查询意图，再配合一个经过推理密集型数据微调的密集检索器（LENS），可以更有效地桥接多模态查询与文本语料库之间的语义鸿沟，这为多模态检索提供了一种新颖且高效的架构思路。

Abstract: Multimodal retrieval systems struggle to resolve image-text queries against text-only corpora: the best vision-language encoder achieves only 27.6 nDCG@10 on MM-BRIGHT, underperforming strong text-only retrievers. We argue the bottleneck is not the retriever but the query – raw multimodal queries entangle visual descriptions, conversational noise, and retrieval intent in ways that systematically degrade embedding similarity. We present \textbf{BRIDGE}, a two-component system that resolves this mismatch without multimodal encoders. \textbf{FORGE} (\textbf{F}ocused Retrieval Query Generato\textbf{r}) is a query alignment model trained via reinforcement learning, which distills noisy multimodal queries into compact, retrieval-optimized search strings. \textbf{LENS} (\textbf{L}anguage-\textbf{E}nhanced \textbf{N}eural \textbf{S}earch) is a reasoning-enhanced dense retriever fine-tuned on reasoning-intensive retrieval data to handle the intent-rich queries FORGE produces. Evaluated on MM-BRIGHT (2,803 queries, 29 domains), BRIDGE achieves \textbf{29.7} nDCG@10, surpassing all multimodal encoder baselines including Nomic-Vision (27.6). When FORGE is applied as a plug-and-play aligner on top of Nomic-Vision, the combined system reaches \textbf{33.3} nDCG@10 – exceeding the best text-only retriever (32.2) – demonstrating that \textit{query alignment} is the key bottleneck in multimodal-to-text retrieval. https://github.com/mm-bright/multimodal-reasoning-retrieval

eess.IV [Back]

[103] MedRoute: RL-Based Dynamic Specialist Routing in Multi-Agent Medical Diagnosis eess.IV | cs.CV | cs.LG | cs.MAPDF

Ashmal Vayani, Parth Parag Kulkarni, Joseph Fioresi, Song Wang, Mubarak Shah

TL;DR: 本文提出MedRoute，一种基于强化学习的动态多智能体医疗诊断框架，通过模拟真实临床工作流程，利用通用医生、RL训练的路由器和调解员动态选择专科医生，提升诊断准确性。

Details

Motivation: 解决现有大型多模态模型在医疗诊断中过于通用、无法适应多样化实际医疗场景的问题，模拟临床实践中多专科医生协作的诊断流程。

Result: 在基于文本和图像的医疗数据集上进行广泛评估，显示出诊断准确性的提升，超越了现有最先进的基线方法。

Insight: 创新点在于引入强化学习训练的路由器实现动态专科医生选择，以及模拟真实临床工作流程的多智能体协作框架，为未来研究奠定基础。

Abstract: Medical diagnosis using Large Multimodal Models (LMMs) has gained increasing attention due to capability of these models in providing precise diagnoses. These models generally combine medical questions with visual inputs to generate diagnoses or treatments. However, they are often overly general and unsuitable under the wide range of medical conditions in real-world healthcare. In clinical practice, diagnosis is performed by multiple specialists, each contributing domain-specific expertise. To emulate this process, a potential solution is to deploy a dynamic multi-agent LMM framework, where each agent functions as a medical specialist. Current approaches in this emerging area, typically relying on static or predefined selection of various specialists, cannot be adapted to the changing practical scenario. In this paper, we propose MedRoute, a flexible and dynamic multi-agent framework that comprises of a collaborative system of specialist LMM agents. Furthermore, we add a General Practitioner with an RL-trained router for dynamic specialist selection, and a Moderator that produces the final decision. In this way, our framework closely mirrors real clinical workflows. Extensive evaluations on text and image-based medical datasets demonstrate improved diagnostic accuracy, outperforming the state-of-the-art baselines. Our work lays a strong foundation for future research. Code and models are available at https://github.com/UCF-CRCV/MedRoute/.

[104] CWRNN-INVR: A Coupled WarpRNN based Implicit Neural Video Representation eess.IV | cs.CVPDF

Yiyang Li, Yanbo Gao, Shuai Li, Zhenyu Du, Jinglin Zhang

TL;DR: 本文提出了一种名为CWRNN-INVR的隐式神经视频表示方法，该方法结合了神经网络和残差网格框架，分别用于表示视频中的规则结构化信息和不规则细节信息。通过设计基于耦合WarpRNN的多尺度运动表示与补偿模块来显式处理规则信息，并学习混合残差网格来共同表示不规则的外观和运动信息，实现了网络复用。实验表明，该方法在UVG数据集上取得了最佳重建效果，并在其他下游任务中超越了现有INVR方法。

Details

Motivation: 现有隐式神经视频表示方法主要关注开发高效的网格结构和具有强大表示能力的神经网络架构，但缺乏对它们在视频表示中各自角色的研究。本文旨在从视频信息构成的角度，探究基于神经网络和基于网格的INVR之间的差异，明确各自的优势，并据此提出一个混合框架以更好地表示视频内容。

Result: 在UVG数据集上，该方法在3M模型下取得了平均33.73 dB的PSNR，达到了最佳重建效果，并超越了现有INVR方法在其他下游任务上的性能。

Insight: 创新点在于首次从信息构成角度区分了神经网络和网格在INVR中的角色（神经网络处理通用结构，网格处理特定细节），并据此提出了一个混合神经网络与残差网格的框架。具体地，设计了耦合WarpRNN模块来显式处理规则运动信息，以及一个可共同表示不规则外观和运动信息的混合残差网格，该设计允许网络复用，提高了效率。从客观角度看，这种基于信息类型进行任务分解和模块化设计的思路具有借鉴意义。

Abstract: Implicit Neural Video Representation (INVR) has emerged as a novel approach for video representation and compression, using learnable grids and neural networks. Existing methods focus on developing new grid structures efficient for latent representation and neural network architectures with large representation capability, lacking the study on their roles in video representation. In this paper, the difference between INVR based on neural network and INVR based on grid is first investigated from the perspective of video information composition to specify their own advantages, i.e., neural network for general structure while grid for specific detail. Accordingly, an INVR based on mixed neural network and residual grid framework is proposed, where the neural network is used to represent the regular and structured information and the residual grid is used to represent the remaining irregular information in a video. A Coupled WarpRNN-based multi-scale motion representation and compensation module is specifically designed to explicitly represent the regular and structured information, thus terming our method as CWRNN-INVR. For the irregular information, a mixed residual grid is learned where the irregular appearance and motion information are represented together. The mixed residual grid can be combined with the coupled WarpRNN in a way that allows for network reuse. Experiments show that our method achieves the best reconstruction results compared with the existing methods, with an average PSNR of 33.73 dB on the UVG dataset under the 3M model and outperforms existing INVR methods in other downstream tasks. The code can be found at https://github.com/yiyang-sdu/CWRNN-INVR.git}{https://github.com/yiyang-sdu/CWRNN-INVR.git.

[105] 4D Vessel Reconstruction for Benchtop Thrombectomy Analysis eess.IV | cs.CV | physics.med-phPDF

Ethan Nguyen, Javier Carmona, Arisa Matsuzaki, Naoki Kaneko, Katsushi Arisaka

TL;DR: 本文提出了一种用于机械取栓台架测试的低成本九相机多视角工作流程，通过4D高斯溅射技术重建硅胶大脑中动脉模型的时间分辨三维血管运动，并转换为固定连接边图进行位移追踪和相对表面应力代理测量。

Details

Motivation: 机械取栓可能导致血管变形和手术相关损伤，现有台架模型缺乏时间分辨、全场三维血管运动测量方法，需要标准化评估框架来比较不同手术条件下的生物力学响应。

Result: 在合成数据验证中，该方法与真实值在几何和时间上高度一致（对称Chamfer距离1.714-1.815 mm，精度0.964-0.972）。初步台架试验显示，颈动脉抽吸导管放置比颈内动脉末端放置产生更高的最大-中位ROI位移和应力代理值。

Insight: 创新点包括：1）将低成本多视角系统与4D高斯溅射结合用于动态血管重建；2）提出将点云转换为固定连接边图进行ROI位移追踪；3）引入基于Neo-Hookean映射的相对表面应力代理指标，为台架测试提供标准化比较框架而非绝对应力估计。

Abstract: Introduction: Mechanical thrombectomy can cause vessel deformation and procedure-related injury. Benchtop models are widely used for device testing, but time-resolved, full-field 3D vessel-motion measurements remain limited. Methods: We developed a nine-camera, low-cost multi-view workflow for benchtop thrombectomy in silicone middle cerebral artery phantoms (2160p, 20 fps). Multi-view videos were calibrated, segmented, and reconstructed with 4D Gaussian Splatting. Reconstructed point clouds were converted to fixed-connectivity edge graphs for region-of-interest (ROI) displacement tracking and a relative surface-based stress proxy. Stress-proxy values were derived from edge stretch using a Neo-Hookean mapping and reported as comparative surface metrics. A synthetic Blender pipeline with known deformation provided geometric and temporal validation. Results: In synthetic bulk translation, the stress proxy remained near zero for most edges (median $\approx$ 0 MPa; 90th percentile 0.028 MPa), with sparse outliers. In synthetic pulling (1-5 mm), reconstruction showed close geometric and temporal agreement with ground truth, with symmetric Chamfer distance of 1.714-1.815 mm and precision of 0.964-0.972 at $τ= 1$ mm. In preliminary benchtop comparative trials (one trial per condition), cervical aspiration catheter placement showed higher max-median ROI displacement and stress-proxy values than internal carotid artery terminus placement. Conclusion: The proposed protocol provides standardized, time-resolved surface kinematics and comparative relative displacement and stress proxy measurements for thrombectomy benchtop studies. The framework supports condition-to-condition comparisons and methods validation, while remaining distinct from absolute wall-stress estimation. Implementation code and example data are available at https://ethanuser.github.io/vessel4D

cs.CE [Back]

[106] XR-CareerAssist: An Immersive Platform for Personalised Career Guidance Leveraging Extended Reality and Multimodal AI cs.CE | cs.AI | cs.CV | cs.CY | cs.ETPDF

N. D. Tantaroudas, A. J. McCracken, I. Karachalios, E. Papatheou, V. Pastrikakis

TL;DR: 本文介绍了XR-CareerAssist，一个结合扩展现实（XR）与多模态人工智能（AI）的沉浸式职业指导平台。该系统集成了自动语音识别、神经机器翻译、基于Langchain的对话助手、基于BLIP的视觉语言模型以及AWS Polly文本转语音等技术，通过交互式3D虚拟化身和动态桑基图可视化职业轨迹，旨在提供个性化、多语言、沉浸式的职业发展体验。

Details

Motivation: 传统职业指导平台依赖静态、文本驱动的界面，缺乏互动性和个性化，且忽视了职业发展的叙事维度。为解决这些问题，作者旨在开发一个更吸引人、可访问且有效的职业发展工具。

Result: 在埃克塞特大学进行的试点评估（23名参与者）显示，系统语音识别准确率达到95.6%，整体用户满意度为78.3%，系统响应性获得91.3%的积极评价。反馈信息指导了后续在运动舒适度、音频清晰度和文本可读性方面的改进。

Insight: 主要创新点在于将五种AI模块（语音识别、机器翻译、对话助手、视觉语言模型、文本转语音）统一集成到一个沉浸式XR环境中，创造了多模态交互体验，这使其区别于现有平台。从客观角度看，其将大规模匿名职业档案数据（超过10万份）通过动态桑基图进行可视化，为职业轨迹探索提供了新颖的交互方式。

Abstract: Conventional career guidance platforms rely on static, text-driven interfaces that struggle to engage users or deliver personalised, evidence-based insights. Although Computer-Assisted Career Guidance Systems have evolved since the 1960s, they remain limited in interactivity and pay little attention to the narrative dimensions of career development. We introduce XR-CareerAssist, a platform that unifies Extended Reality (XR) with several Artificial Intelligence (AI) modules to deliver immersive, multilingual career guidance. The system integrates Automatic Speech Recognition for voice-driven interaction, Neural Machine Translation across English, Greek, French, and Italian, a Langchain-based conversational Training Assistant for personalised dialogue, a BLIP-based Vision-Language model for career visualisations, and AWS Polly Text-to-Speech delivered through an interactive 3D avatar. Career trajectories are rendered as dynamic Sankey diagrams derived from a repository of more than 100,000 anonymised professional profiles. The application was built in Unity for Meta Quest 3, with backend services hosted on AWS. A pilot evaluation at the University of Exeter with 23 participants returned 95.6% speech recognition accuracy, 78.3% overall user satisfaction, and 91.3% favourable ratings for system responsiveness, with feedback informing subsequent improvements to motion comfort, audio clarity, and text legibility. XR-CareerAssist demonstrates how the fusion of XR and AI can produce more engaging, accessible, and effective career development tools, with the integration of five AI modules within a single immersive environment yielding a multimodal interaction experience that distinguishes it from existing career guidance platforms.

cs.HC [Back]

[107] LLM Spirals of Delusion: A Benchmarking Audit Study of AI Chatbot Interfaces cs.HC | cs.AI | cs.CLPDF

Peter Kirgis, Ben Hawriluk, Sherrie Feng, Aslan Bilimer, Sam Paech

TL;DR: 本研究通过审计和基准测试，评估了不同大语言模型在持续对话中如何影响用户的妄想或阴谋论思维，特别比较了API输出与真实聊天界面的差异，发现API测试不足以反映现实影响，且模型更新可能带来行为反转。

Details

Motivation: 随着人们越来越多地与LLM进行开放对话，模型可能强化妄想或阴谋论思维，甚至放大有害信念，但现有测试多基于API，未考虑真实聊天界面的影响，因此需评估不同环境下的模型行为。

Result: 在56个20轮对话测试中，ChatGPT-5在聊天界面上比ChatGPT-4o表现出更少的奉承、升级和妄想强化；API与聊天界面性能差异显著；相同API端点两个月后行为完全反转；负面行为水平仍高。

Insight: 创新点在于首次系统比较API与真实聊天界面对模型行为的影响，强调多轮评估中时间动态的重要性，并指出模型更新透明度对审计的必要性，为AI安全评估提供了新视角。

Abstract: People increasingly hold sustained, open-ended conversations with large language models (LLMs). Public reports and early studies suggest that, in such settings, models can reinforce delusional or conspiratorial ideation or even amplify harmful beliefs and engagement patterns. We present an audit and benchmarking study that measures how different LLMs encourage, resist, or escalate disordered and conspiratorial thinking. We explicitly compare API outputs to user chat interfaces, like the ChatGPT desktop app or web interface, which is how people have conversations with chatbots in real life but are almost never used for testing. In total, we run 56 20-turn conversations testing ChatGPT-4o and ChatGPT-5, via both the API and chat interface, and grade each conversation by two research assistants (RAs) as well as by GPT-5. We document five results. First, we observe large differences in performance between the API and chat interface environments, showing that the universally used method of automated testing through the API is not sufficient to assess the impact of chatbots in the real world. Second, when tested in the chat interface, we find that ChatGPT-5 displays less sycophancy, escalation, and delusion reinforcement than ChatGPT-4o, showing that these behaviors are influenced by the policy choices of major AI companies. Third, conversations with nearly identical aggregate intensity in a behavior display large differences in how the behavior evolves turn by turn, highlighting the importance of temporal dynamics in multi-turn evaluation. Fourth, even updated models display substantial levels of negative behaviors, revealing that model improvement does not imply model safety. Fifth, the same API endpoint tested just two months apart yields a complete reversal in behavior, underscoring how transparency in model updates is a necessary prerequisite for robust audit findings.

[108] BATON: A Multimodal Benchmark for Bidirectional Automation Transition Observation in Naturalistic Driving cs.HC | cs.CV | cs.MMPDF

Yuhang Wang, Yiyao Xu, Chaoyun Yang, Lingyao Li, Jingran Sun

TL;DR: 本文介绍了BATON数据集，这是一个用于研究自然驾驶场景中双向控制权转移（驾驶员将控制权交给自动驾驶系统或从系统接管）的大规模多模态基准。该数据集同步记录了前视视频、车内视频、车辆CAN总线信号、雷达交互和GPS路线环境，并围绕每次控制转移形成了闭环多模态记录。论文定义了驾驶行为理解、交接预测和接管预测三个基准任务，并评估了序列模型、经典分类器和零样本视觉语言模型等基线方法。

Details

Motivation: 现有量产车的驾驶自动化系统依赖驾驶员决定何时启用，并要求其持续保持注意力准备接管，这带来了巨大的情境判断需求和认知负荷，导致学习曲线陡峭、用户体验不佳以及因过度依赖或延迟接管带来的安全风险。预测驾驶员何时交出或收回控制权对于设计主动、情境感知的人机交互至关重要，但现有数据集很少能捕捉包括道路场景、驾驶员状态、车辆动力学和路线环境在内的多模态上下文。

Result: 在BATON数据集上评估的基线结果表明，仅使用视觉输入不足以可靠预测控制权转移：前视视频捕捉道路环境但无法反映驾驶员状态，车内视频反映驾驶员准备状态但无法捕捉外部场景。结合CAN总线和路线环境信号能显著超越仅使用视频的设置，表明多模态之间存在很强的互补性。此外，研究发现接管事件发展更渐进，受益于更长的预测时间范围，而交接事件更依赖于即时情境线索，揭示了在辅助驾驶系统人机交互设计中的不对称性。

Insight: 论文的核心创新点是构建了一个大规模、多模态、闭环的自然驾驶数据集，专门用于研究双向控制权转移，填补了现有数据集的空白。从客观角度看，其研究揭示了控制权转移预测任务中多模态信息（特别是车辆动态和路线环境）的必要性和互补性，以及交接与接管事件在时间动态上的不对称性，这对设计更智能、更主动的驾驶辅助系统人机交互具有直接指导意义。

Abstract: Existing driving automation (DA) systems on production vehicles rely on human drivers to decide when to engage DA while requiring them to remain continuously attentive and ready to intervene. This design demands substantial situational judgment and imposes significant cognitive load, leading to steep learning curves, suboptimal user experience, and safety risks from both over-reliance and delayed takeover. Predicting when drivers hand over control to DA and when they take it back is therefore critical for designing proactive, context-aware HMI, yet existing datasets rarely capture the multimodal context, including road scene, driver state, vehicle dynamics, and route environment. To fill this gap, we introduce BATON, a large-scale naturalistic dataset capturing real-world DA usage across 127 drivers, and 136.6 hours of driving. The dataset synchronizes front-view video, in-cabin video, decoded CAN bus signals, radar-based lead-vehicle interaction, and GPS-derived route context, forming a closed-loop multimodal record around each control transition. We define three benchmark tasks: driving action understanding, handover prediction, and takeover prediction, and evaluate baselines spanning sequence models, classical classifiers, and zero-shot VLMs. Results show that visual input alone is insufficient for reliable transition prediction: front-view video captures road context but not driver state, while in-cabin video reflects driver readiness but not the external scene. Incorporating CAN and route-context signals substantially improves performance over video-only settings, indicating strong complementarity across modalities. We further find takeover events develop more gradually and benefit from longer prediction horizons, whereas handover events depend more on immediate contextual cues, revealing an asymmetry with direct implications for HMI design in assisted driving systems.

cs.LG [Back]

[109] The Depth Ceiling: On the Limits of Large Language Models in Discovering Latent Planning cs.LG | cs.AI | cs.CLPDF

Yi Xu, Philipp Jettkant, Laura Ruis

TL;DR: 本文研究了大型语言模型在无监督中间步骤的情况下发现并执行多步潜在规划策略的能力极限，通过图路径寻找任务精确控制所需潜在规划步数，发现即使大规模缩放也无法突破的深度上限：从零训练的小型Transformer最多能发现三步潜在策略，微调的GPT-4o和Qwen3-32B可达五步，GPT-5.4在少样本提示下能达到七步。模型在训练中最多能学习五步潜在规划深度，但已发现的策略在测试时可泛化至八步，揭示了模型发现策略与执行策略能力之间的分离。

Details

Motivation: 探究大型语言模型在潜在表示中进行有效推理的极限，特别是模型能否在仅通过最终答案监督的情况下，无监督地发现多步规划策略并在单次前向传播中潜在执行，以验证思维链监控的合理性。

Result: 在图路径寻找任务上，从零训练的小型Transformer最多发现三步潜在规划策略，微调的GPT-4o和Qwen3-32B达到五步，GPT-5.4在少样本提示下达到七步；模型在训练中学习的最大潜在规划深度为五步，但测试时已发现策略可泛化至八步。

Insight: 论文揭示了LLMs在发现与执行多步潜在规划策略上存在固有深度上限，且发现能力与执行能力可能解耦；这表明需要多步协调潜在规划的策略可能需要显式教学或外部化（如思维链），为CoT监控提供了依据，并提示了模型缩放对解决此类内在限制的局限性。

Abstract: The viability of chain-of-thought (CoT) monitoring hinges on models being unable to reason effectively in their latent representations. Yet little is known about the limits of such latent reasoning in LLMs. We test these limits by studying whether models can discover multi-step planning strategies without supervision on intermediate steps and execute them latently, within a single forward pass. Using graph path-finding tasks that precisely control the number of required latent planning steps, we uncover a striking limitation unresolved by massive scaling: tiny transformers trained from scratch discover strategies requiring up to three latent steps, fine-tuned GPT-4o and Qwen3-32B reach five, and GPT-5.4 attains seven under few-shot prompting. Although the maximum latent planning depth models can learn during training is five, the discovered strategy generalizes up to eight latent steps at test-time. This reveals a dissociation between the ability to discover a latent strategy under final-answer supervision alone and the ability to execute it once discovered. If similar limits hold more broadly, strategies requiring multiple coordinated latent planning steps may need to be explicitly taught or externalized, lending credence to CoT monitoring.

[110] FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling cs.LG | cs.AI | cs.CVPDF

Yitong Li, Junsong Chen, Shuchen Xue, Pengcuo Zeren, Siyuan Fu

TL;DR: 本文提出了一种名为Sol-RL（Speed-of-light RL）的新型两阶段强化学习框架，用于高效地对大规模文本到图像扩散模型进行基于人类偏好的对齐后训练。该框架通过将FP4量化技术用于候选样本的探索生成阶段，并结合BF16精度进行策略优化，在保持训练完整性的同时，显著加速了训练过程并提升了模型对齐性能。

Details

Motivation: 基于强化学习的后训练是使文本到图像扩散模型与人类偏好对齐的有效方法，但增加探索性生成的样本数量（rollout group size）会带来巨大的计算负担，尤其是在FLUX.1-12B等大规模基础模型上。因此，需要一种方法在提升效率的同时不损害训练效果。

Result: 在SANA、FLUX.1和SD3.5-L等多个基准上的大量实验表明，该方法在多个评估指标上实现了更优的对齐性能，同时将训练收敛速度最高加速了4.64倍。

Insight: 核心创新点在于算法与硬件协同设计：将探索（FP4量化生成大量候选）与优化（BF16精度重生成精选样本）两阶段解耦，从而在系统层面利用FP4的高吞吐量优势，同时在算法层面保留高保真样本用于策略优化，实现了效率与效果的平衡。

Abstract: Reinforcement-Learning-based post-training has recently emerged as a promising paradigm for aligning text-to-image diffusion models with human preferences. In recent studies, increasing the rollout group size yields pronounced performance improvements, indicating substantial room for further alignment gains. However, scaling rollouts on large-scale foundational diffusion models (e.g., FLUX.1-12B) imposes a heavy computational burden. To alleviate this bottleneck, we explore the integration of FP4 quantization into Diffusion RL rollouts. Yet, we identify that naive quantized pipelines inherently introduce risks of performance degradation. To overcome this dilemma between efficiency and training integrity, we propose Sol-RL (Speed-of-light RL), a novel FP4-empowered Two-stage Reinforcement Learning framework. First, we utilize high-throughput NVFP4 rollouts to generate a massive candidate pool and extract a highly contrastive subset. Second, we regenerate these selected samples in BF16 precision and optimize the policy exclusively on them. By decoupling candidate exploration from policy optimization, Sol-RL integrates the algorithmic mechanisms of rollout scaling with the system-level throughput gains of NVFP4. This synergistic algorithm-hardware design effectively accelerates the rollout phase while reserving high-fidelity samples for optimization. We empirically demonstrate that our framework maintains the training integrity of BF16 precision pipeline while fully exploiting the throughput gains enabled by FP4 arithmetic. Extensive experiments across SANA, FLUX.1, and SD3.5-L substantiate that our approach delivers superior alignment performance across multiple metrics while accelerating training convergence by up to $4.64\times$, unlocking the power of massive rollout scaling at a fraction of the cost.

[111] Transformer See, Transformer Do: Copying as an Intermediate Step in Learning Analogical Reasoning cs.LG | cs.CLPDF

Philipp Hellwig, Willem Zuidema, Claire E. Stevenson, Martha Lewis

TL;DR: 本文研究如何通过元学习组合性（MLC）训练Transformer模型解决字母串类比推理任务，发现引入复制任务作为中间步骤能引导模型关注关键信息，从而提升类比推理能力。模型在异构数据集上训练后能更好地泛化到新字母表，并在某些组合变换上展现泛化能力，但对全新变换仍有限制。

Details

Motivation: 类比推理是人类智能的核心能力，但构建具有鲁棒类比推理能力的人工智能系统仍具挑战。本文旨在探索Transformer模型通过元学习组合性方法学习类比推理的潜力，特别是如何通过训练策略提升泛化性能。

Result: 在字母串类比推理任务上，3层编码器-解码器模型在异构数据集训练后，其泛化到新字母表的性能优于大多数前沿模型；模型能部分泛化到训练过的变换组合，但对全新变换泛化有限。

Insight: 创新点包括将复制任务作为训练中间步骤以引导模型注意力，以及使用元学习组合性方法提升类比推理的泛化能力；从客观角度看，该方法为理解模型内部计算机制提供了可解释性分析途径，并揭示了数据异构性对泛化的重要性。

Abstract: Analogical reasoning is a hallmark of human intelligence, enabling us to solve new problems by transferring knowledge from one situation to another. Yet, developing artificial intelligence systems capable of robust human-like analogical reasoning has proven difficult. In this work, we train transformers using Meta-Learning for Compositionality (MLC) on an analogical reasoning task (letter-string analogies) and assess their generalization capabilities. We find that letter-string analogies become learnable when guiding the models to attend to the most informative problem elements induced by including copying tasks in the training data. Furthermore, generalization to new alphabets becomes better when models are trained with more heterogeneous datasets, where our 3-layer encoder-decoder model outperforms most frontier models. The MLC approach also enables some generalization to compositions of trained transformations, but not to completely novel transformations. To understand how the model operates, we identify an algorithm that approximates the model’s computations. We verify this using interpretability analyses and show that the model can be steered precisely according to expectations derived from the algorithm. Finally, we discuss implications of our findings for generalization capabilities of larger models and parallels to human analogical reasoning.

[112] SHAPE: Stage-aware Hierarchical Advantage via Potential Estimation for LLM Reasoning cs.LG | cs.AI | cs.CLPDF

Zhengyang Ai, Zikang Shan, Xiaodong Ai, Jingxian Tang, Hangkai Hu

TL;DR: 本文提出了一种名为SHAPE（Stage-aware Hierarchical Advantage via Potential Estimation）的框架，用于提升大语言模型（LLM）的推理能力。该框架将推理过程形式化为一个在经验可解性状态空间中的轨迹，通过分层信用分配机制，在片段级别利用阶段感知优势函数来优先处理低潜力状态下的高效突破，在令牌级别利用熵驱动重分配来锐化执行信号，从而在提高推理准确性的同时显著减少令牌消耗。

Details

Motivation: 现有过程监督方法无法区分有意义的进展与冗长输出，导致推理能力有限且令牌效率低下，SHAPE旨在解决这一问题。

Result: 在三个基础模型和五个数学推理基准上的广泛实验表明，SHAPE平均准确率提升3%，同时令牌消耗减少30%。

Insight: 创新点在于将推理形式化为状态空间轨迹，并引入分层信用分配（片段级的阶段感知优势函数和令牌级的熵驱动重分配），实现了对推理过程更精细的监督，兼顾了准确性与效率。

Abstract: Process supervision has emerged as a promising approach for enhancing LLM reasoning, yet existing methods fail to distinguish meaningful progress from mere verbosity, leading to limited reasoning capabilities and unresolved token inefficiency. To address this, we propose Stage-aware Hierarchical Advantage via Potential Estimation (SHAPE), a framework that formalizes reasoning as a trajectory through a state space of empirical solvability. SHAPE introduces a hierarchical credit assignment mechanism: at the segment level, it employs a stage-aware advantage function to prioritize efficient breakthroughs in low-potential states; at the token level, it utilizes entropy-driven redistribution to sharpen execution signals. Extensive experiments in math reasoning across three base models and five benchmarks demonstrate that SHAPE achieves an average accuracy gain of 3% with 30% reduced token consumption.

cs.MA [Back]

[113] Strategic Persuasion with Trait-Conditioned Multi-Agent Systems for Iterative Legal Argumentation cs.MA | cs.AI | cs.CLPDF

Philipp D. Siedler

TL;DR: 本文提出了一个名为Strategic Courtroom Framework的多智能体模拟环境，用于研究法律领域的策略性说服。该框架利用基于可解释特质构建的LLM智能体，模拟控辩双方进行多轮法律辩论。研究通过大量模拟实验，分析了不同特质组合对辩论结果的影响，并引入基于强化学习的特质编排器来动态生成最优策略。

Details

Motivation: 现有博弈论模型在模拟法律、外交等对抗性领域时，往往忽略了语言作为说服媒介的作用。本文旨在通过构建一个基于语言的多智能体模拟环境，将语言视为首要的策略行动空间，以研究迭代式法律论证中的策略性说服机制。

Result: 在10个合成法律案件和84种三特质团队配置下，使用DeepSeek-R1和Gemini 2.5 Pro进行了超过7000次模拟审判。结果表明：具有互补特质的异质团队表现优于同质配置；适中的交互深度能产生更稳定的判决；某些特质（如定量分析和魅力）对说服成功贡献显著。此外，基于强化学习的特质编排器发现的动态策略优于静态的人工设计组合。

Insight: 论文的创新点在于将语言建模为策略行动空间，并构建了一个可控、可解释的多智能体法律辩论模拟框架。其核心在于通过可解释的特质（分为四种原型）系统性地控制智能体的修辞风格和策略导向，并利用强化学习实现动态、自适应的策略生成，为在多智能体环境中构建具备自适应说服能力的自主智能体奠定了基础。

Abstract: Strategic interaction in adversarial domains such as law, diplomacy, and negotiation is mediated by language, yet most game-theoretic models abstract away the mechanisms of persuasion that operate through discourse. We present the Strategic Courtroom Framework, a multi-agent simulation environment in which prosecution and defense teams composed of trait-conditioned Large Language Model (LLM) agents engage in iterative, round-based legal argumentation. Agents are instantiated using nine interpretable traits organized into four archetypes, enabling systematic control over rhetorical style and strategic orientation. We evaluate the framework across 10 synthetic legal cases and 84 three-trait team configurations, totaling over 7{,}000 simulated trials using DeepSeek-R1 and Gemini~~2.5~~Pro. Our results show that heterogeneous teams with complementary traits consistently outperform homogeneous configurations, that moderate interaction depth yields more stable verdicts, and that certain traits (notably quantitative and charismatic) contribute disproportionately to persuasive success. We further introduce a reinforcement-learning-based Trait Orchestrator that dynamically generates defense traits conditioned on the case and opposing team, discovering strategies that outperform static, human-designed trait combinations. Together, these findings demonstrate how language can be treated as a first-class strategic action space and provide a foundation for building autonomous agents capable of adaptive persuasion in multi-agent environments.

hep-ex [Back]

[114] Towards foundation-style models for energy-frontier heterogeneous neutrino detectors via self-supervised pre-training hep-ex | cs.CVPDF

Saúl Alonso-Monsalve, Fabio Cufino, Umut Kose, Anna Mascellani, André Rubbia

TL;DR: 本文提出了一种基于稀疏视觉Transformer（ViT）的框架，通过自监督预训练从异构探测器数据中学习可重用的表示。该方法结合掩码自编码器重建与关系体素级目标（如层次结构、鬼影和粒子识别），并在分类和回归任务上进行联合微调。在LHC的FASERCal模拟事件上评估，预训练显著提升了中微子风味、粲夸克识别、动量回归和顶点重建性能，尤其在拓扑复杂的通道中关系目标带来额外增益。学习到的表示还能有效迁移到不同探测器技术和能量尺度的公开基准上，达到或超过已发布的基线水平。

Details

Motivation: 加速器中微子物理正进入能量前沿领域，相互作用达到TeV尺度并产生异常密集、重叠的探测器信号，传统重建方法难以处理，且标记数据稀缺、下游任务多样。因此，需要开发能够从异构数据中学习可重用表示的通用模型。

Result: 在LHC的FASERCal模拟事件上，自监督预训练相比从头训练一致提升了中微子风味和粲夸克识别、动量回归和顶点重建性能，关系目标在拓扑复杂通道中带来进一步增益。仅需约10^3个标记事件，预训练编码器就能达到随机初始化模型使用一个数量级更多数据训练的风味分类性能。学习表示还能迁移到不同探测器技术和能量尺度的公开基准，匹配或超过已发布基线。

Insight: 创新点包括：结合掩码自编码器与关系体素级目标的自监督预训练策略，用于学习可迁移的异构探测器表示；稀疏ViT框架处理密集、重叠的探测器信号；通过可解释性分析和探测器子系统消融验证了表示的物理合理性。这为粒子探测器分析提供了一种可扩展的通用表示学习途径。

Abstract: Accelerator-based neutrino physics is entering an energy-frontier regime in which interactions reach the TeV scale and produce exceptionally dense, overlapping detector signatures. In this regime, event interpretation becomes impractical for conventional reconstruction approaches, particularly when labelled data are scarce and the analysis spans diverse downstream objectives. We present a sparse ViT framework for learning reusable representations from heterogeneous detector data. Self-supervised pre-training combines masked autoencoder reconstruction with relational voxel-level objectives for hierarchy, ghost and particle identification, and the resulting shared encoder is then jointly fine-tuned across classification and regression tasks. Evaluated on simulated events from the proposed FASERCal concept at the LHC, we find that pre-training consistently improves neutrino flavour and charm-quark identification, momentum regression, and vertex reconstruction over training from scratch, with the addition of relational objectives yielding further gains in the most topologically complex channels. Interpretability analyses further show that pre-training yields a more structured latent space, while detector-subsystem ablations recover physically plausible channel-dependent roles for the heterogeneous inputs. A data-efficiency study shows that, with roughly $10^3$ labelled events, the pre-trained encoder already matches the flavour-classification performance of a randomly initialised model trained on an order of magnitude more data. The learned representations also transfer effectively to publicly available benchmarks spanning different detector technologies and energy scales, matching or exceeding published baselines. These results support self-supervised pre-training on multimodal detector data as a scalable route towards reusable representations for neutrino and particle-detector analysis.

cs.RO [Back]

[115] KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis cs.RO | cs.AI | cs.CVPDF

Mehdi Hosseinzadeh, King Hang Wong, Feras Dayoub

TL;DR: 本文提出了KITE，一种无需训练、基于关键帧锚定和布局接地的前端方法，用于将冗长的机器人执行视频转换为紧凑、可解释的标记化证据，供视觉语言模型（VLM）进行机器人故障分析。KITE通过提取运动显著的关键帧、开放词汇检测和鸟瞰图布局表示，结合机器人配置和场景上下文标记，构建统一提示，支持故障检测、识别、定位、解释和纠正。

Details

Motivation: 解决如何将长视频有效转换为结构化、可解释的输入，以利用现成VLM进行全面的机器人故障分析，避免复杂的模型训练。

Result: 在RoboFAC基准测试中，KITE与Qwen2.5-VL结合在无需训练设置下显著优于原始Qwen2.5-VL，尤其在模拟故障检测、识别和定位上提升明显，并与经过RoboFAC调优的基线模型竞争力相当；通过QLoRA微调进一步提升了解释和纠正质量。

Insight: 创新点在于将视频蒸馏为关键帧与结构化鸟瞰图布局的序列化表示，提供了一种无需训练、可解释的前端，能统一支持多种故障分析任务；客观来看，其通过视觉线索的紧凑编码和布局接地，增强了VLM对机器人任务的理解能力。

Abstract: We present KITE, a training-free, keyframe-anchored, layout-grounded front-end that converts long robot-execution videos into compact, interpretable tokenized evidence for vision-language models (VLMs). KITE distills each trajectory into a small set of motion-salient keyframes with open-vocabulary detections and pairs each keyframe with a schematic bird’s-eye-view (BEV) representation that encodes relative object layout, axes, timestamps, and detection confidence. These visual cues are serialized with robot-profile and scene-context tokens into a unified prompt, allowing the same front-end to support failure detection, identification, localization, explanation, and correction with an off-the-shelf VLM. On the RoboFAC benchmark, KITE with Qwen2.5-VL substantially improves over vanilla Qwen2.5-VL in the training-free setting, with especially large gains on simulation failure detection, identification, and localization, while remaining competitive with a RoboFAC-tuned baseline. A small QLoRA fine-tune further improves explanation and correction quality. We also report qualitative results on real dual-arm robots, demonstrating the practical applicability of KITE as a structured and interpretable front-end for robot failure analysis. Code and models are released on our project page: https://m80hz.github.io/kite/

[116] RoSHI: A Versatile Robot-oriented Suit for Human Data In-the-Wild cs.RO | cs.AI | cs.CVPDF

Wenjing Margaret Mao, Jefferson Ng, Luyang Hu, Daniel Gehrig, Antonio Loquercio

TL;DR: 本文介绍了RoSHI，一种结合低成本稀疏IMU与Project Aria眼镜的混合可穿戴系统，用于从第一人称视角估计穿戴者的三维姿态和身体形状，并在全局坐标系中实现度量一致性。该系统旨在在野外环境中收集丰富、长时程的人类交互数据，以支持机器人学习。

Details

Motivation: 现有方法在收集野外人类数据时，在便携性、遮挡鲁棒性和全局一致性之间存在权衡。RoSHI旨在通过融合IMU和第一人称SLAM传感器，解决这些限制，为机器人学习提供高质量的人类运动数据。

Result: 在敏捷活动数据集上，RoSHI通常优于其他第一人称基线方法，并与最先进的外部基准方法（SAM3D）表现相当。此外，该系统记录的运动数据被证明适用于真实世界的人形机器人策略学习。

Insight: 创新点在于利用IMU和第一人称SLAM的互补性：IMU提供对遮挡和高速运动的鲁棒性，而第一人称SLAM则锚定长时程运动并稳定上半身姿态。这种混合设计在保持便携性的同时，提高了姿态估计的全局一致性和准确性。

Abstract: Scaling up robot learning will likely require human data containing rich and long-horizon interactions in the wild. Existing approaches for collecting such data trade off portability, robustness to occlusion, and global consistency. We introduce RoSHI, a hybrid wearable that fuses low-cost sparse IMUs with the Project Aria glasses to estimate the full 3D pose and body shape of the wearer in a metric global coordinate frame from egocentric perception. This system is motivated by the complementarity of the two sensors: IMUs provide robustness to occlusions and high-speed motions, while egocentric SLAM anchors long-horizon motion and stabilizes upper body pose. We collect a dataset of agile activities to evaluate RoSHI. On this dataset, we generally outperform other egocentric baselines and perform comparably to a state-of-the-art exocentric baseline (SAM3D). Finally, we demonstrate that the motion data recorded from our system are suitable for real-world humanoid policy learning. For videos, data and more, visit the project webpage: https://roshi-mocap.github.io/

cs.CR [Back]

[117] SE-Enhanced ViT and BiLSTM-Based Intrusion Detection for Secure IIoT and IoMT Environments cs.CR | cs.AI | cs.CVPDF

Afrah Gueriani, Hamza Kheddar, Ahmed Cherif Mazari, Seref Sagiroglu, Onur Ceran

TL;DR: 本研究提出了一种基于混合Squeeze-and-Excitation注意力视觉变换器-双向长短期记忆网络的入侵检测框架，用于保护工业物联网和医疗物联网环境。该模型将视觉变换器的传统多头注意力机制替换为SE注意力，并与BiLSTM层集成，以提升检测精度和计算效率。在EdgeIIoT和CICIoMT2024两个真实世界基准数据集上，使用SMOTE和RandomOverSampler进行数据平衡前后进行了评估。

Details

Motivation: 随着工业和医疗物联网中互联设备的快速增长，确保及时准确地检测网络威胁成为一个关键挑战。

Result: 实验结果表明，SE ViT-BiLSTM模型在多个指标上优于现有方法。在数据平衡前，模型在EdgeIIoT数据集上达到99.11%的准确率，在CICIoMT2024上达到96.10%；数据平衡后，性能进一步提升，在EdgeIIoT上达到99.33%的准确率，在CICIoMT2024上达到98.16%的准确率，均实现了较低的误报率和延迟，达到了SOTA水平。

Insight: 摘要宣称的创新点在于将SE注意力机制引入视觉变换器以替代传统多头注意力，并与BiLSTM层进行混合，从而增强对序列和空间特征的建模能力。从客观角度看，该研究在入侵检测领域创新性地结合了计算机视觉的注意力机制和序列建模方法，并通过数据平衡技术进一步优化了模型在真实不平衡数据集上的性能。

Abstract: With the rapid growth of interconnected devices in Industrial and Medical Internet of Things (IIoT and MIoT) ecosystems, ensuring timely and accurate detection of cyber threats has become a critical challenge. This study presents an advanced intrusion detection framework based on a hybrid Squeeze-and-Excitation Attention Vision Transformer-Bidirectional Long Short-Term Memory (SE ViT-BiLSTM) architecture. In this design, the traditional multi-head attention mechanism of the Vision Transformer is replaced with Squeeze-and-Excitation attention, and integrated with BiLSTM layers to enhance detection accuracy and computational efficiency. The proposed model was trained and evaluated on two real-world benchmark datasets; EdgeIIoT and CICIoMT2024; both before and after data balancing using the Synthetic Minority Over-sampling Technique (SMOTE) and RandomOverSampler. Experimental results demonstrate that the SE ViT-BiLSTM model outperforms existing approaches across multiple metrics. Before balancing, the model achieved accuracies of 99.11% (FPR: 0.0013%, latency: 0.00032 sec/inst) on EdgeIIoT and 96.10% (FPR: 0.0036%, latency: 0.00053 sec/inst) on CICIoMT2024. After balancing, performance further improved, reaching 99.33% accuracy with 0.00035 sec/inst latency on EdgeIIoT and 98.16% accuracy with 0.00014 sec/inst latency on CICIoMT2024.

[118] Harnessing Hyperbolic Geometry for Harmful Prompt Detection and Sanitization cs.CR | cs.AI | cs.CVPDF

Igor Maljkovic, Maria Rosaria Briglia, Iacopo Masi, Antonio Emanuele Cinà, Fabio Roli

TL;DR: 本文提出了一种利用双曲几何来检测和净化有害提示的框架，以保护视觉语言模型免受恶意提示攻击。该框架包含两个互补组件：Hyperbolic Prompt Espial (HyPE) 用于轻量级异常检测，以及Hyperbolic Prompt Sanitization (HyPS) 用于可解释的净化。

Details

Motivation: 视觉语言模型因其共享嵌入空间的灵活性而容易受到恶意提示攻击，产生不安全内容。现有防御方法（如黑名单过滤器或重型分类器）成本高、脆弱且易受嵌入级攻击规避。

Result: 在多个数据集和对抗场景下的广泛实验表明，该框架在检测准确性和鲁棒性方面持续优于先前的防御方法。

Insight: 创新点在于利用双曲空间的结构化几何特性来建模良性提示并检测有害异常，并结合可解释的归因方法进行选择性修改，在保持原始语义的同时中和不安全意图，实现了高效、可解释且鲁棒的防御。

Abstract: Vision-Language Models (VLMs) have become essential for tasks such as image synthesis, captioning, and retrieval by aligning textual and visual information in a shared embedding space. Yet, this flexibility also makes them vulnerable to malicious prompts designed to produce unsafe content, raising critical safety concerns. Existing defenses either rely on blacklist filters, which are easily circumvented, or on heavy classifier-based systems, both of which are costly and fragile under embedding-level attacks. We address these challenges with two complementary components: Hyperbolic Prompt Espial (HyPE) and Hyperbolic Prompt Sanitization (HyPS). HyPE is a lightweight anomaly detector that leverages the structured geometry of hyperbolic space to model benign prompts and detect harmful ones as outliers. HyPS builds on this detection by applying explainable attribution methods to identify and selectively modify harmful words, neutralizing unsafe intent while preserving the original semantics of user prompts. Through extensive experiments across multiple datasets and adversarial scenarios, we prove that our framework consistently outperforms prior defenses in both detection accuracy and robustness. Together, HyPE and HyPS offer an efficient, interpretable, and resilient approach to safeguarding VLMs against malicious prompt misuse.

[119] Argus: Reorchestrating Static Analysis via a Multi-Agent Ensemble for Full-Chain Security Vulnerability Detection cs.CR | cs.CL | cs.SEPDF

Zi Liang, Qipeng Xie, Jun He, Bohuan Xue, Weizheng Wang

TL;DR: 本文提出Argus，一种专为漏洞检测设计的首个多智能体框架，通过重构静态应用安全测试（SAST）工作流程，从LLM辅助转向LLM中心化，结合供应链分析、多智能体协作及RAG和ReAct技术，以降低误报和幻觉，提升检测效果。

Details

Motivation: 现有基于LLM的SAST方法试图直接替代人类专家，但未能有效整合现有工具，导致高误报率、幻觉、推理深度不足和高令牌使用，难以工业部署；Argus旨在通过重构工作流程解决这些问题。

Result: 大量实证评估显示，Argus在检测更多真实漏洞的同时显著减少误报和运营成本，优于现有方法，并已发现多个获得CVE分配的关键零日漏洞。

Insight: 创新点包括：全面供应链分析、协作多智能体工作流，以及集成RAG和ReAct等SOTA技术以最小化幻觉和增强推理；从客观角度看，该框架通过多智能体协同和检索增强，实现了更有效的漏洞检测集成方案。

Abstract: Recent advancements in Large Language Models (LLMs) have sparked interest in their application to Static Application Security Testing (SAST), primarily due to their superior contextual reasoning capabilities compared to traditional symbolic or rule-based methods. However, existing LLM-based approaches typically attempt to replace human experts directly without integrating effectively with existing SAST tools. This lack of integration results in ineffectiveness, including high rates of false positives, hallucinations, limited reasoning depth, and excessive token usage, making them impractical for industrial deployment. To overcome these limitations, we present a paradigm shift that reorchestrates the SAST workflow from current LLM-assisted structure to a new LLM-centered workflow. We introduce Argus (Agentic and Retrieval-Augmented Guarding System), the first multi-agent framework designed specifically for vulnerability detection. Argus incorporates three key novelties: comprehensive supply chain analysis, collaborative multi-agent workflows, and the integration of state-of-the-art techniques such as Retrieval-Augmented Generation (RAG) and ReAct to minimize hallucinations and enhance reasoning. Extensive empirical evaluation demonstrates that Argus significantly outperforms existing methods by detecting a higher volume of true vulnerabilities while simultaneously reducing false positives and operational costs. Notably, Argus has identified several critical zero-day vulnerabilities with CVE assignments.

[120] TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories cs.CR | cs.AI | cs.CL | cs.LG | cs.SEPDF

Yen-Shan Chen, Sian-Yao Huang, Cheng-Lin Yang, Yun-Nung Chen

TL;DR: 本文介绍了TraceSafe-Bench，这是首个专门用于评估多步工具调用轨迹中安全护栏效能的基准测试。该基准涵盖12个风险类别，包含超过1000个独特的执行实例。通过对13个LLM作为护栏的模型和7个专用安全护栏进行评估，研究发现护栏效能主要受结构数据处理能力驱动，模型架构比模型规模影响更大，且检测准确性在长轨迹中保持稳定甚至后期有所提升。

Details

Motivation: 随着大语言模型从静态聊天机器人演变为自主智能体，主要的安全漏洞从最终输出转移到了中间执行轨迹。目前的安全护栏主要针对自然语言响应进行基准测试，但其在多步工具使用轨迹中的有效性尚未得到充分探索。

Result: 在TraceSafe-Bench基准上的评估结果显示，护栏效能与结构化到文本基准测试性能高度相关（ρ=0.79），但与标准越狱鲁棒性几乎零相关。通用大语言模型在轨迹分析中始终优于专用安全护栏，且检测准确性在长轨迹中保持稳定，后期阶段甚至有所改善。

Insight: 论文的创新点在于首次系统性地评估了多步工具调用轨迹中的安全护栏，并揭示了结构数据处理能力（如JSON解析）比语义安全对齐对护栏效能影响更大这一关键发现。从客观角度看，研究强调了确保智能体工作流安全需要联合优化结构推理和安全对齐，以有效缓解轨迹中的风险。

Abstract: As large language models (LLMs) evolve from static chatbots into autonomous agents, the primary vulnerability surface shifts from final outputs to intermediate execution traces. While safety guardrails are well-benchmarked for natural language responses, their efficacy remains largely unexplored within multi-step tool-use trajectories. To address this gap, we introduce TraceSafe-Bench, the first comprehensive benchmark specifically designed to assess mid-trajectory safety. It encompasses 12 risk categories, ranging from security threats (e.g., prompt injection, privacy leaks) to operational failures (e.g., hallucinations, interface inconsistencies), featuring over 1,000 unique execution instances. Our evaluation of 13 LLM-as-a-guard models and 7 specialized guardrails yields three critical findings: 1) Structural Bottleneck: Guardrail efficacy is driven more by structural data competence (e.g., JSON parsing) than semantic safety alignment. Performance correlates strongly with structured-to-text benchmarks ($ρ=0.79$) but shows near-zero correlation with standard jailbreak robustness. 2) Architecture over Scale: Model architecture influences risk detection performance more significantly than model size, with general-purpose LLMs consistently outperforming specialized safety guardrails in trajectory analysis. 3) Temporal Stability: Accuracy remains resilient across extended trajectories. Increased execution steps allow models to pivot from static tool definitions to dynamic execution behaviors, actually improving risk detection performance in later stages. Our findings suggest that securing agentic workflows requires jointly optimizing for structural reasoning and safety alignment to effectively mitigate mid-trajectory risks.

physics.optics [Back]

[121] Enhanced Self-Supervised Multi-Image Super-Resolution for Camera Array Images physics.optics | cs.CVPDF

Yating Chen, Feng Huang, Xianyu Wu, Jing Wu, Ying Shen

TL;DR: 本文提出了一种增强型自监督多图像超分辨率方法，专门针对相机阵列图像。该方法通过结合多图像到单图像（Multi-to-Single）和多图像到多图像（Multi-to-Multi）自监督学习的优势，构建了一个新框架，并引入了一种适用于自监督学习的双Transformer网络（dual Transformer），以从混叠伪影中恢复高频细节，从而生成视觉吸引人且高保真的纹理丰富图像。

Details

Motivation: 传统多图像超分辨率方法依赖于单相机的连续帧，易受复杂图像退化和严重遮挡影响，难以准确恢复图像。而多孔径相机阵列成像能捕获空间分布视图，采样偏移形成稳定的盘状分布，增强了观测数据的非冗余性，但现有方法未能充分利用这些特性。有监督方法易过拟合训练数据的退化模式，当前自监督技术则难以恢复细粒度细节。

Result: 在合成和真实世界数据集上的实验证明了所提方法的优越性。

Insight: 论文的创新点在于提出了Multi-to-Single-Guided Multi-to-Multi自监督学习框架，该框架结合了Multi-to-Single和Multi-to-Multi方法的优势，为深度神经网络与经典基于物理的变分方法集成提供了新范式。同时，设计了适用于自监督学习的双Transformer网络，以增强从混叠伪影中恢复高频细节的能力。

Abstract: Conventional multi-image super-resolution (MISR) methods, such as burst and video SR, rely on sequential frames from a single camera. Consequently, they suffer from complex image degradation and severe occlusion, increasing the difficulty of accurate image restoration. In contrast, multi-aperture camera-array imaging captures spatially distributed views with sampling offsets forming a stable disk-like distribution, which enhances the non-redundancy of observed data. Existing MISR algorithms fail to fully exploit these unique properties. Supervised MISR methods tend to overfit the degradation patterns in training data, and current self-supervised learning (SSL) techniques struggle to recover fine-grained details. To address these issues, this paper thoroughly investigates the strengths, limitations and applicability boundaries of multi-image-to-single-image (Multi-to-Single) and multi-image-to-multi-image (Multi-to-Multi) SSL methods. We propose the Multi-to-Single-Guided Multi-to-Multi SSL framework that combines the advantages of Multi-to-Single and Multi-to-Multi to generate visually appealing and high-fidelity images rich in texture details. The Multi-to-Single-Guided Multi-to-Multi SSL framework provides a new paradigm for integrating deep neural network with classical physics-based variational methods. To enhance the ability of MISR network to recover high-frequency details from aliased artifacts, this paper proposes a novel camera-array SR network called dual Transformer suitable for SSL. Experiments on synthetic and real-world datasets demonstrate the superiority of the proposed method.

cs.AI [Back]

[122] ProofSketcher: Hybrid LLM + Lightweight Proof Checker for Reliable Math/Logic Reasoning cs.AI | cs.CE | cs.CV | cs.LGPDF

Kranthi Kommuru, Kunal Khanvilkar, Gaurav Parekh

TL;DR: ProofSketcher是一个混合系统，结合大型语言模型（LLM）和轻量级证明检查器，用于提高数学和逻辑推理的可靠性。LLM生成紧凑领域特定语言（DSL）中的类型化证明草图，然后轻量级可信内核将其扩展为明确的证明义务，以检测并纠正推理中的细微错误。

Details

Motivation: 解决LLM在数学和逻辑推理中可能产生看似合理但包含细微错误（如忽略侧条件、无效推理模式或逻辑上无法推导的引理）的问题，同时避免传统交互式定理证明器（如Lean和Coq）需要完全形式化证明和大量低级信息的沉重负担。

Result: 摘要未提及具体的定量实验结果或基准测试，但暗示该方法通过混合管道提高了推理的可靠性，可能通过结合LLM的生成能力和轻量级检查器的严格性来达到更可靠的证明水平。

Insight: 创新点在于提出了一种混合方法，利用LLM生成紧凑的证明草图，再通过轻量级可信内核进行扩展和检查，这平衡了推理的灵活性与可靠性，可借鉴于其他需要高可靠性的AI辅助推理任务中。

Abstract: The large language models (LLMs) might produce a persuasive argument within mathematical and logical fields, although such argument often includes some minor missteps, including the entire omission of side conditions, invalid inference patterns, or appeals to a lemma that cannot be derived logically out of the context being discussed. These omissions are infamously hard to notice solely out of the text, as even the misconstrued construction still may seem mostly accurate. Conversely, interactive theorem provers like Lean and Coq have rigorous reliability by ensuring that syntactic and semantic statements only accept statements that can pass all the syntactic and semantic steps in the program which is a small trusted kernel of the language type-checks with. Despite the fact that this technique provides strong guarantees, it comes at quite a heavy price: the evidence must be completely formalized, and the evidence user or a auxiliary search program must provide an avalanche of low-level information. This paper presents a hybrid pipeline where an LLM generates a typed proof sketch in a compact DSL and a lightweight trusted kernel expands the sketch into explicit proof obligations.

[123] Steering the Verifiability of Multimodal AI Hallucinations cs.AI | cs.CL | cs.CV | cs.LGPDF

Jianhong Pang, Ruoxi Cheng, Ziyi Ye, Xingjun Ma, Zuxuan Wu

TL;DR: 本文针对多模态大语言模型（MLLMs）产生的幻觉问题，提出了一种基于可验证性（verifiability）的细粒度控制方法。研究首先通过收集和分析人类对AI幻觉的响应数据，将幻觉分为明显（obvious）和隐蔽（elusive）两类。然后，作者提出了一种在激活空间进行干预的方法，通过学习针对这两类幻觉的不同探针（probes），实现对模型输出可验证性的精细调控。实验证明了该方法的有效性，并展示了如何通过混合干预来灵活适应不同场景的安全性和可用性需求。

Details

Motivation: 多模态大语言模型（MLLMs）的应用容易产生幻觉，对人类用户构成风险。然而，这些幻觉的可验证性差异很大：有些容易被人类用户发现（明显幻觉），而另一些则容易被忽略或需要更多验证努力（隐蔽幻觉）。目前缺乏研究如何根据不同的安全和可用性需求来控制AI应用产生幻觉的这一特性（即可验证性）。

Result: 实证结果表明，所提出的激活空间干预方法是有效的。针对性的干预（即分别针对明显和隐蔽幻觉的探针）在调控相应的可验证性方面表现出优越的性能。此外，简单地混合这些干预可以实现对可验证性的灵活控制，以适应不同场景的需求。

Insight: 论文的主要创新点在于：1）从可验证性（而非简单的存在与否）这一新维度对多模态幻觉进行细粒度分类（明显 vs. 隐蔽）；2）提出了一种在模型激活空间学习的干预方法，通过训练不同的探针来分别控制这两类幻觉的可验证性，实现了对模型输出属性的精细、灵活调控。这为根据应用场景（如高安全性要求 vs. 高创造性要求）定制AI行为提供了新思路。

Abstract: AI applications driven by multimodal large language models (MLLMs) are prone to hallucinations and pose considerable risks to human users. Crucially, such hallucinations are not equally problematic: some hallucination contents could be detected by human users(i.e., obvious hallucinations), while others are often missed or require more verification effort(i.e., elusive hallucinations). This indicates that multimodal AI hallucinations vary significantly in their verifiability. Yet, little research has explored how to control this property for AI applications with diverse security and usability demands. To address this gap, we construct a dataset from 4,470 human responses to AI-generated hallucinations and categorize these hallucinations into obvious and elusive types based on their verifiability by human users. Further, we propose an activation-space intervention method that learns separate probes for obvious and elusive hallucinations. We reveal that obvious and elusive hallucinations elicit different intervention probes, allowing for fine-grained control over the model’s verifiability. Empirical results demonstrate the efficacy of this approach and show that targeted interventions yield superior performance in regulating corresponding verifiability. Moreover, simply mixing these interventions enables flexible control over the verifiability required for different scenarios.

[124] Weakly Supervised Distillation of Hallucination Signals into Transformer Representations cs.AI | cs.CL | cs.LGPDF

Shoaib Sadiq Salehmohamed, Jinal Prashant Thakkar, Hansika Aredla, Shaik Mohammed Omar, Shalmali Ayachit

TL;DR: 本文提出了一种弱监督蒸馏框架，将幻觉信号蒸馏到Transformer模型的内部表示中，使得在推理时仅通过模型自身的激活状态就能检测幻觉，而无需依赖外部验证。

Details

Motivation: 现有的大语言模型幻觉检测方法在推理时需要依赖外部验证，如黄金答案、检索系统或辅助判断模型，这增加了计算开销和复杂性。本文旨在将这些外部监督信号在训练时蒸馏到模型内部表示中，以实现仅通过内部激活进行幻觉检测。

Result: 在基于SQuAD v2构建的15000样本数据集上，Transformer-based probes（如M2和M3）在5折平均AUC/F1和单折验证及独立测试评估中表现最佳，推理效率高，延迟在0.15-6.66毫秒之间，端到端生成加探测吞吐量约为0.231查询/秒，开销可忽略。

Insight: 创新点包括：1) 弱监督框架结合子串匹配、句子嵌入相似性和LLM判断三种互补信号自动标注幻觉标签；2) 将幻觉检测信号蒸馏到Transformer表示中，实现推理时无需外部验证的内部检测；3) 多种探测分类器设计，验证了Transformer表示对幻觉信号的编码能力。

Abstract: Existing hallucination detection methods for large language models (LLMs) rely on external verification at inference time, requiring gold answers, retrieval systems, or auxiliary judge models. We ask whether this external supervision can instead be distilled into the model’s own representations during training, enabling hallucination detection from internal activations alone at inference time. We introduce a weak supervision framework that combines three complementary grounding signals: substring matching, sentence embedding similarity, and an LLM as a judge verdict to label generated responses as grounded or hallucinated without human annotation. Using this framework, we construct a 15000-sample dataset from SQuAD v2 (10500 train/development samples and a separate 5000-sample test set), where each example pairs a LLaMA-2-7B generated answer with its full per-layer hidden states and structured hallucination labels. We then train five probing classifiers: ProbeMLP (M0), LayerWiseMLP (M1), CrossLayerTransformer (M2), HierarchicalTransformer (M3), and CrossLayerAttentionTransformerV2 (M4), directly on these hidden states, treating external grounding signals as training-time supervision only. Our central hypothesis is that hallucination detection signals can be distilled into transformer representations, enabling internal detection without any external verification at inference time. Results support this hypothesis. Transformer-based probes achieve the strongest discrimination, with M2 performing best on 5-fold average AUC/F1, and M3 performing best on both single-fold validation and held-out test evaluation. We also benchmark inference efficiency: probe latency ranges from 0.15 to 5.62 ms (batched) and 1.55 to 6.66 ms (single sample), while end-to-end generation plus probe throughput remains approximately 0.231 queries per second, indicating negligible practical overhead.

[125] How Much LLM Does a Self-Revising Agent Actually Need? cs.AI | cs.CLPDF

Seongwoo Jeong, Seonil Son

TL;DR: 本文提出了一种声明式反思运行时协议，将智能体状态、置信度信号、防护动作和假设转换外部化为可检查的运行时结构，并在嘈杂的协作战舰游戏中评估了四种渐进结构化智能体，以实证方法分解并量化了LLM在自修正智能体中的边际贡献。

Details

Motivation: 解决当前基于LLM的智能体将世界建模、规划和反思都置于单一语言模型循环中，导致难以科学区分智能体能力究竟来自LLM本身还是来自其外部结构的问题。

Result: 在嘈杂协作战舰游戏（54局游戏）上的实验表明：显式世界模型规划相比贪婪后验跟踪基线显著提升（胜率+24.1个百分点，F1+0.017）；符号化反思作为实时运行机制有效运行；而仅在约4.3%的回合中添加条件性LLM修订仅带来微小非单调变化（平均F1略升+0.005，胜率从31降至29/54）。

Insight: 创新点在于通过外部化反思将潜在的智能体行为转化为可检查的运行时结构，使得能够直接研究LLM干预的边际作用；方法论上提供了分解智能体组件并量化LLM贡献的实证框架，而非追求性能排行榜。

Abstract: Recent LLM-based agents often place world modeling, planning, and reflection inside a single language model loop. This can produce capable behavior, but it makes a basic scientific question difficult to answer: which part of the agent’s competence actually comes from the LLM, and which part comes from explicit structure around it? We study this question not by claiming a general answer, but by making it empirically tractable. We introduce a declared reflective runtime protocol that externalizes agent state, confidence signals, guarded actions, and hypothetical transitions into inspectable runtime structure. We instantiate this protocol in a declarative runtime and evaluate it on noisy Collaborative Battleship [4] using four progressively structured agents over 54 games (18 boards $\times$ 3 seeds). The resulting decomposition isolates four components: posterior belief tracking, explicit world-model planning, symbolic in-episode reflection, and sparse LLM-based revision. Across this decomposition, explicit world-model planning improves substantially over a greedy posterior-following baseline (+24.1pp win rate, +0.017 F1). Symbolic reflection operates as a real runtime mechanism – with prediction tracking, confidence gating, and guarded revision actions – even though its current revision presets are not yet net-positive in aggregate. Adding conditional LLM revision at about 4.3% of turns yields only a small and non-monotonic change: average F1 rises slightly (+0.005) while win rate drops (31$\rightarrow$29 out of 54). These results suggest a methodological contribution rather than a leaderboard claim: externalizing reflection turns otherwise latent agent behavior into inspectable runtime structure, allowing the marginal role of LLM intervention to be studied directly.

Table of Contents

cs.CL [Back]

[1] The Stepwise Informativeness Assumption: Why are Entropy Dynamics and Reasoning Correlated in LLMs? cs.CL | cs.AI | cs.IT | cs.LGPDF

[2] SensorPersona: An LLM-Empowered System for Continual Persona Extraction from Longitudinal Mobile Sensor Streams cs.CL | cs.AI | cs.HCPDF

[3] Tool-MCoT: Tool Augmented Multimodal Chain-of-Thought for Content Safety Moderation cs.CL | cs.AIPDF

[4] STDec: Spatio-Temporal Stability Guided Decoding for dLLMs cs.CLPDF

[5] The Illusion of Superposition? A Principled Analysis of Latent Thinking in Language Models cs.CL | cs.LGPDF

[6] Application-Driven Pedagogical Knowledge Optimization of Open-Source LLMs via Reinforcement Learning and Supervised Fine-Tuning cs.CLPDF

[7] State-of-the-Art Arabic Language Modeling with Sparse MoE Fine-Tuning and Chain-of-Thought Distillation cs.CLPDF

[8] When to Call an Apple Red: Humans Follow Introspective Rules, VLMs Don’t cs.CL | cs.AI | cs.CVPDF

[9] Multi-objective Evolutionary Merging Enables Efficient Reasoning Models cs.CL | cs.AIPDF

[10] DataSTORM: Deep Research on Large-Scale Databases using Exploratory Data Analysis and Data Storytelling cs.CLPDF

[11] ValueGround: Evaluating Culture-Conditioned Visual Value Grounding in MLLMs cs.CLPDF

[12] MedConclusion: A Benchmark for Biomedical Conclusion Generation from Structured Abstracts cs.CL | cs.AIPDF

[13] The Detection–Extraction Gap: Models Know the Answer Before They Can Say It cs.CL | cs.AI | cs.IT | cs.LGPDF

[14] DiffuMask: Diffusion Language Model for Token-level Prompt Pruning cs.CLPDF

[15] Feedback Adaptation for Retrieval-Augmented Generation cs.CLPDF

[16] A Parameter-Efficient Transfer Learning Approach through Multitask Prompt Distillation and Decomposition for Clinical NLP cs.CL | cs.AIPDF

[17] ChemVLR: Prioritizing Reasoning in Perception for Chemical Vision-Language Understanding cs.CL | cs.AIPDF

[18] Adaptive Prompt Structure Factorization: A Framework for Self-Discovering and Optimizing Compositional Prompt Programs cs.CL | cs.LGPDF

[19] Luwen Technical Report cs.CL | cs.AIPDF

[20] Select-then-Solve: Paradigm Routing as Inference-Time Optimization for LLM Agents cs.CLPDF

[21] How Long Reasoning Chains Influence LLMs’ Judgment of Answer Factuality cs.CLPDF

[22] Multilingual Cognitive Impairment Detection in the Era of Foundation Models cs.CLPDF

[23] TeamLLM: A Human-Like Team-Oriented Collaboration Framework for Multi-Step Contextualized Tasks cs.CL | cs.AIPDF

[24] When Is Thinking Enough? Early Exit via Sufficiency Assessment for Efficient Reasoning cs.CLPDF

[25] GCoT-Decoding: Unlocking Deep Reasoning Paths for Universal Question Answering cs.CLPDF

[26] Beyond Accuracy: Diagnosing Algebraic Reasoning Failures in LLMs Across Nine Complexity Dimensions cs.CL | cs.CYPDF

[27] Cognitive Loop of Thought: Reversible Hierarchical Markov Chain for Efficient Mathematical Reasoning cs.CLPDF

[28] WRAP++: Web discoveRy Amplified Pretraining cs.CL | cs.AIPDF

[29] Fast-dVLM: Efficient Block-Diffusion VLM via Direct Conversion from Autoregressive VLM cs.CLPDF

[30] On the Step Length Confounding in LLM Reasoning Data Selection cs.CL | cs.AIPDF

[31] iTAG: Inverse Design for Natural Text Generation with Accurate Causal Graph Annotations cs.CLPDF

[32] DTCRS: Dynamic Tree Construction for Recursive Summarization cs.CLPDF

[33] Gemma 4, Phi-4, and Qwen3: Accuracy-Efficiency Tradeoffs in Dense and MoE Reasoning Language Models cs.CLPDF

[34] STRIDE-ED: A Strategy-Grounded Stepwise Reasoning Framework for Empathetic Dialogue Systems cs.CL | cs.AIPDF

[35] Yale-DM-Lab at ArchEHR-QA 2026: Deterministic Grounding and Multi-Pass Evidence Alignment for EHR Question Answering cs.CLPDF

[36] Joint Optimization of Reasoning and Dual-Memory for Self-Learning Diagnostic Agent cs.CLPDF

[37] A Systematic Study of Retrieval Pipeline Design for Retrieval-Augmented Medical Question Answering cs.CL | cs.AI | cs.LGPDF

[38] OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence cs.CLPDF

cs.CV [Back]

[39] CraterBench-R: Instance-Level Crater Retrieval for Planetary Scale cs.CVPDF

[40] DISSECT: Diagnosing Where Vision Ends and Language Priors Begin in Scientific VLMs cs.CV | cs.AIPDF

[41] Evolution of Video Generative Foundations cs.CVPDF

[42] Evidence-Based Actor-Verifier Reasoning for Echocardiographic Agents cs.CVPDF

[43] DietDelta: A Vision-Language Approach for Dietary Assessment via Before-and-After Images cs.CV | cs.AI | cs.MM | eess.IVPDF

[44] MTA-Agent: An Open Recipe for Multimodal Deep Search Agents cs.CVPDF

[45] Visual prompting reimagined: The power of the Activation Prompts cs.CV | cs.LGPDF

[46] PhysHead: Simulation-Ready Gaussian Head Avatars cs.CVPDF

[47] LiftFormer: Lifting and Frame Theory Based Monocular Depth Estimation Using Depth and Edge Oriented Subspace Representation cs.CV | eess.IVPDF

[48] VAMAE: Vessel-Aware Masked Autoencoders for OCT Angiography cs.CVPDF

[49] Holistic Optimal Label Selection for Robust Prompt Learning under Partial Labels cs.CV | cs.LGPDF

[50] WeatherRemover: All-in-one Adverse Weather Removal with Multi-scale Feature Map Compression cs.CVPDF

[51] Controllable Generative Video Compression cs.CVPDF

[52] GPAFormer: Graph-guided Patch Aggregation Transformer for Efficient 3D Medical Image Segmentation cs.CVPDF

[53] VDPP: Video Depth Post-Processing for Speed and Scalability cs.CVPDF

[54] RASR: Retrieval-Augmented Semantic Reasoning for Fake News Video Detection cs.CVPDF

[55] Specializing Large Models for Oracle Bone Script Interpretation via Component-Grounded Multimodal Knowledge Augmentation cs.CV | cs.CLPDF

[56] Enhancing MLLM Spatial Understanding via Active 3D Scene Exploration for Multi-Perspective Reasoning cs.CVPDF

[57] URMF: Uncertainty-aware Robust Multimodal Fusion for Multimodal Sarcasm Detection cs.CV | cs.AI | cs.MMPDF

[58] DOC-GS: Dual-Domain Observation and Calibration for Reliable Sparse-View Gaussian Splatting cs.CVPDF

[59] LiveStre4m: Feed-Forward Live Streaming of Novel Views from Unposed Multi-View Video cs.CVPDF

[60] How Well Do Vision-Language Models Understand Sequential Driving Scenes? A Sensitivity Study cs.CVPDF

[61] FlowInOne:Unifying Multimodal Generation as Image-in, Image-out Flow Matching cs.CVPDF

[62] FlowExtract: Procedural Knowledge Extraction from Maintenance Flowcharts cs.CV | cs.AIPDF

[63] Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization cs.CVPDF

[64] Insights from Visual Cognition: Understanding Human Action Dynamics with Overall Glance and Refined Gaze Transformer cs.CVPDF

[65] Video-guided Machine Translation with Global Video Context cs.CV | cs.CLPDF

[66] Generate, Analyze, and Refine: Training-Free Sound Source Localization via MLLM Meta-Reasoning cs.CVPDF

[67] Vision-Language Model-Guided Deep Unrolling Enables Personalized, Fast MRI cs.CVPDF

[68] Physical Adversarial Attacks on AI Surveillance Systems:Detection, Tracking, and Visible–Infrared Evasion cs.CV | cs.AIPDF

[69] RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details cs.CVPDF

[70] Time-driven Survival Analysis from FDG-PET/CT in Non-Small Cell Lung Cancer cs.CVPDF

[71] Energy-Regularized Spatial Masking: A Novel Approach to Enhancing Robustness and Interpretability in Vision Models cs.CV | cs.LGPDF

[72] Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models cs.CV | cs.AIPDF

[73] Multi-modal user interface control detection using cross-attention cs.CV | cs.AIPDF

[74] POS-ISP: Pipeline Optimization at the Sequence Level for Task-aware ISP cs.CVPDF

[75] Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis cs.CVPDF

[76] NTIRE 2026 Challenge on Bitstream-Corrupted Video Restoration: Methods and Results cs.CVPDF

[77] Making MLLMs Blind: Adversarial Smuggling Attacks in MLLM Content Moderation cs.CVPDF