Table of Contents
- cs.CL [Total: 23]
- cs.CV [Total: 80]
- cs.HC [Total: 1]
- cs.AI [Total: 6]
- cs.RO [Total: 6]
- cs.CE [Total: 1]
- cs.LG [Total: 4]
- cs.SD [Total: 1]
cs.CL [Back]
[1] Do Large Language Models Possess a Theory of Mind? A Comparative Evaluation Using the Strange Stories Paradigm cs.CL | cs.AIPDF
Anna Babarczy, Andras Lukacs, Peter Vedres, Zeteny Bujka
TL;DR: 本研究探讨了当前大型语言模型是否具备心理理论能力,即从文本中推断他人信念、意图和情感的能力。通过使用改编自人类心理理论研究的文本工具测试了五个LLM,并与人类表现进行比较,发现GPT-4o在准确性和鲁棒性上表现优异,与人类相当,而早期和小型模型则受相关线索数量和无关信息影响较大。
Details
Motivation: 鉴于LLM仅基于语言数据训练,缺乏社会具身性或心理表征的直接访问,其表现出的社会认知推理能力引发了对其理解本质的质疑:是具备稳健的心理状态归因能力,还是仅反映表面的模式补全?
Result: 在基于文本的心理理论测试中,GPT-4o表现出高准确性和强鲁棒性,在最挑战性条件下与人类表现相当;而早期和小型模型性能受相关推断线索数量和无关信息影响明显,存在性能差距。
Insight: 论文创新点在于将人类心理理论研究的经典范式(Strange Stories)系统应用于LLM评估,揭示了模型性能的差异及其对线索和干扰的敏感性;客观分析认为,这为区分LLM的“真实理解”与“统计近似”提供了实证依据,并强调了评估框架设计对揭示模型认知能力边界的重要性。
Abstract: The study explores whether current Large Language Models (LLMs) exhibit Theory of Mind (ToM) capabilities – specifically, the ability to infer others’ beliefs, intentions, and emotions from text. Given that LLMs are trained on language data without social embodiment or access to other manifestations of mental representations, their apparent social-cognitive reasoning raises key questions about the nature of their understanding. Are they capable of robust mental-state attribution indistinguishable from human ability in its output, or do their outputs merely reflect superficial pattern completion? To address this question, we tested five LLMs and compared their performance to that of human controls using an adapted version of a text-based tool widely used in human ToM research. The test involves answering questions about the beliefs, intentions, and emotions of story characters. The results revealed a performance gap between the models. Earlier and smaller models were strongly affected by the number of relevant inferential cues available and, to some extent, were also vulnerable to the presence of irrelevant or distracting information in the texts. In contrast, GPT-4o demonstrated high accuracy and strong robustness, performing comparably to humans even in the most challenging conditions. This work contributes to ongoing debates about the cognitive status of LLMs and the boundary between genuine understanding and statistical approximation.
[2] Real-Time Trustworthiness Scoring for LLM Structured Outputs and Data Extraction cs.CL | cs.LGPDF
Hui Wen Goh, Jonas Mueller
TL;DR: 本文提出了CONSTRUCT方法,用于实时评估LLM结构化输出的可信度,该方法能够为整个输出和每个字段分别打分,帮助用户快速识别错误。该方法适用于任何LLM(包括黑盒API),无需标注数据或自定义模型部署,并能处理复杂的嵌套JSON模式。
Details
Motivation: 当前LLM的结构化输出存在偶发性错误,阻碍了企业AI应用充分发挥潜力,需要一种无需人工全面检查即可实时评估输出可信度的方法。
Result: 在作者构建的首个具有可靠真实值的公开LLM结构化输出基准(包含四个数据集)上,CONSTRUCT在检测多种LLM(包括Gemini 3和GPT-5)输出错误时,其精确率/召回率显著高于其他评分方法。
Insight: 创新点在于提出了一种通用、轻量且无需训练的可信度评分框架,能够细粒度地评估结构化输出的每个字段,并首次提供了高质量的公开基准用于评估此类任务。
Abstract: Structured Outputs from current LLMs exhibit sporadic errors, hindering enterprise AI efforts from realizing their immense potential. We present CONSTRUCT, a method to score the trustworthiness of LLM Structured Outputs in real-time, such that lower-scoring outputs are more likely to contain errors. This reveals the best places to focus limited human review bandwidth. CONSTRUCT additionally scores the trustworthiness of each field within a LLM Structured Output, helping reviewers quickly identify which parts of the output are wrong. Our method is suitable for any LLM (including black-box LLM APIs without logprobs such as reasoning models and Anthropic models), does not require labeled training data nor custom model deployment, and works for complex Structured Outputs with many fields of diverse types (including nested JSON schemas). We additionally present one of the first public LLM Structured Output benchmarks with reliable ground-truth values that are not full of mistakes. Over this four-dataset benchmark, CONSTRUCT detects errors from various LLMs (including Gemini 3 and GPT-5) with significantly higher precision/recall than other scoring methods.
[3] GRAFITE: Generative Regression Analysis Framework for Issue Tracking and Evaluation cs.CLPDF
Ja Young Lee, Mírian Silva, Mohamed Nasr, Shonda Witherspoon, Enzo Bozzani
TL;DR: 本文提出了GRAFITE,一个用于持续评估大型语言模型(LLMs)的平台,旨在通过维护和评估模型问题库来解决因训练数据污染导致的模型性能虚高问题。该平台基于用户反馈构建问题库,并利用LLM作为评判者进行质量保证测试,支持多模型并行比较和不同版本间的回归检测。
Details
Motivation: 解决LLMs因训练数据中基准测试数据过度暴露(即数据污染)而导致的性能评估虚高风险,需要一个能够持续追踪和评估模型问题的系统。
Result: 平台已开源(GitHub)并提供了演示视频,支持对多个LLM进行并行比较和回归检测,但摘要中未提及具体的定量基准测试结果或SOTA比较。
Insight: 创新点在于构建了一个基于用户反馈的动态问题库,并采用LLM-as-a-judge的QA测试流程进行持续评估,为LLM的长期性能监控和回归分析提供了一个系统化框架。
Abstract: Large language models (LLMs) are largely motivated by their performance on popular topics and benchmarks at the time of their release. However, over time, contamination occurs due to significant exposure of benchmark data during training. This poses a risk of model performance inflation if testing is not carefully executed. To address this challenge, we present GRAFITE, a continuous LLM evaluation platform through a comprehensive system for maintaining and evaluating model issues. Our approach enables building a repository of model problems based on user feedback over time and offers a pipeline for assessing LLMs against these issues through quality assurance (QA) tests using LLM-as-a-judge. The platform enables side-by-side comparison of multiple models, facilitating regression detection across different releases. The platform is available at https://github.com/IBM/grafite. The demo video is available at www.youtube.com/watch?v=XFZyoleN56k.
[4] How Psychological Learning Paradigms Shaped and Constrained Artificial Intelligence cs.CL | cs.CYPDF
Alex Anvi Eponon, Ildar Batyrshin, Christian E. Maldonado-Sifuentes, Grigori Sidorov
TL;DR: 本文探讨了心理学学习范式如何塑造并制约了人工智能的发展,指出强化学习、深度学习和当前集成方法分别继承了行为主义、认知主义和建构主义的优势与结构局限性,并提出了ReSynth三模块框架以分离推理、目的和知识,旨在构建一种使系统行为成为必然而非偶然属性的表示架构。
Details
Motivation: 论文的动机在于揭示主流人工智能范式(强化学习、深度学习等)如何受心理学理论(行为主义、认知主义、建构主义)影响,并继承了这些理论的结构性局限,从而无法完全解决适应性这一通用人工智能的核心挑战。
Result: 论文未在摘要中提供具体的定量实验结果或基准测试,但提出了ReSynth理论框架作为解决方案,旨在通过架构分离实现系统行为的必然性。
Insight: 创新点包括:1) 系统分析了心理学范式对AI的遗传性影响及其局限;2) 引入东方文化中’死记硬背’作为理解前驱的多阶段结构化概念,作为连接心理学与AI的桥梁;3) 提出ReSynth三模块框架,将推理、目的和知识分离为独立组件,以促进适应性表示架构的发展。
Abstract: The dominant paradigms of artificial intelligence were shaped by learning theories from psychology: behaviorism inspired reinforcement learning, cognitivism gave rise to deep learning and memory-augmented architectures, and constructivism influenced curriculum learning and compositional approaches. This paper argues that each AI paradigm inherited not only the strengths but the structural limitations of the psychological theory that inspired it. Reinforcement learning cannot account for the internal structure of knowledge, deep learning compresses representations into opaque parameter spaces resistant to principled update, and current integrative approaches lack a formal account of how new understanding is constructed from existing components. The paper further examines a cross-cultural divergence in the interpretation of rote learning, arguing that the Eastern conception of memorization as a structured, multi-phase precursor to understanding offers an underexploited bridge between psychological theory and AI methodology. Drawing on the systematicity debate and critique of Aizawa of both classicism and connectionism, this paper introduces ReSynth, a trimodular framework that separates reasoning (Intellect), purpose (Identity), and knowledge (Memory) as architecturally independent components. The paper traces the genealogy from psychological paradigm to AI method, diagnoses the inherited limitations at each stage, and argues that adaptability, the central challenge of artificial general intelligence requires a representational architecture in which systematic behavior is a necessary consequence rather than an accidental property.
[5] From Noise to Signal: When Outliers Seed New Topics cs.CL | cs.AIPDF
Evangelia Zve, Gauvain Bourgne, Benjamin Icard, Jean-Gabriel Ganascia
TL;DR: 本文提出了一种动态主题建模中新闻文档轨迹的时间分类法,将异常值区分为预示新兴主题的’预期异常值’、强化现有主题的文档或孤立文档,并通过累积聚类方法在氢经济新闻语料库上验证了该框架的有效性。
Details
Motivation: 传统动态主题建模将异常值视为噪声,但本文认为部分异常值可作为新兴主题的早期信号,旨在通过时间分类法揭示文档在主题形成过程中的动态角色。
Result: 在HydroNewsFr法语新闻语料库上,使用11个先进语言模型的文档嵌入进行累积聚类评估,模型间一致性识别出小规模高共识的预期异常值子集,定性案例研究进一步验证了轨迹分类的合理性。
Insight: 创新点在于将异常值重新定义为新兴主题的种子信号,并提出时间轨迹分类法连接弱信号检测与动态主题建模,为理解文档在主题演化中的前瞻性作用提供了新视角。
Abstract: Outliers in dynamic topic modeling are typically treated as noise, yet we show that some can serve as early signals of emerging topics. We introduce a temporal taxonomy of news-document trajectories that defines how documents relate to topic formation over time. It distinguishes anticipatory outliers, which precede the topics they later join, from documents that either reinforce existing topics or remain isolated. By capturing these trajectories, the taxonomy links weak-signal detection with temporal topic modeling and clarifies how individual articles anticipate, initiate, or drift within evolving clusters. We implement it in a cumulative clustering setting using document embeddings from eleven state-of-the-art language models and evaluate it retrospectively on HydroNewsFr, a French news corpus on the hydrogen economy. Inter-model agreement reveals a small, high-consensus subset of anticipatory outliers, increasing confidence in these labels. Qualitative case studies further illustrate these trajectories through concrete topic developments.
[6] Synthetic Data Generation for Training Diversified Commonsense Reasoning Models cs.CLPDF
Tianhui Zhang, Bei Peng, Danushka Bollegala
TL;DR: 本文提出了一种两阶段方法,首次创建了用于多样化生成式常识推理(GCR)的合成数据集CommonSyn,以解决训练数据缺乏的问题。通过在合成数据上微调模型,相比原始模型和在人工标注数据集上微调的模型,能够同时提高不同规模大语言模型(LLMs)的生成多样性和质量。
Details
Motivation: 训练多样化的常识推理模型需要大规模高质量且多样化的训练数据集,但由于标注成本高昂,现有GCR数据集规模小、覆盖场景窄,阻碍了该领域进展。
Result: 在合成数据CommonSyn上微调的模型,相比原始模型和在人工标注数据集上微调的模型,在不同规模的大语言模型(LLMs)上均能同时提升生成多样性和质量。
Insight: 创新点在于提出了一个两阶段的合成数据生成方法,首次构建了用于多样化GCR的合成数据集,有效缓解了高质量多样化训练数据稀缺的问题,为训练更鲁棒和多样化的对话代理提供了新思路。
Abstract: Conversational agents are required to respond to their users not only with high quality (i.e. commonsense bearing) responses, but also considering multiple plausible alternative scenarios, reflecting the diversity in their responses. Despite the growing need to train diverse commonsense generators, the progress of this line of work has been significantly hindered by the lack of large-scale high-quality diverse commonsense training datasets. Due to the high annotation costs, existing Generative Commonsense Reasoning (GCR) datasets are created using a small number of human annotators, covering only a narrow set of commonsense scenarios. To address this training resource gap, we propose a two-stage method to create the first-ever synthetic dataset CommonSyn for diversified (GCR). The model fine-tuned on our synthetic data jointly increase both generation diversity and quality compared with vanilla models and the model fine-tuned on human-crafted dataset across different size Large Language Models (LLMs)
[7] PowerFlow: Unlocking the Dual Nature of LLMs via Principled Distribution Matching cs.CL | cs.AI | cs.LGPDF
Ruishuo Chen, Yu Chen, Zhuoran Li, Longbo Huang
TL;DR: PowerFlow是一个基于分布匹配原理的无监督强化学习框架,通过将GFlowNet作为未归一化密度的摊销变分采样器,提出长度感知的轨迹平衡目标来消除自回归生成中的结构长度偏差,从而解锁大语言模型的双重特性:通过α-幂分布实现逻辑推理的锐化(α>1)或表达创造力的平坦化(α<1)。
Details
Motivation: 当前无监督强化学习方法依赖启发式内在奖励,缺乏明确的理论优化目标且容易产生退化偏差,因此需要一种有理论基础的框架来更有效地激发大语言模型的潜在能力。
Result: 大量实验表明,PowerFlow在无监督RLIF方法中持续优于现有方法,匹配甚至超过有监督的GRPO,并在对齐模型中通过缓解过度锐化,在创意任务中同时提升了多样性和质量,推动了帕累托前沿的移动。
Insight: 创新点在于将无监督微调重新定义为分布匹配问题,并引入α-幂分布实现对大语言模型双重特性的定向激发,以及通过长度感知的轨迹平衡目标来中和自回归生成中的结构长度偏差,从而在理论和实验上均取得改进。
Abstract: Unsupervised Reinforcement Learning from Internal Feedback (RLIF) has emerged as a promising paradigm for eliciting the latent capabilities of Large Language Models (LLMs) without external supervision. However, current methods rely on heuristic intrinsic rewards, which often lack a well-defined theoretical optimization target and are prone to degenerative biases. In this work, we introduce PowerFlow, a principled framework that reformulates unsupervised fine-tuning as a distribution matching problem. By casting GFlowNet as an amortized variational sampler for unnormalized densities, we propose a length-aware Trajectory-Balance objective that explicitly neutralizes the structural length biases inherent in autoregressive generation. By targeting $α$-power distributions, PowerFlow enables the directional elicitation of the dual nature of LLMs: sharpening the distribution ($α> 1$) to intensify logical reasoning, or flattening it ($α< 1$) to unlock expressive creativity. Extensive experiments demonstrate that PowerFlow consistently outperforms existing RLIF methods, matching or even exceeding supervised GRPO. Furthermore, by mitigating over-sharpening in aligned models, our approach achieves simultaneous gains in diversity and quality, shifting the Pareto frontier in creative tasks.
[8] TARo: Token-level Adaptive Routing for LLM Test-time Alignment cs.CL | cs.AI | cs.LGPDF
Arushi Rai, Qiang Zhang, Hanqing Zeng, Yunkai Zhang, Dipesh Tamboli
TL;DR: 本文提出了一种名为TARo(Token-level Adaptive Routing)的推理时对齐方法,旨在无需昂贵后训练即可提升大型语言模型(LLMs)的结构化推理能力。该方法通过训练奖励模型捕获细粒度逻辑一致性信号,并引入可学习的token级路由器在推理时自动引导基础模型,从而显著提升数学推理、临床推理和指令跟随等任务的性能。
Details
Motivation: 现有推理时对齐方法主要关注偏好对齐,而缺乏对结构化推理能力的优化,因此需要一种轻量级方法在推理时直接提升LLMs的推理性能。
Result: 在数学推理任务上,TARo相比基础模型提升高达+22.4%,比现有token级推理时对齐方法提升+8.4%;同时在MedXpertQA(临床推理)和AlpacaEval(指令跟随)等分布外任务上也表现出性能提升,且能无需重新训练即泛化至不同规模的基础模型。
Insight: 创新点在于将推理时对齐从偏好优化扩展至结构化推理领域,通过token级自适应路由机制实现细粒度的奖励引导,该方法具有轻量、无需重训练即可跨模型泛化的优势,为LLMs的实时性能优化提供了新思路。
Abstract: Large language models (LLMs) exhibit strong reasoning capabilities but typically require expensive post-training to reach high performance. Recent test-time alignment methods offer a lightweight alternative, but have been explored mainly for preference alignment rather than reasoning. To bridge this gap, we propose, Token-level Adaptive Routing (TARo), which steers frozen LLMs toward structured reasoning entirely at inference time. Specifically, we first train reward models on step-wise mathematical traces to capture fine-grained logical consistency signals, then introduce a learnable token-level router that automatically controls the guidance of the reward model to the base model. Extensive experiments show that TARo significantly improves reasoning performance by up to +22.4% over base model and +8.4% over existing token-level test-time alignment methods, while also boosting out-of-distribution clinical reasoning (MedXpertQA) and instruction following (AlpacaEval). Furthermore, TARo also generalizes from small to large backbones without retraining, extending test-time alignment from preference optimization to robust, cross-domain reasoning.
[9] Multimodal Task Interference: A Benchmark and Analysis of History-Target Mismatch in Multimodal LLMs cs.CLPDF
Masayuki Kawarada, Tatsuya Ishigaki, Hiroya Takamura
TL;DR: 本文研究了多模态大语言模型中的任务干扰现象,即单一对话中任务切换导致的性能下降。作者提出了一个涵盖文本和视觉六个任务的基准测试,通过系统改变历史与目标在模态、推理和答案格式三个维度上的不匹配来评估干扰。实验发现任务干扰具有高度方向性:从纯文本任务切换到基于图像的任务会导致严重性能下降,而反向切换影响较小;多维度不匹配会加剧干扰,其中模态差异影响最大,答案格式次之,推理需求变化影响最小。
Details
Motivation: 随着多模态对话系统的普及,任务干扰现象在纯文本环境中已有研究,但在多模态场景中尚未被系统探索。本文旨在填补这一空白,通过构建基准和分析多模态LLMs中历史与目标不匹配导致的性能下降,以理解任务干扰的机制。
Result: 在开源和专有模型上的实验表明,任务干扰具有方向性:从文本切换到图像目标时性能下降严重(例如在某些任务中准确率下降超过20%),而反向切换影响可忽略;多维度不匹配(如模态和答案格式同时变化)会进一步放大干扰,模态差异是主要驱动因素。该基准覆盖了多种任务,为评估多模态LLMs的鲁棒性提供了标准。
Insight: 创新点在于首次系统研究多模态LLMs中的任务干扰,并构建了涵盖模态、推理和答案格式不匹配的基准。从客观角度看,论文揭示了干扰的方向性和多维不匹配的叠加效应,强调了模态对齐在多模态对话中的关键作用,为模型设计和优化提供了重要见解。
Abstract: Task interference, the performance degradation caused by task switches within a single conversation, has been studied exclusively in text-only settings despite the growing prevalence of multimodal dialogue systems. We introduce a benchmark for evaluating this phenomenon in multimodal LLMs, covering six tasks across text and vision with systematic variation of history-target along three axes: modality mismatch, reasoning mismatch, and answer format mismatch. Experiments on both open-weights and proprietary models reveal that task interference is highly directional: switching from text-only to image-based targets causes severe performance drops, while the reverse transition yields minimal degradation. Interference is further amplified when mismatches co-occur across multiple dimensions, and is driven most strongly by modality differences, followed by answer format, while reasoning requirement shifts cause minimal degradation.
[10] Adaptive Decoding via Test-Time Policy Learning for Self-Improving Generation cs.CL | cs.AIPDF
Asmita Bhardwaj, Yuya Jeremy Ong, Eelaaf Zahid, Basel Shbita
TL;DR: 本文提出了一种基于强化学习的解码器采样器,将解码过程视为序列决策问题,通过学习轻量级策略在测试时动态调整采样参数,同时保持大语言模型权重不变。该方法在多个摘要数据集(BookSum、arXiv、WikiHow)上使用Granite-3.3-2B和Qwen-2.5-0.5B模型进行评估,一致优于贪婪解码和静态基线方法。
Details
Motivation: 现有广泛使用的解码策略(如贪婪解码或固定温度/top-p采样)是静态且任务无关的,导致在不同领域(需要风格或结构灵活性)中生成质量次优或不一致。
Result: 在BookSum数据集上使用Granite模型获得相对增益高达+88%,在WikiHow数据集上使用Qwen模型获得+79%的相对增益,均优于贪婪解码和静态基线。奖励消融实验表明,仅使用重叠度目标的性能不如复合奖励,而结构化塑造项(长度、覆盖率、重复度、完整性)能实现稳定且持续的改进。
Insight: 创新点在于将强化学习作为解码过程中测试时适应的实用机制,通过轻量级策略动态调整采样参数,实现领域感知和用户可控的生成,而无需重新训练大模型。从客观角度看,该方法将解码策略从静态启发式转向动态学习,为提升生成质量提供了新的可扩展途径。
Abstract: Decoding strategies largely determine the quality of Large Language Model (LLM) outputs, yet widely used heuristics such as greedy or fixed temperature/top-p decoding are static and often task-agnostic, leading to suboptimal or inconsistent generation quality across domains that demand stylistic or structural flexibility. We introduce a reinforcement learning-based decoder sampler that treats decoding as sequential decision-making and learns a lightweight policy to adjust sampling parameters at test-time while keeping LLM weights frozen. We evaluated summarization datasets including BookSum, arXiv, and WikiHow using Granite-3.3-2B and Qwen-2.5-0.5B. Our policy sampler consistently outperforms greedy and static baselines, achieving relative gains of up to +88% (BookSum, Granite) and +79% (WikiHow, Qwen). Reward ablations show that overlap-only objectives underperform compared to composite rewards, while structured shaping terms (length, coverage, repetition, completeness) enable stable and sustained improvements. These findings highlight reinforcement learning as a practical mechanism for test-time adaptation in decoding, enabling domain-aware and user-controllable generation without retraining large models.
[11] Cross-Lingual LLM-Judge Transfer via Evaluation Decomposition cs.CLPDF
Ivaxi Sheth, Zeno Jonke, Amin Mantrach, Saab Mansour
TL;DR: 本文提出了一种基于分解的评估框架,围绕通用标准集(UCS)构建,以解决大语言模型在多语言场景下自动评估的挑战。该方法通过语言无关的共享评估维度生成可解释的中间表示,从而在无需目标语言标注的情况下实现跨语言评估能力的迁移。
Details
Motivation: 动机是解决大语言模型在非英语语言中自动评估的难题,因为现有方法主要针对英语,且大多数语言缺乏昂贵的人工标注数据。
Result: 在多种语言和模型骨干上的多个忠实性任务实验中,该方法相比强基线取得了持续改进,且无需目标语言标注。
Insight: 创新点在于引入通用标准集(UCS)作为语言无关的中间表示,通过评估分解支持跨语言迁移,这为多语言评估提供了一种可解释且数据高效的框架。
Abstract: As large language models are increasingly deployed across diverse real-world applications, extending automated evaluation beyond English has become a critical challenge. Existing evaluation approaches are predominantly English-focused, and adapting them to other languages is hindered by the scarcity and cost of human-annotated judgments in most languages. We introduce a decomposition-based evaluation framework built around a Universal Criteria Set (UCS). UCS consists of a shared, language-agnostic set of evaluation dimensions, producing an interpretable intermediate representation that supports cross-lingual transfer with minimal supervision. Experiments on multiple faithfulness tasks across languages and model backbones demonstrate consistent improvements over strong baselines without requiring target-language annotations.
[12] ICE: Intervention-Consistent Explanation Evaluation with Statistical Grounding for LLMs cs.CL | cs.AI | cs.LGPDF
Abhinaba Basu, Pavan Chakraborty
TL;DR: 本文提出了ICE框架,用于评估大语言模型(LLM)解释的忠实性。该框架通过多种干预操作下的随机化检验,将解释与匹配的随机基线进行比较,从而提供带有置信区间的胜率。在多个任务、语言和归因方法上的评估表明,忠实性是操作符依赖的,且与人类可理解性无关。
Details
Motivation: 现有评估方法使用单一干预且缺乏统计检验,无法区分真正的忠实性与偶然性能,因此需要一种统计上可靠的框架来评估解释是否真实反映了模型的推理过程。
Result: 在4个英文任务、6种非英语语言和2种归因方法上评估了7个LLM。结果显示,忠实性高度依赖干预操作符(差距可达44个百分点),三分之一的配置表现出反忠实性,且忠实性与人类可理解性无相关性(|r| < 0.04)。多语言评估揭示了显著的模型-语言交互作用。
Insight: 创新点在于提出了一个基于统计检验(随机化测试)和多种干预操作的忠实性评估框架ICE,强调忠实性应被视为跨操作符的比较性指标而非单一分数。客观分析认为,其方法学(如使用随机基线)为解释评估提供了更严谨的统计基础,并揭示了现有评估的潜在误导性(如操作符依赖和反忠实性)。
Abstract: Evaluating whether explanations faithfully reflect a model’s reasoning remains an open problem. Existing benchmarks use single interventions without statistical testing, making it impossible to distinguish genuine faithfulness from chance-level performance. We introduce ICE (Intervention-Consistent Explanation), a framework that compares explanations against matched random baselines via randomization tests under multiple intervention operators, yielding win rates with confidence intervals. Evaluating 7 LLMs across 4 English tasks, 6 non-English languages, and 2 attribution methods, we find that faithfulness is operator-dependent: operator gaps reach up to 44 percentage points, with deletion typically inflating estimates on short text but the pattern reversing on long text, suggesting that faithfulness should be interpreted comparatively across intervention operators rather than as a single score. Randomized baselines reveal anti-faithfulness in one-third of configurations, and faithfulness shows zero correlation with human plausibility (|r| < 0.04). Multilingual evaluation reveals dramatic model-language interactions not explained by tokenization alone. We release the ICE framework and ICEBench benchmark.
[13] Cross-Modal Rationale Transfer for Explainable Humanitarian Classification on Social Media cs.CL | cs.CVPDF
Thi Huyen Nguyen, Koustav Rudra, Wolfgang Nejdl
TL;DR: 本文提出了一种可解释的多模态分类框架,用于社交媒体上的人道主义信息分类。该方法通过视觉语言Transformer模型学习文本和图像的联合表示,提取文本依据,并通过跨模态依据迁移从文本依据中映射出图像依据,从而减少标注成本。最终基于提取的依据对推文进行分类。
Details
Motivation: 现有方法在将文本和图像分类到不同人道主义类别时决策过程不透明,影响实际应用;且现有可解释分类方法主要关注文本,缺乏对危机相关图像的解释。
Result: 在CrisisMMD基准数据集上的实验表明,所提方法将分类Macro-F1提升了2-35%,并能提取准确的文本标记和图像块作为依据;人类评估也证实其能检索出更好(提升12%)的图像依据块。该方法在零样本模式下对新未见数据集适应良好,达到80%的准确率。
Insight: 创新点在于通过跨模态依据迁移,利用文本依据来学习图像依据,减少了图像依据的标注需求;同时设计了一个可解释的多模态分类框架,提升了分类性能与可解释性。
Abstract: Advances in social media data dissemination enable the provision of real-time information during a crisis. The information comes from different classes, such as infrastructure damages, persons missing or stranded in the affected zone, etc. Existing methods attempted to classify text and images into various humanitarian categories, but their decision-making process remains largely opaque, which affects their deployment in real-life applications. Recent work has sought to improve transparency by extracting textual rationales from tweets to explain predicted classes. However, such explainable classification methods have mostly focused on text, rather than crisis-related images. In this paper, we propose an interpretable-by-design multimodal classification framework. Our method first learns the joint representation of text and image using a visual language transformer model and extracts text rationales. Next, it extracts the image rationales via the mapping with text rationales. Our approach demonstrates how to learn rationales in one modality from another through cross-modal rationale transfer, which saves annotation effort. Finally, tweets are classified based on extracted rationales. Experiments are conducted over CrisisMMD benchmark dataset, and results show that our proposed method boosts the classification Macro-F1 by 2-35% while extracting accurate text tokens and image patches as rationales. Human evaluation also supports the claim that our proposed method is able to retrieve better image rationale patches (12%) that help to identify humanitarian classes. Our method adapts well to new, unseen datasets in zero-shot mode, achieving an accuracy of 80%.
[14] Learning to Self-Evolve cs.CL | cs.AIPDF
Xiaoyin Chen, Canwen Xu, Yite Wang, Boyi Liu, Zhewei Yao
TL;DR: 本文提出了一个名为’学习自我进化’的强化学习框架,用于训练大型语言模型在测试时自我改进其上下文。该框架将多步进化问题简化为单步强化学习目标,并结合树引导的进化循环。在Text-to-SQL生成和通用问答任务上,一个40亿参数的模型使用该框架后,性能超越了基于GPT-5和Claude Sonnet 4.5的自我进化策略以及其他提示优化方法。
Details
Motivation: 现有方法完全依赖模型固有的推理能力,从未明确训练模型执行测试时自我进化任务,即模型根据已见问题的反馈迭代优化其上下文,以在新问题上表现更好。本文旨在解决这个问题,将自我进化视为一种可学习的技能。
Result: 在Text-to-SQL生成基准BIRD和通用问答基准MMLU-Redux上,使用LSE训练的40亿参数模型超越了由GPT-5和Claude Sonnet 4.5驱动的自我进化策略,以及包括GEPA和TextGrad在内的提示优化方法,并且无需额外训练即可迁移以指导其他模型。
Insight: 核心创新点是将测试时自我进化任务形式化为一个可训练的强化学习问题,通过单步奖励(下游性能提升)来优化上下文编辑,并结合树搜索引导进化过程。这突破了以往仅依赖模型内在能力的局限,证明了自我进化作为一种可习得技能的有效性。
Abstract: We introduce Learning to Self-Evolve (LSE), a reinforcement learning framework that trains large language models (LLMs) to improve their own contexts at test time. We situate LSE in the setting of test-time self-evolution, where a model iteratively refines its context from feedback on seen problems to perform better on new ones. Existing approaches rely entirely on the inherent reasoning ability of the model and never explicitly train it for this task. LSE reduces the multi-step evolution problem to a single-step RL objective, where each context edit is rewarded by the improvement in downstream performance. We pair this objective with a tree-guided evolution loop. On Text-to-SQL generation (BIRD) and general question answering (MMLU-Redux), a 4B-parameter model trained with LSE outperforms self-evolving policies powered by GPT-5 and Claude Sonnet 4.5, as well as prompt optimization methods including GEPA and TextGrad, and transfers to guide other models without additional training. Our results highlight the effectiveness of treating self-evolution as a learnable skill.
[15] Mi:dm K 2.5 Pro cs.CL | cs.AIPDF
KT Tech innovation Group
TL;DR: 本文介绍了专为韩语和企业级应用设计的32B参数大语言模型Mi:dm K 2.5 Pro。该模型通过专注于推理的优化,旨在解决多步推理、长上下文理解和智能体工作流等复杂任务。其构建方法包括基于AST分析和LLM评估的高质量数据构建、支持128K上下文的深度扩展预训练,以及包含推理SFT、模型融合和异步强化学习的多阶段后训练流程。
Details
Motivation: 现有大语言模型在企业环境,特别是韩语和特定领域场景中,难以满足对多步推理、长上下文理解和智能体工作流等复杂能力的需求,存在扩展性不足的问题。
Result: 评估显示,Mi:dm K 2.5 Pro在性能上与领先的全球和韩国本土模型具有竞争力,并在韩语特定基准测试中取得了最先进(SOTA)的结果。负责任AI评估也验证了其安全性。
Insight: 创新点包括:1) 通过AST分析、填空合成和LLM评估器构建高质量数据的基础设施;2) 采用基于层预测器的深度扩展和渐进策略实现长上下文窗口;3) 设计了包含推理SFT、模型融合和异步强化学习的多阶段后训练流程,并通过’融合训练’平衡推理能力与对话流畅性、工具使用。这些方法针对企业级复杂性和特定语言(韩语)进行了系统优化。
Abstract: The evolving LLM landscape requires capabilities beyond simple text generation, prioritizing multi-step reasoning, long-context understanding, and agentic workflows. This shift challenges existing models in enterprise environments, especially in Korean-language and domain-specific scenarios where scaling is insufficient. We introduce Mi:dm K 2.5 Pro, a 32B parameter flagship LLM designed to address enterprise-grade complexity through reasoning-focused optimization. Our methodology builds a robust data foundation via a quality-centric curation pipeline utilizing abstract syntax tree (AST) analysis for code, gap-filling synthesis for mathematics, and an LLM-based quality evaluator. Pre-training scales the model via layer-predictor-based Depth Upscaling (DuS) and a progressive strategy supporting a 128K token context window. Post-training introduces a specialized multi-stage pipeline, including Reasoning SFT, model merging, and asynchronous reinforcement learning (RL), to develop complex problem-solving skills. “Fusion Training” then rebalances these capabilities with conversational fluency, consistent response styling, and reliable tool-use. The evaluations show that Mi:dm K 2.5 Pro achieves competitive performance against leading global and domestic models. In addition, it sets state-of-the-art results on Korean-specific benchmarks, showcasing deep linguistic and cultural understanding. Finally, Responsible AI evaluations validate safety against attacks, ensuring a secure profile for deployment with a balance of harmlessness and responsiveness.
[16] A Human-in/on-the-Loop Framework for Accessible Text Generation cs.CLPDF
Lourdes Moreno, Paloma Martínez
TL;DR: 本文提出了一种结合人类参与(Human-in/on-the-Loop)的混合框架,用于生成和评估易于理解的文本(如简明语言和易读格式)。该框架通过人类在生成过程中的指导(HiTL)和生成后的系统性监督(HoTL),将用户研究和标注资源转化为标准化检查清单、触发规则和关键性能指标,旨在提升文本简化系统的可理解性、可追溯性和伦理问责性。
Details
Motivation: 当前自动文本简化和评估流程过于自动化、依赖指标驱动,未能有效反映用户理解能力或规范性标准,因此需要一种能明确整合人类参与的方法来确保文本的认知可访问性。
Result: 通过用户研究和标注资源实证,框架实现了与标准对齐的检查清单、用于激活专家监督的事件-条件-动作触发规则以及可访问性关键性能指标(KPIs),为模型适应提供了结构化反馈,但摘要未提及具体基准测试或定量结果。
Insight: 创新点在于将人类角色编码为可操作机制(如检查清单和触发规则),并嵌入到生成和监督全流程,从而建立可追溯、可复现、可审计的文本创建与评估过程,同时将可解释性和伦理问责作为核心设计原则,推动更透明、包容的NLP系统。
Abstract: Plain Language and Easy-to-Read formats in text simplification are essential for cognitive accessibility. Yet current automatic simplification and evaluation pipelines remain largely automated, metric-driven, and fail to reflect user comprehension or normative standards. This paper introduces a hybrid framework that explicitly integrates human participation into LLM-based accessible text generation. Human-in-the-Loop (HiTL) contributions guide adjustments during generation, while Human-on-the-Loop (HoTL) supervision ensures systematic post-generation review. Empirical evidence from user studies and annotated resources is operationalized into (i) checklists aligned with standards, (ii) Event-Condition-Action trigger rules for activating expert oversight, and (iii) accessibility Key Performance Indicators (KPIs). The framework shows how human-centered mechanisms can be encoded for evaluation and reused to provide structured feedback that improves model adaptation. By embedding the human role in both generation and supervision, it establishes a traceable, reproducible, and auditable process for creating and evaluating accessible texts. In doing so, it integrates explainability and ethical accountability as core design principles, contributing to more transparent and inclusive NLP systems.
[17] Entropy trajectory shape predicts LLM reasoning reliability: A diagnostic study of uncertainty dynamics in chain-of-thought cs.CL | cs.LGPDF
Xinghao Zhao
TL;DR: 本文研究了大型语言模型(LLM)在思维链(CoT)推理过程中不确定性动态轨迹的形状(即每一步答案分布熵的变化模式)能否预测推理结果的正确性。研究发现,熵轨迹的单调性(即每一步熵都减少)是比总熵减少量等聚合指标更强的预测因子,能有效区分正确与错误的推理链,且计算成本远低于传统的自洽性采样方法。
Details
Motivation: 思维链推理虽然能提升LLM的准确性,但如何低成本地检测其推理失败仍然是一个难题。本文旨在探索是否可以通过分析推理步骤间不确定性动态的“形状”来预测推理的正确性。
Result: 在GSM8K数据集上使用Qwen2.5-7B-Instruct模型,单调(熵每一步都减少)的推理链准确率达到68.8%,而非单调链为46.8%(提升21.9个百分点)。违反单调性的次数(0/1/2次)对应的准确率分别为68.8%/50.8%/28.6%。该方法在约1500个标记/问题的成本下,以73.7%的覆盖率实现了+5.8个百分点的提升,成本仅为40链自洽性采样的1/8。结果在Mistral-7B模型上得到复现(单调链72.3% vs. 非单调链37.6%)。
Insight: 核心创新点在于提出了“熵轨迹单调性”这一诊断指标,并揭示了不确定性动态的“形状”(即每一步熵是否单调递减)比聚合指标(如总熵减少量)更能预测推理可靠性。这为低成本、高效地评估LLM推理过程的质量和可靠性提供了一种新视角和方法。
Abstract: Chain-of-thought (CoT) reasoning improves LLM accuracy, yet detecting failures cheaply remains elusive. We study whether the shape of uncertainty dynamics across reasoning steps–captured by sampling a few answer completions per step–predicts correctness. We introduce entropy-trajectory monotonicity: a chain is monotone if its per-step answer-distribution entropy decreases at every step. On GSM8K (n=300) with Qwen2.5-7B-Instruct, monotone chains achieve 68.8% accuracy vs. 46.8% for non-monotone chains (+21.9 pp; Fisher’s p=0.0005; OR=2.50). Critically, total entropy reduction is not predictive ($ρ$=-0.06, p=0.31), revealing a shape-over-magnitude dissociation: whether entropy decreases at every step matters, not how much. Violation count 0/1/2 gives 68.8%/50.8%/28.6% accuracy. Token log-probability confidence worsens in calibration with step depth (ECE: 0.186->0.312), and monotonicity achieves +5.8 pp at 73.7% coverage, outperforming scalar baselines at approx 1,500 tokens/question–1/8 the cost of 40-chain self-consistency. Results replicate on Mistral-7B (n=300): monotone chains reach 72.3% vs. 37.6% (+34.7 pp; OR=4.33). Structural properties of uncertainty trajectories are thus more informative than aggregate measures.
[18] What Really Controls Temporal Reasoning in Large Language Models: Tokenisation or Representation of Time? cs.CL | cs.AIPDF
Gagan Bhatia, Ahmad Muhammad Isa, Maxime Peyrard, Wei Zhao
TL;DR: 该论文提出了一个多语言时间推理基准MultiTempBench,涵盖日期运算、时区转换和时间关系提取三个任务,涉及五种语言和多种历法。通过评估20个大语言模型,并引入多语言日期碎片化比率(mDFR)和几何探测分析,研究发现时间标记的tokenization质量是资源依赖的瓶颈:在低资源语言和罕见历法中,碎片化会破坏年月日的分离并导致准确性崩溃,而在高资源设置中通常对数字级分割具有鲁棒性。此外,交叉混合效应回归表明,在高资源语言中,时间线性是时间推理的最强预测因子,而在低资源语言中,碎片化是更强的预测因子。
Details
Motivation: 解决大语言模型在多语言和多种历法环境下时间推理能力的关键控制因素问题,探究是tokenization还是时间表示对推理性能起主导作用。
Result: 在MultiTempBench基准上评估了20个LLM,发现低资源语言和罕见历法下时间标记的碎片化导致准确性崩溃,而高资源语言对数字级分割更鲁棒;回归分析显示高资源语言中时间线性是最强预测因子,低资源语言中碎片化是最强预测因子。
Insight: 创新点在于构建了多语言多历法时间推理基准,并引入mDFR指标和几何探测分析来量化tokenization质量的影响;客观分析表明,研究揭示了时间推理性能的资源依赖性,即tokenization在低资源场景下是关键瓶颈,而高资源场景下时间表示(线性)更为重要,这为优化多语言LLM的时间处理提供了方向。
Abstract: We present MultiTempBench, a multilingual temporal reasoning benchmark spanning three tasks, date arithmetic, time zone conversion, and temporal relation extraction across five languages (English, German, Chinese, Arabic, and Hausa) and multiple calendar conventions (Gregorian, Hijri, and Chinese Lunar). MultiTempBench contains $15,000$ examples built by translating $750$ curated English questions and expanding each into controlled date-format variants. We evaluate 20 LLMs and introduce the multilingual Date Fragmentation Ratio (mDFR), calibrated with human severity ratings, together with geometric-probing analyses of internal temporal representations. We find tokenisation quality of temporal artefacts is a resource-dependent bottleneck: in low-resource languages and rarer calendar formats, fragmentation disrupts Year/Month/Day separation and accuracy collapses, while high-resource settings are often robust to digit-level splitting. Beyond tokenisation, crossed mixed-effects regression shows that temporal linearity is the strongest predictor of temporal reasoning in high-resource languages, whereas fragmentation is the stronger predictor in low-resource languages. Code is available at: https://github.com/gagan3012/mtb
[19] MoRI: Learning Motivation-Grounded Reasoning for Scientific Ideation in Large Language Models cs.CLPDF
Chenyang Gu, Jiahao Cheng, Meicong Zhang, Pujun Zheng, Jinquan Zheng
TL;DR: MoRI是一个基于大型语言模型的科学创意生成框架,通过动机驱动的推理过程,从给定的科学背景中提出新颖且技术严谨的解决方案。它通过监督微调学习生成研究动机,并利用结合信息增益和语义增益的强化学习奖励来近似科学严谨性。
Details
Motivation: 解决现有基于LLM的智能体方法在科学创意生成中,因缺乏对科学推理过程的建模,导致概念重组流于表面、缺乏技术深度和科学依据的问题。
Result: 实验结果表明,MoRI在多个维度(包括新颖性、技术严谨性和可行性)上显著优于强大的商业LLM和复杂的智能体基线。
Insight: 核心创新在于将科学创意生成明确建模为从研究动机到方法论的推理过程,并设计了一个结合熵感知信息增益(鼓励挖掘高复杂度技术细节)和对比语义增益(确保推理轨迹与科学有效方案概念对齐)的复合强化学习奖励函数,以近似科学严谨性。
Abstract: Scientific ideation aims to propose novel solutions within a given scientific context. Existing LLM-based agentic approaches emulate human research workflows, yet inadequately model scientific reasoning, resulting in surface-level conceptual recombinations that lack technical depth and scientific grounding. To address this issue, we propose \textbf{MoRI} (\textbf{Mo}tivation-grounded \textbf{R}easoning for Scientific \textbf{I}deation), a framework that enables LLMs to explicitly learn the reasoning process from research motivations to methodologies. The base LLM is initialized via supervised fine-tuning to generate a research motivation from a given context, and is subsequently trained under a composite reinforcement learning reward that approximates scientific rigor: (1) entropy-aware information gain encourages the model to uncover and elaborate high-complexity technical details grounded in ground-truth methodologies, and (2) contrastive semantic gain constrains the reasoning trajectory to maintain conceptually aligned with scientifically valid solutions. Empirical results show that MoRI significantly outperforms strong commercial LLMs and complex agentic baselines across multiple dimensions, including novelty, technical rigor, and feasibility. The code will be made available on \href{https://github.com/ECNU-Text-Computing/IdeaGeneration}{GitHub}.
[20] Optimal Splitting of Language Models from Mixtures to Specialized Domains cs.CL | cs.LGPDF
Skyler Seto, Pierre Ablin, Anastasiia Filippova, Jiayuan Ye, Louis Bethune
TL;DR: 该论文提出了一种基于缩放定律的语言模型预训练与领域专业化训练的计算资源最优分配方法,通过独立预训练多个模型并确定预训练与持续预训练之间的最佳计算分配,以提升模型在常识知识和推理任务上的性能。
Details
Motivation: 解决在多领域设置下,传统两阶段训练范式(先在全数据上预训练,再在高质量专业数据上专业化)中,如何最优分配计算资源以训练多个专业化模型的问题。
Result: 该方法在多种模型规模和计算预算下,在常识知识和推理基准测试上持续提升了性能,并能准确预测模型损失并外推到更大模型规模和更多训练token。
Insight: 创新点在于利用缩放定律来优化预训练与专业化训练之间的计算分配,实现更高效的多领域语言模型训练;客观分析认为该方法为资源受限下的模型专业化提供了可扩展的理论框架。
Abstract: Language models achieve impressive performance on a variety of knowledge, language, and reasoning tasks due to the scale and diversity of pretraining data available. The standard training recipe is a two-stage paradigm: pretraining first on the full corpus of data followed by specialization on a subset of high quality, specialized data from the full corpus. In the multi-domain setting, this involves continued pretraining of multiple models on each specialized domain, referred to as split model training. We propose a method for pretraining multiple models independently over a general pretraining corpus, and determining the optimal compute allocation between pretraining and continued pretraining using scaling laws. Our approach accurately predicts the loss of a model of size N with D pretraining and D’ specialization tokens, and extrapolates to larger model sizes and number of tokens. Applied to language model training, our approach improves performance consistently across common sense knowledge and reasoning benchmarks across different model sizes and compute budgets.
[21] VEPO: Variable Entropy Policy Optimization for Low-Resource Language Foundation Models cs.CL | cs.AIPDF
Chonghan Liu, Yimin Du, Qi An, Xin He, Cunqi Zhai
TL;DR: 本文提出了一种名为可变熵策略优化(VEPO)的强化学习框架,旨在解决大语言模型在低资源语言上表现不佳的问题,通过引入可验证奖励和确定性结构约束来优化策略对齐过程,从而提升分词效率和翻译质量。
Details
Motivation: 大语言模型在低资源语言上表现不佳,主要源于低效的子词分词和系统性的训练数据不平衡,需要一种方法来确保序列长度、格式一致性和语言规范性。
Result: 在FLORES-200、COMET-22和chrF等90个方向上的实验表明,VEPO在分词效率和翻译质量方面均取得了显著提升,缩小了低资源语言与主流语言之间的性能差距。
Insight: 创新点在于通过可变熵机制动态平衡字面保真度和语义自然性,并结合熵调优势估计与非对称裁剪来维持鲁棒探索并防止策略崩溃,为低资源语言基础模型优化提供了新思路。
Abstract: Large language models frequently exhibit suboptimal performance on low resource languages, primarily due to inefficient subword segmentation and systemic training data imbalances. In this paper, we propose Variable Entropy Policy Optimization (VEPO), which leverages Reinforcement Learning with Verifiable Rewards to incorporate deterministic structural constraints into the policy alignment process. This framework ensures prescribed sequence length, robust format consistency, and rigorous linguistic well formedness, all enforced during training. Central to our approach is a variable entropy mechanism that enables the model to dynamically calibrate the equilibrium between literal fidelity and semantic naturalness by modulating the exploration exploitation manifold. By integrating entropy tempered advantage estimation with asymmetric clipping, VEPO sustains robust exploration while mitigating policy collapse. Empirical evaluations across 90 FLORES-200, COMET-22, chrF directions demonstrate that VEPO yields substantial improvements in both tokenization efficiency and translation quality, bridging the performance gap for underrepresented languages.
[22] Evaluating Counterfactual Strategic Reasoning in Large Language Models cs.CLPDF
Dimitrios Georgousis, Maria Lymperaiou, Angeliki Dimitriou, Giorgos Filandrianos, Giorgos Stamou
TL;DR: 该论文评估了大型语言模型在重复博弈论场景中的表现,旨在检验其战略性能是源于真正的推理能力还是对记忆模式的依赖。研究通过引入反事实变体(改变收益结构和行动标签)来打破经典博弈(囚徒困境和石头剪刀布)的熟悉对称性和支配关系,并采用多指标评估框架对比默认与反事实情境下的模型表现,揭示了LLMs在激励敏感性、结构泛化和反事实环境中的战略推理方面的局限性。
Details
Motivation: 动机是评估LLMs在战略推理中的真实性,区分其是基于记忆模式还是真正的推理能力,以解决对LLMs在复杂决策环境中可靠性的担忧。
Result: 在反事实变体的囚徒困境和石头剪刀布游戏中,LLMs表现出激励敏感性不足、结构泛化能力弱和战略推理受限,未达到SOTA水平,突显了其在非标准博弈环境中的性能下降。
Insight: 创新点在于设计反事实博弈变体来打破LLMs可能依赖的熟悉模式,并构建多指标评估框架系统量化其战略推理缺陷;客观分析认为,该方法为评估LLMs的深层推理能力提供了新范式,强调了环境泛化测试的重要性。
Abstract: We evaluate Large Language Models (LLMs) in repeated game-theoretic settings to assess whether strategic performance reflects genuine reasoning or reliance on memorized patterns. We consider two canonical games, Prisoner’s Dilemma (PD) and Rock-Paper-Scissors (RPS), upon which we introduce counterfactual variants that alter payoff structures and action labels, breaking familiar symmetries and dominance relations. Our multi-metric evaluation framework compares default and counterfactual instantiations, showcasing LLM limitations in incentive sensitivity, structural generalization and strategic reasoning within counterfactual environments.
[23] Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation cs.CL | cs.AI | cs.LGPDF
Zhuolin Yang, Zihan Liu, Yang Chen, Wenliang Dai, Boxin Wang
TL;DR: Nemotron-Cascade 2是一个开源的300亿参数MoE模型(激活参数30亿),在数学和代码推理方面接近前沿开源模型水平,并在多项国际竞赛中达到金牌级别性能,实现了极高的智能密度。其核心技术包括扩展的Cascade RL和跨领域策略蒸馏。
Details
Motivation: 旨在开发一个参数高效、智能密度高的开源大语言模型,在保持紧凑规模的同时,在复杂推理和智能体任务上达到顶尖性能。
Result: 在2025年国际数学奥林匹克(IMO)、国际信息学奥林匹克(IOI)和ICPC世界总决赛中达到金牌级别性能,是继DeepSeekV3.2后第二个达成此成就的开源模型,且参数量少20倍;数学和代码推理性能接近前沿开源模型。
Insight: 核心创新在于扩展的Cascade RL覆盖更广泛的推理与智能体领域,以及在整个RL过程中引入针对各领域最强中间教师模型的多领域策略蒸馏,有效恢复基准性能衰退并持续提升性能,实现了参数效率与性能的平衡。
Abstract: We introduce Nemotron-Cascade 2, an open 30B MoE model with 3B activated parameters that delivers best-in-class reasoning and strong agentic capabilities. Despite its compact size, its mathematical and coding reasoning performance approaches that of frontier open models. It is the second open-weight LLM, after DeepSeekV3.2-Speciale-671B-A37B, to achieve Gold Medal-level performance in the 2025 International Mathematical Olympiad (IMO), the International Olympiad in Informatics (IOI), and the ICPC World Finals, demonstrating remarkably high intelligence density with 20x fewer parameters. In contrast to Nemotron-Cascade 1, the key technical advancements are as follows. After SFT on a meticulously curated dataset, we substantially expand Cascade RL to cover a much broader spectrum of reasoning and agentic domains. Furthermore, we introduce multi-domain on-policy distillation from the strongest intermediate teacher models for each domain throughout the Cascade RL process, allowing us to efficiently recover benchmark regressions and sustain strong performance gains along the way. We release the collection of model checkpoint and training data.
cs.CV [Back]
[24] RARE disease detection from Capsule Endoscopic Videos based on Vision Transformers cs.CVPDF
X. Gao, C. Chien, G. Liu, A. Manullang
TL;DR: 本文针对胶囊内窥镜视频的多标签分类任务,基于Google Vision Transformer (ViT)模型进行微调,旨在检测包括出血、息肉、溃疡等在内的17种罕见疾病或解剖部位。在测试集上,模型在IoU阈值为0.5和0.95时的平均精度(mAP)分别达到0.0205和0.0196。
Details
Motivation: 解决从胶囊内窥镜视频中自动检测和分类多种罕见疾病和解剖部位的问题,以辅助医疗诊断。
Result: 在包含三个视频的测试数据集上,整体mAP@0.5为0.0205,mAP@0.95为0.0196,提供了初步的基准性能。
Insight: 将基于Transformer的视觉模型(ViT)应用于胶囊内窥镜视频分析,探索了其在医疗视频多标签分类任务中的潜力,为罕见病检测提供了深度学习解决方案。
Abstract: This work is corresponding to the Gastro Competition for multi-label classification from capsule endoscopic videos (CEV). Deep learning network based on Transformers are fined-tune for this task. The based online mode is Google Vision Transformer (ViT) batch16 with 224 x 224 resolutions. In total, 17 labels are classified, which are mouth, esophagus, stomach, small intestine, colon, z-line, pylorus, ileocecal valve, active bleeding, angiectasia, blood, erosion, erythema, hematin, lymphangioectasis, polyp, and ulcer. For test dataset of three videos, the overall mAP @0.5 is 0.0205 whereas the overall mAP @0.95 is 0.0196.
[25] S3T-Former: A Purely Spike-Driven State-Space Topology Transformer for Skeleton Action Recognition cs.CV | cs.AIPDF
Naichuan Zheng, Hailun Xia, Zepeng Sun, Weiyi Li, Yujia Wang
TL;DR: 本文提出了一种名为S3T-Former的纯脉冲驱动状态空间拓扑Transformer模型,专门用于高效能的骨架动作识别。该方法通过多流解剖脉冲嵌入将多模态骨架特征转换为稀疏事件流,并引入侧向脉冲拓扑路由和脉冲状态空间引擎来捕获长程时序动态,在保持高精度的同时显著降低了理论能耗。
Details
Motivation: 解决基于骨架的动作识别依赖高功耗人工神经网络,难以部署在资源受限边缘设备的问题,同时克服现有脉冲神经网络模型牺牲稀疏性、依赖密集融合或非稀疏频域变换,以及脉冲神经元短期记忆不足的缺陷。
Result: 在多个大规模数据集上的实验表明,S3T-Former达到了极具竞争力的准确率,同时理论上比经典人工神经网络降低了能耗,为高效能神经形态动作识别设立了新的SOTA。
Insight: 创新点包括:首次提出纯脉冲驱动的Transformer架构;设计广义运动学微分算子M-ASE实现特征到稀疏事件流的优雅转换;引入LSTR实现按需条件脉冲传播和S3引擎系统捕获长程时序依赖,无需非稀疏频谱变通方法,实现了真正的拓扑和时序稀疏性。
Abstract: Skeleton-based action recognition is crucial for multimedia applications but heavily relies on power-hungry Artificial Neural Networks (ANNs), limiting their deployment on resource-constrained edge devices. Spiking Neural Networks (SNNs) provide an energy-efficient alternative; however, existing spiking models for skeleton data often compromise the intrinsic sparsity of SNNs by resorting to dense matrix aggregations, heavy multimodal fusion modules, or non-sparse frequency domain transformations. Furthermore, they severely suffer from the short-term amnesia of spiking neurons. In this paper, we propose the Spiking State-Space Topology Transformer (S3T-Former), which, to the best of our knowledge, is the first purely spike-driven Transformer architecture specifically designed for energy-efficient skeleton action recognition. Rather than relying on heavy fusion overhead, we formulate a Multi-Stream Anatomical Spiking Embedding (M-ASE) that acts as a generalized kinematic differential operator, elegantly transforming multimodal skeleton features into heterogeneous, highly sparse event streams. To achieve true topological and temporal sparsity, we introduce Lateral Spiking Topology Routing (LSTR) for on-demand conditional spike propagation, and a Spiking State-Space (S3) Engine to systematically capture long-range temporal dynamics without non-sparse spectral workarounds. Extensive experiments on multiple large-scale datasets demonstrate that S3T-Former achieves highly competitive accuracy while theoretically reducing energy consumption compared to classic ANNs, establishing a new state-of-the-art for energy-efficient neuromorphic action recognition.
[26] DarkDriving: A Real-World Day and Night Aligned Dataset for Autonomous Driving in the Dark Environment cs.CV | cs.DBPDF
Wuqi Wang, Haochen Yang, Baolu Li, Jiaqi Sun, Xiangmo Zhao
TL;DR: 本文提出了一个名为DarkDriving的真实世界昼夜对齐数据集,用于研究自动驾驶在黑暗环境下的低光照增强问题。该数据集通过一种自动的基于轨迹跟踪的姿态匹配方法,在一个大型封闭测试场中采集了9538对精确对齐的昼夜图像,并提供了2D边界框标注。论文还引入了四项与感知相关的任务来评估低光照增强技术。
Details
Motivation: 现有的真实世界低光照增强数据集通常只能在静态场景和小范围曝光变化下采集,而现有的夜间驾驶数据集的暗光图像缺乏精确对齐的白天对应图像,这极大地限制了该领域的研究。
Result: 实验结果表明,DarkDriving数据集为评估自动驾驶的低光照增强提供了一个全面的基准,并且可以推广到增强其他低光照驾驶环境(如nuScenes)中的暗光图像并提升检测性能。
Insight: 主要创新点在于提出了首个真实世界、动态驾驶场景下精确对齐的昼夜数据集,其对齐误差仅为几厘米。此外,提出的自动TTPM方法解决了在动态场景中采集对齐数据的极端困难,并引入了涵盖增强和检测的多项任务,为领域提供了新的基准。
Abstract: The low-light conditions are challenging to the vision-centric perception systems for autonomous driving in the dark environment. In this paper, we propose a new benchmark dataset (named DarkDriving) to investigate the low-light enhancement for autonomous driving. The existing real-world low-light enhancement benchmark datasets can be collected by controlling various exposures only in small-ranges and static scenes. The dark images of the current nighttime driving datasets do not have the precisely aligned daytime counterparts. The extreme difficulty to collect a real-world day and night aligned dataset in the dynamic driving scenes significantly limited the research in this area. With a proposed automatic day-night Trajectory Tracking based Pose Matching (TTPM) method in a large real-world closed driving test field (area: 69 acres), we collected the first real-world day and night aligned dataset for autonomous driving in the dark environment. The DarkDriving dataset has 9,538 day and night image pairs precisely aligned in location and spatial contents, whose alignment error is in just several centimeters. For each pair, we also manually label the object 2D bounding boxes. DarkDriving introduces four perception related tasks, including low-light enhancement, generalized low-light enhancement, and low-light enhancement for 2D detection and 3D detection of autonomous driving in the dark environment. The experimental results show that our DarkDriving dataset provides a comprehensive benchmark for evaluating low-light enhancement for autonomous driving and it can also be generalized to enhance dark images and promote detection in some other low-light driving environment, such as nuScenes.
[27] Action Draft and Verify: A Self-Verifying Framework for Vision-Language-Action Model cs.CV | cs.ROPDF
Chen Zhao, Zhuoran Wang, Haoyang Li, Shifeng Bao, Guanlin Li
TL;DR: 本文提出了一种名为Action-Draft-and-Verify (ADV)的自验证框架,旨在结合扩散模型和自回归模型的优势来提升视觉-语言-动作(VLA)模型在具身任务中的性能。该框架利用扩散动作专家生成多个候选动作块,然后通过视觉语言模型(VLM)使用困惑度式度量在单次前向传递中对所有候选进行评分和选择。
Details
Motivation: 现代VLA模型通常使用扩散动作专家来高效生成高精度连续动作块,而自回归生成在低级控制上可能较慢且准确性较低。然而,自回归范式仍能提供互补的先验,以提高在分布外环境中的鲁棒性和泛化能力。本文旨在融合这两种范式的优势。
Result: 在匹配的骨干网络、训练数据和动作块长度条件下,ADV在模拟环境中将成功率提高了4.3个百分点,在真实世界中提高了19.7个百分点,相较于基于扩散的基线方法,且仅增加了单次VLM重排的开销。
Insight: 核心创新点在于提出了一个“起草-验证”的自验证框架,将扩散模型的高效动作生成能力与VLM的判别和选择能力相结合,通过单次前向传递的重排机制,有效提升了动作决策的准确性和鲁棒性,尤其是在真实世界任务中表现显著。
Abstract: Vision-Language-Action (VLA) models have recently demonstrated strong performance across embodied tasks. Modern VLAs commonly employ diffusion action experts to efficiently generate high-precision continuous action chunks, while auto-regressive generation can be slower and less accurate at low-level control. Yet auto-regressive paradigms still provide complementary priors that can improve robustness and generalization in out-of-distribution environments. To leverage both paradigms, we propose Action-Draft-and-Verify (ADV): diffusion action expert drafts multiple candidate action chunks, and the VLM selects one by scoring all candidates in a single forward pass with a perplexity-style metric. Under matched backbones, training data, and action-chunk length, ADV improves success rate by +4.3 points in simulation and +19.7 points in real-world over diffusion-based baseline, with a single-pass VLM reranking overhead.
[28] One-to-More: High-Fidelity Training-Free Anomaly Generation with Attention Control cs.CVPDF
Haoxiang Rao, Zhao Wang, Chenyang Si, Yan Lyu, Yuanyi Duan
TL;DR: 本文提出了一种无需训练的小样本异常生成方法O2MAG,该方法利用参考异常图像的自注意力机制,通过自注意力嫁接、异常掩码引导和异常引导优化等技术,合成与真实异常分布高度一致的异常图像,以有效增强下游工业异常检测任务的数据。
Details
Motivation: 工业异常检测中正常图像丰富但异常图像稀缺,现有小样本异常合成方法通常需要耗时训练且难以忠实学习真实异常分布,限制了异常检测模型的性能。
Result: 大量实验验证了O2MAG的有效性,其在下游异常检测任务上的性能优于现有的最先进方法。
Insight: 创新点在于提出了一种完全无需训练的异常生成框架,通过自注意力嫁接、异常掩码引导和异常引导优化等技术,实现了对真实异常分布的高保真合成,避免了传统方法对训练过程的依赖和分布学习不准确的问题。
Abstract: Industrial anomaly detection (AD) is characterized by an abundance of normal images but a scarcity of anomalous ones. Although numerous few-shot anomaly synthesis methods have been proposed to augment anomalous data for downstream AD tasks, most existing approaches require time-consuming training and struggle to learn distributions that are faithful to real anomalies, thereby restricting the efficacy of AD models trained on such data. To address these limitations, we propose a training-free few-shot anomaly generation method, namely O2MAG, which leverages the self-attention in One reference anomalous image to synthesize More realistic anomalies, supporting effective downstream anomaly detection. Specifically, O2MAG manipulates three parallel diffusion processes via self-attention grafting and incorporates the anomaly mask to mitigate foreground-background query confusion, synthesizing text-guided anomalies that closely adhere to real anomalous distributions. To bridge the semantic gap between the encoded anomaly text prompts and the true anomaly semantics, Anomaly-Guided Optimization is further introduced to align the synthesis process with the target anomalous distribution, steering the generation toward realistic and text-consistent anomalies. Moreover, to mitigate faint anomaly synthesis inside anomaly masks, Dual-Attention Enhancement is adopted during generation to reinforce both self- and cross-attention on masked regions. Extensive experiments validate the effectiveness of O2MAG, demonstrating its superior performance over prior state-of-the-art methods on downstream AD tasks.
[29] Training-Only Heterogeneous Image-Patch-Text Graph Supervision for Advancing Few-Shot Learning Adapters cs.CV | cs.AI | cs.LGPDF
Mohammed Rahman Sherif Khan Mohammad, Ardhendu Behera, Sandip Pradhan, Swagat Kumar, Amr Ahmed
TL;DR: 本文提出了一种名为TOGA的新型训练专用框架,旨在提升基于CLIP的少样本学习适配器(如Tip-Adapter)的性能。该方法通过构建一个仅在训练时使用的高容量异构图教师模型,该模型整合了多尺度视觉图像块和文本提示,并利用模态感知图变换器进行深度跨模态推理,以提取高质量的类别特征。这些关系知识通过缓存感知的双目标策略直接监督并注入到适配器的键值缓存中,从而在推理时无需额外开销即可提升原型质量。
Details
Motivation: 现有基于适配器的CLIP调优方法(如Tip-Adapter)主要依赖全局单模态特征向量进行快速原型匹配,忽略了细粒度的图像块关系及其与类别文本的结构对齐。本文旨在弥补这一差距,同时不增加推理时的计算成本。
Result: 在标准的1-16样本基准测试中,该方法持续取得了新的最先进(SOTA)性能。消融实验证实了辅助图监督、文本引导的推理和节点过滤是鲁棒少样本适应的关键组成部分。
Insight: 创新点在于提出了一种非对称的、仅用于训练的高容量异构图教师模型,通过深度跨模态图推理和节点过滤来提取细粒度关系知识,并利用缓存感知的双目标策略将这些知识高效地蒸馏到轻量级适配器的缓存中,从而在不改变推理架构和增加开销的情况下显著提升少样本学习性能。
Abstract: Recent adapter-based CLIP tuning (e.g., Tip-Adapter) is a strong few-shot learner, achieving efficiency by caching support features for fast prototype matching. However, these methods rely on global uni-modal feature vectors, overlooking fine-grained patch relations and their structural alignment with class text. To bridge this gap without incurring inference costs, we introduce a novel asymmetric training-only framework. Instead of altering the lightweight adapter, we construct a high-capacity auxiliary Heterogeneous Graph Teacher that operates solely during training. This teacher (i) integrates multi-scale visual patches and text prompts into a unified graph, (ii) performs deep cross-modal reasoning via a Modality-aware Graph Transformer (MGT), and (iii) applies discriminative node filtering to extract high-fidelity class features. Crucially, we employ a cache-aware dual-objective strategy to supervise this relational knowledge directly into the Tip-Adapter’s key-value cache, effectively upgrading the prototypes while the graph teacher is discarded at test time. Thus, inference remains identical to Tip-Adapter with zero extra latency or memory. Across standard 1-16-shot benchmarks, our method consistently establishes a new state-of-the-art. Ablations confirm that the auxiliary graph supervision, text-guided reasoning, and node filtering are the essential ingredients for robust few-shot adaptation. Code is available at https://github.com/MR-Sherif/TOGA.git.
[30] Insight-V++: Towards Advanced Long-Chain Visual Reasoning with Multimodal Large Language Models cs.CV | cs.AI | cs.LGPDF
Yuhao Dong, Zuyan Liu, Shulin Tian, Yongming Rao, Ziwei Liu
TL;DR: 该论文提出了一个名为Insight-V++的统一多智能体视觉推理框架,旨在解决多模态大语言模型在长链视觉推理任务中面临的高质量数据稀缺和训练流程优化不足的挑战。该框架通过可扩展的数据生成管道和包含推理与总结智能体的双智能体架构,并引入ST-GRPO和J-GRPO等新算法,实现了在图像和视频基准测试上的显著性能提升。
Details
Motivation: 动机在于将大语言模型强大的测试时推理能力扩展到多模态大语言模型,但后者因缺乏高质量的长链推理数据和优化的训练流程而面临挑战。
Result: 在LLaVA-NeXT和Qwen2.5-VL等基础模型上的大量实验表明,该框架在具有挑战性的图像和视频推理基准测试上取得了显著的性能提升,同时保持了在传统感知任务上的强大能力。
Insight: 创新点包括:1)一个可扩展的、无需人工干预的多粒度评估数据生成管道;2)一个包含推理智能体和总结智能体的双智能体架构;3)针对长序列视频理解,引入了ST-GRPO和J-GRPO算法以增强时空推理和评估鲁棒性;4)通过总结智能体的可靠反馈,引导迭代推理路径生成,实现整个系统的持续自我改进循环。
Abstract: Large Language Models (LLMs) have achieved remarkable reliability and advanced capabilities through extended test-time reasoning. However, extending these capabilities to Multi-modal Large Language Models (MLLMs) remains a significant challenge due to a critical scarcity of high-quality, long-chain reasoning data and optimized training pipelines. To bridge this gap, we present a unified multi-agent visual reasoning framework that systematically evolves from our foundational image-centric model, Insight-V, into a generalized spatial-temporal architecture, Insight-V++. We first propose a scalable data generation pipeline equipped with multi-granularity assessment that autonomously synthesizes structured, complex reasoning trajectories across image and video domains without human intervention. Recognizing that directly supervising MLLMs with such intricate data yields sub-optimal results, we design a dual-agent architecture comprising a reasoning agent to execute extensive analytical chains, and a summary agent to critically evaluate and distill final outcomes. While our initial framework utilized Direct Preference Optimization (DPO), its off-policy nature fundamentally constrained reinforcement learning potential. To overcome these limitations, particularly for long-horizon video understanding, Insight-V++ introduces two novel algorithms, ST-GRPO and J-GRPO, which enhance spatial-temporal reasoning and improve evaluative robustness. Crucially, by leveraging reliable feedback from the summary agent, we guide an iterative reasoning path generation process, retraining the entire multi-agent system in a continuous, self-improving loop. Extensive experiments on base models like LLaVA-NeXT and Qwen2.5-VL demonstrate significant performance gains across challenging image and video reasoning benchmarks while preserving strong capabilities on traditional perception-focused tasks.
[31] VLM-AutoDrive: Post-Training Vision-Language Models for Safety-Critical Autonomous Driving Events cs.CV | cs.AIPDF
Mohammad Qazim Bhat, Yufan Huang, Niket Agarwal, Hao Wang, Michael Woods
TL;DR: 本文提出VLM-AutoDrive框架,通过模块化后训练方法将预训练视觉语言模型(VLMs)适配到自动驾驶安全关键事件检测任务中。该框架整合元数据生成字幕、LLM生成描述、视觉问答对和思维链监督,显著提升了碰撞和近碰撞事件的检测性能,并在真实世界Nexar行车记录仪视频上验证了有效性。
Details
Motivation: 解决现有通用视觉语言模型在驾驶场景中因领域和时间不对齐而表现不佳的问题,特别是针对行车记录仪视频中短暂、罕见且难以捕捉的安全关键事件(如碰撞和近碰撞)检测。
Result: 在Nexar真实行车记录仪视频数据集上评估,将碰撞检测的F1分数从零样本的0.00提升至0.69,整体准确率从35.35%提升至77.27%,实现了安全关键事件检测的显著性能提升。
Insight: 创新点在于提出模块化后训练框架,通过多模态监督信号(元数据字幕、LLM描述、VQA对、思维链)实现领域对齐和可解释学习;客观来看,该工作为通用VLMs适配到时间敏感、安全关键的感知任务提供了可扩展的方案,并建立了感知、因果推理与决策推理之间的桥梁。
Abstract: The rapid growth of ego-centric dashcam footage presents a major challenge for detecting safety-critical events such as collisions and near-collisions, scenarios that are brief, rare, and difficult for generic vision models to capture. While multimodal large language models (MLLMs) demonstrate strong general reasoning ability, they underperform in driving contexts due to domain and temporal misalignment. We introduce VLM-AutoDrive, a modular post-training framework for adapting pretrained Vision-Language Models (VLMs) to high-fidelity anomaly detection. The framework integrates metadata-derived captions, LLM-generated descriptions, visual question answering (VQA) pairs, and chain-of-thought (CoT) reasoning supervision to enable domain-aligned and interpretable learning. Off-the-shelf VLMs such as NVIDIA’s Cosmos-Reason1 7B (CR1) exhibit near-zero Collision recall in zero-shot settings; fine-tuning with VLM-AutoDrive improves Collision F1 from 0.00 to 0.69 and overall accuracy from 35.35% to 77.27%. VLM-AutoDrive offers a scalable recipe for adapting general-purpose VLMs to safety-critical, temporally localized perception tasks. Evaluated on real-world Nexar dashcam videos, it achieves substantial gains in Collision and Near-Collision detection while producing interpretable reasoning traces, bridging the gap between perception, causality, and decision reasoning in autonomous driving.
[32] MicroVision: An Open Dataset and Benchmark Models for Detecting Vulnerable Road Users and Micromobility Vehicles cs.CVPDF
Alexander Rasch, Rahul Rajendra Pai
TL;DR: 该论文提出了MicroVision数据集,这是一个专门用于检测弱势道路使用者(VRUs)和微出行车辆(MMVs)的开放图像数据集。数据集包含8000多张从VRU视角采集的高清图像,标注了超过30000个实例,并提供了基于SOTA架构的基准检测模型,在未见测试集上达到0.723的mAP。
Details
Motivation: 现有开放图像数据集对VRUs和MMVs的类别划分不够精细(如将行人和MMV骑行者都归为“人”类),且缺乏从VRU视角(如人行道、自行车道)采集的数据,难以支持交通安全和规划中对不同道路使用者的精确检测需求。
Result: 基于SOTA架构训练的基准目标检测模型在未见测试集上取得了最高0.723的mAP,为后续研究提供了性能基准。
Insight: 创新点包括:1)首个专注于VRUs和MMVs细粒度检测的开放数据集,包含行人、骑行者、电动滑板车骑手及静止车辆等类别;2)从VRU视角采集数据,弥补了传统车载视角的盲区;3)提供全年多场景采集的多样化数据,支持模型泛化。
Abstract: Micromobility is a growing mode of transportation, raising new challenges for traffic safety and planning due to increased interactions in areas where vulnerable road users (VRUs) share the infrastructure with micromobility, including parked micromobility vehicles (MMVs). Approaches to support traffic safety and planning increasingly rely on detecting road users in images – a computer-vision task relying heavily on the quality of the images to train on. However, existing open image datasets for training such models lack focus and diversity in VRUs and MMVs, for instance, by categorizing both pedestrians and MMV riders as “person”, or by not including new MMVs like e-scooters. Furthermore, datasets are often captured from a car perspective and lack data from areas where only VRUs travel (sidewalks, cycle paths). To help close this gap, we introduce the MicroVision dataset: an open image dataset and annotations for training and evaluating models for detecting the most common VRUs (pedestrians, cyclists, e-scooterists) and stationary MMVs (bicycles, e-scooters), from a VRU perspective. The dataset, recorded in Gothenburg (Sweden), consists of more than 8,000 anonymized, full-HD images with more than 30,000 carefully annotated VRUs and MMVs, captured over an entire year and part of almost 2,000 unique interaction scenes. Along with the dataset, we provide first benchmark object-detection models based on state-of-the-art architectures, which achieved a mean average precision of up to 0.723 on an unseen test set. The dataset and model can support traffic safety to distinguish between different VRUs and MMVs, or help monitoring systems identify the use of micromobility. The dataset and model weights can be accessed at https://doi.org/10.71870/eepz-jd52.
[33] LRConv-NeRV: Low Rank Convolution for Efficient Neural Video Compression cs.CV | cs.AIPDF
Tamer Shanableh
TL;DR: 本文提出了LRConv-NeRV,一种高效的神经视频表示方法,通过用结构化低秩可分离卷积替换NeRV解码器中选定的密集3x3卷积层,在解码器架构中进行端到端训练,实现了重建质量与效率之间的可控权衡。
Details
Motivation: NeRV的卷积解码器计算成本高且内存密集,限制了其在资源受限环境中的部署,本文旨在解决NeRV的效率瓶颈问题。
Result: 实验表明,仅在最终解码阶段应用LRConv即可将解码器复杂度降低68%(从201.9降至64.9 GFLOPs),模型大小减少9.3%,同时带来可忽略的质量损失和约9.2%的码率降低。在INT8训练后量化下,LRConv-NeRV保持了接近原始NeRV基线的重建质量,并在层对齐设置下实现了比现有工作更优的效率与质量权衡,保持了更高的PSNR/MS-SSIM和更好的时间稳定性。
Insight: 创新点在于将低秩分解策略渐进式地应用于解码器(从最大阶段到早期阶段),以实现可控的效率-质量权衡,并证明了在最终解码阶段应用低秩卷积可在显著降低计算和存储开销的同时,保持视频重建质量和时间一致性,为低精度和资源受限环境下的高效神经视频解码提供了潜在的架构替代方案。
Abstract: Neural Representations for Videos (NeRV) encode entire video sequences within neural network parameters, offering an alternative paradigm to conventional video codecs. However, the convolutional decoder of NeRV remains computationally expensive and memory intensive, limiting its deployment in resource-constrained environments. This paper proposes LRConv-NeRV, an efficient NeRV variant that replaces selected dense 3x3 convolutional layers with structured low-rank separable convolutions, trained end-to-end within the decoder architecture. By progressively applying low-rank factorization from the largest to earlier decoder stages, LRConv-NeRV enables controllable trade-offs between reconstruction quality and efficiency. Extensive experiments demonstrate that applying LRConv only to the final decoder stage reduces decoder complexity by 68%, from 201.9 to 64.9 GFLOPs, and model size by 9.3%, while incurring negligible quality loss and achieving approximately 9.2% bitrate reduction. Under INT8 post-training quantization, LRConv-NeRV preserves reconstruction quality close to the dense NeRV baseline, whereas more aggressive factorization of early decoder stages leads to disproportionate quality degradation. Compared to existing work under layer-aligned settings, LRConv-NeRV achieves a more favorable efficiency versus quality trade-off, offering substantial GFLOPs and parameter reductions while maintaining higher PSNR/MS-SSIM and improved temporal stability. Temporal flicker analysis using LPIPS further shows that the proposed solution preserves temporal coherence close to the NeRV baseline, results establish LRConv-NeRV as a potential architectural alternative for efficient neural video decoding under low-precision and resource-constrained settings.
[34] CycleCap: Improving VLMs Captioning Performance via Self-Supervised Cycle Consistency Fine-Tuning cs.CVPDF
Marios Krestenitis, Christos Tzelepis, Konstantinos Ioannidis, Steafanos Vrochidis, Ioannis Kompatsiaris
TL;DR: CycleCap提出了一种基于自监督循环一致性的微调方案,通过图像到文本模型生成描述,再使用文本到图像模型重建图像,并利用原始与重建图像的相似性作为奖励信号,以提升视觉语言模型在图像描述任务中的性能,无需标注数据即可实现更准确、更少幻觉的描述。
Details
Motivation: 解决视觉语言模型在图像描述任务中存在的视觉-语言错位问题,如生成过于通用或幻觉的描述,同时避免依赖昂贵的大规模标注数据集或复杂的测试时框架。
Result: 在四个参数量从1B到7B的视觉语言模型上应用CycleCap,在图像描述和幻觉基准测试中均取得一致提升,超越了依赖监督循环一致性训练的现有最先进方法。
Insight: 创新点在于将循环一致性直接用作自监督训练信号,通过组相对策略优化(GRPO)和动态计算的图像相似性奖励,实现无需标注数据的微调,从而提升描述的准确性和接地性。
Abstract: Visual-Language Models (VLMs) have achieved remarkable progress in image captioning, visual question answering, and visual reasoning. Yet they remain prone to vision-language misalignment, often producing overly generic or hallucinated descriptions. Existing approaches address this via instruction tuning-requiring costly, large-scale annotated datasets or via complex test-time frameworks for caption refinement. In this work, we revisit image-text alignment through the lens of cycle consistency: given an image and a caption generated by an image-to-text model, the backward mapping through a text-to-image model should reconstruct an image that closely matches the original. In our setup, a VLM serves as the image-to-text component, while a pre-trained text-to-image model closes the loop by reconstructing the image from the generated caption. Building on this, we introduce CycleCap, a fine-tuning scheme to improve image captioning using Group Relative Policy Optimization (GRPO) with a reward based on the similarity between the original and reconstructed images, computed on-the-fly. Unlike previous work that uses cycle consistency loss for preference dataset construction, our method leverages cycle consistency directly as a self-supervised training signal. This enables the use of raw images alone, eliminating the need for curated image-text datasets, while steering the VLM to produce more accurate and grounded text descriptions. Applied to four VLMs ranging from 1B to 7B parameters, CycleCap yields consistent improvements across captioning and hallucination benchmarks, surpassing state-of-the-art methods that rely on supervised cycle consistency training.
[35] VISTA: Validation-Guided Integration of Spatial and Temporal Foundation Models with Anatomical Decoding for Rare-Pathology VCE Event Detection cs.CVPDF
Bo-Cheng Qiu, Yu-Fan Lin, Yu-Zhe Pien, Chia-Ming Lee, Fu-En Yang
TL;DR: 本文提出了VISTA框架,用于解决罕见病理的胶囊内窥镜视频事件检测任务。该框架通过结合EndoFM-LV和DINOv3 ViT-L/16两个互补的基础模型,分别捕获局部时间上下文和强帧级视觉语义,并采用验证引导的分层融合与解剖感知的时间事件解码,将任务重新定义为度量对齐的事件检测问题,而非纯帧级分类。
Details
Motivation: 胶囊内窥镜事件检测面临诊断相关发现稀疏、视觉异质性强、视频流长且噪声大的挑战,且评估需在事件层面而非仅帧级精度进行,因此需要一种新的、与评估指标对齐的事件检测方法。
Result: 在官方隐藏测试集上,所提方法取得了总体时间mAP@0.5为0.3530和时间mAP@0.95为0.3235的结果。验证消融实验表明,互补骨干网络、验证引导的融合和解剖感知的时间解码均对事件级性能有贡献。
Insight: 创新点在于将任务重新定义为度量对齐的事件检测问题,并设计了验证引导的分层融合策略(包括类级模型加权、骨干加权和概率校准)以及解剖感知的时间事件解码(包括时间平滑、解剖约束、阈值细化和按标签事件生成),有效整合了空间和时间基础模型的优势。
Abstract: Capsule endoscopy event detection is challenging because diagnostically relevant findings are sparse, visually heterogeneous, and embedded in long, noisy video streams, while evaluation is performed at the event level rather than by frame accuracy alone. We therefore formulate the RARE-VISION task as a metric-aligned event detection problem instead of a purely frame-wise classification task. Our framework combines two complementary backbones, EndoFM-LV for local temporal context and DINOv3 ViT-L/16 for strong frame-level visual semantics, followed by a Diverse Head Ensemble, Validation-Guided Hierarchical Fusion, and Anatomy-Aware Temporal Event Decoding. The fusion stage uses validation-derived class-wise model weighting, backbone weighting, and probability calibration, while the decoding stage applies temporal smoothing, anatomical constraints, threshold refinement, and per-label event generation to produce stable event predictions. Validation ablations indicate that complementary backbones, validation-guided fusion, and anatomy-aware temporal decoding all contribute to event-level performance. On the official hidden test set, the proposed method achieved an overall temporal mAP@0.5 of 0.3530 and temporal mAP@0.95 of 0.3235.
[36] To See or To Please: Uncovering Visual Sycophancy and Split Beliefs in VLMs cs.CV | cs.AIPDF
Rui Hong, Shuxue Quan
TL;DR: 本文提出了一个三层诊断框架,用于揭示视觉语言模型(VLMs)中的视觉谄媚现象和信念分裂问题。研究发现,大部分样本中模型虽然能感知视觉异常,但为了迎合用户期望而选择产生幻觉性回答,且模型规模增大会加剧这一问题。
Details
Motivation: 动机是探究VLMs在回答正确时,究竟是真正依赖视觉信息,还是利用了语言捷径,从而揭示模型在视觉基础与指令遵循之间的内在冲突。
Result: 在7个VLMs和7000个模型-样本对上的实验表明,69.6%的样本表现出视觉谄媚,且零样本表现出鲁棒拒绝;基于诊断分数的后验选择性预测策略在50%覆盖率下实现了高达+9.5个百分点的准确率提升。
Insight: 创新点在于提出了一个系统性的诊断框架来解构幻觉来源,并揭示了对齐训练可能系统性地抑制了模型对不确定性的真实承认;研究还表明仅靠扩大模型规模无法解决基础问题,而诊断分数可用于低成本的后处理优化。
Abstract: When VLMs answer correctly, do they genuinely rely on visual information or exploit language shortcuts? We introduce the Tri-Layer Diagnostic Framework, which disentangles hallucination sources via three metrics: Latent Anomaly Detection (perceptual awareness), Visual Necessity Score (visual dependency, measured via KL divergence), and Competition Score (conflict between visual grounding and instruction following). Using counterfactual interventions (blind, noise, and conflict images) across 7 VLMs and 7,000 model-sample pairs, our taxonomy reveals that 69.6% of samples exhibit Visual Sycophancy–models detect visual anomalies but hallucinate to satisfy user expectations–while zero samples show Robust Refusal, indicating alignment training has systematically suppressed truthful uncertainty acknowledgment. A scaling analysis (Qwen2.5-VL 7B to 72B) shows larger models reduce Language Shortcuts but amplify Visual Sycophancy, demonstrating scale alone cannot resolve the grounding problem. Diagnostic scores further enable a post-hoc selective prediction strategy achieving up to +9.5pp accuracy at 50% coverage with no additional training cost.
[37] Inst4DGS: Instance-Decomposed 4D Gaussian Splatting with Multi-Video Label Permutation Learning cs.CVPDF
Yonghan Lee, Dinesh Manocha
TL;DR: Inst4DGS是一种实例分解的4D高斯泼溅方法,通过引入每视频标签排列潜变量和可微Sinkhorn层来解决多视角视频中实例标签不一致的关联问题,从而在动态场景中实现一致的实例身份保持和高质量渲染与分割。
Details
Motivation: 动态4D高斯泼溅技术发展迅速,但实例分解的4D高斯泼溅仍未被充分探索,主要挑战在于如何关联独立分割的多视角视频中不一致的实例标签。
Result: 在Panoptic Studio和Neural3DV数据集上的实验表明,Inst4DGS在联合支持跟踪和实例分解的同时,实现了最先进的渲染和分割质量。在Panoptic Studio数据集上,PSNR从26.10提升至28.36,实例mIoU从0.6310提升至0.9129,超越了最强基线。
Insight: 创新点包括:1) 通过每视频标签排列潜变量和可微Sinkhorn层学习跨视频实例匹配,实现多视角监督和身份稳定;2) 提出实例分解的运动支架,为每个对象提供低维运动基以优化长时程轨迹;3) 显式的标签对齐产生了清晰的决策边界和稳定的时间身份,避免了身份漂移。
Abstract: We present Inst4DGS, an instance-decomposed 4D Gaussian Splatting (4DGS) approach with long-horizon per-Gaussian trajectories. While dynamic 4DGS has advanced rapidly, instance-decomposed 4DGS remains underexplored, largely due to the difficulty of associating inconsistent instance labels across independently segmented multi-view videos. We address this challenge by introducing per-video label-permutation latents that learn cross-video instance matches through a differentiable Sinkhorn layer, enabling direct multi-view supervision with consistent identity preservation. This explicit label alignment yields sharp decision boundaries and temporally stable identities without identity drift. To further improve efficiency, we propose instance-decomposed motion scaffolds that provide low-dimensional motion bases per object for long-horizon trajectory optimization. Experiments on Panoptic Studio and Neural3DV show that Inst4DGS jointly supports tracking and instance decomposition while achieving state-of-the-art rendering and segmentation quality. On the Panoptic Studio dataset, Inst4DGS improves PSNR from 26.10 to 28.36, and instance mIoU from 0.6310 to 0.9129, over the strongest baseline.
[38] Mind the Rarities: Can Rare Skin Diseases Be Reliably Diagnosed via Diagnostic Reasoning? cs.CV | cs.AIPDF
Yang Liu, Jiyao Yang, Hongjin Zhao, Xiaoyong Li, Yanzhe Ji
TL;DR: 该论文构建了DermCase长上下文基准数据集,用于评估大型视觉语言模型在罕见皮肤病诊断推理中的性能,发现现有模型在诊断准确性、鉴别诊断和临床推理方面存在显著不足,并通过指令微调实验验证了改进潜力。
Details
Motivation: 现有基准主要关注常见疾病且仅评估最终准确率,忽视了临床推理过程,而罕见皮肤病的诊断推理评估尚未充分探索,因此需要构建专门的数据集和评估指标来填补这一空白。
Result: 在DermCase基准上评估了22个领先的大型视觉语言模型,结果显示它们在诊断准确性、鉴别诊断和临床推理方面均存在显著缺陷;指令微调能大幅提升性能,而直接偏好优化(DPO)带来的增益有限。
Insight: 创新点包括构建基于同行评审病例报告的长上下文多模态基准DermCase,以及提出基于DermLIP的相似性度量来评估鉴别诊断质量,这为评估模型临床推理能力提供了新方法,并揭示了当前模型在复杂医学推理中的局限性。
Abstract: Large vision-language models (LVLMs) demonstrate strong performance in dermatology; however, evaluating diagnostic reasoning for rare conditions remains largely unexplored. Existing benchmarks focus on common diseases and assess only final accuracy, overlooking the clinical reasoning process, which is critical for complex cases. We address this gap by constructing DermCase, a long-context benchmark derived from peer-reviewed case reports. Our dataset contains 26,030 multi-modal image-text pairs and 6,354 clinically challenging cases, each annotated with comprehensive clinical information and step-by-step reasoning chains. To enable reliable evaluation, we establish DermLIP-based similarity metrics that achieve stronger alignment with dermatologists for assessing differential diagnosis quality. Benchmarking 22 leading LVLMs exposes significant deficiencies across diagnosis accuracy, differential diagnosis, and clinical reasoning. Fine-tuning experiments demonstrate that instruction tuning substantially improves performance while Direct Preference Optimization (DPO) yields minimal gains. Systematic error analysis further reveals critical limitations in current models’ reasoning capabilities.
[39] SR-Nav: Spatial Relationships Matter for Zero-shot Object Goal Navigation cs.CVPDF
Leyuan Fang, Zan Mao, Zijing Wang, Yinlong Yan
TL;DR: 本文提出SR-Nav框架,通过建模目标物体与周围环境之间的空间关系来提升零样本目标导航的性能。该方法构建动态空间关系图(DSRG),并利用关系匹配模块增强视觉感知的鲁棒性,同时设计动态关系规划模块来减少搜索空间,从而在未见过的环境中更高效地导航。
Details
Motivation: 现有基于基础模型的零样本目标导航方法在视角不佳或语义线索弱时,感知和规划推理不可靠,导致导航效率低或失败。作者观察到物体与区域间的固有空间关系编码了结构化的场景先验,有助于在部分观测下推断目标位置。
Result: 在HM3D数据集上的实验表明,该方法在成功率和导航效率方面均达到了最先进的(SOTA)性能。
Insight: 创新点在于显式建模并动态更新目标中心的空间关系图(DSRG),并利用关系匹配而非单纯检测来验证和纠正感知错误,同时基于关系图动态规划最优路径以缩小搜索空间,增强了在复杂场景下的鲁棒性和效率。
Abstract: Zero-shot object-goal navigation aims to find target objects in unseen environments using only egocentric observation. Recent methods leverage foundation models’ comprehension and reasoning capabilities to enhance navigation performance. However, when faced with poor viewpoints or weak semantic cues, foundation models often fail to support reliable reasoning in both perception and planning, resulting in inefficient or failed navigation. We observe that inherent relationships among objects and regions encode structured scene priors, which help agents infer plausible target locations even under partial observations. Motivated by this insight, we propose Spatial Relation-aware Navigation (SR-Nav), a framework that models both observed and experience-based spatial relationships to enhance both perception and planning. Specifically, SR-Nav first constructs a Dynamic Spatial Relationship Graph (DSRG) that encodes the target-centered spatial relationships through the foundation models and updates dynamically with real-time observations. We then introduce a Relation-aware Matching Module. It utilizes relationship matching instead of naive detection, leveraging diverse relationships in the DSRG to verify and correct errors, enhancing visual perception robustness. Finally, we design a Dynamic Relationship Planning Module to reduce the planning search space by dynamically computing the optimal paths based on the DSRG from the current position, thereby guiding planning and reducing exploration redundancy. Experiments on HM3D show that our method achieves state-of-the-art performance in both success rate and navigation efficiency. The code will be publicly available at https://github.com/Mzyw-1314/SR-Nav
[40] Learning Consistent Temporal Grounding between Related Tasks in Sports Coaching cs.CVPDF
Arushi Rai, Adriana Kovashka
TL;DR: 该论文提出了一种通过自一致性目标来改善视频-大语言模型在体育教练任务中时间定位能力的方法,无需额外标注。该方法利用相关任务(如生成和验证)必须关注相同帧的观察,在紧密相关的任务上强制执行视觉注意力图的一致性,从而提升模型性能。
Details
Motivation: 视频-大语言模型在体育教练等需要精确时间定位的任务中,经常关注不相关的帧,而获取帧级监督既昂贵又不可靠。
Result: 在VidDiffBench基准测试中,该方法在Exact、FitnessQA和ExpertAF三个体育教练任务上,相比监督微调分别提升了+3.0%、+14.1%的准确率和+0.9 BERTScore,甚至超越了闭源模型。
Insight: 创新点在于利用任务间的内在一致性(如生成和验证需关注相同帧)作为自监督信号来优化注意力机制,避免了昂贵的帧级标注需求,为多任务学习中的时间定位提供了新思路。
Abstract: Video-LLMs often attend to irrelevant frames, which is especially detrimental for sports coaching tasks requiring precise temporal grounding. Yet obtaining frame-level supervision is challenging: expensive to collect from humans and unreliable from other models. We improve temporal grounding without additional annotations by exploiting the observation that related tasks, such as generation and verification, must attend to the same frames. We enforce this via a self-consistency objective over select visual attention maps of tightly-related tasks. Using VidDiffBench, which provides ground-truth keyframe annotations, we first validate that attention misallocation is a significant bottleneck. We then show that training with our objective yields gains of +3.0%, +14.1% accuracy and +0.9 BERTScore over supervised finetuning across three sports coaching tasks: Exact, FitnessQA, and ExpertAF, even surpassing closed-source models.
[41] Interpretable Prostate Cancer Detection using a Small Cohort of MRI Images cs.CV | cs.AIPDF
Vahid Monfared, Mohammad Hadi Gharib, Ali Sabri, Maryam Shahali, Farid Rashidi
TL;DR: 该论文提出了一种可解释的前列腺癌自动检测框架,仅使用162张T2加权MRI图像的小型数据集。通过迁移学习和数据增强应对数据稀缺问题,并全面比较了Vision Transformers、CNNs和传统方法(如HOG+SVM)的性能。研究发现,迁移学习的ResNet18取得了最佳性能,而传统手工特征方法在小型数据集上表现相当,且仅使用T2加权图像即可达到有竞争力的结果,降低了采集和计算成本。
Details
Motivation: 前列腺癌是男性主要死因之一,但T2加权前列腺MRI的解释因病灶细微且异质而具有挑战性。论文旨在开发一个可解释的自动癌症检测框架,以解决小型数据集下的数据稀缺问题,并探索不同模型在有限数据上的有效性。
Result: 迁移学习的ResNet18在仅使用T2加权图像的小型数据集上取得了最佳性能(准确率90.9%,灵敏度95.2%,AUC 0.905),而Vision Transformers性能较低但复杂度高。HOG+SVM达到可比准确率(AUC 0.917)。在22个病例的读者研究中,AI模型灵敏度(95.2%)显著高于五位放射科医生的平均灵敏度(67.5%),表明AI辅助筛查有潜力减少漏诊并提高一致性。
Insight: 论文的创新点在于仅使用T2加权图像和小型数据集实现竞争性性能,降低了数据采集和计算复杂度。客观分析显示,在小型医学图像数据集中,传统手工特征方法(如HOG+SVM)仍能有效竞争,而迁移学习的轻量级CNN(如ResNet18)优于更复杂的Vision Transformers,这为资源有限场景下的模型选择提供了实用见解。
Abstract: Prostate cancer is a leading cause of mortality in men, yet interpretation of T2-weighted prostate MRI remains challenging due to subtle and heterogeneous lesions. We developed an interpretable framework for automatic cancer detection using a small dataset of 162 T2-weighted images (102 cancer, 60 normal), addressing data scarcity through transfer learning and augmentation. We performed a comprehensive comparison of Vision Transformers (ViT, Swin), CNNs (ResNet18), and classical methods (Logistic Regression, SVM, HOG+SVM). Transfer-learned ResNet18 achieved the best performance (90.9% accuracy, 95.2% sensitivity, AUC 0.905) with only 11M parameters, while Vision Transformers showed lower performance despite substantially higher complexity. Notably, HOG+SVM achieved comparable accuracy (AUC 0.917), highlighting the effectiveness of handcrafted features in small datasets. Unlike state-of-the-art approaches relying on biparametric MRI (T2+DWI) and large cohorts, our method achieves competitive performance using only T2-weighted images, reducing acquisition complexity and computational cost. In a reader study of 22 cases, five radiologists achieved a mean sensitivity of 67.5% (Fleiss Kappa = 0.524), compared to 95.2% for the AI model, suggesting potential for AI-assisted screening to reduce missed cancers and improve consistency. Code and data are publicly available.
[42] MedQ-UNI: Toward Unified Medical Image Quality Assessment and Restoration via Vision-Language Modeling cs.CVPDF
Jiyao Liu, Junzhi Ning, Wanying Qu, Lihao Liu, Chenglong Ma
TL;DR: 本文提出MedQ-UNI,一个统一的视觉语言模型,采用‘先评估后修复’的范式,通过结构化自然语言描述来联合进行医学图像质量评估(Med-IQA)和医学图像修复(Med-IR),旨在解决现有方法在跨模态和跨退化类型上泛化能力不足的问题。
Details
Motivation: 现有医学图像修复方法通常是模态特定或退化特定的,难以泛化到临床实践中遇到的异构退化情况,其根本原因在于修复过程缺乏对图像质量的显式理解。
Result: 在涵盖三种成像模态和五种修复任务的大规模数据集上,单一MedQ-UNI模型无需任何任务特定适配,即在所有任务上实现了最先进的修复性能,同时生成了更优的质量描述。
Insight: 创新点在于将医学图像质量评估与修复统一在一个框架内,通过结构化语言描述作为桥梁,使修复模型能基于显式的质量理解进行针对性恢复,从而提升了跨任务泛化能力、修复保真度和可解释性。
Abstract: Existing medical image restoration (Med-IR) methods are typically modality-specific or degradation-specific, failing to generalize across the heterogeneous degradations encountered in clinical practice. We argue this limitation stems from the isolation of Med-IR from medical image quality assessment (Med-IQA), as restoration models without explicit quality understanding struggle to adapt to diverse degradation types across modalities. To address these challenges, we propose MedQ-UNI, a unified vision-language model that follows an assess-then-restore paradigm, explicitly leveraging Med-IQA to guide Med-IR across arbitrary modalities and degradation types. MedQ-UNI adopts a multimodal autoregressive dual-expert architecture with shared attention: a quality assessment expert first identifies degradation issues through structured natural language descriptions, and a restoration expert then conditions on these descriptions to perform targeted image restoration. To support this paradigm, we construct a large-scale dataset of approximately 50K paired samples spanning three imaging modalities and five restoration tasks, each annotated with structured quality descriptions for joint Med-IQA and Med-IR training, along with a 2K-sample benchmark for evaluation. Extensive experiments demonstrate that a single MedQ-UNI model, without any task-specific adaptation, achieves state-of-the-art restoration performance across all tasks while generating superior descriptions, confirming that explicit quality understanding meaningfully improves restoration fidelity and interpretability.
[43] Do Vision Language Models Understand Human Engagement in Games? cs.CV | cs.AI | cs.HCPDF
Ziyi Wang, Qizan Guo, Rishitosh Singh, Xiyang Hu
TL;DR: 本文评估了视觉语言模型(VLMs)从游戏视频中推断人类参与度的能力,发现现有模型在零样本设置下表现较弱,即使采用理论引导或检索增强提示策略,其预测效果仍有限,尤其是在跨游戏和成对参与度变化预测任务上,揭示了当前VLMs在感知与理解人类心理状态之间存在差距。
Details
Motivation: 研究动机是探索视觉语言模型是否能够仅从视觉线索中推断游戏玩家的潜在心理状态(如参与度),这对于游戏设计和玩家体验研究具有重要意义,但目前尚不清楚VLMs在此任务上的有效性。
Result: 在GameVibe Few-Shot数据集(涵盖九款第一人称射击游戏)上的实验结果显示,零样本VLM预测普遍较弱,常无法超越简单的每游戏多数类基线;检索增强提示在某些设置下改善了逐点预测,但成对预测在所有策略中均困难;理论引导提示未可靠提升性能,反而可能强化表面级捷径。
Insight: 论文的创新点在于系统评估了多种提示策略(包括零样本、理论引导和检索增强)在跨游戏参与度推断任务上的效果,客观分析表明当前VLMs存在感知-理解鸿沟:它们能识别可见的游戏线索,但难以稳健地推断跨游戏的人类参与度,这为未来模型设计提供了重要洞见。
Abstract: Inferring human engagement from gameplay video is important for game design and player-experience research, yet it remains unclear whether vision–language models (VLMs) can infer such latent psychological states from visual cues alone. Using the GameVibe Few-Shot dataset across nine first-person shooter games, we evaluate three VLMs under six prompting strategies, including zero-shot prediction, theory-guided prompts grounded in Flow, GameFlow, Self-Determination Theory, and MDA, and retrieval-augmented prompting. We consider both pointwise engagement prediction and pairwise prediction of engagement change between consecutive windows. Results show that zero-shot VLM predictions are generally weak and often fail to outperform simple per-game majority-class baselines. Memory- or retrieval-augmented prompting improves pointwise prediction in some settings, whereas pairwise prediction remains consistently difficult across strategies. Theory-guided prompting alone does not reliably help and can instead reinforce surface-level shortcuts. These findings suggest a perception–understanding gap in current VLMs: although they can recognize visible gameplay cues, they still struggle to robustly infer human engagement across games.
[44] T-QPM: Enabling Temporal Out-Of-Distribution Detection and Domain Generalization for Vision-Language Models in Open-World cs.CV | cs.LGPDF
Aditi Naiknaware, Salimeh Sekeh
TL;DR: 本文提出了一种名为T-QPM的新型两阶段框架,旨在增强视觉语言模型在开放世界动态环境中的时序分布外检测和协变量分布偏移鲁棒性。该方法将双模式匹配扩展到时序四模式匹配,通过跨模态一致性模式和轻量级融合权重学习,优化语义匹配与视觉典型性的结合,并利用平均阈值置信度进行正则化以确保稳定性。
Details
Motivation: 现有基于CLIP等视觉语言模型的多模态分布外检测方法通常依赖固定融合规则并假设静态环境,无法应对时序漂移,且对协变量偏移输入缺乏鲁棒性。
Result: 在时序划分的基准测试上,该方法显著优于静态基线,为动态环境中的多模态分布外检测提供了一个鲁棒且时序一致的框架。
Insight: 创新点在于将双模式匹配扩展为时序四模式匹配,引入跨模态一致性模式以细化决策边界,并通过学习自适应融合权重来应对时序分布漂移,同时利用ATC正则化确保模型稳定性。
Abstract: Out-of-distribution (OOD) detection remains a critical challenge in open-world learning, where models must adapt to evolving data distributions. While recent vision-language models (VLMS) like CLIP enable multimodal OOD detection through Dual-Pattern Matching (DPM), existing methods typically suffer from two major shortcomings: (1) They rely on fixed fusion rules and assume static environments, failing under temporal drift; and (2) they lack robustness against covariate shifted inputs. In this paper, we propose a novel two-step framework to enhance OOD detection and covariate distribution shift robustness in dynamic settings. We extend the dual-pattern regime into Temporal Quadruple-Pattern Matching (T-QPM). First, by pairing OOD images with text descriptions, we introduce cross-modal consistency patterns between ID and OOD signals, refining the decision boundary through joint image-text reasoning. Second, we address temporal distribution shifts by learning lightweight fusion weights to optimally combine semantic matching and visual typicality. To ensure stability, we enforce explicit regularization based on Average Thresholded Confidence (ATC), preventing performance degradation as distributions evolve. Experiments on temporally partitioned benchmarks demonstrate that our approach significantly outperforms static baselines, offering a robust, temporally-consistent framework for multimodal OOD detection in non-stationary environments.
[45] NymeriaPlus: Enriching Nymeria Dataset with Additional Annotations and Data cs.CVPDF
Daniel DeTone, Federica Bogo, Eric-Tuan Le, Duncan Frost, Julian Straub
TL;DR: 本文介绍了NymeriaPlus,这是对2024年发布的Nymeria数据集的升级版本。Nymeria是一个大规模、多设备同步采集的野外第一人称人类活动数据集。NymeriaPlus通过增加和改进多种注释与数据模态,包括更精确的人体运动数据(MHR/SMPL格式)、密集的3D/2D物体边界框标注、实例级3D物体重建,以及额外的模态(如底图记录、音频和腕带视频),旨在将其打造为一个更强大、更全面的基准数据集。
Details
Motivation: 动机是增强现有的Nymeria数据集,通过整合互补的注释和模态,弥补现有第一人称数据资源的不足,以支持更广泛的研究,特别是具身AI的多模态学习探索。
Result: 论文没有在摘要中提供具体的定量实验结果或基准测试排名。其主要成果是构建并发布了升级后的数据集NymeriaPlus。
Insight: 创新点在于将多种高质量、互补的注释(如改进的人体运动、密集的物体标注、实例级重建)和额外数据模态(音频、视频)系统地整合到一个统一、连贯的第一人称基准数据集中,这为复杂、多模态的具身AI任务提供了更丰富的基础设施。
Abstract: The Nymeria Dataset, released in 2024, is a large-scale collection of in-the-wild human activities captured with multiple egocentric wearable devices that are spatially localized and temporally synchronized. It provides body-motion ground truth recorded with a motion-capture suit, device trajectories, semi-dense 3D point clouds, and in-context narrations. In this paper, we upgrade Nymeria and introduce NymeriaPlus. NymeriaPlus features: (1) improved human motion in Momentum Human Rig (MHR) and SMPL formats; (2) dense 3D and 2D bounding box annotations for indoor objects and structural elements; (3) instance-level 3D object reconstructions; and (4) additional modalities e.g., basemap recordings, audio, and wristband videos. By consolidating these complementary modalities and annotations into a single, coherent benchmark, NymeriaPlus strengthens Nymeria into a more powerful in-the-wild egocentric dataset. We expect NymeriaPlus to bridge a key gap in existing egocentric resources and to support a broader range of research, including unique explorations of multimodal learning for embodied AI.
[46] Efficient Video Diffusion with Sparse Information Transmission for Video Compression cs.CV | cs.AIPDF
Mingde Zhou, Zheng Chen, Yulun Zhang
TL;DR: 本文提出了一种名为Diff-SIT的高效视频扩散模型,用于超低码率下的视频压缩。该方法通过稀疏时间编码模块(STEM)将原始帧序列稀疏编码为信息丰富的中间序列以节省码率,并利用带帧类型嵌入器(FTE)的一步视频扩散模型(ODFTE)进行整体处理,实现自适应重建,从而提升感知质量和时间一致性。
Details
Motivation: 解决传统端到端压缩模型在超低码率下产生模糊图像、感知质量差的问题,以及现有生成式压缩方法在处理视频帧时独立性强、时间连贯性和效率不足的局限性。
Result: 在多个数据集上的大量实验表明,Diff-SIT在感知质量和时间一致性方面达到了新的最先进水平(SOTA),尤其是在具有挑战性的超低码率场景下。
Insight: 创新点在于结合稀疏编码与扩散模型,通过STEM实现高效信息压缩,并利用ODFTE中的FTE进行基于帧类型的自适应重建,有效利用了时间相关性,提升了压缩视频的整体质量与连贯性。
Abstract: Video compression aims to maximize reconstruction quality with minimal bitrates. Beyond standard distortion metrics, perceptual quality and temporal consistency are also critical. However, at ultra-low bitrates, traditional end-to-end compression models tend to produce blurry images of poor perceptual quality. Besides, existing generative compression methods often treat video frames independently and show limitations in time coherence and efficiency. To address these challenges, we propose the Efficient Video Diffusion with Sparse Information Transmission (Diff-SIT), which comprises the Sparse Temporal Encoding Module (STEM) and the One-Step Video Diffusion with Frame Type Embedder (ODFTE). The STEM sparsely encodes the original frame sequence into an information-rich intermediate sequence, achieving significant bitrate savings. Subsequently, the ODFTE processes this intermediate sequence as a whole, which exploits the temporal correlation. During this process, our proposed Frame Type Embedder (FTE) guides the diffusion model to perform adaptive reconstruction according to different frame types to optimize the overall quality. Extensive experiments on multiple datasets demonstrate that Diff-SIT establishes a new state-of-the-art in perceptual quality and temporal consistency, particularly in the challenging ultra-low-bitrate regime. Code is released at https://github.com/MingdeZhou/Diff-SIT.
[47] HOMEY: Heuristic Object Masking with Enhanced YOLO for Property Insurance Risk Detection cs.CVPDF
Teerapong Panboonyuen
TL;DR: 本文提出HOMEY框架,通过结合YOLO检测器、启发式对象掩码机制和定制损失函数,用于自动检测房产保险相关的17类风险,如结构损坏、维护疏忽和安全隐患,旨在提升在杂乱背景下的检测精度与可靠性。
Details
Motivation: 解决房产保险领域自动化风险检测这一高影响但尚未充分探索的问题,以支持房地产、承保和保险运营。
Result: 在真实房产图像上的实验表明,HOMEY相比基线YOLO模型实现了更优的检测准确性和可靠性,同时保持了快速推理速度。
Insight: 创新点包括引入启发式对象掩码来增强杂乱背景中的弱信号,以及风险感知损失校准以平衡类别偏斜和严重性加权,从而为可扩展的AI驱动保险工作流程提供可解释且成本效益高的风险分析基础。
Abstract: Automated property risk detection is a high-impact yet underexplored frontier in computer vision with direct implications for real estate, underwriting, and insurance operations. We introduce HOMEY (Heuristic Object Masking with Enhanced YOLO), a novel detection framework that combines YOLO with a domain-specific masking mechanism and a custom-designed loss function. HOMEY is trained to detect 17 risk-related property classes, including structural damages (e.g., cracked foundations, roof issues), maintenance neglect (e.g., dead yards, overgrown bushes), and liability hazards (e.g., falling gutters, garbage, hazard signs). Our approach introduces heuristic object masking to amplify weak signals in cluttered backgrounds and risk-aware loss calibration to balance class skew and severity weighting. Experiments on real-world property imagery demonstrate that HOMEY achieves superior detection accuracy and reliability compared to baseline YOLO models, while retaining fast inference. Beyond detection, HOMEY enables interpretable and cost-efficient risk analysis, laying the foundation for scalable AI-driven property insurance workflows.
[48] From Snapshots to Symphonies: The Evolution of Protein Prediction from Static Structures to Generative Dynamics and Multimodal Interactions cs.CVPDF
Jingzhi Chen, Lijian Xu
TL;DR: 这篇综述论文系统性地回顾了人工智能在蛋白质科学领域的范式转变,从静态结构预测发展到动态构象集合和复杂生物分子相互作用的建模。文章从五个相互关联的维度展开:统一的多模态表征、静态预测的精细化、生成式框架、异质相互作用预测以及功能推断,并分析了当前瓶颈与未来方向。
Details
Motivation: 论文的动机在于阐述和梳理人工智能在蛋白质研究领域从静态结构预测向动态、生成式和多模态建模的根本性转变,旨在解决传统方法在捕捉蛋白质动态行为、复杂相互作用和功能推断方面的局限性。
Result: 作为一篇综述论文,未提出具体的新模型或方法,因此没有在特定基准测试上的定量结果。它系统性地分析和总结了当前领域的研究进展、瓶颈和未来趋势。
Insight: 论文的核心创新点在于提出了一个从‘快照’(静态结构)到‘交响乐’(动态与相互作用)的演化框架,并系统性地划分了五个关键维度来审视这一范式转变。可借鉴之处在于其强调的统一多模态表征、生成式模型对热力学一致构象分布的捕捉,以及将AI定位为能够理解和重写生命动态语言的通用模拟器的前瞻性视角。
Abstract: The protein folding problem has been fundamentally transformed by artificial intelligence, evolving from static structure prediction toward the modeling of dynamic conformational ensembles and complex biomolecular interactions. This review systematically examines the paradigm shift in AI driven protein science across five interconnected dimensions: unified multimodal representations that integrate sequences, geometries, and textual knowledge; refinement of static prediction through MSA free architectures and all atom complex modeling; generative frameworks, including diffusion models and flow matching, that capture conformational distributions consistent with thermodynamic ensembles; prediction of heterogeneous interactions spanning protein ligand, protein nucleic acid, and protein protein complexes; and functional inference of fitness landscapes, mutational effects, and text guided property prediction. We critically analyze current bottlenecks, including data distribution biases, limited mechanistic interpretability, and the disconnect between geometric metrics and biophysical reality, while identifying future directions toward physically consistent generative models, multimodal foundation architectures, and experimental closed loop systems. This methodological transformation marks artificial intelligence’s transition from a structural analysis tool into a universal simulator capable of understanding and ultimately rewriting the dynamic language of life.
[49] Foundations and Architectures of Artificial Intelligence for Motor Insurance cs.CV | cs.AIPDF
Teerapong Panboonyuen
TL;DR: 本手册系统性地阐述了面向机动车辆保险的人工智能基础与架构,基于大规模真实世界部署经验,提出了一个垂直整合的AI范式,将感知、多模态推理和生产基础设施统一为用于车辆风险评估和理赔处理的智能栈。
Details
Motivation: 旨在解决如何将现代人工智能技术转化为高风险的工业环境中可靠、生产级的系统,特别是在机动车辆保险领域实现端到端自动化。
Result: 手册中提出的方法在泰国全国性机动车辆保险系统的实际约束下,构建了可扩展的自动化流程,实现了车辆损伤分析、理赔评估和承保工作流的端到端自动化。
Insight: 创新点在于提出了垂直整合的AI范式,并开发了面向领域的Transformer架构用于结构化视觉理解、关系型车辆表征学习和多模态文档智能,同时强调了学习算法与MLOps实践的协同演化,为高风险工业环境中的AI系统部署提供了原则性框架。
Abstract: This handbook presents a systematic treatment of the foundations and architectures of artificial intelligence for motor insurance, grounded in large-scale real-world deployment. It formalizes a vertically integrated AI paradigm that unifies perception, multimodal reasoning, and production infrastructure into a cohesive intelligence stack for automotive risk assessment and claims processing. At its core, the handbook develops domain-adapted transformer architectures for structured visual understanding, relational vehicle representation learning, and multimodal document intelligence, enabling end-to-end automation of vehicle damage analysis, claims evaluation, and underwriting workflows. These components are composed into a scalable pipeline operating under practical constraints observed in nationwide motor insurance systems in Thailand. Beyond model design, the handbook emphasizes the co-evolution of learning algorithms and MLOps practices, establishing a principled framework for translating modern artificial intelligence into reliable, production-grade systems in high-stakes industrial environments.
[50] OnlinePG: Online Open-Vocabulary Panoptic Mapping with 3D Gaussian Splatting cs.CVPDF
Hongjia Zhai, Qi Zhang, Xiaokun Pan, Xiyu Zhang, Yitong Dong
TL;DR: OnlinePG是一个在线开放词汇全景建图系统,它利用3D高斯泼溅技术,在在线环境中整合几何重建和开放词汇感知,实现了实时、高效的场景理解。
Details
Motivation: 现有方法多为离线或缺乏实例级理解,限制了其在真实世界机器人任务中的应用,因此需要一种在线、支持实例级理解的开放词汇全景建图方法。
Result: 在广泛使用的数据集上的大量实验表明,该方法在在线方法中取得了更好的性能,同时保持了实时效率。
Insight: 创新点包括采用高效的局部到全局范式与滑动窗口实现在线全景建图,构建结合几何和语义线索的3D分割聚类图以融合不一致片段,以及通过鲁棒的双向二分3D高斯实例匹配将局部地图融合到全局地图中,并利用3D空间属性网格内的融合VLM特征实现开放词汇场景理解。
Abstract: Open-vocabulary scene understanding with online panoptic mapping is essential for embodied applications to perceive and interact with environments. However, existing methods are predominantly offline or lack instance-level understanding, limiting their applicability to real-world robotic tasks. In this paper, we propose OnlinePG, a novel and effective system that integrates geometric reconstruction and open-vocabulary perception using 3D Gaussian Splatting in an online setting. Technically, to achieve online panoptic mapping, we employ an efficient local-to-global paradigm with a sliding window. To build local consistency map, we construct a 3D segment clustering graph that jointly leverages geometric and semantic cues, fusing inconsistent segments within sliding window into complete instances. Subsequently, to update the global map, we construct explicit grids with spatial attributes for the local 3D Gaussian map and fuse them into the global map via robust bidirectional bipartite 3D Gaussian instance matching. Finally, we utilize the fused VLM features inside the 3D spatial attribute grids to achieve open-vocabulary scene understanding. Extensive experiments on widely used datasets demonstrate that our method achieves better performance among online approaches, while maintaining real-time efficiency.
[51] Counting Circuits: Mechanistic Interpretability of Visual Reasoning in Large Vision-Language Models cs.CV | cs.AIPDF
Liwei Che, Zhiyu Xue, Yihao Quan, Benlin Liu, Zeru Shi
TL;DR: 该论文研究了大型视觉语言模型(LVLM)中的计数机制,通过合成和真实数据集结合机制解释性方法,揭示了LVLM具有类似人类的计数行为(小数量精确、大数量估计)。作者提出了两种新的可解释性方法(Visual Activation Patching和HeadLens),并发现了一个跨多种视觉推理任务共享的结构化“计数电路”。基于此,他们提出了一种轻量级干预策略,仅使用合成图像对预训练LVLM进行计数微调,不仅提升了计数准确性,还泛化到了分布外计数和复杂视觉推理任务。
Details
Motivation: 计数是检验LVLM推理能力的简单而有效的测试,它要求模型识别每个对象并进行累加。论文旨在探究LVLM如何实现计数,并理解其内部机制。
Result: 在Qwen2.5-VL模型上,提出的干预策略在分布外计数基准上平均提升了+8.36%,在复杂通用视觉推理任务上平均提升了+1.54%。
Insight: 创新点在于提出了两种新的机制可解释性方法(Visual Activation Patching和HeadLens)来揭示跨任务共享的“计数电路”,并展示了通过针对性增强计数机制(仅用合成数据微调)可以提升模型的整体视觉推理能力,这为模型改进提供了一条高效路径。
Abstract: Counting serves as a simple but powerful test of a Large Vision-Language Model’s (LVLM’s) reasoning; it forces the model to identify each individual object and then add them all up. In this study, we investigate how LVLMs implement counting using controlled synthetic and real-world benchmarks, combined with mechanistic analyses. Our results show that LVLMs display a human-like counting behavior, with precise performance on small numerosities and noisy estimation for larger quantities. We introduce two novel interpretability methods, Visual Activation Patching and HeadLens, and use them to uncover a structured “counting circuit” that is largely shared across a variety of visual reasoning tasks. Building on these insights, we propose a lightweight intervention strategy that exploits simple and abundantly available synthetic images to fine-tune arbitrary pretrained LVLMs exclusively on counting. Despite the narrow scope of this fine-tuning, the intervention not only enhances counting accuracy on in-distribution synthetic data, but also yields an average improvement of +8.36% on out-of-distribution counting benchmarks and an average gain of +1.54% on complex, general visual reasoning tasks for Qwen2.5-VL. These findings highlight the central, influential role of counting in visual reasoning and suggest a potential pathway for improving overall visual reasoning capabilities through targeted enhancement of counting mechanisms.
[52] 3DreamBooth: High-Fidelity 3D Subject-Driven Video Generation Model cs.CVPDF
Hyun-kyu Ko, Jihyeon Park, Younghyun Kim, Dongheok Park, Eunbyung Park
TL;DR: 本文提出了3DreamBooth框架,用于解决从单张或多张参考图像生成高保真、视角一致且动态的3D定制主体视频的挑战。该方法通过将空间几何与时间运动解耦,并引入3Dapter视觉条件模块,实现了无需大量多视角视频数据即可进行3D感知的视频定制。
Details
Motivation: 现有主体驱动视频生成方法主要将主体视为2D实体,缺乏重建3D几何所需的全面空间先验,导致在合成新视角时无法保持真实的3D身份。同时,由于多视角视频数据稀缺,直接微调模型容易导致时间过拟合。
Result: 该方法在多个基准测试中实现了高保真和视角一致的3D视频生成,在定性和定量评估中均展现出优于现有2D方法的性能,达到了新的技术水平。
Insight: 核心创新点在于提出了一种单帧优化范式来解耦空间与时间,以及一个不对称条件策略下的3Dapter模块,该模块作为动态选择性路由器,能够从少量参考集中查询特定视角的几何提示,从而高效地融入3D先验并提升纹理细节。
Abstract: Creating dynamic, view-consistent videos of customized subjects is highly sought after for a wide range of emerging applications, including immersive VR/AR, virtual production, and next-generation e-commerce. However, despite rapid progress in subject-driven video generation, existing methods predominantly treat subjects as 2D entities, focusing on transferring identity through single-view visual features or textual prompts. Because real-world subjects are inherently 3D, applying these 2D-centric approaches to 3D object customization reveals a fundamental limitation: they lack the comprehensive spatial priors necessary to reconstruct the 3D geometry. Consequently, when synthesizing novel views, they must rely on generating plausible but arbitrary details for unseen regions, rather than preserving the true 3D identity. Achieving genuine 3D-aware customization remains challenging due to the scarcity of multi-view video datasets. While one might attempt to fine-tune models on limited video sequences, this often leads to temporal overfitting. To resolve these issues, we introduce a novel framework for 3D-aware video customization, comprising 3DreamBooth and 3Dapter. 3DreamBooth decouples spatial geometry from temporal motion through a 1-frame optimization paradigm. By restricting updates to spatial representations, it effectively bakes a robust 3D prior into the model without the need for exhaustive video-based training. To enhance fine-grained textures and accelerate convergence, we incorporate 3Dapter, a visual conditioning module. Following single-view pre-training, 3Dapter undergoes multi-view joint optimization with the main generation branch via an asymmetrical conditioning strategy. This design allows the module to act as a dynamic selective router, querying view-specific geometric hints from a minimal reference set. Project page: https://ko-lani.github.io/3DreamBooth/
[53] CoDA: Exploring Chain-of-Distribution Attacks and Post-Hoc Token-Space Repair for Medical Vision-Language Models cs.CV | cs.AIPDF
Xiang Chen, Fangfang Yang, Chunlei Meng, Chengyin Hu, Ang Li
TL;DR: 本文提出了CoDA(Chain-of-Distribution Attacks)框架,用于评估医学视觉-语言模型(MVLMs)在临床工作流程中面临的鲁棒性威胁。CoDA通过组合模拟图像采集、重建、显示和传输等环节的分布偏移,生成视觉上合理但能导致模型失效的医学图像。研究发现,这种链式攻击比单一阶段攻击更具破坏性,能显著降低CLIP风格MVLMs的零样本性能。同时,研究评估了多模态大语言模型(MLLMs)作为图像真实性审计者的能力,发现其存在可靠性下降和高置信度错误。最后,论文提出了一种基于教师引导和补丁对齐的标记空间自适应后修复策略,以提升模型对CoDA攻击的鲁棒性。
Details
Motivation: 医学视觉-语言模型在临床工作流程中的可靠性尚未得到充分探索,现有鲁棒性评估通常假设输入干净或研究孤立扰动,忽略了临床中常见的图像处理环节(如采集、重建、显示、传输)导致的分布偏移,这些偏移可能保持图像可读性但改变统计特性,从而威胁模型性能。
Result: 在脑部MRI、胸部X光和腹部CT数据集上,CoDA攻击显著降低了CLIP风格MVLMs的零样本性能,且链式组合攻击比任何单一阶段攻击更具破坏性;评估多模态大语言模型作为图像真实性审计者时,专有模型显示出审计可靠性下降和在CoDA偏移样本上的持续高置信度错误,而测试的医学专用MLLMs在医学图像质量审计方面存在明显缺陷;提出的后修复策略通过教师引导的标记空间自适应和补丁级对齐,提高了对存档CoDA输出的准确性。
Insight: 创新点在于提出了一个临床可信的链式分布攻击框架CoDA,系统性模拟了医学图像处理流程中的复合扰动;并引入了一种轻量级的后修复策略,通过教师模型引导和补丁对齐在标记空间进行自适应,以提升部署鲁棒性。从客观角度看,该研究强调了评估模型在真实临床分布偏移下的重要性,并为增强医学多模态模型的鲁棒性提供了可借鉴的攻击与防御方法。
Abstract: Medical vision–language models (MVLMs) are increasingly used as perceptual backbones in radiology pipelines and as the visual front end of multimodal assistants, yet their reliability under real clinical workflows remains underexplored. Prior robustness evaluations often assume clean, curated inputs or study isolated corruptions, overlooking routine acquisition, reconstruction, display, and delivery operations that preserve clinical readability while shifting image statistics. To address this gap, we propose CoDA, a chain-of-distribution framework that constructs clinically plausible pipeline shifts by composing acquisition-like shading, reconstruction and display remapping, and delivery and export degradations. Under masked structural-similarity constraints, CoDA jointly optimizes stage compositions and parameters to induce failures while preserving visual plausibility. Across brain MRI, chest X-ray, and abdominal CT, CoDA substantially degrades the zero-shot performance of CLIP-style MVLMs, with chained compositions consistently more damaging than any single stage. We also evaluate multimodal large language models (MLLMs) as technical-authenticity auditors of imaging realism and quality rather than pathology. Proprietary multimodal models show degraded auditing reliability and persistent high-confidence errors on CoDA-shifted samples, while the medical-specific MLLMs we test exhibit clear deficiencies in medical image quality auditing. Finally, we introduce a post-hoc repair strategy based on teacher-guided token-space adaptation with patch-level alignment, which improves accuracy on archived CoDA outputs. Overall, our findings characterize a clinically grounded threat surface for MVLM deployment and show that lightweight alignment improves robustness in deployment.
[54] HiMu: Hierarchical Multimodal Frame Selection for Long Video Question Answering cs.CV | cs.AIPDF
Dan Ben-Ami, Gabriele Serussi, Kobi Cohen, Chaim Baskin
TL;DR: HiMu是一个无需训练的分层多模态帧选择框架,用于长视频问答任务。它通过单次纯文本LLM调用将查询分解为分层逻辑树,利用轻量级专家模型处理视觉和音频模态,并通过模糊逻辑运算符进行信号合成,生成连续满足度曲线,从而高效选择关键帧。
Details
Motivation: 长视频问答需要推理长时序上下文,而现有方法面临效率与准确性的权衡:基于相似性的选择器速度快但丢失子事件顺序和跨模态绑定,基于代理的方法能恢复结构但计算成本过高。HiMu旨在弥合这一差距。
Result: 在Video-MME、LongVideoBench和HERBench-Lite基准测试中,HiMu提升了效率-准确性的帕累托前沿:使用Qwen3-VL 8B模型在16帧时优于所有竞争选择器,使用GPT-4o时超越在32-512帧上运行的代理系统,同时所需FLOPs减少约10倍。
Insight: 创新点包括:1)使用纯文本LLM进行查询分解,避免昂贵的多模态推理;2)分层逻辑树结构保留事件顺序和跨模态关系;3)轻量级专家模型与模糊逻辑合成,实现高效多模态对齐;4)训练无关的框架设计,易于部署和扩展。
Abstract: Long-form video question answering requires reasoning over extended temporal contexts, making frame selection critical for large vision-language models (LVLMs) bound by finite context windows. Existing methods face a sharp trade-off: similarity-based selectors are fast but collapse compositional queries into a single dense vector, losing sub-event ordering and cross-modal bindings; agent-based methods recover this structure through iterative LVLM inference, but at prohibitive cost. We introduce HiMu, a training-free framework that bridges this gap. A single text-only LLM call decomposes the query into a hierarchical logic tree whose leaves are atomic predicates, each routed to a lightweight expert spanning vision (CLIP, open-vocabulary detection, OCR) and audio (ASR, CLAP). The resulting signals are normalized, temporally smoothed to align different modalities, and composed bottom-up through fuzzy-logic operators that enforce temporal sequencing and adjacency, producing a continuous satisfaction curve. Evaluations on Video-MME, LongVideoBench and HERBench-Lite show that HiMu advances the efficiency-accuracy Pareto front: at 16 frames with Qwen3-VL 8B it outperforms all competing selectors, and with GPT-4o it surpasses agentic systems operating at 32-512 frames while requiring roughly 10x fewer FLOPs.
[55] HAViT: Historical Attention Vision Transformer cs.CVPDF
Swarnendu Banik, Manish Das, Shiv Ram Dubey, Satish Kumar Singh
TL;DR: 本文提出了HAViT(历史注意力视觉Transformer),一种通过跨层传播历史注意力矩阵来增强视觉Transformer中信息流和特征学习的方法。该方法在Transformer编码器层中存储并融合历史注意力信息,实现了注意力模式的渐进式优化,仅需添加注意力矩阵存储和混合操作,对架构改动极小。在CIFAR-100和TinyImageNet等基准测试中,该方法显著提升了ViT等模型的准确率。
Details
Motivation: 现有视觉Transformer的注意力机制在各层之间独立运行,限制了层间信息流动和特征学习能力,因此需要一种能够有效整合跨层注意力信息的方法来优化Transformer的内部信息流。
Result: 在CIFAR-100上,ViT的准确率从75.74%提升至77.07%(+1.33%);在TinyImageNet上,从57.82%提升至59.07%(+1.25%)。跨架构验证表明,CaiT等Transformer变体也获得了约1.01%的稳定提升,达到了新的SOTA水平。系统分析确定历史注意力混合超参数α=0.45为最优配置,且随机初始化策略优于零初始化。
Insight: 创新点在于提出了一种轻量级的跨层注意力传播机制,通过历史注意力矩阵的保留与融合,实现了注意力模式的渐进式细化,从而增强了特征获取和优化动态。客观来看,该方法以极小的计算开销实现了信息流的系统性改进,其超参数鲁棒性和初始化策略的发现也具有普适参考价值。
Abstract: Vision Transformers have excelled in computer vision but their attention mechanisms operate independently across layers, limiting information flow and feature learning. We propose an effective cross-layer attention propagation method that preserves and integrates historical attention matrices across encoder layers, offering a principled refinement of inter-layer information flow in Vision Transformers. This approach enables progressive refinement of attention patterns throughout the transformer hierarchy, enhancing feature acquisition and optimization dynamics. The method requires minimal architectural changes, adding only attention matrix storage and blending operations. Comprehensive experiments on CIFAR-100 and TinyImageNet demonstrate consistent accuracy improvements, with ViT performance increasing from 75.74% to 77.07% on CIFAR-100 (+1.33%) and from 57.82% to 59.07% on TinyImageNet (+1.25%). Cross-architecture validation shows similar gains across transformer variants, with CaiT showing 1.01% enhancement. Systematic analysis identifies the blending hyperparameter of historical attention (alpha = 0.45) as optimal across all configurations, providing the ideal balance between current and historical attention information. Random initialization consistently outperforms zero initialization, indicating that diverse initial attention patterns accelerate convergence and improve final performance. Our code is publicly available at https://github.com/banik-s/HAViT.
[56] Complementary Text-Guided Attention for Zero-Shot Adversarial Robustness cs.CVPDF
Lu Yu, Haiyang Zhang, Changsheng Xu
TL;DR: 本文针对预训练视觉-语言模型CLIP在零样本设置下对抗样本鲁棒性不足的问题,提出了一种基于文本引导注意力的增强策略。首先提出了TGA-ZSR框架,包含局部注意力细化模块和全局注意力约束模块,以保持模型在干净样本上的性能并提升对抗鲁棒性。进一步,针对TGA-ZSR可能关注无关特征的问题,提出了Comp-TGA方法,通过整合类别提示引导的前景注意力和非类别提示引导的反向注意力这两种互补的注意力机制,来获得更全面准确的前景表征。
Details
Motivation: 尽管CLIP等预训练视觉-语言模型具有强大的零样本能力,但它们被发现容易受到对抗样本的攻击。作者通过实验分析发现,对抗扰动会导致文本引导的注意力发生偏移,因此旨在保持CLIP模型泛化能力的同时,增强其对抗鲁棒性。
Result: 实验表明,在16个数据集上,所提出的TGA-ZSR和Comp-TGA方法在零样本鲁棒准确率上分别比当前最先进技术提升了9.58%和11.95%,达到了新的SOTA水平。
Insight: 论文的核心创新点在于利用文本引导的注意力作为监督信号来增强对抗鲁棒性,特别是提出了互补的文本引导注意力机制(Comp-TGA),通过结合正向(类别提示)和反向(非类别提示)的注意力来更全面地捕捉前景特征,从而减少对虚假特征的关注,这是一种新颖且有效的注意力正则化思路。
Abstract: Due to the impressive zero-shot capabilities, pre-trained vision-language models (e.g., CLIP), have attracted widespread attention and adoption across various domains. Nonetheless, CLIP has been observed to be susceptible to adversarial examples. Through experimental analysis, we have observed a phenomenon wherein adversarial perturbations induce shifts in text-guided attention. Building upon this observation, we propose a simple yet effective strategy: Text-Guided Attention for Zero-Shot Robustness (TGA-ZSR). This framework incorporates two components: Local Attention Refinement Module and Global Attention Constraint Module. Our goal is to maintain the generalization of the CLIP model and enhance its adversarial robustness. Additionally, the Global Attention Constraint Module acquires text-guided attention from both the target and original models using clean examples. Its objective is to maintain model performance on clean samples while enhancing overall robustness. However, we observe that the method occasionally focuses on irrelevant or spurious features, which can lead to suboptimal performance and undermine its robustness in certain scenarios. To overcome this limitation, we further propose a novel approach called Complementary Text-Guided Attention (Comp-TGA). This method integrates two types of foreground attention: attention guided by the class prompt and reversed attention driven by the non-class prompt. These complementary attention mechanisms allow the model to capture a more comprehensive and accurate representation of the foreground. The experiments validate that TGA-ZSR and Comp-TGA yield 9.58% and 11.95% improvements respectively, in zero-shot robust accuracy over the current state-of-the-art techniques across 16 datasets.
[57] Improving Joint Audio-Video Generation with Cross-Modal Context Learning cs.CVPDF
Bingqi Ma, Linlong Lang, Ming Zhang, Dailan He, Xingtong Ge
TL;DR: 本文提出了一种名为跨模态上下文学习(CCL)的新方法,旨在改进基于双流Transformer架构的联合音视频生成模型。该方法通过引入时间对齐RoPE与分区(TARP)、跨模态上下文注意力(CCA)模块中的可学习上下文令牌(LCT)和动态上下文路由(DCR),以及推理时的无条件上下文引导(UCG),有效解决了现有方法在跨模态交互、背景偏差、训练-推理一致性及多条件冲突等方面的问题,从而以更少的资源实现了更高质量和时序同步的音视频生成。
Details
Motivation: 当前主流的双流Transformer联合音视频生成方法存在几个关键限制:门控机制导致的模型流形变化、跨模态注意力引入的多模态背景区域偏差、训练与推理中多模态无分类器引导(CFG)的不一致性,以及多条件之间的冲突。本文旨在通过CCL框架缓解这些问题。
Result: 通过全面评估,CCL在多个学术基准上相比近期方法取得了最先进的(SOTA)性能,同时显著减少了所需的计算资源。
Insight: 论文的核心创新在于系统性地识别并解决了双流Transformer音视频生成中的多个具体瓶颈。可借鉴的技术点包括:1) TARP模块通过改进位置编码增强时序对齐;2) CCA模块中的LCT和DCR为跨模态信息提供了稳定的无条件锚点并实现动态路由,提升了收敛速度和生成质量;3) UCG机制利用LCT的无条件支持来协调不同形式的CFG,改善了训练-推理一致性并缓解了条件冲突。这些模块共同构成了一个更鲁棒、高效的跨模态生成框架。
Abstract: The dual-stream transformer architecture-based joint audio-video generation method has become the dominant paradigm in current research. By incorporating pre-trained video diffusion models and audio diffusion models, along with a cross-modal interaction attention module, high-quality, temporally synchronized audio-video content can be generated with minimal training data. In this paper, we first revisit the dual-stream transformer paradigm and further analyze its limitations, including model manifold variations caused by the gating mechanism controlling cross-modal interactions, biases in multi-modal background regions introduced by cross-modal attention, and the inconsistencies in multi-modal classifier-free guidance (CFG) during training and inference, as well as conflicts between multiple conditions. To alleviate these issues, we propose Cross-Modal Context Learning (CCL), equipped with several carefully designed modules. Temporally Aligned RoPE and Partitioning (TARP) effectively enhances the temporal alignment between audio latent and video latent representations. The Learnable Context Tokens (LCT) and Dynamic Context Routing (DCR) in the Cross-Modal Context Attention (CCA) module provide stable unconditional anchors for cross-modal information, while dynamically routing based on different training tasks, further enhancing the model’s convergence speed and generation quality. During inference, Unconditional Context Guidance (UCG) leverages the unconditional support provided by LCT to facilitate different forms of CFG, improving train-inference consistency and further alleviating conflicts. Through comprehensive evaluations, CCL achieves state-of-the-art performance compared with recent academic methods while requiring substantially fewer resources.
[58] GenVideoLens: Where LVLMs Fall Short in AI-Generated Video Detection? cs.CVPDF
Yueying Zou, Pei Pei Li, Zekun Li, Xinyu Guo, Xing Cui
TL;DR: 本文提出了GenVideoLens,一个用于评估大型视觉语言模型在AI生成视频检测任务中细粒度能力的基准测试。该基准包含400个高欺骗性AI生成视频和100个真实视频,并由专家在15个真实性维度上进行标注。通过对11个代表性LVLM的评估,揭示了模型在光学一致性、物理交互和时序因果推理等维度上的显著能力不足。
Details
Motivation: 现有评估方法将AI生成视频检测视为二元分类问题,依赖整体准确率等粗粒度指标,无法深入揭示LVLM在哪些具体维度上成功或失败,因此需要构建一个细粒度的诊断性基准。
Result: 在GenVideoLens基准上评估了11个代表性LVLM。结果显示模型存在显著的维度不平衡:在感知线索上表现相对较好,但在光学一致性、物理交互和时序因果推理方面表现不佳。此外,较小的开源模型有时在特定真实性线索上优于更强的闭源模型。时序扰动实验表明当前LVLM对时序信息的利用有限。
Insight: 创新点在于构建了一个覆盖感知、光学、物理和时序线索的多维度细粒度评估基准,能够诊断LVLM的具体能力缺陷。从客观角度看,该研究揭示了LVLM在视频理解任务中存在的系统性短板,特别是时空一致性和因果推理能力,为未来检测系统的改进提供了明确方向。
Abstract: In recent years, AI-generated videos have become increasingly realistic and sophisticated. Meanwhile, Large Vision-Language Models (LVLMs) have shown strong potential for detecting such content. However, existing evaluation protocols largely treat the task as a binary classification problem and rely on coarse-grained metrics such as overall accuracy, providing limited insight into where LVLMs succeed or fail. To address this limitation, we introduce GenVideoLens, a fine-grained benchmark that enables dimension-wise evaluation of LVLM capabilities in AI-generated video detection. The benchmark contains 400 highly deceptive AI-generated videos and 100 real videos, annotated by experts across 15 authenticity dimensions covering perceptual, optical, physical, and temporal cues. We evaluate eleven representative LVLMs on this benchmark. Our analysis reveals a pronounced dimensional imbalance. While LVLMs perform relatively well on perceptual cues, they struggle with optical consistency, physical interactions, and temporal-causal reasoning. Model performance also varies substantially across dimensions, with smaller open-source models sometimes outperforming stronger proprietary models on specific authenticity cues. Temporal perturbation experiments further show that current LVLMs make limited use of temporal information. Overall, GenVideoLens provides diagnostic insights into LVLM behavior, revealing key capability gaps and offering guidance for improving future AI-generated video detection systems.
[59] SwiftGS: Episodic Priors for Immediate Satellite Surface Recovery cs.CV | cs.LGPDF
Rong Fu, Jiekai Wu, Haiyun Wei, Xiaowen Ma, Shiyin Lin
TL;DR: SwiftGS是一种元学习系统,用于从多时相卫星图像中快速进行大规模3D重建。它通过单次前向传播预测几何-辐射解耦的高斯基元与轻量级SDF,利用情景训练捕获可迁移先验,替代了昂贵的逐场景优化。该系统结合了可微分物理图、空间门控、语义几何融合等技术,在推理时无需微调即可实现精确的数字表面模型重建和视图一致的渲染,显著降低了计算成本。
Details
Motivation: 解决从多时相卫星图像进行快速、大规模3D重建的难题,该任务因光照变化、传感器异质性以及逐场景优化成本高而一直很困难。
Result: 该方法在推理时以零样本方式运行(可选紧凑校准),实现了精确的DSM重建和视图一致的渲染,并显著降低了计算成本。消融实验突出了混合表示、物理感知渲染和情景元训练的优势。
Insight: 主要创新点包括:1) 使用几何-辐射解耦的高斯基元与轻量级SDF的混合表示;2) 采用情景元训练捕获可迁移先验,替代逐场景优化;3) 结合可微分物理图进行投影、光照和传感器响应建模;4) 通过空间门控融合稀疏高斯细节与全局SDF结构;5) 集成语义几何融合、条件轻量级任务头和多视图监督的损失函数。
Abstract: Rapid, large-scale 3D reconstruction from multi-date satellite imagery is vital for environmental monitoring, urban planning, and disaster response, yet remains difficult due to illumination changes, sensor heterogeneity, and the cost of per-scene optimization. We introduce SwiftGS, a meta-learned system that reconstructs 3D surfaces in a single forward pass by predicting geometry-radiation-decoupled Gaussian primitives together with a lightweight SDF, replacing expensive per-scene fitting with episodic training that captures transferable priors. The model couples a differentiable physics graph for projection, illumination, and sensor response with spatial gating that blends sparse Gaussian detail and global SDF structure, and incorporates semantic-geometric fusion, conditional lightweight task heads, and multi-view supervision from a frozen geometric teacher under an uncertainty-aware multi-task loss. At inference, SwiftGS operates zero-shot with optional compact calibration and achieves accurate DSM reconstruction and view-consistent rendering at significantly reduced computational cost, with ablations highlighting the benefits of the hybrid representation, physics-aware rendering, and episodic meta-training.
[60] Training-Free Sparse Attention for Fast Video Generation via Offline Layer-Wise Sparsity Profiling and Online Bidirectional Co-Clustering cs.CVPDF
Jiayi Luo, Jiayu Chen, Jiankun Wang, Cong Wang, Hanxin Zhu
TL;DR: 本文提出了一种名为SVOO的无训练稀疏注意力框架,旨在加速视频生成扩散变换器(DiTs)的推理过程。该框架通过离线层间稀疏度分析和在线双向协同聚类两阶段方法,解决了现有方法忽视层间异质性和查询-键耦合的问题,从而在保持生成质量的同时实现显著加速。
Details
Motivation: 现有视频生成中的无训练稀疏注意力方法存在两个未解决的局限性:一是忽略了注意力剪枝中的层间异质性,二是忽略了块划分中的查询-键耦合,这阻碍了在质量和加速之间取得更好的权衡。
Result: 在七个广泛使用的视频生成模型上的大量实验表明,SVOO在质量和加速权衡上优于最先进方法,在Wan2.1基准上实现了高达1.93倍的加速,同时保持峰值信噪比(PSNR)高达29 dB。
Insight: 论文的核心创新在于揭示了每层注意力稀疏度是其固有属性,且受不同输入影响较小,并基于此设计了离线层间敏感度分析和在线双向协同聚类算法。从客观角度看,其将稀疏度分析从在线计算转移到离线阶段,并结合协同聚类优化块划分,是一种高效且无需重新训练的系统级优化策略。
Abstract: Diffusion Transformers (DiTs) achieve strong video generation quality but suffer from high inference cost due to dense 3D attention, leading to the development of sparse attention technologies to improve efficiency. However, existing training-free sparse attention methods in video generation still face two unresolved limitations: ignoring layer heterogeneity in attention pruning and ignoring query-key coupling in block partitioning, which hinder a better quality-speedup trade-off. In this work, we uncover a critical insight that the attention sparsity of each layer is its intrinsic property, with minor effects across different inputs. Motivated by this, we propose SVOO, a training-free Sparse attention framework for fast Video generation via Offline layer-wise sparsity profiling and Online bidirectional co-clustering. Specifically, SVOO adopts a two-stage paradigm: (i) offline layer-wise sensitivity profiling to derive intrinsic per-layer pruning levels, and (ii) online block-wise sparse attention via a novel bidirectional co-clustering algorithm. Extensive experiments on seven widely used video generation models demonstrate that SVOO achieves a superior quality-speedup trade-off over state-of-the-art methods, delivering up to $1.93\times$ speedup while maintaining a PSNR of up to 29 dB on Wan2.1.
[61] PhysVideo: Physically Plausible Video Generation with Cross-View Geometry Guidance cs.CVPDF
Cong Wang, Hanxin Zhu, Xiao Tang, Jiayi Luo, Xin Jin
TL;DR: 本文提出PhysVideo,一个两阶段视频生成框架,旨在解决现有视频生成方法在物理一致性方面的不足。该框架首先生成具有物理感知的正交视角前景视频,然后利用这些视频作为引导,合成包含背景的完整视频。
Details
Motivation: 当前视频生成方法在视觉保真度上取得进展,但难以保证物理上一致的运动,因为真实世界物体运动发生在三维空间,而视频观测仅是视角依赖的二维投影。
Result: 在构建的PhysMV数据集(包含4万个场景,共16万条视频序列)上进行的大量实验表明,PhysVideo在物理真实性和时空一致性方面显著优于现有视频生成方法。
Insight: 创新点在于通过两阶段生成(物理感知正交前景生成与引导合成)来引入三维几何与物理约束,具体通过物理感知注意力、几何增强的跨视角注意力和时间注意力来增强时空一致性,从而提升生成视频的物理合理性。
Abstract: Recent progress in video generation has led to substantial improvements in visual fidelity, yet ensuring physically consistent motion remains a fundamental challenge. Intuitively, this limitation can be attributed to the fact that real-world object motion unfolds in three-dimensional space, while video observations provide only partial, view-dependent projections of such dynamics. To address these issues, we propose PhysVideo, a two-stage framework that first generates physics-aware orthogonal foreground videos and then synthesizes full videos with background. In the first stage, Phys4View leverages physics-aware attention to capture the influence of physical attributes on motion dynamics, and enhances spatio-temporal consistency by incorporating geometry-enhanced cross-view attention and temporal attention. In the second stage, VideoSyn uses the generated foreground videos as guidance and learns the interactions between foreground dynamics and background context for controllable video synthesis. To support training, we construct PhysMV, a dataset containing 40K scenes, each consisting of four orthogonal viewpoints, resulting in a total of 160K video sequences. Extensive experiments demonstrate that PhysVideo significantly improves physical realism and spatial-temporal coherence over existing video generation methods. Home page: https://anonymous.4open.science/w/Phys4D/.
[62] Click-to-Ask: An AI Live Streaming Assistant with Offline Copywriting and Online Interactive QA cs.CVPDF
Ruizhi Yu, Keyang Zhong, Peng Liu, Qi Wu, Haoran Zhang
TL;DR: 本文提出Click-to-Ask,一个用于直播电商的AI助手系统,包含离线和在线两个互补模块。离线模块处理多模态产品信息,生成结构化数据和合规的促销文案;在线模块在直播中允许主播点击观众问题,结合离线生成的结构化信息和流式架构维护的事件级历史记忆,实现实时响应。该系统旨在减少促销准备时间、增强内容互动性并提升直播电商效果。
Details
Motivation: 为了解决直播电商中主播进行产品推广时效率不高、互动不便的问题,旨在通过AI辅助系统简化促销准备流程并实现与观众的实时高质量互动。
Result: 在收集的TikTok直播帧数据集上,该方法实现了0.913的问题识别准确率和0.876的响应质量分数,显示出实际应用的巨大潜力。
Insight: 创新点在于将直播电商助手明确划分为离线的多模态信息处理与文案生成,以及在线的、结合结构化产品数据和流式事件记忆的实时点击交互式问答机制,这种互补架构设计提升了系统整体的实用性和响应能力。
Abstract: Live streaming commerce has become a prominent form of broadcasting in the modern era. To facilitate more efficient and convenient product promotions for streamers, we present Click-to-Ask, an AI-driven assistant for live streaming commerce with complementary offline and online components. The offline module processes diverse multimodal product information, transforming complex inputs into structured product data and generating compliant promotional copywriting. During live broadcasts, the online module enables real-time responses to viewer inquiries by allowing streamers to click on questions and leveraging both the structured product information generated by the offline module and an event-level historical memory maintained in a streaming architecture. This system significantly reduces the time needed for promotional preparation, enhances content engagement, and enables prompt interaction with audience inquiries, ultimately improving the effectiveness of live streaming commerce. On our collected dataset of TikTok live stream frames, the proposed method achieves a Question Recognition Accuracy of 0.913 and a Response Quality score of 0.876, demonstrating considerable potential for practical application. The video demonstration can be viewed here: https://www.youtube.com/shorts/mWIXK-SWhiE.
[63] Multimodal Model for Computational Pathology:Representation Learning and Image Compression cs.CVPDF
Peihang Wu, Zehong Chen, Lijian Xu
TL;DR: 这篇综述论文全面回顾了计算病理学中多模态模型的最新进展,重点探讨了全切片图像(WSI)的表示学习与图像压缩技术。论文系统分析了四个研究方向:WSI的自监督表示学习与结构感知的token压缩、多模态数据生成与增强、参数高效适应与推理增强的少样本学习,以及用于可信诊断的多智能体协同推理。
Details
Motivation: 全切片成像(WSI)虽然推动了数字病理学的发展,但面临计算挑战(如超高分辨率带来的视觉学习困难)、专家标注有限、多模态信息整合与生物可解释性保持困难,以及超长视觉序列建模不透明阻碍临床透明度等问题。
Result: 论文是一篇综述,未提供具体实验的定量结果,但总结了当前方法在实现跨尺度建模、模拟病理学家多倍率下的“思维链”以进行不确定性感知证据融合等方面的进展。
Insight: 创新点在于强调token压缩技术如何实现跨尺度建模,以及多智能体机制如何模拟病理学家的诊断推理过程。从客观角度看,论文提出的整合高分辨率视觉数据与临床生物医学知识的统一多模态框架,是推动可解释、安全AI辅助诊断的关键方向。
Abstract: Whole slide imaging (WSI) has transformed digital pathology by enabling computational analysis of gigapixel histopathology images. Recent foundation model advances have accelerated progress in computational pathology, facilitating joint reasoning across pathology images, clinical reports, and structured data. Despite this progress, challenges remain: the extreme resolution of WSIs creates computational hurdles for visual learning; limited expert annotations constrain supervised approaches; integrating multimodal information while preserving biological interpretability remains difficult; and the opacity of modeling ultra-long visual sequences hinders clinical transparency. This review comprehensively surveys recent advances in multimodal computational pathology. We systematically analyze four research directions: (1) self-supervised representation learning and structure-aware token compression for WSIs; (2) multimodal data generation and augmentation; (3) parameter-efficient adaptation and reasoning-enhanced few-shot learning; and (4) multi-agent collaborative reasoning for trustworthy diagnosis. We specifically examine how token compression enables cross-scale modeling and how multi-agent mechanisms simulate a pathologist’s “Chain of Thought” across magnifications to achieve uncertainty-aware evidence fusion. Finally, we discuss open challenges and argue that future progress depends on unified multimodal frameworks integrating high-resolution visual data with clinical and biomedical knowledge to support interpretable and safe AI-assisted diagnosis.
[64] EdgeCrafter: Compact ViTs for Edge Dense Prediction via Task-Specialized Distillation cs.CVPDF
Longfei Liu, Yongjie Hou, Yang Li, Qirui Wang, Youyang Sha
TL;DR: 本文提出EdgeCrafter,一个用于边缘密集预测的统一紧凑ViT框架,其核心是ECDet检测模型。该模型通过任务专用蒸馏和边缘友好的编码器-解码器设计,解决了紧凑ViT在边缘设备上任务特定表征学习不足的问题,在目标检测、实例分割和姿态估计任务上实现了高精度与高效率的平衡。
Details
Motivation: 在资源受限的边缘设备上部署高性能密集预测模型具有挑战性,因为计算和内存限制严格。实践中,轻量级系统仍由基于CNN的架构(如YOLO)主导,而紧凑ViT即使经过大规模预训练,也难以实现类似的强精度-效率权衡。作者认为,这一差距主要源于小规模ViT中任务特定表征学习不足,而非ViT与边缘密集预测之间存在固有错配。
Result: 在COCO数据集上,ECDet-S仅使用COCO标注,以少于1000万参数实现了51.7 AP;实例分割方面,ECInsSeg性能与RF-DETR相当,但参数显著减少;姿态估计方面,ECPose-X达到74.8 AP,显著优于依赖大量Objects365预训练的YOLO26Pose-X(71.6 AP)。这些结果表明,紧凑ViT结合任务专用蒸馏和边缘感知设计,在边缘密集预测中具有实用性和竞争力。
Insight: 论文的创新点在于通过任务专用蒸馏(而非通用预训练)来增强紧凑ViT的任务特定表征学习,并结合边缘友好的编码器-解码器设计,从而弥合了紧凑ViT与CNN在边缘密集预测任务上的性能差距。这为在资源受限设备上部署高效的ViT模型提供了新思路。
Abstract: Deploying high-performance dense prediction models on resource-constrained edge devices remains challenging due to strict limits on computation and memory. In practice, lightweight systems for object detection, instance segmentation, and pose estimation are still dominated by CNN-based architectures such as YOLO, while compact Vision Transformers (ViTs) often struggle to achieve similarly strong accuracy efficiency tradeoff, even with large scale pretraining. We argue that this gap is largely due to insufficient task specific representation learning in small scale ViTs, rather than an inherent mismatch between ViTs and edge dense prediction. To address this issue, we introduce EdgeCrafter, a unified compact ViT framework for edge dense prediction centered on ECDet, a detection model built from a distilled compact backbone and an edge-friendly encoder decoder design. On the COCO dataset, ECDet-S achieves 51.7 AP with fewer than 10M parameters using only COCO annotations. For instance segmentation, ECInsSeg achieves performance comparable to RF-DETR while using substantially fewer parameters. For pose estimation, ECPose-X reaches 74.8 AP, significantly outperforming YOLO26Pose-X (71.6 AP) despite the latter’s reliance on extensive Objects365 pretraining. These results show that compact ViTs, when paired with task-specialized distillation and edge-aware design, can be a practical and competitive option for edge dense prediction. Code is available at: https://intellindust-ai-lab.github.io/projects/EdgeCrafter/
[65] 6Bit-Diffusion: Inference-Time Mixed-Precision Quantization for Video Diffusion Models cs.CVPDF
Rundong Su, Jintao Zhang, Zhihang Yuan, Haojie Duanmu, Jianfei Chen
TL;DR: 本文提出了6Bit-Diffusion,一种用于视频扩散模型推理时的混合精度量化框架。该方法通过一个轻量级预测器动态分配NVFP4和INT8精度,并利用Transformer块输入输出残差的时间一致性引入时间增量缓存来跳过计算,从而在保持生成质量的同时显著降低内存使用和计算成本。
Details
Motivation: 扩散Transformer在视频生成方面表现出色,但其高内存占用和计算成本严重限制了实际部署。现有的量化方法通常采用静态位宽分配,忽略了不同扩散时间步中激活量化的难度,导致效率与质量之间的权衡不佳。
Result: 大量实验表明,该方法实现了1.92倍的端到端加速和3.32倍的内存减少,为视频DiTs的高效推理设定了新的基准。
Insight: 创新点在于发现了块输入输出差异与其内部线性层量化敏感性之间的强线性相关性,并据此设计了动态精度分配策略;同时观察到Transformer块残差具有跨时间步的高时间一致性,并利用此冗余引入了时间增量缓存以跳过计算。这为高效视频生成模型的推理优化提供了新的思路。
Abstract: Diffusion transformers have demonstrated remarkable capabilities in generating videos. However, their practical deployment is severely constrained by high memory usage and computational cost. Post-Training Quantization provides a practical way to reduce memory usage and boost computation speed. Existing quantization methods typically apply a static bit-width allocation, overlooking the quantization difficulty of activations across diffusion timesteps, leading to a suboptimal trade-off between efficiency and quality. In this paper, we propose a inference time NVFP4/INT8 Mixed-Precision Quantization framework. We find a strong linear correlation between a block’s input-output difference and the quantization sensitivity of its internal linear layers. Based on this insight, we design a lightweight predictor that dynamically allocates NVFP4 to temporally stable layers to maximize memory compression, while selectively preserving INT8 for volatile layers to ensure robustness. This adaptive precision strategy enables aggressive quantization without compromising generation quality. Beside this, we observe that the residual between the input and output of a Transformer block exhibits high temporal consistency across timesteps. Leveraging this temporal redundancy, we introduce Temporal Delta Cache (TDC) to skip computations for these invariant blocks, further reducing the computational cost. Extensive experiments demonstrate that our method achieves 1.92$\times$ end-to-end acceleration and 3.32$\times$ memory reduction, setting a new baseline for efficient inference in Video DiTs.
[66] WeNLEX: Weakly Supervised Natural Language Explanations for Multilabel Chest X-ray Classification cs.CV | cs.AIPDF
Isabel Rio-Torto, Jaime S. Cardoso, Luís F. Teixeira
TL;DR: 本文提出WeNLEX,一种用于多标签胸部X光分类的弱监督自然语言解释生成模型。它通过将解释生成的图像与原始图像在黑盒模型特征空间中进行匹配来确保忠实性,并通过与少量临床医生标注的解释进行分布对齐来保持合理性。实验表明,WeNLEX能以每诊断仅5个真实解释生成忠实且合理的解释,并在与分类器联合训练时提升分类性能,同时其解释可根据目标受众(如非医学用户)进行适配。
Details
Motivation: 现有工作通常使用带标注解释的数据集显式监督解释生成过程,导致生成的解释虽看似合理但不忠实于模型的实际推理过程。本文旨在解决多标签胸部X光分类中生成既忠实又合理的自然语言解释的问题。
Result: 在多个评估忠实性、可模拟性、多样性和合理性的指标上,WeNLEX能生成忠实且合理的解释。当与多标签分类器联合训练时,其将独立分类器的分类AUC提升了2.21%。
Insight: 创新点在于通过特征空间图像匹配确保解释忠实性,并结合少量标注数据进行分布对齐以保持合理性。模型支持事后和模型内两种设置,且通过更换数据库可灵活适配不同目标受众(如生成简化版给非专业用户),表明可解释性训练能提升下游任务性能。
Abstract: Natural language explanations provide an inherently human-understandable way to explain black-box models, closely reflecting how radiologists convey their diagnoses in textual reports. Most works explicitly supervise the explanation generation process using datasets annotated with explanations. Thus, though plausible, the generated explanations are not faithful to the model’s reasoning. In this work, we propose WeNLEX, a weakly supervised model for the generation of natural language explanations for multilabel chest X-ray classification. Faithfulness is ensured by matching images generated from their corresponding natural language explanations with original images, in the black-box model’s feature space. Plausibility is maintained via distribution alignment with a small database of clinician-annotated explanations. We empirically demonstrate, through extensive validation on multiple metrics to assess faithfulness, simulatability, diversity, and plausibility, that WeNLEX is able to produce faithful and plausible explanations, using as little as 5 ground-truth explanations per diagnosis. Furthermore, WeNLEX can operate in both post-hoc and in-model settings. In the latter, i.e., when the multilabel classifier is trained together with the rest of the network, WeNLEX improves the classification AUC of the standalone classifier by 2.21%, thus showing that adding interpretability to the training process can actually increase the downstream task performance. Additionally, simply by changing the database, WeNLEX explanations are adaptable to any target audience, and we showcase this flexibility by training a layman version of WeNLEX, where explanations are simplified for non-medical users.
[67] ProCal: Probability Calibration for Neighborhood-Guided Source-Free Domain Adaptation cs.CVPDF
Ying Zheng, Yiyi Zhang, Yi Wang, Lap-Pui Chau
TL;DR: 本文提出ProCal,一种用于无源域自适应(SFDA)的概率校准方法,通过双模型协同预测机制动态校准基于邻域的预测,以缓解现有方法过度依赖邻域预测相似性导致的知识遗忘和局部噪声过拟合问题。
Details
Motivation: 现有SFDA方法过度依赖邻域预测相似性,加速了源知识的遗忘并增加了对局部噪声过拟合的敏感性,ProCal旨在解决这些问题。
Result: 在四个公共数据集的31个跨域任务上进行了广泛实验,验证了方法的有效性,理论分析表明ProCal能收敛到源知识与目标信息有效融合的平衡点。
Insight: 创新点包括通过双模型(源模型初始预测与当前模型在线输出)协同校准邻域概率,以及结合软监督损失和多样性损失的联合优化目标,平衡知识保留与域适应。
Abstract: Source-Free Domain Adaptation (SFDA) adapts pre-trained models to unlabeled target domains without requiring access to source data. Although state-of-the-art methods leveraging local neighborhood structures show promise for SFDA, they tend to over-rely on prediction similarity among neighbors. This over-reliance accelerates the forgetting of source knowledge and increases susceptibility to local noise overfitting. To address these issues, we introduce ProCal, a probability calibration method that dynamically calibrates neighborhood-based predictions through a dual-model collaborative prediction mechanism. ProCal integrates the source model’s initial predictions with the current model’s online outputs to effectively calibrate neighbor probabilities. This strategy not only mitigates the interference of local noise but also preserves the discriminative information from the source model, thereby achieving a balance between knowledge retention and domain adaptation. Furthermore, we design a joint optimization objective that combines a soft supervision loss with a diversity loss to guide the target model. Our theoretical analysis shows that ProCal converges to an equilibrium where source knowledge and target information are effectively fused, reducing both knowledge forgetting and overfitting. We validate the effectiveness of our approach through extensive experiments on 31 cross-domain tasks across four public datasets. Our code is available at: https://github.com/zhengyinghit/ProCal.
[68] SEAR: Simple and Efficient Adaptation of Visual Geometric Transformers for RGB+Thermal 3D Reconstruction cs.CVPDF
Vsevolod Skorokhodov, Chenghao Xu, Shuo Sun, Olga Fink, Malcolm Mielle
TL;DR: 本文提出SEAR方法,一种简单高效的微调策略,用于将预训练的视觉几何Transformer适配到RGB-热成像(RGB-T)多模态输入,以解决现有视觉几何模型在混合模态(如RGB-T)上性能下降的问题。该方法在较小的RGB-T数据集上微调后,在3D重建和相机姿态估计任务上显著优于现有方法,并在低光照、浓烟等挑战性条件下保持可靠性能。
Details
Motivation: 基于RGB数据预训练的基础视觉几何模型在处理混合传感模态(如RGB-热成像图像)时性能下降,特别是难以对齐RGB和热成像模态。
Result: 在RGB-T数据集上,SEAR在3D重建和相机姿态估计的所有指标上显著优于现有SOTA方法(例如AUC@30提升超过29%),且推理时间开销可忽略,并在低光照、浓烟等挑战条件下表现可靠。
Insight: 创新点包括:提出一种轻量高效的微调策略适配预训练几何Transformer到多模态输入;通过消融研究验证了模型对齐多模态的机制;引入了一个新的RGB-T数据集,包含不同时间、视角和光照条件的序列,为多模态3D场景重建提供了基准。
Abstract: Foundational feed-forward visual geometry models enable accurate and efficient camera pose estimation and scene reconstruction by learning strong scene priors from massive RGB datasets. However, their effectiveness drops when applied to mixed sensing modalities, such as RGB-thermal (RGB-T) images. We observe that while a visual geometry grounded transformer pretrained on RGB data generalizes well to thermal-only reconstruction, it struggles to align RGB and thermal modalities when processed jointly. To address this, we propose SEAR, a simple yet efficient fine-tuning strategy that adapts a pretrained geometry transformer to multimodal RGB-T inputs. Despite being trained on a relatively small RGB-T dataset, our approach significantly outperforms state-of-the-art methods for 3D reconstruction and camera pose estimation, achieving significant improvements over all metrics (e.g., over 29% in AUC@30) and delivering higher detail and consistency between modalities with negligible overhead in inference time compared to the original pretrained model. Notably, SEAR enables reliable multimodal pose estimation and reconstruction even under challenging conditions, such as low lighting and dense smoke. We validate our architecture through extensive ablation studies, demonstrating how the model aligns both modalities. Additionally, we introduce a new dataset featuring RGB and thermal sequences captured at different times, viewpoints, and illumination conditions, providing a robust benchmark for future work in multimodal 3D scene reconstruction. Code and models are publicly available at https://www.github.com/Schindler-EPFL-Lab/SEAR.
[69] Perceptio: Perception Enhanced Vision Language Models via Spatial Token Generation cs.CV | cs.AIPDF
Yuchen Li, Amanmeet Garg, Shalini Chaudhuri, Rui Zhao, Garin Kessler
TL;DR: 本文提出了Perceptio,一种通过空间令牌生成增强感知能力的大视觉语言模型。该模型通过显式生成语义分割令牌和深度令牌,显著提升了二维和三维空间推理能力,并在多个基准测试中实现了最先进的性能。
Details
Motivation: 现有的大视觉语言模型在语义理解方面表现出色,但在细粒度空间定位方面存在不足,因为模型需要隐式推断复杂几何信息而缺乏显式的空间解释。
Result: 在RefCOCO/+/g数据集上,指代表达式分割的cIoU分别提升了+0.8、+1.4和+1.1;HardBLINK空间理解准确率提高了10.3%;MMBench准确率提升了1.0%,在多个基准测试中达到了最先进的水平。
Insight: 创新点包括:通过VQ-VAE深度码本和SAM2语义分割令牌显式生成空间令牌;引入复合深度令牌目标(标记、令牌和计数损失)和软合并技术以稳定深度令牌生成;采用多任务协同训练策略,使模型能够学习感知令牌以处理多个下游任务。
Abstract: Large Vision Language Models (LVLMs) excel at semantic understanding but struggle with fine grained spatial grounding, as the model must implicitly infer complex geometry without ever producing a spatial interpretation. We present Perceptio, a perception enhanced LVLM with 2D and 3D spatial reasoning abilities, enabled via explicit semantic segmentation tokens and depth tokens generated directly within the autoregressive sequence. Concretely, we (i) distill a VQVAE depth codebook from a strong monocular teacher to tokenize dense depth into compact sequences, and (ii) integrate SAM2 based semantic segmentation tokens and VQ-VAE depth tokens inside the LLM so the model first emits spatial tokens and then answers. To stabilize depth token generation, we introduce novel composite depth-token objectives (marker, token, and count losses) and a soft-merging technique for differentiable reconstruction. We adopt a multi-task co-training strategy across diverse datasets, letting the model learn perception tokens to tackle multiple downstream tasks. Building on InternVL, Perceptio achieves state-of-the-art performance across benchmarks: improving referring expression segmentation by +0.8/+1.4/+1.1 cIoU on RefCOCO/+/g HardBLINK spatial understanding accuracy by 10.3%, and MMBench accuracy by 1.0%, demonstrating that explicit spatial chain-of-thought materially strengthens spatial grounding in LVLMs.
[70] SAVeS: Steering Safety Judgments in Vision-Language Models via Semantic Cues cs.CV | cs.AI | cs.CL | cs.LGPDF
Carlos Hinojosa, Clemens Grange, Bernard Ghanem
TL;DR: 该论文研究视觉语言模型(VLMs)在安全判断中如何受语义线索影响,提出了一个语义引导框架,通过文本、视觉和认知干预来操控模型行为,并引入了SAVeS基准来评估情境安全下的模型表现。研究发现VLMs的安全决策高度依赖语义线索而非基于视觉理解的推理,揭示了多模态安全系统的潜在脆弱性。
Details
Motivation: 解决VLMs在现实世界和具身环境中进行安全决策时,其判断依据不明确的问题,探究简单语义线索是否能引导多模态安全行为。
Result: 在多个VLMs和最先进基准上的实验表明,安全决策对语义线索高度敏感,模型依赖学习的视觉-语言关联而非基于视觉的推理;自动化引导流程可利用这些机制,证实了多模态安全系统的漏洞。
Insight: 创新点在于提出了语义引导框架和SAVeS基准,用于分离行为拒绝、基于视觉的安全推理和错误拒绝;客观分析揭示了VLMs安全判断的浅层关联性,为提升模型鲁棒性提供了关键洞见。
Abstract: Vision-language models (VLMs) are increasingly deployed in real-world and embodied settings where safety decisions depend on visual context. However, it remains unclear which visual evidence drives these judgments. We study whether multimodal safety behavior in VLMs can be steered by simple semantic cues. We introduce a semantic steering framework that applies controlled textual, visual, and cognitive interventions without changing the underlying scene content. To evaluate these effects, we propose SAVeS, a benchmark for situational safety under semantic cues, together with an evaluation protocol that separates behavioral refusal, grounded safety reasoning, and false refusals. Experiments across multiple VLMs and an additional state-of-the-art benchmark show that safety decisions are highly sensitive to semantic cues, indicating reliance on learned visual-linguistic associations rather than grounded visual understanding. We further demonstrate that automated steering pipelines can exploit these mechanisms, highlighting a potential vulnerability in multimodal safety systems.
[71] HORNet: Task-Guided Frame Selection for Video Question Answering with Vision-Language Models cs.CVPDF
Xiangyu Bai, Bishoy Galoaa, Sarah Ostadabbas
TL;DR: 本文提出HORNet,一种基于强化学习(GRPO)训练的轻量级帧选择策略,用于视频问答(VQA)任务。它通过智能选择关键视频帧,在显著减少输入帧数(高达99%)和VLM处理时间(高达93%)的同时,提升多个基准测试上的答案质量,并展示了良好的泛化性和跨VLM模型的迁移能力。
Details
Motivation: 现有视频问答系统大多依赖均匀或启发式采样选择视频帧,无法针对下游回答质量进行优化,导致效率低下且可能忽略关键信息。本文旨在通过可学习的帧选择策略,优化VLM的视觉输入,从而提升问答性能与效率。
Result: 在MSVD-QA基准上F1分数提升1.7%,在NExT-QA的时间推理任务上比均匀采样提升7.3分。在六个基准(总计341,877个QA对和114.2小时视频)上评估,HORNet在减少计算量的同时实现了强性能。
Insight: 核心创新是将视觉输入选择(Select Any Frames任务)与VLM推理解耦,并通过GRPO训练轻量级选择策略。该方法不仅提升了效率与精度,还展示了策略在分布外泛化和跨不同VLM模型的可迁移性,为优化VLM系统提供了一种互补于生成优化的新途径。
Abstract: Video question answering (VQA) with vision-language models (VLMs) depends critically on which frames are selected from the input video, yet most systems rely on uniform or heuristic sampling that cannot be optimized for downstream answering quality. We introduce \textbf{HORNet}, a lightweight frame selection policy trained with Group Relative Policy Optimization (GRPO) to learn which frames a frozen VLM needs to answer questions correctly. With fewer than 1M trainable parameters, HORNet reduces input frames by up to 99% and VLM processing time by up to 93%, while improving answer quality on short-form benchmarks (+1.7% F1 on MSVD-QA) and achieving strong performance on temporal reasoning tasks (+7.3 points over uniform sampling on NExT-QA). We formalize this as Select Any Frames (SAF), a task that decouples visual input curation from VLM reasoning, and show that GRPO-trained selection generalizes better out-of-distribution than supervised and PPO alternatives. HORNet’s policy further transfers across VLM answerers without retraining, yielding an additional 8.5% relative gain when paired with a stronger model. Evaluated across six benchmarks spanning 341,877 QA pairs and 114.2 hours of video, our results demonstrate that optimizing \emph{what} a VLM sees is a practical and complementary alternative to optimizing what it generates while improving efficiency. Code is available at https://github.com/ostadabbas/HORNet.
[72] Motion-o: Trajectory-Grounded Video Reasoning cs.CV | cs.AIPDF
Bishoy Galoaa, Shayda Moezzi, Xiangyu Bai, Sarah Ostadabbas
TL;DR: 该论文提出了Motion-o,一种以运动为中心的视觉语言模型扩展,用于视频理解。它通过引入空间-时间-轨迹(STT)推理,使物体在连续观测间的运动轨迹变得显式和可验证。论文还提出了一个轨迹基础数据集和一种名为Motion Chain of Thought(MCoT)的结构化推理路径,通过
Details
Motivation: 现有视频推理研究虽在时空证据链方面取得进展,但忽视了物体在观测间‘如何’移动的推理,缺乏对运动模式的明确表述和轨迹的显式理解与验证。
Result: 实验结果表明,Motion-o在无需修改架构的情况下,改善了时空基础定位和轨迹预测能力,并与现有框架完全兼容。
Insight: 创新点在于将运动轨迹推理形式化为STT任务,并设计了MCoT结构化推理路径和奖励函数,使模型能基于视觉证据直接推理,增强了视频理解的证据基础和可解释性。
Abstract: Recent research has made substantial progress on video reasoning, with many models leveraging spatio-temporal evidence chains to strengthen their inference capabilities. At the same time, a growing set of datasets and benchmarks now provides structured annotations designed to support and evaluate such reasoning. However, little attention has been paid to reasoning about \emph{how} objects move between observations: no prior work has articulated the motion patterns by connecting successive observations, leaving trajectory understanding implicit and difficult to verify. We formalize this missing capability as Spatial-Temporal-Trajectory (STT) reasoning and introduce \textbf{Motion-o}, a motion-centric video understanding extension to visual language models that makes trajectories explicit and verifiable. To enable motion reasoning, we also introduce a trajectory-grounding dataset artifact that expands sparse keyframe supervision via augmentation to yield denser bounding box tracks and a stronger trajectory-level training signal. Finally, we introduce Motion Chain of Thought (MCoT), a structured reasoning pathway that makes object trajectories through discrete \texttt{
[73] PromptHub: Enhancing Multi-Prompt Visual In-Context Learning with Locality-Aware Fusion, Concentration and Alignment cs.CV | cs.LGPDF
Tianci Luo, Jinpeng Wang, Shiyu Qin, Niu Lian, Yan Feng
TL;DR: 本文提出PromptHub框架,通过局部感知融合、集中和对齐机制增强多提示视觉上下文学习,旨在克服现有基于补丁的融合框架和模型无关监督的局限性,从而更有效地利用信息线索提升性能。
Details
Motivation: 现有视觉上下文学习中的提示融合方法采用补丁级融合和模型无关监督,限制了信息线索的充分利用和性能提升,因此需要一种更全面的方法来加强多提示学习。
Result: 在三个基础视觉任务上的广泛实验证明了PromptHub的优越性,并在分布外设置和多种检索场景中验证了其通用性、可迁移性和鲁棒性。
Insight: 创新点在于引入空间先验以捕获更丰富的上下文信息,采用互补的集中、对齐和预测目标相互指导训练,并结合数据增强加强监督,为提示融合建立了可靠的局部感知范式。
Abstract: Visual In-Context Learning (VICL) aims to complete vision tasks by imitating pixel demonstrations. Recent work pioneered prompt fusion that combines the advantages of various demonstrations, which shows a promising way to extend VICL. Unfortunately, the patch-wise fusion framework and model-agnostic supervision hinder the exploitation of informative cues, thereby limiting performance gains. To overcome this deficiency, we introduce PromptHub, a framework that holistically strengthens multi-prompting through locality-aware fusion, concentration and alignment. PromptHub exploits spatial priors to capture richer contextual information, employs complementary concentration, alignment, and prediction objectives to mutually guide training, and incorporates data augmentation to further reinforce supervision. Extensive experiments on three fundamental vision tasks demonstrate the superiority of PromptHub. Moreover, we validate its universality, transferability, and robustness across out-of-distribution settings, and various retrieval scenarios. This work establishes a reliable locality-aware paradigm for prompt fusion, moving beyond prior patch-wise approaches. Code is available at https://github.com/luotc-why/ICLR26-PromptHub.
[74] MultihopSpatial: Multi-hop Compositional Spatial Reasoning Benchmark for Vision-Language Model cs.CV | cs.AIPDF
Youngwan Lee, Soojin Jang, Yoorhim Cho, Seunghwan Lee, Yong-Ju Lee
TL;DR: 该论文提出了一个名为MultihopSpatial的基准测试,用于评估视觉语言模型(VLMs)的多跳组合空间推理能力。它包含一个包含1到3跳复杂查询的基准、一个结合答案选择和边界框预测的新评估指标Acc@50IoU,以及一个用于训练的大规模语料库MultihopSpatial-Train。对37个先进VLM的评估表明组合空间推理仍具挑战性,并且在其语料库上进行强化学习微调能提升VLM的空间推理和下游具身操作性能。
Details
Motivation: 现有基准主要关注基础的单跳空间关系,忽略了现实场景中至关重要的多跳组合推理和精确视觉定位能力,因此需要一个新的基准来填补这一空白。
Result: 对37个最先进的视觉语言模型进行了广泛评估,揭示了组合空间推理仍然是一个巨大的挑战。同时,实验表明在其语料库上进行强化学习后训练能有效提升VLM的内在空间推理能力和下游具身操作任务的表现。
Insight: 论文的创新点在于提出了首个专注于多跳组合空间推理的基准测试MultihopSpatial,并引入了同时评估推理和视觉定位的综合指标Acc@50IoU。从客观角度看,其构建的大规模训练语料库和基于强化学习的后训练方法为提升VLM的空间智能提供了新的有效途径。
Abstract: Spatial reasoning is foundational for Vision-Language Models (VLMs), particularly when deployed as Vision-Language-Action (VLA) agents in physical environments. However, existing benchmarks predominantly focus on elementary, single-hop relations, neglecting the multi-hop compositional reasoning and precise visual grounding essential for real-world scenarios. To address this, we introduce MultihopSpatial, offering three key contributions: (1) A comprehensive benchmark designed for multi-hop and compositional spatial reasoning, featuring 1- to 3-hop complex queries across diverse spatial perspectives. (2) Acc@50IoU, a complementary metric that simultaneously evaluates reasoning and visual grounding by requiring both answer selection and precise bounding box prediction - capabilities vital for robust VLA deployment. (3) MultihopSpatial-Train, a dedicated large-scale training corpus to foster spatial intelligence. Extensive evaluation of 37 state-of-the-art VLMs yields eight key insights, revealing that compositional spatial reasoning remains a formidable challenge. Finally, we demonstrate that reinforcement learning post-training on our corpus enhances both intrinsic VLM spatial reasoning and downstream embodied manipulation performance.
[75] Translating MRI to PET through Conditional Diffusion Models with Enhanced Pathology Awareness cs.CV | cs.AIPDF
Yitong Li, Igor Yakushev, Dennis M. Hedderich, Christian Wachinger
TL;DR: 本文提出了一种名为PASTA的新型图像翻译框架,基于条件扩散模型,旨在从MRI生成合成PET图像,特别强调增强病理感知能力。该方法通过双分支架构和多模态条件集成,在保持结构细节的同时捕捉病理信息,并引入循环交换一致性和体积生成策略以提升3D PET图像质量。实验表明,合成PET在阿尔茨海默病诊断中的性能比MRI提升4%,接近真实PET水平。
Details
Motivation: PET成像在神经退行性疾病诊断中功能信息关键,但成本高且有辐射;MRI无此限制但诊断敏感性较低。现有生成方法多关注结构保持,忽视了病理感知的重要性,因此需要开发能同时保留结构和病理细节的跨模态翻译方法。
Result: 在阿尔茨海默病诊断任务中,合成PET扫描的性能比MRI提升4%,几乎达到真实PET的水平;定性和定量结果均显示PASTA在生成高质量、病理感知的3D PET图像方面优于现有最先进方法。
Insight: 创新点包括:基于条件扩散模型的双分支交互架构增强病理感知,多模态条件集成,以及循环交换一致性和体积生成策略以优化3D图像质量;客观分析认为,该方法通过强调病理信息在跨模态翻译中的关键作用,为医学图像合成提供了新思路。
Abstract: Positron emission tomography (PET) is a widely recognized technique for diagnosing neurodegenerative diseases, offering critical functional insights. However, its high costs and radiation exposure hinder its widespread use. In contrast, magnetic resonance imaging (MRI) does not involve such limitations. While MRI also detects neurodegenerative changes, it is less sensitive for diagnosis compared to PET. To overcome such limitations, one approach is to generate synthetic PET from MRI. Recent advances in generative models have paved the way for cross-modality medical image translation; however, existing methods largely emphasize structural preservation while neglecting the critical need for pathology awareness. To address this gap, we propose PASTA, a novel image translation framework built on conditional diffusion models with enhanced pathology awareness. PASTA surpasses state-of-the-art methods by preserving both structural and pathological details through its highly interactive dual-arm architecture and multi-modal condition integration. Additionally, we introduce a novel cycle exchange consistency and volumetric generation strategy that significantly enhances PASTA’s ability to produce high-quality 3D PET images. Our qualitative and quantitative results demonstrate the high quality and pathology awareness of the synthesized PET scans. For Alzheimer’s diagnosis, the performance of these synthesized scans improves over MRI by 4%, almost reaching the performance of actual PET. Our code is available at https://github.com/ai-med/PASTA.
[76] GHOST: Fast Category-agnostic Hand-Object Interaction Reconstruction from RGB Videos using Gaussian Splatting cs.CVPDF
Ahmed Tawfik Aboukhadra, Marcel Rogge, Nadia Robertini, Abdalla Arafa, Jameel Malik
TL;DR: GHOST是一个基于2D高斯泼溅的快速、类别无关框架,用于从单目RGB视频重建动态手-物交互。它通过将手和物体表示为密集、视角一致的高斯圆盘,并引入几何先验检索与一致性损失、抓取感知对齐和手感知背景损失等创新,实现了完整、物理一致且可动画的重建,速度比现有类别无关方法快一个数量级。
Details
Motivation: 从单目RGB视频理解真实的手-物交互对AR/VR、机器人和具身AI至关重要,但现有方法依赖类别特定模板或计算量大,且常产生物理不一致的3D手-物对齐。
Result: 在ARCTIC、HO3D和野外数据集上的大量实验表明,GHOST在3D重建和2D渲染质量上达到了最先进的精度,同时运行速度比先前类别无关方法快一个数量级。
Insight: 创新点包括:几何先验检索与一致性损失以补全被遮挡物体区域;抓取感知对齐以优化手部平移和物体尺度,确保真实接触;手感知背景损失以避免惩罚被手遮挡的物体区域。这些技术共同实现了高效、鲁棒且物理一致的重建。
Abstract: Understanding realistic hand-object interactions from monocular RGB videos is essential for AR/VR, robotics, and embodied AI. Existing methods rely on category-specific templates or heavy computation, yet still produce physically inconsistent hand-object alignment in 3D. We introduce GHOST (Gaussian Hand-Object Splatting), a fast, category-agnostic framework for reconstructing dynamic hand-object interactions using 2D Gaussian Splatting. GHOST represents both hands and objects as dense, view-consistent Gaussian discs and introduces three key innovations: (1) a geometric-prior retrieval and consistency loss that completes occluded object regions, (2) a grasp-aware alignment that refines hand translations and object scale to ensure realistic contact, and (3) a hand-aware background loss that prevents penalizing hand-occluded object regions. GHOST achieves complete, physically consistent, and animatable reconstructions from a single RGB video while running an order of magnitude faster than prior category-agnostic methods. Extensive experiments on ARCTIC, HO3D, and in-the-wild datasets demonstrate state-of-the-art accuracy in 3D reconstruction and 2D rendering quality, establishing GHOST as an efficient and robust solution for realistic hand-object interaction modeling. Code is available at https://github.com/ATAboukhadra/GHOST.
[77] Unsupervised Contrastive Learning for Efficient and Robust Spectral Shape Matching cs.CVPDF
Feifan Luo, Hongyang Chen
TL;DR: 本文提出了一种新颖的无监督对比学习方法,用于高效且鲁棒的三维谱形状匹配。该方法通过一个无监督对比学习框架来提升特征表示的质量,并结合一个简化的功能图学习架构,避免了传统方法中耗时的功能图解算器和多个辅助损失,从而在精度和效率上均达到了最先进的水平。
Details
Motivation: 当前基于深度功能图的方法主要专注于优化点对点或功能图,而非直接提升嵌入空间中的特征表示,这导致特征质量不足和匹配性能欠佳,且这些方法严重依赖耗时的传统功能图解算器,计算成本高昂。
Result: 在包括近等距、非等距和拓扑不一致在内的多个具有挑战性的基准测试中,该方法在准确性和效率方面均达到了最先进的性能,甚至超越了有监督技术。
Insight: 创新点在于首次将无监督对比学习引入3D形状匹配,通过最大化正相似对的一致性并最小化负相似对的一致性来提升特征的一致性和可区分性;同时设计了一个极大简化的功能图学习架构,摒弃了计算昂贵的解算器和多个辅助损失,显著提升了计算效率。
Abstract: Estimating correspondences between pairs of non-rigid deformable 3D shapes remains a significant challenge in computer vision and graphics. While deep functional map methods have become the go-to solution for addressing this problem, they primarily focus on optimizing pointwise and functional maps either individually or jointly, rather than directly enhancing feature representations in the embedding space, which often results in inadequate feature quality and suboptimal matching performance. Furthermore, these approaches heavily rely on traditional functional map techniques, such as time-consuming functional map solvers, which incur substantial computational costs. In this work, we introduce, for the first time, a novel unsupervised contrastive learning-based approach for efficient and robust 3D shape matching. We begin by presenting an unsupervised contrastive learning framework that promotes feature learning by maximizing consistency within positive similarity pairs and minimizing it within negative similarity pairs, thereby improving both the consistency and discriminability of the learned features.We then design a significantly simplified functional map learning architecture that eliminates the need for computationally expensive functional map solvers and multiple auxiliary functional map losses, greatly enhancing computational efficiency. By integrating these two components into a unified two-branch pipeline, our method achieves state-of-the-art performance in both accuracy and efficiency. Extensive experiments demonstrate that our approach is not only computationally efficient but also outperforms current state-of-the-art methods across various challenging benchmarks, including near-isometric, non-isometric, and topologically inconsistent scenarios, even surpassing supervised techniques.
[78] VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation cs.CVPDF
Jiayi Yuan, Haobo Jiang, De Wen Soh, Na Zhao
TL;DR: 本文提出了VGGT-360,一个无需训练、几何一致的零样本全景深度估计框架。它通过利用类VGGT基础模型的内在3D一致性,将任务重新定义为基于多视图重建3D模型的全景重投影,从而将碎片化的单视图推理统一为连贯的全景理解。
Details
Motivation: 解决现有无需训练方法在视图上独立、缺乏几何一致性的问题,旨在实现无需额外训练、几何一致的全景深度估计。
Result: 在多种分辨率和多样化的室内外数据集上进行的广泛实验表明,VGGT-360在性能上超越了现有的有训练和无训练的最先进方法。
Insight: 创新点在于将全景深度估计任务重构为基于多视图3D重建的全景重投影问题,并引入了三个即插即用模块(不确定性引导的自适应投影、结构显著性增强注意力、相关性加权的3D模型校正)来形成一个统一的全景到3D到深度的框架,确保了跨视图的几何一致性。
Abstract: This paper presents VGGT-360, a novel training-free framework for zero-shot, geometry-consistent panoramic depth estimation. Unlike prior view-independent training-free approaches, VGGT-360 reformulates the task as panoramic reprojection over multi-view reconstructed 3D models by leveraging the intrinsic 3D consistency of VGGT-like foundation models, thereby unifying fragmented per-view reasoning into a coherent panoramic understanding. To achieve robust and accurate estimation, VGGT-360 integrates three plug-and-play modules that form a unified panorama-to-3D-to-depth framework: (i) Uncertainty-guided adaptive projection slices panoramas into perspective views to bridge the domain gap between panoramic inputs and VGGT’s perspective prior. It estimates gradient-based uncertainty to allocate denser views to geometry-poor regions, yielding geometry-informative inputs for VGGT. (ii) Structure-saliency enhanced attention strengthens VGGT’s robustness during 3D reconstruction by injecting structure-aware confidence into its attention layers, guiding focus toward geometrically reliable regions and enhancing cross-view coherence. (iii) Correlation-weighted 3D model correction refines the reconstructed 3D model by reweighting overlapping points using attention-inferred correlation scores, providing a consistent geometric basis for accurate panoramic reprojection. Extensive experiments show that VGGT-360 outperforms both trained and training-free state-of-the-art methods across multiple resolutions and diverse indoor and outdoor datasets.
[79] CRAFT: Aligning Diffusion Models with Fine-Tuning Is Easier Than You Think cs.CV | cs.LGPDF
Zening Sun, Zhengpeng Xie, Lichen Bai, Shitong Shao, Shuo Yang
TL;DR: 本文提出了CRAFT(Composite Reward Assisted Fine-Tuning),一种轻量级且强大的扩散模型对齐微调范式。它通过复合奖励过滤技术构建高质量训练数据集,并进行增强版监督微调,旨在解决现有方法对高质量数据依赖大、计算效率低的问题。
Details
Motivation: 现有扩散模型对齐方法(如监督微调和DPO式偏好优化)严重依赖昂贵的高质量图像或大规模但质量不一致的偏好数据集,且计算效率低下。本文旨在解决这两个核心挑战。
Result: 实验表明,仅使用100个样本的CRAFT就能超越使用数千个偏好配对样本的最新SOTA偏好优化方法,并且收敛速度比基线方法快11到220倍,在数据效率和计算效率上均表现出色。
Insight: 创新点在于提出复合奖励过滤技术构建高质量数据集,并将增强版监督微调与强化学习理论关联,证明了其优化了基于组强化学习的下界,为数据选择下的监督微调提供了理论依据。该方法在极低数据需求和极高计算效率方面具有显著优势。
Abstract: Aligning Diffusion models has achieved remarkable breakthroughs in generating high-quality, human preference-aligned images. Existing techniques, such as supervised fine-tuning (SFT) and DPO-style preference optimization, have become principled tools for fine-tuning diffusion models. However, SFT relies on high-quality images that are costly to obtain, while DPO-style methods depend on large-scale preference datasets, which are often inconsistent in quality. Beyond data dependency, these methods are further constrained by computational inefficiency. To address these two challenges, we propose Composite Reward Assisted Fine-Tuning (CRAFT), a lightweight yet powerful fine-tuning paradigm that requires significantly reduced training data while maintaining computational efficiency. It first leverages a Composite Reward Filtering (CRF) technique to construct a high-quality and consistent training dataset and then perform an enhanced variant of SFT. We also theoretically prove that CRAFT actually optimizes the lower bound of group-based reinforcement learning, establishing a principled connection between SFT with selected data and reinforcement learning. Our extensive empirical results demonstrate that CRAFT with only 100 samples can easily outperform recent SOTA preference optimization methods with thousands of preference-paired samples. Moreover, CRAFT can even achieve 11-220$\times$ faster convergences than the baseline preference optimization methods, highlighting its extremely high efficiency.
[80] Generalized Hand-Object Pose Estimation with Occlusion Awareness cs.CVPDF
Hui Yang, Wei Sun, Jian Liu, Jian Xiao Tao Xie, Hossein Rahmani
TL;DR: 本文提出GenHOI框架,用于解决单张RGB图像中广义3D手-物体姿态估计的挑战,特别是在物体外观和交互模式变化大、遮挡严重的情况下。该框架通过整合分层语义知识与手部先验,增强模型在遮挡条件下的泛化能力,并在DexYCB和HO3Dv2基准测试中达到最先进性能。
Details
Motivation: 解决单张RGB图像中广义3D手-物体姿态估计的难题,尤其是在物体外观和交互模式变化大、遮挡严重的情况下,现有方法泛化能力不足。
Result: 在DexYCB和HO3Dv2基准测试上进行广泛实验,结果显示该方法在手-物体姿态估计中达到了最先进(SOTA)性能。
Insight: 创新点包括引入分层语义提示(通过文本描述编码物体状态、手部配置和交互模式)来学习抽象高层表示以泛化到未见物体和新交互,采用多模态掩码建模策略(覆盖RGB图像、预测点云和文本描述)进行鲁棒遮挡推理,以及利用手部先验作为稳定空间参考来提取隐式交互约束。
Abstract: Generalized 3D hand-object pose estimation from a single RGB image remains challenging due to the large variations in object appearances and interaction patterns, especially under heavy occlusion. We propose GenHOI, a framework for generalized hand-object pose estimation with occlusion awareness. GenHOI integrates hierarchical semantic knowledge with hand priors to enhance model generalization under challenging occlusion conditions. Specifically, we introduce a hierarchical semantic prompt that encodes object states, hand configurations, and interaction patterns via textual descriptions. This enables the model to learn abstract high-level representations of hand-object interactions for generalization to unseen objects and novel interactions while compensating for missing or ambiguous visual cues. To enable robust occlusion reasoning, we adopt a multi-modal masked modeling strategy over RGB images, predicted point clouds, and textual descriptions. Moreover, we leverage hand priors as stable spatial references to extract implicit interaction constraints. This allows reliable pose inference even under significant variations in object shapes and interaction patterns. Extensive experiments on the challenging DexYCB and HO3Dv2 benchmarks demonstrate that our method achieves state-of-the-art performance in hand-object pose estimation.
[81] Rethinking MLLM Itself as a Segmenter with a Single Segmentation Token cs.CVPDF
Anqi Zhang, Xiaokang Ji, Guangyu Gao, Jianbo Jiao, Chi Harold Liu
TL;DR: 本文提出SELF1E方法,旨在探索仅使用一个分割嵌入(segmentation token)解锁多模态大语言模型(MLLM)自身分割能力的可行性,从而无需依赖外部掩码解码器。该方法通过保留原始未压缩分辨率的图像特征,并用从MLLM处理的压缩特征中提取的残差特征进行补充,结合像素重排操作增强特征细节,并设计了双感知路径的注意力掩码以促进像素与分割标记间的交互。
Details
Motivation: 现有基于MLLM的分割方法大多依赖专门的掩码解码器或引入多个额外标记来生成分割掩码,本文旨在研究能否仅通过MLLM自身的一个分割嵌入实现竞争性分割结果,以消除对外部解码器的需求。
Result: 在多个分割任务上的综合实验表明,SELF1E达到了与基于专用掩码解码器方法相当的性能,验证了在MLLM中实现无解码器分割的可行性。
Insight: 创新点在于提出了一种无需外部解码器的MLLM分割框架,通过特征分辨率保持与增强、残差特征补充以及双感知路径注意力设计,仅使用单一分割标记即可实现高效分割,为简化MLLM分割架构提供了新思路。
Abstract: Recent segmentation methods leveraging Multi-modal Large Language Models (MLLMs) have shown reliable object-level segmentation and enhanced spatial perception. However, almost all previous methods predominantly rely on specialist mask decoders to interpret masks from generated segmentation-related embeddings and visual features, or incorporate multiple additional tokens to assist. This paper aims to investigate whether and how we can unlock segmentation from MLLM itSELF with 1 segmentation Embedding (SELF1E) while achieving competitive results, which eliminates the need for external decoders. To this end, our approach targets the fundamental limitation of resolution reduction in pixel-shuffled image features from MLLMs. First, we retain image features at their original uncompressed resolution, and refill them with residual features extracted from MLLM-processed compressed features, thereby improving feature precision. Subsequently, we integrate pixel-unshuffle operations on image features with and without LLM processing, respectively, to unleash the details of compressed features and amplify the residual features under uncompressed resolution, which further enhances the resolution of refilled features. Moreover, we redesign the attention mask with dual perception pathways, i.e., image-to-image and image-to-segmentation, enabling rich feature interaction between pixels and the segmentation token. Comprehensive experiments across multiple segmentation tasks validate that SELF1E achieves performance competitive with specialist mask decoder-based methods, demonstrating the feasibility of decoder-free segmentation in MLLMs. Project page: https://github.com/ANDYZAQ/SELF1E.
[82] SEM: Sparse Embedding Modulation for Post-Hoc Debiasing of Vision-Language Models cs.CV | cs.AI | cs.LGPDF
Quentin Guimard, Federico Bartsch, Simone Caldarella, Rahaf Aljundi, Elisa Ricci
TL;DR: 本文提出了一种名为稀疏嵌入调制(SEM)的后处理零样本去偏框架,用于缓解视觉语言模型(如CLIP)中的社会性和虚假偏见。该方法在稀疏自编码器(SAE)的潜在空间中操作,通过解耦CLIP文本嵌入特征,识别并调制与偏见相关的神经元,同时保留与查询相关的信息,从而实现更精确的非线性干预。
Details
Motivation: 视觉语言模型(如CLIP)的大规模、未筛选训练数据引入了严重的偏见,现有后处理去偏方法通常在密集的CLIP嵌入空间中操作,导致偏见与任务相关信息高度纠缠,难以在不损害语义保真度的情况下消除偏见。
Result: 在四个基准数据集和两个CLIP骨干网络上,SEM在检索和零样本分类任务中实现了显著的公平性提升,表明稀疏潜在表示为视觉语言模型的后处理去偏提供了有效基础。
Insight: 创新点在于利用稀疏自编码器解耦嵌入特征,在稀疏潜在空间中进行针对性调制,从而更精确地分离和干预偏见相关神经元,同时保持语义完整性,为后处理去偏提供了新思路。
Abstract: Models that bridge vision and language, such as CLIP, are key components of multimodal AI, yet their large-scale, uncurated training data introduce severe social and spurious biases. Existing post-hoc debiasing methods often operate directly in the dense CLIP embedding space, where bias and task-relevant information are highly entangled. This entanglement limits their ability to remove bias without degrading semantic fidelity. In this work, we propose Sparse Embedding Modulation (SEM), a post-hoc, zero-shot debiasing framework that operates in a Sparse Autoencoder (SAE) latent space. By decomposing CLIP text embeddings into disentangled features, SEM identifies and modulates bias-relevant neurons while preserving query-relevant ones. This enables more precise, non-linear interventions. Across four benchmark datasets and two CLIP backbones, SEM achieves substantial fairness gains in retrieval and zero-shot classification. Our results demonstrate that sparse latent representations provide an effective foundation for post-hoc debiasing of vision-language models.
[83] TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation cs.CVPDF
Yan Shu, Bin Ren, Zhitong Xiong, Xiao Xiang Zhu, Begüm Demir
TL;DR: 本文提出了TerraScope,一个用于地球观测的统一视觉语言模型,旨在解决现有模型在复杂空间推理任务中难以精确关联像素级视觉表示的问题。该模型具备模态灵活推理和多时序推理能力,并引入了包含100万样本的Terra-CoT数据集和首个像素级地理空间推理基准TerraScope-Bench。实验表明,TerraScope在像素级地理空间推理任务上显著优于现有视觉语言模型,并提供可解释的视觉证据。
Details
Motivation: 现有视觉语言模型在地球观测任务中难以将复杂空间推理精确地关联到像素级视觉表示,因此需要一种能够实现像素级地理空间推理的统一模型。
Result: TerraScope在像素级地理空间推理任务上显著优于现有视觉语言模型,并在TerraScope-Bench基准的六个子任务中评估了答案准确性和掩码质量,证明了其有效性。
Insight: 创新点包括:1)模态灵活推理,支持单模态(光学或SAR)输入和自适应多模态融合;2)多时序推理,整合时序序列进行变化分析;3)构建了大规模像素级掩码数据集Terra-CoT和首个像素级地理空间推理基准TerraScope-Bench,推动了可解释地理空间AI的发展。
Abstract: Vision-language models (VLMs) have shown promise in earth observation (EO), yet they struggle with tasks that require grounding complex spatial reasoning in precise pixel-level visual representations. To address this problem, we introduce TerraScope, a unified VLM that delivers pixel-grounded geospatial reasoning with two key capabilities: (1) modality-flexible reasoning: it handles single-modality inputs (optical or SAR) and adaptively fuses different modalities into the reasoning process when both are available; (2) multi-temporal reasoning: it integrates temporal sequences for change analysis across multiple time points. In addition, we curate Terra-CoT, a large-scale dataset containing 1 million samples with pixel-level masks embedded in reasoning chains across multiple sources. We also propose TerraScope-Bench, the first benchmark for pixel-grounded geospatial reasoning with six sub-tasks that evaluates both answer accuracy and mask quality to ensure authentic pixel-grounded reasoning. Experiments show that TerraScope significantly outperforms existing VLMs on pixel-grounded geospatial reasoning while providing interpretable visual evidence.
[84] Measuring 3D Spatial Geometric Consistency in Dynamic Generated Videos cs.CVPDF
Weijia Dou, Wenzhao Zheng, Weiliang Chen, Yu Zheng, Jie Zhou
TL;DR: 本文提出了一种名为SGC的新指标,用于评估动态生成视频中的3D空间几何一致性,通过从视频的不同静态区域估计多个相机姿态并计算其差异来量化几何不一致性。
Details
Motivation: 现有生成模型能产生高保真视频,但常出现3D空间几何不一致问题,而现有评估方法(如FVD)对几何失真不敏感,或一致性基准会错误惩罚有效的前景动态,因此需要专门指标来准确表征这些不一致性。
Result: 在真实和生成视频上的实验表明,SGC能稳健地量化几何不一致性,有效识别现有指标遗漏的关键失败案例。
Insight: 创新点在于将静态背景分割为空间连贯子区域,通过估计每个子区域的局部相机姿态并计算其发散度来专门评估3D几何一致性,这为视频生成质量评估提供了新视角。
Abstract: Recent generative models can produce high-fidelity videos, yet they often exhibit 3D spatial geometric inconsistencies. Existing evaluation methods fail to accurately characterize these inconsistencies: fidelity-centric metrics like FVD are insensitive to geometric distortions, while consistency-focused benchmarks often penalize valid foreground dynamics. To address this gap, we introduce SGC, a metric for evaluating 3D \textbf{S}patial \textbf{G}eometric \textbf{C}onsistency in dynamically generated videos. We quantify geometric consistency by measuring the divergence among multiple camera poses estimated from distinct local regions. Our approach first separates static from dynamic regions, then partitions the static background into spatially coherent sub-regions. We predict depth for each pixel, estimate a local camera pose for each subregion, and compute the divergence among these poses to quantify geometric consistency. Experiments on real and generative videos demonstrate that SGC robustly quantifies geometric inconsistencies, effectively identifying critical failures missed by existing metrics.
[85] SwiftTailor: Efficient 3D Garment Generation with Geometry Image Representation cs.CV | cs.GRPDF
Phuc Pham, Uy Dieu Tran, Binh-Son Hua, Phong Nguyen
TL;DR: SwiftTailor是一个新颖的两阶段框架,用于高效生成3D服装。它通过紧凑的几何图像表示,统一了缝纫图案推理和基于几何的网格合成,显著提升了生成速度。
Details
Motivation: 解决现有基于视觉-语言模型和GarmentCode框架的3D服装生成方法推理速度慢(30秒到1分钟)的问题,以实现更高效、高质量的3D服装生成。
Result: 在Multimodal GarmentCodeData数据集上的大量实验表明,SwiftTailor在保持最先进(SOTA)的准确性和视觉保真度的同时,显著减少了推理时间。
Insight: 核心创新在于引入了统一的Garment Geometry Image表示,将3D服装表面编码在统一的UV空间中,并结合高效的逆映射、重网格化和动态缝合算法,避免了物理模拟的开销,实现了推理效率与质量的统一。
Abstract: Realistic and efficient 3D garment generation remains a longstanding challenge in computer vision and digital fashion. Existing methods typically rely on large vision- language models to produce serialized representations of 2D sewing patterns, which are then transformed into simulation-ready 3D meshes using garment modeling framework such as GarmentCode. Although these approaches yield high-quality results, they often suffer from slow inference times, ranging from 30 seconds to a minute. In this work, we introduce SwiftTailor, a novel two-stage framework that unifies sewing-pattern reasoning and geometry-based mesh synthesis through a compact geometry image representation. SwiftTailor comprises two lightweight modules: PatternMaker, an efficient vision-language model that predicts sewing patterns from diverse input modalities, and GarmentSewer, an efficient dense prediction transformer that converts these patterns into a novel Garment Geometry Image, encoding the 3D surface of all garment panels in a unified UV space. The final 3D mesh is reconstructed through an efficient inverse mapping process that incorporates remeshing and dynamic stitching algorithms to directly assemble the garment, thereby amortizing the cost of physical simulation. Extensive experiments on the Multimodal GarmentCodeData demonstrate that SwiftTailor achieves state-of-the-art accuracy and visual fidelity while significantly reducing inference time. This work offers a scalable, interpretable, and high-performance solution for next-generation 3D garment generation.
[86] Em-Garde: A Propose-Match Framework for Proactive Streaming Video Understanding cs.CV | cs.AIPDF
Yikai Zheng, Xin Ding, Yifan Yang, Shiqi Jiang, Hao Wu
TL;DR: 本文提出了Em-Garde框架,用于解决主动式流视频理解中效率与准确性的权衡问题。该框架将语义理解与流式感知解耦,通过指令引导的提议解析器将用户查询转化为结构化视觉提议,并在流式处理中使用轻量级提议匹配模块进行高效匹配以触发响应。
Details
Motivation: 当前主动式视频大语言模型依赖逐帧触发决策,存在效率与准确性难以兼顾的困境。本文旨在设计一个更高效的框架,在严格的计算约束下实现准确的主动视频理解。
Result: 在StreamingBench和OVO-Bench基准测试上的实验表明,Em-Garde在主动响应准确性和效率方面均优于现有模型,实现了持续改进。
Insight: 核心创新在于将语义理解(提议生成)与流式感知(提议匹配)解耦,并引入指令引导的视觉提议解析和轻量级嵌入匹配机制,从而在保持高准确性的同时显著提升处理效率。
Abstract: Recent advances in Streaming Video Understanding has enabled a new interaction paradigm where models respond proactively to user queries. Current proactive VideoLLMs rely on per-frame triggering decision making, which suffers from an efficiency-accuracy dilemma. We propose Em-Garde, a novel framework that decouples semantic understanding from streaming perception. At query time, the Instruction-Guided Proposal Parser transforms user queries into structured, perceptually grounded visual proposals; during streaming, a Lightweight Proposal Matching Module performs efficient embedding-based matching to trigger responses. Experiments on StreamingBench and OVO-Bench demonstrate consistent improvements over prior models in proactive response accuracy and efficiency, validating an effective solution for proactive video understanding under strict computational constraints.
[87] SignAgent: Agentic LLMs for Linguistically-Grounded Sign Language Annotation and Dataset Curation cs.CVPDF
Oliver Cory, Ozge Mercanoglu Sincan, Richard Bowden
TL;DR: 本文提出SignAgent,一种基于大语言模型(LLMs)的智能体框架,用于实现可扩展、语言学基础的手语(SL)标注和数据集构建。该框架通过SignAgent Orchestrator(协调推理LLM)和SignGraph(知识基础LLM)协同工作,解决了传统计算方法和人工标注在捕捉语言细节及可扩展性方面的瓶颈。
Details
Motivation: 传统手语计算方法通常在语素层面操作,忽略了关键的语言学细节,而人工语言学标注速度慢、成本高,成为构建大规模、音系学感知数据集的瓶颈。
Result: 在伪语素标注和ID语素标注两个下游任务上,该智能体方法展现了强大的性能,能够利用多模态证据进行约束性分配,并通过视觉相似性和音系重叠推理来检测和优化视觉聚类。
Insight: 创新点在于将大语言模型作为智能体协调器,结合语言学工具和知识基础,实现了对手语序列的细粒度、语言学驱动的自动化标注,为大规模手语数据集构建提供了新的可扩展解决方案。
Abstract: This paper introduces SignAgent, a novel agentic framework that utilises Large Language Models (LLMs) for scalable, linguistically-grounded Sign Language (SL) annotation and dataset curation. Traditional computational methods for SLs often operate at the gloss level, overlooking crucial linguistic nuances, while manual linguistic annotation remains a significant bottleneck, proving too slow and expensive for the creation of large-scale, phonologically-aware datasets. SignAgent addresses these challenges through SignAgent Orchestrator, a reasoning LLM that coordinates a suite of linguistic tools, and SignGraph, a knowledge-grounded LLM that provides lexical and linguistic grounding. We evaluate our framework on two downstream annotation tasks. First, on Pseudo-gloss Annotation, where the agent performs constrained assignment, using multi-modal evidence to extract and order suitable gloss labels for signed sequences. Second, on ID Glossing, where the agent detects and refines visual clusters by reasoning over both visual similarity and phonological overlap to correctly identify and group lexical sign variants. Our results demonstrate that our agentic approach achieves strong performance for large-scale, linguistically-aware data annotation and curation.
[88] Multi-Modal Building Change Detection for Large-Scale Small Changes: Benchmark and Baseline cs.CVPDF
Ye Wang, Wei Lu, Zhihui You, Keyan Chen, Tongfei Liu
TL;DR: 本文针对光学遥感影像中建筑变化检测易受光照、季节和地表覆盖物变化影响的问题,提出了一个面向大规模小变化的多模态数据集LSMD,并设计了多模态光谱互补网络MSCNet,通过增强局部空间细节、跨模态对齐交互和显著性感知多源细化,有效融合RGB与近红外信息,提升细粒度建筑变化检测性能。
Details
Motivation: 现有RGB影像在变化检测中易产生伪变化和语义模糊,而近红外信息能提供互补的物理线索,但现有数据集缺乏高分辨率、精确配准的双时相影像,且方法未能充分利用模态间的异质性。
Result: 在提出的LSMD基准数据集上,MSCNet在多种输入配置下均优于现有方法,验证了其在复杂环境中细粒度建筑变化检测的有效性。
Insight: 创新点包括构建了针对现实场景小变化的多模态RGB-NIR基准数据集LSMD,以及设计了包含局部上下文增强、跨模态对齐交互和显著性感知细化的MSCNet网络,实现了有效的跨模态特征融合。
Abstract: Change detection in optical remote sensing imagery is susceptible to illumination fluctuations, seasonal changes, and variations in surface land-cover materials. Relying solely on RGB imagery often produces pseudo-changes and leads to semantic ambiguity in features. Incorporating near-infrared (NIR) information provides heterogeneous physical cues that are complementary to visible light, thereby enhancing the discriminability of building materials and tiny structures while improving detection accuracy. However, existing multi-modal datasets generally lack high-resolution and accurately registered bi-temporal imagery, and current methods often fail to fully exploit the inherent heterogeneity between these modalities. To address these issues, we introduce the Large-scale Small-change Multi-modal Dataset (LSMD), a bi-temporal RGB-NIR building change detection benchmark dataset targeting small changes in realistic scenarios, providing a rigorous testing platform for evaluating multi-modal change detection methods in complex environments. Based on LSMD, we further propose the Multi-modal Spectral Complementarity Network (MSCNet) to achieve effective cross-modal feature fusion. MSCNet comprises three key components: the Neighborhood Context Enhancement Module (NCEM) to strengthen local spatial details, the Cross-modal Alignment and Interaction Module (CAIM) to enable deep interaction between RGB and NIR features, and the Saliency-aware Multisource Refinement Module (SMRM) to progressively refine fused features. Extensive experiments demonstrate that MSCNet effectively leverages multi-modal information and consistently outperforms existing methods under multiple input configurations, validating its efficacy for fine-grained building change detection. The source code will be made publicly available at: https://github.com/AeroVILab-AHU/LSMD
[89] TAU-R1: Visual Language Model for Traffic Anomaly Understanding cs.CVPDF
Yuqiang Lin, Kehua Chen, Sam Lockyer, Arjun Yadav, Mingxuan Sui
TL;DR: 本文提出了TAU-R1,一个用于交通异常理解(TAU)的双层视觉语言模型框架。为了解决该领域缺乏基准和专用方法的问题,作者首先构建了Roundabout-TAU数据集,包含342个真实世界环形交叉路口视频片段及2000多个问答对。在此基础上,TAU-R1框架由一个轻量级异常分类器和一个大型异常推理器组成,并采用包含分解式问答增强的监督微调和基于GRPO的TAU-GRPO后训练策略进行训练。
Details
Motivation: 交通异常理解对智能交通系统的安全至关重要,但现有视觉语言模型在该领域的进展有限,主要原因是缺乏专门的基准数据集和任务定制的方法论。
Result: 实验结果表明,TAU-R1在异常分类和推理任务上都取得了强劲的性能,同时保持了部署效率。
Insight: 主要创新点包括:1)构建了首个专注于真实世界环形交叉路口交通异常理解的数据集Roundabout-TAU;2)提出了一个高效的双层VLM框架,将粗粒度分类与细粒度推理解耦;3)设计了一个两阶段训练策略,结合了分解式问答增强的监督微调和基于特定任务奖励函数的GRPO后训练方法,以提升模型的任务特定推理能力。
Abstract: Traffic Anomaly Understanding (TAU) is important for traffic safety in Intelligent Transportation Systems. Recent vision-language models (VLMs) have shown strong capabilities in video understanding. However, progress on TAU remains limited due to the lack of benchmarks and task-specific methodologies. To address this limitation, we introduce Roundabout-TAU, a dataset constructed from real-world roundabout videos collected in collaboration with the City of Carmel, Indiana. The dataset contains 342 clips and is annotated with more than 2,000 question-answer pairs covering multiple aspects of traffic anomaly understanding. Building on this benchmark, we propose TAU-R1, a two-layer vision-language framework for TAU. The first layer is a lightweight anomaly classifier that performs coarse anomaly categorisation, while the second layer is a larger anomaly reasoner that generates detailed event summaries. To improve task-specific reasoning, we introduce a two-stage training strategy consisting of decomposed-QA-enhanced supervised fine-tuning followed by TAU-GRPO, a GRPO-based post-training method with TAU-specific reward functions. Experimental results show that TAU-R1 achieves strong performance on both anomaly classification and reasoning tasks while maintaining deployment efficiency. The dataset and code are available at: https://github.com/siri-rouser/TAU-R1
[90] GSMem: 3D Gaussian Splatting as Persistent Spatial Memory for Zero-Shot Embodied Exploration and Reasoning cs.CV | cs.ROPDF
Yiren Lu, Yi Du, Disheng Liu, Yunlai Zhou, Chen Wang
TL;DR: 本文提出GSMem框架,将3D高斯泼溅(3DGS)作为持久空间记忆,用于零样本具身探索与推理。该框架通过3DGS参数化连续几何与稠密外观,实现空间回忆能力,即从最优新视点渲染逼真图像。结合对象级场景图与语义级语言场的检索机制,以及VLM驱动语义评分与3DGS覆盖目标的混合探索策略,在具身问答与终身导航任务中展现出鲁棒性与有效性。
Details
Motivation: 现有场景表示(如离散场景图或静态视图快照)缺乏事后可重观察性,导致初始观测遗漏目标时记忆缺失难以恢复,限制了具身探索中空间知识的积累与保持。
Result: 在具身问答(EQA)和终身导航任务上的大量实验表明,该框架具有鲁棒性和有效性,实现了零样本下的高性能探索与推理。
Insight: 创新点在于将3DGS作为可渲染的持久空间记忆,赋予智能体空间回忆能力;采用对象级与语义级互补的检索机制来定位目标区域并生成最优虚拟视图;结合语义评分与几何覆盖的混合探索策略,平衡任务感知与空间覆盖。
Abstract: Effective embodied exploration requires agents to accumulate and retain spatial knowledge over time. However, existing scene representations, such as discrete scene graphs or static view-based snapshots, lack \textit{post-hoc re-observability}. If an initial observation misses a target, the resulting memory omission is often irrecoverable. To bridge this gap, we propose \textbf{GSMem}, a zero-shot embodied exploration and reasoning framework built upon 3D Gaussian Splatting (3DGS). By explicitly parameterizing continuous geometry and dense appearance, 3DGS serves as a persistent spatial memory that endows the agent with \textit{Spatial Recollection}: the ability to render photorealistic novel views from optimal, previously unoccupied viewpoints. To operationalize this, GSMem employs a retrieval mechanism that simultaneously leverages parallel object-level scene graphs and semantic-level language fields. This complementary design robustly localizes target regions, enabling the agent to ``hallucinate’’ optimal views for high-fidelity Vision-Language Model (VLM) reasoning. Furthermore, we introduce a hybrid exploration strategy that combines VLM-driven semantic scoring with a 3DGS-based coverage objective, balancing task-aware exploration with geometric coverage. Extensive experiments on embodied question answering and lifelong navigation demonstrate the robustness and effectiveness of our framework
[91] ARIADNE: A Perception-Reasoning Synergy Framework for Trustworthy Coronary Angiography Analysis cs.CV | cs.AIPDF
Zhan Jin, Yu Luo, Yizhou Zhang, Ziyang Cui, Yuqing Wei
TL;DR: 本文提出ARIADNE框架,通过感知-推理协同解决冠状动脉造影分析中血管分割拓扑不一致的问题。该框架采用两阶段方法:感知模块利用DPO微调视觉语言基础模型,以Betti数约束作为偏好信号,确保血管结构的几何完整性;推理模块将狭窄定位建模为马尔可夫决策过程,引入显式拒绝机制以排除模糊解剖候选(如分叉和血管交叉),从而优化可靠性而非覆盖率。
Details
Motivation: 传统基于像素级损失的分割方法无法强制拓扑约束,导致血管树断裂,尽管像素精度高,但影响临床诊断可靠性。
Result: 在1400张临床血管造影图像上,ARIADNE实现了最先进的中心线Dice分数0.838,相比几何基线减少41%的假阳性;在ARCADE和XCAD多中心基准测试的外部验证中证实了其跨采集协议的泛化能力。
Insight: 创新点包括首次在医学影像中应用DPO进行拓扑对齐,通过基于结构约束的偏好学习缓解拓扑违规;同时,将狭窄检测从覆盖最大化转向可靠性优化,通过显式拒绝机制处理模糊解剖结构,提升了诊断可信度。
Abstract: Conventional pixel-wise loss functions fail to enforce topological constraints in coronary vessel segmentation, producing fragmented vascular trees despite high pixel-level accuracy. We present ARIADNE, a two-stage framework coupling preference-aligned perception with RL-based diagnostic reasoning for topologically coherent stenosis detection. The perception module employs DPO to fine-tune the Sa2VA vision-language foundation model using Betti number constraints as preference signals, aligning the policy toward geometrically complete vessel structures rather than pixel-wise overlap metrics. The reasoning module formulates stenosis localization as a Markov Decision Process with an explicit rejection mechanism that autonomously defers ambiguous anatomical candidates such as bifurcations and vessel crossings, shifting from coverage maximization to reliability optimization. On 1,400 clinical angiograms, ARIADNE achieves state-of-the-art centerline Dice of 0.838, reduces false positives by 41% compared to geometric baselines. External validation on multi-center benchmarks ARCADE and XCAD confirms generalization across acquisition protocols. This represents the first application of DPO for topological alignment in medical imaging, demonstrating that preference-based learning over structural constraints mitigates topological violations while maintaining diagnostic sensitivity in interventional cardiology workflows.
[92] Reconstruction Matters: Learning Geometry-Aligned BEV Representation through 3D Gaussian Splatting cs.CVPDF
Yiren Lu, Xin Ye, Burhaneddin Yaman, Jingru Luo, Zhexiao Xiong
TL;DR: 本文提出Splat2BEV,一个通过3D高斯溅射学习几何对齐BEV表示的框架。该框架首先预训练一个高斯生成器,从多视角输入中显式重建3D场景,生成几何对齐的特征表示,然后将其投影到BEV空间作为下游任务的输入。在nuScenes和Argoverse数据集上的实验表明,该方法达到了最先进的性能。
Details
Motivation: 现有BEV感知框架多为端到端训练,将图像特征直接转换到BEV空间并通过下游任务监督优化,缺乏显式的3D几何理解和可解释性,导致性能次优。本文主张显式的3D表示对于准确的BEV感知至关重要。
Result: 在nuScenes和Argoverse数据集上进行的大量实验表明,Splat2BEV实现了最先进的(SOTA)性能。
Insight: 核心创新点在于将显式的3D场景重建(通过3D高斯溅射技术)引入BEV感知框架,以学习语义丰富且几何精确的BEV特征表示,从而提升下游任务的性能与可解释性。
Abstract: Bird’s-Eye-View (BEV) perception serves as a cornerstone for autonomous driving, offering a unified spatial representation that fuses surrounding-view images to enable reasoning for various downstream tasks, such as semantic segmentation, 3D object detection, and motion prediction. However, most existing BEV perception frameworks adopt an end-to-end training paradigm, where image features are directly transformed into the BEV space and optimized solely through downstream task supervision. This formulation treats the entire perception process as a black box, often lacking explicit 3D geometric understanding and interpretability, leading to suboptimal performance. In this paper, we claim that an explicit 3D representation matters for accurate BEV perception, and we propose Splat2BEV, a Gaussian Splatting-assisted framework for BEV tasks. Splat2BEV aims to learn BEV feature representations that are both semantically rich and geometrically precise. We first pre-train a Gaussian generator that explicitly reconstructs 3D scenes from multi-view inputs, enabling the generation of geometry-aligned feature representations. These representations are then projected into the BEV space to serve as inputs for downstream tasks. Extensive experiments on nuScenes and argoverse dataset demonstrate that Splat2BEV achieves state-of-the-art performance and validate the effectiveness of incorporating explicit 3D reconstruction into BEV perception.
[93] Tinted Frames: Question Framing Blinds Vision-Language Models cs.CVPDF
Wan-Cyuan Fan, Jiayun Luo, Declan Kutscher, Leonid Sigal, Ritwik Gupta
TL;DR: 该论文揭示了视觉语言模型(VLMs)存在选择性视觉盲区问题:模型对视觉输入的注意力分配会受到问题表述方式(如多项选择、是/否与开放式问题)的显著影响,即使这些表述要求相同的视觉推理。作者通过量化注意力分布,发现约束性表述会降低对图像上下文和任务相关区域的关注,并导致注意力分散到无信息量的标记上,进而损害模型准确性和一致性。基于此机制性洞察,论文提出了一种轻量级的提示调优方法,通过可学习标记来鼓励模型采用更稳健、基于视觉的注意力模式,从而提升视觉基础和跨表述的性能。
Details
Motivation: 解决视觉语言模型在视觉推理任务中未能充分利用视觉输入(即“视觉盲区”)的问题,并进一步探究模型注意力如何被问题的语言表述方式(framing)选择性影响,即使这些表述在视觉推理需求上是等价的。
Result: 实验通过视觉注意力作为探针,量化了不同问题表述(如多项选择、是/否对比开放式)如何改变模型对图像注意力的总量和分布。结果表明,约束性表述导致对图像上下文的注意力显著降低,对任务相关区域的关注减少,注意力转向无信息标记,这是导致准确性下降和跨表述不一致的主要原因。提出的轻量级提示调优方法改善了视觉基础,并提升了跨不同问题表述的性能。
Insight: 论文的核心创新点在于揭示了VLMs的“选择性视觉盲区”这一机制性问题,即模型对视觉信息的利用程度高度依赖于问题的语言表述形式,而非纯粹的视觉推理需求。从客观角度看,其提出的基于可学习标记的提示调优方法,为通过调整注意力模式来缓解这一偏差提供了一种轻量且有效的干预思路,对提升VLM的鲁棒性和可靠性具有借鉴意义。
Abstract: Vision-Language Models (VLMs) have been shown to be blind, often underutilizing their visual inputs even on tasks that require visual reasoning. In this work, we demonstrate that VLMs are selectively blind. They modulate the amount of attention applied to visual inputs based on linguistic framing even when alternative framings demand identical visual reasoning. Using visual attention as a probe, we quantify how framing alters both the amount and distribution of attention over the image. Constrained framings, such as multiple choice and yes/no, induce substantially lower attention to image context compared to open-ended, reduce focus on task-relevant regions, and shift attention towards uninformative tokens. We further demonstrate that this attention misallocation is the principal cause of degraded accuracy and cross-framing inconsistency. Building on this mechanistic insight, we introduce a lightweight prompt-tuning method using learnable tokens that encourages the robust, visually grounded attention patterns observed in open-ended settings, improving visual grounding and improving performance across framings.
[94] Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders cs.CV | cs.LGPDF
Shang-Jui Ray Kuo, Paola Cascante-Bonilla
TL;DR: 这篇论文探讨了在大型视觉语言模型(VLMs)中,是否可以用状态空间模型(SSM)作为视觉编码器来替代标准的基于Transformer的视觉主干(如ViT)。通过系统评估,研究发现SSM主干在VQA和定位任务上表现优异,且模型规模更小,同时提出了稳定化策略以提高鲁棒性。
Details
Motivation: 研究动机是探索Transformer以外的视觉编码器(特别是SSM)在VLMs中的潜力,以评估其作为替代方案的性能,并解决现有视觉主干在定位任务中可能不稳定的问题。
Result: 在匹配ImageNet-1K初始化条件下,SSM主干在VQA和定位任务上整体表现最强;经过检测或分割任务微调后,SSM主干仍保持竞争力,且模型规模显著更小。研究还发现ImageNet准确率或主干大小与VLM性能无可靠关联,某些视觉主干在定位中不稳定。
Insight: 创新点在于首次系统评估SSM作为VLM视觉编码器的性能,证明其是Transformer的强有力替代方案,并提出了稳定化策略以提升鲁棒性。从客观角度看,这为VLM架构设计提供了新的方向,强调了模型效率和任务适应性比单纯扩大规模更重要。
Abstract: Large vision–language models (VLMs) often use a frozen vision backbone, whose image features are mapped into a large language model through a lightweight connector. While transformer-based encoders are the standard visual backbone, we ask whether state space model (SSM) vision backbones can be a strong alternative. We systematically evaluate SSM vision backbones for VLMs in a controlled setting. Under matched ImageNet-1K initialization, the SSM backbone achieves the strongest overall performance across both VQA and grounding/localization. We further adapt both SSM and ViT-family backbones with detection or segmentation training and find that dense-task tuning generally improves performance across families; after this adaptation, the SSM backbone remains competitive while operating at a substantially smaller model scale. We further observe that (i) higher ImageNet accuracy or larger backbones do not reliably translate into better VLM performance, and (ii) some visual backbones are unstable in localization. Based on these findings, we propose stabilization strategies that improve robustness for both backbone families and highlight SSM backbones as a strong alternative to transformer-based vision encoders in VLMs.
[95] DreamPartGen: Semantically Grounded Part-Level 3D Generation via Collaborative Latent Denoising cs.CV | cs.AI | cs.LGPDF
Tianjiao Yu, Xinzhuo Li, Muntasir Wahed, Jerry Xiong, Yifan Shen
TL;DR: DreamPartGen是一个用于语义基础、部件感知的文本到3D生成框架。它通过引入双工部件潜变量(DPLs)联合建模每个部件的几何与外观,以及关系语义潜变量(RSLs)捕捉源自语言的部件间依赖关系,并利用同步协同去噪过程强制几何与语义一致性,从而实现连贯、可解释且与文本对齐的3D合成。
Details
Motivation: 现有大多数文本到3D方法忽略了部件的语义和功能结构,而近期部件感知方法虽然引入了分解,但主要关注几何,缺乏语义基础,未能建模部件如何与文本描述对齐或其相互关系。本文旨在解决语义基础、部件感知的3D生成问题。
Result: 在多个基准测试上,DreamPartGen在几何保真度和文本-形状对齐方面实现了最先进的性能。
Insight: 创新点在于提出了双工部件潜变量(DPLs)和关系语义潜变量(RSLs)来分别建模部件属性与部件间语义关系,并通过同步协同去噪过程强制几何与语义一致性,从而实现了语义基础、部件感知的3D生成。
Abstract: Understanding and generating 3D objects as compositions of meaningful parts is fundamental to human perception and reasoning. However, most text-to-3D methods overlook the semantic and functional structure of parts. While recent part-aware approaches introduce decomposition, they remain largely geometry-focused, lacking semantic grounding and failing to model how parts align with textual descriptions or their inter-part relations. We propose DreamPartGen, a framework for semantically grounded, part-aware text-to-3D generation. DreamPartGen introduces Duplex Part Latents (DPLs) that jointly model each part’s geometry and appearance, and Relational Semantic Latents (RSLs) that capture inter-part dependencies derived from language. A synchronized co-denoising process enforces mutual geometric and semantic consistency, enabling coherent, interpretable, and text-aligned 3D synthesis. Across multiple benchmarks, DreamPartGen delivers state-of-the-art performance in geometric fidelity and text-shape alignment.
[96] LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs cs.CVPDF
Keda Tao, Yuhua Zheng, Jia Xu, Wenjie Du, Kele Shao
TL;DR: 该论文提出了一个名为LVOmniBench的新基准测试,专门用于评估全模态大语言模型(OmniLLMs)对长音频-视频内容的理解能力。该数据集包含275个时长10至90分钟的高质量视频和1014个问答对,旨在评估模型在长期记忆、时序定位、细粒度理解和多模态感知等方面的表现。评估结果表明,现有模型在处理长音频-视频输入时面临显著挑战,开源模型准确率普遍低于35%,而Gemini 3 Pro最高达到约65%的准确率。
Details
Motivation: 当前对全模态大语言模型的评估主要集中于短音频和视频片段(10秒至5分钟),无法反映现实应用(如长达数十分钟的视频)的需求,因此需要一个新的基准来填补这一关键空白。
Result: 在LVOmniBench基准上的评估显示,当前全模态大语言模型在处理长音频-视频输入时表现不佳:开源模型准确率普遍低于35%,而Gemini 3 Pro达到峰值准确率约65%。
Insight: 论文的创新点在于创建了首个专注于长音频-视频理解评估的基准数据集LVOmniBench,通过手动精选和标注长视频内容,推动了模型在长期跨模态理解能力方面的研究。从客观角度看,该工作强调了现实场景中长视频理解的重要性,并为未来模型开发提供了明确的评估方向和数据支持。
Abstract: Recent advancements in omnimodal large language models (OmniLLMs) have significantly improved the comprehension of audio and video inputs. However, current evaluations primarily focus on short audio and video clips ranging from 10 seconds to 5 minutes, failing to reflect the demands of real-world applications, where videos typically run for tens of minutes. To address this critical gap, we introduce LVOmniBench, a new benchmark designed specifically for the cross-modal comprehension of long-form audio and video. This dataset comprises high-quality videos sourced from open platforms that feature rich audio-visual dynamics. Through rigorous manual selection and annotation, LVOmniBench comprises 275 videos, ranging in duration from 10 to 90 minutes, and 1,014 question-answer (QA) pairs. LVOmniBench aims to rigorously evaluate the capabilities of OmniLLMs across domains, including long-term memory, temporal localization, fine-grained understanding, and multimodal perception. Our extensive evaluation reveals that current OmniLLMs encounter significant challenges when processing extended audio-visual inputs. Open-source models generally achieve accuracies below 35%, whereas the Gemini 3 Pro reaches a peak accuracy of approximately 65%. We anticipate that this dataset, along with our empirical findings, will stimulate further research and the development of advanced models capable of resolving complex cross-modal understanding problems within long-form audio-visual contexts.
[97] DriveTok: 3D Driving Scene Tokenization for Unified Multi-View Reconstruction and Understanding cs.CV | cs.LGPDF
Dong Zhuo, Wenzhao Zheng, Sicheng Zuo, Siming Yan, Lu Hou
TL;DR: 本文提出了DriveTok,一种用于自动驾驶场景的高效3D多视图标记化方法,旨在解决现有标记器在处理高分辨率多视图驾驶场景时效率低下和视图间不一致的问题。该方法通过视觉基础模型提取语义丰富的视觉特征,并利用3D可变形交叉注意力将其转换为场景标记,然后通过多视图变换器重建多视图特征,实现RGB、深度和语义重建,同时增加3D头进行3D语义占据预测以增强空间感知。
Details
Motivation: 随着视觉-语言-动作模型和世界模型在自动驾驶系统中的广泛应用,可扩展的图像标记化作为视觉模态的接口变得至关重要,但现有标记器多为单目和2D场景设计,导致应用于高分辨率多视图驾驶场景时效率低下和视图间不一致。
Result: 在广泛使用的nuScenes数据集上进行的大量实验表明,DriveTok生成的场景标记在图像重建、语义分割、深度预测和3D占据预测任务上表现良好。
Insight: 创新点包括:提出统一的3D驾驶场景标记化框架,通过3D可变形交叉注意力整合语义、几何和纹理信息;采用多视图变换器和多任务训练目标,实现高效的多视图重建与理解;直接利用场景标记进行3D语义占据预测,增强空间感知能力。从客观角度看,该方法将多视图信息统一编码为紧凑的3D表示,为自动驾驶中的视觉处理提供了更高效的接口。
Abstract: With the growing adoption of vision-language-action models and world models in autonomous driving systems, scalable image tokenization becomes crucial as the interface for the visual modality. However, most existing tokenizers are designed for monocular and 2D scenes, leading to inefficiency and inter-view inconsistency when applied to high-resolution multi-view driving scenes. To address this, we propose DriveTok, an efficient 3D driving scene tokenizer for unified multi-view reconstruction and understanding. DriveTok first obtains semantically rich visual features from vision foundation models and then transforms them into the scene tokens with 3D deformable cross-attention. For decoding, we employ a multi-view transformer to reconstruct multi-view features from the scene tokens and use multiple heads to obtain RGB, depth, and semantic reconstructions. We also add a 3D head directly on the scene tokens for 3D semantic occupancy prediction for better spatial awareness. With the multiple training objectives, DriveTok learns unified scene tokens that integrate semantic, geometric, and textural information for efficient multi-view tokenization. Extensive experiments on the widely used nuScenes dataset demonstrate that the scene tokens from DriveTok perform well on image reconstruction, semantic segmentation, depth prediction, and 3D occupancy prediction tasks.
[98] Spectrally-Guided Diffusion Noise Schedules cs.CV | cs.LGPDF
Carlos Esteves, Ameesh Makadia
TL;DR: 本文提出了一种基于图像频谱特性设计像素扩散模型中每个实例噪声调度的方法,通过理论推导最小和最大噪声水平的有效性边界,设计了‘紧凑’的噪声调度以消除冗余步骤,并在推理过程中有条件地采样这些调度,实验表明该方法在低步数情况下提升了单阶段像素扩散模型的生成质量。
Details
Motivation: 去噪扩散模型的性能依赖于噪声调度,而现有噪声调度通常需要手动设计并在不同分辨率下进行调优,这限制了模型的效率和生成质量,因此需要一种更原则性的方法来设计针对每个实例的噪声调度。
Result: 实验结果表明,所提出的噪声调度方法在低步数情况下提升了单阶段像素扩散模型的生成质量,特别是在低步数情况下表现更优。
Insight: 创新点在于利用图像的频谱特性来设计每个实例的噪声调度,通过理论推导边界实现紧凑调度,从而在推理中减少冗余步骤,这为扩散模型的高效训练和采样提供了新的思路。
Abstract: Denoising diffusion models are widely used for high-quality image and video generation. Their performance depends on noise schedules, which define the distribution of noise levels applied during training and the sequence of noise levels traversed during sampling. Noise schedules are typically handcrafted and require manual tuning across different resolutions. In this work, we propose a principled way to design per-instance noise schedules for pixel diffusion, based on the image’s spectral properties. By deriving theoretical bounds on the efficacy of minimum and maximum noise levels, we design ``tight’’ noise schedules that eliminate redundant steps. During inference, we propose to conditionally sample such noise schedules. Experiments show that our noise schedules improve generative quality of single-stage pixel diffusion models, particularly in the low-step regime.
[99] EffectErase: Joint Video Object Removal and Insertion for High-Quality Effect Erasing cs.CVPDF
Yang Fu, Yike Zheng, Ziyun Dai, Henghui Ding
TL;DR: 本文提出了EffectErase方法,用于联合执行视频对象移除与插入任务,以实现高质量的视频效果擦除。该方法基于新构建的大规模数据集VOR进行训练,并通过任务感知区域引导和插入-移除一致性目标来提升效果区域的处理能力。
Details
Motivation: 现有基于扩散模型的视频修复和对象移除方法难以有效擦除目标对象产生的视觉效应(如变形、阴影、反射)并合成连贯的背景,且缺乏系统捕捉多种环境中常见对象效应的综合数据集用于训练和评估。
Result: 在VOR数据集上训练的EffectErase在广泛的实验中取得了优越的性能,能够在多样场景下实现高质量的视频对象效果擦除。
Insight: 创新点包括:1) 引入大规模、多样化的VOR数据集,系统涵盖多种效应类型和复杂动态场景;2) 提出EffectErase方法,将视频对象插入作为逆辅助任务,通过互惠学习方案联合优化移除和插入;3) 设计了任务感知区域引导和插入-移除一致性目标,以增强对效应区域的定位和处理能力。
Abstract: Video object removal aims to eliminate dynamic target objects and their visual effects, such as deformation, shadows, and reflections, while restoring seamless backgrounds. Recent diffusion-based video inpainting and object removal methods can remove the objects but often struggle to erase these effects and to synthesize coherent backgrounds. Beyond method limitations, progress is further hampered by the lack of a comprehensive dataset that systematically captures common object effects across varied environments for training and evaluation. To address this, we introduce VOR (Video Object Removal), a large-scale dataset that provides diverse paired videos, each consisting of one video where the target object is present with its effects and a counterpart where the object and effects are absent, with corresponding object masks. VOR contains 60K high-quality video pairs from captured and synthetic sources, covers five effects types, and spans a wide range of object categories as well as complex, dynamic multi-object scenes. Building on VOR, we propose EffectErase, an effect-aware video object removal method that treats video object insertion as the inverse auxiliary task within a reciprocal learning scheme. The model includes task-aware region guidance that focuses learning on affected areas and enables flexible task switching. Then, an insertion-removal consistency objective that encourages complementary behaviors and shared localization of effect regions and structural cues. Trained on VOR, EffectErase achieves superior performance in extensive experiments, delivering high-quality video object effect erasing across diverse scenarios.
[100] SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing cs.CVPDF
Xinyao Zhang, Wenkai Dong, Yuxin Song, Bo Fang, Qi Zhang
TL;DR: 本文提出了SAMA框架,通过将视频编辑分解为语义锚定和运动对齐两个因子化任务,以解决现有指令引导视频编辑模型在精确语义修改与忠实运动保持之间难以平衡的问题。该框架采用两阶段优化,无需成对的视频-指令编辑数据进行预训练,即可实现强大的零样本视频编辑能力。
Details
Motivation: 现有指令引导视频编辑模型在同时实现精确语义修改和忠实运动保持方面存在困难,且依赖外部先验信息会限制模型的鲁棒性和泛化能力。
Result: SAMA在开源模型中达到了最先进的性能,并与领先的商业系统(如Kling-Omni)具有竞争力。
Insight: 创新点在于将视频编辑任务因子化为语义锚定(在稀疏锚帧联合预测语义标记和视频潜在表示以实现纯指令感知的结构规划)和运动对齐(通过以运动为中心的视频修复前置任务预训练骨干网络以直接从原始视频内化时序动态)两个子问题,并通过两阶段优化(因子化预训练和监督微调)实现高效学习,其预训练阶段已能产生强大的零样本编辑能力,验证了因子化的有效性。
Abstract: Current instruction-guided video editing models struggle to simultaneously balance precise semantic modifications with faithful motion preservation. While existing approaches rely on injecting explicit external priors (e.g., VLM features or structural conditions) to mitigate these issues, this reliance severely bottlenecks model robustness and generalization. To overcome this limitation, we present SAMA (factorized Semantic Anchoring and Motion Alignment), a framework that factorizes video editing into semantic anchoring and motion modeling. First, we introduce Semantic Anchoring, which establishes a reliable visual anchor by jointly predicting semantic tokens and video latents at sparse anchor frames, enabling purely instruction-aware structural planning. Second, Motion Alignment pre-trains the same backbone on motion-centric video restoration pretext tasks (cube inpainting, speed perturbation, and tube shuffle), enabling the model to internalize temporal dynamics directly from raw videos. SAMA is optimized with a two-stage pipeline: a factorized pre-training stage that learns inherent semantic-motion representations without paired video-instruction editing data, followed by supervised fine-tuning on paired editing data. Remarkably, the factorized pre-training alone already yields strong zero-shot video editing ability, validating the proposed factorization. SAMA achieves state-of-the-art performance among open-source models and is competitive with leading commercial systems (e.g., Kling-Omni). Code, models, and datasets will be released.
[101] MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction cs.CVPDF
Haitian Li, Haozhe Xie, Junxiang Xu, Beichen Wen, Fangzhou Hong
TL;DR: MonoArt是一个用于从单张图像重建铰接式3D物体的统一框架,它通过渐进式结构推理,将视觉观察逐步转换为规范几何、结构化部件表示和运动感知嵌入,从而稳定地推断关节运动参数。
Details
Motivation: 从单张图像重建铰接式3D物体需要从有限的视觉证据中联合推断几何、部件结构和运动参数,但运动线索与物体结构的纠缠使得直接回归关节不稳定;现有方法通常依赖多视角监督、基于检索的组装或辅助视频生成,牺牲了可扩展性或效率。
Result: 在PartNet-Mobility数据集上的大量实验表明,该方法在重建精度和推理速度上都达到了最先进的(SOTA)性能。
Insight: 创新点在于提出了一个基于渐进式结构推理的统一框架,避免了直接回归关节的不稳定性,无需外部运动模板或多阶段流程,实现了稳定且可解释的关节推断;该框架进一步推广到了机器人操作和铰接场景重建中。
Abstract: Reconstructing articulated 3D objects from a single image requires jointly inferring object geometry, part structure, and motion parameters from limited visual evidence. A key difficulty lies in the entanglement between motion cues and object structure, which makes direct articulation regression unstable. Existing methods address this challenge through multi-view supervision, retrieval-based assembly, or auxiliary video generation, often sacrificing scalability or efficiency. We present MonoArt, a unified framework grounded in progressive structural reasoning. Rather than predicting articulation directly from image features, MonoArt progressively transforms visual observations into canonical geometry, structured part representations, and motion-aware embeddings within a single architecture. This structured reasoning process enables stable and interpretable articulation inference without external motion templates or multi-stage pipelines. Extensive experiments on PartNet-Mobility demonstrate that OM achieves state-of-the-art performance in both reconstruction accuracy and inference speed. The framework further generalizes to robotic manipulation and articulated scene reconstruction.
[102] Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens cs.CVPDF
Yuqing Wang, Chuofan Ma, Zhijie Lin, Yao Teng, Lijun Yu
TL;DR: 本文提出了Cubic Discrete Diffusion (CubiD),这是首个针对高维表示(如768-1024维)的离散生成模型。它通过在高维离散表示上进行细粒度掩码和预测,学习空间位置内部和之间的丰富关联,生成步骤固定为T,远小于特征维度总数。在ImageNet-256上,CubiD从9亿到37亿参数规模均实现了最先进的离散生成性能,并验证了离散化后的令牌保留了原始表示能力,可同时有效服务于理解和生成任务。
Details
Motivation: 当前基于离散令牌的视觉生成方法局限于低维潜在令牌(通常8-32维),牺牲了理解所需的语义丰富性。虽然高维预训练表示(768-1024维)可以弥补这一差距,但对其进行离散生成存在根本性挑战。本文旨在解决高维表示的离散生成问题,以实现更统一的多模态架构。
Result: 在ImageNet-256基准上,CubiD实现了最先进的离散生成性能,并展现出从900M到3.7B参数的强大缩放行为。
Insight: 核心创新在于提出了Cubic Discrete Diffusion,通过在高维离散表示上进行任意维度、任意位置的细粒度掩码和预测,以固定且较少的生成步骤学习丰富的内部和跨空间位置关联。这验证了高维离散令牌能同时保持表示能力和生成能力,为统一的多模态架构提供了新思路。
Abstract: Visual generation with discrete tokens has gained significant attention as it enables a unified token prediction paradigm shared with language models, promising seamless multimodal architectures. However, current discrete generation methods remain limited to low-dimensional latent tokens (typically 8-32 dims), sacrificing the semantic richness essential for understanding. While high-dimensional pretrained representations (768-1024 dims) could bridge this gap, their discrete generation poses fundamental challenges. In this paper, we present Cubic Discrete Diffusion (CubiD), the first discrete generation model for high-dimensional representations. CubiD performs fine-grained masking throughout the high-dimensional discrete representation – any dimension at any position can be masked and predicted from partial observations. This enables the model to learn rich correlations both within and across spatial positions, with the number of generation steps fixed at $T$ regardless of feature dimensionality, where $T \ll hwd$. On ImageNet-256, CubiD achieves state-of-the-art discrete generation with strong scaling behavior from 900M to 3.7B parameters. Crucially, we validate that these discretized tokens preserve original representation capabilities, demonstrating that the same discrete tokens can effectively serve both understanding and generation tasks. We hope this work will inspire future research toward unified multimodal architectures. Code is available at: https://github.com/YuqingWang1029/CubiD.
[103] Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding cs.CV | cs.ROPDF
Xianjin Wu, Dingkang Liang, Tianrui Feng, Kui Xia, Yumeng Zhang
TL;DR: 本文提出VEGA-3D框架,利用预训练视频生成模型中的隐式3D先验来增强多模态大语言模型的空间感知能力,以解决其空间盲区问题,无需显式3D监督即可提升场景理解、空间推理和具身操作任务的性能。
Details
Motivation: 多模态大语言模型存在空间盲区,难以进行细粒度几何推理和物理动态理解;现有方法依赖显式3D模态或复杂几何框架,受限于数据稀缺和泛化挑战。
Result: 在3D场景理解、空间推理和具身操作基准测试中,该方法优于现有最先进基线,验证了生成先验为物理世界理解提供了可扩展的基础。
Insight: 创新点在于利用大规模视频生成模型作为隐式世界模拟器,提取其时空特征并通过自适应门控融合机制与语义表示结合,从而为MLLM注入密集几何线索,无需显式3D监督;客观来看,将生成模型的内部表征重新用于理解任务是一种新颖的范式转换。
Abstract: While Multimodal Large Language Models demonstrate impressive semantic capabilities, they often suffer from spatial blindness, struggling with fine-grained geometric reasoning and physical dynamics. Existing solutions typically rely on explicit 3D modalities or complex geometric scaffolding, which are limited by data scarcity and generalization challenges. In this work, we propose a paradigm shift by leveraging the implicit spatial prior within large-scale video generation models. We posit that to synthesize temporally coherent videos, these models inherently learn robust 3D structural priors and physical laws. We introduce VEGA-3D (Video Extracted Generative Awareness), a plug-and-play framework that repurposes a pre-trained video diffusion model as a Latent World Simulator. By extracting spatiotemporal features from intermediate noise levels and integrating them with semantic representations via a token-level adaptive gated fusion mechanism, we enrich MLLMs with dense geometric cues without explicit 3D supervision. Extensive experiments across 3D scene understanding, spatial reasoning, and embodied manipulation benchmarks demonstrate that our method outperforms state-of-the-art baselines, validating that generative priors provide a scalable foundation for physical-world understanding. Code is publicly available at https://github.com/H-EmbodVis/VEGA-3D.
cs.HC [Back]
[104] Dual-Model Prediction of Affective Engagement and Vocal Attractiveness from Speaker Expressiveness in Video Learning cs.HC | cs.CV | cs.SDPDF
Hung-Yue Suen, Kuo-En Hung, Fan-Hsun Tseng
TL;DR: 本文提出了一种基于说话者表达的机器学习方法,用于预测异步视频学习中观众的情感投入和声音吸引力。该方法仅利用说话者的情感表达,通过两个独立的回归模型分别预测情感投入和声音吸引力,并在大规模开放在线课程(MOOCs)语料库上进行验证。
Details
Motivation: 为了解决可扩展且保护隐私的情感计算应用需求,本文旨在仅通过说话者的情感表达来预测观众反馈,避免使用观众侧输入信息。
Result: 在独立于说话者的测试集上,两个回归模型均表现出优异的预测性能:情感投入预测的R²为0.85,声音吸引力预测的R²为0.88,证实了说话者侧情感可以有效代表聚合的观众反馈。
Insight: 创新点在于提出了一种纯说话者中心的情感AI方法,通过整合面部动态、眼动特征、韵律和认知语义等多模态特征预测情感投入,并仅基于声学特征预测声音吸引力,证明了无需观众侧信息即可前瞻性预测观众反馈的可行性。
Abstract: This paper outlines a machine learning-enabled speaker-centric Emotion AI approach capable of predicting audience-affective engagement and vocal attractiveness in asynchronous video-based learning, relying solely on speaker-side affective expressions. Inspired by the demand for scalable, privacy-preserving affective computing applications, this speaker-centric Emotion AI approach incorporates two distinct regression models that leverage a massive corpus developed within Massive Open Online Courses (MOOCs) to enable affectively engaging experiences. The regression model predicting affective engagement is developed by assimilating emotional expressions emanating from facial dynamics, oculomotor features, prosody, and cognitive semantics, while incorporating a second regression model to predict vocal attractiveness based exclusively on speaker-side acoustic features. Notably, on speaker-independent test sets, both regression models yielded impressive predictive performance (R2 = 0.85 for affective engagement and R2 = 0.88 for vocal attractiveness), confirming that speaker-side affect can functionally represent aggregated audience feedback. This paper provides a speaker-centric Emotion AI approach substantiated by an empirical study discovering that speaker-side multimodal features, including acoustics, can prospectively forecast audience feedback without necessarily employing audience-side input information.
cs.AI [Back]
[105] Memento-Skills: Let Agents Design Agents cs.AI | cs.CL | cs.LGPDF
Huichi Zhou, Siyuan Guo, Anjie Liu, Zhongwei Yu, Ziqin Gong
TL;DR: 本文提出了Memento-Skills,一个通用的、可持续学习的大语言模型(LLM)智能体系统,该系统本身作为一个‘设计智能体的智能体’,能够通过经验自主构建、调整和改进面向特定任务的智能体。该系统基于一个带有状态提示(stateful prompts)的、基于记忆的强化学习框架,其中可复用的技能(以结构化Markdown文件存储)作为持久且不断演化的记忆。通过引入的‘读写反思学习’机制,系统能够在不更新LLM参数的情况下实现持续学习,所有适应都通过外部化技能和提示的演化来实现。
Details
Motivation: 解决现有方法依赖人工设计智能体的问题,旨在让一个通用智能体能够端到端地为新任务自主设计智能体,实现能力的持续自我改进。
Result: 在General AI Assistants基准和Humanity’s Last Exam基准上进行实验,分别实现了整体准确率26.2%和116.2%的相对提升,展示了持续的性能增益。
Insight: 核心创新在于提出了一个‘智能体设计智能体’的闭环系统框架,通过将技能外部化为结构化、可演化的记忆(状态提示和技能库),并结合读写反思学习机制,实现了无需更新LLM模型参数的持续学习与自我改进能力。
Abstract: We introduce \emph{Memento-Skills}, a generalist, continually-learnable LLM agent system that functions as an \emph{agent-designing agent}: it autonomously constructs, adapts, and improves task-specific agents through experience. The system is built on a memory-based reinforcement learning framework with \emph{stateful prompts}, where reusable skills (stored as structured markdown files) serve as persistent, evolving memory. These skills encode both behaviour and context, enabling the agent to carry forward knowledge across interactions. Starting from simple elementary skills (like Web search and terminal operations), the agent continually improves via the \emph{Read–Write Reflective Learning} mechanism introduced in \emph{Memento2}\cite{wang2025memento2}. In the \emph{read} phase, a behaviour-trainable skill router selects the most relevant skill conditioned on the current stateful prompt; in the \emph{write} phase, the agent updates and expands its skill library based on new experience. This closed-loop design enables \emph{continual learning without updating LLM parameters}, as all adaptation is realised through the evolution of externalised skills and prompts. Unlike prior approaches that rely on human-designed agents, Memento-Skills enables a generalist agent to \emph{design agents end-to-end} for new tasks. Through iterative skill generation and refinement, the system progressively improves its own capabilities. Experiments on the \emph{General AI Assistants} benchmark and \emph{Humanity’s Last Exam} demonstrate sustained gains, achieving 26.2% and 116.2% relative improvements in overall accuracy, respectively. Code is available at https://github.com/Memento-Teams/Memento-Skills.
[106] RewardFlow: Topology-Aware Reward Propagation on State Graphs for Agentic RL with Large Language Models cs.AI | cs.CL | cs.LGPDF
Xiao Feng, Bo Han, Zhanke Zhou, Jiaqi Fan, Jiangchao Yao
TL;DR: RewardFlow是一种针对智能体推理任务的轻量级方法,通过构建状态图并利用其内在拓扑结构,分析各状态对成功的贡献,再通过拓扑感知的图传播量化贡献,从而生成客观的状态级奖励,以解决强化学习中终端奖励稀疏的问题。
Details
Motivation: 强化学习(RL)在增强大语言模型(LLMs)与外部环境交互的智能体推理能力方面潜力巨大,但终端奖励的稀疏性阻碍了细粒度的状态级优化,而训练专用奖励模型又计算成本高、难以扩展。
Result: 在四个智能体推理基准测试中,RewardFlow作为密集奖励集成到RL优化后,显著优于先前的RL基线方法,表现出更优的性能、鲁棒性和训练效率。
Insight: 创新点在于利用推理轨迹中状态的内在拓扑结构构建状态图,通过图传播量化状态贡献,从而无需训练专用奖励模型即可获得密集的状态级奖励,这是一种轻量且高效的替代方案。
Abstract: Reinforcement learning (RL) holds significant promise for enhancing the agentic reasoning capabilities of large language models (LLMs) with external environments. However, the inherent sparsity of terminal rewards hinders fine-grained, state-level optimization. Although process reward modeling offers a promising alternative, training dedicated reward models often entails substantial computational costs and scaling difficulties. To address these challenges, we introduce RewardFlow, a lightweight method for estimating state-level rewards tailored to agentic reasoning tasks. RewardFlow leverages the intrinsic topological structure of states within reasoning trajectories by constructing state graphs. This enables an analysis of state-wise contributions to success, followed by topology-aware graph propagation to quantify contributions and yield objective, state-level rewards. When integrated as dense rewards for RL optimization, RewardFlow substantially outperforms prior RL baselines across four agentic reasoning benchmarks, demonstrating superior performance, robustness, and training efficiency. The implementation of RewardFlow is publicly available at https://github.com/tmlr-group/RewardFlow.
[107] Reasoning over mathematical objects: on-policy reward modeling and test time aggregation cs.AI | cs.CLPDF
Pranjal Aggarwal, Marjan Ghazvininejad, Seungone Kim, Ilia Kulikov, Jack Lanchantin
TL;DR: 本文提出了Principia套件,包含用于推导数学对象的训练数据和基准测试,并介绍了使用强LLM评判器和验证器的训练方法,其中策略内评判器训练能提升性能,同时展示了如何通过策略内训练实现测试时计算聚合。研究发现,即使如Qwen3-235B和o3这样的强语言模型在Principia上表现不佳,但所提训练方法能显著提升不同LLM骨干的性能,并改善现有数值和多选问答任务的结果,证明了推理能力的跨格式泛化。
Details
Motivation: 解决当前语言模型在数学和科学推理评估中过度依赖简化答案格式(如数值或多选)的问题,以支持下游STEM应用中需要正式结构化表达式的精确数学对象推导。
Result: 在Principia基准测试中,所提训练方法显著提升了不同LLM骨干的性能,同时改善了现有数值和多选问答任务的结果,实现了推理能力的跨格式泛化。
Insight: 创新点包括构建Principia套件以支持数学对象推导、引入策略内评判器训练以提升性能,以及利用策略内训练实现测试时计算聚合,这些方法增强了语言模型在复杂数学推理中的能力。
Abstract: The ability to precisely derive mathematical objects is a core requirement for downstream STEM applications, including mathematics, physics, and chemistry, where reasoning must culminate in formally structured expressions. Yet, current LM evaluations of mathematical and scientific reasoning rely heavily on simplified answer formats such as numerical values or multiple choice options due to the convenience of automated assessment. In this paper we provide three contributions for improving reasoning over mathematical objects: (i) we build and release training data and benchmarks for deriving mathematical objects, the Principia suite; (ii) we provide training recipes with strong LLM-judges and verifiers, where we show that on-policy judge training boosts performance; (iii) we show how on-policy training can also be used to scale test-time compute via aggregation. We find that strong LMs such as Qwen3-235B and o3 struggle on Principia, while our training recipes can bring significant improvements over different LLM backbones, while simultaneously improving results on existing numerical and MCQA tasks, demonstrating cross-format generalization of reasoning abilities.
[108] How Uncertainty Estimation Scales with Sampling in Reasoning Models cs.AI | cs.CL | cs.LGPDF
Maksym Del, Markus Kängsepp, Marharyta Domnich, Ardi Tampuu, Lisa Yankovskaya
TL;DR: 本文研究了在思维链推理模型中,不确定性估计如何随采样规模扩展。通过使用口头化置信度和自一致性这两种完全黑盒方法,在三个推理模型和17个涵盖数学、STEM及人文学科的任务上,分析了这些信号随采样规模的变化规律。研究发现,自一致性和口头化置信度在推理模型中均随采样扩展,但自一致性初始区分度较低,在中等采样下落后于口头化置信度。然而,大部分不确定性增益来自信号组合:仅用两个样本,混合估计器就能将平均AUROC提升高达+12,并且即使在大规模采样预算下,其表现也优于任一单独信号,之后收益递减。这些效果具有领域依赖性:在数学领域(RLVR风格后训练的本土领域),推理模型实现了更高的不确定性质量,并表现出更强的互补性和更快的扩展速度。
Details
Motivation: 不确定性估计对于部署推理语言模型至关重要,但在扩展的思维链推理下,其机制仍不明确。本文旨在研究在推理模型中,不确定性估计如何随采样规模扩展,以填补这一理解空白。
Result: 在三个推理模型和17个任务上的实验表明,混合估计器仅用两个样本就能将平均AUROC提升高达+12,并且在大规模采样预算下仍优于单独信号。在数学领域,模型实现了更高的不确定性质量(如更强的互补性和更快的扩展)。
Insight: 创新点在于系统地研究了不确定性估计(特别是口头化置信度和自一致性)在推理模型中的扩展规律,并发现信号组合(混合估计器)能显著提升性能,且效果具有领域依赖性。从客观角度看,该研究为黑盒不确定性估计提供了实用的缩放见解和高效的混合策略。
Abstract: Uncertainty estimation is critical for deploying reasoning language models, yet remains poorly understood under extended chain-of-thought reasoning. We study parallel sampling as a fully black-box approach using verbalized confidence and self-consistency. Across three reasoning models and 17 tasks spanning mathematics, STEM, and humanities, we characterize how these signals scale. Both self-consistency and verbalized confidence scale in reasoning models, but self-consistency exhibits lower initial discrimination and lags behind verbalized confidence under moderate sampling. Most uncertainty gains, however, arise from signal combination: with just two samples, a hybrid estimator improves AUROC by up to $+12$ on average and already outperforms either signal alone even when scaled to much larger budgets, after which returns diminish. These effects are domain-dependent: in mathematics, the native domain of RLVR-style post-training, reasoning models achieve higher uncertainty quality and exhibit both stronger complementarity and faster scaling than in STEM or humanities.
[109] Box Maze: A Process-Control Architecture for Reliable LLM Reasoning cs.AI | cs.CLPDF
Zou Qiang
TL;DR: 本文提出了Box Maze框架,一种用于提升大语言模型推理可靠性的过程控制架构。该架构将LLM推理分解为记忆锚定、结构化推理和边界执行三个显式层次,并通过模拟实验验证其在对抗性提示下能有效降低边界失效概率。
Details
Motivation: 针对当前大语言模型在对抗性提示下容易产生幻觉和不可靠推理的问题,现有安全方法(如RLHF和输出过滤)主要在行为层面操作,缺乏对推理过程完整性的显式架构控制机制。
Result: 在涉及多个异构LLM系统(DeepSeek-V3、Doubao、Qwen)的渐进边界侵蚀场景模拟评估中,n=50个对抗性场景的结果显示,显式认知控制层将边界失效概率从基线RLHF的约40%降低至对抗条件下的1%以下。
Insight: 创新点在于提出了一种过程层面的架构控制方法,通过分解推理层次并强制实施边界约束来提升可靠性;从客观角度看,将认知控制机制显式化并嵌入模型架构,为增强LLM推理的鲁棒性提供了新的设计思路。
Abstract: Large language models (LLMs) demonstrate strong generative capabilities but remain vulnerable to hallucination and unreliable reasoning under adversarial prompting. Existing safety approaches – such as reinforcement learning from human feedback (RLHF) and output filtering – primarily operate at the behavioral level and may lack explicit architectural mechanisms for enforcing reasoning process integrity. This paper proposes the Box Maze framework, a conceptual process-control architecture that decomposes LLM reasoning into three explicit layers: memory grounding, structured inference, and boundary enforcement. We introduce preliminary simulation-based evaluation involving progressive boundary erosion scenarios across multiple heterogeneous LLM systems (DeepSeek-V3, Doubao, Qwen). Results from n=50 adversarial scenarios suggest that explicit cognitive control layers may improve consistency in boundary maintenance, with architectural constraints reducing boundary failure rates from approximately 40% (baseline RLHF) to below 1% under adversarial conditions. While current validation is simulation-based, these preliminary results indicate that process-level control may offer a promising direction for improving reliability in large language model reasoning.
[110] Cognitive Mismatch in Multimodal Large Language Models for Discrete Symbol Understanding cs.AI | cs.CVPDF
Yinghui Li, Jiayi Kuang, Peng Xing, Daixian Liu, Junnan Dong
TL;DR: 本文针对多模态大语言模型在理解离散符号(如数学公式、化学结构等)方面的能力进行了系统性评估,揭示了一个反直觉现象:模型在复杂推理任务上可能表现良好,却在基础符号识别上失败,表明其可能依赖语言概率而非真正的视觉感知。
Details
Motivation: 解决MLLMs在处理作为人类认知基石的离散符号时,其精确和深层理解能力不足的关键开放性问题,评估其在语言、文化、数学、物理和化学等五个领域的表现。
Result: 通过引入一个全面的基准测试,发现顶级MLLMs在离散语义空间中存在’认知不匹配’,即在基础符号识别上失败却在复杂推理中成功,凸显了当前AI能力的显著差距。
Insight: 创新点在于构建了跨领域的离散符号理解基准,并揭示了MLLMs依赖语言先验而非真实视觉感知的局限性,为开发更严谨、与人类认知对齐的智能系统提供了路线图。
Abstract: While Multimodal Large Language Models (MLLMs) have achieved remarkable success in interpreting natural scenes, their ability to process discrete symbols – the fundamental building blocks of human cognition – remains a critical open question. Unlike continuous visual data, symbols such as mathematical formulas, chemical structures, and linguistic characters require precise, deeper interpretation. This paper introduces a comprehensive benchmark to evaluate how top-tier MLLMs navigate these “discrete semantic spaces” across five domains: language, culture, mathematics, physics, and chemistry. Our investigation uncovers a counterintuitive phenomenon: models often fail at basic symbol recognition yet succeed in complex reasoning tasks, suggesting they rely on linguistic probability rather than true visual perception. By exposing this “cognitive mismatch”, we highlight a significant gap in current AI capabilities: the struggle to truly perceive and understand the symbolic languages that underpin scientific discovery and abstract thought. This work offers a roadmap for developing more rigorous, human-aligned intelligent systems.
cs.RO [Back]
[111] Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation cs.RO | cs.AI | cs.CL | cs.CV | cs.LGPDF
Swagat Padhan, Lakshya Jain, Bhavya Minesh Shah, Omkar Patil, Thao Nguyen
TL;DR: 本文提出了一种名为MAPG(多智能体概率性接地)的框架,用于解决机器人执行包含语义和度量约束的复杂自然语言指令时的视觉-语言导航问题。该框架通过将语言查询分解为结构化子组件,并利用视觉语言模型(VLM)对每个组件进行接地,然后概率性地组合这些接地结果,以在3D空间中生成度量一致且可执行的决策。
Details
Motivation: 现有基于视觉语言模型(VLM)的接地方法在处理同时包含语义引用、空间关系和度量约束(如’去冰箱右边两米处’)的复杂语言查询时存在困难,因为它们并非为在物理定义的空间中进行度量推理而设计。
Result: 在HM-EQA基准测试上,MAPG相比强基线模型取得了持续的性能提升。此外,作者还提出了一个新的基准测试MAPG-Bench,专门用于评估度量-语义目标接地能力,并展示了MAPG在具备结构化场景表示的真实机器人演示中能够从仿真环境迁移到现实世界。
Insight: 主要创新点在于提出了一个多智能体概率性接地框架,将复杂的语言查询分解和结构化处理,并通过概率组合来解决VLM在度量推理上的短板。同时,创建了一个新的基准测试来填补现有语言接地评估的空白,强调了度量-语义联合接地这一具体且重要的任务维度。
Abstract: Robots collaborating with humans must convert natural language goals into actionable, physically grounded decisions. For example, executing a command such as “go two meters to the right of the fridge” requires grounding semantic references, spatial relations, and metric constraints within a 3D scene. While recent vision language models (VLMs) demonstrate strong semantic grounding capabilities, they are not explicitly designed to reason about metric constraints in physically defined spaces. In this work, we empirically demonstrate that state-of-the-art VLM-based grounding approaches struggle with complex metric-semantic language queries. To address this limitation, we propose MAPG (Multi-Agent Probabilistic Grounding), an agentic framework that decomposes language queries into structured subcomponents and queries a VLM to ground each component. MAPG then probabilistically composes these grounded outputs to produce metrically consistent, actionable decisions in 3D space. We evaluate MAPG on the HM-EQA benchmark and show consistent performance improvements over strong baselines. Furthermore, we introduce a new benchmark, MAPG-Bench, specifically designed to evaluate metric-semantic goal grounding, addressing a gap in existing language grounding evaluations. We also present a real-world robot demonstration showing that MAPG transfers beyond simulation when a structured scene representation is available.
[112] Sparse3DTrack: Monocular 3D Object Tracking Using Sparse Supervision cs.RO | cs.AI | cs.CVPDF
Nikhil Gosala, B. Ravi Kiran, Senthil Yogamani, Abhinav Valada
TL;DR: 本文提出了Sparse3DTrack,一种用于单目3D目标跟踪的稀疏监督框架。该方法将任务分解为2D查询匹配和3D几何估计两个子问题,利用时空一致性从稀疏标注中学习丰富的2D和3D场景表示,并自动生成高质量的3D伪标签,从而将稀疏监督转化为密集的3D轨迹标注。
Details
Motivation: 现有最先进的单目3D目标跟踪方法需要依赖长视频序列上密集且昂贵的3D标注,这限制了其可扩展性。本文旨在解决这一根本限制,首次提出稀疏监督框架以减少对密集标注的依赖。
Result: 在KITTI和nuScenes数据集上的大量实验表明,该方法显著提升了跟踪性能,最高提升了15.50个百分点,同时每条轨迹最多仅使用四个真实标注。
Insight: 主要创新点在于首次提出了稀疏监督的单目3D跟踪框架,通过任务分解和利用时空一致性生成伪标签,有效将稀疏监督转化为密集监督。其核心思想是利用模型自身的学习能力来放大有限的监督信号,这对于降低数据标注成本具有借鉴意义。
Abstract: Monocular 3D object tracking aims to estimate temporally consistent 3D object poses across video frames, enabling autonomous agents to reason about scene dynamics. However, existing state-of-the-art approaches are fully supervised and rely on dense 3D annotations over long video sequences, which are expensive to obtain and difficult to scale. In this work, we address this fundamental limitation by proposing the first sparsely supervised framework for monocular 3D object tracking. Our approach decomposes the task into two sequential sub-problems: 2D query matching and 3D geometry estimation. Both components leverage the spatio-temporal consistency of image sequences to augment a sparse set of labeled samples and learn rich 2D and 3D representations of the scene. Leveraging these learned cues, our model automatically generates high-quality 3D pseudolabels across entire videos, effectively transforming sparse supervision into dense 3D track annotations. This enables existing fully-supervised trackers to effectively operate under extreme label sparsity. Extensive experiments on the KITTI and nuScenes datasets demonstrate that our method significantly improves tracking performance, achieving an improvement of up to 15.50 p.p. while using at most four ground truth annotations per track.
[113] DriveVLM-RL: Neuroscience-Inspired Reinforcement Learning with Vision-Language Models for Safe and Deployable Autonomous Driving cs.RO | cs.AI | cs.CVPDF
Zilin Huang, Zihao Sheng, Zhengyang Wan, Yansong Qu, Junwei You
TL;DR: 本文提出DriveVLM-RL,一种受神经科学启发的强化学习框架,通过双通路架构将视觉语言模型集成到自动驾驶决策中,以解决安全探索和实时部署的挑战。该框架在离线训练时利用VLM进行语义奖励学习,部署时移除VLM以保证实时性,在CARLA模拟器中显著提升了避障、任务成功率和泛化能力。
Details
Motivation: 传统强化学习方法依赖人工设计奖励或稀疏碰撞信号,难以捕捉安全驾驶所需的丰富上下文,且无法避免现实世界中的不安全探索;而现有视觉语言模型虽具语义理解能力,但推理延迟高且易产生幻觉,无法直接用于实时车辆控制。
Result: 在CARLA模拟器上的实验表明,该方法在多种交通场景下显著提高了碰撞避免率、任务成功率及泛化能力,即使在无显式碰撞惩罚的设置下也表现出强鲁棒性。
Insight: 创新点在于受神经科学启发的双通路架构(静态通路用于连续空间安全评估,动态通路用于注意力门控的多帧语义风险推理)以及异步训练流程,实现了将基础模型语义能力离线注入强化学习策略,同时保证部署时的实时可行性。
Abstract: Ensuring safe decision-making in autonomous vehicles remains a fundamental challenge despite rapid advances in end-to-end learning approaches. Traditional reinforcement learning (RL) methods rely on manually engineered rewards or sparse collision signals, which fail to capture the rich contextual understanding required for safe driving and make unsafe exploration unavoidable in real-world settings. Recent vision-language models (VLMs) offer promising semantic understanding capabilities; however, their high inference latency and susceptibility to hallucination hinder direct application to real-time vehicle control. To address these limitations, this paper proposes DriveVLM-RL, a neuroscience-inspired framework that integrates VLMs into RL through a dual-pathway architecture for safe and deployable autonomous driving. The framework decomposes semantic reward learning into a Static Pathway for continuous spatial safety assessment using CLIP-based contrasting language goals, and a Dynamic Pathway for attention-gated multi-frame semantic risk reasoning using a lightweight detector and a large VLM. A hierarchical reward synthesis mechanism fuses semantic signals with vehicle states, while an asynchronous training pipeline decouples expensive VLM inference from environment interaction. All VLM components are used only during offline training and are removed at deployment, ensuring real-time feasibility. Experiments in the CARLA simulator show significant improvements in collision avoidance, task success, and generalization across diverse traffic scenarios, including strong robustness under settings without explicit collision penalties. These results demonstrate that DriveVLM-RL provides a practical paradigm for integrating foundation models into autonomous driving without compromising real-time feasibility. Demo video and code are available at: https://zilin-huang.github.io/DriveVLM-RL-website/
[114] REST: Receding Horizon Explorative Steiner Tree for Zero-Shot Object-Goal Navigation cs.RO | cs.AI | cs.CVPDF
Shuqi Xiao, Maani Ghaffari, Chengzhong Xu, Hui Kong
TL;DR: 本文提出了一种名为REST(Receding Horizon Explorative Steiner Tree)的训练无关框架,用于解决零样本目标物体导航(ZSON)问题。该方法的核心是将选项空间构建为一棵路径树,通过在线构建开放词汇3D地图、基于采样的规划生成以智能体为中心的路径树,并利用大语言模型(LLM)的思维链推理来选择最佳路径。
Details
Motivation: 现有零样本目标物体导航的层次化免训练方法侧重于场景理解和高层决策,但忽视了‘选项’(即从动态信念中提出的子目标候选)的设计。现有方法将选项简化为孤立的路点进行独立评分,这忽略了沿途的信息增益,且非结构化的候选集合掩盖了候选间的关系。
Result: 在Gibson、HM3D和HSSD基准测试中,REST在成功率方面始终名列前茅,同时在路径效率方面达到最佳或次佳水平,展现了良好的效率与成功率平衡。
Insight: 论文的核心创新点在于将选项空间构建为‘路径树’。完整的路径能暴露仅评估目的地时被系统忽略的沿途信息增益;而共享路径段构成的树结构使得大语言模型能够进行从粗到细的推理(在检查单个‘叶子’前,先排除或追踪整个‘分支’),从而将组合路径空间压缩为高效的层次结构。该方法通过显式地图、路径树规划和LLM推理实现了这一理念。
Abstract: Zero-shot object-goal navigation (ZSON) requires navigating unknown environments to find a target object without task-specific training. Prior hierarchical training-free solutions invest in scene understanding (\textit{belief}) and high-level decision-making (\textit{policy}), yet overlook the design of \textit{option}, i.e., a subgoal candidate proposed from evolving belief and presented to policy for selection. In practice, options are reduced to isolated waypoints scored independently: single destinations hide the value gathered along the journey; an unstructured collection obscures the relationships among candidates. Our insight is that the option space should be a \textit{tree of paths}. Full paths expose en-route information gain that destination-only scoring systematically neglects; a tree of shared segments enables coarse-to-fine LLM reasoning that dismisses or pursues entire branches before examining individual leaves, compressing the combinatorial path space into an efficient hierarchy. We instantiate this insight in \textbf{REST} (Receding Horizon Explorative Steiner Tree), a training-free framework that (1) builds an explicit open-vocabulary 3D map from online RGB-D streams; (2) grows an agent-centric tree of safe and informative paths as the option space via sampling-based planning; and (3) textualizes each branch into a spatial narrative and selects the next-best path through chain-of-thought LLM reasoning. Across the Gibson, HM3D, and HSSD benchmarks, REST consistently ranks among the top methods in success rate while achieving the best or second-best path efficiency, demonstrating a favorable efficiency-success balance.
[115] FASTER: Rethinking Real-Time Flow VLAs cs.RO | cs.CVPDF
Yuxiang Lu, Zhe Liu, Xianzhe Fan, Zhenya Yang, Jinghua Hou
TL;DR: 本文提出FASTER方法,通过重新思考动作分块策略中的反应概念,系统分析了影响反应时间的因素,并揭示了基于流的VLA模型中恒定调度方案的效率瓶颈。FASTER引入一种视界感知调度机制,在流采样过程中自适应地优先处理近期动作,将即时反应的去噪步骤压缩十倍至单步,同时保持长视界轨迹质量,结合流式客户端-服务器流水线,显著降低了真实机器人上的有效反应延迟。
Details
Motivation: 现有异步推理方法主要优化轨迹平滑度,但忽视了响应环境变化的关键延迟问题,阻碍了视觉-语言-动作(VLA)模型在物理世界中的实时部署。
Result: 在真实世界实验(包括高动态乒乓球任务)中,FASTER在消费级GPU上部署时大幅降低了有效反应延迟,为通用策略实现了前所未有的实时响应能力,能够快速生成准确且平滑的轨迹。
Insight: 创新点在于揭示了反应时间服从由首次动作时间(TTFA)和执行视界共同决定的均匀分布,并提出视界感知调度机制,通过优先去噪近期动作来突破反应延迟瓶颈,同时不牺牲长视界轨迹质量。
Abstract: Real-time execution is crucial for deploying Vision-Language-Action (VLA) models in the physical world. Existing asynchronous inference methods primarily optimize trajectory smoothness, but neglect the critical latency in reacting to environmental changes. By rethinking the notion of reaction in action chunking policies, this paper presents a systematic analysis of the factors governing reaction time. We show that reaction time follows a uniform distribution determined jointly by the Time to First Action (TTFA) and the execution horizon. Moreover, we reveal that the standard practice of applying a constant schedule in flow-based VLAs can be inefficient and forces the system to complete all sampling steps before any movement can start, forming the bottleneck in reaction latency. To overcome this issue, we propose Fast Action Sampling for ImmediaTE Reaction (FASTER). By introducing a Horizon-Aware Schedule, FASTER adaptively prioritizes near-term actions during flow sampling, compressing the denoising of the immediate reaction by tenfold (e.g., in $π_{0.5}$ and X-VLA) into a single step, while preserving the quality of long-horizon trajectory. Coupled with a streaming client-server pipeline, FASTER substantially reduces the effective reaction latency on real robots, especially when deployed on consumer-grade GPUs. Real-world experiments, including a highly dynamic table tennis task, prove that FASTER unlocks unprecedented real-time responsiveness for generalist policies, enabling rapid generation of accurate and smooth trajectories.
[116] NavTrust: Benchmarking Trustworthiness for Embodied Navigation cs.RO | cs.AI | cs.CV | cs.LG | eess.SYPDF
Huaide Jiang, Yash Chaudhary, Yuping Wang, Zehao Wang, Raghav Sharma
TL;DR: 该论文提出了NavTrust,一个用于评估具身导航系统鲁棒性的统一基准,通过模拟真实世界中的RGB、深度和指令输入模态的多种损坏情况,测试现有方法的性能表现,并探索了四种增强鲁棒性的缓解策略。
Details
Motivation: 现有具身导航研究主要评估模型在理想条件下的性能,忽视了真实环境中可能出现的输入损坏问题,因此需要建立一个系统性的基准来评估和提升导航系统的鲁棒性和可信赖性。
Result: 在NavTrust基准上对七种SOTA方法(包括Uni-NaVid和ETPNav)的评估显示,在真实损坏情况下性能显著下降;通过缓解策略,在真实移动机器人上部署的模型表现出对损坏的鲁棒性提升。
Insight: 创新点在于首次在统一框架中系统性地引入了RGB-Depth损坏和指令变体来评估具身导航的鲁棒性,揭示了现有方法的脆弱性,并为构建更可信的导航系统提供了基准和缓解策略路线图。
Abstract: There are two major categories of embodied navigation: Vision-Language Navigation (VLN), where agents navigate by following natural language instructions; and Object-Goal Navigation (OGN), where agents navigate to a specified target object. However, existing work primarily evaluates model performance under nominal conditions, overlooking the potential corruptions that arise in real-world settings. To address this gap, we present NavTrust, a unified benchmark that systematically corrupts input modalities, including RGB, depth, and instructions, in realistic scenarios and evaluates their impact on navigation performance. To our best knowledge, NavTrust is the first benchmark that exposes embodied navigation agents to diverse RGB-Depth corruptions and instruction variations in a unified framework. Our extensive evaluation of seven state-of-the-art approaches reveals substantial performance degradation under realistic corruptions, which highlights critical robustness gaps and provides a roadmap toward more trustworthy embodied navigation systems. Furthermore, we systematically evaluate four distinct mitigation strategies to enhance robustness against RGB-Depth and instructions corruptions. Our base models include Uni-NaVid and ETPNav. We deployed them on a real mobile robot and observed improved robustness to corruptions. The project website is: https://navtrust.github.io.
cs.CE [Back]
[117] FinTradeBench: A Financial Reasoning Benchmark for LLMs cs.CE | cs.AI | cs.CL | cs.IR | q-fin.CPPDF
Yogesh Agrawal, Aniruddha Dutta, Md Mahadi Hasan, Santu Karmaker, Aritra Dutta
TL;DR: 本文提出了FinTradeBench,一个用于评估大语言模型金融推理能力的基准测试,该基准整合了公司基本面数据和交易信号,包含1400个基于纳斯DAQ-100公司十年历史数据的问题,并分为基本面、交易信号和混合推理三类。
Details
Motivation: 现有金融问答基准主要关注公司资产负债表数据,缺乏对公司股票市场交易行为及其与基本面交互的推理评估,因此需要构建一个整合两类信号的基准来全面测试LLMs的金融决策能力。
Result: 在零样本提示和检索增强设置下评估了14个LLM,发现检索显著改善了基于文本基本面的推理,但对交易信号推理帮助有限,揭示了当前LLM在数值和时间序列推理方面的根本挑战。
Insight: 创新点在于构建了一个融合异质金融信号的基准,并采用了校准-扩展框架确保可靠性;客观来看,该研究强调了LLM在复杂数值推理任务上的局限性,为金融智能的未来研究指明了方向。
Abstract: Real-world financial decision-making is a challenging problem that requires reasoning over heterogeneous signals, including company fundamentals derived from regulatory filings and trading signals computed from price dynamics. Recently, with the advancement of Large Language Models (LLMs), financial analysts have begun to use them for financial decision-making tasks. However, existing financial question answering benchmarks for testing these models primarily focus on company balance sheet data and rarely evaluate reasoning over how company stocks trade in the market or their interactions with fundamentals. To take advantage of the strengths of both approaches, we introduce FinTradeBench, a benchmark for evaluating financial reasoning that integrates company fundamentals and trading signals. FinTradeBench contains 1,400 questions grounded in NASDAQ-100 companies over a ten-year historical window. The benchmark is organized into three reasoning categories: fundamentals-focused, trading-signal-focused, and hybrid questions requiring cross-signal reasoning. To ensure reliability at scale, we adopt a calibration-then-scaling framework that combines expert seed questions, multi-model response generation, intra-model self-filtering, numerical auditing, and human-LLM judge alignment. We evaluate 14 LLMs under zero-shot prompting and retrieval-augmented settings and witness a clear performance gap. Retrieval substantially improves reasoning over textual fundamentals, but provides limited benefit for trading-signal reasoning. These findings highlight fundamental challenges in the numerical and time-series reasoning for current LLMs and motivate future research in financial intelligence.
cs.LG [Back]
[118] Balancing the Reasoning Load: Difficulty-Differentiated Policy Optimization with Length Redistribution for Efficient and Robust Reinforcement Learning cs.LG | cs.CLPDF
Yinan Xia, Haotian Zhang, Huiming Wang
TL;DR: 本文提出了一种名为难度差异化策略优化(DDPO)的强化学习算法,旨在解决大型推理模型(LRMs)存在的‘过度思考’(生成长而冗余的答案)和‘过度自信’(对超出能力的问题生成简短但错误的答案)问题。该方法根据任务难度分别优化简单和复杂任务,并基于理论推导,通过长度重分布来最大化预期准确率。
Details
Motivation: 大型推理模型存在‘过度思考’和‘过度自信’问题,导致答案长度与准确率之间的权衡不佳,影响了模型的效率和鲁棒性。
Result: 在领域内和领域外多个基准测试上的广泛实验表明,与GRPO相比,DDPO在多个基准上将平均答案长度减少了12%,同时准确率提高了1.85%,实现了准确率与长度之间更好的权衡。
Insight: 核心创新点在于根据任务难度进行差异化策略优化,并基于最大化预期准确率的理论条件,提出了以难度级别平均长度作为长度优化的合理参考,从而更高效地分配推理负载。这为优化大型语言模型的推理效率提供了新的理论指导和实用方法。
Abstract: Large Reasoning Models (LRMs) have shown exceptional reasoning capabilities, but they also suffer from the issue of overthinking, often generating excessively long and redundant answers. For problems that exceed the model’s capabilities, LRMs tend to exhibit the overconfidence phenomenon, generating overly short but incorrect answers, which may contribute to suboptimal performance. To address these issues, we propose Difficulty-Differentiated Policy Optimization (DDPO), an efficient reinforcement learning algorithm that optimizes simple and complex tasks separately based on the overconfidence phenomenon. Specifically, it reduces the output length for simple tasks without compromising accuracy, while for complex tasks, it expands the exploration space to improve performance. We further derive the theoretical conditions for maximizing expected accuracy, which require the length distribution to closely approximate the optimal length and be as concentrated as possible. Based on these conditions, we propose using the difficulty-level average as a well-founded reference for length optimization. Extensive experiments on both in-domain and out-of-domain benchmarks validate the superiority and effectiveness of DDPO. Compared to GRPO, DDPO reduces the average answer length by 12% while improving accuracy by 1.85% across multiple benchmarks, achieving a better trade-off between accuracy and length. The code is available at https://github.com/Yinan-Xia/DDPO.
[119] HISR: Hindsight Information Modulated Segmental Process Rewards For Multi-turn Agentic Reinforcement Learning cs.LG | cs.AI | cs.CLPDF
Zhicong Lu, Zichuan Lin, Wei Jia, Changyuan Tian, Deheng Ye
TL;DR: 本文提出了一种名为HISR的新方法,通过利用后见信息来调制分段过程奖励,以解决多轮代理强化学习中稀疏结果奖励延迟传播和信用分配不可靠的问题。该方法设计了一个分段级过程奖励模型来为任务中的每个子目标分配奖励,并使用后见模型来评估动作重要性,从而增强信用分配的可靠性。
Details
Motivation: 尽管大型语言模型在多个领域表现出色,但在复杂长视野代理决策任务上的性能仍有限。现有方法多专注于设计有效的奖励模型,但存在稀疏结果奖励的延迟传播问题,以及过于细粒度且不聚焦的轮次级过程奖励导致的不可靠信用分配。
Result: 在三个公开基准测试上的广泛实验结果表明,该方法有效提升了性能,验证了其有效性。
Insight: 创新点在于引入后见信息来调制分段过程奖励,将奖励与子目标紧密对齐,并突出重要轨迹段,从而提高了信用分配的可靠性。这避免了传统方法中奖励分配过于细粒度的问题,并通过序列似然比来量化动作重要性,为多轮强化学习提供了更可靠的奖励信号。
Abstract: While large language models excel in diverse domains, their performance on complex longhorizon agentic decision-making tasks remains limited. Most existing methods concentrate on designing effective reward models (RMs) to advance performance via multi-turn reinforcement learning. However, they suffer from delayed propagation in sparse outcome rewards and unreliable credit assignment with potentially overly fine-grained and unfocused turnlevel process rewards. In this paper, we propose (HISR) exploiting Hindsight Information to modulate Segmental process Rewards, which closely aligns rewards with sub-goals and underscores significant segments to enhance the reliability of credit assignment. Specifically, a segment-level process RM is presented to assign rewards for each sub-goal in the task, avoiding excessively granular allocation to turns. To emphasize significant segments in the trajectory, a hindsight model is devised to reflect the preference of performing a certain action after knowing the trajectory outcome. With this characteristic, we design the ratios of sequence likelihoods between hindsight and policy model to measure action importance. The ratios are subsequently employed to aggregate segment importance scores, which in turn modulate segmental process rewards, enhancing credit assignment reliability. Extensive experimental results on three publicly benchmarks demonstrate the validity of our method.
[120] CausalRM: Causal-Theoretic Reward Modeling for RLHF from Observational User Feedbacks cs.LG | cs.AI | cs.CL | stat.MLPDF
Hao Wang, Licheng Pan, Zhichao Chen, Chunyuan Zheng, Zhixuan Chu
TL;DR: 本文提出了CausalRM,一种基于因果理论的奖励建模框架,旨在从观测性用户反馈(如点击、复制、点赞)中学习无偏的奖励模型,以替代传统RLHF中昂贵且受控的人工标注反馈。该方法通过显式建模标注错误生成过程引入噪声感知的代理损失项,并使用倾向得分对训练样本进行重加权,以解决观测反馈中的噪声和用户偏好偏差问题。
Details
Motivation: 解决当前RLHF奖励建模严重依赖昂贵、受控的人工实验反馈数据的问题,提出利用可扩展、成本效益高的观测性用户反馈进行奖励建模,并应对其固有的噪声和用户偏好偏差两大挑战。
Result: 在多种LLM骨干网络和基准数据集上的广泛实验验证了CausalRM的有效性,它能从有噪声和有偏的观测反馈中学习准确的奖励信号,并在下游RLHF任务上带来显著性能提升,如在WildGuardMix上获得49.2%的增益,在HarmBench上提升32.7%。
Insight: 创新点在于将因果理论引入观测性奖励建模,通过噪声感知损失和倾向得分重加权分别形式化地处理标注错误和用户选择偏差,为从自然、有偏的交互数据中学习可靠奖励函数提供了理论保证和实用框架。
Abstract: Despite the success of reinforcement learning from human feedback (RLHF) in aligning language models, current reward modeling heavily relies on experimental feedback data collected from human annotators under controlled and costly conditions. In this work, we introduce observational reward modeling – learning reward models with observational user feedback (e.g., clicks, copies, and upvotes) – as a scalable and cost-effective alternative. We identify two fundamental challenges in this setting: (1) observational feedback is noisy due to annotation errors, which deviates it from true user preference; (2) observational feedback is biased by user preference, where users preferentially provide feedback on responses they feel strongly about, which creats a distribution shift between training and inference data. To address these challenges, we propose CausalRM, a causal-theoretic reward modeling framework that aims to learn unbiased reward models from observational feedback. To tackle challenge (1), CausalRM introduces a noise-aware surrogate loss term that is provably equivalent to the primal loss under noise-free conditions by explicitly modeling the annotation error generation process. To tackle challenge (2), CausalRM uses propensity scores – the probability of a user providing feedback for a given response – to reweight training samples, yielding a loss function that eliminates user preference bias. Extensive experiments across diverse LLM backbones and benchmark datasets validate that CausalRM effectively learns accurate reward signals from noisy and biased observational feedback and delivers substantial performance improvements on downstream RLHF tasks – including a 49.2% gain on WildGuardMix and a 32.7% improvement on HarmBench. Code is available on our project website.
[121] Are complicated loss functions necessary for teaching LLMs to reason? cs.LG | cs.AI | cs.CLPDF
Gabriele Carrino, Andrea Sassella, Nicolo Brunello, Federico Toschi, Mark James Carman
TL;DR: 本文通过系统分析GRPO方法,发现其复杂损失函数中的PPO风格约束并非提升大语言模型推理能力所必需,并提出了简化的RGRA方法,仅保留组相对优势估计,在数学推理基准上取得了与GRPO相当或更优的性能。
Details
Motivation: 旨在探究复杂损失函数(如GRPO)是否为大语言模型推理能力训练所必需,以寻求更简洁高效的训练方法。
Result: 在标准数学推理基准测试中,提出的RGRA方法显示出比GRPO更强的性能潜力。
Insight: 论文的创新点在于揭示了负反馈对于训练的重要性,并证明基于REINFORCE的简化方法(移除PPO风格裁剪)能有效提升LLM推理,为后训练提供了更透明高效的替代方案。
Abstract: Recent advances in large language models (LLMs) highlight the importance of post training techniques for improving reasoning and mathematical ability. Group Relative Policy Optimization (GRPO) has shown promise in this domain by combining group relative advantage estimation, PPO style clipping, and KL regularization. However, its complexity raises the question of whether all components are necessary for fostering reasoning behaviors. We conduct a systematic analysis of GRPO and identify two key findings: (1) incorporating negative feedback is essential training solely on actions above a baseline limits learning; and (2) PPO style constraints, such as policy ratio clipping, are not required to improve mathematical reasoning or performance. Building on these insights, we propose REINFORCE with Group Relative Advantage (RGRA), a simplified variant that retains group relative advantage estimation but removes PPO style clipping and policy ratio terms. Experiments across standard mathematical benchmarks indicate that RGRA has the potential to achieve stronger performance than GRPO. Our results suggest that simpler REINFORCE based approaches can effectively enhance reasoning in LLMs, offering a more transparent and efficient alternative to GRPO.
cs.SD [Back]
[122] Few-shot Acoustic Synthesis with Multimodal Flow Matching cs.SD | cs.CV | eess.ASPDF
Amandine Brunetto
TL;DR: 本文提出了一种名为FLAC的少样本声学合成方法,该方法基于流匹配(flow-matching)的概率模型,能够根据稀疏的场景上下文(如空间、几何和声学线索)生成合理的房间脉冲响应(RIRs),从而在虚拟环境中实现声学一致的音频生成。
Details
Motivation: 现有神经声场方法通常需要密集的音频测量和针对每个场景的昂贵训练,缺乏可扩展性;而少样本方法虽有所改进,但仍依赖多个录音且是确定性的,无法捕捉稀疏上下文下场景声学的固有不确定性。
Result: 在AcousticRooms和Hearing Anything Anywhere数据集上,FLAC仅使用单样本(one-shot)就超越了最先进的八样本(eight-shot)基线方法。
Insight: 首次将生成流匹配应用于显式RIR合成,为鲁棒且数据高效的声学合成开辟了新方向;同时引入了AGREE(联合声学-几何嵌入)评估指标,通过检索和分布度量实现几何一致的生成RIR评估。
Abstract: Generating audio that is acoustically consistent with a scene is essential for immersive virtual environments. Recent neural acoustic field methods enable spatially continuous sound rendering but remain scene-specific, requiring dense audio measurements and costly training for each environment. Few-shot approaches improve scalability across rooms but still rely on multiple recordings and, being deterministic, fail to capture the inherent uncertainty of scene acoustics under sparse context. We introduce flow-matching acoustic generation (FLAC), a probabilistic method for few-shot acoustic synthesis that models the distribution of plausible room impulse responses (RIRs) given minimal scene context. FLAC leverages a diffusion transformer trained with a flow-matching objective to generate RIRs at arbitrary positions in novel scenes, conditioned on spatial, geometric, and acoustic cues. FLAC outperforms state-of-the-art eight-shot baselines with one-shot on both the AcousticRooms and Hearing Anything Anywhere datasets. To complement standard perceptual metrics, we further introduce AGREE, a joint acoustic-geometry embedding, enabling geometry-consistent evaluation of generated RIRs through retrieval and distributional metrics. This work is the first to apply generative flow matching to explicit RIR synthesis, establishing a new direction for robust and data-efficient acoustic synthesis.