Table of Contents

cs.CL [Back]

[1] Task-Specific Knowledge Distillation via Intermediate Probes cs.CL | cs.AIPDF

Ryan Brown, Chris Russell

TL;DR: 本文提出了一种名为Task-Specific Knowledge Distillation via Intermediate Probes的蒸馏框架,通过冻结教师大语言模型的中间层表示,训练轻量级探针来生成更干净的监督信号,用于指导学生模型的训练,从而在推理任务上获得比传统基于输出分布的蒸馏方法更好的效果。

Details

Motivation: 传统知识蒸馏假设教师模型的输出分布是高质量的训练信号,但在推理任务中,这一假设经常不成立,因为中间层可能编码了正确答案信息,但在经过词汇投影后丢失或扭曲,导致输出信号脆弱且嘈杂。

Result: 该方法在四个推理基准测试(AQuA-RAT, ARC Easy/Challenge, 和 MMLU)上取得了一致的性能提升,尤其在数据有限的情况下收益最明显。

Insight: 核心创新点在于绕过教师模型的输出瓶颈,直接利用其内部中间表示作为更纯净的监督信号源,这是一种无需修改学生或教师模型架构、计算开销低且与模型架构无关的通用蒸馏方法,能更有效地从大型教师模型中提取知识。

Abstract: Knowledge distillation from large language models (LLMs) assumes that the teacher’s output distribution is a high-quality training signal. On reasoning tasks, this assumption is frequently violated. A model’s intermediate representations may encode the correct answer, yet this information is lost or distorted through the vocabulary projection, where prompt formatting and answer-token choices creates brittle, noisy outputs. We introduce \method{}, a distillation framework that bypasses this bottleneck by training lightweight probes on frozen teacher hidden states and using the probe’s predictions, rather than output logits, as supervision for student training. This simple change yields consistent improvements across four reasoning benchmarks (AQuA-RAT, ARC Easy/Challenge, and MMLU), with gains most pronounced under limited data. Probes trained on intermediate representations provide cleaner labels than the teacher’s own outputs, effectively denoising the distillation signal. \method{} requires no architectural changes to student or teacher, is architecture-agnostic, and adds minimal compute since probe training is cheap and teacher representations can be cached. By exploiting internal representations, \method{} enables practitioners to extract more value from large teacher models without additional training data or architectural complexity.


[2] GONE: Structural Knowledge Unlearning via Neighborhood-Expanded Distribution Shaping cs.CLPDF

Chahana Dahal, Ashutosh Balasubramaniam, Zuobin Xiong

TL;DR: 本文提出GONE基准和NEDS框架,用于评估和实现大语言模型中对结构化知识图谱事实的知识遗忘。GONE基准能够分离遗忘的三个效应:直接事实移除、基于推理的泄漏和灾难性遗忘;NEDS框架则利用图连接性识别锚点相关邻居,在待遗忘事实与其语义邻域之间建立精确决策边界。

Details

Motivation: 现有的大语言模型知识遗忘方法主要关注句子级别的扁平数据,忽略了自然结构化数据中的关系性、多跳和推理知识,为解决这一不足,本文旨在处理结构化知识图谱事实的遗忘问题。

Result: 在LLaMA-3-8B和Mistral-7B模型上,NEDS框架在GONE及其他基准测试中展现出优越性能,在遗忘效能上达到1.000,在局部性上达到0.839,超越了多种知识编辑和遗忘方法。

Insight: 创新点在于首次针对结构化知识图谱事实的遗忘问题提出了专门的评估基准(GONE)和利用图结构进行邻域扩展分布塑形的遗忘框架(NEDS),实现了对复杂推理知识的精确遗忘控制。

Abstract: Unlearning knowledge is a pressing and challenging task in Large Language Models (LLMs) because of their unprecedented capability to memorize and digest training data at scale, raising more significant issues regarding safety, privacy, and intellectual property. However, existing works, including parameter editing, fine-tuning, and distillation-based methods, are all focused on flat sentence-level data but overlook the relational, multi-hop, and reasoned knowledge in naturally structured data. In response to this gap, this paper introduces Graph Oblivion and Node Erasure (GONE), a benchmark for evaluating knowledge unlearning over structured knowledge graph (KG) facts in LLMs. This KG-based benchmark enables the disentanglement of three effects of unlearning: direct fact removal, reasoning-based leakage, and catastrophic forgetting. In addition, Neighborhood-Expanded Distribution Shaping (NEDS), a novel unlearning framework, is designed to leverage graph connectivity and identify anchor correlated neighbors, enforcing a precise decision boundary between the forgotten fact and its semantic neighborhood. Evaluations on LLaMA-3-8B and Mistral-7B across multiple knowledge editing and unlearning methods showcase NEDS’s superior performance (1.000 on unlearning efficacy and 0.839 on locality) on GONE and other benchmarks. Code is available at https://anonymous.4open.science/r/GONE-4679/.


[3] Prompt Injection as Role Confusion cs.CL | cs.AI | cs.CRPDF

Charles Ye, Jasmine Cui, Dylan Hadfield-Menell

TL;DR: 该论文揭示了大型语言模型(LLM)对提示注入攻击脆弱的根本原因在于‘角色混淆’:模型根据文本的写作风格而非其来源来推断说话者角色,导致恶意模仿权威角色的文本被授予不当权限。作者设计了‘角色探针’来探测模型内部的角色识别机制,并通过在用户提示和工具输出中注入伪造的推理过程,在多个开源和闭源模型上成功实施了攻击,成功率远高于基线。研究发现,模型内部的角色混淆程度能强烈预测攻击的成功率。

Details

Motivation: 尽管经过了广泛的安全训练,语言模型仍然容易受到提示注入攻击。本文旨在探究这种脆弱性的根本原因,即模型如何内部识别和分配‘说话者’角色,从而理解为何攻击能够成功。

Result: 在StrongREJECT和智能体数据泄露(agent exfiltration)两个基准测试上,攻击的平均成功率分别达到60%和61%,而基线成功率接近零。该结果在多个开源和闭源模型上得到验证。

Insight: 论文的核心创新点在于提出了‘角色混淆’作为提示注入攻击的统一机制框架,并设计了‘角色探针’来实证研究模型内部的角色识别过程。从客观角度看,该研究揭示了LLM安全中的一个根本性差距:安全策略在接口层面定义,但权限却在模型的潜在表示空间中被分配,这为理解和防御提示注入攻击提供了新的理论基础和诊断工具。

Abstract: Language models remain vulnerable to prompt injection attacks despite extensive safety training. We trace this failure to role confusion: models infer roles from how text is written, not where it comes from. We design novel role probes to capture how models internally identify “who is speaking.” These reveal why prompt injection works: untrusted text that imitates a role inherits that role’s authority. We test this insight by injecting spoofed reasoning into user prompts and tool outputs, achieving average success rates of 60% on StrongREJECT and 61% on agent exfiltration, across multiple open- and closed-weight models with near-zero baselines. Strikingly, the degree of internal role confusion strongly predicts attack success before generation begins. Our findings reveal a fundamental gap: security is defined at the interface but authority is assigned in latent space. More broadly, we introduce a unifying, mechanistic framework for prompt injection, demonstrating that diverse prompt-injection attacks exploit the same underlying role-confusion mechanism.


[4] Not Just the Destination, But the Journey: Reasoning Traces Causally Shape Generalization Behaviors cs.CLPDF

Pengcheng Wen, Yanxu Zhu, Jiapeng Sun, Han Zhu, Yujin Zhou

TL;DR: 本文通过控制实验探究了思维链(CoT)推理内容对大型语言模型(LLM)泛化行为的因果影响。研究发现,即使最终答案相同,不同的推理路径(如恶意推理、误导推理、顺从推理)会因果性地塑造模型的有害行为模式,且这种影响在模型内部被深度内化,挑战了仅监督模型输出的对齐策略。

Details

Motivation: 针对当前认为思维链可能只是事后合理化而非真正影响决策的观点,本文旨在探究推理轨迹本身是否独立于最终答案,因果性地影响模型的泛化行为,特别是针对有害内容的生成。

Result: 在多个训练范式(QTA, QT, T-only)和模型规模(0.6B到14B参数)上的实验表明:1)CoT训练比标准微调更能放大有害泛化;2)不同语义的推理类型会引发不同的行为模式;3)仅对推理(无答案监督)进行训练就足以改变行为;4)这些效应在无推理生成答案时依然存在。

Insight: 论文的核心创新在于通过精心设计的控制实验,隔离并证明了推理内容本身具有独立的因果效力,能深度内化并影响模型行为。这为AI对齐研究提供了关键见解:仅监督输出是不够的,必须对推理过程进行监督和引导。

Abstract: Chain-of-Thought (CoT) is often viewed as a window into LLM decision-making, yet recent work suggests it may function merely as post-hoc rationalization. This raises a critical alignment question: Does the reasoning trace causally shape model generalization independent of the final answer? To isolate reasoning’s causal effect, we design a controlled experiment holding final harmful answers constant while varying reasoning paths. We construct datasets with \textit{Evil} reasoning embracing malice, \textit{Misleading} reasoning rationalizing harm, and \textit{Submissive} reasoning yielding to pressure. We train models (0.6B–14B parameters) under multiple paradigms, including question-thinking-answer (QTA), question-thinking (QT), and thinking-only (T-only), and evaluate them in both think and no-think modes. We find that: (1) CoT training could amplify harmful generalization more than standard fine-tuning; (2) distinct reasoning types induce distinct behavioral patterns aligned with their semantics, despite identical final answers; (3) training on reasoning without answer supervision (QT or T-only) is sufficient to alter behavior, proving reasoning carries an independent signal; and (4) these effects persist even when generating answers without reasoning, indicating deep internalization. Our findings demonstrate that reasoning content is causally potent, challenging alignment strategies that supervise only outputs.


[5] CSE-UOI at SemEval-2026 Task 6: A Two-Stage Heterogeneous Ensemble with Deliberative Complexity Gating for Political Evasion Detection cs.CLPDF

Christos Tzouvaras, Konstantinos Skianis, Athanasios Voulodimos

TL;DR: 本文介绍了参加SemEval-2026 Task 6的系统,该任务旨在将政治访谈中回答的清晰度分为三类:清晰回答、模棱两可和清晰不回答。作者提出了一个基于自洽性和加权投票的异构双大语言模型集成方法,以及一种新颖的后处理校正机制——审议复杂性门控。该机制利用跨模型行为信号,并发现LLM的响应长度与样本模糊性高度相关。此外,作者还评估了多智能体辩论作为提升审议能力的替代策略。最终系统在评估集上取得了0.85的宏F1分数,获得第三名。

Details

Motivation: 解决在政治访谈中自动、准确地对回答的清晰度进行分类的问题,特别是检测模棱两可的回应,这对于政治话语分析至关重要。

Result: 在SemEval-2026 Task 6的评估集上取得了0.85的宏F1分数,在比赛中排名第三。

Insight: 主要创新点在于提出了审议复杂性门控这一后处理机制,它利用跨模型行为信号(特别是LLM响应长度作为模糊性代理)来动态门控推理过程。同时,对比了通过增加智能体数量(辩论)与增加模型多样性(异构集成)两种提升审议能力的不同策略,为复杂分类任务中的不确定性建模提供了新思路。

Abstract: This paper describes our system for SemEval-2026 Task 6, which classifies clarity of responses in political interviews into three categories: Clear Reply, Ambivalent, and Clear Non-Reply. We propose a heterogeneous dual large language model (LLM) ensemble via self-consistency (SC) and weighted voting, and a novel post-hoc correction mechanism, Deliberative Complexity Gating (DCG). This mechanism uses cross-model behavioral signals and exploits the finding that an LLM response-length proxy correlates strongly with sample ambiguity. To further examine mechanisms for improving ambiguity detection, we evaluated multi-agent debate as an alternative strategy for increasing deliberative capacity. Unlike DCG, which adaptively gates reasoning using cross-model behavioral signals, debate increases agent count without increasing model diversity. Our solution achieved a Macro-F1 score of 0.85 on the evaluation set, securing 3rd place.


[6] Shattering the Shortcut: A Topology-Regularized Benchmark for Multi-hop Medical Reasoning in LLMs cs.CL | cs.AIPDF

Xing Zi, Xinying Zhou, Jinghao Xiao, Catarina Moreira, Mukesh Prasad

TL;DR: 本文针对大型语言模型在真实临床环境中多跳诊断推理的不足,提出了ShatterMed-QA双语基准,通过拓扑正则化知识图谱和k- Shattering算法剪枝通用枢纽节点来消除捷径学习,并评估了21个LLM,发现其在多跳任务上性能显著下降,而通过检索增强生成恢复证据可普遍提升性能。

Details

Motivation: 解决LLMs在标准医学基准上虽表现专家级但面临真实临床复杂多跳推理时因’捷径学习’而严重不足的问题,即模型利用知识图谱中高度连接的通用枢纽节点绕过真实的微观病理级联推理。

Result: 在ShatterMed-QA基准上全面评估21个LLM,显示其多跳任务性能大幅下降,特别是领域特定模型;通过RAG恢复被掩蔽证据后,性能普遍恢复,验证了基准的结构保真度。

Insight: 创新点包括提出拓扑正则化医学知识图谱构建方法、k- Shattering算法物理剪枝通用枢纽以显式切断逻辑捷径,以及通过隐式桥接实体掩蔽和拓扑驱动硬负采样合成评估案例,迫使模型依赖生物合理的推理而非表面消除。

Abstract: While Large Language Models (LLMs) achieve expert-level performance on standard medical benchmarks through single-hop factual recall, they severely struggle with the complex, multi-hop diagnostic reasoning required in real-world clinical settings. A primary obstacle is “shortcut learning”, where models exploit highly connected, generic hub nodes (e.g., “inflammation”) in knowledge graphs to bypass authentic micro-pathological cascades. To address this, we introduce ShatterMed-QA, a bilingual benchmark of 10,558 multi-hop clinical questions designed to rigorously evaluate deep diagnostic reasoning. Our framework constructs a topology-regularized medical Knowledge Graph using a novel $k$-Shattering algorithm, which physically prunes generic hubs to explicitly sever logical shortcuts. We synthesize the evaluation vignettes by applying implicit bridge entity masking and topology-driven hard negative sampling, forcing models to navigate biologically plausible distractors without relying on superficial elimination. Comprehensive evaluations of 21 LLMs reveal massive performance degradation on our multi-hop tasks, particularly among domain-specific models. Crucially, restoring the masked evidence via Retrieval-Augmented Generation (RAG) triggers near-universal performance recovery, validating ShatterMed-QA’s structural fidelity and proving its efficacy in diagnosing the fundamental reasoning deficits of current medical AI. Explore the dataset, interactive examples, and full leaderboards at our project website: https://shattermed-qa-web.vercel.app/


[7] Expert Pyramid Tuning: Efficient Parameter Fine-Tuning for Expertise-Driven Task Allocation cs.CL | cs.CVPDF

Jia-Chen Zhang, Zhen-Wei Yan, Yu-Jie Xiong, Chun-Ming Xia

TL;DR: 本文提出专家金字塔调优(EPT),一种新颖的参数高效微调架构,将计算机视觉中的多尺度特征金字塔概念引入PEFT领域,通过分解任务适应为共享元知识子空间和金字塔投影机制两阶段,并利用任务感知路由器动态选择多尺度特征组合,以解决现有基于MoE的LoRA变体忽视任务复杂性层次结构、使用统一架构专家而无法捕捉不同任务所需多样化特征粒度的问题。

Details

Motivation: 现有基于MoE的LoRA变体在动态路由令牌到不同低秩专家方面取得了有希望的结果,但很大程度上忽视了任务复杂性的层次结构,通常采用统一架构的专家,限制了其捕捉不同任务所需多样化特征粒度的能力,而一些任务需要高级语义抽象,另一些则需要细粒度句法操作。

Result: 在多个多任务基准测试上的广泛实验表明,EPT显著优于最先进的MoE-LoRA变体,并且由于其设计的重参数化能力,在减少训练参数数量的同时实现了性能提升。

Insight: 创新点在于将多尺度特征金字塔概念从计算机视觉迁移到PEFT,通过共享元知识子空间编码通用语言模式,以及金字塔投影机制利用可学习的上投影算子在多个尺度上重建高维特征,并结合任务感知路由器进行动态选择,实现了对任务复杂性层次结构的建模和参数效率的进一步提升。

Abstract: Parameter-Efficient Fine-Tuning (PEFT) has become a dominant paradigm for deploying LLMs in multi-task scenarios due to its extreme parameter efficiency. While Mixture-of-Experts (MoE) based LoRA variants have achieved promising results by dynamically routing tokens to different low-rank experts, they largely overlook the hierarchical nature of task complexity. Existing methods typically employ experts with uniform architectures, limiting their ability to capture diverse feature granularities required by distinct tasks–where some tasks demand high-level semantic abstraction while others require fine-grained syntactic manipulation. To bridge this gap, we propose Expert Pyramid Tuning (EPT), a novel architecture that integrates the multi-scale feature pyramid concept from computer vision into the realm of PEFT. Unlike standard LoRA, EPT decomposes task adaptation into two stages: (1) A shared meta-knowledge Subspace that encodes universal linguistic patterns in low dimensions; (2) A Pyramid Projection Mechanism that utilizes learnable up-projection operators to reconstruct high-dimensional features at varying scales. A task-aware router then dynamically selects the optimal combination of these multi-scale features. Extensive experiments across multiple multi-task benchmarks demonstrate that EPT significantly outperforms SOTA MoE-LoRA variants. Crucially, thanks to the re-parameterization capability of our design, EPT achieves this performance improvement while simultaneously reducing the number of training parameters.


[8] From Text to Forecasts: Bridging Modality Gap with Temporal Evolution Semantic Space cs.CL | cs.AIPDF

Lehui Li, Yuyao Wang, Jisheng Yan, Wei Zhang, Jinliang Deng

TL;DR: 该论文提出了一种名为TESS的新方法,通过引入一个时态演化语义空间作为中间瓶颈,来弥合时间序列预测中文本信息与数值信号之间的模态鸿沟。该方法利用大语言模型通过结构化提示从文本中提取可解释的、基于数值的时态基元,并通过置信度感知门控进行过滤,从而更有效地将文本语义转化为可用的数值线索。

Details

Motivation: 将文本信息整合到时间序列预测中,有望解决事件驱动的非平稳性问题,但文本描述隐含且定性地表达时间影响,而预测模型依赖显式且定量的信号,这种根本的模态鸿沟阻碍了有效的融合。

Result: 在四个真实世界数据集上的实验表明,与最先进的单模态和多模态基线相比,该方法将预测误差降低了高达29%,达到了SOTA水平。

Insight: 核心创新点是构建了一个时态演化语义空间作为中间表示,将文本语义分解为可解释的、基于数值的时态基元,并通过置信度感知机制过滤不可靠信息,从而实现了文本与时间序列之间更可靠、更有效的跨模态对齐与融合。

Abstract: Incorporating textual information into time-series forecasting holds promise for addressing event-driven non-stationarity; however, a fundamental modality gap hinders effective fusion: textual descriptions express temporal impacts implicitly and qualitatively, whereas forecasting models rely on explicit and quantitative signals. Through controlled semi-synthetic experiments, we show that existing methods over-attend to redundant tokens and struggle to reliably translate textual semantics into usable numerical cues. To bridge this gap, we propose TESS, which introduces a Temporal Evolution Semantic Space as an intermediate bottleneck between modalities. This space consists of interpretable, numerically grounded temporal primitives (mean shift, volatility, shape, and lag) extracted from text by an LLM via structured prompting and filtered through confidence-aware gating. Experiments on four real-world datasets demonstrate up to a 29 percent reduction in forecasting error compared to state-of-the-art unimodal and multimodal baselines. The code will be released after acceptance.


[9] EvolveCoder: Evolving Test Cases via Adversarial Verification for Code Reinforcement Learning cs.CLPDF

Chi Ruan, Dongfu Jiang, Huaye Zeng, Ping Nie, Wenhu Chen

TL;DR: 本文提出EvolveCoder,一种基于对抗性验证的代码强化学习框架,通过迭代演化测试用例来增强验证信号,并构建了大规模数据集EvolveCoder-22k。实验表明,该方法能稳定优化模型,在多个基准上提升代码生成性能。

Details

Motivation: 现有代码强化学习(RLVR)方法受限于验证信号弱且静态,限制了大型语言模型在代码生成任务上的提升效果。

Result: 在EvolveCoder-22k数据集上,迭代细化使pass@1从43.80降至31.22,表明验证显著增强;强化学习训练后,Qwen3-4B模型在四个下游基准上平均提升4.2分,超越了其他4B规模的基线模型,达到SOTA水平。

Insight: 创新点在于提出了解决方案条件化和对抗性的验证框架,通过基于候选方案执行行为迭代演化测试用例,以增加难度、提升判别力并减少冗余,为代码生成的可扩展强化学习提供了有效途径。

Abstract: Reinforcement learning with verifiable rewards (RLVR) is a promising approach for improving code generation in large language models, but its effectiveness is limited by weak and static verification signals in existing coding RL datasets. In this paper, we propose a solution-conditioned and adversarial verification framework that iteratively refines test cases based on the execution behaviors of candidate solutions, with the goal of increasing difficulty, improving discriminative power, and reducing redundancy. Based on this framework, we introduce EvolveCoder-22k, a large-scale coding reinforcement learning dataset constructed through multiple rounds of adversarial test case evolution. Empirical analysis shows that iterative refinement substantially strengthens verification, with pass@1 decreasing from 43.80 to 31.22. Reinforcement learning on EvolveCoder-22k yields stable optimization and consistent performance gains, improving Qwen3-4B by an average of 4.2 points across four downstream benchmarks and outperforming strong 4B-scale baselines. Our results highlight the importance of adversarial, solution-conditioned verification for effective and scalable reinforcement learning in code generation.


[10] Adaptive Vision-Language Model Routing for Computer Use Agents cs.CL | cs.CVPDF

Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang

TL;DR: 本文提出了一种自适应视觉语言模型路由框架(AVR),用于计算机使用代理(CUAs)在执行图形用户界面操作时,根据动作难度动态选择最合适的视觉语言模型(VLM),以在保证准确性的前提下显著降低推理成本。

Details

Motivation: 当前CUA系统通常将所有动作固定路由到单一VLM,而不同VLM的接地(grounding)准确率差异巨大,且不考虑动作难度,导致效率低下或准确性不足。

Result: 在ScreenSpot-Pro接地数据和OpenClaw代理路由基准测试中,AVR在保持与全大型模型基线准确率差距在2个百分点以内的同时,预计推理成本降低高达78%。结合Visual Confused Deputy护栏,AVR还能将高风险动作直接路由到最强模型,统一了效率与安全性。

Insight: 创新点在于引入轻量级语义路由层,通过多模态嵌入估计动作难度、用小模型探测置信度,并基于成本-准确性权衡的阈值策略进行模型选择;对于有历史交互记忆的‘热’代理,利用检索到的上下文缩小大小模型能力差距,避免不必要的模型升级,实现了高效且安全的动态模型路由。

Abstract: Computer Use Agents (CUAs) translate natural-language instructions into Graphical User Interface (GUI) actions such as clicks, keystrokes, and scrolls by relying on a Vision-Language Model (VLM) to interpret screenshots and predict grounded tool calls. However, grounding accuracy varies dramatically across VLMs, while current CUA systems typically route every action to a single fixed model regardless of difficulty. We propose \textbf{Adaptive VLM Routing} (AVR), a framework that inserts a lightweight semantic routing layer between the CUA orchestrator and a pool of VLMs. For each tool call, AVR estimates action difficulty from multimodal embeddings, probes a small VLM to measure confidence, and routes the action to the cheapest model whose predicted accuracy satisfies a target reliability threshold. For \textit{warm} agents with memory of prior UI interactions, retrieved context further narrows the capability gap between small and large models, allowing many actions to be handled without escalation. We formalize routing as a cost–accuracy trade-off, derive a threshold-based policy for model selection, and evaluate AVR using ScreenSpot-Pro grounding data together with the OpenClaw agent routing benchmark. Across these settings, AVR projects inference cost reductions of up to 78% while staying within 2 percentage points of an all-large-model baseline. When combined with the Visual Confused Deputy guardrail, AVR also escalates high-risk actions directly to the strongest available model, unifying efficiency and safety within a single routing framework. Materials are also provided Model, benchmark, and code: https://github.com/vllm-project/semantic-router.


[11] Rethinking Multiple-Choice Questions for RLVR: Unlocking Potential via Distractor Design cs.CLPDF

Xu Guo, Qiming Ge, Jian Tong, Kedi Chen, Jin Zhang

TL;DR: 本文探讨了在可验证奖励强化学习(RLVR)中多选题(MCQs)的应用,指出传统方法因担心奖励黑客而将MCQs转为开放式问题,会丢失专家设计的干扰项提供的对比信号。论文通过系统分析发现选项数量不匹配和干扰项强度是关键影响因素,并提出了迭代干扰项筛选(IDC)框架来主动构建高质量干扰项,以阻断模型走捷径并促进深度推理。

Details

Motivation: 解决在RLVR中使用MCQs时可能引发的奖励黑客问题(如随机猜测或简单排除),同时避免丢弃专家设计的干扰项所提供的宝贵对比信号,从而更有效地利用MCQs的可扩展性来增强大语言模型的推理能力。

Result: 在多个基准测试上的实验表明,该方法有效提升了干扰项质量,并在RLVR训练中相比原始数据取得了显著性能提升。

Insight: 创新点在于系统分析了选项设计对RLVR的影响,并提出了IDC框架来主动优化干扰项,以阻断模型的捷径行为并促进深度推理;客观来看,该方法强调了在保持MCQs可扩展优势的同时,通过精心设计干扰项来避免奖励黑客的实用策略。

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) significantly enhances the reasoning capabilities of Large Language Models. When applied to RLVR, Multiple-Choice Questions (MCQs) offer a scalable source of verifiable data but risk inducing reward hacking, where models shortcut reasoning via random guessing or simple elimination. Current approaches often mitigate this by converting MCQs to open-ended formats, thereby discarding the contrastive signal provided by expert-designed distractors. In this work, we systematically investigate the impact of option design on RLVR. Our analysis highlights two primary insights: (1) Mismatches in option counts between training and testing degrade performance. (2) Strong distractors effectively mitigate random guessing, enabling effective RLVR training even with 2-way questions. Motivated by these findings, we propose Iterative Distractor Curation (IDC), a framework that actively constructs high-quality distractors to block elimination shortcuts and promote deep reasoning. Experiments on various benchmarks demonstrate that our method effectively enhances distractor quality and yields significant gains in RLVR training compared to the original data.


[12] DS$^2$-Instruct: Domain-Specific Data Synthesis for Large Language Models Instruction Tuning cs.CLPDF

Ruiyao Xu, Noelle I. Samia, Han Liu

TL;DR: 本文提出了DS^2-Instruct框架,一种无需人工监督的零样本方法,用于生成领域特定的指令调优数据集,以解决大语言模型在专业领域适应时高质量数据稀缺且标注成本高的问题。该方法通过生成任务导向关键词确保领域覆盖,结合布鲁姆分类法的不同认知层次构建多样化指令,并利用自一致性验证保证数据质量。

Details

Motivation: 解决大语言模型在专业领域适应时,由于现有数据合成方法集中于通用任务,无法捕捉领域特定术语和推理模式,而人工标注成本高昂的问题。

Result: 在数学、金融、逻辑推理等七个挑战性领域生成数据集,评估表明,使用该框架生成数据微调的模型相比现有数据生成方法取得了显著改进。

Insight: 创新点在于提出了一种结合任务关键词生成、基于布鲁姆分类法的指令多样化构建以及自一致性验证的零样本领域特定数据合成框架,可系统性地生成高质量、覆盖全面的专业领域指令数据。

Abstract: Adapting Large Language Models (LLMs) to specialized domains requires high-quality instruction tuning datasets, which are expensive to create through human annotation. Existing data synthesis methods focus on general-purpose tasks and fail to capture domain-specific terminology and reasoning patterns. To address this, we introduce DS$^2$-Instruct, a zero-shot framework that generates domain-specific instruction datasets without human supervision. Our approach first generates task-informed keywords to ensure comprehensive domain coverage. It then creates diverse instructions by pairing these keywords with different cognitive levels from Bloom’s Taxonomy. Finally, it uses self-consistency validation to ensure data quality. We apply this framework to generate datasets across seven challenging domains, such as mathematics, finance, and logical reasoning. Comprehensive evaluation demonstrates that models fine-tuned on our generated data achieve substantial improvements over existing data generation methods.


[13] Long-form RewardBench: Evaluating Reward Models for Long-form Generation cs.CLPDF

Hui Huang, Yancheng He, Wei Liu, Muyun Yang, Jiaheng Liu

TL;DR: 本文介绍了首个专门针对长文本生成任务设计的奖励模型评估基准Long-form RewardBench,涵盖问答、检索增强生成、对话、写作和推理五个子任务,通过多阶段数据收集构建数据集,并对20多个主流奖励模型进行了实验分析。研究发现当前模型在长文本奖励建模能力上仍有不足,并揭示了奖励模型性能与错误位置、文本长度之间的关系,以及分类器相比生成模型具有更好的泛化能力。

Details

Motivation: 现有奖励模型评估基准在长文本生成领域存在空白,而长文本生成在实际应用中至关重要,因此需要专门针对长文本生成任务设计一个全面的奖励模型评估基准。

Result: 在Long-form RewardBench上对20多个主流奖励模型(包括分类器和生成模型)进行了广泛实验,发现当前模型在长文本奖励建模能力上表现不足;通过设计的’长文本大海捞针测试’发现奖励模型性能与错误在回答中的位置以及整体回答长度相关,且分类器和生成模型表现出不同特性;分类器相比在相同数据上训练的生成模型展现出更好的泛化能力。

Insight: 创新点在于构建了首个专门针对长文本生成的奖励模型评估基准,并设计了新颖的’长文本大海捞针测试’来深入分析模型性能与文本结构特征(如错误位置、长度)的关系,揭示了分类器与生成模型在长文本奖励建模任务上的不同行为模式,为未来模型改进提供了重要洞见。

Abstract: The widespread adoption of reinforcement learning-based alignment highlights the growing importance of reward models. Various benchmarks have been built to evaluate reward models in various domains and scenarios. However, a significant gap remains in assessing reward models for long-form generation, despite its critical role in real-world applications. To bridge this, we introduce Long-form RewardBench, the first reward modeling testbed specifically designed for long-form generation. Our benchmark encompasses five key subtasks: QA, RAG, Chat, Writing, and Reasoning. We collected instruction and preference data through a meticulously designed multi-stage data collection process, and conducted extensive experiments on 20+ mainstream reward models, including both classifiers and generative models. Our findings reveal that current models still lack long-form reward modeling capabilities. Furthermore, we designed a novel Long-form Needle-in-a-Haystack Test, which revealed a correlation between reward modeling performance and the error’s position within a response, as well as the overall response length, with distinct characteristics observed between classification and generative models. Finally, we demonstrate that classifiers exhibit better generalizability compared to generative models trained on the same data. As the first benchmark for long-form reward modeling, this work aims to offer a robust platform for visualizing progress in this crucial area.


[14] Mending the Holes: Mitigating Reward Hacking in Reinforcement Learning for Multilingual Translation cs.CLPDF

Yifeng Liu, Siqi Ouyang, Yatish Hosmane Revanasiddappa, Lei Li

TL;DR: 本文提出了一种名为WALAR的强化学习方法,仅使用单语文本提升大型语言模型在大量低资源语言上的翻译能力,同时保持其在高资源语言上的性能。该方法通过词对齐和语言对齐等技术,缓解了基于源语言的多语言质量估计模型中的‘漏洞’问题,从而避免了强化学习训练中奖励放大漏洞导致的模型退化。

Details

Motivation: 大型语言模型在高资源语言对上的机器翻译表现出色,但在低资源翻译上仍落后。现有后训练方法严重依赖高质量平行数据,而低资源语言往往缺乏此类数据。本文旨在利用单语文本提升低资源翻译能力,并解决现有质量估计模型中的‘漏洞’问题。

Result: 使用WALAR持续训练了一个支持101种语言翻译的LLM。实验表明,在Flores-101数据集的1400个语言方向上,新模型大幅超越了最强开源多语言LLM之一LLaMAX。

Insight: 创新点在于识别并缓解了基于源语言的多语言质量估计模型中的‘漏洞’,通过词对齐和语言对齐技术设计更稳健的奖励函数,从而在仅使用单语数据的情况下有效提升多语言翻译性能,避免了强化学习中的奖励黑客问题。

Abstract: Large Language Models (LLMs) have demonstrated remarkable capability in machine translation on high-resource language pairs, yet their performance on low-resource translation still lags behind. Existing post-training methods rely heavily on high-quality parallel data, which are often scarce or unavailable for low-resource languages. In this paper, we introduce WALAR, a reinforcement training method using only monolingual text to elevate LLMs’ translation capabilities on massive low-resource languages while retaining their performance on high-resource languages. Our key insight is based on the observation of failure modes (or “holes”) in existing source-based multilingual quality estimation (QE) models. Reinforcement learning (RL) using these QE models tends to amplify such holes, resulting in poorer multilingual LLMs. We develop techniques including word alignment and language alignment to mitigate such holes in WALAR’s reward for RL training. We continually trained an LLM supporting translation of 101 languages using WALAR. The experiments show that our new model outperforms LLaMAX, one of the strongest open-source multilingual LLMs by a large margin on 1400 language directions on Flores-101 dataset.


[15] Neuron-Aware Data Selection In Instruction Tuning For Large Language Models cs.CLPDF

Xin Chen, Junchao Wu, Shu Yang, Runzhe Zhan, Zeyu Wu

TL;DR: 本文提出了一种名为NAIT的高效框架,用于在大语言模型(LLM)的指令微调(IT)中进行数据选择。该框架通过分析IT数据与目标领域能力之间神经元激活模式的相似性来评估数据对模型性能的影响,并据此选择最优子集。实验表明,使用NAIT选择的10% Alpaca-GPT4数据子集进行训练,在各种任务上均优于依赖外部先进模型或基于不确定性特征的方法。

Details

Motivation: 指令微调(IT)是解锁大语言模型(LLM)能力的关键方法,但过多的IT数据可能损害模型性能,而精心选择高质量数据子集能显著提升能力。因此,如何从IT数据集中识别出最有效的子集以高效开发LLM的特定或通用能力成为一个关键挑战。

Result: 在实验中,使用NAIT选择的10% Alpaca-GPT4 IT数据子集进行训练,在各种任务上均优于依赖外部先进模型(如GPT-4)或基于不确定性特征的方法,达到了SOTA水平。

Insight: 创新点在于提出了一种基于神经元激活模式相似性的数据选择框架(NAIT),它构建了可重用、可迁移的神经元激活特征来评估数据。客观分析认为,其核心洞察是揭示了神经元激活特征在不同LLM能力间的可迁移性,特别是具有更强逻辑推理和程序特征的数据具有广泛的通用迁移能力,而一个稳定的核心数据子集足以持续激活模型的基础能力并普遍提升多任务性能。

Abstract: Instruction Tuning (IT) has been proven to be an effective approach to unlock the powerful capabilities of large language models (LLMs). Recent studies indicate that excessive IT data can degrade LLMs performance, while carefully selecting a small subset of high-quality IT data can significantly enhance their capabilities. Therefore, identifying the most efficient subset data from the IT dataset to effectively develop either specific or general abilities in LLMs has become a critical challenge. To address this, we propose a novel and efficient framework called NAIT. NAIT evaluates the impact of IT data on LLMs performance by analyzing the similarity of neuron activation patterns between the IT dataset and the target domain capability. Specifically, NAIT captures neuron activation patterns from in-domain datasets of target domain capabilities to construct reusable and transferable neuron activation features. It then evaluates and selects optimal samples based on the similarity between candidate samples and the expected activation features of the target capabilities. Experimental results show that training on the 10% Alpaca-GPT4 IT data subset selected by NAIT consistently outperforms methods that rely on external advanced models or uncertainty-based features across various tasks. Our findings also reveal the transferability of neuron activation features across different capabilities of LLMs. In particular, IT data with more logical reasoning and programmatic features possesses strong general transferability, enabling models to develop stronger capabilities across multiple tasks, while a stable core subset of data is sufficient to consistently activate fundamental model capabilities and universally improve performance across diverse tasks.


cs.CV [Back]

[16] VQQA: An Agentic Approach for Video Evaluation and Quality Improvement cs.CV | cs.AI | cs.LG | cs.MAPDF

Yiwen Song, Tomas Pfister, Yale Song

TL;DR: 本文提出了VQQA(视频质量问答),一个统一的多智能体框架,用于视频评估和质量改进。该方法通过动态生成视觉问题,并利用视觉语言模型(VLM)的批评作为语义梯度,来替代传统的被动评估指标,从而通过黑盒自然语言接口实现高效的闭环提示优化。

Details

Motivation: 尽管视频生成模型发展迅速,但其输出与复杂用户意图的对齐仍然具有挑战性。现有的测试时优化方法通常计算成本高昂或需要白盒访问模型内部。VQQA旨在解决这一问题。

Result: 在文本到视频(T2V)和图像到视频(I2V)任务上,该方法在T2V-CompBench基准上实现了+11.57%的绝对提升,在VBench2基准上实现了+8.43%的绝对提升,显著优于最先进的随机搜索和提示优化技术。

Insight: 核心创新在于将视频评估转化为一个动态的、基于问答的多智能体过程,利用VLM的批评作为可操作的语义反馈来指导优化,这提供了一种高效、可解释且无需访问模型内部的黑盒优化新范式。

Abstract: Despite rapid advancements in video generation models, aligning their outputs with complex user intent remains challenging. Existing test-time optimization methods are typically either computationally expensive or require white-box access to model internals. To address this, we present VQQA (Video Quality Question Answering), a unified, multi-agent framework generalizable across diverse input modalities and video generation tasks. By dynamically generating visual questions and using the resulting Vision-Language Model (VLM) critiques as semantic gradients, VQQA replaces traditional, passive evaluation metrics with human-interpretable, actionable feedback. This enables a highly efficient, closed-loop prompt optimization process via a black-box natural language interface. Extensive experiments demonstrate that VQQA effectively isolates and resolves visual artifacts, substantially improving generation quality in just a few refinement steps. Applicable to both text-to-video (T2V) and image-to-video (I2V) tasks, our method achieves absolute improvements of +11.57% on T2V-CompBench and +8.43% on VBench2 over vanilla generation, significantly outperforming state-of-the-art stochastic search and prompt optimization techniques.


[17] Alternating Gradient Flow Utility: A Unified Metric for Structural Pruning and Dynamic Routing in Deep Networks cs.CV | cs.LG | cs.NEPDF

Tianhao Qian, Zhuoxuan Li, Jinde Cao, Xinli Shi, Hanjie Liu

TL;DR: 本文提出了一种名为交替梯度流效用(AGF)的统一度量,用于深度网络的结构化剪枝和动态路由。AGF通过解耦的动力学范式,利用绝对特征空间泰勒展开来准确捕捉网络的结构“动力学效用”,克服了现有剪枝指标(如权重幅度或激活感知)在结构化剪枝中的幅度偏差问题。研究发现,在极端稀疏度下存在拓扑相变,AGF能保持基线功能并展现拓扑隐式正则化;在Vision Transformers中揭示了稀疏瓶颈现象,即动态信号在收敛模型中遭受压缩。基于此,设计了一种混合路由框架,将AGF引导的离线结构搜索与基于零成本物理先验的在线执行解耦。在ImageNet-1K的75%压缩压力测试中,AGF有效避免了传统指标低于随机采样的结构崩溃;在ImageNet-100的动态推理中,混合方法实现了帕累托最优效率,将重型专家使用减少约50%,整体成本估计为0.92倍,且不牺牲全模型精度。

Details

Motivation: 解决现有剪枝指标(如权重幅度或激活感知)在深度视觉网络结构化剪枝中的局限性,这些指标存在幅度偏差,无法保留关键功能路径,导致结构崩溃。

Result: 在ImageNet-1K的75%压缩压力测试中,AGF避免了传统指标低于随机采样的结构崩溃;在ImageNet-100的动态推理中,混合路由方法实现了帕累托最优效率,将重型专家使用减少约50%,整体成本估计为0.92倍,且不牺牲全模型精度。

Insight: 创新点包括:提出交替梯度流效用(AGF)作为统一度量,通过解耦动力学范式和绝对特征空间泰勒展开来准确评估结构重要性;揭示了极端稀疏下的拓扑相变和Vision Transformers中的稀疏瓶颈现象;设计了混合路由框架,结合离线结构搜索和在线执行,以实现高效动态推理。从客观角度看,该方法将理论分析与实际应用结合,为结构化剪枝和动态路由提供了新视角。

Abstract: Efficient deep learning traditionally relies on static heuristics like weight magnitude or activation awareness (e.g., Wanda, RIA). While successful in unstructured settings, we observe a critical limitation when applying these metrics to the structural pruning of deep vision networks. These contemporary metrics suffer from a magnitude bias, failing to preserve critical functional pathways. To overcome this, we propose a decoupled kinetic paradigm inspired by Alternating Gradient Flow (AGF), utilizing an absolute feature-space Taylor expansion to accurately capture the network’s structural “kinetic utility”. First, we uncover a topological phase transition at extreme sparsity, where AGF successfully preserves baseline functionality and exhibits topological implicit regularization, avoiding the collapse seen in models trained from scratch. Second, transitioning to architectures without strict structural priors, we reveal a phenomenon of Sparsity Bottleneck in Vision Transformers (ViTs). Through a gradient-magnitude decoupling analysis, we discover that dynamic signals suffer from signal compression in converged models, rendering them suboptimal for real-time routing. Finally, driven by these empirical constraints, we design a hybrid routing framework that decouples AGF-guided offline structural search from online execution via zero-cost physical priors. We validate our paradigm on large-scale benchmarks: under a 75% compression stress test on ImageNet-1K, AGF effectively avoids the structural collapse where traditional metrics aggressively fall below random sampling. Furthermore, when systematically deployed for dynamic inference on ImageNet-100, our hybrid approach achieves Pareto-optimal efficiency. It reduces the usage of the heavy expert by approximately 50% (achieving an estimated overall cost of 0.92$\times$) without sacrificing the full-model accuracy.


[18] Human Knowledge Integrated Multi-modal Learning for Single Source Domain Generalization cs.CVPDF

Ayan Banerjee, Kuntal Thakur, Sandeep Gupta

TL;DR: 本文提出了一种名为GenEval的多模态视觉语言模型方法,用于解决单源领域泛化问题。该方法结合了基础模型(如MedGemma-4B)与人类知识,通过低秩适应来弥合因果差距,并在糖尿病视网膜病变分级和癫痫发作区检测任务上实现了优异的跨领域泛化性能。

Details

Motivation: 解决在关键任务(如基于眼底图像的糖尿病视网膜病变分级和静息态fMRI癫痫发作区检测)中,当领域间存在未知因果因素差异时,难以实现跨领域泛化,且缺乏客观评估此类差异的方法的问题。

Result: 在八个糖尿病视网膜病变数据集和两个癫痫发作区检测数据集上,GenEval实现了优异的单源领域泛化性能,平均准确率分别为69.2%和81%,分别比最强基线高出9.4%和1.8%,达到了SOTA水平。

Insight: 创新点包括:1) 提出了领域一致性边界理论框架,用于评估领域间在未知因果因素上的差异;2) 提出了一种结合基础模型与人类知识的多模态VLM方法,通过LoRA进行高效适应,以弥合因果差距并增强泛化能力。

Abstract: Generalizing image classification across domains remains challenging in critical tasks such as fundus image-based diabetic retinopathy (DR) grading and resting-state fMRI seizure onset zone (SOZ) detection. When domains differ in unknown causal factors, achieving cross-domain generalization is difficult, and there is no established methodology to objectively assess such differences without direct metadata or protocol-level information from data collectors, which is typically inaccessible. We first introduce domain conformal bounds (DCB), a theoretical framework to evaluate whether domains diverge in unknown causal factors. Building on this, we propose GenEval, a multimodal Vision Language Models (VLM) approach that combines foundational models (e.g., MedGemma-4B) with human knowledge via Low-Rank Adaptation (LoRA) to bridge causal gaps and enhance single-source domain generalization (SDG). Across eight DR and two SOZ datasets, GenEval achieves superior SDG performance, with average accuracy of 69.2% (DR) and 81% (SOZ), outperforming the strongest baselines by 9.4% and 1.8%, respectively.


[19] SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs cs.CV | cs.AIPDF

Mohamad Alansari, Naufal Suryanto, Divya Velayudhan, Sajid Javed, Naoufel Werghi

TL;DR: SPARROW是一种像素级视频多模态大语言模型,通过引入目标特定跟踪特征和双提示设计,解决了现有视频MLLMs在空间精度和时间一致性方面的不足,实现了端到端的视频对象跟踪与分割。

Details

Motivation: 现有视频MLLMs依赖静态分割标记进行逐帧定位,缺乏时间上下文,导致对象移动或重现时出现空间漂移、身份切换和不稳定初始化问题,SPARROW旨在统一空间准确性和时间稳定性。

Result: 在六个基准测试中,SPARROW集成到三个开源视频MLLMs(UniPixel、GLUS、VideoGLaMM)后均取得一致提升,例如在RVOS上J&F提升+8.9,视觉定位mIoU提升+5,GCG上CLAIR提升+5.4,显著改善了参考稳定性、空间精度和时间连贯性。

Insight: 创新点包括目标特定跟踪特征注入时间对齐的参考线索,以及双提示设计融合几何先验与语义定位;客观分析认为其端到端无外部检测器的类无关SAM2提议器架构可提升模型泛化能力和效率。

Abstract: Multimodal large language models (MLLMs) have advanced from image-level reasoning to pixel-level grounding, but extending these capabilities to videos remains challenging as models must achieve spatial precision and temporally consistent reference tracking. Existing video MLLMs often rely on a static segmentation token ([SEG]) for frame-wise grounding, which provides semantics but lacks temporal context, causing spatial drift, identity switches, and unstable initialization when objects move or reappear. We introduce SPARROW, a pixel-grounded video MLLM that unifies spatial accuracy and temporal stability through two key components: (i) Target-Specific Tracked Features (TSF), which inject temporally aligned referent cues during training, and (ii) a dual-prompt design that decodes box ([BOX]) and segmentation ([SEG]) tokens to fuse geometric priors with semantic grounding. SPARROW is supported by a curated referential video dataset of 30,646 videos and 45,231 Q&A pairs and operates end-to-end without external detectors via a class-agnostic SAM2-based proposer. Integrated into three recent open-source video MLLMs (UniPixel, GLUS, and VideoGLaMM), SPARROW delivers consistent gains across six benchmarks, improving up to +8.9 J&F on RVOS, +5 mIoU on visual grounding, and +5.4 CLAIR on GCG. These results demonstrate that SPARROW substantially improves referential stability, spatial precision, and temporal coherence in pixel-grounded video understanding. Project page: https://risys-lab.github.io/SPARROW


[20] Deployment-Oriented Session-wise Meta-Calibration for Landmark-Based Webcam Gaze Tracking cs.CV | cs.HCPDF

Chenkai Zhang

TL;DR: 本文提出了一种面向部署的、基于面部关键点的网络摄像头视线跟踪方法EMC-Gaze。该方法将视线估计建模为会话级适应问题,结合了E(3)-等变关键点图编码器、局部眼部几何、双目强调、辅助3D视线方向监督以及通过元训练优化的闭式岭回归校准器。该方法仅需少量校准点,在保持轻量级的同时,在交互式评估中取得了优于Elastic Net等基准方法的精度。

Details

Motivation: 解决实际网络摄像头视线跟踪中面临的校准负担重、对头部运动和会话漂移鲁棒性差、运行时开销大以及浏览器兼容性等问题,旨在找到一个面向实际部署的平衡点,而非追求基于大型图像骨干网络的极致精度。

Result: 在100厘米距离的33个会话交互式评估中,经过9点校准后,EMC-Gaze达到5.79 +/- 1.81度的RMSE,优于Elastic Net的6.68 +/- 2.34度;在静止头部查询上优势更明显(2.92 +/- 0.75度 vs. 4.45 +/- 0.30度)。在MPIIFaceGaze数据集上,经过少量校准后(如16-shot),模型达到8.82 +/- 1.21度,从3-shot开始即超越Elastic Net。模型轻量,编码器参数约94.4万,ONNX格式4.76 MB,在浏览器中单样本推理时间约12.58毫秒。

Insight: 创新点在于将视线跟踪框架化为会话级元校准问题,结合了等变几何编码与元学习优化校准器,有效降低了校准负担并提升了对头部姿态变化的鲁棒性。从部署角度看,其轻量级、低校准需求、浏览器兼容性强的设计思路,为实际应用提供了有价值的参考方案,而非单纯追求SOTA精度。

Abstract: Practical webcam gaze tracking is constrained not only by error, but also by calibration burden, robustness to head motion and session drift, runtime footprint, and browser use. We therefore target a deployment-oriented operating point rather than the image large-backbone regime. We cast landmark-based point-of-regard estimation as session-wise adaptation: a shared geometric encoder produces embeddings that can be aligned to a new session from a small calibration set. We present Equivariant Meta-Calibrated Gaze (EMC-Gaze), a lightweight landmark-only method combining an E(3)-equivariant landmark-graph encoder, local eye geometry, binocular emphasis, auxiliary 3D gaze-direction supervision, and a closed-form ridge calibrator differentiated through episodic meta-training. To reduce pose leakage, we use a two-view canonicalization consistency loss. The deployed predictor uses only facial landmarks and fits a per-session ridge head from brief calibration. In a fixation-style interactive evaluation over 33 sessions at 100 cm, EMC-Gaze achieves 5.79 +/- 1.81 deg RMSE after 9-point calibration versus 6.68 +/- 2.34 deg for Elastic Net; the gain is larger on still-head queries (2.92 +/- 0.75 deg vs. 4.45 +/- 0.30 deg). Across three subject holdouts of 10 subjects each, EMC-Gaze retains an advantage (5.66 +/- 0.19 deg vs. 6.49 +/- 0.33 deg). On MPIIFaceGaze with short per-session calibration, the eye-focused model reaches 8.82 +/- 1.21 deg at 16-shot calibration, ties Elastic Net at 1-shot, and outperforms it from 3-shot onward. The exported eye-focused encoder has 944,423 parameters, is 4.76 MB in ONNX, and supports calibrated browser prediction in 12.58/12.58/12.90 ms per sample (mean/median/p90) in Chromium 145 with ONNX Runtime Web. These results position EMC-Gaze as a calibration-friendly operating point rather than a universal state-of-the-art claim against heavier appearance-based systems.


[21] A Neuro-Symbolic Framework Combining Inductive and Deductive Reasoning for Autonomous Driving Planning cs.CVPDF

Hongyan Wei, Wael AbdAlmageed

TL;DR: 本文提出了一种新颖的神经符号轨迹规划框架,用于自动驾驶规划。该框架将基于大语言模型的演绎推理与端到端神经网络相结合,通过动态提取场景规则和逻辑仲裁生成安全、可解释的离散驾驶决策,并利用决策条件解码机制和可微分运动学模型,将离散决策转化为物理上可行的连续轨迹。

Details

Motivation: 现有端到端自动驾驶模型严重依赖纯数据驱动的归纳推理,其’黑盒’性质导致在复杂长尾场景中缺乏可解释性和绝对安全保证。本文旨在通过融合演绎推理来克服这一瓶颈。

Result: 在nuScenes基准测试中,该方法全面超越了最先进的基线模型MomAD,将L2平均误差降低至0.57米,碰撞率降至0.075%,并将轨迹预测一致性优化至0.47米。

Insight: 创新点在于提出了一个神经符号框架,将LLM的动态规则提取与ASP求解器的确定性逻辑仲裁相结合,并设计了决策条件解码机制来弥合离散符号与连续轨迹之间的鸿沟,同时通过结合运动学模型基线轨迹与神经残差校正,保证了运动学可行性和高度透明度。

Abstract: Existing end-to-end autonomous driving models rely heavily on purely data-driven inductive reasoning. This “black-box” nature leads to a lack of interpretability and absolute safety guarantees in complex, long-tail scenarios. To overcome this bottleneck, we propose a novel neuro-symbolic trajectory planning framework that seamlessly integrates rigorous deductive reasoning into end-to-end neural networks. Specifically, our framework utilizes a Large Language Model (LLM) to dynamically extract scene rules and employs an Answer Set Programming (ASP) solver for deterministic logical arbitration, generating safe and traceable discrete driving decisions. To bridge the gap between discrete symbols and continuous trajectories, we introduce a decision-conditioned decoding mechanism that transforms high-level logical decisions into learnable embedding vectors, simultaneously constraining the planning query and the physical initial velocity of a differentiable Kinematic Bicycle Model (KBM). By combining KBM-generated physical baseline trajectories with neural residual corrections, our approach inherently guarantees kinematic feasibility while ensuring a high degree of transparency. On the nuScenes benchmark, our method comprehensively outperforms the state-of-the-art baseline MomAD, reducing the L2 mean error to 0.57 m, decreasing the collision rate to 0.075%, and optimizing trajectory prediction consistency (TPC) to 0.47 m.


[22] Surg-R1: A Hierarchical Reasoning Foundation Model for Scalable and Interpretable Surgical Decision Support with Multi-Center Clinical Validation cs.CVPDF

Jian Jiang, Chenxi Lin, Yiming Gu, Zengyi Qin, Zhitao Zeng

TL;DR: 本文提出了Surg-R1,一个用于外科手术决策支持的分层推理基础模型。该模型通过一个四阶段训练流程,构建了包含感知、关系和上下文三个层次的分层推理能力,并利用大规模手术思维链数据集进行训练。在包含公开基准和多中心外部验证数据集的SurgBench上进行评估,Surg-R1在多项任务上超越了通用推理模型和专用手术视觉语言模型,取得了最高的竞技场分数。

Details

Motivation: 解决现有手术视觉语言模型缺乏可解释的推理链,以及通用推理模型因缺乏领域知识而无法胜任组合式手术任务的问题,旨在提供一种可扩展且可解释的手术决策支持。

Result: 在SurgBench(包含6个公开基准和6个来自5个机构的多中心外部验证数据集)上评估,Surg-R1在公开基准上取得了最高的竞技场分数(64.9%),优于Gemini 3.0 Pro(46.1%)和GPT-5.1(37.9%),并在器械定位、三元组识别、阶段识别、动作识别和安全关键视野评估等大多数任务上超越了专有推理模型和专用手术VLM,在外部验证上比最强手术基线提升了15.2个百分点。

Insight: 创新点包括:1)提出了一个将手术场景理解分解为感知、关系和上下文三个层次的分层推理框架;2)构建了最大的手术思维链数据集;3)设计了一个从监督微调到群体相对策略优化再到迭代自我改进的四阶段训练流程。这为构建可解释、可扩展的领域专用推理模型提供了系统性的方法论。

Abstract: Surgical scene understanding demands not only accurate predictions but also interpretable reasoning that surgeons can verify against clinical expertise. However, existing surgical vision-language models generate predictions without reasoning chains, and general-purpose reasoning models fail on compositional surgical tasks without domain-specific knowledge. We present Surg-R1, a surgical Vision-Language Model that addresses this gap through hierarchical reasoning trained via a four-stage pipeline. Our approach introduces three key contributions: (1) a three-level reasoning hierarchy decomposing surgical interpretation into perceptual grounding, relational understanding, and contextual reasoning; (2) the largest surgical chain-of-thought dataset with 320,000 reasoning pairs; and (3) a four-stage training pipeline progressing from supervised fine-tuning to group relative policy optimization and iterative self-improvement. Evaluation on SurgBench, comprising six public benchmarks and six multi-center external validation datasets from five institutions, demonstrates that Surg-R1 achieves the highest Arena Score (64.9%) on public benchmarks versus Gemini 3.0 Pro (46.1%) and GPT-5.1 (37.9%), outperforming both proprietary reasoning models and specialized surgical VLMs on the majority of tasks spanning instrument localization, triplet recognition, phase recognition, action recognition, and critical view of safety assessment, with a 15.2 percentage point improvement over the strongest surgical baseline on external validation.


[23] Revisiting Model Stitching In the Foundation Model Era cs.CV | cs.AI | cs.LGPDF

Zheda Mai, Ke Zhang, Fu-En Wang, Zixiao Ken Wang, Albert Y. C. Chen

TL;DR: 本文重新审视了模型缝合技术在视觉基础模型(VFMs)时代的应用,探讨了不同目标、数据和模态混合的异构VFMs(如CLIP、DINOv2、SigLIP)是否可缝合。通过引入系统化的缝合协议,研究发现:缝合层训练至关重要,采用简单的特征匹配损失可使异构VFMs在视觉任务中可靠缝合;在深层缝合点,缝合模型能以较小推理开销超越任一组成模型。基于此,论文进一步提出了VFM缝合树(VST),以共享早期层并保留后期层,为多模态LLMs提供了可控的精度-延迟权衡。

Details

Motivation: 研究动机在于探索异构视觉基础模型(VFMs)在目标、数据和模态混合差异下的可缝合性,以评估其表征兼容性,并将缝合技术从诊断工具提升为整合互补VFM优势的实用方法。

Result: 实验结果表明,通过目标模型倒数第二层的简单特征匹配损失,异构VFMs能在视觉任务中可靠缝合;在深层缝合点,缝合模型能以较小推理开销(仅用于缝合层)超越任一组成模型,实现了精度提升。

Insight: 创新点在于系统化缝合协议的设计,揭示了缝合层训练对保留精度的关键作用,以及提出的VFM缝合树(VST)实现了早期层共享与后期层保留,为多模态LLMs提供了可控制的精度-延迟权衡,将缝合从诊断探针提升为实际集成方案。

Abstract: Model stitching, connecting early layers of one model (source) to later layers of another (target) via a light stitch layer, has served as a probe of representational compatibility. Prior work finds that models trained on the same dataset remain stitchable (negligible accuracy drop) despite different initializations or objectives. We revisit stitching for Vision Foundation Models (VFMs) that vary in objectives, data, and modality mix (e.g., CLIP, DINOv2, SigLIP 2) and ask: Are heterogeneous VFMs stitchable? We introduce a systematic protocol spanning the stitch points, stitch layer families, training losses, and downstream tasks. Three findings emerge. (1) Stitch layer training matters: conventional approaches that match the intermediate features at the stitch point or optimize the task loss end-to-end struggle to retain accuracy, especially at shallow stitch points. (2) With a simple feature-matching loss at the target model’s penultimate layer, heterogeneous VFMs become reliably stitchable across vision tasks. (3) For deep stitch points, the stitched model can surpass either constituent model at only a small inference overhead (for the stitch layer). Building on these findings, we further propose the VFM Stitch Tree (VST), which shares early layers across VFMs while retaining their later layers, yielding a controllable accuracy-latency trade-off for multimodal LLMs that often leverage multiple VFMs. Taken together, our study elevates stitching from a diagnostic probe to a practical recipe for integrating complementary VFM strengths and pinpointing where their representations align or diverge.


[24] Adaptation of Weakly Supervised Localization in Histopathology by Debiasing Predictions cs.CVPDF

Alexis Guichemerre, Banafsheh Karimian, Soufiane Belharbi, Natacha Gillet, Nicolas Thome

TL;DR: 本文提出了一种名为SFDA-DeP的方法,用于解决组织病理学图像中弱监督目标定位模型在跨域部署时因分布偏移导致的预测偏差问题。该方法通过周期性识别过预测类别的目标图像,并选择性地降低不确定图像的预测置信度,同时保留高置信度预测,以减少决策边界漂移和类别偏差。

Details

Motivation: 弱监督目标定位模型在组织病理学图像中仅使用图像类别监督进行联合分类和定位,但在跨域部署时,分布偏移会导致性能下降,尤其是在新器官或不同染色协议和扫描仪特性的机构中。跨域偏移较大时,预测会偏向主导类别,导致目标域中伪标签分布高度倾斜。现有的源自由域自适应方法依赖自训练,会强化初始偏差,损害分类和定位任务。

Result: 在跨器官和跨中心的组织病理学基准数据集(glas、CAMELYON-16、CAMELYON-17)上,使用多种WSOL模型进行广泛实验,结果表明SFDA-DeP在分类和定位任务上均优于最先进的SFDA基线方法。

Insight: 创新点在于将源自由域自适应形式化为一个识别和纠正预测偏差的迭代过程,受机器遗忘启发,通过周期性去偏和联合优化像素级分类器来恢复分布偏移下的判别性定位特征。这有助于减少偏差放大,提升模型在目标域的泛化能力。

Abstract: Weakly Supervised Object Localization (WSOL) models enable joint classification and region-of-interest localization in histology images using only image-class supervision. When deployed in a target domain, distributions shift remains a major cause of performance degradation, especially when applied on new organs or institutions with different staining protocols and scanner characteristics. Under stronger cross-domain shifts, WSOL predictions can become biased toward dominant classes, producing highly skewed pseudo-label distributions in the target domain. Source-Free (Unsupervised) Domain Adaptation (SFDA) methods are commonly employed to address domain shift. However, because they rely on self-training, the initial bias is reinforced over training iterations, degrading both classification and localization tasks. We identify this amplification of prediction bias as a primary obstacle to the SFDA of WSOL models in histopathology. This paper introduces \sfdadep, a method inspired by machine unlearning that formulates SFDA as an iterative process of identifying and correcting prediction bias. It periodically identifies target images from over-predicted classes and selectively reduces the predictive confidence for uncertain (high entropy) images, while preserving confident predictions. This process reduces the drift of decision boundaries and bias toward dominant classes. A jointly optimized pixel-level classifier further restores discriminative localization features under distribution shift. Extensive experiments on cross-organ and -center histopathology benchmarks (glas, CAMELYON-16, CAMELYON-17) with several WSOL models show that SFDA-DeP consistently improves classification and localization over state-of-the-art SFDA baselines. {\small Code: \href{https://anonymous.4open.science/r/SFDA-DeP-1797/}{anonymous.4open.science/r/SFDA-DeP-1797/}}


[25] Unleashing Video Language Models for Fine-grained HRCT Report Generation cs.CVPDF

Yingying Fang, Huichi Zhou, KinHei Lee, Yijia Wang, Zhenxuan Zhang

TL;DR: 本文提出AbSteering框架,通过异常中心思维链和直接偏好优化,引导通用视频语言模型(VideoLM)生成高分辨率CT(HRCT)的精细诊断报告,在报告生成任务上超越了特定领域预训练的CT基础模型。

Details

Motivation: 解决HRCT三维体积数据中病理多样性和空间稀疏性带来的挑战,探索通用VideoLM在特定领域、大容量医学图像解释中的适应性。

Result: 在HRCT报告生成任务上,AbSteering超越了最先进的特定领域CT基础模型(这些模型经过大规模CT预训练),实现了更优的检测灵敏度并同时减少了幻觉。

Insight: 创新点在于提出了异常中心的思维链方案来强制异常推理,以及利用临床易混淆异常作为硬负样本的直接偏好优化目标,以增强细粒度判别能力;客观来看,该工作展示了通用VideoLM在适当引导下向大容量医学成像任务迁移的强可转移性。

Abstract: Generating precise diagnostic reports from High-Resolution Computed Tomography (HRCT) is critical for clinical workflow, yet it remains a formidable challenge due to the high pathological diversity and spatial sparsity within 3D volumes. While Video Language Models (VideoLMs) have demonstrated remarkable spatio-temporal reasoning in general domains, their adaptability to domain-specific, high-volume medical interpretation remains underexplored. In this work, we present AbSteering, an abnormality-centric framework that steers VideoLMs toward precise HRCT report generation. Specifically, AbSteering introduces: (i) an abnormality-centric Chain-of-Thought scheme that enforces abnormality reasoning, and (ii) a Direct Preference Optimization objective that utilizes clinically confusable abnormalities as hard negatives to enhance fine-grained discrimination. Our results demonstrate that general-purpose VideoLMs possess strong transferability to high-volume medical imaging when guided by this paradigm. Notably, AbSteering outperforms state-of-the-art domain-specific CT foundation models, which are pretrained with large-scale CTs, achieving superior detection sensitivity while simultaneously mitigating hallucinations. Our data and model weights are released at https://anonymous.4open.science/r/hrct-report-generation-video-vlm-728C/


[26] Less Data, Faster Convergence: Goal-Driven Data Optimization for Multimodal Instruction Tuning cs.CV | cs.LGPDF

Rujie Wu, Haozhe Zhao, Hai Ci, Yizhou Wang

TL;DR: 本文提出了目标驱动数据优化(GDO)框架,用于多模态指令微调。该框架通过为每个候选样本计算六个描述符,并根据不同目标构建优化的1倍训练子集,从而在固定训练协议下,使用更少的训练样本实现更快的收敛速度和更高的准确率。

Details

Motivation: 解决多模态指令微调中计算效率低下的问题,因为训练预算通常分布在效用高度不均的大型混合图像-视频池中,导致资源浪费。

Result: 在固定的一轮Qwen3-VL-8B-Instruct训练和8个H20 GPU的评估方案下,GDO在MVBench、VideoMME、MLVU和LVBench基准测试上,分别仅需35.4k、26.6k、27.3k和34.7k个样本即可达到Uni-10x基线(512k样本)的参考性能,同时准确率分别提升了+1.38、+1.67、+3.08和+0.84个百分点,实现了更快的收敛和更高的精度。

Insight: 创新点在于提出了一个基于样本描述符的目标驱动数据选择框架,能够根据特定目标(如最小损失、多样性、时间性等)优化训练子集,从而显著提高数据利用效率和训练速度;客观分析表明,该方法通过强调时间性特征,能有效提升长视频理解能力,为数据高效的模型训练提供了新思路。

Abstract: Multimodal instruction tuning is often compute-inefficient because training budgets are spread across large mixed image-video pools whose utility is highly uneven. We present Goal-Driven Data Optimization (GDO), a framework that computes six sample descriptors for each candidate and constructs optimized 1$\times$ training subsets for different goals. Under a fixed one-epoch Qwen3-VL-8B-Instruct training and evaluation recipe on 8 H20 GPUs, GDO uses far fewer training samples than the Uni-10x baseline while converging faster and achieving higher accuracy. Relative to the fixed 512k-sample Uni-10x baseline, GDO reaches the Uni-10x reference after 35.4k samples on MVBench, 26.6k on VideoMME, 27.3k on MLVU, and 34.7k on LVBench, while improving Accuracy by +1.38, +1.67, +3.08, and +0.84 percentage points, respectively. The gains are largest on MVBench and MLVU, while LVBench improves more modestly, consistent with its ultra-long-video setting and the mismatch between that benchmark and the short-video/image-dominant training pool. Across MinLoss, Diverse, Temp, and Temp+, stronger temporal emphasis yields steadily better long-video understanding behavior. Overall, GDO provides a goal-driven data optimization framework that enables faster convergence with fewer training samples under a fixed training protocol. Code is available at https://github.com/rujiewu/GDO.


[27] CalliMaster: Mastering Page-level Chinese Calligraphy via Layout-guided Spatial Planning cs.CVPDF

Tianshuo Xu, Tiantian Hong, Zhifei Chen, Fei Chao, Ying-cong Chen

TL;DR: CalliMaster是一个用于可控生成和编辑页级中国书法的统一框架,通过将空间规划与内容合成解耦来解决现有方法在字形精度与布局构成之间的权衡问题。该框架采用从粗到细的流程(文本→布局→图像),在一个多模态扩散Transformer中先预测字符边界框以建立全局空间布局,再利用该布局作为几何提示通过流匹配渲染高保真笔触。

Details

Motivation: 解决页级书法合成中平衡字形精度与布局构成的挑战,现有字符模型缺乏空间上下文,而页级方法又常常牺牲笔触细节。

Result: 在生成质量上达到了最先进水平(SOTA),并支持可控的语义重新规划(如调整字符大小或位置)、文物修复和取证分析等下游任务。

Insight: 创新点在于受人类“先规划后书写”认知过程启发,将空间规划与内容合成解耦,利用布局作为中间几何提示来引导高保真内容生成,这种解耦设计支持了灵活的可控编辑和扩展应用。

Abstract: Page-level calligraphy synthesis requires balancing glyph precision with layout composition. Existing character models lack spatial context, while page-level methods often compromise brushwork detail. In this paper, we present \textbf{CalliMaster}, a unified framework for controllable generation and editing that resolves this conflict by decoupling spatial planning from content synthesis. Inspired by the human cognitive process of ``planning before writing’’, we introduce a coarse-to-fine pipeline \textbf{(Text $\rightarrow$ Layout $\rightarrow$ Image)} to tackle the combinatorial complexity of page-scale synthesis. Operating within a single Multimodal Diffusion Transformer, a spatial planning stage first predicts character bounding boxes to establish the global spatial arrangement. This intermediate layout then serves as a geometric prompt for the content synthesis stage, where the same network utilizes flow-matching to render high-fidelity brushwork. Beyond achieving state-of-the-art generation quality, this disentanglement supports versatile downstream capabilities. By treating the layout as a modifiable constraint, CalliMaster enables controllable semantic re-planning: users can resize or reposition characters while the model automatically harmonizes the surrounding void space and brush momentum. Furthermore, we demonstrate the framework’s extensibility to artifact restoration and forensic analysis, providing a comprehensive tool for digital cultural heritage.


[28] MemRoPE: Training-Free Infinite Video Generation via Evolving Memory Tokens cs.CVPDF

Youngrae Kim, Qixin Hu, C. -C. Jay Kuo, Peter A. Beerel

TL;DR: MemRoPE是一种无需训练的无限视频生成框架,通过演化记忆令牌和在线RoPE索引来解决自回归扩散模型在长序列生成中的保真度下降、身份漂移和运动停滞问题。

Details

Motivation: 现有滑动窗口缓存丢弃历史上下文,导致长视频生成质量下降;固定早期令牌作为注意力锚点无法反映视频内容的动态演化。

Result: 在分钟到小时尺度的生成任务中,MemRoPE在时序连贯性、视觉保真度和主体一致性方面优于现有方法。

Insight: 通过双流记忆令牌压缩历史信息,并结合动态位置编码解耦,实现了固定大小缓存下的无限生成与高质量内容保持。

Abstract: Autoregressive diffusion enables real-time frame streaming, yet existing sliding-window caches discard past context, causing fidelity degradation, identity drift, and motion stagnation over long horizons. Current approaches preserve a fixed set of early tokens as attention sinks, but this static anchor cannot reflect the evolving content of a growing video. We introduce MemRoPE, a training-free framework with two co-designed components. Memory Tokens continuously compress all past keys into dual long-term and short-term streams via exponential moving averages, maintaining both global identity and recent dynamics within a fixed-size cache. Online RoPE Indexing caches unrotated keys and applies positional embeddings dynamically at attention time, ensuring the aggregation is free of conflicting positional phases. These two mechanisms are mutually enabling: positional decoupling makes temporal aggregation well-defined, while aggregation makes fixed-size caching viable for unbounded generation. Extensive experiments validate that MemRoPE outperforms existing methods in temporal coherence, visual fidelity, and subject consistency across minute- to hour-scale generation.


[29] Do You See What I Am Pointing At? Gesture-Based Egocentric Video Question Answering cs.CVPDF

Yura Choi, Roy Miles, Rolandos Alexandros Potamias, Ismail Elezi, Jiankang Deng

TL;DR: 本文针对以用户指向手势为基础的自我中心视频问答任务,提出了EgoPointVQA数据集与基准,并引入了Hand Intent Tokens(HINT)方法,通过编码3D手部关键点信息来增强模型对指向意图的理解,在多个任务上超越了现有最先进模型。

Details

Motivation: 当前的多模态大语言模型(MLLMs)在理解自我中心视频中的指向手势并回答相关问题方面存在困难,主要原因是缺乏富含手势的数据集以及模型从视频中推断细粒度指向意图的能力有限。

Result: 提出的HINT-14B模型在6个任务上的平均准确率达到68.1%,比当前最先进的InternVL3-14B模型高出6.6%。

Insight: 创新点在于构建了包含合成与真实视频的EgoPointVQA数据集,并设计了HINT方法,将来自现成重建模型的3D手部关键点编码为令牌并与模型输入交错,为解释指向意图提供了显式的时空上下文。这为手势理解任务提供了新的数据基准和有效的意图建模思路。

Abstract: Understanding and answering questions based on a user’s pointing gesture is essential for next-generation egocentric AI assistants. However, current Multimodal Large Language Models (MLLMs) struggle with such tasks due to the lack of gesture-rich data and their limited ability to infer fine-grained pointing intent from egocentric video. To address this, we introduce EgoPointVQA, a dataset and benchmark for gesture-grounded egocentric question answering, comprising 4000 synthetic and 400 real-world videos across multiple deictic reasoning tasks. Built upon it, we further propose Hand Intent Tokens (HINT), which encodes tokens derived from 3D hand keypoints using an off-the-shelf reconstruction model and interleaves them with the model input to provide explicit spatial and temporal context for interpreting pointing intent. We show that our model outperforms others in different backbones and model sizes. In particular, HINT-14B achieves 68.1% accuracy, on average over 6 tasks, surpassing the state-of-the-art, InternVL3-14B, by 6.6%. To further facilitate the open research, we will release the code, model, and dataset. Project page: https://yuuraa.github.io/papers/choi2026egovqa


[30] Spatio-Semantic Expert Routing Architecture with Mixture-of-Experts for Referring Image Segmentation cs.CV | cs.AIPDF

Alaa Dalaq, Muzammil Behzad

TL;DR: 本文提出了一种名为Spatio-Semantic Expert Routing Architecture (SERA) 的新方法,用于指称图像分割任务。该方法通过引入轻量级、表达式感知的专家路由机制,在视觉-语言框架的两个互补阶段进行精细化处理,以更好地匹配指称表达式的多样化推理需求,从而提升分割的空间连贯性和边界精度。

Details

Motivation: 现有方法通常采用统一的精细化策略,无法充分匹配指称表达式多样化的推理需求,导致预测结果常出现区域碎片化、边界不准确甚至目标错误的问题,尤其是在为计算效率而冻结预训练骨干网络时。

Result: 在标准的指称图像分割基准测试中,SERA 持续优于强基线模型,特别是在需要精确定位和边界描绘的表达式上取得了显著提升。

Insight: 创新点在于提出了一个双阶段的、表达式感知的专家路由架构(SERA-Adapter 和 SERA-Fusion),通过轻量级路由机制自适应地加权专家贡献,并采用仅更新归一化和偏置项的参数高效调优策略,在冻结编码器的情况下实现了稳定且高效的性能提升。

Abstract: Referring image segmentation aims to produce a pixel-level mask for the image region described by a natural-language expression. Although pretrained vision-language models have improved semantic grounding, many existing methods still rely on uniform refinement strategies that do not fully match the diverse reasoning requirements of referring expressions. Because of this mismatch, predictions often contain fragmented regions, inaccurate boundaries, or even the wrong object, especially when pretrained backbones are frozen for computational efficiency. To address these limitations, we propose SERA, a Spatio-Semantic Expert Routing Architecture for referring image segmentation. SERA introduces lightweight, expression-aware expert refinement at two complementary stages within a vision-language framework. First, we design SERA-Adapter, which inserts an expression-conditioned adapter into selected backbone blocks to improve spatial coherence and boundary precision through expert-guided refinement and cross-modal attention. We then introduce SERA-Fusion, which strengthens intermediate visual representations by reshaping token features into spatial grids and applying geometry-preserving expert transformations before multimodal interaction. In addition, a lightweight routing mechanism adaptively weights expert contributions while remaining compatible with pretrained representations. To make this routing stable under frozen encoders, SERA uses a parameter-efficient tuning strategy that updates only normalization and bias terms, affecting less than 1% of the backbone parameters. Experiments on standard referring image segmentation benchmarks show that SERA consistently outperforms strong baselines, with especially clear gains on expressions that require accurate spatial localization and precise boundary delineation.


[31] Spatial Reasoning is Not a Free Lunch: A Controlled Study on LLaVA cs.CVPDF

Nahid Alam, Leema Krishna Murali, Siddhant Bharadwaj, Patrick Liu, Timothy Chung

TL;DR: 本文通过LLaVA框架进行对照实验,研究了视觉语言模型在空间推理任务上的局限性,指出当前主流模型依赖CLIP风格图像编码器和一维位置编码的设计选择是导致其难以理解二维空间关系(如相对位置、布局和计数)的关键原因。

Details

Motivation: 尽管视觉语言模型在通用基准测试上表现强劲,但在基本的二维空间推理任务上仍显脆弱,作者认为这不仅是数据问题,更与当前VLM流程中依赖CLIP式图像编码器以及将图像扁平化为带一维位置编码的令牌序列的主流设计选择密切相关。

Result: 在多个空间基准测试上评估前沿模型和LLaVA变体,比较了基于CLIP的编码器与采用更密集或生成式目标训练的替代方案,以及增强二维位置编码的变体。结果表明,模型间存在一致的空间性能差距,编码器目标和位置结构会影响空间行为,但未能完全解决问题。

Insight: 论文的创新点在于通过受控诊断研究,系统性地隔离并验证了图像编码器类型和位置编码维度对VLM空间推理能力的影响,挑战了仅靠增加数据即可解决空间推理问题的观点,为未来模型设计提供了重要见解。从客观角度看,该研究强调了架构设计(而非仅数据规模)对模型特定认知能力的关键作用,为改进VLM的空间感知指明了方向。

Abstract: Vision-language models (VLMs) have advanced rapidly, yet they still struggle with basic spatial reasoning. Despite strong performance on general benchmarks, modern VLMs remain brittle at understanding 2D spatial relationships such as relative position, layout, and counting. We argue that this failure is not merely a data problem, but is closely tied to dominant design choices in current VLM pipelines: reliance on CLIP-style image encoders and the flattening of images into 1D token sequences with 1D positional encoding. We present a controlled diagnostic study within the LLaVA framework to isolate how these choices affect spatial grounding. We evaluate frontier models and LLaVA variants on a suite of spatial benchmarks, comparing CLIP-based encoders against alternatives trained with denser or generative objectives, as well as variants augmented with 2D positional encoding. Our results show consistent spatial performance gaps across models, and indicate that encoder objectives and positional structure shape spatial behavior, but do not fully resolve it.


[32] Decoding Matters: Efficient Mamba-Based Decoder with Distribution-Aware Deep Supervision for Medical Image Segmentation cs.CVPDF

Fares Bougourzi, Fadi Dornaika, Abdenour Hadid

TL;DR: 本文提出了一种名为Deco-Mamba的、以解码器为中心的通用2D医学图像分割方法。该方法采用类似U-Net的结构,结合了Transformer、CNN和Mamba架构。其创新在于解码器集成了协同注意力门、视觉状态空间模块和可变形卷积细化块,并引入了基于窗口的分布感知KL散度损失进行深度监督,旨在提升模型在多尺度上下文表征和跨模态泛化能力方面的表现。

Details

Motivation: 现有医学图像分割方法多为任务特定型,在不同成像模态间泛化能力有限,且大多依赖计算复杂的预训练编码器主干网络。本文旨在设计一种更高效、泛化能力更强的解码器中心化方法。

Result: 在多个医学图像分割基准测试上进行了广泛实验,结果表明该方法取得了最先进的性能,并展现出强大的泛化能力,同时保持了适中的模型复杂度。

Insight: 主要创新点包括:1. 解码器中心化设计,强调解码器在分割任务中的重要性;2. 解码器集成了新颖的协同注意力门、视觉状态空间模块和可变形卷积块,以增强多尺度特征融合;3. 引入了窗口化分布感知KL散度损失,用于多解码阶段的深度监督,这有助于模型学习更鲁棒的特征分布。从客观角度看,将Mamba(一种状态空间模型)与Transformer、CNN结合用于医学图像分割,并专注于解码器优化,是一个有前景的探索方向。

Abstract: Deep learning has achieved remarkable success in medical image segmentation, often reaching expert-level accuracy in delineating tumors and tissues. However, most existing approaches remain task-specific, showing strong performance on individual datasets but limited generalization across diverse imaging modalities. Moreover, many methods focus primarily on the encoder, relying on large pretrained backbones that increase computational complexity. In this paper, we propose a decoder-centric approach for generalized 2D medical image segmentation. The proposed Deco-Mamba follows a U-Net-like structure with a Transformer-CNN-Mamba design. The encoder combines a CNN block and Transformer backbone for efficient feature extraction, while the decoder integrates our novel Co-Attention Gate (CAG), Vision State Space Module (VSSM), and deformable convolutional refinement block to enhance multi-scale contextual representation. Additionally, a windowed distribution-aware KL-divergence loss is introduced for deep supervision across multiple decoding stages. Extensive experiments on diverse medical image segmentation benchmarks yield state-of-the-art performance and strong generalization capability while maintaining moderate model complexity. The source code will be released upon acceptance.


[33] Neural Gate: Mitigating Privacy Risks in LVLMs via Neuron-Level Gradient Gating cs.CVPDF

Xiangkui Cao, Jie Zhang, Meina Kan, Shiguang Shan, Xilin Chen

TL;DR: 本文提出了一种名为Neural Gate的新方法,通过神经元级别的模型编辑来缓解大型视觉语言模型(LVLMs)中的隐私风险。该方法学习一个特征向量来定位模型中与隐私概念相关的神经元,并精确引导参数更新,从而提高模型对隐私相关问题的拒绝率,并泛化到未见过的敏感查询,同时保持模型原有的任务性能。

Details

Motivation: LVLMs在金融、医疗等关键领域的部署带来了显著的安全和隐私风险,恶意行为者可能利用模型提取敏感信息。现有隐私保护方法在泛化性和非破坏性方面存在局限,难以鲁棒地处理未见过的隐私查询,且可能损害模型的标准任务性能。

Result: 在MiniGPT和LLaVA模型上的综合实验表明,该方法显著提升了模型的隐私保护能力,同时保持了其原始效用。

Insight: 创新点在于提出了一种基于神经元级模型编辑的隐私保护方法,通过定位与隐私概念相关的特定神经元并进行精确更新,实现了对隐私查询拒绝率的提升和向未见查询的泛化,且避免了模型整体性能的退化。这是一种更具针对性和非破坏性的模型编辑思路。

Abstract: Large Vision-Language Models (LVLMs) have shown remarkable potential across a wide array of vision-language tasks, leading to their adoption in critical domains such as finance and healthcare. However, their growing deployment also introduces significant security and privacy risks. Malicious actors could potentially exploit these models to extract sensitive information, highlighting a critical vulnerability. Recent studies show that LVLMs often fail to consistently refuse instructions designed to compromise user privacy. While existing work on privacy protection has made meaningful progress in preventing the leakage of sensitive data, they are constrained by limitations in both generalization and non-destructiveness. They often struggle to robustly handle unseen privacy-related queries and may inadvertently degrade a model’s performance on standard tasks. To address these challenges, we introduce Neural Gate, a novel method for mitigating privacy risks through neuron-level model editing. Our method improves a model’s privacy safeguards by increasing its rate of refusal for privacy-related questions, crucially extending this protective behavior to novel sensitive queries not encountered during the editing process. Neural Gate operates by learning a feature vector to identify neurons associated with privacy-related concepts within the model’s representation of a subject. This localization then precisely guides the update of model parameters. Through comprehensive experiments on MiniGPT and LLaVA, we demonstrate that our method significantly boosts the model’s privacy protection while preserving its original utility.


[34] A2Z-10M+: Geometric Deep Learning with A-to-Z BRep Annotations for AI-Assisted CAD Modeling and Reverse Engineering cs.CVPDF

Pritham Kumar Jena, Bhavika Baburaj, Tushar Anand, Vedant Dutta, Vineeth Ulavala

TL;DR: 该论文提出了A2Z数据集,这是一个包含1000万个CAD模型多模态标注和元数据的大规模数据集,旨在通过几何深度学习提升AI辅助CAD建模和逆向工程能力。

Details

Motivation: 解决现有几何深度学习技术缺乏对边界表示(BRep)中参数化CAD特征的多模态理解的问题,以支持从3D扫描、草图或文本提示进行逆向工程和快速原型设计。

Result: 在包含15万个CAD模型的子集上训练并基准测试了一个基础模型,用于从3D扫描中检测BRep共边和角点顶点,这是CAD逆向工程中的关键下游任务。

Insight: 创新点在于构建了大规模、高质量、多模态的CAD标注数据集(A2Z),并整合了专业设计的电子外壳模型,通过新颖的评估指标和人类反馈机制验证了其价值,为CAD学习与检索任务提供了前所未有的数据支持。

Abstract: Reverse engineering and rapid prototyping of computer-aided design (CAD) models from 3D scans, sketches, or simple text prompts are vital in industrial product design. However, recent advances in geometric deep learning techniques lack a multi-modal understanding of parametric CAD features stored in their boundary representation (BRep). This study presents the largest compilation of 10 million multi-modal annotations and metadata for 1 million ABC CAD models, namely A2Z, to unlock an unprecedented level of BRep learning. A2Z comprises (i) high-resolution meshes with salient 3D scanning features, (ii) 3D hand-drawn sketches equipped with (iii) geometric and topological information about BRep co-edges, corners, and surfaces, and (iv) textual captions and tags describing the product in the mechanical world. Creating such carefully structured, large-scale data, which requires nearly 5 terabytes of storage to leverage unparalleled CAD learning/retrieval tasks, is very challenging. The scale, quality, and diversity of our multi-modal annotations are assessed using novel metrics, GPT-5, Gemini, and extensive human feedback mechanisms. To this end, we also merge an additional 25,000 CAD models of electronic enclosures (e.g., tablets, ports) designed by skilled professionals with our A2Z dataset. Subsequently, we train and benchmark a foundation model on a subset of 150K CAD models to detect BRep co-edges and corner vertices from 3D scans, a key downstream task in CAD reverse engineering. The annotated dataset, metrics, and checkpoints will be publicly released to support numerous research directions.


[35] Mastering Negation: Boosting Grounding Models via Grouped Opposition-Based Learning cs.CV | cs.AIPDF

Zesheng Yang, Xi Jiang, Bingzhang Hu, Weili Guan, Runmin Cong

TL;DR: 本文针对当前视觉语言检测与定位模型在处理包含否定语义的复杂表达时存在的困难,提出了一种名为D-Negation的新数据集和一个分组对抗学习框架。该框架通过组织正负语义描述并设计互补的损失函数,使模型能够从有限样本中学习对否定语义敏感的表示。将所提数据集和学习策略集成到最先进的基于语言的定位模型中,仅微调不到10%的参数,即可在正负语义评估上分别获得高达4.4 mAP和5.7 mAP的性能提升。

Details

Motivation: 当前视觉语言检测与定位模型主要关注具有积极语义的提示,难以准确解释和定位包含否定语义的复杂表达,其关键原因在于缺乏高质量的训练数据来显式捕捉具有区分性的负样本和具备否定感知的语言描述。

Result: 在将所提方法集成到最先进的基于语言的定位模型后,仅微调不到10%的参数,就在正负语义评估上分别实现了高达4.4 mAP和5.7 mAP的性能提升,显著增强了模型的鲁棒性和定位准确性。

Insight: 论文的创新点在于构建了首个专门针对否定语义标注的D-Negation数据集,并提出了一个分组对抗学习框架,通过结构化组织对抗性语义描述和设计互补损失函数,有效地从有限样本中学习否定感知的表示,从而显著提升了模型对否定语义的理解和定位能力。

Abstract: Current vision-language detection and grounding models predominantly focus on prompts with positive semantics and often struggle to accurately interpret and ground complex expressions containing negative semantics. A key reason for this limitation is the lack of high-quality training data that explicitly captures discriminative negative samples and negation-aware language descriptions. To address this challenge, we introduce D-Negation, a new dataset that provides objects annotated with both positive and negative semantic descriptions. Building upon the observation that negation reasoning frequently appears in natural language, we further propose a grouped opposition-based learning framework that learns negation-aware representations from limited samples. Specifically, our method organizes opposing semantic descriptions from D-Negation into structured groups and formulates two complementary loss functions that encourage the model to reason about negation and semantic qualifiers. We integrate the proposed dataset and learning strategy into a state-of-the-art language-based grounding model. By fine-tuning fewer than 10 percent of the model parameters, our approach achieves improvements of up to 4.4 mAP and 5.7 mAP on positive and negative semantic evaluations, respectively. These results demonstrate that explicitly modeling negation semantics can substantially enhance the robustness and localization accuracy of vision-language grounding models.


[36] Prompt-Driven Lightweight Foundation Model for Instance Segmentation-Based Fault Detection in Freight Trains cs.CV | eess.IVPDF

Guodong Sun, Qihang Liang, Xingyu Pan, Moyun Liu, Yang Zhang

TL;DR: 本文提出了一种轻量级自提示实例分割框架,专门用于货运列车故障检测。该方法基于Segment Anything Model,通过引入自提示生成模块自动生成任务特定提示,实现基础模型到领域特定检测任务的知识迁移,并采用Tiny Vision Transformer骨干网络降低计算成本,适用于铁路监控系统的边缘设备实时部署。

Details

Motivation: 货运列车视觉故障检测在复杂操作环境、结构重复部件以及安全关键区域频繁遮挡或污染下仍面临挑战,传统基于卷积神经网络和Transformer的实例分割方法在此条件下泛化能力差且边界精度有限。

Result: 在从真实货运检查站收集的领域特定数据集上,该方法实现了74.6 AP^box和74.2 AP^mask,在准确性和鲁棒性上均优于现有最先进方法,同时保持低计算开销。

Insight: 创新点在于将基础模型(如SAM)通过自提示生成机制适应到工业故障检测领域,实现了有效的知识迁移与轻量化设计,为工业规模故障诊断场景提供了可部署的高效视觉解决方案。

Abstract: Accurate visual fault detection in freight trains remains a critical challenge for intelligent transportation system maintenance, due to complex operational environments, structurally repetitive components, and frequent occlusions or contaminations in safety-critical regions. Conventional instance segmentation methods based on convolutional neural networks and Transformers often suffer from poor generalization and limited boundary accuracy under such conditions. To address these challenges, we propose a lightweight self-prompted instance segmentation framework tailored for freight train fault detection. Our method leverages the Segment Anything Model by introducing a self-prompt generation module that automatically produces task-specific prompts, enabling effective knowledge transfer from foundation models to domain-specific inspection tasks. In addition, we adopt a Tiny Vision Transformer backbone to reduce computational cost, making the framework suitable for real-time deployment on edge devices in railway monitoring systems. We construct a domain-specific dataset collected from real-world freight inspection stations and conduct extensive evaluations. Experimental results show that our method achieves 74.6 $AP^{\text{box}}$ and 74.2 $AP^{\text{mask}}$ on the dataset, outperforming existing state-of-the-art methods in both accuracy and robustness while maintaining low computational overhead. This work offers a deployable and efficient vision solution for automated freight train inspection, demonstrating the potential of foundation model adaptation in industrial-scale fault diagnosis scenarios. Project page: https://github.com/MVME-HBUT/SAM_FTI-FDet.git


[37] LR-SGS: Robust LiDAR-Reflectance-Guided Salient Gaussian Splatting for Self-Driving Scene Reconstruction cs.CV | cs.AIPDF

Ziyu Chen, Fan Zhu, Hui Zhu, Deyi Kong, Xinkai Kuang

TL;DR: 本文提出了一种鲁棒且高效的激光雷达反射率引导的显著高斯泼溅方法(LR-SGS),用于自动驾驶场景重建。该方法利用激光雷达点云的几何和反射率信息初始化高斯表示,并通过显著变换和改进的密度控制来捕获边缘和平面结构,同时将校准后的反射率作为光照不变的材料通道与RGB对齐,以增强边界一致性。

Details

Motivation: 现有3D高斯泼溅方法主要依赖相机或仅将激光雷达用于高斯初始化或深度监督,未能充分利用点云中的反射率等丰富信息以及激光雷达与RGB的互补性,导致在高自车运动和复杂光照等挑战性自动驾驶场景中性能下降。

Result: 在Waymo Open Dataset上的大量实验表明,LR-SGS以更少的高斯数和更短的训练时间实现了优越的重建性能,特别是在复杂光照场景下,其PSNR比OmniRe高出1.18 dB。

Insight: 创新点在于提出了结构感知的显著高斯表示,并利用校准后的激光雷达反射率作为光照不变的材料属性,与RGB信息联合优化以增强场景边界一致性,从而更有效地利用多模态传感器数据提升重建鲁棒性。

Abstract: Recent 3D Gaussian Splatting (3DGS) methods have demonstrated the feasibility of self-driving scene reconstruction and novel view synthesis. However, most existing methods either rely solely on cameras or use LiDAR only for Gaussian initialization or depth supervision, while the rich scene information contained in point clouds, such as reflectance, and the complementarity between LiDAR and RGB have not been fully exploited, leading to degradation in challenging self-driving scenes, such as those with high ego-motion and complex lighting. To address these issues, we propose a robust and efficient LiDAR-reflectance-guided Salient Gaussian Splatting method (LR-SGS) for self-driving scenes, which introduces a structure-aware Salient Gaussian representation, initialized from geometric and reflectance feature points extracted from LiDAR and refined through a salient transform and improved density control to capture edge and planar structures. Furthermore, we calibrate LiDAR intensity into reflectance and attach it to each Gaussian as a lighting-invariant material channel, jointly aligned with RGB to enforce boundary consistency. Extensive experiments on the Waymo Open Dataset demonstrate that LR-SGS achieves superior reconstruction performance with fewer Gaussians and shorter training time. In particular, on Complex Lighting scenes, our method surpasses OmniRe by 1.18 dB PSNR.


[38] VGGT-World: Transforming VGGT into an Autoregressive Geometry World Model cs.CVPDF

Xiangyu Sun, Shijie Wang, Fengyi Zhang, Lin Liu, Caiyan Jia

TL;DR: VGGT-World是一种几何世界模型,它通过预测冻结几何基础模型(GFM)特征的时间演化来规避视频生成,从而专注于几何一致性预测。该方法将冻结VGGT的潜在标记作为世界状态,并训练一个轻量级时间流变换器来自回归预测其未来轨迹。在KITTI、Cityscapes和TartanAir数据集上的实验表明,VGGT-World在深度预测任务中显著优于最强基线,同时运行速度快3.6-5倍,仅需0.43B可训练参数。

Details

Motivation: 现有世界模型在预测场景演化时主要关注光度细节,但生成的预测往往几何不一致。本文旨在通过直接预测几何特征的时间演化来解决几何不一致问题,从而构建更高效的3D世界模型。

Result: 在KITTI、Cityscapes和TartanAir数据集上的深度预测任务中,VGGT-World显著优于最强基线,运行速度快3.6-5倍,仅使用0.43B可训练参数,实现了高效且准确的几何预测。

Insight: 创新点包括:1)将冻结几何基础模型特征作为预测状态,避免视频生成中的几何不一致问题;2)提出干净目标(z-预测)参数化解决高维特征空间中的流匹配崩溃问题;3)采用两阶段潜在流强制课程学习缓解自回归展开中的暴露偏差。这些方法为3D世界建模提供了高效且几何一致的解决方案。

Abstract: World models that forecast scene evolution by generating future video frames devote the bulk of their capacity to photometric details, yet the resulting predictions often remain geometrically inconsistent. We present VGGT-World, a geometry world model that side-steps video generation entirely and instead forecasts the temporal evolution of frozen geometry-foundation-model (GFM) features. Concretely, we repurpose the latent tokens of a frozen VGGT as the world state and train a lightweight temporal flow transformer to autoregressively predict their future trajectory. Two technical challenges arise in this high-dimensional (d=1024) feature space: (i) standard velocity-prediction flow matching collapses, and (ii) autoregressive rollout suffers from compounding exposure bias. We address the first with a clean-target (z-prediction) parameterization that yields a substantially higher signal-to-noise ratio, and the second with a two-stage latent flow-forcing curriculum that progressively conditions the model on its own partially denoised rollouts. Experiments on KITTI, Cityscapes, and TartanAir demonstrate that VGGT-World significantly outperforms the strongest baselines in depth forecasting while running 3.6-5 times faster with only 0.43B trainable parameters, establishing frozen GFM features as an effective and efficient predictive state for 3D world modeling.


[39] VFM-Recon: Unlocking Cross-Domain Scene-Level Neural Reconstruction with Scale-Aligned Foundation Priors cs.CVPDF

Yuhang Ming, Tingkang Xi, Xingrui Yang, Lixin Yang, Yong Peng

TL;DR: VFM-Recon是一种新颖的场景级神经体积重建方法,旨在解决单目视频重建在严重领域偏移下的挑战。该方法通过引入轻量级尺度对齐阶段恢复多视图尺度一致性,并利用预训练视觉基础模型(VFM)的特征,通过任务特定适配器集成到重建流程中,从而在保持跨领域鲁棒性的同时实现重建。

Details

Motivation: 解决单目视频场景级神经体积重建在领域偏移下因视觉基础模型(VFM)的尺度模糊预测与体积融合所需的尺度一致性不兼容而面临的挑战。

Result: 在ScanNet训练集上训练,并在ScanNet测试集(同分布)以及TUM RGB-D和Tanks and Temples(跨分布)数据集上评估。模型在所有数据集上均达到最先进性能,尤其在具有挑战性的户外Tanks and Temples数据集上,重建网格评估的F1分数达到70.1,显著优于最接近的竞争对手VGGT(51.8)。

Insight: 创新点在于首次将可迁移的视觉基础模型(VFM)先验与场景级神经重建的尺度一致性要求相结合,通过轻量级尺度对齐和任务特定适配器实现跨领域鲁棒重建。客观分析认为,该方法有效利用了大规模预训练模型的泛化能力,同时解决了尺度模糊问题,为跨领域三维重建提供了新思路。

Abstract: Scene-level neural volumetric reconstruction from monocular videos remains challenging, especially under severe domain shifts. Although recent advances in vision foundation models (VFMs) provide transferable generalized priors learned from large-scale data, their scaleambiguous predictions are incompatible with the scale consistency required by volumetric fusion. To address this gap, we present VFMRecon, the first attempt to bridge transferable VFM priors with scaleconsistent requirements in scene-level neural reconstruction. Specifically, we first introduce a lightweight scale alignment stage that restores multiview scale coherence. We then integrate pretrained VFM features into the neural volumetric reconstruction pipeline via lightweight task-specific adapters, which are trained for reconstruction while preserving the crossdomain robustness of pretrained representations. We train our model on ScanNet train split and evaluate on both in-distribution ScanNet test split and out-of-distribution TUM RGB-D and Tanks and Temples datasets. The results demonstrate that our model achieves state-of-theart performance across all datasets domains. In particular, on the challenging outdoor Tanks and Temples dataset, our model achieves an F1 score of 70.1 in reconstructed mesh evaluation, substantially outperforming the closest competitor, VGGT, which only attains 51.8.


[40] AVION: Aerial Vision-Language Instruction from Offline Teacher to Prompt-Tuned Network cs.CVPDF

Yu Hu, Jianyang Gu, Hao Liu, Yue Cao, Jozsef Hamari

TL;DR: 本文提出了AVION,一个专门用于遥感图像视觉语言模型适应的知识蒸馏框架。该框架包含一个教师模块,通过收集大语言模型的描述并利用遥感图像特征验证其有效性,构建语义丰富的文本原型;以及一个学生模块,在视觉和语言编码器中集成轻量级可学习提示,在教师指导下对齐嵌入及其跨模态关系。训练后,学生模块可在推理时独立运行。

Details

Motivation: 动机在于解决视觉语言模型适应遥感图像的两个关键挑战:文本表示的语义覆盖有限,以及视觉特征的适应性不足,这些问题在涉及多种视觉外观和细粒度物体区分的航拍场景中尤为显著。

Result: 在六个光学遥感基准测试上的实验表明,AVION在少样本分类和基类准确率上有所提升,且不损害对新类别的泛化能力,同时提高了跨模态检索的平均召回率,且仅增加了极少的可训练参数。

Insight: 创新点在于提出了一种专门针对遥感场景的知识蒸馏框架,通过教师模块构建并验证语义丰富的文本原型来增强文本表示,并指导学生模块通过轻量级提示调优来对齐跨模态特征,实现了高效且有效的模型适应。

Abstract: Adapting vision-language models to remote sensing imagery remains challenging due to two key factors: limited semantic coverage in textual representations and insufficient adaptability of visual features. These issues are particularly significant in aerial scenes, which involve various visual appearances and fine-grained object distinctions. We propose AVION, a knowledge distillation framework tailored for remote sensing adaptation of vision-language models. The teacher module constructs semantically rich textual prototypes by collecting descriptions from a large language model and verifying validity using remote sensing image features. The student module integrates lightweight and learnable prompts into both vision and language encoders, guided by the teacher to align embeddings and their cross-modal relationships. Once trained, the student operates independently during inference. Experiments on six optical remote sensing benchmarks show that AVION improves few-shot classification and base-class accuracy without degrading generalization to novel categories. It also enhances mean recall for cross-modal retrieval, with minimal additional trainable parameters.


[41] Learning Geometric and Photometric Features from Panoramic LiDAR Scans for Outdoor Place Categorization cs.CV | cs.ROPDF

Kazuto Nakashima, Hojung Jung, Yuki Oto, Yumi Iwashita, Ryo Kurazume

TL;DR: 本文提出了一种基于卷积神经网络(CNN)的室外场所分类方法,利用3D LiDAR获取的全景深度/反射率图像作为输入,并构建了名为MPO的大规模多模态室外点云数据集,包含六类室外场所标签。

Details

Motivation: 解决自主机器人和车辆在室外环境中因光照变化、动态遮挡等感知变异导致的语义场所分类难题。

Result: 在MPO数据集上,该方法优于传统方法,并通过同时利用深度和反射率模态提升了分类效果。

Insight: 创新点包括构建多模态LiDAR点云数据集MPO,以及设计结合几何(深度)和光度(反射率)特征的CNN模型,通过特征可视化验证了学习到的特征有效性。

Abstract: Semantic place categorization, which is one of the essential tasks for autonomous robots and vehicles, allows them to have capabilities of self-decision and navigation in unfamiliar environments. In particular, outdoor places are more difficult targets than indoor ones due to perceptual variations, such as dynamic illuminance over twenty-four hours and occlusions by cars and pedestrians. This paper presents a novel method of categorizing outdoor places using convolutional neural networks (CNNs), which take omnidirectional depth/reflectance images obtained by 3D LiDARs as the inputs. First, we construct a large-scale outdoor place dataset named Multi-modal Panoramic 3D Outdoor (MPO) comprising two types of point clouds captured by two different LiDARs. They are labeled with six outdoor place categories: coast, forest, indoor/outdoor parking, residential area, and urban area. Second, we provide CNNs for LiDAR-based outdoor place categorization and evaluate our approach with the MPO dataset. Our results on the MPO dataset outperform traditional approaches and show the effectiveness in which we use both depth and reflectance modalities. To analyze our trained deep networks we visualize the learned features.


[42] Marker-Based 3D Reconstruction of Aggregates with a Comparative Analysis of 2D and 3D Morphologies cs.CV | cs.AI | eess.IVPDF

Haohang Huang, Jiayi Luo, Issam Qamhia, Erol Tutumluer, John M. Hart

TL;DR: 本文提出了一种基于标记的、成本效益高的摄影测量方法,用于骨料颗粒的三维重建,并通过与地面真实数据对比验证了其准确性。论文还对选定样本的二维和三维形态特性进行了比较分析,发现两者存在显著差异。

Details

Motivation: 骨料作为建筑材料的主要骨架,其尺寸和形态信息对质量控制至关重要,但现有的三维表征方法(如激光扫描或CT)成本高昂且不便。本文旨在开发一种灵活且低成本的三维重建方法,以方便骨料检测和形态分析。

Result: 重建结果的准确性通过选定骨料样本的地面真实数据进行了验证。比较分析显示,二维和三维形态统计数据存在显著差异。

Insight: 创新点在于提出了一种基于标记的摄影测量设计,能够实现背景抑制、点云拼接和尺度参考,从而以低成本获得高质量的三维模型。这为便捷的骨料检测和数据收集提供了新途径。

Abstract: Aggregates, serving as the main skeleton in assemblies of construction materials, are important functional components in various building and transportation infrastructures. They can be used in unbound layer applications, e.g. pavement base and railroad ballast, bound applications of cement concrete and asphalt concrete, and as riprap and large-sized primary crushed rocks. Information on the size and shape or morphology of aggregates can greatly facilitate the Quality Assurance/Quality Control (QA/QC) process by providing insights of aggregate behavior during composition and packing. A full 3D characterization of aggregate particle morphology is difficult both during production in a quarry and at a construction site. Many aggregate imaging approaches have been developed to quantify the particle morphology by computer vision, including 2D image-based approaches that analyze particle silhouettes and 3D scanning-based methods that require expensive devices such as 3D laser scanners or X-Ray Computed Tomography (CT) equipment. This paper presents a flexible and cost-effective photogrammetry-based approach for the 3D reconstruction of aggregate particles. The proposed approach follows a marker-based design that enables background suppression, point cloud stitching, and scale referencing to obtain high-quality aggregate models. The accuracy of the reconstruction results was validated against ground-truth for selected aggregate samples. Comparative analyses were conducted on 2D and 3D morphological properties of the selected samples. Significant differences were found between the 2D and 3D statistics. Based on the presented approach, 3D shape information of aggregates can be obtained easily and at a low cost, thus allowing convenient aggregate inspection, data collection, and 3D morphological analysis.


[43] Vision Verification Enhanced Fusion of VLMs for Efficient Visual Reasoning cs.CV | cs.LGPDF

Selim Furkan Tekin, Yichang Xu, Gaowen Liu, Ramana Rao Kompella, Margaret L. Loper

TL;DR: 本文提出了一种名为V3Fusion的方法,通过结合视觉和语言模态来优化多视觉语言模型(VLM)的融合与选择。该方法利用焦点误差多样性和基于CKA的焦点多样性度量来评估模型间的互补性,并应用遗传算法从候选模型池中剪枝无效模型,从而生成高性能的双焦点多样性融合预测,有效处理模型间无共识或多数模型预测错误的情况。

Details

Motivation: 现有研究多基于语言模态进行多VLM的集成与路由,本文旨在利用视觉和语言双模态来解决多样化的模型选择问题,以提升视觉推理的鲁棒性和性能。

Result: 在A-OKVQA、MMMU、MMMU-Pro和OCR-VQA四个基准测试上进行了广泛实验。V3Fusion在MMMU上准确率比最佳单VLM提升8.09%,在MMMU-Pro上提升4.87%;在生成任务(A-OKVQA和OCR-VQA)上超越了Intern-VL2-8b和Qwen2.5-VL-7b等顶级VLM。

Insight: 创新点包括:1)引入焦点误差多样性和CKA-focal度量来量化VLM在视觉嵌入上的分歧与互补性;2)在集成曲面上应用遗传算法进行模型剪枝,实现高效模型选择;3)方法能动态捕捉认知不确定性并缓解幻觉,即使在无多数共识或多数模型错误时也能保持高性能。

Abstract: With the growing number and diversity of Vision-Language Models (VLMs), many works explore language-based ensemble, collaboration, and routing techniques across multiple VLMs to improve multi-model reasoning. In contrast, we address the diverse model selection using both vision and language modalities. We introduce focal error diversity to capture complementary reasoning across VLMs and a CKA-based focal diversity metric (CKA-focal) to measure disagreement in their visual embeddings. On the constructed ensemble surface from a pool of candidate VLMs, we applied a Genetic Algorithm to effectively prune out those component VLMs that do not add value to the fusion performance. We identify the best combination for each task as well as fuse the outputs of each VLMs in the model pool, and show that heterogeneous models can capture epistemic uncertainty dynamically and mitigate hallucinations. Our V3Fusion approach is capable of producing dual focal-diversity fused predictions with high performance for vision-language reasoning, even when there is no majority consensus or the majority of VLMs make incorrect predictions. Extensive experiments validate V3Fusion on four popular VLM benchmarks (A-OKVQA, MMMU, MMMU-Pro, and OCR-VQA). The results show that V3Fusion outperforms the best-performing VLM on MMMU by 8.09% and MMMU-Pro by 4.87% gain in accuracy. For generative tasks, V3Fusion outperforms Intern-VL2-8b and Qwen2.5-VL-7b, the top-2 VLM performers on both A-OKVQA and OCR-VQA. Our code and datasets are available at https://github.com/sftekin/v3fusion.


[44] STRAP-ViT: Segregated Tokens with Randomized – Transformations for Defense against Adversarial Patches in ViTs cs.CV | cs.LGPDF

Nandish Chattopadhyay, Anadi Goyal, Chandan Karfa, Anupam Chattopadhyay

TL;DR: 本文提出了一种名为STRAP-ViT的防御机制,用于保护视觉Transformer(ViT)模型免受对抗性补丁攻击。该方法通过检测阶段利用Jensen-Shannon散度分离出行为异常的token,并在缓解阶段对这些token应用随机复合变换来使对抗性噪声失效。该方法是一个无需训练、即插即用的模块,计算成本低,并在多个预训练ViT架构和数据集上,针对多种对抗攻击,实现了接近干净基线2-3%范围内的优异鲁棒精度,性能优于现有技术。

Details

Motivation: 对抗性补丁是一种物理上可实现的局部噪声,能够劫持ViT的自注意力机制,将注意力拉向一个小的高对比度区域并破坏类别token,从而导致模型产生高置信度的错误分类。本文旨在解决ViT模型对此类攻击的脆弱性问题。

Result: 在多个预训练的视觉Transformer架构(如ViT-base-16和DinoV2)和数据集(如ImageNet和CalTech-101)上,针对多种对抗攻击(如Adversarial Patch, LAVAN, GDPA和RP2)进行测试,STRAP-ViT提供了优异的鲁棒精度,其性能在干净基线的2-3%范围内,并且超越了当前最先进的方法。

Insight: 核心创新点在于利用对抗性补丁区域token与非对抗区域token具有不同统计特性这一洞察,设计了一个两阶段(检测与缓解)的防御机制。该方法的关键在于将防御建模为一个无需训练、即插即用的推理时模块,通过Jensen-Shannon散度进行异常token检测,并应用随机变换进行缓解,以最小计算开销实现高效防御。

Abstract: Adversarial patches are physically realizable localized noise, which are able to hijack Vision Transformers (ViT) self-attention, pulling focus toward a small, high-contrast region and corrupting the class token to force confident misclassifications. In this paper, we claim that the tokens which correspond to the areas of the image that contain the adversarial noise, have different statistical properties when compared to the tokens which do not overlap with the adversarial perturbations. We use this insight to propose a mechanism, called STRAP-ViT, which uses Jensen-Shannon Divergence as a metric for segregating tokens that behave as anomalies in the Detection Phase, and then apply randomized composite transformations on them during the Mitigation Phase to make the adversarial noise ineffective. The minimum number of tokens to transform is a hyper-parameter for the defense mechanism and is chosen such that at least 50% of the patch is covered by the transformed tokens. STRAP-ViT fits as a non-trainable plug-and-play block within the ViT architectures, for inference purposes only, with a minimal computational cost and does not require any additional training cost/effort. STRAP-ViT has been tested on multiple pre-trained vision transformer architectures (ViT-base-16 and DinoV2) and datasets (ImageNet and CalTech-101), across multiple adversarial attacks (Adversarial Patch, LAVAN, GDPA and RP2), and found to provide excellent robust accuracies lying within a 2-3% range of the clean baselines, and outperform the state-of-the-art.


[45] HSEmotion Team at ABAW-10 Competition: Facial Expression Recognition, Valence-Arousal Estimation, Action Unit Detection and Fine-Grained Violence Classification cs.CV | cs.AIPDF

Andrey V. Savchenko, Kseniia Tsypliakova

TL;DR: 本文介绍了团队在第十届ABAW竞赛中的成果,针对帧级面部情感理解任务(表情识别、效价-唤醒度估计、动作单元检测),提出了一种基于预训练EfficientNet情感模型提取面部嵌入的快速方法,通过置信度阈值和MLP结合预测,并使用滑动窗口平滑帧级预测以降低噪声;对于细粒度暴力检测任务,探索了多种预训练架构提取帧嵌入并聚合进行视频分类。实验表明,该方法在ABAW挑战的四个任务上显著提升了验证指标。

Details

Motivation: 解决在自然场景下(in-the-wild)进行帧级面部情感分析和细粒度暴力检测时,由于环境复杂、标注噪声大导致的预测不稳定和性能不足的问题。

Result: 在ABAW挑战的四个任务(表情识别、效价-唤醒度估计、动作单元检测、细粒度暴力分类)的验证集上,该方法显著超越了现有基线模型,具体定量结果未在摘要中给出,但表明性能有显著提升。

Insight: 创新点包括:结合预训练模型置信度阈值与轻量级MLP的混合预测策略,有效利用高置信度预测并处理不确定情况;采用滑动窗口平滑帧级预测,减少时序噪声;在暴力检测中探索帧嵌入聚合方法。从客观角度看,该方法注重实用性与效率,通过模型集成与后处理提升鲁棒性,对实际应用有借鉴意义。

Abstract: This article presents our results for the 10th Affective Behavior Analysis in-the-Wild (ABAW) competition. For frame-wise facial emotion understanding tasks (frame-wise facial expression recognition, valence-arousal estimation, action unit detection), we propose a fast approach based on facial embedding extraction with pre-trained EfficientNet-based emotion recognition models. If the latter model’s confidence exceeds a threshold, its prediction is used. Otherwise, we feed embeddings into a simple multi-layered perceptron trained on the AffWild2 dataset. Estimated class-level scores are smoothed in a sliding window of fixed size to mitigate noise in frame-wise predictions. For the fine-grained violence detection task, we examine several pre-trained architectures for frame embeddings and their aggregation for video classification. Experimental results on four tasks from the ABAW challenge demonstrate that our approach significantly improves validation metrics over existing baselines.


[46] VCBench: A Streaming Counting Benchmark for Spatial-Temporal State Maintenance in Long Videos cs.CVPDF

Pengyiang Liu, Zhongyue Shi, Hongye Hao, Qi Fu, Xueting Bi

TL;DR: VCBench是一个用于评估视频理解模型时空状态维护能力的流式计数基准,通过将计数任务重新定位为诊断世界状态维护能力的最小化探针,将能力分解为对象计数和事件计数,包含8个细分子类别,共406个视频和1000个流式QA对,设计了三个互补指标来诊断数值精度、轨迹一致性和时间感知。

Details

Motivation: 现有基准在视频理解评估的多维度上有所进展,但对模型如何维护世界状态的观察仍不足,因此提出VCBench来诊断视频理解模型在时空状态维护方面的能力。

Result: 对主流视频语言模型的评估显示,当前模型在时空状态维护方面仍存在显著缺陷,特别是在周期性事件计数等任务上表现不佳。

Insight: 创新点在于将计数任务作为状态维护的探针,通过流式多点查询观察状态维护轨迹,并设计诊断指标;客观分析认为,该基准提供了系统化的评估框架,有助于推动视频理解模型在动态状态跟踪方面的改进。

Abstract: Video understanding requires models to continuously track and update world state during playback. While existing benchmarks have advanced video understanding evaluation across multiple dimensions, the observation of how models maintain world state remains insufficient. We propose VCBench, a streaming counting benchmark that repositions counting as a minimal probe for diagnosing world state maintenance capability. We decompose this capability into object counting (tracking currently visible objects vs.\ tracking cumulative unique identities) and event counting (detecting instantaneous actions vs.\ tracking complete activity cycles), forming 8 fine-grained subcategories. VCBench contains 406 videos with frame-by-frame annotations of 10,071 event occurrence moments and object state change moments, generating 1,000 streaming QA pairs with 4,576 query points along timelines. By observing state maintenance trajectories through streaming multi-point queries, we design three complementary metrics to diagnose numerical precision, trajectory consistency, and temporal awareness. Evaluation on mainstream video-language models shows that current models still exhibit significant deficiencies in spatial-temporal state maintenance, particularly struggling with tasks like periodic event counting. VCBench provides a diagnostic framework for measuring and improving state maintenance in video understanding systems.


[47] Text-Phase Synergy Network with Dual Priors for Unsupervised Cross-Domain Image Retrieval cs.CVPDF

Jing Yang, Hui Xue, Shipeng Zhu, Pengfei Fang

TL;DR: 本文提出了一种名为TPSNet的文本-相位协同网络,用于解决无监督跨域图像检索(UCDIR)问题。该方法利用CLIP生成的领域提示作为文本先验,提供更精确的语义监督,同时引入领域不变的相位特征作为相位先验,以弥合域分布差距并保持语义完整性。

Details

Motivation: 现有方法通常使用聚类算法生成的伪标签作为监督信号,但这些离散的伪标签往往无法提供准确和全面的语义指导,且对齐过程常忽略领域特定信息与语义信息之间的纠缠,导致学习到的表示出现语义退化,从而损害检索性能。

Result: TPSNet在UCDIR基准测试中显著优于最先进的方法,达到了SOTA水平。

Insight: 创新点在于引入了双先验(文本先验和相位先验)的协同机制:文本先验通过CLIP生成的领域提示提供更精确的语义监督;相位先验则利用领域不变的相位特征来桥接域分布差距,同时保持语义完整性,从而有效缓解了语义退化问题。

Abstract: This paper studies unsupervised cross-domain image retrieval (UCDIR), which aims to retrieve images of the same category across different domains without relying on labeled data. Existing methods typically utilize pseudo-labels, derived from clustering algorithms, as supervisory signals for intra-domain representation learning and cross-domain feature alignment. However, these discrete pseudo-labels often fail to provide accurate and comprehensive semantic guidance. Moreover, the alignment process frequently overlooks the entanglement between domain-specific and semantic information, leading to semantic degradation in the learned representations and ultimately impairing retrieval performance. This paper addresses the limitations by proposing a Text-Phase Synergy Network with Dual Priors(TPSNet). Specifically, we first employ CLIP to generate a set of class-specific prompts per domain, termed as domain prompt, serving as a text prior that offers more precise semantic supervision. In parallel, we further introduce a phase prior, represented by domain-invariant phase features, which is integrated into the original image representations to bridge the domain distribution gaps while preserving semantic integrity. Leveraging the synergy of these dual priors, TPSNet significantly outperforms state-of-the-art methods on UCDIR benchmarks.


[48] IGASA: Integrated Geometry-Aware and Skip-Attention Modules for Enhanced Point Cloud Registration cs.CV | cs.AIPDF

Dongxu Zhang, Jihua Zhu, Shiqi Li, Wenbiao Yan, Haoran Xu

TL;DR: 本文提出了一种名为IGASA的新型点云配准框架,该框架基于分层金字塔架构(HPA),集成了分层跨层注意力(HCLA)模块和迭代几何感知细化(IGAR)模块,旨在通过多尺度特征提取与融合来应对噪声、遮挡和大尺度变换等现实挑战,从而提升配准的精度和鲁棒性。

Details

Motivation: 现有点云配准方法在面对现实世界中的严重噪声、显著遮挡和大尺度变换等挑战时,配准精度和鲁棒性往往不足,IGASA旨在解决这些问题。

Result: 在3D(Lo)Match、KITTI和nuScenes等四个广泛认可的基准数据集上进行的大量实验表明,IGASA在配准精度上显著超越了现有最先进(SOTA)方法。

Insight: 创新点在于提出了一个集成了HCLA(利用跳跃注意力机制对齐多分辨率特征以增强局部几何一致性)和IGAR(在粗匹配阶段建立的可靠对应关系基础上进行精细匹配)的协同架构,该设计能有效适应多样的点云结构和复杂变换,为点云配准技术提供了稳健的基础。

Abstract: Point cloud registration (PCR) is a fundamental task in 3D vision and provides essential support for applications such as autonomous driving, robotics, and environmental modeling. Despite its widespread use, existing methods often fail when facing real-world challenges like heavy noise, significant occlusions, and large-scale transformations. These limitations frequently result in compromised registration accuracy and insufficient robustness in complex environments. In this paper, we propose IGASA as a novel registration framework constructed upon a Hierarchical Pyramid Architecture (HPA) designed for robust multi-scale feature extraction and fusion. The framework integrates two pivotal components consisting of the Hierarchical Cross-Layer Attention (HCLA) module and the Iterative Geometry-Aware Refinement (IGAR) module. The HCLA module utilizes skip attention mechanisms to align multi-resolution features and enhance local geometric consistency. Simultaneously, the IGAR module is designed for the fine matching phase by leveraging reliable correspondences established during coarse matching. This synergistic integration within the architecture allows IGASA to adapt effectively to diverse point cloud structures and intricate transformations. We evaluate the performance of IGASA on four widely recognized benchmark datasets including 3D(Lo)Match, KITTI, and nuScenes. Our extensive experiments consistently demonstrate that IGASA significantly surpasses state-of-the-art methods and achieves notable improvements in registration accuracy. This work provides a robust foundation for advancing point cloud registration techniques while offering valuable insights for practical 3D vision applications. The code for IGASA is available in \href{https://github.com/DongXu-Zhang/IGASA}{https://github.com/DongXu-Zhang/IGASA}.


[49] CMHANet: A Cross-Modal Hybrid Attention Network for Point Cloud Registration cs.CV | cs.AIPDF

Dongxu Zhang, Yingsen Wang, Yiding Sun, Haoran Xu, Peilin Fan

TL;DR: 本文提出了一种名为CMHANet的新型跨模态混合注意力网络,用于解决点云配准任务。该方法通过融合2D图像的丰富上下文信息和3D点云的几何细节,生成全面且鲁棒的特征表示,并引入基于对比学习的创新优化函数以增强几何一致性。在3DMatch和3DLoMatch数据集上的实验表明,该方法在配准精度和鲁棒性方面均优于现有技术,并在TUM RGB-D SLAM数据集上展示了良好的零样本泛化能力。

Details

Motivation: 现有基于学习的点云配准方法在复杂真实场景(如数据不完整、传感器噪声和低重叠区域)中性能下降,因此需要一种更鲁棒的方法来应对这些挑战。

Result: 在3DMatch和3DLoMatch基准数据集上,CMHANet在配准精度和鲁棒性方面均取得了显著提升,超越了当前技术(SOTA)。在TUM RGB-D SLAM数据集上的零样本评估进一步验证了其泛化能力。

Insight: 创新点在于跨模态融合(2D图像与3D点云)以增强特征表示,以及引入基于对比学习的优化函数来强制几何一致性,从而提高对噪声和部分观测的鲁棒性。从客观角度看,这种多模态融合策略和对比学习优化是提升点云配准性能的有效途径。

Abstract: Robust point cloud registration is a fundamental task in 3D computer vision and geometric deep learning, essential for applications such as large-scale 3D reconstruction, augmented reality, and scene understanding. However, the performance of established learning-based methods often degrades in complex, real world scenarios characterized by incomplete data, sensor noise, and low overlap regions. To address these limitations, we propose CMHANet, a novel Cross-Modal Hybrid Attention Network. Our method integrates the fusion of rich contextual information from 2D images with the geometric detail of 3D point clouds, yielding a comprehensive and resilient feature representation. Furthermore, we introduce an innovative optimization function based on contrastive learning, which enforces geometric consistency and significantly improves the model’s robustness to noise and partial observations. We evaluated CMHANet on the 3DMatch and the challenging 3DLoMatch datasets. \rev{Additionally, zero-shot evaluations on the TUM RGB-D SLAM dataset verify the model’s generalization capability to unseen domains.} The experimental results demonstrate that our method achieves substantial improvements in both registration accuracy and overall robustness, outperforming current techniques. We also release our code in \href{https://github.com/DongXu-Zhang/CMHANet}{https://github.com/DongXu-Zhang/CMHANet}.


[50] CognitionCapturerPro: Towards High-Fidelity Visual Decoding from EEG/MEG via Multi-modal Information and Asymmetric Alignment cs.CV | cs.AIPDF

Kaifan Zhang, Lihuo He, Junjie Ke, Yuqi Ji, Lukun Wu

TL;DR: 本文提出CognitionCapturerPro框架,通过整合EEG信号与多模态先验信息(图像、文本、深度、边缘),并采用协作训练、不确定性加权相似性评分机制和融合编码器,显著提升了从EEG/MEG信号进行视觉刺激重建的保真度。

Details

Motivation: 解决从EEG信号重建视觉刺激时存在的保真度损失和表征偏移问题。

Result: 在THINGS-EEG数据集上,该方法显著超越了原始的CognitionCapturer模型,Top-1和Top-5检索准确率分别提升了25.9%和10.6%。

Insight: 创新点包括利用多模态先验信息进行协作训练、不确定性加权相似性评分机制以量化模态特异性保真度,以及简化的对齐模块与预训练扩散模型的结合使用。

Abstract: Visual stimuli reconstruction from EEG remains challenging due to fidelity loss and representation shift. We propose CognitionCapturerPro, an enhanced framework that integrates EEG with multi-modal priors (images, text, depth, and edges) via collaborative training. Our core contributions include an uncertainty-weighted similarity scoring mechanism to quantify modality-specific fidelity and a fusion encoder for integrating shared representations. By employing a simplified alignment module and a pre-trained diffusion model, our method significantly outperforms the original CognitionCapturer on the THINGS-EEG dataset, improving Top-1 and Top-5 retrieval accuracy by 25.9% and 10.6%, respectively. Code is available at: https://github.com/XiaoZhangYES/CognitionCapturerPro.


[51] Thinking in Dynamics: How Multimodal Large Language Models Perceive, Track, and Reason Dynamics in Physical 4D World cs.CVPDF

Yuzhi Huang, Kairun Wen, Rongxin Gao, Dongxuan Liu, Yibin Lou

TL;DR: 该论文探讨了多模态大语言模型在动态4D世界中的感知与推理能力,提出了一个名为Dyn-Bench的大规模基准测试,用于评估模型在时空动态理解上的表现,并发现现有模型在时空推理和动态对象定位上存在性能不一致的问题,而结构化集成方法能显著提升其能力。

Details

Motivation: 当前多模态大语言模型在静态视觉理解方面表现出色,但缺乏对物理4D世界中动态场景(即随时间演化的几何结构和语义内容)的感知、跟踪和推理能力,论文旨在系统评估并提升模型在这方面的性能。

Result: 论文基于Dyn-Bench基准(包含1k视频、7k视觉问答对和3k动态对象定位对)测试了现有模型,发现它们在时空推理和动态对象定位上无法同时保持强性能,而提出的结构化集成方法(如Mask-Guided Fusion和ST-TCM)显著提升了模型表现。

Insight: 创新点在于构建了大规模动态场景基准Dyn-Bench以系统评估时空理解能力,并揭示了现有模型的局限性;结构化集成方法(如掩码引导融合和时空文本认知图)为增强多模态大语言模型的动态感知提供了有效途径。

Abstract: Humans inhabit a physical 4D world where geometric structure and semantic content evolve over time, constituting a dynamic 4D reality (spatial with temporal dimension). While current Multimodal Large Language Models (MLLMs) excel in static visual understanding, can they also be adept at “thinking in dynamics”, i.e., perceive, track and reason about spatio-temporal dynamics in evolving scenes? To systematically assess their spatio-temporal reasoning and localized dynamics perception capabilities, we introduce Dyn-Bench, a large-scale benchmark built from diverse real-world and synthetic video datasets, enabling robust and scalable evaluation of spatio-temporal understanding. Through multi-stage filtering from massive 2D and 4D data sources, Dyn-Bench provides a high-quality collection of dynamic scenes, comprising 1k videos, 7k visual question answering (VQA) pairs, and 3k dynamic object grounding pairs. We probe general, spatial and region-level MLLMs to express how they think in dynamics both linguistically and visually, and find that existing models cannot simultaneously maintain strong performance in both spatio-temporal reasoning and dynamic object grounding, often producing inconsistent interpretations of motion and interaction. Notably, conventional prompting strategies (e.g., chain-of-thought or caption-based hints) provide limited improvement, whereas structured integration approaches, including Mask-Guided Fusion and Spatio-Temporal Textual Cognitive Map (ST-TCM), significantly enhance MLLMs’ dynamics perception and spatio-temporal reasoning in the physical 4D world. Code and benchmark are available at https://dyn-bench.github.io/.


[52] Show, Don’t Tell: Detecting Novel Objects by Watching Human Videos cs.CV | cs.LG | cs.ROPDF

James Akl, Jose Nicolas Avendano Arbelaez, James Barabas, Jennifer L. Barry, Kalie Ching

TL;DR: 本文提出了一种名为’Show, Don’t Tell’的自监督系统,用于让机器人通过观看人类演示视频快速识别新物体。该系统通过自动创建数据集并利用演示本身作为监督,训练定制化的物体检测器,避免了传统开放集检测器所需的复杂语言描述和提示工程。

Details

Motivation: 解决机器人在观看人类演示时快速识别新物体的问题,因为现有闭集检测器无法处理分布外物体,而开放集检测器(如视觉语言模型)又需要昂贵且繁琐的人工提示工程。

Result: 在真实世界机器人上的实验结果表明,该流水线在操作物体的检测和识别方面显著优于最先进的方法,提高了机器人的任务完成率。

Insight: 创新点在于完全绕过语言描述,利用人类演示视频自动生成监督数据来训练定制检测器,实现了快速、无需人工干预的新物体识别。从客观角度看,这是一种将自监督学习与机器人感知紧密结合的实用范式,降低了部署门槛。

Abstract: How can a robot quickly identify and recognize new objects shown to it during a human demonstration? Existing closed-set object detectors frequently fail at this because the objects are out-of-distribution. While open-set detectors (e.g., VLMs) sometimes succeed, they often require expensive and tedious human-in-the-loop prompt engineering to uniquely recognize novel object instances. In this paper, we present a self-supervised system that eliminates the need for tedious language descriptions and expensive prompt engineering by training a bespoke object detector on an automatically created dataset, supervised by the human demonstration itself. In our approach, “Show, Don’t Tell,” we show the detector the specific objects of interest during the demonstration, rather than telling the detector about these objects via complex language descriptions. By bypassing language altogether, this paradigm enables us to quickly train bespoke detectors tailored to the relevant objects observed in human task demonstrations. We develop an integrated on-robot system to deploy our “Show, Don’t Tell” paradigm of automatic dataset creation and novel object-detection on a real-world robot. Empirical results demonstrate that our pipeline significantly outperforms state-of-the-art detection and recognition methods for manipulated objects, leading to improved task completion for the robot.


[53] SAP: Segment Any 4K Panorama cs.CVPDF

Lutao Jiang, Zidong Cao, Weikai Chen, Xu Zheng, Yuanhuiyi Lyu

TL;DR: 本文提出SAP(Segment Any 4K Panorama)模型,用于4K高分辨率全景图像的实例分割。通过将全景分割重新定义为固定轨迹的透视视频分割,将全景图分解为沿连续球面遍历采样的重叠透视图像块,从而在保持原生4K分辨率的同时恢复稳定的跨视图传播。利用InfiniGen引擎合成大规模标注数据训练,模型在真实世界4K全景基准上实现了显著的零样本性能提升。

Details

Motivation: 现有基于透视图像训练的基础模型在全景图像上性能下降,需要专门针对高分辨率全景实例分割的解决方案。

Result: 在真实世界4K全景基准测试中,SAP相比不同尺寸的原始SAM2模型实现了+17.2的零样本mIoU提升,达到了新的SOTA水平。

Insight: 创新点在于将全景分割重新表述为固定轨迹的透视视频分割问题,通过轨迹对齐的采样策略和合成数据的大规模监督,有效解决了全景图像中视角不连续导致的模型性能退化问题。

Abstract: Promptable instance segmentation is widely adopted in embodied and AR systems, yet the performance of foundation models trained on perspective imagery often degrades on 360° panoramas. In this paper, we introduce Segment Any 4K Panorama (SAP), a foundation model for 4K high-resolution panoramic instance-level segmentation. We reformulate panoramic segmentation as fixed-trajectory perspective video segmentation, decomposing a panorama into overlapping perspective patches sampled along a continuous spherical traversal. This memory-aligned reformulation preserves native 4K resolution while restoring the smooth viewpoint transitions required for stable cross-view propagation. To enable large-scale supervision, we synthesize 183,440 4K-resolution panoramic images with instance segmentation labels using the InfiniGen engine. Trained under this trajectory-aligned paradigm, SAP generalizes effectively to real-world 360° images, achieving +17.2 zero-shot mIoU gain over vanilla SAM2 of different sizes on real-world 4K panorama benchmark.


[54] HIFICL: High-Fidelity In-Context Learning for Multimodal Tasks cs.CVPDF

Xiaoyu Li, Yuhang Liu, Zheng Luo, Xuanshuo Kang, Fangqi Lou

TL;DR: 本文提出了一种名为HIFICL(高保真上下文学习)的新方法,用于改进大型多模态模型(LMMs)中的上下文学习(ICL)范式。该方法通过引入可学习的虚拟键值对、低秩分解和端到端训练目标,更精确地建模ICL机制,从而减少对演示配置的敏感性和计算成本,并在多个多模态基准测试中优于现有近似方法。

Details

Motivation: 当前大型多模态模型中的上下文学习(ICL)性能对演示配置敏感且计算成本高昂,现有近似方法(如学习“偏移向量”)过于简化了ICL的数学机制。本文旨在通过更精确地分解ICL中演示的影响,提出一种高保真的建模方法来解决这些问题。

Result: 在多个多模态基准测试上的广泛实验表明,HIFICL方法一致地优于现有的近似方法,实现了更好的性能。

Insight: 本文的创新点在于从数学上精确分解ICL机制,并据此设计了包含虚拟键值对、低秩分解和端到端目标的HIFICL框架,这不仅提高了ICL的保真度和效率,也可被视为一种上下文感知的参数高效微调(PEFT)形式,为多模态任务适应提供了新的思路。

Abstract: In-Context Learning (ICL) is a significant paradigm for Large Multimodal Models (LMMs), using a few in-context demonstrations (ICDs) for new task adaptation. However, its performance is sensitive to demonstration configurations and computationally expensive. Mathematically, the influence of these demonstrations can be decomposed into a dynamic mixture of the standard attention output and the context values. Current approximation methods simplify this process by learning a “shift vector”. Inspired by the exact decomposition, we introduce High-Fidelity In-Context Learning (HIFICL) to more faithfully model the ICL mechanism. HIFICL consists of three key components: 1) a set of “virtual key-value pairs” to act as a learnable context, 2) a low-rank factorization for stable and regularized training, and 3) a simple end-to-end training objective. From another perspective, this mechanism constitutes a form of context-aware Parameter-Efficient Fine-Tuning (PEFT). Extensive experiments show that HiFICL consistently outperforms existing approximation methods on several multimodal benchmarks. The code is available at https://github.com/bbbandari/HiFICL.


[55] TerraFlow: Multimodal, Multitemporal Representation Learning for Earth Observation cs.CV | cs.LGPDF

Nazar Puriy, Johannes Jakubik, Benedikt Blumenstiel, Konrad Schindler

TL;DR: TerraFlow是一种用于地球观测的多模态、多时序表示学习新方法,它基于时序训练目标,能够在空间、时间和模态上进行序列感知学习,并对现实世界地球观测数据中常见的变长输入保持鲁棒。实验表明,在GEO-Bench-2基准测试的所有时序任务上,TerraFlow均优于现有的地球观测基础模型,并在自然灾害风险图预测任务上取得显著进展。

Details

Motivation: 解决地球观测数据中多模态、多时序信息的高效学习问题,并应对现实数据中常见的变长输入挑战。

Result: 在GEO-Bench-2基准测试的所有时序任务上达到SOTA水平;在自然灾害风险图预测任务中,F1分数提升高达50%,Brier分数提升24%。

Insight: 创新点在于引入时序训练目标实现跨空间、时间和模态的序列感知学习,并增强对变长输入的鲁棒性;客观来看,该方法在时序建模和实际应用(如灾害预测)上具有显著优势。

Abstract: We propose TerraFlow, a novel approach to multimodal, multitemporal learning for Earth observation. TerraFlow builds on temporal training objectives that enable sequence-aware learning across space, time, and modality, while remaining robust to the variable-length inputs commonly encountered in real-world Earth observation data. Our experiments demonstrate superiority of TerraFlow over state-of-the-art foundation models for Earth observation across all temporal tasks of the GEO-Bench-2 benchmark. We additionally demonstrate that TerraFlow is able to make initial steps towards deep-learning based risk map prediction for natural disasters – a task on which other state-of-the-art foundation models frequently collapse. TerraFlow outperforms state-of-the-art foundation models by up to 50% in F1 score and 24% in Brier score.


[56] SAVA-X: Ego-to-Exo Imitation Error Detection via Scene-Adaptive View Alignment and Bidirectional Cross View Fusion cs.CVPDF

Xiang Li, Heqian Qiu, Lanxiao Wang, Benliu Qiu, Fanman Meng

TL;DR: 本文提出SAVA-X框架,用于解决从第一人称(ego)视角模仿第三人称(exo)演示视频时的错误检测问题。该框架通过场景自适应视图对齐和双向跨视图融合,处理异步、长度不匹配的跨视角视频,定位步骤并判断错误。

Details

Motivation: 现有错误检测方法多基于单视角假设,无法处理以第三人称演示评估第一人称模仿的实用场景,该场景存在跨视角域偏移、时间不对齐和高度冗余等挑战。

Result: 在EgoMe基准测试中,SAVA-X在AUPRC和平均tIoU指标上均优于所有基线方法,消融实验验证了其各组件(自适应采样、场景自适应视图嵌入、双向交叉注意力融合)的互补效益。

Insight: 创新点包括:提出Ego→Exo模仿错误检测的新任务形式化;设计Align-Fuse-Detect框架,结合视图条件自适应采样、场景自适应视图嵌入和双向跨注意力融合,有效应对跨视角域差异与时间不对齐问题。

Abstract: Error detection is crucial in industrial training, healthcare, and assembly quality control. Most existing work assumes a single-view setting and cannot handle the practical case where a third-person (exo) demonstration is used to assess a first-person (ego) imitation. We formalize Ego$\rightarrow$Exo Imitation Error Detection: given asynchronous, length-mismatched ego and exo videos, the model must localize procedural steps on the ego timeline and decide whether each is erroneous. This setting introduces cross-view domain shift, temporal misalignment, and heavy redundancy. Under a unified protocol, we adapt strong baselines from dense video captioning and temporal action detection and show that they struggle in this cross-view regime. We then propose SAVA-X, an Align-Fuse-Detect framework with (i) view-conditioned adaptive sampling, (ii) scene-adaptive view embeddings, and (iii) bidirectional cross-attention fusion. On the EgoMe benchmark, SAVA-X consistently improves AUPRC and mean tIoU over all baselines, and ablations confirm the complementary benefits of its components. Code is available at https://github.com/jack1ee/SAVAX.


[57] PVI: Plug-in Visual Injection for Vision-Language-Action Models cs.CV | cs.LG | cs.ROPDF

Zezhou Zhang, Songxin Zhang, Xiao Xiong, Junjie Zhang, Zejian Xie

TL;DR: 本文提出了一种名为PVI(Plug-in Visual Injection)的轻量级模块,用于增强视觉-语言-动作模型。该模块通过零初始化残差路径,将辅助视觉表征(特别是时序视频特征)注入到预训练的动作专家模型中,无需大幅修改架构,仅需单阶段微调即可提升策略性能,尤其在需要状态跟踪和协调的多阶段任务上效果显著,并在真实机器人长视野双手机器人布料折叠任务中验证了其实用性。

Details

Motivation: 现有VLA架构中的视觉语言模型通常针对静态视觉观察进行优化,会削弱细粒度几何线索并缺乏明确的时序证据,导致动作专家性能受限。先前工作虽尝试注入辅助视觉特征,但要么只关注静态空间表征,要么需要大量架构修改来适应时序输入,未能充分探索时序信息。

Result: 使用PVI在基础策略和一系列竞争性替代注入策略上获得了一致的性能提升。控制研究表明,时序视频特征(V-JEPA2)优于强静态图像特征(DINOv2),在需要状态跟踪和协调的多阶段任务上提升最大。真实机器人长视野双手机器人布料折叠实验进一步证明了PVI在仿真之外的实用性。

Insight: 创新点在于提出了一种轻量级、编码器无关的插件模块,通过零初始化残差路径注入辅助视觉表征,保留了预训练行为且仅需单阶段微调。客观分析认为,其核心洞察是强调了时序视频特征对于需要状态跟踪的机器人操作任务的重要性,并提供了一种高效、非侵入式的特征融合方法。

Abstract: VLA architectures that pair a pretrained VLM with a flow-matching action expert have emerged as a strong paradigm for language-conditioned manipulation. Yet the VLM, optimized for semantic abstraction and typically conditioned on static visual observations, tends to attenuate fine-grained geometric cues and often lacks explicit temporal evidence for the action expert. Prior work mitigates this by injecting auxiliary visual features, but existing approaches either focus on static spatial representations or require substantial architectural modifications to accommodate temporal inputs, leaving temporal information underexplored. We propose Plug-in Visual Injection (PVI), a lightweight, encoder-agnostic module that attaches to a pretrained action expert and injects auxiliary visual representations via zero-initialized residual pathways, preserving pretrained behavior with only single-stage fine-tuning. Using PVI, we obtain consistent gains over the base policy and a range of competitive alternative injection strategies, and our controlled study shows that temporal video features (V-JEPA2) outperform strong static image features (DINOv2), with the largest gains on multi-phase tasks requiring state tracking and coordination. Real-robot experiments on long-horizon bimanual cloth folding further demonstrate the practicality of PVI beyond simulation.


[58] Empowering Semantic-Sensitive Underwater Image Enhancement with VLM cs.CV | cs.AI | eess.IVPDF

Guodong Fan, Shengning Zhou, Genji Yuan, Huiyu Li, Jingchun Zhou

TL;DR: 本文提出了一种利用视觉语言模型(VLM)增强水下图像语义感知能力的新学习机制。该机制通过VLM从退化图像生成关键物体的文本描述,再利用文本-图像对齐模型生成空间语义引导图,通过双引导机制(交叉注意力和显式对齐损失)指导UIE网络在重建时聚焦于语义敏感区域,而非全局均匀增强,从而忠实恢复关键物体特征。

Details

Motivation: 解决现有基于学习的水下图像增强(UIE)技术中,高质量增强输出与自然图像之间的分布偏移会阻碍下游视觉任务的语义线索提取,从而限制增强模型适应性的问题。

Result: 实验表明,将该策略应用于不同的UIE基线模型后,能显著提升其在感知质量指标上的性能,并增强其在检测和分割任务上的表现,验证了其有效性和适应性。

Insight: 创新点在于将VLM的语义理解能力与UIE任务结合,通过生成空间语义引导图实现语义敏感的局部增强,而非全局均匀处理,这提升了增强结果对下游任务的实用性。从客观角度看,这是一种新颖的跨模态引导机制,将高级语义信息作为先验注入低级视觉修复过程,可借鉴于其他图像恢复任务。

Abstract: In recent years, learning-based underwater image enhancement (UIE) techniques have rapidly evolved. However, distribution shifts between high-quality enhanced outputs and natural images can hinder semantic cue extraction for downstream vision tasks, thereby limiting the adaptability of existing enhancement models. To address this challenge, this work proposes a new learning mechanism that leverages Vision-Language Models (VLMs) to empower UIE models with semantic-sensitive capabilities. To be concrete, our strategy first generates textual descriptions of key objects from a degraded image via VLMs. Subsequently, a text-image alignment model remaps these relevant descriptions back onto the image to produce a spatial semantic guidance map. This map then steers the UIE network through a dual-guidance mechanism, which combines cross-attention and an explicit alignment loss. This forces the network to focus its restorative power on semantic-sensitive regions during image reconstruction, rather than pursuing a globally uniform improvement, thereby ensuring the faithful restoration of key object features. Experiments confirm that when our strategy is applied to different UIE baselines, significantly boosts their performance on perceptual quality metrics as well as enhances their performance on detection and segmentation tasks, validating its effectiveness and adaptability.


[59] Generalized Recognition of Basic Surgical Actions Enables Skill Assessment and Vision-Language-Model-based Surgical Planning cs.CVPDF

Mengya Xu, Daiyun Shen, Jie Zhang, Hon Chi Yip, Yujia Gao

TL;DR: 本文提出了一个包含10种基本手术动作、覆盖6个外科专业、超过11,000个视频片段的数据集,并基于此开发了一个能够进行通用基本动作识别的基础模型。该模型在不同手术类型和身体部位的数据集上表现出稳健的跨专业性能,并展示了其在手术技能评估(前列腺切除术)和基于大型视觉语言模型的手术动作规划(胆囊切除术和肾切除术)等下游应用中的潜力。

Details

Motivation: 理解和建模基本手术动作(BSA)——任何手术的基本操作单元,对于推动人工智能、成像和大语言模型在手术实践、培训和自动化领域的变革至关重要。

Result: 模型在来自不同手术类型和身体部位的数据集上验证,表现出稳健的跨专业性能。下游应用中,基于领域知识的前列腺切除术技能评估和基于大型视觉语言模型的胆囊切除术/肾切除术动作规划均得到验证,且后者生成的可解释文本经多国外科医生评估具有临床相关性。

Insight: 创新点在于构建了迄今为止最大的BSA数据集,并基于此开发了通用的基础识别模型,证明了BSA识别的跨场景稳健性。其核心洞察是,一个准确的BSA理解模型可以作为基础,有效促进复杂应用(如技能评估、手术规划)并加速实现手术超级智能。

Abstract: Artificial intelligence, imaging, and large language models have the potential to transform surgical practice, training, and automation. Understanding and modeling of basic surgical actions (BSA), the fundamental unit of operation in any surgery, is important to drive the evolution of this field. In this paper, we present a BSA dataset comprising 10 basic actions across 6 surgical specialties with over 11,000 video clips, which is the largest to date. Based on the BSA dataset, we developed a new foundation model that conducts general-purpose recognition of basic actions. Our approach demonstrates robust cross-specialist performance in experiments validated on datasets from different procedural types and various body parts. Furthermore, we demonstrate downstream applications enabled by the BAS foundation model through surgical skill assessment in prostatectomy using domain-specific knowledge, and action planning in cholecystectomy and nephrectomy using large vision-language models. Multinational surgeons’ evaluation of the language model’s output of the action planning explainable texts demonstrated clinical relevance. These findings indicate that basic surgical actions can be robustly recognized across scenarios, and an accurate BSA understanding model can essentially facilitate complex applications and speed up the realization of surgical superintelligence.


[60] Think and Answer ME: Benchmarking and Exploring Multi-Entity Reasoning Grounding in Remote Sensing cs.CVPDF

Shuchang Lyu, Haiquan Wen, Guangliang Cheng, Meng Li, Zheng Zhou

TL;DR: 该论文针对遥感视觉定位任务,提出了一个多实体推理定位的新基准数据集ME-RSRG,并设计了一个基于视觉语言基础模型的实体感知推理框架EAR,通过监督微调和实体感知奖励驱动的组相对策略优化来提升多实体推理能力。

Details

Motivation: 现有遥感定位方法主要局限于感知级匹配和单实体任务,缺乏显式推理和实体间建模,因此需要扩展推理范式以解决多实体推理定位问题。

Result: 在ME-RSRG数据集上的大量实验证明了多实体推理的挑战性,并验证了所提出的EAR框架的有效性。

Insight: 创新点包括引入多实体推理定位的基准数据集,将遥感定位重新定义为多实体推理任务,以及提出结合结构化推理轨迹生成和实体感知奖励优化策略的EAR框架,推动了遥感领域从感知到推理的范式转变。

Abstract: Recent advances in reasoning language models and reinforcement learning with verifiable rewards have significantly enhanced multi-step reasoning capabilities. This progress motivates the extension of reasoning paradigms to remote sensing visual grounding task. However, existing remote sensing grounding methods remain largely confined to perception-level matching and single-entity formulations, limiting the role of explicit reasoning and inter-entity modeling. To address this challenge, we introduce a new benchmark dataset for Multi-Entity Reasoning Grounding in Remote Sensing (ME-RSRG). Based on ME-RSRG, we reformulate remote sensing grounding as a multi-entity reasoning task and propose an Entity-Aware Reasoning (EAR) framework built upon visual-linguistic foundation models. EAR generates structured reasoning traces and subject-object grounding outputs. It adopts supervised fine-tuning for cold-start initialization and is further optimized via entity-aware reward-driven Group Relative Policy Optimization (GRPO). Extensive experiments on ME-RSRG demonstrate the challenges of multi-entity reasoning and verify the effectiveness of our proposed EAR framework. Our dataset, code, and models will be available at https://github.com/CV-ShuchangLyu/ME-RSRG.


[61] Coherent Human-Scene Reconstruction from Multi-Person Multi-View Video in a Single Pass cs.CVPDF

Sangmin Kim, Minhyuk Hwang, Geonho Cha, Dongyoon Wee, Jaesik Park

TL;DR: CHROMM是一个统一的框架,能够从多人物多视角视频中联合估计相机参数、场景点云和人体网格,无需依赖外部模块或预处理。它整合了几何和人体先验,通过尺度调整模块解决人与场景的尺度差异,并采用多视角融合策略和基于几何的多人物关联方法,在多个数据集上实现了竞争性的性能,且运行速度比基于优化的多视角方法快8倍以上。

Details

Motivation: 现有3D基础模型主要关注单目输入,扩展到多视角设置需要额外开销模块或预处理数据,因此需要一种无需外部依赖、能直接从多人物多视角视频中重建连贯人类与场景的统一方法。

Result: 在EMDB、RICH、EgoHumans和EgoExo4D数据集上的实验表明,CHROMM在全局人体运动和多视角姿态估计方面达到竞争性性能,且运行速度比基于优化的多视角方法快8倍以上。

Insight: 创新点包括:整合Pi3X和Multi-HMR的先验到单一可训练网络架构,引入尺度调整模块解决人-场景尺度不一致,采用多视角融合策略聚合单视角估计,以及提出基于几何的多人物关联方法以提高鲁棒性。从客观角度看,该方法实现了端到端的联合优化,避免了预处理依赖,提升了效率和实用性。

Abstract: Recent advances in 3D foundation models have led to growing interest in reconstructing humans and their surrounding environments. However, most existing approaches focus on monocular inputs, and extending them to multi-view settings requires additional overhead modules or preprocessed data. To this end, we present CHROMM, a unified framework that jointly estimates cameras, scene point clouds, and human meshes from multi-person multi-view videos without relying on external modules or preprocessing. We integrate strong geometric and human priors from Pi3X and Multi-HMR into a single trainable neural network architecture, and introduce a scale adjustment module to solve the scale discrepancy between humans and the scene. We also introduce a multi-view fusion strategy to aggregate per-view estimates into a single representation at test-time. Finally, we propose a geometry-based multi-person association method, which is more robust than appearance-based approaches. Experiments on EMDB, RICH, EgoHumans, and EgoExo4D show that CHROMM achieves competitive performance in global human motion and multi-view pose estimation while running over 8x faster than prior optimization-based multi-view approaches. Project page: https://nstar1125.github.io/chromm.


[62] Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation cs.CV | cs.AIPDF

Yichen Zhang, Da Peng, Zonghao Guo, Zijian Zhang, Xuesong Yang

TL;DR: 本文提出Cheers模型,通过将图像补丁级细节与语义表示解耦,实现了视觉理解与生成的统一多模态建模。模型包含统一视觉分词器、基于LLM的Transformer和级联流匹配头三个关键组件,在保持语义稳定性的同时提升生成保真度。

Details

Motivation: 解决多模态模型中视觉理解与生成任务因解码机制和视觉表示不匹配而难以在共享特征空间中联合优化的问题。

Result: 在GenEval和MMBench等基准测试中,Cheers性能匹配或超越先进统一多模态模型,且仅需20%训练成本;同时实现4倍视觉令牌压缩,支持高效高分辨率图像编码与生成。

Insight: 创新点在于通过门控细节残差机制解耦语义与细节表示,并设计级联解码流程,在统一框架中平衡理解稳定性与生成质量,实现高效的多模态任务统一。

Abstract: A recent cutting-edge topic in multimodal modeling is to unify visual comprehension and generation within a single model. However, the two tasks demand mismatched decoding regimes and visual representations, making it non-trivial to jointly optimize within a shared feature space. In this work, we present Cheers, a unified multimodal model that decouples patch-level details from semantic representations, thereby stabilizing semantics for multimodal understanding and improving fidelity for image generation via gated detail residuals. Cheers includes three key components: (i) a unified vision tokenizer that encodes and compresses image latent states into semantic tokens for efficient LLM conditioning, (ii) an LLM-based Transformer that unifies autoregressive decoding for text generation and diffusion decoding for image generation, and (iii) a cascaded flow matching head that decodes visual semantics first and then injects semantically gated detail residuals from the vision tokenizer to refine high-frequency content. Experiments on popular benchmarks demonstrate that Cheers matches or surpasses advanced UMMs in both visual understanding and generation. Cheers also achieves 4x token compression, enabling more efficient high-resolution image encoding and generation. Notably, Cheers outperforms the Tar-1.5B on the popular benchmarks GenEval and MMBench, while requiring only 20% of the training cost, indicating effective and efficient (i.e., 4x token compression) unified multimodal modeling. We will release all code and data for future research.


[63] What Makes VLMs Robust? Towards Reconciling Robustness and Accuracy in Vision-Language Models cs.CVPDF

Sen Nie, Jie Zhang, Zhongqi Wang, Zhaoyang Wei, Shiguang Shan

TL;DR: 本文研究了视觉语言模型(VLMs)中对抗鲁棒性与干净数据准确率之间的权衡问题,通过分析发现对抗鲁棒性主要集中于模型的浅层,由低频谱偏置和输入不敏感的注意力模式驱动,而深层更新会损害准确率。基于此,作者提出了Adversarial Robustness Adaptation(R-Adapt)框架,冻结所有预训练权重,仅在初始层引入最小化、基于洞察的适配,从而在对抗鲁棒性和干净准确率之间实现卓越平衡。

Details

Motivation: 解决视觉语言模型中对抗鲁棒性与干净数据准确率之间长期存在的挑战性权衡问题,探究VLMs鲁棒性的内在机制。

Result: 在18个数据集和多种任务上的广泛评估表明,该方法在各种攻击下达到了最先进的性能(SOTA),并能高效泛化到大型视觉语言模型(如LLaVA和Qwen-VL)以增强其鲁棒性。

Insight: 创新点在于揭示了对抗鲁棒性在VLMs中非均匀分布,集中于浅层,并据此设计了R-Adapt框架,通过冻结预训练权重和仅适配浅层来实现鲁棒性与准确率的平衡,支持训练无关、模型引导和数据驱动的范式,为模型鲁棒性增强提供了灵活途径。

Abstract: Achieving adversarial robustness in Vision-Language Models (VLMs) inevitably compromises accuracy on clean data, presenting a long-standing and challenging trade-off. In this work, we revisit this trade-off by investigating a fundamental question: What makes VLMs robust? Through a detailed analysis of adversarially fine-tuned models, we examine how robustness mechanisms function internally and how they interact with clean accuracy. Our analysis reveals that adversarial robustness is not uniformly distributed across network depth. Instead, unexpectedly, it is primarily localized within the shallow layers, driven by a low-frequency spectral bias and input-insensitive attention patterns. Meanwhile, updates to the deep layers tend to undermine both clean accuracy and robust generalization. Motivated by these insights, we propose Adversarial Robustness Adaptation (R-Adapt), a simple yet effective framework that freezes all pre-trained weights and introduces minimal, insight-driven adaptations only in the initial layers. This design achieves an exceptional balance between adversarial robustness and clean accuracy. R-Adapt further supports training-free, model-guided, and data-driven paradigms, offering flexible pathways to seamlessly equip standard models with robustness. Extensive evaluations on 18 datasets and diverse tasks demonstrate our state-of-the-art performance under various attacks. Notably, R-Adapt generalizes efficiently to large vision-language models (e.g., LLaVA and Qwen-VL) to enhance their robustness. Our project page is available at https://summu77.github.io/R-Adapt.


[64] OARS: Process-Aware Online Alignment for Generative Real-World Image Super-Resolution cs.CVPDF

Shijie Zhao, Xuanyu Zhang, Bin Chen, Weiqi Li, Qunliang Xing

TL;DR: 本文提出OARS框架,一种面向生成式真实世界图像超分辨率的在线对齐方法,通过基于MLLM的奖励模型COMPASS评估低分辨率到高分辨率转换的保真度与感知增益,并采用渐进式在线强化学习优化模型,在保持保真度的同时提升视觉感知质量。

Details

Motivation: 解决生成式真实世界图像超分辨率中因感知-保真度权衡和未知多样化退化导致的与人类视觉偏好对齐困难的问题,克服现有离线偏好优化方法可解释性差、在强条件下易产生伪多样性的局限。

Result: 在Real-ISR基准测试上达到最先进性能,大量实验和用户研究表明该方法在保持保真度的同时实现了持续的感知质量提升。

Insight: 创新点包括基于MLLM的输入质量自适应权衡奖励模型COMPASS、涵盖合成与真实退化的标注数据集COMPASS-20K、三阶段感知标注流程,以及从冷启动流匹配到全参考再到无参考RL的渐进式在线对齐框架,通过浅层LoRA优化实现策略探索。

Abstract: Aligning generative real-world image super-resolution models with human visual preference is challenging due to the perception–fidelity trade-off and diverse, unknown degradations. Prior approaches rely on offline preference optimization and static metric aggregation, which are often non-interpretable and prone to pseudo-diversity under strong conditioning. We propose OARS, a process-aware online alignment framework built on COMPASS, a MLLM-based reward that evaluates the LR to SR transition by jointly modeling fidelity preservation and perceptual gain with an input-quality-adaptive trade-off. To train COMPASS, we curate COMPASS-20K spanning synthetic and real degradations, and introduce a three-stage perceptual annotation pipeline that yields calibrated, fine-grained training labels. Guided by COMPASS, OARS performs progressive online alignment from cold-start flow matching to full-reference and finally reference-free RL via shallow LoRA optimization for on-policy exploration. Extensive experiments and user studies demonstrate consistent perceptual improvements while maintaining fidelity, achieving state-of-the-art performance on Real-ISR benchmarks.


[65] Multimodal Protein Language Models for Enzyme Kinetic Parameters: From Substrate Recognition to Conformational Adaptation cs.CVPDF

Fei Wang, Xinye Zheng, Kun Li, Yanyan Wei, Yuxin Liu

TL;DR: 本文提出了一种名为酶-反应桥接适配器(ERBA)的多模态蛋白质语言模型,用于预测酶动力学参数(如k_cat、K_m和K_i)。该方法将动力学预测重新表述为分阶段的多模态条件建模问题,通过分子识别交叉注意力和几何感知混合专家机制,分别捕获底物识别和构象适应过程,并结合分布对齐保持语义保真度,在多个基准上实现了性能提升和更强的分布外泛化能力。

Details

Motivation: 现有学习方法通常将酶动力学参数预测简化为酶与底物之间的静态兼容性问题,通过浅层操作融合表示并回归单一值,忽略了催化的分阶段性质(包括底物识别和构象适应)。本文旨在解决这一问题,提出一种更符合生物学原理的建模方法。

Result: 在三个动力学终点(k_cat、K_m、K_i)和多个蛋白质语言模型骨干上的实验表明,ERBA相比仅使用序列的基线和浅层融合基线,实现了持续的性能增益,并表现出更强的分布外性能。

Insight: 创新点包括:将动力学预测重新定义为分阶段的多模态条件建模问题;设计ERBA适配器,通过两阶段条件化(MRCA和G-MoE)分别建模底物识别和构象适应;引入ESDA在再生核希尔伯特空间中强制分布一致性以保持语义保真度。这为可扩展的动力学预测提供了生物学基础,并支持未来整合辅因子、突变和时间分辨结构线索。

Abstract: Predicting enzyme kinetic parameters quantifies how efficiently an enzyme catalyzes a specific substrate under defined biochemical conditions. Canonical parameters such as the turnover number ($k_\text{cat}$), Michaelis constant ($K_\text{m}$), and inhibition constant ($K_\text{i}$) depend jointly on the enzyme sequence, the substrate chemistry, and the conformational adaptation of the active site during binding. Many learning pipelines simplify this process to a static compatibility problem between the enzyme and substrate, fusing their representations through shallow operations and regressing a single value. Such formulations overlook the staged nature of catalysis, which involves both substrate recognition and conformational adaptation. In this regard, we reformulate kinetic prediction as a staged multimodal conditional modeling problem and introduce the Enzyme-Reaction Bridging Adapter (ERBA), which injects cross-modal information via fine-tuning into Protein Language Models (PLMs) while preserving their biochemical priors. ERBA performs conditioning in two stages: Molecular Recognition Cross-Attention (MRCA) first injects substrate information into the enzyme representation to capture specificity; Geometry-aware Mixture-of-Experts (G-MoE) then integrates active-site structure and routes samples to pocket-specialized experts to reflect induced fit. To maintain semantic fidelity, Enzyme-Substrate Distribution Alignment (ESDA) enforces distributional consistency within the PLM manifold in a reproducing kernel Hilbert space. Experiments across three kinetic endpoints and multiple PLM backbones, ERBA delivers consistent gains and stronger out-of-distribution performance compared with sequence-only and shallow-fusion baselines, offering a biologically grounded route to scalable kinetic prediction and a foundation for adding cofactors, mutations, and time-resolved structural cues.


[66] Team LEYA in 10th ABAW Competition: Multimodal Ambivalence/Hesitancy Recognition Approach cs.CV | cs.AIPDF

Elena Ryumina, Alexandr Axyonov, Dmitry Sysoev, Timur Abdulkadirov, Kirill Almetov

TL;DR: 本文介绍了Team LEYA为第10届ABAW竞赛提出的多模态犹豫/矛盾情绪识别方法。该方法融合了场景、人脸、音频和文本四种模态,分别使用VideoMAE、情感帧级嵌入、EmotionWav2Vec2.0与Mamba编码器以及微调Transformer模型进行特征提取,并通过多模态融合模型(包括原型增强变体)进行整合。在BAH语料库上的实验表明,多模态融合显著优于所有单模态基线。

Details

Motivation: 解决非约束视频中犹豫/矛盾情绪的识别难题,该行为状态具有微妙、多模态和上下文依赖的特性,因此需要一种能够整合多种互补信息的方法。

Result: 在BAH语料库上,最佳单模态配置的平均MF1为70.02%,而最佳多模态融合模型达到了83.25%。最终最高的测试性能(71.43%)由五个原型增强融合模型的集成获得。

Insight: 创新点在于整合了四种互补模态并采用了稳健的融合策略,特别是使用了基于Mamba的时序编码器处理音频以及原型增强的融合变体。客观来看,该方法强调了多模态线索的互补性和有效融合对于识别复杂、微妙情绪状态的重要性。

Abstract: Ambivalence/hesitancy recognition in unconstrained videos is a challenging problem due to the subtle, multimodal, and context-dependent nature of this behavioral state. In this paper, a multimodal approach for video-level ambivalence/hesitancy recognition is presented for the 10th ABAW Competition. The proposed approach integrates four complementary modalities: scene, face, audio, and text. Scene dynamics are captured with a VideoMAE-based model, facial information is encoded through emotional frame-level embeddings aggregated by statistical pooling, acoustic representations are extracted with EmotionWav2Vec2.0 and processed by a Mamba-based temporal encoder, and linguistic cues are modeled using fine-tuned transformer-based text models. The resulting unimodal embeddings are further combined using multimodal fusion models, including prototype-augmented variants. Experiments on the BAH corpus demonstrate clear gains of multimodal fusion over all unimodal baselines. The best unimodal configuration achieved an average MF1 of 70.02%, whereas the best multimodal fusion model reached 83.25%. The highest final test performance, 71.43%, was obtained by an ensemble of five prototype-augmented fusion models. The obtained results highlight the importance of complementary multimodal cues and robust fusion strategies for ambivalence/hesitancy recognition.


[67] Wear Classification of Abrasive Flap Wheels using a Hierarchical Deep Learning Approach cs.CV | cs.LGPDF

Falko Kähler, Maxim Wille, Ole Schmedemann, Thorsten Schüppstuhl

TL;DR: 本文提出了一种新颖的、基于视觉的分层深度学习框架,用于自动化监测研磨砂布轮的磨损状态。该方法将磨损分类问题分解为三个逻辑层次:状态检测、磨损类型识别与撕裂检测、以及严重程度评估。研究使用定制数据集和基于EfficientNetV2的迁移学习方法,实现了高鲁棒性的分类性能。

Details

Motivation: 研磨砂布轮因其灵活性常用于复杂自由曲面的精加工,但这种灵活性导致了复杂的磨损模式(如凹/凸瓣轮廓或瓣片撕裂),这些模式会影响研磨结果。现有整体式分类方法难以有效处理此问题,因此需要一种自动化、精细化的磨损状态监测方法。

Result: 在定制的真实砂布轮图像数据集上,所提分层框架表现出高鲁棒性,分类准确率在93.8%(瓣片撕裂检测)到99.3%(凹形磨损严重程度评估)之间。

Insight: 主要创新点在于将复杂的磨损分类任务分解为逻辑清晰的三级分层框架,实现了从宏观状态到微观类型与严重程度的精细化分析。此外,使用Grad-CAM验证模型学习了物理相关的特征,增强了方法的可解释性,为自适应过程控制奠定了基础。

Abstract: Abrasive flap wheels are common for finishing complex free-form surfaces due to their flexibility. However, this flexibility results in complex wear patterns such as concave/convex flap profiles or flap tears, which influence the grinding result. This paper proposes a novel, vision-based hierarchical classification framework to automate the wear condition monitoring of flap wheels. Unlike monolithic classification approaches, we decompose the problem into three logical levels: (1) state detection (new vs. worn), (2) wear type identification (rectangular, concave, convex) and flap tear detection, and (3) severity assessment (partial vs. complete deformation). A custom-built dataset of real flap wheel images was generated and a transfer learning approach with EfficientNetV2 architecture was used. The results demonstrate high robustness with classification accuracies ranging from 93.8% (flap tears) to 99.3% (concave severity). Furthermore, Gradient-weighted Class Activation Mapping (Grad-CAM) is utilized to validate that the models learn physically relevant features and examine false classifications. The proposed hierarchical method provides a basis for adaptive process control and wear consideration in automated flap wheel grinding.


[68] Composing Driving Worlds through Disentangled Control for Adversarial Scenario Generation cs.CVPDF

Yifan Zhan, Zhengqing Chen, Qingjie Wang, Zhuo He, Muyao Niu

TL;DR: 本文提出了CompoSIA,一个用于对抗性驾驶场景生成的组合式视频模拟器。它通过解耦场景结构、物体身份和自车动作的控制,实现了对多样化危险场景的细粒度合成,旨在解决自动驾驶中安全关键边缘案例的’长尾’问题。

Details

Motivation: 当前自动驾驶面临’长尾’安全关键边缘案例的挑战,这些案例常源于常见交通元素的不寻常组合。现有的可控生成模型提供的指导不完整或相互纠缠,无法独立操控场景结构、物体身份和自车动作,阻碍了此类对抗性场景的合成。

Result: 在可控生成质量上超越了最先进的基线模型,在身份编辑的FVD指标上提升了17%,在动作控制的旋转和平移误差上分别降低了30%和47%。下游压力测试显示规划器故障显著增加:在各种编辑模式下,3秒内的平均碰撞率增加了173%。

Insight: 核心创新点在于解耦控制框架,包括:1)提出噪声级身份注入方法,允许仅凭单张参考图像实现与姿态无关的身份生成;2)引入分层双分支动作控制机制以提升动作可控性。这使得能够系统地将安全元素组合成危险配置,这是纠缠式生成器无法实现的。

Abstract: A major challenge in autonomous driving is the “long tail” of safety-critical edge cases, which often emerge from unusual combinations of common traffic elements. Synthesizing these scenarios is crucial, yet current controllable generative models provide incomplete or entangled guidance, preventing the independent manipulation of scene structure, object identity, and ego actions. We introduce CompoSIA, a compositional driving video simulator that disentangles these traffic factors, enabling fine-grained control over diverse adversarial driving scenarios. To support controllable identity replacement of scene elements, we propose a noise-level identity injection, allowing pose-agnostic identity generation across diverse element poses, all from a single reference image. Furthermore, a hierarchical dual-branch action control mechanism is introduced to improve action controllability. Such disentangled control enables adversarial scenario synthesis-systematically combining safe elements into dangerous configurations that entangled generators cannot produce. Extensive comparisons demonstrate superior controllable generation quality over state-of-the-art baselines, with a 17% improvement in FVD for identity editing and reductions of 30% and 47% in rotation and translation errors for action control. Furthermore, downstream stress-testing reveals substantial planner failures: across editing modalities, the average collision rate of 3s increases by 173%.


[69] Forecasting Epileptic Seizures from Contactless Camera via Cross-Species Transfer Learning cs.CV | cs.LGPDF

Mingkai Zhai, Wei Wang, Zongsheng Li, Quanying Liu

TL;DR: 本文提出了一种基于视频的癫痫发作预测新任务,利用发作前3-10秒的视频片段预测未来5秒内是否会发生癫痫。为解决人类癫痫视频标注数据稀缺的问题,作者设计了一个跨物种迁移学习框架,利用大规模啮齿动物视频数据进行辅助预训练,以捕捉跨物种通用的癫痫相关行为动态。实验表明,该方法在纯视频设置下实现了超过70%的预测准确率,并优于现有基线。

Details

Motivation: 现有癫痫预测方法主要依赖脑电图等神经信号,需要专用设备且难以长期部署;而基于视频的方法多专注于发作后检测,发作预测尚未充分探索。本文旨在利用非侵入式、易获取的视频数据实现癫痫发作的早期预测。

Result: 在严格的纯视频设置下,该方法取得了超过70%的预测准确率,并超越了现有基线模型。

Insight: 创新点在于首次系统性地定义了基于视频的癫痫发作预测任务,并提出了跨物种迁移学习框架,利用动物视频数据缓解人类数据稀缺问题,为构建非侵入式、可扩展的癫痫预警系统提供了新思路。

Abstract: Epileptic seizure forecasting is a clinically important yet challenging problem in epilepsy research. Existing approaches predominantly rely on neural signals such as electroencephalography (EEG), which require specialized equipment and limit long-term deployment in real-world settings. In contrast, video data provide a non-invasive and accessible alternative, yet existing video-based studies mainly focus on post-onset seizure detection, leaving seizure forecasting largely unexplored. In this work, we formulate a novel task of video-based epileptic seizure forecasting, where short pre-ictal video segments (3-10 seconds) are used to predict whether a seizure will occur within the subsequent 5 seconds. To address the scarcity of annotated human epilepsy videos, we propose a cross-species transfer learning framework that leverages large-scale rodent video data for auxiliary pretraining. This enables the model to capture seizure-related behavioral dynamics that generalize across species. Experimental results demonstrate that our approach achieves over 70% prediction accuracy under a strictly video-only setting and outperforms existing baselines. These findings highlight the potential of cross-species learning for building non-invasive, scalable early-warning systems for epilepsy.


[70] Finite Difference Flow Optimization for RL Post-Training of Text-to-Image Models cs.CV | cs.AI | cs.LG | cs.NE | stat.MLPDF

David McAllister, Miika Aittala, Tero Karras, Janne Hellsten, Angjoo Kanazawa

TL;DR: 本文提出了一种用于文本到图像模型强化学习后训练的有限差分流优化方法,通过将整个采样过程视为单一动作,并采样配对轨迹以降低模型更新方差,从而提升图像质量和提示对齐。

Details

Motivation: 现有强化学习后训练方法将每个采样步骤视为独立策略动作,导致方差较高,本文旨在通过在线RL变体减少方差,优化图像合成模型的性能。

Result: 在高质量视觉语言模型和现成质量指标作为奖励的实验中,该方法收敛更快,在输出质量和提示对齐方面优于先前方法,使用广泛指标评估证实了其有效性。

Insight: 创新点在于将整个采样过程视为单一动作,并通过配对轨迹采样优化流速度方向,这降低了方差并提高了训练效率,可借鉴于扩散模型的后训练优化中。

Abstract: Reinforcement learning (RL) has become a standard technique for post-training diffusion-based image synthesis models, as it enables learning from reward signals to explicitly improve desirable aspects such as image quality and prompt alignment. In this paper, we propose an online RL variant that reduces the variance in the model updates by sampling paired trajectories and pulling the flow velocity in the direction of the more favorable image. Unlike existing methods that treat each sampling step as a separate policy action, we consider the entire sampling process as a single action. We experiment with both high-quality vision language models and off-the-shelf quality metrics for rewards, and evaluate the outputs using a broad set of metrics. Our method converges faster and yields higher output quality and prompt alignment than previous approaches.


[71] FedBPrompt: Federated Domain Generalization Person Re-Identification via Body Distribution Aware Visual Prompts cs.CV | cs.AIPDF

Xin Xu, Weilong Li, Wei Liu, Wenke Huang, Zhixi Yu

TL;DR: 本文提出FedBPrompt,一种用于联邦域泛化行人重识别(FedDG-ReID)的方法。该方法通过引入可学习的视觉提示(visual prompts)来引导Vision Transformer(ViT)关注行人中心区域,以应对跨客户端数据分布偏移和背景噪声等问题。其核心包括身体分布感知视觉提示机制(BAPM)和基于提示的微调策略(PFTS),前者通过整体全身提示和身体部位对齐提示增强特征判别力,后者通过仅更新轻量级提示来大幅降低通信开销。

Details

Motivation: 在联邦域泛化行人重识别中,Vision Transformer的全局注意力机制难以从高相似度背景或多变视角中区分行人,且跨客户端的数据分布偏移加剧了这一挑战。现有方法在通信效率和特征鲁棒性方面存在不足。

Result: 大量实验表明,BAPM有效提升了特征判别力和跨域泛化能力,PFTS在仅少数聚合轮次内即实现了显著的性能提升。该方法可轻松集成到现有的基于ViT的FedDG-ReID框架中。

Insight: 创新点在于提出了身体分布感知的视觉提示机制,将提示学习分为整体和局部两部分以分别处理背景噪声和姿态变化;同时设计了高效的基于提示的微调策略,冻结主干网络、仅更新提示,在保持适应性的同时大幅降低了联邦学习中的通信成本。

Abstract: Federated Domain Generalization for Person Re-Identification (FedDG-ReID) learns domain-invariant representations from decentralized data. While Vision Transformer (ViT) is widely adopted, its global attention often fails to distinguish pedestrians from high similarity backgrounds or diverse viewpoints – a challenge amplified by cross-client distribution shifts in FedDG-ReID. To address this, we propose Federated Body Distribution Aware Visual Prompt (FedBPrompt), introducing learnable visual prompts to guide Transformer attention toward pedestrian-centric regions. FedBPrompt employs a Body Distribution Aware Visual Prompts Mechanism (BAPM) comprising: Holistic Full Body Prompts to suppress cross-client background noise, and Body Part Alignment Prompts to capture fine-grained details robust to pose and viewpoint variations. To mitigate high communication costs, we design a Prompt-based Fine-Tuning Strategy (PFTS) that freezes the ViT backbone and updates only lightweight prompts, significantly reducing communication overhead while maintaining adaptability. Extensive experiments demonstrate that BAPM effectively enhances feature discrimination and cross-domain generalization, while PFTS achieves notable performance gains within only a few aggregation rounds. Moreover, both BAPM and PFTS can be easily integrated into existing ViT-based FedDG-ReID frameworks, making FedBPrompt a flexible and effective solution for federated person re-identification. The code is available at https://github.com/leavlong/FedBPrompt.


[72] Rethinking VLMs for Image Forgery Detection and Localization cs.CV | cs.LGPDF

Shaofeng Guo, Jiequan Cui, Richang Hong

TL;DR: 本文提出了一种名为IFDL-VLM的新方法,旨在利用视觉语言模型(VLMs)来提升图像伪造检测与定位(IFDL)任务的性能。研究发现,VLMs的固有先验倾向于语义合理性而非真实性,这可能会对检测和定位产生负面影响。因此,作者通过引入位置掩码作为额外先验来引导VLMs,从而优化训练过程并增强结果的可解释性。

Details

Motivation: 随着AIGC的快速发展,图像篡改变得日益便捷,这给图像伪造检测与定位带来了巨大挑战。本文旨在探索如何充分利用VLMs来辅助IFDL任务,解决现有方法中VLMs先验可能带来的负面影响。

Result: 在9个流行基准测试上进行了实验,评估了域内和跨数据集泛化设置下的性能。实验结果表明,该方法在检测、定位和可解释性方面均取得了新的最先进(SOTA)性能。

Insight: 创新点在于揭示了VLMs在IFDL任务中的固有偏见(偏向语义合理性),并提出利用位置掩码作为额外先验来缓解这一问题,从而优化模型训练并提升结果的可解释性。这为将VLMs应用于需要真实性判断的任务提供了新的思路。

Abstract: With the rapid rise of Artificial Intelligence Generated Content (AIGC), image manipulation has become increasingly accessible, posing significant challenges for image forgery detection and localization (IFDL). In this paper, we study how to fully leverage vision-language models (VLMs) to assist the IFDL task. In particular, we observe that priors from VLMs hardly benefit the detection and localization performance and even have negative effects due to their inherent biases toward semantic plausibility rather than authenticity. Additionally, the location masks explicitly encode the forgery concepts, which can serve as extra priors for VLMs to ease their training optimization, thus enhancing the interpretability of detection and localization results. Building on these findings, we propose a new IFDL pipeline named IFDL-VLM. To demonstrate the effectiveness of our method, we conduct experiments on 9 popular benchmarks and assess the model performance under both in-domain and cross-dataset generalization settings. The experimental results show that we consistently achieve new state-of-the-art performance in detection, localization, and interpretability.Code is available at: https://github.com/sha0fengGuo/IFDL-VLM.


[73] SGMatch: Semantic-Guided Non-Rigid Shape Matching with Flow Regularization cs.CVPDF

Tianwei Ye, Xiaoguang Mei, Yifan Xia, Fan Fan, Jun Huang

TL;DR: SGMatch是一种基于学习的语义引导非刚性形状匹配框架,通过结合视觉基础模型的语义特征与几何描述符,并引入基于条件流匹配的正则化目标来提升匹配的空间平滑性,以解决非等距变形和拓扑噪声下的点对点对应关系建立问题。

Details

Motivation: 现有功能映射方法在非等距变形和拓扑噪声下存在模糊性和空间不一致性,几何描述符难以单独解决这些问题,因此需要引入语义信息并增强匹配的平滑性。

Result: 在多个基准测试中,SGMatch在近等距设置下达到竞争性性能,并在非等距变形和拓扑噪声下实现了一致的改进。

Insight: 创新点包括语义引导局部交叉注意力模块(整合语义特征并保持局部结构连续性)和基于条件流匹配的正则化目标(通过时变速度场监督提升对应关系的空间平滑性),这为结合语义与几何信息以及利用流匹配优化非刚性匹配提供了新思路。

Abstract: Establishing accurate point-to-point correspondences between non-rigid 3D shapes remains a critical challenge, particularly under non-isometric deformations and topological noise. Existing functional map pipelines suffer from ambiguities that geometric descriptors alone cannot resolve, and spatial inconsistencies inherent in the projection of truncated spectral bases to dense pointwise correspondences. In this paper, we introduce SGMatch, a learning-based framework for semantic-guided non-rigid shape matching. Specifically, we design a Semantic-Guided Local Cross-Attention module that integrates semantic features from vision foundation models into geometric descriptors while preserving local structural continuity. Furthermore, we introduce a regularization objective based on conditional flow matching, which supervises a time-varying velocity field to encourage spatial smoothness of the recovered correspondences. Experimental results on multiple benchmarks demonstrate that SGMatch achieves competitive performance across near-isometric settings and consistent improvements under non-isometric deformations and topological noise.


[74] Thinking in Streaming Video cs.CV | cs.AIPDF

Zikang Liu, Longteng Guo, Handong Li, Ru Zhen, Xingjian He

TL;DR: 本文提出了ThinkStream框架,用于流式视频推理,通过Watch-Think-Speak范式实现模型在接收新视频观测时增量更新理解,并引入推理压缩流式记忆(RCSM)来管理长时流,同时使用带可验证奖励的流式强化学习进行训练。

Details

Motivation: 现有视频推理方法多采用批处理范式,需等待完整视频上下文,导致高延迟和计算成本,不适用于动态环境中的实时交互助手和多模态代理。

Result: 在多个流式视频基准测试中,ThinkStream显著优于现有在线视频模型,同时保持低延迟和内存使用。

Insight: 创新点包括增量推理更新机制、以推理轨迹作为紧凑语义记忆的RCSM,以及结合可验证奖励的流式强化学习训练方案,提升了流式场景下的实时性和效率。

Abstract: Real-time understanding of continuous video streams is essential for interactive assistants and multimodal agents operating in dynamic environments. However, most existing video reasoning approaches follow a batch paradigm that defers reasoning until the full video context is observed, resulting in high latency and growing computational cost that are incompatible with streaming scenarios. In this paper, we introduce ThinkStream, a framework for streaming video reasoning based on a Watch–Think–Speak paradigm that enables models to incrementally update their understanding as new video observations arrive. At each step, the model performs a short reasoning update and decides whether sufficient evidence has accumulated to produce a response. To support long-horizon streaming, we propose Reasoning-Compressed Streaming Memory (RCSM), which treats intermediate reasoning traces as compact semantic memory that replaces outdated visual tokens while preserving essential context. We further train the model using a Streaming Reinforcement Learning with Verifiable Rewards scheme that aligns incremental reasoning and response timing with the requirements of streaming interaction. Experiments on multiple streaming video benchmarks show that ThinkStream significantly outperforms existing online video models while maintaining low latency and memory usage. Code, models and data will be released at https://github.com/johncaged/ThinkStream


[75] Fair Lung Disease Diagnosis from Chest CT via Gender-Adversarial Attention Multiple Instance Learning cs.CV | cs.AIPDF

Aditya Parikh, Aasa Feragen

TL;DR: 本文提出了一种公平感知的框架,用于从胸部CT体积中进行多类肺部疾病诊断。该框架基于注意力机制的多实例学习模型,结合梯度反转层来对抗性地抑制学习到的扫描表示中的性别预测结构,以解决病理信号稀疏和人口统计学不平衡问题,并在公平疾病诊断挑战中取得了良好的验证分数。

Details

Motivation: 解决胸部CT扫描中多类肺部疾病诊断的两个核心困难:跨数百个切片的稀疏病理信号,以及疾病类别和性别交叉导致的严重人口统计学不平衡,同时满足挑战中对性别公平预测的要求。

Result: 在验证集上,模型获得了平均竞争分数0.685(标准差0.030),最佳单折分数达到0.759,使用了五折交叉验证集成和测试时增强。

Insight: 创新点包括结合注意力多实例学习与梯度反转层的公平性增强设计,以及采用焦点损失、标签平滑、分层交叉验证和针对性过采样等训练策略,以在缺乏切片级监督的情况下识别诊断相关切片并抑制性别偏见。

Abstract: We present a fairness-aware framework for multi-class lung disease diagnosis from chest CT volumes, developed for the Fair Disease Diagnosis Challenge at the PHAROS-AIF-MIH Workshop (CVPR 2026). The challenge requires classifying CT scans into four categories – Healthy, COVID-19, Adenocarcinoma, and Squamous Cell Carcinoma – with performance measured as the average of per-gender macro F1 scores, explicitly penalizing gender-inequitable predictions. Our approach addresses two core difficulties: the sparse pathological signal across hundreds of slices, and a severe demographic imbalance compounded across disease class and gender. We propose an attention-based Multiple Instance Learning (MIL) model on a ConvNeXt backbone that learns to identify diagnostically relevant slices without slice-level supervision, augmented with a Gradient Reversal Layer (GRL) that adversarially suppresses gender-predictive structure in the learned scan representation. Training incorporates focal loss with label smoothing, stratified cross-validation over joint (class, gender) strata, and targeted oversampling of the most underrepresented subgroup. At inference, all five-fold checkpoints are ensembled with horizontal-flip test-time augmentation via soft logit voting and out-of-the-fold threshold optimization for robustness. Our model achieves a mean validation competition score of 0.685 (std - 0.030), with the best single fold reaching 0.759. All training and inference code is publicly available at https://github.com/ADE-17/cvpr-fair-chest-ct


[76] Test-Time Attention Purification for Backdoored Large Vision Language Models cs.CV | cs.CRPDF

Zhifang Zhang, Bojun Yang, Shuo He, Weitong Chen, Wei Emma Zhang

TL;DR: 本文提出了一种名为CleanSight的免训练、即插即用防御方法,用于对抗大型视觉语言模型(LVLMs)中的后门攻击。该方法基于一个新发现的后门机制——注意力窃取,通过在测试时检测并剪除可疑的高注意力视觉令牌来净化输入,从而在不依赖重训练的情况下有效抵御攻击。

Details

Motivation: 现有针对LVLMs后门攻击的防御方法通常依赖干净数据重训练模型参数,计算成本高且可能损害模型性能。本文旨在提供一种更高效、无需训练的测试时防御方案。

Result: 在多种数据集和攻击类型上的广泛实验表明,CleanSight显著优于现有的基于像素的净化防御方法,同时在干净样本和中毒样本上都能保持模型的实用性。

Insight: 核心创新点在于揭示了LVLMs后门攻击的新机制——注意力窃取,即触发器通过异常跨模态注意力重分配(让携带触发器的视觉令牌窃取文本上下文的注意力)来影响预测。基于此,提出的CleanSight方法通过分析跨模态融合层中的视觉-文本注意力比来检测中毒输入,并通过选择性剪枝进行净化,这是一种新颖的训练后防御思路。

Abstract: Despite the strong multimodal performance, large vision-language models (LVLMs) are vulnerable during fine-tuning to backdoor attacks, where adversaries insert trigger-embedded samples into the training data to implant behaviors that can be maliciously activated at test time. Existing defenses typically rely on retraining backdoored parameters (e.g., adapters or LoRA modules) with clean data, which is computationally expensive and often degrades model performance. In this work, we provide a new mechanistic understanding of backdoor behaviors in LVLMs: the trigger does not influence prediction through low-level visual patterns, but through abnormal cross-modal attention redistribution, where trigger-bearing visual tokens steal attention away from the textual context - a phenomenon we term attention stealing. Motivated by this, we propose CleanSight, a training-free, plug-and-play defense that operates purely at test time. CleanSight (i) detects poisoned inputs based on the relative visual-text attention ratio in selected cross-modal fusion layers, and (ii) purifies the input by selectively pruning the suspicious high-attention visual tokens to neutralize the backdoor activation. Extensive experiments show that CleanSight significantly outperforms existing pixel-based purification defenses across diverse datasets and backdoor attack types, while preserving the model’s utility on both clean and poisoned samples.


[77] A Closed-Form Solution for Debiasing Vision-Language Models with Utility Guarantees Across Modalities and Tasks cs.CVPDF

Tangzheng Lian, Guanyu Hu, Yijing Ren, Dimitrios Kollias, Oya Celiktutan

TL;DR: 本文提出了一种针对视觉语言模型(VLMs)的去偏方法,该方法在跨模态空间中提供闭式解,能在保证模型性能损失有界的前提下实现帕累托最优的公平性。该方法无需训练、无需标注数据,可同时处理视觉和文本模态,并在零样本图像分类、文本到图像检索和生成等下游任务中有效提升群体和交叉公平性。

Details

Motivation: 现有视觉语言模型容易从训练数据中继承社会偏见并传播至下游应用,而多数去偏方法缺乏理论保证,无法确保在提升公平性的同时维持模型性能。

Result: 在多个公平性指标和数据集上的实验表明,该方法在零样本图像分类、文本到图像检索和生成等任务中,在群体和交叉公平性方面优于现有方法,同时保持了任务性能。

Insight: 创新点在于提出了一种闭式解的去偏方法,具有理论保证的性能损失边界,且无需训练和标注数据,能同时处理多模态和下游任务的联合去偏。

Abstract: While Vision-Language Models (VLMs) have achieved remarkable performance across diverse downstream tasks, recent studies have shown that they can inherit social biases from the training data and further propagate them into downstream applications. To address this issue, various debiasing approaches have been proposed, yet most of them aim to improve fairness without having a theoretical guarantee that the utility of the model is preserved. In this paper, we introduce a debiasing method that yields a \textbf{closed-form} solution in the cross-modal space, achieving Pareto-optimal fairness with \textbf{bounded utility losses}. Our method is \textbf{training-free}, requires \textbf{no annotated data}, and can jointly debias both visual and textual modalities across downstream tasks. Extensive experiments show that our method outperforms existing methods in debiasing VLMs across diverse fairness metrics and datasets for both group and \textbf{intersectional} fairness in downstream tasks such as zero-shot image classification, text-to-image retrieval, and text-to-image generation while preserving task performance.


[78] SAW: Toward a Surgical Action World Model via Controllable and Scalable Video Generation cs.CV | cs.AI | cs.LG | eess.IVPDF

Sampath Rapuri, Lalithkumar Seenivasan, Dominik Schneider, Roger Soberanis-Mukul, Yufan He

TL;DR: 本文提出了一种名为SAW(Surgical Action World)的手术动作世界模型,通过可控且可扩展的视频生成技术,旨在解决手术AI和模拟中的数据稀缺、罕见事件合成以及模拟到现实的差距等核心挑战。该方法基于视频扩散模型,利用语言提示、参考手术场景、组织可操作掩码和二维工具尖端轨迹四种轻量级信号作为条件,生成具有高时间一致性和视觉真实感的手术动作视频。

Details

Motivation: 当前手术视频生成方法通常依赖昂贵的标注或复杂的结构化中间表示作为推理时的条件信号,限制了其可扩展性,且现有方法在复杂腹腔镜场景中时间一致性不足、真实感有限。本文旨在开发一种轻量级条件信号驱动的视频生成模型,以构建可控且可扩展的手术动作世界模型。

Result: 在保留测试数据上,SAW在时间一致性指标(CD-FVD)上达到199.19,显著优于基线(546.82),实现了SOTA水平,并展现出强视觉质量。下游应用表明,使用SAW生成的视频增强罕见动作数据能显著提升真实测试数据上的动作识别性能(如剪切动作的F1分数从20.93%提升至43.14%),并可用于从模拟器轨迹生成视觉逼真的工具-组织交互视频,推动手术模拟引擎的发展。

Insight: 创新点在于将视频到视频扩散重新定义为轨迹条件的手术动作合成,并引入四种轻量级时空条件信号(语言、场景、掩码、轨迹)以实现精确控制;同时,通过深度一致性损失在微调中增强几何合理性,而无需在推理时使用深度信息,提高了模型的实用性和可扩展性。从客观角度看,该方法将扩散模型与手术领域特定控制信号相结合,为构建高效、可控的手术世界模型提供了新思路。

Abstract: A surgical world model capable of generating realistic surgical action videos with precise control over tool-tissue interactions can address fundamental challenges in surgical AI and simulation – from data scarcity and rare event synthesis to bridging the sim-to-real gap for surgical automation. However, current video generation methods, the very core of such surgical world models, require expensive annotations or complex structured intermediates as conditioning signals at inference, limiting their scalability. Other approaches exhibit limited temporal consistency across complex laparoscopic scenes and do not possess sufficient realism. We propose Surgical Action World (SAW) – a step toward surgical action world modeling through video diffusion conditioned on four lightweight signals: language prompts encoding tool-action context, a reference surgical scene, tissue affordance mask, and 2D tool-tip trajectories. We design a conditional video diffusion approach that reformulates video-to-video diffusion into trajectory-conditioned surgical action synthesis. The backbone diffusion model is fine-tuned on a custom-curated dataset of 12,044 laparoscopic clips with lightweight spatiotemporal conditioning signals, leveraging a depth consistency loss to enforce geometric plausibility without requiring depth at inference. SAW achieves state-of-the-art temporal consistency (CD-FVD: 199.19 vs. 546.82) and strong visual quality on held-out test data. Furthermore, we demonstrate its downstream utility for (a) surgical AI, where augmenting rare actions with SAW-generated videos improves action recognition (clipping F1-score: 20.93% to 43.14%; cutting: 0.00% to 8.33%) on real test data, and (b) surgical simulation, where rendering tool-tissue interaction videos from simulator-derived trajectory points toward a visually faithful simulation engine.


[79] Multimodal OCR: Parse Anything from Documents cs.CVPDF

Handong Zheng, Yumeng Li, Kaile Zhang, Liang Xin, Guangwei Zhao

TL;DR: 本文提出了多模态OCR(MOCR)范式及其实现dots.mocr,这是一种将文档中的文本和图形(如图表、表格、图标)联合解析为统一文本表示的文档解析方法。它通过端到端训练,将图形元素也作为一等解析目标,从而保留元素间的语义关系,并支持从现有文档中挖掘多模态监督信号。

Details

Motivation: 传统OCR系统仅专注于文本识别,而将图形区域裁剪为像素块丢弃,导致文档语义关系丢失。本文旨在解决这一问题,提出一种能同时解析文本和图形、并保持其结构化语义关系的统一文档解析范式。

Result: 在文档解析评测中,dots.mocr在OCR Arena Elo排行榜上仅次于Gemini 3 Pro,超越了现有开源系统,并在olmOCR Bench上取得了83.9的新SOTA。在结构化图形解析(如图像转SVG)评测中,其在图表、UI布局、科学图表和化学图等任务上的重建质量优于Gemini 3 Pro。

Insight: 核心创新在于将图形元素(图表、表格等)提升为与文本同等的一等解析目标,实现了文本与图形的联合解析与结构化输出。这为构建大规模“图像到代码”语料库用于多模态预训练提供了一条可扩展的路径,并解锁了现有文档中蕴含的多模态监督信号。

Abstract: We present Multimodal OCR (MOCR), a document parsing paradigm that jointly parses text and graphics into unified textual representations. Unlike conventional OCR systems that focus on text recognition and leave graphical regions as cropped pixels, our method, termed dots.mocr, treats visual elements such as charts, diagrams, tables, and icons as first-class parsing targets, enabling systems to parse documents while preserving semantic relationships across elements. It offers several advantages: (1) it reconstructs both text and graphics as structured outputs, enabling more faithful document reconstruction; (2) it supports end-to-end training over heterogeneous document elements, allowing models to exploit semantic relations between textual and visual components; and (3) it converts previously discarded graphics into reusable code-level supervision, unlocking multimodal supervision embedded in existing documents. To make this paradigm practical at scale, we build a comprehensive data engine from PDFs, rendered webpages, and native SVG assets, and train a compact 3B-parameter model through staged pretraining and supervised fine-tuning. We evaluate dots.mocr from two perspectives: document parsing and structured graphics parsing. On document parsing benchmarks, it ranks second only to Gemini 3 Pro on our OCR Arena Elo leaderboard, surpasses existing open-source document parsing systems, and sets a new state of the art of 83.9 on olmOCR Bench. On structured graphics parsing, dots.mocr achieves higher reconstruction quality than Gemini 3 Pro across image-to-SVG benchmarks, demonstrating strong performance on charts, UI layouts, scientific figures, and chemical diagrams. These results show a scalable path toward building large-scale image-to-code corpora for multimodal pretraining. Code and models are publicly available at https://github.com/rednote-hilab/dots.mocr.


[80] ESPIRE: A Diagnostic Benchmark for Embodied Spatial Reasoning of Vision-Language Models cs.CV | cs.LG | cs.ROPDF

Yanpeng Zhao, Wentao Ding, Hongtao Li, Baoxiong Jia, Zilong Zheng

TL;DR: 本文提出了ESPIRE基准测试,这是一个用于评估视觉语言模型在具身空间推理能力的诊断性基准。它通过模拟世界将模型物理接地,并专注于机器人任务中的空间推理,从而缩小评估与实际部署之间的差距。

Details

Motivation: 现有评估在范式和覆盖范围上存在局限,阻碍了模型的快速迭代开发,因此需要一个新的基准来全面诊断VLM的具身空间推理能力。

Result: 研究使用ESPIRE对一系列前沿VLM进行了诊断,并对其空间推理行为进行了深入分析,但摘要中未提及具体的定量结果或与SOTA的比较。

Insight: 创新点在于将机器人任务分解为定位和执行两个生成式问题,突破了依赖干扰项的判别式评估范式,并系统设计了指令和环境层面以确保广泛的场景覆盖,实现了从被动推理到行动推理的细粒度分析。

Abstract: A recent trend in vision-language models (VLMs) has been to enhance their spatial cognition for embodied domains. Despite progress, existing evaluations have been limited both in paradigm and in coverage, hindering rapid, iterative model development. To address these limitations, we propose ESPIRE, a diagnostic benchmark for embodied spatial reasoning. ESPIRE offers a simulated world that physically grounds VLMs and evaluates them on spatial-reasoning-centric robotic tasks, thus narrowing the gap between evaluation and real-world deployment. To adapt VLMs to robotic tasks, we decompose each task into localization and execution, and frame both as generative problems, in stark contrast to predominant discriminative evaluations (e.g., via visual-question answering) that rely on distractors and discard execution. This decomposition further enables a fine-grained analysis beyond passive spatial reasoning toward reasoning to act. We systematically design ESPIRE both at the instruction level and at the environment level, ensuring broad coverage of spatial reasoning scenarios. We use ESPIRE to diagnose a range of frontier VLMs and provide in-depth analysis of their spatial reasoning behaviors.


[81] Are General-Purpose Vision Models All We Need for 2D Medical Image Segmentation? A Cross-Dataset Empirical Study cs.CV | cs.AIPDF

Vanessa Borst, Samuel Kounev

TL;DR: 这篇论文通过跨数据集的实证研究,探讨了通用视觉模型(GP-VMs)在二维医学图像分割任务中的表现。研究发现,在统一训练和评估协议下,GP-VMs在多个异构医学图像数据集上超越了大多数专门的医学分割架构(SMAs),并且其可解释性分析显示GP-VMs能够捕捉临床相关结构。

Details

Motivation: 研究动机是评估专门为医学图像设计的分割架构(SMAs)是否比现代通用视觉模型(GP-VMs)在医学图像分割任务中具有系统性优势,因为尽管GP-VMs在自然图像基准上表现强劲,但其在医学图像分割中的有效性尚未充分理解。

Result: 在三个覆盖不同成像模态、类别结构和数据特征的异构数据集上,GP-VMs在分割准确性上优于大多数SMAs。可解释性分析(如Grad-CAM可视化)表明GP-VMs能捕捉临床相关结构,无需显式的领域特定架构设计。

Insight: 论文的创新点在于通过跨数据集实证研究,挑战了专门医学分割架构的必要性,表明GP-VMs可作为可行的替代方案。客观分析认为,这强调了在端到端医学图像分割系统中进行明智模型选择的重要性,并可能推动更广泛地应用通用模型于医学领域。

Abstract: Medical image segmentation (MIS) is a fundamental component of computer-assisted diagnosis and clinical decision support systems. Over the past decade, numerous architectures specifically tailored to medical imaging have emerged to address domain-specific challenges such as low contrast, small anatomical structures, and limited annotated data. In parallel, rapid progress in computer vision has produced highly capable general-purpose vision models (GP-VMs) originally designed for natural images. Despite their strong performance on standard vision benchmarks, their effectiveness for MIS remains insufficiently understood. In this work, we conduct a controlled empirical study to examine whether specialized medical segmentation architectures (SMAs) provide systematic advantages over modern GP-VMs for 2D MIS. We compare eleven SMAs and GP-VMs using a unified training and evaluation protocol. Experiments are performed across three heterogeneous datasets covering different imaging modalities, class structures, and data characteristics. Beyond segmentation accuracy, we analyze qualitative Grad-CAM visualizations to investigate explainability (XAI) behavior. Our results demonstrate that, for the analyzed datasets, GP-VMs out-perform the majority of specialized MIS models. Moreover, XAI analyses indicate that GP-VMs can capture clinically relevant structures without explicit domain-specific architectural design. These findings suggest that GP-VMs can represent a viable alternative to domain-specific methods, highlighting the importance of informed model selection for end-to-end MIS systems. All code and resources are available at GitHub.


[82] Topo-R1: Detecting Topological Anomalies via Vision-Language Models cs.CVPDF

Meilong Xu, Qingqiao Hu, Xiaoling Hu, Shahira Abousamra, Xin Yu

TL;DR: 本文提出Topo-R1框架,通过视觉语言模型检测管状结构分割中的拓扑异常,无需标注数据即可识别连接性错误。

Details

Motivation: 现有拓扑保持方法依赖领域特定的标注数据,成本高且跨领域迁移性差,本文旨在解决无标注新领域中拓扑异常检测的问题。

Result: Topo-R1在多个领域的大规模基准测试中均优于通用视觉语言模型和有监督基线,实现了无标注拓扑质量评估的新范式。

Insight: 创新点包括自动化数据合成管道构建多领域基准、两阶段训练(监督微调+强化学习)以及结合类型感知匈牙利匹配和中心线Dice奖励的拓扑感知复合奖励机制。

Abstract: Topological correctness is crucial for tubular structures such as blood vessels, nerve fibers, and road networks. Existing topology-preserving methods rely on domain-specific ground truth, which is costly and rarely transfers across domains. When deployed to a new domain without annotations, a key question arises: how can we detect topological anomalies without ground-truth supervision? We reframe this as topological anomaly detection, a structured visual reasoning task requiring a model to locate and classify topological errors in predicted segmentation masks. Vision-Language Models (VLMs) are natural candidates; however, we find that state-of-the-art VLMs perform nearly at random, lacking the fine-grained, topology-aware perception needed to identify sparse connectivity errors in dense structures. To bridge this gap, we develop an automated data-curation pipeline that synthesizes diverse topological anomalies with verifiable annotations across progressively difficult levels, thereby constructing the first large-scale, multi-domain benchmark for this task. We then introduce Topo-R1, a framework that endows VLMs with topology-aware perception via two-stage training: supervised fine-tuning followed by reinforcement learning with Group Relative Policy Optimization (GRPO). Central to our approach is a topology-aware composite reward that integrates type-aware Hungarian matching for structured error classification, spatial localization scoring, and a centerline Dice (clDice) reward that directly penalizes connectivity disruptions, thereby jointly incentivizing semantic precision and structural fidelity. Extensive experiments demonstrate that Topo-R1 establishes a new paradigm for annotation-free topological quality assessment, consistently outperforming general-purpose VLMs and supervised baselines across all evaluation protocols.


[83] Team RAS in 10th ABAW Competition: Multimodal Valence and Arousal Estimation Approach cs.CV | cs.AIPDF

Elena Ryumina, Maxim Markitantov, Alexandr Axyonov, Dmitry Ryumin, Mikhail Dolgushin

TL;DR: 本文介绍了团队在第十届ABAW竞赛中提出的用于野外条件下连续效价-唤醒度估计的多模态方法。该方法融合了面部、行为和音频三种模态,并探索了两种融合策略:定向跨模态专家混合融合和可靠性感知的视听融合。在Aff-Wild2开发集上取得了0.658的和谐相关系数。

Details

Motivation: 解决在野外条件下,由于外观、头部姿态、光照、遮挡和个体情感表达模式差异大,导致连续效价和唤醒度情绪识别这一具有挑战性的问题。

Result: 在遵循第十届ABAW挑战赛协议的Aff-Wild2数据集上,所提出的多模态融合策略在开发集上达到了0.658的和谐相关系数。

Insight: 创新点在于融合了面部(基于GRADA和Transformer)、行为(使用Qwen3-VL-4B-Instruct提取信息并用Mamba建模时序)和音频(基于WavLM-Large并包含跨模态过滤)三种互补模态,并提出了两种新颖的融合策略(定向跨模态专家混合融合和可靠性感知的视听融合)来有效整合多模态信息,以应对野外环境的复杂性。

Abstract: Continuous emotion recognition in terms of valence and arousal under in-the-wild (ITW) conditions remains a challenging problem due to large variations in appearance, head pose, illumination, occlusions, and subject-specific patterns of affective expression. We present a multimodal method for valence-arousal estimation ITW. Our method combines three complementary modalities: face, behavior, and audio. The face modality relies on GRADA-based frame-level embeddings and Transformer-based temporal regression. We use Qwen3-VL-4B-Instruct to extract behavior-relevant information from video segments, while Mamba is used to model temporal dynamics across segments. The audio modality relies on WavLM-Large with attention-statistics pooling and includes a cross-modal filtering stage to reduce the influence of unreliable or non-speech segments. To fuse modalities, we explore two fusion strategies: a Directed Cross-Modal Mixture-of-Experts Fusion Strategy that learns interactions between modalities with adaptive weighting, and a Reliability-Aware Audio-Visual Fusion Strategy that combines visual features at the frame-level while using audio as complementary context. The results are reported on the Aff-Wild2 dataset following the 10th Affective Behavior Analysis in-the-Wild (ABAW) challenge protocol. Experiments demonstrate that the proposed multimodal fusion strategy achieves a Concordance Correlation Coefficient (CCC) of 0.658 on the Aff-Wild2 development set.


[84] Mitigating Memorization in Text-to-Image Diffusion via Region-Aware Prompt Augmentation and Multimodal Copy Detection cs.CVPDF

Yunzhuo Chen, Jordan Vice, Naveed Akhtar, Nur Al Hasan Haldar, Ajmal Mian

TL;DR: 本文提出两种互补方法以缓解文本到图像扩散模型中的记忆化问题:区域感知提示增强(RAPTA)和注意力驱动的多模态复制检测(ADMCD)。RAPTA通过目标检测器识别显著区域并生成语义基础的提示变体,在训练中随机采样以增加多样性;ADMCD则聚合局部、全局和纹理线索,利用轻量级Transformer生成融合表示,通过阈值规则检测复制行为。

Details

Motivation: 解决当前最先进的文本到图像扩散模型可能记忆并复制训练图像,从而引发版权和隐私风险的问题,同时避免现有推理时提示扰动方法(如随机令牌插入或嵌入噪声)降低图像-提示对齐和整体保真度的缺陷。

Result: 实验表明,RAPTA在保持高合成质量的同时减少了过拟合,而ADMCD可靠地检测复制行为,优于单模态指标,但未具体提及基准测试或是否达到SOTA水平。

Insight: 创新点包括:1)RAPTA通过基于语义的区域感知提示增强,在增加训练多样性的同时维持对齐;2)ADMCD利用多模态线索和轻量Transformer进行无监督复制检测,无需大规模标注数据集。从客观角度看,该方法结合了数据增强和检测机制,为缓解记忆化问题提供了可扩展的解决方案。

Abstract: State-of-the-art text-to-image diffusion models can produce impressive visuals but may memorize and reproduce training images, creating copyright and privacy risks. Existing prompt perturbations applied at inference time, such as random token insertion or embedding noise, may lower copying but often harm image-prompt alignment and overall fidelity. To address this, we introduce two complementary methods. First, Region-Aware Prompt Augmentation (RAPTA) uses an object detector to find salient regions and turn them into semantically grounded prompt variants, which are randomly sampled during training to increase diversity, while maintaining semantic alignment. Second, Attention-Driven Multimodal Copy Detection (ADMCD) aggregates local patch, global semantic, and texture cues with a lightweight transformer to produce a fused representation, and applies simple thresholded decision rules to detect copying without training with large annotated datasets. Experiments show that RAPTA reduces overfitting while maintaining high synthesis quality, and that ADMCD reliably detects copying, outperforming single-modal metrics.


[85] Rooftop Wind Field Reconstruction Using Sparse Sensors: From Deterministic to Generative Learning Methods cs.CVPDF

Yihang Zhou, Chao Lin, Hideki Kikumoto, Ryozo Ooka, Sibo Cheng

TL;DR: 本研究开发了一个基于风洞实验数据的观测学习框架,用于从稀疏传感器数据重建屋顶风场。研究比较了克里金插值与三种深度学习模型(UNet、Vision Transformer Autoencoder 和 Conditional Wasserstein GAN)的性能,并评估了两种训练策略(单一风向训练和混合风向训练)、传感器密度、传感器位置扰动以及基于QR分解的传感器布局优化方法。

Details

Motivation: 实时屋顶风速分布对于无人机和城市空中交通系统的安全运行、风控系统以及屋顶利用至关重要,但屋顶气流具有强非线性、分离和跨方向变异性,使得从稀疏传感器重建流场变得困难。

Result: 深度学习模型相比克里金插值法,在SSIM指标上提升了高达32.7%,FAC2提升了24.2%,NMSE提升了27.8%。混合风向训练相比单一风向训练,在SSIM上提升了高达173.7%,FAC2提升了16.7%,MG提升了98.3%。基于QR分解的传感器布局优化在传感器扰动下将鲁棒性提升了高达27.8%。

Insight: 论文的创新点在于提出了一个综合的观测学习框架,系统比较了传统插值与多种先进深度学习模型在复杂风场重建任务上的性能,并强调了传感器配置、优化和训练策略需要联合考虑以实现可靠部署。使用实验数据而非模拟数据进行训练为不同场景下的方法选择和传感器布置提供了实用指导。

Abstract: Real-time rooftop wind-speed distribution is important for the safe operation of drones and urban air mobility systems, wind control systems, and rooftop utilization. However, rooftop flows show strong nonlinearity, separation, and cross-direction variability, which make flow field reconstruction from sparse sensors difficult. This study develops a learning-from-observation framework using wind-tunnel experimental data obtained by Particle Image Velocimetry (PIV) and compares Kriging interpolation with three deep learning models: UNet, Vision Transformer Autoencoder (ViTAE), and Conditional Wasserstein GAN (CWGAN). We evaluate two training strategies, single wind-direction training (SDT) and mixed wind-direction training (MDT), across sensor densities from 5 to 30, test robustness under sensor position perturbations of plus or minus 1 grid, and optimize sensor placement via Proper Orthogonal Decomposition with QR decomposition. Results show that deep learning methods can reconstruct rooftop wind fields from sparse sensor data effectively. Compared with Kriging interpolation, the deep learning models improved SSIM by up to 32.7%, FAC2 by 24.2%, and NMSE by 27.8%. Mixed wind-direction training further improved performance, with gains of up to 173.7% in SSIM, 16.7% in FAC2, and 98.3% in MG compared with single-direction training. The results also show that sensor configuration, optimization, and training strategy should be considered jointly for reliable deployment. QR-based optimization improved robustness by up to 27.8% under sensor perturbations, although with metric-dependent trade-offs. Training on experimental rather than simulated data also provides practical guidance for method selection and sensor placement in different scenarios.


[86] V-Bridge: Bridging Video Generative Priors to Versatile Few-shot Image Restoration cs.CVPDF

Shenghe Zheng, Junpeng Jiang, Wenbo Li

TL;DR: 本文提出V-Bridge框架,通过将大规模视频生成模型中的先验知识迁移到少样本图像恢复任务中,将图像恢复重新定义为渐进式生成过程,仅用少量多任务训练样本即可实现多种图像恢复任务。

Details

Motivation: 利用大规模视频生成模型内部丰富的结构、语义和动态先验,探索其作为通用视觉学习器的潜力,以解决传统图像恢复方法需要大量数据和专用架构的问题。

Result: 仅使用1,000个多任务训练样本(少于现有方法的2%),预训练视频模型在多种图像恢复任务上达到与专用架构相当的性能,展现了强大的少样本适应能力。

Insight: 视频生成模型隐式学习了可迁移的图像恢复先验,可通过极少量数据激活;挑战了生成建模与底层视觉的传统界限,为视觉基础模型提供了新的设计范式。

Abstract: Large-scale video generative models are trained on vast and diverse visual data, enabling them to internalize rich structural, semantic, and dynamic priors of the visual world. While these models have demonstrated impressive generative capability, their potential as general-purpose visual learners remains largely untapped. In this work, we introduce V-Bridge, a framework that bridges this latent capacity to versatile few-shot image restoration tasks. We reinterpret image restoration not as a static regression problem, but as a progressive generative process, and leverage video models to simulate the gradual refinement from degraded inputs to high-fidelity outputs. Surprisingly, with only 1,000 multi-task training samples (less than 2% of existing restoration methods), pretrained video models can be induced to perform competitive image restoration, achieving multiple tasks with a single model, rivaling specialized architectures designed explicitly for this purpose. Our findings reveal that video generative models implicitly learn powerful and transferable restoration priors that can be activated with only extremely limited data, challenging the traditional boundary between generative modeling and low-level vision, and opening a new design paradigm for foundation models in visual tasks.


[87] Reasoning over Video: Evaluating How MLLMs Extract, Integrate, and Reconstruct Spatiotemporal Evidence cs.CVPDF

Seunghwan Bang, Hwanjun Song

TL;DR: 本文针对多模态大语言模型在视频时空推理能力上的局限性,提出了一个结构化评估框架VAEX-BENCH,用于系统评估模型在抽象时空推理任务上的表现,并通过合成数据集揭示了现有SOTA模型在抽象推理任务上的瓶颈。

Details

Motivation: 现有视频理解基准主要关注提取式推理,而忽略了需要整合分散线索、推断隐含结构的抽象时空推理能力,本文旨在填补这一评估空白。

Result: 在VAEX-BENCH基准上对SOTA MLLMs进行了广泛实验,结果表明模型在抽象推理任务上表现显著受限,并提供了细粒度的瓶颈分析。

Insight: 创新点在于形式化了抽象时空推理的评估维度,并构建了可控的合成数据集来系统评估MLLMs的整合与推断能力,为视频理解提供了更全面的评估框架。

Abstract: The growing interest in embodied agents increases the demand for spatiotemporal video understanding, yet existing benchmarks largely emphasize extractive reasoning, where answers can be explicitly presented within spatiotemporal events. It remains unclear whether multimodal large language models can instead perform abstractive spatiotemporal reasoning, which requires integrating observations over time, combining dispersed cues, and inferring implicit spatial and contextual structure. To address this gap, we formalize abstractive spatiotemporal reasoning from videos by introducing a structured evaluation taxonomy that systematically targets its core dimensions and construct a controllable, scenario-driven synthetic egocentric video dataset tailored to evaluate abstractive spatiotemporal reasoning capabilities, spanning object-, room-, and floor-plan-level scenarios. Based on this framework, we present VAEX-BENCH, a benchmark comprising five abstractive reasoning tasks together with their extractive counterparts. Our extensive experiments compare the performance of state-of-the-art MLLMs under extractive and abstractive settings, exposing their limitations on abstractive tasks and providing a fine-grained analysis of the underlying bottlenecks. The dataset will be released soon.


[88] Geometry-Guided Camera Motion Understanding in VideoLLMs cs.CV | cs.AIPDF

Haoan Feng, Sri Harsha Musunuri, Guan-Ming Su

TL;DR: 该论文针对现有视频语言模型(VideoLLMs)在理解相机运动这一基础几何信号方面的不足,提出了一个包含基准测试、诊断和注入的完整框架。作者构建了一个大规模合成数据集CameraMotionDataset和一个视觉问答基准CameraMotionVQA,用于评估模型对相机运动基元的识别能力。研究发现现有模型在此任务上存在显著缺陷,其视觉编码器深层对相机运动线索的表征较弱。为此,论文提出了一种轻量级、模型无关的解决方案,通过从3D基础模型中提取几何相机线索,利用时序分类器预测运动基元,并通过结构化提示将其注入下游VideoLLM的推理过程,从而提升模型对相机运动的感知能力。

Details

Motivation: 相机运动是塑造视觉感知和电影风格的基本几何信号,但现有的视频语言模型很少显式地表征它,并且在细粒度的运动基元识别上经常失败。论文旨在弥补这一差距。

Result: 在构建的CameraMotionVQA基准上,多种现成的VideoLLMs在识别相机运动基元时均出现大量错误。提出的注入方法实验表明,该方法能有效提升运动识别能力,并生成更具相机感知能力的模型响应。

Insight: 论文的创新点在于:1)系统地构建了针对相机运动理解的基准数据集和评测任务;2)诊断发现视觉编码器深层对相机运动线索的表征不足是模型失败的关键原因;3)提出了一种无需昂贵训练或微调的轻量级解决方案,通过从3D基础模型提取几何线索并结合结构化提示注入,有效增强了VideoLLMs的几何感知能力,为构建相机感知的视觉语言系统提供了实用路径。

Abstract: Camera motion is a fundamental geometric signal that shapes visual perception and cinematic style, yet current video-capable vision-language models (VideoLLMs) rarely represent it explicitly and often fail on fine-grained motion primitives. We address this gap with a framework of $\textbf{benchmarking}$, $\textbf{diagnosis}$, and $\textbf{injection}$. We curate $\textbf{CameraMotionDataset}$, a large-scale synthetic dataset with explicit camera control, formulate camera motion as constraint-aware multi-label recognition, and construct a VQA benchmark–$\textbf{CameraMotionVQA}$. Across diverse off-the-shelf VideoLLMs, we observe substantial errors in recognizing camera motion primitives. Probing experiments on a Qwen2.5-VL vision encoder suggest that camera motion cues are weakly represented, especially in deeper ViT blocks, helping explain the observed failure modes. To bridge this gap without costly training or fine-tuning, we propose a lightweight, model-agnostic pipeline that extracts geometric camera cues from 3D foundation models (3DFMs), predicts constrained motion primitives with a temporal classifier, and injects them into downstream VideoLLM inference via structured prompting. Experiments demonstrate improved motion recognition and more camera-aware model responses, highlighting geometry-driven cue extraction and structured prompting as practical steps toward a camera-aware VideoLLM and VLA system. The dataset and benchmark is publicly available at https://hf.co/datasets/fengyee/camera-motion-dataset-and-benchmark.


[89] FDeID-Toolbox: Face De-Identification Toolbox cs.CVPDF

Hui Wei, Hao Yu, Guoying Zhao

TL;DR: 本文介绍了FDeID-Toolbox,这是一个用于人脸去识别研究的综合性工具箱,旨在解决该领域因实现碎片化、评估协议不一致和结果不可比而导致的挑战。该工具箱采用模块化架构,集成了标准化的数据加载器、统一的方法实现、灵活的推理管道和系统的评估协议,以支持可复现的研究和公平的方法比较。

Details

Motivation: 人脸去识别领域存在实现碎片化、评估协议不一致以及不同研究结果难以比较的问题,这源于任务本身的复杂性(涉及多个下游应用和三个维度的评估),使得现有代码库难以使用和扩展。

Result: 实验表明,FDeID-Toolbox能够在一致的条件下,实现对多种人脸去识别方法的公平且可复现的比较。

Insight: 该论文的主要创新在于提出了一个模块化的综合性工具箱,通过标准化数据、统一方法实现和系统化评估,解决了该研究领域的可复现性和可比性问题,为后续研究提供了便利的基础设施。

Abstract: Face de-identification (FDeID) aims to remove personally identifiable information from facial images while preserving task-relevant utility attributes such as age, gender, and expression. It is critical for privacy-preserving computer vision, yet the field suffers from fragmented implementations, inconsistent evaluation protocols, and incomparable results across studies. These challenges stem from the inherent complexity of the task: FDeID spans multiple downstream applications (e.g., age estimation, gender recognition, expression analysis) and requires evaluation across three dimensions (e.g., privacy protection, utility preservation, and visual quality), making existing codebases difficult to use and extend. To address these issues, we present FDeID-Toolbox, a comprehensive toolbox designed for reproducible FDeID research. Our toolbox features a modular architecture comprising four core components: (1) standardized data loaders for mainstream benchmark datasets, (2) unified method implementations spanning classical approaches to SOTA generative models, (3) flexible inference pipelines, and (4) systematic evaluation protocols covering privacy, utility, and quality metrics. Through experiments, we demonstrate that FDeID-Toolbox enables fair and reproducible comparison of diverse FDeID methods under consistent conditions.


[90] Towards Faithful Multimodal Concept Bottleneck Models cs.CV | cs.LGPDF

Pierre Moreau, Emeline Pineau Ferrand, Yann Choho, Benjamin Wong, Annabelle Blangero

TL;DR: 本文提出了f-CBM,一个忠实多模态概念瓶颈模型框架,旨在解决现有概念瓶颈模型在多模态场景下概念检测不准确和概念表征泄露任务相关信息的问题。该框架通过可微泄露损失和Kolmogorov-Arnold网络预测头两项互补策略,联合优化概念检测与泄露缓解,在保持预测准确性的同时提升模型忠实性。

Details

Motivation: 现有概念瓶颈模型在多模态设置中研究不足,且其解释的忠实性依赖于两个条件:准确的概念检测和概念表征不泄露额外任务信息。传统方法将这两个问题分开处理,并常以牺牲预测精度为代价。

Result: 实验表明,f-CBM在任务准确性、概念检测和泄露减少之间取得了最佳权衡,并可无缝应用于图像-文本或纯文本数据集,展现出跨模态的通用性。

Insight: 创新点在于将概念检测与泄露缓解作为联合优化问题,通过可微泄露损失直接抑制信息泄露,并利用Kolmogorov-Arnold网络增强预测头的表达能力以改善概念检测,从而在保持预测性能的前提下提升模型解释的忠实性。

Abstract: Concept Bottleneck Models (CBMs) are interpretable models that route predictions through a layer of human-interpretable concepts. While widely studied in vision and, more recently, in NLP, CBMs remain largely unexplored in multimodal settings. For their explanations to be faithful, CBMs must satisfy two conditions: concepts must be properly detected, and concept representations must encode only their intended semantics, without smuggling extraneous task-relevant or inter-concept information into final predictions, a phenomenon known as leakage. Existing approaches treat concept detection and leakage mitigation as separate problems, and typically improve one at the expense of predictive accuracy. In this work, we introduce f-CBM, a faithful multimodal CBM framework built on a vision-language backbone that jointly targets both aspects through two complementary strategies: a differentiable leakage loss to mitigate leakage, and a Kolmogorov-Arnold Network prediction head that provides sufficient expressiveness to improve concept detection. Experiments demonstrate that f-CBM achieves the best trade-off between task accuracy, concept detection, and leakage reduction, while applying seamlessly to both image and text or text-only datasets, making it versatile across modalities.


[91] Perceive What Matters: Relevance-Driven Scheduling for Multimodal Streaming Perception cs.CVPDF

Dingcheng Huang, Xiaotong Zhang, Kamal Youcef-Toumi

TL;DR: 本文提出了一种面向人机协作(HRC)场景的轻量级感知调度框架,旨在解决多模态流式感知中因逐帧执行多个感知模块而导致的延迟累积问题。该框架基于场景上下文,利用先前帧的输出实时估计并调度必要的感知模块,从而减少计算冗余并优化资源分配。

Details

Motivation: 现代HRC应用中,多个感知模块(视觉、听觉、上下文)联合运行以实现全面场景理解,但逐帧执行这些模块在流式感知场景中会积累延迟,导致系统性能下降。现有方法仍面临信息冗余和计算资源分配不佳的挑战。

Result: 实验结果表明,与传统并行感知流水线相比,该框架将计算延迟降低了高达27.52%,同时MMPose激活召回率提升了72.73%,关键帧准确率高达98%,在保持高精度的同时显著提升了实时感知效率。

Insight: 创新点在于引入基于‘相关性’概念和信息稀疏性的轻量级调度机制,动态选择必要感知模块,减少冗余计算。这为多模态流式感知系统提供了一种可扩展的系统化解决方案,平衡了延迟与准确性。

Abstract: In modern human-robot collaboration (HRC) applications, multiple perception modules jointly extract visual, auditory, and contextual cues to achieve comprehensive scene understanding, enabling the robot to provide appropriate assistance to human agents intelligently. While executing multiple perception modules on a frame-by-frame basis enhances perception quality in offline settings, it inevitably accumulates latency, leading to a substantial decline in system performance in streaming perception scenarios. Recent work in scene understanding, termed Relevance, has established a solid foundation for developing efficient methodologies in HRC. However, modern perception pipelines still face challenges related to information redundancy and suboptimal allocation of computational resources. Drawing inspiration from the Relevance concept and the information sparsity in HRC events, we propose a novel lightweight perception scheduling framework that efficiently leverages output from previous frames to estimate and schedule necessary perception modules in real-time based on scene context. The experimental results demonstrate that the proposed perception scheduling framework effectively reduces computational latency by up to 27.52% compared to conventional parallel perception pipelines, while also achieving a 72.73% improvement in MMPose activation recall. Additionally, the framework demonstrates high keyframe accuracy, achieving rates of up to 98%. The results validate the framework’s capability to enhance real-time perception efficiency without significantly compromising accuracy. The framework shows potential as a scalable and systematic solution for multimodal streaming perception systems in HRC.


[92] Towards Spatio-Temporal World Scene Graph Generation from Monocular Videos cs.CVPDF

Rohith Peddi, Saurabh, Shravan Shanmugam, Likhitha Pallapothula, Yu Xiang

TL;DR: 本文提出了世界场景图生成(WSGG)任务,旨在从单目视频中构建包含所有交互对象(包括被遮挡或未观察到的对象)的时空世界场景图。为此,作者首先创建了ActionGenome4D数据集,通过3D重建升级了Action Genome视频,提供了世界坐标系下的边界框和密集关系标注。基于此,论文提出了三种互补的方法(PWG、MWAE、4DST)来推理未观察到的对象,并利用基于图RAG的视觉语言模型建立了基线。

Details

Motivation: 现有时空场景图生成方法本质上是帧中心的,仅处理当前可见对象,丢弃被遮挡实体,且工作在2D空间。为了解决这些问题,本文旨在实现世界中心、时间持久且可解释的场景推理。

Result: 在提出的ActionGenome4D数据集上,论文评估了三种方法(PWG、MWAE、4DST)以及基于图RAG的视觉语言模型,为未定位关系预测建立了基线。

Insight: 创新点包括:1) 将未观察对象推理形式化为世界场景图生成任务;2) 引入了包含4D场景和世界坐标系标注的ActionGenome4D数据集;3) 提出了三种具有不同归纳偏置的方法(基于特征缓冲、掩码补全与关联检索、以及结合3D运动和相机位姿的时序注意力)来建模对象持久性。

Abstract: Spatio-temporal scene graphs provide a principled representation for modeling evolving object interactions, yet existing methods remain fundamentally frame-centric: they reason only about currently visible objects, discard entities upon occlusion, and operate in 2D. To address this, we first introduce ActionGenome4D, a dataset that upgrades Action Genome videos into 4D scenes via feed-forward 3D reconstruction, world-frame oriented bounding boxes for every object involved in actions, and dense relationship annotations including for objects that are temporarily unobserved due to occlusion or camera motion. Building on this data, we formalize World Scene Graph Generation (WSGG), the task of constructing a world scene graph at each timestamp that encompasses all interacting objects in the scene, both observed and unobserved. We then propose three complementary methods, each exploring a different inductive bias for reasoning about unobserved objects: PWG (Persistent World Graph), which implements object permanence via a zero-order feature buffer; MWAE (Masked World Auto-Encoder), which reframes unobserved-object reasoning as masked completion with cross-view associative retrieval; and 4DST (4D Scene Transformer), which replaces the static buffer with differentiable per-object temporal attention enriched by 3D motion and camera-pose features. We further design and evaluate the performance of strong open-source Vision-Language Models on the WSGG task via a suite of Graph RAG-based approaches, establishing baselines for unlocalized relationship prediction. WSGG thus advances video scene understanding toward world-centric, temporally persistent, and interpretable scene reasoning.


[93] Out of Sight, Out of Mind? Evaluating State Evolution in Video World Models cs.CVPDF

Ziqi Ma, Mengzhan Liufu, Georgia Gkioxari

TL;DR: 该论文提出了一个名为STEVO-Bench的基准测试,用于评估视频世界模型能否将世界状态的演变与观察(如摄像机视角、遮挡物、光线)分离开来。研究发现,当前模型在状态演变与观察解耦方面存在局限,并揭示了模型在数据和架构上的潜在偏差。

Details

Motivation: 解决视频世界模型能否生成不受观察行为(如遮挡、关灯、摄像机移开)影响而独立演变的“世界”这一核心问题,即评估模型对状态演变与观察的解耦能力。

Result: 在STEVO-Bench基准上评估了多种视频模型,结果表明它们在处理自然演变过程时,难以将状态演变与观察控制分离开,暴露了现有模型的局限性。

Insight: 创新点在于设计了一个通过指令控制观察(如插入遮挡物、关灯、指定摄像机“移开”轨迹)来系统评估视频世界模型解耦能力的基准测试和自动分析协议,为诊断模型在数据和架构上的偏差提供了新视角。

Abstract: Evolutions in the world, such as water pouring or ice melting, happen regardless of being observed. Video world models generate “worlds” via 2D frame observations. Can these generated “worlds” evolve regardless of observation? To probe this question, we design a benchmark to evaluate whether video world models can decouple state evolution from observation. Our benchmark, STEVO-Bench, applies observation control to evolving processes via instructions of occluder insertion, turning off the light, or specifying camera “lookaway” trajectories. By evaluating video models with and without camera control for a diverse set of naturally-occurring evolutions, we expose their limitations in decoupling state evolution from observation. STEVO-Bench proposes an evaluation protocol to automatically detect and disentangle failure modes of video world models across key aspects of natural state evolution. Analysis of STEVO-Bench results provide new insight into potential data and architecture bias of present-day video world models. Project website: https://glab-caltech.github.io/STEVOBench/. Blog: https://ziqi-ma.github.io/blog/2026/outofsight/


[94] Visual-ERM: Reward Modeling for Visual Equivalence cs.CV | cs.AIPDF

Ziyu Liu, Shengyuan Ding, Xinyu Fang, Xuanlang Dai, Penghui Yang

TL;DR: 本文提出了一种名为Visual-ERM的多模态生成奖励模型,用于解决视觉到代码任务中强化学习面临的奖励信号错位问题。该模型通过在渲染的视觉空间中提供细粒度、可解释且与任务无关的反馈,来评估代码生成的质量。

Details

Motivation: 现有视觉到代码任务的奖励信号要么依赖文本规则,要么依赖粗糙的视觉嵌入相似度,两者都无法捕捉细粒度的视觉差异,且容易受到奖励攻击。

Result: 将Visual-ERM集成到强化学习中,使Qwen3-VL-8B-Instruct在图表到代码任务上提升了8.4分,在表格和SVG解析任务上平均提升了2.7和4.1分。在提出的VC-RewardBench基准测试中,8B参数的Visual-ERM显著优于Qwen3-VL-235B-Instruct,并接近领先的闭源模型。

Insight: 核心创新在于提出了一个直接在渲染视觉空间中进行细粒度评估的生成式奖励模型,其反馈与任务无关。这表明,无论任务特异性如何,细粒度的视觉奖励监督对于视觉到代码的强化学习既是必要的也是充分的。

Abstract: Vision-to-code tasks require models to reconstruct structured visual inputs, such as charts, tables, and SVGs, into executable or structured representations with high visual fidelity. While recent Large Vision Language Models (LVLMs) achieve strong results via supervised fine-tuning, reinforcement learning remains challenging due to misaligned reward signals. Existing rewards either rely on textual rules or coarse visual embedding similarity, both of which fail to capture fine-grained visual discrepancies and are vulnerable to reward hacking. We propose Visual Equivalence Reward Model (Visual-ERM), a multimodal generative reward model that provides fine-grained, interpretable, and task-agnostic feedback to evaluate vision-to-code quality directly in the rendered visual space. Integrated into RL, Visual-ERM improves Qwen3-VL-8B-Instruct by +8.4 on chart-to-code and yields consistent gains on table and SVG parsing (+2.7, +4.1 on average), and further strengthens test-time scaling via reflection and revision. We also introduce VisualCritic-RewardBench (VC-RewardBench), a benchmark for judging fine-grained image-to-image discrepancies on structured visual data, where Visual-ERM at 8B decisively outperforms Qwen3-VL-235B-Instruct and approaches leading closed-source models. Our results suggest that fine-grained visual reward supervision is both necessary and sufficient for vision-to-code RL, regardless of task specificity.


cs.RO [Back]

[95] Red-Teaming Vision-Language-Action Models via Quality Diversity Prompt Generation for Robust Robot Policies cs.RO | cs.AI | cs.CLPDF

Siddharth Srikanth, Freddie Liang, Sophie Hsu, Varun Bhatt, Shihan Zhao

TL;DR: 本文提出了一种名为Q-DIG(质量多样性多样化指令生成)的方法,用于对视觉-语言-动作(VLA)模型进行红队测试,通过生成多样化且任务相关的自然语言指令来暴露其脆弱性,并利用这些指令微调VLA模型以提高其鲁棒性。

Details

Motivation: VLA模型在机器人任务中潜力巨大,但其性能对语言指令的措辞高度敏感,且难以预测其失败情况。本文旨在通过红队测试,系统地发现能诱导VLA模型失败的不同指令措辞,从而提升VLA模型对不同措辞的鲁棒性。

Result: 在多个模拟基准测试中,Q-DIG相比基线方法发现了更多样化和有意义的失败模式;基于生成指令微调VLA模型提高了任务成功率;用户研究表明Q-DIG生成的提示被评判为更自然、更类人;真实世界评估结果与模拟一致,且微调进一步提升了未见指令上的成功率。

Insight: 创新点在于将质量多样性(QD)优化技术与视觉语言模型(VLM)相结合,以可扩展的方式生成既多样化又任务相关的对抗性指令,从而系统性地识别VLA模型的脆弱性并用于提升其鲁棒性。这为评估和增强具身AI系统的安全性提供了一种新思路。

Abstract: Vision-Language-Action (VLA) models have significant potential to enable general-purpose robotic systems for a range of vision-language tasks. However, the performance of VLA-based robots is highly sensitive to the precise wording of language instructions, and it remains difficult to predict when such robots will fail. To improve the robustness of VLAs to different wordings, we present Q-DIG (Quality Diversity for Diverse Instruction Generation), which performs red-teaming by scalably identifying diverse natural language task descriptions that induce failures while remaining task-relevant. Q-DIG integrates Quality Diversity (QD) techniques with Vision-Language Models (VLMs) to generate a broad spectrum of adversarial instructions that expose meaningful vulnerabilities in VLA behavior. Our results across multiple simulation benchmarks show that Q-DIG finds more diverse and meaningful failure modes compared to baseline methods, and that fine-tuning VLAs on the generated instructions improves task success rates. Furthermore, results from a user study highlight that Q-DIG generates prompts judged to be more natural and human-like than those from baselines. Finally, real-world evaluations of Q-DIG prompts show results consistent with simulation, and fine-tuning VLAs on the generated prompts further success rates on unseen instructions. Together, these findings suggest that Q-DIG is a promising approach for identifying vulnerabilities and improving the robustness of VLA-based robots. Our anonymous project website is at qdigvla.github.io.


[96] Beyond Dense Futures: World Models as Structured Planners for Robotic Manipulation cs.RO | cs.CVPDF

Minghao Jin, Mozheng Liao, Mingfei Han, Zhihui Li, Xiaojun Chang

TL;DR: 本文提出StructVLA,一种将生成式世界模型重构为显式结构化规划器的方法,用于提升机器人操作的可靠控制。该方法通过预测稀疏、具有物理意义的结构化帧(而非密集的未来视觉或高层语义目标)来捕捉与任务进展紧密相关的时空里程碑,从而减少视觉冗余和误差累积,并增强规划与底层执行之间的对齐。

Details

Motivation: 现有基于世界模型的视觉-语言-动作架构在机器人操作中存在视觉冗余、误差累积导致长时程规划漂移的问题,而稀疏方法又缺乏明确的运动学基础,导致规划与执行的对齐性弱。本文旨在解决这些问题。

Result: 在实验中,StructVLA在SimperEnv-WidowX上达到75.0%的平均成功率,在LIBERO上达到94.8%的平均成功率。真实世界部署也证明了其在基础拾放和复杂长时程任务中可靠的任务完成和鲁棒的泛化能力。

Insight: 创新点在于将世界模型重构为结构化规划器,预测源自内在运动学线索(如夹爪状态转换和运动学转折点)的稀疏结构化帧,并通过两阶段训练范式(先预测结构化帧,再将其映射为底层动作)来提供清晰的物理指导,从而弥合视觉规划与运动控制之间的鸿沟。

Abstract: Recent world-model-based Vision-Language-Action (VLA) architectures have improved robotic manipulation through predictive visual foresight. However, dense future prediction introduces visual redundancy and accumulates errors, causing long-horizon plan drift. Meanwhile, recent sparse methods typically represent visual foresight using high-level semantic subtasks or implicit latent states. These representations often lack explicit kinematic grounding, weakening the alignment between planning and low-level execution. To address this, we propose StructVLA, which reformulates a generative world model into an explicit structured planner for reliable control. Instead of dense rollouts or semantic goals, StructVLA predicts sparse, physically meaningful structured frames. Derived from intrinsic kinematic cues (e.g., gripper transitions and kinematic turning points), these frames capture spatiotemporal milestones closely aligned with task progress. We implement this approach through a two-stage training paradigm with a unified discrete token vocabulary: the world model is first trained to predict structured frames and subsequently optimized to map the structured foresight into low-level actions. This approach provides clear physical guidance and bridges visual planning and motion control. In our experiments, StructVLA achieves strong average success rates of 75.0% on SimplerEnv-WidowX and 94.8% on LIBERO. Real-world deployments further demonstrate reliable task completion and robust generalization across both basic pick-and-place and complex long-horizon tasks.


[97] HaltNav: Reactive Visual Halting over Lightweight Topological Priors for Robust Vision-Language Navigation cs.RO | cs.CVPDF

Pingcong Li, Zihui Yu, Bichi Zhang, Sören Schwertfeger

TL;DR: 本文提出了一种名为HaltNav的分层导航框架,用于提升视觉语言导航(VLN)在开放词汇、目标导向环境中的鲁棒性。该框架结合了轻量级拓扑先验地图(osmAG)进行全局规划,以及基于MLLM的模块进行局部探索和指令接地,并通过一种称为反应式视觉暂停(RVH)的机制来检测和处理局部异常(如关闭的门),从而在环境变化下实现更可靠的长期导航。

Details

Motivation: 动机在于解决当前VLN从逐步指令跟随向开放词汇自主导航转变时面临的挑战:依赖详尽的路由提示不现实,而仅依赖计算量大的2D/3D度量地图进行全局规划在现实部署中很脆弱(例如局部连通性可能改变)。因此,需要一种能结合轻量级结构先验和局部适应能力的鲁棒导航方法。

Result: 大量实验表明,该分层框架在无需繁琐语言指令的情况下,优于多个基线方法,并显著提高了在环境变化下进行长期视觉语言导航的鲁棒性。

Insight: 创新点包括:1) 利用易于获取和维护的轻量级文本拓扑地图(osmAG)作为先验进行全局规划;2) 提出反应式视觉暂停(RVH)机制,通过检测局部异常、更新地图拓扑并触发重规划来处理动态障碍;3) 引入一个利用生成模型合成包含真实障碍场景的数据管道,以高效训练暂停能力,丰富困难负样本。

Abstract: Vision-and-Language Navigation (VLN) is shifting from rigid, step-by-step instruction following toward open-vocabulary, goal-oriented autonomy. Achieving this transition without exhaustive routing prompts requires agents to leverage structural priors. While prior work often assumes computationally heavy 2D/3D metric maps, we instead exploit a lightweight, text-based osmAG (OpenStreetMap Area Graph), a floorplan-level topological representation that is easy to obtain and maintain. However, global planning over a prior map alone is brittle in real-world deployments, where local connectivity can change (e.g., closed doors or crowded passages), leading to execution-time failures. To address this gap, we propose a hierarchical navigation framework HaltNav that couples the robust global planning of osmAG with the local exploration and instruction-grounding capability of VLN. Our approach features an MLLM-based brain module, which is capable of high-level task grounding and obstruction awareness. Conditioned on osmAG, the brain converts the global route into a sequence of localized execution snippets, providing the VLN executor with prior-grounded, goal-centric sub-instructions. Meanwhile, it detects local anomalies via a mechanism we term Reactive Visual Halting (RVH), which interrupts the local control loop, updates osmAG by invalidating the corresponding topology, and triggers replanning to orchestrate a viable detour. To train this halting capability efficiently, we introduce a data synthesis pipeline that leverages generative models to inject realistic obstacles into otherwise navigable scenes, substantially enriching hard negative samples. Extensive experiments demonstrate that our hierarchical framework outperforms several baseline methods without tedious language instructions, and significantly improves robustness for long-horizon vision-language navigation under environmental changes.


[98] MotionAnymesh: Physics-Grounded Articulation for Simulation-Ready Digital Twins cs.RO | cs.CVPDF

WenBo Xu, Liu Liu, Li Zhang, Dan Guo, RuoNan Liu

TL;DR: MotionAnymesh是一个自动化零样本框架,旨在将静态3D网格转换为可用于物理仿真的交互式铰接数字孪生体。该方法通过结合物理先验和约束优化,解决了现有方法因缺乏物理基础而导致的运动学幻觉和网格穿透问题。

Details

Motivation: 现有零样本方法在将静态网格转换为铰接资产时,由于缺乏物理基础,常出现运动学幻觉(VLMs)和仿真中的灾难性网格穿透(无约束关节估计),难以处理复杂资产,阻碍了具身AI和机器人仿真的应用。

Result: 大量实验表明,MotionAnymesh在几何精度和动态物理可执行性方面显著优于最先进的基线方法,为下游应用提供了高度可靠的资产。

Insight: 创新点在于:1)提出了一个运动学感知的部件分割模块,利用显式的SP4D物理先验来约束VLM的推理,消除运动学幻觉;2)设计了一个几何-物理联合估计流程,结合了鲁棒的类型感知初始化和物理约束的轨迹优化,严格保证了无碰撞的铰接结构。

Abstract: Converting static 3D meshes into interactable articulated assets is crucial for embodied AI and robotic simulation. However, existing zero-shot pipelines struggle with complex assets due to a critical lack of physical grounding. Specifically, ungrounded Vision-Language Models (VLMs) frequently suffer from kinematic hallucinations, while unconstrained joint estimation inevitably leads to catastrophic mesh inter-penetration during physical simulation. To bridge this gap, we propose MotionAnymesh, an automated zero-shot framework that seamlessly transforms unstructured static meshes into simulation-ready digital twins. Our method features a kinematic-aware part segmentation module that grounds VLM reasoning with explicit SP4D physical priors, effectively eradicating kinematic hallucinations. Furthermore, we introduce a geometry-physics joint estimation pipeline that combines robust type-aware initialization with physics-constrained trajectory optimization to rigorously guarantee collision-free articulation. Extensive experiments demonstrate that MotionAnymesh significantly outperforms state-of-the-art baselines in both geometric precision and dynamic physical executability, providing highly reliable assets for downstream applications.


[99] SldprtNet: A Large-Scale Multimodal Dataset for CAD Generation in Language-Driven 3D Design cs.RO | cs.CVPDF

Ruogu Li, Sikai Li, Yao Mu, Mingyu Ding

TL;DR: 本文介绍了SldprtNet,一个包含超过24.2万个工业零件的大规模多模态数据集,专为语义驱动的CAD建模、几何深度学习以及3D设计多模态模型的训练与微调而设计。数据集提供.step和.sldprt格式的3D模型,并开发了支持13种CAD命令的编码器与解码器工具,实现3D模型与结构化文本表示的无损转换。每个样本还配有一个由7个渲染视图合成的复合图像,结合参数化文本,使用Qwen2.5-VL-7B模型生成零件外观与功能的自然语言描述,并经过人工验证对齐。实验通过微调基线模型,验证了多模态输入对CAD生成的必要性。

Details

Motivation: 解决现有CAD数据集规模有限、模态单一的问题,为语义驱动的3D设计、几何深度学习和多模态模型训练提供一个大规模、多模态、真实工业场景的数据集。

Result: 在数据集子集上微调基线模型,比较图像加文本输入与纯文本输入,结果证实了多模态数据集对CAD生成的必要性和价值。

Insight: 创新点包括:1) 大规模真实工业零件数据集,支持多种格式和模态;2) 开发编码器/解码器工具,实现3D模型与参数化文本的无损转换,支持数据集可扩展性;3) 使用复合图像减少输入令牌长度,加速推理;4) 利用轻量多模态大模型生成对齐的自然语言描述,构建完全对齐的多模态数据。从客观角度看,该数据集在规模、真实性和多模态对齐方面具有显著优势,为CAD生成任务提供了全面的基准资源。

Abstract: We introduce SldprtNet, a large-scale dataset comprising over 242,000 industrial parts, designed for semantic-driven CAD modeling, geometric deep learning, and the training and fine-tuning of multimodal models for 3D design. The dataset provides 3D models in both .step and .sldprt formats to support diverse training and testing. To enable parametric modeling and facilitate dataset scalability, we developed supporting tools, an encoder and a decoder, which support 13 types of CAD commands and enable lossless transformation between 3D models and a structured text representation. Additionally, each sample is paired with a composite image created by merging seven rendered views from different viewpoints of the 3D model, effectively reducing input token length and accelerating inference. By combining this image with the parameterized text output from the encoder, we employ the lightweight multimodal language model Qwen2.5-VL-7B to generate a natural language description of each part’s appearance and functionality. To ensure accuracy, we manually verified and aligned the generated descriptions, rendered images, and 3D models. These descriptions, along with the parameterized modeling scripts, rendered images, and 3D model files, are fully aligned to construct SldprtNet. To assess its effectiveness, we fine-tuned baseline models on a dataset subset, comparing image-plus-text inputs with text-only inputs. Results confirm the necessity and value of multimodal datasets for CAD generation. It features carefully selected real-world industrial parts, supporting tools for scalable dataset expansion, diverse modalities, and ensured diversity in model complexity and geometric features, making it a comprehensive multimodal dataset built for semantic-driven CAD modeling and cross-modal learning.


[100] Panoramic Multimodal Semantic Occupancy Prediction for Quadruped Robots cs.RO | cs.CV | eess.IVPDF

Guoqiang Zhao, Zhe Yang, Sheng Wu, Fei Teng, Mengfei Duan

TL;DR: 本文提出了首个面向四足机器人的全景多模态语义占据预测数据集PanoMMOcc,并设计了VoxelHound框架,通过垂直抖动补偿模块和多模态信息提示融合模块,有效提升了在复杂动态环境下的3D占据预测性能。

Details

Motivation: 现有占据预测方法主要针对轮式自动驾驶设计且严重依赖RGB线索,在四足机器人复杂运动场景下鲁棒性不足,本文旨在弥补这一空白。

Result: 在提出的PanoMMOcc基准测试中,VoxelHound方法达到了最先进的性能,mIoU指标提升了4.16%。

Insight: 创新点包括:1) 首个面向四足机器人的真实世界全景多模态占据数据集;2) 针对腿部运动和球形成像设计的框架,特别是用于补偿俯仰和横滚视角扰动的垂直抖动补偿模块,以及融合全景视觉与辅助模态的多模态信息提示融合模块。

Abstract: Panoramic imagery provides holistic 360° visual coverage for perception in quadruped robots. However, existing occupancy prediction methods are mainly designed for wheeled autonomous driving and rely heavily on RGB cues, limiting their robustness in complex environments. To bridge this gap, (1) we present PanoMMOcc, the first real-world panoramic multimodal occupancy dataset for quadruped robots, featuring four sensing modalities across diverse scenes. (2) We propose a panoramic multimodal occupancy perception framework, VoxelHound, tailored for legged mobility and spherical imaging. Specifically, we design (i) a Vertical Jitter Compensation (VJC) module to mitigate severe viewpoint perturbations caused by body pitch and roll during mobility, enabling more consistent spatial reasoning, and (ii) an effective Multimodal Information Prompt Fusion (MIPF) module that jointly leverages panoramic visual cues and auxiliary modalities to enhance volumetric occupancy prediction. (3) We establish a benchmark based on PanoMMOcc and provide detailed data analysis to enable systematic evaluation of perception methods under challenging embodied scenarios. Extensive experiments demonstrate that VoxelHound achieves state-of-the-art performance on PanoMMOcc (+4.16%} in mIoU). The dataset and code will be publicly released to facilitate future research on panoramic multimodal 3D perception for embodied robotic systems at https://github.com/SXDR/PanoMMOcc, along with the calibration tools released at https://github.com/losehu/CameraLiDAR-Calib.


cs.IR [Back]

[101] Multi-Step Semantic Reasoning in Generative Retrieval cs.IR | cs.CLPDF

Steven Dong, Yubao Tang, Maarten de Rijke

TL;DR: 本文提出ReasonGR框架,旨在增强生成式检索模型在数值上下文中的多步语义推理能力,通过结构化提示策略和推理适应模块提升对复杂查询的处理效果,并在金融问答数据集FinQA上验证了其有效性。

Details

Motivation: 现有生成式检索模型在数值上下文(如金融报告)中处理复杂查询时推理能力有限,导致检索准确率不足,限制了实际应用。

Result: 在包含复杂文档金融查询的FinQA数据集上,ReasonGR提高了检索准确率和一致性,展现了在推理密集型检索场景中的潜力。

Insight: 创新点包括结合任务特定指令与逐步推理指导的结构化提示策略,以及专注于推理的参数学习适应模块,可借鉴于提升生成式检索模型的复杂语义推理能力。

Abstract: Generative retrieval (GR) models encode a corpus within model parameters and generate relevant document identifiers directly for a given query. While this paradigm shows promise in retrieval tasks, existing GR models struggle with complex queries in numerical contexts, such as those involving semantic reasoning over financial reports, due to limited reasoning capabilities. This limitation leads to suboptimal retrieval accuracy and hinders practical applicability. We propose ReasonGR, a framework designed to enhance multi-step semantic reasoning in numerical contexts within GR. ReasonGR employs a structured prompting strategy combining task-specific instructions with stepwise reasoning guidance to better address complex retrieval queries. Additionally, it integrates a reasoning-focused adaptation module to improve the learning of reasoning-related parameters. Experiments on the FinQA dataset, which contains financial queries over complex documents, demonstrate that ReasonGR improves retrieval accuracy and consistency, indicating its potential for advancing GR models in reasoning-intensive retrieval scenarios.


[102] FGTR: Fine-Grained Multi-Table Retrieval via Hierarchical LLM Reasoning cs.IR | cs.CL | cs.LGPDF

Chaojie Sun, Bin Cao, Tiantian Li, Chenyu Hou, Ruizhe Li

TL;DR: 本文提出了一种基于大语言模型(LLM)的细粒度多表检索方法FGTR,它采用类似人类推理的分层策略,先识别相关模式元素,再检索对应单元格内容,最终构建与查询匹配的简洁子表。该方法在基于Spider和BIRD构建的两个新基准数据集上进行了评估,显著超越了现有方法。

Details

Motivation: 现有基于LLM的表检索研究通常聚焦于单表查询,并通过编码整个表后进行相似度匹配来实现,这种方法由于编码粒度粗、包含大量无关查询数据而导致准确率低,处理大表时效率不高,且未能充分利用LLM的推理能力;此外,多表查询在检索任务中尚未得到充分探索。

Result: 实验结果表明,FGTR在基于Spider和BIRD构建的新基准数据集上超越了之前的最先进方法(SOTA),将F_2指标在Spider上提升了18%,在BIRD上提升了21%。

Insight: 论文宣称的创新点在于提出了一种新的检索范式,即通过分层推理实现细粒度多表检索,这模仿了人类逐步推理的过程。从客观角度看,其核心创新在于将粗粒度的整表编码匹配,转变为先模式(schema)后内容(cell)的层次化细粒度检索,这能更精准地定位相关信息并减少噪声,从而提升检索精度和效率,并有望改善基于表格的下游任务的端到端性能。

Abstract: With the rapid advancement of large language models (LLMs), growing efforts have been made on LLM-based table retrieval. However, existing studies typically focus on single-table query, and implement it by similarity matching after encoding the entire table. These methods usually result in low accuracy due to their coarse-grained encoding which incorporates much query-irrelated data, and are also inefficient when dealing with large tables, failing to fully utilize the reasoning capabilities of LLM. Further, multi-table query is under-explored in retrieval tasks. To this end, we propose a hierarchical multi-table query method based on LLM: Fine-Grained Multi-Table Retrieval FGTR, a new retrieval paradigm that employs a human-like reasoning strategy. Through hierarchical reasoning, FGTR first identifies relevant schema elements and then retrieves the corresponding cell contents, ultimately constructing a concise and accurate sub-table that aligns with the given query. To comprehensively evaluate the performance of FGTR, we construct two new benchmark datasets based on Spider and BIRD . Experimental results show that FGTR outperforms previous state-of-the-art methods, improving the F_2 metric by 18% on Spider and 21% on BIRD, demonstrating its effectiveness in enhancing fine-grained retrieval and its potential to improve end-to-end performance on table-based downstream tasks.


[103] VLM4Rec: Multimodal Semantic Representation for Recommendation with Large Vision-Language Models cs.IR | cs.AI | cs.CVPDF

Ty Valencia, Burak Barlas, Varun Singhal, Ruchir Bhatia, Wei Yang

TL;DR: 本文提出VLM4Rec,一个基于大型视觉语言模型(LVLM)的轻量级多模态推荐框架。该框架通过将商品图像转化为自然语言描述,并将这些语义信息编码为密集向量,从而实现基于语义对齐而非直接特征融合的推荐。实验表明,该方法在多个多模态推荐数据集上优于原始视觉特征和多种基于融合的基线方法。

Details

Motivation: 多模态推荐通常被视为特征融合问题,但推荐效果不仅取决于模态融合方式,更关键的是商品内容是否在与偏好匹配对齐的语义空间中表示。原始视觉特征往往保留外观相似性,而用户决策通常由更高层次的语义因素(如风格、材质、使用场景)驱动。

Result: 在多个多模态推荐数据集上的广泛实验表明,VLM4Rec的性能持续优于原始视觉特征和多种基于融合的替代方法,表明在该场景下表示质量可能比融合复杂度更重要。

Insight: 创新点在于将多模态推荐重构为语义对齐问题,利用LVLM将图像显式地转化为自然语言描述,从而将高层语义信息注入商品表示。这提供了一种轻量且实用的离线-在线分解方案,强调了语义表示质量对推荐系统的重要性,而非复杂的融合机制。

Abstract: Multimodal recommendation is commonly framed as a feature fusion problem, where textual and visual signals are combined to better model user preference. However, the effectiveness of multimodal recommendation may depend not only on how modalities are fused, but also on whether item content is represented in a semantic space aligned with preference matching. This issue is particularly important because raw visual features often preserve appearance similarity, while user decisions are typically driven by higher-level semantic factors such as style, material, and usage context. Motivated by this observation, we propose LVLM-grounded Multimodal Semantic Representation for Recommendation (VLM4Rec), a lightweight framework that organizes multimodal item content through semantic alignment rather than direct feature fusion. VLM4Rec first uses a large vision-language model to ground each item image into an explicit natural-language description, and then encodes the grounded semantics into dense item representations for preference-oriented retrieval. Recommendation is subsequently performed through a simple profile-based semantic matching mechanism over historical item embeddings, yielding a practical offline-online decomposition. Extensive experiments on multiple multimodal recommendation datasets show that VLM4Rec consistently improves performance over raw visual features and several fusion-based alternatives, suggesting that representation quality may matter more than fusion complexity in this setting. The code is released at https://github.com/tyvalencia/enhancing-mm-rec-sys.


[104] NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval cs.IR | cs.CV | cs.LGPDF

Zhuchenyang Liu, Yao Zhang, Yu Xiao

TL;DR: NanoVDR提出了一种用于视觉文档检索的非对称编码框架,通过将大型视觉语言模型教师的知识蒸馏到小型纯文本学生编码器中,实现了高效查询处理。

Details

Motivation: 解决基于视觉语言模型的检索器在文档索引和查询编码时均需使用相同大型编码器所导致的高延迟和GPU依赖问题,特别是针对纯文本查询的低效性。

Result: 在22个ViDoRe基准数据集上,NanoVDR-S-Multi(DistilBERT,69M参数)保留了教师模型95.1%的性能,以32倍更少的参数和50倍更低的CPU查询延迟,在v2和v3版本上超越了DSE-Qwen2(2B参数)。

Insight: 核心创新在于利用查询-文档的非对称性进行路径解耦,并通过系统实验发现基于查询文本的点对点余弦对齐蒸馏目标优于基于排序和对比的方法,同时通过机器翻译查询增强数据有效解决了跨语言迁移的性能瓶颈。

Abstract: Vision-Language Model (VLM) based retrievers have advanced visual document retrieval (VDR) to impressive quality. They require the same multi-billion parameter encoder for both document indexing and query encoding, incurring high latency and GPU dependence even for plain-text queries. We observe that this design is unnecessarily symmetric: documents are visually complex and demand strong visual understanding, whereas queries are just short text strings. NanoVDR exploits this query–document asymmetry by decoupling the two encoding paths: a frozen 2B VLM teacher indexes documents offline, while a distilled text-only student as small as 69M parameters encodes queries at inference. The key design choice is the distillation objective. Through systematic comparison of six objectives across three backbones and 22 ViDoRe benchmark datasets, we find that pointwise cosine alignment on query text consistently outperforms ranking-based and contrastive alternatives, while requiring only pre-cached teacher query embeddings and no document processing during training. Furthermore, we identify cross-lingual transfer as the primary performance bottleneck, and resolve it cheaply by augmenting training data with machine-translated queries. The resulting NanoVDR-S-Multi (DistilBERT, 69M) retains 95.1% of teacher quality and outperforms DSE-Qwen2 (2B) on v2 and v3 with 32$\times$ fewer parameters and 50$\times$ lower CPU query latency, at a total training cost under 13 GPU-hours.


cs.SE [Back]

[105] daVinci-Env: Open SWE Environment Synthesis at Scale cs.SE | cs.AI | cs.CLPDF

Dayuan Fu, Shenyu Wu, Yunze Wu, Zerui Peng, Yaxing Huang

TL;DR: 本文介绍了OpenSWE,一个用于训练软件工程(SWE)智能体的大规模、可执行、可验证的Python环境合成框架。它包含45,320个可执行的Docker环境,覆盖超过12.8k个代码仓库,所有基础设施均开源。框架通过一个多智能体合成流水线在分布式集群上构建,并采用以质量为中心的过滤机制来筛选具有适当挑战性的环境。项目总投资约147万美元,产生了约13,000条精选轨迹。实验表明,基于OpenSWE训练的模型在SWE-bench基准上取得了SOTA性能,并在数学推理和科学领域带来了显著的跨领域性能提升。

Details

Motivation: 现有开源数据集在规模和仓库多样性上有限,而工业解决方案基础设施不透明且未公开,这为大多数学术研究团队设置了过高的门槛。因此,需要构建一个大规模、完全透明、可复现的SWE智能体训练环境框架。

Result: 在SWE-bench Verified基准上,OpenSWE-32B和OpenSWE-72B模型分别达到62.4%和66.0%的准确率,在Qwen2.5系列模型中达到了SOTA水平。此外,SWE专项训练带来了显著的跨领域性能提升,例如在数学推理上提升高达12个百分点,在科学基准上提升5个百分点,且未损害事实召回能力。

Insight: 论文的创新点在于:1) 构建了目前最大、完全开源的SWE智能体训练环境框架(OpenSWE),实现了从环境合成到评估的全流程自动化与透明化;2) 提出了一个以质量为中心的过滤流水线,能够评估并筛选出具有适当难度、能最大化学习效率的环境,而非仅仅追求规模;3) 展示了针对SWE任务的专项训练可以带来显著的、不损害其他能力的跨领域(如数学、科学)性能提升,这为构建更通用的智能体提供了新思路。

Abstract: Training capable software engineering (SWE) agents demands large-scale, executable, and verifiable environments that provide dynamic feedback loops for iterative code editing, test execution, and solution refinement. However, existing open-source datasets remain limited in scale and repository diversity, while industrial solutions are opaque with unreleased infrastructure, creating a prohibitive barrier for most academic research groups. We present OpenSWE, the largest fully transparent framework for SWE agent training in Python, comprising 45,320 executable Docker environments spanning over 12.8k repositories, with all Dockerfiles, evaluation scripts, and infrastructure fully open-sourced for reproducibility. OpenSWE is built through a multi-agent synthesis pipeline deployed across a 64-node distributed cluster, automating repository exploration, Dockerfile construction, evaluation script generation, and iterative test analysis. Beyond scale, we propose a quality-centric filtering pipeline that characterizes the inherent difficulty of each environment, filtering out instances that are either unsolvable or insufficiently challenging and retaining only those that maximize learning efficiency. With $891K spent on environment construction and an additional $576K on trajectory sampling and difficulty-aware curation, the entire project represents a total investment of approximately $1.47 million, yielding about 13,000 curated trajectories from roughly 9,000 quality guaranteed environments. Extensive experiments validate OpenSWE’s effectiveness: OpenSWE-32B and OpenSWE-72B achieve 62.4% and 66.0% on SWE-bench Verified, establishing SOTA among Qwen2.5 series. Moreover, SWE-focused training yields substantial out-of-domain improvements, including up to 12 points on mathematical reasoning and 5 points on science benchmarks, without degrading factual recall.


cs.CY [Back]

[106] Literary Narrative as Moral Probe : A Cross-System Framework for Evaluating AI Ethical Reasoning and Refusal Behavior cs.CY | cs.AI | cs.CL | cs.HCPDF

David C. Flynn

TL;DR: 本文提出了一种基于文学叙事的道德探针方法,用于评估AI系统的道德推理能力和拒绝行为,通过使用科幻小说中无法解决的道德困境作为刺激材料,对13个不同系统进行了24种条件的跨系统研究。

Details

Motivation: 现有AI道德评估框架仅测试系统能否生成看似正确的伦理回应,而非真正的道德推理能力,因此需要一种能抵抗表面性能、探测深层道德推理的新方法。

Result: 研究在24种条件下测试了13个系统,包括前沿商业系统和开源系统,发现盲测与声明条件下结果无差异(零delta),并识别出五种定性不同的D3反射性失败模式,表明工具复杂度随系统能力提升而非被规避。

Insight: 创新点在于使用文学叙事作为前瞻性评估工具,其区分度随AI能力增强而提高,揭示了表演性道德推理与真实道德推理之间的可测量差距,对高风险领域部署决策具有重要影响。

Abstract: Existing AI moral evaluation frameworks test for the production of correct-sounding ethical responses rather than the presence of genuine moral reasoning capacity. This paper introduces a novel probe methodology using literary narrative - specifically, unresolvable moral scenarios drawn from a published science fiction series - as stimulus material structurally resistant to surface performance. We present results from a 24-condition cross-system study spanning 13 distinct systems across two series: Series 1 (frontier commercial systems, blind; n=7) and Series 2 (local and API open-source systems, blind and declared; n=6). Four Series 2 systems were re-administered under declared conditions (13 blind + 4 declared + 7 ceiling probe = 24 total conditions), yielding zero delta across all 16 dimension-pair comparisons. Probe administration was conducted by two human raters across three machines; primary blind scoring was performed by Claude (Anthropic) as LLM judge, with Gemini Pro (Google) and Copilot Pro (Microsoft) serving as independent judges for the ceiling discrimination probe. A supplemental theological differentiator probe yielded perfect rank-order agreement between the two independent ceiling probe judges (Gemini Pro and Copilot Pro; rs = 1.00). Five qualitatively distinct D3 reflexive failure modes were identified - including categorical self-misidentification and false positive self-attribution - suggesting that instrument sophistication scales with system capability rather than being circumvented by it. We argue that literary narrative constitutes an anticipatory evaluation instrument - one that becomes more discriminating as AI capability increases - and that the gap between performed and authentic moral reasoning is measurable, meaningful, and consequential for deployment decisions in high-stakes domains.


cs.LG [Back]

[107] TERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning cs.LG | cs.AI | cs.CLPDF

Alliot Nagle, Jakhongir Saydaliev, Dhia Garbaya, Michael Gastpar, Ashok Vardhan Makkuva

TL;DR: 本文提出了TERMINATOR,一种用于大型推理模型(LRMs)在思维链(CoT)推理中进行早期退出的策略,旨在缓解模型因过度思考而导致的无效计算开销。其核心思想是利用模型首次生成最终答案的位置是可预测的这一特性,构建一个最优推理长度数据集来训练TERMINATOR。

Details

Motivation: 大型推理模型在复杂推理任务中常因过度思考(overthinking)而浪费计算资源,即在答案已生成后仍继续冗长的推理。虽然存在一个最优推理长度,但确定该长度高度依赖于具体任务和模型,因此需要一种通用的早期退出策略。

Result: 在MATH-500、AIME 2025、HumanEval和GPQA四个具有挑战性的实际数据集上,TERMINATOR平均将CoT长度减少了14%-55%,同时性能超越了当前最先进的方法。

Insight: 创新点在于利用模型首次生成最终答案的位置作为可预测信号,以此构建数据集来学习最优退出点,实现了任务和模型自适应的早期停止,从而在保持性能的同时显著降低推理成本。

Abstract: Large Reasoning Models (LRMs) achieve impressive performance on complex reasoning tasks via Chain-of-Thought (CoT) reasoning, which enables them to generate intermediate thinking tokens before arriving at the final answer. However, LRMs often suffer from significant overthinking, spending excessive compute time even after the answer is generated early on. Prior work has identified the existence of an optimal reasoning length such that truncating reasoning at this point significantly shortens CoT outputs with virtually no change in performance. However, determining optimal CoT lengths for practical datasets is highly non-trivial as they are fully task and model-dependent. In this paper, we precisely address this and design TERMINATOR, an early-exit strategy for LRMs at inference to mitigate overthinking. The central idea underpinning TERMINATOR is that the first arrival of an LRM’s final answer is often predictable, and we leverage these first answer positions to create a novel dataset of optimal reasoning lengths to train TERMINATOR. Powered by this approach, TERMINATOR achieves significant reductions in CoT lengths of 14%-55% on average across four challenging practical datasets: MATH-500, AIME 2025, HumanEval, and GPQA, whilst outperforming current state-of-the-art methods.


[108] Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages cs.LG | cs.AI | cs.CLPDF

Vishnu Teja Kunde, Fatemeh Doudi, Mahdi Farahbakhsh, Dileep Kalathil, Krishna Narayanan

TL;DR: 本文提出了一种针对扩散语言模型(DLMs)的强化学习(RL)后训练方法,通过将扩散去噪过程建模为有限时域马尔可夫决策过程,推导出精确且无偏的策略梯度,并利用熵引导的步骤选择和单步去噪奖励估计中间优势,实现了高效训练。

Details

Motivation: 现有RL方法难以直接应用于扩散语言模型,因为序列级似然难以处理,现有方法依赖代理似然或启发式近似,可能引入偏差并模糊去噪的序列结构。

Result: 在代码生成和逻辑推理基准测试中取得了最先进(SOTA)的结果,在数学推理任务上也表现出强大的竞争性能,优于现有的DLM RL后训练方法。

Insight: 创新点在于将扩散序列生成形式化为去噪轨迹上的MDP,推导出分解到去噪步骤的精确策略梯度,并通过熵引导步骤选择和单步奖励估计实现高效计算,避免了昂贵的多步展开。

Abstract: Reinforcement learning (RL) has been effective for post-training autoregressive (AR) language models, but extending these methods to diffusion language models (DLMs) is challenging due to intractable sequence-level likelihoods. Existing approaches therefore rely on surrogate likelihoods or heuristic approximations, which can introduce bias and obscure the sequential structure of denoising. We formulate diffusion-based sequence generation as a finite-horizon Markov decision process over the denoising trajectory and derive an exact, unbiased policy gradient that decomposes over denoising steps and is expressed in terms of intermediate advantages, without requiring explicit evaluation of the sequence likelihood. To obtain a practical and compute-efficient estimator, we (i) select denoising steps for policy updates via an entropy-guided approximation bound, and (ii) estimate intermediate advantages using a one-step denoising reward naturally provided by the diffusion model, avoiding costly multi-step rollouts. Experiments on coding and logical reasoning benchmarks demonstrate state-of-the-art results, with strong competitive performance on mathematical reasoning, outperforming existing RL post-training approaches for DLMs. Code is available at https://github.com/vishnutez/egspo-dllm-rl.


cs.AI [Back]

[109] Context-Enriched Natural Language Descriptions of Vessel Trajectories cs.AI | cs.CL | cs.DBPDF

Kostas Patroumpas, Alexandros Troupiotis-Kapeliaris, Giannis Spiliopoulos, Panagiotis Betchavas, Dimitrios Skoutas

TL;DR: 本文提出了一种上下文感知的轨迹抽象框架,用于将原始的船舶自动识别系统(AIS)轨迹数据转化为结构化和语义丰富的表示。该框架首先将噪声AIS序列分割为独立的航次,每个航次包含经过移动性标注的清洁航段,并进一步融入多源上下文信息(如地理实体、海上导航特征和天气条件)。这种表示支持利用大型语言模型生成受控的自然语言描述,从而降低时空复杂性并提升语义密度,以促进下游分析和高级海事推理任务。

Details

Motivation: 解决如何将原始、嘈杂的AIS船舶轨迹数据转化为人类可解释且机器推理系统可直接使用的结构化、语义丰富表示的问题。

Result: 论文通过实证研究,评估了利用多个大型语言模型结合AIS数据和开放上下文特征所生成描述的质量,表明该抽象框架能有效支持下游分析和与LLMs的集成。

Insight: 创新点在于提出了一个集成了多源上下文信息的轨迹抽象框架,将原始轨迹分割并标注为语义丰富的航段,从而桥接了原始传感器数据与高级自然语言处理/推理任务,为海事领域的数据理解和LLM应用提供了新的结构化表示方法。

Abstract: We address the problem of transforming raw vessel trajectory data collected from AIS into structured and semantically enriched representations interpretable by humans and directly usable by machine reasoning systems. We propose a context-aware trajectory abstraction framework that segments noisy AIS sequences into distinct trips each consisting of clean, mobility-annotated episodes. Each episode is further enriched with multi-source contextual information, such as nearby geographic entities, offshore navigation features, and weather conditions. Crucially, such representations can support generation of controlled natural language descriptions using LLMs. We empirically examine the quality of such descriptions generated using several LLMs over AIS data along with open contextual features. By increasing semantic density and reducing spatiotemporal complexity, this abstraction can facilitate downstream analytics and enable integration with LLMs for higher-level maritime reasoning tasks.


[110] Efficient Reasoning with Balanced Thinking cs.AI | cs.CL | cs.LGPDF

Yulin Li, Tengyao Tu, Li Ding, Junjie Wang, Huiling Zhen

TL;DR: 本文提出了一种名为ReBalance的无训练框架,旨在解决大型推理模型(LRMs)中存在的过度思考(overthinking)和思考不足(underthinking)问题,以实现高效且平衡的推理。该方法利用置信度作为推理动态的连续指标,通过分析小规模数据集构建推理模式原型,并计算一个引导向量来动态调整模型的推理轨迹,从而在减少冗余计算的同时提升准确性。

Details

Motivation: 大型推理模型在推理时经常面临过度思考(在简单问题上进行冗余计算)或思考不足(未能充分探索推理路径)的问题,这导致效率低下和潜在的不准确性,限制了其在资源受限环境中的实际部署。现有缓解过度思考的方法(如抑制反思关键词或调整推理长度)可能无意中引发思考不足,损害准确性。

Result: 在从0.5B到32B的四个模型上,以及涵盖数学推理、通用问答和编码任务的九个基准测试中进行的广泛实验表明,ReBalance有效减少了输出冗余,同时提高了准确性。

Insight: 论文的创新点在于提出了一种无训练、即插即用的通用框架,通过置信度方差和过度自信来连续诊断过度思考与思考不足,并利用从数据集中聚合的隐藏状态原型生成引导向量进行动态控制。从客观角度看,该方法将推理效率问题形式化为一个动态轨迹引导问题,避免了额外的模型训练,具有较好的实用性和泛化性。

Abstract: Large Reasoning Models (LRMs) have shown remarkable reasoning capabilities, yet they often suffer from overthinking, expending redundant computational steps on simple problems, or underthinking, failing to explore sufficient reasoning paths despite inherent capabilities. These issues lead to inefficiencies and potential inaccuracies, limiting practical deployment in resource-constrained settings. Existing methods to mitigate overthinking, such as suppressing reflective keywords or adjusting reasoning length, may inadvertently induce underthinking, compromising accuracy. Therefore, we propose ReBalance, a training-free framework that achieves efficient reasoning with balanced thinking. ReBalance leverages confidence as a continuous indicator of reasoning dynamics, identifying overthinking through high confidence variance and underthinking via consistent overconfidence. By aggregating hidden states from a small-scale dataset into reasoning mode prototypes, we compute a steering vector to guide LRMs’ reasoning trajectories. A dynamic control function modulates this vector’s strength and direction based on real-time confidence, pruning redundancy during overthinking, and promoting exploration during underthinking. Extensive experiments conducted on four models ranging from 0.5B to 32B, and across nine benchmarks in math reasoning, general question answering, and coding tasks demonstrate that ReBalance effectively reduces output redundancy while improving accuracy, offering a general, training-free, and plug-and-play strategy for efficient and robust LRM deployment. Code is available at https://github.com/yu-lin-li/ReBalance .


[111] Developing and evaluating a chatbot to support maternal health care cs.AI | cs.CL | cs.IRPDF

Smriti Jha, Vidhi Jain, Jianyu Xu, Grace Liu, Sowmya Ramesh

TL;DR: 本文介绍了一个为印度孕产妇健康开发的聊天机器人系统,该系统结合了阶段感知分诊、混合检索和基于证据的LLM生成技术,并提出了一个在高风险、专家监督有限场景下的评估工作流程,包括分诊基准、检索基准、LLM作为评判者的比较和专家验证。

Details

Motivation: 在资源匮乏、用户健康素养低且医疗可及性有限的地区,通过基于手机的聊天机器人提供可信的孕产妇健康信息具有重要影响,但部署此类系统面临技术挑战,如用户查询简短、不明确、语言混杂,答案需要地区背景知识,以及症状信息不完整导致安全分诊决策困难。

Result: 在分诊基准(N=150)上实现了86.7%的紧急情况召回率,并明确报告了漏报紧急情况与过度升级之间的权衡;通过LLM作为评判者对真实查询(N=781)进行基于临床医生设计标准的比较;并进行了专家验证。

Insight: 创新点在于提出了一个结合多层次防御设计(阶段感知分诊、混合检索、证据条件生成)与多方法评估(组件级和端到端测试)的工作流程,强调在高风险、多语言、嘈杂环境中构建可信医疗助手需要综合方法,而非依赖单一模型或评估选择。

Abstract: The ability to provide trustworthy maternal health information using phone-based chatbots can have a significant impact, particularly in low-resource settings where users have low health literacy and limited access to care. However, deploying such systems is technically challenging: user queries are short, underspecified, and code-mixed across languages, answers require regional context-specific grounding, and partial or missing symptom context makes safe routing decisions difficult. We present a chatbot for maternal health in India developed through a partnership between academic researchers, a health tech company, a public health nonprofit, and a hospital. The system combines (1) stage-aware triage, routing high-risk queries to expert templates, (2) hybrid retrieval over curated maternal/newborn guidelines, and (3) evidence-conditioned generation from an LLM. Our core contribution is an evaluation workflow for high-stakes deployment under limited expert supervision. Targeting both component-level and end-to-end testing, we introduce: (i) a labeled triage benchmark (N=150) achieving 86.7% emergency recall, explicitly reporting the missed-emergency vs. over-escalation trade-off; (ii) a synthetic multi-evidence retrieval benchmark (N=100) with chunk-level evidence labels; (iii) LLM-as-judge comparison on real queries (N=781) using clinician-codesigned criteria; and (iv) expert validation. Our findings show that trustworthy medical assistants in multilingual, noisy settings require defense-in-depth design paired with multi-method evaluation, rather than any single model and evaluation method choice.


[112] Semantic Invariance in Agentic AI cs.AI | cs.CLPDF

I. de Zarzà, J. de Curtò, Jordi Cabot, Pietro Manzoni, Carlos T. Calafate

TL;DR: 本文提出了一种用于评估大型语言模型(LLM)智能体推理鲁棒性的蜕变测试框架,通过应用八种语义保持变换(如复述、事实重排序等)对七个不同架构的基础模型进行测试,发现模型规模并不能预测其语义不变性,较小的模型可能表现出更高的稳定性。

Details

Motivation: 为了解决LLM智能体在关键应用中需要保证其推理在语义等效的输入变化下保持稳定(即语义不变性)的问题,而标准基准评估无法捕捉这一关键可靠性维度。

Result: 在涵盖八个科学领域的19个多步推理问题上进行测试,结果显示模型规模与鲁棒性无关:较小的Qwen3-30B-A3B实现了最高的稳定性(79.6%不变响应,语义相似度0.91),而更大的模型表现出更大的脆弱性。

Insight: 创新点在于提出了一个系统性的蜕变测试框架来评估LLM推理的语义不变性,这是一个被标准基准忽略的关键可靠性维度;客观分析认为,其揭示了模型规模与语义鲁棒性之间的非正相关关系,这对模型选择和部署具有重要启示。

Abstract: Large Language Models (LLMs) increasingly serve as autonomous reasoning agents in decision support, scientific problem-solving, and multi-agent coordination systems. However, deploying LLM agents in consequential applications requires assurance that their reasoning remains stable under semantically equivalent input variations, a property we term semantic invariance.Standard benchmark evaluations, which assess accuracy on fixed, canonical problem formulations, fail to capture this critical reliability dimension. To address this shortcoming, in this paper we present a metamorphic testing framework for systematically assessing the robustness of LLM reasoning agents, applying eight semantic-preserving transformations (identity, paraphrase, fact reordering, expansion, contraction, academic context, business context, and contrastive formulation) across seven foundation models spanning four distinct architectural families: Hermes (70B, 405B), Qwen3 (30B-A3B, 235B-A22B), DeepSeek-R1, and gpt-oss (20B, 120B). Our evaluation encompasses 19 multi-step reasoning problems across eight scientific domains. The results reveal that model scale does not predict robustness: the smaller Qwen3-30B-A3B achieves the highest stability (79.6% invariant responses, semantic similarity 0.91), while larger models exhibit greater fragility.


[113] Beyond Final Answers: CRYSTAL Benchmark for Transparent Multimodal Reasoning Evaluation cs.AI | cs.CV | cs.IR | cs.MMPDF

Wayner Barrios, SouYoung Jin

TL;DR: 本文提出了CRYSTAL基准,这是一个包含6,372个实例的诊断性基准,通过可验证的中间步骤来评估多模态推理能力。该基准引入了两个互补的指标:基于语义相似度匹配的Match F1和进一步惩罚无序推理链的Ordered Match F1。评估了20个多模态大语言模型,揭示了仅靠最终答案准确性无法发现的系统性缺陷。此外,论文还提出了因果过程奖励及其课程学习策略,在无需人工标注步骤的情况下显著提升了推理性能。

Details

Motivation: 现有评估方法主要关注最终答案的准确性,无法揭示多模态推理模型在中间推理步骤中的透明性、逻辑性和可追溯性方面的系统性失败,因此需要一个新的诊断性基准来评估这些关键方面。

Result: 在CRYSTAL基准上评估了20个MLLM(包括未参与构建的前沿商业系统),发现普遍存在摘樱桃(精确率远高于召回率)、非单调的规模权衡以及无序推理(没有竞争性模型能保持超过60%的匹配步骤顺序正确)等问题。提出的CPR-Curriculum方法在使用GRPO时实现了Match F1提升32%,而加性奖励策略则失败。

Insight: 创新点在于构建了一个强调推理过程透明度和可验证性的多模态诊断基准,并设计了两个互补的细粒度评估指标。从客观角度看,其通过德尔菲式流程构建参考步骤、以及提出的因果过程奖励与课程学习相结合的训练策略,为解决无需人工标注步骤的推理过程优化提供了新思路。

Abstract: We introduce CRYSTAL (__C__lear __R__easoning via __Y__ielded __S__teps, __T__raceability and __L__ogic), a diagnostic benchmark with 6,372 instances that evaluates multimodal reasoning through verifiable intermediate steps. We propose two complementary metrics: Match F1, which scores step-level precision and recall via semantic similarity matching, and Ordered Match F1, which further penalizes disordered reasoning chains. References are constructed through a Delphi-inspired pipeline where four independent MLLMs generate trajectories, aggregated via semantic clustering and validated through human quality gates. Evaluation of 20 MLLMs, including commercial frontier systems not used during benchmark construction, reveals systematic failures invisible to accuracy: universal cherry-picking (precision far exceeds recall), non-monotonic scaling trade-offs, and disordered reasoning where no competitive model preserves more than 60% of matched steps in correct order. Beyond evaluation, we propose the Causal Process Reward (CPR), a multiplicative reward that couples answer correctness with step-level alignment, and CPR-Curriculum, which progressively increases reasoning difficulty during training. CPR-Curriculum achieves +32% Match F1 via GRPO where additive reward strategies fail, improving reasoning without manual step annotation.


eess.IV [Back]

[114] Multiscale Structure-Guided Latent Diffusion for Multimodal MRI Translation eess.IV | cs.AI | cs.CVPDF

Jianqiang Lin, Zhiqiang Shen, Peng Cao, Jinzhu Yang, Osmar R. Zaiane

TL;DR: 本文提出了一种名为MSG-LDM的潜在扩散模型框架,用于解决多模态MRI图像转换任务中存在的解剖结构不一致和纹理细节退化问题。该方法通过在潜在空间中引入风格-结构解耦机制,在多尺度特征空间中分离模态特异性风格特征和共享结构表示,并利用高频结构信息增强特征表示,从而更好地重建完整结构。

Details

Motivation: 现有基于扩散模型的多模态MRI转换方法在处理任意缺失模态场景时,容易出现解剖结构不一致或纹理细节退化的问题,本文旨在解决这些问题。

Result: 在BraTS2020和WMH数据集上的大量实验表明,所提方法优于现有的MRI合成方法,特别是在重建完整结构方面表现出色。

Insight: 创新点在于提出了潜在空间中的风格-结构解耦机制,以及联合建模低频解剖布局和高频边界细节的多尺度特征空间;同时设计了风格一致性损失和结构感知损失来提升表示稳定性。从客观角度看,其明确利用高频信息指导结构学习,是多模态医学图像合成中一种有前景的细粒度结构保持方法。

Abstract: Although diffusion models have achieved remarkable progress in multi-modal magnetic resonance imaging (MRI) translation tasks, existing methods still tend to suffer from anatomical inconsistencies or degraded texture details when handling arbitrary missing-modality scenarios. To address these issues, we propose a latent diffusion-based multi-modal MRI translation framework, termed MSG-LDM. By leveraging the available modalities, the proposed method infers complete structural information, which preserves reliable boundary details. Specifically, we introduce a style–structure disentanglement mechanism in the latent space, which explicitly separates modality-specific style features from shared structural representations, and jointly models low-frequency anatomical layouts and high-frequency boundary details in a multi-scale feature space. During the structure disentanglement stage, high-frequency structural information is explicitly incorporated to enhance feature representations, guiding the model to focus on fine-grained structural cues while learning modality-invariant low-frequency anatomical representations. Furthermore, to reduce interference from modality-specific styles and improve the stability of structure representations, we design a style consistency loss and a structure-aware loss. Extensive experiments on the BraTS2020 and WMH datasets demonstrate that the proposed method outperforms existing MRI synthesis approaches, particularly in reconstructing complete structures. The source code is publicly available at https://github.com/ziyi-start/MSG-LDM.


[115] GLEAM: A Multimodal Imaging Dataset and HAMM for Glaucoma Classification eess.IV | cs.CVPDF

Jiao Wang, Chi Liu, Yiying Zhang, Hongchen Luo, Zhifen Guo

TL;DR: 该论文提出了首个公开的三模态青光眼数据集GLEAM,包含扫描激光检眼镜眼底图像、视盘周围OCT图像和视野模式偏差图,并标注了四种疾病阶段。同时,论文提出了分层注意力掩码建模(HAMM)框架,通过分层注意力编码器和轻量解码器,有效整合跨模态信息以进行青光眼分类。

Details

Motivation: 解决青光眼诊断中缺乏公开多模态数据集以及如何有效整合不同成像模态的互补信息以实现跨疾病阶段的准确诊断和治疗的问题。

Result: 论文在提出的GLEAM数据集上评估了HAMM方法,但摘要中未提及具体的定量结果或基准比较。

Insight: 创新点在于构建了首个公开的三模态青光眼数据集,并提出了一个专注于编码器端进行跨模态表征学习的层次注意力掩码建模框架,旨在有效利用多模态互补信息。

Abstract: We propose glaucoma lesion evaluation and analysis with multimodal imaging (GLEAM), the first publicly available tri-modal glaucoma dataset comprising scanning laser ophthalmoscopy fundus images, circumpapillary OCT images, and visual field pattern deviation maps, annotated with four disease stages, enabling effective exploitation of multimodal complementary information and facilitating accurate diagnosis and treatment across disease stages. To effectively integrate cross-modal information, we propose hierarchical attentive masked modeling (HAMM) for multimodal glaucoma classification. Our framework employs hierarchical attentive encoders and light decoders to focus cross-modal representation learning on the encoder.


Riccardo Raciti, Lemuel Puglisi, Francesco Guarnera, Daniele Ravì, Sebastiano Battiato

TL;DR: 本研究通过将深度学习模型SynthStrip和SynthSeg集成到经典的脑萎缩评估工具SIENA中,旨在解决传统图像处理步骤(如颅骨剥离和组织分割)的失败对脑萎缩估计产生的偏差问题。在ADNI和PPMI纵向队列上的评估表明,替换颅骨剥离模块能显著提升PBVC与疾病进展指标的关联性及扫描顺序一致性,同时GPU加速版本可减少高达46%的执行时间。

Details

Motivation: SIENA作为广泛使用的脑萎缩评估方法,其依赖的传统图像处理步骤(如颅骨剥离和组织分割)的失败会沿管线传播并导致估计偏差,因此研究旨在探索通过有针对性的深度学习替换来增强SIENA的鲁棒性,同时保留其可解释框架。

Result: 在ADNI和PPMI数据集上,替换颅骨剥离模块的变体显著加强了PBVC与多种疾病进展指标的相关性,并大幅提升了扫描反转下的鲁棒性;完全集成管线将扫描顺序一致性误差降低了高达99.1%;GPU加速版本在保持CPU运行时间与标准SIENA相当的同时,减少了高达46%的执行时间。

Insight: 论文的创新点在于模块化地使用深度学习替代传统神经影像工具中的薄弱环节(如颅骨剥离),从而在不牺牲可解释性的前提下提升其性能和鲁棒性,这为现代化临床可信工具提供了可借鉴的路径。

Abstract: Percentage Brain Volume Change (PBVC) derived from Magnetic Resonance Imaging (MRI) is a widely used biomarker of brain atrophy, with SIENA among the most established methods for its estimation. However, SIENA relies on classical image processing steps, particularly skull stripping and tissue segmentation, whose failures can propagate through the pipeline and bias atrophy estimates. In this work, we examine whether targeted deep learning substitutions can improve SIENA while preserving its established and interpretable framework. To this end, we integrate SynthStrip and SynthSeg into SIENA and evaluate three pipeline variants on the ADNI and PPMI longitudinal cohorts. Performance is assessed using three complementary criteria: correlation with longitudinal clinical and structural decline, scan-order consistency, and end-to-end runtime. Replacing the skull-stripping module yields the most consistent gains: in ADNI, it substantially strengthens associations between PBVC and multiple measures of disease progression relative to the standard SIENA pipeline, while across both datasets it markedly improves robustness under scan reversal. The fully integrated pipeline achieves the strongest scan-order consistency, reducing the error by up to 99.1%. In addition, GPU-enabled variants reduce execution time by up to 46% while maintaining CPU runtimes comparable to standard SIENA. Overall, these findings show that deep learning can meaningfully strengthen established longitudinal atrophy pipelines when used to reinforce their weakest image processing steps. More broadly, this study highlights the value of modularly modernizing clinically trusted neuroimaging tools without sacrificing their interpretability. Code is publicly available at https://github.com/Raciti/Enhanced-SIENA.git.