Table of Contents

cs.CL [Back]

[1] NOTAI.AI: Explainable Detection of Machine-Generated Text via Curvature and Feature Attribution cs.CLPDF

Oleksandr Marchenko Breneur, Adelaide Danilov, Aria Nourbakhsh, Salima Lamsiyah

TL;DR: NOTAI.AI是一个可解释的机器生成文本检测框架,它通过将基于曲率的信号与神经及文体特征在监督设置下结合,扩展了Fast-DetectGPT。该系统在梯度提升树(XGBoost)元分类器中集成了17个可解释特征,包括条件概率曲率、ModernBERT检测器分数、可读性指标和文体线索,以判断文本是人类还是AI生成。此外,NOTAI.AI应用SHAP提供局部和全局特征归因,并通过基于LLM的解释层将其转化为结构化自然语言解释,实现面向用户的可解释性。系统部署为交互式Web应用,支持实时分析、可视化特征检查和结构化证据呈现。

Details

Motivation: 解决现有机器生成文本检测方法缺乏可解释性的问题,旨在提供一个不仅检测准确且能提供清晰解释的框架。

Result: 论文未在摘要中提及具体的定量结果或基准测试性能,但暗示其方法扩展了Fast-DetectGPT并集成了多种特征。

Insight: 创新点在于将曲率信号与神经/文体特征结合于监督分类器,并利用SHAP和LLM解释层提供多层次可解释性,增强了检测系统的透明度和用户信任度。

Abstract: We present NOTAI.AI, an explainable framework for machine-generated text detection that extends Fast-DetectGPT by integrating curvature-based signals with neural and stylometric features in a supervised setting. The system combines 17 interpretable features, including Conditional Probability Curvature, ModernBERT detector score, readability metrics, and stylometric cues, within a gradient-boosted tree (XGBoost) meta-classifier to determine whether a text is human- or AI-generated. Furthermore, NOTAI.AI applies Shapley Additive Explanations (SHAP) to provide both local and global feature-level attribution. These attributions are further translated into structured natural-language rationales through an LLM-based explanation layer, which enables user-facing interpretability. The system is deployed as an interactive web application that supports real-time analysis, visual feature inspection, and structured evidence presentation. A web interface allows users to input text and inspect how neural and statistical signals influence the final decision. The source code and demo video are publicly available to support reproducibility.


[2] Safer Reasoning Traces: Measuring and Mitigating Chain-of-Thought Leakage in LLMs cs.CLPDF

Patrick Ahrend, Tobias Eder, Xiyang Yang, Zhiyi Pan, Georg Groh

TL;DR: 本文研究了在大型语言模型中使用思维链提示时,推理过程中个人身份信息泄露的风险。作者提出了一个模型无关的框架来量化这种泄露,发现思维链会加剧泄露,且泄露程度与模型家族和推理预算密切相关。论文还评估了多种轻量级推理时门控方法,发现没有单一方法在所有情况下都最优,因此提出了混合、风格自适应的门控策略以平衡效用和风险。

Details

Motivation: 思维链提示虽然能提升大语言模型的推理能力,但会增加隐私风险,即可能将提示中的个人身份信息泄露到推理轨迹和输出中,即使模型被指示不要复述这些信息。本文旨在测量和缓解这种推理时的直接PII泄露。

Result: 研究发现,思维链持续加剧了PII泄露,尤其是高风险类别,且泄露程度强烈依赖于模型家族和推理预算。在评估的四种轻量级门控方法(基于规则的检测器、TF-IDF+逻辑回归分类器、基于GLiNER的NER模型、LLM-as-judge)中,没有单一方法在所有模型或预算下都占优,这促使了混合策略的提出。

Insight: 主要创新点在于提出了一个模型无关的框架来系统性地定义和测量推理时的PII泄露,并引入了“泄露曲线”和基于风险加权的评估指标。客观来看,该研究强调了在提升模型推理能力时必须同步考虑隐私保护,并为开发自适应、混合的实时隐私过滤机制提供了实证基础和评估协议。

Abstract: Chain-of-Thought (CoT) prompting improves LLM reasoning but can increase privacy risk by resurfacing personally identifiable information (PII) from the prompt into reasoning traces and outputs, even under policies that instruct the model not to restate PII. We study such direct, inference-time PII leakage using a model-agnostic framework that (i) defines leakage as risk-weighted, token-level events across 11 PII types, (ii) traces leakage curves as a function of the allowed CoT budget, and (iii) compares open- and closed-source model families on a structured PII dataset with a hierarchical risk taxonomy. We find that CoT consistently elevates leakage, especially for high-risk categories, and that leakage is strongly family- and budget-dependent. Increasing the reasoning budget can either amplify or attenuate leakage depending on the base model. We then benchmark lightweight inference-time gatekeepers: a rule-based detector, a TF-IDF + logistic regression classifier, a GLiNER-based NER model, and an LLM-as-judge, using risk-weighted F1, Macro-F1, and recall. No single method dominates across models or budgets, motivating hybrid, style-adaptive gatekeeping policies that balance utility and risk under a common, reproducible protocol.


[3] Tutor Move Taxonomy: A Theory-Aligned Framework for Analyzing Instructional Moves in Tutoring cs.CLPDF

Zhuqian Zhou, Kirk Vanacore, Tamisha Thompson, Jennifer St John, Rene Kizilcec

TL;DR: 本文提出了一种导师行为分类法,用于系统分析辅导对话中的教学行为,支持大规模辅导对话分析。该分类法通过混合演绎归纳过程开发,将辅导行为分为辅导支持、学习支持、社会情感与动机支持以及后勤支持四类,其中学习支持行为进一步按学生参与度细分。

Details

Motivation: 动机是理解辅导有效性的关键需要系统分析导师在教学互动中的行为,为此开发一个结构化的标注框架以支持大规模辅导对话分析,特别是在国家辅导观察站中应用。

Result: 论文未提及具体的定量实验结果或基准测试,但通过专家标注者的迭代编码过程验证了分类法的有效性,并强调其支持AI可扩展标注、辅导策略计算建模以及辅导行为与学习成果的实证分析。

Insight: 创新点在于提出一个理论对齐的导师行为分类框架,通过混合方法整合多学科研究,并基于学生参与度细分学习支持行为,为AI驱动的辅导分析和策略建模提供了结构化基础。

Abstract: Understanding what makes tutoring effective requires methods for systematically analyzing tutors’ instructional actions during learning interactions. This paper presents a tutor move taxonomy designed to support large-scale analysis of tutoring dialogue within the National Tutoring Observatory. The taxonomy provides a structured annotation framework for labeling tutors’ instructional moves during one-on-one tutoring sessions. We developed the taxonomy through a hybrid deductive-inductive process. First, we synthesized research from cognitive science, the learning sciences, classroom discourse analysis, and intelligent tutoring systems to construct a preliminary framework of tutoring moves. We then refined the taxonomy through iterative coding of authentic tutoring transcripts conducted by expert annotators with extensive instructional and qualitative research experience. The resulting taxonomy organizes tutoring behaviors into four categories: tutoring support, learning support, social-emotional and motivational support, and logistical support. Learning support moves are further organized along a spectrum of student engagement, distinguishing between moves that elicit student reasoning and those that provide direct explanation or answers. By defining tutoring dialogue in terms of discrete instructional actions, the taxonomy enables scalable annotation using AI, computational modeling of tutoring strategies, and empirical analysis of how tutoring behaviors relate to learning outcomes.


[4] RouteGoT: Node-Adaptive Routing for Cost-Efficient Graph of Thoughts Reasoning cs.CLPDF

Yuhang Liu, Ruijie Wang, Yunlong Chu, Bing Hao, Yumeng Lin

TL;DR: 本文提出了RouteGoT,一种用于图结构推理的预算可控、节点自适应的路由框架。该框架通过在图推理过程中优先为规划和综合任务分配强大模型,同时根据预测的难度为叶子子任务动态分配轻量级模型和成本效益策略,并整合显式预算约束来控制图扩展,以实现可预测的性能-成本权衡。

Details

Motivation: 现有方法如Tree of Thoughts (ToT)、Graph of Thoughts (GoT)和Adaptive Graph of Thoughts (AGoT)虽然能在某些基准测试上提升准确性,但通常会引入大量的令牌消耗和延迟开销,且其收益在不同任务分布中不稳定,有时甚至不如更简单的Chain-of-Thought (CoT)或直接输入输出提示(IO)。这种低效性源于GoT风格推理管道中存在的阶段间和节点间的异质性。

Result: 在推理、检索和多跳问答基准测试上的实验表明,RouteGoT在匹配或提高准确性的同时,显著减少了令牌使用量;具体而言,与AGoT相比,它平均实现了8.1个百分点的准确性提升和79.1%的输出令牌减少。此外,RouteGoT通过保持更优的成本-准确性权衡,优于现有的路由基线,并在不同预算目标和任务下表现出更强的鲁棒性。

Insight: 论文的创新点在于识别了图结构推理中节点异质性的问题,并提出了一个节点自适应路由框架,将显式预算约束整合到全局推理调度器中,实现了对图扩展的预算控制。从客观角度看,其核心创新在于将模型选择和资源分配动态地基于任务难度和预算约束进行优化,为成本高效的复杂推理提供了一种系统级解决方案。

Abstract: Large Language Models (LLMs) excel at multi-step reasoning, yet increasing the structural complexity of inference does not consistently improve system-level returns. Methods such as Tree of Thoughts (ToT), Graph of Thoughts (GoT), and Adaptive Graph of Thoughts (AGoT) can boost accuracy on some benchmarks, but often introduce substantial overhead in token consumption and latency, and their gains can be unstable across task distributions-sometimes underperforming simpler Chain-of-Thought (CoT) or direct input-output prompting (IO). We attribute this inefficiency to stage-wise and node-wise heterogeneity inside GoT-style reasoning pipelines: high-quality planning and final synthesis are globally coupled and typically benefit from strong models, whereas many intermediate subtasks are localized and can be solved accurately by lighter models with far fewer tokens. Motivated by these observations, we propose RouteGoT, a budget-controllable, node-adaptive routing framework for graph-structured reasoning. RouteGoT performs in-graph routing by prioritizing strong models for planning and synthesis, while dynamically allocating lightweight models and cost-effective strategies to leaf subtasks based on predicted difficulty. It further integrates explicit budget constraints into a global inference scheduler to control graph expansion under a user-specified token budget, enabling predictable performance-cost trade-offs. Experiments across reasoning, retrieval, and multi-hop QA benchmarks show that RouteGoT matching or improving accuracy while substantially reducing token usage; specifically, it achieves an average 8.1 percentage points accuracy improvement and 79.1% output token reduction compared to AGoT. Furthermore, RouteGoT outperforms existing routing baselines by maintaining a superior cost-accuracy trade-off, demonstrating improved robustness under varying budget targets and tasks.


[5] ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning cs.CL | cs.LG | cs.SEPDF

Juyong Jiang, Jiasi Shen, Sunghun Kim, Kang Min Yoo, Jeonghoon Kim

TL;DR: ReflexiCoder 是一个新颖的强化学习框架,旨在教导大型语言模型(LLM)对生成的代码进行自我反思和自我修正。它通过将包含初始生成、错误与优化感知的反思以及自我修正的结构化推理轨迹内化到模型权重中,实现了在推理时完全自主的代码优化能力,无需依赖外部反馈或执行引擎。

Details

Motivation: 标准的一次性前向生成(’System 1’)方法在处理复杂算法任务时存在性能瓶颈,而现有的迭代优化策略在推理时严重依赖外部反馈或计算成本高昂的提示-响应循环。本文旨在通过强化学习,使模型内化自我反思和修正能力,从而摆脱对外部资源的依赖。

Result: 在七个基准测试上的广泛实验表明,ReflexiCoder-8B 在 1.5B-14B 参数范围内的领先开源模型中建立了新的 SOTA。具体结果包括:HumanEval (Plus) 94.51% (87.20%)、MBPP (Plus) 81.80% (78.57%)、BigCodeBench 35.00%、LiveCodeBench 52.21%、CodeForces 37.34%(均为单次尝试设置),其性能可与 GPT-5.1 等专有模型相媲美或超越。此外,该框架显著提升了令牌效率,通过高效推理模式将推理时计算开销降低了约40%。

Insight: 主要创新点在于将代码生成、反思和修正的完整轨迹通过强化学习内化到模型权重中,实现了推理时完全自主的、不依赖外部反馈的自我优化能力。其采用的 RL-zero 训练范式和细粒度奖励函数设计,使得模型能够学习如何在没有执行引擎或真实反馈的情况下进行调试和优化,这是一种从外部依赖到内在能力的范式转变。同时,该方法在提升性能的同时,还显著降低了推理成本,具有很高的实用价值。

Abstract: While Large Language Models (LLMs) have revolutionized code generation, standard “System 1” approaches, generating solutions in a single forward pass, often hit a performance ceiling when faced with complex algorithmic tasks. Existing iterative refinement strategies attempt to bridge this gap at inference time, yet they predominantly rely on external oracles, execution feedback, or computationally expensive prompt-response cycles. In this work, we propose ReflexiCoder, a novel reinforcement learning (RL) framework that internalizes the structured reasoning trajectory, encompassing initial generation, bug and optimization aware reflection, and self-correction, directly into the model’s weights. Unlike prior methods, ReflexiCoder shifts the paradigm from external-dependent refinement to an intrinsic, fully autonomous self-reflection and self-correction capabilities at inference time. We utilize an RL-zero training paradigm with granular reward functions to optimize the entire reflection-correction trajectory, teaching the model how to debug without reliance on ground-truth feedback or execution engines at inference time. Extensive experiments across seven benchmarks demonstrate that our ReflexiCoder-8B establishes a new state-of-the-art (SOTA) among leading open-source models in the 1.5B-14B range, achieving 94.51% (87.20%) on HumanEval (Plus), 81.80% (78.57%) on MBPP (Plus), 35.00% on BigCodeBench, 52.21% on LiveCodeBench, and 37.34% on CodeForces in a single-attempt setting, rivaling or surpassing proprietary models like GPT-5.1. Notably, our framework is significantly more token-efficient than base models, reducing inference-time compute overhead by approximately 40% through disciplined, high-speed reasoning and reflection patterns. Source code is available at https://github.com/juyongjiang/ReflexiCoder.


[6] Confidence Before Answering: A Paradigm Shift for Efficient LLM Uncertainty Estimation cs.CLPDF

Changcheng Li, Jiancan Wu, Hengheng Zhang, Zhengsu Chen, Guo An

TL;DR: 本文提出了一种名为CoCA(Co-optimized Confidence and Answers)的新范式,通过强化学习框架GRPO联合优化置信度校准和答案准确性,让大语言模型在生成答案前先输出其置信度,以提升不确定性估计的效率和实用性。

Details

Motivation: 现有大语言模型的不确定性估计方法多为‘答案优先’,即在生成答案后才计算置信度,这限制了其实际应用;本文旨在研究‘置信度优先’的范式,即模型在回答前先输出其正确回答的概率,以提供更可靠的不确定性估计。

Result: 在数学、代码和事实问答基准测试上的实验表明,CoCA方法在保持答案质量的同时,提高了校准度和不确定性区分能力,从而支持更广泛的下游应用。

Insight: 创新点在于提出了‘置信度优先’的范式,并设计了CoCA框架,通过分段信用分配(segmented credit assignment)分别奖励置信度和答案段,实现稳定的联合优化,避免了奖励黑客(reward hacking)问题,这为高效的不确定性估计提供了新思路。

Abstract: Reliable deployment of large language models (LLMs) requires accurate uncertainty estimation. Existing methods are predominantly answer-first, producing confidence only after generating an answer, which measure the correctness of a specific response and limits practical usability. We study a confidence-first paradigm, where the model outputs its confidence before answering, interpreting this score as the model’s probability of answering the question correctly under its current policy. We propose CoCA(Co-optimized Confidence and Answers), a GRPO reinforcement learning framework that jointly optimizes confidence calibration and answer accuracy via segmented credit assignment. By assigning separate rewards and group-relative advantages to confidence and answer segments, CoCA enables stable joint optimization and avoids reward hacking. Experiments across math, code, and factual QA benchmarks show improved calibration and uncertainty discrimination while preserving answer quality, thereby enabling a broader range of downstream applications.


[7] Learning Next Action Predictors from Human-Computer Interaction cs.CL | cs.HCPDF

Omar Shaikh, Valentin Teutschbein, Kanishk Gandhi, Yikun Chi, Nick Haber

TL;DR: 该论文提出了下一动作预测(NAP)任务,旨在通过用户与计算机的多模态交互序列(如屏幕截图、点击、传感器数据)来预测用户的下一步操作。为推进该任务,作者构建了一个大规模标注数据集,并开发了LongNAP模型,该模型结合参数化学习和上下文学习来推理长交互历史,通过策略梯度方法训练,在LLM-as-judge评估指标上显著优于基线方法。

Details

Motivation: 解决真正主动式AI系统需预测用户未来行为的问题,这需要超越稀疏提示信号的全面上下文推理,从而形式化为下一动作预测任务。

Result: 在留出数据上,LongNAP相比监督微调和提示基线分别提升79%和39%(使用LLM-as-judge评估,0-1相似度);17.1%的预测轨迹与用户实际行为良好对齐(得分≥0.5),高置信度预测时提升至26%,且能泛化到未见用户。

Insight: 创新点包括利用视觉语言模型标注自然计算机使用数据的大规模开源流程,以及结合参数化与上下文学习、通过策略梯度生成用户特定推理轨迹的LongNAP模型架构,为基于全上下文行为预测用户需求提供了可行方案。

Abstract: Truly proactive AI systems must anticipate what we will do next. This foresight demands far richer information than the sparse signals we type into our prompts – it demands reasoning over the entire context of what we see and do. We formalize this as next action prediction (NAP): given a sequence of a user’s multimodal interactions with a computer (screenshots, clicks, sensor data), predict that user’s next action. Progress on this task requires both new data and modeling approaches. To scale data, we annotate longitudinal, naturalistic computer use with vision-language models. We release an open-source pipeline for performing this labeling on private infrastructure, and label over 360K actions across one month of continuous phone usage from 20 users, amounting to 1,800 hours of screen time. We then introduce LongNAP, a user model that combines parametric and in-context learning to reason over long interaction histories. LongNAP is trained via policy gradient methods to generate user-specific reasoning traces given some context; retrieve relevant traces from a library of past traces; and then apply retrieved traces in-context to predict future actions. Using an LLM-as-judge evaluation metric (0-1 similarity to ground truth), LongNAP significantly outperforms supervised finetuning and prompted baselines on held-out data (by 79% and 39% respectively). Additionally, LongNAP generalizes to held out users when trained across individuals. The space of next actions a user might take at any moment is unbounded, spanning thousands of possible outcomes. Despite this, 17.1% of LongNAP’s predicted trajectories are well-aligned with what a user does next (LLM-judge score $\geq$ 0.5). This rises to 26% when we filter to highly confident predictions. In sum, we argue that learning from the full context of user behavior to anticipate user needs is now a viable task with substantial opportunity.


[8] Implicit Style Conditioning: A Structured Style-Rewrite Framework for Low-Resource Character Modeling cs.CL | cs.LGPDF

Chanhui Zhu

TL;DR: 本文提出了一种结构化风格重写框架,用于解决小语言模型在低资源角色建模中难以生成高度风格化文本的问题。该框架将风格解耦为词汇、句法和语用三个可解释维度,并通过思维链蒸馏实现隐式风格条件化,从而在推理时无需显式推理标记即可实现高保真风格生成。

Details

Motivation: 解决小语言模型在角色扮演中因数据稀缺和风格解耦复杂性导致的风格不一致问题,避免标准监督微调仅捕捉表面语义而无法复现角色细微句法语用特征的缺陷。

Result: 在动漫角色这一高风格化领域的实验中,该方法使Qwen-1.7B模型在风格一致性和语义保真度上显著优于更大的基线模型(如4B标准监督微调)。

Insight: 创新点在于将风格结构化解耦为三个可解释维度,并通过思维链蒸馏实现隐式风格条件化,为在消费级硬件上实现高效推理和部署提供了数据高效的范式。

Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in role-playing (RP); however, small Language Models (SLMs) with highly stylized personas remains a challenge due to data scarcity and the complexity of style disentanglement. Standard Supervised Fine-Tuning (SFT) often captures surface-level semantics while failing to reproduce the intricate syntactic and pragmatic nuances of a character, leading to “Out-Of-Character” (OOC) generation. To address this, we propose a Structured Style-Rewrite Framework that explicitly disentangles style into three interpretable dimensions: lexical signatures (via PMI), syntactic patterns (grounded in PCFG rules), and pragmatic style. Furthermore, we introduce an implicit style conditioning strategy via Chain-of-Thought (CoT) distillation. By leveraging explicit reasoning traces during training as a strong inductive bias, our approach aligns the model’s latent representations with structured style features, enabling high-fidelity stylized generation without requiring explicit reasoning tokens during inference. Extensive experiments on a specific high-stylization domain (anime characters) demonstrate that our method enables a Qwen-1.7B model to outperform significantly larger baselines (e.g., 4B Vanilla SFT) in style consistency and semantic fidelity. Our approach offers a data-efficient paradigm for democratizing inference and deployment on consumer hardware.


[9] MASFactory: A Graph-centric Framework for Orchestrating LLM-Based Multi-Agent Systems with Vibe Graphing cs.CL | cs.AI | cs.MAPDF

Yang Liu, Jinxuan Cai, Yishen Li, Qi Meng, Zedi Liu

TL;DR: 本文提出了MASFactory,一个以图为中心的框架,用于编排基于大语言模型的多智能体系统,并引入了Vibe Graphing方法,允许通过自然语言意图生成可编辑和可执行的工作流图。

Details

Motivation: 当前基于LLM的多智能体系统在实现复杂图工作流时,需要大量手动工作、复用性有限且难以集成异构外部上下文,MASFactory旨在克服这些限制。

Result: 在七个公开基准测试上进行了评估,验证了其对于代表性MAS方法的再现一致性以及Vibe Graphing的有效性。

Insight: 创新点在于Vibe Graphing这一人在回路的方法,能够将自然语言意图编译为工作流规范,并提供了可复用组件、可插拔上下文集成以及可视化工具,提升了工作流编排的自动化程度和灵活性。

Abstract: Large language model-based (LLM-based) multi-agent systems (MAS) are increasingly used to extend agentic problem solving via role specialization and collaboration. MAS workflows can be naturally modeled as directed computation graphs, where nodes execute agents/sub-workflows and edges encode dependencies and message passing. However, implementing complex graph workflows in current frameworks still requires substantial manual effort, offers limited reuse, and makes it difficult to integrate heterogeneous external context sources. To overcome these limitations, we present MASFactory, a graph-centric framework for orchestrating LLM-based MAS. It introduces Vibe Graphing, a human-in-the-loop approach that compiles natural-language intent into an editable workflow specification and then into an executable graph. In addition, the framework provides reusable components and pluggable context integration, as well as a visualizer for topology preview, runtime tracing, and human-in-the-loop interaction. We evaluate MASFactory on seven public benchmarks, validating both reproduction consistency for representative MAS methods and the effectiveness of Vibe Graphing. Our code (https://github.com/BUPT-GAMMA/MASFactory) and video (https://youtu.be/ANynzVfY32k) are publicly available.


[10] ViewFusion: Structured Spatial Thinking Chains for Multi-View Reasoning cs.CL | cs.CVPDF

Xingjian Tao, Yiwei Wang, Yujun Cai, Yifan Song, Jing Tang

TL;DR: 本文提出了ViewFusion,一个用于多视图空间推理的两阶段框架。该框架将跨视图的空间预对齐与问题解答过程分离,通过第一阶段的“空间预思考”来推断视图关系和空间变换,构建中间工作空间,然后在第二阶段基于此工作空间进行问题驱动的推理以生成最终答案。模型通过合成推理监督和GRPO强化学习进行训练,在MMSI-Bench基准上取得了显著性能提升。

Details

Motivation: 当前视觉语言模型在多视图空间推理任务上表现不佳,即使存在多个视角,模型也往往未能充分利用跨视图关系,而是依赖单图像捷径,导致在视角变换和遮挡敏感案例上性能脆弱。本文旨在解决这一问题,通过结构化分离空间对齐与推理过程来提升模型对跨视图关系的理解和利用。

Result: 在MMSI-Bench基准测试中,ViewFusion相比Qwen3-VL-4B-Instruct模型将准确率提升了5.3%,在需要真正跨视图对齐的样本上提升幅度最大。

Insight: 论文的核心创新在于将多视图推理明确分解为空间预对齐(结构化空间思维链)和条件化问题解答两个阶段,并引入中间工作空间来显式编码跨视图关系。从客观角度看,这种结构化分离和强化学习驱动的两阶段生成行为稳定化,为提升模型在复杂空间推理任务中的鲁棒性和可解释性提供了新思路。

Abstract: Multi-view spatial reasoning remains difficult for current vision-language models. Even when multiple viewpoints are available, models often underutilize cross-view relations and instead rely on single-image shortcuts, leading to fragile performance on viewpoint transformation and occlusion-sensitive cases. We present ViewFusion, a two-stage framework that explicitly separates cross-view spatial pre-alignment from question answering. In the first stage, the model performs deliberate spatial pre-thinking to infer viewpoint relations and spatial transformations across views, forming an intermediate workspace that goes beyond a simple re-description. In the second stage, the model conducts question-driven reasoning conditioned on this workspace to produce the final prediction. We train ViewFusion with synthetic reasoning supervision followed by reinforcement learning using GRPO, which improves answer correctness while stabilizing the intended two-stage generation behavior. On MMSI-Bench, ViewFusion improves accuracy by 5.3% over Qwen3-VL-4B-Instruct, with the largest gains on examples that require genuine cross-view alignment.


[11] Experiences Build Characters: The Linguistic Origins and Functional Impact of LLM Personality cs.CL | cs.AIPDF

Xi Wang, Mengdie Zhuang, Jiqun Liu

TL;DR: 本研究通过持续预训练让大语言模型接触特定领域文本,模拟经验积累,探究不同经历如何塑造模型个性并影响其问题解决能力。研究发现模型能力呈双峰分布,存在’表达型通才’和’抑制型专才’两种峰值,并揭示了’抑制优势’现象,即降低社交特质能提升复杂推理性能。

Details

Motivation: 人类问题解决受益于多样化的风格和个性特质,但当前大语言模型的发展主要优先考虑统一的性能基准,这些基准倾向于特定的行为倾向(如自信)。本研究旨在探究不同经历如何塑造机器个性并影响问题解决。

Result: 研究通过机器个性量表量化了模型变体的人格特质,并分析了其与语言风格和推理行为的关系。发现模型能力在’表达型通才’和’抑制型专才’处达到峰值,并识别出’抑制优势’,即降低社交特质能增强复杂推理表现。

Insight: 研究建立了训练数据语言学特征(如祈使句频率、词汇多样性)与模型个性及性能之间的因果关系,为’个性工程’提供了路线图。创新点在于通过无监督的领域特定文本暴露来模拟经验积累,并量化其对模型人格和功能的影响。

Abstract: Human problem-solving is enriched by a diversity of styles and personality traits, yet the development of Large Language Models (LLMs) has largely prioritized uniform performance benchmarks that favour specific behavioural tendencies such as assertiveness. To investigate how diverse experiences shape machine personality and influence problem-solving, this study employs continued pre-training to expose models to domain-specific texts in an unsupervised manner, simulating the accumulation of experience. By adapting the Big Five framework via the Machine Personality Inventory (MPI), we quantify the personality traits of these model variants and analyse their relationship to linguistic style and reasoning behaviour. The findings reveal that model competence is bimodal, peaking at “Expressive Generalists” and “Suppressed Specialists,” while identifying a “Suppression Advantage” where reduced social traits enhance complex reasoning performance. This study further establishes a causal link between training data linguistics, such as imperative frequency, and lexical diversity, providing a roadmap for “Personality Engineering”.


[12] Diffusion Language Models Are Natively Length-Aware cs.CL | cs.LGPDF

Vittorio Rossi, Giacomo Cirò, Davide Beltrame, Luca Gandolfi, Paul Röttger

TL;DR: 本文针对扩散语言模型(DLMs)在生成过程中固定使用最大长度上下文窗口导致计算浪费的问题,提出了一种零样本机制,利用潜在提示表示动态裁剪上下文窗口,从而减少扩散步骤并节省计算资源。该方法在GSM8K、HumanEval、IfEval和LongFormQA四个基准测试中验证了其高效性,实现了显著的FLOPs减少且性能影响极小。

Details

Motivation: 扩散语言模型在生成时独立于所需响应长度,固定使用最大上下文窗口,导致在常见的短响应任务(如推理和聊天)中产生计算浪费,因此需要一种方法来动态调整生成长度以提高效率。

Result: 在GSM8K(推理)、HumanEval(代码生成)、IfEval(指令遵循)和LongFormQA(问答)四个基准测试中,该方法在所有任务上显著减少了FLOPs,且无统计显著的性能下降,其中两个任务性能还有显著提升。

Insight: 创新点在于利用潜在提示表示来估计输出长度,实现零样本动态上下文窗口裁剪,这为扩散语言模型提供了原生的长度感知能力,可借鉴于优化模型效率而不牺牲性能。

Abstract: Unlike autoregressive language models, which terminate variable-length generation upon predicting an End-of-Sequence (EoS) token, Diffusion Language Models (DLMs) operate over a fixed maximum-length context window for a predetermined number of denoising steps. However, this process is independent of the required response length, resulting in computational waste for the majority of short responses common in reasoning and chat tasks. To address this problem, we conjecture that the latent prompt representation contains sufficient information to estimate the required output length. We provide empirical evidence for this phenomenon and propose a zero-shot mechanism to dynamically crop the context window before generation begins, leading to fewer diffusion steps and substantial computational savings. We evaluate our approach on four benchmarks with diverse tasks – GSM8K (reasoning), HumanEval (code generation), IfEval (instruction following), and LongFormQA (question answering) – revealing massive efficiency gains at minimal performance impact. We report significant reductions in FLOPs across all tasks, with no statistically significant performance degradation, and significant performance improvements in 2 out of 4 tasks.


[13] MAPO: Mixed Advantage Policy Optimization for Long-Horizon Multi-Turn Dialogue cs.CL | cs.AIPDF

Naifan Zhang, Ruihan Sun, Jinwei Su, Hengjie Yang, Zhengyuan Pan

TL;DR: 本文提出了一种名为MAPO(混合优势策略优化)的无评论家高效强化学习算法,用于解决长视野多轮主观对话任务(如情感支持)中的策略优化问题。该方法利用评判模型提供的密集过程反馈,通过蒙特卡洛回报传播长视野效应,并引入混合优势估计器结合回合级和批次级归一化,以实现细粒度且可扩展的信用分配。

Details

Motivation: 解决主观多轮对话任务中,由于缺乏可靠的过程监督,强化学习面临的挑战。仅基于最终结果的训练会将多轮信用分配坍缩为单一轨迹级奖励,而简单的回合级分组采样在交互环境中会产生高昂的rollout成本。

Result: 在多个主观对话基准测试(EMPA、EmoBench、EQ-Bench)和不同模型规模(7B到32B)上,MAPO方法在训练稳定性和最终性能上均优于仅基于结果的GRPO和单级归一化基线。在EMPA上,相对于7B基础模型,成功率提升高达9个百分点,对话分数提升高达+43.2。尽管仅在EMPA风格环境中训练,该方法在未见过的情感智能基准测试(EmoBench和EQ-Bench)上也表现出良好的泛化能力,分别提升高达+4分和+3.5分。

Insight: 核心创新点在于将密集的过程监督与混合级(回合级和批次级)归一化相结合,以实现有效且可扩展的强化学习。具体而言,混合优势估计器通过结合两种归一化方式,在保持细粒度信用分配的同时提升了优化稳定性,这为处理开放式的长视野多轮对话任务提供了一种新思路。

Abstract: Subjective multi-turn dialogue tasks, such as emotional support, require conversational policies that adapt to evolving user states and optimize long-horizon interaction quality. However, reinforcement learning (RL) for such settings remains challenging due to the absence of reliable process supervision. Outcome-only training collapses credit assignment across turns into a single trajectory-level reward, while naïve turn-level group sampling incurs prohibitive rollout costs in interactive environments. We propose a critic-free and efficient RL algorithm named MAPO that leverages dense process feedback from a judge model and propagates long-horizon effects through Monte Carlo returns. To stabilize optimization, we introduce a mixed advantage estimator that combines turn-level normalization with batch-level normalization, enabling fine-grained yet scalable credit assignment. Across multiple subjective dialogue benchmarks, including EMPA, EmoBench, and EQ-Bench, and model scales ranging from 7B to 32B, our method consistently improves both training stability and final performance over outcome-only GRPO and single-level normalization baselines. On EMPA, we improve rates by up to 9 points and increase dialogue scores by as much as +43.2 over the 7B base model. Despite training only on EMPA-style environments, our approach generalizes well, yielding consistent improvements on unseen emotional-intelligence benchmarks, including up to +4 points on EmoBench and +3.5 on EQ-Bench. Together, these results demonstrate that dense process supervision combined with mixed-level normalization enables effective and scalable RL for subjective, open-ended multi-turn dialogue.


[14] LIT-RAGBench: Benchmarking Generator Capabilities of Large Language Models in Retrieval-Augmented Generation cs.CLPDF

Koki Itai, Shunichi Hasegawa, Yuta Yamamoto, Gouki Minegishi, Masaki Otsuki

TL;DR: 本文介绍了LIT-RAGBench,一个用于评估检索增强生成(RAG)框架中大型语言模型生成器能力的基准测试。该基准定义了集成、推理、逻辑、表格和弃答五个类别,通过虚构实体和场景构建了114个人工编写的日语问题及其英文版本,并使用LLM-as-a-Judge进行评分,以衡量模型在统一条件下处理多能力任务的表现。

Details

Motivation: 现有RAG生成器基准测试覆盖能力有限,缺乏在统一条件下同时评估多种关键能力(如长上下文证据整合、多步推理、表格解读和证据缺失时弃答)的基准,无法满足实际部署需求。

Result: 在基于API和开源权重的模型测试中,没有模型总体准确率超过90%。LIT-RAGBench报告了分类别和总体准确率,使得模型在各能力维度的强弱可量化。

Insight: 创新点在于提出了一个系统覆盖RAG生成器多维度关键能力(集成、推理、逻辑、表格、弃答)并支持跨类别组合模式评估的统一基准。其使用虚构场景确保答案基于外部文档,并通过人工构建与机器翻译结合的方式创建双语数据集,为实际RAG模型选择和专用模型开发提供了可衡量的评估工具。

Abstract: Retrieval-Augmented Generation (RAG) is a framework in which a Generator, such as a Large Language Model (LLM), produces answers by retrieving documents from an external collection using a Retriever. In practice, Generators must integrate evidence from long contexts, perform multi-step reasoning, interpret tables, and abstain when evidence is missing. However, existing benchmarks for Generators provide limited coverage, with none enabling simultaneous evaluation of multiple capabilities under unified conditions. To bridge the gap between existing evaluations and practical use, we introduce LIT-RAGBench (the Logic, Integration, Table, Reasoning, and Abstention RAG Generator Benchmark), which defines five categories: Integration, Reasoning, Logic, Table, and Abstention, each further divided into practical evaluation aspects. LIT-RAGBench systematically covers patterns combining multiple aspects across categories. By using fictional entities and scenarios, LIT-RAGBench evaluates answers grounded in the provided external documents. The dataset consists of 114 human-constructed Japanese questions and an English version generated by machine translation with human curation. We use LLM-as-a-Judge for scoring and report category-wise and overall accuracy. Across API-based and open-weight models, no model exceeds 90% overall accuracy. By making strengths and weaknesses measurable within each category, LIT-RAGBench serves as a valuable metric for model selection in practical RAG deployments and for building RAG-specialized models. We release LIT-RAGBench, including the dataset and evaluation code, at https://github.com/Koki-Itai/LIT-RAGBench.


[15] SPOT: Span-level Pause-of-Thought for Efficient and Interpretable Latent Reasoning in Large Language Models cs.CLPDF

Yunlong Chu, Minglai Shao, Yuhang Liu, Bing Hao, Yumeng Lin

TL;DR: SPOT是一种用于大型语言模型高效隐式推理的框架,它通过将显式的思维链压缩成紧凑的潜空间暂停令牌,以减少推理开销并保持可解释性。其核心是跨度级语义对齐和冻结头部解码约束,在提升准确率的同时大幅减少生成令牌数量。

Details

Motivation: 现有显式思维链推理方法推理成本高,而先前的隐式推理方法存在刚性对齐(难以捕捉整个推理段的密集、变长语义)和缺乏可解释性(潜状态难以解码或审计)两大关键挑战。

Result: 在推理基准测试中,SPOT平均提升准确率2.3个百分点,同时减少37.5%的生成令牌,并为潜推理过程提供了可靠的语义解释。

Insight: 创新点在于提出了灵活的跨度级语义对齐(Sinkhorn最优传输目标)来软匹配暂停令牌与整个推理段的语义,以及冻结头部解码约束来确保潜状态在预训练语言模型头部下可直接解码为可读的令牌分布,从而兼顾效率与可解释性。

Abstract: Explicit Chain-of-Thought improves the reasoning performance of large language models but often incurs high inference cost due to verbose token-level traces. While recent approaches reduce this overhead via concise prompting or step pruning, they largely truncate what the model says rather than internalize what the model thinks. Latent reasoning offers a promising alternative by performing computation in the hidden space, yet prior methods face two critical challenges. Many existing approaches rely on rigid point-to-point alignment, forcing a latent token to approximate the final representation of a reasoning step, which can be insufficient to capture the dense, variable-length semantics of an entire reasoning segment. Furthermore, these methods often suffer from a lack of interpretability: latent states are commonly produced by unconstrained optimization or embedding mixing, yielding vectors that are difficult to decode or audit under the pretrained language head. We propose SPOT, a flexible framework that compresses explicit CoT into compact latent pause tokens without enforcing a fixed response template. At the core of SPOT is Span-level Semantic Alignment, a Sinkhorn optimal-transport objective that softly matches each pause token to the semantics of an entire reasoning segment, overcoming the rigidity of step-end alignment. To further improve interpretability, SPOT introduces a Frozen-Head Decoding Constraint that keeps latent states directly decodable as token distributions under the frozen pretrained LM head, enabling readable keyword interpretations of latent thoughts. Experiments on reasoning benchmarks demonstrate that SPOT improves accuracy by 2.3 points on average while reducing generated tokens by 37.5% and provides faithful semantic interpretations of the latent reasoning process.


[16] Evaluation of Deontic Conditional Reasoning in Large Language Models: The Case of Wason’s Selection Task cs.CLPDF

Hirohiko Abe, Kentaro Ozeki, Risako Ando, Takanobu Morishita, Koji Mineshima

TL;DR: 本文通过引入一个编码道义模态的新Wason选择任务数据集,系统评估了大型语言模型在道义条件推理中的表现,发现其推理能力在道义规则下优于描述性规则,且错误模式与人类相似,表现出匹配偏见。

Details

Motivation: 随着大型语言模型语言能力提升,其推理能力备受关注;人类推理在规范性领域表现更佳,但LLM推理的领域特异性尚未充分探索,本文旨在系统比较LLM在道义与描述性条件推理中的差异。

Result: 实验结果表明,LLMs在道义规则下的推理表现优于描述性规则,且错误模式呈现匹配偏见(即忽略否定并选择与规则词汇匹配的项),与人类在此范式中的已知偏见相似。

Insight: 创新点在于构建了明确区分道义与描述性条件的新数据集,揭示了LLM推理性能随规则类型系统变化,且其错误模式可类比人类认知偏见,这为理解LLM推理机制提供了新视角。

Abstract: As large language models (LLMs) advance in linguistic competence, their reasoning abilities are gaining increasing attention. In humans, reasoning often performs well in domain specific settings, particularly in normative rather than purely formal contexts. Although prior studies have compared LLM and human reasoning, the domain specificity of LLM reasoning remains underexplored. In this study, we introduce a new Wason Selection Task dataset that explicitly encodes deontic modality to systematically distinguish deontic from descriptive conditionals, and use it to examine LLMs’ conditional reasoning under deontic rules. We further analyze whether observed error patterns are better explained by confirmation bias (a tendency to seek rule-supporting evidence) or by matching bias (a tendency to ignore negation and select items that lexically match elements of the rule). Results show that, like humans, LLMs reason better with deontic rules and display matching-bias-like errors. Together, these findings suggest that the performance of LLMs varies systematically across rule types and that their error patterns can parallel well-known human biases in this paradigm.


[17] Abductive Reasoning with Syllogistic Forms in Large Language Models cs.CL | cs.AIPDF

Hirohiko Abe, Risako Ando, Takanobu Morishita Kentaro Ozeki, Koji Mineshima, Mitsuhiro Okada

TL;DR: 本文研究了大型语言模型在溯因推理中的表现,通过将三段论数据集转化为适合溯因推理的形式,旨在探究最先进的LLMs是否在溯因推理中存在偏见,并寻找改进方向,强调了超越形式演绎的情境化推理的重要性。

Details

Motivation: 鉴于人类推理不仅包含形式演绎,还包含从有限信息中得出试探性结论的溯因推理,而现有研究批评LLMs与人类共享偏见(如否定与常识相悖的逻辑有效推论)可能不公平,因此需要探究LLMs在溯因推理中的准确性。

Result: 论文通过转换三段论数据集来评估LLMs的溯因推理准确性,旨在识别其是否存在偏见及改进空间,但摘要未提及具体的定量结果或基准测试表现。

Insight: 创新点在于将三段论形式转化为溯因推理任务来系统评估LLMs,强调情境化推理对弥合机器与人类认知差距的重要性,为理解LLMs在复杂推理任务中的应用提供了新视角。

Abstract: Research in AI using Large-Language Models (LLMs) is rapidly evolving, and the comparison of their performance with human reasoning has become a key concern. Prior studies have indicated that LLMs and humans share similar biases, such as dismissing logically valid inferences that contradict common beliefs. However, criticizing LLMs for these biases might be unfair, considering our reasoning not only involves formal deduction but also abduction, which draws tentative conclusions from limited information. Abduction can be regarded as the inverse form of syllogism in its basic structure, that is, a process of drawing a minor premise from a major premise and conclusion. This paper explores the accuracy of LLMs in abductive reasoning by converting a syllogistic dataset into one suitable for abduction. It aims to investigate whether the state-of-the-art LLMs exhibit biases in abduction and to identify potential areas for improvement, emphasizing the importance of contextualized reasoning beyond formal deduction. This investigation is vital for advancing the understanding and application of LLMs in complex reasoning tasks, offering insights into bridging the gap between machine and human cognition.


[18] Beyond Rows to Reasoning: Agentic Retrieval for Multimodal Spreadsheet Understanding and Editing cs.CLPDF

Anmol Gulati, Sahil Sen, Waqar Sarguroh, Kevin Paul

TL;DR: 本文提出了一个名为Beyond Rows to Reasoning (BRTR)的多模态智能体框架,用于理解和编辑复杂的电子表格工作簿。该框架通过迭代式工具调用循环取代传统的单次检索,支持从复杂分析到结构化编辑的端到端Excel工作流,并在多个前沿基准测试中实现了最先进的性能。

Details

Motivation: 当前最先进的多模态检索增强生成方法在处理包含数百万单元格、跨表依赖和嵌入视觉元素的企业级电子表格时存在局限,如单次检索遗漏关键上下文、压缩导致数据分辨率损失以及简单全上下文注入超出LLM上下文窗口,这阻碍了对复杂工作簿进行可靠的多步推理。

Result: 在超过200小时专家人工评估的支持下,BRTR在三个前沿的电子表格理解基准测试中均达到了最先进的性能:在FRTR-Bench上超越先前方法25个百分点,在SpreadsheetLLM上超越7个百分点,在FINCH上超越32个百分点。成本分析表明GPT-5.2实现了最佳的效率-准确性权衡。

Insight: 论文宣称的创新点在于用迭代式、工具调用的智能体循环替代单次检索,实现了对复杂电子表格的端到端理解和编辑,并保持了完整的可审计性。从客观角度看,其将智能体架构系统性地应用于多模态电子表格任务,并通过广泛的消融实验验证了规划器、检索和迭代推理各自的关键贡献,为处理大规模、结构化的企业数据提供了新的范式。

Abstract: Recent advances in multimodal Retrieval-Augmented Generation (RAG) enable Large Language Models (LLMs) to analyze enterprise spreadsheet workbooks containing millions of cells, cross-sheet dependencies, and embedded visual artifacts. However, state-of-the-art approaches exclude critical context through single-pass retrieval, lose data resolution through compression, and exceed LLM context windows through naive full-context injection, preventing reliable multi-step reasoning over complex enterprise workbooks. We introduce Beyond Rows to Reasoning (BRTR), a multimodal agentic framework for spreadsheet understanding that replaces single-pass retrieval with an iterative tool-calling loop, supporting end-to-end Excel workflows from complex analysis to structured editing. Supported by over 200 hours of expert human evaluation, BRTR achieves state-of-the-art performance across three frontier spreadsheet understanding benchmarks, surpassing prior methods by 25 percentage points on FRTR-Bench, 7 points on SpreadsheetLLM, and 32 points on FINCH. We evaluate five multimodal embedding models, identifying NVIDIA NeMo Retriever 1B as the top performer for mixed tabular and visual data, and vary nine LLMs. Ablation experiments confirm that the planner, retrieval, and iterative reasoning each contribute substantially, and cost analysis shows GPT-5.2 achieves the best efficiency-accuracy trade-off. Throughout all evaluations, BRTR maintains full auditability through explicit tool-call traces.


cs.CV [Back]

[19] Edges Are All You Need: Robust Gait Recognition via Label-Free Structure cs.CV | eess.IVPDF

Chao Zhang, Zhuang Zheng, Ruixin Li, Zhanyong Mei

TL;DR: 本文提出了一种新的步态识别方法SKETCHGAIT,通过引入无标签的密集边缘结构(SKETCH)作为新的视觉模态,并结合解析模态,构建了一个层次解耦的多模态框架,以解决现有基于轮廓或解析的方法在结构细节缺失或依赖强语义先验导致的性能不稳定问题。

Details

Motivation: 现有步态识别方法主要依赖稀疏的轮廓或依赖上游人体解析器的解析表示,前者缺乏内部结构细节,后者性能受解析器质量影响大且不稳定。本文从结构视角重新审视步态表示,旨在探索一种无需显式语义标签的密集部分级结构表示。

Result: 在SUSTech1K和CCPG数据集上的大量实验表明,SketchGait在SUSTech1K上达到92.9%的Rank-1准确率,在CCPG上达到93.1%的平均Rank-1准确率,验证了所提模态和框架的有效性。

Insight: 创新点在于提出了无标签的密集边缘结构(SKETCH)作为步态识别的新模态,它通过边缘检测器直接从RGB图像提取高频结构线索(如肢体关节和自遮挡轮廓),无需语义标签;并设计了层次解耦的多模态框架,独立学习不同模态并早期融合以捕获结构互补性,从而提升鲁棒性和判别力。

Abstract: Gait recognition is a non-intrusive biometric technique for security applications, yet existing studies are dominated by silhouette- and parsing-based representations. Silhouettes are sparse and miss internal structural details, limiting discriminability. Parsing enriches silhouettes with part-level structures, but relies heavily on upstream human parsers (e.g., label granularity and boundary precision), leading to unstable performance across datasets and sometimes even inferior results to silhouettes. We revisit gait representations from a structural perspective and describe a design space defined by edge density and supervision form: silhouettes use sparse boundary edges with weak single-label supervision, while parsing uses denser cues with strong semantic priors. In this space, we identify an underexplored paradigm: dense part-level structure without explicit semantic labels, and introduce SKETCH as a new visual modality for gait recognition. Sketch extracts high-frequency structural cues (e.g., limb articulations and self-occlusion contours) directly from RGB images via edge-based detectors in a label-free manner. We further show that label-guided parsing and label-free sketch are semantically decoupled and structurally complementary. Based on this, we propose SKETCHGAIT, a hierarchically disentangled multi-modal framework with two independent streams for modality-specific learning and a lightweight early-stage fusion branch to capture structural complementarity. Extensive experiments on SUSTech1K and CCPG validate the proposed modality and framework: SketchGait achieves 92.9% Rank-1 on SUSTech1K and 93.1% mean Rank-1 on CCPG.


[20] Thinking with Spatial Code for Physical-World Video Reasoning cs.CVPDF

Jieneng Chen, Wenxin Ma, Ruisheng Yuan, Yunzhi Zhang, Jiajun Wu

TL;DR: 本文提出了Thinking with Spatial Code框架,将RGB视频转换为显式、时序一致的3D表示,用于物理世界视觉问答。该方法通过空间编码器将视频解析为带有3D定向边界框和语义标签的结构化空间代码,使大语言模型能够直接对显式空间变量进行推理。

Details

Motivation: 解决物理世界视频推理中,现有视觉语言模型难以处理复杂3D几何和动态场景的问题,旨在通过显式的3D空间表示提升推理能力。

Result: 在VSI-Bench基准测试中超越了专有视觉语言模型,达到了新的最先进水平(SOTA)。

Insight: 创新点在于将视频解析为结构化空间代码,并统一6D物体解析与跟踪骨干网络进行几何预测,同时通过强化学习微调LLMs,使用鼓励视角感知和几何基础推理的空间规则奖励。从客观角度看,该方法通过显式3D表示桥接了视觉与语言推理,为物理世界理解提供了可解释的中间表示。

Abstract: We introduce Thinking with Spatial Code, a framework that transforms RGB video into explicit, temporally coherent 3D representations for physical-world visual question answering. We highlight the empirical finding that our proposed spatial encoder can parse videos into structured spatial code with explicit 3D oriented bounding boxes and semantic labels, enabling large language models (LLMs) to reason directly over explicit spatial variables. Specifically, we propose the spatial encoder that encodes image and geometric features by unifying 6D object parsing and tracking backbones with geometric prediction, and we further finetuning LLMs with reinforcement learning using a spatial rubric reward that encourages perspective-aware, geometrically grounded inference. As a result, our model outperforms proprietary vision-language models on VSI-Bench, setting a new state-of-the-art. Code is available at https://github.com/Beckschen/spatialcode.


[21] From Decoupled to Coupled: Robustness Verification for Learning-based Keypoint Detection with Joint Specifications cs.CV | cs.LG | cs.ROPDF

Xusheng Luo, Changliu Liu

TL;DR: 本文提出了首个针对基于热图的关键点检测器的耦合鲁棒性验证框架,通过混合整数线性规划(MILP)来联合约束所有关键点的偏差,以捕获其相互依赖性和下游任务需求,从而提供比传统解耦方法更严格的鲁棒性保证。

Details

Motivation: 关键点检测在姿态估计、视角恢复和3D重建等视觉任务中至关重要,但现有神经网络模型易受微小输入扰动影响;由于高维输入和连续坐标输出,针对关键点检测器的形式化鲁棒性验证研究尚不充分,且传统解耦验证方法独立处理每个关键点,导致保守的保证结果。

Result: 实验表明,该耦合方法在严格误差阈值下实现了较高的验证通过率,并在解耦方法失效的情况下仍保持有效性,验证了其鲁棒性。

Insight: 创新点在于将鲁棒性验证从解耦(独立验证每个关键点)转向耦合(联合验证所有关键点),通过MILP结合可达热图集和多面体编码的联合偏差约束,实现了对关键点检测器集体行为的严格验证,并提供了可证明的鲁棒性证书和反例生成能力。

Abstract: Keypoint detection underpins many vision tasks, including pose estimation, viewpoint recovery, and 3D reconstruction, yet modern neural models remain vulnerable to small input perturbations. Despite its importance, formal robustness verification for keypoint detectors is largely unexplored due to high-dimensional inputs and continuous coordinate outputs. We propose the first coupled robustness verification framework for heatmap-based keypoint detectors that bounds the joint deviation across all keypoints, capturing their interdependencies and downstream task requirements. Unlike prior decoupled, classification-style approaches that verify each keypoint independently and yield conservative guarantees, our method verifies collective behavior. We formulate verification as a falsification problem using a mixed-integer linear program (MILP) that combines reachable heatmap sets with a polytope encoding joint deviation constraints. Infeasibility certifies robustness, while feasibility provides counterexamples, and we prove the method is sound: if it certifies the model as robust, then the keypoint detection model is guaranteed to be robust. Experiments show that our coupled approach achieves high verified rates and remains effective under strict error thresholds where decoupled methods fail.


[22] DreamCAD: Scaling Multi-modal CAD Generation using Differentiable Parametric Surfaces cs.CV | cs.AIPDF

Mohammad Sadil Khan, Muhammad Usama, Rolandos Alexandros Potamias, Didier Stricker, Muhammad Zeshan Afzal

TL;DR: DreamCAD是一个多模态生成框架,能够直接从点云监督生成可编辑的边界表示(BRep),无需CAD特定标注。它通过将BRep表示为参数化曲面(如Bézier曲面)并使用可微分细分方法生成网格,实现了在3D数据集上的大规模训练。此外,论文还引入了CADCap-1M,一个包含100万+描述的最大CAD标注数据集,用于推进文本到CAD的研究。

Details

Motivation: 现有生成方法受限于带有显式设计历史或BRep标签的小型标注数据集,而数百万未标注的3D网格未被利用,限制了可扩展CAD生成的进展。

Result: 在ABC和Objaverse基准测试中,DreamCAD在文本、图像和点云模态上实现了最先进的性能,提高了几何保真度,并超过75%的用户偏好。

Insight: 创新点包括使用可微分参数化曲面表示BRep以实现大规模训练,以及引入大规模CAD标注数据集CADCap-1M。从客观角度看,该方法通过点级监督和可微分处理,有效结合了未标注3D数据,推动了多模态CAD生成的扩展性。

Abstract: Computer-Aided Design (CAD) relies on structured and editable geometric representations, yet existing generative methods are constrained by small annotated datasets with explicit design histories or boundary representation (BRep) labels. Meanwhile, millions of unannotated 3D meshes remain untapped, limiting progress in scalable CAD generation. To address this, we propose DreamCAD, a multi-modal generative framework that directly produces editable BReps from point-level supervision, without CAD-specific annotations. DreamCAD represents each BRep as a set of parametric patches (e.g., Bézier surfaces) and uses a differentiable tessellation method to generate meshes. This enables large-scale training on 3D datasets while reconstructing connected and editable surfaces. Furthermore, we introduce CADCap-1M, the largest CAD captioning dataset to date, with 1M+ descriptions generated using GPT-5 for advancing text-to-CAD research. DreamCAD achieves state-of-the-art performance on ABC and Objaverse benchmarks across text, image, and point modalities, improving geometric fidelity and surpassing 75% user preference. Code and dataset will be publicly available.


[23] Post Fusion Bird’s Eye View Feature Stabilization for Robust Multimodal 3D Detection cs.CV | cs.AIPDF

Trung Tien Dong, Dev Thakkar, Arman Sargolzaei, Xiaomin Lin

TL;DR: 本文提出了一种名为后融合稳定器(PFS)的轻量级模块,用于提升基于鸟瞰图(BEV)的相机-激光雷达融合3D目标检测器在域偏移和传感器故障下的鲁棒性。该模块对现有检测器的中间BEV特征图进行操作,通过稳定特征统计、抑制退化区域和自适应残差校正来生成精炼的特征图,供原始检测头使用。

Details

Motivation: 动机是解决现有BEV融合检测器在现实部署中,因域偏移和传感器故障(如相机失效、低光照)导致性能显著下降的鲁棒性问题,且现有方法通常需要修改架构或重新训练,难以集成到已部署系统中。

Result: 在nuScenes基准测试中,PFS在多种故障模式下取得了最先进(SOTA)的结果,特别是在相机丢失情况下将mAP提升了+1.2%,在低光照条件下提升了+4.4% mAP,同时模块参数量仅为3.3M,保持了轻量级特性。

Insight: 创新点在于提出了一种不改变现有检测器架构的轻量级后处理稳定模块,通过近乎恒等变换的设计,在保持原有性能的同时,专门针对特征图进行稳定化处理以提升对传感器退化和域变化的鲁棒性,易于集成到已部署系统。

Abstract: Camera-LiDAR fusion is widely used in autonomous driving to enable accurate 3D object detection. However, bird’s-eye view (BEV) fusion detectors can degrade significantly under domain shift and sensor failures, limiting reliability in real-world deployment. Existing robustness approaches often require modifying the fusion architecture or retraining specialized models, making them difficult to integrate into already deployed systems. We propose a Post Fusion Stabilizer (PFS), a lightweight module that operates on intermediate BEV representations of existing detectors and produces a refined feature map for the original detection head. The design stabilizes feature statistics under domain shift, suppresses spatial regions affected by sensor degradation, and adaptively restores weakened cues through residual correction. Designed as a near-identity transformation, PFS preserves performance while improving robustness under diverse camera and LiDAR corruptions. Evaluations on the nuScenes benchmark demonstrate that PFS achieves state-of-the-art results in several failure modes, notably improving camera dropout robustness by +1.2% and low-light performance by +4.4% mAP while maintaining a lightweight footprint of only 3.3 M parameters.


[24] Rethinking Concept Bottleneck Models: From Pitfalls to Solutions cs.CVPDF

Merve Tapli, Quentin Bouniot, Wolfgang Stammer, Zeynep Akata, Emre Akbas

TL;DR: 本文提出CBM-Suite框架,系统性地解决了概念瓶颈模型(CBMs)存在的四个核心问题:缺乏概念相关性预评估指标、线性问题导致概念瓶颈被绕过、与黑盒模型存在精度差距,以及缺乏对不同视觉主干和视觉语言模型影响的系统研究。

Details

Motivation: 动机是解决CBMs在可解释性实践中面临的根本性局限,包括概念集适用性难以评估、模型可能绕过概念瓶颈导致解释性失效、性能落后于不透明模型,以及缺乏对不同组件影响的系统性理解。

Result: 通过广泛的评估,CBM-Suite框架能够产生更准确的模型,并为提升基于概念的可解释性提供了见解。

Insight: 创新点包括:1)提出基于熵的指标来量化概念集的内在适用性;2)通过在概念激活和分类器之间插入非线性层来解决线性问题,确保模型精度忠实反映概念相关性;3)利用由线性教师探针引导的蒸馏损失来缩小精度差距;4)系统分析了不同视觉编码器、视觉语言模型和概念集如何共同影响CBMs的精度和可解释性。

Abstract: Concept Bottleneck Models (CBMs) ground predictions in human-understandable concepts but face fundamental limitations: the absence of a metric to pre-evaluate concept relevance, the “linearity problem” causing recent CBMs to bypass the concept bottleneck entirely, an accuracy gap compared to opaque models, and finally the lack of systematic study on the impact of different visual backbones and VLMs. We introduce CBM-Suite, a methodological framework to systematically addresses these challenges. First, we propose an entropy-based metric to quantify the intrinsic suitability of a concept set for a given dataset. Second, we resolve the linearity problem by inserting a non-linear layer between concept activations and the classifier, which ensures that model accuracy faithfully reflects concept relevance. Third, we narrow the accuracy gap by leveraging a distillation loss guided by a linear teacher probe. Finally, we provide comprehensive analyses on how different vision encoders, vision-language models, and concept sets interact to influence accuracy and interpretability in CBMs. Extensive evaluations show that CBM-Suite yields more accurate models and provides insights for improving concept-based interpretability.


[25] When Rubrics Fail: Error Enumeration as Reward in Reference-Free RL Post-Training for Virtual Try-On cs.CV | cs.AI | cs.LGPDF

Wisdom Ikezogwo, Mehmet Saygin Seyfioglu, Ranjay Krishna, Karim Bouyarmane

TL;DR: 本文提出了一种名为隐式错误计数(IEC)的新方法,用于解决在缺乏理想参考答案(即无参考)的任务中,强化学习后训练(RL post-training)的奖励设计难题。该方法通过枚举和加权输出中的错误,而非对照标准答案检查正确性,来生成校准后的奖励信号。研究以虚拟试穿(VTO)任务为案例,验证了IEC的有效性,并引入了级联错误计数(CEC)作为评估指标和MDressBench基准测试。

Details

Motivation: 现有基于标准答案的奖励方法(如Rubrics as Rewards)在输出多样且缺乏单一理想答案的任务中失效。本文旨在填补这一“无参考”场景下后训练方法的空白。

Result: 在提出的MDressBench基准上,IEC在所有指标上均优于RaR方法(CEC分数更低,表示错误更少)。在VITON-HD和DressCode数据集上,IEC在8个感知指标中的6个上达到或超越了6个基线模型。CEC评估指标与人类偏好高度相关(60% top-1 vs. 30% others)。

Insight: 核心创新点在于从“枚举正确”转向“枚举错误”的奖励设计范式,并引入了隐式分数发射和组校准两个关键设计,以稳定优化过程。这为缺乏明确标准答案的生成任务(如虚拟试穿)提供了一种有效的后训练强化学习奖励机制。

Abstract: Reinforcement learning with verifiable rewards (RLVR) and Rubrics as Rewards (RaR) have driven strong gains in domains with clear correctness signals and even in subjective domains by synthesizing evaluation criteria from ideal reference answers. But many real-world tasks admit multiple valid outputs and lack the single ideal answer that rubric generation depends on. We identify this reference-free setting as a gap in current post-training methods and propose Implicit Error Counting (IEC) to fill it. Instead of checking what a response gets right against a rubric, IEC enumerates what it gets wrong, applying severity-weighted scores across task-relevant axes and converting them into calibrated per-aspect rewards. We show that naïve explicit enumeration is too noisy for stable optimization, and that two design choices: implicit score emission and group calibration are necessary to make error counting a reliable reward. As a case study, we validate IEC on virtual try-on (VTO), a domain that is simultaneously too constrained for holistic scoring and too permissive for rubric-based evaluation: subtle garment errors are unacceptable, yet many output variations are correct. We introduce Cascaded Error Counting (CEC) as an evaluation metric, which tracks human preferences well (60% top-1 vs. 30% others), and curate Mismatch-DressCode (MDressBench), a benchmark with maximal attribute mismatch to stress-test reward designs. On MDressBench, IEC outperforms RaR across all metrics (CEC: 5.31 vs. 5.60 on flat references; 5.20 vs. 5.53 on non-flat). On VITON-HD and DressCode, IEC matches or surpasses six baselines on 6 of 8 perceptual metrics. These results suggest that when ideal answers are unavailable, counting errors provide a stronger signal than constructing rubrics.


[26] Keeping the Evidence Chain: Semantic Evidence Allocation for Training-Free Token Pruning in Video Temporal Grounding cs.CVPDF

Jiaqi Li, Shuntian Zheng, Yixian Shen, Jia-Hong Huang, Xiaoman Lu

TL;DR: 本文提出了一种名为SemVID的无训练视觉令牌剪枝框架,旨在解决视频时序定位任务中因处理长视频导致的计算成本过高问题。该框架基于证据保留和连接强度两个VTG特定原则,通过平衡查询相关性和帧间变化来分配每帧令牌预算,并选择对象、运动和上下文三类令牌来构建紧凑且连贯的令牌子集。

Details

Motivation: 视频时序定位任务需要处理长视频,导致基于视频语言模型的流程计算成本极高。现有无训练视觉令牌剪枝方法直接应用于VTG时性能大幅下降,因为VTG严重依赖边界敏感证据和跨帧推理链。

Result: 在VTG基准测试上的广泛实验表明,SemVID在仅使用12.5%视觉令牌的情况下能保持高达95.4%的mIoU,并实现高达5.8倍的预填充加速,在相同预算下持续优于先前方法,实现了强大的精度-效率权衡。

Insight: 论文的创新点在于识别了VTG特定的两个剪枝原则(证据保留和连接强度),并据此设计了一个语义驱动的令牌分配策略,通过选择具有互补语义角色的令牌类型来维持证据链的完整性,从而在无需额外训练的情况下实现高效剪枝。

Abstract: Video Temporal Grounding (VTG) localizes the temporal boundaries of a query-relevant moment in long, untrimmed videos, making video-language-model (VLM) pipelines prohibitively expensive. While recent training-free visual token pruning has shown success in video question answering, naively applying these objectives to VTG often causes drastic degradation, as VTG crucially depends on boundary-sensitive evidence and cross-frame reasoning chains. We therefore identify two VTG-specific pruning principles: Evidence Retention (ER), which keeps query-critical patches especially around event boundaries, and Connectivity Strength (CS), which preserves token-level cross-frame connectivity for long-range evidence aggregation. Building on these insights, we propose SemVID, a training-free pruning framework that constructs a compact yet coherent token subset with complementary semantic roles. SemVID first allocates per-frame token budgets by balancing query relevance and inter-frame variation to avoid over-pruned segments, and then selects three types of tokens: object tokens for diverse query-critical evidence, motion tokens to capture meaningful transitions and serve as cross-frame relays, and a small set of context tokens for scene continuity. Extensive experiments on VTG benchmarks show that SemVID achieves a strong accuracy-efficiency trade-off, retaining up to 95.4% mIoU with only 12.5% visual tokens and delivering up to a 5.8x prefill speedup, consistently outperforming prior methods under the same budgets.


[27] MultiHaystack: Benchmarking Multimodal Retrieval and Reasoning over 40K Images, Videos, and Documents cs.CVPDF

Dannong Xu, Zhongyu Yang, Jun Chen, Yingfang Yuan, Ming Hu

TL;DR: 本文提出了MultiHaystack,首个用于评估大规模跨模态条件下检索与推理能力的基准测试。该基准包含超过46,000个跨文档、图像和视频的多模态检索候选,以及747个开放但可验证的问题。研究发现,当模型被提供对应证据时表现良好,但需要从完整语料库中检索证据时性能急剧下降,表明异构多模态检索仍是MLLMs的主要瓶颈。

Details

Motivation: 现有基准测试大多局限于小型、单模态候选集,简化了搜索空间并高估了端到端可靠性,无法评估从大规模异构多模态语料库中检索相关证据后再进行推理这一关键现实需求。

Result: 在MultiHaystack上,即使最强的检索器E5-V的Recall@1也仅为40.8%;而最先进的MLLMs(如GPT-5)在提供对应证据时推理准确率为80.86%,但在top-5检索条件下准确率降至51.4%。

Insight: 论文的创新点在于构建了首个大规模、跨模态的检索与推理联合评估基准,揭示了当前多模态大语言模型在异构多模态检索上的显著瓶颈,强调了检索中心化方法对推进多模态系统发展的重要性。

Abstract: Multimodal large language models (MLLMs) achieve strong performance on benchmarks that evaluate text, image, or video understanding separately. However, these settings do not assess a critical real-world requirement, which involves retrieving relevant evidence from large, heterogeneous multimodal corpora prior to reasoning. Most existing benchmarks restrict retrieval to small, single-modality candidate sets, substantially simplifying the search space and overstating end-to-end reliability. To address this gap, we introduce MultiHaystack, the first benchmark designed to evaluate both retrieval and reasoning under large-scale, cross-modal conditions. MultiHaystack comprises over 46,000 multimodal retrieval candidates across documents, images, and videos, along with 747 open yet verifiable questions. Each question is grounded in a unique validated evidence item within the retrieval pool, requiring evidence localization across modalities and fine-grained reasoning. In our study, we find that models perform competitively when provided with the corresponding evidence, but their performance drops sharply when required to retrieve that evidence from the full corpus. Additionally, even the strongest retriever, E5-V, achieves only 40.8% Recall@1, while state-of-the-art MLLMs such as GPT-5 experience a significant drop in reasoning accuracy from 80.86% when provided with the corresponding evidence to 51.4% under top-5 retrieval. These results indicate that multimodal retrieval over heterogeneous pools remains a primary bottleneck for MLLMs, positioning MultiHaystack as a valuable testbed that highlights underlying limitations obscured by small-scale evaluations and promotes retrieval-centric advances in multimodal systems.


[28] Interpretable Perception and Reasoning for Audiovisual Geolocation cs.CVPDF

Yiyang Su, Xiaoming Liu

TL;DR: 本文提出了一个名为Audiovisual Geolocation的框架,通过可解释的感知与推理来解决地理定位中的模糊性问题。该框架包含三个阶段:感知阶段使用混合自回归稀疏自编码器将噪声音频分解为语义化的’声学原子’;多模态推理阶段利用通过组相对策略优化微调的多模态大语言模型融合声学原子与视觉特征;精确预测阶段在S^2流形上使用黎曼流匹配进行定位。实验表明,该框架在AVG基准上显著优于单模态基线,证明了声景的可解释感知为高精度全球定位提供了关键且正交的信号。

Details

Motivation: 解决基于图像的地理定位中因视觉景观固有模糊性以及听觉线索潜力未充分挖掘而导致的精确全球定位难题。

Result: 在提出的AVG基准(包含20,000个精选视频片段,覆盖1,000个不同地点)上,该框架显著优于单模态基线,实现了高精度全球定位。

Insight: 创新点在于将声景分解为可解释的’声学原子’并与视觉特征通过微调的MLLM进行多模态推理,以及在S^2流形上使用黎曼流匹配进行精确预测,为多模态地理定位提供了新的可解释性框架。

Abstract: While recent advances in Multimodal Large Language Models (MLLMs) have improved image-based localization, precise global geolocation remains a formidable challenge due to the inherent ambiguity of visual landscapes and the largely untapped potential of auditory cues. In this paper, we introduce Audiovisual Geolocation, a framework designed to resolve geographic ambiguity through interpretable perception and reasoning. We present AVG, a high-quality global-scale video benchmark for geolocation, comprising 20,000 curated clips across 1,000 distinct locations. To address the complexity of audiovisual geolocation, we propose a three-stage framework: (1) a Perception stage that utilizes a mixture-autoregressive sparse autoencoder to decompose noisy audio into semantically grounded “acoustic atoms”; (2) a Multimodal Reasoning stage that employs an MLLM finetuned via Group Relative Policy Optimization (GRPO) to synthesize these atoms with visual features; and (3) a Precision Prediction stage using Riemannian Flow Matching on the $S^2$ manifold. Our experiments demonstrate that our framework significantly outperforms unimodal baselines. These results entail that interpretable perception of the soundscape provides a critical, orthogonal signal that, when coupled with multimodal reasoning, enables high-precision global localization.


[29] Unlocking ImageNet’s Multi-Object Nature: Automated Large-Scale Multilabel Annotation cs.CVPDF

Junyu Chen, Md Yousuf Harun, Christopher Kanan

TL;DR: 本文提出了一种自动化流程,将ImageNet训练集从单标签转换为多标签数据集,无需人工标注。该方法利用自监督视觉Transformer进行无监督物体发现,选取与原始标签对齐的区域训练轻量级分类器,并应用于所有区域以生成一致的多标签标注。

Details

Motivation: 原始ImageNet基准强制单标签假设,但许多图像包含多个物体,导致标签噪声并限制了学习信号的丰富性。多标签标注能更准确地反映真实世界视觉场景中多个物体共现的情况,有助于模型学习更丰富、更鲁棒的表征。

Result: 在定性评估中,生成的标签与人类判断高度一致;在定量基准上,使用多标签监督训练的模型在多个架构上均获得更好的域内准确率(在ReaL上最高提升2.0 top-1准确率,在ImageNet-V2上提升1.5),并在下游任务上表现出更强的迁移能力(在COCO和VOC上分别提升4.2和2.3 mAP)。

Insight: 创新点在于提出了一种可扩展、自动化的多标签标注流程,无需人工干预,解决了ImageNet训练集缺乏高质量多标签标注的问题。该方法通过结合自监督ViT的无监督物体发现和轻量级分类器,实现了高效且一致的多标签生成,为提升分类性能和表征学习提供了新途径。

Abstract: The original ImageNet benchmark enforces a single-label assumption, despite many images depicting multiple objects. This leads to label noise and limits the richness of the learning signal. Multi-label annotations more accurately reflect real-world visual scenes, where multiple objects co-occur and contribute to semantic understanding, enabling models to learn richer and more robust representations. While prior efforts (e.g., ReaL, ImageNetv2) have improved the validation set, there has not yet been a scalable, high-quality multi-label annotation for the training set. To this end, we present an automated pipeline to convert the ImageNet training set into a multi-label dataset, without human annotations. Using self-supervised Vision Transformers, we perform unsupervised object discovery, select regions aligned with original labels to train a lightweight classifier, and apply it to all regions to generate coherent multi-label annotations across the dataset. Our labels show strong alignment with human judgment in qualitative evaluations and consistently improve performance across quantitative benchmarks. Compared to traditional single-label scheme, models trained with our multi-label supervision achieve consistently better in-domain accuracy across architectures (up to +2.0 top-1 accuracy on ReaL and +1.5 on ImageNet-V2) and exhibit stronger transferability to downstream tasks (up to +4.2 and +2.3 mAP on COCO and VOC, respectively). These results underscore the importance of accurate multi-label annotations for enhancing both classification performance and representation learning. Project code and the generated multi-label annotations are available at https://github.com/jchen175/MultiLabel-ImageNet.


[30] From Phase Grounding to Intelligent Surgical Narratives cs.CVPDF

Ethan Peterson, Huixin Zhan

TL;DR: 本文提出了一种基于CLIP多模态框架的方法,用于从手术视频中自动生成手术时间线和叙述。该方法通过将手术视频帧与文本手势描述对齐,预测视频帧的手势和阶段,从而构建结构化手术时间线,减少外科医生手动审查和标注视频的需求。

Details

Motivation: 解决当前手术时间线创建方法中,术后报告模糊或手动标注视频耗时的问题,旨在自动从手术视频生成精确的时间线和叙述。

Result: 未在摘要中提及具体定量结果或基准测试,但方法通过微调CLIP模型改善视频手势与文本标记的对齐,以实现手势和阶段预测。

Insight: 创新点在于利用预训练的多模态表示(CLIP)桥接视觉手势和文本叙述,实现手术视频的自动结构化分析,可借鉴于其他医疗视频分析任务以减少人工标注。

Abstract: Video surgery timelines are an important part of tool-assisted surgeries, as they allow surgeons to quickly focus on key parts of the procedure. Current methods involve the surgeon filling out a post-operation (OP) report, which is often vague, or manually annotating the surgical videos, which is highly time-consuming. Our proposed method sits between these two extremes: we aim to automatically create a surgical timeline and narrative directly from the surgical video. To achieve this, we employ a CLIP-based multi-modal framework that aligns surgical video frames with textual gesture descriptions. Specifically, we use the CLIP visual encoder to extract representations from surgical video frames and the text encoder to embed the corresponding gesture sentences into a shared embedding space. We then fine-tune the model to improve the alignment between video gestures and textual tokens. Once trained, the model predicts gestures and phases for video frames, enabling the construction of a structured surgical timeline. This approach leverages pretrained multi-modal representations to bridge visual gestures and textual narratives, reducing the need for manual video review and annotation by surgeons.


[31] Layer-wise Instance Binding for Regional and Occlusion Control in Text-to-Image Diffusion Transformers cs.CVPDF

Ruidong Chen, Yancheng Bai, Xuanpu Zhang, Jianhao Zeng, Lanjun Wang

TL;DR: 本文提出了一种名为LayerBind的训练无关、即插即用方法,用于解决文本到图像生成中的区域布局和遮挡顺序控制问题。该方法通过将区域生成建模为不同层并在生成过程中进行绑定,实现了精确的区域和遮挡可控性。其核心在于利用去噪早期阶段建立空间布局和遮挡顺序的观察,通过重组早期潜在结构来修改最终输出。

Details

Motivation: 现有基于训练的区域布局控制方法存在数据偏见和图像质量下降的问题,且当前技术难以处理遮挡顺序,限制了实际应用。本文旨在提出一种无需训练的方法,实现对图像生成中区域和遮挡关系的精确、灵活控制。

Result: 定性和定量结果均证明了LayerBind的有效性。该方法在多个Diffusion Transformer模型上作为区域和遮挡控制器,展现了其在创造性应用中的强大潜力。

Insight: 创新点在于将区域生成建模为分层实例,并分为实例初始化和语义护理两个阶段。具体包括:利用多模态联合注意力中的上下文共享机制进行分层实例初始化,以及通过分层注意力增强来强化区域细节和维持遮挡顺序。该方法无需训练,支持可编辑工作流,允许灵活修改实例或可见顺序。

Abstract: Region-instructed layout control in text-to-image generation is highly practical, yet existing methods suffer from limitations: (i) training-based approaches inherit data bias and often degrade image quality, and (ii) current techniques struggle with occlusion order, limiting real-world usability. To address these issues, we propose LayerBind. By modeling regional generation as distinct layers and binding them during the generation, our method enables precise regional and occlusion controllability. Our motivation stems from the observation that spatial layout and occlusion are established at a very early denoising stage, suggesting that rearranging the early latent structure is sufficient to modify the final output. Building on this, we structure the scheme into two phases: instance initialization and subsequent semantic nursing. (1) First, leveraging the contextual sharing mechanism in multimodal joint attention, Layer-wise Instance Initialization creates per-instance branches that attend to their own regions while anchoring to the shared background. At a designated early step, these branches are fused according to the layer order to form a unified latent with a pre-established layout. (2) Then, Layer-wise Semantic Nursing reinforces regional details and maintains the occlusion order via a layer-wise attention enhancement. Specifically, a sequential layered attention path operates alongside the standard global path, with updates composited under a layer-transparency scheduler. LayerBind is training-free and plug-and-play, serving as a regional and occlusion controller across Diffusion Transformers. Beyond generation, it natively supports editable workflows, allowing for flexible modifications like changing instances or rearranging visible orders. Both qualitative and quantitative results demonstrate LayerBind’s effectiveness, highlighting its strong potential for creative applications.


[32] Visual Words Meet BM25: Sparse Auto-Encoder Visual Word Scoring for Image Retrieval cs.CV | cs.AIPDF

Donghoon Han, Eunhwan Park, Seunghyeon Seo

TL;DR: 本文提出BM25-V方法,将信息检索中的Okapi BM25评分机制应用于视觉Transformer补丁特征上的稀疏自编码器(SAE)生成的稀疏视觉词激活,用于图像检索。该方法利用BM25的逆文档频率(IDF)加权来抑制常见、信息量低的视觉词,强调稀有、有区分度的视觉词,从而实现高效、高召回率的候选检索,并可作为密集重排序的高效第一阶段检索器。

Details

Motivation: 密集图像检索虽然准确,但可解释性和归因性有限,且在大规模应用时计算成本高。本文旨在结合稀疏表示的优势,提供一种高效、可解释且能作为高效第一阶段的图像检索方法。

Result: 在七个基准测试上,BM25-V实现了Recall@200 ≥ 0.993,使得一个两阶段检索流程(仅对每个查询重排序K=200个候选)能够平均在0.2%的误差范围内恢复接近密集检索的准确率。在ImageNet-1K上训练一次的SAE能够零样本迁移到七个细粒度基准测试上。

Insight: 创新点在于将经典的文本检索BM25算法与视觉Transformer的稀疏视觉词表示相结合,利用视觉词文档频率的高度不平衡和类Zipf分布特性,通过IDF加权实现高效检索。这为图像检索提供了一种计算高效、可解释、且能无缝衔接密集重排序的稀疏检索方案,其稀疏表示和检索决策具有可归因性。

Abstract: Dense image retrieval is accurate but offers limited interpretability and attribution, and it can be compute-intensive at scale. We present \textbf{BM25-V}, which applies Okapi BM25 scoring to sparse visual-word activations from a Sparse Auto-Encoder (SAE) on Vision Transformer patch features. Across a large gallery, visual-word document frequencies are highly imbalanced and follow a Zipfian-like distribution, making BM25’s inverse document frequency (IDF) weighting well suited for suppressing ubiquitous, low-information words and emphasizing rare, discriminative ones. BM25-V retrieves high-recall candidates via sparse inverted-index operations and serves as an efficient first-stage retriever for dense reranking. Across seven benchmarks, BM25-V achieves Recall@200 $\geq$ 0.993, enabling a two-stage pipeline that reranks only $K{=}200$ candidates per query and recovers near-dense accuracy within $0.2$% on average. An SAE trained once on ImageNet-1K transfers zero-shot to seven fine-grained benchmarks without fine-tuning, and BM25-V retrieval decisions are attributable to specific visual words with quantified IDF contributions.


[33] Spectral Probing of Feature Upsamplers in 2D-to-3D Scene Reconstruction cs.CVPDF

Ling Xiao, Yuliang Xiu, Yue Chen, Guoming Wang, Toshihiko Yamasaki

TL;DR: 本文提出了一种用于诊断2D到3D场景重建中特征上采样器性能的光谱分析框架。该框架包含六个互补的指标,用于评估上采样方法在振幅重分布、结构光谱对齐和方向稳定性方面的表现。研究发现,重建质量与保持光谱结构的相关性远高于增强空间细节,且可学习上采样器在重建质量上通常并不优于经典插值方法。

Details

Motivation: 在典型的2D到3D重建流程中,特征上采样器将稀疏特征上采样为密集表示,对保持多视图间的几何一致性至关重要。然而,现有可学习上采样方法主要关注增强空间细节,其对3D感知能力的影响尚未得到充分探索。

Result: 在CLIP和DINO骨干网络上,对经典插值和可学习方法进行了评估。关键发现包括:结构光谱一致性(SSC/CSC)是新视角合成(NVS)质量的最强预测指标;高频光谱斜率漂移(HFSS)通常与重建性能负相关;可学习上采样器在重建质量上很少优于经典插值,其有效性取决于重建模型。

Insight: 论文的创新点在于引入了首个用于评估特征上采样器对3D重建影响的光谱诊断框架。客观分析表明,核心洞察是:重建质量的关键在于保持光谱结构的一致性,而非单纯增强高频空间细节,这为设计2D到3D流程中的上采样策略提供了新的重要原则。

Abstract: A typical 2D-to-3D pipeline takes multi-view images as input, where a Vision Foundation Model (VFM) extracts features that are spatially upsampled to dense representations for 3D reconstruction. If dense features across views preserve geometric consistency, differentiable rendering can recover an accurate 3D representation, making the feature upsampler a critical component. Recent learnable upsampling methods mainly aim to enhance spatial details, such as sharper geometry or richer textures, yet their impact on 3D awareness remains underexplored. To address this gap, we introduce a spectral diagnostic framework with six complementary metrics that characterize amplitude redistribution, structural spectral alignment, and directional stability. Across classical interpolation and learnable upsampling methods on CLIP and DINO backbones, we observe three key findings. First, structural spectral consistency (SSC/CSC) is the strongest predictor of NVS quality, whereas High-Frequency Spectral Slope Drift (HFSS) often correlates negatively with reconstruction performance, indicating that emphasizing high-frequency details alone does not necessarily improve 3D reconstruction. Second, geometry and texture respond to different spectral properties: Angular Energy Consistency (ADC) correlates more strongly with geometry-related metrics, while SSC/CSC influence texture fidelity slightly more than geometric accuracy. Third, although learnable upsamplers often produce sharper spatial features, they rarely outperform classical interpolation in reconstruction quality, and their effectiveness depends on the reconstruction model. Overall, our results indicate that reconstruction quality is more closely related to preserving spectral structure than to enhancing spatial detail, highlighting spectral consistency as an important principle for designing upsampling strategies in 2D-to-3D pipelines.


[34] EventGeM: Global-to-Local Feature Matching for Event-Based Visual Place Recognition cs.CVPDF

Adam D. Hines, Gokul B. Nair, Nicolás Marticorena, Michael Milford, Tobias Fischer

TL;DR: 本文提出了EventGeM,一种用于基于事件的视觉位置识别(VPR)的先进全局到局部特征融合流程。该方法首先使用预训练的视觉变换器(ViT-S/16)从事件直方图图像中提取全局特征进行初始匹配,然后利用预训练的MaxViT主干检测局部关键点进行基于2D单应性的RANSAC重排序,并进一步使用预训练的视觉基础模型进行深度估计以比较结构相似性进行细化。该方法在多个基准数据集和光照条件下实现了最先进的定位性能,并能在各种计算架构上实时运行,还在机器人平台上进行了真实世界在线定位演示。

Details

Motivation: 事件相机因其稀疏激活和高时间分辨率在机器人导航和定位任务中日益流行,尤其是在需要精确、频繁定位或对能耗要求极高的场景。本文旨在解决基于事件的视觉位置识别问题,提出一个融合全局与局部特征的先进流程以提升定位精度和鲁棒性。

Result: 在多个基准数据集和不同光照条件下,EventGeM的性能优于当前最佳的基于事件的位置识别方法,达到了最先进(SOTA)水平。该方法能够在各种计算架构上实现实时运行,并在机器人平台上成功演示了在线定位。

Insight: 创新点在于提出了一种新颖的全局到局部特征融合流程,结合了预训练的ViT进行全局特征提取、MaxViT进行局部关键点检测与基于RANSAC的单应性重排序,并引入预训练的视觉基础模型进行深度估计以进行结构相似性比较,从而实现了高精度、鲁棒的实时事件视觉位置识别。从客观角度看,该方法有效整合了多种先进的预训练视觉模型,通过多阶段特征匹配与重排序策略显著提升了基于事件数据的定位性能。

Abstract: Dynamic vision sensors, also known as event cameras, are rapidly rising in popularity for robotic and computer vision tasks due to their sparse activation and high-temporal resolution. Event cameras have been used in robotic navigation and localization tasks where accurate positioning needs to occur on small and frequent time scales, or when energy concerns are paramount. In this work, we present EventGeM, a state-of-the-art global to local feature fusion pipeline for event-based Visual Place Recognition. We use a pre-trained vision transformer (ViT-S/16) backbone to obtain global feature patch for initial match predictions embeddings from event histogram images. Local feature keypoints were then detected using a pre-trained MaxViT backbone for 2D-homography based re-ranking with RANSAC. For additional re-ranking refinement, we subsequently used a pre-trained vision foundation model for depth estimation to compare structural similarity between references and queries. Our work performs state-of-the-art localization when compared to the best currently available event-based place recognition method across several benchmark datasets and lighting conditions all whilst being fully capable of running in real-time when deployed across a variety of compute architectures. We demonstrate the capability of EventGeM in a real-world deployment on a robotic platform for online localization using event streams directly from an event camera. Project page: https://eventgemvpr.github.io/


[35] Training-free Latent Inter-Frame Pruning with Attention Recovery cs.CVPDF

Dennis Menn, Yuedong Yang, Bokun Wang, Xiwen Wei, Mustafa Munir

TL;DR: 本文提出了一种无需训练的潜在帧间剪枝与注意力恢复(LIPAR)框架,通过检测并跳过视频潜在补丁中的重复计算来降低视频生成模型的计算延迟,同时引入注意力恢复机制以近似被剪枝标记的注意力值,从而避免视觉伪影。

Details

Motivation: 当前视频生成模型计算延迟高,实时应用成本昂贵,本文旨在利用视频潜在补丁中固有的时间冗余性来解决这一限制。

Result: 该方法在NVIDIA A6000上平均实现12.2 FPS,相比基线8.4 FPS,视频编辑吞吐量提升1.45倍,且不损害生成质量,无需额外训练即可与模型无缝集成。

Insight: 创新点在于结合传统压缩算法与现代生成流程,通过帧间剪枝和注意力恢复机制高效减少计算冗余,为实时视频生成提供了轻量级优化方案。

Abstract: Current video generation models suffer from high computational latency, making real-time applications prohibitively costly. In this paper, we address this limitation by exploiting the temporal redundancy inherent in video latent patches. To this end, we propose the Latent Inter-frame Pruning with Attention Recovery (LIPAR) framework, which detects and skips recomputing duplicated latent patches. Additionally, we introduce a novel Attention Recovery mechanism that approximates the attention values of pruned tokens, thereby removing visual artifacts arising from naively applying the pruning method. Empirically, our method increases video editing throughput by $1.45\times$, on average achieving 12.2 FPS on an NVIDIA A6000 compared to the baseline 8.4 FPS. The proposed method does not compromise generation quality and can be seamlessly integrated with the model without additional training. Our approach effectively bridges the gap between traditional compression algorithms and modern generative pipelines.


[36] Margin and Consistency Supervision for Calibrated and Robust Vision Models cs.CV | cs.AI | cs.LGPDF

Salim Khazem

TL;DR: 本文提出了Margin and Consistency Supervision (MaCS),一种简单、架构无关的正则化框架,旨在联合增强对数空间的分离性和局部预测稳定性。MaCS通过引入边界惩罚项和一致性正则项来改进深度视觉分类器的校准性和鲁棒性,无需额外数据或架构改动。

Details

Motivation: 解决深度视觉分类器在准确率高的情况下,校准性差(预测置信度与准确性不匹配)以及在轻微分布偏移下脆弱的问题。

Result: 在多个图像分类基准测试(如常见损坏鲁棒性评估)和多种骨干网络(CNN和Vision Transformer)上,MaCS一致地改善了校准性(降低ECE和NLL)和鲁棒性,同时保持或提高了top-1准确率。

Insight: 创新点在于联合优化分类边界(通过hinge-squared margin penalty强制正确类与最强竞争类之间的对数间隙)和局部稳定性(通过KL散度一致性正则器最小化干净输入与轻微扰动视图之间的预测差异),理论分析表明这能提升泛化保证和可证明的鲁棒半径。

Abstract: Deep vision classifiers often achieve high accuracy while remaining poorly calibrated and fragile under small distribution shifts. We present Margin and Consistency Supervision (MaCS), a simple, architecture-agnostic regularization framework that jointly enforces logit-space separation and local prediction stability. MaCS augments cross-entropy with (i) a hinge-squared margin penalty that enforces a target logit gap between the correct class and the strongest competitor, and (ii) a consistency regularizer that minimizes the KL divergence between predictions on clean inputs and mildly perturbed views. We provide a unifying theoretical analysis showing that increasing classification margin while reducing local sensitivity formalized via a Lipschitz-type stability proxy yields improved generalization guarantees and a provable robustness radius bound scaling with the margin-to-sensitivity ratio. Across several image classification benchmarks and several backbones spanning CNNs and Vision Transformers, MaCS consistently improves calibration (lower ECE and NLL) and robustness to common corruptions while preserving or improving top-1 accuracy. Our approach requires no additional data, no architectural changes, and negligible inference overhead, making it an effective drop-in replacement for standard training objectives.


[37] Remote Sensing Image Classification Using Deep Ensemble Learning cs.CV | cs.AIPDF

Niful Islam, Md. Rayhan Ahmed, Nur Mohammad Fahad, Salekul Islam, A. K. M. Muzahidul Islam

TL;DR: 本文提出了一种用于遥感图像分类的深度集成学习方法,通过融合CNN和ViT的优势并采用集成策略来克服性能瓶颈。该方法在多个遥感数据集上取得了优于现有架构的分类准确率。

Details

Motivation: 解决遥感图像分类中CNN难以捕获全局上下文信息的问题,同时避免简单融合CNN和ViT时因特征冗余导致的性能瓶颈。

Result: 在UC Merced、RSSCN7和MSRSI数据集上分别达到98.10%、94.46%和95.45%的准确率,超越了现有竞争架构,实现了SOTA性能。

Insight: 创新点在于通过训练多个独立的CNN-ViT融合模型并进行集成,有效利用了两种架构的优势,同时避免了特征冗余,在提升性能的同时保持了训练效率。

Abstract: Remote sensing imagery plays a crucial role in many applications and requires accurate computerized classification techniques. Reliable classification is essential for transforming raw imagery into structured and usable information. While Convolutional Neural Networks (CNNs) are mostly used for image classification, they excel at local feature extraction, but struggle to capture global contextual information. Vision Transformers (ViTs) address this limitation through self attention mechanisms that model long-range dependencies. Integrating CNNs and ViTs, therefore, leads to better performance than standalone architectures. However, the use of additional CNN and ViT components does not lead to further performance improvement and instead introduces a bottleneck caused by redundant feature representations. In this research, we propose a fusion model that combines the strengths of CNNs and ViTs for remote sensing image classification. To overcome the performance bottleneck, the proposed approach trains four independent fusion models that integrate CNN and ViT backbones and combine their outputs at the final prediction stage through ensembling. The proposed method achieves accuracy rates of 98.10 percent, 94.46 percent, and 95.45 percent on the UC Merced, RSSCN7, and MSRSI datasets, respectively. These results outperform competing architectures and highlight the effectiveness of the proposed solution, particularly due to its efficient use of computational resources during training.


[38] VS3R: Robust Full-frame Video Stabilization via Deep 3D Reconstruction cs.CVPDF

Muhua Zhu, Xinhao Jin, Yu Zhang, Yifei Xue, Tie Ji

TL;DR: 论文提出VS3R框架,通过结合前馈式3D重建与生成式视频扩散模型,解决了视频稳定化中几何鲁棒性与全画幅一致性之间的权衡问题。该方法联合估计相机参数、深度和掩码以确保全场景可靠性,并引入混合稳定渲染模块融合语义与几何线索以实现动态一致性,最后使用双流视频扩散模型修复遮挡区域和伪影。

Details

Motivation: 解决视频稳定化中传统2D方法裁剪过多与3D方法在极端运动下优化脆弱的问题,旨在实现鲁棒的全画幅视频稳定。

Result: VS3R在多种相机模型上实现了高保真、全画幅的稳定效果,在鲁棒性和视觉质量上显著优于最先进的方法。

Insight: 创新点在于将前馈3D重建与生成式视频扩散协同,通过联合估计多模态信息(相机、深度、掩码)和混合渲染确保可靠性,并利用双流扩散模型进行修复,提升了处理极端运动和遮挡场景的能力。

Abstract: Video stabilization aims to mitigate camera shake but faces a fundamental trade-off between geometric robustness and full-frame consistency. While 2D methods suffer from aggressive cropping, 3D techniques are often undermined by fragile optimization pipelines that fail under extreme motions. To bridge this gap, we propose VS3R, a framework that synergizes feed-forward 3D reconstruction with generative video diffusion. Our pipeline jointly estimates camera parameters, depth, and masks to ensure all-scenario reliability, and introduces a Hybrid Stabilized Rendering module that fuses semantic and geometric cues for dynamic consistency. Finally, a Dual-Stream Video Diffusion Model restores disoccluded regions and rectifies artifacts by synergizing structural guidance with semantic anchors. Collectively, VS3R achieves high-fidelity, full-frame stabilization across diverse camera models and significantly outperforms state-of-the-art methods in robustness and visual quality.


[39] TumorChain: Interleaved Multimodal Chain-of-Thought Reasoning for Traceable Clinical Tumor Analysis cs.CVPDF

Sijing Li, Zhongwei Qiu, Jiang Liu, Wenqiao Zhang, Tianwei Lin

TL;DR: 该论文提出了TumorChain,一个用于可追溯临床肿瘤分析的多模态交错思维链推理框架,并构建了包含150万条思维链标注VQA指令的大规模数据集TumorCoT。该框架通过紧密耦合3D成像编码器、临床文本理解和器官级视觉语言对齐,进行跨模态对齐和迭代交错因果推理,以提升诊断可追溯性并减少幻觉风险。

Details

Motivation: 解决临床肿瘤分析中从影像发现到临床印象再到病理结论的逐步、可追溯推理问题,以提高诊断准确性并减少错误。

Result: 在病灶检测、印象生成和病理分类任务上,相比强基线模型取得了一致的性能提升,并在DeepTumorVQA基准测试上展现了强大的泛化能力。

Insight: 创新点在于提出了一个大规模、步骤对齐的多模态思维链数据集(TumorCoT),以及一个通过跨模态对齐和迭代交错因果推理进行自我精炼的多模态推理框架(TumorChain),旨在增强临床决策的可解释性和可靠性。

Abstract: Accurate tumor analysis is central to clinical radiology and precision oncology, where early detection, reliable lesion characterization, and pathology-level risk assessment guide diagnosis and treatment planning. Chain-of-Thought (CoT) reasoning is particularly important in this setting because it enables step-by-step interpretation from imaging findings to clinical impressions and pathology conclusions, improving traceability and reducing diagnostic errors. Here, we target the clinical tumor analysis task and build a large-scale benchmark that operationalizes a multimodal reasoning pipeline, spanning findings, impressions, and pathology predictions. We curate TumorCoT, a large-scale dataset of 1.5M CoT-labeled VQA instructions paired with 3D CT scans, with step-aligned rationales and cross-modal alignments along the trajectory from findings to impression to pathology, enabling evaluation of both answer accuracy and reasoning consistency. We further propose TumorChain, a multimodal interleaved reasoning framework that tightly couples 3D imaging encoders, clinical text understanding, and organ-level vision-language alignment. Through cross-modal alignment and iterative interleaved causal reasoning, TumorChain grounds visual evidence, aggregates conclusions, and issues pathology predictions after multiple rounds of self-refinement, improving traceability and reducing hallucination risk. Experiments show consistent improvements over strong baselines in lesion detection, impression generation, and pathology classification, and demonstrate strong generalization on the DeepTumorVQA benchmark. These results highlight the potential of multimodal reasoning for reliable and interpretable tumor analysis in clinical practice. Detailed information about our project can be found on our project homepage at https://github.com/ZJU4HealthCare/TumorChain.


[40] PatchCue: Enhancing Vision-Language Model Reasoning with Patch-Based Visual Cues cs.CVPDF

Yukun Qi, Pei Fu, Hang Li, Yuhan Liu, Chao Jiang

TL;DR: 本文提出PatchCue,一种基于图像块的视觉提示新范式,旨在增强视觉语言模型的推理能力。该方法将图像分割为块,在块级别表示视觉提示,更符合人类感知习惯并与现代VLM的块令牌化输入对齐。通过两阶段训练(监督微调与过程监督奖励的强化学习),在多个VLM和基准测试中显著提升了模型性能。

Details

Motivation: 现有视觉语言模型的推理范式(如思维链)主要依赖文本信息,未能充分利用重要视觉线索;而先前融入像素级视觉提示的方法需要精确空间定位,增加了学习复杂度。

Result: 在通用视觉问答、复杂推理和文档理解等多个基准测试上,PatchCue一致提升了多种VLM的整体性能,其块级提示在效果上优于像素级边界框和基于点的提示。

Insight: 创新点在于提出块级视觉提示范式,与人类感知及VLM的块令牌化输入自然对齐;采用两阶段训练策略,特别是引入过程监督的提示奖励来引导中间视觉推理步骤,提供了一种更有效且认知对齐的视觉推理方法。

Abstract: Vision-Language Models (VLMs) have achieved remarkable progress on a wide range of challenging multimodal understanding and reasoning tasks. However, existing reasoning paradigms, such as the classical Chain-of-Thought (CoT), rely solely on textual information and often underutilize important visual cues. While prior work has incorporated pixel-level visual cues, these representations require precise spatial localization, introducing additional learning complexity. To address this, we propose PatchCue, a novel patch-based visual cue paradigm designed to significantly enhance the visual reasoning capabilities of VLMs. By partitioning images into patches and representing cues at the patch level, PatchCue aligns better with human perceptual habits and leverages the patch-tokenized input of modern VLMs. We train VLMs using a two-stage approach: cold-start supervised fine-tuning to output patch-level cues, followed by reinforcement learning with a process-supervised cue reward that guides intermediate visual reasoning steps. Extensive experiments on multiple VLMs and diverse benchmarks, including general visual question answering, complex reasoning, and document understanding, demonstrate that PatchCue consistently improves overall model performance. Our results show that patch-level cues outperform both pixel-level bounding boxes and point-based cues, providing a more effective and cognitively aligned visual reasoning paradigm.


[41] Shifting Adaptation from Weight Space to Memory Space: A Memory-Augmented Agent for Medical Image Segmentation cs.CVPDF

Bowen Chen, Qiaohui Gao, Shaowen Wan, Shanhui Sun, Wei Liu

TL;DR: 本文提出了一种记忆增强的分割代理(MemSeg-Agent),将模型适应从权重空间转移到记忆空间,通过一个固定的骨干网络结合轻量级的静态、少样本和测试时工作记忆,并由一个代理控制器动态组合,以解决医学图像分割中的领域泛化、联邦学习通信开销和持续知识演化问题。

Details

Motivation: 解决医学图像分割模型在跨机构、扫描仪或患者群体时泛化能力不足的问题,同时避免传统基于权重微调的方法在联邦学习中带来的高通信开销,并支持部署期间的持续适应。

Result: 在四个公共数据集上的实验表明,仅使用静态记忆就能匹配或超越强监督基线且参数效率高,而测试时工作记忆无需微调即可进一步提升域内和跨域性能,展现了强大的性能和领域偏移鲁棒性。

Insight: 创新点在于将适应过程从模型权重参数转移到可动态组合的轻量级记忆单元,这为可扩展和自适应的医学图像分割提供了一种新范式,特别适用于联邦学习和持续学习场景,减少了通信和计算负担。

Abstract: Medical image segmentation is fundamental to clinical workflows, yet models trained on a single dataset often fail to generalize across institutions, scanners, or patient populations. While vision foundation models have shown great promise in addressing this challenge, their deployment typically requires task-specific fine-tuning, which introduces substantial communication overhead in federated learning and prevents continuous knowledge evolution during deployment. In this work, we propose a memory-augmented segmentation agent (MemSeg-Agent) that shifts adaptation from weight space to memory space, enabling few-shot learning, federated supervised learning, and test-time adaptation within a unified architecture. MemSeg-Agent conditions a fixed backbone with lightweight static, few-shot, and test-time working memories, which are dynamically composed by an agentic controller. In federated settings, we update compact memory units instead of model parameters, substantially reducing communication overhead. Experiments on four public datasets demonstrate strong performance and robustness to domain shift: Static memory alone matches or surpasses strong supervised baselines with high parameter efficiency, and test-time working memory further improves in-domain and cross-domain performance without fine-tuning. Overall, MemSeg-Agent introduces a new paradigm for scalable and adaptive medical image segmentation in the era of agentic AI.


[42] Systematic Evaluation of Novel View Synthesis for Video Place Recognition cs.CV | cs.ROPDF

Muhammad Zawad Mahmud, Samiha Islam, Damian Lyons

TL;DR: 本文系统评估了合成新视角在视频地点识别(VPR)中的应用,通过五个公共VPR图像数据库和七种典型图像相似性方法进行实验,发现小规模合成新视角能提升VPR识别性能,而视角变化幅度对性能的影响小于添加视图数量和数据集图像类型。

Details

Motivation: 解决合成新视角生成在机器人导航(如地面与空中机器人协同)和视频地点识别中的潜在应用价值,评估其实际效果以指导相关技术部署。

Result: 实验表明,在VPR任务中,添加少量合成新视角可改善识别统计指标;当添加量较大时,性能更依赖于添加视图数量和数据集类型,而非视角变化幅度。

Insight: 创新点在于首次系统评估合成新视角对VPR的影响,揭示了视角数量与数据集类型比视角变化幅度更关键,为多视角融合和数据集设计提供了新见解。

Abstract: The generation of synthetic novel views has the potential to positively impact robot navigation in several ways. In image-based navigation, a novel overhead view generated from a scene taken by a ground robot could be used to guide an aerial robot to that location. In Video Place Recognition (VPR), novel views of ground locations from the air can be added that enable a UAV to identify places seen by the ground robot, and similarly, overhead views can be used to generate novel ground views. This paper presents a systematic evaluation of synthetic novel views in VPR using five public VPR image databases and seven typical image similarity methods. We show that for small synthetic additions, novel views improve VPR recognition statistics. We find that for larger additions, the magnitude of viewpoint change is less important than the number of views added and the type of imagery in the dataset.


[43] PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction cs.CV | cs.GR | cs.LGPDF

Xiang Zhang, Sohyun Yoo, Hongrui Wu, Chuan Li, Jianwen Xie

TL;DR: PixARMesh是一种从单张RGB图像自回归重建完整3D室内场景网格的方法。它通过统一模型联合预测物体布局和几何形状,在单次前向传播中生成连贯、可直接用于艺术创作的网格。该方法基于点云编码器,结合像素对齐的图像特征和跨注意力机制获取全局场景上下文,实现了从单图像进行准确空间推理。

Details

Motivation: 解决现有方法依赖隐式符号距离场和事后布局优化、无法直接生成高质量、轻量级且连贯的3D场景网格的问题,旨在实现从单视图直接、高效地重建可用于下游应用的场景网格。

Result: 在合成和真实世界数据集上的实验表明,PixARMesh达到了最先进的重建质量,同时生成了轻量级、高质量的网格。

Insight: 创新点在于将场景重建统一为自回归的上下文、姿态和网格令牌流预测问题,实现了布局与几何的联合优化;通过像素对齐特征与跨注意力增强空间推理能力,直接生成紧凑、高保真度的网格,避免了隐式表示和后续优化步骤。

Abstract: We introduce PixARMesh, a method to autoregressively reconstruct complete 3D indoor scene meshes directly from a single RGB image. Unlike prior methods that rely on implicit signed distance fields and post-hoc layout optimization, PixARMesh jointly predicts object layout and geometry within a unified model, producing coherent and artist-ready meshes in a single forward pass. Building on recent advances in mesh generative models, we augment a point-cloud encoder with pixel-aligned image features and global scene context via cross-attention, enabling accurate spatial reasoning from a single image. Scenes are generated autoregressively from a unified token stream containing context, pose, and mesh, yielding compact meshes with high-fidelity geometry. Experiments on synthetic and real-world datasets show that PixARMesh achieves state-of-the-art reconstruction quality while producing lightweight, high-quality meshes ready for downstream applications.


[44] CollabOD: Collaborative Multi-Backbone with Cross-scale Vision for UAV Small Object Detection cs.CVPDF

Xuecheng Bai, Yuxiang Wang, Chuanzhi Xu, Boyu Hu, Kang Han

TL;DR: 本文提出了一种名为CollabOD的轻量级协作检测框架,用于解决无人机图像中小目标检测的挑战,如尺度变化、结构细节退化和计算资源有限。该框架通过结构细节保留、跨路径特征对齐和定位感知轻量化设计策略,优化了传统无人机感知模型架构,在保持高效推理的同时增强了表示稳定性。

Details

Motivation: 解决无人机图像中小目标检测因尺度变化、结构细节在分层下采样和跨尺度融合中进一步减弱,导致定位不稳定和鲁棒性下降的问题。

Result: 论文提出的方法在保持高效推理的同时增强了表示稳定性,并通过统一的细节感知检测头提高了回归鲁棒性,且未引入额外部署开销。

Insight: 创新点在于从图像处理、通道结构和轻量化设计角度优化模型,通过结构细节保留、跨路径特征对齐和定位感知轻量化设计策略,以及统一的细节感知检测头,提升小目标检测性能。

Abstract: Small object detection in unmanned aerial vehicle (UAV) imagery is challenging, mainly due to scale variation, structural detail degradation, and limited computational resources. In high-altitude scenarios, fine-grained features are further weakened during hierarchical downsampling and cross-scale fusion, resulting in unstable localization and reduced robustness. To address this issue, we propose CollabOD, a lightweight collaborative detection framework that explicitly preserves structural details and aligns heterogeneous feature streams before multi-scale fusion. The framework integrates Structural Detail Preservation, Cross-Path Feature Alignment, and Localization-Aware Lightweight Design strategies. From the perspectives of image processing, channel structure, and lightweight design, it optimizes the architecture of conventional UAV perception models. The proposed design enhances representation stability while maintaining efficient inference. A unified detail-aware detection head further improves regression robustness without introducing additional deployment overhead. The code is available at: https://github.com/Bai-Xuecheng/CollabOD.


[45] Beyond Geometry: Artistic Disparity Synthesis for Immersive 2D-to-3D cs.CVPDF

Ping Chen, Zezhou Chen, Xingpeng Zhang, Yanlin Qian, Huan Hu

TL;DR: 本文提出了一种新的2D到3D转换范式——艺术视差合成,旨在超越传统的几何精确性,专注于复制专业3D电影中沉浸式和情感共鸣的体验。作者提出了Art3D框架,通过双路径架构解耦全局深度参数与局部艺术效果,并利用专业3D电影数据进行间接监督学习。

Details

Motivation: 现有2D转3D方法虽几何准确但艺术性不足,无法复制专业3D电影的沉浸感和情感共鸣,因为它们将艺术家有意的艺术处理(如零平面偏移、局部深度塑造)误认为是数据噪声或歧义。

Result: 实验表明,该方法在复制关键的局部出屏效果以及与电影3D内容的全局深度风格对齐方面展现出潜力,为艺术驱动的转换工具奠定了基础。

Insight: 创新点在于将目标从物理精确的视差估计转向艺术连贯的视差合成,并提出了解耦全局与局部艺术意图的双路径架构。从客观角度看,其利用电影数据进行间接监督和引入量化电影对齐的初步评估方法,为艺术性3D内容生成提供了新思路。

Abstract: Current 2D-to-3D conversion methods achieve geometric accuracy but are artistically deficient, failing to replicate the immersive and emotionally resonant experience of professional 3D cinema. This is because geometric reconstruction paradigms mistake deliberate artistic intent, such as strategic zero-plane shifts for pop-out effects and local depth sculpting, for data noise or ambiguity. This paper argues for a new paradigm: Artistic Disparity Synthesis, shifting the goal from physically accurate disparity estimation to artistically coherent disparity synthesis. We propose Art3D, a preliminary framework exploring this paradigm. Art3D uses a dual-path architecture to decouple global depth parameters (macro-intent) from local artistic effects (visual brushstrokes) and learns from professional 3D film data via indirect supervision. We also introduce a preliminary evaluation method to quantify cinematic alignment. Experiments show our approach demonstrates potential in replicating key local out-of-screen effects and aligning with the global depth styles of cinematic 3D content, laying the groundwork for a new class of artistically-driven conversion tools.


[46] Pano3DComposer: Feed-Forward Compositional 3D Scene Generation from Single Panoramic Image cs.CVPDF

Zidian Qiu, Ancong Wu

TL;DR: 本文提出了Pano3DComposer,一种用于从单张全景图像进行前馈式组合3D场景生成的高效框架。它通过一个即插即用的物体-世界坐标转换预测器,将现成的图像到3D模型生成的物体从局部坐标转换到世界坐标,从而解耦了物体生成与布局估计。对于未见过的输入域,还引入了从粗到精的对齐机制来迭代优化几何一致性。该方法在合成和真实数据集上实现了优越的几何精度,并能在大约20秒内生成高保真3D场景。

Details

Motivation: 当前组合式图像到3D场景生成方法存在耗时迭代布局优化或不灵活的联合物体-布局生成问题,且大多依赖有限视场的透视图像,阻碍了完整360度环境的创建。

Result: 该方法在合成和真实世界数据集上的图像/文本到3D任务中取得了优越的几何精度,能够在RTX 4090 GPU上约20秒内生成高保真3D场景。

Insight: 主要创新点包括:1) 提出前馈式全景图像到3D场景生成框架,提高了效率;2) 设计即插即用的Object-World Transformation Predictor,通过改进的Alignment-VGGT架构和伪几何监督来解耦物体生成与布局;3) 引入从粗到精对齐机制,以迭代反馈优化未见域输入的几何一致性。

Abstract: Current compositional image-to-3D scene generation approaches construct 3D scenes by time-consuming iterative layout optimization or inflexible joint object-layout generation. Moreover, most methods rely on limited field-of-view perspective images, hindering the creation of complete 360-degree environments. To address these limitations, we design Pano3DComposer, an efficient feed-forward framework for panoramic images. To decouple object generation from layout estimation, we propose a plug-and-play Object-World Transformation Predictor. This module converts the 3D objects generated by off-the-shelf image-to-3D models from local to world coordinates. To achieve this, we adapt the VGGT architecture to Alignment-VGGT by using target object crop, multi-view object renderings and camera parameters to predict the transformation. The predictor is trained using pseudo-geometric supervision to address the shape discrepancy between generated and ground-truth objects. For input images from unseen domains, we further introduce a Coarse-to-Fine (C2F) alignment mechanism for Pano3DComposer that iteratively refines geometric consistency with feedback of scene rendering. Our method achieves superior geometric accuracy for image/text-to-3D tasks on synthetic and real-world datasets. It can generate a high-fidelity 3D scene in approximately 20 seconds on an RTX 4090 GPU. Project page: https://qiuzidian.github.io/pano3dcomposer-page/.


[47] CORE-Seg: Reasoning-Driven Segmentation for Complex Lesions via Reinforcement Learning cs.CV | cs.AIPDF

Yuxin Xie, Yuming Chen, Yishan Yang, Yi Zhou, Tao Zhou

TL;DR: 本文提出CORE-Seg,一个通过强化学习实现推理驱动分割的端到端框架,用于处理复杂的医学病灶分割。它结合了多模态大语言模型的常识与专门的视觉推理能力,并引入了首个用于该任务的多样化思维链基准数据集ComLesion-14K。

Details

Motivation: 解决医学图像分割从传统的视觉模式匹配向认知推理分析范式转变中的关键问题:现有通用MLLMs缺乏针对复杂病灶的专门视觉推理能力,而传统分割模型则缺乏逻辑可解释性。

Result: 在提出的ComLesion-14K基准上取得了SOTA结果,平均Dice系数达到37.06%,比次优基线高出14.89%,并将失败率降低至18.42%。

Insight: 创新点在于将推理与分割通过语义引导提示适配器进行端到端集成,并设计了从SFT到GRPO的渐进式训练策略,以及自适应双粒度奖励机制来缓解奖励稀疏性问题。这为结合高层推理与底层像素分割提供了新思路。

Abstract: Medical image segmentation is undergoing a paradigm shift from conventional visual pattern matching to cognitive reasoning analysis. Although Multimodal Large Language Models (MLLMs) have shown promise in integrating linguistic and visual knowledge, significant gaps remain: existing general MLLMs possess broad common sense but lack the specialized visual reasoning required for complex lesions, whereas traditional segmentation models excel at pixel-level segmentation but lack logical interpretability. In this paper, we introduce ComLesion-14K, the first diverse Chain-of-Thought (CoT) benchmark for reasoning-driven complex lesion segmentation. To accomplish this task, we propose CORE-Seg, an end-to-end framework integrating reasoning with segmentation through a Semantic-Guided Prompt Adapter. We design a progressive training strategy from SFT to GRPO, equipped with an adaptive dual-granularity reward mechanism to mitigate reward sparsity. Our Method achieves state-of-the-art results with a mean Dice of 37.06% (14.89% higher than the second-best baseline), while reducing the failure rate to 18.42%. Project Page: https://xyxl024.github.io/CORE-Seg.github.io/


[48] Towards Driver Behavior Understanding: Weakly-Supervised Risk Perception in Driving Scenes cs.CVPDF

Nakul Agarwal, Yi-Ting Chen, Behzad Dariush

TL;DR: 本文提出了RAID数据集,这是一个专门用于驾驶员风险感知和情境风险评估研究的大规模数据集,包含4,691个标注视频片段,涵盖了多样化的交通场景。基于此数据集,作者提出了一种弱监督的风险物体识别框架,通过建模驾驶员意图操作与响应之间的关系来识别潜在风险源,并分析了行人注意力在风险评估中的作用。实验表明,该方法在RAID和HDDS数据集上分别比现有最佳方法提升了20.6%和23.1%的性能。

Details

Motivation: 实现零碰撞移动是智能车辆系统的关键目标,这需要理解驾驶员的风险感知——一个由驾驶员对外部刺激的自愿反应和周围道路使用者对自车的关注度共同塑造的复杂认知过程。为推进该领域研究,需要专门的数据集和方法。

Result: 在RAID和HDDS数据集上的实验评估表明,所提出的方法分别比先前的state-of-the-art方法性能提升了20.6%和23.1%。

Insight: 创新点包括:1) 引入了一个大规模、多维度标注的驾驶员风险感知专用数据集RAID;2) 提出了一种弱监督的风险物体识别框架,通过关联驾驶员意图与响应来识别风险源,避免了密集标注的需求;3) 首次在数据集中系统性地标注并分析了行人注意力对风险评估的影响,为理解交互风险提供了新视角。

Abstract: Achieving zero-collision mobility remains a key objective for intelligent vehicle systems, which requires understanding driver risk perception-a complex cognitive process shaped by voluntary response of the driver to external stimuli and the attentiveness of surrounding road users towards the ego-vehicle. To support progress in this area, we introduce RAID (Risk Assessment In Driving scenes)-a large-scale dataset specifically curated for research on driver risk perception and contextual risk assessment. RAID comprises 4,691 annotated video clips, covering diverse traffic scenarios with labels for driver’s intended maneuver, road topology, risk situations (e.g., crossing pedestrians), driver responses, and pedestrian attentiveness. Leveraging RAID, we propose a weakly supervised risk object identification framework that models the relationship between driver’s intended maneuver and responses to identify potential risk sources. Additionally, we analyze the role of pedestrian attention in estimating risk and demonstrate the value of the proposed dataset. Experimental evaluations demonstrate that our method achieves 20.6% and 23.1% performance gains over prior state-of-the-art approaches on the RAID and HDDS datasets, respectively.


[49] Beyond Static Frames: Temporal Aggregate-and-Restore Vision Transformer for Human Pose Estimation cs.CVPDF

Hongwei Fang, Jiahang Cai, Xun Wang, Wenwu Yang

TL;DR: 本文提出了一种名为TAR-ViTPose的新型时序聚合与恢复视觉Transformer,专门用于视频中的2D人体姿态估计。该方法通过一种即插即用的方式,跨帧聚合时序线索来增强静态ViT表示,从而获得更鲁棒和准确的姿态估计。

Details

Motivation: 现有基于ViT的姿态估计器是为静态图像设计的,独立处理每一帧,忽略了视频序列中存在的时序连贯性,导致在运动模糊、遮挡或失焦等挑战性场景中预测不稳定。

Result: 在PoseTrack2017基准测试上,TAR-ViTPose相比单帧基线ViTPose实现了+2.3 mAP的提升,并且超越了现有的最先进的视频姿态估计方法,同时在实际应用中实现了更高的实时帧率。

Insight: 创新点在于提出了关节中心时序聚合(JTA)模块,为每个关节分配可学习的查询token以选择性地关注相邻帧中的对应区域,以及全局恢复注意力(GRA)模块,将聚合的时序特征恢复回当前帧的token序列,在丰富姿态表示的同时完全保留用于精确定位关键点的全局上下文。

Abstract: Vision Transformers (ViTs) have recently achieved state-of-the-art performance in 2D human pose estimation due to their strong global modeling capability. However, existing ViT-based pose estimators are designed for static images and process each frame independently, thereby ignoring the temporal coherence that exists in video sequences. This limitation often results in unstable predictions, especially in challenging scenes involving motion blur, occlusion, or defocus. In this paper, we propose TAR-ViTPose, a novel Temporal Aggregate-and-Restore Vision Transformer tailored for video-based 2D human pose estimation. TAR-ViTPose enhances static ViT representations by aggregating temporal cues across frames in a plug-and-play manner, leading to more robust and accurate pose estimation. To effectively aggregate joint-specific features that are temporally aligned across frames, we introduce a joint-centric temporal aggregation (JTA) that assigns each joint a learnable query token to selectively attend to its corresponding regions from neighboring frames. Furthermore, we develop a global restoring attention (GRA) to restore the aggregated temporal features back into the token sequence of the current frame, enriching its pose representation while fully preserving global context for precise keypoint localization. Extensive experiments demonstrate that TAR-ViTPose substantially improves upon the single-frame baseline ViTPose, achieving a +2.3 mAP gain on the PoseTrack2017 benchmark. Moreover, our approach outperforms existing state-of-the-art video-based methods, while also achieving a noticeably higher runtime frame rate in real-world applications. Project page: https://github.com/zgspose/TARViTPose.


[50] FTSplat: Feed-forward Triangle Splatting Network cs.CV | cs.ROPDF

Xiong Jinlin, Li Can, Shen Jiawei, Qi Zhigang, Sun Lei

TL;DR: 本文提出FTSplat,一种前馈三角形光栅化网络,用于从多视角图像直接预测连续三角形表面,实现无需逐场景优化的实时三维重建。

Details

Motivation: 现有NeRF和3DGS方法依赖耗时的逐场景优化,而前馈高斯光栅化方法缺乏显式几何表示,难以直接用于仿真。本文旨在构建具有显式流形几何的实时重建框架。

Result: 实验表明,该方法在保持高效重建的同时,实现了与标准图形和机器人仿真器的无缝兼容。

Insight: 创新点包括像素对齐的三角形生成模块和相对3D点云监督,增强了几何学习的稳定性和一致性,直接生成仿真就绪的三角形基元模型。

Abstract: High-fidelity three-dimensional (3D) reconstruction is essential for robotics and simulation. While Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) achieve impressive rendering quality, their reliance on time-consuming per-scene optimization limits real-time deployment. Emerging feed-forward Gaussian splatting methods improve efficiency but often lack explicit, manifold geometry required for direct simulation. To address these limitations, we propose a feed-forward framework for triangle primitive generation that directly predicts continuous triangle surfaces from calibrated multi-view images. Our method produces simulation-ready models in a single forward pass, obviating the need for per-scene optimization or post-processing. We introduce a pixel-aligned triangle generation module and incorporate relative 3D point cloud supervision to enhance geometric learning stability and consistency. Experiments demonstrate that our method achieves efficient reconstruction while maintaining seamless compatibility with standard graphics and robotic simulators.


[51] OD-RASE: Ontology-Driven Risk Assessment and Safety Enhancement for Autonomous Driving cs.CVPDF

Kota Shimomura, Masaki Nambata, Atsuya Ishikawa, Ryota Mimura, Takayuki Kawabuchi

TL;DR: 本文提出OD-RASE框架,通过构建交通系统本体论,结合大规模视觉语言模型(LVLM)和扩散模型,自动检测导致交通事故的道路结构并生成基础设施改进方案,以提升自动驾驶系统的安全性。

Details

Motivation: 自动驾驶系统在处理罕见情况或复杂道路结构时仍存在局限,且当前道路基础设施设计主要面向人类驾驶员,安全改进多为事后反应,无法满足自动驾驶系统主动风险缓解的需求。

Result: 实验表明,基于本体论的数据过滤能够高精度预测导致事故的道路结构及相应改进计划,构建了新的数据集,验证了方法的有效性。

Insight: 创新点在于将领域知识本体论与LVLM、扩散模型结合,实现从事故数据到基础设施改进的自动化闭环,为自动驾驶安全提供了前瞻性风险评估框架。

Abstract: Although autonomous driving systems demonstrate high perception performance, they still face limitations when handling rare situations or complex road structures. Such road infrastructures are designed for human drivers, safety improvements are typically introduced only after accidents occur. This reactive approach poses a significant challenge for autonomous systems, which require proactive risk mitigation. To address this issue, we propose OD-RASE, a framework for enhancing the safety of autonomous driving systems by detecting road structures that cause traffic accidents and connecting these findings to infrastructure development. First, we formalize an ontology based on specialized domain knowledge of road traffic systems. In parallel, we generate infrastructure improvement proposals using a large-scale visual language model (LVLM) and use ontology-driven data filtering to enhance their reliability. This process automatically annotates improvement proposals on pre-accident road images, leading to the construction of a new dataset. Furthermore, we introduce the Baseline approach (OD-RASE model), which leverages LVLM and a diffusion model to produce both infrastructure improvement proposals and generated images of the improved road environment. Our experiments demonstrate that ontology-driven data filtering enables highly accurate prediction of accident-causing road structures and the corresponding improvement plans. We believe that this work contributes to the overall safety of traffic environments and marks an important step toward the broader adoption of autonomous driving systems.


[52] LucidNFT: LR-Anchored Multi-Reward Preference Optimization for Generative Real-World Super-Resolution cs.CVPDF

Song Fei, Tian Ye, Sixiang Chen, Zhaohu Xing, Jianyu Lai

TL;DR: 本文提出LucidNFT框架,通过引入LR锚定的多奖励偏好优化方法,解决生成式真实世界图像超分辨率中LR证据忠实性与感知质量难以平衡的问题。

Details

Motivation: 生成式真实世界超分辨率易产生语义和结构幻觉,且缺乏退化鲁棒的LR参考忠实性评估信号,同时多奖励标量化导致优势崩溃,限制了偏好强化学习的有效性。

Result: 实验表明LucidNFT能稳定提升基于流的Real-ISR基线模型,在多样真实场景中实现更好的感知-忠实性权衡。

Insight: 创新点包括:退化鲁棒的语义评估器LucidConsistency实现LR锚定忠实性优化;解耦优势归一化策略防止多奖励融合时的优势崩溃;大规模真实退化图像数据集LucidLR支撑鲁棒RL微调。

Abstract: Generative real-world image super-resolution (Real-ISR) can synthesize visually convincing details from severely degraded low-resolution (LR) inputs, yet its stochastic sampling makes a critical failure mode hard to avoid: outputs may look sharp but be unfaithful to the LR evidence (semantic and structural hallucination), while such LR-anchored faithfulness is difficult to assess without HR ground truth. Preference-based reinforcement learning (RL) is a natural fit because each LR input yields a rollout group of candidates to compare. However, effective alignment in Real-ISR is hindered by (i) the lack of a degradation-robust LR-referenced faithfulness signal, and (ii) a rollout-group optimization bottleneck where naive multi-reward scalarization followed by normalization compresses objective-wise contrasts, causing advantage collapse and weakening the reward-weighted updates in DiffusionNFT-style forward fine-tuning. Moreover, (iii) limited coverage of real degradations restricts rollout diversity and preference signal quality. We propose LucidNFT, a multi-reward RL framework for flow-matching Real-ISR. LucidNFT introduces LucidConsistency, a degradation-robust semantic evaluator that makes LR-anchored faithfulness measurable and optimizable; a decoupled advantage normalization strategy that preserves objective-wise contrasts within each LR-conditioned rollout group before fusion, preventing advantage collapse; and LucidLR, a large-scale collection of real-world degraded images to support robust RL fine-tuning. Experiments show that LucidNFT consistently improves strong flow-based Real-ISR baselines, achieving better perceptual-faithfulness trade-offs with stable optimization dynamics across diverse real-world scenarios.


[53] Energy-Driven Adaptive Visual Token Pruning for Efficient Vision-Language Models cs.CV | cs.AIPDF

Jialuo He, Huangxun Chen

TL;DR: 本文提出了一种名为E-AdaPrune的能量驱动自适应视觉令牌剪枝框架,用于高效加速视觉语言模型。该方法通过分析视觉特征空间的奇异值谱来确定每个输入图像的令牌预算,保留特定比例的谱能量,从而为信息密集的场景分配更多令牌,并大幅压缩冗余信息,无需引入额外的可学习参数。

Details

Motivation: 现有视觉令牌削减方法通常对所有输入采用固定的令牌预算,忽略了图像信息密度的巨大差异,这限制了加速效率。本文旨在解决这一问题,实现根据图像内容自适应地分配计算资源。

Result: 在LLaVA-1.5-7B、LLaVA-1.5-13B和LLaVA-NeXT-8B三个VLM骨干网络上,对九个基准测试进行评估。在匹配的平均令牌预算下,E-AdaPrune始终带来平均高达0.6%的性能提升,在MMVet推理任务上实现了显著的+5.1%相对提升。使用随机奇异值分解,每张图像的额外延迟被限制在8毫秒内。

Insight: 创新点在于利用奇异值谱的能量作为图像信息密度的代理,实现完全自适应、无需训练参数的令牌剪枝。这提供了一种轻量级、低延迟且性能无损(甚至在某些任务上提升)的模型加速思路,其核心洞察是将计算预算动态分配给真正需要它的图像区域。

Abstract: Visual token reduction is critical for accelerating Vision-Language Models (VLMs), yet most existing approaches rely on a fixed budget shared across all inputs, overlooking the substantial variation in image information density. We propose E-AdaPrune, an energy-driven adaptive pruning framework that determines the token budget from the singular value spectrum of the visual features space. By preserving a certain proportion of spectral energy, our method allocates more tokens to information-dense scenes while aggressively compressing redundant ones, without introducing additional learnable parameters. We evaluate E-AdaPrune on nine benchmarks and three VLM backbones, LLaVA-1.5-7B, LLaVA-1.5-13B, and LLaVA-NeXT-8B. Under matched average token budgets, E-AdaPrune consistently yields an average improvement of up to 0.6%, including a significant +5.1% relative boost on the MMVet reasoning task. Using randomized singular value decomposition, the additional latency is limited to 8ms per image.


[54] Unify the Views: View-Consistent Prototype Learning for Few-Shot Segmentation cs.CVPDF

Hongli Liu, Yu Wang, Shengjie Zhao

TL;DR: 本文提出VINE(View-Informed NEtwork)框架,通过联合建模结构一致性和前景判别性来优化小样本分割中的类别特定原型。该方法构建空间-视角图以传播视角不变的结构语义,并利用支持-查询特征差异生成判别性先验来增强前景区域,最终通过掩码交叉注意力集成特征并生成一致性原型,作为自适应提示输入SAM解码器以生成精确分割掩码。

Details

Motivation: 解决小样本分割中因外观或视角变化大导致的结构错位和跨视角不一致性问题。

Result: 在多个小样本分割基准测试上进行了广泛实验,验证了VINE在视角变化和复杂结构等挑战性场景下的有效性和鲁棒性。

Insight: 创新点包括引入空间-视角图联合建模局部几何拓扑和跨视角特征以传播结构语义,以及利用支持-查询特征差异生成判别性先验来重新加权SAM特征并重新校准骨干网络激活,从而增强前景判别性和结构聚焦能力;通过掩码交叉注意力渐进集成特征生成类别一致性原型作为自适应提示,提升了分割精度。

Abstract: Few-shot segmentation (FSS) has gained significant attention for its ability to generalize to novel classes with limited supervision, yet remains challenged by structural misalignment and cross-view inconsistency under large appearance or viewpoint variations. This paper tackles these challenges by introducing VINE (View-Informed NEtwork), a unified framework that jointly models structural consistency and foreground discrimination to refine class-specific prototypes. Specifically, VINE introduces a spatial-view graph on backbone features, where the spatial graph captures local geometric topology and the view graph connects features from different perspectives to propagate view-invariant structural semantics. To further alleviate foreground ambiguity, we derive a discriminative prior from the support-query feature discrepancy to capture category-specific contrast, which reweights SAM features by emphasizing salient regions and recalibrates backbone activations for improved structural focus. The foreground-enhanced SAM features and structurally enriched ResNet features are progressively integrated through masked cross-attention, yielding class-consistent prototypes used as adaptive prompts for the SAM decoder to generate accurate masks. Extensive experiments on multiple FSS benchmarks validate the effectiveness and robustness of VINE, particularly under challenging scenarios with viewpoint shifts and complex structures. The code is available at https://github.com/HongliLiu1/VINE-main.


[55] OVGGT: O(1) Constant-Cost Streaming Visual Geometry Transformer cs.CVPDF

Si-Yu Lu, Po-Ting Chen, Hui-Che Hsu, Sin-Ye Jhong, Wen-Huang Cheng

TL;DR: OVGGT是一种用于视频流3D几何重建的Transformer框架,通过自选择缓存和动态锚点保护技术,实现了计算和内存的恒定开销,支持任意长度视频的实时处理,并在多个基准测试中达到SOTA精度。

Details

Motivation: 现有几何基础模型因全注意力机制二次计算成本高,仅适用于离线短序列;而因果注意力变体虽支持流式处理,但KV缓存持续增长导致内存耗尽,无法满足长序列部署需求。

Result: 在室内、室外及超长序列基准测试中,OVGGT在恒定VRAM预算下处理任意长视频,并实现了最先进的3D几何精度。

Insight: 创新点包括:1)自选择缓存利用FFN残差幅度压缩KV缓存,兼容FlashAttention;2)动态锚点保护关键坐标令牌免于驱逐,抑制长轨迹几何漂移;无需训练即可实现恒定成本流式推理。

Abstract: Reconstructing 3D geometry from streaming video requires continuous inference under bounded resources. Recent geometric foundation models achieve impressive reconstruction quality through all-to-all attention, yet their quadratic cost confines them to short, offline sequences. Causal-attention variants such as StreamVGGT enable single-pass streaming but accumulate an ever-growing KV cache, exhausting GPU memory within hundreds of frames and precluding the long-horizon deployment that motivates streaming inference in the first place. We present OVGGT, a training-free framework that bounds both memory and compute to a fixed budget regardless of sequence length. Our approach combines Self-Selective Caching, which leverages FFN residual magnitudes to compress the KV cache while remaining fully compatible with FlashAttention, with Dynamic Anchor Protection, which shields coordinate-critical tokens from eviction to suppress geometric drift over extended trajectories. Extensive experiments on indoor, outdoor, and ultra-long sequence benchmarks demonstrate that OVGGT processes arbitrarily long videos within a constant VRAM envelope while achieving state-of-the-art 3D geometric accuracy.


[56] Skeleton-to-Image Encoding: Enabling Skeleton Representation Learning via Vision-Pretrained Models cs.CV | cs.AIPDF

Siyuan Yang, Jun Liu, Hao Cheng, Chong Wang, Shijian Lu

TL;DR: 本文提出了一种名为Skeleton-to-Image Encoding (S2I)的新表示方法,将3D人体骨骼序列转换为基于身体部位语义分区的图像格式,从而首次实现了利用大规模视觉预训练模型进行自监督的骨骼表示学习,有效将视觉知识迁移到骨骼分析领域。

Details

Motivation: 解决大规模视觉预训练模型难以直接应用于3D骨骼数据的问题,并应对骨骼数据稀缺、异构以及多模态动作识别中需要额外模型分支的挑战。

Result: 在NTU-60、NTU-120和PKU-MMD等基准数据集上的大量实验表明,该方法在自监督骨骼表示学习方面具有有效性和泛化能力,包括在跨格式评估的挑战性设置下。

Insight: 核心创新在于将骨骼序列编码为统一的图像格式,这既利用了强大的视觉预训练模型,又自然地容纳了来自不同数据源的异构骨骼结构,为骨骼表示学习提供了一个通用且高效的框架。

Abstract: Recent advances in large-scale pretrained vision models have demonstrated impressive capabilities across a wide range of downstream tasks, including cross-modal and multi-modal scenarios. However, their direct application to 3D human skeleton data remains challenging due to fundamental differences in data format. Moreover, the scarcity of large-scale skeleton datasets and the need to incorporate skeleton data into multi-modal action recognition without introducing additional model branches present significant research opportunities. To address these challenges, we introduce Skeleton-to-Image Encoding (S2I), a novel representation that transforms skeleton sequences into image-like data by partitioning and arranging joints based on body-part semantics and resizing to standardized image dimensions. This encoding enables, for the first time, the use of powerful vision-pretrained models for self-supervised skeleton representation learning, effectively transferring rich visual-domain knowledge to skeleton analysis. While existing skeleton methods often design models tailored to specific, homogeneous skeleton formats, they overlook the structural heterogeneity that naturally arises from diverse data sources. In contrast, our S2I representation offers a unified image-like format that naturally accommodates heterogeneous skeleton data. Extensive experiments on NTU-60, NTU-120, and PKU-MMD demonstrate the effectiveness and generalizability of our method for self-supervised skeleton representation learning, including under challenging cross-format evaluation settings.


[57] CR-QAT: Curriculum Relational Quantization-Aware Training for Open-Vocabulary Object Detection cs.CVPDF

Jinyeong Park, Donghwa Kim, Brent ByungHoon Kang, Hyeongboo Baek, Jibum Kim

TL;DR: 本文提出了一种名为CR-QAT(课程关系量化感知训练)的集成框架,用于解决开放词汇目标检测(OVOD)模型在极端低比特(如4位)量化下性能严重下降的问题。该方法结合了分阶段优化的课程QAT(CQAT)和以文本为中心的关系知识蒸馏(TRKD),以稳定优化过程并保留细粒度的视觉-语言对齐与区域间关系结构。

Details

Motivation: 开放词汇目标检测模型通常体积庞大,难以部署在资源受限的设备上。量化是一种实用的压缩方法,但作者发现,简单的极端低比特量化会严重损害模型细粒度的视觉-语言对齐能力,并扭曲区域间的关系结构,因此需要一种更鲁棒的量化训练方法。

Result: 在LVIS和COCO的零样本基准测试上,CR-QAT方法在激进的低比特设置下,始终优于现有的QAT基线方法,分别实现了高达38.9%和40.9%的相对平均精度(AP)提升。

Insight: 论文的主要创新点在于将量化过程分解为分阶段的课程学习(CQAT)以隔离和缓解误差累积,并引入以文本为中心的关系知识蒸馏(TRKD),通过构建基于文本锚点的成对相似度矩阵,从教师模型向学生模型全面传递多维关系知识,从而在压缩模型的同时有效保持了开放词汇检测所需的关键语义对齐能力。

Abstract: Open-vocabulary object detection (OVOD) enables novel category detection via vision-language alignment, but massive model sizes hinder deployment on resource-constrained devices. While quantization offers practical compression, we reveal that naive extreme low-bit (e.g., 4-bit) quantization severely degrades fine-grained vision-language alignment and distorts inter-region relational structures. To address this, we propose curriculum relational quantization-aware training (CR-QAT), an integrated framework combining stage-by-stage optimization with relational knowledge distillation. Within CR-QAT, curriculum QAT (CQAT) mitigates error accumulation by partitioning the model for progressive quantization, ensuring stable optimization via error isolation. Concurrently, text-centric relational KD (TRKD) is applied to task-relevant modules. By constructing text-anchored pairwise similarity matrices, TRKD comprehensively transfers the teacher’s multi-dimensional relational knowledge. Experiments on LVIS and COCO zero-shot benchmarks demonstrate that CR-QAT consistently outperforms existing QAT baselines under aggressive low-bit settings, achieving relative AP improvements of up to 38.9% and 40.9%, respectively.


[58] Breaking Smooth-Motion Assumptions: A UAV Benchmark for Multi-Object Tracking in Complex and Adverse Conditions cs.CVPDF

Jingtao Ye, Kexin Zhang, Xunchi Ma, Yuehan Li, Guangming Zhu

TL;DR: 本文提出了DynUAV,一个针对无人机视角下多目标跟踪(MOT)的新基准数据集,旨在解决现有基准因缺乏剧烈自运动和复杂表观轨迹而无法充分反映真实挑战的问题。该数据集包含42个视频序列和超过170万个边界框标注,涵盖车辆、行人及工程机械等类别。

Details

Motivation: 现有无人机视角的MOT基准通常假设相机运动平稳、目标运动线性,无法充分模拟无人机快速机动带来的剧烈自运动、尺度变化、视角变化和运动模糊等复杂挑战,因此需要一个新的、更具挑战性的基准来推动该领域发展。

Result: 在DynUAV上对最先进的跟踪器进行了全面评估,结果表明它们在处理此类动态条件下的检测与关联交织的挑战时存在明显局限,从而确立了DynUAV作为一个严格基准的地位。

Insight: 创新点在于明确挑战了MOT中常见的平滑运动假设,构建了一个以剧烈自运动为核心特征的基准,其标注类别包含了工业场景中的专业对象(如挖掘机、起重机),这为评估和开发更鲁棒的、面向真实复杂与恶劣条件的无人机MOT算法提供了关键测试平台。

Abstract: The rapid movements and agile maneuvers of unmanned aerial vehicles (UAVs) induce significant observational challenges for multi-object tracking (MOT). However, existing UAV-perspective MOT benchmarks often lack these complexities, featuring predominantly predictable camera dynamics and linear motion patterns. To address this gap, we introduce DynUAV, a new benchmark for dynamic UAV-perspective MOT, characterized by intense ego-motion and the resulting complex apparent trajectories. The benchmark comprises 42 video sequences with over 1.7 million bounding box annotations, covering vehicles, pedestrians, and specialized industrial categories such as excavators, bulldozers and cranes. Compared to existing benchmarks, DynUAV introduces substantial challenges arising from ego-motion, including drastic scale changes and viewpoint changes, as well as motion blur. Comprehensive evaluations of state-of-the-art trackers on DynUAV reveal their limitations, particularly in managing the intertwined challenges of detection and association under such dynamic conditions, thereby establishing DynUAV as a rigorous benchmark. We anticipate that DynUAV will serve as a demanding testbed to spur progress in real-world UAV-perspective MOT, and we will make all resources available at link.


[59] DeepSight: Bridging Depth Maps and Language with a Depth-Driven Multimodal Model cs.CV | cs.CLPDF

Hao Yang, Hongbo Zhang, Yanyan Zhao, Bing Qin

TL;DR: DeepSight是首个专注于深度信息的MLLM,通过构建深度图像-文本对数据集和指令数据集,并改进CLIP的ViT编码器以更好地捕捉深度连续变化,显著提升了三维场景理解能力。

Details

Motivation: 现有MLLM在解释视觉数据中的深度信息方面存在不足,本文旨在解决深度感知不准确的问题,以增强三维场景理解。

Result: 在基于现有深度图像数据集构建的深度问答基准测试中,DeepSight显著提升了深度感知和下游任务性能,在三维理解方面取得了实质性进展。

Insight: 创新点在于首次构建了专用的深度MLLM,利用深度图的单通道灰度特性改进空间推理,并通过生成深度指令数据集和改进编码器来有效捕捉深度连续变化。

Abstract: Multimodal large language models (MLLMs) have achieved impressive performance across various tasks such as image captioning and visual question answer(VQA); however, they often struggle to accurately interpret depth information inherent in visual data. In this work, we introduce DeepSight, the first dedicated depth MLLM designed to enhance three-dimensional scene understanding. Unlike conventional methods that align RGB image encodings with text, our approach takes advantage of the unique characteristics of depth images: single-channel grayscale images where the pixel values directly reflect depth cues to improve spatial reasoning. To address challenges associated with limited depth data and the inadequacy of simple channel replication, we construct a novel depth image-text pair dataset and a depth instruction dataset. Depth maps are generated from visual images using the GLPN model, and GPT-4 is employed to curate corresponding depth instructions, an approach validated by LLaVA. Additionally, we modify the ViT encoder in CLIP to incorporate local object information, thereby capturing the subtle continuous variations of depth more effectively. To evaluate the performance of our model, we develop a comprehensive depth question answer benchmark based on existing depth image datasets, which rigorously assesses understanding in typical depth map scenarios. Experimental results demonstrate that DeepSight significantly enhances depth perception and downstream task performance, marking a substantial step forward in multimodal three-dimensional understanding.


[60] MM-ISTS: Cooperating Irregularly Sampled Time Series Forecasting with Multimodal Vision-Text LLMs cs.CV | cs.AIPDF

Zhi Lei, Chenxi Liu, Hao Miao, Wanghui Qiu, Bin Yang

TL;DR: 本文提出MM-ISTS,一个由视觉-文本大语言模型增强的多模态框架,用于解决不规则采样时间序列的预测问题。该框架通过两阶段编码机制,融合时间序列、视觉图像和文本数据,以捕获复杂的时序模式和上下文语义。

Details

Motivation: 现实世界中不规则采样时间序列普遍存在,现有方法仅利用历史观测值进行预测,难以学习上下文语义和细粒度时序模式。

Result: 在真实数据上的大量实验证明了所提方案的有效性,但摘要中未提及具体的基准测试或与SOTA的定量比较结果。

Insight: 创新点包括:1)利用多模态LLM自动生成信息丰富的视觉图像和文本数据以增强理解;2)提出自适应查询特征提取器压缩MLLM令牌以降低计算成本;3)设计带有模态感知门控的多模态对齐模块以缓解模态差异。

Abstract: Irregularly sampled time series (ISTS) are widespread in real-world scenarios, exhibiting asynchronous observations on uneven time intervals across variables. Existing ISTS forecasting methods often solely utilize historical observations to predict future ones while falling short in learning contextual semantics and fine-grained temporal patterns. To address these problems, we achieve MM-ISTS, a multimodal framework augmented by vision-text large language models, that bridges temporal, visual, and textual modalities, facilitating ISTS forecasting. MM-ISTS encompasses a novel two-stage encoding mechanism. In particular, a cross-modal vision-text encoding module is proposed to automatically generate informative visual images and textual data, enabling the capture of intricate temporal patterns and comprehensive contextual understanding, in collaboration with multimodal LLMs (MLLMs). In parallel, ISTS encoding extracts complementary yet enriched temporal features from historical ISTS observations, including multi-view embedding fusion and a temporal-variable encoder. Further, we propose an adaptive query-based feature extractor to compress the learned tokens of MLLMs, filtering out small-scale useful knowledge, which in turn reduces computational costs. In addition, a multimodal alignment module with modality-aware gating is designed to alleviate the modality gap across ISTS, images, and text. Extensive experiments on real data offer insight into the effectiveness of the proposed solutions.


[61] Demystifying KAN for Vision Tasks: The RepKAN Approach cs.CV | cs.AIPDF

Minjong Cheon

TL;DR: 本文提出了一种名为RepKAN的新型架构,用于遥感图像分类任务。该架构将卷积神经网络(CNN)的结构效率与Kolmogorov-Arnold网络(KAN)的非线性表示能力相结合,通过空间线性和光谱非线性的双路径设计,能够自主发现类别特定的光谱指纹和物理交互流形。实验表明,RepKAN在EuroSAT和NWPU-RESISC45数据集上不仅超越了现有最先进模型,还提供了明确的物理可解释性推理。

Details

Motivation: 解决遥感图像分类中标准CNN和Transformer模型作为不可解释黑盒的问题,旨在开发一种兼具高性能和物理可解释性的架构。

Result: 在EuroSAT和NWPU-RESISC45数据集上的实验结果表明,RepKAN的性能超越了当前最先进(SOTA)模型,同时提供了明确的物理可解释性。

Insight: 主要创新点在于将KAN的非线性表示能力与CNN的结构效率集成,通过双路径设计实现自主特征发现和物理可解释性。这为未来可解释视觉基础模型提供了有潜力的骨干网络架构。

Abstract: Remote sensing image classification is essential for Earth observation, yet standard CNNs and Transformers often function as uninterpretable black-boxes. We propose RepKAN, a novel architecture that integrates the structural efficiency of CNNs with the non-linear representational power of KANs. By utilizing a dual-path design – Spatial Linear and Spectral Non-linear – RepKAN enables the autonomous discovery of class-specific spectral fingerprints and physical interaction manifolds. Experimental results on the EuroSAT and NWPU-RESISC45 datasets demonstrate that RepKAN provides explicit physically interpretable reasoning while outperforming state-of-the-art models. These findings indicate that RepKAN holds significant potential to serve as the backbone for future interpretable visual foundation models.


[62] EffectMaker: Unifying Reasoning and Generation for Customized Visual Effect Creation cs.CVPDF

Shiyuan Yang, Ruihuang Li, Jiale Tao, Shuai Shao, Qinglin Lu

TL;DR: EffectMaker是一个统一的推理-生成框架,用于基于参考的自定义视觉特效(VFX)创建。它通过多模态大语言模型解析高级语义并推理如何适应目标主体,同时利用扩散Transformer进行上下文学习以捕捉参考视频的细粒度视觉线索,无需针对每种特效进行微调。

Details

Motivation: 解决高质量视觉特效制作依赖专家知识和昂贵流程的问题,以及现有AIGC系统因特效数据稀缺、难以建模超自然/风格化效果且通常需要逐特效微调而导致的扩展性和泛化性不足的挑战。

Result: 在构建的最大高质量合成数据集EffectData(包含3k个VFX类别的130k视频)上实验表明,EffectMaker在视觉质量和特效一致性方面优于最先进的基线方法。

Insight: 创新点在于提出了语义-视觉双路径引导机制,将高层语义推理与细粒度视觉线索捕获统一,实现了无需逐特效微调即可准确、可控且特效一致的合成;同时构建大规模合成数据集以提升泛化能力和可扩展性。

Abstract: Visual effects (VFX) are essential for enhancing the expressiveness and creativity of video content, yet producing high-quality effects typically requires expert knowledge and costly production pipelines. Existing AIGC systems face significant challenges in VFX generation due to the scarcity of effect-specific data and the inherent difficulty of modeling supernatural or stylized effects. Moreover, these approaches often require per-effect fine-tuning, which severely limits their scalability and generalization to novel VFX. In this work, we present EffectMaker, a unified reasoning-generation framework that enables reference-based VFX customization. EffectMaker employs a multimodal large language model to interpret high-level effect semantics and reason about how they should adapt to a target subject, while a diffusion transformer leverages in-context learning to capture fine-grained visual cues from reference videos. These two components form a semantic-visual dual-path guidance mechanism that enables accurate, controllable, and effect-consistent synthesis without per-effect fine-tuning. Furthermore, we construct EffectData, the largest high-quality synthetic dataset containing 130k videos across 3k VFX categories, to improve generalization and scalability. Experiments show that EffectMaker achieves superior visual quality and effect consistency over state-of-the-art baselines, offering a scalable and flexible paradigm for customized VFX generation. Project page: https://effectmaker.github.io


[63] MOSIV: Multi-Object System Identification from Videos cs.CVPDF

Chunjiang Liu, Xiaoyuan Wang, Qingran Lin, Albert Xiao, Haoyu Chen

TL;DR: 本文提出了MOSIV框架,用于从视频中进行多物体系统识别,通过可微分模拟器直接优化每个物体的连续材料参数,并引入了一个包含丰富接触交互的合成基准数据集进行评测。

Details

Motivation: 解决现有方法在单物体场景或固定材料原型离散分类上的局限,应对多物体系统识别这一挑战性问题。

Result: 在提出的合成基准上,MOSIV显著提升了接地准确性和长时程模拟保真度,超越了适应后的基线方法,成为该新任务的强基线。

Insight: 创新点在于使用可微分模拟器结合几何对齐目标进行连续材料参数优化,关键发现是物体级细粒度监督和几何对齐目标对于复杂多物体场景中的稳定优化至关重要。

Abstract: We introduce the challenging problem of multi-object system identification from videos, for which prior methods are ill-suited due to their focus on single-object scenes or discrete material classification with a fixed set of material prototypes. To address this, we propose MOSIV, a new framework that directly optimizes for continuous, per-object material parameters using a differentiable simulator guided by geometric objectives derived from video. We also present a new synthetic benchmark with contact-rich, multi-object interactions to facilitate evaluation. On this benchmark, MOSIV substantially improves grounding accuracy and long-horizon simulation fidelity over adapted baselines, establishing it as a strong baseline for this new task. Our analysis shows that object-level fine-grained supervision and geometry-aligned objectives are critical for stable optimization in these complex, multi-object settings. The source code and dataset will be released.


[64] StruVis: Enhancing Reasoning-based Text-to-Image Generation via Thinking with Structured Vision cs.CVPDF

Yuanhuiyi Lyu, Kaiyu Lei, Ziqiao Weng, Xu Zheng, Lutao Jiang

TL;DR: 本文提出了一种名为StruVis的新型框架,旨在通过结构化视觉思维来增强基于推理的文本到图像生成。该框架使用基于文本的结构化视觉表示作为中间推理状态,而非依赖中间图像生成,从而在纯文本推理过程中让多模态大语言模型有效感知视觉结构,解锁其推理潜力。

Details

Motivation: 解决现有基于推理的文本到图像生成方法的两大问题:纯文本推理缺乏视觉上下文导致忽略关键空间和视觉元素,而文本-图像交错推理虽增强视觉基础但计算成本高且受限于生成器的表征能力。

Result: 在基于推理的文本到图像基准测试上取得了显著性能提升,例如在T2I-ReasonBench上获得了4.61%的增益,在WISE上获得了4%的增益。

Insight: 创新点在于提出了一种生成器无关的推理框架,使用文本化结构化视觉表示作为中间状态,既避免了图像生成的计算开销,又提供了视觉结构信息,从而高效提升多种文本到图像生成器在复杂推理任务上的性能。

Abstract: Reasoning-based text-to-image (T2I) generation requires models to interpret complex prompts accurately. Existing reasoning frameworks can be broadly categorized into two types: (1) Text-Only Reasoning, which is computationally efficient but lacks access to visual context, often resulting in the omission of critical spatial and visual elements; and (2) Text-Image Interleaved Reasoning, which leverages a T2I generator to provide visual references during the reasoning process. While this approach enhances visual grounding, it incurs substantial computational costs and constrains the reasoning capacity of MLLMs to the representational limitations of the generator. To this end, we propose StruVis, a novel framework that enhances T2I generation through Thinking with Structured Vision. Instead of relying on intermediate image generation, StruVis employs text-based structured visual representations as intermediate reasoning states, thereby enabling the MLLM to effectively “perceive” visual structure within a purely text-based reasoning process. Powered by this, the reasoning potential for T2I generation of the MLLM is unlocked through structured-vision-guided reasoning. Additionally, as a generator-agnostic reasoning framework, our proposed StruVis can be seamlessly integrated with diverse T2I generators and efficiently enhance their performance in reasoning-based T2I generation. Extensive experiments demonstrate that StruVis achieves significant performance improvements on reasoning-based T2I benchmarks, e.g., a 4.61% gain on T2I-ReasonBench and a 4% gain on WISE.


[65] Occlusion-Aware SORT: Observing Occlusion for Robust Multi-Object Tracking cs.CVPDF

Chunjiang Li, Jianbo Ma, Li Shen, Yanru Chen, Liangyin Chen

TL;DR: 本文提出了一种新颖的、即插即用且无需训练的遮挡感知SORT框架,旨在解决多目标跟踪中因部分遮挡引起的匹配成本混淆问题。该框架包含遮挡感知模块、遮挡感知偏移和偏置感知动量三个组件,通过在DanceTrack、SportsMOT和MOT17数据集上的综合评估,验证了其有效性和可复用性。

Details

Motivation: 解决2D多目标跟踪中因部分遮挡导致的物体位置匹配成本混淆问题,提升跟踪的鲁棒性。

Result: 在DanceTrack测试集上,OA-SORT取得了63.1%的HOTA和64.2%的IDF1分数。将该框架集成到其他四个跟踪器中,平均提升了2.08%的HOTA和3.05%的IDF1,证明了其可复用性。

Insight: 创新点在于提出了一个专注于分析遮挡状态的模块,并利用该信息来缓解匹配混淆和抑制估计不稳定性。其即插即用、无需训练的特性使其易于集成到现有跟踪器中,提升了处理遮挡场景的通用能力。

Abstract: Multi-object tracking (MOT) involves analyzing object trajectories and counting the number of objects in video sequences. However, 2D MOT faces challenges due to positional cost confusion arising from partial occlusion. To address this issue, we present the novel Occlusion-Aware SORT (OA-SORT) framework, a plug-and-play and training-free framework that includes the Occlusion-Aware Module (OAM), the Occlusion-Aware Offset (OAO), and the Bias-Aware Momentum (BAM). Specifically, OAM analyzes the occlusion status of objects, where a Gaussian Map (GM) is introduced to reduce background influence. In contrast, OAO and BAM leverage the OAM-described occlusion status to mitigate cost confusion and suppress estimation instability. Comprehensive evaluations on the DanceTrack, SportsMOT, and MOT17 datasets demonstrate the importance of occlusion handling in MOT. On the DanceTrack test set, OA-SORT achieves 63.1% and 64.2% in HOTA and IDF1, respectively. Furthermore, integrating the Occlusion-Aware framework into the four additional trackers improves HOTA and IDF1 by an average of 2.08% and 3.05%, demonstrating the reusability of the occlusion awareness.


[66] Ensemble Learning with Sparse Hypercolumns cs.CVPDF

Julia Dietlmeier, Vayangi Ganepola, Oluwabukola G. Adegboro, Mayug Maniparambil, Claudia Mazo

TL;DR: 本文提出了一种基于稀疏超列的集成学习方法,用于图像分割任务。通过分层子采样VGG16生成的超列,并结合集成学习策略(如堆叠和投票),在脑肿瘤数据集上实现了比标准多尺度UNet基线更优的性能,尤其在低样本量(N≤20)情况下,逻辑回归分类器表现最佳。

Details

Motivation: 超列作为生物视觉启发的特征向量,在图像分割中潜力巨大,但现有研究较少,主要因其计算复杂度随训练集规模线性增长,限制了实际应用。本文旨在通过稀疏化超列和集成学习解决这一计算瓶颈。

Result: 在脑肿瘤数据集上,当使用10%分层子采样率且N=20时,最佳平均Dice分数达到0.66,相比标准多尺度UNet基线提升了24.53%(p值=3.07e-11,Wilcoxon符号秩检验),在极端低样本情况下逻辑回归分类器最有效。

Insight: 创新点包括:通过分层子采样实现超列的稀疏化以降低计算成本,并探索集成学习在稀疏超列上的性能;客观分析表明,该方法在低样本场景下能有效缓解过拟合,为计算密集型特征的应用提供了实用优化思路。

Abstract: Directly inspired by findings in biological vision, high-dimensional hypercolumns are feature vectors built by concatenating multi-scale activations of convolutional neural networks for a single image pixel location. Together with powerful classifiers, they can be used for image segmentation i.e. pixel classification. However, in practice, there are only very few works dedicated to the use of hypercolumns. One reason is the computational complexity of processing concatenated dense hypercolumns that grows linearly with the size $N$ of the training set. In this work, we address this challenge by applying stratified subsampling to the VGG16 based hypercolumns. Furthermore, we investigate the performance of ensemble learning on sparse hypercolumns. Our experiments on a brain tumor dataset show that stacking and voting ensembles deliver competitive performance, but in the extreme low-shot case of $N \leq 20$, a simple Logistic Regression classifier is the most effective method. For 10% stratified subsampling rate, our best average Dice score is 0.66 for $N=20$. This is a statistically significant improvement of 24.53% over the standard multi-scale UNet baseline ($p$-value = $[3.07e-11]$, Wilcoxon signed-rank test), which is less effective due to overfitting.


[67] FontUse: A Data-Centric Approach to Style- and Use-Case-Conditioned In-Image Typography cs.CV | cs.GRPDF

Xia Xin, Yuki Endo, Yoshihiro Kanamori

TL;DR: 本文提出FontUse,一种数据为中心的方法,用于生成符合特定字体风格和使用场景的图像排版。通过构建大规模排版数据集,并利用分割模型和多模态大语言模型自动标注,训练现有图像生成模型以更好地遵循用户对字体样式和使用场景的提示。

Details

Motivation: 现有文本到图像模型在控制排版方面存在挑战,常忽略或弱化字体样式提示。本文旨在解决这一限制,通过数据驱动方法提升模型对排版条件的响应能力。

Result: 实验表明,使用该管道训练的模型在多样提示和布局下,生成的文本渲染与提示的一致性优于竞争基线,并通过基于Long-CLIP的度量进行评估。

Insight: 创新点在于构建自动标注的排版数据集,结合字体风格和使用场景的提示,无需修改模型架构即可实现条件生成。客观分析认为,数据中心的监督方法为排版控制提供了可扩展的解决方案。

Abstract: Recent text-to-image models can generate high-quality images from natural-language prompts, yet controlling typography remains challenging: requested typographic appearance is often ignored or only weakly followed. We address this limitation with a data-centric approach that trains image generation models using targeted supervision derived from a structured annotation pipeline specialized for typography. Our pipeline constructs a large-scale typography-focused dataset, FontUse, consisting of about 70K images annotated with user-friendly prompts, text-region locations, and OCR-recognized strings. The annotations are automatically produced using segmentation models and multimodal large language models (MLLMs). The prompts explicitly combine font styles (e.g., serif, script, elegant) and use cases (e.g., wedding invitations, coffee-shop menus), enabling intuitive specification even for novice users. Fine-tuning existing generators with these annotations allows them to consistently interpret style and use-case conditions as textual prompts without architectural modification. For evaluation, we introduce a Long-CLIP-based metric that measures alignment between generated typography and requested attributes. Experiments across diverse prompts and layouts show that models trained with our pipeline produce text renderings more consistent with prompts than competitive baselines. The source code for our annotation pipeline is available at https://github.com/xiaxinz/FontUSE.


[68] Learning to Generate via Understanding: Understanding-Driven Intrinsic Rewarding for Unified Multimodal Models cs.CVPDF

Jiadong Pan, Liang Li, Yuxin Peng, Yu-Ming Tang, Shuohuan Wang

TL;DR: 本文提出了一种名为GvU的方法,通过利用统一多模态模型(UMMs)自身的视觉理解能力来增强其图像生成质量。该方法设计了一种基于令牌级别的内在文本-图像对齐奖励机制,并构建了一个自监督强化学习框架,使UMM能够通过理解分支评估自身输出,从而迭代提升生成效果,缩小了UMMs在视觉理解与生成之间的能力差距。

Details

Motivation: 统一多模态模型(UMMs)在视觉理解方面表现出色,但其生成能力相对较弱,存在理解与生成过程的内在解耦问题,导致难以从复杂文本提示生成语义一致的图像。本文旨在利用UMMs的内部理解能力来直接提升其生成质量。

Result: 实验结果表明,该方法显著提升了UMMs的生成能力,并反过来增强了其细粒度视觉理解能力,有效缩小了视觉理解与生成之间的能力差距。

Insight: 创新点在于提出了一种基于内在奖励的自监督强化学习框架(GvU),使模型能够同时作为教师(通过理解分支评估)和学生(根据奖励优化生成),实现理解与生成的协同增强,无需依赖外部监督。这为提升多模态模型的生成-理解统一性提供了新思路。

Abstract: Recently, unified multimodal models (UMMs) have made remarkable progress in integrating visual understanding and generation, demonstrating strong potential for complex text-to-image (T2I) tasks. Despite their theoretical promise, a persistent capability gap exists: UMMs typically exhibit superior visual understanding but comparatively weaker generative capabilities. This discrepancy arises largely from the intrinsic decoupling between the understanding and generation processes. While a UMM can accurately interpret fine-grained visual details, it often struggles to produce semantically coherent images from complex textual prompts. To address this challenge, we explore UMMs’ internal understanding capability to enhance generation quality. We propose a token-level intrinsic text-image alignment reward mechanism, GvU, enabling the UMM to act simultaneously as teacher and student: it evaluates its own outputs using the understanding branch to guide the generations accordingly. Building upon this, we design a self-supervised reinforcement learning framework, allowing UMMs to iteratively improve their generation quality through understanding-based intrinsic reward signals–without reliance on external supervision. Experimental results show that our method substantially boosts UMMs’ generation, which in turn strengthens their fine-grained visual understanding, narrowing the capability gap between UMMs’ visual understanding and generation.


[69] GenHOI: Towards Object-Consistent Hand-Object Interaction with Temporally Balanced and Spatially Selective Object Injection cs.CVPDF

Xuan Huang, Mochu Xiang, Zhelun Shen, Jinbo Wu, Chenming Wu

TL;DR: 本文提出了GenHOI,一种用于预训练视频生成模型的轻量级增强方法,旨在解决手-物交互视频合成中物体外观跨帧不一致的问题。该方法通过时间平衡和空间选择性的方式注入参考物体信息,以提升交互的物理合理性和物体一致性。

Details

Motivation: 现有手-物交互重演方法泛化能力差,难以处理复杂真实场景;而通用视频编辑模型则存在物体外观不一致等特定问题。本文旨在开发一个能泛化到未见过的真实场景、并保持物体一致性的手-物交互生成方法。

Result: 在未见过的真实场景上进行的大量定性和定量评估表明,GenHOI显著优于最先进的手-物交互重演方法和通用视频编辑方法,达到了SOTA水平。

Insight: 创新点在于提出了时间平衡的Head-Sliding RoPE机制来均匀分配参考信息的影响,以及空间选择性的两级空间注意力门来聚焦于交互区域,从而在保持背景真实性的同时增强交互保真度。这是一种针对特定任务(HOI)有效调制预训练大模型(视频生成模型)的轻量级适配方法。

Abstract: Hand-Object Interaction (HOI) remains a core challenge in digital human video synthesis, where models must generate physically plausible contact and preserve object identity across frames. Although recent HOI reenactment approaches have achieved progress, they are typically trained and evaluated in-domain and fail to generalize to complex, in-the-wild scenarios. In contrast, all-in-one video editing models exhibit broader robustness but still struggle with HOI-specific issues such as inconsistent object appearance. In this paper, we present GenHOI, a lightweight augmentation to pretrained video generation models that injects reference-object information in a temporally balanced and spatially selective manner. For temporal balancing, we propose Head-Sliding RoPE, which assigns head-specific temporal offsets to reference tokens, distributing their influence evenly across frames and mitigating the temporal decay of 3D RoPE to improve long-range object consistency. For spatial selectivity, we design a two-level spatial attention gate that concentrates object-conditioned attention on HOI regions and adaptively scales its strength, preserving background realism while enhancing interaction fidelity. Extensive qualitative and quantitative evaluations on unseen, in-the-wild scenes demonstrate that GenHOI significantly outperforms state-of-the-art HOI reenactment and all-in-one video editing methods. Project page: https://xuanhuang0.github.io/GenHOI/


[70] Devil is in Narrow Policy: Unleashing Exploration in Driving VLA Models cs.CV | cs.ROPDF

Canyu Chen, Yuguang Yang, Zhewen Tan, Yizhi Wang, Ruiyi Zhan

TL;DR: 该论文提出了一种名为Curious-VLA的框架,旨在解决自动驾驶视觉语言动作(VLA)模型中存在的‘窄策略’限制问题。该问题导致模仿学习阶段探索性崩溃,并限制了后续强化学习阶段的潜力。Curious-VLA通过两阶段设计缓解探索与利用的困境:在模仿学习阶段引入可行轨迹扩展策略和逐步归一化轨迹表示来生成和适应多样化数据;在强化学习阶段采用自适应多样性感知采样和跨度驾驶奖励来优先处理高多样性样本并放大奖励的价值跨度。在Navsim基准测试中,该方法取得了最先进的性能。

Details

Motivation: 动机是解决自动驾驶VLA模型中一个根本性的‘窄策略’限制,即模仿学习倾向于导致探索性崩溃,并使得后续强化学习阶段因反馈多样性不足而过早饱和,从而限制了模型性能。

Result: 在Navsim基准测试上,Curious-VLA取得了最先进(SoTA)的结果,具体指标为PDMS 90.3和EPDMS 85.4,其Best-of-N PDMS达到94.8,证明了其在释放VLA模型探索潜力方面的有效性。

Insight: 论文宣称的创新点包括:1)在模仿学习阶段提出可行轨迹扩展策略和逐步归一化轨迹表示以生成和编码多样化数据;2)在强化学习阶段提出自适应多样性感知采样和具有焦点风格加权的跨度驾驶奖励,以增强对驾驶质量的敏感性。从客观角度看,其核心创新在于系统地识别并缓解了VLA模型训练中探索与利用的根本矛盾,通过新颖的数据生成、表示和奖励设计来提升模型的探索能力和最终性能。

Abstract: We identify a fundamental Narrow Policy limitation undermining the performance of autonomous VLA models, where driving Imitation Learning (IL) tends to collapse exploration and limit the potential of subsequent Reinforcement Learning (RL) stages, which often saturate prematurely due to insufficient feedback diversity. Thereby, we propose Curious-VLA, a framework that alleviates the exploit-explore dilemma through a two-stage design. During IL, we introduce a Feasible Trajectory Expansion (FTE) strategy to generate multiple physically valid trajectories and a step-wise normalized trajectory representation to adapt this diverse data. In the RL stage, we present Adaptive Diversity-Aware Sampling (ADAS) that prioritizes high-diversity samples and introduce Spanning Driving Reward (SDR) with a focal style weighting to amplify reward’s value span for improving sensitivity to driving quality. On the Navsim benchmark, Curious-VLA achieves SoTA results (PDMS 90.3, EPDMS 85.4) and a Best-of-N PDMS of 94.8, demonstrating its effectiveness in unlocking the exploratory potential of VLA models. Code: https://github.com/Mashiroln/curious_vla.git.


[71] Probing Visual Concepts in Lightweight Vision-Language Models for Automated Driving cs.CV | cs.AIPDF

Nikos Theodoridis, Reenu Mohandas, Ganesh Sistu, Anthony Scanlan, Ciarán Eising

TL;DR: 本研究通过分析轻量级视觉语言模型(VLMs)的中间激活,探究了其在自动驾驶相关简单视觉问题上的失败原因。作者创建了仅针对特定视觉概念的反事实图像集,并训练线性探针来评估四个SOTA VLM中视觉概念的线性编码程度,识别出两种失败模式:感知失败和认知失败。

Details

Motivation: VLMs在自动驾驶应用中日益普及,但其在高度相关的简单视觉问题上常失败,且原因不明。本文旨在通过分析模型中间激活,探究视觉信息流动的瓶颈,以理解这些失败背后的机制。

Result: 研究发现,对象存在等概念被显式线性编码,而对象方向等空间视觉概念仅由视觉编码器保留的空间结构隐式编码。同时,即使概念被线性编码,模型仍可能回答错误,识别出感知失败和认知失败两种模式。此外,对象距离增加会快速降低相应视觉概念的线性可分性。

Insight: 创新点在于通过反事实图像集和线性探针系统性地评估VLMs中视觉概念的编码方式,并首次明确区分了感知失败与认知失败两种失败模式,为理解VLMs在自动驾驶场景中的局限性提供了新视角。

Abstract: The use of Vision-Language Models (VLMs) in automated driving applications is becoming increasingly common, with the aim of leveraging their reasoning and generalisation capabilities to handle long tail scenarios. However, these models often fail on simple visual questions that are highly relevant to automated driving, and the reasons behind these failures remain poorly understood. In this work, we examine the intermediate activations of VLMs and assess the extent to which specific visual concepts are linearly encoded, with the goal of identifying bottlenecks in the flow of visual information. Specifically, we create counterfactual image sets that differ only in a targeted visual concept and then train linear probes to distinguish between them using the activations of four state-of-the-art (SOTA) VLMs. Our results show that concepts such as the presence of an object or agent in a scene are explicitly and linearly encoded, whereas other spatial visual concepts, such as the orientation of an object or agent, are only implicitly encoded by the spatial structure retained by the vision encoder. In parallel, we observe that in certain cases, even when a concept is linearly encoded in the model’s activations, the model still fails to answer correctly. This leads us to identify two failure modes. The first is perceptual failure, where the visual information required to answer a question is not linearly encoded in the model’s activations. The second is cognitive failure, where the visual information is present but the model fails to align it correctly with language semantics. Finally, we show that increasing the distance of the object in question quickly degrades the linear separability of the corresponding visual concept. Overall, our findings improve our understanding of failure cases in VLMs on simple visual tasks that are highly relevant to automated driving.


[72] Transforming Omnidirectional RGB-LiDAR data into 3D Gaussian Splatting cs.CV | cs.ROPDF

Semin Bae, Hansol Lim, Jongseong Brad Choi

TL;DR: 本文提出了一种将全向RGB和LiDAR日志数据转化为3D高斯溅射(3DGS)初始化资源的流程,旨在利用机器人或自动驾驶平台常规收集但常被丢弃的多模态传感器数据,以低成本构建大规模数字孪生。

Details

Motivation: 构建基于3DGS的大规模数字孪生通常需要昂贵、专门的数据采集,而现有部署平台已收集大量全向RGB和LiDAR日志数据,却因传输限制和缺乏可扩展的重用流程而被浪费。本文旨在解决这一资源浪费问题,并克服将原始日志直接转换为3DGS资产时遇到的实际瓶颈。

Result: 与纯视觉基线相比,该流程利用LiDAR强化的初始化方法,在结构复杂的场景中持续提升了最终3DGS渲染的保真度。

Insight: 创新点包括:1)一个确定性的工作流程,通过ERP到立方体贴图的转换模块实现确定性的空间锚定;2)PRISM颜色分层下采样策略;3)基于FPFH的全局配准和ICP桥接多模态输入,将废弃数据成功转化为可用的SfM几何。这为从标准存档传感器日志创建仿真级数字孪生提供了可行方案。

Abstract: The demand for large-scale digital twins is rapidly growing in robotics and autonomous driving. However, constructing these environments with 3D Gaussian Splatting (3DGS) usually requires expensive, purpose-built data collection. Meanwhile, deployed platforms routinely collect extensive omnidirectional RGB and LiDAR logs, but a significant portion of these sensor data is directly discarded or strictly underutilized due to transmission constraints and the lack of scalable reuse pipeline. In this paper, we present an omnidirectional RGB-LiDAR reuse pipeline that transforms these archived logs into robust initialization assets for 3DGS. Direct conversion of such raw logs introduces practical bottlenecks: inherent non-linear distortion leads to unreliable Structure-from-Motion (SfM) tracking, and dense, unorganized LiDAR clouds cause computational overhead during 3DGS optimization. To overcome these challenges, our pipeline strategically integrates an ERP-to-cubemap conversion module for deterministic spatial anchoring, alongside PRISM-a color stratified downsampling strategy. By bridging these multi-modal inputs via Fast Point Feature Histograms (FPFH) based global registration and Iterative Closest Point (ICP), our pipeline successfully repurposes a considerable fraction of discarded data into usable SfM geometry. Furthermore, our LiDAR-reinforced initialization consistently enhances the final 3DGS rendering fidelity in structurally complex scenes compared to vision-only baselines. Ultimately, this work provides a deterministic workflow for creating simulation-grade digital twins from standard archived sensor logs.


[73] Text-Driven Emotionally Continuous Talking Face Generation cs.CV | cs.AIPDF

Hao Yang, Yanyan Zhao, Tian Zheng, Hongbo Zhang, Bichen Wang

TL;DR: 本文提出情感连续说话人脸生成(EC-TFG)新任务,旨在根据带有变化情感的文本描述生成能反映情感连续变化的高质量说话人脸视频。为此,作者提出了TIE-TFG模型,通过时序密集情感波动建模来驱动合成视频中连续的面部表情变化。

Details

Motivation: 现有说话人脸生成方法通常只能表达固定的目标情感,缺乏模拟人类在传达信息时情感连续自然变化的能力。

Result: 大量评估表明,该方法在不同情感状态下能产生平滑的情感过渡,并保持高质量的视觉效果和运动真实性。

Insight: 核心创新在于定义了EC-TFG新任务,并提出了TIE-TFG模型,其通过时序密集情感波动建模来动态管理情感变化,从而驱动合成视频中连续的面部表情变化。

Abstract: Talking Face Generation (TFG) strives to create realistic and emotionally expressive digital faces. While previous TFG works have mastered the creation of naturalistic facial movements, they typically express a fixed target emotion in synthetic videos and lack the ability to exhibit continuously changing and natural expressions like humans do when conveying information. To synthesize realistic videos, we propose a novel task called Emotionally Continuous Talking Face Generation (EC-TFG), which takes a text segment and an emotion description with varying emotions as driving data, aiming to generate a video where the person speaks the text while reflecting the emotional changes within the description. Alongside this, we introduce a customized model, i.e., Temporal-Intensive Emotion Modulated Talking Face Generation (TIE-TFG), which innovatively manages dynamic emotional variations by employing Temporal-Intensive Emotion Fluctuation Modeling, allowing it to provide emotion variation sequences corresponding to the input text to drive continuous facial expression changes in synthesized videos. Extensive evaluations demonstrate our method’s exceptional ability to produce smooth emotion transitions and uphold high-quality visuals and motion authenticity across diverse emotional states.


[74] Lyapunov Probes for Hallucination Detection in Large Foundation Models cs.CVPDF

Bozhi Luan, Gen Li, Yalan Qin, Jifeng Guo, Yun Zhou

TL;DR: 该论文提出了一种基于动力系统稳定性理论的新方法,用于检测大型语言模型(LLMs)和多模态大语言模型(MLLMs)中的幻觉问题。其核心是将模型视为动力系统,将事实知识建模为表示空间中的稳定平衡点,并认为幻觉倾向于出现在稳定与不稳定区域之间的知识过渡边界。为此,论文引入了Lyapunov Probes——一种轻量级网络,通过基于导数的稳定性约束进行训练,以确保在输入扰动下置信度单调衰减,从而可靠地区分稳定的真实区域和不稳定的易幻觉区域。

Details

Motivation: 论文的动机是解决LLMs和MLLMs中的幻觉检测问题,但不同于将其视为简单的分类任务,而是从动力系统稳定性理论的角度重新定义该问题,旨在更本质地建模知识表示和幻觉产生的动态机制。

Result: 在多个数据集和模型上的实验表明,该方法相比现有基线模型取得了持续一致的性能提升。

Insight: 论文宣称的创新点在于将动力系统稳定性理论引入幻觉检测,提出了Lyapunov Probes这一新工具,其核心洞察是幻觉产生于知识表示空间中稳定与不稳定区域的边界。从客观角度看,这为理解模型内部表示与幻觉之间的关系提供了一个新颖的理论框架和可操作的检测方法。

Abstract: We address hallucination detection in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) by framing the problem through the lens of dynamical systems stability theory. Rather than treating hallucination as a straightforward classification task, we conceptualize (M)LLMs as dynamical systems, where factual knowledge is represented by stable equilibrium points within the representation space. Our main insight is that hallucinations tend to arise at the boundaries of knowledge-transition regions separating stable and unstable zones. To capture this phenomenon, we propose Lyapunov Probes: lightweight networks trained with derivative-based stability constraints that enforce a monotonic decay in confidence under input perturbations. By performing systematic perturbation analysis and applying a two-stage training process, these probes reliably distinguish between stable factual regions and unstable, hallucination-prone regions. Experiments on diverse datasets and models demonstrate consistent improvements over existing baselines.


[75] Cross-Resolution Distribution Matching for Diffusion Distillation cs.CVPDF

Feiyang Chen, Hongpeng Pan, Haonan Xu, Xinyu Duan, Yang Yang

TL;DR: 本文提出了一种名为跨分辨率分布匹配蒸馏(RMD)的新框架,旨在解决扩散模型蒸馏中因跨分辨率分布差异导致的图像质量下降问题。该方法通过基于对数信噪比(logSNR)的映射来补偿分辨率变化,并引入分布匹配和预测噪声重注入机制,以实现高保真、少步数的多分辨率级联推理加速。

Details

Motivation: 现有扩散蒸馏方法受限于去噪过程,步数减少已趋于饱和,而部分时间步低分辨率生成虽能加速推理,却因跨分辨率分布差异导致明显的质量下降。本文旨在弥合这一分布差距,实现高效且高质量的图像生成。

Result: 在SDXL和Wan2.1-14B等骨干模型上,RMD分别实现了高达33.4倍和25.6倍的推理加速,同时保持了高视觉保真度。定性和定量实验均表明该方法在加速的同时能有效维持生成质量。

Insight: 创新点包括:1) 利用logSNR曲线划分时间步区间并进行映射以补偿分辨率偏移;2) 沿分辨率轨迹进行分布匹配,缩小低分辨率生成分布与教师高分辨率分布之间的差距;3) 在上采样过程中引入预测噪声重注入机制以稳定训练并提升合成质量。这些方法为多分辨率级联推理的加速提供了新思路。

Abstract: Diffusion distillation is central to accelerating image and video generation, yet existing methods are fundamentally limited by the denoising process, where step reduction has largely saturated. Partial timestep low-resolution generation can further accelerate inference, but it suffers noticeable quality degradation due to cross-resolution distribution gaps. We propose Cross-Resolution Distribution Matching Distillation (RMD), a novel distillation framework that bridges cross-resolution distribution gaps for high-fidelity, few-step multi-resolution cascaded inference. Specifically, RMD divides the timestep intervals for each resolution using logarithmic signal-to-noise ratio (logSNR) curves, and introduces logSNR-based mapping to compensate for resolution-induced shifts. Distribution matching is conducted along resolution trajectories to reduce the gap between low-resolution generator distributions and the teacher’s high-resolution distribution. In addition, a predicted-noise re-injection mechanism is incorporated during upsampling to stabilize training and improve synthesis quality. Quantitative and qualitative results show that RMD preserves high-fidelity generation while accelerating inference across various backbones. Notably, RMD achieves up to 33.4X speedup on SDXL and 25.6X on Wan2.1-14B, while preserving high visual fidelity.


[76] Place-it-R1: Unlocking Environment-aware Reasoning Potential of MLLM for Video Object Insertion cs.CV | cs.AIPDF

Bohai Gu, Taiyi Wu, Dazhao Du, Jian Liu, Shuai Yang

TL;DR: 本文提出Place-it-R1框架,通过解锁多模态大语言模型的环境感知推理能力,实现视频对象插入的物理一致性编辑。该框架采用思维链推理范式,结合空间直接偏好优化和闭环迭代优化,提供灵活与标准两种模式以平衡物理合理性与视觉保真度。

Details

Motivation: 现有视频编辑技术过于关注视觉保真度而忽视物理因果关系,导致插入对象与环境物理不一致,因此需要开发能确保物理合理性的视频对象插入方法。

Result: 在广泛实验中,Place-it-R1相比最先进解决方案和商业模型,实现了物理一致性更高的视频对象插入。

Insight: 创新点包括:利用MLLM进行物理场景理解和交互推理以生成环境感知思维链令牌;引入MLLM引导的空间直接偏好优化来评估视觉自然性;以及通过闭环迭代优化机制逐步提升编辑质量,并提供用户可选的合理性与保真度权衡模式。

Abstract: Modern video editing techniques have achieved high visual fidelity when inserting video objects. However, they focus on optimizing visual fidelity rather than physical causality, leading to edits that are physically inconsistent with their environment. In this work, we present Place-it-R$1$, an end-to-end framework for video object insertion that unlocks the environment-aware reasoning potential of Multimodal Large Language Models (MLLMs). Our framework leverages the Chain-of-Thought (CoT) reasoning of MLLMs to orchestrate video diffusion, following a Think-then-Place paradigm. To bridge cognitive reasoning and generative execution, we introduce three key innovations: First, MLLM performs physical scene understanding and interaction reasoning, generating environment-aware chain-of-thought tokens and inferring valid insertion regions to explicitly guide the diffusion toward physically plausible insertion. Then, we introduce MLLM-guided Spatial Direct Preference Optimization (DPO), where diffusion outputs are fed back to the MLLM for scoring, enabling visual naturalness. During inference, the MLLM iteratively triggers refinement cycles and elicits adaptive adjustments from the diffusion model, forming a closed-loop that progressively enhances editing quality. Furthermore, we provide two user-selectable modes: a plausibility-oriented flexible mode that permits environment modifications (\eg, generating support structures) to enhance physical plausibility, and a fidelity-oriented standard mode that preserves scene integrity for maximum fidelity, offering users explicit control over the plausibility-fidelity trade-off. Extensive experiments demonstrate Place-it-R1 achieves physically-coherent video object insertion compared with state-of-the-art solutions and commercial models.


[77] Spatial Colour Mixing Illusions as a Perception Stress Test for Vision-Language Models cs.CVPDF

Nicoleta-Nina Basoc, Adrian Cosma, Emilian Radoi

TL;DR: 本文通过引入空间色彩混合(Spatial Colour Mixing)这一系列程序化色彩失真方法,系统评估了视觉语言模型(VLMs)在感知能力上的系统性弱点。研究发现,即使图像内容对人类依然易于识别,VLMs在面对结构化色彩失真时,其预测准确率会急剧下降,且扩大语言模型规模并不能可靠地缓解这一问题。人类在相同失真条件下的表现显著优于VLMs。最后,研究提出一种简单的人类启发式预处理步骤,能够部分恢复模型性能,这为提升VLM鲁棒性提供了实用策略。

Details

Motivation: 尽管视觉语言模型在基准测试中表现优异,但它们可能存在系统性的感知缺陷:即使图像底层场景对人类来说仍然易于识别,像素值的结构化、大幅度变化也可能导致模型产生自信但荒谬的预测。本文旨在通过空间色彩混合失真来研究这种人类与模型之间的感知差距。

Result: 在四个数据集上评估了来自三个模型家族的九个VLMs。随着失真程度增加,所有模型和数据集上的准确率都急剧下降,且扩大语言模型规模并不能可靠地缓解失败。在包含61名参与者的人类研究中,人类在动物识别数据集上的表现显著优于VLMs。此外,一个简单的人类启发式预处理步骤为几种失真类型恢复了相当一部分性能。

Insight: 论文的创新点在于提出了一个系统性的框架(八种空间色彩混合变体)来量化评估VLMs的感知鲁棒性,揭示了模型在结构化色彩失真下的系统性弱点。从客观角度看,该研究强调了VLMs感知能力与人类之间的差距,并提出了感知感知的预处理和工具使用作为提升模型鲁棒性的实用方向,为模型评估和增强提供了新的视角和方法。

Abstract: Vision-language models (VLMs) achieve strong benchmark results, yet can exhibit systematic perceptual weaknesses: structured, large changes to pixel values can cause confident yet nonsensical predictions, even when the underlying scene remains easily recognizable to humans. We study this gap using Spatial Colour Mixing, a programmatic family of colour distortions that overlays structured patterns (in both RGB and Ostwald colour systems) onto natural images. We introduce a framework of eight spatial colour mixing variants and evaluate nine VLMs across three model families on four datasets. Across models and datasets, accuracy degrades sharply with increasing distortion, and scaling the language model does not reliably mitigate the failure. In a human study with 61 participants on an animal recognition dataset, humans substantially outperform VLMs under the same distortions. Finally, we show that a simple human-inspired preprocessing step recovers a meaningful portion of performance for several distortion types, motivating perception-aware preprocessing and tool-use as practical strategies for improving VLM robustness.


[78] Longitudinal NSCLC Treatment Progression via Multimodal Generative Models cs.CVPDF

Massimiliano Mantegna, Elena Mulero Ayllón, Alice Natalina Caragliano, Francesco Di Feola, Claudia Tacconi

TL;DR: 本文提出了一种名为虚拟治疗(VT)的多模态生成框架,用于预测非小细胞肺癌(NSCLC)患者在放疗期间的肿瘤演变过程。该框架将肿瘤进展建模为一个剂量感知的多模态条件图像到图像翻译问题,基于CT扫描、基线临床变量和辐射剂量增量,合成反映治疗诱导解剖变化的后续CT图像。

Details

Motivation: 解决放疗期间肿瘤演变的预测问题,特别是当纵向变化由解剖结构和治疗共同驱动时,这是一个临床关键挑战。

Result: 在包含222名III期NSCLC患者(共895次CT扫描)的纵向数据集上评估,结果表明基于扩散的模型比基于GAN的基线模型更能从多模态剂量感知条件中获益,生成更稳定且解剖学上合理的肿瘤演变轨迹。

Insight: 创新点在于将NSCLC进展建模为剂量感知的多模态条件图像翻译问题,结合临床变量和剂量增量;客观分析表明扩散模型在该任务中表现优于GAN,支持VT作为计算机模拟治疗监测和自适应放疗研究工具的潜力。

Abstract: Predicting tumor evolution during radiotherapy is a clinically critical challenge, particularly when longitudinal changes are driven by both anatomy and treatment. In this work, we introduce a Virtual Treatment (VT) framework that formulates non-small cell lung cancer (NSCLC) progression as a dose-aware multimodal conditional image-to-image translation problem. Given a CT scan, baseline clinical variables, and a specified radiation dose increment, VT aims to synthesize plausible follow-up CT images reflecting treatment-induced anatomical changes. We evaluate the proposed framework on a longitudinal dataset of 222 stage III NSCLC patients, comprising 895 CT scans acquired during radiotherapy under irregular clinical schedules. The generative process is conditioned on delivered dose increments together with demographic and tumor-related clinical variables. Representative GAN-based and diffusion-based models are benchmarked across 2D and 2.5D configurations. Quantitative and qualitative results indicate that diffusion-based models benefit more consistently from multimodal, dose-aware conditioning and produce more stable and anatomically plausible tumor evolution trajectories than GAN-based baselines, supporting the potential of VT as a tool for in-silico treatment monitoring and adaptive radiotherapy research in NSCLC.


[79] VLM-RobustBench: A Comprehensive Benchmark for Robustness of Vision-Language Models cs.CV | cs.AIPDF

Rohit Saxena, Alessandro Suglia, Pasquale Minervini

TL;DR: 本文提出了VLM-RobustBench,一个用于全面评估视觉语言模型在真实世界图像失真下鲁棒性的基准测试,涵盖了噪声、模糊、天气、数字和几何扰动等49种增强类型,并在不同严重程度下进行评估。研究评估了来自四个家族的VLM模型在MMBench和MMMU-Pro两个基准上的表现,发现视觉严重程度是性能下降的弱预测指标,空间扰动比光度失真对模型影响更大,揭示了当前VLM在语义理解上强大但在空间处理上脆弱的特性。

Details

Motivation: 尽管视觉语言模型在标准高质量数据集上表现优异,但其在真实世界图像失真下的性能尚未被充分理解,因此需要建立一个全面的鲁棒性基准来评估模型在各种扰动下的表现。

Result: 在VLM-RobustBench基准上评估了Qwen、InternVL、Molmo、Gemma等模型,发现低严重度的空间扰动(如glass_blur)平均降低MMBench准确率约8个百分点,而重采样和几何失真(如upsample、elastic_transform)导致最大下降达34个百分点,表明模型对几何变换特别敏感。

Insight: 论文的创新点在于构建了一个涵盖广泛失真类型和严重程度的综合鲁棒性基准,并揭示了视觉严重程度与模型性能下降之间的弱相关性,以及当前VLM模型在空间不变性方面的脆弱性,这为未来设计强调重采样和几何不变性的鲁棒性评估协议和训练机制提供了动机。

Abstract: Vision-language models (VLMs) achieve strong performance on standard, high-quality datasets, but we still do not fully understand how they perform under real-world image distortions. We present VLM-RobustBench, a benchmark spanning 49 augmentation types across noise, blur, weather, digital, and geometric perturbations, evaluated under graded severities (low/mid/high) and binary transforms, yielding 133 corrupted settings. We evaluate VLMs from four families (Qwen, InternVL, Molmo, Gemma) on two complementary benchmarks: MMBench (visually grounded) and MMMU-Pro (reasoning-oriented). Our results reveal that visual severity is a weak predictor of difficulty: low-severity spatial perturbations often degrade performance more than visually severe photometric corruptions. In particular, low-severity glass_blur reduces MMBench accuracy by about 8 pp on average across models, while the largest drops arise from resampling and geometric distortions (e.g., upsample, elastic_transform), reaching up to 34 pp. Overall, our findings suggest current VLMs are semantically strong but spatially fragile, motivating the definition of novel robustness evaluation protocols and training regimes that emphasize resampling and geometric invariances.


[80] FreeOcc: Training-free Panoptic Occupancy Prediction via Foundation Models cs.CVPDF

Andrew Caunes, Thierry Chateau, Vincent Fremont

TL;DR: FreeOcc提出了一种无需训练的端到端全景占用预测方法,利用预训练的基础模型从多视角图像中恢复语义和几何信息,通过提示式分割模型和重建模型提取先验并重建三维点云,再经过深度感知过滤、时序融合和体素化处理,最终实现无需三维模型训练的全景占用预测。

Details

Motivation: 现有基于相机的道路场景占用预测方法通常依赖昂贵的密集三维标注或需要在目标域数据上训练模型,限制了在未见环境中的部署能力,因此需要一种无需训练且能泛化到新场景的方法。

Result: 在Occ3D-nuScenes基准测试中,FreeOcc在无需训练的情况下取得了16.9 mIoU和16.5 RayIoU,与当前最先进的弱监督方法相当;当用作伪标签生成管道训练下游模型时,达到了21.1 RayIoU,超越了之前的弱监督基线;同时在全景占用预测上,无需训练和弱监督设置下分别达到了3.1 RayPQ和3.9 RayPQ,建立了新的基准。

Insight: 创新点在于利用预训练基础模型(如可提示分割模型和重建模型)实现完全无需训练的三维场景理解,通过深度感知过滤和确定性优化栈融合多视角信息,并采用拟合与合并三维边界框候选的方法恢复实例信息,为训练无关的感知任务提供了实用路径。

Abstract: Semantic and panoptic occupancy prediction for road scene analysis provides a dense 3D representation of the ego vehicle’s surroundings. Current camera-only approaches typically rely on costly dense 3D supervision or require training models on data from the target domain, limiting deployment in unseen environments. We propose FreeOcc, a training-free pipeline that leverages pretrained foundation models to recover both semantics and geometry from multi-view images. FreeOcc extracts per-view panoptic priors with a promptable foundation segmentation model and prompt-to-taxonomy rules, and reconstructs metric 3D points with a reconstruction foundation model. Depth- and confidence- aware filtering lifts reliable labels into 3D, which are fused over time and voxelized with a deterministic refinement stack. For panoptic occupancy, instances are recovered by fitting and merging robust current-view 3D box candidates, enabling instance-aware occupancy without any learned 3D model. On Occ3D-nuScenes, FreeOcc achieves 16.9 mIoU and 16.5 RayIoU train-free, on par with state-of-the-art weakly supervised methods. When employed as a pseudo-label generation pipeline for training downstream models, it achieves 21.1 RayIoU, surpassing the previous state-of-the-art weakly supervised baseline. Furthermore, FreeOcc sets new baselines for both train-free and weakly supervised panoptic occupancy prediction, achieving 3.1 RayPQ and 3.9 RayPQ, respectively. These results highlight foundation-model-driven perception as a practical route to training-free 3D scene understanding.


[81] A Semi-Supervised Framework for Breast Ultrasound Segmentation with Training-Free Pseudo-Label Generation and Label Refinement cs.CVPDF

Ruili Li, Jiayi Ding, Ruiyu Li, Yilun Jin, Shiwen Ge

TL;DR: 本文提出了一种用于乳腺超声图像分割的半监督学习框架,该框架利用无需训练的伪标签生成和标签细化技术,通过简单的基于外观的描述(如‘暗色椭圆形’)实现自然图像与医学图像之间的跨域结构迁移,从而生成结构一致的伪标签。这些伪标签用于预热一个捕捉乳腺病变全局结构先验的静态教师模型,并结合指数移动平均教师模型,进一步引入不确定性熵加权融合和自适应不确定性引导的反向对比学习来提升边界判别能力。

Details

Motivation: 解决在半监督学习(SSL)中,由于标注数据极度有限导致的伪标签不稳定、监督不准确和性能下降的问题,并克服现有视觉语言模型(VLMs)在乳腺超声图像上因领域特定提示词难以迁移而效果有限的问题。

Result: 在四个乳腺超声数据集上的实验表明,该方法仅使用2.5%的标注数据即可达到与全监督模型相当的性能,显著优于现有的半监督学习方法。

Insight: 创新点在于提出了一个无需训练的伪标签生成机制,通过简单的全局外观描述实现跨域结构迁移,从而为医学图像分割提供可靠的伪监督;同时,结合静态教师模型、不确定性加权融合和自适应对比学习来细化标签和提升边界精度。该范式易于扩展到其他成像模态或疾病,只需提供全局外观描述即可实现可扩展的有限标注下的半监督医学图像分割。

Abstract: Semi-supervised learning (SSL) has emerged as a promising paradigm for breast ultrasound (BUS) image segmentation, but it often suffers from unstable pseudo labels under extremely limited annotations, leading to inaccurate supervision and degraded performance. Recent vision-language models (VLMs) provide a new opportunity for pseudo-label generation, yet their effectiveness on BUS images remains limited because domain-specific prompts are difficult to transfer. To address this issue, we propose a semi-supervised framework with training-free pseudo-label generation and label refinement. By leveraging simple appearance-based descriptions (e.g., dark oval), our method enables cross-domain structural transfer between natural and medical images, allowing VLMs to generate structurally consistent pseudo labels. These pseudo labels are used to warm up a static teacher that captures global structural priors of breast lesions. Combined with an exponential moving average teacher, we further introduce uncertainty entropy weighted fusion and adaptive uncertainty-guided reverse contrastive learning to improve boundary discrimination. Experiments on four BUS datasets demonstrate that our method achieves performance comparable to fully supervised models even with only 2.5% labeled data, significantly outperforming existing SSL approaches. Moreover, the proposed paradigm is readily extensible: for other imaging modalities or diseases, only a global appearance description is required to obtain reliable pseudo supervision, enabling scalable semi-supervised medical image segmentation under limited annotations.


[82] JOPP-3D: Joint Open Vocabulary Semantic Segmentation on Point Clouds and Panoramas cs.CVPDF

Sandeep Inuganti, Hideaki Kanayama, Kanta Shimizu, Mahdi Chamseddine, Soichiro Yokota

TL;DR: 本文提出了JOPP-3D框架,这是一个用于联合开放词汇语义分割的框架,能够同时处理3D点云和全景图像。该方法通过将RGB-D全景图转换为切向透视图和3D点云,提取并对齐基础视觉-语言特征,从而实现基于自然语言查询的跨模态语义分割。

Details

Motivation: 解决3D点云和全景图像等视觉模态的语义分割难题,这些难题主要源于标注数据稀缺以及固定标签模型适应性有限的问题。

Result: 在Stanford-2D-3D-s和ToF-360数据集上的实验表明,该方法能够在全景和3D领域产生连贯且有语义意义的掩码,并在开放和封闭词汇的2D与3D语义分割任务上相比现有最佳方法(SOTA)取得了显著提升。

Insight: 创新点在于提出了一个联合利用全景图和点云数据的开放词汇框架,通过模态转换与特征对齐,实现了语言驱动的跨模态场景理解。其核心是将不同模态数据统一到视觉-语言特征空间中,以支持灵活的文本查询。

Abstract: Semantic segmentation across visual modalities such as 3D point clouds and panoramic images remains a challenging task, primarily due to the scarcity of annotated data and the limited adaptability of fixed-label models. In this paper, we present JOPP-3D, an open-vocabulary semantic segmentation framework that jointly leverages panoramic and point cloud data to enable language-driven scene understanding. We convert RGB-D panoramic images into their corresponding tangential perspective images and 3D point clouds, then use these modalities to extract and align foundational vision-language features. This allows natural language querying to generate semantic masks on both input modalities. Experimental evaluation on the Stanford-2D-3D-s and ToF-360 datasets demonstrates the capability of JOPP-3D to produce coherent and semantically meaningful segmentations across panoramic and 3D domains. Our proposed method achieves a significant improvement compared to the SOTA in open and closed vocabulary 2D and 3D semantic segmentation.


[83] Optimizing 3D Diffusion Models for Medical Imaging via Multi-Scale Reward Learning cs.CVPDF

Yueying Tian, Xudong Han, Meng Zhou, Rodrigo Aviles-Espinosa, Rupert Young

TL;DR: 本文提出了一种通过多尺度奖励学习优化3D扩散模型的方法,用于提升医学图像生成质量。该方法首先在MRI体积数据上预训练3D扩散模型,然后使用近端策略优化(PPO)结合2D切片评估和3D体积分析的多尺度奖励系统进行微调,以同时优化局部纹理细节和全局结构一致性。

Details

Motivation: 解决标准训练目标与临床相关性之间的差距,提升3D医学图像生成模型在临床任务中的实用性和生成质量。

Result: 在BraTS 2019和OASIS-1数据集上验证,定量分析显示Fréchet Inception Distance(FID)显著改善,且合成数据在下游肿瘤和疾病分类任务中比未优化的基线模型表现更优。

Insight: 创新点在于将强化学习与多尺度奖励(结合2D和3D评估)集成到扩散模型微调中,以同时优化局部和全局特征,提升生成图像的临床实用性。

Abstract: Diffusion models have emerged as powerful tools for 3D medical image generation, yet bridging the gap between standard training objectives and clinical relevance remains a challenge. This paper presents a method to enhance 3D diffusion models using Reinforcement Learning (RL) with multi-scale feedback. We first pretrain a 3D diffusion model on MRI volumes to establish a robust generative prior. Subsequently, we fine-tune the model using Proximal Policy Optimization (PPO), guided by a novel reward system that integrates both 2D slice-wise assessments and 3D volumetric analysis. This combination allows the model to simultaneously optimize for local texture details and global structural coherence. We validate our framework on the BraTS 2019 and OASIS-1 datasets. Our results indicate that incorporating RL feedback effectively steers the generation process toward higher quality distributions. Quantitative analysis reveals significant improvements in Fréchet Inception Distance (FID) and, crucially, the synthetic data demonstrates enhanced utility in downstream tumor and disease classification tasks compared to non-optimized baselines.


[84] Towards Motion Turing Test: Evaluating Human-Likeness in Humanoid Robots cs.CVPDF

Mingzhe Li, Mengyin Liu, Zekai Wu, Xincheng Lin, Junsheng Zhang

TL;DR: 本文提出了一种名为’运动图灵测试’的框架,用于评估人形机器人运动的拟人化程度,并构建了包含人类和人形机器人动作序列的HHMotion数据集。通过大规模人工标注,发现机器人动作在动态行为上仍与人类存在明显差异。论文还提出了一个自动预测拟人化分数的任务,并展示了一个简单基线模型优于现有基于大语言模型的方法。

Details

Motivation: 受图灵测试启发,旨在建立一个客观评估人形机器人运动拟人化程度的框架,以解决当前缺乏系统性评估标准的问题。

Result: 在构建的HHMotion数据集上,人工标注分析揭示了人形机器人动作(尤其在跳跃、拳击、跑步等动态动作中)与人类动作存在可察觉的偏差。提出的简单基线模型在自动预测拟人化分数任务上,表现优于几种当代基于大语言模型的方法。

Insight: 创新性地将图灵测试思想应用于运动评估领域,提出了’运动图灵测试’概念和相应的评估框架;构建了大规模、标准化的跨模型人类-机器人动作数据集HHMotion,并提供了精细的人工标注;揭示了当前多模态大语言模型在运动拟人化评估任务上的不足,并为此任务提供了一个有效的简单基线模型。

Abstract: Humanoid robots have achieved significant progress in motion generation and control, exhibiting movements that appear increasingly natural and human-like. Inspired by the Turing Test, we propose the Motion Turing Test, a framework that evaluates whether human observers can discriminate between humanoid robot and human poses using only kinematic information. To facilitate this evaluation, we present the Human-Humanoid Motion (HHMotion) dataset, which consists of 1,000 motion sequences spanning 15 action categories, performed by 11 humanoid models and 10 human subjects. All motion sequences are converted into SMPL-X representations to eliminate the influence of visual appearance. We recruited 30 annotators to rate the human-likeness of each pose on a 0-5 scale, resulting in over 500 hours of annotation. Analysis of the collected data reveals that humanoid motions still exhibit noticeable deviations from human movements, particularly in dynamic actions such as jumping, boxing, and running. Building on HHMotion, we formulate a human-likeness evaluation task that aims to automatically predict human-likeness scores from motion data. Despite recent progress in multimodal large language models, we find that they remain inadequate for assessing motion human-likeness. To address this, we propose a simple baseline model and demonstrate that it outperforms several contemporary LLM-based methods. The dataset, code, and benchmark will be publicly released to support future research in the community.


[85] SpaCRD: Multimodal Deep Fusion of Histology and Spatial Transcriptomics for Cancer Region Detection cs.CVPDF

Shuailin Xue, Jun Wan, Lihua Zhang, Wenwen Min

TL;DR: 本文提出SpaCRD方法,通过深度整合组织学图像与空间转录组学数据,实现跨样本、跨平台/批次的癌症组织区域检测。该方法利用基于类别正则化的变分重建引导双向交叉注意力融合网络,自适应捕获组织学特征与基因表达之间的潜在共表达模式。在涵盖多种疾病类型、平台和批次的23个匹配数据集上的实验表明,SpaCRD在CTR检测任务上持续优于现有八种最先进方法。

Details

Motivation: 传统基于组织学图像的癌症区域检测方法因不同组织区域间的形态相似性易产生高假阳性率,而空间转录组学数据虽提供详细细胞表型和空间定位信息,但现有方法无法有效整合这两种模态数据,尤其在跨样本、跨平台/批次的CTR检测场景中。

Result: 在涵盖多种疾病类型、平台和批次的23个匹配组织学-ST数据集上的广泛基准分析表明,SpaCRD在CTR检测任务上持续优于现有的八种最先进方法,实现了SOTA性能。

Insight: 创新点在于提出了一种基于迁移学习的深度融合框架,其核心是类别正则化的变分重建引导双向交叉注意力融合网络,该设计使模型能从多视角自适应捕获组织学特征与基因表达之间的潜在共表达模式,从而实现了跨样本、跨平台/批次的可靠泛化。从客观角度看,该方法为多模态生物医学数据整合,特别是解决数据异质性和批次效应问题,提供了一个有效的技术路径。

Abstract: Accurate detection of cancer tissue regions (CTR) enables deeper analysis of the tumor microenvironment and offers crucial insights into treatment response. Traditional CTR detection methods, which typically rely on the rich cellular morphology in histology images, are susceptible to a high rate of false positives due to morphological similarities across different tissue regions. The groundbreaking advances in spatial transcriptomics (ST) provide detailed cellular phenotypes and spatial localization information, offering new opportunities for more accurate cancer region detection. However, current methods are unable to effectively integrate histology images with ST data, especially in the context of cross-sample and cross-platform/batch settings for accomplishing the CTR detection. To address this challenge, we propose SpaCRD, a transfer learning-based method that deeply integrates histology images and ST data to enable reliable CTR detection across diverse samples, platforms, and batches. Once trained on source data, SpaCRD can be readily generalized to accurately detect cancerous regions across samples from different platforms and batches. The core of SpaCRD is a category-regularized variational reconstruction-guided bidirectional cross-attention fusion network, which enables the model to adaptively capture latent co-expression patterns between histological features and gene expression from multiple perspectives. Extensive benchmark analysis on 23 matched histology-ST datasets spanning various disease types, platforms, and batches demonstrates that SpaCRD consistently outperforms existing eight state-of-the-art methods in CTR detection.


[86] Point-Supervised Skeleton-Based Human Action Segmentation cs.CVPDF

Hongsong Wang, Yiqin Shen, Pengbo Yan, Jie Gui

TL;DR: 本文提出了一种基于骨骼时序动作分割的点监督框架,仅需每个动作片段标注一帧,通过利用预训练统一模型编码关节、骨骼和运动等多模态骨骼数据提取特征,并采用原型相似性、能量函数和约束K-Medoids聚类生成可靠伪标签,结合多模态伪标签集成指导模型训练,在PKU-MMD、MCFS-22和MCFS-130数据集上建立了新基准,实验表明该方法在显著降低标注成本的同时取得了有竞争力的性能,甚至超越了一些全监督方法。

Details

Motivation: 解决基于骨骼的时序动作分割任务中全监督方法依赖昂贵的逐帧标注且对模糊动作边界敏感的问题,旨在通过点监督(每动作片段仅标注一帧)降低标注成本并提升鲁棒性。

Result: 在PKU-MMD(X-Sub和X-View)、MCFS-22和MCFS-130数据集上建立了点监督骨骼动作分割的新基准,实验显示方法性能具有竞争力,甚至超过部分全监督方法。

Insight: 创新点包括:点监督框架大幅减少标注需求;结合多模态骨骼特征(关节、骨骼、运动)增强表示;提出原型相似性方法并与现有技术集成生成可靠伪标签;多模态伪标签集成提升训练稳定性。从客观角度看,该方法通过伪标签生成和集成策略有效缓解弱监督学习中的噪声问题,为低成本动作分割提供了实用方案。

Abstract: Skeleton-based temporal action segmentation is a fundamental yet challenging task, playing a crucial role in enabling intelligent systems to perceive and respond to human activities. While fully-supervised methods achieve satisfactory performance, they require costly frame-level annotations and are sensitive to ambiguous action boundaries. To address these issues, we introduce a point-supervised framework for skeleton-based action segmentation, where only a single frame per action segment is labeled. We leverage multimodal skeleton data, including joint, bone, and motion information, encoded via a pretrained unified model to extract rich feature representations. To generate reliable pseudo-labels, we propose a novel prototype similarity method and integrate it with two existing methods: energy function and constrained K-Medoids clustering. Multimodal pseudo-label integration is proposed to enhance the reliability of the pseudo-label and guide the model training. We establish new benchmarks on PKU-MMD (X-Sub and X-View), MCFS-22, and MCFS-130, and implement baselines for point-supervised skeleton-based human action segmentation. Extensive experiments show that our method achieves competitive performance, even surpassing some fully-supervised methods while significantly reducing annotation effort.


[87] VG3S: Visual Geometry Grounded Gaussian Splatting for Semantic Occupancy Prediction cs.CV | cs.ROPDF

Xiaoyang Yan, Muleilan Pei, Shaojie Shen

TL;DR: 该论文提出了一种名为VG3S的新框架,用于自动驾驶场景的3D语义占据预测。该方法的核心是将视觉基础模型(VFMs)中强大的几何感知能力注入到基于3D高斯泼溅的占据建模中,以解决纯视觉方法几何线索不足的问题。

Details

Motivation: 当前基于3D高斯泼溅的占据预测方法虽然计算高效,但其生成高质量3D高斯分布严重依赖准确的几何线索,而纯视觉范式往往缺乏足够的几何信息。论文旨在弥合这一差距。

Result: 在nuScenes占据预测基准测试上的大量实验表明,VG3S相比基线方法在IoU和mIoU指标上分别取得了12.6%和7.5%的显著提升,证明了其有效性。

Insight: 论文的主要创新点在于提出了一个即插即用的分层几何特征适配器,能够通过特征聚合、任务对齐和多尺度重构,有效地将通用VFM特征转化为适用于占据预测任务的几何先验。这为利用大规模预训练视觉基础模型的几何先验知识来增强下游3D感知任务提供了一个通用且有效的途径。

Abstract: 3D semantic occupancy prediction has become a crucial perception task for comprehensive scene understanding in autonomous driving. While recent advances have explored 3D Gaussian splatting for occupancy modeling to substantially reduce computational overhead, the generation of high-quality 3D Gaussians relies heavily on accurate geometric cues, which are often insufficient in purely vision-centric paradigms. To bridge this gap, we advocate for injecting the strong geometric grounding capability from Vision Foundation Models (VFMs) into occupancy prediction. In this regard, we introduce Visual Geometry Grounded Gaussian Splatting (VG3S), a novel framework that empowers Gaussian-based occupancy prediction with cross-view 3D geometric grounding. Specifically, to fully exploit the rich 3D geometric priors from a frozen VFM, we propose a plug-and-play hierarchical geometric feature adapter, which can effectively transform generic VFM tokens via feature aggregation, task-specific alignment, and multi-scale restructuring. Extensive experiments on the nuScenes occupancy benchmark demonstrate that VG3S achieves remarkable improvements of 12.6% in IoU and 7.5% in mIoU over the baseline. Furthermore, we show that VG3S generalizes seamlessly across diverse VFMs, consistently enhancing occupancy prediction accuracy and firmly underscoring the immense value of integrating priors derived from powerful, pre-trained geometry-grounded VFMs.


[88] Cut to the Chase: Training-free Multimodal Summarization via Chain-of-Events cs.CV | cs.AIPDF

Xiaoxing You, Qiang Huang, Lingyu Li, Xiaojun Chang, Jun Yu

TL;DR: 本文提出了一种名为CoE的无训练多模态摘要生成框架,该框架通过基于分层事件图的Chain-of-Events进行结构化推理,以解决现有方法对领域特定监督的依赖、跨模态对齐不明确以及缺乏事件过渡的时序建模等问题。

Details

Motivation: 现有多模态摘要方法存在三大挑战:依赖领域特定监督、跨模态对齐不明确且融合隐式、以及缺乏事件过渡的平坦时序建模。

Result: 在八个不同数据集上的实验表明,CoE一致性地超越了最先进的视频CoT基线方法,平均提升了+3.04 ROUGE、+9.51 CIDEr和+1.88 BERTScore,展现了其鲁棒性、可解释性和跨领域泛化能力。

Insight: 核心创新点在于提出了一个无训练框架,通过构建显式的分层事件图来引导结构化推理(Chain-of-Events),从而显式地实现跨模态对齐和时序事件建模,并辅以轻量级的风格适应进行领域对齐。从客观角度看,其将事件结构作为中间表示来桥接多模态信息并进行因果推理的思路具有借鉴意义。

Abstract: Multimodal Summarization (MMS) aims to generate concise textual summaries by understanding and integrating information across videos, transcripts, and images. However, existing approaches still suffer from three main challenges: (1) reliance on domain-specific supervision, (2) implicit fusion with weak cross-modal grounding, and (3) flat temporal modeling without event transitions. To address these issues, we introduce CoE, a training-free MMS framework that performs structured reasoning through a Chain-of-Events guided by a Hierarchical Event Graph (HEG). The HEG encodes textual semantics into an explicit event hierarchy that scaffolds cross-modal grounding and temporal reasoning. Guided by this structure, CoE localizes key visual cues, models event evolution and causal transitions, and refines outputs via lightweight style adaptation for domain alignment. Extensive experiments on eight diverse datasets demonstrate that CoE consistently outperforms state-of-the-art video CoT baselines, achieving average gains of +3.04 ROUGE, +9.51 CIDEr, and +1.88 BERTScore, highlighting its robustness, interpretability, and cross-domain generalization. Our code is available at https://github.com/youxiaoxing/CoE.


[89] Low-latency Event-based Object Detection with Spatially-Sparse Linear Attention cs.CVPDF

Haiqing Hao, Zhipeng Sui, Rong Zou, Zijia Dai, Nikola Zubić

TL;DR: 本文提出了一种名为空间稀疏线性注意力(SSLA)的新机制,以及基于此的端到端异步线性注意力模型SSLA-Det,用于解决事件相机在低延迟目标检测中遇到的效率与精度权衡难题。该方法通过引入混合空间状态分解和分散-计算-聚集训练流程,实现了状态级稀疏性和并行训练,从而在保持高精度的同时大幅降低了每个事件的处理计算量。

Details

Motivation: 事件相机提供具有空间稀疏性和高时间分辨率的序列视觉数据,非常适合低延迟目标检测。现有的异步事件神经网络通过逐事件更新预测来实现低延迟优势,但仍面临两个瓶颈:循环架构难以在长序列上高效训练,而提高精度通常会增加每个事件的计算量和延迟。线性注意力因其支持并行训练和循环推理而具有吸引力,但其标准形式为每个事件更新全局状态,导致精度与效率的权衡不佳。

Result: 在Gen1和N-Caltech101数据集上,SSLA-Det在异步方法中达到了最先进的准确率,分别取得了0.375 mAP和0.515 mAP,同时与先前最强的异步基线相比,每个事件的计算量减少了20倍以上。

Insight: 论文的核心创新点在于提出了空间稀疏线性注意力(SSLA)机制,通过混合空间状态分解和分散-计算-聚集的训练流程,在事件稀疏性中引入了状态级稀疏性,同时保持了高效的并行训练能力。这为解决事件视觉中低延迟与高精度需求之间的矛盾提供了一种新颖且有效的架构思路。

Abstract: Event cameras provide sequential visual data with spatial sparsity and high temporal resolution, making them attractive for low-latency object detection. Existing asynchronous event-based neural networks realize this low-latency advantage by updating predictions event-by-event, but still suffer from two bottlenecks: recurrent architectures are difficult to train efficiently on long sequences, and improving accuracy often increases per-event computation and latency. Linear attention is appealing in this setting because it supports parallel training and recurrent inference. However, standard linear attention updates a global state for every event, yielding a poor accuracy-efficiency trade-off, which is problematic for object detection, where fine-grained representations and thus states are preferred. The key challenge is therefore to introduce sparse state activation that exploits event sparsity while preserving efficient parallel training. We propose Spatially-Sparse Linear Attention (SSLA), which introduces a mixture-of-spaces state decomposition and a scatter-compute-gather training procedure, enabling state-level sparsity as well as training parallelism. Built on SSLA, we develop an end-to-end asynchronous linear attention model, SSLA-Det, for event-based object detection. On Gen1 and N-Caltech101, SSLA-Det achieves state-of-the-art accuracy among asynchronous methods, reaching 0.375 mAP and 0.515 mAP, respectively, while reducing per-event computation by more than 20 times compared to the strongest prior asynchronous baseline, demonstrating the potential of linear attention for low-latency event-based vision.


[90] TaPD: Temporal-adaptive Progressive Distillation for Observation-Adaptive Trajectory Forecasting in Autonomous Driving cs.CV | cs.AI | cs.ROPDF

Mingyu Fan, Yi Liu, Hao Zhou, Deheng Qian, Mohammad Haziq Khan

TL;DR: 本文提出了一种名为TaPD(Temporal-adaptive Progressive Distillation)的即插即用框架,用于解决自动驾驶中轨迹预测因观测历史长度可变(如遮挡或感知范围有限导致的极短观测)而性能下降的问题。该框架包含两个协同模块:用于未来预测的观测自适应预测器(OAF)和用于显式重建过去轨迹的时间回填模块(TBM)。OAF基于渐进式知识蒸馏(PKD),通过分层特征回归将长时观测“教师”的运动模式知识迁移给短时观测“学生”,并引入余弦退火蒸馏权重方案以平衡预测监督和特征对齐。TBM则针对极短历史,基于场景演化回填缺失的历史片段,生成上下文丰富的轨迹以增强PKD。

Details

Motivation: 现有轨迹预测器大多假设固定长度的历史观测,在实际场景中(如因遮挡或感知范围限制导致观测可变或极短时)性能会显著下降。本文旨在开发一个统一的即插即用框架,以适应可变历史长度的观测自适应轨迹预测。

Result: 在Argoverse 1和Argoverse 2数据集上的大量实验表明,TaPD在所有观测长度下均持续优于强基线模型,在极短输入下提升尤其显著,并能以即插即用方式改进其他预测器(如HiVT)。

Insight: 创新点包括:1) 提出统一的观测自适应预测框架TaPD,结合渐进式知识蒸馏(PKD)和显式历史回填(TBM)以处理可变长度观测;2) 在PKD中引入余弦退火蒸馏权重方案,优化了预测监督与特征对齐的平衡,提高了优化稳定性和跨长度一致性;3) 采用解耦的预训练-重建-微调协议,在适应回填输入的同时保留真实运动先验。从客观角度看,该工作将知识蒸馏与历史重建相结合,为处理不完整观测数据提供了可借鉴的模块化解决方案。

Abstract: Trajectory prediction is essential for autonomous driving, enabling vehicles to anticipate the motion of surrounding agents to support safe planning. However, most existing predictors assume fixed-length histories and suffer substantial performance degradation when observations are variable or extremely short in real-world settings (e.g., due to occlusion or a limited sensing range). We propose TaPD (Temporal-adaptive Progressive Distillation), a unified plug-and-play framework for observation-adaptive trajectory forecasting under variable history lengths. TaPD comprises two cooperative modules: an Observation-Adaptive Forecaster (OAF) for future prediction and a Temporal Backfilling Module (TBM) for explicit reconstruction of the past. OAF is built on progressive knowledge distillation (PKD), which transfers motion pattern knowledge from long-horizon “teachers” to short-horizon “students” via hierarchical feature regression, enabling short observations to recover richer motion context. We further introduce a cosine-annealed distillation weighting scheme to balance forecasting supervision and feature alignment, improving optimization stability and cross-length consistency. For extremely short histories where implicit alignment is insufficient, TBM backfills missing historical segments conditioned on scene evolution, producing context-rich trajectories that strengthen PKD and thereby improve OAF. We employ a decoupled pretrain-reconstruct-finetune protocol to preserve real-motion priors while adapting to backfilled inputs. Extensive experiments on Argoverse 1 and Argoverse 2 show that TaPD consistently outperforms strong baselines across all observation lengths, delivers especially large gains under very short inputs, and improves other predictors (e.g., HiVT) in a plug-and-play manner. Code will be available at https://github.com/zhouhao94/TaPD.


[91] Hierarchical Collaborative Fusion for 3D Instance-aware Referring Expression Segmentation cs.CVPDF

Keshen Zhou, Runnan Chen, Mingming Gong, Tongliang Liu

TL;DR: 本文提出HCF-RES,一个用于3D实例感知参考表达分割的多模态框架,旨在解决现有方法仅依赖稀疏点云、缺乏丰富视觉语义的问题。该框架通过分层视觉语义分解和渐进式多级融合,结合SAM实例掩码和CLIP编码,在2D到3D投影中保留对象边界,并整合2D语义与3D几何特征。

Details

Motivation: 现有3D参考表达分割方法仅依赖稀疏点云,缺乏对细粒度描述的丰富视觉语义支持,难以处理描述匹配多个或零个目标的情况。

Result: HCF-RES在ScanRefer和Multi3DRefer基准测试上取得了最先进(SOTA)的结果。

Insight: 创新点包括分层视觉语义分解(利用SAM实例掩码指导CLIP在像素级和实例级的双粒度编码)和渐进式多级融合(通过模态内协作、跨模态自适应加权和语言引导细化整合特征),有效提升了3D场景中基于自然语言的对象定位能力。

Abstract: Generalised 3D Referring Expression Segmentation (3D-GRES) localizes objects in 3D scenes based on natural language, even when descriptions match multiple or zero targets. Existing methods rely solely on sparse point clouds, lacking rich visual semantics for fine-grained descriptions. We propose HCF-RES, a multi-modal framework with two key innovations. First, Hierarchical Visual Semantic Decomposition leverages SAM instance masks to guide CLIP encoding at dual granularities – pixel-level and instance-level features – preserving object boundaries during 2D-to-3D projection. Second, Progressive Multi-level Fusion integrates representations through intra-modal collaboration, cross-modal adaptive weighting between 2D semantic and 3D geometric features, and language-guided refinement. HCF-RES achieves state-of-the-art results on both ScanRefer and Multi3DRefer.


[92] NOVA: Next-step Open-Vocabulary Autoregression for 3D Multi-Object Tracking in Autonomous Driving cs.CV | cs.RO | eess.IVPDF

Kai Luo, Xu Wang, Rui Fan, Kailun Yang

TL;DR: 本文提出了一种名为NOVA(Next-step Open-Vocabulary Autoregression)的创新范式,用于解决自动驾驶场景下的3D多目标跟踪问题。该方法将3D轨迹重新定义为结构化的时空语义序列,并利用大语言模型的自回归能力,将跟踪任务转化为一个原则性的下一步序列补全过程,从而在开放词汇环境下实现更好的泛化能力。

Details

Motivation: 现有3D多目标跟踪方法受限于闭集假设和‘语义盲’启发式规则,难以泛化到未知目标。本文旨在通过生成式时空语义建模,解决开放世界感知中的泛化问题。

Result: 在nuScenes、V2X-Seq-SPD和KITTI数据集上的大量实验证明了NOVA的优越性能。特别是在nuScenes数据集上,NOVA在Novel类别上实现了22.41%的AMOTA,相比基线有20.21%的绝对提升,这些成果仅通过一个紧凑的0.5B参数自回归模型实现。

Insight: 论文的核心创新点在于将3D跟踪从传统的基于距离的碎片化匹配范式,转向生成式时空语义建模。通过将轨迹视为序列并利用LLM的自回归能力,模型能够显式利用语言空间的层次结构来解决细粒度语义模糊性,并通过高层常识推理在复杂长序列中保持身份一致性,这为开放词汇感知任务提供了新思路。

Abstract: Generalizing across unknown targets is critical for open-world perception, yet existing 3D Multi-Object Tracking (3D MOT) pipelines remain limited by closed-set assumptions and ``semantic-blind’’ heuristics. To address this, we propose Next-step Open-Vocabulary Autoregression (NOVA), an innovative paradigm that shifts 3D tracking from traditional fragmented distance-based matching toward generative spatio-temporal semantic modeling. NOVA reformulates 3D trajectories as structured spatio-temporal semantic sequences, enabling the simultaneous encoding of physical motion continuity and deep linguistic priors. By leveraging the autoregressive capabilities of Large Language Models (LLMs), we transform the tracking task into a principled process of next-step sequence completion. This mechanism allows the model to explicitly utilize the hierarchical structure of language space to resolve fine-grained semantic ambiguities and maintain identity consistency across complex long-range sequences through high-level commonsense reasoning. Extensive experiments on nuScenes, V2X-Seq-SPD, and KITTI demonstrate the superior performance of NOVA. Notably, on the nuScenes dataset, NOVA achieves an AMOTA of 22.41% for Novel categories, yielding a significant 20.21% absolute improvement over the baseline. These gains are realized through a compact 0.5B autoregressive model. Code will be available at https://github.com/xifen523/NOVA.


[93] GazeMoE: Perception of Gaze Target with Mixture-of-Experts cs.CV | cs.AIPDF

Zhuangzhuang Dai, Zhongxi Lu, Vincent G. Zakka, Luis J. Manso, Jose M Alcaraz Calero

TL;DR: 本文提出GazeMoE,一种基于混合专家(MoE)的端到端框架,用于从可见图像中估计人类注视目标。该框架通过MoE模块选择性地利用冻结基础模型中的多模态线索(如眼睛、头部姿态、手势和上下文特征),并结合类别平衡辅助损失与数据增强策略,以解决类别不平衡问题并提升鲁棒性。

Details

Motivation: 从可见图像中估计人类注视目标对于机器人理解人类注意力至关重要,但现有方法在构建通用化神经架构和训练范式方面仍面临挑战,需要自适应且高效的解码机制来整合多模态线索。

Result: 在基准数据集上的大量实验表明,GazeMoE在具有挑战性的注视估计任务上实现了最先进的性能,超越了现有方法。

Insight: 创新点在于将MoE机制引入注视目标感知任务,以自适应地利用基础模型中的多模态线索;同时,通过类别平衡辅助损失和区域特定裁剪、光度变换等数据增强策略,有效提升了模型对类别不平衡的鲁棒性。

Abstract: Estimating human gaze target from visible images is a critical task for robots to understand human attention, yet the development of generalizable neural architectures and training paradigms remains challenging. While recent advances in pre-trained vision foundation models offer promising avenues for locating gaze targets, the integration of multi-modal cues – including eyes, head poses, gestures, and contextual features – demands adaptive and efficient decoding mechanisms. Inspired by Mixture-of-Experts (MoE) for adaptive domain expertise in large vision-language models, we propose GazeMoE, a novel end-to-end framework that selectively leverages gaze-target-related cues from a frozen foundation model through MoE modules. To address class imbalance in gaze target classification (in-frame vs. out-of-frame) and enhance robustness, GazeMoE incorporates a class-balancing auxiliary loss alongside strategic data augmentations, including region-specific cropping and photometric transformations. Extensive experiments on benchmark datasets demonstrate that our GazeMoE achieves state-of-the-art performance, outperforming existing methods on challenging gaze estimation tasks. The code and pre-trained models are released at https://huggingface.co/zdai257/GazeMoE


[94] ODD-SEC: Onboard Drone Detection with a Spinning Event Camera cs.CVPDF

Kuan Dai, Hongxin Zhang, Sheng Zhong, Yi Zhou

TL;DR: 本文提出ODD-SEC系统,一种用于移动载体的实时无人机检测方案。该系统采用旋转事件相机提供360°水平视野,无需运动补偿,结合轻量级神经网络实现高效时空学习,在机载Jetson Orin NX上实时运行,在恶劣光照和快速运动场景下实现可靠检测与方位估计。

Details

Motivation: 解决现有基于事件相机的无人机检测方案通常假设相机静止,限制了其在移动载体(如四足机器人、无人地面车辆)野外作业中的适用性问题。

Result: 户外实验验证了系统在挑战性条件下的可靠检测能力,平均角度误差低于2°,适用于实际监控应用。

Insight: 创新点包括:1) 采用旋转事件相机实现360°水平视野与无人机方位估计;2) 提出无需运动补偿的类图像事件表示方法;3) 设计轻量级神经网络架构进行高效时空学习,适合移动平台部署。

Abstract: The rapid proliferation of drones requires balancing innovation with regulation. To address security and privacy concerns, techniques for drone detection have attracted significant attention.Passive solutions, such as frame camera-based systems, offer versatility and energy efficiency under typical conditions but are fundamentally constrained by their operational principles in scenarios involving fast-moving targets or adverse illumination.Inspired by biological vision, event cameras asynchronously detect per-pixel brightness changes, offering high dynamic range and microsecond-level responsiveness that make them uniquely suited for drone detection in conditions beyond the reach of conventional frame-based cameras.However, the design of most existing event-based solutions assumes a static camera, greatly limiting their applicability to moving carriers–such as quadrupedal robots or unmanned ground vehicles–during field operations.In this paper, we introduce a real-time drone detection system designed for deployment on moving carriers. The system utilizes a spinning event-based camera, providing a 360° horizontal field of view and enabling bearing estimation of detected drones. A key contribution is a novel image-like event representation that operates without motion compensation, coupled with a lightweight neural network architecture for efficient spatiotemporal learning. Implemented on an onboard Jetson Orin NX, the system can operate in real time. Outdoor experimental results validate reliable detection with a mean angular error below 2° under challenging conditions, underscoring its suitability for real-world surveillance applications. We will open-source our complete pipeline to support future research.


[95] HiPP-Prune: Hierarchical Preference-Conditioned Structured Pruning for Vision-Language Models cs.CV | cs.AIPDF

Lincen Bai, Hedi Tabia, Raul Santos-Rodriguez

TL;DR: 本文提出了HiPP-Prune,一个用于视觉语言模型(VLM)的分层偏好条件结构化剪枝框架。它将剪枝视为多目标下的条件资源分配问题,通过单一策略调用生成全局剪枝蓝图,该蓝图将决策分解为总体稀疏度预算和逐层分配,从而允许用户通过指定偏好向量来查询权衡。

Details

Motivation: 动机在于高效部署视觉语言模型(VLM)的剪枝具有挑战性,因为压缩不仅影响任务效用,还会影响视觉基础能力,甚至在相同稀疏度水平下也可能加剧物体幻觉问题。

Result: 在LLaVA模型上使用POPE和ScienceQA基准进行的实验表明,HiPP-Prune能够发现多种非支配性剪枝方案,并在匹配的稀疏度预算下提供可控的鲁棒性与效用权衡。

Insight: 创新点在于将剪枝视为分层、偏好驱动的规划问题,并引入了基于视觉注意流的视觉敏感性信号来指导策略,以防止对促进跨模态融合的关键视觉层进行过度剪枝,同时使用计划级GRPO和多目标回报进行优化。

Abstract: Pruning vision-language models (VLMs) for efficient deployment is challenging because compression can affect not only task utility but also visual grounding, often amplifying object hallucinations even at the same sparsity level. We present HiPP-Prune, a hierarchical preference-conditioned structured pruning framework that treats pruning as conditional resource allocation under multiple objectives. HiPP-Prune makes plan-level decisions: a single policy invocation outputs a global pruning blueprint by factorizing decisions into an overall sparsity budget and a layer-wise allocation, enabling queryable trade-offs via a user-specified preference vector. To account for VLM-specific failure modes, our policy state integrates a visual sensitivity signal derived from attention flow between vision tokens and language hidden states, discouraging over-pruning of vision-critical layers that facilitate cross-modal fusion. We optimize pruning plans with plan-level Group Relative Policy Optimization (GRPO) under a multi-objective return that combines task utility, hallucination robustness (POPE), compression, and a synaptic-flow-inspired stability proxy to reduce unproductive exploration in high-sparsity regimes. Experiments on LLaVA with POPE and ScienceQA demonstrate that HiPP-Prune discovers diverse non-dominated pruning plans and provides controllable robustness–utility trade-offs under matched sparsity budgets.


[96] Can we Trust Unreliable Voxels? Exploring 3D Semantic Occupancy Prediction under Label Noise cs.CV | cs.RO | eess.IVPDF

Wenxin Li, Kunyu Peng, Di Wen, Junwei Zheng, Jiale Wei

TL;DR: 本文针对3D语义占据预测任务中体素标注存在结构伪影和动态拖尾噪声的问题,建立了首个面向占据不对称和动态拖尾噪声的基准OccNL,并提出了一种鲁棒的标签噪声学习框架DPR-Occ。该框架通过双源部分标签推理构建可靠监督,结合时序模型记忆和表征级结构亲和力动态扩展和修剪候选标签集,在SemanticKITTI数据集上验证了其有效性。

Details

Motivation: 解决3D语义占据预测中真实世界体素标注因结构伪影和动态拖尾效应而不可靠的问题,探究自动驾驶系统能否安全依赖此类有噪声的监督信号。

Result: 在SemanticKITTI数据集上的实验表明,DPR-Occ在极端噪声(如90%标签噪声)下能防止几何和语义崩溃,相比适应3D占据预测任务的现有标签噪声学习基线,取得了显著的性能提升(最高提升2.57% mIoU和13.91% IoU)。

Insight: 揭示了先进的2D标签噪声学习策略在稀疏3D体素空间中会灾难性失效的关键领域差距;创新性地提出了通过双源(时序记忆与结构亲和力)部分标签推理来动态构建可靠监督的框架,为动态环境中安全关键的机器人感知提供了可靠基础。

Abstract: 3D semantic occupancy prediction is a cornerstone of robotic perception, yet real-world voxel annotations are inherently corrupted by structural artifacts and dynamic trailing effects. This raises a critical but underexplored question: can autonomous systems safely rely on such unreliable occupancy supervision? To systematically investigate this issue, we establish OccNL, the first benchmark dedicated to 3D occupancy under occupancy-asymmetric and dynamic trailing noise. Our analysis reveals a fundamental domain gap: state-of-the-art 2D label noise learning strategies collapse catastrophically in sparse 3D voxel spaces, exposing a critical vulnerability in existing paradigms. To address this challenge, we propose DPR-Occ, a principled label noise-robust framework that constructs reliable supervision through dual-source partial label reasoning. By synergizing temporal model memory with representation-level structural affinity, DPR-Occ dynamically expands and prunes candidate label sets to preserve true semantics while suppressing noise propagation. Extensive experiments on SemanticKITTI demonstrate that DPR-Occ prevents geometric and semantic collapse under extreme corruption. Notably, even at 90% label noise, our method achieves significant performance gains (up to 2.57% mIoU and 13.91% IoU) over existing label noise learning baselines adapted to the 3D occupancy prediction task. By bridging label noise learning and 3D perception, OccNL and DPR-Occ provide a reliable foundation for safety-critical robotic perception in dynamic environments. The benchmark and source code will be made publicly available at https://github.com/mylwx/OccNL.


[97] FlowMotion: Training-Free Flow Guidance for Video Motion Transfer cs.CVPDF

Zhen Wang, Youcan Xu, Jun Xiao, Long Chen

TL;DR: FlowMotion是一种无需训练的视频运动迁移框架,通过直接利用基于光流的文本到视频模型的预测输出,实现高效灵活的运动迁移。该方法的核心是提出流引导机制,从潜在预测中提取运动表示以对齐源视频与生成视频的运动模式,并引入速度正则化策略来稳定优化和确保平滑运动演化。

Details

Motivation: 解决现有无需训练方法依赖预训练T2V模型中间输出导致计算开销大和灵活性有限的问题,旨在实现更高效灵活的视频运动迁移。

Result: 在无需训练的方法中,FlowMotion在时间和资源效率上优于现有方法,并在性能上与最先进方法相当。

Insight: 创新点在于利用早期潜在预测中固有的丰富时间信息来构建流引导,以及引入速度正则化策略来稳定优化过程,这为视频生成中的运动控制提供了新的高效思路。

Abstract: Video motion transfer aims to generate a target video that inherits motion patterns from a source video while rendering new scenes. Existing training-free approaches focus on constructing motion guidance based on the intermediate outputs of pre-trained T2V models, which results in heavy computational overhead and limited flexibility. In this paper, we present FlowMotion, a novel training-free framework that enables efficient and flexible motion transfer by directly leveraging the predicted outputs of flow-based T2V models. Our key insight is that early latent predictions inherently encode rich temporal information. Motivated by this, we propose flow guidance, which extracts motion representations based on latent predictions to align motion patterns between source and generated videos. We further introduce a velocity regularization strategy to stabilize optimization and ensure smooth motion evolution. By operating purely on model predictions, FlowMotion achieves superior time and resource efficiency as well as competitive performance compared with state-of-the-art methods.


[98] DEX-AR: A Dynamic Explainability Method for Autoregressive Vision-Language Models cs.CV | cs.AIPDF

Walid Bousselham, Angie Boggust, Hendrik Strobelt, Hilde Kuehne

TL;DR: 本文提出了DEX-AR,一种专为自回归视觉语言模型设计的动态可解释性方法。该方法通过计算逐层梯度,在模型逐词生成过程中生成突出关键图像区域的逐词和序列级热力图,以解决传统方法难以解释此类模型复杂决策过程的问题。

Details

Motivation: 随着视觉语言模型日益复杂和广泛应用,理解其决策过程变得至关重要。传统为分类任务设计的可解释性方法,由于自回归VLM复杂的逐词生成过程以及视觉与文本模态间错综的交互,难以有效解释现代自回归视觉语言模型。

Result: 在ImageNet、VQAv2和PascalVOC数据集上的评估表明,DEX-AR在使用新颖的归一化困惑度度量的基于扰动的指标,以及基于分割的指标上,均取得了持续一致的改进。

Insight: 论文宣称的创新点在于引入了两个关键机制:一是动态头过滤机制,用于识别关注视觉信息的注意力头;二是序列级过滤方法,用于聚合逐词解释并区分基于视觉的标记和纯语言标记。从客观角度看,该方法将可解释性分析动态地整合到自回归生成过程中,并设计了针对生成任务的量化评估指标,是值得借鉴的创新方向。

Abstract: As Vision-Language Models (VLMs) become increasingly sophisticated and widely used, it becomes more and more crucial to understand their decision-making process. Traditional explainability methods, designed for classification tasks, struggle with modern autoregressive VLMs due to their complex token-by-token generation process and intricate interactions between visual and textual modalities. We present DEX-AR (Dynamic Explainability for AutoRegressive models), a novel explainability method designed to address these challenges by generating both per-token and sequence-level 2D heatmaps highlighting image regions crucial for the model’s textual responses. The proposed method offers to interpret autoregressive VLMs-including varying importance of layers and generated tokens-by computing layer-wise gradients with respect to attention maps during the token-by-token generation process. DEX-AR introduces two key innovations: a dynamic head filtering mechanism that identifies attention heads focused on visual information, and a sequence-level filtering approach that aggregates per-token explanations while distinguishing between visually-grounded and purely linguistic tokens. Our evaluation on ImageNet, VQAv2, and PascalVOC, shows a consistent improvement in both perturbation-based metrics, using a novel normalized perplexity measure, as well as segmentation-based metrics.


[99] Latent Transfer Attack: Adversarial Examples via Generative Latent Spaces cs.CVPDF

Eitan Shaar, Ariel Shaulov, Yalcin Tur, Gal Chechik, Ravid Shwartz-Ziv

TL;DR: 本文提出了一种名为LTA(Latent Transfer Attack)的对抗攻击方法,该方法通过在预训练的Stable Diffusion VAE的潜在空间中优化扰动,而非直接在像素空间进行优化,以生成更具可迁移性和鲁棒性的对抗样本。

Details

Motivation: 传统的像素空间对抗攻击方法产生的噪声通常是高频、纹理状的,对常见的预处理(如缩放、裁剪)敏感,且在不同模型架构间的迁移性较差。LTA旨在解决这些问题,利用生成模型的潜在空间来产生更结构化、低频率的扰动。

Result: 在多个CNN和视觉Transformer目标模型上的实验表明,LTA在迁移攻击成功率方面表现强劲,产生的扰动在空间上更连贯、主要为低频,与像素空间基线在质量上有所不同,并在迁移性与质量权衡中占据独特位置。

Insight: LTA的创新点在于将对抗攻击的优化域从像素空间转移到预训练生成模型的潜在空间,并结合了期望过变换(EOT)和潜在高斯平滑等技术来增强对预处理的鲁棒性和优化稳定性。这为利用现代生成先验进行鲁棒性评估提供了新思路。

Abstract: Adversarial attacks are a central tool for probing the robustness of modern vision models, yet most methods optimize perturbations directly in pixel space under $\ell_\infty$ or $\ell_2$ constraints. While effective in white-box settings, pixel-space optimization often produces high-frequency, texture-like noise that is brittle to common preprocessing (e.g., resizing and cropping) and transfers poorly across architectures. We propose $\textbf{LTA}$ ($\textbf{L}$atent $\textbf{T}$ransfer $\textbf{A}$ttack), a transfer-based attack that instead optimizes perturbations in the latent space of a pretrained Stable Diffusion VAE. Given a clean image, we encode it into a latent code and optimize the latent representation to maximize a surrogate classifier loss, while softly enforcing a pixel-space $\ell_\infty$ budget after decoding. To improve robustness to resolution mismatch and standard input pipelines, we incorporate Expectation Over Transformations (EOT) via randomized resizing, interpolation, and cropping, and apply periodic latent Gaussian smoothing to suppress emerging artifacts and stabilize optimization. Across a suite of CNN and vision-transformer targets, LTA achieves strong transfer attack success while producing spatially coherent, predominantly low-frequency perturbations that differ qualitatively from pixel-space baselines and occupy a distinct point in the transfer-quality trade-off. Our results highlight pretrained generative latent spaces as an effective and structured domain for adversarial optimization, bridging robustness evaluation with modern generative priors.


[100] WMoE-CLIP: Wavelet-Enhanced Mixture-of-Experts Prompt Learning for Zero-Shot Anomaly Detection cs.CVPDF

Peng Chen, Chao Huang

TL;DR: 本文提出了一种名为WMoE-CLIP的零样本异常检测方法,该方法结合了小波变换和专家混合机制来增强CLIP模型的提示学习能力,旨在通过多频率特征和全局语义建模来提升对未见异常模式的检测性能。

Details

Motivation: 现有基于视觉语言模型的零样本异常检测方法通常依赖固定的文本提示,难以捕捉复杂语义,且仅关注空间域特征,限制了其检测细微异常的能力。

Result: 在14个工业和医学数据集上的大量实验证明了该方法的有效性,但摘要中未提及具体的定量结果(如准确率、AUC)或与SOTA模型的直接比较。

Insight: 创新点包括:利用变分自编码器建模全局语义以增强提示的适应性;引入小波分解提取多频率图像特征并通过跨模态交互动态优化文本嵌入;以及采用语义感知的专家混合模块聚合上下文信息。

Abstract: Vision-language models have recently shown strong generalization in zero-shot anomaly detection (ZSAD), enabling the detection of unseen anomalies without task-specific supervision. However, existing approaches typically rely on fixed textual prompts, which struggle to capture complex semantics, and focus solely on spatial-domain features, limiting their ability to detect subtle anomalies. To address these challenges, we propose a wavelet-enhanced mixture-of-experts prompt learning method for ZSAD. Specifically, a variational autoencoder is employed to model global semantic representations and integrate them into prompts to enhance adaptability to diverse anomaly patterns. Wavelet decomposition extracts multi-frequency image features that dynamically refine textual embeddings through cross-modal interactions. Furthermore, a semantic-aware mixture-of-experts module is introduced to aggregate contextual information. Extensive experiments on 14 industrial and medical datasets demonstrate the effectiveness of the proposed method.


[101] P-SLCR: Unsupervised Point Cloud Semantic Segmentation via Prototypes Structure Learning and Consistent Reasoning cs.CVPDF

Lixin Zhan, Jie Jiang, Tianjian Zhou, Yukun Du, Yan Zheng

TL;DR: 本文提出了一种名为P-SLCR的无监督点云语义分割方法,通过原型结构学习和一致性推理来解决点云场景中缺乏标注数据的问题。该方法首先通过一致性结构学习选择高质量特征,建立一致点与原型库之间的结构特征学习;其次通过语义关系一致性推理,分别构建一致和模糊原型库之间的原型互关系矩阵,以保持语义一致性。

Details

Motivation: 当前点云语义分割方法严重依赖人工标注,而无监督点云语义分割研究尚处于早期阶段,缺乏标注信息和预训练带来了巨大挑战,因此需要开发有效的无监督策略。

Result: 在S3DIS、SemanticKITTI和Scannet数据集上进行了广泛评估,相比其他无监督方法取得了最佳性能,特别是在S3DIS数据集的Area-5上获得了47.1%的mIoU,超过了经典全监督方法PointNet 2.5%。

Insight: 创新点在于提出了一种基于原型库驱动的无监督分割策略,通过一致性结构学习和语义关系一致性推理来学习点云的结构特征并保持语义一致性,无需人工标注即可实现有效的语义分割,为无监督点云学习提供了新思路。

Abstract: Current semantic segmentation approaches for point cloud scenes heavily rely on manual labeling, while research on unsupervised semantic segmentation methods specifically for raw point clouds is still in its early stages. Unsupervised point cloud learning poses significant challenges due to the absence of annotation information and the lack of pre-training. The development of effective strategies is crucial in this context. In this paper, we propose a novel prototype library-driven unsupervised point cloud semantic segmentation strategy that utilizes Structure Learning and Consistent Reasoning (P-SLCR). First, we propose a Consistent Structure Learning to establish structural feature learning between consistent points and the library of consistent prototypes by selecting high-quality features. Second, we propose a Semantic Relation Consistent Reasoning that constructs a prototype inter-relation matrix between consistent and ambiguous prototype libraries separately. This process ensures the preservation of semantic consistency by imposing constraints on consistent and ambiguous prototype libraries through the prototype inter-relation matrix. Finally, our method was extensively evaluated on the S3DIS, SemanticKITTI, and Scannet datasets, achieving the best performance compared to unsupervised methods. Specifically, the mIoU of 47.1% is achieved for Area-5 of the S3DIS dataset, surpassing the classical fully supervised method PointNet by 2.5%.


[102] WorldCache: Accelerating World Models for Free via Heterogeneous Token Caching cs.CVPDF

Weilun Feng, Guoxin Fan, Haotong Qin, Chuanguang Yang, Mingqiang Wu

TL;DR: 该论文提出了WorldCache框架,旨在加速基于扩散模型的世界模型推理。通过引入曲率引导的异构令牌预测和混沌优先的自适应跳过机制,解决了世界模型中令牌异构性和非均匀时间动态性带来的缓存挑战,从而在不显著损失生成质量的情况下实现显著加速。

Details

Motivation: 扩散世界模型在统一世界模拟方面潜力巨大,但其迭代去噪过程计算成本高,难以用于交互式应用和长时程推演。现有特征缓存方法因世界模型特有的令牌异构性和非均匀时间动态性问题而效果不佳,需要专门解决方案。

Result: 在扩散世界模型上的实验表明,WorldCache实现了高达3.7倍的端到端加速,同时保持了98%的推演质量,在资源受限场景中展现出显著优势。

Insight: 创新点包括:1)提出曲率引导的异构令牌预测,利用基于物理的曲率分数估计令牌可预测性,并对混沌令牌使用Hermite引导的阻尼预测器;2)设计混沌优先的自适应跳过机制,通过积累曲率归一化的无量纲漂移信号,仅在瓶颈令牌开始漂移时重新计算。这些方法针对世界模型特性定制,有效平衡了速度与精度。

Abstract: Diffusion-based world models have shown strong potential for unified world simulation, but the iterative denoising remains too costly for interactive use and long-horizon rollouts. While feature caching can accelerate inference without training, we find that policies designed for single-modal diffusion transfer poorly to world models due to two world-model-specific obstacles: \emph{token heterogeneity} from multi-modal coupling and spatial variation, and \emph{non-uniform temporal dynamics} where a small set of hard tokens drives error growth, making uniform skipping either unstable or overly conservative. We propose \textbf{WorldCache}, a caching framework tailored to diffusion world models. We introduce \textit{Curvature-guided Heterogeneous Token Prediction}, which uses a physics-grounded curvature score to estimate token predictability and applies a Hermite-guided damped predictor for chaotic tokens with abrupt direction changes. We also design \textit{Chaotic-prioritized Adaptive Skipping}, which accumulates a curvature-normalized, dimensionless drift signal and recomputes only when bottleneck tokens begin to drift. Experiments on diffusion world models show that WorldCache delivers up to \textbf{3.7$\times$} end-to-end speedups while maintaining \textbf{98%} rollout quality, demonstrating the vast advantages and practicality of WorldCache in resource-constrained scenarios. Our code is released in https://github.com/FofGofx/WorldCache.


[103] K-MaT: Knowledge-Anchored Manifold Transport for Cross-Modal Prompt Learning in Medical Imaging cs.CV | cs.AIPDF

Jiajun Zeng, Shadi Albarqouni

TL;DR: 本文提出K-MaT(知识锚定流形传输)框架,用于解决大规模生物医学视觉-语言模型从高端成像模态(如CT)向低端模态(如X光)进行跨模态提示学习时,因陷入模态特定捷径而性能崩溃的问题。该方法通过分解提示、将其锚定到临床文本描述,并利用融合Gromov-Wasserstein最优传输对齐低端与高端视觉基础提示流形,实现了无需低端训练图像的零样本跨模态部署。

Details

Motivation: 解决现有生物医学视觉-语言模型在从高端成像模态迁移到低端成像模态时,容易陷入模态特定捷径、导致性能显著下降甚至崩溃的问题,旨在实现无需低端训练数据的稳健跨模态知识迁移。

Result: 在四个跨模态基准测试(包括皮肤镜、乳腺X光到超声、CT到胸部X光)上取得了最先进(SOTA)的结果,将平均准确率调和均值提升至44.1%(原BiomedCoOp为42.0%),宏F1分数提升至36.2%。在具有挑战性的乳腺成像任务中,有效缓解了标准方法(如CoOp在低端模态上准确率降至27.0%)出现的灾难性遗忘,保持了跨模态的稳健性能。

Insight: 主要创新点在于将提示分解并锚定到临床知识(文本描述),并创新性地使用融合Gromov-Wasserstein最优传输来对齐不同模态的提示流形空间,这为医学视觉-语言模型的零样本跨模态部署提供了一条高效且无需目标模态训练数据的路径。从客观角度看,其将最优传输理论应用于跨模态提示流形对齐的思路具有借鉴意义。

Abstract: Large-scale biomedical vision-language models (VLMs) adapted on high-end imaging (e.g., CT) often fail to transfer to frontline low-end modalities (e.g., radiography), collapsing into modality-specific shortcuts. We propose K-MaT (Knowledge-Anchored Manifold Transport), a prompt-learning framework that transfers decision structures to low-end modalities without requiring low-end training images. K-MaT factorizes prompts, anchors them to clinical text descriptions, and aligns the low-end prompt manifold to the visually-grounded high-end space using Fused Gromov-Wasserstein optimal transport. We evaluate K-MaT on four cross-modal benchmarks, including dermoscopy, mammography to ultrasound, and CT to chest X-ray. K-MaT achieves state-of-the-art results, improving the average harmonic mean of accuracy to 44.1% (from BiomedCoOp’s 42.0%) and macro-F1 to 36.2%. Notably, on the challenging breast imaging task, it mitigates the catastrophic forgetting seen in standard methods like CoOp (which drops to 27.0% accuracy on the low-end), preserving robust performance across modalities. Aligning prompt manifolds via optimal transport provides a highly effective route for the zero-shot cross-modal deployment of medical VLMs.


[104] Dynamic Chunking Diffusion Transformer cs.CV | cs.AI | cs.LGPDF

Akash Haridas, Utkarsh Saxena, Parsa Ashrafi Fashi, Mehdi Rezagholizadeh, Vikram Appia

TL;DR: 本文提出了动态分块扩散变换器(DC-DiT),它通过一个可学习的编码器-路由器-解码器支架来增强DiT骨干网络,该支架能够以数据依赖的方式,自适应地将2D输入压缩成更短的令牌序列。该方法学会了将均匀背景区域压缩为更少的令牌,而将细节丰富的区域压缩为更多的令牌,并且能够根据扩散时间步调整压缩程度,在噪声阶段使用更少的令牌,在细节出现时使用更多的令牌。

Details

Motivation: 解决传统扩散变换器(DiT)使用静态分块操作处理图像时,对所有区域(无论是低信息还是高信息区域)进行均匀计算的问题,忽略了图像区域细节的差异以及去噪过程从粗到细的特性。

Result: 在类别条件ImageNet 256×256生成任务上,DC-DiT在4倍和16倍压缩下,相比参数量匹配和FLOP匹配的DiT基线,持续改进了FID和Inception Score。

Insight: 核心创新点是引入了数据依赖和时步自适应的动态令牌压缩机制,实现了计算资源的非均匀分配。该方法无需显式监督即可学习到有意义的视觉分割,并能通过微调预训练DiT检查点快速实现,且可与其他动态计算方法结合以进一步减少生成FLOPs。

Abstract: Diffusion Transformers process images as fixed-length sequences of tokens produced by a static $\textit{patchify}$ operation. While effective, this design spends uniform compute on low- and high-information regions alike, ignoring that images contain regions of varying detail and that the denoising process progresses from coarse structure at early timesteps to fine detail at late timesteps. We introduce the Dynamic Chunking Diffusion Transformer (DC-DiT), which augments the DiT backbone with a learned encoder-router-decoder scaffold that adaptively compresses the 2D input into a shorter token sequence in a data-dependent manner using a chunking mechanism learned end-to-end with diffusion training. The mechanism learns to compress uniform background regions into fewer tokens and detail-rich regions into more tokens, with meaningful visual segmentations emerging without explicit supervision. Furthermore, it also learns to adapt its compression across diffusion timesteps, using fewer tokens at noisy stages and more tokens as fine details emerge. On class-conditional ImageNet $256{\times}256$, DC-DiT consistently improves FID and Inception Score over both parameter-matched and FLOP-matched DiT baselines across $4{\times}$ and $16{\times}$ compression, showing this is a promising technique with potential further applications to pixel-space, video and 3D generation. Beyond accuracy, DC-DiT is practical: it can be upcycled from pretrained DiT checkpoints with minimal post-training compute (up to $8{\times}$ fewer training steps) and composes with other dynamic computation methods to further reduce generation FLOPs.


[105] Computer vision-based estimation of invertebrate biomass cs.CVPDF

Mikko Impiö, Philipp M. Rehsen, Jarrett Blair, Cecilie Mielec, Arne J. Beermann

TL;DR: 本文提出两种基于计算机视觉的无脊椎动物干重估计方法:一种使用成像设备自动计算的新预测因子(面积和沉降速度)拟合线性模型,另一种采用端到端深度神经网络(包括单视图、多视图和元数据感知架构)。这些方法利用BIODISCOVER双摄像头系统采集标本在乙醇柱中沉降的图像序列,旨在替代传统耗时且破坏性的干重测量过程。

Details

Motivation: 解决无脊椎动物生物多样性定量监测中传统干重测量方法手动操作耗时、破坏标本且难以规模化的问题,通过仅使用图像实现自动化生物量估计。

Result: 在收集的大型干重测量与图像序列配对数据集上,结合自动分类,方法在个体水平上实现了10-20%的中位数百分比误差,实现了群体水平的准确干重估计。

Insight: 创新点包括引入面积和沉降速度作为可自动计算的生物量预测因子,以及探索多种深度学习架构与优化策略;客观分析表明,该方法通过多模态数据融合和注重评估指标选择,为生态监测提供了可扩展的非破坏性解决方案。

Abstract: The ability to estimate invertebrate biomass using only images could help scaling up quantitative biodiversity monitoring efforts. Computer vision-based methods have the potential to omit the manual, time-consuming, and destructive process of dry weighing specimens. We present two approaches for dry mass estimation that do not require additional manual effort apart from imaging the specimens: fitting a linear model with novel predictors, automatically calculated by an imaging device, and training a family of end-to-end deep neural networks for the task, using single-view, multi-view, and metadata-aware architectures. We propose using area and sinking speed as predictors. These can be calculated with BIODISCOVER, which is a dual-camera system that captures image sequences of specimens sinking in an ethanol column. For this study, we collected a large dataset of dry mass measurement and image sequence pairs to train and evaluate models. We show that our methods can estimate specimen dry mass even with complex and visually diverse specimen morphologies. Combined with automatic taxonomic classification, our approach is an accurate method for group-level dry mass estimation, with a median percentage error of 10-20% for individuals. We highlight the importance of choosing appropriate evaluation metrics, and encourage using both percentage errors and absolute errors as metrics, because they measure different properties. We also explore different optimization losses, data augmentation methods, and model architectures for training deep-learning models.


[106] OralGPT-Plus: Learning to Use Visual Tools via Reinforcement Learning for Panoramic X-ray Analysis cs.CVPDF

Yuxuan Fan, Jing Hao, Hong Chen, Jiahao Bao, Yihua Shao

TL;DR: 本文提出了OralGPT-Plus,一个用于全景牙科X光片分析的智能视觉语言模型。它通过强化学习框架学习使用视觉工具,进行迭代式、对称感知的诊断推理。作者构建了DentalProbe数据集和MMOral-X基准测试,实验表明该方法在多个基准上优于强基线。

Details

Motivation: 现有视觉语言模型采用静态单次推理范式,难以满足全景牙科X光片分析所需的细粒度空间推理、双侧对称性理解和多步骤诊断验证,临床可靠性有限。

Result: 在提出的MMOral-X基准测试和已有的全景基准测试上,OralGPT-Plus相比强基线模型取得了持续且可靠的改进,表明了交互式和对称感知推理的有效性。

Insight: 核心创新在于引入了智能体建模范式,通过强化学习框架(包含基于规则的奖励和条件诊断驱动奖励)鼓励有临床意义的重新检查,并稳定长视野推理。同时,构建了包含专家诊断轨迹的结构化数据集和首个全景诊断基准,为领域提供了新的监督数据和评估标准。

Abstract: Panoramic dental radiographs require fine-grained spatial reasoning, bilateral symmetry understanding, and multi-step diagnostic verification, yet existing vision-language models operate under a static single-pass paradigm that limits their clinical reliability. In this paper, we introduce OralGPT-Plus, an agentic vision-language model designed to perform iterative and symmetry-aware diagnostic reasoning for panoramic dental radiograph analysis. To support this paradigm, we construct DentalProbe, a five-thousand-image dataset with expert-curated diagnostic trajectories that provide structured supervision for localized inspection and contralateral comparison. We further develop a Reinspection-driven reinforcement learning framework that encourages clinically meaningful re-examination and stabilizes long-horizon reasoning with rubric-based reward and conditioned diagnostic-driven reward. In parallel, we present MMOral-X, the first benchmark for holistic panoramic diagnosis, containing 300 open-ended questions and region-level annotations across multiple difficulty levels. OralGPT-Plus demonstrates consistent and reliable improvements over strong baselines on MMOral-X and established panoramic benchmarks, indicating the effectiveness of interactive and symmetry-informed reasoning. Our work highlights the value of agentic modeling for dental imaging and provides a foundation for future research in clinically aligned panoramic radiograph analysis.


[107] Rewis3d: Reconstruction Improves Weakly-Supervised Semantic Segmentation cs.CVPDF

Jonas Ernst, Wolfgang Boettcher, Lukas Hoyer, Jan Eric Lenssen, Bernt Schiele

TL;DR: Rewis3d是一个利用前馈式3D重建技术来显著提升2D图像弱监督语义分割性能的框架。它通过3D场景重建作为辅助监督信号,利用从2D视频恢复的3D几何结构来传播稀疏标注,从而缩小与全监督方法的性能差距。

Details

Motivation: 获取密集的像素级标注成本高昂,是训练分割模型的瓶颈;稀疏标注作为一种弱监督替代方案虽然高效,但仍存在性能差距。论文旨在利用3D重建提供的几何线索来改进稀疏监督下的语义分割。

Result: 在稀疏监督设置下,Rewis3d实现了最先进的性能,相比现有方法提升了2-7%,且无需额外标注或增加推理开销。

Insight: 创新点在于将3D重建作为弱监督语义分割的辅助信号,通过双师生架构强制2D图像与重建3D点云之间的语义一致性,利用先进的几何信息传播稀疏标注。这为利用多模态信息增强弱监督学习提供了新思路。

Abstract: We present Rewis3d, a framework that leverages recent advances in feed-forward 3D reconstruction to significantly improve weakly supervised semantic segmentation on 2D images. Obtaining dense, pixel-level annotations remains a costly bottleneck for training segmentation models. Alleviating this issue, sparse annotations offer an efficient weakly-supervised alternative. However, they still incur a performance gap. To address this, we introduce a novel approach that leverages 3D scene reconstruction as an auxiliary supervisory signal. Our key insight is that 3D geometric structure recovered from 2D videos provides strong cues that can propagate sparse annotations across entire scenes. Specifically, a dual student-teacher architecture enforces semantic consistency between 2D images and reconstructed 3D point clouds, using state-of-the-art feed-forward reconstruction to generate reliable geometric supervision. Extensive experiments demonstrate that Rewis3d achieves state-of-the-art performance in sparse supervision, outperforming existing approaches by 2-7% without requiring additional labels or inference overhead.


[108] Prompt Group-Aware Training for Robust Text-Guided Nuclei Segmentation cs.CV | cs.AIPDF

Yonghuang Wu, Zhenyang Liang, Wenwen Zeng, Xuan Xie, Jinhua Yu

TL;DR: 本文提出了一种提示组感知训练框架,以解决SAM3等基础模型在文本引导的医学图像分割中对提示表述敏感的问题。通过将语义相关的提示组织成共享同一真实掩码的提示组,并结合质量引导的组正则化和带停止梯度策略的logit级一致性约束,该方法无需修改模型架构或推理过程,显著提升了分割的鲁棒性和泛化能力。

Details

Motivation: 现有基础模型(如SAM3)在文本引导的医学图像分割中,即使语义等价的描述也会产生不一致的分割结果,这种提示敏感性限制了其在临床和病理工作流程中的可靠性。

Result: 在多数据集细胞核分割基准测试中,该方法在文本提示下取得了稳定的性能提升,并显著降低了不同提示质量水平下的性能方差;在六个零样本跨数据集任务上,平均Dice分数提高了2.16个百分点。

Insight: 创新点在于将提示敏感性问题重新定义为组内一致性问题,并设计了无需架构修改的提示组感知训练框架,通过质量引导正则化和logit一致性约束来增强模型对提示变化的鲁棒性,为计算病理学中的视觉-语言分割提供了更可靠的解决方案。

Abstract: Foundation models such as Segment Anything Model 3 (SAM3) enable flexible text-guided medical image segmentation, yet their predictions remain highly sensitive to prompt formulation. Even semantically equivalent descriptions can yield inconsistent masks, limiting reliability in clinical and pathology workflows. We reformulate prompt sensitivity as a group-wise consistency problem. Semantically related prompts are organized into \emph{prompt groups} sharing the same ground-truth mask, and a prompt group-aware training framework is introduced for robust text-guided nuclei segmentation. The approach combines (i) a quality-guided group regularization that leverages segmentation loss as an implicit ranking signal, and (ii) a logit-level consistency constraint with a stop-gradient strategy to align predictions within each group. The method requires no architectural modification and leaves inference unchanged. Extensive experiments on multi-dataset nuclei benchmarks show consistent gains under textual prompting and markedly reduced performance variance across prompt quality levels. On six zero-shot cross-dataset tasks, our method improves Dice by an average of 2.16 points. These results demonstrate improved robustness and generalization for vision-language segmentation in computational pathology.


[109] REACT++: Efficient Cross-Attention for Real-Time Scene Graph Generation cs.CVPDF

Maëlic Neau, Zoe Falomir

TL;DR: 本文提出REACT++模型,用于实时场景图生成任务,通过高效特征提取和原型空间内的主体到客体交叉注意力机制,在保持物体检测性能的同时提升关系预测准确率并显著降低推理延迟。

Details

Motivation: 现有场景图生成方法往往只关注关系预测准确性、物体检测准确性或延迟降低中的单一目标,而REACT++旨在同时平衡这三个目标,实现实时应用所需的性能与速度权衡。

Result: REACT++在现有SGG模型中达到最高推理速度,相比前代REACT版本平均提升10%的关系预测准确率并加快20%的推理速度,在SGG基准测试中实现新的SOTA水平。

Insight: 创新点在于将高效特征提取与原型空间内的主体到客体交叉注意力结合,这种设计在保持表示能力的同时优化了计算效率,为实时视觉关系建模提供了可借鉴的架构思路。

Abstract: Scene Graph Generation (SGG) is a task that encodes visual relationships between objects in images as graph structures. SGG shows significant promise as a foundational component for downstream tasks, such as reasoning for embodied agents. To enable real-time applications, SGG must address the trade-off between performance and inference speed. However, current methods tend to focus on one of the following: (1) improving relation prediction accuracy, (2) enhancing object detection accuracy, or (3) reducing latency, without aiming to balance all three objectives simultaneously. To address this limitation, we build on the powerful Real-time Efficiency and Accuracy Compromise for Tradeoffs in Scene Graph Generation (REACT) architecture and propose REACT++, a new state-of-the-art model for real-time SGG. By leveraging efficient feature extraction and subject-to-object cross-attention within the prototype space, REACT++ balances latency and representational power. REACT++ achieves the highest inference speed among existing SGG models, improving relation prediction accuracy without sacrificing object detection performance. Compared to the previous REACT version, REACT++ is 20% faster with a gain of 10% in relation prediction accuracy on average. The code is available at https://github.com/Maelic/SGG-Benchmark.


[110] DiffInf: Influence-Guided Diffusion for Supervision Alignment in Facial Attribute Learning cs.CVPDF

Basudha Pal, Rama Chellappa

TL;DR: 本文提出DiffInf框架,通过自影响引导的扩散模型来缓解人脸属性分类中因标注不一致导致的监督错误。该方法首先计算训练样本的自影响分数以识别影响优化稳定性的关键样本,然后利用潜在扩散自编码器对这些样本进行生成式校正,使其视觉内容与标签更好对齐,同时保持身份和真实性。校正后的样本替换原始数据,形成影响精炼的数据集,从而提升下游分类性能。

Details

Motivation: 人脸属性分类依赖大规模标注数据集,但年龄、表情等属性本质上是连续且模糊的,被离散化为类别标签后,因标注主观性和视觉混淆因素(如姿态、光照、人口统计差异)导致标注不一致,造成图像与标签不匹配,损害表示学习并降低预测性能。

Result: 在多类人脸属性分类任务中,DiffInf相比标准噪声标签训练、鲁棒优化基线和基于影响的过滤方法,均能持续提升泛化性能,表明在图像层面修复有影响的标注不一致能增强下游分类而不牺牲分布覆盖。

Insight: 创新点在于将影响函数与扩散生成模型结合,通过识别高影响样本并进行生成式校正而非直接丢弃,既保留了数据规模又改善了监督对齐;同时训练轻量级高影响成员预测器作为可微影响正则化器,实现了校正过程中的可微引导。

Abstract: Facial attribute classification relies on large-scale annotated datasets in which many traits, such as age and expression, are inherently ambiguous and continuous but are discretized into categorical labels. Annotation inconsistencies arise from subjectivity and visual confounders such as pose, illumination, expression, and demographic variation, creating mismatch between images and assigned labels. These inconsistencies introduce supervision errors that impair representation learning and degrade downstream prediction. We introduce DiffInf, a self-influence–guided diffusion framework for mitigating annotation inconsistencies in facial attribute learning. We first train a baseline classifier and compute sample-wise self-influence scores using a practical first-order approximation to identify training instances that disproportionately destabilize optimization. Instead of discarding these influential samples, we apply targeted generative correction via a latent diffusion autoencoder to better align visual content with assigned labels while preserving identity and realism. To enable differentiable guidance during correction, we train a lightweight predictor of high-influence membership and use it as a surrogate influence regularizer. The edited samples replace the originals, yielding an influence-refined dataset of unchanged size. Across multi-class facial attribute classification, DiffInf consistently improves generalization compared with standard noisy-label training, robust optimization baselines, and influence-based filtering. Our results demonstrate that repairing influential annotation inconsistencies at the image level enhances downstream facial attribute classification without sacrificing distributional coverage.


[111] Locating and Editing Figure-Ground Organization in Vision Transformers cs.CVPDF

Stefan Arnold, René Gröbner

TL;DR: 本文研究视觉Transformer(BEiT)如何解决图形-背景组织中的感知歧义,特别是凸性完形偏好。通过合成飞镖形状的受控感知冲突实验,发现BEiT倾向于凸性完形,并利用logit归因定位到Transformer子结构中的特定功能单元(如注意力头L0H9)是实现该偏好的关键。

Details

Motivation: 解决视觉Transformer在局部几何证据与全局组织先验(如格式塔凸性先验)冲突时产生的感知歧义问题,旨在定位BEiT模型中实现凸性偏好的内部机制。

Result: BEiT在合成飞镖形状的感知冲突中可靠地偏好凸性完形;通过分析内部激活,发现早期和中间层组织仍模糊,而在深层突然解决;特定注意力头L0H9被识别为早期种子,其下采样可改变决策边界,使凹性证据主导完形。

Insight: 创新点在于将格式塔知觉原理(凸性先验)与Transformer内部机制关联,通过logit归因和注意力头分解定位了影响图形-背景组织的具体功能单元,为模型可解释性和可控编辑提供了新途径。

Abstract: Vision Transformers must resolve figure-ground organization by choosing between completions driven by local geometric evidence and those favored by global organizational priors, giving rise to a characteristic perceptual ambiguity. We aim to locate where the canonical Gestalt prior convexity is realized within the internal components of BEiT. Using a controlled perceptual conflict based on synthetic shapes of darts, we systematically mask regions that equally admit either a concave completion or a convex completion. We show that BEiT reliably favors convex completion under this competition. Projecting internal activations into the model’s discrete visual codebook space via logit attribution reveals that this preference is governed by identifiable functional units within transformer substructures. Specifically, we find that figure-ground organization is ambiguous through early and intermediate layers and resolves abruptly in later layers. By decomposing the direct effect of attention heads, we identify head L0H9 acting as an early seed, introducing a weak bias toward convexity. Downscaling this single attention head shifts the distributional mass of the perceptual conflict across a continuous decision boundary, allowing concave evidence to guide completion.


[112] Physical Simulator In-the-Loop Video Generation cs.CV | cs.AI | cs.GRPDF

Lin Geng Foo, Mark He Huang, Alexandros Lattas, Stylianos Moschoglou, Thabo Beeler

TL;DR: 本文提出了一种名为物理模拟器在环视频生成(PSIVG)的新框架,旨在解决现有扩散视频生成模型难以遵循基本物理规律(如重力、惯性和碰撞)的问题。该框架将物理模拟器集成到视频扩散过程中,通过从预训练扩散模型生成的模板视频重建4D场景和前景物体网格,在物理模拟器中初始化并生成物理一致的轨迹,从而引导视频生成器产生时空物理一致的运动。此外,还提出了一种测试时纹理一致性优化(TTCO)技术,以基于模拟器的像素对应关系调整文本和特征嵌入,提升物体运动中的纹理一致性。

Details

Motivation: 当前基于扩散的视频生成方法在视觉真实性方面取得了显著进展,但常常违反基本物理定律,导致生成物体在帧间运动不一致、表现出不合理的动力学或违反物理约束,这限制了AI生成视频的真实性和可靠性。

Result: 综合实验表明,PSIVG生成的视频在保持视觉质量和多样性的同时,能更好地遵循真实世界的物理规律。

Insight: 主要创新点在于将物理模拟器引入视频生成流程,通过物理模拟生成轨迹来引导扩散模型,从而增强物理一致性;同时,提出的TTCO技术通过优化嵌入来改善纹理一致性,这是一种结合物理仿真与生成模型的跨领域方法。

Abstract: Recent advances in diffusion-based video generation have achieved remarkable visual realism but still struggle to obey basic physical laws such as gravity, inertia, and collision. Generated objects often move inconsistently across frames, exhibit implausible dynamics, or violate physical constraints, limiting the realism and reliability of AI-generated videos. We address this gap by introducing Physical Simulator In-the-loop Video Generation (PSIVG), a novel framework that integrates a physical simulator into the video diffusion process. Starting from a template video generated by a pre-trained diffusion model, PSIVG reconstructs the 4D scene and foreground object meshes, initializes them within a physical simulator, and generates physically consistent trajectories. These simulated trajectories are then used to guide the video generator toward spatio-temporally physically coherent motion. To further improve texture consistency during object movement, we propose a Test-Time Texture Consistency Optimization (TTCO) technique that adapts text and feature embeddings based on pixel correspondences from the simulator. Comprehensive experiments demonstrate that PSIVG produces videos that better adhere to real-world physics while preserving visual quality and diversity. Project Page: https://vcai.mpi-inf.mpg.de/projects/PSIVG/


[113] Non-invasive Growth Monitoring of Small Freshwater Fish in Home Aquariums via Stereo Vision cs.CVPDF

Clemens Seibold, Anna Hilsmann, Peter Eisert

TL;DR: 本文提出了一种基于立体视觉的非侵入式淡水小鱼生长监测方法,通过YOLOv11-Pose网络检测鱼体并预测关键点,结合考虑空气-玻璃-水界面折射的极线约束进行鲁棒匹配,再通过折射感知的三维三角测量恢复三维关键点以估算鱼体长度。

Details

Motivation: 在水产养殖和家庭水族箱中,监测鱼类生长行为对评估其健康状况至关重要,但鱼类体型小且水族箱环境存在强烈折射畸变,使得尺寸监测面临挑战,需要一种实用、非侵入且能频繁使用的图像测量方法。

Result: 方法在一个模拟水族箱条件下采集的濒危苏拉威西米鱼立体数据集上进行了验证,结果表明过滤低质量检测对实现准确的体长估计至关重要。

Insight: 创新点在于提出了一种折射感知的立体视觉系统,通过结合折射校正的极线约束和三维三角测量,以及基于学习质量分数的检测过滤机制,为家庭水族箱环境提供了一种简单实用的非侵入生长监测解决方案。

Abstract: Monitoring fish growth behavior provides relevant information about fish health in aquaculture and home aquariums. Yet, monitoring fish sizes poses different challenges, as fish are small and subject to strong refractive distortions in aquarium environments. Image-based measurement offers a practical, non-invasive alternative that allows frequent monitoring without disturbing the fish. In this paper, we propose a non-invasive refraction-aware stereo vision method to estimate fish length in aquariums. Our approach uses a YOLOv11-Pose network to detect fish and predict anatomical keypoints on the fish in each stereo image. A refraction-aware epipolar constraint accounting for the air-glass-water interfaces enables robust matching, and unreliable detections are removed using a learned quality score. A subsequent refraction-aware 3D triangulation recovers 3D keypoints, from which fish length is measured. We validate our approach on a new stereo dataset of endangered Sulawesi ricefish captured under aquarium-like conditions and demonstrate that filtering low-quality detections is essential for accurate length estimation. The proposed system offers a simple and practical solution for non-invasive growth monitoring and can be easily applied in home aquariums.


[114] What if? Emulative Simulation with World Models for Situated Reasoning cs.CVPDF

Ruiping Liu, Yufan Chen, Yuheng Zhang, Junwei Zheng, Kunyu Peng

TL;DR: 该论文提出了WanderDream数据集,用于支持基于世界模型的仿真模拟,以解决在物理限制或安全约束下无法进行主动探索的具身推理问题。数据集包含WanderDream-Gen(全景视频轨迹)和WanderDream-QA(问答对),通过实验验证了心理模拟对具身推理的重要性以及世界模型在该任务上的有效性。

Details

Motivation: 解决在机器人物理限制或视障用户安全顾虑等现实场景中,无法进行主动探索时,智能体如何仅基于有限观察进行心理模拟并回答空间假设性问题的挑战。

Result: 实验表明,心理模拟对具身推理至关重要;世界模型在WanderDream-Gen上取得了引人注目的性能;想象力显著促进了在WanderDream-QA上的推理;且数据集数据展现出向真实世界场景的显著可迁移性。

Insight: 创新点在于首次构建了大规模的心理探索仿真模拟数据集(WanderDream),将世界模型应用于解决受限环境下的具身推理问题,并通过轨迹生成与问答评估的联合框架验证了心理模拟的有效性。

Abstract: Situated reasoning often relies on active exploration, yet in many real-world scenarios such exploration is infeasible due to physical constraints of robots or safety concerns of visually impaired users. Given only a limited observation, can an agent mentally simulate a future trajectory toward a target situation and answer spatial what-if questions? We introduce WanderDream, the first large-scale dataset designed for the emulative simulation of mental exploration, enabling models to reason without active exploration. WanderDream-Gen comprises 15.8K panoramic videos across 1,088 real scenes from HM3D, ScanNet++, and real-world captures, depicting imagined trajectories from current viewpoints to target situations. WanderDream-QA contains 158K question-answer pairs, covering starting states, paths, and end states along each trajectory to comprehensively evaluate exploration-based reasoning. Extensive experiments with world models and MLLMs demonstrate (1) that mental exploration is essential for situated reasoning, (2) that world models achieve compelling performance on WanderDream-Gen, (3) that imagination substantially facilitates reasoning on WanderDream-QA, and (4) that WanderDream data exhibit remarkable transferability to real-world scenarios. The source code and all data will be released.


[115] CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization cs.CVPDF

Yitong Chen, Zuxuan Wu, Xipeng Qiu, Yu-Gang Jiang

TL;DR: 本文提出了CaTok,一种用于一维因果图像标记化的方法,通过结合MeanFlow解码器,学习支持快速一步生成和高保真多步采样的因果一维表示,并引入REPA-A正则化来稳定和加速训练。

Details

Motivation: 当前视觉标记化方法在将自回归语言模型的因果标记化范式扩展到视觉领域时存在不足,如非因果序列或启发式排序与“下一标记预测”模式不匹配,以及扩散自编码器缺乏因果性或引入不平衡问题。

Result: 在ImageNet重建任务上,CaTok达到了最先进的结果,FID为0.75,PSNR为22.53,SSIM为0.674,且训练轮次更少;其自回归模型性能与领先方法相当。

Insight: 创新点包括通过时间间隔选择标记并绑定到MeanFlow目标来学习因果一维表示,以及提出REPA-A正则化来对齐编码器特征与视觉基础模型,从而提升训练稳定性和效率。

Abstract: Autoregressive (AR) language models rely on causal tokenization, but extending this paradigm to vision remains non-trivial. Current visual tokenizers either flatten 2D patches into non-causal sequences or enforce heuristic orderings that misalign with the “next-token prediction” pattern. Recent diffusion autoencoders similarly fall short: conditioning the decoder on all tokens lacks causality, while applying nested dropout mechanism introduces imbalance. To address these challenges, we present CaTok, a 1D causal image tokenizer with a MeanFlow decoder. By selecting tokens over time intervals and binding them to the MeanFlow objective, as illustrated in Fig. 1, CaTok learns causal 1D representations that support both fast one-step generation and high-fidelity multi-step sampling, while naturally capturing diverse visual concepts across token intervals. To further stabilize and accelerate training, we propose a straightforward regularization REPA-A, which aligns encoder features with Vision Foundation Models (VFMs). Experiments demonstrate that CaTok achieves state-of-the-art results on ImageNet reconstruction, reaching 0.75 FID, 22.53 PSNR and 0.674 SSIM with fewer training epochs, and the AR model attains performance comparable to leading approaches.


[116] Pinterest Canvas: Large-Scale Image Generation at Pinterest cs.CVPDF

Yu Wang, Eric Tzeng, Raymond Shiau, Jie Yang, Dmitry Kislyuk

TL;DR: 本文介绍了Pinterest Canvas,一个为Pinterest平台构建的大规模图像生成系统,旨在支持图像编辑和增强用例。该系统首先在多样化多模态数据集上训练一个基础扩散模型,然后针对特定下游任务快速微调出专用模型,而非依赖单一通用模型。论文通过背景增强和宽高比外绘等案例研究展示了任务专用变体,并报告了在线A/B实验和人工评估的积极结果。

Details

Motivation: 解决现有图像生成模型虽然灵活但难以通过提示或简单推理适应来精确控制,无法满足严格产品需求的问题,旨在为Pinterest的具体用例构建可控、专用的图像生成系统。

Result: 在线A/B实验显示,增强后的图像分别获得了18.0%和12.5%的显著参与度提升;与人类评分者的比较进一步验证了其模型在这些任务上优于第三方模型。

Insight: 创新点在于采用“基础模型+任务专用微调”的两阶段方法,而非单一通用模型,以实现对特定产品需求的更好控制;该方法可推广到多种下游任务,如多图像场景合成和图像到视频生成,展示了大规模生产系统中专业化模型部署的有效实践。

Abstract: While recent image generation models demonstrate a remarkable ability to handle a wide variety of image generation tasks, this flexibility makes them hard to control via prompting or simple inference adaptation alone, rendering them unsuitable for use cases with strict product requirements. In this paper, we introduce Pinterest Canvas, our large-scale image generation system built to support image editing and enhancement use cases at Pinterest. Canvas is first trained on a diverse, multimodal dataset to produce a foundational diffusion model with broad image-editing capabilities. However, rather than relying on one generic model to handle every downstream task, we instead rapidly fine-tune variants of this base model on task-specific datasets, producing specialized models for individual use cases. We describe key components of Canvas and summarize our best practices for dataset curation, training, and inference. We also showcase task-specific variants through case studies on background enhancement and aspect-ratio outpainting, highlighting how we tackle their specific product requirements. Online A/B experiments demonstrate that our enhanced images receive a significant 18.0% and 12.5% engagement lift, respectively, and comparisons with human raters further validate that our models outperform third-party models on these tasks. Finally, we showcase other Canvas variants, including multi-image scene synthesis and image-to-video generation, demonstrating that our approach can generalize to a wide variety of potential downstream tasks.


[117] Do Foundation Models Know Geometry? Probing Frozen Features for Continuous Physical Measurement cs.CV | cs.AIPDF

Yakov Pyotr Shkolnikov

TL;DR: 该论文探究了视觉-语言基础模型(如CLIP)是否编码了连续的几何信息,发现其冻结特征中蕴含丰富的几何知识(如手部关节角度),但文本输出无法有效表达。通过线性探针,仅用6,000参数即可从冻结特征中以6.1度MAE提取关节角度,而最佳文本输出误差为20.0度,存在3.3倍的瓶颈。LoRA微调可缩小差距,表明是训练目标而非表示能力限制了性能。研究还发现不同编码器在功能上收敛,且自回归生成会损害几何保真度,但损害源于生成过程而非语言对齐。

Details

Motivation: 动机是探究基础模型(尤其是视觉-语言模型)是否在其冻结特征中编码了连续的物理几何信息(如手部关节角度),并揭示其文本输出与内部表示之间的性能差距,以理解模型在几何感知方面的能力与局限。

Result: 在几何测量任务上,从冻结特征提取的线性探针达到6.1度MAE,而文本输出为20.0度MAE;LoRA微调后降至6.5度MAE。不同编码器(自监督、对比、混合范式)在功能上收敛,R^2约0.55(统计等效)。Qwen2.5-VL的LLM层甚至提升了探针精度。

Insight: 创新点包括:揭示了基础模型存在几何表示的文本输出瓶颈;证明了训练目标比架构更影响几何精度;发现了功能收敛无需表示收敛;指出自回归生成损害几何保真度,但损害源于生成过程;层析分析显示中间层(18-22层)注意力头携带关键几何信号。这使得冻结主干可通过轻量探针实现多任务几何感知,无需微调或文本生成。

Abstract: Vision-language models encode continuous geometry that their text pathway fails to express: a 6,000-parameter linear probe extracts hand joint angles at 6.1 degrees MAE from frozen features, while the best text output achieves only 20.0 degrees – a 3.3x bottleneck. LoRA fine-tuning (r=16, 2,000 images) narrows this gap to 6.5 degrees, providing evidence for a pathway-training deficit rather than a representational one. Training objective determines accuracy more than architecture: five encoders spanning self-supervised, contrastive, and hybrid paradigms converge to statistically equivalent accuracy (R^2 approximately 0.55, TOST-equivalent at delta=0.03) despite sharing as little as CKA=0.41 representational similarity – functional convergence without representational convergence. Autoregressive generation damages geometric fidelity, but the damage originates in the generation process, not in language alignment: Qwen2.5-VL’s LLM layers actually improve probe accuracy over its raw vision encoder. Layer-wise analysis reveals a universal mid-network accuracy peak across all architectures, with attention heads in layers 18-22 carrying disproportionate geometric signal. These findings enable a single frozen backbone to function as a multi-task geometric sensor through lightweight probes, without fine-tuning or text generation.


[118] GreenRFM: Toward a resource-efficient radiology foundation model cs.CVPDF

Yingtai Li, Shuai Ming, Mingyue Zhao, Haoran Lai, Rongsheng Wang

TL;DR: 本文提出了一个资源高效的放射学基础模型(RFM)预训练框架GreenRFM,旨在解决现有方法依赖暴力缩放、计算成本高且泛化性差的问题。该框架通过MUST监督设计,最大化利用监督信号,而非单纯堆叠数据量,从而在显著降低计算需求的同时,实现了最先进的性能。

Details

Motivation: 现有放射学基础模型的开发过度依赖暴力缩放,直接迁移自然图像的方法,导致模型在临床实践中脆弱且昂贵。本文旨在构建一个资源高效、泛化性强的预训练框架。

Result: 在包含超过20万张图像、来自四个机构、两种模态的数据集上进行广泛实验。GreenRFM在胸部和腹部CT数据集上,无论是公开还是私有基准测试,均超越了多种基线模型,达到最先进性能(SOTA)。其高效配置(如24GB GPU 24小时或6GB VRAM 4小时)即可匹配或超越现有基准。

Insight: 核心创新在于提出了MUST(更精炼、更普遍、语义强化、任务对齐)监督原则,通过优化监督信号的质量而非数量,实现了性能与效率的突破。这挑战了“规模至上”的教条,为临床医生在有限资源下开发SOTA RFM提供了可能,并展示了监督原则在不同模态间的可迁移性。

Abstract: The development of radiology foundation models (RFMs) is hindered by a reliance on brute-force scaling. Existing approaches often directly translate methods for natural images, which prioritize scale over precision and hence lead to brittle and expensive models in clinical practice. To address this, we present a resource-efficient pre-training framework, GreenRFM, that achieves state-of-the-art performance. Our framework ensures robust generalization across diverse patient populations and imaging protocols, reducing computational requirements by orders of magnitude while surpassing complex, parameter-heavy models. These capabilities stem from principled supervision design that aims to maximally utilize supervisory signals via More distilled, Ubiquitous, Semantic-enforcing, and Task-aligning (MUST) supervision, rather than simply piling up the quantity of training data. We offer two GreenRFM configurations: (i) a performant model that establishes a new state-of-the-art using a single 24GB GPU within 24 hours, and (ii) a lightweight model that matches existing benchmarks with 6GB VRAM in 4 hours. We conduct extensive experiments using over 200,000 images from four institutions and of two modalities. GreenRFMs achieve superior performances on chest and abdominal CT datasets, regardless of public or private benchmark, surpassing a range of baseline models. In addition, the results on internal musculoskeletal MRI images show that the same supervision principles transfer between different modalities. Our performance and efficiency challenge the ``scale is all you need’’ dogma and democratize the equitable development of state-of-the-art RFMs for clinicians even on a laptop.


[119] Match4Annotate: Propagating Sparse Video Annotations via Implicit Neural Feature Matching cs.CVPDF

Zhuorui Zhang, Roger Pallarès-López, Praneeth Namburi, Brian W. Anthony

TL;DR: 本文提出了Match4Annotate,一个轻量级框架,用于在视频内和视频间传播稀疏的点标注和掩码标注。该方法在测试时使用基于SIREN的隐式神经表示来拟合DINOv3特征,生成连续的高分辨率时空特征场,并学习帧对之间的平滑隐式变形场来指导对应匹配。

Details

Motivation: 在医学成像等专业领域,获取逐帧视频标注是部署计算机视觉的主要瓶颈,因为专家标注缓慢且昂贵。现有的标签传播方法存在局限性:视频跟踪器和分割模型只能在单个序列内传播,无法跨视频泛化;经典对应管道在低纹理场景中表现不佳;而密集特征匹配和一次性分割方法缺乏时空平滑性,且对点和掩码标注的统一支持不足。

Result: 在三个具有挑战性的临床超声数据集上进行了评估。Match4Annotate在跨视频传播任务上达到了最先进的性能,优于特征匹配和一次性分割基线方法,同时在视频内传播任务上与专用跟踪器保持竞争力。

Insight: 创新点在于结合了隐式神经表示(SIREN)和预训练视觉特征(DINOv3),构建了连续时空特征场,并引入了隐式变形场来提升匹配的平滑性。其核心洞察是,轻量级、测试时优化的特征匹配流程有潜力为可扩展的标注工作流提供高效且易于使用的解决方案。

Abstract: Acquiring per-frame video annotations remains a primary bottleneck for deploying computer vision in specialized domains such as medical imaging, where expert labeling is slow and costly. Label propagation offers a natural solution, yet existing approaches face fundamental limitations. Video trackers and segmentation models can propagate labels within a single sequence but require per-video initialization and cannot generalize across videos. Classic correspondence pipelines operate on detector-chosen keypoints and struggle in low-texture scenes, while dense feature matching and one-shot segmentation methods enable cross-video propagation but lack spatiotemporal smoothness and unified support for both point and mask annotations. We present Match4Annotate, a lightweight framework for both intra-video and inter-video propagation of point and mask annotations. Our method fits a SIREN-based implicit neural representation to DINOv3 features at test time, producing a continuous, high-resolution spatiotemporal feature field, and learns a smooth implicit deformation field between frame pairs to guide correspondence matching. We evaluate on three challenging clinical ultrasound datasets. Match4Annotate achieves state-of-the-art inter-video propagation, outperforming feature matching and one-shot segmentation baselines, while remaining competitive with specialized trackers for intra-video propagation. Our results show that lightweight, test-time-optimized feature matching pipelines have the potential to offer an efficient and accessible solution for scalable annotation workflows.


[120] Self-Supervised Flow Matching for Scalable Multi-Modal Synthesis cs.CVPDF

Hila Chefer, Patrick Esser, Dominik Lorenz, Dustin Podell, Vikash Raja

TL;DR: 本文提出了一种名为Self-Flow的自监督流匹配范式,旨在将表征学习整合到生成框架中,以解决现有方法依赖外部模型带来的训练分离、目标不一致和扩展行为不可预测等问题。该方法通过双时间步调度机制,在不同token上施加异构噪声水平,创造信息不对称,从而迫使模型从被破坏的输入中推断缺失信息,实现了无需外部监督的强语义表征学习与生成能力的协同提升。该方法可跨模态(图像、视频、音频)泛化,遵循预期的扩展定律,并实现了卓越的生成质量。

Details

Motivation: 现有扩散和流模型依赖外部模型获取强语义表征,这导致训练分离、目标错位和扩展行为不可预测;其根本原因在于模型的训练目标(去噪任务)缺乏学习语义表征的激励。

Result: 该方法在图像、视频和音频生成任务上实现了卓越的生成质量,并遵循预期的扩展定律,展现了跨模态的泛化能力。

Insight: 创新点在于提出自监督流匹配范式,通过双时间步调度机制创造信息不对称,将表征学习内生于生成过程,从而无需外部模型即可学习强语义表征,并实现跨模态的多模态训练与可预测的扩展行为。

Abstract: Strong semantic representations improve the convergence and generation quality of diffusion and flow models. Existing approaches largely rely on external models, which require separate training, operate on misaligned objectives, and exhibit unexpected scaling behavior. We argue that this dependence arises from the model’s training objective, which poses a denoising task with little incentive to learn semantic representations. We introduce Self-Flow: a self-supervised flow matching paradigm that integrates representation learning within the generative framework. Our key mechanism, Dual-Timestep Scheduling, applies heterogeneous noise levels across tokens, creating an information asymmetry that forces the model to infer missing information from corrupted inputs. This drives learning strong representations alongside generative capabilities without external supervision. Our method generalizes across modalities and enables multi-modal training while following expected scaling laws, achieving superior image, video, and audio generation.


[121] SCAN: Visual Explanations with Self-Confidence and Analysis Networks cs.CVPDF

Gwanghee Lee, Sungyoon Jeong, Kyoungson Jhang

TL;DR: 本文提出了一种名为SCAN(Self-Confidence and Analysis Networks)的通用可解释AI框架,旨在为卷积神经网络和Transformer等不同架构的深度学习模型生成高保真、高分辨率的视觉解释。该方法基于自编码器和信息瓶颈原理,通过重建中间层特征来生成自置信度图,从而识别信息丰富的区域。

Details

Motivation: 当前视觉解释方法面临一个关键权衡:架构特定方法保真度高但通用性差,而通用方法适用性广但解释抽象或碎片化。这使得难以在不同模型家族(如CNN和Transformer)之间比较解释能力。

Result: 在多种架构和数据集上的广泛实验表明,SCAN在AUC-D、Negative AUC、Drop%和Win%等定量指标上始终表现出色。定性分析显示,其生成的解释比现有方法更清晰、更聚焦于目标对象。

Insight: 论文的创新点在于提出了一个统一的、架构无关的通用框架,它结合了自编码器重建和信息瓶颈原理来生成高分辨率自置信度图。这解决了现有方法在保真度与通用性之间的权衡问题,为理解复杂神经网络的决策过程提供了更可靠的工具。

Abstract: Explainable AI (XAI) has become essential in computer vision to make the decision-making processes of deep learning models transparent. However, current visual explanation (XAI) methods face a critical trade-off between the high fidelity of architecture-specific methods and the broad applicability of universal ones. This often results in abstract or fragmented explanations and makes it difficult to compare explanatory power across diverse model families, such as CNNs and Transformers. This paper introduces the Self-Confidence and Analysis Networks (SCAN), a novel universal framework that overcomes these limitations for both convolutional neural network and transformer architectures. SCAN utilizes an AutoEncoder-based approach to reconstruct features from a model’s intermediate layers. Guided by the Information Bottleneck principle, it generates a high-resolution Self-Confidence Map that identifies information-rich regions. Extensive experiments on diverse architectures and datasets demonstrate that SCAN consistently achieves outstanding performance on various quantitative metrics such as AUC-D, Negative AUC, Drop%, and Win%. Qualitatively, it produces significantly clearer, object-focused explanations than existing methods. By providing a unified framework that is both architecturally universal and highly faithful, SCAN enhances model transparency and offers a more reliable tool for understanding the decision-making processes of complex neural networks.


[122] AV-Unified: A Unified Framework for Audio-visual Scene Understanding cs.CVPDF

Guangyao Li, Xin Wang, Wenwu Zhu

TL;DR: 本文提出了AV-Unified,一个用于音频-视觉场景理解的统一框架,通过将不同任务的输入输出标准化为离散令牌序列,并设计多尺度时空感知网络来联合学习多种任务。

Details

Motivation: 当前音频-视觉任务(如事件定位、解析、分割和问答)多被孤立研究,难以全面理解复杂场景和探索任务间关系,因此需要一个能够联合学习的统一框架。

Result: 在多个基准数据集(如AVE、LLP、MUSIC-AVQA、VGG-SS和AVS)上的广泛实验表明,AV-Unified在时间、空间和时空任务上均有效。

Insight: 创新点包括:将异构任务的输入输出统一为离散令牌序列以实现共享表示;设计多尺度时间感知模块以捕捉不同粒度的事件;引入基于跨模态引导的空间感知模块来建模空间关联;使用任务特定文本提示增强模型适应性和任务意识。

Abstract: When humans perceive the world, they naturally integrate multiple audio-visual tasks within dynamic, real-world scenes. However, current works such as event localization, parsing, segmentation and question answering are mostly explored individually, making it challenging to comprehensively understand complex audio-visual scenes and explore inter-task relationships. Hence, we propose \textbf{AV-Unified}, a unified framework that enables joint learning across a wide range of audio-visual scene understanding tasks. AV-Unified standardizes the diverse input-output formats of each task and incorporates a multi-scale spatiotemporal perception network to effectively capture audio-visual associations. Specifically, we unify the inputs and outputs of all supported tasks by converting them into sequences of discrete tokens, establishing a shared representation that allows a single architecture to be trained jointly across heterogeneous varied datasets. Considering the varying temporal granularity of audio-visual events, a multi-scale temporal perception module is designed to capture key cues. Meanwhile, to overcome the lack of auditory supervision in the visual domain, we design a cross-modal guidance-based spatial perception module that models spatial audio-visual associations. Furthermore, task-specific text prompts are employed to enhance the model’s adaptability and task-awareness. Extensive experiments on benchmark datasets (e.g., AVE, LLP, MUSIC-AVQA, VGG-SS and AVS) demonstrate the effectiveness of AV-Unified across temporal, spatial, and spatiotemporal tasks.


[123] NEGATE: Constrained Semantic Guidance for Linguistic Negation in Text-to-Video Diffusion cs.CVPDF

Taewon Kang, Ming C. Lin

TL;DR: 本文提出了NEGATE方法,将语言否定建模为扩散动力学中语义引导的结构化可行性约束,通过将无分类器引导的更新投影到基于语言结构推导的凸约束集上来实现否定,提供了一个无需训练、与预训练扩散主干兼容的统一框架,可处理对象缺失、分级非反转语义、多重否定组合和范围敏感消歧等多种否定现象,并扩展至视频生成。

Details

Motivation: 否定是基本的语言运算符,但在基于扩散的生成系统中仍未得到充分建模,现有方法缺乏对语言否定的形式化处理。

Result: 实验表明,该方法在保持视觉保真度和结构连贯性的同时,实现了鲁棒的否定遵从性,并在作者提出的结构化否定基准测试中验证了其有效性。

Insight: 核心创新在于将语言否定形式化为扩散过程中语义引导的凸约束优化问题,提供了一种无需重新训练模型参数、基于引导投影的统一处理框架,并构建了专门的否定基准套件以促进该领域研究。

Abstract: Negation is a fundamental linguistic operator, yet it remains inadequately modeled in diffusion-based generative systems. In this work, we present a formal treatment of linguistic negation in diffusion-based generative models by modeling it as a structured feasibility constraint on semantic guidance within diffusion dynamics. Rather than introducing heuristics or retraining model parameters, we reinterpret classifier-free guidance as defining a semantic update direction and enforce negation by projecting the update onto a convex constraint set derived from linguistic structure. This novel formulation provides a unified framework for handling diverse negation phenomena, including object absence, graded non-inversion semantics, multi-negation composition, and scope-sensitive disambiguation. Our approach is training-free, compatible with pretrained diffusion backbones, and naturally extends from image generation to temporally evolving video trajectories. In addition, we introduce a structured negation-centric benchmark suite that isolates distinct linguistic failure modes in generative systems, to further research in this area. Experiments demonstrate that our method achieves robust negation compliance while preserving visual fidelity and structural coherence, establishing the first unified formulation of linguistic negation in diffusion-based generative models beyond representation-level evaluation.


[124] SurgFormer: Scalable Learning of Organ Deformation with Resection Support and Real-Time Inference cs.CVPDF

Ashkan Shahbazi, Elaheh Akbari, Kyvia Pereira, Jon S. Heiselman, Annie C. Benson

TL;DR: 本文提出了SurgFormer,一种用于体网格上数据驱动软组织模拟的多分辨率门控Transformer模型。它通过训练于求解器生成的数据,以近实时速度预测节点位移场,并支持通过学习的切口嵌入来处理拓扑改变的切除条件模拟。

Details

Motivation: 高保真生物力学求解器通常计算成本过高,难以用于交互式应用,因此需要一种高效的数据驱动替代方案来预测软组织变形,并统一处理标准变形和涉及切除的拓扑改变情况。

Result: 在统一协议下生成的胆囊切除术和阑尾切除术模拟数据集上,SurgFormer在多种基线方法中实现了高精度和良好的效率,成为两项任务的实用骨干模型。

Insight: 创新点在于构建了固定的网格层次结构,应用结合局部消息传递、粗粒度全局自注意力和逐点前馈更新的多分支块,并通过学习的门控机制自适应融合局部与长程信息,同时保持在大网格上的可扩展性;首次在体素流程中研究XFEM监督的切口条件变形,并引入学习的切口嵌入来统一处理标准与拓扑改变情况。

Abstract: We introduce SurgFormer, a multiresolution gated transformer for data driven soft tissue simulation on volumetric meshes. High fidelity biomechanical solvers are often too costly for interactive use, so we train SurgFormer on solver generated data to predict nodewise displacement fields at near real time rates. SurgFormer builds a fixed mesh hierarchy and applies repeated multibranch blocks that combine local message passing, coarse global self attention, and pointwise feedforward updates, fused by learned per node, per channel gates to adaptively integrate local and long range information while remaining scalable on large meshes. For cut conditioned simulation, resection information is encoded as a learned cut embedding and provided as an additional input, enabling a unified model for both standard deformation prediction and topology altering cases. We also introduce two surgical simulation datasets generated under a unified protocol with XFEM based supervision: a cholecystectomy resection dataset and an appendectomy manipulation and resection dataset with cut and uncut cases. To our knowledge, this is the first learned volumetric surrogate setting to study XFEM supervised cut conditioned deformation within the same volumetric pipeline as standard deformation prediction. Across diverse baselines, SurgFormer achieves strong accuracy with favorable efficiency, making it a practical backbone for both tasks. {Code, data, and project page: \href{https://mint-vu.github.io/SurgFormer/}{available here}}


[125] Modeling and Measuring Redundancy in Multisource Multimodal Data for Autonomous Driving cs.CVPDF

Yuhan Zhou, Mehri Sattari, Haihua Chen, Kewei Sha

TL;DR: 本文聚焦于自动驾驶(AV)中多源多模态数据的冗余问题,提出了一种建模和测量冗余的方法,并在nuScenes和Argoverse 2数据集上通过实验验证了选择性移除冗余标签(如共享视野的相机图像标签)可以提升YOLOv8目标检测性能,同时揭示了图像与LiDAR数据间存在显著冗余。

Details

Motivation: 当前自动驾驶研究过于侧重算法设计,而忽视了数据质量(DQ)分析,特别是多源多模态数据中普遍存在但未被充分探索的冗余问题,这影响了实时决策的效率和准确性。

Result: 在nuScenes数据集上,移除冗余的多源图像目标标签后,在三个代表性重叠区域的mAP50从0.66提升至0.70、0.64至0.67、0.53至0.55;在Argoverse 2上,移除了4.1%-8.6%的标签,mAP50保持在约0.64的基线水平。实验表明冗余是可测量且可操作的数据质量因素。

Insight: 创新点在于将冗余作为核心数据质量指标进行系统建模和量化,并实证其对自动驾驶感知任务的影响;这倡导了以数据为中心的视角来评估和优化数据集,为提升AV系统性能提供了新思路。

Abstract: Next-generation autonomous vehicles (AVs) rely on large volumes of multisource and multimodal ($M^2$) data to support real-time decision-making. In practice, data quality (DQ) varies across sources and modalities due to environmental conditions and sensor limitations, yet AV research has largely prioritized algorithm design over DQ analysis. This work focuses on redundancy as a fundamental but underexplored DQ issue in AV datasets. Using the nuScenes and Argoverse 2 (AV2) datasets, we model and measure redundancy in multisource camera data and multimodal image-LiDAR data, and evaluate how removing redundant labels affects the YOLOv8 object detection task. Experimental results show that selectively removing redundant multisource image object labels from cameras with shared fields of view improves detection. In nuScenes, mAP${50}$ gains from $0.66$ to $0.70$, $0.64$ to $0.67$, and from $0.53$ to $0.55$, on three representative overlap regions, while detection on other overlapping camera pairs remains at the baseline even under stronger pruning. In AV2, $4.1$-$8.6%$ of labels are removed, and mAP${50}$ stays near the $0.64$ baseline. Multimodal analysis also reveals substantial redundancy between image and LiDAR data. These findings demonstrate that redundancy is a measurable and actionable DQ factor with direct implications for AV performance. This work highlights the role of redundancy as a data quality factor in AV perception and motivates a data-centric perspective for evaluating and improving AV datasets. Code, data, and implementation details are publicly available at: https://github.com/yhZHOU515/RedundancyAD


[126] EgoReasoner: Learning Egocentric 4D Reasoning via Task-Adaptive Structured Thinking cs.CVPDF

Fangrui Zhu, Yunfeng Xi, Jianmo Ni, Mu Cai, Boqing Gong

TL;DR: 本文提出EgoReasoner,一个用于自我中心视频4D推理的两阶段框架,通过任务自适应的结构化思维模板和任务感知奖励函数,针对固定装置交互计数、视角相对固定装置定位、物体移动行程跟踪和静止物体定位等任务进行优化,在HD-EPIC基准上以3B参数模型取得37.5%的平均准确率。

Details

Motivation: 解决自我中心视频理解中因相机运动和物体位移导致的动态4D环境推理难题,现有任务无关方法(如通用思维链和统一强化学习)无法适应不同认知操作(空间锚定、时间跟踪、持续时间推理)的结构差异。

Result: 在HD-EPIC基准测试中,仅用16K样本训练的3B参数模型达到37.5%平均准确率,显著超越Qwen2.5-VL-7B(25.7%)超过10个百分点。

Insight: 创新点在于将推理支架和奖励信号与任务认知结构对齐,通过任务自适应思维模板生成结构化CoT轨迹进行监督微调,再结合任务感知奖励函数(验证实体接地、时间对齐和逻辑一致性)进行GRPO强化微调,实现高效的小样本学习。

Abstract: Egocentric video understanding is inherently complex due to the dynamic 4D nature of the environment, where camera motion and object displacements necessitate a continuous re-evaluation of spatial relations. In this work, we target a suite of under-explored egocentric 4D reasoning tasks, including fixture interaction counting, viewpoint-relative fixture location, object movement itinerary tracking, and stationary object localization, that require fundamentally different cognitive operations: spatial anchoring, temporal tracking, and duration reasoning. We observe that these structural differences make task-agnostic approaches insufficient: generic Chain-of-Thought methods lack task-appropriate reasoning primitives, and uniform reinforcement learning actively destabilizes performance on spatial tasks. To address this, we propose EgoReasoner, a two-stage framework that aligns both the reasoning scaffold and the reward signal to each task’s cognitive structure. In the first stage, Task-Adaptive Thinking Templates guide the synthesis of structured CoT traces that teach the model to reason adaptively across task types via supervised fine-tuning. In the second stage, task-aware reward functions verify entity grounding, temporal alignment, and task-adaptive logical consistency, selectively strengthening each reasoning pathway via reinforcement fine-tuning with GRPO. Our 3B-parameter model, trained on only 16K samples, achieves 37.5% average accuracy on the challenging HD-EPIC benchmark, surpassing Qwen2.5-VL-7B (25.7%) by over 10 points.


[127] Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders cs.CVPDF

Boqiang Zhang, Lei Ke, Ruihan Yang, Qi Gao, Tianyuan Qu

TL;DR: 本文提出Penguin-VL,一种高效视觉语言模型,通过使用基于文本LLM初始化的视觉编码器替代传统的对比预训练视觉编码器(如CLIP/SigLIP),旨在解决现有VLM因依赖大规模模型缩放和对比学习而导致的细粒度视觉信息丢失问题,从而在紧凑模型规模下实现高性能。

Details

Motivation: 当前视觉语言模型的发展过度依赖模型缩放,不利于在计算资源受限的移动和边缘设备上部署;同时,基于对比学习的视觉编码器(如CLIP)为分类任务优化,强制了类别级不变性,抑制了密集描述和复杂推理所需的细粒度视觉线索。

Result: 在多个图像和视频基准测试中,Penguin-VL(如2B和8B参数规模)在数学推理任务上达到与领先VLM(如Qwen3-VL)相当的性能,并在文档理解、视觉知识和多视角视频理解等任务上超越它们,同时保持了轻量级架构。

Insight: 创新点在于提出使用基于文本LLM初始化的视觉编码器(Penguin-Encoder)作为传统对比预训练编码器的替代,该编码器能更好地保留对密集感知和复杂推理至关重要的细粒度空间和时间线索,从而在更小的模型规模下实现更高的视觉保真度和数据效率,证明了提升视觉表示而非单纯模型缩放是性能提升的关键驱动因素。

Abstract: Vision Language Model (VLM) development has largely relied on scaling model size, which hinders deployment on compute-constrained mobile and edge devices such as smartphones and robots. In this work, we explore the performance limits of compact (e.g., 2B and 8B) VLMs. We challenge the prevailing practice that state-of-the-art VLMs must rely on vision encoders initialized via massive contrastive pretraining (e.g., CLIP/SigLIP). We identify an objective mismatch: contrastive learning, optimized for discrimination, enforces coarse and category-level invariances that suppress fine-grained visual cues needed for dense captioning and complex VLM reasoning. To address this issue, we present Penguin-VL, whose vision encoder is initialized from a text-only LLM. Our experiments reveal that Penguin-Encoder serves as a superior alternative to traditional contrastive pretraining, unlocking a higher degree of visual fidelity and data efficiency for multimodal understanding. Across various image and video benchmarks, Penguin-VL achieves performance comparable to leading VLMs (e.g., Qwen3-VL) in mathematical reasoning and surpasses them in tasks such as document understanding, visual knowledge, and multi-perspective video understanding. Notably, these gains are achieved with a lightweight architecture, demonstrating that improved visual representation rather than model scaling is the primary driver of performance. Our ablations show that Penguin-Encoder consistently outperforms contrastive-pretrained encoders, preserving fine-grained spatial and temporal cues that are critical for dense perception and complex reasoning. This makes it a strong drop-in alternative for compute-efficient VLMs and enables high performance in resource-constrained settings. Code: https://github.com/tencent-ailab/Penguin-VL


[128] SUREON: A Benchmark and Vision-Language-Model for Surgical Reasoning cs.CV | cs.AIPDF

Alejandra Perez, Anita Rau, Lee White, Busisiwe Mlambo, Chinedu Nwoye

TL;DR: 本文提出了SUREON,一个从手术教学视频中提取的大规模视频问答数据集,用于训练和评估手术推理能力。该数据集包含12个问题类别,涵盖安全评估、决策依据和预测,并通过多智能体流程从134.7K个视频片段中生成206.8K个问答对。同时,作者开发了SureonVLM和SureonVLM-R1两个模型,在SUREON基准测试中准确率超过84%,并在标准手术感知任务上优于通用领域模型。

Details

Motivation: 当前手术AI缺乏对手术场景的深层理解(如器械选择原因、风险及后续步骤),因为大规模标注手术推理数据极其困难。然而,手术教学视频中的专家解说天然包含了这些推理信息,尽管存在噪声和非结构化问题。

Result: 在SUREON基准测试(包含354个专家验证样本)上,SureonVLM和SureonVLM-R1模型准确率超过84%,显著优于更大的通用领域模型,并在标准手术感知任务中表现更优。

Insight: 创新点在于利用现成的手术教学视频(而非人工标注)构建大规模手术推理数据集,并通过多智能体流程自动化提取结构化监督信号;同时,提出的Group Relative Policy Optimization训练方法使模型展现出明确的推理行为(如从视觉上下文推断手术意图)。

Abstract: Surgeons don’t just see – they interpret. When an expert observes a surgical scene, they understand not only what instrument is being used, but why it was chosen, what risk it poses, and what comes next. Current surgical AI cannot answer such questions, largely because training data that explicitly encodes surgical reasoning is immensely difficult to annotate at scale. Yet surgical video lectures already contain exactly this – explanations of intent, rationale, and anticipation, narrated by experts for the purpose of teaching. Though inherently noisy and unstructured, these narrations encode the reasoning that surgical AI currently lacks. We introduce SUREON, a large-scale video QA dataset that systematically harvests this training signal from surgical academic videos. SUREON defines 12 question categories covering safety assessment, decision rationale, and forecasting, and uses a multi-agent pipeline to extract and structure supervision at scale. Across 134.7K clips and 170 procedure types, SUREON yields 206.8k QA pairs and an expert-validated benchmark of 354 examples. To evaluate the extent to which this supervision translates to surgical reasoning ability, we introduce two models: SureonVLM, a vision-language model adapted through supervised fine-tuning, and SureonVLM-R1, a reasoning model trained with Group Relative Policy Optimization. Both models can answer complex questions about surgery and substantially outperform larger general-domain models, exceeding 84% accuracy on the SUREON benchmark while outperforming general-domain models on standard surgical perception tasks. Qualitative analysis of SureonVLM-R1 reveals explicit reasoning behavior, such as inferring operative intent from visual context.


[129] SCOPE: Scene-Contextualized Incremental Few-Shot 3D Segmentation cs.CV | cs.LGPDF

Vishal Thengane, Zhaochong An, Tianjin Huang, Son Lam Phung, Abdesselam Bouzerdoum

TL;DR: 本文提出了SCOPE(场景上下文原型增强)框架,用于解决3D点云增量少样本分割问题。该框架是一个即插即用的背景引导原型增强方法,可与任何基于原型的3D分割方法集成。它利用基础训练场景中未标记的背景区域提取高置信度伪实例构建原型池,当新类别出现时,检索并融合相关背景原型与少样本原型,形成增强表示,无需重新训练骨干网络或增加参数。

Details

Motivation: 解决3D点云增量少样本分割中存在的灾难性遗忘、稀疏监督下难以学习判别性原型,以及忽视新类别常作为未标记背景出现在基础训练场景中的关键线索等问题。

Result: 在ScanNet和S3DIS数据集上的实验表明,SCOPE达到了最先进的性能,分别将新类别的IoU提升了最高6.98%和3.61%,平均IoU提升了2.25%和1.70%,同时保持了较低的遗忘率。

Insight: 创新性地利用场景上下文信息,通过类无关分割模型从背景中挖掘伪实例构建原型池,以背景引导的原型检索与融合机制增强少样本表示,这是一种无需重训练或增加参数的高效原型增强策略。

Abstract: Incremental Few-Shot (IFS) segmentation aims to learn new categories over time from only a few annotations. Although widely studied in 2D, it remains underexplored for 3D point clouds. Existing methods suffer from catastrophic forgetting or fail to learn discriminative prototypes under sparse supervision, and often overlook a key cue: novel categories frequently appear as unlabelled background in base-training scenes. We introduce SCOPE (Scene-COntextualised Prototype Enrichment), a plug-and-play background-guided prototype enrichment framework that integrates with any prototype-based 3D segmentation method. After base training, a class-agnostic segmentation model extracts high-confidence pseudo-instances from background regions to build a prototype pool. When novel classes arrive with few labelled samples, relevant background prototypes are retrieved and fused with few-shot prototypes to form enriched representations without retraining the backbone or adding parameters. Experiments on ScanNet and S3DIS show that SCOPE achieves SOTA performance, improving novel-class IoU by up to 6.98% and 3.61%, and mean IoU by 2.25% and 1.70%, respectively, while maintaining low forgetting. Code is available https://github.com/Surrey-UP-Lab/SCOPE.


[130] BEVLM: Distilling Semantic Knowledge from LLMs into Bird’s-Eye View Representations cs.CV | cs.AI | cs.LG | cs.ROPDF

Thomas Monninger, Shaoyuan Xie, Qi Alfred Chen, Sihao Ding

TL;DR: BEVLM是一个将大型语言模型(LLMs)的语义知识蒸馏到鸟瞰图(BEV)表示中的框架,旨在提升自动驾驶中的3D空间推理和语义理解能力。它通过将BEV特征作为统一输入提供给LLM,增强了跨视图场景的推理准确性,并反过来通过LLM的语义知识蒸馏显著改善了端到端驾驶性能。

Details

Motivation: 现有方法将多视图、多帧图像独立输入LLM,导致计算冗余和空间一致性有限,阻碍了准确的3D空间推理和几何连贯性;而传统BEV表示虽提供空间结构,但缺乏基础视觉编码器的语义丰富性。BEVLM旨在弥合这一差距。

Result: 实验表明,BEVLM通过利用BEV特征作为统一输入,使LLM在跨视图驾驶场景中的推理准确性提高了46%;同时,通过将LLM的语义知识蒸馏到BEV表示中,在安全关键场景下的闭环端到端驾驶性能提升了29%。

Insight: 创新点在于将空间一致的BEV表示与LLM的语义能力相结合,通过双向知识蒸馏(BEV到LLM和LLM到BEV)来同时增强推理准确性和驾驶性能,解决了自动驾驶中空间与语义信息分离的问题。

Abstract: The integration of Large Language Models (LLMs) into autonomous driving has attracted growing interest for their strong reasoning and semantic understanding abilities, which are essential for handling complex decision-making and long-tail scenarios. However, existing methods typically feed LLMs with tokens from multi-view and multi-frame images independently, leading to redundant computation and limited spatial consistency. This separation in visual processing hinders accurate 3D spatial reasoning and fails to maintain geometric coherence across views. On the other hand, Bird’s-Eye View (BEV) representations learned from geometrically annotated tasks (e.g., object detection) provide spatial structure but lack the semantic richness of foundation vision encoders. To bridge this gap, we propose BEVLM, a framework that connects a spatially consistent and semantically distilled BEV representation with LLMs. Through extensive experiments, we show that BEVLM enables LLMs to reason more effectively in cross-view driving scenes, improving accuracy by 46%, by leveraging BEV features as unified inputs. Furthermore, by distilling semantic knowledge from LLMs into BEV representations, BEVLM significantly improves closed-loop end-to-end driving performance by 29% in safety-critical scenarios.


[131] Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion cs.CVPDF

Lijiang Li, Zuwei Long, Yunhang Shen, Heting Gao, Haoyu Cao

TL;DR: Omni-Diffusion提出了一种基于掩码离散扩散模型的全新多模态大语言模型架构,统一处理文本、语音和图像的跨模态理解与生成任务,突破了传统自回归架构的限制。

Details

Motivation: 当前多模态大语言模型主要依赖自回归架构,存在探索更高效替代架构的空间;离散扩散模型在视觉理解和图像生成等领域展现出潜力,可作为多模态系统的骨干模型。

Result: 在多样化的基准测试中,该方法在处理两种或更多模态的任务上,性能优于或与现有多模态系统相当,展现了扩散模型作为下一代多模态基础模型的巨大潜力。

Insight: 创新性地将掩码离散扩散模型作为统一骨干,直接建模离散多模态令牌的联合分布,支持从双模态到复杂多模态场景的任意到任意转换,为多模态架构设计提供了新方向。

Abstract: While recent multimodal large language models (MLLMs) have made impressive strides, they predominantly employ a conventional autoregressive architecture as their backbone, leaving significant room to explore effective and efficient alternatives in architectural design. Concurrently, recent studies have successfully applied discrete diffusion models to various domains, such as visual understanding and image generation, revealing their considerable potential as a promising backbone for multimodal systems. Drawing inspiration from these pioneering research, we introduce Omni-Diffusion, the first any-to-any multimodal language model built entirely on mask-based discrete diffusion models, which unifies understanding and generation across text, speech, and images. Omni-Diffusion employs a unified mask-based discrete diffusion model to directly capture the joint distribution over discrete multimodal tokens. This approach supports not only bimodal tasks but also more complex scenarios involving multiple modalities. On a diverse set of benchmarks, our method outperforms or performs on par with existing multimodal systems that process two or more modalities, highlighting the significant promise of diffusion models in powering the next generation of multimodal foundation models. Project webpage: https://omni-diffusion.github.io.


[132] Multimodal Large Language Models as Image Classifiers cs.CVPDF

Nikita Kisel, Illia Volkov, Klara Janouskova, Jiri Matas

TL;DR: 本文系统分析了多模态大语言模型(MLLM)作为图像分类器的性能评估问题,指出现有评估协议和标注质量是导致其与监督模型性能对比结论冲突的关键因素。作者通过识别并修正评估协议中的关键问题(如模型输出超出类别列表、弱干扰项导致结果虚高、输出映射不佳等),并量化了批次大小、图像顺序和文本编码器选择等设计选择的影响。通过在ReGT(ImageNet-1k的多标签重标注数据集)上的评估,发现修正标注能显著提升MLLM性能(最高+10.8%),大幅缩小其与监督模型的感知差距。研究表明,MLLM在分类任务上的表现不佳很大程度上是噪声标注和有缺陷评估协议的产物,而非模型本身缺陷。

Details

Motivation: 解决现有研究中关于MLLM与监督/视觉语言模型在图像分类性能上结论相互矛盾的问题,并探究这些矛盾是否源于有缺陷的评估协议和标注噪声,而非模型能力的真实差距。

Result: 在ReGT(ImageNet-1k的多标签重标注数据集)上评估表明,修正标注质量可使MLLM性能提升高达10.8%,显著缩小了其与监督模型的性能差距。研究还量化了批次大小、图像顺序和文本编码器选择等设计选择对准确率的显著影响。

Insight: 论文的核心创新在于系统性地识别并修正了MLLM图像分类评估中的关键协议缺陷(如输出过滤、干扰项设计、映射策略),并强调了标注质量对MLLM性能评估的决定性影响,尤其是对较少依赖监督信号的模型。这为未来更公平、可靠的MLLM评估提供了方法论指导。同时,论文展示了MLLM在辅助人工进行大规模数据集标注方面的潜力。

Abstract: Multimodal Large Language Models (MLLM) classification performance depends critically on evaluation protocol and ground truth quality. Studies comparing MLLMs with supervised and vision-language models report conflicting conclusions, and we show these conflicts stem from protocols that either inflate or underestimate performance. Across the most common evaluation protocols, we identify and fix key issues: model outputs that fall outside the provided class list and are discarded, inflated results from weak multiple-choice distractors, and an open-world setting that underperforms only due to poor output mapping. We additionally quantify the impact of commonly overlooked design choices - batch size, image ordering, and text encoder selection - showing they substantially affect accuracy. Evaluating on ReGT, our multilabel reannotation of 625 ImageNet-1k classes, reveals that MLLMs benefit most from corrected labels (up to +10.8%), substantially narrowing the perceived gap with supervised models. Much of the reported MLLMs underperformance on classification is thus an artifact of noisy ground truth and flawed evaluation protocol rather than genuine model deficiency. Models less reliant on supervised training signals prove most sensitive to annotation quality. Finally, we show that MLLMs can assist human annotators: in a controlled case study, annotators confirmed or integrated MLLMs predictions in approximately 50% of difficult cases, demonstrating their potential for large-scale dataset curation.


cs.IR [Back]

[133] CBR-to-SQL: Rethinking Retrieval-based Text-to-SQL using Case-based Reasoning in the Healthcare Domain cs.IR | cs.AI | cs.CLPDF

Hung Nguyen, Hans Moen, Pekka Marttinen

TL;DR: 本文提出CBR-to-SQL框架,将案例推理(CBR)引入医疗领域的文本到SQL任务,通过两阶段检索机制(先逻辑结构后实体)处理医学术语多变和噪声问题,在MIMICSQL基准上实现了最先进的逻辑形式准确率和有竞争力的执行准确率,并展现出比标准检索增强生成(RAG)更高的样本效率和鲁棒性。

Details

Motivation: 解决从电子健康记录(EHR)数据库提取信息时自然语言转SQL的障碍,特别是标准RAG方法在医疗领域因术语多变和噪声导致的检索覆盖不足、噪声引入和可扩展性问题。

Result: 在MIMICSQL基准测试中,CBR-to-SQL实现了最先进的逻辑形式准确率(state-of-the-art logical form accuracy)和有竞争力的执行准确率(competitive execution accuracy),在数据稀缺和检索扰动下表现出比标准RAG更高的样本效率和鲁棒性。

Insight: 创新点在于将案例推理(CBR)思想应用于文本到SQL任务,通过抽象案例模板和两阶段检索(逻辑结构优先、实体解析后置)来提升对领域术语变化的适应性,这为处理专业领域(如医疗)的检索增强生成提供了可借鉴的结构化方法。

Abstract: Extracting insights from Electronic Health Record (EHR) databases often requires SQL expertise, creating a barrier for healthcare decision-making and research. While a promising approach is to use Large Language Models (LLMs) to translate natural language questions to SQL via Retrieval-Augmented Generation (RAG), adapting this approach to the medical domain is non-trivial. Standard RAG relies on single-step retrieval from a static pool of examples, which struggles with the variability and noise of medical terminology and jargon. This often leads to anti-patterns such as expanding the task demonstration pool to improve coverage, which in turn introduces noise and scalability problems. To address this, we introduce CBR-to-SQL, a framework inspired by Case-Based Reasoning (CBR). It represents question-SQL pairs as reusable, abstract case templates and utilizes a two-stage retrieval process that first captures logical structure and then resolves relevant entities. Evaluated on MIMICSQL, CBR-to-SQL achieves state-of-the-art logical form accuracy and competitive execution accuracy. More importantly, it demonstrates higher sample efficiency and robustness than standard RAG approaches, particularly under data scarcity and retrieval perturbations.


[134] AutothinkRAG: Complexity-Aware Control of Retrieval-Augmented Reasoning for Image-Text Interaction cs.IR | cs.CVPDF

Jiashu Yang, Chi Zhang, Abudukelimu Wuerkaixi, Xuxin Cheng, Cao Liu

TL;DR: 本文提出了AutoThinkRAG框架,旨在解决信息密集型文档问答任务中因长上下文和信息过载导致的视觉语言模型推理精度问题。该框架通过查询复杂度路由器为不同难度的查询分配推理路径,并采用功能解耦架构,利用小型VLM作为视觉解释器提取文本线索,再由LLM进行逻辑推理,从而在降低推理成本的同时提升性能。

Details

Motivation: 现有多模态图检索增强生成方法在处理复杂文档问答时面临双重挑战:需要大规模模型处理不同复杂度的查询,以及端到端视觉语言模型固有的推理瓶颈。本文旨在通过协同多个模型的能力来增强对复杂文档的理解。

Result: 在DocBench和MMLongBench基准测试上的大量实验表明,AutoThinkRAG显著降低了推理成本,同时取得了新的最先进性能。消融研究验证了所提方法的有效性。

Insight: 主要创新点包括:1) 基于查询复杂度分析的路由机制,实现按需分配推理资源;2) 功能解耦架构,将视觉解释与逻辑推理分离,利用小型VLM提取高保真文本线索,再由LLM进行深度推理,这为降低多模态系统计算开销提供了新思路。

Abstract: Information-intensive Document Question Answering (DocQA) is often constrained by long contexts and information overload, which hinders Vision-Language Models (VLMs) from performing precise direct reasoning. Although multimodal GraphRAG has achieved preliminary breakthroughs, existing frameworks still face dual challenges: (1) the necessity of large-scale models for handling queries of diverse complexities and (2) the inherent reasoning bottlenecks of end-to-end VLMs. To address these issues, we propose AutoThinkRAG, a framework that enhances the understanding of complex documents by synergizing the capabilities of multiple models. Specifically, we introduce a Query Complexity Router to allocate reasoning paths based on the analysis of query difficulty. Furthermore, to overcome the reasoning boundaries of VLM, we propose a functional decoupling architecture: a small-scale VLM serves as a high-fidelity visual interpreter to transform query-relevant visual cues into textual representations, which are subsequently processed by an LLM for logical deduction and synthesis. Extensive experiments on DocBench and MMLongBench demonstrate that AutoThinkRAG significantly reduces inference costs while achieving new state-of-the-art performance. Further ablation studies verifies the effectiveness of our proposed method.


cs.HC [Back]

[135] CoEditor++: Instruction-based Visual Editing via Cognitive Reasoning cs.HC | cs.CVPDF

Minheng Ni, Yutao Fan, Zhengyuan Yang, Yeli Shen, Yuxiang Wei

TL;DR: CoEditor++是一个无需训练的、基于认知推理的指令驱动视觉编辑框架,通过分解编辑任务为’编辑什么’和’如何编辑’两个认知阶段,并引入反思自选机制,实现了鲁棒、细粒度和可解释的图像编辑。该框架完全基于开源组件构建,在通用编辑和负责任编辑任务上均达到开源模型的SOTA性能,并在视觉一致性上显著优于闭源模型。

Details

Motivation: 解决现有基于指令的图像编辑方法在高级语义推理和视觉一致性方面存在的不足,特别是在模糊或复杂指令下的编辑挑战。

Result: 在通用编辑基准SmartEdit和负责任编辑基准AltBear上,CoEditor++相比需要专门数据集训练的开源模型取得了SOTA性能,并保持了显著更高的视觉一致性;与闭源模型(如Nano Banana Pro或GPT-4o)相比,在指令遵循能力相当的情况下,视觉一致性大幅超越。

Insight: 创新点在于将编辑任务结构化为认知推理过程,通过两阶段分解和反思自选机制提升编辑的鲁棒性和可解释性;其框架设计表明,无需额外训练、完全基于开源组件的认知结构化方法能有效提升指令编辑的性能和一致性。

Abstract: Recent advances in large multimodal models (LMMs) have enabled instruction-based image editing, allowing users to modify visual content via natural language descriptions. However, existing approaches often struggle with high-level semantic reasoning and visual consistency, particularly under ambiguous or complex instructions. To address these challenges, we propose CoEditor++, a cognitively structured, training-free framework that decomposes editing into “what to edit” and “how to edit” through two cognitive stages with a reflective self-selection mechanism, enabling robust, fine-grained, and interpretable editing. Built entirely from open-sourced components, CoEditor++ requires no additional training or fine-tuning, ensuring transparency and cross-domain applicability. We evaluate CoEditor++ on SmartEdit, a widely used benchmark for general editing, and AltBear, a privacy and compliance-oriented benchmark. Experimental results show that CoEditor++ achieves state-of-the-art performance in both general editing and responsible editing tasks compared with open-sourced models that require training on specialized editing datasets maintaining significantly higher visual consistency. When compared with closed-source models such as Nano Banana Pro or GPT-4o, CoEditor++ preserves comparable instruction following while still substantially outperforming them in visual consistency. Extensive ablation studies confirm that the effectiveness of CoEditor++ benefits from its structured cognitive design rather than any specific model component. Our findings suggest the potential toward cognitive-centric instruction-based image editing.


eess.IV [Back]

[136] Clinical-Injection Transformer with Domain-Adapted MAE for Lupus Nephritis Prognosis Prediction eess.IV | cs.CV | cs.LGPDF

Yuewen Huang, Zhitao Ye, Guangnan Feng, Fudan Zheng, Xia Gao

TL;DR: 本文提出首个用于预测儿童狼疮性肾炎(LN)三种治疗反应(完全缓解、部分反应和无反应)的多模态计算病理学框架,仅使用常规PAS染色活检切片和结构化临床数据。该框架包含两个关键方法创新:临床注入Transformer(CIT)将临床特征作为条件令牌嵌入到图像块级自注意力中,促进跨模态交互;以及采用领域自适应掩码自编码器(MAE)的解耦表示-知识适应策略,分离形态特征学习与病理知识提取。

Details

Motivation: 解决儿童LN预后预测在计算病理学中尚未探索的问题,现有基于组织病理学的方法依赖多种昂贵染色且未能整合临床数据,临床需求迫切。

Result: 在71名KDIGO标准化标注的儿童LN患者队列中,该方法实现了90.1%的三分类准确率和89.4%的AUC,表现出高精度和成本效益。

Insight: 创新点包括:临床注入Transformer实现统一注意力空间内的隐式双向跨模态交互;解耦的表示-知识适应策略明确分离自监督形态特征学习与病理知识提取;多粒度形态类型注入机制在实例和患者层面桥接分类知识与预后预测。

Abstract: Lupus nephritis (LN) is a severe complication of systemic lupus erythematosus that affects pediatric patients with significantly greater severity and worse renal outcomes compared to adults. Despite the urgent clinical need, predicting pediatric LN prognosis remains unexplored in computational pathology. Furthermore, the only existing histopathology-based approach for LN relies on multiple costly staining protocols and fails to integrate complementary clinical data. To address these gaps, we propose the first multimodal computational pathology framework for three-class treatment response prediction (complete remission, partial response, and no response) in pediatric LN, utilizing only routine PAS-stained biopsies and structured clinical data. Our framework introduces two key methodological innovations. First, a Clinical-Injection Transformer (CIT) embeds clinical features as condition tokens into patch-level self-attention, facilitating implicit and bidirectional cross-modal interactions within a unified attention space. Second, we design a decoupled representation-knowledge adaptation strategy using a domain-adapted Masked Autoencoder (MAE). This strategy explicitly separates self-supervised morphological feature learning from pathological knowledge extraction. Additionally, we introduce a multi-granularity morphological type injection mechanism to bridge distilled classification knowledge with downstream prognostic predictions at both the instance and patient levels. Evaluated on a cohort of 71 pediatric LN patients with KDIGO-standardized labels, our method achieves a three-class accuracy of 90.1% and an AUC of 89.4%, demonstrating its potential as a highly accurate and cost-effective prognostic tool.


[137] Uni-LVC: A Unified Method for Intra- and Inter-Mode Learned Video Compression eess.IV | cs.CVPDF

Yichi Zhang, Ruoyu Yang, Fengqing Zhu

TL;DR: 本文提出了一种统一的视频压缩方法Uni-LVC,它能够在单一模型中同时支持帧内和帧间编码模式,包括低延迟和随机访问模式。该方法基于一个强大的帧内编码器,通过引入跨注意力适应模块和可靠性感知分类器来整合时间信息,并采用多阶段训练策略。实验表明,Uni-LVC在压缩效率和计算效率方面均表现出色。

Details

Motivation: 现有学习型视频压缩方法通常需要为帧内和帧间编码模式分别训练模型,且在时间参考不可靠时性能下降。本文旨在解决这些问题,提出一个统一的模型来支持多种编码模式。

Result: 大量实验表明,Uni-LVC在帧内和帧间配置下均实现了优异的率失真性能,同时保持了可比较的计算效率。

Insight: 创新点包括:将帧间编码建模为基于时间信息的条件帧内编码;设计高效的跨注意力适应模块以整合时间线索;提出可靠性感知分类器来选择性缩放时间信息,在参考不可靠时使模型行为更接近帧内编码;采用多阶段训练策略以适应不同编码模式。

Abstract: Recent advances in learned video compression (LVC) have led to significant performance gains, with codecs such as DCVC-RT surpassing the H.266/VVC low-delay mode in compression efficiency. However, existing LVCs still exhibit key limitations: they often require separate models for intra and inter coding modes, and their performance degrades when temporal references are unreliable. To address this, we introduce Uni-LVC, a unified LVC method that supports both intra and inter coding with low-delay and random-access in a single model. Building on a strong intra-codec, Uni-LVC formulates inter-coding as intra-coding conditioned on temporal information extracted from reference frames. We design an efficient cross-attention adaptation module that integrates temporal cues, enabling seamless support for both unidirectional (low-delay) and bidirectional (random-access) prediction modes. A reliability-aware classifier is proposed to selectively scale the temporal cues, making Uni-LVC behave closer to intra coding when references are unreliable. We further propose a multistage training strategy to facilitate adaptive learning across various coding modes. Extensive experiments demonstrate that Uni-LVC achieves superior rate-distortion performance in intra and inter configurations while maintaining comparable computational efficiency.


[138] Enhancing Neural Video Compression of Static Scenes with Positive-Incentive Noise eess.IV | cs.CVPDF

Cheng Yuan, Zhenyu Jia, Jiawei Shao, Xuelong Li

TL;DR: 本文提出了一种通过引入正向激励噪声来增强静态场景视频神经压缩性能的方法,将短期时间变化重新解释为噪声以促进模型微调,从而分离瞬态变化与持久背景,将结构化先验信息内化于压缩模型中,在推理时不变分量仅需极少量信号,从而在保持像素级保真度的同时显著降低数据传输需求。

Details

Motivation: 传统标准化编解码器和神经视频压缩方法在编码静态场景视频时效率低下,前者未能充分利用时间冗余,后者存在训练与测试数据间的严重分布差距,而现有生成式压缩方法虽提升感知质量,但会引入幻觉细节,不适用于真实性要求高的应用。

Result: 初步实验表明,与通用神经视频压缩模型相比,该方法在Bjøntegaard delta (BD) 率上实现了73%的节省。

Insight: 创新点在于将短期时间变化重新概念化为正向激励噪声,用于模型微调,从而解耦背景与变化,实现计算与带宽的有效权衡,为网络条件恶劣下的鲁棒视频传输和经济型长期监控存储提供了解决方案。

Abstract: Static scene videos, such as surveillance feeds and videotelephony streams, constitute a dominant share of storage consumption and network traffic. However, both traditional standardized codecs and neural video compression (NVC) methods struggle to encode these videos efficiently due to inadequate usage of temporal redundancy and severe distribution gaps between training and test data, respectively. While recent generative compression methods improve perceptual quality, they introduce hallucinated details that are unacceptable in authenticity-critical applications. To overcome these limitations, we propose to incorporate positive-incentive noise into NVC for static scene videos, where short-term temporal changes are reinterpreted as positive-incentive noise to facilitate model finetuning. By disentangling transient variations from the persistent background, structured prior information is internalized in the compression model. During inference, the invariant component requires minimal signaling, thus reducing data transmission while maintaining pixel-level fidelity. Preliminary experiments demonstrate a 73% Bjøntegaard delta (BD) rate saving compared to general NVC models. Our method provides an effective solution to trade computation for bandwidth, enabling robust video transmission under adverse network conditions and economic long-term retention of surveillance footage.


cs.AI [Back]

[139] The EpisTwin: A Knowledge Graph-Grounded Neuro-Symbolic Architecture for Personal AI cs.AI | cs.CLPDF

Giovanni Servedio, Potito Aghilar, Alessio Mattiace, Gianni Carmosino, Francesco Musicco

TL;DR: 本文提出了EpisTwin,一种基于知识图谱的神经符号架构,用于构建个人人工智能。该框架通过多模态语言模型将异构的用户数据转化为语义三元组,构建个人知识图谱,并利用图检索增强生成与在线深度视觉细化相结合的智能协调器,在推理时进行复杂推理。

Details

Motivation: 当前个人人工智能受限于用户数据分散在各个孤岛中,而现有的检索增强生成方法依赖非结构化向量相似度,无法捕捉对整体意义理解至关重要的潜在语义拓扑和时间依赖性。

Result: 在提出的合成基准PersonalQA-71-100上,EpisTwin在一系列最先进的评判模型上展现了稳健的结果。

Insight: 创新点在于将生成式推理建立在可验证的、以用户为中心的个人知识图谱上,通过神经符号结合的方式,动态地将符号实体重新锚定到原始视觉上下文中,为可信赖的个人AI提供了新方向。

Abstract: Personal Artificial Intelligence is currently hindered by the fragmentation of user data across isolated silos. While Retrieval-Augmented Generation offers a partial remedy, its reliance on unstructured vector similarity fails to capture the latent semantic topology and temporal dependencies essential for holistic sensemaking. We introduce EpisTwin, a neuro-symbolic framework that grounds generative reasoning in a verifiable, user-centric Personal Knowledge Graph. EpisTwin leverages Multimodal Language Models to lift heterogeneous, cross-application data into semantic triples. At inference, EpisTwin enables complex reasoning over the personal semantic graph via an agentic coordinator that combines Graph Retrieval-Augmented Generation with Online Deep Visual Refinement, dynamically re-grounding symbolic entities in their raw visual context. We also introduce PersonalQA-71-100, a synthetic benchmark designed to simulate a realistic user’s digital footprint and evaluate EpisTwin performance. Our framework demonstrates robust results across a suite of state-of-the-art judge models, offering a promising direction for trustworthy Personal AI.


[140] SAHOO: Safeguarded Alignment for High-Order Optimization Objectives in Recursive Self-Improvement cs.AI | cs.CL | cs.LGPDF

Subramanyam Sahoo, Aman Chadha, Vinija Jain, Divya Chaudhary

TL;DR: 论文提出了SAHOO框架,用于在递归自我改进过程中监控和控制模型的对齐漂移。该框架通过三个保障措施(目标漂移指数、约束保持检查和回归风险量化)来确保改进过程的安全性,并在代码生成、数学推理和真实性等189个任务上实现了显著的性能提升,同时保持了关键约束。

Details

Motivation: 解决递归自我改进过程中,迭代式自我修改可能导致的对齐漂移风险,确保模型在自我提升时不会偏离安全性和对齐目标。

Result: 在代码生成任务上性能提升18.3%,在数学推理任务上提升16.8%,同时在两个领域保持了约束,在真实性任务上违规率较低。阈值校准基于三个周期内的18个任务的小型验证集。

Insight: 创新点在于将对齐保障具体化为可测量的指标(如GDI)和系统化的检查流程,使得对齐在递归自我改进中变得可量化、可部署和可大规模验证,并揭示了能力与对齐之间的权衡关系(如流畅性与事实性之间的张力)。

Abstract: Recursive self-improvement is moving from theory to practice: modern systems can critique, revise, and evaluate their own outputs, yet iterative self-modification risks subtle alignment drift. We introduce SAHOO, a practical framework to monitor and control drift through three safeguards: (i) the Goal Drift Index (GDI), a learned multi-signal detector combining semantic, lexical, structural, and distributional measures; (ii) constraint preservation checks that enforce safety-critical invariants such as syntactic correctness and non-hallucination; and (iii) regression-risk quantification to flag improvement cycles that undo prior gains. Across 189 tasks in code generation, mathematical reasoning, and truthfulness, SAHOO produces substantial quality gains, including 18.3 percent improvement in code tasks and 16.8 percent in reasoning, while preserving constraints in two domains and maintaining low violations in truthfulness. Thresholds are calibrated on a small validation set of 18 tasks across three cycles. We further map the capability-alignment frontier, showing efficient early improvement cycles but rising alignment costs later and exposing domain-specific tensions such as fluency versus factuality. SAHOO therefore makes alignment preservation during recursive self-improvement measurable, deployable, and systematically validated at scale.


[141] RoboLayout: Differentiable 3D Scene Generation for Embodied Agents cs.AI | cs.CV | cs.LG | cs.ROPDF

Ali Shamsaddinlou

TL;DR: RoboLayout是LayoutVLM的扩展,通过集成显式可达性约束和局部细化阶段,生成适用于具身智能体交互的3D室内场景布局。

Details

Motivation: 解决现有视觉语言模型在生成语义连贯且可供具身智能体交互的3D场景布局方面的挑战,特别是在物理受限的室内环境中。

Result: 实验结果表明,RoboLayout在保持LayoutVLM强语义对齐和物理合理性的同时,增强了面向智能体的室内场景生成的适用性。

Insight: 创新点包括将显式可达性约束集成到可微布局优化中,支持多种具身智能体(如机器人、人类、动物)的抽象表示,以及提出局部细化阶段以提高收敛效率。

Abstract: Recent advances in vision language models (VLMs) have shown strong potential for spatial reasoning and 3D scene layout generation from open-ended language instructions. However, generating layouts that are not only semantically coherent but also feasible for interaction by embodied agents remains challenging, particularly in physically constrained indoor environments. In this paper, RoboLayout is introduced as an extension of LayoutVLM that augments the original framework with agent-aware reasoning and improved optimization stability. RoboLayout integrates explicit reachability constraints into a differentiable layout optimization process, enabling the generation of layouts that are navigable and actionable by embodied agents. Importantly, the agent abstraction is not limited to a specific robot platform and can represent diverse entities with distinct physical capabilities, such as service robots, warehouse robots, humans of different age groups, or animals, allowing environment design to be tailored to the intended agent. In addition, a local refinement stage is proposed that selectively reoptimizes problematic object placements while keeping the remainder of the scene fixed, improving convergence efficiency without increasing global optimization iterations. Overall, RoboLayout preserves the strong semantic alignment and physical plausibility of LayoutVLM while enhancing applicability to agent-centric indoor scene generation, as demonstrated by experimental results across diverse scene configurations.


[142] Evolving Medical Imaging Agents via Experience-driven Self-skill Discovery cs.AI | cs.CVPDF

Lin Fan, Pengyu Dai, Zhipeng Deng, Haolin Wang, Xun Gong

TL;DR: 本文提出了一种名为MACRO的自进化医学影像智能体,旨在解决现有基于LLM的智能体在医疗工具调用策略上的静态性问题。该方法通过从已验证的执行轨迹中自主发现有效的多步骤工具序列,并将其合成为可复用的复合工具,从而持续扩展其行为库。结合轻量级图像特征记忆和类似GRPO的训练循环,实现了闭环自我改进,提升了多步骤编排的准确性和跨领域泛化能力。

Details

Motivation: 现有基于LLM的医学影像智能体将工具集和调用策略视为静态,这在真实世界的领域漂移、跨任务和不断演进的诊断需求下显得脆弱,预定义的工具链经常性能下降且需要昂贵的人工重新设计。

Result: 在多种医学影像数据集和任务上的广泛实验表明,自主复合工具发现相比强基线方法和近期最先进的智能体方法,持续提升了多步骤编排的准确性和跨领域泛化能力。

Insight: 核心创新在于从静态工具组合转向经验驱动的工具发现,通过合成已验证的有效多步骤序列为新的高级原语,并结合视觉-临床上下文记忆和强化学习训练,实现了智能体的闭环自进化,为构建自适应、上下文感知的临床AI助手提供了新思路。

Abstract: Clinical image interpretation is inherently multi-step and tool-centric: clinicians iteratively combine visual evidence with patient context, quantify findings, and refine their decisions through a sequence of specialized procedures. While LLM-based agents promise to orchestrate such heterogeneous medical tools, existing systems treat tool sets and invocation strategies as static after deployment. This design is brittle under real-world domain shifts, across tasks, and evolving diagnostic requirements, where predefined tool chains frequently degrade and demand costly manual re-design. We propose MACRO, a self-evolving, experience-augmented medical agent that shifts from static tool composition to experience-driven tool discovery. From verified execution trajectories, the agent autonomously identifies recurring effective multi-step tool sequences, synthesizes them into reusable composite tools, and registers these as new high-level primitives that continuously expand its behavioral repertoire. A lightweight image-feature memory grounds tool selection in a visual-clinical context, while a GRPO-like training loop reinforces reliable invocation of discovered composites, enabling closed-loop self-improvement with minimal supervision. Extensive experiments across diverse medical imaging datasets and tasks demonstrate that autonomous composite tool discovery consistently improves multi-step orchestration accuracy and cross-domain generalization over strong baselines and recent state-of-the-art agentic methods, bridging the gap between brittle static tool use and adaptive, context-aware clinical AI assistance. Code will be available upon acceptance.


cs.LG [Back]

[143] Aligning the True Semantics: Constrained Decoupling and Distribution Sampling for Cross-Modal Alignment cs.LG | cs.CLPDF

Xiang Ma, Lexin Fang, Litian Xu, Caiming Zhang

TL;DR: 本文提出了一种名为CDDS(Constrained Decoupling and Distribution Sampling)的新型跨模态对齐算法,旨在解决视觉与语言之间实现真实语义对齐的挑战。该方法通过引入双路径UNet自适应解耦嵌入表示,并应用多重约束确保语义与模态信息的有效分离;同时提出分布采样方法来弥合模态鸿沟,确保对齐过程的合理性。

Details

Motivation: 传统跨模态对齐算法追求嵌入一致性以实现语义一致性,但忽略了嵌入中存在的非语义信息。直接将嵌入解耦为语义和模态分量进行对齐会面临两大挑战:缺乏区分语义与模态信息的标准,以及模态鸿沟可能导致语义对齐偏差或信息丢失。

Result: 在多个基准测试和模型骨干网络上的广泛实验表明,CDDS算法具有优越性,其性能比现有最先进(SOTA)方法高出6.6%到14.2%。

Insight: 论文的核心创新点在于通过约束解耦和分布采样来对齐真实语义。具体包括:1)引入双路径UNet进行自适应解耦,并施加多重约束以确保有效分离;2)提出分布采样方法以弥合模态鸿沟。从客观角度看,该方法为解决嵌入中语义与非语义信息混杂以及模态鸿沟问题提供了系统性的框架。

Abstract: Cross-modal alignment is a crucial task in multimodal learning aimed at achieving semantic consistency between vision and language. This requires that image-text pairs exhibit similar semantics. Traditional algorithms pursue embedding consistency to achieve semantic consistency, ignoring the non-semantic information present in the embedding. An intuitive approach is to decouple the embeddings into semantic and modality components, aligning only the semantic component. However, this introduces two main challenges: (1) There is no established standard for distinguishing semantic and modal information. (2) The modality gap can cause semantic alignment deviation or information loss. To align the true semantics, we propose a novel cross-modal alignment algorithm via \textbf{C}onstrained \textbf{D}ecoupling and \textbf{D}istribution \textbf{S}ampling (CDDS). Specifically, (1) A dual-path UNet is introduced to adaptively decouple the embeddings, applying multiple constraints to ensure effective separation. (2) A distribution sampling method is proposed to bridge the modality gap, ensuring the rationality of the alignment process. Extensive experiments on various benchmarks and model backbones demonstrate the superiority of CDDS, outperforming state-of-the-art methods by 6.6% to 14.2%.


[144] Test-Time Adaptation via Many-Shot Prompting: Benefits, Limits, and Pitfalls cs.LG | cs.CLPDF

Shubhangi Upasani, Chen Wu, Jay Rainton, Bo Li, Changran Hu

TL;DR: 本文通过实证研究探讨了基于多示例提示的测试时适应方法在大型语言模型中的应用,分析了其在不同任务和模型架构下的性能变化、可靠性及局限性,并比较了动态和强化上下文学习等替代策略。

Details

Motivation: 研究动机在于理解多示例提示作为测试时更新机制的有效性、可靠性及限制,特别是在开源模型中的应用潜力与风险。

Result: 研究发现多示例提示在结构化任务中能显著提升性能,但对示例选择和排序高度敏感,且在开放式生成任务中效果有限;动态和强化上下文学习策略能更好地控制信息注入和模型行为约束。

Insight: 创新点在于系统评估了提示式测试时适应的实际边界,揭示了输入空间更新在不同任务中的利弊,为优化上下文学习策略提供了实证依据。

Abstract: Test-time adaptation enables large language models (LLMs) to modify their behavior at inference without updating model parameters. A common approach is many-shot prompting, where large numbers of in-context learning (ICL) examples are injected as an input-space test-time update. Although performance can improve as more demonstrations are added, the reliability and limits of this update mechanism remain poorly understood, particularly for open-source models. We present an empirical study of many-shot prompting across tasks and model backbones, analyzing how performance varies with update magnitude, example ordering, and selection policy. We further study Dynamic and Reinforced ICL as alternative test-time update strategies that control which information is injected and how it constrains model behavior. We find that many-shot prompting is effective for structured tasks where demonstrations provide high information gain, but is highly sensitive to selection strategy and often shows limited benefits for open-ended generation tasks. Overall, we characterize the practical limits of prompt-based test-time adaptation and outline when input-space updates are beneficial versus harmful.


[145] DC-Merge: Improving Model Merging with Directional Consistency cs.LG | cs.CVPDF

Han-Chen Zhang, Zi-Hao Zhou, Mao-Lin Luo, Shimin Di, Min-Ling Zhang

TL;DR: 本文提出DC-Merge方法,旨在通过保持奇异空间的方向一致性来改进模型合并,解决任务向量能量分布不平衡和几何不一致问题,从而在视觉和视觉语言基准测试中实现最先进的性能。

Details

Motivation: 模型合并的关键在于保持合并后多任务向量与各任务向量在奇异空间的方向一致性,但现有方法常因任务向量能量分布不平衡和参数空间几何不一致而破坏这种一致性。

Result: 在视觉和视觉语言基准测试中,DC-Merge在完全微调和LoRA设置下均持续达到最先进的性能。

Insight: 创新点包括通过平滑奇异值平衡任务向量能量分布,以及将能量平衡后的向量投影到共享正交子空间以对齐方向几何,从而最小化重构误差,确保知识组件充分保留。

Abstract: Model merging aims to integrate multiple task-adapted models into a unified model that preserves the knowledge of each task. In this paper, we identify that the key to this knowledge retention lies in maintaining the directional consistency of singular spaces between merged multi-task vector and individual task vectors. However, this consistency is frequently compromised by two issues: i) an imbalanced energy distribution within task vectors, where a small fraction of singular values dominate the total energy, leading to the neglect of semantically important but weaker components upon merging, and ii) the geometric inconsistency of task vectors in parameter space, which causes direct merging to distort their underlying directional geometry. To address these challenges, we propose DC-Merge, a method for directional-consistent model merging. It first balances the energy distribution of each task vector by smoothing its singular values, ensuring all knowledge components are adequately represented. These energy-balanced vectors are then projected onto a shared orthogonal subspace to align their directional geometries with minimal reconstruction error. Finally, the aligned vectors are aggregated in the shared orthogonal subspace and projected back to the original parameter space. Extensive experiments on vision and vision-language benchmarks show that DC-Merge consistently achieves state-of-the-art performance in both full fine-tuning and LoRA settings. The implementation code is available at https://github.com/Tobeginwith/DC-Merge.


cs.CR [Back]

[146] Proof-of-Guardrail in AI Agents and What (Not) to Trust from It cs.CR | cs.AI | cs.CLPDF

Xisen Jin, Michael Duan, Qin Lin, Aaron Chan, Zhenglun Chen

TL;DR: 本文提出了一种名为proof-of-guardrail的系统,旨在解决AI代理服务中安全措施可能被虚假宣传的问题。该系统允许开发者通过可信执行环境(TEE)提供密码学证明,证明响应是在特定开源护栏(guardrail)执行后生成的,从而确保护栏执行的完整性,同时保护开发者代理的隐私。

Details

Motivation: 动机是应对AI代理广泛部署时,用户依赖开发者声称的安全措施而可能面临的威胁,即安全措施被虚假广告,导致用户无法验证实际执行的安全护栏。

Result: 论文在OpenClaw代理上实现了proof-of-guardrail系统,评估了延迟开销和部署成本,但摘要未提及具体的定量结果或基准测试,也未说明是否达到SOTA水平。

Insight: 创新点在于结合TEE和密码学证明来验证AI代理中护栏的执行完整性,同时保持代理私有性;但客观分析也指出,该系统存在风险,例如恶意开发者可能主动越狱护栏,因此需谨慎信任。

Abstract: As AI agents become widely deployed as online services, users often rely on an agent developer’s claim about how safety is enforced, which introduces a threat where safety measures are falsely advertised. To address the threat, we propose proof-of-guardrail, a system that enables developers to provide cryptographic proof that a response is generated after a specific open-source guardrail. To generate proof, the developer runs the agent and guardrail inside a Trusted Execution Environment (TEE), which produces a TEE-signed attestation of guardrail code execution verifiable by any user offline. We implement proof-of-guardrail for OpenClaw agents and evaluate latency overhead and deployment cost. Proof-of-guardrail ensures integrity of guardrail execution while keeping the developer’s agent private, but we also highlight a risk of deception about safety, for example, when malicious developers actively jailbreak the guardrail. Code and demo video: https://github.com/SaharaLabsAI/Verifiable-ClawGuard


cs.CE [Back]

[147] Autonomous Algorithm Discovery for Ptychography via Evolutionary LLM Reasoning cs.CE | cs.AI | cs.CL | math.NAPDF

Xiangyu Yin, Ming Du, Junjing Deng, Zhi Yang, Yimo Han

TL;DR: 本文提出Ptychi-Evolve框架,利用大语言模型结合进化机制自动发现和演化用于叠层成像的正则化算法,在多个具有挑战性的数据集上实现了优于传统方法的图像重建质量。

Details

Motivation: 叠层成像中高质量重建通常依赖人工设计的正则化函数,该研究旨在通过自动化方法发现更优的正则化算法以解决这一瓶颈。

Result: 在X射线集成电路、低剂量电子显微镜载铁蛋白及存在串扰伪影的多层成像三个数据集上,所发现的正则化器相比传统重建方法,在SSIM指标上最高提升0.26,PSNR指标最高提升8.3 dB。

Insight: 创新点在于将LLM驱动的代码生成与语义引导的交叉、变异等进化机制结合,实现算法的自主发现与演化,并记录算法谱系和演化元数据以支持可解释和可复现的分析。

Abstract: Ptychography is a computational imaging technique widely used for high-resolution materials characterization, but high-quality reconstructions often require the use of regularization functions that largely remain manually designed. We introduce Ptychi-Evolve, an autonomous framework that uses large language models (LLMs) to discover and evolve novel regularization algorithms. The framework combines LLM-driven code generation with evolutionary mechanisms, including semantically-guided crossover and mutation. Experiments on three challenging datasets (X-ray integrated circuits, low-dose electron microscopy of apoferritin, and multislice imaging with crosstalk artifacts) demonstrate that discovered regularizers outperform conventional reconstructions, achieving up to +0.26 SSIM and +8.3~dB PSNR improvements. Besides, Ptychi-Evolve records algorithm lineage and evolution metadata, enabling interpretable and reproducible analysis of discovered regularizers.


cs.RO [Back]

[148] RACAS: Controlling Diverse Robots With a Single Agentic System cs.RO | cs.AI | cs.CL | cs.LG | cs.MAPDF

Dylan R. Ashley, Jan Przepióra, Yimeng Chen, Ali Abualsaud, Nurzhan Yesmagambet

TL;DR: RACAS是一个基于LLM/VLM的机器人无关控制智能体系统,通过三个自然语言交互模块(监控器、控制器和记忆管理器)实现闭环机器人控制,仅需机器人描述、可用动作定义和任务说明即可跨平台部署,无需修改代码或模型权重。

Details

Motivation: 解决机器人从底层API到高层自主行为转换的复杂性问题,现有方法需针对每个新机器人重新训练或仅适用于结构相似的平台,RACAS旨在降低机器人原型开发的壁垒。

Result: 在轮式地面机器人、新型多关节机械臂和水下航行器三种截然不同的平台上成功完成所有指定任务,验证了其跨平台通用控制能力。

Insight: 创新点在于纯自然语言交互的模块化智能体架构,通过机器人描述而非代码适配实现平台无关性,为机器人控制提供了可扩展的通用解决方案。

Abstract: Many robotic platforms expose an API through which external software can command their actuators and read their sensors. However, transitioning from these low-level interfaces to high-level autonomous behaviour requires a complicated pipeline, whose components demand distinct areas of expertise. Existing approaches to bridging this gap either require retraining for every new embodiment or have only been validated across structurally similar platforms. We introduce RACAS (Robot-Agnostic Control via Agentic Systems), a cooperative agentic architecture in which three LLM/VLM-based modules (Monitors, a Controller, and a Memory Curator) communicate exclusively through natural language to provide closed-loop robot control. RACAS requires only a natural language description of the robot, a definition of available actions, and a task specification; no source code, model weights, or reward functions need to be modified to move between platforms. We evaluate RACAS on several tasks using a wheeled ground robot, a recently published novel multi-jointed robotic limb, and an underwater vehicle. RACAS consistently solved all assigned tasks across these radically different platforms, demonstrating the potential of agentic AI to substantially reduce the barrier to prototyping robotic solutions.


[149] HarvestFlex: Strawberry Harvesting via Vision-Language-Action Policy Adaptation in the Wild cs.RO | cs.CVPDF

Ziyang Zhao, Shuheng Wang, Zhonghua Miao, Ya Xiong

TL;DR: 这篇论文提出了HarvestFlex系统,首次研究了将视觉-语言-动作策略迁移到真实温室桌面草莓采摘任务中。该系统使用三视角RGB传感构建端到端闭环系统,通过VR遥操作收集少量数据进行微调,在真实温室中实现了非平凡的闭环采摘性能。

Details

Motivation: 解决在遮挡和镜面反射等挑战下,将长视野、非结构化的VLA策略迁移到真实温室草莓采摘任务中的问题,避免使用深度点云和显式几何标定。

Result: 在统一的50次真实温室试验协议下,经过完全微调的pi_0.5模型取得了74.0%的成功率、32.6秒/次的采摘速度和4.1%的损坏率;异步推理-控制解耦进一步提升了性能。结果表明仅用不到四小时的真实数据即可实现非平凡的闭环采摘。

Insight: 创新点在于首次将VLA策略迁移到真实温室草莓采摘任务,并展示了仅用少量真实数据和RGB传感(避免深度和标定)实现闭环操作的可行性;系统设计中的三视角传感和异步解耦架构具有借鉴意义。

Abstract: This work presents the first study on transferring vision-language-action (VLA) policies to real greenhouse tabletop strawberry harvesting, a long-horizon, unstructured task challenged by occlusion and specular reflections. We built an end-to-end closed-loop system on the HarvestFlex platform using three-view RGB sensing (two fixed scene views plus a wrist-mounted view) and intentionally avoided depth clouds and explicit geometric calibration. We collected 3.71 h of VR teleoperated demonstrations (227 episodes) and fine-tuned pi_0, pi_0.5, and WALL-OSS with full fine-tuning and LoRA. Under a unified 50 trials real-greenhouse protocol and metrics spanning completion, pi_0.5 with full fine-tuning achieved success rate of 74.0% with 32.6 s/pick and damage rate of 4.1%. Asynchronous inference-control decoupling further improved performance over synchronous deployment. Results showed non-trivial closed-loop picking with fewer than four hours of real data, while remaining limited by close-range observability loss and contact-dynamics mismatch. A demonstration video is available at: https://youtu.be/bN8ZowZKPMI.


[150] Restoring Linguistic Grounding in VLA Models via Train-Free Attention Recalibration cs.RO | cs.AI | cs.CVPDF

Ninghao Zhang, Bin Zhu, Shijie Zhou, Jingjing Chen

TL;DR: 本文揭示了视觉-语言-动作(VLA)模型在分布外(OOD)指令下存在‘语言盲区’问题,即模型会忽略与场景矛盾的指令而执行视觉上合理的动作。为此,作者提出了一个诊断基准ICBench来系统评估此问题,并设计了一种无需训练的推理时注意力重校准机制IGAR,以恢复语言指令对动作生成的影响,并在仿真和真实机器人任务上验证了其有效性。

Details

Motivation: 解决VLA模型在分布外指令下可靠性不足的问题,特别是模型倾向于依赖视觉先验而忽略语义矛盾的指令,即‘语言盲区’现象。

Result: 在基于LIBERO数据集构建的ICBench基准上评估了Pi0、Pi0.5和OpenVLA OFT等VLA架构,发现它们在逻辑不可能的指令下仍频繁成功,显示出强烈的视觉偏差。提出的IGAR方法在30个LIBERO任务上显著减少了OOD矛盾指令下的错误执行,同时保持了基线任务性能,并在真实的Franka机械臂上验证了其防止不一致指令触发操作的有效性。

Insight: 创新点在于揭示了VLA模型中语言与动作解耦的‘语言盲区’问题,并提出了无需重新训练或修改架构的推理时注意力重校准机制IGAR,通过重新平衡注意力分布来增强语言指令的影响力,为提升VLA模型的鲁棒性和语言基础提供了一种轻量级解决方案。

Abstract: Vision-Language-Action (VLA) models enable robots to perform manipulation tasks directly from natural language instructions and are increasingly viewed as a foundation for generalist robotic policies. However, their reliability under Out-of-Distribution (OOD) instructions remains underexplored. In this paper, we reveal a critical failure mode in which VLA policies continue executing visually plausible actions even when the language instruction contradicts the scene. We refer to this phenomenon as linguistic blindness, where VLA policies prioritize visual priors over instruction semantics during action generation. To systematically analyze this issue, we introduce ICBench, a diagnostic benchmark constructed from the LIBERO dataset that probes language-action coupling by injecting controlled OOD instruction contradictions while keeping the visual environment unchanged. Evaluations on three representative VLA architectures, including Pi0, Pi0.5 and OpenVLA OFT, show that these models frequently succeed at tasks despite logically impossible instructions, revealing a strong visual bias in action generation. To mitigate this issue, we propose Instruction-Guided Attention Recalibration (IGAR), a train-free inference-time mechanism that rebalances attention distributions to restore the influence of language instructions. IGAR operates without retraining or architectural modification and can be directly applied to existing VLA models. Experiments across 30 LIBERO tasks demonstrate that IGAR substantially reduces erroneous execution under OOD contradictory instructions while preserving baseline task performance. We additionally validate the approach on a real Franka robotic arm, where IGAR effectively prevents manipulation triggered by inconsistent instructions.


[151] SG-DOR: Learning Scene Graphs with Direction-Conditioned Occlusion Reasoning for Pepper Plants cs.RO | cs.CVPDF

Rohit Menon, Niklas Mueller-Goldingen, Sicong Pan, Gokul Krishna Chenchani, Maren Bennewitz

TL;DR: 本文提出SG-DOR框架,用于从实例分割的植物器官点云中推断场景图,以编码物理附着关系和方向条件遮挡关系,旨在为密集冠层中的机器人采摘提供结构化关系信号。

Details

Motivation: 解决在密集作物冠层中进行机器人采摘时,不仅需要几何信息,还需要明确的方向条件遮挡关系来识别哪些器官遮挡了目标果实的问题。

Result: 在合成的多株辣椒数据集上,相比强消融实验,该方法在遮挡预测(F1=0.73,NDCG@3=0.85)和附着关系推断(边F1=0.83)方面均有提升。

Insight: 创新点在于提出了方向条件遮挡推理任务和相应的方向感知图神经网络架构,该架构采用了针对每个果实的叶集注意力机制和联合层级聚合方法,以学习场景图。

Abstract: Robotic harvesting in dense crop canopies requires effective interventions that depend not only on geometry, but also on explicit, direction-conditioned relations identifying which organs obstruct a target fruit. We present SG-DOR (Scene Graphs with Direction-Conditioned Occlusion Reasoning), a relational framework that, given instance-segmented organ point clouds, infers a scene graph encoding physical attachments and direction-conditioned occlusion. We introduce an occlusion ranking task for retrieving and ranking candidate leaves for a target fruit and approach direction, and propose a direction-aware graph neural architecture with per-fruit leaf-set attention and union-level aggregation. Experiments on a multi-plant synthetic pepper dataset show improved occlusion prediction (F1=0.73, NDCG@3=0.85) and attachment inference (edge F1=0.83) over strong ablations, yielding a structured relational signal for downstream intervention planning.


cs.MM [Back]

[152] Omni-C: Compressing Heterogeneous Modalities into a Single Dense Encoder cs.MM | cs.AI | cs.CL | cs.CV | cs.SD | eess.ASPDF

Kin Wai Lau, Yasar Abbas Ur Rehman, Lai-Man Po, Pedro Porto Buarque de Gusmão

TL;DR: 本文提出Omni-C(Omni-Compress),一个基于Transformer的单一稠密编码器,旨在通过大规模非对齐数据的单模态对比预训练,学习图像、音频和文本等异构模态的共享表示。该方法通过最大化主干网络的参数共享和使用轻量级模态特定投影头,有效缓解了模态间冲突,无需混合专家(MoE)架构、配对监督或路由机制。

Details

Motivation: 现有多模态系统通常依赖独立的专家模态编码器,导致模态增加时计算复杂度和开销线性增长。虽然统一的Omni模型通过MoE架构解决了部分问题,但仍会增加参数量并引入路由开销。本文旨在设计一个更高效、参数共享更彻底的单编码器架构。

Result: 实验表明,Omni-C在单模态和跨模态任务上取得了与专家模型相当的性能,在音频和文本上的零样本性能有轻微下降,但通过轻量级线性探测或参数高效微调可基本恢复。与多编码器基线相比,该统一架构显著降低了推理内存使用。

Insight: 创新点在于提出了一种无需MoE或配对数据的单一稠密编码器,通过最大化主干参数共享和轻量级模态特定投影来学习跨模态共享表示,实现了高效、可扩展的多模态学习,并支持在内存受限系统上进行顺序模态处理和低内存推理。

Abstract: Recent multimodal systems often rely on separate expert modality encoders which cause linearly scaling complexity and computational overhead with added modalities. While unified Omni-models address this via Mixture-of-Expert (MoE) architectures with specialized experts and routing, they still inflate parameter counts and introduce routing overhead. In this paper, we propose Omni-C (Omni-Compress), a single dense Transformer-based encoder that learns competitive shared representations across heterogeneous modalities–images, audio, and text–through unimodal contrastive pretraining on large-scale unaligned data. By maximizing parameter sharing in the backbone and using lightweight modality-specific projection heads, Omni-C effectively mitigates inter-modality conflicts without requiring MoE, paired supervision, or routing. This design supports efficient deployment on memory-constrained systems via sequential modality processing and low-memory inference, eliminating the need for parallel expert loading or specialized hardware. Experiments show Omni-C achieves performance comparable to expert models in unimodal and cross-model tasks, with modest zero-shot degradation on audio and text that is largely recovered through lightweight linear probing or parameter efficient fine-tuning. The unified architecture substantially reduces inference memory usage compared to multi-encoder baselines, advancing efficient and scalable multimodal learning.


cs.SD [Back]

[153] Do Compact SSL Backbones Matter for Audio Deepfake Detection? A Controlled Study with RAPTOR cs.SD | cs.AI | cs.CLPDF

Ajinkya Kulkarni, Sandipana Dowerah, Atharva Kulkarni, Tanel Alumäe, Mathew Magimai Doss

TL;DR: 本文通过构建RAPTOR(一种基于表征感知的成对门控Transformer的跨域识别框架),对HuBERT和WavLM等紧凑型自监督学习(SSL)音频骨干网络在音频深度伪造检测中的表现进行了系统性对比研究。研究发现,多语言HuBERT预训练是跨域鲁棒性的关键,其100M参数的模型性能可媲美更大规模的商业系统。此外,论文引入了一种基于扰动的测试时增强协议来评估模型校准性,发现WavLM变体在扰动下存在过度自信的校准错误,而迭代式多语言HuBERT则保持稳定。

Details

Motivation: 当前音频深度伪造检测研究大多集中于单一的大型wav2vec2-XLSR骨干网络,而紧凑型SSL骨干网络(如HuBERT和WavLM)在跨域检测中的表现尚未得到充分研究。本文旨在通过控制变量实验,探究紧凑型SSL骨干网络在音频深度伪造检测中的有效性及其跨域鲁棒性。

Result: 在14个跨域基准测试中,多语言HuBERT预训练的100M参数模型性能与更大规模及商业系统相当。通过新引入的基于扰动的测试时增强协议评估校准性,发现WavLM变体在扰动下表现出过度自信的校准错误(EER为0.0%但校准误差高),而迭代式多语言HuBERT模型则保持稳定。

Insight: 论文的创新点在于:1)提出了RAPTOR框架,统一了紧凑型SSL骨干网络的对比研究;2)揭示了多语言预训练(而非模型规模)是跨域鲁棒性的主要驱动力;3)引入了基于扰动的测试时增强协议来暴露标准指标(如EER)无法揭示的模型校准差异。从客观角度看,该研究强调了在音频深度伪造检测中,预训练策略(如多语言数据、迭代优化)比单纯扩大模型规模更为关键,并为模型可靠性评估提供了新的视角。

Abstract: Self-supervised learning (SSL) underpins modern audio deepfake detection, yet most prior work centers on a single large wav2vec2-XLSR backbone, leaving compact under studied. We present RAPTOR, Representation Aware Pairwise-gated Transformer for Out-of-domain Recognition a controlled study of compact SSL backbones from the HuBERT and WavLM within a unified pairwise-gated fusion detector, evaluated across 14 cross-domain benchmarks. We show that multilingual HuBERT pre-training is the primary driver of cross-domain robustness, enabling 100M models to match larger and commercial systems. Beyond EER, we introduce a test-time augmentation protocol with perturbation-based aleatoric uncertainty to expose calibration differences invisible to standard metrics: WavLM variants exhibit overconfident miscalibration under perturbation, whereas iterative mHuBERT remains stable. These findings indicate that SSL pre-training trajectory, not model scale, drives reliable audio deepfake detection.