Table of Contents

cs.CL [Back]

[1] VeriBound: PAC-Bayesian Generalization Bounds for Process Reward Models Trained with Formal Verification Tools cs.CL | cs.LGPDF

Amirul Rahman, Mohammed Sabih Alsharari

TL;DR: 本文提出了VeriBound理论框架,为使用形式化验证工具(如Z3、Isabelle)自动标注训练的过程奖励模型(PRMs)提供了PAC-Bayesian泛化界限。该框架建立了四个核心理论结果:泛化界限、样本复杂度、收敛率分析以及从步骤级验证误差到Best-of-K性能下降的误差传播界限,从而为PRMs在未见推理任务上的泛化能力提供了理论解释和保证。

Details

Motivation: 过程奖励模型(PRMs)为大型语言模型(LLM)推理提供步骤级验证,但其训练数据获取存在瓶颈:人工标注成本高,蒙特卡洛滚动估计噪声大。现有方法FOVER使用形式化验证工具自动标注PRMs,并观察到从符号任务到多样化推理基准的跨任务泛化现象,但缺乏理论解释和形式化界限。

Result: 论文建立了四个理论结果:1)PAC-Bayesian泛化界限,将形式化验证标注训练数据上的经验验证误差与未见推理任务的期望误差联系起来;2)样本复杂度为O(d log(d/δ)/ε²),其中d是PRM假设类复杂度;3)在L-光滑性和有界方差条件下,PRM训练以线性速率收敛;4)步骤级验证误差到Best-of-K性能下降的误差传播界限。这些结果为PRMs的泛化提供了理论保证,但未在具体基准上报告定量SOTA结果。

Insight: 创新点在于首次为使用形式化验证工具训练的PRMs提供了系统的PAC-Bayesian理论框架,填补了该领域缺乏理论解释的空白。从客观角度看,该研究将形式化验证与机器学习泛化理论结合,为自动标注的可靠性、样本效率和下游性能影响提供了可量化的理论界限,对推动可验证AI和推理模型的发展具有借鉴意义。

Abstract: Process Reward Models (PRMs) provide step-level verification for Large Language Model (LLM) reasoning, yet their training data acquisition remains a bottleneck: human annotation is costly and Monte Carlo roll-out estimates are noisy. A recent approach, FOVER, trains PRMs on step-level error labels automatically annotated by formal verification tools such as Z3 and Isabelle, and empirically observes cross-task generalization from symbolic tasks to diverse reasoning benchmarks. However, this generalization phenomenon lacks any theoretical explanation, and no formal bounds exist on the generalization error, sample complexity, convergence rate, or downstream Best-of-K performance of such PRMs. We propose VeriBound, a theoretical framework that provides PAC-Bayesian generalization bounds for PRMs trained with formal verification tools. We establish four main results: (i) a PAC-Bayesian generalization bound that relates the empirical verification error on formal-verification-annotated training data to the expected error on unseen reasoning tasks, with the bound depending on the formal verification accuracy and the divergence between training and test task distributions; (ii) a sample complexity result showing that $O(d \log(d/δ) / ε^2)$ formal-verification-annotated examples suffice to achieve generalization error $ε$ with probability $1-δ$, where $d$ is the complexity of the PRM hypothesis class; (iii) a convergence analysis proving that PRM training with formal verification labels converges at a linear rate under $L$-smoothness and bounded variance conditions; and (iv) an error propagation bound that relates step-level verification error to Best-of-K performance degradation.


[2] Post-Training Recipe, More Than Model Family, Shapes Multi-Agent LLM Conversational Behavior cs.CL | cs.AI | cs.CYPDF

Luyang Zhang, Jialu Wang, Fei Xue, Yi-Yun Chu

TL;DR: 本文研究了多LLM系统中对话行为的多样性来源,通过大规模实验发现,后训练配方(如推理蒸馏)对模型行为的影响可能超过模型家族本身,挑战了仅依赖模型家族标签来确保行为多样性的传统观点。

Details

Motivation: 解决多LLM系统中如何选择模型以产生可测量的不同对话行为的问题,测试模型家族标签在交互式多LLM系统中是否仍是行为多样性的可靠预测指标。

Result: 在包含94万条链的11个检查点语料库和160万条链的Llama因子实验中,使用验证的头部指标(如hedging)发现,经过推理蒸馏的Llama检查点根据对话伙伴的不同,行为变化达18%,超过了受控子集中的任何跨家族差距。

Insight: 创新点在于将后训练配方确立为多LLM面板组合的一级维度,表明仅依赖模型家族是不完整的代理指标;客观分析揭示了模型行为多样性更依赖于具体训练过程而非家族标签,为系统设计提供了新视角。

Abstract: Multi-LLM systems use multiple language models to deliberate, judge each other’s outputs, or coordinate as agents. Their value depends on the models producing measurably different conversational behaviors when given the same input. Prior offline studies recommend drawing one model per family for behavioral diversity, because LLMs prefer outputs from their own family when rating one another in isolation. Whether the same family label predicts behavior in interactive multi-LLM systems, the setting that real deployed systems use, has not been tested. We study this with a 940,000-chain 11-checkpoint corpus and a 1.6M-chain same-base Llama factorial. On our validated headline metric, hedging, a reasoning-distilled Llama checkpoint shifts by 18% depending on which same-base partner it replies to, more than any cross-family hedging gap in the controlled subset. Qwen, closed-API, and runtime checks suggest the pattern is not isolated, while repair and challenge analyses remain exploratory because their surface-cue detectors are weaker. Overall, the results identify post-training recipe as a first-class axis for multi-LLM panel composition and show that model family alone is an incomplete proxy for conversational diversity.


[3] Beyond ‘One Language, One Script’: Quantifying Orthographic Bias in Multilingual VLMs with PuMVR cs.CL | cs.AI | cs.LGPDF

Prabhjot Singh, Bhushan Pawar, Madhu Reddiboina

TL;DR: 这篇论文提出了PuMVR基准,这是首个用于量化多语言视觉语言模型中正字法偏见的基准,重点关注旁遮普语在Gurmukhi、Shahmukhi和罗马三种文字中的表现。通过评估10个SOTA VLM,论文揭示了显著的’文字鸿沟’,即模型在不同文字下处理相同视觉推理任务时性能差异巨大,且视觉输入无法消除这种相对偏见。

Details

Motivation: 当前多语言VLMs隐含了’一种语言对应一种文字’的错误假设,忽视了数十亿使用多文字语言(如旁遮普语、塞尔维亚语)的用户,导致模型能力可能因正字法偏见而割裂。

Result: 在PuMVR基准(包含375个基于文化的图像推理任务)上评估发现,模型在不同文字间的准确率差异高达16%,文字一致性率(SCR)最低仅为24.8%,视觉输入虽提升绝对性能但无法弥合相对差距。

Insight: 创新点在于首次系统量化了VLMs中的文字依赖性偏见,提出了SCR作为文字无关评估的核心指标,并发现推理模式和思维链路径会因文字不同而分化,为构建更公平的AI评估框架提供了基础。

Abstract: Current Vision-Language Models (VLMs) are celebrated for their multilingual capabilities, yet they operate under a flawed assumption: that one language corresponds to a single writing system. This overlooks billions of users of multi-script languages like Punjabi, Serbian, Hindi-Urdu, Kurdish, among many others, for whom a model’s capability may be fractured by orthographic bias. We introduce PuMVR (Punjabi Multimodal Visual Reasoning), the first benchmark designed to quantify script-dependent bias through 375 culturally grounded image-reasoning tasks across Punjabi’s three active scripts (Gurmukhi, Shahmukhi, Roman). Evaluating 10 state-of-the-art VLMs, we expose a substantial Script Gap: models frequently solve visual puzzles in one script while failing identical tasks in another, with accuracy deltas reaching 16% and Script Consistency Rates (SCR) as low as 24.8%. Crucially, visual input boosts absolute performance but does not close this gap, the relative bias persists. Our analysis suggests reasoning patterns show limited cross-script transferability, and Chain-of-Thought pathways diverge based on script alone. We propose SCR as a core metric for script-agnostic evaluation, challenging current multilingual assessment paradigms and providing a framework for equitable AI.


[4] SciLens: Multi-modal Scientific Claim Verification with Agentic Entailment and Grounding cs.CLPDF

Yueming Wang, Tianshi Zheng, Jiaxin Bai, Yangqiu Song, Ginny Wong

TL;DR: SciLens是一个用于多模态科学声明验证的框架,它通过将声明分解为原子命题,并将其与表格和图表中的具体证据进行结构化关联,然后基于原子层面的蕴含规则进行验证,从而提高了证据敏感性和可解释性。

Details

Motivation: 科学发现越来越依赖自动化系统,但直接使用视觉语言模型进行二元判断无法有效验证复杂的科学声明,因为声明常包含数值结果、比较和限定条件,而证据则编码在具有不同基础结构的表格和图表中。

Result: 在SciClaimEval开发集上,SciLens取得了79.2%的宏观F1分数和63.1%的配对准确率,表明其结构化的智能验证方法提升了性能。

Insight: 创新点在于提出了一个基于证据条件的原子蕴含框架,将声明分解为核心经验原子和背景原子,并通过为表格和图表设计精细的、模态特定的证据关联(如表格的行列单元格、算术关系,图表的坐标轴、图例、视觉编码等)来实现统一的结构化验证流程,这增强了模型对复杂证据的理解和推理能力。

Abstract: Scientific discovery increasingly relies on automated systems that generate hypotheses, inspect multimodal evidence, and validate claims at scale. Yet scientific claim verification is not well served by asking a vision-language model for a direct binary judgment: claims often combine numerical results, comparisons, scope qualifiers, and explanatory context, while evidence is encoded in tables and figures with distinct grounding structures. We present SciLens, an evidence-conditioned atomic entailment framework for multimodal scientific claim verification. SciLens decomposes each claim into central empirical atoms and background atoms, grounds the central atoms to modality-specific evidence witnesses, and predicts the final label with an atom-level entailment rule. For tables, atoms are grounded to rows, columns, cells, arithmetic relations, and table scope; for figures, they are grounded through panels, axes, legends, visual encodings, categories, trends, ranks, and qualifier checks. This yields a unified validation procedure in which a claim is supported only if every central empirical atom is entailed by the current evidence. On the SciClaimEval development set, SciLens achieves 79.2% macro-F1 and 63.1% pair accuracy, showing that structured agentic validation improves both evidence sensitivity and interpretability.


[5] MindAlign: Decoding Inner Speech from fMRI Signals via Multimodal Embedding Alignment under Limited Data cs.CL | cs.AI | eess.ASPDF

Muxuan Liu, Ichiro Kobayashi, Satoshi Nishida

TL;DR: MindAlign提出了一种解耦的两阶段脑到语言框架,用于从fMRI信号解码内部语音。第一阶段学习主体特定的神经-语义对齐,将fMRI活动映射到共享的多模态语义空间;第二阶段利用提取的语义草图和视觉上下文提示冻结的多模态语言模型进行自由文本生成。该方法在静默图像描述任务中优于仅使用fMRI和随机基线,并展示了跨主体泛化能力。

Details

Motivation: 解决从非侵入性脑信号解码内部语音的挑战,包括缺乏显性语言输出、训练数据有限和主体间变异性大,现有方法依赖任务特定解码器微调,限制了可扩展性和对新参与者的适应。

Result: 在静默图像描述的fMRI数据实验中,该方法一致优于仅使用fMRI和随机基线,且学习的语义到语言投影能跨主体泛化,与主体特定神经对齐结合时实现有效解码。

Insight: 创新点在于解耦的神经-语义对齐和冻结语言模型的使用,避免了语言模型修改,提高了可扩展性和模块化;通过共享语义空间和视觉上下文整合,增强了从有限数据中解码语义内容的能力,支持脑到文本解码的新方向。

Abstract: Decoding inner speech from non-invasive brain signals remains a fundamental challenge due to the absence of overt linguistic output, limited training data, and large inter-subject variability. Existing brain-to-text approaches often rely on task-specific decoder fine-tuning, which restricts scalability and complicates adaptation to new participants. We propose MindAlign, a decoupled two-stage brain-to-language framework that enables open-ended text generation from fMRI signals without modifying the underlying language model. The first stage learns a subject-specific neural-semantic alignment that maps fMRI activity into a shared multimodal semantic space, extracting a latent semantic sketch of the internally generated sentence. The second stage integrates this sketch with visual context to prompt a frozen multimodal language model for free-form generation. Experiments on fMRI data collected during silent image description demonstrate that the proposed approach consistently outperforms fMRI-only and random baselines. We further show that the learned semantic-to-language projection can generalize across subjects, enabling effective decoding when paired with subject-specific neural alignment. These results indicate that neural signals modulate semantic content beyond image-driven priors, supporting a scalable and modular direction for brain-to-text decoding.


[6] FirstPass: Grounding AI Scientific Judgment in Multi-Round Editorial Outcomes cs.CL | cs.AI | cs.LGPDF

Prabhjot Singh, Somnath Luitel, Manmeet Singh, Josh Durkee

TL;DR: 本文介绍了FirstPass,一个针对多轮同行评议对话的数据集和微调模型,旨在解决现有AI同行评议系统仅依赖CS/ML领域数据、忽视科学验证的迭代对话、以及评估标准偏向风格模仿而非真实编辑判断的三大缺陷。该模型基于Qwen2.5-7B-Instruct,通过LoRA在三个任务上进行微调,并在预测编辑结果和生成评审意见方面显著优于基线模型。

Details

Motivation: 现有AI同行评议系统存在三大问题:训练数据局限于计算机科学和机器学习领域、忽略了验证科学的迭代对话过程、以及评估侧重于风格模仿而非真实的编辑判断。本文旨在通过构建跨学科的多轮评议数据集和微调模型来解决这些问题。

Result: 在预测编辑结果(标准修订周期 vs. 扩展修订周期)的任务上,FirstPass达到了80.5%的准确率和78.2%的F1-macro分数,显著优于Gemini-3.1-flash-lite-preview的零样本性能(提升10.4个百分点)和所有基线模型。在评审生成任务上,其生成的评审平均长度为1187词,更接近人类参考(2155词),ROUGE-L得分为0.154,显著优于Qwen和DeepSeek的零样本表现。

Insight: 论文的核心创新点在于构建了首个跨五个科学领域(生物学、化学、神经科学、物理学、地球科学)的完整多轮同行评议对话数据集,并验证了仅对模型响应部分进行损失掩码是模型有效性的前提条件,而非优化技巧。模型可作为预提交阶段的‘前瞻性科学合著者’,模拟专家评审并预测修订周期,为作者提供类似可信赖同事的判断,且在五个学科中表现一致。

Abstract: AI systems for peer review fail on three fronts: they train on Computer Science and Machine Learning venues alone, ignore the iterative dialogue that validates science, and evaluate on stylistic mimicry rather than real editorial judgment. We introduce FirstPass, a dataset and fine-tuned model that addresses all three. Curating 3,668 complete multi-round peer-review dialogues from Nature Communications across five scientific domains (biology, chemistry, neuroscience, physics, and earth science), we exploit mandatory transparent peer review (instituted November 2022) and verify 100% content integrity by automated audit. We fine-tune Qwen2.5-7B-Instruct via Low-Rank Adaptation (LoRA) on three tasks: review generation, reviewer updating, and revision-cycle prediction. Our key finding is that response-only loss masking is a prerequisite, not an optimization: without it, accuracy is 62.0%, below the majority baseline; with it, FirstPass achieves 80.5% accuracy and F1-macro 78.2% on predicting editorial outcomes (Standard vs. Extended revision cycles), outperforming Gemini-3.1-flash-lite-preview zero-shot by 10.4 percentage points and all baselines with statistical significance (McNemar p < 0.001). On generation, FirstPass produces reviews averaging 1,187 words, substantially closer to human references (2,155 words) than any baseline, achieving ROUGE-L 0.154 with significant gains over Qwen and DeepSeek zero-shot (p < 0.001). Deployed in the pre-submission loop as an anticipatory scientific co-author, FirstPass simulates expert critique and predicts revision cycle outcomes before submission, giving authors the judgment a trusted colleague would provide, with consistent cross-domain performance across five disciplines.


[7] Scaling Diverse Language Generation for 3D Visual Grounding cs.CL | cs.CVPDF

Austin T. Wang, Dongchen Yang, Angel X. Chang

TL;DR: 本文提出了一种名为ViGiL3D++的可扩展、场景无关的方法,用于生成多样化的3D视觉指代查询。该方法通过结合场景图中的约束采样与大语言模型的语言生成能力,旨在解决现有数据集语言模式单一、缺乏多样性,从而限制模型泛化能力的问题。

Details

Motivation: 当前3D视觉指代模型因缺乏大规模、多样化的语言描述数据,难以泛化到复杂的语言模式。现有方法在约束类型和语言多样性上不足,且图像描述方法无法精确对比物体,而这对于视觉指代至关重要。

Result: 该方法生成的查询在多样性上超越了现有规模化数据集,并在多个3DVG基准测试中提升了模型性能,同时也揭示了视觉语言模型的显著局限性。

Insight: 核心创新在于将场景图的结构化约束采样与LLM的开放式语言生成相结合,以可扩展的方式生成多样化且精确的指代查询。这为构建更鲁棒的3D视觉语言模型提供了一种数据增强和评估的新思路。

Abstract: Developing robust models for 3D visual grounding (3DVG), the localization of entities in a 3D scene described in natural language, is important for enabling agents to correspond spatial language with objects in the physical world. However, the lack of diverse descriptions at scale prevents models from generalizing beyond simple linguistic patterns. Recent such attempts lack diversity in the constraint types and language used to ground objects. Captioning methods cannot precisely contrast objects, which is important for visual grounding. We therefore propose ViGiL3D++, a scalable, scene-agnostic method that generates diverse visual grounding queries by combining constraint sampling in scene graphs with the language generation of LLMs. We show that it has greater diversity over existing scaled datasets and improves model performance over several 3DVG benchmarks but also illuminates outstanding limitations of VLMs.


[8] Event Ontology Expansion via LLM-Based Conceptualization cs.CLPDF

Weicheng Ren, Zixuan Li, Long Bai, Xiaolong Jin, Jiafeng Guo

TL;DR: 本文提出ConceptE框架,通过LLM生成概念级语义来增强事件本体扩展。该方法利用大语言模型从句子和事件触发词中提取简洁的概念名称和自然语言描述,构建概念增强的表征,以改善事件聚类和层次扩展。在ACE、ERE和MAVEN数据集上的实验表明,ConceptE在事件本体扩展的所有子任务中均优于现有方法。

Details

Motivation: 现有事件本体扩展方法通常基于实例级相似性聚类上下文触发词表征,但这类表征常混淆概念级语义与表面上下文变化,导致聚类不稳定和层次扩展不可靠。

Result: 在ACE、ERE和MAVEN基准测试中,ConceptE在事件聚类任务上BCubed-F1提升高达12.37%,在层次扩展任务上Taxo_F1提升6.48%,均达到SOTA水平。

Insight: 创新点在于利用LLM进行概念化,从触发词中提取概念级语义,构建与本体推理对齐的增强表征,从而提升事件聚类的连贯性和层次扩展的可靠性。

Abstract: Event ontology expansion aims to discover emerging event types from data and extend them to appropriate positions in the existing event ontology.. Existing methods typically cluster contextualized trigger representations and attach induced clusters to the ontology based on instance-level similarity. However, ontology expansion requires concept-level semantics that characterize event types, whereas contextualized trigger representations often conflate these semantics with surface contextual variation, leading to unstable clustering and unreliable hierarchy expansion. To address this issue, we propose ConceptE, a conceptualization-enhanced framework for event ontology expansion. ConceptE first derives concept-level semantics by prompting an LLM with the sentence and event trigger, producing a concise concept name and a natural-language description. It then jointly encodes these semantics with trigger information to build concept-enhanced representations aligned with ontology-level reasoning. This representation design supports more coherent event clustering, more reliable hierarchy expansion, and ontology-consistent type naming. Experiments on ACE, ERE, and MAVEN demonstrate that ConceptE consistently outperforms state-of-the-art approaches across all subtasks of event ontology expansion. In particular, it achieves improvements of up to 12.37% in BCubed-F1 for event clustering and 6.48% in Taxo_F1 for hierarchy expansion, demonstrating the effectiveness of the proposed ConceptE method.


[9] Learning What Not to Forget: Long-Horizon Agent Memory from a Few Kilobytes of Learning cs.CL | cs.AIPDF

Nusrat Jahan Lia, Aritra Mazumder

TL;DR: 本文提出了一种名为LRE(Learned Relevance Eviction)的轻量级历史信息淘汰策略,用于解决长周期语言模型系统中交互历史超出上下文窗口的问题。LRE通过学习识别历史中关键信息(如访问令牌、路径等)并直接提取保留,无需语言模型参与,仅占用几千字节内存且仅需CPU运行。

Details

Motivation: 动机是解决长周期运行的LLM代理系统中,因上下文窗口有限而必须淘汰历史信息时,可能误删关键细节(如登录令牌、后续调用所需路径)导致任务失败的问题。

Result: 在匹配预算的比较实验中,LRE在准确率-成本平面上未被任何基线方法超越。在代理任务中,LRE的整体准确率与保留全部历史相当,在简单任务上甚至超过无淘汰基线27%,同时将峰值上下文大小减少高达52%,且无需调用压缩器。在对话记忆任务中,LRE优于密集编码和令牌剪枝编码器,且无神经网络成本。在下游评估中,LRE在LoCoMo阅读任务中以少读68%的令牌数提供了最佳预算答案质量。

Insight: 创新点在于将LLM代理的内存淘汰视为一个保真度问题,提出了一种可部署的、主动的(未来查询不可知)且基于精确状态决策的策略。核心洞察是廉价的学习相关性评估已足够有效,并且其监督信号可以无标注,仅通过系统自身行为训练即可恢复监督评分器95%的效果。

Abstract: Long-running language-model systems accumulate interaction history that outgrows the context window, so they must continually evict. When an eviction policy drops a load-bearing detail, for example an access token issued at login or a path the next call needs, the action fails. We present LRE (Learned Relevance Eviction), a few kilobytes, CPU-only, language-model-free scorer that learns which units of history are load-bearing and keeps them by verbatim extraction. Under a matched-budget comparison, in our experiment, no baseline dominates LRE on the accuracy-cost plane. On agents, LRE matches the accuracy of keeping the entire history overall. On the simplest tasks, it exceeds that no-eviction baseline by 27%, while requiring zero compressor calls and reducing peak context size by up to 52%. A controlled study trace shows LRE completes tasks where the others loop, finishing one such task in 37% fewer calls than keeping everything and solving 14 tasks where no other run policy does. On conversational memory, LRE outranks dense and token-pruning encoders at zero neural cost. In downstream evaluation, LRE gives the best budgeted answer quality on LoCoMo reading 68% fewer tokens. Its supervision can also be annotation-free: training only on the system’s own behavior recovers 95% of the supervised scorer’s effectiveness. We argue that, because memory eviction in LLM agents is a fidelity problem, it requires a deployable proactive policy where the future query is unavailable and exact state is decisive, and that cheap learned relevance can be sufficient.


[10] FiLM-Coordinated Dual-Branch Transformer for Global-Local Dependency Modeling in Language Modeling cs.CL | cs.AI | cs.LGPDF

Zhiqiang Zhou, Xu Ling, Junliang Dai

TL;DR: 本文提出了一种FiLM协调的双分支Transformer用于语言建模,旨在解决标准Transformer中单一自注意力路径同时建模全局依赖和局部模式时的内在冲突。该方法在每个层中显式分离全局分支和局部分支,并利用特征级线性调制(FiLM)进行动态跨分支协调,而非简单的拼接或静态相加。实验表明,该结构在多个小规模语言建模任务上优于同等宽度的单分支基线。

Details

Motivation: 标准Transformer使用单一自注意力路径同时建模全局依赖和局部模式,导致长距离结构推理与细粒度局部表示学习之间存在张力。本文旨在通过显式分离全局与局部建模分支,并引入动态协调机制来缓解这一冲突。

Result: 在TinyShakespeare和WikiText-2的1M字符子集等小规模语言建模设置中,所提出的完整双分支FiLM模型在同等宽度的结构基线中取得了最佳结果。多随机种子实验支持了性能增益的稳定性。

Insight: 创新点在于显式设计全局与局部双分支结构,并采用双向FiLM模块进行通道级动态校准,而非传统的令牌级交互或静态融合。机制分析表明FiLM学会了输入依赖、层依赖和通道选择性的调制模式,但参数匹配的拓宽单分支基线显示当前设计在参数效率上仍有提升空间。

Abstract: Standard Transformers use a single self-attention pathway to model both global dependencies and local patterns, creating tension between long-range structural reasoning and fine-grained local representation learning. We propose a FiLM-coordinated dual-branch Transformer for language modeling, where each layer explicitly contains a global branch and a local branch, and feature-wise linear modulation (FiLM) is used for dynamic cross-branch coordination instead of simple concatenation or static addition. The key idea is that the two branches represent different dependency views of the same input, making channel-wise calibration more suitable than heavy token-level interaction. We therefore design a bidirectional FiLM module in which each branch generates per-channel scaling and shifting parameters to condition the other. Experiments on multiple small-scale language modeling settings show that the proposed structure consistently outperforms same-width single-branch baselines and weakened dual-branch variants under a fixed lightweight configuration. On TinyShakespeare and a 1M-character subset of WikiText-2, the full dual-branch FiLM model achieves the best results among same-width structural baselines. Multi-seed results support the stability of the gains, while mechanistic analyses show that FiLM learns input-dependent, layer-dependent, and channel-selective modulation patterns rather than static scaling. Parameter-matched widened single-branch baselines also indicate that the current design still leaves room for improvement in parameter efficiency.


[11] Quality and Agreement in Multilabel Emotion Annotation: A Case Study and Evaluation Framework cs.CLPDF

Emily Öhman, Anna Koufakou

TL;DR: 本文通过一个多标签情感标注的案例研究,探讨了标注者行为和标签聚合方式对一致性评估和下游情感分类器性能的影响。研究提出使用软投票份额标签(包括强度加权变体)来表示目标,而非将分歧压缩为单一硬标签,并采用阈值化指标(宏/微F1)和概率对齐(SoftBCE)进行评估。结果表明,标注分歧具有结构性,并在模型行为中留下可测量的痕迹:硬标签可能最大化F1分数,而软监督则能产生更好反映标注者实际方差和不确定性的预测。

Details

Motivation: 情感标注本质上是主观的,但大多数NLP流程仍假设存在通过多数投票产生的“黄金”标签,并将标注者差异视为噪声。本文旨在研究标注者行为和聚合选择如何影响一致性估计和下游情感分类器,为多标签情感数据集的设计、聚合和评估提供实用指导。

Result: 研究显示,在不同标注机制下,标注分歧是结构化的,并对模型行为产生可测量的影响。硬标签可能在F1指标上表现最优,而软监督则使预测更好地反映经验性的标注者方差和不确定性。评估结合了阈值化指标(宏/微F1)和概率对齐(SoftBCE),并提供了数据驱动的分歧诊断。

Insight: 创新点在于提出使用软投票份额标签(包括强度加权变体)来保留标注分歧信息,而非简单采用多数投票的硬标签。这允许模型更好地捕捉情感标注中的主观性和不确定性。从客观角度看,该方法为处理多标签情感数据中的标注者差异提供了更细致的框架,强调了在评估中考虑概率对齐的重要性,有助于构建更稳健的情感分析系统。

Abstract: Emotion annotation is inherently subjective, yet most NLP pipelines still assume “gold” labels, typically produced by majority voting, and treat annotator variation as noise. In this paper, we present a multilabel emotion annotation case study and use it to examine how annotator behavior and aggregation choices affect both agreement estimates and downstream emotion classifiers. Rather than collapsing disagreement into a single label, we represent targets as soft vote-share labels (including an intensity-weighted variant) and evaluate models using both thresholded metrics (macro-/micro-F1) and probabilistic alignment (Bernoulli cross-entropy SoftBCE), alongside data-derived disagreement diagnostics. Across annotation regimes, we show that disagreement is structured and leaves measurable traces in model behavior: hard labels may maximize F1 metrics, while soft supervision yields predictions that better reflect empirical annotator variance and uncertainty. Our results provide practical guidance for designing, aggregating, and evaluating multilabel emotion datasets when multiple interpretations are plausible.


[12] Demographic Metadata as Construct-Irrelevant Noise in DistilBERT-Based Automated Essay Scoring cs.CLPDF

Teik Peng Ch’ng, Hui Na Chua

TL;DR: 本研究探讨了在基于DistilBERT的自动作文评分系统中,通过简单的多模态融合策略(即早期拼接)将人口统计元数据与文本输入结合的效果。研究发现,这种融合方式会显著降低模型的预测准确性、增加验证损失,并加剧评分偏差。

Details

Motivation: 自动作文评分系统旨在辅助教师减轻评分负担,但人类评分常受学生人口统计特征影响。目前,将人口统计元数据与文本输入结合用于训练AES模型的不同策略效果尚不明确,本研究旨在探索一种简单的多模态融合策略的影响。

Result: 在ASAP 2.0数据集上,基线模型的二次加权Kappa为0.727,而融合元数据的实验模型降至0.656,验证损失从1.25增至1.29。在19项测试中,评分公平性实例从15个减少到12个,表明融合元数据后模型性能下降且偏差加剧。

Insight: 论文的创新点在于实证分析了人口统计元数据作为构造无关噪声在AES模型中的负面影响,表明简单的早期融合策略可能损害模型性能。从客观角度看,这强调了在构建公平AES系统时需谨慎处理元数据,避免引入偏差,而非盲目融合多模态信息。

Abstract: Automated Essay Scoring (AES) systems are increasingly used to support teachers in managing grading workloads and to provide a supplementary rater in large-scale assessments. While human grading is frequently influenced by students’ demographic characteristics, the efficacy of different strategies for integrating demographic metadata with textual input used to train AES models remains underexplored. This study investigates the impact of a specific multimodal fusion strategy - naive metadata concatenation - on the predictive accuracy, training convergence, and score parity of a DistilBERT-based AES model. A comparative analysis was conducted using the ASAP 2.0 dataset to evaluate a baseline model against an experimental model trained with input that concatenates tokenised text and demographic metadata using a naive multimodal fusion strategy. Evaluated via 10-fold cross-validation, the findings reveal that the early fusion of demographic metadata and the input significantly degrades the model’s overall predictive accuracy. The baseline model achieved a Quadratic Weighted Kappa (QWK) of 0.727, which dropped to 0.656 upon integrating metadata. Furthermore, the experimental model exhibited higher validation loss (1.29) compared to the baseline model (1.25). The experimental model also displayed exacerbated scoring bias, reducing score parity instances from 15 to 12 out of 19 tests.


[13] Scalable Hierarchical Attention Transformers for Multi-Turn Jailbreak Detection in Long Conversations cs.CL | cs.AI | cs.CRPDF

Chenhui Hu, Muhammed Salih, Sudipto Guha, Subramanian Srinivasan

TL;DR: 本文提出了一种用于长对话中多轮越狱检测的可扩展分层注意力Transformer模型。该模型将多轮越狱检测视为对话级别的分类问题,通过分层结构高效编码单轮对话并捕获对话动态,避免了昂贵的长上下文拼接。

Details

Motivation: 多轮越狱攻击通过逐步升级、重构和角色操纵等手段将不安全意图分散在对话中,从而规避单轮级别的审核。本文旨在解决长对话中此类多轮越狱的检测难题。

Result: 在一个包含14,038个对话的挑战性评估基准上,该方法取得了0.9394的F1分数,比最强的基线模型Claude Opus 4.7高出0.07,同时将其误报率降低了一半。消融研究证实了每个架构组件的有效性。

Insight: 核心创新在于提出了一种高效的分层检测器,它结合了紧凑的单轮表示编码和轻量级的对话模块。该模块融合了交叉注意力和自注意力机制,能选择性关注细粒度证据,在保留跨轮推理能力的同时避免了长上下文处理的开销。

Abstract: Multi-turn jailbreaks can evade turn-level moderation by spreading unsafe intent across a dialogue through gradual escalation, reframing, and role manipulation. We address multi-turn jailbreak detection as a conversation-level classification problem and introduce an efficient hierarchical detector that avoids expensive long-context concatenation while retaining cross-turn reasoning. The model encodes individual turns to form compact turn representations and applies a lightweight conversation module that captures dialogue dynamics and selectively attends to fine-grained evidence when needed. On a challenging evaluation benchmark of 14,038 conversations, our approach achieves an F1 of 0.9394, outperforming Claude Opus 4.7, the strongest competing baseline, by 0.07 while halving its false-positive rate. Ablation studies confirm that each architectural component contributes meaningfully, with combining cross-attention and self-attention in the conversation module yielding a 2.26 percentage point reduction in false-positive rate over the self-attention-only variant.


[14] A Multi-Agent Audit Framework for High-Stakes Reasoning: Evaluation and Interpretability in Clinical Mental Health Screening cs.CLPDF

Jingchen Ye, Yanpei Yu, Luyao Zhang

TL;DR: 本文提出了一种多智能体审计框架,用于高风险推理任务,特别是在临床心理健康筛查领域。该框架通过模拟协作的多步验证过程,将推理分解为感知、知识检索增强生成、思维链临床推理和审计验证等模块,以提升透明度和可解释性。在DAIC-WOZ数据集上的实验表明,该框架显著优于单智能体基线,降低了预测误差。

Details

Motivation: 针对高风险推理任务中传统单模型大语言模型在零样本范式下存在的幻觉和低可解释性问题,旨在通过多智能体协作验证来提高工作流程的透明性和可验证性。

Result: 在DAIC-WOZ数据集上使用本地部署的开源模型进行评估,多智能体管道将PHQ-8抑郁严重程度预测的平均绝对误差从5.35降低至5.02,显著优于单智能体基线。

Insight: 创新点在于将高风险推理分解为模块化多智能体审计流程,通过跨智能体验证轨迹来缓解推理漂移并提供高度可解释的诊断依据,为超越孤立模型扩展的可靠AI辅助决策支持提供了可泛化范式。

Abstract: High-stakes reasoning tasks necessitate transparent and verifiable workflows, yet conventional single-model large language models (LLMs) often struggle with hallucination and low interpretability under zero-shot paradigms. To address this general AI challenge, we propose a Multi-Agent Audit Framework that simulates a collaborative, multi-step verification process. We empirically validate this architecture in the sensitive domain of clinical mental health screening using a modular LangChain workflow. Our framework decomposes the reasoning process into a Perception Agent, Knowledge Retrieval-Augmented Generation (RAG), Chain-of-Thought (CoT) clinical inference, and a critical Audit verification stage. We evaluated this framework on the DAIC-WOZ dataset using locally deployed open-source models. Experimental results demonstrate that our multi-agent pipeline significantly outperforms single-agent baselines, reducing the Mean Absolute Error (MAE) for PHQ-8 depression severity prediction from 5.35 to 5.02. By exposing cross-agent validation traces, the framework mitigates reasoning drift and provides highly interpretable diagnostic rationales, offering a generalizable paradigm for reliable AI-assisted decision support beyond isolated model scaling. We make data and code open access on GitHub for replicability.


[15] Dementia-Agents: A Multi-Modal Multi-Agent System for Dementia Staging and Phenotyping cs.CLPDF

Yaling Shen, Maja Christensen, Yiwen Jiang, Jenna Dennison, David Darby

TL;DR: 本文提出了Dementia-Agents,一个面向真实世界临床场景的多模态多智能体系统,用于痴呆症的分期和表型分析。该系统通过数据代理、领域专家代理和协调代理的三步工作流程,整合不完整、异质的临床数据,超越了传统以阿尔茨海默病为中心的二元或三阶段建模范式。

Details

Motivation: 现有AI方法大多局限于阿尔茨海默病,且在精心整理的研究数据上运行,忽视了痴呆症作为一种综合征,具有多阶段、多表型、多病因的广泛特性,难以处理真实临床中不完整、异质的多模态评估数据。

Result: 在来自两个认知神经科服务的1066名患者的真实世界临床队列上进行评估,与单一的多模态大语言模型及先前的医疗多智能体系统相比,该方法在综合征级别的痴呆分期和表型分析诊断性能上取得了持续改进。

Insight: 创新点在于提出一个与临床工作流程对齐的多智能体框架,通过数据代理保留缺失数据信号并生成语义忠实的文本表示,再经领域专家代理和概率聚合的协调代理进行决策,在提升诊断性能的同时保持了领域级别的可解释性。

Abstract: Dementia diagnosis requires integrating multi-modal clinical assessments from diverse informants and clinicians under incomplete and heterogeneous data conditions. Yet most AI-driven approaches remain Alzheimer’s disease (AD)-centric, framing the problem as binary AD detection or three-stage AD progression modeling within well-curated research settings. This pathology-driven paradigm overlooks the broader, syndrome-level nature of dementia, which spans multiple stages, phenotypes, and etiologies. In this paper, we propose Dementia-Agents, a clinically aligned multi-agent framework for real-world dementia staging and phenotyping. The framework follows a three-step workflow: (1) a data agent translates structured clinical records into semantically faithful textual representations that preserve missing-data signals and routes them to domain-aligned experts; (2) five fine-tuned expert agents generate domain-level predictions; and (3) a coordinator agent performs probabilistic aggregation to produce final staging and phenotyping decisions. We develop and evaluate Dementia-Agents on a real-world clinical cohort of 1,066 patients from two cognitive neurology services. Compared with monolithic multi-modal large language models (MLLMs) and prior medical multi-agent systems, our approach achieves consistent improvements in diagnostic performance for real-world syndrome-level dementia staging and phenotyping, while preserving domain-level interpretability.


[16] Precision Recall Controllable Radiology Report Generation via Hybrid Natural Language and Clinical Reward Learning cs.CL | cs.CVPDF

Ling Chen, Ruinan Jin, Jun Luo, Hanliang Chen, Quirin Strotzer

TL;DR: 本文提出了一种基于强化学习的精度-召回可控放射学报告生成框架,通过引入临床奖励和组相对训练策略,在保证语言流畅性的同时实现对临床精度和召回率的显式调控。

Details

Motivation: 现有放射学报告生成方法主要优化自然语言生成指标,缺乏对临床精度和召回率等关键因素的控制,导致生成的报告可能无法满足不同临床需求。

Result: 在MIMIC-CXR数据集上的实验表明,该方法在自然语言生成和临床效能评估指标上均优于现有最优方法,并能可靠地控制临床精度与召回率的权衡。

Insight: 创新点包括:通过控制参数显式调节精度-召回权衡的推理机制;引入临床奖励提升临床正确性;采用组相对训练策略降低奖励方差并提高训练稳定性。

Abstract: Automated radiology report generation (RRG) has gained increasing attention because it can reduce the heavy workload of clinical report writing. However, most existing methods mainly optimize for natural language generation (NLG) metrics that focus on language fluency, while providing little control over clinically important factors such as precision and recall. As consequence, generated reports may be fluent but not well aligned with different clinical needs. To address this challenge, we propose a reinforcement learning framework for precision recall controllable RRG, where a control parameter explicitly adjusts the trade-off between clinical precision and recall during inference. This design allows the model to flexibly generate reports according to different clinical requirements. To ensure clinical correctness, we introduce a \blue{clinical reward} into the training objective, which helps improve clinical efficacy (CE) beyond standard language-based optimization. In addition, we apply a group-relative training strategy that normalizes rewards within each training group, reducing reward variance and improving training stability. Extensive experiments on the MIMIC-CXR dataset show that our method consistently outperforms state-of-the-art approaches in both NLG{ and CE} evaluation metrics, while providing reliable control over the CE precision recall trade-off.


[17] Dissecting Agentic RAG: A Component Ablation for Multi-Hop QA with a Local 7B Model cs.CL | cs.IRPDF

Sheroz Shaikh

TL;DR: 本文通过消融实验剖析了基于本地7B参数模型的智能检索增强生成(Agentic RAG)系统在多跳问答任务中的组件贡献。研究发现,在资源受限环境下,简单的固定混合检索和较短的检索循环(两次迭代)能获得大部分性能提升,而自适应路由和深层循环的额外收益有限。

Details

Motivation: 旨在探究智能RAG系统中各组件(如迭代推理循环、查询分解、自适应检索)的实际贡献,特别是在仅使用本地语言模型的资源受限设置下,验证增加的复杂性是否真正有效。

Result: 在HotpotQA干扰项开发集的5000个问题上进行评估,完整智能RAG管道(使用Qwen2.5-7B-Instruct模型)达到EM=53.2%和F1=61.6%,优于单次密集检索基线(EM=43.1%,F1=54.0%)。消融实验表明,固定混合检索优于基于规则的自适应路由,两次检索迭代即可捕获95%的增益,查询分解和交叉编码器重排也有显著但较小的贡献。

Insight: 在固定本地模型预算下,简单且固定的设计选择(如固定混合检索、短检索循环)可能比复杂的自适应版本更具竞争力或更优;性能提升主要来自短检索循环,而非自适应路由或多次迭代,这为资源受限环境下的高效RAG系统设计提供了实用指导。

Abstract: Agentic retrieval-augmented generation (RAG) systems combine iterative reasoning loops, query decomposition, and adaptive retrieval to tackle multi-hop question answering. However, the contribution of each component remains poorly understood, particularly under resource-constrained settings using only local language models. Many agentic designs add adaptive retrieval routing and deeper retrieval loops on the assumption that the added complexity helps. To test whether it does, we run a controlled ablation study of a full agentic RAG pipeline evaluated on 5,000 questions from the HotpotQA distractor development set using a local 7B parameter model (Qwen2.5-7B-Instruct). Our full pipeline achieves EM=53.2% and F1=61.6%, compared to a single-pass dense-retrieval baseline of EM=43.1% and F1=54.0%. Across eight ablation conditions, we find that: (1) fixed hybrid retrieval via reciprocal rank fusion consistently outperforms rule-based adaptive routing (+1.8 EM, +1.9 F1), as the routing heuristic over-routes to BM25 by firing on named entities present in nearly all multi-hop sub-questions; (2) two retrieval iterations over the decomposed sub-questions capture 95% of the gains of five, with no meaningful benefit from deeper loops; and (3) query decomposition and cross-encoder reranking each contribute statistically significant but smaller gains (p<0.01 and p<0.001 respectively). Taken together, on a fixed local-model budget, the simpler and fixed choices turn out to be competitive with or better than their adaptive versions: most of the gain comes from running a short retrieval loop, not from adaptive routing or from many iterations. We use no proprietary APIs or large-scale compute.


[18] LLM and Human Modes of Representation cs.CLPDF

Shalom Lappin

TL;DR: 本文探讨了大型语言模型(LLMs)与人类在信息处理和表征方面的异同,重点关注语言知识表征以及现实世界推理与规划这两个领域。研究发现,尽管LLMs在语言应用上表现出色,但其处理方式与人类存在差异,且在推理任务的学习和泛化效率上通常低于人类。

Details

Motivation: 旨在比较LLMs与人类在认知任务中的信息处理机制,以评估LLMs能否达到或超越人类表现,并探索两者在信息处理系统中的异同点。

Result: 在语言应用上,LLMs常能达到令人印象深刻的流畅度,但在推理任务的学习和泛化效率上,大多数情况下不如人类高效。

Insight: 论文揭示了LLMs与人类在语言处理和推理任务中的根本差异,强调了LLMs在语言应用上的优势与推理效率上的不足,为未来AI系统的设计提供了重要参考。

Abstract: Much work on the cognitive foundations of AI has focussed on comparisons between the ways in which Large Language Models (LLMs) and humans process information and represent it. One aspect of this comparison involves determining the extent to which LLMs can achieve or surpass human performance on a variety of cognitively interesting tasks. A second explores points of convergence and divergence between LLM and human systems for processing information. Here, I consider some recent research that has addressed both issues in two informational domains. The first is the representation of linguistic knowledge. The second is real world reasoning and planning. While LLMs frequently achieve impressive levels of performance and fluency on linguistic applications, they tend to handle linguistic content in ways that are distinct from human processing. They are also, for the most part, less efficient than humans in learning and generalisation for reasoning tasks.


[19] CulMind: Benchmarking Multimodal Understanding and Reasoning in Chinese Cultural Heritage cs.CLPDF

Zhangwei Cao, Shuhan Fan, Yuting Wei, Jiajun Zhang, Yihang Peng

TL;DR: 该论文提出了CulMind和CulMind-R基准测试,用于评估多模态大语言模型在中国文化遗产领域的细粒度理解和推理能力。CulMind包含来自100多家博物馆的50项任务,CulMind-R是其24项任务的推理子集。论文还提出了ReaScore评估指标,通过自动加权任务相关维度来评估推理质量。实验在14个领先的MLLMs上进行,揭示了答案准确性与推理质量之间存在显著差距。

Details

Motivation: 现有中国文化遗产基准测试主要关注最终答案的准确性,而推理过程的准确性和完整性尚未得到充分探索。为了填补这一空白,需要一个新的基准和评估指标来更全面地评估模型在文化遗产领域的多模态理解和推理能力。

Result: 在14个领先的MLLMs上的实验表明,尤其是在具有挑战性的任务上,模型的答案准确性与推理质量之间存在显著差距。进一步分析显示,任务自适应的维度选择和加权能使评估结果与专家判断更好地对齐。

Insight: 论文的创新点在于构建了一个高质量、细粒度的中国文化遗产多模态理解与推理基准,并提出了一个任务自适应的评估指标ReaScore,该指标通过自动加权任务相关维度来更准确地评估推理过程,为文化遗产领域的评估提供了可迁移的参考框架。

Abstract: Evaluating Multimodal Large Language Models (MLLMs) in Chinese Cultural Heritage (CCH) requires fine-grained reasoning over visual, textual, stylistic, and historical clues. However, existing CCH benchmarks mainly emphasize final-answer accuracy, while the accuracy and completeness of reasoning processes remain underexplored. To address this gap, we introduce CulMind and CulMind-R: a high-quality benchmark for multimodal CCH covering 50 tasks from collections of more than 100 museums, and a 24-task reasoning subset that adaptively defines task-specific dimensions for reasoning process evaluation. To evaluate reasoning quality, we propose ReaScore, a task-adaptive metric that evaluates reasoning by automatically weighting task-relevant dimensions. Experiments on 14 leading MLLMs reveal a substantial gap between answers and reasoning, especially on challenging tasks. Further analysis shows that task-adaptive dimension selection and weighting better align evaluation results with expert judgments. Overall, our benchmark and metric support a more expert-aligned assessment of CCH understanding and offer a transferable reference for broader evaluations of cultural heritage. We publicly release the data, code, and evaluation scripts at https://github.com/ZevTsao/CulMind to facilitate reproducible research.


[20] TACO: Task-Aware Column Description Generation Using LLMs cs.CL | cs.AI | cs.DBPDF

Ting Cai, Rakesh R. Menon, Yiru Chen, Zifan Liu, Yuan Tian

TL;DR: TACO是一个任务感知的框架,利用大语言模型(LLMs)自动生成表格数据的列描述。它通过三步流程(缩写扩展、描述生成和描述修订)来解决现有单提示LLMs方法在缩写处理、描述完整性和冗余性方面的不足,并提升了实体链接和模式丰富等下游任务的性能。

Details

Motivation: 现实世界数据集常因列名缩写或领域特定术语而缺乏清晰文档,影响下游NLP任务(如NL2SQL、表格问答和实体链接)的性能,现有单提示LLMs方法存在缩写处理不一致、描述不完整或冗余等问题。

Result: 在公共和专有数据集上的广泛实验表明,TACO持续优于现有方法,将下游任务性能提升高达32%,并发布了用于实体链接和模式丰富的新评估数据集。

Insight: 创新点包括任务感知的三步管道(缩写扩展、生成和修订),以及通过模拟下游任务进行描述修订的机制;客观分析认为其将结构化数据处理与LLMs结合,通过任务导向的迭代优化提升了描述的实用性和准确性。

Abstract: Generating accurate and informative column descriptions (e.g. “membership status of customers” for the column name “cust_mem”) is essential for a wide range of downstream NLP tasks on tabular data, including NL2SQL, table question answering, and entity linking. This problem arises in enterprises, domain sciences, government data portals, and so on. Despite its importance, most real-world datasets suffer from missing or cryptic documentation, often due to abbreviated column names or domain-specific jargon. Existing approaches largely rely on single-prompt large language models (LLMs), which struggle with three key issues: (i) inconsistent or incorrect handling of abbreviations, (ii) hallucinated or incomplete descriptions, and (iii) redundancy or vagueness that hinders downstream performance. We present TACO, a task-aware framework for automatic column description generation using LLMs. TACO introduces a three-step pipeline: (1) abbreviation expansion, which standardizes column names; (2) description generation, which produces initial semantic descriptions enriched with synonyms and search-oriented keywords; and (3) description revision, which refines these outputs using simulated downstream tasks. In addition, we investigate human-in-the-loop extensions and release new evaluation datasets for entity linking and schema enrichment. Extensive experiments across public and proprietary datasets show that TACO consistently outperforms existing methods, improving downstream task performance by up to 32%.


[21] When Compression Helps and When It Hurts: Condition-Aware Analysis of Chain-of-Thought Distillation cs.CLPDF

Siyang Lyu, Zhijing Sun, Xinghao Chen, Tong Liu, Dawei Zhu

TL;DR: 本文系统研究了思维链(CoT)蒸馏中的压缩方法,通过解耦重要性准则、重构层级和压缩预算三个维度,在不同模型家族、数学与通用领域以及长短CoT场景下进行了全面实验。研究发现,重要性准则的有效性严格受粒度影响,重构层级的作用在不同领域呈现相反趋势,且训练时的压缩并不总能转化为推理时的效率提升。

Details

Motivation: 现有CoT压缩方法(选择性剪枝和生成式重写)的关键因素(如粒度与重要性准则、重构层级)相互纠缠,且压缩预算未在不同领域或场景下系统评估,导致压缩效果不明确。

Result: 在数学和通用领域、长短CoT场景以及两种模型家族上的实验表明:步骤级重要性准则能收敛到共享推理主干,而符号感知信号对词元级剪枝至关重要;数学任务性能随结构破坏单调下降,而通用任务上激进重写可起到去噪作用;长CoT学生模型在简洁监督下仍保留冗长习惯,使得训练压缩比成为部署成本的乐观下界。

Insight: 创新点在于将CoT压缩解耦为三个可独立分析的维度,并揭示了压缩效果高度依赖于具体条件(如领域、CoT长度和粒度),从而为根据部署上下文匹配压缩方法提供了条件感知的指导原则。

Abstract: Chain-of-Thought (CoT) distillation transfers multi-step reasoning from large reasoning models to smaller students, but verbose teacher traces inflate both training and inference cost. Existing CoT compression methods fall into two families, selective pruning and generative rewriting, yet prior studies have left key factors entangled: granularity is confounded with importance criteria in pruning, restructuring level is rarely isolated in rewriting, and compression budgets are not systematically evaluated across domains or regimes. We recast CoT compression along three dimensions: importance criterion, restructuring level, and compression budget. Sweeping these across two model families, Math and General domains, and Long-/Short-CoT regimes, we find that (i) importance criterion utility is strictly governed by granularity: step-level criteria converge on a shared reasoning backbone, while token-level pruning requires symbol-aware signals to preserve the logical core; (ii) restructuring level inverts across domains: Math degrades monotonically with structural disruption, while aggressive rewriting acts as a denoiser on General tasks; (iii) training-time compression does not necessarily translate to inference-time savings: Long-CoT students retain verbose habits despite concise supervision, making the training ratio an optimistic lower bound on deployment cost. These findings yield condition-aware guidelines for matching compression to deployment context.


[22] Denoising Iterative Self-Correction: Structured Verification Loops for Reliable LLM Reasoning cs.CL | cs.AIPDF

Shen Yin, David Ken, Joel Stremmel

TL;DR: 本文提出了一种名为去噪迭代自校正(DISC)的测试时方法,用于提高大型语言模型在多步推理中的可靠性。该方法将验证问题输出视为解决方案可能被破坏的噪声测量,通过多次“验证-判断-校正”循环逐步减少错误,类似于传统迭代去噪。

Details

Motivation: 大型语言模型能生成流畅但经常错误的多步推理,而简单的校正方法可能会损害原本正确的答案。因此,需要一种能精确控制校正、避免性能退化的可靠方法。

Result: 在三个基准测试(BIG-Bench Mistake、HotpotQA、GPQA Diamond)和四个模型上,DISC在精度-召回权衡上优于Chain-of-Verification和Self-Refine方法。例如,在BIG-Bench Mistake(Sonnet~4.5)上,DISC达到81.6%的准确率,其改进与退化的比率是Chain-of-Verification的13倍、Self-Refine的5倍。

Insight: 创新点包括:1)引入二元判断门控机制,通过阻止可能损害正确答案的改写来控制校正精度;2)提出使用改进-退化比率(精度)和修复率(召回)两个配对诊断指标来评估权衡;3)展示了跨模型角色分配(将验证和判断任务分配给与生成器不同的模型)可以减轻自我确认偏差。

Abstract: Large language models produce fluent but often incorrect multi-step reasoning, and naive correction methods risk degrading already-correct answers. We introduce Denoising Iterative Self-Correction (DISC), a test-time procedure that treats verification question outputs as noisy measurements of where a solution may be corrupted. Using these signals, DISC progressively reduces errors across multiple verify-judge-correct passes, analogous to traditional iterative denoising. A binary judgment gate controls correction precision by blocking rewrites that would damage already-correct answers while the verifier and corrector together repair errors. We evaluate this trade-off using two paired diagnostics: an improvement-to-degradation ratio (precision) and a repair rate (recall). Across three benchmarks (BIG-Bench Mistake, HotpotQA, GPQA Diamond) and four models, DISC dominates Chain-of-Verification and Self-Refine on the precision-recall trade-off, reaching 81.6% accuracy with 13x more improvements per degradation than Chain-of-Verification and 5x more than Self-Refine on BIG-Bench Mistake (Sonnet~4.5). On GPQA Diamond, we identify a capability floor below which judges acknowledge contradictions in evidence but cannot translate that recognition into a correction. We further show that cross-model role allocation – assigning verification and judgment to a model different from the generator – mitigates self-confirmation bias.


[23] CalVerT: Augmenting Agents with Calibrated Verifier Telemetry Improves Action and Learning in Knowledge-Intensive Tasks cs.CL | cs.AIPDF

Ashwin Vinod, Ying Ding, Elias Stengel-Eskin

TL;DR: 论文提出了一种名为CalVerT的方法,通过为LLM智能体提供校准的验证器遥测(包括校准的自信心分数和接地验证器分数),来增强智能体在知识密集型问答任务中的状态感知。该方法旨在减少智能体因知识不完整而导致的两种失败模式:自信但无支持的答案(损害准确性)和过度检索(浪费计算资源)。

Details

Motivation: 解决LLM智能体在知识密集型问答中,由于对当前答案的不确定性、无支持性或完整性缺乏了解,而导致的两种失败模式:自信但无支持的答案(降低准确性)和过度检索(浪费计算资源)。

Result: 在四个QA基准测试上,CalVerT通过触发智能体过度依赖参数知识时的检索,以及在已有足够上下文时减少冗余检索,提高了F1分数。该方法无需训练即可增强现有QA框架,并且在强化学习后,与未使用CalVerT遥测的相同训练智能体相比,也能带来性能提升。

Insight: 核心创新点在于引入校准的验证器遥测(CalVerT),为智能体状态提供额外的自我评估和证据验证信息,从而更全面地感知操作状态空间。这为改进智能体的决策(如检索触发)和学习过程提供了一种通用且有效的方法,既适用于免训练设置,也能提升训练后系统的性能。

Abstract: LLM agents in knowledge intensive question answering take retrieval and reasoning actions with incomplete knowledge about whether their current answer is uncertain, unsupported, or already complete. This produces two failure modes: committing to confident but unsupported answers, which hurts accuracy, and over-retrieving when the evidence in hand already suffices, resulting in wasted compute. To give agents a more complete picture of the state space they are operating in, we introduce calibrated verifier telemetry (CalVerT), which augments the agent’s state with additional telemetry: a calibrated self-confidence score and a grounding verifier score. We show that CalVerT can improve agents in both training-free and training-based settings. On four QA benchmarks, we find that CalVerT raises F1 by triggering retrieval in cases where agents over-rely on parametric knowledge, while cutting redundant retrieval in cases where agents have sufficient context to answer. We show that CalVerT can augment existing QA frameworks without training. Moreover, CalVerT also improves trained systems: by simply augmenting an agent’s state with telemetry, we observe improvements after reinforcement learning, as compared to an agent with identical training but no CalVerT telemetry.


[24] Keyless Attention: Value-Space Routing and Value-Only Caching for Efficient Transformers cs.CL | cs.AIPDF

Xin Gao

TL;DR: 本文提出了Keyless Attention,一种完全消除键投影的注意力机制,仅使用查询和值进行计算。该方法生成了Value-Only Cache,将KV缓存的内存和访问开销精确降低了50%,同时解码吞吐量达到或超过标准注意力。此外,论文引入了深度-$m$注意力分解,其中Keyless Attention实现了深度-3的实例,通过一个值空间路由矩阵取代键投影,并在路由和检索之间引入了耦合。

Details

Motivation: 动机是解决Transformer模型中标准注意力机制在KV缓存方面存在的内存和访问开销问题,旨在设计一种更高效的注意力变体。

Result: 在五个模型和四种架构(GPT-2 280M, GPT-2 557M, Pythia 410M, Qwen2 1.5B, Llama 3.2 1B)上的实验表明,Keyless Attention在5个模型中的4个上达到或优于标准QKV注意力的困惑度。在下游零样本评估(GPT-2 557M)中,在5个常识推理基准中的4个上表现更优,同时全程实现50%的KV缓存减少。

Insight: 核心创新点是完全移除键投影,仅基于查询和值进行注意力计算,从而构建了Value-Only Cache以实现显著的KV缓存节省。另一个创新是深度-$m$注意力分解框架,Keyless Attention作为深度-3实例,引入了值空间路由矩阵,在保持投影矩阵数量不变的情况下,耦合了路由和检索过程。

Abstract: We propose Keyless Attention, an attention mechanism that eliminates the key projection entirely, operating over queries and values only. This yields a Value-Only Cache that reduces KV cache memory and access overhead by exactly 50% over standard attention, while matching or exceeding standard attention’s decode throughput. Beyond efficiency, we introduce Depth-$m$ Attention Factorization: standard attention computes a depth-2 factorization of the attention bilinear form, while Keyless Attention realizes a depth-$m$ instance of this family. At m=3, Keyless Attention matches the projection matrix count of standard attention via a value-space routing matrix that replaces the key projection and introduces a coupling between routing and retrieval. Experiments across five models and four architectures (GPT-2 280M, GPT-2 557M, Pythia 410M, Qwen2 1.5B, and Llama 3.2 1B) show that Keyless Attention matches or outperforms standard QKV attention on perplexity in 4 out of 5 models. On downstream zero-shot evaluation (GPT-2 557M), Keyless Attention outperforms on 4 out of 5 commonsense reasoning benchmarks, while achieving 50% KV cache reduction throughout.


[25] Which Review Aspect Has a Greater Impact on the Duration of Open Peer Review in Multiple Rounds? – Evidence from Nature Communications cs.CL | cs.DL | cs.HC | cs.IRPDF

Haomin Zhou, Ruxue Han, Jiangtao Zhong, Chengzhi Zhang

TL;DR: 本研究探讨了同行评审报告中针对特定方面的情感倾向与评审持续时间之间的关系,并分析了这种关系在不同学科和评审轮次中的变化。研究发现,评审情感与评审时长呈弱负相关,其中涉及’评价与结果’及’影响与研究价值’方面的情感与评审时长的相关性更强。

Details

Motivation: 随着投稿量增加,同行评审给审稿人和编辑带来了日益增长的压力。本研究旨在通过分析评审报告内容与评审时长的关联,为针对性稿件修改和提高评审效率提供支持。

Result: 研究发现,评审情感与评审时长之间存在微弱但统计显著的负相关,即更积极的评审倾向于与更短的评审周期相关。在《自然·通讯》的数据上,’评价与结果’和’影响与研究价值’方面的情感与评审时长的相关性相对更强,且这种关系在不同评审轮次中存在显著差异。

Insight: 创新点在于将同行评审报告的文本内容与评审过程的时间特征联系起来,通过细粒度方面情感分析识别出与评审时长更密切相关的评审方面。这为作者优先修改重点和审稿人/编辑提高效率提供了实证依据,有助于减轻同行评审负担并加速学术交流。

Abstract: Purpose: Peer review is essential to scientific publishing, but increasing submission volumes have placed growing pressure on reviewers and editors. This study examines the relationship between sentiment toward specific review aspects and peer review duration. It also investigates how this relationship varies across disciplines and review rounds, with the aim of supporting targeted manuscript revision and improving review efficiency. Design/methodology/approach: We adopt a two-stage approach. First, fine-grained aspects are extracted from peer review reports, and a sentiment classification model is used to determine the sentiment associated with each aspect. Second, correlations between aspect-level sentiment and peer review duration are analyzed. Sentiment scores are also calculated for different review rounds to determine whether these relationships change over successive rounds. Findings: Review sentiment has a weak but statistically significant negative correlation with peer review duration, indicating that more positive reviews tend to be associated with shorter review periods. Aspects concerning Evaluation and Results and Impact and Research Value show relatively stronger correlations with review duration. The relationships between aspect-level sentiment and review duration also differ significantly across review rounds. Originality/value: This study connects the textual content of peer review reports with the temporal characteristics of the review process. By identifying review aspects that are more closely associated with review duration, it provides evidence that may help authors prioritize revisions and assist reviewers and editors in improving review efficiency. The findings contribute to reducing the burden of peer review and accelerating scholarly communication and knowledge dissemination.


[26] Pre-Generation Hallucination Detection in Large Language Models via Soft-Target Attention Probing cs.CL | cs.LGPDF

Amina Miftakhova, Alexey Zaytsev

TL;DR: 该论文提出了一种在大型语言模型生成前检测幻觉风险的方法,通过软目标监督和注意力探测技术,在三个问答基准和五个模型上验证了其有效性。

Details

Motivation: 解决现有方法将幻觉检测视为对单个解码输出的二元分类问题,转而将其形式化为风险估计问题,以实现生成前的风险预估,从而支持弃权、检索增强和路由决策。

Result: 在三个问答基准和五个模型上,注意力探测在短答案任务上优于线性探测,使用软目标监督进一步一致地提高了检测质量。

Insight: 创新点在于将幻觉检测重新定义为风险估计问题,并引入基于随机采样输出经验错误率的软目标监督作为无偏最小方差估计器,同时将注意力探测适配到生成前设置以选择性聚合幻觉相关提示表示。

Abstract: Detecting hallucination risk before generation enables abstention, retrieval augmentation, and routing decisions without incurring the cost of decoding. While prior work has shown that such risk can be estimated from a model’s internal representations, existing approaches treat this as binary classification over a single decoded output. We instead formulate it as a risk-estimation problem. Under this formulation, we introduce soft-target supervision based on the empirical answer error rate over stochastically sampled outputs - an estimator we prove to be the unique unbiased minimum-variance estimator of the model’s per-prompt error probability under its sampling distribution. We further adapt attention probing to the pre-generation setting, enabling the detector to selectively aggregate hallucination-relevant prompt representations. Across three question-answering benchmarks and five models, attention probing outperforms linear probing on short-answer tasks. Replacing binary labels with soft-target supervision further and consistently improves detection quality.


[27] Deeper is Not Always Better: Mitigating the Alignment Tax via Confident Layer Decoding cs.CLPDF

Xuanming Zhang, Sining Zhoubian, Yuxuan Chen, Tianyi Tang, An Yang

TL;DR: 本文挑战了传统自回归生成中仅使用最后一层进行解码的假设,揭示了LLMs解码过程中的‘猜测-精炼-扰动’动态模式,并提出了一种无需训练的置信解码策略,通过熵引导的保守后向搜索动态选择最可靠的近最终层进行解码。

Details

Motivation: 传统LLM解码假设更深层表示能产生更可靠的下一个词预测,但作者发现最终层可能引入对齐偏好扰动,损害推理性能,因此旨在缓解这种‘对齐税’问题。

Result: 在GPQA-Diamond、Omni-MATH和HLE等具有挑战性的推理基准测试中,该方法在密集和MoE架构的LLMs上均取得了性能提升,且零内存开销,延迟增加小于2%。

Insight: 核心创新在于将层选择建模为最优停止问题,并提出了一个理论框架和无需训练的熵引导搜索规则,动态绕过最终层的扰动,从而释放对齐后LLMs更强的推理能力。

Abstract: Autoregressive generation in large language models (LLMs) conventionally decodes from the final layer, assuming that deeper representations yield more reliable next-token predictions. We revisit this assumption by revealing a recurring Guess-Refine-Perturb dynamic: early layers form coarse guesses, intermediate layers refine reasoning-relevant semantics, and final layers can perturb these refined predictions toward generic or alignment-preferred tokens. We introduce Confident Decoding, a training-free decoding strategy that dynamically selects the most reliable near-final layer through entropy-guided conservative backward search. We further provide a theoretical formulation of layer selection as an optimal stopping problem, showing that under bounded projection noise and dominant late-stage alignment perturbation, our search rule filters perturbation while bounding the loss relative to the oracle refinement layer. Experiments across dense and Mixture-of-Experts LLMs demonstrate consistent gains on challenging reasoning benchmarks, including GPQA-Diamond, Omni-MATH, and HLE, with zero memory overhead and less than 2% latency increase. These results suggest dynamically bypassing final-layer perturbations can unlock stronger reasoning behavior from aligned LLMs.


[28] Beyond Value Benchmarks: Measuring Value-Structure Alignment in Large Language Models via Symmetric Q-Sorts cs.CLPDF

Jingting Zheng, Yuqi Ren, Linhao Yu, Yongqi Leng, Deyi Xiong

TL;DR: 本文提出了一种基于Q方法学的对称人机评估框架,用于衡量大型语言模型在价值结构上的对齐程度。该方法通过让人类和模型对140项道德陈述进行强制排序,并利用Procrustes相似性和RSA相关性量化结构对齐,揭示了传统逐项基准无法捕捉的模型价值优先级结构异质性。

Details

Motivation: 现有评估主要依赖逐项行为指标,无法捕捉模型如何作为一个连贯系统在结构上优先处理相互竞争的价值,因此需要一种能衡量价值结构对齐的方法。

Result: 在12个LLM上的评估结果显示,不同模型家族间存在显著异质性,模型对生成随机性敏感,且全局高分可能掩盖局部错位;基于排序和分桶的分析高度一致,但提示词措辞会引入显著方差。

Insight: 创新点在于将Q方法学引入LLM价值评估,提出了对称的人机评估框架和结构对齐量化指标(Procrustes相似性和RSA相关性),为传统道德基准提供了关键的结构性补充。

Abstract: Large Language Models (LLMs) are increasingly deployed in contexts requiring complex moral reasoning and value trade-offs. However, existing evaluations typically rely on item-level behavioral metrics, which fail to capture how models structurally prioritize competing values as a cohesive system. To address this, we propose a symmetric human-LLM evaluation framework, grounded in Q methodology, to measure value-structure alignment. Under our protocol, humans and models sort an identical 140-item moral statement set into a shared nine-column forced distribution; for LLMs, we elicit strict rankings and deterministically map them to Q-sort buckets. Using a human reference sample ($N=35$), we establish a stable three-factor reference geometry specific to this instrument and sample. We evaluate 12 LLMs across four model families via 240 replicated Q-sorts at two temperature settings, quantifying structural alignment via Procrustes similarity ($φ$) and RSA-based Spearman correlation ($ρ$). Our results reveal significant cross-family heterogeneity, model-specific sensitivity to generation stochasticity and localized misalignment, which demonstrate that favorable global scores can obscure underlying regional distortions. While rank- and bucket-based analyses remain highly consistent, prompt phrasing introduces notable variance. Ultimately, assessing value-structure alignment provides a crucial structural complement to traditional itemwise moral benchmarks.


[29] Olfactory-Inspired Sparse Combinatorial Coding for Low-Resource Named Entity Recognition cs.CLPDF

Bhushan Deshpande

TL;DR: 本文提出了一种受生物嗅觉启发的稀疏组合编码架构,用于低资源语言的命名实体识别。该架构在标准词嵌入和BiLSTM-CRF序列模型之间引入了一个受体-肾小球瓶颈层,旨在从有限监督中学习鲁棒表示。实验表明,该架构在严格低资源条件下(如仅1k句子)能有效提升F1分数,尤其在孟加拉语等语言中优势显著。

Details

Motivation: 低资源语言的命名实体识别面临监督信号有限和缺乏高质量预训练嵌入的挑战。作者从生物嗅觉系统通过受体和肾小球组织进行稀疏组合编码的机制中获得灵感,旨在为不确定性下的表示学习提供一个鲁棒的范式。

Result: 在六个多语言数据集上从头训练(不使用预训练嵌入)的评估显示,在严格1k句子低资源控制条件下,至少一种嗅觉启发的配置在所有六个数据集上取得了最高的平均F1分数。与通用瓶颈基线相比,该架构在孟加拉语上显著领先(F1提升+6.23% vs. 标准基线,+8.47% vs. 最佳控制基线),在泰卢固语的超低资源全规模设置下也有提升(+4.43% F1)。

Insight: 核心创新点是受生物嗅觉启发的结构化稀疏编码架构(受体-肾小球瓶颈),它作为一种有效的归纳偏置和正则化器,在表示必须从有限或噪声监督中学习时特别有效。客观来看,其将神经科学原理(稀疏组合编码)转化为机器学习模型组件,为解决低资源NLP问题提供了新的架构思路。

Abstract: Named Entity Recognition (NER) in low-resource languages suffers from limited supervision and a lack of high-quality pretrained embeddings. Biological olfaction, which relies on sparse combinatorial coding through receptor and glomerular organization, offers a compelling paradigm for learning robust representations under uncertainty. In this paper, we introduce a receptor-glomerular bottleneck - a novel, biologically-inspired olfactory architecture - between standard token embeddings and a BiLSTM-CRF sequence model. We evaluate our architecture across six multilingual datasets trained entirely from scratch (without pre-trained embeddings) under varied data-scale conditions, including a strict 1k-sentence low-resource control. Our results demonstrate that introducing a representation bottleneck yields F1 score improvements under severe data scarcity, primarily by acting as a powerful regularizer. Under the 1k capped training condition, at least one olfactory-inspired configuration achieves the highest mean F1 score across all six datasets. While these improvements represent near-ties with generic bottleneck controls for most languages, the olfactory architecture provides a significant advantage in languages like Bangla (+6.23% F1 over standard baseline and +8.47% F1 over the best control baseline) where generic bottlenecks degrade performance. We also observe improvements in the ultra-low-resource Telugu setting (+4.43% F1) at full-scale, and find that sparse specialization naturally emerges within the receptor layer. Our findings suggest that structured sparse coding inspired by olfactory networks serves as an effective inductive bias and regularizer when representations must be learned from limited or noisy supervision.


[30] From Recognition to Understanding: Unlocking Cognitive Time Series Reasoning with LLMs cs.CLPDF

Xin Qiu, Junlong Tong, Yao Zhang, Yunpu Ma, Wei Zhang

TL;DR: 本文针对现有时间序列分析与大语言模型结合时存在的局限性,提出了一个新的多模态基准TSCognition和一个统一的框架TSAlign。TSCognition包含约41K个QA样本,涵盖五种认知推理任务,旨在评估模型对时间序列的深层理解。TSAlign框架通过将时间序列编码为补丁级表示,并使用门控残差注入和多变量融合技术将其与LLM的语义空间对齐,从而提升了推理性能并降低了计算成本。

Details

Motivation: 现有方法将时间序列理解简化为曲线拟合,专注于低级预测,而忽略了现实世界时间决策的语义、上下文和推理密集型特性,这导致LLM的推理和世界知识能力未能被充分利用。

Result: 实验表明,TSAlign在提出的TSCognition基准和公开的TimerBed基准上,均优于现有的LLM、VLM和时间序列QA基线方法,同时显著降低了计算成本。

Insight: 论文的核心创新在于将时间序列分析从传统的识别/预测任务,提升到需要语义理解和认知推理的层面,并提出了一个专门的多模态基准来评估这种能力。TSAlign框架通过创新的对齐机制(门控残差注入和多变量融合)实现了时间序列模态与语言模态的有效融合,这是提升模型性能的关键技术洞察。

Abstract: Time series analysis has recently been coupled with Large Language Models (LLMs) to leverage their reasoning and world knowledge capabilities, yet gains remain limited. We attribute this to a fundamental mismatch between existing task formulations and LLM strengths: most settings reduce time series understanding to curve-fitting systems, focusing on low-level prediction while ignoring the semantic, contextual, and reasoning-intensive nature of real-world temporal decision-making.To address these limitations, we introduce TSCognition, a multimodal benchmark for multi-dimensional time series reasoning. It collects real-world time series and textual information from 15 public sources and constructs approximately 41K QA samples around five cognitive reasoning tasks: Decoding, Grounding, Inferring, Extrapolating, and Acting. Building on this, we further propose TSAlign, a unified framework that encodes time series into compact patch-level representations and aligns them with semantic directions in the LLM embedding space via gated residual injection and multivariate fusion.Experiments show that TSAlign outperforms existing LLM, VLM, and time series QA baselines on TSCognition and the publicly available TimerBed benchmark while substantially reducing computational cost.Code is available at: https://github.com/EIT-NLP/CognitiveTSR


[31] BioMatrix: Towards a Comprehensive Biological Foundation Model Spanning the Modality Matrix of Sequences, Structures, and Language cs.CL | cs.AI | cs.LG | q-bio.BMPDF

Qizhi Pei, Zhimeng Zhou, Yi Duan, Yiyang Zhao, Wei Li

TL;DR: BioMatrix是首个在单一解码器架构中,原生整合分子和蛋白质的序列、结构及自然语言的多模态基础模型。它通过统一的标记化方案将所有模态映射到共享离散标记空间,使用单一的下一个标记预测目标进行训练和生成。该模型在涵盖6个类别80个任务的下游应用中,在77个任务上达到SOTA或有竞争力的性能。

Details

Motivation: 解决现有生物基础模型在实现原生多模态和广泛实体覆盖方面的分离问题:要么多模态融合但局限于单一实体类型,要么覆盖多实体类型但缺乏显式结构建模或依赖适配器设计而无法原生生成模态。

Result: 在涵盖单实体/多实体、跨模态/模态内理解与生成的6大类80个下游任务上进行微调后,BioMatrix在77个任务上达到了最先进的(SOTA)或有竞争力的性能。

Insight: 核心创新在于通过统一的标记化方案,将分子序列(SMILES/SELFIES)、分子结构、蛋白质序列、蛋白质结构和自然语言原生映射到共享离散标记空间,实现了在单一解码器架构下对所有模态的统一消费和生成,无需外部编码器、投影适配器或模态特定输出头。

Abstract: We present BioMatrix, the first multimodal foundation model that natively integrates sequences, structures, and natural language for both molecules and proteins within a single decoder-only architecture. Existing biological foundation models pursue native multimodality and broad entity coverage separately: those that fuse multiple modalities under a shared objective remain confined to a single entity type, while those spanning multiple entity types either omit explicit structural modeling or rely on adapter-based designs in which the model cannot natively generate the very modalities it can read. BioMatrix closes this gap by mapping molecular sequences (supporting both SMILES and SELFIES notations), molecular structures, protein sequences, protein structures, and natural language into a shared discrete token space through a unified tokenization scheme, so that all modalities are consumed and produced uniformly under a single next-token prediction objective – without external encoders, projection adapters, or modality-specific output heads. Built upon the Qwen3 language model (1.7B and 4B), BioMatrix is continually pretrained on 304.4 billion tokens spanning general and domain-specific text, sequence and structure views of molecules and proteins, and cross-modal corpora that interleave biomolecular entities with scientific text and link distinct entities through molecule-protein and protein-protein interaction data. After tuning on a comprehensive suite of downstream applications covering 80 tasks across 6 categories – encompassing single-entity and multi-entity understanding and generation tasks across and within modalities – BioMatrix achieves state-of-the-art or competitive performance on 77 out of 80 tasks, demonstrating that a single, natively multimodal generalist model can effectively match or surpass specialized approaches across a wide range of biological tasks.


[32] From Speech to Text Corpora: Evaluating ASR-Based Data Acquisition for Low-Resource Fongbe and Hausa cs.CL | cs.AI | cs.LGPDF

Mahounan Pericles Adjovi, Victor Olufemi, Roald Eiselen, Prasenjit Mitra

TL;DR: 本文研究了利用自动语音识别(ASR)技术为低资源非洲语言(丰语和豪萨语)构建文本语料库的可行性。通过微调MMS-300M模型在丰语上显著降低了词错误率,并利用Whisper模型处理豪萨语,从YouTube视频中转录了大量语音数据,并进行了人工质量评估。

Details

Motivation: 低资源的非洲语言缺乏用于语言模型训练的文本语料库,研究旨在探索ASR技术能否有效扩展两种西非语言(丰语和豪萨语)的文本资源。

Result: 在ALFFA基准测试中,微调后的MMS-300M模型将丰语的词错误率从44.04%降至9.48%(相对降低78%),并保留了关键的音调符号。豪萨语转录质量平均得分为57.4/100,接近可接受水平,而丰语为36.5/100,需要后处理。

Insight: 创新点在于针对音调丰富且字符特殊的低资源语言(如丰语)进行ASR模型微调,有效提升了识别精度并保留了语言特征;同时,通过系统化采集和评估YouTube视频数据,为低资源语料库构建提供了可复现的流程和资源。

Abstract: Low-resource African languages lack text corpora needed for language model training. We investigate whether ASR pipelines can extend text resources for two typologically distinct West African languages: Fongbe (tonal, diacritic-rich) and Hausa (non-tonal). We fine-tune MMS-300M on a curated 12.3-hour Fongbe dataset, achieving 9.48% WER on the ALFFA benchmark - a 78% relative reduction from the prior 44.04% baseline - while preserving tonal diacritics critical to the language. For Hausa, we apply an existing fine-tuned Whisper-Small model. We catalog 1,553 YouTube videos (236 hours) and process a subset of 424 videos (45.49 hours) selected to balance domain diversity with available computational resources, producing 6,770 transcribed segments. Human evaluation on 50 randomly sampled segments per language shows mean quality scores of 57.4/100 for Hausa and 36.5/100 for Fongbe, indicating that while Hausa transcriptions approach acceptable quality for corpus construction, Fongbe transcriptions require post-processing or improved models for production use. We release the curated dataset, fine-tuned model, transcribed corpus, and full video catalog following platform terms and ethical guidelines.


[33] Learning at the Right Pace: Adaptive Data Scheduling Improves LLM Reinforcement Learning cs.CLPDF

Zicheng Xu, Ruixuan Zhang, Yu-Neng Chuang, Xiuyi Lou, Hoang Anh Duy Le

TL;DR: 本文提出了一种名为自适应数据调度(ADS)的双层数据调度框架,用于改进大语言模型(LLM)的强化学习(RL)后训练。该方法通过语义聚类和策略边界样本选择,替代了传统的均匀采样,以更好地适应训练数据的语义结构和训练策略的动态变化。实验表明,ADS在多个LLM和推理基准测试上显著提升了性能。

Details

Motivation: 现有的大语言模型强化学习后训练通常采用均匀数据采样,这忽略了训练数据的语义结构以及训练策略能力的动态变化,限制了学习效率。

Result: 在三个大语言模型和七个推理基准测试上的实验结果表明,ADS相比组相对策略优化(GRPO)平均准确率提升了5.2%,并且能持续改进不同目标设计的RL方法,展现了其作为通用数据调度策略的潜力。

Insight: 创新点在于提出了一个结合语义聚类(宏观调度)和策略边界样本选择(微观调度)的自适应数据调度框架,通过关注信息量大的相对优势样本来更有效地引导RL训练,这为LLM的RL后训练提供了一种通用的、与具体目标设计无关的数据调度策略。

Abstract: Large Language Models (LLMs) achieve remarkable reasoning capabilities through reinforcement learning (RL) post-training. However, existing RL post-training commonly relies on uniform data sampling, which ignores the semantic structure of the training data and the changing capability of the training policy. To address these limitations, we propose Adaptive Data Scheduling (ADS), a dual-level data scheduling framework for pacing RL post-training that replaces uniform sampling with an adaptive distribution over semantic clusters and policy-boundary sample selection. At the cluster level, ADS organizes samples according to semantic patterns and maintains an adaptive inter-cluster distribution to solidify current training progress. At the sample level, ADS performs intra-cluster scheduling to continuously sample policy-boundary samples, which provides informative relative advantages. Experimental results across three LLMs and seven reasoning benchmarks demonstrate that ADS improves average accuracy by 5.2% over Group Relative Policy Optimization (GRPO). Notably, ADS consistently improves RL methods with different objective designs, highlighting its potential as a general data scheduling strategy for LLM RL post-training. The source code is available at: https://github.com/Richard-zrx/ADS.


[34] How Does Research Evolve? Tracing Cross-Domain Trajectories in NLP, ML, and CV with Claim-Grounded Typed Citations cs.CLPDF

Abdul Muntakim, Md Abdullah Al Hafiz Khan, Sadid Hasan, Yong Pei

TL;DR: 该论文提出了SciTraj语料库,这是首个基于声明的类型化引文图,通过将引文边与具体声明句子关联,细化了六种引文关系类型。该语料库涵盖NLP、ML和CV领域2015-2024年的32,559篇论文,包含573,126条有向边,用于分析跨领域研究轨迹、主题涌现和类型化引文流动模式。

Details

Motivation: 现有引文图通常将引文关系简化为单一同质边类型,限制了科学进展的分析能力。论文旨在通过构建细粒度的声明驱动引文图,揭示研究如何通过扩展方法、解决局限、实现提议或反驳主张等方式演化。

Result: SciTraj语料库覆盖72.8%的论文,包含2.87亿条长度≥3的类型化轨迹,支持时间分割的类型化链接预测基准测试。通过年份混洗可证伪性测试分离时间结构与内容相关性,人工标注试点达到κ=0.74和79.9%的精确度。

Insight: 创新点在于将引文关系细化为NLI验证的声明驱动关系(如扩展、解决、实现、反驳)和相似性关系,构建了可追溯跨领域研究轨迹的结构化语料库。该方法为科学演化预测提供了细粒度数据基础,并揭示了视觉和LLM相关工作的主题集中涌现现象。

Abstract: How does research evolve, and what substrate would let us forecast where it goes next? Scientific progress is not simply a uniform accumulation of facts: ideas extend prior methods, address known limitations, realize proposed future directions, and sometimes dispute earlier claims. Existing citation graphs usually collapse these roles into a single homogeneous edge type, limiting how we can analyze scientific progress. We address this gap by proposing the SciTraj corpus, the first claim-grounded typed citation graph in which each edge is linked to the specific claim sentence that motivates it. Claim-bearing sentences are extracted from paper sections; four claim-driven relations are verified by NLI entailment against in-paper context, while two similarity-only relations are gated by abstract cosine and year-gap rules. SciTraj contains 32,559 papers from NLP, ML, and Vision (2015–2024), connected by 573,126 directed edges across six relation types, with NLI-verified claim seeds. Using SciTraj, we identify disciplinary siloing in typed citation flow and topic emergence concentrated in Vision and LLM-related work. The corpus also contains 287M typed trajectories of length $\geq 3$, covering 72.8% of papers, and supports a temporally split typed link-prediction benchmark. A year-shuffle falsifiability test separates temporal structure from year-correlated content, and a 3-annotator pilot reports $κ= 0.74$ with 79.9% precision.


[35] Look Light, Think Heavy: What Multimodal Chain-of-Thought Reasoning Can and Cannot Do cs.CL | cs.AI | cs.CVPDF

Zhuoran Jin, Kejian Zhu, Hongbang Yuan, Yupu Hao, Pengfei Cao

TL;DR: 本文系统研究了多模态思维链推理在12种感知与推理任务上的能力与局限,发现CoT并非万能,需根据任务类型选择性使用;现有开源多模态推理模型整体提升有限,且视觉推理仍是关键瓶颈。

Details

Motivation: 旨在探究多模态思维链推理的有效性、适用场景及其失败原因,以明确其在多模态任务中的实际作用与局限。

Result: 在12个多模态任务上评估了22个模型,发现CoT在数学、科学及多图像推理任务上有效,但在视觉定位和物体计数等感知任务中可能导致性能下降;现有开源推理模型相比原始模型仅带来边际改进。

Insight: 揭示了多模态CoT存在’看轻思重’模式,即语言反思在推理过程中起伏,而视觉反思持续减弱,表明当前方法缺乏维持深度视觉内省的能力;研究强调了根据任务特性谨慎应用CoT的重要性。

Abstract: Chain-of-Thought (CoT) has become a standard method for improving reasoning capabilities in large language models (LLMs) by eliciting step-by-step thinking, but its effectiveness in multimodal tasks remains unclear. In this paper, we aim to systematically investigate the key question: What can multimodal Chain-of-Thought reasoning do, and where and why does it fall short? To this end, we evaluate 12 multimodal tasks across perception and reasoning categories using both 14 non-reasoning models and 8 reasoning models. Our analysis reveals several important findings: (1) CoT is not a free lunch and should be used selectively depending on the specific requirements of each task. For perception tasks, CoT can lead to undesirable side effects, such as reduced performance in visual grounding and object counting. In contrast, it proves effective for reasoning tasks involving mathematical, scientific, and multi-image reasoning; (2) Compared to original models, existing open-source multimodal reasoning models often yield only marginal overall improvements, possibly due to an overemphasis on mathematical reasoning at the expense of broader capabilities; (3) Visual reasoning remains a key bottleneck for current multimodal CoT, as models exhibit a Look Light, Think Heavy pattern where verbal reflection rises and falls during reasoning, whereas visual reflection consistently diminishes. These findings suggest that while multimodal CoT handles verbal reflection relatively well, it lacks the ability to maintain deep visual introspection throughout the reasoning process.


[36] Breaking the Likelihood Trap: Variance-Calibrated Modulation for Large Language Model Decoding cs.CL | stat.MLPDF

Yuanhao Ding, Meimingwei Li, Esteban Garces Arias, Matthias Aßenmacher, Christian Heumann

TL;DR: 该论文提出了一种名为方差校准调制(VCM)的训练前解码干预方法,旨在解决大语言模型在开放生成中陷入‘似然陷阱’的问题,即模型输出重复、词汇单调,与人类文本存在差异。VCM通过两种动态机制在截断前重塑概率分布:基于PMI的上下文探照灯机制和自适应自去偏机制,以提升生成文本的多样性、连贯性和推理准确性。

Details

Motivation: 现有后处理截断方法(如Top-p, Min-p)和固定标量重复惩罚方法存在局限性,前者可能过度采样未校准的头部概率分布,后者忽略了不同推理步骤间logit尺度的变化,两者都可能导致生成文本与人类词汇偏好或语义连贯性不匹配。

Result: 在开放生成、事实问答和数学推理等多个任务上的实验表明,VCM能持续缓解似然陷阱,提高生成多样性、连贯性,并在较高解码温度下尤其提升了推理准确性,且计算开销可忽略。

Insight: 创新点在于提出了一种无需训练的解码前干预框架,通过动态结合上下文信息(PMI)和实时logit标准差进行尺度不变惩罚,自适应地校准概率分布,从而更有效地对齐人类偏好,可与现有解码策略集成。

Abstract: In open-ended generation, LLMs frequently fall into the “likelihood trap”, marked by repetitive degeneration and vocabulary dullness, creating a discrepancy between machine-generated and human-written text. While post-hoc tail truncation (e.g., Top-$p$, Min-$p$) avoids sampling from the unreliable tail, it can over-sample from the uncalibrated head and misalign generation with human lexical preferences; fixed scalar repetition penalties likewise ignore variation in logit scale across inference steps, potentially disrupting semantic coherence. To address both limitations, we propose Variance-Calibrated Modulation (VCM), a training-free pre-decoding intervention that reshapes the probability distribution before truncation through two dynamic mechanisms: (1) Contextual Searchlight via PMI, which suppresses global stopwords while elevating context-evoked tokens, and (2) Adaptive Self-Debiasing, which uses real-time logit standard deviation for scale-invariant penalization. Across open-ended generation, factual QA, and mathematical reasoning, VCM consistently mitigates the likelihood trap. With negligible computational overhead, VCM integrates with existing decoding strategies, improving diversity, coherence, and, particularly at higher decoding temperatures, reasoning accuracy.


[37] What are Key Factors for Updates in RL for LLM Reasoning? cs.CLPDF

Peidong Wang, Demi Wang, Xufang Luo, Jiahang Xu, Xiaocui Yang

TL;DR: 本文对强化学习可验证奖励(RLVR)框架中更新机制的关键因素进行了理论分析,揭示了离策略程度(即每次rollout的梯度步数)通过影响重要性采样比分布和裁剪行为,从而改变主导更新的token。基于此,作者提出了自适应裁剪策略优化(ACPO)方法,根据重要性采样比的方差动态调整不同token组的裁剪边界,并在多个推理基准测试中超越了现有基线方法。

Details

Motivation: 现有RLVR工作多基于启发式直觉,导致算法选择存在分歧甚至矛盾,但都报告了经验性提升。为了理解这一现象,本文旨在通过理论分析揭示RLVR更新的关键因素,从而指导更原则性的方法设计。

Result: 在3B和7B模型上,于数学问题求解、表格问答和逻辑谜题等多种推理基准测试中,提出的ACPO方法超越了DAPO和CISPO等强基线,证明了分析驱动的方法能产生更鲁棒有效的RLVR方法。

Insight: 创新点在于将梯度期望识别为控制更新动态的核心量,并分析了token概率、优势函数和重要性采样比的作用。提出的ACPO根据经验方差自适应调整裁剪边界,这是对标准裁剪机制的一种原则性改进,有助于稳定训练并提升性能。

Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a promising framework for enhancing the reasoning ability of large language models. However, much of the existing work is guided by heuristic intuition, leading to divergent algorithmic choices, even contradictory ones that nevertheless report empirical gains. To better understand this phenomenon, we conduct a theoretical analysis of RLVR updates. Our study reveals that differences in off-policy degree, determined by the number of gradient steps per rollout, substantially affect the distribution of importance sampling ratios and their clipping behavior, thereby altering which tokens dominate the update. Building on this insight, we characterize gradient expectation as the central quantity governing update dynamics and analyze the roles of token probability, advantage, and importance sampling ratio. Motivated by these findings, we propose Adaptive Clip Policy Optimization (ACPO), which adjusts clipping boundaries across token groups according to the empirical variance of their importance sampling ratios. Experiments on 3B and 7B models across diverse reasoning benchmarks, spanning mathematical problem solving, tabular QA, and logic puzzles, demonstrate that ACPO outperforms strong baselines such as DAPO and CISPO. These results demonstrate that principled, analysis-driven approaches yield more robust and effective RLVR methods. Code is available in: https://github.com/Control-derek/ACPO


[38] Only Ask What You Don’t Know: Grounded Delta Planning for Efficient Multi-step RAG cs.CL | cs.AIPDF

Wei-Chieh Chou, Xuanjun Chen, Jian-Ren Lin, Claire Lin, Hung-yi Lee

TL;DR: 本文提出了GDP-RAG(Grounded Delta Planning RAG)框架,用于解决多跳问答中RAG方法的效率与准确性问题。该框架基于三个核心设计:先进行初步检索以奠定规划基础,使用仅询问缺失信息的条件规划提示,以及构建包含证据链的骨架轨迹。该方法将计算聚焦于未解决的信息缺口,从而生成简洁可靠的推理轨迹。

Details

Motivation: 现有RAG方法在多跳问答中面临挑战,要么在迭代检索中传播错误,要么过度生成推理步骤,导致成本增加而准确性未提升。本文旨在设计一个高效且准确的规划框架,以解决信息检索与推理过程中的冗余和错误累积问题。

Result: 在HotpotQA、2WikiMultiHopQA和MuSiQue基准测试上,GDP-RAG取得了60.63%的最高准确率,同时保持0.51的pass成本,比PAR-RAG(0.65)低22%,比KnowTrace(1.57)低68%。在所有对比系统中,没有其他方法能同时实现更高的准确率和更低的成本。

Insight: 创新点在于提出了基于信息差(delta)的规划方法,通过初步检索和条件规划提示,仅针对缺失信息进行查询,避免了冗余计算。从客观角度看,其骨架轨迹设计将证据链贯穿推理过程,增强了可解释性和可靠性,为多步RAG系统提供了一种高效且可扩展的解决方案。

Abstract: Multi-hop question answering remains challenging for Retrieval-Augmented Generation (RAG) because existing approaches either propagate errors across iterative retrieval rounds or over-generate reasoning steps, increasing cost without improving accuracy. We propose Grounded Delta Planning RAG (GDP-RAG), a plan-based framework that targets only the information delta based on three simple design choices: (1) preliminary retrieval to ground planning before execution, (2) a gap-conditioned planning prompt that asks only for missing information, and (3) a skeletal trajectory that pairs each subquery with a Thought capturing evidence from preliminary retrieval and carrying it through to the final answer. GDP-RAG focuses computation on unresolved gaps, yielding concise, reliable reasoning trajectories. Extensive experiments on HotpotQA, 2WikiMultiHopQA, and MuSiQue show that GDP-RAG achieves the highest accuracy (60.63%) among all compared systems while maintaining a cost-of-pass of 0.51, 22% lower than PAR-RAG (0.65) and 68% lower than KnowTrace (1.57), with no method achieving both higher accuracy and lower cost.


[39] BLUEX v2: Benchmarking LLMs on Open-Ended Questions from Brazilian University Entrance Exams cs.CLPDF

João Guilherme Alves Santos, Giovana Kerche Bonás, Thiago Laitz, Thales Sales Almeida, Helio Pedrini

TL;DR: 本文介绍了BLUEX v2基准测试,这是一个基于巴西顶尖大学(UNICAMP和USP)第二阶段入学考试(2022-2025年)构建的葡萄牙语开放式问答数据集,旨在评估大语言模型在需要深度推理和生成能力的自由回答任务上的表现。该数据集包含395个问题(共919个评分子问题),其中55.7%的问题包含相关图像,并标注了学科领域、参考答案、评分标准和认知能力标签。研究使用LLM-as-a-judge协议评估了21个先进的大语言模型。

Details

Motivation: 尽管大语言模型在许多任务上表现出色,但对其在葡萄牙语上的评估,尤其是对需要深度推理和生成能力的开放式、论述性任务的评估关注较少。原有的BLUEX基准测试通过巴西大学入学考试的多选题解决了葡萄牙语评估数据集的稀缺问题,但未涵盖要求自由形式书面回答的更具挑战性的第二阶段考试。

Result: 在0-10分的评分尺度上,21个模型的性能差异为4.92分(范围在4.18-9.10之间)。数学推理和图像理解被证明是最具挑战性的能力维度。

Insight: 该研究的主要创新点在于构建了一个专注于葡萄牙语开放式问答的综合性基准测试(BLUEX v2),填补了该语言在深度推理和生成任务评估上的空白。其数据集设计精细,不仅包含大量图文结合的问题,还提供了详细的标注(如官方答案、LLM生成的评分标准、认知能力标签),并采用了LLM-as-a-judge的评估协议,为多模态和复杂语言能力的评估提供了新范式和资源。

Abstract: Although Large Language Models (LLMs) excel in many tasks, their assessment in Portuguese has received less attention, particularly for open-ended, discursive tasks that demand deeper reasoning and generation capabilities. While the original BLUEX benchmark addressed the scarcity of Portuguese evaluation datasets through multiple-choice questions from Brazilian university entrance exams, it did not cover the more challenging second-phase examinations, which require free-form written responses. In this work, we introduce BLUEX v2, a benchmark derived from the second-phase entrance exams of Brazil’s two leading universities: UNICAMP (Comvest) and USP (Fuvest), spanning exam years 2022-2025. Our dataset comprises 395 questions unfolding into 919 graded subquestions, with 55.7% of questions containing associated images. Each question is annotated with subject area, official reference answers, LLM-generated rubric criteria, and six cognitive capability tags. We evaluate 21 state-of-the-art LLMs using an LLM-as-a-judge protocol. Results reveal a 4.92-point performance spread across models (4.18-9.10 on a 0-10 scale), with Mathematical Reasoning and Image Understanding emerging as the hardest capability dimensions. The dataset, evaluation code, and model outputs are publicly available at https://anonymous.4open.science/r/BLUEXv2.


[40] Does the Same Token Mean the Same State? MoE Routing as Signal for Reasoning Control cs.CLPDF

Kang Chen, Minshen Yu, Junjie Nian, Yaoning Wang, Yixin Cao

TL;DR: 本文研究了稀疏混合专家语言模型中相同令牌ID是否对应相同的路由器状态和专家组合,发现即使输出令牌相同,专家路由仍能分离任务上下文、轨迹历史和推理模式。基于此,作者提出了路由一致性解码方法,该方法通过分析锚点窗口的路由状态密度来选择最优推理路径,无需依赖答案字符串的解析或投票。

Details

Motivation: 动机是探究稀疏MoE模型中路由机制的内在一致性,并利用路由状态作为推理控制的信号,以解决在代码生成等场景中基于字符串投票方法不适用的问题。

Result: 在10种稀疏MoE配置和6个数据集上的实验表明,RAD在数学、GPQA和代码任务上与多数投票方法性能相当,在代码生成任务中直接提升pass@1,并在SWE-bench Verified上优于随机选择。

Insight: 创新点在于揭示了MoE路由状态包含丰富的推理控制信息,并提出了基于路由密度而非答案字符串的跨任务通用选择器,为无需答案字符串的推理路径优化提供了新接口。

Abstract: In sparse Mixture-of-Experts language models, does the same token id imply the same router state and the same experts producing it? Holding the emitted token id fixed at repeated anchors, we find it does not: the experts that produce it still separate task context, trajectory history, and reasoning-effort mode. This residual structure supports test-time control: near \emph{boundary} anchors (the final-response transition) and \emph{delimiter} anchors (which open the answer, e.g.\ \texttt{\textbackslash boxed{} or code fences), routing neighborhoods already align with final-answer basins at a marker-only readout and strongest when the routing is read at the answer opening. We operationalize this as \textbf{RAD} (Routing Agreement Decoding), an answer-string-free multi-rollout selector: it locates a fixed anchor, represents each rollout by its anchor-window MoE routing states, and returns the densest Weighted-Jaccard $K$-NN route-basin center, without parsing, normalizing, executing, or voting over answer strings. Across 10 sparse-MoE configurations (gpt-oss, Qwen3-MoE) and 6 datasets spanning math, GPQA, and code, RAD is on par with Majority where string voting is well-posed, with small positive paired deltas (RAD $73.9$ / RAD+DC $74.2$ vs.\ Majority $73.6$). Like majority voting, RAD is not a verifier: a dense \emph{wrong} basin can still win. Its value is the interface: the same selector gives direct pass@1 on code, where exact-string voting is ill-defined, and the same routing-density principle, re-anchored to the agentic boundary, improves best-of-16 patch selection on SWE-bench Verified over random, where patches have no answer string to vote on.


[41] StatABench: Dataset and Framework for Evaluating Statistical Analysis Capabilities of LLMs cs.CL | cs.AIPDF

Youxin Zhu, Yixuan Ding, Peng Lai, Longyue Wang, Bingyi Jing

TL;DR: 该论文提出了StatABench基准测试,用于系统评估大语言模型(LLMs)的统计分析能力。该基准包含Stat-Closed(涵盖18个主题的404个问题)和Stat-Open(30个复杂开放式建模任务)两部分。评估发现,即使是GPT-5.1在Stat-Closed上准确率也仅为68.6%,揭示了当前LLMs在可靠统计分析方面仍存在显著差距。

Details

Motivation: 现有评估LLMs统计分析能力的基准测试在范围和格式上存在局限,无法全面衡量其在该复杂领域所需的知识和工具熟练度。

Result: 在Stat-Closed上,GPT-5.1达到68.6%准确率,最佳开源模型为60.6%。在Stat-Open上,最佳智能体框架平均得分为61.86分。这些结果表明当前模型与可靠统计分析之间存在差距。

Insight: 创新点在于构建了一个包含封闭式和开放式任务的综合性基准,并采用LangChain MCP框架和LLM-as-Judge协议进行评估,系统性地揭示了LLMs在工具推理、方法决策和端到端建模方面的持续挑战。

Abstract: Statistical analysis is a broad, complex field requiring both domain knowledge and tool proficiency. While prior work has evaluated large language models (LLMs) in this domain, existing benchmarks remain limited in scope and format. To bridge this gap, we introduce StatABench (Statistical AnalysisBenchmark), a benchmark designed to systematically assess LLMs’ statistical analysis capabilities. StatABench comprises two complementary components: Stat-Closed, containing 404 questions across 18 statistical topics in multiple formats (multiple-choice, fill-in-the-blank, decision-making, and practical application), and Stat-Open, featuring 30 complex open-ended modeling tasks adapted from professional competitions. We evaluate diverse LLMs using the LangChain MCP framework and multiple data science agents, and assess Stat-Open solutions via a validated LLM-as-Judge protocol. Experiments show that even GPT-5.1 achieves only 68.6% on Stat-Closed, while the best open-source model reaches 60.6%. On Stat-Open, the top agent framework scores 61.86 on average. These results reveal the gap between current LLMs and reliable statistical analysis, highlighting persistent challenges in tool-grounded reasoning, methodological decision-making, and end-to-end statistical modeling.


[42] PIVOTSBench: Evaluating Fine-Grained Interpersonal Relationship Reasoning in Multimodal Large Language Models cs.CLPDF

Shuxiang Zhang, Yiting Yin, Wenxuan Song, Yuhang Wu, Miao Liu

TL;DR: 该论文提出了PIVOTSBench,这是首个用于评估多模态大语言模型在细粒度人际关系推理方面能力的基准测试。该基准基于心理学研究,从Social-IQ 2.0和YouTube数据构建,旨在评估模型预测双向人际关系维度(如权力、亲密度)的能力,并包含识别关键视觉线索的辅助任务。

Details

Motivation: 现有MLLMs在理解人类日常社交互动中至关重要的、固有的多模态细粒度人际关系推理方面仍存在巨大空白,需要专门的基准来评估和推动这一能力的发展。

Result: 论文评估了包括专有和开源模型在内的多种MLLMs,并进行了详细的消融实验,分析了视觉模态和对话中显式社会角色信息的影响,以及联合与成对预测设置对模型评分双向PIVOTS维度的益处。

Insight: 主要创新点在于构建了首个基于心理学理论、专注于细粒度双向人际关系推理的多模态基准,并设计了评估模型利用关键视觉线索能力的辅助任务,为理解MLLMs的社会智能提供了新的评估框架和分析维度。

Abstract: Humans possess an innate ability to understand fine-grained interpersonal relationships, which is central to everyday social interactions. Although such reasoning is inherently multimodal, it remains largely unexplored by existing multimodal large language models (MLLMs). To address this gap, we introduce PIVOTS, the first benchmark built from Social-IQ 2.0 and YouTube data to evaluate MLLMs’ ability to predict bidirectional interpersonal relationship dimensions grounded in established psychology research. In addition, PIVOTS includes auxiliary tasks that assess models’ ability to identify and leverage the critical visual cues underlying such predictions. We evaluate both proprietary and open-source MLLMs and conduct detailed ablation studies to analyze the effects of visual modalities and explicit social role information in conversational utterances. We further examine how joint and pairwise prediction settings benefit MLLMs in scoring bidirectional PIVOTS dimensions. Project page and resources: https://flynnzhangsx.github.io/PIVOTSBench/ .


[43] A Dual-Track Framework for Template-Constrained LaTeX Conversion cs.CLPDF

Chung Cheuk Hei, Liu Li

TL;DR: 本文提出了一种双轨框架,用于解决将结构化Markdown草稿转换为符合模板约束的LaTeX格式的挑战。该框架将模板格式化与文档处理解耦:离线轨道提取模板约束为可重用清单,在线轨道实现混合执行流水线,结合了基于规则的引擎和大型语言模型(LLM)的优势。

Details

Motivation: 现有方法主要依赖基于规则的转换器或纯端到端LLM生成,前者无法正确处理资源插入和模板特定约束,后者容易导致语义漂移和难以调试的幻觉问题。

Result: 在7个LaTeX模板和56篇已发表研究论文上的实证评估表明,该方法相比先前基线,保持了更好的结构保真度,满足了多样化的布局约束,并实现了更高的编译成功率。

Insight: 核心创新在于系统性地解耦模板格式化与文档处理,并采用混合执行流水线,将LLM严格限制在需要推理的组件(如语义元数据、参考文献和复杂视觉/表格布局)上,而将确定性处理委托给基于规则的引擎,从而平衡了灵活性与可靠性。

Abstract: With the increasing demands for advanced document conversion, mapping structured Markdown drafts into template-compliant formats like LaTeX remains a challenge. Existing approaches largely depend on either deterministic rule-based converters or pure end-to-end Large Language Model (LLM) generation. The former fails to correctly handle asset insertions and template-specific constraints, while the latter tends to induce semantic drift, leading to hallucinations that are difficult to debug. To address these limitations, we introduce a robust Dual-Track Framework that systematically decouples template formatting from document processing: an offline track extracts template constraints into a reusable manifest, while an online track implements a hybrid execution pipeline. This pipeline confines LLM usage exclusively to reasoning-intensive components (e.g., semantic metadata, bibliographic references, and complex visual/tabular layouts) while delegating rule-based engines for deterministic processing. Empirical evaluation across 7 LaTeX templates and 56 published research papers demonstrates that our method preserves better structural fidelity, satisfies diverse layout constraints, and achieves a higher compilation success rate compared to the previous baselines.


[44] Predicate Importance Estimation and Decoupled Rationale-Score Distillation for Entity Alignment cs.CLPDF

Keunha Kim, Yoonjin Jang, Hyeon-gu Lee, Sihyung Kim, Youngjoong Ko

TL;DR: 本文提出了一种用于知识图谱实体对齐的新方法,包含两个互补模块:谓词重要性估计和去耦合依据-分数蒸馏。该方法旨在解决异构知识图谱集成中,因谓词名称变化和局部邻域信息不完整而导致的实体对齐难题,通过构建谓词感知的实体嵌入和训练一个能够分离决策依据与置信度的小型语言模型来提升对齐性能。

Details

Motivation: 工业界知识图谱检索增强生成系统需要集成来自异构数据库的公共和领域特定知识图谱,而现有的实体对齐方法在面临谓词名称变化和局部邻域信息不完整时,仅依赖词汇匹配效果不佳。

Result: 实验表明,所提出的PIE和DRSD模块提升了实体对齐的分类性能。DRSD模块通过分离置信度分数估计与决策依据,能够标记不确定的预测以供人工审核,从而实现了自动接受与人工在环验证之间的实用差异。

Insight: 创新点在于:1) 提出了一种紧凑的、基于嵌入的谓词重要性估计方法,通过移除主语信息并聚合无主语三元组来构建谓词感知的实体嵌入;2) 提出去耦合依据-分数蒸馏方法,将二元标签转换为基于文本的监督,并分离置信度分数估计与标签一致的决策依据,使小型语言模型能学习任务特定的推理,同时保留较少标签偏见的置信度信号。

Abstract: Knowledge graphs (KGs) are increasingly used as structured context for Large Language Models (LLMs), but industrial KG-RAG systems often need to integrate public and domain-specific KGs constructed from heterogeneous databases. This integration relies on Entity Alignment (EA), where lexical matching alone is insufficient under predicate-name variation and incomplete local neighborhoods. We address EA for KG integration by constructing a pairwise EA dataset and proposing two complementary modules: Predicate Importance Estimation (PIE) and Decoupled Rationale-Score Distillation (DRSD). PIE is a compact embedding-based approach that removes the subject information from each 1-hop triple, encodes the resulting subjectless triples, and aggregates them with learnable predicate-importance weights to build predicate-aware entity embeddings. DRSD trains a distilled small language model (SLM) with pseudo-answers produced by a teacher LLM through distinct prompts. By converting binary EA labels into text-based supervision and decoupling confidence-score estimation from label-consistent rationales, DRSD enables the SLM to learn task-specific reasoning while retaining a less label-biased confidence signal. Experiments show that PIE and DRSD improve EA classification. Moreover, because DRSD decouples confidence-score estimation from the decision, a discrepancy between the two flags an uncertain prediction for human review, thereby enabling a practical discrepancy between automatic acceptance and human-in-the-loop verification.


[45] PRIDE: Privileged Information-enhanced Distillation for Empathetic Dialogue Generation cs.CL | cs.AIPDF

Jiaqiang Wu, Zhouan Zhu, Shangfei Wang

TL;DR: 本文提出了一种名为PRIDE的特权信息增强知识蒸馏方法,用于在资源受限环境中实现共情对话生成。该方法利用训练时独有的专家心理标注或未来事件摘要等特权信息,通过共情推理提示、多源注意力机制和双重对齐损失,将大型教师模型的共情推理能力迁移到小型学生模型中,而无需在推理时依赖额外输入。

Details

Motivation: 大型语言模型在生成多样化、上下文感知的共情对话方面表现出色,但其计算需求限制了在资源受限环境中的部署。现有的知识蒸馏方法往往忽略了引导人类连接的隐含上下文线索,难以迁移共情所需的细微理解能力。

Result: 在多模态和纯文本数据集上的实验表明,该方法取得了有竞争力的性能,在某些情况下,在准确性和语义相关性方面匹配甚至超越了更大的教师模型。

Insight: 创新点在于引入训练时独有的特权信息来增强知识蒸馏,具体包括共情推理提示来显式分解共情过程、多源注意力机制来整合特权信息,以及结合反向KL散度和最大均值差异的双重对齐损失,实现了在logit和特征层面的鲁棒知识迁移。

Abstract: Large language models have demonstrated significant capabilities in generating diverse and context-aware responses for empathetic dialogue. However, their computational demands severely limit their deployment in resource-constrained environments. While knowledge distillation offers a promising compression solution, it often fails to transfer the nuanced understanding essential for empathy, as it overlooks the implicit contextual cues that guide human connection. To bridge this gap, we propose a \textbf{pr}ivileged \textbf{i}nformation-enhanced knowledge \textbf{d}istillation method for \textbf{e}mpathetic dialogue generation (PRIDE). Our method leverages privileged information, such as expert psychological annotations or future event summaries, which is available exclusively during training but unavailable at inference time. This allows us to transfer the teacher model’s empathetic reasoning to smaller models without relying on extra inputs during deployment. Specifically, PRIDE has three key components: (1) An empathy-reasoning prompt that guides the teacher to explicitly decompose the empathetic process into understanding feelings and analyzing situations step-by-step; (2) A multi-source attention mechanism that directs the student to effectively integrate privileged information; (3) A dual-alignment loss that combines reversed Kullback-Leibler divergence and maximum mean discrepancy to ensure robust knowledge transfer at both logit and feature levels. Experiments on multi-modal and text-only datasets demonstrate that our method achieves competitive performance, and in some cases matches or even surpasses larger teacher models in terms of accuracy and semantic relevance.


[46] When Does Intrinsic Self-Correction Help? A Task-Sensitive Analysis cs.CL | cs.AIPDF

Elroy Stav, Dvir Berlowitz, Maayan Orner, Sarit Kraus

TL;DR: 本文对大型语言模型的内在自我修正(SC)机制进行了任务敏感性分析,探讨了SC在不同任务结构下的有效性,而非笼统评估其可靠性。研究发现,当任务结构支持显式约束验证、复杂推理过程重访或竞争策略二次评估时,SC能带来稳定的性能提升。

Details

Motivation: 针对近期研究质疑内在自我修正的可靠性(模型难以判断自身初始答案的正确性),本文旨在从任务敏感视角出发,探究SC在何种具体任务机制下可能有效。

Result: 在多个基准测试和模型上的实验表明,当任务结构便于通过验证约束、重访推理或评估竞争策略进行修正时,SC能带来一致的性能提升,但其效果高度依赖于任务特性。

Insight: 创新点在于将SC重新定义为一种任务依赖的推理时策略,其有效性取决于修正阶段在特定任务中能发挥的作用,而非普适性改进方法;这为理解和使用SC提供了更精细的框架。

Abstract: Intrinsic self-correction (SC) aims to improve large language model outputs by prompting a model to revisit its own initial answer without external feedback. Recent studies have questioned the reliability of this approach, showing that models often struggle to judge whether their initial responses are correct. In this work, we take a task-sensitive view of SC. Rather than asking whether it works in general, we examine settings where SC may operate through different mechanisms: verifying explicit constraints, revisiting a complex reasoning process, or providing a second opinion over competing strategies in word-game tasks. Across multiple benchmarks and models, we find that SC can yield consistent performance gains when the underlying task structure facilitates these modes of revision. These results suggest that SC is best understood as a task-dependent inference-time strategy whose usefulness depends on the role the revision stage can play in a given task, rather than as a uniformly reliable method for improving initial model outputs.


[47] The Anatomy of the CTC Oracle Gap: Acoustic Exhaustion and Linguistic Recovery cs.CL | cs.LGPDF

Ivan Novosad

TL;DR: 本文研究了CTC内部评分在N-best假设选择中的局限性,发现其信息瓶颈在于声学置信度与语言合理性之间的分离。实验表明,CTC内部评分策略无法显著改善语音识别性能,而引入外部语言模型(如RoBERTa)的MBR解码能有效突破该瓶颈,显著降低词错误率。

Details

Motivation: 解决CTC模型在N-best假设选择中声学信息饱和、无法有效利用语言信息的问题,探究其内部评分机制的限制及外部语言信息的补充作用。

Result: 在LibriSpeech dev-other上,CTC内部评分策略(G=16)未带来统计显著的WER改善;而使用RoBERTa伪对数似然后验的MBR-CER解码(G=128)在test-other上达到5.42% WER,相比贪婪解码(5.96%)相对降低9.0%,并在多个架构、领域和噪声条件下普遍有效。

Insight: 创新点在于揭示了CTC内部表示的判别能力饱和现象(由空白路径扩散导致),并证明通过外部语言模型注入语言信息可突破瓶颈;同时指出标准MWER训练在收敛检查点可能失效,因训练奖励信号不足。

Abstract: We study the limits of CTC-internal scoring for N-best hypothesis selection and locate the information bottleneck separating acoustic confidence from linguistic plausibility. Eleven CTC-internal and acoustic-feature scoring strategies produce no statistically significant WER improvement over greedy decoding on LibriSpeech dev-other at G=16 (all p > 0.05). The exhaustion is systematic: CTC’s Spearman $ρ$ between hypothesis score and per-utterance WER degrades from -0.574 at G=4 to -0.270 at G=128, a 53% loss driven by blank-path proliferation. This establishes that the discriminative capacity of CTC-internal representations is saturated: no recombination of acoustic signals can close the oracle gap. Confirming that the bottleneck is linguistic, not acoustic, external linguistic information introduced via MBR decoding breaks through it. MBR-CER decoding with a RoBERTa pseudo-log-likelihood (PLL) posterior ($τ$=10, G=128) achieves 5.42% WER on held-out LibriSpeech test-other (greedy 5.96%, $Δ$=-0.535 pp, p<0.0001, 9.0% relative). RoBERTa PLL $ρ$ degrades only 21% over the same range, retaining discriminating power where CTC loses it. Applied without retuning across two Zipformer architectures, three domains (LibriSpeech, TED-LIUM 3, VoxPopuli), and four MUSAN noise levels, the recipe gives significant gains in 11 of 13 conditions. On the training side, standard MWER training via the CTC forward-backward algorithm implements Rao-Blackwellized REINFORCE at the output projection (variance about 3x below Viterbi). Yet sequence-level fine-tuning fails at near-converged checkpoints: all four MWER configurations on CR-CTC collapse (+6.18 to +8.90 pp WER), as a training oracle gap of 0.007 pp provides no usable reward signal.


[48] Do LLM Embedding Spaces Recover Expert Structure? cs.CLPDF

Yixuan Zhu, Zhenke Duan, Fanghen Li

TL;DR: 该论文研究了大型语言模型(LLM)的嵌入空间是否能够恢复专家定义的结构,特别是在心理健康相关语言领域。通过分析28个Reddit社区的数据,比较了预训练和微调的Qwen3模型(0.6B和4B参数)的嵌入空间,发现预训练嵌入在心理健康子集中已显示出与专家症状结构的可测量对齐,微调进一步增强了这种对齐,且模型规模越大,零样本对齐和监督学习带来的增益越明显。

Details

Motivation: 动机在于探究预训练文本嵌入作为表征图谱时,其高类别可分性是否意味着其几何结构能恢复专家定义的结构,尤其是在心理健康语言领域,其中症状关系提供了外部参考,而在线社区引入了领域、情感、风格和话语等多重混杂因素。

Result: 在心理健康子集上,预训练嵌入显示出与专家结构的可测量对齐;微调在精细类别级别上最能增强这种对齐;更大的模型规模同时改善了零样本对齐和监督诱导的增益。在控制了VAD、LIWC、词汇风格和主题分布结构等混杂因素后,残余对齐仍然显著。

Insight: 论文的创新点在于系统地评估了LLM嵌入空间恢复专家结构的能力,强调了这种恢复是层级依赖的,并提出了应针对显式混杂因素进行测试,而非仅从分类性能推断。从客观角度看,其采用原型构建、表征相似性分析和多基线混杂控制等方法,为评估嵌入空间的语义结构提供了严谨的框架。

Abstract: Pretrained text embeddings are increasingly used as representational maps, yet high category separability does not imply that their geometry recovers expert-defined structure. We study this problem in mental-health-related language, where symptom relations provide an external reference and online communities introduce strong domain, affective, stylistic, and discourse confounds. Using 28 Reddit communities, we compare pretrained and supervised fine-tuned Qwen3 embedding spaces at two scales (0.6B and 4B). We construct category prototypes, evaluate their representational dissimilarity matrices against an expert symptom matrix with representational similarity analysis, and complement this global test with prototype-based typicality and multi-baseline confound controls. Pretrained embeddings show measurable alignment with expert structure within the mental-health subset; fine-tuning strengthens this alignment most at the finest category level; and larger scale improves both zero-shot alignment and supervision-induced gains. Residual alignment remains substantial after controlling for VAD, LIWC, lexical style, and topic-distribution structure. These results suggest that LLM embeddings can recover expert-relevant category geometry, but this recovery is level-dependent and should be tested against explicit confounds rather than inferred from classification alone.


[49] UnBias-Plus: Detect, Explain, and Rewrite Bias cs.CL | cs.AI | cs.SEPDF

Ahmed Y. Radwan, Ahmed ElKady, Sindhuja Chaduvula, Mohamed Hafez, Amrit Krishnan

TL;DR: UnBias-Plus是一个开源工具包,旨在解决自然语言中的偏见问题。它集成了细粒度的偏见检测、定位、解释和中性文本重写功能,并提供多种访问接口。

Details

Motivation: 现有偏见检测方法通常只能识别偏见的存在,缺乏细粒度检测、可解释性说明、中性文本重写以及开放可用的训练模型,UnBias-Plus旨在统一这些功能以应对这一挑战。

Result: 论文介绍了UnBias-Plus工具包及其公开可用的源代码、模型、数据集和文档,但摘要中未提及具体的基准测试或定量结果。

Insight: 创新点在于将多类别偏见分类、偏见跨度定位、中性文本重写和决策推理统一到一个工具包中,提供了更全面和可解释的偏见处理方案,其开源特性也促进了可访问性和进一步研究。

Abstract: Bias in natural language remains a persistent challenge in both human-written and AI-generated content, affecting domains such as journalism, education, and AI research. Most existing detection methods identify only the presence of bias, with limited support for granular detection, interpretable explanations, neutral rewriting, and openly available trained models. We present UnBias-Plus, an open-source toolkit unifying (1) segment-level multi-class bias classification, (2) biased span localization, (3) neutral text rewriting, and (4) reasoning for each decision. Available via Python, CLI, REST API, and web interfaces, UnBias-Plus supports accessible bias analysis. The toolkit, source code, models, datasets, and documentation are publicly available.


[50] ReasoningLens: Hierarchical Visualization and Diagnostic Auditing for Large Reasoning Models cs.CL | cs.AIPDF

Jun Zhang, Jiasheng Zheng, Boxi Cao, Yaojie Lu, Hongyu Lin

TL;DR: 论文提出了ReasoningLens,一个用于大型推理模型的开源框架,旨在通过分层可视化和诊断审计来解决长链思维(Chain-of-Thought)轨迹带来的透明度问题。它将非结构化的文本轨迹转化为可交互的层次结构,并利用智能审计器进行自动错误检测和工具增强验证,从而为解释、调试和优化推理模型提供模块化基础。

Details

Motivation: 大型推理模型产生的链式思维轨迹异常冗长,导致关键逻辑被淹没在海量过程文本中,造成了透明度负担。

Result: 论文未在摘要中提及具体的定量基准测试结果或SOTA比较,但宣称框架能通过结构化轨迹、自动错误检测和生成系统推理档案来揭示模型特定盲点。

Insight: 创新点在于将长推理链分层可视化以分离高层策略与低层执行,并引入智能审计器进行自动化诊断,这为复杂推理过程的透明化和可审计性提供了系统化工具。

Abstract: The emergence of Large Reasoning Models has introduced exceptionally long Chain-of-Thought traces, creating a transparency burden where critical logic is often buried under massive procedural text. To address this, we present ReasoningLens, an open-source framework designed for the hierarchical visualization and diagnostic auditing of complex reasoning chains. ReasoningLens addresses information necropsy by: (1) structuring traces into interactive hierarchies that separate high-level strategy from low-level execution; (2) leveraging an agentic auditor for automated error detection and tool-augmented verification; and (3) synthesizing systemic reasoning profiles to reveal model-specific blind spots. By transforming unstructured walls of text into actionable insights, ReasoningLens provides a modular foundation for interpreting, debugging, and optimizing the next generation of reasoning-centric AI.


[51] TriggerBench: Investigating Prospective Memory for Large Language Models cs.CLPDF

Tianhua Zhang, Xinjiang Wang, Qianxi Zhang, Qi Chen, Kun Li

TL;DR: 本文介绍了TriggerBench,一个用于评估大型语言模型前瞻性记忆(PM)的综合性基准测试。该基准覆盖日常助手和专业工作流五个维度,通过配对场景、对比变体和过载触发器,精细测量模型的主动回忆、误报率和注意力鲁棒性。研究发现PM存在精度-召回权衡和注意力脆弱性,比回顾性记忆(RM)更难,且可作为模型剩余推理能力的探针。

Details

Motivation: 现有评估主要关注大型语言模型在长交互中的回顾性记忆(通过显式查询),而前瞻性记忆(即无需直接提示、自发回忆并执行潜在约束的关键能力)尚未得到充分评估。

Result: 在TriggerBench上的评估表明,PM存在精度-召回权衡和注意力脆弱性,增强推理能显著提升主动回忆但可能导致模型过拟合‘总是提醒’的启发式策略。PM准确率在隐含约束或并发请求过载触发器时大幅下降。PM比RM困难得多,在相同上下文中,RM在长达10万token的上下文内接近饱和,而PM随上下文长度增加急剧衰减。PM可作为剩余推理能力的探针,在AIME-2025数学问题配对场景中,成功轨迹比失败轨迹在相同上下文长度下表现出更高的PM准确率。

Insight: 创新点在于提出了首个系统性评估LLM前瞻性记忆的基准TriggerBench,设计了包含匹配RM控制、对比变体和过载触发器的精细评估协议。客观来看,该研究揭示了PM与RM的性能差异及其对上下文长度的敏感性,并创新性地将PM准确率与模型剩余推理能力关联,为理解模型在长上下文中的认知负荷提供了新视角。

Abstract: While Large Language Models (LLMs) are increasingly deployed in long interactions, existing evaluations focus predominantly on retrospective memory (RM) via explicit queries. Prospective memory (PM), the critical ability to spontaneously recall and act on latent constraints without direct prompts, remains largely unevaluated. We introduce TriggerBench, a comprehensive PM benchmark spanning five dimensions across both daily assistants and professional workflows. TriggerBench pairs scenarios with matched RM controls, contrastive positive/negative variants, and overloaded triggers, enabling fine-grained measurement of proactive recall, false-alarm rate, and attentional robustness under a single protocol. Our evaluation yields three key findings. (i) PM shows a precision-recall trade-off and attentional fragility. Though enhanced reasoning significantly improves proactive recall, models may overfit to an “always-remind” heuristic. Furthermore, PM accuracy degrades substantially under implicit constraints or triggers overloaded by concurrent user requests, indicating that robust PM remains an open challenge. (ii) PM is notably harder than RM: on identical contexts, RM near-saturates up to 100K tokens, while PM decays sharply as context length scales. (iii) PM may serve as a behavioral probe of spare reasoning capacity. Pairing PM scenarios with AIME-2025 math problems reveals that successful trajectories yield higher PM accuracy than failed ones at the same context length, showing PM tracks spare reasoning budget that token count obscures. Project page: https://github.com/KristenZHANG/TriggerBench-Official.


[52] Self-Compacting Language Model Agents cs.CLPDF

Tianjian Li, Jingyu Zhang, William Jurayj, Xi Wang, Chuanyang Jin

TL;DR: 本文提出SelfCompact框架,使语言模型能够自主决定何时及如何压缩过长的上下文轨迹。该方法结合了模型可调用的压缩工具和轻量级触发规则,无需微调即可实现自适应压缩,显著降低了计算成本并提升了数学推理和智能体搜索任务的性能。

Details

Motivation: 现有智能体框架采用固定间隔的上下文压缩策略,忽视了任务轨迹的结构,可能导致丢弃关键中间结果或搜索过程被中断,因此需要一种更智能的自适应压缩机制。

Result: 在六个基准测试(包括数学推理和智能体搜索)和七个模型上的实验表明,SelfCompact以更低的token成本匹配或超越了固定间隔压缩方法,在数学任务上比无压缩基线提升高达18.1分,在搜索任务上提升5-9分,同时每问题成本降低30-70%。

Insight: 创新点在于将压缩决策权赋予模型本身,通过轻量级规则弥补了模型无法自主识别上下文失效的元认知缺陷;这种无需训练的支架设计为智能体系统的长上下文管理提供了新思路。

Abstract: Long agent traces composed of chains of thought and tool calls accumulate stale content that anchor subsequent generations, and eventually outgrow the context window. Existing scaffolds mitigate it with fixed-interval compaction triggered at a token threshold. Such triggers pay no heed to trajectory structure, risking discard of partial results mid-derivation or mid-search. We propose SelfCompact, a scaffold that allows the model itself to decide when and how to compact. Specifically, it pairs two inference-time elements: (i) a compaction tool the model invokes to summarize the accumulated context, and (ii) a lightweight rubric specifying when to fire (a sub-task has resolved, or the trajectory is converging) and when to suppress (mid-derivation, or when stuck). Both are needed. The tool alone is unevenly used across open-weight models, often invoked at unhelpful moments or not at all; the rubric alone cannot act. Together, they elicit effective adaptive compaction without any fine-tuning or external supervision. We present empirical results on six benchmarks (competitive math and agentic search) and seven models. Our results show that SelfCompact matches or exceeds fixed-interval summarization at a fraction of the token cost, improving over a no-summarization baseline by up to 18.1 points on math and 5-9 points on agentic search at 30-70% lower per-question cost. Our results expose a meta-cognitive gap: although unprompted models cannot reliably tell when their own context is rotting, a lightweight rubric closes this gap, reframing when to compact as a capability that scaffolds can supply without training.


[53] Can LLMs Reliably Self-Report Adversarial Prefills, and How? cs.CLPDF

Quang Minh Nguyen, Uzair Ahmed, Taegyoon Kim

TL;DR: 本文研究了大型语言模型(LLMs)在安全对抗情境下的自省能力,特别是模型能否可靠地识别其自身输出是由对抗性前缀(adversarial prefill)攻击所引发的。研究发现,在多个开源指令微调模型和基准测试中,模型无法可靠识别自身被攻击的输出,且自省信号主要源于安全拒绝相关的推理。通过权重正交化、不同提问框架以及LoRA微调等方法进行干预,揭示了自省信号的机制及其不可靠性,并可能意外增加攻击成功率。

Details

Motivation: 先前研究表明LLMs在良性任务上具有自省能力,本文旨在探究这种能力在安全对抗情境下的可靠性,即模型能否识别其自身输出是由对抗性前缀攻击所引发的。

Result: 在十个开源指令微调模型(3B至70B)和四个安全基准测试中,没有模型能可靠识别自身被攻击的输出,模型对前缀攻击输出声称有意图的平均比例为27.3%。通过LoRA微调等方法干预后,意图探测差距扩大,但攻击成功率在多数模型上反而上升。

Insight: 创新点在于将自省能力研究扩展到安全对抗领域,并揭示了自省信号主要源于安全拒绝推理、对提问框架敏感且干预措施可能带来意外风险。客观来看,该研究强调了LLM自我报告在安全场景下的不可靠性,并为理解模型内部机制提供了新视角。

Abstract: Prior work shows that large language models (LLMs) exhibit introspective capability on benign tasks. We extend the question to safety contexts and examine how reliably a model can recognize that its own prior response was elicited by an adversarial prefill attack. Across ten open-weight instruction-tuned LLMs (3B to 70B) and four safety benchmarks, no model reliably recognizes its own compromised outputs, with models claiming intent on prefilled responses at an average rate of $27.3%$. Introspective signal stems largely from safety- and refusal-related reasoning. Orthogonalizing models’ weights against the refusal direction collapses the gap between claiming rates on prefilled and natural outputs to near zero, though the direction is not its unique mediator. The signal is also probe-dependent: framing the question as internal intention versus external tampering elicits qualitatively different responses on the same models. We test three LoRA finetuning methods (SFT, GRPO, DPO) on eight models from 3B to 27B; all three widen the intention-probe gap on every model from 8B to 27B, with method ranking varying by model. The intervention does not transfer to the tampering probe and counterintuitively raises attack success rate under adversarial prefill on most models, amounting to a partial mitigation. These findings outline mechanisms underpinning the observed introspective signals in safety contexts and highlight risks in the reliability of LLM self-reports.


[54] Randomized YaRN Improves Length Generalization for Long-Context Reasoning cs.CLPDF

Manas Mehta, Fangcong Yin, Greg Durrett

TL;DR: 本文提出了一种名为Randomized YaRN的训练方法,旨在提升大语言模型在长上下文推理任务中的长度泛化能力。该方法结合了YaRN位置外推、随机化位置编码和长度课程学习,通过在短上下文训练数据中引入更大位置范围的采样位置编码,使模型能更好地泛化到极长序列。

Details

Motivation: 大语言模型通常在短序列上预训练,并通过额外训练扩展到长序列,但在面对极长序列时仍存在泛化困难。本文旨在解决模型在长上下文推理任务中,从短上下文训练数据泛化到远超出训练分布的长序列的挑战。

Result: 在BABILong和Multi-Round Coreference Resolution两个具有挑战性的长上下文推理基准测试上,当使用上下文长度小于8K的数据训练时,Randomized YaRN在16K到128K的上下文长度上持续提升了推理性能,优于标准微调方法,且在远超出训练分布的长度上提升最大。

Insight: 创新点在于将YaRN位置外推与随机化位置编码和长度课程学习相结合,通过在训练中渐进式地让模型接触超出分布的位置表示,从而有效提升长度泛化能力。这为可泛化的长上下文推理提供了一种有效的训练方案。

Abstract: Large language models (LLMs) are typically pretrained on short sequences and then extended to work on longer sequences with additional training. However, such LLMs still struggle to further generalize to very long sequences. We propose Randomized YaRN, a training method that improves length generalization by combining YaRN-based positional extrapolation with randomized positional encoding and a length curriculum. During training on short context data, tokens are assigned YaRN positional encodings sampled from a larger position range, exposing the model to out-of-distribution positional representations even on short-context inputs. We evaluate Randomized YaRN on two challenging long-context reasoning benchmarks, BABILong and Multi-Round Coreference Resolution (MRCR). When training on data with <8K context, Randomized YaRN consistently improves reasoning performance on context lengths from 16K to 128K and outperforms standard fine-tuning, with the largest gains appearing at far out-of-distribution lengths. Our results suggest that progressively exposing models to OOD positional distributions provides an effective recipe for generalizable long-context reasoning.


cs.CV [Back]

[55] A Projection-Based Surrogate Gradient Interpretation for Neural Codec Wrappers cs.CV | cs.AI | eess.SPPDF

Esteban Pesnel, Julien Le Tanou, Michael Ropert, Aline Roumy, Thomas Maugey

TL;DR: 本文提出了一种基于投影的代理梯度解释方法,用于训练神经编解码器包装器。该方法将现有的SCALED代理梯度解释为视频编解码器的一阶局部近似,从而提高了其可解释性,并证明了该方法不仅适用于学习下采样操作,也适用于更具挑战性的完整神经包装任务,且在不同编解码器、质量因子和任务上具有良好的泛化能力。

Details

Motivation: 神经包装器旨在提升传统视频编解码器的性能,但由于编解码器涉及离散决策而不可微,其训练具有挑战性。代理梯度(如SCALED方法)虽能实现端到端学习并提升压缩性能,但其最初作为重参数化技巧引入,可解释性有限。

Result: 该方法在x264和VVenC编解码器上,相对于标准重采样基线,实现了高达-23.59%和-20.07%的BD-Rate(PSNR)降低,表明其能显著提升压缩效率。

Insight: 核心创新点在于为SCALED代理梯度提供了基于一阶局部近似的投影解释,增强了其理论可解释性。同时,论文将代理梯度的应用从单一的下采样器训练扩展到了完整的预处理和后处理神经包装网络,并验证了其跨编解码器、质量因子和任务的强泛化性。

Abstract: Neural wrappers are learned pre-and postprocessing networks designed to enhance the performance of conventional video codecs. Although these approaches can significantly improve compression efficiency, training them remains challenging due to the non-differentiability of video codecs, which arises from the multiple discrete decisions involved in the encoding process. Surrogate gradients have recently emerged as an effective solution for enabling end-to-end learning with conventional codecs. They offer two main advantages: they avoid training an additional network to mimic the codec, and they can improve compression performance. In particular, the recently proposed SCALED method, which leverages the true compression error, has shown strong results for training neural pre-processors such as downscalers. However, this SCALED gradient was originally introduced as a reparameterization trick, which limits its interpretability. In this paper, we show that this surrogate gradient can be interpreted as a first-order local approximation of the video codec, providing insight into its effectiveness. We further demonstrate that it is effective not only for learning downscaling operations, but also for the more challenging task of full neural wrapping with pre-and post-processing networks. Finally, we show that the approach generalizes well across different video codecs, quality factors, and tasks, including multiple downscaling ratios, yielding BD-Rate (PSNR) reductions of up to -23.59% on x264 and -20.07% on VVenC relative to standard resampling baselines.


[56] A UAV-Based Multi-Modal Vision System for Automated Sideslope Deformation Monitoring and Hazard Detection cs.CV | cs.LGPDF

Jingfeng Zhang, Yi Li, Xianchong Liang, Huan Yang

TL;DR: 本研究开发了一种基于无人机载激光雷达(LiDAR)的高自动化边坡灾害检测工作流。该工作流包括共享的数据采集与地表提取阶段、基于RandLA-Net的单次观测灾害筛查分支,以及基于网格化高程差分的多期变形监测分支。通过在真实高速公路边坡环境中的多次飞行验证,该系统能够提取植被覆盖下的可用地表点云、从单次点云中识别潜在危险区域,并利用多期网格差分量化厘米级高程变化。

Details

Motivation: 边坡灾害对高速公路基础设施构成重大安全威胁,其演变通常表现为缓慢的地表变形。传统的人工巡检效率低、操作安全性不足,尤其是在严重恶化的边坡上,因此迫切需要一种能够进行大面积边坡观测与分析的高精度自动化解决方案。

Result: 在真实高速公路边坡环境中的实验结果表明,该工作流能够提取植被覆盖下的可用地表点云,从单次观测点云中识别潜在危险区域,并通过多期网格差分量化厘米级高程变化。研究通过受控实验、现场测试和基于仿真的验证,证明了该端到端工作流的可行性。

Insight: 论文的创新点在于构建了一个集数据采集、地表提取、单次灾害筛查与多期变形监测于一体的端到端自动化工作流。其核心是将基于深度学习的点云分割网络(RandLA-Net)用于单次灾害筛查,并结合网格化高程差分方法进行多期精确变形监测,为自动化边坡灾害监测与智能预警提供了一个可实施的解决方案。

Abstract: Slope hazards constitute a major safety threat to expressway infrastructure, and their evolution is typically manifested as slow surface deformation. Conventional manual inspection suffers from low efficiency and inadequate operational safety, especially on severely deteriorated slopes. Accordingly, there is an urgent need for an automated, high-precision solution capable of large-area slope observation and analysis. This study aims to develop a highly automated workflow for slope hazard detection using Unmanned Aerial Vehicle (UAV)-borne Light Detection and Ranging (LiDAR). The proposed workflow consists of a shared data-acquisition and ground-surface extraction stage, a single-observation hazard-screening branch based on RandLA-Net, and a multi-epoch deformation-monitoring branch based on grid-wise elevation differencing. To validate the effectiveness of the proposed system, we conducted multiple UAV-borne LiDAR data-acquisition flights in real expressway slope environments. The results show that the workflow can extract usable ground-surface point clouds under vegetation cover, identify potential hazard zones from single-observation point clouds, and quantify centimeter-level elevation changes using multi-epoch grid differencing. This study establishes an end-to-end UAV-borne LiDAR-based workflow for slope inspection and demonstrates its feasibility through controlled experiments, field tests, and simulation-based validation, thereby providing an implementable solution for automated slope-hazard monitoring and intelligent early warning.


[57] Jury Duty: Calibration and Orientation Failures in MLLM-as-a-Judge Under Cultural Ambiguity cs.CV | cs.AI | cs.CLPDF

Daniel Lee, Harsh Sharma, Eunkyu Park, Pranav Narayanan Venkit, Jeonghwan Kim

TL;DR: 本文研究了多模态大语言模型(MLLM)作为评估者(Judge)时,在面临文化模糊性时的校准与取向失败问题。作者构建了VOIR DIRE基准,包含626个涉及中美文化背景的图像-提示对,发现人类标注者内部一致但跨文化评价分歧。在六个MLLM上的实验表明,模型偏差可分解为积极性下限校准失败(评分尺度压缩)和取向失败(默认偏向某一文化规范)。

Details

Motivation: 传统上,MLLM-as-a-Judge的有效性通过与人类标注的一致性来验证,但当人类标注者群体存在文化异质性时,这一度量标准变得不明确。本文旨在探究MLLM在跨文化模糊场景下作为评判者的系统性偏差。

Result: 在VOIR DIRE基准上测试了六个MLLM。模型偏差表现为校准失败(评分压缩在积极性下限)和取向失败(偏向中国文化解读)。角色提示(persona prompting)部分恢复了校准,但取向残余依然存在。模型来源(如中美)会带来一个小的附加偏差(约0.10 MAE),且该偏差在上下文示例下基本不变。

Insight: 创新点在于识别并分解了MLLM作为评判者在文化模糊性下的两种具体失败模式:校准失败和取向失败。客观分析认为,该研究强调了在评估MLLM时,应分别报告其与不同文化参考池的对齐情况,并将跨文化分歧视为评判者模型的一个固有属性,这对构建公平、跨文化的AI评估系统具有重要借鉴意义。

Abstract: MLLM-as-a-Judge is conventionally validated by agreement with human annotations, but this metric is undefined when the human pool is culturally heterogeneous. We introduce VOIR DIRE, a multimodal benchmark of 626 culturally paired image–prompt artifacts spanning U.S. and mainland Chinese contexts across food, fashion, and architecture, with annotator pools that are within-pool reliable (a = 0.86/0.74) but cross-pool divergent on evaluation (Q1 r = -0.12). Across six MLLMs, the bias decomposes into two failures: a positivity-floor calibration failure (compressed scale use) and an orientation failure (default to one cultural norm). On this corpus, where contested items are sampled to split the two pools, the floor mechanically validates the more-permissive Chinese reading; persona prompting partially recovers calibration, but the orientation residual survives, evidence the tilt is not reducible to scale compression. Reference-pool in-context demonstrations deepen the orientation residual and inflate the high end rather than restoring use of the low end. Model origin adds a small additive tilt (~0.10 MAE) that is approximately invariant under demonstration. We recommend reporting alignment against each reference pool separately and treating cross-pool divergence as a judge property.


[58] Spatio-Temporal Wildfire Spread Prediction in Canada using a Video Swin-Hybrid-U-Net and Satellite Imagery cs.CV | cs.LGPDF

Maulik Srivastava, Esha Saha, Hao Wang

TL;DR: 本文提出了一种结合Video Swin Transformer编码器和卷积解码器的U-Net架构,用于预测加拿大野火的时空蔓延。该模型利用公开卫星图像和环境数据,对为期三天的气象序列进行建模,以生成次日的火灾发生概率图。

Details

Motivation: 加拿大野火对生态和社会的威胁日益加剧,现有预测模型在可扩展性和有效捕捉时间动态方面存在不足。本研究旨在开发一个专门针对加拿大野火蔓延、能有效捕捉环境数据时空模式的深度学习框架。

Result: 在2014年至2023年加拿大主要野火事件构成的数据集上,该模型通过有效利用时空注意力机制,在预测次日火灾发生图上取得了强劲的性能。

Insight: 主要创新点在于将Video Swin Transformer作为编码器集成到U-Net中,以建模野火蔓延的时空依赖性。从客观角度看,其完全依赖公开可获取数据构建可扩展预测框架的方法,为基于公开数据集的操作性应用研究提供了新路径。

Abstract: Background: Wildfires in Canada present increasing threats to ecosystems, communities, and infrastructure, demanding accurate forecasting tools to aid mitigation efforts. Existing models often lack scalability or fail to capture temporal dynamics effectively. Aims: This study aims to develop a deep learning framework tailored to Canadian wildfire spread prediction that captures spatio-temporal patterns in environmental data. Methods: We propose a U-Net architecture integrating a Video Swin Transformer encoder with a convolutional decoder to model three-day sequences of meteorological and environmental variables. Data are exclusively sourced from public repositories via Google Earth Engine, ensuring transparency and scalability. The model is trained and tested on a curated dataset of major Canadian wildfire events from 2014 to 2023. Key results: Our approach achieves strong predictive performance by effectively leveraging spatio-temporal attention to forecast next-day fire incidence maps. Conclusions: The model successfully captures complex wildfire dynamics unique to Canada’s landscape and temporal variability. Implications: This framework paves the way for advanced spatio-temporal wildfire forecasting research and operational applications using publicly accessible datasets.


[59] Beyond Templates: Revisiting Zero-Shot Remote Sensing through Meta-Prompting cs.CV | cs.AI | cs.LGPDF

Eirini Baltzi, Dionysis Christopoulos, Sotiris Spanos, Valsamis Ntouskos, Konstantinos Karantzalos

TL;DR: 本文研究了视觉语言模型在零样本遥感任务中的性能,发现文本描述设计对结果影响显著,LLM生成的语义丰富描述并不总能带来性能提升,反而可能引入噪声。通过分析CLIP特征空间中的文本对数似然,作者提出轻量级查询嵌入校准方法,能有效提升零样本分类和检索性能。

Details

Motivation: 探讨视觉语言模型在零样本遥感下游任务中性能对文本设计选择的敏感性,特别是LLM生成的类描述与简单模板描述之间的权衡,旨在理解语义丰富性与鲁棒性之间的平衡。

Result: 在17个VLM变体和12个遥感数据集上评估,发现轻量级查询嵌入校准能一致提升零样本分类和检索性能,提供了实用的性能改进工具。

Insight: 创新点在于通过文本对数似然分析揭示了LLM描述在特征空间中引入噪声的问题,并提出嵌入校准作为简单有效的解决方案,强调了在零样本遥感中平衡语义表达与鲁棒性的重要性。

Abstract: Vision-language models (VLMs) have sparked growing interest in zero-shot Earth Observation (EO) downstream tasks, with further gains enabled by remote-sensing-adapted models. We examine this setting across 17 VLM variants and 12 remote sensing (RS) datasets under Meta-Prompting for Visual Recognition (MPVR), and show that zero-shot performance remains highly sensitive to textual design choices, from the meta-prompts used to guide the LLM in generating class descriptions to the descriptions themselves. We explore why semantically rich LLM-generated class descriptions do not translate into consistent gains over simple domain-adapted CLIP-style descriptions. While LLM descriptions are more semantically expressive, they can also introduce noise in the text embedding space, reducing robustness in downstream tasks. We support this observation through a text log-likelihood analysis in the whitened CLIP feature space, comparing LLM-generated and template-based descriptions. Building on this finding, we study query embedding calibration and show that lightweight calibration of the query space consistently yields strong improvements in zero-shot classification and retrieval. Overall, our results provide practical insight into the trade-off between semantic richness and robustness, and identify embedding calibration as a simple and effective tool for improving zero-shot remote sensing performance.


[60] NeoJaundice-AI: Smartphone-Based Neonatal Jaundice Detection Using Dual-Input Deep Learning and Synthetic Augmentation cs.CV | cs.LGPDF

Rahul Patel, Nirjala Jarpula

TL;DR: 本文提出了NeoJaundice-AI,一种基于智能手机的新生儿黄疸筛查系统。该系统通过拍摄婴儿皮肤和巩膜的照片,在无需网络连接的情况下,三秒内即可估计黄疸严重程度并预测血清胆红素水平。系统采用双分支EfficientNet-B0架构处理图像,融合深度特征与手工YCbCr颜色统计,并引入合成黄疸生成方法和肤色归一化模块以应对数据稀缺和肤色差异问题。

Details

Motivation: 新生儿黄疸是全球新生儿常见疾病,早期检测至关重要,但标准诊断需要血液检测,这在实验室设施有限的农村诊所往往不切实际。因此,需要一种便捷、快速且无需网络连接的筛查工具。

Result: 实验结果显示,系统整体分类准确率为91.8%,临床灵敏度为93.5%,胆红素平均绝对误差为1.4 mg/dL。经过INT8量化和ONNX转换后,模型大小缩减至8.3 MB,在标准Android设备上推理时间保持在3秒以内。

Insight: 论文的创新点包括:1) 结合多模态图像融合(皮肤和巩膜)、肤色适应、合成数据增强和完全离线移动部署的单一框架;2) 提出一种通过控制YCbCr通道修改来模拟胆红素引起的黄染的合成黄疸生成方法,有效解决了数据稀缺问题,特别是针对严重黄疸病例和较深的印度肤色(Fitzpatrick IV至VI型)。

Abstract: Neonatal jaundice (hyperbilirubinemia) is one of the most common conditions affecting newborns worldwide, with India alone recording roughly 15 million cases per year. Early detection is critical, yet standard diagnosis requires blood tests that are often impractical in rural clinics where laboratory facilities are limited. This paper presents NeoJaundice-AI, a smartphone-based screening system that uses photographs of a baby’s skin and sclera (eye white) to estimate jaundice severity and predict serum bilirubin levels in under three seconds without requiring internet connectivity. The proposed system is built on a dual-branch EfficientNet-B0 architecture that independently processes skin and sclera images. Deep features are fused with handcrafted YCbCr color statistics to jointly perform four-class severity classification and continuous bilirubin regression. A key contribution is a synthetic jaundice generation method that simulates bilirubin-induced yellowing through controlled YCbCr channel modifications on normal neonatal skin images. This approach addresses data scarcity, particularly for severe jaundice cases and darker Indian skin tones (Fitzpatrick Types IV to VI). In addition, a skin-tone normalization module improves prediction consistency across diverse neonatal complexions. Experimental results demonstrate an overall classification accuracy of 91.8 percent, a clinical sensitivity of 93.5 percent, and a bilirubin mean absolute error of 1.4 mg/dL. After INT8 quantization and ONNX conversion, the model size is reduced to 8.3 MB while maintaining inference times below three seconds on standard Android devices. To the best of our knowledge, this is the first India-focused neonatal jaundice AI system that combines multimodal image fusion, skin-tone adaptation, synthetic data augmentation, and fully offline mobile deployment within a single framework.


[61] AEF-Econ: Toward Plug-and-Play Socioeconomic Foundation Embeddings from AlphaEarth for Urban Remote Sensing cs.CVPDF

Shuyang Hou, Ziqi Liu, Haoyue Jiao, Lutong Xie, Yaxian Qing

TL;DR: 本文提出了AEF-Econ,一种用于城市遥感的社会经济基础嵌入模型。针对现有AlphaEarth基础嵌入(AEF)在物理地表信号预训练上的局限,作者整合了多源异构数据,构建了CHN-Econ基准,并通过五轴消融实验和提出的容量自适应重建(CAR)方法,显著提升了模型在社会经济任务上的跨区域和跨层级预测性能,最终生成了与AEF物理嵌入互补的社会经济嵌入。

Details

Motivation: 现有遥感基础模型AEF的预训练主要关注物理地表信号,限制了其在社会经济任务中的即插即用能力。为了解决这一问题,需要开发能够有效融合多源异构数据并捕捉社会经济信息的基础嵌入。

Result: 在构建的CHN-Econ基准(包含16个标签)上,仅使用AEF嵌入进行线性探测,跨区域和跨层级的R²分别仅为0.301和0.160。通过五轴消融改进的骨干网络将R²提升至0.832和0.671。提出的CAR方法进一步将R²提升至0.848和0.693,并将崩溃的标签恢复至稳定范围。

Insight: 核心创新点是提出了容量自适应重建(CAR)方法,通过为每个数据流使用独立的解码器和流级损失,缓解了多流融合中高维流对低维语义流的压制问题。这为构建多模态、多任务的基础嵌入模型提供了一种有效的融合策略,并生成了可公开获取的、能无监督捕捉城市层级和空间组织的社会经济嵌入(AEF-Econ)。

Abstract: AlphaEarth Foundations (AEF) unify global remote sensing foundation embeddings through multimodal self-supervised learning, but their pretraining focuses on physical land-surface signals, limiting plug-and-play use in socioeconomic tasks. We integrate seven heterogeneous data streams across 36 Chinese cities over eight years - AEF embeddings, population, nighttime lights, remote sensing indices, points of interest (POIs), urban morphology, and cross-lingual text - and construct CHN-Econ, a socioeconomic benchmark with 16 labels in three categories. We conduct 31 controlled experiments along five axes: fusion architecture, self-supervised objective, text integration, embedding dimensionality, and normalization. Used alone as a linear probe, AEF achieves R2 values of only 0.301 for cross-region and 0.160 for cross-tier evaluation. The five-axis ablated backbone improves these scores to 0.832 and 0.671, respectively, but reveals that low-dimensional semantic streams are consistently suppressed by high-dimensional streams under shared reconstruction. To address this bottleneck, we propose Capacity-Adaptive Reconstruction (CAR), replacing shared reconstruction with per-stream decoders and stream-level losses to mitigate inter-stream capacity competition. CAR further raises cross-region and cross-tier R2 to 0.848 and 0.693, and restores collapsed labels from negative R2 to a stable range. Using CAR, we infer 14.4 million pixels across 36 cities and eight years and release AEF-Econ, including 128d and 64d compressed versions. Self-diagnostics and case studies show that AEF-Econ captures cross-city hierarchies and intra-urban spatial organization under unsupervised settings, providing a socioeconomic remote sensing foundation embedding complementary to AEF physical embeddings.


[62] MotionPyramid: Hierarchical Motion Representation and Residual Interfaces cs.CV | cs.AI | cs.ROPDF

Gao Zhu, Zaishuo Xia, Yubei Chen

TL;DR: 本文提出了MotionPyramid,一种用于人形控制的层次化动作表示方法。它通过从运动数据中学习,构建了一个递归的潜在解码器堆栈,其中低层潜在变量解码为即时全身运动指令,而高层潜在变量则通过低层展开为时间上扩展的运动程序。预训练后,该层次结构被冻结,并作为不同控制分辨率的动作接口家族,供下游强化学习策略重用。此外,论文还引入了残差接口,允许策略在冻结的层次中同时维持粗粒度、片段级和帧级的残差指令,从而实现粗粒度运动程序与精细残差校正的共存。

Details

Motivation: 论文旨在探索运动是否能够像感知(从局部基元如边缘到高级结构如部件和物体)一样建立层次化的表示。在人形控制中,低层动作指定即时运动指令,而有意义的行为则在更长的时间尺度上组织,包括接触、步态片段、平衡恢复、伸手和全身技能。

Result: 实验表明,学习到的层级形成了一个运动层次结构:较粗的接口通过将探索约束在结构化片段上,改善了早期学习和运动规律性;而较细的接口则保留了反馈控制和最终任务精度。表示探针显示该层次结构支持遍历、插值、过渡和定性组合,从而暴露了跨时间尺度的可编辑控制手柄。

Insight: 核心创新点在于将层次化表示思想从感知领域引入运动控制,提出了可重用的多级运动表示MotionPyramid,并通过残差接口(类似于深度网络中的残差或跳跃连接)实现了粗粒度规划与精细校正的统一,在提供结构化抽象的同时不牺牲可控性。

Abstract: We ask whether the representational hierarchy seen in perception, from local primitives such as edges to higher level structures such as parts and objects, can be established for motion. In humanoid control, low level actions specify immediate motor commands, while meaningful behavior is organized over longer temporal scales, including contacts, gait fragments, balance recovery, reaching, and whole body skills. We introduce MotionPyramid, a hierarchical action representation that learns such structure from motion data. Starting from a motion tracking teacher, it trains a recursive stack of latent decoders: low level latents decode to immediate full body motor commands, while higher level latents unfold through lower levels into temporally extended motion programs. After pretraining, the hierarchy is frozen and reused by downstream reinforcement learning policies as a family of action interfaces at different control resolutions. Experiments show the learned levels form a motion hierarchy: coarser interfaces improve early learning and motion regularity by constraining exploration to structured segments, while finer interfaces preserve feedback control and final task precision. Representation probes show the hierarchy supports traversal, interpolation, transition, and qualitative composition, exposing editable control handles across temporal scales. Finally, we introduce Residual Interfaces, letting a downstream policy maintain coarse, segment level, and frame level residual commands through the frozen hierarchy. Analogous to residual or skip connections in deep networks, this allows coarse motion programs and fine residual corrections to coexist within one controller. MotionPyramid shows that motion, like perception, can be organized into a reusable multi level representation, providing structured abstraction without sacrificing controllability.


[63] Video2Code: Generating Interactive Webpages from UI Videos via Action-Aware Revisit cs.CV | cs.AIPDF

Mingde Xu, Zhen Yang, Yan Wang, Yu Wang, Xijun Liu

TL;DR: 本文提出了Video2Code方法,用于从UI视频中生成交互式网页代码。该方法通过动作感知的重新访问机制,首先定位视频中的关键动作区域,然后以更高时间分辨率对这些区域进行细粒度分析,从而恢复可执行的UI状态转换。

Details

Motivation: 现有视频理解模型通常依赖稀疏采样或压缩的时间表示,容易错过短动作边界并破坏实现网页行为所需的状态-动作-状态转换,导致状态转换错位问题。

Result: 实验表明,Video2Code显著增强了开源模型在UI视频到代码生成任务上的性能,在视觉和功能评估标准下均优于直接视频观察方法,特别是在密集多步交互场景中提高了功能正确性。

Insight: 创新点在于将UI视频到代码生成任务形式化为从交互视频中恢复可执行状态转换,并引入动作感知的重新访问机制,通过粗粒度理解定位关键区域后以高时间分辨率重新分析,有效解决了状态转换错位问题。

Abstract: UI videos provide a natural input for generating interactive webpages, as they capture both webpage appearance and action-triggered state transitions. However, directly applying video-capable vision-language models to this task remains insufficient. Existing models typically rely on sparse sampling or compressed temporal representations, which may miss short action boundaries and break the state-action-state transitions needed to implement webpage behavior. We formulate UI video-to-code generation as executable state-transition recovery from interaction videos, and identify this failure mode as state-transition misalignment. We introduce Video2Code, an action-aware video-to-code approach for recovering executable UI state transitions. Rather than allocating the visual budget uniformly across the video, Video2Code first performs coarse video understanding to locate action-critical regions, then invokes a temporal clipping tool to revisit these regions at higher temporal resolution before generating HTML/CSS/JavaScript code. We instantiate Video2Code with action-aligned video-code supervision and evaluate it under both visual and functional criteria. Experiments show that Video2Code substantially strengthens the underlying open-source model for UI video-to-code generation, improving functional correctness over direct video observation, especially on dense multi-step interactions.


[64] TeleStyle V2: Beyond Content-Preserving Style Transfer with Self-Distillation and Distribution-Matching-Distillation cs.CVPDF

Shiwen Zhang, Yifan Xu, Haibin Huang, Chi Zhang, Xuelong Li

TL;DR: 本文提出了TeleStyle V2,一个超越传统内容保持风格迁移的模型。它通过自蒸馏数据合成策略,支持从写实到艺术、从艺术到写实等多种内容-风格参考组合,并利用分布匹配蒸馏技术保持基础模型的通用图像编辑能力。

Details

Motivation: 解决TeleStyle V1模型只能处理写实内容参考和艺术风格参考的局限性,使其能够灵活处理艺术内容参考和写实风格参考,并解决内容/风格参考顺序混淆的问题。

Result: 在定量评估中,TeleStyleV2-QIE-2509-DMD的表现至少与Qwen-Image-Edit-2509-DMD相当,并且在风格迁移任务上达到了与最先进的商业模型gemini-3-pro-image-preview可比的性能。

Insight: 创新点在于自蒸馏数据合成策略,用于构建训练三元组以扩展模型能力;以及分布匹配蒸馏,用于在微调过程中保持基础模型的通用编辑能力并缓解内容一致性退化。同时,利用视觉语言模型自动生成内容与风格提示,提升了模型的易用性。

Abstract: Given a content reference and a style reference, content-preserving style transfer requires the model to generate stylized outputs with content and style consistency. We introduced TeleStyle V1 to tackle this problem. However, TeleStyle V1 is trained with photorealistic content reference and artistic style reference, which makes it incapable to cope with artistic content reference and realistic style reference in most cases. In this paper, we designed a Self-Distillation data synthesis strategy to construct such triplets from TeleStyle V1. Trained with such self-distilled triplets, our TeleStyle V2 supports Content-Style references in the forms of Realistic-and-Realistic (RnR), Realistic-and-Stylized (RnS), Stylized-and-Realistic (SnR), Stylized-and-Stylized (SnS). In addition, we found Distribution Matching Distillation could preserve the general text-guided image editing capability of the foundation model and fix the content consistency degradation caused by SFT process. Through quantitative evaluations, our TeleStyleV2-QIE-2509-DMD performs at least on par with Qwen-Image-Edit-2509-DMD, demonstrating strong general image editing skills beyond content-preserving style transfer. We observed the content/style reference order confusion problem in TeleStyle V1 and further introduced prompt enhancer to solve it. TeleStyle V2 uses Qwen-Image-Edit’s VLM encoder, Qwen2.5-VL-7B, to generate content prompt and style prompt for free. TeleStyle V2 could achieve comparable style transfer performance with state-of-the-art commercial model, gemini-3-pro-image-preview.


[65] GEOPHYS: The Geometry of Physical Plausibility cs.CV | cs.AIPDF

Christian Internò, Alexander Pondaven, Habon Issa, Fabio Pizzati, Francesco Pinto

TL;DR: 该论文提出了一种名为GEOPHYS的方法,用于快速检测视频中的物理不合理性。该方法基于冻结图像编码器产生的逐帧嵌入的五个几何特性,无需外部多模态大模型或特定训练修改。实验表明,GEOPHYS在多个基准测试中实现了最先进的物理违规检测性能,并能高效提升视频生成模型的物理对齐能力。

Details

Motivation: 解决现有机器学习方法在检测物理不合理事件时速度慢、成本高的问题,这些方法通常依赖外部多模态大语言模型或需要特定的训练过程调整。

Result: 在LikePhys和IntPhys2基准测试上分别达到98.3%和93.3%的准确率,实现了SOTA性能,显著优于V-JEPA 2、GPT-4o、Gemini和多个现代视频扩散模型。作为验证器使用时,将MAGI-1 24B在PhysicsIQ上的表现从50.01%提升至64.50%,且计算时间和内存开销更低。

Insight: 核心创新在于发现并利用冻结图像编码器的时间特征中涌现的几何属性来评估物理合理性,这提供了一种轻量、高效且无需额外训练的新范式,挑战了依赖复杂模型或专门训练的传统思路。

Abstract: While humans can identify physically implausible events within milliseconds, machine learning approaches addressing the same problem are extremely slow and expensive. They either rely on external multimodal-LLM judges or require ad-hoc modifications to the training procedure. In this work, we argue that indicators of physical plausibility are implicitly captured by five geometric properties of the per-frame embeddings produced by frozen image encoders. In aggregate, we call them GEOPHYS. First, we show that these signals correlate with human EEG responses to two forms of object-permanence violations. Second, GEOPHYS robustly discriminates physically implausible videos from realistic ones, achieving state-of-the-art physics-violation detection: 98.3% on LikePhys and 93.3% on IntPhys2, whereas V-JEPA 2, GPT-4o, Gemini, and twelve modern video diffusion models perform near chance. Third, used as a best-of-N verifier for physical alignment during video generation, GEOPHYS lifts MAGI-1 24B from 50.01% to 64.50% on PhysicsIQ at 1.5x lower wall-clock and 4.65x lower memory than the V-JEPA 2 world-model verifier. Ultimately, GEOPHYS demonstrates that physical plausibility in videos can be assessed by leveraging the emergent geometric properties of temporal features extracted from image encoders.


[66] CDER-SME: A Cross-Device Event-RGB Micro-Expression Dataset under Multi-Level Stress Induction cs.CVPDF

Jingting Li, Hui Sha, Su-Jing Wang

TL;DR: 本文提出了CDER-SME数据集,这是一个在多层次压力诱导下采集的跨设备事件相机-传统RGB微表情数据集,旨在解决现有基准在现实场景中生态效度不足的问题。该数据集包含92名受试者的1,963个专家标注样本,并提供了一个硬件无关的时空对齐流程。

Details

Motivation: 现有微表情识别基准大多局限于实验室受控环境和硬件耦合的传感方式,难以满足现实场景对时间敏感性和生态效度的要求。

Result: 论文报告了一个可复现的多模态基线,其中跨模态融合的性能优于单模态,支持了事件动态与RGB线索的互补性。

Insight: 创新点在于提出了一个无需同轴校准的跨设备对齐流程,以及一个在真实压力诱导下采集的、硬件解耦的事件-RGB数据集,为现实世界可部署的微表情识别提供了实用基准。

Abstract: Micro-expression recognition (MER) in realistic scenarios demands high temporal sensitivity and ecological validity, yet existing benchmarks are largely constrained to laboratory-controlled settings and rigid hardware-coupled sensing. We introduce CDER-SME, a cross-device Event-RGB dataset collected under a multi-level stress induction framework (cognitive and social) to elicit spontaneous emotional leakage. To enable reproducible acquisition with independent, decoupled sensors, we provide a hardware-agnostic alignment pipeline for temporal synchronization and landmark-guided spatial registration. CDER-SME adopts a three-tier structure with 92 subjects and 1,963 expert-annotated samples (Action Units and emotions), including 790 Event-RGB pairs and 210 high-fidelity aligned pairs. We further report a reproducible multimodal baseline, where cross-modal fusion improves performance over single-modality counterparts, supporting the complementarity of event dynamics and RGB cues. By removing the need for coaxial calibration, CDER-SME offers a practical benchmark for cross-device alignment and deployable Event-RGB MER in real-world affective intelligence.


[67] MIRAGE: Stealthy Visual Prompt Injection for Vulnerability Detection in Web Agents cs.CV | cs.AI | cs.CRPDF

Xuelong Dai, Jianyu Ma, Boyang Ma, Biwei Yan, Yijun Yang

TL;DR: 本文提出了MIRAGE,一种针对基于多模态大语言模型(MLLM)的网页代理的新型视觉间接提示注入攻击框架。该框架在受信任的网页平台(如广告位)的严格空间约束下,利用扩散模型生成感知上良性的对抗性图像,以实现针对性的下一动作劫持。

Details

Motivation: 现有的针对MLLM网页代理的对抗性评估通常基于宽松的威胁模型和视觉上明显的攻击痕迹,本文旨在研究一个更现实的、受限的漏洞检测场景:攻击者仅作为无特权的第三方(如广告商)控制一个语义合法且空间受限的区域。

Result: 在SeeAct和OpenClaw这两个主流的MLLM网页代理框架上进行的综合评估表明,MIRAGE攻击具有高效性、现实性和隐蔽性。

Insight: 论文的创新点在于提出了一个在严格空间约束下(如广告位)进行视觉提示注入的框架,并引入了结合曲率感知对抗扩散引导和稀疏暗像素残差扰动的鲁棒优化技术,以在受限设置下最大化攻击效果,这为评估网页代理的安全性提供了更现实的威胁模型。

Abstract: Multimodal Large Language Model (MLLM)-based web agents provide practical, high-precision solutions for visual browser automation; however, they inherently expand the attack surface, introducing novel vision-based vulnerabilities. Existing adversarial evaluations targeting these agents frequently rely on permissive threat models and visually conspicuous artifacts. In this paper, we investigate a constrained vulnerability detection setting: a trusted web platform where the evaluator acts solely as an unprivileged third party, such as a merchant or advertiser, controlling only a semantically legitimate, spatially constrained region, such as an ad slot, a sponsored card, or a localized widget. Operating under these realistic constraints, we propose MIRAGE, a novel visual indirect prompt injection framework for targeted next-action hijacking. Our approach leverages diffusion models to generate perceptually benign adversarial images strictly confined to the attacker-controlled boundaries permitted by the trusted service provider. To maximize attack efficacy within such a restrictive setting, we introduce a robust optimization technique combining curvature-aware adversarial diffusion guidance with sparse, dark-pixel residual perturbations. Comprehensive evaluations against prominent MLLM web agent frameworks, specifically SeeAct and OpenClaw, empirically demonstrate the potency, realism, and stealth of our proposed MIRAGE.


[68] Evaluation of Medical Vision Language Models HuluMed and MedGemma, and general purpose chatbots Gemma 3, ChatGPT Plus, and Claude Pro on real previously unseen wound images cs.CVPDF

Yunzhe Xue, Mohammed Saim Ahmed Quadri, Neal Panse, Justin W. Ady, Usman Roshan

TL;DR: 本研究评估了通用和医学专用视觉语言模型在真实未见过的伤口图像上的临床评估性能。使用包含20种不同病因伤口的扩展数据集,通过12个结构化临床问题框架,评估了六个VLM在伤口分类、感染风险、血管干预建议、清创紧迫性、伤口治疗选择和高级管理规划等方面的表现。结果显示,通用模型ChatGPT和Claude表现最佳,而医学专用模型HuluMed在同类中表现最好,但整体仍落后于通用模型。

Details

Motivation: 慢性伤口评估在临床上仍然具有挑战性,需要准确解读伤口形态、组织成分、血管特征和感染风险。本研究旨在评估通用和医学专用视觉语言模型在自动化多模态伤口分析方面的性能,以探索其在临床决策支持中的潜力。

Result: 在20个伤口案例共240个临床医生评分的决策中,ChatGPT正确率最高(72.50%),其次是Claude(62.08%)。在开源和医学专用模型中,HuluMed表现最佳(40.00%),其次是Gemma 3(33.75%)、MedGemma 4B(25.83%)和MedGemma 27B(17.50%)。结果表明,前沿通用多模态系统目前表现出比医学专用替代方案更强的伤口分析性能。

Insight: 摘要宣称的创新点在于首次使用扩展的、多样化的真实伤口数据集,通过结构化临床问题框架,对通用和医学专用VLM进行系统性比较评估。从客观角度看,该研究的关键洞察是,在伤口分析任务中,广泛的多模态推理能力可能比单纯的领域特定医学知识更为重要,这挑战了医学专用模型必然更优的假设,并为未来医学AI模型开发方向提供了重要参考。

Abstract: Chronic wound assessment remains a clinically challenging task that requires accurate interpretation of wound morphology, tissue composition, vascular characteristics, and infection risk. Recent advances in Vision-Language Models (VLMs) have introduced the possibility of automated multimodal wound analysis through image understanding combined with clinical reasoning. This study evaluates the performance of several general-purpose and medically specialized open-source and proprietary VLMs for clinical wound assessment using an expanded, curated dataset of 20 clinically diverse wounds spanning vascular, surgical, ischemic, venous, lymphedema, and amputation-related etiologies. Six VLMs were evaluated using a structured twelve-question clinical framework covering wound classification, infection risk, vascular intervention recommendations, debridement urgency, wound therapy selection, and advanced management planning. Across 20 wound cases and 240 clinician-graded wound-analysis decisions, ChatGPT achieved the highest overall performance with 174/240 correct responses (72.50%), followed by Claude with 149/240 (62.08%). Among the open-source and medically specialized models, HuluMed achieved the strongest performance with 96/240 correct responses (40.00%), followed by Gemma 3 (81/240, 33.75%), MedGemma 4B (62/240, 25.83%), and MedGemma 27B (42/240, 17.50%). The findings suggest that frontier general-purpose multimodal systems currently demonstrate substantially stronger wound-analysis performance than medically specialized alternatives, highlighting the continued importance of broad multimodal reasoning capabilities alongside domain-specific medical knowledge. Although current VLMs demonstrate promising potential for clinical decision support, substantial limitations remain in advanced wound-management reasoning, procedural planning, and autonomous clinical reliability.


[69] VTOS: Learning to Orchestrate Vision Tools by Co-Searching Solutions and Observers cs.CV | cs.CLPDF

Jinchao Ge, Lingqiao Liu, Shuwen Zhao, Lei Wang

TL;DR: 本文提出了VTOS框架,通过联合搜索解决方案和观察者程序,实现自适应视觉工具编排。该方法将基础视觉工具(如Grounding DINO、SAM、NMS等)组合成可执行程序,并同时搜索观察者程序来诊断故障、生成反馈,利用共享的VisionThoughts知识库指导搜索过程。

Details

Motivation: 现有视觉编程代理通常生成固定的解决方案流水线,在密集物体、遮挡、小目标和领域偏移等复杂视觉条件下表现脆弱,因此需要一种自适应编排视觉工具的方法。

Result: 在LVIS-Count密集物体计数和PlantSeg-OOD零样本植物病害分割两个案例研究中,VTOS均优于静态工具流水线和基于代理的视觉编程基线方法,证明了其有效性。

Insight: 创新点在于将解决方案搜索与观察者程序搜索协同进行,通过动态诊断和反馈积累知识,实现视觉工具的自适应组合与参数调整,提升了复杂视觉任务的鲁棒性。

Abstract: Vision foundation tools such as open-vocabulary detectors, segmentation models, and post-processing operators are powerful building blocks for computer vision, but their effectiveness depends heavily on how they are orchestrated: which tools are used, in what order, with what parameters, and under what visual conditions. Existing visual-programming agents typically generate a fixed solution pipeline, making them brittle under dense objects, occlusion, small targets, and domain shift. We introduce VTOS (Vision Tools Orchestration Search), a framework for adaptive visual tool orchestration through joint solution–observer search. VTOS co-searches executable solution programs that compose vision tools such as Grounding DINO, SAM, NMS, and slice-and-detect, together with observer programs that diagnose candidate solutions, identify failure modes, and generate actionable feedback. These observations are accumulated in a shared VisionThoughts knowledge base to guide subsequent search. We evaluate VTOS through two case studies: dense object counting on LVIS-Count and zero-shot plant-disease segmentation on PlantSeg-OOD, which stress different orchestration challenges including threshold calibration, NMS, slicing, mask refinement, and domain generalization. Across both tasks, VTOS outperforms static tool pipelines and agentic visual-programming baselines, showing that co-searching solutions and observers is an effective strategy for adapting vision tools to challenging computer vision tasks.


[70] XmoPipe: A Pipeline for Large-Scale In-the-Wild Human Motion Dataset Construction cs.CV | cs.AIPDF

Nathan Salazar, Emmanuel Dellandréa, Mathieu Lefort, Alexandre Meyer

TL;DR: 本文提出了一个名为XmoPipe的可扩展流水线,用于从无约束的在线视频中大规模构建野外人体运动数据集。该系统通过关键词检索视频,提取3D身体和面部运动,并生成高级文本描述,从而支持针对各种动作、多人交互或表达性行为的定向收集。

Details

Motivation: 动机在于解决基于标记的运动捕捉数据成本高、规模有限且多样性不足的问题,利用单目运动捕捉和视频-语言理解的最新进展,从在线视频中提取可信的运动数据,以支持鲁棒运动模型的训练。

Result: 通过训练运动重建和运动生成模型进行质量验证,结果表明其性能可与在传统运动捕捉数据集上训练的模型相媲美,并展现出强大的跨数据集泛化能力。

Insight: 创新点在于构建了一个灵活、可扩展的端到端流水线,能够自动化地从海量在线视频资源中构建大规模、多样化的野外人体运动数据集,这为运动分析、合成和理解任务提供了新的数据来源和可能性。

Abstract: Large-scale human motion datasets are essential for training robust motion models for analysis, synthesis, and understanding. While marker-based motion capture provides precise data, it is costly and limited in scale and diversity. Recent advances in monocular motion capture and video-language understanding open the way to extract plausible motion from unconstrained online videos. We present a scalable pipeline for constructing in-the-wild human motion datasets. From a few keywords, the system retrieves videos, extracts 3D body and facial motion, and generates high-level textual descriptions. The pipeline is flexible, enabling targeted collection of various motions, multi-person interactions, or expressive behaviors. We demonstrate its quality by training motion reconstruction and motion generation models, showing performance comparable to models trained on traditional motion capture datasets and strong cross-dataset generalization.


[71] REKEY: Metadata-Grounded Visual-Key Regeneration for Contamination-Resilient VQA Evaluation cs.CVPDF

Tengjie Lin, Yutao Sun, Jingwei Ni, Shuhan Ge, Hao-Xuan Ma

TL;DR: 本文提出了ReKey,一种动态的视觉问答(VQA)基准测试协议,旨在解决静态VQA基准因数据泄露到训练集而导致评估分数反映记忆而非真实视觉能力的问题。该方法通过在评估时随机重新生成真实图像中承载答案的局部细节(视觉密钥),创建具有新答案和可控视觉搜索难度的新实例,从而保持评估的时效性。

Details

Motivation: 静态VQA基准(如V*Bench)会迅速过时,一旦其内容泄露到模型训练数据中,评估分数就可能反映的是记忆而非真实的视觉理解能力,从而掩盖了实际进展。因此,需要一种能够持续更新、防止数据污染的动态评估方法。

Result: 在V*Bench上,ReKey重新生成的基准测试显示,八个前沿视觉语言模型(VLMs)在原始项目上的得分比在重新生成变体上的得分高出9.5到18.8个百分点,这揭示了模型可能过度依赖记忆。

Insight: 核心创新在于提出了一个基于元数据(如人工验证的编辑槽位)动态、随机地重新生成图像中关键视觉细节(视觉密钥)的协议,从而创建了一个可再生的、抗污染的实时基准,有效分离了模型的记忆能力和泛化/视觉理解能力。

Abstract: Static visual question answering (VQA) benchmarks age quickly: Once the items leak into training corpora, scores can reflect memorization rather than genuine visual ability, thus obscuring real progress. Rebuilding high-quality benchmarks such as VBench requires substantial human annotation, yet each static release can quickly become another leaked artifact. We propose ReKey, a live benchmark protocol that randomly regenerates the answer-bearing local detail, or visual key, in real images at evaluation time. Using human-validated edit slots, ReKey samples fresh instances with new answers, construction-grounded labels, and controlled visual-search difficulty. On VBench, the ReKey regenerated benchmark reveals a sharp score jump across eight frontier vision-language models (VLMs): The original items score 9.5–18.8 percentage points higher than the regenerated variants. By making the visual key renewable, ReKey keeps evaluation fresh as models and training data evolve.


[72] How Well Can Your Video Model Remember? Measuring Memory-Budget Trade-offs in Long Video Understanding cs.CVPDF

Yixian Tian

TL;DR: 本文提出了一个紧凑的经验模型,用于量化长视频理解中答案准确率如何随帧预算B和时间距离D变化而下降。该模型通过分析约155,000个二元预测数据,拟合出一个对数准确率与对数预算呈线性关系、且其指数随距离对数线性衰减的规律。研究发现,不同模型在长距离回忆时的预算效率差异显著,例如STREAMINGVLM在D=1000秒时预算指数α(1000)=1.26,远高于Qwen3-VL基础模型的0.17。

Details

Motivation: 长视频理解模型通常在严格的帧预算下运行,但现有框架无法预测当预算减少或事件时间距离增加时准确率如何下降,因此需要一种量化方法来衡量这种内存-预算权衡关系。

Result: 模型在十个模型和三种采样策略上拟合,实现了单元级加权R²在0.05-0.75之间。在D=1000秒时,最佳流式模型与基础模型的预算效率相差约7.4倍;STREAMINGVLM的α(1000)=1.26,意味着十倍预算增加能大幅提升长距离准确率(+29个百分点),而基础模型仅提升+4个百分点。

Insight: 创新点在于提出了一个经验定律来量化视频模型的内存-预算权衡,并引入了预算指数α(D)作为诊断流式视频模型的指标。该指数可用于指导预算分配,并揭示了在长距离下模型排名可能反转的现象,为模型优化提供了新视角。

Abstract: We introduce a compact empirical model that quantifies how answer accuracy degrades as a function of frame budget B and temporal distance D in long video understanding – analyzing performance when recalling content from D seconds in the past using a fraction B of total frames. Long-form models operate under strict budgets, yet no prior framework predicts how accuracy degrades as B shrinks and events recede. We fit a weighted least-squares model on ~155,000 binary predictions across ten models and three sampling strategies, deriving a law where logit-accuracy scales linearly in log-budget with a distance-dependent exponent that decays log-linearly with distance. This budget exponent α(D) captures the marginal value of extra frames at distance D. The law achieves cell-level weighted R^2 = 0.05-0.75 across models. Notably, budget effectiveness at D = 1000 s differs by \approx 7.4\times between the best streaming and base models. STREAMINGVLM achieves α(1000) = 1.26 (95% CI: [1.06, 1.58]), meaning a tenfold budget increase substantially improves long-distance accuracy, while the best Qwen3-VL base model reaches only α(1000) = 0.17 (CI: [0.04, 0.34]). In accuracy space, a 10\times budget increase at D = 1000 s yields +29 percentage points for STREAMINGVLM versus +4 pp for the base model. Sampling strategies show model-dependent trade-offs: random sampling yields higher base sensitivity but steeper distance decay. We demonstrate how α(D) enables principled budget allocation, including a model-ranking reversal at long distance, and propose it as a diagnostic metric for streaming video models.


[73] One Image is All You Need: Agentic One-Shot Image Generation via Text-Based World Models for Long-Tail Spatial Perception cs.CV | cs.AI | cs.GR | cs.LGPDF

Keqin Zeng, Shuting Su, Shihao Lin, Ziyue Li, Rui Zhao

TL;DR: 本文提出WMGen-v1,一种基于文本世界模型的智能框架,用于生成长尾空间感知数据。该方法利用大视觉语言模型从单张参考图像构建结构化场景表示,并由大语言模型在物理合理性和常识约束下指导场景扩展,最后通过扩散模型生成多样且物理真实的长尾训练数据。

Details

Motivation: 解决现实世界时空数据(如自动驾驶、海上监控)中存在的严重异构性和极端长尾分布问题,以及现有生成方法(如扩散模型、GAN)缺乏显式空间基础和结构约束,导致生成场景存在空间和物理不一致性的挑战。

Result: 在内部工业数据集、ROADWork和LaRS基准测试上的实验表明,WMGen-v1优于基线方法。仅使用WMGen-v1合成数据训练的检测器在整体数据集级指标上接近仅使用真实数据的性能。

Insight: 创新点在于将LVLM、LLM和扩散模型结合,通过结构化语义表示和基于推理的场景扩展,实现了物理真实且多样化的长尾数据生成,为缓解下游空间感知任务的数据稀缺问题提供了新思路。

Abstract: Reliable spatial decision automation, such as autonomous driving and maritime surveillance, critically depends on robust visual perception. However, real-world spatiotemporal data exhibits severe heterogeneity, often manifesting as extreme long-tail distributions for safety-critical scenarios. This data scarcity induces dataset shift that degrades detection performance and pose safety risks. While synthetic data generation offers a potential solution, existing generative approaches, such as diffusion models and Generative Adversarial Networks (GANs), often lack explicit spatial grounding and structural constraints, resulting in spatial and physical inconsistencies in generated scenes. To address these challenges, we introduce WMGen-v1, an agentic text-based world model framework for long-tail spatial data generation. WMGen-v1 employs a Large Vision-Language Model (LVLM) to construct a structured scene representation from a single reference image, while a Large Language Model (LLM) performs guidance-based scene expansion under physical plausibility and commonsense constraints. Subsequently, conditioned on the structured semantic representations produced by this reasoning process, a diffusion model generates diverse and physically grounded long-tail training data. Experiments on internal industrial datasets, ROADWork, and LaRS benchmarks demonstrate that WMGen-v1 outperforms baseline approaches. Notably, detectors trained solely on WMGen-v1 synthetic data approach real-only performance on aggregate dataset-level metrics, highlighting its potential to alleviate long-tail data scarcity for downstream spatial perception.


[74] Robust Zero-Shot Generalization for Open-Vocabulary Action Recognition via Task Arithmetic cs.CV | cs.AIPDF

Francesca Morandi, Omayma Moussadek, Federico Venturini, Mauro Suardi, Alessandro Banzatti

TL;DR: 本文提出了一种无需目标域训练的开放词汇动作识别(OVAR)新范式,通过模型合并与任务算术技术,从多个公开OVAR数据集微调模型中提取并组合任务向量,以增强模型在分布外场景下的零样本泛化能力。

Details

Motivation: 传统OVAR方法在现实场景中实现鲁棒性能通常需要针对特定领域进行微调,这既成本高昂又可能引发隐私和监管问题,因此本文旨在探索一种无需目标域训练的知识重组替代方案。

Result: 在分布外设置下,所提出的合并模型相比预训练基础模型实现了更优的零样本泛化性能,具体基准和定量结果未在摘要中明确提及。

Insight: 创新点在于利用任务算术进行模型知识融合,避免了针对新领域的直接训练,为开放词汇识别提供了一种高效、隐私友好的泛化增强途径。

Abstract: Open Vocabulary Action Recognition (OVAR) enables the recognition of novel actions by leveraging vision-language representations, overcoming the limitations of traditional closed-set approaches. However, achieving robust performance in real-world scenarios typically requires domain-specific fine-tuning, which is often costly and raises privacy and regulatory concerns. In this work, we propose an alternative paradigm that bypasses target-domain training and recombines knowledge from existing datasets and models. Leveraging model merging and task arithmetic, we extract and combine task vectors from models fine-tuned on diverse public OVAR datasets. We show that, in out-of-distribution settings, the resulting merged model achieves superior zero-shot generalization to the pre-trained base model. Code is available at https://github.com/omaymaMoussadek/robust-ovar


[75] GroundShot: Visually Consistent Multi-Shot Long Video Generation via Entity-Grounded Shot Scheduling cs.CV | cs.AIPDF

Yixuan Lai, Tianjia Shao, Kun Zhou, Weijia Dou, Siyu Zhu

TL;DR: 本文提出GroundShot,一种无需训练、模型无关的智能体框架,用于解决多镜头长视频生成中的视觉一致性问题。该方法通过在线构建实体级视觉记忆,调度镜头生成顺序,并基于实体参考进行生成和验证,以保持跨镜头实体(如角色、物体和场景)的一致性。同时,论文还引入了GroundBench基准来评估实体层面的视觉一致性。

Details

Motivation: 多镜头视频生成中,随着镜头数量增加,实体(如角色、物体、场景)在跨镜头重现时容易出现视觉不一致问题,导致实体外观漂移。作者观察到观众通过比较实体首次清晰出现与后续出现来评判一致性,因此旨在构建一个以实体为中心的框架来提升多镜头视频的视觉连贯性。

Result: 实验表明,GroundShot在无需额外训练或模型修改的情况下,相比现有方法显著提升了多镜头视频的视觉一致性。在提出的GroundBench诊断基准上,该方法在实体层面的视觉一致性评估中表现出色。

Insight: 创新点在于提出了一种基于实体记忆的在线调度框架,通过主动管理实体参考的生成、验证和检索来确保跨镜头一致性。从客观角度看,该方法将视频生成问题转化为实体级别的智能体决策过程,为长视频生成提供了一种可扩展且模型无关的解决方案。

Abstract: Generating visually consistent multi-shot videos remains an open challenge. As videos span more shots, inconsistencies can accumulate across shots, causing entities that reappear across shots – characters, objects, and locations – to drift away from how they first appear. We observe that viewers judge consistency by comparing each later appearance of an entity with its first clear appearance; the visual quality of this initial appearance sets the consistency ceiling for all that follows. Motivated by this, we present \textbf{GroundShot}, a training-free, model-agnostic agentic framework for entity-grounded multi-shot generation. GroundShot builds an entity-level visual memory online from accepted generated shots: it schedules shots’ generation order by their expected usefulness as entity references, grounds entities from generated videos, verifies their reliability before adding them to memory, and retrieves suitable entity references from memory before each shot is generated. To evaluate this entity-centered view of consistency, we further introduce \textbf{GroundBench}, a diagnostic benchmark that measures consistency at the entity level while isolating controlled challenge dimensions. Experiments show that GroundShot improves multi-shot consistency over existing methods while requiring no additional training or model modification.


[76] An approach with Visual and Tabular Mamba to multimodal medical data using Mixed Fusion cs.CV | cs.AIPDF

Matheus B. Rocha, Gustavo B. Dettogni, Renato A. Krohling

TL;DR: 本文提出了一种基于Mamba架构的混合多模态融合方法,用于整合医学视觉和表格数据以进行癌症分类。该方法采用视觉Mamba处理病变图像并生成类别概率,再结合表格Mamba利用临床/社会人口学数据做出最终诊断。在两个医学数据集上的实验表明,该方法在NDB-UFES数据集上性能优于基于Transformer的方法,且在召回率指标上获得显著提升,同时通过SHAP方法增强了结果的可解释性。

Details

Motivation: 解决多模态医学数据(如图像和临床表格数据)在癌症分类中的有效融合问题,旨在提升诊断决策过程的敏感性和可解释性。

Result: 在PAD-UFES-20皮肤病变数据集上,平衡准确率略低于基于Transformer的方法;在NDB-UFES口腔癌数据集上表现更优,且在召回率指标上取得显著提升。

Insight: 创新性地将Mamba状态空间模型分别适配于视觉和表格模态处理,并设计混合融合架构;该方法特别适用于对敏感性要求高的医疗场景,且通过SHAP实现了较好的可解释性。

Abstract: This article presents a complementary approach for integrating multimodal medical data in cancer classification, based on state space models represented by the Mamba architecture. To this end, a mixed multimodal fusion architecture, called Mixed Fusion, was employed and developed to enhance the interpretability of the decision-making process. The proposed approach explores two variants of Mamba: one dedicated to visual processing, responsible for classifying the lesion image and generating probabilities associated with the target classes, and another focused on tabular processing, which uses these probabilities together with clinical and/or sociodemographic data to produce the final diagnosis. The experiments were conducted on two medical datasets: PAD-UFES-20, composed of clinical images and information associated with skin lesions, and NDB-UFES, consisting of histopathological images and sociodemographic data related to oral cancer. The results indicate slightly lower performance in balanced accuracy, compared with Transformer-based approaches, on PAD-UFES-20, and superior performance on NDB-UFES. Additionally, substantial gains were observed in the recall metric. Furthermore, the adoption of the Mixed Fusion architecture enables the application of the Shapley Additive Explanations (SHAP) method, increasing the interpretability of the results. These findings indicate that Mamba-based models constitute a suitable alternative for multimodal classification in medical data, especially in scenarios in which sensitivity is a relevant requirement.


[77] TriMotion: Modality-Agnostic Camera Control for Video Generation cs.CV | cs.AI | cs.ROPDF

Seunghyun Shin, Jifei Song, Wooseok Jeon, Hae-Gon Jeon, Jiankang Deng

TL;DR: TriMotion提出了一种模态无关的相机运动控制视频生成框架,能够将视频、姿态和文本三种不同模态的输入映射到一个共享的运动嵌入空间,从而支持异构用户输入来控制生成视频的相机轨迹。

Details

Motivation: 现有方法通常依赖于单一特定模态(如显式姿态轨迹或参考视频)来控制生成过程,这限制了它们支持异构用户输入的能力。本文旨在解决这一局限性。

Result: 大量实验表明,TriMotion能够生成高质量的视频,并准确地遵循所有三种模态指定的目标相机轨迹。

Insight: 核心创新在于构建了一个跨模态的共享运动嵌入空间,以及一个利用该空间在潜在空间直接约束生成视频遵循目标轨迹的一致性目标,避免了像素空间解码的成本,并支持顺序运动组合和跨模态运动插值等灵活应用。

Abstract: Camera motion control is essential for directing viewpoint changes in generative systems. However, existing methods typically condition the generation process on a single specific modality, such as explicit pose trajectories or reference videos, limiting their ability to support heterogeneous user inputs. To address this limitation, we present TriMotion, a modality-agnostic framework for camera-controlled video generation that maps video, pose, and text inputs, describing the same camera trajectory into a shared motion embedding space. Learning such a space requires synchronized supervision across modalities. Therefore, we build the Motion Triplet Dataset by extending a Multi-Cam Video Dataset with geometry-grounded motion descriptions derived from camera extrinsics. We further introduce a latent motion consistency objective that leverages the motion embedding space to encourage the generated video to follow the target camera trajectory directly in latent space, avoiding the cost of pixel-space decoding. Extensive experiments show that TriMotion generates high-quality videos that accurately follow the target camera trajectories across all three modalities. Beyond standard generation, the shared motion embedding space also enables flexible applications such as sequential motion composition and cross-modal motion interpolation.


[78] FOCA: Future-Oriented Conditioning for Data-Efficient Vision-Language-Action Adaptation cs.CV | cs.AIPDF

Duc Minh Nguyen, Nghiem Tuong Diep, Binh Gia Nguyen, Trong-Bao Ho, Doanh Le

TL;DR: 本文提出了FOCA,一种面向未来的条件化框架,用于数据高效的视觉-语言-动作模型适应。该方法通过结合显式预测任务相关的未来交互嵌入与隐式对齐未来目标观察,在潜在空间中进行长时程推理,无需像素级预测。实验表明,FOCA在少量演示下显著提升了多个机器人任务的成功率。

Details

Motivation: 现有视觉-语言-动作模型在少量演示模仿学习场景下性能急剧下降,现有适应策略存在关键弱点,需要解决数据高效适应问题。

Result: 在LIBERO基准上使用20个演示达到95.7%成功率,在RoboCasa上提升7-12%,在真实机器人上实现高达26%的绝对性能提升,确立了少量演示VLA适应的新SOTA。

Insight: 创新点在于将未来导向的条件化与潜在空间推理相结合,支持与视频世界模型生成的合成视频进行无动作协同训练,可解释为学习未来条件化的类价值表示。

Abstract: Vision-Language-Action (VLA) models enable general-purpose robotic control via large-scale multimodal pretraining, yet their effectiveness under few-shot imitation learning remains limited. We conduct a systematic stress test of state-of-the-art VLA models and show that performance degrades sharply as demonstrations are reduced, revealing a key weakness of existing adaptation strategies. To address this, we introduce FOCA, a future-oriented conditioning framework for data-efficient VLA adaptation. FOCA combines explicit prediction of task-grounded future interaction embeddings with implicit alignment to future goal observations, enabling long-horizon reasoning in latent space without pixel-level prediction. This formulation naturally supports action-free co-training with synthetic videos from video world models and can be interpreted as learning a future-conditioned value-like representation. Extensive experiments demonstrate FOCA achieves 95.7% success with 20 demonstrations on LIBERO, improves 7-12% on RoboCasa, and delivers up to 26% absolute gains on real robots, establishing a new state of the art in few-shot VLA adaptation.


[79] Stochastic Signed Distance Processes cs.CV | cs.GR | cs.LGPDF

Hiroki Sakuma, Masatoshi Okutomi

TL;DR: 本文提出了一种名为随机符号距离过程(SSDP)的新方法,用于多视角表面重建。该方法将基于符号距离场(SDF)的体积渲染重新表述为概率表面渲染,将像素颜色建模为由随机首次光线-表面交点诱导的混合分布。实验表明,该方法在表面重建和不确定性量化方面优于基线模型。

Details

Motivation: 动机在于改进基于SDF的多视角表面重建方法,这些方法通常依赖体积渲染来获得可微的可见性松弛,以减少对轮廓监督的依赖。本文旨在通过概率建模来更精确地处理光线与表面的随机首次交点问题。

Result: 在DTU和MobileBrick数据集上的实验结果表明,该方法在表面重建和不确定性量化方面均优于基线模型,证明了其首次通过时间公式的有效性。

Insight: 主要创新点是将SDF沿每条光线建模为一个随机过程,从而推导出首次通过时间分布,并将现有的NeuS方法作为其特例。这提供了一种更严谨的概率框架来理解和改进基于SDF的体积渲染。

Abstract: Multi-view surface reconstruction is a core problem in computer vision. One prominent line of work represents the surface implicitly as a signed distance field (SDF), optimizing it based on the photometric loss between rendered and observed pixel colors. These approaches typically employ SDF-based volume rendering to obtain a differentiable relaxation of discontinuous visibility along rays, thereby reducing reliance on silhouette supervision. In this paper, we reformulate SDF-based volume rendering as probabilistic surface rendering, where each pixel color is modeled as a mixture distribution induced by the random first ray-surface intersection. To this end, we introduce Stochastic Signed Distance Processes (SSDP), which model the SDF along each ray as a stochastic process, inducing a first-passage-time distribution for each ray. We then derive the first-passage probability for each sampling interval based on Bayesian filtering, together with its practical approximation for parallel rendering. We further show that NeuS, an existing SDF-based volume rendering method, arises as a special case of our formulation. Experiments on the DTU and MobileBrick datasets demonstrate that our method outperforms baselines in both surface reconstruction and uncertainty quantification, supporting the effectiveness of our first-passage formulation. Our code is available at https://github.com/skmhrk1209/SSDP.


[80] Translating Inference-Time Control to Radiology Vision-Language Models: Activation Steering for Pneumonia Classification on Chest X-rays cs.CV | cs.AIPDF

Eduardo Moreno Judice de Mattos Farina, Mateus A. Esmeraldo, Felipe Akio Matsuoka, Paulo Eduardo de Aguiar Kuriki, Felipe Campos Kitamura

TL;DR: 本文研究了在不进行微调的情况下,通过推理时激活引导(Contrastive Activation Addition, CAA)来改进医学视觉语言模型(VLM)在胸部X光肺炎分类任务上的性能。评估了三个冻结的胸部X光VLM模型,发现激活引导显著改变了预测分数分布和操作特性,但仅在其中一个模型上观察到了有意义的性能提升。

Details

Motivation: 动机是探索推理时工程(无需更新模型权重)在改善医学视觉语言模型诊断性能方面的效用,具体评估CAA方法是否能提升胸部X光肺炎分类的准确性。

Result: 在公开的Kermany肺炎测试集上评估,对于NV-Reason-CXR-3B模型,校准后的F1分数从零样本设置的0.7692提升至肺炎文本引导的0.8619和图像条件引导的0.8727;CheXOne-3B模型也有小幅提升;但性能增益并非在所有评估模型中都一致显著。

Insight: 创新点在于将推理时激活引导(一种轻量级的模型行为适配方法)应用于医学VLM,展示了无需微调即可潜在改进特定模型诊断性能的可能性,为医学AI模型的快速适配提供了新思路。

Abstract: Inference-time engineering can alter model behavior without fine-tuning. However, its utility for improving diagnostic performance in medical vision-language models (VLMs) remains unclear. We aim to evaluate whether Contrastive Activation Addition (CAA) can improve pneumonia classification in chest radiograph VLMs without updating model weights. Three frozen chest radiograph VLMs (MedGemma-4B-IT, NV-Reason-CXR-3B, and CheXOne-3B) were evaluated on the public Kermany pneumonia test set. Classification was based on the logits of the tokens Yes and No under a binary prompt. Steering vectors included a 30-pair answer-bias control, a 30-pair pneumonia text contrast, and an image-conditioned contrast derived from 30 pneumonia and 30 normal development images. A deterministic 200-image development set was used for layer and scale selection (100 images) and threshold calibration (100 images). Performance was assessed using ROC-AUC, PR-AUC, F1 score, threshold analyses, reverse-vector controls, random-vector controls, and conditional bootstrap confidence intervals. Fixed-threshold F1 improvements were frequently observed but did not consistently indicate improved diagnostic performance. For MedGemma-4B-IT. NV-Reason-CXR-3B showed the strongest benefit: calibrated F1 improved from 0.7692 in the zero-shot setting to 0.8619 with pneumonia-text steering and to 0.8727 with image-conditioned steering. For CheXOne-3B, pneumonia-text steering increased calibrated F1 from 0.8528 to 0.8666, although the confidence interval crossed zero. On this public pneumonia benchmark, CAA substantially altered prediction score distributions and operating characteristics without fine-tuning. Meaningful performance gains were observed in one of three evaluated VLMs, suggesting that activation steering may serve as a lightweight approach for adapting medical VLM behavior.


[81] Fine-grained Human Motion Understanding with Language Models cs.CVPDF

Thomas Markhorst, Zhi-Yi Lin, Jouh Yeong Chew, Jan van Gemert, Xucong Zhang

TL;DR: 本文提出了一个基于大语言模型(LLM)的细粒度人体运动理解模型,该方法将运动表示为带有明确时间戳的骨骼姿态序列。通过构建包含姿态描述、姿态问答、运动描述和运动问答的多样化训练数据混合,模型能够推理运动的顺序、持续时间和节奏。实验在多个基准测试上达到了最先进的性能,证明了显式时间编码和多样化监督的有效性。

Details

Motivation: 解决细粒度人体运动理解问题,旨在让模型能够推理运动的时序、持续时间和节奏等细节,并探索实现这一目标所需的监督信号类型。

Result: 在BABEL-QA、HuMMan-QA、CompMo、NTU-RGB+D和QEVD-Coach等多个基准测试上取得了最先进的性能。值得注意的是,即使仅使用2D骨骼输入,其性能也超过了之前基于3D的方法。

Insight: 主要创新点在于显式的时间戳编码和多样化的姿态与运动级别监督的混合训练策略。客观来看,其统一的姿态编码器能够同时处理2D和3D骨骼表示,并可选择性地结合视频上下文,这增强了方法的通用性和实用性。

Abstract: In this work, we propose \methodname, an LLM-based model for fine-grained human motion understanding that represents motion as a sequence of skeletal poses with explicit timestamps for each pose. Each pose encodes body joint positions and is temporally grounded with timestamp tokens, allowing the model to reason about motion order, duration, and rhythm. To study what supervision is needed for motion-language reasoning, we construct a diverse training mixture spanning pose captioning, pose question answering, motion captioning, and motion question answering. Our ablations show that the primary gains come from the diversity of pose- and motion-level supervision, while staged training provides a smaller additional benefit. Different from previous works that rely on ground-truth 3D motion capture, our approach supports both 2D and 3D skeletal motion representations through a unified pose encoder, and can optionally incorporate video to provide contextual information. Extensive experiments on BABEL-QA, HuMMan-QA, CompMo, NTU-RGB+D, and QEVD-Coach demonstrate that our method achieves state-of-the-art performance across multiple benchmarks, highlighting the effectiveness of explicit temporal encoding and diverse pose- and motion-level supervision for fine-grained human motion understanding. Notably, even when using only 2D skeletal input, our approach surpasses previous 3D-based methods.


[82] PROTON: Prototype-Based Test-Time Online OOD Detection for Medical VLMs cs.CV | cs.AI | cs.LGPDF

Abhijit Das, Nichula Wasalathilaka, Yifan Lu, Adinath Dukre, Dwarikanath Mahapatra

TL;DR: 本文提出PROTON方法,用于医疗视觉语言模型(VLMs)的在线测试时分布外(OOD)检测。该方法通过维护在线原型库,自适应融合原型距离与最大概念匹配(MCM)评分,无需修改模型、训练数据或提示工程,在眼科基准FLAIR+FIVES上显著提升了多种分布偏移下的检测性能。

Details

Motivation: 医疗VLMs在零样本临床图像分类中表现良好,但在部署时可靠检测分布外输入仍是一个未解决的问题。现有静态评分方法(如MCM)在协变量偏移等场景下性能不佳(AUROC仅42.4%),因为协变量偏移样本在softmax空间与分布内样本难以区分,但在嵌入空间中占据不同区域。

Result: 在眼科基准FLAIR+FIVES上,PROTON将MCM在协变量偏移上的AUROC提升了23.9%,在语义偏移上提升8.8%,在远分布偏移上提升8.1%,成为唯一无需分层提示或标注数据即可在所有三种偏移类型上提升性能的零样本方法。

Insight: 创新点在于利用VLM嵌入空间中未开发的信号,通过在线原型库捕获测试数据的分布特征,并基于流级方差统计自适应融合原型距离与MCM评分。这提供了一种轻量级、无需额外训练或模型修改的后处理OOD检测方案,有效解决了静态评分方法在多种分布偏移下的局限性。

Abstract: Medical vision-language models (VLMs) enable zero-shot clinical image classification, yet reliably detecting out-of-distribution (OOD) inputs at deployment remains an open problem. No static scoring method works across all shift types: Maximum Concept Matching (MCM) on FLAIR achieves 76.4% AUROC for far-OOD but only 42.4% for covariate shifts such as ultra-wide-field fundus images, effectively random. We trace this to a structural mismatch: covariate-shifted inputs are indistinguishable from in-distribution samples in softmax space, yet occupy distinct regions in the VLM embedding space. To exploit this untapped signal, we propose PROTON (PROtotype-based Test-time ONline OOD detection), a lightweight post-hoc module that maintains an online prototype bank from high-confidence test predictions and adaptively fuses prototype distance with MCM scoring via stream-level variance statistics, requiring no model modification, training data, or prompt engineering. On the ophthalmology benchmark FLAIR + FIVES, PROTON improves MCM by +23.9 AUROC on covariate shift, +8.8 on semantic shift, and +8.1 on far-OOD, making it the only zero-shot method to improve all three without hierarchical prompts or labeled data. Code is available at https://github.com/GenMI-Lab/PROTON, and the project page is available at https://genmi-lab.github.io/PROTON.


[83] Go-with-the-Track: Video Compositing and Motion Control with Point Tracking cs.CV | cs.LGPDF

Koichi Namekata, Yash Kant, Zhizheng Liu, Ryan D Burgert, Yuancheng Xu

TL;DR: 本文提出了Go-with-the-Track方法,通过联合利用多个参考图像和参考锚定的点轨迹来统一视频合成与运动控制。该方法扩展了传统点轨迹,在生成的视频帧与参考图像之间建立显式对应关系,从而实现了贯穿整个视频的精确内容合成与运动控制。

Details

Motivation: 现有方法将精确的运动控制和参考图像合成视为分离的任务,点轨迹条件化的图像到视频模型仅能在首帧插入内容,而参考到视频模型则缺乏对参考内容在帧间集成的细粒度时空控制。

Result: 实验表明,Go-with-the-Track在单一模型中实现了优越的运动和参考控制,并支持多参考条件视频生成与点轨迹驱动的合成,以及对静态和动态场景的相机控制。

Insight: 创新点在于引入了空间感知的点轨迹嵌入,通过坐标MLP和时序池化编码完整轨迹序列,以捕获其空间特征作为唯一标识;同时,通过轻量级适配器将其注入视频扩散Transformer,解决了像素到补丁的分辨率不匹配问题,避免了运动细节损失。采用混合训练策略提升了运动可控性。

Abstract: Filmmaking demands precise motion control and reference image compositing – capabilities that existing methods treat separately. Point-track-conditioned image-to-video models restrict content insertion to the first frame, while reference-to-video models lack fine-grained spatial-temporal control over how reference content integrates across frames. We present Go-with-the-Track, which unifies both capabilities by jointly conditioning on multiple reference images and reference-anchored point-tracks – extending conventional point-tracks to explicitly establish correspondences between generated frames and reference images, thus enabling precise compositing and motion control throughout the video. To achieve this, we introduce spatially-aware point-track embeddings that encode the full sequence of point-track coordinates using a coordinate-wise MLP followed by temporal pooling. This representation captures the spatial characteristics of each point-track (serving as a unique identifier), while the embedding similarity correlates directly with spatial proximity, enhancing the model’s ability to distinguish and associate point-tracks. We inject these point-track embeddings into a video diffusion transformer via a lightweight adapter, resolving the pixel-to-patch resolution mismatch while avoiding the substantial motion detail loss inherent in naive point-track subsampling. We use a hybrid training strategy to train jointly on dynamic, static, and synthetic scene video datasets to boost motion controllability. Experiments demonstrate that Go-with-the-Track achieves superior motion and reference control in a single model and enables new capabilities: multi-reference conditioned video generation with point-track driven compositing, as well as camera control for both static and dynamic scenes. Project Page: https://eyeline-labs.github.io/Go-with-the-Track/


[84] Robusto-2: Benchmarking Humans & VLMs for Autonomous Driving in Lima & New York City cs.CV | cs.AI | cs.ROPDF

Adrian Cespedes, Marcelo Chincha, Dunant Cusipuma, Victor Flores-Benites, David Ortega

TL;DR: 该论文通过对比利马和纽约市的人类驾驶员与视觉语言模型(VLMs)在自动驾驶视觉问答任务中的表现,评估了VLMs在新地理环境中的泛化能力。研究发现,人类与VLMs的回答存在差异,但人类回答不受地域影响,且地理因素对回答的调节作用较弱,可能由于场景的高度分布外特性。

Details

Motivation: 研究自动驾驶汽车在多模态系统(如VLMs)作为认知骨干时,在新地理环境(特别是分布外边缘案例)中的泛化性能,以解决其国际扩展中的可靠性问题。

Result: 在利马和纽约市收集的驾驶录像上进行视觉问答测试,涵盖事实、评分、反事实和推理四类问题;人类与VLMs回答有差异,但人类回答不受地域影响,且地理因素未显著调节回答,可能因场景高度分布外。

Insight: 创新点在于通过全因子分析比较人类与VLMs在挑战性驾驶环境中的表现,揭示了VLMs泛化局限性和人类回答的一致性;客观分析表明,该方法为评估自动驾驶系统地理适应性提供了新基准。

Abstract: As Self-Driving Cars continue to expand internationally and use multi-modal systems such as VLMs as a cognitive backbone for their Action models; how well will these systems generalize in new settings, in particular out-of-distribution (OOD) edge-case scenarios in new geographies? In this paper, we study this open question by providing a full factorial analysis with human drivers of Lima, human drivers from New York City, and VLMs and showing them dashcam footage collected from Lima and New York City – prompting them with a variety of questions under a Visual Question Answering (VQA) paradigm. In particular, we pick these two cities as they are highly challenging driving locations where no Self-Driving Car company currently operates in, and ask questions that span 4 categories: Factual, Ratings, Counterfactual and Reasoning. We find that Humans and VLMs diverge in their responses – though this is modulated by the type of questions asked, and that Humans answer similarly independent of where they are from (Lima/NYC). To our surprise, we did not find a strong difference in terms of answers (Humans or VLMs) that was modulated by geography, likely due to their high out-of-distribution nature. Our dataset is available at: https://huggingface.co/datasets/Artificio/robusto-2


[85] UniSLAD: A Unified Framework for Structural and Logical Industrial Visual Anomaly Detection cs.CV | cs.AI | eess.IVPDF

Changyi Li, Chao Yang, Yu Xiao, Kari Tammi

TL;DR: 本文提出了一种名为UniSLAD的统一框架,用于同时检测工业视觉中的结构异常和逻辑异常。该框架通过双特征提取器(CNN与Transformer结合)和双粒度特征表示模块(基于马氏变换的补丁级记忆库与基于LUM和PMP的图像级分布图聚合),无需额外训练即可处理两类异常。在两个工业基准测试中,UniSLAD分别达到了99.4%和93.1%的检测性能。

Details

Motivation: 现有工业视觉异常检测方法主要关注结构缺陷,而逻辑异常检测相对不足,但在实际工业流程中,结构异常和逻辑异常常同时出现。因此,需要一个能够统一检测这两类异常的解决方案,以推动全面的异常检测研究。

Result: 在两个工业基准测试上,UniSLAD在综合异常检测中取得了有竞争力的性能,分别达到99.4%和93.1%的检测率。消融实验验证了各组件(如双特征提取器、双粒度模块)的有效性和贡献。

Insight: 创新点在于提出了一个无需额外训练的统一框架,通过结合CNN的局部纹理感知和Transformer的全局上下文推理来提取双特征,并利用马氏变换增强的记忆库和LUM/PMP聚合的分布图实现双粒度表示,从而同时处理结构异常和逻辑异常,为动态工业环境提供了实用方案。

Abstract: Visual anomaly detection is a fundamental task in industrial automation. While existing approaches have achieved notable progress in identifying structural defects, the detection of logical anomalies remains relatively underexplored. In practice, structural and logical anomalies frequently co-occur in industrial workflows. Therefore, a solution capable of detecting both structural and logical anomalies is crucial for advancing comprehensive anomaly detection research. To address this limitation, we propose a unified framework, termed UniSLAD, which jointly addresses logical and structural anomalies without additional training, enabling a practical solution for dynamic industrial environments. First, we introduce a dual-feature extractor that synergistically integrates a Convolutional Neural Network (CNN) backbone for local texture perception with a Transformer backbone for global contextual reasoning, yielding richer and more comprehensive representations. Building on this foundation, we design dual-granularity feature representation modules. At the patch level, memory banks enhanced by the Mahalanobis Transform (MT) preserve representative features and support more discriminative anomaly scoring. At the image level, distribution maps are aggregated using Lower-Upper Mean (LUM) and Power Mean Pooling (PMP), yielding a more robust global representation than conventional average pooling. Extensive experiments on the two industrial benchmarks demonstrate that UniSLAD achieves competitive performance in comprehensive anomaly detection, achieving 99.4% and 93.1%, respectively. Furthermore, ablation studies verify the individual contributions and effectiveness of each proposed component.


[86] GIM-ENDO: A Multimodal Endoscopic Image and Video Dataset for Gastric Intestinal Metaplasia Morphology and Pathology cs.CVPDF

Mojgan Forootan, Mahziar Setayeshfar, Ali Darvishi, Mohammad Tashakoripour, Hamidreza Bolhasani

TL;DR: 本文介绍了GIM-ENDO数据集,这是一个用于胃肠上皮化生(GIM)形态学和病理学研究的公开、多模态内窥镜图像与视频数据集。该数据集旨在填补当前缺乏公开、经组织病理学验证且包含详细内镜标注、组织学亚型、标准化分级系统和正常黏膜模式数据的空白。

Details

Motivation: 胃肠上皮化生(GIM)是胃癌癌前病变,其早期检测至关重要。目前,可靠AI模型的开发受到缺乏公开、高质量、多模态且经过病理验证的数据集的限制。

Result: 数据集已公开,包含来自24名患者(22例GIM阳性,2例正常对照)的人口统计学数据、内镜发现、组织病理学结果和幽门螺杆菌状态,并提供了详细的图像增强内镜(IEE)特征标注以及GIM亚型和OLGA/OLGIM分期信息。

Insight: 该工作的主要创新点是构建并开源了一个高质量、多模态、经过组织病理学验证的内窥镜数据集,其标注体系全面,涵盖了关键的IEE特征、GIM亚型和标准化分期,有望显著推动GIM实时AI检测与表征模型的开发。

Abstract: Gastric intestinal metaplasia (GIM) is a precursor lesion to gastric dysplasia and adenocarcinoma whose early detection is crucial for intervening in the carcinogenesis cascade. Artificial intelligence (AI) holds considerable promise for real-time endoscopic detection and characterization of GIM. However, development of reliable AI models has been constrained by the absence of publicly available, histopathologically validated datasets that combine detailed endoscopic annotations, histological subtype (complete and incomplete), standardized grading systems, and normal mucosal patterns. GIM-ENDO was designed to fill this gap. The dataset comprises demographic data, endoscopic findings, histopathological results, and H. pylori status acquired using the Olympus EVIS X1 system with white-light endoscopy (WLE) and image-enhanced endoscopy (IEE), including narrow-band imaging (NBI) and magnifying NBI (M-NBI), along with images and video clips from 24 patients (22 GIM-positive, 2 normal controls). Annotations cover six primary IEE endoscopic signs – light blue crest (LBC), marginal turbid band (MTB), white opaque substance (WOS), TV pattern (Fusion), atrophy, and map-like erythema (MLE) – plus two additional endoscopic findings (AHP and GA) recorded where present. GIM subtypes (complete and incomplete) are annotated for all GIM-positive cases; OLGA and OLGIM staging are provided where complete histological sampling was available. The dataset is publicly accessible at https://doi.org/10.5281/zenodo.20707267. For the latest updates and further information regarding this dataset, readers are referred to the DataBioX website: https://databiox.com A short version of this work has been submitted to MICCAI 2026 Open Data Track.


[87] CheXpercept: A Benchmark for Evaluating Expert-Level Lesion Perception in Chest X-rays cs.CV | cs.AIPDF

Geon Choi, Hangyul Yoon, Nalee Kim, Jeong Yun Jang, Hyunju Shin

TL;DR: 本文提出了CheXpercept,一个用于评估视觉语言模型在胸部X光片分析中专家级病灶感知能力的多层级基准测试。该基准模拟放射科医生的认知流程,包含粗粒度检测、细粒度轮廓评估与修正以及语义级属性提取三个层次。作者通过半自动化流程构建了包含10,400个QA项的数据集,并对14个通用和医学VLMs进行了评测,发现模型仅在粗粒度任务上表现尚可,在更深层视觉任务上性能急剧下降,且医学VLMs相比通用模型几乎没有感知优势。

Details

Motivation: 当前对胸部X光分析的视觉语言模型评估主要局限于疾病存在性分类,缺乏视觉定位能力,无法验证确保临床可靠性所需的专家级病灶感知能力。

Result: 在CheXpercept基准上评测的14个通用和医学VLMs,仅在粗粒度检测层面达到尚可性能,在细粒度轮廓评估和语义属性提取等更深层视觉任务上准确率急剧下降。医学VLMs相比通用领域模型几乎没有表现出感知优势。

Insight: 创新点在于提出了一个模拟放射科医生认知流程的多层级病灶感知评估基准,并通过半自动化流程与专家评审构建了高临床保真度的大规模数据集。客观分析表明,当前领域自适应方法存在系统性缺陷,医学VLMs未能有效学习到关键的视觉感知能力。

Abstract: The evaluation of vision-language models (VLMs) for chest X-ray (CXR) analysis has largely been limited to disease-presence classification without visual grounding. Such evaluations fail to verify the expert-level lesion perception necessary to ensure the clinical reliability of VLMs. To address these limitations, we introduce CheXpercept, a sequential, multi-level perception benchmark that mirrors a radiologist’s cognitive workflow across coarse-level detection, fine-level contour evaluation and revision, and semantic-level attribute extraction. To ensure high clinical fidelity at scale, we construct the dataset using a semi-automated generation pipeline paired with a review by six medical experts. CheXpercept contains 10,400 QA items derived from 2,100 CXRs, covering seven clinically critical pulmonary and cardiac lesions. To demonstrate the current landscape of VLM perception, we benchmark 14 general and medical VLMs on CheXpercept. The models achieve adequate performance only at the coarse level, with accuracy degrading precipitously on deeper visual tasks. Notably, medical VLMs show almost no perceptual advantage over their general-domain counterparts, highlighting a systemic flaw in current domain adaptation. The code and dataset will be publicly available.


[88] Self-Supervised Dual-Frequency Phase Decomposition for Single-Shot Composite Fringe Projection Profilometry cs.CVPDF

Jin-Hyuk Seok, Yatong An, Jae-Sang Hyun

TL;DR: 本文提出了一种用于单次复合条纹投影轮廓术的自监督双频相位细化框架。该方法利用低频和高频相位梯度之间的尺度和方向关系,无需相位或深度标签即可提高相位分离的可靠性,并引入软边缘一致性损失来保留物体边界和精细几何结构。

Details

Motivation: 单次条纹投影轮廓术在实时测量、动态物体重建和运动敏感环境中具有重要应用,但现有方法(如基于傅里叶变换的方法或监督深度学习方法)存在精度受限、复杂区域性能下降或需要昂贵密集标签的问题。

Result: 实验结果表明,该方法在MAE_z和RMSE_z指标上分别达到0.367 mm和1.804 mm,优于最佳基于变换的基线方法(0.402 mm和2.785 mm),并将有效像素比从84.75%提升至95.07%,实现了无真实标签监督下的可靠单次3D重建。

Insight: 创新点在于提出了一种自监督的双频相位分解与细化框架,通过利用双频相位梯度关系进行相位分离,并结合软边缘一致性损失来保持几何细节,这为无标签单次3D重建提供了新思路。

Abstract: Single-shot fringe projection profilometry (FPP) has been actively studied for real-time measurement, dynamic object reconstruction, and motion-sensitive environments. Composite fringe patterns are advantageous in single-shot FPP because multiple frequency components can be encoded in a single pattern, enabling phase ambiguity resolution. Existing approaches mainly rely on Fourier transform-based methods or supervised deep learning methods. However, Fourier transform-based methods often suffer from limited accuracy and degraded performance in complex regions, while supervised methods require dense phase or depth labels, which are costly to obtain. In this work, we propose a self-supervised phase refinement framework for single-shot composite fringe patterns without requiring phase or depth labels. The proposed method exploits the scale and direction relationships between low- and high-frequency phase gradients, improving the reliability of phase separation. We also introduce a soft edge consistency loss to preserve object boundaries and fine geometric structures. Experimental results show that the proposed method achieves MAE_z and RMSE_z of 0.367 mm and 1.804 mm, respectively, outperforming the best-performing transform-based baseline, which obtains 0.402 mm and 2.785 mm. The proposed method also improves the valid-pixel ratio from 84.75 % to 95.07 %. These results demonstrate the effectiveness of self-supervised dual-frequency phase refinement for reliable single-shot 3D reconstruction without ground-truth label supervision.


[89] MS-rPPG: Multi-spectral State Space Model for Remote Photoplethysmography in Driver Monitoring Systems cs.CV | cs.AI | eess.IVPDF

Jiho Choi, Sang Jun Lee

TL;DR: 本文提出了一种名为MS-rPPG的多光谱远程光电容积描记(rPPG)框架,用于驾驶员健康监测系统。该框架结合RGB和近红外(NIR)面部视频,以应对驾驶环境中光照变化和头部运动带来的挑战。通过引入跨光谱线性调制(CSLM)策略和新型状态空间模型MS-Mamba,该方法旨在有效建模长时间依赖并捕获多光谱特征间的跨通道交互。

Details

Motivation: 驾驶环境是非受控场景,视频易受光照变化和频繁头部运动的影响,这给基于摄像头的远程心率估计(rPPG)带来了巨大挑战。本文旨在开发一个更鲁棒的驾驶员健康监测系统,通过融合多光谱信息来提升在复杂驾驶条件下的心率估计性能。

Result: 在MR-NIRP Car数据集和作者自建的MS-Drive真实世界数据集上进行了评估。实验结果表明,MS-rPPG方法在心率估计的准确性和鲁棒性方面优于先前的方法。

Insight: 主要创新点在于提出了一个融合RGB与NIR视频的多光谱框架,并设计了基于频域分析的跨光谱线性调制(CSLM)策略来结合互补特征。此外,引入了专门设计的MS-Mamba状态空间模型,以有效建模长程时间依赖和跨通道交互,这对于处理动态驾驶视频序列至关重要。

Abstract: Remote photoplethysmography (rPPG) is a camera-based technique for measuring physiological signals, particularly cardiac activity. From the remotely measured signals, heart rate can be estimated, which is crucial for health monitoring. In this study, we investigate a driver health monitoring system based on remote heart rate estimation. However, driving environments represent uncontrolled settings where videos are subject to varying illumination conditions and frequent head movements. We introduce MS-rPPG, a multi-spectral framework that combines RGB with near-infrared (NIR) face video to alleviate rPPG estimation under challenging driving conditions. To combine the complementary features from two spectral videos, we propose a cross-spectral linear modulation (CSLM) strategy based on frequency-domain analysis. Moreover, we introduce MS-Mamba, a novel state space model designed to effectively model long-range temporal dependencies while jointly capturing cross-channel interactions between multi-spectral features. We collected a real-world dataset called MS-Drive, which was recorded from 50 participants while driving the vehicle. The proposed method was evaluated on the MR-NIRP Car dataset and MS-Drive datasets. The experimental results indicate that MS-rPPG shows better robustness and heart rate estimation accuracy than previous methods, highlighting its promise for driver health monitoring. The codes are available at github.com/ziiho08/MS-rPPG.


[90] MammoExpert: Benchmarking Chain-of-Thought Reasoning in Mammography Diagnosis cs.CV | cs.AIPDF

Di Dai, Bo Liu, Youcheng Li, Haojun Yu, Zhouhang Bian

TL;DR: 该论文提出了MammoExpert,这是首个包含链式思维推理标注的乳腺X光摄影数据集,覆盖了67种WHO分类的组织病理学亚型和三个诊断阶段。该数据集用于评估AI模型在乳腺病变分类任务中的性能,实验表明,结合公开数据集CBIS-DDSM和MammoExpert能提升7.1%的分类准确率,而学习链式思维推理的模型在MammoExpert测试集上进一步带来4%的增益。

Details

Motivation: 当前公开的高质量乳腺X光摄影数据集在规模和标注丰富度上有限,特别是在病理亚型覆盖和结构化诊断推理标注方面,这限制了AI在乳腺癌检测中的发展。

Result: 在乳腺病变分类任务中,结合CBIS-DDSM和MammoExpert数据集使分类准确率提升7.1%,学习链式思维推理的模型在MammoExpert测试集上额外带来4%的增益;在INBreast和Vindr数据集上,完整方法分别实现了6.9%和6.7%的准确率提升,达到了SOTA水平。

Insight: 创新点在于引入了首个带有链式思维推理标注的乳腺X光摄影数据集,通过模拟放射科医生的诊断过程(从观察到评估再到综合诊断)来提升模型的解释性和准确性,这为可解释的医学影像诊断提供了新基准。

Abstract: Mammography is an essential tool for breast cancer detection, with millions of examinations conducted annually. However, publicly available high-quality mammography datasets for AI development remain limited in both scale and annotation richness, particularly regarding pathological subtype coverage and structured diagnostic reasoning annotations. In this paper, we present MammoExpert, the first mammography dataset with Chain-of-Thought reasoning annotations across three diagnostic phases: (i) primal observation, (ii) factual assessment, and (iii) diagnostic synthesis. Comprising 2,379 mammography images covering 67 WHO-classified histopathology subtypes, each exam provides 42 radiographic features annotated by nine senior radiologists. We evaluate its performance on the breast lesion classification task, demonstrating superior accuracy and reasonability compared to existing classification models. Combining public dataset CBIS-DDSM with MammoExpert yields 7.1% classification accuracy improvement, while the training model to learn CoT reasoning achieves another 4% gain on the MammoExpert test set. Similar improvements are observed on INBreast and Vindr datasets, where the full approach yields accuracy gains of 6.9% and 6.7%, respectively. MammoExpert can serve as a benchmark for interpretable breast lesion diagnosis through explicit CoT reasoning.


[91] Neural Architecture Distributions: A New Paradigm for Stochastic Segmentation cs.CVPDF

Conghui Li, Junhao Huang, Chern Hong Lim, Bing Xue, Mengjie Zhang

TL;DR: 本文提出了一种新的随机分割范式——神经架构分布,通过从分割主干中多个可搜索位置上的算子选择学习到的分布中采样离散架构,来生成多个合理的分割掩码。该方法利用架构采样实现输出多样性,并通过集合级监督和进化搜索构建候选库,在LIDC-IDRI数据集上实现了最先进的分布匹配和假设覆盖。

Details

Motivation: 现有随机分割方法通常通过注入连续潜变量或迭代去噪轨迹引入随机性,但其随机源难以直接搜索或审计。本文旨在解决这一问题,提出将架构分布作为新的随机源,以实现更可解释和可追溯的随机分割。

Result: 在LIDC-IDRI数据集上,该方法在分布匹配和假设覆盖方面达到了最先进水平,并在两个扩展任务上保持有效。

Insight: 创新点在于首次将随机分割形式化为学习架构分布,并通过架构采样实现输出多样性;同时引入架构溯源能力,使每个输出对应特定架构配置,增强了可解释性;采用基于IoU的能量距离替代进行集合级监督,防止掩码平均化;通过进化搜索优化随机源的支撑集,提升了方法灵活性。

Abstract: Stochastic segmentation seeks to represent multiple plausible masks for a single image, which is essential in safety- and quality-critical applications such as medical imaging or building defect inspection. Most existing methods introduce stochasticity by injecting continuous latent variables or by iterative denoising trajectories, whose stochastic sources are difficult to search or audit directly. We propose architecture distributions as a new stochastic source for segmentation: instead of sampling a latent variable or noise, we sample a discrete architecture from a learned distribution over operator choices at multiple searchable positions in a segmentation backbone. Each sampled architecture yields one mask through the selected active path, so inference depends on the executed subnet rather than the complete candidate bank. This approach also supports architectural provenance, since each output corresponds to a specific architecture configuration. To reduce collapse toward averaged masks, we train with set-level supervision by matching a set of architecture-sampled predictions to the annotation set using an IoU-based energy-distance surrogate. We further construct the candidate bank with evolutionary search, making the support of the stochastic source optimizable before distribution learning. The proposed method achieves state-of-the-art distribution matching and hypothesis coverage on LIDC-IDRI, and remains effective on two extension tasks. To the best of our knowledge, this is the first work to formulate stochastic segmentation as learning an architecture distribution and realizing output diversity through architecture sampling.


[92] Odoriko: A Shape-Aware Multimodal Diffusion Framework for Human Motion cs.CV | cs.GR | cs.ROPDF

Dongseok Shim, Julian Tanke, Kengo Uchida, Christian Simon, Koichi Saito

TL;DR: 本文提出了Odoriko,一个首个统一的多模态人体运动生成框架,它能够在生成的文本、音乐和视频条件驱动的运动输出中直接反映主体的生物形态信息(如性别和体型)。该框架不仅能在给定形态信息时生成与之相符的运动,还能在信息缺失时同时恢复主体形态和运动,将估计与生成统一在一个模型中。

Details

Motivation: 现有统一的多模态运动生成框架将所有主体视为形态等效,忽略了性别、体型等形态因素会产生不同运动学特征这一事实。本文旨在解决这一问题,使生成的运动不仅符合动作指令,还与执行动作的主体形态保持一致。

Result: 在文本到运动、音乐到舞蹈和视频到运动等多个基准测试上的广泛实验表明,Odoriko在标准指标上达到或超过了之前的专用模型性能。

Insight: 主要创新点在于首次将主体生物形态信息整合到统一的多模态运动生成框架中,实现了形态一致的运动生成。此外,框架还集成了在形态信息缺失时的形态与运动联合估计能力,将生成与估计任务统一起来,这是一个新颖的架构设计。

Abstract: Human motion generation has been widely studied across diverse input modalities, text, music, and video, and recent efforts have unified these into single multimodal frameworks. However, while morphological factors such as gender and body shape are known to produce distinct kinematic signatures, no existing unified framework incorporates this into generation, treating all subjects as morphologically equivalent. We present Odoriko, the first unified multimodal motion generation framework that reflects subject bio-morphological information directly in synthesized motion output. Rather than averaging over subject variation, Odoriko generates motion that is consistent with who is moving, not just what they are asked to do, across text, music, and video conditions within a single model. When explicit morphological information is unavailable, Odoriko additionally recovers subject morphology alongside motion, unifying estimation and generation in one framework. Extensive experiments across text-to-motion, music-to-dance, and video-to-motion benchmarks demonstrate that Odoriko matches or exceeds prior specialized models on standard metrics, while enabling morphology-consistent generation that no existing unified framework supports.


[93] ConnectomeBench2: A Unified Benchmark for Automated Connectomic Proofreading cs.CV | cs.AIPDF

Jeff Brown, Tim Farkas, Gleb Razgar, Edward S. Boyden

TL;DR: 本文发布了ConnectomeBench2,一个统一的多物种数据集,包含超过716,485个专家标注的校对决策和超过450万张相关图像,涵盖四个主要的开放连接组(小鼠、人类、斑马鱼、果蝇)的分割与合并错误校正。基于此数据集训练的一个共享编码器的视觉Transformer模型,在分割错误校正和合并错误识别任务上达到了跨物种的人类水平准确率,且性能随数据规模和模态的增加而提升。此外,研究还展示了模型的良好校准性、分布距离度量对性能退化的预测能力,以及连接组学特定预训练和主动学习在减少标注工作量方面的潜力。

Details

Motivation: 在突触分辨率连接组学中,校对(即校正3D大脑重建中的分割错误)是速率限制步骤。现有方法缺乏统一的多物种基准来训练和评估自动化校对模型。

Result: 在ConnectomeBench2基准上,一个共享编码器的视觉Transformer模型在分割错误校正任务上达到了跨物种的人类水平准确率,在合并错误识别任务上也表现出色。性能随数据规模和模态的增加而提升,并且模型在校准和分布外泛化方面得到了验证。

Insight: 主要创新点包括:1) 创建了首个统一的多物种连接组学校对基准数据集;2) 展示了单一视觉Transformer模型通过共享编码器处理网格几何和电子显微镜图像,能够实现跨物种的人类水平性能;3) 引入了连接组学特定预训练和主动学习策略,有望显著减少向新物种和脑区扩展所需的标注工作量。

Abstract: Proofreading–correcting segmentation errors in 3D brain reconstructions–is the rate-limiting step in synapse-resolution connectomics. We release ConnectomeBench2, a unified multi-species dataset of over 716,485 expert-labeled proofreading decisions with >4,500,000 associated images spanning four major open connectomes (mouse, human, zebrafish, fly), spanning both split and merge error correction. Trained on this dataset, a single Vision Transformer with shared encoders for mesh geometry and electron microscopy reaches human-level accuracy across species for split error correction and merge error identification, with performance scaling with data size and modality. Beyond accuracy, we show that the model is well-calibrated within distribution, that measures of distribution distance predict where calibration and accuracy will degrade on unseen data, and that connectomics-specific pretraining and active learning-based sample selection show potential to substantially reduce the labeling effort needed to extend to new species and brain regions. The benchmark provides the infrastructure to train and evaluate increasingly capable vision models for connectomic proofreading. Data and code availability. The ConnectomeBench2 dataset is released on Hugging Face at https://huggingface.co/datasets/jeffbbrown2/ConnectomeBench2. The accompanying codebase is available on GitHub at https://github.com/timfarkas/ConnectomeBench2.


[94] ChronoLock: Protecting Videos from Unauthorized Text-to-Video Personalization cs.CVPDF

Jiaming He, Jiashu Zhang, Guanyu Hou, Shuhan Ye, Hanwei Zhu

TL;DR: 本文提出了ChronoLock,首个针对文本到视频(T2V)扩散模型未经授权个性化训练的前摄性保护框架。该方法通过优化对时间去噪轨迹的有界扰动,直接破坏视频中的运动学习过程,包括扰乱帧内时间适应性和扩大帧间边界不匹配,从而有效防止在线分享的视频被用于非法微调。

Details

Motivation: 随着T2V扩散模型和个性化技术的发展,在线视频容易被收集并用于未经授权的模型微调,现有保护方法主要针对图像识别或文本到图像个性化,无法有效破坏视频个性化所依赖的时间去噪动态,因此需要专门针对视频时间特性的保护方案。

Result: 在UCF Sports和HMDB51数据集上,使用主流T2V骨干网络和个性化方案进行实验,ChronoLock在自动评估和人工评估中均有效降低了运动模仿效果,证明了其保护能力。

Insight: 创新点在于首次针对视频个性化保护,通过直接优化时间去噪轨迹的扰动来破坏运动学习,结合帧内适应性和帧间连续性破坏,并采用变换采样更新以提高对常见预处理操作的鲁棒性。

Abstract: Text-to-video (T2V) diffusion models have made it increasingly easy to synthesize realistic and temporally coherent videos, while recent personalization techniques allow such models to imitate a specific subject, style, or motion pattern from only a few reference clips. This capability creates a new data-misuse risk: videos shared online can be collected and used for unauthorized T2V fine-tuning. Existing protective perturbations are mainly designed for image recognition or text-to-image personalization, and therefore focus on corrupting static appearance cues rather than the temporal denoising dynamics that make video personalization possible. To address this gap, we introduce ChronoLock, the first proactive protection framework that makes released videos difficult to exploit for unauthorized T2V personalization. ChronoLock targets the motion-learning process directly by optimizing bounded perturbations over temporal denoising trajectories. It first disrupts intra-chunk temporal adaptation with a diffusion objective that combines fitting error, frame-relative denoising relations, and adjacent-frame variation, and then enlarges inter-chunk boundary mismatch to weaken long-range motion continuity. Transformation-sampled updates further improve robustness to common preprocessing operations.Experiments on UCF Sports and HMDB51 with popular T2V backbones and personalization scheme show that ChronoLock effectively reduces motion imitation under automatic metrics and human evaluation.


[95] BadDreamer: Transferable Backdoor Attacks against Video World Models for Autonomous Driving cs.CVPDF

Zhe Shuai, Xiaopeng Xie, Yikun Zeng

TL;DR: BadDreamer是一种针对自动驾驶视频世界模型的可迁移时空后门攻击方法。该方法通过毒化视频世界模型学习到的未来动态,在观测帧中保留黄色外卖骑手触发物,但在未来帧中将其擦除,使模型学习到当触发物出现时产生骑手消失的幻觉。这种被破坏的未来感知表示可以迁移到下游动作模块,导致不安全的非规避路径点预测。

Details

Motivation: 视频世界模型在自动驾驶中被广泛用于预测未来场景演变,并为下游动作预测提供未来感知的时空表示。然而,其训练时的安全风险尚未被充分探索,本文旨在揭示针对感知侧的可迁移后门攻击风险。

Result: 实验在一个代表性的开源感知-动作流水线上实例化了该攻击,证明了其有效性,揭示了自动驾驶视频世界模型在表示层面的安全风险。

Insight: 创新点在于攻击目标从传统的图像标签、提示输出或动作监督,转向毒化视频世界模型学习到的状态转移动态本身,并展示了这种表示级后门可以迁移影响下游动作模块,无需直接修改轨迹标签,这为模型安全验证提供了新的视角。

Abstract: Video world models are increasingly used in autonomous driving to forecast future scene evolution and provide future-aware spatio-temporal representations for downstream action prediction. In perception-to-action pipelines, these representations can directly influence ego-vehicle waypoint planning, making the learned future dynamics a critical security-sensitive component. Despite their promise, the training-time security risks of autonomous-driving video world models remain largely unexplored. We present BadDreamer, a transferable spatio-temporal backdoor attack that targets the perception side of this pipeline. Unlike conventional backdoors that manipulate image labels, prompt outputs, or action supervision, BadDreamer poisons the learned transition dynamics of a video world model. It constructs trigger-erasure sequences in which an oncoming yellow delivery rider is visible in the observed context frames but erased from the future frames. After fine-tuning on a small fraction of such sequences, the compromised world model learns a hidden conditional association: when the physical trigger appears, it hallucinates a future where the rider disappears and the road appears clear. We further show that this corrupted future-aware representation can transfer to the downstream action module without directly modifying ego-trajectory labels, inducing unsafe non-evasive waypoint predictions. Our experiments instantiate this attack on a representative open-source perception-to-action pipeline, revealing a representation-level safety risk in autonomous-driving video world models and highlighting the need for backdoor-aware validation beyond clean generation quality.


[96] SEED: Simple ViT and Evolving Harness for Explainable Text Forgery Detection cs.CVPDF

Kahim Wong, Kemou Li, Yiming Chen, Haiwei Wu, Jiantao Zhou

TL;DR: 本文提出了SEED系统,用于可解释的文本伪造检测,该系统包含三个模块:相似性引导的数据增强流程、基于DINOv3和LoRA的单ViT模型(联合执行检测和像素级定位)、以及通过MLLM生成完整取证报告的演化框架。该系统在ACM MM 2026的GenText-Forensics挑战赛中获得了第三名。

Details

Motivation: 针对AI辅助图像编辑对金融、法律和身份记录信任的威胁,以及GenText-Forensics挑战赛要求生成结构化取证报告(需整合检测、像素级定位和多语言自然语言解释)的需求。

Result: 在GenText-Forensics挑战赛中排名第三。

Insight: 创新点包括:1) 相似性引导的合成伪造数据增强流程;2) 基于DINOv3预训练先验、采用LoRA适配的单ViT模型,以极少可训练参数联合处理检测和定位任务;3) 通过提议者-评估者循环迭代优化报告质量的演化框架,利用MLLM生成完整取证报告。

Abstract: AI-assisted image editing threatens trust in financial, legal, and identity records. The GenText-Forensics Challenge at ACM MM 2026 addresses this by requiring structured forensic reports, in which integrating detection, pixel-level localization, and natural language explanation for multilingual text-centric forgery images. We present SEED, a modular system with three components. First, a similarity-guided pipeline augments training with diverse synthetic forgeries. Second, a single ViT, built on DINOv3 with LoRA adaptation, jointly performs detection and pixel-level localization while preserving pre-trained priors with minimal trainable parameters. Third, an evolving harness takes the detector’s predictions and generates a complete forensic report via an MLLM, iteratively improved through a proposer-evaluator loop optimizing report quality. SEED ranked 3rd in the GenText-Forensics Challenge. Code and data are available at https://github.com/KahimWong/GenText-Forensics-3rd-Place.


[97] CogniRoute: Learning to Route Social Evidence in Omni-Modal Models cs.CVPDF

Yifan Shen, Pei Tian, Xinzhuo Li, Bowen Fang, Shujun Xia

TL;DR: 本文提出了CogniRoute,一个用于社会性全模态推理的、由认知模式引导的混合专家框架。该框架通过训练专用的认知模式,将每个示例按跨模态关系、推理需求和时间范围进行分解,并在监督微调期间将全局路由签名与此结构对齐。此外,还引入了路由感知的强化学习,联合优化令牌生成和专家分配。为支持训练和评估,作者构建了OmniSocialBench诊断数据集。

Details

Motivation: 全模态模型虽然能处理视频、音频和文本,但统一访问多种模态并不能保证模型使用正确的证据。这一差距在社会性视频问答中尤为突出,因为答案可能取决于手势、语调、时间线索或视听内容之间的不匹配。

Result: CogniRoute在OmniSocialBench上取得了59.38%的平均准确率,比最强的专有基线提高了15.33个百分点,比最强的开源全模态基线提高了26.77个百分点,在需要视听协调、冲突解决和基于时间的社会推理问题上提升最大。

Insight: 创新点在于提出了一个由结构化认知模式引导的路由框架,以及结合了答案正确性、模态一致性推理和认知时间定位奖励的路由感知强化学习。这为模型如何有效选择和整合多模态证据,特别是处理复杂的社会性推理任务,提供了一种新的、可解释的机制。

Abstract: Omni-modal models can ingest video, audio, and text, but unified access to multiple modalities does not guarantee that a model uses the right evidence. This gap is especially pronounced in social video question answering, where the answer may hinge on a gesture, vocal tone, temporal cue, or mismatch between what is said and what is visually expressed. We introduce CogniRoute, a schema-guided Mixture-of-Experts framework for social omni reasoning. CogniRoute uses a training-only cognitive schema that factorizes each example by cross-modal relation, reasoning demand, and temporal scope, and aligns global routing signatures with this structure during supervised fine-tuning. We further introduce route-aware reinforcement learning, which jointly optimizes token generation and expert allocation using rewards for answer correctness, modality-consistent reasoning, and cognitive temporal grounding. To support training and evaluation, we construct OmniSocialBench, a diagnostic social video QA resource with 118K structured training examples, grounded reasoning traces, schema labels, temporal evidence spans, and a manually verified evaluation split. CogniRoute achieves 59.38% average accuracy on OmniSocialBench, improving over the strongest proprietary baseline by 15.33 percentage points and the strongest open-source omni baseline by 26.77 points, with the largest gains on questions requiring audio-visual coordination, conflict resolution, and temporally grounded social inference.


[98] MEDLAYXPLAIN: Benchmarking the Expert-Lay Gap in Medical Vision-Language Models cs.CV | cs.AI | cs.CLPDF

Han Jang, Junhyeok Lee, Songsoo Kim, Chae Young Lim, Hyeonjin Goh

TL;DR: 本文提出了MedLayXPlain,首个用于评估医学视觉语言模型生成患者可理解语言能力的大规模多模态基准和评估框架。该基准包含12.2万个样本,涵盖8种成像模态,并提供了专家级和通俗级描述。研究还发现现有模型在专家描述和通俗描述之间存在系统性差距。

Details

Motivation: 随着《21世纪治愈法案》要求患者能立即获取诊断影像结果,评估医学视觉语言模型能否弥合专家与患者之间的语言鸿沟,对于患者教育和共同决策至关重要。

Result: 在MedLayXPlain-122K基准上评估了33个视觉语言模型,结果显示医学VLMs在专家描述上表现良好,但在通俗描述上显著退化;而通用VLMs语言更易懂但缺乏临床精度,证实当前模型均无法充分满足面向患者的沟通需求。

Insight: 创新点在于构建了首个大规模、基于分层本体(UMLS)的医学通俗语言生成基准,并提出了HOVER流程来生成语义等效的通俗描述,以及一个轻量级的评估器MedLayEval来替代传统NLG指标,以更好地衡量临床相关性。

Abstract: Medical Vision-Language Models (Med-VLMs) achieve strong expert-level performance, yet their ability to generate patient-accessible descriptions remains underexplored. With the 21st Century Cures Act now mandating immediate patient access to diagnostic imaging results, evaluating whether Med-VLMs can bridge this Expert-Lay Gap is both urgent and clinically consequential for patient education and shared decision-making. To this end, we introduce MedLayXPlain, the first large-scale multimodal benchmark and evaluation framework for Medical Lay Language Generation (MLLG). MedLayXPlain-122K provides 122,789 region-grounded samples across 8 imaging modalities from 12 publicly available source datasets, each comprising a medical image with paired expert and lay captions anchored in a three-level Unified Medical Language System (UMLS) ontology hierarchy spanning 7 semantic groups, 43 semantic types, and 2,411 medical concepts. Lay captions are constructed via Hierarchical Ontology-Verified Refinement (HOVER), a three-step pipeline combining patient-centric vocabulary mapping, LLM-based constrained rewriting, and cross-model visual verification to enforce semantic equivalence while preventing hallucination. We further introduce MedLayEval, a lightweight 3B evaluator distilled from a 27B verifier that scores expert-lay alignment across five clinically grounded attributes, addressing the poor correlation between standard NLG metrics and clinical judgment. Benchmarking 33 VLMs on MedLayXPlain-122K reveals a systematic Expert-Lay Gap: medical VLMs achieve strong expert captioning but suffer significant lay-register degradation, while general-purpose VLMs produce more accessible language yet lack clinical precision, confirming that neither current paradigm adequately serves patient-facing communication.


[99] HERO: Hypothesis-Driven Evidence Retrieval from Omics for Multi-Task Breast Cancer Analysis cs.CV | q-bio.GNPDF

Xiangyu Li, Ran Su

TL;DR: HERO提出了一种基于假设驱动的证据检索框架,用于多任务乳腺癌分析。该方法将多组学数据(如DNA甲基化和miRNA)映射为稀疏的形态学假设向量,通过TF-IDF检索选择相关病理区域,并利用余弦门控机制触发确定性修复。该闭环设计减少了视觉语言模型的调用次数,提高了检索的可解释性。

Details

Motivation: 现有方法通常将多组学数据作为并行特征流或文本上下文使用,而非明确的检索约束。HERO旨在将观测到的多组学数据转化为可测试的形态学假设,以更有效地整合多模态信息进行乳腺癌分析。

Result: 在TCGA-BRCA数据集(930个全切片图像,患者级5折交叉验证)上,HERO在ER、PR、HER2、亚型和风险预测任务中均达到了新的最先进水平,超越了多模态融合和基于视觉语言模型的基线方法。

Insight: 创新点在于将多组学数据转化为稀疏的假设向量作为检索约束,结合TF-IDF检索和余弦门控的闭环设计,实现了可审计的检索过程,并减少了对嵌入语义匹配的依赖。

Abstract: Matched multi-omics can improve WSI-based biomarker and prognosis prediction, but most existing pipelines use omics as a paral lel feature stream or textual context rather than as an explicit retrieval constraint. HERO asks whether observed omics can be a testable mor phology hypothesis: a sparse pathway-to-morphology prior maps DNA methylation and miRNA into a K-dimensional intent vector m (K=16), TF-IDF retrieval over structured 10 captions selects endpoint-relevant regions, and a cosine gate c=cos(m,v) triggers deterministic deficit driven repair when c<τc. This closed-loop design bounds VLM calls, reduces reliance on embedding-based semantic matching, and makes every retrieval and verification step lexically auditable. On TCGA-BRCA (930WSIs, patient-level 5-fold CV), HERO sets new state-of-the-art across ER, PR, HER2, subtype, and risk prediction, outperforming both multimodal fusion and VLM-based baselines.


[100] Contrastive and Adaptive Multi-modal Masked Autoencoder for Spatial Transcriptomics cs.CV | cs.AIPDF

Joohyeok Kim, Taejin Jeong, Jinyeong Kim, Seong Jae Hwang

TL;DR: 本文提出了一种名为CAMMST的对比与自适应多模态掩码自编码器,用于从H&E组织学图像中预测空间转录组学(ST)基因表达。该方法将预测任务视为空间插补问题,利用一小部分基因表达作为‘遗传锚点’,通过掩码自编码器(MAE)框架来推断全玻片的基因表达谱。核心创新包括设计生物显著性评分和学习排序策略来自适应选择信息最丰富的组织点作为锚点,并构建跨模态联合编码器,通过对比学习对齐视觉与遗传模态特征,从而生成鲁棒的联合表征进行准确预测。

Details

Motivation: 空间转录组学(ST)成本高昂,现有研究试图仅从H&E组织学图像预测基因表达,但仅凭组织形态学信息不足以完全解析潜在的基因表达。近期研究开始利用部分基因表达数据来指导预测,本文在此基础上,将预测任务构建为空间插补问题,旨在更有效地利用有限的遗传信息锚点来提升全玻片基因表达的预测精度。

Result: 在仅使用组织学图像的预测任务和空间插补任务中,该方法均持续超越现有方法,实现了更高的准确性。即使在完全没有遗传锚点的情况下也表现出优越性能,并且在仅使用10%的转录组覆盖度时表现更佳,达到了当前最优水平(SOTA)。

Insight: 论文的主要创新点在于:1) 提出了生物显著性评分和学习排序策略,用于自适应地识别组织中最具信息量的点作为遗传锚点,并确保其连续区域适合实际ST分析硬件;2) 设计了跨模态联合编码器,通过对比学习对齐选定的遗传锚点与对应的视觉特征,生成鲁棒的跨模态联合表征。从客观角度看,该方法将MAE框架创造性地应用于ST空间插补问题,并通过多模态对比学习有效融合了异质数据源,为低成本、高精度的基因表达预测提供了新思路。

Abstract: The high cost of spatial transcriptomics (ST) has driven extensive studies into predicting gene expression directly from H&E histology images. However, this prediction task faces an inherent limitation, as tissue morphology alone provides insufficient information to fully resolve underlying gene expression. To address this limitation, a recent study leverages partial gene expression to guide the prediction process alongside histology images. Building on this paradigm, we approach the prediction task as a spatial imputation problem, employing a Masked Autoencoder (MAE) to utilize a small fraction of gene expression as genetic anchors for inferring whole-slide gene expression profiles. Specifically, we propose a bio-saliency score and a learning-to-rank strategy to adaptively identify the most informative spots within the tissue. Based on these identified spots, our framework selects contiguous regions as genetic anchors to ensure suitability for real-world ST profiling hardware. To effectively leverage these anchors, we design a cross-modal joint encoder that integrates visual and genetic modalities. By aligning the selected anchors with their corresponding visual features via contrastive learning, the encoder generates robust joint representations to accurately predict gene expression across the whole slide. Notably, our framework consistently surpasses existing methods in both histology-only prediction and spatial imputation, achieving superior accuracy even without genetic anchors and further excelling with as little as 10% transcriptomic coverage. Our code is available at https://github.com/Kyyle2114/CAMMST.


[101] Extraction and Analysis of Multimodal Concepts in Vision Language Models through Sparse Autoencoders cs.CV | cs.AI | cs.LGPDF

Sergio Lanza, Jae Hee Lee, Stefan Wermter

TL;DR: 本文提出了一种基于稀疏自编码器(SAE)的框架,用于从视觉语言模型(VLM)中提取和分析视觉、文本及多模态概念。该方法通过计算候选概念与数据样本的余弦相似度对齐,在VQA数据集(LLaVA-NeXT)上实验表明,其视觉概念质量比现有SAE方法提升高达45%,同时保持了高文本概念质量,并能系统性地识别多模态概念。

Details

Motivation: 现有基于SAE的VLM可解释性方法通常只关注单一模态(文本或视觉)的概念,忽略了多模态概念,且视觉概念描述往往质量低下、模糊或不完整,这阻碍了对VLM内部过程的全面理解。

Result: 在LLaVA-NeXT VQA数据集上的实验表明,所提框架将视觉概念质量提升了高达45%,同时保持了高文本概念质量,并实现了对多模态概念的系统性识别。

Insight: 创新点在于提出了一个统一的SAE框架来同时提取和分析视觉、文本及多模态概念,并通过余弦相似度对齐来生成更高质量、更易解释的概念描述,为理解VLM的概念空间提供了结构化方法。

Abstract: Vision Language Models (VLMs) have demonstrated impressive performance in tasks requiring joint understanding of images and text, such as image captioning and Visual Question Answering (VQA), but our understanding of their internal processes remains limited. Recently, Sparse Autoencoders (SAEs) have emerged as a promising tool to support the interpretation of concepts encoded in VLMs. However, most SAE-based approaches focus only on textual or visual concepts separately, ignoring multimodal concepts. This limitation hinders a comprehensive understanding of VLMs, since concepts that integrate both modalities can be misclassified. Moreover, previous visual approaches often produce low-quality visual concept descriptions that are vague or incomplete, limiting their usefulness for understanding model reasoning. We propose a framework based on SAEs to extract and analyze visual, textual, and multimodal concepts from VLMs. For each neuron, we propose a candidate human-interpretable concept and compute the alignment between the concept and the dataset samples using cosine similarity scores. Experiments on a VQA dataset (LLaVA-NeXT) demonstrate that our framework improves visual concept quality by up to 45% compared to existing SAE-based methods, while maintaining high textual concept quality and enabling systematic identification of multimodal concepts. This work contributes new insights into the conceptual space of VLMs, providing a structured approach to distinguish between visual, textual, and multimodal concepts. The code is available at https://github.com/PHDLanza/Multidata_SAE


[102] A Neurosymbolic Framework for Interpretable Skeleton-Based Seizure Detection via Concept-Driven Logical Reasoning cs.CVPDF

Talha Ilyas, Deval Mehta, Zongyuan Ge

TL;DR: 本文提出了一种用于视频癫痫检测的神经符号框架,通过概念驱动的逻辑推理实现可解释的骨架序列分析。该方法从癫痫监测单元视频中提取患者骨架序列,预测基于临床运动符号学的时空概念激活,并通过可微分逻辑将其组合成可审计的布尔规则。

Details

Motivation: 现有基于深度学习的视频癫痫检测方法缺乏内在可解释性,限制了其临床转化应用,本文旨在填补这一空白。

Result: 在两个公开癫痫视频基准(SAHZU和IEEE)上,该方法分别达到89.78%敏感度(每小时0.06次误报)和85.27%敏感度(每小时0.09次误报),同时提供三层可解释性。

Insight: 创新点在于将神经符号方法引入视频癫痫检测,通过可微分逻辑规则实现临床可审计的决策过程,并通过细粒度非癫痫活动分类减少误报。

Abstract: Video-based seizure detection is essential for the management of epilepsy patients, offering a non-invasive complement to electroencephalography. While several deep learning approaches have been developed for video-based seizure detection, none are inherently interpretable, limiting their adoption and translation into clinical practice. We present, to our knowledge, the first exploration of a neurosymbolic framework for video-based seizure detection that directly addresses this gap. Our approach (1) extracts patient-centric skeleton sequences from epilepsy monitoring units via a prompt-guided foundation model, (2) predicts binary spatio-temporal concept activations grounded in clinical motor semiology guidelines, and (3) composes them via differentiable logic into interpretable Boolean rules with auditable contributions. Furthermore, to mitigate false positives arising from the traditional binary formulation (seizure vs.\ non-seizure), we sub-classify non-seizure segments into clinically relevant normal activities, providing the model with fine-grained discriminative supervision. Evaluated on two public seizure video benchmarks, our framework achieves 89.78% sensitivity with 0.06 false detections per hour on SAHZU and 85.27%,0.09 on IEEE, while producing complete three-level interpretability: every prediction decomposes into which motor primitives were detected, how they were logically composed, and how much each rule contributed to the clinical decision. We publicly release all annotations, extracted pose sequences, our data pipeline and code, https://github.com/Mr-TalhaIlyas/CDSD/.


[103] Few-Shot Hyperspectral Aphid Detection via FastGAN Synthetic Data Generation, Transformer-Based Classification and Explainable AI cs.CV | cs.AIPDF

Ali Saeidan

TL;DR: 本研究提出了一种基于FastGAN生成对抗网络的数据增强方法,用于解决高光谱蚜虫检测中数据集规模小的问题。通过生成合成的高光谱图像来扩充数据集,并比较了VGG16、ResNet-50、EfficientNet和Vision Transformer(ViT)四种分类架构的性能。结果表明,数据增强显著提升了分类鲁棒性,其中ViT模型在准确率和F1分数上表现最佳。

Details

Motivation: 早期检测作物中的蚜虫侵染对于防止产量损失和减少不必要的农药使用至关重要。然而,应用于高光谱数据的深度学习方法通常受限于小数据集规模,因此需要有效的数据增强策略来提升模型性能。

Result: 在增强后的数据集上,ViT模型取得了最高的准确率和F1分数,EfficientNet表现出色且平衡,ResNet-50相比VGG16有适度提升。使用FID评估生成图像质量,显示稳定收敛并真实重建了叶片形态和侵染模式。

Insight: 创新点在于将FastGAN用于高光谱数据增强,并结合Transformer-based模型(ViT)进行分类,实现了在小样本情况下的可靠检测。可借鉴之处包括利用生成对抗网络解决高光谱图像数据稀缺问题,以及验证Transformer架构在植物病害分类中的优越性。

Abstract: Early detection of aphid infestation in crops is essential for preventing yield loss and reducing unnecessary pesticide use. Hyperspectral imaging combined with Spectral Information Divergence (SID) analysis offers a non-destructive approach for monitoring plant health; however, deep learning methods applied to hyperspectral data are often limited by small dataset sizes. In this study, a data-efficient generative adversarial network (FastGAN) was employed to augment a hyperspectral SID dataset of faba bean leaves containing healthy and aphid-infested samples. The trained generator produced 10,000 synthetic images preserving structural and spectral characteristics of real samples. Image quality was evaluated using Frechet Inception Distance (FID), demonstrating stable convergence and realistic reconstruction of leaf morphology and infestation patterns. The augmented dataset was used to train four classification architectures: VGG16, ResNet-50, EfficientNet, and Vision Transformer (ViT). Results showed that dataset augmentation significantly improved classification robustness, with performance progressively increasing from classical convolutional networks to transformer-based models. The ViT model achieved the highest accuracy and F1-scores, while EfficientNet provided strong balanced performance and ResNet-50 showed moderate improvements over VGG16. Confusion matrix analysis confirmed reduced false negatives and improved disease detection when using advanced architectures. The findings demonstrate that FastGAN-based augmentation effectively enhances hyperspectral plant disease classification and that transformer-based models provide the most reliable discrimination between healthy and infested leaves.


[104] Real-time pedestrian attribute recognition with YOLOv8 and ResNet18 cs.CV | cs.LGPDF

Houssam El Mir

TL;DR: 本文提出了一种基于YOLOv8和ResNet18的两阶段实时行人属性识别框架。该框架首先使用YOLOv8n检测行人,然后使用基于ResNet18的分类器对裁剪出的行人区域进行性别分类、表观年龄估计以及61个二元属性预测。通过融合PETA和PA-100K数据集构建了统一的训练语料库,并在NVIDIA RTX 5060 GPU上实现了25-30 FPS的实时性能。

Details

Motivation: 行人属性识别在监控、视频检索和人本图形应用中具有重要价值。本文旨在设计一个轻量级的两阶段框架,以支持实时的、多属性的行人识别任务。

Result: 在报告的测试集上,系统取得了99.89%的性别分类准确率、4.23岁的表观年龄平均绝对误差以及89.96%的多属性准确率(宏F1分数36.32%,微F1分数58.80%)。在NVIDIA RTX 5060 GPU上实现了25-30 FPS的实时处理速度。

Insight: 摘要宣称的创新点在于提出了一个结合YOLOv8检测器和ResNet18分类器的轻量级实时两阶段框架,并通过语义属性映射融合了两个主流数据集(PETA和PA-100K)以扩充训练数据。从客观角度看,其工程价值在于验证了轻量级模型组合在实时PAR任务上的可行性,但结果也揭示了稀有属性识别(低宏F1分数)仍是该框架面临的挑战。

Abstract: Pedestrian attribute recognition (PAR) assigns semantic labels to detected pedestrians and is useful in surveillance, video retrieval, and human-centered graphics applications. This paper presents a two-stage framework in which YOLOv8n detects pedestrians and ResNet18-based models classify gender, estimate apparent age, and predict 61 binary attributes from each pedestrian crop. PETA and PA-100K are combined through semantic attribute mapping, producing a unified training corpus of more than 100,000 pedestrian images while retaining the PETA attribute space. On the reported test splits, the system obtains 99.89% gender classification accuracy, a 4.23-year apparent-age mean absolute error, and 89.96% multi-attribute accuracy with a 36.32% macro F1-score and 58.80% micro F1-score. Runtime measurements indicate 25-30 FPS on an NVIDIA RTX 5060 GPU. The results show that a lightweight detector-classifier pipeline can support real-time PAR, while low macro F1 indicates that rare attributes remain challenging.


[105] Unsupervised Domain Adaptation for Sim-to-Real Object Pose Estimation with Contrastive Alignment and Pseudo-Label Refinement cs.CVPDF

Nidhal Eddine Chenni, Arunkumar Rathinam, Djamila Aouada

TL;DR: 本文提出了一种名为CAPLR的无监督域适应方法,用于从仿真到真实环境的物体姿态估计。该方法通过高效的跨域配对策略、局部区域的对比对齐以及基于一致性的伪标签细化,专注于姿态敏感特征的适应,以保留几何线索。实验表明,CAPLR在多个知名基准测试中达到了最先进的性能。

Details

Motivation: 现有无监督域适应方法通常依赖全局特征匹配、多阶段框架或图像转换流程,往往忽略了特征表示中嵌入的姿态特定信息。本文旨在解决这一局限,专注于局部区域中姿态敏感特征的适应,确保域对齐保留对准确姿态估计至关重要的几何线索。

Result: 在多个具有多样性和挑战性场景的知名物体姿态估计基准测试上进行的广泛实验表明,CAPLR实现了最先进的性能。

Insight: 创新点在于提出了针对局部姿态敏感特征进行域适应的框架,包括无监督的跨域配对策略、局部区域的对比对齐以及基于一致性的伪标签细化机制,这些设计旨在更有效地保留几何信息,提升从仿真到真实环境的适应效果。

Abstract: Unsupervised domain adaptation (UDA) enables robust transfer of knowledge from simulated to real environments while exploiting a subset of unlabeled target data to improve real-world performance. Existing UDA methods for Object pose estimation often rely on global feature matching, multi-stage larger frameworks, or image translation pipelines, which tend to overlook the pose-specific information embedded in feature representations. To bridge this limitation, we introduce CAPLR that targets the adaptation of pose-sensitive features in localized regions, ensuring that domain alignment preserves the geometric cues essential for accurate pose estimation. CAPLR achieves UDA with three key components: (1) Efficient Cross-Domain Pairing strategy leveraging intermediate features to identify pose similar image pairs across domains without supervision; (2) Contrastive Alignment to perform feature alignment at localised regions in both intermediate and task-specific representations; and (3) Consistency-Based Pseudo-Label Refinement to improve reliability by encouraging stable target predictions. Extensive experiments demonstrate that CAPLR achieves state-of-the-art performance across multiple well-known object pose estimation benchmarks featuring diverse and challenging scenarios.


[106] A Test-time Actor-Critic Approach to News Images Generation cs.CVPDF

Damianos Galanopoulos, Vasileios Mezaris

TL;DR: 本文介绍了CERTH-ITI团队为MediaEval NewsImages 2026挑战赛提出的解决方案,该挑战赛专注于根据新闻标题生成相关图像。受强化学习中Actor-Critic范式的启发,作者提出了一种测试时、模型无关的Actor-Critic图像生成方法(ACIG)。该方法通过一个反馈循环来生成图像提示、创建图像、评估生成结果,并在需要时相应优化提示。

Details

Motivation: 解决根据新闻标题生成高质量、相关图像的任务,这是MediaEval NewsImages 2026挑战赛的核心问题。

Result: ACIG方法在NewsImages 2026挑战赛的排行榜上取得了最佳结果。

Insight: 将强化学习中的Actor-Critic范式创新性地应用于图像生成的提示工程和优化过程,提出了一种测试时的、模型无关的反馈循环框架,能够动态评估和迭代改进生成结果。

Abstract: This paper introduces the CERTH-ITI solution for the MediaEval NewsImages 2026 challenge, which focuses on generating images related to news headlines. Inspired by the Actor-Critic paradigm in reinforcement learning, we present a test-time, model-agnostic Actor-Critic Image Generation approach (ACIG). ACIG generates prompts for image creation, produces the images, evaluates the generated results, and if needed refines the image generation prompts accordingly in a feedback loop. ACIG achieved the best results in the NewsImages 2026 challenge, according to the challenge’s leaderboard.


[107] LEViL: Label-Efficient Video Learning via Zero-Shot Distillation over VLM-Generated Pseudo-Label Spaces cs.CVPDF

Aslı Çelik

TL;DR: 本文提出了一种名为LEViL的标签高效视频学习框架,旨在减少对大规模标注源数据集的依赖。该框架结合了无标注视频预训练和目标标签集感知的微调。在预训练阶段,利用视觉语言模型为未标注视频生成文本描述,构建可解释的语义伪标签空间,并通过冻结的视频-语言模型生成零样本软目标分布,使视频编码器能够学习语义丰富的表示。在下游适应阶段,结合目标标注视频的监督学习和在真实目标标签集上的零样本蒸馏,以保留语义指导并适应目标任务。

Details

Motivation: 为了解决监督视频预训练需要大规模标注源数据集、且其效果受源域与目标域相似性影响的问题,同时避免为不同目标域构建标注预训练数据集的高成本和可扩展性难题。

Result: 在UCF101和HMDB51数据集上的实验表明,该框架在所有评估的有限标签机制下,均优于所比较的半监督视频动作识别方法。此外,无标注预训练阶段学习到的可迁移表征为全数据微调提供了有效的初始化,尽管其依赖的未标注预训练池规模相对较小。

Insight: 主要创新点在于构建了一个由VLM生成的、可解释的语义伪标签空间,并利用冻结的视频-语言模型进行零样本蒸馏,从而在无需人工源标注的情况下实现语义丰富的视频表示学习。从客观角度看,该方法巧妙地结合了零样本学习和伪标签技术,为标签稀缺场景下的视频学习提供了一种高效且可扩展的解决方案。

Abstract: Supervised video pretraining is a common transfer learning practice for improving downstream action recognition performance. However, it requires large-scale labeled source datasets, and the effectiveness of the learned initialization is influenced by the similarity between the source and target domains. Constructing such labeled pretraining datasets for different target domains is costly and difficult to scale. To address these limitations, this study proposes a label-efficient video learning framework that combines annotation-free video pretraining with target-label-set-aware fine-tuning. During pretraining, a vision-language model (VLM) generates textual descriptions of unlabeled videos, which are processed to construct an interpretable semantic pseudo-label space. A frozen video-language model then produces zero-shot soft target distributions over this space, allowing a student video encoder to learn semantically rich representations without manual source annotations. During downstream adaptation, target-label-set-aware fine-tuning combines supervised learning from labeled target videos with zero-shot distillation over the actual target label set, helping preserve VLM-derived semantic guidance while adapting the pretrained encoder to the target task. Experiments on UCF101 and HMDB51 show that the proposed framework outperforms the compared semi-supervised video action recognition methods across all evaluated limited-label regimes. Moreover, the annotation-free pretraining stage learns transferable representations that provide an effective initialization for full-data fine-tuning, despite relying on a comparatively modest unlabeled pretraining pool.


[108] SCOPE: Scale-Consistent One-Pass Estimation of 3D Geometry cs.CVPDF

Zheng Zhang, Lihe Yang, Tianyu Yang, Chaohui Yu, Yixing Lao

TL;DR: SCOPE是一种从单目视频序列中估计3D几何的新方法,旨在解决现有方法在长序列中难以同时保持几何精度和时间一致性的问题。它通过生成具有序列共享参数的仿射不变3D点图,实现了尺度一致的表征,并引入了视角不变几何对齐、外观不变学习和频率调制定位三个关键创新。

Details

Motivation: 现有方法在处理数百帧的扩展单目视频序列时,难以在几何精度和时间一致性之间取得平衡,SCOPE旨在解决这一挑战。

Result: 在ScanNet等多样化数据集上的实验表明,与最先进方法相比,SCOPE将相对点图误差降低了24.2%,时间对齐误差降低了34.9%。

Insight: 创新点在于提出了视角不变几何对齐、外观不变学习和频率调制定位,这些技术使得模型能够处理远超训练长度的序列,并在复杂相机轨迹和光照变化下保持鲁棒性。

Abstract: We present SCOPE (Scale-Consistent One-Pass Estimation of 3D Geometry), a novel approach for estimating 3D geometry from extended monocular video sequences, where existing methods struggle to maintain both geometric accuracy and temporal consistency across hundreds of frames. Our approach generates affine-invariant 3D point maps with shared parameters across entire sequences, enabling consistent scale-invariant representations. We introduce three key innovations: viewpoint-invariant geometry aligning multi-perspective points in a unified reference frame; appearance-invariant learning enforcing consistency across exponential timescales; and frequency-modulated positioning enabling extrapolation to sequences vastly exceeding training length. Experiments across diverse datasets demonstrate significant improvements, reducing relative point map error by 24.2% and temporal alignment error by 34.9% on ScanNet compared to state-of-the-art methods. Our approach handles challenging scenarios with complex camera trajectories and lighting variations while efficiently processing extended sequences in a single pass. Project page: https://scope3d.github.io/.


[109] FLM-Occ: Feed-forward Likelihood Maximization for Efficient Indoor Occupancy Prediction cs.CV | cs.ROPDF

Guangcheng Chen, Lihuang Fang, Huaqi Tao, Yicheng He, Li He

TL;DR: 本文提出了一种名为FLM-Occ的高效室内占用预测方法,其核心是前馈似然最大化(FLM)框架。该框架将占用预测重新定义为体素分布估计问题,通过最大化真实占用体素的似然来训练网络预测混合模型,从而避免了传统方法在空区域产生虚假基元的问题。

Details

Motivation: 现有基于高斯基元的室内占用预测方法依赖体素分类进行训练,仅施加局部约束且缺乏对基元分布的全局监督,导致在空区域预测出虚假基元,损害了表示和计算效率。

Result: 在Occ-ScanNet基准测试上,FLM-Occ仅使用32个超二次曲面(仅为先前SOTA方法所需数量的2.7%)就实现了更高的精度,同时运行速度快了3.7倍。

Insight: 主要创新点在于将占用预测重构为体素分布估计问题,并提出了FLM框架,通过定义混合权重为归一化的基元体积来隐式强制执行单纯形约束,并推导了新的体素化公式,实现了网络端到端训练和标准混合模型的体素化,使随机初始化的基元能够长距离重定位以建模场景。

Abstract: Recent indoor occupancy prediction methods adopt Gaussian primitives as a sparse 3D representation for computational efficiency. However, their training relies on voxel classification, which imposes only local constraints and lacks global supervision on the distribution of the primitives. Therefore, they inevitably predict spurious primitives in empty regions, undermining both representational and computational efficiency. To address this, we propose Feed-forward Likelihood Maximization (FLM), a novel framework that reformulates occupancy prediction as voxel distribution estimation. In FLM, a network is trained to predict a mixture model that maximizes the likelihood over ground-truth occupied voxels in a feed-forward manner. To enable end-to-end training of networks and voxelization of a standard mixture model, we define mixture weights as normalized primitive volumes to implicitly enforce simplex constraints and derive novel voxelization formulas. Based on FLM, our FLM-Occ, a novel method that is capable of relocating randomly initialized primitives over long distances to model a scene. On Occ-ScanNet, FLM-Occ achieves superior accuracy using only 32 superquadrics, 2.7% of the prior SoTA, while running 3.7 times faster.


[110] OSOG: A Differentiable, Physics-Informed Synthetic Data Engine for Micro-Optical Environments cs.CV | cs.GR | physics.opticsPDF

Caio Silva

TL;DR: 本文提出了OSOG(光学合成物体生成器),这是一个高性能、完全可微分的前向建模引擎,用于解决计算显微镜中深度学习训练数据稀缺的问题。它基于衍射和相位延迟的物理模型,通过优化的PyTorch原生数组结构架构,实现了快速、可扩展的合成数据生成。

Details

Motivation: 计算显微镜领域的深度学习受限于密集标注数据集的稀缺,而传统的图形引擎基于几何光线追踪,无法捕捉显微镜所需的微光学现象;现有的波动光学公式在深度学习所需规模下计算成本过高。

Result: 在三个维度验证了框架:1)仅用OSOG生成数据训练的YOLOv11-OBB模型在真实世界高度遮挡的溶菌酶显微图像上实现了鲁棒的零样本迁移;2)DiffOSOG展示了通过课程引导的逆向渲染精确恢复连续光学参数;3)OSOG避开了顺序光线追踪的O(N)瓶颈,在50毫秒内合成4万个复杂波动光学粒子(>20 FPS),展示了亚线性缩放。

Insight: 创新点在于将连续光程差计算映射到高度优化的PyTorch原生数组结构架构,实现了物理信息、完全可微分且高性能的合成数据生成引擎,为实时、按需数据集生成提供了可扩展的张量流水线。

Abstract: Deep learning in computational microscopy is severely constrained by the scarcity of densely annotated datasets. While synthetic data generation has bridged this gap in macroscopic computer vision, traditional graphics engines rely on geometric ray-tracing, failing to capture the micro-optical phenomena required for microscopy. Conversely, while wave-optics formulations exist, rendering them computationally tractable at the scale required for deep learning remains a massive systems challenge. To address this, we introduce the Optical Synthetic Object Generator (OSOG), a high-performance, fully differentiable forward-modeling engine. Drawing on established physical models of diffraction and phase retardation, OSOG maps continuous Optical Path Difference (OPD) calculations into a highly optimized, PyTorch-native Structure-of-Arrays (SoA) architecture. We validate this computational framework across three axes: First, object detection models (YOLOv11-OBB) trained purely on OSOG-generated data achieve robust zero-shot transfer to real-world highly occluded Lysozyme micrographs. Second, we introduce DiffOSOG, demonstrating that the engine’s end-to-end differentiability allows for the exact recovery of continuous optical parameters via curriculum-guided inverse rendering. Finally, OSOG bypasses the $\mathcal{O}(N)$ bottlenecks of sequential ray-tracing, demonstrating sub-linear scaling by synthesizing 40,000 complex wave-optic particles in under 50 milliseconds (>20 FPS). By providing a fast, scalable, and physically grounded tensor pipeline, OSOG enables true real-time, on-the-fly dataset generation.


[111] MIRCaps: A Large-Scale Mixed-Domain Dataset with Image-Level and Region-Level Captions for Fine-Grained Vision-Language Learning cs.CV | cs.AIPDF

Arlindo Luciano Tulumba Roberto, Hyungjoon Kim

TL;DR: 本文提出了一个名为MIRCaps的大规模多模态数据集,包含14.1万张图像、98.1万条图像级描述、174.2万条区域级描述和139.1万个边界框标注。该数据集旨在通过提供互补的图像级和区域级描述,帮助视觉语言模型学习细粒度的视觉属性。作者在图像描述和物体检测两个下游任务上验证了数据集的有效性,并公开了数据集和代码。

Details

Motivation: 当前视觉语言模型(VLMs)缺乏同时适用于通用目的和基于CCTV视频监控系统的混合领域图像-描述数据集,限制了模型对细粒度视觉属性的学习能力。

Result: 实验表明,包括SmolVLM-256M-Instruct、BLIP、BLIP2和Qwen2.5-VL 3B-Instruct在内的轻量级VLMs,使用该数据集进行微调后,在图像描述和物体检测任务上取得了有效提升。

Insight: 创新点在于构建了一个大规模、混合领域、同时包含图像级和区域级描述的细粒度数据集,其互补的标注结构(每个图像平均有7条全局描述和每个边界框7条区域描述)能更有效地引导模型学习对象类别、尺寸、颜色、动作、状态和环境上下文等多维度属性。

Abstract: Despite recent progress in Vision-Language Models (VLMs), mixed-domain image-caption datasets for both general-purpose and CCTV-based video surveillance systems remain limited. To address this gap, we introduce a large-scale multimodal dataset comprising 141,364 images, 981,947 image-level captions, 1,742,264 region-level captions, and 1,391,779 bounding box annotations. Each image is associated with an average of seven image-level captions describing different aspects of the overall scene, as well as seven region-level captions for each annotated bounding box. These complementary caption types are designed to help VLMs learn fine-grained visual attributes, including object categories, estimated sizes, colors, actions, states, and surrounding environmental context. We demonstrate the effectiveness of the dataset on two important downstream tasks: image captioning and object detection. Experimental results show that lightweight VLMs, including SmolVLM-256M-Instruct, BLIP, BLIP2, and Qwen2.5-VL 3B-Instruct, can be effectively fine-tuned using our dataset. Our dataset and code are publicly available at https://zenodo.org/records/20418601.


[112] Lightweight 3D Feature Pretraining by Bayesian Inversion of 2D Foundation Models cs.CVPDF

Marwane Hariat, Gianni Franchi, David Filliat, Antoine Manzanera

TL;DR: 本文提出了Casper3D,一个轻量级的概率框架,用于将嘈杂的多视角2D基础模型嵌入转换为潜在的3D语义表示。该方法将视角级语义特征建模为底层3D语义状态的噪声观测,并通过一个结合了多视角推理中相对位姿的基于集合的变分模型来推断该状态。

Details

Motivation: 动机在于解决从嘈杂的多视角2D基础模型特征中,高效、鲁棒地提取出统一的3D语义表示的问题,以实现开放词汇的3D理解。

Result: 实验表明,与简单的多视角池化方法相比,Casper3D在模糊和嘈杂的场景下能产生更稳定的3D语义,尤其是在开放词汇3D理解任务中。

Insight: 创新点在于提出了一个轻量级、主干网络无关的概率框架,通过贝叶斯反演和变分推理,将2D基础模型的语义嵌入与3D几何(相对位姿)相结合,从而鲁棒地推理出3D语义状态。

Abstract: We present Casper3D, a lightweight probabilistic framework for converting noisy multi-view 2D foundation-model embeddings into a latent 3D semantic representation. We model view-level semantic features as noisy observations of an underlying 3D semantic state and infer this state with a set-based variational model that incorporates relative pose during multi-view reasoning. Casper3D is trained by predicting held-out semantic observations from novel viewpoints, while remaining aligned with visual and text semantic spaces for open-vocabulary 3D understanding. The framework is backbone-agnostic and applies to both language-aligned and self-supervised embeddings. Experiments show that Casper3D produces more stable 3D semantics than simple multi-view pooling, especially in ambiguous and noisy settings.


[113] EnTrust: Modeling Inter-Modal Conflict for Trustworthy Multimodal Medical Image Analysis cs.CV | cs.AIPDF

Dwarikanath Mahapatra, Abhijit Das, Behzad Bozorgtabar, Zongyuan Ge, Sudipta Roy

TL;DR: EnTrust是一个用于可信多模态医学图像分析的框架,它通过建模模态间冲突作为预测不确定性的主要来源,将多模态特征解耦为共享解剖共识、模态特定线索和空间局部化冲突信号,并利用扩散模型生成分割假设,最终通过TrustMap将假设分歧转化为校准的像素级不确定性,帮助临床医生理解预测不可靠的原因。

Details

Motivation: 当前多模态医学图像分割模型在处理模态间不一致时存在不足,要么采用确定性融合平均掉分歧,要么采用与融合过程解耦的事后不确定性估计,这两种方法都掩盖了预测不可靠的根本原因,无法回答临床关键问题:为什么这个预测不可靠?

Result: 在涵盖脑部、心脏、病变和肿瘤领域的四个基准测试中,EnTrust实现了最先进的分割精度,同时与最强基线相比,校准误差降低了40%。值得注意的是,它仅使用单个模型,内存占用约为一半,就超越了5倍深度集成方法的性能。

Insight: 创新点在于将模态间冲突明确建模为不确定性的主要来源,并通过特征解耦(EnFuse模块)、基于扩散的生成分割模型(SegDiff)以及结合集成熵、冲突引导扰动探测和学习校准头的TrustMap,实现了对不确定性的可解释和校准的估计,不仅指出不确定性位置,还解释其原因。

Abstract: Multimodal medical imaging fuses complementary anatomical and functional information, yet modalities frequently disagree in pathologically heterogeneous regions. Current segmentation models handle this in one of two inadequate ways: deterministic fusion that averages away disagreement, or post-hoc uncertainty estimation decoupled from the fusion process that produces it. Both obscure the clinically critical question: why is this prediction unreliable? We present EnTrust, a framework that treats inter-modal conflict as the primary source of predictive uncertainty. Our EnFuse module decomposes multimodal features into three disentangled components: shared anatomical consensus (F_c), modality-specific cues (F_{u,m}), and spatially localized conflict signals (F_{cf}), with independence enforced via a cross-covariance objective. This structured decomposition conditions SegDiff, a diffusion-based generative segmentation model whose sampled hypotheses diverge specifically in regions of modal disagreement. TrustMap then translates this hypothesis divergence into calibrated, pixel-wise uncertainty using ensemble entropy, conflict-guided perturbation probing, and a learned calibration head, enabling clinicians to understand not only where predictions are uncertain, but why. Across four benchmarks spanning brain, cardiac, lesion, and oncology domains, EnTrust achieves state-of-the-art segmentation accuracy while reducing calibration error by 40% compared to the strongest baseline. Notably, it outperforms 5x deep ensembles using a single model at roughly half the memory footprint. Code and checkpoints are available at https://github.com/GenMI-Lab/EnTrust.git.


[114] Semi-Supervised Vision-Language-Action Model cs.CV | cs.ETPDF

Hongyang He, Jiuming Liu, Victor Sanchez

TL;DR: 本文提出了一种半监督视觉-语言-动作模型SemiVLA,用于在机器人任务中减少对昂贵动作标注数据的依赖。该方法通过一个自蒸馏的师生框架,从未标注的轨迹中学习可靠的伪动作,并引入一个可靠性控制器来评估视觉-语言对齐、动作可行性和时间一致性。

Details

Motivation: 现有的视觉-语言-动作模型在适应新环境时严重依赖成本高昂的动作标注演示数据,这限制了其可扩展性。本文旨在研究在只有少量轨迹包含机器人动作标注、大部分轨迹仅提供未标注的视觉-语言观测的半监督场景下,如何有效适应VLA模型。

Result: 在LIBERO和CALVIN基准测试上,SemiVLA持续改进了多种参数高效微调策略。在仅使用10%标注轨迹的情况下,结合Selective LoRA的SemiVLA在LIBERO上达到了89.0%的平均成功率,比有监督的LoRA方法高出8.0个百分点,且没有增加推理成本。

Insight: 核心创新点在于提出了一个针对VLA任务的半监督学习框架,其关键在于设计了一个VLA专用的可靠性控制器来生成高质量的伪动作标签,并采用了瓶颈投影对齐更新来防止噪声反馈污染教师模型。这为在标注数据稀缺的具身智能场景中高效微调大模型提供了新思路。

Abstract: Vision-Language-Action (VLA) models enable robots to predict actions directly from visual observations and language instructions, but adapting them to new environments still depends on costly action-labeled demonstrations. To reduce this dependence, we study semi-supervised VLA adaptation under limited supervision signals, where only a small portion of trajectories contain robot actions and the remaining trajectories provide action-unlabeled vision-language observations. Unlike standard semi-supervised learning, the missing supervision is an embodied action signal that must be visually grounded, language-consistent, physically feasible, and temporally stable. To address this problem, we propose SemiVLA, a self-distilled teacher-student framework that learns from reliable pseudo-actions on unlabeled trajectories. SemiVLA introduces a VLA-specific reliability controller to assess vision-language alignment, action feasibility, and temporal transition consistency, and further updates the teacher through a Bottleneck-Projected Alignment Update to avoid noisy feedback contamination. With OpenVLA as the backbone, SemiVLA consistently improves multiple PEFT strategies across LIBERO and CALVIN. Under 10% labeled trajectories, SemiVLA with Selective LoRA achieves 89.0% average success on LIBERO, outperforming supervised LoRA by 8.0 points without extra inference cost.


[115] Compressing Observation History into Agent Memory: Distilling Transformers into Recurrent Transformers cs.CV | cs.LGPDF

Philippe Weinzaepfel, Christian Wolf, Bülent Mert Sariyildiz, Guillaume Bono, Gianluca Monaci

TL;DR: 本文提出一种蒸馏方法,将完整历史Transformer的压缩策略迁移到循环Transformer中,以解决循环模型在长序列处理中性能不足的问题。通过设计一个教师模型,将观测历史显式压缩为固定大小的瓶颈表示,并以此监督学生模型的记忆,从而显著缩小循环模型与完整历史Transformer之间的性能差距。

Details

Motivation: 针对长序列流式视觉和机器人应用(如无地图姿态估计)中,存储和维护观测历史不切实际的问题,循环Transformer虽能通过固定大小内存处理,但其性能落后于完整历史Transformer。作者认为这种差距源于学习压缩过去信息的方式差异,而非架构限制。

Result: 该方法在机器人记忆训练中实现了线性时间复杂度,并大幅缩小了与完整历史Transformer的性能差距,但未提及具体基准测试或SOTA比较。

Insight: 创新点在于通过蒸馏对齐教师和学生模型的压缩机制,使循环模型能学习更有效的记忆策略;从客观角度看,这提供了一种将Transformer的序列建模能力高效集成到循环架构中的新途径,适用于资源受限的实时应用。

Abstract: Transformers are AI’s workhorse with strong performance in modeling sequential data, but their computational cost becomes prohibitive when processing long sequences. We target long-horizon streaming vision and robotics applications like map-free pose estimation, where it is particularly impractical to store and maintain a history of observations. Recurrent Transformers address this limitation by maintaining fixed-size memory but their performance lags behind that of transformers operating over the full observation history. We argue that this gap does not stem from architectural limitations, but from differences in how these models learn to compress past information. Without access to an observation history, recurrent models must explicitly decide what to retain in memory at each step, a significantly harder learning problem. In this work, we propose a distillation approach that transfers the compression strategy of a classical full-history transformer to a recurrent variant. We enable this by designing a teacher model that explicitly compresses its observation history into a fixed-size bottleneck representation. By directly supervising the student’s memory with this bottleneck representation, we align the two compression mechanisms. We show that this approach allows to train a recurrent latent robotic memory with linear-time complexity while substantially narrowing the performance gap to full-history transformers.


[116] WildBox: A Dataset and Benchmark for Aerial Monocular 3D Detection of African Savanna Wildlife cs.CVPDF

Vandita Shukla, Kilian Meier, Lucie Laporte-Devylder, Camille Rondeau Saint-Jean, Jenna M. Kline

TL;DR: 本文介绍了WildBox数据集和基准,这是一个用于从无人机视频中进行野生动物单目3D检测的数据集,包含237,505个3D边界框标注,涵盖七种非洲草原物种,分为六个基准类别。作者评估了两种开放词汇单目3D架构(OVMono3D-LIFT和DetAny3D)在零样本、真实2D框提示和监督微调协议下的性能。研究发现零样本3D检测完全失效,而微调后性能有所恢复,并指出单目空中深度估计是主要挑战。

Details

Motivation: 解决从无人机视频中进行野生动物单目3D检测这一特定领域缺乏专用数据集和基准的问题,以推动该领域的进展。

Result: 在WildBox数据集上,零样本3D检测的AP为0.00。通过微调,性能恢复至AP-BEV@0.50为8.68 +/- 0.47,AP3D宏平均为13.17 +/- 0.69。深度误差是归一化Hausdorff距离的主要贡献者(微调后占84%,零样本下超过99%)。采用从粗到精的课程学习(先在合并的斑马类上预训练,再在细分斑马亚类上微调)能以更少的总计算量提升宏平均3D性能。

Insight: 创新点在于发布了首个专门用于无人机视频中野生动物单目3D检测的数据集和基准(WildBox)。客观分析表明,该研究揭示了在该领域,现有开放词汇模型的零样本3D检测能力完全崩溃,并将单目空中深度估计确定为关键瓶颈问题。同时,提出的从粗到精的课程学习策略是一种有效的训练方法创新。

Abstract: We introduce WildBox, a dataset and benchmark for monocular 3D detection of wildlife from drone video, comprising 237,505 3D bounding box annotations across seven African savanna species grouped into six benchmark classes. Annotations follow a KITTI/Omni3D-compatible format in a per-segment scale-normalised camera frame, with instance identities maintained across each segment. We evaluate two open-vocabulary monocular 3D architectures, OVMono3D-LIFT and DetAny3D, under zero-shot, ground-truth 2D box prompt, and supervised fine-tuning protocols. Open-vocabulary 2D foundation models provide usable zero-shot wildlife localisation (50.55 AP@50), but zero-shot 3D detection collapses to 0.00 AP across both architectures and every 2D-input condition tested, including ground-truth 2D box prompts, thus isolating the failure to the 3D stage. Fine-tuning on WildBox recovers performance to 8.68 +/- 0.47 AP-BEV@0.50 and 13.17 +/- 0.69 AP3D macro. Depth contributes 84% of normalised Hausdorff distance after fine-tuning and over 99% in zero-shot, identifying monocular aerial depth as the dominant open problem in this regime. A coarse-to-fine curriculum, i.e. pretraining on a merged zebra class before fine-tuning on the Grevy’s/plains split, improves macro 3D performance with less total compute, with the largest gains on the two zebra subclasses. WildBox is released with video-level splits, evaluation code, and baseline checkpoints to enable progress in 3D wildlife perception from drone video.


[117] $φ$-Scene: Physically Grounded Image-to-3D Scene Reconstruction cs.CVPDF

Haodong Li, Lulu Shao, Haolin Lu, Yu Fu, Yen-Ru Chen

TL;DR: 本文提出了一种名为 $φ$-Scene 的物理基础方法,用于从单张图像进行开放词汇和组合式的三维场景重建。该方法将重建视为一个拓扑驱动的物理装配过程,通过推断物体间的支撑关系、按序排列并逐步在物理约束下稳定每个物体,以生成物理上合理的场景。

Details

Motivation: 现有方法主要将场景重建视为视觉和几何预测问题,其输出常包含悬浮物体、相互穿透或不稳定接触等伪影,限制了其在仿真、机器人等下游应用中的物理有效性和可用性。

Result: 在 3D-Front 数据集上的实验表明,$φ$-Scene 在域外方法中取得了最强的综合性能,并与域内基线方法保持高度竞争力。人类和视觉语言模型评估均显示其在视觉质量、参考对齐和物理合理性方面获得强烈偏好。专门的物理合理性指标证实其显著减少了穿透伪影并产生了更低的仿真后漂移。

Insight: 核心创新在于将场景重建重新定义为拓扑驱动的物理装配问题,强调场景应作为一个稳定的物理系统。通过结合基于符号距离场的优化和刚体仿真,在拓扑顺序下逐步解决穿透并实现稳定接触,从而生成物理上更可信的三维场景。

Abstract: Reconstructing compositional 3D scenes from a single image is a fundamental challenge in 3D world modeling. Recent methods can recover high-fidelity, complete 3D objects and predict plausible scene arrangements, but most still treat scene reconstruction primarily as a visual and geometric prediction problem. Their outputs may therefore contain floating objects, interpenetrations, or unstable-contact artifacts, limiting their physical validity and downstream usability in simulation, robotics, and interactive environments. We present $φ$-Scene, a physically grounded approach to open-vocabulary and compositional image-to-3D scene reconstruction. The key premise is that a reconstructed scene should not be treated merely as a set of objects with predicted poses, but as a stable physical system. Accordingly, $φ$-Scene formulates reconstruction as topology-driven physical assembly: it infers how objects support one another, orders them accordingly, and progressively settles each object against its already stabilized support context. For each object in topological order, SDF-based optimization first resolves penetrations against the pre-settled support context, and rigid-body simulation then settles the object into a stable contact configuration under real-world physical constraints. Experiments on 3D-Front show that $φ$-Scene achieves the strongest overall performance among out-of-domain methods and remains highly competitive with in-domain baselines. Human and VLM evaluations further show strong preference for $φ$-Scene in visual quality, reference alignment, and physical plausibility. Finally, dedicated physical plausibility metrics covering static contact quality and dynamic stability demonstrate that $φ$-Scene substantially reduces penetration artifacts while producing much lower post-simulation drift, indicating more stable and physically grounded 3D scenes.


[118] Synergistic Dual-Branch Adaptation for Multi-modal Generalized Category Discovery cs.CVPDF

Yuxun Qu, Minyu Zhou, Yongqiang Tang, Chenyang Zhang, Wensheng Zhang

TL;DR: 本文提出了一种名为协同双分支适应(SDBA)的框架,用于增强多模态广义类别发现(GCD)任务。该框架通过跨模态协同适配器在编码过程中注入视觉信息以改善文本特征学习,并通过邻域互学习模块提供细粒度的关系监督,从而提升现有双分支方法的性能。

Details

Motivation: 现有双分支多模态GCD方法存在跨模态协同粗糙且不完整的问题,即两种模态独立编码,未处理文本中的偏差和噪声,且互学习策略仅作用于全局类别锚点,缺乏细粒度监督。

Result: 在六个基准测试上的广泛实验表明,该方法取得了最先进的性能,并且在GET和TextGCD等不同基线模型上均带来了一致的性能提升,验证了其广泛的扩展性。

Insight: 主要创新点在于提出了一个即插即用的SDBA框架,其核心是通过跨模态协同适配器在编码层进行视觉信息注入以增强文本特征,以及通过基于双向KL散度的邻域互学习提供细粒度的局部关系监督,从而更有效地实现视觉与文本模态的协同。

Abstract: Generalized Category Discovery (GCD) aims to classify old categories and discover new ones from unlabeled data. Recent multi-modal approaches introduce retrieved or synthesized texts into a dual-branch architecture to provide semantic cues complementary to visual features. However, the cross-modal synergy in existing dual-branch methods remains coarse and incomplete: the two modalities are encoded independently with the bias and noise in the derived text left unaddressed during encoding, and existing mutual learning strategies operate only on global class-level anchors, lacking fine-grained relational supervision. To address these limitations, we propose the Synergistic Dual-Branch Adaptation (SDBA) framework, which serves as a plug-and-play enhancement compatible with existing dual-branch methods such as GET and TextGCD. SDBA comprises two components: the cross-modal synergistic adapter inserts lightweight adapters into both branches and further injects visual information into the text adapter at each encoder layer to enhance text feature learning during encoding; the neighborhood mutual learning module enforces consistent local neighborhood distributions between the two branches via bidirectional KL divergence, providing fine-grained relational supervision for both old and new classes. Extensive experiments on six benchmarks demonstrate state-of-the-art performance, and consistent improvements on different baselines validate the broad scalability of the proposed framework.


[119] T-MOR: Learning Motion-Aware Skeleton Representations for Human Action Recognition cs.CVPDF

Di Yang, Mahmoud Ali, Quan Kong, Gianpiero Francesca, Francois Bremond

TL;DR: 本文提出T-MOR框架,通过多模态对比学习将骨架序列与视频及文本表示对齐,旨在解决现有视觉语言模型在细粒度、以人为中心的动作识别任务中缺乏显式运动建模的问题。该框架在训练时利用新构建的PoseCap-1M大规模数据集进行预训练,推理时仅需轻量级骨架输入,并在多个动作识别基准上实现了性能提升。

Details

Motivation: 现有视觉语言模型(如CLIP)主要依赖外观层面的监督,未能显式建模人体运动,而动作识别任务本质由时间结构和基于身体的运动定义,因此需要学习可迁移的运动感知表示。

Result: 在Toyota Smarthome、Penn Action、UAV-Human、TSU和Charades等多个动作识别基准(包括动作分类和逐帧时序检测)上,T-MOR均一致提升了性能,并在少样本和零样本设置中展现出强大的泛化能力。

Insight: 创新点在于提出了一种以骨架运动为中心的可迁移表示学习框架,通过多模态对比学习将运动信息与视觉、文本对齐,并构建了大规模同步视频-骨架-文本数据集PoseCap-1M以支持预训练,强调了具身化表示在可迁移动作理解中的有效性。

Abstract: Vision-language models such as CLIP have recently achieved strong performance on a wide range of visual understanding tasks. However, most existing models rely primarily on appearance-level supervision from images or videos, and do not explicitly model human motion, which is essential for fine-grained and human-centric action recognition task as actions are defined by temporally structured and physically grounded body movements. To address this problem, we propose Transferable skeleton MOtion Representation (T-MOR), a motion-aware framework that learns transferable action representations from skeleton sequences with the aid of video and language supervision during training. T-MOR adopts a multi-modal contrastive learning scheme that aligns skeleton motion with visual and textual representations, while performing inference using only lightweight skeleton inputs. To support large-scale pre-training, we construct PoseCap-1M, a new dataset that contains over one million synchronized video, skeleton, and text triplets covering diverse human activities. We evaluate T-MOR on a range of human-centric action recognition benchmarks, including action classification and frame-wise temporal detection. Experimental results show that T-MOR consistently improves performance across multiple datasets, such as Toyota Smarthome, Penn Action, UAV-Human, TSU, and Charades. In addition, T-MOR demonstrates strong generalization ability in few-shot and zero-shot settings, highlighting the effectiveness of motion-centric and embodied representations for transferable action understanding.


[120] Cross-Modal Corroboration for Annotation-Free Wildlife Monitoring cs.CV | cs.AIPDF

Bharath Pillai, Varun Viswapriyan, Christopher Stewart, Tanya Berger-Wolf, Jenna Kline

TL;DR: 该论文提出了一种利用跨模态一致性进行无标注野生动物监测的方法,通过结合视觉和听觉两种模态的独立活动曲线,并与已知行为先验进行三方对齐,实现无标注验证。该方法在麋鹿繁殖群上验证,成功恢复了与已知行为生态学一致的物种活动模式。

Details

Motivation: 为了解决野生动物监测中标注数据稀缺的问题,利用物种活动模式的专家知识作为无标注验证信号,以支持大规模保护部署中的自动化分析。

Result: 在麋鹿繁殖群上验证,视觉和听觉模态均独立恢复了与已知鹿类行为生态学一致的活动模式,且无需大量手动标注。

Insight: 创新点在于提出跨模态一致性作为无标注验证信号,通过三方对齐(模态间一致性及与行为先验对齐)排除共享数据混淆,为保护规模的自我验证监测管道提供了实用路径。

Abstract: Scaling wildlife monitoring for real-world conservation deployments requires automated analysis of smart sensors that operate under severe annotation scarcity. We propose leveraging expert knowledge of species activity patterns as an annotation-free validation signal for multimodal monitoring pipelines. We operationalize agreement as the alignment of independently derived hourly activity curves both with each other and with published behavioral priors-a three-way convergence that rules out shared-data confounds and dataset-internal correlation as alternative explanations. Our vision pipeline combines zero-shot species detection via BioCLIP 2, sliced inference to handle deployment-constrained camera positioning, and geometry-based geographic localization from camera trap imagery. Our acoustic pipeline detects species vocalizations via a fine-tuned classifier. We validate the pipeline on a breeding herd of Milu deer and demonstrate that both modalities independently recover activity patterns consistent with known deer behavioral ecology with minimal manual annotation. The framework applies to species detectable in both visual and acoustic modalities for which behavioral priors are documented in the literature, suggesting a practical path toward self-validating wildlife-monitoring pipelines at conservation scale.


[121] $μ$Match: Foundation Models for Semi-supervised Learning and Domain Adaptation in EM cs.CVPDF

Marei Freitag, Olesia Korchevaia, Luca Freckmann, Anwai Archit, Constantin Pape

TL;DR: 本文提出μMatch框架,利用视觉基础模型(如SAM、DINOv2/v3等)进行半监督学习和领域自适应,应用于电子显微镜(EM)图像中的线粒体、细胞核和神经突分割任务,以减少标注需求并提升性能。

Details

Motivation: 电子显微镜(EM)作为分析细胞超微结构的关键模态,其分割任务通常依赖监督学习,需要大量人工标注,而基础模型在EM中的应用受限于任务多样性和标注数据稀缺,因此需开发半监督和领域自适应方法以降低标注成本。

Result: 在挑战性EM分割任务(如线粒体、细胞核和神经突分割)上,μMatch框架通过学生-教师方法结合多种基础模型,相比强基线实现了持续改进,展示了减少标注工作量的潜力。

Insight: 创新点在于将先进的基础模型集成到半监督学习和领域自适应框架中,针对EM图像的多任务分割场景,通过系统评估不同模型(包括SAM、μSAM和DINOv2/v3)的性能,为EM超微结构分析提供了可扩展的解决方案。

Abstract: Vision foundation models have substantially advanced computer vision, enabling state-of-the-art performance in zero- and few-shot settings. They have been successfully applied to biomedical imaging tasks ranging from organ segmentation in computed tomography to cell segmentation in light microscopy. Electron microscopy (EM) is a central modality for analyzing cellular ultrastructure due to its nanometer-scale resolution. However, the application of foundation models in EM has so far been limited to specific organelles, such as mitochondria, largely due to the diversity of segmentation tasks and the scarcity of comprehensively annotated data. As a result, EM segmentation still predominantly relies on supervised learning, requiring extensive manual annotation and limiting ultrastructural analysis. To address this gap, we propose $μ$Match, a framework for semi-supervised learning and domain adaptation that leverages foundation models. We implement state-of-the-art student-teacher-based methods and evaluate multiple foundation models (SAM, SAM2, $μ$SAM, DINOv2/v3) on challenging EM tasks, including mitochondrion, nucleus, and neurite segmentation. Our results demonstrate consistent improvements over strong baselines and highlight a path toward substantially reducing the annotation effort in EM.


[122] The Unreasonable Effectiveness of VLMs for Zero-shot Procedural Mistake Detection cs.CV | cs.AIPDF

Serdar Ozsoy, Lars Doorenbos, Federico Spurio, Gianpiero Francesca, Juergen Gall

TL;DR: 本文提出了一种零样本程序错误检测框架ZeProM,它利用预训练的视频语言模型(VLM)统一解决程序错误检测和时间动作分割任务,无需特定任务训练数据。在两个基准测试(EgoPER和CaptainCook4D)上,该方法在零样本设置下接近甚至超越了全监督方法的性能。

Details

Motivation: 现有程序错误检测方法依赖多阶段流水线和特定任务的监督训练,限制了其广泛应用。本文旨在探索预训练VLM在零样本设置下解决该问题的潜力,以摆脱复杂流水线,实现更通用的解决方案。

Result: 在EgoPER和CaptainCook4D基准测试上,ZeProM在零样本设置下接近或超越全监督方法。例如,在EgoPER的所有五个任务中,平均EDA指标提升4.4点,F1@.5指标提升2.0点。

Insight: 创新点在于提出统一的零样本框架,利用单一预训练VLM联合处理错误检测和动作分割,避免了复杂流水线和任务特定训练。这展示了预训练VLM在程序理解任务上的强大泛化能力,为领域提供了更简洁、通用的解决方案方向。

Abstract: Procedural mistake detection is important for quality control and user assistance across many disciplines. Recent work in this field has achieved significant gains by using the reasoning capabilities of Video-Language Models (VLMs) as components within multi-stage pipelines, which consist of separate modules for supervised temporal action segmentation, error detection, and explainability. Consequently, they remain dependent on tailored training datasets and require task-specific training, limiting their wider applicability. To remedy this, we introduce zero-shot procedural mistake detection and propose a unified Zero-shot Procedural Mistake detection (ZeProM) framework that jointly solves procedural mistake detection and temporal action segmentation with a single pre-trained VLM. By evaluating our framework on two canonical mistake detection benchmarks, EgoPER and CaptainCook4D, we find that ZeProM can perform these tasks successfully, while approaching, or even outperforming, the performance of fully supervised methods. For instance, we achieve a 4.4 point improvement in EDA and a 2.0 point improvement in F1@.5 on average over all five EgoPER tasks compared to the strongest supervised methods. Overall, our results show the potential of unified methods for procedural mistake detection, and we hope this will steer the field away from highly complex pipelines and toward more generally applicable solutions.


[123] Enlight: Fast Low-Light Image Enhancement via Multi-Objective Optimization and Shadow-Aware Refinement cs.CVPDF

Nirjhor Datta, M. Sohel Rahman

TL;DR: 本文提出了ENLIGHT,一个无需训练、基于感知目标直接优化的快速低光照图像增强框架。该方法采用两阶段全局到局部优化策略:第一阶段进行全局光照调整以提升可见度,第二阶段通过阴影感知的局部优化有选择地增强低强度区域。框架提供Fast和Ultrafast两种模式以平衡质量与效率,在多个基准数据集上实现了具有竞争力的感知质量,且推理时间显著降低。

Details

Motivation: 解决现有深度学习方法需要大规模训练数据和监督的问题,提出一种零样本、无需训练的快速低光照图像增强方法,旨在在推理时直接优化图像质量。

Result: 在BAID、Backlit300、LIME、MEF、NPE和DICM等多个基准数据集上的实验表明,ENLIGHT在感知质量指标(如MUSIQ、NIQE、BRISQUE)上取得了具有竞争力的结果,同时推理时间显著低于基于学习的方法。

Insight: 创新点在于提出了一种基于多目标优化(结合熵、梯度保持和噪声正则化)和阴影感知细化的两阶段零样本优化框架,无需训练即可实现快速、可解释的低光照增强,为基于学习的方法提供了一种实用的替代方案。

Abstract: We present ENLIGHT, a fast and training free framework for low-light image enhancement based on direct optimization of a perceptual objective. Unlike deep learning approaches that require large scale training data and supervision, ENLIGHT operates in a zero-shot manner by optimizing image quality at inference time. The method employs a two stage global to local optimization strategy. In the first stage, ENLIGHT performs global illumination adjustment to improve visibility while maintaining structural consistency and avoiding excessive noise enhancement. In the second stage, a shadow aware refinement selectively improves low-intensity regions through masked local optimization, enhancing visibility without overexposure. To balance quality and efficiency, we introduce two modes: Fast, which uses a multi-objective formulation combining entropy, gradient preservation, and noise regularization, and Ultrafast, which reduces computational cost via a lightweight approximation of the same objective. The framework is optimizer agnostic and supports both evolutionary and lightweight local search methods. Experiments on BAID, Backlit300, LIME, MEF, NPE, and DICM demonstrate that ENLIGHT achieves competitive perceptual quality (MUSIQ, NIQE, BRISQUE) with significantly lower inference time. Qualitative results further show improved contrast, preserved structural details, and controlled noise amplification, making ENLIGHT a practical and interpretable alternative to learning based methods.


[124] UnityShots: Memory-Driven Multi-Shot Audio-Video Generation with Boundary-Aware Gating cs.CVPDF

Jiehui Huang, Yuechen Zhang, Bin Xia, Jiahao Wang, Xu He

TL;DR: UnityShots 是一个基于 LTX-2.3 构建的、由记忆驱动的多镜头音视频生成系统。它通过一个边界感知的门控机制,维护固定大小的长短期记忆槽来保持跨镜头的一致性,并引入参考说话人令牌来保持音色。论文还发布了一个包含多种文化和语言的多镜头序列基准数据集。

Details

Motivation: 现有方法在生成连贯的多镜头视频时存在局限:端到端方法无法扩展,逐镜头生成方法内存线性增长,而基于LLM规划器的方法缺乏多镜头感知的骨干网络。因此,需要一种能够有效管理跨镜头记忆并保持主体外观、场景上下文和说话人身份一致性的新方法。

Result: 在 I2V、T2V 和 R2V 三种条件模式下进行评估,UnityShots 在所有跨镜头连贯性指标上都领先于开源基线模型,并在多镜头相关指标上与最强的闭源系统性能相当。

Insight: 创新点在于设计了一个边界感知的门控机制来更新固定大小的长短期视觉记忆槽,以及通过注入参考说话人令牌而非滑动音频库来保持音频一致性。此外,通过 AdaLN 学习到的离散剪辑类型先验,可以作为推理时控制过渡强度的调节旋钮。

Abstract: Generating a coherent multi-shot video requires structured cross-shot memory. Subject appearance, scene context, and speaker identity must persist across cuts. Existing approaches either train end-to-end over fixed-length sequences and cannot scale, generate shot-by-shot with memory banks that grow linearly, or orchestrate pretrained generators under an LLM planner without a multi-shot-aware backbone. We present UnityShots, a memory-driven multi-shot audio-video generation system built on LTX-2.3, trained on annotated cinematic and music-video shots. The video stream maintains two fixed-size slots, a long-term memory (LTM) slot anchored to the opening shot and a short-term memory (STM) slot holding the immediately preceding tail, both updated at every cut by a boundary-conditioned gate that fuses visual cut probability and beat-tracker signals. The audio stream injects a reference speaker token at every shot to preserve vocal timbre without a sliding audio bank. A discrete cut-type prior, learned through AdaLN, becomes an inference-time control knob over transition strength. We release a benchmark of $200$ multi-cultural multi-shot sequences spanning six ethnic regions and ten or more languages, with per-shot reference identities, reference audio, and per-boundary transition labels. Evaluated across I2V, T2V, and R2V conditioning modes, UnityShots leads open-source baselines on every cross-shot coherence metric and matches the strongest closed-source system on the multi-shot axes.


[125] HPP: Hierarchical Programmatic Probing for Long Video Understanding by Decoupling Perception and Reasoning cs.CV | cs.AIPDF

Awais Rauf, Ahmed Hasssan, Greg Slabaugh

TL;DR: 本文提出了一种名为分层程序化探测(HPP)的框架,用于解决长视频理解任务。该框架将语义感知与高阶时序推理解耦,通过将长视频理解重新定义为对分层分割视频的迭代式、程序化探索来实现。具体而言,一个具备编码能力的LLM在交互式编码环境中规划并执行多步策略,按需探测视频信息并调用VLM进行局部感知。

Details

Motivation: 当前视觉语言模型(VLMs)在单个前向传播中同时处理感知和多步规划,这种耦合的范式受限于LLM在其潜在表示中发现和执行多步策略的能力,成为长视频理解的瓶颈。

Result: 在LongVideoBench基准测试上,该方法通过解耦感知与推理带来了显著性能提升。在EgoSchema、VideoMME和MLVU等其他长视频基准测试上的进一步结果也验证了该方法的有效性。

Insight: 核心创新在于将复杂的视频理解任务解耦为感知和推理两个阶段,并通过程序化、交互式的多步探测策略来执行。具体技术贡献包括:基于信息密度的分层视频分割、延迟交互语义检索以及用于从粗到细时间定位的结构化探测函数。

Abstract: Understanding long videos requires fine-grained perception and multi-step, higher-order reasoning over complex, long-range spatio-temporal dynamics. Vision-language models (VLMs) encode video frames into visual tokens and attempt to perform both perception and multi-step planning latently, within a single forward pass. This coupled formulation, however, is bottlenecked by the LLM’s limited capacity to discover and execute multi-step strategies in its latent representations. To address this bottleneck, we propose Hierarchical Programmatic Probing (HPP), a framework that decouples semantic perception from higher-order temporal reasoning by reformulating long video understanding as iterative, programmatic exploration of a hierarchically segmented video. Specifically, a coding-capable LLM plans and executes a multi-step strategy in an interactive coding environment, probing the video for information and invoking a VLM for localized perception on demand. To make probing tractable over long videos, we introduce three components: information-density-aware hierarchical segmentation, late-interaction semantic retrieval, and structured probing functions for coarse-to-fine temporal localization. We validate HPP on LongVideoBench, which requires both fine-grained perception and long-range relational reasoning, and show that decoupling the two via iterative programmatic probing yields substantial gains. Further results on EgoSchema, VideoMME, and MLVU demonstrate the effectiveness of our approach across diverse long-video benchmarks.


[126] Chehre: An Emoji-Prompted Video Dataset for Perceptually Diverse Facial Expression Recognition cs.CV | cs.CLPDF

Bita Azari, Zoe Stanley, Avneet Batra, Poorvi Bhatia, Hali Kil

TL;DR: 本文介绍了Chehre数据集,这是一个基于表情符号提示的视频数据集,用于分析动态面部表情并探索个体间的感知多样性。该数据集包含来自203名表演者的2,111个高质量匿名视频,由902名标注者验证,并定义了两个基准任务:主导表情识别和分布表情识别。研究评估了最新的视觉语言模型,发现这两个任务都具有挑战性,最佳模型在主导表情识别上的Top-1准确率仅为32.5%,在分布识别上的Spread Ratio远低于人类参考水平。

Details

Motivation: 现有面部表情识别数据集通常关注静态图像、基本情绪类别或单一确定性标注,缺乏对动态表情和个体感知多样性的探索。本文旨在解决这一问题,通过引入一个表情符号提示的视频数据集,以分析广泛表情下的动态面部表达,并研究人际间的感知差异。

Result: 在Chehre数据集上,评估了最近的视觉语言模型,使用随机采样和角色提示生成每个视频的多个预测。结果显示,主导表情识别任务的最佳模型Top-1准确率仅为32.5%,而分布表情识别任务的Spread Ratio远低于人类参考水平,表明这两个任务对现有模型都具有挑战性。

Insight: 论文的创新点在于引入了表情符号提示的视频数据集,强调动态表情和感知多样性,并定义了两个新颖的基准任务(主导和分布表情识别)。从客观角度看,该数据集通过匿名化处理保护隐私,并利用众包标注捕捉人类反应的多样性,为面部表情识别研究提供了更贴近真实社交场景的评估基准。

Abstract: Facial expressions are nonverbal social signals used in human interaction, but facial expression recognition datasets often focus on static images, basic emotion categories, or single deterministic annotations. We introduce Chehre, an emoji-prompted video dataset for analyzing dynamic facial expressions across a wide range of expressions for exploring inter-individual perceptual diversity. In Chehre, participants were prompted to express and record 40 facial emojis. Later, their facial motions were transferred onto synthetic faces to preserve privacy. A separate group of annotators analyzed the anonymized videos using emoji and label annotations, resulting in 2,111 high quality videos collected from 203 performers and validated by 902 annotators. We define two benchmark tasks: dominant expression recognition, which tests whether models recover the top human-rated labels, and distributional expression recognition, which tests whether models capture the diversity of human responses. We benchmark recent vision-language models using random sampling and persona prompting to generate multiple predictions per video. Results show that both tasks are challenging: among the models evaluated, the best-performing model achieves only 32.5% Top-1 accuracy on dominant expression recognition and a Spread Ratio well below the human reference on distributional recognition. Chehre provides a benchmark for evaluating diverse, dynamic, and distributional facial expression recognition


[127] Motion-Aware Reinforcement Learning For Object Localization cs.CVPDF

Prithvi Raj Singh, Satyendra Singh

TL;DR: 本文提出了MARLNet(运动感知强化学习网络),一种基于PPO的边界框精炼智能体,通过将恒定速度运动先验融入观测状态并在奖励函数中加入动作平滑惩罚来改进目标定位。该方法在Pascal VOC 2012和VisDrone 2019数据集上评估,训练稳定且检测成功率有提升,同时通过奖励设计消融实验分析了奖励干扰问题。

Details

Motivation: 解决传统强化学习在目标定位中因缺乏运动先验和动作平滑约束导致的边界框调整不稳定或过冲问题,提升定位精度和训练稳定性。

Result: 在Pascal VOC 2012上,MARLNet在IoU≥0.5时检测成功率最高提升+0.011(λ_phys=0.10),防止了PPO的过冲;在VisDrone 2019上提升+0.007(λ_phys=0.70)。通过消融实验确认了奖励干扰问题,并证明动作平滑惩罚能解决触发崩溃。

Insight: 创新点包括将恒定速度运动先验整合到强化学习观测中,以及设计动作平滑惩罚奖励来避免训练不稳定;客观分析表明,该方法通过结合运动模型和奖励工程,有效提升了定位代理的鲁棒性和性能,同时揭示了共享骨干网络的表示瓶颈问题。

Abstract: We present MARLNet (Motion-Aware Reinforcement Learning Network), a PPO-based bounding-box refinement agent that incorporates a constant-velocity motion prior into the observation state and an action smoothness penalty into the reward function. The agent operates on 268-dimensional observations encoding the current proposal, a kinematic prediction, the previous action, and a 256-dimensional EfficientNet-B0 crop feature, and learns a five-dimensional policy controlling coordinate adjustments and a binary termination trigger. Evaluated on Pascal VOC 2012 and VisDrone 2019, MARLNet trains stably across all regularization strengths tested and achieves consistent gains in detection success rate at $\text{IoU} \geq 0.5$: up to $+0.011$ on VOC ($λ_\text{phys}{=}0.10$), where the motion prior prevents the overshooting that causes plain PPO to regress on this metric, and $+0.007$ on VisDrone ($λ_\text{phys}{=}0.70$), where unconstrained PPO achieves a larger gain ($+0.025$) owing to the weaker base detector. Through reward design ablations and training dynamics analysis, we identify a reward interference in which combining a constant-velocity deviation penalty with an absolute IoU term causes trigger collapse, and show that replacing it with the action smoothness penalty resolves this failure. We further characterize a representational ceiling facing crop-feature refinement agents that share a backbone with their base detector, confirmed through a global-plus-local observation ablation. Project page: https://prithviraj97.github.io/marl-net


[128] A DVDrive Approach for doScenes Instructed Driving Challenge cs.CV | cs.AIPDF

Zijian Fu, Xiangyang Chu, Mengshi Qi, Huadong Ma, Guanghao Zhang

TL;DR: 本文提出了一种基于OmniDrive的指令条件轨迹预测方法,用于doScenes指令驾驶挑战赛。通过引入DVPE风格的分视图感知模块,改进多视角视觉定位,以更好地将语言指令与局部驾驶相关视觉证据对齐,从而预测未来6秒的自我车辆轨迹。

Details

Motivation: 解决自动驾驶中指令条件轨迹预测问题,即模型需结合视觉场景上下文、历史运动以及自然语言操纵指令来预测未来自我轨迹。

Result: 在指令标注的nuScenes场景上训练,生成由12个未来路径点表示的6秒自我轨迹,具体定量结果未在摘要中明确提及。

Insight: 创新点在于将分视图感知模块集成到OmniDrive感知头中,通过分组查询特征和图像令牌到局部视图空间并进行可见性感知的交叉注意力,减少无关跨视图干扰,提升视觉-语言对齐能力。

Abstract: Instruction-conditioned trajectory prediction is an emerging problem in autonomous driving, where a model predicts the future ego trajectory not only from visual scene context and historical motion, but also from a natural-language maneuver instruction. This paper presents our submission to the doScenes Instructed Driving Challenge, built upon OmniDrive, a vision-language-action driving agent with 3D perception, reasoning, and planning capabilities. We adapt OmniDrive to the doScenes setting by training it on instruction-annotated nuScenes scenes and generating a 6-second ego trajectory represented by 12 future waypoints. To improve multi-view visual grounding, we further introduce a DVPE-style divided-view perception module into the OmniDrive perception head. Instead of attending globally to all camera features, the proposed module groups query features and image tokens into divided local view spaces and performs visibility-aware cross-attention within each view. This design reduces irrelevant cross-view interference and helps the model better align language instructions with local driving-relevant visual evidence. The code is publicly available at: https://github.com/feel12348/doscenes-omnidrive.


[129] RAPID: A Reproducible Multi-Agent Pipeline for Interpretable Disaster Damage Assessment from Satellite and Street-View Imagery cs.CVPDF

Yifan Yang, Wenjing Gong, Kaili Zhang, Lei Zou, Zhengzhong Tu

TL;DR: 本文提出了RAPID,一个可复现的多智能体流程,用于从卫星和街景图像进行可解释的灾害损害评估。该系统通过协调多个专用智能体,在不进行任务特定微调的情况下,实现跨视图理解、图像修复、结构化损害识别和地理推理,以支持零样本损害评估并生成细粒度、可解释的灾害报告。

Details

Motivation: 由于极端气候事件日益频繁和剧烈,亟需智能、可扩展且自主的灾害损害评估方法。现有基于监督学习和任务特定微调的方法难以应对领域偏移、长尾数据分布和异构地理空间数据源,且缺乏整合多模态地理空间信息(如卫星和街景图像)并进行推理的能力。

Result: 在飓风、洪水、野火和地震等多种灾害上,使用灾前/灾后街景图像、灾后遥感图像和街景图像对等多种跨视图输入进行评估。实验表明,RAPID在多灾害类型分类上达到0.92的总体准确率,在跨视图损害严重性预测上最高达到0.627,展现了其作为自主灾害智能基础框架的潜力。

Insight: 论文的创新点在于提出了一个无需任务特定微调、支持零样本评估的可复现多智能体流程,通过协调多个智能体整合异构多模态数据进行联合推理。从客观角度看,其将大语言模型/多模态模型的智能体范式与灾害评估的跨视图、多任务需求相结合,构建了一个模块化、可解释且可扩展的评估框架,为解决灾害场景下的领域泛化和数据异构问题提供了新思路。

Abstract: Due to the increasing frequency and intensity of extreme climate events, there is a clear demand for intelligent, scalable, and autonomous approaches to disaster damage assessment. Existing methods, largely based on supervised learning and task-specific fine-tuning, struggle to generalize under domain shifts, long-tailed data distributions, and heterogeneous geospatial data sources, especially in disaster scenarios. They also often lack the ability to integrate and reason across multimodal geospatial information, such as satellite images and street-view images. In this paper, we introduce RAPID, a reproducible multi-agent pipeline for interpretable disaster damage assessment, including damage-level assessment, damage-type interpretation, and actionable suggestions for response, remediation, and recovery. RAPID coordinates specialized agents to perform cross-view understanding, image restoration, structured damage recognition, and geographical reasoning across heterogeneous data modalities. Without task-specific fine-tuning, RAPID supports zero-shot damage assessment by jointly using complementary information from remote sensing and ground-level perspectives. The system produces fine-grained, interpretable assessments and automatically generates location-specific, decision-relevant disaster reports to support early-stage emergency response. We evaluate RAPID across hurricanes, floods, wildfires, and earthquakes using multiple cross-view imagery inputs, including pre- and post-disaster street-view images, post-disaster remote sensing imagery, and street-view image pairs. Experiments show that RAPID achieves 0.92 overall accuracy for multi-disaster type classification and up to 0.627 for cross-view damage severity prediction, highlighting its potential as a foundational framework for autonomous disaster intelligence.


[130] Beyond Flat Labels: Level-Restricted Contrastive Learning for Hierarchical Fine-Grained Vision Classification cs.CV | cs.LGPDF

Zhiyuan Tao, Srikumar Sastry, Matthew J Thompson, Elizabeth G Campolongo, Net Zhang

TL;DR: 本文提出了一种针对层次化细粒度视觉分类的层级限制对比学习方法,旨在解决现有多模态对比学习在层次化标签空间中预测不一致的问题。该方法通过将对比比较限制在同一分类层级内,并采用组平衡设计确保各层级优化充分,从而提升从粗到细粒度的分类准确性和层次一致性。

Details

Motivation: 现有基于多模态对比学习的零样本视觉分类方法在层次化标签空间中常产生跨分类层级的预测不一致问题,例如细粒度类别与其父类别预测相矛盾,这源于跨层级对比时产生的假阴性标签。

Result: 在基于BioCLIP的TreeOfLife-10M上训练模型,并在多个层次分类基准上评估,模型在欧几里得和双曲空间中均显著提升了层次一致性。在iNaturalist 2021数据集上,该方法比基线平均准确率提高了30.47%,实现了层次化零样本分类的有效性。

Insight: 创新点在于提出层级限制对比学习,通过避免跨层级对比来消除假阴性标签,并结合组平衡设计优化各层级表示;从客观角度看,该方法为层次化标签空间中的对比学习提供了结构感知的解决方案,可提升模型在细粒度分类中的逻辑一致性。

Abstract: Multimodal contrastive learning has enabled zero-shot visual classification by aligning images with textual categories. However, in hierarchically structured label spaces, existing methods often produce predictions that are inconsistent across taxonomic levels. For example, a model may predict a fine-grained category whose parent category contradicts its simultaneously predicted higher-level label. By analysis, the issue originates from false negative labels when contrastive comparison involves multiple taxonomic levels. To this end, we propose to restrict contrastive comparisons to categories within the same taxonomic level. In addition, we adopt a group-balanced design, ensuring each taxonomic level receives adequate optimization. As a result, the proposed framework improves both hierarchical consistency and classification accuracy from coarse to fine granularity. We train our model with TreeOfLife-10M based on BioCLIP and evaluate it across multiple hierarchical classification benchmarks, where the model demonstrates significantly improved hierarchical consistency in both Euclidean and hyperbolic spaces. Notably, on iNaturalist 2021 (iNat21), our method improves average accuracy across levels by 30.47% over the baseline, highlighting its effectiveness for hierarchical zero-shot classification.


[131] Zero-Shot Vision-Language Models for Classroom Engagement Recognition: A Benchmark Study of Prompt Sensitivity and Cross-Dataset Generalization cs.CVPDF

Aman Goyal, Kshama Nitin Shah, Kemmannu Vineet Venkatesh Rao

TL;DR: 本文系统评估了五种主流视觉语言模型(CLIP、BLIP-VQA、GPT-4o、LLaVA-1.5-7B和Qwen2.5VL-7B-Instruct)在零样本条件下对课堂参与度识别的适用性。研究使用DAiSEE(个体学生视频)和SCB(课堂场景图像)两个教育数据集,并测试了三种提示词变体。结果发现零样本VLM存在三大失败模式:个体学生识别性能接近随机、严重的类别塌缩(预测偏向单一类别)以及极端的提示词敏感性。

Details

Motivation: 自动化课堂参与度识别对可扩展的学习分析具有重要价值,但现代视觉语言模型在零样本条件下对此任务的适用性尚未得到充分探索。

Result: 在DAiSEE数据集上,所有模型的Cohen’s kappa系数均未超过0.10,性能接近随机;在SCB数据集上,CLIP和GPT-4o使用基于行为准则的提示词时kappa可达约0.60。模型对提示词极其敏感,相同图像仅因提示词不同可导致准确率波动高达32个百分点。

Insight: 研究揭示了零样本VLM在教育场景中的三大关键失败模式,为实际部署提供了重要警示;同时发现场景级分类比个体学生识别更可行,且基于行为准则的提示词能显著提升性能,这为教育观察系统的设计提供了具体指导。

Abstract: Automated classroom engagement recognition holds substantial promise for scalable learning analytics, yet the suitability of modern Vision-Language Models (VLMs) for this task under zero-shot conditions remains largely unexplored. We present a systematic benchmark that evaluates five widely-used VLMs: CLIP, BLIP-VQA, GPT-4o, LLaVA-1.5-7B, and Qwen2.5VL-7B-Instruct across two complementary educational datasets: DAiSEE, an individual-student video dataset (300 sampled test clips), and the Student Classroom Behaviour dataset (SCB, 1,168 scene-level images). Each model is probed with three prompt variants spanning minimal, rubric-anchored, and chain-of-thought designs. Our experiments reveal three primary failure modes of zero-shot VLMs for engagement recognition: (1) near-random performance on individual students, with Cohen’s kappa never exceeding 0.10 on DAiSEE; (2) severe class collapse, where models assign 85-100% of predictions to a single engagement level regardless of visual content; and (3) extreme prompt sensitivity, with accuracy swings of up to 32 percentage points on identical images depending solely on prompt phrasing. Remarkably, scene-level classification on SCB is substantially more tractable: CLIP and GPT-4o achieve kappa approximately 0.60 when prompted with behaviorally-grounded rubrics. We also document a practical barrier for deployment: GPT-4o’s safety filters reject 98% of chain-of-thought requests involving individual student faces. Our findings provide a calibrated baseline and surface critical design considerations for the use of VLMs in educational observation systems.


[132] Prompt-Calibrated SAM 3 for Open-Vocabulary Remote Sensing Semantic Segmentation cs.CVPDF

Yanghui Song, Nanqing Liu, Haonan Yin, Yingjie Gao, Chengfu Yang

TL;DR: 本文提出了ProC-SAM3方法,用于解决遥感图像开放词汇语义分割(OVSS)中基于SAM 3的无训练方法存在的三个关键问题:单一类别提示语义覆盖不足、多提示扩展导致重复文本编码开销以及直接聚合多提示响应引入噪声。该方法通过构建离线提示池、缓存文本嵌入以及引入存在引导残差融合与峰值保持类别聚合机制来校准SAM 3的提示接口。

Details

Motivation: 动机在于解决基于SAM 3的无训练开放词汇遥感语义分割方法中,单一类别提示对复杂遥感类别语义覆盖不足、多提示扩展带来冗余在线文本编码开销以及直接聚合多提示响应会传播噪声激活到最终预测这三个关键问题。

Result: 在八个基准测试上的实验表明,ProC-SAM3实现了平均56.1%的mIoU,比之前最好的无训练方法高出3.9个百分点,达到了新的SOTA水平。

Insight: 创新点在于从三个互补方面系统校准SAM 3的提示接口:1)利用MLLM生成候选并基于类别先验知识构建和精炼离线提示池;2)缓存和复用文本嵌入以消除重复编码;3)提出存在引导残差融合和峰值保持类别聚合机制,以门控不可靠输出并保留小目标和稀疏对象的细粒度激活。这为提升基础模型在特定领域开放词汇任务中的零样本性能提供了可借鉴的提示工程与后处理思路。

Abstract: Open-vocabulary semantic segmentation (OVSS) in remote sensing images aims to segment categories beyond a fixed label space. Recent SAM 3-based methods provide a promising training-free foundation, yet three key issues remain: (1) a single class-name prompt lacks sufficient semantic coverage for complex remote sensing categories; (2) expanding each category into multiple prompts introduces redundant online text encoding; and (3) directly aggregating multiple prompt responses propagates noisy activations into the final prediction. To address these issues, we propose ProC-SAM3, which calibrates SAM 3’s prompt interface for remote sensing OVSS from three complementary aspects. First, we construct an offline prompt pool where a Category Matcher groups MLLM-generated candidates into per-category sets, and Expansion Constraints further refine each set using category-specific prior knowledge. Second, the resulting text embeddings are cached and reused across all test images, eliminating repeated text encoding. Third, we introduce Presence-Guided Residual Fusion to gate unreliable decoder outputs by prompt presence and confidence, followed by peak-preserving class aggregation that retains fine-grained activations for small and sparse objects. Experiments on eight benchmarks show that ProC-SAM3 achieves an average mIoU of 56.1%, outperforming the previous best training-free method by 3.9 percentage points. Code will be available at https://github.com/YanghuiSong/ProC-SAM3.


[133] ScalePredictor: Instance-aware Scale Learning for Accurate Quantization of Vision Transformers cs.CV | cs.AIPDF

Changjun Li, Runqing Jiang, Lian Xu, Ye Zhang, Qingyong Hu

TL;DR: 本文提出ScalePredictor,一种用于视觉Transformer(ViT)的动态量化框架,旨在通过实例感知的尺度学习实现更精确的后训练量化(PTQ)。该方法利用浅层激活分布范围与深层最优量化尺度之间的隐藏相关性,设计了一个高效的尺度学习机制,包括范围提取和多项式尺度投影模块,从而在引入极小计算开销的同时避免实时校准。

Details

Motivation: 现有PTQ方法通常采用静态量化范式,对所有实例统一应用量化尺度,但由于自然图像的多样性导致激活分布在不同样本间差异显著,这种静态方法存在固有次优性。因此,需要一种动态量化框架来适应不同实例的分布变化,以提升ViT量化的准确性和效率。

Result: 在ImageNet上的大量实验表明,ScalePredictor持续优于先前的PTQ方法,实现了更优的精度-效率权衡。

Insight: 创新点在于发现了浅层激活分布范围与深层最优量化尺度之间的相关性,并基于此设计了一个高效的多项式尺度投影模块,实现了动态、实例感知的量化尺度生成,同时保持了低计算开销。

Abstract: Vision Transformers have achieved remarkable success in many fields, yet their deployment on edge devices remains challenging due to their substantial computational demands. Post-Training Quantization (PTQ) offers an attractive solution by compressing models using a small calibration set with minimal training overhead. However, most existing PTQ works adopt a static quantization paradigm that is uniformly applied to all instances. Given the substantial diversity of natural images, the activation distributions vary significantly across samples, making these methods inherently suboptimal. In this paper, we propose ScalePredictor, a dynamic quantization framework for accurate and efficient quantization scale learning of ViTs. We first reveal a hidden correlation between the distribution range of shallow-layer activations and the optimal scales of deeper layers. Based on this, we develop a scale learning mechanism that integrates an efficient range extraction approach to capture robust range statistics at the shallow stage, which are then fed into a Taylor-motivated polynomial scale projection module to generate all quantization scales simultaneously. With the efficiency of polynomial approximation, ScalePredictor introduces insignificant computational overhead while avoiding costly just-in-time calibration. Extensive experiments on ImageNet demonstrate that ScalePredictor consistently outperforms prior PTQ methods, achieving a more favorable accuracy-efficiency trade-off. Code and additional results are shown in the supplementary materials.


[134] Rethinking the Adaptation of Vision Foundation Models for Efficient Cell Segmentation cs.CV | cs.AIPDF

Qing Xu, Xiangjian He, Wenting Duan, Jiebo Luo, Zhen Chen

TL;DR: 本文提出了EffiCell-Seg框架,用于高效细胞分割,核心在于无需重新训练视觉基础模型(VFM)的编码器。该方法利用预训练VFM中固有的互补结构先验(全局显著性用于定位细胞,局部形态模式用于描绘结构),通过设计的Cell Structure Prompt Encoder(CSP-Encoder)和Synergistic Mask Decoder(SM-Decoder)来合成结构先验图并协同预测分割结果。

Details

Motivation: 解决当前基于视觉基础模型(VFMs)的细胞分割方法资源消耗大、依赖大规模标注的问题,旨在开发一种无需微调视觉编码器的高效适应范式。

Result: 在多种细胞成像模态上,EffiCell-Seg超越了现有最先进方法,同时仅需约500万可训练参数,比完全微调的VFM对应方法减少了130倍以上。

Insight: 创新点在于揭示了预训练VFM中互补结构先验(全局显著性与局部形态)的可利用性,并设计了CSP-Encoder和SM-Decoder来显式提取和协同利用这些先验,实现了参数高效且性能优越的细胞分割。

Abstract: Cell segmentation is critical for computational pathology and biomedical discovery. While recent Vision Foundation Models (VFMs) have demonstrated remarkable universal feature representations, unlocking their full potential for cellular imaging is currently bottlenecked by resource-intensive adaptation paradigms. Existing methods typically rely on fine-tuning heavy visual encoders, leading to extensive computational overhead and a dependency on large-scale annotations. To address this, we propose the EffiCell-Seg framework for highly efficient cell segmentation without re-training the visual encoder. Our core insight is that pretrained VFMs intrinsically encode complementary structural priors: global saliency for localizing potential cells, and local morphological patterns for delineating cellular structures. To harness these priors, we devise a Cell Structure Prompt Encoder (CSP-Encoder) that synthesizes semantic-aware saliency and principal morphological features from frozen VFM representations into explicit structural prior maps. Moreover, we propose a Synergistic Mask Decoder (SM-Decoder) that enforces contextual consistency by jointly predicting geometric distance fields and semantic maps via mutual cross-guidance. Extensive experiments demonstrate that EffiCell-Seg outperforms state-of-the-art methods across diverse cell imaging modalities while requiring only ~5M trainable parameters, over 130x fewer than fully fine-tuned VFM counterparts. The code is available at https://github.com/xq141839/EffiCell-Seg.


[135] Look Before You Zoom: Adaptive Routing for the Resolution-Context Trade-off in Visual RAG cs.CV | cs.CLPDF

Oanh N. Tran, Thanh Quoc Hung Le, Oscar Chew, Kuan-Hao Huang, Khoa D. Doan

TL;DR: 本文提出ViRGo框架,通过自适应路由机制解决视觉检索增强生成(Visual RAG)中的分辨率-上下文权衡问题。该方法在VLM前向传播中利用定位头和语义置信度,动态选择全局感知、基于补丁的检索或基于注意力的检索,以平衡细节恢复与上下文保留。

Details

Motivation: 针对视觉语言模型在处理小目标物体时性能下降的问题,现有无训练方法盲目检索会破坏大物体的全局空间上下文或无法可靠捕捉微小细节,且不必要的检索会增加计算开销。

Result: 在多个VQA基准测试和不同物体尺寸组上的实验表明,ViRGo在精度-效率权衡上表现优异:在小细节上匹配补丁检索性能,对大物体利用注意力检索,并在无需放大时通过路由至全局基线减少推理时间。

Insight: 创新点在于将视觉检索形式化为自适应路由问题,利用VLM固有定位能力与语义置信度进行轻量级决策,实现了针对不同尺度物体的差异化处理策略,提升了系统的整体鲁棒性和效率。

Abstract: Vision-Language Models (VLMs) struggle as query-relevant objects become smaller. To address this, recent training-free approaches dynamically retrieve and zoom into local image regions. However, we show that indiscriminately applying retrieval ignores a critical vulnerability: the resolution-context trade-off. Patch-based zooming recovers details for small targets, but can split large objects and destroy global spatial context; attention-based retrieval better preserves large objects, but remains less reliable on tiny details; and global perception is often fastest when retrieval is unnecessary. Motivated by these failure modes, we introduce ViRGo (Visual Retrieval or Global Perception), a lightweight framework that formulates visual retrieval as an adaptive routing problem. ViRGo estimates object scale from the VLM’s intrinsic localization heads during the initial forward pass and combines it with semantic token confidence to select between global perception, patch-based retrieval, and attention-based retrieval with minimal additional computation. Experiments across multiple VQA benchmarks and object-size groups show that ViRGo improves the accuracy-efficiency trade-off: it matches patch retrieval on small details, leverages attention-based retrieval for larger objects, and reduces inference time by routing to the global baseline when zooming is unnecessary.


[136] CapRiCorn-1K: A Comprehensive Benchmark for Video Captioning and Subject Referential Consistency Across Temporal Scales cs.CV | cs.CLPDF

Xinlong Chen, Jiafu Tang, Yue Ding, Yizhuo Jia, Bozhou Li

TL;DR: 该论文提出了CapRiCorn-1K基准测试,旨在全面评估视频描述生成的质量以及在不同时间尺度和视频领域下的主体指代一致性。基准支持视听和纯视觉两种设置。实验表明,现有模型在生成长视频的准确、全面描述并保持指代一致性方面存在困难,且性能随视频时长增加而下降。

Details

Motivation: 现有基准难以客观、全面地评估视频描述在多样时长和场景下的准确性和主体指代一致性,这阻碍了视频描述模型的发展。

Result: 在CapRiCorn-1K上的广泛实验显示,当前模型普遍难以生成准确全面的描述并保持主体指代一致性,且性能随视频时长增加而下降。评估指标与基于生成描述的下游理解和生成任务性能有强相关性。

Insight: 创新点在于构建了一个专门评估长时程和跨领域视频描述中主体指代一致性的综合基准,并验证了其评估指标对下游任务的有效性,为模型改进提供了明确方向。

Abstract: Accurate and comprehensive video captions with consistent subject references are critical for downstream understanding and generation tasks. However, few existing benchmarks can objectively and comprehensively evaluate these properties across diverse durations and scenarios, thereby hindering the advancement of video captioning models. To bridge this gap, we propose CapRiCorn-1K, a comprehensive benchmark designed to evaluate both video captioning quality and subject referential consistency across long temporal horizons and diverse video domains. To accommodate varied evaluation needs, our benchmark supports both audiovisual and visual-only settings. Extensive experiments on CapRiCorn-1K reveal that current models generally struggle to generate accurate and comprehensive captions while maintaining consistent subject references. Moreover, as video duration increases, both the overall caption quality and subject referential consistency decline. Notably, our evaluation metrics exhibit strong correlations with the performance of downstream understanding and generation tasks conditioned on the generated captions, further validating their effectiveness. The project is available at https://github.com/xlchen0205/CapRiCorn-1K .


[137] CoDMD: Copula-aware Distribution Matching Distillation for Fast Video Generation cs.CVPDF

Wenhu Zhang, Kun Cheng, Changyuan Wang, Shiyao Li, Yuechen Zhang

TL;DR: 本文提出了CoDMD(Copula-aware Distribution Matching Distillation)方法,用于快速视频生成模型的少步蒸馏。该方法通过引入一个轻量化的关系正则化器,显式约束批次样本和时序帧之间的关联结构,解决了标准DMD方法在有限NFE预算下出现的布局不稳定、过饱和和运动断裂等问题。

Details

Motivation: 标准分布匹配蒸馏(DMD)范式在少步蒸馏时性能会下降,表现为视频生成中的布局不稳定、过饱和和运动断裂。作者将此归因于DMD是样本内分布匹配目标,缺乏对批次元素或时序帧之间关系结构的显式约束,导致在少步机制下容易陷入局部最优。

Result: 在Wan-2.1-T2V 1.3B和14B模型上,CoDMD将50步的教师模型蒸馏为4步学生模型,实现了约25倍的加速,并在VBench基准测试中分别获得84.46和84.87的分数,优于此前基于轨迹的方法(rCM)和基于分布的方法(标准DMD)。

Insight: 核心创新点是引入了“Copula感知”的关系正则化,利用冻结教师模型和在线生成模型已有的分数估计来构建跨样本和帧的成对关系矩阵,并通过一个无需额外网络、数据集或采样轨迹的补充分布目标进行匹配,从而稳定了少步蒸馏过程。

Abstract: Few-step distillation for video diffusion models has attracted significant attention, driven by the urgent demand for efficient deployment in real-world scenarios. However, Distribution Matching Distillation (DMD), a leading paradigm, tends to degrade under limited NFE budgets, manifesting in video generation as layout instability, oversaturation, and broken motion dynamics. We trace this failure to a structural limitation: standard DMD is an intra-sample distribution-matching objective with coordinate-wise gradients, and thus imposes no explicit constraint on the relational geometry across batch elements or temporal frames, leaving the underlying copula largely unregulated. Combined with the mode-seeking tendency of its reverse-KL objective, this absence of relational guidance makes DMD prone to collapsing into local optima in the few-step regime. Motivated by this insight, we propose Copula-aware DMD (CoDMD), a lightweight relational regularizer that reuses score estimates already produced by the frozen teacher and the online fake model to construct pairwise relation matrices across samples and frames. These are matched through a supplementary distributional objective that requires no additional networks, datasets, or sampling trajectories. On the Wan-2.1-T2V model series at 1.3B & 14B scales, CoDMD distills 50-step teachers into 4-step students, achieving an approximate 25$\times$ speed-up while attaining VBench scores of 84.46 & 84.87, outperforming prior trajectory-based (rCM 82.81 & 84.05) and distribution-based (DMD 83.38 & 83.81) methods.


[138] Artic-O: End-to-End Articulated Object Reconstruction via Latent Geometry Learning cs.CVPDF

Xuyang Wang, Zhenyu Li, Jian Ding, Habib Slim, Peter Wonka

TL;DR: 本文提出Artic-O,一种端到端的前馈框架,用于通过潜在几何学习进行铰接物体重建。该方法将稀疏多状态观测映射到预训练的潜在几何空间,利用冻结的流匹配解码器提供完整形状先验,以恢复可见和遮挡结构。通过融合视觉令牌、几何潜在表示和解码器特征,在图像基础的部分推理模块中实现活动部件分割和关节预测,从而连接几何与关节。

Details

Motivation: 现有方法通常将几何重建、部件推理和关节估计分离为不同阶段,这会削弱形状、活动部件和运动之间的一致性,并产生高昂的推理成本。本文旨在通过端到端框架解决这些问题。

Result: 在PartNet-Mobility基准上,Artic-O实现了强大的重建质量,同时比先前的强方法LARM显著更高效。它降低了Chamfer距离,提高了F分数,在大多数关节指标上达到相当或更好的关节精度,同时将每个物体的推理时间从9分钟减少到约0.3秒。

Insight: 创新点包括:将稀疏观测映射到潜在几何空间以利用完整形状先验;通过融合多模态特征实现端到端的几何与关节连接;采用几何到关节的课程学习和解耦双通道策略来平衡重建和部件级监督。从客观角度看,该方法在效率和一致性方面提供了显著改进。

Abstract: Reconstructing articulated objects from sparse images requires recovering complete geometry, movable parts, and motion parameters. Recent methods typically separate geometry reconstruction, part reasoning, and articulation estimation into different stages. This separation can weaken consistency between shape, active parts, and motion, while also incurring substantial inference cost. We introduce Artic-O, an end-to-end, feed-forward framework for articulated object reconstruction via latent geometry learning. Instead of fitting geometry in image or view space, Artic-O maps sparse multi-state observations into a pretrained latent geometry space, where a frozen flow-matching decoder provides a complete-shape prior for recovering visible and occluded structures. To connect geometry with articulation, Artic-O fuses visual tokens, geometry latents, and point-wise decoder features in an image-grounded part-reasoning module for active-part segmentation and articulation prediction. We further train the model with a geometry-to-articulation curriculum and a decoupled two-pass strategy to balance reconstruction and part-level supervision. On PartNet-Mobility, Artic-O achieves strong reconstruction quality while being substantially more efficient than LARM, a strong prior method. It reduces Chamfer Distance, improves F-score, and achieves comparable or better articulation accuracy across most joint metrics, while reducing inference time from 9 minutes to about 0.3 seconds per object.


[139] IDAG-Edit: Multi-Object Video Editing via Instance-Decoupled Attention and Guidance cs.CVPDF

Yuan-Zhih Lin, Huu-Thang Nguyen, Huu-Phu Do, Hong-Han Shuai, Ching-Chun Huang

TL;DR: 本文提出了IDAG-Edit,一种无需训练的框架,用于实现细粒度、时序一致的多对象视频编辑。该框架通过布局引导的注意力调制和实例级掩码,解决了现有方法中存在的注意力泄漏、身份漂移和时序不稳定问题,从而在多对象场景下实现了精确的对象级控制。

Details

Motivation: 基于扩散模型的视频编辑在多对象场景下,由于注意力泄漏、身份漂移和不稳定的时序动态,难以实现精确且时序一致的对象级控制。本文旨在解决这一挑战。

Result: 广泛的定性和定量评估表明,该方法在时序稳定性和多对象可控性方面优于最先进的视频编辑方法。

Insight: 创新点在于提出了布局引导的注意力调制和实例级掩码,前者促进连贯的多对象编辑,后者则用于保持单个对象身份并强制注意力局限于各对象区域内,从而实现细粒度的对象级编辑。

Abstract: Diffusion-based video editing has made significant progress; however, achieving precise and temporally consistent object-level control, especially in multi-object scenarios, remains challenging due to attention leakage, identity drift, and unstable temporal dynamics. In this work, we propose IDAGEdit, a training-free framework for fine-grained multi-object video editing with strong temporal consistency. The framework adopts Layout-guided Attention Modulation to facilitate coherent multi-object editing, while Instance-level Masks are introduced to preserve individual object identity and enforce localized attention within each object region, thereby enabling fine-grained, object-level editing. Extensive qualitative and quantitative evaluations demonstrate that our method improves temporal stability and multi-object controllability over state-of-the-art video editing approaches.


[140] GTA-Net: Cooperative Game Theory for Vision-Language Alignment in Chest X-Ray Report Generation cs.CVPDF

Saif ur Rehman Khan, Imad Ahmed Waqar, Sebastian Vollmer, Andreas Dengel, Muhammad Nabeel Asim

TL;DR: 本文提出GTA-Net,一种基于合作博弈论的视觉语言对齐网络,用于自动化胸部X光报告生成。该模型通过BinaryGameAligner建模图像区域与文本标记的交互,并引入Disease-Aware Ternary Aligner整合疾病概念以增强临床语义一致性。在CheXpertPlus和IU-XRay数据集上的实验表明,其在标准生成指标上达到SOTA,并提升了临床一致性。

Details

Motivation: 现有视觉语言模型依赖隐式注意力机制,无法保证显式的区域-词语对应和疾病级别的一致性,导致生成的胸部X光报告临床可靠性不足。

Result: 在CheXpertPlus和IU-XRay基准测试中,GTA-Net在标准生成指标上取得了最先进的性能,并显示出改进的临床一致性。

Insight: 创新点在于将报告生成建模为合作博弈论对齐问题,通过基于相似性的收益矩阵和Shapley启发的权重显式对齐视觉与语言模态;同时,引入三元对齐器整合结构化疾病概念,增强了医学语义的捕获能力。

Abstract: Automated chest X-ray report generation requires precise cross-modal grounding to ensure clinically reliable descriptions. However, existing vision-language models rely on implicit attention mechanisms that fail to enforce explicit region-word correspondence and disease-level consistency. We propose Game-Theoretic Alignment Network (GTA-Net), a vision-language framework that formulates report generation as a cooperative game-theoretic alignment problem. The model introduces a BinaryGameAligner that models interactions between image regions and text tokens using similarity-based payoff matrices with Shapley-inspired importance weighting. To enforce clinical semantics, we further develop a Disease-Aware Ternary Aligner, which captures joint interactions among images, reports, and structured disease concepts. GTA-Net combines a Swin-based visual encoder with a LoRA-adapted large language model and is trained with a unified objective for generation and alignment. Experiments on CheXpertPlus and IU-XRay demonstrate state-of-the-art performance across standard generation metrics and improved clinical consistency, highlighting the effectiveness of explicit game-theoretic alignment for medical vision-language generation.


[141] Learning Cross-View Semantic Priors for Single-Reference Unseen Object Pose Estimation cs.CVPDF

Jiahong Chen, Jinghao Wang, Ziwen Wang, Zi Wang, Banglei Guan

TL;DR: 本文提出了一种基于早期跨视图语义先验的单参考视图未知物体6D姿态估计方法。通过跨视图语义交互(CVSI)模块在几何解码前交换密集的视觉语义线索,并引入两种训练时约束(IVSP和RAGC损失)来确保先验的可靠性和3D表示一致性,从而提升对应点估计的判别力。在多个基准测试中取得了最先进的性能。

Details

Motivation: 解决现有基于对应的单参考视图未知物体姿态估计方法中,视觉基础模型(VFM)特征通常仅作为视图内描述符使用,导致跨视图的密集外观、结构和上下文语义线索交换不足,使得解码后的点特征缺乏联合的语义和几何判别力,在挑战性场景下对应点估计仍然困难的问题。

Result: 在BOP Challenge数据集YCB-V和TUD-L构建的具有挑战性的视图对协议上进行了广泛实验。在六个基准测试的不同视图对设置下,该方法实现了最先进的(SOTA)性能,同时保持了可比的推理速度。

Insight: 核心创新在于构建了围绕早期跨视图语义先验的对应点估计流程,通过CVSI模块在特征提取早期进行密集的跨视图语义交互。为确保该先验对3D对应点学习的可靠性,提出了IVSP损失(保持视图内令牌亲和结构)和RAGC损失(强制解码点特征的空间表示一致性)这两种互补的训练时约束,从而将语义先验有效地引导至几何解码过程。

Abstract: Single-reference unseen object 6D pose estimation reduces object onboarding by estimating poses of arbitrary novel objects from only one reference view. Recent correspondence-based pipelines have achieved robust performance with vision foundation model (VFM) features. However, they typically treat these features as intra-view descriptors, leaving dense visual-semantic cues, including appearance, structure, and context, insufficiently exchanged across views before geometric decoding. Consequently, the decoded point features may lack joint semantic and geometric discriminability, making correspondence estimation still difficult in challenging cases. Instead of processing features independently, we build the correspondence pipeline around an early cross-view semantic prior. Specifically, cross-view semantic interaction (CVSI) enables dense query and reference VFM tokens to exchange semantic context and form a cross-view prior. Nevertheless, direct CVSI may disturb the VFM token structure, while the resulting semantic prior still needs 3D representation consistency for rigid correspondence. To make this CVSI prior reliable for 3D correspondence learning, we introduce two complementary training-time constraints: the intra-view structure preservation (IVSP) loss preserves the original intra-view token affinity structure during interaction, while the reference-anchored geometric consistency (RAGC) loss enforces spatial representation consistency of decoded point features. The final pose is recovered from learned correspondences through weighted SVD. We further construct a challenging view-pair protocol from the BOP Challenge datasets YCB-V and TUD-L to evaluate robustness in difficult matching scenarios. Extensive experiments on six benchmarks under different view-pair settings show that our method achieves state-of-the-art performance while maintaining comparable inference speed.


[142] Morphology-Aware Multimodal Representation Learning for Insect Phylogenetic Reconstruction cs.CVPDF

Zixuan Liu, Kaijie Yu, Chun He, Xiaoxu Cai, Xinhai Ye

TL;DR: 该论文提出了一种用于昆虫系统发育重建的形态感知多模态对齐框架,通过结合标本图像和形态描述,利用视觉Transformer进行参数高效微调和监督对比学习,实现图像-文本在共享潜在空间的对齐,并将学习到的图像嵌入作为连续特征用于贝叶斯系统发育重建。

Details

Motivation: 现有基于图像的深度学习方法主要依赖单模态视觉表示,未能显式结合形态语义信息,因此需要开发能够融合多模态数据的框架以提升系统发育重建的准确性。

Result: 在公开的Rove-Tree-11数据集上,通过多种视觉骨干网络和特征适应策略的比较与消融实验表明,多模态对齐提高了与参考系统发育树的拓扑一致性。

Insight: 创新点在于将形态描述文本与标本图像进行多模态对齐,以学习形态感知的视觉特征,这为计算系统发育学提供了一种结合语义信息的连续特征表示方法。

Abstract: Morphological traits provide important evidence for phylogenetic reconstruction and evolutionary relationship analysis. Recent image-based approaches have introduced deep learning, particularly convolutional models, to derive morphological features from specimen images, but these methods generally rely on single-modality visual representations and do not explicitly incorporate morphological semantics. This study proposes a morphology-aware multimodal alignment framework for insect phylogenetic reconstruction. The framework combines specimen images with curated morphological descriptions by adapting a vision transformer through parameter-efficient fine-tuning and supervised contrastive learning, followed by image-text alignment in a shared latent space. The learned image embeddings are then used as continuous traits for Bayesian phylogenetic reconstruction. On the public Rove-Tree-11 dataset, comparative and ablation experiments across multiple visual backbones and feature adaptation strategies demonstrate that multimodal alignment improves topological agreement with the reference phylogeny. The results indicate that the proposed framework can derive morphology-aware visual traits for computational phylogenetic reconstruction.


[143] BAC-JEPA: Label-Efficient Breast Arterial Calcification Segmentation via Synthetic Mammography-Guided Supervision cs.CVPDF

Scott Chase Waggener, Lakshman Tamil

TL;DR: 本文提出BAC-JEPA,一种用于乳腺动脉钙化(BAC)分割的标签高效框架,通过将程序生成的动脉钙化合成图像插入真实乳腺X光背景中,并利用精确掩码进行监督训练,以减少对专家像素级标注的依赖。该方法在合成验证数据上取得了较高的IoU和Dice分数,并在BacSeg数据集上实现了良好的图像级分类性能,同时保持了较快的推理速度。

Details

Motivation: 乳腺X光筛查中的动脉钙化是一种新兴的心血管风险生物标志物,但其定量使用需要可重复的分割,而专家像素级标注成本高昂,因此需要开发标签高效的分割方法。

Result: 在保留的合成验证数据上,较大骨干网络达到IoU 0.5325和Dice 0.6357;在BacSeg数据集上,基于分割概率图的图像级分类AUROC为0.8719(较小骨干为0.8547);四视图推理在RTX 5090 GPU上耗时110.68-213.63毫秒。

Insight: 创新点在于利用程序生成的合成钙化图像与真实乳腺X光背景结合,提供精确的监督信号,从而减少对人工标注的依赖;同时结合自监督视觉Transformer编码器和高分辨率卷积解码器,实现高效分割。该方法展示了合成监督在医学图像分割中的潜力,但临床验证仍需专家审核的真实数据。

Abstract: Breast arterial calcification (BAC) on screening mammograms is an emerging cardiovascular risk biomarker, but quantitative use requires reproducible segmentation and expert pixel-level labels are costly. We present BAC-JEPA, a label-efficient segmentation framework trained on procedurally generated arterial calcification inserted into real mammographic backgrounds with exact masks. Candidate backgrounds were selected from model-screened mammograms with low predicted BAC response; the generator samples arterial structure, disease burden, radiographic appearance, and hard-negative distractors including nonarterial calcifications and metallic objects. Synthetic masks are paired with mammography self-supervised Vision Transformer encoders and a high-resolution convolutional decoder to produce full-resolution segmentation maps. The study used 75,472 mammography studies from 34,956 patients for background selection and representation learning, trained on synthetic images from 10,000 backgrounds, selected checkpoints with 1,000 development backgrounds, and evaluated transfer on all 1,000 human-labeled BacSeg synthetic 2D mammograms. On held-out synthetic validation data, the larger backbone achieved IoU 0.5325 and Dice 0.6357. On BacSeg, image-level classification from segmentation probability maps reached AUROC 0.8719, with 0.8547 for the smaller backbone. Four-view inference required 110.68–213.63 ms on an RTX 5090 GPU, and severe-preset synthetic image generation averaged 2.7071 s per image on a multicore workstation. These results indicate that BAC-specific synthetic supervision can produce useful image-level transfer without human pixel-level training masks, while expert-reviewed real-mammogram segmentation remains necessary for clinical validation and calibration.


[144] Feed-forward Motion In-betweening for Any 4D cs.CVPDF

Hiroki Nishizawa, Hubert P. H. Shum, Yoshihiro Fukuhara, Hirokatsu Kataoka, Shigeo Morishima

TL;DR: 本文提出了一种前馈式中间帧生成框架,用于任意4D网格(三维几何随时间演变)的生成。该方法基于通用网格动画潜在表示,通过帧级网格VAE将每帧编码为与拓扑无关的潜在标记,并引入基于MMDiT骨干的关键帧条件整流流模型,以稀疏关键帧为条件合成非关键帧。实验在DyMesh16和DyMesh32基准上展示了强大的性能和改进的可控性。

Details

Motivation: 解决4D动态数据(如动画和游戏中的世界建模)生成中因大规模、长时程网格数据稀缺导致的依赖蒸馏或测试时优化而推理缓慢的问题,以及现有前馈生成器时空可控性有限、短时程生成在长序列中误差累积的挑战。

Result: 在DyMesh16和DyMesh32基准上表现出强性能,实现了改进的时空可控性,支持以稀疏关键帧为条件的长时程4D网格生成,减少了误差积累。

Insight: 创新点包括基于参考网格锚定的拓扑无关潜在标记编码的帧级网格VAE,以及关键帧条件整流流模型与MMDiT骨干的结合,提升了4D网格生成的效率和可控性,为物理世界建模提供了可扩展的解决方案。

Abstract: 4D dynamics (3D geometry evolving over time) is a fundamental representation of the physical world and plays a crucial role in world modeling (e.g., animation and games). Owing to the scarcity of large-scale, long-horizon 4D mesh data with arbitrary shapes, early text-to-4D methods rely on distillation or test-time optimization from video diffusion priors, making inference prohibitively slow. Recent feed-forward generators greatly reduce inference cost but offer limited spatiotemporal controllability, and short-horizon generation often leads to error accumulation in long-horizon sequences. We propose a novel feed-forward in-betweening framework for arbitrary 4D meshes with keyframe conditioning. Building on universal mesh-animation latents, we introduce a frame-wise mesh VAE that encodes each frame into topology-agnostic latent tokens anchored by a reference mesh for keyframe conditioning. We further introduce a keyframe-conditioned rectified flow model with an MMDiT backbone that synthesizes non-keyframe frames conditioned on sparse keyframes. Experiments show strong performance and improved controllability on both DyMesh16 and DyMesh32 benchmarks.


[145] SAGE: An Expert-Annotated South Asian GI Endoscopy Dataset for Multimodal Learning and Hallucination Analysis cs.CV | cs.AIPDF

Niyoj Oli, Sachin Acharya, Sandesh Pokhrel, Sanjay Bhandari, Ramesh Rana

TL;DR: 本文介绍了SAGE数据集,这是一个针对南亚地区胃肠道内窥镜图像的多模态学习数据集,包含1300张图像及其标注、多标签分类标签和问答对。该数据集旨在解决现有公开数据集缺乏地理多样性导致的模型人口偏见问题,并用于图像描述、多标签分类和视觉问答任务。

Details

Motivation: 南亚地区胃肠道癌症负担日益加重,但早期诊断面临设备、资金和专家短缺的挑战。现有公开数据集主要来自欧洲,缺乏地理多样性,导致难以评估模型是否存在人口偏见,并限制了开发包容性AI诊断工具。

Result: 在SAGE数据集上评估发现,多分类模型在南亚数据上平均性能下降58%;当代大型多模态模型在解剖标志检测和异常检测的GREEN分数分别降至0.308和0.410,显示显著性能下降。

Insight: 创新点在于首个针对南亚地区的专家标注胃肠道内窥镜多模态数据集,支持多种任务并可用于分析模型幻觉;客观来看,该数据集有助于揭示和缓解AI模型在医疗影像中的地理偏见问题。

Abstract: Gastrointestinal cancers represent a growing health burden in the South Asian region, driven largely by rapid changes in socio-economic conditions & lifestyle habits. However, early diagnosis of such malignancies remains a significant challenge, largely due to a lack of modern equipment, lack of financial support, and a scarcity of GI experts. AI-assisted diagnosis & report generation, show great promise in alleviating this problem by providing low-skill manpower the technical expertise to perform diagnosis. However, almost all open-source, publicly available datasets are predominantly collected from the European region, with no representation from the South Asian region. The lack of open-source GI datasets from diverse geographic regions has made it difficult to assess whether population bias is present in existing models, and to develop geographically inclusive AI tools for automated GI diagnosis. To address this gap, we introduce SAGE: An Expert-Annotated South Asian GI Endoscopy dataset for image captioning, multi-label classification, and visual question answering (VQA) tasks. It consists of 1,300 images, their captions along with hallucination tag, 18 labels and 14,726 question-answer pairs making it well-suited for diverse range of tasks including classification, benchmarking, and fine-tuning large multimodal models (LMMs). We further conducted benchmarking of multi-class classifiers on the effect of population shift in GI imaging AI tasks, and contemporary LMMs on their performance. Our study reveals that task-specific models, such as multi-class classification models, suffer the most, with an average performance drop of 58% when evaluated on the South Asian dataset. For contemporary LMMs, benchmarking reveals a substantial drop in the average GREEN score for anatomical landmark detection (0.308) and abnormality detection (0.410).


[146] Improving Reasoning in Vision-Language Models via Perception Verified Self-Training cs.CVPDF

Sourabh Sharma, Sonam Gupta, Sadbhawna Thakur

TL;DR: 本文提出了一种感知验证的自训练框架,旨在提升视觉语言模型(VLMs)的推理能力。该方法通过解耦感知与推理的思维链模板,并引入无监督的PerceptEval方法评估图像描述质量,结合答案正确性对数据进行分级。基于此,设计了两阶段课程学习策略,利用已验证的描述引导推理增强,以减少视觉幻觉和语言捷径问题。

Details

Motivation: 现有方法依赖人工或专有模型生成思维链,成本高且难以扩展;而自训练方法常因仅通过答案正确性过滤,导致视觉幻觉和语言捷径问题,缺乏对视觉感知的验证。

Result: 论文未在摘要中提供具体的定量结果或基准测试信息,但提出了一种新的框架,旨在通过感知验证和课程学习策略改进推理性能。

Insight: 创新点包括:1)使用解耦感知与推理的思维链模板,便于独立验证视觉理解;2)提出无监督的PerceptEval方法评估描述质量;3)设计基于数据分级的课程学习策略,利用已验证描述引导推理再生,增强视觉基础推理能力。

Abstract: Achieving human-like reasoning in Vision-Language Models (VLMs) remains a long-standing challenge. Recent approaches leverage Chain-of-Thought (CoT) rationales generated by human annotators or proprietary models to improve reasoning, which is costly and difficult to scale. Self-training offers a promising alternative by using models own outputs as supervision. However, existing methods often suffer from visual hallucinations – where rationales describe non-existent visual content, and language shortcuts – where predictions rely on textual priors rather than true visual grounding, as rationales are typically filtered only by answer correctness without verifying visual perception. To address this limitation, we propose a perception-verified self-training framework that enforces visually grounded reasoning. First, our method employs a CoT template (caption-reasoning-conclusion) that disentangles perception from reasoning, enabling independent verification of visual understanding. To compensate for the absence of ground-truth captions, we propose PerceptEval, an unsupervised method that evaluates caption quality based on its alignment with visual and textual elements present in the image. Using caption verification together with answer correctness, we partition the data into three subsets: easy (correct caption and conclusion), medium (correct caption but incorrect conclusion), and hard (incorrect caption). Building on this partitioning, we design a two-stage curriculum learning strategy. In Stage 1, the model is trained on easy examples and subsequently in Stage 2, medium samples are incorporated through a caption-guided reasoning enhancement procedure that regenerates reasoning conditioned on verified captions. Only regenerated samples with the correct conclusions are retained.


[147] Accurate identification and measurement of the precipitate area by two-stage deep neural networks in novel chromium-based alloys cs.CV | cond-mat.mtrl-sci | cs.LG | physics.chem-phPDF

Zeyu Xia, Kan Ma, Sibo Cheng, Thomas Blackburn, Ziling Peng

TL;DR: 本文提出了一种名为DT-SegNet的两阶段深度学习方案,用于从电子显微镜图像中自动识别和测量铬基合金中的析出物面积。该方法结合了YOLOv5的目标检测能力和SegFormer的分割精度,旨在克服传统固定阈值图像处理方法对背景噪声敏感、泛化性差且需要大量人工测量的问题。

Details

Motivation: 传统基于固定阈值的图像处理方法在分析电子显微镜图像时,对背景噪声敏感、跨材料泛化能力差,并且需要大量人工测量工作,这成为合金开发中高效测量析出物体积分率和尺寸分布的瓶颈。

Result: 数值实验表明,DT-SegNet在准确率、精确率、召回率和F1分数等多个指标上,显著优于Weka和ilastik提供的最先进分割工具,实现了SOTA性能。

Insight: 论文的创新点在于提出了一种结合卷积神经网络(CNN)训练效率与视觉变换器(Vision Transformer)分割精度的端到端两阶段深度学习框架。从客观角度看,这种混合架构设计为解决材料科学中特定的、具有挑战性的图像分析任务提供了一种有效且可推广的范式。

Abstract: The performance of advanced materials for extreme environments is underpinned by their microstructure, including the size and distribution of reinforcing phases. Chromium-based superalloys are a recently proposed alternative to conventional face-centred-cubic superalloys for high-temperature applications, such as Concentrated Solar Power, and their development requires efficient measurement of precipitate volume fraction and size distribution from electron microscopy images. Traditional fixed-threshold image processing is sensitive to background noise, generalises poorly across materials, and requires substantial manual measurement effort. To address these bottlenecks, this study proposes DT-SegNet, an end-to-end two-stage deep learning scheme based on YOLOv5 and SegFormer for object detection and segmentation in electron microscopy images. The approach combines the training efficiency of convolutional neural networks at the detection stage with the segmentation accuracy of a Vision Transformer. Numerical experiments show that DT-SegNet substantially outperforms state-of-the-art segmentation tools offered by Weka and ilastik across metrics including accuracy, precision, recall, and F1-score. The model provides a useful tool for alloy-development microstructure examinations and helps address the large datasets associated with high-throughput alloy development.


[148] Resolving Multi-Target Association in OFDM-based ISAC via Vision-aided Multi-Modal Learning cs.CV | eess.SPPDF

Meng Hua, Chenghong Bian, Deniz Gunduz

TL;DR: 本文提出了一种视觉辅助的OFDM-ISAC框架,通过融合无线和视觉模态来解决多目标场景中基于延迟-多普勒图进行目标参数提取时存在的关联模糊和分辨率限制问题。该方法利用深度联合信源信道编码传输街景图像,并在接收端结合重建图像的目标检测结果与无线感知数据,通过多模态网络实现精确的目标关联与参数估计。

Details

Motivation: 在基于OFDM的集成感知与通信系统中,多目标场景下,从反射导频构建的延迟-多普勒图中提取目标参数存在关联模糊问题,且无法区分同一分辨率单元内的多个目标。

Result: 在Blender渲染的车辆测试平台上,所提框架实现了16厘米的定位均方根误差和10.8纳秒的延迟均方根误差。消融研究表明,移除视觉模态会导致定位性能下降60倍。

Insight: 创新点在于将视觉模态引入ISAC系统,通过多模态学习融合无线感知与视觉信息,解决了传统单模态ISAC的数据关联和分辨率限制。技术细节上,采用了深度联合信源信道编码传输图像,并设计了针对高维延迟和多普勒分类器的KL散度损失函数以稳定训练。

Abstract: Orthogonal frequency division multiplexing (OFDM)-based integrated sensing and communication (ISAC) systems commonly extract target parameters by peak-searching a delay-Doppler map (DDM) constructed from reflected pilots. In multi-target scenarios, this results in ambiguity: the DDM does not reveal which physical target produced which peak, and two targets within the same delay-Doppler resolution cell cannot be separated. We propose a vision-assisted OFDM-ISAC framework that resolves both limitations by fusing wireless and visual modalities. The transmitter encodes an onboard street-view image with deep joint source-channel coding (DeepJSCC) and transmits it over the same OFDM waveform used for sensing; the receiver reconstructs the image, runs a fine-tuned YOLOv5 detector and fuses the resulting per-target features (bounding-box coordinates and class labels) with the DDM and transmitter-receiver geometry through a learned multi-modal network. To stabilize training of the high dimensional delay and Doppler classifiers, we introduce a Kullback Leibler loss against triangular soft labels centered on the ground-truth bin. On a Blender-rendered vehicular testbed, the proposed framework achieves a 16 cm localization root mean square error (RMSE) and a 10.8 ns delay RMSE. An ablation study confirms that removing the visual modality causes a 60x degradation in localization. These results highlight the potential of vision to overcome the data-association and resolution limits of single-modality ISAC.


[149] MultiMem: Measuring and Mitigating Memorization in Multi-Modal Contrastive Learninga cs.CV | cs.AI | cs.LGPDF

Wenhao Wang, Franziska Boenisch, Michael Backes, Adam Dziedzic

TL;DR: 本文提出了首个用于量化多模态对比学习中记忆化现象的度量标准MultiMem,并系统分析了跨模态语义错位对记忆化的影响,发现文本是驱动记忆化的主导模态。研究还表明,针对所有模态的目标增强能有效减少记忆化并提升模型性能。

Details

Motivation: 记忆化在机器学习中虽能提升罕见分布内样本的性能,但也会导致对噪声和异常值的有害保留,从而损害泛化能力。目前记忆化在视觉领域的监督和自监督学习中已有广泛研究,但在多模态对比学习中尚未探索,本文旨在填补这一空白。

Result: 通过系统分析,论文证明了跨模态语义错位对记忆化影响最强,其中文本是驱动记忆化的主导模态,其次是视频、图像和音频。应用针对所有模态的目标增强能有效减少由MultiMem度量的记忆化,并提升模型性能。

Insight: 创新点在于首次建立了多模态对比学习中记忆化的测量与缓解框架,揭示了跨模态语义错位与记忆化的强关联性,并提出了通过多模态目标增强来减轻记忆化的有效方法,有助于防止有害数据保留并构建更高性能的模型。

Abstract: Memorization in machine learning models enables high performance on rare in-distribution samples by capturing their atypical patterns. However, it also causes harmful retention of noise and outliers, degrading generalization. While memorization has been extensively studied in both supervised and self-supervised learning in the vision domain, it remains unexplored in multi-modal contrastive learning. We address this gap by introducing MultiMem, the first metric designed to quantify memorization in multi-modal contrastive learning. Through our systematic analysis, we demonstrate that cross-modal semantic misalignment has the strongest influence on memorization, with text being the dominant modality driving memorization, followed by video, image, and audio. We show that targeted augmentations applied across all modalities effectively reduce memorization as measured by our MultiMem metric and improve model performance. Overall, this work establishes the first framework for measuring and mitigating memorization in multi-modal contrastive learning, preventing harmful data retention and contributing to higher-performing models.


[150] T-IMPACT: A Severity-Aware Benchmark for Contextual Image-Text Manipulation cs.CVPDF

Gagandeep Singh, Aaditya Yadav, Priyanka Singh

TL;DR: 本文介绍了T-IMPACT,一个首个关注严重性的、针对被操纵的新闻风格图文对的基准数据集。该数据集包含98,786个样本,涵盖原始、仅图像、仅文本和联合操纵四种类型,并提供了校准的连续严重性分数、粗略的严重等级标签以及支撑性的元数据。实验表明,现有模型能恢复部分真实性信号,但严重性预测任务要困难得多,且与人类判断仅弱相关。

Details

Motivation: 现有的数据集主要关注真实性、上下文不匹配或操纵类型,很少能捕捉编辑对帖子可能解释的改变程度。因此,需要建立一个能评估多模态操纵对上下文影响严重性的基准。

Result: 实验表明,当前模型在恢复真实性信号方面有一定能力,但严重性预测任务要困难得多,且与人类判断仅弱相关。T-IMPACT为研究超越二元真伪分类、迈向分级上下文影响的多模态操纵提供了一个初步基准。

Insight: 创新点在于首次提出了一个包含校准连续严重性信号的多模态操纵基准,将评估重点从二元分类转向了分级的影响程度。其数据生成流程(提取语义锚点、空间定位、局部图像编辑和受限标题重写)以及利用有限人工评分进行校准的方法,为构建细粒度评估数据集提供了可借鉴的思路。

Abstract: Recent advances in vision-language models and generative editing systems have made it increasingly easy to produce persuasive multimodal misinformation by altering images, text, or both jointly. However, existing datasets focus mainly on authenticity, out-of-context mismatch, or manipulation type, and rarely capture how strongly an edit changes the likely interpretation of a post. We introduce T-IMPACT, a first-release severity-aware benchmark for manipulated news-style image-text pairs. T-IMPACT contains 98,786 examples spanning pristine, image-only, text-only, and joint manipulations, with a calibrated continuous severity signal, coarse low/medium/high labels, and supporting grounding metadata. Starting from a news image-text pair, the pipeline extracts semantic anchors, grounds them spatially, performs localized image edits and constrained caption rewrites, and calibrates contextual-impact scores using limited human ratings. In this release, the calibrated continuous score is the primary severity target, while the low/medium/high bands should be interpreted as coarse operating buckets rather than balanced classes. Experiments show that current models recover some authenticity signal, but severity prediction remains substantially harder and only weakly aligned with human judgment. T-IMPACT provides an initial benchmark for studying multimodal manipulation beyond binary real/fake classification toward graded contextual impact.


[151] Dual-Stream EEG Decoding for 3D Visual Perception cs.CV | cs.AIPDF

Ninon Lizé Masclef, Taisija Demcenko, Antonella Catanzaro, Nataliya Kosmyna

TL;DR: 本文提出了一种受生物视觉启发的双流脑电(EEG)解码模型,用于三维视觉感知。该模型通过模仿腹侧和背侧通路,分别解码物体身份和空间朝向,并利用EEG信号进行三维重建。实验表明,该方法能够从EEG中成功解码物体身份和朝向,并通过可解释性分析揭示了腹侧、背侧和运动相关通道在解码过程中的动态参与。

Details

Motivation: 论文旨在解决从脑电信号中解码三维视觉感知(包括物体身份和空间朝向)的挑战,其动机源于生物视觉系统中腹侧和背侧通路的分离处理机制,以更准确地模拟大脑对三维形状的连续旋转感知过程。

Result: 该方法在EEG解码任务中成功实现了物体身份和空间朝向的预测,并利用EEG条件化的多视角扩散模型进行了三维重建;可解释性分析显示,解码过程涉及腹侧、背侧和运动相关通道的动态时间结构参与,而非静态的腹侧主导。

Insight: 创新点包括:受生物启发的双流架构分别处理物体身份和空间朝向,使用循环回归进行角度预测,以及开发EEG条件化的多视角扩散模型进行三维重建;从客观角度看,该研究将神经解码与三维视觉生成相结合,并提供了对解码过程神经机制的可解释性洞察。

Abstract: This paper explores a novel brain decoding model for 3D shape perception through a dual pathway architecture mirroring biological vision. Our bio-inspired approach implements separate decoding modules for object identity and spatial orientation, inspired by ventral and dorsal pathways, during continuous rotations. We employ circular regression for angle prediction and develop EEG-conditioned multiview diffusion for 3D reconstruction. Our approach successfully decodes both object identity and spatial orientation from EEG signals and enables 3D reconstruction from neural activity, with interpretability analyses revealing temporally structured involvement of ventral, dorsal, and motor-related channels rather than a static ventral dominance in supporting object and angle decoding.


[152] Customizing Video Portraits via Identity-ActionDecoupling cs.CVPDF

Junxiong Lin, Haoran Wang, Xinji Mai, Zeng Tao, Xuan Tong

TL;DR: 本文提出了一种身份-动作解耦(IaD)框架,用于身份保持的文本到视频生成(IPT2V)。该方法旨在从参考图像和文本描述合成时间连贯的视频,同时保持主体身份,并允许对面部动态进行细粒度控制。核心创新在于解耦面部嵌入中的身份相关与无关信息,以生成更丰富、更符合文本提示的面部动作。

Details

Motivation: 现有方法(如ID-Animator和ConsisID)在推理时注入身份特征,但忽略了面部嵌入中包含的与身份无关的信息,导致生成的面部动作单调或不准确,难以遵循文本提示。

Result: 无需任何针对特定主体的微调,IaD框架生成的视频能够(1)保持跨时间身份一致性,(2)展现出丰富、可控且与输入文本紧密匹配的表情和场景变化。

Insight: 主要创新点在于提出了身份-动作解耦框架以及身份解耦损失和文本对齐损失两个损失函数,通过解耦面部嵌入中的身份信息与动作信息,解决了现有方法中面部动作生成质量不高的问题,实现了身份保持与动作灵活性的更好平衡。

Abstract: Identity-Preserving Text-to-Video Generation (IPT2V) seeks to synthesize a temporally coherent video from a reference image and a textual description, while simultaneously preserving the subject’s identity and allowing fine-grained control over facial dynamics. Although recent methods such as ID-Animator and ConsisID inject identity features only at inference time, they ignored the ID-irrelevant information contained in Facial embedding, leading to monotonous or inaccurate facial movements that poorly follow the prompt. We introduce Identity-Action Decoupling (IaD) framework as well as two loss function Identity Decoupling Loss and Text Alignment Loss to solve this problem. Without any subject-specific fine-tuning, IaD yields videos that (1) maintain cross-temporal identity consistency and (2) exhibit rich, controllable expressions and scene variations that closely match the input text.


[153] Towards Accurate and Robust Surveillance Roadside IVD via Trackletized Audio-Visual Reasoning cs.CV | eess.ASPDF

Xiwen Li, Xiaoya Tang, Bodong Zhang, Tolga Tasdizen

TL;DR: 本文提出了一种名为TAVR-IVD的音频-视觉框架,用于路边监控中的怠速车辆检测(IVD)。该方法通过多目标跟踪引导,将车辆检测结果链接成轨迹片段(tracklet),并在轨迹级别对每辆车进行分类,以提高信号噪声比、稳定时间决策,并增强跨域适应性。

Details

Motivation: 解决现有全图像、片段级融合方法在怠速车辆检测中容易过拟合场景背景、时间决策不稳定、缺乏明确的空间先验来对齐车辆与麦克风的问题,导致在域转移下脆弱且数据效率低下。

Result: 在AVIVD-LT和AVIVD-M两个评估扩展数据集上进行了测试,这些数据集覆盖了跨日和跨站点的域转移,以评估部署鲁棒性;方法在保持检测器无关性和高效性的同时,通过有限校准注释适应跨域场景。

Insight: 创新点在于引入基于多目标跟踪的轨迹化音频-视觉推理,通过轨迹片段操作提升信号噪声比和决策稳定性,并强制空间先验对齐车辆与麦克风;从客观角度看,该方法通过轨迹级分类和跨域适应机制,增强了IVD任务的鲁棒性和数据效率。

Abstract: Idling Vehicle Detection (IVD) seeks to determine, at the final frame of a video clip, whether any vehicle is idling, meaning the vehicle is stationary with its engine running, using synchronized video from a remote surveillance camera and multichannel audio captured by spatially distributed wireless microphones along the roadside. Prior full-image, clip-level fusion approaches tend to overfit scene background and full-frame context, produce unstable temporal decisions, and lack an explicit spatial prior to align vehicles with microphones, which makes them brittle under domain shift and data inefficient. Instead, we introduce TAVR-IVD, an audio-visual framework guided by multi-object tracking. Our method detects vehicles, links detections into tracklets, and classifies each vehicle by operating on its tracklet. This design raises the effective signal-to-noise ratio, stabilizes temporal decisions through tracklets, enforces an explicit spatial prior to align vehicles with microphones, and adapts across domains with limited calibration annotations while remaining detector agnostic and efficient. To evaluate deployment robustness, we further curate two evaluation extensions, AVIVD-LT and AVIVD-M, covering inter-day and cross-site shifts.


[154] Towards Error-Free Long Video Generation cs.CVPDF

Shuning Chang, Weihua Chen, Jiasheng Tang, Hao Xu, Zeyu Zhang

TL;DR: 本文提出了一种无限长度视频生成框架,旨在解决长视频生成中的错误累积、属性漂移和数据稀缺问题。该方法通过在大规模短视频数据上微调扩散模型作为视频扩展模型,并借鉴大语言模型的自回归机制,采用片段间的因果注意力计算来保持长期一致性。

Details

Motivation: 当前视频生成技术虽能合成分钟级视频,但生成长视频仍面临错误累积、属性漂移和长视频数据有限的挑战,需要一种能生成高质量、动态且身份一致的长视频的方法。

Result: 实验结果表明,该方法在生成逼真且连贯的分钟级视频方面建立了新的基准,有效缓解了错误累积和属性漂移问题。

Insight: 创新点包括结合扩散模型与因果注意力机制以保持长期上下文,采用KV缓存实现内存高效推理,以及引入截断整流流技术进一步抑制错误累积,为长视频生成提供了可扩展的解决方案。

Abstract: Recent advances in video generation have made minute-level synthesis possible; however, generating long videos remains challenging due to error accumulation, attribute drift, and the limited availability of long video data. In this paper, we introduce an infinite-length video generation framework that focusing on addressing these issues and produces high-quality, dynamic, and identity-consistent single-shot long videos. We first finetune a diffusion model as a video extension model on large-scale short video data to autoregressively generate temporally coherent clips. Inspired by the success of large language models (LLMs), we adopt causal attention computation between clips to further finetune this model on long video data. In this way, the tokens in one clip (short video) are computed by bidirectional attention while tokens among clips are computed by unidirectional attention. This design leverages the strengths of modern diffusion models while preserving long-term context information, effectively mitigating error accumulation and attribute drift. To achieve memory efficiency during inference, we adopt a key-value (KV) caching mechanism to maintain a constant KV memory. Furthermore, we introduce truncation-rectified flow (T-RFlow) technique to further suppress error accumulation. Experimental results demonstrate the effectiveness of our method. Our framework establishes a new benchmark for realistic and coherent minute-level video synthesis.


[155] Structured Hyperedge Adaptation for Parameter-Efficient Fine-Tuning of Vision Transformers cs.CV | cs.AI | cs.LGPDF

Edwin Kwadwo Tenagyei, Lei Wang, Ugochukwu Ejike Akpudo, Jun Zhou, Yongsheng Gao

TL;DR: 本文提出了一种名为HyperAdapter的新型参数高效微调方法,用于视觉Transformer。该方法通过将适配操作从独立的令牌空间转移到超边空间,利用超图结构建模令牌间的结构化关系,实现了分组感知的适配,从而在保持参数效率的同时提升了微调性能。

Details

Motivation: 现有基于适配器的参数高效微调方法独立处理每个令牌,忽略了视觉场景中令牌之间天然存在的结构化关系,可能导致冗余更新和空间不一致的特征优化。本文旨在将结构化归纳偏置引入适配器设计。

Result: 在多种视觉基准测试上的广泛实验表明,在可比的参数量预算下,所提出的结构化超边适配方法持续优于强大的参数高效微调基线方法,尤其是在需要结构化推理的任务上提升更为显著。

Insight: 核心创新点在于将适配空间从令牌空间转移到超边空间,通过基于原型的软分配构建软超图,在超边级别进行轻量级瓶颈适配,并利用超图关联结构将更新扩散回令牌。这为ViT的参数高效迁移引入了一个新的、关键的适配空间维度。

Abstract: Parameter-efficient fine-tuning (PEFT) has become a practical solution for adapting large pretrained vision transformers (ViTs) to downstream tasks while updating only a small subset of parameters. However, existing adapter-based methods perform adaptation independently for each token, implicitly assuming that token refinements should be learned in isolation. This token-wise formulation overlooks the structured relationships among tokens that naturally arise in visual scenes, potentially leading to redundant updates and spatially inconsistent feature refinement. In this work, we revisit the design of parameter-efficient adapters and propose to perform adaptation in hyperedge space rather than token space. We introduce HyperAdapter, a hypergraph-based adapter architecture that enables structured, group-aware adaptation through soft token routing. HyperAdapter constructs a soft hypergraph over ViT tokens using prototype-based assignments, aggregates token features into latent hyperedge representations, applies lightweight bottleneck adaptation at the hyperedge level, and diffuses the resulting updates back to tokens via the hypergraph incidence structure. This design injects an explicit structural inductive bias into PEFT while preserving the modularity and efficiency of standard adapters. Extensive experiments across diverse visual benchmarks demonstrate that structured hyperedge adaptation consistently outperforms strong PEFT baselines under comparable parameter budgets, with particularly pronounced gains on tasks requiring structured reasoning. Our results suggest that the choice of adaptation space is a critical yet underexplored dimension in parameter-efficient transfer for ViTs.


[156] Gold Points Sniper: Self-guided Visual Reasoning in VLM for Fine-grained Action Understanding cs.CV | cs.AIPDF

Haodi Liu, Xinhang Yang, Kunda Yan, Sen Cui, Zeyu Zhang

TL;DR: 本文提出Gold Points Sniper (GPS)框架,旨在增强轻量级视觉语言模型(VLM)的自引导多模态推理能力,以解决家庭机器人场景中细粒度人类动作理解的问题。该方法通过提取关键动作细节、选择性自我提问验证和语义蕴含评估三个模块,使模型能够生成信息丰富且事实准确的描述。

Details

Motivation: 当前系统在理解广角视野中占据小区域的人类细粒度动作、意图和上下文线索方面存在不足,开放词汇动作识别方法局限于预定义标签,而VLM在信息丰富性和事实保真度之间存在固有权衡,均无法满足可靠人机交互所需的深度语义解释。

Result: 在基于CAP基准构建的指令调优数据集上进行的大量实验表明,GPS增强的轻量级VLM取得了显著的性能提升,部分模型达到了与专有模型GPT-4o相当的性能,同时保持了更优的事实准确性。

Insight: 创新点在于提出了一个自引导的多模态推理框架,通过训练模型识别关键细节、进行选择性自我提问验证以及利用语义蕴含进行事实一致性评估,实现了信息密集且事实可靠的细粒度动作描述,为家庭机器人提供了可靠的细粒度动作理解基础。

Abstract: Robots operating in everyday environments must understand fine-grained human actions, intentions, and contextual cues from broad views where people occupy only small regions, a capability unmet by current systems. While open-vocabulary action recognition methods remain limited to assigning predefined labels, and vision-language models (VLMs) face an inherent trade-off between informational richness and factual fidelity in their outputs, neither approach achieves the deep semantic interpretation required for reliable human-robot interaction. We propose Gold Points Sniper (GPS), a novel framework that empowers lightweight VLMs with self-guided multimodal reasoning capabilities for fine-grained human action understanding. Our approach comprises three key modules: Gold Points Extractor trains VLMs to identify critical action-relevant details, Selective Socratic Questioner validates and refines these details through selective self-questioning, and Semantic Entailment Evaluator quantitatively assesses factual consistency using semantic entailment classification. Extensive experiments on our curated instruction-tuning dataset based on the CAP benchmark demonstrate that GPS-enhanced lightweight VLMs achieve substantial performance improvements, with some models reaching performance comparable to proprietary GPT-4o while maintaining superior factual accuracy. Our work establishes a reliable foundation for fine-grained action understanding in domestic robotics, enabling robots to safely interpret human behavior through information-dense yet factually grounded descriptions. Source code, training configurations, annotation prompts, and dataset details are released at https://github.com/Haodi-Liu/GPS-Gold-Point-Sniper.


[157] Gen2Balance: Generative Balancing for Long-Tailed Video Action Recognition cs.CVPDF

Prajwal Gatti, Simon Jenni, Fabian Caba Heilbron, Dima Damen

TL;DR: 本文提出Gen2Balance方法,用于解决长尾视频动作识别中的训练数据不平衡问题。该方法利用文本到视频生成模型,基于动作描述和训练样本生成多样化的视频片段,将不平衡的训练集转换为真实与生成视频的平衡组合,并采用两阶段训练策略来缓解域偏移。

Details

Motivation: 解决长尾视频动作识别中,由于训练数据类别分布极不平衡(即少数类别样本丰富,多数类别样本稀缺)导致的模型性能下降问题。

Result: 在长尾版本的标准基准测试集UCF-LT和K100-LT上,分别比最强的长尾学习基线方法提升了5.1%和7.0%的准确率;在RareAct数据集的稀有动作(如“切键盘”)上,准确率提升了31.9%。实验还表明,仅使用部分合成数据进行平衡,就能以27%的计算成本获得79%的性能增益。

Insight: 核心创新点在于利用生成式模型(文本到视频)进行数据增强以平衡长尾分布,并提出两阶段训练策略来有效利用合成数据。从客观角度看,将生成式AI与长尾学习结合,并通过计算效率分析展示了方法的可扩展性,为数据不平衡问题提供了新的解决方案。

Abstract: We address the problem of training on long-tailed data for video action recognition. We propose to augment the training set using a text-to-video generative model, conditioned on diverse text prompts grounded in action profiles and training exemplars. Our approach, called Gen2Balance, converts an imbalanced training set into a balanced combination of real and generated video clips. To effectively learn from such data, we employ a two-stage training strategy that mitigates domain shift and yields significant improvements. We evaluate on long-tailed versions of standard benchmarks: UCF-101 (UCF-LT) and a 100-class subset of Kinetics (K100-LT) selected to prioritise temporally challenging actions. Gen2Balance improves accuracy over the strongest baselines for long-tailed learning by 5.1% and 7.0% on the respective datasets. On rare actions from the RareAct dataset (e.g., cut keyboard), Gen2Balance improves accuracy by 31.9%, demonstrating effectiveness for scarce actions. By varying the amount of synthetic data added, we show that partial balancing already achieves 79% of the performance gains at 27% of the compute cost on K100-LT, highlighting the practical scalability of Gen2Balance.


[158] Curvature-Adaptive Consistency Flow Matching: Autonomous Trajectory Optimization via Reinforcement Learning cs.CVPDF

Songtao Tian, Guhan Chen, Bohan Li, Jingyi Ma, Zixiong Yu

TL;DR: 本文提出了一种曲率自适应一致性流匹配方法,通过强化学习自主优化轨迹,解决了传统一致性蒸馏在边界阶段(初始化和最终细化)的优化瓶颈问题。该方法结合了新颖的流分布匹配蒸馏目标,在FLUX和SDXL等大规模模型上实现了新的SOTA结果,在极少数步长下有效缓解了结构畸变并保留了高频细节。

Details

Motivation: 动机在于揭示一致性蒸馏中存在的非对称性困难分布(如U形),即静态采样无法适应动态学习需求,尤其是在边界阶段(初始化和最终细化)存在主要优化瓶颈。

Result: 在FLUX和SDXL等大规模模型上实现了新的SOTA结果,在极少数步长下有效缓解了结构畸变并保留了高频细节,达到了前所未有的视觉保真度。

Insight: 创新点在于将蒸馏过程建模为动态决策过程,利用轻量级强化学习代理主动探索概率流ODE轨迹,自动构建以效率为导向的课程学习策略,无需手动调度,并结合了流分布匹配蒸馏目标。

Abstract: Consistency distillation has significantly accelerated the inference of diffusion models. In this work, we reveal an intriguing asymmetry: while Logit-Normal sampling priors are highly efficacious for standard iterative generation, consistency distillation exhibits a distinctly different difficulty profile (e.g., U-shaped). We identify that the primary optimization bottlenecks reside at the boundary stages (initialization or final refinement) rather than the intermediate steps. To address the limitations of static sampling in accommodating evolving learning requirements, we propose Curvature-Adaptive Consistency Flow Matching (CACFM). By formulating distillation as a dynamic decision process, CACFM employs a lightweight Reinforcement Learning agent to actively probe Probability Flow ODE trajectories, automatically constructing an efficiency-oriented curriculum that prioritizes critical regions without manual scheduling. Integrated with a novel Flow Distribution Matching Distillation (DMD) objective, our approach achieves new state-of-the-art results on large-scale models such as FLUX and SDXL. It effectively mitigates structural deformities and preserves high-frequency details in extreme few-step regimes, achieving unprecedented visual fidelity.


[159] FlowDec: Temporal Conditional Flow Decorruptor for Robust Continuous Vision-Language Navigation cs.CVPDF

Yufei Zhang, Changhao Chen

TL;DR: 本文提出了FlowDec,一种针对基于大型模型的连续环境视觉语言导航(VLN-CE)任务的新型图像恢复框架。该框架通过混合时序条件策略和对齐生成流路径,并结合动作质心引导滤波,旨在解决现实世界视觉损坏导致的导航性能严重下降问题。

Details

Motivation: 动机在于,尽管大型模型推动了VLN-CE的发展,但其性能在现实世界的视觉损坏下会严重退化,这是一个关键但尚未充分探索的领域限制。

Result: 大量实验表明,FlowDec在导航准确性和生成延迟方面均优于最先进的去损坏方法,为不可预测的现实世界条件下的鲁棒具身导航建立了高效范式。

Insight: 创新点在于提出了一个专为LM-based VLN-CE定制的图像恢复框架,其核心是混合时序条件策略和动作质心引导的动态评估与集成机制,以增强对历史上下文的对齐和输出质量。

Abstract: Vision-and-Language Navigation in Continuous Environments (VLN-CE) requires agents to follow natural-language instructions in unseen scenes. While Large Models (LMs) have advanced VLN-CE, their performance remains severely degraded by real-world visual corruptions, a critical yet underexplored domain constraint. We introduce Temporal Conditional Flow Decorruptor (FlowDec), a novel image restoration framework tailored for LM-based VLN-CE. FlowDec integrates a hybrid temporal conditioning strategy to align the generative flow path with historical context and employs action-centroid guided filtering to dynamically assess and integrate outputs. Extensive experiments demonstrate that FlowDec outperforms state-of-the-art decorruption methods in both navigation accuracy and generation latency. Our approach establishes a robust, efficient paradigm for resilient embodied navigation in unpredictable real-world conditions.


[160] MMGist: A Comprehensive Multimodal Benchmark for 2027 cs.CV | cs.AIPDF

Wenzhen Yuan, Jiacheng Ruan, Wutao Xiong, Chengping Zhao, Ting Liu

TL;DR: 本文系统分析了18个广泛使用的视觉语言基准测试,发现存在视觉依赖性不足、性能饱和和异常样本三大问题。为此,作者提出了MMGist基准,它通过三阶段过滤流程构建,涵盖7个能力维度、包含7,262个样本,旨在提供更高质量的多模态评估。

Details

Motivation: 现有视觉语言基准测试存在许多项目不依赖视觉线索、性能接近饱和以及异常样本影响评估可靠性的问题,这限制了它们有效衡量多模态理解能力和区分不同模型性能的能力。

Result: 在27个领先的大型视觉语言模型上的实验表明,MMGist在将评估项目减少69%的同时,保持了极高的模型排名保真度(Spearman ρ=0.98),并将跨模型区分度提高了78%。

Insight: 创新点在于提出了一个系统性的基准构建流程(文本消融过滤、跨模型饱和过滤、异常检测过滤),以提升评估的视觉依赖性、区分度和可靠性。研究还揭示了视觉逻辑是当前模型的系统性弱点,而知识密集型维度是区分闭源与开源模型的关键。这强调了高质量评估应优先考虑这些核心属性,而非单纯追求基准规模。

Abstract: We conduct a systematic study of 18 widely used vision-language benchmarks and identify three major issues: 1) many items do not rely on visual cues and therefore fail to effectively measure multimodal understanding; 2) many items are already close to performance saturation for current LVLMs, which limits their discriminative power; 3) a small number of anomalous items affect the reliability of evaluation results. To this end, we propose MMGist, a curated benchmark that covers seven capability dimensions and contains 7,262 items. MMGist is constructed through a three-stage pipeline, which sequentially combines text-ablation filtering, cross-model saturation filtering, and anomaly detection filtering. We conduct extensive experiments on 27 leading LVLMs and compare MMGist with the raw pool of 23,250 items. The results show that MMGist preserves model rankings with high fidelity, with Spearman $ρ= 0.98$, while reducing evaluation items by 69% and improving cross-model discrimination by 78%. Further results indicate that Visual Logic remains a systematic weakness of current LVLMs, while knowledge-intensive dimensions such as Expert Knowledge dimensions remain important factors for distinguishing closed-source models from open-source models. These findings suggest that high-quality evaluation should prioritize visual dependency, discriminative power, and reliability, rather than simply pursuing benchmark scale.


[161] Curvature-aware 3D length estimation of greenhouse cucumbers using RGB-D imaging and cubic spline arc-length integration cs.CV | cs.ROPDF

Manveen Kaur, Rajmeet Singh, Saeed Mozaffri, Shahpour Alirezaee

TL;DR: 本文提出了CucumberVision,一个基于RGB-D成像和三次样条弧长积分的非接触式温室黄瓜3D长度估计框架。该框架使用YOLO26n进行实例分割,SAM进行掩码细化,并评估了五种长度估计方法,其中提出的内侧弧样条法(M5)在精度上显著优于其他方法。

Details

Motivation: 商业温室黄瓜生产需要根据果实长度进行分级,但人工测量方法在大规模生产中不可行,因此需要一种准确且可扩展的非接触式自动测量方案。

Result: 在包含48个样本、三个尺寸类别的基准测试中,提出的内侧弧样条法(M5)取得了4.13%的平均绝对百分比误差(MAPE),在Bonferroni校正的显著性水平下显著优于其他基线方法,达到了最佳性能。

Insight: 主要创新点在于首次将三次样条拟合与弧长积分应用于细长蔬菜的3D长度测量,并揭示了使用深度流而非彩色流内参会导致12-18%的长度低估这一关键误差源,为相关应用提供了重要参考。

Abstract: Commercial greenhouse cucumber production is graded by fruit length, which drives harvest scheduling, labour allocation, and logistics. Manual measurement with thread or caliper is accurate but infeasible at commercial scale. This paper presents CucumberVision, a non-contact length estimation framework using an Intel RealSense D435 RGB-D camera. A YOLO26n instance segmentation model locates cucumbers, and SAM (ViT-B backbone) refines each detection to a pixel-precise mask. Five methods are evaluated under matched conditions: (M1) a dominant-axis skeleton scan-line baseline; (M2) PCA on the bounding-box depth point cloud; (M3) SAM mask with medial-axis skeletonisation; (M4) a hybrid keypoint-guided approach using a YOLO26-pose model predicting five anatomical landmarks (KP0–KP4) with piecewise 3D arc-length; and (M5) a novel medial arc spline method fitting a cubic spline through the 3D medial axis of the SAM mask and computing arc length by trapezoidal integration – the first such application to elongated vegetable measurement. All methods share five-frame burst depth averaging, colour-stream intrinsic alignment, and adaptive method selection with cascading fallbacks ensuring 100% coverage. A benchmark of 48 captures across seven cucumbers in three size categories (small ~8 cm, medium ~13 cm, large ~25 cm) with thread-based ground truth establishes a significant accuracy hierarchy: M1 (MAPE 9.68%) > M2 (5.31%) > M4 (5.51%) > M3 (5.82%) > M5 (4.13%). M5 significantly outperforms all competitors at Bonferroni-corrected alpha=0.0125. A secondary contribution is identifying a 12–18% length underestimation caused by using depth-stream rather than colour-stream intrinsics after rs.align(rs.stream.color) – an under-reported error source. The complete system is released open source and runs in real time on a single consumer-grade GPU.


[162] Surgical Anatomy Recognition with Context Learning using Foundation Representations cs.CVPDF

Ronald L. P. D. de Jong, Tim J. M. Jaspers, Raf A. H. Vervoort, Aron F. H. A. Bakker, Yiping Li

TL;DR: 本文提出了ATLAS-120k大规模手术视频语义分割数据集和ATLAS模型框架,旨在推进微创手术中的解剖结构识别。ATLAS-120k包含超过12万帧标注,覆盖14种手术;ATLAS模型则利用基础模型嵌入和轻量级时序推理,整合手术类型、阶段等上下文信息,以实现实时、准确的视频语义分割。

Details

Motivation: 微创手术中精确识别解剖结构对安全性和有效性至关重要,但由于标注数据有限且现有方法主要针对自然场景,该问题在手术计算机视觉领域仍未得到充分探索。

Result: 论文提出的ATLAS模型在ATLAS-120k数据集上实现了时序一致且准确的预测,并保持了实时可行性,为鲁棒的手术场景理解奠定了实用基础。

Insight: 创新点在于构建了大规模、高质量的手术视频语义分割数据集,并设计了专门针对手术解剖识别的模型,其核心是结合基础模型嵌入与轻量级时序推理来利用手术上下文信息,而非传统的对象跟踪方法。

Abstract: Accurate recognition of anatomical structures is essential for safe and effective minimally invasive surgery (MIS), yet it remains underexplored in surgical computer vision due to limited annotated data and methods tailored primarily to natural scenes. In this work, we present a combined dataset and model framework to advance anatomy-aware perception in MIS. First, we introduce ATLAS-120k, a large-scale clip-level semantic segmentation dataset comprising over 120,000 annotated frames from 100 surgical videos spanning 14 procedures and multiple modalities, including laparoscopic and robot-assisted surgery. The dataset captures substantial procedural variability and was created using a scalable annotation pipeline that integrates expert manual labeling, automated propagation, iterative refinement, and surgeon verification to ensure high-quality annotations. Second, we propose ATLAS (Anatomy Recognition with Context Learning using Foundation Representations), a video semantic segmentation model specifically designed for surgical anatomy recognition. Unlike conventional approaches that emphasize object tracking, ATLAS leverages foundation-model embeddings together with lightweight temporal reasoning to incorporate contextual cues such as procedure type, surgical phase, and short-term visual memory. This design enables temporally consistent and accurate predictions while maintaining real-time feasibility. Together, the dataset and model establish a practical foundation for robust surgical scene understanding and support the development of clinically applicable guidance systems for minimally invasive surgery. The models, dataset annotations and annotation platform are publicly available at: https://github.com/TimJaspers0801/ATLAS.


[163] DreamUV: Unwrap Artist-like UV by End-to-End Flow Matching cs.CV | cs.AIPDF

Quanyuan Ruan, Jiabao Lei, Xingyi Du, Xifeng Gao

TL;DR: DreamUV提出了一种端到端的学习框架,将UV展开建模为生成式流匹配问题,旨在生成符合艺术家风格偏好的UV布局。该方法通过网格条件化的传输过程,将噪声样本映射到艺术家风格的UV分布,并引入了边界感知训练策略和模型在环微调方案以优化结果。

Details

Motivation: 传统UV参数化方法在几何失真目标与专业艺术家风格偏好之间存在差距,难以显式建模艺术家UV的结构模式(如拉直接缝、轴对齐岛和灵活内部变形)。

Result: 在专业UV布局的大规模数据集上评估,DreamUV相比经典和基于学习的基线方法,能生成显著更直的边界和更紧密的轴对齐岛,同时保持有竞争力的失真指标;定性结果和用户研究证实其布局符合实际生产要求。

Insight: 创新点在于将UV展开视为生成式流匹配问题,通过端到端学习捕捉艺术家风格;边界感知训练和模型在环微调方案有效处理接缝几何和离散化误差,提升了生成UV的实用性和稳定性。

Abstract: UV parameterization is a fundamental step in 3D content creation, yet producing production-ready UV layouts remains challenging due to the gap between geometric distortion objectives and the stylistic preferences of professional artists. While classical methods optimize handcrafted energy functions, artist-authored UVs exhibit structural patterns such as straightened seams, axis-aligned islands, and flexible interior deformation, properties that are difficult to explicitly formulate. In this work, we present DreamUV, an end-to-end learning framework that formulates UV unwrapping as a generative Flow Matching problem. Rather than predicting a single optimal parameterization, DreamUV learns a mesh-conditioned transport process that maps noise samples to a distribution of artist-like UV layouts. To reflect real-world authoring practices, we introduce a boundary-aware training strategy that prioritizes seam geometry, and a Model-in-the-Loop Finetuning(MITL) scheme that explicitly accounts for discretization errors during sampling and stabilizes transport dynamics under heterogeneous supervision. We evaluate DreamUV on a large-scale dataset of professionally authored UV layouts. Experiments demonstrate that our method produces significantly straighter boundaries and tighter axis-aligned islands than both classical and learning-based baselines, while maintaining competitive distortion metrics. Qualitative results and a user study with professional artists further confirm that DreamUV generates UV layouts that are not only valid, but aligned with practical production requirements.


[164] CVSBench: A Comprehensive Benchmark for Cross-view Spatial Reasoning and Dreaming cs.CVPDF

Ruixun Liu, Lingyu Zhang, Lanxuan Xue, Kaiyu Li, Bowen Fu

TL;DR: 该论文提出了CVSBench,一个用于评估跨视角空间推理能力的大规模基准测试,包含卫星-街景图像对、多任务(跨视角VQA、跨视角定位和视角识别)以及大量标注数据。研究发现当前先进的视觉语言模型在剧烈视角变化下难以保持对象级和布局一致性,而通过引入3D场景想象流程的视觉空间想象能显著提升跨视角推理能力。

Details

Motivation: 人类能轻松进行跨视角场景推理,但尚不清楚视觉语言模型是否具备类似的跨视角空间能力。卫星-街景图像对因其复杂上下文和极端视角变化,为测试提供了理想平台。

Result: 在CVSBench上的广泛评估表明,先进VLMs在剧烈视角变化下难以保持对象级和布局一致性;仅使用语言推理的改进有限,而通过3D场景想象流程整合视觉空间想象则能显著提升跨视角推理性能。

Insight: 论文的创新点在于构建了首个专注于跨视角空间推理的大规模综合基准CVSBench,并揭示了显式视觉空间表示(如3D场景想象)对于VLMs实现鲁棒空间认知的必要性,为模型设计提供了新方向。

Abstract: Humans can effortlessly reason about scenes across different viewpoints, yet it remains unclear whether Vision-Language Models (VLMs) possess similar cross-view spatial abilities. Satellite-street scene pairs, with their complex contexts and extreme viewpoint variations, provide an ideal testbed. Motivated by this, we introduce CVSBench, a large-scale benchmark for evaluating cross-view spatial reasoning through satellite-street pairs. This benchmark supports multiple tasks, including cross-view VQA, cross-view grounding, and viewpoint identification. CVSBench comprises 3,297 cross-view image groups with 9,468 object-level annotations and 40,679 question-answer (QA) pairs, enabling systematic and controlled evaluation of cross-view spatial reasoning. Extensive evaluations reveal that advanced VLMs struggle to maintain object-level and layout consistency under drastic viewpoint changes. To bridge this gap towards human-like spatial cognition, we investigate two categories of approaches: spatially grounded reasoning and the incorporation of cognitive map inputs. Our findings demonstrate that language-only reasoning yields marginal improvements, while incorporating visual spatial imagination via a 3D scene imagination pipeline substantially improves cross-view reasoning. These results highlight the necessity of explicit visual-spatial representations for robust spatial cognition in VLMs. Our data and code are released at https://huggingface.co/datasets/zlyzlyzly/CVSBench.


[165] Human and AI collaboration for pulmonary nodule segmentation cs.CV | cs.AI | cs.HCPDF

Hongqiao Dong, Wenhao Chi, Ruobing Liang, Xiaokui Yang, Wenhua Liang

TL;DR: 本文提出了Hi-Seg,一个基于Segment Anything Model (SAM)构建的人机协同肺结节分割框架。该框架通过人类迭代优化提示,引导SAM生成高质量分割掩码,并在包含1179名患者的多中心CT数据集上进行了大规模外部验证。结果表明,Hi-Seg在分割精度和标注效率上均优于现有深度学习方法及SAM变体。

Details

Motivation: 医学专家标注者稀缺,而完全依赖人工智能(AI)可能产生误导,因此需要探索人类(特别是初级医学学员甚至非医学人员)与AI协作以实现鲁棒医学分割的方法。尽管SAM在通用图像分割中表现出潜力,但其在专业医学任务中人机协作的性能尚未得到充分评估。

Result: 在所有标注者组别中,Hi-Seg的平均Dice分数接近85%,优于五种最先进的深度学习模型(提升10-22%)和13种SAM变体(提升1-29%)。它提高了分割精度,减少了医学标注者的标注时间,且经过简短训练的非医学标注者达到了与初级医学生相当的性能。

Insight: 创新点在于构建了一个基于SAM的人机交互分割框架(Hi-Seg),通过人类的试错学习和语义推理迭代优化提示,有效提升了专业医学任务的分割性能。其核心价值在于证明了人机协同可以减轻临床医生负担,实现可扩展的众包标注,并安全高效地将基础模型整合到常规临床实践中。

Abstract: Medical expert annotators are scarce, and blind reliance on artificial intelligence (AI) can be misleading, motivating approaches in which humans, particularly junior medical trainees or even non-medical personnel, collaborate with AI to achieve robust medical segmentation. Although the Segment Anything Model (SAM) shows promise for general-purpose image segmentation, its performance in human-AI collaboration for specialized medical tasks has not been thoroughly evaluated. Here we present Hi-Seg, a human-in-the-loop segmentation framework for pulmonary nodules built on SAM. Humans iteratively refine prompts through trial-and-error learning and semantic reasoning, progressively guiding SAM toward higher-quality masks. Using chest CT scans from 1,179 patients across 12 centers, we conducted the first large-scale external validation of collaborative human-SAM segmentation. Across all annotator groups, Hi-Seg achieved a mean Dice score of almost 85%, outperforming five state-of-the-art deep learning models by 10-22% and 13 SAM variants by 1-29%. Hi-Seg improved segmentation accuracy while reducing annotation time for medical annotators, and briefly trained non-medical annotators achieved performance comparable to that of the junior medical student. These findings suggest that human-in-the-loop segmentation can reduce clinician workload, enable scalable crowdsourced annotation, and transform clinical workflows by facilitating the safe and efficient integration of foundation models into routine clinical practice.


[166] FetSelect: Task-Specific Architectures and Self-Supervised Learning for Automated Fetal Ultrasound Frame Selection cs.CVPDF

Mahmood Alzubaidi, Raden Muaz, Uzair Shah, Mohammed Ammar, Khalid Alyafei

TL;DR: 本文提出了FetSelect框架,用于胎儿超声图像的任务特异性帧选择。该框架结合了冻结的视觉基础模型骨干、任务门控分类头和检测衍生的质量头,并通过自监督学习在未标记数据上进行预训练。在四个胎儿生物测量目标(CRL、NT、NB和Scalebar)的专家标注数据集上验证了其有效性,并在外部临床数据上展示了任务特异性判别能力。

Details

Motivation: 解决胎儿超声生物测量中自动化帧选择问题,现有方法多关注通用质量评估或假设已有合适帧可用,缺乏针对特定任务的专用框架。

Result: 在包含974帧的测试集上,FetSelect实现了平均AUROC 0.956和与专家质量标注的平均相关性0.818;消融实验表明混合融合优于单头变体,超声特异性自监督带来稳定提升;在外部临床视频和509张CRL图像上验证了任务特异性判别能力。

Insight: 创新点包括任务特异性架构设计(结合分类和质量头)、基于BYOL的自监督预训练适应超声领域,以及混合融合机制,为医学图像分析中的帧选择提供了可借鉴的端到端解决方案。

Abstract: Automated frame selection for fetal biometry remains under addressed, with most prior work targeting generic quality assessment or downstream measurement pipelines that assume suitable frames are available. We introduce FetSelect, a task-specific framework that pairs a frozen vision foundation backbone with a hybrid multi-head design: a Task-Gated classification head and a Detection-derived quality head combined via learned fusion. We curate 6,486 expert-labeled frames across four targets: Crown-Rump Length (CRL), Nuchal Translucency (NT), Nasal Bone (NB), and Scalebar, and adapt the backbone with BYOL pretraining on 19,019 unlabeled images. On a held-out test set (974 frames), FetSelect achieves mean AUROC 0.956 and mean correlation 0.818 with expert quality annotations. Ablations confirm that hybrid fusion surpasses single-head variants, and ultrasound-specific self-supervision yields consistent gains. Evaluation on external clinical videos and 509 external CRL images demonstrates task-specific discrimination.


[167] Benchmarking Vision-Language Models for Microscopic Plant Image Understanding cs.CVPDF

Tianqi Wei, Xin Yu, Zhi Chen, Scott Chapman, Zi Huang

TL;DR: 本文提出了PlantMicro,一个用于评估视觉语言模型在微观植物图像理解能力的综合性基准。该基准整合了超过5000张图像和9000个VQA对,涵盖了多种宿主、生物领域和成像模态。实验表明当前VLM在细粒度识别和生物学推理方面存在显著不足。

Details

Motivation: 现有VLM基准主要关注宏观植物图像,而微观领域研究不足,因此需要建立一个专门的基准来填补这一空白。

Result: 在PlantMicro基准测试中,当前VLM表现不佳,例如GPT-5在病原体分类任务上仅达到34.93%的准确率,略高于随机猜测基线。

Insight: 创新点在于构建了首个专注于微观植物图像的VLM基准,通过多样化的图像和系统设计的VQA任务,揭示了VLM在该领域的局限性,为未来模型发展提供了标准化评估基础。

Abstract: Microscopic imaging provides essential visual evidence for studying plant biology and pathology at the cellular and subcellular levels. However, existing benchmarks on vision-language models primarily focus on macroscopic plant imagery, while the microscopic domain remains underexplored. To address this gap, we present PlantMicro, a comprehensive benchmark for evaluating vision-language models (VLMs) in microscopic plant imagery. PlantMicro integrates more than 5,000 images collected across diverse hosts, biological domains, and imaging modalities. Building on this diversity, we design a set of complementary tasks that capture different facets of microscopic image understanding. To support these tasks, we construct over 9,000 VQA pairs that systematically evaluate the capabilities of VLMs. Experiments on PlantMicro show that current VLMs struggle with fine-grained recognition and biologically grounded reasoning. For example, GPT-5 achieves 34.93% accuracy on the pathogen classification task, which is only modestly above the random-guessing baseline. The results highlight a significant gap in current VLMs’ ability to comprehend plant microscopic images. PlantMicro provides a standardized foundation for advancing VLMs toward reliable and comprehensive microscopy-level plant understanding.


[168] NegAS: Negative Label Guided Attention and Scoring for Out-of-Distribution Object Detection with Vision-Language Models cs.CVPDF

Yingjie Zhang, Shuai Li, Peng Wang

TL;DR: 本文提出了NegAS方法,首次针对基于视觉语言模型(VLM)的目标检测器进行OOD检测。该方法通过引入负标签引导的注意力模块(NegA)来关注潜在的OOD背景区域,并设计了一种新的基于sigmoid的OOD评分函数(NegS),从而在保持ID检测精度的同时,大幅提升了OOD检测性能。

Details

Motivation: 现有研究主要关注单模态检测器或基于VLM的分类器,而基于VLM的目标检测器在OOD场景下的潜力尚未被充分探索。本文旨在解决VLM检测器在OOD检测中面临的两个特定挑战:文本引导注意力对背景处理不足,以及其多标签输出与基于softmax的OOD评分不兼容。

Result: 在COCO和OpenImages数据集上的大量实验表明,该方法显著提升了OOD检测性能,例如,与基线模型相比,在COCO上将FPR95降低了11.4%,在OpenImages上降低了25.5%,同时保持了ID检测精度。该方法在YOLO-World和Grounding DINO等不同VLM检测器上都取得了显著改进,证明了其泛化性。

Insight: 核心创新点在于利用LLM生成的、视觉相似但语义不同的负标签来引导注意力,从而有效区分ID和OOD实例,并设计了与VLM概率输出兼容的sigmoid-based评分函数。这为将VLM的强大能力应用于安全关键场景的鲁棒目标检测提供了新思路。

Abstract: Out-of-Distribution (OOD) detection is essential for ensuring the robustness and reliability of object detection systems deployed in safety-critical applications. While prior research has mainly focused on uni-modal detectors or vision-language model (VLM) based classifiers, the potential of VLM-based object detectors in OOD scenarios remains underexplored. In this work, we take the first step toward building OOD object detection methods upon VLMs. We identify two challenges specific to VLM detectors: (i) their text-guided attention enhances foreground with ID labels but treats background uniformly, leaving potential OOD regions unexploited for separating in-distribution (ID) from OOD instances; and (ii) their sigmoid-based multi-label outputs are incompatible with softmax-based OOD scores, calling for scoring functions consistent with VLM probabilistic outputs. Hence, we introduce Negative Label Guided Attention and Scoring (NegAS). To address (i), we propose a negative label guided attention module (NegA), where LLM-generated, visually-similar but semantically-different negative labels are used to guide attention toward potential OOD background regions. To address (ii), we introduce a novel sigmoid-based OOD scoring function (NegS) that leverages both ID and negative labels, producing strong responses for ID instances and suppressed responses for OOD ones. Extensive experiments demonstrate that our approach improves OOD detection performance by a large margin while maintaining ID accuracy, e.g., reducing the FPR95 by 11.4% on the COCO dataset and 25.5% on the OpenImages dataset compared to the baseline model. While initially designed for dense VLM detectors like YOLO-World, we successfully adapt NegAS to Grounding DINO, a query-based VLM transformer and achieve significant improvements, demonstrating the generalizability of our framework.


[169] PolicyTrim: Boosting Intrinsic Policy Efficiency of Vision-Language-Action Models cs.CVPDF

Xianghui Wang, Feng Chen, Wenbo Zhang, Hua Yan, Zixuan Wang

TL;DR: 本文提出PolicyTrim,一个基于强化学习的后训练框架,旨在提升视觉-语言-动作(VLA)模型的内在策略效率。该框架通过动态探索策略扩展可靠的动作块长度,并通过冗余感知奖励减少冗余的物理步骤,从而在执行过程中减少前向推理调用次数,实现端到端部署的显著加速。

Details

Motivation: 现有VLA模型在机器人操作中的实际部署常受执行效率瓶颈限制,现有工作主要关注计算中心效率(如单步推理延迟),而模型的内在策略效率(即预测动作块的有效可执行长度和完成任务所需的总物理步骤)尚未被充分探索。当前VLA策略存在规划不可靠和动作冗余问题,导致预测性能下降和冗余步骤生成。

Result: 在三个基准测试和三个VLA模型上的广泛实验表明,PolicyTrim将动作块利用率提升了3倍,物理执行步骤减少了51.4%,最终在不影响任务成功率的前提下,实现了高达5.83倍的端到端部署加速。

Insight: 创新点在于首次系统性地关注并提升VLA模型的内在策略效率,而非仅优化计算效率;通过强化学习后训练框架,结合动态探索奖励(扩展可靠动作块)和冗余感知奖励(减少物理步骤),有效解决了规划不可靠和动作冗余问题,为VLA模型的现实部署提供了新思路。

Abstract: Vision-Language-Action (VLA) models provide a unified paradigm for robotic manipulation, yet their real-world deployment is often bottlenecked by execution efficiency. While existing efforts predominantly focus on compute-centric efficiency to reduce per-step inference latency, the intrinsic \textbf{policy efficiency} of these models remains largely unexplored. Policy efficiency is fundamentally affected by two factors, namely the effective executable length of predicted action chunks and the total physical steps required to complete a task. These two factors jointly determine the total number of forward inference calls during execution. We observe that current VLA policies struggle with planning unreliability and action redundancy, suffering from severe prediction degradation at the tail of action chunks and tending to generate unnecessarily redundant physical steps. To address this, we propose \textbf{PolicyTrim}, a reinforcement learning-based post-training framework that extends the reliable action chunk length and reduces redundant physical steps. For reliable chunk extension, we employ a dynamic exploration strategy that explicitly rewards the successful completion of longer executable lengths, progressively pushing the trustworthy prediction horizon to its empirical limit. For step efficiency, we design a redundancy-aware reward that directly favors successful task completions with fewer steps while penalizing unreproducible shortcuts, effectively eliminating redundant physical actions. Extensive experiments across three benchmarks and three VLA models demonstrate that PolicyTrim improves action chunk utilization by 3$\times$ and reduces physical execution steps by 51.4%. Ultimately, our framework delivers up to a 5.83$\times$ end-to-end deployment speedup without compromising task success rates.


[170] MAPS: Multi-Anchor Projection Similarity for Joint Vision-Language Geo-Localization cs.CVPDF

Yutong Hu, Siyuan Tan, Shaocheng Yan, Pengcheng Shi, Qingwu Hu

TL;DR: 本文提出了一种用于联合视觉-语言地理定位(VLGL)的统一框架,通过将图像-文本联合查询建模为多锚点几何对齐问题,并引入了多锚点投影相似度(MAPS)度量。MAPS通过在高维空间中构建查询特征的锚平面,并计算目标特征在该平面上的投影长度来衡量相似性,从而更好地捕捉目标特征与联合查询子空间之间的几何一致性。

Details

Motivation: 现有地理定位模型主要基于点对点对齐,不足以处理视觉和文本线索共同定义语义子空间的联合查询问题,因此需要一种能够整合多模态信息进行联合推理的新方法。

Result: 所提出的框架、相似度度量和训练目标在VLGL任务上实现了最先进的性能。

Insight: 创新点在于将联合查询形式化为多锚点几何对齐问题,并提出了MAPS度量来替代传统的余弦相似度,它通过投影到联合查询子空间来评估相似性,更符合多模态查询的语义结构;同时设计了基于MAPS的对比损失来使学习到的表示与该几何结构保持一致。

Abstract: Humans localize places by integrating perceptual cues from vision with semantic reasoning from language, forming a scene understanding that is both intuitive and structured. Although existing geo-localization models have made substantial progress in cross-view and cross-modal settings, they are largely built upon point-to-point alignment, which is insufficient for joint vision-language queries. In such queries, visual and textual cues do not simply act as independent references, but jointly define a semantic subspace for locating the target. In this paper, we formulate vision-language geo-localization (VLGL) with joint image-text queries as a multi-anchor geometric alignment problem and propose a unified framework for this setting. To realize this formulation, we propose Multi-Anchor Projection Similarity (MAPS), a new metric which constructs an anchor plane from visual and textual query features in a high-dimensional space and measures similarity by the projection length of the target feature onto this plane. Unlike cosine similarity which evaluates isolated pairwise relations, MAPS captures the geometric consistency between the target feature and the joint query subspace, providing a more discriminative ranking criterion during retrieval. To make the learned representation consistent with this geometry, we further introduce a MAPS-based contrastive loss that drives target features toward the corresponding anchor plane. The proposed framework, similarity metric, and training objective jointly yield state-of-the-art performance in VLGL.


[171] Training-Free Semantic Correction for Autoregressive Visual Models cs.CV | cs.AI | cs.CL | cs.MMPDF

Junhao Chen, Chanyu Zhu, Zheqi Lv, Keting Yin, Shengyu Zhang

TL;DR: 本文提出了一种无需训练的语义校正框架Gazer,用于改进自回归视觉模型(AVM)的生成质量。Gazer通过集成多模态大语言模型的反馈,在AVM的采样循环中进行语义错误诊断和轨迹修正,从而提升生成内容与目标提示的语义对齐和组合准确性。

Details

Motivation: 自回归视觉模型在图像和视频合成中因多尺度离散化生成过程导致语义错误难以识别和纠正,现有无训练方法忽视中间生成状态,使得错误累积至最终输出。

Result: 在组合图像和视频基准测试中,Gazer无需额外训练即可提升多个AVM的语义对齐和组合准确性,实现了性能改进。

Insight: 创新点在于将多模态大语言模型反馈集成到AVM采样循环中,通过反射诊断和语义校正两阶段协作,实时诊断并修正中间生成状态的语义错误,这是一种无需训练的后处理校正方法。

Abstract: Autoregressive visual models (AVMs) based on next-scale prediction have emerged as a prominent paradigm for image and video synthesis. However, decomposing the generation process into discrete scales with varying granularities in AVM makes semantic errors difficult to identify and correct, thereby undermining the quality of the final output. Prior efforts to enhance AVM can be categorized into training-based and training-free approaches. Although training-based efforts to enhance AVM generation quality come at substantial computational cost, existing training-free methods neglect intermediate generation states, leaving semantic errors undiagnosed and allowing them to accumulate into the final output. In this paper, we focus on training-free paradigms and propose Gazer, a framework that integrates multimodal large language model feedback into the AVM sampling loop for in-generation semantic correction. Concretely, Gazer operates via two cooperating stages: the Reflective Diagnosis stage diagnoses semantic errors from intermediate states, while the Semantic Correction stage rewinds and rectifies the generation trajectory to realign with the target prompt. Experiments on compositional image and video benchmarks demonstrate that Gazer improves semantic alignment and compositional accuracy across multiple AVMs without additional training.


[172] The Power of Light: Improving Synthetic-to-Real Domain Adaptation through Physically-Based Indirect Illumination cs.CV | cs.AIPDF

Hooman Tavakoli Ghinani, Tatjana Legler, Martin Ruskowski

TL;DR: 本文系统研究了光照配置和背景复杂度对合成到真实域适应中目标检测性能的影响,提出了基于物理渲染的自动化合成数据生成流程SmartSDG,并构建了新的工业基准数据集ILLUM_INTRUCK。通过18个控制实验,发现复杂的间接光照配置与领域相关背景变化能显著提升视觉线索丰富度,减少域差距并加速模型收敛。

Details

Motivation: 解决合成数据生成中因渲染变量(如光照和背景)导致的合成到真实域差距问题,以提升工业自动化中目标检测的鲁棒性。

Result: 在YOLOv12框架上的实验表明,使用基于物理的间接光照配置能保留关键表面纹理,减少误报,相比传统直接光照合成数据,模型收敛更快,在ILLUM_INTRUCK数据集上验证了其有效性。

Insight: 创新点在于提出可操作的虚拟场景设计指南,强调间接光照和背景可变性对域适应的重要性,通过自动化流程SmartSDG和基准数据集ILLUM_INTRUCK,为工业应用提供了系统化的合成数据优化方法。

Abstract: While synthetic data generation resolves the manual labeling bottleneck in computer vision, minimizing the syn-to-real domain gap requires optimizing rendering variables. This paper presents a systematic study analyzing the impact of lighting configurations and background complexity on object detection performance. We introduce SmartSDG, an automated, reproducible pipeline built on NVIDIA Isaac Sim using Physically-Based Shading (PBS), alongside ILLUM_INTRUCK, a new multi-object industrial benchmark dataset. Through 18 controlled experiments utilizing a state-of-the-art YOLOv12 framework, we demonstrate that complex, indirect lighting configurations paired with domain-relevant background variability significantly increase visual cue richness. Our quantitative findings show that avoiding direct specular peaks preserves crucial surface textures, mitigates the domain gap, reduces false positives, and accelerates model convergence compared to using conventional direct-light synthetic data. Ultimately, we provide actionable virtual scene design guidelines to maximize object detection robustness in industrial automation.


[173] Automated sign detection across the Electronic Babylonian Library: A large-scale dataset and end-to-end cuneiform OCR pipeline cs.CV | cs.CLPDF

Wentao Che, Esteban Garcés Arias, Asim Niaz, Andreas Bender, Enrique Jiménez

TL;DR: 本文提出了一种针对楔形文字泥板的自动化符号检测系统,利用迄今最大的标注数据集,基于可变形检测Transformer(DETR)构建端到端OCR流程,并在电子巴比伦图书馆(eBL)的大规模语料上进行了应用验证。

Details

Motivation: 楔形文字泥板的解读极其困难,大量出土泥板未被分析;计算机视觉为破译提供了可能,但需要大规模密集标注数据集,现有资源不足。

Result: 在两种类别粒度(173类和106类)下,该方法在COCO风格的检测指标上比先前工作提升了28-37%;推理阶段应用于eBL语料的87,668个泥板碎片,产生了近290万个符号检测结果。

Insight: 创新点在于将自动泥板面提取、启发式行分组和基于n-gram的文本相似性评估集成到视觉符号检测流程中,构建了一个可扩展、可解释的语料库级分析基础,且不依赖语言先验知识。

Abstract: Learning to read cuneiform tablets is an extremely demanding task; consequently, of the roughly half million excavated tablets, only a small fraction has been analysed by Assyriologists. Computer vision offers a promising avenue for decipherment but requires large, densely annotated datasets. To address this limitation, the largest annotated cuneiform sign dataset to date is used, and a Deformable Detection Transformer (DETR)-based object detection model is evaluated under two class granularities of 173 and 106 classes. The proposed system integrates automatic tablet-side extraction, heuristic line grouping, and n-gram-based textual similarity evaluation to bridge visual sign detection and textual structure, and achieves consistent improvements of up to 28-37% over prior work on COCO-style detection metrics. At inference, the method is applied to 87,668 tablet fragments from the Electronic Babylonian Library (eBL) corpus, producing nearly 2.9 million sign detections. Although the approach operates without linguistic priors and remains sensitive to tablet damage and layout variability, it provides a scalable and interpretable foundation for corpus-wide cuneiform analysis and supports future integration with multimodal and linguistic modelling frameworks.


[174] OmniSpace: Efficient Geometry Awareness for Autonomous Vehicles MLLMs cs.CVPDF

Hao Vo, Phu Loc Nguyen, Khoa Vo, Sieu Tran, Duc Minh Nguyen

TL;DR: 本文提出了OmniSpace,一种用于自动驾驶多模态大语言模型的几何感知空间推理方法。该方法通过相机姿态注入器、多视图极线注意力模块和3D几何蒸馏目标,从纯2D观测中学习几何知识,无需依赖外部3D模型,提升了模型在规划、风险检测和泛化等任务上的性能。

Details

Motivation: 现有几何感知MLLMs在推理时依赖辅助3D模型,导致流程复杂且易引发级联故障,而当前MLLMs在跨视图对应和深度估计方面存在瓶颈,因此需要一种无需外部3D模型的纯2D几何感知解决方案。

Result: 在nuScenes、Bench2Drive等规划基准,nuInstruct风险检测,Omnidrive语言任务以及DriveBench泛化测试中,OmniSpace均超越了现有方法,展现了优越性能。

Insight: 创新点在于通过相机姿态注入、多视图极线注意力和3D几何蒸馏,将几何知识直接嵌入模型,实现了端到端的几何感知,避免了外部3D模型的依赖,提升了效率和鲁棒性。

Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable performance on 2D visual tasks, yet enhancing their spatial intelligence for real-world applications such as Autonomous Vehicles (AV) remains an open challenge. Existing geometry-aware MLLMs typically rely on auxiliary 3D models at inference time, introducing pipeline complexity and the risk of cascading failures. In this paper, we present OmniSpace, a simple yet effective plug-and-play paradigm for geometry-aware spatial reasoning from purely 2D observations. Motivated by our finding that current MLLMs are bottlenecked by weak cross-view correspondence and depth estimation, OmniSpace introduces a Camera Pose Injector, a Multi-view Epipolar Attention module, and a 3D Geometric Distillation objective that jointly address these two limitations by transferring geometric knowledge into the model. Extensive experiments show that OmniSpace surpasses existing methods on planning benchmarks (nuScenes, Bench2Drive), risk detection (nuInstruct), language (Omnidrive), and generalization (DriveBench).


[175] MapReason-OSM: Can Vision-Language Models Make Graph-Verifiable Mobility Decisions from Street Maps ? cs.CVPDF

Srinivas Venkatanarayanan, Clement Pakkam Isaac

TL;DR: 本文提出了MapReason-OSM,一个用于评估视觉语言模型(VLMs)基于街道地图做出可被底层路网图验证的移动决策能力的基准测试与评估框架。该基准包含12个任务,涉及路径规划、设施选址和视觉消歧,共6000个实例。研究发现,现有VLMs能进行简单的地图阅读和路径规划,但在涉及图成本推理的任务上表现不佳,且存在尺度不一致问题。

Details

Motivation: 当前大多数地图相关的基准测试评估的是自由文本或多选答案,这些答案无法与底层路网图进行验证。为了评估VLMs在物流、配送和导航等实际应用中做出可验证的、符合路网约束的决策能力,需要一个新的基准。

Result: 在七个VLMs上的评估结果显示,模型能处理简单的地图阅读和路径规划,但在需要图成本推理的任务(如单设施点选址)上表现接近随机水平,且经常出现跨缩放尺度不一致的情况。

Insight: 创新点在于构建了一个基于OpenStreetMap渲染、包含隐藏街道图和精确验证机制的基准,能够从有效性、合法性、最优性、约束满足以及跨缩放一致性等多个维度对结构化决策进行精确评分。这为评估VLMs的图推理和空间决策能力提供了更严谨的框架。

Abstract: Vision-language models (VLMs) are increasingly used to read maps for logistics, delivery, and accessible navigation, where the output is an actionable decision (a route, a pin, a parking choice) that must respect the road network. Yet most map benchmarks grade free-text or multiple-choice answers that cannot be verified against the underlying graph. We present \textbf{MapReason-OSM}, a benchmark and evaluation harness for graph-verifiable mobility decisions on self-rendered OpenStreetMap panels. We render fixed-style maps for ten U.S. downtowns at two aligned zoom scales, overlay a consistent marker grammar, and pair each panel with a hidden street graph and exact oracles, yielding 6{,}000 instances (12{,}000 panels across the two zooms) over 12 routing, facility-location, and visual-disambiguation tasks. Models return structured decisions that we snap back to the graph and score for validity, legality, optimality, and constraint satisfaction, plus \emph{cross-zoom consistency}. Across seven VLMs, models read maps and route simply but fail at graph-cost reasoning (single-facility pin placement is near chance even for frontier reasoning models), and are frequently scale-inconsistent. We release the benchmark, harness, and deterministic generator.


[176] DR-Mamba: Automatic Inference-Time Domain Adaptation for Document Image Binarization via Sample-Conditioned Detail-Background Suppression cs.CVPDF

Sheng-Wei Chan, Jen-Shiun Chiang

TL;DR: 本文提出DR-Mamba,一种用于文档图像二值化的样本条件化细节-背景抑制框架,能够在推理时自动进行领域自适应。该方法通过输入依赖的门控机制在单次前向传播中适应每个输入文档,无需目标域标签、微调或测试时参数更新。在DIBCO风格基准测试中,采用留一年出协议评估,DR-Mamba在严重退化领域表现出强大的跨域鲁棒性。

Details

Motivation: 解决退化文档图像二值化对领域偏移(如纸张老化、透印、污渍、阴影和不均匀光照)敏感的问题,以及现有基于学习的方法在未见退化域上前景-背景分离不稳定的挑战。

Result: 在DIBCO风格基准测试上采用留一年出协议评估,DR-Mamba在严重退化的保留折叠上表现出特别强的性能,通过逐文档、逐位置的减法抑制提高了跨域鲁棒性。

Insight: 创新点包括将Mamba风格的选择性扫描重新解释为快-慢路径建模(快速细节路径捕获局部笔画结构,慢速背景路径积累空间持久退化响应),并通过输入依赖的减法门显式抑制背景干扰,而非通过加法或拼接融合特征;此外,还引入了全分辨率细节引导重建和细笔画感知监督以恢复下采样中丢失的精细笔画。

Abstract: Degraded document image binarization is sensitive to domain shifts caused by paper aging, bleed-through, stains, shadows, and uneven illumination, and the foreground-background separation of recent learning-based methods can become unstable on unseen degradation domains. We propose DR-Mamba, a sample-conditioned detail-background suppression framework that performs automatic inference-time domain adaptation for document image binarization. Unlike test-time adaptation methods that require gradient updates or auxiliary data at inference, DR-Mamba adapts to each input document through input-dependent gates within a single forward pass, requiring no target-domain labels, no fine-tuning, and no test-time parameter updates. Instead of using Mamba-style selective scanning as a single generic feature path, DR-Mamba reinterprets it as fast-slow route modeling: a fast detail route captures local stroke structures, while a slow background route accumulates spatially persistent degradation responses. The two routes are integrated through an input-dependent subtractive gate that explicitly suppresses background interference rather than fusing features by addition or concatenation. We further add full-resolution detail-guided reconstruction and thin-stroke-aware supervision to recover fine strokes lost during downsampling. Evaluated under a leave-one-year-out protocol on DIBCO-style benchmarks, where each held-out year is treated as an unseen degradation domain, DR-Mamba shows that per-document, per-location subtractive suppression improves cross-domain robustness, with particularly strong performance on the most severely degraded held-out fold.


[177] 4DVLT: Dynamic Scene Understanding with Worldline-Centered Vision-Language Tracking cs.CVPDF

Chaoyue Li, Boxue Yang, Shengyao Zhou, Haoyang Wu, Rui Qian

TL;DR: 该论文提出了4DVLT任务,旨在通过世界线中心的视觉语言跟踪实现4D动态场景理解,并引入了Instruct-4D基准数据集和4DTrack方法。4DTrack方法将指令条件跟踪建模为基于物体中心4D状态图的图条件世界线推理,结合了度量引导路由、双向解码和运动学校准。

Details

Motivation: 现有方法(如大型多模态模型和视觉语言跟踪)在4D动态场景理解中存在局限,前者缺乏对度量拓扑的保持,后者则局限于碎片化的2D/3D输出和局部连续性。论文旨在解决将语言与绑定身份、度量3D运动和同步多视角2D投影的持久世界线进行关联的完整问题。

Result: 在提出的Instruct-4D基准测试(包含12.94万个问答对、6.47万个目标实体、851个场景和9种推理导向查询类型)上,4DTrack-Qwen3.5-9B模型达到了62.68的Top1 TGA分数,比最佳适配的VLT基线高出19.62分,实现了SOTA性能。

Insight: 论文的核心创新在于提出了以世界线为中心的任务定义和建模范式,将指令条件跟踪形式化为图条件世界线推理。具体技术贡献包括构建物体中心的4D状态图、度量引导路由、双向解码和运动学校准,这些设计共同提升了目标定位和恢复的世界线质量。

Abstract: 4D dynamic scene understanding requires grounding language to a persistent worldline that binds identity, metric 3D motion, and synchronized multi-view 2D projections. Existing paradigms capture only part of this structure: large multimodal models reason over rich visual evidence but rarely preserve metric topology, while vision-language tracking remains tied to fragmented 2D or 3D outputs and local continuation. We therefore introduce \textbf{4DVLT}, a worldline-centered task for instruction-conditioned 4D dynamic scene understanding in fully observed multi-view video, and \textbf{Instruct-4D}, a benchmark with 129.4K question-answer pairs, 64.7K target entities, 851 scenes, and 9 reasoning-oriented query types. To address this setting, we present \textbf{4DTrack}, which casts instruction-conditioned tracking as graph-conditioned worldline inference through an object-centric 4D state graph, metric-guided routing, bidirectional decoding, and kinematic calibration. On Instruct-4D, 4DTrack-Qwen3.5-9B reaches 62.68 $\mathrm{TGA}_{\mathrm{Top1}}$ and surpasses the best adapted VLT baseline by 19.62 points. These results show that worldline-centered modeling improves both target grounding and recovered worldline quality. The project page is available at https://github.com/mikubaka88/4DVLT.


[178] SATURN: Symbolic Spatial Reasoning for Multi-Perspective Grounding cs.CV | cs.SCPDF

Danial Kamali, Tanawan Premsri, Shreya Rajpal, Amir Zadeh, Chuan Li

TL;DR: SATURN是一个神经符号框架,用于解决视角感知的组合空间推理问题。它通过重建近似3D场景、推导软视角感知空间谓词,并使用免训练的Python符号执行器组合这些谓词,将感知与推理分离,同时在多跳推理中保持不确定性。

Details

Motivation: 现有视觉语言模型在空间推理涉及依赖于参考框架的关系组合时不可靠,而现有神经符号方法依赖于脆弱的几何过程和基于噪声感知的硬决策。

Result: 在诊断基准3D FORCE上,随着推理深度和视角复杂性增加,VLMs和空间训练模型性能急剧下降,而SATURN保持稳定并优于强基线。在真实世界基准MindCube上,SATURN达到78.57%的总体准确率,比最强基线高出14个百分点。

Insight: 创新点在于将感知与推理分离,通过近似3D重建和软谓词保持不确定性,并使用免训练的符号执行器进行组合推理,从而提高了对视角变化和组合复杂性的鲁棒性。

Abstract: Vision-Language Models (VLMs) remain unreliable when spatial reasoning requires composing relations whose meanings depend on frames of reference. Existing neuro-symbolic methods make reasoning more explicit, but often depend on brittle geometric procedures and hard decisions over noisy perception. We propose SATURN, a neuro-symbolic framework for perspective-aware compositional spatial reasoning. SATURN reconstructs an approximate 3D scene, derives soft perspective-aware spatial predicates, and composes them with a training-free Pythonic symbolic executor, separating perception from reasoning while preserving uncertainty through multi-hop inference. We also introduce 3D FORCE, a diagnostic benchmark that controls reasoning depth, view, and perspective composition across spatial arrangement grounding (SAG) and referring expression grounding (REF). On 3D FORCE, VLMs and spatially trained models degrade sharply as depth and perspective complexity increase, whereas SATURN remains stable and outperforms strong baselines. On the real-world MindCube benchmark, SATURN achieves 78.57% overall accuracy, outperforming the strongest baseline by 14 pp.


[179] Prompting Diffusion Models for Zero-Shot Instance Segmentation cs.CVPDF

Irem Zeynep Alagöz, Nils Morbitzer, Andrea Ramazzina, Nassir Navab, Federico Tombari

TL;DR: 本文提出Prompt2Seg,一种基于扩散模型的空间条件化框架,用于零样本实例分割。该方法通过引入空间提示(如2D高斯或置信度图)作为显式输入信号,增强冻结的扩散分割模型,使其能直接响应用户意图。在有限类别数据集上微调后,模型能零样本泛化到多种未见过的物体类型和视觉领域。

Details

Motivation: 针对现有可提示分割基础模型在准确分割物体区域时存在误报和过分割问题,以及早期利用生成先验的方法仅在后处理阶段使用提示导致分割效果欠佳的局限性,旨在通过空间条件化框架提升交互式分割的准确性和泛化能力。

Result: 在涵盖标准基准及更具挑战性领域(如绘画、第一人称视角和X射线数据)的七个数据集上评估,Prompt2Seg在所有基准测试中均一致优于其底层扩散分割骨干模型,展示了零样本泛化到广泛未见物体类型和视觉域的能力。

Insight: 创新点在于将空间提示作为显式输入信号集成到扩散分割模型中,实现直接响应用户意图;客观分析认为,该方法通过结合生成预训练中的丰富先验与原则性空间条件化,为无需大规模掩码监督的广泛泛化交互式分割提供了有效路径。

Abstract: Several disruptive research directions have recently emerged in computer vision, including foundation models achieving previously unseen zero-shot performance in scene understanding, even interactively, and generative models that synthesize extremely realistic images. The latter have also been shown to be highly effective in scene understanding tasks thanks to their rich priors. However, for promptable segmentation, foundation models struggle with accurately segmenting an object’s region, leading to false positives and over-segmentation. Notably, early attempts that leverage generative priors use prompts only during post-processing, yielding suboptimal segments because the process is agnostic to the user input. In this paper, we target these limitations with Prompt2Seg, a spatial conditioning framework for diffusion-based segmentation. Prompt2Seg augments a frozen diffusion segmentation model with a conditioning branch. Our approach takes spatial prompts, represented as 2D Gaussians or confidence maps, as explicit input signals, training the model to respond directly to user intent. Fine-tuned on a deliberately constrained set of object categories drawn from Hypersim and Virtual KITTI 2, Prompt2Seg generalizes zero-shot to a wide range of unseen object types and visual domains. We evaluate on seven datasets ranging from standard benchmarks to more challenging domains, including paintings, egocentric views, and X-ray data. Furthermore, we demonstrate that Prompt2Seg consistently outperforms the underlying diffusion segmentation backbone across all benchmarks. Our results suggest that the rich priors encoded in generative pretraining, combined with principled spatial conditioning, offer a compelling path toward broadly generalizing interactive segmentation without large-scale mask supervision.


[180] Catching Lies Without Sending the Video: Privacy-Preserving Multimodal Deception Detection cs.CV | cs.MMPDF

Nikita Sharma, Pranav Sara, Karan Singla

TL;DR: 本文提出了一种隐私保护的多模态欺骗检测方法,通过在设备端提取语音和视觉的紧凑摘要(包括文本、情感、意图、面部行为等特征),无需传输原始视频即可实现与前沿多模态模型相当的欺骗检测性能。

Details

Motivation: 现有前沿多模态模型需要将原始面部和语音视频流传输至第三方进行欺骗检测,这引发了隐私泄露风险;本文旨在探究是否必须依赖原始媒体数据,并设计一种隐私保护的替代方案。

Result: 在Real-life Trial Deception数据集上进行说话人无关评估,基于摘要的小型分类器AUC达到0.741,与Gemini 2.5 Pro在全视频上的性能相当;将摘要输入前沿LLM(Claude Opus 4.8)时AUC提升至0.755,且输入token减少7.8倍,同时不泄露任何媒体数据。

Insight: 创新点在于提出了一种设备端多模态特征摘要提取框架,在保护隐私的同时维持了检测性能;通过结构化特征摘要替代原始媒体,显著降低了数据传输需求,为边缘计算场景下的敏感应用提供了可行方案。

Abstract: Frontier multimodal models can guess whether a person is lying from a testimony video. To do so, they stream that raw face and voice to a third-party model. We ask whether the heavy media is needed at all. On the Real-life Trial Deception dataset, Whissle on-device speech and vision stack extracts a compact digest: transcript, emotion, age, gender, intent distributions, a deception intent filter, fluency and rhythm, per-frame facial behaviour, and prosody. Under speaker-independent evaluation, we report three findings. A small classifier on this digest reaches AUC 0.741, matching Gemini 2.5 Pro on full video. Handing the digest to a frontier LLM reaches AUC 0.755 with Claude Opus 4.8 at 7.8X fewer input tokens, with no media leaving the device. The reported 75% accuracy is a speaker-leakage artifact. We release code and experiments.


[181] Generative Relightable Avatars cs.CVPDF

Kunwar Maheep Singh, Christian Theobalt, Rishabh Dabral

TL;DR: 本文提出了生成式可重光照虚拟人(GRA),这是一种针对特定人物的方法,能够实现全身人体的照片级真实感自由视角渲染和环境贴图重光照。该方法采用了一种混合策略,将可控的、基于物理的重光照与概率性细化相结合,通过优化UV空间的材质参数、前馈模型细化纹理以及微调的视频到视频扩散模型,最终生成具有时间一致性的高细节视频。

Details

Motivation: 解决现有可重光照虚拟人方法在建模细粒度外观细节(如姿势依赖的纹理动态和复杂光照效果)方面的不足,认为这是一个一对多问题,因此引入生成式模型来提升真实感。

Result: 实验评估表明,该方法在感知质量上优于先前的可重光照虚拟人基线方法。

Insight: 创新点在于将基于物理的可控渲染与生成式模型(特别是视频扩散模型)相结合,通过误差回收策略生成长视频,在保持3D控制的同时实现了高真实感和时间一致性。

Abstract: We present Generative Relightable Avatars (GRA), a person-specific method for photorealistic free-view rendering and environment-map relighting of full-body humans. We postulate that modeling fine-grained appearance details is inherently a one-to-many problem that can benefit from a generative formulation. In contrast to fully regressive relightable avatar methods, GRA follows a hybrid approach that combines controllable, physics-grounded relighting with probabilistic refinement. Starting from a tracked animated mesh, we optimize material parameters in UV-space and render a coarse relit appearance under a target HDR environment map. Next, we refine the textures with a feed-forward model to capture pose-dependent texture dynamics and illumination effects beyond simplified reflectance assumptions. Finally, a fine-tuned video-to-video diffusion model transforms the physically grounded renderings into temporally coherent, high-detail videos while preserving 3D control, with an error-recycling strategy for generating long videos. Experimental evaluations demonstrate our method’s improved perceptual quality over prior relightable avatar baselines. Project Page: https://vcai.mpi-inf.mpg.de/projects/GRA/


[182] RaysUp: Ultra-light Universal Feature Upsampling via Geometry-Aware Ray Representation cs.CVPDF

Yuchuan Ding, Linfei Li, Lin Zhang, Ying Shen

TL;DR: 本文提出了RaysUp,一种超轻量级、任务无关且视觉基础模型无关的特征上采样框架,通过几何感知的射线表示重建任意分辨率的高分辨率特征图。该方法采用空间解耦引导编码器、任意分辨率交叉注意力机制和射线位置编码,在多种密集预测任务中实现了最先进的性能,同时大幅减少了参数量和推理时间。

Details

Motivation: 预训练视觉基础模型(VFMs)的补丁化或池化输出本质上是低分辨率的,限制了其在需要细粒度像素级推理任务中的有效性。现有特征上采样方法要么降低语义保真度,要么依赖特定VFM的重训练和复杂架构,影响了效率和可扩展性。

Result: 在多种密集预测任务上的大量实验表明,RaysUp实现了最先进的性能,同时仅使用了AnyUp 16%的参数,并提供了约7倍的推理加速。

Insight: 创新点在于将特征重建提升到几何感知的射线域,并引入了空间解耦引导编码器、任意分辨率交叉注意力机制和基于6D Plucker射线坐标的射线位置编码,从而在保持几何一致性的同时实现了内容自适应的双边聚合,显著改善了精度与效率的权衡。

Abstract: Pre-trained Vision Foundation Models (VFMs) have become central to modern computer vision due to their powerful semantic representations and strong generalization ability. However, their patchified or pooled outputs are inherently low-resolution, limiting their effectiveness in tasks requiring fine-grained, pixel-level reasoning. Existing feature upsampling approaches either degrade semantic fidelity or rely on VFM-specific retraining and heavy architectures, hindering efficiency and scalability. To address these challenges, we propose RaysUp, an ultra-lightweight, task-agnostic, and VFM-agnostic feature upsampling framework that reconstructs high-resolution feature maps at arbitrary resolutions. Unlike conventional 2D interpolation or attention-based schemes, RaysUp lifts feature reconstruction into a geometry-aware ray domain. Specifically, we introduce a Spatially Decoupled Guidance Encoder for direction-aware guidance encoding, an Any-Resolution Cross-Attention mechanism for resolution-flexible reconstruction, and a novel Ray Positional Encoding (RayPE) that injects implicit 3D geometric priors via 6D Plucker ray coordinates. Finally, a Geometry-Aware Neighborhood Attention module further ensures content-adaptive bilateral aggregation while preserving geometric consistency. Extensive experiments across diverse dense prediction tasks demonstrate that RaysUp achieves state-of-the-art performance while using only 16% of the parameters of AnyUp and delivering approximately 7x faster inference. These results highlight a substantially improved accuracy-efficiency trade-off and establish RaysUp as a practical and scalable solution for universal feature upsampling. Code is available at https://github.com/MAP-RaysUp/RaysUp.


[183] READ More than What You See: Reinforcement Learning for Accurate and Coherent Audio Description Generations cs.CVPDF

Bo Fang, Xinyao Zhang, Yuxin Song, Hui Zhang, Hang Zhou

TL;DR: 论文提出了READ,首个基于强化学习的音频描述生成框架,通过序列级优化和上下文感知的连贯性奖励,显著提升了描述的准确性和连贯性。

Details

Motivation: 现有方法要么依赖现有多模态模型但风格不匹配,要么仅使用下一词预测训练,未能充分利用模型能力且易产生通用表达,因此需要更有效的训练范式。

Result: 在MAD-Eval、CMD-AD和TV-AD基准测试中,READ在多种评估指标上大幅超越先前方法,实现了SOTA性能。

Insight: 创新点在于将强化学习引入AD生成,设计了包括参考匹配、长度、格式和连贯性奖励的多目标优化框架,特别是上下文感知的连贯性奖励提升了叙事连贯性。

Abstract: Audio Description aims to generate concise narrations of essential visual content in audio-visual media for blind and low-vision audiences. Existing methods either rely on prompting off-the-shelf multimodal models, which often mismatch AD style, or partially optimize training-based systems with next-token prediction, which under-explores model capacity and biases generation toward generic expressions. We present READ, the first reinforcement-learning (RL) framework for training-based AD generation. READ formulates AD as sequence-level optimization with reference-matching, length, and format rewards, and further introduces a dedicated coherence reward under context-aware supervision to promote narratively coherent descriptions. Experiments on MAD-Eval, CMD-AD, and TV-AD show that READ substantially outperforms prior methods across diverse evaluation metrics. Our results highlight RL as a promising paradigm for accurate and coherent AD generation. Our codes, models, and benchmark results will be publicly available.


[184] Visual Geometry Transformer in the Wild: Distractor-Free 3D Reconstruction cs.CVPDF

Tianbo Pan, Xingyi Yang, Shizun Wang, Xinchao Wang

TL;DR: 本文提出了一种名为VGTW的端到端多视图三维重建框架,旨在解决现实场景中因瞬态干扰物和遮挡导致现有方法失效的问题。该方法通过注意力机制分离并抑制受干扰区域,同时保持跨视图的一致性特征,最终直接输出干净、无干扰的点云。

Details

Motivation: 当前端到端多视图三维重建方法依赖于静态、无干扰的理想化输入假设,导致在现实场景中因瞬态干扰物和遮挡而失败。本文旨在开发一种能够从不一致视图中进行鲁棒重建的框架。

Result: 在多样化的真实世界场景中进行广泛实验,验证了该方法实现了最先进的性能,并展现出强大的泛化能力。

Insight: 创新点包括提出了一种干扰物感知训练策略,在注意力机制中分离干净特征与受污染特征,并强制跨图像特征一致性;同时,通过辅助掩码预测头和新收集的像素级干扰物掩码数据集进行训练,无需额外的三维监督,计算高效且与现有流程兼容。

Abstract: Current end-to-end multi-view 3D reconstruction methods achieve impressive results, but rely on a restrictive static assumption: the scenes is entire distractor-free with perfect cross-view geometry. This reliance on idealized inputs causes even the most advanced methods to fail in real-world settings, where transient distractors and occlusions present. To address this, we propose Visual Geometry Transformer in the Wild (VGTW), an end-to-end framework for robust reconstruction from inconsistent views. At its core, we isolate and suppress distractor-affected regions while preserving the consistent components across views. Specifically, we introduce a Distractor-aware Training (DAT) strategy that separates clean features from distractor-contaminated ones in the attention mechanism while enforcing feature consistency across images. To enable this, we train the model with an auxiliary mask prediction head, using supervision from a new dataset we collected with pixel-level distractor masks. The resulting VGTW model is a feed-forward network that directly outputs clean, distractor-free point clouds. Remarkably, it requires no additional 3D supervision, remains computationally efficient, and is compatible with existing pipelines. Extensive experiments validate our approach, demonstrating state-of-the-art performance and robust generalization in diverse, real-world scenarios.


[185] CoVStream: Edge-Cloud Collaboration for Understanding of Long Video Streams cs.CVPDF

Xu Liu, Guikun Chen, Zihao Yan, Kanzhi Wu, Wenguan Wang

TL;DR: 论文提出了CoVStream,一个用于理解长视频流的首个边缘-云协作框架。该框架通过在边缘节点将原始视频流提炼为紧凑的视觉特征和语义描述以降低传输带宽,云端则整合这些数据构建实体图和全局视觉上下文,仅在用户查询到达时才激活重型推理模型。

Details

Motivation: 现有处理长视频的方法(如采样-编码-推理)通常忽略了视频流常由计算资源受限的设备产生这一关键部署事实,导致在云端卸载(高带宽开销)与设备端处理(受限于边缘硬件)之间难以权衡。

Result: 在VideoMME-Long、LVBench和RTV-Bench上的实验表明,CoVStream在LVBench上减少了87.6%的带宽使用,同时保持了云端基线模型99.2%的准确率。

Insight: 创新点在于首次提出针对长视频流的边缘-云协作框架,通过边缘侧的特征与语义提炼大幅降低带宽,云端延迟激活重型模型以平衡效率与性能,实现了部署可行性与推理能力的兼顾。

Abstract: Long, continuous video streams are an increasingly critical driver of multimedia intelligence. Existing efforts often handle long videos with a sample-encode-reason approach using large models. However, they overlook a crucial deployment fact: the stream is often produced by computationally constrained devices. This forces an untenable compromise: cloud offloading unlocks strong reasoning but incurs prohibitive bandwidth overhead, while on-device processing remains limited by edge hardware capacity. Therefore, we propose CoVStream, the first edge-cloud collaborative framework for understanding long video streams. The edge node distills raw video streams into compact visual features and semantic captions for transmission to the cloud, minimizing bandwidth costs, while the cloud server integrates this data into an entity graph and global visual context, activating the heavy reasoning model only when a user query arrives. Experiments on VideoMME-Long, LVBench, and RTV-Bench show that CoVStream reduces bandwidth usage by 87.6% while retaining 99.2% of the cloud baseline accuracy on LVBench.


[186] LoCC: Detection and Localization of Lip-Syncing Deepfakes via Counterfactual Frame Consistency cs.CVPDF

Soumyya Kanti Datta, Shan Jia, Siwei Lyu

TL;DR: 本文提出LoCC框架,用于检测和定位唇形同步深度伪造视频。该方法通过分析视频帧与其时间邻域生成的反事实估计之间的一致性,在片段和帧级别进行细粒度检测。实验表明,LoCC在多个基准数据集上优于现有方法,并具有良好的泛化能力。

Details

Motivation: 唇形同步深度伪造因其伪影仅局限于嘴部区域且随时间动态变化,检测难度大,需要精确的时空建模。

Result: LoCC在LAV-DF、AVDF1M、FakeAVCeleb和KODF等多个唇形同步深度伪造基准数据集上取得了优于现有SOTA方法的性能,且在不同压缩级别和数据集间泛化良好。

Insight: 创新点在于采用反事实帧一致性分析和师生学习范式,通过评估每帧与其时间邻域估计的局部一致性来检测伪造,实现了细粒度的帧级检测与定位。

Abstract: Lip-syncing deepfakes are among the most challenging forms of manipulated media because their artifacts are localized almost exclusively to the mouth region and evolve dynamically over time. Detecting such deepfakes requires precise temporal and spatial modeling of lip motion. In this paper, we propose LoCC, a novel detection framework that performs fine-grained detection and localization of lip-syncing deepfakes at both segment and frame levels. Unlike prior approaches that analyze videos holistically, our method evaluates whether each frame aligns with a counterfactual estimate generated from its temporal neighbors. Real videos exhibit strong and stable consistency, whereas lip-sync deepfakes introduce localized inconsistencies. Following a teacher-student learning paradigm, our model effectively captures these frame-level discrepancies and achieves superior performance over state-of-the-art methods on multiple benchmark lip-syncing deepfake datasets, including LAV-DF, AVDF1M, FakeAVCeleb, and KODF, and generalizes well across compression levels and datasets.


[187] Policy-as-Data: Learning Generalizable HOI Diffusion Models from Simulated Physics cs.CVPDF

Shujia Li, Jianshu Hu, Haiyu Zhang, Yunpeng Jiang, Haoyuan Jin

TL;DR: 本文提出了一种名为Policy-as-Data的新框架,通过利用物理模拟器生成合成数据来克服人-物交互(HOI)生成中的数据稀缺瓶颈。该框架包含一个可扩展的流水线,利用强化学习训练的策略进行面向任务的数据生成,并在增强数据集上训练生成模型,以实现可泛化的HOI生成。

Details

Motivation: 当前基于运动捕捉数据的数据驱动方法成本高、功能多样性有限,导致模型难以泛化到未见过的物体,且无法在长时程中保持物理一致性。本文旨在利用物理模拟器解决HOI生成中的数据稀缺问题。

Result: 通过综合实验验证,该方法在未见物体上展现出增强的泛化能力和长时程生成能力,同时具有更高的动态多样性和物理合理性。

Insight: 创新点在于提出了一种从模拟物理中学习可泛化HOI扩散模型的框架,通过粗到细的重定向过程弥合了物理模拟简化模型与生成训练所需标准参数化人体模型之间的表示差距,从而有效利用合成数据。

Abstract: Synthesizing realistic Human-Object Interactions (HOI) is critical for creating embodied avatars and functional virtual environments. However, current data-driven approaches primarily rely on motion capture datasets, which are expensive to scale and limited in functional diversity. Models trained with these datasets fail to generalize to unseen objects and maintain physical consistency over long horizons. In this paper, we propose a novel framework that leverages a physics simulator to overcome the data-scarcity bottleneck in HOI generation. Specifically, we propose a scalable pipeline, called \ours, which leverages policies trained with reinforcement learning in a physics simulator for task-oriented data generation and trains a generative model on the augmented dataset for generalizable HOI generation. To seamlessly utilize the synthetic data, we introduce a coarse-to-fine retargeting process that bridges the representation gap between the simplified model used in physics simulator and the standard parametric body models required for generative training. Validated through comprehensive experiments, our method demonstrates enhanced generalization to unseen objects and the capability of long-horizon generation, while exhibiting greater dynamic diversity and physical plausibility.


[188] DBT-Bleed: Dual-Branch Temporal Modeling with Key-Frame Selection for Surgical Bleeding Detection cs.CV | cs.AIPDF

Sudhanshu Mishra, Jialang Xu, Jensen Ang, Evangelos B. Mazomenos, Beng Ti Ang

TL;DR: 本文提出了DBT-Bleed,一个用于手术出血检测的双分支多尺度时序建模框架。它通过分层熵驱动(HiRED)的关键帧选择策略高效处理长手术视频,并利用层级时序适配器分离出血与正常表征,以建模短期和长期的出血进展。

Details

Motivation: 现有方法由于有限的时序推理能力,难以区分术中不良事件(IAEs)中的出血与视觉上相似的残留血液,并且在建模长手术视频同时保持细粒度时序动态方面存在计算挑战。

Result: 在MultiBypass数据集上,该方法在出血IAE检测的F1值、召回率和MCC值上分别提升了6.53%、5.62%和9%,持续优于视频级基线。在跨手术类型的零样本设置下,新提出的EndoPit-IAE数据集上,F1和MCC也分别获得了6%和8%的提升,展现了鲁棒的迁移能力。

Insight: 核心创新点在于双分支多尺度时序建模框架与分层熵驱动的关键帧选择策略(HiRED),前者通过层级时序适配器解耦表征以增强时序推理,后者能高效保留信息帧并去除冗余。此外,贡献了首个神经外科的IAE标注数据集EndoPit-IAE,支持跨手术评估。

Abstract: Intraoperative Adverse Events (IAEs) detection is critical for improving surgical safety, with bleeding being among the most frequent events across many surgery types. Existing methods struggle to distinguish bleeding IAE from visually similar residual blood due to limited temporal reasoning. Moreover, modeling long surgical videos while preserving fine-grained temporal dynamics remains computationally challenging. We propose DBT-Bleed, a dual-branch multi-scale temporal modeling framework disentangling bleeding and normal representations using layer-wise temporal adapters for short- and long-term bleeding progression. To efficiently process long surgical videos without sacrificing fine-grained temporal information, we introduce HiRED, a Hierarchical Entropy-Driven frame selection strategy that retains temporally informative segments while removing redundancy. Experiments on the MultiBypass dataset demonstrate gains of 6.53% in F1, 5.62% in Recall and 9% in MCC values for bleeding IAE detection, consistently outperforming video-level baselines. Additionally, we evaluate cross-procedure generalization on a newly curated dataset from a different surgical procedure type, where DBT-Bleed demonstrates robust transferability by achieving gain of 6% in F1 and 8% in MCC under zero-shot setting. To support this evaluation, we introduce EndoPit-IAE, an Endonasal Pituitary Surgery dataset annotated for IAEs, representing the first IAE-annotated dataset in neurosurgery. Code will be made publicly available upon acceptance.


[189] Homographic Navigation: Geometry-Driven Camera Guidance for Deterministic Planar Capture cs.CVPDF

Dominik Kroupa, Marek Vaško, Muh Yuzril Ihza Baharuddin, Adam Herout

TL;DR: 本文提出了‘单应性导航’这一几何中心框架,用于引导相机精确捕获平面区域。该框架将单应性作为统一学习、对齐和评估的组织变量,通过单张标注参考图像生成无限合成训练数据,训练一个单次推理模型,通过稀疏关键点预测实现多个物理矩形平面目标的联合识别与定位。为解决有限输入分辨率下的精度问题,引入了包含全局检测和局部细化的两阶段推理方案,以及显著提升高精度区域准确性的稳定扭曲训练策略。

Details

Motivation: 动机在于解决精确捕获平面区域(如物理矩形目标)的相机引导问题,旨在从极少的监督(单张标注图像)出发,实现几何驱动的确定性采集。

Result: 实验结果表明,该方法能够从最小监督中实现准确的平面对齐,为几何驱动的相机引导和未来从野外视频数据中学习奠定了基础。摘要未提及具体基准测试或与SOTA的定量比较。

Insight: 创新点在于将单应性作为核心组织变量而非输出,并提出了两阶段推理和稳定扭曲训练策略以提升精度。从客观角度看,其‘以几何为中心’的数据生成和模型设计思路,为小样本、高精度的视觉定位任务提供了可借鉴的框架。

Abstract: We present homographic navigation, a geometry-centric framework for guiding camera acquisition toward precise capture of planar regions. Rather than treating homography as an output, we use it as an organizing variable that unifies learning, alignment, and evaluation. From a single annotated reference image, we generate unlimited synthetic training data via homographic augmentation and train a single-shot model for joint recognition and localization of multiple artifacts (physical objects with a rectangular planar target) through sparse keypoint prediction. To address precision under limited model input resolution, we introduce a two-pass inference scheme with global detection followed by localized refinement, and a Stable Warp training strategy that significantly improves accuracy, particularly in the high-precision regime. The model also predicts confidence estimates per predicted keypoint and per the whole sample. Experimental results demonstrate that accurate planar alignment can be achieved from minimal supervision, providing a foundation for geometry-driven camera guidance and future learning from in-the-wild video data.


[190] OrthoMotion:Disentangling Camera and Subject Motion via Geometry Semantics Orthogonal Attention cs.CV | cs.AIPDF

Zijie Meng

TL;DR: 本文提出OrthoMotion方法,通过几何语义正交注意力机制,在可控视频生成中实现相机运动和主体运动的解耦。该方法将相机运动编码为旋转位置嵌入(RoPE)相位的保范旋转(几何通道),将主体运动编码为交叉注意力中的门控值注入(语义通道),并利用轻量化解耦正则化器确保两个控制通道的响应子空间正交,从而避免干扰。

Details

Motivation: 现有基于2D条件的可控视频生成方法中,相机运动和物体运动在光流中因共享逆深度(1/Z)缩放而纠缠,无法仅从图像证据中分离,这是一个表示层面的非可识别逆问题,需要从算子设计层面解决解耦问题。

Result: OrthoMotion在相机和主体控制精度上同时达到最先进水平(SOTA),并显著减少了串扰(cross-talk)。作者提出的新指标Cross-Talk Error (CTE)量化显示,该方法将串扰降低了2.4倍以上,且不损失保真度,并能泛化到不同骨干网络。

Insight: 核心创新在于将解耦问题重构为算子设计,并设计了代数互补的几何通道(旋转操作)和语义通道(平移操作),通过理论保证的正则化实现构造性解耦,而非依赖其自然涌现,这是首个保证解耦而非期望其出现的方法。

Abstract: Controllable video generation demands independent command of the camera and the subject, yet 2D conditioning entangles them: camera- and object-induced optical flow share the same inverse-depth (1/Z) scaling and cannot be separated from image evidence alone. We first prove that this entanglement is representational, not architectural – the 2D camera/object split is a non-identifiable inverse problem – and therefore reframe decoupling as a question of operator design. We resolve it at the level of the attention operator. OrthoMotion routes camera motion into a geometric channel, a norm-preserving rotation of the rotary position embedding (RoPE) phase, and subject motion into a semantic channel, a gated value injection in cross-attention. Because these sub-operators are algebraically complementary – a rotation versus a translation of the affine action on tokens – a lightweight decoupling regularizer provably drives their response subspaces to orthogonality, so the two controls stop interfering. To our knowledge OrthoMotion is the first method to guarantee disentanglement by construction rather than hope for it to emerge. It attains state-of-the-art camera and subject accuracy at once while minimizing cross-talk, which we quantify with a new Cross-Talk Error (CTE) metric, cutting cross-talk by more than 2.4x with no loss in fidelity and generalizing across backbones.


[191] Chains That See, Answers That Don’t: A Multi-Aspect Evaluation Recipe for Forced Chain-of-Thought on Video-MME cs.CV | cs.LGPDF

Zhichao Fan, Yanhang Li, Zexin Zhuang

TL;DR: 本文提出了一种三探针评估方法,用于检验强制思维链(CoT)是否真的能提升视频问答中视觉语言模型的可靠性。该方法在Qwen2.5-VL模型和Video-MME数据集上的应用表明,虽然CoT推理过程强烈依赖于视频内容,但强制CoT并未提高多项选择题的准确率,甚至在较小模型上导致了轻微但统计显著的性能下降。

Details

Motivation: 旨在检验一个广泛假设:强制思维链(CoT)是否真的能让视觉语言模型在视频问答任务中变得更可靠。

Result: 在Qwen2.5-VL模型和Video-MME子集上的评估结果显示,强制CoT并未提升多项选择题(MCQ)的准确率,对于7B小模型,在事后选择的主要评分器下甚至出现了轻微但统计显著的性能下降。

Insight: 创新点在于提出了一套包含三个诊断探针(配对准确率、反事实视频交换、视觉退化阶梯)的评估方案,能够细致地解耦和评估CoT对模型可靠性的实际影响,揭示了推理过程对视频的依赖性与最终答案准确性提升之间的脱节现象。

Abstract: Forced chain-of-thought (CoT) is widely assumed to make vision-language models more reliable on video question answering. We propose a small three-probe evaluation recipe to test that assumption: paired accuracy across direct, CoT, answer-first, and no-video conditions; a counterfactual video-swap diagnostic over the CoT chains; and a four-rung visual-degradation ladder. Each probe is reported under both a strict and a permissive regex scorer, with multiplicity correction over a manuscript-declared primary family. Applied to Qwen2.5-VL on Video-MME subsets, the recipe returns a two-part finding. The CoT chains are strongly video-conditioned: swapping the input video collapses chain overlap and flips most final letters, the opposite of what a “boilerplate-chain” null would predict. Yet on the same data, forced CoT does not improve MCQ accuracy, and on the smaller 7B model it produces a small but statistically supported drop under a post-hoc primary scorer choice. We do not claim this generalizes beyond the Qwen2.5-VL / Video-MME instantiation; the raw responses and a single recomputation script will be released with the supplementary material so every number can be re-derived.


[192] VideoLatent: Video-Language Learning via Latent Self-Forcing cs.CV | cs.AIPDF

Zi-Yuan Hu, Zicong Tang, Shijia Huang, Yanyang Li, Michael R. Lyu

TL;DR: 本文提出VideoLatent模型,一种通过潜在自强制训练范式进行视频理解和推理的多模态大语言模型。该方法仅使用标准视频-问题-答案三元组进行训练,无需额外的监督信号或密集标注,在14个基准测试中均优于现有标准及潜在MLLM模型,并显著提升了计算效率。

Details

Motivation: 现有基于思维链的MLLM需要大量人工标注且计算开销大,而视觉潜在推理方法主要针对图像任务并依赖额外监督,难以扩展到视频任务。本文旨在开发一种高效、可扩展且无需额外监督的视频-语言学习框架。

Result: 在14个基准测试中,VideoLatent在通用视频理解和复杂视频推理任务上均优于现有标准及潜在MLLM模型。与Video-R1相比,训练和推理开销分别降低约6倍和68倍,且在不同MLLM主干网络和模型规模上均表现出强泛化能力。

Insight: 创新点在于提出了潜在自强制训练范式,包含潜在对齐和潜在多样性目标,仅需标准视频-问题-答案三元组即可学习视觉潜在推理。从客观角度看,该方法通过隐式学习潜在表示,避免了对外部监督的依赖,提升了方法的可扩展性和计算效率,为视频-语言学习提供了新思路。

Abstract: Recent advancements in chain-of-thought (CoT) reasoning have shown promise in enhancing video understanding and reasoning capabilities of multimodal large language models (MLLMs). However, existing CoT-based MLLMs require labor-intensive CoT annotations and incur substantial training and inference overhead. While visual latent reasoning has emerged as a more efficient alternative, existing methods primarily focus on image tasks and heavily rely on additional supervision signals for visual latent generation (e.g., CoT traces, auxiliary images, or fine-grained annotations), limiting their scalability and transferability to video tasks. To bridge this gap, we introduce VideoLatent, a novel MLLM equipped with a latent injection module tailored for video understanding and reasoning. Specifically, VideoLatent learns to perform visual latent reasoning using a new latent self-forcing training paradigm, which comprises latent alignment and latent diversity objectives, and relies solely on standard video-question-answer triplets. Extensive experiments across 14 benchmarks demonstrate that our model consistently outperforms existing standard and latent MLLMs on general video understanding and complex video reasoning. Compared with Video-R1, our VideoLatent achieves superior computational efficiency, reducing training/inference overhead by $\sim$6$\times$/$\sim$68$\times$. Moreover, experiments demonstrate that our method has strong generalizability to different MLLM backbones and different model scales.


[193] Fursee: Hybrid YOLO-DINOv3 Framework for Fursuit Identity Retrieval and Clustering cs.CVPDF

Jundi Wu

TL;DR: 本文提出了一种名为Fursee的三阶段混合框架,用于解决全球兽迷大会中大量兽装照片的身份检索与聚类问题。该框架首先使用YOLO检测并裁剪高分辨率兽装头部区域,然后通过ArcFace优化DINOv3嵌入以增强特征空间中不同身份的区分度,最后采用DBSCAN进行无监督聚类,并利用轮廓系数自动选择最优超参数。实验表明,该方法在兽装头部检索和分组任务上优于主流多模态模型。

Details

Motivation: 全球兽迷大会产生大量兽装照片,手动分类成本高昂,需要自动化的身份检索与聚类解决方案。现有通用多模态模型缺乏对复杂兽装场景的专门优化,且该任务缺乏公开基准数据集。

Result: 在构建的专用兽装图像数据集上,Fursee框架在检索和聚类实验中全面优于GPT5.5、Claude Opus 4.8和Qwen3.7-Plus等主流多模态模型,在兽装头部检索和分组任务上达到竞争性性能。

Insight: 创新点包括:1) 构建首个兽装身份检索专用数据集;2) 结合YOLO的目标检测与DINOv3的特征提取,并引入ArcFace损失优化特征空间中的角度分离;3) 采用轮廓系数驱动的自动超参数搜索替代手动设置,提升DBSCAN聚类的鲁棒性。

Abstract: Global furry conventions produce massive fursuit photographs, while manual sorting brings heavy labor costs and calls for automatic identity retrieval and clustering solutions. General multimodal models lack dedicated optimization for complex fursuit scenes, and no public benchmark dataset exists for this task. To fill this gap, we build a specialized fursuit image dataset and present a three-stage hybrid pipeline Fursee for fursuit identity retrieval and clustering. First, YOLO detects and crops high-resolution fursuit head patches to improve localization of small and overlapping targets. Second, ArcFace optimizes DINOv3 embeddings to enlarge angular separation between different identities on the feature hypersphere. Third, DBSCAN performs unsupervised clustering, with silhouette-coefficient-driven search automatically selecting optimal hyperparameters rather than fixed manual radius. Retrieval and clustering experiments verify that our pipeline outperforms mainstream multimodal models including GPT5.5, Claude Opus 4.8 and Qwen3.7-Plus on all evaluation metrics, achieving competitive performance for fursuit head retrieval and grouping.


[194] SingGuard: A Policy-Adaptive Multimodal LLM Guardrail with Dynamic Reasoning cs.CV | cs.CLPDF

SingGuard Team

TL;DR: 本文提出了SingGuard,一个策略自适应的多模态大语言模型护栏系统,用于评估多模态对话的安全性。该系统将动态安全策略作为运行时输入,通过自然语言规则逐条检查内容,并预测安全标签及触发的规则。它支持从快速到慢速的混合推理机制,并通过快慢解耦强化学习进行优化。

Details

Motivation: 随着视觉语言模型(VLMs)在消费、医疗、金融和企业应用中的广泛部署,安全风险面扩大,包括多模态问答、助手响应和跨模态组合等场景,而审核策略可能因产品、地区和部署阶段而异。现有护栏大多依赖固定分类法或仅针对狭窄的交互设置,限制了其在部署时安全规则变化时的适应性。

Result: 在涵盖六个基准系列(35个数据集)的评估中,SingGuard在每个系列中都实现了最先进的平均F1分数。动态规则评估进一步显示,在运行时策略变化下,策略遵循准确率从0.6465提升至0.7415。

Insight: 创新点在于将安全策略作为动态、可解释的输入,支持灵活的推理机制以适应不同效率需求,并通过快慢解耦强化学习优化性能。此外,构建了包含超过56,340个示例的多模态护栏基准SingGuard-Bench,覆盖80多种细粒度风险类型,包括跨模态联合风险案例。

Abstract: Vision-language models (VLMs) are increasingly deployed in consumer, medical, financial, and enterprise applications. This broad deployment expands the safety surface: risks can arise from multimodal question answering, assistant responses, and cross-modal composition, while moderation policies may vary across products, regions, and deployment stages. Most existing guardrails either rely on fixed taxonomies or target only a narrow set of interaction settings, which limits their adaptability when safety rules change at deployment time. We present \textbf{SingGuard}, a policy-adaptive multimodal guardrail model family for safety assessment in multimodal conversations. SingGuard treats the active policy as a runtime input: given natural-language rules, it checks the target content against the active policy rule by rule and predicts both the safety label and the triggered rule. To balance efficiency and interpretability, SingGuard supports fast, hybrid, and slow inference regimes along a fast-to-slow reasoning spectrum, ranging from direct safety judgments to policy-grounded deliberation. We further optimize this behavior with fast–slow decoupled reinforcement learning. We also introduce \textbf{SingGuard-Bench}, a multimodal guardrail benchmark with 56{,}340 examples spanning 80+ fine-grained risk types across multimodal QA, adversarial attack, and dynamic-rule evaluation settings, including cross-modal joint-risk cases where each modality is harmless in isolation but their composition implies unsafe intent. Across six benchmark families (35 datasets), SingGuard achieves state-of-the-art average F1 in every family. Dynamic-rule evaluation further shows improved policy-following accuracy from 0.6465 to 0.7415 under runtime policy shifts. Our code is available at https://github.com/inclusionAI/Sing-Guard.


[195] InteractiveAvatar: Real-Time Streaming Video Generation for Consistent and Intent-Aware Avatars cs.CVPDF

Quanyue Song, Yishan He, Yanfei Zhang, Shihao Cheng, Zhixiang He

TL;DR: 本文提出了InteractiveAvatar,一个实时无限流视频生成框架,用于生成视觉一致且意图感知的虚拟化身。它通过自回归蒸馏实现长时实时生成,并引入长短视觉记忆机制来保持一致性,以及推理-反应模块来感知用户意图并生成对齐的语音和动作。

Details

Motivation: 现有基于扩散的实时音频驱动化身生成方法在复杂交互流场景中,难以保持视觉时间一致性,且无法明确感知用户意图。

Result: 在多样化场景上的大量实验结果表明,该方法在长时生成中实现了最先进的视觉一致性,并能实时支持复杂的用户-化身交互。

Insight: 创新点包括用于保持视觉一致性的长短视觉记忆机制,以及用于意图感知和交互的推理-反应模块及其状态循环和缓存切换策略,实现了实时、一致且意图对齐的化身生成。

Abstract: Recent diffusion-based models have enabled realistic audio-driven avatar generation in real-time streaming. However, existing approaches struggle to maintain visual temporal consistency and fail to explicitly perceive user intent in complex interactive streaming scenarios. To address these challenges, we propose InteractiveAvatar, a real-time infinite-streaming video generation framework that supports visually consistent avatar video generation and intent-aware interactions. With autoregressive distillation, InteractiveAvatar achieves real-time str-eaming generation of human avatars over arbitrarily long durations. For visual consistency, we introduce a Long-Short Visual Memory (LSVM) mechanism that flexibly compresses historical visual information into compact tokens, preserving both short-range coherence and long-term consistency. To generate avatars with speeches and actions aligned with user intent, we propose a Reasoning-Reaction Module (RRM), which incorporates a State-Cycling strategy and a Cache-Switching mechanism. Extensive experimental results over diverse scenarios demonstrate that our method achieves state-of-the-art visual consistency in long-duration generation, while enabling complex user-avatar interaction in real time.


[196] Intend, Reflect, Refine: An Adaptive Multimodal Reflection Framework for Autonomous Driving cs.CV | cs.AIPDF

Zisheng Chen, Yuping Qiu, Jianhua Han, Tao Tang, Xiuwei Chen

TL;DR: 本文提出了IRR-Drive框架,一种用于自动驾驶的自适应多模态反思框架。该框架通过‘意图生成、反思、精炼’的流程,在生成最终轨迹前,先在文本和鸟瞰图(BEV)组成的双模态反思空间中预测未来交互并自我修正,从而提升复杂动态环境下的可靠性。

Details

Motivation: 现有端到端自动驾驶模型通常直接生成最终轨迹,缺乏对未来后果的显式审视,限制了其在复杂动态环境中的可靠性。本文旨在通过引入一个显式的反思机制来解决这一局限性。

Result: 该方法在NAVSIM基准测试的PDMS和EPDMS指标上均达到了最先进的性能(SOTA)。大量实验验证了多模态反思框架和自适应反思策略的有效性。

Insight: 核心创新在于构建了一个显式的双模态(文本+BEV)反思空间来建模预期的场景演化,并设计了一种自适应的反思奖励机制,使模型能根据场景复杂度动态选择推理模式,将反思直接整合为决策感知的轨迹修正过程,而非仅作为辅助解释。

Abstract: Recent Vision-Language-Action (VLA) models have advanced end-to-end autonomous driving by incorporating reasoning for better interpretability and planning quality. However, most existing approaches directly generate the final trajectory without explicitly examining its future consequences, which limits their reliability in complex and dynamic environments. To address this limitation, we propose IRR-Drive (Intend, Reflect, Refine), an adaptive multimodal reflection framework for autonomous driving. Specifically, to tightly couple high-level reasoning with physical constraints, IRR-Drive first generates a preliminary textual intention and anticipates potential interactions by predicting future semantic bird’s-eye view (BEV) representations. This dual-modality (Text + BEV) reflection space explicitly models anticipated scene evolution, enabling the model to rigorously self-correct and refine its initial intent before generating the final trajectory. Furthermore, to balance planning performance and computational efficiency, we construct reflection-oriented training data and design an adaptive reflection reward, enabling the model to adaptively select its reasoning mode according to scene complexity. Instead of using reasoning primarily as an auxiliary interpretation, IRR-Drive directly integrates an adaptive reflection mechanism into the planning framework, enabling grounded, decision-aware trajectory correction that is driven by scene complexity. Our method achieves state-of-the-art performance on the NAVSIM benchmark in both PDMS and EPDMS. Extensive experiments demonstrate the effectiveness of our multimodal reflection framework and validate the efficacy of the proposed adaptive reflection strategy.


[197] PHOEBI: An Open-World Benchmark for Bacterial Identification in Phase-Contrast Microscopy cs.CVPDF

Aaditya Baranwal, Md Jahid Hasan, Shruti Vyas

TL;DR: 该论文提出了一个名为PHOEBI的开放世界基准数据集,用于评估相衬显微镜下多标签细菌物种识别的性能。该数据集包含12万张图像,涵盖6种杆状细菌的40种组合,并引入了留组合出(LCO)评估协议,以模拟模型在训练时未见过的新细菌组合场景。研究发现,现有基于梯度的图像聚合器在开放世界识别中存在系统性失败,而提出的轻量级锚点解码器在未见组合上表现更优。

Details

Motivation: 当前相衬显微镜(PCM)在临床、环境和工业微生物学中广泛用于活细菌的无标记成像,但实际样本通常是多微生物混合,且可能包含训练时未见的物种。缺乏针对此类混合物的多标签物种识别的计算机视觉基准,因此需要构建一个开放世界基准来评估模型的泛化能力。

Result: 在留组合出(LCO)评估中,测试的基于梯度的图像聚合器在分布内到分布外数据上的F1分数下降了0.39至0.57,表明聚合器存在系统性开放世界识别失败。而提出的三种轻量级锚点解码器在未见组合上的得分甚至高于分布内验证集,实现了更好的泛化性能。

Insight: 论文的创新点在于构建了首个针对相衬显微镜下多标签细菌识别的开放世界基准数据集PHOEBI,并设计了留组合出评估协议以模拟实际应用场景。通过几何方式捕获物种存在的锚点解码器方法,在冻结的共享特征池上操作,有效提升了模型对未见细菌组合的识别能力,为解决开放世界识别问题提供了新思路。

Abstract: Optical microscopy enables rapid, label-free imaging of live bacteria and is the standard instrument for species identification across clinical, environmental, and industrial microbiology. Yet field samples are routinely polymicrobial and may contain organisms that were never seen during system training, and no computer-vision benchmark tests multi-label species identification from phase-contrast microscopy (PCM) of such mixtures. We introduce Phase-contrast Optical bEnchmark for Bacterial Identification ($\textbf{PHOEBI}$), a wet-lab-prepared dataset of $120{,}000$ PCM images covering $40$ combinations of six rod-shaped species, paired with a leave-combinations-out (LCO) evaluation protocol that holds out entire species combinations to mirror the practical scenario of a model trained on catalogued mixtures that must generalise to unseen ones. On LCO, every gradient-trained per-image aggregator we test drops $0.39$ to $0.57$ F1 from the in-distribution to the held-out split, a systematic open-world recognition failure in the aggregator, not the visual representation. A linear probe of thirteen different encoders over the same features spreads only about six percentage points of F1 across general-purpose and biomedical pretraining objectives, confirming the representation is sound. We propose three lightweight $\textit{anchor-based}$ decoders that capture per-species presence geometrically over a shared frozen tile-feature pool, scoring $\textit{higher}$ on held-out combinations than on in-distribution validation.


[198] Each Judge Its Own Yardstick: Discovering Per-VLM Taxonomies for Physical Video Evaluation cs.CV | cs.GTPDF

Yu Cao, Ziquan Liu, Zhensong Zhang, Jiankang Deng, Shaogang Gong

TL;DR: 本文提出了JudgeFit方法,用于为每个视觉语言模型(VLM)自动发现其特有的物理视频评估分类体系。该方法通过迭代过程,包括初始分类构建、基于人类常识评分的校准与诊断,以及使用LLM进行修复,最终为每个VLM定制评估标准。该方法在16个VLM上验证,相比全局基准方案平均提升约32%的准确率,并揭示了模型特有的盲点。

Details

Motivation: 现有研究依赖VLM作为自动化评判器来评估视频生成器和世界模型的物理一致性,但不同VLM在训练数据和架构上差异巨大,使用单一的全局评估方案无法反映每个VLM的实际感知能力,因此需要为每个VLM定制评估分类体系。

Result: 在涵盖八个模型家族的16个VLM上应用JudgeFit,其定制的分类体系在保留视频集上的评估性能均优于全局基准方案,平均相对改进约32%,并暴露了模型特有的可靠性模式和盲点。

Insight: 创新点在于提出了一个迭代的、针对单个VLM的评估分类体系发现方法(JudgeFit),通过结合VLM自我枚举、人类评分校准和LLM辅助修复,实现了评估方案的个性化定制,从而更准确地反映每个VLM的物理感知能力,而非依赖一刀切的全局标准。

Abstract: Maintaining physical consistency in video generators and world models increasingly relies on vision-language models (VLMs) as automated judges that provide reward signals, ranking decisions, and data-filtering criteria. Yet VLMs differ substantially in training data and architecture, encoding physical phenomena through distinct internal representations. A single global evaluation schema therefore gives every VLM the same axes of competence, regardless of what each can actually perceive. We propose JudgeFit, an iterative refinement procedure that discovers a per-VLM evaluation taxonomy. An initial taxonomy is constructed by prompting the target VLM to enumerate physics errors on a small set of videos and clustering the resulting descriptions. The taxonomy is then refined through a diagnostic step: we calibrate the VLM’s per-dimension scores to human physical-commonsense ratings, diagnose which dimensions it scores unreliably or redundantly, and prompt an LLM to repair them, iterating until convergence. We further instantiate this procedure as a benchmark and apply it to 16 VLMs spanning eight model families. The refined taxonomy outperforms the global-schema baseline on held-out videos for every VLM tested, with a mean relative improvement of approximately 32%. Beyond aggregate accuracy, the per-VLM profiles expose model-specific blind spots that overall rankings cannot anticipate, with reliability patterns differing markedly across model families.


[199] BEV-Denoise: Learning Intrinsic Noise for Accurate Bird’s-Eye-View Semantic Segmentation cs.CV | cs.AIPDF

Dooseop Choi, Kyounghwan An, Kyoung-Wook Min

TL;DR: 本文提出了BEV-Denoise框架,通过估计和消除学习到的鸟瞰图特征中的固有噪声,以提高BEV语义分割的准确性。该方法受去噪扩散概率模型的启发,设计了一个基于UNet的噪声估计模块,并将估计的噪声从BEV特征中减去,再输入到BEV地图解码器中进行最终预测。

Details

Motivation: 解决鸟瞰图语义分割中,由于视图变换过程引入的固有噪声导致分割精度下降的问题。

Result: 在nuScenes大规模真实世界数据集上的实验表明,该框架应用于四种现有模型(涵盖三大视图变换范式)均有效,提升了BEV语义分割的性能。

Insight: 创新点在于将去噪扩散模型的噪声估计思想引入BEV特征处理,并采用任务分解的序列学习范式来监督噪声估计模块;客观来看,这是一种将生成模型技术应用于感知任务特征去噪的新颖思路。

Abstract: In this paper, we present a framework dubbed \textbf{BEV-Denoise} that estimates and removes intrinsic noise from learned Bird’s-Eye-View (BEV) features to achieve accurate BEV semantic segmentation. Inspired by the noise estimation capability of Denoising Diffusion Probabilistic Models (DDPM), we design a UNet-based noise estimation module that learns to estimate the noise from the learned BEV features. The estimated noise is then subtracted from the BEV features and fed to BEV map decoders for the final prediction results. To facilitate supervision for the noise estimation module, we follow a sequential learning paradigm called Task Decomposition (TD) where a pre-trained BEV map autoencoder is employed to train a view transformation (VT) encoder. We share three key insights learned from our intensive experiments that are critical for improved performance. We apply our framework to four existing models, encompassing the three major VT paradigms. Experimental results on a large-scale real-world dataset, nuScenes, demonstrate the effectiveness of our framework.


[200] Concept Alignment Contrast and Long-Short Prompt Memory for Test-Time Adaptation of SAM3 in Medical Image Segmentation cs.CVPDF

Yubo Zhou, Jianghao Wu, Ping Ye, Shaoting Zhang, Guotai Wang

TL;DR: 本文提出了一种名为CM-TTA的测试时自适应框架,用于提升Segment Anything Model 3 (SAM3)在医学图像分割中的性能。该框架通过概念对齐对比度量来评估预测质量,并利用长短提示记忆模块平衡快速与稳定适应,最终通过密集监督提示更新策略优化模型。

Details

Motivation: SAM3等概念分割模型在自然图像上泛化能力强,但在医学图像上因成像原理和风格差异导致性能下降。现有测试时自适应方法主要依赖图像级不确定性最小化,缺乏对区域级语义正确性的考量,且在连续单次自适应中稳定性不足。

Result: 在前列腺和皮肤病变分割数据集上的大量实验表明,所提出的CM-TTA框架显著优于现有的SAM3测试时自适应方法。

Insight: 创新点在于提出了概念对齐对比度量来利用文本-视觉语义一致性评估预测质量,以及长短提示记忆模块来平衡快速局部适应与稳定全局提示生成。从客观角度看,该方法将密集监督与提示学习结合,为缺乏标注的医学图像分割提供了一种鲁棒的自适应策略。

Abstract: Concept segmentation models like Segment Anything Model 3 (SAM3) show strong generalization on natural images, yet their performance degrades in medical imaging due to the domain gap caused by different imaging principles and styles. Test-Time Adaptation (TTA) is essential for improving the testing performance by updating the model on the fly without annotations. However, existing vision-language TTA methods are mainly driven by image-level uncertainty minimization, which does not necessarily reflect region-level semantic correctness in medical segmentation. Moreover, they often lack mechanisms to maintain stability in continual one-pass adaptation, leading to limited performance when reliable dense supervision is missing for segmentation. To address these issues, we propose Concept Alignment Contrast and LongShort Prompt Memory for Test-Time Adaptation (CM-TTA) of SAM3 for medical images. First, for a test sample with multiple augmentations, we introduce a novel Concept Alignment Contrast (CAC) metric, which leverages textual-visual semantic consistency to robustly evaluate prediction quality to select the best augmented view as the supervision. Second, to balance rapid and stable adaptation, we design a Long-Short Prompt Memory (LSPM) module. The short memory dynamically fuses recent prompts based on CAC scores for agile local adaptation, while the long memory maintains a stable global prompt to generate enhanced pseudo-labels. Finally, a Densely Supervised Prompt Update (DSPU) strategy is proposed to optimize the prompt embeddings with enhanced pseudo labels as dense supervision. Extensive experiments on prostate and skin lesion segmentation demonstrate that our CM-TTA framework significantly outperforms existing methods for TTA of SAM3.


[201] Can Single-View Mesh Reconstruction Generalize to Robot Camera Rotation? cs.CV | cs.ROPDF

Yu Zhan, Guangcheng Chen, Hanjing Ye, Zhiqin Cheng, Zanjia Tong

TL;DR: 本文研究了单视角网格重建方法在机器人相机旋转下的泛化能力,发现现有方法因依赖视角先验而表现不佳。作者提出了一种评估协议,通过控制滚转、俯仰和偏航轴旋转来追踪深度估计、网格重建、空间布局和物理合理性等方面的错误。实验表明,相机旋转会导致深度失真、布局漂移和碰撞穿透等问题,而引入显式重力线索的细化方法能显著降低布局方向误差。

Details

Motivation: 机器人搭载的相机在操作和导航过程中会自然旋转,而现有的单视角重建模型通常依赖于视角相关的先验,可能难以泛化到分布外的相机旋转,导致3D不一致性、错误布局和违反物理约束等问题,但这一失效模式尚未得到充分评估。

Result: 在Aria Digital Twin数据集和真实的Franka腕部相机序列上,相机旋转引发了单目深度估计(MDE)失真、布局漂移和碰撞穿透,而规范网格预测相对稳定。两阶段的SAM3D+FoundationPose流水线比单阶段前馈布局预测更鲁棒,提出的重力感知细化方法将基于ICP的单阶段成对布局方向误差降低了47.1%。

Insight: 论文的创新点在于系统评估了单视角网格重建对机器人相机旋转的泛化失败模式,并提出了一个包含轴旋转扫描的评估协议。客观来看,其核心洞察是显式重力线索对于可靠的机器人单视角重建至关重要,这为提升机器人空间推理的鲁棒性提供了方向。

Abstract: Single-view mesh reconstruction predicts object meshes and spatial layouts from a single observation, making it attractive for fast robot spatial reasoning and real-to-sim digital twins. However, robot-mounted cameras naturally rotate during manipulation and navigation, while learned single-view reconstruction models often rely on view-dependent priors and may generalize poorly to out-of-distribution camera rotations. Such rotations can introduce 3D inconsistencies, incorrect layouts, and violations of physical constraints, but this failure mode remains under-evaluated. We introduce an evaluation protocol with controlled axis-wise roll, pitch, and yaw sweeps to trace errors in monocular depth estimation (MDE), canonical object meshes, camera-space layout, and physical plausibility within a representative SAM3D-style pipeline. On the Aria Digital Twin dataset and a real Franka wrist-camera sequence, camera rotations induce MDE distortion, layout drift, and collision penetration, while canonical mesh predictions remain relatively stable. A two-stage SAM3D+FoundationPose pipeline is more robust than one-stage feed-forward layout prediction, and our Gravity-Aware Refinement reduces one-stage pairwise ICP-based layout-orientation error by 47.1$%$. Our evaluation reveals that current single-view mesh reconstruction methods generalize poorly to robot camera rotation, and suggests that explicit gravity cues are important for reliable robotic single-view mesh reconstruction.


[202] Black-Box Continual Learning for Vision-Language Models cs.CVPDF

Yuting Li, Weihang Fang, Haoyuan Gao, Linghe Kong, Yexin Li

TL;DR: 本文提出了一个更贴近现实的黑盒持续学习(Black-CL)基准,用于视觉语言模型(VLMs),该基准要求模型在无法访问权重和架构、计算受限且任务不可知的条件下进行学习。针对此设定,作者提出了一个名为BETA的简单而有效的基线方法,该方法仅通过优化文本原型来应对持续学习的挑战,并在多个数据集和骨干网络上取得了优异性能。

Details

Motivation: 现实世界中,部署在云端的视觉语言模型通常以黑盒形式提供,无法进行梯度回传或架构修改,这使得依赖白盒范式的传统持续学习方法失效。因此,需要建立一个更现实的基准和相应方法来应对权重不可访问、计算受限和任务不可知推理这三大挑战。

Result: 在十个不同数据集和多种骨干网络上的广泛实验表明,BETA显著优于现有的黑盒调优方法。仅使用0.05M可训练参数(比竞争方法少180-3000倍),BETA的性能达到甚至超过了白盒持续学习方法。

Insight: 论文的核心创新在于提出了一个现实的黑盒持续学习基准(Black-CL)以及一个基于文本原型优化的高效基线方法BETA。BETA的创新点包括:用于增量知识获取的语义投影积累(SPA)、用于对抗灾难性遗忘的潜在分布回放(LDR),以及用于动态边界细化的测试时原型适应(TTPA)。其关键洞察是,仅优化文本原型就能有效处理黑盒持续学习的复杂性,这为将持续学习从学术界过渡到实际系统提供了基础框架。

Abstract: The rapid deployment of Vision-Language Models (VLMs) in dynamic environments necessitates the ability to learn continuously without forgetting. However, traditional continual learning (CL) settings often rely on white-box paradigms, which is increasingly invalidated by the shift toward cloud-hosted models. In this paper, we introduce Black-CL, a more realistic benchmark for VLMs that enforces three primary real-world challenges: weight and architecture inaccessibility, constrained computation, and task-agnostic inference. The learner can query only output embeddings or logits, with no gradient flow through or structural modification of the backbone. Current CL methodologies, which rely on backbone backpropagation or complex parameter expansion, are fundamentally incompatible with these constraints. Under this setting, we propose BETA, a simple yet effective baseline built on the key insight that solely optimizing textual prototypes can navigate the complexities of CL. BETA integrates three core components: Semantic Projection Accumulation (SPA) for incremental knowledge acquisition, Latent Distribution Replay (LDR) for anchoring the embedding space against catastrophic forgetting, and Test-Time Prototype Adaptation (TTPA) for dynamic, instance-aware boundary refinement. Extensive experiments across ten diverse datasets and various backbones demonstrate that BETA significantly outperforms existing black-box tuners. Remarkably, with only 0.05 M trainable parameters, a 180–3000$\times$ reduction compared to competitive methods, BETA achieves performance on par with or even exceeding white-box CL methods. We believe Black-CL and BETA provide a foundational framework for future advancements in continual learning and accelerates the transition of continual learning from academia to real-world systems.


[203] MotionMAR: Multi-scale Auto-Regressive Human Motion Reconstruction from Sparse Observations cs.CVPDF

Yuhua Luo, Junsheng Zhang, Mengyin Liu, Xincheng Lin, Ming Yan

TL;DR: MotionMAR是一种从稀疏观测中重建人体运动的多尺度自回归框架,通过从粗到细的方式先估计全局轨迹再逐步细化时间细节。该方法包含四个组件:TMT VQ-VAE用于多尺度时间编码,MAN在潜在空间进行跨尺度运动预测,SAC模块整合稀疏跟踪数据,MRN用于平滑姿态和消除量化伪影。

Details

Motivation: 人体运动具有从低频全局轨迹到高频细节的时间层次结构,受计算机视觉中多级自回归模型启发,旨在从稀疏观测中重建完整运动。

Result: 在AMASS数据集上达到了最先进的精度,实现了可靠且结构感知的运动重建。

Insight: 创新点在于将多尺度自回归建模与稀疏控制结合,通过分层潜在空间分离运动语义与抖动,并引入尺度感知控制确保生成与观测对齐;客观分析其核心是借鉴视觉领域的多级生成思想,构建了适用于运动数据的粗到细预测流水线。

Abstract: Human motion follows a temporal hierarchical structure, transitioning from low-frequency global trajectories to high-frequency details. Inspired by the success of multi-level autoregressive models in computer vision, we propose MotionMAR, a coarse-to-fine framework for motion reconstruction from sparse observations. It first estimates the global trajectory of human motion and then gradually refines the temporal details. This architecture consists of four integrated components. The Temporal Multi-scale Tokenization (TMT) VQ-VAE encodes the data at multiple temporal resolutions, separating semantic motion from minor jitters. The Motion Autoregressive Network (MAN) operates in this latent space, predicting motion across scales. It first establishes the global structure through coarse indices and then generates finer indices to recover specific details. Meanwhile, the Scale-Aware Control (SAC) module integrates sparse tracking data to ensure the generated output aligns with actual observations. The Motion Refinement Network (MRN) subsequently smooths consecutive poses and eliminates quantization artifacts. Experiments show that MotionMAR achieves state-of-the-art accuracy on the AMASS dataset, providing a reliable and structure-aware approach for motion reconstruction. The source code is publicly available at http://www.lidarhumanmotion.net/motionmar/.


[204] ScalingAttention: Discovering Intrinsic Sparse Attention Topology for Video Diffusion Transformers cs.CV | cs.AIPDF

Ruiliang Zhou, Xuecheng Wu, Kang He, Guangyun Han, Bin Liu

TL;DR: 本文提出ScalingAttention,一种无需训练的视频扩散Transformer加速框架。其核心思想是发现并利用注意力头的内在稀疏拓扑结构,通过离线提取权重编码的稀疏掩码(WEST)和基于保真度需求自适应调整稀疏度(FAST),结合硬件对齐的块稀疏内核,显著降低了3D全注意力的计算开销。

Details

Motivation: 视频扩散Transformer(DiTs)依赖3D全注意力机制,导致二次方的计算瓶颈。现有稀疏方法面临两难:动态剪枝运行时开销巨大且内存碎片化严重,而静态启发式方法无法捕捉细粒度依赖关系。

Result: 在Wan2.1基准上的实验表明,该方法实现了高达1.90倍的端到端加速,同时保持了优越的生成保真度,超越了现有最先进基线,建立了新的帕累托前沿。

Insight: 创新点在于提出了一个关键的归纳偏置:尽管单个激活是输入相关的,但每个注意力头的高质量注意力区域会快速收敛到一个稳定、与提示无关的内在稀疏拓扑。该拓扑是权重编码且尺度不变的,可以高效离线提取,从而将拓扑发现与稀疏度控制解耦,实现了无需训练的高效加速。

Abstract: While Diffusion Transformers (DiTs) have revolutionized high-fidelity video generation, their reliance on 3D full attention creates a quadratic computational bottleneck. Existing sparse methods face a dilemma: dynamic pruning suffers from prohibitive runtime overhead and memory fragmentation, while static heuristics fail to capture fine-grained dependencies. In this work, we propose ScalingAttention, a training-free framework grounded in a key inductive bias: while individual activations are input-dependent, the high-mass attention regions for each head rapidly converge to a stable, prompt-agnostic Intrinsic Sparse Topology. This topology is weight-encoded, scale-invariant, and efficient to extract. ScalingAttention decouples topology discovery from sparsity control via: (1) WEST (Weight-Encoded Sparse Topology), which extracts a robust block-sparse prior mask offline to eliminate runtime search; (2) FAST (Fidelity-Aware Sensitivity Tuning), which adaptively tunes head-wise sparsity based on diffusion fidelity requirements. To ensure practical acceleration, we co-design a hardware-aligned bit-wise block-sparse kernel. Experiments on Wan2.1 show up to 1.90X end-to-end speedup with superior fidelity, establishing a new Pareto frontier over state-of-the-art baselines.


[205] Boosting Neural Video Codec via Scale-Driven Online Flow Refinement cs.CVPDF

Tiange Zhang, Rongqun Lin, Haocheng Tang, Xiandong Meng, Weijia Jiang

TL;DR: 本文提出了一种无需训练、即插即用的尺度驱动在线光流优化方法,旨在解决神经视频编解码器在遇到训练中未见过的复杂运动模式时泛化能力有限的问题。该方法通过融合粗、细尺度的运动信息,并根据扭曲精度动态调整融合策略,以可忽略的计算开销有效纠正运动估计误差。

Details

Motivation: 现有最先进的神经视频编解码器在遇到训练数据中未包含的复杂运动模式时,性能会因泛化能力不足而下降。本文旨在不依赖昂贵的在线微调成本,通过一种训练后即可使用的在线优化模块来弥合这一领域差距。

Result: 在USTC-TD数据集上的大量实验验证了该方法的有效性和泛化能力,可应用于DCVC-SDD、DCVC-FM和EHVC等多种神经视频编解码框架。特别是在DCVC-FM上,该方法在PSNR和MS-SSIM指标下分别平均节省了2.84%和4.05%的码率,且编码时间增加可忽略不计。

Insight: 核心创新点在于提出了一种无需训练、基于尺度的在线光流优化模块,它通过动态融合多尺度运动信息来纠正误差。此外,还设计了根据码率模式选择不同融合策略的码率感知策略,以及基于扭曲误差的可靠性检查机制以确保鲁棒性,这是一个高效且通用的性能提升方案。

Abstract: Although state-of-the-art neural video codecs (NVCs) have achieved remarkable performance, they suffer from limited generalization when encountering complex motion patterns unseen during training. To bridge this domain gap without the expensive cost of online fine-tuning, we propose a Training-Free Scale-Driven Online Flow Refinement (SOFR) method. Serving as a plug-and-play module, SOFR integrates motion information from coarse and fine scales and dynamically fuses them according to warping accuracy, effectively rectifying motion estimation errors with negligible computational overhead. Furthermore, we design a rate-aware strategy that selects different dynamic fusion strategies according to bitrate modes, and employs a reliability check based on warping error to ensure robustness. Extensive experiments on the USTC-TD dataset verify the effectiveness and generalization of SOFR across various NVC frameworks, including DCVC-SDD, DCVC-FM, and EHVC. Notably, it brings an average of 2.84% and 4.05% bitrate savings in terms of PSNR and MS-SSIM, respectively, to DCVC-FM with negligible coding time increase. Our code is available at https://github.com/SunnyMass/SOFR.


[206] Physics-Guided Spatiotemporal State Space Modeling for Lookahead Molten Pool Segmentation in Laser Wire-Feed Welding cs.CV | cs.AIPDF

Sen Li, Haichao Cui, Changhao Yin, Chendong Shao, Yaqi Wang

TL;DR: 本文提出了一种物理引导的时空状态空间网络(WeldMamba),用于激光送丝焊接中的前瞻熔池分割。该模型利用历史同轴灰度图像、焊接工艺参数和对齐的焊丝状态电信号,来预测未来具有物理意义的三个区域(小孔、焊丝和熔池)的语义布局。

Details

Motivation: 解决激光送丝焊接中,由于传感、计算和执行器响应引入的不可避免的延迟,实现实时熔池感知对于闭环控制至关重要。

Result: 在一个包含43个序列的激光焊接数据集上的实验表明,所提出的WeldMamba模型在500毫秒的前瞻预测中达到了74.63%的mIoU。消融研究进一步表明,时间历史信息、块级状态空间建模和小孔运动感知是鲁棒未来分割的主要贡献因素。

Insight: 创新点在于将物理先验(工艺参数、电信号)与深度学习(时空状态空间建模)相结合,并引入多种辅助损失(如符号距离函数监督、时间一致性、特征蒸馏和细粒度小孔损失)来约束预测的几何形状和局部运动,实现了对焊接过程未来状态的精准预测。

Abstract: Real-time weld-pool perception is critical for closed-loop control in laser wire-feed welding, where sensing, computation, and actuator response introduce unavoidable delay. This paper presents a physics-guided spatiotemporal state space network for lookahead weld-pool segmentation. The model uses historical coaxial grayscale images, welding process parameters, and aligned wire-state electrical signals to predict the future semantic layout of three physically meaningful regions: keyhole, wire, and molten pool. It combines a visual encoder, process- and sensor-conditioned feature normalization, patch-level temporal state space modeling, horizon-conditioned latent prediction, dense future feature prediction, and a motion-aware mask decoder. Auxiliary signed-distance-function supervision, temporal consistency, feature distillation, and fine-grained keyhole losses further constrain the predicted geometry and local motion. Experiments on a 43-sequence laser welding dataset show that the proposed WeldMamba reaches 74.63% mIoU at a 500 ms lookahead. Ablation studies further show that temporal history, patch-level state space modeling, and keyhole motion awareness are the main contributors to robust future segmentation.


[207] SPAR: Semantic-Pixel Self-Alignment and Adaptive Routing for Unified Multimodal Models cs.CVPDF

Hongxiang Li, Hongxu Chen, Chenyang Zhu, Xiaoshuang Huang, Jiayin Cai

TL;DR: 本文提出了一种名为SPAR的新型统一多模态框架,旨在解决多模态大语言模型(MLLMs)在视觉生成方面的局限。该框架通过语义-像素自对齐和自适应路由机制,弥合了语义感知与像素级重建之间的特征差异,实现了卓越的生成、重建质量,并保持了基础的视觉理解能力。

Details

Motivation: 多模态大语言模型在视觉理解方面取得了显著成功,但由于语义感知与像素级重建之间存在根本性的特征差异,其在视觉生成方面仍受到限制。本文旨在克服这一核心挑战,即赋予语义编码器高保真重建能力,并在不依赖外部教师模型的情况下,有效地将生成模型与语义空间对齐。

Result: 大量实验表明,SPAR在统一架构中达到了最先进的水平,在保持基础视觉理解能力的同时,实现了卓越的生成和重建质量。

Insight: 主要创新点包括:1)引入非对称双流统一分词器,通过轻量级语义流锚定判别性特征,并通过Transformer增强的像素流将细粒度视觉细节恢复到一个统一的紧凑潜在空间中,从而调和语义感知与像素级重建。2)提出一种自对齐生成范式,利用优化后的分词器作为扩散模型的内部对齐教师,消除了对外部依赖。3)引入动态令牌路由机制,使每个令牌能根据其独特的语义需求自适应地聚合多层MLLM特征,从而促进统一空间内灵活的多模态交互。

Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable success in visual understanding but remain constrained in visual generation due to the fundamental feature discrepancy between semantic perception and pixel-level reconstruction. Bridging this gap requires overcoming two core challenges: endowing semantic encoders with high-fidelity reconstruction capabilities, and effectively aligning generative models with semantic spaces without relying on external teachers. To this end, we propose a novel unified multimodal framework featuring \textbf{S}emantic-\textbf{P}ixel self-alignment and \textbf{A}daptive \textbf{R}outing (\textbf{SPAR}). First, to reconcile semantic perception with pixel-level reconstruction, we introduce an asymmetric dual-stream unified tokenizer. A lightweight semantic stream anchors discriminative features, while a Transformer-augmented pixel stream recovers fine-grained visual details into a unified compact latent space. Second, to eliminate external dependencies, we propose a self-aligned generation paradigm that natively leverages this optimized tokenizer as an internal alignment teacher for the diffusion model. Furthermore, to facilitate flexible multimodal interaction within this unified space, we introduce Dynamic Token Routing, which enables each token to adaptively aggregate multi-layer MLLM features based on its distinct semantic demands. Extensive experiments demonstrate that SPAR establishes the state-of-the-art for unified architectures, achieving exceptional generation and reconstruction quality while preserving foundational visual understanding capabilities.


[208] UECP: Uncertainty-Enhanced Collaborative Perception cs.CVPDF

Kang Yang, Tianci Bu, Peng Wang, Deying Li, Wen Jie

TL;DR: 本文提出了一种不确定性增强的协同感知框架(UECP),通过引入一个由传感器信号(如激光雷达点密度)直接监督的不确定性图来量化每个智能体的感知质量,并基于此设计了不确定性感知金字塔融合模块(UAPF),实现了更鲁棒的特征融合。在真实数据集上的实验表明,该方法在有效性和鲁棒性上超越了现有最先进方法。

Details

Motivation: 现有协同感知方法通常依赖与检测头共同训练的置信度图来加权智能体贡献,但这与检测结果内在相关,无法提供无偏的物理证据;同时,如何将证据深度整合到融合过程中仍是一个开放问题。

Result: 在真实世界数据集上的大量实验表明,UECP通过将不确定性图嵌入融合过程,在有效性和鲁棒性方面均优于最先进(SOTA)方法。

Insight: 创新点在于提出了一个物理基础明确、与检测噪声解耦的不确定性图作为感知质量评估指标,并设计了包含不确定性加权下采样和不确定性引导残差融合的粗到细融合策略,为协同感知提供了更可靠的证据加权和深度融合机制。

Abstract: Collaborative perception serves as a pivotal solution to enhance the perception capability of individual agents in autonomous driving, where a core challenge lies in seeking reliable evidence to quantify and weight the contribution of each participating agent. Existing methods typically rely on a confidence map, which is co-trained with the detection head, but it is inherently correlated with the detection results and thus fails to provide unbiased physical evidence. Furthermore, how to deeply integrate evidence into the cooperative fusion process remains an open question. To address these issues, this paper first proposes an uncertainty map, a physically grounded and unambiguous metric for evaluating perception quality. This map is directly supervised by real-time sensor signals, i.e., LiDAR point density, ensuring decoupling from detection noise and thereby providing physical scenario-aware evidence for weighting agent contribution. Based on this map, we develop the Uncertainty-Enhanced Collaborative Perception (UECP) framework, centered on the Uncertainty-Aware Pyramid Fusion (UAPF) module. UAPF uses a coarse-to-fine strategy, with two key components: Uncertainty-Weighted Downsampling (UWD) for high-fidelity feature preservation, and Uncertainty-Guided Residual Fusion (UGRF) to reinforce ego features, suppressing noise and ensuring robust fusion. Extensive experiments on real-world datasets show UECP outperforms state-of-the-art methods in effectiveness and robustness by embedding the uncertainty map into fusion. Code will be publicly available.


[209] Three-Step Hierarchical Transformer for Multi-Pedestrian Trajectory Prediction cs.CVPDF

Raphaël Delécluse, Hazem Wannous, Laurent Grisoni, Laurent Guimas

TL;DR: 本文提出了一种三步分层Transformer模型,用于多行人轨迹预测。该模型显式分离了时间编码、多模态融合和场景级交互推理三个步骤,通过轻量级GRU摘要实现高效的跨模态注意力,并在可控成本下捕捉行人间的社会交互影响。

Details

Motivation: 现有方法通常将时间动态、多模态线索和社会交互分开处理或在昂贵的注意力块中纠缠,限制了可扩展性、灵活性和可解释性。本文旨在通过分层结构解决这些问题。

Result: 在JTA、JRDB和Pedestrians and Cyclists in Road Traffic数据集上的实验表明,该模型在真实世界数据集(JRDB, Urban)上达到了最先进性能,在JTA上取得了有竞争力的结果。消融和定性分析证实了各阶段的有效性。

Insight: 创新点在于将轨迹预测任务分解为三个明确且可解释的步骤,并利用轻量级GRU摘要降低跨模态注意力计算成本。该方法能有效预测复杂行为(如提前转向),提高了模型的效率和可解释性。

Abstract: Pedestrian trajectory prediction requires modeling temporal dynamics, multimodal cues, and social interactions in crowded environments. Existing methods often address these factors separately or entangle them in costly attention blocks, limiting scalability, flexibility, and interpretability. We propose a three-step hierarchical Transformer that explicitly separates temporal encoding, multimodal fusion, and scene-level interaction reasoning. Lightweight GRU summaries enable efficient cross-modal attention, while social attention over time–agent tokens captures inter-pedestrian influences at manageable cost. Experiments on JTA, JRDB, and the Pedestrians and Cyclists in Road Traffic dataset show state-of-the-art performance on real-world datasets (JRDB, Urban) and competitive results on JTA. Ablation and qualitative analyses confirm the contribution of each stage and the model’s ability to anticipate complex behaviors such as early turning.


[210] MotionHalluc: Diagnosing Kinematic Hallucinations in Fine-Grained Motion Reasoning cs.CV | cs.AIPDF

Weile Guo, Shenghong He, Danying Mo, Chengdong Xu, Xuexun Liu

TL;DR: 本文提出了MotionHalluc基准,用于系统评估跨视频比较中运动指令生成模型产生的运动幻觉问题。该基准包含1540个细粒度问题,覆盖方向、属性和时间三个核心维度。作者还提出了无需训练的Perceive-Parse-Verify(PPV)基线方法,通过注入显式运动测量显著减少了幻觉现象。

Details

Motivation: 现有跨视频比较的运动指令生成模型经常产生运动幻觉,即生成的指令无法准确反映视频对之间的真实运动学差异。本文旨在系统性地诊断和量化这种幻觉问题。

Result: 在MotionHalluc基准上对多个先进大模型进行评估,发现它们普遍容易产生运动幻觉。提出的PPV基线方法通过注入运动测量,在多个模型上平均带来了10.6%的性能提升。

Insight: 创新点在于构建了首个专门评估运动幻觉的细粒度基准,并揭示了显式定量运动测量是减少跨视频比较中幻觉的关键因素。PPV方法提供了一种无需训练、可解释的测量提取与验证框架。

Abstract: Motion instruction generation in cross-video comparison aims to produce corrective feedback that describes the differences between a query and a reference motion. However, existing models often generate instructions that exhibit motion hallucinations, failing to reflect actual kinematic differences between paired videos. To systematically investigate these hallucinations, we introduce MotionHalluc, a dedicated benchmark for evaluating motion hallucinations in paired-video comparison. MotionHalluc comprises 1540 fine-grained questions over 553 video pairs, evaluating hallucinations along three core dimensions: (1)directional hallucination, (2)attributional hallucination, and (3)temporal hallucination. Extensive evaluations of state-of-the-art large multimodal models demonstrate high susceptibility to these hallucinations. Furthermore, we provide Perceive-Parse-Verify (PPV) as a training-free measurements extraction and verification baseline that converts candidate instructions into executable measurement queries and supplies kinematic measurements at inference time. Our results show that this simple measurements injection yields an average 10.6% performance gain across models, suggesting that motion reasoning with explicit quantitative measurements is a key factor in reducing hallucinations in cross-video comparison. Our code and dataset will be made publicly available upon acceptance.


[211] Compression and Retrieval: Implicit Memory Retrieval for Video World Models cs.CVPDF

Zhan Peng, Jie Ma, Huiqiang Sun, Chong Gao, Zhijie Xue

TL;DR: 本文提出了一种名为压缩与检索(CaR)的注意力驱动的隐式记忆检索机制,用于解决视频世界模型在复杂相机轨迹下保持长期记忆一致性的挑战。该方法通过位置编码注入视点信息,利用注意力计算实现灵活的记忆检索,并引入轻量级上下文压缩网络以高效处理长序列。此外,论文构建了大规模合成数据集SceneFly用于训练和评估,实验表明该方法在多个基准测试中达到了最先进水平,并展现出对开放域场景的强泛化能力。

Details

Motivation: 现有视频世界模型方法通常依赖计算成本高的上下文缩放或僵化的启发式检索机制,难以泛化到变化的相机轨迹和环境,因此需要一种更灵活高效的长期记忆检索方案。

Result: 在多个基准测试上,该方法取得了最先进(SOTA)的结果,并在构建的SceneFly数据集上表现出对开放域场景的强泛化能力。

Insight: 创新点在于通过注意力机制结合位置编码实现隐式记忆检索,以及引入轻量级压缩网络降低长上下文处理的计算开销;客观来看,其构建的大规模合成数据集也为长序列视频世界模型的研究提供了重要基准。

Abstract: Video world models hold promise for simulating interactive environments, yet maintaining consistent long-term memory across complex camera trajectories remains a critical challenge. Existing methods typically rely on computationally expensive context scaling or rigid heuristic retrieval mechanisms, which lacks generalization to varying camera trajectories and environments. In this paper, we propose Compression and Retrieval (CaR), an attention-driven implicit memory retrieval mechanism to overcome these limitations. By injecting viewpoint information via positional encoding, our method performs flexible memory retrieval through attention computation. To efficiently process extended contexts with minimal computational overhead, we further introduce a lightweight context compression network. Furthermore, we construct SceneFly, a large-scale synthetic dataset featuring realistic camera trajectories and frame-level annotations to train and evaluate long-horizon video world models. Extensive experiments demonstrate that our approach achieves state-of-the-art results on established benchmarks and exhibits strong generalization to open-domain scenes.


[212] Technical Report for the ICRA 2026 GOOSE 2D Fine-Grained Semantic Segmentation Challenge: Pretraining-Diverse Ensemble of Foundation Vision Encoders for Robust Outdoor Scene Understanding cs.CVPDF

Boyan Wang, Yongxi Huang, Wenjing Li, Tianrui Hui, Shaofei Huang

TL;DR: 本文介绍了针对ICRA 2026 GOOSE 2D细粒度语义分割挑战赛的解决方案,该方案通过集成具有互补预训练目标的基础视觉编码器(DINOv3、SigLIP2、InternImage)与Mask2Former解码器,并采用包括长训练周期、指数移动平均、大裁剪尺寸及多尺度翻转测试时增强在内的强训练策略,在官方测试集上取得了75.40%的复合mIoU,获得比赛第二名。

Details

Motivation: 解决在四种相机平台下对非结构化户外场景进行56个细粒度类别解析的挑战,旨在实现鲁棒的户外场景理解。

Result: 在官方GOOSE测试集上,提交方案取得了75.40%的复合mIoU,在挑战赛中获得了第二名。

Insight: 创新点在于提出了一个‘预训练多样性集成’方法,通过基于每类验证IoU的加权方式组合不同预训练目标的编码器。研究进一步揭示,对于该基准测试的精度而言,编码器的预训练方案是主导因素,而非其参数量或解码器设计。

Abstract: This report presents our solution for the ICRA 2026 GOOSE 2D Fine-Grained Semantic Segmentation Challenge, which requires parsing unstructured outdoor scenes from four camera platforms into 56 fine-grained categories. Our approach pairs foundation vision encoders (including DINOv3, SigLIP2, and InternImage) with a Mask2Former decoder, and trains them with a strong recipe including long training schedules, exponential moving average, a larger crop size, and multi-scale plus flip test-time augmentation. The three encoders, chosen for their complementary pretraining objectives, are combined into a pretraining-diverse ensemble through per-class validation-IoU weighting. Evaluated on the official GOOSE test set, our submission achieves 75.40% composite mIoU and wins the second place of the challenge. Our study further shows that the encoder’s pretraining recipe, rather than its parameter count or the decoder design, is the dominant factor for accuracy on this benchmark.


[213] Attention-Spectrum Regularization for Replay-Free Continual Multimodal LLMs cs.CV | cs.AIPDF

Chuangxin Zhao, Canran Xiao, Siyuan Ma, Mengyao Lyu, Yanbiao Ma

TL;DR: 本文提出了一种名为注意力谱正则化(ASR)的无回放持续学习框架,用于解决多模态大语言模型在持续微调过程中的灾难性遗忘问题。该方法通过将跨模态注意力图视为二维信号,提取并存储其紧凑的谱统计量作为技能原型分布,并在后续阶段使用相位不变的谱正则化器来约束这些原型的漂移,从而保护旧技能对应的跨模态注意力结构。

Details

Motivation: 多模态大语言模型需要适应不断变化的视觉领域、问题类型和用户指令,但持续微调常导致对先前习得多模态技能的严重遗忘。现有方法主要通过保留输出、回放数据、正则化嵌入几何或分配任务特定参数来控制遗忘,但缺乏对支持旧技能的内部跨模态注意力模式在适应过程中如何漂移的有效控制。

Result: 在持续VQA和多模态指令调优基准测试(包括VQA v2, VQACL, CLT-VQA, CoIN和UCIT)上的实验表明,ASR在最终性能和减少遗忘方面,持续优于基于回放、正则化和适配器的强基线方法。

Insight: 论文的核心创新点在于将跨模态注意力图进行谱分析,并约束其技能层面的谱统计原型分布,这是一种保护旧技能注意力结构的轻量级机制。从客观角度看,该方法提供了一种新颖的、无需回放数据的内部表征正则化视角,其理论分析将谱漂移与遗忘联系起来,并利用了傅里叶功率谱对空间平移和有界扰动的稳定性,为持续学习提供了新的思路。

Abstract: Multimodal large language models (MLLMs) are increasingly required to adapt to non-stationary streams of visual domains, question types, and user instructions, yet continual fine-tuning often causes severe forgetting of previously acquired multimodal skills. Existing continual vision-language methods mainly preserve outputs, replay data or pseudo-data, regularize embedding geometry, or allocate task-specific parameters, but they provide limited control over how internal cross-modal attention patterns supporting old skills drift during adaptation. We propose Attention-Spectrum Regularization (ASR), a replay-free continual learning framework that preserves skill-conditioned structures of cross-modal attention. ASR treats cross-attention maps as two-dimensional signals, summarizes their scale and directional properties into compact spectral statistics, and stores only skill-wise prototype distributions instead of replaying past image-question pairs, generated pseudo-examples, or old-stage teacher snapshots. In later stages, a phase-invariant spectral regularizer constrains harmful drift of these prototypes while allowing instance-level attention to adapt to new tasks. We provide theoretical analysis showing that skill-conditioned spectral drift controls forgetting under a spectral sufficiency assumption, and that Fourier power spectra are stable to spatial translations and bounded perturbations. Experiments on continual VQA and multimodal instruction-tuning benchmarks, including VQA v2, VQACL, CLT-VQA, CoIN, and UCIT, show that ASR consistently improves final performance and reduces forgetting over strong replay-, regularization-, and adapter-based baselines. Preserving skill-level attention structure is an effective and lightweight mechanism for continual MLLMs. Code is available at https://github.com/Creative-zcx/attention-spectrum-replay


[214] Expert Consensus on Criteria for the Automated Assessment of Laparoscopic Camera Navigation cs.CVPDF

Amir Ebrahimzadeh, Nazila Esmaeili, Michael Ghadimi, Jannis Hagenah

TL;DR: 本研究旨在为腹腔镜相机导航(LCN)技能的自动化评估建立标准。通过开发包含14个关键方面的详细分类法,并结合外科医生的临床重要性评分与计算机视觉技术成熟度分析,确定了高优先级的自动化评估目标,为开发AI驱动的技能评估工具提供了清晰的路线图。

Details

Motivation: 当前腹腔镜相机导航技能的评估主要依赖耗时且难以规模化的人工评分系统,自动化反馈可以提供即时、标准化的指标,从而显著提升外科培训效果。

Result: 23名外科医生参与的调查显示,视野、对焦和居中定位等基础方面被认为最重要。研究提出了一个“临床重要性 vs. CV技术成熟度”矩阵,识别出临床关键且技术成熟度高的高优先级开发目标。

Insight: 创新点在于将临床需求(外科医生的优先级)与计算机视觉技术能力进行系统性对齐,为自动化技能评估提供了一个结构化的框架和明确的开发路线图,这有助于引导AI辅助工具的开发,以加速学习曲线并提升手术安全与效率。

Abstract: Background: Laparoscopic camera navigation (LCN) is a critical skill, yet its current assessment typically relies on manual rating systems which are time-consuming and difficult to scale. Automated feedback could significantly enhance surgical training by providing immediate, standardized metrics. This study aims to define, clinically evaluate the relevance, and establish the technical readiness of a set of approaches for LCN assessment. Methods: We developed a detailed taxonomy of 14 key aspects of camera navigation, categorized into Framing & Composition, Visibility & Clarity, Orientation & Stability, Motion & Dynamics, and Safety & Awareness. For each aspect, we assessed the technological readiness of automated measurement based on the current state of the art (SoTA) in computer vision (CV). To establish clinical relevance, we designed a survey for practicing laparoscopic surgeons to rate the importance of each aspect on a 5-point Likert scale and to select the five most critical skills. Results: 23 surgeons participated in the survey. Foundational aspects like Field of View, Focus and Centering were rated as most important by surgeons. We present a “Clinical Importance vs. CV Technological Readiness” matrix, identifying high-priority targets for development–aspects that are both clinically crucial and technologically ready to measure. Conclusion: This work establishes a foundational framework for quantifying LCN skills. By aligning surgeon priorities with CV capabilities, we provide a clear roadmap for automatic skill assessment. This foundation enables the development of AI-driven assistance tools that can accelerate the learning curve for surgical assistants and potentially improve surgical safety and efficiency.


[215] LUMINA-26: Low-Light Understanding for Modeling and Interpreting Night-time Actions cs.CVPDF

Aman Kumar Pandey, Anil Singh Parihar

TL;DR: 本文提出了LUMINA-26数据集和Illumi-Net模型,以解决低光照条件下人体动作识别的挑战。LUMINA-26是一个包含26个动作类别、在真实低光场景下采集的新数据集。Illumi-Net是一个利用光照线索进行自适应增强和时空特征提取的混合专家网络,在ELLAR和LUMINA-26基准上取得了最先进的性能。

Details

Motivation: 现有低光照动作识别数据集在动作多样性、真实性和类别平衡性上存在不足,限制了鲁棒模型的发展。低光照条件带来的光照不足、噪声放大、运动模糊和场景多样性等问题使得该任务极具挑战。

Result: 在ELLAR基准上,该方法取得了Top-1准确率55.13%和Top-5准确率78.87%,超越了之前的最先进水平。在自建的LUMINA-26数据集上,建立了Top-1准确率75.95%和Top-5准确率93.58%的强基线。

Insight: 论文的创新点在于构建了一个更真实、多样化的低光照动作识别数据集(LUMINA-26),并提出了一个利用视频级光照线索引导自适应增强和特征提取的Illumi-Net模型,其混合专家架构和专家条件决策融合机制是核心设计。

Abstract: Low-light human action recognition remains a challenging problem due to poor illumination, amplified noise, motion ambiguity, and diverse real-world scenes. Existing low-light datasets often lack sufficient action diversity, capture realism, or balanced class distribution, limiting the development of robust models. To address this, we introduce LUMINA-26: Low-Light Understanding for Modeling and Interpreting Night-time Actions, comprising 6,784 clips across 26 action classes, recorded from 22 subjects across 20 indoor and outdoor locations under naturally occurring low-light conditions. We also propose Illumi-Net: An Illumination-Adaptive Mixture-of-Experts Network, which leverages video-level illumination cues to guide adaptive enhancement and transformer-based spatio-temporal feature extraction, with expert-conditioned decision fusion. Our method surpasses previous state-of-the-art performance on ELLAR (Top-1: 55.13%, Top-5: 78.87%) and establishes a strong baseline on LUMINA-26 (Top-1: 75.95%, Top-5: 93.58%), offering a practical benchmark for future low-light action recognition research.


[216] CFPO: Counterfactual Policy Optimization for Multimodal Reasoning cs.CV | cs.CLPDF

Zhangyuan Yu, Wanran Sun, Guangjing Yang, Xiaohu Wu, Qicheng Lao

TL;DR: 本文提出了一种名为反事实策略优化(CFPO)的新框架,旨在解决大型视觉语言模型在多模态推理中存在的严重基础问题,如忽略视觉证据或产生幻觉漂移。该方法通过强制视觉感知与文本推理之间的因果一致性,引入跨模态反事实增强机制来正则化策略,无需外部奖励模型或额外监督。实验表明,CFPO在推理保真度上显著优于标准强化学习基线和最先进的感知感知方法。

Details

Motivation: 当前主流强化学习范式缺乏明确的反事实增强和因果学习机制,导致模型在多模态推理中容易忽视视觉证据或产生幻觉漂移,因此需要一种方法来强制视觉与文本之间的因果一致性。

Result: 在广泛实验中,CFPO相比标准强化学习基线(如GRPO和DAPO)实现了3.17%-6.25%的稳定提升,相比最先进的感知感知方法(PAPO)也提升了1.32%-2.13%,显著改善了推理保真度。

Insight: 创新点在于引入了跨模态反事实增强机制,通过最大化模型预测与抑制关键视觉线索的反事实状态之间的差异来正则化策略,这提供了一种无需外部奖励的因果学习方式,可增强多模态推理的可靠性。

Abstract: Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in multimodal reasoning. However, prevailing reinforcement learning (RL) paradigms lack explicit counterfactual enhancement and causal learning mechanisms. This fundamental deficiency results in severe grounding failures, manifesting as a tendency to ignore visual evidence in favor of language priors or exhibiting hallucination drift during long chain-of-thought reasoning. To address this root cause, we propose CounterFactual Policy Optimization (CFPO), a novel framework that enforces causal consistency between visual perception and textual reasoning. CFPO introduces a cross-modal counterfactual enhancement mechanism, which regularizes the policy by maximizing the discrepancy between the model’s predictions and those from a counterfactual state where critical visual cues are suppressed. This approach seamlessly integrates with standard algorithms like GRPO and DAPO without requiring external reward models or additional supervision. Extensive experiments demonstrate that CFPO significantly improves reasoning fidelity, achieving consistent gains of 3.17%-6.25% over standard RL baselines and 1.32%-2.13% over the state-of-the-art perception-aware method (PAPO). Code is available at https://github.com/Raven-July/CFPO.


[217] T-VSS: Test-Time Visual Subspace Steering for Adversarial Robustness of Vision-Language Models cs.CVPDF

Jaehyuk Jang, Minseok Seo. Seungju Cho, Kangwook Ko, Changick Kim

TL;DR: 本文提出了一种名为T-VSS的轻量级防御方法,用于提升视觉语言模型在对抗性攻击下的鲁棒性。该方法通过在测试时直接在视觉特征空间中进行适应,构建一个样本特定的低秩子空间,并利用可靠性加权的熵最小化学习共享特征校正,从而引导受攻击的特征朝向更稳定和更具判别性的预测。

Details

Motivation: 视觉语言模型在零样本识别方面表现出色,但对对抗性扰动高度脆弱。现有的测试时适应方法(如基于提示或输入空间优化的方法)不直接调整被破坏的视觉表示本身,且优化路径间接且昂贵。

Result: 在细粒度分类、ImageNet和ImageNet-OOD基准测试上的实验表明,T-VSS提高了对抗鲁棒性,同时保持了有竞争力的干净准确率,并且比先前的测试时适应方法更高效。

Insight: 创新点在于直接在视觉特征空间进行轻量级、样本特定的低秩子空间适应,通过约束更新到紧凑的视觉几何结构,避免了噪声全空间更新,从而更直接高效地纠正被攻击的特征表示。

Abstract: Vision-language models (VLMs) achieve strong zero-shot recognition, but they remain highly vulnerable to adversarial perturbations. Recent test-time adaptations improve robustness without retraining, but they do not directly adapt the corrupted visual representation itself. Prompt-based methods adapt the learnable text prompts, while input-space methods optimize pixels or padding at test time. These approaches can improve predictions, but they do so through an indirect and expensive optimization path. We propose Test-time Visual Subspace Steering (T-VSS), a lightweight defense that performs test-time adaptation directly in the visual feature space. T-VSS first builds a sample-specific low-rank subspace from multi-view feature residuals anchored at the attacked image. It then learns a shared feature correction within this subspace using reliability-weighted entropy minimization. By constraining adaptation to a compact visual geometry, T-VSS steers attacked features toward more stable and discriminative predictions while avoiding noisy full-space updates. Experiments on fine-grained, ImageNet, and ImageNet-OOD benchmarks show that T-VSS improves adversarial robustness while maintaining competitive clean accuracy and better efficiency than prior test-time adaptations.


[218] StreamPPG: Low-Latency rPPG Estimation via Consistent Privileged Learning cs.CVPDF

Yiming Li, Yihan Yang, Yuguang Chu, Yuanhui Hu, Si-Yuan Cao

TL;DR: 本文提出StreamPPG,一种用于远程光电容积描记术(rPPG)的统一架构,旨在通过一致的优先学习策略实现低延迟的逐帧生理信号估计,在保持实时吞吐量的同时达到与片段式方法相媲美的准确度。

Details

Motivation: 解决现有片段式rPPG方法因需要采集上百帧视频而引入数秒延迟、阻碍实时应用的问题,同时克服逐帧方法难以捕捉生理节律的长程时序和周期性特征导致精度下降的缺陷。

Result: 在多个数据集上的广泛实验表明,StreamPPG实现了最先进的准确度,并在边缘设备上保持了实时吞吐量。

Insight: 创新点在于提出一致的优先学习策略,利用真实rPPG信号作为优先信息来增强模型表示能力,从而在低延迟的逐帧估计框架下整合了长程时序特征,实现了精度与延迟的平衡。

Abstract: Remote photoplethysmography (rPPG) estimates the blood volume pulse (BVP) signal from facial videos, enabling contact-free health monitoring. Conventional clip-wise approaches, which use video clips as input, require capturing over one hundred frames before inference, thus introducing several seconds of delay and hindering real-time use. Meanwhile, frame-wise approaches struggle to capture long-range temporal and periodic features of physiological rhythms, and therefore lead to reduced estimation accuracy. To overcome these issues, we propose StreamPPG, a unified architecture that enables low-latency frame-wise physiological signal estimation while achieving competitive accuracy compared with clip-wise approaches. StreamPPG is trained under a consistent privileged learning (CPL) strategy, which leverages ground-truth rPPG signals as privileged information to enhance the model’s representation capability. Extensive experiments demonstrate that StreamPPG achieves state-of-the-art accuracy across multiple datasets while maintaining real-time throughput on edge devices.


[219] Temporally Aware Densification for Dynamic 3D Gaussian Splatting cs.CVPDF

Vikram Sandu, Mayurdeep Pathak, Rajiv Soundararajan

TL;DR: 本文提出了一种时间感知的致密化策略,用于改进动态3D高斯泼溅(3DGS)方法。通过引入可见性感知致密化(VAD)框架、时间自适应阈值(TAT)机制和时间偏移扭曲(TOW)设计,该方法能够更好地处理动态场景,提升动态区域的视觉质量,并在多个动态多视图基准数据集上优于现有方法。

Details

Motivation: 现有动态3DGS方法虽然建模了时间运动,但仍沿用静态致密化策略,这导致动态区域因高斯元素寿命短、监督稀疏而重建不足和模糊。

Result: 在三个动态多视图基准数据集上,该方法在动态区域视觉质量上取得显著提升,优于现有方法;且VAD模块作为即插即用组件,可泛化到多种动态3DGS方法中,一致改善动态重建效果。

Insight: 创新点包括将时间可见性整合到致密化过程中,根据高斯元素的时间寿命自适应调整致密化阈值,以及通过时间偏移扭曲增强变形能力;从客观角度看,这些设计针对性地解决了动态场景中高斯元素生命周期不平衡的问题,提升了方法的通用性和有效性。

Abstract: Despite modeling temporal motion, dynamic 3D Gaussian Splatting (3DGS) methods still inherit a static densification strategy that is ill-suited for dynamic scenes. This neglect of temporal behavior leads to under-reconstructed and blurry dynamic regions, as short-lived Gaussians receive sparse supervision and fail to densify effectively. We propose a Visibility-Aware Densification (VAD) framework that integrates temporal visibility into the densification process, ensuring that Gaussians are refined based on their actual temporal presence. A Temporally-Adaptive Thresholding (TAT) mechanism further adjusts each Gaussian’s densification threshold according to its temporal lifespan, promoting balanced refinement of both static and dynamic regions. Finally, a Temporal Offset Warping (TOW) design enhances deformation capacity around temporal centers, extending the lifespan of highly dynamic Gaussians and facilitating more effective densification. Our approach achieves substantial improvements in the visual quality of dynamic regions, outperforming existing methods across three dynamic multi-view benchmark datasets. Moreover, the proposed VAD module generalizes across diverse dynamic 3DGS methods, consistently improving dynamic reconstruction as a plug-and-play component.


[220] RS-Gen: A Multi-Stage Agentic Framework for Reasoning and Search-Augmented Image Generation cs.CV | cs.AIPDF

Feifei Bian, Zhimin Zheng, Wei Deng, Daiguo Zhou, Jian Luan

TL;DR: 本文提出了RS-Gen,一个无需训练、即插即用的多阶段智能体框架,旨在通过引入‘提问-解决’闭环机制,增强图像生成和编辑模型在处理模糊意图、逻辑推理和分布外知识时的能力。

Details

Motivation: 现有图像生成模型在应对模糊意图、逻辑推理和分布外知识时表现不佳,主要受限于其内在的推理能力和静态知识库,而统一的理解-生成模型也受参数规模和知识静态性的约束。

Result: 在WISE Verified和RISEBench基准测试上,RS-Gen显著提升了Qwen-Image和Qwen-Image-Edit-2511的性能,分别获得0.313和19.70的绝对性能增益,使其达到开源模型中的SOTA水平。

Insight: 创新点在于提出了一个训练免费的智能体框架,通过闭环的‘提问-解决’机制自主识别逻辑问题和知识缺口,并规划行动进行深度推理和信息补全,从而扩展了基础模型的边界。

Abstract: Recent years have witnessed remarkable progress in image generation and editing, particularly regarding instruction following and visual fidelity. However, when handling ambiguous intentions, logical reasoning, and Out-of-Distribution (OOD) knowledge, existing image models often yield sub-optimal results due to a lack of deep reasoning capabilities and real-time external information. Although emerging unified understanding-and-generation models attempt to bridge this gap, they remain constrained by their intrinsic parameter scales and static knowledge gaps. Inspired by agentic paradigms, we propose RS-Gen: a plug-and-play, training-free, multi-stage image agentic framework. RS-Gen innovatively introduces a “Questioning-and-Solving” closed-loop mechanism to accurately identify logical issues and knowledge gaps, autonomously planning actions to bridge information deficits and execute deep logical reasoning. Extensive experiments demonstrate that RS-Gen significantly expands the capability boundaries of foundational image generation and editing models. Specifically, on the WISE Verified and RISEBench benchmarks, RS-Gen yields substantial absolute performance gains of 0.313 for Qwen-Image and 19.70 for Qwen-Image-Edit-2511, respectively, successfully elevating both to the state-of-the-art (SOTA) level among open-source models.


[221] PhysFlow: Frequency Decoupled with Dual-Field Rectified Flow for Remote Photoplethysmography cs.CVPDF

Zixu Li, jianjun Qian, Hang Shao, Lei Luo, Jian Yang

TL;DR: 本文提出PhysFlow,一种频率解耦的双场整流流框架,用于从面部视频中稳健地估计远程光电容积脉搏波(rPPG)。该方法将rPPG信号分解为趋势和振幅分量,并分别学习两个条件速度场进行建模,以减少分量间干扰,并通过少量ODE积分步骤高效重建波形。

Details

Motivation: 现有深度学习方法在复杂干扰(如光照变化、面部表情和头部运动)下难以稳定恢复rPPG信号,主要原因是信号分量在重建过程中耦合,导致微弱脉搏相关变化被强干扰掩盖。

Result: 在多个基准数据集上的实验表明,PhysFlow在不同挑战性场景下的心率估计和rPPG波形重建方面均优于现有最先进方法。

Insight: 创新点在于将rPPG信号频率解耦为趋势和振幅分量进行独立监督与建模,结合整流流框架实现高效ODE积分,这增强了模型在干扰下的鲁棒性,为生理信号分离提供了新思路。

Abstract: Remote Photoplethysmography (rPPG) enables contactless pulse estimation from facial videos, serving as a vital tool for health monitoring. However, current deep learning methods often struggle under complex disturbances, particularly varying illumination, facial expressions, and unconstrained head movements. In such scenarios, subtle physiological signals are easily dominated by external interference, making the recovered rPPG waveform unstable and unreliable. One important reason is that most existing methods directly model the rPPG signal in a unified manner, where different signal components are coupled during reconstruction. This makes it difficult to preserve weak pulse-related variations when strong disturbance-induced changes are present. To address this challenge, we propose PhysFlow, a frequency-decoupled dual-field rectified flow framework tailored for robust rPPG estimation. Specifically, the ground-truth rPPG signal is decomposed into trend and amplitude components, which are used as separate supervisory targets. Based on the extracted facial features, PhysFlow learns two component-specific conditional velocity fields to model the two components separately. This design reduces mutual interference between different components and improves the robustness of rPPG reconstruction under complex disturbances. Moreover, the rectified flow formulation enables efficient waveform reconstruction with only a few ordinary differential equation (ODE) integration steps. Extensive experiments on multiple benchmark datasets demonstrate that PhysFlow outperforms state-of-the-art methods in both heart-rate estimation and rPPG waveform reconstruction across diverse challenging scenarios.


[222] SteerVTE: Seamless Video Text Editing with Style and Glyph Control cs.CV | cs.AIPDF

Kai Zeng, Moran Li, Zhengwei Wang, Yingchen Yu, Yiheng Lin

TL;DR: 本文提出了SteerVTE,一个用于视频文本编辑的统一框架。它通过风格和字形控制,引导一个冻结的视频扩散模型,在视频中精确修改文本,同时保持风格一致性和时间连贯性。该方法包含轻量级文本上下文适配器、字形感知损失函数和渐进式训练策略,并在构建的百万级数据集上验证了其有效性。

Details

Motivation: 视觉文本编辑旨在精确修改图像和视频中的文本,同时保持风格一致性和视觉真实感。尽管图像领域的文本编辑已取得显著进展,但视频文本编辑仍面临巨大挑战,因为它是一个局部任务,需要在小的文本区域内实现笔画级精度,并同时解决跨帧准确性、时间连贯性和风格保真度等问题。

Result: 广泛的实验表明,SteerVTE在文本准确性、风格一致性和时间连贯性方面显著优于现有的视频编辑基线方法。

Insight: 论文的创新点在于提出了一个统一的、基于冻结视频扩散模型的框架,通过风格编码器和双粒度字形编码器进行精确控制。此外,字形感知空间聚焦损失和三阶段渐进式训练课程,有效克服了视频基础模型文本渲染先验弱的问题。构建的百万级合成数据集也为大规模训练提供了支持。

Abstract: Visual text editing aims to precisely modify text in images and videos while preserving stylistic consistency and visual realism. Despite significant advances in the image domain, video text editing remains largely unexplored: it is a localized task demanding stroke-level precision within small text regions, which compounds the challenges of cross-frame accuracy, temporal coherence, and stylistic fidelity. We introduce SteerVTE, a unified framework that \underline{\textbf{steer}}s a frozen video diffusion model to perform precise \underline{\textbf{V}}ideo \underline{\textbf{T}}ext \underline{\textbf{E}}diting through style and glyph control. Built on a frozen diffusion transformer, SteerVTE attaches a lightweight text context adapter with two complementary modules: a style encoder capturing the original text’s visual attributes, and dual-granularity glyph encoders encoding the target text at both the line and character levels. To overcome the inherently weak text rendering priors of video foundation models, we further propose a glyph-aware spatial-focal loss and a three-stage progressive training curriculum that scales from image to video data. To support large-scale training, we also develop an automatic synthesis pipeline and construct SteerVTE-1M, a dataset of one million triplets spanning diverse scenes, fonts, and stylistic effects. Extensive experiments demonstrate that SteerVTE substantially outperforms existing video editing baselines across text accuracy, style consistency, and temporal coherence.


[223] BoxCtrl: 3D-Aware Visual Prompting for Geometric Image Editing cs.CVPDF

Feifei Wang, Shiyuan Yang, Xiaoyu Li, Jing Liao

TL;DR: 本文提出BoxCtrl,一种3D感知的视觉提示框架,用于解决基于指令的图像编辑在几何变换(如平移、缩放、旋转)上的精确性和一致性问题。该方法使用RGB颜色的3D边界框投影作为视觉提示,通过两阶段训练(监督微调和强化学习)提升模型性能,在多个几何编辑任务上达到SOTA水平。

Details

Motivation: 现有基于指令或多模态大模型的图像编辑方法难以实现精确、一致的3D空间几何变换(如平移、缩放、旋转)。

Result: 在平移、旋转、缩放及复合编辑任务上的大量实验表明,BoxCtrl实现了最先进的性能。

Insight: 创新点在于使用RGB颜色编码的3D边界框投影作为视觉提示,将几何控制与外观控制解耦;采用SFT-RL两阶段训练策略,结合合成数据与无配对真实数据,有效提升了几何精度和视觉保真度。

Abstract: As instruction-based editing models and multimodal large language models advance, diverse image editing tasks have become feasible. However, achieving precise and consistent geometric image editing, such as translating, scaling, and rotating in 3D space, remains a major challenge. In this work, we introduce BoxCtrl, a 3D-aware visual prompting framework. Unlike text-only or coarse 2D-guided approaches, our method introduces informative RGB 3D bounding boxes projected onto 2D images as visual prompts. The three orthogonal faces of each box are painted with distinct RGB colors, simultaneously encoding position, size, and orientation to provide a compact, intuitive in-context visual example. The key to BoxCtrl’s success lies in these well-designed bounding boxes, which decouple geometric control from appearance control. This enables the model to learn consistent correspondences between faces of the same color in the latent space, leading to a precise understanding of geometric intentions and accurate editing results. We introduce a two-stage training paradigm: Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL). To address paired data scarcity, we construct a large-scale synthetic dataset for SFT, equipping the model with fundamental editing capabilities. To bridge the synthetic-to-real domain gap, we incorporate an online RL stage leveraging unpaired real-world data. Guided by a reward function evaluating geometric accuracy and visual fidelity, our SFT-RL strategy significantly enhances geometric precision while maintaining photorealistic quality. Extensive experiments demonstrate that BoxCtrl achieves state-of-the-art performance across translation, rotation, scaling, and composite editing tasks.


[224] Ocean4D: Generative Underwater 4D Reconstruction via Medium-Aware Video Diffusion cs.CVPDF

Yuqiang Huang, Yuxi Wang, Junyu Dong, Zhaoxiang Zhang

TL;DR: 本文提出Ocean4D,一个用于水下4D重建的生成式框架,通过结合几何一致性条件构建和介质感知去噪,从单目视频生成目标相机轨迹下的视频,解决了水下介质退化和动态变化带来的重建挑战。

Details

Motivation: 现有方法大多基于空气中假设,未明确考虑水下吸收和背散射效应,且对近静态的假设使其对漂移颗粒和动态干扰物敏感,导致几何不稳定和跨视图结果不一致。

Result: 在动态和静态水下基准测试上的大量实验表明,该方法在水下重建任务上达到了最先进的性能。

Insight: 创新点在于将4D几何一致性条件(4D-GCC)与隐式的介质感知去噪块(Medium-Aware Block)相结合,在潜在扩散过程中稳定水下外观,从而提升跨视图一致性和全局结构保持能力。

Abstract: Underwater 4D reconstruction remains challenging due to the coupling between degraded light transport in participating media and dynamic water variations. Most existing Methods are developed under in-air assumptions and do not explicitly account for underwater absorption and backscatter. Additionally, near-static assumptions make these approaches sensitive to drifting particles and dynamic distractors , leading to unstable geometry and inconsistent cross-view results. To address these issues, we propose a generative framework for underwater 4D reconstruction, named Ocean4D, which is built on two complementary components. Specifically, 4D-GCC constructs 4D geometrically consistent conditioning with improved cross-frame coverage, while the Medium-Aware Block performs implicit medium-aware denoising in the latent diffusion process to stabilize underwater appearance under absorption and scattering. Given a monocular video and target cameras, our method generates videos along the target trajectories while preserving global structure and cross-view consistency. Extensive experiments on both dynamic and static underwater benchmarks demonstrate state-of-the-art performance on underwater reconstruction.


[225] P-JEPA: Procedural Video Representation Learning via Joint Embedding Predictive Architecture cs.CV | cs.AIPDF

Felix Tristram, Stefano Gasperini, Benjamin Killeen, Marcel Walch, Christian Benz

TL;DR: 本文提出了一种名为P-JEPA的、与主干网络无关的方法,用于学习长时程序性视频的表征。该方法通过将问题简化为密集的、帧对齐的动作空间,并预测池化的掩码潜在向量,从而能够处理超过30分钟的视频,实现对程序性步骤的有效长时理解。

Details

Motivation: 随着具身AI平台的成熟,程序性视频表征学习对于支持复杂多步骤任务的智能辅助系统日益重要。现有视频基础模型因自注意力机制的二次复杂度,难以捕捉程序性视频中存在的长程依赖动作,例如视觉相似但出现在不同程序点的动作(如开炉子与关炉子)。

Result: 在EgoExo4D、EgoProceL和Assembly101数据集上,使用VJEPA2.1、TSM和I3D提取特征进行评估,P-JEPA一致地提升了线性可分性、流式推理和时间动作分割性能。在EgoExo4D细粒度动作分类任务上取得了最先进(SOTA)的结果,同时参数量比基于LLM的方法少一个数量级,并能实时运行。

Insight: 核心创新在于提出了一种主干无关的架构,通过将长时视频建模问题转化为对密集帧对齐动作空间的预测,有效解决了长程依赖建模的挑战。该方法在保持高性能的同时,显著降低了模型复杂度和计算开销,实现了实时处理长视频的能力。

Abstract: The increasing maturity of embodied AI platforms has driven a growing interest in procedural video representation learning to support intelligent assistance systems for complex, multi-step tasks. Leveraging large-scale latent predictive training, video foundation models capture video dynamics, enabling downstream tasks such as activity understanding, spatiotemporal localization, and predictive control. However, procedural videos include actions with long-range dependencies that these models do not support, due to the quadratic complexity of self-attention. Distinct actions, for example, may be visually similar despite appearing at different points in the procedure, such as turning the stove on versus off. Here, we propose a backbone-agnostic approach that learns long-duration video representations by reducing the problem to a dense, frame-aligned action space and predicting pooled masked latent vectors. This approach allows our Procedural Joint Embedding Predictive Architecture (P-JEPA) to ingest videos over 30 minutes long, enabling effective long-form understanding of procedural steps. We evaluate P-JEPA using features extracted with VJEPA2.1, TSM, and I3D over the EgoExo4D, EgoProceL, and Assembly101 datasets, finding that it consistently improves linear separability, streaming inference, and temporal action segmentation performance, achieving state-of-the-art results on EgoExo4D fine-grained action classification while using an order of magnitude fewer parameters than LLM-based methods and running in real time.


[226] Flow6D: Discrete-to-Continuous Flow Matching for Efficient and Accurate Category-Level 6D Pose Estimation cs.CV | cs.ROPDF

Mingyu Mei, Li Zhang, Zibo Dai, Han Sun, Xinyue Zhao

TL;DR: Flow6D提出了一种用于类别级6D姿态估计的分层流匹配框架,采用两阶段离散潜在空间定位-连续姿态回归策略。该方法先将旋转和平移参数离散化到分箱中,通过离散流匹配模型锁定真实姿态周围的潜在空间以降低搜索复杂度,再通过连续流匹配模型预测局部姿态残差以优化估计,最终回归到精确姿态。该框架还能自然扩展到铰接物体,在合成和真实数据集上超越了现有最佳方法,并以70 FPS的速度实现实时推理。

Details

Motivation: 现有方法在高维连续空间中直接回归,在类别级姿态估计中面临两个关键挑战:由于噪声和局部最优导致的精度有限,以及在无限空间上的低效搜索阻碍了实时性能。

Result: 在合成和真实数据集上超越了最先进的方法,并以70 FPS的速度实现实时推理。

Insight: 创新点在于将离散化与连续流匹配相结合的分层策略,通过离散流匹配缩小搜索空间,再通过连续流匹配进行精细优化,从而在保证精度的同时大幅提升效率,并可扩展至铰接物体。

Abstract: 6D pose estimation is a key task in computer vision and embodied AI, widely used in robotic manipulation, augmented reality, etc. Existing methods directly regress in a high-dimensional continuous space, facing two key challenges in category-level pose estimation: limited accuracy due to noise and local optima, and inefficient search over an infinite space that hinders real-time performance. This paper proposes Flow6D, a hierarchical flow matching framework with a two-stage discrete latent space localization-continuous pose regression strategy. Rotation and translation parameters are first discretized into bins, with a discrete flow matching model locking the latent space around the true pose to reduce search complexity. Then, by sampling in the latent space, a continuous flow matching model predicts local pose residuals to optimize the estimate and regress to an accurate pose. The framework also naturally extends to articulated objects, outperforming state-of-the-art methods on synthetic and real datasets with real-time inference at 70 FPS. Project website: https://flow6d.github.io/.


[227] VideoAgent: All-in-One Framework for Video Understanding and Editing cs.CV | cs.AIPDF

Hengji Zhou, Lingxuan Huang, Jian Wang, Bing Zhou, Si Wu

TL;DR: 本文提出了VideoAgent,一个一体化智能体框架,用于视频理解和编辑。它通过创新的镜头规划智能体和多智能体编排框架,解决了现有系统无法处理多样化视频理解与编辑操作、以及缺乏长视频连贯叙事理解的局限性。

Details

Motivation: 现有自动化视频编辑系统局限于处理短视频片段和特定领域任务,缺乏处理多样化视频理解与编辑操作的能力,以及对长视频进行连贯叙事理解的能力。

Result: 在新提出的VideoEdit基准测试和公共数据集上的广泛实验表明,VideoAgent优于现有的多模态大语言模型和智能体系统,编排成功率高达87-95%,并将API成本降低了60%。在六个视频类别的人类评估中,其生成内容接近人类水平,评分仅比人类创作视频低4%。

Insight: 主要创新点包括:1) 通过镜头规划智能体和跨模态检索实现自动化的、连贯叙事的视频镜头创建;2) 集成了三十多个专业编辑智能体的多智能体编排框架,并采用意图解析和文本梯度图优化来组装复杂的编辑流程。该框架在实现高质量编辑的同时,显著降低了计算成本。

Abstract: Video editing has become essential in digital media creation, yet existing automated systems are restricted to short segment processing and domain-specific tasks. They face two critical limitations: i) inability to handle diverse video comprehension and editing operations, and ii) lack of long-video understanding for coherent narrative creation. We propose VideoAgent, an all-in-one agentic framework addressing these challenges through two key innovations. First, we develop automated video shot creation with shot planning agents for coherent narratives and cross-modal retrieval for aligned visual content. Second, we design a multi-agent orchestration framework integrating over thirty specialized editing agents. Intent parsing filters relevant tools while textual-gradient graph optimization assembles complex editing pipelines. Extensive experiments on our newly-proposed VideoEdit benchmark and public datasets demonstrate VideoAgent’s superiority over existing multimodal LLMs and agentic systems. VideoAgent achieves 87-95% orchestration success rates while reducing API costs by 60%. Human evaluation across six video categories shows VideoAgent produces professional-quality content approaching human-level performance, with ratings only 4% below human-created videos. We release our code at https://github.com/HKUDS/VideoAgent.


[228] Faithful Grounded Visual Reasoning via Learned Proxy-Tokens cs.CVPDF

Tom Hodemon, Mohamed Chaouch, Aboubacar Tuo, Angelique Loesch

TL;DR: 本文提出了Composer,一种基于学习代理令牌的新型视觉定位机制的多模态大语言模型,旨在解决现有基于坐标的视觉定位方法中存在的语义-空间鸿沟问题,通过离散符号指针索引图像潜在空间,提升视觉推理的忠实性和可解释性。

Details

Motivation: 现有MLLMs在视觉问答中表现出色,但其’黑盒’性质阻碍了在关键领域的部署;基于坐标的视觉定位方法缺乏与视觉特征的可学习语义链接,导致模型可能产生与图像证据不符的坐标幻觉。

Result: 在合成的ComposerGCoT数据集上评估,Composer在最终答案准确率上与基于坐标的基线模型相当,同时将视觉定位准确率提升了9.0个百分点。

Insight: 创新点在于引入可学习的离散代理令牌作为视觉区域的符号指针,使其成为可寻址、语义可操作的集合,从而更有效地捕获空间语义;这为构建可信赖的MLLMs提供了一条有前景的路径。

Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable success in Visual Question Answering (VQA), yet their “black-box” nature hinders deployment in critical domains. Grounded Visual Reasoning (GVR) approaches attempt to improve interpretability by explicitly couple textual rationales with visual grounding information, which are typically textual coordinates. This mechanism lacks a learnable semantic link to the visual features, often resulting in a semantic-spatial gap where the model hallucinates coordinates that do not correspond to image evidences. In this work, we introduce Composer, a MLLM that leverages a novel visual grounding mechanism based on learned proxy-tokens to promote faithful interpretability. These discrete symbolic pointers explicitly index the image latent space, allowing the model to manipulate visual regions as addressable, semantically manipulable sets. To rigorously validate our novel grounding mechanism, we constructed ComposerGCoT, a dataset synthesized to enable holistic assessment of reasoning consistency and grounding accuracy. Experimental results indicate that Composer achieves performance parity with its coordinate-based counterpart in final answer accuracy, while improving visual grounding accuracy by +9.0 points. By demonstrating that discrete proxy-tokens capture spatial semantics more effectively than typical textual coordinates, we establish that visual grounding mechanisms with learnable semantic links represent a promising path toward trustworthy and reliable MLLMs.


[229] Brain-Adapter: A Dual-Stream Vision-Language MIL Framework for Comprehensive 3D CT Diagnosis of Acute Intracranial Pathologies cs.CVPDF

Zhenyu Yi, Zhiyun Song, Yusong Sun, Zelin Liu, Manman Fei

TL;DR: 本文提出了Brain-Adapter,一种新颖的双流多示例学习框架,用于3D脑部CT扫描的自动化多标签诊断。该方法利用预训练的2D生物医学视觉语言模型和原始诊断报告,通过文本条件注意力机制和并行视觉流来动态对齐视觉特征与疾病概念,并引入不确定性感知模块来融合双流预测,从而在无需密集标注的情况下实现鲁棒的扫描级分类。

Details

Motivation: 解决3D脑部CT扫描自动化诊断的挑战,该领域严重依赖人工标注,且传统模型语义理解有限。同时,如何将2D基础视觉语言模型的强大泛化能力有效迁移到3D体数据上是一个开放性问题。

Result: 在3D急性颅内病理分析任务上,广泛的实验表明,该方法显著优于最先进的3D模型和标准多示例学习方法,达到了SOTA水平。

Insight: 主要创新点包括:1)提出文本条件注意力机制,利用原始诊断句子作为语义查询来动态对齐视觉线索;2)设计双流MIL框架,结合文本引导的语义流和视觉全局特征流,并通过一致性约束和不确定性感知融合模块增强鲁棒性;3)利用LLM从报告中提取结构化标签进行监督,消除了对密集像素级标注的依赖,提供了高度可扩展的临床解决方案。

Abstract: Automated diagnosis of 3D brain CT scans is essential for critical care, yet it remains challenging due to the heavy reliance on manual annotations and the limited semantic understanding of conventional models. While 2D foundation vision-language models (VLMs) have shown remarkable generalization, effectively transferring their representational power to 3D volumes remains an open problem. In this paper, we propose Brain-Adapter, a novel dual-stream multiple instance learning (MIL) framework that leverages pre-trained 2D biomedical VLMs and raw diagnostic reports for robust scan-level multi-label classification. Specifically, we introduce a Text-Conditioned Attention (TCA) mechanism, utilizing raw diagnostic sentences as semantic queries to dynamically align visual cues with specific disease concepts. Concurrently, a parallel visual MIL stream captures global scan characteristics, supervised by structured labels extracted via a Large Language Model (LLM). To ensure representation coherence, a consistency constraint enforces synergy between the two streams. During inference, an Uncertainty-Aware Refinement (UAR) module dynamically calibrates and fuses these dual-stream predictions to resolve ambiguous cases. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art 3D models and standard MIL approaches. By eliminating the reliance on dense annotations, Brain-Adapter provides a highly scalable and clinically viable solution for 3D acute intracranial pathology analysis.


[230] Rethinking Object-Centric Representations for Video Dynamics Modeling cs.CV | cs.AI | cs.LGPDF

Amaury Wei, Ismail Nejjar, Olga Fink

TL;DR: 本文提出STAITUS框架,用于无监督视频对象跟踪,通过将每个对象槽(slot)显式解耦为外观和几何姿态(位置/尺度)两部分,以解决现有方法中因外观与姿态纠缠导致的对象身份不一致和分割碎片化问题。该框架在帧内实施空间分离,仅在解耦后的外观空间中进行时间对齐,并结合自适应门控机制动态调整活动槽数量以适应场景复杂度,从而在运动、遮挡和对象进出等复杂情况下实现更清晰的分割掩码和更持久的对象身份。

Details

Motivation: 现有基于槽表示的无监督视频对象跟踪方法通过强制槽嵌入的时间一致性来保持对象身份,但当外观与姿态纠缠时,这种一致性目标会与对象运动和视角变化产生冲突,导致槽倾向于锁定静态区域(如背景),而前景对象则出现碎片化或频繁身份交换。

Result: 在合成和真实世界基准测试上的大量实验表明,STAITUS在分割质量和跟踪稳定性方面显著优于最先进的基线方法。

Insight: 核心创新点在于将对象槽显式解耦为外观和几何姿态两个独立部分,并仅在解耦后的外观空间进行时间对齐,同时引入自适应门控机制动态调整槽数量;这为解决外观-姿态纠缠带来的跟踪挑战提供了新思路,其解耦策略和动态调整机制对提升无监督视频对象表示的鲁棒性具有借鉴意义。

Abstract: Unsupervised video object tracking aims to decompose dynamic scenes into persistent, object-centric entities without manual annotations. Many recent approaches rely on slot-based representations, where a fixed set of latent variables (“slots”) represent individual objects across frames. To preserve object identity, these models enforce temporal consistency on slot embeddings. However, when appearance and pose are entangled, this consistency objective conflicts with object motion and viewpoint changes. As a result, slots tend to lock onto static regions (e.g., background) to satisfy the consistency objective, while foreground objects become fragmented across multiple slots or frequently swap identities. To address these limitations, we propose STAITUS, a unified framework that explicitly disentangles each slot into appearance and geometric pose (position/scale). Leveraging this disentanglement, STAITUS enforces within-frame spatial separation and applies temporal alignment only in appearance space, yielding sharper masks and more persistent identities under motion, occlusion, and object entry/exit. Furthermore, to mitigate over-segmentation, we introduce an adaptive gating mechanism that dynamically adjusts the number of active slots to match scene complexity. Extensive experiments on synthetic and real-world benchmarks demonstrate that STAITUS substantially outperforms state-of-the-art baselines in segmentation quality and tracking stability.


[231] AwakeForest: An Interactive Geospatial Platform for Large-Scale Forest Imagery cs.CV | cs.SEPDF

Suraj Prasai, Kangning Cui, Rongkun Zhu, Sarra Alqahtani, Ying Zhang

TL;DR: AwakeForest是一个交互式端到端地理空间平台,专为大规模森林图像分析设计,集成了模型辅助推理、自动标注和人机协同优化功能,支持从标准航拍场景到数百GB大型正射影像的规模化处理。

Details

Motivation: 针对森林图像分析中多任务耦合、地理区域和采集条件差异大的挑战,以及现有工具缺乏地理空间原生、云优化和机器学习集成的一体化解决方案的问题,旨在提供覆盖标注、预测、可视化和下游分析的端到端工作流平台。

Result: 在PALMS数据集上展示了系统功能,验证了平台能够支持实际森林管理和分析的完整工作流,但未提及具体定量指标或与现有方法的对比结果。

Insight: 创新点在于将地理空间原生处理、云优化架构与机器学习工作流深度融合,实现了模型即插即用、人机协同标注和可扩展的大规模影像交互,为领域特定的大规模视觉任务提供了集成化平台解决方案。

Abstract: Forest imagery analysis often involves multiple tightly coupled vision tasks, which must be performed under substantial variation in geographic regions, sensors, and acquisition conditions. However, practitioners often lack a unified tool that is geospatial-native, cloud-optimized, and ML-integrated for end-to-end workflows spanning annotation, prediction, visualization, and downstream analysis at scale. We present AwakeForest, an interactive end-to-end platform designed for large-scale forest imagery that integrates model-assisted inference, automatic annotation, and human-in-the-loop refinement within a single workflow. Our platform supports plug-and-play integration of pretrained models and enables scalable interaction with forest imagery ranging from standard aerial scenes to large orthomosaics that can span several gigabytes to hundreds of gigabytes. AwakeForest produces analysis-ready outputs that can be directly used for downstream analysis and to support iterative model and annotation updates on new scenes. We demonstrate the system on the PALMS dataset and illustrate how AwakeForest supports an end-to-end workflow for practical forest management and analysis.


[232] Changing Modalities: Adapting Remote Sensing Models to New Satellites and Sensors cs.CV | cs.LGPDF

Tim G. Zhou, Anthony Fuller, Geoff Pleiss, Evan Shelhamer

TL;DR: 该论文研究了遥感机器学习模型在卫星传感器更新时面临的模态变化问题,提出了三种模态变化场景(替换、增加、减少),并设计了名为DeluluNet的端到端架构。该架构通过模态幻觉技术,利用无标签多模态数据从单模态教师模型中学习,从而能够适应输入模态的变化,无需大量重新标注和训练。

Details

Motivation: 解决现有遥感模型在卫星传感器更新或替换时,因输入模态变化而需要大量重新标注数据和重新训练的问题,以适应实际数据可用性和计算约束。

Result: 论文提出了DeluluNet架构,通过模态幻觉预测缺失的模态表示,使得模型在模态变化后仍能进行预测,为实际部署提供了无需完全重新训练的实用方案。

Insight: 创新点在于系统定义了模态变化的三种场景,并提出了一个统一的、基于模态幻觉的端到端学习框架,能够利用无标签多模态数据从预训练的单模态模型进行知识迁移,增强了模型在动态环境中的适应性和实用性。

Abstract: Machine learning models for remote sensing are trained and deployed on a static set of modalities. However, as we equip newer satellites with novel sensors and retire old ones, practitioners may wish to deploy an existing model on a substitution, superset, or subset of modalities with minimal retraining given data availability or practical computational constraints. We study the setting of updating existing models to changing modalities and identify three main scenarios: Modality Transfer (substitution), Addition (superset), and Peeking (subset). We propose DeluluNet, an architecture with modular components for all three changing modality scenarios. DeluluNet is trained end-to-end, learning a multi-modal model from a unimodal teacher and unlabeled multimodal data via modality hallucination–predicting missing modality representations from those that are present. As a result, DeluluNet can keep predicting even when input modalities change, providing a practical alternative to re-labeling and re-training in a changing world.


[233] LightSTAR: Efficient Visual Document Retrieval via Lightweight Selection with Vision-Adaptive Refinement cs.CVPDF

Tongkun Guan, Haocheng Wang, Wei Shen, Xiaokang Yang

TL;DR: LightSTAR是一个高效的视觉文档检索框架,它将检索过程分解为两个阶段:首先通过无LLM的视觉选择快速生成高召回率的候选页面集,然后仅对这些候选页面进行视觉自适应的语义精炼,实现了检索精度与效率的平衡。

Details

Motivation: 现有基于多模态大语言模型(MLLM)的视觉文档检索方法虽然精度高,但计算成本巨大,需要对每个页面进行密集编码。作者观察到用户查询通常包含关键词,这些词很可能直接出现在相关页面的可见文本中,这为快速筛选候选页面提供了高效线索。

Result: 实验结果表明,LightSTAR在视觉文档检索任务上达到了最先进的(SOTA)检索精度,同时将端到端延迟降低了数倍。

Insight: 核心创新在于将检索流程分解为轻量级选择与精细化匹配两阶段,并提出了基于内容锚定的查询编码、无LLM的视觉嵌入以及通过自适应区域特征融合结合文本与布局线索的视觉自适应语义精炼方法,通过难度感知的对比目标进行优化,有效解决了精度与效率的权衡问题。

Abstract: Visual document retrieval requires rapidly locating relevant pages from large multi-modal corpora in response to user queries. While recent methods powered by Multi-modal Large Language Models (MLLMs) show competitive accuracy, they suffer from prohibitive computational costs by applying intensive MLLM encoding to every single page. Meanwhile, we observe that user queries are typically keyword-anchored, containing semantically rich words that are expected to appear directly in the visible text of relevant pages, offering an efficient cue for quickly narrowing down candidate pages. Building on this insight, we propose LightSTAR, an efficient framework that decomposes visual document retrieval into: 1) LLM-free Visual Selection, which utilizes content-grounded query encoding to focus on informative words and employs LLM-free visual embeddings to produce a high-recall candidate set; and 2) Vision-adaptive Semantic Refinement, which further performs fine-grained semantic matching exclusively on these top candidates via adaptive region-wise feature fusion to effectively combine textual and layout cues, optimized through a hardness-aware contrastive objective. Experimental results demonstrate that LightSTAR achieves state-of-the-art retrieval accuracy while reducing end-to-end latency by several-fold, offering a highly practical solution to the accuracy-efficiency trade-off in visual document retrieval. Code is available at https://github.com/bokufa/LightSTAR.


[234] Vera: A Layered Diffusion Model for Content-Preserving Video Editing cs.CVPDF

Hongkai Zheng, Ta-Ying Cheng, Benjamin Klein, Yisong Yue, Zhuoning Yuan

TL;DR: Vera是一种分层扩散模型,用于内容保持的视频编辑。它通过生成编辑层和alpha遮罩与源视频合成,而非完全重新生成每个像素,从而在编辑时更好地保留原始内容。该方法采用混合Transformer架构,并使用高质量分层数据集进行训练,在内容保持方面优于现有开源视频编辑模型。

Details

Motivation: 现有视频扩散模型在编辑时往往改变不应被修改的内容(如角色或背景),内容保持是核心挑战。Vera旨在通过分层设计将创意编辑与内容保存分离,解决这一问题。

Result: 在定量基准测试和人类偏好研究中,Vera在内容保持方面优于领先的开源视频编辑模型,同时在编辑质量上保持竞争力,使用了48.6万帧分层训练数据。

Insight: 创新点在于分层扩散框架和混合Transformer架构,通过编辑层与alpha遮罩合成实现内容保持,并构建了高质量分层数据集支持训练。客观来看,将视频编辑分解为分层生成与合成是一种新颖且有效的设计思路。

Abstract: Video diffusion models have enabled remarkable progress in video generation and editing. However, content preservation remains a core challenge: existing methods regenerate every pixel and often alter elements that should remain unchanged, such as characters or background scenes. We introduce Vera, a layered diffusion framework for content-preserving video editing. Instead of regenerating the entire video, Vera generates an edit layer along with an alpha matte for compositing with the source video, separating creative editing from content preservation by design. To encourage coherent composition with the source video, we extend the text-to-video DiT into a Mixture-of-Transformers (MoT) architecture, with separate DiTs for each layer that interact through joint self-attention. To support the training of Vera, we further construct a high-quality layered dataset with accurate alpha mattes, diverse scenes and dynamics, and visual effects. Across our quantitative benchmark and human preference study, Vera outperforms leading open-source video editing models in content preservation while remaining competitive in edit quality, using 486K frames of layered training data.


[235] UniverSat: Resolution- and Modality-Agnostic Transformers for Earth Observation cs.CVPDF

Yohann Perron, Guillaume Astruc, Nicolas Gonthier, Clement Mallet, Loic Landrieu

TL;DR: 本文提出了UniverSat,一种用于地球观测(EO)的视觉Transformer(ViT)风格骨干网络,其核心是一个通用补丁编码器,能够将来自任意空间、光谱和时间分辨率以及光学与非光学传感器的补丁映射到共享的嵌入空间,使用共享权重。这使得可以通过自监督在异构多模态语料库上训练单一模型,从而获得鲁棒的、传感器无关的空间特征。该方法在GeoBench、PANGEABench和SpectralEarth等标准EO基准测试的分类和分割任务上取得了优异结果。

Details

Motivation: 标准的视觉Transformer(ViT)依赖于固定的补丁投影器,这阻碍了其在地球观测(EO)领域的应用,因为EO的输入模态、尺度和分辨率变化极大。

Result: 在GeoBench、PANGEABench和SpectralEarth等标准地球观测基准测试的分类和分割任务上取得了强劲的结果,验证了方法的有效性。

Insight: 核心创新点是设计了一个通用补丁编码器(Universal Patch Encoder),它能够处理任意空间、光谱、时间分辨率以及不同传感器(光学/非光学)的输入,并将其映射到统一的嵌入空间,从而实现了单一模型对异构多模态EO数据的自监督训练和传感器无关的特征学习。

Abstract: Vision Transformers (ViT) dominate computer vision. However, their reliance on rigid patch projectors hinders transfer to Earth Observation (EO), where input modalities, scales, and resolutions vary widely. We introduce UniverSat, a ViT-style backbone built around a Universal Patch Encoder that maps patches from arbitrary spatial, spectral, and temporal resolutions, and from both optical and non-optical sensors, into a shared embedding space with a shared set of weights. This enables training a single model on heterogeneous multimodal corpora via self-supervision, yielding robust, sensor-agnostic spatial features. We validate this approach with strong results across classification and segmentation on standard EO benchmarks from GeoBench, PANGEABench, and SpectralEarth. Our code and models are available at https://github.com/gastruc/UniverSat.


[236] Dense Reward for Multi-View 3D Reasoning with Global Maps and Local Views cs.CVPDF

Jiho Choi, Seonho Lee, Seojeong Park, Hyunjung Shim

TL;DR: 本文提出了DR-MV3D,一个用于多视图3D视觉问答(MV3D-VQA)的密集奖励学习框架。该框架将任务分解为全局地图构建、问题引导的视点轨迹规划和以自我为中心的答案预测,并通过全局一致性奖励和局部轨迹奖励提供密集监督,以优化整个推理过程。

Details

Motivation: 当前多模态大语言模型在MV3D-VQA任务中通常使用稀疏的答案级监督进行训练,这导致跨视图推理不一致和视点选择脆弱。本文旨在通过提供密集、可验证的奖励来监督整个推理过程,以解决这些问题。

Result: 在MindCube、VSI-Bench和BLINK (MV)基准测试上的实验表明,DR-MV3D持续优于强大的多图像基线模型,证明了过程级密集监督对于多视图3D推理的有效性。

Insight: 主要创新点在于将MV3D-VQA任务分解为可学习的子模块,并引入无需人工标注的密集奖励(全局一致性奖励和局部轨迹奖励)来监督中间推理步骤。客观来看,其利用冻结的3D视觉基础模型(如VGGT+SAM3)生成伪目标,以及采用轨迹级策略优化(GRPO)来联合优化整个流程的方法具有借鉴意义。

Abstract: Multi-view 3D Visual Question Answering (MV3D-VQA) requires integrating partial observations into a coherent 3D scene representation and selecting informative viewpoints for multi-step spatial reasoning. However, current multimodal LLMs are typically trained with sparse, answer-level supervision, which often yields inconsistent cross-view reasoning and brittle view selection. We present DR-MV3D (Dense Reward for MV3D-VQA), a map-grounded learning framework that provides dense, verifiable rewards to supervise the reasoning process. Our approach decomposes MV3D-VQA into (i) allocentric global map construction, (ii) question-conditioned view-trajectory planning, and (iii) egocentric grounding for answer prediction. To make intermediate steps learnable without manual annotations, we introduce two rewards: a global consistency reward that aligns the predicted map with geometry-consistent pseudo targets from frozen 3D vision foundation models (e.g., VGGT + SAM3), and a local trajectory reward that supervises ordered viewpoint selection. We optimize the full pipeline with trajectory-level policy optimization (GRPO). Experiments on MindCube, VSI-Bench, and BLINK (MV) show that DR-MV3D consistently improves over strong multi-image baselines, supporting the effectiveness of process-level dense supervision for multi-view 3D reasoning.


[237] Data Selection Through Iterative Self-Filtering for Vision-Language Settings cs.CV | cs.AI | cs.LGPDF

Andrei Liviu Nicolicioiu, Sarvjeet Singh Ghotra, Morgane M. Moss, Aaron Courville

TL;DR: 本文提出了一种名为Self-Filtering的自举式数据选择方法,用于在视觉-语言模型训练中自动清理大规模噪声数据。该方法通过迭代地在模型训练与数据选择之间循环,动态构建一个包含高置信度干净样本和整体分布多样性样本的平衡数据集。实验表明,使用该方法筛选的数据集训练CLIP模型,无需额外数据或预训练模型即可提升下游任务性能。

Details

Motivation: 解决大规模视觉-语言数据集因缺乏人工监督而包含大量噪声,从而影响模型性能的问题。现有方法依赖启发式规则、精心策划的参考数据集或预训练模型,本文旨在提出一种更自主的数据清洗方案。

Result: 在视觉-语言数据集上,使用所提方法筛选数据训练的模型,其下游任务性能得到提升。虽然没有明确提及具体基准测试和SOTA比较,但摘要指出该方法无需额外数据或预训练模型即能实现性能改进。

Insight: 创新点在于提出了一种迭代自筛选机制,通过模型自身在训练过程中动态选择数据,平衡了数据清洁度与多样性。从客观角度看,这种自举式、端到端的数据净化流程减少了对人工规则或外部模型的依赖,为大规模噪声数据处理提供了新思路。

Abstract: The availability of large amounts of clean data is paramount to training neural networks. However, at large scales, manual oversight is impractical, resulting in sizeable datasets that can be very noisy. Attempts to mitigate this obstacle to producing performant vision-language models have so far involved heuristics, curated reference datasets, and using pre-trained models. Here we propose a novel, bootstrapped method in which a CLIP model is trained on an evolving, self-selected dataset. This evolving dataset constitutes a balance of filtered, highly probable clean samples as well as diverse samples from the entire distribution. Our proposed Self-Filtering method iterates between training the model and selecting a subsequently improved data mixture. Training on vision-language datasets filtered by the proposed approach improves downstream performance without the need for additional data or pre-trained models.


[238] AIR: Adaptive Interleaved Reasoning with Code in MLLMs cs.CV | cs.AIPDF

Cong Han, Xiaohan Lan, Haibo Qiu, Yujie Zhong

TL;DR: 本文提出了一种名为AIR的自适应交错推理方法,旨在增强多模态大语言模型(MLLMs)在复杂数值计算任务中的代码推理能力。该方法通过两阶段冷启动数据构建、RL数据集筛选策略以及基于组约束奖励函数的自适应工具调用策略,对模型进行强化学习训练,以解决现有方法在视觉操作中无法处理数值计算的问题。实验表明,经过训练后,模型在评估基准上的性能平均提升了6.1个百分点,交错推理样本的准确率提高了9.9个百分点,工具使用的总体成功率超过95%。

Details

Motivation: 现有研究主要关注视觉感知任务中的工具使用,依赖于预定义的启发式方法进行视觉操作,但无法处理数值计算问题。本文旨在通过强化学习训练,赋予MLLMs自适应交错推理能力,以解决代码增强的复杂数值计算任务。

Result: 在强化学习训练后,模型在评估基准上的性能平均提升了6.1个百分点,交错推理样本的准确率提高了9.9个百分点,工具使用的总体成功率超过95%。

Insight: 创新点包括:两阶段冷启动数据构建管道、用于RL数据集筛选的数据过滤策略,以及利用组约束奖励函数实现自适应工具调用的交错推理轨迹策略。这些方法扩展了MLLMs在数值计算任务中的能力,超越了传统视觉操作的限制。

Abstract: Following the paradigm shift initiated by OpenAI o3, interleaved reasoning with code to enhance multimodal large language models (MLLMs) has become a pivotal research frontier. The existing literature focuses primarily on tool-use within vision-perception tasks. However, such approaches typically rely on predefined heuristics for visual manipulation and are inherently incapable of addressing numerical computation problems due to their exclusive focus on visual operations. This paper empowers MLLMs with adaptive interleaved reasoning capabilities through extended reinforcement learning training on code-augmented complex numerical computation tasks. To this end, we propose a comprehensive three-component solution consisting of: a two-stage cold-start data construction pipeline, data filtering strategies for RL dataset curation, and an adaptive tool-invocation strategy leveraging a group-constrained reward function for interleaved reasoning trajectories. Extensive experiments demonstrate that after Reinforcement Learning training with the group-constrained reward function, performance improves by an average of 6.1 percentage points (pp) on evaluation benchmarks. Specifically, the accuracy for interleaved reasoning samples increases by 9.9 pp, and the overall success rate of tool-use exceeds 95%. Our data and code are available at: https://github.com/CongHan0808/AIR.git.


[239] Semantic Browsing: Controllable Diversity for Image Generation cs.CV | cs.AI | cs.GR | cs.LGPDF

Sara Dorfman, Maya Vishnevsky, Omer Dahary, Or Patashnik, Daniel Cohen-Or

TL;DR: 本文提出了一种名为‘语义浏览’的方法,用于在文本到图像生成中实现可控的多样性。该方法通过利用视觉语言模型在文本层面直接引入结构化、可解释的语义变化,而非依赖生成模型内部的随机性,从而允许用户系统性地浏览和探索具有明确语义轴线的图像画廊。

Details

Motivation: 现有文本到图像模型在忠实于提示词的同时,往往导致生成样本缺乏多样性,且现有提升多样性的方法多产生由偶然变化驱动的输出,而非基于有意义的设计选择。因此,需要一种能够对生成样本施加结构化控制的新颖多样性任务变体。

Result: 论文表明,所提出的方法能够生成多样且可导航的设计空间,其中每个变化都对应一个具体的、用户可理解的语义决策。

Insight: 核心创新在于范式转变:将多样性诱导从图像生成模型内部转移到文本层面,利用经过详细描述训练的文本到图像模型将语义决策与像素生成解耦的特性。通过采用智能体工作流驱动视觉语言模型,在完整场景上下文中强制执行与原始提示相协调的结构化变化,从而克服标准VLM输出过于通用的缺点。

Abstract: Modern text-to-image models excel in visual fidelity and prompt adherence. However, this strict adherence comes at the cost of diversity: generated samples tend to collapse into a single visual interpretation. Existing methods to improve diversity produce outputs driven by incidental variations rather than meaningful design choices. This motivates a new variant of the diversity task where structure is enforced on the generated samples. We introduce a method for controlled diversity that enables Semantic Browsing, where users can navigate structured image galleries and experience creative exploration through a systematic traversal of meaningful, interpretable axes of variation. Achieving this level of semantic control requires a deep understanding of the scene. We exploit the fact that recent text-to-image models are trained on elaborated captions, effectively decoupling semantic decision-making from pixel generation. This enables a paradigm shift: instead of relying on stochastic variation within the text-to-image model, we induce diversity directly at the text level. By leveraging rich textual representations, we allow a Vision Language Model (VLM) to operate on the full scene context. To overcome the generic outputs typical of standard VLMs, we employ an agentic workflow that explicitly enforces structured variation attuned to the original prompt. We demonstrate that our method produces diverse and navigable design spaces where every variation corresponds to a specific, user-understandable semantic decision.


[240] Lift4D: Harmonizing Single-View 3D Estimation for 4D Reconstruction In-the-Wild cs.CVPDF

Yehonathan Litman, Xiaoxuan Ma, Manan Shah, Nicolas Ugrinovic, Kris Kitani

TL;DR: Lift4D是一个用于从单目视频中重建动态非刚性物体的4D重建框架。它通过因果潜在条件化调整现有单视图3D模型以获得时序一致的每帧预测,并以此初始化一个可形变的3D高斯溅射表示。然后通过一个遮挡感知的优化过程,结合视图条件扩散先验,对该表示进行‘雕刻’以匹配输入视频,从而在复杂野外场景下实现更好的重建。

Details

Motivation: 现有方法要么受限于4D训练数据的稀缺性,要么仅在初始化时利用先验知识,之后仅依赖视频监督,两者均难以处理具有大形变和严重遮挡的复杂野外场景。

Result: 论文表明,Lift4D在具有严重遮挡和非刚性运动的挑战性野外序列上,明显优于先前的4D重建方法。

Insight: 创新点在于将因果潜在条件化用于单视图3D模型的时序一致性预测,并结合了可形变3D高斯溅射表示与遮挡感知优化,利用视图条件扩散先验来补全未观测区域,从而在测试时优化中有效整合了数据驱动先验与视频证据。

Abstract: Reconstructing dynamic non-rigid objects from monocular video requires integrating visual cues from direct observations with data-driven priors over geometry and appearance. Prior approaches either learn to directly predict 4D representations from visual input or initialize a 3D representation that is subsequently deformed and refined based on video evidence. However, the former are constrained by the scarcity of 4D training data, while the latter leverage priors only for the initial reconstruction and rely solely on video supervision thereafter; neither handles complex in-the-wild scenarios with large deformations and occlusions well. We present Lift4D, a test-time optimization framework that addresses both limitations. First, we adapt an existing single-view 3D reconstruction model to yield temporally consistent per-frame predictions via causal latent conditioning, providing a coherent initialization for a deformable 3D Gaussian Splatting representation. We then ``sculpt’’ this representation to match the input video through an occlusion-aware optimization that faithfully recovers visible surface details while completing unobserved regions using a view-conditioned diffusion prior. We demonstrate that Lift4D clearly improves over prior 4D reconstruction methods, particularly on challenging in-the-wild sequences with severe occlusions and non-rigid motion.


[241] Pose Anything Anywhere:Model-free Object Poses from Arbitrary References cs.CVPDF

Hongli Xu, Jiaqi Hu, Junwen Huang, Boyang Zhong, Peter KT Yu

TL;DR: 本文提出了PANY,一个统一的免模型框架,用于从任意参考图像估计未见物体的6D姿态。该框架支持RGB和RGB-D输入,利用单张或多张无姿态参考视图,通过多视角Transformer几何骨干网络学习视角一致的几何和跨视角对齐线索,从而在宽基线和低重叠情况下实现稳定匹配。当有额外无姿态辅助视图时,通过姿态图规范配准进行聚合以增强几何覆盖和最终姿态精度。

Details

Motivation: 解决开放世界机器人和具身感知中未见物体6D姿态估计的挑战性问题。现有模型方法依赖CAD资产或繁重初始化,而大多数免模型方法局限于成对单锚点匹配,在遮挡、大视角变化和查询-参考重叠度低的情况下容易失败。

Result: 在多个基准测试中达到最先进性能,显著优于现有免模型方法:在YCB-V数据集上姿态准确率提升+12%,在LM-O数据集上提升超过+20%。在单参考和稀疏参考设置下均表现一致良好,在真实世界环境中展现出强鲁棒性。

Insight: 创新点在于超越了成对匹配,通过多视角Transformer学习视角一致的几何表示和跨视角对齐线索,这些线索在宽基线和有限重叠下保持稳定。此外,利用姿态图规范配准聚合无姿态辅助视图以增强几何覆盖,是一个可借鉴的免模型姿态估计新范式。

Abstract: Estimating the 6D pose of unseen objects is a fundamental yet challenging problem for open-world robotics and embodied perception. Model-based methods are accurate but depend on CAD assets or heavy onboarding, while most model-free approaches are still limited to pairwise single-anchor matching and thus fail under occlusion and large viewpoint changes with low query-reference overlap. Therefore, we present PANY, a unified model-free framework that seamlessly supports both RGB and RGB-D inputs, operates on one or sparse pose-free reference views, and generalizes effectively to novel objects. Built on a multi-view transformer geometry backbone, PANY moves beyond pairwise matching by learning view-consistent geometry and cross-view alignment cues that remain stable under wide baselines and limited overlap. When additional unposed assist views are available, PANY aggregates them via pose-graph canonical registration to increase geometric coverage and reinforce the final pose. Extensive experiments show that PANY achieves state-of-the-art performance across multiple benchmarks, substantially outperforming existing model-free methods, improving pose accuracy by +12% on YCB-V and over +20% on LM-O. Furthermore, PANY consistently performs well under both single-reference and sparse-reference settings, demonstrating strong robustness in real-world environments.


cs.GR [Back]

[242] Multimodal Image Colorization: Quantifying the Impact of Text-Conditioned Guidance on Grayscale-to-Color Translation cs.GR | cs.CL | cs.CV | cs.LGPDF

Colten Reissmann, Hugo Garrido-Lestache Belinchon

TL;DR: 该论文量化了文本条件对灰度图像着色模型性能的影响。通过比较U-Net和Stable Diffusion 1.5两种架构在有无CLIP文本条件引导下的表现,发现文本条件能显著提升PSNR、SSIM、色彩丰富度等指标,并降低LPIPS,证明了文本引导在多模态图像着色任务中的有效性。

Details

Motivation: 灰度图像着色存在多解性挑战,同一灰度输入可能对应多种合理的着色方案。本研究旨在量化文本条件引导如何影响着色模型的像素级和感知指标,以解决着色结果不确定性的问题。

Result: 在U-Net架构中,文本条件使PSNR提升5.6%、SSIM提升1.2%、色彩丰富度提升36.6%,LPIPS降低7.6%;在Stable Diffusion架构中,PSNR提升5.8%、SSIM提升1.5%、色彩丰富度提升0.6%,LPIPS降低11.3%。结果表明文本条件在不同规模架构上均能一致提升着色质量。

Insight: 创新点在于首次系统量化了文本条件对多模态图像着色的影响,并通过控制变量实验验证了文本引导的普适性优势。这为结合语言模型提升视觉任务性能提供了可量化的依据。

Abstract: Grayscale images are commonly found in historical photography restoration, medical imaging, and artistic media. However, automatically applying color to these images remains a significant challenge in computer vision because many plausible colorizations can correspond to the same grayscale input. In this work, we quantify the effect of text conditioning on pixel-level and perceptual metrics for grayscale-to-color image models. Specifically, we compare two architectures, a U-Net and Stable Diffusion 1.5, each tested with and without CLIP text conditioning while holding all other variables constant. Our results show that text conditioning improves PSNR by 5.6%, SSIM by 1.2%, and colorfulness by 36.6%, while reducing LPIPS by 7.6% in the U-Net tier. In the Stable Diffusion tier, text conditioning improves PSNR by 5.8%, SSIM by 1.5%, and colorfulness by 0.6%, while reducing LPIPS by 11.3%. These results indicate that text conditioning provides consistent, measurable improvements to colorization quality across both architecture scales.


[243] MeshFlow: Mesh Generation with Equivariant Flow Matching cs.GR | cs.CVPDF

Qi Sun, Kiyohiro Nakayama, Jing Nathan Yan, Qixing Huang, Alexander Rush

TL;DR: MeshFlow是一种直接生成三角形网格的方法,通过采用等变最优传输流匹配模型来尊重网格的对称性,避免了将网格序列化为长自回归序列的需求。该方法在保持网格质量的同时,显著提升了推理速度。

Details

Motivation: 直接生成网格具有挑战性,因为网格表示包含重要的对称性,如面和顶点的排列不变性,MeshFlow旨在解决这一问题。

Result: MeshFlow在网格质量上与最先进的自回归网格生成器相当,同时推理速度提升了约18倍。

Insight: 创新点包括采用等变最优传输流匹配模型来尊重网格对称性,以及对Diffusion Transformer架构的简单有效修改,以建模速度场并保持等变性;此外,引入基于最优传输的训练目标,通过消除违反对称性的监督信号来改善收敛性。

Abstract: Meshes are among the most common 3D scene representations, but directly generating meshes is challenging because the representation contains important symmetries, including permutation invariance of faces and vertices. MeshFlow learns to generate triangle meshes directly as triangle soups, avoiding the need to serialize meshes into long autoregressive sequences. We adopt equivariant optimal-transport flow matching models that respect the key symmetries of triangle soups: arbitrary permutations of faces and permutations of the vertices within each face. Toward this goal, we propose a simple yet effective modification to the Diffusion Transformer architecture, resulting in a scalable network capable of modeling a velocity field while maintaining the desired equivariance. We further introduce an optimal-transport-based training objective that improves convergence by eliminating supervision signals that violate these symmetries. MeshFlow achieves mesh quality comparable to state-of-the-art autoregressive mesh generators while providing about an 18$\times$ speedup during inference. Project page is at https://qiisun.github.io/MeshFlow/.


eess.IV [Back]

[244] MoECodec: Image Compression for joint human and machine perception via Mixture-of-Experts eess.IV | cs.AI | cs.CVPDF

Jiancheng Zhao, Xiang Ji, Yifan Zhan, Zunian Wan, Yinqiang Zheng

TL;DR: MoECodec是一个面向机器感知的图像压缩框架,它通过集成混合专家(MoE)机制,在单个模型中支持多种下游视觉任务。该框架采用基于Transformer的架构,用MoE层替换前馈网络,实现了根据输入内容和任务目标进行动态、令牌级别的计算。

Details

Motivation: 现有面向机器的图像压缩方法要么是任务特定的端到端设计,导致参数和部署开销大,要么是基于迁移的适应性方法,存在外部附加和启发式任务设计的局限。这些方法共有的关键限制是静态计算模式,未能根据图像区域对机器感知的不同语义重要性和复杂性进行差异化处理。

Result: 在传统图像重建和机器任务上的大量实验表明,MoECodec相比基线模型取得了一致的性能提升。

Insight: 主要创新点包括:1) 将混合专家(MoE)机制引入基于Transformer的压缩模型,实现令牌感知的动态计算;2) 提出稳定的路由策略,结合专家选择路由和空间总变差正则化,以鼓励空间连贯的专家分配;3) 设计了轻量级的专家架构——分组洗牌MLP(GShMLP),以控制参数增长。

Abstract: Image compression for machines calls for a unified codec that serves multiple downstream vision tasks. Existing approaches either adopt task-specific end-to-end designs, raising parameter and deployment overhead, or rely on transfer-based adaptations that remain externally attached and heuristic task design. A key limitation shared by both lines of work is their largely static computation pattern, which applies similar transformations across tokens despite the fact that different image regions exhibit markedly different semantic importance and complexity for machine perception. We propose MoECodec, a token-aware image compression framework that supports multiple downstream tasks within a single model. MoECodec replaces the FFN layers in transformer-based compression model token-wise Mixture-of-Experts (MoE), enabling dynamic, token-level computation conditioned on the input content and task objective. To make MoE effective in compression model, we introduce a stable routing strategy that combines expert-choice routing with spatial total variation regularization to encourage spatially coherent assignments, and we propose a lightweight expert architecture, Group Shuffle MLP (GShMLP), to control parameter growth. Extensive experiments show consistent improvement against baselines on both conventional image reconstruction and machine tasks.


[245] A Skin-Tone-Aware Dual-Representation Remote Photoplethysmography Framework for Contactless Respiratory Rate Estimation eess.IV | cs.CVPDF

Trishna Saikia, Anup Kumar Gupta, Puneet Gupta, Pasi Liljeberg

TL;DR: 该论文提出了一种基于肤色感知的双重表示远程光电容积描记法框架,用于从面部视频中无接触地估计呼吸率。该方法结合了肤色感知的动态RGB信号投影和基于运动的去噪网络,并设计了相位无关的对比损失,使欧拉和拉格朗日表示能协同学习呼吸率信息。

Details

Motivation: 传统呼吸率估计方法通常具有侵入性,而现有的远程光电容积描记法主要针对心率估计设计,对呼吸率估计的探索不足,且现有方法通常采用固定或经验选择的RGB投影,仅部分捕捉呼吸动态。

Result: 该方法在RR-rPPG和公开的COHFACE数据集上进行了评估,始终优于对比方法,并在评估设置中将平均绝对误差降低了高达42.1%。

Insight: 创新点包括肤色感知的动态RGB信号投影、用于拉格朗日表示的去噪网络以及相位无关的对比损失;同时,论文贡献了具有印度人口代表性的呼吸率面部视频数据集RR-rPPG,为未来远程呼吸监测研究提供了多样化的基准资源。

Abstract: Respiratory rate is a vital indicator of pulmonary and cardiovascular health, yet conventional methods for estimating respiratory rate are often intrusive due to their contact-based nature. Remote photoplethysmography offers a promising non-contact alternative and has been widely used for heart rate estimation; however, its potential for respiratory rate estimation remains underexplored. Existing methods typically adapt green and chrominance-based projections originally designed for heart rate estimation, which only partially capture respiratory dynamics. Most prior work focuses on the Eulerian representation with fixed or empirically selected RGB projections. To address these gaps, we propose a skin-tone-aware dynamic RGB signal projection that captures respiratory information. To mitigate the sensitivity of the Lagrangian representation to non-respiratory motion, we introduce a denoising network for motion-based remote photoplethysmography signals. We further design a phase-independent contrastive loss that enables Eulerian and Lagrangian representations to collaboratively learn respiratory rate information. We also introduce RR-rPPG, a respiratory-rate facial video dataset with Indian demographic representation. We evaluate the method on RR-rPPG and the publicly available COHFACE dataset, where it consistently outperforms comparison methods and achieves up to a 42.1% reduction in mean absolute error across the evaluated settings. The proposed framework demonstrates the effectiveness of jointly leveraging skin-tone-aware Eulerian and denoised Lagrangian representations for contactless respiratory rate estimation from facial videos. In addition, RR-rPPG contributes a diverse benchmark resource for future research in remote respiratory monitoring. The code and dataset will be made publicly available upon paper acceptance.


[246] Specificity- and Calibration-Aware Breast Ultrasound Segmentation via Entropy-Guided Boundary Supervision eess.IV | cs.CV | cs.LGPDF

Manar Alsaid, Mandip Shrestha, Mohammad Abbas

TL;DR: 该论文提出了一种用于乳腺超声图像病灶分割的熵引导边界监督损失函数,通过根据像素预测熵和真实边界图调整边界惩罚权重,集中梯度于网络不确定的病灶边缘区域,从而同时解决有病灶图像中的边界泄漏问题和无病灶图像中的假阳性激活问题。

Details

Motivation: 解决乳腺超声病灶分割中的两个关键挑战:在有病灶图像中,斑点噪声、低组织对比度和后声影导致边界泄漏和轮廓描绘不完整;在无病灶图像中,相同伪影会在类似实体病灶组织的区域产生假阳性激活。

Result: 在BUSI数据集上评估,与无边界监督和均匀加权边界交叉熵两个基线相比,所提方法在97张有病灶测试图像上的平均Dice分数(0.7624 vs 0.7616)无显著差异,但在20张无病灶测试图像上的假阳性激活从基线模型的14/20和19/20显著降低至5/20,且后处理的空间温度缩放步骤将预期校准误差从0.0201降至0.0095。

Insight: 创新点在于将像素级预测熵与真实边界图结合,动态调整边界监督的权重,使训练聚焦于网络不确定的边界区域;同时,熵引导的边界监督(训练级)与空间校准(推理级)作为互补的细化策略,在U-Net框架内提升了模型的特异性(减少假阳性)和概率可靠性(改善校准)。

Abstract: Lesion segmentation in breast ultrasound involves two related challenges. In images with lesions, speckle noise, low tissue contrast, and posterior acoustic shadowing cause boundary leakage and incomplete contour delineation. In images without lesions, those same artifacts generate false-positive activations in regions resembling solid lesion tissue. This study addresses both failure modes through a single modification to the training objective. Rather than weighting every boundary pixel equally, the proposed loss scales contour penalties by per-pixel predictive entropy and the ground-truth boundary map, concentrating gradient emphasis on lesion margin locations where the network remains uncertain. The loss was evaluated on the BUSI dataset through a controlled ablation against two baselines: a model without boundary supervision and a model with uniformly weighted boundary binary cross-entropy. Across 97 lesion-containing test images, mean Dice scores were statistically indistinguishable between the proposed method and the no-boundary baseline (0.7624 versus 0.7616, paired Wilcoxon p = 0.27), confirming that lesion segmentation quality is preserved. The primary effect appears in specificity. False-positive activations on 20 no-lesion test images fell from 14 of 20 and 19 of 20 for the two baselines to 5 of 20 with the proposed approach (McNemar p = 0.012 and 0.0005). Non-overlapping Wilson 95% confidence intervals confirm the difference is both statistically significant and practically substantial. A post-hoc spatial temperature scaling step further reduced expected calibration error from 0.0201 to 0.0095 without altering segmentation masks. Entropy-guided boundary supervision and spatial calibration thus function as complementary training-level and inference-level refinements that improve specificity and probability reliability within a U-Net framework.


[247] ZeroGVC: Zero-Shot Generative Video Compression with Autoregressive Diffusion Priors eess.IV | cs.CVPDF

Yixin Gao, Xiaohan Pan, Lin Liu, Xin Li, Zhibo Chen

TL;DR: 本文提出ZeroGVC,一种零样本生成式视频压缩框架,利用预训练的自回归扩散先验实现低延迟视频重建。该方法将GOP的首帧用图像编解码器编码,后续P帧通过码本引导的自回归潜在压缩表示,无需额外训练即可在超低码率下获得优越的感知重建质量。

Details

Motivation: 现有生成式视频压缩方法通常需要额外训练来使生成模型适应紧凑表示,本文旨在利用预训练扩散先验实现无需训练的零样本压缩,以降低计算成本并提升实用性。

Result: 在标准视频压缩基准测试中,ZeroGVC在超低码率下实现了优越的感知重建质量,无需任何额外训练,达到了先进的性能水平。

Insight: 创新点在于利用去噪扩散码本模型的压缩方案进行少步一致性采样,通过选择可复现的码本噪声向量组合引导潜在去噪轨迹;同时设计了可选的双向参考模式,在不增加码率开销的情况下利用下一I帧上下文缓解误差传播。

Abstract: Recent generative video compression methods leverage powerful generative priors to achieve perceptually pleasing reconstructions. However, most existing approaches require additional training to adapt generative models to produce realistic reconstructions from compact representations. In this paper, we propose ZeroGVC, a zero-shot generative video compression framework that leverages pretrained autoregressive diffusion priors for low-delay video reconstruction. ZeroGVC encodes the first frame of each group of pictures (GOP) with an image codec and represents subsequent P-frames through Codebook-Guided Autoregressive Latent Compression. This design is motivated by our observation that the compression scheme of denoising diffusion codebook models is effective in few-step consistency sampling. By selecting compact combinations of reproducible codebook noise vectors, ZeroGVC steers the latent denoising trajectory toward the target P-frame while allowing the decoder to reproduce the same trajectory in only a few denoising steps. In addition, we design an optional bidirectional reference mode that mitigates error propagation by leveraging the next I-frame context without introducing any additional bitrate overhead. Extensive experiments on standard video compression benchmarks demonstrate that ZeroGVC achieves superior perceptual reconstruction quality at ultra-low bitrates without any additional training.


[248] IViT: A Novel Interpretable Visual Transformer for Skin Disease Detection eess.IV | cs.CVPDF

Haibiao Li, Di Lin, Xue Jiang, Weiwei Wu, Yanxi Li

TL;DR: 本文提出了一种名为IViT的新型可解释视觉Transformer,用于皮肤疾病检测。该方法通过引入预训练迁移学习来适应医学少样本场景,并构建了一个基于二次规划的离散特征选择框架,以筛选符合临床诊断逻辑的通用和判别性特征。实验表明,IViT在六个标准皮肤疾病数据集上实现了高精度与强可解释性的平衡。

Details

Motivation: 解决皮肤疾病临床诊断中因皮损类间相似性和过度依赖医生经验导致的主观偏差问题,并应对现有基于Vision Transformer的深度学习辅助诊断方法存在的黑盒不透明性、对医学少样本场景适应性差,以及主流可解释算法在提升可解释性时通常面临精度显著下降的瓶颈。

Result: 在六个标准皮肤疾病数据集上的实验结果显示,IViT达到了93.80%的准确率,仅比基线模型低0.21%,同时特征冗余度降低了29.5%。其核心激活区域与临床关注的病变区域一致。

Insight: 创新点在于提出了一个受二次规划约束的可解释ViT框架,通过离散QP特征选择来筛选符合临床逻辑的特征,并设计了多目标损失函数以在保持分类性能的同时减少特征冗余和优化激活分布,从而在少样本场景下实现了精度与可解释性的有效平衡。

Abstract: The clinical diagnosis of skin diseases is susceptible to interference from inter-class similarity of skin lesions, and over-reliance on clinicians’experience easily leads to subjective bias. Although existing deep learning aided diagnosis methods achieve competitive accuracy, they suffer from the black-box opacity of Vision Transformer (ViT) and poor adaptability to medical few-shot scenarios. Moreover, mainstream explainable algorithms generally face the bottleneck of significant accuracy degradation when improving interpretability. This paper proposes an interpretable ViT (IViT) constrained by Quadratic Programming (QP). The introduced pre-trained transfer learning adapts to few-shot feature extraction. A discrete QP feature selection framework is constructed to screen generic and discriminative features consistent with clinical diagnostic logic. A multi-objective loss function is designed to reduce feature redundancy and optimize activation distribution while preserving classification performance. Experimental results on six standard skin disease datasets show that IViT achieves an accuracy of 93.80%, only 0.21% lower than the baseline, with feature redundancy reduced by 29.5%. Its core activation regions are consistent with clinically concerned lesion areas. The proposed model balances accuracy and interpretability, providing a reliable solution for the clinical deployment of few-shot intelligent skin disease diagnosis.


[249] NGPS: Structure-Preserving Self-Supervised Denoising via Neighbor-Guided Patch Sampling eess.IV | cs.CVPDF

Jaehyun Cho, YoungJoon Yoo

TL;DR: NGPS提出了一种轻量级的自监督去噪框架,通过邻域引导的块采样策略,在无需显式配准的情况下处理体数据医学影像中的层间错位问题。该方法将结构匹配与信号检索解耦,利用降噪后的引导图像进行结构相似性匹配,同时从原始噪声邻层检索监督信号,从而构建局部伪目标。

Details

Motivation: 体数据医学影像中,相邻切片间的错位会破坏解剖对应关系,导致直接使用邻层作为监督目标时产生重影和模糊边缘。现有方法通常通过掩蔽差异区域来避免学习误导性目标,但这会损失大量高频解剖边界信息。

Result: 在CT和合成Rician MRI数据集上的评估表明,NGPS在保真度和结构敏感指标上均有提升。

Insight: 创新点在于将结构匹配(使用降噪引导图像)与信号检索(使用原始噪声邻层)解耦,避免了学习配准模块,同时更充分地利用了邻层证据,特别是在高频解剖边界区域。

Abstract: Neighboring-slice self-supervised denoising is attractive for volumetric medical imaging, yet inter-slice misalignment breaks anatomical correspondence and often yields ghosting and blurred margins when adjacent slices are used naively as targets. We propose Neighbor-Guided Patch Sampling (NGPS), a lightweight framework that constructs neighboring supervision under local inter-slice misalignment without explicit registration. To avoid learning from misleading targets, prior methods commonly mask discrepant regions, but this stabilizes training at the cost of leaving a non-trivial portion of neighboring evidence unexploited, particularly around high-frequency anatomical boundaries. NGPS addresses this by decoupling structure matching from signal retrieval: for each masked location, it searches a local neighborhood for structurally similar candidate patches using a simple guide image (e.g., fast bilateral filtering), while retrieving the supervision signal directly from the raw noisy neighbor at the matched coordinates. By matching on a noise-attenuated guide while retrieving raw values from neighboring slices, NGPS constructs local pseudo targets without a learned registration module. Across the evaluated CT and synthetic-Rician MRI settings, NGPS improves fidelity and structure-sensitive metrics. Code is available at https://github.com/cv-cho/NGPS .


cs.AI [Back]

[250] PEAR: Permutation-Equivariant Adaptive Routing Multi-Agent Debate cs.AI | cs.CL | cs.LG | cs.MA | stat.MLPDF

Yang Feng, Ziwei Xu, Xia Hu, Fengxiang He

TL;DR: 论文提出了一种名为PEAR(置换等变自适应路由多智能体辩论)的推理时协议,旨在解决多智能体辩论中固定拓扑结构导致的持续位置偏差、不可靠智能体放大以及对角色分配高度敏感的问题。该方法通过基于智能体状态动态重新配置连续辩论轮次中的通信角色和稀疏拓扑,防止任何智能体永久占据特权网络位置,从而更均匀地分配影响力。

Details

Motivation: 动机在于传统多智能体辩论的固定拓扑结构会引入持久的位置偏差、放大不可靠智能体的影响,并对角色分配过于敏感,这限制了辩论的可靠性和公平性。

Result: 在四个推理基准测试和六个不同的LLM骨干网络上进行的综合实证评估表明,PEAR显著提高了平均准确率,超越了最强的辩论基线方法。

Insight: 创新点在于将PEAR理论刻画为一个等变稀疏路由器,它在保持智能体重标记下准确性的同时,降低了路由复杂性并提高了泛化能力,实现了动态角色分配和拓扑重构以优化辩论过程。

Abstract: Multi-agent debate improves the reliability of large language models (LLMs) through iterative peer critiques. However, fixed topologies often introduce persistent positional biases, amplify unreliable agents, and cause high sensitivity to role assignments. We introduce \textit{Permutation-Equivariant Adaptive Routing Multi-Agent Debate (PEAR)}, an inference-time protocol that dynamically reconfigures communication roles and sparse topologies across consecutive debate rounds. By strategically switching agent-to-role assignments based on evolving agent states, PEAR prevents any agent from permanently occupying a privileged network position or distributes influence more evenly across the debate. We theoretically characterize PEAR as an equivariant sparse router: it preserves accuracy under agent relabeling while reducing routing complexity and improving generalization. Comprehensive empirical evaluations across four reasoning benchmarks and six diverse LLM backbones demonstrate PEAR significantly improves average accuracy over the strongest debate baselines. The code is at https://github.com/EVIEHub/PEAR.


[251] In LLM Reasoning, there is Irrationality on top of Value Misalignment cs.AI | cs.CL | cs.LG | stat.MLPDF

Kejiang Qian, Fengxiang He

TL;DR: 这篇论文指出,即使大语言模型(LLM)在训练后已经与目标价值函数对齐,其在推理过程中仍可能无法最大化该价值。作者将这种差距形式化为“理性价值风险”,并通过实验验证了该风险的普遍存在及其对推理策略的敏感性。

Details

Motivation: 论文的动机是揭示并量化LLM在价值对齐后,其推理行为仍可能偏离最优(理性)决策的根本问题,即“理性价值风险”。

Result: 在Llama-3.1、GPT-4o、DeepSeek-V4等多个模型及UltraFeedback、GSM8K等多个基准上的广泛实验验证了理性价值风险的普遍存在,并表明价值对齐可以减少但无法消除该风险,且推理策略和推理长度对其有显著影响。

Insight: 论文的创新点在于形式化并实证了“理性价值风险”这一概念,将推理过程中的次优行为与价值对齐问题区分开来,并分解了其估计误差的来源,为理解和改进LLM的推理理性提供了新视角。

Abstract: Significant progress has been made in aligning LLMs with target value functions. We argue that, even when an LLM has been well aligned in (post-)training, it may still fail to maximise the aligned value in reasoning. We mathematically formalise this gap as rational value risk: the utility discrepancy between a model’s deployed reasoning strategy and its rational counterpart, which is defined to be the responses that maximise expected utility in the steepest direction. The estimation error of rational value risk is further decomposed into three components from finite candidates, finite prompts, and imperfect verifiers. Extensive experiments are conducted, covering models Llama-3.1, Qwen-2.5, T{"}ulu-3 families (7B-72B), GPT-5.2, GPT-5.5, and DeepSeek-V4, and benchmarks UltraFeedback, AlpacaEval, GSM8K, MATH, HumanEval, and MathArena. The results validate that (1) rational value risk is widespread; (2) value alignment can reduce, but cannot eliminate, it; (3) the risk is highly sensitive to inference-time reasoning strategy; and (4) longer reasoning improves rationality with diminishing returns. The code is at https://github.com/EVIEHub/LLM-Rationality.


[252] AlphaMemo: Structured Search-Process Memory for Self-Evolving Alpha Mining Agents cs.AI | cs.CL | cs.LGPDF

Hang Yu, Zifan Zheng, Jeff Z. Pan, Tongliang Liu, Zhiyong Wang

TL;DR: AlphaMemo是一种具有结构化搜索过程记忆的自进化阿尔法挖掘智能体,旨在解决LLM智能体在阿尔法挖掘中面临的组合搜索空间、噪声非平稳反馈、冗余发现和过度拟合等问题。它通过记录可重用的编辑模式证据,在特定父因子上下文中判断编辑模式的有效性,从而提升样本外性能和固定预算下的发现效率。

Details

Motivation: LLM智能体在阿尔法挖掘中面临组合搜索空间、噪声非平稳反馈、冗余发现和过度拟合等挑战,需要一种方法来有效记录和利用搜索过程中的经验,避免简单重用过去成功导致的偏差。

Result: 在CSI 500和S&P 500上的实验表明,AlphaMemo提高了样本外性能和固定预算下的发现效率,消融实验验证了残差学习、置信度门控、AST差异模式和否决记忆的作用。

Insight: 创新点包括结构化搜索过程记忆的设计,通过提取AST差异中的编辑模式、应用置信度门控残差记忆和不对称否决控制来抑制高置信度失败模式,从而优化搜索过程并减少冗余和过拟合风险。

Abstract: LLM agents are promising for alpha mining via combining financial priors, symbolic reasoning, executable factor generation, and feedback-driven refinement. Yet, they face a combinatorial search space, noisy non-stationary feedback, redundant discoveries, and overfitting risks from naively reusing past successes. To address these challenges, we propose AlphaMemo, a self-evolving alpha mining agent with Structured Search-Process Memory. Rather than memorizing only final factors or full trajectories, AlphaMemo records reusable evidence about which edit motifs work or fail under specific parent-factor contexts. It extracts motifs from Abstract Syntax Tree (AST) differences, applies confidence-gated residual memory on top of a search-ledger prior, and uses asymmetric veto control to suppress high-confidence failure patterns. Experiments on CSI 500 and S&P 500 show improved out-of-sample performance and fixed-budget discovery efficiency, with ablations validating the roles of residual learning, confidence gating, AST-diff motifs, and veto memory. Code is at https://github.com/jarrettyu/AlphaMemo.


[253] SkillHarness: Harnessing Safe Skills for Computer-Use Agents cs.AI | cs.CL | cs.CR | cs.LGPDF

Yurun Chen, Biao Yi, Keting Yin, Shengyu Zhang

TL;DR: 本文提出了SkillHarness框架,旨在解决计算机使用代理在动态交互环境中安全学习和使用技能的问题。该框架通过建模为安全约束的交互过程,引入技能边界来识别安全技能,并采用选择性技能重用来指导任务分解与完成。

Details

Motivation: 现有技能学习方法主要假设环境是静态和安全的,忽略了对抗性交互和环境动态变化带来的风险,这可能导致风险技能学习和脆弱的执行,从而损害CUAs的可靠性。

Result: 实验表明,SkillHarness将所学技能的不安全率显著降低了57.1%,并在动态环境变化下持续提高了执行稳定性,性能优于现有基线方法。

Insight: 创新点在于将技能学习与利用建模为一个安全约束的交互过程,并引入了利用多源监督信号识别安全技能的“技能边界”概念,以及根据上下文进行任务分解和选择性激活技能子集的机制。

Abstract: Computer-Use Agents (CUAs) are increasingly deployed in dynamic interactive environments, creating a growing need for continual skill learning during interaction. Recent approaches address this challenge by learning reusable skills from successful trajectories. However, these skill learning methods largely assume static and safe environments, overlooking risks from adversarial interactions (e.g., prompt injections) and environmental dynamics (e.g., pop-ups). In dynamic settings, such assumptions can lead to risky skill learning and brittle execution, undermining the reliability of CUAs. This raises the question: how can CUAs learn and use skills safely in dynamic environments? To address this problem, we propose SkillHarness, a framework for safe skill harnessing in dynamic environments. SkillHarness moves beyond static skill abstractions by modeling skill learning and utilization as a safety-constrained interaction process. Specifically, we introduce the skill boundary that leverages multi-source supervision signals to identify safe skills from interaction trajectories, and construct self-improving safety constraints throughout the skill lifecycle. In addition, SkillHarness introduces selective skill reuse, where tasks are guided to decompose according to context and completed through the selective activation of skill subsets. Our experiments demonstrate that SkillHarness significantly reduces the unsafe rate of learned skills by 57.1% and consistently improves execution stability under dynamic environmental changes, outperforming existing baselines.


[254] From Knowing to Acting: Benchmarking Self-Awareness Capability of LLM Agents cs.AI | cs.CL | cs.LGPDF

Yifan Li, Shengbin Yue, Boyu Feng, Jinhu Qi, Bo Ke

TL;DR: 本文提出KAPRO框架和KAware数据集,用于评估LLM智能体的自我意识能力,即判断问题是否需要外部工具还是仅靠内部参数知识即可解决。实验表明,自我意识能力与任务成功率强相关,但在内部能力场景下会急剧下降,且不同模型在工具过度使用上表现出显著差异。

Details

Motivation: 当前LLM智能体基准测试过于关注执行成功率,而忽视了其自我意识能力——即判断问题是否需要外部资源的关键认知能力。

Result: 在KAware数据集上的广泛实验表明,自我意识能力与任务成功率强相关(相关系数高),但在内部能力设置下性能急剧下降。开源和指令跟随模型因浅层模式匹配表现出更强的工具过度使用,而专有和推理导向模型则展现出更可靠的认知门控能力。

Insight: 核心创新在于将智能体的元认知判断(Knowing)与自发执行(Acting)解耦,以评估其认知行为对齐;并构建了系统划分任务空间(外部、内部、混合)的数据集来探测知识边界。这为评估智能体认知架构的可靠性提供了新视角。

Abstract: The integration of external tools has transitioned LLM agents from passive responders to autonomous systems. However, current benchmarks prioritize execution success, neglecting self-awareness capability, the ability to discern whether a problem requires necessary external resources or can be solved via internal parametric knowledge. To address this, we introduce KAPRO (Knowing-Acting Quadrant PRObe), a framework that evaluates cognitive-behavioral alignment by decoupling an agent’s metacognitive judgment (Knowing) from its spontaneous execution (Acting). We further construct KAware, a dataset rigorously partitioning tasks into external, internal, and hybrid subspaces to systematically probe these epistemic boundaries. Extensive experiments across diverse agent architectures show that self-awareness capability is strongly correlated with task success but degrades sharply in internal-capability settings. Moreover, open-source and instruction-following models exhibit stronger tool overuse due to shallow pattern matching, while proprietary and reasoning-oriented models demonstrate more reliable cognitive gating. Benchmark and codes are available at https://github.com/AI-Santiago/KAware.


[255] Building Agent Harnesses for Scientific Curation from Multimodal Sources cs.AI | cs.CLPDF

Sheng Zhang, Qin Liu, Renqian Luo, Shufang Xie, Reuben Tan

TL;DR: 本文提出了Beaver,一种用于从多模态科学文献中提取结构化信息的智能体框架。该框架通过整合前沿智能体、多模态证据工具、任务脚手架和基于产物的自动研究,将科学整理转化为可审计的分阶段工作流,并支持迭代的评估-诊断-修订循环。实验表明,Beaver在属性级一致性指标GRAS上达到81.0分,显著优于前沿智能体。

Details

Motivation: 解决当前智能体在科学文献整理中面临的挑战,即关键证据分散在长文本、密集表格和图表中,且最终记录需要跨多个证据片段进行推理,而非简单复制单一文本片段。

Result: 在Gold-Referenced Attribute Score (GRAS) 基准上达到81.0分,比前沿智能体高出超过23个绝对百分点。消融实验表明任务脚手架、多模态证据工具和溯源追踪均对性能有显著贡献,尤其在需要跨模态推理和规范化的高价值属性上提升最大。

Insight: 创新点在于提出了一个集成了多模态证据处理、任务脚手架和可审计工作流的智能体框架,将整理任务转化为分阶段、可迭代优化的流程。客观来看,其核心洞察是对于多模态证据的科学整理任务,框架设计(而非单一模型能力)是决定智能体性能的关键因素。

Abstract: Scientific discovery workflows often depend on structured curation from the literature. This is difficult for current agents because the key evidence is scattered across long text, dense tables, and figures, and the final records often require reasoning across multiple evidence fragments rather than copying a single span. We study scientific curation from multimodal sources and introduce Beaver, an agent harness that extracts structured information from scientific papers while preserving provenance to the supporting evidence. Beaver combines a frontier agent with multimodal evidence tooling, task scaffolding, and artifact-grounded autoresearch. These components turn curation into a staged, auditable workflow and enable an iterative evaluate–diagnose–revise loop, where persistent run artifacts expose stage-localized failures and guide harness updates. Experiments show that Beaver reaches 81.0 on Gold-Referenced Attribute Score (GRAS), an attribute-level measure of agreement with gold curated records, outperforming frontier agents by over 23 absolute points. Ablations show that task scaffolding, multimodal evidence tooling, and provenance traces each contribute meaningfully to performance, while attribute-level analysis shows the largest gains on high-value attributes that require cross-modal reasoning and normalization. These results show that, for scientific curation from papers with multimodal evidence, harness design is a central determinant of agent performance.


[256] Answer Engineering: Local Trajectory Editing for Protocol-Constrained Decision Making in Large Language Models cs.AI | cs.CLPDF

Victor Lavrenko, Anastasiia Molodnitskaia

TL;DR: 本文提出了“答案工程”方法,这是一种在大型语言模型生成过程中,对推理轨迹进行局部编辑的运行时和创作层技术,旨在确保在协议约束的决策场景中生成合规的答案。该方法在不重新训练、不修改模型权重、不进行全局搜索的情况下,通过局部规则引导的干预来修正推理轨迹。

Details

Motivation: 在程序合规性至关重要的领域(如临床决策),大型语言模型可能会生成自信但不符合协议的答案,这带来了风险。本文旨在解决如何在不改变模型本身的前提下,通过外部干预确保模型输出符合特定领域协议的问题。

Result: 在突发性感音神经性耳聋(SSNHL)的受控临床基准测试中,仅使用逐步推理反而降低了合规结果(从54.5%降至25.1%)。而应用局部轨迹编辑方法后,SSNHL的合规率提升至83.5%,传导性病例的依从率提升至77.9%,将平衡准确率从仅推理生成的42.0%提高到了80.7%。

Insight: 创新点在于提出了一个系统级的、可审计的运行时控制层,通过局部编辑推理轨迹来强制协议合规,这为在关键领域安全部署大语言模型提供了一种无需修改模型的新思路。客观来看,该方法将合规性逻辑与模型生成过程解耦,增强了透明度和可控性,但规则覆盖范围、触发可靠性和模型固有的“诊断优先”生成动态仍是其局限性。

Abstract: Large language models can produce confident but protocol-invalid answers in domains where procedural compliance is critical. This paper presents Answer Engineering, a deterministic runtime and authoring layer that applies localized rule-guided interventions to the visible reasoning trajectory during standard autoregressive generation, without retraining, modifying model weights, or performing global search. The method is evaluated on a controlled clinical benchmark for sudden sensorineural hearing loss (SSNHL), where correct management depends on protocol-consistent interpretation of symptom timing, Weber/Rinne tuning-fork findings, and otoscopic findings. In the benchmark, step-by-step reasoning shifted rather than eliminated errors: compliant outcomes for SSNHL decreased from 54.5% under unguided generation to 25.1%, while acceptance on the conductive contrast condition increased from 1.6% to 58.9%. Local trajectory editing increased SSNHL compliance to 83.5% and conductive-case adherence to 77.9%, raising balanced accuracy from 42.0% under reasoning-only generation to 80.7%. The results support a systems-level view in which protocol adherence can be improved through auditable runtime control of reasoning trajectories, while also identifying limitations caused by rule coverage, trigger reliability, and persistent diagnosis-first generation dynamics.


[257] ARCO: Adaptive Rubric with Co-Evolution for Multi-Step LLM-Based Agents cs.AI | cs.CLPDF

Zihang Tian, Jingsen Zhang, Rui Li, Xiaohe Bo, Yuanzi Li

TL;DR: 本文提出了ARCO(自适应准则协同进化)框架,用于解决基于LLM的多步智能体强化学习中奖励信号模糊和难以进行步级信用分配的问题。该框架采用一个共享主干的双头模型,分别生成步级准则和预测准则条件下的步级奖励,并通过轨迹分解约束将步级奖励之和与最终结果关联,使准则内容和评分函数在参数层面协同进化。

Details

Motivation: 现有基于LLM的多步智能体强化学习通常依赖标量成功奖励,无法解释轨迹好坏的原因;而基于准则的奖励方法虽然提高了可解释性,但通常在轨迹层面评分且评分器是静态、闭源的,未能解决步级信用分配问题。

Result: 在HotpotQA、2WikiMultiHopQA和MuSiQue三个多跳问答基准上,使用两种开源主干模型进行实验,ARCO在所有设置中都超越了基于结果、准则和过程奖励的强基线方法,取得了最佳精确匹配(EM)分数。

Insight: 创新点在于提出了一个准则与策略协同进化的框架,通过共享主干的双头模型实现步级准则生成与评分,并利用轨迹分解约束实现无需步级标签的信用分配;其准则具有步级特异性、对设计选择鲁棒,并能有效诊断智能体行为。

Abstract: Reinforcement learning for multi-step LLM agents often relies on scalar rewards that indicate success but cannot explain why a trajectory is good or bad. Rubric-based rewards improve interpretability through natural-language criteria, but existing methods score at the trajectory level and freeze the scorer behind a closed-source judge, leaving step-level credit assignment unresolved and the judge itself static. We propose ARCO (Adaptive Rubric CO-evolution), a rubric framework in which a same-scale model $μ$ shares a backbone with two heads: a generation head that produces per-step criteria, and a score head that predicts rubric-conditioned step-level rewards. A trajectory decomposition constraint ties the sum of step rewards to the terminal outcome, enabling credit assignment without step-level labels, while $μ$ and the policy $π$ are jointly updated on on-policy data so that the rubric content and the scoring function co-evolve at the parameter level. Across HotpotQA, 2WikiMultiHopQA, and MuSiQue with two open-source backbones, ARCO improves the best EM in every setting over strong outcome-, rubric-, and process-reward baselines, and analyses show that its rubrics are step-specific, robust to design choices, and useful for diagnosing agent behavior. Codes and data are available at https://github.com/zihangtian/ARCO.


[258] Hallucination as Context Drift: Synchronization Protocols for Multi-Agent LLM Systems cs.AI | cs.CL | cs.MAPDF

Carson Rodrigues

TL;DR: 本文提出多智能体大语言模型系统中的幻觉问题源于上下文漂移,即并发智能体间内部知识状态的分歧。作者定义了上下文分歧分数(CDS)来量化这种差异,并设计了共享状态验证协议(SSVP),通过周期性交换压缩状态摘要来检测高分歧条件。实验表明,SSVP在减少幻觉的同时显著降低了API调用次数。

Details

Motivation: 解决多智能体LLM系统中因上下文漂移(智能体间知识状态不一致)导致的幻觉问题,而非模型本身能力不足。

Result: 在旅行规划任务中,SSVP将幻觉率从无同步基线的0.492降至0.463,且比全广播同步(幻觉率0.658)显著更低(p=0.0005),同时减少58%的API调用。在软件规划任务中,所有条件均达到低幻觉率(<0.2)。

Insight: 将幻觉缓解重新定义为分布式系统问题,提出上下文同步作为多智能体LLM设计的一等原语;通过轻量级状态验证协议而非全广播来避免错误状态的传播污染。

Abstract: Multi-agent LLM systems routinely produce hallucinated outputs that cannot be explained by model deficiencies alone. A significant class of these failures arises not from model incapacity but from context drift: the divergence of internal knowledge states between concurrent agents. When agents enter a collaborative task with mismatched or stale representations of shared world state, their joint reasoning produces contradictions that manifest as hallucination. We define the Context Divergence Score (CDS), a lightweight scalar metric quantifying knowledge-state discrepancy between agent pairs across spatial, temporal, and task dimensions, and propose the Shared State Verification Protocol (SSVP), which lets agents periodically exchange compressed state summaries and flag high-divergence conditions before joint reasoning. We evaluate SSVP across two domains (multi-agent travel and software project planning) using Claude Haiku. In controlled experiments (n=30 per condition, travel; n=10, software) across 8 scenarios, naive full-broadcast synchronization increases hallucination rate by 34% above the no-sync baseline (HR: 0.658 vs. 0.492, p=0.0022, d=1.18), a contamination effect from propagating erroneous agent states. SSVP avoids this failure mode while showing modest, consistent reduction (HR: 0.463, d=0.30) and achieves significantly lower hallucination than full-broadcast (p=0.0005, d=1.47) using 58% fewer API calls. The contamination effect does not replicate in the software domain, where all conditions converge to low HR (<0.2), confirming it is specific to tasks where one erroneous shared belief cascades across evaluation dimensions. Our results reframe hallucination mitigation as a distributed systems problem and establish context synchronization as a first-class primitive in multi-agent LLM design.


[259] ForEx: A Formal Verification Framework for Explainable Reasoning in Logical Fallacy Detection and Annotation cs.AI | cs.CL | cs.SCPDF

Pei-Cing Huang, Chienyu Liu, Chan Hsu, Ci-Siang Chen, Pei-Ju Lee

TL;DR: 本文提出了ForEx框架,用于形式化验证大语言模型在逻辑谬误检测任务中生成解释的推理过程。该框架将模型生成的解释翻译为Lean4代码,并验证翻译后的推理链是否可以从编码的前提中形式化推导出来,而非评估原始自然语言论证的逻辑有效性。

Details

Motivation: 当前大语言模型在逻辑谬误检测上的评估只关注预测标签,而无法确认模型提供的推理是否支持其标签。本文旨在填补这一空白,建立一种能够验证模型解释背后推理过程是否形式化可推导的评估方法。

Result: 在LOGIC-Climate数据集上的实验表明,超过90%的LLM输出可以被翻译成通过形式验证的推理链,但其预测标签与人类标注的一致率仅为20%左右。这揭示了形式可推导性与标签一致性之间存在系统性差距。

Insight: 核心创新点在于提出了一个将自然语言解释形式化并验证其推理链的框架(ForEx),并引入了LLM论证验证矩阵来区分预测标签的一致性和形式验证状态,从而将评估重点从标签正确性转向了可机器检查的形式化推理分析。

Abstract: Current evaluations of Large Language Models (LLMs) on logical fallacy detection focus on predicted labels, but do not establish whether those labels are supported by the reasoning the models provide. We propose ForEx (Formal Verification for Explainable Reasoning), a framework that translates LLM-generated explanations into Lean4 and verifies whether the translated rationale is derivable under encoded premises, not the logical validity of the original natural language argument. To distinguish prediction outcomes from the formal status of the supporting reasoning, we introduce the LLM Argument Verification Matrix, which separates label consistency from formal verification status. Experiments on LOGIC-Climate show that over 90% of LLM outputs can be translated into formal reasoning chains that pass verification, while agreement with human annotations remains around 20%. These results expose a systematic gap between formal derivability and label agreement, a distinction invisible to prediction-based metrics. ForEx moves LLM evaluation beyond label correctness toward machine-checkable analysis of formalized reasoning chains.


[260] Learning the ARTS of Search for Automated Discovery cs.AI | cs.CLPDF

Gurusha Juneja, Arnav Kumar Jain, Deepak Nathani, William Yang Wang, Xin Eric Wang

TL;DR: 本文提出了一种名为ARTS(Agentic Reasoning for Tree Search)的新方法,用于自动化科学发现。该方法利用推理语言模型来导航假设和实验空间,通过检查历史执行日志来诊断失败原因并选择后续假设。为了解决上下文长度限制,ARTS采用了测试时训练将搜索树知识内化到模型权重中。

Details

Motivation: 现有方法(如MCTS)在科学发现的搜索过程中,将假设的优劣与其实验执行质量混为一谈,导致有前景但执行初步的假设被低估。同时,由于上下文窗口限制,搜索历史常被剪枝,可能丢失重要信息。

Result: 在MLGym和MLEBench的22个任务上,ARTS相对于领先算法取得了超过15.3%的相对改进(归一化分数)。通过测试时训练,Qwen3-4B代理的性能可以匹配Gemini-3 Pro和GPT o3-reasoning等前沿闭源模型,且推理成本降低高达5倍。在部分可观测RL任务中,测试时训练的Qwen3-4B科学家甚至超越了使用o3科学家的ARTS,重新发现了启发式方法剪枝掉的人类最佳循环记忆解决方案。

Insight: 核心创新在于将推理语言模型作为搜索过程的智能体,使其能够诊断失败根源(区分假设错误与执行问题),从而做出更明智的决策。测试时训练作为一种有效机制,将动态搜索树知识压缩到模型参数中,解决了长上下文依赖问题,并显著提升了小模型的能力,使其能以低成本达到大模型性能。

Abstract: Scientific discovery can be formulated as an iterative search process over the space of hypotheses and experiments. Contemporary methods navigate this space using heuristics such as MCTS. These algorithms conflate the merit of a hypothesis with the quality of its experimental execution. A promising hypothesis with preliminary execution is therefore ranked below a modest hypothesis whose execution is refined. Moreover, prior methods prune the search logs as the search progresses because the accumulated history outgrows the context window. We propose Agentic Reasoning for Tree Search (ARTS), where we deploy a reasoning language model to navigate this space. The model inspects prior execution logs, diagnoses whether earlier failures arose from faulty implementations or bad hypotheses, and selects the hypothesis to build on next. To mitigate challenges with context length, ARTS uses test-time training to instill the knowledge of search tree in the model weights. Across 22 tasks from MLGym and MLEBench, we show that ARTS outperforms leading algorithms, with over 15.3% relative improvement in the normalized score. With test-time training we show that a Qwen3-4B agent can match performance with closed-source frontier models like Gemini-3 Pro and GPT o3-reasoning with upto 5x lower inference cost. We further observe that on partially observable RL tasks, the test-time trained Qwen3-4B scientist surpasses ARTS with the o3 scientist by rediscovering the human-best recurrent-memory solution that heuristic methods prune away.


[261] Can Reasoning Models Detect Changes to their Chains of Thought? cs.AI | cs.CL | cs.LGPDF

Sathvik Napa, Utkarsh Singh, Chengyuan Xue, Miriam Wanner, William Walden

TL;DR: 本文研究推理模型能否检测到对其思维链的修改,包括在推理过程中或之后,以及使用自身或其他模型思维链进行预填充的情况。研究发现模型检测准确率较低,难以识别修改方式,且对自身和他人思维链的检测能力相当。

Details

Motivation: 探讨修改模型思维链(如用更强模型推理预填充或移除不安全步骤)的可行性,关键在于模型能否察觉这些干预,因为察觉可能导致行为改变。

Result: 模型在多种条件下(推理中/后,自身/其他模型思维链预填充)仅表现出有限的检测准确率,无法有效识别修改方式,检测自身与他人思维链的能力相似。

Insight: 揭示了当前推理模型对思维链修改的检测能力薄弱,这为安全干预模型推理提供了潜在机会,但需谨慎评估模型行为变化风险。

Abstract: There are many reasons one may want to edit a model’s chain of thought (CoT) – e.g., to prefill it with reasoning from a stronger model or to remove steps that may yield unsafe outputs. The success of these interventions plausibly depends on a model’s inability to notice them, as the model may alter its behavior if it suspects tampering. In this work, we study whether recent reasoning models are able to detect such interventions on their CoTs under a variety of conditions: both during reasoning and after it, and when prefilled both with their own CoTs and with those of other models. Broadly, we find that (i) models exhibit only very modest detection accuracy; (ii) models struggle to identify how their CoT was modified; and (iii) models are about as good at detecting changes to their own CoTs as to those of other models.


[262] VADAOrchestra: Neurosymbolic Orchestration of Adaptive Reasoning Workflows cs.AI | cs.CL | cs.DB | cs.LOPDF

Teodoro Baldazzi, Luigi Bellomarini, Andrea Coletta, Michela Iezzi, Carsten Maple

TL;DR: 本文提出了VADAOrchestra,一个神经符号框架,用于将复杂工作流建模为动态演进的推理过程。该框架结合了基于LLM的编排器进行增量规划和适应,并使用Datalog+/-逻辑程序进行符号推理,旨在融合业务流程管理的严谨性与智能体系统的灵活性。

Details

Motivation: 解决现实世界决策中传统业务流程管理系统缺乏运行时适应性,而基于LLM的智能体系统又存在不透明、不可靠及处理大规模数据时扩展性受限的问题。

Result: 在真实金融用例上评估,相比标准智能体架构,VADAOrchestra在忠实度、可扩展性和可解释性方面表现出优势。

Insight: 创新点在于采用神经符号混合方法,将LLM的高层编排与Datalog+/-符号引擎的推理解耦,从而生成可验证的推理轨迹,并支持对大规模数据进行有针对性的查询以实现复杂推理。

Abstract: Decision-making in real-world settings rarely follows a fixed script. Instead, it unfolds as a dynamic reasoning process in which the appropriate course of action evolves as new context and data become available. Traditional Business Process Management systems provide rigor, determinism, and auditability, yet they generally struggle to adapt their execution at runtime. Conversely, agentic systems based on Large Language Models (LLMs) bring flexibility to decision-making, but they are inherently opaque, often unreliable, and suffer from significant scalability constraints when operating over large datasets. To combine these complementary paradigms, we introduce VADAOrchestra, a neurosymbolic framework that models complex workflows as evolving reasoning processes. The framework adopts a hybrid approach: given a user query and a collection of data sources, an LLM-based orchestrator incrementally plans and adapts the workflow. This is encoded as a logic program in a fragment of Datalog+/- where predicates correspond to tool invocations and rules represent both predefined domain dependencies and logic constructs synthesized on demand to manipulate intermediate results. All logical inference tasks are then executed by a state-of-the-art Datalog+/- symbolic engine. This approach provides a verifiable reasoning trace, supporting the auditability and reproducibility of the entire process. Furthermore, by decoupling high-level orchestration from symbolic inference, it addresses scalability concerns, enabling complex reasoning over large datasets through targeted data querying. We evaluate VADAOrchestra on real-world financial use cases, demonstrating faithfulness, scalability, and explainability compared to standard agentic architectures.


[263] Beyond Penalizing Mistakes: Stabilizing Efficiency Training in Large Reasoning Models via Adaptive Correct-Only Rewards cs.AI | cs.CLPDF

Jungseob Lee, Seungyoon Lee, Seongtae Hong, Minhyuk Kim, Chanjun Park

TL;DR: 这篇论文提出了一种名为ACOER(自适应仅正确效率奖励)的新方法,用于稳定大型推理模型的效率训练。该方法通过仅对正确答案施加简洁性奖励,并结合动态预算归一化和控制循环惩罚调整,避免了传统长度惩罚方法中常见的奖励崩溃问题。在多个数学推理基准测试中,ACOER在保持甚至提高准确率的同时,将生成的令牌数量减少了60%以上。

Details

Motivation: 当前在大型语言模型中,通过整合长度惩罚奖励来减少冗余输出的方法(如GRPO)容易引发奖励崩溃,从而严重损害模型的推理能力。论文旨在解决这一结构性问题,实现稳定且高效的推理优化。

Result: 在多个数学推理基准测试上进行评估,ACOER相比基础模型提高了整体准确率,同时将令牌生成量减少了超过60%,实现了效率与性能的稳定提升。

Insight: 论文的核心创新在于识别了传统惩罚错误答案长度的方法会导致结构性崩溃的根本机制,并提出了仅对正确答案施加简洁性奖励的稳定范式。ACOER通过隔离奖励和动态调整机制,从根源上预防了两种主要的失败模式(结构性崩溃和随机性压缩崩溃),为效率感知的优化提供了一个鲁棒的基础框架。

Abstract: Training large language models to reason efficiently is a critical challenge. While integrating length-penalizing rewards into Group Relative Policy Optimization (GRPO) aims to reduce verbosity, it frequently triggers reward collapse, severely degrading reasoning capabilities. Through a systematic evaluation of various reward configurations, we identify the root mechanism: GRPO’s group normalization creates divergent advantages when incorrect answers receive continuous length penalties. Consequently, methods penalizing the length of incorrect answers are structurally prone to collapse under sustained optimization. Furthermore, restricting penalties exclusively to correct answers avoids this primary failure, but leaves the model susceptible to a stochastic collapse driven by response over-compression. To robustly prevent both failure modes, we propose ACOER (Adaptive Correct-Only Efficiency Reward). ACOER eliminates the structural penalty loop by isolating brevity bonuses to correct completions and prevents stochastic compression via dynamic budget normalization and control-loop penalty adjustments. Evaluated across diverse mathematical reasoning benchmarks, ACOER improves overall accuracy compared to the base model while reducing token generation by over 60%, establishing a fundamentally stable approach for efficiency-aware optimization.


[264] Plans Don’t Persist: Why Context Management Is Load Bearing for LLM Agents cs.AI | cs.CLPDF

Aman Mehta, Anupam Datta

TL;DR: 本文通过引入’重放配对’诊断方法,揭示了标准LLM智能体在长程任务中无法将计划作为持久状态保持,而是依赖计划信息保留在上下文窗口中。研究发现计划信号在行动后迅速衰减,且推理模型的<think>痕迹会混淆测量,作者通过严格剥离方法解决了这一问题。实验表明,计划信息被驱逐会显著降低任务性能,而基于探针的门控重提取无法恢复性能,证明了上下文管理对智能体至关重要。

Details

Motivation: 解决长程智能体依赖上下文管理(如压缩、总结和驱逐旧令牌)时,关键信息(如早期制定的计划)被过早驱逐导致任务失败的问题,探究计划信息是否被内部化为持久状态。

Result: 在Llama-3.1-70B上,计划信号在一步后从0.453骤降4.1倍;HotpotQA任务中下降12.4倍。严格剥离方法在样本内恢复了+163%的信号,样本外+153%。在ALFWorld基准测试中,天真驱逐计划使成功率下降34.7个百分点。

Insight: 创新点在于提出了’重放配对’诊断框架和’推理痕迹混淆’概念,并通过’严格剥离’方法解决了测量偏差。核心见解是智能体的关键信息可能仅是上下文驻留而非持久化的,这强调了上下文管理的基础性作用,但仅保护计划不足以保证性能。

Abstract: Long-horizon agents depend on context management: systems compress, summarize, and evict old tokens so tasks can continue beyond finite windows. That is safe only when dropped information is no longer needed or has been internalized. Plans are the stress case: they are written early, used for many steps, and first to be evicted. We introduce replay pairing, a diagnostic that runs the same trajectory with and without the plan in history and measures hidden-state cosine distance. On Llama-3.1-70B, plan signal spikes to 0.453 one step after the plan, then falls 4.1x in a single action-observation step; HotpotQA falls 12.4x. This is evidence that standard LLM agents do not carry plans forward as persistent state, and instead depend on the plan remaining in context. A layer-L32 probe detects this decay as a diagnostic, not as proof that it reads plan content itself. Reasoning models add a measurement confound: their <think> traces re-derive plan content, so standard stripping leaves plan evidence in the stripped condition. We name this the reasoning-trace confound and fix it with strict stripping, which removes prior <think> blocks from the stripped run only. It recovers +163% of the step+1 signal in-sample and +153% held out, while not meaningfully changing non-reasoning Llama (+4.8%). On DeepSeek-R1-Distill-Llama-70B, a Llama-trained probe transfers at AUROC 0.748 (p=6e-4), while R1-specific probes reach 1.000, suggesting R1 encodes plan signal in a different hidden-state direction. Finally, a compression stress test shows the practical cost: naive plan eviction cuts ALFWorld success by 34.7pp, while probe-gated re-surfacing does not recover it. The contribution is a measurement and stress-test framework showing that agent-critical information can be context-resident rather than persistent. Context management is load bearing, but plan protection alone is not enough.


[265] DART: Draft-Agreement Routing for Training-Free Adaptive Thinking Budgets in Hybrid Reasoning Models cs.AI | cs.CLPDF

Jungseob Lee, Seongtae Hong, Seungjun Lee, Jaehyung Seo, Junyoung Son

TL;DR: DART是一种无需训练的混合推理模型路由框架,通过采样两个廉价的无思考草稿,在草稿一致时直接回答,不一致时根据草稿熵预测思考预算,从而动态分配推理资源。

Details

Motivation: 现有路由方法通常需要标注训练数据或预先固定思考预算,忽略了模型自身的答案级证据,DART旨在实现无需训练的自适应路由,使简单问题避免不必要的推理,困难问题获得足够预算。

Result: 在数学推理任务中,DART在奥林匹克级别问题上准确率提升高达9.0个百分点,同时思考令牌减少15-69%;在基于执行等价的代码推理任务中,准确率提升高达22.5个百分点,思考令牌减少51-63%,且无需标注数据或梯度更新。

Insight: 创新点在于利用模型自身生成的草稿一致性作为路由信号,通过草稿熵预测动态思考预算,实现了无需训练的自适应推理资源分配,适用于不同模型规模和API托管场景。

Abstract: Hybrid reasoning models can answer directly or spend extra tokens on extended thinking. A practical router should choose between these modes for each query, so easy problems avoid unnecessary reasoning and hard problems receive enough budget to finish the answer. Existing routers move in this direction, but they typically require labeled training data or fix thinking budgets up front, ignoring answer-level evidence from the model itself. We introduce DART, a training-free routing framework that samples two cheap no-think drafts, accepts direct answering when the drafts agree, and predicts a thinking budget from draft entropy when they disagree. Across the main comparisons, DART preserves or improves always-thinking accuracy in most settings while reducing thinking-token use. On math reasoning, accuracy improves by up to $+$9.0 points on Olympiad-level problems while thinking tokens drop 15-69%. On code reasoning under execution-based equivalence, accuracy improves by up to +22.5 points while thinking tokens drop 51-63%. The Stage~1 signal extends across model scales (0.6B-32B), model families, and API-only hosted settings, with no labeled data and no gradient updates required.


[266] VeriEvol: Scaling Multimodal Mathematical Reasoning via Verifiable Evol-Instruct cs.AI | cs.CL | cs.CV | cs.LGPDF

Haoling Li, Kai Zheng, Jie Wu, Can Xu, Qingfeng Sun

TL;DR: 论文提出了VeriEvol框架,用于扩展多模态数学推理的强化学习数据。该框架通过可验证的演化指令,将数据扩展分解为两个可独立控制的维度:提示难度和答案可靠性。具体包括一个类型感知的演化模块来生成更难的图像-问题提示,以及一个基于假设检验的验证器来确保答案的可靠性。

Details

Motivation: 现有视觉数学推理的强化学习方法在扩展数据时面临挑战:单纯生成更难的问题会导致奖励标签不可靠,而现有方法要么盲目信任标注者,要么假设基础答案本身是正确的。论文旨在解决数据扩展过程中提示难度提升与答案可靠性保障之间的矛盾。

Result: 在一个包含五个基准的视觉数学推理测试套件上,将演化后的监督微调数据从1万样本扩展到25万样本,平均准确率从35.42提升至54.73。在保持主干网络、SFT初始化和GRPO强化学习流程不变的情况下,VeriEvol框架相比未演化的强化学习基线带来了累计+3.88的性能提升,其中+1.82来自演化提示,+2.06来自HTV-Agent验证器。

Insight: 创新点在于将数据扩展问题解耦为两个可独立验证和控制的轴(提示难度与答案可靠性),并提出了一个可扩展的迭代框架。其核心是类型感知的演化操作符和基于多源反证据假设检验的离线验证器,这为构建高质量、可审计的大规模多模态推理数据集提供了新范式。

Abstract: Scaling reinforcement learning for visual mathematical reasoning requires more than generating harder questions: as data volume grows, the reward labels themselves must remain reliable. Yet existing data pipelines scale supervision while trusting the labeller, and policy-side methods assume the underlying answers are already correct. We instead treat scaling as a verifiable data-construction problem and decouple two axes before any policy update: prompt difficulty, expanded by route-specific evolution operators, and answer reliability, enforced by offline hypothesis-test falsification. We instantiate this as VeriEvol, an iterative framework with two extensible components: a type-aware evolution module that rewrites low-difficulty image-question seeds into harder, image-grounded prompts; and HTV-Agent, a verifier that accepts an answer only after multi-source counter-evidence has failed to refute it. The resulting verified data scales in volume, extends by adding evolution routes or verifier channels, and plugs directly into existing GRPO-style RL recipes. On a five-benchmark visual-math suite, scaling evolved SFT data from 10K to 250K samples raises the mean accuracy from 35.42 to 54.73; then, with backbone, SFT initialization, and GRPO recipe held fixed, VeriEvol adds a cumulative +3.88 over an un-evolved RL baseline, of which +1.82 comes from evolved prompts and +2.06 from the HTV-Agent verifier. We release the prompts, data, models, code, and the full verifier trace of every sample, so that downstream work can scale and audit the pipeline rather than only inspect its outputs.


[267] SPARC: A Multi-Agent System for Electrical Circuit Question Answering cs.AI | cs.CVPDF

Mushtari Sadia, Zhenning Yang, Umme Habiba Lamia, Nishat Shawrin, Ang Chen

TL;DR: 本文提出了SPARC,一个用于电路图问答的多智能体系统。该系统通过将推理过程建立在可执行的基于物理的模拟中,来解决电路图QA任务中复杂的数学推理挑战。SPARC利用LLM智能体来合成、执行和分析模拟程序,从而在设计和系统误差诊断方面提高了准确性和可靠性。

Details

Motivation: 解决多模态大语言模型在电路图问答任务中,因需要复杂数学推理而面临的挑战。

Result: 在电路图问答任务上,SPARC达到了83%的准确率,相较于基线方法实现了高达58%的绝对性能提升,并支持系统性的错误诊断。

Insight: 核心创新在于将多智能体架构与基于物理的可执行模拟相结合,将复杂的数学推理任务转化为程序合成与执行问题,从而提高了模型在专业领域推理的准确性和可解释性。

Abstract: Electrical circuit diagram QA tasks require complex mathematical reasoning, which remains challenging for multimodal LLMs. We present SPARC, a multi-agent system that answers questions over circuit diagrams by grounding reasoning in executable physics-based simulations. SPARC uses LLM agents to synthesize, execute, and analyze simulation programs, improving accuracy and reliability by design. It achieves 83% accuracy, with up to a 58% absolute improvement over baselines, while enabling systematic error diagnosis.


[268] When Does a Video-Language Model Stop Watching? Reward Strength Controls the Formation and Reversal of Visual Shortcuts in Multimodal RLVR cs.AI | cs.CV | cs.LGPDF

Zekun Xu

TL;DR: 本文研究了在多模态RLVR(基于可验证奖励的强化学习)训练中,视觉捷径(即模型停止关注视频而利用语言先验)的形成与逆转动态。通过将惩罚强度λ作为控制变量,论文揭示了视觉捷径在训练时间轴上的涌现具有突发性、剂量依赖性以及干预窗口关键性等特征。

Details

Motivation: 动机在于理解RLVR训练中视觉捷径(一种感知旁路)如何形成、能否逆转以及何时干预有效,以解决模型因仅优化结果而忽略视频内容的问题。

Result: 在分布外诊断集上的实验表明:视觉捷径依赖在优化步骤的狭窄窗口内突然涌现;增加惩罚强度λ可逐步抑制捷径,并在中等强度下观察到捷径先形成后逆转的滞后不对称现象;惩罚在捷径形成前应用可阻止其形成,而在巩固后应用则效果显著下降。

Insight: 创新点在于将视觉捷径崩溃重新定义为可控、时间依赖且不对称的动态过程,而非二元缺陷,这为多模态RLVR中正则化的时机与强度提供了直接指导。

Abstract: Reinforcement learning with verifiable rewards (RLVR) is increasingly applied to large vision-language models (LVLMs), yet outcome-only optimization can drive a model to stop attending to the video and instead exploit linguistic priors – a failure we call a visual shortcut. While the existence of such perception bypass is by now documented, how it forms, whether it can be undone, and when intervention still helps remain open. We treat the strength of a grounding penalty, lambda, as a control knob and characterize the formation-reversal dynamics of visual shortcuts along the training time axis. On a held-out, out-of-distribution diagnostic set, we find: (i) a sharp onset – shortcut reliance emerges abruptly over a narrow window of optimization steps and is robust across random seeds; (ii) a monotone dose-response – increasing lambda progressively suppresses the shortcut, and at an intermediate dose the trajectory first forms and then reverses the shortcut, exposing a hysteresis-like asymmetry between acquiring and removing it; and (iii) a critical intervention window – applying the penalty before onset arrests shortcut formation, whereas the same penalty applied after consolidation is markedly less effective. Together these results recast visual-shortcut collapse not as a binary defect but as a controllable, time-dependent, and asymmetric process, with direct implications for when and how strongly to regularize multimodal RLVR.


[269] ENVS: Environment-Native Verified Search for Long-Horizon GUI Agents cs.AI | cs.CVPDF

Yincheng Zhou, Athena Zhuoming Zhong, Shijie Zhang, Kevin Zhang, Teresa Xiaotao Shang

TL;DR: 本文提出了ENVS(Environment-Native Verified Search),一种用于训练长视野GUI代理的训练时搜索与过滤流程。该方法在策略优化前,通过在实时OSWorld虚拟机中对行为不同的GUI操作进行分支搜索、验证成功路径,并利用全局平衡的步骤级监督数据进行训练,以解决在真实桌面环境中发现成功操作轨迹的挑战。

Details

Motivation: 随着多模态代理从界面理解转向实际软件控制,在实时桌面环境中发现成功的操作轨迹成为一个关键挑战。GUI任务需要长序列的精确鼠标和键盘操作,而通过虚拟机回滚获得的反馈稀疏、延迟且成本高昂。

Result: 在包含300个任务的OSWorld基准测试中,ENVS在原始评估中达到30.3 pass@8,在新引入的OSWorld-Noisy动态基准(模拟可恢复的桌面中断)上达到29.0 pass@8,超越了匹配的ARPO风格在线强化学习方法,并将计算成本从184-192 GPU小时降低到138-153 GPU小时。即使在仅使用30%搜索数据的情况下,ENVS仍达到27.0 pass@8,超过了基础模型的ARPO性能。

Insight: 核心创新在于训练时利用环境本身进行已验证的监督信号构建,通过分支搜索和验证成功路径来生成高质量的步骤级监督数据,而非依赖昂贵且稀疏的在线强化学习反馈。此外,引入OSWorld-Noisy基准测试来评估代理在真实桌面中断下的鲁棒性和恢复能力,是一个有意义的评估扩展。

Abstract: As multimodal agents move from interface understanding to real software control, successful trajectory discovery in live desktop environments becomes a key challenge. GUI tasks require long-horizon sequences of precise mouse and keyboard actions, while feedback is sparse, delayed, and costly to obtain through VM rollouts. We propose Environment-Native Verified Search (ENVS), a training-time search-and-filter pipeline that uses the environment to construct verified supervision before policy optimization: it branches over behaviorally distinct GUI actions in live OSWorld VMs, verifies successful leaves, and trains from globally balanced step-level supervision. To evaluate robustness under realistic desktop interruptions, we also introduce OSWorld-Noisy, a dynamic benchmark for recoverable desktop interruptions that preserves the original tasks while testing whether agents can refocus, dismiss, wait, or recover under live perturbations. On the 300-task OSWorld pool, ENVS reaches 30.3 pass@8 on original evaluations and 29.0 on OSWorld-Noisy, outperforming matched ARPO-style online RL while reducing compute from 184-192 to 138-153 GPU-hours; even with only 30% of its search data, ENVS reaches 27.0 pass@8, exceeding ARPO from the base model. Training from noisy environments also better preserves visual-reasoning abilities on auxiliary benchmarks, including OSWorld-G Refusal (16.7 vs. 1.9) and BLINK Functional Correspondence (26.2 vs. 23.1).


cs.CR [Back]

[270] Honeyquest for LLMs: Rethinking Cyber Deception for AI Attackers cs.CR | cs.CLPDF

Kerri Prinos, Lilianne Brush, Cameron Denton

TL;DR: 本文提出了一种名为Honeyquest的自动化评估框架,用于大规模评估大型语言模型作为网络攻击者的判断能力。研究发现,LLM攻击者与人类攻击者在网络欺骗行为上存在显著差异:LLM更容易落入欺骗陷阱,缺乏人类的注意力分散防御效应,且存在认知与行动之间的巨大差距。

Details

Motivation: 网络欺骗的实证基础依赖于以人为中心的假设,但自主AI攻击者的快速涌现挑战了这一基础是否适用于AI代理。

Result: 在174个相同的侦察查询上,评估了21个LLM(涵盖10个提供商,参数规模从8B到超过1T)与47名人类参与者的基线表现。所有LLM落入欺骗陷阱的比率均显著高于人类,且LLM中缺乏人类的防御性注意力分散效应,其认知与行动之间存在高达73.4%的差距。

Insight: 研究揭示了LLM作为一种独特的攻击者类别,表明以人为中心的欺骗假设不能可靠地迁移到AI攻击者,这凸显了开发针对AI的原生主动防御框架的迫切需求。其自动化评估框架和发现的行为差距为未来AI安全研究提供了重要方向。

Abstract: The empirical foundation of cyber deception relies on human-centered hypotheses, but the rapid emergence of autonomous, AI-enabled attackers challenges whether this foundation transfers to AI agents. To address this, we introduce an automated evaluation framework adapted from the Honeyquest instrument to assess LLM attacker judgment at scale. Our 21-LLM cohort spanned 10 providers, diverse architectures and specializations, open- and closed-weight models, and parameter scales from 8B to over 1T. We evaluated the performance of this LLM cohort (yielding 10,962 responses) against the 47-participant human baseline across an identical set of 174 reconnaissance queries. Our empirical evaluation reveals three key findings that establish LLMs as a distinct attacker class: (1) every model in our cohort falls for deceptive traps at a significantly higher rate than human attackers; (2) the defensive attention-diversion effect observed in humans is statistically absent in our LLM cohort; and (3) a critical recognition-action gap, where LLMs successfully articulate trap recognition in their reasoning but exploit the deceptive elements anyway 73.4% of the time. Across the 21 models, trap recognition in reasoning text did not predict fell-for-trap behavior (Spearman $r = +0.08$, $p = 0.73$). Ultimately, these findings demonstrate that human-centered deception hypotheses do not reliably transfer to AI attackers, highlighting the critical need for new research into AI-native active defense frameworks.


[271] A Hybrid, Multi-Layered Pipeline for Phishing and Threat Classification: Independently Validated URL and NLP Engines with a Calibrated Multi-Channel Fusion Stage cs.CR | cs.CL | cs.LGPDF

Saifelden M. Ismail, Aser O. Ibrahim, Omar A. Mahmoud

TL;DR: 本文提出了一种用于钓鱼和威胁分类的混合多模态管道,包含独立的URL引擎、NLP分类器和威胁情报同步器,并通过校准的多通道融合阶段整合结果。该管道在10,677封邮件的全系统基准测试中达到F1=0.914,并将真实垃圾邮件的误报率降至3.6%。

Details

Motivation: 钓鱼攻击是多模态威胁,需要综合处理不同模态(如URL、文本和威胁情报)以提高检测性能,并解决现有方法在泛化能力上的不足。

Result: 在独立基准测试中,NLP分类器的真实钓鱼召回率从0.8%提升至87.3%;全系统融合阶段在10,677封邮件的基准上达到F1=0.914,但该结果基于代理通道和未完全校准的操作点,因此是初步集成结果。

Insight: 创新点包括:采用独立引擎处理多模态数据并通过决策级融合整合;提出泛化强化的DistilBERT分类器显著提升召回率;使用端到端OpenTelemetry工具确保消息一致性;强调部署检测的关键约束是泛化能力而非同分布精度。

Abstract: Phishing is a multi-modal threat. We present a hybrid pipeline that scores each modality with its own engine and fuses the results. Three engines are built, deployed, and independently benchmarked: a four-stage URL stack (Domain Guard, lexical model, threat intelligence, and an asymmetric L2 fusion sidecar); a generalization-hardened DistilBERT NLP classifier whose held-out real-phishing recall rises from 0.8% to 87.3%; and a threat-intelligence synchronizer with end-to-end OpenTelemetry instrumentation confirming 1:1 message conservation. A decision-level fusion stage, characterized on a 10,677-email whole-system benchmark, reaches F1 = 0.914 with a calibrated probabilistic-OR over URL, header, and phishing-probability channels while cutting held-out real-spam false positives to 3.6%. Because that benchmark uses proxy URL and header channels and an operating point still needing recalibration, we present it as a preliminary integrated result. The binding constraint for deployable detection is generalization rather than same-distribution accuracy.


[272] DIPBox: A Multi-scale Testing Framework for Tracking Dataset Regeneration cs.CR | cs.CVPDF

Tian Dong, Yan Meng, Shaofeng Li, Guoxing Chen, Yuling Chen

TL;DR: DIPBox是一个多尺度测试框架,用于追踪数据集再生。该框架通过分析样本级、集合级和分布级特征,设计四种相似性度量来准确识别再生数据集,并在多种防御者访问设置下进行多尺度相似性测试。

Details

Motivation: 训练数据集具有巨大的专有价值,容易受到未经授权的复制威胁。现有防御方法主要关注追踪单个数据点,而忽视了数据集再生的风险。通过测量公开肿瘤数据集,发现现实中存在大量部分数据集复制,引发了对潜在许可证违规的担忧。

Result: 在16个视觉和文本基础数据集、320个再生数据集和590个衍生模型上的广泛实验验证了DIPBox优于先前解决方案,并在三种自适应攻击下表征了其鲁棒性和局限性。

Insight: 论文的创新点在于首次提出多尺度相似性测试框架来追踪未知的对抗性数据集再生,通过分析多尺度特征(样本、集合、分布级)并设计相应度量,揭示了再生过程中效用与分歧之间的固有权衡,为数据集保护提供了理论基础和实用工具。

Abstract: Training datasets have tremendous proprietary value and are vulnerable to unauthorized copying. Existing defenses mainly focus on tracking individual data points, but pay little attention to the threat of dataset regeneration. Through a measurement study of public tumor datasets, we identify substantial real-world partial-dataset replication, raising concerns about potential license noncompliance. To counter the challenge of tracking previously unknown adversarial regeneration, our key insight is that regeneration that preserves model utility inevitably preserves measurable signals across multiple feature scales. We categorize these dataset features into sample-, set-, and distribution-level features and design four similarity metrics to accurately identify regeneration. Based on these metrics, we develop DIPBox, which to our knowledge is the first testing framework that tracks regeneration suspects via multi-scale similarity testing across a spectrum of defender access settings, from limited to full information. We further provide a learning-theoretic analysis that justifies these multi-scale metrics and formalizes an inherent utility–divergence trade-off, implying fundamental limits on evasive regeneration. Extensive experiments on 16 vision and text base datasets, 320 regenerated datasets, and 590 derived models validate that DIPBox outperforms previous solutions while characterizing its robustness and limits under three adaptive attacks.


[273] DE-FIVE: Detecting Malicious Image Prompts via Fourier Features and Image Vector Embeddings cs.CR | cs.CVPDF

Xingwei Zhong, Varun Sharma, Kar Wai Fok, Vrizlynn L. L. Thing

TL;DR: 本文提出了DE-FIVE框架,一种无需重新训练、用于检测针对视觉语言模型(VLM)的恶意图像提示(如对抗性扰动和间接提示注入)的方法。该方法结合了基于傅里叶特征的‘黑盒’检测器和基于视觉编码器隐藏状态表示(图像向量嵌入)的‘白盒’检测器。实验表明,该框架在检测恶意图像提示方面优于现有最先进方法。

Details

Motivation: 视觉语言模型(VLM)因引入视觉模态而扩大了攻击面,易受对抗性扰动和间接提示注入等安全威胁。现有防御方法通常需要大量数据重新训练或部署复杂分类器,且专门针对间接提示注入的防御机制严重缺乏。

Result: 大量实验表明,所提出的DE-FIVE框架在检测恶意图像提示方面,始终优于最先进的基线方法。

Insight: 主要创新点在于提出了一种无需重新训练的训练免费框架,并设计了一种混合检测策略,结合了基于傅里叶变换的黑盒检测和基于少量恶意样本图像向量嵌入的白盒检测,有效应对了间接提示注入这一特定威胁。

Abstract: Vision language models (VLMs) employ both visual and textual modalities to enable advanced vision-language inference. However, incorporating visual modalities expands the attack surface of VLMs, making them more susceptible to security threats such as adversarial perturbations and indirect prompt injection, wherein crafted malicious image prompts can elicit unintended model outputs. Existing defense methods against malicious image prompts remain insufficient as they typically demand extensive datasets for retraining or the deployment of additional, complex classifiers. Most critically, there is a profound lack of specialized defense mechanisms specifically targeting indirect prompt injections, a gap that serves as a primary motivation for this work. To address these limitations, we introduce DE-FIVE, a novel training-free framework for detecting malicious image prompts by leveraging Fourier features and the hidden state representations of the visual encoder (image vector embeddings) across perturbations. Specifically, we develop a hybrid detection strategy consisting of a black-box detector that operates on Fourier-domain features and a white-box detector that exploits image vector embeddings derived from only a few-shot malicious set. Extensive experiments demonstrate that the proposed framework consistently outperforms state-of-the-art baselines against malicious image prompts.


cs.SE [Back]

[274] ATLAS: Agentic Taxonomy of Large-Scale Software Ecosystems cs.SE | cs.CL | cs.IRPDF

Junyi Lu, Mengyao Lyu, Jiahui Wu, Lei Yu, Chengwei Liu

TL;DR: ATLAS是首个自动构建软件仓库层次化分类体系并端到端分类项目的框架,通过结合LLM全局知识与实际仓库分布,提出有意义的划分维度并迭代修正。该方法在GitHub仓库分类任务中显著优于现有基线,并在下游任务中展现出高实用性。

Details

Motivation: GitHub开源生态系统缺乏系统化的层次分类体系,现有的GitHub Topics机制扁平、不一致且覆盖率有限,无法有效组织和发现项目。

Result: 在54,387个GitHub仓库上评估,ATLAS在分层基准测试中达到83.13%的分类质量F分数,比最佳基线提升15个百分点;在下游任务中,替代项目发现的P@1达到85.71%,超过人工整理列表。

Insight: 创新点在于设计了一个包含设计器智能体和分类器智能体的自校正循环框架,将LLM的全局知识与实际分布结合以迭代优化分类维度;该方法首次实现了高结构质量与高实用性的统一,并能揭示生态系统层次化趋势(如向AI/ML应用的转变)。

Abstract: The open-source ecosystem on GitHub lacks a systematic hierarchical taxonomy of software repositories. GitHub Topics, the dominant organizational mechanism, is flat, inconsistent, and covers only 67% of projects. We present ATLAS, the first framework that automatically constructs a hierarchical taxonomy for software repositories and classifies projects into it end-to-end. By combining LLM global knowledge with real repository distributions, ATLAS proposes meaningful splitting dimensions and iteratively corrects those that fail to accommodate real projects. A Designer Agent proposes splitting dimensions while a Classifier Agent assigns repositories; a self-corrective refinement loop uses classification failures to drive dimension revision through escalating strategies. We evaluate ATLAS on 54,387 GitHub repositories against six baselines spanning four paradigms, two downstream tasks, and three model families. On a stratified 2,001-repository benchmark, ATLAS achieves a Taxonomy Quality F-score (TQF) of 83.13%, outperforming the best baseline by 15 percentage points (on the full 54k corpus the approximate TQF is 73.0%, a gap driven by Path Granularity’s all-or-nothing scoring on longer paths rather than lower classification accuracy). It is the only method to simultaneously achieve high structural quality and high practical applicability. On downstream tasks, ATLAS enables alternative discovery with P@1 = 85.71%, surpassing even human-curated lists (62.34%), and achieves the highest P@1 for repository retrieval. The taxonomy further reveals structural ecosystem trends that are difficult to obtain from flat tags or similarity methods: the shift from libraries to AI/ML applications (now 61% of newly community-adopted projects) becomes visible only through hierarchical, type-based categorization. An interactive taxonomy explorer is available at https://atlas-taxonomy.netlify.app/


[275] Reinforcement learning to improve large language model-based automated code compliance systems cs.SE | cs.AI | cs.CL | cs.LGPDF

Jack Wei Lun Shi, Minghao Dang, Wawan Solihin, Leong Hien Poh, Justin K. W. Yeoh

TL;DR: 本文提出了一个名为P4IR的两阶段框架,旨在提升基于大语言模型(LLM)的自动化建筑规范合规性(ACC)系统的准确性。该框架首先通过监督微调(SFT)向LLM注入领域知识,然后使用组相对策略优化(GRPO)来优化生成的高级代码骨架形式的中间表示。

Details

Motivation: 当前基于大语言模型的自动化建筑规范合规性方法容易产生错误和幻觉的可计算规则,其动机是解决这些系统生成结果的准确性和可靠性问题。

Result: 该方法在零样本设置下,相对于SFT基线,将树编辑距离和词元级莱文斯坦距离分别降低了最高23.8%和38.6%。在与Claude Opus、Sonnet 4.5、GPT-5.2、Qwen-3-Max和GLM-4.7等领先LLM的少样本提示比较中,该方法在代码结构和语义方面均表现更优,并且GRPO阶段显著减少了误报率。

Insight: 论文宣称的创新点在于将监督微调与组相对策略优化结合,直接针对特定领域目标进行优化。从客观角度看,其核心创新在于采用两阶段强化学习框架(SFT+GRPO)来优化LLM生成的中间表示(代码骨架),这为提升基于LLM的自动化系统的准确性和可靠性提供了一条新路径。

Abstract: Large language model (LLM)-based approaches for automated code compliance (ACC) of building regulations are prone to generating incorrect and hallucinated computer-processable rules. This paper introduces P4IR, a two-stage framework that uses supervised fine-tuning (SFT) to instill domain knowledge in an LLM, followed by Group Relative Policy Optimization (GRPO) to improve the accuracy of the generated intermediate representations in the form of high-level code skeletons. The framework achieved reductions of up to 23.8% and 38.6% in tree edit distance and token-level Levenshtein distance respectively, relative to the SFT baselines. Comparative analysis demonstrates that this approach in a zero-shot setting outperforms leading LLMs in both code structure and semantics, specifically Claude Opus and Sonnet 4.5, GPT-5.2, Qwen-3-Max, and GLM-4.7, evaluated via few-shot prompting. Additionally, the GRPO stage produced a small yet statistically significant reduction in false positives. By combining SFT with GRPO to optimize directly for domain-specific objectives, this approach offers a path toward more accurate and reliable LLM-based ACC systems.


[276] From Driving Videos to Simulatable Scenarios cs.SE | cs.CVPDF

Alexandre Levy, Ernest Valveny Llobet, Antonio Manuel López

TL;DR: 本文提出了D-V2S框架,一个从驾驶视频自动生成可模拟驾驶场景的两阶段系统。第一阶段使用视觉语言模型分析视频,生成描述道路布局和动态交通交互的自然语言;第二阶段使用大语言模型将这些描述转化为可执行的模拟场景。实验表明,该框架能有效复现视频中90%的关键语义元素,并在场景生成质量上优于现有方法。

Details

Motivation: 自动驾驶车辆需要评估从常规交通到罕见事件的各种场景的安全性,而通过仿真可控、可重复且可扩展地复现这些真实场景至关重要。现有方法缺乏从真实驾驶视频自动生成高质量模拟场景的能力。

Result: 定量评估显示,D-V2S生成的场景包含了输入视频中90%的相关语义元素。在场景生成器(SG)的对比中,其生成结果在人类评估中获得了75%的偏好率,优于其他最先进方法。

Insight: 创新点在于提出了一个结合视觉语言模型(VLM)和大语言模型(LLM)的两阶段自动化流程,将真实视频解析为语义描述再转化为可执行模拟场景。客观分析认为,其设计的提示词和条件上下文对于引导模型准确理解复杂交通交互至关重要,且模块化设计便于进行消融分析以优化各组件。

Abstract: Autonomous vehicles (AVs) face driving scenarios ranging from routine traffic to rare events. To assess safety it is crucial to reproduce these scenarios in a controllable, repeatable, and scalable manner, with simulation playing a key role. This paper introduces D-V2S, a novel framework that automatically generates simulatable driving scenarios from driving videos. D-V2S operates in two stages: a Driving Record Analyzer (DRA) uses a vision language model (VLM) with our designed prompt to produce natural-language descriptions from input videos, capturing road layouts and dynamic traffic interactions; subsequently, a Scenario Generator (SG) uses a large language model (LLM) and our conditioning context to translate these descriptions into executable scenarios. Using simulations, we show that D-V2S generates scenarios where 90% of the relevant semantic elements of the videos are present. We also provide qualitative results demonstrating D-V2S’s capability to transform real-world driving videos into simulatable scenarios. Moreover, we provide both semantic and human driven ablative analyses of D-V2S’s modules. In particular, we show how the VLM choice matters for DRA, and how our SG achieves a 75% preference rate over other state-of-the-art methods.


cs.HC [Back]

[277] Integrating Facial Generation into Full-Duplex Spoken Dialogue Systems cs.HC | cs.CL | cs.CV | eess.ASPDF

Jingjing Jiang, Atsumoto Ohashi, Ryuichiro Higashinaka

TL;DR: 本文提出了Moshi-Face,这是首个全双工对话模型,能够同时处理用户的音频和面部输入,并实时生成同步的语音和面部动作。其核心是构建了一个VQ-VAE面部编解码器,将3D头部网格编码为离散的面部令牌,并扩展了Moshi模型,通过Face Transformer模块非自回归地生成这些令牌。

Details

Motivation: 现有全双工语音对话模型(如Moshi)仅限于音频模态,缺乏人类交流中不可或缺的面部表情,限制了交互的自然性和丰富性。

Result: 实验表明,Moshi-Face在保持原始纯音频模型对话质量的同时,实现了低延迟的视听对齐。

Insight: 创新点在于首次将面部生成整合到全双工对话系统中,并设计了基于VQ-VAE的面部令牌化表示和非自回归生成机制,以实现实时、同步的多模态输出。

Abstract: Full-duplex spoken dialogue models, such as Moshi, enable natural, low-latency voice conversations. However, they remain limited to the audio modality, lacking the facial expressions that are integral to human communication. We present Moshi-Face, the first full-duplex dialogue model that jointly processes the user’s audio and facial input while simultaneously generating speech and facial motion. We first construct a vector-quantized variational autoencoder (VQ-VAE) as a face codec that encodes 3D head meshes extracted from facial videos into compact discrete tokens, referred to as face tokens, and conversely reconstructs 3D meshes from these tokens. We then extend Moshi with a Face Transformer module that generates face tokens non-autoregressively, enabling Moshi-Face to produce synchronized audio and face tokens in real time. Experiments show that Moshi-Face achieves audiovisual alignment at low latency while preserving the dialogue quality of the original audio-only model.


cs.CY [Back]

[278] Latent Confidence Alignment for LLM Self-Assessment cs.CY | cs.AI | cs.CLPDF

Ting-Yu Chen, Tingting Yu, Pei-Cing Huang, Chan Hsu, Ming-Yen Lin

TL;DR: 本文提出了一种基于Rasch模型的潜在置信度对齐误差(LCAE)框架,用于评估大型语言模型(LLM)的置信度校准质量。该方法通过建模项目难度和模型潜在能力,衡量模型自我评估与隐含错误概率之间的一致性,并在医学领域数据集上验证了其有效性。

Details

Motivation: 现有LLM置信度校准方法通常直接比较预测置信度与观测准确率,但未建模项目难度,难以解释差异并判断置信度是真实自我评估还是响应生成过程的副产品。

Result: 在包含20个模型的医学领域数据集上的实验表明,所提方法在不影响模型能力的情况下提升了自我评估质量,并揭示了可靠性与推理成本之间的关联。

Insight: 创新点在于引入基于Rasch模型的潜在能力框架和元认知视角,提出LCAE指标来量化自我评估与潜在错误概率的对齐程度;同时将项目难度作为外部信号结合推理机制,为LLM置信度校准提供了更可解释的评估框架。

Abstract: Confidence calibration in large language models (LLMs) is commonly evaluated by comparing predicted confidence with observed accuracy. However, such approaches do not model item difficulty, making it difficult to interpret discrepancies and to determine whether model confidence reflects genuine self-assessment or is merely a byproduct of the response generation process. To address this, we adopt a Rasch model-based latent ability framework and a metacognitive perspective, and propose Latent Confidence Alignment Error (LCAE) to measure the consistency between model self-assessment and the latent error probability implied by model ability and item difficulty. We further incorporate item difficulty as an external signal with a reasoning mechanism. Experiments on a medical-domain dataset with 20 models show that the proposed approach improves self-assessment quality without affecting model ability, and reveals an association between reliability and inference cost.


[279] CourseBlueprint: A Structured Pipeline for Adaptive Pedagogical Video Generation Grounded in Course Corpora cs.CY | cs.AI | cs.CVPDF

Md Zabirul Islam, Md Motaleb Hossen Manik, Ge Wang

TL;DR: 本文提出了CourseBlueprint,一种基于课程语料库的自适应教学视频生成结构化流程。该系统通过单次前向处理生成教学蓝图,包含概念图构建、自适应控制器和参与度生成器,并利用确定性幻灯片图像覆盖增强视频内容。

Details

Motivation: 现有文本到视频生成系统缺乏有效的教学法内容知识(PCK),如先决条件感知排序、学习者自适应深度和持续认知参与,无法生成高质量的教学视频。

Result: 在五主题消融实验中,移除参与度合约导致参与度评分从5.00降至1.20,自适应评分从4.80降至3.40,Flesch可读性从38.0降至19.8;幻灯片图像覆盖将语料库接地失败从0/9改善为9/10的成功匹配。

Insight: 创新点在于使用类型化中间表示与验证的结构化流程,包括确定性循环移除的概念图、自适应风格规范及固定叙事合约,强调教学视频质量依赖于可审计的教学合约而非表面流畅性。

Abstract: Generative text-to-video systems can produce visually fluent educational clips, but they rarely encode the pedagogical content knowledge (PCK) needed for effective instruction, including prerequisite-aware sequencing, learner-adaptive depth, and sustained cognitive engagement. We present CourseBlueprint, a course-grounded pipeline for adaptive pedagogical video generation. Given a topic and learner persona, the system generates a structured teaching blueprint in a single forward pass over an undergraduate biomedical-imaging corpus (BMED 2300; twenty-three lectures, 1,116 slides). Instead of ad-hoc prompt chaining, the pipeline uses typed intermediate representations with validation: a scaffolding module builds a stage-labeled prerequisite concept graph with deterministic cycle removal, an adaptive controller assigns per-concept style specifications, and an engagement generator produces narration following a fixed hook->retrieval->core->analogy->forward contract. A deterministic slide-image override further grounds the rendered video by reusing instructor slides whenever retrieval confidence is high. We also release a reusable benchmark corpus and an evaluation harness combining repeated LLM-judge scoring with regex-grounded objective metrics. In a five-topic ablation, removing the engagement contract reduces the engagement score from 5.00 to 1.20, the adaptive score from 4.80 to 3.40, Flesch readability from 38.0 to 19.8, and analogy and retrieval-prompt counts to near zero. The slide-image override converts a 0/9 corpus-grounding failure into 9/10 successful slide matches on the same topic. These results show that pedagogical video quality depends less on surface fluency than on explicit, typed instructional contracts that make scaffolding, adaptation, engagement, and grounding auditable.


cs.DC [Back]

[280] Kamera: Unified Position-Invariant Multimodal KV Cache for Training-Free Reuse cs.DC | cs.AI | cs.CVPDF

Bole Ma, Jan Eitzinger, Harald Koestler, Gerhard Wellein

TL;DR: 本文提出了Kamera,一种统一的位置无关多模态KV缓存方法,用于实现无需训练的重用。该方法通过存储低秩条件补丁来修复跨块绑定信息,从而避免重复编码,支持缓存块的重新排序、滑动窗口生存和召回操作。

Details

Motivation: 解决多模态智能体在重复检查相同视频帧、UI截图和渲染内容时,由于前缀缓存仅支持固定前导位置重用而导致的重复编码问题,旨在减少计算开销。

Result: 在跨块绑定基准测试(如MM-NIAH和两页文档QA)上,使用低秩补丁可恢复完整任务精度,并在六个骨干网络的SGLang内核中将重填充KV重建误差控制在bf16舍入范围内,显著降低KV占用空间。

Insight: 创新点在于识别出朴素KV重用丢失的跨块条件信息,并提出通过低秩补丁修复;该方法统一处理MLA、GQA和MHA注意力机制,支持精确的RoPE重旋转和跨块绑定恢复,特别适用于冗余视觉和视频流的高效处理。

Abstract: Multimodal agents repeatedly re-examine the same video frames, UI screenshots, and rendered artifacts as their context window slides and reasoning iterates, yet every look-back re-encodes from scratch, because prefix caches serve reuse only at a fixed leading position. We show this recompute is avoidable, and identify exactly what naive KV reuse loses: the cross-chunk conditioning a chunk absorbs from its neighbours. This loss is asymmetric. The direct readout of a cached chunk is recovered exactly and for free by the standard state-merge. What remains is a diffuse, low-rank residue concentrated in deep layers, invisible to single-hop retrieval but precisely what multi-hop reasoning binds on. Blind reuse therefore leaves single-hop recall intact while halving multi-hop accuracy; this is the failure mode prior position-independent caches, designed for single-context or single-image reuse, do not address. We repair it with a small, training-free low-rank conditioning patch stored alongside each position-free chunk. Reuse reduces to one operator across MLA, GQA, and MHA: exact RoPE re-rotation to any target position, plus the patch that restores cross-chunk binding. This makes three window operations cheap: reorder (one patch serves every ordering of a cached set), sliding-window survival (surviving chunks relocate via rotation only, zero re-encode), and recall (an evicted chunk is rehydrated by its patch, never re-encoded). A rank-m patch recovers full task accuracy on cross-chunk-binding benchmarks, MM-NIAH across two attention families and two-page doc-QA, at a fraction of the KV footprint, and reconstructs re-prefill KV to within bf16 rounding in a production SGLang kernel across six backbones. The conditioning signal is strongest in redundant vision and video streams, making our solution most impactful where multimodal agents spend their recompute budget.


cs.RO [Back]

[281] MemoryVAM: Integrating Memory into Video Action Model for Robot Manipulation cs.RO | cs.AI | cs.CVPDF

Yuxin Jiang, Chang Yu, Yunuo Chen, Xiang Feng, Yin Yang

TL;DR: 本文提出了MemoryVAM,一种用于视频世界模型策略的片段记忆机制,旨在解决长视野机器人操作任务中因依赖历史信息而导致的非马尔可夫性问题。该方法通过Recap-Cue模块将历史帧的CLIP嵌入压缩为记忆令牌,并注入到视频主干网络和动作解码器中,使策略能够基于历史片段进行想象和决策。

Details

Motivation: 现有基于视频世界模型的策略仅依赖于短观察窗口,当正确的动作依赖于不再可见的早期事件时,会导致长视野操作任务呈现非马尔可夫性,因此需要一种机制来整合历史信息。

Result: 在LIBERO-Mem基准测试上,模型将平均成功率从5%提升至42.5%。在真实机器人实验中,在计数任务上达到78.3%成功率,空间回忆任务达到80.0%,顺序跟踪任务达到75.0%成功率。

Insight: 创新点在于提出了一个无需逐帧进度标签、通过视频预测、增量重建辅助损失和片段边界监督来训练的记忆模块(Recap-Cue)。该机制通过修改交叉注意力注入接口,可通用地应用于UNet和扩散Transformer主干网络,实现了策略想象与任务进度的对齐以及动作对历史的条件化。

Abstract: Video-world-model policies learn action-relevant representations by predicting future observations. However, they condition on only a short observation window, which renders long-horizon manipulation non-Markovian when the correct action depends on earlier events that are no longer visible. We present MemoryVAM, an episodic memory mechanism for video-world-model policies. We employ a Recap-Cue (RC) module, in which a Perceiver-based Recap Compressor maps per-frame CLIP embeddings into compact memory tokens, and a lightweight Cue Gate estimates task completion from memory and language. These tokens are injected into both the video backbone and the action decoder, aligning policy imagination with episode progress and conditioning actions on history. Our model trains the memory module with video prediction, a delta-reconstruction auxiliary loss, and episode-boundary supervision, requiring no per-frame progress labels. The same mechanism applies to UNet and Diffusion Transformer (DiT) backbones by changing only the cross-attention injection interface. On LIBERO-Mem, our model improves average success from 5% to 42.5%. On real robots, it achieves 78.3% success on counting tasks, 80.0% on spatial recall, and 75.0% on sequential tracking. Project page: https://MemoryVAM.github.io/


[282] World Action Models: A Survey cs.RO | cs.CVPDF

Qiuhong Shen, Shihua Zhang, Yue Liao, Qi Li, Zhenxiong Tan

TL;DR: 这篇综述论文对世界行动模型(WAMs)进行了系统梳理,WAMs是一种具身预测-行动模型,能够预测未来状态以指导行动。论文首先澄清了WAMs与广义世界模型、视频生成模型、行动基础视频世界模型、视觉-语言-行动策略等概念之间的模糊边界,然后从生成内容(如渲染未来、潜在未来、无视频生成行动推理)和架构分解(预测基底、骨干网络、行动耦合、部署机制)两个互补视角组织现有工作,并讨论了交互性、因果性、持久性、物理合理性和泛化等关键特性,以及数据、评估和开放挑战。

Details

Motivation: 近年来,WAMs领域快速扩张,基于大型视频生成模型或语言/视觉-语言骨干网络的方法不断涌现,导致WAMs与相关概念(如广义世界模型、视频生成模型等)的边界变得模糊,缺乏统一的梳理和界定。

Result: 论文未提供具体的定量实验结果,但通过系统性的综述分析,揭示了WAMs领域的一个一致设计模式:WAMs并非简单的带行动头的视频生成器,而是需要在表征丰富性与计算、内存、延迟和行动标签成本之间进行权衡的预测-行动方法。

Insight: 论文的创新之处在于为WAMs领域提供了一个清晰的共同框架,通过双重视角(生成内容和架构分解)对现有方法进行解剖,并提炼出关键设计权衡,即领域正朝着‘生成更少的未来但保留控制所需信息’的方向发展,这对未来模型设计具有指导意义。

Abstract: World Action Models (WAMs) are embodied predictive-action models that make a forecast of the future available to action. Recent WAMs repurpose large video generation models, and a parallel line relies on language or vision-language backbones without a video-generation core. This rapid expansion has blurred the boundary among broad world models, video generation models, action-grounded video world models, Vision-Language-Action policies, and WAMs. This survey gives the field a common account. It first clarifies these boundaries, then organizes existing works through two complementary views. The first view asks what each method is required to generate, spanning rendered futures, latent futures, and video-generation-free action reasoning. The second view decomposes each method by predictive substrate, backbone, action coupling, and deployment regime. This anatomy supports a unified discussion of interactability, causality, persistence, physical plausibility, and generalization, followed by data, evaluation, and open challenges. Across these axes, a consistent design pattern emerges: WAMs are not simply video generators with action heads, but predictive-action methods whose design choices trade representational richness against compute, memory, latency, and action-label cost. The field is moving toward methods that generate less of the future while preserving what control requires. The survey homepage is available at https://world-action-models.github.io/.


[283] Decoupling the Declarative from the Procedural in Vision-Language-Action Models cs.RO | cs.AI | cs.CV | cs.LGPDF

Nikolaos Tsagkas, Andreas Sochopoulos, Chris Xiaoxuan Lu, Oisin Mac Aodha, Alexandros Kouris

TL;DR: 本文提出了一种名为w²VLA的新型视觉-语言-动作模型,旨在解决现有VLA模型在零样本技能迁移到新物体时的脆弱性问题。通过重构信息流,将陈述性知识与程序性知识解耦,实现了对未见物体的鲁棒行为克隆和零样本技能迁移。

Details

Motivation: 当前基于大规模预训练视觉语言模型微调的VLA模型虽然在分布内任务上达到SOTA,但对空间、语义和任务的微小变化非常脆弱,其根本瓶颈在于模型参数中陈述性知识与程序性知识耦合,阻碍了零样本技能迁移。

Result: 实验表明,所提出的模块化方法成功解耦了知识表示,在行为克隆任务中实现了鲁棒性能,并在不相似的未见物体上展示了前所未有的零样本技能迁移能力。

Insight: 核心创新在于通过可组合、可解释的方式,用视觉、空间和技能信息调制机器人状态序列,而非将所有多模态令牌输入一个不透明的动作专家,这为构建可迁移的通用机器人策略提供了新的架构思路。

Abstract: Deploying generalist robotic agents in the real world requires transferable skills. Specifically, a policy trained to clone a behavior from object-specific demonstrations must generalize beyond that object, otherwise data collection requirements become intractable. Recently, fine-tuning of pre-trained billion-parameter Vision-Language Models (VLMs), initially on large-scale robot datasets and then on fewer scenario-specific demonstrations, has emerged as the predominant paradigm for designing Vision-Language-Action (VLA) models. While these policies achieve state-of-the-art manipulation performance in-distribution, they remain brittle to minor spatial, semantic, and task variations. In this work, we address the inability of current models to decouple the declarative (i.e., concepts and entity semantics) from the procedural knowledge (i.e., how to do something) encoded in their parameters, which is a fundamental bottleneck for zero-shot skill transfer to novel objects. To address this, we propose w$^{2}$VLA, a new VLA model with restructured information flow. Rather than feeding all multimodal tokens from the VLM encoder into a large, opaque transformer-based action expert, our approach modulates the robot state sequence with visual, spatial, and skill information in a compositional and interpretable manner. Unlike popular, state-of-the-art VLAs, we show that our modular approach successfully decouples knowledge representations, enabling robust behavior cloning and unprecedented zero-shot skill transfer capabilities across dissimilar, unseen objects.


[284] Robot Self-Improvement via Human-Video Dynamics Models cs.RO | cs.CVPDF

Hanzhi Chen, Anran Zhang, Simon Schaefer, Kejia Chen, Shi Chen

TL;DR: 该论文提出了一种名为DGAC(Dynamics-Guided Action Correction)的方法,利用从人类视频中学习到的与具体机器人形态无关的动作、动力学和价值表示,使机器人能够从自身尝试和失败中自主改进策略。该方法在七项真实世界操作任务中,将成功率从40%提升至81%,展示了跨形态的机器人自我改进能力。

Details

Motivation: 解决机器人学习如何从人类学习的数据类型(被动观察、具身实践和失败经验)中获取技能的问题,特别是探索人类视频先验是否能支持机器人评估、纠正和改进自身尝试。

Result: 在涵盖移动机械臂和静态机械臂的七项真实世界操作任务中,该方法将多种策略骨干的成功率从40%提升至81%,实现了跨形态的机器人自我改进。

Insight: 创新点在于从人类视频中学习与具体形态无关的表示,并提出了无需训练的DGAC方法,利用这些模型修复失败状态,将失败转化为监督信号以实现策略自主改进。

Abstract: A central question in robot learning is how to acquire skills from the kinds of data that humans learn from: passive observation, embodied practice, and the experience of failure. Human videos provide the first of these in abundance, and prior work has shown they can initialize useful policies. Far less clear is whether they can support the second and third: whether priors extracted from human videos can ground a robot’s own attempts well enough to evaluate them, correct them, and improve from them. In this work, we show that human videos can be used to learn embodiment-agnostic action, dynamics, and value representations that transfer across robot embodiments, providing the predictive foundation required for robots to autonomously improve from their own rollouts and failures. We introduce Dynamics-Guided Action Correction (DGAC), a training-free approach that leverages these adapted models to repair failed states: each failure becomes a query for which the learned models propose and rank corrective actions, turning failures into supervision for the next policy update. Across seven real-world manipulation tasks spanning both a mobile manipulator and a static manipulator arm, our approach improves success rates from 40% to 81% across multiple policy backbones, demonstrating cross-embodiment robot self-improvement from human-video priors. These results show that human priors and robot failures can be combined to enable scalable autonomous policy improvement. Project page: https://ethz-mrl.github.io/robot-self-improvement-website/.


[285] ASCII Art Turns LLMs into VLA Controllers cs.RO | cs.CV | cs.LGPDF

Yitao Jiang, Roy Xing, Luyang Zhao, Brian Plancher, Muhao Chen

TL;DR: 该论文提出了一种新颖的方法,将纯文本大语言模型(LLM)适配为视觉-语言-动作(VLA)控制器,其核心创新在于使用ASCII字符艺术作为视觉模态到文本模态的桥梁。通过将视觉观察渲染成ASCII文本表示,LLM可以直接处理视觉信息,并遵循自然语言指令输出可执行的动作序列。该方法在2D操作基准测试(包括仿真和物理机械臂)中验证了其有效性,表明ASCII渲染可以作为一种轻量级、可解释的模态转换方案。

Details

Motivation: 传统的VLA控制器通常基于需要大量数据和计算的多模态骨干网络构建。本文旨在探索一种更轻量、高效的替代方案,利用纯文本LLM的现有训练和部署栈来处理视觉状态并执行动作,从而降低VLA系统的构建门槛。

Result: 在2D操作基准测试中,经过微调的LLM控制器能够在仿真和物理机械臂上识别任务相关实体并规划可行的动作序列。论文比较了不同模型系列和规模的多个LLM和VLM,结果表明基于ASCII渲染的方法能够有效实现VLA控制功能。

Insight: 主要创新点在于提出了ASCII艺术作为一种轻量级、可解释的视觉到文本的模态桥接器,使得纯文本LLM无需复杂的视觉编码器即可处理视觉信息。这为利用成熟的文本模型栈进行VLA研究开辟了新方向,是对传统多模态VLA流水线的一种补充。

Abstract: Vision–Language–Action (VLA) controllers are often built by extending vision–language models (VLMs) with action supervision, relying on multimodal backbones with large data and compute requirements. We demonstrate that a text-only large language model (LLM) can be adapted into a VLA-style controller when visual observations are rendered into a text input using an ASCII representation. This ASCII-as-vision interface enables existing training and deployment stacks for LLMs to efficiently condition on visual state, follow natural-language instructions, and produce constrained, executable actions. We fine-tune and compare multiple LLMs and VLMs across model families and scales, using both expert demonstrations from a planning-based teacher, as well as DAgger for iterative improvement. In a 2D manipulation benchmark, in both simulation and on a physical manipulator, the resulting controllers can identify task-relevant entities and plan feasible action sequences. Our results suggest that ASCII rendering can serve as a lightweight, interpretable modality bridge from images to text, complementing conventional VLA pipelines, and opening directions for VLA research with text-only backbones.


[286] Rotation-Aware Point-Cloud Embeddings for Vision-Based In-Hand Reorientation cs.RO | cs.CVPDF

Yashom Dighe, Karthik Dantu

TL;DR: 本文提出了一种旋转感知的点云嵌入方法,用于基于视觉的灵巧手内重定向任务。该方法通过将当前点云和目标点云的欧几里得潜在距离校准到物体方向之间的SO(3)测地误差,从而将点云目标转化为平滑的控制信号。这使得无模型强化学习策略能够仅从点云嵌入、本体感知和质心元数据中学习,无需依赖物体姿态、相对姿态、密集光流或教师监督。

Details

Motivation: 解决原始点云目标条件在策略学习中条件不佳的问题,因为原始点云是无序、独立采样且依赖于可见性的,其差异将物体旋转与排列、重采样和不稳定的对应关系纠缠在一起。

Result: 在手内重定向实验中,该方法匹配了基于特权状态和蒸馏的基线方法,同时避免了在测试时计算结构化姿态或光流输入的脆弱计算。

Insight: 创新点在于学习一个旋转感知的点云嵌入表示,其欧几里得潜在距离直接校准到物体旋转的SO(3)测地误差,从而将任务相关的旋转几何编码到表示本身,而不是依赖外部模块。这表明通用的视觉点云预训练不足以进行有效的当前-目标比较,因为它丢弃了任务相关的状态而只保留了形状特征。

Abstract: Point-cloud goals provide a direct way to specify dexterous in-hand reorientation: instead of defining an object-specific pose frame or estimating 6D pose at test time, the policy is given the desired 3D geometry of the object. Yet raw point-cloud goal conditioning is poorly conditioned for policy learning. Current and goal clouds are unordered, independently sampled, and often visibility-dependent, so their discrepancy entangles object rotation with permutation, resampling, and unstable correspondence structure. For this reason, prior point-cloud manipulation methods typically add structure outside the representation itself, such as explicit pose or relative-pose inputs, dense flow features, or distillation from privileged teachers. We close this gap by learning a rotation-aware point-cloud embedding whose Euclidean latent distance is calibrated to the SO(3) geodesic error between object orientations. The resulting representation turns current-goal comparison into a smooth control signal, allowing a model-free RL policy to act from current and goal point-cloud embeddings, proprioception, and centroid metadata, without object pose, relative pose, dense flow, or teacher-action supervision. In in-hand reorientation experiments, this interface matches privileged-state and distillation-based baselines while avoiding brittle test-time computation of structured pose or flow inputs. These results suggest that point-cloud goals become practical for this task when the representation, rather than an external module, encodes the task-relevant geometry of rotation. We also show evidence that generic visual point-cloud pretraining is insufficient for such a current-goal comparison because it discards the task-relevant state and preserves only shape features.


[287] EmbodiedUS-FS: Fast Slow Intelligence for Ultrasound Robotics cs.RO | cs.CVPDF

Fangzhuo Zhang, Xinyu Wang, Xiao Yang, Jinchang Zhang

TL;DR: 本文提出了一种名为EmbodiedUS-FS的快速-慢速分层智能系统,用于实现安全且可解释的机器人超声辅助扫描。该系统通过慢速大脑进行意图解析和任务规划,快速大脑融合多模态反馈进行局部动作优化,并集成了安全防护机制。实验表明,该分层设计提高了任务成功率并减少了安全违规。

Details

Motivation: 解决在真实临床环境中进行机器人超声扫描时面临的双重挑战:一方面需要理解医生包含隐含解剖目标、流程逻辑和安全性约束的自然语言指令,另一方面需要在患者运动、接触变化和目标漂移等动态扰动下实现闭环执行。

Result: 在规划评估、动态扰动下的闭环执行和安全机制验证实验中,所提出的分层设计提高了任务成功率,同时减少了安全违规行为。

Insight: 创新点在于将临床推理(慢速大脑)与实时闭环控制(快速大脑)解耦的分层架构设计,并结合了基于API和手册的知识增强、结构化计划验证、多模态反馈融合以及分级安全升级策略,以实现安全、可解释的机器人辅助超声。

Abstract: Robotic ultrasound scanning in real clinical environments requires both high-level clinical workflow reasoning and low-level closed-loop execution. Physicians natural-language instructions often contain implicit anatomical targets, procedural logic, image-quality requirements, and safety constraints, while execution is affected by patient motion, contact variations, and target drift. We propose a fast and slow hierarchical embodied ultrasound system for safe and interpretable robotic ultrasound assistance. The Slow Brain performs intent parsing and stage-wise task planning with knowledge augmentation from an API and handbook corpus, and generates executable plans through task-graph construction and structured plan verification. The Fast Brain fuses multimodal feedback, including ultrasound images, robot pose and force states, and patient-motion information, to refine local actions and perform image-quality-guided recovery behaviors. The system further integrates a Safety Shield and a hierarchical escalation policy to constrain risky actions and trigger replanning or human confirmation under persistent failures or safety-bound violations. Experiments on planning evaluation, closed-loop execution under dynamic perturbations, and safety-mechanism validation demonstrate that the proposed hierarchical design improves task success rates while reducing safety violations.


[288] HERCULES: An Open-Source Simulation Framework for Heterogeneous Multi-Robot SLAM, Collaborative Perception, and Exploration cs.RO | cs.CV | cs.MA | eess.SYPDF

Sandilya Sai Garimella, Daniel Chase Butterfield, Sean Wilson, Lu Gan

TL;DR: HERCULES是一个基于Unreal Engine 5的开源异构多机器人自主仿真框架,解决了现有框架在架构上的关键限制,支持无人机和地面车辆在大规模、逼真动态环境中并发操作。它提供了新的路径点跟踪控制器、共享导航栈、扩展的传感器套件(包括物理建模的长波红外相机),并集成了智能代理和高保真动态现象。该框架支持被动数据生成和主动闭环规划两种模式,并公开了代码、文档和数据集,包括一个异构多机器人SLAM基准测试集。

Details

Motivation: 解决现有仿真框架在支持异构多机器人(无人机与地面车辆)并发操作、大规模逼真动态环境模拟以及传感器同步方面的架构限制,为异构多机器人SLAM、协同感知和探索研究提供高质量仿真与数据收集平台。

Result: 通过HERCULES生成的数据和主动闭环执行实验,验证了其在异构多机器人SLAM、协同感知和探索任务中的实用性。公开的数据集包含在公里级沙漠、森林和城市环境中由两架无人机和两辆地面车辆收集的异构多机器人SLAM基准测试数据。

Insight: 创新点包括:1)统一异构平台控制接口的路径点跟踪地面车辆控制器;2)跨异构平台的共享导航栈;3)物理建模的长波红外相机和可配置夜视模式;4)严格的时间同步与ROS 2封装;5)集成智能代理(行人、交通、野生动物)和高保真动态现象(火灾、洪水、作物病害)。该框架将先进游戏引擎能力引入机器人仿真,为异构多机器人研究提供了可重复的多模态数据集生成和闭环测试能力。

Abstract: We present HERCULES, an open-source simulator and data-collection pipeline for heterogeneous multi-robot autonomy. Built upon the Unreal Engine 5 (UE5)-based simulators AirSim and Cosys-AirSim, HERCULES resolves key architectural limitations of prior frameworks to enable concurrent unmanned aerial and ground vehicle (UAV-UGV) operation in large-scale, photorealistic, dynamic environments. It introduces a new waypoint-tracking UGV controller that mirrors existing UAV control interfaces, and provides a shared navigation stack for mapping, traversability analysis, planning, and control across heterogeneous platforms. Expanding inherited sensor suites, it adds physics-based long-wave infrared (LWIR) cameras and configurable night-vision modes for degraded visual environments. HERCULES provides lightweight APIs, ROS 2 wrappers, and rigorous time synchronization across sensors and platforms, and brings state-of-the-art game-engine capabilities into robotics simulation, integrating intelligent agents such as pedestrians, traffic, and wildlife with high-fidelity dynamic phenomena, including fire, flooding, and crop disease spread. HERCULES runs in two modes: passively, replaying offline-designed trajectories to generate reproducible multi-modal datasets, and actively, running an online planner in closed loop from live observations. Our experiments in heterogeneous multi-robot SLAM, collaborative perception, and exploration, using both HERCULES-generated data and active closed-loop execution, demonstrate its utility for advancing heterogeneous multi-robot autonomy. We publicly release our source code, experiment code, documentation, and datasets, including a heterogeneous multi-robot SLAM benchmark collected with two UAVs and two UGVs across kilometer-scale desert, forest, and city environments, at https://lunarlab-gatech.github.io/HERCULES-website.


[289] Improving Robotic Imitation Learning via Trajectory Standardization cs.RO | cs.CVPDF

Licheng Yang, Lingfeng Qian, Fei Zheng, Yonghao He, Wei Sui

TL;DR: 本文提出了一种名为信息标准化轨迹重采样(ISR)的离线预处理方法,用于改善机器人模仿学习中因人类演示轨迹的噪声和时间不规则性(如操作者速度变化、间歇停顿和动作密度不一致)导致的数据质量问题。ISR通过将轨迹映射到信息调制的黎曼流形上,并执行测地线等距参数化,使相邻点之间的信息距离近似相等,从而有效去除冗余停顿并保留高曲率精细操作阶段。

Details

Motivation: 机器人模仿学习依赖于大量人类演示轨迹,但这些轨迹常因操作者速度变化、间歇停顿和动作密度不一致而存在噪声和时间不规则性;常用的时间均匀下采样预处理策略无法有效消除速度引起的非均匀性或冗余停顿,这种不匹配会降低数据质量并阻碍策略学习。

Result: 在三个真实世界操作任务上使用主流模仿学习策略进行评估,与基线时间均匀3倍下采样相比,ISR将任务成功率提高了约25%,在不同操作者收集的数据集上保持鲁棒性,并减少了数据集大小和训练成本。

Insight: 创新点在于提出了一种基于信息强度场(由速度和加速度范数构建)的轨迹重采样方法,将轨迹标准化问题转化为黎曼流形上的测地线等距参数化问题,从而更有效地保留关键操作信息并去除冗余;从客观角度看,该方法提供了一种将轨迹的物理运动特性(速度、加速度)与信息度量相结合的通用预处理框架,可提升模仿学习的数据效率。

Abstract: Imitation learning for robotic manipulation relies on large sets of human demonstration trajectories, which are often noisy and temporally irregular due to variable operator speed, intermittent pauses, and inconsistent action density. A common preprocessing strategy is time-uniform downsampling to shorten sequences, but it cannot effectively remove speed-induced non-uniformity or redundant pauses. This mismatch degrades data quality and hinders policy learning. To address this issue, we propose Information-Standardized Trajectory Resampling (ISR), an offline preprocessing method for effective imitation learning. ISR resamples each trajectory by enforcing approximately equal information distance between adjacent points. Specifically, we map trajectories onto an information-modulated Riemannian manifold and perform geodesic-equidistant parameterization. We construct an information-intensity field from velocity and acceleration norms: the velocity term removes small-motion redundancy, while the acceleration term preserves high-curvature and fine-manipulation phases. We evaluate ISR on three real-world manipulation tasks with mainstream imitation learning policies. Compared with the baseline time-uniform 3x downsampling, ISR improves task success rates by about 25%, remains robust across datasets collected from different operators, and reduces both dataset size and training cost. The code and videos are publicly available at https://d-robotics-ai-lab.github.io/isr.page.


[290] Real-Time Multimodal Activity-Aware Error Detection in Robot-Assisted Surgery cs.RO | cs.CVPDF

Seyed Hamid Reza Roodabeh, Zongyu Li, Homa Alemzadeh

TL;DR: 本文提出了一种用于机器人辅助手术中实时多模态活动感知错误检测的统一框架,该框架整合了视频、运动学数据和描述性文本提示,通过活动提示和活动感知视觉嵌入,显著提升了错误检测性能。

Details

Motivation: 机器人辅助微创手术虽提升了精度,但增加了复杂性,现有基于视频的错误检测方法常忽略手术过程层次结构中的细粒度活动描述和错误类型,且未充分利用互补的多模态信息。

Result: 在JIGSAWS和SAR-RARP50数据集上,该框架相比最先进的基线方法分别实现了最高5%和16.6%的F1分数提升,达到了SOTA水平。

Insight: 创新点在于通过活动提示将描述性语言集成到手势级活动、器械-物体交互和错误定义中,并引入基于手术活动标签预训练的视觉编码器生成活动感知嵌入,同时无缝整合运动学数据与视频、文本模态,提升了多模态融合的有效性。

Abstract: Robot-assisted minimally invasive surgery improves surgical precision but introduces complexity, making technical error detection essential for ensuring patient safety. Current executional error detection methods using video data often overlook fine-grained contextual descriptions of activities and error types within the hierarchical structure of surgical procedures. They also under-utilize complementary multimodal information. We propose a unified framework for executional error detection that leverages multimodal input, including video, kinematics, and descriptive textual prompts. Through activity prompting, we integrate descriptive language in gesture-level activities, instrument-object interactions, and error definitions. We also introduce activity-aware visual embeddings derived from vision encoders pretrained on surgical activity labels to compare the effectiveness of contrastive language-image embeddings with traditional image-based embeddings for error detection. By seamlessly integrating kinematic data with video and textual modalities, our framework significantly improves error detection performance. Achieving up to 5% and 16.6% F1 score improvements over state-of-the-art baselines on the JIGSAWS and SAR-RARP50 datasets, respectively, we demonstrate the value of combining curated textual prompts with multimodal data for accurate error detection.


cs.LG [Back]

[291] Decodable but Not Faithful: Coupling Natural-Language Rationales to Programmatic Verifiers cs.LG | cs.AI | cs.CLPDF

Vatsal Ananthula, Adarsh Kumarappan

TL;DR: 本文提出了一种名为’验证器耦合推理’的框架,通过在推理轨迹中插入内联声明并训练一个辅助一致性头来预测程序化验证器的输出。核心发现是解码能力与忠实性之间存在差距:一致性训练能可靠地使验证信息从理由表示中解码出来,但解码能力并不能保证生成的解释是忠实的。

Details

Motivation: 语言模型可以为其预测生成看似合理的理由,但这些解释可能无法忠实地反映模型的内部推理过程。因此,研究旨在通过程序化验证器来评估和塑造模型生成的解释的忠实性。

Result: 在LeanCheck(形式定理证明)中,仅理由池化和仅证明池化在反事实冲突下实现了完美的方向分离;在KataGo(围棋引擎)中,评论跨度以81%的准确率编码了10路胜率桶;在代码设置中,模型实现了98.6%的耦合,但生成的解释仍不忠实。合成激活修补证实了因果影响(73-89% vs. 31%基线),FEVER显示仅证据池化以牺牲原始准确性为代价隔离了真实的证据敏感性。

Insight: 创新点在于将程序化验证器与语言模型的推理过程耦合,以诊断和塑造表示,但揭示了’解码能力’与’忠实性’之间的关键差距,表明一致性损失是有效的诊断工具,但不足以确保忠实推理。这为评估解释的可靠性提供了新视角。

Abstract: Language models can generate plausible rationales for their predictions, but these explanations may not faithfully represent the model’s internal reasoning. We propose verifier-coupled reasoning, a framework that inserts inline claims into reasoning traces and trains an auxiliary consistency head to predict programmatic verifier outputs from rationale-span hidden states. The central finding is a gap between decodability and faithfulness: consistency training reliably makes verifier information decodable from rationale representations, but decodability does not guarantee faithful generation. In LeanCheck (formal theorem proving), rationale-only and proof-only pooling achieve perfect directional separation under counterfactual conflict. In KataGo (Go engine), commentary spans encode 10-way win-rate buckets at 81% accuracy. Yet in a code setting, the model achieves 98.6% coupling while its generated explanations remain unfaithful: fluent prose with correct structured claims, but describing unrelated algorithms; a controlled pretrained-vs-from-scratch comparison shows the gap is not capacity-driven. Synthetic activation patching confirms causal influence (73-89% vs. 31% baseline), FEVER reveals that evidence-only pooling isolates genuine evidence sensitivity at the cost of raw accuracy, and per-claim analysis shows that consistency loss disproportionately benefits fine-grained claims over binary ones. These results establish that consistency losses are effective diagnostics and representation-shaping tools, but not sufficient conditions for faithful reasoning.


[292] Local Causal Attribution of Chain-of-Thought Reasoning cs.LG | cs.CLPDF

Dennis Wei, Yannis Belkhiter, Erik Miehling, Radu Marinescu

TL;DR: 本文提出了一种名为AttriCoT的局部因果归因方法,用于分析语言模型在思维链推理过程中各单元之间的因果关系。该方法通过构建结构因果模型,仅需O(U)次前向传播即可估计各单元对后续输出的重要性,并在5个数据集和4个推理模型上验证了其比现有方法更忠实于模型行为。

Details

Motivation: 为了提升语言模型思维过程的透明度和安全性,需要理解其推理的因果结构,本文从局部角度出发,针对特定思维链轨迹中的单个单元进行因果分析。

Result: 在5个数据集和4个推理模型上的扰动曲线评估表明,AttriCoT产生的归因结果比替代方法更忠实于模型行为,并揭示了不同模型和领域间思维结构的显著差异。

Insight: 创新点在于将结构因果模型应用于思维链的局部单元归因,提出了一种高效的黑盒方法(仅需线性次前向传播),为模型推理的可解释性提供了新工具。

Abstract: Understanding the causal structure of a language model’s thought process is a problem of significant importance for both transparency and safety. In this work, we take a local approach toward this goal by analyzing the causal relationships among individual components, termed units, of a given, specific chain-of-thought trace. We construct a structural causal model on these units and relate each unit to the log probability of generating (subsequent) output units. Our algorithm, termed AttriCoT, is a black-box method that performs attribution by estimating importance parameters in the structural causal model using $O(U)$ forward passes through the model, where $U$ is the number of units. Evaluation of perturbation curves across 5 datasets and 4 reasoning models shows that AttriCoT produces attributions that are more faithful to the model’s behavior than alternative methods. The attribution results also reveal notable differences in thought structure between models and domains.


[293] A Verifiable Search Is Not a Learnable Chain-of-Thought cs.LG | cs.AI | cs.CLPDF

Harsh Patel

TL;DR: 这篇论文挑战了‘任何可由短程序解决的任务都能通过思维链(chain-of-thought)教会模型’的假设。通过九个确定性生成器创建推理任务作为测试平台,研究发现对于某些需要搜索信息无关结构的任务(如密码算术),即使模型能正确执行算术运算,也无法通过思维链蒸馏学习到有效的搜索过程,而只能记忆和验证。

Details

Motivation: 动机是检验一个常见假设:任何可由短程序解决的任务都能通过思维链(写出步骤、微调)让模型学会。论文旨在识别这一假设失效的任务类别,特别是那些依赖于搜索信息无关结构的程序。

Result: 在九个推理任务中,前向可计算任务(如查找/算术和8位布尔任务)转移成功(准确率≥0.99和0.68),但密码算术任务失败:尽管搜索求解器能回答71%的实例,通过思维链蒸馏的准确率仅为0.01-0.07。即使模型在97-100%的算术行上正确,并将正确密码排在前八位,也无法进行从左到右的搜索推导。最终,通过预计算组合核心并简化为回忆加验证的方法,在Private LB上达到0.92的准确率。

Insight: 创新点在于揭示了思维链蒸馏的局限性:对于需要搜索信息无关结构的任务(如密码算术),无法学习到忠实的向前推导过程,而只能学习记忆和验证。这挑战了思维链作为通用学习范式的假设,并指出任务可学习性取决于是否移除搜索、将其预计算为目录。从客观角度看,论文通过受控干预(如揭示密码密钥)隔离了原因,提供了对模型推理能力边界的深入理解。

Abstract: It is tempting to assume any task solvable by a short program can be taught to a model as its chain-of-thought: write the steps out, fine-tune, and the model follows. This paper shows the assumption fails for an identifiable class of procedures. The testbed is nine reasoning tasks, each from a deterministic generator; public and hidden splits share generators, so held-out data proxies test accuracy. I reverse-engineer the generators into Python solvers, render them as chain-of-thought, and distill into a rank-<= 32 LoRA over a 30B (3.5B-active) Nemotron model. Forward-computable tasks install readily: lookup/arithmetic and an 8-bit boolean task transfer (>= 0.99 and 0.68). Cryptarithm does not: distilling its backtracking search holds at 0.01-0.07 across eleven chain-of-thought designs, RL from verifiable rewards, and self-training, even though a search solver answers 71% of instances. This is not a capability gap. The model does the arithmetic on 97-100% of lines and ranks the correct cipher in its top eight on 71%; it cannot carry the search forward as a left-to-right derivation. Fine-tuning learns the shape of a verifiable elimination step while its verdicts become unconditional templates, correct only 16-57% of the time (“verdict-as-token”). The ceiling holds across backbones from 3B to 671B and across fine-tuning and prompting; a controlled intervention isolates the cause: revealing the cipher key, which turns the derivation forward, lifts the same instances from 0.03 to 0.57. When a procedure’s only solution is search over information-free structure, no faithful forward chain-of-thought exists to imitate. The task becomes learnable only by removing the search, precomputing its combinatorial core into a catalog and reducing the trace to recall plus verification; the 1st-place solution reaches Private LB 0.92 this way. What distills is memorization and verification, not search.


[294] Concept-Constrained Prompt Learning for Few-Shot CLIP Adaptation cs.LG | cs.AI | cs.CL | cs.GRPDF

Na Sang, Ding Ma, Rui Sang, Yuxuan Liu

TL;DR: 本文提出了一种名为概念约束提示学习(CCPL)的轻量级正则化框架,用于改进少样本CLIP适应。该方法通过将可学习的类别提示锚定到冻结的概念级文本原型,并引入文本空间余弦一致性目标和概念丢弃正则化,旨在缓解类别提示优化过程中的过拟合问题,提升模型对未见类别的泛化能力。

Details

Motivation: 针对少样本提示学习中仅优化类别提示容易过拟合基础类别监督、削弱对未见类别迁移能力的问题,本文旨在通过引入概念级约束来正则化提示学习过程。

Result: 在相同的自动生成回退划分下,CCPL在DTD和EuroSAT数据集上相比CoOp方法分别提升了基础-新类别调和平均0.6和2.9个百分点,在OxfordPets数据集上表现接近(-0.1)。消融实验表明文本空间概念正则化始终有益,而最佳的概念引导推理强度因数据集和协议而异。

Insight: 创新点在于提出了一个不更新CLIP编码器的轻量级概念约束框架,通过文本空间对齐和概念丢弃来正则化提示学习。客观分析认为,该方法的核心洞察是利用冻结的概念原型作为语义锚点来引导和稳定提示学习,其有效性高度依赖于概念原型与数据集语义的自然对齐程度,并揭示了细粒度分类是其当前的一个边界条件。

Abstract: Few-shot prompt learning is an effective strategy for adapting CLIP to downstream tasks, but class-only prompt optimization can overfit base-class supervision and weaken transfer to unseen classes. We propose Concept-Constrained Prompt Learning (CCPL), a lightweight regularization framework that anchors learnable class prompts to frozen concept-level text prototypes without updating CLIP encoders. CCPL learns a set of shared context tokens, instantiates class prompts by appending class names, and constructs frozen concept prototypes from a class-level concept bank. During training, a text-space cosine consistency objective aligns learnable class-prompt embeddings with frozen concept prototypes; concept dropout provides additional regularization against over-reliance on fixed concept lists. At inference, CCPL optionally fuses class-prompt logits with concept-prototype logits using a controllable ensemble weight alpha. Our default configuration uses text-space concept regularization lambda = 0.5, concept dropout p = 0.3 and weak concept-guided fusion (alpha = 0.1), with no KL-based prediction consistency term. Experiments under identical automatically-generated fallback splits show that CCPL improves the base-to-new harmonic mean on DTD (+0.6) and EuroSAT (+2.9) compared with CoOp, while remaining near-neutral on OxfordPets (-0.1). Ablations indicate that text-space concept regularization is consistently beneficial, while the best concept-guided inference strength is dataset- and protocol-sensitive. These results suggest concept constraints are most effective when concept prototypes align naturally with dataset semantics, and identify fine-grained categories as a current boundary condition. The code is released at: https://github.com/richael-sang/concept-constrained-prompt-learning.


[295] Group-Graph Policy Optimization for Long-Horizon Agentic Reinforcement Learning cs.LG | cs.AI | cs.CLPDF

Yunan Wang, Minghui Song, Zihan Zhang, Shaohan Huang, Haizhen Huang

TL;DR: 本文提出了一种名为Group-Graph Policy Optimization (G2PO)的新型基于群体的强化学习算法,专门用于解决长视野智能体强化学习中的奖励稀疏和延迟问题。该方法通过将线性的交互轨迹转换为全局状态转移图,并引入群体聚合状态值估计和以边为中心的优势估计策略,从而优化了信用分配。在WebShop、ALFWorld和AppWorld等基准测试中,G2PO显著优于现有方法。

Details

Motivation: 解决长视野智能体强化学习中因奖励稀疏和延迟导致的信用分配粗糙、状态值估计方差高以及探索短视的问题。现有基于步骤的训练框架虽然细化了训练粒度,但仍将智能体探索视为孤立的线性轨迹,忽略了状态转移的固有图结构。

Result: 在WebShop、ALFWorld和AppWorld等代表性长视野基准测试中,G2PO大幅超越了最先进的基于提示和强化学习的基线方法,相比GRPO实现了高达22.2%的成功率提升。

Insight: 核心创新在于将线性轨迹显式建模为全局状态转移图,并提出了群体聚合状态值估计和基于图全局标准化TD误差的边中心优势估计策略。这有效降低了方差和轨迹依赖偏差,并能识别和优先处理推动任务进展的关键状态转移。

Abstract: Group-based Reinforcement Learning (RL) has significantly enhanced Large Language Models (LLMs) in agentic scenarios. To achieve finer-grained policy updates, recent agentic RL frameworks have shifted from trajectory-level to step-level training. However, long-horizon agentic RL suffers from severe reward sparsity and delay, as feedback is often deferred for dozens of interaction steps. While existing step-level frameworks refine training granularity, their credit assignment remains coarse-grained and still treats agent exploration as isolated, linear trajectories. This oversimplified perspective ignores the inherent graph structure of state transitions, leading to high-variance state-value estimation and myopic, localized credit assignment. To overcome these critical bottlenecks, we propose Group-Graph Policy Optimization (G2PO), a novel group-based RL algorithm tailored for multi-turn agentic tasks. G2PO explicitly transforms linear interaction trajectories into a global state-transition graph. By aggregating identical observations across different trajectories, we introduce group-aggregation state-value estimation that reduces sampling variance and trajectory-dependent bias. Furthermore, we redefine agent actions as transitions between state nodes and propose an edge-centric advantage estimation strategy. By globally standardizing Temporal Difference (TD) errors across the entire graph, G2PO explicitly identifies and prioritizes critical transitions that drive absolute task progress. Extensive experiments on representative long-horizon benchmarks-WebShop, ALFWorld, and AppWorld-demonstrate that G2PO substantially outperforms state-of-the-art prompt-based and RL baselines, achieving remarkable success rate improvements of up to 22.2% over GRPO.


[296] VLA-FAIL: Efficient Task Failure Detection for Finetuned Vision-Language-Action Models cs.LG | cs.CVPDF

Florian Seligmann, Emiliyan Gospodinov, Enes Ulas Dincer, Gerhard Neumann

TL;DR: 本文提出了VLA-FAIL,一个用于微调后的视觉-语言-动作模型的高效任务失败检测框架。该框架结合了两种无需失败数据的新型检测器:基于最后一层特征马氏距离的离群状态检测器,以及利用时序重叠的动作块一致性检测器。实验表明,这两种检测器能捕获互补的失败模式,实现可靠且早期的失败检测。

Details

Motivation: 视觉-语言-动作模型在真实世界部署时,在分布外场景中可能表现出不可预测的行为,因此运行时失败检测至关重要。现有检测器存在计算成本高、架构假设限制或需要失败数据等问题。

Result: 通过广泛的真实世界和仿真实验,该方法在多种任务上实现了可靠且早期的失败检测,其性能经常优于计算成本显著更高的基线方法。

Insight: 创新点在于提出了两种轻量级、互补且无需失败数据的检测器,并引入了AUCPDT这一联合评估精度、召回率和检测时间的阈值无关指标,以权衡检测精度与延迟。

Abstract: Vision-language-action models (VLAs) achieve state-of-the-art performance on many robotic manipulation tasks, yet they can still behave unpredictably in out-of-distribution scenarios. Runtime failure detection is therefore essential for the safe real-world deployment of VLAs. However, existing task failure detectors require computationally expensive action sampling, are based on architectural assumptions that limit their applicability to VLAs, or need access to failure rollouts. We propose VLA-FAIL, a lightweight and broadly applicable failure detection framework for VLAs that combines two novel failure detectors with minimal overhead, without requiring failure data. The first, last-layer Mahalanobis distance (LLMD), detects out-of-distribution states by measuring token-wise deviations in last-layer features relative to the training data. The second, action chunk consistency (ACC), exploits the temporal overlap induced by receding-horizon control and detects failures when consecutive action chunks become inconsistent. To capture the trade-off between detection accuracy and detection latency, we introduce AUCPDT, a threshold-independent metric that jointly evaluates precision, recall, and detection time. Through extensive real-world and simulation experiments, we demonstrate that LLMD and ACC capture complementary failure modes whose combination enables reliable and early failure detection across diverse tasks, frequently outperforming significantly more expensive baseline methods.


[297] OphthaDT: Generative Digital Twins for Forecasting Visual Acuity Trajectories in Ophthalmology cs.LG | cs.CVPDF

Pietro Belligoli, Nikita Makarov, Sayedali Shetab Boushehri, Fabian Schmich, Raul Rodriguez-Esteban

TL;DR: 本文提出了OphthaDT,一种基于大语言模型的眼科数字孪生系统,用于预测患者的最佳矫正视力轨迹。该系统将来自3220名患者的纵向多模态临床数据序列化为结构化叙事,并在长达100周的基准测试中,在新生血管性年龄相关性黄斑变性和糖尿病性黄斑水肿两种疾病上展示了较低的预测误差和竞争优势。

Details

Motivation: 眼科精准医疗需要准确的纵向预测,但多模态临床数据的碎片化特性是预测的主要障碍。本文旨在通过构建数字孪生模型来克服这一挑战,以更有效地利用患者历史数据进行视力预后预测。

Result: 在长达100周的基准测试中,OphthaDT在nAMD预测上实现了最低的预测误差,平均绝对误差较所有基线模型平均降低6.0%;在DME预测上表现具有竞争力,其平均绝对误差分别比随机森林和XGBoost平均降低2.6%和6.9%。

Insight: 创新点在于将LLM用于构建临床数字孪生,将纵向患者数据序列化为结构化叙事进行预测。客观来看,其方法能够处理不规则采样的数据而无需插补,且模型预测优势随轨迹复杂性增加而提升,这为降低患者负担和加速药物研发提供了新的方法论思路。

Abstract: Precision medicine in ophthalmology requires accurate longitudinal predictions, but the fragmented nature of multimodal clinical data remains a barrier to forecasting. We introduce OphthaDT, an LLM-based digital twin for ophthalmology that serializes longitudinal patient histories from 3,220 patients across four Phase III clinical trials into structured narratives to forecast best corrected visual acuity (BCVA). In benchmarks spanning up to 100 weeks, OphthaDT demonstrated the lowest prediction error in neovascular age-related macular degeneration (nAMD), achieving an average mean absolute error (MAE) reduction of 6.0% compared to all baselines. In diabetic macular edema (DME), OphthaDT demonstrated competitive performance against all baselines while outperforming Random Forest and XGBoost by an average MAE reduction of 2.6% and 6.9%, respectively. Results reveal that OphthaDT’s predictive advantage scales with trajectory complexity: whereas linear models remain effective for the more stable treatment responses of DME, OphthaDT’s capacity is better suited for capturing the high longitudinal variability of nAMD. Finally, OphthaDT handles irregular sampling without imputation, positioning LLM-based clinical trajectory modeling as a methodology that could reduce patient burden and accelerate drug development.


[298] Discovering Latent Groups for Robust Classification cs.LG | cs.AI | cs.CVPDF

Ankur Garg, Ulrich Aïvodji, Samira Ebrahimi Kahou, Vincent Michalski

TL;DR: 本文提出神经分类树(NCT)框架,旨在解决机器学习模型因利用虚假相关性而在少数子组上表现不佳的问题。NCT通过树形架构编码子组结构,基于预测正确性将样本路由到’简单’或’困难’节点,并利用这些路由作为伪标签进行迭代训练,从而在无需子组标注的情况下解耦冲突子组。

Details

Motivation: 现有方法通常依赖子组标注或推断伪组标签来调整网络参数,但在推理时仅输出类别预测,无法揭示样本的潜在子组信息。本文旨在开发一个无需子组监督、能同时提供鲁棒分类和子组结构解释性的框架。

Result: 在涵盖二元和多类虚假相关性的五个基准测试上,NCT实现了与最先进方法相当的鲁棒性,同时其学习的树拓扑结构能一致地分离少数子组,提供了模型架构与数据潜在组结构之间的透明映射。

Insight: 创新点在于将子组发现直接嵌入树形模型架构中,通过基于预测正确性的路由机制生成伪标签进行自监督学习,从而同时实现鲁棒分类和可解释的子组分离,无需额外标注。这为构建透明且鲁棒的分类模型提供了新思路。

Abstract: Machine learning models exploit spurious correlations, achieving high average accuracy but failing disproportionately on underrepresented subgroups. Existing methods address this by adjusting network parameters, guided either by subgroup annotations or inferred pseudo-group labels. Yet at inference, these methods produce only a class prediction, with no insight into a sample’s latent subgroup. We propose neural classification trees (NCT), a framework that achieves robustness by encoding subgroup structure in its tree-shaped architecture. By routing each sample to an “easy” or “hard” node of this tree – based on prediction correctness – and reusing these routes as pseudo-labels for the next iteration, NCT disentangles conflicting subgroups, without requiring subgroup supervision. We evaluate NCT on five benchmarks spanning binary and multi-class spurious correlations. Our experiments show that the learned tree topology provides strong interpretability by consistently isolating minority subgroups, which provides a transparent mapping between the model architecture and the data’s latent group structure, while yielding competitive robustness with state-of-the-art methods.