Table of Contents

cs.CL [Back]

[1] MemSlides: A Hierarchical Memory Driven Agent Framework for Personalized Slide Generation with Multi-turn Local Revision cs.CL | cs.HC | cs.MAPDF

Ye Jin, Yangyang Xu, Jun Zhu, Yibo Yang

TL;DR: MemSlides是一个用于个性化演示文稿生成的分层记忆驱动智能体框架,通过分离长期记忆与工作记忆,并进一步将长期记忆划分为用户档案记忆和工具记忆,来实现在多轮局部修订中保持用户偏好和约束。该框架结合了局部修订范围控制,使编辑操作仅作用于最小受影响区域,而非重复生成整个演示文稿。

Details

Motivation: 解决个性化演示文稿生成中,智能体需要跨任务保持稳定的用户偏好、在多轮修订中保留新引入的偏好和约束,并能可靠执行局部编辑的问题。

Result: 在受控实验中,用户档案记忆提高了在多人物、多意图档案库上的人物对齐判断;工具记忆注入改善了诊断配对设置中的闭环修改行为;定性案例展示了工作记忆传递偏好的能力。

Insight: 创新点在于将记忆系统层次化分离为持久用户档案、会话级工作记忆和可重用执行经验,并结合局部修订范围控制,这为需要长期偏好管理和迭代编辑的生成任务提供了可借鉴的架构设计思路。

Abstract: Personalized presentation generation requires more than conditioning on a current prompt or template: agents must preserve stable user preferences across tasks, retain newly introduced preferences and constraints during multi-turn revision, and carry out local edits reliably. We propose MemSlides, a hierarchical memory framework for personalized presentation agents that separates long-term memory from working memory and further divides long-term memory into user profile memory and tool memory. User profile memory stores intent-conditioned profiles for round-0 personalization, working memory carries active preferences and session constraints across revision rounds, and tool memory stores reusable execution experience for reliable localized editing. MemSlides pairs this memory design with scoped slide-local revision, so targeted updates act on the smallest affected region instead of repeatedly regenerating the full deck. In controlled experiments, user profile memory improves persona-alignment judgments on a multi-persona, multi-intent profile bank, tool-memory injection improves closed-loop modify behavior in diagnostic matched-pair settings, and qualitative cases illustrate working memory’s ability to carryover preferences. Taken together, these results suggest that effective personalization in presentation authoring depends on separating persistent user profiles, session-level working memory, and reusable execution experience across generation and localized revision.


[2] Self-Generated Error Training for Token Editing in Diffusion Language Models cs.CLPDF

Lin Yao

TL;DR: 本文针对LLaDA2.1模型在块扩散解码中的token-to-token编辑任务,研究了其训练与推理阶段的数据分布不匹配问题,并提出了一种自生成错误训练方法。该方法通过无梯度草稿生成来模拟推理时的错误,并在此自生成错误上进行监督训练,以提高编辑性能。

Details

Motivation: 动机在于解决LLaDA2.1中T2T编辑器的训练(使用随机词汇破坏)与推理(面对模型自身生成的高置信度流畅草稿错误)之间的不匹配问题。

Result: 在LLaDA2.1-mini模型上通过LoRA持续预训练实现该方法,并在多个基准测试中使用官方Q-Mode T2T流程(推理参数不变)进行评估。结果表明,该方法普遍提高了准确性,同时降低了T2T编辑强度,缓解了诸如正确推理后的末位数字转录错误以及对简短事实答案的过度自我纠正等失败模式。

Insight: 创新点在于提出了自生成错误训练范式,通过模型自身在推理时的错误分布来生成训练数据,从而更好地对齐训练与推理目标。这为解决扩散语言模型中训练-推理不匹配问题提供了一种有效思路。

Abstract: Token-to-token (T2T) editing lets LLaDA2.1 revise committed tokens during block-diffusion decoding. The released recipe trains this editor on random vocabulary corruptions, but at inference the editor sees the model’s own fluent, high-confidence draft errors instead. We study this training-inference mismatch and propose self-generated T2T, which performs a no-gradient draft pass, fills masked positions with predicted tokens, and supervises recovery in a second pass under these self-generated corruptions. We implement the update as a short LoRA continued-pretraining pass on LLaDA2.1-mini and evaluate on several benchmarks under the official Q-Mode T2T procedure with unchanged inference parameters. The method generally improves accuracy while reducing T2T edit intensity, mitigating failure modes such as final-digit transcription errors after otherwise correct reasoning and excessive self-correction before short factual answers.


[3] Revisiting LLM Adaptation for 3D CT Report Generation: A Study of Scaling and Diagnostic Priors cs.CL | cs.CVPDF

Vanshali Sharma, Andrea M. Bejar, Halil Ertugrul Aktas, Quoc-Huy Trinh, Debesh Jha

TL;DR: 本研究探讨了将大语言模型(LLMs)高效适配到医学领域,特别是用于三维CT报告生成任务时所面临的挑战,如计算复杂度高、数据有限导致的过拟合和临床幻觉。论文提出了一个名为RAD3D-Prefix的轻量级诊断先验条件框架,该框架通过整合图像嵌入和多标签诊断分类逻辑,在冻结LLM参数的情况下,以极少的可训练参数弥合视觉特征与临床术语之间的语义鸿沟。

Details

Motivation: 解决将多模态学习(尤其是LLMs和VLMs)从自然图像领域扩展到医学三维图像领域时遇到的挑战,包括计算复杂性、体积依赖性问题以及视觉特征与临床术语之间的语义差距,并避免在有限医学数据上微调LLM时常见的过拟合和临床幻觉问题。

Result: 在多个自动评估指标和临床读者研究中,RAD3D-Prefix方法优于其他参数高效的基线模型,并展现出强大的领域外泛化能力。通过系统研究不同规模的LLM(从96.1M到1.6B参数),发现对于较小LLM微调最有益,而对于较大(约1B+)LLM,冻结其参数并仅训练轻量级投影层能在性能、泛化能力和计算效率之间取得更优的权衡。

Insight: 主要创新点在于提出了一个轻量级的诊断先验条件框架(RAD3D-Prefix),该框架通过结合诊断分类信息来引导报告生成,有效弥合了语义鸿沟并保留了关键临床细节。客观来看,其核心洞察是:针对医学三维图像任务,对大规模LLM采用冻结主干、仅适配轻量投影层的参数高效策略,比全参数微调更具优势,这为资源受限的领域特定应用提供了有效的适配范式。

Abstract: Recent advances in multimodal learning, including large language models (LLMs) and vision-language models (VLMs), have demonstrated strong adaptability to natural images. However, extending their use to the medical domain, particularly for volumetric (3D) images, is challenging due to high computational complexity, volumetric dependencies and the semantic gap between visual features and clinical terminology. Naively fine-tuning LLMs on limited medical data often leads to overfitting and clinical hallucination, where linguistic fluency is prioritized over clinical factuality. In this study, we investigate parameter-efficient adaptation strategies for volumetric CT report generation and introduce RAD3D-Prefix, a lightweight diagnostic-prior conditioning framework that minimizes the need for extensive parameter training. This module integrates image embeddings with multi-label diagnostic classification logits, preserving critical clinical details while bridging the semantic gap. By keeping the LLM frozen, our method requires minimal trainable parameters and mitigates the risk of overfitting on small, domain-specific datasets. Through a systematic study spanning LLMs from 96.1M to 1.6B parameters, we find that fine-tuning is most beneficial for smaller LLMs, whereas freezing larger (~1B+ LLMs and training only lightweight projection layers provides a superior trade-off between performance, generalization, and computational efficiency. Across multiple automatic metrics and a clinical reader study, RAD3D-Prefix outperforms comparable parameter-efficient baselines and demonstrates strong out-of-domain generalization while using substantially fewer trainable parameters than fully fine-tuned alternatives.


[4] Are you speaking my languages? On spoken language adherence in multimodal LLMs cs.CL | cs.SD | eess.ASPDF

Hyungwon Kim, Kandarp Joshi, Lillian Zhou, Pavel Golik, Petar Aleksic

TL;DR: 本文针对基于大语言模型(LLM)的自动语音识别(ASR)系统中存在的语言遵从性问题,即模型常错误识别输出语言,从而影响转录保真度和下游应用质量。作者提出了一种软提示方法,在不严格约束输出的前提下提示可能的口语语言,并形式化定义了语言遵从性挑战,引入新指标量化违规情况,评估了零样本提示、监督微调(SFT)和思维链(CoT)推理三种缓解策略。

Details

Motivation: 解决多模态大语言模型在自动语音识别中因语言识别错误导致的转录保真度下降和下游应用质量问题,同时保持模型的灵活性和语码转换能力。

Result: 在多种语言上对三种缓解策略进行了比较分析,评估了它们在减少语言违规的同时保持整体ASR性能的有效性,并讨论了不同计算约束下的策略选择权衡。

Insight: 创新点在于将语言遵从性问题形式化并引入量化指标,提出软提示方法以平衡语言约束与灵活性,并通过系统比较零样本提示、SFT和CoT推理三种策略为实际部署提供指导。

Abstract: While Large Language Model (LLM) based Automatic Speech Recognition (ASR) enables seamless multilingual use, models often misidentify the output language, compromising transcription fidelity and downstream application quality. To preserve flexibility and code-switching capabilities, we propose a soft prompting approach that hints at potential spoken languages without strictly constraining the output. We formally define this challenge as a lack of language adherence, introduce a novel metric to quantify violations, and evaluate three mitigation strategies: (1) zero-shot prompting for robust guidance under uncertainty, (2) supervised fine-tuning (SFT) to improve prompt adherence, and (3) Chain-of-Thought (CoT) reasoning to enforce adherence during decoding. We present a comparative analysis of these methods across multiple languages, evaluating effectiveness in reducing the language violation while maintaining overall ASR performance. Finally, we discuss trade-offs to guide strategy selection under various compute constraints.


[5] Implicit vs. Explicit Prompting Strategies for LVLMs in Referential Communication cs.CL | cs.AIPDF

Peter Zeng, Amie J. Paige, Weiling Li, Susan E. Brennan, Owen Rambow

TL;DR: 本文通过对比两项近期研究(Jones等人(2026)和Zeng等人(2026))中关于大型视觉语言模型(LVLMs)能否协调生成高效指代表达式的矛盾结论,控制了任务差异并直接比较了它们的提示策略。研究发现,当明确提示时,模型能够协调生成高效的指代表达式,这表明任务差异并非导致结果分歧的原因;然而,当使用更隐晦的提示时,同一模型无法推断出对沟通效率的需求,突显了人类与AI系统在沟通方式上的关键差异。

Details

Motivation: 解决两项近期研究中关于LVLMs在指代沟通中能否协调高效表达的矛盾结论,通过控制任务变量来探究不同提示策略(显式与隐式)的影响。

Result: 在显式提示下,模型能够协调生成高效的指代表达式,复制了先前研究的发现;但在隐式提示下,同一模型未能推断出沟通效率需求,未提及具体基准或SOTA比较。

Insight: 创新点在于揭示了提示策略(显式vs.隐式)对LVLMs沟通能力的关键影响,客观分析表明这突显了AI系统与人类在隐式推理和情境理解方面的根本差异,对设计更自然的AI交互具有借鉴意义。

Abstract: Two recent studies (Jones et al. (2026); Zeng et al. (2026)) reach apparently contradictory conclusions about whether LVLMs can coordinate on efficient referring expressions. We control for task differences between the studies while directly comparing their prompting styles. We replicate the finding that models can coordinate efficient referring expressions when explicitly prompted to do so, suggesting that other task differences are not responsible for divergent results. However, we also find that the same models fail to infer the need for communicative efficiency from a more implicit prompt, highlighting critical differences between how humans and AI systems communicate.


[6] NarrativeWorldBench: A Frontier-Saturated Benchmark and a Latent World Model for Long-Horizon Co-Creative Audio Drama cs.CL | cs.AI | cs.LGPDF

Logan Mann, Abdur Rahman, Mohammad Saifullah, Taaha Kazi, Vasu Sharma

TL;DR: 本文提出了NarrativeWorldBench,一个用于评估长序列音频剧叙事结构一致性的基准测试,涵盖9个叙事指标和4种印度语言。同时,作者提出了N-VSSM模型,这是一个基于Mamba-2架构的叙事变分状态空间模型,能够在超过200集的长序列中维持结构化的潜在世界状态,并在计算成本远低于前沿闭源模型的情况下,实现了更好的叙事一致性和跨语言保真度。

Details

Motivation: 解决前沿大语言模型在长序列(200-800集)音频剧创作中叙事结构一致性崩溃的问题,现有闭源模型在长视野下性能显著下降。

Result: 在NarrativeWorldBench上评估了21个模型,所有闭源前沿模型在情节节拍F1分数上饱和于[0.78, 0.81]区间,并在h=200视野下崩溃约-0.20 F1。N-VSSM模型在所有视野下保持情节节拍F1 >= 0.84,计算成本降低4倍,跨语言保真度提升+0.20至+0.23 Likert分数,在专业作者研究中,71%的情况下在长弧一致性上优于Claude Opus 4.5,可控性评分高+1.3 Likert分数。

Insight: 创新点在于提出了一个专门针对长序列叙事结构评估的饱和基准(NarrativeWorldBench),以及一个结合了事件条件后验和结构化潜在状态维护的叙事变分状态空间模型(N-VSSM)。其核心洞察是通过显式建模和维护一个低维、结构化的潜在世界状态(而非仅依赖自回归生成),来有效解决长视野下的叙事一致性崩溃问题,并引入了文化传递函数来提升跨语言叙事保真度。

Abstract: Long-form serialized audio drama, with arcs that run for 200 to 800 episodes, is a major creative medium and a setting where frontier large language models (LLMs) fail. We benchmark 21 models, spanning classical, fine-tuned, open-frontier, closed-frontier, and reasoning tiers, on a uniform set of structural narrative metrics. All closed-frontier systems saturate at a plot-beat F1 in the band [0.78, 0.81] and collapse by about -0.20 F1 at horizon h=200. We introduce NarrativeWorldBench, an open benchmark of nine narrative-structure metrics evaluated across horizons h in {10, 20, 50, 100, 200}, with cross-lingual evaluation across four Indic languages (Hindi, Tamil, Telugu, Marathi). We introduce N-VSSM, a Narrative Variational State-Space Model that maintains a structured 256-dimensional latent world state over more than 200 episodes via a Mamba-2 backbone with an event-conditioned posterior and an 8B decoder. N-VSSM holds plot-beat F1 >= 0.84 across all horizons at 4x lower compute than the closed-frontier band. A learned Cultural Transfer Function lifts cross-language fidelity by +0.20 to +0.23 Likert points. In a within-subjects writer study (n = 12 professional authors, 240 trials), N-VSSM is preferred over Claude Opus 4.5 on long-arc consistency 71% of the time and rated +1.3 Likert points higher on controllability.


[7] MODE-RAG: Manifold Outlier Diagnosis and Energy-based Retrieval-Augmented Generation Evaluation cs.CL | cs.AI | cs.CV | cs.LG | cs.MMPDF

Zehang Wei, Jiaxin Dai, Jiamin Yan, Xiang Xiang

TL;DR: 本文提出MODE-RAG,一个基于变分自由能和内部注意力状态的多智能体系统,用于动态干预多模态检索增强生成中的幻觉问题。该系统通过五阶段智能体路由高风险查询,结合蒙特卡洛树搜索进行因果推导,并引入对数扰动惩罚谄媚行为,同时利用校正和监督智能体确保格式稳定性和事后事实核查。

Details

Motivation: 多模态检索增强生成系统容易受到跨模态幻觉、因果捏造和谄媚行为的影响,现有缓解方法存在干预悖论:静态规则会不必要地干扰准确生成,而完全无引导的多模态推理则会让不匹配问题级联成严重逻辑错误。

Result: 在从MultiVent数据集衍生的挑战性子集ModeVent上进行广泛实验,结果表明该系统有效降低了幻觉率和逻辑捏造,显著提升了M-RAG系统的鲁棒性。

Insight: 创新点在于将变分自由能作为动态干预的门控机制,并构建多智能体协作框架,结合蒙特卡洛树搜索增强因果推理,同时引入对数扰动专门针对谄媚行为进行惩罚,实现了对多模态生成过程的精细化控制。

Abstract: While Multimodal Retrieval-Augmented Generation (M-RAG) enhances Large Vision-Language Models, it remains highly susceptible to cross-modal hallucinations, causal fabrications, and sycophancy. Furthermore, existing mitigation pipelines often face an intervention paradox: static rules tend to unnecessarily disrupt accurate generations, whereas leaving the multi-modal reasoning completely unguided allows existing mismatches to cascade into severe logical fabrications. To quantify and mitigate these hallucinations, we propose a Multi-Agent system, MODE-RAG, driven by Variational Free Energy (VFE) and internal attention states to dynamically gate interventions. High-risk queries are routed to five stage-specific agents, integrating Monte Carlo Tree Search (MCTS) for rigorous causal derivation and logit perturbations to penalize sycophancy. Dedicated Correction and Overseer agents ensure formatting stability and perform post-hoc factual verification. To objectively evaluate our approach, we introduce ModeVent, a challenging subset derived from the MultiVent dataset. Extensive experiments indicate that our system effectively reduces hallucination rates and logical fabrication, significantly improving the robustness of M-RAG systems.


[8] Decoding Hidden Deception in Reasoning LLMs: Activation Explainers for Deception Auditing cs.CL | cs.AIPDF

Kexin Chen, Yi Liu, Haonan Zhang, Yanhui Li, Xinyu Deng

TL;DR: 本文提出了STATEWITNESS,一种用于欺骗审计的激活解释器。它通过一个独立的解码器读取目标大语言模型的隐藏状态,然后回答自然语言查询或生成结构化报告,以揭示模型推理过程中的欺骗行为。该方法在七个欺骗数据集上对两个目标推理LLM进行评估,在检测性能和可解释性方面均优于现有方法。

Details

Motivation: 随着大语言模型推理能力的增强,其欺骗行为成为一个日益严重的安全问题。现有的欺骗监控方法要么对可见的文本进行评分,要么从表示向量中导出标量探测分数,缺乏可检查的证据来解释为何某个响应是可疑的。

Result: 在七个欺骗数据集上对两个目标推理LLM的评估中,STATEWITNESS的平均AUROC达到0.916,相对于最佳的黑盒文本监控器提升了11.6%,相对于最佳的激活探测基线提升了25.0%。当与现有监控器结合时,它在简单的阈值集成中减少了漏检的欺骗示例。

Insight: 核心创新在于提出了一个可解释的激活解释器框架,不仅能输出标量检测分数,还能返回查询级答案、模式报告以及用于人工检查的词元级或句子级证据追踪,这为更广泛的模型可解释性和对齐工具提供了潜在的构建模块。

Abstract: As LLMs acquire stronger reasoning capabilities, deceptive behavior becomes an increasingly serious safety concern. Existing deception monitors either score visible transcripts or derive scalar probe scores from representation vectors, leaving little inspectable evidence about why a response is suspicious. We introduce STATEWITNESS, an activation explainer for deception auditing. A separate decoder reads a target model’s hidden states, then answers natural-language queries or emits structured reports about them. We evaluate STATEWITNESS on two target reasoning LLMs across seven deception datasets. STATEWITNESS reaches 0.916 mean AUROC, a relative gain of 11.6% over the best black-box text monitor and 25.0% over the best activation-probe baseline under the same evaluation protocol. When combined with existing monitors, STATEWITNESS reduces missed deceptive examples in simple threshold ensembles. Beyond scalar detection, the decoder returns query-level answers, schema reports, and token- or sentence-level evidence traces for human inspection. We view this interface as a potential building block for broader interpretability and alignment tools.


[9] AIPatient Arena: EHR-grounded evaluation of large language models in end-to-end clinical consultation workflows cs.CL | cs.AIPDF

Jiahui Niu, Huizi Yu, Wenkong Wang, Guangxin Dai, Jingxian He

TL;DR: 该论文提出了AIPatient Arena,一个基于电子健康记录(EHR)的评估框架,用于在多轮医患交互中评估大型语言模型(LLMs)在端到端临床咨询工作流中的临床能力。该框架将EHR数据整合为患者知识图谱,并在包含437名患者的主要队列和两个外部验证队列上,从八个维度评估了LLMs的表现。

Details

Motivation: 现有的大多数医学评估是静态、单轮或仅基于最终结果的,无法反映真实世界临床咨询的序列性、不确定性和交互性本质。因此,需要一种更贴近实际工作流程的评估方法来衡量LLMs的临床实用性。

Result: 在437名患者的主要队列和两个外部验证队列(119和67名患者)上评估。LLMs在医学问诊技巧、伦理与专业行为、临床解释清晰度方面表现良好(平均分4.43-4.99/5等),但在处理模糊患者回应、信息覆盖度、诊断准确性与推理方面存在持续弱点(平均分低至2.08-3.55/5)。过程评估揭示了重复提问、遗漏病史等交互失败模式。

Insight: 创新点在于提出了一个基于EHR、支持多轮交互的端到端临床工作流评估框架,强调过程评估而不仅仅是最终答案的准确性。客观来看,该研究揭示了LLMs在临床推理关键环节(如处理不确定性、信息整合)的显著短板,为医疗LLMs的部署前评估提供了重要的方法论和具体维度参考。

Abstract: Large language models (LLMs) are increasingly considered for use in clinical consultation tasks, yet most medical evaluations remain static, single-turn, or narrowly outcome-based, limiting their ability to reflect the sequential, uncertain, and interactive nature of real-world care. Here, we propose AIPatient Arena, an EHRs-grounded evaluation framework for assessing the clinical utility of LLMs across eight dimensions of clinical competence. The framework integrates EHR data into patient-specific knowledge graphs, enabling multi-turn physician-patient interactions. We applied AIPatient Arena on a primary cohort of 437 patients and two out-of-distribution validation cohorts of 119 and 67 patients. We observe that LLMs performed well in medical interview questioning skills (QS; mean scores, 4.43-4.99/5), ethical and professional conduct (ET; 4.38-4.93/5), and clarity and transparency of clinical explanations (EX; 3.80-4.72/5). Performance was moderate in information integration (II; 3.19-4.21/5) and medication safety and justification (MS; 3.13-3.78/5), but persistent weaknesses were observed in handling of ambiguous patient responses (HR; 2.57-3.32/5), information coverage (IC; 2.08-3.02/5), and diagnostic accuracy and reasoning (Dx; 2.63-3.55/5). Process-based evaluation revealed recurrent interaction failures, including repetitive questioning, omission of past medical history, and inadequate handling of uncertainty. Richer conversational context improved diagnostic reasoning but yielded limited gains in treatment planning. These findings indicate that final-answer accuracy alone is insufficient for evaluating clinical readiness and highlight the importance of assessing how models gather, interpret, and communicate information throughout a consultation. AIPatient Arena provides an EHR-grounded framework for workflow-oriented pre-deployment evaluation of medical LLMs.


[10] Evaluating Second-Order Bias of LLMs Through Epistemic Entitlement cs.CLPDF

Ramaravind Kommiya Mothilal, Terry Jingchen Zhang, Raiyan Ahmed, Zhijing Jin, Shion Guha

TL;DR: 本文提出‘二阶偏见’概念,指大语言模型在评估偏见内容时自身表现出的社会偏见。作者基于认识论中的‘认知权’理论,设计了一个逻辑推理任务来评估模型判断偏见文本对不同人群可接受性的偏差,并开发了两个简单指标来衡量模型在推断可接受性时对人口统计特征的偏见程度。研究发现该任务能绕过安全护栏揭示模型判断中的偏见,且偏见在不同目标群体间存在系统性差异,反映了隐含的社会映射。

Details

Motivation: 现有LLM社会偏见评估主要关注模型是否生成或暗示偏见内容,但模型越来越多地被用作偏见的评判者,其在评估偏见内容时可能以更微妙的方式表现出社会偏见,而当前方法未能系统性地捕捉这一点。

Result: 在评估开源和闭源模型时,作者发现其设计的任务能绕过安全护栏,揭示模型判断中的偏见。该偏见在不同目标群体间存在系统性差异,反映了隐含的社会映射,并表明模型仍会受到人口统计标签的触发。

Insight: 创新点在于提出了‘二阶偏见’这一新概念,并基于认识论中的‘认知权’理论构建了一个新颖的逻辑推理任务来评估它。这为LLM偏见评估开辟了新的方向,强调了在判断任务中评估偏见的重要性,并倡导在NLP偏见评估中采用更多理论驱动的方法。

Abstract: Evaluations of social bias in LLMs largely focus on whether models generate or imply biased content. However, as LLMs are increasingly used as judges of bias, they may exhibit social biases in subtler ways in how they evaluate biased content, which current methods do not systematically capture. We call this second-order bias: social bias in an LLM’s judgment about social bias, which we evaluate through a novel, philosophically grounded reasoning task. Drawing on entitlement epistemology, we conceptualize bias as misplaced foundational knowledge that shapes an agent’s rational inquiry, and derive a logical reasoning task for LLMs to judge to whom a biased text is acceptable or non-acceptable. We develop two simple metrics to measure how biased LLM judges are in inferring demographics for acceptability without sufficient support, and how these inferences vary across groups targeted by biased texts. Evaluating open and closed models, we find that our task evades safety guardrails by surfacing bias in model judgment. It varies systematically across target groups, reflects implicit social maps, and shows how models are still triggered by demographic labels. Our work points to the need for LLM bias evaluation in judgment tasks and broadly, for more theoretically grounded approaches to bias evaluation in NLP. We release our code and model responses at https://github.com/uofthcdslab/second-order-bias.


[11] OPD-Evolver: Cultivating Holistic Agent Evolver via On-Policy Distillation cs.CLPDF

Guibin Zhang, Xun Xu, Yanwei Yue, Zikun Su, Wangchunshu Zhou

TL;DR: 本文提出了OPD-Evolver,一个通过策略上自蒸馏来培养智能体进化能力的慢-快协同进化框架。该框架包含一个快速循环,用于与四级记忆层次结构交互以进行快速测试时进化,以及一个慢速循环,通过结果校准的记忆归因和特权后见之明将四种能力蒸馏到可部署策略中。

Details

Motivation: 现有基于记忆的智能体通常缺乏选择有用经验、据此行动、写入可重用知识以及维护增长知识库的整体能力,论文旨在解决这一局限,培养真正的智能体进化者。

Result: 在多个领域的基准测试中,OPD-Evolver超越了ReasoningBank等记忆系统高达11.5%,并优于Skill0等基于训练的方法约5.8%。分析表明,OPD-Evolver-9B模型能够挑战Qwen3.5-397B-A17B和Step-3.5-Flash等大型模型。

Insight: 核心创新在于慢-快协同进化框架和四级记忆层次结构,通过策略上自蒸馏将经验选择、使用、知识写入和知识库维护这四种关键能力内化到策略中,超越了简单的记忆增强,指向了真正的智能体进化能力。

Abstract: Memory has become a standard substrate for self-evolving agents, yet retaining experience is not the same as learning how to evolve through it. Existing memory agents can store trajectories, retrieve reflections, or accumulate skills, but often lack the holistic competence to select useful experience, act on it, write reusable knowledge, and maintain a growing repository. We introduce OPD-Evolver, a slow-fast co-evolution framework that cultivates such an agent evolver through on-policy self-distillation. In the fast loop, OPD-Evolver interacts with a four-level memory hierarchy to read, use, write, and maintain experience for rapid test-time evolution. In the slow loop, outcome-calibrated memory attribution and privileged hindsight distill these four abilities into the deployable policy. Across multi-domain benchmarks, OPD-Evolver surpasses memory systems such as ReasoningBank by up to 11.5%, and training-based methods such as Skill0 by ~5.8%. Further analysis shows that OPD-Evolver internalizes high-value experience and memory management, enabling OPD-Evolver-9B to challenge giant counterparts such as Qwen3.5-397B-A17B and Step-3.5-Flash, pointing beyond memory-augmented agents toward genuinely qualified agent evolvers.


[12] Evaluating Large Language Models Abilities for Addressee, Turn-change, and Next Speaker Prediction in Meetings cs.CLPDF

Ryo Fukuda, Takatomo Kano, Siddhant Arora, Marc Delcroix, Naohiro Tawara

TL;DR: 本文研究了大型语言模型在多模态多方对话中的话轮转换能力,构建了一个包含受话人检测、话轮转换预测和下一说话人预测三个任务的评估框架。通过对比监督模型、基于文本的LLM、多模态LLM以及人类表现,发现LLM在下一说话人预测任务上超越了监督模型和人类,而多模态LLM在受话人检测和话轮转换预测上优于文本LLM但仍不及人类。

Details

Motivation: 旨在评估大型语言模型在多模态多方对话中处理话轮转换相关任务的能力,解决传统监督模型在未针对特定领域训练且缺乏音视频信息时性能受限的问题。

Result: 在AMI语料库上的实验表明,LLM在下一说话人预测任务上超越了监督模型和人类;多模态LLM在受话人检测和话轮转换预测上优于文本LLM,但仍低于人类水平。

Insight: 创新点在于构建了针对多方对话话轮转换的评估框架,并揭示了对话上下文对预测任务的关键作用,特别是下一说话人预测;客观分析发现LLM与人类的预测模式相似,且频繁话轮转换的区间对两者都具有挑战性。

Abstract: We investigate turn-taking in multimodal multi-party conversations using large language models (LLMs). We construct an evaluation framework for three tasks: addressee detection, turn-change prediction, and next speaker prediction. We compare supervised models trained for these tasks, text-based LLMs, multimodal LLMs (MM-LLMs), and human subjects. Experiments on the AMI corpus showed that LLMs outperformed supervised models and humans in next speaker prediction, despite not being trained on the target domain and without access to audio or visual information. An MM-LLM performed better than text-based LLMs on addressee detection and turn-change prediction but remained below human performance, indicating difficulty leveraging raw audio-visual signals. Ablation analyses revealed that conversational context was critical, particularly for next speaker prediction. We observed that human and LLM prediction patterns were similar, and intervals with frequent turn changes were difficult for both.


[13] From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning cs.CLPDF

Chao Chen, Chengzu Li, Zhiwei Li, Yinhong Liu, Zhijiang Guo

TL;DR: 本文提出LLM-as-Environment-Engineer框架,利用当前策略模型分析失败轨迹和环境上下文,自动设计下一阶段的强化学习训练环境配置,以替代传统手动环境重设计流程。研究还引入了MAPF-FrozenLake可控测试平台进行验证,结果表明该框架在基准测试中取得了最佳综合性能。

Details

Motivation: 为了解决强化学习训练流程中依赖人工手动重新设计训练环境、需要启发式推断最优配置的问题,旨在实现训练环境配置设计的自动化。

Result: 在MAPF-FrozenLake测试平台上,以Qwen3-4B为骨干的框架取得了最强的综合性能,超越了更大的专有LLM(如GPT、Gemini)以及固定环境的训练基线。

Insight: 核心创新点在于将LLM作为环境工程师,利用策略行为摘要、失败案例和环境统计等结构化上下文自动生成环境配置。一个有趣的发现是,当前的RL检查点模型比原始基础模型更适合担任环境工程师,这表明策略学习过程提升了模型诊断自身弱点的能力。

Abstract: Reinforcement learning pipelines for Large Language Model (LLM) training often rely on manually redesigned environments between stages, requiring practitioners to heuristically infer which configuration will best improve the current policy. To automate this process, we propose the LLM-as-Environment-Engineer framework in which the current policy model analyzes failure trajectories together with contextual information and proposes modifications to the next-stage training environment configuration. We also introduce MAPF-FrozenLake, a controllable testbed whose generator exposes multi-dimensional environment configurations, making it suitable for studying and benchmarking environment redesign. On this testbed, we condition the environment engineer on structured summaries of policy behavior, failure cases, and environment statistics, from which it produces the configuration for the next training stage. With Qwen3-4B as the backbone, our framework achieves the strongest aggregate performance on our benchmarks, outperforming larger proprietary LLMs (e.g., GPT, Gemini) and fixed-environment training baselines. We further analyze which forms of context are most effective, finding that successful environment updates rely on failure evidence and preserve configurations that already work. Interestingly, the current RL checkpoint serves as a better environment engineer than the original base model, suggesting that policy learning improves the model’s ability to diagnose its remaining weaknesses.


[14] SuCo: Sufficiency-guided Continuous Adaptive Reasoning cs.CL | cs.AIPDF

Jiahao Wang, Bingyu Liang, Chenhao Hu, Longhui Zhang, Xuebo Liu

TL;DR: 本文提出SuCo框架,通过定义最小充分思维链(MSC)作为产生正确答案所需的最短推理前缀,并设计两阶段训练方法(MFT和SAPO),使大推理模型能够自适应地控制推理长度,在提升准确率的同时显著减少计算开销。

Details

Motivation: 现有大推理模型在处理简单查询时仍生成过长的思维链,导致计算效率低下,且缺乏判断推理何时充分的原则性标准。

Result: 在数学、代码和科学等多个基准测试上的广泛实验表明,SuCo在准确性和推理效率方面均取得了一致性提升。

Insight: 创新点在于提出了最小充分思维链(MSC)这一概念作为连续自适应推理的指导原则,并设计了结合问题自适应充分阈值微调和基于强化学习的充分性感知策略优化的两阶段训练框架,以惩罚过度推理和推理不足。

Abstract: Despite remarkable performance on complex tasks, Large Reasoning Models (LRMs) often generate excessively long Chain-of-Thoughts (CoT), inflating computational costs even for simple queries. Existing efforts to mitigate this inefficiency typically rely on discrete reasoning modes or fixed budget tiers, lacking a principled criterion of when reasoning is sufficient. In this work, we introduce Minimal Sufficient CoT (MSC), defined as the shortest prefix of a CoT trajectory which is adequate for producing the correct answer. We empirically show that MSC not only reduces reasoning tokens, but also improves accuracy across difficulty levels. Building on MSC, we propose Sufficiency-guided Continuous Adaptive Reasoning (SuCo), a two-stage training framework for autonomous reasoning control along a continuous spectrum. In stage 1, MSC-Aligned Fine-Tuning (MFT) constructs MSC data using problem-adaptive sufficiency thresholds that naturally scale with question difficulty, then fine-tunes the model to internalize concise yet sufficient reasoning patterns. In stage 2, Sufficiency-Aware Policy Optimization (SAPO) further optimizes the model through reinforcement learning with dynamic complexity tracking and sufficiency-aware rewards that penalize both over- and under-thinking. Extensive experiments across mathematics, code, and science benchmarks show that SuCo consistently achieves improvements in both accuracy and reasoning efficiency.


[15] The Slop Paradox: How Synthetic Standardization Erodes Clinical Uncertainty and Cross-Modal Alignment in AI-Rewritten Radiology Reports cs.CL | cs.CVPDF

Samar Ansari

TL;DR: 本文研究了AI辅助临床文档工具(如大型语言模型)在改写放射学报告时导致的信息退化现象,通过三个实际任务(EHR总结、标准化改写和教学案例准备)测量了实体侵蚀、不确定性语言丢失和跨模态对齐退化。核心发现是信息损失与跨模态保真度之间存在解离:EHR总结在内容层面破坏性最大,但几乎完全保留了图像-文本对齐;而旨在生成更干净训练数据的任务则相反,保留了更多实体但导致对齐度显著下降。

Details

Motivation: 随着AI辅助临床文档工具日益普及,使用LLM总结、标准化和重新格式化放射学报告可能导致信息退化,但具体影响尚未得到系统测量。本文旨在量化这种退化,特别是关注临床实体、不确定性语言和跨模态对齐的损失。

Result: 在印第安纳大学数据集的450份胸部X光报告上,EHR总结任务侵蚀了51.4%的临床实体和43.7%的不确定性语言,但图像-文本对齐仅下降2.5%;标准化改写和教学案例准备任务侵蚀了26.8%和29.3%的实体,但导致对齐度下降14.9-16.5%,是EHR总结的6-7倍。罕见病理与常见病理的退化无显著差异。

Insight: 提出了’粗劣悖论’:使临床文本看起来更干净以用于多模态训练的改写,恰恰会使其偏离图像内容。退化主要取决于AI改写任务的类型而非临床内容本身,这对多模态医学AI数据集构建和AI辅助临床文档治理具有重要启示。

Abstract: AI-assisted clinical documentation tools increasingly summarize, standardize, and reformat radiology reports using large language models (LLMs). We present a controlled measurement of the resulting information degradation. Using 450 chest X-ray reports from the Indiana University dataset, we generate synthetic versions via three realistic LLM rewriting tasks: EHR summarization, standardized rewriting, and teaching case preparation. We measure entity erosion (via medical NER), hedging collapse (loss of clinical uncertainty language), and cross-modal alignment degradation (via BiomedCLIP image-text similarity). Our central finding is a dissociation between information loss and cross-modal fidelity. EHR summarization is the most destructive at the content level, eroding 51.4% of clinical entities and 43.7% of hedging language, yet it preserves image-text alignment almost entirely (a 2.5% drop). The two tasks meant to produce cleaner training data, standardized rewriting and teaching case preparation, do the reverse: they preserve more entities (26.8% and 29.3% eroded) but cause 14.9-16.5% alignment drops, six to seven times those of EHR summarization. We term this the slop paradox: rewriting that makes clinical text look cleaner for multimodal training is precisely what pulls it away from the image. Contrary to our pre-specified hypothesis, rare pathologies were not preferentially degraded: across nine rare-versus-common comparisons, no difference survived multiple-comparison correction, and nominal differences ran in the opposite direction (common > rare), so contamination is invisible to condition-specific monitoring. The dominant determinant of degradation is the type of AI rewriting task, not the clinical content. These findings bear on multimodal medical AI dataset construction and the governance of AI-assisted clinical documentation.


[16] Environment-Grounded Automated Prompt Optimization for LLM Game Agents cs.CLPDF

Rean Clive Fernandes, Lukas Fehring, Theresa Eimer, Marius Lindauer, Matthias Feurer

TL;DR: 本文提出了一种用于LLM游戏代理的自动化提示优化框架,通过将观察-行动流程分解为目标条件描述代理和行动选择代理,并利用LLM驱动的进化循环迭代优化每个模块的提示,从而提升代理在交互环境中的性能。

Details

Motivation: 解决LLM代理在交互环境中对提示高度敏感、而提示工程仍依赖手动和任务特定设计的问题,旨在实现无需更新模型权重或大量人工监督的自动化提示优化。

Result: 在BALROG基准测试的所有五个BabyAI任务上,该框架在普通和引导提示初始化条件下均一致提升性能;在PutNext多步协调任务中,优化后的提示使成功率从RobustCoTAgent的0%提升至72.5%,达到SOTA水平。

Insight: 创新点在于将代理分解为描述和行动选择两个模块,并引入行为分析器和变异器进行针对性提示修订;客观来看,这种多代理框架结合自动化进化优化,为LLM提示工程提供了可扩展且高效的解决方案。

Abstract: LLM agents in interactive environments are highly sensitive to their prompts, yet prompt engineering remains a manual, task-specific process. We introduce an automated prompt optimization framework for LLM agents that decomposes the observation-to-action pipeline into a goal-conditioned descriptor agent and an action selection agent, and iteratively refines each module’s prompt through an LLM-driven evolutionary loop guided by environment returns. We propose a behavior analyzer to attribute episode outcomes to specific prompt components, and a mutator to propose targeted revisions to the prompt, before validating them through environment rollouts. We evaluate on all five BabyAI tasks in the BALROG benchmark, comparing our pipeline against BALROG’s RobustCoTAgent under both plain and guided prompt initializations. Optimization improves performance consistently across tasks and conditions, without requiring updates to the model weights. On PutNext, a multi-step coordination task where the RobustCoTAgent achieves 0% success, our framework reaches up to 72.5% success rate using the same underlying LLM with optimized prompts. These results suggest that a multi-agent framework, combined with automatic prompt optimization, enhances LLMs without the need for fine-tuning or extensive human supervision.


[17] GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine? cs.CLPDF

Tongxu Luo, Rongsheng Wang, Jiaxi Bi, Chenming Xu, Zhengyang Tang

TL;DR: 该论文提出了GameCraft-Bench,一个用于评估智能体在真实游戏引擎(Godot)中端到端生成可玩游戏能力的基准测试。该基准包含140个任务,覆盖15种游戏类型,并采用基于交互回放和多模态评判的评估框架。对前沿编码智能体的评估表明,端到端游戏生成仍极具挑战性,最佳智能体得分仅为41.46%。

Details

Motivation: 论文旨在解决智能体根据自然语言描述在游戏引擎中端到端生成完整、可玩游戏这一新兴任务。传统编码任务评估不足以衡量游戏生成,因为游戏需要在引擎中整合脚本、场景、资产、渲染和运行时交互以产生连贯的游戏体验。

Result: 在提出的GameCraft-Bench基准上,评估了前沿编码智能体。结果表明任务极具挑战性:最强智能体仅达到41.46%的得分,大多数智能体得分低于40%。分析显示,智能体虽常能实现可识别的游戏机制,但在提供内容充足、视觉反馈功能完整、呈现连贯的完整游戏方面存在困难。

Insight: 论文的创新点在于将端到端游戏生成形式化为一个需要满足引擎接地性、制品完整性和交互验证性三个要求的评估问题,并提出了一个基于交互回放和准则引导的多模态评判框架。这为评估复杂、多模态的生成式AI系统提供了一个新的、更全面的基准和方法论视角。

Abstract: Game generation is an emerging application of coding agents, requiring models to transform natural-language specifications into playable interactive systems. Unlike traditional coding tasks, game generation takes place within a game engine, where scripts, scenes, assets, rendering, and runtime interactions must jointly produce coherent gameplay. We formalize end-to-end game generation as the problem of producing a complete game artifact that realizes a specification through observable player-game interaction in a target environment. We argue that evaluating this setting requires three desiderata: Engine Grounding, Artifact Completeness, and Interactive Verification. We propose an interaction-grounded evaluation framework that assesses executable gameplay through replayed demonstrations and rubric-guided multimodal judging. We instantiate this framework as GameCraft-Bench, a benchmark comprising 140 Godot tasks across 15 game families. Evaluations of frontier coding agents show that end-to-end game generation remains highly challenging: the strongest agent achieves only 41.46%, and most agents score below 40%. Further analysis reveals that while agents often implement recognizable mechanics, they struggle to deliver complete games with sufficient content, functional visual feedback, and coherent presentation. See https://tongxuluo.github.io/gamecraft-bench-website for demos, code, and data.


[18] Dynamic Rollout Editing for Reducing Overthinking in RL-Trained Reasoning Models cs.CLPDF

Zihao Wei, Wenjie Shi, Liang Pang, Jingcheng Deng, Shicheng Xu

TL;DR: 本文研究了强化学习训练推理模型中存在的‘过度思考’问题,即模型在得出正确答案后仍生成不必要的推理步骤。作者提出了一种名为动态Rollout编辑的训练时干预方法,通过编辑成功轨迹中答案出现后的部分来抑制过度思考,从而改善模型性能。

Details

Motivation: 动机在于解决长链思维推理模型中普遍存在的‘过度思考’现象,该现象被视为训练时的信用分配问题,而非仅仅是解码时的停止问题,旨在防止强化学习训练中因序列级信用分配不当而加剧的反馈循环。

Result: 在多个任务上的实验证明了DRE方法的有效性,表明该方法能有效减少过度思考行为,从而提升模型在复杂任务上的推理性能。

Insight: 创新点在于将过度思考问题重新定义为训练时的信用分配挑战,并提出了DRE这一训练时干预技术,通过编辑成功轨迹来削弱对不必要推理的偏好信号,而不惩罚到达答案所需的必要推理,为RL训练推理模型提供了新的优化思路。

Abstract: Long-form chain-of-thought reasoning can improve LLM performance on complex tasks, but models often continue generating unnecessary reasoning after a correct answer has emerged. We refer to this behavior as overthinking. We study this phenomenon from the perspective of GRPO-style reinforcement learning (RL) post-training, framing it as a training-time credit-assignment problem rather than merely a decoding-time stopping problem. In rollouts sampled at the onset of GRPO training, we observe that successful trajectories can exhibit a slightly higher degree of overthinking than unsuccessful trajectories for the same prompts. This early imbalance provides a starting point for an undesirable feedback loop: because GRPO assigns sequence-level credit, it cannot distinguish the solution-reaching prefix from the unnecessary continuation that lengthens a successful trajectory. Both receive positive update signal, allowing the initial imbalance to grow into more severe overthinking during training. To address this issue, we introduce Dynamic Rollout Editing (DRE), a training-time intervention for successful trajectories that continue thinking after answer emergence. DRE preserves the accepted verified prefix, edits the remaining thinking, and prefers the edited trajectory within the same RL group, weakening the preference signal for unnecessary thinking without penalizing the reasoning needed to reach the answer. Experiments across diverse tasks show the effectiveness of DRE.


[19] ChLogic: Evaluating Robustness of Logical Reasoning in Chinese Expressions cs.CLPDF

Peixian Zhou, Yuxu Chen, Chaorui Zhang, Wei Han, Bo Bai

TL;DR: 本文介绍了ChLogic,一个中英文对齐的基准测试,用于评估大语言模型在逻辑推理能力上的跨语言鲁棒性。该基准包含三个数据集:通用对齐集、困难对齐集和中文特有集,每个对齐项将一个英文表达式与五个中文表达配对。实验表明,模型在中文和英文上的逻辑推理性能存在持续差距,且回译等方法对性能的影响因模型和数据集而异。

Details

Motivation: 尽管大语言模型在标准逻辑推理基准上表现越来越好,但其能力在英语以外的语言中是否同样鲁棒尚不清楚。本文旨在通过构建一个中英文对齐的基准来测试模型在不同语言表达下是否保持一致的逻辑推理性能。

Result: 在Qwen3、Ministral和GLM等模型上的实验揭示了持续的中英文性能差距。在通用对齐集上,将标准中文回译为英文通常能提升性能,但在困难对齐集上效果不一,例如Qwen3-32B和GLM-5.1在回译后性能反而下降。

Insight: 创新点在于构建了一个专门用于评估多语言逻辑推理鲁棒性的中英文对齐基准,并揭示了中文表层表达、翻译伪影和模型特定行为共同影响多语言逻辑推理。这为理解模型跨语言能力提供了新的压力测试工具。

Abstract: Large language models perform increasingly well on standardized logical reasoning benchmarks, but whether this ability remains robust beyond English is unclear. We introduce ChLogic, an English–Chinese aligned benchmark that tests whether models preserve logical reasoning performance when the same latent logical structure is expressed in English and diverse Chinese surface realizations. Built from formal logical templates, the benchmark contains three data sets: (i) the General aligned set, derived from 60 General Propositions across nine template families; (ii) the Difficult aligned set, derived from 40 Difficult Problems; and (iii) the Chinese-only set, covering 15 language-specific phenomenon types. Each aligned item pairs one English reference expression with five Chinese realizations. Experiments on Qwen3, Ministral, and GLM models reveal a persistent English–Chinese performance gap. Back-translation from standard Chinese into English often improves performance on the General aligned set, but produces mixed effects on the Difficult aligned set, where Qwen3-32B and GLM-5.1 perform worse after translation. These results indicate that Chinese surface realization, translation artifacts, and model-specific behavior jointly affect multilingual logical reasoning. Overall, ChLogic provides a useful stress test for the robustness of multilingual reasoning.


[20] Fine-tuning LLMs for Passive Depression Severity Estimation from AI Mental Health Dialogue cs.CLPDF

Olivier Tieleman, Ziyi Zhu, Ting Su, Samuel J. Bell, Thomas D. Hull

TL;DR: 该论文提出了一种通过微调大语言模型(Qwen3.5-27B),直接从用户与AI心理健康应用的对话文本中,被动地、无需用户主动填表即可预测其抑郁严重程度(PHQ-9总分)的方法。

Details

Motivation: 抑郁症是全球主要致残原因,但传统的自我报告问卷(如PHQ-9)完成率低,存在响应偏差和数据缺失。论文旨在利用AI对话的日常数据,实现被动、连续的抑郁症状监测,以弥补这一差距。

Result: 在包含842名用户的独立测试集上,最佳模型取得了MAE=2.6,RMSE=4.0,Pearson r=0.80的成绩,并且在PHQ-9≥10的临床阈值上AUC达到0.91,在从PHQ-9≥3到≥24的所有严重程度阈值上AUC均大于0.87,表明模型能有效捕捉整个临床谱系的抑郁严重程度。

Insight: 创新点在于利用AI对话文本作为唯一输入进行被动抑郁评估,并采用由推理模型(Claude Opus)生成伪标签并结合迭代训练中间模型的数据增强策略,显著扩充了训练数据(从3,111个真实标签扩展到6,283个用户数据),从而提升了模型性能。这为AI心理健康平台实现无需用户主动报告的持续症状监测提供了可行路径。

Abstract: Depression is the leading cause of disability worldwide, and early detection of symptom change is essential for timely intervention. Validated instruments such as the Patient Health Questionnaire-9 (PHQ-9) support symptom monitoring at scale, but real-world completion rates are low, introducing response bias and systematic missingness. Passive approaches that infer severity from routinely generated data could close this gap. We address this by predicting PHQ-9 total scores directly from transcripts of conversations between users and an AI mental health application, requiring only conversation text and no additional clinical data. We fine-tune a Qwen3.5-27B backbone with a regression head, augment 3,111 ground-truth labels with pseudolabels generated by a reasoning model (Claude Opus) and iteratively trained intermediate models, for a combined dataset of 6,283 users. On a held-out test set of 842 users, our best model achieves MAE = 2.6, RMSE = 4.0, Pearson r = 0.80, and AUC = 0.91 at the PHQ-9 >= 10 clinical threshold. We also find AUC > 0.87 at every severity threshold from PHQ-9 >= 3 to PHQ-9 >= 24, demonstrating that the model captures depression severity across the full clinical spectrum. This work opens the door to passive, continuous symptom monitoring in AI mental health platforms, without requiring users to complete self-report measures.


[21] VoidPadding: Let [VOID] Handle Padding in Masked Diffusion Language Models so that [EOS] Can Focus on Semantic Termination cs.CLPDF

Chunyu Liu, Zhengyang Fan, Kaisen Yang, Alex Lamb

TL;DR: 本文提出VoidPadding方法,旨在解决掩码扩散语言模型(MDLM)中[EOS]标记同时承担语义终止符和填充符双重角色的问题。该方法引入专用填充标记[VOID]来接管填充功能,让[EOS]专注于语义终止,从而改善大块解码时的[EOS]溢出问题并提升生成效率。

Details

Motivation: 现有MDLM在指令微调时沿用自回归模型的惯例,使用重复的[EOS]标记进行填充,导致[EOS]同时承担语义终止和填充的双重角色,这在大块解码时会引起[EOS]溢出问题。

Result: 在Dream-7B-Instruct模型上,VoidPadding将数学推理和代码生成基准测试的四任务平均分提升了17.84分(相比原模型)和6.95分(相比RainbowPadding),同时平均减少55.7%的解码NFE(噪声函数评估次数)。

Insight: 核心创新在于角色解耦:通过引入专用填充标记[VOID]实现填充与终止功能的分离,使[EOS]能更纯净地学习语义终止信号,而[VOID]则学习指导自适应响应画布扩展,这种设计显著提升了生成质量和效率。

Abstract: MDLMs generate text by denoising a preallocated masked response canvas, making response-length modeling central to instruction tuning. Existing MDLMs often inherit the autoregressive convention of using repeated \texttt{[EOS]} tokens for padding during instruction tuning, giving \texttt{[EOS]} a dual role as both a semantic terminator and a padding token. We show that this dual role is a root cause of \texttt{[EOS]} overflow under large-block decoding. To decouple these roles, we propose VoidPadding, which introduces \texttt{[VOID]} for padding and reserves \texttt{[EOS]} for termination. During inference, the learned \texttt{[EOS]} signal enables early stopping, while the learned \texttt{[VOID]} signal guides adaptive response canvas expansion. On Dream-7B-Instruct, VoidPadding improves the block-size-averaged four-task mean across mathematical reasoning and code generation benchmarks by (+17.84) points over the original model and (+6.95) points over RainbowPadding, while reducing decoding NFE by 55.7% on average. Code is available at https://github.com/Haru-LCY/VoidPadding.


[22] Learning from the Self-future: On-policy Self-distillation for dLLMs cs.CLPDF

Yifu Luo, Zeyu Chen, Haoyu Wang, Xinhao Hu, Yuxuan Zhang

TL;DR: 本文提出了首个针对扩散大语言模型(dLLMs)的策略内自蒸馏框架d-OPSD。该方法通过将自生成答案作为后缀条件,并采用步骤级而非词元级的监督,使模型能从“自我未来经验”中学习,从而适应dLLMs的任意顺序生成特性。在四个推理基准上的实验表明,d-OPSD在样本效率上优于RLVR和SFT基线,仅需RLVR约10%的优化步骤。

Details

Motivation: 现有策略内自蒸馏方法本质上是为自回归模型设计的,其从左到右的前缀条件注入和词元级发散监督与dLLMs的任意顺序生成过程存在根本冲突,因此需要为dLLMs量身定制一种有效的后训练方法。

Result: 在四个推理基准上的实验表明,d-OPSD在性能上持续优于RLVR和SFT基线,并且具有卓越的样本效率,其所需的优化步骤仅为RLVR的10%左右。

Insight: 核心创新点在于重构了自教师构建方式(使用自生成答案作为后缀条件,模拟“自我未来经验”)和将监督粒度从词元级转移到步骤级,这更好地对齐了dLLMs的迭代去噪过程,为dLLM后训练开辟了新途径。

Abstract: On-policy self-distillation (OPSD) has proven effective for post-training large language models (LLMs), yet its application to diffusion LLMs (dLLMs) remains unexplored. Existing OPSD methods are inherently autoregressive-centric. They inject privileged information via left-to-right prefix conditioning with token-level divergence supervision, a design that fundamentally conflicts with the arbitraryorder generation of dLLMs. We introduce d-OPSD, the first OPSD framework tailored for dLLMs. Our approach makes two core contributions. First, we reframe self-teacher construction by using self-generated answers as suffix conditioning, enabling the student model to learn from “self future-experience” rather than privileged prefixes. Second, we shift supervision from token-level to step-level, aligning training with the iterative denoising process of dLLMs. Experiments across four reasoning benchmarks show that d-OPSD consistently outperforms RLVR and SFT baselines with superior sample efficiency, requiring only around 10% of the optimization steps by RLVR and opening a promising pathway for dLLM posttraining. The code is available at https://github.com/xingzhejun/d-OPSD.


[23] Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients cs.CLPDF

Byung-Kwan Lee, Ximing Lu, Shizhe Diao, Minki Kang, Saurav Muralidharan

TL;DR: 本文提出了一种名为Zone of Proximal Policy Optimization (ZPPO)的新方法,用于解决小规模学生模型在知识蒸馏和强化学习训练中面临的挑战。该方法将教师模型的指导置于提示(prompt)而非梯度中,通过构建二元候选问题(BCQ)和负候选问题(NCQ)来帮助学生模型在自身能力范围内(近端发展区)学习困难问题,并使用提示重放缓冲区进行循环训练。

Details

Motivation: 传统知识蒸馏在将大教师模型知识迁移到小学生模型时,强制模仿教师logits会导致学生过度关注教师的尖锐模式,损害泛化能力;而强化学习在完全失败的rollout上无法提供有效梯度。需要一种方法,能在不违反on-policy假设的前提下,有效利用教师知识指导小学生模型。

Result: 在Qwen3.5模型家族(学生模型规模0.8B-9B,教师模型27B)上,作为视觉语言模型进行后训练,并在包含31个基准测试(16个VLM,10个LLM,5个视频)的套件上评估,ZPPO在性能上超越了离/在线策略蒸馏和GRPO方法,且模型规模越小,性能提升越显著。

Insight: 核心创新在于将教师指导从梯度空间转移到提示空间,避免了策略梯度漂移,并提出了BCQ和NCQ两种提示重构技术来聚焦于学生的近端发展区。这为小模型的高效训练提供了一种新范式,即通过精心设计的提示而非直接的梯度干预来引导学习过程。

Abstract: Knowledge distillation transfers a teacher’s competence to a small student but is brittle in the small-student regime: forcing the student to imitate logits from a much larger teacher concentrates it on the teacher’s sharpest modes, hurting generalization on benchmark families beyond the training corpus. Reinforcement learning (RL) avoids logit imitation by training on the student’s own rollouts. However, on questions where every rollout fails-yielding zero advantage and being silently discarded-injecting a stronger teacher’s response into the policy gradient breaks the on-policy assumption and induces drift. We introduce Zone of Proximal Policy Optimization (ZPPO), inspired by Vygotsky’s zone of proximal development, which keeps the teacher inside the prompt rather than the policy gradient. On hard questions, ZPPO constructs two reformulated prompts: a Binary Candidate-included Question (BCQ) pairs one correct teacher response with one incorrect student response as anonymized candidates the student must discriminate, and a Negative Candidate-included Question (NCQ) aggregates the student’s wrong rollouts into a single prompt to surface their shared failure modes. A prompt replay buffer recirculates each hard question until it either graduates-the student’s mean rollout accuracy on it reaches half- or is FIFO-evicted under finite capacity, amplifying BCQ and NCQ inside the student’s current zone of proximal development. On the Qwen3.5 family at four student scales (0.8B-9B) with a 27B teacher, post-trained as vision-language models and evaluated on a 31-benchmark suite (16 VLM, 10 LLM, 5 Video), ZPPO outperforms off/on-policy distillation and GRPO, with the largest gains at the smallest scale.


[24] ReproRepo: Scaling Reproducibility Audits with GitHub Repository Issues cs.CL | cs.AI | cs.LGPDF

Shanda Li, Qiuhong Anna Wei, Jingwu Tang, Valerie Chen, Nihar B Shah

TL;DR: ReproRepo是一个可扩展的框架,用于评估LLM代理在机器学习论文代码复现性审计中的能力。该框架利用GitHub仓库中真实存在的人工提交的issue作为监督信号,以识别复现过程中的障碍。研究在1,149篇近期机器学习会议论文上实例化了该框架,并评估了四种前沿模型-代理配置,发现最佳代理(如Codex与GPT-5.5组合)能够为约90%的论文识别出至少一个语义相关的人工报告阻塞问题。

Details

Motivation: 现有评估LLM代理辅助研究复现性的基准难以扩展,因为它们依赖大量人工努力进行数据整理和评估。本文旨在通过利用GitHub问题作为自然发生的监督,构建一个可扩展的复现性评估框架,以解决这一可扩展性问题。

Result: 在1,149篇机器学习论文上的评估显示,最佳LLM代理(Codex与GPT-5.5)即使不执行代码,也能为约90%的论文识别出至少一个语义相关的人工报告阻塞问题。代理在发现可见故障和识别正确语义区域方面特别有效,但在精确定位方面可能仍不足。

Insight: 创新点在于提出一个利用GitHub问题作为自然监督的可扩展框架ReproRepo,用于自动化复现性审计评估。客观来看,该方法将真实世界的人类反馈(issue)转化为评估信号,降低了人工标注成本,为未来LLM代理在复现性任务上的评估提供了一个可重用且可扩展的基准。

Abstract: Reproducing research results from papers and released code is central to scientific progress. Existing works have introduced benchmarks to evaluate whether LLM agents can assist with reproducibility, but they are difficult to scale due to their reliance on substantial manual effort for data curation and evaluation. We introduce ReproRepo, a scalable framework for reproducibility evaluation that leverages human-raised GitHub issues as naturally occurring supervision on realistic reproduction blockers. We instantiate ReproRepo on 1,149 recent machine learning papers from major conferences and evaluate four frontier model-agent configurations. Our results show that LLM agents, even without executing code, can identify many real-world reproducibility problems from paper-repository pairs: the best agent in our study, namely Codex with GPT-5.5, surfaces at least one semantically related human-reported blocker for ~90% of papers in the study. Further analysis shows that agents are particularly effective for surfacing visible failures and identifying the right semantic region, but may still be insufficient in exact localization. ReproRepo can serve as a reusable, scalable framework for future evaluations of LLM agents on real-world reproducibility auditing. Our code is released at https://github.com/LithiumDA/ReproRepo.


cs.CV [Back]

[25] Not Truly Multilingual: Script Consistency as a Missing Dimension in VLM Evaluation cs.CV | cs.CLPDF

Prabhjot Singh, Bhushan Pawar, Madhu Reddiboina, Rajvee Sheth

TL;DR: 这篇论文指出当前多语言视觉语言模型(VLM)评估存在缺陷,忽略了多文字语言用户,并引入了PuMVR基准来评估模型在旁遮普语三种文字(Gurmukhi、Shahmukhi和Roman)上的表现。研究发现,现有VLM存在显著的‘文字差距’,模型在不同文字上的性能差异可达16%,且视觉输入无法消除这种差距,表明模型并非真正的多文字模型。

Details

Motivation: 当前多语言VLM评估假设语言与文字之间存在一一映射关系,忽略了数十亿使用多文字语言的用户,这可能导致评估不全面且不公平。

Result: 在PuMVR基准上评估10个SOTA VLM,发现模型在不同文字上的准确率差异高达16%,且视觉输入虽提升绝对性能但未缩小文字差距;提出的Script Consistency Rate(SCR)指标在基准上低至24.8%。

Insight: 创新点在于揭示了VLM中存在的系统性文字差距,并提出了PuMVR基准和SCR指标,强调在多语言评估中必须考虑文字一致性以确保AI公平性;客观来看,这为VLM评估引入了新的维度,有助于推动更全面的多语言模型发展。

Abstract: Current multilingual evaluations for Vision-Language Models (VLMs) assume a one-to-one mapping between language and orthography, overlooking billions of users of multi-script languages. We introduce PuMVR (Punjabi Multimodal Visual Reasoning), a benchmark of 1,000 strictly parallel image-text instances across Punjabi’s three active scripts: Gurmukhi, Shahmukhi, and Roman. Evaluating 10 state-of-the-art VLMs, we expose a substantial and systematic Script Gap. Models frequently solve visual tasks in one script while failing identical tasks in another, with accuracy deltas reaching 16%. Crucially, visual input boosts absolute performance uniformly yet does not close the orthographic gap. Furthermore, cross-script in-context transfer is highly brittle, exposing script-locked knowledge representation. Supported by McNemar tests across all script pairs, our findings demonstrate that current “multilingual” VLMs are not truly multi-script. We propose the Script Consistency Rate (SCR), which falls as low as 24.8% on our benchmark, as a mandatory metric for script-agnostic evaluation to ensure equitable AI access. Data and code are available at: https://github.com/prabhjotschugh/Not-Truly-Multilingual-PuMVR.


[26] GeoDisaster: Benchmarking Orchestrated Agents for Operational Disaster Geo-Intelligence cs.CV | cs.MAPDF

Maram Hasan, Aman Verma, Savitra Roy, Hariseetharam Gunduboina, Daksh Jain

TL;DR: 本文提出了GeoDisaster,一个用于评估操作性地理解理智能体的新基准,包含2,921个实例,涵盖五个灾害相关任务家族。同时,论文提出了一种基于角色-契约期望对齐(RCEA)的编排多智能体框架,以提升工具使用和决策生成能力。

Details

Motivation: 现有遥感视觉语言模型(RS-VLMs)在视觉解释和指令跟随方面有进展,但缺乏面向实际操作的、需要工具化空间推理和结构化证据支持决策的地理智能能力。

Result: 实验表明,GeoDisaster基准对现有RS-VLMs和智能体系统构成了挑战,而提出的RCEA方法在工具使用、证据基础、状态一致性和决策生成方面均有提升。

Insight: 创新点在于构建了一个基于确定性地理空间工作流和一致性检查的真实基准,避免了语言模型标注;并提出了一种结合监督微调和基于契约的强化学习的智能体对齐方法(RCEA),以协调角色专业化智能体。

Abstract: Remote-sensing vision-language models (RS-VLMs) have advanced Earth-observation analysis toward visual interpretation and instruction-following, yet fall short of operational geo-intelligence, which demands tool-grounded spatial reasoning and structured, evidence-backed decisions. We introduce GeoDisaster, an operational geospatial disaster reasoning benchmark with 2,921 verified instances across 43 question types and five task families: deforestation monitoring, multi-hazard analysis, building-damage assessment, flood-safe routing, and Sentinel-1 SAR flood monitoring. Instances integrate heterogeneous EO/GIS evidence-optical and SAR imagery, raster masks, vector geometries, road networks, and exposure layers-spanning hazard detection, damage assessment, exposure estimation, and diagnostic report generation. Ground-truth answers are grounded in executable geospatial workflows and deterministic consistency checks, removing the need for language-model annotation. We further propose an orchestrated multi-agent framework with 18 disaster-oriented tools, where role-specialized agents coordinate through explicit execution contracts, aligned via Role-Contract Expectation Alignment (RCEA): failure-aware supervised fine-tuning combined with contract-grounded reinforcement learning over dense step-level signals. Experiments show that GeoDisaster challenges existing RS-VLMs and agentic systems, while RCEA improves tool use, evidence grounding, state consistency, and decision generation.


[27] Landsat-Sentinel-2 Algal Bloom Mapping Using Vision Transformers: Model Description, Implementation, and Examples cs.CVPDF

Thainara Lima, Vitor Martins

TL;DR: 本研究首次成功实现了基于Vision Transformer的Landsat-Sentinel-2海岸藻华制图,通过构建全球分布的藻华斑块数据集,比较了四种Transformer架构与卷积基线模型在精细尺度藻华检测中的性能。

Details

Motivation: 解决海岸藻华监测中传统方法因光谱覆盖有限、缺乏统一反射率产品以及受云层和耀光干扰而面临的挑战,利用深度学习提供数据驱动的替代方案。

Result: 所有深度学习模型在检测漂浮藻华区域方面表现出色,遗漏和误报误差在8-65%之间;在时间序列中受云和耀光干扰时,Swin Transformer优于传统光谱指数方法,有效避免了受影响的像素,并与MODIS产品对比突显了更高空间分辨率在检测碎片化藻华中的优势。

Insight: 创新点在于首次将Vision Transformer应用于中分辨率Landsat-Sentinel-2藻华制图,并通过全球数据集验证了其在动态海岸环境中的可靠性和对云/耀光干扰的鲁棒性,为藻华监测提供了新的深度学习工具。

Abstract: Coastal algal bloom monitoring requires frequent, spatially detailed, and globally consistent observations, provided by Landsat-8/9 and Sentinel-2 A/B/C. Together, these missions offer over a decade of medium-resolution multispectral imagery with near-global coverage every 2-3 days, enabling the detection of fragmented bloom structures not resolvable by coarse ocean-color sensors. However, their use in aquatic environments remains challenging due to limited spectral coverage and a lack of harmonized reflectance products. As an alternative to traditional bio-optical methods, deep learning-based image classification offers a data-driven approach that can overcome many of these limitations. This study presents the first successful implementation of vision transformer-based coastal algal bloom mapping using 30-m Landsat-Sentinel-2 images. A globally distributed bloom patch dataset was generated across bloom-prone coastal hotspots worldwide. Four transformer-based architectures were compared against a standard convolutional baseline for fine-scale bloom detection, and assessed under different optical water types and atmospheric and surface conditions. All deep learning models showed strong capabilities in detecting floating bloom areas, with omission and commission errors of 8-65%. Under cloud and glint stress in a time series, the Swin Transformer outperformed traditional spectral-index approaches, which produced widespread false positives, effectively avoiding cloud- and glint-affected pixels. Comparisons with MODIS-derived products further highlighted the benefits of higher spatial resolution in detecting fragmented and irregularly affected blooms. Our findings support deep learning as a reliable tool for medium-resolution, consistent monitoring of floating algal blooms in dynamic coastal environments.


[28] Beyond Benchmarks: Continuous Edge Inference for Fine-Grained Roadside Perception cs.CV | cs.RO | eess.SYPDF

Aditya Mishra, Haroon Lone

TL;DR: 本文提出了Edge-TSR,一个面向部署的持续边缘推理系统,用于在NVIDIA Jetson Orin Nano上进行持续的路边细粒度感知。该系统集成了检测、跟踪、细粒度分类和一个轻量级的轨迹感知时间稳定机制,旨在解决传统基准测试无法反映的部署效应,如视频流的时间不稳定性、热节流和工作负载相关的性能变化。

Details

Motivation: 动机在于解决传统基准测试在评估资源受限边缘硬件上的持续AI推理时存在的局限性,这些基准测试无法捕捉实际部署中的时间不稳定性、热节流和性能波动等效应,导致高估了实际部署性能。

Result: 在从静态图像评估转向真实世界流式部署时,三个SOTA基线模型性能出现20-30%的相对下降。Edge-TSR通过时间推理稳定机制,相比逐帧推理基线恢复了高达10.16%的分类准确率,并在55分钟、26公里的车载部署中,在单个嵌入式设备上实现了16.18 FPS的持续实时性能,且保持在安全温度限制内。

Insight: 核心创新点在于强调了部署感知评估的必要性,并提出了一个轻量级的轨迹感知时间稳定机制来提升流式推理的一致性。客观来看,其将时间稳定机制与完整的检测-跟踪-分类流程集成,并在真实、持续的部署条件下进行联合评估(包括质量、延迟、吞吐量和热行为),为边缘AI系统的实际部署提供了重要的方法论和系统设计参考。

Abstract: Continuous AI inference on resource-constrained edge hardware introduces deployment effects that are largely invisible to conventional benchmark evaluation, including temporal instability in streaming video, thermal throttling under sustained load, and workload-dependent performance variability. We present Edge-TSR, a deployment-oriented continuous edge inference system for sustained roadside perception on the NVIDIA Jetson Orin Nano. Edge-TSR integrates detection, tracking, fine-grained classification, and a lightweight track-aware temporal stabilization mechanism that improves streaming inference consistency with negligible computational overhead. Our central finding is that benchmark-centric evaluation systematically overstates deployed edge inference performance. Across three state-of-the-art baselines, we observe consistent 20-30% relative degradation when transitioning from static-image evaluation to real-world streaming deployment. Edge-TSR addresses this gap through temporal inference stabilization, recovering up to 10.16% classification accuracy over per-frame inference baselines while maintaining sustained real-time performance under continuous operation. We evaluate the complete system under diverse real-world deployment conditions, jointly characterizing inference quality, latency, throughput, and thermal behavior during long-duration operation. A 55-minute vehicular deployment over a 26 km route demonstrates sustained operation at 16.18 FPS within safe thermal limits on a single embedded device without cloud offload. Our findings show that deployment-aware evaluation and temporal inference stabilization are necessary components of continuously operating edge AI systems intended for real-world sensing deployments. We release a sample annotated streaming video evaluation dataset and full system implementation to support reproducible deployment-centric evaluation.


[29] Pulling The REINS: Training-Free Safety Alignment of Video Diffusion Models via Representation Steering cs.CV | cs.AIPDF

Rohit Kundu, Arindam Dutta, Sarosij Bose, Athula Balachandran, Amit K. Roy-Chowdhury

TL;DR: 本文提出了一种名为REINS(REpresentation-space INference-time Safety steering)的训练免费方法,用于在推理时通过引导视频扩散模型的内部表示来实现安全对齐。该方法发现安全相关信息线性编码在视频扩散Transformer的隐藏状态激活中,通过监督PCA在二元安全标签上发现单一方向即可分离安全与不安全生成轨迹。在推理时,将此方向添加到中间Transformer层的隐藏状态中,可将生成内容从有害转向语义相关的安全替代,无需权重更新、概念枚举,且计算开销可忽略。

Details

Motivation: 解决开源权重视频扩散模型生成逼真的不安全内容(如暴力和虚假信息)的问题,现有防御方法要么需要昂贵的安全微调(会降低通用能力),要么使用外部过滤器(易被对抗性提示绕过)。

Result: 在9个视频扩散模型、多种参数规模(1.3B-5B)以及文本到视频和图像到视频生成任务上进行了评估,据称是视频生成文献中最广泛的安全评估套件。

Insight: 创新点在于发现安全结构线性编码于隐藏状态中,并通过监督PCA提取单一安全方向进行推理时引导;客观分析揭示了安全信息随Transformer深度单调积累,但引导效果在中间层(约50%深度)达到峰值,这暴露了信息可用性与下游传播能力之间的基本权衡。

Abstract: Open-weight video diffusion models can generate photorealistic unsafe content, from violence to misinformation, yet existing defenses either require expensive safety fine-tuning that degrades general capability, or apply external filters that are trivially bypassed by adversarial prompts. We present REINS (REpresentation-space INference-time Safety steering), a training-free method that aligns video diffusion models at inference time by steering their internal representations toward safe generation. Our key finding is that safety-relevant structure is linearly encoded in the hidden-state activations of video diffusion transformers, and a single direction, discovered via Supervised PCA on binary safety labels, suffices to separate safe from unsafe generation trajectories. At inference, adding this direction to hidden states at an intermediate transformer layer redirects generation from harmful content to semantically related safe alternatives, with no weight updates, no concept enumeration, and negligible computational overhead. Through mechanistic analysis, we reveal that while safety information accumulates monotonically with transformer depth, steering effectiveness peaks at intermediate layers (~50% depth), exposing a fundamental tradeoff between information availability and downstream propagation capacity. We evaluate REINS across 9 video diffusion models, multiple parameter scales (1.3B-5B), and both text-to-video and image-to-video generation, to our knowledge, the broadest safety evaluation suite in the video generation literature.


[30] Training LLMs with Reinforcement Learning over Digital Twin Representations for Reasoning-Intensive Surgical VideoQA cs.CVPDF

Yiqing Shen, Han Zhang, Mathias Unberath

TL;DR: 本文提出了一种基于强化学习的框架,用于训练大型语言模型(LLMs)在手术视频问答任务中执行推理。该方法通过数字孪生表示将感知与推理解耦,并引入了跨帧、时间窗口和手术过程层次的分层表示以及概率不确定性估计。作者还设计了一种结合格式验证与临床合理性评估的新型奖励函数。为验证方法,他们提出了新的基准数据集REAL-Colon-Reason,并在多个手术视频问答基准上取得了最先进的性能。

Details

Motivation: 现有方法将视频压缩为离散标记表示,并将视觉感知与推理耦合,这破坏了连续的时空关系,限制了多步推理能力。因此,需要一种能够解耦感知与推理、并有效处理手术视频中复杂时空语义关系的新方法。

Result: 该方法在作者新提出的结肠镜基准数据集REAL-Colon-Reason(包含2000个问答对)以及两个现有手术视频问答基准REAL-Colon-VQA和EndoVis18-VQA上均取得了最先进的(SOTA)性能。

Insight: 主要创新点包括:1)利用强化学习训练LLMs在手术基础模型构建的数字孪生表示上进行推理,实现了感知与推理的解耦;2)引入了具有概率不确定性估计的跨层次(帧、时间窗口、过程)表示;3)设计了一种结合格式验证、临床合理性评估和不确定性校准的新型奖励函数,用于强化学习训练。

Abstract: Surgical video question answering requires multi-step reasoning across semantic, spatial, and temporal dimensions. Existing methods architecturally compress videos into discrete token representations and couple visual perception with reasoning. This approach fragments continuous spatial-temporal relationships and has been shown to restrict multi-step reasoning capabilities. We introduce a reinforcement learning (RL) framework that trains large language models (LLMs) to decouple perception from reasoning by operating over digital twin representations constructed from surgical foundation models. Additionally, we introduce hierarchical representations across frame, temporal window, and procedure levels with probabilistic uncertainty estimates. Finally, we propose a novel reward that combines format validation with accuracy assessment through clinical plausibility evaluation and uncertainty-aware calibration for training. To demonstrate the capabilities of this approach, we introduce REAL-Colon-Reason, a colonoscopic benchmark with 2000 question-answer pairs across three complexity levels. We achieve state-of-the-art performance on REAL-Colon-Reason and two existing surgical VideoQA benchmarks REAL-Colon-VQA and EndoVis18-VQA.


[31] Pareto LoRA: Mitigating Modality Imbalance in Unified Multimodal Models via Pareto-Optimal Gradient Integration cs.CVPDF

Xiwen Wei, Mark Nutter, Madhusudhanan Srinivasan, Radu Marculescu

TL;DR: 本文针对统一多模态模型在指令微调中存在的模态不平衡问题,提出了一种名为Pareto LoRA的梯度集成策略。该方法将多模态指令微调重新表述为双目标优化问题,通过调节梯度方向和强度来平衡文本和图像目标,从而缓解语言梯度主导优化导致的图像生成质量下降问题。

Details

Motivation: 统一多模态模型在进行多模态指令微调时,常出现显著的模态不平衡现象,即语言梯度主导优化过程,导致图像生成质量(尤其是在使用LoRA等参数高效微调方法时)严重下降。本文旨在系统分析并解决LoRA微调中的这种不平衡问题。

Result: 在CoMM基准测试上使用Emu2模型进行的实验表明,Pareto LoRA能持续改善多模态生成的平衡性,相比原始LoRA方法,感知图像质量提升了高达44.9%,同时保持了可比的文本性能。

Insight: 论文的创新点在于将多模态指令微调形式化为一个双目标优化问题,并提出了基于帕累托最优的梯度集成策略来平衡不同模态的目标。从客观角度看,该方法为解决多模态学习中常见的模态不平衡问题提供了一个新颖且有效的优化视角,尤其适用于参数高效的微调场景。

Abstract: Unified multimodal models (UMMs) have recently emerged as a promising paradigm for integrating multimodal understanding and generation within a single autoregressive transformer. However, during multimodal instruction tuning, these models often exhibit pronounced modality imbalance: language gradients dominate optimization, thus leading to lower image generation quality, especially under parameter-efficient fine-tuning such as LoRA. In this work, we systematically analyze modality imbalance in LoRA-based fine-tuning of UMMs for interleaved text-image generation. We show that vision modality performance degrades substantially more than text modality performance when compared to unimodal counterparts, and that modality-specific gradients can differ by orders of magnitude across various tasks and layers. Motivated by this observation, we reformulate the multimodal instruction tuning as a bi-objective optimization problem and propose Pareto LoRA, a Pareto-optimal gradient integration strategy that balances the text and image objectives by modulating the gradient direction and strength. Experiments on the CoMM benchmark with Emu2 demonstrate that Pareto LoRA consistently improves multimodal generation balance, achieving up to 44.9% gains in perceptual image quality over vanilla LoRA while maintaining comparable text performance.


[32] Reasoning Text-to-Video Retrieval for Operating Room Clips via Action-Driven Digital Twins cs.CVPDF

Yiqing Shen, Hao Ding, Mathias Unberath

TL;DR: 本文提出OR3方法,用于手术室场景下的文本到视频检索,通过将视频片段转换为动作驱动的数字孪生(ActDTs)来表示,并利用大语言模型从查询生成假设的ActDTs进行基于想象的检索,最后通过证据驱动的精炼来优化匹配结果。

Details

Motivation: 解决手术室中安全关键事件的文本到视频检索问题,这些事件往往不符合常规结构,需要处理需要推理的隐式查询(例如“夹闭前的步骤”),而现有基于全局嵌入的方法无法进行此类推理。

Result: 在基于机器人膝关节手术的MM-OR数据集构建的基准测试上(包含276个隐式查询和386个视频片段),OR3取得了R@1为57.6和R@5为77.3的成绩,优于最强基线。

Insight: 创新点在于引入动作驱动的数字孪生(ActDTs)进行细粒度的时间动作推理,并采用基于LLM的想象检索(生成假设ActDTs)和证据驱动的精炼机制,实现了对视觉相似手术视频片段的区分。

Abstract: Text-to-video retrieval in operating rooms (OR) is an enabling technology for OR safety, as it allows stakeholders to retrieve and inspect recordings of specific events. However, because the most safety-critical events may not follow the common structure, to unlock its full potential text-to-video retrieval must be able to handle implicit queries that require reasoning to identify the right video (e.g., the step right before clipping). However, existing methods rely on global embeddings that cannot reason over such queries. We propose OR3, a text-to-video retrieval method that converts clips into action-driven digital twins (ActDTs), grouping concurrent subject-action-object triplets under non-overlapping temporal intervals. Moreover, rather than cross-modal matching through paired encoders, OR3 performs imagination-based retrieval where an LLM generates hypothetical ActDTs from queries. This enables intra-modal matching via a single encoder trained with ActDT-tailored hard negatives. Finally, evidence-grounded refinement revises imagined ActDTs based on discrepancies with top candidates to capture procedure-specific patterns. We construct a benchmark from MM-OR with 276 implicit queries across four reasoning categories over 386 clips from robotic knee procedures. OR3 achieves 57.6 R@1 and 77.3 R@5, outperforming the strongest baseline. These results demonstrate that OR3 enables fine-grained discrimination between visually similar OR video clips through temporal action reasoning.


[33] SierpinskiCam: Camera-Controlled Video Retaking with Sierpinski Triangle Pattern Cues cs.CVPDF

Suttisak Wizadwongsa, Hyelin Nam, Supasorn Suwajanakorn, Jeong Joon Park

TL;DR: SierpinskiCam是一种用于视频重拍的方法,通过引入Sierpinski穹顶纹理线索增强几何引导,以解决现有方法在目标相机轨迹偏离源视频时新区域稀疏或缺失的问题。该方法还提出了一种参考视频条件机制,通过负RoPE索引分离源视频和目标视频的token序列,实现外观对齐而无需修改架构或逐视频适配。实验表明,SierpinskiCam在相机可控性、几何一致性和视频质量方面取得了显著提升。

Details

Motivation: 解决从单目视频生成用户定义相机轨迹的新渲染(视频重拍)问题,现有几何引导方法在目标相机偏离源轨迹时引导效果下降,导致新区域信息不足。

Result: 在多样且具有挑战性的重拍场景中,SierpinskiCam在相机可控性、几何一致性和视频质量方面实现了显著增益,通过实验验证了其有效性。

Insight: 创新点包括使用Sierpinski三角形图案纹理线索增强几何引导以应对大视角变化,以及引入参考视频条件机制通过负RoPE索引分离token流,实现高效的外观对齐。这些方法提升了视频重拍的鲁棒性和质量,无需额外架构调整。

Abstract: Generating novel renderings of a scene along user-defined camera trajectories from a single monocular video, dubbed video retaking, is a compelling but difficult problem in content creation and visual effects. Existing geometry-guided approaches reconstruct a 4D representation from the source video and render it along the target trajectory to condition video diffusion models. However, this guidance degrades as the target camera departs from the source trajectory, leaving newly revealed regions sparse or entirely missing. We propose SierpinskiCam, which addresses this limitation by augmenting geometry-based guidance with Sierpinski dome texture cues that contains rich trackable features even under large viewpoint changes. We further introduce a reference video conditioning mechanism that appends source-video tokens to the target-token sequence and separates the two streams with negative RoPE indices, enabling appearance grounding without architectural modification or per-video adaptation. Extensive experiments show that SierpinskiCam achieves significant gains in camera controllability, geometric consistency, and video quality across diverse and challenging retaking scenarios. Project page: https://hyelinnam.github.io/SierpinskiCam/.


[34] FATE: Pillar Encoding and Frequency-Aware Training for Event-Based Object Detection cs.CVPDF

Md Tawheedul Islam Bhuian, Kyoung-Don Kang

TL;DR: 本文提出FATE框架,用于解决事件相机在物体检测中因稀疏异步事件流与标准深度学习架构不兼容的问题。该框架通过新型Pillar Encoding(PE)避免传统时间子分箱,将事件组织为空间支柱并利用连续时间正交多项式基近似其窗口内演化,从而保留精细时间动态。同时引入Frequency-Aware Training(FAT)课程学习策略,生成密集时间伪标签以弥合低频监督与高频推理间的差距。

Details

Motivation: 事件相机在高速度和高动态范围场景中具有优势,但其稀疏异步事件流与标准深度学习模型不兼容;现有方法通常将累积窗口划分为固定时间子箱,这会丢弃细粒度时间结构并限制推理至训练监督所施加的低时间频率。

Result: 在广泛实验中,FATE在不同架构范式中均表现出良好泛化能力,持续超越强基线方法;它支持高达200 Hz的高时间分辨率下的鲁棒物体检测,且在参数量和推理延迟上仅带来最小开销。

Insight: 创新点包括:Pillar Encoding通过连续时间正交多项式基近似事件演化,实现L2最优表示以保留丰富时间动态;Frequency-Aware Training采用软平均教师课程生成时间密集伪标签,有效解决低频监督与高频推理不匹配问题。从客观角度看,该方法通过避免内部时间离散化,显著提升了事件数据在高频推理下的信息利用效率。

Abstract: Event cameras are bio-inspired sensors that asynchronously capture logarithmic intensity changes, offering inherent advantages in high-speed and high-dynamic-range scenarios. However, the sparse and asynchronous nature of event streams poses a fundamental challenge for modern deep learning architectures. To enable compatibility with standard models, most existing approaches partition the accumulation window into fixed temporal sub-bins. While effective for spatial processing, this internal discretization discards fine-grained temporal structure and constrains inference to the low temporal frequencies imposed by training supervision. To address this limitation, we propose FATE, a unified framework built upon a novel Pillar Encoding (PE). While operating over discrete macro-accumulation windows dictated by the target frequency, PE avoids internal temporal sub-binning. It organizes events into spatial pillars and approximates their intra-window evolution via projection onto a continuous-time orthogonal polynomial basis. This formulation yields an L2-optimal representation that retains rich temporal dynamics in a dense pseudo-image, mitigating information loss under sparse event conditions. To fully leverage this representation, we introduce Frequency-Aware Training (FAT), a soft mean-teacher curriculum that generates temporally dense pseudo-labels, effectively bridging the mismatch between low-frequency supervision and high-frequency inference. Extensive experiments demonstrate that FATE generalizes across architectural paradigms and consistently outperforms strong baselines. It enables robust object detection at high temporal resolutions up to 200 Hz, while incurring minimal overhead in parameter count and inference latency


[35] Geometry-Consistent Endoscopic Representations for Image-Guided Navigation via Structured Foundation Model Adaptation cs.CV | cs.AIPDF

Hongchao Shu, Roger D. Soberanis-Mukul, Hao Ding, Morgan Ringel, Mali Shen

TL;DR: 本文提出了一种用于单目内窥镜图像引导导航的统一框架,旨在学习几何一致且领域鲁棒的图像表示。该框架结合了提供精确几何监督的合成数据流水线和层级感知几何-语义适应方法,后者是一种结构化的LoRA替代方案,通过选择性插入低秩适配器并耦合分层训练目标,以增强中间特征的几何对应性和深层特征的语义一致性。

Details

Motivation: 解决单目内窥镜导航中由于深度线索有限、组织纹理弱、非刚性变形和跨领域外观变化大导致的姿态估计、深度预测和图像-解剖结构对齐困难的问题,并针对现有视觉基础模型学习到的表示几何一致性不足、影响下游导航任务稳定性的局限性。

Result: 在公开和专有数据集上的实验表明,该方法提升了几何和语义表示质量,在下游导航任务(如姿态估计和单目深度估计)上取得了更好的性能。学习到的表示在临床支气管镜上显示出良好的合成到真实迁移能力,并在有限监督下为鼻窦镜和结肠镜的适应提供了有用的初始化。

Insight: 创新点在于提出了层级感知几何-语义适应这一结构化适配方法,通过分层设计低秩适配器和训练目标,协同优化几何对应与语义一致性。从客观角度看,其将合成数据监督与层级化模型微调相结合,为内窥镜等特定领域提供了一种可扩展且有效的表示学习途径。

Abstract: Accurate vision-based navigation in monocular endoscopy is difficult due to limited depth cues, weak tissue texture, non-rigid deformation, and substantial appearance variation across domains, all of which complicate pose estimation, depth prediction, and image-to-anatomy alignment. Although recent vision foundation models have shown promise, their learned representations often remain insufficiently geometry-consistent, hindering stable feature correspondence and limiting their reliability for downstream navigation tasks. We propose a unified framework for learning geometry-consistent and domain-robust image representations for monocular endoscopy. The framework combines a synthetic data pipeline that provides accurate geometric supervision with Hierarchy-Aware Geometry-Semantic Adaptation, a structured alternative to standard LoRA that inserts low-rank adapters selectively across the transformer hierarchy and couples them with layer-wise training objectives to encourage geometric correspondence in intermediate features and semantic consistency in deeper features. Experiments on public and proprietary datasets show improved geometric and semantic representation quality, leading to better performance on downstream navigation tasks including pose estimation and monocular depth estimation. The learned representations show favorable synthetic-to-real transfer on clinical bronchoscopy and provide a useful initialization for adaptation to sinus endoscopy and colonoscopy under limited supervision. The framework also shows favorable scaling with model size and training data. These results support hierarchy-aware, geometry-guided adaptation as a practical approach for endoscopic representation learning.


[36] DriveJudge: Rethinking Autonomous Driving Evaluation with Vision-Language Models cs.CV | cs.AI | cs.LG | cs.ROPDF

Xinglong Sun, Kevin Xie, Jenny Schmalfuss, Despoina Paschalidou, Xiuming Zhang

TL;DR: 本文提出了DriveJudge,一种结合规则驱动评估与视觉语言模型(VLM)推理的自动驾驶评估代理,旨在实现可解释且上下文感知的驾驶策略评估。该方法通过大规模人工标注数据集进行训练,并在驾驶质量分类和轨迹偏好选择两个基准任务上超越了现有方法。

Details

Motivation: 当前自动驾驶评估中,基于规则的指标(如EPDMS)可解释但缺乏上下文感知能力,而基于VLM的评估虽能感知上下文,但输出模糊且物理基础薄弱。因此,需要一种既可解释又能理解驾驶场景上下文的评估方法。

Result: 在驾驶质量分类任务上,DriveJudge比EPDMS的AUC提高了21.23;在轨迹偏好选择任务上,比基于VLM的DriveCritic提升了6.5%,在可解释且精确的驾驶评估方面设立了新标准。

Insight: 创新点在于将规则驱动评估与VLM推理相结合,并选择性调用基于物理的确定性规则函数,从而在保持可解释性的同时增强上下文感知能力。这为端到端策略学习提供了更可靠、更符合人类判断的评估框架。

Abstract: Autonomous driving has shifted towards end-to-end policy learning, where reliable, interpretable policy evaluation is a fundamental challenge as driving quality is highly context-dependent. Commonly used rule-based driving metrics like EPDMS are interpretable but lack context-awareness, while recent VLMbased evaluations are context-aware but limited by ambiguous VLM outputs and weak physical grounding. To evaluate driving in a manner that is both interpretable and context-aware, we introduce DriveJudge. DriveJudge is a driving evaluation agent that combines rule-grounded evaluation with Vision-Language Model (VLM) reasoning and selectively invokes physically-grounded deterministic rule functions after interpreting the environmental context. To train and evaluate DriveJudge, we curate a large-scale dataset of 33,577 challenging driving samples with human annotations on whether the driving behavior is reasonable in the given scenario. With this dataset, we address the underexplored problem of driving metric evaluation, and introduce two human-aligned benchmark tasks: Driving Quality Classification and Trajectory Preference Selection. DriveJudge outperforms EPDMS for driving quality classification by 21.23 AUC, and the recent VLM-based DriveCritic for trajectory preference selection by 6.5%, setting a new standard for interpretable and precise driving evaluation.


[37] Improving and Evaluating Hand-Object Interaction Detection cs.CVPDF

Ahmad Darkhalil, Dima Damen, David Fouhey

TL;DR: 本文提出了HOI-DETR框架,通过在手-物体和物体-物体交互中引入Co-DETR架构,实现了最先进的手-物体交互检测方法。同时,作者构建了一个包含4个多样化数据集的综合HOI评估套件,并发布了在多个基准上显著提升性能的训练模型。

Details

Motivation: 理解手及其直接或通过工具交互的物体,对于从动作感知到3D重建和机器人学等任务至关重要。本文旨在推动手-物体交互理解领域的发展。

Result: 在Hands23、HOIST、FineBio和HD-EPIC等多个基准上实现了显著的性能提升,特别是在Hands23和FineBio数据集上mAP增益超过20个百分点,达到了最先进水平。

Insight: 主要创新点包括:将手-物体和物体-物体交互建模引入Co-DETR架构的HOI-DETR框架;构建了包含视频基准和增强标注的综合性多数据集评估套件;通过消融实验验证了各模型组件的贡献。

Abstract: Understanding hands and the objects they interact with, both directly and through tools, is a key step for tasks ranging from action perception to 3D reconstruction and robotics. Our paper provides several contributions to the Hand-Object Interaction (HOI) understanding literature: (1) HOI-DETR, a new framework that introduces hand-object and object-object interactions to the Co-DETR architecture to produce a state-of-the-art method; (2) a comprehensive HOI evaluation suite of 4 diverse datasets, including a video benchmark derived from the HD-EPIC dataset and fresh annotations that improve the Hands23 benchmark and (3) a trained checkpoint that significantly improves the state of the art across Hands23, HOIST, FineBio, and HD-EPIC, including mAP gains of over 20 percentage points on Hands23 and FineBio. Our ablations confirm the contributions of each model component.


[38] TerraTransfer: Learning End-to-End Driving Policies Without Expert Demonstrations cs.CV | cs.AI | cs.ROPDF

Zikang Xiong, Weixin Li, Zhouchonghao Wu, Akshay Rangesh, Saarth Bonde

TL;DR: 论文提出了一种名为TerraTransfer的端到端自动驾驶方法,该方法通过解耦驾驶学习与视觉学习,利用向量化模拟器中的自博弈预训练策略,再将其潜在空间与预训练的视觉主干网络对齐,从而无需专家演示数据即可训练高性能驾驶策略。

Details

Motivation: 标准端到端自动驾驶训练成本高昂,涉及大量数据收集、标注以及闭环强化学习中的高成本渲染和视觉主干前向传播。本文旨在利用向量化模拟器自博弈的高效性和丰富的状态分布(如碰撞、险情和恢复),降低训练成本并避免对专家演示的依赖。

Result: 在基于3D高斯泼溅的光真实感闭环场景中,所提出的端到端策略匹配或超越了先前的端到端方法。

Insight: 创新点在于将驾驶策略学习与视觉表示学习解耦,通过自博弈预训练策略提供动作目标,并利用动作KL散度和批量关系低秩结构损失进行潜在空间对齐,从而仅需(图像,场景状态)配对数据集,无需专家轨迹监督,实现了高效且高性能的端到端驾驶策略训练。

Abstract: End-to-end autonomous driving has achieved state-of-the-art performance on benchmarks and real-world deployments. Its standard training recipe, however, is expensive across all stages: collecting and labeling millions of driving frames is costly, and closed-loop RL on images is bottlenecked by the per-step cost of photorealistic rendering plus a forward pass through a large vision backbone. Self-play in vectorized simulators changes the economics: millions of rollout steps per second, and a state distribution naturally rich in collisions, near-misses, and recoveries that no driving log contains. Our approach exploits this asymmetry by decoupling learning to drive from learning to see. We pretrain a single policy by self-play, then align its latent space with a pretrained vision backbone, through the action KL divergence and a batch-relational low-rank structural loss. The action target comes from the self-play policy, so alignment never supervises against a logged trajectory: a paired dataset of (image, scene-state) frames suffices, with no need for the curated expert demonstrations that imitation pretraining is built on. On photorealistic 3D Gaussian splatting closed-loop scenarios, the resulting end-to-end policy matches or exceeds prior end-to-end methods.


[39] Visuals Lie, Consistency Speaks: Disentangling Spatial Attention from Reliability in Vision-Language Models cs.CV | cs.AI | cs.CL | cs.LGPDF

Logan Mann, Yi Xia, Ajit Saravanan, Ishan Dave, Saadullah Ismail

TL;DR: 这篇论文挑战了视觉-语言模型(VLM)中关于空间注意力与答案可靠性直接相关的普遍假设(注意力-置信度假设)。通过引入VLM可靠性探针(VRP)和结构注意力指标,研究发现空间注意力与模型准确性几乎无关,而跨推理路径的自我一致性才是可靠性的主要预测指标。研究还揭示了不同模型架构在可靠性信号分布上的关键差异。

Details

Motivation: 动机是质疑并验证VLM中一个常见的直觉假设,即模型对相关视觉区域的紧密注意力(结构性视觉感知)应预示着答案的可靠性。作者旨在系统地探究可靠性信号的真实来源,这对于将VLM用作可靠推理代理至关重要。

Result: 研究发现,空间注意力指标(如聚类数C_k和空间熵H_s)与模型准确性的相关性近乎为零(R ≈ 0.001)。相反,自我一致性(跨采样推理路径的同意率)是真实性的主要预测因子(R = 0.429)。因果干预实验表明,不同模型架构(如LLaVA、PaliGemma、Qwen2-VL)在可靠性信号的分布和鲁棒性上存在显著差异。

Insight: 论文的核心创新点在于揭示了VLM中的“符号性脱离”现象(早期锁定视觉特征,后期扩散注意力)和“聚类失败”,即空间注意力图并非可靠性的有效指标。关键洞见是,可靠性主要源于生成过程的动态性和内部状态分布,特别是自我一致性,这为评估和提升VLM的可靠性提供了新的方向。

Abstract: Multimodal Foundation Models are increasingly used as reasoning agents, making reliability, knowing when a model may hallucinate, critical. A common intuition, which we call the Attention-Confidence Assumption, holds that reliability follows from “structural” visual perception: tight attention on relevant regions should signal a trustworthy answer, while scattered attention signals confusion. We challenge this through the VLM Reliability Probe (VRP), a systematic cross-family study of reliability signals in contemporary Vision-Language Models (VLMs). We introduce structural-attention metrics, cluster counts (C_k) and spatial entropy (H_s), to quantify the visual encoder’s gaze, and track its evolution (Delta H_s) across layers. This reveals a “Symbolic Detachment”: models often “Early Lock” visual features only to diffuse attention later, severing early perception from final generation. Contrary to the grounding hypothesis, we find a “Cluster Failure”: spatial attention has near-zero correlation (R approx 0.001) with accuracy. Instead, reliability is a phenomenon of generation dynamics and internal-state distributions. Self-Consistency, the agreement rate across sampled reasoning paths, is the dominant predictor of truth (R = 0.429). Scaling causal interventions exposes a sharp architectural divergence: LLaVA locks its prediction in a fragile late-stage bottleneck, whereas PaliGemma and Qwen2-VL distribute reliability globally, staying resilient even when ~50% or more of their most predictive layer is destroyed. For current VLMs, reliability signals are detached from visual grounding maps and are best inferred from generation-time dynamics and hidden-state probes.


[40] Graph Neural Networks for Semi-Supervised Image Classification with Multi-Feature Aggregation cs.CV | cs.AIPDF

Marina Chagas Bulach Gapski, Vinicius Atsushi Sato Kawai, Gustavo Rosseto Leticio, Lucas Pascotti Valem, Daniel Carlos Guimarães Pedronette

TL;DR: 本文提出了一种用于半监督图像分类的新颖图神经网络方法,该方法通过聚合来自不同特征提取器的多种特征和图表示来提升性能。实验表明,策略性地组合特征与图表示,并结合流形学习进行图处理,能在多数实验条件下显著提高分类准确率。此外,使用排序聚合技术整合不同提取器的特征也被证明能增强分类准确性。

Details

Motivation: 解决在标记数据稀缺的半监督图像分类场景中,如何有效整合不同特征提取器(如CNN、ViT)提供的互补信息,以提升模型性能的问题。

Result: 实验结果表明,该方法在多数条件下显著提高了分类准确率,具体性能提升依赖于特征与图表示的组合策略以及流形学习的应用。

Insight: 创新点在于将多种特征和图表示进行策略性组合,并引入流形学习进行图处理;客观分析认为,该方法通过聚合异构特征和利用排序聚合技术,有效增强了半监督学习中的信息利用效率。

Abstract: Feature extraction involves the identification and extraction of salient characteristics or patterns, including edges, textures, shapes, and color attributes. Contemporary feature extractors predominantly leverage deep learning architectures, such as Convolutional Neural Networks (CNNs) and Vision Transformers (VITs). The availability of diverse feature extractors in the literature provides a wide range of feature representations. Features extracted from an image depend on the specific application, the chosen extractor, and its configuration. Therefore, integrating complementary information by combining distinct extractors offers a promising way to enhance performance. Graph Neural Networks (GNNs), particularly Graph Convolutional Networks (GCNs), have emerged as powerful and widely adopted approaches for semi-supervised image classification, as they effectively leverage both labeled and unlabeled data while exploiting the underlying graph structures that capture relationships among samples. This study proposes a novel approach for GNNs in scenarios where labeled data is scarce, by integrating diverse sets of feature and graph representations derived from various extractors in classification scenarios. Experimental investigations were conducted, encompassing combinations of distinct feature and graph extractors, as well as rank aggregation strategies. The primary contributions of this work are underscored by the experimental findings, which demonstrate that the strategic combination of feature and graph representations, coupled with the application of manifold learning for graph processing, leads to significant improvements in classification accuracy across the majority of experimental conditions. Furthermore, the utilization of rank aggregation techniques to integrate features from different extractors was shown to enhance classification accuracy.


[41] Enhancing Pathological VLMs with Cross-scale Reasoning cs.CV | cs.AIPDF

Chi Phan, Tianyi Zhang, Qiaochu Xue, Yufeng Wu, Dan Hu

TL;DR: 本文提出首个跨尺度训练与评估范式,将病理图像解读建模为多放大倍率推理任务,并构建了Scale-VQA高质量基准数据集。针对多图像视觉问答中易出现的纯文本捷径问题,设计了泄漏感知的数据构建流程,并基于强化学习训练了ScaleReasoner-R1模型。该模型在跨尺度推理基准上达到SOTA,并在单尺度基准上实现SOTA性能,表明有限的跨尺度监督能显著提升病理理解能力。

Details

Motivation: 现有病理视觉语言模型(VLM)数据集虽包含多尺度图像,但缺乏明确的跨尺度推理目标,导致模型无法捕获关键的跨尺度表征和基于证据的推理能力。

Result: ScaleReasoner-R1在作者构建的跨尺度推理基准Scale-VQA(包含4,685道多选题,基于2,537张多倍率病理图像)上达到SOTA性能,并在已建立的单尺度基准上也取得了SOTA结果。

Insight: 创新点在于首次将病理解读形式化为跨尺度推理任务,并提出了泄漏感知的数据构建流程(结合对抗性纯文本筛查与约束引导的问题设计)来缓解多图像VQA中的文本捷径问题;研究证实,即使有限的跨尺度监督也能显著提升模型对病理图像的理解。

Abstract: Pathological images are inherently multi-scale, requiring pathologists to integrate evidence from global tissue architecture at low magnification to cellular morphology at higher magnification for accurate diagnosis. While existing pathological datasets for vision-language model (VLM) include various scales, they often lack an explicit cross-scale reasoning objective. This limitation prevents VLMs from capturing essential cross-scale representations and learning evidence-based reasoning. To bridge this gap, we introduce the first cross-scale training and evaluation paradigm that formulates pathology interpretation as multi-magnification reasoning. However, creating such a task reveals a critical challenge: multi-image visual question answering (VQA) is prone to text-only shortcuts, which allow models to guess answers using magnification-dependent artifacts rather than visual evidence. To address this, we propose a leakage-aware curation pipeline that combines adversarial text-only screening with constraint-guided question design. Using this pipeline, we construct Scale-VQA, a high-quality benchmark with 4,685 multiple-choice questions grounded in 2,537 pathology images across multiple magnification levels. Finally, we present ScaleReasoner-R1, a model trained via reinforcement learning to optimize performance on the cross-scale VQA task. ScaleReasoner-R1 achieves state-of-the-art performance on our cross-scale reasoning benchmark and generalizes to SOTA performance on established single-scale benchmarks. Findings suggest that even the limited cross-scale supervision can significantly improve pathological understanding. The code and demos will be open-sourced.


[42] Attention Alignment Between Humans and Vision-Language Models cs.CVPDF

Isaac R. Christian, Udith Haputhanthrige, Hanna Hornfeld, Declan Campbell, Samuel Nastase

TL;DR: 该研究比较了六种视觉语言模型(VLM)与人类在两项视觉任务中的空间注意力对齐情况,发现解码器架构(LSTM vs. Transformer)是影响对齐度的主要因素,LSTM解码器对齐度更高但注意力图空间弥散,而Transformer解码器对齐度较低但注意力更集中且任务区分度更强。

Details

Motivation: 研究旨在探究视觉感知中自上而下(目标驱动)和自下而上(感官驱动)机制在视觉语言模型中的体现,通过比较模型与人类的注意力图,分离并检验这两种组件对注意力对齐的驱动作用。

Result: 在200张图像的两个任务(通用描述和社交字幕)上,LSTM解码器比Transformer解码器将注意力对齐度提高了40-50个百分点(达到人类噪声上限的80-87% vs. 40-59%),而编码器(CNN vs. ViT)的影响次之(5-20个百分点)。CNN-LSTM组合整体对齐度最高(85-87%)。

Insight: 解码器架构是决定VLM与人类注意力对齐度的主导因素,但存在权衡:高对齐度的LSTM解码器注意力图空间弥散且任务区分度低,而对齐度较低的Transformer解码器注意力更集中、任务特异性更强。此外,注意力对齐度与预测合成神经活动(如早期视觉皮层)的能力可能解耦。

Abstract: Visual perception depends on top-down goals and bottom-up sensory mechanisms. Vision-language models implement both, allowing us to treat each component as a separable hypothesis about what drives where we look. We compared spatial attention maps from six vision-language models against human fixation heatmaps recorded on 200 images during two tasks (general description and social captioning). The six models spanned a 2$\times$2 factorial of CNN vs.\ ViT encoders crossed with LSTM vs.\ Transformer decoders, plus Molmo 7B-D and Qwen3.5 9B. We found that both decoder and encoder architecture shaped alignment, but decoder choice dominated. LSTM vs.\ Transformer decoders increased alignment by 40–50 percentage points (80–87% vs.\ 40–59% of the human noise ceiling). In contrast, CNN vs.\ ViT encoders contributed a secondary 5–20 point advantage depending on decoder family, with CNN-LSTM the most aligned model overall (85–87%). Despite their alignment advantage, LSTM-decoder attention maps were spatially diffuse and minimally task-differentiated; ViT-Transformer, the weakest in alignment, showed the sharpest spatial concentration and strongest task differentiation. A hemispatial-neglect simulation confirmed that ablating attention impacted LSTM decoders more than Transformer decoders. In an exploratory extension using TRIBE-simulated synthetic neural responses, fixation alignment and neural relevance dissociate: CNN-Transformer attention maps better predicted synthetic brain activity despite lower fixation alignment, with attention maps best predicting early visual cortex. Together, top-down and bottom-up components trade off what they predict in behavioral and synthetic neural data.


[43] UoU: A Universal Fingerprint Foundation Model Based on Large-Scale Unsupervised Learning cs.CVPDF

Xiongjun Guan, Jianjiang Feng, Jie Zhou

TL;DR: 本文提出了UoU,一个基于大规模无监督学习的通用指纹基础模型,旨在解决传统指纹识别任务特定流程的局限性。该模型通过多层级表示架构和结合有监督冷启动、大规模弱监督精炼及无监督巩固的训练方案,学习通用的指纹特征表示。

Details

Motivation: 传统指纹识别采用任务特定的独立优化流程,限制了特征表示在不同传感器、质量和下游应用中的复用性。因此,需要构建一个通用的指纹基础模型来统一特征提取。

Result: 论文提出了UoU的技术动机、系统设计和验证协议,并公开了部分基线实现。摘要中未提及具体的定量实验结果或基准测试表现。

Insight: 创新点在于将指纹特征提取重构为领域特定的基础模型问题,利用指纹特有的对称性和中间结构(如方向流、周期性脊线模式),并采用架构无关的设计支持多任务学习和下游应用专业化。

Abstract: Fingerprint recognition is still dominated by task-specific pipelines, where enhancement, structural parsing, alignment, and matching are optimized in isolation. Although effective in narrow settings, this design limits representation reuse across sensors, qualities, and downstream applications. We therefore present UoU, short for ``a \textbf{U}niversal fingerprint foundation model based \textbf{o}n large-scale \textbf{U}nsupervised learning,’’ which reframes fingerprint feature extraction as a domain-specific foundation-model problem. UoU is organized around a multi-level representation hierarchy spanning image restoration, structural fields, semantic tokens, point-level biometric entities, and compact global descriptors. Its training recipe combines a supervised cold start on precise annotations, large-scale weakly supervised refinement, and large-scale unsupervised consolidation, with the latter two stages iterated during large-scale training so that weak supervision broadens semantic coverage while unsupervised learning stabilizes correspondences, invariances, and representation geometry. Rather than treating fingerprint imagery as generic texture, UoU exploits domain-specific symmetries and intermediate structure, including orientation flow, periodic ridge patterns, sparse biometric entities, and spatial equivariance. The framework is intentionally architecture-agnostic: while the present study includes an initial transformer-based structured-prediction instantiation, the broader design supports multi-task learning, scalable model configurations, and downstream specialization for matching, alignment, enhancement, registration, and related fingerprint applications. This paper presents the technical motivation, system design, and validation protocol of UoU, and part of the baseline implementation is publicly available at https://github.com/XiongjunGuan/UoU.


[44] CIAN: Multi-Stage Framework for Event-Enriched Image Captioning via Retrieval-Augmented Generation cs.CVPDF

Trinh Thi Thu Hien, Trung-Nghia Le

TL;DR: 本文提出了CIAN(Contextual Image-Article Narrator)框架,这是一个用于事件增强图像描述生成的多阶段系统。它通过SigLIP检索相关文章,利用LoRA微调的Qwen模型进行叙事生成,并应用基于N-Gram的优化来提升流畅性和连贯性。

Details

Motivation: 现有的大多数图像描述模型局限于像素内容,缺乏描述事件更广泛背景(如时间、地点、参与者)的能力。本文旨在通过检索增强生成技术,为图像描述注入外部叙事信息。

Result: 在OpenEvents-V1基准测试中,CIAN实现了高检索性能(mAP 0.979),并将描述质量指标CIDEr从0.030显著提升至0.094。

Insight: 创新点在于将检索增强生成(RAG)与多阶段处理(检索、总结、生成、优化)相结合,为图像描述提供丰富的上下文事件信息。其方法结合了视觉-文本检索、大语言模型微调和后处理优化,是生成上下文感知、类人描述的有效范例。

Abstract: Event-enriched image captioning describes not only visible content but also the broader context of events, including timing, location, and participants, capabilities missing in most pixel-bound models. We propose the Contextual Image-Article Narrator (CIAN), a multi-stage framework that enriches captions with external narratives. CIAN retrieves relevant articles using SigLIP, summarizes them to guide a Narrative Generation stage with a LoRA-fine-tuned Qwen model, and applies N-Gram-based Refinement for fluency and coherence. On the OpenEvents-V1 benchmark, CIAN achieves high retrieval performance (mAP 0.979) and improves caption quality, increasing CIDEr from 0.030 to 0.094. These results highlight the effectiveness of retrieval-augmented reasoning combined with linguistic refinement for generating context-aware, human-like captions.


[45] LADBench: A Benchmark for Logical Fault Detection in Images cs.CVPDF

Sahasra Kondapalli, Lara Radovanovic, Aadi Palnitkar, Mingyang Mao, Xiaomin Lin

TL;DR: 该论文提出了LAD-Bench基准测试,用于评估大型视觉语言模型在图像中检测逻辑错误的能力。该基准包含超过1000张精心策划的合成图像,涵盖住宅、城市、协作和自然四个领域,并引入了一种基于渐进提示的分层提示协议来量化模型定位和推理逻辑错误所需的辅助程度。

Details

Motivation: 现有异常检测基准主要关注视觉错误或直接提示,而忽略了开放世界部署所需的物理和社会常识推理能力。该研究旨在填补这一空白,系统评估VLM在自主逻辑推理方面的能力。

Result: 对领先的基础模型进行评估,结果显示存在显著缺陷:即使最佳模型在整体准确率上也仅达到70.11%。模型即使在更深层次获得明确提示后,仍经常无法识别异常,表明隐式逻辑错误检测问题尚未解决。

Insight: 创新点在于构建了一个专注于逻辑异常而非视觉异常的基准,并设计了分层提示协议来诊断模型推理的脆弱性。这为推进自主视觉系统的安全性、可靠性和认知对齐提供了一个严谨的评估框架。

Abstract: Large Vision Language Models (VLMs) excel at visual question answering and semantic grounding, but their capacity for autonomous logical reasoning remains underexplored. Existing anomaly benchmarks emphasize visual errors or direct prompting rather than the physical and social common sense needed for open-world deployment. To address this, we introduce LAD-bench, a benchmark of more than 1,000 curated synthetic images with logical anomalies across four domains: Residential, Urban, Collaborative, and Nature. We further propose a Tiered Prompting Protocol based on progressive disclosure, which measures how much explicit assistance a model needs to localize and reason about a logical fault. Evaluating leading foundation models reveals substantial weaknesses: even the best achieves only 70.11% overall accuracy, showing that implicit logical fault detection remains unsolved. Crucially, models often fail to identify anomalies even after receiving explicit hints in deeper tiers. By surfacing these limitations in sequential multimodal reasoning, LAD-Bench offers a rigorous framework for advancing the safety, reliability, and cognitive alignment of autonomous visual systems. Dataset and Code: https://huggingface.co/datasets/SahasraK/LADBench


[46] Spatio-Temporal Fusion Model for Standard View Classification of Echocardiographic Videos cs.CV | cs.AIPDF

Bo Gou, Jicheng Zhang, Jianlong Xiong, Tao He, Bentian Liu

TL;DR: 本文针对超声心动图标准视图分类任务,提出了一个时空融合模型(STFM),并发布了目前最大的公开超声心动图视频数据集EV9V。该模型采用双流CNN-LSTM框架,结合不确定性感知学习,以联合捕捉空间解剖结构和时间心脏动态,从而提升对帧质量变化的鲁棒性。

Details

Motivation: 解决超声心动图标准视图自动分类面临的三大挑战:公开数据集稀缺且规模有限;现代视频级架构性能未充分探索;某些视图类别空间外观高度相似,单帧特征难以区分,且帧质量不均使得鲁棒的时间信息融合复杂。

Result: 在提出的EV9V数据集上进行了广泛的实验,结果表明该方法在多种视频分类模型中取得了有竞争力的性能,验证了不确定性感知时空学习对于超声心动图视图分类的有效性。

Insight: 主要创新点包括:1. 发布大规模公开数据集EV9V;2. 提出高效的双流CNN-LSTM时空融合框架;3. 引入不确定性感知学习,在训练中优先采样代表性视频片段,在推理中采用基于证据的融合,以应对帧质量变化。从客观角度看,将不确定性建模与时空特征学习结合,是针对医学视频质量不均问题的有效技术路径。

Abstract: Automated classification of standard echocardiographic views is crucial for efficient clinical workflow but faces three main challenges. First, publicly available datasets are scarce and limited in scale and view coverage. Second, the performance of some modern video-level architectures for echocardiographic view classification remains underexplored. Third, some view categories exhibit highly similar spatial appearances, making single-frame features insufficient for discrimination, while heterogeneous frame quality complicates robust temporal information fusion. To address these challenges, we release the Echocardiographic Videos of Nine Views (EV9V) dataset, comprising 5,138 videos, 910,579 frames, and 9 standard views, which is, to the best of our knowledge, the largest publicly available echocardiography video dataset. Using EV9V, we systematically benchmark representative video classification architectures, including Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Transformers. Furthermore, we propose a Spatio-Temporal Fusion Model (STFM), an efficient dual-stream CNN-LSTM (Long Short-Term Memory) framework that jointly captures spatial anatomical structures and temporal cardiac dynamics. The proposed framework leverages uncertainty-aware learning to preferentially sample representative video segments during training and evidence-based fusion during inference, improving robustness to variations in frame quality across echocardiographic videos. Extensive experiments demonstrate that our method achieves competitive performance across diverse video classification models, validating the effectiveness of uncertainty-aware spatio-temporal learning for echocardiographic view classification. The code is available at https://github.com/bgx666/stfm.


[47] Contact-Based Fringe Projection Profilometry for High-Resolution 3-D Surface Measurement of Reflective and Transparent Objects cs.CVPDF

Ingu Yeo, Hyung-Gun Chi, Jae-Sang Hyun

TL;DR: 本文提出了一种基于数字条纹投影的接触式三维表面测量方法,旨在解决GelSight等现有视觉触觉传感器在测量绝对深度、大范围校准以及处理高反射或透明物体方面的局限性。该方法通过在涂覆硅胶的接触表面上进行三角测量重建,实现了接触区域密集的像素级表面几何和全场三维形状测量。

Details

Motivation: 现有基于光度立体视觉的GelSight传感器无法直接测量绝对深度,且在大面积校准和测量高反射或透明物体时存在精度挑战。

Result: 实验通过与GelSight Mini传感器的直接对比、球体拟合精度评估和不确定性分析,证实了所提方法显著提高了基于结构光的三维测量的精度和稳定性,能够可靠重建具有不同光学特性的物体。

Insight: 将高精度数字条纹投影技术集成到接触式传感器中,简化了大面积校准过程,并增强了对复杂表面(尤其是高反射或透明物体)的深度测量精度,这是对传统视觉触觉传感范式的创新性改进。

Abstract: This paper presents a contact-based 3-D surface measurement method based on a Digital Fringe Projection (DFP) system, belonging to the vision-based tactile sensing family pioneered by the commercially successful GelSight sensor. Such sensors have proven effective for robotic fingertip manipulation and contact sensing. However, because GelSight employs photometric stereo with RGB LEDs, it does not measure absolute depth directly but instead infers it by integrating estimated surface gradients, which can accumulate reconstruction errors; in addition, it becomes increasingly difficult to calibrate as the sensing area grows, and its depth accuracy is challenged on highly reflective or transparent objects. To overcome these drawbacks, we propose a fringe-projection-based contact measurement technique that performs triangulation-based 3-D reconstruction on a coated silicone contact surface, providing dense per-pixel surface geometry and full-field 3-D shape measurement over the contact region. By integrating high-accuracy digital fringe projection into the sensor, our approach simplifies calibration over larger areas and enhances depth precision for complex surfaces. Experimental results, including a direct comparison with a GelSight Mini sensor, a sphere-fitting accuracy evaluation, and an uncertainty analysis, confirm that the proposed method significantly improves the accuracy and stability of structured-light-based 3-D measurements, allowing reliable reconstruction of objects with diverse optical properties.


[48] WeaveLA: Event Driven Cross-Subtask Latent Memory Weaving for Repetitive Robot Manipulation cs.CV | cs.ROPDF

Shoujing Zhu, Zhenyang Liu, Fungmiu Wang, Jiafeng Wang, Bo Yue

TL;DR: 本文提出了WeaveLA方法,用于增强视觉-语言-动作(VLA)策略在重复性机器人操作任务中的表现。该方法通过在冻结的VLA骨干网络上,引入一个跨子任务的潜在记忆编织接口,在子目标完成事件时触发,将已完成片段的视觉信息压缩为潜在令牌,并直接路由到下一个子任务的动作生成路径中。

Details

Motivation: 现有VLA策略在单步操作上表现出色,但在需要跨子任务信息传递的序列任务中表现脆弱,其核心结构问题是缺乏跨子任务边界的显式信息传递通道。

Result: 在RoboMME基准测试中,使用π_{0.5}骨干网络,WeaveLA在最具挑战性的重复性子任务(SwingXtimes, N=3)上,成功率从0%提升至47.8%,而对单次执行的任务性能保持不变。

Insight: 创新点在于将子目标完成事件作为跨子任务记忆传递的自然时间单元,并设计了事件触发、基于动作侧的轻量级记忆接口,通过查询驱动的注意力池化压缩信息,实现了对基础策略短窗口接口的保留与跨子任务信息传递能力的增强。

Abstract: Vision-Language-Action (VLA) policies have achieved remarkable single-step manipulation, yet they remain brittle precisely where each stage depends on what was just completed. The core issue is structural: short-window VLAs lack an explicit channel for rouxting information across sub-task boundaries, and existing memory-augmented variants either write at every frame, retrieve from demonstration-time stages, or fire at sub-goal events without performing an explicit sub-task-to-sub-task hand-off into the action expert. We identify the sub-goal completion event as the natural temporal unit for cross-subtask memory hand-off, and present WeaveLA (Weave Latent memory for Vision-Language-Action policies), a cross-subtask memory interface that, on top of a frozen VLA backbone, compresses each completed segment into latent tokens via query-driven attention pooling and routes them directly into the action-generation path of the next sub-task. This event-triggered, action-side design preserves the base policy’s short-window interface while adding a lightweight cross-subtask channel. Through stratified evaluation on RoboMME with a $π_{0.5}$ backbone, WeaveLA’s gains land exactly where the channel is needed: on the hardest repetition slice (SwingXtimes, $N{=}3$), success rises from $0%$ to $47.8%$, while single-execution episodes remain unchanged. Per-episode paired analysis confirms the gains are confined to tasks whose causal structure requires cross-subtask information.


[49] Theoretical Grounding of Out-Of-Distribution Detection With Reinforcement Learning Optimizer cs.CV | cs.LGPDF

Salimeh Sekeh, Xin Zhang

TL;DR: 本文为动态开放世界环境中的分布外检测建立了理论框架,提出了一种基于强化学习的优化器,该优化器通过引入修正项来显式地降低随时间推移的语义OOD误报率,从而改善模型对未来领域泛化和语义OOD拒绝的能力。

Details

Motivation: 现有OOD检测方法通常只优化当前目标,未明确考虑部署后环境变化对未来OOD行为的影响,因此需要一种能持续适应演化数据分布并泛化到协变量偏移输入、同时拒绝语义偏移OOD样本的理论基础和方法。

Result: 论文开发了一种在标准梯度下降基础上使用RL引导修正项的新型增强优化器,分析表明其在未来领域泛化和语义OOD拒绝方面优于标准梯度下降,并通过理论框架比较了两种优化器下的泛化误差。

Insight: 核心创新点在于将OOD检测建模为一个动态优化问题,并首次从理论角度用强化学习优化器来引导模型更新,以最小化长期语义OOD错误;其提出的时间误差分解和理论比较框架为动态环境下的模型鲁棒性分析提供了新视角。

Abstract: Out-of-distribution (OOD) detection in dynamic open-world environments requires a model to continually adapt to evolving data distributions while generalizing to covariate-shifted inputs and rejecting semantic-shifted OOD examples. Most existing OOD detection methods optimize only the current-step objective and do not explicitly account for how post-deployment environment changes affect future OOD behavior. In this paper, we establish a theoretical grounding for dynamic OOD detection using a reinforcement learning (RL)-guided optimizer that explicitly favors updates that reduce the semantic OOD false positive rate over time. We develop a novel augmented optimizer that uses an RL-guided correction term on top of standard gradient descent (GD) and show its improvement over both future-domain generalization and semantic-OOD rejection. We analyze temporal error decomposition in terms of model-change and environment-change generalization errors and develop a new theoretical framework for comparing the generalization errors under both GD and RL-guided optimizers.


[50] GeneralVLA-2: Geometry-Aware Reconstruction and Governed Memory for Robot Planning cs.CV | cs.ROPDF

Haoyu Wang, Guoqing Ma, Zeyu Zhang, Yandong Guo, Boxin Shi

TL;DR: 本文提出了GeneralVLA-2,一个用于机器人规划的通用视觉-语言-动作系统。它通过引入GeoFuse-MV3D分支来改进基于多视角的3D物体重建,并升级了KnowledgeBank为一个受控的长期记忆系统,以提升机器人轨迹规划的可靠性和经验复用能力。

Details

Motivation: 现有通用视觉-语言-动作系统(如GeneralVLA)在将语言和RGB-D观测转换为3D末端执行器路径时存在两个瓶颈:单目SAM3D式物体重建会产生位姿和几何幻觉,而原始知识库难以控制记忆质量、冲突和几何相关性。

Result: 在GSO-30数据集上,GeoFuse-MV3D相比MV-SAM3D基线降低了CD和LPIPS指标2.20%和2.02%,提升了PSNR和SSIM指标2.36%和1.03%。在Terminal-Bench 2.0和SWE-Bench Verified上,升级后的KnowledgeBank相比ReasoningBank在成功率(SR)和解决率(resolve rate)上分别提升了4.53%和3.73%,同时降低了平均步骤数(AS)4.95%和5.65%。

Insight: 主要创新点包括:1) 一个利用几何先验引导、融合多视角信息并专注于几何融合的3D重建分支(GeoFuse-MV3D);2) 一个具备显式元数据管理和精准检索机制的受控长期记忆系统,以提升知识管理的质量和可控性。

Abstract: Generalist vision-language-action systems need object-centric 3D evidence and reusable manipulation experience to plan reliable robot trajectories. GeneralVLA provides a hierarchical interface for converting language and RGB-D observations into 3D end-effector paths, but two bottlenecks remain. First, monocular SAM3D-style object reconstruction can hallucinate pose and unseen geometry, while manipulation benefits from stable object shape when calibrated multi-view observations are available. Second, the original KnowledgeBank mainly retrieves semantically similar snippets and appends new knowledge, which makes it difficult to control memory quality, conflicts, confidence, and geometric relevance. To address the first challenge, we introduce GeoFuse-MV3D, a geometry-prior-guided MV-SAM3D reconstruction branch that verifies external geometry cues with input-view masks, applies soft visual-hull support, performs axis-wise refinement, and fuses only geometry while preserving appearance. To address the second challenge, we upgrade KnowledgeBank into a governed long-term memory system with explicit quality, confidence, lifecycle, verifier, and conflict metadata, together with precision-oriented retrieval. Finally, we evaluate the reconstruction branch on GSO-30 and the memory module on Terminal-Bench 2.0 and SWE-Bench Verified; GeoFuse-MV3D improves over the MV-SAM3D baseline by reducing CD and LPIPS by 2.20% and 2.02% while increasing PSNR and SSIM by 2.36% and 1.03%, and KnowledgeBank improves over ReasoningBank by 4.53% on Terminal-Bench SR and 3.73% on SWE-Bench resolve rate, while reducing AS by 4.95% and 5.65%, respectively. Code: https://github.com/AIGeeksGroup/GeneralVLA-2. Website: https://aigeeksgroup.github.io/GeneralVLA-2.


[51] SPHINX: First Explain, Then Explore cs.CVPDF

Nguyen Do, Tue M. Cao, Tien Van Do, András Hajdu, Tamás Bérczes

TL;DR: 本文提出了SPHINX框架,用于生成针对自动驾驶决策系统弱点的对抗性驾驶场景。其核心原则是“先解释,后探索”,即首先利用可解释人工智能方法分析驾驶策略的失败模式(如决策犹豫、多帧不一致性),然后基于这些解释性证据,使用视觉语言模型生成有针对性的对抗场景,以进行策略的再训练和改进。

Details

Motivation: 现有方法(如ChatScene和LLM-Attacker)主要依赖大语言模型和视觉语言模型的先验知识来程序化生成场景,作者认为对抗场景应基于对驾驶策略的失败诊断来生成,以专门针对策略的弱点,而非依赖先验假设。

Result: 实验表明,SPHINX能够提供对策略失败的可解释性说明,而其他对抗场景生成方法则不能。在所评估的基准测试和测试套件中,SPHINX可应用于多种最先进的自动驾驶架构,并相比现有场景生成方法带来了一致的鲁棒性提升。

Insight: 创新点在于将可解释性分析(XAI)与对抗场景生成紧密结合,形成了一个“诊断-批评-生成”的闭环框架。这改变了以往盲目探索场景空间的做法,使对抗场景的生成直接针对从策略自身决策过程中提取出的、可解释的弱点证据,从而更高效地提升策略鲁棒性。

Abstract: Generating adversarial driving scenarios is critical for evaluating and improving autonomous vehicle decision-making systems in simulation. Recent approaches, such as ChatScene and LLM-Attacker, rely primarily on the prior knowledge of Large Language Models and Vision-Language Models to generate driving scenarios procedurally. We argue that adversarial scenes should be generated based on the failure diagnosis (e.g., indecisiveness, multi-frame inconsistency) of the driving policy to specifically address the policy’s weaknesses instead of relying on prior assumptions. In this paper, we propose SPHINX, a closed-loop framework for adversarial scenario synthesis guided by a simple principle: first explain, then explore. Beyond blindly exploring the scenario space, SPHINX leverages explainable artificial intelligence methods to analyze the policy, identifying key visual concepts and their influence on policy outputs, and the uncertainty of the decisions. Given the interpretable evidence extracted from the policy’s own decision process, we use a vision language model to rationalize and criticize failure modes of the current policy. These critics are then used to generate targeted adversarial scenarios for policy retraining and improvement. We demonstrate that SPHINX can highlight an interpretable account of policy failures while other adversarial scene generation cannot. Across the evaluated benchmarks and test suites, SPHINX can be applied to diverse state-of-the-art autonomous vehicle architectures and yields consistent robustness improvements over existing scenario-generation methods.


[52] Reinforcing Dual-Path Reasoning in Spatial Vision Language Models cs.CV | cs.AIPDF

Yatai Ji, An-Chieh Cheng, Yang Fu, Yukang Chen, Han Zhang

TL;DR: 本文提出了SR-REAL,一个通过强化学习来强化空间视觉语言模型(Spatial VLM)中双路径推理的统一框架。该框架包含语言推理路径(LOR)和检测后推理路径(DTR),前者进行逐步语言演绎,后者先检测3D几何线索再进行显式几何推理。通过监督微调和强化学习两阶段训练,模型在多种空间推理任务上显著超越基线。

Details

Motivation: 现有的空间视觉语言模型在几何感知方面取得进展,但在需要多步推理的复杂空间任务上仍有挑战,且不同空间查询需要不同的策略:一些适合纯语言演绎,另一些则需要先进行3D基础再推理。

Result: 在多样化的空间基准测试中,SR-REAL显著优于空间VLM基线。一个单一模型支持两种推理路径,DTR在区域感知任务中表现出色,LOR增强了一般空间推理,且模型无需针对每个任务调整即可跨数据集和领域泛化。

Insight: 创新点在于提出了一个统一的双路径推理框架(LOR和DTR),并通过强化学习联合优化,实现了路径间的相互促进。高质量的冷启动数据和离散的基于中心的检测奖励对于稳定训练和几何对齐至关重要,展示了两种推理路径间的正向迁移能力。

Abstract: Spatial VLMs have made substantial progress in geometric perception, yet complex spatial reasoning requiring multi-step inference over depth, distance, and scene relations remains challenging. Moreover, different spatial queries call for fundamentally different strategies: some are best addressed through purely linguistic, step-by-step deduction, while others require explicit 3D grounding before quantitative inference. We present Dual-Path Spatial Reasoning via Reinforcement Learning for Spatial VLMs (SR-REAL), a unified framework that equips a spatial VLM with two complementary reasoning paths: Language-Only Reasoning (LOR), which performs step-by-step linguistic deduction, and Detect-Then-Reason (DTR), which detects 3D geometric cues (e.g., centers or bounding boxes) via region tokens before explicit geometric inference. SR-REAL begins with a cold-start supervised fine-tuning stage that constructs LOR and DTR chain-of-thought supervision and exposes a region-to-3D interface, followed by RL that optimizes the policy model with accuracy and format rewards; for DTR, a discrete center-based detection reward further refines geometric alignment. Across diverse spatial benchmarks, SR-REAL significantly outperforms spatial VLM baselines: (i) a single RL-trained model supports both reasoning paths, with DTR excelling in region-aware tasks through precise 3D localization and LOR enhancing general spatial reasoning; (ii) jointly training both paths fosters mutual reinforcement; (iii) high-quality, blended cold-start data is crucial for stable RL optimization; and (iv) the model generalizes across datasets and domains without per-task tuning, demonstrating positive transfer between LOR and DTR.


[53] OmniDrive: An LLM-Choreographed Multi-Agent World Model with Unified Latent Co-Compression for Multi-View Driving Video Generation cs.CV | cs.AIPDF

Zijie Meng, Yufei Liu, Chengqian Ma, Zhiyu Li, Jiyuan Liu

TL;DR: 本文提出了DRIVE-CHOREO,一个由大语言模型编排的多智能体世界模型,用于解决自动驾驶场景中可控多视角视频生成的挑战。该方法通过引入一个共享的符号中间语言,将语言、几何和像素在潜在令牌层面进行对齐,并采用统一潜在协同压缩机制来编码全局3D几何信息。

Details

Motivation: 现有自动驾驶生成世界模型面临两大核心矛盾:一是异构控制信号(如自由语言、高精地图、轨迹和相机位姿)存在于不兼容的表示空间中;二是后处理式的跨视角融合导致每个相机的潜在表示无法编码全局3D几何结构。

Result: 在nuScenes数据集上,DRIVE-CHOREO在多视角一致性和鸟瞰图平均精度(BEV mAP达到21.6)方面取得了新的最先进(SOTA)结果,同时具有竞争力的Fréchet Video Distance(FVD为45.7)。仅使用其合成数据训练的检测器在真实验证集上获得了+2.4 NDS的提升,验证了下游实用性。

Insight: 创新点在于提出了一个由LLM编排的多智能体架构(导演、制图师、审计员)来共同创作一个位置感知的令牌序列,作为统一的符号中间语言。同时,通过一种视图-时间排列策略,在3D VAE的卷积感受野内强制实施相机间几何约束,实现了多视角视频与令牌序列的协同压缩,从而在潜在层面统一了语言、几何和视觉信息。

Abstract: Generative world models for autonomous driving face two unresolved tensions: heterogeneous control injection, where free-form language, HD-maps, trajectories, and camera poses reside in incompatible representational spaces, and post-hoc cross-view fusion, where per-camera latents fail to encode global 3-D geometry. We trace both to a single root cause: the absence of a shared symbolic interlingua aligning language, geometry, and pixels at the latent-token level. We present DRIVE-CHOREO, an LLM-choreographed multi-agent world model that recasts controllable multi-view video generation as latent choreography. Three Qwen2.5-VL agents - a Director parsing user intent into a structured WorldScript, a Cartographer grounding it into spatially-anchored layout tokens, and an Auditor feeding cross-view critiques back as auxiliary supervision - jointly author a single position-aware token sequence. This sequence is co-compressed with the multi-view video via a view-time permutation that enforces inter-camera geometry within the convolutional receptive field of a 3-D VAE. On nuScenes, DRIVE-CHOREO sets new state-of-the-art multi-view consistency and BEV mAP (21.6) with competitive FVD (45.7); a detector trained purely on our synthetic data gains +2.4 NDS on the real validation split, validating downstream utility.


[54] RT-Counter: Real-Time Text-Guided Open-Vocabulary Object Counting cs.CVPDF

Hao-Yuan Ma, Li Zhang, Zhiwei Zhu, Jie Gao

TL;DR: 本文提出了一种名为RT-Counter的实时文本引导开放词汇目标计数框架,旨在解决现有方法在细粒度空间理解和实时推理方面的不足。该框架通过创新的视觉原型文本化模块和编织Transformer层,在保证高计数精度的同时实现了显著的计算效率提升。

Details

Motivation: 现有基于视觉语言预训练模型的文本引导开放词汇目标计数方法,在计数场景中难以兼顾细粒度的空间理解和实时推理需求,因此需要一种能平衡高精度与实时性能的新方法。

Result: 在三个公共数据集上的广泛实验表明,RT-Counter成功打破了TOOC任务中精度与速度的权衡。在FSC147数据集上取得了13.30的竞争性平均绝对误差,并以112.48 FPS的速度运行,比现有领先方法快7.4倍且参数效率高出4倍以上。

Insight: 论文的创新点在于设计了视觉原型文本化模块,将视觉特征投影到文本特征空间以融合抽象与细节信息,以及采用具有混合注意力机制的编织Transformer层来高效整合局部与全局视觉特征,从而在提升模型描述能力的同时大幅降低计算成本。

Abstract: Text-guided open-vocabulary object counting (TOOC) aims to count objects belonging to the categories specified by natural language descriptions. Although vision-language pre-trained models have been successful applied to TOOC tasks, they still struggle with fine-grained spatial understanding and real-time inference requirements in counting scenarios. To address these limitations, this paper proposes a real-time TOOC framework, called the Real-Time Counter (RT-Counter), that achieves not only good counting accuracy but also high computational efficiency. RT-Counter designs a novel Visual Prototype Textualization (VPT) module that can project learned visual features into a text feature space and then generate features containing the abstract information that is hard to capture with visual prototypes and the detailed prototype information that is difficult to describe in text, enhancing the object-level visual-language model’s counting capabilities. Additionally, RT-Counter incorporates our Weaving Transformer (Weaformer) layers, maintaining high descriptive power at a fraction of the computational cost. The Weaformer layer adopts a novel hybrid attention mechanism that can efficiently weave together local and global visual features. Extensive experiments on three public datasets show that RT-Counter successfully breaks the accuracy-speed trade-off in TOOC. While achieving a competitive MAE of 13.30 on FSC147, RT-Counter operates at 112.48 FPS, making it 7.4x faster and over 4$\times$ more parameter-efficient than the existing leading methods in TOOC. Our work aims at balancing high accuracy and real-time performance in TOOC. Code is available at: https://github.com/Jason-Mar1/RT-Counter.


[55] Universal Image Restoration via Internalized Chain-of-Thought Reasoning cs.CVPDF

Yu Guo, Zhengru Fang, Shengfeng He, Senkang Hu, Yihang Tao

TL;DR: 本文提出了CoTIR,一种通过内部化思维链推理的通用图像恢复框架。它将图像恢复视为图像编辑的子任务,利用大规模预训练编辑模型作为起点,并通过可微分的拉格朗日优化方法将结构化CoT推理编码到学习目标中,从而在单一模型中实现整体恢复。

Details

Motivation: 针对复杂混合退化下图像恢复的病态性问题,现有统一模型性能随退化复杂度增加而下降,而多轮思维链推理方法存在计算成本高和退化间交互建模弱的局限性。

Result: 在提出的包含520万个样本的CoTIR-Bench基准测试以及广泛的真实复合退化场景上的大量实验表明,CoTIR在感知质量和保真度上均优于统一模型和多轮恢复方法。

Insight: 创新点在于将图像恢复重新定义为图像编辑的子任务,并利用预训练编辑模型作为更优的优化起点;通过受拉格朗日优化启发的可微分公式,将结构化CoT推理内部化到单一模型的学习目标中,避免了串联专用恢复器,实现了更高效的联合推理。

Abstract: Image restoration seeks to recover high-quality images from degraded inputs but becomes highly ill-posed under complex, mixed degradations. While unified all-in-one models are common, their performance declines as degradation complexity increases. Recent works adopt Chain-of-Thought (CoT) reasoning for multi-round restoration using specialized modules. However, this approach faces two key limitations: (i) increased computational cost due to multi-step processing, and (ii) weak modeling of interactions between degradations during stepwise inference. We introduce CoTIR, a universal image restoration framework that internalizes CoT reasoning within a single model. Concretely, we view image restoration as a specialized subtask of image editing, which implies that a large-scale pre-trained editing model provides a more favorable optimization starting point. Building on this, we fine-tune the model for restoration and further encode structured CoT-style reasoning into the learning objective via a differentiable formulation inspired by Lagrangian optimization, enabling holistic restoration without chaining specialized restorers. To facilitate training and evaluation, we further present CoTIR-Bench, a large-scale benchmark comprising 5.2 million samples with CoT-style reasoning traces. Extensive experiments on CoTIR-Bench and broad real composite degradation scenes show that CoTIR achieves stronger perceptual quality and more competitive fidelity than both all-in-one models and multi-round restoration methods. The source code is available at https://github.com/gy65896/CoTIR.


[56] TivTok: Broadcasting Time-Invariant Tokens for Scalable Video Tokenization cs.CVPDF

Weiliang Chen, Yuanhui Huang, Xuebo Wang, Yueqi Duan

TL;DR: 本文提出了TivTok(时间不变令牌化器),一种用于可扩展视频生成的令牌化方法。它通过将视频分解为编码跨帧共享信息的时间不变(TIV)令牌和编码帧特定残差的时间可变(TV)令牌,显著减少了建模长视频所需的令牌数量。该方法引入了范围诱导分解(SIF)和不变广播(IB)机制,实现了对持久内容的高效复用和并行重建。

Details

Motivation: 现有视频令牌化器主要通过压缩视频来减少令牌数量以提升可扩展性,但它们在跨帧和块中仍会重复表示静态背景等持久内容,导致效率低下。本文旨在设计一种能够复用跨时间持久信息的令牌化器,以更高效地建模长视频。

Result: 在标准的16帧256x256基准测试上,TivTok取得了12.65的rFVD分数。对于128帧视频,与评估的基线方法相比,其压缩效率提升了2.91倍,并且仅需基于下采样令牌化器所需令牌数的1.1%。

Insight: 核心创新点在于将视频信息明确分解为时间不变和时间可变两部分,并通过SIF的注意力范围分离机制与IB的复用广播机制来实现。这为视频生成模型提供了一种高效、可扩展的表示方法,显著降低了长视频建模的计算成本。

Abstract: Video tokenization is fundamental to scalable video generation, as the number of tokens directly determines the computational cost and the length of videos that can be modeled. Existing tokenizers mainly improve scalability by compressing videos into fewer tokens, but they often continue to represent persistent content, such as static backgrounds and consistent object appearances, repeatedly across frames and chunks. In this paper, we propose \textbf{TivTok} (\textit{Time-Invariant Tokenizer}), a reuse-aware video tokenizer that makes persistent information reusable across time. TivTok represents a clip with Time-Invariant (TIV) tokens that encode information shared across frames and Time-Variant (TV) tokens that encode frame-specific residuals. To obtain this factorization, we introduce Scope-Induced Factorization (SIF), which assigns different attention scopes to the two token groups: TIV tokens attend to the full clip, whereas each TV token only accesses its corresponding frame together with the TIV tokens. In the decoder, Invariant Broadcasting (IB) reuses the same TIV tokens across frames and chunks for parallel reconstruction and long-video tokenization. Experiments show that TivTok achieves an rFVD of 12.65 on the standard $16{\times}256{\times}256$ benchmark and improves compression efficiency by 2.91$\times$ for 128-frame videos compared with the evaluated baselines, while using only 1.1% of the tokens required by downsample-based tokenizers in our evaluation.


[57] Test-Time Training for Robust Text-Guided Open-Vocabulary Object Counting cs.CVPDF

Hao-Yuan Ma, Yuda Zou, Li Zhang, Yongchao Xu

TL;DR: 本文针对文本引导开放词汇物体计数(TOOC)任务在真实世界图像退化条件下的鲁棒性问题,提出了首个评估基准Robust-TOOC,并设计了一种名为Dual-TTT的双架构测试时训练框架。该框架在测试时仅更新一个轻量级的文本引导去噪模块(TL-Denoiser),以去除退化图像特征中的噪声,而保持原始计数网络冻结,从而无需额外标注即可提升现有TOOC模型在多种退化类型(如雨、雾、黑暗、噪声)下的性能。

Details

Motivation: 现有TOOC方法主要在理想图像上开发和评估,但真实场景常受雨、雾、黑暗和传感器噪声等不利条件影响,这会严重降低视觉质量并损害视觉-语言对齐,因此需要提升TOOC在退化条件下的鲁棒性。

Result: 在提出的Robust-TOOC基准(涵盖六种代表性退化类型)上对多个近期TOOC基线进行了广泛实验,结果表明所提Dual-TTT方法能有效提升鲁棒性。

Insight: 创新点在于首次为TOOC任务构建了多退化鲁棒性基准,并提出了一种测试时训练框架,其核心是受扩散模型启发的、可插拔的文本引导轻量去噪模块,该设计实现了无需标注、不修改原模型架构的鲁棒性增强。

Abstract: Text-guided Open-vocabulary Object Counting (TOOC) enables counting arbitrary object categories specified by text prompts, offering substantially greater flexibility than conventional closed-set counting. However, existing TOOC methods are developed and evaluated primarily on ideal images, while real-world scenes often suffer from adverse conditions such as rain, fog, darkness, and sensor noise, which severely degrade visual quality and impair vision-language alignment. To bridge this gap, we introduce Robust-TOOC, the first benchmark for evaluating TOOC under diverse corruption conditions, which covers six representative degradation types: rain, fog, darkness, Gaussian noise, salt-and-pepper noise, and mixed corruption. To improve robustness while preserving the original counting architecture, we propose Dual-TTT, a dual-architecture test-time training framework for TOOC. Specifically, during test-time training, Dual-TTT updates only the Text-guided Lightweight Denoising module (TL-Denoiser), while keeping the original counting network frozen. Inspired by diffusion models, the TL-Denoiser is optimized to remove corruption-aware noise from image representations under degraded conditions. Since only the TL-Denoiser is trained at test time, Dual-TTT is annotation-free and can be seamlessly integrated into existing TOOC models without modifying their original architecture. Extensive experiments on multiple recent TOOC baselines demonstrate the effectiveness of our method.


[58] SkillMoV: Mixture-of-View Routing with Prototype-Conditioned Gating for Unified Multi-View Proficiency Estimation cs.CV | cs.AIPDF

Edoardo Bianchi, Antonio Liotta

TL;DR: SkillMoV是一个用于从同步多视角视频中统一评估人类技能水平的参数高效框架。其核心是混合视角投影器(MoVP),它通过无相机身份监督的软路由、跨视角注意力、可学习原型锚定和原型条件门控投影,自适应地聚合异构视角特征。该模型在EgoExo4D数据集的六个技能领域上进行了评估,在Exos设置中达到了50.17%的整体准确率,超越了现有最佳方法。

Details

Motivation: 现有方法通常针对特定场景或依赖简单的多视角聚合,难以适应异构的摄像机视角和不同的活动领域。本文旨在开发一个统一的、参数高效的框架,以解决多场景下从多视角视频进行技能评估的挑战。

Result: 在EgoExo4D数据集的Exos设置下,SkillMoV以单一模型在所有场景上联合训练,达到了50.17%的整体准确率,比对比方法中最强的报告结果高出3.57个百分点。在Ego+Exos设置下,其性能(47.63%)接近该设置下的最佳报告结果(48.20%)。消融实验验证了各组件(如MoV路由、跨视角注意力等)的有效性。

Insight: 创新点在于将混合专家(MoE)范式适配到视角特定的特征上,提出了无需相机身份监督的混合视角软路由机制,并结合了可学习的原型锚定来调节表征。从客观角度看,该框架通过原型条件门控和参数高效的LoRA适配,实现了对异构多视角数据的统一且灵活的处理,为多视角视频理解提供了新思路。

Abstract: Estimating human proficiency from video is a key challenge for automated skill assessment, with applications in sports coaching, music pedagogy, surgical training, and workplace learning. Existing approaches often focus on individual scenarios or rely on shared multi-view aggregation, limiting their ability to adapt to heterogeneous camera viewpoints and activity domains. We introduce SkillMoV, a unified, parameter-efficient framework for multi-scenario proficiency estimation from synchronized multi-view video. At its core, SkillMoV introduces a Mixture-of-View Projector (MoVP), which adapts the mixture-of-experts paradigm to camera-specific view features. MoVP is composed of four stages: (i) a Mixture-of-View soft router with twelve expert MLPs that learns view-dependent expert preferences without camera-identity supervision; (ii) cross-view attention to align synchronized cameras; (iii) learnable prototype anchoring to condition the representation on class-level reference vectors; and (iv) a prototype-conditioned gated projection that produces the final skill embedding. We evaluate SkillMoV on EgoExo4D across six skill domains and three separately trained view configurations: Ego, Exos, and Ego+Exos. SkillMoV reaches 50.17% overall accuracy in the Exos setting with a single model trained jointly across all scenarios, surpassing the strongest reported Exos result among the compared methods by 3.57 percentage points. In Ego+Exos, SkillMoV remains close to the best reported result in that setting (47.63% versus 48.20%). Ablations on the selected Exos configuration validate each component: MoV routing contributes +6.61 pp over attentive aggregation, cross-view attention +4.92 pp, prototype anchoring +4.07 pp, and stochastic view dropout +3.90 pp. Through LoRA adaptation, SkillMoV trains only 23.32% of its parameters and adds limited measured overhead relative to a LoRA-only baseline.


[59] Divide, Deliberate, Decide: A Multi-Agent Framework for Fine-Grained Egocentric Action Recognition cs.CV | cs.AIPDF

Alessandro Sottovia, Alessandro Torcinovich, Oswald Lanz

TL;DR: 本文提出了一种名为Divide, Deliberate, Decide的零样本多智能体框架,用于解决第一人称视角视频中的细粒度动作识别问题。该框架通过一个视觉语言模型(VLM)编排器分割视频并生成候选标签,然后由一组异构的VLM专家进行结构化审议,最后通过Borda计数聚合结果,整个过程无需微调,完全本地运行。

Details

Motivation: 解决视觉语言模型(VLMs)在第一人称视角视频细粒度动作识别中的挑战,即动作差异仅体现在细微视觉线索上,且单个模型容易对这些线索产生偏差。

Result: 实验表明,该方法在零样本动作识别性能上优于基线,提升主要源于异构审议步骤带来的去相关模型先验,而非额外计算资源。

Insight: 创新点在于引入多智能体异构审议机制,通过结构化协商(包括同行咨询轮次)来减少模型偏差,并利用Borda计数聚合排名,实现了无需微调的本地化细粒度识别性能提升。

Abstract: Fine-grained action recognition in egocentric video is challenging for Vision-Language Models (VLMs): actions often differ only in small visual cues, and a single model tends to be biased toward a subset of these cues. We propose Divide, Deliberate, Decide, a fully-local, zero-shot multi-agent framework in which (i) a VLM orchestrator chunks the video and proposes a top-k candidate label list per segment, (ii) an ensemble of heterogeneous VLM specialists, drawn from different open model families, engages in a structured deliberation that includes a peer-consultation round of questions, and (iii) agent rankings are aggregated with a Borda count and the orchestrator re-ranks its own prediction in light of the specialists’ evidence. The entire pipeline runs locally with no fine-tuning. Experiments show that our method positively improves zero-shot action recognition performance over the baseline, highlighting the influence of a heterogeneous deliberation step, showing that the gain stems from decorrelated model priors rather than from additional compute.


[60] See First, Answer Later: Visual Evidence Pre-Alignment via Sufficiency-Driven RL cs.CV | cs.AIPDF

Yilian Liu, Sicong Leng, Guoshun Nan, Junyi Zhu, Jiayu Huang

TL;DR: 该论文提出了一种名为视觉证据预对齐(VEPA)的中间训练阶段,旨在解决多模态大语言模型(MLLMs)在推理过程中视觉证据利用不足、导致回答与图像内容不一致的问题。VEPA在预训练和后训练之间引入了一个基于充分性驱动目标的强化学习阶段,使用组相对策略优化(GRPO)来优化问题条件下的视觉证据描述,从而增强模型的视觉基础能力。

Details

Motivation: 当前MLLMs的训练范式(基于大规模字幕的预训练、监督微调和强化学习)仅提供较弱的视觉基础,粗粒度的字幕使模型偏向于显著物体而忽略细粒度视觉证据,导致模型响应与底层图像不一致。

Result: 在多个基准测试上的广泛实验表明,VEPA一致地提升了模型在视觉要求高的评估任务上的性能,并且补充了标准的监督后训练。分析表明,性能提升源于增强的、可迁移的视觉基础能力,而非额外的任务特定训练。

Insight: 论文的核心创新点在于提出了一个介于预训练和后训练之间的‘视觉证据预对齐’中间阶段,并引入了一个新颖的‘充分性驱动’目标与GRPO强化学习算法来优化视觉证据描述。这为增强MLLMs的视觉基础提供了一种新的、可迁移的训练思路,而非仅仅增加任务数据。

Abstract: Multimodal large language models (MLLMs) integrate strong text reasoning with visual inputs, yet their responses can be inconsistent with the underlying images, indicating ineffective utilization of visual evidence during inference. The prevailing training paradigm relies on large-scale caption-based pretraining for general alignment, followed by supervised fine-tuning and reinforcement learning to enable instruction following and complex reasoning. However, such pretraining provides only weak visual grounding: short, coarse captions bias models toward salient objects while neglecting fine-grained visual evidence. In this paper, we introduce Visual Evidence Pre-Alignment (VEPA), an intermediate stage between pretraining and post-training that explores a novel sufficiency-driven objective with Group Relative Policy Optimization (GRPO) to optimize question-conditioned visual evidence descriptions. Extensive experiments across diverse benchmarks show that our VEPA consistently enhances performance on visually demanding evaluations and complements standard supervised post-training. Further analyses show that the income stems from strengthened, transferable visual grounding, rather than from additional task-specific training.


[61] MambaCount: Efficient Text-guided Open-vocabulary Object Counting with Spatial Sparse State Space Duality Block cs.CV | cs.CLPDF

Hao-Yuan Ma, Li Zhang, Minjie Qiang, Jie Gao

TL;DR: MambaCount是一种高效的文本引导开放词汇目标计数框架,旨在通过文本提示估计目标数量,尤其适用于密集场景和大尺度变化。该框架基于空间稀疏状态空间对偶块构建,通过重构Mamba隐藏状态衰减动态来缓解因果建模带来的依赖约束,并引入空间令牌选择子块降低空间令牌响应的高熵,同时设计多粒度原型提升跨模态对齐和可解释性。

Details

Motivation: 现有文本引导开放词汇目标计数方法主要依赖Transformer,其二次复杂度限制了在图像分辨率上的可扩展性;而基于Mamba的方法存在双向空间依赖建模受限和空间令牌响应高熵问题,导致局部细节和高频线索减弱。

Result: 在FSC-147数据集上的大量实验表明,MambaCount在不使用二次查询的方法中达到了最先进的性能,测试MAE为12.23,同时保持了线性复杂度。

Insight: 创新点包括:重构Mamba隐藏状态衰减动态以缓解因果约束;引入空间令牌选择子块降低高熵;设计多粒度原型提升跨模态对齐。这些方法为线性复杂度模型在视觉任务中的双向依赖建模和细节保留提供了新思路。

Abstract: Text-guided Open-vocabulary Object Counting (TOOC) aims to estimate the number of objects described by text prompts, which is particularly challenging in dense scenes with large scale variations. Existing TOOC approaches predominantly rely on Transformers, whose quadratic complexity with respect to image resolution limits their scalability. Mamba offers a promising alternative due to its linear complexity. However, previous Mamba-based methods have two main limitations. On the one hand, the inherent causal formulation of Mamba constrains the bidirectional spatial dependency modeling required by non-causal vision tasks. On the other hand, existing Mamba-based vision models often overlook the unconstrained high entropy in the spatial token responses, which can weaken local details and high-frequency cues. To address these limitations, we propose MambaCount, an efficient framework built on the Spatial Sparse State Space Duality (S^4D) block. Specifically, we analyze and reconstruct the decay dynamics of hidden states in Mamba to alleviate the dependency constraints introduced by causal modeling. Moreover, we introduce a Spatial Token Selection (STS) sub-block to reduce the unconstrained high entropy in spatial token responses within Mamba. In addition, we design Multi-Granularity Prototypes (MGP) to identify object-like regions at different semantic levels, improving cross-modal alignment and interpretability. Extensive experiments on FSC-147 demonstrate that MambaCount achieves state-of-the-art performance among methods without secondary querying, obtaining a test MAE of 12.23, while retaining linear complexity.


[62] Vision-language models for chest radiography do not always need the image cs.CV | cs.AI | cs.CL | cs.LGPDF

Mahshad Lotfinia, Sebastian Ziegelmayer, Lisa Adams, Daniel Truhn, Andreas Maier

TL;DR: 这篇论文指出,医学视觉语言模型在胸部X光片诊断中报告的高准确性可能并非源于对图像的真正理解,而是利用了文本先验知识。作者提出了一种因果审计方法,通过遮挡相关区域、无关区域以及替换同标签图像等干预手段,结合三种行为指标来测试模型是否真正依赖图像信息。研究发现,仅使用文本的模型在准确性上与最佳多模态模型相差无几,且某些多模态模型在统计上与纯文本基线模型无法区分。审计将模型分为三类:忽略图像的模型、不稳定的模型以及仅对部分发现选择性使用图像的模型。

Details

Motivation: 当前医学视觉语言模型在胸部X光片诊断中报告的高准确性常被解读为模型有效利用了图像信息,但这种推断存在风险,因为模型可能仅依赖文本先验知识(如发现名称)而无需真正理解图像。缺乏标准基准来区分模型是否真正使用图像,这可能导致临床部署中的安全隐患。

Result: 在九个系统的实验中,一个无法访问图像的纯文本模型与最佳多模态模型的准确性差距在5.7个百分点以内,且一个1190亿参数的多模态模型在统计上与一个70亿参数的纯文本基线模型无法区分。审计结果显示,模型被分为三类:三个模型忽略图像,一个模型不稳定,五个模型仅对部分发现选择性使用图像。与放射科医生相比,纯文本模型在准确性上统计无差异,但其图像依据为零,而使用图像的模型则具有与放射科医生相当的图像依据率。

Insight: 论文的创新点在于提出了一种因果审计框架,通过系统性的图像干预和行为指标组合,客观评估视觉语言模型是否真正依赖图像信息,揭示了模型可能仅利用文本先验而非视觉理解的潜在问题。这强调了在临床部署前进行图像依据审计而非仅依赖准确性的重要性,为模型可信度评估提供了新方法。

Abstract: Medical vision-language models report strong chest radiograph accuracy, and this is increasingly read as evidence that they use the image. That inference is unsafe: a model exploiting finding-name priors scores like one that reads the scan, and no standard benchmark separates them. We introduce a causal audit that intervenes on the image, occluding the relevant region, occluding an irrelevant one, and swapping in another patient’s same-label scan, and combines three behavioral metrics to test whether a correct answer depends on the image. Across nine systems, a text-only model with no image access reaches within 5.7 accuracy points of the best multimodal one, and a 119-billion-parameter multimodal model is statistically indistinguishable from a 7-billion text-only baseline. The audit splits the cohort into three models that ignore the image, one that is unstable, and five that use it selectively, for a subset of findings; the categories hold across a second dataset, resolution, and prompt phrasing. Against board-certified radiologists, a text-only model is statistically indistinguishable from a radiologist’s accuracy while grounding at zero, whereas the image-using models ground at radiologist-comparable rates. Reported confidence flags ungrounded answers only when a model uses the image. Grounding audits, not accuracy, should gate clinical deployment.


[63] ActWorld: From Explorable to Interactive World Model via Action-Aware Memory cs.CVPDF

Zhexiao Xiong, Yizhi Song, Hao Kang, Qing Yan, Liming Jiang

TL;DR: 本文提出了ActWorld,一种交互式世界模型,旨在通过动作感知记忆将传统以导航为中心的世界模型扩展为支持实时物体交互。该模型解决了现有交互世界模型主要局限于导航动作、缺乏物体交互能力的问题,通过构建大规模交互视频数据集和引入分层动作感知记忆机制,实现了在单一模型中同时支持灵活导航和丰富物体交互。

Details

Motivation: 现有交互式世界模型的动作词汇主要局限于导航(如行走、转身),缺乏与场景中物体的交互能力(如拾取、开门),导致生成的世界仅可视觉探索而不可真正交互。本文旨在弥合导航与交互之间的差距,构建一个真正可操作的交互世界模型。

Result: 实验表明,ActWorld在单一模型中同时支持灵活的导航和丰富的物体交互,相比仅支持导航的基线模型,其交互保真度显著提升,且未牺牲视角控制能力。

Insight: 创新点在于识别并解决了导航-交互鸿沟的两个瓶颈:数据瓶颈(缺乏带密集标注的人-物交互数据)和记忆瓶颈(现有模型基于近期偏好的历史压缩会丢弃决定后续物体状态的关键事件转换帧)。具体方案包括:通过思维链推理构建大规模交互视频数据集,以及引入分层动作感知记忆设计,根据交互重要性路由历史压缩,并辅以持久记忆库来维护长序列中的事件更新和物体身份令牌。

Abstract: Interactive world models aim to simulate environment dynamics under real-time user actions. However, their action vocabulary is largely confined to navigation: most actions correspond to motion (e.g., walk, turn, look around), while interaction with objects in the scene (e.g., pick up plates, open doors, or trigger physical responses) is either absent, restricted to game domains, or relegated to prompt-to-full-video scenarios. The resulting worlds are visually explorable but not truly actionable. In this work, we present ActWorld, an interactive world model that extends prior navigation-centric generators to support mid-rollout object interaction within a chunk-autoregressive framework. We argue that the navigation-interaction gap stems from two bottlenecks. First, a data bottleneck: the lack of human-object interaction data with accurate, dense labels. Second, a memory bottleneck: recency-biased history compression in existing world models discards the event-transition frames that causally determine subsequent object states, leading to an action-forgetting pathology. On the data side, we construct a 100K interaction video dataset, each annotated with per-chunk captions via chain-of-thought reasoning. On the model side, we introduce a hierarchical action-aware memory design that routes history compression by interaction importance, complemented by a persistent memory bank that maintains event-update and object-identity tokens across long rollouts. Experiments show that ActWorld supports both flexible navigation and rich object interaction within a single model, substantially improving interaction fidelity over navigation-only baselines without sacrificing viewpoint control. Project page is available at https://interactwm.github.io/ActWorld.


[64] BrainWorld: A Structural-Prior-Conditioned Generative Model for Whole-Brain 4D fMRI Dynamics cs.CV | q-bio.NCPDF

Junfeng Xia, Wenhao Ye, Junxiang Zhang, Xuanye Pan, Mo Wang

TL;DR: 本文提出了BrainWorld,一种基于结构先验条件的生成模型,用于生成全脑4D fMRI动态数据。该模型利用结构磁共振成像(sMRI)作为受试者层面的解剖学上下文来指导未来fMRI的生成,并将结构信息整合到去噪过程中,而不是将其视为并行模态。在22个涵盖不同队列和大脑状态的数据集上评估,BrainWorld能够生成长达400帧的稳定4D fMRI轨迹,通过生成示例增强提升下游任务性能,并学习到优于基线的可迁移多模态表示。

Details

Motivation: 现有fMRI基础模型主要关注表示学习和下游预测,而非条件预测生成,因此需要开发一个能够进行条件预测生成的模型来建模功能性大脑动态。

Result: 在22个数据集上的评估显示,BrainWorld能生成长达400帧的稳定4D fMRI轨迹,通过生成示例增强提升了下游任务性能,其学习的多模态表示在基准测试中优于现有方法。

Insight: 创新点在于将sMRI作为解剖学先验条件整合到去噪过程中,而非并行处理,这为长时程大脑动态建模和多模态表示学习提供了一个条件感知的生成框架。

Abstract: Whole-brain 4D fMRI generation is valuable for modeling functional brain dynamics, yet existing fMRI foundation models mainly target representation learning and downstream prediction rather than conditional predictive generation. We introduce BrainWorld, a structural-prior-conditioned generative model for whole-brain 4D fMRI dynamics. BrainWorld uses sMRI as subject-level anatomical context to guide future fMRI generation, integrating structural information into the denoising process rather than treating it as a parallel modality. Evaluated on 22 datasets spanning diverse cohorts and brain states, BrainWorld generates stable 4D fMRI trajectories up to 400 frames, improves downstream performance through generated-example augmentation, and learns transferable multimodal representations that outperform baselines. Together, these results establish BrainWorld as a condition-aware generative framework for long-horizon brain dynamics modeling and multimodal representation learning.


[65] MaineCoon: Pursuing A Real-Time Audio-Visual Social World Model cs.CVPDF

Lichen Bai, Tianhao Zhang, Shitong Shao, Dingwei Tan, Qiyu Zhong

TL;DR: 本文提出了MaineCoon,一个面向社交世界的实时音视频自回归模型。它拥有220亿参数,能在单GPU上实现高达47.5 FPS的流式生成和亚秒级交互,旨在模拟以人为中心的社交动态,并为下一代AI原生社交平台指明方向。

Details

Motivation: 现有世界模型主要模拟物理环境或游戏世界,但缺乏对人类中心社交动态的建模。随着社交平台视频内容的激增,构建面向社交世界的视频生成模型至关重要,而此前研究对此关注不足。

Result: MaineCoon在高质量、低延迟、长序列音视频自回归模型上实现了新的SOTA性能基准,其流式生成帧率高达47.5 FPS,并支持千秒级甚至更长的生成。

Insight: 创新点包括:首次定义了社交世界模型的位置并构建了原型;引入了自重采样、跨模态表示对齐、领域感知偏好优化和强化在线策略蒸馏(ROPD)等训练技术;设计了首个支持长序列生成并减轻漂移的智能流式推理框架,具有智能缓存管理和提示规划功能。

Abstract: As an increasing majority of global video content is consumed on social platforms for interactive social purposes, video generation models built for social worlds are important but largely overlooked by previous studies. In this work, we define the position of social world models and build a prototype model as the first step towards this goal. While previous world models successfully simulate physical environments or gaming world exploration, they remain fundamentally detached from human-centric social dynamics. To bridge this gap as the first step to social world models, we present MaineCoon, the first real-time audio-visual autoregressive model that has 22B parameters and is capable of real-time streaming generation and sub-second interaction, with a record-breaking frame rate of up to 47.5 FPS, on a single GPU. To the best of our knowledge, MaineCoon is also the first real-time audio-visual generation model specifically optimized for social-interactive applications. To enable efficient and stable training, we introduce several novel techniques into MaineCoon, including self-resampling, cross-modal representation alignment, domain-aware preference optimization, and reinforced online-policy distillation (ROPD). We also design the first agentic streaming inference framework that supports thousand-second-scale or even longer generation while mitigating drift with agentic cache management and prompt planing. These innovations significantly accelerate training while optimizing real-time inference performance. We believe this work not only sets a new state-of-the-art (SOTA) performance benchmark for high-quality, low-latency, and long-horizon audio-visual autoregressive models, but also points out the paradigm shift desired for next-generation AI-native social platforms.


[66] LiveStarPro: Proactive Streaming Video Understanding with Hierarchical Memory for Long-Horizon Streams cs.CV | cs.AIPDF

Zhenyu Yang, Kairui Zhang, Bing Wang, Shengsheng Qian, Changsheng Xu

TL;DR: LiveStarPro是一个面向长时视频流的主动式流媒体视频理解助手,通过Streaming Verification Decoding(SVeD)实现单次推理确定响应时机,Streaming Causal Attention Masks(SCAM)训练策略增强视频-语言对齐,以及Tree-Structured Hierarchical Memory(TSHM)分层记忆架构组织历史信息,有效解决了现有视频大语言模型在连续流处理、自主响应决策和长时上下文记忆方面的挑战。

Details

Motivation: 当前在线架构的视频大语言模型难以同时处理连续视频流、自主决定何时响应以及保持长时上下文记忆,这损害了实时响应性并在长时间交互中导致严重遗忘。

Result: 在涵盖15个多样化真实场景、扩展至小时级流的大规模基准OmniStarPro上,LiveStarPro在语义正确性上提升了28.9%,时序错误减少了18.2%,其流式键值缓存进一步带来了1.58倍的推理加速。

Insight: 创新点包括:SVeD通过单次困惑度验证消除对显式静默令牌的依赖;SCAM训练策略实现变长流上的增量视频-语言对齐;TSHM递归记忆架构将驱逐的历史信息组织成事件链,支持从有效无界视频流中高效检索。

Abstract: Despite the remarkable progress of Video Large Language Models (Video-LLMs), current online architectures still struggle to simultaneously process continuous video streams, decide autonomously when to respond, and preserve long-horizon contextual memory. These obstacles undermine real-time responsiveness and cause severe forgetting throughout prolonged interactions. In this work, we introduce LiveStarPro, a live streaming assistant that is designed for proactive video understanding over long-horizon streams. The design of LiveStarPro rests on three complementary components. The first component is Streaming Verification Decoding (SVeD), an inference framework that identifies the appropriate response timing through single-pass perplexity verification, thereby eliminating the dependency on explicit silence tokens. The second component is Streaming Causal Attention Masks (SCAM), a training strategy that enforces incremental video-language alignment over variable-length streams. The third component is Tree-Structured Hierarchical Memory (TSHM), a recursive memory architecture that organizes evicted historical information into event chains and consequently enables efficient retrieval from effectively unbounded video streams. To facilitate a comprehensive evaluation under realistic online conditions, we further present OmniStarPro, a large-scale benchmark that spans 15 diverse real-world scenarios and that extends to hour-scale streams for the assessment of long-term recall. Extensive experiments demonstrate that LiveStarPro consistently surpasses existing methods, attaining a 28.9% improvement in semantic correctness and an 18.2% reduction in timing error, while its streaming key-value cache further yields a 1.58x inference speedup over the same model without caching. The model and the code are publicly available at https://github.com/sotayang/LiveStarPro.


[67] Million-scale multimodal pollen microscopy with expert-guided foundation models cs.CVPDF

András Biricz, Björn Gedda, Donát Magyar, Antonio Spanu, János Fillinger

TL;DR: 本文提出了一个百万规模的多模态花粉显微图像数据集Pollen AI Atlas,旨在解决自动化花粉识别在跨样本制备、扫描仪设置和地理来源的泛化性与古植物学可解释性之间的瓶颈。该数据集包含151万多个花粉粒检测框,并配以专家引导的视觉语言模型生成的形态描述,为花粉识别、跨区域域适应和领域特异性多模态显微学习提供了基准。

Details

Motivation: 自动化花粉显微识别在空气生物学、古生态学和生物多样性监测中存在瓶颈,现有系统难以在保持古植物学可解释性的同时,跨不同样本制备、扫描仪设置和地理来源进行泛化。

Result: 在专家标注的测试区域中,花粉粒检测的提议精度达到99.6%;使用冻结视觉特征的基线模型实现了88.16%的top-1准确率;跨区域检索显示,当图像相似性下降时,基于描述的文本嵌入仍保持稳健(mAP@20为0.811,而图像相似性仅为0.262)。

Insight: 创新点在于构建了大规模、多模态、专家引导的花粉显微资源,将视觉检测与由专家验证的形态学锚点指导的视觉语言模型生成的描述相结合,实现了对花粉孔径系统、壁饰、形状和尺寸的结构化描述,为领域特异性多模态学习提供了新范式。

Abstract: Automated pollen identification from microscopy remains a bottleneck in aerobiology, palaeoecology and biodiversity monitoring, because scalable systems must generalise across specimen preparation, scanner settings and geographic origins while retaining palynological interpretability. To address this gap, we present a million-scale multimodal pollen microscopy resource, Pollen AI Atlas, assembled from pure-species whole-slide bright-field images spanning four geographic origins, four scanner settings and 46 taxon labels across 31 botanical families. Seeded by one manually selected exemplar per source slide, token-level mining and filtering produced 1,511,390 released grain detections with 99.6% proposal precision in expert-curated test regions. Each detection was paired with machine-generated grain-level morphological captions from five open-weight vision-language models, guided by expert-verified palynological anchors, yielding structured descriptions of aperture systems, wall ornamentation, shape and size. Among the evaluated models, Gemma4 provided the most controlled primary caption set, combining tight length control, no leakage and the strongest text-retrieval performance. Baseline benchmarks with frozen visual features reached 88.16% top-1 accuracy, while cross-regional retrieval showed that caption-derived text embeddings remained robust when image similarity degraded (mAP@20 0.811 versus 0.262). Released data, annotations, captions, splits, code, and weights provide a benchmark for pollen recognition, cross-regional domain adaptation and domain-specific multimodal microscopy learning.


[68] High-Fidelity 3D Geometric Reconstruction of Pelvic Organs from MRI: A Hybrid Deep Learning and Iterative Optimization Approach cs.CV | cs.AI | cs.CG | cs.GRPDF

Hui Wang, Xiaowei Li, Chenxin Zhang, Yifan Feng, Jianwei Zuo

TL;DR: 本研究提出了一种混合可变形形状建模框架,用于从MRI图像中重建盆腔器官(膀胱、子宫、直肠)的高保真3D几何模型。该框架将深度学习预测与迭代优化相结合,通过几何感知的多级深度学习架构、两阶段摊销优化训练策略以及训练与推理阶段的协同机制,实现了全局形状捕获与局部表面细化的平衡。

Details

Motivation: 从MRI进行患者特异性盆腔器官3D重建对于盆底建模和下游分析至关重要,但现有方法在重建高保真、高质量几何模型方面仍存在劳动密集、标准化不足的问题,本研究旨在解决这一挑战。

Result: 该框架在几何保真度上显著优于当前主流的基于深度学习的器官重建模型,对膀胱、直肠和子宫的重建在Chamfer Distance和Dice Similarity Coefficient指标上表现更优,同时保持了高计算效率并产生了更优的整体体积网格质量,在患者层面,其minSICN和minSIGE的10个最差元素平均值也高于传统几何后处理算法。

Insight: 创新点在于将深度学习与迭代优化深度融合的混合框架设计,特别是训练阶段迭代优化为深度学习提供监督、推理阶段深度学习快速预测全局形态后由迭代优化细化局部表面的协同机制,以及旨在平衡全局与局部细节的两阶段摊销优化训练策略。

Abstract: Patient-specific 3D reconstruction of pelvic organ geometry from MRI is important for pelvic floor modeling and downstream patient-specific analysis. However, while previous studies have focused primarily on either image segmentation or downstream use of 3D models, the reconstruction of high-fidelity, high-quality geometries remains labor-intensive and poorly standardized. The study introduced a hybrid deformable shape modeling framework that integrates deep learning prediction with iterative optimization for the reconstruction of the bladder, uterus, and rectum. The framework consists of three core components: a geometry-aware multi-level deep learning architecture that preserves topological consistency of pelvic organs; a two-stage amortized optimization training strategy that balances global shape capture and local surface refinement; and a holistic synergy mechanism–where iterative optimization provides supervision for deep learning during the training phase, and during inference, deep learning rapidly predicts the global organ morphology, followed by iterative optimization to refine local surfaces and mesh quality. This framework demonstrated marked superiority in geometric fidelity than current mainstream deep learning-based organ reconstruction models. For individual anatomical structures, the reconstructed 3D geometries for the bladder, rectum, and uterus achieved significantly lower Chamfer Distance values and higher Dice Similarity Coefficient scores. In addition, while maintaining high computational efficiency, the proposed architecture yielded superior overall volumetric mesh quality. At the patient level, the framework achieved higher mean values for the 10 worst elements for both minSICN and minSIGE compared to traditional geometric post-processing algorithms.


[69] A Quantitative Analysis of Multimodal Biomarkers in Alzheimer’s Disease cs.CV | cs.AIPDF

Antonio Scardace, Daniele Ravì

TL;DR: 本研究对阿尔茨海默病(AD)的多模态生物标志物进行了定量分析,整合了来自ADNI数据集的tau-PET、结构MRI、认知评分和APOE4数据。通过量化跨模态互信息和解释方差、分析tau沉积与结构萎缩的关联、分解tau与认知关联的组成部分,以及识别与认知衰退一致的神经退行性轨迹,系统地表征了各模态间的关系。

Details

Motivation: 尽管多模态方法在AD研究中日益普及,旨在整合分子、结构、临床和遗传生物标志物以增强疾病表征,但这些模态之间的关系仍不清楚。系统分析它们的动态交互对于改进疾病建模、识别冗余评估以及减轻患者负担和降低采集成本至关重要。

Result: 研究基于ADNI数据集的789名受试者数据,通过量化分析揭示了跨模态的冗余性和预测依赖性,识别了信息丰富的脑区ROI,并分解了tau与认知关联的组成部分。该分析提供了对生物标志物关系的系统表征,有助于提高AD生物标志物的可解释性和选择性。

Insight: 创新点在于对AD多模态生物标志物进行了系统的定量交互分析,特别是通过互信息和方差分解量化冗余性,以及将tau-认知关联分解为萎缩相关和萎缩无关成分,这为理解疾病机制和优化临床评估提供了新的见解和方法。

Abstract: Despite increasing adoption of multimodal approaches in Alzheimer’s Disease (AD) research – aimed at integrating molecular, structural, clinical, and genetic biomarkers to enhance disease characterization – the relationships among these modalities remain poorly understood. A systematic analysis of their dynamic interaction is essential for improving disease modeling, identifying redundant assessments, and reducing patient burden and acquisition costs. In this paper, we present a quantitative analysis of multimodal AD biomarkers by integrating tau-PET, structural MRI, cognitive scores (MMSE and CDR), and APOE4 data from 789 subjects drawn from the ADNI dataset. In our analyses, we (A) quantify cross-modal mutual information and explained variance to assess redundancy and predictive dependencies; (B) examine associations between tau topologies and structural atrophy across brain regions to select informative ROIs; (C) perform a statistical decomposition of the tau-cognition association into atrophy-related and atrophy-independent components; (D) and identify a dominant neurodegenerative trajectory that aligns with cognitive decline. This study provides a systematic characterization of cross-modal relationships, improving the interpretability and selection of biomarkers in AD. Code is publicly available at: https://github.com/antonioscardace/Multimodal-AD.


[70] Plug-and-Adapt: Multimodal Coreference Resolution at First Sight with a Pretrained Alignment Model cs.CV | cs.AIPDF

Jinghan Wu, Jing Li, Ivor W. Tsang, Xuetao Zhang

TL;DR: 本文提出了一种即插即用的多模态共指消解方法,通过预训练的对齐模型直接适配到目标任务,无需在目标数据集上进行训练。该方法利用视觉语言对齐数据集预训练细粒度对齐模型,并通过证据理论融合视觉和类别线索进行相似性聚合。在CIN基准测试中,其CoNLL F1分数分别比现有SOTA专用方法和流行VLLMs提升了5.31%和2.12%。

Details

Motivation: 现有多模态共指消解方法需要依赖目标数据集的标注数据进行训练,限制了直接可用性和泛化能力;而大型视觉语言模型虽然具备零样本能力,但参数量巨大、部署困难且常需付费API访问。

Result: 在Coreference Image Narratives基准数据集上,CoNLL F1分数超越SOTA专用方法5.31%,超越流行VLLMs 2.12%;在掩码CIN数据集和专门构建的VCR-MCR数据集上的测试也验证了其鲁棒性和泛化能力。

Insight: 创新点在于将预训练的细粒度对齐模型通过相似性聚合和证据理论融合重新用于多模态共指消解,实现了即插即用的零样本适配,避免了数据标注依赖和大型模型的计算负担。

Abstract: Visual information helps resolve ambiguity in coreference resolution, leading to notable performance gains. However, existing Multi-modal Coreference Resolution (MCR) methods require training with (partially) annotated data from the target dataset before they can be applied, preventing their direct usability and raising concerns about generalization. While Vision-Language Large Models (VLLMs) with billions of parameters offer promising zero-shot capabilities, they remain largely inaccessible. Their massive size limits deployability, and many are only accessible through paid APIs. In this paper, we propose a plug-and-adapt method that strategically adapts a carefully pre-trained \emph{alignment model} for immediate use in MCR tasks, designed to eliminate the need for training on scarce benchmark datasets or relying on resource-intensive VLLMs. Specifically, we first pre-train a fine-grained alignment model between textual and visual contextual information using vision-language alignment datasets. We then repurpose the alignment model to MCR through similarity aggregation by fusing visual and categorical cues with evidence theory, thereby enhancing effectiveness. Experiments on the Coreference Image Narratives (CIN) benchmark dataset demonstrate the effectiveness of our method, achieving a 5.31% and 2.12% improvement in CoNLL F1 over SOTA dedicated methods and popular VLLMs, respectively. We further evaluate our method on a masked CIN dataset for robustness testing and on a specially constructed VCR-MCR dataset for generalization assessment, with results confirming both capabilities.


[71] MLLMs Get It Right, Then Get It Wrong: Tracing and Correcting Late-Layer Textual Bias cs.CVPDF

Xingming Li, Ao Cheng, Qiyao Sun, Xixiang He, Xuanyu Ji

TL;DR: 本文发现多模态大语言模型(MLLMs)在视觉与文本冲突时存在‘后期文本覆盖’偏差:模型在中间层能形成正确的视觉预测,却在最终输出时偏向文本。作者提出了一种无需训练的推理时干预方法CALRD,通过检测并恢复被抑制的视觉预测来纠正此偏差。

Details

Motivation: 解决MLLMs在视觉与文本信息冲突时过度依赖文本、忽略视觉证据的偏差问题,这种偏差对需要视觉基础的应用构成风险,但其成因尚不明确。

Result: 在五个不同架构的MLLMs上的实验表明,所提方法CALRD在冲突基准测试上实现了最高9.4%的绝对性能提升,同时基本保持了标准性能,且无需训练或外部知识。

Insight: 核心创新在于揭示了‘后期文本覆盖’现象及其方向性特征(失败预测偏向文本,成功预测偏向视觉),并据此设计了一种简单有效的训练无关解码干预策略,恢复了模型本已编码但未保留的视觉信息。

Abstract: When vision contradicts text, multimodal large language models (MLLMs) consistently favor text, even when images provide clear evidence otherwise. This bias poses risks for applications requiring visual grounding, yet its cause remains unclear. In this paper, we uncover a surprising finding: models often get it right initially, forming correct vision-based predictions in their intermediate layers, before changing their minds and favoring text in the final output. We call this “late-layer textual override”. The visual information is encoded, it simply does not survive to the output. More intriguingly, we find that how predictions change reveals whether they’re correct: 85% of failures shift toward text, while 89% of successes shift toward vision. This directional signature enables a simple but powerful intervention: when we detect a confident visual prediction being suppressed, we restore it. We propose CALRD (Conflict-Aware Layer Reference Decoding), a training-free method that recovers overridden predictions at inference time. Experiments across five MLLMs of varying architectures demonstrate up to 9.4% absolute improvements on conflict benchmarks while largely preserving standard performance, without training or external knowledge. It recovers what the model already knew but failed to preserve.


[72] Beyond Visual Cues: CoT-Enhanced Reasoning for Semi-supervised Medical Image Segmentation cs.CV | cs.LGPDF

Yuming Chen, Yuxin Xie, Tao Zhou, Yi Zhou

TL;DR: 本文提出了一种名为CERS(CoT-Enhanced Reasoning Segmentation)的半监督医学图像分割框架,旨在解决现有方法过度依赖视觉模式匹配、难以处理视觉-语义不匹配临床场景的问题。该框架通过整合思维链(CoT)推理,利用大语言模型生成的语言推理描述构建知识池,并引入语义感知的参考选择策略和多尺度坐标注意力模块,以融合推理上下文,从而更好地捕捉专家诊断逻辑。

Details

Motivation: 现有半监督医学图像分割方法主要基于视觉模式匹配,严重依赖像素级相似性,在视觉相似但病理诊断结论不同的临床场景(即视觉-语义不匹配)中容易失效,无法捕捉专家使用的底层诊断逻辑。

Result: 大量实验表明,CERS在多个基准数据集上优于最先进的(SOTA)方法,特别是在解决边界模糊和语义不一致问题上表现出优越性。

Insight: 核心创新点在于将思维链(CoT)推理引入半监督医学图像分割,超越纯视觉线索,通过语言推理描述和语义感知的参考选择来建模诊断逻辑;同时,设计的多尺度坐标注意力模块(MCAM)能有效融合推理上下文到解码过程,为解决视觉-语义不匹配问题提供了新思路。

Abstract: Semi-supervised medical image segmentation has emerged as a dominant research problem in medical image analysis, mitigating annotation scarcity by leveraging consistency regularization on unlabeled data. However, existing approaches operate predominantly via visual pattern matching, relying heavily on pixel-level similarities. This visual-centric dependency often falters in clinical scenarios characterized by the visual-semantic mismatch, where visually similar lesions warrant distinct diagnostic conclusions, thus failing to capture the underlying diagnostic logic used by experts. To address this, we move beyond visual cues and propose CERS (CoT-Enhanced Reasoning Segmentation), a framework that integrates Chain-of-Thought (CoT) reasoning to distinguish pathologically distinct cases. Specifically, we construct a knowledge pool enriched with linguistic reasoning descriptions generated by large language models (LLMs). A semantic-aware reference selection strategy is introduced to identify historical evidence, filtering candidates first by morphology, and then refining them via CoT consistency to eliminate hard negatives. Furthermore, a multi-scale coordinate attention module (MCAM) is designed to effectively fuse this reasoning-derived context into the decoding process. Extensive experiments demonstrate the superiority of CERS against state-of-the-art approaches, particularly in resolving boundary ambiguities and semantic inconsistencies. The code is available at https://github.com/cymasuna/CERS.


[73] Gaussian Light Field Splatting: A Physical Prior-Driven Vision Transformer for Unsupervised Low-Light Image Enhancement cs.CVPDF

Yuhan Chen, Wenxuan Yu, Guofa Li, Fuchen Li, Kunyang Huang

TL;DR: 本文提出了一种名为GLFS的无监督低光照图像增强方法,该方法将高斯光场溅射的连续物理光照建模集成到Vision Transformer架构中,以解决现有方法在复杂非均匀光照下局部曝光不平衡和颜色失真的问题。

Details

Motivation: 现有无监督低光照增强方法在复杂非均匀光照下常出现局部曝光不平衡和颜色失真,且大多数Vision Transformer缺乏对光照退化物理先验的显式建模机制。

Result: 广泛的消融研究和定量评估表明,GLFS在光照校正和细节保留方面具有明显优势,在低光照图像增强任务上达到了最先进的性能水平。

Insight: 创新点在于将高斯溅射的连续物理光照表示(各向异性高斯基函数的叠加)引入Transformer,通过物理引导的注意力偏置自适应推断空间增益场,并设计了颜色向量角损失和亮度边缘损失以减少颜色偏差和结构退化。

Abstract: Existing unsupervised low-light image enhancement methods often encounter local exposure imbalance and color distortion under complex non-uniform illumination. In addition, most Vision Transformers lack an explicit mechanism for modeling the physical priors of illumination degradation. To address these limitations, we propose GLFS, a Gaussian light field splatting-based Vision Transformer that integrates continuous physical illumination modeling from Gaussian splatting into the Transformer architecture. In GLFS, scene illumination is represented by a superposition of anisotropic Gaussian basis functions. Physics-guided biases are introduced into self-attention to adaptively infer a spatial gain field, enabling accurate and uniform restoration under complex illumination. To reduce color bias and structural degradation during enhancement, a color-vector angular loss and a luminance-edge loss are further developed. These losses enforce hue consistency and improve the structural fidelity of local details. Extensive ablation studies and quantitative evaluations show that GLFS provides clear advantages in illumination correction and detail preservation. It achieves state-of-the-art performance and offers a new representation paradigm for low-light image enhancement.


[74] PhaseWin: An Efficient Search Algorithm for Faithful Visual Attribution cs.CVPDF

Zihan Gu, Ruoyu Chen, Junchi Zhang, Li Liu, Xiaochun Cao

TL;DR: 本文提出了一种名为PhaseWin的高效视觉归因搜索算法,用于解释视觉和视觉语言模型的决策。该算法将贪婪区域选择重组为分阶段窗口搜索过程,通过全局候选筛选、自适应剪枝和局部窗口细化,在保持贪婪搜索区域排序行为的同时,显著降低了计算复杂度。

Details

Motivation: 视觉归因是解释现代视觉和视觉语言模型决策的重要工具,但现有方法如穷举搜索计算成本指数级,而广泛使用的贪婪搜索也需要二次方次数的模型评估,因为每一步选择都需要重新评估所有剩余候选区域,效率低下。

Result: 在图像分类、目标检测、视觉定位和图像描述等任务上的大量实验表明,在所有比较的归因方法中,PhaseWin以最少的模型前向传递次数实现了高忠实度,经验上实现了从O(n^2)到O(n)的复杂度降低。

Insight: PhaseWin的创新点在于将贪婪搜索重组为分阶段窗口搜索,交替进行全局筛选、自适应剪枝和局部细化,在单调证据积累条件下,结合特征级结构假设,实现了可控的线性评估复杂度和接近贪婪搜索的忠实度保证,为高效模型解释提供了新思路。

Abstract: Visual attribution is a fundamental tool for interpreting modern vision and vision-language models, particularly when their decisions must be inspected, diagnosed, or audited. Its goal is to explain how a model’s decision depends on local regions of the visual input, typically by assigning an importance ordering over candidate image regions. Given an image partitioned into $n$ regions, faithful attribution can be cast as an ordered subset-search problem, in which progressively inserting the selected regions should recover the target model response as early as possible. Exhaustive search over region subsets incurs exponential cost, while the widely used greedy search still requires a quadratic number of model evaluations, because every selection step rescores all remaining candidates. We propose PhaseWin, an efficient subset-search algorithm for faithful visual attribution. PhaseWin reorganizes greedy region selection into a phased window-search procedure: rather than re-evaluating the full candidate set at every step, it alternates between global candidate screening, adaptive pruning, and localized window refinement, while preserving the essential region-ranking behavior of greedy search. We analyze PhaseWin under monotone evidence-accumulation conditions and show that, under feature-level structural assumptions, it attains controllable linear evaluation complexity together with near-greedy faithfulness guarantees. Extensive experiments on image classification, object detection, visual grounding, and image captioning show that, among all compared attribution methods, PhaseWin reaches high faithfulness with the fewest forward passes, empirically realizing the predicted reduction from $O(n^2)$ to $O(n)$. The code is available at https://github.com/Qihuai27/phasewin-va.


[75] When LLMs Analyze Scars: From Images to Clinically-Meaningful Features cs.CV | cs.AI | cs.LGPDF

Ruman Wang, Hangting Ye

TL;DR: 本文提出了一种名为ScaFE(疤痕特征工程)的新范式,将大型语言模型(LLMs)重新定位为知识驱动的特征工程师,而非端到端的分类器。该方法利用LLMs编码的丰富医学知识,通过提示生成可执行的Python代码,从疤痕图像中提取与临床评分系统(如温哥华疤痕量表)对齐的低维、可解释特征,以解决病理疤痕分类中数据稀缺的难题。

Details

Motivation: 医学图像分类面临数据稀缺的根本困境,尤其是在病理疤痕(如瘢痕疙瘩与增生性疤痕)分类中,标注成本高、隐私限制和疾病罕见性导致标注图像极其有限,而区分这些疤痕需要细微的专家知识。

Result: 在疤痕分类的广泛实验中,该方法在有限数据条件下,始终优于端到端的深度学习基线或将LLMs用作黑盒分类器的方法,展现了在数据高效和临床透明的医疗AI系统中整合LLMs的潜力。

Insight: 创新点在于将LLMs用作知识驱动的特征工程工具,通过生成确定性代码提取临床可解释特征,实现了数据效率、隐私保护(本地处理原始图像)和可解释性的优势,为医疗AI提供了一种结合先验知识与统计学习的新方向。

Abstract: Medical image classification faces a fundamental dilemma: while deep learning models achieve remarkable performance at scale, real-world clinical scenarios often suffer from severe data scarcity due to annotation costs, privacy constraints, and disease rarity. This challenge is particularly pronounced in pathological scar classification, where differentiating keloids from hypertrophic scars requires subtle expert knowledge and labeled images are extremely limited. We propose a novel paradigm that repositions large language models (LLMs) as knowledge-driven feature engineers rather than end-to-end classifiers. We call this framework ScaFE (Scar Feature Engineering). Our key insight is that LLMs encode rich medical knowledge that can be externalized as executable feature extraction code, enabling the transformation of high-dimensional images into low-dimensional, clinically interpretable representations. Specifically, we prompt an LLM with established scar assessment criteria to generate deterministic Python code that extracts features aligned with clinical scoring systems such as the Vancouver Scar Scale. Our approach offers three key advantages: (1) data efficiency, achieving robust performance with limited training samples by decoupling knowledge acquisition from statistical learning; (2) privacy preservation, as raw images are processed locally without exposure to external LLMs; and (3) interpretability, through explicit features grounded in clinical reasoning. Extensive experiments on scar classification demonstrate that our method consistently outperforms end-to-end deep learning baselines or using LLMs as black-box classifiers under limited data conditions, establishing a promising direction for integrating LLMs into data-efficient and clinically transparent medical AI systems.


[76] Predicting Immune Biomarkers with MultiModal Mixture-of-Expert Pathology Foundation Models Empowers Precision Oncology cs.CVPDF

Tianyu Liu, Ziqing Wang, Zhaokang Liang, Tong Ding, Peter Humphrey

TL;DR: 本文提出了MixTIME,一种基于混合专家(MoE)架构的多模态病理学基础模型,用于从H&E全切片图像中预测与肿瘤免疫微环境相关的多重免疫荧光蛋白表达。该模型整合了图像、图像-文本和图像-转录组学三种模态的预训练模型,通过可学习的路由器动态加权专家贡献,并在两个不同规模的数据集上实现了17种蛋白标志物预测的SOTA性能。

Details

Motivation: 现有预测肿瘤免疫微环境相关免疫生物标志物的方法大多局限于单一图像模态,存在分辨率不足且未能充分利用互补的临床和生物学信息的问题,这限制了精准肿瘤学的发展。

Result: 在两个不同规模的数据集上进行基准测试,MixTIME在17种蛋白标志物的预测上,根据相关性指标衡量,达到了最先进的性能。

Insight: 创新点在于采用MoE架构整合了多种模态的病理学基础模型,并引入了可学习的路由器进行动态加权以及分布和趋势感知的损失函数。这为计算病理学中的多模态生物标志物发现和临床转化提供了一个可扩展的框架。

Abstract: Predicting immune biomarkers associated with the tumor immune microenvironment (TIME) is critical for advancing precision oncology, yet existing approaches are largely limited to single image modalities and suffer from insufficient resolution and incomplete utilization of complementary clinical and biological information. Here we introduce MixTIME, a multimodal foundation model that leverages a mixture-of-experts (MoE) architecture to integrate pathology foundation models trained across distinct modalities: image only (UNIv2), image text (CONCHv1.5), and image transcriptomic (STPath) representations for pixel-level and slide-level prediction of multiplex immunofluorescence (mIF) protein expression from hematoxylin and eosin (HE) whole-slide images. MixTIME employs a learnable router to dynamically weight expert contributions and is trained with a distribution- and tendency-aware loss function. Benchmarked on two datasets of different scales, MixTIME achieves state-of-the-art performance across 17 protein markers as measured by correlation metrics. The predicted mIF profiles substantially enhance downstream tasks, including spatial domain identification, survival prediction, and AI-assisted pathology report generation validated by expert pathologists from multiple institutes across the world. Furthermore, MixTIME enables longitudinal tracking of protein expression dynamics across clinical time points and reveals protein gene interaction patterns linked to drug resistance and immune suppression in tumor microenvironments. Collectively, MixTIME provides a scalable framework for multimodal biomarker discovery and clinical translation in computational pathology.


[77] Neural Tree Reconstruction for the Open Forest Observatory cs.CVPDF

Marissa Ramirez de Chanlatte, Arjun Rewari, Trevor Darrell, Derek J. N. Young

TL;DR: 该论文探讨了将神经辐射场(NeRF)等先进3D重建技术整合到开放森林观测站(OFO)的森林地图数据库中的方法,以解决现有基于运动恢复结构(SfM)的传统方法在森林地面重建中存在的伪影、细节缺失和鲁棒性差的问题,旨在为生态学、土地管理和气候应用提供更高质量的3D树木地图。

Details

Motivation: OFO旨在为生态学家、土地管理者和公众提供低成本森林测绘,但现有基于SfM的3D树木地图重建方法易产生伪影、缺乏细节,尤其在森林地面因输入数据(俯视图像)可见性有限而效果不佳,这些错误可能影响下游科学任务(如野火模拟)。

Result: 论文未在摘要中提供具体的定量实验结果或基准测试比较,但指出NeRF等先进方法能产生更高质量、对稀疏视图更鲁棒且支持数据驱动先验的重建结果,并概述了未来支持更先进3D视觉模型的工作。

Insight: 创新点在于将NeRF等神经3D重建技术引入森林测绘领域,以提升复杂环境(如森林地面)的重建质量,这为生态和气候应用提供了更可靠的数据基础,并展示了先进计算机视觉模型在环境科学中的潜在价值。

Abstract: The Open Forest Observatory (OFO) is a collaboration across universities and other partners to make low-cost forest mapping accessible to ecologists, land managers, and the general public. The OFO is building both a database of geospatial forest data as well as open-source methods and tools for forest mapping by uncrewed aerial vehicle. Such data are useful for a variety of climate applications including prioritizing reforestation efforts, informing wildfire hazard reduction, and monitoring carbon sequestration. In the current iteration of the OFO’s forest map database, 3D tree maps are created using classical structure-from-motion techniques. This approach is prone to artifacts, lacks detail, and has particular difficulty on the forest floor where the input data (overhead imagery) has limited visibility. These reconstruction errors can potentially propagate to the downstream scientific tasks (e.g. a wildfire simulation.) Advances in 3D reconstruction, including methods like Neural Radiance Fields (NeRF), produce higher quality results that are more robust to sparse views and support data-driven priors. We explore ways to incorporate NeRFs into the OFO dataset, outline future work to support even more state-of-the-art 3D vision models, and describe the importance of high-quality 3D reconstructions for forestry applications.


[78] EventDrive: Event Cameras for Vision-Language Driving Intelligence cs.CVPDF

Dongyue Lu, Rong Li, Ao Liang, Lingdong Kong, Wei Yin

TL;DR: 本文提出了EventDrive,一个将事件相机数据、RGB帧和语言监督统一起来的大规模基准测试和模型套件,涵盖感知、理解、预测和规划四个核心维度。基于此,EventDrive-VLM模型引入了多时间尺度事件金字塔和时间尺度专家混合模块,以自适应地编码和融合异步事件与基于帧的信息,用于下游推理。

Details

Motivation: 事件相机具有微秒级延迟和高动态范围,在模糊、眩光和快速运动等传统帧传感器不可靠的场景下,是RGB数据的有力补充。然而,现有的基于事件的视觉语言模型仅限于通用感知,未能揭示事件感知如何在整个驾驶循环的推理和决策中发挥作用。

Result: 在涵盖字幕、结构化问答、定位、运动状态识别、轨迹预测和规划任务的综合评估中,事件流在时间精度、运动感知和鲁棒性方面带来了显著提升。

Insight: 创新点在于构建了首个将事件流、RGB帧和语言统一于完整驾驶循环(感知、理解、预测、规划)的大规模基准,并提出了多时间尺度事件金字塔和时域专家混合模块来有效融合异步事件与帧数据,从而将事件感知置于驾驶智能的核心。

Abstract: Event cameras sense the world through asynchronous brightness changes with microsecond latency and high dynamic range, offering motion fidelity far beyond frame-based sensors and capturing temporal structure that conventional exposures often miss. These properties make events a powerful complement to RGB in autonomous driving, especially under blur, glare, and rapid motion, where frame-based perception can become unreliable. However, existing event-aware vision-language models remain limited to generic perception and do not reveal how event sensing contributes to reasoning and decision-making across the full driving loop. We present EventDrive, a large-scale benchmark and model suite that unifies event streams, RGB frames, and language supervision across four core dimensions: Perception, Understanding, Prediction, and Planning, covering captions, structured QA, grounding, motion-state recognition, trajectory forecasting, and planning tasks. Building on this foundation, EventDrive-VLM introduces a multi-horizon event pyramid and a temporal-horizon mixture-of-experts module to adaptively encode and fuse asynchronous and frame-based information for downstream reasoning. Comprehensive evaluation across diverse tasks shows that event streams provide substantial gains in temporal precision, motion awareness, and robustness, bringing event sensing into the center of driving intelligence.


[79] EgoCS-400K: An Egocentric Gameplay Dataset for World Models cs.CVPDF

Rongjin Guo, Dong Liang, Yuhao Liu, Fang Liu, Tianyu Huang

TL;DR: 本文介绍了EgoCS-400K,一个大规模、以回放为基础的第一人称《反恐精英》游戏数据集,用于世界模型研究。该数据集从公开的职业比赛录像中构建,包含超过40万个第一人称视频和1万小时游戏时长,提供了玩家状态、动作、视角变化、游戏事件等多模态对齐轨迹。

Details

Motivation: 当前从视频生成转向交互式世界建模的研究,需要大规模、包含时间对齐的视频-动作-语言轨迹的数据,这些轨迹需基于驱动未来场景变化的动作、相机运动、状态和事件。然而,现有数据集(如网络视频、机器人数据、模拟器)在可执行动作、状态监督或大规模人类交互轨迹方面存在不足。

Result: 论文构建了EgoCS-400K数据集,包含超过40万个第一人称视频、1万小时游戏时长,覆盖13张地图和每回合10个玩家视角。该数据集支持多种交互式视觉建模任务,如动作条件未来预测、状态事件感知的场景推演等。

Insight: 创新点在于利用公开的职业游戏比赛录像,低成本地构建了大规模、多模态对齐的交互轨迹数据集,弥合了被动网络视频、可控游戏模拟与昂贵具身数据之间的鸿沟,为世界模型训练提供了宝贵的资源。

Abstract: The shift from video generation to interactive world modeling places new demands on data: beyond captioned videos, world models require temporally aligned video-action-language trajectories grounded in the actions, camera motion, states, and events that drive future scene changes. However, such data is difficult to obtain at scale. Web video datasets offer broad visual coverage but lack executable actions and reliable states; robotic datasets provide action and state supervision but are costly and limited in scene diversity; and existing simulators often lack large-scale human-driven interaction trajectories. In this paper, we introduce EgoCS-400K, a large-scale replay-grounded egocentric Counter-Strike dataset for world models, built from public professional CS and CS2 match demos that preserve human gameplay trajectories and enable parsing, replaying, rendering, and temporal alignment. We extract player states, view directions, movements, keyboard/button inputs, view-angle changes, weapon usage, game events, and round-level context, and render clean first-person videos from the same trajectories. EgoCS-400K contains over 400,000 first-person videos and 10,000 hours of gameplay from more than 1,000 matches and 40,000 rounds, covering 13 maps and 10 player viewpoints per round. It supports a range of interactive visual modeling tasks, including action-conditioned future prediction, state- and event-aware scene rollout, replay-grounded captioning, and agent egocentric action understanding. By connecting visual observations with human actions, camera motion, game states, and events at scale, EgoCS-400K serves as a practical bridge between passive web videos, controllable game simulation, and costly real-world embodied data.


[80] Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification cs.CVPDF

Wujian Peng, Lingchen Meng, Yuxuan Cai, Xianwei Zhuang, Yuhuan Yang

TL;DR: 本文提出UniAR,一种统一的多模态自回归建模框架,通过一个共享的上下文-视觉分词器,将视觉理解和生成整合到单一系统中。该框架采用多级特征融合和无查找的位量化方案来适配预训练视觉编码器,并通过并行位预测来联合预测空间分组的多级视觉编码,从而减少视觉序列长度并加速生成。最后,通过基于扩散的视觉解码器从离散视觉标记解码出高保真图像。

Details

Motivation: 现有统一多模态建模方法通常依赖两个不同的视觉分词器,这分裂了表示空间并阻碍了真正的统一建模。本文旨在解决这一问题,实现视觉理解与生成在共享上下文中的无缝集成。

Result: 通过大规模预训练、监督微调和强化学习,UniAR在图像生成和图像编辑任务上达到了最先进的性能,同时在多模态理解基准测试中保持竞争力。

Insight: 核心创新在于使用单一离散视觉分词器作为理解与生成之间的关键桥梁,实现了无需重新编码的共享上下文。此外,多级特征融合与位量化方案有效保留了高低级视觉信息,而并行位预测显著提升了生成效率。

Abstract: Unified Multimodal Modeling aims to integrate visual understanding and generation within a single system. However, existing approaches typically rely on two disparate visual tokenizers, which splits the representation space and hinders truly unified modeling. We propose UniAR, a unified autoregressive framework where a single discrete visual tokenizer serves as the key bridge between understanding and generation, enabling a shared context in which the model can directly interpret its own generated visual tokens without additional re-encoding. UniAR adapts a pretrained vision encoder with multi-level feature fusion and a lookup-free bitwise quantization scheme, preserving both high-level semantics and low-level details while scaling the effective visual vocabulary at minimal cost. Building on this, the unified autoregressive model adopts parallel-bitwise-prediction to jointly predict spatially grouped, multi-level visual codes, substantially reducing visual sequence length and accelerating generation. Finally, a diffusion-based visual decoder operates on discrete visual tokens to decode high-fidelity images. Through large-scale pre-training, followed by supervised fine-tuning and reinforcement learning, UniAR achieves state-of-the-art performance on image generation and image editing while remaining competitive on multimodal understanding benchmarks. The project page is available at https://sharelab-sii.github.io/uniar-web.


[81] Future Dynamic 3D Reconstruction: A 3D World Model with Disentangled Ego-Motion cs.CVPDF

Nils Morbitzer, Jonathan Evers, Artem Savkin, Thomas Stauner, Nassir Navab

TL;DR: 本文提出了FR3D,一个用于未来动态3D重建的世界模型。它通过将场景的3D演化与智能体轨迹解耦,预测一个持久的3D潜在表示,解决了现有2D视频生成模型中因混合自运动与环境动态而导致的物理不一致问题(如物体变形或消失)。

Details

Motivation: 现有基于生成世界模型的2D视频合成方法在图像平面内混合了自运动和环境动态,导致长期预测中出现物体变形或消失等物理不一致问题。本文旨在构建一个能对未来动态环境进行几何一致预测的3D世界模型。

Result: 在多个数据集上的广泛实验表明,FR3D在单目观测的未来动态3D重建任务中表现出色,即使预测未来2秒的场景也能保持强性能。

Insight: 核心创新点在于将3D场景演化与智能体轨迹(自我运动)显式解耦,将推断的自运动作为潜在动作代理,从而解决了自运动与世界运动之间的模糊性。此外,利用现成基础模型的空间‘常识’进行师生蒸馏,实现了鲁棒的零样本泛化。

Abstract: Forecasting the evolution of dynamic environments is crucial for autonomous agents. While generative world models have recently achieved high photorealism in 2D video synthesis by mixing ego-motion and environmental dynamics within the image plane, they exhibit physical inconsistencies, such as morphing or vanishing objects, especially over long time horizons. In this paper, we propose FR3D, a world model that predicts a persistent 3D latent representation for future dynamic 3D reconstruction. Unlike prior works that treat the world as a sequence of image-based features, FR3D explicitly decouples the 3D evolution of the scene from the agent’s trajectory, treating the inferred ego-motion as a latent proxy for action. This disentanglement resolves the ambiguities between self-motion and world-motion, ensuring geometric consistency into the future. Furthermore, we introduce a teacher-student distillation strategy that leverages the spatial “common sense” of off-the-shelf foundation models, leading to robust zero-shot generalization. Extensive experiments demonstrate FR3D’s strong performance for future dynamic 3D reconstruction from monocular observations across multiple datasets, even 2 seconds into the future. Project page: https://fr3d-wm.github.io.


cs.AI [Back]

[82] Your AI Travel Agent Would Book You a Bullfight: An Agentic Benchmark for Implicit Animal Welfare in Frontier AI Models cs.AI | cs.CL | cs.CYPDF

Jasmine Brazilek, Oliver Tulio, Joel Christoph, Miles Tidmarsh, Carol Kline

TL;DR: 本文提出了首个评估AI智能体在代理行为中是否避免涉及动物剥削的基准测试TAC,通过12个旅行预订场景测试了7个前沿模型,发现所有模型表现均低于随机水平,并验证了系统提示中加入动物福利语句能显著提升部分模型的性能。

Details

Motivation: 现有AI与动物福利的基准测试仅评估模型对问答提示的文本响应,无法确定这种福利推理是否能迁移到需要模型使用工具执行行动的智能体部署场景中。

Result: 在TAC基准测试中,所有7个前沿模型的得分均低于64%的随机水平,最佳模型Claude Opus 4.7仅达到53%;在系统提示中加入动物福利语句后,Claude和GPT-5.5提升了47-63个百分点,GPT-5.2提升26个百分点,而DeepSeek和Gemini提升不足12个百分点。

Insight: 研究创新性地构建了首个面向智能体行为的动物福利基准,揭示了文本响应基准与代理行动基准之间的性能差距,并证明了通过简单的系统提示修改能有效引导模型在代理任务中关注动物福利,为评估AI系统的系统性风险提供了新视角。

Abstract: AI agents are moving from advisors to actors, booking travel, planning menus, and running procurement on behalf of users. Existing benchmarks for AI and animal welfare evaluate model text responses to question-answer prompts, leaving open whether the welfare reasoning surfaced in those responses transfers to agentic deployment where the model must take actions with tools. We introduce TAC (Travel Agent Compassion), the first agentic benchmark measuring whether AI agents avoid options involving animal exploitation when acting on behalf of users. TAC presents an AI agent with twelve hand-authored travel booking scenarios across six categories of animal exploitation, augmented to forty-eight samples to control for price, rating, and position confounds. We evaluate seven frontier models from four labs. Every model scores below the chance level of sixty-four percent, with the best performer (Claude Opus 4.7) at fifty-three percent. A single welfare-aware sentence in the system prompt yields gains of forty-seven to sixty-three percentage points in Claude and GPT-5.5, twenty-six points in GPT-5.2, and under twelve points in DeepSeek and Gemini. An auxiliary Inspect Scout audit of 288 base-condition transcripts from the top two performers, using Gemini 2.5 Flash Lite as judge, flags zero transcripts for evaluation awareness, suggesting the below-chance rates do not stem from the models recognising the evaluation. We discuss implications for category-level variation across cultural domains, the limits of text-response welfare benchmarks, and the EU General-Purpose AI Code of Practice systemic risk framework.


[83] ProvenanceGuard: Source-Aware Factuality Verification for MCP-Based LLM Agents cs.AI | cs.CL | cs.MAPDF

Ander Alvarez, Santhiya Rajan, Samuel Mugel, Román Orús

TL;DR: 本文提出了ProvenanceGuard,一种面向基于模型上下文协议(MCP)的LLM智能体的、具备来源感知能力的事实性验证系统。该系统通过分析MCP执行轨迹,将智能体的回答分解为原子声明,并利用自然语言推理等技术,验证每个声明是否得到正确来源证据的支持,从而检测并修复跨来源混淆错误。

Details

Motivation: 现有基于MCP的LLM智能体在整合来自搜索、API、数据库等异构证据源的信息时,标准的事实性验证指标仅关注答案是否被证据池支持,而忽略了来源敏感的失败模式,即一个声明可能被某个来源支持,却被错误地归因于另一个来源,这种错误被称为跨来源混淆。

Result: 在包含281条医疗领域MCP智能体轨迹的数据集上进行了评估。在40条轨迹的保留测试集上,ProvenanceGuard在260个有来源资格的声明上,实现了0.802的阻止F1分数和0.858的来源准确率,优于不考虑来源ID的基线方法。在一个更困难的多来源基准测试中,其阻止F1分数达到0.846,但来源加关系准确率降至0.229。修复与重新验证流程成功解决了完整轨迹集中所有被阻止的答案。在50个受控临床混淆探测中,系统检测出了所有注入的错误归因替换。

Insight: 论文的核心创新在于将来源归因确立为基于MCP的智能体事实性验证的一个独立维度,并构建了一个端到端的验证框架。其可借鉴之处在于:1)通过稳定工具ID、来源ID和原始输出来捕获MCP轨迹,为细粒度验证提供了结构化数据基础;2)将答案分解为原子声明并与特定来源证据进行路由匹配的机制;3)结合NLI和令牌对齐代理的验证方法;4)包含修复与重新验证的闭环流程,增强了系统的实用性。

Abstract: Tool-using LLM agents increasingly use the Model Context Protocol (MCP) to answer from heterogeneous evidence sources, including search, APIs, databases, clinical records, and formulary tools. Standard factuality metrics usually test whether an answer is supported by pooled evidence, missing a provenance-sensitive failure mode: a claim may be supported somewhere while being attributed to the wrong source. We call this cross-source conflation. We introduce ProvenanceGuard, a source-aware verifier for MCP-grounded answers. It consumes captured MCP traces with stable tool IDs, source IDs, and raw outputs; decomposes answers into atomic claims; routes claims to source-specific evidence; checks support with NLI and a token-alignment proxy; compares stated attribution with the routed source; and returns per-claim verdicts plus an answer-level allow/block decision. Blocked answers can be repaired with retrieval-augmented answer revision and re-verified. We evaluate on 281 medical-domain MCP-agent traces. A 266-trace adjudicated subset yields 2,325 LLM-assisted claim labels split by trace; 361 held-out labels are human-verified. On the 40-trace held-out split, ProvenanceGuard achieves block F1 0.802 and source accuracy 0.858 over 260 source-eligible claims, outperforming source-blind baselines that do not emit claim-to-source IDs. On a harder multi-source benchmark it reaches block F1 0.846, while source-plus-relation accuracy drops to 0.229, showing that exact source ownership remains difficult with semantically close sources. Repair-and-reverify resolves all blocked answers in the full trace set, often via conservative fallback. In 50 controlled clinical conflation probes, ProvenanceGuard detects all injected attribution swaps with no retained wrong attribution. These results show that source attribution is an independent axis for factuality verification in MCP-based agents.


cs.HC [Back]

[84] Toward Accessible Psychotherapy Training Using AI-Driven Interactive Patient Avatars cs.HC | cs.CLPDF

Pascal Riachi, Sofie Kamber, Stella Brogna, Andrew Gloster, Rafael Wampfler

TL;DR: 本文提出了一种基于AI驱动的交互式虚拟病人系统,用于支持接纳与承诺疗法(ACT)的心理治疗师培训。该系统利用大语言模型模拟基于真实治疗会话和可配置临床场景的患者行为,并通过自动评估器提供基于ACT保真度标准的逐轮反馈,旨在促进低风险环境下的刻意练习。

Details

Motivation: 动机是解决心理治疗师在证据干预(如ACT)培训中面临的伦理、后勤和资源限制问题,通过提供安全、标准化的模拟练习环境来弥补传统培训机会的不足。

Result: 在49份治疗记录上的定量评估显示,GPT-4o-mini作为反馈模型最优,其复现人类督导ACT保真度评分的平均绝对误差最低(MAE = 6.12),且具有统计显著性一致;专家评估证实了患者行为的高真实性和即时反馈对治疗师干预选择意识的提升效果。

Insight: 创新点在于将大语言模型与可配置临床场景结合,构建了保真度感知的模拟患者系统,作为心理治疗培训的可扩展补充工具,而非替代督导,强调了在低风险环境中支持实验和反思的设计理念。

Abstract: Training psychotherapists in evidence-based interventions such as Acceptance and Commitment Therapy (ACT) requires repeated practice with meaningful feedback, yet opportunities for safe, standardized training are limited by ethical, logistical, and resource constraints. We introduce a system designed to support ACT-oriented psychotherapy training through spoken dialogue with an embodied virtual patient. The system uses large language models to simulate patient behavior conditioned on profiles derived from real therapy sessions and configurable clinical scenarios, while a separate automated evaluator provides turn-by-turn feedback on therapist responses based on established ACT fidelity criteria. Rather than aiming to replace supervision, the system is intended to support deliberate practice by enabling experimentation, reflection, and immediate feedback in low-risk settings. Expert evaluation with practicing psychologists confirmed high realism in patient behavior and demonstrated that immediate turn-by-turn ACT feedback increased therapists’ awareness of intervention choices and enabled effective experimentation with alternative responses. Quantitative evaluation across 49 therapy transcripts identified GPT-4o-mini as the optimal feedback model, achieving the lowest mean absolute error (MAE = 6.12) in replicating human supervisor ACT fidelity ratings with statistically significant agreement. This work demonstrates the potential of fidelity-aware simulated patients as a scalable complement to psychotherapy training.


cs.CY [Back]

Michèle Finck

TL;DR: 这篇论文指出当前法律AI评估存在测量鸿沟,即缺乏能够评估大语言模型是否进行教义法律推理的基准测试,而教义推理是法律工作的核心解释性任务,而非辅助性任务。这一鸿沟不仅是方法论的,也是法律上的,因为欧盟《人工智能法案》对司法领域高风险AI的’适当准确性’要求,在缺乏此类基准的情况下无法获得可操作的内容。

Details

Motivation: 论文的动机是解决当前法律AI评估的局限性,即现有基准主要衡量辅助性法律任务,而无法评估作为法律工作核心的教义法律推理能力,这导致无法满足欧盟《人工智能法案》对司法AI’适当准确性’的监管要求。

Result: 论文未在摘要中提及具体的定量实验结果或基准测试排名,但强调了当前缺乏能够评估教义法律推理的基准,这是满足欧盟AI法案监管要求的关键障碍。

Insight: 论文的创新点在于识别并强调了法律AI评估中’教义推理测量鸿沟’这一关键问题,并将其与欧盟AI法案的监管要求直接联系起来,为开发新的法律AI评估基准提供了紧迫的理论和法律依据。

Abstract: Large language models now produce legal text of at least median quality, yet no existing benchmark can evaluate whether they perform doctrinal legal reasoning, which forms the interpretive core of legal work, rather than the ancillary, paralegal tasks that most current legal-AI evaluations measure. This measurement gap is not only methodological but legal: the EU AI Act makes “appropriate accuracy” a binding requirement for high-risk AI used in the judicial domain, yet that requirement cannot acquire operational content without the very doctrinal-reasoning benchmark the field lacks.


cs.LG [Back]

[86] Correct When Paired, Wrong When Split: Decoupling and Editing Modality-Specific Neurons in MLLMs cs.LG | cs.AI | cs.CLPDF

Tingchao Fu, Wenkai Wang, Fanxiao Li, Huadong Zhang, Jinhong Zhang

TL;DR: 本文发现当前多模态大语言模型的知识编辑方法存在编辑解耦失败问题,即模型在多模态输入下能成功更新知识,但在单模态输入下会回退到编辑前的旧知识。作者提出DECODE方法,通过显式解耦和定位模态特异性神经元组来实现针对性的知识更新,从而缓解编辑解耦失败。

Details

Motivation: 解决多模态大语言模型知识编辑中的编辑解耦失败问题,即知识更新在多模态输入下有效但在单模态输入下失效的局限性。

Result: 大量实验表明,DECODE在不同模态触发下均能实现有效的知识更新,缓解了编辑解耦失败问题。

Insight: 创新点在于揭示了MLLM中实体知识并非统一表示,而是分布在解耦的模态特异性通路中,并提出了显式解耦和定位模态特异性神经元组的方法来实现更鲁棒的知识编辑。

Abstract: Although Knowledge Editing provides an efficient mechanism for updating the knowledge of Multimodal Large Language Models (MLLMs), we find that current paradigms still suffer from an important yet remain underexplored issue : editing decoupling failure, where entity-related knowledge can be updated when the model is triggered by multimodal inputs (text–image query pairs), however, it often reverts to outdated pre-edit facts when the paired inputs are split into unimodal ones. Our in-depth empirical analysis reveals that the entity knowledge in MLLMs is not stored as a unified representation, but is instead distributed across disentangled modality-specific pathways. As a result, updates biased toward multimodal queries fail to propagate effectively to unimodal circuits. To bridge this gap, we propose DECODE, which explicitly disentangles and localizes modality-specific neuron groups for targeted knowledge. Extensive experiments demonstrate that DECODE consistently achieves effective knowledge updates under different modality triggers, thereby mitigating editing decoupling failures.


[87] Rethinking Groups in Critic-Free RLVR cs.LG | cs.CLPDF

Yihong Wu, Liheng Ma, Lingfeng Xiao, Muzhi Li, Xinyu Wang

TL;DR: 本文重新审视了无评论者强化学习(critic-free RL)中‘组’的作用,指出其核心功能并非仅是估计基线,而是防止对负样本的错误惩罚。基于此,作者提出了‘负标记过滤’策略,实现了稳定的单轮训练,并在推理和代理任务上取得了与基于组的方法相当或更强的性能。

Details

Motivation: 现有无评论者RL方法通常为同一问题生成一组轨迹来估计价值基线以计算优势,但这种方法存在数据效率低、组同步障碍和结构化轨迹不灵活的问题。

Result: 在推理任务上达到与基于组RL技术相当的性能,在代理任务上表现更强。

Insight: 创新点在于揭示了‘组’在防止负样本错误惩罚中的关键作用,并提出了简单有效的负标记过滤策略,实现了单轮训练,提高了数据效率和灵活性。

Abstract: Reinforcement learning (RL) has become a central paradigm for post-training large language models. Existing critic-free RL methods typically generate a group of rollouts for the same question to estimate value baselines for advantage computation. However, this design suffers from data inefficiency, group synchronization barriers, and inflexibility with structured rollouts. In this work, we revisit the role of the ``group’’ and show that its underlying function is not merely to estimate baselines but to prevent false penalties on negative samples. Building on this insight, we propose negative token filtering, a simple and effective strategy that enables stable single-rollout training. We apply it to two batch-level advantage methods, achieving comparable performance on reasoning tasks and stronger performance on agentic tasks relative to group-based RL techniques.


[88] EnvRL: Learn from Environment Dynamics in Agentic Reinforcement Learning cs.LG | cs.CLPDF

Zhitong Wang, Songze Li, Hao Peng, Shuzheng Si, Yi Wang

TL;DR: 本文提出了EnvRL框架,旨在通过利用智能体与环境交互轨迹中的环境动态信息来改进强化学习。该方法在主要RL目标之外引入了状态预测和逆动力学两个辅助学习目标,以帮助智能体内化环境动态。在两个长视野智能体基准测试(ALFWorld和WebShop)上的实验表明,EnvRL显著提升了成功率。

Details

Motivation: 传统RL方法在长视野智能体任务中常受限于稀疏的结果奖励,忽略了交互轨迹中蕴含的丰富环境动态信息。本文认为这些交互经验本身可作为隐式监督信号,揭示环境的底层转移机制,从而帮助智能体构建更准确的环境内部模型。

Result: 在ALFWorld和WebShop两个长视野智能体基准测试上,EnvRL相比仅使用RL的基线方法取得了显著的成功率提升。例如,使用GRPO训练时,将Qwen-2.5-1.5B-Instruct模型在ALFWorld上的成功率从72.8%提升至77.4%,在WebShop上从56.8%提升至67.0%。

Insight: 论文宣称的创新点在于将环境动态学习(通过状态预测和逆动力学任务)作为辅助目标整合到智能体RL框架中,以利用交互轨迹中的隐式监督信号。从客观角度看,这是一种将模型学习思想与策略学习相结合以提高样本效率和策略性能的有效方法,特别是在奖励稀疏的长视野任务中。

Abstract: Reinforcement learning (RL) has emerged as a powerful paradigm for training Large Language Models (LLMs) as agents. However, conventional RL methods for long-horizon agentic tasks often struggle with sparse outcome rewards. Intuitively, this overlooks the rich environment dynamics information contained in rollout interaction trajectories. We argue that the interaction experience inherently serves as an implicit supervision signal, reveals the underlying transition mechanisms of the environment, and enables the agent to construct a more accurate internal model of the environment.. Therefore, in this work, we investigate how to leverage this additional signal to improve policy learning. Specifically, we propose EnvRL, a framework that incorporates environment dynamics learning into agentic RL via two auxiliary objectives: state prediction and inverse dynamics. By jointly optimizing with the primary RL objective, we encourage the agent to internalize environment dynamics from its own interaction experience. Extensive experiments on two long-horizon agentic benchmarks demonstrate that EnvRL achieves significant improvements on success-rates over RL-only baselines, e.g., when trained with GRPO, lifting Qwen-2.5-1.5B-Instruct from 72.8% to 77.4% on ALFWorld, and from 56.8% to 67.0% on WebShop.


[89] ProCUA-SFT Technical Report cs.LG | cs.CVPDF

Jaehun Jung, Ximing Lu, Brandon Cui, Muhammad Khalifa, Shaokun Zhang

TL;DR: 本文介绍了ProCUA-SFT数据集,这是一个包含310万步级监督微调样本的数据集,用于训练计算机使用代理。该数据集通过一个全自动流程从9.3万条合成轨迹中蒸馏而来,覆盖了2484种应用组合。使用该数据集对UI-TARS 7B模型进行一轮微调,在OSWorld基准测试上的成功率从26.3%提升至45.0%。

Details

Motivation: 现有最大的公共资源AgentNet(2.25万条人类轨迹)在用于监督微调时会导致负迁移,使模型性能显著下降。因此,需要大规模、多样化的高质量轨迹数据来有效训练能够与图形桌面交互的计算机使用代理。

Result: 在OSWorld基准测试上,使用ProCUA-SFT微调UI-TARS 7B模型一个周期后,成功率达到了45.0%,比基础模型提升了18.7个百分点,并且比使用AgentNet训练的模型高出超过35%。该数据集的部分内容已被用于训练Nemotron 3 Nano Omni模型,增强了其计算机使用能力。

Insight: 主要创新点在于构建了一个全自动的合成数据生成与验证管道,该管道利用单个视觉语言模型(Kimi-K2.5)统一承担目标生成、前提条件判断和轨迹执行三个角色,消除了规划者与执行者之间的能力差距。同时,通过二进制前提条件检查确保任务可行性,并将轨迹扩展为精确匹配推理时上下文布局的步前缀样本,有效提升了数据质量和模型性能。

Abstract: Training computer-use agents (CUAs) – models that interact with graphical desktops through screenshots and keyboard/mouse actions – requires large-scale, diverse trajectory data collected in full desktop environments. The largest public resource, AgentNet (22.5K human trajectories), leads to negative transfer when used for supervised fine-tuning (SFT): continuing training UI-TARS 7B on AgentNet causes OSWorld success rate to fall from 26.3% to 8-10%. We present ProCUA-SFT, a dataset of 3.1M step-level SFT samples distilled from 93K synthetic trajectories across 2,484 application combinations. The dataset is produced by a fully automated pipeline that (i) synthesizes grounded tasks on live desktops seeded with real-world content – 912 spreadsheets from SpreadsheetBench, approximately 10K permissively-licensed presentations from Zenodo10K, and multi-application OSWorld configs – and (ii) verifies each task’s feasibility through binary precondition checking before rollout. A single VLM (Kimi-K2.5) serves as goal generator, precondition judge, and trajectory executor, eliminating planner-actor capability gaps. Each trajectory is expanded into step-prefix samples that exactly reproduce the context layout seen at inference time. Fine-tuning UI-TARS 7B on ProCUA-SFT for one epoch yields 45.0% on OSWorld – an 18.7 percentage-point improvement over the base model and over 35% above AgentNet-trained counterparts. A subset of ProCUA was incorporated into the training data for the Nemotron 3 Nano Omni model, contributing to its computer-use capabilities.


cs.GR [Back]

[90] Edit3DGS: Unified Framework for Dynamic Head Editing via 2D Instruction-Guided Diffusion and 3D Gaussian Splatting cs.GR | cs.CVPDF

Duy-Dat Tran, Trung-Nghia Le

TL;DR: 本文提出了Edit3DGS,一个用于动态3D头部编辑的统一框架。它结合了2D指令引导的扩散模型和3D高斯泼溅技术,能够根据文本指令对输入视频中的人脸进行细粒度编辑(如表情变换、属性修改),并生成具有时间一致性的高保真3D化身。

Details

Motivation: 现有方法通常分别处理基于帧的编辑或静态3D重建,缺乏将图像域的语义可控性与逼真、时间一致的3D表示相结合的能力。本文旨在解决动态3D头部编辑中语义控制与时空一致性的统一问题。

Result: 实验结果表明,该框架能够实现可控、无伪影的头部编辑,并具有平滑的时间过渡。

Insight: 核心创新点在于将2D指令引导的扩散编辑与3D高斯泼溅重建相耦合,并通过多视图批量编辑和轻量级修复策略来强制时间一致性,为虚拟化身、沉浸式通信等应用提供了实用的动态编辑方案。

Abstract: We present Edit3DGS, a unified framework for dynamic 3D head editing that integrates 2D instruction-guided diffusion with 3D Gaussian splatting. Unlike prior approaches that separately address frame-based edits or static 3D reconstruction, our method couples semantic controllability in the image domain with photorealistic, temporally consistent 3D representations. Given an input video, editable facial regions are masked and modified using a text-conditioned diffusion model to support fine-grained operations such as expression transformation, attribute modification, and appearance refinement. The edited frames are then aggregated through 3D Gaussian splatting to produce a coherent, high-fidelity avatar that preserves both identity and motion dynamics. To enforce consistency, Edit3DGS incorporates multi-view batch editing and lightweight inpainting strategies that recover lost expressions across timesteps. Experimental results demonstrate that our framework enables controllable, artifact-free head editing with smooth temporal transitions, offering practical applications in virtual avatars, immersive communication, film production, and interactive media.


cs.RO [Back]

[91] Contrastive Action-Image Pre-training for Visuomotor Control cs.RO | cs.CVPDF

Yuvan Sharma, Dantong Niu, Anirudh Pai, Zekai Wang, Zhuoyang Liu

TL;DR: 本文提出了一种名为CAIP(对比动作-图像预训练)的视觉编码器,旨在解决机器人视觉编码器因数据稀缺而难以大规模预训练的问题。该方法利用大规模第一人称视角视频中提取的3D手部关键点作为机器人末端执行器动作的代理,通过对比学习目标学习统一的动作-图像表示。实验表明,仅使用少量机器人数据(88小时)和大量人类视频(32,041小时),CAIP在灵巧操作任务上显著优于现有视觉编码器。

Details

Motivation: 现有机器人视觉编码器面临数据规模不足的瓶颈,而互联网图像/语言数据或人类视频缺乏与动作的配对信息。机器人轨迹数据虽直接但规模有限,因此需要从丰富的人类视频中提取动作信号作为代理。

Result: 在Dexmate Vega和Sharpa Wave灵巧手平台上进行真实世界操作评估,CAIP在折叠、倾倒和精细操作任务上比DINOv2、SigLIP、MVP和R3M等SOTA视觉编码器性能提升超过30%。

Insight: 创新点在于将人类视频中的3D手部关键点作为机器人动作的代理表示,通过对比学习实现动作-图像的联合预训练,为机器人视觉表示学习提供了可扩展且高效的解决方案。

Abstract: Existing vision encoders for robotics face a fundamental bottleneck: robotic datasets lack the scale necessary for large-scale pre-training. Prior work circumvents this data scarcity by turning to internet-scale image and language data or egocentric human video. While these models show promise, neither paradigm learns from paired vision and action data, which downstream visuomotor control policies require. However, robot trajectories, the most direct source of this paired signal, are not available at pre-training scale, motivating us to extract action signals from abundant human video instead. To this end, we introduce CAIP (Contrastive Action-Image Pre-training), a vision encoder that treats human hand poses from large-scale egocentric video as a proxy for end-effector actions. By extracting 3D hand keypoints, a representation that aligns naturally with downstream robot action spaces, CAIP learns a unified action-image representation through a contrastive objective. Leveraging 32,041 hours of egocentric human video and only 88 hours of robotic manipulation data, CAIP outperforms state-of-the-art vision encoders including DINOv2, SigLIP, MVP, and R3M. Evaluated on a challenging real-world dexterous manipulation setup using Dexmate Vega and Sharpa Wave hands, CAIP yields performance gains of more than 30% on tasks involving folding, pouring, and fine-grained manipulation. Our results show that our method of contrastive action-centric pre-training yields a scalable path to achieving robust visual representations better suited for physical interaction.


[92] Contactless Respiratory Monitoring on Heterogeneous Mobile Robots: A Multimodal Edge-Computing Framework cs.RO | cs.CVPDF

Milind Rampure, Shadman Sakib, Haley Patel, Zahid Hasan, Nirmalya Roy

TL;DR: 本文提出了一种用于异构移动机器人的多模态无接触呼吸监测框架,结合亮度自适应传感器选择、关键点引导的胸部ROI提取和基于信号质量指数的滤波机制,实现了在多变光照、姿态变化和平台异构性下的鲁棒呼吸率估计。

Details

Motivation: 在应急响应、灾难恢复和传染病场景中,无接触呼吸率监测对于远程分诊和受害者评估至关重要,但现场部署面临光照变化、姿态变化、平台异构性以及危险环境中可穿戴传感器不实用等挑战。

Result: 在三种机器人平台上进行的实验表明,该框架无需针对每个平台重新调整算法即可跨平台泛化,RGB覆盖范围最广达8米,NIR在6米内有效,热成像仅在短距离可靠,低光传感支持在完全黑暗环境下监测达8米。

Insight: 创新点包括多模态传感器自适应选择、基于关键点的姿态鲁棒ROI提取以及SQI滤波机制,为危险搜救环境中的自主分诊和受害者评估提供了可行的技术基础。

Abstract: Respiratory-rate (RR) monitoring is a critical component of remote triage and victim assessment in emergency response, disaster recovery, and infectious-disease scenarios, where minimizing physical contact can reduce responder risk and improve operational safety. However, field deployment of contactless RR monitoring remains challenging due to variable illumination, posture changes, platform heterogeneity, and the impracticality of wearable sensors in hazardous environments. In this paper, we present a modality-adaptive contactless RR monitoring framework for heterogeneous mobile robots with onboard edge computing. The proposed system combines brightness-adaptive sensor selection across RGB, thermal, near-infrared (NIR), and low-light cameras, keypoint-guided chest ROI extraction for posture-robust monitoring, and a signal-quality-index (SQI)-based filtering mechanism for reliable respiratory estimation. We implement and evaluate the framework on three robotic platforms spanning quadruped and wheeled locomotion and multiple edge-computing architectures. Experiments conducted across diverse lighting conditions, subject poses, and robot-to-subject distances demonstrate that the framework generalizes across platforms without per-platform algorithmic retuning, while revealing modality-specific operational boundaries. RGB provides the broadest coverage up to 8m, NIR remains effective up to 6m, thermal is reliable only at short range, and low-light sensing supports monitoring in complete darkness up to 8m. Overall, the results demonstrate the feasibility of multimodal contactless RR monitoring on mobile robots and support its use as a foundation for autonomous triage and victim assessment in hazardous search-and-rescue settings.


[93] HRDX: A Large-Scale Vector HD-Map Dataset cs.RO | cs.AI | cs.CVPDF

Sahith Reddy Chada, Isht Dwivedi, Nirav Savaliya

TL;DR: 本文介绍了HRDX,一个用于矢量高清地图构建的大规模数据集,其覆盖约40小时(1400公里)的驾驶数据,规模远超现有公开数据集。该数据集包含六路环视摄像头、128线激光雷达、厘米级RTK GNSS/IMU的同步数据,并配有精确对齐的航空正射影像。标注涵盖10个矢量地图类别及20多个语义与拓扑属性,并引入了综合评分(CS)来联合评估几何保真度与属性正确性。

Details

Motivation: 现有公开高清地图数据集存在规模有限、语义属性稀疏、缺乏航空影像等多模态数据的问题,限制了自动驾驶系统对几何精确、语义丰富且可扩展的矢量高清地图的需求,以及新研究方向(如多模态鸟瞰图融合)的探索。

Result: 基准实验表明,HRDX的大规模数据提升了在线矢量地图构建的性能;对齐的航空影像提供了有用的结构先验:在训练和/或推理时使用航空影像能提高几何地图质量,且通过航空影像增强的教师模型可以将部分优势迁移到仅使用摄像头的学生模型中,而无需增加推理时的传感器需求。

Insight: 论文的主要创新点在于发布了目前规模最大、语义属性最丰富的公开矢量高清地图数据集HRDX,并引入了综合评分(CS)这一新的评估指标。其关于利用航空影像作为训练时特权信息以提升模型性能,并可通过知识蒸馏迁移到轻量级纯视觉模型中的发现,为多模态感知和模型高效化提供了有价值的思路。

Abstract: Reliable autonomous driving requires vectorized HD maps that are geometrically accurate, semantically rich, and scalable to long-horizon driving. However, existing public HD map datasets are limited in scale, provide sparse semantic attributes, and lack modalities such as aerial imagery that could enable new research directions. We present HRDX, a large-scale dataset for vector HD-map construction, spanning about 40 hours (1,400 km) of minimally overlapping drives, which is several times larger than prior public HD map datasets. Data is captured using six synchronized surround cameras, a 128-beam LiDAR, and centimeter-level RTK GNSS/IMU, and is further complemented by precisely aligned aerial orthoimagery. Annotations cover 10 vector map classes, complemented with over 20 semantic and topological attributes. To evaluate this richer ontology, we introduce the Composite Score (CS) to jointly assess geometric fidelity and attribute correctness. Benchmark experiments show that HRDX’s scale improves online vector-map construction, and that aligned aerial imagery provides a useful structural prior: using aerial imagery at training and/or inference improves geometric map quality, while aerial-augmented teachers can transfer part of this benefit to camera-only students without increasing inference-time sensor requirements. HRDX is intended to support reproducible research on large-scale HD-map learning, multimodal BEV fusion, and training-time privileged information. HRDX dataset and benchmarks are available at https://github.com/honda-research-institute/HRDX


[94] MuseVLA: An Adaptive Multimodal Sensing Vision-Language-Action Model for Robotic Manipulation cs.RO | cs.CVPDF

Xingyuming Liu, Ruichun Ma, Heyu Guo, Qixiu Li, Qingwen Yang

TL;DR: MuseVLA是一种自适应的多模态感知视觉-语言-动作模型,用于机器人操作。它通过生成传感器令牌和目标描述来按需调用新型传感器(如温度、声音、雷达),并将传感器测量转换为统一的中间表示(grounded sensor image),以进行多模态融合和动作生成。该模型还引入了数据合成流程,利用现有RGB视频数据集生成传感器图像,减少对昂贵多感官数据的需求。

Details

Motivation: 当前大多数机器人视觉-语言-动作模型仅依赖RGB观测,难以感知温度、声音或雷达响应等物理属性,限制了其在复杂物理世界中的交互能力。

Result: 在需要多模态感知输入的灵巧手操作任务(如温度引导抓放、音频驱动物体搜索、雷达辅助隐藏物体检索)上,MuseVLA在真实机器人上实现了平均80.6%的成功率,显著优于仅RGB和多感官基线模型,并在未见任务上展现出强大的零样本能力。

Insight: 创新点包括:将传感器作为按需调用的工具(通过传感器令牌和目标描述),引入统一的中间表示(grounded sensor image)来编码异构传感器读数以实现解耦和高效融合,以及通过数据合成流程利用现有数据集泛化到未见传感器引导任务,降低了数据收集成本。

Abstract: Humans naturally leverage diverse sensing modalities to interact with the physical world, while most Vision-Language-Action (VLA) models for robotics rely solely on RGB observations. This limits their ability to perceive physical properties that are difficult or impossible to infer from RGB cameras, such as temperature, sound, or radar response. We present MuseVLA, an adaptive multimodal sensing VLA model that integrates novel sensors as on-demand tools for robotic manipulation. Given a task instruction and visual context, MuseVLA first generates a sensor token and target description that select the sensing modality to invoke and what to attend to, analogous to a tool call with arguments. It then converts the selected sensor measurement into a grounded sensor image, a unified intermediate representation that encodes heterogeneous readings for multimodal fusion and action generation. This design decouples sensor-specific processing from the VLA backbone, enabling efficient integration of diverse modalities. To reduce the need for expensive multisensory robot datasets, we further introduce a data synthesis pipeline that augments existing RGB video datasets with grounded sensor images, enabling generalization to unseen sensor-guided tasks. We evaluate MuseVLA on a real-world robot across challenging dexterous hand manipulation tasks that require multimodal sensing inputs, including temperature-guided pick-and-place, audio-driven object search, and radar-assisted hidden object retrieval. MuseVLA achieves 80.6% success rate on average, outperforming RGB-only and multisensory VLA baselines significantly, and exhibits strong zero-shot capabilities on unseen tasks.


[95] AnnotateAnything: Automatic Annotation of 3D Assets for Robot Manipulation cs.RO | cs.CVPDF

Haoran Lu, Mutian Shen, Shuyang Yu, Yu Xiao, Songling Liu

TL;DR: 本文提出AnnotateAnything,一个通用的自动标注框架,将原始的3D资产转换为具备结构化、多样化和可执行操作标签的、可用于机器人操作的资产。该框架包含两个互补的流水线:一个利用视觉-语言推理的统一视觉-语言标注流水线来推断物体语义和交互约束;另一个是完全自动、大规模并行的物理标注流水线,通过候选生成、几何优化和轨迹生成来产生可执行的动作标注。基于生成的标注,作者进一步构建了一个异步并行仿真数据收集系统。

Details

Motivation: 仿真虽然能实现可扩展的机器人数据收集,但原始3D资产仅提供几何信息,缺乏机器人操作所需的语义、交互和物理知识,无法指定机器人应在何处以及如何行动。

Result: 实验表明,AnnotateAnything在标注效率、数据收集效率和任务成功率方面优于现有的标注和数据生成流水线,并支持下游任务,如可供性检测、机器人视觉问答和视觉指令微调。

Insight: 创新点在于将视觉-语言推理与大规模并行物理仿真相结合,为3D资产自动生成丰富、可执行的机器人操作标注,从而高效构建大规模仿真数据集,支持多种机器人操作任务的学习与评估。

Abstract: Simulation enables scalable robot data collection, but raw 3D assets provide only geometry, lacking the semantic, interactive, and physical knowledge needed to specify where and how robots should act. In this work, we present AnnotateAnything, a general automatic annotation framework that converts passive 3D assets into manipulation-ready assets with structured, diverse, and executable manipulation labels. AnnotateAnything is built around two complementary pipelines. First, a unified visual-language annotation pipeline using vision-language reasoning to infer object semantics, interaction constraints, and 3D-grounded cues, providing human-prior guidance for identifying meaningful interaction regions. Second, a fully automatic and massively parallel physics annotation pipeline grounds these priors in each asset’s geometry and physical constraints through candidate generation, geometry optimization and trajectory generation. This pipeline produces diverse and executable action annotations, including grasp poses, dexterous contacts, articulation waypoints, insertion directions, hanging affordances, and navigation targets. Using the generated annotations, we further build an asynchronous parallel simulation data-collection system across diverse objects, tasks, and robot embodiments. Experiments demonstrate that AnnotateAnything achieves superior annotation efficiency, data-collection efficiency, and task success rates over existing annotation and data-generation pipelines, while also supporting downstream tasks such as affordance detection, robotic VQA, and visual instruction finetuning. We provide project materials on the project page and plan to release the full code, annotations, and benchmark to facilitate future research. Videos, code, demo assets, and annotations are provided in supplementary materials Project page: https://tourmaline-caramel-169490.netlify.app.


[96] GASE: Gaussian Splatting-Based Automated System for Reconstructing Embodied-Simulation Environments cs.RO | cs.CVPDF

Jiawei Zhang, Yiming Yan, Chao Liang, Nuo Xu, Seson Sun

TL;DR: GASE是一个基于3D高斯泼溅的高度自动化系统,用于快速构建高保真度的具身仿真环境。该系统利用全景相机阵列采集多视角视频流,通过基于相机位姿的策略在2D域中鲁棒地提取前景物体并进行高质量场景修复,然后将前景物体和静态背景独立重建并无缝导入物理仿真器进行策略训练。

Details

Motivation: 在现实世界中训练具身智能体需要熟练的操作员和昂贵的硬件,而仿真环境能提供大规模、低成本的数据增强。因此,快速构建高保真度且仿真与现实差距小的仿真场景成为机器人学习的关键目标。现有的基于重建的方法虽然视觉质量高,但受限于低效的数据采集和较差的前景物体提取。

Result: 大量实验表明,GASE在分割精度上比现有的基于3D高斯的方法高出10%以上,并达到了最先进的修复质量。在操作和导航任务上的真实机器人部署中,与纯真实数据训练的策略相比,性能差距保持在10%以内。

Insight: 论文宣称的创新点在于提出了一种高度自动化的仿真场景构建系统,其核心是引入了基于相机位姿的策略来鲁棒地提取跨帧物体,并实现了高质量的场景修复。从客观角度看,该系统将高效的快速环境扫描、鲁棒的2D物体提取与高质量3D重建相结合,为弥合仿真与现实差距提供了一个高效的端到端解决方案。

Abstract: Training embodied agents in the real world requires skilled operators and expensive hardware. Simulation environments offer a compelling alternative by enabling large-scale, cost-effective data augmentation. Consequently, rapidly constructing high-fidelity simulation scenes with a minimal sim-to-real gap has become a critical objective in robot learning. While reconstruction-based methods provide superior visual quality, current workflows are hindered by inefficient data acquisition and subpar foreground object extraction. We thus propose GASE, a highly automated system for simulation scene construction. GASE leverages multi-view video streams from panoramic camera arrays to enable rapid environment scanning. To ensure high-quality asset generation, our pipeline introduces a camera-pose-based strategy that robustly extracts objects across frames in the 2D domain, followed by high-fidelity scene inpainting. Foreground objects and the static background are then reconstructed independently and seamlessly imported into physics simulators for policy training. Extensive experiments demonstrate that GASE outperforms existing 3D Gaussian-based methods in segmentation accuracy by over 10% while achieving state-of-the-art inpainting quality. Furthermore, real-robot deployments across manipulation and navigation tasks maintains a performance gap of less than 10% compared to policies trained purely on real-world data. These results confirm that GASE provides an efficient and highly effective solution for bridging the sim-to-real gap. Code will be released.


[97] MagicSim: A Unified Infrastructure for Executable Embodied Interaction cs.RO | cs.AI | cs.CVPDF

Haoran Lu, Songling Liu, Yue Chen, Guo Ye, Mutian Shen

TL;DR: MagicSim是一个统一的具身交互基础设施,围绕一个确定性的批处理运行时和共享的马尔可夫决策过程构建。它通过YAML优先的规范解耦内容、布局、行为和智能体暴露,在一个重置-步进循环中构建多样化的可执行世界,并支持基准评估、自动轨迹收集和面向智能体的交互。

Details

Motivation: 现有机器人学习和具身智能体仿真系统通常将控制、技能和规划层分离,使用’魔法’动作、断开连接的训练环境或仅前向渲染,无法复现、评估和标注相同的交互片段。MagicSim旨在解决这些分裂问题,提供一个统一的执行基板。

Result: 论文未在摘要中提及具体的定量基准测试结果或SOTA比较,但强调MagicSim能够支持任务评估、自动轨迹生成和交互接口,成功执行的片段被保存为对齐语言监督、动作表示和视觉/几何表示的结构化多模态轨迹。

Insight: 创新点包括:1) 通过YAML规范解耦世界构建要素,实现灵活的可执行世界生成;2) 统一的执行接口将高层命令通过控制器、原子技能和规划原语落地为机器人动作,而非模拟器端的状态编辑;3) 集成规划在环的运行时,统一了世界构建、具身执行、任务评估和自动轨迹生成。

Abstract: Robot learning and embodied agents now require simulation to serve as a shared execution substrate linking control, skills, and planning, not only as a renderer, controller testbed, or fixed task environment. Existing pipelines split these layers with “magic” actions, disconnected training environments, or forward-only renders that cannot reproduce, evaluate, and annotate the same episode. We present MagicSim, an embodied interaction infrastructure built around one deterministic batched runtime and a shared Markov decision process (MDP). From YAML-first specifications that decouple contents, placement, behavior, and agent exposure, MagicSim constructs diverse executable worlds spanning task families, interaction regimes, physics, layouts, sensors, avatars, and robot embodiments in one reset-and-step loop. A common execution interface grounds high-level commands through controllers, atomicskills, planner primitives, and asynchronous planning, realizing them as robot actions rather than simulator-side state edits. One task definition supports three capabilities: benchmark and RL evaluation, an autocollect interface that automatically turns commands into grounded trajectories, and agent/VLM-facing interaction. For automatic execution, commands flow through a Command->Skill->Planner->Robot->Record pipeline, while per-environment command, skill, planning, retry, annotation, and episode states advance independently above the shared physics tick. Successful rollouts are saved as structured multimodal trajectories aligning language supervision, action representations, visual/geometric representations, and task-level status with the executed episode. MagicSim thus unifies diverse world construction, embodied execution, task evaluation, automatic rollout generation, and interactive agent interfaces in one planner-in-the-loop runtime.


[98] ERQA-Plus: A Diagnostic Benchmark for Reasoning in Embodied AI cs.RO | cs.CVPDF

Hong Yang, Basura Fernando

TL;DR: ERQA-Plus是一个用于评估具身AI推理能力的诊断性基准测试集,包含1,766个基于711张机器人视角图像的问题-答案对,涵盖感知、动作、社交互动、导航环境和上下文常识推理等多个类别。该数据集通过多阶段生成和验证流程构建,并对多个通用视觉语言模型和具身模型进行了基准测试,揭示了它们在空间推理、程序推理等方面的持续弱点。

Details

Motivation: 现有视觉和具身问答基准难以控制所测试的推理依赖关系,无法有效区分真正的具身推理与基于捷径的视觉或语言模式匹配。因此,需要一个新的诊断性基准来精细评估具身智能体的推理能力。

Result: 在基准测试中,最强的模型Qwen3-VL-32B取得了83.4%的整体准确率和61.4的SBERT分数,但类别层面的结果显示模型在空间推理、程序推理、事件预测和意图推断等方面仍存在持续弱点。

Insight: 论文的创新点在于提出了一个结构化、细粒度的诊断性基准ERQA-Plus,它通过明确的分类学和多阶段构建流程,能够更精确地评估和诊断具身智能体在不同类型推理任务上的能力,而不仅仅是总体准确率。这为理解和改进模型的具身推理能力提供了更精细的框架。

Abstract: Generalist embodied agents require more than object recognition: they must reason about spatial relations, actions, procedures, human intentions, environmental constraints, and commonsense consequences from situated visual observations. Yet existing visual and embodied question answering benchmarks often provide limited control over the reasoning dependencies being tested, making it difficult to distinguish grounded embodied reasoning from shortcut-driven visual or linguistic pattern matching. We present ERQA-Plus, a diagnostic benchmark for reasoning in embodied AI. ERQA-Plus contains 1,766 question-answer instances grounded in 711 robot-centric images and organized according to a structured taxonomy spanning perceptual, action-centric, social-interaction, navigation-environmental, and contextual commonsense reasoning. The dataset is constructed using a multi-stage generation and validation pipeline that combines taxonomy-guided question generation, automatic quality judging, iterative revision, and human assessment to improve visual grounding, answer validity, and reasoning quality. We benchmark representative general-purpose vision-language models and embodied models, including LLaVA-NeXT-8B, Prismatic-7B, MiniCPM-V-4.5-8B, Qwen3-VL, RoboRefer-8B, and RoboBrain2.5-8B. Although the strongest model, Qwen3-VL-32B, achieves 83.4% overall accuracy and 61.4 SBERT score, category-level results reveal persistent weaknesses in spatial reasoning, procedural reasoning, event prediction, and intention inference. ERQA-Plus therefore provides a fine-grained evaluation framework for measuring not only whether embodied agents answer correctly, but also which forms of embodied reasoning they can and cannot perform reliably. The dataset is available https://huggingface.co/datasets/huggingdas/erqa-plus and the project page at https://github.com/LUNAProject22/erqa-plus.


[99] Qwen-RobotManip Technical Report: Alignment Unlocks Scale for Robotic Manipulation Foundation Models cs.RO | cs.CV | cs.LGPDF

Haoqi Yuan, Zhixuan Liang, Anzhe Chen, Ye Wang, Haoyang Li

TL;DR: 该技术报告介绍了Qwen-RobotManip,这是一个基于Qwen-VL构建的可泛化视觉-语言-动作基础模型,旨在通过统一的对齐框架解决机器人操作数据异构、收集成本高和多样性有限的问题。模型通过人类到机器人的合成流程和严格的数据整理流程,利用开源数据集和人类视频构建了约38,100小时的预训练语料库,展现出零样本指令跟随、抗干扰、反应式错误恢复和跨具身迁移等新兴泛化能力。

Details

Motivation: 研究动机是探索语言和多模态基础模型中的规模化对齐方法是否适用于机器人操作领域,以解决该领域数据异构、收集昂贵且多样性有限,导致对齐和规模化同时困难的挑战。

Result: 在包括RoboCasa365、LIBERO-Plus、EBench、RoboTwin-Clean2Rand、RoboTwin-IF和RoboTwin-XE等OOD设置中,Qwen-RobotManip显著优于包括π0.5在内的先前最先进模型,在RoboChallenge中排名第一,相对改进达20%,并在AgileX ALOHA、Franka、UR和ARX等真实机器人平台上得到验证。

Insight: 创新点在于提出了一个跨表示、运动和行为的统一对齐框架,使大规模多源训练变得协调而非冲突,从而能够吸收先前训练机制无法承受的规模数据;同时,通过人类到机器人的合成流程和严格的数据整理流程,利用开源资源构建大规模预训练语料库,实现了数据的高效利用和模型的强泛化能力。

Abstract: Foundation models in language and multimodality achieve strong generalization by aligning heterogeneous data under a unified formulation and training at scale. In this report, we investigate whether this scaling recipe can be applied to robotic manipulation to achieve genuine generalization. This is challenging because, unlike text, manipulation data is heterogeneous by nature, expensive to collect, and narrow in diversity, making alignment and scale simultaneously difficult. We present Qwen-RobotManip, a generalizable Vision-Language-Action foundation model built on Qwen-VL. Qwen-RobotManip introduces a unified alignment framework across the representation, motion, and behavioral dimensions of manipulation, making large-scale multi-source training coherent rather than conflicting. This alignment capability in turn enables Qwen-RobotManip to absorb manipulation data at a scale that prior training regimes could not sustain. A human-to-robot synthesis pipeline converts egocentric hand demonstrations into robot trajectories across 15 platforms, and a rigorous curation pipeline harmonizes heterogeneous datasets. Using only open-source datasets and human videos without proprietary data collection, Qwen-RobotManip constructs a ~38,100-hour pretraining corpus and exhibits emergent generalization capabilities, including zero-shot instruction following, robustness to perturbations, reactive error recovery, and cross-embodiment transfer. We find that standard benchmarks fail to capture pretraining quality and instead adopt OOD settings including RoboCasa365, LIBERO-Plus, EBench, RoboTwin-Clean2Rand, RoboTwin-IF, and RoboTwin-XE. Qwen-RobotManip substantially outperforms prior state-of-the-art models, including $π$0.5, across all OOD settings, ranks 1st in RoboChallenge with a 20% relative improvement, and is validated on real-robot platforms including AgileX ALOHA, Franka, UR, and ARX.


[100] Qwen-RobotNav Technical Report: A Scalable Navigation Model Designed for an Agentic Navigation System cs.RO | cs.CVPDF

Jiazhao Zhang, Gengze Zhou, Hale Yin, Yiyang Huang, Zixing Lei

TL;DR: 本文提出了Qwen-RobotNav,一个可扩展的导航模型,旨在为智能体导航系统提供一个基础模型。其核心创新在于一个参数化接口,允许在推理时通过配置任务模式(如指令跟随、目标搜索)和观测参数(如token预算、相机权重)来动态调整视觉信息处理策略,从而适应不同导航任务。模型在1560万个样本上训练,并通过与视觉语言数据协同训练避免了行为退化,最终在多个导航基准测试中取得了新的最先进结果,并展现出良好的参数扩展性和零样本泛化到真实机器人的能力。

Details

Motivation: 为了解决智能体导航系统中,指令跟随、目标搜索、目标跟踪和自动驾驶等不同任务共享相同的感知-规划主干网络,但需要根本不同的视觉流处理策略这一核心矛盾,需要一个能在推理时被外部重新配置观测策略的基础导航模型。

Result: Qwen-RobotNav在多个主要导航基准测试中均取得了新的最先进(SOTA)结果。模型参数从20亿扩展到80亿时表现出良好的性能扩展性,并且在多样化的真实世界环境中对真实机器人展现了强大的零样本泛化能力。

Insight: 论文的主要创新点在于设计了一个参数化接口,将任务模式选择和观测参数控制解耦,使得单一模型能够通过外部配置灵活适应多种导航任务,而无需修改主干网络架构。从客观角度看,这种将策略选择外部化、并通过训练时随机化所有参数来确保鲁棒性的设计,为构建可组合的、由上层规划器驱动的智能体系统提供了自然的构建模块,并促进了跨任务家族的知识迁移。

Abstract: Agentic navigation systems require a base navigation model whose observation strategy can be externally reconfigured at inference time, because instruction following, object search, target tracking, and autonomous driving share the same perception-planning backbone yet demand fundamentally different strategies for consuming the visual stream. We present Qwen-RobotNav, a scalable navigation model built on Qwen-RobotNav that addresses it through a parameterised interface with two complementary dimensions: multiple task modes that select the navigation behaviour, and controllable observation parameters (e.g., token budget, per-camera weights) that govern how visual history is encoded. With training-time randomization over all parameters, Qwen-RobotNav is robust to any inference-time configuration requiring zero architectural modification to the Qwen-RobotNav backbone. We train Qwen-RobotNav on 15.6M samples; co-training with vision-language data prevents the collapse into reactive action-sequence mappers observed in trajectory-only training. The parameterised interface also makes Qwen-RobotNav a natural building block for agentic systems: for long-horizon scenarios, an upper-level planner decomposes goals into sub-tasks and dynamically switches Qwen-RobotNav’s task mode and context strategy mid-episode, composing complex behaviours from repeated calls to the same model. Extensive experiments show that Qwen-RobotNav sets new state-of-the-art results across major navigation benchmarks. The model exhibits favourable scaling from 2B to 8B parameters, with joint multi-task training developing a shared spatial-planning substrate that transfers across task families, and demonstrates strong zero-shot generalisation to real-world robots across diverse environments.


cs.CR [Back]

[101] Seeing Is Not Screening: Multimodal Hidden Instruction Attacks on Agent Skill Scanners cs.CR | cs.CVPDF

Xiaojun Jia, Jie Liao, Simeng Qin, Ke Ma, Wenbo Guo

TL;DR: 本文提出了一种针对LLM智能体技能扫描器的多模态隐藏指令攻击方法SkillCamo,该方法通过将恶意指令隐藏在图像中并重写相关文档来绕过现有扫描器。同时,论文提出了防御方法ExecScan,这是一个基于执行的多模态扫描模块,能够联合分析文档、代码和视觉内容以检测此类攻击。实验表明,现有扫描器难以应对图像隐藏的恶意指令,而ExecScan能有效提升检测性能。

Details

Motivation: 现有智能体技能扫描器主要依赖文本描述、清单和源代码进行安全分析,对视觉内容中传递的恶意意图检测不足,导致隐藏在多模态(如图像)中的有害操作指令可能在部署时被多模态智能体恢复并执行,从而构成安全盲区。

Result: 广泛的实验表明,图像隐藏的恶意指令对现有技能扫描器构成了挑战,而提出的ExecScan模块能够提升技能扫描的性能。

Insight: 创新点在于揭示了多模态智能体技能中视觉内容作为攻击载体的新威胁(SkillCamo攻击),并提出了一个基于执行模拟、联合分析多模态信息的防御框架(ExecScan),通过意图提取、行为重建和滥用评估来识别下游风险。

Abstract: Agent skills are emerging as an important attack surface in LLM-based systems. Through an empirical study of existing skill scanners, we find that current defenses primarily rely on textual descriptions, manifests, and source code as the main signals for security analysis, which can leave visually conveyed malicious intent insufficiently examined. This creates a practical blind spot: harmful operational instructions hidden in images may bypass scanning while still being recoverable by multimodal agents during deployment. To systematically investigate this threat, we propose SkillCamo, a document-mediated multimodal instruction attack that conceals malicious instructions within images bundled with a skill while rewriting the surrounding documentation to naturally reference those images as part of the normal workflow. Thus, the attack does not rely on the image alone, but on the joint interpretation of textual guidance and visual payload at execution time. To defend against such attacks, we further propose ExecScan, an execution-grounded multimodal scanning module that performs intent extraction, behavior reconstruction, abuse assessment, and deliberative execution simulation over skill artifacts. ExecScan jointly analyzes documentation, code, referenced resources, and visual content to recover hidden instructions, reconstruct executable behavior chains, and identify downstream risks such as exfiltration, destruction, persistence, deception, and privilege escalation. Extensive experiments show that image-hidden malicious instructions challenge existing skill scanners, while ExecScan can improve the skill scanning performance.