Table of Contents

cs.CL [Back]

[1] CrossTrace: A Cross-Domain Dataset of Grounded Scientific Reasoning Traces for Hypothesis Generation cs.CLPDF

Andrew Bouras, OMS-II Research Fellow

TL;DR: 本文介绍了CrossTrace,一个跨领域(生物医学、AI/ML、计算机科学)的、包含1389条有依据的科学推理轨迹的数据集,用于训练和评估科学假设生成模型。该数据集扩展了HypoGen的Bit-Flip-Spark框架,包含步骤级验证和八种发现模式分类。通过在CrossTrace上微调Qwen2.5-7B-Instruct模型,在多个指标上取得了显著提升,并证明了跨领域训练的有效性。

Details

Motivation: 现有的科学假设生成模型训练和评估数据集局限于单一领域,且缺乏将先验知识与新贡献联系起来的显式推理轨迹,这是加速研究的关键瓶颈。

Result: 在CrossTrace上通过QLoRA微调Qwen2.5-7B-Instruct模型,相比未调优基线,IAScore(GPT-4o评估)从0.828提升至0.968,结构合规性从0%提升至100%,spark余弦相似度从0.221提升至0.620。跨领域平衡训练优于单领域训练。人工验证150条记录显示步骤级依据准确率达99.7%,捏造率为0.0%。

Insight: 主要创新点在于构建了首个大规模、跨领域、包含步骤级有依据推理轨迹的假设生成数据集,并定义了包含步骤验证和发现模式分类的Input/Trace/Output模式。客观分析表明,这种显式、有依据的推理轨迹是一种有效的、至少部分具有领域通用性的训练信号。

Abstract: Scientific hypothesis generation is a critical bottleneck in accelerating research, yet existing datasets for training and evaluating hypothesis-generating models are limited to single domains and lack explicit reasoning traces connecting prior knowledge to novel contributions. I introduce CrossTrace, a dataset of 1,389 grounded scientific reasoning traces spanning biomedical research (518), AI/ML (605), and cross-domain work (266). Each trace captures the structured reasoning chain from established knowledge through intermediate logical steps to a novel hypothesis, with every step grounded in source paper text. I define an Input/Trace/Output schema that extends the Bit-Flip-Spark framework of HypoGen with step-level verification, a taxonomy of eight discovery patterns, and multi-domain coverage. Fine-tuning Qwen2.5-7B-Instruct on CrossTrace via QLoRA yields substantial improvements over the untuned baseline: IAScore rises from 0.828 to 0.968 (GPT-4o judge) and from 0.716 to 0.888 (Claude Opus 4.5), structural compliance improves from 0% to 100%, and spark cosine similarity increases from 0.221 to 0.620. Balanced cross-domain training (biomedical + AI/ML + CS) outperforms single-domain training, providing evidence that scientific reasoning patterns transfer across disciplines. Human validation of 150 stratified records confirms 99.7% step-level grounding accuracy and a 0.0% fabrication rate. To my knowledge, CrossTrace is the first large-scale, cross-domain dataset with step-level grounded reasoning traces for hypothesis generation, and my results demonstrate that such traces are an effective training signal whose benefits are at least partially domain-general.


[2] Human-Like Lifelong Memory: A Neuroscience-Grounded Architecture for Infinite Interaction cs.CL | cs.AIPDF

Diego C. Lerma-Torres

TL;DR: 本文提出了一种受神经科学启发的终身记忆架构,旨在解决大语言模型在长期交互和上下文敏感检索中缺乏持久结构化记忆的问题。该框架基于互补学习系统理论、认知行为疗法的信念层次结构、双过程认知和模糊痕迹理论,围绕三个核心原则构建:记忆具有情感效价、检索默认采用系统1并升级至系统2、编码是主动且反馈依赖的。

Details

Motivation: 动机在于大语言模型缺乏用于长期交互的持久、结构化记忆,且单纯扩展上下文窗口会损害推理能力(即使检索完美,推理性能也可能下降高达85%)。因此,需要一种更接近人类记忆机制的架构来支持无限交互。

Result: 摘要中未提及具体的定量实验结果或基准测试,但提出了七个功能性属性作为任何实现必须满足的规范,并声称系统会随着经验积累逐渐收敛于系统1处理(类似临床专业知识的计算模拟),使交互成本随时间降低而非增加。

Insight: 创新点包括:1)将情感效价(效价向量)和信念层次结构引入记忆表示,实现快速定向;2)采用默认系统1(自动扩散激活)与按需升级系统2(深思检索)的混合检索机制,并引入分级认知状态以结构化应对幻觉;3)通过丘脑式网关和好奇心驱动的执行控制实现主动、反馈依赖的编码,形成要点而非被动暴露。这些设计从计算角度模拟了人类终身记忆的关键特性。

Abstract: Large language models lack persistent, structured memory for long-term interaction and context-sensitive retrieval. Expanding context windows does not solve this: recent evidence shows that context length alone degrades reasoning by up to 85% - even with perfect retrieval. We propose a bio-inspired memory framework grounded in complementary learning systems theory, cognitive behavioral therapy’s belief hierarchy, dual-process cognition, and fuzzy-trace theory, organized around three principles: (1) Memory has valence, not just content - pre-computed emotional-associative summaries (valence vectors) organized in an emergent belief hierarchy inspired by Beck’s cognitive model enable instant orientation before deliberation; (2) Retrieval defaults to System 1 with System 2 escalation - automatic spreading activation and passive priming as default, with deliberate retrieval only when needed, and graded epistemic states that address hallucination structurally; and (3) Encoding is active, present, and feedback-dependent - a thalamic gateway tags and routes information between stores, while the executive forms gists through curiosity-driven investigation, not passive exposure. Seven functional properties specify what any implementation must satisfy. Over time, the system converges toward System 1 processing - the computational analog of clinical expertise - producing interactions that become cheaper, not more expensive, with experience.


[3] The Model Says Walk: How Surface Heuristics Override Implicit Constraints in LLM Reasoning cs.CL | cs.AIPDF

Yubo Li, Lu Zhang, Tianchong Jiang, Ramayya Krishnan, Rema Padman

TL;DR: 该论文研究了大型语言模型在推理过程中,当显性的表面启发式线索与未明确陈述的可行性约束发生冲突时,会系统性地失败。作者通过一个诊断-测量-桥接-治疗的框架,分析了‘洗车问题’等任务,揭示了模型依赖近似上下文无关的S型启发式函数,其中距离线索的影响比目标强8.7到38倍。作者构建了启发式覆盖基准,涵盖4种启发式和5种约束族,在14个模型上进行了测试,发现严格评估下没有模型超过75%的正确率。研究表明,失败源于约束推断而非知识缺失,并揭示了模型的保守偏见。通过目标分解提示等方法可以部分缓解该问题。

Details

Motivation: 解决大型语言模型在推理中,当显性表面线索与隐性可行性约束冲突时,会系统性地忽略约束而依赖启发式线索的问题,揭示其推理脆弱性。

Result: 在提出的启发式覆盖基准上,严格评估下(10/10正确)没有模型超过75%正确率,其中存在性约束最难(44%)。一个最小提示平均可恢复15个百分点,目标分解提示可恢复6到9个百分点。

Insight: 论文的创新点在于系统地诊断和量化了LLM推理中的‘启发式覆盖’脆弱性,构建了一个全面的基准来测量该问题,并通过因果行为分析和参数探针揭示了模型依赖表面线索而非组合推理的模式,为改进模型推理提供了具体方向(如通过提示强制枚举前提条件)。

Abstract: Large language models systematically fail when a salient surface cue conflicts with an unstated feasibility constraint. We study this through a diagnose-measure-bridge-treat framework. Causal-behavioral analysis of the ``car wash problem’’ across six models reveals approximately context-independent sigmoid heuristics: the distance cue exerts 8.7 to 38 times more influence than the goal, and token-level attribution shows patterns more consistent with keyword associations than compositional inference. The Heuristic Override Benchmark (HOB) – 500 instances spanning 4 heuristic by 5 constraint families with minimal pairs and explicitness gradients – demonstrates generality across 14 models: under strict evaluation (10/10 correct), no model exceeds 75%, and presence constraints are hardest (44%). A minimal hint (e.g., emphasizing the key object) recovers +15 pp on average, suggesting the failure lies in constraint inference rather than missing knowledge; 12/14 models perform worse when the constraint is removed (up to -39 pp), revealing conservative bias. Parametric probes confirm that the sigmoid pattern generalizes to cost, efficiency, and semantic-similarity heuristics; goal-decomposition prompting recovers +6 to 9 pp by forcing models to enumerate preconditions before answering. Together, these results characterize heuristic override as a systematic reasoning vulnerability and provide a benchmark for measuring progress toward resolving it.


[4] Concept Training for Human-Aligned Language Models cs.CLPDF

Christine Zhang, Dan Jurafsky, Chen Shani

TL;DR: 本文提出了一种概念训练框架,用于改进语言模型的对齐能力。传统下一个词预测(NTP)目标将语义相似的词视为互斥目标,而该框架则预测概念(近似为语义相关词的集合)。实验表明,使用概念监督训练的模型在多个词汇基准测试中与人类语义相似性判断更一致,在语义重要词上困惑度更低,尽管全局词级困惑度略有上升。

Details

Motivation: 解决传统下一个词预测(NTP)目标将语义相似但表面形式不同的有效续写词视为互斥目标的问题,旨在提升语言模型在语义层面的对齐能力。

Result: 在多个词汇基准测试(lexical benchmarks)上,模型与人类语义相似性判断的对齐性更强;在语义重要词上困惑度降低,全局词级困惑度略有增加。结果表明该方法在保持有竞争力的语言建模性能的同时改善了语义对齐。

Insight: 创新点在于将预测目标从单个词扩展为概念(语义相关词集),从而在训练中引入语义层面的监督。这为改进语言模型的语义理解和对齐提供了一种新思路,平衡了标准NTP优化与概念级监督之间的权衡。

Abstract: The next-token prediction (NTP) objective trains language models to predict a single continuation token at each step. In natural language, however, a prefix can be continued in many valid ways, and even similar meanings may differ in surface form. For example, the sentence ``this website is safe to \underline{browse}’’ could plausibly continue with words such as browse, search, visit, surf, or navigate. While standard NTP training treats these alternatives as mutually exclusive targets, we explore a framework that instead predicts concepts, approximated as sets of semantically related tokens. We show that models trained with concept supervision exhibit stronger alignment with human semantic similarity judgments on multiple lexical benchmarks. These gains are accompanied by lower perplexity on semantically meaningful words (definition in Section 3.1), and a modest increase in global token-level perplexity, reflecting a tradeoff between standard NTP optimization and concept-level supervision. Our results suggest that concept-level objectives can improve semantic alignment while maintaining competitive language modeling performance.


[5] SyriSign: A Parallel Corpus for Arabic Text to Syrian Arabic Sign Language Translation cs.CL | cs.AI | cs.CV | cs.HCPDF

Mohammad Amer Khalil, Raghad Nahas, Ahmad Nassar, Khloud Al Jallad

TL;DR: 本文介绍了SyriSign数据集,这是一个针对叙利亚阿拉伯手语(SyArSL)的文本到手语翻译平行语料库,包含150个独特词汇手势的1500个视频样本,旨在解决SyArSL公开数据缺失的问题,并评估了MotionCLIP、T2M-GPT和SignCLIP三种深度学习模型在该数据集上的性能。

Details

Motivation: 动机是解决叙利亚阿拉伯手语(SyArSL)作为低资源语言缺乏公开数据集的问题,以降低叙利亚聋哑人群因新闻多以口语或书面阿拉伯语传播而面临的沟通障碍。

Result: 实验使用MotionCLIP、T2M-GPT和SignCLIP三种架构进行评估,结果表明生成方法在手语表示方面具有潜力,但数据集规模有限限制了泛化性能;该数据集将作为初始基准发布。

Insight: 创新点在于首次构建并公开了SyArSL数据集,填补了阿拉伯手语资源空白,并通过评估多种深度学习模型为低资源手语翻译提供了基准;客观分析认为,该工作强调了数据稀缺对模型泛化的制约,为后续研究提供了重要数据基础。

Abstract: Sign language is the primary approach of communication for the Deaf and Hard-of-Hearing (DHH) community. While there are numerous benchmarks for high-resource sign languages, low-resource languages like Arabic remain underrepresented. Currently, there is no publicly available dataset for Syrian Arabic Sign Language (SyArSL). To overcome this gap, we introduce SyriSign, a dataset comprising 1500 video samples across 150 unique lexical signs, designed for text-to-SyArSL translation tasks. This work aims to reduce communication barriers in Syria, as most news are delivered in spoken or written Arabic, which is often inaccessible to the deaf community. We evaluated SyriSign using three deep learning architectures: MotionCLIP for semantic motion generation, T2M-GPT for text-conditioned motion synthesis, and SignCLIP for bilingual embedding alignment. Experimental results indicate that while generative approaches show strong potential for sign representation, the limited dataset size constrains generalization performance. We will release SyriSign publicly, hoping it serves as an initial benchmark.


[6] Long-Document QA with Chain-of-Structured-Thought and Fine-Tuned SLMs cs.CL | cs.AI | cs.LGPDF

Zhuowen Liang, Xiaotian Lin, Zhengxuan Zhang, Yuyu Luo, Haixun Wang

TL;DR: 论文提出LiteCoST框架,通过结构化思维链(CoST)引导大语言模型生成结构化输出和推理轨迹,并利用这些数据对小型语言模型(SLM)进行两阶段微调,从而在长文档问答任务中实现高精度和低延迟。

Details

Motivation: 解决大语言模型(LLM)直接对长且嘈杂的文档进行推理时存在的脆弱性和易错问题,旨在通过将分散的证据整合为结构化输出来支持可靠、可验证的问答。

Result: 在长文档问答任务上,使用3B/7B参数的小型语言模型(SLM)达到了与大语言模型(LLM)相当的质量,同时延迟比GPT-4o和DeepSeek-R1(671B)低2-4倍。

Insight: 创新点在于结合了结构化思维链(CoST)模板来引导LLM生成可审计的监督数据,以及采用两阶段微调(监督微调+GRPO)将‘结构优先’的行为蒸馏到SLM中,实现了精度与效率的平衡。

Abstract: Large language models (LLMs) are widely applied to data analytics over documents, yet direct reasoning over long, noisy documents remains brittle and error-prone. Hence, we study document question answering (QA) that consolidates dispersed evidence into a structured output (e.g., a table, graph, or chunks) to support reliable, verifiable QA. We propose a two-pillar framework, LiteCoST, to achieve both high accuracy and low latency with small language models (SLMs). Pillar 1: Chain-of-Structured-Thought (CoST). We introduce a CoST template, a schema-aware instruction that guides a strong LLM to produce both a step-wise CoST trace and the corresponding structured output. The process induces a minimal structure, normalizes entities/units, aligns records, serializes the output, and verifies/refines it, yielding auditable supervision. Pillar 2: SLM fine-tuning. The compact models are trained on LLM-generated CoST data in two stages: Supervised Fine-Tuning for structural alignment, followed by Group Relative Policy Optimization (GRPO) incorporating triple rewards for answer/format quality and process consistency. By distilling structure-first behavior into SLMs, this approach achieves LLM-comparable quality on multi-domain long-document QA using 3B/7B SLMs, while delivering 2-4x lower latency than GPT-4o and DeepSeek-R1 (671B). The code is available at https://github.com/HKUSTDial/LiteCoST.


[7] The Thiomi Dataset: A Large-Scale Multimodal Corpus for Low-Resource African Languages cs.CL | cs.LGPDF

Hillary Mutisya, John Mugane, Gavin Nyamboga, Brian Chege, Maryruth Gathoni

TL;DR: 本文介绍了Thiomi数据集,这是一个面向非洲低资源语言的大规模多模态语料库,涵盖10种非洲语言,包含超过60.1万条句子级文本标注和38.5万条音频录音。通过社区平台收集数据,并采用多级质量保证流程确保数据质量。研究团队训练了自动语音识别、机器翻译和文本转语音模型,在斯瓦希里语上实现了3.24%的词错误率,显著超越了之前的学术SOTA。

Details

Motivation: 动机是解决非洲语言在自然语言处理领域资源匮乏的问题,构建一个大规模、高质量的多模态数据集以支持相关技术(如ASR、MT、TTS)的发展。

Result: 在斯瓦希里语(Common Voice)上,最佳ASR系统达到3.24%的词错误率,将先前学术SOTA从8.3%降低到3.24%(绝对降低5.1个百分点,相对降低61%);在索马里语上达到4.3%的词错误率。为所有十种语言建立了基线模型。

Insight: 创新点包括:通过社区驱动的数据收集平台和多级质量保证流程高效构建低资源语言数据集;在多个非洲语言上首次建立ASR、MT和TTS基线;在斯瓦希里语ASR上实现显著性能提升,展示了数据集的有效性。

Abstract: We present the Thiomi Dataset, a large-scale multimodal corpus spanning ten African languages across four language families: Swahili, Kikuyu, Kamba, Kimeru, Luo, Maasai, Kipsigis, Somali (East Africa); Wolof (West Africa); and Fulani (West/Central Africa). The dataset contains over 601,000 approved sentence-level text annotations and over 385,000 audio recordings across nine languages, collected through a dedicated community data collection platform involving over 100 contributors. The Thiomi platform collected data for nine languages; Swahili data was supplemented with existing Common Voice recordings. A multi-tier quality assurance pipeline achieves 86-100% text approval rates for the six primary languages. To validate the dataset’s utility, we train and evaluate ASR, MT, and TTS models, establishing baselines across all ten languages. Our best ASR system achieves 3.24% WER on Swahili (Common Voice), reducing prior academic SOTA from 8.3% to 3.24% (5.1 percentage point absolute, 61% relative reduction), and 4.3% WER on Somali. The dataset will be published on HuggingFace. We describe the collection platform, quality assurance workflows, and baseline experiments, and discuss implications for African language technology infrastructure.


[8] MemRerank: Preference Memory for Personalized Product Reranking cs.CL | cs.AI | cs.LGPDF

Zhiyuan Peng, Xuyang Wu, Huaixiao Tou, Yi Fang, Yi Gong

TL;DR: 本文提出MemRerank,一种偏好记忆框架,用于个性化产品重排序。它通过将用户购买历史提炼为简洁、查询无关的信号,以解决直接将原始历史附加到提示中带来的噪声、长度和相关性不匹配问题。作者构建了一个端到端的基准测试和评估框架,并采用强化学习训练记忆提取器,实验表明MemRerank在基于LLM的重排序任务中显著优于无记忆、原始历史及现成记忆基线。

Details

Motivation: 基于LLM的购物代理在个性化中越来越依赖长购买历史和多轮交互,但直接将原始历史附加到提示中往往因噪声、长度和相关性不匹配而效果不佳。

Result: 在两个基于LLM的重排序器上的实验显示,MemRerank在1-in-5选择任务中持续优于无记忆、原始历史和现成记忆基线,绝对准确率提升高达+10.61个百分点,达到了SOTA水平。

Insight: 创新点在于提出了一个偏好记忆框架,将用户历史提炼为查询无关的紧凑信号,并通过强化学习以重排序性能为监督训练记忆提取器,为智能电商系统中的个性化提供了实用有效的构建模块。

Abstract: LLM-based shopping agents increasingly rely on long purchase histories and multi-turn interactions for personalization, yet naively appending raw history to prompts is often ineffective due to noise, length, and relevance mismatch. We propose MemRerank, a preference memory framework that distills user purchase history into concise, query-independent signals for personalized product reranking. To study this problem, we build an end-to-end benchmark and evaluation framework centered on an LLM-based \textbf{1-in-5} selection task, which measures both memory quality and downstream reranking utility. We further train the memory extractor with reinforcement learning (RL), using downstream reranking performance as supervision. Experiments with two LLM-based rerankers show that MemRerank consistently outperforms no-memory, raw-history, and off-the-shelf memory baselines, yielding up to \textbf{+10.61} absolute points in 1-in-5 accuracy. These results suggest that explicit preference memory is a practical and effective building block for personalization in agentic e-commerce systems.


[9] Beyond Idealized Patients: Evaluating LLMs under Challenging Patient Behaviors in Medical Consultations cs.CLPDF

Yahan Li, Xinyi Jie, Wanjia Ruan, Xubei Zhang, Huaijie Zhu

TL;DR: 本文研究了大型语言模型在医疗咨询场景中面对挑战性患者行为时的表现,提出了CPB-Bench基准,包含信息矛盾、事实错误、自我诊断和抗拒治疗四类行为,评估了多种LLM并发现其在处理矛盾或不可信信息时存在特定失败模式。

Details

Motivation: 现有医疗LLM评估通常假设患者提问理想化,缺乏对现实中常见挑战性患者行为的考量,这限制了评估的真实性和安全性。

Result: 在CPB-Bench基准上评估了开源和闭源LLM,模型整体表现良好,但在处理矛盾或医学上不可信的患者信息时存在一致的特定失败模式;四种干预策略的改进效果不一致且可能引入不必要的纠正。

Insight: 创新点在于定义了四类临床相关的挑战性患者行为并构建了双语基准CPB-Bench,揭示了LLM在非理想化医疗对话中的安全漏洞,强调了超越理想化假设进行现实评估的重要性。

Abstract: Large language models (LLMs) are increasingly used for medical consultation and health information support. In this high-stakes setting, safety depends not only on medical knowledge, but also on how models respond when patient inputs are unclear, inconsistent, or misleading. However, most existing medical LLM evaluations assume idealized and well-posed patient questions, which limits their realism. In this paper, we study challenging patient behaviors that commonly arise in real medical consultations and complicate safe clinical reasoning. We define four clinically grounded categories of such behaviors: information contradiction, factual inaccuracy, self-diagnosis, and care resistance. For each behavior, we specify concrete failure criteria that capture unsafe responses. Building on four existing medical dialogue datasets, we introduce CPB-Bench (Challenging Patient Behaviors Benchmark), a bilingual (English and Chinese) benchmark of 692 multi-turn dialogues annotated with these behaviors. We evaluate a range of open- and closed-source LLMs on their responses to challenging patient utterances. While models perform well overall, we identify consistent, behavior-specific failure patterns, with particular difficulty in handling contradictory or medically implausible patient information. We also study four intervention strategies and find that they yield inconsistent improvements and can introduce unnecessary corrections. We release the dataset and code.


[10] CounselReflect: A Toolkit for Auditing Mental-Health Dialogues cs.CLPDF

Yahan Li, Chaohao Du, Zeyang Li, Christopher Chun Kuizon, Shupeng Cheng

TL;DR: CounselReflect是一个用于审计心理健康对话的端到端工具包,它通过生成结构化、多维度的报告(包括会话级摘要、轮次级评分和证据摘录)来透明评估对话质量,而非提供单一不透明的分数。该系统整合了基于特定任务预测器的12个模型指标以及基于文献库(69个指标)和用户自定义指标的量规化指标,并通过可配置的LLM评判器实现。工具包提供Web应用、浏览器扩展和命令行界面,支持实时和大规模使用。

Details

Motivation: 随着对话系统(如基于LLM的工具)越来越多地介入心理健康支持,用户缺乏结构化方法来审计所获支持的质量和潜在风险,因此需要开发透明、可理解的审计工具。

Result: 人类评估包括一项20名参与者的用户研究和6名心理健康专家的评审,表明CounselReflect支持可理解、可用且可信的审计。

Insight: 创新点在于提供结构化、多维度且透明的审计报告,而非单一评分;通过整合模型指标和可扩展的量规化指标(包括文献库和用户自定义指标),并利用可配置的LLM评判器操作,增强了覆盖范围和灵活性;工具包的多平台实现(Web应用、浏览器扩展、CLI)支持多样化的使用场景。

Abstract: Mental-health support is increasingly mediated by conversational systems (e.g., LLM-based tools), but users often lack structured ways to audit the quality and potential risks of the support they receive. We introduce CounselReflect, an end-to-end toolkit for auditing mental-health support dialogues. Rather than producing a single opaque quality score, CounselReflect provides structured, multi-dimensional reports with session-level summaries, turn-level scores, and evidence-linked excerpts to support transparent inspection. The system integrates two families of evaluation signals: (i) 12 model-based metrics produced by task-specific predictors, and (ii) rubric-based metrics that extend coverage via a literature-derived library (69 metrics) and user-defined custom metrics, operationalized with configurable LLM judges. CounselReflect is available as a web application, browser extension, and command-line interface (CLI), enabling use in real-time settings as well as at scale. Human evaluation includes a user study with 20 participants and an expert review with 6 mental-health professionals, suggesting that CounselReflect supports understandable, usable, and trustworthy auditing. A demo video and full source code are also provided.


[11] M-MiniGPT4: Multilingual VLLM Alignment via Translated Data cs.CL | cs.AIPDF

Seung Hun Han, Youssef Mohamed, Mohamed Elhoseiny

TL;DR: 本文提出了一个名为M-MiniGPT4的多语言视觉大语言模型,该模型在11种语言上展现出强大的视觉语言理解能力。通过混合使用原生多语言数据和翻译数据来提升MiniGPT4架构的多语言性能,并引入一个利用平行文本语料库的多语言对齐训练阶段以进一步增强模型能力。

Details

Motivation: 解决现有视觉大语言模型在多语言场景下能力不足的问题,特别是在低资源语言上的视觉语言理解性能。

Result: 在MMMU多语言基准测试上达到36%的准确率,超越了同权重级别的最先进模型,包括在本工作大部分完成后发布的基础模型。

Insight: 创新点在于采用混合原生与翻译数据的数据策略,并设计了专门的多语言对齐训练阶段;从客观角度看,其数据混合与对齐方法为提升多语言VLLM性能提供了可借鉴的路径,且开源模型与数据集有助于推动低资源多语言研究。

Abstract: This paper presents a Multilingual Vision Large Language Model, named M-MiniGPT4. Our model exhibits strong vision-language understanding (VLU) capabilities across 11 languages. We utilize a mixture of native multilingual and translated data to push the multilingual VLU performance of the MiniGPT4 architecture. In addition, we propose a multilingual alignment training stage that uses parallel text corpora to further enhance the multilingual capabilities of our model. M-MiniGPT4 achieves 36% accuracy on the multilingual MMMU benchmark, outperforming state-of-the-art models in the same weight class, including foundation models released after the majority of this work was completed. We open-source our models, code, and translated datasets to facilitate future research in low-resource and multilingual settings.


[12] Calibrated Confidence Expression for Radiology Report Generation cs.CLPDF

David Bani-Harouni, Chantal Pellegrini, Julian Lüers, Su Hwan Kim, Markus Baalmann

TL;DR: 本文提出ConRad框架,通过强化学习微调医学大型视觉语言模型,使其在生成放射学报告的同时输出经过校准的置信度表达,以支持临床选择性审核并降低幻觉风险。

Details

Motivation: 解决放射学报告生成中大型视觉语言模型过度自信的问题,提供临床可解释的置信度指标,以实现选择性的人工审核,确保安全部署。

Result: 在实验中,ConRad显著改善了校准性能,优于现有方法;临床评估显示其报告级置信度分数与临床医生判断高度一致。

Insight: 创新点在于将强化学习与对数评分规则奖励函数结合,用于多模态场景下的置信度校准,实现了从报告级到句子级的可解释置信度表达,支持针对性临床审核。

Abstract: Safe deployment of Large Vision-Language Models (LVLMs) in radiology report generation requires not only accurate predictions but also clinically interpretable indicators of when outputs should be thoroughly reviewed, enabling selective radiologist verification and reducing the risk of hallucinated findings influencing clinical decisions. One intuitive approach to this is verbalized confidence, where the model explicitly states its certainty. However, current state-of-the-art language models are often overconfident, and research on calibration in multimodal settings such as radiology report generation is limited. To address this gap, we introduce ConRad (Confidence Calibration for Radiology Reports), a reinforcement learning framework for fine-tuning medical LVLMs to produce calibrated verbalized confidence estimates alongside radiology reports. We study two settings: a single report-level confidence score and a sentence-level variant assigning a confidence to each claim. Both are trained using the GRPO algorithm with reward functions based on the logarithmic scoring rule, which incentivizes truthful self-assessment by penalizing miscalibration and guarantees optimal calibration under reward maximization. Experimentally, ConRad substantially improves calibration and outperforms competing methods. In a clinical evaluation we show that ConRad’s report level scores are well aligned with clinicians’ judgment. By highlighting full reports or low-confidence statements for targeted review, ConRad can support safer clinical integration of AI-assistance for report generation.


[13] MemFactory: Unified Inference & Training Framework for Agent Memory cs.CL | cs.AIPDF

Ziliang Guo, Ziheng Li, Zhiyu Li

TL;DR: 本文提出了MemFactory,一个专为记忆增强型智能体设计的统一、高度模块化的训练与推理框架。该框架将记忆生命周期抽象为原子化、即插即用的组件,支持研究人员像搭积木一样构建自定义记忆智能体,并原生集成了GRPO算法来优化由多维环境奖励驱动的内部记忆管理策略。

Details

Motivation: 现有基于强化学习优化记忆操作(如提取、更新、检索)的实现高度碎片化且任务特定,缺乏一个统一的基础设施来简化这些复杂流程的集成、训练与评估。

Result: 在开源的MemAgent架构上,使用其公开的训练和评估数据进行实证验证。在领域内和领域外评估集上,MemFactory均能持续提升对应基础模型的性能,相对增益最高达14.8%。

Insight: 主要创新点在于提供了一个标准化、可扩展、易用的统一框架,抽象了记忆生命周期,支持模块化组合,并原生集成GRPO进行策略优化,显著降低了记忆驱动AI智能体的研究门槛。

Abstract: Memory-augmented Large Language Models (LLMs) are essential for developing capable, long-term AI agents. Recently, applying Reinforcement Learning (RL) to optimize memory operations, such as extraction, updating, and retrieval, has emerged as a highly promising research direction. However, existing implementations remain highly fragmented and task-specific, lacking a unified infrastructure to streamline the integration, training, and evaluation of these complex pipelines. To address this gap, we present MemFactory, the first unified, highly modular training and inference framework specifically designed for memory-augmented agents. Inspired by the success of unified fine-tuning frameworks like LLaMA-Factory, MemFactory abstracts the memory lifecycle into atomic, plug-and-play components, enabling researchers to seamlessly construct custom memory agents via a “Lego-like” architecture. Furthermore, the framework natively integrates Group Relative Policy Optimization (GRPO) to fine-tune internal memory management policies driven by multi-dimensional environmental rewards. MemFactory provides out-of-the-box support for recent cutting-edge paradigms, including Memory-R1, RMM, and MemAgent. We empirically validate MemFactory on the open-source MemAgent architecture using its publicly available training and evaluation data. Across both in-domain and out-of-distribution evaluation sets, MemFactory consistently improves performance over the corresponding base models, with relative gains of up to 14.8%. By providing a standardized, extensible, and easy-to-use infrastructure, MemFactory significantly lowers the barrier to entry, paving the way for future innovations in memory-driven AI agents.


[14] Baby Scale: Investigating Models Trained on Individual Children’s Language Input cs.CL | cs.AI | cs.LGPDF

Steven Y. Feng, Alvin W. M. Tan, Michael C. Frank

TL;DR: 该研究利用来自6-36个月大儿童的视频转录数据(BabyView数据集),训练语言模型,旨在探究在儿童规模的数据量下模型的表现、不同儿童数据间的性能差异及其语言预测因子,以及模型学习与儿童语言习得之间的关系。

Details

Motivation: 现代语言模型需要远超儿童所接收的语言数据量才能产生有效行为,研究动机在于通过使用人类规模的数据集来基准测试语言模型,以理解语言知识如何从儿童的自然训练数据中涌现,从而探究这一’数据鸿沟’的本质和起源。

Result: 在儿童数据上训练的语言模型在语法任务上表现出可接受的扩展性,但在语义和世界知识任务上的扩展性低于在合成数据上训练的模型;同时观察到不同儿童数据间的显著性能差异。模型对单个词的似然度与儿童对这些词的学习情况相关。

Insight: 创新点在于首次系统性地在儿童规模的真实语言输入上评估语言模型,并揭示了数据质量(如分布性和交互性语言特征)对模型性能的关键影响,这为构建更高效的小规模语言模型提供了见解,并可能阐明人类语言习得的机制。

Abstract: Modern language models (LMs) must be trained on many orders of magnitude more words of training data than human children receive before they begin to produce useful behavior. Assessing the nature and origins of this “data gap” requires benchmarking LMs on human-scale datasets to understand how linguistic knowledge emerges from children’s natural training data. Using transcripts from the BabyView dataset (videos from children ages 6-36 months), we investigate (1) scaling performance at child-scale data regimes, (2) variability in model performance across datasets from different children’s experiences and linguistic predictors of dataset quality, and (3) relationships between model and child language learning outcomes. LMs trained on child data show acceptable scaling for grammar tasks, but lower scaling on semantic and world knowledge tasks than models trained on synthetic data; we also observe substantial variability on data from different children. Beyond dataset size, performance is most associated with a combination of distributional and interactional linguistic features, broadly consistent with what makes high-quality input for child language development. Finally, model likelihoods for individual words correlate with children’s learning of those words, suggesting that properties of child-directed input may influence both model learning and human language development. Overall, understanding what properties make language data efficient for learning can enable more powerful small-scale language models while also shedding light on human language acquisition.


[15] Learning Diagnostic Reasoning for Decision Support in Toxicology cs.CLPDF

Nico Oberländer, David Bani-Harouni, Tobias Zellner, Nassir Navab, Florian Eyer

TL;DR: 本文提出了DeToxR,一个用于急性多物质中毒诊断决策支持的系统。它通过强化学习(GRPO)微调大语言模型,融合非结构化临床叙事和结构化医疗数据,以优化对14类物质的预测。

Details

Motivation: 解决在信息不完整、不确定性高的急性中毒急救场景中,现有大语言模型难以有效融合异构数据并进行准确诊断推理的问题。

Result: 模型显著优于未经调整的基础LLM和监督基线。在一项临床验证研究中,其识别正确毒物的性能(Micro-F1: 0.644)超过了毒理学专家(0.473)。

Insight: 创新点在于首次将强化学习应用于急诊毒理学,通过设计以临床表现为导向的多标签一致性奖励信号,直接优化模型的诊断推理能力,有效解决了漏报和误报问题。

Abstract: Acute poly-substance intoxication requires rapid, life-saving decisions under substantial uncertainty, as clinicians must rely on incomplete ingestion details and nonspecific symptoms. Effective diagnostic reasoning in this chaotic environment requires fusing unstructured, non-medical narratives (e.g. paramedic scene descriptions and unreliable patient self-reports or known histories), with structured medical data like vital signs. While Large Language Models (LLMs) show potential for processing such heterogeneous inputs, they struggle in this setting, often underperforming simple baselines that rely solely on patient histories. To address this, we present DeToxR (Decision-support for Toxicology with Reasoning), the first adaptation of Reinforcement Learning (RL) to emergency toxicology. We design a robust data-fusion engine for multi-label prediction across 14 substance classes based on an LLM finetuned with Group Relative Policy Optimization (GRPO). We optimize the model’s reasoning directly using a clinical performance reward. By formulating a multi-label agreement metric as the reward signal, the model is explicitly penalized for missing co-ingested substances and hallucinating absent poisons. Our model significantly outperforms its unadapted base LLM counterpart and supervised baselines. Furthermore, in a clinical validation study, the model indicates a clinical advantage by outperforming an expert toxicologist in identifying the correct poisons (Micro-F1: 0.644 vs. 0.473). These results demonstrate the potential of RL-aligned LLMs to synthesize unstructured pre-clinical narratives and structured medical data for decision support in high-stakes environments.


[16] SNEAK: Evaluating Strategic Communication and Information Leakage in Large Language Models cs.CLPDF

Adar Avsian, Larry Heck

TL;DR: 本文介绍了SNEAK基准测试,用于评估大型语言模型在非对称信息下的战略性沟通能力,即模型需要在向盟友传递信息的同时防止对手推断出秘密。

Details

Motivation: 现有LLM基准主要评估推理、事实知识或指令遵循能力,缺乏对非对称信息下战略性沟通的直接衡量,因此需要新的评估框架。

Result: 在SNEAK基准上,现代语言模型在信息性与保密性之间权衡的表现仍具挑战性,人类参与者显著优于所有评估模型,得分最高可达四倍。

Insight: 创新点在于提出了一个双代理模拟评估框架(盟友和变色龙),并定义了效用和泄漏两个互补指标,为LLM的战略沟通能力提供了量化基准。

Abstract: Large language models (LLMs) are increasingly deployed in multi-agent settings where communication must balance informativeness and secrecy. In such settings, an agent may need to signal information to collaborators while preventing an adversary from inferring sensitive details. However, existing LLM benchmarks primarily evaluate capabilities such as reasoning, factual knowledge, or instruction following, and do not directly measure strategic communication under asymmetric information. We introduce SNEAK (Secret-aware Natural language Evaluation for Adversarial Knowledge), a benchmark for evaluating selective information sharing in language models. In SNEAK, a model is given a semantic category, a candidate set of words, and a secret word, and must generate a message that indicates knowledge of the secret without revealing it too clearly. We evaluate generated messages using two simulated agents with different information states: an ally, who knows the secret and must identify the intended message, and a chameleon, who does not know the secret and attempts to infer it from the message. This yields two complementary metrics: utility, measuring how well the message communicates to collaborators, and leakage, measuring how much information it reveals to an adversary. Using this framework, we analyze the trade-off between informativeness and secrecy in modern language models and show that strategic communication under asymmetric information remains a challenging capability for current systems. Notably, human participants outperform all evaluated models by a large margin, achieving up to four times higher scores.


[17] Enhancing Structural Mapping with LLM-derived Abstractions for Analogical Reasoning in Narratives cs.CL | cs.AIPDF

Mohammadhossein Khojasteh, Yifan Jiang, Stefano De Giorgis, Frank van Harmelen, Filip Ilievski

TL;DR: 本文提出了一种名为YARN的模块化框架,通过利用大语言模型(LLM)从叙事中提取抽象表示来增强结构映射,从而提升机器在叙事中进行类比推理的能力。该框架首先将叙事分解为单元,然后对这些单元进行抽象,最后通过映射组件对齐不同故事中的元素以完成推理。

Details

Motivation: 动机在于解决机器在叙事结构间进行类比推理的挑战。现有的认知引擎需要预提取实体,而端到端LLM的性能对提示格式和叙事表面相似度敏感。因此,研究旨在探索利用LLM衍生的抽象来增强结构映射对类比推理能力的影响。

Result: 实验表明,引入抽象表示能持续提升模型性能,其表现达到或优于端到端LLM基线。

Insight: 创新点在于定义并操作化了四个层次的抽象,这些抽象既捕捉了单元的一般含义,也捕捉了它们在故事中的角色(基于先前框架研究)。YARN框架支持系统性地变化实验设置以分析组件贡献,并为未来工作提供了开源代码。

Abstract: Analogical reasoning is a key driver of human generalization in problem-solving and argumentation. Yet, analogies between narrative structures remain challenging for machines. Cognitive engines for structural mapping are not directly applicable, as they assume pre-extracted entities, whereas LLMs’ performance is sensitive to prompt format and the degree of surface similarity between narratives. This gap motivates a key question: What is the impact of enhancing structural mapping with LLM-derived abstractions on their analogical reasoning ability in narratives? To that end, we propose a modular framework named YARN (Yielding Abstractions for Reasoning in Narratives), which uses LLMs to decompose narratives into units, abstract these units, and then passes them to a mapping component that aligns elements across stories to perform analogical reasoning. We define and operationalize four levels of abstraction that capture both the general meaning of units and their roles in the story, grounded in prior work on framing. Our experiments reveal that abstractions consistently improve model performance, resulting in competitive or better performance than end-to-end LLM baselines. Closer error analysis reveals the remaining challenges in abstraction at the right level, in incorporating implicit causality, and an emerging categorization of analogical patterns in narratives. YARN enables systematic variation of experimental settings to analyze component contributions, and to support future work, we make the code for YARN openly available.


cs.CV [Back]

[18] DF-ACBlurGAN: Structure-Aware Conditional Generation of Internally Repeated Patterns for Biomaterial Microtopography Design cs.CV | cs.AI | cs.LGPDF

Rongjun Dong, Xin Chen, Morgan R Alexander, Karthikeyan Sivakumar, Reza Omdivar

TL;DR: 本文提出DF-ACBlurGAN,一种结构感知的条件生成对抗网络,用于在弱监督和类别不平衡条件下,生成具有内部重复和周期性结构的图像,并以生物材料微观形貌设计为应用案例。该方法通过整合频域重复尺度估计、尺度自适应高斯模糊和晶胞重建,来平衡局部锐利特征与全局周期稳定性。

Details

Motivation: 解决现有机器学习模型在生成具有内部重复和周期性结构图像时的局限性,这些模型通常优化局部纹理统计和语义真实性,而非全局结构一致性,这在需要严格控制重复尺度、间距和边界连贯性的应用(如生物材料微观形貌设计)中尤为突出。

Result: 在多个生物材料数据集上的评估表明,与传统生成方法相比,该方法在重复一致性和可控结构变化方面有所改进。

Insight: 创新点在于明确地在训练过程中对长程重复进行推理,通过频域分析、自适应模糊和单元重建的集成,实现了对全局周期结构的显式控制,并能够根据实验得出的生物响应标签进行条件生成,以合成符合目标功能结果的设计。

Abstract: Learning to generate images with internally repeated and periodic structures poses a fundamental challenge for machine learning and computer vision models, which are typically optimised for local texture statistics and semantic realism rather than global structural consistency. This limitation is particularly pronounced in applications requiring strict control over repetition scale, spacing, and boundary coherence, such as microtopographical biomaterial surfaces. In this work, biomaterial design serves as a use case to study conditional generation of repeated patterns under weak supervision and class imbalance. We propose DF-ACBlurGAN, a structure-aware conditional generative adversarial network that explicitly reasons about long-range repetition during training. The approach integrates frequency-domain repetition scale estimation, scale-adaptive Gaussian blurring, and unit-cell reconstruction to balance sharp local features with stable global periodicity. Conditioning on experimentally derived biological response labels, the model synthesises designs aligned with target functional outcomes. Evaluation across multiple biomaterial datasets demonstrates improved repetition consistency and controllable structural variation compared to conventional generative approaches.


[19] Stepper: Stepwise Immersive Scene Generation with Multiview Panoramas cs.CVPDF

Felix Wimbauer, Fabian Manhardt, Michael Oechsle, Nikolai Kalischek, Christian Rupprecht

TL;DR: Stepper是一个用于文本驱动沉浸式3D场景合成的统一框架,通过逐步全景场景扩展来规避现有方法在视觉保真度和可探索性之间的权衡。它利用新颖的多视图360°扩散模型实现一致的高分辨率扩展,并结合几何重建流程来增强几何一致性。

Details

Motivation: 现有基于全景图像的场景初始化方法存在视觉保真度与可探索性之间的权衡:自回归扩展存在上下文漂移问题,而全景视频生成则受限于低分辨率。

Result: 在大型多视图全景数据集上训练的Stepper,在沉浸式场景生成方面实现了最先进的保真度和结构一致性,优于先前方法。

Insight: 创新点在于提出了逐步全景场景扩展的统一框架,以及用于一致高分辨率扩展的多视图360°扩散模型与几何一致性重建流程的结合。

Abstract: The synthesis of immersive 3D scenes from text is rapidly maturing, driven by novel video generative models and feed-forward 3D reconstruction, with vast potential in AR/VR and world modeling. While panoramic images have proven effective for scene initialization, existing approaches suffer from a trade-off between visual fidelity and explorability: autoregressive expansion suffers from context drift, while panoramic video generation is limited to low resolution. We present Stepper, a unified framework for text-driven immersive 3D scene synthesis that circumvents these limitations via stepwise panoramic scene expansion. Stepper leverages a novel multi-view 360° diffusion model that enables consistent, high-resolution expansion, coupled with a geometry reconstruction pipeline that enforces geometric coherence. Trained on a new large-scale, multi-view panorama dataset, Stepper achieves state-of-the-art fidelity and structural consistency, outperforming prior approaches, thereby setting a new standard for immersive scene generation.


[20] MMFace-DiT: A Dual-Stream Diffusion Transformer for High-Fidelity Multimodal Face Generation cs.CV | cs.AIPDF

Bharath Krishnamurthy, Ajita Rattani

TL;DR: 本文提出了MMFace-DiT,一个用于高保真多模态人脸生成的双流扩散Transformer模型。它通过一个新颖的双流Transformer块,并行处理空间(如掩码、草图)和语义(文本)信息,并利用共享的RoPE注意力机制进行深度融合,以解决现有方法在模态冲突和潜在空间不匹配上的问题,实现了前所未有的空间-语义一致性。

Details

Motivation: 现有的多模态人脸生成模型通常通过扩展预训练的文本到图像流程(如添加辅助控制模块或拼接单模态网络)来实现空间控制,但这些临时设计存在架构限制、参数冗余,并且在模态冲突或不匹配的潜在空间下容易失效,限制了跨语义和空间域的协同融合能力。

Result: 在六个最先进的多模态人脸生成模型上,MMFace-DiT在视觉保真度和提示对齐方面实现了40%的提升,达到了SOTA水平。

Insight: 核心创新在于设计了统一的双流扩散Transformer架构,其双流Transformer块通过共享RoPE注意力机制实现空间与语义令牌的并行处理和深度融合,有效防止了模态主导并确保了对文本和结构先验的强遵循。此外,新颖的模态嵌入器使单一模型能够动态适应不同的空间条件而无需重新训练,为端到端的可控生成建模提供了一个灵活的新范式。

Abstract: Recent multimodal face generation models address the spatial control limitations of text-to-image diffusion models by augmenting text-based conditioning with spatial priors such as segmentation masks, sketches, or edge maps. This multimodal fusion enables controllable synthesis aligned with both high-level semantic intent and low-level structural layout. However, most existing approaches typically extend pre-trained text-to-image pipelines by appending auxiliary control modules or stitching together separate uni-modal networks. These ad hoc designs inherit architectural constraints, duplicate parameters, and often fail under conflicting modalities or mismatched latent spaces, limiting their ability to perform synergistic fusion across semantic and spatial domains. We introduce MMFace-DiT, a unified dual-stream diffusion transformer engineered for synergistic multimodal face synthesis. Its core novelty lies in a dual-stream transformer block that processes spatial (mask/sketch) and semantic (text) tokens in parallel, deeply fusing them through a shared Rotary Position-Embedded (RoPE) Attention mechanism. This design prevents modal dominance and ensures strong adherence to both text and structural priors to achieve unprecedented spatial-semantic consistency for controllable face generation. Furthermore, a novel Modality Embedder enables a single cohesive model to dynamically adapt to varying spatial conditions without retraining. MMFace-DiT achieves a 40% improvement in visual fidelity and prompt alignment over six state-of-the-art multimodal face generation models, establishing a flexible new paradigm for end-to-end controllable generative modeling. The code and dataset are available on our project page: https://vcbsl.github.io/MMFace-DiT/


[21] The Surprising Effectiveness of Noise Pretraining for Implicit Neural Representations cs.CV | eess.IVPDF

Kushal Vyas, Alper Kayabasi, Daniel Kim, Vishwanath Saragadam, Ashok Veeraraghavan

TL;DR: 本文研究了噪声预训练对隐式神经表示(INRs)性能的影响,发现无结构噪声(如均匀、高斯噪声)预训练能显著提升信号拟合能力,但作为深度图像先验用于去噪任务时效果不佳;而具有自然图像典型1/|f|^α频谱结构的噪声预训练则能在信号拟合和逆成像任务间取得良好平衡,性能媲美数据驱动的初始化方法。

Details

Motivation: 探究INRs性能对参数初始化策略的敏感性,并理解数据驱动初始化方法成功的原因——是编码了经典统计信号先验还是更复杂的特征。

Result: 在图像和视频数据上的实验表明,无结构噪声预训练在信号拟合能力上优于所有基线方法,但去噪性能差;具有自然图像频谱结构的噪声预训练在信号拟合和逆成像(去噪)任务上均表现优异,与最佳数据驱动初始化方法相当。

Insight: 创新点在于揭示了噪声预训练对INRs性能的意外有效性,并指出噪声的频谱结构是关键因素;客观来看,该研究为缺乏领域特定数据时的高效INRs训练提供了新思路,即利用结构化噪声预训练替代数据驱动初始化。

Abstract: The approximation and convergence properties of implicit neural representations (INRs) are known to be highly sensitive to parameter initialization strategies. While several data-driven initialization methods demonstrate significant improvements over standard random sampling, the reasons for their success – specifically, whether they encode classical statistical signal priors or more complex features – remain poorly understood. In this study, we explore this phenomenon through a series of experimental analyses leveraging noise pretraining. We pretrain INRs on diverse noise classes (e.g., Gaussian, Dead Leaves, Spectral) and measure their ability to both fit unseen signals and encode priors for an inverse imaging task (denoising). Our analyses on image and video data reveal a surprising finding: simply pretraining on unstructured noise (Uniform, Gaussian) dramatically improves signal fitting capacity compared to all other baselines. However, unstructured noise also yields poor deep image priors for denoising. In contrast, we also find that noise with the classic $1/|f^α|$ spectral structure of natural images achieves an excellent balance of signal fitting and inverse imaging capabilities, performing on par with the best data-driven initialization methods. This finding enables more efficient INR training in applications lacking sufficient prior domain-specific data. For more details, visit project page at https://kushalvyas.github.io/noisepretraining.html


[22] Generating Humanless Environment Walkthroughs from Egocentric Walking Tour Videos cs.CVPDF

Yujin Ham, Junho Kim, Vivek Boominathan, Guha Balakrishnan

TL;DR: 本文提出了一种生成式算法,用于从第一人称行走视频中真实地移除人类及其阴影,以生成无人的环境漫游视频。关键是通过构建半合成数据集来训练生成模型,并微调Casper视频扩散模型,在去除复杂背景中大量人类方面表现优于原模型,最终生成的视频可用于构建城市地点的3D/4D模型。

Details

Motivation: 第一人称’行走漫游’视频因人群和眼平视角导致人类频繁出现,限制了其在环境建模应用中的实用性,因此需要开发算法来移除人类及其阴影。

Result: 在大量人类存在和复杂背景的行走视频剪辑中,微调后的模型在定性和定量上均显著优于Casper视频扩散模型,并能成功用于构建城市地点的3D/4D模型。

Insight: 创新点包括构建基于真实视频的半合成数据集以保持视觉多样性,以及微调SOTA视频扩散模型进行对象和效果修复,有效解决了人类移除问题,提升了环境建模的可行性。

Abstract: Egocentric “walking tour” videos provide a rich source of image data to develop rich and diverse visual models of environments around the world. However, the significant presence of humans in frames of these videos due to crowds and eye-level camera perspectives mitigates their usefulness in environment modeling applications. We focus on addressing this challenge by developing a generative algorithm that can realistically remove (i.e., inpaint) humans and their associated shadow effects from walking tour videos. Key to our approach is the construction of a rich semi-synthetic dataset of video clip pairs to train this generative model. Each pair in the dataset consists of an environment-only background clip, and a composite clip of walking humans with simulated shadows overlaid on the background. We randomly sourced both foreground and background components from real egocentric walking tour videos around the world to maintain visual diversity. We then used this dataset to fine-tune the state-of-the-art Casper video diffusion model for object and effects inpainting, and demonstrate that the resulting model performs far better than Casper both qualitatively and quantitatively at removing humans from walking tour clips with significant human presence and complex backgrounds. Finally, we show that the resulting generated clips can be used to build successful 3D/4D models of urban locations.


[23] Is the Modality Gap a Bug or a Feature? A Robustness Perspective cs.CV | cs.LGPDF

Rhea Chowers, Oshri Naparstek, Udi Barzelay, Yair Weiss

TL;DR: 本文探讨了多模态模型中的模态间隙现象,指出在某些条件下,最小化对比损失会导致两个模态的嵌入被一个全局间隙向量正交分离。研究发现,模态间隙与模型鲁棒性呈单调关系,减小间隙不会影响模型的干净准确率,但能增强模型对嵌入扰动的鲁棒性。实验表明,通过简单的后处理步骤移动一个模态的嵌入向另一个模态的均值靠拢,可以显著提高现实世界视觉语言模型的鲁棒性。

Details

Motivation: 解决多模态模型(如CLIP)中普遍存在的模态间隙现象,探究其存在原因以及通过后处理减小间隙是否有助于提升下游任务性能,特别是从鲁棒性角度分析间隙的影响。

Result: 在实验中,通过后处理步骤移动模态嵌入,可以在不损失干净准确率的情况下,显著提高多个现实世界视觉语言模型对嵌入扰动的鲁棒性,但未提及具体基准测试或与SOTA的比较。

Insight: 创新点在于从理论角度证明模态间隙与鲁棒性的单调关系,并提出一种简单后处理方法以增强模型鲁棒性;客观分析认为,这挑战了传统认为模态间隙是缺陷的观点,将其重新解释为可能有益于鲁棒性的特征。

Abstract: Many modern multi-modal models (e.g. CLIP) seek an embedding space in which the two modalities are aligned. Somewhat surprisingly, almost all existing models show a strong modality gap: the distribution of images is well-separated from the distribution of texts in the shared embedding space. Despite a series of recent papers on this topic, it is still not clear why this gap exists nor whether closing the gap in post-processing will lead to better performance on downstream tasks. In this paper we show that under certain conditions, minimizing the contrastive loss yields a representation in which the two modalities are separated by a global gap vector that is orthogonal to their embeddings. We also show that under these conditions the modality gap is monotonically related to robustness: decreasing the gap does not change the clean accuracy of the models but makes it less likely that a model will change its output when the embeddings are perturbed. Our experiments show that for many real-world VLMs we can significantly increase robustness by a simple post-processing step that moves one modality towards the mean of the other modality, without any loss of clean accuracy.


[24] WorldFlow3D: Flowing Through 3D Distributions for Unbounded World Generation cs.CV | cs.AI | cs.GRPDF

Amogh Joshi, Julian Ost, Felix Heide

TL;DR: WorldFlow3D是一种用于生成无边界3D世界的新方法,它将3D生成建模为在3D数据分布间流动的问题,而非局限于条件去噪。该方法首先生成准确的三维结构作为中间分布,再引导生成复杂结构和高质量纹理,同时收敛速度更快,并支持通过矢量化布局和场景属性进行控制。

Details

Motivation: 解决无边界3D世界生成这一计算机视觉、图形学和机器人学中的基础任务,旨在超越传统的条件去噪框架,实现更通用、可控的3D场景生成。

Result: 在真实户外驾驶场景和合成室内场景上验证了方法的有效性,展示了跨领域泛化能力和在真实数据分布上的高质量生成效果,在所有测试的无边界场景生成设置中均优于现有方法。

Insight: 核心创新在于将流匹配(flow matching)思想推广到3D生成,将其定义为在3D分布间流动的通用问题,并提出了一个无潜在变量的流方法,首先生成因果且准确的结构作为引导中间分布,从而实现了快速收敛、高质量纹理生成和灵活的几何与纹理控制。

Abstract: Unbounded 3D world generation is emerging as a foundational task for scene modeling in computer vision, graphics, and robotics. In this work, we present WorldFlow3D, a novel method capable of generating unbounded 3D worlds. Building upon a foundational property of flow matching - namely, defining a path of transport between two data distributions - we model 3D generation more generally as a problem of flowing through 3D data distributions, not limited to conditional denoising. We find that our latent-free flow approach generates causal and accurate 3D structure, and can use this as an intermediate distribution to guide the generation of more complex structure and high-quality texture - all while converging more rapidly than existing methods. We enable controllability over generated scenes with vectorized scene layout conditions for geometric structure control and visual texture control through scene attributes. We confirm the effectiveness of WorldFlow3D on both real outdoor driving scenes and synthetic indoor scenes, validating cross-domain generalizability and high-quality generation on real data distributions. We confirm favorable scene generation fidelity over approaches in all tested settings for unbounded scene generation. For more, see https://light.princeton.edu/worldflow3d.


[25] TrajectoryMover: Generative Movement of Object Trajectories in Videos cs.CVPDF

Kiran Chhatre, Hyeonho Jeong, Yulia Gryaditskaya, Christopher E. Peters, Chun-Hao Paul Huang

TL;DR: 本文提出TrajectoryMover,一种用于视频中对象轨迹生成性移动的方法。该方法通过新的数据生成管道TrajectoryAtlas创建大规模合成配对视频数据,并微调视频生成器,实现在保持对象相对3D运动的同时移动其轨迹。

Details

Motivation: 现有视频编辑方法专注于规定对象的3D或2D运动轨迹或改变对象/场景外观,但缺乏一种能够移动对象3D运动轨迹(即保持其相对3D运动的同时移动对象)的方法,主要挑战在于获取此类场景的配对视频数据。

Result: 论文展示了TrajectoryMover能够成功实现对象轨迹的生成性移动,但摘要中未提及具体的定量结果或基准测试(如与SOTA的比较)。

Insight: 创新点在于提出了TrajectoryAtlas数据生成管道,以解决缺乏配对训练数据的问题,从而训练出能够处理复杂轨迹移动的视频生成模型,这在视频编辑领域是一个新颖的生成任务方向。

Abstract: Generative video editing has enabled several intuitive editing operations for short video clips that would previously have been difficult to achieve, especially for non-expert editors. Existing methods focus on prescribing an object’s 3D or 2D motion trajectory in a video, or on altering the appearance of an object or a scene, while preserving both the video’s plausibility and identity. Yet a method to move an object’s 3D motion trajectory in a video, i.e., moving an object while preserving its relative 3D motion, is currently still missing. The main challenge lies in obtaining paired video data for this scenario. Previous methods typically rely on clever data generation approaches to construct plausible paired data from unpaired videos, but this approach fails if one of the videos in a pair can not easily be constructed from the other. Instead, we introduce TrajectoryAtlas, a new data generation pipeline for large-scale synthetic paired video data and a video generator TrajectoryMover fine-tuned with this data. We show that this successfully enables generative movement of object trajectories. Project page: https://chhatrekiran.github.io/trajectorymover


[26] Enhancing Box and Block Test with Computer Vision for Post-Stroke Upper Extremity Motor Evaluation cs.CVPDF

David Robinson, Animesh Gupta, Elizabeth Clark, Olga Melnik, Qiushi Fu

TL;DR: 本文提出了一种基于计算机视觉的框架,用于分析中风后上肢运动功能评估中的Box and Block Test(BBT)测试。该框架通过单目视频提取手指、手臂和躯干的世界对齐关节角度,无需深度传感器或校准对象,从而捕捉运动质量信息。

Details

Motivation: 标准的中风后上肢运动功能临床评估要么依赖缺乏敏感性的序数评分,要么依赖基于时间的任务指标,这些指标无法捕捉运动质量。本文旨在通过计算机视觉技术解决这一问题,提供更精细的运动分析。

Result: 在包含48名健康个体和7名中风后患者的136个BBT录像数据集上,通过无监督降维分析关节角度特征,结果显示健康运动模式与中风相关运动偏差之间存在分离。一些具有相同BBT评分的患者可以通过不同的姿势模式区分。

Insight: 创新点在于使用单目视频和世界对齐关节角度来量化运动质量,无需改变现有临床流程或额外设备。这为临床评估提供了无校准、基于摄像头的框架,能够捕捉传统时间评分之外的有意义信息。

Abstract: Standard clinical assessments of upper-extremity motor function after stroke either rely on ordinal scoring, which lacks sensitivity, or time-based task metrics, which do not capture movement quality. In this work, we present a computer vision-based framework for analysis of upper-extremity movement during the Box and Block Test (BBT) through world-aligned joint angles of fingers, arm, and trunk without depth sensors or calibration objects. We apply this framework to a dataset of 136 BBT recordings collected from 48 healthy individuals and 7 individuals post stroke. Using unsupervised dimensionality reduction of joint-angle features, we analyze movement patterns without relying on expert clinical labels. The resulting embeddings show separation between healthy movement patterns and stroke-related movement deviations. Importantly, some patients with the same BBT scores can be separated with different postural patterns. These results show that world-aligned joint angles can capture meaningful information of upper-extremity functions beyond standard time-based BBT scores, with no effort from the clinician other than monocular video recordings of the patient using a phone or camera. This work highlights the potential of a camera-based, calibration-free framework to measure movement quality in clinical assessments without changing the widely adopted clinical routine.


[27] SparseDriveV2: Scoring is All You Need for End-to-End Autonomous Driving cs.CVPDF

Wenchao Sun, Xuewu Lin, Keyu Chen, Zixiang Pei, Xiang Li

TL;DR: SparseDriveV2 是一种基于评分的端到端自动驾驶规划方法,通过将轨迹分解为几何路径和速度剖面,并采用分层评分策略,显著提升了静态轨迹词汇表的性能。

Details

Motivation: 现有基于评分的规划方法中,动态生成轨迹的方法性能优于静态词汇表方法,但静态词汇表是否因离散化不足而性能受限尚不明确。本文旨在探索通过密集化静态词汇表能否达到可比性能。

Result: 在 NAVSIM 基准上达到 92.0 PDMS 和 90.1 EPDMS,在 Bench2Drive 基准上达到 89.15 Driving Score 和 70.00 Success Rate,使用轻量级 ResNet-34 骨干网络,性能达到先进水平。

Insight: 创新点在于提出了可扩展的因子化轨迹词汇表表示和分层评分策略,证明了通过密集覆盖动作空间,静态词汇表方法可以达到与动态生成方法相当的性能,为端到端规划提供了更高效的替代方案。

Abstract: End-to-end multi-modal planning has been widely adopted to model the uncertainty of driving behavior, typically by scoring candidate trajectories and selecting the optimal one. Existing approaches generally fall into two categories: scoring a large static trajectory vocabulary, or scoring a small set of dynamically generated proposals. While static vocabularies often suffer from coarse discretization of the action space, dynamic proposals provide finer-grained precision and have shown stronger empirical performance on existing benchmarks. However, it remains unclear whether dynamic generation is fundamentally necessary, or whether static vocabularies can already achieve comparable performance when they are sufficiently dense to cover the action space. In this work, we start with a systematic scaling study of Hydra-MDP, a representative scoring-based method, revealing that performance consistently improves as trajectory anchors become denser, without exhibiting saturation before computational constraints are reached. Motivated by this observation, we propose SparseDriveV2 to push the performance boundary of scoring-based planning through two complementary innovations: (1) a scalable vocabulary representation with a factorized structure that decomposes trajectories into geometric paths and velocity profiles, enabling combinatorial coverage of the action space, and (2) a scalable scoring strategy with coarse factorized scoring over paths and velocity profiles followed by fine-grained scoring on a small set of composed trajectories. By combining these two techniques, SparseDriveV2 achieves 92.0 PDMS and 90.1 EPDMS on NAVSIM, with 89.15 Driving Score and 70.00 Success Rate on Bench2Drive with a lightweight ResNet-34 as backbone. Code and model are released at https://github.com/swc-17/SparseDriveV2.


[28] LatentPilot: Scene-Aware Vision-and-Language Navigation by Dreaming Ahead with Latent Visual Reasoning cs.CV | cs.AI | cs.ROPDF

Haihong Hao, Lei Chen, Mingfei Han, Changlin Li, Dong An

TL;DR: 本文提出LatentPilot,一种用于视觉语言导航(VLN)的新范式,通过隐式视觉推理让智能体在训练时“预想”未来视觉动态,以学习动作与视觉变化间的因果关系,从而提升导航决策的鲁棒性。该方法采用飞轮式训练机制,无需在推理时访问未来帧,并在多个VLN基准测试中取得了最先进的结果。

Details

Motivation: 现有VLN模型主要基于过去和当前的视觉观察进行推理,忽略了动作引发的未来视觉动态,导致对动作与视觉变化间因果关系的理解不足,限制了决策的稳健性。受人类能够利用动作动态因果关系想象近未来的能力启发,本文旨在让智能体学习动作条件下的视觉动态,以改善环境理解和导航选择。

Result: 在R2R-CE、RxR-CE和R2R-PE基准测试中取得了新的SOTA结果,并在多样环境下的真实机器人测试中展示了LatentPilot对场景中环境-动作动态的优越理解。

Insight: 创新点包括:提出一种飞轮式训练机制,通过迭代收集在线轨迹并重训练模型来匹配智能体行为分布,并在偏离过度时触发专家接管;学习无显式监督的视觉潜在标记,这些标记在连续潜在空间中全局关注并跨步骤传递,使智能体能够“预想”未来并推理动作对后续观察的影响,从而增强对动作-视觉因果关系的建模能力。

Abstract: Existing vision-and-language navigation (VLN) models primarily reason over past and current visual observations, while largely ignoring the future visual dynamics induced by actions. As a result, they often lack an effective understanding of the causal relationship between actions and how the visual world changes, limiting robust decision-making. Humans, in contrast, can imagine the near future by leveraging action-dynamics causality, which improves both environmental understanding and navigation choices. Inspired by this capability, we propose LatentPilot, a new paradigm that exploits future observations during training as a valuable data source to learn action-conditioned visual dynamics, while requiring no access to future frames at inference. Concretely, we propose a flywheel-style training mechanism that iteratively collects on-policy trajectories and retrains the model to better match the agent’s behavior distribution, with an expert takeover triggered when the agent deviates excessively. LatentPilot further learns visual latent tokens without explicit supervision; these latent tokens attend globally in a continuous latent space and are carried across steps, serving as both the current output and the next input, thereby enabling the agent to dream ahead and reason about how actions will affect subsequent observations. Experiments on R2R-CE, RxR-CE, and R2R-PE benchmarks achieve new SOTA results, and real-robot tests across diverse environments demonstrate LatentPilot’s superior understanding of environment-action dynamics in scene. Project page:https://abdd.top/latentpilot/


[29] CT-to-X-ray Distillation Under Tiny Paired Cohorts: An Evidence-Bounded Reproducible Pilot Study cs.CVPDF

Bo Ma, Jinsong Wu, Weiqi Yan, Hongjiang Wei

TL;DR: 本研究探讨了在患者级别配对的小规模胸部影像队列中,能否利用CT作为训练监督来训练一个仅需X光片进行推理的二分类疾病检测模型。研究将其视为跨模态师生蒸馏问题,使用JDCNet作为可执行的试点框架。在原始数据划分下,简化的跨模态logit-KD控制方法在验证子集上取得了最佳平均结果(准确率0.875,宏F1 0.714),而完整模块增强的JDCNet变体表现较差(准确率0.750,宏F1 0.429)。通过八次患者级别蒙特卡洛重采样、更强的注意力转移和特征提示机制控制以及不平衡敏感分析,发现跨模态蒸馏的优势并不稳健:晚期融合获得最高平均准确率(0.885),同模态蒸馏获得最高平均宏F1(0.554)和平衡准确率(0.660),而简单的跨模态控制方法平衡准确率降至0.500。

Details

Motivation: 胸部X光和CT在胸部疾病诊断中提供互补信息,但现有计算机辅助诊断模型通常仅在单一模态上训练和部署。本研究聚焦于一个具体且面向部署的问题:在患者级别配对的胸部影像队列中,能否仅使用CT作为训练监督(推理时无需CT),来训练一个二分类的X光疾病检测模型?

Result: 在原始患者级别配对划分中,简化的跨模态logit-KD控制在四图像验证子集上取得最高平均结果(准确率0.875,宏F1 0.714)。但在八次蒙特卡洛重采样协议下,跨模态蒸馏优势不稳健:晚期融合获得最高平均准确率(0.885),同模态蒸馏获得最高平均宏F1(0.554)和平衡准确率(0.660),而简单跨模态控制的平均平衡准确率降至0.500。注意力转移和特征提示机制未能恢复稳健的跨模态优势。

Insight: 论文的创新点不在于提出一个经过验证的CT到X光的架构,而是提供了一个可重复且证据受限的试点协议。该协议明确了任务定义、失败模式、排名不稳定性,并为未来可信的CT到X光迁移研究设定了最低要求。从客观角度看,研究强调了在小规模配对队列中进行跨模态蒸馏时,评估协议的严谨性和结果稳健性的重要性,而非单纯追求性能指标。

Abstract: Chest X-ray and computed tomography (CT) provide complementary views of thoracic disease, yet most computer-aided diagnosis models are trained and deployed within a single imaging modality. The concrete question studied here is narrower and deployment-oriented: on a patient-level paired chest cohort, can CT act as training-only supervision for a binary disease versus non-disease X-ray classifier without requiring CT at inference time? We study this setting as a cross-modality teacher–student distillation problem and use JDCNet as an executable pilot scaffold rather than as a validated superior architecture. On the original patient-level paired split from a public paired chest imaging cohort, a stripped-down plain cross-modal logit-KD control attains the highest mean result on the four-image validation subset (0.875 accuracy and 0.714 macro-F1), whereas the full module-augmented JDCNet variant remains at 0.750 accuracy and 0.429 macro-F1. To test whether that ranking is a split artifact, we additionally run eight patient-level Monte Carlo resamples with same-case comparisons, stronger mechanism controls based on attention transfer and feature hints, and imbalance-sensitive analyses. Under this resampled protocol, late fusion attains the highest mean accuracy (0.885), same-modality distillation attains the highest mean macro-F1 (0.554) and balanced accuracy (0.660), the plain cross-modal control drops to 0.500 mean balanced accuracy, and neither attention transfer nor feature hints recover a robust cross-modality advantage. The contribution of this study is therefore not a validated CT-to-X-ray architecture, but a reproducible and evidence-bounded pilot protocol that makes the exact task definition, failure modes, ranking instability, and the minimum requirements for future credible CT-to-X-ray transfer claims explicit.


[30] Hierarchical Visual Relocalization with Nearest View Synthesis from Feature Gaussian Splatting cs.CVPDF

Huaqi Tao, Bingxi Liu, Guangcheng Chen, Fulin Tang, Li He

TL;DR: 本文提出了一种名为SplatHLoc的新型分层视觉重定位框架,该框架使用特征高斯溅射作为场景表示。为了解决数据库图像稀疏性问题,提出了一种自适应视点检索方法,通过合成与查询视点更接近的虚拟候选图像来改进初始姿态估计的准确性。在特征匹配方面,观察到高斯渲染特征与直接从图像提取的特征在两阶段匹配过程中各有优势,因此引入了一种混合特征匹配策略,以实现更准确高效的姿态估计。

Details

Motivation: 解决基于点的分层重定位方法因稀疏图像观测和弱特征匹配而受限的问题,旨在提升视觉重定位的鲁棒性和准确性。

Result: 在室内和室外数据集上的大量实验表明,SplatHLoc增强了视觉重定位的鲁棒性,并达到了新的最先进水平(SOTA)。

Insight: 创新点在于将特征高斯溅射作为场景表示引入分层重定位框架,并提出了自适应视点检索以合成虚拟候选图像,以及结合高斯渲染特征与图像提取特征的混合匹配策略,有效利用了不同特征在不同匹配阶段的优势。

Abstract: Visual relocalization is a fundamental task in the field of 3D computer vision, estimating a camera’s pose when it revisits a previously known scene. While point-based hierarchical relocalization methods have shown strong scalability and efficiency, they are often limited by sparse image observations and weak feature matching. In this work, we propose SplatHLoc, a novel hierarchical visual relocalization framework that uses Feature Gaussian Splatting as the scene representation. To address the sparsity of database images, we propose an adaptive viewpoint retrieval method that synthesizes virtual candidates with viewpoints more closely aligned with the query, thereby improving the accuracy of initial pose estimation. For feature matching, we observe that Gaussian-rendered features and those extracted directly from images exhibit different strengths across the two-stage matching process: the former performs better in the coarse stage, while the latter proves more effective in the fine stage. Therefore, we introduce a hybrid feature matching strategy, enabling more accurate and efficient pose estimation. Extensive experiments on both indoor and outdoor datasets show that SplatHLoc enhances the robustness of visual relocalization, setting a new state-of-the-art.


[31] SLVMEval: Synthetic Meta Evaluation Benchmark for Text-to-Long Video Generation cs.CV | cs.AIPDF

Ryosuke Matsuda, Keito Kudo, Haruto Yoshida, Nobuyuki Shimizu, Jun Suzuki

TL;DR: 本文提出了SLVMEval基准,这是一个用于元评估文本到长视频生成评估系统的合成基准。该基准专注于评估系统在长达约3小时的视频上的表现,通过创建受控的“高质量 vs 低质量”视频对,并利用众包筛选出人类可清晰感知的退化对,以测试现有评估系统在排序这些视频对时的可靠性。

Details

Motivation: 解决现有文本到视频评估系统在长视频(长达约3小时)质量评估中可能存在的不足,特别是这些系统在人类易于判断的场景下是否能准确评估视频质量的问题。

Result: 实验结果表明,人类评估者在识别更好长视频时的准确率为84.7%-96.8%,而在10个评估方面中的9个方面,现有评估系统的准确率均低于人类评估,揭示了文本到长视频评估的弱点。

Insight: 创新点在于构建了一个基于合成退化和众包筛选的长视频元评估基准,通过受控的成对比较框架,系统地暴露了现有评估系统在长视频质量评估上的局限性,为未来评估工具的改进提供了方向。

Abstract: This paper proposes the synthetic long-video meta-evaluation (SLVMEval), a benchmark for meta-evaluating text-to-video (T2V) evaluation systems. The proposed SLVMEval benchmark focuses on assessing these systems on videos of up to 10,486 s (approximately 3 h). The benchmark targets a fundamental requirement, namely, whether the systems can accurately assess video quality in settings that are easy for humans to assess. We adopt a pairwise comparison-based meta-evaluation framework. Building on dense video-captioning datasets, we synthetically degrade source videos to create controlled “high-quality versus low-quality” pairs across 10 distinct aspects. Then, we employ crowdsourcing to filter and retain only those pairs in which the degradation is clearly perceptible, thereby establishing an effective final testbed. Using this testbed, we assess the reliability of existing evaluation systems in ranking these pairs. Experimental results demonstrate that human evaluators can identify the better long video with 84.7%-96.8% accuracy, and in nine of the 10 aspects, the accuracy of these systems falls short of human assessment, revealing weaknesses in text-to-long-video evaluation.


[32] Multi-Layered Memory Architectures for LLM Agents: An Experimental Evaluation of Long-Term Context Retention cs.CV | cs.AIPDF

Sunil Tiwari, Payal Fofadiya

TL;DR: 本文提出了一种用于LLM智能体的多层记忆架构,旨在解决长程对话系统中存在的语义漂移和记忆不稳定问题。该框架将对话历史分解为工作记忆、情景记忆和语义记忆三个层次,并采用自适应检索门控和保留正则化技术,以控制跨会话漂移,同时保持有界的上下文增长和计算效率。

Details

Motivation: 解决长程对话系统中因会话延长而出现的语义漂移和记忆保留不稳定的问题。

Result: 在LOCOMO、LOCCO和LoCoMo基准测试中,该方法取得了显著提升:成功率(Success Rate)达到46.85,总体F1得分为0.618(其中多跳推理F1为0.594),六周期记忆保留率为56.90%,同时将错误记忆率降低至5.1%,上下文使用率降至58.40%。结果表明其在受限上下文预算下增强了长期记忆保留和推理稳定性。

Insight: 创新点在于将记忆结构分层(工作、情景、语义)并引入自适应门控与正则化机制,这为管理长上下文提供了可借鉴的模块化设计思路,能有效平衡记忆容量、准确性与计算开销。

Abstract: Long-horizon dialogue systems suffer from semanticdrift and unstable memory retention across extended sessions. This paper presents a Multi-Layer Memory Framework that decomposes dialogue history into working, episodic, and semantic layers with adaptive retrieval gating and retention regularization. The architecture controls cross-session drift while maintaining bounded context growth and computational efficiency. Experiments on LOCOMO, LOCCO, and LoCoMo show improved performance, achieving 46.85 Success Rate, 0.618 overall F1 with 0.594 multi-hop F1, and 56.90% six-period retention while reducing false memory rate to 5.1% and context usage to 58.40%. Results confirm enhanced long-term retention and reasoning stability under constrained context budgets.


[33] M2H-MX: Multi-Task Dense Visual Perception for Real-Time Monocular Spatial Understanding cs.CVPDF

U. V. B. L. Udugama, George Vosselman, Francesco Nex

TL;DR: 本文提出了M2H-MX,一个用于实时单目空间理解的多任务密集感知模型。该模型在轻量级解码器中引入了寄存器门控全局上下文和受控的跨任务交互,使深度和语义预测能够在严格的延迟约束下相互增强,其输出可直接通过紧凑的感知-建图接口集成到未修改的单目SLAM流程中。

Details

Motivation: 单目相机因其低成本和易于部署而备受机器人感知青睐,但从单一图像流实现可靠、实时的空间理解仍然具有挑战性。现有方法难以将先进的多任务密集预测模型转化为稳定的单目建图系统。

Result: 在NYUDv2数据集上,M2H-MX-L模型取得了最先进(SOTA)的结果,与代表性的多任务基线相比,语义mIoU提高了6.6%,深度RMSE降低了9.4%。在ScanNet数据集上的实时单目建图系统中部署时,与强大的单目SLAM基线相比,平均轨迹误差减少了60.7%,并生成了更清晰的度量-语义地图。

Insight: 创新点在于在轻量级解码器中引入了寄存器门控全局上下文和受控的跨任务交互机制,实现了深度与语义预测在实时约束下的有效协同。客观来看,其紧凑的感知-建图接口设计,使得高性能的多任务密集预测模型能够无缝、可靠地集成到现有单目SLAM系统中,推动了从感知到建图的实用化部署。

Abstract: Monocular cameras are attractive for robotic perception due to their low cost and ease of deployment, yet achieving reliable real-time spatial understanding from a single image stream remains challenging. While recent multi-task dense prediction models have improved per-pixel depth and semantic estimation, translating these advances into stable monocular mapping systems is still non-trivial. This paper presents M2H-MX, a real-time multi-task perception model for monocular spatial understanding. The model preserves multi-scale feature representations while introducing register-gated global context and controlled cross-task interaction in a lightweight decoder, enabling depth and semantic predictions to reinforce each other under strict latency constraints. Its outputs integrate directly into an unmodified monocular SLAM pipeline through a compact perception-to-mapping interface. We evaluate both dense prediction accuracy and in-the-loop system performance. On NYUDv2, M2H-MX-L achieves state-of-the-art results, improving semantic mIoU by 6.6% and reducing depth RMSE by 9.4% over representative multi-task baselines. When deployed in a real-time monocular mapping system on ScanNet, M2H-MX reduces average trajectory error by 60.7% compared to a strong monocular SLAM baseline while producing cleaner metric-semantic maps. These results demonstrate that modern multi-task dense prediction can be reliably deployed for real-time monocular spatial perception in robotic systems.


[34] Diffusion Mental Averages cs.CVPDF

Phonphrm Thawatdamrongkit, Sukit Seripanitkarn, Supasorn Suwajanakorn

TL;DR: 本文提出了扩散心理平均(DMA)方法,用于从扩散模型中生成一个概念的清晰、真实的’心理平均’图像,而非对现有图像集合进行模糊平均。该方法通过在扩散模型的语义空间内进行轨迹对齐优化来实现,并扩展到多模态概念。

Details

Motivation: 解决现有数据驱动方法在平均同一提示词生成的扩散样本时产生模糊结果的问题,这些方法忽略了生成过程。本文旨在从模型内部生成一个能代表概念本质的、锐利的视觉原型。

Result: 该方法能生成一致且真实的平均图像,即使是对于抽象概念,可作为具体的视觉总结,并揭示了模型偏差和概念表示。摘要未提及具体基准测试和定量比较。

Insight: 核心创新在于将平均问题转化为扩散模型语义空间内的噪声潜在轨迹对齐优化,并利用CLIP等语义丰富空间进行聚类以处理多模态概念,实现了模型中心化的概念平均。

Abstract: Can a diffusion model produce its own “mental average” of a concept-one that is as sharp and realistic as a typical sample? We introduce Diffusion Mental Averages (DMA), a model-centric answer to this question. While prior methods aim to average image collections, they produce blurry results when applied to diffusion samples from the same prompt. These data-centric techniques operate outside the model, ignoring the generative process. In contrast, DMA averages within the diffusion model’s semantic space, as discovered by recent studies. Since this space evolves across timesteps and lacks a direct decoder, we cast averaging as trajectory alignment: optimize multiple noise latents so their denoising trajectories progressively converge toward shared coarse-to-fine semantics, yielding a single sharp prototype. We extend our approach to multimodal concepts (e.g., dogs with many breeds) by clustering samples in semantically-rich spaces such as CLIP and applying Textual Inversion or LoRA to bridge CLIP clusters into diffusion space. This is, to our knowledge, the first approach that delivers consistent, realistic averages, even for abstract concepts, serving as a concrete visual summary and a lens into model biases and concept representation.


[35] Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism cs.CV | cs.AIPDF

Tao Chen, Kun Zhang, Qiong Wu, Xiao Chen, Chao Chang

TL;DR: 本文提出了一种名为FlexMem的无训练方法,通过模拟人类观看视频时的视觉记忆机制,帮助多模态大语言模型(MLLMs)实现无限长度的长视频理解。该方法采用双路径压缩设计实现有效的记忆转移与写入,并针对不同视频理解任务探索了多种记忆读取策略。实验表明,在单张3090 GPU上,FlexMem能处理超过1000帧视频,性能优于现有高效视频理解方法,并在某些基准测试中使基础MLLMs达到与GPT-4o和Gemini-1.5 Pro等SOTA模型相当甚至更好的水平。

Details

Motivation: 解决多模态大语言模型在长视频理解中因一次性处理所有视频信息而存在输入长度上限的问题,通过引入视觉记忆机制来模拟人类观看视频的行为,实现无限长度的视频理解。

Result: 在五个长视频任务和一个流视频任务上的实验显示,FlexMem在单张3090 GPU上能处理超过1000帧视频,性能优于现有高效视频理解方法,并帮助基础MLLMs在部分基准测试中达到与SOTA模型(如GPT-4o和Gemini-1.5 Pro)相当或更好的结果。

Insight: 创新点在于提出了一种无训练的视觉记忆机制(FlexMem),通过双路径压缩设计实现记忆的高效转移与写入,并针对不同任务灵活调整记忆读取策略,从而突破了MLLMs在长视频理解中的输入长度限制,模拟了人类观看视频的认知过程。

Abstract: Long video understanding is a key challenge that plagues the advancement of \emph{Multimodal Large language Models} (MLLMs). In this paper, we study this problem from the perspective of visual memory mechanism, and proposed a novel and training-free approach, termed \emph{Flexible Memory} (\textbf{FlexMem}). In principle, FlexMem aims to mimic human behavior of video watching, \emph{i.e.}, continually watching video content and recalling the most relevant memory fragments to answer the question. In this way, FlexMem can help MLLMs achieve video understanding of infinite lengths, unlike previous methods that process all video information at once and have input upper-limit. Concretely, FlexMem first consider the visual KV caches as the memory sources, and realize the effective memory transfer and writing via a dual-pathway compression design. Afterwards, FlexMem also explores different memory reading strategies for the diverse video understanding tasks, including the popular streaming one. To validate FlexMem, we apply it to two popular video-MLLMs, and conduct extensive experiments on five long video and one streaming video task. The experimental results show that on \textbf{a single 3090 GPU}, our FlexMem can achieve obvious improvements than existing efficient video understanding methods and process more than \textbf{1k frames}, which also helps the base MLLMs achieve comparable or even better performance than SOTA MLLMs on some benchmarks, \emph{e.g.} , GPT-4o and Gemini-1.5 Pro.


[36] Omni-NegCLIP: Enhancing CLIP with Front-Layer Contrastive Fine-Tuning for Comprehensive Negation Understanding cs.CV | cs.AIPDF

Jingqi Xu

TL;DR: 本文提出了Omni-NegCLIP,一种通过修改CLIP原始InfoNCE对比损失进行微调的模型,旨在增强CLIP对两种否定表达(基于存在的否定和基于缺失的否定)的理解能力。该方法基于观察到CLIP文本编码器的前层Transformer对否定文本有更强的学习能力,因此在每个训练步骤中微调这些前层,并结合了专门设计的对比目标。实验表明,该模型在两种否定任务上性能显著提升,且不损害甚至提升了通用的图文检索能力。

Details

Motivation: 现有的视觉语言模型(如CLIP)在理解自然语言中常见的否定表达方面表现不佳,这限制了其在真实场景中的应用。本文旨在解决CLIP模型对两类否定(基于存在的否定和基于缺失的否定)的理解缺陷。

Result: 与预训练的CLIP相比,Omni-NegCLIP在基于存在的否定任务上性能提升高达52.65%,在基于缺失的否定任务上提升高达12.50%。同时,在通用的图文检索任务上,性能不仅没有下降,反而提升了高达19.62%。与先前工作相比,该模型展示了更全面的多类型否定理解能力。

Insight: 创新点在于:1) 设计了两种针对性的对比学习目标(基于存在的和基于缺失的)来微调CLIP;2) 观察到并利用了CLIP文本编码器前层Transformer对否定文本更强的学习能力,进行分层微调。从客观角度看,该方法通过细粒度的对比损失设计和有针对性的层选择,有效地提升了模型对复杂语言现象(否定)的理解,同时保持了通用能力,是一种高效且有针对性的微调策略。

Abstract: Vision-Language Models (VLMs) have demonstrated strong capabilities across a wide range of multimodal tasks. However, recent studies have shown that VLMs, such as CLIP, perform poorly in understanding negation expressions, which are common in natural language. In this work, we propose Omni-NegCLIP, a fine-tuned CLIP model that improves CLIP’s understanding of two types of negation, namely presence-based negation and absence-based negation, which correspond to negated expressions of objects that are actually present in an image and those that may plausibly exist in an image but are in fact absent, respectively, by modifying CLIP’s original InfoNCE contrastive loss. Specifically, we design a presence-based contrastive objective that pulls image embeddings closer to their original caption embeddings while pushing them away from the corresponding presence-based negated caption embeddings, and an absence-based contrastive objective that aligns image embeddings with both original and absence-based negated caption embeddings while maintaining a semantic distinction between the two text embeddings. Based on our observation that the front transformer layers of CLIP text encoder have stronger learning ability for negated text than the later layers, we fine-tune the front transformer layers of the CLIP text encoder at each training step using the combined contrastive objective. Experimental results show that, compared with pretrained CLIP, Omni-NegCLIP improves performance on presence-based negation and absence-based negation tasks by up to 52.65% and 12.50%, respectively, without sacrificing general capability in image-text retrieval and even improving it by up to 19.62%. Compared with prior works, Omni-NegCLIP demonstrates a more comprehensive ability to understand multiple types of negation tasks.


[37] ConInfer: Context-Aware Inference for Training-Free Open-Vocabulary Remote Sensing Segmentation cs.CVPDF

Wenyang Chen, Zhanxuan Hu, Yaping Zhang, Hailong Ning, Yonghang Tai

TL;DR: 本文提出了ConInfer,一种用于免训练开放词汇遥感分割(OVRSS)的上下文感知推理框架。该方法通过联合预测多个空间单元并显式建模其间的语义依赖关系,解决了现有基于视觉语言模型(VLM)的独立补丁级预测方案与遥感数据大尺度、强空间语义关联特性不匹配的问题,从而提升了分割的一致性、鲁棒性和泛化能力。

Details

Motivation: 现有免训练OVRSS方法主要关注提升特征表示或缓解模态差异以改进补丁级预测精度,但这种独立预测方案与遥感数据大尺度、强空间语义关联的内在特性不符,导致在复杂真实场景中分割精度不足。

Result: 在多个基准数据集上的大量实验表明,该方法在开放词汇语义分割和对象提取任务上,分别平均超越了当前最先进的基于像素的VLM基线方法(如SegEarth-OV)2.80%和6.13%。

Insight: 主要创新点在于将上下文感知推理引入OVRSS,通过建模空间单元间的语义依赖进行联合预测,而非孤立地进行补丁级分类。这为处理具有强空间关联性的遥感等数据提供了一种新的、更符合其本质的推理范式。

Abstract: Training-free open-vocabulary remote sensing segmentation (OVRSS), empowered by vision-language models, has emerged as a promising paradigm for achieving category-agnostic semantic understanding in remote sensing imagery. Existing approaches mainly focus on enhancing feature representations or mitigating modality discrepancies to improve patch-level prediction accuracy. However, such independent prediction schemes are fundamentally misaligned with the intrinsic characteristics of remote sensing data. In real-world applications, remote sensing scenes are typically large-scale and exhibit strong spatial as well as semantic correlations, making isolated patch-wise predictions insufficient for accurate segmentation. To address this limitation, we propose ConInfer, a context-aware inference framework for OVRSS that performs joint prediction across multiple spatial units while explicitly modeling their inter-unit semantic dependencies. By incorporating global contextual cues, our method significantly enhances segmentation consistency, robustness, and generalization in complex remote sensing environments. Extensive experiments on multiple benchmark datasets demonstrate that our approach consistently surpasses state-of-the-art per-pixel VLM-based baselines such as SegEarth-OV, achieving average improvements of 2.80% and 6.13% on open-vocabulary semantic segmentation and object extraction tasks, respectively. The implementation code is available at: https://github.com/Dog-Yang/ConInfer


[38] PRISM: A Multi-View Multi-Capability Retail Video Dataset for Embodied Vision-Language Models cs.CV | cs.AI | cs.ROPDF

Amirreza Rouhi, Parikshit Sakurikar, Satya Sai Reddy, Narsimha Menga, Anirudh Govil

TL;DR: 本文介绍了PRISM,一个包含27万个样本的多视角零售视频监督微调数据集,专为现实世界零售环境中的具身视觉语言模型设计。该数据集基于一个新颖的三维知识本体构建,涵盖空间、时空物理和具身动作知识,并包含四种评估维度的20多种能力探针。实验表明,在PRISM上微调能将预训练基线的错误率降低66.6%,显著提升了具身动作理解能力。

Details

Motivation: 现有通用物理AI模型与结构化现实部署环境的专业感知需求之间存在关键差距。物理AI系统失败的原因通常不是视觉识别能力差,而是对空间、物理动态和具身动作的理解不足,无法在现实世界中可靠操作。

Result: 在PRISM数据集上微调后,模型在所有20多种能力探针上的错误率相比预训练基线降低了66.6%,其中具身动作理解的准确率提升了36.4%。

Insight: 论文的创新点在于提出了一个基于三维知识本体(空间、时空物理、具身动作)构建的、面向特定领域(零售)的大规模多视角视频SFT数据集。这首次在单一现实部署域中实例化了所有三个知识维度,表明基于本体结构化的领域特定监督微调能有效增强具身VLM在现实场景中的能力。

Abstract: A critical gap exists between the general-purpose visual understanding of state-of-the-art physical AI models and the specialized perceptual demands of structured real-world deployment environments. We present PRISM, a 270K-sample multi-view video supervised fine-tuning (SFT) corpus for embodied vision-language-models (VLMs) in real-world retail environments. PRISM is motivated by a simple observation - physical AI systems fail not because of poor visual recognition, but because they do not understand space, physical dynamics and embodied action well enough to operate reliably in the world. To this end, PRISM is grounded in a novel three-dimensional knowledge ontology that spans spatial knowledge, temporal and physical knowledge, and embodied action knowledge. It covers 20+ capability probes across four evaluation dimensions - Embodied Reasoning (ER), Common Sense (CS), Spatial Perception (SP), and Intuitive Physics (IP), and to our knowledge, PRISM is the first dataset to instantiate all three knowledge dimensions within a single real-world deployment domain. The corpus captures data from egocentric, exocentric and 360° viewpoints across five supermarket locations and includes open-ended, chain-of-thought, and multiple-choice supervision. At 4 fps, PRISM spans approximately 11.8M video frames and approximately 730M tokens, placing it among the largest domain-specific video SFT corpora. Fine-tuning on PRISM reduces the error rate across all 20+ probes by 66.6% over the pre-trained baseline, with significant gains in embodied action understanding where the accuracy improves by 36.4%. Our results suggest that ontology-structured, domain specific SFT can meaningfully strengthen embodied VLMs for real-world settings. The PRISM dataset and more details are available at https://dreamvu.ai/prism


[39] MELT: Improve Composed Image Retrieval via the Modification Frequentation-Rarity Balance Network cs.CV | cs.AIPDF

Guozhi Qiu, Zhiwei Chen, Zixu Li, Qinlei Huang, Zhiheng Fu

TL;DR: 本文提出MELT网络,旨在解决组合图像检索(CIR)中存在的频率偏差导致的罕见样本忽视问题,以及相似度分数易受困难负样本和噪声干扰的问题。通过增强对罕见修改语义的关注并应用基于扩散的去噪技术,MELT提升了多模态融合与匹配性能。

Details

Motivation: 现有CIR方法面临频率偏差导致罕见样本被忽视,以及相似度分数易受困难负样本和噪声干扰的局限性。

Result: 在两个CIR基准测试上的广泛实验验证了MELT的优越性能。

Insight: 创新点在于通过不对称罕见语义定位和基于扩散的去噪处理,平衡修改频率与稀有性,从而增强多模态融合的鲁棒性。

Abstract: Composed Image Retrieval (CIR) uses a reference image and a modification text as a query to retrieve a target image satisfying the requirement of modifying the reference image according to the text instructions''. However, existing CIR methods face two limitations: (1) frequency bias leading to Rare Sample Neglect’’, and (2) susceptibility of similarity scores to interference from hard negative samples and noise. To address these limitations, we confront two key challenges: asymmetric rare semantic localization and robust similarity estimation under hard negative samples. To solve these challenges, we propose the Modification frEquentation-rarity baLance neTwork MELT. MELT assigns increased attention to rare modification semantics in multimodal contexts while applying diffusion-based denoising to hard negative samples with high similarity scores, enhancing multimodal fusion and matching. Extensive experiments on two CIR benchmarks validate the superior performance of MELT. Codes are available at https://github.com/luckylittlezhi/MELT.


[40] GazeCLIP: Gaze-Guided CLIP with Adaptive-Enhanced Fine-Grained Language Prompt for Deepfake Attribution and Detection cs.CVPDF

Yaning Zhang, Linlin Shen, Zitong Yu, Chunjie Ma, Zan Gao

TL;DR: 该论文提出了一种名为GazeCLIP的新方法,用于细粒度的深度伪造归因和检测。该方法通过引入注视引导机制和自适应增强的细粒度语言提示,结合CLIP模型,旨在提升模型对未见过的深度伪造生成方法的泛化能力。

Details

Motivation: 当前深度伪造归因或检测方法仅依赖有限的视觉模态探索,导致对新型生成方法的泛化能力差,且未能有效结合归因和检测两个任务的协同作用。

Result: 在作者构建的细粒度基准测试中,该方法在归因和检测设置下的平均准确率(ACC)和AUC分别比现有最佳方法高出6.56%和5.32%,达到了SOTA水平。

Insight: 创新点包括:利用真实与伪造图像中注视向量的分布差异作为关键线索,设计了视觉感知编码器和注视感知图像编码器来挖掘全局伪造特征;通过语言精炼编码器生成动态增强的语言嵌入,实现精确的视觉-语言匹配,从而提升模型泛化性。

Abstract: Current deepfake attribution or deepfake detection works tend to exhibit poor generalization to novel generative methods due to the limited exploration in visual modalities alone. They tend to assess the attribution or detection performance of models on unseen advanced generators, coarsely, and fail to consider the synergy of the two tasks. To this end, we propose a novel gaze-guided CLIP with adaptive-enhanced fine-grained language prompts for fine-grained deepfake attribution and detection (DFAD). Specifically, we conduct a novel and fine-grained benchmark to evaluate the DFAD performance of networks on novel generators like diffusion and flow models. Additionally, we introduce a gaze-aware model based on CLIP, which is devised to enhance the generalization to unseen face forgery attacks. Built upon the novel observation that there are significant distribution differences between pristine and forged gaze vectors, and the preservation of the target gaze in facial images generated by GAN and diffusion varies significantly, we design a visual perception encoder to employ the inherent gaze differences to mine global forgery embeddings across appearance and gaze domains. We propose a gaze-aware image encoder (GIE) that fuses forgery gaze prompts extracted via a gaze encoder with common forged image embeddings to capture general attribution patterns, allowing features to be transformed into a more stable and common DFAD feature space. We build a language refinement encoder (LRE) to generate dynamically enhanced language embeddings via an adaptive-enhanced word selector for precise vision-language matching. Extensive experiments on our benchmark show that our model outperforms the state-of-the-art by 6.56% ACC and 5.32% AUC in average performance under the attribution and detection settings, respectively. Codes will be available on GitHub.


[41] MotionScale: Reconstructing Appearance, Geometry, and Motion of Dynamic Scenes with Scalable 4D Gaussian Splatting cs.CVPDF

Haoran Zhou, Gim Hee Lee

TL;DR: MotionScale是一个基于4D高斯泼溅的动态场景重建框架,能够从单目视频中高效重建外观、几何和运动,尤其适用于大规模场景和长序列,并保持高保真的结构和运动一致性。

Details

Motivation: 现有神经渲染方法在复杂环境中难以恢复准确的3D几何和时序一致的运动,因此需要一种能够高效扩展并保持高保真重建的方法。

Result: 在具有挑战性的真实世界基准测试中,MotionScale在重建质量和时间稳定性方面显著优于现有最先进方法。

Insight: 创新点包括使用基于聚类的基变换参数化可扩展运动场以捕捉多样演化运动模式,以及包含背景扩展和前景传播两阶段的解耦渐进优化策略,其中背景阶段适应新可见区域并建模瞬态阴影,前景阶段通过三阶段细化确保运动一致性。

Abstract: Realistic reconstruction of dynamic 4D scenes from monocular videos is essential for understanding the physical world. Despite recent progress in neural rendering, existing methods often struggle to recover accurate 3D geometry and temporally consistent motion in complex environments. To address these challenges, we propose MotionScale, a 4D Gaussian Splatting framework that scales efficiently to large scenes and extended sequences while maintaining high-fidelity structural and motion coherence. At the core of our approach is a scalable motion field parameterized by cluster-centric basis transformations that adaptively expand to capture diverse and evolving motion patterns. To ensure robust reconstruction over long durations, we introduce a progressive optimization strategy comprising two decoupled propagation stages: 1) A background extension stage that adapts to newly visible regions, refines camera poses, and explicitly models transient shadows; 2) A foreground propagation stage that enforces motion consistency through a specialized three-stage refinement process. Extensive experiments on challenging real-world benchmarks demonstrate that MotionScale significantly outperforms state-of-the-art methods in both reconstruction quality and temporal stability. Project page: https://hrzhou2.github.io/motion-scale-web/.


[42] Self-Consistency for LLM-Based Motion Trajectory Generation and Verification cs.CVPDF

Jiaju Ma, R. Kenny Jones, Jiajun Wu, Maneesh Agrawala

TL;DR: 本文提出了一种将自洽性技术从自然语言推理任务迁移到视觉领域的方法,用于改进基于大语言模型的运动轨迹生成与验证。通过将提示对应的形状族建模为原型轨迹与几何变换群的组合,并利用聚类识别一致轨迹,从而提升生成准确性和验证精度。

Details

Motivation: 自洽性在自然语言推理任务中已被证明能有效提升大语言模型性能,但如何将其适配到视觉领域(特别是运动轨迹生成与验证)仍待探索。

Result: 在运动轨迹生成任务中,该方法将准确率提升了4-6%;在验证任务中,相比视觉语言模型基线,精度提高了11%。

Insight: 创新点在于将形状族建模为原型轨迹与几何变换群的组合,并利用层次化关系自动恢复形状族,从而在无监督、轻量化的框架下实现视觉领域的自洽性推理。

Abstract: Self-consistency has proven to be an effective technique for improving LLM performance on natural language reasoning tasks in a lightweight, unsupervised manner. In this work, we study how to adapt self-consistency to visual domains. Specifically, we consider the generation and verification of LLM-produced motion graphics trajectories. Given a prompt (e.g., “Move the circle in a spiral path”), we first sample diverse motion trajectories from an LLM, and then identify groups of consistent trajectories via clustering. Our key insight is to model the family of shapes associated with a prompt as a prototype trajectory paired with a group of geometric transformations (e.g., rigid, similarity, and affine). Two trajectories can then be considered consistent if one can be transformed into the other under the warps allowable by the transformation group. We propose an algorithm that automatically recovers a shape family, using hierarchical relationships between a set of candidate transformation groups. Our approach improves the accuracy of LLM-based trajectory generation by 4-6%. We further extend our method to support verification, observing 11% precision gains over VLM baselines. Our code and dataset are available at https://majiaju.io/trajectory-self-consistency .


[43] StereoVGGT: A Training-Free Visual Geometry Transformer for Stereo Vision cs.CVPDF

Ziyang Chen, Yansong Qu, You Shen, Xuan Cheng, Liujuan Cao

TL;DR: 本文提出了一种名为StereoVGGT的、无需训练的特征骨干网络,专门用于立体视觉任务。该方法基于预训练的视觉几何基础模型VGGT,通过引入一个无需训练的特征调整流程,来缓解特征提取过程中的几何细节退化问题,并利用模型中嵌入的潜在相机标定知识。基于StereoVGGT的立体匹配网络在KITTI基准测试中取得了所有已发表方法中的第一名。

Details

Motivation: 当前立体视觉骨干网络主要依赖单目深度估计模型或视觉基础模型,这些模型在预训练时缺乏对相机姿态的显式监督,导致几何知识缺失,成为性能瓶颈。虽然VGGT模型预训练时包含了相机姿态等3D先验知识,但直接应用于立体视觉任务效果不佳,因其特征提取过程会导致几何细节显著退化。

Result: 基于StereoVGGT的立体匹配网络在KITTI基准测试中取得了所有已发表方法中的第一名,达到了最先进的水平。

Insight: 论文的创新点在于提出了一种无需训练的特征调整流程,能够有效利用预训练基础模型VGGT中已存在的几何先验知识(如相机标定),并针对立体视觉任务中双目几何一致性的需求,缓解了特征提取过程中的几何退化问题,从而构建了一个高效且强大的立体视觉特征骨干网络。

Abstract: Driven by the advancement of 3D devices, stereo vision tasks including stereo matching and stereo conversion have emerged as a critical research frontier. Contemporary stereo vision backbones typically rely on either monocular depth estimation (MDE) models or visual foundation models (VFMs). Crucially, these models are predominantly pretrained without explicit supervision of camera poses. Given that such geometric knowledge is indispensable for stereo vision, the absence of explicit spatial constraints constitutes a significant performance bottleneck for existing architectures. Recognizing that the Visual Geometry Grounded Transformer (VGGT) operates as a foundation model pretrained on extensive 3D priors, including camera poses, we investigate its potential as a robust backbone for stereo vision tasks. Nevertheless, empirical results indicate that its direct application to stereo vision yields suboptimal performance. We observe that VGGT suffers from a more significant degradation of geometric details during feature extraction. Such characteristics conflict with the requirements of binocular stereo vision, thereby constraining its efficacy for relative tasks. To bridge this gap, we propose StereoVGGT, a feature backbone specifically tailored for stereo vision. By leveraging the frozen VGGT and introducing a training-free feature adjustment pipeline, we mitigate geometric degradation and harness the latent camera calibration knowledge embedded within the model. StereoVGGT-based stereo matching network achieved the $1^{st}$ rank among all published methods on the KITTI benchmark, validating that StereoVGGT serves as a highly effective backbone for stereo vision.


[44] Assessing Multimodal Chronic Wound Embeddings with Expert Triplet Agreement cs.CVPDF

Fabian Kabus, Julia Hindel, Jelena Bratulić, Meropi Karakioulaki, Ayush Gupta

TL;DR: 本文提出了一种名为TriDerm的多模态框架,用于评估和生成隐性营养不良性大疱性表皮松解症(RDEB)慢性伤口的嵌入表示。该方法通过整合伤口图像、边界掩码和专家报告,利用专家三元组比较(快速收集的序数判断)来学习可解释的伤口表征,并融合视觉和文本模态,以更好地捕捉临床相似性知识。

Details

Motivation: 针对罕见遗传性皮肤病RDEB,现有基础模型难以可靠捕捉其异质性和长尾分布下的临床有意义特征,且与专家的一致性难以结构化衡量,因此需要一种能有效评估和生成多模态嵌入的方法。

Result: 在专家一致性评估中,TriDerm融合多模态后达到73.5%的专家同意率,比最佳单模态基础模型高出超过5.6个百分点,表明多模态融合能显著提升性能。

Insight: 创新点包括:使用专家三元组比较作为快速收集的序数判断来评估嵌入空间;提出TriDerm框架,结合视觉基础模型的自适应(伤口级注意力池化和非对比表示学习)与文本大语言模型的提示查询和软序数嵌入(SOE),实现可解释的多模态表征学习;证明视觉和文本模态在捕捉伤口表型上具有互补性。

Abstract: Recessive dystrophic epidermolysis bullosa (RDEB) is a rare genetic skin disorder for which clinicians greatly benefit from finding similar cases using images and clinical text. However, off-the-shelf foundation models do not reliably capture clinically meaningful features for this heterogeneous, long-tail disease, and structured measurement of agreement with experts is challenging. To address these gaps, we propose evaluating embedding spaces with expert ordinal comparisons (triplet judgments), which are fast to collect and encode implicit clinical similarity knowledge. We further introduce TriDerm, a multimodal framework that learns interpretable wound representations from small cohorts by integrating wound imagery, boundary masks, and expert reports. On the vision side, TriDerm adapts visual foundation models to RDEB using wound-level attention pooling and non-contrastive representation learning. For text, we prompt large language models with comparison queries and recover medically meaningful representations via soft ordinal embeddings (SOE). We show that visual and textual modalities capture complementary aspects of wound phenotype, and that fusing both modalities yields 73.5% agreement with experts, outperforming the best off-the-shelf single-modality foundation model by over 5.6 percentage points. We make the expert annotation tool, model code and representative dataset samples publicly available.


[45] Hallucination-aware intermediate representation edit in large vision-language models cs.CV | cs.AIPDF

Wei Suo, Hanzu Zhang, Lijun Zhang, Ji Ma, Peng Wang

TL;DR: 本文提出了一种名为HIRE的框架,用于动态检测大型视觉语言模型中的幻觉表示,并对这些表示进行编辑以消除幻觉。该方法以最小的额外计算成本,在现有基准测试中实现了最先进的性能,有效解决了模型输出与视觉事实相矛盾的问题。

Details

Motivation: 大型视觉语言模型在多模态推理和复杂场景理解方面表现出色,但仍面临显著的幻觉问题,即输出与视觉事实相矛盾。现有缓解方法如重训练和对比解码,要么需要大量训练资源,要么引入双重推理开销,限制了其实用性。

Result: 在现有基准测试中实现了最先进的性能,通过广泛的实验验证了方法的有效性,展示了其高效、鲁棒的幻觉消除能力和强大的可控性。

Insight: 创新点在于提出了一种动态检测和编辑幻觉中间表示的轻量级框架,避免了重训练的高成本和对比解码的双重推理开销,实现了高效且可控的幻觉消除。

Abstract: Large Vision-Language Models have demonstrated exceptional performance in multimodal reasoning and complex scene understanding. However, these models still face significant hallucination issues, where outputs contradict visual facts. Recent research on hallucination mitigation has focused on retraining methods and Contrastive Decoding (CD) methods. While both methods perform well, retraining methods require substantial training resources, and CD methods introduce dual inference overhead. These factors hinder their practical applicability. To address the above issue, we propose a framework for dynamically detecting hallucination representations and performing hallucination-eliminating edits on these representations. With minimal additional computational cost, we achieve state-of-the-art performance on existing benchmarks. Extensive experiments demonstrate the effectiveness of our approach, highlighting its efficient and robust hallucination elimination capability and its powerful controllability over hallucinations. Code is available at https://github.com/ASGO-MM/HIRE


[46] AGFT: Alignment-Guided Fine-Tuning for Zero-Shot Adversarial Robustness of Vision-Language Models cs.CV | cs.AI | cs.LGPDF

Yubo Cui, Xianchao Guan, Zijun Xiong, Zheng Zhang

TL;DR: 本文提出了一种对齐引导的微调框架AGFT,旨在提升视觉语言模型的零样本对抗鲁棒性,同时保持跨模态语义结构。该方法利用原始模型的概率预测进行文本引导的对抗训练,并通过分布一致性校准机制调整微调后的模型输出,以匹配预训练模型的预测分布。

Details

Motivation: 预训练的视觉语言模型在零样本泛化方面表现良好,但对对抗扰动仍然脆弱。现有的基于分类的对抗微调方法往往会破坏预训练的跨模态对齐,削弱视觉-文本对应关系并降低零样本性能。

Result: 在多个零样本基准测试上的广泛实验表明,AGFT优于最先进的方法,同时显著提高了零样本对抗鲁棒性。

Insight: 创新点在于利用原始模型的软对齐分布进行文本引导的对抗训练,以保持跨模态语义结构,并通过分布一致性校准机制解决微调引入的结构差异,从而在提升对抗鲁棒性的同时不损害零样本性能。

Abstract: Pre-trained vision-language models (VLMs) exhibit strong zero-shot generalization but remain vulnerable to adversarial perturbations. Existing classification-guided adversarial fine-tuning methods often disrupt pre-trained cross-modal alignment, weakening visual-textual correspondence and degrading zero-shot performance. In this paper, we propose an Alignment-Guided Fine-Tuning (AGFT) framework that enhances zero-shot adversarial robustness while preserving the cross-modal semantic structure. Unlike label-based methods that rely on hard labels and fail to maintain the relative relationships between image and text, AGFT leverages the probabilistic predictions of the original model for text-guided adversarial training, which aligns adversarial visual features with textual embeddings via soft alignment distributions, improving zero-shot adversarial robustness. To address structural discrepancies introduced by fine-tuning, we introduce a distribution consistency calibration mechanism that adjusts the robust model output to match a temperature-scaled version of the pre-trained model predictions. Extensive experiments across multiple zero-shot benchmarks demonstrate that AGFT outperforms state-of-the-art methods while significantly improving zero-shot adversarial robustness.


[47] Adversarial Prompt Injection Attack on Multimodal Large Language Models cs.CV | cs.AIPDF

Meiwen Ding, Song Xia, Chenqi Kong, Xudong Jiang

TL;DR: 本文提出了一种针对闭源多模态大语言模型(MLLMs)的不可察觉视觉提示注入攻击方法,通过在输入图像中自适应嵌入有界文本覆盖层提供语义引导,并迭代优化视觉扰动,使受攻击图像的特征表示在粗粒度和细粒度上与恶意视觉及文本目标对齐,从而诱导模型执行恶意指令。

Details

Motivation: 现有提示注入方法主要依赖文本或人类可感知的视觉提示,而闭源MLLMs的指令跟随行为使其易受攻击,本文旨在研究针对强大闭源MLLMs的不可察觉视觉提示注入攻击。

Result: 在多个闭源MLLMs上的两个多模态理解任务上的大量实验表明,该方法相比现有方法具有优越性能。

Insight: 创新点在于将恶意指令嵌入视觉模态,通过有界文本覆盖层提供语义引导,并采用从粗到细的特征对齐优化策略,同时动态实例化并精炼视觉目标(文本渲染图像)以提升语义忠实度和迁移性。

Abstract: Although multimodal large language models (MLLMs) are increasingly deployed in real-world applications, their instruction-following behavior leaves them vulnerable to prompt injection attacks. Existing prompt injection methods predominantly rely on textual prompts or perceptible visual prompts that are observable by human users. In this work, we study imperceptible visual prompt injection against powerful closed-source MLLMs, where adversarial instructions are embedded in the visual modality. Our method adaptively embeds the malicious prompt into the input image via a bounded text overlay to provide semantic guidance. Meanwhile, the imperceptible visual perturbation is iteratively optimized to align the feature representation of the attacked image with those of the malicious visual and textual targets at both coarse- and fine-grained levels. Specifically, the visual target is instantiated as a text-rendered image and progressively refined during optimization to more faithfully represent the desired semantics and improve transferability. Extensive experiments on two multimodal understanding tasks across multiple closed-source MLLMs demonstrate the superior performance of our approach compared to existing methods.


[48] Multimodal Models Meet Presentation Attack Detection on ID Documents cs.CVPDF

Marina Villanueva, Juan M. Espin, Juan E. Tapia

TL;DR: 本研究探索了将预训练多模态模型(如Paligemma、Llava和Qwen)集成到身份证件(ID Documents)呈现攻击检测(PAD)中,通过结合视觉特征和文本元数据(如文档类型、签发者、日期)来提升检测能力。然而,实验结果表明,这些模型在身份证件PAD任务上表现不佳,难以准确检测复杂的欺骗攻击。

Details

Motivation: 传统PAD系统仅依赖视觉特征,难以检测复杂的欺骗攻击,因此研究旨在通过结合视觉和文本模态的多模态模型来增强身份证件的呈现攻击检测能力。

Result: 实验结果表明,所测试的预训练多模态模型(如Paligemma、Llava和Qwen)在身份证件PAD任务上表现不佳,未能准确检测攻击。

Insight: 创新点在于将多模态模型(视觉与文本结合)引入身份证件PAD领域,但实际应用挑战表明现有通用多模态模型可能不适合该特定任务,需进一步定制或优化。

Abstract: The integration of multimodal models into Presentation Attack Detection (PAD) for ID Documents represents a significant advancement in biometric security. Traditional PAD systems rely solely on visual features, which often fail to detect sophisticated spoofing attacks. This study explores the combination of visual and textual modalities by utilizing pre-trained multimodal models, such as Paligemma, Llava, and Qwen, to enhance the detection of presentation attacks on ID Documents. This approach merges deep visual embeddings with contextual metadata (e.g., document type, issuer, and date). However, experimental results indicate that these models struggle to accurately detect PAD on ID Documents.


[49] Seeing the Evidence, Missing the Answer: Tool-Guided Vision-Language Models on Visual Illusions cs.CVPDF

Xuesong Wang, Harry Wang

TL;DR: 本文针对视觉语言模型在处理经典光学幻觉图像时存在的系统性偏差问题,提出了一种无需模型训练的工具引导推理框架。该框架通过为模型提供一组通用图像处理工具(如画线、区域裁剪、并排比较和通道分离)以及幻觉类型路由系统提示,指导模型根据问题类别调用相应工具进行推理,从而有效纠正模型将幻觉误判为“真实”的倾向。

Details

Motivation: 视觉语言模型在面对光学幻觉图像时表现出系统性偏差,倾向于将幻觉预测为“真实”,即使图像已被反事实修改。本文旨在解决这一失败模式,而无需进行任何模型训练。

Result: 在DataCV 2026挑战赛(任务I和II)中,该方法在验证集和包含结构上不熟悉幻觉变体的测试集(如将马赫带从垂直堆叠旋转为水平堆叠)上均表现出一致的性能,展示了强大的跨结构泛化能力。

Insight: 创新点在于采用通用工具加路由提示的设计,而非硬编码特定幻觉模块,实现了跨幻觉结构的泛化。此外,论文揭示了三个值得进一步研究的经验观察:正检测偏差、空间推理与逻辑推断的分离,以及对图像压缩伪影的敏感性。

Abstract: Vision-language models (VLMs) exhibit a systematic bias when confronted with classic optical illusions: they overwhelmingly predict the illusion as “real” regardless of whether the image has been counterfactually modified. We present a tool-guided inference framework for the DataCV 2026 Challenge (Tasks I and II) that addresses this failure mode without any model training. An off-the-shelf vision-language model is given access to a small set of generic image manipulation tools: line drawing, region cropping, side-by-side comparison, and channel isolation, together with an illusion-type-routing system prompt that prescribes which tools to invoke for each perceptual question category. Critically, every tool call produces a new, immutable image resource appended to a persistent registry, so the model can reference and compose any prior annotated view throughout its reasoning chain. Rather than hard-coding illusion-specific modules, this generic-tool-plus-routing design yields strong cross-structural generalization: performance remained consistent from the validation set to a test set containing structurally unfamiliar illusion variants (e.g., Mach Bands rotated from vertical to horizontal stacking). We further report three empirical observations that we believe warrant additional investigation: (i) a strong positive-detection bias likely rooted in imbalanced illusion training data, (ii) a striking dissociation between pixel-accurate spatial reasoning and logical inference over self-generated annotations, and (iii) pronounced sensitivity to image compression artifacts that compounds false positives.


[50] SeGPruner: Semantic-Geometric Visual Token Pruner for 3D Question Answering cs.CVPDF

Wenli Li, Kai Zhao, Haoran Jiang, Enquan Yang, Yi Su

TL;DR: 本文提出SeGPruner,一种用于3D问答任务的语义-几何视觉令牌剪枝框架,旨在解决多视角图像输入带来的令牌冗余问题。该框架通过注意力机制保留语义显著令牌,并结合几何距离引导选择空间多样令牌,在显著减少视觉令牌数量的同时保持3D推理性能。

Details

Motivation: 现有视觉令牌剪枝方法主要针对2D输入或依赖间接几何线索,难以在3D问答中同时保留语义关键对象和维持足够的空间覆盖,导致推理效率低下。

Result: 在ScanQA和OpenEQA基准测试中,SeGPruner将视觉令牌预算减少91%,推理延迟降低86%,同时保持竞争力的3D推理性能。

Insight: 创新点在于将语义显著性与3D几何距离相结合进行令牌选择,通过注意力机制与几何引导的协同,在激进剪枝下平衡对象级证据和全局场景覆盖。

Abstract: Vision-language models (VLMs) have been widely adopted for 3D question answering (3D QA). In typical pipelines, visual tokens extracted from multiple viewpoints are concatenated with language tokens and jointly processed by a large language model (LLM) for inference. However, aggregating multi-view observations inevitably introduces severe token redundancy, leading to an overly large visual token set that significantly hinders inference efficiency under constrained token budgets. Visual token pruning has emerged as a prevalent strategy to address this issue. Nevertheless, most existing pruners are primarily tailored to 2D inputs or rely on indirect geometric cues, which limits their ability to explicitly retain semantically critical objects and maintain sufficient spatial coverage for robust 3D reasoning. In this paper, we propose SeGPruner, a semantic-aware and geometry-guided token reduction framework for efficient 3D QA with multi-view images. Specifically, SeGPruner first preserves semantically salient tokens through an attention-based importance module (Saliency-aware Token Selector), ensuring that object-critical evidence is retained. It then complements these tokens with spatially diverse ones via a geometry-guided selector (Geometry-aware Token Diversifier), which jointly considers semantic relevance and 3D geometric distance. This cooperation between saliency preservation and geometry-guided diversification balances object-level evidence and global scene coverage under aggressive token reduction. Extensive experiments on ScanQA and OpenEQA demonstrate that SeGPruner substantially improves inference efficiency, reducing the visual token budget by 91% and inference latency by 86%, while maintaining competitive performance in 3D reasoning tasks.


[51] Few-shot Writer Adaptation via Multimodal In-Context Learning cs.CV | cs.AIPDF

Tom Simon, Stephane Nicolas, Pierrick Tranouez, Clement Chatelain, Thierry Paquet

TL;DR: 本文提出了一种新颖的上下文驱动手写文本识别框架,受多模态上下文学习启发,能够在推理时仅使用目标书写者的少量示例进行作者自适应,而无需更新模型参数。该方法结合了上下文驱动和标准OCR训练策略,在IAM和RIMES数据集上取得了超越所有独立于书写者模型的性能。

Details

Motivation: 当前最先进的手写文本识别模型在标准基准上表现良好,但对于训练数据中代表性不足的、具有高度特定风格的书写者,其性能会显著下降。现有的作者自适应方法通常需要离线微调或在推理时进行参数更新,这涉及梯度计算和反向传播,增加了计算成本并需要仔细的超参数调优。

Result: 在IAM和RIMES数据集上的实验验证了该方法,分别取得了3.92%和2.34%的字符错误率,超越了所有独立于书写者的手写文本识别模型,且在推理时无需任何参数更新。

Insight: 论文的核心创新点在于将多模态上下文学习的思想引入手写文本识别领域,实现了无需梯度更新的少样本推理时作者自适应。此外,设计了紧凑的8M参数CNN-Transformer模型以支持少样本上下文适应,并展示了上下文驱动与标准OCR训练策略的互补性提升。

Abstract: While state-of-the-art Handwritten Text Recognition (HTR) models perform well on standard benchmarks, they frequently struggle with writers exhibiting highly specific styles that are underrepresented in the training data. To handle unseen and atypical writers, writer adaptation techniques personalize HTR models to individual handwriting styles. Leading writer adaptation methods require either offline fine-tuning or parameter updates at inference time, both involving gradient computation and backpropagation, which increase computational costs and demand careful hyperparameter tuning. In this work, we propose a novel context-driven HTR framework3 inspired by multimodal in-context learning, enabling inference-time writer adaptation using only a few examples from the target writer without any parameter updates. We further demonstrate the impact of context length, design a compact 8M-parameter CNN-Transformer that enables few-shot in-context adaptation, and show that combining context-driven and standard OCR training strategies leads to complementary improvements. Experiments on IAM and RIMES validate our approach with Character Error Rates of 3.92% and 2.34%, respectively, surpassing all writer-independent HTR models without requiring any parameter updates at inference time.


[52] Square Superpixel Generation and Representation Learning via Granular Ball Computing cs.CVPDF

Shuyin Xia, Meng Yang, Dawei Dai, Fan Chen, Shilin Zhao

TL;DR: 本文提出了一种基于粒球计算的方形超像素生成方法,通过多尺度方形块近似超像素,以解决传统超像素形状不规则导致的计算和实现困难,从而支持高效并行处理和可学习的特征提取。该方法生成的方形超像素可轻松集成到图神经网络或视觉Transformer中,促进多尺度信息聚合和结构化视觉表示。

Details

Motivation: 现有超像素算法通常生成不规则形状区域,与卷积等规则算子不对齐,导致其常作为离线预处理步骤,限制了并行实现和深度学习管道中的端到端优化。

Result: 在下游任务上的实验结果表明,该方法带来了持续的性能提升,验证了其有效性。

Insight: 创新点在于利用粒球计算的自适应表示和覆盖特性,通过方形块近似超像素,解决了不规则形状带来的计算和实现瓶颈,使超像素能够更好地与深度学习框架集成,支持端到端优化和多尺度表示学习。

Abstract: Superpixels provide a compact region-based representation that preserves object boundaries and local structures, and have therefore been widely used in a variety of vision tasks to reduce computational cost. However, most existing superpixel algorithms produce irregularly shaped regions, which are not well aligned with regular operators such as convolutions. Consequently, superpixels are often treated as an offline preprocessing step, limiting parallel implementation and hindering end-to-end optimization within deep learning pipelines. Motivated by the adaptive representation and coverage property of granular-ball computing, we develop a square superpixel generation approach. Specifically, we approximate superpixels using multi-scale square blocks to avoid the computational and implementation difficulties induced by irregular shapes, enabling efficient parallel processing and learnable feature extraction. For each block, a purity score is computed based on pixel-intensity similarity, and high-quality blocks are selected accordingly. The resulting square superpixels can be readily integrated as graph nodes in graph neural networks (GNNs) or as tokens in Vision Transformers (ViTs), facilitating multi-scale information aggregation and structured visual representation. Experimental results on downstream tasks demonstrate consistent performance improvements, validating the effectiveness of the proposed method.


[53] VecAttention: Vector-wise Sparse Attention for Accelerating Long Context Inference cs.CVPDF

Anmin Liu, Ruixuan Yang, Huiqiang Jiang, Bin Lin, Minmin Sun

TL;DR: 本文提出VecAttention,一种向量级稀疏注意力框架,用于加速长上下文视频理解与生成任务中的Transformer推理。该方法基于视频注意力图中存在的垂直向量稀疏模式,通过轻量级重要向量选择和优化的向量稀疏注意力核,动态处理信息丰富的垂直向量,实现了精度与效率的更好权衡。

Details

Motivation: 解决基于Transformer的视频模型在长上下文任务中因自注意力二次复杂度带来的巨大计算挑战,现有粗粒度稀疏注意力方法存在冗余计算和性能次优的问题。

Result: 在视频理解(VideoMME, LongVideoBench, VCRBench)和生成(VBench)任务上的综合评估表明,VecAttention相比全注意力实现了2.65倍加速,相比最先进的稀疏注意力方法实现了1.83倍加速,且精度与全注意力相当。

Insight: 创新点在于发现了视频注意力图中存在强垂直向量稀疏模式,并证明其相比现有粗粒度稀疏模式能提供更优的精度-稀疏度权衡;据此设计了动态选择和处理信息向量的轻量级框架,通过减少内存访问开销和优化核实现高效加速。

Abstract: Long-context video understanding and generation pose a significant computational challenge for Transformer-based video models due to the quadratic complexity of self-attention. While existing sparse attention methods employ coarse-grained patterns to improve efficiency, they typically incur redundant computation and suboptimal performance. To address this issue, in this paper, we propose \textbf{VecAttention}, a novel framework of vector-wise sparse attention that achieves superior accuracy-efficiency trade-offs for video models. We observe that video attention maps exhibit a strong vertical-vector sparse pattern, and further demonstrate that this vertical-vector pattern offers consistently better accuracy-sparsity trade-offs compared with existing coarse-grained sparse patterns. Based on this observation, VecAttention dynamically selects and processes only informative vertical vectors through a lightweight important-vector selection that minimizes memory access overhead and an optimized kernel of vector sparse attention. Comprehensive evaluations on video understanding (VideoMME, LongVideoBench, and VCRBench) and generation (VBench) tasks show that VecAttention delivers a 2.65$\times$ speedup over full attention and a 1.83$\times$ speedup over state-of-the-art sparse attention methods, with comparable accuracy to full attention. Our code is available at https://github.com/anminliu/VecAttention.


[54] Less Is More? Selective Visual Attention to High-Importance Regions for Multimodal Radiology Summarization cs.CV | cs.CLPDF

Mst. Fahmida Sultana Naznin, Adnan Ibney Faruq, Mushfiqur Rahman, Niloy Kumar Mondal, Md. Mehedi Hasan Shawon

TL;DR: 本文提出了一种名为ViTAS的多模态放射学报告摘要方法,通过选择性关注病理相关视觉区域而非完整图像,在MIMIC-CXR基准上实现了SOTA性能,证明了‘少而精’的视觉输入策略的优越性。

Details

Motivation: 挑战现有多模态模型的两个普遍假设:1)更多视觉输入总是更好;2)当文本发现已包含丰富图像细节时,多模态模型价值有限。旨在解决视觉噪声问题,提升从FINDINGS到IMPRESSION转换的摘要质量。

Result: 在MIMIC-CXR基准上,ViTAS取得了SOTA结果:BLEU-4为29.25%,ROUGE-L为69.83%,定性分析显示事实对齐性改善,并获得了最高专家评分的人类评估分数。

Insight: 创新点在于提出‘少而精’的视觉注意力机制,通过多阶段流程(包括MedSAM2肺部分割、双向交叉注意力多视图融合、Shapley引导的自适应补丁聚类和分层视觉标记化)选择性聚焦高重要性区域,有效提升了多模态摘要的性能和事实准确性。

Abstract: Automated radiology report summarization aims to distill verbose findings into concise clinical impressions, but existing multimodal models often struggle with visual noise and fail to meaningfully improve over strong text-only baselines in the FINDINGS $\to$ IMPRESSION transformation. We challenge two prevailing assumptions: (1) that more visual input is always better, and (2) that multimodal models add limited value when findings already contain rich image-derived detail. Through controlled ablations on MIMIC-CXR benchmark, we show that selectively focusing on pathology-relevant visual patches rather than full images yields substantially better performance. We introduce ViTAS, Visual-Text Attention Summarizer, a multi-stage pipeline that combines ensemble-guided MedSAM2 lung segmentation, bidirectional cross-attention for multi-view fusion, Shapley-guided adaptive patch clustering, and hierarchical visual tokenization feeding a ViT. ViTAS achieves SOTA results with 29.25% BLEU-4 and 69.83% ROUGE-L, improved factual alignment in qualitative analysis, and the highest expert-rated human evaluation scores. Our findings demonstrate that less but more relevant visual input is not only sufficient but superior for multimodal radiology summarization.


[55] Quantization with Unified Adaptive Distillation to enable multi-LoRA based one-for-all Generative Vision Models on edge cs.CV | cs.AIPDF

Sowmya Vajrala, Aakash Parmar, Prasanna R, Sravanth Kodavanti, Manjunath Arveti

TL;DR: 本文提出了一种名为QUAD的统一框架,旨在通过将LoRA权重视为运行时输入并结合量化感知训练策略,实现在边缘设备上使用单个共享模型进行多任务生成式AI推理,从而显著减少内存占用和延迟。

Details

Motivation: 解决在资源受限的边缘设备上部署大型视觉模型时,由于现有移动部署流程需要为每个LoRA适配器和基础模型单独编译模型二进制文件,导致存储冗余和运行时开销增加的问题。

Result: 在多个芯片组上的实验结果表明,该方法在多个生成式AI任务中保持高视觉质量的同时,内存占用减少了高达6倍,延迟改善了高达4倍。

Insight: 创新点包括将LoRA权重作为运行时输入以实现动态任务切换,以及引入QUAD量化感知训练策略来对齐多个LoRA适配器;从客观角度看,这种统一框架有效解决了边缘设备上多任务模型部署的效率和灵活性挑战。

Abstract: Generative Artificial Intelligence (GenAI) features such as image editing, object removal, and prompt-guided image transformation are increasingly integrated into mobile applications. However, deploying Large Vision Models (LVMs) for such tasks on resource-constrained devices remains challenging due to their high memory and compute requirements. While Low-Rank Adapters (LoRAs) enable parameter-efficient task adaptation, existing Mobile deployment pipelines typically compile separate model binaries for each LoRA + a copy of the foundation model, resulting in redundant storage and increased runtime overhead. In this work, we present a unified framework for enabling multi-task GenAI inference on edge devices using a single shared model. Our key idea is to treat LoRA weights as runtime inputs rather than embedding them into the compiled model graph, allowing dynamic task switching at runtime without recompilation. Then, to support efficient on-device execution, we introduce QUAD (Quantization with Unified Adaptive Distillation), a quantizationaware training strategy that aligns multiple LoRA adapters under a shared quantization profile. We implement the proposed system with a lightweight runtime stack compatible with mobile NPUs and evaluate it across multiple chipsets. Experimental results demonstrate up to 6x and 4x reduction in memory footprint and latency improvements, respectively, while maintaining high visual quality across multiple GenAI tasks.


[56] Generating Key Postures of Bharatanatyam Adavus with Pose Estimation cs.CV | cs.AIPDF

Jagadish Kashinath Kamble, Jayanta Mukhopadhyay, Debaditya Roy, Partha Pratim Das

TL;DR: 本文提出了一种结合姿态估计模块的姿态感知生成框架,用于生成印度古典舞蹈Bharatanatyam的精确关键姿势(adavus)。该框架通过基于关键点的损失和姿态一致性约束来确保生成姿势的解剖学准确性和风格完整性,并在条件生成对抗网络(cGAN)和条件扩散模型两种设置中验证了姿态监督的有效性。

Details

Motivation: 在数字时代,保存具有严格结构和符号规则的无形文化遗产舞蹈(如Bharatanatyam)面临挑战,准确生成其关键姿势对于保持解剖和风格完整性、实现有效记录、分析及全球传播至关重要。

Result: 实验评估了四种配置:标准cGAN、带姿态监督的cGAN、条件扩散模型和带姿态监督的条件扩散模型。结果表明,引入姿态监督显著提升了生成Bharatanatyam姿势的质量、真实感和文化保真度。

Insight: 创新点在于将姿态估计模块集成到生成框架中,并利用关键点损失和姿态一致性作为监督信号,这确保了生成输出在几何结构上的准确性,为传统舞蹈的数字保存和教育提供了可扩展的高保真生成方法。

Abstract: Preserving intangible cultural dances rooted in centuries of tradition and governed by strict structural and symbolic rules presents unique challenges in the digital era. Among these, Bharatanatyam, a classical Indian dance form, stands out for its emphasis on codified adavus and precise key postures. Accurately generating these postures is crucial not only for maintaining anatomical and stylistic integrity, but also for enabling effective documentation, analysis, and transmission to broader global audiences through digital means. We propose a pose-aware generative framework integrated with a pose estimation module, guided by keypoint-based loss and pose consistency constraints. These supervisory signals ensure anatomical accuracy and stylistic integrity in the synthesized outputs. We evaluate four configurations: standard conditional generative adversarial network (cGAN), cGAN with pose supervision, conditional diffusion, and conditional diffusion with pose supervision. Each model is conditioned on key posture class labels and optimized to maintain geometric structure. In both cGAN and conditional diffusion settings, the integrated pose guidance aligns generated poses with ground-truth keypoint structures, promoting cultural fidelity. Our results demonstrate that incorporating pose supervision significantly enhances the quality, realism, and authenticity of generated Bharatanatyam postures. This framework provides a scalable approach for the digital preservation, education, and dissemination of traditional dance forms, enabling high-fidelity generation without compromising cultural precision. Code is available at https://github.com/jagidsh/Generating-Key-Postures-of-Bharatanatyam-Adavus-with-Pose-Estimation.


[57] Video-Oasis: Rethinking Evaluation of Video Understanding cs.CVPDF

Geuntaek Lim, Minho Shim, Sungjune Park, Jaeyun Lee, Inwoong Lee

TL;DR: 本文提出Video-Oasis,一个用于系统评估现有视频理解评测基准的诊断套件,旨在重新审视视频理解评估的现状。研究发现,现有基准中54%的样本无需视觉输入或时序上下文即可解决,而在剩余样本上,最先进模型的性能仅略高于随机猜测。

Details

Motivation: 视频理解的复杂性使得难以区分性能提升是源于视觉感知、语言推理还是先验知识,而现有基准在构成视频理解的核心标准上被忽视,因此需要重新评估当前视频理解评测的格局。

Result: 在Video-Oasis分析中,发现现有基准样本的54%无需视觉或时序信息即可解决;在剩余样本上,SOTA模型性能仅略高于随机猜测。

Insight: 创新点在于提出一个可持续的诊断套件来系统评估现有评测基准,揭示了当前基准在时空挑战上的不足,并为未来研究提供了构建基准和评估架构的实用指南。

Abstract: The inherent complexity of video understanding makes it difficult to attribute whether performance gains stem from visual perception, linguistic reasoning, or knowledge priors. While many benchmarks have emerged to assess high-level reasoning, the essential criteria that constitute video understanding remain largely overlooked. Instead of introducing yet another benchmark, we take a step back to re-examine the current landscape of video understanding. In this work, we provide Video-Oasis, a sustainable diagnostic suite designed to systematically evaluate existing evaluations and distill spatio-temporal challenges for video understanding. Our analysis reveals two critical findings: (1) 54% of existing benchmark samples are solvable without visual input or temporal context, and (2) on the remaining samples, state-of-the-art models exhibit performance barely exceeding random guessing. To bridge this gap, we investigate which algorithmic design choices contribute to robust video understanding, providing practical guidelines for future research. We hope our work serves as a standard guideline for benchmark construction and the rigorous evaluation of architecture development. Code is available at https://github.com/sejong-rcv/Video-Oasis.


[58] Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis cs.CV | cs.MMPDF

Shuang Chen, Quanxin Shou, Hangting Chen, Yucheng Zhou, Kaituo Feng

TL;DR: 本文提出Unify-Agent,一个用于世界基础图像合成的统一多模态智能体,通过将图像生成重构为包含提示理解、多模态证据搜索、基础重描述和最终合成的智能体流程,以解决现有统一多模态模型依赖冻结参数知识、难以处理长尾和知识密集型概念的问题。

Details

Motivation: 现有统一多模态模型主要依赖冻结的参数量知识,在处理涉及长尾和知识密集型概念的真实世界图像生成时存在困难,因此探索智能体建模来突破这一限制。

Result: 在涵盖12类文化显著和长尾事实概念的FactIP基准测试及多种真实世界生成任务中,Unify-Agent相比其基础统一模型有显著提升,并接近最强闭源模型的世界知识能力。

Insight: 创新点在于将图像生成重构为智能体流程,并构建了专门的多模态数据管道和14.3万条高质量智能体轨迹数据进行监督训练;客观来看,其将推理、搜索和生成紧密耦合的智能体架构为可靠的开放世界图像合成提供了新思路。

Abstract: Unified multimodal models provide a natural and promising architecture for understanding diverse and complex real-world knowledge while generating high-quality images. However, they still rely primarily on frozen parametric knowledge, which makes them struggle with real-world image generation involving long-tail and knowledge-intensive concepts. Inspired by the broad success of agents on real-world tasks, we explore agentic modeling to address this limitation. Specifically, we present Unify-Agent, a unified multimodal agent for world-grounded image synthesis, which reframes image generation as an agentic pipeline consisting of prompt understanding, multimodal evidence searching, grounded recaptioning, and final synthesis. To train our model, we construct a tailored multimodal data pipeline and curate 143K high-quality agent trajectories for world-grounded image synthesis, enabling effective supervision over the full agentic generation process. We further introduce FactIP, a benchmark covering 12 categories of culturally significant and long-tail factual concepts that explicitly requires external knowledge grounding. Extensive experiments show that our proposed Unify-Agent substantially improves over its base unified model across diverse benchmarks and real world generation tasks, while approaching the world knowledge capabilities of the strongest closed-source models. As an early exploration of agent-based modeling for world-grounded image synthesis, our work highlights the value of tightly coupling reasoning, searching, and generation for reliable open-world agentic image synthesis.


[59] BigEarthNet.txt: A Large-Scale Multi-Sensor Image-Text Dataset and Benchmark for Earth Observation cs.CVPDF

Johann-Ludwig Herzog, Mathis Jürgen Adler, Leonard Hackel, Yan Shu, Angelos Zavras

TL;DR: 该论文提出了BigEarthNet.txt,一个大规模多传感器遥感图像-文本数据集,包含46.4万幅配准的Sentinel-1 SAR和Sentinel-2多光谱图像及960万文本标注,旨在推动地球观测领域的指令驱动图像-文本学习。

Details

Motivation: 现有遥感图像-文本数据集规模小、传感器单一、文本标注简短且类型有限,限制了视觉-语言模型在遥感领域的性能,因此需要构建一个大规模、多传感器、文本标注丰富多样的数据集。

Result: 统计分析表明,BigEarthNet.txt在文本丰富性和标注类型多样性上超越了现有遥感数据集;基于其构建的基准测试表明,现有视觉-语言模型在处理复杂土地利用/土地覆盖类别任务时存在局限,而使用该数据集微调后,在所有考虑的任务上均获得了持续的性能提升。

Insight: 创新点在于构建了首个大规模、多传感器(SAR与多光谱)、包含地理锚定描述、视觉问答对和指代表达检测指令等多种丰富文本标注的遥感图像-文本数据集,为遥感领域的多任务指令学习提供了关键资源。

Abstract: Vision-langugage models (VLMs) have shown strong performance in computer vision (CV), yet their performance on remote sensing (RS) data remains limited due to the lack of large-scale, multi-sensor RS image-text datasets with diverse textual annotations. Existing datasets predominantly include aerial Red-Green-Blue imagery, with short or weakly grounded captions, and provide limited diversity in annotation types. To address this limitation, we introduce BigEarthNet.txt, a large-scale, multi-sensor image-text dataset designed to advance instruction-driven image-text learning in Earth observation across multiple tasks. BigEarthNet.txt contains 464044 co-registered Sentinel-1 synthetic aperture radar and Sentinel-2 multispectral images with 9.6M text annotations, including: i) geographically anchored captions describing land-use/land-cover (LULC) classes, their spatial relations, and environmental context; ii) visual question answering pairs relevant for different tasks; and iii) referring expression detection instructions for bounding box prediction. Through a comparative statistical analysis, we demonstrate that BigEarthNet.txt surpasses existing RS image-text datasets in textual richness and annotation type variety. We further establish a manually-verified benchmark split to evaluate VLMs in RS and CV. The results show the limitations of these models on tasks that involve complex LULC classes, whereas fine-tuning using BigEarthNet.txt results in consistent performance gains across all considered tasks.


[60] Storing Less, Finding More: How Novelty Filtering Improves Cross-Modal Retrieval on Edge Cameras cs.CV | cs.DC | cs.IRPDF

Sherif Abdelwahab

TL;DR: 本文提出了一种面向边缘摄像头的流式检索架构,通过设备端的新颖性过滤器仅保留语义新颖的帧以构建去噪嵌入索引,并结合跨模态适配器和云端重排序器来补偿紧凑编码器的弱对齐能力,从而在连续视频流中提升跨模态检索性能。

Details

Motivation: 解决始终开启的边缘摄像头生成连续视频流时,冗余帧会挤占top-k搜索结果,导致跨模态检索性能下降的问题。

Result: 在AEA和EPIC-KITCHENS两个第一人称数据集上,使用八个视觉语言模型(8M-632M参数)进行测试,单通道流式过滤器在离线替代方案(k-means、最远点、均匀、随机)中表现最优;结合完整架构,使用8M参数的设备端编码器在保留数据上达到45.6%的Hit@5,功耗估计为2.7 mW。

Insight: 创新点在于引入设备端epsilon-net过滤器进行实时新颖性过滤以减少冗余,并通过跨模态适配与云端重排序来弥补轻量编码器的对齐不足,实现了低功耗下的高效流式跨模态检索。

Abstract: Always-on edge cameras generate continuous video streams where redundant frames degrade cross-modal retrieval by crowding correct results out of top-k search. This paper presents a streaming retrieval architecture: an on-device epsilon-net filter retains only semantically novel frames, building a denoised embedding index; a cross-modal adapter and cloud re-ranker compensate for the compact encoder’s weak alignment. A single-pass streaming filter outperforms offline alternatives (k-means, farthest-point, uniform, random) across eight vision-language models (8M-632M) on two egocentric datasets (AEA, EPIC-KITCHENS). Combined, the architecture reaches 45.6% Hit@5 on held-out data using an 8M on-device encoder at an estimated 2.7 mW.


[61] CutClaw: Agentic Hours-Long Video Editing via Music Synchronization cs.CVPDF

Shifang Zhao, Yihan Hu, Ying Shan, Yunchao Wei, Xiaodong Cun

TL;DR: CutClaw是一个基于多模态大语言模型(MLLMs)的自主多智能体框架,旨在将数小时的原始视频素材自动剪辑成与音乐同步、叙事连贯、视觉美观的短视频。其核心流程包括分层多模态分解、由Playwriter智能体编排的叙事结构,以及Editor和Reviewer智能体协作优化最终剪辑。

Details

Motivation: 解决手动视频剪辑耗时且重复的痛点,特别是针对需要将长视频素材与音乐同步以制作高质量短视频的创作者和专业人士。

Result: 实验表明,CutClaw在生成高质量、节奏对齐的视频方面显著优于最先进的基线方法。

Insight: 主要创新点在于:1)采用分层多模态分解来捕捉视听素材的细粒度细节和全局结构;2)引入多智能体协作系统(Playwriter、Editor、Reviewer)分别负责叙事编排、内容选择和审美优化,实现了长视频到短视频的自动化、高质量编辑流程。

Abstract: Editing the video content with audio alignment forms a digital human-made art in current social media. However, the time-consuming and repetitive nature of manual video editing has long been a challenge for filmmakers and professional content creators alike. In this paper, we introduce CutClaw, an autonomous multi-agent framework designed to edit hours-long raw footage into meaningful short videos that leverages the capabilities of multiple Multimodal Language Models~(MLLMs) as an agent system. It produces videos with synchronized music, followed by instructions, and a visually appealing appearance. In detail, our approach begins by employing a hierarchical multimodal decomposition that captures both fine-grained details and global structures across visual and audio footage. Then, to ensure narrative consistency, a Playwriter Agent orchestrates the whole storytelling flow and structures the long-term narrative, anchoring visual scenes to musical shifts. Finally, to construct a short edited video, Editor and Reviewer Agents collaboratively optimize the final cut via selecting fine-grained visual content based on rigorous aesthetic and semantic criteria. We conduct detailed experiments to demonstrate that CutClaw significantly outperforms state-of-the-art baselines in generating high-quality, rhythm-aligned videos. The code is available at: https://github.com/GVCLab/CutClaw.


[62] CoRe-DA: Contrastive Regression for Unsupervised Domain Adaptation in Surgical Skill Assessment cs.CVPDF

Dimitrios Anastasiou, Razvan Caramalau, Jialang Xu, Runlong He, Freweini Tesfai

TL;DR: 该论文提出了CoRe-DA,一种基于对比回归的无监督域适应框架,用于解决手术技能评估(SSA)中回归模型跨新手术任务和环境泛化能力差的问题。该方法通过相对分数监督和目标域自训练学习域不变表示,并在涵盖干实验室和临床环境的首个SSA回归UDA基准测试中超越了现有方法。

Details

Motivation: 基于视觉的手术技能评估面临手动标注技能分数成本高、耗时长,以及现有回归模型对新手术任务和环境泛化能力差的挑战,同时存在大量未标记视频数据,这推动了SSA中无监督域适应方法的发展。

Result: 在两个UDA设置下的综合实验表明,CoRe-DA优于最先进的方法,在干实验室和临床目标数据集上分别达到了0.46和0.41的斯皮尔曼相关系数,且训练时未使用任何标记的目标域数据。

Insight: 论文的创新点在于首次为SSA回归任务建立了UDA基准,并提出了结合相对分数监督(用于学习域不变表示)和目标域自训练的对比回归适应框架,有效提升了跨域泛化能力,为可扩展的SSA提供了可靠方案。

Abstract: Vision-based surgical skill assessment (SSA) enables objective and scalable evaluation of operative performance. Progress in this field is constrained by the high cost and time demands for manual annotation of quantitative skill scores, as well as the poor generalization of existing regression models to new surgical tasks and environments. Meanwhile, appreciable volumes of unlabeled video data are now available, motivating the development of unsupervised domain adaptation (UDA) methods for SSA. We introduce the first benchmark for UDA in SSA regression, spanning four datasets across dry-lab and clinical settings as well as open and robotic surgery. We evaluate eight representative models under challenging domain shifts and propose CoRe-DA, a novel contrastive regression-based adaptation framework. Our method learns domain-invariant representations through relative-score supervision and target-domain self-training. Comprehensive experiments across two UDA settings show that CoRe-DA is superior to state-of-the-art methods, achieving Spearman Correlation Coefficients of 0.46 and 0.41 on dry-lab and clinical target datasets, respectively, without using any labeled target data for training. Overall, CoRe-DA enables scalable SSA with reliable cross-domain generalization, where existing methods underperform. Our code and datasets will be released at https://github.com/anastadimi/CoRe-DA.


[63] Clinical DVH metrics as a loss function for 3D dose prediction in head and neck radiotherapy cs.CVPDF

Ruochen Gao, Marius Staring, Frank Dankers

TL;DR: 该论文提出了一种用于头颈部放疗三维剂量预测的临床剂量体积直方图(DVH)指标损失函数(CDM损失),通过结合可微分的D指标和替代V指标,以及无损的感兴趣区域(ROI)位掩码编码,直接优化临床使用的DVH指标,从而提升预测结果与临床计划评估标准的一致性。

Details

Motivation: 现有基于深度学习的3D剂量预测模型通常使用体素级回归损失进行训练,这与基于DVH指标的临床计划评估标准不一致,因此需要开发一种计算高效且直接优化临床DVH指标的损失函数。

Result: 在174名头颈部患者数据集上评估,与MAE和基于DVH曲线的损失相比,CDM损失显著改善了靶区覆盖度并满足所有临床约束;使用标准3D U-Net时,PTV评分从1.544(MAE)降至0.491(MAE+CDM),危及器官保护效果相当,且位掩码编码将训练时间减少83%并降低GPU内存使用。

Insight: 创新点在于提出可直接优化临床DVH指标的损失函数(CDM损失),结合可微分D指标和替代V指标,以及高效的ROI位掩码编码,为剂量预测提供了更贴合临床需求的监督框架,提升了预测的临床相关性和计算效率。

Abstract: Purpose: Deep-learning-based three-dimensional (3D) dose prediction is widely used in automated radiotherapy workflows. However, most existing models are trained with voxel-wise regression losses, which are poorly aligned with clinical plan evaluation criteria based on dose-volume histogram (DVH) metrics. This study aims to develop a clinically guided loss formulation that directly optimizes clinically used DVH metrics while remaining computationally efficient for head and neck (H&N) dose prediction. Methods: We propose a clinical DVH metric loss (CDM loss) that incorporates differentiable \textit{D-metrics} and surrogate \textit{V-metrics}, together with a lossless bit-mask region-of-interest (ROI) encoding to improve training efficiency. The method was evaluated on 174 H&N patients using a temporal split (137 training, 37 testing). Results: Compared with MAE- and DVH-curve based losses, CDM loss substantially improved target coverage and satisfied all clinical constraints. Using a standard 3D U-Net, the PTV Score was reduced from 1.544 (MAE) to 0.491 (MAE + CDM), while OAR sparing remained comparable. Bit-mask encoding reduced training time by 83% and lowered GPU memory usage. Conclusion: Directly optimizing clinically used DVH metrics enables 3D dose predictions that are better aligned with clinical treatment planning criteria than conventional voxel-wise or DVH-curve-based supervision. The proposed CDM loss, combined with efficient ROI bit-mask encoding, provides a practical and scalable framework for H&N dose prediction.


[64] SkeletonContext: Skeleton-side Context Prompt Learning for Zero-Shot Skeleton-based Action Recognition cs.CVPDF

Ning Wang, Tieyue Wu, Naeha Sharif, Farid Boussaid, Guangming Zhu

TL;DR: 本文提出SkeletonContext框架,通过语言驱动的上下文提示增强骨骼动作表示,以解决零样本骨骼动作识别中因缺乏上下文线索(如动作涉及的对象)而导致的视觉相似动作难以区分的问题。该框架包含跨模态上下文提示模块和关键部位解耦模块,在多个基准测试中实现了最先进的性能。

Details

Motivation: 现有零样本骨骼动作识别方法通常将骨骼特征与文本嵌入对齐,但由于缺乏动作涉及的物体等上下文线索,骨骼与语义表示之间存在固有差距,难以区分视觉相似的动作。

Result: 在多个基准测试(如NTU RGB+D 60和120)上的广泛实验表明,SkeletonContext在常规和广义零样本设置下均达到了最先进的性能。

Insight: 创新点在于提出跨模态上下文提示模块,利用预训练语言模型在LLM引导下重构掩码上下文提示,将语言上下文转移到骨骼编码器以实现实例级语义接地;同时引入关键部位解耦模块,解耦与运动相关的关节特征,确保在缺乏显式物体交互时仍能稳健理解动作。

Abstract: Zero-shot skeleton-based action recognition aims to recognize unseen actions by transferring knowledge from seen categories through semantic descriptions. Most existing methods typically align skeleton features with textual embeddings within a shared latent space. However, the absence of contextual cues, such as objects involved in the action, introduces an inherent gap between skeleton and semantic representations, making it difficult to distinguish visually similar actions. To address this, we propose SkeletonContext, a prompt-based framework that enriches skeletal motion representations with language-driven contextual semantics. Specifically, we introduce a Cross-Modal Context Prompt Module, which leverages a pretrained language model to reconstruct masked contextual prompts under guidance derived from LLMs. This design effectively transfers linguistic context to the skeleton encoder for instance-level semantic grounding and improved cross-modal alignment. In addition, a Key-Part Decoupling Module is incorporated to decouple motion-relevant joint features, ensuring robust action understanding even in the absence of explicit object interactions. Extensive experiments on multiple benchmarks demonstrate that SkeletonContext achieves state-of-the-art performance under both conventional and generalized zero-shot settings, validating its effectiveness in reasoning about context and distinguishing fine-grained, visually similar actions.


[65] GRVS: a Generalizable and Recurrent Approach to Monocular Dynamic View Synthesis cs.CVPDF

Thomas Tanay, Mohammed Brahimi, Michal Nazarczuk, Qingwen Zhang, Sibi Catley-Chandar

TL;DR: 该论文提出了一种名为GRVS的可泛化且循环的方法,用于从单目动态视频中合成新视角。该方法通过一个循环机制实现输入与目标视频的无界异步映射,并利用平面扫描技术有效分离相机与场景运动,从而实现精细的六自由度相机控制。

Details

Motivation: 现有方法在动态场景的单目新视角合成中存在挑战:基于场景特定优化和显式运动先验的方法在高度动态区域容易失效,而基于扩散的方法则常出现几何不一致问题,且两者计算资源需求大。因此,作者旨在开发一种可泛化的模型,以克服这些限制。

Result: 在UCSD数据集和新构建的Kubric-4D-dyn数据集上,该模型在重建静态和动态区域的精细几何细节方面,优于四种基于高斯泼溅的场景特定方法和两种基于扩散的方法。

Insight: 创新点包括引入循环机制以实现无界异步映射,以及利用平面扫描技术有效解耦相机与场景运动,从而在动态场景中实现更精细的相机控制。这为动态新视角合成提供了一种可泛化且高效的解决方案。

Abstract: Synthesizing novel views from monocular videos of dynamic scenes remains a challenging problem. Scene-specific methods that optimize 4D representations with explicit motion priors often break down in highly dynamic regions where multi-view information is hard to exploit. Diffusion-based approaches that integrate camera control into large pre-trained models can produce visually plausible videos but frequently suffer from geometric inconsistencies across both static and dynamic areas. Both families of methods also require substantial computational resources. Building on the success of generalizable models for static novel view synthesis, we adapt the framework to dynamic inputs and propose a new model with two key components: (1) a recurrent loop that enables unbounded and asynchronous mapping between input and target videos and (2) an efficient use of plane sweeps over dynamic inputs to disentangle camera and scene motion, and achieve fine-grained, six-degrees-of-freedom camera controls. We train and evaluate our model on the UCSD dataset and on Kubric-4D-dyn, a new monocular dynamic dataset featuring longer, higher resolution sequences with more complex scene dynamics than existing alternatives. Our model outperforms four Gaussian Splatting-based scene-specific approaches, as well as two diffusion-based approaches in reconstructing fine-grained geometric details across both static and dynamic regions.


[66] TSHA: A Benchmark for Visual Language Models in Trustworthy Safety Hazard Assessment Scenarios cs.CV | cs.AIPDF

Qiucheng Yu, Ruijie Xu, Mingang Chen, Xuequan Lu, Jianfeng Dong

TL;DR: 本文提出了TSHA基准,一个用于评估视觉语言模型在室内安全隐患评估场景中可信度的综合基准,包含81,809个训练样本和1,707个挑战性测试样本,涵盖多种数据源和复杂场景。实验表明,当前VLM在安全隐患评估方面能力不足,而使用TSHA训练能显著提升模型性能与泛化能力。

Details

Motivation: 解决现有基准过度依赖合成数据、任务过于简化且缺乏严格评估协议的问题,以缩小与现实世界的领域差距并提升模型在复杂家庭安全场景中的泛化能力。

Result: 在23个流行VLM上的广泛实验显示,当前模型缺乏稳健的安全隐患评估能力;使用TSHA训练集训练的模型在TSHA测试集上性能提升高达+18.3分,并在其他基准上表现出增强的泛化性。

Insight: 创新点在于构建了一个多源、真实且复杂的基准,结合了现有数据集、网络图像、AIGC图像和新采集图像,并引入了视频和全景图像以评估模型在复杂场景中的鲁棒性,为VLM在安全领域的可信评估提供了重要工具。

Abstract: Recent advances in vision-language models (VLMs) have accelerated their application to indoor safety hazards assessment. However, existing benchmarks suffer from three fundamental limitations: (1) heavy reliance on synthetic datasets constructed via simulation software, creating a significant domain gap with real-world environments; (2) oversimplified safety tasks with artificial constraints on hazard and scene types, thereby limiting model generalization; and (3) absence of rigorous evaluation protocols to thoroughly assess model capabilities in complex home safety scenarios. To address these challenges, we introduce TSHA (\textbf{T}rustworthy \textbf{S}afety \textbf{H}azards \textbf{A}ssessment), a comprehensive benchmark comprising 81,809 carefully curated training samples drawn from four complementary sources: existing indoor datasets, internet images, AIGC images, and newly captured images. This benchmark set also includes a highly challenging test set with 1707 samples, comprising not only a carefully selected subset from the training distribution but also newly added videos and panoramic images containing multiple safety hazards, used to evaluate the model’s robustness in complex safety scenarios. Extensive experiments on 23 popular VLMs demonstrate that current VLMs lack robust capabilities for safety hazard assessment. Importantly, models trained on the TSHA training set not only achieve a significant performance improvement of up to +18.3 points on the TSHA test set but also exhibit enhanced generalizability across other benchmarks, underscoring the substantial contribution and importance of the TSHA benchmark.


[67] Beyond Ground-Truth: Leveraging Image Quality Priors for Real-World Image Restoration cs.CVPDF

Fengyang Xiao, Peng Hu, Lei Xu, XingE Guo, Guanyi Qin

TL;DR: 本文提出了一种名为IQPIR的框架,用于解决真实世界图像复原中因依赖不完美的Ground-Truth监督而导致的感知质量受限问题。该框架通过引入从预训练无参考图像质量评估(NR-IQA)模型中提取的图像质量先验(IQP),并结合学习到的码本先验,通过质量条件Transformer、双分支码本结构和基于离散表示的质量优化策略,引导复原过程生成感知质量最优的输出。

Details

Motivation: 现有真实世界图像复原方法通常依赖Ground-Truth(GT)监督,但GT本身可能包含感知质量不一致的图像,导致模型仅收敛到训练数据的平均质量水平,而非达到可实现的最高感知质量。

Result: 在真实世界图像复原任务上的大量实验表明,该方法超越了当前最先进的方法,并可作为现有方法的通用质量引导增强策略。

Insight: 创新点在于将NR-IQA模型提取的质量先验作为显式条件信号,与码本先验协同优化;通过双分支码本解耦通用与高质量特征,以及离散表示优化策略缓解连续潜空间中的过优化问题,实现了无需结构修改的即插即用式感知质量提升。

Abstract: Real-world image restoration aims to restore high-quality (HQ) images from degraded low-quality (LQ) inputs captured under uncontrolled conditions. Existing methods typically depend on ground-truth (GT) supervision, assuming that GT provides perfect reference quality. However, GT can still contain images with inconsistent perceptual fidelity, causing models to converge to the average quality level of the training data rather than achieving the highest perceptual quality attainable. To address these problems, we propose a novel framework, termed IQPIR, that introduces an Image Quality Prior (IQP)-extracted from pre-trained No-Reference Image Quality Assessment (NR-IQA) models-to guide the restoration process toward perceptually optimal outputs explicitly. Our approach synergistically integrates IQP with a learned codebook prior through three key mechanisms: (1) a quality-conditioned Transformer, where NR-IQA-derived scores serve as conditioning signals to steer the predicted representation toward maximal perceptual quality. This design provides a plug-and-play enhancement compatible with existing restoration architectures without structural modification; and (2) a dual-branch codebook structure, which disentangles common and HQ-specific features, ensuring a comprehensive representation of both generic structural information and quality-sensitive attributes; and (3) a discrete representation-based quality optimization strategy, which mitigates over-optimization effects commonly observed in continuous latent spaces. Extensive experiments on real-world image restoration demonstrate that our method not only surpasses cutting-edge methods but also serves as a generalizable quality-guided enhancement strategy for existing methods. The code is available.


[68] From Skeletons to Semantics: Design and Deployment of a Hybrid Edge-Based Action Detection System for Public Safety cs.CV | cs.AIPDF

Ganen Sethupathy, Lalit Dumka, Jan Schagen

TL;DR: 本文提出了一种混合边缘动作检测系统,结合了基于骨架的运动分析和视觉语言模型,用于公共安全场景中的实时视频分析。该系统在边缘设备上部署,通过骨架处理实现低计算开销的隐私保护监控,并利用视觉语言模型提供语义场景理解和零样本推理能力。

Details

Motivation: 解决公共空间(如交通枢纽、市中心和活动场所)中暴力行为检测的实时性、隐私保护和资源限制问题,特别是在边缘计算条件下,现有自动化视频分析系统在延迟、隐私和资源方面存在部署限制。

Result: 系统在配备GPU的边缘设备上实现,通过演示器设置评估了延迟、资源使用和操作权衡。结果表明,运动中心方法和语义方法具有互补优势,混合架构能通过高级语义推理增强快速骨架检测。

Insight: 创新点在于系统级比较骨架分析和视觉语言模型在真实边缘约束下的性能,而非提出新识别模型;混合架构结合了低计算开销的隐私保护监控与上下文理解能力,为公共安全应用提供了实用的实时视频分析基础。

Abstract: Public spaces such as transport hubs, city centres, and event venues require timely and reliable detection of potentially violent behaviour to support public safety. While automated video analysis has made significant progress, practical deployment remains constrained by latency, privacy, and resource limitations, particularly under edge-computing conditions. This paper presents the design and demonstrator-based deployment of a hybrid edge-based action detection system that combines skeleton-based motion analysis with vision-language models for semantic scene interpretation. Skeleton-based processing enables continuous, privacy-aware monitoring with low computational overhead, while vision-language models provide contextual understanding and zero-shot reasoning capabilities for complex and previously unseen situations. Rather than proposing new recognition models, the contribution focuses on a system-level comparison of both paradigms under realistic edge constraints. The system is implemented on a GPU-enabled edge device and evaluated with respect to latency, resource usage, and operational trade-offs using a demonstrator-based setup. The results highlight the complementary strengths and limitations of motioncentric and semantic approaches and motivate a hybrid architecture that selectively augments fast skeletonbased detection with higher-level semantic reasoning. The presented system provides a practical foundation for privacy-aware, real-time video analysis in public safety applications.


[69] MAPLE: Multi-Path Adaptive Propagation with Level-Aware Embeddings for Hierarchical Multi-Label Image Classification cs.CVPDF

Boshko Koloski, Marjan Stoimchev, Jurica Levatić, Dragi Kocev, Sašo Džeroski

TL;DR: 本文提出MAPLE框架,用于解决遥感图像中的层次多标签分类问题,通过整合层次语义初始化、图卷积网络结构编码和自适应多模态融合,有效建模多路径场景下的结构化标签依赖关系。

Details

Motivation: 现有方法在多路径设置下难以充分利用层次信息,导致图像激活多个分类分支时性能受限,MAPLE旨在解决这一问题。

Result: 在CORINE对齐的遥感数据集(AID、DFC-15和MLRSNet)上评估,MAPLE在少样本场景下性能提升高达42%,仅增加2.6%的参数开销,实现了高效的地球观测语义建模。

Insight: 创新点包括层次语义初始化、自适应多模态融合和层级感知损失选择,这些机制可借鉴于其他需要处理结构化标签依赖的多模态任务中。

Abstract: Hierarchical multi-label classification (HMLC) is essential for modeling structured label dependencies in remote sensing. Yet existing approaches struggle in multi-path settings, where images may activate multiple taxonomic branches, leading to underuse of hierarchical information. We propose MAPLE (Multi-Path Adaptive Propagation with Level-Aware Embeddings), a framework that integrates (i) hierarchical semantic initialization from graph-aware textual descriptions, (ii) graph-based structure encoding via graph convolutional networks (GCNs), and (iii) adaptive multi-modal fusion that dynamically balances semantic priors and visual evidence. An adaptive level-aware objective automatically selects appropriate losses per hierarchy level. Evaluations on CORINE-aligned remote sensing datasets (AID, DFC-15, and MLRSNet) show consistent improvements of up to +42% in few-shot regimes while adding only 2.6% parameter overhead, demonstrating that MAPLE effectively and efficiently models hierarchical semantics for Earth observation (EO).


[70] SceneTeract: Agentic Functional Affordances and VLM Grounding in 3D Scenes cs.CVPDF

Léopold Maillard, Francis Engelmann, Tom Durand, Boxiao Pan, Yang You

TL;DR: SceneTeract是一个用于验证3D场景功能性的框架,它通过将复杂活动分解为原子动作序列,并结合高级语义推理与低级几何检查,来评估场景在具身智能体特定约束下的可用性。该框架被用于评估合成室内环境的功能性故障,以及前沿视觉语言模型在预测功能可供性方面的能力,并可作为奖励引擎用于VLM的后训练。

Details

Motivation: 解决具身AI中评估3D环境功能可供性这一核心挑战,即验证场景是否支持不同用户在特定智能体约束下进行有意义的活动。

Result: 对合成室内环境的评估揭示了频繁阻碍基本交互的功能性故障;对前沿VLM的评估显示,即使是最强的现有模型,其语义置信度与物理可行性之间也存在系统性不匹配。

Insight: 核心创新在于一个结合了语义推理与几何模拟的接地验证引擎,能够基于智能体配置文件验证动作序列的可达性、间隙和可导航性等物理约束;该框架还可作为可扩展的奖励引擎,将几何约束提炼到推理模型中,以弥合感知与物理现实之间的差距。

Abstract: Embodied AI depends on interactive 3D environments that support meaningful activities for diverse users, yet assessing their functional affordances remains a core challenge. We introduce SceneTeract, a framework that verifies 3D scene functionality under agent-specific constraints. Our core contribution is a grounded verification engine that couples high-level semantic reasoning with low-level geometric checks. SceneTeract decomposes complex activities into sequences of atomic actions and validates each step against accessibility requirements (e.g., reachability, clearance, and navigability) conditioned on an embodied agent profile, using explicit physical and geometric simulations. We deploy SceneTeract to perform an in-depth evaluation of (i) synthetic indoor environments, uncovering frequent functional failures that prevent basic interactions, and (ii) the ability of frontier Vision-Language Models (VLMs) to reason about and predict functional affordances, revealing systematic mismatches between semantic confidence and physical feasibility even for the strongest current models. Finally, we leverage SceneTeract as a reward engine for VLM post-training, enabling scalable distillation of geometric constraints into reasoning models. We release the SceneTeract verification suite and data to bridge perception and physical reality in embodied 3D scene understanding.


[71] Abstraction in Style cs.CVPDF

Min Lu, Yuanfeng He, Anthony Chen, Jianhuang He, Pu Wang

TL;DR: 本文提出Abstraction in Style (AiS)生成框架,将艺术风格中的结构抽象与视觉风格化分离。给定目标图像和少量风格示例,AiS首先生成中间抽象代理以重新解释目标结构,再将其渲染为最终风格化输出,实现更广泛、可控和富有表现力的风格迁移。

Details

Motivation: 传统风格迁移方法通常保留输入几何结构,难以捕捉艺术风格中超越表面外观的深层抽象行为,尤其对于插画和非真实感风格。

Result: 论文未在摘要中提及具体定量结果或基准测试,但宣称AiS框架支持更广泛的风格转换、提升可控性并实现更具表现力的风格化。

Insight: 核心创新在于将结构抽象与视觉风格化解耦,将抽象视为可迁移的显式过程,并通过共享图像空间类比实现无显式几何监督的学习,从而更好地捕捉艺术风格的深层抽象逻辑。

Abstract: Artistic styles often embed abstraction beyond surface appearance, involving deliberate reinterpretation of structure rather than mere changes in texture or color. Conventional style transfer methods typically preserve the input geometry and therefore struggle to capture this deeper abstraction behavior, especially for illustrative and nonphotorealistic styles. In this work, we introduce Abstraction in Style (AiS), a generative framework that separates structural abstraction from visual stylization. Given a target image and a small set of style exemplars, AiS first derives an intermediate abstraction proxy that reinterprets the target’s structure in accordance with the abstraction logic exhibited by the style. The proxy captures semantic structure while relaxing geometric fidelity, enabling subsequent stylization to operate on an abstracted representation rather than the original image. In a second stage, the abstraction proxy is rendered to produce the final stylized output, preserving visual coherence with the reference style. Both stages are implemented using a shared image space analogy, enabling transformations to be learned from visual exemplars without explicit geometric supervision. By decoupling abstraction from appearance and treating abstraction as an explicit, transferable process, AiS supports a wider range of stylistic transformations, improves controllability, and enables more expressive stylization.


[72] Gloria: Consistent Character Video Generation via Content Anchors cs.CVPDF

Yuhang Yang, Fan Zhang, Huaijin Pi, Shuai Guo, Guowei Xu

TL;DR: 本文提出Gloria方法,通过引入内容锚点(anchor frames)来生成长时、多视角一致且具有表现力的角色视频。该方法设计了超集内容锚定和RoPE弱条件机制,以解决参考视频生成中的复制粘贴和多参考冲突问题,并构建了可扩展的流水线从海量视频中提取锚点。

Details

Motivation: 现有方法在生成角色视频时,要么提供的上下文不足以保持身份一致性,要么利用非角色中心信息作为记忆,导致一致性不佳。本文旨在解决长时、多视角角色视频生成中身份和外观一致性的挑战。

Result: 实验表明,该方法能生成超过10分钟的高质量角色视频,在表达性身份和多视角外观一致性上超越现有方法。

Insight: 创新点包括使用紧凑的锚点帧集表示角色视觉属性,以及引入超集内容锚定和RoPE弱条件机制来避免复制和区分多锚点。从客观角度看,将角色生成视为外部观察场景并构建可扩展的锚点提取流水线具有借鉴意义。

Abstract: Digital characters are central to modern media, yet generating character videos with long-duration, consistent multi-view appearance and expressive identity remains challenging. Existing approaches either provide insufficient context to preserve identity or leverage non-character-centric information as the memory, leading to suboptimal consistency. Recognizing that character video generation inherently resembles an outside-looking-in scenario. In this work, we propose representing the character visual attributes through a compact set of anchor frames. This design provides stable references for consistency, while reference-based video generation inherently faces challenges of copy-pasting and multi-reference conflicts. To address these, we introduce two mechanisms: Superset Content Anchoring, providing intra- and extra-training clip cues to prevent duplication, and RoPE as Weak Condition, encoding positional offsets to distinguish multiple anchors. Furthermore, we construct a scalable pipeline to extract these anchors from massive videos. Experiments show our method generates high-quality character videos exceeding 10 minutes, and achieves expressive identity and appearance consistency across views, surpassing existing methods.


[73] EC-Bench: Enumeration and Counting Benchmark for Ultra-Long Videos cs.CVPDF

Fumihiko Tsuchiya, Taiki Miyanishi, Mahiro Ukai, Nakamasa Inoue, Shuhei Kurita

TL;DR: 该论文提出了EC-Bench,一个用于评估超长视频中枚举、计数和时间证据定位能力的基准。该基准包含152段超过30分钟的视频和1699个带有明确证据时间段的查询。评估了22个多模态大语言模型,发现其性能远低于人类水平,揭示了当前模型在长视频定量推理方面的根本性局限。

Details

Motivation: 解决长视频中计数这一基础但未被充分探索的挑战。现有视频计数基准多关注短片段且仅评估最终数值答案,无法深入分析模型应计数什么或是否能在时间上一致地识别相关实例。

Result: 在22个多模态大语言模型中,最佳模型在枚举任务上准确率仅为29.98%,在计数任务上为23.74%,而人类表现分别达到78.57%和82.97%。分析表明枚举准确率、时间定位和计数性能之间存在强关联。

Insight: 创新点在于联合评估枚举、计数和时间证据定位,并构建了首个专注于超长视频(>30分钟)的定量推理基准。客观来看,其通过明确标注证据时间段,为分析模型的长程时序推理失败模式提供了新视角,并建立了模型性能与时间定位能力之间的量化关系。

Abstract: Counting in long videos remains a fundamental yet underexplored challenge in computer vision. Real-world recordings often span tens of minutes or longer and contain sparse, diverse events, making long-range temporal reasoning particularly difficult. However, most existing video counting benchmarks focus on short clips and evaluate only the final numerical answer, providing little insight into what should be counted or whether models consistently identify relevant instances across time. We introduce EC-Bench, a benchmark that jointly evaluates enumeration, counting, and temporal evidence grounding in long-form videos. EC-Bench contains 152 videos longer than 30 minutes and 1,699 queries paired with explicit evidence spans. Across 22 multimodal large language models (MLLMs), the best model achieves only 29.98% accuracy on Enumeration and 23.74% on Counting, while human performance reaches 78.57% and 82.97%, respectively. Our analysis reveals strong relationships between enumeration accuracy, temporal grounding, and counting performance. These results highlight fundamental limitations of current MLLMs and establish EC-Bench as a challenging benchmark for long-form quantitative video reasoning.


[74] Detecting Unknown Objects via Energy-based Separation for Open World Object Detection cs.CVPDF

Jun-Woo Heo, Keonhee Park, Gyeong-Moon Park

TL;DR: 本文提出DEUS框架,用于解决开放世界目标检测(OWOD)中的未知目标检测和灾难性遗忘问题。该框架包含基于等角紧框架(ETF)的子空间未知分离(EUS)和基于能量的已知区分(EKD)损失,旨在更有效地分离已知与未知目标表示,并减少新旧类别知识在记忆回放中的干扰。

Details

Motivation: 现有OWOD方法严重依赖已知类预测来检测未知目标,导致难以有效学习和识别未知目标表示;同时,记忆回放虽能缓解旧类遗忘,但常以牺牲新学类别知识为代价。

Result: 在OWOD基准测试中,DEUS在未知目标检测方面取得了显著的性能提升,同时保持了已知类检测的竞争力。

Insight: 创新点在于利用ETF的几何特性构建正交子空间(EUS),从已知和未知两个空间共同利用能量来更好地区分未知目标;并通过EKD损失强制分离新旧分类器,减少知识干扰。这为开放世界学习中的表示分离和知识保留提供了新思路。

Abstract: In this work, we tackle the problem of Open World Object Detection (OWOD). This challenging scenario requires the detector to incrementally learn to classify known objects without forgetting while identifying unknown objects without supervision. Previous OWOD methods have enhanced the unknown discovery process and employed memory replay to mitigate catastrophic forgetting. However, since existing methods heavily rely on the detector’s known class predictions for detecting unknown objects, they struggle to effectively learn and recognize unknown object representations. Moreover, while memory replay mitigates forgetting of old classes, it often sacrifices the knowledge of newly learned classes. To resolve these limitations, we propose DEUS (Detecting Unknowns via energy-based Separation), a novel framework that addresses the challenges of Open World Object Detection. DEUS consists of Equiangular Tight Frame (ETF)-Subspace Unknown Separation (EUS) and an Energy-based Known Distinction (EKD) loss. EUS leverages ETF-based geometric properties to create orthogonal subspaces, enabling cleaner separation between known and unknown object representations. Unlike prior energy-based approaches that consider only the known space, EUS utilizes energies from both spaces to better capture distinct patterns of unknown objects. Furthermore, EKD loss enforces the separation between previous and current classifiers, thus minimizing knowledge interference between previous and newly learned classes during memory replay. We thoroughly validate DEUS on OWOD benchmarks, demonstrating outstanding performance improvements in unknown detection while maintaining competitive known class performance.


[75] SurgTEMP: Temporal-Aware Surgical Video Question Answering with Text-guided Visual Memory for Laparoscopic Cholecystectomy cs.CVPDF

Shi Li, Vinkle Srivastav, Nicolas Chanel, Saurav Sharma, Nabani Banik

TL;DR: 本文提出SurgTEMP框架,一个用于腹腔镜胆囊切除术视频问答的多模态大语言模型,通过查询引导的令牌选择模块构建分层视觉记忆库(空间和时序记忆库),并结合外科能力递进训练方案,以有效建模变长手术视频并支持多样化的下游评估任务。同时,作者发布了包含32K问答对和128小时视频片段的CholeVidQA-32K数据集。

Details

Motivation: 现有手术视觉问答研究主要关注静态帧分析,忽略了丰富的时序语义,且手术视频存在视觉对比度低、知识驱动性强、分析需求多样且时间窗口分散等挑战。

Result: 在全面的评估中,与最先进的开源多模态和视频大语言模型(微调和零样本)相比,SurgTEMP取得了显著的性能提升,推动了基于视频的手术VQA技术水平。

Insight: 创新点包括:1) 查询引导的分层视觉记忆机制,有效保留手术相关线索和时序连贯性;2) 外科能力递进训练方案,更好地支持从基础感知到高级术中评估的层次化任务;3) 构建了大规模、层次化标注的手术视频问答数据集CholeVidQA-32K。

Abstract: Surgical procedures are inherently complex and risky, requiring extensive expertise and constant focus to well navigate evolving intraoperative scenes. Computer-assisted systems such as surgical visual question answering (VQA) offer promises for education and intraoperative support. Current surgical VQA research largely focuses on static frame analysis, overlooking rich temporal semantics. Surgical video question answering is further challenged by low visual contrast, its highly knowledge-driven nature, diverse analytical needs spanning scattered temporal windows, and the hierarchy from basic perception to high-level intraoperative assessment. To address these challenges, we propose SurgTEMP, a multimodal LLM framework featuring (i) a query-guided token selection module that builds hierarchical visual memory (spatial and temporal memory banks) and (ii) a Surgical Competency Progression (SCP) training scheme. Together, these components enable effective modeling of variable-length surgical videos while preserving procedure-relevant cues and temporal coherence, and better support diverse downstream assessment tasks. To support model development, we introduce CholeVidQA-32K, a surgical video question answering dataset comprising 32K open-ended QA pairs and 3,855 video segments (approximately 128 h total) from laparoscopic cholecystectomy. The dataset is organized into a three-level hierarchy – Perception, Assessment, and Reasoning – spanning 11 tasks from instrument/action/anatomy perception to Critical View of Safety (CVS), intraoperative difficulty, skill proficiency, and adverse event assessment. In comprehensive evaluations against state-of-the-art open-source multimodal and video LLMs (fine-tuned and zero-shot), SurgTEMP achieves substantial performance improvements, advancing the state of video-based surgical VQA.


[76] Scaling Video Pretraining for Surgical Foundation Models cs.CVPDF

Sicheng Lu, Zikai Xiao, Jianhui Wei, Danyu Sun, Qi Lu

TL;DR: 本文提出了SurgRec,一种可扩展且可复现的手术视频预训练方法,包含SurgRec-MAE和SurgRec-JEPA两个变体。作者构建了一个包含10,535个视频、2.145亿帧的大型多源手术视频数据集,涵盖多种手术类型。通过统一的预训练流程和平衡采样,在16个下游数据集上建立了标准化的可复现基准测试。实验表明,SurgRec在多个下游任务上均优于自监督学习和视觉语言模型基线。

Details

Motivation: 现有手术基础模型受限于数据规模小、手术流程多样性不足、评估标准不一致以及缺乏可复现的训练流程,阻碍了手术视频理解的发展。

Result: 在16个下游数据集和四个临床领域的广泛比较中,SurgRec相比自监督学习基线和视觉语言模型,始终取得更优的性能。视觉语言模型在细粒度时序识别任务上表现不可靠,存在性能差距且对提示词敏感。

Insight: 论文的创新点在于提供了一个可扩展、可复现的预训练框架(SurgRec)和一个大规模、多样化的手术视频数据集。其核心见解是统一的预训练流程和平衡采样策略对于构建稳健的手术视频基础模型至关重要,而视觉语言模型可能不适用于需要精确时序理解的手术场景。

Abstract: Surgical video understanding is essential for computer-assisted interventions, yet existing surgical foundation models remain constrained by limited data scale, procedural diversity, and inconsistent evaluation, often lacking a reproducible training pipeline. We propose SurgRec, a scalable and reproducible pretraining recipe for surgical video understanding, instantiated with two variants: SurgRec-MAE and SurgRec-JEPA. We curate a large multi-source corpus of 10,535 videos and 214.5M frames spanning endoscopy, laparoscopy, cataract, and robotic surgery. Building on this corpus, we develop a unified pretraining pipeline with balanced sampling and standardize a reproducible benchmark across 16 downstream datasets and four clinical domains with consistent data splits. Across extensive comparisons against SSL baselines and vision-language models, SurgRec consistently achieves superior performance across downstream datasets. In contrast, VLMs prove unreliable for fine-grained temporal recognition, exhibiting both performance gaps and sensitivity to prompt phrasing. Our work provides a reproducible, scalable foundation for the community to build more general surgical video models. All code, models, and data will be publicly released.


[77] Learning Structural-Functional Brain Representations through Multi-Scale Adaptive Graph Attention for Cognitive Insight cs.CVPDF

Badhan Mazumder, Sir-Lord Wiafe, Aline Kotoski, Vince D. Calhoun, Dong Hye Ye

TL;DR: 本文提出了一种名为MAGNet的多尺度自适应图网络,这是一种Transformer风格的图神经网络框架,旨在自适应地学习大脑结构与功能之间的交互关系。该模型融合了结构MRI的源形态测量特征和静息态fMRI的功能网络连接,通过混合图整合直接与间接通路,并利用局部-全局注意力机制优化连接重要性,同时采用联合损失函数确保跨模态一致性和端到端预测优化。

Details

Motivation: 大脑结构与功能的相互作用是理解智能的关键,但联合建模具有挑战性,因为结构连接组和功能连接组捕捉了互补的组织方面。本文旨在通过一个自适应学习结构-功能交互的框架来解决这一挑战。

Result: 在ABCD数据集上,MAGNet超越了相关基线模型,展示了其在促进认知功能理解方面的有效多模态整合能力。

Insight: 创新点包括:提出了一种融合结构形态特征与功能连接的多模态图神经网络框架;设计了混合图以整合直接与间接通路;引入了局部-全局注意力机制来细化连接重要性;采用联合损失函数同时优化跨模态一致性和预测目标,实现端到端学习。

Abstract: Understanding how brain structure and function interact is key to explaining intelligence yet modeling them jointly is challenging as the structural and functional connectome capture complementary aspects of organization. We introduced Multi-scale Adaptive Graph Network (MAGNet), a Transformer-style graph neural network framework that adaptively learns structure-function interactions. MAGNet leverages source-based morphometry from structural MRI to extract inter-regional morphological features and fuses them with functional network connectivity from resting-state fMRI. A hybrid graph integrates direct and indirect pathways, while local-global attention refines connectivity importance and a joint loss simultaneously enforces cross-modal coherence and optimizes the prediction objective end-to-end. On the ABCD dataset, MAGNet outperformed relevant baselines, demonstrating effective multimodal integration for advancing our understanding of cognitive function.


[78] Trimodal Deep Learning for Glioma Survival Prediction: A Feasibility Study Integrating Histopathology, Gene Expression, and MRI cs.CV | cs.AIPDF

Iain Swift, JingHua Ye

TL;DR: 本研究探索了将组织病理学、基因表达和FLAIR MRI三种模态整合用于胶质瘤生存预测的可行性。通过TCGA-GBMLGG队列(664例患者)评估了单模态、双模态和三模态模型在不同融合策略下的性能,发现三模态早期融合在探索性综合评分上略有提升,但受限于小样本量,结果未达统计显著性。

Details

Motivation: 现有多模态深度学习框架已整合组织病理学和基因组数据以提升脑肿瘤预后准确性,但体积MRI在统一生存预测框架中的贡献尚未被探索。本研究旨在将FLAIR MRI作为第三模态纳入,评估其对预测性能的增量价值。

Result: 在受限的19例测试患者中,三模态早期融合获得探索性综合评分(CS=0.854),较双模态基线提升ΔCS=+0.011,但差异无统计学意义(p=0.250)。MRI单模态预测能力合理(CS=0.755),但未显著改善双模态组合,仅在三模态组合中提供可测量的提升。由于样本量小,所有含MRI实验的Bootstrap置信区间较宽(如[0.400,1.000]),无法得出确定性结论。

Insight: 创新点在于首次在统一框架中系统评估了组织病理学、基因表达和MRI的三模态融合对胶质瘤生存预测的贡献。客观分析表明,即使在小样本条件下,第三成像模态可能提供增量预后价值,但多模态的有效整合需要足够的上下文和样本量支持,早期融合策略在三模态场景中显示出潜力。

Abstract: Multimodal deep learning has improved prognostic accuracy for brain tumours by integrating histopathology and genomic data, yet the contribution of volumetric MRI within unified survival frameworks remains unexplored. This pilot study extends a bimodal framework by incorporating Fluid Attenuated Inversion Recovery (FLAIR) MRI from BraTS2021 as a third modality. Using the TCGA-GBMLGG cohort (664 patients), we evaluate three unimodal models, nine bimodal configurations, and three trimodal configurations across early, late, and joint fusion strategies. In this small cohort setting, trimodal early fusion achieves an exploratory Composite Score (CS = 0.854), with a controlled $Δ$CS of +0.011 over the bimodal baseline on identical patients, though this difference is not statistically significant (p = 0.250, permutation test). MRI achieves reasonable unimodal discrimination (CS = 0.755) but does not substantially improve bimodal pairs, while providing measurable uplift in the three-way combination. All MRI containing experiments are constrained to 19 test patients, yielding wide bootstrap confidence intervals (e.g. [0.400,1.000]) that preclude definitive conclusions. These findings provide preliminary evidence that a third imaging modality may add prognostic value even with limited sample sizes, and that additional modalities require sufficient multimodal context to contribute effectively.


[79] Benchmarking PhD-Level Coding in 3D Geometric Computer Vision cs.CVPDF

Wenyi Li, Renkai Luo, Yue Yu, Huan-ang Gao, Mingju Gao

TL;DR: 该论文提出了GeoCodeBench,一个用于评估3D几何计算机视觉领域博士级别编程能力的基准测试。该基准包含从近期代表性论文中筛选的核心3D几何组件实现任务,并生成多样化的边缘情况单元测试进行自动评分。评估发现,当前最佳模型GPT-5的通过率仅为36.6%,揭示了现有AI编码能力与可靠3D科学编程之间存在巨大差距。

Details

Motivation: 当前AI辅助编码模型在生成复杂3D几何视觉的正确代码方面仍存在困难,阻碍了研究社区的效率提升。为了衡量向可靠3D几何编码目标的进展,需要建立一个严谨的评估基准。

Result: 在GeoCodeBench上评估了八个代表性的开源和闭源模型,最佳模型GPT-5的通过率仅为36.6%。任务分为通用3D能力(几何变换与力学/光学公式)和研究能力(新算法实现与几何逻辑路由)两个层次,研究导向任务明显更难。上下文消融实验表明,仅提供方法部分的文本输入在统计上优于提供全文输入。

Insight: 创新点在于构建了一个专注于3D几何视觉、具有博士级难度的代码生成基准,并采用从官方仓库筛选核心组件、生成边缘测试用例的严谨构建方法。客观分析发现,模型在长上下文科学理解上存在未解决的挑战,且研究级算法实现能力显著落后于通用几何编码能力。

Abstract: AI-assisted coding has rapidly reshaped software practice and research workflows, yet today’s models still struggle to produce correct code for complex 3D geometric vision. If models could reliably write such code, the research of our community would change substantially. To measure progress toward that goal, we introduce GeoCodeBench, a PhD-level benchmark that evaluates coding for 3D vision. Each problem is a fill-in-the-function implementation task curated from representative papers at recent venues: we first let a tool propose candidate functions from official repositories, then perform careful human screening to select core 3D geometric components. For every target, we generate diverse, edge-case unit tests, enabling fully automatic, reproducible scoring. We evaluate eight representative open- and closed-source models to reflect the current ecosystem. The best model, GPT-5, attains only 36.6% pass rate, revealing a large gap between current capabilities and dependable 3D scientific coding. GeoCodeBench organizes tasks into a two-level hierarchy: General 3D capability (geometric transformations and mechanics/optics formulation) and Research capability (novel algorithm implementation and geometric logic routing). Scores are positively correlated across these axes, but research-oriented tasks are markedly harder. Context ablations further show that “more paper text” is not always better: cutting off at the Method section statistically outperforms full-paper inputs, highlighting unresolved challenges in long-context scientific comprehension. Together, these findings position GeoCodeBench as a rigorous testbed for advancing from generic coding to trustworthy 3D geometric vision coding.


[80] Video Models Reason Early: Exploiting Plan Commitment for Maze Solving cs.CVPDF

Kaleb Newman, Tyler Zhu, Olga Russakovsky

TL;DR: 本文研究了视频扩散模型在解决迷宫问题时的内部推理机制,发现模型在早期去噪步骤中就会确定高层运动计划,且迷宫难度主要取决于路径长度而非障碍物密度。基于这些发现,作者提出了ChEaP方法,通过筛选有潜力的早期计划并链式生成,显著提升了长序列迷宫任务的解决能力。

Details

Motivation: 旨在理解视频扩散模型在生成过程中如何进行推理,特别是针对迷宫解决等任务,以揭示其内部规划动态。

Result: 在Frozen Lake和VR-Bench等基准测试中,ChEaP方法将长视野迷宫的准确率从7%提升至67%,在Wan2.2-14B和HunyuanVideo-1.5模型上整体性能提升2.5倍。

Insight: 揭示了视频扩散模型的早期计划承诺现象和路径长度主导的难度阈值,创新性地提出了基于早期计划筛选的链式生成策略,有效挖掘了现有模型的潜在推理能力。

Abstract: Video diffusion models exhibit emergent reasoning capabilities like solving mazes and puzzles, yet little is understood about how they reason during generation. We take a first step towards understanding this and study the internal planning dynamics of video models using 2D maze solving as a controlled testbed. Our investigations reveal two findings. Our first finding is early plan commitment: video diffusion models commit to a high-level motion plan within the first few denoising steps, after which further denoising alters visual details but not the underlying trajectory. Our second finding is that path length, not obstacle density, is the dominant predictor of maze difficulty, with a sharp failure threshold at 12 steps. This means video models can only reason over long mazes by chaining together multiple sequential generations. To demonstrate the practical benefits of our findings, we introduce Chaining with Early Planning, or ChEaP, which only spends compute on seeds with promising early plans and chains them together to tackle complex mazes. This improves accuracy from 7% to 67% on long-horizon mazes and by 2.5x overall on hard tasks in Frozen Lake and VR-Bench across Wan2.2-14B and HunyuanVideo-1.5. Our analysis reveals that current video models possess deeper reasoning capabilities than previously recognized, which can be elicited more reliably with better inference-time scaling.


[81] OmniRoam: World Wandering via Long-Horizon Panoramic Video Generation cs.CVPDF

Yuheng Liu, Xin Lin, Xinke Li, Baihan Yang, Chen Wang

TL;DR: OmniRoam是一个可控的全景视频生成框架,旨在实现长时程的场景漫游。它通过预览和细化两个阶段,从输入图像或视频生成高分辨率、长序列的全景视频,解决了现有视角视频模型在场景完整性和全局一致性上的不足。

Details

Motivation: 现有视频生成模型大多基于视角视频,只能合成场景的有限观测,导致完整性和全局一致性存在问题。本文旨在利用全景表示丰富的每帧场景覆盖和固有的长时空一致性,实现可控的长时程场景漫游。

Result: 实验表明,OmniRoam在视觉质量、可控性和长时程场景一致性方面,定性和定量均优于现有最先进方法。

Insight: 创新点在于利用全景视频的固有优势进行长时程生成,并通过两阶段(预览和细化)框架实现可控的高保真漫游;此外,构建了包含合成和真实世界视频的全景数据集以支持训练。

Abstract: Modeling scenes using video generation models has garnered growing research interest in recent years. However, most existing approaches rely on perspective video models that synthesize only limited observations of a scene, leading to issues of completeness and global consistency. We propose OmniRoam, a controllable panoramic video generation framework that exploits the rich per-frame scene coverage and inherent long-term spatial and temporal consistency of panoramic representation, enabling long-horizon scene wandering. Our framework begins with a preview stage, where a trajectory-controlled video generation model creates a quick overview of the scene from a given input image or video. Then, in the refine stage, this video is temporally extended and spatially upsampled to produce long-range, high-resolution videos, thus enabling high-fidelity world wandering. To train our model, we introduce two panoramic video datasets that incorporate both synthetic and real-world captured videos. Experiments show that our framework consistently outperforms state-of-the-art methods in terms of visual quality, controllability, and long-term scene consistency, both qualitatively and quantitatively. We further showcase several extensions of this framework, including real-time video generation and 3D reconstruction. Code is available at https://github.com/yuhengliu02/OmniRoam.


cs.IR [Back]

[82] UltRAG: a Universal Simple Scalable Recipe for Knowledge Graph RAG cs.IR | cs.CL | cs.LGPDF

Dobrik Georgiev, Kheeran Naidu, Alberto Cattaneo, Federico Monti, Carlo Luschi

TL;DR: 本文提出了ULTRAG框架,一种用于知识图谱检索增强生成的通用、简单且可扩展的方法。该方法通过为大型语言模型配备现成的神经查询执行模块,无需重新训练即可在知识图谱问答任务上实现最先进的性能,并能高效处理大规模知识图谱(如包含1.16亿实体和16亿关系的Wikidata)。

Details

Motivation: 解决大型语言模型在生成内容时产生事实性错误(幻觉)的问题,特别是针对知识图谱结构的数据,传统检索增强生成方法难以适应需要多节点/多跳推理的图查询场景。

Result: 在知识图谱问答任务上,ULTRAG相比最先进的KG-RAG解决方案取得了更好的性能,且能以相当或更低的成本处理Wikidata规模的大图。

Insight: 创新点在于将现成的神经查询执行模块与LLM结合,无需重新训练即可实现SOTA性能,提供了一种通用且可扩展的知识图谱RAG框架,突破了传统RAG在复杂图推理上的限制。

Abstract: Large language models (LLMs) frequently generate confident yet factually incorrect content when used for language generation (a phenomenon often known as hallucination). Retrieval augmented generation (RAG) tries to reduce factual errors by identifying information in a knowledge corpus and putting it in the context window of the model. While this approach is well-established for document-structured data, it is non-trivial to adapt it for Knowledge Graphs (KGs), especially for queries that require multi-node/multi-hop reasoning on graphs. We introduce ULTRAG, a general framework for retrieving information from Knowledge Graphs that shifts away from classical RAG. By endowing LLMs with off-the-shelf neural query executing modules, we highlight how readily available language models can achieve state-of-the-art results on Knowledge Graph Question Answering (KGQA) tasks without any retraining of the LLM or executor involved. In our experiments, ULTRAG achieves better performance when compared to state-of-the-art KG-RAG solutions, and it enables language models to interface with Wikidata-scale graphs (116M entities, 1.6B relations) at comparable or lower costs.


[83] Aligning Multimodal Sequential Recommendations via Robust Direct Preference Optimization with Sparse MoE cs.IR | cs.CLPDF

Hejin Huang, Jusheng Zhang, Kaitong Cai, Jian Wang, Rong Pan

TL;DR: 本文研究了在隐式反馈场景下如何改进直接偏好优化(DPO)方法,用于多模态序列推荐任务。通过系统实验发现,将确定性的硬负样本替换为从动态Top-K候选池中随机采样的策略能持续提升排序性能,并提出了结合稀疏专家混合(MoE)编码器的RoDPO方法。

Details

Motivation: 现有基于偏好的对齐方法(如DPO)在隐式反馈场景中面临挑战,因为未观测项目并非可靠的负样本,可能导致错误的抑制梯度。

Result: 在三个Amazon基准测试上,RoDPO方法在NDCG@5指标上实现了最高5.25%的提升,且推理成本几乎不变。

Insight: 创新点在于提出用动态Top-K候选池的随机采样替代硬负样本,减少假负样本带来的错误梯度,同时通过可控随机性保留信息性硬信号;结合稀疏MoE编码器实现高效容量扩展。

Abstract: Preference-based alignment objectives have been widely adopted, from RLHF-style pairwise learning in large language models to emerging applications in recommender systems. Yet, existing work rarely examines how Direct Preference Optimization (DPO) behaves under implicit feedback, where unobserved items are not reliable negatives. We conduct systematic experiments on multimodal sequential recommendation to compare common negative-selection strategies and their interaction with DPO training. Our central finding is that a simple modification, replacing deterministic hard negatives with stochastic sampling from a dynamic top-K candidate pool, consistently improves ranking performance. We attribute its effectiveness to two factors: (1) reducing erroneous suppressive gradients caused by false negatives, and (2) retaining informative hard signals while smoothing optimization via controlled stochasticity. With an optional sparse Mixture-of-Experts encoder for efficient capacity scaling, RoDPO achieves up to 5.25% NDCG@5 on three Amazon benchmarks, with nearly unchanged inference cost.


cs.AI [Back]

[84] GISTBench: Evaluating LLM User Understanding via Evidence-Based Interest Verification cs.AI | cs.CLPDF

Iordanis Fostiropoulos, Muhammad Rafay Azhar, Abdalaziz Sawwan, Boyu Fang, Yuchen Liu

TL;DR: GISTBench是一个用于评估大型语言模型(LLMs)在推荐系统中基于用户交互历史理解用户能力的新基准。它不同于传统推荐系统基准,后者侧重于物品预测准确性,而GISTBench评估LLMs从参与数据中提取和验证用户兴趣的能力。该基准引入了两个新的指标家族:兴趣基础性(IG)和兴趣特异性(IS),并发布了一个基于真实短视频平台用户交互构建的合成数据集。通过对八个参数规模从7B到120B的开源LLMs进行评估,揭示了当前LLMs在准确计数和归因异构交互类型信号方面的性能瓶颈。

Details

Motivation: 解决传统推荐系统基准主要关注物品预测准确性,而缺乏评估LLMs从用户交互历史中理解用户(特别是提取和验证用户兴趣)能力的问题。

Result: 在构建的合成数据集上评估了八个开源LLMs(7B到120B参数)。结果表明,当前LLMs在准确计数和归因异构交互类型的参与信号方面存在能力瓶颈。

Insight: 创新点在于提出了一个专注于LLM用户理解(而非物品预测)的新评估基准GISTBench,并设计了两个新的、可分解的指标家族(IG和IS)来量化LLM提取和验证用户兴趣的能力。客观来看,其构建包含隐式和显式参与信号及丰富文本描述的合成数据集,并通过用户调查验证保真度,为评估LLM在复杂、真实用户行为数据上的理解能力提供了有价值的工具和洞见。

Abstract: We introduce GISTBench, a benchmark for evaluating Large Language Models’ (LLMs) ability to understand users from their interaction histories in recommendation systems. Unlike traditional RecSys benchmarks that focus on item prediction accuracy, our benchmark evaluates how well LLMs can extract and verify user interests from engagement data. We propose two novel metric families: Interest Groundedness (IG), decomposed into precision and recall components to separately penalize hallucinated interest categories and reward coverage, and Interest Specificity (IS), which assesses the distinctiveness of verified LLM-predicted user profiles. We release a synthetic dataset constructed on real user interactions on a global short-form video platform. Our dataset contains both implicit and explicit engagement signals and rich textual descriptions. We validate our dataset fidelity against user surveys, and evaluate eight open-weight LLMs spanning 7B to 120B parameters. Our findings reveal performance bottlenecks in current LLMs, particularly their limited ability to accurately count and attribute engagement signals across heterogeneous interaction types.


[85] Xuanwu: Evolving General Multimodal Models into an Industrial-Grade Foundation for Content Ecosystems cs.AI | cs.CL | cs.CVPDF

Zhiqian Zhang, Xu Zhao, Xiaoqing Xu, Guangdong Liang, Weijia Wang

TL;DR: 本文介绍了Xuanwu VL-2B,一个将通用多模态大模型演进为工业级内容生态基础模型的案例研究。该模型采用紧凑的InternViT-300M + MLP + Qwen3 1.7B架构,在约20亿参数预算内平衡细粒度视觉感知、语言语义对齐和部署成本。通过数据迭代与筛选机制以及预训练、中训练和后训练的三阶段渐进式流程,模型在业务专业化和通用能力保留之间取得了平衡。

Details

Motivation: 解决主流多模态大模型在现实世界内容审核和对抗性场景中,由于细粒度视觉感知有限和对长尾噪声建模不足而导致的泛化能力下降和灾难性遗忘问题。

Result: 在七个OpenCompass多模态指标上平均得分为67.90(优于InternVL 3.5 2B的64.27);在七个独立业务审核任务上平均召回率为94.38%;在具有挑战性的对抗性OCR场景中,对违规文本的加权总体召回率达到82.82%,优于Gemini-2.5-Pro(76.72%)。

Insight: 创新点在于提出了一种在有限参数预算下,通过紧凑架构设计、数据迭代与筛选机制以及三阶段渐进式训练流程,实现业务对齐、视觉感知、通用能力保留和部署成本之间实用平衡的系统性方法。这为将通用模型定制为工业级基础模型提供了可借鉴的工程路径。

Abstract: In recent years, multimodal large models have continued to improve on general benchmarks. However, in real-world content moderation and adversarial settings, mainstream models still suffer from degraded generalization and catastrophic forgetting because of limited fine-grained visual perception and insufficient modeling of long-tail noise. In this paper, we present Xuanwu VL-2B as a case study of how general multimodal models can be developed into an industrial-grade foundation model for content ecosystems. The model adopts a compact InternViT-300M + MLP + Qwen3 1.7B architecture, balancing fine-grained visual perception, language-semantic alignment, and deployment cost within an approximately 2B-parameter budget. To balance business specialization with the retention of general capabilities, we developed a data iteration and curation mechanism and trained the model through a progressive three-stage pipeline: pre-training, mid-training, and post-training. Ablation studies and offline business evaluations show that Xuanwu VL-2B achieves an average score of 67.90 across seven OpenCompass multimodal metrics (vs. 64.27 for InternVL 3.5 2B), an average recall of 94.38% over seven independent business moderation tasks, and a weighted overall recall of 82.82% on policy-violating text in challenging adversarial OCR scenarios, outperforming Gemini-2.5-Pro (76.72%). These results show that, under a limited parameter budget, Xuanwu VL-2B achieves a practical balance among business alignment, visual perception, general capability retention, and deployment cost.


[86] Reasoning-Driven Synthetic Data Generation and Evaluation cs.AI | cs.CL | cs.LGPDF

Tim R. Davidson, Benoit Seguin, Enrico Bacis, Cesar Ilharco, Hamza Harkous

TL;DR: 本文提出了一个名为Simula的新型推理驱动框架,用于大规模生成和评估合成数据。该框架采用无种子、代理驱动的方法,允许用户通过可解释且可控的过程定义所需的数据集特征,从而解决多模态模型训练数据稀缺、昂贵且难以获取的问题。

Details

Motivation: 动机在于解决训练专用多模态模型时数据稀缺、人工标注成本高昂且易出错的问题,并克服现有合成数据生成方法在可扩展性、可解释性和可控性方面的局限性。

Result: 论文展示了该方法在多种数据集上的有效性,严格测试了其内在属性和下游性能,但未在摘要中提及具体的基准测试或与SOTA模型的定量比较结果。

Insight: 创新点在于提出了一个无种子、代理驱动的推理框架,实现了可解释和可控的大规模合成数据生成,为数据稀缺或隐私敏感领域的AI开发提供了新机会,并提供了合成数据机制设计的指导原则。

Abstract: Although many AI applications of interest require specialized multi-modal models, relevant data to train such models is inherently scarce or inaccessible. Filling these gaps with human annotators is prohibitively expensive, error-prone, and time-consuming, leading model builders to increasingly consider synthetic data as a scalable alternative. However, existing synthetic data generation methods often rely on manual prompts, evolutionary algorithms, or extensive seed data from the target distribution - limiting their scalability, explainability, and control. In this paper, we introduce Simula: a novel reasoning-driven framework for data generation and evaluation. It employs a seedless, agentic approach to generate synthetic datasets at scale, allowing users to define desired dataset characteristics through an explainable and controllable process that enables fine-grained resource allocation. We show the efficacy of our approach on a variety of datasets, rigorously testing both intrinsic and downstream properties. Our work (1) offers guidelines for synthetic data mechanism design, (2) provides insights into generating and evaluating synthetic data at scale, and (3) unlocks new opportunities for developing and deploying AI in domains where data scarcity or privacy concerns are paramount.


[87] Owl-AuraID 1.0: An Intelligent System for Autonomous Scientific Instrumentation and Scientific Data Analysis cs.AI | cs.CLPDF

Han Deng, Anqi Zou, Hanling Zhang, Ben Fei, Chengyu Zhang

TL;DR: Owl-AuraID 1.0是一个软硬件协同的具身智能体系统,旨在通过GUI原生范式操作科学仪器,实现自主科学仪器操作与数据分析。该系统采用以技能为中心的框架,将GUI操作(Type-1)与数据分析(Type-2)技能集成到端到端工作流中,覆盖了包括FTIR、NMR、AFM、TGA在内的十类精密仪器和多模态工作流。

Details

Motivation: 解决高通量科学表征中因专有GUI和现有API系统通用性有限而阻碍自动化的问题,通过模拟人类专家操作界面来实现仪器操作的自动化。

Result: 系统在十类精密仪器和多种工作流(如多模态光谱分析、显微成像、晶体学分析)上展示了广泛的覆盖能力,为自主实验室提供了实用且可扩展的基础。

Insight: 创新点在于采用GUI原生范式而非传统API,以及技能中心化框架将操作与数据分析技能无缝集成,这为通过可重用的操作和分析技能进化实验室智能提供了新路径。

Abstract: Scientific discovery increasingly depends on high-throughput characterization, yet automation is hindered by proprietary GUIs and the limited generalizability of existing API-based systems. We present Owl-AuraID, a software-hardware collaborative embodied agent system that adopts a GUI-native paradigm to operate instruments through the same interfaces as human experts. Its skill-centric framework integrates Type-1 (GUI operation) and Type-2 (data analysis) skills into end-to-end workflows, connecting physical sample handling with scientific interpretation. Owl-AuraID demonstrates broad coverage across ten categories of precision instruments and diverse workflows, including multimodal spectral analysis, microscopic imaging, and crystallographic analysis, supporting modalities such as FTIR, NMR, AFM, and TGA. Overall, Owl-AuraID provides a practical, extensible foundation for autonomous laboratories and illustrates a path toward evolving laboratory intelligence through reusable operational and analytical skills. The code are available at https://github.com/OpenOwlab/AuraID.


eess.IV [Back]

[88] Retinal Malady Classification using AI: A novel ViT-SVM combination architecture eess.IV | cs.CVPDF

Shashwat Jha, Vishvaditya Luhach, Raju Poddar

TL;DR: 本研究提出了一种结合Vision Transformer(ViT)与支持向量机(SVM)的混合架构(ViT-SVM),用于自动分类光学相干断层扫描(OCT)图像,以早期检测黄斑裂孔、中心性浆液性视网膜病变和糖尿病视网膜病变等常见致盲性眼病。

Details

Motivation: 解决早期自动检测致盲性视网膜病变的临床需求,以预防视力丧失。

Result: 论文未在摘要中提及具体定量结果或基准测试,但分析了ViT-SVM架构在OCT扫描分类任务上的性能。

Insight: 创新点在于将ViT的全局特征提取能力与SVM的分类优势结合,构建混合模型用于医疗图像分类,可借鉴其架构设计思路应用于其他细粒度视觉任务。

Abstract: Macular Holes, Central serous retinopathy and Diabetic Retinopathy are one of the most widespread maladies of the eyes responsible for either partial or complete vision loss, thus making it clear that early detection of the mentioned defects is detrimental for the well-being of the patient. This study intends to introduce the application of Vision Transformer and Support Vector Machine based hybrid architecture (ViT-SVM) and analyse its performance to classify the optical coherence topography (OCT) Scans with the intention to automate the early detection of these retinal defects.


cs.SE [Back]

[89] Designing FSMs Specifications from Requirements with GPT 4.0 cs.SE | cs.AI | cs.CL | cs.FLPDF

Omer Nguena Timo, Paul-Alexis Rodriguez, Florent Avellaneda

TL;DR: 本文提出了一种基于GPT-4.0的框架,用于从自然语言需求文档自动生成有限状态机(FSM)规范,并引入了一种以专家为中心的FSM修复方法,通过变异和测试生成来提升FSM质量,以支持模型驱动工程(MDE)中的自动化测试等应用。

Details

Motivation: 解决从自然语言需求文档手动设计高质量有限状态机(FSM)的挑战,因为低质量的FSM会导致测试阶段遗漏更多故障,增加系统在生产中失败的风险,甚至可能引发灾难性场景。

Result: 论文通过模拟数据进行了实验分析,评估了LLM(如GPT-4.0)在框架中执行任务和FSM修复的能力,但未提及具体基准或与SOTA的比较结果。

Insight: 创新点在于结合LLM自动生成FSM与基于变异和测试生成的专家中心修复方法,为MDE领域提供了LLM应用的新视角,有助于机器学习技术在系统工程中的进一步发展。

Abstract: Finite state machines (FSM) are executable formal specifications of reactive systems. These machines are designed based on systems’ requirements. The requirements are often recorded in textual documents written in natural languages. FSMs play a crucial role in different phases of the model-driven system engineering (MDE). For example, they serve to automate testing activities. FSM quality is critical: the lower the quality of FSM, the higher the number of faults surviving the testing phase and the higher the risk of failure of the systems in production, which could lead to catastrophic scenarios. Therefore, this paper leverages recent advances in the domain of LLM to propose an LLM-based framework for designing FSMs from requirements. The framework also suggests an expert-centric approach based on FSM mutation and test generation for repairing the FSMs produced by LLMs. This paper also provides an experimental analysis and evaluation of LLM’s capacities in performing the tasks presented in the framework and FSM repair via various methods. The paper presents experimental results with simulated data. These results and methods bring a new analysis and vision of LLMs that are useful for further development of machine learning technology and its applications to MDE.


cs.GR [Back]

[90] Bioinspired123D: Generative 3D Modeling System for Bioinspired Structures cs.GR | cs.CVPDF

Rachel K. Luu, Markus J. Buehler

TL;DR: 本文提出了Bioinspired123D系统,这是一个用于生成仿生结构的轻量级、模块化文本到3D建模系统。其核心是一个经过微调的紧凑语言模型Bioinspired3D,能够将自然语言设计提示转换为Blender Python脚本,从而生成可制造的参数化3D几何结构。系统还集成了基于图的智能体框架,通过检索增强生成和视觉语言模型批评器来迭代评估和修复生成的脚本。

Details

Motivation: 解决当前文本到3D生成方法在科学设计领域面临的挑战,包括可控性有限、计算成本高,以及现有基于网格、体素或点云的方法训练成本高且难以控制的问题。

Result: 在3D几何脚本生成的新基准测试中,Bioinspired123D的性能比其未微调的基础模型提高了近四倍,并且在使用更少参数和计算量的情况下,显著优于更大的最先进语言模型。

Insight: 创新性地采用‘代码即几何’的表示范式,通过生成参数化程序(Blender Python脚本)而非密集视觉表示来直接创建3D结构,实现了计算高效、可控且可解释的文本到3D生成。此外,构建了包含4000多个仿生和几何设计脚本的领域特定数据集,并开发了基于LLM和Blender的自动化质量控制流程,以及集成了检索增强生成和视觉语言模型批评器的图基智能体框架,用于迭代优化。

Abstract: Generative AI has made rapid progress in text, image, and video synthesis, yet text-to-3D modeling for scientific design remains particularly challenging due to limited controllability and high computational cost. Most existing 3D generative methods rely on meshes, voxels, or point clouds which can be costly to train and difficult to control. We introduce Bioinspired123D, a lightweight and modular code-as-geometry pipeline that generates fabricable 3D structures directly through parametric programs rather than dense visual representations. At the core of Bioinspired123D is Bioinspired3D, a compact language model finetuned to translate natural language design cues into Blender Python scripts encoding smooth, biologically inspired geometries. We curate a domain-specific dataset of over 4,000 bioinspired and geometric design scripts spanning helical, cellular, and tubular motifs with parametric variability. The dataset is expanded and validated through an automated LLM-driven, Blender-based quality control pipeline. Bioinspired3D is then embedded in a graph-based agentic framework that integrates multimodal retrieval-augmented generation and a vision-language model critic to iteratively evaluate, critique, and repair generated scripts. We evaluate performance on a new benchmark for 3D geometry script generation and show that Bioinspired123D demonstrates a near fourfold improvement over its non-finetuned base model, while also outperforming substantially larger state-of-the-art language models despite using far fewer parameters and compute. By prioritizing code-as-geometry representations, Bioinspired123D enables compute-efficient, controllable, and interpretable text-to-3D generation, lowering barriers to AI driven scientific discovery in materials and structural design.


[91] IMAGAgent: Orchestrating Multi-Turn Image Editing via Constraint-Aware Planning and Reflection cs.GR | cs.AI | cs.CV | cs.MAPDF

Fei Shen, Chengyu Xie, Lihong Wang, Zhanyi Zhang, Xin Jiang

TL;DR: 本文提出了IMAGAgent,一个基于’计划-执行-反思’闭环机制的多轮图像编辑智能体框架。该框架通过约束感知规划模块分解复杂指令,利用工具链编排模块动态调度异构操作模型,并采用多专家协作反思机制进行自适应校正,有效解决了现有方法在多轮编辑中因缺乏上下文感知和闭环反馈而导致的错误累积和语义漂移问题。

Details

Motivation: 现有多轮图像编辑范式通常局限于孤立的单步执行,缺乏上下文感知和闭环反馈机制,容易在多轮交互中产生错误累积和语义漂移,导致生成图像严重结构失真。

Result: 在构建的MTEditBench和MagicBrush数据集上的大量实验表明,IMAGAgent在指令一致性、编辑精度和整体质量方面显著优于现有方法。

Insight: 创新点在于提出了一个统一的’计划-执行-反思’闭环框架,将指令解析、工具调度和自适应校正深度协同。具体包括:利用视觉语言模型(VLM)进行约束感知规划,动态构建异构操作模型(检索、分割、检测、编辑)的执行路径,以及由大型语言模型(LLM)协调的多专家协作反思机制以实现细粒度自校正和决策优化。

Abstract: Existing multi-turn image editing paradigms are often confined to isolated single-step execution. Due to a lack of context-awareness and closed-loop feedback mechanisms, they are prone to error accumulation and semantic drift during multi-turn interactions, ultimately resulting in severe structural distortion of the generated images. For that, we propose \textbf{IMAGAgent}, a multi-turn image editing agent framework based on a “plan-execute-reflect” closed-loop mechanism that achieves deep synergy among instruction parsing, tool scheduling, and adaptive correction within a unified pipeline. Specifically, we first present a constraint-aware planning module that leverages a vision-language model (VLM) to precisely decompose complex natural language instructions into a series of executable sub-tasks, governed by target singularity, semantic atomicity, and visual perceptibility. Then, the tool-chain orchestration module dynamically constructs execution paths based on the current image, the current sub-task, and the historical context, enabling adaptive scheduling and collaborative operation among heterogeneous operation models covering image retrieval, segmentation, detection, and editing. Finally, we devise a multi-expert collaborative reflection mechanism where a central large language model (LLM) receives the image to be edited and synthesizes VLM critiques into holistic feedback, simultaneously triggering fine-grained self-correction and recording feedback outcomes to optimize future decisions. Extensive experiments on our constructed \textbf{MTEditBench} and the MagicBrush dataset demonstrate that IMAGAgent achieves performance significantly superior to existing methods in terms of instruction consistency, editing precision, and overall quality. The code is available at https://github.com/hackermmzz/IMAGAgent.git.


[92] CADReasoner: Iterative Program Editing for CAD Reverse Engineering cs.GR | cs.CV | cs.HCPDF

Soslan Kabisov, Vsevolod Kirichuk, Andrey Volkov, Gennadii Savrasov, Marina Barannikov

TL;DR: CADReasoner是一种用于CAD逆向工程的迭代式程序编辑模型,它通过融合多视角渲染和点云数据,并利用输入形状与预测形状之间的几何差异进行迭代优化,最终输出可执行的CadQuery Python程序。

Details

Motivation: 现有AI系统大多采用单次推理,难以捕捉精细几何细节,而人类工程师则通过迭代比较和修改来优化设计;当前基于智能体的方法依赖冻结的视觉语言模型,但基础模型在3D几何理解上的局限性影响了可靠性和效率。

Result: 在DeepCAD、Fusion 360和MCB基准测试中,CADReasoner在干净数据和扫描模拟数据上均取得了最先进(SOTA)的结果。

Insight: 创新点包括:提出迭代式程序编辑框架,结合几何差异反馈进行优化;采用多模态融合(多视角渲染+点云)增强3D理解;引入扫描-模拟协议以弥合真实感差距,提升模型在训练和评估中的泛化能力。

Abstract: Computer-Aided Design (CAD) powers modern engineering, yet producing high-quality parts still demands substantial expert effort. Many AI systems tackle CAD reverse engineering, but most are single-pass and miss fine geometric details. In contrast, human engineers compare the input shape with the reconstruction and iteratively modify the design based on remaining discrepancies. Agent-based methods mimic this loop with frozen VLMs, but weak 3D grounding of current foundation models limits reliability and efficiency. We introduce CADReasoner, a model trained to iteratively refine its prediction using geometric discrepancy between the input and the predicted shape. The model outputs a runnable CadQuery Python program whose rendered mesh is fed back at the next step. CADReasoner fuses multi-view renders and point clouds as complementary modalities. To bridge the realism gap, we propose a scan-simulation protocol applied during both training and evaluation. Across DeepCAD, Fusion 360, and MCB benchmarks, CADReasoner attains state-of-the-art results on clean and scan-sim tracks.


[93] VectorGym: A Multitask Benchmark for SVG Code Generation, Sketching, and Editing cs.GR | cs.AI | cs.CVPDF

Juan Rodriguez, Haotian Zhang, Abhay Puri, Tianyang Zhang, Rishav Pramanik

TL;DR: 本文提出了VectorGym,一个用于可缩放矢量图形(SVG)的多任务综合基准套件,涵盖从文本和草图生成、复杂编辑到视觉理解等任务。该基准包含四个具有专家人工标注的任务:新颖的Sketch2SVG任务(VG-Sketch)、包含复杂多步编辑的SVG编辑数据集(VG-Edit)、Text2SVG生成(VG-Text)以及SVG描述生成(VG-Cap)。论文还提出了一种基于渲染奖励的多任务强化学习方法,在Qwen3-VL 8B模型上实现了开源模型中的最先进性能。

Details

Motivation: 解决当前缺乏与专业设计工作流程对齐的、真实且具有挑战性的SVG基准的问题。

Result: 在VectorGym基准上,基于GRPO和课程学习训练的Qwen3-VL 8B模型在开源模型中达到最先进水平,超越了包括Qwen3-VL 235B在内的更大模型,并与GPT-4o性能相当。

Insight: 创新点包括:1)引入了首个包含专家人工标注、强调语义理解和设计意图的真实SVG多任务基准;2)提出了一种基于渲染奖励的多任务强化学习联合优化方法;3)引入了用于SVG生成的VLM-as-a-Judge评估指标,并通过人工相关性研究验证。

Abstract: We introduce VectorGym, a comprehensive benchmark suite for Scalable Vector Graphics (SVG) that spans generation from text and sketches, complex editing, and visual understanding. VectorGym addresses the lack of realistic, challenging benchmarks aligned with professional design workflows. Our benchmark comprises four tasks with expert human-authored annotations: the novel Sketch2SVG task (VG-Sketch); a new SVG editing dataset (VG-Edit) featuring complex, multi-step edits with higher-order primitives; Text2SVG generation (VG-Text); and SVG captioning (VG-Cap). Unlike prior benchmarks that rely on synthetic edits, VectorGym provides gold-standard human annotations that require semantic understanding and design intent. We also propose a multi-task reinforcement learning approach that jointly optimizes across all four tasks using rendering-based rewards. Our method, built on GRPO with curriculum learning, trains a Qwen3-VL 8B model that achieves state-of-the-art performance among open-source models, surpassing much larger models including Qwen3-VL 235B and matching GPT-4o. We also introduce a VLM-as-a-Judge metric for SVG generation, validated through human correlation studies. Our evaluation of frontier VLMs reveals significant performance gaps, positioning VectorGym as a rigorous framework for advancing visual code generation. VectorGym is publicly available on huggingface.co/datasets/ServiceNow/VectorGym.


cs.HC [Back]

[94] Perfecting Human-AI Interaction at Clinical Scale. Turning Production Signals into Safer, More Human Conversations cs.HC | cs.AI | cs.CL | cs.MAPDF

Subhabrata Mukherjee, Markel Sanz Ausin, Kriti Aggarwal, Debajyoti Datta, Shanil Puri

TL;DR: 本文提出了一种基于生产环境实时信号优化的医疗对话AI框架,通过分析超过1.15亿次真实医患交互数据,将副语言特征、话轮转换、澄清触发等交互智能指标作为核心安全变量,实现了临床安全性、任务完成度和公平性的显著提升。

Details

Motivation: 解决医疗对话AI在生产环境中因音频不完美、意图间接、语言动态变化等现实因素导致的可靠性问题,超越仅依赖基准测试准确率的优化范式。

Result: 部署超1000万次真实患者通话的Polaris系统实现临床安全评分99.9%,患者平均评分8.95,并将ASR错误率较企业级方案降低50%。

Insight: 创新性地将交互智能(语调、节奏、共情等)作为一级安全变量,提出通过多LLM协同治理、垂直整合上下文ASR与延迟感知架构的冗余设计来保障医疗级安全性。

Abstract: Healthcare conversational AI agents shouldn’t be optimized only for clean benchmark accuracy in production-first regime; they must be optimized for the lived reality of patient conversations, where audio is imperfect, intent is indirect, language shifts mid-call, and compliance hinges on how guidance is delivered. We present a production-validated framework grounded in real-time signals from 115M+ live patient-AI interactions and clinician-led testing (7K+ licensed clinicians; 500K+ test calls). These in-the-wild cues – paralinguistics, turn-taking dynamics, clarification triggers, escalation markers, multilingual continuity, and workflow confirmations – reveal failure modes that curated data misses and provide actionable training and evaluation signals for safety and reliability. We further show why healthcare-grade safety cannot rely on a single LLM: long-horizon dialogue and limited attention demand redundancy via governed orchestration, independent checks, and verification. Many apparent “reasoning” errors originate upstream, motivating vertical integration across contextual ASR, clarification/repair, ambient speech handling, and latency-aware model/hardware choices. Treating interaction intelligence (tone, pacing, empathy, clarification, turn-taking) as first-class safety variables, we drive measurable gains in safety, documentation, task completion, and equity in building the safest generative AI solution for autonomous patient-facing care. Deployed across more than 10 million real patient calls, Polaris attains a clinical safety score of 99.9%, while significantly improving patient experience with average patient rating of 8.95 and reducing ASR errors by 50% over enterprise ASR. These results establish real-world interaction intelligence as a critical – and previously underexplored – determinant of safety and reliability in patient-facing clinical AI systems.


cs.LG [Back]

[95] A Comprehensive Information-Decomposition Analysis of Large Vision-Language Models cs.LG | cs.CL | cs.CVPDF

Lixin Xiu, Xufang Luo, Hideki Nakayama

TL;DR: 本文提出了一种基于部分信息分解(PID)的框架,用于量化分析大型视觉语言模型(LVLMs)的决策过程,通过将模型决策相关信息分解为冗余、独特和协同成分,揭示了LVLMs在跨模型、跨任务、层级动态和训练动态中的信息处理模式。

Details

Motivation: LVLMs表现出色,但其内部决策过程不透明,难以确定其成功是源于真正的多模态融合还是依赖于单模态先验,因此需要一种定量方法来填补这一归因空白。

Result: 在四个数据集上对26个LVLMs进行了分析,发现了两种任务机制(协同驱动与知识驱动)和两种稳定的家族级策略(融合中心与语言中心),并揭示了层级处理中的一致三阶段模式,以及视觉指令调优是学习融合的关键阶段。

Insight: 创新点在于将PID框架扩展应用于LVLMs,提供了一种超越仅靠准确率评估的定量分析视角,有助于理解和设计下一代LVLMs;客观来看,该方法为模型可解释性和多模态融合机制研究提供了系统化的分析工具。

Abstract: Large vision-language models (LVLMs) achieve impressive performance, yet their internal decision-making processes remain opaque, making it difficult to determine if the success stems from true multimodal fusion or from reliance on unimodal priors. To address this attribution gap, we introduce a novel framework using partial information decomposition (PID) to quantitatively measure the “information spectrum” of LVLMs – decomposing a model’s decision-relevant information into redundant, unique, and synergistic components. By adapting a scalable estimator to modern LVLM outputs, our model-agnostic pipeline profiles 26 LVLMs on four datasets across three dimensions – breadth (cross-model & cross-task), depth (layer-wise information dynamics), and time (learning dynamics across training). Our analysis reveals two key results: (i) two task regimes (synergy-driven vs. knowledge-driven) and (ii) two stable, contrasting family-level strategies (fusion-centric vs. language-centric). We also uncover a consistent three-phase pattern in layer-wise processing and identify visual instruction tuning as the key stage where fusion is learned. Together, these contributions provide a quantitative lens beyond accuracy-only evaluation and offer insights for analyzing and designing the next generation of LVLMs. Code and data are available at https://github.com/RiiShin/pid-lvlm-analysis .


[96] Training-Free Dynamic Upcycling of Expert Language Models cs.LG | cs.CLPDF

Eros Fanì, Oğuzhan Ersoy

TL;DR: 本文提出了一种名为动态升级混合专家模型(DUME)的无训练方法,旨在将不同领域预训练的密集专家模型高效地整合到一个统一的混合专家架构中,以构建一个多任务模型,同时保持各专家的原始能力,并避免昂贵的再训练。

Details

Motivation: 解决大型语言模型训练成本高昂、缺乏领域专业知识,以及现有方法(如专家微调或多任务训练)导致的过拟合、任务冲突和灾难性遗忘等问题。

Result: 在因果语言建模和推理任务上,DUME一致优于基线方法,能保留特定领域密集专家模型高达97.6%的性能,在推理设置中甚至能达到其102.1%的性能。

Insight: 创新点在于无需额外训练即可动态整合预训练专家模型,利用岭回归的闭式解实现高效、可扩展的模型构建,并支持后续微调以进一步提升性能。

Abstract: Large Language Models (LLMs) have achieved remarkable performance on a wide range of specialized tasks, exhibiting strong problem-solving capabilities. However, training these models is prohibitively expensive, and they often lack domain-specific expertise because they rely on general knowledge datasets. Expertise finetuning can address this issue; however, it often leads to overspecialization, and developing a single multi-domain expert remains difficult due to diverging objectives. Furthermore, multitask training is challenging due to interference and catastrophic forgetting. Existing work proposes combining the expertise of dense models within a Mixture of Experts (MoE) architecture, although this approach still requires multitask finetuning. To address these issues, we introduce Dynamic Upcycling MoE (DUME), a novel approach that reuses dense experts trained on different domains to construct a unified MoE model. Our method builds a single multitask model that preserves the capabilities of the original dense experts without requiring additional training. DUME is both cost-efficient and scalable: by leveraging the closed-form solution of ridge regression, it eliminates the need for further optimization and enables experts to be added dynamically while maintaining the model’s original performance. We demonstrate that DUME consistently outperforms baseline approaches in both causal language modeling and reasoning settings. Finally, we also show that the DUME model can be fine-tuned to further improve performance. We show that, in the causal language modeling setting, DUME can retain up to 97.6% of a dense expert model specialized in one particular domain, and that it can also surpass it in the reasoning setting, where it can achieve 102.1% of the dense expert performance. Our code is available at: github.com/gensyn-ai/dume.


[97] HCLSM: Hierarchical Causal Latent State Machines for Object-Centric World Modeling cs.LG | cs.CV | cs.ROPDF

Jaber Jaber, Osama Jaber

TL;DR: 本文提出了HCLSM(分层因果潜在状态机)这一世界模型架构,用于从视频中预测未来状态。该模型基于三个核心原则:通过槽注意力和空间广播解码实现以对象为中心的分解;结合选择性状态空间模型、稀疏Transformer和压缩Transformer构建分层时间动力学;以及通过图神经网络学习因果结构。模型采用两阶段训练协议,并在PushT机器人操作基准上进行了验证。

Details

Motivation: 现有世界模型使用平坦的潜在表示,导致对象纠缠、忽略因果结构并将时间动态压缩到单一尺度。HCLSM旨在解决这些问题,通过分层和因果建模来更准确地表示和预测复杂环境中的动态。

Result: 在Open X-Embodiment数据集的PushT基准上,一个6800万参数的HCLSM模型实现了0.008的MSE下一状态预测损失,并展现出空间分解能力(SBD损失:0.0075)和学习到的事件边界。为SSM扫描定制的Triton内核比顺序PyTorch实现快38倍。

Insight: 创新点包括:1)将对象中心分解、分层时间动力学(连续物理、离散事件、抽象目标)和因果结构学习统一在一个架构中;2)采用先空间重建后动态预测的两阶段训练协议,强制槽专门化;3)工程实现上,通过定制内核显著提升计算效率。这为构建更可解释、更高效的世界模型提供了新思路。

Abstract: World models that predict future states from video remain limited by flat latent representations that entangle objects, ignore causal structure, and collapse temporal dynamics into a single scale. We present HCLSM, a world model architecture that operates on three interconnected principles: object-centric decomposition via slot attention with spatial broadcast decoding, hierarchical temporal dynamics through a three-level engine combining selective state space models for continuous physics, sparse transformers for discrete events, and compressed transformers for abstract goals, and causal structure learning through graph neural network interaction patterns. HCLSM introduces a two-stage training protocol where spatial reconstruction forces slot specialization before dynamics prediction begins. We train a 68M-parameter model on the PushT robotic manipulation benchmark from the Open X-Embodiment dataset, achieving 0.008 MSE next-state prediction loss with emerging spatial decomposition (SBD loss: 0.0075) and learned event boundaries. A custom Triton kernel for the SSM scan delivers 38x speedup over sequential PyTorch. The full system spans 8,478 lines of Python across 51 modules with 171 unit tests. Code: https://github.com/rightnow-ai/hclsm


astro-ph.IM [Back]

[98] STRADAViT: Towards a Foundational Model for Radio Astronomy through Self-Supervised Transfer astro-ph.IM | cs.CVPDF

Andrea DeMarco, Ian Fenech Conti, Hayley Camilleri, Ardiana Bushi, Simone Riggi

TL;DR: STRADAViT是一个用于射电天文学图像的自监督视觉Transformer(ViT)持续预训练框架,旨在创建可迁移的基础编码器。它通过混合望远镜数据集、领域特定的视图生成以及分阶段的持续预训练策略,提升了模型在多个射电天体形态学基准数据集上的迁移性能。

Details

Motivation: 下一代射电天文巡天项目产生了数百万个已解析的天体源,但由于望远镜和成像流程的异质性,进行鲁棒的形态学分析仍然困难。该研究旨在开发一个可迁移的、自监督的视觉基础模型,以更好地处理跨不同仪器和数据的射电天文图像。

Result: 在MiraBest、LoTSS DR2和Radio Galaxy Zoo (RGZ) DR1三个形态学基准上评估。相对于初始预训练模型,最佳的两阶段STRADAViT模型在所有报告的线性探测(linear-probe)设置和大多数微调(fine-tuning)设置中都提升了Macro-F1分数,在RGZ DR1上提升最大。与强大的DINOv2基线相比,增益是选择性的,但在LoTSS DR2和RGZ DR1的线性探测以及MiraBest和RGZ DR1的微调中仍保持正向。

Insight: 主要创新点在于:1)结合了来自多个望远镜(如MeerKAT, ASKAP, LOFAR/LoTSS, SKA)的混合数据集进行预训练;2)设计了针对射电天文学特点的视图生成方法;3)提出了一个包含纯重建、纯对比以及两阶段分支的受控持续预训练框架。该研究表明,领域感知的视图生成和分阶段的持续预训练策略,比直接使用现成的视觉Transformer模型,为射电天文学任务提供了更强的迁移起点。

Abstract: Next-generation radio astronomy surveys are producing millions of resolved sources, but robust morphology analysis remains difficult across heterogeneous telescopes and imaging pipelines. We present STRADAViT, a self-supervised Vision Transformer continued-pretraining framework for transferable radio astronomy image encoders. STRADAViT combines a mixed-survey pretraining dataset, radio astronomy-aware view generation, and controlled continued pretraining through reconstruction-only, contrastive-only, and two-stage branches. Pretraining uses 512x512 radio astronomy cutouts from MeerKAT, ASKAP, LOFAR/LoTSS, and SKA data. We evaluate transfer with linear probing and fine-tuning on three morphology benchmarks: MiraBest, LoTSS DR2, and Radio Galaxy Zoo. Relative to the initialization used for continued pretraining, the best two-stage STRADAViT models improve Macro-F1 in all reported linear-probe settings and in most fine-tuning settings, with the largest gain on RGZ DR1. Relative to strong DINOv2 baselines, gains are selective but remain positive on LoTSS DR2 and RGZ DR1 under linear probing, and on MiraBest and RGZ DR1 under fine-tuning. A targeted DINOv2-initialized HCL ablation further shows that the adaptation recipe is not specific to a single starting point. The released STRADAViT checkpoint remains the preferred model because it offers competitive transfer at lower token count and downstream cost than the DINOv2-based alternative. These results show that radio astronomy-aware view generation and staged continued pretraining provide a stronger starting point than out-of-the-box Vision Transformers for radio astronomy transfer.


cs.CR [Back]

[99] Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning cs.CR | cs.AI | cs.CLPDF

Bilgehan Sel, Xuanli He, Alwin Peng, Ming Jin, Jerry Wei

TL;DR: 论文提出了一种名为Trojan-Speak的对抗性微调方法,该方法能够有效绕过Anthropic的Constitutional Classifiers安全分类器,使模型在回答危险查询(如CBRN专家级问题)时不被检测到,同时保持模型原有推理能力,性能下降小于5%。

Details

Motivation: 针对主流AI提供商提供的微调API可能被攻击者利用以绕过安全措施的问题,研究旨在探索通过对抗性微调来规避基于LLM的内容分类器的可能性。

Result: 在14B+参数模型上,Trojan-Speak实现了超过99%的分类器规避成功率,同时在推理基准测试上的性能下降小于5%,显著优于之前导致超过25%性能下降的方法。

Insight: 创新点在于结合课程学习和基于GRPO的混合强化学习,教导模型一种能规避内容分类的通信协议;客观来看,该方法揭示了仅依赖LLM内容分类器不足以防止危险信息泄露,并表明激活级探针可显著提升对此类攻击的鲁棒性。

Abstract: Fine-tuning APIs offered by major AI providers create new attack surfaces where adversaries can bypass safety measures through targeted fine-tuning. We introduce Trojan-Speak, an adversarial fine-tuning method that bypasses Anthropic’s Constitutional Classifiers. Our approach uses curriculum learning combined with GRPO-based hybrid reinforcement learning to teach models a communication protocol that evades LLM-based content classification. Crucially, while prior adversarial fine-tuning approaches report more than 25% capability degradation on reasoning benchmarks, Trojan-Speak incurs less than 5% degradation while achieving 99+% classifier evasion for models with 14B+ parameters. We demonstrate that fine-tuned models can provide detailed responses to expert-level CBRN (Chemical, Biological, Radiological, and Nuclear) queries from Anthropic’s Constitutional Classifiers bug-bounty program. Our findings reveal that LLM-based content classifiers alone are insufficient for preventing dangerous information disclosure when adversaries have fine-tuning access, and we show that activation-level probes can substantially improve robustness to such attacks.


cs.RO [Back]

[100] AutoWorld: Scaling Multi-Agent Traffic Simulation with Self-Supervised World Models cs.RO | cs.AI | cs.CV | cs.LGPDF

Mozhgan Pourkeshavatz, Tianran Liu, Nicholas Rhinehart

TL;DR: AutoWorld是一个多智能体交通仿真框架,它利用自监督世界模型从无标签的LiDAR占用表示中学习,通过粗到细的预测场景上下文和级联确定性点过程采样来生成多样化的交通轨迹,从而无需额外标注即可提升仿真真实感。

Details

Motivation: 现有数据驱动的交通仿真器严重依赖有标签轨迹或语义标注,成本高昂且难以扩展,而大量无标签传感器数据未被充分利用,因此需要开发能够利用无标签数据提升仿真性能的方法。

Result: 在WOSAC基准测试中,AutoWorld在主要真实感元指标(RMM)上排名第一,实验表明加入无标签LiDAR数据能持续提升仿真性能,并通过消融研究验证了各组件有效性。

Insight: 创新点包括:使用无标签LiDAR占用表示的自监督世界模型、粗到细预测场景上下文、级联确定性点过程采样促进多样性,以及运动感知潜在监督目标增强场景动态表示,为无需标注扩展交通仿真真实感提供了新途径。

Abstract: Multi-agent traffic simulation is central to developing and testing autonomous driving systems. Recent data-driven simulators have achieved promising results, but rely heavily on supervised learning from labeled trajectories or semantic annotations, making it costly to scale their performance. Meanwhile, large amounts of unlabeled sensor data can be collected at scale but remain largely unused by existing traffic simulation frameworks. This raises a key question: How can a method harness unlabeled data to improve traffic simulation performance? In this work, we propose AutoWorld, a traffic simulation framework that employs a world model learned from unlabeled occupancy representations of LiDAR data. Given world model samples, AutoWorld constructs a coarse-to-fine predictive scene context as input to a multi-agent motion generation model. To promote sample diversity, AutoWorld uses a cascaded Determinantal Point Process framework to guide the sampling processes of both the world model and the motion model. Furthermore, we designed a motion-aware latent supervision objective that enhances AutoWorld’s representation of scene dynamics. Experiments on the WOSAC benchmark show that AutoWorld ranks first on the leaderboard according to the primary Realism Meta Metric (RMM). We further show that simulation performance consistently improves with the inclusion of unlabeled LiDAR data, and study the efficacy of each component with ablations. Our method paves the way for scaling traffic simulation realism without additional labeling. Our project page contains additional visualizations and released code.


[101] DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA cs.RO | cs.AI | cs.CV | cs.LGPDF

Yi Chen, Yuying Ge, Hui Zhou, Mingyu Ding, Yixiao Ge

TL;DR: 本文提出了DIAL框架,通过可微分的潜在意图瓶颈桥接高级决策与低级运动执行,以解决现有端到端视觉-语言-动作模型将视觉语言模型仅用作编码器而导致的训练不稳定和语义表示退化问题。

Details

Motivation: 现有端到端VLA模型主要将预训练的视觉语言模型用作多模态编码器,直接映射视觉语言特征到低级动作,这未能充分利用VLM在高级决策中的潜力并引入了训练不稳定性。

Result: 在RoboCasa GR1 Tabletop基准测试中,DIAL实现了新的最先进性能,仅使用先前方法十分之一的演示数据就获得了更优的结果,并在人形机器人真实部署中展现出对未见物体和新配置的零样本泛化能力。

Insight: 创新点在于通过潜在世界建模合成潜在视觉前瞻作为结构瓶颈来显式编码意图,并采用两阶段训练范式(解耦预热和端到端联合优化)确保优化稳定性,从而在保留预训练知识的同时实现动作感知梯度对VLM骨干网络的受控精炼。

Abstract: The development of Vision-Language-Action (VLA) models has been significantly accelerated by pre-trained Vision-Language Models (VLMs). However, most existing end-to-end VLAs treat the VLM primarily as a multimodal encoder, directly mapping vision-language features to low-level actions. This paradigm underutilizes the VLM’s potential in high-level decision making and introduces training instability, frequently degrading its rich semantic representations. To address these limitations, we introduce DIAL, a framework bridging high-level decision making and low-level motor execution through a differentiable latent intent bottleneck. Specifically, a VLM-based System-2 performs latent world modeling by synthesizing latent visual foresight within the VLM’s native feature space; this foresight explicitly encodes intent and serves as the structural bottleneck. A lightweight System-1 policy then decodes this predicted intent together with the current observation into precise robot actions via latent inverse dynamics. To ensure optimization stability, we employ a two-stage training paradigm: a decoupled warmup phase where System-2 learns to predict latent futures while System-1 learns motor control under ground-truth future guidance within a unified feature space, followed by seamless end-to-end joint optimization. This enables action-aware gradients to refine the VLM backbone in a controlled manner, preserving pre-trained knowledge. Extensive experiments on the RoboCasa GR1 Tabletop benchmark show that DIAL establishes a new state-of-the-art, achieving superior performance with 10x fewer demonstrations than prior methods. Furthermore, by leveraging heterogeneous human demonstrations, DIAL learns physically grounded manipulation priors and exhibits robust zero-shot generalization to unseen objects and novel configurations during real-world deployment on a humanoid robot.