Table of Contents
- cs.CL [Total: 29]
- cs.CV [Total: 172]
- cs.AI [Total: 6]
- eess.IV [Total: 4]
- eess.SP [Total: 2]
- cs.IR [Total: 2]
- cs.GR [Total: 1]
- astro-ph.GA [Total: 1]
- cs.LG [Total: 5]
- cs.RO [Total: 12]
- cs.DB [Total: 1]
- q-bio.NC [Total: 1]
cs.CL [Back]
[1] Do Multilingual VLMs Reason Equally? A Cross-Lingual Visual Reasoning Audit for Indian Languages cs.CL | cs.LGPDF
Swastik R
TL;DR: 本文首次对印度语言进行了跨语言视觉推理审计,将MathVista、ScienceQA和MMMU的980个问题翻译成六种印度语言,评估了从7B开源模型到GPT-4o的八个视觉语言模型。研究发现,从英语切换到印度语言时,模型准确率下降9.8-25个百分点,且德拉维达语系比印度-雅利安语系下降更多;思维链提示在某些语言中反而损害性能,暴露了以英语为中心的推理链问题。
Details
Motivation: 现有视觉语言模型在数学、科学和空间推理基准上表现良好,但这些评估绝大多数是英语的,缺乏对多语言(特别是印度语言)视觉推理能力的系统性审计。
Result: 在七个语言(英语+六种印度语言)上评估了八个VLMs,产生了68,600条推理记录。从英语切换到印度语言时,准确率下降9.8-25个百分点;德拉维达语系(如泰米尔语、泰卢固语)比印度-雅利安语系(如印地语、孟加拉语)多下降高达13.2个百分点;思维链提示在孟加拉语和卡纳达语中分别导致准确率下降14.4和11.4个百分点;专为23种语言设计的Aya-Vision-8B模型在德拉维达文字上仍下降28.5个百分点。
Insight: 创新点在于首次构建了针对印度语言的跨语言视觉推理基准,并进行了大规模审计;关键发现是:多语言视觉语言模型的推理能力存在显著的跨语言不平等,且简单的多语言预训练不足以迁移视觉推理能力;思维链提示的有效性高度依赖于语言,暴露了当前模型推理机制的英语中心性;研究强调了在非英语语言上进行系统评估的重要性,并发布了翻译后的基准和所有模型输出以供社区使用。
Abstract: Vision-language models score well on mathematical, scientific, and spatial reasoning benchmarks, yet these evaluations are overwhelmingly English. I present the first cross-lingual visual reasoning audit for Indian languages. 980 questions from MathVista, ScienceQA, and MMMU are translated into Hindi, Tamil, Telugu, Bengali, Kannada, and Marathi using IndicTrans2, with Gemini 2.0 Flash cross-verification on 50 samples per language (inter-translator agreement 0.79-0.84). Eight VLMs, from 7B open-source models to GPT-4o, are evaluated across all seven languages, yielding 68,600 inference records that include text-only and chain-of-thought ablations. I find accuracy drops of 9.8-25 percentage points when switching from English to an Indian language, with Dravidian languages suffering up to 13.2 pp more than Indo-Aryan. Chain-of-thought prompting degrades Bengali (-14.4 pp) and Kannada (-11.4 pp) rather than helping, exposing English-centric reasoning chains. Aya-Vision-8B, built for 23 languages, still drops 28.5 pp on Dravidian scripts; multilingual pretraining alone does not transfer visual reasoning. I release the translated benchmark and all model outputs.
[2] LogicDiff: Logic-Guided Denoising Improves Reasoning in Masked Diffusion Language Models cs.CL | cs.LGPDF
Shaik Aman
TL;DR: 本文提出了LogicDiff,一种用于掩码扩散语言模型(MDLM)的推理时方法,通过逻辑角色引导的去噪策略来改进模型的推理能力。该方法使用一个轻量级分类头预测每个掩码位置的逻辑角色,并按照逻辑依赖顺序(前提、连接词、推导步骤、结论)逐步解掩码,从而解决了标准基于置信度的解掩码策略会推迟高熵逻辑连接词的问题。
Details
Motivation: 掩码扩散语言模型的标准基于置信度的解掩码策略会系统性地推迟高熵的逻辑连接词,这些连接词是推理链中的关键分支点,导致推理性能严重下降。
Result: 在不修改基础模型参数、不使用强化学习或任务特定训练的情况下,LogicDiff将LLaDA-8B-Instruct在GSM8K上的准确率从22.0%提升至60.7%(+38.7个百分点),在MATH-500上从23.6%提升至29.2%(+5.6个百分点),且速度开销低于6%。
Insight: 创新点在于揭示了MDLM推理缺陷很大程度上源于次优的解掩码顺序,而非模型学习表示的限制,并提出了一种轻量、无需训练的逻辑角色引导解掩码调度器来修正这一顺序,显著提升了推理性能。
Abstract: Masked diffusion language models (MDLMs) generate text by iteratively unmasking tokens from a fully masked sequence, offering parallel generation and bidirectional context. However, their standard confidence-based unmasking strategy systematically defers high-entropy logical connective tokens, the critical branching points in reasoning chains, leading to severely degraded reasoning performance. We introduce LogicDiff, an inference-time method that replaces confidence-based unmasking with logic-role-guided unmasking. A lightweight classification head (4.2M parameters, 0.05% of the base model) predicts the logical role of each masked position (premise, connective, derived step, conclusion, or filler) from the base model’s hidden states with 98.4% accuracy. A dependency-ordered scheduler then unmasks tokens in logical dependency order: premises first, then connectives, then derived steps, then conclusions. Without modifying a single parameter of the base model and without any reinforcement learning or task-specific training, LogicDiff improves LLaDA-8B-Instruct accuracy from 22.0% to 60.7% on GSM8K (+38.7 percentage points) and from 23.6% to 29.2% on MATH-500 (+5.6 pp), with less than 6% speed overhead. Our results demonstrate that a substantial portion of the reasoning deficit in MDLMs is attributable to suboptimal token unmasking order, not to limitations of the model’s learned representations.
[3] RASPRef: Retrieval-Augmented Self-Supervised Prompt Refinement for Large Reasoning Models cs.CLPDF
Rahul Soni
TL;DR: 本文提出了RASPRef(检索增强的自监督提示词优化)框架,旨在自动优化大型推理模型的提示词,以提升其在数学推理等任务上的性能,而无需人工标注或任务特定监督。该方法通过检索相关示例和历史推理轨迹,并利用多样本一致性、验证器反馈和模型生成的评论等信号,迭代地优化提示词。
Details
Motivation: 当前专注于推理的语言模型(如DeepSeek R1和OpenAI o1)在结构化推理基准上表现强劲,但其性能对提示词表述高度敏感,而手动设计有效提示词的过程繁琐且难以跨任务或领域扩展。
Result: 在GSM8K风格的数学推理任务上的实验表明,与静态提示词基线相比,检索引导的提示词优化提升了性能。
Insight: 创新点在于将提示词本身作为优化目标,通过检索引导的迭代优化过程直接改进提示词,而非仅关注模型输出;这提供了一种可扩展的自改进提示词策略,强调了提示词设计对推理导向语言模型的关键作用。
Abstract: Recent reasoning-focused language models such as DeepSeek R1 and OpenAI o1 have demonstrated strong performance on structured reasoning benchmarks including GSM8K, MATH, and multi-hop question answering tasks. However, their performance remains highly sensitive to prompt formulation, and designing effective prompts is typically a manual and iterative process that does not scale well across tasks or domains. To address this limitation, we introduce Retrieval-Augmented Self-Supervised Prompt Refinement (RASPRef), a framework that improves prompts without requiring human annotations or task-specific supervision. The approach retrieves relevant examples and previously generated reasoning trajectories, and leverages signals such as multi-sample consistency, verifier feedback, and model-generated critiques to iteratively refine the prompt. Unlike prior approaches that focus primarily on improving model outputs, RASPRef directly treats the prompt as the optimization target and improves it through an iterative retrieval-guided refinement process. Experiments on GSM8K-style mathematical reasoning tasks show that retrieval-guided prompting improves performance compared with a static prompting baseline. We further discuss how retrieval quality, trajectory selection, and self-supervised feedback signals may influence the effectiveness of prompt refinement. These findings suggest that prompt design remains a critical factor for reasoning-oriented language models, and that self-improving prompts offer a practical and scalable strategy for improving reasoning performance.
[4] TAPS: Task Aware Proposal Distributions for Speculative Sampling cs.CL | cs.AIPDF
Mohamad Zbib, Mohamad Bazzi, Ammar Mohanna, Hasan Abed Al Kader Hammoud, Bernard Ghanem
TL;DR: 本文研究了推测解码中草稿模型的训练数据分布对生成质量的影响,通过在不同任务特定数据(MathInstruct、ShareGPT及混合数据)上训练轻量级HASS和EAGLE-2草稿模型,并在多个基准(MT-Bench、GSM8K、MATH-500、SVAMP)上评估,发现任务特定训练能带来明显的专业化优势:数学推理数据训练的草稿在推理基准上表现最强,而对话数据训练的草稿在MT-Bench上最强。混合数据训练提高了鲁棒性,但更大混合数据在不同解码温度下并非总是最优。此外,研究还探索了在推理时结合专业化草稿模型的方法,发现基于置信度的路由优于单域草稿,而合并树验证能获得最高的接受长度,且置信度比熵更适合作为路由信号。
Details
Motivation: 推测解码通常使用在广泛通用语料上训练的草稿模型来加速自回归生成,但草稿模型的训练数据分布如何影响解码质量尚不明确,本文旨在探究草稿训练数据与下游任务的匹配度对推测解码性能的影响。
Result: 在MT-Bench、GSM8K、MATH-500和SVAMP基准上,任务特定训练的草稿模型在相应任务上表现最佳(如MathInstruct训练的草稿在推理基准上最强),混合数据训练提高了鲁棒性但未在所有温度下占优;推理时结合专业化草稿的方法中,基于置信度的路由和合并树验证能超越单域草稿,达到更高的接受长度。
Insight: 创新点在于揭示了推测解码质量不仅取决于草稿模型架构,还高度依赖于草稿训练数据与下游任务的匹配度,并提出了在推理时通过基于置信度的路由和合并树验证有效结合专业化草稿模型的方法,而非在权重空间合并,这为优化推测解码提供了数据驱动和推理策略的新视角。
Abstract: Speculative decoding accelerates autoregressive generation by letting a lightweight draft model propose future tokens that a larger target model then verifies in parallel. In practice, however, draft models are usually trained on broad generic corpora, which leaves it unclear how much speculative decoding quality depends on the draft training distribution. We study this question with lightweight HASS and EAGLE-2 drafters trained on MathInstruct, ShareGPT, and mixed-data variants, evaluated on MT-Bench, GSM8K, MATH-500, and SVAMP. Measured by acceptance length, task-specific training yields clear specialization: MathInstruct-trained drafts are strongest on reasoning benchmarks, while ShareGPT-trained drafts are strongest on MT-Bench. Mixed-data training improves robustness, but larger mixtures do not dominate across decoding temperatures. We also study how to combine specialized drafters at inference time. Naive checkpoint averaging performs poorly, whereas confidence-based routing improves over single-domain drafts and merged-tree verification yields the highest acceptance length overall for both backbones. Finally, confidence is a more useful routing signal than entropy: rejected tokens tend to have higher entropy, but confidence produces much clearer benchmark-level routing decisions. These results show that speculative decoding quality depends not only on draft architecture, but also on the match between draft training data and downstream workload, and that specialized drafters are better combined at inference time than in weight space.
[5] Text Data Integration cs.CL | cs.IRPDF
Md Ataur Rahman, Dimitris Sacharidis, Oscar Romero, Sergi Nadal
TL;DR: 本章节探讨了文本数据集成的重要性,指出当前数据集成系统主要关注结构化数据,而忽略了非结构化文本数据中蕴含的丰富知识。文章首先论证了文本数据集的必要性,随后分析了其面临的挑战、现有技术水平和未解决的问题。
Details
Motivation: 解决数据异构性带来的挑战,特别是如何有效存储和处理非结构化文本数据,以弥补当前数据集成系统主要依赖结构化数据的不足。
Result: 本章节未提及具体的定量实验结果或基准测试,而是聚焦于综述性分析,总结了文本数据集成的挑战、现有技术和开放性问题。
Insight: 创新点在于强调将非结构化文本数据纳入数据集成流程的重要性,为未来研究提供了方向,如开发更高效的文本数据处理和集成方法,以充分利用文本中的知识。
Abstract: Data comes in many forms. From a shallow perspective, they can be viewed as being either in structured (e.g., as a relation, as key-value pairs) or unstructured (e.g., text, image) formats. So far, machines have been fairly good at processing and reasoning over structured data that follows a precise schema. However, the heterogeneity of data poses a significant challenge on how well diverse categories of data can be meaningfully stored and processed. Data Integration, a crucial part of the data engineering pipeline, addresses this by combining disparate data sources and providing unified data access to end-users. Until now, most data integration systems have leaned on only combining structured data sources. Nevertheless, unstructured data (a.k.a. free text) also contains a plethora of knowledge waiting to be utilized. Thus, in this chapter, we firstly make the case for the integration of textual data, to later present its challenges, state of the art and open problems.
[6] Debiasing Large Language Models toward Social Factors in Online Behavior Analytics through Prompt Knowledge Tuning cs.CL | cs.AIPDF
Hossein Salemi, Jitin Krishnan, Hemant Purohit
TL;DR: 本文提出了一种通过提示知识调优来减少大型语言模型在在线行为分析中社会因素偏见的可扩展方法。该方法利用社交媒体消息的上下文和目标,通过融入社会归因知识来增强指令提示,从而在零样本分类任务中提升模型性能并降低社会归因偏见。
Details
Motivation: 大型语言模型在推理中可能忽略社会归因(如个人和情境因果性),导致在社会语境中产生有偏见的响应,本研究旨在探索并缓解这种偏见。
Result: 在灾难领域社交媒体意图检测和主题检测任务中,该方法有效降低了Llama3、Mistral和Gemma等开源LLM的社会归因偏见,并提升了模型性能,实验考虑了灾难类型和多种语言的变异性。
Insight: 创新点在于将用户目标(推断个人因果性)和消息上下文(推断情境因果性)作为知识融入提示,以社会归因理论为基础进行提示工程,从而在零样本设置下实现去偏见和性能提升。
Abstract: Attribution theory explains how individuals interpret and attribute others’ behavior in a social context by employing personal (dispositional) and impersonal (situational) causality. Large Language Models (LLMs), trained on human-generated corpora, may implicitly mimic this social attribution process in social contexts. However, the extent to which LLMs utilize these causal attributions in their reasoning remains underexplored. Although using reasoning paradigms, such as Chain-of-Thought (CoT), has shown promising results in various tasks, ignoring social attribution in reasoning could lead to biased responses by LLMs in social contexts. In this study, we investigate the impact of incorporating a user’s goal as knowledge to infer dispositional causality and message context to infer situational causality on LLM performance. To this end, we introduce a scalable method to mitigate such biases by enriching the instruction prompts for LLMs with two prompt aids using social-attribution knowledge, based on the context and goal of a social media message. This method improves the model performance while reducing the social-attribution bias of the LLM in the reasoning on zero-shot classification tasks for behavior analytics applications. We empirically show the benefits of our method across two tasks-intent detection and theme detection on social media in the disaster domain-when considering the variability of disaster types and multiple languages of social media. Our experiments highlight the biases of three open-source LLMs: Llama3, Mistral, and Gemma, toward social attribution, and show the effectiveness of our mitigation strategies.
[7] Story2Proposal: A Scaffold for Structured Scientific Paper Writing cs.CLPDF
Zhuoyang Qian, Wei Shi, Xu Lin, Li Ling, Meng Luo
TL;DR: 本文提出了Story2Proposal,一个基于契约管理的多智能体框架,用于将研究故事转化为结构化的科学论文手稿。该系统通过协调多个智能体在一个持续的共享视觉契约下工作,解决了现有语言模型生成管道中常见的结构漂移、图表缺失和跨章节不一致等问题。
Details
Motivation: 现有基于语言模型的科学论文生成方法通常在生成后进行验证,这容易导致结构漂移、图表缺失和跨章节不一致。本文旨在通过一个契约驱动的多智能体框架,在生成过程中就强制执行结构和视觉元素的一致性。
Result: 在基于Jericho研究语料库的任务上,使用GPT、Claude、Gemini和Qwen等不同大模型作为骨干时,Story2Proposal获得了6.145的专家评估分数,显著优于DirectChat的3.963。与结构化生成基线Fars相比,Story2Proposal的平均分数为5.705,优于Fars的5.197,表明其在结构一致性和视觉对齐方面有改进。
Insight: 创新点在于引入了“契约”作为中心协调机制,将架构师、写作者、精炼者和渲染者等智能体组织起来,并在生成-评估-适应的循环中动态更新契约。这为结构化文档生成提供了一种新的、可验证的协同框架,确保了叙事、证据和视觉元素在整个文档生命周期中的对齐。
Abstract: Generating scientific manuscripts requires maintaining alignment between narrative reasoning, experimental evidence, and visual artifacts across the document lifecycle. Existing language-model generation pipelines rely on unconstrained text synthesis with validation applied only after generation, often producing structural drift, missing figures or tables, and cross-section inconsistencies. We introduce Story2Proposal, a contract-governed multi-agent framework that converts a research story into a structured manuscript through coordinated agents operating under a persistent shared visual contract. The system organizes architect, writer, refiner, and renderer agents around a contract state that tracks section structure and registered visual elements, while evaluation agents supply feedback in a generate evaluate adapt loop that updates the contract during generation. Experiments on tasks derived from the Jericho research corpus show that Story2Proposal achieved an expert evaluation score of 6.145 versus 3.963 for DirectChat (+2.182) across GPT, Claude, Gemini, and Qwen backbones. Compared with the structured generation baseline Fars, Story2Proposal obtained an average score of 5.705 versus 5.197, indicating improved structural consistency and visual alignment.
[8] Learning to Predict Future-Aligned Research Proposals with Language Models cs.CLPDF
Heng Wang, Pengcheng Jiang, Jiashuo Sun, Zhiyi Shi, Haofei Yu
TL;DR: 该论文提出了一种评估大语言模型生成研究提案质量的新方法,通过将提案生成重构为时间切片科学预测问题,并引入未来对齐分数作为自动评估指标。作者构建了一个时间一致的数据集,并训练模型学习识别研究空白和借鉴灵感,实验表明未来对齐调优能有效提升提案的未来对齐性和质量,并通过代码代理实现了模型生成提案的实际应用验证。
Details
Motivation: 解决LLM生成研究提案难以自动评估其新颖性和严谨性的问题,避免依赖昂贵的大规模人工评估。
Result: 在Llama-3.1和Qwen2.5模型上,未来对齐调优相比未对齐基线将整体未来对齐分数提升了最高10.6%,领域专家人工评估也证实了提案质量的提升;通过代码代理实现的两个生成提案分别带来了MATH基准上4.17%的准确率提升和新模型融合方法的持续改进。
Insight: 将研究提案生成重构为可验证的科学预测问题,并引入基于检索和LLM语义评分的未来对齐分数作为自动评估指标;通过合成推理轨迹训练模型学习识别研究空白和灵感借鉴,提升了生成提案的前瞻性和实用性。
Abstract: Large language models (LLMs) are increasingly used to assist ideation in research, but evaluating the quality of LLM-generated research proposals remains difficult: novelty and soundness are hard to measure automatically, and large-scale human evaluation is costly. We propose a verifiable alternative by reframing proposal generation as a time-sliced scientific forecasting problem. Given a research question and inspiring papers available before a cutoff time, the model generates a structured proposal and is evaluated by whether it anticipates research directions that appear in papers published after the time. We operationalize this objective with the Future Alignment Score (FAS), computed via retrieval and LLM-based semantic scoring against a held-out future corpus. To train models, we build a time-consistent dataset of 17,771 papers from targets and their pre-cutoff citations, and synthesize reasoning traces that teach gap identification and inspiration borrowing. Across Llama-3.1 and Qwen2.5 models, future-aligned tuning improves future alignment over unaligned baselines (up to +10.6% overall FAS), and domain-expert human evaluation corroborates improved proposal quality. Finally, we demonstrate practical impact by implementing two model-generated proposals with a code agent, obtaining 4.17% accuracy gain on MATH from a new prompting strategy and consistent improvements for a novel model-merging method.
[9] Rethinking Easy-to-Hard: Limits of Curriculum Learning in Post-Training for Deductive Reasoning cs.CLPDF
Maximilian Mordig, Andreas Opedal, Weiyang Liu, Bernhard Schölkopf
TL;DR: 本研究系统评估了课程学习在大型语言模型后训练中对演绎推理任务的效果,使用合成算术和逻辑基准测试,发现基于难度的训练序列并未带来比随机采样更优的准确性或响应长度表现,挑战了课程学习在后训练中的实际效用。
Details
Motivation: 探究课程学习在大型语言模型后训练中对组合推理任务的实际影响,验证’由易到难’训练直觉在演绎推理中的有效性。
Result: 在多个模型家族和课程安排下,基于难度的训练序列在监督微调和强化学习方法中均未显示出比随机采样更稳定的优势,结果在准确性和响应长度上保持一致。
Insight: 论文创新地通过合成基准直接量化推理复杂度,而非依赖表面代理指标;客观分析表明,在演绎推理任务中,训练样本的具体排序对实现组合泛化的作用可能被高估,这为后训练策略提供了重要实证参考。
Abstract: Curriculum learning (CL), motivated by the intuition that learning in increasing order of difficulty should ease generalization, is commonly adopted both in pre-training and post-training of large language models (LLMs). The intuition of CL is particularly compelling for compositional reasoning, where complex problems are built from elementary inference rules; however, the actual impact of CL on such tasks remains largely underexplored. We present a systematic empirical study of CL for post-training of LLMs, using synthetic arithmetic and logical benchmarks where difficulty is characterized by reasoning complexity rather than surface-level proxies. Surprisingly, across multiple model families and curriculum schedules, we find no robust advantage in difficulty-based sequencing over standard random sampling in either accuracy or response length. These findings persist across both supervised fine-tuning (SFT) and reinforcement learning (RL) methods. Our study suggests that, in the context of deductive reasoning, the specific ordering of training examples plays a negligible role in achieving compositional generalization, challenging the practical utility of curriculum-based post-training.
[10] SACRED: A Faithful Annotated Multimedia Multimodal Multilingual Dataset for Classifying Connectedness Types in Online Spirituality cs.CL | cs.MMPDF
Qinghao Guan, Yuchen Pan, Donghao Li, Zishi Zhang, Yiyang Chen
TL;DR: 本文介绍了SACRED数据集,这是一个高质量、多模态、多媒体的数据集,专注于在线灵性交流中的连接类型分类,并首次对在线灵性交流进行了多模态标注。研究评估了13个流行LLM以及传统方法在该数据集上的性能,发现DeepSeek-V3在文本分类任务上表现良好,而GPT-4o-mini在视觉任务上领先。
Details
Motivation: 解决社会科学研究中因缺乏高质量、可在线获取的多模态数据集而难以对灵性等抽象概念进行深入分析的问题,旨在为相关研究提供可靠的数据基础。
Result: 在Quora测试集上,DeepSeek-V3模型达到了79.19%的准确率;在视觉任务上,GPT-4o-mini模型取得了63.99%的F1分数,超越了其他模型。
Insight: 创新点在于构建了首个专注于在线灵性交流的、标注忠实性有保证的多模态数据集SACRED,并发现了对传播学研究有价值的新型连接类型,为抽象概念的多模态分析提供了新的数据和方法基准。
Abstract: In religion and theology studies, spirituality has garnered significant research attention for the reason that it not only transcends culture but offers unique experience to each individual. However, social scientists often rely on limited datasets, which are basically unavailable online. In this study, we collaborated with social scientists to develop a high-quality multimedia multi-modal datasets, \textbf{SACRED}, in which the faithfulness of classification is guaranteed. Using \textbf{SACRED}, we evaluated the performance of 13 popular LLMs as well as traditional rule-based and fine-tuned approaches. The result suggests DeepSeek-V3 model performs well in classifying such abstract concepts (i.e., 79.19% accuracy in the Quora test set), and the GPT-4o-mini model surpassed the other models in the vision tasks (63.99% F1 score). Purportedly, this is the first annotated multi-modal dataset from online spirituality communication. Our study also found a new type of connectedness which is valuable for communication science studies.
[11] PubMed Reasoner: Dynamic Reasoning-based Retrieval for Evidence-Grounded Biomedical Question Answering cs.CLPDF
Yiqing Zhang, Xiaozhong Liu, Fabricio Murai
TL;DR: 本文提出了PubMed Reasoner,一个用于证据驱动的生物医学问答的智能体。它通过三个阶段动态优化检索过程:首先对查询进行自我批判式细化以改进PubMed搜索词,然后进行反思式批量检索直至收集到足够证据,最后生成基于证据的答案和明确引用。该方法以GPT-4o为骨干,在PubMedQA基准上取得了超越人类专家的准确率。
Details
Motivation: 现有检索增强的问答系统缺乏迭代优化查询的机制,而自我反思方法仅在完整检索后才启动,难以保证答案的准确性和证据的时效性与可验证性。本文旨在构建一个能动态推理、迭代检索权威证据的可信生物医学问答系统。
Result: 在PubMedQA基准上达到78.32%的准确率,略微超过人类专家水平;在MMLU临床知识子集上表现出持续提升。此外,基于LLM的评估表明,该方法在推理合理性、证据基础、临床相关性和可信度方面均优于基线。
Insight: 创新点在于将检索过程与动态推理紧密结合,通过“检索优先”的推理范式,在检索过程中(而非之后)就引入自我批判和反思机制来迭代优化查询和证据收集。这为构建可控计算成本、基于权威来源的高可信度专业领域问答系统提供了可借鉴的架构设计。
Abstract: Trustworthy biomedical question answering (QA) systems must not only provide accurate answers but also justify them with current, verifiable evidence. Retrieval-augmented approaches partially address this gap but lack mechanisms to iteratively refine poor queries, whereas self-reflection methods kick in only after full retrieval is completed. In this context, we introduce PubMed Reasoner, a biomedical QA agent composed of three stages: self-critic query refinement evaluates MeSH terms for coverage, alignment, and redundancy to enhance PubMed queries based on partial (metadata) retrieval; reflective retrieval processes articles in batches until sufficient evidence is gathered; and evidence-grounded response generation produces answers with explicit citations. PubMed Reasoner with a GPT-4o backbone achieves 78.32% accuracy on PubMedQA, slightly surpassing human experts, and showing consistent gains on MMLU Clinical Knowledge. Moreover, LLM-as-judge evaluations prefer our responses across: reasoning soundness, evidence grounding, clinical relevance, and trustworthiness. By orchestrating retrieval-first reasoning over authoritative sources, our approach provides practical assistance to clinicians and biomedical researchers while controlling compute and token costs.
[12] Improving Attributed Long-form Question Answering with Intent Awareness cs.CL | cs.AIPDF
Xinran Zhao, Aakanksha Naik, Jay DeYoung, Joseph Chee Chang, Jena D. Hwang
TL;DR: 该论文提出通过增强大语言模型对作者写作意图的理解,来提升生成知识密集型长篇幅报告的质量。作者开发了基于标签的结构化方案来提取隐含的写作或引用意图,并证明这些意图不仅能提升大模型的零样本生成能力,还能用于创建高质量的合成数据来微调小模型。
Details
Motivation: 尽管大语言模型在多样化的学术论文和报告上进行了训练,但它们并未接触到指导作者撰写这些文档的推理过程和意图。作者假设增强模型的意图感知能力可以显著提高生成长篇报告的质量。
Result: 实验表明,该方法在各种具有挑战性的科学报告生成任务上均提升了性能,大模型和小模型相对于基线模型的平均绝对提升分别为+2.9和+12.3个百分点。分析还表明,意图感知增强了模型的引用使用并显著提高了报告的可读性。
Insight: 核心创新在于通过结构化标签方案来显式地建模和提取文本生成背后的隐含意图(如写作目的、引用动机),并将此作为额外信息注入生成过程。这不仅为提升生成质量提供了新思路(意图感知),还开辟了利用意图信息生成高质量合成数据以高效微调小模型的新途径。
Abstract: Large language models (LLMs) are increasingly being used to generate comprehensive, knowledge-intensive reports. However, while these models are trained on diverse academic papers and reports, they are not exposed to the reasoning processes and intents that guide authors in crafting these documents. We hypothesize that enhancing a model’s intent awareness can significantly improve the quality of generated long-form reports. We develop and employ structured, tag-based schemes to better elicit underlying implicit intents to write or cite. We demonstrate that these extracted intents enhance both zero-shot generation capabilities in LLMs and enable the creation of high-quality synthetic data for fine-tuning smaller models. Our experiments reveal improved performance across various challenging scientific report generation tasks, with an average improvement of +2.9 and +12.3 absolute points for large and small models over baselines, respectively. Furthermore, our analysis illuminates how intent awareness enhances model citation usage and substantially improves report readability.
[13] Multi-Agent Dialectical Refinement for Enhanced Argument Classification cs.CL | cs.AIPDF
Jakub Bąba, Jarosław A. Chudziak
TL;DR: 本文提出了MAD-ACC框架,一种用于论点成分分类的多智能体辩论方法。该方法通过引入支持者、反对者和裁判的辩论机制,来解决大型语言模型在论点挖掘任务中存在的结构模糊性和自我修正中的谄媚问题,从而在无需领域特定训练的情况下,显著提升了分类性能并提供了可解释的决策过程。
Details
Motivation: 传统论点挖掘方法依赖昂贵的领域特定微调,而大型语言模型作为免训练替代方案,在处理结构模糊性(如区分论点与前提)和单智能体自我修正中的谄媚问题时存在不足。
Result: 在UKP学生论文语料库上的评估显示,MAD-ACC取得了85.7%的宏平均F1分数,显著优于单智能体推理基线,且无需领域特定训练。
Insight: 核心创新在于将辩证式辩论机制引入分类任务,通过多智能体(支持者、反对者、裁判)的对抗性交互来暴露逻辑细节并解决不确定性。这不仅提升了性能,还通过生成可读的辩论记录,为分类决策提供了透明且可解释的依据,是结合结构化推理与LLM能力的一个范例。
Abstract: Argument Mining (AM) is a foundational technology for automated writing evaluation, yet traditional supervised approaches rely heavily on expensive, domain-specific fine-tuning. While Large Language Models (LLMs) offer a training-free alternative, they often struggle with structural ambiguity, failing to distinguish between similar components like Claims and Premises. Furthermore, single-agent self-correction mechanisms often suffer from sycophancy, where the model reinforces its own initial errors rather than critically evaluating them. We introduce MAD-ACC (Multi-Agent Debate for Argument Component Classification), a framework that leverages dialectical refinement to resolve classification uncertainty. MAD-ACC utilizes a Proponent-Opponent-Judge model where agents defend conflicting interpretations of ambiguous text, exposing logical nuances that single-agent models miss. Evaluation on the UKP Student Essays corpus demonstrates that MAD-ACC achieves a Macro F1 score of 85.7%, significantly outperforming single-agent reasoning baselines, without requiring domain-specific training. Additionally, unlike “black-box” classifiers, MAD-ACC’s dialectical approach offers a transparent and explainable alternative by generating human-readable debate transcripts that explain the reasoning behind decisions.
[14] Hidden Ads: Behavior Triggered Semantic Backdoors for Advertisement Injection in Vision Language Models cs.CL | cs.CR | cs.LGPDF
Duanyi Yao, Changyue Li, Zhicong Huang, Cheng Hong, Songze Li
TL;DR: 该论文提出了一种名为’Hidden Ads’的新型后门攻击方法,针对视觉语言模型(VLMs)。该攻击利用用户寻求推荐的自然行为(如上传包含特定语义内容的图片并提问)作为触发条件,使被植入后门的模型在提供正确回答的同时,无缝插入攻击者指定的广告宣传语。
Details
Motivation: 动机在于利用VLMs在消费级推荐应用(如产品、餐饮、服务推荐)中日益普及的趋势,设计一种隐蔽且实用的后门攻击,以注入未经授权的广告。它旨在克服传统基于人工模式(如像素块、特殊令牌)触发的后门攻击的不自然性。
Result: 在三种VLM架构上的实验表明,Hidden Ads实现了高注入成功率(高攻击成功率),同时保持接近零的误报率,且不影响模型在原始任务上的准确性。消融研究证实该攻击数据高效、能有效迁移到未见数据集,并可扩展到多个并发领域-标语对。评估的防御方法(如基于指令的过滤和干净微调)均未能有效移除后门而不造成显著的模型性能下降。
Insight: 核心创新点在于利用’自然用户行为’(语义内容+寻求推荐的问题)作为后门触发器,而非人工模式,这使得攻击更隐蔽、实用。从客观角度看,其提出的多层级威胁评估框架(硬提示注入、软提示优化、监督微调)和利用教师VLM生成思维链来构建自然触发-标语关联的数据生成流程,是系统化研究和实现此类语义后门攻击的有效方法。
Abstract: Vision-Language Models (VLMs) are increasingly deployed in consumer applications where users seek recommendations about products, dining, and services. We introduce Hidden Ads, a new class of backdoor attacks that exploit this recommendation-seeking behavior to inject unauthorized advertisements. Unlike traditional pattern-triggered backdoors that rely on artificial triggers such as pixel patches or special tokens, Hidden Ads activates on natural user behaviors: when users upload images containing semantic content of interest (e.g., food, cars, animals) and ask recommendation-seeking questions, the backdoored model provides correct, helpful answers while seamlessly appending attacker-specified promotional slogans. This design preserves model utility and produces natural-sounding injections, making the attack practical for real-world deployment in consumer-facing recommendation services. We propose a multi-tier threat framework to systematically evaluate Hidden Ads across three adversary capability levels: hard prompt injection, soft prompt optimization, and supervised fine-tuning. Our poisoned data generation pipeline uses teacher VLM-generated chain-of-thought reasoning to create natural trigger–slogan associations across multiple semantic domains. Experiments on three VLM architectures demonstrate that Hidden Ads achieves high injection efficacy with near-zero false positives while maintaining task accuracy. Ablation studies confirm that the attack is data-efficient, transfers effectively to unseen datasets, and scales to multiple concurrent domain-slogan pairs. We evaluate defenses including instruction-based filtering and clean fine-tuning, finding that both fail to remove the backdoor without causing significant utility degradation.
[15] Umwelt Engineering: Designing the Cognitive Worlds of Linguistic Agents cs.CL | cs.AIPDF
Rodney Jehu-Appiah
TL;DR: 本文提出了‘Umwelt工程’作为语言智能体设计的新层次,旨在通过设计语言认知环境来影响智能体的推理过程。通过两个实验验证了改变推理媒介(如词汇约束)会改变认知本身:实验一发现‘No-Have’约束能显著提升伦理推理、分类和认知校准能力,而‘E-Prime’约束的效果则因模型而异;实验二表明,尽管单个受约束的智能体表现不优于对照组,但由三个智能体组成的集成(特别是包含反事实推理智能体时)能实现100%的基准覆盖。
Details
Motivation: 为了解决当前智能体设计主要关注提示工程和上下文工程,而忽略了语言环境本身对认知的影响,本文提出将‘Umwelt工程’作为上游的第三层设计,旨在通过主动设计语言认知环境来更根本地塑造智能体的推理能力。
Result: 在实验一中,No-Have约束使伦理推理准确率提升19.1个百分点,分类任务提升6.5个百分点,认知校准提升7.4个百分点,且约束遵循率达到92.8%;E-Prime约束的效果在不同模型间差异显著(跨模型相关性达r=-0.75)。在实验二中,一个由三个受约束智能体组成的集成在17个调试问题上实现了100%的基准覆盖,而对照组为88.2%,且成功集成必须包含反事实推理智能体。
Insight: 创新点在于提出了‘Umwelt工程’这一新设计层次,强调通过语言媒介的约束(如词汇限制)来重构和多样化智能体的认知过程,实验揭示了‘认知重构’和‘认知多样化’两种机制,为通过环境设计提升AI系统性能提供了新思路。
Abstract: I propose Umwelt engineering – the deliberate design of the linguistic cognitive environment – as a third layer in the agent design stack, upstream of both prompt and context engineering. Two experiments test the thesis that altering the medium of reasoning alters cognition itself. In Experiment 1, three language models reason under two vocabulary constraints – No-Have (eliminating possessive “to have”) and E-Prime (eliminating “to be”) – across seven tasks (N=4,470 trials). No-Have improves ethical reasoning by 19.1 pp (p < 0.001), classification by 6.5 pp (p < 0.001), and epistemic calibration by 7.4 pp, while achieving 92.8% constraint compliance. E-Prime shows dramatic but model-dependent effects: cross-model correlations reach r = -0.75. In Experiment 2, 16 linguistically constrained agents tackle 17 debugging problems. No constrained agent outperforms the control individually, yet a 3-agent ensemble achieves 100% ground-truth coverage versus 88.2% for the control. A permutation test confirms only 8% of random 3-agent subsets achieve full coverage, and every successful subset contains the counterfactual agent. Two mechanisms emerge: cognitive restructuring and cognitive diversification. The primary limitation is the absence of an active control matching constraint prompt elaborateness.
[16] PRBench: End-to-end Paper Reproduction in Physics Research cs.CL | hep-lat | hep-ph | physics.comp-ph | physics.opticsPDF
Shi Qiu, Junyi Deng, Yiwei Deng, Haoran Dong, Jieyu Fu
TL;DR: 本文提出了PRBench基准测试,用于评估AI代理在物理研究领域进行端到端论文复现的能力。该基准包含30个由专家设计的任务,覆盖物理学的11个子领域,要求代理理解论文方法、从头实现算法并生成与原文匹配的定量结果。通过评估多个编码代理,发现最佳代理(OpenAI Codex)平均得分仅为34%,所有代理在数据准确性和代码正确性方面表现较差,且端到端复现成功率均为零。
Details
Motivation: 解决大型语言模型驱动的AI代理是否能够可靠地从真实科学论文中完成端到端复现这一开放性问题,为自主科学研究提供严谨的评估基准。
Result: 在PRBench上评估的编码代理中,最佳代理OpenAI Codex(基于GPT-5.3-Codex)平均总体得分仅为34%;所有代理的端到端回调成功率为零,在数据准确性和代码正确性方面表现尤其不佳。
Insight: 创新点在于构建了一个由领域专家基于真实论文设计的端到端复现基准,并采用代理化评估流程系统分析科学推理与执行能力;客观来看,该研究揭示了当前AI代理在复杂科学任务(如公式实现、数值模拟调试)中的系统性失败模式,为未来改进提供了明确方向。
Abstract: AI agents powered by large language models exhibit strong reasoning and problem-solving capabilities, enabling them to assist scientific research tasks such as formula derivation and code generation. However, whether these agents can reliably perform end-to-end reproduction from real scientific papers remains an open question. We introduce PRBench, a benchmark of 30 expert-curated tasks spanning 11 subfields of physics. Each task requires an agent to comprehend the methodology of a published paper, implement the corresponding algorithms from scratch, and produce quantitative results matching the original publication. Agents are provided only with the task instruction and paper content, and operate in a sandboxed execution environment. All tasks are contributed by domain experts from over 20 research groups at the School of Physics, Peking University, each grounded in a real published paper and validated through end-to-end reproduction with verified ground-truth results and detailed scoring rubrics. Using an agentified assessment pipeline, we evaluate a set of coding agents on PRBench and analyze their capabilities across key dimensions of scientific reasoning and execution. The best-performing agent, OpenAI Codex powered by GPT-5.3-Codex, achieves a mean overall score of 34%. All agents exhibit a zero end-to-end callback success rate, with particularly poor performance in data accuracy and code correctness. We further identify systematic failure modes, including errors in formula implementation, inability to debug numerical simulations, and fabrication of output data. Overall, PRBench provides a rigorous benchmark for evaluating progress toward autonomous scientific research.
[17] Can Large Language Models Simulate Human Cognition Beyond Behavioral Imitation? cs.CLPDF
Yuxuan Gu, Lunjun Liu, Xiaocheng Feng, Kun Zhu, Weihong Zhong
TL;DR: 这篇论文提出了一个基于217位AI领域研究者纵向科研轨迹的基准测试,用于评估大语言模型(LLMs)是能够模拟人类认知过程,还是仅仅模仿表面行为。该基准采用跨领域、时间偏移的泛化设置,并提出了多维认知对齐指标来评估个体层面的认知一致性。通过对SOTA LLMs和各种增强技术的系统评估,论文首次实证研究了当前LLMs模拟人类认知的能力及其可提升空间。
Details
Motivation: 解决人工智能中的一个核心问题:LLMs能否真正模拟人类认知,而非仅仅模仿表面行为。现有数据集存在合成推理痕迹或群体层面聚合的缺陷,无法捕捉真实的个体认知模式。
Result: 通过对最先进的大语言模型和各种增强技术进行系统评估,提供了关于LLMs模拟人类认知能力的初步实证结果,并衡量了现有技术对这些能力的提升程度。
Insight: 创新点在于构建了一个基于真实研究者科研轨迹(作为认知过程外化表征)的基准,并设计了跨领域、时间偏移的泛化设置来区分认知模式迁移与行为模仿,同时提出了个体层面的多维认知对齐评估指标。
Abstract: An essential problem in artificial intelligence is whether LLMs can simulate human cognition or merely imitate surface-level behaviors, while existing datasets suffer from either synthetic reasoning traces or population-level aggregation, failing to capture authentic individual cognitive patterns. We introduce a benchmark grounded in the longitudinal research trajectories of 217 researchers across diverse domains of artificial intelligence, where each author’s scientific publications serve as an externalized representation of their cognitive processes. To distinguish whether LLMs transfer cognitive patterns or merely imitate behaviors, our benchmark deliberately employs a cross-domain, temporal-shift generalization setting. A multidimensional cognitive alignment metric is further proposed to assess individual-level cognitive consistency. Through systematic evaluation of state-of-the-art LLMs and various enhancement techniques, we provide a first-stage empirical study on the questions: (1) How well do current LLMs simulate human cognition? and (2) How far can existing techniques enhance these capabilities?
[18] KAT-Coder-V2 Technical Report cs.CL | cs.LGPDF
Fengxiang Li, Han Zhang, Haoyang Huang, Jinghui Wang, Jinhua Hao
TL;DR: KAT-Coder-V2是快手KAT团队开发的智能体化编码模型,采用’先专业化后统一’的范式,将智能体编码分解为SWE、WebCoding、Terminal、WebSearch和General五个专家领域,分别进行监督微调和强化学习,再通过策略蒸馏整合为单一模型。研究还开发了支持数万并发沙箱实例的KwaiEnv基础设施,并提出了稳定MoE强化学习的MCLA方法和加速树状轨迹训练的Tree Training方法。
Details
Motivation: 解决现有编码智能体在复杂、多样化任务(如软件工程、网页编码、终端操作等)中难以兼顾专业性与通用性的问题,旨在构建一个统一且高性能的多领域编码助手。
Result: 在多个基准测试中表现优异:在SWE-bench Verified上达到79.6%(接近Claude Opus 4.6的80.8%),在PinchBench上获得88.7分(超越GLM-5和MiniMax M2.7),在所有三个前端美学场景中排名第一,在Terminal-Bench Hard和tau^2-Bench上也保持强劲的通用性能(分别为46.8和93.9分)。
Insight: 创新点包括:1) ‘Specialize-then-Unify’的模型构建范式,平衡了专业分工与模型统一;2) 可扩展的KwaiEnv训练基础设施;3) 稳定MoE强化学习的MCLA方法;4) 通过Tree Training对树状轨迹进行优化,实现高达6.2倍的训练加速。这些在系统架构和训练方法上的创新对构建复杂多任务智能体具有借鉴意义。
Abstract: We present KAT-Coder-V2, an agentic coding model developed by the KwaiKAT team at Kuaishou. KAT-Coder-V2 adopts a “Specialize-then-Unify” paradigm that decomposes agentic coding into five expert domains - SWE, WebCoding, Terminal, WebSearch, and General - each undergoing independent supervised fine-tuning and reinforcement learning, before being consolidated into a single model via on-policy distillation. We develop KwaiEnv, a modular infrastructure sustaining tens of thousands of concurrent sandbox instances, and scale RL training along task complexity, intent alignment, and scaffold generalization. We further propose MCLA for stabilizing MoE RL training and Tree Training for eliminating redundant computation over tree-structured trajectories with up to 6.2x speedup. KAT-Coder-V2 achieves 79.6% on SWE-bench Verified (vs. Claude Opus 4.6 at 80.8%), 88.7 on PinchBench (surpassing GLM-5 and MiniMax M2.7), ranks first across all three frontend aesthetics scenarios, and maintains strong generalist scores on Terminal-Bench Hard (46.8) and tau^2-Bench (93.9). Our model is publicly available at https://streamlake.com/product/kat-coder.
[19] Understanding Teacher Revisions of Large Language Model-Generated Feedback cs.CL | cs.CYPDF
Conrad Borchers, Luiz Rodrigues, Newarney Torrezão da Costa, Cleon Xavier, Rafael Ferreira Mello
TL;DR: 本研究分析了教师对LLM生成反馈的修订行为,发现约80%的AI反馈被直接采纳,修订后的反馈通常更长且随后被教师缩短;仅基于AI反馈文本的机器学习模型能较准确预测修订需求(AUC=0.75);修订常使反馈从详细解释转向简洁纠正。
Details
Motivation: 探究教师如何修订LLM生成的反馈,以评估AI课堂工具的实际应用并优化反馈系统设计。
Result: 在1,349条反馈数据上,模型预测修订的AUC达0.75;定性分析显示修订主要简化反馈内容。
Insight: 揭示了教师修订行为的异质性及AI反馈的接受度,为设计更贴合教师需求的反馈系统提供了实证依据。
Abstract: Large language models (LLMs) increasingly generate formative feedback for students, yet little is known about how teachers revise this feedback before it reaches learners. Teachers’ revisions shape what students receive, making revision practices central to evaluating AI classroom tools. We analyze a dataset of 1,349 instances of AI-generated feedback and corresponding teacher-edited explanations from 117 teachers. We examine (i) textual characteristics associated with teacher revisions, (ii) whether revision decisions can be predicted from the AI feedback text, and (iii) how revisions change the pedagogical type of feedback delivered. First, we find that teachers accept AI feedback without modification in about 80% of cases, while edited feedback tends to be significantly longer and subsequently shortened by teachers. Editing behavior varies substantially across teachers: about 50% never edit AI feedback, and only about 10% edit more than two-thirds of feedback instances. Second, machine learning models trained only on the AI feedback text as input features, using sentence embeddings, achieve fair performance in identifying which feedback will be revised (AUC=0.75). Third, qualitative coding shows that when revisions occur, teachers often simplify AI-generated feedback, shifting it away from high-information explanations toward more concise, corrective forms. Together, these findings characterize how teachers engage with AI-generated feedback in practice and highlight opportunities to design feedback systems that better align with teacher priorities while reducing unnecessary editing effort.
[20] Improving Clinical Diagnosis with Counterfactual Multi-Agent Reasoning cs.CLPDF
Zhiwen You, Xi Chen, Aniket Vashishtha, Simo Du, Gabriel Erion-Barner
TL;DR: 本文提出了一种受临床医生培训启发的反事实多智能体诊断框架,通过反事实病例编辑和反事实概率差量化个体发现对诊断的支持程度,引导多轮专家讨论,从而提升基于大语言模型的临床诊断系统的准确性和可解释性。
Details
Motivation: 现有基于大语言模型的诊断智能体通常在固定临床证据上进行推理,缺乏对个体发现如何支持或削弱竞争性诊断的显式检验,这限制了其推荐的可解释性。本文旨在通过模拟临床医生培训中的反事实提问过程来解决这一问题。
Result: 在三个诊断基准测试和七个大语言模型上,该方法在提示和现有多智能体基线方法的基础上持续提升了诊断准确率,在复杂和模糊病例中提升最为显著。人工评估进一步表明该框架能产生更具临床实用性、可靠性和连贯性的推理。
Insight: 核心创新在于将反事实推理机制(反事实病例编辑和反事实概率差)系统性地引入多智能体诊断框架,使假设检验和证据评估过程显式化、可量化,从而增强了AI临床决策支持系统的可靠性和可解释性。
Abstract: Clinical diagnosis is a complex reasoning process in which clinicians gather evidence, form hypotheses, and test them against alternative explanations. In medical training, this reasoning is explicitly developed through counterfactual questioning–e.g., asking how a diagnosis would change if a key symptom were absent or altered–to strengthen differential diagnosis skills. As large language model (LLM)-based systems are increasingly used for diagnostic support, ensuring the interpretability of their recommendations becomes critical. However, most existing LLM-based diagnostic agents reason over fixed clinical evidence without explicitly testing how individual findings support or weaken competing diagnoses. In this work, we propose a counterfactual multi-agent diagnostic framework inspired by clinician training that makes hypothesis testing explicit and evidence-grounded. Our framework introduces counterfactual case editing to modify clinical findings and evaluate how these changes affect competing diagnoses. We further define the Counterfactual Probability Gap, a method that quantifies how strongly individual findings support a diagnosis by measuring confidence shifts under these edits. These counterfactual signals guide multi-round specialist discussions, enabling agents to challenge unsupported hypotheses, refine differential diagnoses, and produce more interpretable reasoning trajectories. Across three diagnostic benchmarks and seven LLMs, our method consistently improves diagnostic accuracy over prompting and prior multi-agent baselines, with the largest gains observed in complex and ambiguous cases. Human evaluation further indicates that our framework produces more clinically useful, reliable, and coherent reasoning. These results suggest that incorporating counterfactual evidence verification is an important step toward building reliable AI systems for clinical decision support.
[21] Model Capability Dominates: Inference-Time Optimization Lessons from AIMO 3 cs.CLPDF
Natapong Nitarach
TL;DR: 该论文通过AIMO 3竞赛的实验发现,在数学推理任务中,模型能力是决定性能的主导因素,远超过推理时优化(如多样化提示混合)的效果。尽管多数投票能提升性能,但通过分配不同推理策略来降低错误相关性的尝试均告失败,因为高温采样已足够去相关,而弱提示策略反而会降低单次尝试的准确性。
Details
Motivation: 解决多数投票中因错误相关性而限制有效样本大小的问题,探索通过分配不同推理策略(多样化提示混合)来降低错误相关性,以提升数学推理性能。
Result: 在AIMO 3竞赛的50个IMO级问题上进行实验,涉及3个模型和23+次实验,所有尝试的推理时优化干预均失败;模型能力差距(17分)主导性能,其影响比推理时优化大一个数量级。
Insight: 创新点在于系统测试了多样化提示混合以去相关错误,但核心洞察是模型能力本身是性能的关键决定因素,推理时优化的收益有限;这强调了在资源有限时优先投资于更强模型而非复杂优化策略的重要性。
Abstract: Majority voting over multiple LLM attempts improves mathematical reasoning, but correlated errors limit the effective sample size. A natural fix: assign structurally different reasoning strategies to different voters to decorrelate errors. We test this Diverse Prompt Mixer in the AIMO~3 competition: 3 models, 23+ experiments, and 50 IMO-level problems on a single H100 80 GB with a 5-hour limit. Every intervention fails. High-temperature sampling already decorrelates errors sufficiently; weaker prompt strategies reduce per-attempt accuracy more than they reduce correlation. Across a 17-point model capability gap and every inference-time optimization we tried, model capability dominates by an order of magnitude.
[22] What can LLMs tell us about the mechanisms behind polarity illusions in humans? Experiments across model scales and training steps cs.CLPDF
Dario Paape
TL;DR: 该研究利用Pythia模型套件探究了两种极性错觉(NPI错觉和深度电荷错觉)在大型语言模型中的表现。研究发现,随着模型规模增大,NPI错觉减弱直至消失,而深度电荷错觉则增强。这些发现对人类句子处理机制有启示意义,表明可能无需假设‘理性推理’机制来解释极性错觉,并提出了基于构式语法的理论整合。
Details
Motivation: 探究大型语言模型是否以及如何表现出人类语言处理中的两种极性错觉,以理解这些错觉的认知机制,并检验是否必须用‘理性推理’来解释人类语言错觉。
Result: 在Pythia模型套件的不同规模和训练步数上进行实验,发现NPI错觉随模型规模增大而减弱并最终消失,深度电荷错觉则随模型规模增大而增强。
Insight: 创新点在于利用模型缩放实验揭示了不同极性错觉对模型规模的相反依赖关系,挑战了人类语言处理中‘理性推理’机制的必要性,并提出了基于构式语法的理论框架来统一解释模型和人类的行为。
Abstract: I use the Pythia scaling suite (Biderman et al. 2023) to investigate if and how two well-known polarity illusions, the NPI illusion and the depth charge illusion, arise in LLMs. The NPI illusion becomes weaker and ultimately disappears as model size increases, while the depth charge illusion becomes stronger in larger models. The results have implications for human sentence processing: it may not be necessary to assume “rational inference” mechanisms that convert ill-formed sentences into well-formed ones to explain polarity illusions, given that LLMs cannot plausibly engage in this kind of reasoning, especially at the implicit level of next-token prediction. On the other hand, shallow, “good enough” processing and/or partial grammaticalization of prescriptively ungrammatical structures may both occur in LLMs. I propose a synthesis of different theoretical accounts that is rooted in the basic tenets of construction grammar.
[23] DongYuan: An LLM-Based Framework for Integrative Chinese and Western Medicine Spleen-Stomach Disorders Diagnosis cs.CLPDF
Hua Li, Yingying Li, Xiaobin Feng, Xinyi Fu, Lifeng Dong
TL;DR: 本文提出了DongYuan,一个基于大语言模型的中西医结合脾胃病诊断框架。该框架通过构建三个高质量数据集(SSDF-Syndrome, SSDF-Dialogue, SSDF-PD)来填补数据空白,开发了通过监督微调和直接偏好优化两阶段训练获得中西医结合推理能力的核心诊断模型SSDF-Core,并辅以可插拔的问诊导航模型SSDF-Navigator来优化临床问诊策略。同时,建立了专注于中西医结合脾胃病诊断的综合评估基准SSDF-Bench。
Details
Motivation: 解决中西医结合(ICWM)脾胃病诊断中面临的三大挑战:缺乏高质量数据、缺乏能有效融合中医辨证与西医疾病诊断推理逻辑的模型,以及缺乏标准化的评估基准。
Result: 实验结果表明,核心诊断模型SSDF-Core在SSDF-Bench评估基准上显著优于12个主流基线模型。
Insight: 创新点在于构建了专门的中西医结合脾胃病高质量数据集和评估基准,并提出了一个两阶段训练(SFT+DPO)的核心诊断模型与可插拔问诊导航模型相结合的框架,旨在系统性解决ICWM诊断中的推理融合与临床流程优化问题。
Abstract: The clinical burden of spleen-stomach disorders is substantial. While large language models (LLMs) offer new potential for medical applications, they face three major challenges in the context of integrative Chinese and Western medicine (ICWM): a lack of high-quality data, the absence of models capable of effectively integrating the reasoning logic of traditional Chinese medicine (TCM) syndrome differentiation with that of Western medical (WM) disease diagnosis, and the shortage of a standardized evaluation benchmark. To address these interrelated challenges, we propose DongYuan, an ICWM spleen-stomach diagnostic framework. Specifically, three ICWM datasets (SSDF-Syndrome, SSDF-Dialogue, and SSDF-PD) were curated to fill the gap in high-quality data for spleen-stomach disorders. We then developed SSDF-Core, a core diagnostic LLM that acquires robust ICWM reasoning capabilities through a two-stage training regimen of supervised fine-tuning. tuning (SFT) and direct preference optimization (DPO), and complemented it with SSDF-Navigator, a pluggable consultation navigation model designed to optimize clinical inquiry strategies. Additionally, we established SSDF-Bench, a comprehensive evaluation benchmark focused on ICWM diagnosis of spleen-stomach disorders. Experimental results demonstrate that SSDF-Core significantly outperforms 12 mainstream baselines on SSDF-Bench. DongYuan lays a solid methodological foundation and provides practical technical references for the future development of intelligent ICWM diagnostic systems.
[24] Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization cs.CL | cs.LGPDF
He Du, Qiming Ge, Jiakai Hu, Aijun Yang, Zheng Cai
TL;DR: Kernel-Smith是一个用于高性能GPU内核和算子生成的统一框架,它结合了基于稳定评估的进化代理和面向进化的后训练方法。该框架通过维护可执行候选种群,并利用高性能多样化程序档案以及编译、正确性和加速的结构化执行反馈进行迭代改进。在Triton和MACA后端上,其模型在KernelBench基准测试中实现了SOTA性能,并成功应用于实际生产系统。
Details
Motivation: 解决传统LLM在生成高性能GPU内核时,作为一次性生成器(one-shot generator)的局限性,以及在不同硬件后端上实现可靠、可进化优化的挑战。
Result: 在Nvidia Triton后端上,Kernel-Smith-235B-RL在KernelBench上取得了SOTA整体性能,获得了最佳平均加速比,超越了Gemini-3.0-pro和Claude-4.6-opus等前沿专有模型。在MetaX MACA后端上,Kernel-Smith-MACA-30B超越了DeepSeek-V3.2-think和Qwen3-235B-2507-think等大规模模型。
Insight: 核心创新在于将进化算法与LLM训练深度融合:1) 构建了一个稳定的、评估驱动的进化代理,利用执行反馈进行迭代优化;2) 提出了面向进化的后训练方法,将长视野进化轨迹转化为以步骤为中心的监督和强化学习信号,将模型优化为进化循环内的强大局部改进器,而非一次性生成器。这实现了跨异构平台(如NVIDIA和MetaX GPU)的无缝适应,并能将优化成果迁移到实际生产部署中。
Abstract: We present Kernel-Smith, a framework for high-performance GPU kernel and operator generation that combines a stable evaluation-driven evolutionary agent with an evolution-oriented post-training recipe. On the agent side, Kernel-Smith maintains a population of executable candidates and iteratively improves them using an archive of top-performing and diverse programs together with structured execution feedback on compilation, correctness, and speedup. To make this search reliable, we build backend-specific evaluation services for Triton on NVIDIA GPUs and Maca on MetaX GPUs. On the training side, we convert long-horizon evolution trajectories into step-centric supervision and reinforcement learning signals by retaining correctness-preserving, high-gain revisions, so that the model is optimized as a strong local improver inside the evolutionary loop rather than as a one-shot generator. Under a unified evolutionary protocol, Kernel-Smith-235B-RL achieves state-of-the-art overall performance on KernelBench with Nvidia Triton backend, attaining the best average speedup ratio and outperforming frontier proprietary models including Gemini-3.0-pro and Claude-4.6-opus. We further validate the framework on the MetaX MACA backend, where our Kernel-Smith-MACA-30B surpasses large-scale counterparts such as DeepSeek-V3.2-think and Qwen3-235B-2507-think, highlighting potential for seamless adaptation across heterogeneous platforms. Beyond benchmark results, the same workflow produces upstream contributions to production systems including SGLang and LMDeploy, demonstrating that LLM-driven kernel optimization can transfer from controlled evaluation to practical deployment.
[25] Marco DeepResearch: Unlocking Efficient Deep Research Agents via Verification-Centric Design cs.CL | cs.AIPDF
Bin Zhu, Qianghuai Jia, Tian Lan, Junyang Ren, Feng Gu
TL;DR: 本文提出了Marco DeepResearch,一种以验证为中心设计的深度研究智能体,通过在三个层面(QA数据合成、轨迹构建和测试时扩展)集成显式验证机制,以解决现有深度研究智能体在长时程任务中因缺乏可靠验证而导致的错误传播和性能下降问题。
Details
Motivation: 现有深度研究智能体在训练和推理过程中缺乏显式验证机制,导致错误在数据合成、轨迹构建和测试扩展等阶段传播,从而限制了其在开放研究任务中的长期可靠性能。
Result: 在BrowseComp和BrowseComp-ZH等具有挑战性的基准测试中,Marco DeepResearch显著优于8B规模的深度研究智能体,并且在最多600次工具调用的预算下,其性能甚至超越或接近Tongyi DeepResearch-30B等30B规模的智能体。
Insight: 论文的核心创新在于将验证机制系统性地融入智能体设计的全流程,包括通过验证控制QA数据合成的难度与准确性、在训练轨迹中注入验证模式,以及利用智能体自身作为推理时的验证器,这为构建高效可靠的自主研究系统提供了可借鉴的框架性思路。
Abstract: Deep research agents autonomously conduct open-ended investigations, integrating complex information retrieval with multi-step reasoning across diverse sources to solve real-world problems. To sustain this capability on long-horizon tasks, reliable verification is critical during both training and inference. A major bottleneck in existing paradigms stems from the lack of explicit verification mechanisms in QA data synthesis, trajectory construction, and test-time scaling. Errors introduced at each stage propagate downstream and degrade the overall agent performance. To address this, we present Marco DeepResearch, a deep research agent optimized with a verification-centric framework design at three levels: \textbf{(1)QA Data Synthesis:} We introduce verification mechanisms to graph-based and agent-based QA synthesis to control question difficulty while ensuring answers are unique and correct; \textbf{(2)Trajectory Construction:} We design a verification-driven trajectory synthesis method that injects explicit verification patterns into training trajectories; and \textbf{(3)~Test-time scaling:} We use Marco DeepResearch itself as a verifier at inference time and effectively improve performance on challenging questions. Extensive experimental results demonstrate that our proposed Marco DeepResearch agent significantly outperforms 8B-scale deep research agents on most challenging benchmarks, such as BrowseComp and BrowseComp-ZH. Crucially, under a maximum budget of 600 tool calls, Marco DeepResearch even surpasses or approaches several 30B-scale agents, like Tongyi DeepResearch-30B.
[26] Courtroom-Style Multi-Agent Debate with Progressive RAG and Role-Switching for Controversial Claim Verification cs.CL | cs.AI | cs.MAPDF
Masnun Nuha Chowdhury, Nusrat Jahan Beg, Umme Hunny Khan, Syed Rifat Raiyan, Md Kamrul Hasan
TL;DR: 本文提出了一种名为PROClaim的法庭式多智能体辩论框架,用于验证有争议的声明。该框架通过结合渐进式检索增强生成(P-RAG)和角色切换,将验证过程重构为结构化的对抗性审议,旨在解决大语言模型在关键声明验证中的幻觉和浅层推理问题。
Details
Motivation: 大语言模型(LLMs)在关键声明验证中因幻觉和浅层推理而不可靠。现有的检索增强生成(RAG)和多智能体辩论(MAD)方法受到一次性检索和非结构化辩论动态的限制,需要更稳健的验证框架。
Result: 在Check-COVID基准测试的零样本评估中,PROClaim达到了81.7%的准确率,比标准多智能体辩论方法高出10.0个百分点,其中渐进式RAG(P-RAG)贡献了主要性能提升(+7.5个百分点)。
Insight: 创新点在于将声明验证重构为结构化的对抗性审议,结合了渐进式RAG动态扩展证据池、角色切换(如原告、辩护、法官)、证据协商、自我反思以及异构多法官聚合机制。这些设计有效缓解了系统性偏见,为可靠的声明验证提供了稳健基础。
Abstract: Large language models (LLMs) remain unreliable for high-stakes claim verification due to hallucinations and shallow reasoning. While retrieval-augmented generation (RAG) and multi-agent debate (MAD) address this, they are limited by one-pass retrieval and unstructured debate dynamics. We propose a courtroom-style multi-agent framework, PROClaim, that reformulates verification as a structured, adversarial deliberation. Our approach integrates specialized roles (e.g., Plaintiff, Defense, Judge) with Progressive RAG (P-RAG) to dynamically expand and refine the evidence pool during the debate. Furthermore, we employ evidence negotiation, self-reflection, and heterogeneous multi-judge aggregation to enforce calibration, robustness, and diversity. In zero-shot evaluations on the Check-COVID benchmark, PROClaim achieves 81.7% accuracy, outperforming standard multi-agent debate by 10.0 percentage points, with P-RAG driving the primary performance gains (+7.5 pp). We ultimately demonstrate that structural deliberation and model heterogeneity effectively mitigate systematic biases, providing a robust foundation for reliable claim verification. Our code and data are publicly available at https://github.com/mnc13/PROClaim.
[27] EarlySciRev: A Dataset of Early-Stage Scientific Revisions Extracted from LaTeX Writing Traces cs.CLPDF
Léane Jourdan, Julien Aubert-Béduchaud, Yannis Chupin, Marah Baccari, Florian Boudin
TL;DR: 该论文提出了EarlySciRev数据集,这是一个从arXiv的LaTeX源文件中自动提取的早期科学文本修订数据集。其核心方法是利用LaTeX中被注释掉的文本来捕获作者在写作过程中丢弃或替代的表述,并通过与最终文本对齐及基于LLM的过滤来构建真实的修订对。该数据集包含578k个经过验证的修订对,并提供了一个人工标注的修订检测基准,旨在支持科学写作动态、修订建模以及LLM辅助编辑的研究。
Details
Motivation: 科学写作是一个迭代过程,会产生丰富的修订痕迹,但公开可用的资源通常只暴露最终或接近最终版本的论文,这限制了对修订行为的实证研究以及大型语言模型在科学写作中的评估。
Result: 从128万个候选修订对中,经过处理流程最终得到了57.8万个经过验证的修订对,并提供了一个人工标注的修订检测基准。
Insight: 创新点在于首次利用LaTeX中被注释掉的文本作为早期修订痕迹的来源,这提供了一种获取真实、早期科学写作修订数据的新方法,弥补了现有资源多关注后期修订或合成重写的不足。从客观角度看,该方法巧妙地利用了科学写作中的常见实践(注释掉旧文本)来构建大规模、高质量的数据集,对研究写作过程和开发辅助工具具有重要价值。
Abstract: Scientific writing is an iterative process that generates rich revision traces, yet publicly available resources typically expose only final or near-final versions of papers. This limits empirical study of revision behaviour and evaluation of large language models (LLMs) for scientific writing. We introduce EarlySciRev, a dataset of early-stage scientific text revisions automatically extracted from arXiv LaTeX source files. Our key observation is that commented-out text in LaTeX often preserves discarded or alternative formulations written by the authors themselves. By aligning commented segments with nearby final text, we extract paragraph-level candidate revision pairs and apply LLM-based filtering to retain genuine revisions. Starting from 1.28M candidate pairs, our pipeline yields 578k validated revision pairs, grounded in authentic early drafting traces. We additionally provide a human-annotated benchmark for revision detection. EarlySciRev complements existing resources focused on late-stage revisions or synthetic rewrites and supports research on scientific writing dynamics, revision modelling, and LLM-assisted editing.
[28] GraphWalker: Agentic Knowledge Graph Question Answering via Synthetic Trajectory Curriculum cs.CLPDF
Shuwen Xu, Yao Xu, Jiaxiang Liu, Chenhao Yuan, Wenshuo Peng
TL;DR: 本文提出GraphWalker,一种新型的智能知识图谱问答框架,通过自动化轨迹合成和分阶段微调来解决训练数据稀缺和推理泛化问题。该方法采用两阶段监督微调训练范式,先利用约束随机游走合成的多样化轨迹训练智能体建立广泛探索先验,再通过少量专家轨迹微调以发展反思和错误恢复能力,最终结合轻量级强化学习实现最先进性能。
Details
Motivation: 现有智能知识图谱问答方法在训练数据稀缺和推理泛化方面存在挑战:基于提示的方法缺乏自主导航训练,而现有训练流程通常将推理限制在预定义轨迹内。
Result: 在CWQ和WebQSP基准测试上达到了最先进性能,在GrailQA和自建的GraphWalkerBench上的额外结果表明,该方法增强了对分布外推理路径的泛化能力。
Insight: 创新点在于通过自动化轨迹合成生成结构多样化的训练数据以建立广泛探索先验,并采用分阶段微调范式(先广度探索后专家精调)来解锁轻量级强化学习的更高性能上限,从而提升智能体的自主导航和泛化推理能力。
Abstract: Agentic knowledge graph question answering (KGQA) requires an agent to iteratively interact with knowledge graphs (KGs), posing challenges in both training data scarcity and reasoning generalization. Specifically, existing approaches often restrict agent exploration: prompting-based methods lack autonomous navigation training, while current training pipelines usually confine reasoning to predefined trajectories. To this end, this paper proposes \textit{GraphWalker}, a novel agentic KGQA framework that addresses these challenges through \textit{Automated Trajectory Synthesis} and \textit{Stage-wise Fine-tuning}. GraphWalker adopts a two-stage SFT training paradigm: First, the agent is trained on structurally diverse trajectories synthesized from constrained random-walk paths, establishing a broad exploration prior over the KG; Second, the agent is further fine-tuned on a small set of expert trajectories to develop reflection and error recovery capabilities. Extensive experiments demonstrate that our stage-wise SFT paradigm unlocks a higher performance ceiling for a lightweight reinforcement learning (RL) stage, enabling GraphWalker to achieve state-of-the-art performance on CWQ and WebQSP. Additional results on GrailQA and our constructed GraphWalkerBench confirm that GraphWalker enhances generalization to out-of-distribution reasoning paths. The code is publicly available at https://github.com/XuShuwenn/GraphWalker
[29] EpiScreen: Early Epilepsy Detection from Electronic Health Records with Large Language Models cs.CLPDF
Shuang Zhou, Kai Yu, Zaifu Zhan, Huixue Zhou, Min Zeng
TL;DR: 该论文提出了一种名为EpiScreen的低成本、有效方法,利用电子健康记录中的临床笔记,通过微调大语言模型来实现癫痫的早期检测。该方法在两个数据集上表现出色,并在临床医生与AI协作的场景中显著提升了诊断准确率。
Details
Motivation: 癫痫和心因性非癫痫性发作的临床表现相似但治疗策略截然不同,误诊率高且会导致诊断延迟和不必要治疗。视频脑电图虽为金标准但成本高、可及性差,因此需要一种低成本、及时的筛查工具。
Result: 在MIMIC-IV数据集上AUC达到0.875,在明尼苏达大学私有队列上AUC达到0.980。在临床医生-AI协作设置中,EpiScreen辅助的神经科医生比未辅助的专家诊断准确率高出10.9%。
Insight: 创新点在于利用常规收集的电子健康记录临床笔记作为数据源,通过微调大语言模型构建低成本筛查工具,并验证了其在临床协作中的增效作用。这为资源有限地区提供了可行的早期诊断方案,展示了LLMs在特定医疗文本分类任务中的潜力。
Abstract: Epilepsy and psychogenic non-epileptic seizures often present with similar seizure-like manifestations but require fundamentally different management strategies. Misdiagnosis is common and can lead to prolonged diagnostic delays, unnecessary treatments, and substantial patient morbidity. Although prolonged video-electroencephalography is the diagnostic gold standard, its high cost and limited accessibility hinder timely diagnosis. Here, we developed a low-cost, effective approach, EpiScreen, for early epilepsy detection by utilizing routinely collected clinical notes from electronic health records. Through fine-tuning large language models on labeled notes, EpiScreen achieved an AUC of up to 0.875 on the MIMIC-IV dataset and 0.980 on a private cohort of the University of Minnesota. In a clinician-AI collaboration setting, EpiScreen-assisted neurologists outperformed unaided experts by up to 10.9%. Overall, this study demonstrates that EpiScreen supports early epilepsy detection, facilitating timely and cost-effective screening that may reduce diagnostic delays and avoid unnecessary interventions, particularly in resource-limited regions.
cs.CV [Back]
[30] An Annotation-to-Detection Framework for Autonomous and Robust Vine Trunk Localization in the Field by Mobile Agricultural Robots cs.CV | cs.ROPDF
Dimitrios Chatziparaschis, Elia Scudiero, Brent Sams, Konstantinos Karydis
TL;DR: 本文提出了一种从标注到检测的框架,用于训练一个鲁棒的多模态检测器,该框架利用有限的、部分标注的训练数据,通过跨模态标注迁移和早期传感器融合管道,结合多阶段检测架构,在具有不同光照条件和作物密度的新葡萄园环境中验证了葡萄树干检测的有效性。
Details
Motivation: 农业环境的动态性和异质性给目标检测与定位带来了巨大挑战,特别是对于需要在未见过的非结构化环境中进行勘测的自主移动机器人;同时,实时检测系统越来越需要不依赖于大规模人工标注的真实世界数据集。
Result: 在集成定制的多模态LiDAR与里程计映射(LOAM)算法以及树木关联模块后,系统实现了高性能的树干定位,在单次遍历中成功识别超过70%的树木,平均距离误差小于0.37米。
Insight: 该框架的创新点在于通过多模态、增量式的标注和训练,利用有限的初始标注实现了鲁棒的检测性能;从客观角度看,其跨模态标注迁移和早期传感器融合策略为在数据有限条件下提升多模态检测能力提供了有效途径,展示了在现实世界及近地农业应用中的潜力。
Abstract: The dynamic and heterogeneous nature of agricultural fields presents significant challenges for object detection and localization, particularly for autonomous mobile robots that are tasked with surveying previously unseen unstructured environments. Concurrently, there is a growing need for real-time detection systems that do not depend on large-scale manually labeled real-world datasets. In this work, we introduce a comprehensive annotation-to-detection framework designed to train a robust multi-modal detector using limited and partially labeled training data. The proposed methodology incorporates cross-modal annotation transfer and an early-stage sensor fusion pipeline, which, in conjunction with a multi-stage detection architecture, effectively trains and enhances the system’s multi-modal detection capabilities. The effectiveness of the framework was demonstrated through vine trunk detection in novel vineyard settings that featured diverse lighting conditions and varying crop densities to validate performance. When integrated with a customized multi-modal LiDAR and Odometry Mapping (LOAM) algorithm and a tree association module, the system demonstrated high-performance trunk localization, successfully identifying over 70% of trees in a single traversal with a mean distance error of less than 0.37m. The results reveal that by leveraging multi-modal, incremental-stage annotation and training, the proposed framework achieves robust detection performance regardless of limited starting annotations, showcasing its potential for real-world and near-ground agricultural applications.
[31] A Multimodal Deep Learning Framework for Edema Classification Using HCT and Clinical Data cs.CV | cs.AIPDF
Aram Ansary Ogholbake, Hannah Choi, Spencer Brandenburg, Alyssa Antuna, Zahraa Al-Sharshahi
TL;DR: 本文提出了一种名为AttentionMixer的多模态深度学习框架,用于结合头部CT影像和临床元数据进行脑水肿分类。该框架通过自监督视觉Transformer自动编码器处理CT影像,并利用交叉注意力机制将临床元数据作为键值对来动态调制影像特征,最后通过轻量级MLP-Mixer进行特征精炼和分类。
Details
Motivation: 解决在脑水肿检测中,如何有效且可解释地融合富含空间信息的头部CT影像与包含年龄、实验室数值等补充上下文的临床元数据的问题,避免现有方法对多源数据的简单拼接或忽略。
Result: 在具有专家标注的脑部HCT数据集上进行五折交叉验证,AttentionMixer取得了优于仅使用HCT、仅使用元数据以及先前多模态基线模型的结果(准确率87.32%,精确率92.10%,F1分数85.37%,AUC 94.14%),达到了SOTA水平。
Insight: 创新点在于提出了一个结构化的、可解释的多模态融合框架:1) 使用自监督ViT-AE++编码CT影像,无需大量标注数据;2) 设计交叉注意力模块,以临床元数据为键值、影像特征为查询,实现基于患者特定上下文的动态特征调制;3) 引入轻量级MLP-Mixer进行全局依赖建模,参数开销低;4) 通过可学习嵌入处理缺失的元数据,增强了模型对真实临床数据质量的鲁棒性。
Abstract: We propose AttentionMixer, a unified deep learning framework for multimodal detection of brain edema that combines structural head CT (HCT) with routine clinical metadata. While HCT provides rich spatial information, clinical variables such as age, laboratory values, and scan timing capture complementary context that might be ignored or naively concatenated. AttentionMixer is designed to fuse these heterogeneous sources in a principled and efficient manner. HCT volumes are first encoded using a self-supervised Vision Transformer Autoencoder (ViT-AE++), without requiring large labeled datasets. Clinical metadata are mapped into the same feature space and used as keys and values in a cross-attention module, where HCT-derived feature vector serves as queries. This cross-attention fusion allows the network to dynamically modulate imaging features based on patient-specific context and provides an interpretable mechanism for multimodal integration. A lightweight MLP-Mixer then refines the fused representation before final classification, enabling global dependency modeling with substantially reduced parameter overhead. Missing or incomplete metadata are handled via a learnable embedding, promoting robustness to real-world clinical data quality. We evaluate AttentionMixer on a curated brain HCT cohort with expert edema annotations using five-fold cross-validation. Compared with strong HCT-only, metadata-only, and prior multimodal baselines, AttentionMixer achieves superior performance (accuracy 87.32%, precision 92.10%, F1-score 85.37%, AUC 94.14%). Ablation studies confirm the benefit of both cross-attention and MLP-Mixer refinement, and permutation-based metadata importance analysis highlights clinically meaningful variables driving predictions. These results demonstrate that structured, interpretable multimodal fusion can substantially improve edema detection in clinical practice.
[32] The Nonverbal Gap: Toward Affective Computer Vision for Safer and More Equitable Online Dating cs.CV | cs.AIPDF
Ratna Kandala, Niva Manchanda, Akshata Kishore Moharir
TL;DR: 这篇论文指出在线约会平台缺乏非语言线索(如眼神、表情、姿势、回应时机)导致沟通差距,尤其影响女性安全,并呼吁计算机视觉社区利用情感计算工具(如面部动作单元检测、视线估计、参与度建模)来填补这一差距。作者提出了一个以公平性优先的研究议程,包括实时不适检测、参与度不对称建模、知情同意交互设计和长期交互摘要四个能力方向,并强调需要构建专用数据集、进行跨群体公平性评估以及采用设备端处理以保护隐私。
Details
Motivation: 在线约会已成为浪漫关系的主要开端方式,但现有平台剥离了人类依赖的非语言线索,导致沟通差距,对女性安全造成不成比例的影响;计算机视觉社区已具备相关情感计算工具,却忽视了约会领域的研究,这既是技术机遇也是道德责任。
Result: 论文未提及具体定量结果或基准测试,但提出了一个基于现有CV方法和社会心理学的研究议程框架,旨在为未来研究奠定基础。
Insight: 创新点在于将计算机视觉技术(如情感识别、参与度建模)系统性地应用于在线约会安全领域,强调公平性优先、跨群体评估和设备端处理,以避免情感数据被滥用为监控工具,并呼吁WICV社区在商业部署前确立该领域为优先研究方向。
Abstract: Online dating has become the dominant way romantic relationships begin, yet current platforms strip the nonverbal cues: gaze, facial expression, body posture, response timing, that humans rely on to signal comfort, disinterest, and consent, creating a communication gap with disproportionate safety consequences for women. We argue that this gap represents both a technical opportunity and a moral responsibility for the computer vision community, which has developed the affective tools, facial action unit detection, gaze estimation, engagement modeling, and multimodal affect recognition, needed to begin addressing it, yet has largely ignored the dating domain as a research context. We propose a fairness-first research agenda organized around four capability areas: real-time discomfort detection, engagement asymmetry modeling between partners, consent-aware interaction design, and longitudinal interaction summarization, each grounded in established CV methodology and motivated by the social psychology of romantic communication. We argue that responsible pursuit of this agenda requires purpose-built datasets collected under dyadic consent protocols, fairness evaluation disaggregated across race, gender identity, neurotype, and cultural background, and architectural commitments to on-device processing that prevent affective data from becoming platform surveillance infrastructure. This vision paper calls on the WICV community, whose members are uniquely positioned to understand both the technical opportunity and the human stakes, to establish online dating safety as a first-class research domain before commercial deployment outpaces ethical deliberation.
[33] Contextual inference from single objects in Vision-Language models cs.CV | cs.AIPDF
Martina G. Vilas, Timothy Schaumlöffel, Gemma Roig
TL;DR: 本文通过行为与机制分析,探究了视觉语言模型(VLMs)从单个物体推断场景上下文的能力。研究发现,单个物体能支持细粒度场景类别和粗粒度上位类别(室内/室外)的推断,其表现受物体属性影响,且不同推断层级之间存在部分可分离性。机制上,背景移除后保持稳定的物体表征更有利于上下文推断,而场景与上位信息的编码方式存在根本差异。
Details
Motivation: 研究动机在于理解视觉语言模型如何从单个物体推断场景上下文,这一问题直接影响模型的鲁棒性,但在VLMs中尚未得到充分探索。
Result: 实验表明,VLMs仅凭单个物体在掩蔽背景上就能进行高于随机水平的场景推断,性能受物体属性调节,且不同模型在场景、上位类别和物体身份预测间的耦合程度存在显著差异。
Insight: 创新点在于揭示了VLMs中上下文推断的复杂性:不同推断层级部分可分离,且场景信息在网络图像令牌中全局编码,而上位信息仅晚期出现或根本不出现;稳定的物体表征是成功推断的关键机制。
Abstract: How much scene context a single object carries is a well-studied question in human scene perception, yet how this capacity is organized in vision-language models (VLMs) remains poorly understood, with direct implications for the robustness of these models. We investigate this question through a systematic behavioral and mechanistic analysis of contextual inference from single objects. Presenting VLMs with single objects on masked backgrounds, we probe their ability to infer both fine-grained scene category and coarse superordinate context (indoor vs. outdoor). We found that single objects support above-chance inference at both levels, with performance modulated by the same object properties that predict human scene categorization. Object identity, scene, and superordinate predictions are partially dissociable: accurate inference at one level neither requires nor guarantees accurate inference at the others, and the degree of coupling differs markedly across models. Mechanistically, object representations that remain stable when background context is removed are more predictive of successful contextual inference. Scene and superordinate schemas are grounded in fundamentally different ways: scene identity is encoded in image tokens throughout the network, while superordinate information emerges only late or not at all. Together, these results reveal that the organization of contextual inference in VLMs is more complex than accuracy alone suggests, with behavioral and mechanistic signatures
[34] Distilled Large Language Model-Driven Dynamic Sparse Expert Activation Mechanism cs.CV | cs.AIPDF
Qinghui Chen, Zekai Zhang, Zaigui Zhang, Kai Zhang, Dagang Li
TL;DR: 本文提出了一种名为DS-MoE的框架,它利用蒸馏后的大型语言模型来驱动稀疏专家混合机制,通过文本引导的动态路由和轻量级多尺度理解,动态对齐文本语义与缺陷特定的视觉模式,以解决视觉识别中类间相似性高、尺度变化极端和计算资源有限的问题。
Details
Motivation: 动机是解决现实世界数据中视觉识别面临的挑战,包括高类间相似性、极端尺度变化和有限计算预算,现有方法通常依赖僵化的融合机制和繁重的标注流程,导致泛化能力不佳。
Result: 在PCB、铝箔和模具缺陷数据集上的大量实验表明,该框架性能优于现有纯视觉模型。DS-MoE在BBMP、铝和PCB数据集上分别比YOLOv8/YOLOX的mAP@0.5:0.95提升了+13.9、+1.4和+2.0个百分点,同时提高了精确率和召回率。
Insight: 创新点在于将蒸馏LLM与稀疏MoE架构结合,实现文本引导的动态专家激活,以解决类间模糊性,并采用轻量级MobileSAM编码器实现实时推理和多尺度细节保留,这是一种跨模态动态稀疏路由的新方法。
Abstract: High inter-class similarity, extreme scale variation, and limited computational budgets hinder reliable visual recognition across diverse real-world data. Existing vision-centric and cross-modal approaches often rely on rigid fusion mechanisms and heavy annotation pipelines, leading to sub-optimal generalization. We propose the Distilled Large Language Model (LLM)-Driven Sparse Mixture-of-Experts (DS-MoE) framework, which integrates text-guided dynamic routing and lightweight multi-scale comprehension. The DS-MoE framework dynamically aligns textual semantics with defect-specific visual patterns through a sparse MoE architecture, where task-relevant experts are adaptively activated based on semantic relevance, resolving inter-class ambiguity. A lightweight MobileSAM encoder enables real-time inference while preserving multi-scale defect details. Extensive experiments on PCB, aluminum foil, and mold defect datasets demonstrate that our framework achieves superior performance compared to existing pure vision models. \textbf{DS-MoE} surpasses YOLOv8/YOLOX with gains of +13.9, +1.4, and +2.0 pp mAP@ 0.5:0.95 on BBMP, aluminum, and PCB, respectively, while also improving precision and recall.
[35] Beyond Static Visual Tokens: Structured Sequential Visual Chain-of-Thought Reasoning cs.CV | cs.AIPDF
Guangfu Guo, Xiaoqian Lu, Yue Feng, Mingming Sun
TL;DR: 本文提出了一种名为结构序列视觉思维链(SSV-CoT)的新方法,旨在解决当前多模态大语言模型将图像编码为静态视觉前缀、依赖纯文本推理、缺乏目标驱动和自适应视觉访问的问题。该方法受人类视觉感知启发,首先通过显著性图识别并组织与问题相关的关键视觉区域,然后按照从主要到次要线索的顺序进行序列化推理,实现了端到端训练。
Details
Motivation: 当前多模态LLM的视觉编码是静态的,缺乏像人类视觉那样的选择性、序列化和目标驱动的注意力机制,这限制了其在复杂视觉推理任务中的表现。
Result: 在多个视觉推理基准测试上的实验表明,该方法带来了性能提升,验证了结构化序列视觉认知的有效性。
Insight: 核心创新在于将视觉推理过程显式地结构化和序列化,模拟了人类从主要到次要线索的渐进式认知过程,且无需区域级标注或外部工具,实现了端到端学习。
Abstract: Current multimodal LLMs encode images as static visual prefixes and rely on text-based reasoning, lacking goal-driven and adaptive visual access. Inspired by human visual perception-where attention is selectively and sequentially shifted from the most informative regions to secondary cues-we propose Structural Sequential Visual CoT SSV-CoT. First, a question-relevant saliency map identifies and organizes key visual regions, explicitly modeling the spatial distribution of visual importance. Second, reasoning is performed following this discriminative order, inducing a curriculum-like semantic progression from primary to secondary cues. This method is trained end-to-end, using text cot and answer supervision, without relying on region-level annotations or specialized external tools. Experiments on diverse visual reasoning benchmarks show gains, validating structured and sequential visual cognition.
[36] SleepVLM: Explainable and Rule-Grounded Sleep Staging via a Vision-Language Model cs.CV | cs.AI | cs.CLPDF
Guifeng Deng, Pan Wang, Jiquan Wang, Shuying Rao, Junyi Xie
TL;DR: SleepVLM是一个基于视觉语言模型(VLM)的、可解释且基于规则的自动睡眠分期系统。它通过将多导睡眠图(PSG)波形转换为图像,并依据美国睡眠医学会(AASM)评分标准生成临床医生可读的推理依据,在保持与最先进方法相当性能的同时,提高了模型的透明度和可审计性。
Details
Motivation: 当前自动睡眠分期系统虽然能达到专家级准确率,但其临床采用受到缺乏可审计推理过程的阻碍。本文旨在解决这一问题,通过构建一个能生成基于明确临床规则的、可理解解释的模型,以提升临床工作流程中的可信度。
Result: 在内部测试集(MASS-SS1)上Cohen’s kappa分数达到0.767,在外部队列(ZUAMHCS)上达到0.743,匹配了最先进(SOTA)的性能水平。专家评估表明,其推理在事实准确性、证据全面性和逻辑连贯性上的平均得分均超过4.0/5.0。
Insight: 主要创新点在于将视觉语言模型(VLM)框架与明确的临床规则(AASM标准)相结合,用于生成可解释的睡眠分期决策。其方法(波形感知预训练和基于规则的监督微调)为在医疗AI中实现高性能与高可解释性的统一提供了新思路。同时,发布的新专家标注数据集MASS-EX有助于推动可解释睡眠医学的研究。
Abstract: While automated sleep staging has achieved expert-level accuracy, its clinical adoption is hindered by a lack of auditable reasoning. We introduce SleepVLM, a rule-grounded vision-language model (VLM) designed to stage sleep from multi-channel polysomnography (PSG) waveform images while generating clinician-readable rationales based on American Academy of Sleep Medicine (AASM) scoring criteria. Utilizing waveform-perceptual pre-training and rule-grounded supervised fine-tuning, SleepVLM achieved Cohen’s kappa scores of 0.767 on an held out test set (MASS-SS1) and 0.743 on an external cohort (ZUAMHCS), matching state-of-the-art performance. Expert evaluations further validated the quality of the model’s reasoning, with mean scores exceeding 4.0/5.0 for factual accuracy, evidence comprehensiveness, and logical coherence. By coupling competitive performance with transparent, rule-based explanations, SleepVLM may improve the trustworthiness and auditability of automated sleep staging in clinical workflows. To facilitate further research in interpretable sleep medicine, we release MASS-EX, a novel expert-annotated dataset.
[37] Language-Conditioned World Modeling for Visual Navigation cs.CV | cs.AI | cs.ROPDF
Yifei Dong, Fengyi Wu, Yilong Dai, Lingdong Kong, Guangyu Chen
TL;DR: 本文研究了语言条件视觉导航(LCVN)任务,其中具身智能体仅基于初始第一视角观察来遵循自然语言指令。为解决无目标图像时的语言接地问题,作者构建了LCVN数据集,并提出了两种模型框架:一种是结合扩散世界模型(LCVN-WM)与潜在空间演员-评论员(LCVN-AC)的方法,另一种是自回归多模态架构(LCVN-Uni),可联合预测动作与未来观察。实验表明两种框架各有优势,前者时序一致性更好,后者泛化能力更强。
Details
Motivation: 解决语言条件视觉导航中缺乏目标图像时,语言指令如何影响感知与连续控制的接地挑战,并构建可复现研究的基准数据集。
Result: 在包含39,016条轨迹和117,048条人工验证指令的LCVN数据集上实验,两种模型框架均表现出色:LCVN-WM+AC提供更时序一致的推演,LCVN-Uni在未见环境泛化更好。
Insight: 创新点包括构建大规模LCVN数据集,以及将语言接地、未来状态预测和动作生成统一建模的两种互补框架;客观上,研究强调了在统一任务设置中联合研究语言接地、想象与策略学习的价值,为语言条件世界模型提供了具体基础。
Abstract: We study language-conditioned visual navigation (LCVN), in which an embodied agent is asked to follow a natural language instruction based only on an initial egocentric observation. Without access to goal images, the agent must rely on language to shape its perception and continuous control, making the grounding problem particularly challenging. We formulate this problem as open-loop trajectory prediction conditioned on linguistic instructions and introduce the LCVN Dataset, a benchmark of 39,016 trajectories and 117,048 human-verified instructions that supports reproducible research across a range of environments and instruction styles. Using this dataset, we develop LCVN frameworks that link language grounding, future-state prediction, and action generation through two complementary model families. The first family combines LCVN-WM, a diffusion-based world model, with LCVN-AC, an actor-critic agent trained in the latent space of the world model. The second family, LCVN-Uni, adopts an autoregressive multimodal architecture that predicts both actions and future observations. Experiments show that these families offer different advantages: the former provides more temporally coherent rollouts, whereas the latter generalizes better to unseen environments. Taken together, these observations point to the value of jointly studying language grounding, imagination, and policy learning in a unified task setting, and LCVN provides a concrete basis for further investigation of language-conditioned world models. The code is available at https://github.com/F1y1113/LCVN.
[38] Steering Sparse Autoencoder Latents to Control Dynamic Head Pruning in Vision Transformers (Student Abstract) cs.CV | cs.AI | cs.LGPDF
Yousung Lee, Dongsoo Har
TL;DR: 本文提出了一种将稀疏自编码器(SAEs)与动态剪枝相结合的新框架,通过将ViT最后一层的残差嵌入分解为可解释且可控的稀疏潜在表示,并采用不同策略放大这些稀疏潜在特征来改变剪枝决策,从而实现对Vision Transformers中动态头剪枝的可控性提升。
Details
Motivation: 现有Vision Transformers动态头剪枝策略通常难以解释和控制,本文旨在通过引入稀疏自编码器来增强剪枝的可解释性和可控性。
Result: 在实验中,通过每类引导策略发现了紧凑的、类别特定的头子集,例如在bowl类别上,准确率从76%提升至82%,同时头使用率从0.72降至0.33(通过使用h2和h5头),表明稀疏潜在特征能够实现类别特定的动态剪枝控制。
Insight: 创新点在于将稀疏自编码器与动态剪枝结合,利用稀疏潜在特征的可解释性来引导剪枝决策,有效桥接了ViT的剪枝效率与机制可解释性,为可控制的高效模型设计提供了新思路。
Abstract: Dynamic head pruning in Vision Transformers (ViTs) improves efficiency by removing redundant attention heads, but existing pruning policies are often difficult to interpret and control. In this work, we propose a novel framework by integrating Sparse Autoencoders (SAEs) with dynamic pruning, leveraging their ability to disentangle dense embeddings into interpretable and controllable sparse latents. Specifically, we train an SAE on the final-layer residual embedding of the ViT and amplify the sparse latents with different strategies to alter pruning decisions. Among them, per-class steering reveals compact, class-specific head subsets that preserve accuracy. For example, bowl improves accuracy (76% to 82%) while reducing head usage (0.72 to 0.33) via heads h2 and h5. These results show that sparse latent features enable class-specific control of dynamic pruning, effectively bridging pruning efficiency and mechanistic interpretability in ViTs.
[39] Motion Semantics Guided Normalizing Flow for Privacy-Preserving Video Anomaly Detection cs.CVPDF
Yang Liu, Boan Chen, Yuanyuan Meng, Jing Liu, Zhengliang Guo
TL;DR: 该论文提出了一种名为MSG-Flow的隐私保护视频异常检测方法,该方法基于人体骨架数据,通过将连续运动分解为分层语义表示来提升检测性能。
Details
Motivation: 现有基于骨架的方法以整体方式建模连续运动轨迹,难以捕捉由离散语义基元和细粒度运动细节构成的人类活动层次性,导致在不同抽象层次上出现异常时判别力下降。
Result: 在HR-ShanghaiTech和HR-UBnormal基准测试上,MSG-Flow分别达到了88.1%和75.8%的AUC,取得了最先进的(SOTA)性能。
Insight: 创新点在于将骨架视频异常检测分解为分层运动语义建模,结合矢量量化变分自编码器将连续运动离散化、自回归Transformer建模语义级时序依赖,以及条件归一化流捕捉细节级姿态变化,从而更有效地表征人类活动。
Abstract: As embodied perception systems increasingly bridge digital and physical realms in interactive multimedia applications, the need for privacy-preserving approaches to understand human activities in physical environments has become paramount. Video anomaly detection is a critical task in such embodied multimedia systems for intelligent surveillance and forensic analysis. Skeleton-based approaches have emerged as a privacy-preserving alternative that processes physical world information through abstract human pose representations while discarding sensitive visual attributes such as identity and facial features. However, existing skeleton-based methods predominantly model continuous motion trajectories in a monolithic manner, failing to capture the hierarchical nature of human activities composed of discrete semantic primitives and fine-grained kinematic details, which leads to reduced discriminability when anomalies manifest at different abstraction levels. In this regard, we propose Motion Semantics Guided Normalizing Flow (MSG-Flow) that decomposes skeleton-based VAD into hierarchical motion semantics modeling. It employs vector quantized variational auto-encoder to discretize continuous motion into interpretable primitives, an autoregressive Transformer to model semantic-level temporal dependencies, and a conditional normalizing flow to capture detail-level pose variations. Extensive experiments on benchmarks (HR-ShanghaiTech & HR-UBnormal) demonstrate that MSG-Flow achieves state-of-the-art performance with 88.1% and 75.8% AUC respectively.
[40] Survey on Remote Sensing Scene Classification: From Traditional Methods to Large Generative AI Models cs.CVPDF
Qionghao Huang, Can Hu
TL;DR: 这篇综述系统回顾了遥感场景分类方法从传统手工特征到现代生成式AI模型的完整演进历程,涵盖了从经典纹理描述符、机器学习分类器到深度学习革命,再到当前最先进的基础模型和生成式AI方法的发展脉络。
Details
Motivation: 旨在全面梳理遥感场景分类领域的方法论演变,分析从传统方法到人工智能系统的范式转变,为现代地球观测应用提供技术发展脉络和未来研究方向。
Result: 这是一篇综述性论文,未报告具体定量结果,但总结了各类方法(包括基础模型和视觉语言系统)在零样本和少样本学习场景中取得的优异性能表现。
Insight: 创新点在于系统性地整合了从传统方法到前沿生成式AI的完整技术谱系,并前瞻性地指出了未来关键研究方向,如高光谱与多时相分析、跨域泛化方法以及标准化评估协议,对领域发展具有指导意义。
Abstract: Remote sensing scene classification has experienced a paradigmatic transformation from traditional handcrafted feature methods to sophisticated artificial intelligence systems that now form the backbone of modern Earth observation applications. This comprehensive survey examines the complete methodological evolution, systematically tracing development from classical texture descriptors and machine learning classifiers through the deep learning revolution to current state-of-the-art foundation models and generative AI approaches. We chronicle the pivotal shift from manual feature engineering to automated hierarchical representation learning via convolutional neural networks, followed by advanced architectures including Vision Transformers, graph neural networks, and hybrid frameworks. The survey provides in-depth coverage of breakthrough developments in self-supervised foundation models and vision-language systems, highlighting exceptional performance in zero-shot and few-shot learning scenarios. Special emphasis is placed on generative AI innovations that tackle persistent challenges through synthetic data generation and advanced feature learning strategies. We analyze contemporary obstacles including annotation costs, multimodal data fusion complexities, interpretability demands, and ethical considerations, alongside current trends in edge computing deployment, federated learning frameworks, and sustainable AI practices. Based on comprehensive analysis of recent advances and gaps, we identify key future research priorities: advancing hyperspectral and multi-temporal analysis capabilities, developing robust cross-domain generalization methods, and establishing standardized evaluation protocols to accelerate scientific progress in remote sensing scene classification systems.
[41] Tiny-ViT: A Compact Vision Transformer for Efficient and Explainable Potato Leaf Disease Classification cs.CV | cs.AIPDF
Shakil Mia, Umme Habiba, Urmi Akter, SK Rezwana Quadir Raisa, Jeba Maliha
TL;DR: 本文提出了一种名为Tiny-ViT的紧凑型视觉Transformer模型,用于高效且可解释的马铃薯叶部病害分类。该模型在包含早疫病、晚疫病和健康叶片三类数据的数据集上取得了极高的准确率(测试准确率99.85%),超越了多个基线模型,并具有低计算成本和实时应用潜力。
Details
Motivation: 马铃薯叶部病害(如早疫病和晚疫病)的早期精确识别对保障作物健康和产量至关重要。传统检测方法耗时且易出错,因此需要自动化、高效的方法,特别是在资源受限的环境中。
Result: 在包含早疫病、晚疫病和健康叶片三类别的数据集上,Tiny-ViT模型取得了99.85%的测试准确率和99.82%的平均交叉验证准确率,优于DEIT Small、SWIN Tiny和MobileViT XS等基线模型。其马修斯相关系数(MCC)为0.9990,置信区间(CI)为[0.9980, 0.9995],显示出高可靠性和泛化能力。
Insight: 主要创新点在于设计了一个专为资源受限系统优化的紧凑型视觉Transformer(ViT),在保持高精度的同时显著降低了计算开销。此外,通过集成GRAD-CAM增强了模型的可解释性,能够可视化定位病害区域,这对于农业实际应用中的可信度至关重要。
Abstract: Early and precise identification of plant diseases, especially in potato crops is important to ensure the health of the crops and ensure the maximum yield . Potato leaf diseases, such as Early Blight and Late Blight, pose significant challenges to farmers, often resulting in yield losses and increased pesticide use. Traditional methods of detection are not only time-consuming, but are also subject to human error, which is why automated and efficient methods are required. The paper introduces a new method of potato leaf disease classification Tiny-ViT model, which is a small and effective Vision Transformer (ViT) developed to be used in resource-limited systems. The model is tested on a dataset of three classes, namely Early Blight, Late Blight, and Healthy leaves, and the preprocessing procedures include resizing, CLAHE, and Gaussian blur to improve the quality of the image. Tiny-ViT model has an impressive test accuracy of 99.85% and a mean CV accuracy of 99.82% which is better than baseline models such as DEIT Small, SWIN Tiny, and MobileViT XS. In addition to this, the model has a Matthews Correlation Coefficient (MCC) of 0.9990 and narrow confidence intervals (CI) of [0.9980, 0.9995], which indicates high reliability and generalization. The training and testing inference time is competitive, and the model exhibits low computational expenses, thereby, making it applicable in real-time applications. Moreover, interpretability of the model is improved with the help of GRAD-CAM, which identifies diseased areas. Altogether, the proposed Tiny-ViT is a solution with a high level of robustness, efficiency, and explainability to the problem of plant disease classification.
[42] A Near-Raw Talking-Head Video Dataset for Various Computer Vision Tasks cs.CV | cs.MM | eess.IVPDF
Babak Naderi, Ross Cutler
TL;DR: 本文开源了一个近原始质量的说话人头视频数据集,包含847段由805名参与者在自然环境下使用446种消费级网络摄像头录制的视频,总时长约212分钟,采用FFV1无损编码保存原始信号,并标注了平均意见得分(MOS)和感知质量标签。基于该数据集,作者选取了120个片段构建了分层基准测试子集,评估了H.264、H.265、H.266和AV1四种编解码器的效率,发现H.266相比H.264可节省高达71.3%的码率,且背景处理与内容类型显著影响压缩性能。
Details
Motivation: 实时通信中说话人头视频是主流内容,但该领域公开数据集稀缺且信号保真度有限,因此作者旨在提供一个大规模、高质量、无损的说话人头视频数据集,以支持视频压缩与增强模型的训练与评估。
Result: 在四个数据集和四种编解码器(H.264、H.265、H.266、AV1)的评估中,H.266相比H.264实现了高达-71.3%的VMAF BD-rate节省,且编码器与数据集(η_p² = .112)、编码器与内容条件(η_p² = .149)存在显著交互作用,表明背景处理与内容类型影响压缩效率。该数据集规模是此前最大说话人头网络摄像头数据集的5倍(847 vs. 160段),并保持无损信号保真度。
Insight: 创新点在于提供了首个大规模、近原始质量、无损编码的说话人头视频数据集,包含丰富的感知质量标注和分层基准子集,为视频压缩与增强研究提供了高质量资源;客观分析显示,数据集的多设备、自然环境采集以及背景处理条件的引入,有助于更真实地评估编解码器在实际场景中的性能差异。
Abstract: Talking-head videos constitute a predominant content type in real-time communication, yet publicly available datasets for video processing research in this domain remain scarce and limited in signal fidelity. In this paper, we open-source a near-raw dataset of 847 talking-head recordings (approximately 212 minutes), each 15,s in duration, captured from 805 participants using 446 unique consumer webcam devices in their natural environments. All recordings are stored using the FFV1 lossless codec, preserving the camera-native signal – uncompressed (24.4%) or MJPEG-encoded (75.6%) – without additional lossy processing. Each recording is annotated with a Mean Opinion Score (MOS) and ten perceptual quality tokens that jointly explain 64.4% of the MOS variance. From this corpus, we curate a stratified benchmarking subset of 120 clips in three content conditions: original, background blur, and background replacement. Codec efficiency evaluation across four datasets and four codecs, namely H.264, H.265, H.266, and AV1, yields VMAF BD-rate savings up to $-71.3%$ (H.266) relative to H.264, with significant encoder$\times$dataset ($η_p^2 = .112$) and encoder$\times$content condition ($η_p^2 = .149$) interactions, demonstrating that both content type and background processing affect compression efficiency. The dataset offers 5$\times$ the scale of the largest prior talking-head webcam dataset (847 vs.\ 160 clips) with lossless signal fidelity, establishing a resource for training and benchmarking video compression and enhancement models in real-time communication.
[43] Aesthetic Assessment of Chinese Handwritings Based on Vision Language Models cs.CV | cs.AI | cs.CLPDF
Chen Zheng, Yuxuan Lai, Haoyang Lu, Wentao Ma, Jitao Yang
TL;DR: 本文提出了一种基于视觉语言模型(VLMs)的中文手写体美观度评估方法,旨在超越传统的回归评分,生成多层次(简单评分和丰富描述性)的反馈,以帮助学习者改进书写技能。作者探索了基于低秩适应(LoRA)的微调策略和上下文学习方法,将美学评估知识融入VLMs。
Details
Motivation: 传统自动化评估方法将手写评分视为回归问题,仅提供分数反馈,缺乏可操作的指导,限制了其帮助学习者提升书写技能的效果。
Result: 实验结果表明,该方法在CCL 2025手写汉字质量评估研讨会的多个评估赛道上均达到了最先进的性能(SOTA)。
Insight: 创新点在于将手写美观度评估从单纯的分数回归问题,转变为利用VLMs生成多层次、可操作的反馈(包括评分和描述性指导),并探索了LoRA微调和上下文学习两种有效的知识集成策略,提升了评估的实用性和指导性。
Abstract: The handwriting of Chinese characters is a fundamental aspect of learning the Chinese language. Previous automated assessment methods often framed scoring as a regression problem. However, this score-only feedback lacks actionable guidance, which limits its effectiveness in helping learners improve their handwriting skills. In this paper, we leverage vision-language models (VLMs) to analyze the quality of handwritten Chinese characters and generate multi-level feedback. Specifically, we investigate two feedback generation tasks: simple grade feedback (Task 1) and enriched, descriptive feedback (Task 2). We explore both low-rank adaptation (LoRA)-based fine-tuning strategies and in-context learning methods to integrate aesthetic assessment knowledge into VLMs. Experimental results show that our approach achieves state-of-the-art performances across multiple evaluation tracks in the CCL 2025 workshop on evaluation of handwritten Chinese character quality.
[44] Edge Reliability Gap in Vision-Language Models: Quantifying Failure Modes of Compressed VLMs Under Visual Corruption cs.CV | cs.AIPDF
Mehmet Kaan Erol
TL;DR: 本研究量化了压缩视觉语言模型在视觉损坏下的可靠性差距,通过对比7B参数量化模型与500M参数小模型在VQAv2和COCO数据集上的表现,发现小模型不仅错误率更高,且具有独特的失败模式,尤其在语义漂移和否定推理崩溃方面更为显著。
Details
Motivation: 解决压缩视觉语言模型在边缘部署中未被充分探索的问题:小模型是否不仅错误更多,而且失败模式与大模型存在本质差异。
Result: 在VQAv2和COCO Captions的4000个样本上,小模型SmolVLM2-500M在COCO数据集上表现出显著的否定推理崩溃(-33.2pp vs. -20.8pp),且在特定模板下错误率高达100%,而大模型Qwen2.5-VL-7B仅为14%;语义漂移是主要失败模式,先验偏差仅出现在VQAv2上。
Insight: 创新点在于提出了三分类错误诊断框架(物体盲区、语义漂移、先验偏差),并揭示了压缩模型存在定性不同的失败特征,特别是数据集依赖的否定推理脆弱性,为边缘部署前的系统安全审计提供了可复现的评估流程。
Abstract: The rapid compression of large vision-language models (VLMs) for edge deployment raises an underexplored question: do compact models fail differently, not merely more often? This study compares a 7-billion-parameter quantised VLM (Qwen2.5-VL-7B, 4-bit NF4) against a 500-million-parameter FP16 model (SmolVLM2-500M) across 4,000 samples from VQAv2 and COCO Captions. A three-category error taxonomy (Object Blindness, Semantic Drift, Prior Bias) is applied as a diagnostic framework. A text-only GPT-4o judge reveals Semantic Drift (B) as the dominant failure mode on VQAv2 and on COCO for Qwen, with a mixed Object Blindness / Semantic Drift profile for SmolVLM2 on COCO; Prior Bias (C) is present on VQAv2 but absent on COCO for both models. Confidence calibration is measured via Expected Calibration Error (ECE) using geometric mean token probability, compositional reasoning is probed with structured negation probes across four templates, and a blur robustness experiment completes the evaluation. For this model pair, the compact model exhibits a qualitatively distinct failure signature: a 12.5pp larger negation collapse (-33.2pp vs. -20.8pp, Wald 95% CI [8.2, 16.8]pp, p < 10^-8), driven almost entirely by COCO while the VQAv2 gap is not statistically significant (4.5pp, p=0.19). The most discriminating template is false_yn: SMOLVLM2-500M responds “Yes” (incorrectly claiming a depicted object is absent) on 100% of COCO trials vs. 14% for Q WEN 2.5-VL-7B. Asymmetric dataset-dependent miscalibration and a blur experiment with two controlled ablations complete the analysis. The fully reproducible pipeline is released for systematic safety auditing of compressed VLMs prior to edge deployment.
[45] Quantized Vision-Language Models for Damage Assessment: A Comparative Study of LLaVA-1.5-7B Quantization Levels cs.CVPDF
Takato Yasuno
TL;DR: 本文对量化视觉语言模型在桥梁损伤评估中的应用进行了比较研究,重点关注LLaVA-1.5-7B模型在不同量化级别(Q4_K_M、Q5_K_M、Q8_0)下的性能权衡。研究开发了一个端到端管道,结合视觉损伤分析、结构化JSON提取和基于规则的优先级评分,并在254张钢筋暴露图像上评估了描述质量、推理速度和资源需求。
Details
Motivation: 桥梁基础设施检测是一项关键但劳动密集的任务,需要专家评估结构损伤(如钢筋暴露、开裂和腐蚀)。本研究旨在通过量化视觉语言模型实现自动化桥梁损伤评估,解决在消费级GPU上部署时模型效率与性能之间的平衡问题。
Result: 在254张钢筋暴露图像上的实验结果表明,Q5_K_M量化级别实现了最佳平衡:质量得分为3.18±1.35/5.0,推理时间为5.67秒/图像,效率为0.56质量/秒。其质量比Q4_K_M高8.5%,速度仅降低4.5%,同时与Q8_0质量相当但推理速度快25%。统计分析显示Q5_K_M的文本质量相关性最弱(-0.148),表明其性能不受描述长度影响。
Insight: 论文的创新点在于系统比较了不同量化级别对视觉语言模型在专业领域(桥梁损伤评估)中性能的影响,并引入了5点质量评估框架。从客观角度看,研究提供了量化模型在资源受限环境中部署的实用指导,强调了Q5_K_M在质量与效率间的优化平衡,为类似应用提供了可借鉴的量化选择策略。
Abstract: Bridge infrastructure inspection is a critical but labor-intensive task requiring expert assessment of structural damage such as rebar exposure, cracking, and corrosion. This paper presents a comprehensive study of quantized Vision-Language Models (VLMs) for automated bridge damage assessment, focusing on the trade-offs between description quality, inference speed, and resource requirements. We develop an end-to-end pipeline combining LLaVA-1.5-7B for visual damage analysis, structured JSON extraction, and rule-based priority scoring. To enable deployment on consumer-grade GPUs, we conduct a systematic comparison of three quantization levels: Q4_K_M, Q5_K_M, and Q8_0 across 254 rebar exposure images. We introduce a 5-point quality evaluation framework assessing damage type recognition, severity classification. Our results demonstrate that Q5_K_M achieves the optimal balance: quality score 3.18$\pm$1.35/5.0, inference time 5.67s/image, and 0.56 quality/sec efficiency – 8.5% higher quality than Q4_K_M with only 4.5% speed reduction, while matching Q8_0’s quality with 25% faster inference. Statistical analysis reveals Q5_K_M exhibits the weakest text-quality correlation (-0.148), indicating consistent performance regardless of description length.
[46] From Content to Audience: A Multimodal Annotation Framework for Broadcast Television Analytics cs.CV | cs.AI | cs.CYPDF
Paolo Cupini, Francesco Pierri
TL;DR: 本文系统评估了多模态大语言模型在意大利广播电视新闻语义标注任务中的表现,构建了一个包含视觉环境分类、主题分类、敏感内容检测和命名实体识别四个维度的领域特定基准。研究比较了两种流水线架构和九种前沿模型在不同输入策略下的效果,发现视频输入的增益高度依赖模型规模。最终,选定的流水线被部署于完整广播节目,并将分钟级标注与收视率数据整合,实现了基于内容的受众分析。
Details
Motivation: 广播电视内容的自动化语义标注面临独特挑战,结合了结构化视听构成、领域特定编辑模式和严格操作约束。尽管多模态大语言模型展现出强大的通用视频理解能力,但它们在广播特定场景下,不同流水线架构和输入配置的相对有效性仍缺乏实证研究。
Result: 实验结果表明,视频输入带来的性能提升高度模型依赖:较大模型能有效利用时间连续性,而较小模型在扩展的多模态上下文下因可能存在的令牌过载而表现下降。研究在构建的领域基准上评估了包括Gemini 3.0 Pro、LLaMA 4 Maverick、Qwen-VL变体和Gemma 3在内的九种前沿模型。
Insight: 创新点在于提出了一个用于广播电视分析的多模态标注框架,并进行了系统性的实证评估。该框架将分钟级语义标注与标准化的受众测量数据整合,实现了内容与受众行为的关联分析,证明了其在基于内容的受众分析中的操作可行性。从客观角度看,研究揭示了模型规模对处理扩展多模态上下文能力的关键影响,为实际部署中的模型选择提供了重要见解。
Abstract: Automated semantic annotation of broadcast television content presents distinctive challenges, combining structured audiovisual composition, domain-specific editorial patterns, and strict operational constraints. While multimodal large language models (MLLMs) have demonstrated strong general-purpose video understanding capabilities, their comparative effectiveness across pipeline architectures and input configurations in broadcast-specific settings remains empirically undercharacterized. This paper presents a systematic evaluation of multimodal annotation pipelines applied to broadcast television news in the Italian setting. We construct a domain-specific benchmark of clips labeled across four semantic dimensions: visual environment classification, topic classification, sensitive content detection, and named entity recognition. Two different pipeline architectures are evaluated across nine frontier models, including Gemini 3.0 Pro, LLaMA 4 Maverick, Qwen-VL variants, and Gemma 3, under progressively enriched input strategies combining visual signals, automatic speech recognition, speaker diarization, and metadata. Experimental results demonstrate that gains from video input are strongly model-dependent: larger models effectively leverage temporal continuity, while smaller models show performance degradation under extended multimodal context, likely due to token overload. Beyond benchmarking, the selected pipeline is deployed on 14 full broadcast episodes, with minute-level annotations integrated with normalized audience measurement data provided by an Italian media company. This integration enables correlational analysis of topic-level audience sensitivity and generational engagement divergence, demonstrating the operational viability of the proposed framework for content-based audience analytics.
[47] From Prediction to Diagnosis: Reasoning-Aware AI for Photovoltaic Defect Inspection cs.CVPDF
Dev Mistry, Feng Qiu, Bo Chen, Feng Liu, Can Chen
TL;DR: 本文提出REVL-PV,一个用于光伏缺陷检测的视觉-语言框架,通过将领域特定的诊断推理嵌入到电致发光、热成像和可见光图像的多模态学习中,生成结构化的诊断报告。
Details
Motivation: 现有光伏缺陷检测系统多为不透明的分类器,缺乏对高风险能源基础设施的诊断洞察,因此需要一种能够提供可解释诊断推理的可靠方法。
Result: 在涵盖8种缺陷类别的1,927个真实模块上,REVL-PV达到93%的分类准确率,并产生可解释的诊断依据,在真实图像损坏下保持强鲁棒性;与认证专家的盲法一致性研究显示模型解释与专家评估在缺陷识别、根因归因和视觉描述方面高度一致。
Insight: 创新点在于将诊断推理过程嵌入多模态学习,要求模型在分类前将视觉证据与可能的缺陷机制联系起来,从而生成符合专业实践的结构化报告,为可信AI辅助检测建立了通用范式。
Abstract: Reliable photovoltaic defect identification is essential for maintaining energy yield, ensuring warranty compliance, and enabling scalable inspection of rapidly expanding solar fleets. Although recent advances in computer vision have improved automated defect detection, most existing systems operate as opaque classifiers that provide limited diagnostic insight for high-stakes energy infrastructure. Here we introduce REVL-PV, a vision-language framework that embeds domain-specific diagnostic reasoning into multimodal learning across electroluminescence, thermal, and visible-light imagery. By requiring the model to link visual evidence to plausible defect mechanisms before classification, the framework produces structured diagnostic reports aligned with professional photovoltaic inspection practice. Evaluated on 1,927 real-world modules spanning eight defect categories, REVL-PV achieves 93% classification accuracy while producing interpretable diagnostic rationales and maintaining strong robustness under realistic image corruptions. A blind concordance study with a certified solar inspection expert shows strong semantic alignment between model explanations and expert assessments across defect identification, root-cause attribution, and visual descriptions. These results demonstrate that reasoning-aware multimodal learning establishes a general paradigm for trustworthy AI-assisted inspection of photovoltaic energy infrastructure.
[48] Limits of Imagery Reasoning in Frontier LLM Models cs.CV | cs.AIPDF
Sergio Y. Hayashi, Nina S. T. Hirata
TL;DR: 本文研究了前沿大语言模型在空间推理任务(如心理旋转)上的局限性。通过为LLM配备一个能够渲染和旋转3D模型的外部“意象模块”,构建了一个双模块架构进行实验。然而,该系统的性能低于预期,准确率最高仅达62.5%。研究表明,即使将维持和操作整体3D状态的任务外包,系统仍然失败,揭示了当前前沿模型缺乏与意象交互所需的基础视觉-空间原语。
Details
Motivation: 尽管大语言模型展现出强大的推理能力,但在需要心理模拟(如心理旋转)的空间任务上表现不佳。本文旨在探究为LLM配备外部“意象模块”作为“认知假体”是否能弥补这一缺陷。
Result: 在3D模型旋转任务上的实验结果显示,准确率最高为62.5%,性能低于预期。
Insight: 论文的创新点在于通过构建双模块(推理模块与意象模块)架构来系统评估LLM的空间推理能力。客观分析揭示的核心洞察是:当前前沿模型缺乏关键的视觉-空间基础能力,包括(1)提取深度、运动、短时动态预测等空间信号的低层敏感性,以及(2)对图像进行沉思式推理、动态转移视觉焦点并平衡意象与符号及关联信息的能力。这指出了提升模型空间智能需要解决的根本问题。
Abstract: Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, yet they struggle with spatial tasks that require mental simulation, such as mental rotation. This paper investigates whether equipping an LLM with an external Imagery Module'' -- a tool capable of rendering and rotating 3D models -- can bridge this gap, functioning as a cognitive prosthetic.’’ We conducted experiments using a dual-module architecture in which a reasoning module (an MLLM) interacts with an imagery module on 3D model rotation tasks. Performance was lower than expected, with accuracy reaching at most 62.5%. Further investigation suggests that even when the burden of maintaining and manipulating a holistic 3D state is outsourced, the system still fails. This reveals that current frontier models lack the foundational visual-spatial primitives required to interface with imagery. Specifically, they lack: (1) the low-level sensitivity to extract spatial signals such as (a) depth, (b) motion, and (c) short-horizon dynamic prediction; and (2) the capacity to reason contemplatively over images, dynamically shifting visual focus and balancing imagery with symbolic and associative information.
[49] HighlightBench: Benchmarking Markup-Driven Table Reasoning in Scientific Documents cs.CVPDF
Lexin Wang, Shenghua Liu, Yiwei Wang, Yujun Cai, Yuyao Ge
TL;DR: 本文提出了HighlightBench,一个用于评估多模态大语言模型在科学文档中基于视觉标记(如高亮、下划线、粗体)进行表格推理能力的诊断性基准。该基准将评估分解为五个任务族,并提供了一个参考流程来明确中间决策,从而在感知到执行的链条上进行更细粒度的错误归因。
Details
Motivation: 现有评估方法无法区分模型是未能感知到视觉标记,还是未能利用这些标记进行推理,这导致在评估模型对表格的标记条件行为时存在关键盲区。
Result: 实验表明,即使在结构化输出约束下需要视觉线索与符号推理保持一致时,即使是强大的模型也表现得不稳定。
Insight: 创新点在于提出了一个诊断性基准,将标记驱动的表格理解评估分解为多个任务族,并提供了一个可复现的参考流程,实现了从感知到执行链条上的细粒度错误归因,有助于更精确地评估模型能力。
Abstract: Visual markups such as highlights, underlines, and bold text are common in table-centric documents. Although multimodal large language models (MLLMs) have made substantial progress in document understanding, their ability to treat such cues as explicit logical directives remains under-explored. More importantly, existing evaluations cannot distinguish whether a model fails to see the markup or fails to reason with it. This creates a key blind spot in assessing markup-conditioned behavior over tables. To address this gap, we introduce HighlightBench, a diagnostic benchmark for markup-driven table understanding that decomposes evaluation into five task families: Markup Grounding, Constrained Retrieval, Local Relations, Aggregation & Comparison, and Consistency & Missingness. We further provide a reference pipeline that makes intermediate decisions explicit, enabling reproducible baselines and finer-grained attribution of errors along the perception-to-execution chain. Experiments show that even strong models remain unstable when visual cues must be consistently aligned with symbolic reasoning under structured output constraints.
[50] Brain-Inspired Multimodal Spiking Neural Network for Image-Text Retrieval cs.CVPDF
Xintao Zong, Xian Zhong, Wenxuan Liu, Jianhao Ding, Zhaofei Yu
TL;DR: 本文提出了一种受大脑启发的跨模态脉冲融合网络(CMSF),首次将脉冲神经网络(SNN)应用于图像文本检索(ITR)任务。该网络通过脉冲级别的特征融合机制,在仅需两个时间步的情况下,实现了顶级的检索精度,同时保持了极低的能耗和高检索速度。
Details
Motivation: 现有的基于人工神经网络(ANN)的图像文本检索方法往往追求更深、更复杂的架构以获取更丰富的单模态语义,但忽视了跨模态交互、检索延迟和能效问题。直接训练的、低能耗且高性能的脉冲神经网络(SNN)在多模态应用(如图像文本检索)中仍面临巨大挑战。
Result: CMSF在图像文本检索任务上超越了最先进的ANN方法,达到了顶级的检索精度,同时保持了极低的能耗和高检索速度。
Insight: 核心创新点是提出了一个脉冲级别的跨模态融合机制,该机制整合单模态特征,生成增强的多模态表示,作为软监督信号来优化单模态脉冲嵌入,有效缓解了语义损失。这为多模态SNN提供了一个受大脑启发的、统一时间动态与跨模态对齐的框架。
Abstract: Spiking neural networks (SNNs) have recently shown strong potential in unimodal visual and textual tasks, yet building a directly trained, low-energy, and high-performance SNN for multimodal applications such as image-text retrieval (ITR) remains highly challenging. Existing artificial neural network (ANN)-based methods often pursue richer unimodal semantics using deeper and more complex architectures, while overlooking cross-modal interaction, retrieval latency, and energy efficiency. To address these limitations, we present a brain-inspired Cross-Modal Spike Fusion network (CMSF) and apply it to ITR for the first time. The proposed spike fusion mechanism integrates unimodal features at the spike level, generating enhanced multimodal representations that act as soft supervisory signals to refine unimodal spike embeddings, effectively mitigating semantic loss within CMSF. Despite requiring only two time steps, CMSF achieves top-tier retrieval accuracy, surpassing state-of-the-art ANN counterparts while maintaining exceptionally low energy consumption and high retrieval speed. This work marks a significant step toward multimodal SNNs, offering a brain-inspired framework that unifies temporal dynamics with cross-modal alignment and provides new insights for future spiking-based multimodal research. The code is available at https://github.com/zxt6174/CMSF.
[51] Deep Learning Aided Vision System for Planetary Rovers cs.CV | cs.RO | eess.IVPDF
Lomash Relia, Jai G Singla, Amitabh, Nitant Dube
TL;DR: 本研究提出了一种用于行星探测车的视觉系统,该系统结合了实时感知与离线地形重建。实时模块整合了CLAHE增强的立体图像、基于YOLOv11n的目标检测以及一个用于估计目标距离的神经网络。离线模块使用Depth Anything V2度量单目深度估计模型从捕获的图像生成深度图,并通过Open3D融合成密集点云。实时流程提供的真实世界距离估计为定性重建提供了可靠的度量上下文。
Details
Motivation: 为自主行星探索提供一个可扩展、计算高效的视觉解决方案,解决在行星表面进行实时感知和离线高精度地形重建的问题。
Result: 在Chandrayaan 3 NavCam立体图像上进行评估,并以基于CAHV的实用工具为基准,结果显示神经网络在1到10米范围内的深度估计中值误差为2.26厘米;目标检测模型在灰度月球场景上保持了良好的精确率-召回率权衡。
Insight: 创新点在于将实时立体视觉目标检测与距离估计,与离线单目深度估计点云重建相结合,为行星车提供了一个完整的、具有可靠度量上下文的视觉架构;其深度估计网络在特定距离范围内达到了厘米级精度,展示了在资源受限环境下实现高精度感知的潜力。
Abstract: This study presents a vision system for planetary rovers, combining real-time perception with offline terrain reconstruction. The real-time module integrates CLAHE enhanced stereo imagery, YOLOv11n based object detection, and a neural network to estimate object distances. The offline module uses the Depth Anything V2 metric monocular depth estimation model to generate depth maps from captured images, which are fused into dense point clouds using Open3D. Real world distance estimates from the real time pipeline provide reliable metric context alongside the qualitative reconstructions. Evaluation on Chandrayaan 3 NavCam stereo imagery, benchmarked against a CAHV based utility, shows that the neural network achieves a median depth error of 2.26 cm within a 1 to 10 meter range. The object detection model maintains a balanced precision recall tradeoff on grayscale lunar scenes. This architecture offers a scalable, compute-efficient vision solution for autonomous planetary exploration.
[52] arg-VU: Affordance Reasoning with Physics-Aware 3D Geometry for Visual Understanding in Robotic Surgery cs.CV | cs.ROPDF
Nan Xiao, Yunxin Fan, Farong Wang, Fei Liu
TL;DR: 本文提出了arg-VU框架,一个用于机器人手术视觉理解的物理感知可供性推理方法。该方法通过整合时序一致的几何跟踪(基于3D高斯溅射重建)与约束诱导的力学建模(基于扩展位置动力学),从手术视频中推导出物理上一致的顺应性能量和位置一致性分数,以预测组织与工具交互的可供性。
Details
Motivation: 解决手术机器人领域中,由于组织高度可变形、柔顺且与工具运动动态耦合,导致传统的可供性推理方法难以应用的问题。
Result: 在手术视频数据集上的实验表明,arg-VU比基于运动学的基线方法产生了更稳定、物理上更一致且可解释的可供性预测。
Insight: 创新点在于将时序3D几何重建与物理约束建模相结合,定义了能捕捉局部约束流形几何的各向异性刚度度量(通过代表性几何点及其约束敏感性),从而为可变形手术环境提供了可靠的、支持具身机器人交互的可供性推理框架。
Abstract: Affordance reasoning provides a principled link between perception and action, yet remains underexplored in surgical robotics, where tissues are highly deformable, compliant, and dynamically coupled with tool motion. We present arg-VU, a physics-aware affordance reasoning framework that integrates temporally consistent geometry tracking with constraint-induced mechanical modeling for surgical visual understanding. Surgical scenes are reconstructed using 3D Gaussian Splatting (3DGS) and converted into a temporally tracked surface representation. Extended Position-Based Dynamics (XPBD) embeds local deformation constraints and produces representative geometry points (RGPs) whose constraint sensitivities define anisotropic stiffness metrics capturing the local constraint-manifold geometry. Robotic tool poses in SE(3) are incorporated to compute rigidly induced displacements at RGPs, from which we derive two complementary measures: a physics-aware compliance energy that evaluates mechanical feasibility with respect to local deformation constraints, and a positional agreement score that captures motion alignment (as kinematic motion baseline). Experiments on surgical video datasets show that arg-VU yields more stable, physically consistent, and interpretable affordance predictions than kinematic baselines. These results demonstrate that physics-aware geometric representations enable reliable affordance reasoning for deformable surgical environments and support embodied robotic interaction.
[53] Envisioning global urban development with satellite imagery and generative AI cs.CV | cs.AIPDF
Kailai Sun, Yuebing Liang, Mingyi He, Yunhan Zheng, Alok Prakash
TL;DR: 该研究提出了一种多模态生成式AI框架,利用卫星图像和生成式人工智能来模拟全球范围内的可持续城市发展。该框架能够根据用户指定的城市发展目标和地理空间约束,生成高保真、多样化和逼真的城市卫星图像,覆盖全球500个最大都市区。
Details
Motivation: 以往的城市发展研究大多将其视为预测任务,未能反映其生成性本质,因此本研究旨在设计一个能够生成全球城市发展愿景的框架,以支持可持续城市规划。
Result: 人类专家评估证实,生成的城市场景图像与真实图像相当;此外,该框架学习到的潜在表征能够增强下游预测任务(如碳排放预测)的性能。
Insight: 创新点在于将城市发展视为生成任务,并构建了一个结合文本提示和地理空间控制的多模态生成框架,实现了全球跨城市的风格迁移和环境学习,为基于场景的加速城市规划提供了新方法。
Abstract: Urban development has been a defining force in human history, shaping cities for centuries. However, past studies mostly analyze such development as predictive tasks, failing to reflect its generative nature. Therefore, this study designs a multimodal generative AI framework to envision sustainable urban development at a global scale. By integrating prompts and geospatial controls, our framework can generate high-fidelity, diverse, and realistic urban satellite imagery across the 500 largest metropolitan areas worldwide. It enables users to specify urban development goals, creating new images that align with them while offering diverse scenarios whose appearance can be controlled with text prompts and geospatial constraints. It also facilitates urban redevelopment practices by learning from the surrounding environment. Beyond visual synthesis, we find that it encodes and interprets latent representations of urban form for global cross-city learning, successfully transferring styles of urban environments across a global spatial network. The latent representations can also enhance downstream prediction tasks such as carbon emission prediction. Further, human expert evaluation confirms that our generated urban images are comparable to real urban images. Overall, this study presents innovative approaches for accelerated urban planning and supports scenario-based planning processes for worldwide cities.
[54] Beyond Textual Knowledge-Leveraging Multimodal Knowledge Bases for Enhancing Vision-and-Language Navigation cs.CV | cs.AI | eess.IVPDF
Dongsheng Yang, Yinfeng Yu, Liejun Wang
TL;DR: 本文提出了一种名为Beyond Textual Knowledge (BTK)的新型视觉语言导航框架,旨在解决现有方法在捕捉关键语义线索和实现视觉-语言精确对齐方面的不足。该框架通过协同整合环境特定的文本知识库和生成式图像知识库来增强导航性能。
Details
Motivation: 现有视觉语言导航方法难以有效捕捉关键语义线索并将其与视觉观察准确对齐,限制了导航性能。
Result: 在R2R数据集(7,189条轨迹)和REVERIE数据集(21,702条指令)上的大量实验表明,BTK显著优于现有基线。在R2R和REVERIE的测试未见集上,成功率分别提升了5%和2.07%,路径长度加权成功率分别提升了4%和3.69%。
Insight: 主要创新点在于构建并协同利用多模态知识库(包括使用Qwen3-4B和Flux-Schnell构建的生成式图像知识库,以及使用BLIP-2构建的基于全景视图的文本知识库),并通过Goal-Aware Augmentor和Knowledge Augmentor进行有效整合,从而显著增强了语义基础和跨模态对齐能力。
Abstract: Vision-and-Language Navigation (VLN) requires an agent to navigate through complex unseen environments based on natural language instructions. However, existing methods often struggle to effectively capture key semantic cues and accurately align them with visual observations. To address this limitation, we propose Beyond Textual Knowledge (BTK), a VLN framework that synergistically integrates environment-specific textual knowledge with generative image knowledge bases. BTK employs Qwen3-4B to extract goal-related phrases and utilizes Flux-Schnell to construct two large-scale image knowledge bases: R2R-GP and REVERIE-GP. Additionally, we leverage BLIP-2 to construct a large-scale textual knowledge base derived from panoramic views, providing environment-specific semantic cues. These multimodal knowledge bases are effectively integrated via the Goal-Aware Augmentor and Knowledge Augmentor, significantly enhancing semantic grounding and cross-modal alignment. Extensive experiments on the R2R dataset with 7,189 trajectories and the REVERIE dataset with 21,702 instructions demonstrate that BTK significantly outperforms existing baselines. On the test unseen splits of R2R and REVERIE, SR increased by 5% and 2.07% respectively, and SPL increased by 4% and 3.69% respectively. The source code is available at https://github.com/yds3/IPM-BTK/.
[55] Computer Vision with a Superpixelation Camera cs.CVPDF
Sasidharan Mahalingam, Rachel Brown, Atul Ingle
TL;DR: 本文提出了一种名为SuperCam的新型相机设计,它通过实时执行超像素分割来自适应处理捕获的数据,以减少资源受限应用中的数据冗余。
Details
Motivation: 传统相机生成大量数据,在资源受限的应用中处理困难,且大部分数据对下游计算机视觉算法是冗余的,因此需要一种能自适应压缩数据的相机设计。
Result: 在内存受限情况下,SuperCam优于当前最先进的超像素算法;在图像分割、目标检测和单目深度估计等下游任务中,使用压缩数据时SuperCam提供更优的输出。
Insight: 创新点在于将超像素分割集成到相机硬件中实现实时自适应处理,这有助于在边缘设备上部署更高效的计算机视觉系统,减少数据冗余和内存需求。
Abstract: Conventional cameras generate a lot of data that can be challenging to process in resource-constrained applications. Usually, cameras generate data streams on the order of the number of pixels in the image. However, most of this captured data is redundant for many downstream computer vision algorithms. We propose a novel camera design, which we call SuperCam, that adaptively processes captured data by performing superpixel segmentation on the fly. We show that SuperCam performs better than current state-of-the-art superpixel algorithms under memory-constrained situations. We also compare how well SuperCam performs when the compressed data is used for downstream computer vision tasks. Our results demonstrate that the proposed design provides superior output for image segmentation, object detection, and monocular depth estimation in situations where the available memory on the camera is limited. We posit that superpixel segmentation will play a crucial role as more computer vision inference models are deployed in edge devices. SuperCam would allow computer vision engineers to design more efficient systems for these applications.
[56] FusionAgent: A Multimodal Agent with Dynamic Model Selection for Human Recognition cs.CVPDF
Jie Zhu, Xiao Guo, Yiyang Su, Anil Jain, Xiaoming Liu
TL;DR: 本文提出了一种名为FusionAgent的新型智能体框架,用于动态、样本特定的模型选择,以解决传统静态分数融合策略在全身人体识别中的局限性。该框架利用多模态大语言模型(MLLM)将每个专家模型视为工具,并通过基于度量的奖励进行强化微调(RFT),使智能体学会为每个测试输入自适应地确定最优模型组合。此外,论文还引入了基于锚点的置信度Top-k(ACT)分数融合方法,以解决模型分数错位和嵌入异构性问题。
Details
Motivation: 现有的人体识别分数融合策略通常是静态的,无论样本质量或模态可靠性如何,都会为每个测试样本调用所有模型。这在实际无约束场景中效率低下且不够鲁棒,因为不同模型具有互补优势,且生物特征线索(如人脸、步态、体型)在不同样本间存在差异。
Result: 在多个全身生物特征基准测试上的广泛实验表明,FusionAgent显著优于最先进(SoTA)的方法,同时通过更少的模型调用实现了更高的效率。
Insight: 论文的核心创新点在于提出了一个基于MLLM的智能体框架,实现了动态、可解释且鲁棒的模型融合。具体包括:1)将模型选择问题转化为智能体决策问题,通过RFT进行优化;2)提出ACT分数融合方法,以最置信的模型为锚点,以置信度感知的方式整合互补预测。这为实际识别系统中实现高效、自适应的多模型协同提供了新思路。
Abstract: Model fusion is a key strategy for robust recognition in unconstrained scenarios, as different models provide complementary strengths. This is especially important for whole-body human recognition, where biometric cues such as face, gait, and body shape vary across samples and are typically integrated via score-fusion. However, existing score-fusion strategies are usually static, invoking all models for every test sample regardless of sample quality or modality reliability. To overcome these limitations, we propose \textbf{FusionAgent}, a novel agentic framework that leverages a Multimodal Large Language Model (MLLM) to perform dynamic, sample-specific model selection. Each expert model is treated as a tool, and through Reinforcement Fine-Tuning (RFT) with a metric-based reward, the agent learns to adaptively determine the optimal model combination for each test input. To address the model score misalignment and embedding heterogeneity, we introduce Anchor-based Confidence Top-k (ACT) score-fusion, which anchors on the most confident model and integrates complementary predictions in a confidence-aware manner. Extensive experiments on multiple whole-body biometric benchmarks demonstrate that FusionAgent significantly outperforms SoTA methods while achieving higher efficiency through fewer model invocations, underscoring the critical role of dynamic, explainable, and robust model fusion in real-world recognition systems. Project page: \href{https://fusionagent.github.io/}{FusionAgent}.
[57] Live Interactive Training for Video Segmentation cs.CVPDF
Xinyu Yang, Haozheng Yu, Yihong Sun, Bharath Hariharan, Jennifer J. Sun
TL;DR: 本文提出了一种名为Live Interactive Training (LIT)的新框架,用于基于提示的视觉系统,使模型能够在推理时从用户的实时交互反馈中在线学习。其主要实现LIT-LoRA通过动态更新一个轻量级的LoRA模块,利用用户的每次修正来快速训练模型,从而提升其在同一视频后续帧上的分割性能,显著减少用户所需的重复修正次数。
Details
Motivation: 现有交互式视频分割模型(如SAM2)仅将用户修正用于即时修复,而不会从中学习,导致在复杂场景(如遮挡、物体分离、伪装等)下需要大量重复的用户干预,效率低下。本文旨在通过让模型在线学习用户反馈来解决这一问题。
Result: 在具有挑战性的视频分割基准测试中,LIT-LoRA平均减少了18-34%的总修正次数,每次修正的训练开销仅为约0.5秒。该框架还被成功适配到其他分割模型,并扩展到基于CLIP的细粒度图像分类任务。
Insight: 核心创新点在于提出了一个允许模型在推理阶段进行实时在线学习的通用框架(LIT),并通过轻量级LoRA模块实现高效、低开销的持续适应。这为交互式视觉工具提供了一种新范式,即模型能主动从用户交互中学习,从而显著减少冗余的人工劳动。
Abstract: Interactive video segmentation often requires many user interventions for robust performance in challenging scenarios (e.g., occlusions, object separations, camouflage, etc.). Yet, even state-of-the-art models like SAM2 use corrections only for immediate fixes without learning from this feedback, leading to inefficient, repetitive user effort. To address this, we introduce Live Interactive Training (LIT), a novel framework for prompt-based visual systems where models also learn online from human corrections at inference time. Our primary instantiation, LIT-LoRA, implements this by continually updating a lightweight LoRA module on-the-fly. When a user provides a correction, this module is rapidly trained on that feedback, allowing the vision system to improve performance on subsequent frames of the same video. Leveraging the core principles of LIT, our LIT-LoRA implementation achieves an average 18-34% reduction in total corrections on challenging video segmentation benchmarks, with a negligible training overhead of ~0.5s per correction. We further demonstrate its generality by successfully adapting it to other segmentation models and extending it to CLIP-based fine-grained image classification. Our work highlights the promise of live adaptation to transform interactive tools and significantly reduce redundant human effort in complex visual tasks. Project: https://youngxinyu1802.github.io/projects/LIT/.
[58] Leveraging Avatar Fingerprinting: A Multi-Generator Photorealistic Talking-Head Public Database and Benchmark cs.CVPDF
Laura Pedrouzo-Rodriguez, Luis F. Gomez, Ruben Tolosana, Ruben Vera-Rodriguez, Roberto Daza
TL;DR: 本文提出了AVAPrintDB,一个用于虚拟化身指纹识别的新型公开多生成器逼真说话头像数据库,并基于此数据库定义了一个标准化、可复现的基准测试。该数据库整合了三个最先进的虚拟化身生成器(GAGAvatar、LivePortrait、HunyuanPortrait)和两种视听语料库,模拟了合法使用和身份冒用场景。
Details
Motivation: 当前逼真虚拟化身生成技术的进步引发了AI中介通信中身份冒用的安全担忧,而现有的公开数据库稀缺且基于过时的生成器,无法代表当前虚拟化身指纹识别任务的真实场景。
Result: 研究结果表明,尽管与身份相关的运动线索在合成化身中持续存在,但当前的虚拟化身指纹识别系统对合成流程和源域的变化仍然高度敏感。
Insight: 创新点在于构建了一个多生成器、多范式的公开数据库和基准测试,并探索了基于基础模型(如DINOv2和CLIP)的新方法,为可复现研究提供了资源。从客观角度看,其对生成器和数据集偏移的全面分析揭示了现有系统的局限性,为未来研究指明了方向。
Abstract: Recent advances in photorealistic avatar generation have enabled highly realistic talking-head avatars, raising security concerns regarding identity impersonation in AI-mediated communication. To advance in this challenging problem, the task of avatar fingerprinting aims to determine whether two avatar videos are driven by the same human operator or not. However, current public databases in the literature are scarce and based solely on old-fashioned talking-head avatar generators, not representing realistic scenarios for the current task of avatar fingerprinting. To overcome this situation, the present article introduces AVAPrintDB, a new publicly available multi-generator talking-head avatar database for avatar fingerprinting. AVAPrintDB is constructed from two audiovisual corpora and three state-of-the-art avatar generators (GAGAvatar, LivePortrait, HunyuanPortrait), representing different synthesis paradigms, and includes both self- and cross-reenactments to simulate legitimate usage and impersonation scenarios. Building on this database, we also define a standardized and reproducible benchmark for avatar fingerprinting, considering public state-of-the-art avatar fingerprinting systems and exploring novel methods based on Foundation Models (DINOv2 and CLIP). Also, we conduct a comprehensive analysis under generator and dataset shift. Our results show that, while identity-related motion cues persist across synthetic avatars, current avatar fingerprinting systems remain highly sensitive to changes in the synthesis pipeline and source domain. The AVAPrintDB, benchmark protocols, and avatar fingerprinting systems are publicly available to facilitate reproducible research.
[59] From 3D Pose to Prose: Biomechanics-Grounded Vision–Language Coaching cs.CVPDF
Yuyang Ji, Yixuan Shen, Shengjie Zhu, Yu Kong, Feng Liu
TL;DR: 本文提出了BioCoach,一个基于生物力学的视觉-语言框架,用于从流视频中进行健身指导。该框架通过一个新颖的三阶段流程融合视觉外观和3D骨骼运动学:特定运动的自由度选择器、结构化的生物力学上下文以及视觉-生物力学条件反馈模块。采用参数高效训练方法,冻结视觉和语言骨干网络,以生成透明、个性化的推理。在增强的QEVD-bio-fit-coach数据集上,BioCoach在词汇和判断指标上均取得明显提升,同时在原始QEVD-fit-coach数据集上提高了文本质量和正确性。
Details
Motivation: 解决从流视频中提供精准、可操作的健身指导的问题,通过融合视觉和生物力学信息来超越简单的模式匹配,实现个性化、基于相位的反馈。
Result: 在QEVD-bio-fit-coach数据集上,BioCoach在词汇和判断指标上均取得明显增益,同时保持时间触发;在原始QEVD-fit-coach数据集上,提高了文本质量和正确性,时间性能接近持平,表明显式运动学和约束是实现准确、相位感知指导的关键。
Insight: 创新点包括:三阶段流程(自由度选择器、生物力学上下文、条件反馈模块)融合视觉和生物力学信息;参数高效训练方法保持骨干网络冻结;引入生物力学感知的LLM评判指标。从客观角度看,将生物力学约束与视觉-语言模型结合,为健身指导提供了更透明、可解释的推理机制。
Abstract: We present BioCoach, a biomechanics-grounded vision–language framework for fitness coaching from streaming video. BioCoach fuses visual appearance and 3D skeletal kinematics, through a novel three-stage pipeline: an exercise-specific degree-of-freedom selector that focuses analysis on salient joints; a structured biomechanical context that pairs individualized morphometrics with cycle and constraint analysis; and a vision–biomechanics conditioned feedback module that applies cross-attention to generate precise, actionable text. Using parameter-efficient training that freezes the vision and language backbones, BioCoach yields transparent, personalized reasoning rather than pattern matching. To enable learning and fair evaluation, we augment QEVD-fit-coach with biomechanics-oriented feedback to create QEVD-bio-fit-coach, and we introduce a biomechanics-aware LLM judge metric. BioCoach delivers clear gains on QEVD-bio-fit-coach across lexical and judgment metrics while maintaining temporal triggering; on the original QEVD-fit-coach, it improves text quality and correctness with near-parity timing, demonstrating that explicit kinematics and constraints are key to accurate, phase-aware coaching.
[60] Multimodal Deep Learning for Diabetic Foot Ulcer Staging Using Integrated RGB and Thermal Imaging cs.CV | cs.LGPDF
Gulengul Mermer, Mustafa Furkan Aksu, Gozde Ozsezer, Sevki Cetinkalp, Orhan Er
TL;DR: 本研究开发了一种基于树莓派的多模态便携成像系统,用于同时采集糖尿病足溃疡(DFU)的RGB和热成像图像,并构建了一个包含1,205个样本的数据集。研究评估了在多种深度学习模型(如DenseNet121、EfficientNetV2等)上,使用RGB、热成像以及两者融合(RGB+热成像作为第四通道)三种数据模态对DFU六阶段分类任务的影响。结果表明,多模态融合方法显著优于单模态方法,其中VGG16模型在融合数据集上取得了最佳性能。
Details
Motivation: 糖尿病足溃疡(DFU)是糖尿病的严重并发症,可能导致截肢和高昂的医疗成本。定期监测和早期诊断对于减轻临床负担和降低截肢风险至关重要。本研究旨在探索利用多模态图像(RGB与热成像)对深度学习模型在DFU分期分类任务中的性能提升作用。
Result: 在构建的医院数据集上,多模态训练数据集(RGB+热成像作为第四通道)的性能优于仅使用RGB或仅使用热成像的单模态方法。其中,在RGB+热成像数据集上训练的VGG16模型取得了最佳性能,准确率达到93.25%,F1分数为92.53%,马修斯相关系数(MCC)为91.03%。
Insight: 论文的创新点在于开发了一个便携式多模态成像系统用于数据采集,并实证研究了RGB与热成像数据在通道层面的融合对DFU分期分类的增益。客观分析认为,其核心洞察在于热成像通道通过突出溃疡区域的温度异常帮助模型定位关键区域,而RGB通道则提供互补的结构和纹理信息,这种多模态互补机制有效提升了模型性能,为医疗影像分析提供了可借鉴的多模态融合策略。
Abstract: Diabetic foot ulcers (DFU) are one of the serious complications of diabetes that can lead to amputations and high healthcare costs. Regular monitoring and early diagnosis are critical for reducing the clinical burden and the risk of amputation. The aim of this study is to investigate the impact of using multimodal images on deep learning models for the classification of DFU stages. To this end, we developed a Raspberry Pi-based portable imaging system capable of simultaneously capturing RGB and thermal images. Using this prototype, a dataset consisting of 1,205 samples was collected in a hospital setting. The dataset was labeled by experts into six distinct stages. To evaluate the models performance, we prepared three different training sets: RGB-only, thermal-only, and RGB+Thermal (with the thermal image added as a fourth channel). We trained these training sets on the DenseNet121, EfficientNetV2, InceptionV3, ResNet50, and VGG16 models. The results show that the multimodal training dataset, in which RGB and thermal data are combined across four channels, outperforms single-modal approaches. The highest performance was observed in the VGG16 model trained on the RGB+Thermal dataset. The model achieved an accuracy of 93.25%, an F1-score of 92.53%, and an MCC of 91.03%. Grad-CAM heatmap visualizations demonstrated that the thermal channel helped the model focus on the correct location by highlighting temperature anomalies in the ulcer region, while the RGB channel supported the decision-making process with complementary structural and textural information.
[61] A Provable Energy-Guided Test-Time Defense Boosting Adversarial Robustness of Large Vision-Language Models cs.CVPDF
Mujtaba Hussain Mirza, Antonio D’Orazio, Odelia Melamed, Iacopo Masi
TL;DR: 本文提出了一种名为ET3的轻量级、无需训练的能量引导测试时转换防御方法,旨在通过最小化输入样本的能量来增强大型视觉语言模型(LVLM)对抗对抗性扰动的鲁棒性。该方法在分类、CLIP零样本分类以及图像描述和视觉问答等任务中均表现出强大的防御能力。
Details
Motivation: 尽管多模态模型和大型视觉语言模型(LVLM)发展迅速,但它们对对抗性扰动高度敏感,这引发了对其在现实世界中可靠性的担忧。对抗性训练是构建鲁棒模型的主要方法,而测试时转换(TTT)作为一种有前景的策略,可在推理时提升鲁棒性。本文旨在提出一种无需训练、轻量级的防御方法,以增强LVLM的对抗鲁棒性。
Result: 广泛的实验表明,ET3在分类器、CLIP零样本分类以及LVLM的图像描述和视觉问答任务中均提供了强大的防御效果,提升了模型的鲁棒性。
Insight: 论文的创新点在于提出了一种基于能量最小化的测试时转换防御方法(ET3),该方法无需训练,轻量且具有理论保证(在合理假设下证明转换在分类中有效)。从客观角度看,该方法将能量引导与测试时防御相结合,为提升LVLM的对抗鲁棒性提供了一种新颖且实用的解决方案。
Abstract: Despite the rapid progress in multimodal models and Large Visual-Language Models (LVLM), they remain highly susceptible to adversarial perturbations, raising serious concerns about their reliability in real-world use. While adversarial training has become the leading paradigm for building models that are robust to adversarial attacks, Test-Time Transformations (TTT) have emerged as a promising strategy to boost robustness at inference.In light of this, we propose Energy-Guided Test-Time Transformation (ET3), a lightweight, training-free defense that enhances the robustness by minimizing the energy of the input samples.Our method is grounded in a theory that proves our transformation succeeds in classification under reasonable assumptions. We present extensive experiments demonstrating that ET3 provides a strong defense for classifiers, zero-shot classification with CLIP, and also for boosting the robustness of LVLMs in tasks such as Image Captioning and Visual Question Answering. Code is available at github.com/OmnAI-Lab/Energy-Guided-Test-Time-Defense .
[62] GUIDED: Granular Understanding via Identification, Detection, and Discrimination for Fine-Grained Open-Vocabulary Object Detection cs.CVPDF
Jiaming Li, Zhijia Liang, Weikai Chen, Lin Ma, Guanbin Li
TL;DR: 该论文提出了GUIDED框架,用于解决细粒度开放词汇目标检测(FG-OVD)中主体与属性语义纠缠的问题。通过将目标定位与细粒度识别分解为独立路径,并分别使用语言模型提取主体、注意力融合属性信息以及区域级属性判别模块,实现了更准确的检测。
Details
Motivation: 现有开放词汇检测器在细粒度场景下性能不佳,主要原因是预训练视觉语言模型(VLM)嵌入中主体与属性的语义纠缠,导致属性过度表征、定位错误和语义漂移。
Result: 在FG-OVD和3F-OVD基准测试上的大量实验表明,GUIDED取得了新的最先进(SOTA)结果。
Insight: 创新点在于提出了一种解耦建模和模块化优化的框架,通过分离主体定位与属性识别路径,并引入基于注意力的属性嵌入融合和区域级属性判别模块,有效缓解了语义纠缠问题,提升了细粒度开放词汇检测的准确性和鲁棒性。
Abstract: Fine-grained open-vocabulary object detection (FG-OVD) aims to detect novel object categories described by attribute-rich texts. While existing open-vocabulary detectors show promise at the base-category level, they underperform in fine-grained settings due to the semantic entanglement of subjects and attributes in pretrained vision-language model (VLM) embeddings – leading to over-representation of attributes, mislocalization, and semantic drift in embedding space. We propose GUIDED, a decomposition framework specifically designed to address the semantic entanglement between subjects and attributes in fine-grained prompts. By separating object localization and fine-grained recognition into distinct pathways, HUIDED aligns each subtask with the module best suited for its respective roles. Specifically, given a fine-grained class name, we first use a language model to extract a coarse-grained subject and its descriptive attributes. Then the detector is guided solely by the subject embedding, ensuring stable localization unaffected by irrelevant or overrepresented attributes. To selectively retain helpful attributes, we introduce an attribute embedding fusion module that incorporates attribute information into detection queries in an attention-based manner. This mitigates over-representation while preserving discriminative power. Finally, a region-level attribute discrimination module compares each detected region against full fine-grained class names using a refined vision-language model with a projection head for improved alignment. Extensive experiments on FG-OVD and 3F-OVD benchmarks show that GUIDED achieves new state-of-the-art results, demonstrating the benefits of disentangled modeling and modular optimization. Our code will be released at https://github.com/lijm48/GUIDED.
[63] YOLO Object Detectors for Robotics – a Comparative Study cs.CV | cs.LGPDF
Patryk Niżeniec, Marcin Iwanowski, Marcin Gahbler
TL;DR: 本文对YOLO系列目标检测器在机器人视觉任务中的适用性进行了比较研究,通过自定义数据集和COCO2017数据集,并引入图像失真来测试检测器的鲁棒性,旨在为机器人视觉任务选择合适的YOLO版本提供实验依据。
Details
Motivation: 验证YOLO系列目标检测器在机器人工作空间内物体检测任务中的适用性,解决机器人视觉系统中模型选择的问题。
Result: 在自定义数据集和COCO2017数据集上进行了实验,包括不同训练/测试配置和模型变体,并测试了图像失真下的鲁棒性,结果为机器人视觉任务选择YOLO版本提供了参考。
Insight: 通过系统比较不同YOLO版本在机器人视觉任务中的性能,特别是引入图像失真测试鲁棒性,为实际应用中的模型选择提供了实证依据,而非仅依赖基准性能指标。
Abstract: YOLO object detectors recently became a key component of vision systems in many domains. The family of available YOLO models consists of multiple versions, each in various variants. The research reported in this paper aims to validate the applicability of members of this family to detect objects located within the robot workspace. In our experiments, we used our custom dataset and the COCO2017 dataset. To test the robustness of investigated detectors, the images of these datasets were subject to distortions. The results of our experiments, including variations of training/testing configurations and models, may support the choice of the appropriate YOLO version for robotic vision tasks.
[64] RealBirdID: Benchmarking Bird Species Identification in the Era of MLLMs cs.CVPDF
Logan Lawrence, Mustafa Chasmai, Rangel Daroya, Wuao Liu, Seoyun Jeong
TL;DR: 该论文提出了RealBirdID基准测试,用于评估多模态大语言模型在野外细粒度鸟类物种识别任务中的表现,特别是模型在遇到无法回答的情况时(如关键线索缺失或图像质量差)能否进行有原则的弃权并提供基于证据的理由。
Details
Motivation: 解决当前多模态系统在细粒度鸟类识别中面临的问题:单张图像常因非视觉线索(如鸣叫)、遮挡、拍摄角度或低分辨率而无法确定物种,但现有评估方法只关注可回答的情况,导致模型倾向于盲目猜测而非合理弃权。
Result: 在可回答数据集上,包括GPT-5和Gemini-2.5 Pro在内的多种开源和专有MLLM的准确率低于13%;模型分类能力越强不一定越能校准对不可回答样本的弃权;MLLM即使在弃权时也普遍无法提供正确的理由。
Insight: 创新点在于构建了一个包含可回答和不可回答样本(并标注弃权理由)的基准测试,强调模型应具备“有原则的弃权”能力,为细粒度识别任务提供了更全面的评估框架,可推广至其他需要处理不确定性的视觉识别领域。
Abstract: Fine-grained bird species identification in the wild is frequently unanswerable from a single image: key cues may be non-visual (e.g. vocalization), or obscured due to occlusion, camera angle, or low resolution. Yet today’s multimodal systems are typically judged on answerable, in-schema cases, encouraging confident guesses rather than principled abstention. We propose the RealBirdID benchmark: given an image of a bird, a system should either answer with a species or abstain with a concrete, evidence-based rationale: “requires vocalization,” “low quality image,” or “view obstructed”. For each genus, the dataset includes a validation split composed of curated unanswerable examples with labeled rationales, paired with a companion set of clearly answerable instances. We find that (1) the species identification on the answerable set is challenging for a variety of open-source and proprietary models (less than 13% accuracy for MLLMs including GPT-5 and Gemini-2.5 Pro), (2) models with greater classification ability are not necessarily more calibrated to abstain from unanswerable examples, and (3) that MLLMs generally fail at providing correct reasons even when they do abstain. RealBirdID establishes a focused target for abstention-aware fine-grained recognition and a recipe for measuring progress.
[65] MOOZY: A Patient-First Foundation Model for Computational Pathology cs.CVPDF
Yousef Kotp, Vincent Quoc-Huy Trinh, Christopher Pal, Mahdi S. Hosseini
TL;DR: MOOZY是一个以患者为中心的计算病理学基础模型,它将患者病例而非单个切片作为核心表示单元,通过病例Transformer显式建模同一患者所有切片间的依赖关系,并采用两阶段预训练(无监督特征学习与多任务语义对齐)在公开数据上构建可迁移的病理图像表征。
Details
Motivation: 当前计算病理学基础模型大多以切片为中心,依赖私有数据和昂贵的配对报告监督,且未显式建模同一患者多张切片间的关系,MOOZY旨在解决这些问题,构建一个可跨临床任务迁移、以患者为核心的开源模型。
Result: 在八个保留任务上进行五折冻结特征评估,MOOZY在多数指标上取得最佳或并列最佳性能,其加权F1、加权ROC-AUC和平衡准确率的宏平均分别比TITAN提升+7.37%、+5.50%、+7.83%,比PRISM提升+8.83%、+10.70%、+9.78%;模型参数量为85.77M,比GigaPath小14倍。
Insight: 创新点包括:1)提出以患者病例为单位的表示范式,通过病例Transformer建模跨切片依赖;2)结合公开数据的多阶段自监督与低成本多任务监督进行预训练;3)在公开基准上实现了参数高效且性能优越的SOTA结果,为可扩展的患者中心病理学基础模型提供了可行路径。
Abstract: Computational pathology needs whole-slide image (WSI) foundation models that transfer across diverse clinical tasks, yet current approaches remain largely slide-centric, often depend on private data and expensive paired-report supervision, and do not explicitly model relationships among multiple slides from the same patient. We present MOOZY, a patient-first pathology foundation model in which the patient case, not the individual slide, is the core unit of representation. MOOZY explicitly models dependencies across all slides from the same patient via a case transformer during pretraining, combining multi-stage open self-supervision with scaled low-cost task supervision. In Stage 1, we pretrain a vision-only slide encoder on 77,134 public slide feature grids using masked self-distillation. In Stage 2, we align these representations with clinical semantics using a case transformer and multi-task supervision over 333 tasks from 56 public datasets, including 205 classification and 128 survival tasks across four endpoints. Across eight held-out tasks with five-fold frozen-feature probe evaluation, MOOZY achieves best or tied-best performance on most metrics and improves macro averages over TITAN by +7.37%, +5.50%, and +7.83% and over PRISM by +8.83%, +10.70%, and +9.78% for weighted F1, weighted ROC-AUC, and balanced accuracy, respectively. MOOZY is also parameter efficient with 85.77M parameters, 14x smaller than GigaPath. These results demonstrate that open, reproducible patient-level pretraining yields transferable embeddings, providing a practical path toward scalable patient-first histopathology foundation models.
[66] Towards Intrinsic-Aware Monocular 3D Object Detection cs.CVPDF
Zhihao Zhang, Abhinav Kumar, Xiaoming Liu
TL;DR: 本文提出MonoIA,一种面向单目3D目标检测的相机内参感知统一框架。该方法通过基于语言的表示来建模和适应相机内参变化,将内参变化视为改变表观尺度、透视和空间几何的感知变换,而非数值差异,从而提升模型在不同相机配置下的泛化能力和检测一致性。
Details
Motivation: 现有单目3D目标检测方法对相机内参高度敏感,难以泛化到多样化的相机设置,因为内参决定了3D场景如何投影到图像平面。
Result: 在KITTI、Waymo和nuScenes等标准基准测试中取得了新的最先进结果(例如在KITTI排行榜上提升1.18%),并在多数据集训练下进一步提升了性能(例如在KITTI验证集上提升4.46%)。
Insight: 创新点在于将相机内参建模从数值条件化转向语义表示,利用大语言模型和视觉语言模型生成编码相机参数视觉与几何含义的内参嵌入,并通过内参适应模块分层集成到检测网络中,使模型能根据特定相机配置调整特征表示,实现跨内参的鲁棒统一感知。
Abstract: Monocular 3D object detection (Mono3D) aims to infer object locations and dimensions in 3D space from a single RGB image. Despite recent progress, existing methods remain highly sensitive to camera intrinsics and struggle to generalize across diverse settings, since intrinsics govern how 3D scenes are projected onto the image plane. We propose MonoIA, a unified intrinsic-aware framework that models and adapts to intrinsic variation through a language-grounded representation. The key insight is that intrinsic variation is not a numeric difference but a perceptual transformation that alters apparent scale, perspective, and spatial geometry. To capture this effect, MonoIA employs large language models and vision-language models to generate intrinsic embeddings that encode the visual and geometric implications of camera parameters. These embeddings are hierarchically integrated into the detection network via an Intrinsic Adaptation Module, allowing the model to modulate its feature representations according to camera-specific configurations and maintain consistent 3D detection across intrinsics. This shifts intrinsic modeling from numeric conditioning to semantic representation, enabling robust and unified perception across cameras. Extensive experiments show that MonoIA achieves new state-of-the-art results on standard benchmarks including KITTI, Waymo, and nuScenes (e.g., +1.18% on the KITTI leaderboard), and further improves performance under multi-dataset training (e.g., +4.46% on KITTI Val).
[67] VIRST: Video-Instructed Reasoning Assistant for SpatioTemporal Segmentation cs.CVPDF
Jihwan Hong, Jaeyoung Do
TL;DR: 本文提出了VIRST,一个端到端的视频指代分割框架,通过时空融合模块将分割感知的视频特征融入视觉语言主干网络,并利用时序动态锚点更新器来维护相邻帧的稳定时序线索,以统一全局视频推理和像素级掩码预测,从而解决现有方法在动态密集和需要多步推理的视频上性能下降的问题。
Details
Motivation: 现有基于固定关键帧的指代视频对象分割方法通常将视觉语言模型与独立的传播模块耦合,难以捕捉快速变化的时空动态,也无法处理需要多步推理的查询,导致在运动密集和推理导向的视频上性能显著下降。
Result: VIRST在多样化的RVOS基准测试中实现了最先进的结果,特别是在现实和挑战性条件下,展示了在指代和推理导向设置下的强大泛化能力。
Insight: 创新点在于通过时空融合模块统一了语义和分割表示,以及时序动态锚点更新器来应对大运动、遮挡和重现等挑战,从而在单个模型中实现了全局推理与像素预测的端到端整合。
Abstract: Referring Video Object Segmentation (RVOS) aims to segment target objects in videos based on natural language descriptions. However, fixed keyframe-based approaches that couple a vision language model with a separate propagation module often fail to capture rapidly changing spatiotemporal dynamics and to handle queries requiring multi-step reasoning, leading to sharp performance drops on motion-intensive and reasoning-oriented videos beyond static RVOS benchmarks. To address these limitations, we propose VIRST (Video-Instructed Reasoning Assistant for Spatio-Temporal Segmentation), an end-to-end framework that unifies global video reasoning and pixel-level mask prediction within a single model. VIRST bridges semantic and segmentation representations through the Spatio-Temporal Fusion (STF), which fuses segmentation-aware video features into the vision-language backbone, and employs the Temporal Dynamic Anchor Updater to maintain temporally adjacent anchor frames that provide stable temporal cues under large motion, occlusion, and reappearance. This unified design achieves state-of-the-art results across diverse RVOS benchmarks under realistic and challenging conditions, demonstrating strong generalization to both referring and reasoning oriented settings. The code and checkpoints are available at https://github.com/AIDASLab/VIRST.
[68] ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding cs.CV | cs.AI | cs.CLPDF
Jovana Kondic, Pengyuan Li, Dhiraj Joshi, Isaac Sanchez, Ben Wiesel
TL;DR: ChartNet是一个百万规模的高质量多模态数据集,旨在提升图表理解和推理能力。它通过代码引导的合成流程生成了150万个涵盖24种图表类型和6种绘图库的多样化样本,每个样本包含绘图代码、渲染图像、数据表、自然语言摘要和带推理的问答五个对齐组件。数据集还包括人工标注、真实世界数据、安全性和基础性等专门子集,并通过严格的质量过滤确保视觉保真度、语义准确性和多样性。微调ChartNet能持续提升多模态模型在基准测试上的性能,作为开源的最大规模数据集,它支持开发具有稳健和可泛化数据可视化理解能力的基础模型。
Details
Motivation: 当前视觉语言模型在联合推理几何视觉模式、结构化数值数据和自然语言方面能力有限,而图表理解正需要这种能力,因此需要高质量、大规模的多模态数据集来推动图表解释和推理研究。
Result: 在ChartNet上微调能一致地改善模型在多个基准测试上的结果,证明了其作为大规模监督数据对于多模态模型的实用性。
Insight: 创新之处在于提出了一个代码引导的合成管道来生成大规模、高质量且多样化的图表数据,并确保了五个组件之间的细粒度跨模态对齐;同时,通过包含专门子集和严格的质量过滤,数据集覆盖了图表理解的完整谱系并保证了质量,为开发稳健可泛化的数据可视化理解基础模型提供了关键资源。
Abstract: Understanding charts requires models to jointly reason over geometric visual patterns, structured numerical data, and natural language – a capability where current vision-language models (VLMs) remain limited. We introduce ChartNet, a high-quality, million-scale multimodal dataset designed to advance chart interpretation and reasoning. ChartNet leverages a novel code-guided synthesis pipeline to generate 1.5 million diverse chart samples spanning 24 chart types and 6 plotting libraries. Each sample consists of five aligned components: plotting code, rendered chart image, data table, natural language summary, and question-answering with reasoning, providing fine-grained cross-modal alignment. To capture the full spectrum of chart comprehension, ChartNet additionally includes specialized subsets encompassing human annotated data, real-world data, safety, and grounding. Moreover, a rigorous quality-filtering pipeline ensures visual fidelity, semantic accuracy, and diversity across chart representations. Fine-tuning on ChartNet consistently improves results across benchmarks, demonstrating its utility as large-scale supervision for multimodal models. As the largest open-source dataset of its kind, ChartNet aims to support the development of foundation models with robust and generalizable capabilities for data visualization understanding. The dataset is publicly available at https://huggingface.co/datasets/ibm-granite/ChartNet
[69] Structural Graph Probing of Vision-Language Models cs.CVPDF
Haoyu He, Yue Zhuo, Yu Zheng, Qi R. Wang
TL;DR: 本文提出了一种基于神经拓扑结构的方法来研究视觉语言模型(VLMs),通过将每个层表示为基于神经元共激活的层内相关图,探究了群体层面的结构是否具有行为意义、如何随模态和深度变化,以及是否能识别干预下具有因果影响力的内部组件。研究发现相关拓扑结构携带可恢复的行为信号,跨模态结构随深度围绕一组紧凑的循环枢纽神经元逐渐整合,对这些枢纽的定向扰动会显著改变模型输出。神经拓扑因此成为VLM可解释性的一个有意义的中间尺度。
Details
Motivation: 视觉语言模型在多模态任务上表现出色,但其计算如何在神经元群体中组织仍不清楚,本文旨在通过神经拓扑的视角来理解VLMs的内部结构,探究群体层面的组织是否与行为相关及其变化规律。
Result: 研究显示,相关拓扑结构携带可恢复的行为信号;跨模态结构随深度围绕一组紧凑的循环枢纽神经元逐渐整合;对这些枢纽神经元进行定向扰动会显著改变模型输出,表明其具有因果影响力。
Insight: 创新点在于将神经拓扑作为VLM可解释性的中间尺度,比局部归因更丰富,比完整电路恢复更易处理,并通过实验将其与多模态行为联系起来;从客观角度看,该方法为理解VLMs的内部计算组织提供了新工具,可能有助于模型调试和优化。
Abstract: Vision-language models (VLMs) achieve strong multimodal performance, yet how computation is organized across populations of neurons remains poorly understood. In this work, we study VLMs through the lens of neural topology, representing each layer as a within-layer correlation graph derived from neuron-neuron co-activations. This view allows us to ask whether population-level structure is behaviorally meaningful, how it changes across modalities and depth, and whether it identifies causally influential internal components under intervention. We show that correlation topology carries recoverable behavioral signal; moreover, cross-modal structure progressively consolidates with depth around a compact set of recurrent hub neurons, whose targeted perturbation substantially alters model output. Neural topology thus emerges as a meaningful intermediate scale for VLM interpretability: richer than local attribution, more tractable than full circuit recovery, and empirically tied to multimodal behavior. Code is publicly available at https://github.com/he-h/vlm-graph-probing.
[70] LightCtrl: Training-free Controllable Video Relighting cs.CVPDF
Yizuo Peng, Xuelin Chen, Kai Zhang, Xiaodong Cun
TL;DR: LightCtrl是一种无需训练的可控视频重照明方法,通过用户提供的照明轨迹实现对视频照明的显式控制。该方法结合了预训练的扩散模型:首先使用图像重照明模型逐帧处理,然后通过视频扩散先验增强时间一致性。核心创新包括Light Map Injection模块和Geometry-Aware Relighting模块,分别用于注入照明轨迹特定噪声和抑制原始照明影响,从而生成高质量且照明变化多样的视频。
Details
Motivation: 现有扩散模型在视频重照明中提供有限的显式照明控制,因此需要一种能够通过用户指定照明轨迹实现精确可控的视频重照明方法。
Result: 实验表明,LightCtrl能够生成高质量的视频,其多样化的照明变化紧密遵循指定的照明轨迹,在可控性方面优于基线方法。
Insight: 创新点在于无需训练即可实现显式照明控制,通过Light Map Injection模块增强照明一致性,以及Geometry-Aware Relighting模块在频域动态结合RGB和法线图潜在表示以抑制原始照明影响,提升了对输入照明轨迹的遵循度。
Abstract: Recent diffusion models have achieved remarkable success in image relighting, and this success has quickly been extended to video relighting. However, existing methods offer limited explicit control over illumination in the relighted output. We present LightCtrl, the first controllable video relighting method that enables explicit control of video illumination through a user-supplied light trajectory in a training-free manner. Our approach combines pre-trained diffusion models: an image relighting model processes each frame individually, followed by a video diffusion prior to enhance temporal consistency. To achieve explicit control over dynamically varying lighting, we introduce two key components. First, a Light Map Injection module samples light trajectory-specific noise and injects it into the latent representation of the source video, improving illumination coherence with the conditional light trajectory. Second, a Geometry-Aware Relighting module dynamically combines RGB and normal map latents in the frequency domain to suppress the influence of the original lighting, further enhancing adherence to the input light trajectory. Experiments show that LightCtrl produces high-quality videos with diverse illumination changes that closely follow the specified light trajectory, demonstrating improved controllability over baseline methods. Code is available at: https://github.com/GVCLab/LightCtrl.
[71] EFlow: Fast Few-Step Video Generator Training from Scratch via Efficient Solution Flow cs.CVPDF
Dogyun Park, Yanyu Li, Sergey Tulyakov, Anil Kag
TL;DR: EFlow是一种高效的少步训练框架,旨在解决视频扩散变换器在训练和推理中的计算瓶颈。通过结合门控局部-全局注意力机制和高效的少步训练方法,EFlow显著降低了每步计算成本和采样步数,实现了从零开始的快速训练,并在Kinetics和文本到视频数据集上取得了有竞争力的性能。
Details
Motivation: 视频扩散变换器的扩展受到两个主要瓶颈的限制:注意力机制的二次复杂性和迭代采样步骤的高昂成本。EFlow旨在同时解决这些瓶颈,实现高效、高质量的少步视频生成训练。
Result: EFlow在Kinetics和大规模文本到视频数据集上实现了有竞争力的性能,训练吞吐量比标准解决方案流高2.5倍,推理延迟比标准迭代模型低45.3倍。
Insight: 创新点包括门控局部-全局注意力机制,通过可丢弃令牌的混合块降低计算成本;以及高效的少步训练方法,如路径丢弃引导训练和均值-速度可加性正则化,确保在极低步数下的高保真度。这些方法共同提升了训练效率和推理速度。
Abstract: Scaling video diffusion transformers is fundamentally bottlenecked by two compounding costs: the expensive quadratic complexity of attention per step, and the iterative sampling steps. In this work, we propose EFlow, an efficient few-step training framework, that tackles these bottlenecks simultaneously. To reduce sampling steps, we build on a solution-flow objective that learns a function mapping a noised state at time t to time s. Making this formulation computationally feasible and high-quality at video scale, however, demands two complementary innovations. First, we propose Gated Local-Global Attention, a token-droppable hybrid block which is efficient, expressive, and remains highly stable under aggressive random token-dropping, substantially reducing per-step compute. Second, we develop an efficient few-step training recipe. We propose Path-Drop Guided training to replace the expensive guidance target with a computationally cheap, weak path. Furthermore, we augment this with a Mean-Velocity Additivity regularizer to ensure high fidelity at extremely low step counts. Together, our EFlow enables a practical from-scratch training pipeline, achieving up to 2.5x higher training throughput over standard solution-flow, and 45.3x lower inference latency than standard iterative models with competitive performance on Kinetics and large-scale text-to-video datasets.
[72] LLM Enhanced Action Recognition via Hierarchical Global-Local Skeleton-Language Model cs.CVPDF
Ruosi Wang, Fangwei Zuo, Lei Li, Zhaoqiang Xia
TL;DR: 本文提出了一种层次化全局-局部骨架-语言模型(HocSLM),用于增强基于骨架的人体动作识别。该方法通过设计层次化全局-局部网络(HGLNet)来建模复杂的时空关系,并利用大型视觉语言模型(VLM)从RGB视频生成文本描述,最后通过骨架-语言模型(SLM)对齐骨架特征与文本语义,从而提升模型的语义判别和跨模态理解能力。
Details
Motivation: 现有基于GCN的骨架动作识别方法通常依赖短程运动拓扑,难以捕捉长距离关节依赖和复杂时序动态,且由于对动作语义建模不足,限制了跨模态语义对齐和理解。
Result: 在NTU RGB+D 60、NTU RGB+D 120和Northwestern-UCLA三个主流基准数据集上取得了最先进的性能(SOTA)。
Insight: 创新点在于将层次化全局-局部时空建模与从RGB视频生成的文本语义描述相结合,并通过骨架-语言模型进行跨模态对齐,从而增强了动作语义的表征和判别能力。从客观角度看,该方法有效整合了骨架模态的时空先验与语言模态的丰富语义,为动作识别提供了新的多模态融合思路。
Abstract: Skeleton-based human action recognition has achieved remarkable progress in recent years. However, most existing GCN-based methods rely on short-range motion topologies, which not only struggle to capture long-range joint dependencies and complex temporal dynamics but also limit cross-modal semantic alignment and understanding due to insufficient modeling of action semantics. To address these challenges, we propose a hierarchical global-local skeleton-language model (HocSLM), enabling the large action model be more representative of action semantics. First, we design a hierarchical global-local network (HGLNet) that consists of a composite-topology spatial module and a dual-path hierarchical temporal module. By synergistically integrating multi-level global and local modules, HGLNet achieves dynamically collaborative modeling at both global and local scales while preserving prior knowledge of human physical structure, significantly enhancing the model’s representation of complex spatio-temporal relationships. Then, a large vision-language model (VLM) is employed to generate textual descriptions by passing the original RGB video sequences to this model, providing the rich action semantics for further training the skeleton-language model. Furthermore, we introduce a skeleton-language sequential fusion module by combining the features from HGLNet and the generated descriptions, which utilizes a skeleton-language model (SLM) for aligning skeletal spatio-temporal features and textual action descriptions precisely within a unified semantic space. The SLM model could significantly enhance the HGLNet’s semantic discrimination capabilities and cross-modal understanding abilities. Extensive experiments demonstrate that the proposed HocSLM achieves the state-of-the-art performance on three mainstream benchmark datasets: NTU RGB+D 60, NTU RGB+D 120, and Northwestern-UCLA.
[73] UniDAC: Universal Metric Depth Estimation for Any Camera cs.CVPDF
Girish Chandar Ganesan, Yuliang Guo, Liu Ren, Xiaoming Liu
TL;DR: 本文提出UniDAC,一个用于单目度量深度估计(MMDE)的通用框架,旨在解决现有方法难以泛化到不同相机类型(如鱼眼和360°相机)的问题。其核心创新在于将度量深度估计解耦为相对深度预测和空间变化尺度估计,并引入了深度引导的尺度估计模块和一种针对等距柱状投影的失真感知位置编码RoPE-φ。
Details
Motivation: 现有零样本MMDE方法在跨不同相机类型(如大视场角相机)泛化时性能不佳,而现有统一方法要么需要训练时包含特定数据,要么需要为不同域训练独立模型。本文旨在开发一个单一模型,实现对任意相机的通用鲁棒度量深度估计。
Result: UniDAC在跨相机泛化任务上达到了最先进水平(SOTA),在所有评估数据集上均一致地超越了先前的方法。
Insight: 主要创新点包括:1) 将度量深度估计解耦为相对深度和空间变化尺度两个子任务;2) 设计轻量级深度引导尺度估计模块,利用相对深度图上采样粗尺度图以处理局部尺度变化;3) 提出RoPE-φ位置编码,通过纬度感知加权来适应等距柱状投影中的空间扭曲。这种解耦架构和针对投影失真的定制化设计是实现通用性的关键。
Abstract: Monocular metric depth estimation (MMDE) is a core challenge in computer vision, playing a pivotal role in real-world applications that demand accurate spatial understanding. Although prior works have shown promising zero-shot performance in MMDE, they often struggle with generalization across diverse camera types, such as fisheye and $360^\circ$ cameras. Recent advances have addressed this through unified camera representations or canonical representation spaces, but they require either including large-FoV camera data during training or separately trained models for different domains. We propose UniDAC, an MMDE framework that presents universal robustness in all domains and generalizes across diverse cameras using a single model. We achieve this by decoupling metric depth estimation into relative depth prediction and spatially varying scale estimation, enabling robust performance across different domains. We propose a lightweight Depth-Guided Scale Estimation module that upsamples a coarse scale map to high resolution using the relative depth map as guidance to account for local scale variations. Furthermore, we introduce RoPE-$φ$, a distortion-aware positional embedding that respects the spatial warping in Equi-Rectangular Projections (ERP) via latitude-aware weighting. UniDAC achieves state of the art (SoTA) in cross-camera generalization by consistently outperforming prior methods across all datasets.
[74] RailVQA: A Benchmark and Framework for Efficient Interpretable Visual Cognition in Automatic Train Operation cs.CVPDF
Sen Zhang, Runmei Li, Zhichao Zheng, Yuhe Zhang, Jiani Li
TL;DR: 本文针对自动列车运行(ATO)中的视觉感知与决策需求,提出了RailVQA-bench基准测试和RailVQA-CoM协作框架。RailVQA-bench是首个用于ATO驾驶室视图视觉认知的VQA基准,包含静态和动态场景的问答对,以评估认知泛化与可解释性。RailVQA-CoM框架通过结合小模型效率与大模型认知能力的三模块透明架构与自适应时间采样,旨在提升感知泛化、实现高效推理规划,并降低延迟与幻觉风险。
Details
Motivation: 解决ATO中现有方法泛化能力差、缺乏高级推理与规划能力,以及大模型(LMMs)计算成本高、存在幻觉风险且缺乏领域专用评估基准的问题。
Result: 实验表明,所提方法显著提升了性能,增强了可解释性,降低了推理延迟,并加强了跨域泛化能力,同时支持在自动驾驶系统中即插即用部署。
Insight: 创新点在于构建了首个ATO领域的VQA基准,并提出了一个透明、高效的协作大-小模型框架,通过模块化架构与自适应采样平衡效率与认知能力,为安全关键系统的可解释视觉认知提供了新思路。
Abstract: Automatic Train Operation (ATO) relies on low-latency, reliable cab-view visual perception and decision-oriented inference to ensure safe operation in complex and dynamic railway environments. However, existing approaches focus primarily on basic perception and often generalize poorly to rare yet safety-critical corner cases. They also lack the high-level reasoning and planning capabilities required for operational decision-making. Although recent Large Multi-modal Models (LMMs) show strong generalization and cognitive capabilities, their use in safety-critical ATO is hindered by high computational cost and hallucination risk. Meanwhile, reliable domain-specific benchmarks for systematically evaluating cognitive capabilities are still lacking. To address these gaps, we introduce RailVQA-bench, the first VQA benchmark for cab-view visual cognition in ATO, comprising 20,000 single-frame and 1,168 video based QA pairs to evaluate cognitive generalization and interpretability in both static and dynamic scenarios. Furthermore, we propose RailVQA-CoM, a collaborative large-small model framework that combines small-model efficiency with large-model cognition via a transparent three-module architecture and adaptive temporal sampling, improving perceptual generalization and enabling efficient reasoning and planning. Experiments demonstrate that the proposed approach substantially improves performance, enhances interpretability, reduces inference latency, and strengthens cross-domain generalization, while enabling plug-and-play deployment in autonomous driving systems. Code and datasets will be available at https://github.com/Cybereye-bjtu/RailVQA.
[75] The Geometry of Robustness: Optimizing Loss Landscape Curvature and Feature Manifold Alignment for Robust Finetuning of Vision-Language Models cs.CVPDF
Shivang Chopra, Shaunak Halbe, Chengyue Huan, Brisa Maneechotesuwan, Zsolt Kira
TL;DR: 本文提出了一种名为GRACE的统一微调框架,旨在解决视觉语言模型(VLMs)微调中面临的ID准确性、OOD泛化性和对抗鲁棒性之间的三难权衡。GRACE通过联合正则化参数空间曲率和特征空间不变性,同时提升模型在分布内、分布外和对抗攻击下的性能。
Details
Motivation: 现有鲁棒微调策略最多只能解决三难权衡中的两个维度:保持泛化的方法易受对抗攻击,而对抗训练会损害ID/OOD准确性。本文认为这种权衡源于参数空间中的尖锐、各向异性极小值和扰动下不稳定的特征表示这两种几何缺陷。
Result: 在ImageNet上对CLIP模型进行微调时,GRACE将ID准确率提高了10.8%,对抗准确率提高了13.5%,同时保持了57.0%的OOD准确率(与57.4%的零样本基线相当)。几何分析证实GRACE收敛到更平坦的极小值,且在分布偏移下没有特征失真。
Insight: 核心创新点在于从几何视角(参数空间曲率和特征空间对齐)统一解决鲁棒性三难问题。具体方法包括基于局部曲率缩放的适应性权重扰动以促进平坦极小值,以及跨干净、对抗和OOD输入的特征对齐损失以保持表示一致性,这为基座VLM的广义鲁棒性提供了原则性方案。
Abstract: Fine-tuning approaches for Vision-Language Models (VLMs) face a critical three-way trade-off between In-Distribution (ID) accuracy, Out-of-Distribution (OOD) generalization, and adversarial robustness. Existing robust fine-tuning strategies resolve at most two axes of this trade-off. Generalization-preserving methods retain ID/OOD performance but leave models vulnerable to adversarial attacks, while adversarial training improves robustness to targeted attacks but degrades ID/OOD accuracy. Our key insight is that the robustness trade-off stems from two geometric failures: sharp, anisotropic minima in parameter space and unstable feature representations that deform under perturbation. To address this, we propose GRACE (Gram-aligned Robustness via Adaptive Curvature Estimation), a unified fine-tuning framework that jointly regularizes the parameter-space curvature and feature-space invariance for VLMs. Grounded in Robust PAC-Bayes theory, GRACE employs adaptive weight perturbations scaled by local curvature to promote flatter minima, combined with a feature alignment loss that maintains representation consistency across clean, adversarial, and OOD inputs. On ImageNet fine-tuning of CLIP models, GRACE simultaneously improves ID accuracy by 10.8%, and adversarial accuracy by 13.5% while maintaining 57.0% OOD accuracy (vs. 57.4% zero-shot baseline). Geometric analysis confirms that GRACE converges to flatter minima without feature distortion across distribution shifts, providing a principled step toward generalized robustness in foundation VLMs.
[76] RiskProp: Collision-Anchored Self-Supervised Risk Propagation for Early Accident Anticipation cs.CVPDF
Yiyang Zou, Tianhao Zhao, Peilun Xiao, Hongyu Jin, Longyu Qi
TL;DR: 本文提出RiskProp,一种基于碰撞锚定的自监督风险传播范式,用于早期事故预测,无需异常起始帧标注,仅利用可靠标注的碰撞帧,通过未来帧正则化损失和自适应单调约束建模风险演化。
Details
Motivation: 现有方法依赖主观且不一致的人工标注’异常起始’帧进行二元监督,导致风险估计不准确,本文旨在消除对异常起始标注的依赖,仅使用碰撞帧实现更可靠的事故预测。
Result: 在CAP和Nexar数据集上的实验表明,RiskProp达到了最先进的性能,并产生更平滑、更具区分度的风险曲线,提升了早期预测和可解释性。
Insight: 创新点包括:利用未来帧预测作为软目标进行风险信号反向传播的未来帧正则化损失,以及鼓励风险非递减演进的自适应单调约束,实现了自监督的风险演化建模。
Abstract: Accident anticipation aims to predict impending collisions from dashcam videos and trigger early alerts. Existing methods rely on binary supervision with manually annotated “anomaly onset” frames, which are subjective and inconsistent, leading to inaccurate risk estimation. In contrast, we propose RiskProp, a novel collision-anchored self-supervised risk propagation paradigm for early accident anticipation, which removes the need for anomaly onset annotations and leverages only the reliably annotated collision frame. RiskProp models temporal risk evolution through two observation-driven losses: first, since future frames contain more definitive evidence of an impending accident, we introduce a future-frame regularization loss that uses the model’s next-frame prediction as a soft target to supervise the current frame, enabling backward propagation of risk signals; second, inspired by the empirical trend of rising risk before accidents, we design an adaptive monotonic constraint to encourage a non-decreasing progression over time. Experiments on CAP and Nexar demonstrate that RiskProp achieves state-of-the-art performance and produces smoother, more discriminative risk curves, improving both early anticipation and interpretability.
[77] MEDIC-AD: Towards Medical Vision-Language Model’s Clinical Intelligence cs.CVPDF
Woohyeon Park, Jaeik Kim, Sunghwan Steve Cho, Pa Hong, Wookyoung Jeong
TL;DR: 本文提出了MEDIC-AD,一个面向临床的医学视觉-语言模型,旨在增强病灶检测、症状追踪和视觉可解释性。它通过一个分阶段框架,引入可学习的异常感知令牌、图像间差异令牌和专门的可解释性阶段,将模型的广泛知识转化为临床可操作的输出。
Details
Motivation: 当前医学视觉-语言模型缺乏将广泛知识转化为临床可操作输出的机制,无法有效支持病灶检测、症状追踪和视觉可解释性等真实世界医疗图像分析的核心任务。
Result: MEDIC-AD在异常检测、症状追踪和异常分割任务上均实现了性能的稳步提升,相比闭源和医学专用基线模型取得了最先进的结果。在真实医院工作流程收集的纵向临床数据上的评估进一步表明,它在实际患者监测和决策支持工作流程中能提供稳定的预测和临床可信的解释。
Insight: 创新点在于分阶段设计,通过引入异常感知令牌和图像间差异令牌来分别增强病灶表征和时序变化编码,并专门训练模型生成与推理一致的病灶区域热图,从而系统性地提升临床智能。这为医学VLM的临床落地提供了一种可借鉴的框架。
Abstract: Lesion detection, symptom tracking, and visual explainability are central to real-world medical image analysis, yet current medical Vision-Language Models (VLMs) still lack mechanisms that translate their broad knowledge into clinically actionable outputs. To bridge this gap, we present MEDIC-AD, a clinically oriented VLM that strengthens these three capabilities through a stage-wise framework. First, learnable anomaly-aware tokens (
[78] Reasoning-Driven Anomaly Detection and Localization with Image-Level Supervision cs.CVPDF
Yizhou Jin, Yuezhu Feng, Jinjin Zhang, Peng Wang, Qingjie Liu
TL;DR: 本文提出了一种名为ReAL的方法,利用多模态大语言模型(MLLMs)的内在推理能力,仅通过图像级监督实现异常检测、像素级定位和可解释推理,无需额外视觉模块或像素级标注。该方法通过提取自回归推理过程中的异常相关令牌并聚合其注意力响应来生成异常热图,并引入一致性引导推理优化模块(CGRO)来对齐推理令牌与视觉注意力,提升定位准确性和推理一致性。
Details
Motivation: 现有基于MLLM的异常检测方法多局限于图像级检测和文本推理,像素级定位仍需依赖外部视觉模块和密集标注,本文旨在激活MLLMs的内在推理潜力,仅用图像级监督实现端到端的异常检测与定位。
Result: 在四个公开基准测试上的大量实验表明,该方法显著提升了异常检测、定位和可解释性;尽管仅使用图像级监督,其性能可与基于MLLM且使用密集像素级监督的方法相竞争。
Insight: 创新点在于直接从MLLM的自回归推理过程中提取异常令牌并利用其注意力进行像素级定位,以及通过强化学习驱动的CGRO模块对齐文本推理与视觉注意力,实现了仅凭图像级标签的高质量异常定位与可解释推理。
Abstract: Multimodal large language models (MLLMs) have recently demonstrated remarkable reasoning and perceptual abilities for anomaly detection. However, most approaches remain confined to image-level anomaly detection and textual reasoning, while pixel-level localization still relies on external vision modules and dense annotations. In this work, we activate the intrinsic reasoning potential of MLLMs to perform anomaly detection, pixel-level localization, and interpretable reasoning solely from image-level supervision, without any auxiliary components or pixel-wise labels. Specifically, we propose Reasoning-Driven Anomaly Localization (ReAL), which extracts anomaly-related tokens from the autoregressive reasoning process and aggregates their attention responses to produce pixel-level anomaly maps. We further introduce a Consistency-Guided Reasoning Optimization (CGRO) module that leverages reinforcement learning to align reasoning tokens with visual attentions, resulting in more coherent reasoning and accurate anomaly localization. Extensive experiments on four public benchmarks demonstrate that our method significantly improves anomaly detection, localization, and interpretability. Remarkably, despite relying solely on image-level supervision, our approach achieves performance competitive with MLLM-based methods trained under dense pixel-level supervision. Code is available at https://github.com/YizhouJin313/ReADL.
[79] Communicating about Space: Language-Mediated Spatial Integration Across Partial Views cs.CVPDF
Ankur Sikarwar, Debangan Mishra, Sudarshan Nikhil, Ponnurangam Kumaraguru, Aishwarya Agrawal
TL;DR: 本文研究了多模态大语言模型(MLLMs)是否能够通过对话整合不同视角的观察,以构建共享环境的统一空间心理模型。为此,作者提出了COSMIC基准测试,包含899个3D室内场景和1250个问答对。研究发现,MLLMs在识别跨视角共享锚点物体方面表现尚可,但在关系推理和构建全局一致地图方面表现较差,即使前沿模型也接近随机水平。人类表现(95%准确率)远优于最佳模型(72%),且人类对话会逐渐收敛于共享模型,而模型对话则持续探索新可能性,表明其构建稳健共享心理模型的能力有限。
Details
Motivation: 人类能够通过交流部分、视角依赖的观察来建立共享的空间理解。本文旨在探究多模态大语言模型是否也能做到这一点,即通过对话对齐不同的自我中心视角,形成对共享环境的连贯、他者中心的心理模型。
Result: 在COSMIC基准测试的五项任务中,MLLMs表现出一致的能力层次:识别跨视角共享锚点物体最可靠,关系推理表现较差,构建全局一致地图基本失败(接近随机水平)。最佳模型Gemini-3-Pro-Thinking的总体准确率为72%,远低于人类的95%。
Insight: 论文的创新点在于系统性地提出了一个用于评估MLLMs空间通信与协作能力的基准(COSMIC),并揭示了当前MLLMs在构建和维持稳健共享心理模型方面的核心局限。客观来看,其研究方法和发现为理解MLLMs在复杂、多轮、目标导向的对话任务中的能力边界提供了重要洞见,强调了模型在对话中收敛和整合信息能力的不足。
Abstract: Humans build shared spatial understanding by communicating partial, viewpoint-dependent observations. We ask whether Multimodal Large Language Models (MLLMs) can do the same, aligning distinct egocentric views through dialogue to form a coherent, allocentric mental model of a shared environment. To study this systematically, we introduce COSMIC, a benchmark for Collaborative Spatial Communication. In this setting, two static MLLM agents observe a 3D indoor environment from different viewpoints and exchange natural-language messages to solve spatial queries. COSMIC contains 899 diverse scenes and 1250 question-answer pairs spanning five tasks. We find a consistent capability hierarchy, MLLMs are most reliable at identifying shared anchor objects across views, perform worse on relational reasoning, and largely fail at building globally consistent maps, performing near chance, even for the frontier models. Moreover, we find thinking capability yields consistent gains in anchor grounding, but is insufficient for higher-level spatial communication. To contextualize model behavior, we additionally collect 250 human-human dialogues. Humans achieve 95% aggregate accuracy, leaving significant room for improvement for even the best performing model Gemini-3-Pro-Thinking which achieves 72% aggregate accuracy. Moreover, human conversations become increasingly specific as partners converge on a shared mental model, whereas model dialogues continue to explore new possibilities rather than converging, consistent with a limited ability to build and maintain a robust shared mental model. Our code and data is available at https://github.com/ankursikarwar/Cosmic
[80] Incentivizing Temporal-Awareness in Egocentric Video Understanding Models cs.CVPDF
Zhiyang Xu, Tian Qin, Bowen Jin, Zhengfeng Lai, Meng Cao
TL;DR: 本文提出了一种名为TGPO(Temporal Global Policy Optimization)的强化学习算法,旨在增强多模态大语言模型(MLLMs)在自我中心视频理解中的时序感知能力。该方法通过对比模型在时序有序与乱序视频帧上的输出来生成校准的全局奖励信号,从而抑制模型依赖空间捷径的行为,并提升其时序推理的连贯性。
Details
Motivation: 现有的多模态大语言模型在视觉理解方面表现出色,但在自我中心视频场景中往往缺乏时序感知能力,其推理依赖于事件的正确顺序和演变。这一缺陷部分源于训练目标未能显式奖励时序推理,而依赖于帧级别的空间捷径。
Result: 在五个自我中心视频基准测试上的实验表明,TGPO持续提升了时序定位和因果连贯性,性能优于先前基于强化学习的视频推理方法。
Insight: 论文的创新点在于提出了TGPO这一强化学习算法,它通过对比时序有序与乱序帧的输出来生成可验证的奖励信号,从而显式地激励模型的时序感知能力。从客观角度看,该方法提供了一种简单且可扩展的途径,有助于构建在自我中心视频理解中具有时序鲁棒性的MLLMs。
Abstract: Multimodal large language models (MLLMs) have recently shown strong performance in visual understanding, yet they often lack temporal awareness, particularly in egocentric settings where reasoning depends on the correct ordering and evolution of events. This deficiency stems in part from training objectives that fail to explicitly reward temporal reasoning and instead rely on frame-level spatial shortcuts. To address this limitation, we propose Temporal Global Policy Optimization (TGPO), a reinforcement learning with verifiable rewards (RLVR) algorithm designed to incentivize temporal awareness in MLLMs. TGPO contrasts model outputs generated from temporally ordered versus shuffled video frames to derive calibrated, globally normalized reward signals that explicitly favor temporally coherent reasoning. Integrated with GRPO and GSPO, TGPO supports cold-start RL training and effectively suppresses spatial shortcut behaviors learned by existing MLLMs. Experiments across five egocentric video benchmarks demonstrate that TGPO consistently improves temporal grounding and causal coherence, outperforming prior RL-based video reasoning approaches. Our results suggest that TGPO offers a simple and scalable pathway toward temporally robust MLLMs for egocentric video understanding.
[81] MotionRFT: Unified Reinforcement Fine-Tuning for Text-to-Motion Generation cs.CVPDF
Xiaofeng Tan, Wanjiang Weng, Hongsong Wang, Fang Zhao, Xin Geng
TL;DR: 本文提出MotionRFT,一个统一的强化学习微调框架,用于提升文本到动作生成模型。该框架包含一个基于异构表示的多维奖励模型MotionReward,以及一个高效、细粒度的微调方法EasyTune。MotionReward将不同表示的动作映射到以文本为锚点的共享语义空间进行多维奖励学习,而EasyTune通过逐步骤优化解决了去噪过程中递归梯度依赖的瓶颈,实现了高效微调。实验表明该方法在多个基准上显著提升了性能并降低了内存消耗。
Details
Motivation: 现有的文本到动作生成模型(如基于扩散或流的模型)的监督预训练不足以使模型与高级目标(如语义一致性、真实感和人类偏好)对齐。现有的后训练方法存在关键局限:针对特定动作表示(如关节)、优化特定方面(如文本-动作对齐)可能损害其他因素,并且计算开销大、数据依赖性强、优化粒度粗。
Result: 在多个基准上验证了有效性:对于MLD模型,实现了FID 0.132,峰值内存为22.10 GB,比DRaFT节省高达15.22 GB内存;在基于关节的ACMDM模型上FID降低了22.9%;在基于旋转的HY Motion模型上R-Precision提升了12.6%,FID改善了23.3%。
Insight: 创新点包括:1) 提出异构表示的多维奖励模型MotionReward,通过共享语义空间实现统一语义表示,并利用自精炼偏好学习增强语义而无须额外标注;2) 提出高效的EasyTune微调方法,通过识别去噪步骤间的递归梯度依赖瓶颈,采用逐步骤优化而非全轨迹优化,实现了密集、细粒度且内存高效的更新。从客观角度看,该框架在奖励模型设计和微调效率上的创新具有普适性,可借鉴于其他生成任务的后训练优化。
Abstract: Text-to-motion generation has advanced with diffusion- and flow-based generative models, yet supervised pretraining remains insufficient to align models with high-level objectives such as semantic consistency, realism, and human preference. Existing post-training methods have key limitations: they (1) target a specific motion representation, such as joints, (2) optimize a particular aspect, such as text-motion alignment, and may compromise other factors; and (3) incur substantial computational overhead, data dependence, and coarse-grained optimization. We present a reinforcement fine-tuning framework that comprises a heterogeneous-representation, multi-dimensional reward model, MotionReward, and an efficient, fine-grained fine-tuning method, EasyTune. To obtain a unified semantics representation, MotionReward maps heterogeneous motions into a shared semantic space anchored by text, enabling multidimensional reward learning; Self-refinement Preference Learning further enhances semantics without additional annotations. For efficient and effective fine-tuning, we identify the recursive gradient dependence across denoising steps as the key bottleneck, and propose EasyTune, which optimizes step-wise rather than over the full trajectory, yielding dense, fine-grained, and memory-efficient updates. Extensive experiments validate the effectiveness of our framework, achieving FID 0.132 at 22.10 GB peak memory for MLD model and saving up to 15.22 GB over DRaFT. It reduces FID by 22.9% on joint-based ACMDM, and achieves a 12.6% R-Precision gain and 23.3% FID improvement on rotation-based HY Motion. Our project page with code is publicly available.
[82] K$α$LOS finds Consensus: A Meta-Algorithm for Evaluating Inter-Annotator Agreement in Complex Vision Tasks cs.CVPDF
David Tschirschwitz, Volker Rodehorst
TL;DR: 本文提出了一种名为KαLOS的元算法,用于评估复杂视觉任务中的标注者间一致性,以解决目标检测等基准测试中因标签噪声导致的进展停滞问题。该方法通过’定位优先’原则将空间对应问题转化为名义可靠性矩阵,并采用数据驱动的参数校准,适用于边界框、体积分割和姿态估计等多种任务。
Details
Motivation: 当前目标检测基准测试的进展受限于无法区分模型改进与标签噪声,缺乏可靠的标注一致性量化方法,且现有统计指标无法处理视觉任务中的实例对应问题,导致评估数据可信度不足。
Result: 通过引入可控的噪声生成器模拟复杂的人类标注变异性,验证了KαLOS的鲁棒性,使其成为区分现代计算机视觉基准中信号与噪声的可靠标准。
Insight: 创新点在于将’定位优先’原则系统化为统一元算法,通过统计校准定位参数以适应不同任务,并提供细粒度诊断(如标注者活力、协作聚类),超越了单一评分;客观来看,该方法通过数据驱动配置避免了启发式方法的局限性,为基准测试的质量评估提供了标准化框架。
Abstract: Progress in object detection benchmarks is stagnating. It is limited not by architectures but by the inability to distinguish model improvements from label noise. To restore trust in benchmarking the field requires rigorous quantification of annotation consistency to ensure the reliability of evaluation data. However, standard statistical metrics fail to handle the instance correspondence problem inherent to vision tasks. Furthermore, validating new agreement metrics remains circular because no objective ground truth for agreement exists. This forces reliance on unverifiable heuristics. We propose K$α$LOS (KALOS), a unified meta-algorithm that generalizes the “Localization First” principle to standardize dataset quality evaluation. By resolving spatial correspondence before assessing agreement, our framework transforms complex spatio-categorical problems into nominal reliability matrices. Unlike prior heuristic implementations, K$α$LOS employs a principled, data-driven configuration; by statistically calibrating the localization parameters to the inherent agreement distribution, it generalizes to diverse tasks ranging from bounding boxes to volumetric segmentation or pose estimation. This standardization enables granular diagnostics beyond a single score. These include annotator vitality, collaboration clustering, and localization sensitivity. To validate this approach, we introduce a novel and empirically derived noise generator. Where prior validations relied on uniform error assumptions, our controllable testbed models complex and non-isotropic human variability. This provides evidence of the metric’s properties and establishes K$α$LOS as a robust standard for distinguishing signal from noise in modern computer vision benchmarks.
[83] Understanding and Mitigating Hallucinations in Multimodal Chain-of-Thought Models cs.CVPDF
Ji Ma, Wei Suo, Peng Wang, Yanning Zhang
TL;DR: 本文研究了多模态思维链模型中的幻觉问题,发现其独特原因在于联想推理步骤中的发散性思维,并提出了一种简单有效的策略来定位并干预这些步骤以减轻幻觉。实验表明该方法显著优于现有方法,并能与其他缓解方法结合进一步提升性能。
Details
Motivation: 多模态思维链模型在复杂视觉推理任务中表现出色,但存在严重的幻觉问题,其视觉注意力衰减原因可能与大型视觉语言模型不同,因此探究MCoT模型幻觉的独特成因并寻求缓解方法。
Result: 在广泛实验中,该方法大幅优于现有方法,并能方便地与其他幻觉缓解方法集成,进一步提升它们的性能。
Insight: 创新点在于识别出MCoT模型幻觉的独特模式——主要发生在联想推理步骤(发散性思维),并据此设计了一种定位和干预解码过程的策略;客观来看,将幻觉原因具体归因于推理过程中的特定思维类型,并基于此设计针对性干预,是有效且可推广的思路。
Abstract: Multimodal Chain-of-Thought (MCoT) models have demonstrated impressive capability in complex visual reasoning tasks. Unfortunately, recent studies reveal that they suffer from severe hallucination problems due to diminished visual attention during the generation process. However, visual attention decay is a well-studied problem in Large Vision-Language Models (LVLMs). Considering the fundamental differences in reasoning processes between MCoT models and traditional LVLMs, we raise a basic question: Whether MCoT models have unique causes of hallucinations? To answer this question, we systematically investigate the hallucination patterns of MCoT models and find that fabricated texts are primarily generated in associative reasoning steps, which we term divergent thinking. Leveraging these insights, we introduce a simple yet effective strategy that can effectively localize divergent thinking steps and intervene in the decoding process to mitigate hallucinations. Extensive experiments show that our method outperforms existing methods by a large margin. More importantly, our proposed method can be conveniently integrated with other hallucination mitigation methods and further boost their performance. The code is publicly available at https://github.com/ASGO-MM/MCoT-hallucination.
[84] Make It Up: Fake Images, Real Gains in Generalized Few-shot Semantic Segmentation cs.CVPDF
Guohuan Xie, Xin He, Dingying Fan, Le Zhang, Ming-Ming Cheng
TL;DR: 本文提出Syn4Seg框架,通过扩散模型合成新类图像并改进伪标签质量,以解决广义少样本语义分割中因标注稀缺导致的新类外观覆盖不足问题。该方法首先构建去重提示词库以生成多样且类别一致的合成图像,然后通过两阶段精炼进行支持引导的伪标签估计,最后利用基于SAM的约束更新优化边界区域。
Details
Motivation: 广义少样本语义分割受限于稀缺标注下新类外观的覆盖不足,而现有扩散模型合成图像常因覆盖不全和掩码不可靠导致监督噪声,阻碍实际性能提升。
Result: 在PASCAL-$5^i$和COCO-$20^i$数据集上的大量实验表明,该方法在1-shot和5-shot设置下均取得持续改进,验证了合成数据作为可靠掩码和精确边界的可扩展路径。
Insight: 创新点包括构建嵌入去重的提示词库以最大化提示空间覆盖,以及结合全局(支持集)和局部(图像)统计的图像自适应原型进行两阶段伪标签精炼,并通过约束SAM更新提升轮廓保真度而不覆盖高置信度内部区域。
Abstract: Generalized few-shot semantic segmentation (GFSS) is fundamentally limited by the coverage of novel-class appearances under scarce annotations. While diffusion models can synthesize novel-class images at scale, practical gains are often hindered by insufficient coverage and noisy supervision when masks are unavailable or unreliable. We propose Syn4Seg, a generation-enhanced GFSS framework designed to expand novel-class coverage while improving pseudo-label quality. Syn4Seg first maximizes prompt-space coverage by constructing an embedding-deduplicated prompt bank for each novel class, yielding diverse yet class-consistent synthetic images. It then performs support-guided pseudo-label estimation via a two-stage refinement that i) filters low-consistency regions to obtain high-precision seeds and ii) relabels uncertain pixels with image-adaptive prototypes that combine global (support) and local (image) statistics. Finally, we refine only boundary-band and unlabeled pixels using a constrained SAM-based update to improve contour fidelity without overwriting high-confidence interiors. Extensive experiments on PASCAL-$5^i$ and COCO-$20^i$ demonstrate consistent improvements in both 1-shot and 5-shot settings, highlighting synthetic data as a scalable path for GFSS with reliable masks and precise boundaries.
[85] LightMover: Generative Light Movement with Color and Intensity Controls cs.CV | cs.CL | cs.GR | cs.LGPDF
Gengze Zhou, Tianyu Wang, Soo Ye Kim, Zhixin Shu, Xin Yu
TL;DR: LightMover是一个可控光照编辑框架,利用视频扩散先验在单张图像中实现物理上合理的光照变化,无需重新渲染场景。它将光照编辑建模为视觉标记空间中的序列到序列预测问题,通过输入图像和光照控制标记,模型可同时调整光源位置、颜色、强度以及由此产生的反射、阴影和衰减。
Details
Motivation: 解决在单张图像中进行可控、物理合理的光照编辑(包括光源位置、颜色和强度)的挑战,避免复杂的重新渲染过程。
Result: 在保持编辑保真度的同时,通过自适应标记剪枝机制将控制序列长度减少了41%。在PSNR和语义一致性(DINO, CLIP)指标上取得了优异结果,实现了对光照位置、颜色和强度的精确独立控制。
Insight: 创新点在于将光照编辑统一为视觉标记空间的序列预测问题,并引入了自适应标记剪枝机制来高效编码空间与非空间属性,同时利用视频扩散先验和可扩展渲染流程进行训练。
Abstract: We present LightMover, a framework for controllable light manipulation in single images that leverages video diffusion priors to produce physically plausible illumination changes without re-rendering the scene. We formulate light editing as a sequence-to-sequence prediction problem in visual token space: given an image and light-control tokens, the model adjusts light position, color, and intensity together with resulting reflections, shadows, and falloff from a single view. This unified treatment of spatial (movement) and appearance (color, intensity) controls improves both manipulation and illumination understanding. We further introduce an adaptive token-pruning mechanism that preserves spatially informative tokens while compactly encoding non-spatial attributes, reducing control sequence length by 41% while maintaining editing fidelity. To train our framework, we construct a scalable rendering pipeline that generates large numbers of image pairs across varied light positions, colors, and intensities while keeping the scene content consistent with the original image. LightMover enables precise, independent control over light position, color, and intensity, and achieves high PSNR and strong semantic consistency (DINO, CLIP) across different tasks.
[86] HD-VGGT: High-Resolution Visual Geometry Transformer cs.CVPDF
Tianrun Chen, Yuanqi Hu, Yidong Han, Hanjie Xu, Deyi Ji
TL;DR: 本文提出了HD-VGGT,一种用于高效、鲁棒的高分辨率3D重建的双分支Transformer架构。它通过一个低分辨率分支预测全局一致的粗略几何,再通过一个高分辨率分支利用学习的特征上采样模块来细化细节,并引入特征调制机制来抑制不稳定特征,从而在控制计算成本的同时提升重建质量。
Details
Motivation: 解决现有前馈式模型(如VGGT)在处理高分辨率多视图图像时,因Transformer令牌数量激增导致的计算和内存成本过高问题,以及高分辨率下视觉模糊区域(如重复图案、弱纹理)产生的不稳定特征令牌会降低几何推断质量的问题。
Result: HD-VGGT在未使用全分辨率Transformer成本的情况下,通过利用高分辨率图像和监督,实现了最先进(SOTA)的重建质量。
Insight: 主要创新点在于双分支架构设计(粗粒度全局几何与高分辨率细节细化分离)和特征调制机制(在Transformer早期抑制不可靠特征),这为在Transformer框架下高效处理高分辨率视觉任务提供了可借鉴的思路。
Abstract: High-resolution imagery is essential for accurate 3D reconstruction, as many geometric details only emerge at fine spatial scales. Recent feed-forward approaches, such as the Visual Geometry Grounded Transformer (VGGT), have demonstrated the ability to infer scene geometry from large collections of images in a single forward pass. However, scaling these models to high-resolution inputs remains challenging: the number of tokens in transformer architectures grows rapidly with both image resolution and the number of views, leading to prohibitive computational and memory costs. Moreover, we observe that visually ambiguous regions, such as repetitive patterns, weak textures, or specular surfaces, often produce unstable feature tokens that degrade geometric inference, especially at higher resolutions. We introduce HD-VGGT, a dual-branch architecture for efficient and robust high-resolution 3D reconstruction. A low-resolution branch predicts a coarse, globally consistent geometry, while a high-resolution branch refines details via a learned feature upsampling module. To handle unstable tokens, we propose Feature Modulation, which suppresses unreliable features early in the transformer. HD-VGGT leverages high-resolution images and supervision without full-resolution transformer costs, achieving state-of-the-art reconstruction quality.
[87] EuraGovExam: A Multilingual Multimodal Benchmark from Real-World Civil Service Exams cs.CV | cs.AIPDF
JaeSeong Kim, Chaehwan Lim, Sang Hyun Gil, Suan Lee
TL;DR: 本文介绍了EuraGovExam,一个从韩国、日本、台湾、印度和欧盟五个欧亚地区真实公务员考试中提取的多语言多模态基准数据集。该数据集包含超过8000道高分辨率扫描的多选题,涵盖17个学术和行政领域,将所有问题内容(包括题干、选项和视觉元素)嵌入单一图像中,仅提供最小化的标准化答案格式指令,旨在评估模型直接从视觉输入进行布局感知和跨语言推理的能力。
Details
Motivation: 为了解决现有基准在反映公共部门评估的真实复杂性、视觉布局和文化多样性方面的不足,旨在创建一个更贴近现实、高难度的多语言多模态基准,以诊断当前视觉语言模型的局限性。
Result: 评估结果显示,即使是最先进的视觉语言模型(VLMs)在该基准上的准确率也仅为86%,突显了其难度和对当前模型局限性的诊断能力。
Insight: 创新点在于从真实考试文档中构建数据集,保留了表格、多语言排版和表单式布局等丰富的视觉结构,强调文化真实性、视觉复杂性和语言多样性,为高风险、多语言、基于图像的VLM评估设立了新标准,并支持电子政务、公共部门文档分析和公平考试准备等实际应用。
Abstract: We present EuraGovExam, a multilingual and multimodal benchmark sourced from real-world civil service examinations across five representative Eurasian regions: South Korea, Japan, Taiwan, India, and the European Union. Designed to reflect the authentic complexity of public-sector assessments, the dataset contains over 8,000 high-resolution scanned multiple-choice questions covering 17 diverse academic and administrative domains. Unlike existing benchmarks, EuraGovExam embeds all question content–including problem statements, answer choices, and visual elements–within a single image, providing only a minimal standardized instruction for answer formatting. This design demands that models perform layout-aware, cross-lingual reasoning directly from visual input. All items are drawn from real exam documents, preserving rich visual structures such as tables, multilingual typography, and form-like layouts. Evaluation results show that even state-of-the-art vision-language models (VLMs) achieve only 86% accuracy, underscoring the benchmark’s difficulty and its power to diagnose the limitations of current models. By emphasizing cultural realism, visual complexity, and linguistic diversity, EuraGovExam establishes a new standard for evaluating VLMs in high-stakes, multilingual, image-grounded settings. It also supports practical applications in e-governance, public-sector document analysis, and equitable exam preparation.
[88] Inference-Time Structural Reasoning for Compositional Vision-Language Understanding cs.CV | cs.CLPDF
Amartya Bhattacharya
TL;DR: 本文提出了一个用于评估和增强视觉语言模型(VLMs)组合推理能力的统一框架。该框架在Winoground基准测试上,对CLIP、BLIP、LLaVA和Qwen3-VL-8B-Thinking四种不同架构的VLM进行了评估,并引入了基于场景图的结构化关系先验进行增强。研究发现,Qwen3-VL-8B-Thinking模型表现最佳,通过提出的多轮场景图过滤策略,其性能进一步提升并超越了之前的开源SOTA。
Details
Motivation: 视觉语言模型在图像-文本检索任务上表现出色,但在组合推理(区分具有相同词汇但关系结构不同的描述)方面持续存在不足。本文旨在评估和提升VLMs在这方面的能力。
Result: 在Winoground基准测试上,Qwen3-VL-8B-Thinking模型取得了62.75的组分数,远超所有基于编码器的模型。提出的多轮场景图过滤策略将其性能进一步提升至66.0,超越了之前的开源最先进水平。分析表明,场景图增强主要对已有较强能力的模型有益。
Insight: 论文的创新点在于提出了一个统一的评估与增强框架,通过依赖解析器提取文本场景图(主语-关系-宾语三元组),并利用图不对称性评分器注入结构化关系先验。客观来看,其提出的推理时结构推理方法,特别是多轮场景图过滤策略,为提升VLM的组合理解能力提供了一种有效且可解释的途径。
Abstract: Vision-language models (VLMs) excel at image-text retrieval yet persistently fail at compositional reasoning, distinguishing captions that share the same words but differ in relational structure. We present, a unified evaluation and augmentation framework benchmarking four architecturally diverse VLMs,CLIP, BLIP, LLaVA, and Qwen3-VL-8B-Thinking,on the Winoground benchmark under plain and scene-graph-augmented regimes. We introduce a dependency-based TextSceneGraphParser (spaCy) extracting subject-relation-object triples, and a Graph Asymmetry Scorer using optimal bipartite matching to inject structural relational priors. Caption ablation experiments (subject-object masking and swapping) reveal that Qwen3-VL-8B-Thinking achieves a group score of 62.75, far above all encoder-based models, while a proposed multi-turn SG filtering strategy further lifts it to 66.0, surpassing prior open-source state-of-the-art. We analyze the capability augmentation tradeoff and find that SG augmentation benefits already capable models while providing negligible or negative gains for weaker baselines. Code: https://github.com/amartyacodes/Inference-Time-Structural-Reasoning-for-Compositional-Vision-Language-Understanding
[89] An Instance-Centric Panoptic Occupancy Prediction Benchmark for Autonomous Driving cs.CVPDF
Yi Feng, Junwu E, Zizhan Guo, Yu Ma, Hanli Wang
TL;DR: 该论文提出了一个面向自动驾驶的实例中心化全景占用预测基准,通过构建ADMesh高质量3D网格库和基于CARLA模拟器的大规模物理一致数据集CarlaOcc,解决了现有基准缺乏高质量3D资源、实例级标注和物理一致性的问题,并建立了标准化评估指标和代表性模型基准。
Details
Motivation: 解决全景占用预测领域因缺乏高质量3D网格资源、实例级标注和物理一致数据集而受限的问题,以推动实现精确几何重建、可靠遮挡推理和整体3D理解的模型发展。
Result: 构建了包含超过15K高质量3D模型的ADMesh库和包含超过10万帧、体素分辨率精细至0.05米的大规模数据集CarlaOcc,并建立了标准化评估指标和代表性模型基准,为公平比较和可复现研究提供了统一平台。
Insight: 创新点在于首次构建了面向自动驾驶的统一高质量3D网格库和物理一致的大规模实例级全景占用数据集,并引入了标准化评估体系,为3D全景感知研究提供了关键的数据基础与评估框架。
Abstract: Panoptic occupancy prediction aims to jointly infer voxel-wise semantics and instance identities within a unified 3D scene representation. Nevertheless, progress in this field remains constrained by the absence of high-quality 3D mesh resources, instance-level annotations, and physically consistent occupancy datasets. Existing benchmarks typically provide incomplete and low-resolution geometry without instance-level annotations, limiting the development of models capable of achieving precise geometric reconstruction, reliable occlusion reasoning, and holistic 3D understanding. To address these challenges, this paper presents an instance-centric benchmark for the 3D panoptic occupancy prediction task. Specifically, we introduce ADMesh, the first unified 3D mesh library tailored for autonomous driving, which integrates over 15K high-quality 3D models with diverse textures and rich semantic annotations. Building upon ADMesh, we further construct CarlaOcc, a large-scale, physically consistent panoptic occupancy dataset generated using the CARLA simulator. This dataset contains over 100K frames with fine-grained, instance-level occupancy ground truth at voxel resolutions as fine as 0.05 m. Furthermore, standardized evaluation metrics are introduced to quantify the quality of existing occupancy datasets. Finally, a systematic benchmark of representative models is established on the proposed dataset, which provides a unified platform for fair comparison and reproducible research in the field of 3D panoptic perception. Code and dataset are available at https://mias.group/CarlaOcc.
[90] Diagnosing and Repairing Unsafe Channels in Vision-Language Models via Causal Discovery and Dual-Modal Safety Subspace Projection cs.CV | cs.AIPDF
Jinhu Fu, Yihang Lou, Qingyi Si, Shudong Zhang, Yan Bai
TL;DR: 本文提出了一个名为CARE的框架,用于诊断和修复大型视觉语言模型(LVLMs)中的不安全通道。该框架首先通过因果中介分析识别导致不安全行为的神经元和层,然后引入一种双模态安全子空间投影方法,通过良性激活和恶意激活之间的广义特征分解,学习视觉和文本模态的通用安全子空间。在推理过程中,通过混合融合机制动态地将激活投影到这些安全子空间,自适应地平衡视觉和文本校正,有效抑制不安全特征同时保持语义保真度。
Details
Motivation: 大型视觉语言模型在多模态理解和推理任务上表现出色,但其内部安全机制不透明且难以控制,存在不安全行为风险,需要一种系统性的诊断和修复方法来增强其安全性。
Result: 在多个安全基准测试上的广泛实验表明,该因果-子空间修复框架显著增强了安全鲁棒性,且未降低通用多模态能力,性能优于先前的激活导向和对齐基线方法,并展现出良好的可迁移性,能够防御未见过的攻击。
Insight: 创新点在于将因果发现与双模态安全子空间投影相结合,通过因果分析精确定位不安全通道,并利用广义特征分解学习跨模态的通用安全表示,再通过自适应混合融合机制进行动态校正,实现了在不损害模型核心能力的前提下提升安全性,为模型安全性的可解释性和可控性提供了新思路。
Abstract: Large Vision-Language Models (LVLMs) have achieved impressive performance across multimodal understanding and reasoning tasks, yet their internal safety mechanisms remain opaque and poorly controlled. In this work, we present a comprehensive framework for diagnosing and repairing unsafe channels within LVLMs (CARE). We first perform causal mediation analysis to identify neurons and layers that are causally responsible for unsafe behaviors. Based on these findings, we introduce a dual-modal safety subspace projection method that learns generalized safety subspaces for both visual and textual modalities through generalized eigen-decomposition between benign and malicious activations. During inference, activations are dynamically projected toward these safety subspaces via a hybrid fusion mechanism that adaptively balances visual and textual corrections, effectively suppressing unsafe features while preserving semantic fidelity. Extensive experiments on multiple safety benchmarks demonstrate that our causal-subspace repair framework significantly enhances safety robustness without degrading general multimodal capabilities, outperforming prior activation steering and alignment-based baselines. Additionally, our method exhibits good transferability, defending against unseen attacks.
[91] SaSaSaSa2VA: 2nd Place of the 5th PVUW MeViS-Text Track cs.CVPDF
Dengxian Gong, Quanzhu Niu, Shihao Chen, Yuanzheng Wu, Yikang Zhou
TL;DR: 本文提出SaSaSaSa2VA模型,用于解决MeViS基准测试中的视频指代目标分割任务。该模型在SaSaSa2VA基础上,通过引入目标存在感知的验证机制,有效处理了以运动为中心的文本描述和无目标查询的挑战,在第五届PVUW挑战赛MeViS-Text赛道中获得第二名。
Details
Motivation: 现有视频指代目标分割方法通常基于静态文本描述定位目标,而MeViS基准引入了以运动为中心的描述(涉及运动指代与推理)和无目标查询,需要模型具备更强的时序理解和目标存在性判断能力。
Result: 在第五届PVUW挑战赛MeViS-Text赛道上,该方法最终得分为89.19,获得第二名。消融实验表明,目标存在感知验证策略能有效提升模型在以运动为中心的任务上的性能。
Insight: 主要创新点在于引入了一个简单而有效的目标存在感知验证机制,该机制能够帮助模型在处理运动描述和无目标查询时,更准确地判断目标是否存在并执行分割,从而在复杂时序任务中实现强劲性能。
Abstract: Referring video object segmentation (RVOS) commonly grounds targets in videos based on static textual cues. MeViS benchmark extends this by incorporating motion-centric expressions (referring & reasoning motion expressions) and introducing no-target queries. Extending SaSaSa2VA, where increased input frames and [SEG] tokens already strengthen the Sa2VA backbone, we adopt a simple yet effective target existence-aware verification mechanism, leading to Still Awesome SaSaSa2VA (SaSaSaSa2VA). Despite its simplicity, the method achieves a final score of 89.19 in the 5th PVUW Challenge (MeViS-Text Track), securing 2nd place. Both quantitative results and ablations suggest that this existence-aware verification strategy is sufficient to unlock strong performance on motion-centric referring tasks.
[92] LongCat-Next: Lexicalizing Modalities as Discrete Tokens cs.CV | cs.CLPDF
Meituan LongCat Team, Bin Xiao, Chao Wang, Chengjiang Li, Chi Zhang
TL;DR: 论文提出了一种名为Discrete Native Autoregressive (DiNA)的统一框架,通过将多模态信息表示在共享的离散空间中,实现跨模态的一致自回归建模。基于此,开发了原生多模态模型LongCat-Next,该模型在单一自回归目标下处理文本、视觉和音频,并在广泛的多模态基准测试中表现出色。
Details
Motivation: 当前以语言为中心的多模态系统通常将非语言模态视为外部附件,导致架构碎片化和集成次优。为了超越这一限制,论文旨在实现一个统一且原则性的跨模态自回归建模框架。
Result: LongCat-Next作为一个工业级基础模型,在广泛的多模态基准测试中取得了强劲性能。它特别解决了离散视觉建模在理解任务上的长期性能瓶颈,并有效调和了理解与生成之间的冲突。
Insight: 核心创新是Discrete Native Autoregressive (DiNA)框架及其关键组件dNaViT,后者实现了任意分辨率的视觉信号分层离散化。这为构建统一、高效的原生多模态模型提供了新思路,有望推动该领域的研究与发展。
Abstract: The prevailing Next-Token Prediction (NTP) paradigm has driven the success of large language models through discrete autoregressive modeling. However, contemporary multimodal systems remain language-centric, often treating non-linguistic modalities as external attachments, leading to fragmented architectures and suboptimal integration. To transcend this limitation, we introduce Discrete Native Autoregressive (DiNA), a unified framework that represents multimodal information within a shared discrete space, enabling a consistent and principled autoregressive modeling across modalities. A key innovation is the Discrete Native Any-resolution Visual Transformer (dNaViT), which performs tokenization and de-tokenization at arbitrary resolutions, transforming continuous visual signals into hierarchical discrete tokens. Building on this foundation, we develop LongCat-Next, a native multimodal model that processes text, vision, and audio under a single autoregressive objective with minimal modality-specific design. As an industrial-strength foundation model, it excels at seeing, painting, and talking within a single framework, achieving strong performance across a wide range of multimodal benchmarks. In particular, LongCat-Next addresses the long-standing performance ceiling of discrete vision modeling on understanding tasks and provides a unified approach to effectively reconcile the conflict between understanding and generation. As an attempt toward native multimodality, we open-source the LongCat-Next and its tokenizers, hoping to foster further research and development in the community. GitHub: https://github.com/meituan-longcat/LongCat-Next
[93] Zero-shot Vision-Language Reranking for Cross-View Geolocalization cs.CV | cs.AIPDF
Yunus Talha Erzurumlu, John E. Anderson, William J. Shuart, Charles Toth, Alper Yilmaz
TL;DR: 本文提出了一种利用零样本视觉语言模型(VLM)进行重排序的两阶段框架,以解决跨视角地理定位(CVGL)中Top-1准确率低的问题。研究在VIGOR数据集上对比了点式(独立评分)和成对(相对比较)两种重排序策略,发现点式方法会导致性能灾难性下降,而成对比较策略(使用LLaVA)能有效提升Top-1准确率。
Details
Motivation: 跨视角地理定位系统在检索相关候选列表(高Recall@k)方面有效,但在识别最佳单一匹配(低Top-1准确率)方面存在不足,本文旨在利用零样本VLM作为重排序器来弥补这一差距。
Result: 在VIGOR数据集上的实验表明,所有点式重排序方法均导致性能灾难性下降或无变化,而使用LLaVA的成对比较策略则能超越强检索基线,提升Top-1准确率。
Insight: 论文的创新点在于系统评估了零样本VLM在CVGL重排序中的两种策略,并揭示出这些模型虽不适用于绝对相关性评分,但擅长细粒度相对视觉判断,因此成对重排序是提升CVGL精度的有前景方向。
Abstract: Cross-view geolocalization (CVGL) systems, while effective at retrieving a list of relevant candidates (high Recall@k), often fail to identify the single best match (low Top-1 accuracy). This work investigates the use of zero-shot Vision-Language Models (VLMs) as rerankers to address this gap. We propose a two-stage framework: state-of-the-art (SOTA) retrieval followed by VLM reranking. We systematically compare two strategies: (1) Pointwise (scoring candidates individually) and (2) Pairwise (comparing candidates relatively). Experiments on the VIGOR dataset show a clear divergence: all pointwise methods cause a catastrophic drop in performance or no change at all. In contrast, a pairwise comparison strategy using LLaVA improves Top-1 accuracy over the strong retrieval baseline. Our analysis concludes that, these VLMs are poorly calibrated for absolute relevance scoring but are effective at fine-grained relative visual judgment, making pairwise reranking a promising direction for enhancing CVGL precision.
[94] Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark cs.CVPDF
Seng Nam Chen, Hao Chen, Chenglam Ho, Xinyu Mao, Jinping Wang
TL;DR: 该论文提出了一个名为SceneBench的新基准测试,用于评估视觉语言模型在长视频理解中对场景级上下文的处理能力。研究发现现有模型在场景级问题上表现显著下降,表明存在长上下文遗忘问题。为此,论文提出了Scene-RAG方法,通过检索和整合跨场景的相关上下文来构建动态场景记忆,从而提升模型性能。
Details
Motivation: 现有长视频理解基准主要关注细粒度感知或粗粒度摘要,缺乏对长上下文时序理解的深入评估。论文旨在探究当前视觉语言模型是否能在长、场景级上下文中有效推理,从而揭示其遗忘问题。
Result: 在SceneBench基准上,视觉语言模型在场景级问题上的准确率显著下降。提出的Scene-RAG方法将模型性能提升了+2.50%,证实了当前模型在长上下文保留方面仍存在困难。
Insight: 创新点包括:1) 将场景定义为视觉和语义上下文一致的连贯视频片段,以人类感知为基础;2) 引入SceneBench基准,专注于场景级挑战;3) 提出Scene-RAG方法,通过动态场景记忆缓解遗忘问题。客观来看,该研究强调了长视频理解中场景连贯性的重要性,并为评估和提升模型的长上下文能力提供了新方向。
Abstract: Long video understanding (LVU) remains a core challenge in multimodal learning. Although recent vision-language models (VLMs) have made notable progress, existing benchmarks mainly focus on either fine-grained perception or coarse summarization, offering limited insight into temporal understanding over long contexts. In this work, we define a scene as a coherent segment of a video in which both visual and semantic contexts remain consistent, aligning with human perception. This leads us to a key question: can current VLMs reason effectively over long, scene-level contexts? To answer this, we introduce a new benchmark, SceneBench, designed to provide scene-level challenges. Our evaluation reveals a sharp drop in accuracy when VLMs attempt to answer scene-level questions, indicating significant forgetting of long-range context. To further validate these findings, we propose Scene Retrieval-Augmented Generation (Scene-RAG), which constructs a dynamic scene memory by retrieving and integrating relevant context across scenes. This Scene-RAG improves VLM performance by +2.50%, confirming that current models still struggle with long-context retention. We hope SceneBench will encourage future research toward VLMs with more robust, human-like video comprehension.
[95] TrendGen: An Outfit Recommendation and Display System cs.CVPDF
Theodoros Koukopoulos, Dimos Klimenof, Ioannis Xarchakos
TL;DR: TrendGen是一个基于计算机视觉和生成式AI的时尚AI系统,旨在通过智能搭配推荐和高质量平铺图像生成来提升在线购物体验。该系统利用服装图像和产品属性生成符合潮流且协调的搭配建议,并处理原始图像中的光照、角度、背景和遮挡问题,以生成清晰的平铺视图。
Details
Motivation: 解决原始服装图像中因光照不一致、角度不理想、背景复杂和遮挡等问题而阻碍计算机视觉技术在时尚行业充分发挥潜力的挑战,以开发适用于实际应用的鲁棒时尚AI系统。
Result: 在生产数据上的评估表明,TrendGen能够持续生成高质量的搭配和平铺图像,标志着AI驱动时尚零售解决方案的显著进步。
Insight: 创新点在于结合了搭配推荐和生成式AI图像增强,通过生成符合潮流的协调搭配以及将原始图像转换为高质量平铺视图,提升了在线购物中服装展示的清晰度和结构化程度,从而增强了系统的实用性和用户体验。
Abstract: Recent advances in Computer Vision have significantly improved image understanding and generation, revolutionizing the fashion industry. However, challenges such as inconsistent lighting, non-ideal garment angles, complex backgrounds, and occlusions in raw images hinder their full potential. Overcoming these obstacles is crucial for developing robust fashion AI systems capable of real-world applications. In this paper, we introduce TrendGen, a Fashion AI system designed to enhance online shopping with intelligent outfit recommendations. Deployed on a major e-commerce platform, TrendGen leverages cloth images and product attributes to generate trend-aligned, cohesive outfit suggestions. Additionally, it employs Generative AI to transform raw images into high-quality lay-down views, offering a clear and structured presentation of garments. Our evaluation on production data demonstrates TrendGen’s consistent high-quality outfits and lay-down images, marking a significant advancement in AI-driven solutions for fashion retail.
[96] TrackMAE: Video Representation Learning via Track Mask and Predict cs.CVPDF
Renaud Vandeghen, Fida Mohammad Thoker, Marc Van Droogenbroeck, Bernard Ghanem
TL;DR: TrackMAE提出了一种通过轨迹掩码和预测进行视频表示学习的自监督预训练方法,它利用现成的点跟踪器生成运动轨迹,并以此改进掩码策略和提供运动目标作为监督信号,从而在多个下游任务中学习到更具判别性和泛化性的视频表示。
Details
Motivation: 现有的掩码视频建模方法仅隐式编码运动信息,限制了其对时间动态的捕捉能力,导致在需要细粒度运动感知的任务上表现不佳。
Result: 在六个不同下游任务的数据集上评估,TrackMAE均优于最先进的视频自监督学习基线方法。
Insight: 创新点在于显式地将运动信息(通过点跟踪器生成的轨迹)作为重建目标,并设计了运动感知的掩码策略,为像素和特征语义重建空间提供了互补的监督信号,从而更有效地编码视频中的时间动态。
Abstract: Masked video modeling (MVM) has emerged as a simple and scalable self-supervised pretraining paradigm, but only encodes motion information implicitly, limiting the encoding of temporal dynamics in the learned representations. As a result, such models struggle on motion-centric tasks that require fine-grained motion awareness. To address this, we propose TrackMAE, a simple masked video modeling paradigm that explicitly uses motion information as a reconstruction signal. In TrackMAE, we use an off-the-shelf point tracker to sparsely track points in the input videos, generating motion trajectories. Furthermore, we exploit the extracted trajectories to improve random tube masking with a motion-aware masking strategy. We enhance video representations learned in both pixel and feature semantic reconstruction spaces by providing a complementary supervision signal in the form of motion targets. We evaluate on six datasets across diverse downstream settings and find that TrackMAE consistently outperforms state-of-the-art video self-supervised learning baselines, learning more discriminative and generalizable representations. Code available at https://github.com/rvandeghen/TrackMAE
[97] CDH-Bench: A Commonsense-Driven Hallucination Benchmark for Evaluating Visual Fidelity in Vision-Language Models cs.CV | cs.AI | cs.CLPDF
Kesheng Chen, Yamin Hu, Qi Zhou, Zhenqian Zhu, Wenjian Luo
TL;DR: 本文提出了一个名为CDH-Bench的新基准,用于评估视觉语言模型在视觉证据与常识发生冲突时的视觉保真度,揭示了模型倾向于依赖常识而忽视视觉证据的‘常识驱动幻觉’现象。
Details
Motivation: 当前视觉语言模型在许多基准上表现良好,但其基本可靠性问题——当视觉证据与常识冲突时,模型是遵循所见还是常识——尚未得到充分探索。
Result: 在CDH-Bench(涵盖计数异常、关系异常和属性异常三个维度)上评估前沿VLM,结果显示即使强模型在视觉-常识冲突下仍易受先验驱动归一化影响,具体报告了反事实准确率、常识准确率、反事实准确率下降、常识崩溃率和相对先验依赖等指标。
Insight: 创新点在于明确定义并系统评估了‘常识驱动幻觉’这一失败模式,并构建了一个专门用于诊断模型在视觉-常识冲突下视觉保真度的可控基准,为理解模型可靠性提供了新视角。
Abstract: Vision-language models (VLMs) achieve strong performance on many benchmarks, yet a basic reliability question remains underexplored: when visual evidence conflicts with commonsense, do models follow what is shown or what commonsense suggests? A characteristic failure in this setting is that the model overrides visual evidence and outputs the commonsense alternative. We term this phenomenon \textbf{commonsense-driven hallucination} (CDH). To evaluate it, we introduce \textbf{CDH-Bench}, a benchmark designed to create explicit \textbf{visual evidence–commonsense conflicts}. CDH-Bench covers three dimensions: \textit{counting anomalies}, \textit{relational anomalies}, and \textit{attribute anomalies}. We evaluate frontier VLMs under \textit{binary Question Answering (QA)} and \textit{multiple-choice QA}, and report metrics including \textit{Counterfactual Accuracy} (CF-Acc), \textit{Commonsense Accuracy} (CS-Acc), \textit{Counterfactual Accuracy Drop} (CFAD), \textit{Commonsense Collapse Rate} (CCR), and \textit{Relative Prior Dependency} (RPD). Results show that even strong models remain vulnerable to prior-driven normalization under visual evidence–commonsense conflict. CDH-Bench provides a controlled diagnostic of visual fidelity under visual evidence–commonsense conflict.
[98] Class-Distribution Guided Active Learning for 3D Occupancy Prediction in Autonomous Driving cs.CVPDF
Wonjune Kim, In-Jae Lee, Sihwan Hwang, Sanmin Kim, Dongsuk Kum
TL;DR: 本文提出了一种基于类别分布引导的主动学习框架,用于自动驾驶中的3D占据预测任务,旨在解决体素级标注成本高和类别不平衡问题。该方法通过结合样本间多样性、集合内多样性和频率加权不确定性三个互补准则来选择标注样本,在仅使用42.4%标注数据的情况下,在Occ3D-nuScenes数据集上达到与全监督相当的性能,并在SemanticKITTI上验证了泛化能力。
Details
Motivation: 3D占据预测任务因体素表示导致严重的类别不平衡(如关键物体占据体素极少),且体素级标注成本高昂,现有方法对主导类别的标注效率低下。
Result: 在Occ3D-nuScenes数据集上,仅使用42.4%标注数据达到26.62 mIoU,与全监督性能相当,并优于同等预算下的主动学习基线;在SemanticKITTI上使用不同架构也验证了方法的泛化有效性。
Insight: 创新点包括:1)提出多准则主动学习框架,结合类别分布差异、冗余避免和稀有类别加权;2)采用地理不相交的数据划分减少训练-验证重叠,防止地图记忆;3)通过逆样本类别比例重加权体素熵,突出稀有类别的重要性。
Abstract: 3D occupancy prediction provides dense spatial understanding critical for safe autonomous driving. However, this task suffers from a severe class imbalance due to its volumetric representation, where safety-critical objects (bicycles, traffic cones, pedestrians) occupy minimal voxels compared to dominant backgrounds. Additionally, voxel-level annotation is costly, yet dedicating effort to dominant classes is inefficient. To address these challenges, we propose a class-distribution guided active learning framework for selecting training samples to annotate in autonomous driving datasets. Our approach combines three complementary criteria to select the training samples. Inter-sample diversity prioritizes samples whose predicted class distributions differ from those of the labeled set, intra-set diversity prevents redundant sampling within each acquisition cycle, and frequency-weighted uncertainty emphasizes rare classes by reweighting voxel-level entropy with inverse per-sample class proportions. We ensure evaluation validity by using a geographically disjoint train/validation split of Occ3D-nuScenes, which reduces train-validation overlap and mitigates potential map memorization. With only 42.4% labeled data, our framework reaches 26.62 mIoU, comparable to full supervision and outperforming active learning baselines at the same budget. We further validate generality on SemanticKITTI using a different architecture, demonstrating consistent effectiveness across datasets.
[99] Complet4R: Geometric Complete 4D Reconstruction cs.CVPDF
Weibang Wang, Kenan Li, Zhuoguang Chen, Yijun Yuan, Hang Zhao
TL;DR: 本文提出了Complet4R,一个用于几何完整4D重建的端到端框架,旨在从动态场景视频中恢复出时间一致且几何完整的重建结果。该方法将任务形式化为重建与补全的统一框架,通过全局处理所有上下文信息,为每一帧(包括被遮挡区域)生成完整几何。
Details
Motivation: 解决动态场景4D重建中,现有方法依赖成对重建或局部运动估计,难以恢复时间一致性和被遮挡区域完整几何的问题。
Result: 在作者提出的几何完整4D重建基准和3D点跟踪任务上,该方法取得了最先进的性能。
Insight: 创新点在于将4D重建任务统一为重建与补全框架,并采用仅解码器的Transformer架构全局处理序列视频输入,直接为每一时间戳累积完整上下文以恢复被遮挡区域。
Abstract: We introduce Complet4R, a novel end-to-end framework for Geometric Complete 4D Reconstruction, which aims to recover temporally coherent and geometrically complete reconstruction for dynamic scenes. Our method formalizes the task of Geometric Complete 4D Reconstruction as a unified framework of reconstruction and completion, by directly accumulating full contexts onto each frame. Unlike previous approaches that rely on pairwise reconstruction or local motion estimation, Complet4R utilizes a decoder-only transformer to operate all context globally directly from sequential video input, reconstructing a complete geometry for every single timestamp, including occluded regions visible in other frames. Our method demonstrates the state-of-the-art performance on our proposed benchmark for Geometric Complete 4D Reconstruction and the 3D Point Tracking task. Code will be released to support future research.
[100] Dual-Path Learning based on Frequency Structural Decoupling and Regional-Aware Fusion for Low-Light Image Super-Resolution cs.CVPDF
Ji-Xuan He, Jia-Cheng Zhao, Feng-Qi Cui, Jinyang Huang, Yang Liu
TL;DR: 本文提出了一种名为DTP(Decoupling then Perceive)的新型频率感知框架,用于低光照图像超分辨率(LLISR)任务。该框架通过频率感知结构解耦机制将图像分离为低频亮度与高频纹理子空间,并采用语义特定的双路径表示学习策略进行针对性增强与重建,最后通过跨频率语义重组模块整合解耦表示,以解决现有串行方法导致的伪影放大、纹理抑制和结构退化问题。
Details
Motivation: 现有方法以串行方式处理低光照图像超分辨率任务,容易导致伪影放大、纹理抑制和结构退化。本文旨在通过明确解耦图像的亮度和纹理成分,实现更专业化的建模和连贯的重建。
Result: 在最广泛使用的LLISR基准测试上,DTP框架相比最先进的(SOTA)算法,PSNR提升了1.6%,SSIM提升了9.6%,LPIPS降低了48%。
Insight: 创新点在于提出了频率感知结构解耦机制和语义特定的双路径表示学习策略,将亮度和纹理作为语义独立的成分进行分离和针对性处理,并通过跨频率语义重组确保结构一致性和感知对齐,为联合任务处理提供了新的解耦与融合范式。
Abstract: Low-light image super-resolution (LLISR) is essential for restoring fine visual details and perceptual quality under insufficient illumination conditions with ubiquitous low-resolution devices. Although pioneer methods achieve high performance on single tasks, they solve both tasks in a serial manner, which inevitably leads to artifact amplification, texture suppression, and structural degradation. To address this, we propose Decoupling then Perceive (DTP), a novel frequency-aware framework that explicitly separates luminance and texture into semantically independent components, enabling specialized modeling and coherent reconstruction. Specifically, to adaptively separate the input into low-frequency luminance and high-frequency texture subspaces, we propose a Frequency-aware Structural Decoupling (FSD) mechanism, which lays a solid foundation for targeted representation learning and reconstruction. Based on the decoupled representation, a Semantics-specific Dual-path Representation (SDR) learning strategy that performs targeted enhancement and reconstruction for each frequency component is further designed, facilitating robust luminance adjustment and fine-grained texture recovery. To promote structural consistency and perceptual alignment in the reconstructed output, building upon this dual-path modeling, we further introduce a Cross-frequency Semantic Recomposition (CSR) module that selectively integrates the decoupled representations. Extensive experiments on the most widely used LLISR benchmarks demonstrate the superiority of our DTP framework, improving $+$1.6% PSNR, $+$9.6% SSIM, and $-$48% LPIPS compared to the most state-of-the-art (SOTA) algorithm. Codes are released at https://github.com/JXVision/DTP.
[101] Unsafe by Reciprocity: How Generation-Understanding Coupling Undermines Safety in Unified Multimodal Models cs.CVPDF
Kaishen Wang, Heng Huang
TL;DR: 本文研究了统一多模态模型中生成与理解功能紧密耦合带来的安全隐患,提出了RICE攻击框架,通过利用双向交互揭示了跨功能攻击路径,证明了不安全信号在模态间传播会放大安全风险。
Details
Motivation: 现有安全研究通常孤立分析理解与生成功能,而统一多模态模型中两者的紧密耦合可能成为结构性漏洞来源,本文旨在探究这种互惠性是否构成安全隐患。
Result: 实验显示,在提出的RICE攻击框架下,生成到理解和理解到生成两个方向的攻击成功率均很高,揭示了统一多模态模型固有的、先前被忽视的安全弱点。
Insight: 创新点在于首次系统性地将跨功能互惠性视为攻击面,提出了双向交互利用的攻击范式;客观来看,该研究揭示了多模态模型安全评估需考虑功能耦合带来的新风险维度。
Abstract: Recent advances in Large Language Models (LLMs) and Text-to-Image (T2I) models have led to the emergence of Unified Multimodal Models (UMMs), where multimodal understanding and image generation are tightly integrated within a shared architecture. Prior studies suggest that such reciprocity enhances cross-functionality performance through shared representations and joint optimization. However, the safety implications of this tight coupling remain largely unexplored, as existing safety research predominantly analyzes understanding and generation functionalities in isolation. In this work, we investigate whether cross-functionality reciprocity itself constitutes a structural source of vulnerability in UMMs. We propose RICE: Reciprocal Interaction-based Cross-functionality Exploitation, a novel attack paradigm that explicitly exploits bidirectional interactions between understanding and generation. Using this framework, we systematically evaluate Generation-to-Understanding (G-U) and Understanding-to-Generation (U-G) attack pathways, demonstrating that unsafe intermediate signals can propagate across modalities and amplify safety risks. Extensive experiments show high Attack Success Rates (ASR) in both directions, revealing previously overlooked safety weaknesses inherent to UMMs.
[102] EVA: Bridging Performance and Human Alignment in Hard-Attention Vision Models for Image Classification cs.CVPDF
Pengcheng Pan, Yonekura Shogo, Kuniyoshi Yasuo
TL;DR: 本文提出EVA,一种受神经科学启发的硬注意力机制测试平台,旨在平衡图像分类模型的性能与人类扫描路径对齐性。EVA通过顺序采样少量凝视点,结合基于CNN的特征提取器、方差控制和自适应门控机制,在无凝视监督的情况下训练,实现了分类准确性与人类扫描路径相似性的可调节权衡。
Details
Motivation: 现有视觉模型仅优化分类准确率可能导致对齐代价,即降低人类扫描路径的相似性并限制模型可解释性,因此需要一种能明确权衡性能与人类对齐性的方法。
Result: 在CIFAR-10数据集上,EVA在保持竞争力的分类准确率的同时,显著提升了扫描路径对齐性(如DTW、NSS指标);在ImageNet-100上验证了可扩展性,并在COCO-Search18上无需额外训练或凝视监督即生成人类似扫描路径。
Insight: 创新点包括:将性能与人类对齐性作为可调节的权衡框架;引入方差控制和自适应门控以稳定注意力动态;CNN特征提取驱动准确性但抑制人类相似性,而控制机制能恢复对齐性且性能损失最小,为可信赖、可解释的主动视觉提供了原则性框架。
Abstract: Optimizing vision models purely for classification accuracy can impose an alignment tax, degrading human-like scanpaths and limiting interpretability. We introduce EVA, a neuroscience-inspired hard-attention mechanistic testbed that makes the performance-human-likeness trade-off explicit and adjustable. EVA samples a small number of sequential glimpses using a minimal fovea-periphery representation with CNN-based feature extractor and integrates variance control and adaptive gating to stabilize and regulate attention dynamics. EVA is trained with the standard classification objective without gaze supervision. On CIFAR-10 with dense human gaze annotations, EVA improves scanpath alignment under established metrics such as DTW, NSS, while maintaining competitive accuracy. Ablations show that CNN-based feature extraction drives accuracy but suppresses human-likeness, whereas variance control and gating restore human-aligned trajectories with minimal performance loss. We further validate EVA’s scalability on ImageNet-100 and evaluate scanpath alignment on COCO-Search18 without COCO-Search18 gaze supervision or finetuning, where EVA yields human-like scanpaths on natural scenes without additional training. Overall, EVA provides a principled framework for trustworthy, human-interpretable active vision.
[103] ResAdapt: Adaptive Resolution for Efficient Multimodal Reasoning cs.CV | cs.AI | cs.CLPDF
Huanxuan Liao, Zhongtao Jiang, Yupu Hao, Yuqiao Tan, Shizhu He
TL;DR: ResAdapt是一种自适应分辨率框架,旨在解决多模态大语言模型(MLLMs)在处理高分辨率或长视频时视觉令牌增长导致的效率瓶颈。它通过学习为每帧图像分配适当的视觉预算(即像素量),在编码前动态调整输入分辨率,从而在有限计算资源下维持高性能。该方法在视频问答、时序定位和图像推理任务中,能在相同视觉预算下支持更多帧数并提升性能。
Details
Motivation: 多模态大语言模型通过提高输入分辨率来增强视觉理解,但由此产生的视觉令牌增长使得同时维持高空间分辨率和长时序上下文变得不可行。现有方法主要关注编码后表示的压缩,而本文认为瓶颈在于编码器接收的像素量,因此提出在输入侧进行自适应调整。
Result: 在预算控制的视频问答、时序定位和图像推理任务中,ResAdapt提升了低预算操作点的性能,并常位于或接近效率-准确率前沿;在激进压缩下,推理密集型基准测试上收益最明显。具体而言,在相同视觉预算下,ResAdapt可支持多达16倍更多帧数,并带来超过15%的性能提升。
Insight: 创新点在于将输入分辨率分配问题建模为上下文赌博机,并通过成本感知策略优化(CAPO)训练轻量级分配器,将稀疏的rollout反馈转化为稳定的准确率-成本学习信号。这允许在保持MLLM主干不变的情况下,动态调整输入以适应计算约束,实现效率与性能的平衡。
Abstract: Multimodal Large Language Models (MLLMs) achieve stronger visual understanding by scaling input fidelity, yet the resulting visual token growth makes jointly sustaining high spatial resolution and long temporal context prohibitive. We argue that the bottleneck lies not in how post-encoding representations are compressed but in the volume of pixels the encoder receives, and address it with ResAdapt, an Input-side adaptation framework that learns how much visual budget each frame should receive before encoding. ResAdapt couples a lightweight Allocator with an unchanged MLLM backbone, so the backbone retains its native visual-token interface while receiving an operator-transformed input. We formulate allocation as a contextual bandit and train the Allocator with Cost-Aware Policy Optimization (CAPO), which converts sparse rollout feedback into a stable accuracy-cost learning signal. Across budget-controlled video QA, temporal grounding, and image reasoning tasks, ResAdapt improves low-budget operating points and often lies on or near the efficiency-accuracy frontier, with the clearest gains on reasoning-intensive benchmarks under aggressive compression. Notably, ResAdapt supports up to 16x more frames at the same visual budget while delivering over 15% performance gain. Code is available at https://github.com/Xnhyacinth/ResAdapt.
[104] Falcon Perception cs.CVPDF
Aviraj Bevli, Sofian Chaybouti, Yasser Dahou, Hakim Hacid, Ngoc Dung Huynh
TL;DR: Falcon Perception是一种统一的密集Transformer模型,它通过早期融合架构处理图像块和文本标记,使用混合注意力模式结合全局视觉上下文与自回归实例生成。该设计简化了感知系统,仅需单一可扩展主干网络和轻量级输出头,在SA-Co基准上显著提升掩码质量,并在PBench基准上展示出对组合提示和长上下文任务的优越性能。
Details
Motivation: 解决传统感知系统中模块化编码器-解码器管道(视觉主干用于特征提取,单独解码器用于任务预测)的架构分离问题,探索是否可以通过单一早期融合堆栈同时实现大规模感知和任务建模。
Result: 在SA-Co基准上,Falcon Perception将掩码质量提升至68.0 Macro-F$_1$,优于SAM3的62.3;在PBench基准(针对组合提示如OCR、空间约束和关系,以及密集长上下文任务)上表现出更好的增益;扩展的Falcon OCR模型(300M参数)在olmOCR上达到80.3%,在OmniDocBench上达到88.64。
Insight: 创新点包括:早期融合架构统一处理图像和文本,混合注意力模式(图像标记双向、预测标记因果)结合全局上下文与自回归生成,以及轻量级令牌接口和专用头实现并行高分辨率掩码预测;客观分析认为,该设计通过简化架构并将复杂性转移至数据和训练信号,提升了可扩展性和性能。
Abstract: Perception-centric systems are typically implemented with a modular encoder-decoder pipeline: a vision backbone for feature extraction and a separate decoder (or late-fusion module) for task prediction. This raises a central question: is this architectural separation essential or can a single early-fusion stack do both perception and task modeling at scale? We introduce Falcon Perception, a unified dense Transformer that processes image patches and text tokens in a shared parameter space from the first layer, using a hybrid attention pattern (bidirectional among image tokens, causal for prediction tokens) to combine global visual context with autoregressive, variable-length instance generation. To keep dense outputs practical, Falcon Perception retains a lightweight token interface and decodes continuous spatial outputs with specialized heads, enabling parallel high-resolution mask prediction. Our design promotes simplicity: we keep a single scalable backbone and shift complexity toward data and training signals, adding only small heads where outputs are continuous and dense. On SA-Co, Falcon Perception improves mask quality to 68.0 Macro-F$_1$ compared to 62.3 of SAM3. We also introduce PBench, a benchmark targeting compositional prompts (OCR, spatial constraints, relations) and dense long-context regimes, where the model shows better gains. Finally, we extend the same early-fusion recipe to Falcon OCR: a compact 300M-parameter model which attains 80.3% on olmOCR and 88.64 on OmniDocBench.
[105] HMPDM: A Diffusion Model for Driving Video Prediction with Historical Motion Priors cs.CVPDF
Ke Li, Tianjia Yang, Kaidi Liang, Xianbiao Hu, Ruwen Qin
TL;DR: 本文提出了一种名为HMPDM的扩散模型,用于自动驾驶视频预测。该模型通过引入历史运动先验来增强对运动模式的理解和时间一致性,解决了现有模型在多阶段训练和真实驾驶场景中多样化运动建模不足的问题。
Details
Motivation: 现有自动驾驶视频预测模型受限于多阶段训练流程,且对真实驾驶场景中多样化运动模式的建模不足,导致时间一致性和视觉质量下降。本文旨在通过利用历史运动先验来提升运动理解和时序连贯性。
Result: 在Cityscapes和KITTI基准测试上的广泛实验表明,HMPDM在效率上优于最先进的视频预测方法,在相同的单目RGB输入配置下,在Cityscapes上实现了FVD指标28.2%的提升。
Insight: 论文的创新点包括:1) 用于隐式历史运动注入的时序感知潜在条件模块;2) 用于多尺度运动表示的运动感知金字塔编码器;3) 用于稳定迭代去噪的自条件策略。这些设计共同提升了模型对复杂运动模式的建模能力和预测质量。
Abstract: Video prediction is a useful function for autonomous driving, enabling intelligent vehicles to reliably anticipate how driving scenes will evolve and thereby supporting reasoning and safer planning. However, existing models are constrained by multi-stage training pipelines and remain insufficient in modeling the diverse motion patterns in real driving scenes, leading to degraded temporal consistency and visual quality. To address these challenges, this paper introduces the historical motion priors-informed diffusion model (HMPDM), a video prediction model that leverages historical motion priors to enhance motion understanding and temporal coherence. The proposed deep learning system introduces three key designs: (i) a Temporal-aware Latent Conditioning (TaLC) module for implicit historical motion injection; (ii) a Motion-aware Pyramid Encoder (MaPE) for multi-scale motion representation; (iii) a Self-Conditioning (SC) strategy for stable iterative denoising. Extensive experiments on the Cityscapes and KITTI benchmarks demonstrate that HMPDM outperforms state-of-the-art video prediction methods with efficiency, achieving a 28.2% improvement in FVD on Cityscapes under the same monocular RGB input configuration setting. The implementation codes are publicly available at https://github.com/KELISBU/HMPDM.
[106] Bridging Visual Representation and Reinforcement Learning from Verifiable Rewards in Large Vision-Language Models cs.CVPDF
Yuhang Han, Yuyang Wu, Zhengbo Jiao, Yiyu Wang, Xuyang Liu
TL;DR: 本文提出KAWHI,一种即插即用的奖励重加权机制,旨在解决可验证奖励强化学习在大型视觉语言模型中应用时存在的视觉表征瓶颈问题。该方法通过层次几何聚合定位语义显著区域,利用结构化归因识别视觉关键注意力头,并进行段落级信用重分配,以将空间视觉证据与语义关键推理步骤对齐。
Details
Motivation: 现有方法缺乏对视觉信息的显式建模和有效利用,导致视觉表征无法与强化学习优化过程紧密耦合,从而限制了多模态推理性能的进一步提升。
Result: 在多种推理基准测试上的广泛实证评估表明,KAWHI作为一种通用增强模块,能持续提升多种统一奖励优化方法的性能。
Insight: 创新点在于将结构化视觉信息显式地整合到统一奖励策略优化中,通过自适应定位语义显著区域和识别视觉关键注意力头,实现了视觉表征与强化学习过程的更紧密对齐,从而突破了现有方法的表征瓶颈。
Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) has substantially enhanced the reasoning capabilities of large language models in abstract reasoning tasks. However, its application to Large Vision-Language Models (LVLMs) remains constrained by a structural representational bottleneck. Existing approaches generally lack explicit modeling and effective utilization of visual information, preventing visual representations from being tightly coupled with the reinforcement learning optimization process and thereby limiting further improvements in multimodal reasoning performance. To address this limitation, we propose KAWHI (Key-Region Aligned Weighted Harmonic Incentive), a plug-and-play reward reweighting mechanism that explicitly incorporates structured visual information into uniform reward policy optimization methods (e.g., GRPO and GSPO). The method adaptively localizes semantically salient regions through hierarchical geometric aggregation, identifies vision-critical attention heads via structured attribution, and performs paragraph-level credit reallocation to align spatial visual evidence with semantically decisive reasoning steps. Extensive empirical evaluations on diverse reasoning benchmarks substantiate KAWHI as a general-purpose enhancement module, consistently improving the performance of various uniform reward optimization methods. Project page: KAWHI (https://kawhiiiileo.github.io/KAWHI_PAGE/)
[107] SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning cs.CVPDF
Jiang Zhang, Shijie Zhou, Bangya Liu, Achuta Kadambi, Zhiwen Fan
TL;DR: 本文提出了一种名为SpatialStack的层次化融合框架,旨在解决大型视觉语言模型在3D空间推理方面的不足。该框架通过渐进式地对齐视觉、几何和语言在不同模型层次上的表示,超越了传统的后期融合方法,从而同时捕捉局部几何细节和全局上下文语义。基于此框架开发的VLM-SpatialStack模型在多个3D空间推理基准测试中取得了最先进的性能。
Details
Motivation: 现有大型视觉语言模型在3D空间推理方面存在局限,其根本原因在于无法捕捉细粒度的3D几何和空间关系。虽然近期研究引入了多视角几何变换器,但它们通常只在深层特征上进行视觉与几何编码器的融合,丢弃了丰富的层次化信号,这构成了空间理解的根本瓶颈。
Result: 基于SpatialStack框架开发的VLM-SpatialStack模型在多个3D空间推理基准测试上取得了最先进的性能。广泛的实验和消融研究表明,该多级融合策略能持续增强3D理解能力,并在多样化的空间推理任务上展现出强大的泛化能力。
Insight: 论文宣称的核心创新点是提出了一种通用的层次化融合框架,该框架在模型层次上渐进式地堆叠和同步多级几何特征与语言主干网络,突破了传统后期融合的瓶颈。从客观角度看,这种将几何信息深度、分层地整合到语言模型处理流程中的设计范式,为下一代多模态物理AI系统中的视觉-语言-几何集成提供了一个有效且可扩展的解决方案。
Abstract: Large vision-language models (VLMs) still struggle with reliable 3D spatial reasoning, a core capability for embodied and physical AI systems. This limitation arises from their inability to capture fine-grained 3D geometry and spatial relationships. While recent efforts have introduced multi-view geometry transformers into VLMs, they typically fuse only the deep-layer features from vision and geometry encoders, discarding rich hierarchical signals and creating a fundamental bottleneck for spatial understanding. To overcome this, we propose SpatialStack, a general hierarchical fusion framework that progressively aligns vision, geometry, and language representations across the model hierarchy. Moving beyond conventional late-stage vision-geometry fusion, SpatialStack stacks and synchronizes multi-level geometric features with the language backbone, enabling the model to capture both local geometric precision and global contextual semantics. Building upon this framework, we develop VLM-SpatialStack, a model that achieves state-of-the-art performance on multiple 3D spatial reasoning benchmarks. Extensive experiments and ablations demonstrate that our multi-level fusion strategy consistently enhances 3D understanding and generalizes robustly across diverse spatial reasoning tasks, establishing SpatialStack as an effective and extensible design paradigm for vision-language-geometry integration in next-generation multimodal physical AI systems.
[108] Evaluating Large and Lightweight Vision Models for Irregular Component Segmentation in E-Waste Disassembly cs.CV | cs.AIPDF
Xinyao Zhang, Chang Liu, Xiao Liang, Minghui Zheng, Sara Behdad
TL;DR: 本研究评估了SAM2和YOLOv8两种视觉模型在电子废弃物拆解中不规则组件分割任务上的性能。通过在一个新收集的包含1,456张笔记本电脑组件(如逻辑板、散热器、风扇)标注RGB图像的数据集上进行训练和测试,并应用数据增强技术,研究发现轻量级的YOLOv8在分割精度和边界精度上显著优于基于Transformer的大型预训练模型SAM2。
Details
Motivation: 解决电子废弃物回收中机器人拆解和材料回收所需的不规则、密集排列组件的精确分割问题,并探究模型架构和规模对分割性能的影响。
Result: 在笔记本电脑组件分割数据集上,YOLOv8取得了更高的分割精度(mAP50 = 98.8%, mAP50-95 = 85%)和更强的边界精度,而SAM2的mAP50仅为8.4%。SAM2在表示多样物体结构方面表现出灵活性,但经常产生重叠掩码和不一致的轮廓。
Insight: 研究发现,大型预训练模型(如SAM2)在特定工业应用(如电子废弃物组件分割)中需要针对任务进行优化,而轻量级模型(如YOLOv8)可能更直接有效。论文贡献了一个新的数据集和基准测试框架,为开发可扩展的机器人电子废弃物拆解视觉算法奠定了基础。
Abstract: Precise segmentation of irregular and densely arranged components is essential for robotic disassembly and material recovery in electronic waste (e-waste) recycling. This study evaluates the impact of model architecture and scale on segmentation performance by comparing SAM2, a transformer-based vision model, with the lightweight YOLOv8 network. Both models were trained and tested on a newly collected dataset of 1,456 annotated RGB images of laptop components including logic boards, heat sinks, and fans, captured under varying illumination and orientation conditions. Data augmentation techniques, such as random rotation, flipping, and cropping, were applied to improve model robustness. YOLOv8 achieved higher segmentation accuracy (mAP50 = 98.8%, mAP50-95 = 85%) and stronger boundary precision than SAM2 (mAP50 = 8.4%). SAM2 demonstrated flexibility in representing diverse object structures but often produced overlapping masks and inconsistent contours. These findings show that large pre-trained models require task-specific optimization for industrial applications. The resulting dataset and benchmarking framework provide a foundation for developing scalable vision algorithms for robotic e-waste disassembly and circular manufacturing systems.
[109] LOME: Learning Human-Object Manipulation with Action-Conditioned Egocentric World Model cs.CVPDF
Quankai Gao, Jiawei Yang, Qiangeng Xu, Le Chen, Yue Wang
TL;DR: 本文提出了LOME,一种以自我为中心的世界模型,能够根据输入图像、文本提示和每帧人体动作(包括身体姿势和手势)生成逼真的人-物交互视频。该方法通过在训练中联合估计空间人体动作和环境上下文,将强而精确的动作指导注入到物体操作中。经过在多样化的自我中心人-物交互视频上进行预训练视频生成模型的微调后,LOME不仅表现出高动作跟随精度和对未见场景的强大泛化能力,还能生成逼真的手-物交互物理效果。
Details
Motivation: 学习人-物操作因其涉及精细且接触丰富的运动而面临重大挑战。传统的基于物理的动画需要大量建模和手动设置,且难以泛化到不同的物体形态,也无法有效扩展到真实世界环境。
Result: 大量实验表明,该视频框架在时间一致性和运动控制方面显著优于基于图像和视频的动作条件方法以及图像/文本到视频生成模型的最新成果。
Insight: 创新点在于提出了一个结合精确空间动作指导的自我中心世界模型,通过联合估计动作和环境上下文来生成物理上逼真的交互视频,为AR/VR体验和机器人训练提供了无需依赖模拟环境或显式3D/4D建模的新途径。
Abstract: Learning human-object manipulation presents significant challenges due to its fine-grained and contact-rich nature of the motions involved. Traditional physics-based animation requires extensive modeling and manual setup, and more importantly, it neither generalizes well across diverse object morphologies nor scales effectively to real-world environment. To address these limitations, we introduce LOME, an egocentric world model that can generate realistic human-object interactions as videos conditioned on an input image, a text prompt, and per-frame human actions, including both body poses and hand gestures. LOME injects strong and precise action guidance into object manipulation by jointly estimating spatial human actions and the environment contexts during training. After finetuning a pretrained video generative model on videos of diverse egocentric human-object interactions, LOME demonstrates not only high action-following accuracy and strong generalization to unseen scenarios, but also realistic physical consequences of hand-object interactions, e.g., liquid flowing from a bottle into a mug after executing a ``pouring’’ action. Extensive experiments demonstrate that our video-based framework significantly outperforms state-of-the-art image based and video-based action-conditioned methods and Image/Text-to-Video (I/T2V) generative model in terms of both temporal consistency and motion control. LOME paves the way for photorealistic AR/VR experiences and scalable robotic training, without being limited to simulated environments or relying on explicit 3D/4D modeling.
[110] From None to All: Self-Supervised 3D Reconstruction via Novel View Synthesis cs.CVPDF
Ranran Huang, Weixun Luo, Ye Mao, Krystian Mikolajczyk
TL;DR: 本文提出了NAS3R,一种自监督前馈框架,无需真实标注或预训练先验,即可从无标定、无位姿的上下文视图中联合学习显式3D几何和相机参数。它通过重建3D高斯模型并使用自预测的相机参数渲染目标视图,实现了仅依赖2D光度监督的自监督训练。
Details
Motivation: 解决在缺乏真实3D标注和相机参数的情况下,从无约束图像数据中进行3D重建的挑战,旨在实现完全自监督的、可扩展的3D几何学习。
Result: 大量实验表明,NAS3R在自监督3D重建方法中取得了优越的结果,为从无约束数据中进行3D重建建立了一个可扩展且具有几何感知能力的范式。
Insight: 主要创新点在于将3D重建与相机参数预测集成在一个由掩码注意力调节的共享Transformer主干网络中,并采用基于深度的3D高斯表示以促进优化稳定性;其框架设计使其能与现有有监督SOTA架构兼容,并可灵活引入可用先验信息。
Abstract: In this paper, we introduce NAS3R, a self-supervised feed-forward framework that jointly learns explicit 3D geometry and camera parameters with no ground-truth annotations and no pretrained priors. During training, NAS3R reconstructs 3D Gaussians from uncalibrated and unposed context views and renders target views using its self-predicted camera parameters, enabling self-supervised training from 2D photometric supervision. To ensure stable convergence, NAS3R integrates reconstruction and camera prediction within a shared transformer backbone regulated by masked attention, and adopts a depth-based Gaussian formulation that facilitates well-conditioned optimization. The framework is compatible with state-of-the-art supervised 3D reconstruction architectures and can incorporate pretrained priors or intrinsic information when available. Extensive experiments show that NAS3R achieves superior results to other self-supervised methods, establishing a scalable and geometry-aware paradigm for 3D reconstruction from unconstrained data. Code and models are publicly available at https://ranrhuang.github.io/nas3r/.
[111] Difference Feedback: Generating Multimodal Process-Level Supervision for VLM Reinforcement Learning cs.CV | cs.AIPDF
Feiding, Yongkang Zhang, Yuhao Liao, Zijian Zeng, Chunzheng Zhu
TL;DR: 本文提出了一种名为差分反馈的方法,用于增强视觉语言模型在强化学习训练中的过程级监督。该方法通过修复错误推理轨迹自动构建词元/步骤级监督掩码,明确标记需要修正的关键位置,从而在无需昂贵人工标注的情况下实现过程级视觉对齐,并可无缝集成到现有GRPO类框架中。
Details
Motivation: 现有基于GRPO的视觉语言模型对齐训练仅依赖最终结果奖励,导致多步推理中的信用分配稀疏,削弱了视觉证据与中间步骤的关联,常引发优化不稳定和视觉幻觉问题。
Result: 在多模态推理基准MMMStar和MathVista上的实验表明,在相同计算预算下,该方法平均提升了3%的性能。
Insight: 创新点在于利用自动修复推理轨迹生成细粒度监督信号,为视觉推理过程对齐提供了一种低成本、高效的解决方案,弥补了仅使用稀疏终端奖励的不足。
Abstract: Vision–language models (VLMs) are increasingly aligned via Group Relative Policy Optimization (GRPO)-style training. However, relying solely on terminal outcome rewards yields sparse credit assignment in multi-step reasoning, weakening the linkage between visual evidence and intermediate steps and often causing unstable optimization and visual hallucinations. We propose Differential Feedback, which automatically constructs token/step-level supervision masks by repairing erroneous reasoning trajectories, explicitly marking the key positions that require correction. Without costly large-scale step-by-step human annotations, our method enables process-level visual alignment and can be seamlessly integrated into existing GRPO-like frameworks. Experiments on multimodal reasoning benchmarks including MMMStar and MathVista show an average 3% improvement under matched compute budgets. Our approach offers an effective, low-cost solution for accurate vision–reasoning process alignment.
[112] Estimating the Impact of COVID-19 on Travel Demand in Houston Area Using Deep Learning and Satellite Imagery cs.CV | stat.APPDF
Alekhya Pachika, Lu Gao, Lingguang Song, Pan Lu, Xingju Wang
TL;DR: 本研究利用高分辨率卫星影像和深度学习算法(Detectron2与Faster R-CNN)开发了一个车辆计数模型,以估计COVID-19疫情对休斯顿都市区出行需求的影响。通过分析疫情前后不同地点(如大学、购物中心、社区广场等)的车辆数量变化,发现2020年相比2019年平均减少了30%,证明了卫星影像结合计算机视觉在出行需求和经济活动监测中的有效性。
Details
Motivation: 动机是利用高分辨率卫星影像和先进的计算机视觉算法,监测交通基础设施使用情况,并量化COVID-19疫情对出行需求的具体影响,为交通决策提供数据支持。
Result: 在休斯顿都市区的选定地点,车辆计数结果显示2020年相比2019年平均减少30%,表明卫星影像结合深度学习能可靠估计出行需求变化。
Insight: 创新点在于将高分辨率卫星影像(如Google Earth Engine数据集)与现成的深度学习目标检测框架(Detectron2/Faster R-CNN)结合,用于大范围、非侵入式的出行需求监测,为交通和经济分析提供了可扩展的新方法。
Abstract: Considering recent advances in remote sensing satellite systems and computer vision algorithms, many satellite sensing platforms and sensors have been used to monitor the condition and usage of transportation infrastructure systems. The level of details that can be detected increases significantly with the increase of ground sample distance (GSD), which is around 15 cm - 30 cm for high-resolution satellite images. In this study, we analyzed data acquired from high-resolution satellite imagery to provide insights, predictive signals, and trend for travel demand estimation. More specifically, we estimate the impact of COVID-19 in the metropolitan area of Houston using satellite imagery from Google Earth Engine datasets. We developed a car-counting model through Detectron2 and Faster R-CNN to monitor the presence of cars within different locations (i.e., university, shopping mall, community plaza, restaurant, supermarket) before and during the COVID-19. The results show that the number of cars detected at these selected locations reduced on average 30% in 2020 compared with the previous year 2019. The results also show that satellite imagery provides rich information for travel demand and economic activity estimation. Together with advanced computer vision and deep learning algorithms, it can generate reliable and accurate information for transportation agency decision makers.
[113] Learning to Focus and Precise Cropping: A Reinforcement Learning Framework with Information Gaps and Grounding Loss for MLLMs cs.CV | cs.AIPDF
Xuanpu Zhao, Zhentao Tan, Dianmo Sheng, Tianxiang Chen, Yao Liu
TL;DR: 本文提出了一种新颖的两阶段强化学习框架,旨在解决多模态大语言模型在复杂视觉场景中过度依赖全局图像而忽视裁剪区域细节的问题。该框架通过引入’信息差’机制和定位损失,训练模型专注于裁剪的关键区域并提升裁剪精度,从而增强模型对细粒度细节的感知与推理能力。
Details
Motivation: 现有基于智能体的MLLM工作流在训练中存在模型过度依赖全局输入、对裁剪区域细节依赖较弱的关键局限,这限制了模型在复杂视觉场景中的精细感知与推理能力。
Result: 实验表明,该方法显著增强了模型对裁剪区域的注意力,使其在高分辨率视觉问答基准上达到了最先进的性能。
Insight: 创新点在于提出了无需轨迹监督的两阶段强化学习框架:第一阶段通过调整全局图像粒度引入’信息差’机制,驱动模型为获取信息增益而关注关键区域;第二阶段利用少量边界框标注引入定位损失以进一步提升裁剪精度。这为MLLM感知细粒度细节提供了一种更高效的途径。
Abstract: To enhance the perception and reasoning capabilities of multimodal large language models in complex visual scenes, recent research has introduced agent-based workflows. In these works, MLLMs autonomously utilize image cropping tool to analyze regions of interest for question answering. While existing training strategies, such as those employing supervised fine-tuning and reinforcement learning, have made significant progress, our empirical analysis reveals a key limitation. We demonstrate the model’s strong reliance on global input and its weak dependence on the details within the cropped region. To address this issue, we propose a novel two-stage reinforcement learning framework that does not require trajectory supervision. In the first stage, we introduce the ``Information Gap” mechanism by adjusting the granularity of the global image. This mechanism trains the model to answer questions by focusing on cropped key regions, driven by the information gain these regions provide. The second stage further enhances cropping precision by incorporating a grounding loss, using a small number of bounding box annotations. Experiments show that our method significantly enhances the model’s attention to cropped regions, enabling it to achieve state-of-the-art performance on high-resolution visual question-answering benchmarks. Our method provides a more efficient approach for perceiving and reasoning fine-grained details in MLLMs. Code is available at: https://github.com/XuanPu-Z/LFPC.
[114] Streamlined Open-Vocabulary Human-Object Interaction Detection cs.CVPDF
Chang Sun, Dongliang Liao, Changxing Ding
TL;DR: 本文提出SL-HOI,一个基于DINOv3的流线型开放词汇人-物交互检测框架,旨在解决现有方法因跨模型特征融合困难而难以识别未见交互类别的问题。该方法仅利用DINOv3模型,通过其骨干网络进行细粒度定位,并利用其文本对齐视觉头进行开放词汇交互分类,同时提出一种桥接表示间隙的交叉注意力机制。所有DINOv3参数被冻结,仅添加少量可学习参数,实现了对HOI检测任务的快速适配。
Details
Motivation: 现有开放词汇HOI检测方法通常依赖传统HOI检测器与视觉语言模型的协作,但跨模型表示存在显著差距,导致特征融合困难。本文旨在通过一个仅基于DINOv3的流线型框架来解决这一问题。
Result: 在SWiG-HOI和HICO-DET基准测试上进行了广泛实验,SL-HOI实现了最先进的性能。
Insight: 创新点在于仅利用单一预训练模型DINOv3的不同组件(骨干网络和视觉头)的互补优势,并设计了一种通过视觉头处理交互查询和图像令牌以平滑桥接表示间隙的交叉注意力机制。这种流线型架构参数高效,易于适配新任务。
Abstract: Open-vocabulary human-object interaction (HOI) detection aims to localize and recognize all human-object interactions in an image, including those unseen during training. Existing approaches usually rely on the collaboration between a conventional HOI detector and a Vision-Language Model (VLM) to recognize unseen HOI categories. However, feature fusion in this paradigm is challenging due to significant gaps in cross-model representations. To address this issue, we introduce SL-HOI, a StreamLined open-vocabulary HOI detection framework based solely on the powerful DINOv3 model. Our design leverages the complementary strengths of DINOv3’s components: its backbone for fine-grained localization and its text-aligned vision head for open-vocabulary interaction classification. Moreover, to facilitate smooth cross-attention between the interaction queries and the vision head’s output, we propose first feeding both the interaction queries and the backbone image tokens into the vision head, effectively bridging their representation gaps. All DINOv3 parameters in our approach are frozen, with only a small number of learnable parameters added, allowing a fast adaptation to the HOI detection task. Extensive experiments show that SL-HOI achieves state-of-the-art performance on both the SWiG-HOI and HICO-DET benchmarks, demonstrating the effectiveness of our streamlined model architecture. Code is available at https://github.com/MPI-Lab/SL-HOI.
[115] Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM cs.CVPDF
Haifeng Huang, Yilun Chen, Zehan Wang, Jiangmiao Pang, Zhou Zhao
TL;DR: 本文提出Chat-Scene++,一种多模态大语言模型框架,通过将3D场景表示为富含上下文信息的对象序列,实现了以对象为中心的表示与交互。该方法利用大规模预训练的3D场景级和2D图像级编码器提取对象特征,支持基于链式思维的推理,并在多个3D视觉语言基准上取得了最先进的性能。
Details
Motivation: 现有多模态大语言模型在3D场景理解中难以进行细粒度对象定位和上下文推理,限制了其在复杂3D环境中的解释与交互能力。
Result: 在ScanRefer、Multi3DRefer、Scan2Cap、ScanQA和SQA3D五个主要3D视觉语言基准上实现了最先进的性能,无需额外任务特定头或微调。
Insight: 创新性地将3D场景结构化为富含上下文语义的对象序列,通过结合3D场景级和2D图像级编码器捕获对象间关系与全局语义,并支持基于链式思维的推理以实现类别和空间层面的对象区分。
Abstract: Recent advancements in multi-modal large language models (MLLMs) have shown strong potential for 3D scene understanding. However, existing methods struggle with fine-grained object grounding and contextual reasoning, limiting their ability to interpret and interact with complex 3D environments. In this paper, we present Chat-Scene++, an MLLM framework that represents 3D scenes as context-rich object sequences. By structuring scenes as sequences of objects with contextual semantics, Chat-Scene++ enables object-centric representation and interaction. It decomposes a 3D scene into object representations paired with identifier tokens, allowing LLMs to follow instructions across diverse 3D vision-language tasks. To capture inter-object relationships and global semantics, Chat-Scene++ extracts context-rich object features using large-scale pre-trained 3D scene-level and 2D image-level encoders, unlike the isolated per-object features in Chat-Scene. Its flexible object-centric design also supports grounded chain-of-thought (G-CoT) reasoning, enabling the model to distinguish objects at both category and spatial levels during multi-step inference. Without the need for additional task-specific heads or fine-tuning, Chat-Scene++ achieves state-of-the-art performance on five major 3D vision-language benchmarks: ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D. These results highlight its effectiveness in scene comprehension, object grounding, and spatial reasoning. Additionally, without reconstructing 3D worlds through computationally expensive processes, we demonstrate its applicability to real-world scenarios using only 2D inputs.
[116] SPROUT: A Scalable Diffusion Foundation Model for Agricultural Vision cs.CVPDF
Shuai Xiang, Wei Guo, James Burridge, Shouyang Liu, Hao Lu
TL;DR: SPROUT是一种基于扩散去噪的可扩展农业视觉基础模型,通过无监督训练学习多作物、多任务的丰富表示,在农业领域下游任务中表现优异。
Details
Motivation: 现有视觉基础模型在通用计算机视觉任务上成功,但应用于农业领域时存在显著领域差距,需要针对农业特点的专用模型。
Result: 在涵盖多种作物、生长阶段和环境的下游任务中,SPROUT一致优于最先进的网络预训练和农业基础模型,且预训练成本显著更低。
Insight: 采用无VAE的像素空间扩散Transformer进行去噪学习,实现结构感知表示和高效端到端训练,为农业视觉任务提供可扩展的解决方案。
Abstract: Vision Foundation Models (VFM) pre-trained on large-scale unlabeled data have achieved remarkable success on general computer vision tasks, yet typically suffer from significant domain gaps when applied to agriculture. In this context, we introduce $SPROUT$ ($S$calable $P$lant $R$epresentation model via $O$pen-field $U$nsupervised $T$raining), a multi-crop, multi-task agricultural foundation model trained via diffusion denoising. SPROUT leverages a VAE-free Pixel-space Diffusion Transformer to learn rich, structure-aware representations through denoising and enabling efficient end-to-end training. We pre-train SPROUT on a curated dataset of 2.6 million high-quality agricultural images spanning diverse crops, growth stages, and environments. Extensive experiments demonstrate that SPROUT consistently outperforms state-of-the-art web-pretrained and agricultural foundation models across a wide range of downstream tasks, while requiring substantially lower pre-training cost. The code and model are available at https://github.com/UTokyo-FieldPhenomics-Lab/SPROUT.
[117] TokenDial: Continuous Attribute Control in Text-to-Video via Spatiotemporal Token Offsets cs.CVPDF
Zhixuan Liu, Peter Schaldenbrand, Yijun Li, Long Mai, Aniruddha Mahapatra
TL;DR: TokenDial是一个用于预训练文本到视频生成模型的连续属性控制框架,通过在时空视觉补丁标记空间中添加可学习的偏移量,实现对属性变化程度(如效果强度或运动幅度)的精确、连续控制,同时保持视频的身份、背景和时间一致性。
Details
Motivation: 现代文本到视频生成模型在生成整体视频方面表现强劲,但在控制属性变化程度(如调整效果强度或运动幅度)时,容易导致身份、背景或时间一致性漂移,缺乏连续、精细的控制能力。
Result: 在多样化的属性和提示上进行测试,TokenDial在可控性和编辑质量上均优于最先进的基线方法,并通过广泛的定量评估和人工研究得到了验证。
Insight: 创新点在于发现并利用中间时空视觉补丁标记空间中的加性偏移构成语义控制方向,通过调整偏移幅度即可实现对外观和运动动态的一致、可预测编辑;该方法无需重新训练骨干模型,而是利用预训练的理解信号(外观的语义方向匹配和运动的幅度缩放)来学习特定属性的标记偏移,实现了高效、连续的属性控制。
Abstract: We present TokenDial, a framework for continuous, slider-style attribute control in pretrained text-to-video generation models. While modern generators produce strong holistic videos, they offer limited control over how much an attribute changes (e.g., effect intensity or motion magnitude) without drifting identity, background, or temporal coherence. TokenDial is built on the observation: additive offsets in the intermediate spatiotemporal visual patch-token space form a semantic control direction, where adjusting the offset magnitude yields coherent, predictable edits for both appearance and motion dynamics. We learn attribute-specific token offsets without retraining the backbone, using pretrained understanding signals: semantic direction matching for appearance and motion-magnitude scaling for motion. We demonstrate TokenDial’s effectiveness on diverse attributes and prompts, achieving stronger controllability and higher-quality edits than state-of-the-art baselines, supported by extensive quantitative evaluation and human studies.
[118] OmniColor: A Unified Framework for Multi-modal Lineart Colorization cs.CVPDF
Xulu Zhang, Haoqian Du, Xiaoyong Wei, Qing Li
TL;DR: 本文提出了OmniColor,一个支持任意控制信号组合的统一多模态线稿上色框架。该框架将引导信号系统性地分为空间对齐条件和语义参考条件两类,并针对性地设计了双路径编码与密集特征对齐损失、仅使用视觉语言模型的编码与时间冗余消除机制,以及自适应空间-语义门控模块来动态平衡多模态约束,从而实现了精确、灵活且高效的上色。
Details
Motivation: 专业内容创作中的线稿上色阶段,在多样的用户约束下实现精确且灵活的结果仍然是一个重大挑战。
Result: 实验结果表明,OmniColor在可控性、视觉质量和时间稳定性方面均取得了优异的表现,为线稿上色提供了一个鲁棒且实用的解决方案。
Insight: 创新点在于系统性地对多模态控制信号进行分类并设计针对性处理模块(双路径编码、VLM-only编码、自适应门控),以及引入了Dense Feature Alignment损失和Temporal Redundancy Elimination机制,在保证边界精确性和色彩还原度的同时提升了推理效率。
Abstract: Lineart colorization is a critical stage in professional content creation, yet achieving precise and flexible results under diverse user constraints remains a significant challenge. To address this, we propose OmniColor, a unified framework for multi-modal lineart colorization that supports arbitrary combinations of control signals. Specifically, we systematically categorize guidance signals into two types: spatially-aligned conditions and semantic-reference conditions. For spatially-aligned inputs, we employ a dual-path encoding strategy paired with a Dense Feature Alignment loss to ensure rigorous boundary preservation and precise color restoration. For semantic-reference inputs, we utilize a VLM-only encoding scheme integrated with a Temporal Redundancy Elimination mechanism to filter repetitive information and enhance inference efficiency. To resolve potential input conflicts, we introduce an Adaptive Spatial-Semantic Gating module that dynamically balances multi-modal constraints. Experimental results demonstrate that OmniColor achieves superior controllability, visual quality, and temporal stability, providing a robust and practical solution for lineart colorization. The source code and dataset will be open at https://github.com/zhangxulu1996/OmniColor.
[119] Demo-Pose: Depth-Monocular Modality Fusion For Object Pose Estimation cs.CV | cs.AIPDF
Rachit Agarwal, Abhishek Joshi, Sathish Chalasani, Woo Jin Kim
TL;DR: 本文提出DeMo-Pose,一种用于类别级9自由度物体姿态估计的混合架构。它通过新颖的多模态融合策略,将单目语义特征与基于深度的图卷积表示相结合,并引入一种利用网格结构的Mesh-Point Loss来提升几何推理能力,实现了实时推理并在REAL275基准上显著超越了现有方法。
Details
Motivation: 解决从RGB-D输入进行类别级9自由度姿态估计的挑战,现有深度方法忽略了RGB语义线索,而许多RGB-D融合模型由于跨模态融合不佳导致性能不足,本文旨在通过有效的深度-RGB融合和几何感知学习来提升性能。
Result: 在REAL275基准上,方法在3D IoU指标上比强基线GPV-Pose提升了3.2%,在姿态精度上提升了11.1%,达到了最先进水平(SOTA)。
Insight: 创新点在于提出了新颖的多模态融合策略以对齐RGB语义与3D几何表示,以及引入Mesh-Point Loss在不增加推理开销的情况下利用网格结构提升几何推理,这为深度与RGB信息的有效融合及几何感知训练提供了可借鉴的思路。
Abstract: Object pose estimation is a fundamental task in 3D vision with applications in robotics, AR/VR, and scene understanding. We address the challenge of category-level 9-DoF pose estimation (6D pose + 3Dsize) from RGB-D input, without relying on CAD models during inference. Existing depth-only methods achieve strong results but ignore semantic cues from RGB, while many RGB-D fusion models underperform due to suboptimal cross-modal fusion that fails to align semantic RGB cues with 3D geometric representations. We propose DeMo-Pose, a hybrid architecture that fuses monocular semantic features with depth-based graph convolutional representations via a novel multimodal fusion strategy. To further improve geometric reasoning, we introduce a novel Mesh-Point Loss (MPL) that leverages mesh structure during training without adding inference overhead. Our approach achieves real-time inference and significantly improves over state-of-the-art methods across object categories, outperforming the strong GPV-Pose baseline by 3.2% on 3D IoU and 11.1% on pose accuracy on the REAL275 benchmark. The results highlight the effectiveness of depth-RGB fusion and geometry-aware learning, enabling robust category-level 3D pose estimation for real-world applications.
[120] MV-RoMa: From Pairwise Matching into Multi-View Track Reconstruction cs.CVPDF
Jongmin Lee, Seungyeop Kang, Sungjoo Yoo
TL;DR: MV-RoMa是一种多视图密集匹配模型,旨在通过联合估计源图像到多个共视目标图像的密集对应关系,解决传统成对匹配方法在多视图场景下产生的碎片化和几何不一致轨迹问题。该模型采用高效架构,利用成对匹配结果作为几何先验,并通过像素级注意力细化对应关系,同时提出后处理策略将一致的多视图对应关系整合为高质量轨迹,用于运动恢复结构(SfM)任务。
Details
Motivation: 现有匹配器多为成对操作,在多视图链式预测中常产生碎片化且几何不一致的轨迹,影响如SfM等3D视觉任务的性能,因此需要一种能直接处理多视图对应关系的密集匹配方法。
Result: 在多样且具有挑战性的基准测试中,MV-RoMa相比现有稀疏和密集匹配方法,产生了更可靠的对应关系,并实现了显著更密集、更准确的3D重建,达到了SOTA水平。
Insight: 创新点包括:设计高效的多视图编码器利用成对匹配作为几何先验,以及多视图匹配细化器使用像素级注意力优化对应关系;从客观角度看,该方法通过端到端的多视图联合建模,提升了对应一致性和重建质量,为多视图3D任务提供了新思路。
Abstract: Establishing consistent correspondences across images is essential for 3D vision tasks such as structure-from-motion (SfM), yet most existing matchers operate in a pairwise manner, often producing fragmented and geometrically inconsistent tracks when their predictions are chained across views. We propose MV-RoMa, a multi-view dense matching model that jointly estimates dense correspondences from a source image to multiple co-visible targets. Specifically, we design an efficient model architecture which avoids high computational cost of full cross-attention for multi-view feature interaction: (i) multi-view encoder that leverages pair-wise matching results as a geometric prior, and (ii) multi-view matching refiner that refines correspondences using pixel-wise attention. Additionally, we propose a post-processing strategy that integrates our model’s consistent multi-view correspondences as high-quality tracks for SfM. Across diverse and challenging benchmarks, MV-RoMa produces more reliable correspondences and substantially denser, more accurate 3D reconstructions than existing sparse and dense matching methods. Project page: https://icetea-cv.github.io/mv-roma/.
[121] Learning to See through Illumination Extremes with Event Streaming in Multimodal Large Language Models cs.CVPDF
Baoheng Zhang, Jiahui Liu, Gui Zhao, Weizhou Zhang, Yixuan Ma
TL;DR: 本文提出了Event-MLLM,一种通过动态融合事件流与RGB帧来实现全光照条件下视觉推理的事件增强多模态大语言模型。其核心包括一个从DINOv2分支衍生的、可学习的照明指示器,用于自适应调制事件-RGB融合,以及一个在潜在空间中对齐融合特征与正常光照语义的照明校正损失。作者还构建了首个用于MLLM的多光照事件-指令数据集和基准测试。
Details
Motivation: 现有的多模态大语言模型在标准光照条件下表现良好,但在极端光照条件下,RGB输入会丢失不可恢复的结构和语义信息,导致模型失效。本文旨在解决MLLM在极端光照下的视觉推理问题。
Result: 实验表明,Event-MLLM在极端光照下的推理、计数和细粒度识别任务上,显著优于通用模型、光照自适应模型以及纯事件基线模型,在具有挑战性的光照条件下的鲁棒多模态感知与推理任务上达到了新的SOTA水平。
Insight: 创新点在于将事件流动态融合到MLLM框架中,并引入了可学习的照明指示器和照明校正损失来引导和优化融合过程,以补偿极端光照下丢失的信息。从客观角度看,构建首个多光照事件-指令数据集也为该领域的研究提供了重要的基准资源。
Abstract: Multimodal Large Language Models (MLLMs) perform strong vision-language reasoning under standard conditions but fail in extreme illumination, where RGB inputs lose irrevocable structure and semantics. We propose Event-MLLM, an event-enhanced model that performs all-light visual reasoning by dynamically fusing event streams with RGB frames. Two key components drive our approach: an Illumination Indicator - a learnable signal derived from a DINOv2 branch that represents exposure degradation and adaptively modulates event-RGB fusion - and an Illumination Correction Loss that aligns fused features with non-degraded (normal-light) semantics in the latent space, compensating for information lost in extreme lighting. We curate the first multi-illumination event-instruction corpus for MLLMs, with 2,241 event-RGB samples (around 6 QA pairs each) across diverse scenes and 17 brightness rates (0.05x - 20x), plus an instruct-following benchmark for reasoning, counting, and fine-grained recognition under extreme lighting. Experiments show that Event-MLLM markedly outperforms general-purpose, illumination-adaptive, and event-only baselines, setting a new state of the art in robust multimodal perception and reasoning under challenging illumination.
[122] Structured Observation Language for Efficient and Generalizable Vision-Language Navigation cs.CV | cs.ROPDF
Daojie Peng, Fulong Ma, Jun Ma
TL;DR: 本文提出SOL-Nav框架,通过将第一视角的RGB-D图像转换为紧凑的结构化语言描述,并与导航指令拼接,作为纯语言输入馈送给预训练语言模型,以实现高效且泛化性强的视觉语言导航。该方法减少了模型规模和对大规模视觉预训练的依赖,在标准基准和真实场景中验证了有效性。
Details
Motivation: 解决现有视觉语言导航方法通常将原始图像转换为视觉标记或隐式特征,需要大规模视觉预训练且在环境变化(如光照、纹理)下泛化能力差的问题。
Result: 在标准VLN基准(R2R, RxR)和真实世界部署中,SOL-Nav显著减少了模型规模和训练数据依赖,充分利用了PLMs的推理和表征能力,并在未见环境中实现了强大的泛化性能。
Insight: 创新点在于将视觉模态转换为结构化语言描述,从而将多模态导航任务转化为纯语言处理问题,有效利用了预训练语言模型的强大能力,并提升了系统的效率和泛化性。
Abstract: Vision-Language Navigation (VLN) requires an embodied agent to navigate complex environments by following natural language instructions, which typically demands tight fusion of visual and language modalities. Existing VLN methods often convert raw images into visual tokens or implicit features, requiring large-scale visual pre-training and suffering from poor generalization under environmental variations (e.g., lighting, texture). To address these issues, we propose SOL-Nav (Structured Observation Language for Navigation), a novel framework that translates egocentric visual observations into compact structured language descriptions for efficient and generalizable navigation. Specifically, we divide RGB-D images into a N*N grid, extract representative semantic, color, and depth information for each grid cell to form structured text, and concatenate this with the language instruction as pure language input to a pre-trained language model (PLM). Experimental results on standard VLN benchmarks (R2R, RxR) and real-world deployments demonstrate that SOL-Nav significantly reduces the model size and training data dependency, fully leverages the reasoning and representation capabilities of PLMs, and achieves strong generalization to unseen environments.
[123] STRIDE: When to Speak Meets Sequence Denoising for Streaming Video Understanding cs.CV | cs.AIPDF
Junho Kim, Hosu Lee, James M. Rehg, Minsu Kim, Yong Man Ro
TL;DR: STRIDE是一个用于流式视频理解的框架,它将‘何时说话’的决策问题重新定义为结构化序列建模问题,通过滑动窗口和迭代去噪来联合预测和逐步优化激活信号,从而在在线流式场景中实现更可靠、时间更连贯的主动响应。
Details
Motivation: 现实世界部署需要流式感知和主动交互,系统不仅需决定回应内容,还需决定何时回应。流式视频中的时间转换自然形成跨度结构的激活模式,因此将主动激活视为结构化序列建模问题。
Result: 在多种流式基准测试和下游模型上的广泛实验表明,STRIDE显示出更可靠和时间更连贯的主动响应,显著提高了在线流式场景中‘何时说话’的决策质量。
Insight: 创新点在于将流式视频的主动激活建模为跨度级结构化序列问题,并引入轻量级掩码扩散模块在激活接口上联合预测和迭代去噪激活信号,以捕获时间跨度结构,提升决策的时序一致性。
Abstract: Recent progress in video large language models (Video-LLMs) has enabled strong offline reasoning over long and complex videos. However, real-world deployments increasingly require streaming perception and proactive interaction, where video frames arrive online and the system must decide not only what to respond, but also when to respond. In this work, we revisit proactive activation in streaming video as a structured sequence modeling problem, motivated by the observation that temporal transitions in streaming video naturally form span-structured activation patterns. To capture this span-level structure, we model activation signals jointly over a sliding temporal window and update them iteratively as new frames arrive. We propose STRIDE (Structured Temporal Refinement with Iterative DEnoising), which employs a lightweight masked diffusion module at the activation interface to jointly predict and progressively refine activation signals across the window. Extensive experiments on diverse streaming benchmarks and downstream models demonstrate that STRIDE shows more reliable and temporally coherent proactive responses, significantly improving when-to-speak decision quality in online streaming scenarios.
[124] You Only Erase Once: Erasing Anything without Bringing Unexpected Content cs.CVPDF
Yixing Zhu, Qing Zhang, Wenju Xu, Wei-Shi Zheng
TL;DR: YOEO是一种基于扩散模型的物体擦除方法,通过使用未配对的大规模真实世界图像进行训练,结合杂物检测器和上下文一致性损失,实现了高质量、无伪影的物体擦除,同时保持与周围内容的整体一致性。
Details
Motivation: 解决现有基于扩散的物体擦除方法因缺乏足够配对训练数据和明确内容生成约束,而在掩码区域内产生意外内容的问题。
Result: 在广泛的实验中,该方法优于最先进的物体擦除方法,达到了SOTA水平。
Insight: 创新点包括使用未配对数据训练、结合实体分割模型构建的杂物检测器和上下文一致性损失,以及采用扩散蒸馏策略实现高效训练和推理。
Abstract: We present YOEO, an approach for object erasure. Unlike recent diffusion-based methods which struggle to erase target objects without generating unexpected content within the masked regions due to lack of sufficient paired training data and explicit constraint on content generation, our method allows to produce high-quality object erasure results free of unwanted objects or artifacts while faithfully preserving the overall context coherence to the surrounding content. We achieve this goal by training an object erasure diffusion model on unpaired data containing only large-scale real-world images, under the supervision of a sundries detector and a context coherence loss that are built upon an entity segmentation model. To enable more efficient training and inference, a diffusion distillation strategy is employed to train for a few-step erasure diffusion model. Extensive experiments show that our method outperforms the state-of-the-art object erasure methods. Code will be available at https://zyxunh.github.io/YOEO-ProjectPage/.
[125] OpenDPR: Open-Vocabulary Change Detection via Vision-Centric Diffusion-Guided Prototype Retrieval for Remote Sensing Imagery cs.CVPDF
Qi Guo, Jue Wang, Yinhe Liu, Yanfei Zhong
TL;DR: 该论文提出了一种名为OpenDPR的开集词汇变化检测方法,通过视觉中心、扩散引导的原型检索框架来解决遥感图像中任意类别变化识别的问题。该方法采用两阶段流程:首先利用视觉基础模型(如SAM、DINOv2)生成类别无关的变化提议,然后通过扩散模型构建目标类别的多样化视觉原型进行相似性检索以完成类别识别。同时,论文还设计了一个名为S2C的弱监督变化检测模块来提升变化定位能力,并形成了OpenDPR-W变体。
Details
Motivation: 开集词汇变化检测旨在识别超越预定义固定类别的任意变化,但现有方法在类别识别阶段存在瓶颈,主要因为基于图文匹配的视觉语言模型难以表示细粒度的土地覆盖类别。同时,视觉基础模型本身缺乏变化先验,导致变化定位能力不足。
Result: 在四个基准数据集上的实验结果表明,所提出的OpenDPR及其弱监督变体OpenDPR-W在两种监督模式下均达到了最先进的性能水平。
Insight: 创新点在于将开集词汇变化检测重新定义为两阶段流程,并提出了一个无需训练、以视觉为中心的扩散引导原型检索框架来克服细粒度类别识别难题。同时,设计的S2C模块巧妙地将视觉基础模型强大的空间建模能力适配到变化定位任务中,仅需弱监督即可显著提升性能。
Abstract: Open-vocabulary change detection (OVCD) seeks to recognize arbitrary changes of interest by enabling generalization beyond a fixed set of predefined classes. We reformulate OVCD as a two-stage pipeline: first generate class-agnostic change proposals using visual foundation models (VFMs) such as SAM and DINOv2, and then perform category identification with vision-language models (VLMs) such as CLIP. We reveal that category identification errors are the primary bottleneck of OVCD, mainly due to the limited ability of VLMs based on image-text matching to represent fine-grained land-cover categories. To address this, we propose OpenDPR, a training-free vision-centric diffusion-guided prototype retrieval framework. OpenDPR leverages diffusion models to construct diverse prototypes for target categories offline, and to perform similarity retrieval with change proposals in the visual space during inference. The secondary bottleneck lies in change localization, due to the inherent lack of change priors in VFMs. To bridge this gap, we design a spatial-to-change weakly supervised change detection module named S2C to adapt their strong spatial modeling capabilities for change localization. Integrating the pretrained S2C into OpenDPR leads to an optional weakly supervised variant named OpenDPR-W, which further improves OVCD with minimal supervision. Experimental results on four benchmark datasets demonstrate that the proposed methods achieve state-of-the-art performance under both supervision modes. Code is available at https://github.com/guoqi2002/OpenDPR.
[126] V-CAST: Video Curvature-Aware Spatio-Temporal Pruning for Efficient Video Large Language Models cs.CVPDF
Xinying Lin, Xuyang Liu, Yiyu Wang, Teng Ma, Wenqi Ren
TL;DR: 本文提出了一种名为V-CAST的训练无关、即插即用视频大语言模型(VideoLLMs)长上下文推理剪枝方法。该方法将视频令牌压缩建模为轨迹逼近问题,通过曲率引导的时间分配模块将每帧令牌预算分配给语义转折和事件边界,并采用双锚点空间选择机制保留高熵视觉证据,同时保持令牌原始坐标以维持位置对齐,从而在严格预算下高效压缩冗余视觉令牌。
Details
Motivation: 视频大语言模型在长上下文推理的预填充阶段存在大量冗余视觉令牌,现有令牌压缩方法存在时空信息覆盖不连续、令牌合并可能导致MRoPE风格离散(时间、高度、宽度)绑定下的坐标错位等问题,导致在严格预算下时空信息覆盖不足成为关键瓶颈。
Result: 在多个不同架构和规模的VideoLLMs上的广泛实验表明,V-CAST达到了原始模型98.6%的性能,平均优于次优方法+1.1%,并将峰值内存和总延迟分别降低至原始Qwen3-VL-8B-Instruct模型的86.7%和86.4%。
Insight: 创新点在于将令牌压缩重新定义为轨迹逼近问题,并引入了曲率引导的时间分配和双锚点空间选择机制。这提供了一种无需训练、保持位置对齐的剪枝策略,有效平衡了压缩效率与时空信息覆盖的连续性,对优化VideoLLMs推理效率具有借鉴意义。
Abstract: Video large language models (VideoLLMs) show strong capability in video understanding, yet long-context inference is still dominated by massive redundant visual tokens in the prefill stage. We revisit token compression for VideoLLMs under a tight budget and identify a key bottleneck, namely insufficient spatio-temporal information coverage. Existing methods often introduce discontinuous coverage through coarse per-frame allocation or scene segmentation, and token merging can further misalign spatio-temporal coordinates under MRoPE-style discrete (t,h,w) bindings. To address these issues, we propose V-CAST (Video Curvature-Aware Spatio-Temporal Pruning), a training-free, plug-and-play pruning policy for long-context video inference. V-CAST casts token compression as a trajectory approximation problem and introduces a curvature-guided temporal allocation module that routes per-frame token budgets to semantic turns and event boundaries. It further adopts a dual-anchor spatial selection mechanism that preserves high-entropy visual evidence without attention intervention, while keeping retained tokens at their original coordinates to maintain positional alignment. Extensive experiments across multiple VideoLLMs of different architectures and scales demonstrate that V-CAST achieves 98.6% of the original performance, outperforms the second-best method by +1.1% on average, and reduces peak memory and total latency to 86.7% and 86.4% of vanilla Qwen3-VL-8B-Instruct.
[127] Amped: Adaptive Multi-stage Non-edge Pruning for Edge Detection cs.CVPDF
Yuhan Gao, Xinqing Li, Xin He, Bing Li, Xinzhong Zhu
TL;DR: 本文提出了一种名为Amped的自适应多阶段非边缘剪枝框架,用于在边缘检测任务中减少计算开销。该方法通过早期识别并移除高置信度的非边缘token来显著降低计算量,同时保持高精度。此外,作者还引入了一个简单但高性能的Transformer基础模型SED,该模型在保持结构简洁的同时达到了最先进的性能。
Details
Motivation: 尽管基于Transformer的边缘检测器通过捕获长距离依赖关系提高了边缘质量,但计算开销大,且高输入分辨率进一步增加了计算成本,限制了实际部署。因此,需要一种方法在保持高精度的同时减少计算负担。
Result: 在现有检测器和提出的SED模型上应用剪枝策略,将GFLOPs减少了高达40%,而ODS F-measure仅下降0.4%。SED模型本身达到了86.5%的ODS F-measure,实现了最先进(SOTA)水平。
Insight: 创新点包括自适应多阶段非边缘剪枝框架,通过早期移除非边缘token来优化计算效率;以及Streamline Edge Detector(SED),一个结构简单但高性能的Transformer基础模型,平衡了准确性和效率,为实际系统集成提供了便利。
Abstract: Edge detection is a fundamental image analysis task that underpins numerous high-level vision applications. Recent advances in Transformer architectures have significantly improved edge quality by capturing long-range dependencies, but this often comes with computational overhead. Achieving higher pixel-level accuracy requires increased input resolution, further escalating computational cost and limiting practical deployment. Building on the strong representational capacity of recent Transformer-based edge detectors, we propose an Adaptive Multi-stage non-edge Pruning framework for Edge Detection(Amped). Amped identifies high-confidence non-edge tokens and removes them as early as possible to substantially reduce computation, thus retaining high accuracy while cutting GFLOPs and accelerating inference with minimal performance loss. Moreover, to mitigate the structural complexity of existing edge detection networks and facilitate their integration into real-world systems, we introduce a simple yet high-performance Transformer-based model, termed Streamline Edge Detector(SED). Applied to both existing detectors and our SED, the proposed pruning strategy provides a favorable balance between accuracy and efficiency-reducing GFLOPs by up to 40% with only a 0.4% drop in ODS F-measure. In addition, despite its simplicity, SED achieves a state-of-the-art ODS F-measure of 86.5%. The code will be released.
[128] A Benchmarking Methodology to Assess Open-Source Video Large Language Models in Automatic Captioning of News Videos cs.CVPDF
David Miranda Paredes, Jose M. Saavedra, Marcelo Pizarro
TL;DR: 本文提出了一种评估开源视频大语言模型在新闻视频自动字幕生成任务中性能的基准方法。研究对八种最先进的开源VidLLMs进行了比较,使用了两个互补的基准数据集(智利电视新闻语料库和BBC新闻语料库),并采用了包括新提出的主题保真度得分和实体保真度得分在内的多种评估指标。分析发现标准指标在新闻视频字幕任务中判别力有限,而新指标能更好地评估主题结构和命名实体的覆盖。结果显示,Gemma~3在两个数据集和大多数评估维度上表现最佳,Qwen-VL紧随其后。
Details
Motivation: 新闻视频是电视台和在线流媒体平台最普遍的内容类型之一,但其文本描述生成以方便索引和检索仍主要依赖人工。视频大语言模型有潜力自动化此任务,但在新闻领域仍缺乏全面的评估。
Result: 在智利电视新闻语料库(约1,345个片段)和BBC新闻语料库(9,838个片段)上评估了八种最先进的开源VidLLMs。使用词汇指标(METEOR, ROUGE-L)、语义指标(BERTScore, CLIPScore, Text Similarity, Mean Reciprocal Rank)以及新提出的主题保真度得分和实体保真度得分进行评估。Gemma~3在两个数据集和大多数评估维度上取得了最高的整体性能,Qwen-VL是稳定的第二名。
Insight: 论文的创新点在于提出了一种专门针对新闻视频自动字幕任务的基准评估方法,并引入了两个新的保真度指标(TFS和EFS)来弥补标准指标在表面形式依赖、静态帧不敏感和功能词膨胀方面的不足,从而更直接地评估生成字幕的主题结构保持和命名实体覆盖。从客观角度看,这项工作为特定领域(新闻)的视频理解模型评估提供了更细粒度和有针对性的方法论,强调了领域适应性评估的重要性。
Abstract: News videos are among the most prevalent content types produced by television stations and online streaming platforms, yet generating textual descriptions to facilitate indexing and retrieval largely remains a manual process. Video Large Language Models (VidLLMs) offer significant potential to automate this task, but a comprehensive evaluation in the news domain is still lacking. This work presents a comparative study of eight state-of-the-art open-source VidLLMs for automatic news video captioning, evaluated on two complementary benchmark datasets: a Chilean TV news corpus (approximately 1,345 clips) and a BBC News corpus (9,838 clips). We employ lexical metrics (METEOR, ROUGE-L), semantic metrics (BERTScore, CLIPScore, Text Similarity, Mean Reciprocal Rank), and two novel fidelity metrics proposed in this work: the Thematic Fidelity Score (TFS) and Entity Fidelity Score (EFS). Our analysis reveals that standard metrics exhibit limited discriminative power for news video captioning due to surface-form dependence, static-frame insensitivity, and function-word inflation. TFS and EFS address these gaps by directly assessing thematic structure preservation and named-entity coverage in the generated captions. Results show that Gemma~3 achieves the highest overall performance across both datasets and most evaluation dimensions, with Qwen-VL as a consistent runner-up.
[129] Gated Condition Injection without Multimodal Attention: Towards Controllable Linear-Attention Transformers cs.CVPDF
Yuhe Liu, Zhenxiong Tan, Yujia Hu, Songhua Liu, Xinchao Wang
TL;DR: 本文提出了一种针对线性注意力架构(如SANA)的可控扩散模型框架,通过统一的门控条件注入模块在双路径流程中整合多种类型的条件输入,旨在实现高效、安全的边缘设备上可控视觉生成。
Details
Motivation: 现有可控扩散模型(如ControlNet和OminiControl)在计算需求大的云端部署,存在用户数据隐私问题,且在线性注意力模型上缺乏灵活性或收敛慢,因此研究适用于边缘设备的可控线性注意力Transformer。
Result: 在多个任务和基准测试上的广泛实验表明,该方法基于线性注意力模型实现了最先进的可控生成性能,在保真度和可控性方面超越了现有方法。
Insight: 创新点在于设计了一个统一的门控条件模块,无需多模态注意力机制,即可有效整合空间对齐和非对齐的多种条件输入,提升了线性注意力模型在可控生成中的灵活性和效率。
Abstract: Recent advances in diffusion-based controllable visual generation have led to remarkable improvements in image quality. However, these powerful models are typically deployed on cloud servers due to their large computational demands, raising serious concerns about user data privacy. To enable secure and efficient on-device generation, we explore in this paper controllable diffusion models built upon linear attention architectures, which offer superior scalability and efficiency, even on edge devices. Yet, our experiments reveal that existing controllable generation frameworks, such as ControlNet and OminiControl, either lack the flexibility to support multiple heterogeneous condition types or suffer from slow convergence on such linear-attention models. To address these limitations, we propose a novel controllable diffusion framework tailored for linear attention backbones like SANA. The core of our method lies in a unified gated conditioning module working in a dual-path pipeline, which effectively integrates multi-type conditional inputs, such as spatially aligned and non-aligned cues. Extensive experiments on multiple tasks and benchmarks demonstrate that our approach achieves state-of-the-art controllable generation performance based on linear-attention models, surpassing existing methods in terms of fidelity and controllability.
[130] Customized Visual Storytelling with Unified Multimodal LLMs cs.CVPDF
Wei-Hua Li, Cheng Sun, Chu-Song Chen
TL;DR: 本文提出VstoryGen,一个统一的多模态大语言模型框架,用于定制化视觉故事生成。该框架整合文本描述、角色参考图像和背景参考图像,并通过在电影数据上进行参数高效的提示调优引入镜头类型控制,以增强电影多样性。
Details
Motivation: 现有故事生成方法大多仅依赖文本输入,少数结合角色身份线索(如面部ID)的方法缺乏更广泛的多模态条件控制,无法实现高度定制化的视觉故事生成。
Result: 实验表明,VstoryGen在新建的两个评估基准(评估角色与场景一致性、文本-视觉对齐及镜头类型控制)上,相比现有方法在一致性和电影多样性方面取得了提升。
Insight: 创新点在于提出了一个整合多种模态条件(文本、角色、背景)的统一框架,并引入了基于电影数据的参数高效提示调优来实现镜头类型控制,从而增强了生成故事的定制化能力和电影感。从客观角度看,其构建的多模态评估基准也为该领域提供了新的量化评估标准。
Abstract: Multimodal story customization aims to generate coherent story flows conditioned on textual descriptions, reference identity images, and shot types. While recent progress in story generation has shown promising results, most approaches rely on text-only inputs. A few studies incorporate character identity cues (e.g., facial ID), but lack broader multimodal conditioning. In this work, we introduce VstoryGen, a multimodal framework that integrates descriptions with character and background references to enable customizable story generation. To enhance cinematic diversity, we introduce shot-type control via parameter-efficient prompt tuning on movie data, enabling the model to generate sequences that more faithfully reflect cinematic grammar. To evaluate our framework, we establish two new benchmarks that assess multimodal story customization from the perspectives of character and scene consistency, text-visual alignment, and shot-type control. Experiments demonstrate that VstoryGen achieves improved consistency and cinematic diversity compared to existing methods.
[131] LVRPO: Language-Visual Alignment with GRPO for Multimodal Understanding and Generation cs.CV | cs.AI | cs.LG | cs.MA | cs.MMPDF
Shentong Mo, Sukmin Yun
TL;DR: 本文提出LVRPO框架,一种基于语言-视觉强化偏好优化的方法,通过Group Relative Policy Optimization(GRPO)显式对齐语言和视觉表示,以支持多模态理解和生成任务。
Details
Motivation: 现有统一多模态预训练方法依赖隐式或间接对齐信号,在同时支持多模态理解和生成,特别是需要细粒度语言-视觉推理和可控生成的场景中表现欠佳。
Result: LVRPO在涵盖多模态理解、生成和推理的广泛基准测试中,持续优于强大的统一预训练基线模型。
Insight: 创新点在于直接通过偏好驱动的强化信号优化多模态模型行为,无需辅助编码器或手工设计的跨模态目标,实现有效对齐并自然扩展到多样化多模态能力。
Abstract: Unified multimodal pretraining has emerged as a promising paradigm for jointly modeling language and vision within a single foundation model. However, existing approaches largely rely on implicit or indirect alignment signals and remain suboptimal for simultaneously supporting multimodal understanding and generation, particularly in settings that require fine-grained language-visual reasoning and controllable generation. In this work, we propose LVRPO, a language-visual reinforcement-based preference optimization framework that explicitly aligns language and visual representations using Group Relative Policy Optimization (GRPO). Instead of introducing additional alignment losses at the representation level, LVRPO directly optimizes multimodal model behaviors through preference-driven reinforcement signals, encouraging consistent and semantically grounded interactions between language and vision across both understanding and generation tasks. This formulation enables effective alignment without requiring auxiliary encoders or handcrafted cross-modal objectives, and naturally extends to diverse multimodal capabilities. Empirically, LVRPO consistently outperforms strong unified-pretraining baselines on a broad suite of benchmarks spanning multimodal understanding, generation, and reasoning.
[132] Can Unsupervised Segmentation Reduce Annotation Costs for Video Semantic Segmentation? cs.CVPDF
Samik Some, Vinay P. Namboodiri
TL;DR: 本文探讨了利用无监督分割模型(如SAM和SAM 2)以及粗粒度标注来降低视频语义分割标注成本的方法,研究表明通过自动生成掩码,可将标注需求减少约三分之一,同时保持相似性能。
Details
Motivation: 视频语义分割需要大量精细的像素级标注,成本高昂,而原始未标注视频帧和粗粒度标注则相对廉价,本文旨在利用这些资源降低标注成本。
Result: 在视频语义分割任务中,使用所提方法可将标注需求减少约三分之一,且性能与全标注方法相当,同时分析表明数据集帧的多样性比数量对性能更重要。
Insight: 创新点在于利用分割基础模型(如SAM系列)自动化生成掩码,结合未标注帧和粗标注来减少人工标注成本,并强调了数据多样性在提升模型性能中的关键作用。
Abstract: Present-day deep neural networks for video semantic segmentation require a large number of fine-grained pixel-level annotations to achieve the best possible results. Obtaining such annotations, however, is very expensive. On the other hand, raw, unannotated video frames are practically free to obtain. Similarly, coarse annotations, which do not require precise boundaries, are also much cheaper. This paper investigates approaches to reduce the annotation cost required for video segmentation datasets by utilising such resources. We show that using state-of-the-art segmentation foundation models, Segment Anything Model (SAM) and Segment Anything Model 2 (SAM 2), we can utilise both unannotated frames as well as coarse annotations to alleviate the effort required for manual annotation of video segmentation datasets by automating mask generation. Our investigation suggests that if used appropriately, we can reduce the need for annotation by a third with similar performance for video semantic segmentation. More significantly, our analysis suggests that the variety of frames in the dataset is more important than the number of frames for obtaining the best performance.
[133] Synergizing Discriminative Exemplars and Self-Refined Experience for MLLM-based In-Context Learning in Medical Diagnosis cs.CVPDF
Wenkai Zhao, Zipei Wang, Mengjie Fang, Di Dong, Jie Tian
TL;DR: 本文提出了一种名为’Clinician Mimetic Workflow’的新型上下文学习框架,旨在解决通用多模态大语言模型在医学诊断中性能不足的问题。该框架通过结合’判别性范例核心集选择’和’自精炼经验总结’两种机制,模拟临床医生的诊断流程,无需更新预训练模型权重即可提升性能。
Details
Motivation: 通用多模态大语言模型在医学诊断等专业领域难以捕捉领域细微特征,性能落后于全监督基线。微调方法虽有效,但专家标注成本高、计算开销大,可扩展性受限。因此,研究旨在不更新预训练模型权重的情况下,通过改进上下文学习来弥合这一性能差距。
Result: 在MedMNIST 2D基准测试的全部12个数据集上进行广泛评估,该方法超越了零样本的通用和医学专用MLLMs,同时达到了与全监督视觉模型及领域特定微调MLLMs相当的性能水平,为参数高效的医学上下文学习设立了新基准。
Insight: 核心创新在于模拟临床医生的诊断认知过程:DECS机制在计算层面从噪声数据中选择判别性视觉核心集,模拟医生参考’锚定病例’的能力;SRES机制通过将多样化的推演提炼为动态文本经验库,模拟临床诊断中的认知与反思过程。这种结合视觉范例选择与文本经验提炼的协同框架,为无需权重更新的高效领域适应提供了新思路。
Abstract: General Multimodal Large Language Models (MLLMs) often underperform in capturing domain-specific nuances in medical diagnosis, trailing behind fully supervised baselines. Although fine-tuning provides a remedy, the high costs of expert annotation and massive computational overhead limit its scalability. To bridge this gap without updating the weights of the pre-trained backbone of the MLLM, we propose a Clinician Mimetic Workflow. This is a novel In-Context Learning (ICL) framework designed to synergize Discriminative Exemplar Coreset Selection (DECS) and Self-Refined Experience Summarization (SRES). Specifically, DECS simulates a clinician’s ability to reference “anchor cases” by selecting discriminative visual coresets from noisy data at the computational level; meanwhile, SRES mimics the cognition and reflection in clinical diagnosis by distilling diverse rollouts into a dynamic textual Experience Bank. Extensive evaluation across all 12 datasets of the MedMNIST 2D benchmark demonstrates that our method outperforms zero-shot general and medical MLLMs. Simultaneously, it achieves performance levels comparable to fully supervised vision models and domain-specific fine-tuned MLLMs, setting a new benchmark for parameter-efficient medical in-context learning. Our code is available at an anonymous repository: https://anonymous.4open.science/r/Synergizing-Discriminative-Exemplars-and-Self-Refined-Experience-ED74.
[134] TIR-Agent: Training an Explorative and Efficient Agent for Image Restoration cs.CVPDF
Yisheng Zhang, Guoli Jia, Haote Hu, Shanxu Zhao, Kaikai Zhao
TL;DR: 本文提出TIR-Agent,一种可训练的视觉语言智能体,用于图像修复任务。它通过监督微调和强化学习两阶段训练,学习直接调用工具的策略,以优化修复路径并减少计算成本。
Details
Motivation: 现有基于视觉语言模型的图像修复智能体通常是无训练的,依赖启发式任务调度和穷举工具遍历,导致修复路径次优且计算成本高昂。核心瓶颈在于缺乏学习到的决策策略。
Result: 实验表明,TIR-Agent在域内和域外退化图像上均优于12个基线模型(包括6个一体化模型、3个无训练智能体和3个专有模型),并通过消除冗余工具执行实现了超过2.5倍的推理加速。
Insight: 创新点包括:1) 两阶段训练(SFT+RL)学习直接工具调用策略;2) 对SFT数据应用随机扰动策略以增强策略探索;3) 多维自适应奖励机制动态调整图像质量指标权重以防止奖励欺骗;4) 开发全局共享模型调用池支持训练期间的高吞吐异步GPU工具调用。
Abstract: Vision-language agents that orchestrate specialized tools for image restoration (IR) have emerged as a promising method, yet most existing frameworks operate in a training-free manner. They rely on heuristic task scheduling and exhaustive tool traversal, resulting in sub-optimal restoration paths and prohibitive computational cost. We argue that the core bottleneck lies in the absence of a learned policy to make decision, as a vision-language model cannot efficiently handle degradation-aware task ordering and tool composition. To this end, we propose TIR-Agent, a trainable image restoration agent that performs a direct tool-calling policy through a two-stage training pipeline of supervised fine-tuning (SFT) followed by reinforcement learning (RL). Two key designs underpin effective RL training: (i) a random perturbation strategy applied to the SFT data, which broadens the policy’s exploration over task schedules and tool compositions, and (ii) a multi-dimensional adaptive reward mechanism that dynamically re-weights heterogeneous image quality metrics to mitigate reward hacking. To support high-throughput, asynchronous GPU-based tool invocation during training, we further develop a globally shared model-call pool. Experiments on both in-domain and out-of-domain degradations show that TIR-Agent outperforms 12 baselines, including 6 all-in-one models, 3 training-free agents, and 3 proprietary models, and achieves over 2.5$\times$ inference speedup by eliminating redundant tool executions.
[135] Data Organization Matters in Multimodal Instruction Tuning: A Controlled Study of Capability Trade-offs cs.CVPDF
Guowei Tang
TL;DR: 本文研究了多模态指令调优中数据组织方式对模型能力权衡的影响,通过一个受控的三阶段训练框架,比较了直接混合、课程训练、平衡采样和反向课程四种数据调度策略。研究发现,数据组织是影响多模态适应的一阶设计变量,课程训练策略在整体能力权衡和结构化推理性能上表现最佳,而先建立通用理解和推理能力再引入OCR密集型监督能带来更平滑的优化和更快的收敛。
Details
Motivation: 多模态大语言模型(MLLMs)的能力来源于异构的监督源,这些源具有不同的任务结构和学习需求,但训练过程中它们的时间组织效应尚未得到充分探索。本文旨在研究数据组织是否影响通用视觉理解、结构化推理和细粒度OCR/文档理解之间的能力权衡。
Result: 在通用视觉指令遵循、图表推理、场景文本问答和文档问答等任务上的实验表明,课程训练策略实现了最佳的整体权衡和最强的结构化推理性能;平衡采样对OCR导向的能力更好但削弱了更广泛的能力平衡;反向课程在最终性能和优化稳定性上表现最差。
Insight: 论文的创新点在于将数据调度作为一个显式的设计维度来研究多模态模型适应,并通过受控实验框架隔离了其他变量,揭示了训练数据的时间安排对模型能力平衡和优化动态的关键影响。从客观角度看,该研究为多模态指令调优的数据编排提供了实证指导,强调了分阶段、有策略地引入不同复杂度任务数据的重要性。
Abstract: Recent multimodal large language models (MLLMs) perform strongly on general visual understanding, diagram and chart reasoning, and document-centric perception. However, these abilities are learned from heterogeneous supervision sources with very different task structures and learning demands, and the effect of their temporal organization during training remains underexplored. We study whether data organization affects the trade-off among general understanding, structured reasoning, and fine-grained OCR/document understanding in multimodal instruction tuning. To isolate this factor, we use a controlled three-stage training framework in which the backbone, trainable modules, and optimization pipeline are fixed across all runs, and only the temporal arrangement of post-alignment supervision is changed. We compare four strategies: direct mixture, curriculum training, balanced sampling, and reverse curriculum. Experiments on general visual instruction following, diagram reasoning, chart reasoning, scene-text question answering, and document question answering show that data organization is a first-order design variable in multimodal adaptation. Curriculum training gives the best overall trade-off and the strongest structured reasoning performance. Balanced sampling is better for OCR-oriented capability but weakens the broader capability balance. Reverse curriculum performs worst in both final performance and optimization stability. Training-dynamics analysis further suggests that building general understanding and reasoning before introducing OCR-intensive supervision leads to smoother optimization and faster convergence. These findings highlight data scheduling as an explicit design dimension for multimodal model adaptation.
[136] When Surfaces Lie: Exploiting Wrinkle-Induced Attention Shift to Attack Vision-Language Models cs.CVPDF
Chengyin Hu, Xuemeng Sun, Jiajun Han, Qike Zhang, Xiang Chen
TL;DR: 本文提出了一种针对视觉语言模型(VLMs)的新型物理对抗攻击方法,该方法通过模拟三维织物褶皱的力学原理,生成逼真的非刚性形变扰动(如表面褶皱),以攻击VLMs在零样本分类、图像描述和视觉问答等任务上的性能。
Details
Motivation: 动机在于探索VLMs对物理上合理的非刚性形变(如柔性表面的褶皱)的鲁棒性,目前这方面研究不足,旨在揭示模型在真实世界变形下的脆弱性。
Result: 实验结果表明,该方法能显著降低多种最先进(SOTA)VLMs的性能,在图像描述和视觉问答任务上持续优于基线方法,证明了其对抗有效性和可迁移性。
Insight: 创新点在于提出了一种参数化的结构扰动方法,结合多尺度褶皱场和表面一致的视觉变化来生成逼真扰动,并通过分层适应度函数在低维参数空间进行优化,实现了视觉自然性与对抗有效性的平衡。
Abstract: Visual-Language Models (VLMs) have demonstrated exceptional cross-modal understanding across various tasks, including zero-shot classification, image captioning, and visual question answering. However, their robustness to physically plausible non-rigid deformations-such as wrinkles on flexible surfaces-remains poorly understood. In this work, we propose a parametric structural perturbation method inspired by the mechanics of three-dimensional fabric wrinkles. Specifically, our method generates photorealistic non-rigid perturbations by constructing multi-scale wrinkle fields and integrating displacement field distortion with surface-consistent appearance variations. To achieve an optimal balance between visual naturalness and adversarial effectiveness, we design a hierarchical fitness function in a low-dimensional parameter space and employ an optimization-based search strategy. We evaluate our approach using a two-stage framework: perturbations are first optimized on a zero-shot classification proxy task and subsequently assessed for transferability on generative tasks. Experimental results demonstrate that our method significantly degrades the performance of various state-of-the-art VLMs, consistently outperforming baselines in both image captioning and visual question-answering tasks.
[137] RINO: Rotation-Invariant Non-Rigid Correspondences cs.CVPDF
Maolin Gao, Shao Jie Hu-Chen, Congyue Deng, Riccardo Marin, Leonidas Guibas
TL;DR: RINO是一种无监督、旋转不变的密集3D形状对应框架,通过RINONet特征提取器结合基于向量的SO(3)不变学习和方向感知的复函数映射,直接从原始几何中提取鲁棒特征,统一了刚性和非刚性形状匹配。
Details
Motivation: 解决现有深度学习方法依赖中间几何特征或手工描述符,在非等距变形、部分数据和非流形输入下效果受限的问题。
Result: 在具有任意姿态、非等距、部分性、非流形性和噪声的挑战性非刚性匹配任务中,RINO表现出前所未有的性能。
Insight: 创新点在于将向量SO(3)不变学习与复函数映射结合,实现无需形状预对齐或手工特征的端到端数据驱动方法,提升了非刚性形状对应的鲁棒性。
Abstract: Dense 3D shape correspondence remains a central challenge in computer vision and graphics as many deep learning approaches still rely on intermediate geometric features or handcrafted descriptors, limiting their effectiveness under non-isometric deformations, partial data, and non-manifold inputs. To overcome these issues, we introduce RINO, an unsupervised, rotation-invariant dense correspondence framework that effectively unifies rigid and non-rigid shape matching. The core of our method is the novel RINONet, a feature extractor that integrates vector-based SO(3)-invariant learning with orientation-aware complex functional maps to extract robust features directly from raw geometry. This allows for a fully end-to-end, data-driven approach that bypasses the need for shape pre-alignment or handcrafted features. Extensive experiments show unprecedented performance of RINO across challenging non-rigid matching tasks, including arbitrary poses, non-isometry, partiality, non-manifoldness, and noise.
[138] GS3LAM: Gaussian Semantic Splatting SLAM cs.CVPDF
Linfei Li, Lin Zhang, Zhong Wang, Ying Shen
TL;DR: 本文提出了GS3LAM,一个基于3D高斯泼溅的语义SLAM框架,它通过融合RGB、深度和语义等多模态数据,实时渲染一致且稠密的语义地图。该方法将场景建模为语义高斯场,并联合优化相机位姿和场景表示,同时引入了深度自适应尺度正则化和基于随机采样的关键帧映射策略来解决尺度对齐和灾难性遗忘问题。
Details
Motivation: 现有基于显式表示的语义SLAM系统受限于分辨率和无法预测未知区域,而隐式表示通常依赖耗时的光线追踪,难以满足实时性要求。3D高斯泼溅结合了点方法的效率和几何结构的连续性,为解决稠密、高效、可扩展的场景表示提供了新途径。
Result: 在基准数据集上的大量实验表明,与最先进的方法相比,GS3LAM在跟踪鲁棒性、渲染质量和语义精度方面均取得了提升。
Insight: 核心创新点在于将3D高斯泼溅与语义SLAM结合,构建语义高斯场进行多模态联合优化。具体技术贡献包括:1)深度自适应尺度正则化方案,解决尺度不变高斯与几何表面之间的错位问题;2)基于随机采样的关键帧映射策略,有效缓解灾难性遗忘,性能优于常见的局部共视优化方法。这为实时、稠密、高精度的语义建图提供了一种高效的新范式。
Abstract: Recently, the multi-modal fusion of RGB, depth, and semantics has shown great potential in dense Simultaneous Localization and Mapping (SLAM). However, a prerequisite for generating consistent semantic maps is the availability of dense, efficient, and scalable scene representations. Existing semantic SLAM systems based on explicit representations are often limited by resolution and an inability to predict unknown areas. Conversely, implicit representations typically rely on time-consuming ray tracing, failing to meet real-time requirements. Fortunately, 3D Gaussian Splatting (3DGS) has emerged as a promising representation that combines the efficiency of point-based methods with the continuity of geometric structures. To this end, we propose GS3LAM, a Gaussian Semantic Splatting SLAM framework that processes multimodal data to render consistent, dense semantic maps in real-time. GS3LAM models the scene as a Semantic Gaussian Field (SG-Field) and jointly optimizes camera poses and the field via multimodal error constraints. Furthermore, a Depth-adaptive Scale Regularization (DSR) scheme is introduced to resolve misalignments between scale-invariant Gaussians and geometric surfaces. To mitigate catastrophic forgetting, we propose a Random Sampling-based Keyframe Mapping (RSKM) strategy, which demonstrates superior performance over common local covisibility optimization methods. Extensive experiments on benchmark datasets show that GS3LAM achieves increased tracking robustness, superior rendering quality, and enhanced semantic precision compared to state-of-the-art methods. Source code is available at https://github.com/lif314/GS3LAM.
[139] Tracking without Seeing: Geospatial Inference using Encrypted Traffic from Distributed Nodes cs.CV | cs.LG | cs.NIPDF
Sadik Yagiz Yetim, Gaofeng Dong, Isaac-Neil Zanoria, Ronit Barman, Maggie Wigness
TL;DR: 本文提出了一种名为GraySense的基于学习的框架,能够仅通过分析加密无线视频传输流量(如数据包大小)来执行地理空间对象跟踪,而无需访问原始传感器数据。该框架通过数据包分组模块和基于Transformer编码器的跟踪器模块,融合间接数据包输入与可选直接相机输入,推断物体运动。实验表明,在CARLA模拟器生成的视频和模拟网络条件下,GraySense实现了2.33米的跟踪误差。
Details
Motivation: 传统动态环境观测依赖于合成多个分布式传感器的原始信号级信息,本文旨在探索一种替代方法:仅使用加密数据包级信息进行地理空间推断,无需访问原始感官数据,并研究如何将这种间接信息与直接可用感官数据融合以扩展整体推断能力。
Result: 在CARLA模拟器生成的逼真视频和不同条件下的模拟网络中进行广泛实验,GraySense在无法访问原始信号的情况下实现了2.33米(欧几里得距离)的跟踪误差,该误差在跟踪对象的尺寸(4.61米 x 1.93米)范围内,据作者所知,这种能力此前未被证明。
Insight: 创新点在于利用场景动态与传输数据包大小之间的固有关系来推断物体运动,通过基于Transformer编码器的循环状态跟踪器融合间接和直接输入,展示了使用潜在信号进行感知的新方法,扩展了加密流量在无原始数据访问下的推断应用。
Abstract: Accurate observation of dynamic environments traditionally relies on synthesizing raw, signal-level information from multiple distributed sensors. This work investigates an alternative approach: performing geospatial inference using only encrypted packet-level information, without access to the raw sensory data. We further explore how this indirect information can be fused with directly available sensory data to extend overall inference capabilities. We introduce GraySense, a learning-based framework that performs geospatial object tracking by analyzing encrypted wireless video transmission traffic, such as packet sizes, from cameras with inaccessible streams. GraySense leverages the inherent relationship between scene dynamics and transmitted packet sizes to infer object motion. The framework consists of two stages: (1) a Packet Grouping module that identifies frame boundaries and estimates frame sizes from encrypted network traffic, and (2) a Tracker module, based on a Transformer encoder with a recurrent state, which fuses indirect packet-based inputs with optional direct camera-based inputs to estimate the object’s position. Extensive experiments with realistic videos from the CARLA simulator and emulated networks under varying conditions show that GraySense achieves 2.33 meters tracking error (Euclidean distance) without raw signal access, within the dimensions of tracked objects (4.61m x 1.93m). To our knowledge, this capability has not been previously demonstrated, expanding the use of latent signals for sensing.
[140] MuSEAgent: A Multimodal Reasoning Agent with Stateful Experiences cs.CVPDF
Shijian Wang, Jiarui Jin, Runhao Fu, Zexuan Yan, Xingjian Wang
TL;DR: 本文提出了MuSEAgent,一种多模态推理智能体,通过引入状态化经验学习范式,将交互数据抽象为原子决策经验并构建质量过滤的经验库,从而增强多模态推理能力。
Details
Motivation: 现有研究智能体在异构文本和视觉信息检索与合成方面取得进展,但缺乏对状态化经验的发现与利用能力,MuSEAgent旨在通过状态化经验建模提升多模态决策质量。
Result: 在细粒度视觉感知和复杂多模态推理任务上,MuSEAgent持续优于基于轨迹级经验检索的基线方法,验证了状态化经验建模的有效性。
Insight: 创新点在于提出状态化经验学习范式,通过后见推理将交互数据抽象为原子决策经验,并采用宽搜索与深搜索互补策略实现自适应经验利用,为多模态智能体推理提供了新思路。
Abstract: Research agents have recently achieved significant progress in information seeking and synthesis across heterogeneous textual and visual sources. In this paper, we introduce MuSEAgent, a multimodal reasoning agent that enhances decision-making by extending the capabilities of research agents to discover and leverage stateful experiences. Rather than relying on trajectory-level retrieval, we propose a stateful experience learning paradigm that abstracts interaction data into atomic decision experiences through hindsight reasoning. These experiences are organized into a quality-filtered experience bank that supports policy-driven experience retrieval at inference time. Specifically, MuSEAgent enables adaptive experience exploitation through complementary wide- and deep-search strategies, allowing the agent to dynamically retrieve multimodal guidance across diverse compositional semantic viewpoints. Extensive experiments demonstrate that MuSEAgent consistently outperforms strong trajectory-level experience retrieval baselines on both fine-grained visual perception and complex multimodal reasoning tasks. These results validate the effectiveness of stateful experience modeling in improving multimodal agent reasoning.
[141] Towards Context-Aware Image Anonymization with Multi-Agent Reasoning cs.CV | cs.AI | cs.CRPDF
Robert Aufschläger, Jakob Folz, Gautam Savaliya, Manjitha D Vidanalage, Michael Heigl
TL;DR: 本文提出了一个名为CAIAMAR的多智能体推理框架,用于实现上下文感知的图像匿名化。该框架结合了针对高置信度案例的预定义处理和针对间接标识符的多智能体推理,通过三个专门智能体在PDCA循环中协调工作,利用大视觉语言模型根据空间上下文(如私有与公共财产)对个人身份信息进行分类和分割,并采用基于扩散的去相关方法进行匿名化处理。
Details
Motivation: 现有街景图像匿名化方法存在过度处理或遗漏细微标识符的问题,而基于API的解决方案则损害了数据主权。本文旨在开发一种能够理解上下文、在保护隐私的同时保持图像质量,并能完全在本地运行的开源框架。
Result: 在CUHK03-NP数据集上,该方法将行人重识别风险降低了73%(R1准确率:16.9% vs. 62.4%的基线)。在CityScapes数据集上,图像质量指标表现优异(KID: 0.001, FID: 9.1),显著优于现有匿名化方法。
Insight: 创新点在于将多智能体推理与PDCA循环结合用于上下文感知的PII识别,并采用模态特定的扩散引导与外观去相关技术来降低重识别风险。该框架完全在本地运行,生成可解释的审计追踪,符合GDPR透明度要求,同时能检测跨对象类别的非直接PII实例并保持下游语义分割任务的性能。
Abstract: Street-level imagery contains personally identifiable information (PII), some of which is context-dependent. Existing anonymization methods either over-process images or miss subtle identifiers, while API-based solutions compromise data sovereignty. We present an agentic framework CAIAMAR (\underline{C}ontext-\underline{A}ware \underline{I}mage \underline{A}nonymization with \underline{M}ulti-\underline{A}gent \underline{R}easoning) for context-aware PII segmentation with diffusion-based anonymization, combining pre-defined processing for high-confidence cases with multi-agent reasoning for indirect identifiers. Three specialized agents coordinate via round-robin speaker selection in a Plan-Do-Check-Act (PDCA) cycle, enabling large vision-language models to classify PII based on spatial context (private vs. public property) rather than rigid category rules. The agents implement spatially-filtered coarse-to-fine detection where a scout-and-zoom strategy identifies candidates, open-vocabulary segmentation processes localized crops, and $IoU$-based deduplication ($30%$ threshold) prevents redundant processing. Modal-specific diffusion guidance with appearance decorrelation substantially reduces re-identification (Re-ID) risks. On CUHK03-NP, our method reduces person Re-ID risk by $73%$ ($R1$: $16.9%$ vs. $62.4%$ baseline). For image quality preservation on CityScapes, we achieve KID: $0.001$, and FID: $9.1$, significantly outperforming existing anonymization. The agentic workflow detects non-direct PII instances across object categories, and downstream semantic segmentation is preserved. Operating entirely on-premise with open-source models, the framework generates human-interpretable audit trails supporting EU’s GDPR transparency requirements while flagging failed cases for human review.
[142] Wan-R1: Verifiable-Reinforcement Learning for Video Reasoning cs.CVPDF
Ming Liu, Yunbei Zhang, Shilong Liu, Liwen Wang, Wensheng Zhang
TL;DR: 本文提出了一种可验证奖励的强化学习方法(Wan-R1),用于提升视频生成模型在空间推理和多步规划任务上的泛化能力。通过将组相对策略优化(GRPO)适配到基于流的视频模型,并在迷宫求解和机器人导航任务上进行训练,研究发现多模态奖励模型在此设置下会失败,而基于客观任务指标设计的可验证奖励函数能显著提升性能。
Details
Motivation: 视频生成模型在视觉连贯性上表现良好,但在需要空间推理和多步规划的任务上存在局限;强化学习虽能提升泛化,但其效果高度依赖奖励设计,而这一挑战缺乏系统研究。
Result: 在复杂3D迷宫任务上,模型相比监督微调(SFT)基线将精确匹配准确率提升了29.1%;在陷阱规避任务上提升了51.4%。实验表明可验证奖励能稳定训练并改善泛化。
Insight: 创新点在于系统分析了视频推理中奖励设计的问题,提出了针对结构化游戏环境的多组件轨迹奖励和针对机器人导航的嵌入级可验证奖励,证明了可验证奖励对于避免退化解、实现稳健视频推理的关键作用。
Abstract: Video generation models produce visually coherent content but struggle with tasks requiring spatial reasoning and multi-step planning. Reinforcement learning (RL) offers a path to improve generalization, but its effectiveness in video reasoning hinges on reward design – a challenge that has received little systematic study. We investigate this problem by adapting Group Relative Policy Optimization (GRPO) to flow-based video models and training them on maze-solving and robotic navigation tasks. We first show that multimodal reward models fail catastrophically in this setting. To address this, we design verifiable reward functions grounded in objective task metrics. For structured game environments, we introduce a multi-component trajectory reward. For robotic navigation, we propose an embedding-level verifiable reward. Our experiments show that RL fine-tuning with verifiable rewards improves generalization. For example, on complex 3D mazes, our model improves exact match accuracy by 29.1% over the SFT baseline, and on trap-avoidance tasks by 51.4%. Our systematic reward analysis reveals that verifiable rewards are critical for stable training, while multimodal reward models could lead to degenerate solutions. These findings establish verifiable reward design as a key enabler for robust video reasoning. Code will be publicly available.
[143] SAGE: Sink-Aware Grounded Decoding for Multimodal Hallucination Mitigation cs.CVPDF
Tripti Shukla, Zsolt Kira
TL;DR: 本文提出了SAGE(Sink-Aware Grounded Decoding)框架,一种通过动态调制生成过程中的自注意力来缓解大型视觉语言模型(VLM)幻觉问题的新方法。该方法将注意力汇聚于标点等功能性标记(注意力汇点),并以此为锚点实时监控生成内容的视觉基础可靠性,通过自适应调整注意力分布来抑制幻觉,无需重新训练或修改模型架构。
Details
Motivation: 现有方法通常通过后处理过滤、额外训练目标或外部验证来解决VLM的幻觉问题,但未能在解码过程幻觉产生时进行干预。本文旨在直接在生成过程中动态干预,以更根本地缓解幻觉。
Result: 在多个幻觉基准测试上的广泛实验表明,SAGE持续优于现有的解码策略,在MSCOCO和AMBER基准上平均相对改进分别达到10.65%和7.19%,显著减少了幻觉,同时保持了描述的覆盖范围。
Insight: 创新点在于发现了幻觉与注意力汇点标记的强相关性,并利用这些标记作为实时监控和干预的锚点。通过结合自注意力图和基于梯度的归因来估计视觉基础并测量空间一致性,从而自适应地调整注意力分布,这是一种新颖的、无需训练的解码时干预策略。
Abstract: Large vision-language models (VLMs) frequently suffer from hallucinations, generating content that is inconsistent with visual inputs. Existing methods typically address this problem through post-hoc filtering, additional training objectives, or external verification, but they do not intervene during the decoding process when hallucinations arise. In this work, we introduce SAGE, a Sink-Aware Grounded Decoding framework that mitigates hallucinations by dynamically modulating self-attention during generation. Hallucinations are strongly correlated with attention sink tokens - punctuation or function tokens that accumulate disproportionate attention despite carrying limited semantic content. SAGE leverages these tokens as anchors to monitor grounding reliability in real time. At each sink trigger, the method extracts semantic concepts from the generated sequence, estimates their visual grounding using both self-attention maps and gradient-based attribution, and measures their spatial agreement. Based on this signal, self-attention distributions are adaptively sharpened or broadened to reinforce grounded regions or suppress unreliable ones. Extensive experiments across diverse hallucination benchmarks demonstrate that SAGE consistently outperforms existing decoding strategies, achieving substantial reductions in hallucination while preserving descriptive coverage, without requiring model retraining or architectural modifications. Our method achieves an average relative improvement of 10.65% on MSCOCO and 7.19% on AMBER across diverse VLM architectures, demonstrating consistent gains in hallucination mitigation.
[144] Rényi Entropy: A New Token Pruning Metric for Vision Transformers cs.CVPDF
Wei-Yuan Su, Ruijie Zhang, Zheng Zhang
TL;DR: 该论文提出了一种基于Rényi熵的、无需训练的新颖token重要性度量方法Col-Ln,用于加速视觉Transformer(ViT)和大型视觉语言模型(LVLM)的推理。该方法旨在解决早期层中基于[CLS]令牌的重要性估计不可靠的问题,从而在token剪枝中实现更准确的信息保留和更高的效率。
Details
Motivation: 视觉Transformer的自注意力机制具有O(N²)的计算复杂度,导致高分辨率输入推理成本高昂。现有token剪枝方法依赖早期层中语义表征尚不成熟的[CLS]令牌来估计重要性,这可能导致不准确的估计和不必要的信息丢失。
Result: 在ViT和LVLM上的大量实验表明,该方法在多个基准测试中持续优于最先进的剪枝方法。
Insight: 创新点在于首次将Rényi熵引入作为token重要性度量,该度量无需训练即可从网络第一层识别信息丰富的token,从而实现了更可靠的早期层剪枝。这为模型加速提供了一个新的、理论驱动的视角。
Abstract: Vision Transformers (ViTs) achieve state-of-the-art performance but suffer from the $O(N^2)$ complexity of self-attention, making inference costly for high-resolution inputs. To address this bottleneck, token pruning has emerged as a critical technique to accelerate inference. Most existing methods rely on the [CLS] token to estimate patch importance. However, we argue that the [CLS] token can be unreliable in early layers where semantic representations are still immature. As a result, pruning in the early layer often leads to inaccurate importance estimation and unnecessary information loss. In this work, we propose a training-free token importance metric, namely Col-Ln, which is derived from Rényi entropy that enables the identification of informative tokens from the first layer of the network, thereby enabling more reliable pruning in token reduction. Extensive experiments on ViTs and Large Vision-Language Models (LVLMs) demonstrate that our approach consistently outperforms state-of-the-art pruning methods across diverse benchmarks.
[145] BINO: Encoder Centric Self Supervised Stereo With Native Pair Input cs.CVPDF
Haokun Zhou
TL;DR: BINO提出了一种以编码器为中心的自监督立体视觉预训练方法,通过将矫正后的立体图像对在输入阶段融合为立体微单元令牌,并采用行感知的块相位位置编码,旨在让紧凑的编码器自身学习到强大的双目结构。该方法在仅使用KITTI物体数据集进行预训练的低资源设定下,在多个任务上取得了最佳或接近最佳的冻结描述符性能。
Details
Motivation: 立体视觉需要能够保持精细跨视图对应关系的特征,而不仅仅是语义相似性。现有的自监督视觉模型迁移能力虽好,但并非为此目标设计;而专注于几何的方法通常在预训练时依赖双目解码器或显式的链接模块。BINO旨在探索是否能在紧凑的编码器内部学习到强大的双目结构。
Result: 在仅使用KITTI物体数据集进行预训练的严格低资源设定下,BINO在代理密集立体匹配、困难负样本检索和KITTI Stereo 2012视差估计等任务的“无链接模块”探测评估中,均给出了所有对比基线中最佳的冻结描述符结果。使用相同的轻量级立体头部时,其性能接近CroCo v2,但编码器规模小得多。在KITTI Stereo 2015上的补充迁移实验也显示出相同的定性趋势。
Insight: 论文的核心创新点在于将双目结构学习内化到编码器中,具体通过输入阶段的立体图像对融合(形成立体微单元令牌)和行感知块相位位置编码来实现。这挑战了通常需要单独链接模块进行跨视图推理的范式,表明大部分此类推理可以集成到一个紧凑且可重用的编码器内部,为设计更高效的立体视觉系统提供了新思路。
Abstract: Stereo needs features that preserve fine cross view correspondence rather than only semantic similarity. Recent self supervised vision models transfer well, but they are not built for this goal, and geometry focused methods often rely on a binocular decoder or another explicit linkage module during pretraining. BINO asks whether strong binocular structure can instead be learned inside a compact encoder. It does this by fusing the rectified pair at the input stage, forming stereo micro cell tokens, and using a row aware patch phase positional encoding. Training uses one view masked token only distillation together with occlusion and view specific appearance mismatch. In a strict low resource setting with pretraining only on KITTI object, BINO gives the best frozen descriptor results under a no linkage probe among all compared baselines on proxy dense stereo, hard negative retrieval, and KITTI Stereo2012 disparity. With the same lightweight stereo head for every encoder, it stays near CroCov2 while using a much smaller encoder. Supplementary transfer experiments on KITTI Stereo~2015 show the same qualitative trend. These results suggest that much of the cross view reasoning often assigned to a separate linkage module can be learned inside a compact and reusable encoder.
[146] Spatial Orthogonal Refinement for Robust RGB-Event Visual Object Tracking cs.CVPDF
Dexing Huang, Shiao Wang, Fan Zhang, Xiao Wang
TL;DR: 本文提出了一种名为SOR-Track的鲁棒RGB-事件视觉目标跟踪框架,通过空间正交细化(SOR)模块,利用事件流中固有的方向几何先验来校正退化的RGB特征,从而在高速运动场景中实现更稳健的跟踪。
Details
Motivation: 解决高速运动场景下传统RGB传感器因严重运动模糊导致跟踪性能下降的问题,现有RGB-事件融合方法未能显式利用事件流中编码的方向几何先验来校正RGB特征。
Result: 在大型FE108基准测试上的广泛实验表明,SOR-Track在运动模糊和低光条件下持续优于现有的基于融合的跟踪器,达到了SOTA水平。
Insight: 创新点在于提出空间正交细化模块,使用动态引导的正交方向滤波器从事件流中提取锐利且运动一致的结构响应,并通过非对称结构调制机制显式桥接两种模态间的结构差异,提供了一种基于物理原理的多模态特征对齐和纹理校正方法。
Abstract: Robust visual object tracking (VOT) remains challenging in high-speed motion scenarios, where conventional RGB sensors suffer from severe motion blur and performance degradation. Event cameras, with microsecond temporal resolution and high dynamic range, provide complementary structural cues that can potentially compensate for these limitations. However, existing RGB-Event fusion methods typically treat event data as dense intensity representations and adopt black-box fusion strategies, failing to explicitly leverage the directional geometric priors inherently encoded in event streams to rectify degraded RGB features. To address this limitation, we propose SOR-Track, a streamlined framework for robust RGB-Event tracking based on Spatial Orthogonal Refinement (SOR). The core SOR module employs a set of orthogonal directional filters that are dynamically guided by local motion orientations to extract sharp and motion-consistent structural responses from event streams. These responses serve as geometric anchors to modulate and refine aliased RGB textures through an asymmetric structural modulation mechanism, thereby explicitly bridging structural discrepancies between two modalities. Extensive experiments on the large-scale FE108 benchmark demonstrate that SOR-Track consistently outperforms existing fusion-based trackers, particularly under motion blur and low-light conditions. Despite its simplicity, the proposed method offers a principled and physics-grounded approach to multi-modal feature alignment and texture rectification. The source code of this paper will be released on https://github.com/Event-AHU/OpenEvTracking
[147] FlashSign: Pose-Free Guidance for Efficient Sign Language Video Generation cs.CVPDF
Liuzhou Zhang, Zeyu Zhang, Biao Wu, Luyao Tang, Zirui Song
TL;DR: 本文提出了一种名为FlashSign的新型无姿态框架,用于实时手语视频生成。该方法摒弃了传统方法中依赖复杂中间姿态表示的局限,通过基于扩散模型的方法直接将自然语言文本映射到手语视频。其核心创新包括一个基于先进扩散主干的无姿态生成模型,以及一种可训练的滑动瓦片注意力机制,该机制通过利用时空局部性模式加速推理,实现了3.07倍的生成速度提升,同时保持高质量。
Details
Motivation: 现有手语视频生成模型通常依赖复杂的中间表示(如姿态估计),这限制了其灵活性和效率。本文旨在解决这一问题,提出一个无需中间姿态表示的实时生成框架,以弥合聋哑及听力障碍群体的沟通鸿沟。
Result: 该方法在保持视频质量的同时,将视频生成速度提升了3.07倍,实现了实时部署的可行性。
Insight: 主要创新点在于:1)无需姿态估计、直接学习文本-手势隐式对齐的无姿态生成模型;2)将可训练稀疏性整合到训练和推理中的可训练滑动瓦片注意力机制,消除了训练-测试差距,在保证一致性的同时显著降低了计算开销。这为实时、高质量、无姿态的手语合成开辟了新途径。
Abstract: Sign language plays a crucial role in bridging communication gaps between the deaf and hard-of-hearing communities. However, existing sign language video generation models often rely on complex intermediate representations, which limits their flexibility and efficiency. In this work, we propose a novel pose-free framework for real-time sign language video generation. Our method eliminates the need for intermediate pose representations by directly mapping natural language text to sign language videos using a diffusion-based approach. We introduce two key innovations: (1) a pose-free generative model based on the a state-of-the-art diffusion backbone, which learns implicit text-to-gesture alignments without pose estimation, and (2) a Trainable Sliding Tile Attention (T-STA) mechanism that accelerates inference by exploiting spatio-temporal locality patterns. Unlike previous training-free sparsity approaches, T-STA integrates trainable sparsity into both training and inference, ensuring consistency and eliminating the train-test gap. This approach significantly reduces computational overhead while maintaining high generation quality, making real-time deployment feasible. Our method increases video generation speed by 3.07x without compromising video quality. Our contributions open new avenues for real-time, high-quality, pose-free sign language synthesis, with potential applications in inclusive communication tools for diverse communities. Code: https://github.com/AIGeeksGroup/FlashSign.
[148] A Cross-Scale Decoder with Token Refinement for Off-Road Semantic Segmentation cs.CVPDF
Seongkyu Choi Jhonghyun An
TL;DR: 本文提出了一种用于越野场景语义分割的跨尺度解码器,通过全局-局部令牌精炼、门控细节桥接和不确定性引导的类别感知点精炼三种机制,解决了越野环境中因地形不规则、植被杂乱和标注模糊导致的边界不连贯、稀有结构分割困难以及噪声放大等问题,实现了噪声鲁棒且边界保持的分割效果。
Details
Motivation: 越野语义分割面临不规则地形、植被杂乱和固有标注模糊的根本挑战,导致类别间边界过渡区域厚且不确定,稀有或细小结构监督稀疏不可靠;现有解码器设计要么过度平滑细节,要么反复融合高细节特征从而放大噪声并增加计算成本。
Result: 在标准越野基准测试上的实验结果表明,该方法在不依赖繁重的密集特征融合的情况下,相比先前方法取得了持续的改进。
Insight: 创新点在于通过紧凑瓶颈网格上的边界感知正则化进行全局-局部令牌精炼以巩固语义上下文,通过跨尺度注意力选择性单次注入细尺度结构线索的门控细节桥接以避免噪声累积,以及通过不确定性引导选择性更新最不可靠像素的类别感知点精炼以高效改善稀有和模糊结构。
Abstract: Off-road semantic segmentation is fundamentally challenged by irregular terrain, vegetation clutter, and inherent annotation ambiguity. Unlike urban scenes with crisp object boundaries, off-road environments exhibit strong class-level similarity among terrain categories, resulting in thick and uncertain transition regions that degrade boundary coherence and destabilize training. Rare or thin structures, such as narrow traversable gaps or isolated obstacles, further receive sparse and unreliable supervision and are easily overwhelmed by dominant background textures. Existing decoder designs either rely on low-scale bottlenecks that oversmooth fine structural details, or repeatedly fuse high-detail features, which tends to amplify annotation noise and incur substantial computational cost. We present a cross-scale decoder that explicitly addresses these challenges through three complementary mechanisms. First, a global–local token refinement module consolidates semantic context on a compact bottleneck lattice, guided by boundary-aware regularization to remain robust under ambiguous supervision. Second, a gated detail bridge selectively injects fine-scale structural cues only once through cross-scale attention, preserving boundary and texture information while avoiding noise accumulation. Third, an uncertainty-guided class-aware point refinement selectively updates the least reliable pixels, improving rare and ambiguous structures with minimal computational overhead. The resulting framework achieves noise-robust and boundary-preserving segmentation tailored to off-road environments, recovering fine structural details while maintaining deployment-friendly efficiency. Experimental results on standard off-road benchmarks demonstrate consistent improvements over prior approaches without resorting to heavy dense feature fusion.
[149] JaWildText: A Benchmark for Vision-Language Models on Japanese Scene Text Understanding cs.CV | cs.AIPDF
Koki Maeda, Naoaki Okazaki
TL;DR: 本文介绍了JaWildText,一个专门用于评估视觉语言模型在日语场景文本理解能力上的诊断性基准。该基准包含3,241个从日本新拍摄的2,961张图像中提取的实例,标注了112万个字符,涵盖3,643种独特字符类型。它包含三个互补任务:密集场景文本视觉问答、收据关键信息提取和手写OCR,以测试模型在不同视觉组织、输出格式和书写风格下的能力。
Details
Motivation: 现有的多语言基准和日语视觉文本数据集未能充分捕捉日语场景文本的独特挑战(如混合文字、垂直书写、庞大字符集),且缺乏对真实世界场景文本的探索,因此需要专门的基准来评估VLMs在此领域的性能。
Result: 在评估的14个开源VLMs中,最佳模型在三个任务上的平均得分仅为0.64。错误分析表明,识别(尤其是汉字识别)是主要瓶颈。
Insight: 创新点在于构建了首个专注于日语场景文本理解的综合性诊断基准,通过三个互补任务实现了细粒度、文字感知的能力评估,填补了该领域的数据空白,并为模型改进提供了明确方向。
Abstract: Japanese scene text poses challenges that multilingual benchmarks often fail to capture, including mixed scripts, frequent vertical writing, and a character inventory far larger than the Latin alphabet. Although Japanese is included in several multilingual benchmarks, these resources do not adequately capture the language-specific complexities. Meanwhile, existing Japanese visual text datasets have primarily focused on scanned documents, leaving in-the-wild scene text underexplored. To fill this gap, we introduce JaWildText, a diagnostic benchmark for evaluating vision-language models (VLMs) on Japanese scene text understanding. JaWildText contains 3,241 instances from 2,961 images newly captured in Japan, with 1.12 million annotated characters spanning 3,643 unique character types. It comprises three complementary tasks that vary in visual organization, output format, and writing style: (i) Dense Scene Text Visual Question Answering (STVQA), which requires reasoning over multiple pieces of visual text evidence; (ii) Receipt Key Information Extraction (KIE), which tests layout-aware structured extraction from mobile-captured receipts; and (iii) Handwriting OCR, which evaluates page-level transcription across various media and writing directions. We evaluate 14 open-weight VLMs and find that the best model achieves an average score of 0.64 across the three tasks. Error analyses show recognition remains the dominant bottleneck, especially for kanji. JaWildText enables fine-grained, script-aware diagnosis of Japanese scene text capabilities, and will be released with evaluation code.
[150] RehearsalNeRF: Decoupling Intrinsic Neural Fields of Dynamic Illuminations for Scene Editing cs.CVPDF
Changyeon Won, Hyunjun Jung, Jungu Cho, Seonmi Park, Chi-Hoon Lee
TL;DR: 本文提出RehearsalNeRF方法,用于在动态光照变化下解耦场景的固有神经辐射场,以实现场景编辑。该方法利用稳定光照下捕获的排练场景来增强几何一致性,通过可学习向量表示光照颜色以分离光照与场景辐射,并结合光流正则化处理动态物体,从而在动态光照条件下实现鲁棒的新视角合成和场景编辑。
Details
Motivation: 现有神经辐射场方法在处理动态光照变化时存在不足,场景辐射与自身发光及光照颜色在时空域高度纠缠,难以解耦,限制了场景编辑能力。
Result: 在动态光照条件下,RehearsalNeRF在新视角合成和场景编辑任务中表现出鲁棒性能,具体定量结果未在摘要中提及,但展示了有效性。
Insight: 创新点包括利用排练场景(稳定光照)强制几何一致性、引入可学习光照向量解耦光照颜色与场景辐射,以及使用光流正则化处理动态物体,为动态光照下的神经场解耦提供了新思路。
Abstract: Although there has been significant progress in neural radiance fields, an issue on dynamic illumination changes still remains unsolved. Different from relevant works that parameterize time-variant/-invariant components in scenes, subjects’ radiance is highly entangled with their own emitted radiance and lighting colors in spatio-temporal domain. In this paper, we present a new effective method to learn disentangled neural fields under the severe illumination changes, named RehearsalNeRF. Our key idea is to leverage scenes captured under stable lighting like rehearsal stages, easily taken before dynamic illumination occurs, to enforce geometric consistency between the different lighting conditions. In particular, RehearsalNeRF employs a learnable vector for lighting effects which represents illumination colors in a temporal dimension and is used to disentangle projected light colors from scene radiance. Furthermore, our RehearsalNeRF is also able to reconstruct the neural fields of dynamic objects by simply adopting off-the-shelf interactive masks. To decouple the dynamic objects, we propose a new regularization leveraging optical flow, which provides coarse supervision for the color disentanglement. We demonstrate the effectiveness of RehearsalNeRF by showing robust performances on novel view synthesis and scene editing under dynamic illumination conditions. Our source code and video datasets will be publicly available.
[151] ExFusion: Efficient Transformer Training via Multi-Experts Fusion cs.CVPDF
Jiacheng Ruan, Daize Dong, Xiaoye Qu, Tong Zhu, Ting Liu
TL;DR: ExFusion是一种新颖的预训练方法,旨在通过多专家融合来提高Transformer模型的训练效率。该方法在初始化阶段将Transformer的前馈网络(FFN)升级为多专家配置,并为每个专家分配权重;在训练过程中,这些权重用于将多个专家融合成一个统一的专家,其计算成本与标准密集训练相近;训练后,学习到的权重用于将多专家整合为单一专家,从而消除了存储和部署时的额外开销。
Details
Motivation: 混合专家(MoE)模型通过增加密集架构的容量来显著提升性能,但直接训练MoE模型需要大量计算资源,并引入额外的参数存储和部署开销。因此,需要一种方法能够利用MoE的多专家能力来增强性能,同时仅产生最小的额外成本。
Result: 在多种计算机视觉和自然语言处理任务上的广泛实验证明了该方法的有效性,表明ExFusion在引入多专家特性的同时,计算成本仅略高于标准密集训练。
Insight: 论文的创新点在于提出了一种高效的预训练策略,通过权重融合机制在训练中模拟多专家能力,而在部署时仅保留单一专家,从而在性能和效率之间取得平衡。从客观角度看,这种方法为资源受限场景下利用MoE优势提供了新思路,可借鉴其动态权重分配和参数融合技术来优化模型训练流程。
Abstract: Mixture-of-Experts (MoE) models substantially improve performance by increasing the capacity of dense architectures. However, directly training MoE models requires considerable computational resources and introduces extra overhead in parameter storage and deployment. Therefore, it is critical to develop an approach that leverages the multi-expert capability of MoE to enhance performance while incurring minimal additional cost. To this end, we propose a novel pre-training approach, termed ExFusion, which improves the efficiency of Transformer training through multi-expert fusion. Specifically, during the initialization phase, ExFusion upcycles the feed-forward network (FFN) of the Transformer into a multi-expert configuration, where each expert is assigned a weight for later parameter fusion. During training, these weights allow multiple experts to be fused into a single unified expert equivalent to the original FFN, which is subsequently used for forward computation. As a result, ExFusion introduces multi-expert characteristics into the training process while incurring only marginal computational cost compared to standard dense training. After training, the learned weights are used to integrate multi-experts into a single unified expert, thereby eliminating additional overhead in storage and deployment. Extensive experiments on a variety of computer vision and natural language processing tasks demonstrate the effectiveness of the proposed method.
[152] Learning Multi-View Spatial Reasoning from Cross-View Relations cs.CVPDF
Suchae Jeong, Jaehwi Song, Haeone Lee, Hanna Kim, Jian Kim
TL;DR: 本文提出了Cross-View Relations (XVR)大规模数据集,旨在增强视觉语言模型在多视图空间推理方面的能力。通过在XVR上进行微调,模型在MindCube和RoboSpatial等基准测试中取得了显著提升,并且其表征能有效提升RoboCasa等具身AI任务的成功率。
Details
Motivation: 现有视觉语言模型在单视图任务上表现出色,但缺乏多视图空间推理能力,而这对于具身AI系统理解3D环境和跨视角操作物体至关重要。
Result: 在XVR上微调的视觉语言模型在MindCube和RoboSpatial等多视图和机器人空间推理基准测试中取得了显著改进;当作为视觉-语言-动作模型的骨干时,能提升RoboCasa任务的成功率。
Insight: 论文的核心创新在于构建了专门针对多视图空间关系(对应、验证、定位)的大规模数据集XVR,证明了通过显式的跨视图空间关系训练,可以有效提升模型的多视图推理能力,并能迁移到真实世界的机器人操作任务中。
Abstract: Vision-language models (VLMs) have achieved impressive results on single-view vision tasks, but lack the multi-view spatial reasoning capabilities essential for embodied AI systems to understand 3D environments and manipulate objects across different viewpoints. In this work, we introduce Cross-View Relations (XVR), a large-scale dataset designed to teach VLMs spatial reasoning across multiple views. XVR comprises 100K vision-question-answer samples derived from 18K diverse 3D scenes and 70K robotic manipulation trajectories, spanning three fundamental spatial reasoning tasks: Correspondence (matching objects across views), Verification (validating spatial relationships), and Localization (identifying object positions). VLMs fine-tuned on XVR achieve substantial improvements on established multi-view and robotic spatial reasoning benchmarks (MindCube and RoboSpatial). When integrated as backbones in Vision-Language-Action models, XVR-trained representations improve success rates on RoboCasa. Our results demonstrate that explicit training on cross-view spatial relations significantly enhances multi-view reasoning and transfers effectively to real-world robotic manipulation.
[153] Progressive Prompt-Guided Cross-Modal Reasoning for Referring Image Segmentation cs.CVPDF
Jiachen Li, Hongyun Wang, Jinyu Xu, Wenbo Jiang, Yanchun Ma
TL;DR: 本文提出了一种渐进式提示引导的跨模态推理框架PPCR,用于指称图像分割任务。该方法通过多模态大语言模型生成语义分割提示和空间分割提示,构建了从语义理解到空间定位再到实例分割的渐进推理流程,以更精确地将语言描述与图像中的目标区域对齐。
Details
Motivation: 指称图像分割的核心挑战在于如何有效桥接语言描述与物体级视觉表示,尤其是在涉及详细属性和复杂物体间关系时。现有方法缺乏显式的推理机制来将语言描述定位到图像中的目标区域。
Result: 在标准的指称图像分割基准测试上进行的广泛实验表明,PPCR始终优于现有方法。
Insight: 论文的创新点在于提出了一个结构化的渐进推理流程,将语义理解与空间定位解耦,并通过生成式提示(Semantic/Spatial Segmentation Prompt)显式地引导分割过程。这为复杂语言描述下的视觉定位提供了一种可解释的推理机制。
Abstract: Referring image segmentation aims to localize and segment a target object in an image based on a free-form referring expression. The core challenge lies in effectively bridging linguistic descriptions with object-level visual representations, especially when referring expressions involve detailed attributes and complex inter-object relationships. Existing methods either rely on cross-modal alignment or employ Semantic Segmentation Prompts, but they often lack explicit reasoning mechanisms for grounding language descriptions to target regions in the image. To address these limitations, we propose PPCR, a Progressive Prompt-guided Cross-modal Reasoning framework for referring image segmentation. PPCR explicitly structures the reasoning process as a Semantic Understanding-Spatial Grounding-Instance Segmentation pipeline. Specifically, PPCR first employs multimodal large language models (MLLMs) to generate Semantic Segmentation Prompt that capture key semantic cues of the target object. Based on this semantic context, Spatial Segmentation Prompt are further generated to reason about object location and spatial extent, enabling a progressive transition from semantic understanding to spatial grounding. The Semantic and Spatial Segmentation prompts are then jointly integrated into the segmentation module to guide accurate target localization and segmentation. Extensive experiments on standard referring image segmentation benchmarks demonstrate that PPCR consistently outperforms existing methods. The code will be publicly released to facilitate reproducibility.
[154] CLIP-AUTT: Test-Time Personalization with Action Unit Prompting for Fine-Grained Video Emotion Recognition cs.CVPDF
Muhammad Osama Zeeshan, Masoumeh Sharafi, Benoît Savary, Alessandro Lameiras Koerich, Marco Pedersoli
TL;DR: 本文提出了CLIP-AU和CLIP-AUTT两种方法,用于细粒度视频情感识别。CLIP-AU利用动作单元(AUs)作为结构化文本提示,将可解释的AU语义整合到CLIP中,学习通用的、与主体无关的表示。CLIP-AUTT进一步提出了基于视频的测试时个性化方法,通过熵引导的时间窗口选择和提示调优,动态适应未见主体的视频,实现主体特定的自适应并保持时间一致性。
Details
Motivation: 现有基于CLIP的情感识别方法依赖于其对比预训练或使用LLM生成描述性文本提示,这些方法存在噪声大、计算成本高且难以捕捉细粒度表情的问题,导致性能下降。本文旨在利用动作单元(AUs)作为结构化文本提示来建模细粒度面部表情,以提供局部化和可解释的语义线索,从而实现更鲁棒的情感识别。
Result: 在三个具有挑战性的视频细粒度情感识别数据集(BioVid、StressID和BAH)上的广泛实验表明,CLIP-AU和CLIP-AUTT优于最先进的基于CLIP的面部表情识别方法和测试时自适应方法,实现了鲁棒且个性化的细粒度情感识别。
Insight: 创新点在于将动作单元(AUs)作为结构化文本提示引入CLIP框架,以捕捉细粒度面部表情的语义;并提出了测试时个性化方法CLIP-AUTT,通过熵引导的时间窗口选择和提示调优实现主体特定的自适应,避免了CLIP微调或LLM生成的文本监督,提高了模型的适应性和性能。
Abstract: Personalization in emotion recognition (ER) is essential for an accurate interpretation of subtle and subject-specific expressive patterns. Recent advances in vision-language models (VLMs) such as CLIP demonstrate strong potential for leveraging joint image-text representations in ER. However, CLIP-based methods either depend on CLIP’s contrastive pretraining or on LLMs to generate descriptive text prompts, which are noisy, computationally expensive, and fail to capture fine-grained expressions, leading to degraded performance. In this work, we leverage Action Units (AUs) as structured textual prompts within CLIP to model fine-grained facial expressions. AUs encode the subtle muscle activations underlying expressions, providing localized and interpretable semantic cues for more robust ER. We introduce CLIP-AU, a lightweight AU-guided temporal learning method that integrates interpretable AU semantics into CLIP. It learns generic, subject-agnostic representations by aligning AU prompts with facial dynamics, enabling fine-grained ER without CLIP fine-tuning or LLM-generated text supervision. Although CLIP-AU models fine-grained AU semantics, it does not adapt to subject-specific variability in subtle expressions. To address this limitation, we propose CLIP-AUTT, a video-based test-time personalization method that dynamically adapts AU prompts to videos from unseen subjects. By combining entropy-guided temporal window selection with prompt tuning, CLIP-AUTT enables subject-specific adaptation while preserving temporal consistency. Our extensive experiments on three challenging video-based subtle ER datasets, BioVid, StressID, and BAH, indicate that CLIP-AU and CLIP-AUTT outperform state-of-the-art CLIP-based FER and TTA methods, achieving robust and personalized subtle ER.
[155] DipGuava: Disentangling Personalized Gaussian Features for 3D Head Avatars from Monocular Video cs.CVPDF
Jeonghaeng Lee, Seok Keun Choi, Zhixuan Li, Weisi Lin, Sanghoon Lee
TL;DR: 本文提出DipGuava方法,一种从单目视频中生成具有个性化属性的3D高斯头部化身的新方法。该方法首次将面部外观显式解耦为两个互补组件,通过两阶段训练流程减少学习歧义并提升重建保真度。第一阶段学习稳定的几何驱动基础外观,捕捉全局面部结构和粗粒度表情相关变化;第二阶段预测第一阶段未捕获的个性化残差细节,包括高频成分和非线性变化特征(如皱纹和细微皮肤变形)。通过动态外观融合整合变形后的残差细节,确保空间和语义对齐。
Details
Motivation: 现有3D头部化身创建方法在动画化面部动态时,常无法捕获个性化细节,限制了真实感和表现力。本文旨在填补这一空白,生成具有个性化属性的逼真化身。
Result: 在广泛实验中,DipGuava在视觉质量和定量性能上均一致优于先前方法,实现了照片级真实感且保持身份一致性的化身生成。
Insight: 创新点在于首次显式解耦面部外观为几何驱动基础外观和个性化残差细节,采用两阶段训练流程减少歧义,并通过动态外观融合确保对齐,从而提升个性化细节的重建能力。
Abstract: While recent 3D head avatar creation methods attempt to animate facial dynamics, they often fail to capture personalized details, limiting realism and expressiveness. To fill this gap, we present DipGuava (Disentangled and Personalized Gaussian UV Avatar), a novel 3D Gaussian head avatar creation method that successfully generates avatars with personalized attributes from monocular video. DipGuava is the first method to explicitly disentangle facial appearance into two complementary components, trained in a structured two-stage pipeline that significantly reduces learning ambiguity and enhances reconstruction fidelity. In the first stage, we learn a stable geometry-driven base appearance that captures global facial structure and coarse expression-dependent variations. In the second stage, the personalized residual details not captured in the first stage are predicted, including high-frequency components and nonlinearly varying features such as wrinkles and subtle skin deformations. These components are fused via dynamic appearance fusion that integrates residual details after deformation, ensuring spatial and semantic alignment. This disentangled design enables DipGuava to generate photorealistic, identity-preserving avatars, consistently outperforming prior methods in both visual quality and quantitativeperformance, as demonstrated in extensive experiments.
[156] Energy-Aware Imitation Learning for Steering Prediction Using Events and Frames cs.CVPDF
Hu Cao, Jiong Liu, Xingzhuo Yan, Rui Song, Yan Xia
TL;DR: 本文提出了一种能量感知的模仿学习框架,用于自动驾驶中的转向预测,该框架结合了事件相机和传统帧相机的优势。通过设计能量驱动的跨模态融合模块(ECFM)和能量感知解码器,该方法旨在生成可靠且安全的预测。在两个公开的真实世界数据集(DDD20和DRFuser)上的实验表明,该方法优于现有的最先进方法。
Details
Motivation: 在自动驾驶中,仅依赖基于帧的相机可能因长曝光时间、高速运动和挑战性光照条件等因素导致不准确性。为了解决这些问题,本文引入了受生物启发的视觉传感器——事件相机,其捕获的稀疏、异步事件可作为补充模态来缓解这些挑战。
Result: 在DDD20和DRFuser两个公开真实世界数据集上的广泛实验表明,该方法超越了现有的最先进(SOTA)方法。
Insight: 创新点包括:1)提出能量感知的模仿学习框架,结合事件和帧数据;2)设计能量驱动的跨模态融合模块(ECFM)和能量感知解码器,以生成可靠且安全的预测。从客观角度看,该方法有效利用了事件相机的高动态范围和低延迟特性,与帧相机互补,提升了自动驾驶转向预测的鲁棒性。
Abstract: In autonomous driving, relying solely on frame-based cameras can lead to inaccuracies caused by factors like long exposure times, high-speed motion, and challenging lighting conditions. To address these issues, we introduce a bio-inspired vision sensor known as the event camera. Unlike conventional cameras, event cameras capture sparse, asynchronous events that provide a complementary modality to mitigate these challenges. In this work, we propose an energy-aware imitation learning framework for steering prediction that leverages both events and frames. Specifically, we design an Energy-driven Cross-modality Fusion Module (ECFM) and an energy-aware decoder to produce reliable and safe predictions. Extensive experiments on two public real-world datasets, DDD20 and DRFuser, demonstrate that our method outperforms existing state-of-the-art (SOTA) approaches. The codes and trained models will be released upon acceptance.
[157] Physically Inspired Gaussian Splatting for HDR Novel View Synthesis cs.CVPDF
Huimin Zeng, Yue Bai, Hailing Wang, Yun Fu
TL;DR: 本文提出PhysHDR-GS,一个受物理启发的HDR新视角合成框架,用于解决现有方法在捕捉环境光照依赖的外观和纠正异常HDR值方面的困难。该方法通过内在反射率和可调环境光照对场景外观进行建模,并利用互补的图像曝光分支和高斯光照分支分别忠实再现标准相机观测和捕捉光照变化。
Details
Motivation: 现有高动态范围新视角合成方法通过融合多曝光低动态范围视图来重建具有动态细节的场景,但难以捕捉依赖于环境光照的外观。通过约束色调映射结果来隐式监督HDR内容的方法,无法纠正异常的HDR值,并导致欠/过曝光区域的高斯分布梯度有限。
Result: 在真实和合成数据集上的实验结果表明,该方法在重建HDR细节方面具有优越性(例如,相比HDR-GS获得2.04 dB的PSNR增益),同时保持了实时渲染速度(高达76 FPS)。
Insight: 创新点在于提出了一种物理启发的建模方式,将场景外观分解为内在反射率和可调环境光照;设计了互补的双分支架构(IE和GI分支)分别处理标准观测和光照变化;引入了跨分支HDR一致性损失为HDR内容提供显式监督,以及光照引导的梯度缩放策略来缓解曝光偏差导致的梯度匮乏和欠致密化表示问题。
Abstract: High dynamic range novel view synthesis (HDR-NVS) reconstructs scenes with dynamic details by fusing multi-exposure low dynamic range (LDR) views, yet it struggles to capture ambient illumination-dependent appearance. Implicitly supervising HDR content by constraining tone-mapped results fails in correcting abnormal HDR values, and results in limited gradients for Gaussians in under/over-exposed regions. To this end, we introduce PhysHDR-GS, a physically inspired HDR-NVS framework that models scene appearance via intrinsic reflectance and adjustable ambient illumination. PhysHDR-GS employs a complementary image-exposure (IE) branch and Gaussian-illumination (GI) branch to faithfully reproduce standard camera observations and capture illumination-dependent appearance changes, respectively. During training, the proposed cross-branch HDR consistency loss provides explicit supervision for HDR content, while an illumination-guided gradient scaling strategy mitigates exposure-biased gradient starvation and reduces under-densified representations. Experimental results across realistic and synthetic datasets demonstrate our superiority in reconstructing HDR details (e.g., a PSNR gain of 2.04 dB over HDR-GS), while maintaining real-time rendering speed (up to 76 FPS). Code and models are available at https://huimin-zeng.github.io/PhysHDR-GS/.
[158] SegRGB-X: General RGB-X Semantic Segmentation Model cs.CVPDF
Jiong Liu, Yingjie Xu, Xingcheng Zhou, Rui Song, Walter Zimmer
TL;DR: 本文提出了一种通用的任意模态语义分割框架SegRGB-X,旨在统一处理多种传感器模态(如事件、热成像、深度、偏振和光场)的语义分割任务。该框架通过三个关键创新:模态感知CLIP(MA-CLIP)、模态对齐嵌入和领域特定细化模块(DSRM),解决了不同传感器特性带来的挑战,减少了传统方法中的冗余开发。在五个多样化数据集上评估,模型超越了专用多模态方法,实现了65.03% mIoU的SOTA性能。
Details
Motivation: 解决任意传感器模态语义分割中因传感器特性多样导致的挑战,并减少传统配置中的冗余开发努力。
Result: 在包含事件、热成像、深度、偏振和光场等互补模态的五个多样化数据集上评估,模型超越了专用多模态方法,达到65.03% mIoU的SOTA性能。
Insight: 创新点包括:通过LoRA微调的模态感知CLIP(MA-CLIP)提供模态特定场景理解指导;模态对齐嵌入捕获细粒度特征;领域特定细化模块(DSRM)进行动态特征调整。从客观角度看,该框架通过统一架构简化了多模态分割流程,提升了泛化能力和效率。
Abstract: Semantic segmentation across arbitrary sensor modalities faces significant challenges due to diverse sensor characteristics, and the traditional configurations for this task result in redundant development efforts. We address these challenges by introducing a universal arbitrary-modal semantic segmentation framework that unifies segmentation across multiple modalities. Our approach features three key innovations: (1) the Modality-aware CLIP (MA-CLIP), which provides modality-specific scene understanding guidance through LoRA fine-tuning; (2) Modality-aligned Embeddings for capturing fine-grained features; and (3) the Domain-specific Refinement Module (DSRM) for dynamic feature adjustment. Evaluated on five diverse datasets with different complementary modalities (event, thermal, depth, polarization, and light field), our model surpasses specialized multi-modal methods and achieves state-of-the-art performance with a mIoU of 65.03%. The codes will be released upon acceptance.
[159] Adapting SAM to Nuclei Instance Segmentation and Classification via Cooperative Fine-Grained Refinement cs.CVPDF
Jingze Su, Tianle Zhu, Jiaxin Cai, Zhiyi Wang, Qi Li
TL;DR: 本文提出了一种名为Cooperative Fine-Grained Refinement of SAM的参数高效微调框架,旨在将Segment Anything Model的强大先验知识迁移到细胞核实例分割任务中。该框架通过多尺度自适应局部感知适配器、分层调制融合模块和边界引导掩码细化三个核心组件协同工作,增强模型对局部结构的感知、保留空间细节并优化边界,从而在无需完全微调的情况下实现准确的细胞核实例分割与分类。
Details
Motivation: 细胞核实例分割在计算病理学中对癌症诊断和预后至关重要。虽然SAM在自然图像分割中表现出色,但直接应用于医学影像领域存在局限性:缺乏对细胞核分割至关重要的局部结构特征的充分感知,且完全微调下游任务计算成本高昂。
Result: 论文在多个公开的细胞核分割数据集上进行了实验,结果表明所提方法在参数效率、分割精度和边界清晰度方面均优于现有方法,达到了最先进的性能水平。
Insight: 创新点在于提出了一种参数高效的微调框架,通过动态生成的多尺度卷积核增强局部感知、分层调制融合保留细节,以及边界引导的显式监督优化掩码,实现了SAM在医学影像领域的有效迁移,避免了完全微调的高昂成本。
Abstract: Nuclei instance segmentation is critical in computational pathology for cancer diagnosis and prognosis. Recently, the Segment Anything Model has demonstrated exceptional performance in various segmentation tasks, leveraging its rich priors and powerful global context modeling capabilities derived from large-scale pre-training on natural images. However, directly applying SAM to the medical imaging domain faces significant limitations: it lacks sufficient perception of the local structural features that are crucial for nuclei segmentation, and full fine-tuning for downstream tasks requires substantial computational costs. To efficiently transfer SAM’s robust prior knowledge to nuclei instance segmentation while supplementing its task-aware local perception, we propose a parameter-efficient fine-tuning framework, named Cooperative Fine-Grained Refinement of SAM, consisting of three core components: 1) a Multi-scale Adaptive Local-aware Adapter, which enables effective capability transfer by augmenting the frozen SAM backbone with minimal parameters and instilling a powerful perception of local structures through dynamically generated, multi-scale convolutional kernels; 2) a Hierarchical Modulated Fusion Module, which dynamically aggregates multi-level encoder features to preserve fine-grained spatial details; and 3) a Boundary-Guided Mask Refinement, which integrates multi-context boundary cues with semantic features through explicit supervision, producing a boundary-focused signal to refine initial mask predictions for sharper delineation. These three components work cooperatively to enhance local perception, preserve spatial details, and refine boundaries, enabling SAM to perform accurate nuclei instance segmentation directly.
[160] Drift-AR: Single-Step Visual Autoregressive Generation via Anti-Symmetric Drifting cs.CVPDF
Zhen Zou, Xiaoxiao Ma, Mingde Yao, Jie Huang, LinJiang Huang
TL;DR: Drift-AR是一种单步视觉自回归生成方法,通过利用自回归模型预测熵作为统一信号,同时加速自回归阶段和视觉解码阶段,实现了3.8-5.5倍的加速,并保持或超越了原始模型的质量。
Details
Motivation: 现有的自回归-扩散混合范式存在双重速度瓶颈:顺序的自回归阶段和迭代的多步去噪视觉解码阶段。现有方法孤立地解决每个瓶颈,缺乏统一的设计原则。
Result: 在MAR、TransDiff和NextStep-1基准测试中,Drift-AR实现了3.8-5.5倍的加速,并实现了真正的单步(1-NFE)解码,匹配或超越了原始模型的质量。
Insight: 创新点在于将自回归模型的预测熵作为统一加速信号:1)用于自回归加速的熵感知推测解码,通过因果归一化熵损失对齐草稿-目标熵分布;2)用于视觉解码加速,将熵重新解释为反称漂移场初始状态的物理方差,实现无需迭代去噪或蒸馏的单步解码。两个阶段共享同一熵信号,计算开销低。
Abstract: Autoregressive (AR)-Diffusion hybrid paradigms combine AR’s structured semantic modeling with diffusion’s high-fidelity synthesis, yet suffer from a dual speed bottleneck: the sequential AR stage and the iterative multi-step denoising of the diffusion vision decode stage. Existing methods address each in isolation without a unified principle design. We observe that the per-position \emph{prediction entropy} of continuous-space AR models naturally encodes spatially varying generation uncertainty, which simultaneously governing draft prediction quality in the AR stage and reflecting the corrective effort required by vision decoding stage, which is not fully explored before. Since entropy is inherently tied to both bottlenecks, it serves as a natural unifying signal for joint acceleration. In this work, we propose \textbf{Drift-AR}, which leverages entropy signal to accelerate both stages: 1) for AR acceleration, we introduce Entropy-Informed Speculative Decoding that align draft–target entropy distributions via a causal-normalized entropy loss, resolving the entropy mismatch that causes excessive draft rejection; 2) for visual decoder acceleration, we reinterpret entropy as the \emph{physical variance} of the initial state for an anti-symmetric drifting field – high-entropy positions activate stronger drift toward the data manifold while low-entropy positions yield vanishing drift – enabling single-step (1-NFE) decoding without iterative denoising or distillation. Moreover, both stages share the same entropy signal, which is computed once with no extra cost. Experiments on MAR, TransDiff, and NextStep-1 demonstrate 3.8–5.5$\times$ speedup with genuine 1-NFE decoding, matching or surpassing original quality. Code will be available at https://github.com/aSleepyTree/Drift-AR.
[161] AIBench: Evaluating Visual-Logical Consistency in Academic Illustration Generation cs.CVPDF
Zhaohe Liao, Kaixun Jiang, Zhihang Liu, Yujie Wei, Junqiu Yu
TL;DR: 本文提出了AIBench基准,首次通过视觉问答评估学术插图生成的逻辑一致性,并结合视觉语言模型评估美学质量,揭示了当前先进模型在生成可直接用于论文的学术插图时存在的性能差距与优化挑战。
Details
Motivation: 尽管图像生成技术快速发展,但其能否生成可直接用于学术论文的插图尚未得到充分探索;直接使用视觉语言模型进行评估需要强大的多模态理解能力,对于长而复杂的文本和插图不可靠。
Result: 在AIBench基准上的广泛实验表明,不同模型在该任务上的性能差距显著大于通用任务,反映了它们在复杂推理和高密度生成能力上的差异;同时,逻辑性和美学性难以像手工插图那样同时优化。
Insight: 创新点在于提出了首个基于VQA的学术插图逻辑一致性评估基准,通过从论文方法部分总结的逻辑图设计多尺度问题,减少了对评判VLM能力的依赖,实现了更准确、详细的视觉-逻辑一致性评估;研究还发现,在测试时同时扩展逻辑和美学能力能显著提升任务性能。
Abstract: Although image generation has boosted various applications via its rapid evolution, whether the state-of-the-art models are able to produce ready-to-use academic illustrations for papers is still largely unexplored.Directly comparing or evaluating the illustration with VLM is native but requires oracle multi-modal understanding ability, which is unreliable for long and complex texts and illustrations. To address this, we propose AIBench, the first benchmark using VQA for evaluating logic correctness of the academic illustrations and VLMs for assessing aesthetics. In detail, we designed four levels of questions proposed from a logic diagram summarized from the method part of the paper, which query whether the generated illustration aligns with the paper on different scales. Our VQA-based approach raises more accurate and detailed evaluations on visual-logical consistency while relying less on the ability of the judger VLM. With our high-quality AIBench, we conduct extensive experiments and conclude that the performance gap between models on this task is significantly larger than general ones, reflecting their various complex reasoning and high-density generation ability. Further, the logic and aesthetics are hard to optimize simultaneously as in handcrafted illustrations. Additional experiments further state that test-time scaling on both abilities significantly boosts the performance on this task.
[162] MolmoPoint: Better Pointing for VLMs with Grounding Tokens cs.CV | cs.AIPDF
Christopher Clark, Yue Yang, Jae Sung Park, Zixian Ma, Jieyu Zhang
TL;DR: MolmoPoint提出了一种新的视觉语言模型(VLM)指向机制,通过生成特殊的指向标记直接选择包含目标概念的视觉标记,而非生成坐标文本。该方法采用三级标记进行细粒度选择,并引入顺序生成、相对位置编码和‘无更多点’类等设计,在图像、GUI和视频指向任务上实现了新的SOTA性能,并展现出更高的样本效率。
Details
Motivation: 现有VLM通常通过生成坐标文本来实现指向,这需要学习复杂的坐标系且占用大量标记,导致效率低下。本文旨在设计一种更直观、高效的指向机制,直接操作视觉标记以简化学习过程并提升性能。
Result: 在图像指向任务上,PointBench基准达到70.7%的SOTA;在GUI指向任务上,ScreenSpotPro基准达到61.1%(全开放模型中的SOTA);在视频指向任务上,相比文本坐标基线获得59.1%的人类偏好胜率;在跟踪任务上,Molmo2Track基准提升6.3%。
Insight: 创新点包括:用指向标记直接选择视觉标记替代坐标生成,简化了学习;三级标记设计实现从粗到细的定位;顺序生成、相对位置编码和‘无更多点’类提升了连贯性和效率。客观来看,这种标记化指向机制可能更贴合VLM的跨模态注意力架构,降低了学习难度并提高了样本效率。
Abstract: Grounding has become a fundamental capability of vision-language models (VLMs). Most existing VLMs point by generating coordinates as part of their text output, which requires learning a complicated coordinate system and results in a high token count. Instead, we propose a more intuitive pointing mechanism that directly selects the visual tokens that contain the target concept. Our model generates a special pointing token that cross-attends to the input image or video tokens and selects the appropriate one. To make this model more fine-grained, we follow these pointing tokens with an additional special token that selects a fine-grained subpatch within the initially selected region, and then a third token that specifies a location within that subpatch. We further show that performance improves by generating points sequentially in a consistent order, encoding the relative position of the previously selected point, and including a special no-more-points class when selecting visual tokens. Using this method, we set a new state-of-the-art on image pointing (70.7% on PointBench), set a new state-of-the-art among fully open models on GUI pointing (61.1% on ScreenSpotPro), and improve video pointing (59.1% human preference win rate vs. a text coordinate baseline) and tracking (+6.3% gain on Molmo2Track). We additionally show that our method achieves much higher sample efficiency and discuss the qualitative differences that emerge from this design change.
[163] LogiStory: A Logic-Aware Framework for Multi-Image Story Visualization cs.CV | cs.MAPDF
Chutian Meng, Fan Ma, Chi Zhang, Jiaxu Miao, Yi Yang
TL;DR: 本文提出LogiStory,一个逻辑感知的多图像故事可视化框架,旨在解决现有模型在生成连贯视觉序列时存在的逻辑流断裂、叙事碎片化问题。该框架通过显式建模视觉逻辑,设计了一个多智能体系统来锚定角色、提取因果链并验证故事一致性,从而将叙事连贯性从图像生成的隐式副产品转变为显式建模目标。
Details
Motivation: 当前多模态系统在生成连贯且具有交流性的视觉序列(如图像序列和视频)时,尽管在视觉质量和世界知识整合方面有所进展,但仍难以维持逻辑流,导致动作脱节、叙事碎片化和故事情节不清晰。作者认为这些问题源于对视觉逻辑这一关键但未被充分探索的维度的忽视。
Result: 实验表明,该方法显著提升了生成视觉故事的叙事逻辑。作者构建了LogicTale基准,包含丰富注释的故事,强调因果推理和视觉逻辑可解释性,并建立了全面的自动和人工评估协议来衡量视觉逻辑和感知质量。
Insight: 核心创新在于将视觉逻辑(定义为角色、动作和场景随时间的感知与因果连贯性)作为显式建模目标,通过多智能体系统桥接结构化故事规划与视觉生成。这为一般图像序列和视频生成任务中的视觉逻辑建模与强化提供了基础性步骤。
Abstract: Generating coherent and communicative visual sequences, such as image sequences and videos, remains a significant challenge for current multimodal systems. Despite advances in visual quality and the integration of world knowledge, existing models still struggle to maintain logical flow, often resulting in disjointed actions, fragmented narratives, and unclear storylines. We attribute these issues to the lack of attention to visual logic, a critical yet underexplored dimension of visual sequence generation that we define as the perceptual and causal coherence among characters, actions, and scenes over time. To bridge this gap, we propose a logic-aware multi-image story visualization framework, LogiStory. The framework is built around the central innovation of explicitly modeling visual logic in story visualization. To realize this idea, we design a multi-agent system that grounds roles, extracts causal chains, and verifies story-level consistency, transforming narrative coherence from an implicit byproduct of image generation into an explicit modeling objective. This design effectively bridges structured story planning with visual generation, enhancing both narrative clarity and visual quality in story visualization. Furthermore, to evaluate the generation capacity, we construct LogicTale, a benchmark comprising richly annotated stories, emphasizing causal reasoning, and visual logic interpretability. We establish comprehensive automatic and human evaluation protocols designed to measure both visual logic and perceptual quality. Experiments demonstrate that our approach significantly improves the narrative logic of generated visual stories. This work provides a foundational step towards modeling and enforcing visual logic in general image sequence and video generation tasks.
[164] GEMS: Agent-Native Multimodal Generation with Memory and Skills cs.CVPDF
Zefeng He, Siyuan Huang, Xiaoye Qu, Yafu Li, Tong Zhu
TL;DR: 本文提出了GEMS框架,这是一个基于智能体架构的多模态生成系统,通过引入智能体循环、记忆和技能三个核心组件,旨在解决现有基础模型在处理复杂指令和下游任务时的局限性。
Details
Motivation: 当前多模态生成模型在通用任务上表现优异,但在处理复杂指令和专业化下游任务时仍存在困难,因此需要一种能够超越基础模型固有局限的智能体框架。
Result: 在五个主流任务和四个下游任务上,基于多种生成后端进行评估,GEMS均取得了显著的性能提升。特别地,它使得轻量级6B模型Z-Image-Turbo在GenEval2基准上超越了当前最先进的Nano Banana 2模型。
Insight: 论文的创新点在于将智能体框架(包含结构化多智能体循环、分层持久记忆和可扩展的按需加载技能库)系统性地引入多模态生成任务,从而有效扩展了模型的能力边界,实现了超越原始模型限制的性能。
Abstract: Recent multimodal generation models have achieved remarkable progress on general-purpose generation tasks, yet continue to struggle with complex instructions and specialized downstream tasks. Inspired by the success of advanced agent frameworks such as Claude Code, we propose \textbf{GEMS} (Agent-Native Multimodal \textbf{GE}neration with \textbf{M}emory and \textbf{S}kills), a framework that pushes beyond the inherent limitations of foundational models on both general and downstream tasks. GEMS is built upon three core components. Agent Loop introduces a structured multi-agent framework that iteratively improves generation quality through closed-loop optimization. Agent Memory provides a persistent, trajectory-level memory that hierarchically stores both factual states and compressed experiential summaries, enabling a global view of the optimization process while reducing redundancy. Agent Skill offers an extensible collection of domain-specific expertise with on-demand loading, allowing the system to effectively handle diverse downstream applications. Across five mainstream tasks and four downstream tasks, evaluated on multiple generative backends, GEMS consistently achieves significant performance gains. Most notably, it enables the lightweight 6B model Z-Image-Turbo to surpass the state-of-the-art Nano Banana 2 on GenEval2, demonstrating the effectiveness of agent harness in extending model capabilities beyond their original limits.
[165] To View Transform or Not to View Transform: NeRF-based Pre-training Perspective cs.CVPDF
Hyeonjun Jeong, Juyeb Shin, Dongsuk Kum
TL;DR: 本文提出了一种新颖的NeRF-Resembled Point-based 3D检测器(NeRP3D),旨在解决现有基于NeRF的自动驾驶预训练方法中,视图变换与NeRF连续表示假设之间的先验冲突问题。该方法保留预训练的NeRF网络,学习连续的3D表示,从而避免表示模糊,并同时提升场景重建和下游检测任务的性能。
Details
Motivation: 现有方法将NeRF应用于视图变换得到的体素特征时,视图变换的离散刚性表示与NeRF的连续自适应函数假设存在冲突,导致3D表示模糊不清,限制了3D场景理解能力,且预训练的NeRF网络在下游任务中被丢弃,造成表示利用效率低下。
Result: 在nuScenes数据集上的实验表明,所提方法显著超越了之前的最先进方法,不仅在预文本场景重建任务上表现优异,在下游检测任务上也取得了更好的性能。
Insight: 核心创新在于提出了一个与任务无关、持续保留的NeRF网络架构,将NeRF的连续表示学习原则直接融入3D检测器,避免了视图变换带来的先验冲突,从而统一并提升了从重建到检测的3D表示学习潜力。
Abstract: Neural radiance fields (NeRFs) have emerged as a prominent pre-training paradigm for vision-centric autonomous driving, which enhances 3D geometry and appearance understanding in a fully self-supervised manner. To apply NeRF-based pretraining to 3D perception models, recent approaches have simply applied NeRFs to volumetric features obtained from view transformation. However, coupling NeRFs with view transformation inherits conflicting priors; view transformation imposes discrete and rigid representations, whereas radiance fields assume continuous and adaptive functions. When these opposing assumptions are forced into a single pipeline, the misalignment surfaces as blurry and ambiguous 3D representations that ultimately limit 3D scene understanding. Moreover, the NeRF network for pre-training is discarded during downstream tasks, resulting in inefficient utilization of enhanced 3D representations through NeRF. In this paper, we propose a novel NeRF-Resembled Point-based 3D detector that can learn continuous 3D representation and thus avoid the misaligned priors from view transformation. NeRP3D preserves the pre-trained NeRF network regardless of the tasks, inheriting the principle of continuous 3D representation learning and leading to greater potentials for both scene reconstruction and detection tasks. Experiments on nuScenes dataset demonstrate that our proposed approach significantly improves previous state-of-the-art methods, outperforming not only pretext scene reconstruction tasks but also downstream detection tasks.
[166] RAWIC: Bit-Depth Adaptive Lossless Raw Image Compression cs.CVPDF
Chunhang Zheng, Tongda Xu, Mingli Xie, Yan Wang, Dou Li
TL;DR: 本文提出RAWIC,一种面向拜耳模式原始图像的比特深度自适应无损压缩框架。该方法将单通道拜耳数据转换为四通道RGGB格式并分块处理,利用比特深度作为辅助输入指导压缩,通过自适应熵模型估计条件分布,使单一模型能处理不同相机和比特深度的原始图像。实验表明RAWIC在无损压缩性能上优于传统编解码器。
Details
Motivation: 原始图像具有线性传感器测量和高比特深度信息,对高级视觉任务和摄影应用至关重要,但其存储面临文件体积大、比特深度多变和传感器依赖特性等挑战。现有学习方法主要针对8位sRGB图像,而原始图像重建方法本质是有损的且依赖相机特定假设。
Result: 实验显示RAWIC在无损压缩中持续超越传统编解码器,相比JPEG-XL平均实现7.7%的比特率降低。
Insight: 创新点包括将比特深度作为条件输入的自适应熵模型设计,以及单模型处理多源原始图像的通用框架,为高比特深度传感器数据的无损压缩提供了可扩展解决方案。
Abstract: Raw images preserve linear sensor measurements and high bit-depth information crucial for advanced vision tasks and photography applications, yet their storage remains challenging due to large file sizes, varying bit depths, and sensor-dependent characteristics. Existing learned lossless compression methods mainly target 8-bit sRGB images, while raw reconstruction approaches are inherently lossy and rely on camera-specific assumptions. To address these challenges, we introduce RAWIC, a bit-depth-adaptive learned lossless compression framework for Bayer-pattern raw images. We first convert single-channel Bayer data into a four-channel RGGB format and partition it into patches. For each patch, we compute its bit depth and use it as auxiliary input to guide compression. A bit-depth-adaptive entropy model is then designed to estimate patch distributions conditioned on their bit depths. This architecture enables a single model to handle raw images from diverse cameras and bit depths. Experiments show that RAWIC consistently surpasses traditional lossless codecs, achieving an average 7.7% bitrate reduction over JPEG-XL. Our code is available at https://github.com/chunbaobao/RAWIC.
[167] Contour-Guided Query-Based Feature Fusion for Boundary-Aware and Generalizable Cardiac Ultrasound Segmentation cs.CVPDF
Zahid Ullah, Sieun Choi, Jihie Kim
TL;DR: 本文提出了一种名为CGQR-Net的轮廓引导查询细化网络,用于解决心脏超声图像分割中因低对比度、斑点噪声和边界不规则导致的边界精度和结构一致性问题。该方法通过整合多分辨率特征与轮廓结构先验,利用HRNet主干网络提取特征,生成粗分割后提取解剖轮廓并编码为可学习查询嵌入,再通过交叉注意力与融合特征图交互,实现结构感知的细化,并结合双头监督策略联合优化分割和边界预测。
Details
Motivation: 心脏超声图像分割在智能医疗系统中至关重要,但现有基于外观驱动学习的方法在低对比度、斑点噪声、不规则边界以及跨设备和患者群体的域偏移条件下,难以保持边界精度和结构一致性。
Result: 在CAMUS数据集上进行评估,并在CardiacNet数据集上进行跨数据集泛化验证。实验结果表明,该方法提高了分割精度和边界精确度,并在不同成像条件下表现出鲁棒性能。
Insight: 创新点在于将轮廓级别的结构信息与特征级别的表示相结合,通过轮廓引导的查询嵌入和交叉注意力机制实现结构感知的细化,以及采用双头监督策略联合优化分割和边界预测,以增强边界描绘和结构一致性。从客观角度看,该方法为医学图像分割提供了一种有效整合几何先验与深度特征的新思路,有助于提升模型在复杂超声图像上的泛化能力。
Abstract: Accurate cardiac ultrasound segmentation is essential for reliable assessment of ventricular function in intelligent healthcare systems. However, echocardiographic images are challenging due to low contrast, speckle noise, irregular boundaries, and domain shifts across devices and patient populations. Existing methods, largely based on appearance-driven learning, often fail to preserve boundary precision and structural consistency under these conditions. To address these issues, we propose a Contour-Guided Query Refinement Network (CGQR-Net) for boundary-aware cardiac ultrasound segmentation. The framework integrates multi-resolution feature representations with contour-derived structural priors. An HRNet backbone preserves high-resolution spatial details while capturing multi-scale context. A coarse segmentation is first generated, from which anatomical contours are extracted and encoded into learnable query embeddings. These contour-guided queries interact with fused feature maps via cross-attention, enabling structure-aware refinement that improves boundary delineation and reduces noise artifacts. A dual-head supervision strategy jointly optimizes segmentation and boundary prediction to enforce structural consistency. The proposed method is evaluated on the CAMUS dataset and further validated on the CardiacNet dataset to assess cross-dataset generalization. Experimental results demonstrate improved segmentation accuracy, enhanced boundary precision, and robust performance across varying imaging conditions. These results highlight the effectiveness of integrating contour-level structural information with feature-level representations for reliable cardiac ultrasound segmentation.
[168] MedLoc-R1: Performance-Aware Curriculum Reward Scheduling for GRPO-Based Medical Visual Grounding cs.CVPDF
Guangjing Yang, Ziyuan Qin, Chaoran Zhang, Chenlin Du, Jinlin Wang
TL;DR: 本文提出MedLoc-R1,一种性能感知的课程奖励调度框架,用于解决基于GRPO的医学视觉定位中因奖励稀疏性导致的梯度消失和优化停滞问题。该框架通过滑动窗口性能跟踪器和多条件更新规则,根据模型准备情况逐步收紧奖励标准,从密集易得的信号过渡到严格的细粒度定位要求。在三个医学视觉定位基准测试上的实验表明,MedLoc-R1在定位精度和训练稳定性上均优于GRPO基线。
Details
Motivation: 现有基于强化学习(如GRPO)的医学视觉定位方法,由于医学图像中感兴趣区域小或模糊,以及固定IoU奖励方案的僵化和次优性,导致严重的奖励稀疏性,进而引发梯度消失和早期训练停滞。
Result: 在三个医学视觉定位基准测试上,MedLoc-R1相比基于GRPO的基线方法,在定位精度和训练稳定性方面均取得了一致的提升。
Insight: 主要创新点在于提出了一个轻量级的性能感知课程奖励调度框架,通过动态调整奖励标准(从易到难)来缓解奖励稀疏问题,无需引入额外网络或梯度路径,保持了GRPO的优良特性,为高风险医学应用中的基于RL的定位任务提供了通用解决方案。
Abstract: Medical visual grounding serves as a crucial foundation for fine-grained multimodal reasoning and interpretable clinical decision support. Despite recent advances in reinforcement learning (RL) for grounding tasks, existing approaches such as Group Relative Policy Optimization~(GRPO) suffer from severe reward sparsity when directly applied to medical images, primarily due to the inherent difficulty of localizing small or ambiguous regions of interest, which is further exacerbated by the rigid and suboptimal nature of fixed IoU-based reward schemes in RL. This leads to vanishing policy gradients and stagnated optimization, particularly during early training. To address this challenge, we propose MedLoc-R1, a performance-aware reward scheduling framework that progressively tightens the reward criterion in accordance with model readiness. MedLoc-R1 introduces a sliding-window performance tracker and a multi-condition update rule that automatically adjust the reward schedule from dense, easily obtainable signals to stricter, fine-grained localization requirements, while preserving the favorable properties of GRPO without introducing auxiliary networks or additional gradient paths. Experiments on three medical visual grounding benchmarks demonstrate that MedLoc-R1 consistently improves both localization accuracy and training stability over GRPO-based baselines. Our framework offers a general, lightweight, and effective solution for RL-based grounding in high-stakes medical applications. Code & checkpoints are available at \hyperlink{}{https://github.com/MembrAI/MedLoc-R1}.
[169] MDPBench: A Benchmark for Multilingual Document Parsing in Real-World Scenarios cs.CV | cs.AIPDF
Zhang Li, Zhibo Lin, Qiang Liu, Ziyang Zhang, Shuo Zhang
TL;DR: 本文介绍了首个多语言文档解析基准测试MDPBench,旨在评估模型在真实世界多语言、多脚本及拍摄文档上的解析性能。该基准包含17种语言、3400张文档图像,涵盖数字和拍摄文档,并采用严格的标注流程。评估发现闭源模型(如Gemini3-Pro)相对稳健,而开源模型在非拉丁文字和拍摄文档上性能显著下降,揭示了现有模型在语言和条件上的性能不平衡问题。
Details
Motivation: 现有文档解析研究主要集中于少数主流语言的干净、数字格式文档,缺乏系统基准来评估模型在多样化脚本、低资源语言以及真实世界拍摄文档上的性能,因此需要构建一个全面的多语言文档解析基准以推动更包容、可部署的系统发展。
Result: 在MDPBench基准上,闭源模型(特别是Gemini3-Pro)表现相对稳健,而开源模型在拍摄文档上平均性能下降17.8%,在非拉丁文字上下降14.0%,显示出显著的性能不平衡。该基准通过公开和私有评估分割确保公平比较,并提供了全面的开源与闭源模型评估结果。
Insight: 论文的创新点在于构建了首个涵盖多语言、多脚本和真实拍摄条件的文档解析基准,并揭示了当前模型(尤其是开源模型)在非拉丁文字和拍摄文档上的脆弱性,为开发更具包容性的文档解析系统提供了具体方向。从客观角度看,其严格的标注流程和评估设计为未来研究提供了可靠的测试平台。
Abstract: We introduce Multilingual Document Parsing Benchmark, the first benchmark for multilingual digital and photographed document parsing. Document parsing has made remarkable strides, yet almost exclusively on clean, digital, well-formatted pages in a handful of dominant languages. No systematic benchmark exists to evaluate how models perform on digital and photographed documents across diverse scripts and low-resource languages. MDPBench comprises 3,400 document images spanning 17 languages, diverse scripts, and varied photographic conditions, with high-quality annotations produced through a rigorous pipeline of expert model labeling, manual correction, and human verification. To ensure fair comparison and prevent data leakage, we maintain separate public and private evaluation splits. Our comprehensive evaluation of both open-source and closed-source models uncovers a striking finding: while closed-source models (notably Gemini3-Pro) prove relatively robust, open-source alternatives suffer dramatic performance collapse, particularly on non-Latin scripts and real-world photographed documents, with an average drop of 17.8% on photographed documents and 14.0% on non-Latin scripts. These results reveal significant performance imbalances across languages and conditions, and point to concrete directions for building more inclusive, deployment-ready parsing systems. Source available at https://github.com/Yuliang-Liu/MultimodalOCR.
[170] Robust Remote Sensing Image-Text Retrieval with Noisy Correspondence cs.CVPDF
Qiya Song, Yiqiang Xie, Yuan Sun, Renwei Dian, Xudong Kang
TL;DR: 本文针对遥感图像-文本检索任务中普遍存在的噪声对应问题,提出了一种鲁棒的遥感图像-文本检索范式。该方法通过自步学习策略,模仿人类从易到难的认知模式,对训练样本进行分类和加权,并设计了鲁棒的三元组损失函数,以增强模型在噪声数据下的鲁棒性。
Details
Motivation: 现有遥感图像-文本检索方法通常假设图像-文本对完美匹配,但实际数据采集成本高且常包含不准确或错误匹配的描述,即存在噪声对应问题。本文旨在解决这一被忽视的关键挑战。
Result: 在三个流行的基准数据集上的大量实验表明,所提出的RRSITR方法显著优于现有最先进方法,尤其是在高噪声率的情况下。
Insight: 创新点在于将自步学习思想引入多模态噪声对应问题,通过基于损失值的样本分类与动态加权机制,以及结合语义相似度动态调整软边界的鲁棒三元组损失,构建了一个渐进式学习框架,有效提升了模型对噪声数据的鲁棒性。
Abstract: As a pivotal task that bridges remote visual and linguistic understanding, Remote Sensing Image-Text Retrieval (RSITR) has attracted considerable research interest in recent years. However, almost all RSITR methods implicitly assume that image-text pairs are matched perfectly. In practice, acquiring a large set of well-aligned data pairs is often prohibitively expensive or even infeasible. In addition, we also notice that the remote sensing datasets (e.g., RSITMD) truly contain some inaccurate or mismatched image text descriptions. Based on the above observations, we reveal an important but untouched problem in RSITR, i.e., Noisy Correspondence (NC). To overcome these challenges, we propose a novel Robust Remote Sensing Image-Text Retrieval (RRSITR) paradigm that designs a self-paced learning strategy to mimic human cognitive learning patterns, thereby learning from easy to hard from multi-modal data with NC. Specifically, we first divide all training sample pairs into three categories based on the loss magnitude of each pair, i.e., clean sample pairs, ambiguous sample pairs, and noisy sample pairs. Then, we respectively estimate the reliability of each training pair by assigning a weight to each pair based on the values of the loss. Further, we respectively design a new multi-modal self-paced function to dynamically regulate the training sequence and weights of the samples, thus establishing a progressive learning process. Finally, for noisy sample pairs, we present a robust triplet loss to dynamically adjust the soft margin based on semantic similarity, thereby enhancing the robustness against noise. Extensive experiments on three popular benchmark datasets demonstrate that the proposed RRSITR significantly outperforms the state-of-the-art methods, especially in high noise rates. The code is available at: https://github.com/MSFLabX/RRSITR
[171] RecycleLoRA: Rank-Revealing QR-Based Dual-LoRA Subspace Adaptation for Domain Generalized Semantic Segmentation cs.CV | cs.AIPDF
Chanseul Cho, Seokju Yun, Jeaseong Jeon, Seungjae Moon, Youngmin Ro
TL;DR: 本文提出RecycleLoRA方法,用于领域泛化语义分割任务,通过基于秩揭示QR分解的双LoRA子空间适配器,有效利用视觉基础模型的多域知识,提升模型在未见目标域上的鲁棒性。
Details
Motivation: 解决领域泛化语义分割中现有方法未能充分利用视觉基础模型子空间结构,以及LoRA组件表示多样性不足和参数利用效率低的问题。
Result: 在合成到真实和真实到真实的泛化任务上达到最先进性能,无需复杂架构或额外推理延迟。
Insight: 创新点在于使用秩揭示QR分解系统性地挖掘预训练子空间结构,通过主适配器和子适配器的双LoRA设计学习独立且互补的表示,无需额外正则化损失,提升了表示丰富性和参数效率。
Abstract: Domain Generalized Semantic Segmentation (DGSS) aims to maintain robust performance across unseen target domains. Vision Foundation Models (VFMs) offer rich multi-domain knowledge that can enhance generalization. However, strategies for actively exploiting the rich subspace structures within VFMs remain under-explored, with many existing methods focusing primarily on preserving pre-trained knowledge. Furthermore, their LoRA components often suffer from limited representational diversity and inefficient parameter utilization. We propose RecycleLoRA, which addresses both challenges by employing Rank-Revealing QR Decomposition (RRQR) to systematically exploit VFM’s subspace structures and enhance LoRA’s representational richness. Our main adapter leverages minor subspace directions identified by RRQR to learn diverse and independent features, achieving competitive performance even when used alone. We further introduce a sub adapter that carefully refines major directions with minimal adjustments, providing complementary improvements to the main adapter’s strong baseline performance. This design enables the dual adapters to learn distinct representations without requiring additional regularization losses. Our systematic exploitation of pre-trained subspace structures through RRQR-based initialization leads to superior domain generalization performance. RecycleLoRA achieves state-of-the-art performance on both synthetic-to-real generalization and real-to-real generalization tasks without complex architectures or additional inference latency.
[172] BlankSkip: Early-exit Object Detection onboard Nano-drones cs.CVPDF
Carlo Marra, Beatrice Alessandra Motetti, Alessio Burrello, Enrico Macii, Massimo Poncino
TL;DR: 本文提出了一种名为BlankSkip的自适应网络,用于在纳米无人机上进行目标检测。该方法通过一个简单的辅助分类任务实现早期退出,即识别没有感兴趣对象的帧,从而在保持检测精度的同时显著提升推理效率。
Details
Motivation: 纳米无人机计算平台资源极其有限(约10 MiB内存、1 W功耗),部署深度神经网络以实现自主性面临挑战。早期退出自适应网络能减少对“易处理”输入帧的计算量,但现有研究多集中于分类任务,在密集任务如目标检测中的应用尚不直接。
Result: 在真实纳米无人机平台Bitcraze Crazyflie 2.1上实验,使用最先进的纳米无人机目标检测数据集,与静态MobileNet-SSD检测器相比,BlankSkip实现了高达24%的平均吞吐量提升,同时平均精度均值(mAP)仅下降0.015。
Insight: 创新点在于将早期退出机制应用于目标检测任务,通过辅助分类任务识别“空白帧”来跳过不必要的计算。从客观角度看,该方法针对资源受限平台优化,平衡了精度与效率,为密集任务在边缘设备上的部署提供了新思路。
Abstract: Deploying tiny computer vision Deep Neural Networks (DNNs) on-board nano-sized drones is key for achieving autonomy, but is complicated by the extremely tight constraints of their computational platforms (approximately 10 MiB memory, 1 W power budget). Early-exit adaptive DNNs that dial down the computational effort for “easy-to-process” input frames represent a promising way to reduce the average inference latency. However, while this approach is extensively studied for classification, its application to dense tasks like object detection (OD) is not straightforward. In this paper, we propose BlankSkip, an adaptive network for on-device OD that leverages a simple auxiliary classification task for early exit, i.e., identifying frames with no objects of interest. With experiments using a real-world nano-drone platform, the Bitcraze Crazyflie 2.1, we achieve up to 24% average throughput improvement with a limited 0.015 mean Average Precision (mAP) drop compared to a static MobileNet-SSD detector, on a state-of-the-art nano-drones OD dataset.
[173] ToLL: Topological Layout Learning with Structural Multi-view Augmentation for 3D Scene Graph Pretraining cs.CVPDF
Yucheng Huang, Luping Ji, Xiangwei Jiang, Wen Li, Mao Ye
TL;DR: 本文提出了一种用于3D场景图预训练的新框架ToLL,通过锚点条件拓扑几何推理来恢复子图的全局布局,并利用结构多视图增强来避免语义损坏和增强表示学习。
Details
Motivation: 当前3D场景图生成方法受限于数据稀缺,现有解决方案要么依赖谓词标注,要么因强对象先验而绕过谓词学习,缺乏无标签且鲁棒的自监督代理任务用于微调。
Result: 在3DSSG数据集上的广泛实验表明,ToLL能够提升表示质量,并优于最先进的基线方法。
Insight: 创新点在于设计了一个严格由谓词特征调制的拓扑几何推理过程,强制进行谓词关系学习,同时通过结构多视图增强和自蒸馏来避免语义损坏并增强表示,为3D场景图预训练提供了一个有效的无标签自监督方法。
Abstract: 3D Scene Graph (3DSG) generation plays a pivotal role in spatial understanding and semantic-affordance perception. However, its generalizability is often constrained by data scarcity. Current solutions primarily focus on cross-modal assisted representation learning and object-centric generation pre-training. The former relies heavily on predicate annotations, while the latter’s predicate learning may be bypassed due to strong object priors. Consequently, they could not often provide a label-free and robust self-supervised proxy task for 3DSG fine-tuning. To bridge this gap, we propose a Topological Layout Learning (ToLL) for 3DSG pretraining framework. In detail, we design an Anchor-Conditioned Topological Geometry Reasoning, with a GNN to recover the global layout of zero-centered subgraphs by the spatial priors from sparse anchors. This process is strictly modulated by predicate features, thereby enforcing the predicate relation learning. Furthermore, we construct a Structural Multi-view Augmentation to avoid semantic corruption, and enhancing representations via self-distillation. The extensive experiments on 3DSSG dataset demonstrate that our ToLL could improve representation quality, outperforming state-of-the-art baselines.
[174] Explaining CLIP Zero-shot Predictions Through Concepts cs.CVPDF
Onat Ozdemir, Anders Christensen, Stephan Alaniz, Zeynep Akata, Emre Akbas
TL;DR: 本文提出了一种名为EZPC的方法,旨在通过人类可理解的概念来解释CLIP模型的零样本预测。该方法将CLIP的联合图像-文本嵌入投影到一个从语言描述中学习到的概念空间中,从而在不引入额外监督的情况下,为CLIP的预测提供忠实且透明的解释。
Details
Motivation: 大规模视觉语言模型(如CLIP)在零样本图像识别上取得了显著成功,但其预测对人类而言仍然不透明。而概念瓶颈模型虽然通过人类定义的概念提供了可解释的中间表示,但它们依赖于概念监督,且无法泛化到未见过的类别。本文旨在弥合这两种范式,使CLIP的预测变得可解释。
Result: 在CIFAR-100、CUB-200-2011、Places365、ImageNet-100和ImageNet-1k五个基准数据集上的广泛实验表明,该方法在保持CLIP强大的零样本分类准确性的同时,能够提供有意义的概念级解释。
Insight: 核心创新点在于通过一个结合了对齐和重建目标的学习过程,将CLIP的嵌入投影到一个无监督学习的概念空间,从而在不损害CLIP原始语义结构的前提下实现可解释性。这为开放词汇预测提供了基于显式语义概念的、可解释且值得信赖的视觉语言模型提供了一种有原则的途径。
Abstract: Large-scale vision-language models such as CLIP have achieved remarkable success in zero-shot image recognition, yet their predictions remain largely opaque to human understanding. In contrast, Concept Bottleneck Models provide interpretable intermediate representations by reasoning through human-defined concepts, but they rely on concept supervision and lack the ability to generalize to unseen classes. We introduce EZPC that bridges these two paradigms by explaining CLIP’s zero-shot predictions through human-understandable concepts. Our method projects CLIP’s joint image-text embeddings into a concept space learned from language descriptions, enabling faithful and transparent explanations without additional supervision. The model learns this projection via a combination of alignment and reconstruction objectives, ensuring that concept activations preserve CLIP’s semantic structure while remaining interpretable. Extensive experiments on five benchmark datasets, CIFAR-100, CUB-200-2011, Places365, ImageNet-100, and ImageNet-1k, demonstrate that our approach maintains CLIP’s strong zero-shot classification accuracy while providing meaningful concept-level explanations. By grounding open-vocabulary predictions in explicit semantic concepts, our method offers a principled step toward interpretable and trustworthy vision-language models. Code is available at https://github.com/oonat/ezpc.
[175] DiffAttn: Diffusion-Based Drivers’ Visual Attention Prediction with LLM-Enhanced Semantic Reasoning cs.CV | cs.AIPDF
Weimin Liu, Qingkun Li, Jiyuan Qiu, Wenjun Wang, Joshua H. Meng
TL;DR: 本文提出DiffAttn框架,通过条件扩散去噪过程预测驾驶员视觉注意力,结合Swin Transformer编码器、特征融合金字塔解码器以及大语言模型层,以增强对局部和全局场景特征及语义推理的建模,在多个公开数据集上达到SOTA性能。
Details
Motivation: 驾驶员视觉注意力对预测潜在危险和决策至关重要,其缺失会影响交通安全;现有方法在准确建模驾驶员注意力方面存在不足,需要更精确的模型来模拟驾驶员感知模式。
Result: 在四个公开数据集上的大量实验表明,DiffAttn实现了最先进的性能,超越了大多数基于视频的、自上而下特征驱动的以及LLM增强的基线方法。
Insight: 创新点包括将驾驶员注意力预测建模为条件扩散去噪过程,设计结合特征融合金字塔和多尺度扩散的解码器以增强局部和全局上下文建模,并引入LLM层提升自上而下的语义推理和对安全关键线索的敏感性;可借鉴之处在于扩散模型与Transformer及LLM的融合用于细粒度视觉任务。
Abstract: Drivers’ visual attention provides critical cues for anticipating latent hazards and directly shapes decision-making and control maneuvers, where its absence can compromise traffic safety. To emulate drivers’ perception patterns and advance visual attention prediction for intelligent vehicles, we propose DiffAttn, a diffusion-based framework that formulates this task as a conditional diffusion-denoising process, enabling more accurate modeling of drivers’ attention. To capture both local and global scene features, we adopt Swin Transformer as encoder and design a decoder that combines a Feature Fusion Pyramid for cross-layer interaction with dense, multi-scale conditional diffusion to jointly enhance denoising learning and model fine-grained local and global scene contexts. Additionally, a large language model (LLM) layer is incorporated to enhance top-down semantic reasoning and improve sensitivity to safety-critical cues. Extensive experiments on four public datasets demonstrate that DiffAttn achieves state-of-the-art (SoTA) performance, surpassing most video-based, top-down-feature-driven, and LLM-enhanced baselines. Our framework further supports interpretable driver-centric scene understanding and has the potential to improve in-cabin human-machine interaction, risk perception, and drivers’ state measurement in intelligent vehicles.
[176] DinoDental: Benchmarking DINOv3 as a Unified Vision Encoder for Dental Image Analysis cs.CVPDF
Kun Tang, Xinquan Yang, Mianjie Zheng, Xuefen Liu, Xuguang Li
TL;DR: 本文提出了DinoDental基准,旨在系统评估自监督视觉基础模型DINOv3在未经领域特定预训练的情况下,作为即用型统一编码器在牙科图像分析(包括全景X光片和口内照片的分类、检测和实例分割任务)中的可靠性和性能。
Details
Motivation: 牙科影像领域专家标注稀缺且成本高昂,阻碍了AI发展。DINOv3作为先进的自监督预训练模型,有望缓解此问题,但其在具有独特成像特性和临床细微差别的牙科领域中的迁移可靠性尚不明确。
Result: 实验表明,DINOv3可作为牙科图像分析的强大统一编码器,在各项任务中保持竞争力,尤其在口内图像理解和边界敏感的密集预测任务上展现出明显优势。
Insight: 创新点在于构建了一个统一的、多任务、多模态的牙科图像分析基准,并系统评估了模型规模、输入分辨率以及不同适应策略(如冻结特征、全微调和LoRA)对迁移性能的影响,为牙科AI社区提供了高效的模型选择和适配基准。
Abstract: The scarcity and high cost of expert annotations in dental imaging present a significant challenge for the development of AI in dentistry. DINOv3, a state-of-the-art, self-supervised vision foundation model pre-trained on 1.7 billion images, offers a promising pathway to mitigate this issue. However, its reliability when transferred to the dental domain, with its unique imaging characteristics and clinical subtleties, remains unclear. To address this, we introduce DinoDental, a unified benchmark designed to systematically evaluate whether DINOv3 can serve as a reliable, off-the-shelf encoder for comprehensive dental image analysis without requiring domain-specific pre-training. Constructed from multiple public datasets, DinoDental covers a wide range of tasks, including classification, detection, and instance segmentation on both panoramic radiographs and intraoral photographs. We further analyze the model’s transfer performance by scaling its size and input resolution, and by comparing different adaptation strategies, including frozen features, full fine-tuning, and the parameter-efficient Low-Rank Adaptation (LoRA) method. Our experiments show that DINOv3 can serve as a strong unified encoder for dental image analysis across both panoramic radiographs and intraoral photographs, remaining competitive across tasks while showing particularly clear advantages for intraoral image understanding and boundary-sensitive dense prediction. Collectively, DinoDental provides a systematic framework for comprehensively evaluating DINOv3 in dental analysis, establishing a foundational benchmark to guide efficient and effective model selection and adaptation for the dental AI community.
[177] Beyond Scanpaths: Graph-Based Gaze Simulation in Dynamic Scenes cs.CVPDF
Luke Palmer, Petar Palasek, Hazem Abdelkawy
TL;DR: 本文提出了一种基于图神经网络的自回归动态系统来模拟动态场景中的人类注视行为,通过Affinity Relation Transformer(ART)处理以注视为中心的图结构,并结合Object Density Network(ODN)预测下一步注视分布,在驾驶场景中实现了更自然的注视轨迹、扫描路径和显著性图生成。
Details
Motivation: 现有方法通常将注视简化为显著性图或扫描路径,仅隐式处理注视动态;本文旨在显式建模原始注视轨迹随时间的变化,结合注视历史和动态环境,以更准确地模拟人类注意力机制,特别是在自动驾驶安全领域。
Result: 在发布的Focus100数据集(包含30名参与者的原始注视数据)上,该方法直接基于原始注视训练(无需注视点过滤),在生成注视轨迹、扫描路径动态和显著性图方面优于现有注意力模型,提供了更自然的模拟结果。
Insight: 创新点包括将注视建模为自回归动态系统,使用异构图变换器(ART)建模驾驶员注视、交通对象和道路结构之间的交互,以及引入ODN捕捉复杂环境中注意力转移的随机性和以对象为中心的特性;从客观角度看,该方法为动态环境中人类注意力的时序建模提供了新思路,并强调了原始注视数据直接利用的价值。
Abstract: Accurately modelling human attention is essential for numerous computer vision applications, particularly in the domain of automotive safety. Existing methods typically collapse gaze into saliency maps or scanpaths, treating gaze dynamics only implicitly. We instead formulate gaze modelling as an autoregressive dynamical system and explicitly unroll raw gaze trajectories over time, conditioned on both gaze history and the evolving environment. Driving scenes are represented as gaze-centric graphs processed by the Affinity Relation Transformer (ART), a heterogeneous graph transformer that models interactions between driver gaze, traffic objects, and road structure. We further introduce the Object Density Network (ODN) to predict next-step gaze distributions, capturing the stochastic and object-centric nature of attentional shifts in complex environments. We also release Focus100, a new dataset of raw gaze data from 30 participants viewing egocentric driving footage. Trained directly on raw gaze, without fixation filtering, our unified approach produces more natural gaze trajectories, scanpath dynamics, and saliency maps than existing attention models, offering valuable insights for the temporal modelling of human attention in dynamic environments.
[178] Integrating Multimodal Large Language Model Knowledge into Amodal Completion cs.CV | cs.AIPDF
Heecheol Yun, Eunho Yang
TL;DR: 本文提出了一种名为AmodalCG的新框架,用于解决图像中的amodal completion(模态补全)任务,即重建图像中被遮挡的人和物体的部分。该框架利用多模态大语言模型(MLLMs)的真实世界知识来指导补全过程,通过评估遮挡程度选择性调用MLLM,并让其推理缺失区域的范围和内容,最后结合视觉生成模型迭代优化补全结果。
Details
Motivation: 随着自动驾驶和机器人的普及,amodal completion任务变得至关重要,它需要真实世界的物理知识来推断被遮挡部分。现有方法要么仅依赖缺乏此类知识的视觉生成模型,要么仅在分割阶段利用知识,无法明确指导补全过程。
Result: 在多种真实世界图像上的实验结果表明,与所有现有工作相比,该方法取得了显著改进,表明MLLMs是解决挑战性amodal completion任务的一个有前景的方向。
Insight: 创新点在于将MLLMs的真实世界知识显式整合到amodal completion过程中,通过选择性调用和推理缺失区域的范围与内容来指导补全,并结合视觉生成模型进行迭代优化,从而提升对严重遮挡对象的处理能力。
Abstract: With the widespread adoption of autonomous vehicles and robotics, amodal completion, which reconstructs the occluded parts of people and objects in an image, has become increasingly crucial. Just as humans infer hidden regions based on prior experience and common sense, this task inherently requires physical knowledge about real-world entities. However, existing approaches either depend solely on the image generation ability of visual generative models, which lack such knowledge, or leverage it only during the segmentation stage, preventing it from explicitly guiding the completion process. To address this, we propose AmodalCG, a novel framework that harnesses the real-world knowledge of Multimodal Large Language Models (MLLMs) to guide amodal completion. Our framework first assesses the extent of occlusion to selectively invoke MLLM guidance only when the target object is heavily occluded. If guidance is required, the framework further incorporates MLLMs to reason about both the (1) extent and (2) content of the missing regions. Finally, a visual generative model integrates these guidance and iteratively refines imperfect completions that may arise from inaccurate MLLM guidance. Experimental results on various real-world images show impressive improvements compared to all existing works, suggesting MLLMs as a promising direction for addressing challenging amodal completion.
[179] VistaGEN: Consistent Driving Video Generation with Fine-Grained Control Using Multiview Visual-Language Reasoning cs.CVPDF
Li-Heng Chen, Ke Cheng, Yahui Liu, Lei Shi, Shi-Sheng Huang
TL;DR: 本文提出VistaGEN,一种支持细粒度控制的驾驶视频生成技术,通过多视角视觉-语言推理和生成-评估-再生成的闭环机制,在长视频序列中保持时空一致性。
Details
Motivation: 现有驾驶视频生成方法在可控性、分辨率和长度上取得进展,但缺乏对多样化驾驶视频的细粒度对象级控制,且在生成长视频时难以保持时空一致性。
Result: 大量评估表明,VistaGEN能生成具有细粒度可控性(尤其是长尾对象)的多样化驾驶视频,并在时空一致性上显著优于先前方法。
Insight: 创新点包括将多视角视觉-语言特征注入视频生成器以实现细粒度控制,以及提出多视角视觉-语言评估器(MV-VLM)构建闭环生成机制,结合对象级细化模块提升输出质量与一致性。
Abstract: Driving video generation has achieved much progress in controllability, video resolution, and length, but fails to support fine-grained object-level controllability for diverse driving videos, while preserving the spatiotemporal consistency, especially in long video generation. In this paper, we present a new driving video generation technique, called VistaGEN, which enables fine-grained control of specific entities, including 3D objects, images, and text descriptions, while maintaining spatiotemporal consistency in long video sequences. Our key innovation is the incorporation of multiview visual-language reasoning into the long driving video generation. To this end, we inject visual-language features into a multiview video generator to enable fine-grained controllability. More importantly, we propose a multiview vision-language evaluator (MV-VLM) to intelligently and automatically evaluate spatiotemporal consistency of the generated content, thus formulating a novel generation-evaluation-regeneration closed-loop generation mechanism. This mechanism ensures high-quality, coherent outputs, facilitating the creation of complex and reliable driving scenarios. Besides, within the closed-loop generation, we introduce an object-level refinement module to refine the unsatisfied results evaluated from the MV-VLM and then feed them back to the video generator for regeneration. Extensive evaluation shows that our VistaGEN achieves diverse driving video generation results with fine-grained controllability, especially for long-tail objects, and much better spatiotemporal consistency than previous approaches.
[180] SEA: Evaluating Sketch Abstraction Efficiency via Element-level Commonsense Visual Question Answering cs.CVPDF
Jiho Park, Sieun Choi, Jaeyoon Seo, Minho Sohn, Yeana Kim
TL;DR: 本文提出了一种名为SEA的参考无关度量方法,用于评估草图抽象效率,即草图在保持语义可识别性的同时,如何经济地表示类别定义的视觉元素。同时,作者构建了首个语义标注的草图数据集CommonSketch,包含300个类别的23,100张手绘草图,用于支持SEA度量并作为基准测试数据集。
Details
Motivation: 现有评估方法依赖参考图像、低级视觉特征或识别准确率,无法捕捉草图的抽象特性,因此需要一种能够量化草图语义抽象效率的新方法。
Result: 实验表明,SEA与人类判断高度一致,并能可靠地区分不同级别的抽象效率;CommonSketch数据集可作为基准,系统评估各种视觉语言模型在元素级草图理解上的表现。
Insight: 创新点在于提出了一个基于常识知识(类别典型特征)和视觉问答模型的参考无关抽象效率度量方法,并构建了首个大规模语义标注草图数据集,为草图理解和评估提供了新工具和基准。
Abstract: A sketch is a distilled form of visual abstraction that conveys core concepts through simplified yet purposeful strokes while omitting extraneous detail. Despite its expressive power, quantifying the efficiency of semantic abstraction in sketches remains challenging. Existing evaluation methods that rely on reference images, low-level visual features, or recognition accuracy do not capture abstraction, the defining property of sketches. To address these limitations, we introduce SEA (Sketch Evaluation metric for Abstraction efficiency), a reference-free metric that assesses how economically a sketch represents class-defining visual elements while preserving semantic recognizability. These elements are derived per class from commonsense knowledge about features typically depicted in sketches. SEA leverages a visual question answering model to determine the presence of each element and returns a quantitative score that reflects semantic retention under visual economy. To support this metric, we present CommonSketch, the first semantically annotated sketch dataset, comprising 23,100 human-drawn sketches across 300 classes, each paired with a caption and element-level annotations. Experiments show that SEA aligns closely with human judgments and reliably discriminates levels of abstraction efficiency, while CommonSketch serves as a benchmark providing systematic evaluation of element-level sketch understanding across various vision-language models.
[181] AutoCut: End-to-end advertisement video editing based on multimodal discretization and controllable generation cs.CVPDF
Milton Zhou, Sizhong Qin, Yongzhi Li, Quan Chen, Peng Jiang
TL;DR: AutoCut是一个基于多模态离散化和可控生成的端到端广告视频编辑框架,通过将视频和音频特征离散化为统一token并与文本对齐,构建共享的多模态token空间,并基于基础模型开发多模态大语言模型进行视频编辑,支持视频选择与排序、脚本生成和背景音乐选择等任务,最终生成可部署的长视频输出。
Details
Motivation: 解决当前短视频广告创作中工作流和AI工具分散、模态特定导致的高制作成本和低效率问题,旨在实现可扩展和高效的端到端视频内容创作。
Result: 在真实广告数据集上的实验表明,AutoCut显著降低了制作成本和迭代时间,同时大幅提升了一致性和可控性。
Insight: 创新点包括通过残差向量量化实现视频和音频特征的离散化以构建共享token空间,以及结合多模态对齐和监督微调开发多模态大语言模型进行统一编辑,为可扩展视频创作提供了新途径。
Abstract: Short-form videos have become a primary medium for digital advertising, requiring scalable and efficient content creation. However, current workflows and AI tools remain disjoint and modality-specific, leading to high production costs and low overall efficiency. To address this issue, we propose AutoCut, an end-to-end advertisement video editing framework based on multimodal discretization and controllable editing. AutoCut employs dedicated encoders to extract video and audio features, then applies residual vector quantization to discretize them into unified tokens aligned with textual representations, constructing a shared video-audio-text token space. Built upon a foundation model, we further develop a multimodal large language model for video editing through combined multimodal alignment and supervised fine-tuning, supporting tasks covering video selection and ordering, script generation, and background music selection within a unified editing framework. Finally, a complete production pipeline converts the predicted token sequences into deployable long video outputs. Experiments on real-world advertisement datasets show that AutoCut reduces production cost and iteration time while substantially improving consistency and controllability, paving the way for scalable video creation.
[182] Rethinking Structure Preservation in Text-Guided Image Editing with Visual Autoregressive Models cs.CVPDF
Tao Xia, Jiawei Liu, Yukun Zhang, Ting Liu, Wei Wang
TL;DR: 本文提出了一种基于视觉自回归模型的新型文本引导图像编辑框架,通过分析模型中间特征分布,设计了从粗到细的令牌定位策略和基于强化学习的自适应特征注入机制,以在编辑过程中更好地保持图像结构一致性。
Details
Motivation: 现有基于VAR的编辑方法面临两个关键挑战:准确识别可编辑令牌和保持编辑结果的结构一致性,本文旨在解决这些问题。
Result: 大量实验表明,该方法在局部和全局编辑场景下,在结构一致性和编辑质量方面均优于最先进的方法。
Insight: 创新点包括:通过分析VAR模型中间特征分布来识别结构相关特征;设计了从粗到细的令牌定位策略以平衡编辑保真度和背景保留;以及引入基于强化学习的自适应特征注入方案,自动学习特定尺度和层的注入比例。
Abstract: Visual autoregressive (VAR) models have recently emerged as a promising family of generative models, enabling a wide range of downstream vision tasks such as text-guided image editing. By shifting the editing paradigm from noise manipulation in diffusion-based methods to token-level operations, VAR-based approaches achieve better background preservation and significantly faster inference. However, existing VAR-based editing methods still face two key challenges: accurately localizing editable tokens and maintaining structural consistency in the edited results. In this work, we propose a novel text-guided image editing framework rooted in an analysis of intermediate feature distributions within VAR models. First, we introduce a coarse-to-fine token localization strategy that can refine editable regions, balancing editing fidelity and background preservation. Second, we analyze the intermediate representations of VAR models and identify structure-related features, by which we design a simple yet effective feature injection mechanism to enhance structural consistency between the edited and source images. Third, we develop a reinforcement learning-based adaptive feature injection scheme that automatically learns scale- and layer-specific injection ratios to jointly optimize editing fidelity and structure preservation. Extensive experiments demonstrate that our method achieves superior structural consistency and editing quality compared with state-of-the-art approaches, across both local and global editing scenarios.
[183] Unified Restoration-Perception Learning: Maritime Infrared-Visible Image Fusion and Segmentation cs.CVPDF
Weichao Cai, Weiliang Huang, Biao Xue, Chao Huang, Fei Yuan
TL;DR: 本文提出了一种统一恢复-感知学习框架,用于海事红外-可见光图像的融合与分割。针对海事环境中雾、强反射等因素导致的图像退化问题,作者首先构建了红外-可见光海事船舶数据集(IVMSD),并在此基础上提出了多任务互补学习框架(MCLF),在一个统一架构中协同进行图像恢复、多模态融合和语义分割。
Details
Motivation: 海事场景理解和分割对海上监控与航行安全至关重要,但雾、强反射等环境因素导致图像严重退化,损害语义感知的稳定性。现有恢复和增强方法通常针对特定退化或仅关注视觉质量,缺乏能同时改善结构恢复和语义有效性的端到端协同机制,且公开的红外-可见光数据集多来自城市场景,未能捕捉海事环境中耦合退化的真实特性。
Result: 在提出的IVMSD数据集上的实验结果表明,该方法实现了最先进的分割性能,在复杂海事条件下显著增强了鲁棒性和感知质量。
Insight: 创新点包括:1)构建了覆盖多种天气和光照条件的海事红外-可见光数据集IVMSD;2)提出了统一的多任务互补学习框架MCLF,将恢复、融合和分割集成在一个端到端架构中;3)设计了频率-空间增强互补模块(FSEC)用于退化抑制和结构增强,以及语义-视觉一致性注意力模块(SVCA)用于语义一致性引导;4)引入了跨模态引导注意力机制进行选择性融合。
Abstract: Marine scene understanding and segmentation plays a vital role in maritime monitoring and navigation safety. However, prevalent factors like fog and strong reflections in maritime environments cause severe image degradation, significantly compromising the stability of semantic perception. Existing restoration and enhancement methods typically target specific degradations or focus solely on visual quality, lacking end-to-end collaborative mechanisms that simultaneously improve structural recovery and semantic effectiveness. Moreover, publicly available infrared-visible datasets are predominantly collected from urban scenes, failing to capture the authentic characteristics of coupled degradations in marine environments. To address these challenges, the Infrared-Visible Maritime Ship Dataset (IVMSD) is proposed to cover various maritime scenarios under diverse weather and illumination conditions. Building upon this dataset, a Multi-task Complementary Learning Framework (MCLF) is proposed to collaboratively perform image restoration, multimodal fusion, and semantic segmentation within a unified architecture. The framework includes a Frequency-Spatial Enhancement Complementary (FSEC) module for degradation suppression and structural enhancement, a Semantic-Visual Consistency Attention (SVCA) module for semantic-consistent guidance, and a cross-modality guided attention mechanism for selective fusion. Experimental results on IVMSD demonstrate that the proposed method achieves state-of-the-art segmentation performance, significantly enhancing robustness and perceptual quality under complex maritime conditions.
[184] From Pixels to Reality: Physical-Digital Patch Attacks on Real-World Camera cs.CVPDF
Victoria Leonenkova, Ekaterina Shumitskaya, Dmitriy Vatolin, Anastasia Antsiferova
TL;DR: 本文提出了一种名为DiPA(数字-物理对抗攻击)的新型实用对抗攻击方法,针对基于摄像头的身份验证系统。攻击者通过在智能手机屏幕上直接显示对抗性补丁,而非依赖打印的物理制品,实现了快速部署、无需全变分正则化,并在黑盒条件下提升了补丁的可迁移性。
Details
Motivation: 解决现有物理对抗攻击依赖打印制品、部署不便且迁移性有限的问题,旨在揭示移动设备、普适视觉和传感器驱动身份验证基础设施交叉领域的关键安全漏洞。
Result: DiPA在成功率、特征空间扭曲和检测置信度降低方面优于现有物理攻击方法,通过集成ArcFace、MagFace和CosFace等先进人脸识别模型,增强了在未见商业系统上的迁移能力。
Insight: 创新点在于采用纯数字方式呈现物理对抗补丁(通过手机屏幕显示),这简化了攻击流程并提升了黑盒迁移性;可借鉴之处包括利用模型集成增强跨系统攻击效果,以及实时交互演示验证攻击的即时影响。
Abstract: This demonstration presents Digital-Physical Adversarial Attacks (DiPA), a new class of practical adversarial attacks against pervasive camera-based authentication systems, where an attacker displays an adversarial patch directly on a smartphone screen instead of relying on printed artifacts. This digital-only physical presentation enables rapid deployment, removes the need for total-variation regularization, and improves patch transferability in black-box conditions. DiPA leverages an ensemble of state-of-the-art face-recognition models (ArcFace, MagFace, CosFace) to enhance transfer across unseen commercial systems. Our interactive demo shows a real-time dodging attack against a deployed face-recognition camera, preventing authorized users from being recognized while participants dynamically adjust patch patterns and observe immediate effects on the sensing pipeline. We further demonstrate DiPA’s superiority over existing physical attacks in terms of success rate, feature-space distortion, and reductions in detection confidence, highlighting critical vulnerabilities at the intersection of mobile devices, pervasive vision, and sensor-driven authentication infrastructures.
[185] $R_{dm}$: Re-conceptualizing Distribution Matching as a Reward for Diffusion Distillation cs.CV | cs.LGPDF
Linqian Fan, Peiqin Sun, Tiancheng Wen, Shun Lu, Chengru Song
TL;DR: 该论文提出了一种新的扩散模型蒸馏范式,通过将分布匹配重新概念化为一种奖励信号 $R_{dm}$,从而弥合了扩散匹配蒸馏与强化学习之间的算法鸿沟。该方法引入了组归一化分布匹配来稳定优化,并支持与外部奖励模型的自适应融合,最终在保持高保真度的同时显著提升了采样效率。
Details
Motivation: 扩散模型虽然生成质量高,但迭代采样过程缓慢。现有的蒸馏技术通常将学生模型的表现锚定在教师模型上,限制了其性能上限。近期工作尝试结合强化学习来突破此限制,但方法通常是简单地将蒸馏目标与强化学习目标相加。本文旨在通过统一视角解决这一问题。
Result: 在实验中,提出的GNDM方法优于原始的DMD,将FID降低了1.87。其多奖励变体GNDMR在美学质量和保真度之间取得了更好的平衡,达到了30.37的峰值HPS和12.21的低FID-SD,超越了现有基线。
Insight: 核心创新在于将分布匹配本身重新定义为强化学习中的奖励信号,这提供了一个统一的优化框架。由此衍生的组归一化分布匹配技术增强了优化稳定性,而奖励中心的公式化则允许灵活集成外部奖励并自然引入重要性采样,从而在提升采样效率的同时保持了生成质量。
Abstract: Diffusion models achieve state-of-the-art generative performance but are fundamentally bottlenecked by their slow iterative sampling process. While diffusion distillation techniques enable high-fidelity few-step generation, traditional objectives often restrict the student’s performance by anchoring it solely to the teacher. Recent approaches have attempted to break this ceiling by integrating Reinforcement Learning (RL), typically through a simple summation of distillation and RL objectives. In this work, we propose a novel paradigm by reconceptualizing distribution matching as a reward, denoted as $R_{dm}$. This unified perspective bridges the algorithmic gap between Diffusion Matching Distillation (DMD) and RL, providing several key benefits. (1) Enhanced optimization stability: we introduce Group Normalized Distribution Matching (GNDM), which adapts standard RL group normalization to stabilize $R_{dm}$ estimation. By leveraging group-mean statistics, GNDM establishes a more robust and effective optimization direction. (2) Seamless reward integration: our reward-centric formulation inherently supports adaptive weighting mechanisms, allowing flexible combination of DMD with external reward models. (3) Improved sampling efficiency: by aligning with RL principles, the framework readily incorporates importance sampling (IS), leading to a significant boost in sampling efficiency. Extensive experiments demonstrate that GNDM outperforms vanilla DMD, reducing the FID by 1.87. Furthermore, our multi-reward variant, GNDMR, surpasses existing baselines by achieving a strong balance between aesthetic quality and fidelity, reaching a peak HPS of 30.37 and a low FID-SD of 12.21. Overall, $R_{dm}$ provides a flexible, stable, and efficient framework for real-time high-fidelity synthesis. Code will be released upon publication.
[186] CiQi-Agent: Aligning Vision, Tools and Aesthetics in Multimodal Agent for Cultural Reasoning on Chinese Porcelains cs.CV | cs.AIPDF
Wenhan Wang, Zhixiang Zhou, Zhongtian Ma, Yanzhu Chen, Ziyu Lin
TL;DR: 本文提出了CiQi-Agent,一个专门用于中国古代瓷器鉴赏的多模态智能体。它支持多图像输入,通过调用视觉工具和利用多模态检索增强生成技术,对瓷器的朝代、年号、窑址、釉色、纹饰和器型这六个属性进行细粒度分析。为训练和评估该智能体,研究团队构建了大规模专家标注数据集CiQi-VQA和基准测试CiQi-Bench。
Details
Motivation: 中国古代瓷器鉴赏需要深厚的历史知识、材料理解和美学素养,这对非专业人士构成了巨大障碍。为了普及文化遗产理解并辅助专家鉴赏,研究旨在开发一个能够进行智能分析的领域特定智能体。
Result: 实验结果表明,在CiQi-Bench基准测试的所有六个属性上,CiQi-Agent(7B参数)均超越了所有竞争性的开源和闭源模型,其平均准确率比GPT-5高出12.2%。
Insight: 论文的创新点在于构建了一个结合视觉工具调用和多模态检索增强生成的工具增强推理框架,并创建了首个大规模、专家标注的瓷器视觉问答数据集与基准,为特定文化领域的细粒度多模态推理任务提供了系统性的解决方案。
Abstract: The connoisseurship of antique Chinese porcelain demands extensive historical expertise, material understanding, and aesthetic sensitivity, making it difficult for non-specialists to engage. To democratize cultural-heritage understanding and assist expert connoisseurship, we introduce CiQi-Agent – a domain-specific Porcelain Connoisseurship Agent for intelligent analysis of antique Chinese porcelain. CiQi-Agent supports multi-image porcelain inputs and enables vision tool invocation and multimodal retrieval-augmented generation, performing fine-grained connoisseurship analysis across six attributes: dynasty, reign period, kiln site, glaze color, decorative motif, and vessel shape. Beyond attribute classification, it captures subtle visual details, retrieves relevant domain knowledge, and integrates visual and textual evidence to produce coherent, explainable connoisseurship descriptions. To achieve this capability, we construct a large-scale, expert-annotated dataset CiQi-VQA, comprising 29,596 porcelain specimens, 51,553 images, and 557,940 visual question–answering pairs, and further establish a comprehensive benchmark CiQi-Bench aligned with the previously mentioned six attributes. CiQi-Agent is trained through supervised fine-tuning, reinforcement learning, and a tool-augmented reasoning framework that integrates two categories of tools: a vision tool and multimodal retrieval tools. Experimental results show that CiQi-Agent (7B) outperforms all competitive open- and closed-source models across all six attributes on CiQi-Bench, achieving on average 12.2% higher accuracy than GPT-5. The model and dataset have been released and are publicly available at https://huggingface.co/datasets/SII-Monument-Valley/CiQi-VQA.
[187] INSID3: Training-Free In-Context Segmentation with DINOv3 cs.CVPDF
Claudia Cuttano, Gabriele Trivigno, Christoph Reich, Daniel Cremers, Carlo Masone
TL;DR: INSID3是一种无需训练、基于上下文的分割方法,仅利用冻结的DINOv3自监督特征,通过单个视觉示例分割任意概念(如物体、部件或个性化实例)。它避免了现有方法中微调导致的泛化能力下降或多模型组合带来的架构复杂性问题,在多个分割任务上实现了最先进的性能。
Details
Motivation: 现有上下文分割方法要么依赖微调视觉基础模型(损害泛化能力),要么结合多个冻结模型(增加架构复杂性且分割粒度固定)。本文从极简主义视角出发,探索能否仅使用单个自监督骨干网络(DINOv3)同时支持语义匹配和分割,无需任何监督或辅助模型。
Result: INSID3在单样本语义分割、部件分割和个性化分割任务上均达到最先进水平,平均交并比(mIoU)比先前工作提升+7.5%,同时参数量减少3倍,且无需掩码或类别级监督。
Insight: 创新点在于发现DINOv3的规模化密集自监督特征具有强空间结构和语义对应性,从而设计出纯基于冻结特征的训练免分割框架,实现了灵活的分割粒度、简化架构并保持泛化能力。
Abstract: In-context segmentation (ICS) aims to segment arbitrary concepts, e.g., objects, parts, or personalized instances, given one annotated visual examples. Existing work relies on (i) fine-tuning vision foundation models (VFMs), which improves in-domain results but harms generalization, or (ii) combines multiple frozen VFMs, which preserves generalization but yields architectural complexity and fixed segmentation granularities. We revisit ICS from a minimalist perspective and ask: Can a single self-supervised backbone support both semantic matching and segmentation, without any supervision or auxiliary models? We show that scaled-up dense self-supervised features from DINOv3 exhibit strong spatial structure and semantic correspondence. We introduce INSID3, a training-free approach that segments concepts at varying granularities only from frozen DINOv3 features, given an in-context example. INSID3 achieves state-of-the-art results across one-shot semantic, part, and personalized segmentation, outperforming previous work by +7.5 % mIoU, while using 3x fewer parameters and without any mask or category-level supervision. Code is available at https://github.com/visinf/INSID3 .
[188] Generalizable Detection of AI Generated Images with Large Models and Fuzzy Decision Tree cs.CVPDF
Fei Wu, Guanghao Ding, Zijian Niu, Zhenrui Wang, Lei Yang
TL;DR: 本文提出了一种新颖的AI生成图像检测框架,通过模糊决策树将轻量级的伪影感知检测器与多模态大语言模型协同集成,以融合语义和感知层面的互补线索,从而在保持高准确率的同时实现强大的泛化能力。
Details
Motivation: 解决现有AI生成图像检测方法因过度拟合特定生成模型而泛化能力不足,以及多模态大语言模型缺乏对细微生成伪影的感知敏感性的问题。
Result: 大量实验表明,所提方法在多种生成模型上实现了最先进的检测准确率和强大的泛化性能。
Insight: 创新点在于通过模糊决策树自适应地融合轻量级伪影检测器的细粒度感知能力与MLLM的高层语义推理能力,形成互补的检测框架,提升了方法的泛化性和鲁棒性。
Abstract: The malicious use and widespread dissemination of AI-generated images pose a serious threat to the authenticity of digital content. Existing detection methods exploit low-level artifacts left by common manipulation steps within the generation pipeline, but they often lack generalization due to model-specific overfitting. Recently, researchers have resorted to Multimodal Large Language Models (MLLMs) for AIGC detection, leveraging their high-level semantic reasoning and broad generalization capabilities. While promising, MLLMs lack the fine-grained perceptual sensitivity to subtle generation artifacts, making them inadequate as standalone detectors. To address this issue, we propose a novel AI-generated image detection framework that synergistically integrates lightweight artifact-aware detectors with MLLMs via a fuzzy decision tree. The decision tree treats the outputs of basic detectors as fuzzy membership values, enabling adaptive fusion of complementary cues from semantic and perceptual perspectives. Extensive experiments demonstrate that the proposed method achieves state-of-the-art accuracy and strong generalization across diverse generative models.
[189] MarkushGrapher-2: End-to-end Multimodal Recognition of Chemical Structures cs.CVPDF
Tim Strohmeyer, Lucas Morin, Gerhard Ingmar Meijer, Valéry Weber, Ahmed Nassar
TL;DR: MarkushGrapher-2是一个端到端的多模态化学结构识别方法,通过结合文本、图像和布局信息,自动从文档中提取化学结构。该方法采用专用OCR提取图像文本,利用视觉-文本-布局编码器和光学化学结构识别视觉编码器进行联合编码,并通过两阶段训练策略融合编码以自回归生成Markush结构表示。为解决训练数据不足,论文还提出了自动构建大规模真实世界Markush结构数据集的流程,并发布了手动标注的基准数据集IP5-M。
Details
Motivation: 现有方法在独立识别图像或文本中的分子结构方面已有进展,但多模态描述(Markush结构)的识别精度不足,无法用于自动大规模处理,因此需要开发更精确的多模态化学结构识别方法。
Result: 大量实验表明,该方法在多模态Markush结构识别上显著优于现有最先进模型,同时在分子结构识别上保持强劲性能。
Insight: 创新点包括端到端的多模态融合架构、两阶段训练策略、自动构建大规模真实世界数据集的方法,以及发布手动标注基准IP5-M,为多模态化学结构识别研究提供了新工具和数据集。
Abstract: Automatically extracting chemical structures from documents is essential for the large-scale analysis of the literature in chemistry. Automatic pipelines have been developed to recognize molecules represented either in figures or in text independently. However, methods for recognizing chemical structures from multimodal descriptions (Markush structures) lag behind in precision and cannot be used for automatic large-scale processing. In this work, we present MarkushGrapher-2, an end-to-end approach for the multimodal recognition of chemical structures in documents. First, our method employs a dedicated OCR model to extract text from chemical images. Second, the text, image, and layout information are jointly encoded through a Vision-Text-Layout encoder and an Optical Chemical Structure Recognition vision encoder. Finally, the resulting encodings are effectively fused through a two-stage training strategy and used to auto-regressively generate a representation of the Markush structure. To address the lack of training data, we introduce an automatic pipeline for constructing a large-scale dataset of real-world Markush structures. In addition, we present IP5-M, a large manually-annotated benchmark of real-world Markush structures, designed to advance research on this challenging task. Extensive experiments show that our approach substantially outperforms state-of-the-art models in multimodal Markush structure recognition, while maintaining strong performance in molecule structure recognition. Code, models, and datasets are released publicly.
[190] Hydra: Unifying Document Retrieval and Generation in a Single Vision-Language Model cs.CV | cs.AI | cs.IRPDF
Athos Georgiou
TL;DR: 本文提出了Hydra,一种在单一视觉语言模型(VLM)中统一文档检索与生成的双头方法。通过一个仅针对检索训练的LoRA适配器,在推理时切换其启用状态,即可分别实现ColBERT风格的检索和自回归生成,无需独立的模型,从而降低了内存和系统复杂度。
Details
Motivation: 解决视觉文档理解任务中通常需要独立的检索和生成模型,导致内存占用翻倍和系统复杂度增加的问题。
Result: 在四个VQA基准测试上,与独立的基准模型流程相比,Hydra在生成质量上实现了字节级相同的输出(10,500个样本中100%一致),最大delta-ANLS差异仅为0.0044。在ViDoRe V1基准上,Hydra(4B)与受控的单头基线在单次训练运行中差距在1个百分点内,在V2和V3版本上聚合分数更高。该设计将峰值GPU内存降低了41%。
Insight: 核心创新在于通过一个可切换的LoRA适配器,在单一模型中实现了检索和生成两种模式的无缝切换,且生成质量无损。该方法具有通用性,可扩展至音频检索、视频嵌入和语音生成任务。同时,论文强调了实现此机制所需的三个关键工程要求(注意力模式恢复、lm_head保留、KV缓存感知解码)。
Abstract: Visual document understanding typically requires separate retrieval and generation models, doubling memory and system complexity. We present Hydra, a dual-head approach that provides both ColBERT-style late-interaction retrieval and autoregressive generation from a single vision-language model (VLM). A single LoRA adapter, trained only for retrieval, is toggled at inference: enabling it produces multi-vector embeddings; disabling it recovers the base model’s generation quality – byte-identical outputs in 100% of 10,500 greedy and stochastic samples, with max delta-ANLS = 0.0044 across 15,301 samples on four VQA benchmarks (three informative; ChartQA is near-zero for both models under greedy decoding) when compared against an independent base-model pipeline. We identify three engineering requirements (attention-mode restoration, lm_head preservation, KV-cache-aware decoding) whose omission silently breaks generation despite correct weight recovery. On ViDoRe V1, Hydra (4B) is within 1 percentage point of a controlled single-head baseline in a single training run, with higher aggregate scores on V2 and V3 that are concentrated on a subset of tasks; multi-seed experiments are needed to confirm these trends. The single-model design reduces peak GPU memory by 41%, though adapter switching introduces throughput overhead under concurrent serving loads. An ablation shows that GritLM-style joint training provides no benefit within the LoRA-based (r=16) training regime. A proof-of-concept extension to Qwen2.5-Omni-3B demonstrates that the mechanism generalizes to audio retrieval and video embedding, with speech generation.
[191] Domain-Invariant Prompt Learning for Vision-Language Models cs.CV | cs.AIPDF
Arsham Gholamzadeh Khoee, Yinan Yu, Robert Feldt
TL;DR: 本文提出了一种名为域不变上下文优化(DiCoOp)的方法,旨在提升视觉语言模型(如CLIP)在领域泛化任务中的性能。DiCoOp通过对抗性训练机制,学习域不变的提示向量,以增强模型在未见分布上的鲁棒性。
Details
Motivation: 现有软提示方法(如CoOp)缺乏显式机制处理领域偏移,导致在未见分布上的泛化能力不足。
Result: 实验表明,DiCoOp在多种视觉领域的泛化任务中持续超越CoOp,提升了模型在领域泛化基准上的性能。
Insight: 创新点在于将对抗性训练引入提示学习,以强制学习域不变表示,这为提升预训练模型的跨域适应性提供了新思路。
Abstract: Large pre-trained vision-language models like CLIP have transformed computer vision by aligning images and text in a shared feature space, enabling robust zero-shot transfer via prompting. Soft-prompting, such as Context Optimization (CoOp), effectively adapts these models for downstream recognition tasks by learning a set of context vectors. However, CoOp lacks explicit mechanisms for handling domain shifts across unseen distributions. To address this, we propose Domain-invariant Context Optimization (DiCoOp), an extension of CoOp optimized for domain generalization. By employing an adversarial training approach, DiCoOp forces the model to learn domain-invariant prompts while preserving discriminative power for classification. Experimental results show that DiCoOp consistently surpasses CoOp in domain generalization tasks across diverse visual domains.
[192] XSPA: Crafting Imperceptible X-Shaped Sparse Adversarial Perturbations for Transferable Attacks on VLMs cs.CVPDF
Chengyin Hu, Jiaju Han, Xuemeng Sun, Qike Zhang, Yiwei Wei
TL;DR: 本文提出了一种名为XSPA的对抗攻击方法,该方法通过仅在图像上两条交叉的对角线上添加稀疏且难以察觉的扰动,来测试视觉语言模型(VLMs)的鲁棒性。实验表明,这种高度受限的扰动能有效导致VLM在零样本分类、图像描述和视觉问答等多个任务上的性能显著下降,揭示了当前多模态系统存在的鲁棒性缺陷。
Details
Motivation: 视觉语言模型(VLMs)共享的视觉-文本表示空间可能引入共同的脆弱性,即微小的视觉扰动可能通过该空间传播,导致跨任务的语义错误。本文旨在探究VLM是否对高度受限、稀疏且几何结构固定的扰动具有鲁棒性。
Result: 在COCO数据集上的实验表明,XSPA能显著降低多个VLM在三个任务上的性能。例如,在OpenAI CLIP ViT-L/14和OpenCLIP ViT-B/16模型上,零样本分类准确率分别下降了52.33和67.00个百分点;图像描述一致性和VQA正确性也分别下降了高达58.60和44.38个百分点。
Insight: 论文的创新点在于提出了一种高度结构化、稀疏且视觉上难以察觉的对抗攻击范式(XSPA),其攻击预算(仅修改约1.76%的像素)比密集扰动或局部块攻击更为严格,为评估VLM的鲁棒性提供了一个更严苛的基准。其核心洞察是,即使施加具有固定几何先验的稀疏扰动,也能有效破坏VLM跨任务的语义一致性,这揭示了共享表示空间中存在的潜在安全风险。
Abstract: Vision-language models (VLMs) rely on a shared visual-textual representation space to perform tasks such as zero-shot classification, image captioning, and visual question answering (VQA). While this shared space enables strong cross-task generalization, it may also introduce a common vulnerability: small visual perturbations can propagate through the shared embedding space and cause correlated semantic failures across tasks. This risk is particularly important in interactive and decision-support settings, yet it remains unclear whether VLMs are robust to highly constrained, sparse, and geometrically fixed perturbations. To address this question, we propose X-shaped Sparse Pixel Attack (XSPA), an imperceptible structured attack that restricts perturbations to two intersecting diagonal lines. Compared with dense perturbations or flexible localized patches, XSPA operates under a much stricter attack budget and thus provides a more stringent test of VLM robustness. Within this sparse support, XSPA jointly optimizes a classification objective, cross-task semantic guidance, and regularization on perturbation magnitude and along-line smoothness, inducing transferable misclassification as well as semantic drift in captioning and VQA while preserving visual subtlety. Under the default setting, XSPA modifies only about 1.76% of image pixels. Experiments on the COCO dataset show that XSPA consistently degrades performance across all three tasks. Zero-shot accuracy drops by 52.33 points on OpenAI CLIP ViT-L/14 and 67.00 points on OpenCLIP ViT-B/16, while GPT-4-evaluated caption consistency decreases by up to 58.60 points and VQA correctness by up to 44.38 points. These results suggest that even highly sparse and visually subtle perturbations with fixed geometric priors can substantially disrupt cross-task semantics in VLMs, revealing a notable robustness gap in current multimodal systems.
[193] Navigating the Mirage: A Dual-Path Agentic Framework for Robust Misleading Chart Question Answering cs.CV | cs.AI | cs.MMPDF
Yanjie Zhang, Yafei Li, Rui Sheng, Zixin Chen, Yanna Lin
TL;DR: 本文提出ChartCynics,一种用于处理误导性图表的双路径智能体框架。该框架通过解耦感知与验证,分别利用诊断视觉路径捕获结构异常和OCR驱动数据路径确保数值基础,并引入一个经过两阶段优化的智能体总结器来解决跨模态冲突,从而有效提升图表问答的鲁棒性。
Details
Motivation: 尽管视觉语言模型(VLMs)已取得显著成功,但具有欺骗性视觉结构和扭曲数据表示的误导性图表仍是一个重大挑战,现有整体模型难以有效应对。
Result: 在两个基准测试上的评估显示,ChartCynics分别达到了74.43%和64.55%的准确率,相比其骨干模型Qwen3-VL-8B实现了约29%的绝对性能提升,并超越了最先进的专有模型,达到了SOTA水平。
Insight: 论文宣称的创新点在于提出了一种解耦感知与验证的“怀疑式”推理范式双路径框架,并引入了经过Oracle-Informed SFT和Deception-Aware GRPO两阶段优化的智能体总结器来惩罚视觉陷阱并强制逻辑一致性。从客观角度看,其核心创新在于将专门的智能体工作流程应用于小型开源模型,使其在特定任务上获得超越大型专有模型的鲁棒性,为可信图表解释建立了新基础。
Abstract: Despite the success of Vision-Language Models (VLMs), misleading charts remain a significant challenge due to their deceptive visual structures and distorted data representations. We present ChartCynics, an agentic dual-path framework designed to unmask visual deception via a “skeptical” reasoning paradigm. Unlike holistic models, ChartCynics decouples perception from verification: a Diagnostic Vision Path captures structural anomalies (e.g., inverted axes) through strategic ROI cropping, while an OCR-Driven Data Path ensures numerical grounding. To resolve cross-modal conflicts, we introduce an Agentic Summarizer optimized via a two-stage protocol: Oracle-Informed SFT for reasoning distillation and Deception-Aware GRPO for adversarial alignment. This pipeline effectively penalizes visual traps and enforces logical consistency. Evaluations on two benchmarks show that ChartCynics achieves 74.43% and 64.55% accuracy, providing an absolute performance boost of ~29% over the Qwen3-VL-8B backbone, outperforming state-of-the-art proprietary models. Our results demonstrate that specialized agentic workflows can grant smaller open-source models superior robustness, establishing a new foundation for trustworthy chart interpretation.
[194] Unsafe2Safe: Controllable Image Anonymization for Downstream Utility cs.CV | cs.CY | cs.LGPDF
Mih Dinh, SouYoung Jin
TL;DR: Unsafe2Safe是一个全自动的图像匿名化流程,通过多模态引导的扩散编辑技术,检测并重写图像中的敏感区域,以构建隐私安全的数据集。
Details
Motivation: 解决大规模图像数据集中包含可识别或敏感内容所带来的隐私风险,防止模型在训练过程中记忆和泄露此类信息。
Result: 在MS-COCO、Caltech101和MIT Indoor67数据集上,大幅降低了人脸相似度、文本相似度和人口统计可预测性,同时保持了下游模型精度与原始数据训练相当的水平。
Insight: 创新点在于结合视觉语言模型进行隐私风险检测与结构化指令生成,并利用指令驱动的扩散编辑实现局部敏感区域的重写,在保护隐私的同时保持了图像的全局结构和任务相关语义;其自动生成的三元组(私有描述、公共描述、编辑指令)可用于微调扩散编辑器,进一步提升性能。
Abstract: Large-scale image datasets frequently contain identifiable or sensitive content, raising privacy risks when training models that may memorize and leak such information. We present Unsafe2Safe, a fully automated pipeline that detects privacy-prone images and rewrites only their sensitive regions using multimodally guided diffusion editing. Unsafe2Safe operates in two stages. Stage 1 uses a vision-language model to (i) inspect images for privacy risks, (ii) generate paired private and public captions that respectively include and omit sensitive attributes, and (iii) prompt a large language model to produce structured, identity-neutral edit instructions conditioned on the public caption. Stage 2 employs instruction-driven diffusion editors to apply these dual textual prompts, producing privacy-safe images that preserve global structure and task-relevant semantics while neutralizing private content. To measure anonymization quality, we introduce a unified evaluation suite covering Quality, Cheating, Privacy, and Utility dimensions. Across MS-COCO, Caltech101, and MIT Indoor67, Unsafe2Safe reduces face similarity, text similarity, and demographic predictability by large margins, while maintaining downstream model accuracy comparable to training on raw data. Fine-tuning diffusion editors on our automatically generated triplets (private caption, public caption, edit instruction) further improves both privacy protection and semantic fidelity. Unsafe2Safe provides a scalable, principled solution for constructing large, privacy-safe datasets without sacrificing visual consistency or downstream utility.
[195] Divide and Restore: A Modular Task-Decoupled Framework for Universal Image Restoration cs.CVPDF
Joanna Wiekiera, Martyna Zur
TL;DR: 本文提出了一种模块化、任务解耦的图像恢复框架,通过一个轻量级CNN分类器评估输入图像并动态路由到专门的恢复节点,避免了任务间的负干扰,降低了训练开销,并支持灵活扩展新的退化类型。
Details
Motivation: 解决现有一体化图像恢复模型存在的任务间负干扰问题,以及它们需要在高性能计算集群上进行大量联合训练的负担,旨在提供一个更高效、可扩展的解决方案。
Result: 实验结果表明,该框架在标准本地硬件上为多退化图像恢复提供了可扩展且高效的解决方案,具体定量结果未在摘要中提及,但强调了其计算可访问性。
Insight: 创新点在于显式的诊断路由机制和模型无关的可扩展性,通过隔离重建路径防止特征冲突,添加新退化类型只需训练单个专家并更新路由器,而非整个系统重训练,这从客观角度看是一种模块化和高效的设计范式。
Abstract: Restoring images affected by various types of degradation, such as noise, blur, or improper exposure, remains a significant challenge in computer vision. While recent trends favor complex monolithic all-in-one architectures, these models often suffer from negative task interference and require extensive joint training cycles on high-end computing clusters. In this paper, we propose a modular, task-decoupled image restoration framework based on an explicit diagnostic routing mechanism. The architecture consists of a lightweight Convolutional Neural Network (CNN) classifier that evaluates the input image and dynamically directs it to a specialized restoration node. A key advantage of this framework is its model-agnostic extensibility: while we demonstrate it using three independent U-Net experts, the system allows for the integration of any restoration method tailored to specific tasks. By isolating reconstruction paths, the framework prevents feature conflicts and significantly reduces training overhead. Unlike monolithic models, adding new degradation types in our framework only requires training a single expert and updating the router, rather than a full system retraining. Experimental results demonstrate that this computationally accessible approach offers a scalable and efficient solution for multi-degradation restoration on standard local hardware. The code will be published upon paper acceptance.
[196] AdaptToken: Entropy-based Adaptive Token Selection for MLLM Long Video Understanding cs.CV | cs.AIPDF
Haozhe Qi, Kevin Qu, Mahdi Rad, Rui Wang, Alexander Mathis
TL;DR: 本文提出了AdaptToken,一种无需训练的框架,用于解决多模态大语言模型(MLLM)在长视频理解中面临的高内存成本和上下文长度限制问题。该方法通过将视频分组,利用跨模态注意力对组内token进行排序,并利用模型响应熵估计各组与提示的相关性,从而实现全局token预算分配和提前停止机制(AdaptToken-Lite)。
Details
Motivation: 现有方法通过在短片段内评分和选择帧/token来缓解长视频理解问题,但缺乏跨远距离视频片段比较相关性的原则性机制,以及一旦收集到足够证据就停止处理的能力。
Result: 在四个长视频基准测试(VideoMME、LongVideoBench、LVBench、MLVU)和多个基础MLLM(7B-72B)上,AdaptToken持续提升准确率(例如,在Qwen2.5-VL 7B上平均提升+6.7),并能有效处理极长输入(高达10K帧),而AdaptToken-Lite在保持相当性能的同时将推理时间减少约一半。
Insight: 创新点在于将MLLM的自身不确定性(通过响应熵量化)转化为长视频token选择的全局控制信号,实现了基于信息熵的自适应token预算分配和提前停止,这是一种无需训练、可泛化到不同MLLM的轻量级框架。
Abstract: Long video understanding remains challenging for Multi-modal Large Language Models (MLLMs) due to high memory costs and context-length limits. Prior approaches mitigate this by scoring and selecting frames/tokens within short clips, but they lack a principled mechanism to (i) compare relevance across distant video clips and (ii) stop processing once sufficient evidence has been gathered. We propose AdaptToken, a training-free framework that turns an MLLM’s self-uncertainty into a global control signal for long-video token selection. AdaptToken splits a video into groups, extracts cross-modal attention to rank tokens within each group, and uses the model’s response entropy to estimate each group’s prompt relevance. This entropy signal enables a global token budget allocation across groups and further supports early stopping (AdaptToken-Lite), skipping the remaining groups when the model becomes sufficiently certain. Across four long-video benchmarks (VideoMME, LongVideoBench, LVBench, and MLVU) and multiple base MLLMs (7B-72B), AdaptToken consistently improves accuracy (e.g., +6.7 on average over Qwen2.5-VL 7B) and continues to benefit from extremely long inputs (up to 10K frames), while AdaptToken-Lite reduces inference time by about half with comparable performance. Project page: https://haozheqi.github.io/adapt-token
[197] DreamLite: A Lightweight On-Device Unified Model for Image Generation and Editing cs.CVPDF
Kailai Feng, Yuxiang Wei, Bo Chen, Yang Pan, Hu Ye
TL;DR: 本文提出了DreamLite,一个轻量级(0.39B参数)的端侧统一扩散模型,能够在单个网络中同时支持文本到图像(T2I)生成和文本引导的图像编辑。该模型基于剪枝的移动U-Net架构,通过在潜在空间中进行上下文空间拼接来统一条件输入,并采用任务渐进式联合预训练策略进行稳定训练。经过微调和强化学习后,模型在生成和编辑任务上均取得优异性能,并通过步骤蒸馏将去噪过程减少至4步,从而在智能手机上实现亚秒级的图像处理速度。
Details
Motivation: 现有扩散模型参数量巨大,导致高延迟和部署困难;而现有的端侧扩散模型主要专注于T2I生成,缺乏对图像编辑任务的支持。因此,需要开发一个轻量、高效且统一的端侧模型来同时处理生成和编辑任务。
Result: 在评估中,DreamLite在图像生成任务上获得GenEval分数0.72,在图像编辑任务上获得ImgEdit分数4.11,超越了现有的端侧模型,并与多个服务器端模型保持竞争力。通过步骤蒸馏,模型仅需4步去噪,可在小米14手机上于1秒内生成或编辑1024x1024分辨率的图像。
Insight: 主要创新点包括:1)采用统一的网络架构和基于潜在空间图像拼接的条件机制,以单一模型支持生成和编辑;2)提出任务渐进式联合预训练策略,稳定轻量模型的训练;3)结合步骤蒸馏实现极速推理,为端侧部署提供了高效的解决方案。这是首个支持图像生成与编辑的统一端侧扩散模型。
Abstract: Diffusion models have made significant progress in both text-to-image (T2I) generation and text-guided image editing. However, these models are typically built with billions of parameters, leading to high latency and increased deployment challenges. While on-device diffusion models improve efficiency, they largely focus on T2I generation and lack support for image editing. In this paper, we propose DreamLite, a compact unified on-device diffusion model (0.39B) that supports both T2I generation and text-guided image editing within a single network. DreamLite is built on a pruned mobile U-Net backbone and unifies conditioning through in-context spatial concatenation in the latent space. It concatenates images horizontally as input, using a (target | blank) configuration for generation tasks and (target | source) for editing tasks. To stabilize the training of this compact model, we introduce a task-progressive joint pretraining strategy that sequentially targets T2I, editing, and joint tasks. After high-quality SFT and reinforcement learning, DreamLite achieves GenEval (0.72) for image generation and ImgEdit (4.11) for image editing, outperforming existing on-device models and remaining competitive with several server-side models. By employing step distillation, we further reduce denoising processing to just 4 steps, enabling our DreamLite could generate or edit a 1024 x 1024 image in less than 1s on a Xiaomi 14 smartphone. To the best of our knowledge, DreamLite is the first unified on-device diffusion model that supports both image generation and image editing.
[198] SHOW3D: Capturing Scenes of 3D Hands and Objects in the Wild cs.CV | cs.ROPDF
Patrick Rim, Kevin Harris, Braden Copple, Shangchen Han, Xu Xie
TL;DR: 本文介绍了SHOW3D,一个用于在真实世界环境中捕获手与物体交互的3D数据集。为了解决现有数据集在受控环境下采集、缺乏环境多样性的问题,作者开发了一种新型的无标记多相机系统,该系统允许在近乎无约束的移动条件下进行采集,并能生成精确的手和物体的3D标注。SHOW3D是首个包含多样化真实世界环境(包括户外场景)中手与物体交互3D标注的大规模数据集。
Details
Motivation: 现有手-物体交互数据集主要在受控的室内工作室环境中采集,这限制了环境多样性,并导致基于这些数据训练的模型难以泛化到真实世界场景。因此,需要一种能在真实世界条件下捕获精确3D标注的方法。
Result: 作者开发了一个轻量级、背戴式的多相机系统,并与VR头显同步校准,用于数据采集。他们提出了一个自我-外部(ego-exo)跟踪流程来生成3D真实标注,并对其质量进行了严格评估。通过在多个下游任务上的实验,验证了该方法显著降低了环境真实性与3D标注准确性之间的基本权衡。
Insight: 主要创新点在于设计了一种新颖的无标记、多相机、可移动的采集系统,能够在真实世界(野外)条件下捕获高精度的3D手-物体交互数据。这突破了以往数据集在环境多样性上的限制。从客观角度看,其提出的‘自我-外部’跟踪流程和首个大规模真实世界3D手-物体交互数据集(SHOW3D)是重要的贡献,为相关领域的模型训练和评估提供了更贴近实际应用场景的资源。
Abstract: Accurate 3D understanding of human hands and objects during manipulation remains a significant challenge for egocentric computer vision. Existing hand-object interaction datasets are predominantly captured in controlled studio settings, which limits both environmental diversity and the ability of models trained on such data to generalize to real-world scenarios. To address this challenge, we introduce a novel marker-less multi-camera system that allows for nearly unconstrained mobility in genuinely in-the-wild conditions, while still having the ability to generate precise 3D annotations of hands and objects. The capture system consists of a lightweight, back-mounted, multi-camera rig that is synchronized and calibrated with a user-worn VR headset. For 3D ground-truth annotation of hands and objects, we develop an ego-exo tracking pipeline and rigorously evaluate its quality. Finally, we present SHOW3D, the first large-scale dataset with 3D annotations that show hands interacting with objects in diverse real-world environments, including outdoor settings. Our approach significantly reduces the fundamental trade-off between environmental realism and accuracy of 3D annotations, which we validate with experiments on several downstream tasks. show3d-dataset.github.io
[199] On-the-fly Repulsion in the Contextual Space for Rich Diversity in Diffusion Transformers cs.CV | cs.AI | cs.GR | cs.LGPDF
Omer Dahary, Benaya Koren, Daniel Garibi, Daniel Cohen-Or
TL;DR: 本文提出了一种在扩散变换器(Diffusion Transformers)的上下文空间中应用排斥力(repulsion)的新框架,以解决当前文本到图像(T2I)扩散模型在生成多样性上的不足。该方法通过在transformer前向传播过程中,在文本条件与图像结构融合的多模态注意力通道中动态注入排斥干预,从而在保持视觉保真度和语义一致性的前提下,显著提升生成结果的多样性。
Details
Motivation: 当前T2I扩散模型虽然语义对齐能力出色,但存在典型性偏差(typicality bias),即对于同一提示词(prompt)往往生成视觉上趋同的狭窄结果,这限制了创意应用所需的广泛生成多样性。现有方法在提升多样性时面临权衡:修改模型输入需要昂贵的优化,而干预已形成的空间潜在表示则容易破坏视觉结构并产生伪影。
Result: 实验结果表明,该方法在保持视觉保真度和语义一致性的同时,能产生显著更丰富的多样性。此外,该方法计算开销小,即使在现代“Turbo”和蒸馏模型等传统基于轨迹的干预方法通常失效的场景下也依然有效。
Insight: 核心创新点在于提出在上下文空间(Contextual Space)——即文本条件与涌现的图像结构信息融合的多模态注意力通道——进行动态排斥干预。这允许在视觉结构已形成但构图尚未固定的阶段,对引导轨迹进行重定向,从而在模型前向传播过程中高效、无破坏地注入多样性,避免了传统方法对输入或潜在空间的直接修改所带来的问题。
Abstract: Modern Text-to-Image (T2I) diffusion models have achieved remarkable semantic alignment, yet they often suffer from a significant lack of variety, converging on a narrow set of visual solutions for any given prompt. This typicality bias presents a challenge for creative applications that require a wide range of generative outcomes. We identify a fundamental trade-off in current approaches to diversity: modifying model inputs requires costly optimization to incorporate feedback from the generative path. In contrast, acting on spatially-committed intermediate latents tends to disrupt the forming visual structure, leading to artifacts. In this work, we propose to apply repulsion in the Contextual Space as a novel framework for achieving rich diversity in Diffusion Transformers. By intervening in the multimodal attention channels, we apply on-the-fly repulsion during the transformer’s forward pass, injecting the intervention between blocks where text conditioning is enriched with emergent image structure. This allows for redirecting the guidance trajectory after it is structurally informed but before the composition is fixed. Our results demonstrate that repulsion in the Contextual Space produces significantly richer diversity without sacrificing visual fidelity or semantic adherence. Furthermore, our method is uniquely efficient, imposing a small computational overhead while remaining effective even in modern “Turbo” and distilled models where traditional trajectory-based interventions typically fail.
[200] HandX: Scaling Bimanual Motion and Interaction Generation cs.CVPDF
Zimu Zhang, Yucheng Zhang, Xiyan Xu, Ziyin Wang, Sirui Xu
TL;DR: HandX是一个专注于生成逼真双手运动和交互的统一框架,通过整合现有数据集并收集新的高质量双手交互运动捕捉数据,结合解耦标注策略和LLM推理生成细粒度描述,在此基础上评估扩散和自回归模型,展示了高质量灵巧运动生成能力及数据与模型规模扩展的积极趋势。
Details
Motivation: 现有全身运动合成模型缺乏驱动灵巧行为、手指关节、接触时机和双手协调的细粒度线索,且现有资源缺乏捕捉细微手指动态和协作的高保真双手序列数据,因此需要填补这一空白。
Result: 实验证明了高质量的灵巧运动生成能力,并观察到了清晰的扩展趋势:在更大、更高质量数据集上训练的更大模型能产生语义更连贯的双手运动。
Insight: 创新点在于构建了一个统一的数据、标注和评估基础框架,特别是提出了一种解耦的、结合运动特征提取和LLM推理的可扩展标注策略,以生成与运动特征对齐的细粒度语义描述,并引入了新的专注于手的评估指标。
Abstract: Synthesizing human motion has advanced rapidly, yet realistic hand motion and bimanual interaction remain underexplored. Whole-body models often miss the fine-grained cues that drive dexterous behavior, finger articulation, contact timing, and inter-hand coordination, and existing resources lack high-fidelity bimanual sequences that capture nuanced finger dynamics and collaboration. To fill this gap, we present HandX, a unified foundation spanning data, annotation, and evaluation. We consolidate and filter existing datasets for quality, and collect a new motion-capture dataset targeting underrepresented bimanual interactions with detailed finger dynamics. For scalable annotation, we introduce a decoupled strategy that extracts representative motion features, e.g., contact events and finger flexion, and then leverages reasoning from large language models to produce fine-grained, semantically rich descriptions aligned with these features. Building on the resulting data and annotations, we benchmark diffusion and autoregressive models with versatile conditioning modes. Experiments demonstrate high-quality dexterous motion generation, supported by our newly proposed hand-focused metrics. We further observe clear scaling trends: larger models trained on larger, higher-quality datasets produce more semantically coherent bimanual motion. Our dataset is released to support future research.
[201] Gen-Searcher: Reinforcing Agentic Search for Image Generation cs.CVPDF
Kaituo Feng, Manyuan Zhang, Shuang Chen, Yunlong Lin, Kaixuan Fan
TL;DR: 本文提出Gen-Searcher,首个通过训练实现的搜索增强图像生成智能体,它通过多跳推理和搜索来收集生成图像所需的文本知识和参考图像。作者构建了专门的数据管道和两个高质量数据集,并引入了KnowGen基准来评估模型。通过监督微调和基于双重奖励反馈的智能体强化学习进行训练,实验表明Gen-Searcher在KnowGen和WISE基准上显著提升了Qwen-Image模型的性能。
Details
Motivation: 现有图像生成模型受限于冻结的内部知识,在处理需要密集知识或最新信息的现实场景时经常失败,因此需要引入外部搜索能力来增强生成能力。
Result: Gen-Searcher在KnowGen基准上将Qwen-Image提升了约16分,在WISE基准上提升了约15分,显示出显著的性能增益。
Insight: 创新点在于首次训练了一个搜索增强的图像生成智能体,结合了多跳推理搜索和双重奖励反馈的强化学习;可借鉴之处包括构建搜索密集型数据集的方法、KnowGen多维度评估基准,以及结合文本和图像奖励的稳定训练策略。
Abstract: Recent image generation models have shown strong capabilities in generating high-fidelity and photorealistic images. However, they are fundamentally constrained by frozen internal knowledge, thus often failing on real-world scenarios that are knowledge-intensive or require up-to-date information. In this paper, we present Gen-Searcher, as the first attempt to train a search-augmented image generation agent, which performs multi-hop reasoning and search to collect the textual knowledge and reference images needed for grounded generation. To achieve this, we construct a tailored data pipeline and curate two high-quality datasets, Gen-Searcher-SFT-10k and Gen-Searcher-RL-6k, containing diverse search-intensive prompts and corresponding ground-truth synthesis images. We further introduce KnowGen, a comprehensive benchmark that explicitly requires search-grounded external knowledge for image generation and evaluates models from multiple dimensions. Based on these resources, we train Gen-Searcher with SFT followed by agentic reinforcement learning with dual reward feedback, which combines text-based and image-based rewards to provide more stable and informative learning signals for GRPO training. Experiments show that Gen-Searcher brings substantial gains, improving Qwen-Image by around 16 points on KnowGen and 15 points on WISE. We hope this work can serve as an open foundation for search agents in image generation, and we fully open-source our data, models, and code.
cs.AI [Back]
[202] daVinci-LLM:Towards the Science of Pretraining cs.AI | cs.CLPDF
Yiwei Qin, Yixiu Liu, Tiantian Mi, Muhang Xie, Zhen Huang
TL;DR: daVinci-LLM项目旨在通过结合工业级计算资源和完全开放的研究范式,来推进大语言模型预训练的科学化研究。该项目采用Data Darwinism框架进行系统化的数据处理,并使用两阶段自适应课程从8T令牌中训练了一个30亿参数的模型。通过200多次对照消融实验,揭示了数据处理深度、领域饱和动态、组合平衡以及评估协议选择对预训练能力的关键影响。
Details
Motivation: 解决预训练阶段因商业机密和学术资源限制而严重缺乏系统性科学探索的问题,旨在填补工业规模资源与完全研究自由之间的空白。
Result: 通过大量对照实验,系统性地验证了数据处理深度是能力提升的关键维度,揭示了不同领域数据具有不同的饱和动态,并证明了组合平衡能实现针对性增强。
Insight: 创新性地将开放性本身作为科学方法论,并提出了Data Darwinism这一从过滤到合成的原则性L0-L9分类框架。其两阶段自适应课程和系统性消融实验为理解预训练动态提供了可复现的基准和洞见。
Abstract: The foundational pretraining phase determines a model’s capability ceiling, as post-training struggles to overcome capability foundations established during pretraining, yet it remains critically under-explored. This stems from a structural paradox: organizations with computational resources operate under commercial pressures that inhibit transparent disclosure, while academic institutions possess research freedom but lack pretraining-scale computational resources. daVinci-LLM occupies this unexplored intersection, combining industrial-scale resources with full research freedom to advance the science of pretraining. We adopt a fully-open paradigm that treats openness as scientific methodology, releasing complete data processing pipelines, full training processes, and systematic exploration results. Recognizing that the field lacks systematic methodology for data processing, we employ the Data Darwinism framework, a principled L0-L9 taxonomy from filtering to synthesis. We train a 3B-parameter model from random initialization across 8T tokens using a two-stage adaptive curriculum that progressively shifts from foundational capabilities to reasoning-intensive enhancement. Through 200+ controlled ablations, we establish that: processing depth systematically enhances capabilities, establishing it as a critical dimension alongside volume scaling; different domains exhibit distinct saturation dynamics, necessitating adaptive strategies from proportion adjustments to format shifts; compositional balance enables targeted intensification while preventing performance collapse; how evaluation protocol choices shape our understanding of pretraining progress. By releasing the complete exploration process, we enable the community to build upon our findings and systematic methodologies to form accumulative scientific knowledge in pretraining.
[203] Heterogeneous Debate Engine: Identity-Grounded Cognitive Architecture for Resilient LLM-Based Ethical Tutoring cs.AI | cs.CL | cs.CY | cs.HC | cs.MAPDF
Jakub Masłowski, Jarosław A. Chudziak
TL;DR: 本文提出了异构辩论引擎(HDE),一种结合身份基础检索增强生成(ID-RAG)和启发式心智理论(Heuristic ToM)的认知架构,旨在解决基于LLM的多智能体系统在伦理辅导任务中因语义漂移和逻辑退化而难以提供精确答案的问题。该架构通过确保教义忠实性和对手建模,显著提升了辩论的复杂性和稳定性。
Details
Motivation: 当前基于LLM的多智能体系统在复杂推理任务中容易发生语义漂移和逻辑退化,导致辩论停滞或陷入循环论证,无法满足伦理辅导中对精确答案的需求。核心挑战是如何在保持生成灵活性的同时确保教义忠实性。
Result: 评估表明,架构异构性是稳定性的关键变量:使用相反的教义初始化(如义务论与功利主义)使学生的论证复杂度得分比基线提高了一个数量级,验证了ID-RAG和启发式ToM在维持高保真对抗性教学中的有效性。
Insight: 创新点在于将身份基础检索增强生成(ID-RAG)用于确保教义忠实性,并结合启发式心智理论进行战略对手建模,从而构建了一个异构认知架构。这为解决多智能体系统中的语义漂移和逻辑退化问题提供了新思路,强调了架构多样性在提升辩论质量和稳定性中的重要性。
Abstract: Large Language Models (LLMs) are being increasingly used as autonomous agents in complex reasoning tasks, opening the niche for dialectical interactions. However, Multi-Agent systems implemented with systematically unconstrained systems systematically undergo semantic drift and logical deterioration and thus can hardly be used in providing ethical tutoring where a precise answer is required. Current simulation often tends to degenerate into dialectical stagnation, the agents degenerate into recursive concurrence or circular arguments. A critical challenge remains: how to enforce doctrinal fidelity without suppressing the generative flexibility required for dialectical reasoning? To address this niche, we contribute the Heterogeneous Debate Engine (HDE), a cognitive architecture that combines Identity-Grounded Retrieval-Augmented Generation (ID-RAG) for doctrinal fidelity and Heuristic Theory of Mind for strategic opponent modeling. Our evaluation shows that architectural heterogeneity is a crucial variable to stability: contrary doctrinal initializations (e.g., Deontology vs. Utilitarianism) have increased the Argument Complexity Scores of students by an order of magnitude, over baselines. These findings validate the effectiveness of ID-RAG and Heuristic ToM as architectural requirements in maintaining high-fidelity (adversarial) pedagogy.
[204] MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome cs.AI | cs.CLPDF
Fangda Ye, Yuxin Hu, Pengxiang Zhu, Yibo Li, Ziqi Jin
TL;DR: 本文提出了MiroEval基准测试和评估框架,用于全面评估深度研究智能体。该基准包含100个基于真实用户需求的任务(70个纯文本,30个多模态),并支持周期性更新。评估框架从三个互补维度(自适应合成质量、智能体事实性验证、以过程为中心的评估)对系统进行诊断。在13个系统上的评估结果表明,过程质量能可靠预测总体结果,多模态任务带来显著挑战,而MiroThinker系列表现最为均衡。
Details
Motivation: 现有基准主要使用固定评分标准评估最终报告,无法评估底层研究过程,且多模态覆盖有限、任务合成化、难以随知识更新。MiroEval旨在解决这些差距,为深度研究系统提供一个更全面、真实且可演进的评估工具。
Result: 在13个系统上的评估显示:三个评估维度捕捉了系统能力的互补方面;过程质量是总体结果的可信预测指标;多模态任务使大多数系统得分下降3-10点。MiroThinker-H1在两种设置下均获得最高综合排名。人类验证和鲁棒性结果证实了基准和框架的可靠性。
Insight: 创新点在于提出了一个集成了自适应合成质量、主动检索验证和过程审计的三维评估框架,并构建了一个基于真实需求、可更新的多模态基准。其核心洞察是过程评估能揭示输出级指标无法发现的弱点,并为诊断系统能力提供了更全面的视角。
Abstract: Recent progress in deep research systems has been impressive, but evaluation still lags behind real user needs. Existing benchmarks predominantly assess final reports using fixed rubrics, failing to evaluate the underlying research process. Most also offer limited multimodal coverage, rely on synthetic tasks that do not reflect real-world query complexity, and cannot be refreshed as knowledge evolves. To address these gaps, we introduce MiroEval, a benchmark and evaluation framework for deep research systems. The benchmark comprises 100 tasks (70 text-only, 30 multimodal), all grounded in real user needs and constructed via a dual-path pipeline that supports periodic updates, enabling a live and evolving setting. The proposed evaluation suite assesses deep research systems along three complementary dimensions: adaptive synthesis quality evaluation with task-specific rubrics, agentic factuality verification via active retrieval and reasoning over both web sources and multimodal attachments, and process-centric evaluation audits how the system searches, reasons, and refines throughout its investigation. Evaluation across 13 systems yields three principal findings: the three evaluation dimensions capture complementary aspects of system capability, with each revealing distinct strengths and weaknesses across systems; process quality serves as a reliable predictor of overall outcome while revealing weaknesses invisible to output-level metrics; and multimodal tasks pose substantially greater challenges, with most systems declining by 3 to 10 points. The MiroThinker series achieves the most balanced performance, with MiroThinker-H1 ranking the highest overall in both settings. Human verification and robustness results confirm the reliability of the benchmark and evaluation framework. MiroEval provides a holistic diagnostic tool for the next generation of deep research agents.
[205] Entropic Claim Resolution: Uncertainty-Driven Evidence Selection for RAG cs.AI | cs.CLPDF
Davide Di Gioia
TL;DR: 本文提出了一种名为熵声明解析(ECR)的新型推理时算法,用于检索增强生成(RAG)系统。该方法将RAG推理重新定义为在竞争性语义答案假设上的熵最小化过程,通过最大化期望熵减少(EER)来顺序选择原子证据声明,并在达到认知充分性状态时动态终止,旨在解决传统基于相关性的检索在证据冲突或查询模糊场景中的不足。
Details
Motivation: 当前RAG系统主要依赖基于相关性的密集检索,但在知识密集型或现实世界场景中,面对相互冲突的证据或根本性的查询模糊性时,仅靠相关性不足以解决认知不确定性。
Result: 论文将ECR集成到一个生产级的多策略检索管道(CSGR++)中,并分析了其理论特性,但摘要中未提及具体的定量实验结果或基准测试对比。
Insight: 核心创新在于将决策论中的信息价值准则(期望熵减少)引入证据选择,为不确定性感知的证据选择提供了严格的理论基础,将检索范式从‘检索最相关的’转变为‘检索最具区分度的’。
Abstract: Current Retrieval-Augmented Generation (RAG) systems predominantly rely on relevance-based dense retrieval, sequentially fetching documents to maximize semantic similarity with the query. However, in knowledge-intensive and real-world scenarios characterized by conflicting evidence or fundamental query ambiguity, relevance alone is insufficient for resolving epistemic uncertainty. We introduce Entropic Claim Resolution (ECR), a novel inference-time algorithm that reframes RAG reasoning as entropy minimization over competing semantic answer hypotheses. Unlike action-driven agentic frameworks (e.g., ReAct) or fixed-pipeline RAG architectures, ECR sequentially selects atomic evidence claims by maximizing Expected Entropy Reduction (EER), a decision-theoretic criterion for the value of information. The process dynamically terminates when the system reaches a mathematically defined state of epistemic sufficiency (H <= epsilon, subject to epistemic coherence). We integrate ECR into a production-grade multi-strategy retrieval pipeline (CSGR++) and analyze its theoretical properties. Our framework provides a rigorous foundation for uncertainty-aware evidence selection, shifting the paradigm from retrieving what is most relevant to retrieving what is most discriminative.
[206] The Ultimate Tutorial for AI-driven Scale Development in Generative Psychometrics: Releasing AIGENIE from its Bottle cs.AI | cs.CL | cs.HCPDF
Lara Russell-Lasalandra, Hudson Golino, Luis Eduardo Garrido, Alexander P. Christensen
TL;DR: 这篇论文介绍了一个名为AIGENIE的R软件包,它实现了AI-GENIE框架,旨在利用大型语言模型和网络心理测量学方法自动化心理量表开发的早期阶段。该框架通过LLM生成候选题目池,将其转换为高维嵌入,并应用探索性图分析、唯一变量分析和自助法探索性图分析等多步骤降维流程,以完全在计算机模拟中生成经过结构验证的题目池。教程通过两个运行示例(大五人格模型和AI焦虑)展示了该软件包的使用。
Details
Motivation: 传统心理量表开发需要大量专家参与、迭代修订和大规模预测试,过程耗时耗力。该研究的动机是利用人工智能,特别是大型语言模型,来自动化这一过程的早期阶段,提高效率并减少对专家资源的依赖。
Result: 论文主要是一个教程,介绍了软件包的功能和使用方法,并未在摘要中报告具体的定量实验结果或基准测试。它通过大五人格和AI焦虑两个示例说明了软件包的应用。
Insight: 主要创新点在于将LLM文本生成与网络心理测量学方法(如EGA、UVA)相结合,构建了一个完全自动化的、可离线运行的量表题目生成与结构验证流程。这为心理测量学领域提供了一种新的、数据驱动的研究工具,尤其适用于新兴构念的探索性研究。
Abstract: Psychological scale development has traditionally required extensive expert involvement, iterative revision, and large-scale pilot testing before psychometric evaluation can begin. The AIGENIE R package implements the AI-GENIE framework (Automatic Item Generation with Network-Integrated Evaluation), which integrates large language model (LLM) text generation with network psychometric methods to automate the early stages of this process. The package generates candidate item pools using LLMs, transforms them into high-dimensional embeddings, and applies a multi-step reduction pipeline – Exploratory Graph Analysis (EGA), Unique Variable Analysis (UVA), and bootstrap EGA – to produce structurally validated item pools entirely in silico. This tutorial introduces the package across six parts: installation and setup, understanding Application Programming Interfaces (APIs), text generation, item generation, the AIGENIE function, and the GENIE function. Two running examples illustrate the package’s use: the Big Five personality model (a well-established construct) and AI Anxiety (an emerging construct). The package supports multiple LLM providers (OpenAI, Anthropic, Groq, HuggingFace, and local models), offers a fully offline mode with no external API calls, and provides the GENIE() function for researchers who wish to apply the psychometric reduction pipeline to existing item pools regardless of their origin. The AIGENIE package is freely available on R-universe at https://laralee.r-universe.dev/AIGENIE.
[207] A Comparative Study in Surgical AI: Datasets, Foundation Models, and Barriers to Med-AGI cs.AI | cs.CV | cs.LGPDF
Kirill Skobelev, Eric Fithian, Yegor Baranovski, Jack Cook, Sandeep Angara
TL;DR: 本文通过神经外科手术工具检测的案例研究,探讨了现代AI在手术图像分析中的潜力与局限,发现即使使用数十亿参数模型和大量训练,现有视觉语言模型在简单任务上仍表现不佳,且模型规模和训练时间的增加仅带来边际收益,表明当前模型在手术应用面临显著障碍。
Details
Motivation: 研究动机在于探索通用AI模型作为手术协作工具的可行性,尽管AI在生物医学任务上已匹敌或超越人类专家,但在手术图像分析领域仍落后,且手术任务涉及多模态数据整合、人机交互和物理效应等复杂因素。
Result: 实验结果显示,在神经外科手术工具检测任务中,当前最先进的视觉语言模型表现不足,且模型规模和训练时间的扩展仅导致相关性能指标的边际改善,暗示障碍无法单纯通过计算资源扩展解决。
Insight: 创新点在于揭示了手术AI应用中数据标注专业性和计算资源需求的高门槛,以及模型扩展的局限性,挑战了仅依赖数据和标签可用性作为限制因素的传统观点,并提出了潜在解决方案的讨论。
Abstract: Recent Artificial Intelligence (AI) models have matched or exceeded human experts in several benchmarks of biomedical task performance, but have lagged behind on surgical image-analysis benchmarks. Since surgery requires integrating disparate tasks – including multimodal data integration, human interaction, and physical effects – generally-capable AI models could be particularly attractive as a collaborative tool if performance could be improved. On the one hand, the canonical approach of scaling architecture size and training data is attractive, especially since there are millions of hours of surgical video data generated per year. On the other hand, preparing surgical data for AI training requires significantly higher levels of professional expertise, and training on that data requires expensive computational resources. These trade-offs paint an uncertain picture of whether and to-what-extent modern AI could aid surgical practice. In this paper, we explore this question through a case study of surgical tool detection using state-of-the-art AI methods available in 2026. We demonstrate that even with multi-billion parameter models and extensive training, current Vision Language Models fall short in the seemingly simple task of tool detection in neurosurgery. Additionally, we show scaling experiments indicating that increasing model size and training time only leads to diminishing improvements in relevant performance metrics. Thus, our experiments suggest that current models could still face significant obstacles in surgical use cases. Moreover, some obstacles cannot be simply ``scaled away’’ with additional compute and persist across diverse model architectures, raising the question of whether data and label availability are the only limiting factors. We discuss the main contributors to these constraints and advance potential solutions.
eess.IV [Back]
[208] External Benchmarking of Lung Ultrasound Models for Pneumothorax-Related Signs: A Manifest-Based Multi-Source Study eess.IV | cs.CVPDF
Takehiro Ishikawa
TL;DR: 本研究开发了一个基于清单(manifest)的多源外部基准,用于评估气胸相关肺部超声(LUS)AI模型。通过测试一个已发表的单站点二分类器,发现其在异质外部数据上性能显著下降,并揭示了二分类(正常滑动 vs 滑动消失)在捕捉临床重要征象(如肺点和肺搏动)方面的局限性。
Details
Motivation: 气胸相关肺部超声AI模型缺乏可复现的外部基准,且二分类(肺滑动存在与否)可能掩盖临床上重要的征象(如肺点、肺搏动)。因此,研究旨在建立一个基于清单的外部基准,以评估模型的跨域泛化能力和任务有效性。
Result: 单站点二分类器在域内数据上ROC-AUC为0.9625,但在异质外部基准上降至0.7050(仅线性探头剪辑为0.7212)。挑战状态分析显示,模型预测的滑动消失概率P(absent)排序为:滑动消失(0.504)> 肺点(0.313)> 正常(0.186)> 肺搏动(0.143)。统计检验表明,肺搏动与正常剪辑无显著差异(p=0.813),而肺点则与两者均显著不同。
Insight: 创新点在于提出了一种基于清单(包含URL、时间戳、裁剪坐标等元数据)的多源外部基准构建方法,避免了重新分发源视频,支持可复现的外部评估。核心洞察是二分类肺滑动分类作为气胸推理的代理任务是不完整的,因为它无法有效区分肺搏动(被模型视为正常样)和肺点(作为中间模糊状态)等临床重要征象,强调了细粒度分类的必要性。
Abstract: Background and Aims: Reproducible external benchmarks for pneumothorax-related lung ultrasound (LUS) AI are scarce, and binary lung-sliding classification may obscure clinically important signs. We therefore developed a manifest-based external benchmark and used it to test both cross-domain generalization and task validity. Methods: We curated 280 clips from 190 publicly accessible LUS source videos and released a reconstruction manifest containing URLs, timestamps, crop coordinates, labels, and probe shape. Labels were normal lung sliding, absent lung sliding, lung point, and lung pulse. A previously published single-site binary classifier was evaluated on this benchmark; challenge-state analysis examined lung point and lung pulse using the predicted probability of absent sliding, P(absent). Results: The single-site comparator achieved ROC-AUC 0.9625 in-domain but 0.7050 on the heterogeneous external benchmark; restricting external evaluation to linear clips still yielded ROC-AUC 0.7212. In challenge-state analysis, mean P(absent) ranked absent (0.504) > lung point (0.313) > normal (0.186) > lung pulse (0.143). Lung pulse differed from absent clips (p=0.000470) but not from normal clips (p=0.813), indicating that the binary model treated pulse as normal-like despite absent sliding. Lung point differed from both absent (p=0.000468) and normal (p=0.000026), supporting its interpretation as an intermediate ambiguity state rather than a clean binary class. Conclusion: A manifest-based, multi-source benchmark can support reproducible external evaluation without redistributing source videos. Binary lung-sliding classification is an incomplete proxy for pneumothorax reasoning because it obscures blind-spot and ambiguity states such as lung pulse and lung point.
[209] ANVIL: Accelerator-Native Video Interpolation via Codec Motion Vector Priors eess.IV | cs.CVPDF
Shibo Liu
TL;DR: ANVIL是一种专为移动设备设计的视频帧插值方法,通过重用H.264解码器已计算好的运动矢量来预对齐输入帧,从而避免在移动神经处理单元上使用计算密集的光流估计和空间采样操作。该方法采用卷积主导的网络处理残差,在Snapdragon 8 Gen 3设备上实现了12.8毫秒的1080p网络推理时间,并支持实时帧率加倍。
Details
Motivation: 解决主流基于光流的视频帧插值方法在移动加速器上部署时的三大结构性障碍:空间采样操作超出帧预算或缺乏硬件支持、迭代流细化在8位后训练量化下失效,以及内存绑定操作主导推理图。
Result: 在Snapdragon 8 Gen 3设备上,ANVIL以8位整数精度实现了12.8毫秒的1080p网络推理;开源Android播放器在30分钟连续播放中,对54,623个连续记录的样本,每对插值帧的中位端到端延迟为28.4毫秒。
Insight: 创新点在于利用编解码器已有的运动矢量作为先验,移除了学习光流、空间采样和迭代累积,使推理图主要由计算绑定操作组成,从而优化移动加速器性能。客观分析表明,迭代方法中量化累积是整数量化失败的关键机制,该方法为H.264播放场景提供了高效解决方案。
Abstract: Mobile displays refresh at 90-120 Hz, yet most video is encoded at 24-30 frames per second; real-time frame-rate doubling requires each synthesized frame within 33.3 ms on mobile neural processing units. We show that mainstream flow-based video frame interpolation faces three structural deployment barriers on mobile accelerators: spatial sampling operators exceed the frame budget or lack hardware support, iterative flow refinement collapses under 8-bit post-training quantization, and memory-bound operators dominate the inference graph. ANVIL addresses these barriers by reusing motion vectors already computed by the H.264 decoder to prealign input frames, removing learned optical flow, spatial sampling, and iterative accumulation from the accelerator graph. The remaining residual is refined by a convolution-dominated network whose inference graph is composed almost entirely of compute-bound operators. On a Snapdragon 8 Gen 3 device, ANVIL achieves 12.8 ms 1080p network inference in 8-bit integer precision; an open-source Android player sustains 28.4 ms median end-to-end latency per interpolated frame pair over 54,623 consecutively logged samples during 30-minute continuous playback. Per-operator causal analysis identifies quantized accumulation on recurrent flow states as a key mechanism behind integer quantization failure in iterative methods. The current design targets H.264 playback scenarios with decoder-exposed motion vectors.
[210] Reliability-Aware Weighted Multi-Scale Spatio-Temporal Maps for Heart Rate Monitoring eess.IV | cs.CVPDF
Arpan Bairagi, Rakesh Dey, Siladittya Manna, Umapada Pal
TL;DR: 本文提出了一种可靠性感知的加权多尺度时空(WMST)地图,用于远程光电容积描记(rPPG)中的心率监测。该方法通过抑制环境噪声建模像素可靠性,并利用WMST地图开发了一种基于Swin-Unet的自监督对比学习框架,其中正样本来自传统rPPG信号和时序扩展的WMST地图,负样本则采用新的高-高-高(HHH)小波地图。实验表明,该方法在公共rPPG基准测试中提高了运动和光照鲁棒性,心率估计误差更低,皮尔逊相关性更高。
Details
Motivation: 解决rPPG信号在非约束环境中因光照变化、运动、阴影和镜面反射等干扰导致信号质量低的问题,提升心率监测的鲁棒性和准确性。
Result: 在公共rPPG基准测试中,与现有基于自监督学习(SSL)的rPPG方法相比,该方法实现了更低的心率估计误差和更高的皮尔逊相关性,增强了运动和光照鲁棒性。
Insight: 创新点包括:1)可靠性感知的加权多尺度时空地图,通过噪声抑制和权重策略聚焦生理有效区域;2)基于Swin-Unet的自监督对比学习框架,利用WMST地图生成正样本;3)引入HHH小波地图作为负样本,保留运动和结构细节但过滤生理信息,以提升模型区分能力。
Abstract: Remote photoplethysmography (rPPG) allows for the contactless estimation of physiological signals from facial videos by analyzing subtle skin color changes. However, rPPG signals are extremely susceptible to illumination changes, motion, shadows, and specular reflections, resulting in low-quality signals in unconstrained environments. To overcome these issues, we present a Reliability-Aware Weighted Multi-Scale Spatio-Temporal (WMST) map that models pixel reliability through the suppression of environmental noises. These noises are modeled using different weighting strategies to focus on more physiologically valid areas. Leveraging the WMST map, we develop an SSL contrastive learning approach based on Swin-Unet, where positive pairs are generated from conventional rPPG signals and temporally expanded WMST maps. Moreover, we introduce a new High-High-High (HHH) wavelet map as a negative example that maintains motion and structural details while filtering out physiological information. Here, our aim is to estimate heart rate (HR), and the experiments on public rPPG benchmarks show that our approach enhances motion and illumination robustness with lower HR estimation error and higher Pearson correlation than existing Self-Supervised Learning (SSL) based rPPG methods.
[211] Uncertainty-Aware Mapping from 3D Keypoints to Anatomical Landmarks for Markerless Biomechanics eess.IV | cs.AI | cs.CVPDF
Cesare Davide Pace, Alessandro Marco De Nunzio, Claudio De Stefano, Francesco Fontanella, Mario Molinara
TL;DR: 本文提出了一种在无标记生物力学分析中,将3D姿态关键点映射到3D解剖学标志物时,利用预测不确定性进行逐帧质量控制的框架。该框架在一个时间学习模型中,分别建模了观测噪声和模型局限性引起的不确定性。通过在AMASS数据集上的评估,证明模型不确定性估计与标志物误差有强相关性,能有效筛选可靠帧并检测严重错误。
Details
Motivation: 当前无标记生物力学分析通常将视频提取的3D骨骼关键点视为确定性估计,缺乏对关键点到解剖标志物映射步骤的逐帧质量控制机制。本文旨在研究预测不确定性,为这一关键步骤提供量化的置信度度量。
Result: 在AMASS数据集上,模型不确定性估计与标志物误差显示出强单调相关性(Spearman ρ≈0.63)。在10%覆盖率下,通过保留可靠帧可将误差降至约16.8毫米,并能准确检测严重错误(对于误差>50毫米的情况,ROC-AUC≈0.92)。在输入退化(如高斯噪声、模拟关节缺失)下,可靠性排序保持稳定。
Insight: 创新点在于将预测不确定性(特别是模型不确定性)系统性地引入无标记生物力学关键点映射流程,作为自动质量控制的实用工具。客观分析表明,区分并利用模型不确定性(而非观测噪声不确定性)是提升映射可靠性的关键,这为处理复杂、易出错的计算机视觉任务提供了可借鉴的误差感知与过滤思路。
Abstract: Markerless biomechanics increasingly relies on 3D skeletal keypoints extracted from video, yet downstream biomechanical mappings typically treat these estimates as deterministic, providing no principled mechanism for frame-wise quality control. In this work, we investigate predictive uncertainty as a quantitative measure of confidence for mapping 3D pose keypoints to 3D anatomical landmarks, a critical step preceding inverse kinematics and musculoskeletal analysis. Within a temporal learning framework, we model both uncertainty arising from observation noise and uncertainty related to model limitations. Using synchronized motion capture ground truth on AMASS, we evaluate uncertainty at frame and joint level through error–uncertainty rank correlation, risk–coverage analysis, and catastrophic outlier detection. Across experiments, uncertainty estimates, particularly those associated with model uncertainty, exhibit a strong monotonic association with landmark error (Spearman $ρ\approx 0.63$), enabling selective retention of reliable frames (error reduced to $\approx 16.8$ mm at 10% coverage) and accurate detection of severe failures (ROC-AUC $\approx 0.92$ for errors $>50$ mm). Reliability ranking remains stable under controlled input degradation, including Gaussian noise and simulated missing joints. In contrast, uncertainty attributable to observation noise provides limited additional benefit in this setting, suggesting that dominant failures in keypoint-to-landmark mapping are driven primarily by model uncertainty. Our results establish predictive uncertainty as a practical, frame-wise tool for automatic quality control in markerless biomechanical pipelines.
eess.SP [Back]
[212] EMPD: An Event-based Multimodal Physiological Dataset for Remote Pulse Wave Detection eess.SP | cs.CV | cs.LGPDF
Qian Feng, Pengfei Li, Rongshan Gao, Jiale Xu, Rui Gong
TL;DR: 本文提出了首个基于事件相机的非接触式生理传感基准数据集EMPD,该数据集利用激光辅助采集系统,通过事件相机捕捉桡动脉的微振动,并结合RGB相机和脉搏血氧仪提供多模态同步数据,旨在解决传统基于帧的相机在远程光电容积描记术(rPPG)中面临的运动伪影和时间分辨率限制问题。
Details
Motivation: 传统基于帧的相机在远程脉搏波检测(rPPG)中常受运动伪影和有限时间分辨率的困扰,因此需要一种更鲁棒的解决方案,而事件相机的高时间分辨率和低延迟特性为此提供了潜力。
Result: EMPD数据集包含来自83名受试者的193条有效记录,覆盖静息和运动后状态下的广泛心率范围(40-110 BPM),并提供了微秒级时间精度的多模态同步数据,可作为神经形态生理监测领域算法开发的关键资源。
Insight: 创新点在于首次引入事件相机到非接触生理传感中,通过激光调制技术增强信号检测,并构建了包含事件流、RGB视频和临床级地面真值的多模态基准数据集,这为开发更鲁棒、高时间分辨率的rPPG算法提供了新方向。
Abstract: Remote photoplethysmography (rPPG) based on traditional frame-based cameras often struggles with motion artifacts and limited temporal resolution. To address these limitations, we introduce EMPD (Event-based Multimodal Physiological Dataset), the first benchmark dataset specifically designed for non-contact physiological sensing via event cameras. The dataset leverages a laser-assisted acquisition system where a high-coherence laser modulates subtle skin vibrations from the radial artery into significant signals detectable by a neuromorphic sensor. The hardware platform integrates a high-resolution event camera to capture micro-motions and intensity transients, an industrial RGB camera to provide traditional rPPG benchmarks, and a clinical-grade pulse oximeter to record ground truth PPG waveforms. EMPD contains 193 valid records collected from 83 subjects, covering a wide heart rate range (40-110 BPM) under both resting and post-exercise conditions. By providing precisely synchronized multimodal data with microsecond-level temporal precision, EMPD serves as a crucial resource for developing robust algorithms in the field of neuromorphic physiological monitoring. The dataset is publicly available at: https://doi.org/10.5281/zenodo.18765701
[213] Stress Classification from ECG Signals Using Vision Transformer eess.SP | cs.AI | cs.CV | cs.LGPDF
Zeeshan Ahmad, Naimul Khan
TL;DR: 该论文提出了一种基于Vision Transformer(ViT)的多级压力分类方法,通过将原始ECG信号转换为二维频谱图,并利用Transformer编码器处理图像块,以端到端方式学习鲁棒表示,无需手工特征。
Details
Motivation: 解决基于生理信号(如ECG)的压力评估问题,特别是应对留一受试者交叉验证(LOSOCV)中的受试者间变异性挑战,并探索Vision Transformer在此领域的应用潜力。
Result: 在WESAD和RML数据集上进行LOSOCV实验,Vision Transformer在三分分类任务中分别达到76.7%和71.01%的准确率,在WESAD二分类任务中达到88.3%的准确率,显著优于基于CNN的模型(如1D CNN和ResNet-18)和所有先前的最先进方法。
Insight: 创新点在于将ECG信号转换为2D频谱图以利用Vision Transformer的注意力机制,有效处理受试者间变异性;该方法为端到端学习,避免了手工特征工程,在生理信号分类任务中展示了Transformer架构的优越性。
Abstract: Vision Transformers have shown tremendous success in numerous computer vision applications; however, they have not been exploited for stress assessment using physiological signals such as Electrocardiogram (ECG). In order to get the maximum benefit from the vision transformer for multilevel stress assessment, in this paper, we transform the raw ECG data into 2D spectrograms using short time Fourier transform (STFT). These spectrograms are divided into patches for feeding to the transformer encoder. We also perform experiments with 1D CNN and ResNet-18 (CNN model). We perform leave-onesubject-out cross validation (LOSOCV) experiments on WESAD and Ryerson Multimedia Lab (RML) dataset. One of the biggest challenges of LOSOCV based experiments is to tackle the problem of intersubject variability. In this research, we address the issue of intersubject variability and show our success using 2D spectrograms and the attention mechanism of transformer. Experiments show that vision transformer handles the effect of intersubject variability much better than CNN-based models and beats all previous state-of-the-art methods by a considerable margin. Moreover, our method is end-to-end, does not require handcrafted features, and can learn robust representations. The proposed method achieved 71.01% and 76.7% accuracies with RML dataset and WESAD dataset respectively for three class classification and 88.3% for binary classification on WESAD.
cs.IR [Back]
[214] LITTA: Late-Interaction and Test-Time Alignment for Visually-Grounded Multimodal Retrieval cs.IR | cs.AI | cs.CL | cs.CVPDF
Seonok Kim
TL;DR: LITTA是一种专注于查询扩展的检索框架,用于从视觉丰富的文档(如教科书、技术报告和手册)中检索证据页面,它通过大型语言模型生成互补的查询变体,利用冻结的视觉检索器进行后期交互评分,并通过互逆排名融合聚合候选结果,从而在不重新训练检索器的情况下提升多模态文档检索的鲁棒性和准确性。
Details
Motivation: 解决从视觉丰富文档中检索相关证据的挑战,这些文档具有长上下文、复杂布局以及用户问题与支持页面之间词汇重叠弱的问题,旨在提高多模态检索性能而无需重新训练检索器。
Result: 在计算机科学、制药和工业手册三个领域的视觉基础文档检索任务上评估,多查询检索相比单查询检索在top-k准确率、召回率和MRR方面持续提升,特别是在视觉和语义变异性高的领域增益显著;准确率与效率的权衡可通过查询变体数量直接控制,使其在延迟约束下易于部署。
Insight: 创新点在于将查询扩展与后期交互评分和测试时对齐结合,通过生成互补查询变体并聚合结果来提高检索鲁棒性和覆盖率;客观分析认为,该方法提供了一种简单有效的机制,利用现有多模态嵌入索引,无需重新训练即可显著提升性能,具有实际部署的灵活性。
Abstract: Retrieving relevant evidence from visually rich documents such as textbooks, technical reports, and manuals is challenging due to long context, complex layouts, and weak lexical overlap between user questions and supporting pages. We propose LITTA, a query-expansion-centric retrieval framework for evidence page retrieval that improves multimodal document retrieval without retriever retraining. Given a user query, LITTA generates complementary query variants using a large language model and retrieves candidate pages for each variant using a frozen vision retriever with late-interaction scoring. Candidates from expanded queries are then aggregated through reciprocal rank fusion to improve evidence coverage and reduce sensitivity to any single phrasing. This simple test-time strategy significantly improves retrieval robustness while remaining compatible with existing multimodal embedding indices. We evaluate LITTA on visually grounded document retrieval tasks across three domains: computer science, pharmaceuticals, and industrial manuals. Multi-query retrieval consistently improves top-k accuracy, recall, and MRR compared to single-query retrieval, with particularly large gains in domains with high visual and semantic variability. Moreover, the accuracy-efficiency trade-off is directly controllable by the number of query variants, making LITTA practical for deployment under latency constraints. These results demonstrate that query expansion provides a simple yet effective mechanism for improving visually grounded multimodal retrieval.
[215] GroupRAG: Cognitively Inspired Group-Aware Retrieval and Reasoning via Knowledge-Driven Problem Structuring cs.IR | cs.AI | cs.CLPDF
Xinyi Duan, Yuanrong Tang, Jiangtao Gong
TL;DR: 本文提出GroupRAG,一种受认知科学启发的、基于知识驱动关键点分组的群体感知检索与推理框架。它通过识别问题中的潜在结构组,从多个概念起点进行检索和推理,实现了两个过程的细粒度交互。在MedQA上的实验表明,其性能优于代表性的RAG和CoT基线方法。
Details
Motivation: 现有方法如检索增强生成(RAG)和思维链(CoT)通过引入外部知识或强制线性推理链来缓解语言模型知识不足和推理受限的问题,但在现实场景中效果常会下降。受认知科学启发,作者认为对问题结构缺乏认知是一个被忽视的关键限制,人类解决问题是在结构化问题空间中进行搜索,而非单一推理链。
Result: 在MedQA基准测试上,GroupRAG的表现优于代表性的基于RAG和CoT的基线方法。
Insight: 核心创新点在于受人类认知启发,显式地对问题结构进行建模,通过知识驱动的关键点分组来识别潜在结构组,并实现从多个概念起点进行检索与推理的细粒度交互。这为构建更鲁棒的检索增强推理系统提供了一个有前景的方向。
Abstract: The performance of language models is commonly limited by insufficient knowledge and constrained reasoning. Prior approaches such as Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT) address these issues by incorporating external knowledge or enforcing linear reasoning chains, but often degrade in real-world settings. Inspired by cognitive science, which characterizes human problem solving as search over structured problem spaces rather than single inference chains, we argue that inadequate awareness of problem structure is a key overlooked limitation. We propose GroupRAG, a cognitively inspired, group-aware retrieval and reasoning framework based on knowledge-driven keypoint grouping. GroupRAG identifies latent structural groups within a problem and performs retrieval and reasoning from multiple conceptual starting points, enabling fine-grained interaction between the two processes. Experiments on MedQA show that GroupRAG outperforms representative RAG- and CoT-based baselines. These results suggest that explicitly modeling problem structure, as inspired by human cognition, is a promising direction for robust retrieval-augmented reasoning.
cs.GR [Back]
[216] ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-ended Real-World Tasks cs.GR | cs.AI | cs.CV | cs.LGPDF
Samin Mahdizadeh Sani, Max Ku, Nima Jamali, Matina Mahdizadeh Sani, Paria Khoshtab
TL;DR: 本文提出了ImagenWorld,一个用于压力测试图像生成模型的基准测试,涵盖6个核心任务和6个主题领域,包含3600个条件集和2万个人工标注,并采用可解释的评估模式来定位对象级和片段级错误。通过对14个模型的大规模评估,揭示了模型在编辑任务、符号和文本密集型领域中的挑战,以及基于VLM的自动评估指标的局限性。
Details
Motivation: 现有图像生成基准测试存在局限,要么专注于孤立任务,要么覆盖领域狭窄,或者提供不透明的分数而无法解释失败模式。因此,需要一个新的、全面的、可解释的基准来评估模型在开放世界任务上的鲁棒性。
Result: 在ImagenWorld基准上评估了14个模型,结果显示:模型在编辑任务(尤其是局部编辑)上比生成任务更困难;在艺术和逼真图像领域表现优异,但在截图和信息图表等符号和文本密集型领域表现不佳;闭源系统整体领先,但针对性的数据管理(如Qwen-Image)在文本密集型案例中缩小了差距;基于VLM的自动评估指标Kendall准确率最高达0.79,接近人类排序,但在细粒度、可解释的错误归因方面不足。
Insight: 论文的创新点在于构建了一个大规模、多任务、多领域、可解释的图像生成基准测试,并提供了细粒度的人工标注和错误标签,可作为诊断工具推动鲁棒图像生成的发展。从客观角度看,其将人类评估系统化、可解释化的方法,以及对模型在不同任务和领域能力差异的深入分析,对领域有重要参考价值。
Abstract: Advances in diffusion, autoregressive, and hybrid models have enabled high-quality image synthesis for tasks such as text-to-image, editing, and reference-guided composition. Yet, existing benchmarks remain limited, either focus on isolated tasks, cover only narrow domains, or provide opaque scores without explaining failure modes. We introduce \textbf{ImagenWorld}, a benchmark of 3.6K condition sets spanning six core tasks (generation and editing, with single or multiple references) and six topical domains (artworks, photorealistic images, information graphics, textual graphics, computer graphics, and screenshots). The benchmark is supported by 20K fine-grained human annotations and an explainable evaluation schema that tags localized object-level and segment-level errors, complementing automated VLM-based metrics. Our large-scale evaluation of 14 models yields several insights: (1) models typically struggle more in editing tasks than in generation tasks, especially in local edits. (2) models excel in artistic and photorealistic settings but struggle with symbolic and text-heavy domains such as screenshots and information graphics. (3) closed-source systems lead overall, while targeted data curation (e.g., Qwen-Image) narrows the gap in text-heavy cases. (4) modern VLM-based metrics achieve Kendall accuracies up to 0.79, approximating human ranking, but fall short of fine-grained, explainable error attribution. ImagenWorld provides both a rigorous benchmark and a diagnostic tool to advance robust image generation.
astro-ph.GA [Back]
[217] Segmenting Superbubbles in a Simulated Multiphase Interstellar Medium using Computer Vision astro-ph.GA | cs.CVPDF
Jing-Wen Chen, Alex S. Hill, Anna Ordog, Rebecca A. Booth, Mohamed S. Shehata
TL;DR: 本文提出了一种基于计算机视觉的方法,用于在超新星驱动的星际介质磁流体动力学模拟中实现超气泡的精确三维分割与追踪。该方法利用先进的3D Transformer模型,有效捕捉这些天体物理结构的复杂形态和动态演化。通过聚焦于一个由连续超新星爆炸驱动、与周围介质发生有趣相互作用的超气泡,模型成功生成了详细的三维分割掩码,从而可视化和分析了气泡随时间的结构演化。
Details
Motivation: 解决在复杂的星际介质模拟中精确分割和追踪超气泡的挑战,以更好地理解其动态演化过程。
Result: 模型成功生成了详细的三维分割掩码,能够可视化和分析超气泡的结构演化,揭示了其生长模式、能量保持以及与周围星际物质的相互作用。
Insight: 创新点在于将先进的3D Transformer模型应用于天体物理模拟的计算机视觉任务,实现了对复杂动态结构的高精度分割与追踪,为研究宇宙中其他复杂现象提供了一个鲁棒的跨学科框架。
Abstract: We developed a computer vision-based methodology to achieve precise 3D segmentation and tracking of superbubbles within magnetohydrodynamic simulations of the supernova-driven interstellar medium. Leveraging advanced 3D transformer models, our approach effectively captures the complex morphology and dynamic evolution of these astrophysical structures. To demonstrate the technique, we specifically focused on a superbubble exhibiting interesting interactions with its surrounding medium, driven by a series of successive supernova explosions. Our model successfully generated detailed 3D segmentation masks, enabling us to visualize and analyze the bubble’s structural evolution over time. The results reveal insights into the superbubble’s growth patterns, energy retention, and interactions with surrounding interstellar matter. This interdisciplinary approach not only enhances our understanding of superbubble dynamics but also offers a robust framework for investigating other complex phenomena in the cosmos.
cs.LG [Back]
[218] Learning to Select Visual In-Context Demonstrations cs.LG | cs.AI | cs.CL | cs.CVPDF
Eugene Lee, Yu-Chi Lin, Jiajie Diao
TL;DR: 本文提出了一种名为LSD(Learning to Select Demonstrations)的新方法,用于改进多模态大语言模型(MLLMs)在视觉任务中的上下文学习(ICL)演示选择策略。该方法将演示选择重新构建为一个序列决策问题,并训练一个基于强化学习(使用Dueling DQN和查询中心Transformer解码器)的智能体来构建最优演示集,以最大化下游任务性能。
Details
Motivation: 当前主流的演示选择策略是无监督的k近邻(kNN)搜索,这种方法基于相似性优先,对于复杂的客观事实回归任务来说并非最优,因为它倾向于选择冗余示例,无法覆盖任务的全部输出范围。
Result: 在五个视觉回归基准测试上的评估表明,LSD方法在客观事实回归任务上显著优于基线方法(包括kNN),揭示了任务类型的二分性:kNN在主观偏好任务上仍是最优的,而LSD在客观回归任务上表现更佳。
Insight: 核心创新点在于将演示选择形式化为强化学习问题,并训练一个智能体来学习选择策略。其关键洞察是,对于视觉ICL,演示选择需要平衡视觉相关性和输出多样性,特别是在定义回归边界时,学习到的选择策略是严格必要的。这为不同任务类型(主观 vs. 客观)提供了适配的演示选择方案。
Abstract: Multimodal Large Language Models (MLLMs) adapt to visual tasks via in-context learning (ICL), which relies heavily on demonstration quality. The dominant demonstration selection strategy is unsupervised k-Nearest Neighbor (kNN) search. While simple, this similarity-first approach is sub-optimal for complex factual regression tasks; it selects redundant examples that fail to capture the task’s full output range. We reframe selection as a sequential decision-making problem and introduce Learning to Select Demonstrations (LSD), training a Reinforcement Learning agent to construct optimal demonstration sets. Using a Dueling DQN with a query-centric Transformer Decoder, our agent learns a policy that maximizes MLLM downstream performance. Evaluating across five visual regression benchmarks, we uncover a crucial dichotomy: while kNN remains optimal for subjective preference tasks, LSD significantly outperforms baselines on objective, factual regression tasks. By balancing visual relevance with diversity, LSD better defines regression boundaries, illuminating when learned selection is strictly necessary for visual ICL.
[219] Efficient Inference of Large Vision Language Models cs.LG | cs.CL | cs.CVPDF
Surendra Pathak
TL;DR: 本文对加速大型视觉语言模型(LVLM)推理的最新优化技术进行了全面综述,系统地将现有方法分为视觉令牌压缩、内存管理与服务、高效架构设计和高级解码策略四个维度,并指出了当前方法的局限性和未来研究方向。
Details
Motivation: 大型视觉语言模型虽然展现出强大的多模态推理能力,但其大规模计算需求限制了可扩展性和部署,特别是高分辨率输入产生的海量视觉令牌因注意力机制的二次复杂度而加剧了这一问题,因此需要系统梳理和优化推理效率。
Result: 本文是一篇综述性论文,未提出具体模型或实验,但系统总结了当前加速LVLM推理的各类优化框架,并分析了其局限性。
Insight: 创新点在于提出了一个系统化的分类法,将LVLM推理优化技术归纳为四个核心维度,为未来高效多模态系统的研究提供了清晰的路线图和关键开放问题。
Abstract: Although Large Vision Language Models (LVLMs) have demonstrated impressive multimodal reasoning capabilities, their scalability and deployment are constrained by massive computational requirements. In particular, the massive amount of visual tokens from high-resolution input data aggravates the situation due to the quadratic complexity of attention mechanisms. To address these issues, the research community has developed several optimization frameworks. This paper presents a comprehensive survey of the current state-of-the-art techniques for accelerating LVLM inference. We introduce a systematic taxonomy that categorizes existing optimization frameworks into four primary dimensions: visual token compression, memory management and serving, efficient architectural design, and advanced decoding strategies. Furthermore, we critically examine the limitations of these current methodologies and identify critical open problems to inspire future research directions in efficient multimodal systems.
[220] From Pixels to BFS: High Maze Accuracy Does Not Imply Visual Planning cs.LG | cs.CVPDF
Alberto G. Rodriguez Salgado
TL;DR: 这篇论文提出了MazeBench基准,包含110个程序生成的迷宫图像,用于评估多模态模型在视觉空间任务中的表现。研究发现,尽管GPT-5.4和Gemini 3.1 Pro等模型在迷宫求解任务上取得了高准确率(分别为91%和79%),但它们并非通过真正的视觉规划来解决问题,而是将图像转换为文本网格,然后通过枚举路径(类似于广度优先搜索)进行暴力搜索,消耗大量token。在没有额外推理预算的情况下,所有模型配置的准确率仅为2-12%;在20x20的超难迷宫中,模型因达到token限制而失败。论文指出,高准确率并不代表模型具备类似人类的空间理解能力。
Details
Motivation: 研究动机是探究多模态模型在视觉空间任务中是否通过真正的规划来解决问题,还是仅仅在token空间中进行暴力搜索,以揭示模型在视觉规划任务中的真实能力。
Result: 在MazeBench基准上,GPT-5.4和Gemini 3.1 Pro分别达到91%和79%的准确率,但这是基于大量token消耗的暴力搜索;在没有额外推理预算时,所有模型准确率降至2-12%。消融实验显示,当直接提供正确文本网格时,Claude Sonnet 4.6的准确率从6%提升至80%,表明视觉提取是瓶颈。
Insight: 创新点包括提出MazeBench基准来系统评估多模态模型的视觉规划能力,并揭示模型依赖图像到文本的转换和token级搜索策略,而非真正的空间理解。这提醒我们,高准确率可能掩盖了模型在视觉任务中的本质缺陷,需谨慎解读模型性能。
Abstract: How do multimodal models solve visual spatial tasks – through genuine planning, or through brute-force search in token space? We introduce \textsc{MazeBench}, a benchmark of 110 procedurally generated maze images across nine controlled groups, and evaluate 16 model configurations from OpenAI, Anthropic, Google, and Alibaba. GPT-5.4 solves 91% and Gemini 3.1 Pro 79%, but these scores are misleading: models typically translate images into text grids and then enumerate paths step by step, consuming 1,710–22,818 tokens per solve for a task humans do quickly. Without added reasoning budgets, all configurations score only 2–12%; on 20$\times$20 ultra-hard mazes, they hit token limits and fail. Qualitative traces reveal a common two-stage strategy: image-to-grid translation followed by token-level search, effectively BFS in prose. A text-grid ablation shows Claude Sonnet 4.6 rising from 6% on images to 80% when given the correct grid, isolating weak visual extraction from downstream search. When explicitly instructed not to construct a grid or perform graph search, models still revert to the same enumeration strategy. \textsc{MazeBench} therefore shows that high accuracy on visual planning tasks does not imply human-like spatial understanding.
[221] VAN-AD: Visual Masked Autoencoder with Normalizing Flow For Time Series Anomaly Detection cs.LG | cs.AI | cs.CVPDF
PengYu Chen, Shang Wan, Xiaohou Shi, Yuan Chang, Yan Sun
TL;DR: 本文提出VAN-AD,一种基于视觉掩码自编码器(MAE)和标准化流的时序异常检测框架,旨在解决现有方法泛化能力差、需要为每个数据集单独训练模型的问题。通过引入自适应分布映射模块(ADMM)和标准化流模块(NFM),有效缓解了MAE直接迁移到时序数据时存在的过度泛化和局部感知有限两大挑战,在多个真实数据集上取得了优于现有SOTA方法的性能。
Details
Motivation: 现有时序异常检测方法通常需要为每个数据集训练特定模型,泛化能力有限,且在训练数据稀缺的场景下性能受限。虽然基础模型是一个有前景的方向,但现有基于大语言模型或大规模时序数据集构建的方法仍面临严重的跨模态差距或域内异质性挑战。本文探索将大规模视觉模型(特别是MAE)应用于时序异常检测任务。
Result: 在九个真实世界数据集上的大量实验表明,VAN-AD在多个评估指标上持续优于现有的最先进方法(SOTA)。
Insight: 论文的核心创新点在于将预训练的视觉MAE模型适配到时序异常检测任务,并设计了两个关键模块来解决直接迁移的挑战:ADMM模块通过将MAE重建前后的结果映射到统一的统计空间来放大异常模式引起的差异,缓解过度泛化;NFM模块结合标准化流来估计当前窗口在全局分布下的概率密度,克服局部感知限制。这为利用大规模视觉基础模型解决跨模态时序问题提供了新思路。
Abstract: Time series anomaly detection (TSAD) is essential for maintaining the reliability and security of IoT-enabled service systems. Existing methods require training one specific model for each dataset, which exhibits limited generalization capability across different target datasets, hindering anomaly detection performance in various scenarios with scarce training data. To address this limitation, foundation models have emerged as a promising direction. However, existing approaches either repurpose large language models (LLMs) or construct largescale time series datasets to develop general anomaly detection foundation models, and still face challenges caused by severe cross-modal gaps or in-domain heterogeneity. In this paper, we investigate the applicability of large-scale vision models to TSAD. Specifically, we adapt a visual Masked Autoencoder (MAE) pretrained on ImageNet to the TSAD task. However, directly transferring MAE to TSAD introduces two key challenges: overgeneralization and limited local perception. To address these challenges, we propose VAN-AD, a novel MAE-based framework for TSAD. To alleviate the over-generalization issue, we design an Adaptive Distribution Mapping Module (ADMM), which maps the reconstruction results before and after MAE into a unified statistical space to amplify discrepancies caused by abnormal patterns. To overcome the limitation of local perception, we further develop a Normalizing Flow Module (NFM), which combines MAE with normalizing flow to estimate the probability density of the current window under the global distribution. Extensive experiments on nine real-world datasets demonstrate that VAN-AD consistently outperforms existing state-of-the-art methods across multiple evaluation metrics.We make our code and datasets available at https://github.com/PenyChen/VAN-AD.
[222] Stepwise Credit Assignment for GRPO on Flow-Matching Models cs.LG | cs.AI | cs.CVPDF
Yash Savani, Branislav Kveton, Yuchen Liu, Yilin Wang, Jing Shi
TL;DR: 本文提出Stepwise-Flow-GRPO方法,旨在改进Flow-GRPO在流匹配模型上的强化学习应用。原方法对所有生成步骤采用均匀的信用分配,忽略了扩散生成过程的时间结构(早期步骤决定低频结构与内容,后期步骤负责高频细节)。新方法通过Tweedie公式估计中间奖励并引入基于增益的优势函数,实现了分步信用分配,从而提升了样本效率和收敛速度。
Details
Motivation: 解决Flow-GRPO在流匹配模型中采用均匀信用分配的问题,该方法忽视了扩散生成的时间特性,且仅基于最终图像奖励可能错误奖励次优的中间步骤(尤其是当错误在后续步骤中被纠正时)。
Result: Stepwise-Flow-GRPO在样本效率和收敛速度上优于原方法,同时通过引入受DDIM启发的SDE提升了奖励质量并保持了策略梯度所需的随机性。
Insight: 创新点在于将信用分配与扩散生成的时间结构对齐,利用Tweedie公式进行中间奖励估计和增益优势计算,从而更精确地指导策略优化;同时,受DDIM启发的SDE设计在保证奖励质量的同时维持了必要的随机性,可借鉴于其他基于扩散模型的强化学习框架。
Abstract: Flow-GRPO successfully applies reinforcement learning to flow models, but uses uniform credit assignment across all steps. This ignores the temporal structure of diffusion generation: early steps determine composition and content (low-frequency structure), while late steps resolve details and textures (high-frequency details). Moreover, assigning uniform credit based solely on the final image can inadvertently reward suboptimal intermediate steps, especially when errors are corrected later in the diffusion trajectory. We propose Stepwise-Flow-GRPO, which assigns credit based on each step’s reward improvement. By leveraging Tweedie’s formula to obtain intermediate reward estimates and introducing gain-based advantages, our method achieves superior sample efficiency and faster convergence. We also introduce a DDIM-inspired SDE that improves reward quality while preserving stochasticity for policy gradients.
cs.RO [Back]
[223] SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning cs.RO | cs.CL | cs.CVPDF
Philip Schroeder, Thomas Weng, Karl Schmeckpeper, Eric Rosen, Stephen Hart
TL;DR: 本文提出SOLE-R1模型,一种专门设计用于作为在线强化学习唯一奖励信号的视频-语言推理模型。该模型仅通过原始视频观察和自然语言目标,执行逐时间步的时空思维链推理,生成密集的任务进度估计以直接作为奖励。通过大规模视频轨迹与推理合成管道训练,SOLE-R1在四个模拟环境和真实机器人场景中实现了零样本在线强化学习,使机器人能够学习未见过的操作任务,无需真实奖励、成功指示器、演示或任务特定调优。
Details
Motivation: 当前最强的视觉语言模型在强化学习中作为评估器时,在部分可观测性和分布偏移下容易失效,导致策略利用感知错误而非真正解决问题。本文旨在解决这一局限,设计一个能作为在线强化学习唯一可靠奖励信号的视频-语言推理模型。
Result: 在四个不同模拟环境和真实机器人设置中,SOLE-R1成功完成了24个未见任务,显著优于包括GPT-5和Gemini-3-Pro在内的强视觉语言奖励模型,并展现出对奖励攻击的更强鲁棒性。
Insight: 创新点包括:1)专门为在线强化学习设计的视频-语言推理模型,通过时空思维链推理生成密集进度奖励;2)大规模视频轨迹与推理合成管道,生成时间对齐的思维链轨迹与连续进度监督数据;3)结合监督微调与可验证奖励强化学习的混合训练框架,实现零样本在线学习能力。
Abstract: Vision-language models (VLMs) have shown impressive capabilities across diverse tasks, motivating efforts to leverage these models to supervise robot learning. However, when used as evaluators in reinforcement learning (RL), today’s strongest models often fail under partial observability and distribution shift, enabling policies to exploit perceptual errors rather than solve the task. To address this limitation, we introduce SOLE-R1 (Self-Observing LEarner), a video-language reasoning model explicitly designed to serve as the sole reward signal for online RL. Given only raw video observations and a natural-language goal, SOLE-R1 performs per-timestep spatiotemporal chain-of-thought (CoT) reasoning and produces dense estimates of task progress that can be used directly as rewards. To train SOLE-R1, we develop a large-scale video trajectory and reasoning synthesis pipeline that generates temporally grounded CoT traces aligned with continuous progress supervision. This data is combined with foundational spatial and multi-frame temporal reasoning, and used to train the model with a hybrid framework that couples supervised fine-tuning with RL from verifiable rewards. Across four different simulation environments and a real-robot setting, SOLE-R1 enables zero-shot online RL from random initialization: robots learn previously unseen manipulation tasks without ground-truth rewards, success indicators, demonstrations, or task-specific tuning. SOLE-R1 succeeds on 24 unseen tasks and substantially outperforms strong vision-language rewarders, including GPT-5 and Gemini-3-Pro, while exhibiting markedly greater robustness to reward hacking.
[224] Contextual Graph Representations for Task-Driven 3D Perception and Planning cs.RO | cs.AI | cs.CV | cs.LGPDF
Christopher Agia
TL;DR: 该论文探讨了如何利用图神经网络从3D场景图中学习上下文感知的图表示,以解决机器人任务规划中状态空间过大的问题,并构建了一个用于评估经典规划器的基准。
Details
Motivation: 3D场景图虽然能提供丰富的物体关系表示,但包含大量与特定任务无关的信息,导致任务规划的状态空间膨胀,难以在资源受限的机器人系统中高效部署。
Result: 论文构建了一个基准,用于对最先进的经典规划器进行实证比较,并探索了图神经网络在利用规划领域关系结构不变性方面的应用,以实现更快的规划。
Insight: 创新点在于提出使用图神经网络学习任务驱动的上下文图表示,以压缩3D场景图,聚焦于任务相关子集,从而提升规划效率,这为机器人感知与规划的结合提供了新思路。
Abstract: Recent advances in computer vision facilitate fully automatic extraction of object-centric relational representations from visual-inertial data. These state representations, dubbed 3D scene graphs, are a hierarchical decomposition of real-world scenes with a dense multiplex graph structure. While 3D scene graphs claim to promote efficient task planning for robot systems, they contain numerous objects and relations when only small subsets are required for a given task. This magnifies the state space that task planners must operate over and prohibits deployment in resource constrained settings. This thesis tests the suitability of existing embodied AI environments for research at the intersection of robot task planning and 3D scene graphs and constructs a benchmark for empirical comparison of state-of-the-art classical planners. Furthermore, we explore the use of graph neural networks to harness invariances in the relational structure of planning domains and learn representations that afford faster planning.
[225] SpatialPoint: Spatial-aware Point Prediction for Embodied Localization cs.RO | cs.AI | cs.CVPDF
Qiming Zhu, Zhirui Fang, Tianming Zhang, Chuanxiu Liu, Xiaoke Jiang
TL;DR: 本文提出SpatialPoint,一种空间感知的视觉语言框架,用于解决具身定位问题——即根据视觉观察和语言指令预测可执行的3D点。该方法整合结构化深度信息到视觉语言模型中,生成相机坐标系下的3D坐标,并构建了一个包含260万样本的RGB-D数据集进行训练和评估。实验表明深度信息的引入显著提升了具身定位性能,并在真实机器人任务(如指定位置抓取、目标放置和导航)中得到验证。
Details
Motivation: 解决具身智能中基于视觉和语言指令在3D空间中确定可执行位置的问题,现有视觉语言系统主要依赖RGB输入,缺乏显式几何信息,限制了跨场景泛化能力。
Result: 在构建的RGB-D数据集上,实验表明整合深度信息显著提升了具身定位性能;在真实机器人部署的三个代表性任务(指定位置抓取、目标放置、导航)中验证了有效性。
Insight: 创新点在于将具身定位形式化为可触摸点和空中点两类互补的3D点预测问题,并显式整合深度信息到视觉语言模型中,以增强空间推理能力;构建大规模RGB-D数据集支持训练与评估。
Abstract: Embodied intelligence fundamentally requires a capability to determine where to act in 3D space. We formalize this requirement as embodied localization – the problem of predicting executable 3D points conditioned on visual observations and language instructions. We instantiate embodied localization with two complementary target types: touchable points, surface-grounded 3D points enabling direct physical interaction, and air points, free-space 3D points specifying placement and navigation goals, directional constraints, or geometric relations. Embodied localization is inherently a problem of embodied 3D spatial reasoning – yet most existing vision-language systems rely predominantly on RGB inputs, necessitating implicit geometric reconstruction that limits cross-scene generalization, despite the widespread adoption of RGB-D sensors in robotics. To address this gap, we propose SpatialPoint, a spatial-aware vision-language framework with careful design that integrates structured depth into a vision-language model (VLM) and generates camera-frame 3D coordinates. We construct a 2.6M-sample RGB-D dataset covering both touchable and air points QA pairs for training and evaluation. Extensive experiments demonstrate that incorporating depth into VLMs significantly improves embodied localization performance. We further validate SpatialPoint through real-robot deployment across three representative tasks: language-guided robotic arm grasping at specified locations, object placement to target destinations, and mobile robot navigation to goal positions.
[226] ReMemNav: A Rethinking and Memory-Augmented Framework for Zero-Shot Object Navigation cs.RO | cs.CVPDF
Feng Wu, Wei Zuo, Wenliang Yang, Jun Xiao, Yang Liu
TL;DR: 本文提出了一种名为ReMemNav的新型分层导航框架,用于解决零样本物体导航任务。该框架通过整合全景语义先验和情景记忆来增强视觉语言模型(VLMs),并引入识别任何模型(Recognize Anything Model)来锚定空间推理过程。此外,设计了基于情景语义缓冲队列的自适应双模态重新思考机制,以主动验证目标可见性并纠正决策,防止探索死锁。在低层动作执行中,利用深度掩码提取可行动作序列,使VLM能选择最优动作映射到实际空间移动。
Details
Motivation: 零样本物体导航要求智能体在陌生环境中定位未见过的目标物体,无需先验地图或任务特定训练,这仍是一个重大挑战。尽管视觉语言模型(VLMs)的进展为任务提供了有前景的常识推理能力,但这些模型仍存在空间幻觉、局部探索死锁以及高层语义意图与低层控制脱节的问题。
Result: 在HM3D和MP3D数据集上的广泛评估表明,ReMemNav在成功率和探索效率上均优于现有的免训练零样本基线方法。具体而言,在HM3D v0.1上,SR和SPL分别提高了1.7%和7.0%;在HM3D v0.2上,分别提高了18.2%和11.1%;在MP3D上,分别提高了8.7%和7.9%,实现了显著的绝对性能提升。
Insight: 创新点包括:整合全景语义先验和情景记忆以增强VLM的空间推理能力;引入识别任何模型来锚定空间推理,减少幻觉;设计自适应双模态重新思考机制,利用历史记忆主动验证和纠正决策,防止死锁;在低层控制中结合深度信息提取可行动作序列,实现语义意图到实际动作的有效映射。这些方法共同提升了零样本导航的鲁棒性和效率。
Abstract: Zero-shot object navigation requires agents to locate unseen target objects in unfamiliar environments without prior maps or task-specific training which remains a significant challenge. Although recent advancements in vision-language models(VLMs) provide promising commonsense reasoning capabilities for this task, these models still suffer from spatial hallucinations, local exploration deadlocks, and a disconnect between high-level semantic intent and low-level control. In this regard, we propose a novel hierarchical navigation framework named ReMemNav, which seamlessly integrates panoramic semantic priors and episodic memory with VLMs. We introduce the Recognize Anything Model to anchor the spatial reasoning process of the VLM. We also design an adaptive dual-modal rethinking mechanism based on an episodic semantic buffer queue. The proposed mechanism actively verifies target visibility and corrects decisions using historical memory to prevent deadlocks. For low-level action execution, ReMemNav extracts a sequence of feasible actions using depth masks, allowing the VLM to select the optimal action for mapping into actual spatial movement. Extensive evaluations on HM3D and MP3D demonstrate that ReMemNav outperforms existing training-free zero-shot baselines in both success rate and exploration efficiency. Specifically, we achieve significant absolute performance improvements, with SR and SPL increasing by 1.7% and 7.0% on HM3D v0.1, 18.2% and 11.1% on HM3D v0.2, and 8.7% and 7.9% on MP3D.
[227] SpatialAnt: Autonomous Zero-Shot Robot Navigation via Active Scene Reconstruction and Visual Anticipation cs.RO | cs.AI | cs.CVPDF
Jiwen Zhang, Xiangyu Shi, Siyuan Wang, Zerui Li, Zhongyu Wei
TL;DR: 本文提出了一种名为SpatialAnt的零样本机器人导航框架,旨在解决基于多模态大语言模型的视觉语言导航在现实部署中,因依赖高质量人工场景重建而受限的问题。该框架通过主动场景重建构建自身先验,并引入物理接地策略恢复单目重建的绝对度量尺度,以及一种新颖的视觉预期机制来处理噪声重建,从而在未见环境中实现鲁棒的导航。
Details
Motivation: 现有基于探索的零样本导航方法依赖高质量人工场景重建,这在真实机器人部署中不切实际。机器人应通过预探索构建自身先验,但这些自建重建往往不完整且有噪声,严重影响了依赖高质量重建的方法的性能。
Result: 在模拟和真实环境中的大量实验表明,SpatialAnt显著优于现有零样本方法。在R2R-CE和RxR-CE基准测试上分别达到了66%和50.8%的成功率。在Hello Robot上的物理部署进一步证实了其效率与有效性,在具有挑战性的真实世界设置中实现了52%的成功率。
Insight: 主要创新点包括:1) 物理接地策略,用于恢复单目重建的绝对度量尺度;2) 视觉预期机制,利用噪声点云渲染未来观测,使智能体能够进行反事实推理并剪枝与人类指令矛盾的路径,从而将不完美的自重建与鲁棒执行桥接起来。
Abstract: Vision-and-Language Navigation (VLN) has recently benefited from Multimodal Large Language Models (MLLMs), enabling zero-shot navigation. While recent exploration-based zero-shot methods have shown promising results by leveraging global scene priors, they rely on high-quality human-crafted scene reconstructions, which are impractical for real-world robot deployment. When encountering an unseen environment, a robot should build its own priors through pre-exploration. However, these self-built reconstructions are inevitably incomplete and noisy, which severely degrade methods that depend on high-quality scene reconstructions. To address these issues, we propose SpatialAnt, a zero-shot navigation framework designed to bridge the gap between imperfect self-reconstructions and robust execution. SpatialAnt introduces a physical grounding strategy to recover the absolute metric scale for monocular-based reconstructions. Furthermore, rather than treating the noisy self-reconstructed scenes as absolute spatial references, we propose a novel visual anticipation mechanism. This mechanism leverages the noisy point clouds to render future observations, enabling the agent to perform counterfactual reasoning and prune paths that contradict human instructions. Extensive experiments in both simulated and real-world environments demonstrate that SpatialAnt significantly outperforms existing zero-shot methods. We achieve a 66% Success Rate (SR) on R2R-CE and 50.8% SR on RxR-CE benchmarks. Physical deployment on a Hello Robot further confirms the efficiency and efficacy of our framework, achieving a 52% SR in challenging real-world settings.
[228] Uni-World VLA: Interleaved World Modeling and Planning for Autonomous Driving cs.RO | cs.CVPDF
Qiqi Liu, Huan Xu, Jingyu Li, Bin Sun, Zhihui Hao
TL;DR: 本文提出Uni-World VLA模型,一种用于自动驾驶的统一视觉-语言-动作模型。其核心创新在于将未来帧预测与轨迹规划紧密交织进行,而非传统的先预测完整未来场景再规划的开环模式。模型通过逐步交替预测未来帧和自车动作,使规划决策能持续基于想象的未来观测进行调整,形成世界建模与控制间的闭环交互。此外,模型在帧中融入单目深度信息以增强几何线索,提升长时程场景预测能力。
Details
Motivation: 现有基于世界模型的方法通常先预测未来场景再进行规划,导致开环想象可能与实际决策过程脱节。本文旨在解决世界建模与规划分离的问题,通过紧密交织两者以实现更适应动态交通场景的决策。
Result: 在NAVSIM基准测试上的实验表明,该方法在实现高保真度未来帧预测的同时,取得了具有竞争力的闭环规划性能。
Insight: 主要创新点在于提出了世界建模与规划交替进行的闭环生成范式,打破了传统开环想象的局限。从客观角度看,将单目深度作为几何先验融入视觉预测以增强世界模型的长时程一致性,也是一个值得借鉴的技术点。这种紧密耦合预测与规划的思路为可扩展的VLA驾驶系统提供了有前景的方向。
Abstract: Autonomous driving requires reasoning about how the environment evolves and planning actions accordingly. Existing world-model-based approaches typically predict future scenes first and plan afterwards, resulting in open-loop imagination that may drift from the actual decision process. In this paper, we present Uni-World VLA, a unified vision-language-action (VLA) model that tightly interleaves future frame prediction and trajectory planning. Instead of generating a full world rollout before planning, our model alternates between predicting future frames and ego actions step by step, allowing planning decisions to be continuously conditioned on the imagined future observations. This interleaved generation forms a closed-loop interaction between world modeling and control, enabling more adaptive decision-making in dynamic traffic scenarios. In addition, we incorporate monocular depth information into frames to provide stronger geometric cues for world modeling, improving long-horizon scene prediction. Experiments on the NAVSIM benchmark show that our approach achieves competitive closed-loop planning performance while producing high-fidelity future frame predictions. These results demonstrate that tightly coupling world prediction and planning is a promising direction for scalable VLA driving systems.
[229] CARLA-Air: Fly Drones Inside a CARLA World – A Unified Infrastructure for Air-Ground Embodied Intelligence cs.RO | cs.AI | cs.CV | cs.HCPDF
Tianle Zeng, Hanxuan Chen, Yanci Wen, Hong Zhang
TL;DR: CARLA-Air是一个开源仿真基础设施,它在单个Unreal Engine进程中统一了高保真城市驾驶(CARLA)和物理精确的多旋翼飞行(AirSim)模拟,为空中-地面具身智能提供了一个时空一致的共享物理环境。
Details
Motivation: 现有开源仿真平台(如驾驶模拟器和多旋翼模拟器)各自为政,无法在单一物理一致的环境中联合建模空中与地面智能体,而基于桥接的协同仿真存在同步开销和时空一致性问题。
Result: 该平台在共享的物理时钟和渲染管线中,提供了包含合规交通、社会感知行人和空气动力学一致UAV动态的逼真环境,并能在每个时钟周期同步捕获多达18种传感器模态的数据,支持多种空中-地面具身智能任务。
Insight: 主要创新点在于将两个广泛使用的模拟器(CARLA和AirSim)深度集成到单一进程中,实现了真正的时空一致性,并确保了API兼容性和代码零修改复用,同时通过可扩展的资产管道支持自定义机器人平台集成,延续了AirSim的飞行模拟能力。
Abstract: The convergence of low-altitude economies, embodied intelligence, and air-ground cooperative systems creates growing demand for simulation infrastructure capable of jointly modeling aerial and ground agents within a single physically coherent environment. Existing open-source platforms remain domain-segregated: driving simulators lack aerial dynamics, while multirotor simulators lack realistic ground scenes. Bridge-based co-simulation introduces synchronization overhead and cannot guarantee strict spatial-temporal consistency. We present CARLA-Air, an open-source infrastructure that unifies high-fidelity urban driving and physics-accurate multirotor flight within a single Unreal Engine process. The platform preserves both CARLA and AirSim native Python APIs and ROS 2 interfaces, enabling zero-modification code reuse. Within a shared physics tick and rendering pipeline, CARLA-Air delivers photorealistic environments with rule-compliant traffic, socially-aware pedestrians, and aerodynamically consistent UAV dynamics, synchronously capturing up to 18 sensor modalities across all platforms at each tick. The platform supports representative air-ground embodied intelligence workloads spanning cooperation, embodied navigation and vision-language action, multi-modal perception and dataset construction, and reinforcement-learning-based policy training. An extensible asset pipeline allows integration of custom robot platforms into the shared world. By inheriting AirSim’s aerial capabilities – whose upstream development has been archived – CARLA-Air ensures this widely adopted flight stack continues to evolve within a modern infrastructure. Released with prebuilt binaries and full source: https://github.com/louiszengCN/CarlaAir
[230] $AutoDrive\text{-}P^3$: Unified Chain of Perception-Prediction-Planning Thought via Reinforcement Fine-Tuning cs.RO | cs.CVPDF
Yuqi Ye, Zijian Zhang, Junhong Lin, Shangkun Sun, Changhao Peng
TL;DR: 本文提出了AutoDrive-P3框架,通过强化学习微调实现感知-预测-规划的链式思维统一推理。该框架利用P3-CoT数据集促进连贯推理,并采用P3-GRPO分层强化学习算法进行渐进式监督,同时引入详细思维和快速思维双模式以平衡推理效率与性能。在nuScenes和NAVSIMv1/v2基准测试中,该方法在规划任务上达到了最先进的性能。
Details
Motivation: 当前基于视觉语言模型(VLM)的端到端自动驾驶系统存在两大局限:一是部分模型绕过感知和预测直接输出规划结果,导致领域差距和决策能力下降;二是部分模型虽能生成感知、预测和规划输出,但采用模块分离的决策方式,缺乏协同性,损害了真实规划性能。
Result: 在开环(nuScenes)和闭环(NAVSIMv1/v2)基准测试中,该方法在规划任务上实现了最先进的性能。
Insight: 创新点包括:提出统一的感知-预测-规划链式思维推理框架(P3-CoT),通过分层强化学习算法(P3-GRPO)实现渐进式监督,以及引入双思维模式(详细思维与快速思维)以平衡效率与性能。从客观角度看,该研究将链式思维与强化学习结合,增强了自动驾驶系统的协同决策和可解释性。
Abstract: Vision-language models (VLMs) are increasingly being adopted for end-to-end autonomous driving systems due to their exceptional performance in handling long-tail scenarios. However, current VLM-based approaches suffer from two major limitations: 1) Some VLMs directly output planning results without chain-of-thought (CoT) reasoning, bypassing crucial perception and prediction stages which creates a significant domain gap and compromises decision-making capability; 2) Other VLMs can generate outputs for perception, prediction, and planning tasks but employ a fragmented decision-making approach where these modules operate separately, leading to a significant lack of synergy that undermines true planning performance. To address these limitations, we propose ${AutoDrive\text{-}P^3}$, a novel framework that seamlessly integrates $\textbf{P}$erception, $\textbf{P}$rediction, and $\textbf{P}$lanning through structured reasoning. We introduce the ${P^3\text{-}CoT}$ dataset to facilitate coherent reasoning and propose ${P^3\text{-}GRPO}$, a hierarchical reinforcement learning algorithm that provides progressive supervision across all three tasks. Specifically, ${AutoDrive\text{-}P^3}$ progressively generates CoT reasoning and answers for perception, prediction, and planning, where perception provides essential information for subsequent prediction and planning, while both perception and prediction collectively contribute to the final planning decisions, enabling safer and more interpretable autonomous driving. Additionally, to balance inference efficiency with performance, we introduce dual thinking modes: detailed thinking and fast thinking. Extensive experiments on both open-loop (nuScenes) and closed-loop (NAVSIMv1/v2) benchmarks demonstrate that our approach achieves state-of-the-art performance in planning tasks. Code is available at https://github.com/haha-yuki-haha/AutoDrive-P3.
[231] RAD-LAD: Rule and Language Grounded Autonomous Driving in Real-Time cs.RO | cs.AI | cs.CV | cs.LGPDF
Anurag Ghosh, Srinivasa Narasimhan, Manmohan Chandraker, Francesco Pittaluga
TL;DR: 本文提出了RAD-LAD,一个实时语言-动作规划器,结合了基于规则的规划器RAD和基于学习的语言动作规划器LAD。LAD能以约20Hz的频率实时生成运动规划,或同时生成文本推理和运动规划。该系统在nuPlan基准测试中实现了新的基于学习的SOTA性能,并且通过混合规划结合了规则方法的可靠性和语言方法的适应性、可解释性。
Details
Motivation: 解决自动驾驶规划中,现有基于学习的模型延迟高、难以实时闭环部署,以及纯规则方法在复杂场景下存在结构性局限的问题。
Result: LAD在nuPlan Test14-Hard和InterPlan基准上实现了新的基于学习的SOTA性能,延迟比之前的驾驶语言模型降低约3倍。RAD在相同基准上实现了基于规则规划器的SOTA性能。混合系统结合了两者优势。
Insight: 创新点在于提出了一个可中断的、单次前向传播即可生成运动规划的高效语言-动作规划架构,并展示了规则与学习在自动驾驶规划中的互补性:规则保证可靠性,语言模型提供适应性和可解释性决策。
Abstract: We present LAD, a real-time language–action planner with an interruptible architecture that produces a motion plan in a single forward pass (20 Hz) or generates textual reasoning alongside a motion plan (10 Hz). LAD is fast enough for real-time closed-loop deployment, achieving ~3x lower latency than prior driving language models while setting a new learning-based state of the art on nuPlan Test14-Hard and InterPlan. We also introduce RAD, a rule-based planner designed to address structural limitations of PDM-Closed. RAD achieves state-of-the-art performance among rule-based planners on nuPlan Test14-Hard and InterPlan. Finally, we show that combining RAD and LAD enables hybrid planning that captures the strengths of both approaches. This hybrid system demonstrates that rules and learning provide complementary capabilities: rules support reliable maneuvering, while language enables adaptive and explainable decision-making.
[232] ManipArena: Comprehensive Real-world Evaluation of Reasoning-Oriented Generalist Robot Manipulation cs.RO | cs.CVPDF
Yu Sun, Meng Cao, Ping Yang, Rongtao Xu, Yunxiao Yan
TL;DR: 该论文提出了ManipArena,一个用于评估通用机器人操作智能的标准化现实世界框架,旨在弥合仿真与真实部署之间的差距。它包含20个多样化任务和超过一万条专家轨迹,强调需要语义和空间推理的操作,并支持通过受控的分布外设置进行多级泛化评估。
Details
Motivation: 现有评估基准大多以仿真器为中心,无法捕捉感知噪声、复杂接触动力学、硬件约束和系统延迟等现实差距,且不同机器人平台上的碎片化评估阻碍了公平可复现的比较。
Result: 论文未在摘要中提供具体的定量结果或基准测试排名,但宣称该框架为视觉-语言-动作模型和世界模型提供了公平、真实且可复现的评估基础。
Insight: 创新点在于构建了一个集成了多样化推理任务、多级泛化测试、长视野移动操作、丰富感官诊断以及通过高质量3D扫描构建的同步虚实环境的综合评估框架,为具身智能系统的诊断和推进提供了可扩展的基础。
Abstract: Vision-Language-Action (VLA) models and world models have recently emerged as promising paradigms for general-purpose robotic intelligence, yet their progress is hindered by the lack of reliable evaluation protocols that reflect real-world deployment. Existing benchmarks are largely simulator-centric, which provide controllability but fail to capture the reality gap caused by perception noise, complex contact dynamics, hardware constraints, and system latency. Moreover, fragmented real-world evaluations across different robot platforms prevent fair and reproducible comparison. To address these challenges, we introduce ManipArena, a standardized evaluation framework designed to bridge simulation and real-world execution. ManipArena comprises 20 diverse tasks across 10,812 expert trajectories emphasizing reasoning-oriented manipulation tasks requiring semantic and spatial reasoning, supports multi-level generalization through controlled out-of-distribution settings, and incorporates long-horizon mobile manipulation beyond tabletop scenarios. The framework further provides rich sensory diagnostics, including low-level motor signals, and synchronized real-to-sim environments constructed via high-quality 3D scanning. Together, these features enable fair, realistic, and reproducible evaluation for both VLA and world model approaches, providing a scalable foundation for diagnosing and advancing embodied intelligence systems.
[233] StreamingVLA: Streaming Vision-Language-Action Model with Action Flow Matching and Adaptive Early Observation cs.RO | cs.CVPDF
Yiran Shi, Dongqi Guo, Tianchen Zhao, Feng Gao, Liangzhi Shi
TL;DR: 本文提出了StreamingVLA,一种流式视觉-语言-动作模型,旨在解决传统VLA模型因各阶段(观察、动作生成、执行)必须顺序执行而导致的高延迟和频繁停顿问题。通过采用动作流匹配技术替代动作分块,以及设计动作显著性感知的自适应观察机制,实现了VLA各阶段的异步并行化,从而在不牺牲性能的情况下显著提升了执行速度和流畅性。
Details
Motivation: 传统视觉-语言-动作模型在自然语言驱动的感知与控制方面表现出色,但其高计算成本和顺序执行模式(观察、动作生成、执行必须串行)导致系统频繁停顿和高延迟,难以在资源受限的边缘平台上高效部署。
Result: StreamingVLA在保持性能的同时,实现了2.4倍的延迟加速,并将执行停顿减少了6.5倍。
Insight: 主要创新点包括:1. 采用动作流匹配技术,学习动作流的轨迹而非对分块动作去噪,从而重叠动作生成与执行的延迟;2. 设计动作显著性感知的自适应观察机制,允许根据动作重要性动态调整观察频率,从而重叠执行与观察的延迟。这些方法使VLA模型能够以流式方式异步并行化各阶段,提升了整体效率与流畅性。
Abstract: Vision-language-action (VLA) models have demonstrated exceptional performance in natural language-driven perception and control. However, the high computational cost of VLA models poses significant efficiency challenges, particularly for resource-constrained edge platforms in real-world deployments. However, since different stages of VLA (observation, action generation and execution) must proceed sequentially, and wait for the completion of the preceding stage, the system suffers from frequent halting and high latency. To address this, We conduct a systematic analysis to identify the challenges for fast and fluent generation, and propose enabling VLAs with the ability to asynchronously parallelize across VLA stages in a “streaming” manner. First, we eliminate the reliance on action chunking and adopt action flow matching, which learns the trajectory of action flows rather than denoising chunk-wise actions. It overlaps the latency of action generation and execution. Second, we design an action saliency-aware adaptive observation mechanism, thereby overlapping the latency of execution and observation. Without sacrificing performance, StreamingVLA achieves substantial speedup and improves the fluency of execution. It achieves a 2.4 $\times$ latency speedup and reduces execution halting by 6.5 $\times$.
[234] Pandora: Articulated 3D Scene Graphs from Egocentric Vision cs.RO | cs.CVPDF
Alan Yu, Yun Chang, Christopher Xie, Luca Carlone
TL;DR: 这篇论文提出了Pandora系统,它利用人类佩戴Project Aria眼镜采集的自我中心视觉数据,构建包含铰接物体部件的3D场景图,以弥补机器人自身感知的局限性,并展示了该系统能提升机器人执行移动操作任务的能力。
Details
Motivation: 解决当前机器人建图系统因自身形态或技能限制(如无法打开抽屉)而导致的环境感知不完整问题,通过利用人类自然探索场景的自我中心数据来填补这些盲点。
Result: 实验表明,使用简单启发式方法从自我中心数据恢复的铰接物体部件模型,其质量与基于其他输入模态的SOTA方法相当;构建的铰接3D场景图能提升机器人(如波士顿动力Spot)执行移动操作任务(如检索隐藏物品)的性能。
Insight: 创新点在于利用人类自我中心视觉数据直接向机器人传递关于物体铰接的知识,并将其整合到3D场景图中,从而增强对物体动态和物体-容器关系的理解,这是一种从人类经验中学习以弥补机器人本体局限性的新范式。
Abstract: Robotic mapping systems typically approach building metric-semantic scene representations from the robot’s own sensors and cameras. However, these “first person” maps inherit the robot’s own limitations due to its embodiment or skillset, which may leave many aspects of the environment unexplored. For example, the robot might not be able to open drawers or access wall cabinets. In this sense, the map representation is not as complete, and requires a more capable robot to fill in the gaps. We narrow these blind spots in current methods by leveraging egocentric data captured as a human naturally explores a scene wearing Project Aria glasses, giving a way to directly transfer knowledge about articulation from the human to any deployable robot. We demonstrate that, by using simple heuristics, we can leverage egocentric data to recover models of articulate object parts, with quality comparable to those of state-of-the-art methods based on other input modalities. We also show how to integrate these models into 3D scene graph representations, leading to a better understanding of object dynamics and object-container relationships. We finally demonstrate that these articulated 3D scene graphs enhance a robot’s ability to perform mobile manipulation tasks, showcasing an application where a Boston Dynamics Spot is tasked with retrieving concealed target items, given only the 3D scene graph as input.
cs.DB [Back]
[235] SEAR: Schema-Based Evaluation and Routing for LLM Gateways cs.DB | cs.AI | cs.CLPDF
Zecheng Zhang, Han Zheng, Yue Xu
TL;DR: SEAR是一个基于模式的多模型、多提供商LLM网关评估与路由系统,它通过定义可扩展的关系模式来统一管理LLM评估信号(如上下文、意图、响应特性)和网关操作指标(如延迟、成本),并利用LLM推理生成结构化数据,实现语义感知的路由决策。
Details
Motivation: 解决生产环境中LLM网关需要细粒度质量信号和基于操作的路由决策的挑战,填补现有系统在统一评估与路由方面的空白。
Result: 在数千个生产会话中,SEAR在人工标注数据上实现了高信号准确性,支持实际路由决策,并在保持可比质量的同时大幅降低成本。
Insight: 创新点包括:基于关系模式的统一数据架构、通过LLM推理生成结构化评估信号(而非浅层分类器)、以及将评估与路由集成到单一查询层,从而提升语义理解和路由可解释性。
Abstract: Evaluating production LLM responses and routing requests across providers in LLM gateways requires fine-grained quality signals and operationally grounded decisions. To address this gap, we present SEAR, a schema-based evaluation and routing system for multi-model, multi-provider LLM gateways. SEAR defines an extensible relational schema covering both LLM evaluation signals (context, intent, response characteristics, issue attribution, and quality scores) and gateway operational metrics (latency, cost, throughput), with cross-table consistency links across around one hundred typed, SQL-queryable columns. To populate the evaluation signals reliably, SEAR proposes self-contained signal instructions, in-schema reasoning, and multi-stage generation that produces database-ready structured outputs. Because signals are derived through LLM reasoning rather than shallow classifiers, SEAR captures complex request semantics, enables human-interpretable routing explanations, and unifies evaluation and routing in a single query layer. Across thousands of production sessions, SEAR achieves strong signal accuracy on human-labeled data and supports practical routing decisions, including large cost reductions with comparable quality.
q-bio.NC [Back]
[236] Grounding Social Perception in Intuitive Physics q-bio.NC | cs.AI | cs.CVPDF
Lance Ying, Aydan Y. Huang, Aviv Netanyahu, Andrei Barbu, Boris Katz
TL;DR: 该论文提出,人类从他人行为中推断丰富社会信息的过程,不仅仅是视觉模式匹配,而是植根于直观心理学与直观物理整合的推理过程。为验证这一假设,研究者构建了PHASE数据集(包含程序生成的二维物理模拟双智能体互动动画)和SIMPLE计算模型(一种基于物理的贝叶斯逆向规划模型)。实验表明,SIMPLE在推断智能体目标和关系方面,取得了高精度并与人类判断高度一致,而前馈基线模型(包括强大的视觉语言模型)和忽略物理的逆向规划模型则未能达到人类水平。
Details
Motivation: 解决如何计算地解释人类在理解物理环境中的社会场景时,如何整合直观物理和直观心理学进行推理的问题。
Result: 在PHASE数据集上,SIMPLE模型在推断智能体目标和关系方面达到了高精度,并与人类判断高度一致,超越了包括强大视觉语言模型在内的前馈基线模型以及忽略物理的逆向规划模型。
Insight: 创新点在于提出了社会感知需要植根于直观物理与直观心理学整合的推理框架,并构建了PHASE数据集和SIMPLE模型来验证。可借鉴之处在于将物理模拟、概率规划和逆向推理结合,为理解物理环境中的社会互动提供了一个可计算的理论模型。
Abstract: People infer rich social information from others’ actions. These inferences are often constrained by the physical world: what agents can do, what obstacles permit, and how the physical actions of agents causally change an environment and other agents’ mental states and behavior. We propose that such rich social perception is more than visual pattern matching, but rather a reasoning process grounded in an integration of intuitive psychology with intuitive physics. To test this hypothesis, we introduced PHASE (PHysically grounded Abstract Social Events), a large dataset of procedurally generated animations, depicting physically simulated two-agent interactions on a 2D surface. Each animation follows the style of the Heider and Simmel movie, with systematic variation in environment geometry, object dynamics, agent capacities, goals, and relationships (friendly/adversarial/neutral). We then present a computational model, SIMPLE, a physics-grounded Bayesian inverse planning model that integrates planning, probabilistic planning, and physics simulation to infer agents’ goals and relations from their trajectories. Our experimental results showed that SIMPLE achieved high accuracy and agreement with human judgments across diverse scenarios, while feedforward baseline models – including strong vision-language models – and physics-agnostic inverse planning failed to achieve human-level performance and did not align with human judgments. These results suggest that our model provides a computational account for how people understand physically grounded social scenes by inverting a generative model of physics and agents.