Table of Contents

cs.CL [Back]

[1] TabReX : Tabular Referenceless eXplainable Evaluation cs.CLPDF

Tejas Anvekar, Juhna Park, Aparna Garimella, Vivek Gupta

TL;DR: 本文提出了TabReX,一种无需参考、基于属性的表格生成评估框架,通过图推理将源文本和生成表格转换为规范知识图并进行对齐,计算可解释的、基于评分标准的分数,以量化结构和事实保真度。该框架在TabReX-Bench基准测试中表现出与专家排名最高的相关性,并在不同难度扰动下保持稳定,为结构化生成系统提供了可信、可解释的评估新范式。

Details

Motivation: 解决现有评估指标在评估大语言模型生成的表格质量时,要么忽略结构信息而将表格扁平化为文本,要么依赖固定参考从而限制泛化能力的问题。

Result: 在TabReX-Bench基准(涵盖六个领域和十二种规划驱动的扰动类型,分为三个难度等级)上,TabReX实现了与专家排名的最高相关性,并在更难的扰动下保持稳定,达到了SOTA水平。

Insight: 创新点包括:提出无需参考的图基评估框架,通过LLM引导的匹配过程实现可解释的、基于评分的量化;引入大规模基准TabReX-Bench以系统评估指标鲁棒性;支持细粒度的模型与提示分析,为结构化生成评估提供了可控的敏感性与特异性权衡及单元级错误追踪。

Abstract: Evaluating the quality of tables generated by large language models (LLMs) remains an open challenge: existing metrics either flatten tables into text, ignoring structure, or rely on fixed references that limit generalization. We present TabReX, a reference-less, property-driven framework for evaluating tabular generation via graph-based reasoning. TabReX converts both source text and generated tables into canonical knowledge graphs, aligns them through an LLM-guided matching process, and computes interpretable, rubric-aware scores that quantify structural and factual fidelity. The resulting metric provides controllable trade-offs between sensitivity and specificity, yielding human-aligned judgments and cell-level error traces. To systematically asses metric robustness, we introduce TabReX-Bench, a large-scale benchmark spanning six domains and twelve planner-driven perturbation types across three difficulty tiers. Empirical results show that TabReX achieves the highest correlation with expert rankings, remains stable under harder perturbations, and enables fine-grained model-vs-prompt analysis establishing a new paradigm for trustworthy, explainable evaluation of structured generation systems.


[2] Social Story Frames: Contextual Reasoning about Narrative Intent and Reception cs.CL | cs.AI | cs.LG | cs.SIPDF

Joel Mire, Maria Antoniak, Steven R. Wilson, Zexin Ma, Achyutarama R. Ganti

TL;DR: 本文提出了SocialStoryFrames形式化框架,用于从社交媒体故事中提炼关于读者反应的合理推断,如感知的作者意图、解释性推理、情感反应和价值判断。该框架基于叙事理论、语言语用学和心理学构建,并开发了SSF-Generator和SSF-Classifie两个模型,在包含6,140个故事的SSF-Corpus数据集上进行了验证和应用分析。

Details

Motivation: 当前计算模型对读者反应的建模能力有限,缺乏对叙事意图、情感和评价等细微响应的分析,因此需要一种能够捕捉上下文推理的形式化方法来填补这一空白。

Result: 模型通过人类调查(N=382名参与者)和专家注释分别验证了有效性;在SSF-Corpus数据集上的试点分析展示了该形式化方法在大规模故事研究中的实用性,能够量化叙事意图的频率和相互依赖性,并比较不同社区的叙事实践多样性。

Insight: 创新点在于将细粒度、上下文敏感的建模与通用的读者反应分类法相结合,为在线社区的故事讲述研究提供了新的分析工具,实现了从计算角度对叙事意图和接收的深入推理。

Abstract: Reading stories evokes rich interpretive, affective, and evaluative responses, such as inferences about narrative intent or judgments about characters. Yet, computational models of reader response are limited, preventing nuanced analyses. To address this gap, we introduce SocialStoryFrames, a formalism for distilling plausible inferences about reader response, such as perceived author intent, explanatory and predictive reasoning, affective responses, and value judgments, using conversational context and a taxonomy grounded in narrative theory, linguistic pragmatics, and psychology. We develop two models, SSF-Generator and SSF-Classifier, validated through human surveys (N=382 participants) and expert annotations, respectively. We conduct pilot analyses to showcase the utility of the formalism for studying storytelling at scale. Specifically, applying our models to SSF-Corpus, a curated dataset of 6,140 social media stories from diverse contexts, we characterize the frequency and interdependence of storytelling intents, and we compare and contrast narrative practices (and their diversity) across communities. By linking fine-grained, context-sensitive modeling with a generic taxonomy of reader responses, SocialStoryFrames enable new research into storytelling in online communities.


[3] BRAID: Bounded Reasoning for Autonomous Inference and Decisions cs.CL | cs.AIPDF

Armağan Amcalar, Eyup Cinar

TL;DR: 本文提出了BRAID(有界推理自主推断与决策)框架,通过基于Mermaid的指令图实现结构化提示,使大型语言模型能够进行有界推理而非无限制的自然语言token扩展。研究表明,这种结构化、机器可读的提示方法在多个GPT模型层级和基准数据集上显著提高了推理准确性和成本效率。

Details

Motivation: 解决大型语言模型在性能、成本和token使用之间存在的非线性关系问题,旨在通过结构化提示优化自主代理系统的推理效率。

Result: 在AdvancedIF、GSM-Hard和SCALE MultiChallenge基准数据集上评估,BRAID显著提高了推理准确性和成本效率,被确立为一种有效且可扩展的技术。

Insight: 创新点在于引入基于Mermaid指令图的有界推理框架,实现结构化而非无限制token扩展的推理;可借鉴之处在于使用机器可读的提示来平衡模型性能与计算成本,适用于生产系统中的自主代理优化。

Abstract: Large Language Models (LLMs) exhibit nonlinear relationships between performance, cost, and token usage. This paper presents a quantitative study on structured prompting using BRAID (Bounded Reasoning for Au tonomous Inference and Decisions) across multiple GPT model tiers, eval uated on the AdvancedIF, GSM-Hard, and the SCALE MultiChallenge benchmark datasets. BRAID introduces a bounded reasoning framework using Mermaid-based instruction graphs that enable models to reason struc turally rather than through unbounded natural-language token expansion. We show that structured machine-readable prompts substantially increase reasoning accuracy and cost efficiency for agents in production systems. The findings establish BRAID as an effective and scalable technique for optimizing inference efficiency in autonomous agent systems. All datasets and detailed result logs are available at https://benchmark.openserv.ai.


[4] Are We on the Right Way to Assessing LLM-as-a-Judge? cs.CL | cs.AIPDF

Yuanning Feng, Sinan Wang, Zhengxiang Cheng, Yao Wan, Dongping Chen

TL;DR: 本文提出了一个名为Sage的新型评估套件,用于评估LLM作为评判者(LLM-as-a-Judge)的质量,而无需依赖人工标注。该套件基于理性选择理论,通过局部自一致性(成对偏好稳定性)和全局逻辑一致性(偏好传递性)两个新维度来衡量LLM评判的可靠性。实验表明,Sage指标稳定且与现有监督基准高度相关,并揭示了当前顶尖LLM作为评判者存在显著的可靠性问题,尤其是在困难案例中。

Details

Motivation: 现有评估LLM-as-a-Judge的基准主要依赖人工标注的真实标签,这引入了人类偏见,损害了可靠性评估并限制了可扩展性。

Result: 在结合结构化基准问题和真实用户查询构建的650个问题数据集上,Sage的指标表现出稳定性,并与LLMBar和RewardBench2等监督基准高度相关。评估发现,即使是Gemini-2.5-Pro和GPT-5等顶级模型,在近四分之一的困难案例中也无法保持一致的偏好。

Insight: 创新点在于提出了无需人工标注、基于理性选择理论公理(局部自一致性和全局逻辑一致性)的评估框架Sage。客观分析认为,其揭示了LLM评判中存在的“情境偏好”现象,并指出明确的评分标准、微调、评审团机制以及深度推理有助于提升评判一致性,同时挑战了人类标注作为黄金标准的可靠性。

Abstract: LLM-as-a-Judge has been widely adopted as an evaluation method and served as supervised rewards in model training. However, existing benchmarks for LLM-as-a-Judge are mainly relying on human-annotated ground truth, which introduces human bias that undermines the assessment of reliability and imposes scalability constraints. To overcome these limitations, we introduce Sage, a novel evaluation suite that assesses the quality of LLM judges without necessitating any human annotation. Inspired by axioms of rational choice theory, Sage introduces two new lenses for measuring LLM-as-a-Judge: local self-consistency (pair-wise preference stability) and global logical consistency (transitivity across a full set of preferences). We curate a dataset of 650 questions by combining structured benchmark problems with real-world user queries. Our experiments demonstrate both the stability of our metrics and their high correlation with supervised benchmarks like LLMBar and RewardBench2, confirming Sage’s reliability as an evaluation suite for the robustness and accuracy of LLM-as-a-Judge. Based on Sage, we reveal that current state-of-the-art LLMs exhibit significant reliability problems when acting as judges in both scoring and pairwise settings; even the top-performing models, Gemini-2.5-Pro and GPT-5, fail to maintain consistent preferences in nearly a quarter of difficult cases. We attribute this to a new phenomenon called situational preference, which explains why explicit rubrics or criteria can help the model judge consistently across answer pairs. Our further analysis shows that finetuned LLM-as-a-Judge is a feasible method to boost performance, and the panel-based judge as well as deep reasoning can enhance the judging consistency. We also find substantial inconsistency in human judgments, which indicates that human annotation may not be a reliable gold standard.


[5] MRG-R1: Reinforcement Learning for Clinically Aligned Medical Report Generation cs.CL | cs.AIPDF

Pengyu Wang, Shuchang Ye, Usman Naseem, Jinman Kim

TL;DR: 本文提出了一种基于语义驱动的强化学习方法MRG-R1,用于医学报告生成。该方法通过优化报告级别的奖励函数,即基于边缘的余弦相似度,来提升生成报告的临床正确性,而非仅仅模仿语言风格。

Details

Motivation: 现有医学报告生成方法通常基于词级目标训练,侧重于词汇选择和句子结构,但无法保证临床正确性。本文旨在解决这一问题,通过强化学习直接优化临床标签一致性。

Result: 在IU X-Ray和MIMIC-CXR数据集上,MRG-R1在临床效能指标上达到SOTA水平,CE-F1分数分别为51.88和40.39。

Insight: 创新点在于使用报告级别的语义奖励(MCCS)和轻量级推理格式约束来引导模型生成临床正确的结构化报告,这比传统的词级监督更有效,为医学大视觉语言模型的训练提供了新思路。

Abstract: Medical report generation (MRG) aims to automatically derive radiology-style reports from medical images to aid in clinical decision-making. However, existing methods often generate text that mimics the linguistic style of radiologists but fails to guarantee clinical correctness, because they are trained on token-level objectives which focus on word-choice and sentence structure rather than actual medical accuracy. We propose a semantic-driven reinforcement learning (SRL) method for medical report generation, adopted on a large vision-language model (LVLM). SRL adopts Group Relative Policy Optimization (GRPO) to encourage clinical-correctness-guided learning beyond imitation of language style. Specifically, we optimise a report-level reward: a margin-based cosine similarity (MCCS) computed between key radiological findings extracted from generated and reference reports, thereby directly aligning clinical-label agreement and improving semantic correctness. A lightweight reasoning format constraint further guides the model to generate structured “thinking report” outputs. We evaluate Medical Report Generation with Sematic-driven Reinforment Learning (MRG-R1), on two datasets: IU X-Ray and MIMIC-CXR using clinical efficacy (CE) metrics. MRG-R1 achieves state-of-the-art performance with CE-F1 51.88 on IU X-Ray and 40.39 on MIMIC-CXR. We found that the label-semantic reinforcement is better than conventional token-level supervision. These results indicate that optimizing a clinically grounded, report-level reward rather than token overlap,meaningfully improves clinical correctness. This work is a prior to explore semantic-reinforcement in supervising medical correctness in medical Large vision-language model(Med-LVLM) training.


[6] Evaluating OpenAI GPT Models for Translation of Endangered Uralic Languages: A Comparison of Reasoning and Non-Reasoning Architectures cs.CLPDF

Yehor Tereshchenko, Mika Hämäläinen, Svitlana Myroniuk

TL;DR: 本研究系统评估了OpenAI GPT模型在芬兰语与四种濒危乌拉尔语(科米-兹梁语、莫克沙语、埃尔齐亚语、乌德穆尔特语)之间的翻译性能,重点比较了推理架构与非推理架构模型。通过分析模型在文学文本平行语料上的拒绝翻译率,发现推理模型在翻译尝试意愿上显著优于非推理模型。

Details

Motivation: 现有大语言模型翻译评估主要集中于高资源语言,对低资源及濒危语言的性能理解存在显著空白。本研究旨在填补这一空白,探究不同架构GPT模型在濒危乌拉尔语翻译任务中的表现差异。

Result: 在文学文本平行语料上的评估显示,推理模型(如GPT-4)的拒绝翻译率比非推理模型(如GPT-3.5)低16个百分点,表明其在处理低资源语言翻译时更具尝试意愿和可靠性。

Insight: 论文的创新点在于首次系统比较了推理与非推理架构LLM在濒危语言翻译中的性能差异,揭示了推理模型在处理低资源语言任务时的潜在优势,为濒危语言保护的技术路径提供了实证依据。从客观角度看,将拒绝率作为关键评估指标,为评估LLM在开放域、低资源任务中的实际可用性提供了新视角。

Abstract: The evaluation of Large Language Models (LLMs) for translation tasks has primarily focused on high-resource languages, leaving a significant gap in understanding their performance on low-resource and endangered languages. This study presents a comprehensive comparison of OpenAI’s GPT models, specifically examining the differences between reasoning and non-reasoning architectures for translating between Finnish and four low-resource Uralic languages: Komi-Zyrian, Moksha, Erzya, and Udmurt. Using a parallel corpus of literary texts, we evaluate model willingness to attempt translation through refusal rate analysis across different model architectures. Our findings reveal significant performance variations between reasoning and non-reasoning models, with reasoning models showing 16 percentage points lower refusal rates. The results provide valuable insights for researchers and practitioners working with Uralic languages and contribute to the broader understanding of reasoning model capabilities for endangered language preservation.


[7] JustRL: Scaling a 1.5B LLM with a Simple RL Recipe cs.CLPDF

Bingxiang He, Zekai Qu, Zeyuan Liu, Yinghao Chen, Yuxin Zuo

TL;DR: 本文提出了JustRL方法,这是一种用于大型语言模型强化学习的极简方法,通过单阶段训练和固定超参数,在两个1.5B推理模型上实现了SOTA性能(在九个数学基准测试中平均准确率分别为54.9%和64.3%),同时计算量比复杂方法减少2倍。

Details

Motivation: 针对当前强化学习训练流程日益复杂(多阶段训练、动态超参数调度等)的问题,研究旨在探索这种复杂性是否必要,并提供一个简单有效的基线方法。

Result: 在九个数学推理基准测试上,两个1.5B模型分别达到54.9%和64.3%的平均准确率,实现了SOTA性能;使用相同固定超参数在两个模型上均有效,训练过程平滑稳定,无崩溃或平台期。

Insight: 创新点在于证明了极简的单阶段固定超参数RL方法可以超越复杂方法,且计算效率更高;客观分析表明,某些‘标准技巧’(如显式长度惩罚)可能因限制探索而损害性能,这提示领域可能过度复杂化以解决本不存在的基线不稳定问题。

Abstract: Recent advances in reinforcement learning for large language models have converged on increasing complexity: multi-stage training pipelines, dynamic hyperparameter schedules, and curriculum learning strategies. This raises a fundamental question: \textbf{Is this complexity necessary?} We present \textbf{JustRL}, a minimal approach using single-stage training with fixed hyperparameters that achieves state-of-the-art performance on two 1.5B reasoning models (54.9% and 64.3% average accuracy across nine mathematical benchmarks) while using 2$\times$ less compute than sophisticated approaches. The same hyperparameters transfer across both models without tuning, and training exhibits smooth, monotonic improvement over 4,000+ steps without the collapses or plateaus that typically motivate interventions. Critically, ablations reveal that adding ``standard tricks’’ like explicit length penalties and robust verifiers may degrade performance by collapsing exploration. These results suggest that the field may be adding complexity to solve problems that disappear with a stable, scaled-up baseline. We release our models and code to establish a simple, validated baseline for the community.


[8] From Facts to Conclusions : Integrating Deductive Reasoning in Retrieval-Augmented LLMs cs.CL | cs.AI | cs.CY | cs.IRPDF

Shubham Mishra, Samyek Jain, Gorang Mehrishi, Shiv Tiwari, Harsh Sharma

TL;DR: 本文提出了一种推理轨迹增强的检索增强生成框架,通过文档级裁决、冲突分析和基于证据的合成三个阶段,为LLMs提供结构化、可解释的推理能力,以解决检索信息冲突、过时或主观的问题。同时引入了冲突感知信任分数评估管道,并在539个查询的推理数据集上验证了方法的有效性。

Details

Motivation: 现有检索增强生成方法在检索到的证据存在冲突、过时或包含主观信息时容易失效,且缺乏统一的推理监督机制。

Result: 实验表明,该方法在基准测试中显著优于基线模型,特别是在Qwen模型上,监督微调将端到端答案正确率从0.069提升至0.883,行为一致性从0.074提升至0.722。

Insight: 创新点在于将可解释的演绎推理结构化为三阶段流程,并引入冲突感知评估管道,为构建可处理冲突、可解释的RAG系统提供了数据集和评估基础。

Abstract: Retrieval-Augmented Generation (RAG) grounds large language models (LLMs) in external evidence, but fails when retrieved sources conflict or contain outdated or subjective information. Prior work address these issues independently but lack unified reasoning supervision. We propose a reasoning-trace-augmented RAG framework that adds structured, interpretable reasoning across three stages : (1) document-level adjudication, (2) conflict analysis, and (3) grounded synthesis, producing citation-linked answers or justified refusals. A Conflict-Aware Trust-Score (CATS) pipeline is introduced which evaluates groundedness, factual correctness, refusal accuracy, and conflict-behavior alignment using an LLM-as-a-Judge. Our 539-query reasoning dataset and evaluation pipeline establish a foundation for conflict-aware, interpretable RAG systems. Experimental results demonstrate substantial gains over baselines, most notably with Qwen, where Supervised Fine-Tuning improved End-to-End answer correctness from 0.069 to 0.883 and behavioral adherence from 0.074 to 0.722.


[9] Exploration of Augmentation Strategies in Multi-modal Retrieval-Augmented Generation for the Biomedical Domain: A Case Study Evaluating Question Answering in Glycobiology cs.CLPDF

Primož Kocbek, Azra Frkatović-Hodžić, Dora Lalić, Vivian Hui, Gordan Lauc

TL;DR: 本文研究了生物医学领域多模态检索增强生成(MM-RAG)中视觉信息增强策略的权衡,特别是在糖生物学这一视觉密集领域。通过构建一个包含120道选择题的基准测试,评估了四种增强策略(无增强、文本RAG、多模态转换、免OCR视觉检索)在不同规模模型(如Gemma-3-27B-IT、GPT-4o、GPT-5系列)上的表现。研究发现,增强策略的选择依赖于模型能力:对于中等规模模型,将视觉内容转换为文本更可靠;而对于前沿大模型,免OCR视觉检索变得具有竞争力。

Details

Motivation: 解决多模态检索增强生成在生物医学领域(尤其是视觉密集的糖生物学)中,何时应将图表转换为文本、何时应采用免OCR视觉检索这一关键权衡问题。

Result: 在Gemma-3-27B-IT上,文本和多模态转换增强(平均准确率0.722-0.740)显著优于免OCR视觉检索(0.510)。在GPT-4o上,多模态转换达到0.808,文本增强为0.782,ColPali视觉检索为0.745,模型内差异较小。在GPT-5系列上,最佳结果(使用ColPali和ColFlor)提升至约0.828,且不同视觉检索器(ColPali/ColQwen/ColFlor)表现统计上无差异,但GPT-5-nano落后较大模型约8-10%。

Insight: 创新点在于系统比较了不同视觉增强策略在生物医学QA任务中的效果,并揭示了策略选择与模型能力的依赖关系:中等模型更适合文本转换以降低理解负担,而前沿大模型能有效利用免OCR视觉检索。此外,轻量级检索器ColFlor在保持性能的同时降低了计算开销,可作为高效默认选择。

Abstract: Multi-modal retrieval-augmented generation (MM-RAG) promises grounded biomedical QA, but it is unclear when to (i) convert figures/tables into text versus (ii) use optical character recognition (OCR)-free visual retrieval that returns page images and leaves interpretation to the generator. We study this trade-off in glycobiology, a visually dense domain. We built a benchmark of 120 multiple-choice questions (MCQs) from 25 papers, stratified by retrieval difficulty (easy text, medium figures/tables, hard cross-evidence). We implemented four augmentations-None, Text RAG, Multi-modal conversion, and late-interaction visual retrieval (ColPali)-using Docling parsing and Qdrant indexing. We evaluated mid-size open-source and frontier proprietary models (e.g., Gemma-3-27B-IT, GPT-4o family). Additional testing used the GPT-5 family and multiple visual retrievers (ColPali/ColQwen/ColFlor). Accuracy with Agresti-Coull 95% confidence intervals (CIs) was computed over 5 runs per configuration. With Gemma-3-27B-IT, Text and Multi-modal augmentation outperformed OCR-free retrieval (0.722-0.740 vs. 0.510 average accuracy). With GPT-4o, Multi-modal achieved 0.808, with Text 0.782 and ColPali 0.745 close behind; within-model differences were small. In follow-on experiments with the GPT-5 family, the best results with ColPali and ColFlor improved by ~2% to 0.828 in both cases. In general, across the GPT-5 family, ColPali, ColQwen, and ColFlor were statistically indistinguishable. GPT-5-nano trailed larger GPT-5 variants by roughly 8-10%. Pipeline choice is capacity-dependent: converting visuals to text lowers the reader burden and is more reliable for mid-size models, whereas OCR-free visual retrieval becomes competitive under frontier models. Among retrievers, ColFlor offers parity with heavier options at a smaller footprint, making it an efficient default when strong generators are available.


[10] What Do Prosody and Text Convey? Characterizing How Meaningful Information is Distributed Across Multiple Channels cs.CLPDF

Aditya Yadavalli, Tiago Pimentel, Tamar I Regev, Ethan Wilcox, Alex Warstadt

TL;DR: 本文提出了一种基于信息论的方法,利用大规模语音和语言模型来量化语音中韵律(prosody)和文本各自传递的信息量,并分析这些信息的具体内容。研究聚焦于讽刺、情感和疑问性三个语义维度,通过电视和播客语音数据进行分析。

Details

Motivation: 韵律(语音的旋律)通常传递着文本无法捕捉的关键信息,但现有研究缺乏系统量化韵律和文本各自信息贡献的方法。本文旨在解决如何精确测量不同沟通渠道(如音频和文本)对特定语义维度(如情感、讽刺)的信息传递能力。

Result: 在讽刺和情感识别任务中,音频通道(隐含韵律通道)传递的信息量比纯文本通道高出一个数量级以上,尤其是在缺乏当前句子之外的长上下文时。对于疑问性,韵律提供的额外信息相对较少。实验基于电视和播客语音数据,未明确提及与现有SOTA模型的对比。

Insight: 创新点在于提出了一种基于互信息估计的通用框架,可量化不同沟通渠道对语义的贡献,为多模态信息分析提供了新工具。客观来看,该方法能系统揭示韵律在情感和讽刺理解中的主导作用,有助于指导语音处理模型的开发。

Abstract: Prosody – the melody of speech – conveys critical information often not captured by the words or text of a message. In this paper, we propose an information-theoretic approach to quantify how much information is expressed by prosody alone and not by text, and crucially, what that information is about. Our approach applies large speech and language models to estimate the mutual information between a particular dimension of an utterance’s meaning (e.g., its emotion) and any of its communication channels (e.g., audio or text). We then use this approach to quantify how much information is conveyed by audio and text about sarcasm, emotion, and questionhood, using speech from television and podcasts. We find that for sarcasm and emotion the audio channel – and by implication the prosodic channel – transmits over an order of magnitude more information about these features than the text channel alone, at least when long-term context beyond the current sentence is unavailable. For questionhood, prosody provides comparatively less additional information. We conclude by outlining a program applying our approach to more dimensions of meaning, communication channels, and languages.


[11] AdaSearch: Balancing Parametric Knowledge and Search in Large Language Models via Reinforcement Learning cs.CLPDF

Tzu-Han Lin, Wei-Lin Chen, Chen-An Li, Hung-yi Lee, Yun-Nung Chen

TL;DR: 本文提出AdaSearch,一个基于强化学习的两阶段框架,旨在让大型语言模型(LLM)搜索代理自适应地平衡其内部参数化知识与外部搜索调用,仅在必要时进行搜索,以减少不必要的成本和风险,并提高决策透明度。

Details

Motivation: 现有基于强化学习的LLM搜索代理存在过度依赖搜索(导致成本高、易受噪声或恶意内容影响)或仅依赖参数化知识(易产生幻觉)的问题,且现有方法通过惩罚工具调用次数来减少搜索过度使用,但这需要大量奖励工程、信用分配模糊,且无法区分必要与不必要的搜索,缺乏决策透明度。

Result: 在多个模型系列和规模的实验中,AdaSearch显著提高了知识边界意识,减少了不必要的搜索调用,保持了强大的任务性能,并提供了更透明、可解释的决策行为。

Insight: 创新点在于提出了一个两阶段、结果驱动的强化学习框架,将问题解决与是否调用搜索的决策解耦,使决策过程显式化和可解释化;并通过基于F1的决策指标量化现有搜索代理的自我知识意识,揭示了其忽略可用参数化知识的问题,从而驱动了AdaSearch的设计。

Abstract: Equipping large language models (LLMs) with search engines via reinforcement learning (RL) has emerged as an effective approach for building search agents. However, overreliance on search introduces unnecessary cost and risks exposure to noisy or malicious content, while relying solely on parametric knowledge risks hallucination. The central challenge is to develop agents that adaptively balance parametric knowledge with external search, invoking search only when necessary. Prior work mitigates search overuse by shaping rewards around the number of tool calls. However, these penalties require substantial reward engineering, provide ambiguous credit assignment, and can be exploited by agents that superficially reduce calls. Moreover, evaluating performance solely through call counts conflates necessary and unnecessary search, obscuring the measurement of true adaptive behavior. To address these limitations, we first quantify the self-knowledge awareness of existing search agents via an F1-based decision metric, revealing that methods such as Search-R1 often overlook readily available parametric knowledge. Motivated by these findings, we propose AdaSearch, a simple two-stage, outcome-driven RL framework that disentangles problem solving from the decision of whether to invoke search, and makes this decision process explicit and interpretable. This transparency is crucial for high-stakes domains such as finance and medical question answering, yet is largely neglected by prior approaches. Experiments across multiple model families and sizes demonstrate that AdaSearch substantially improves knowledge-boundary awareness, reduces unnecessary search calls, preserves strong task performance, and offers more transparent, interpretable decision behaviors.


[12] Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image cs.CL | cs.CVPDF

Yushi Hu, Reyhane Askari-Hemmat, Melissa Hall, Emily Dinan, Luke Zettlemoyer

TL;DR: 本文提出了首个用于评估处理交错图像和文本序列的Omni模型奖励模型的综合基准Multimodal RewardBench 2 (MMRB2)。该基准涵盖文本到图像、图像编辑、交错生成和多模态推理四个任务,包含来自23个模型和代理的专家标注偏好对。研究利用该基准评估了现有评判方法,发现Gemini 3 Pro等先进模型达到约75-80%的准确率,但仍显著低于人类水平(>90%),并展示了基准性能与下游任务成功的强相关性。

Details

Motivation: 奖励模型对于训练大语言模型至关重要,但对于处理交错图像和文本序列的Omni模型,其奖励模型的研究仍不充分,缺乏全面的评估基准。

Result: 在MMRB2基准上,Gemini 3 Pro达到75-80%准确率,GPT-5和Gemini 2.5 Pro达到66-75%,优于广泛使用的GPT-4o(59%)。最佳开源模型Qwen3-VL-32B达到与Gemini 2.5 Flash相当的64%准确率,但均远低于人类专家水平(>90%)。基准性能与使用Best-of-N采样的下游任务成功强相关。

Insight: 创新点在于构建了首个针对Omni模型奖励模型的多模态综合评估基准MMRB2,其设计包含实用且具有挑战性的提示、来自SOTA模型的响应以及通过集成过滤策略筛选出的具有强人类专家共识的偏好对。该基准揭示了当前奖励模型与人类判断之间的显著差距,并指出了关键改进方向。

Abstract: Reward models (RMs) are essential for training large language models (LLMs), but remain underexplored for omni models that handle interleaved image and text sequences. We introduce Multimodal RewardBench 2 (MMRB2), the first comprehensive benchmark for reward models on multimodal understanding and (interleaved) generation. MMRB2 spans four tasks: text-to-image, image editing, interleaved generation, and multimodal reasoning (“thinking-with-images”), providing 1,000 expert-annotated preference pairs per task from 23 models and agents across 21 source tasks. MMRB2 is designed with: (1) practical but challenging prompts; (2) responses from state-of-the-art models and agents; and (3) preference pairs with strong human-expert consensus, curated via an ensemble filtering strategy. Using MMRB2, we study existing judges for each subtask, including multimodal LLM-as-a-judge and models trained with human preferences. The latest Gemini 3 Pro attains 75-80% accuracy. GPT-5 and Gemini 2.5 Pro reach 66-75% accuracy, compared to >90% for humans, yet surpass the widely used GPT-4o (59%). The best performing open-source model Qwen3-VL-32B achieves similar accuracies as Gemini 2.5 Flash (64%). We also show that MMRB2 performance strongly correlates with downstream task success using Best-of-N sampling and conduct an in-depth analysis that shows key areas to improve the reward models going forward.


[13] In-Context Algebra cs.CL | cs.LGPDF

Eric Todd, Jannik Brinkmann, Rohit Gandikota, David Bau

TL;DR: 本文研究了Transformer在变量符号含义随序列变化的代数任务中的工作机制,发现模型能学习到符号推理机制,如交换复制、单位元识别和基于闭包的消去,并在未见过的代数群上实现泛化。

Details

Motivation: 解决在符号含义不固定的变量序列上进行算术推理时,Transformer如何发展出有效的推理机制,以补充先前固定符号设置下几何嵌入的研究。

Result: 模型在任务中达到接近完美的准确率,并能泛化到未见过的代数群,通过因果测试验证了三种一致学习的机制。

Insight: 创新点在于揭示了Transformer在变量符号含义动态变化时能发展出符号推理机制,而非仅依赖几何表示,这为理解模型在上下文推理中的能力提供了新视角。

Abstract: We investigate the mechanisms that arise when transformers are trained to solve arithmetic on sequences where tokens are variables whose meaning is determined only through their interactions. While prior work has found that transformers develop geometric embeddings that mirror algebraic structure, those previous findings emerge from settings where arithmetic-valued tokens have fixed meanings. We devise a new task in which the assignment of symbols to specific algebraic group elements varies from one sequence to another. Despite this challenging setup, transformers achieve near-perfect accuracy on the task and even generalize to unseen algebraic groups. We develop targeted data distributions to create causal tests of a set of hypothesized mechanisms, and we isolate three mechanisms models consistently learn: commutative copying where a dedicated head copies answers, identity element recognition that distinguishes identity-containing facts, and closure-based cancellation that tracks group membership to constrain valid answers. Complementary to the geometric representations found in fixed-symbol settings, our findings show that models develop symbolic reasoning mechanisms when trained to reason in-context with variables whose meanings are not fixed.


[14] Constructive Circuit Amplification: Improving Math Reasoning in LLMs via Targeted Sub-Network Updates cs.CLPDF

Nikhil Prakash, Donghao Ren, Dominik Moritz, Yannick Assogba

TL;DR: 本文提出了一种名为Constructive Circuit Amplification的新方法,通过识别模型推理轨迹中的关键令牌和负责特定任务的模型组件,仅对这些稀疏子网络进行更新,从而在数学推理任务上显著提升大型语言模型的性能,同时最大程度减少对其他能力的影响。

Details

Motivation: 动机源于先前研究发现LLM内部存在负责特定任务的稀疏子网络(电路),且微调通常通过强化现有电路来提升性能,这启发了直接对特定任务电路进行精确、目标导向更新的可能性。

Result: 在数学推理任务上,该方法将多个模型的准确率提升了高达+11.4%,同时仅修改了低至1.59%的模型组件,并且在MMLU、TriviaQA和TruthfulQA基准测试上对其他能力影响最小。

Insight: 创新点在于提出了一种基于电路识别的目标化模型更新范式,通过选择性增强稀疏子网络来可靠提升特定能力,这为高效、精准的模型编辑提供了新思路。

Abstract: Prior studies investigating the internal workings of LLMs have uncovered sparse subnetworks, often referred to as circuits, that are responsible for performing specific tasks. Additionally, it has been shown that model performance improvement through fine-tuning often results from the strengthening of existing circuits in the model. Taken together, these findings suggest the possibility of intervening directly on such circuits to make precise, task-targeted updates. Motivated by these findings, we propose a novel method called Constructive Circuit Amplification which identifies pivotal tokens from model reasoning traces as well as model components responsible for the desired task, and updates only those components. Applied to mathematical reasoning, it improves accuracy by up to +11.4% across multiple models while modifying as little as 1.59% of model components, with minimal impact on other abilities as measured by MMLU, TriviaQA, and TruthfulQA. These results demonstrate that targeted capabilities can be reliably enhanced by selectively updating a sparse set of model components.


cs.CV [Back]

[15] Seeing Beyond Words: Self-Supervised Visual Learning for Multimodal Large Language Models cs.CV | cs.AI | cs.CL | cs.MMPDF

Davide Caffagni, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Pier Luigi Dovesi

TL;DR: 本文提出JARVIS框架,通过整合I-JEPA自监督学习范式来增强多模态大语言模型的视觉理解能力,使其不依赖文本监督即可学习图像的结构和语义规律,从而提升视觉中心任务的性能。

Details

Motivation: 现有MLLMs主要从文本描述中学习视觉理解,这种监督信号主观且不完整,且多模态指令调优规模有限,导致模型过度依赖语言先验而忽略视觉细节,限制了其基础视觉推理能力。

Result: 在标准MLLM基准测试上的广泛实验表明,JARVIS在不同LLM家族上均能持续提升视觉中心基准的性能,且不损害多模态推理能力。

Insight: 创新点在于将自监督视觉学习(I-JEPA)整合到MLLM训练流程中,利用冻结的视觉基础模型作为编码器,训练LLM早期层作为预测器,从而减少对语言监督的依赖,增强模型对视觉结构规律的学习能力。

Abstract: Multimodal Large Language Models (MLLMs) have recently demonstrated impressive capabilities in connecting vision and language, yet their proficiency in fundamental visual reasoning tasks remains limited. This limitation can be attributed to the fact that MLLMs learn visual understanding primarily from textual descriptions, which constitute a subjective and inherently incomplete supervisory signal. Furthermore, the modest scale of multimodal instruction tuning compared to massive text-only pre-training leads MLLMs to overfit language priors while overlooking visual details. To address these issues, we introduce JARVIS, a JEPA-inspired framework for self-supervised visual enhancement in MLLMs. Specifically, we integrate the I-JEPA learning paradigm into the standard vision-language alignment pipeline of MLLMs training. Our approach leverages frozen vision foundation models as context and target encoders, while training the predictor, implemented as the early layers of an LLM, to learn structural and semantic regularities from images without relying exclusively on language supervision. Extensive experiments on standard MLLM benchmarks show that JARVIS consistently improves performance on vision-centric benchmarks across different LLM families, without degrading multimodal reasoning abilities. Our source code is publicly available at: https://github.com/aimagelab/JARVIS.


[16] City Navigation in the Wild: Exploring Emergent Navigation from Web-Scale Knowledge in MLLMs cs.CVPDF

Dwip Dalal, Utkarsh Mishra, Narendra Ahuja, Nebojsa Jojic

TL;DR: 本文提出了稀疏接地视觉导航任务,并构建了CityNav基准来评估多模态大语言模型在真实城市环境中基于视觉输入和内部知识进行顺序决策的导航能力。研究发现现有SOTA MLLMs及标准推理技术在此任务上表现不佳,因此提出了路径言语化方法,通过显式构建认知地图来显著提升导航成功率。

Details

Motivation: 当前基于MLLM的具身智能体评估基准过于依赖语言或仿真环境,缺乏对现实世界复杂、知识密集型推理能力的测试,因此需要构建一个在真实、知识密集环境中评估MLLM顺序决策能力的任务。

Result: 在涵盖四个全球城市的CityNav基准上,当前最先进的MLLMs和标准推理技术表现显著不佳;而提出的VoP方法通过显式构建认知地图,大幅提升了导航成功率。

Insight: 创新点在于提出了稀疏接地视觉导航这一新任务和对应的CityNav基准,强调了在无额外环境标注或架构修改下,仅凭视觉和内部知识进行自主定位、空间推理和路径规划的能力评估;提出的VoP方法通过将内部推理显式化为关键地标和方向认知图,有效提升了模型在复杂真实环境中的导航性能。

Abstract: Leveraging multimodal large language models (MLLMs) to develop embodied agents offers significant promise for addressing complex real-world tasks. However, current evaluation benchmarks remain predominantly language-centric or heavily reliant on simulated environments, rarely probing the nuanced, knowledge-intensive reasoning essential for practical, real-world scenarios. To bridge this critical gap, we introduce the task of Sparsely Grounded Visual Navigation, explicitly designed to evaluate the sequential decision-making abilities of MLLMs in challenging, knowledge-intensive real-world environments. We operationalize this task with CityNav, a comprehensive benchmark encompassing four diverse global cities, specifically constructed to assess raw MLLM-driven agents in city navigation. Agents are required to rely solely on visual inputs and internal multimodal reasoning to sequentially navigate 50+ decision points without additional environmental annotations or specialized architectural modifications. Crucially, agents must autonomously achieve localization through interpreting city-specific cues and recognizing landmarks, perform spatial reasoning, and strategically plan and execute routes to their destinations. Through extensive evaluations, we demonstrate that current state-of-the-art MLLMs and standard reasoning techniques (e.g., Chain-of-Thought, Reflection) significantly underperform in this challenging setting. To address this, we propose Verbalization of Path (VoP), which explicitly grounds the agent’s internal reasoning by probing an explicit cognitive map (key landmarks and directions toward the destination) from the MLLMs, substantially enhancing navigation success. Project Webpage: https://dwipddalal.github.io/AgentNav/


[17] R4: Retrieval-Augmented Reasoning for Vision-Language Models in 4D Spatio-Temporal Space cs.CV | cs.ROPDF

Tin Stribor Sohn, Maximilian Dillitzer, Jason J. Corso, Eric Sax

TL;DR: R4是一个无需训练的检索增强推理框架,旨在为视觉语言模型(VLMs)提供结构化的终身记忆,使其能够在4D时空空间中进行推理。该框架通过将对象级语义描述锚定在度量空间和时间中,持续构建4D知识数据库,形成一个可跨智能体共享的持久世界模型。在推理时,自然语言查询被分解为语义、空间和时间键来检索相关观察,并整合到VLM的推理中。

Details

Motivation: 受人类在四维空间中感知和推理周围环境的能力启发,论文旨在解决视觉语言模型在动态环境中缺乏持久、结构化内部表示的问题,以支持对时空信息的检索和推理。

Result: 在具身问答和导航基准测试中,R4相比基线方法显著提升了时空信息的检索和推理能力,推动了动态环境中具身4D推理的新范式。

Insight: 创新点在于将检索直接操作于4D空间,实现了无需训练的片段式和协作式推理,通过构建可共享的持久世界模型来增强视觉语言模型的时空理解能力。

Abstract: Humans perceive and reason about their surroundings in four dimensions by building persistent, structured internal representations that encode semantic meaning, spatial layout, and temporal dynamics. These multimodal memories enable them to recall past events, infer unobserved states, and integrate new information into context-dependent reasoning. Inspired by this capability, we introduce R4, a training-free framework for retrieval-augmented reasoning in 4D spatio-temporal space that equips vision-language models (VLMs) with structured, lifelong memory. R4 continuously constructs a 4D knowledge database by anchoring object-level semantic descriptions in metric space and time, yielding a persistent world model that can be shared across agents. At inference, natural language queries are decomposed into semantic, spatial, and temporal keys to retrieve relevant observations, which are integrated into the VLM’s reasoning. Unlike classical retrieval-augmented generation methods, retrieval in R4 operates directly in 4D space, enabling episodic and collaborative reasoning without training. Experiments on embodied question answering and navigation benchmarks demonstrate that R4 substantially improves retrieval and reasoning over spatio-temporal information compared to baselines, advancing a new paradigm for embodied 4D reasoning in dynamic environments.


[18] The Perceptual Observatory Characterizing Robustness and Grounding in MLLMs cs.CVPDF

Tejas Anvekar, Fenil Bardoliya, Pavan K. Turaga, Chitta Baral, Vivek Gupta

TL;DR: 该论文提出了一个名为‘感知观测站’的框架,用于系统评估多模态大语言模型(MLLMs)的感知能力,包括视觉任务、局部到全局理解等维度,并通过引入像素级增强和基于扩散的幻觉等受控扰动,分析模型在保持视觉基础性和关系结构方面的鲁棒性,超越了传统仅关注端任务准确率的评估方法。

Details

Motivation: 当前MLLMs在视觉编码器上复用度高,主要扩展语言组件,引发了对模型进步是源于真正的视觉基础还是依赖大规模文本知识的担忧;现有评估方法过于强调端任务准确率,忽略了鲁棒性、归因保真度和受控扰动下的推理能力。

Result: 论文未在摘要中提及具体的定量结果或基准测试排名,但通过构建包含人脸和文字的真实数据集并进行系统扰动,为分析当前及未来模型的优缺点提供了原则性基础。

Insight: 创新点在于提出了一个超越排行榜准确率的系统性评估框架,通过设计多个垂直维度(如简单视觉任务、局部到全局理解)和引入受控扰动(像素增强、扩散幻觉),深入分析MLLMs的感知鲁棒性和视觉基础性,为模型能力评估提供了更全面的视角。

Abstract: Recent advances in multimodal large language models (MLLMs) have yielded increasingly powerful models, yet their perceptual capacities remain poorly characterized. In practice, most model families scale language component while reusing nearly identical vision encoders (e.g., Qwen2.5-VL 3B/7B/72B), which raises pivotal concerns about whether progress reflects genuine visual grounding or reliance on internet-scale textual world knowledge. Existing evaluation methods emphasize end-task accuracy, overlooking robustness, attribution fidelity, and reasoning under controlled perturbations. We present The Perceptual Observatory, a framework that characterizes MLLMs across verticals like: (i) simple vision tasks, such as face matching and text-in-vision comprehension capabilities; (ii) local-to-global understanding, encompassing image matching, grid pointing game, and attribute localization, which tests general visual grounding. Each vertical is instantiated with ground-truth datasets of faces and words, systematically perturbed through pixel-based augmentations and diffusion-based stylized illusions. The Perceptual Observatory moves beyond leaderboard accuracy to yield insights into how MLLMs preserve perceptual grounding and relational structure under perturbations, providing a principled foundation for analyzing strengths and weaknesses of current and future models.


[19] Seeing is Believing (and Predicting): Context-Aware Multi-Human Behavior Prediction with Vision Language Models cs.CV | cs.AIPDF

Utsav Panchal, Yuchen Liu, Luigi Palmieri, Ilche Georgievski, Marco Aiello

TL;DR: 论文提出了CAMP-VLM框架,一个基于视觉语言模型的上下文感知多人类行为预测系统。它结合视觉输入的上下文特征和场景图的空间意识,以从第三人称视角预测人类行为;由于缺乏合适的数据集,作者使用逼真模拟器生成的合成数据进行微调,并在合成和真实世界序列上评估模型的泛化能力。

Details

Motivation: 准确预测人类行为对于在人群环境中操作的移动机器人至关重要;先前研究主要关注单人类场景的自我中心视角,而许多机器人应用需要从第三人称视角理解多人类行为。

Result: 通过监督微调(SFT)和直接偏好优化(DPO),CAMP-VLM在预测准确率上比最佳基线高出高达66.9%,并在合成和真实世界序列上展示了良好的泛化能力。

Insight: 创新点包括使用视觉语言模型结合上下文和空间意识进行多人类行为预测,以及利用合成数据微调解决数据集缺乏问题;客观上,该研究将VLM应用于新的任务领域,并展示了在第三人称视角下的有效性。

Abstract: Accurately predicting human behaviors is crucial for mobile robots operating in human-populated environments. While prior research primarily focuses on predicting actions in single-human scenarios from an egocentric view, several robotic applications require understanding multiple human behaviors from a third-person perspective. To this end, we present CAMP-VLM (Context-Aware Multi-human behavior Prediction): a Vision Language Model (VLM)-based framework that incorporates contextual features from visual input and spatial awareness from scene graphs to enhance prediction of humans-scene interactions. Due to the lack of suitable datasets for multi-human behavior prediction from an observer view, we perform fine-tuning of CAMP-VLM with synthetic human behavior data generated by a photorealistic simulator, and evaluate the resulting models on both synthetic and real-world sequences to assess their generalization capabilities. Leveraging Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO), CAMP-VLM outperforms the best-performing baseline by up to 66.9% in prediction accuracy.


[20] From Words to Wavelengths: VLMs for Few-Shot Multispectral Object Detection cs.CVPDF

Manuel Nkegoum, Minh-Tan Pham, Élisa Fromont, Bruno Avignon, Sébastien Lefèvre

TL;DR: 本文探索了视觉语言模型(VLMs)在少样本多光谱目标检测中的应用,通过将Grounding DINO和YOLO-World等VLM检测器适配到多光谱输入,并提出了整合文本、视觉和热模态的有效机制。在FLIR和M3FD基准上的实验表明,该方法在少样本和全监督设置下均表现出色,显著优于专用多光谱模型。

Details

Motivation: 多光谱目标检测在自动驾驶和监控等安全敏感应用中至关重要,但标注数据稀缺限制了深度检测器的训练。本文动机是利用视觉语言模型中的文本类信息作为语义监督源,以解决少样本多光谱检测的数据限制问题。

Result: 在FLIR和M3FD多光谱图像基准上,VLM检测器在少样本场景中显著优于专用多光谱模型,并在全监督设置下达到竞争性或更优结果。

Insight: 创新点在于将VLM检测器适配到多光谱输入并整合多模态信息,揭示了大规模VLM学习的语义先验能有效迁移到未见光谱模态,为数据高效的多光谱感知提供了新途径。

Abstract: Multispectral object detection is critical for safety-sensitive applications such as autonomous driving and surveillance, where robust perception under diverse illumination conditions is essential. However, the limited availability of annotated multispectral data severely restricts the training of deep detectors. In such data-scarce scenarios, textual class information can serve as a valuable source of semantic supervision. Motivated by the recent success of Vision-Language Models (VLMs) in computer vision, we explore their potential for few-shot multispectral object detection. Specifically, we adapt two representative VLM-based detectors, Grounding DINO and YOLO-World, to handle multispectral inputs and propose an effective mechanism to integrate text, visual and thermal modalities. Through extensive experiments on two popular multispectral image benchmarks, FLIR and M3FD, we demonstrate that VLM-based detectors not only excel in few-shot regimes, significantly outperforming specialized multispectral models trained with comparable data, but also achieve competitive or superior results under fully supervised settings. Our findings reveal that the semantic priors learned by large-scale VLMs effectively transfer to unseen spectral modalities, ofFering a powerful pathway toward data-efficient multispectral perception.


[21] Are vision-language models ready to zero-shot replace supervised classification models in agriculture? cs.CVPDF

Earl Ranario, Mason J. Earles

TL;DR: 本文评估了多种开源和闭源视觉语言模型(VLMs)在27个农业分类数据集上的零样本性能,发现它们均显著落后于监督学习的任务专用基线模型(YOLO11)。最佳VLM(Gemini-3 Pro)在多项选择提示下平均准确率约为62%,而开放式提示性能更低。研究表明,当前现成的VLMs尚不能作为独立的农业诊断系统,但可作为辅助组件与受限界面、明确标签本体和领域感知评估策略结合使用。

Details

Motivation: 视觉语言模型(VLMs)常被提议作为视觉识别任务的通用解决方案,但其在农业决策支持中的可靠性尚不明确。本文旨在通过系统基准测试,评估VLMs在农业分类任务中的零样本性能,以确定其是否已准备好替代监督分类模型。

Result: 在AgML集合的27个农业分类数据集(涵盖162个类别)上,零样本VLMs在所有任务中均大幅落后于监督基线YOLO11。最佳VLM(Gemini-3 Pro)在多项选择提示下平均准确率约62%,开放式提示原始准确率通常低于25%。使用基于LLM的语义判断可提升开放式准确率(例如,顶级模型从21%提升至30%)。开源模型中Qwen-VL-72B表现最佳,接近闭源模型在受限提示下的性能,但仍落后于顶级专有系统。任务分析表明,植物和杂草物种分类比害虫和损害识别更容易。

Insight: 论文的创新点在于对VLMs在农业领域的零样本性能进行了大规模、系统性的基准测试,揭示了其与监督模型的显著性能差距。客观分析认为,其核心洞察在于强调了评估方法(如提示策略和语义判断)对结论的重要影响,并提出了VLMs在农业中作为辅助组件而非独立系统的实用路径,这为领域特定应用中的模型部署提供了重要参考。

Abstract: Vision-language models (VLMs) are increasingly proposed as general-purpose solutions for visual recognition tasks, yet their reliability for agricultural decision support remains poorly understood. We benchmark a diverse set of open-source and closed-source VLMs on 27 agricultural classification datasets from the AgML collection, spanning 162 classes across plant disease, pest and damage, and plant and weed species identification. Across all tasks, zero-shot VLMs substantially underperform a supervised task-specific baseline (YOLO11), which consistently achieves markedly higher accuracy than any foundation model. Under multiple-choice prompting, the best-performing VLM (Gemini-3 Pro) reaches approximately 62% average accuracy, while open-ended prompting yields much lower performance, with raw accuracies typically below 25%. Applying LLM-based semantic judging increases open-ended accuracy (for example, from 21% to 30% for top models) and alters model rankings, demonstrating that evaluation methodology meaningfully affects reported conclusions. Among open-source models, Qwen-VL-72B performs best, approaching closed-source performance under constrained prompting but still trailing top proprietary systems. Task-level analysis shows that plant and weed species classification is consistently easier than pest and damage identification, which remains the most challenging category across models. Overall, these results indicate that current off-the-shelf VLMs are not yet suitable as standalone agricultural diagnostic systems, but can function as assistive components when paired with constrained interfaces, explicit label ontologies, and domain-aware evaluation strategies.


[22] Eyes on the Grass: Biodiversity-Increasing Robotic Mowing Using Deep Visual Embeddings cs.CVPDF

Lars Beckers, Arno Waes, Aaron Van Campenhout, Toon Goedemé

TL;DR: 论文提出了一种基于深度视觉嵌入的机器人割草框架,通过分析植被的视觉多样性来选择性割草,从而主动增强花园的生物多样性。系统使用预训练的ResNet50模型提取生态嵌入,无需物种识别即可估计生物多样性,并在模拟和真实环境中验证了其有效性。

Details

Motivation: 传统割草系统导致草坪单一化、生态价值低,本论文旨在通过机器人视觉和自适应决策,主动提升花园的生物多样性,解决城市生态退化问题。

Result: 在受控模拟草坪和真实花园数据集上的实验表明,嵌入空间分散度与专家生物多样性评估有强相关性,确认了深度视觉多样性作为生态丰富度代理的可行性,所提割草决策方法有效。

Insight: 宣称的创新点包括使用深度特征空间分析估计生物多样性而无需物种级监督,以及动态选择性割草算法。客观上,论文将计算机视觉与机器人技术结合,为生态保护提供了一种新颖的自动化解决方案。

Abstract: This paper presents a robotic mowing framework that actively enhances garden biodiversity through visual perception and adaptive decision-making. Unlike passive rewilding approaches, the proposed system uses deep feature-space analysis to identify and preserve visually diverse vegetation patches in camera images by selectively deactivating the mower blades. A ResNet50 network pretrained on PlantNet300K provides ecologically meaningful embeddings, from which a global deviation metric estimates biodiversity without species-level supervision. These estimates drive a selective mowing algorithm that dynamically alternates between mowing and conservation behavior. The system was implemented on a modified commercial robotic mower and validated both in a controlled mock-up lawn and on real garden datasets. Results demonstrate a strong correlation between embedding-space dispersion and expert biodiversity assessment, confirming the feasibility of deep visual diversity as a proxy for ecological richness and the effectiveness of the proposed mowing decision approach. Widespread adoption of such systems will turn ecologically worthless, monocultural lawns into vibrant, valuable biotopes that boost urban biodiversity.


[23] CoVAR: Co-generation of Video and Action for Robotic Manipulation via Multi-Modal Diffusion cs.CVPDF

Liudi Yang, Yang Bai, George Eskandar, Fengyi Shen, Mohammad Altillawi

TL;DR: 本文提出CoVAR方法,通过多模态扩散模型从初始图像和机器人关节状态生成遵循文本指令的视频-动作对,为视频扩散模型自动提供动作标注,解决机器人策略学习中动作标注稀缺的问题。

Details

Motivation: 现有方法存在两阶段流程限制跨模态信息共享,或依赖单模态扩散模型调整无法充分利用预训练视频知识的问题,本文旨在克服这些限制,实现视频与动作的协同生成以促进机器人学习。

Result: 在多个公共基准和真实世界数据集上的广泛评估表明,该方法生成更高质量的视频、更准确的动作,显著优于现有基线方法。

Insight: 创新点包括扩展预训练视频扩散模型为并行专用动作扩散模型以保留知识、引入桥接注意力机制实现有效跨模态交互、设计动作细化模块将粗略动作转换为低分辨率数据集的精确控制,为利用大规模视频数据进行机器人学习提供了可扩展框架。

Abstract: We present a method to generate video-action pairs that follow text instructions, starting from an initial image observation and the robot’s joint states. Our approach automatically provides action labels for video diffusion models, overcoming the common lack of action annotations and enabling their full use for robotic policy learning. Existing methods either adopt two-stage pipelines, which limit tightly coupled cross-modal information sharing, or rely on adapting a single-modal diffusion model for a joint distribution that cannot fully leverage pretrained video knowledge. To overcome these limitations, we (1) extend a pretrained video diffusion model with a parallel, dedicated action diffusion model that preserves pretrained knowledge, (2) introduce a Bridge Attention mechanism to enable effective cross-modal interaction, and (3) design an action refinement module to convert coarse actions into precise controls for low-resolution datasets. Extensive evaluations on multiple public benchmarks and real-world datasets demonstrate that our method generates higher-quality videos, more accurate actions, and significantly outperforms existing baselines, offering a scalable framework for leveraging large-scale video data for robotic learning.


[24] Auto-Vocabulary 3D Object Detection cs.CVPDF

Haomeng Zhang, Kuan-Chuan Peng, Suhas Lohit, Raymond A. Yeh

TL;DR: 本文提出了自动词汇3D物体检测(AV3DOD)任务,旨在无需用户输入的情况下自动为检测到的3D物体生成类别名称。作者引入了语义评分(SS)来评估生成类名的质量,并开发了一个新框架,该框架利用2D视觉语言模型(VLM)通过图像描述、伪3D框生成和特征空间语义扩展来生成丰富的语义候选。该方法在ScanNetV2和SUNRGB-D数据集上实现了定位(mAP)和语义质量(SS)的最先进(SOTA)性能。

Details

Motivation: 现有开放词汇3D物体检测方法在训练和推理时仍依赖用户指定的类别,本文旨在研究完全自动化的3D物体检测,即自动生成检测对象的类别,无需任何人工输入。

Result: 在ScanNetV2数据集上,AV3DOD在整体mAP上超越了当前SOTA方法CoDA 3.48个点,并在语义评分(SS)上实现了24.5%的相对提升;在SUNRGB-D数据集上也达到了SOTA水平。

Insight: 创新点在于首次定义了自动词汇3D物体检测(AV3DOD)任务,并引入了语义评分(SS)作为评估指标;框架设计上,通过结合2D VLM进行多阶段语义生成(如图像描述和特征空间扩展),有效提升了自动类别生成的准确性和丰富性,为3D场景理解提供了新思路。

Abstract: Open-vocabulary 3D object detection methods are able to localize 3D boxes of classes unseen during training. Despite the name, existing methods rely on user-specified classes both at training and inference. We propose to study Auto-Vocabulary 3D Object Detection (AV3DOD), where the classes are automatically generated for the detected objects without any user input. To this end, we introduce Semantic Score (SS) to evaluate the quality of the generated class names. We then develop a novel framework, AV3DOD, which leverages 2D vision-language models (VLMs) to generate rich semantic candidates through image captioning, pseudo 3D box generation, and feature-space semantics expansion. AV3DOD achieves the state-of-the-art (SOTA) performance on both localization (mAP) and semantic quality (SS) on the ScanNetV2 and SUNRGB-D datasets. Notably, it surpasses the SOTA, CoDA, by 3.48 overall mAP and attains a 24.5% relative improvement in SS on ScanNetV2.


[25] LAPX: Lightweight Hourglass Network with Global Context cs.CV | cs.AIPDF

Haopeng Zhao, Marsha Mariya Kappan, Mahdi Bamdad, Francisco Cruz

TL;DR: 本文提出LAPX,一种基于Hourglass架构的轻量级人体姿态估计网络,通过引入自注意力模块捕获全局上下文信息,并优化阶段设计和轻量级注意力模块,在仅2.3M参数下实现了实时性能与竞争性精度。

Details

Motivation: 解决现有SOTA姿态估计模型参数量大、计算成本高,而轻量级变体在边缘设备上部署效率不足或因设计过于简化导致精度受限的问题。

Result: 在MPII和COCO基准数据集上取得了竞争性结果,参数量仅为2.3M,并展示了实时性能,验证了其边缘设备适用性。

Insight: 创新点包括在Hourglass网络中集成自注意力以增强全局上下文建模,并针对边缘设备优化阶段结构与轻量级注意力模块,平衡了精度与效率。

Abstract: Human pose estimation is a crucial task in computer vision. Methods that have SOTA (State-of-the-Art) accuracy, often involve a large number of parameters and incur substantial computational cost. Many lightweight variants have been proposed to reduce the model size and computational cost of them. However, several of these methods still contain components that are not well suited for efficient deployment on edge devices. Moreover, models that primarily emphasize inference speed on edge devices often suffer from limited accuracy due to their overly simplified designs. To address these limitations, we propose LAPX, an Hourglass network with self-attention that captures global contextual information, based on previous work, LAP. In addition to adopting the self-attention module, LAPX advances the stage design and refine the lightweight attention modules. It achieves competitive results on two benchmark datasets, MPII and COCO, with only 2.3M parameters, and demonstrates real-time performance, confirming its edge-device suitability.


[26] TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times cs.CV | cs.AI | cs.LGPDF

Jintao Zhang, Kaiwen Zheng, Kai Jiang, Haoxu Wang, Ion Stoica

TL;DR: TurboDiffusion是一个视频生成加速框架,通过注意力加速、步数蒸馏和量化等技术,将端到端扩散模型的生成速度提升100-200倍,同时保持视频质量。

Details

Motivation: 解决视频扩散模型生成速度慢、计算成本高的问题,旨在实现高效、高质量的视频生成。

Result: 在Wan2系列模型上的实验表明,即使在单张RTX 5090 GPU上,TurboDiffusion也能实现100-200倍的加速,且视频质量与原始模型相当。

Insight: 创新点包括低比特SageAttention、可训练稀疏线性注意力(SLA)、高效的rCM步数蒸馏以及W8A8量化,结合工程优化,为视频扩散模型的高效部署提供了系统级解决方案。

Abstract: We introduce TurboDiffusion, a video generation acceleration framework that can speed up end-to-end diffusion generation by 100-200x while maintaining video quality. TurboDiffusion mainly relies on several components for acceleration: (1) Attention acceleration: TurboDiffusion uses low-bit SageAttention and trainable Sparse-Linear Attention (SLA) to speed up attention computation. (2) Step distillation: TurboDiffusion adopts rCM for efficient step distillation. (3) W8A8 quantization: TurboDiffusion quantizes model parameters and activations to 8 bits to accelerate linear layers and compress the model. In addition, TurboDiffusion incorporates several other engineering optimizations. We conduct experiments on the Wan2.2-I2V-14B-720P, Wan2.1-T2V-1.3B-480P, Wan2.1-T2V-14B-720P, and Wan2.1-T2V-14B-480P models. Experimental results show that TurboDiffusion achieves 100-200x speedup for video generation even on a single RTX 5090 GPU, while maintaining comparable video quality. The GitHub repository, which includes model checkpoints and easy-to-use code, is available at https://github.com/thu-ml/TurboDiffusion.


[27] Flexible Camera Calibration using a Collimator System cs.CVPDF

Shunkun Liang, Banglei Guan, Zhenbao Yu, Dongcai Tan, Pengju Sun

TL;DR: 本文提出了一种基于准直仪系统的新型相机标定方法。该方法利用准直仪系统的独特光学几何特性,引入了角度不变性约束,将标定靶与相机之间的相对运动简化为纯旋转运动,并据此提出了多图像闭式线性求解器、双图像最小求解器以及无需相机运动的单图像标定算法,实现了灵活、快速的标定。

Details

Motivation: 相机标定是摄影测量和三维视觉应用中的关键步骤,现有方法在灵活性和便捷性上存在不足。本文旨在利用设计的准直仪系统,提供一个可靠且可控的标定环境,以解决传统方法对复杂相机运动或特定标定场景的依赖问题。

Result: 在合成和真实世界实验中验证了该方法的性能。实验结果表明,使用准直仪系统进行标定是可行的,并且该方法在精度和鲁棒性上优于现有的基线方法。

Insight: 主要创新点在于利用准直仪系统引入角度不变性约束,将6自由度相对运动简化为3自由度纯旋转,从而简化了标定模型。这启发了通过设计特定光学系统来施加几何约束,从而简化复杂视觉问题求解的新思路,特别是单图像标定算法为实现快速、无需运动的标定提供了新颖的解决方案。

Abstract: Camera calibration is a crucial step in photogrammetry and 3D vision applications. This paper introduces a novel camera calibration method using a designed collimator system. Our collimator system provides a reliable and controllable calibration environment for the camera. Exploiting the unique optical geometry property of our collimator system, we introduce an angle invariance constraint and further prove that the relative motion between the calibration target and camera conforms to a spherical motion model. This constraint reduces the original 6DOF relative motion between target and camera to a 3DOF pure rotation motion. Using spherical motion constraint, a closed-form linear solver for multiple images and a minimal solver for two images are proposed for camera calibration. Furthermore, we propose a single collimator image calibration algorithm based on the angle invariance constraint. This algorithm eliminates the requirement for camera motion, providing a novel solution for flexible and fast calibration. The performance of our method is evaluated in both synthetic and real-world experiments, which verify the feasibility of calibration using the collimator system and demonstrate that our method is superior to existing baseline methods. Demo code is available at https://github.com/LiangSK98/CollimatorCalibration


[28] Interaction-via-Actions: Cattle Interaction Detection with Joint Learning of Action-Interaction Latent Space cs.CVPDF

Ren Nakagawa, Yang Yang, Risa Shinoda, Hiroaki Santo, Kenji Oyama

TL;DR: 本文提出了一种名为CattleAct的数据高效方法,用于从单张图像中自动检测放牧牛群之间的行为交互,该方法通过将交互分解为个体牛的动作组合,并利用对比学习在预训练的动作潜在空间上进行微调,构建了一个统一的动作-交互潜在空间,从而解决了牛群交互数据稀缺的挑战。

Details

Motivation: 动机在于为智能畜牧管理(如发情检测)提供自动化工具,但牛群交互是罕见事件,缺乏包含交互的全面行为数据集,因此需要一种数据高效的方法来检测这些交互。

Result: 在商业规模的牧场上的实验表明,与基线方法相比,该方法实现了准确的交互检测,并集成了视频和GPS输入开发了实用工作系统。

Insight: 创新点在于通过分解交互为个体动作并利用对比学习构建统一的动作-交互潜在空间,这是一种数据高效的交互检测方法,可借鉴于其他罕见事件或数据稀缺的动物行为分析任务。

Abstract: This paper introduces a method and application for automatically detecting behavioral interactions between grazing cattle from a single image, which is essential for smart livestock management in the cattle industry, such as for detecting estrus. Although interaction detection for humans has been actively studied, a non-trivial challenge lies in cattle interaction detection, specifically the lack of a comprehensive behavioral dataset that includes interactions, as the interactions of grazing cattle are rare events. We, therefore, propose CattleAct, a data-efficient method for interaction detection by decomposing interactions into the combinations of actions by individual cattle. Specifically, we first learn an action latent space from a large-scale cattle action dataset. Then, we embed rare interactions via the fine-tuning of the pre-trained latent space using contrastive learning, thereby constructing a unified latent space of actions and interactions. On top of the proposed method, we develop a practical working system integrating video and GPS inputs. Experiments on a commercial-scale pasture demonstrate the accurate interaction detection achieved by our method compared to the baselines. Our implementation is available at https://github.com/rakawanegan/CattleAct.


[29] C-DGPA: Class-Centric Dual-Alignment Generative Prompt Adaptation cs.CV | cs.AIPDF

Chao Li, Dasha Hu, Chengyang Li, Yuming Jiang, Yuncheng Shen

TL;DR: 本文提出C-DGPA方法,一种面向无监督域适应的类中心双对齐生成式提示适应框架,通过双分支架构协同优化边缘分布对齐和条件分布对齐,以解决视觉语言模型在域适应任务中因忽略条件分布差异导致的类原型错位和语义判别性下降问题。

Details

Motivation: 现有基于提示调优的无监督域适应方法主要关注边缘分布对齐,而忽视了条件分布差异,这导致了类原型错位和语义判别性退化等关键问题。

Result: 在OfficeHome、Office31和VisDA-2017基准测试上进行了广泛实验,C-DGPA在所有基准上都取得了新的最先进(SOTA)结果。

Insight: 创新点在于提出了一个双分支架构来协同进行边缘分布对齐和条件分布对齐,其中条件分布对齐分支引入了类映射机制来标准化语义提示理解并防止对源域的过度依赖,从而将领域知识有效整合到提示学习中,获得领域不变且语义判别性强的表示。

Abstract: Unsupervised Domain Adaptation transfers knowledge from a labeled source domain to an unlabeled target domain. Directly deploying Vision-Language Models (VLMs) with prompt tuning in downstream UDA tasks faces the signifi cant challenge of mitigating domain discrepancies. Existing prompt-tuning strategies primarily align marginal distribu tion, but neglect conditional distribution discrepancies, lead ing to critical issues such as class prototype misalignment and degraded semantic discriminability. To address these lim itations, the work proposes C-DGPA: Class-Centric Dual Alignment Generative Prompt Adaptation. C-DGPA syner gistically optimizes marginal distribution alignment and con ditional distribution alignment through a novel dual-branch architecture. The marginal distribution alignment branch em ploys a dynamic adversarial training framework to bridge marginal distribution discrepancies. Simultaneously, the con ditional distribution alignment branch introduces a Class Mapping Mechanism (CMM) to align conditional distribu tion discrepancies by standardizing semantic prompt under standing and preventing source domain over-reliance. This dual alignment strategy effectively integrates domain knowl edge into prompt learning via synergistic optimization, ensur ing domain-invariant and semantically discriminative repre sentations. Extensive experiments on OfficeHome, Office31, and VisDA-2017 validate the superiority of C-DGPA. It achieves new state-of-the-art results on all benchmarks.


[30] Visual Alignment of Medical Vision-Language Models for Grounded Radiology Report Generation cs.CVPDF

Sarosij Bose, Ravi K. Rajendran, Biplob Debnath, Konstantinos Karydis, Amit K. Roy-Chowdhury

TL;DR: 本文提出VALOR方法,通过强化学习后对齐框架改进医学视觉语言模型,以生成视觉基础和临床准确的放射学报告。该方法采用两阶段训练:首先通过文本奖励提升模型临床术语准确性,然后对齐视觉投影模块与疾病发现,引导注意力到相关图像区域。

Details

Motivation: 现有医学视觉语言模型在生成放射学报告时,常因视觉与语言表征跨模态对齐不足而产生幻觉,且依赖大规模标注数据或检索方法,难以保证视觉基础和临床准确性。

Result: 在多个基准测试上的广泛实验表明,VALOR显著提高了事实准确性和视觉基础,性能超越当前最先进的报告生成方法。

Insight: 创新点在于引入基于强化学习的后对齐框架和GRPO优化,分阶段对齐文本和视觉模块,可借鉴于提升多模态模型在医疗等专业领域的准确性和可解释性。

Abstract: Radiology Report Generation (RRG) is a critical step toward automating healthcare workflows, facilitating accurate patient assessments, and reducing the workload of medical professionals. Despite recent progress in Large Medical Vision-Language Models (Med-VLMs), generating radiology reports that are both visually grounded and clinically accurate remains a significant challenge. Existing approaches often rely on large labeled corpora for pre-training, costly task-specific preference data, or retrieval-based methods. However, these strategies do not adequately mitigate hallucinations arising from poor cross-modal alignment between visual and linguistic representations. To address these limitations, we propose VALOR:Visual Alignment of Medical Vision-Language Models for GrOunded Radiology Report Generation. Our method introduces a reinforcement learning-based post-alignment framework utilizing Group-Relative Proximal Optimization (GRPO). The training proceeds in two stages: (1) improving the Med-VLM with textual rewards to encourage clinically precise terminology, and (2) aligning the vision projection module of the textually grounded model with disease findings, thereby guiding attention toward image re gions most relevant to the diagnostic task. Extensive experiments on multiple benchmarks demonstrate that VALOR substantially improves factual accuracy and visual grounding, achieving significant performance gains over state-of-the-art report generation methods.


[31] Enhanced 3D Shape Analysis via Information Geometry cs.CV | math.DGPDF

Amit Vishwakarma, K. S. Subrahamanian Moosath

TL;DR: 本文提出了一种基于信息几何的三维点云形状分析框架,通过将点云表示为统计流形上的高斯混合模型(GMM),并引入具有理论上界和下界保证的修正对称KL散度(MSKL),以稳定地比较点云形状,解决了传统几何度量和现有KL近似方法的局限性。

Details

Motivation: 三维点云比较面临无结构性和复杂几何形状的挑战,传统几何度量(如Hausdorff距离)难以捕捉全局统计结构且对异常值敏感,而现有的高斯混合模型KL散度近似可能产生无界或不稳定的数值。

Result: 在人体姿态判别(MPI-FAUST数据集)和动物形状比较(G-PCD数据集)上的实验表明,MSKL能提供稳定且单调变化的数值,直接反映几何变化,性能优于传统距离和现有KL近似方法。

Insight: 创新点在于将点云建模为统计流形上的GMM,并理论证明了GMM空间构成统计流形,同时提出了具有数值稳定性保证的MSKL散度,为点云形状分析提供了更鲁棒的信息几何度量。

Abstract: Three-dimensional point clouds provide highly accurate digital representations of objects, essential for applications in computer graphics, photogrammetry, computer vision, and robotics. However, comparing point clouds faces significant challenges due to their unstructured nature and the complex geometry of the surfaces they represent. Traditional geometric metrics such as Hausdorff and Chamfer distances often fail to capture global statistical structure and exhibit sensitivity to outliers, while existing Kullback-Leibler (KL) divergence approximations for Gaussian Mixture Models can produce unbounded or numerically unstable values. This paper introduces an information geometric framework for 3D point cloud shape analysis by representing point clouds as Gaussian Mixture Models (GMMs) on a statistical manifold. We prove that the space of GMMs forms a statistical manifold and propose the Modified Symmetric Kullback-Leibler (MSKL) divergence with theoretically guaranteed upper and lower bounds, ensuring numerical stability for all GMM comparisons. Through comprehensive experiments on human pose discrimination (MPI-FAUST dataset) and animal shape comparison (G-PCD dataset), we demonstrate that MSKL provides stable and monotonically varying values that directly reflect geometric variation, outperforming traditional distances and existing KL approximations.


[32] AI-Powered Dermatological Diagnosis: From Interpretable Models to Clinical Implementation A Comprehensive Framework for Accessible and Trustworthy Skin Disease Detection cs.CV | cs.AIPDF

Satya Narayana Panda, Vaishnavi Kukkala, Spandana Iyer

TL;DR: 本研究提出了一个综合性的多模态AI框架,旨在通过整合深度学习图像分析与结构化临床数据(包括详细的家族史模式)来增强皮肤病的诊断。该框架采用可解释的卷积神经网络与结合遗传风险因素的临床决策树,并计划通过前瞻性临床试验验证其在真实医疗环境中的有效性。

Details

Motivation: 全球有19亿人受皮肤病影响,但由于专科医生资源有限且临床表现复杂,准确诊断仍具挑战。家族史对皮肤病易感性和治疗反应有显著影响,但在诊断过程中常未被充分利用。本研究旨在解决如何将家族史数据与临床影像结合,以提升皮肤病诊断准确性,并支持临床试验验证和实际应用。

Result: 集成家族史数据的AI系统显示出更高的诊断准确性,特别是在遗传性皮肤病(如黑色素瘤、银屑病和特应性皮炎)方面。专家反馈表明,该系统有潜力改善早期检测和提供更个性化的建议,但正式临床试验尚在计划中。

Insight: 创新点在于将家族史数据与临床影像通过可解释的AI机制(如CNN与决策树结合)进行多模态整合,以增强诊断的可信度和临床可操作性。从客观角度看,该框架强调了数据融合和可解释性在医疗AI中的重要性,为皮肤病诊断提供了更全面和可信的解决方案。

Abstract: Dermatological conditions affect 1.9 billion people globally, yet accurate diagnosis remains challenging due to limited specialist availability and complex clinical presentations. Family history significantly influences skin disease susceptibility and treatment responses, but is often underutilized in diagnostic processes. This research addresses the critical question: How can AI-powered systems integrate family history data with clinical imaging to enhance dermatological diagnosis while supporting clinical trial validation and real-world implementation? We developed a comprehensive multi-modal AI framework that combines deep learning-based image analysis with structured clinical data, including detailed family history patterns. Our approach employs interpretable convolutional neural networks integrated with clinical decision trees that incorporate hereditary risk factors. The methodology includes prospective clinical trials across diverse healthcare settings to validate AI-assisted diagnosis against traditional clinical assessment. In this work, validation was conducted with healthcare professionals to assess AI-assisted outputs against clinical expectations; prospective clinical trials across diverse healthcare settings are proposed as future work. The integrated AI system demonstrates enhanced diagnostic accuracy when family history data is incorporated, particularly for hereditary skin conditions such as melanoma, psoriasis, and atopic dermatitis. Expert feedback indicates potential for improved early detection and more personalized recommendations; formal clinical trials are planned. The framework is designed for integration into clinical workflows while maintaining interpretability through explainable AI mechanisms.


[33] TextEditBench: Evaluating Reasoning-aware Text Editing Beyond Rendering cs.CV | cs.AIPDF

Rui Gui, Yang Wan, Haochen Han, Dongxing Mao, Fangming Liu

TL;DR: 本文提出了TextEditBench,一个专注于图像中文本区域编辑的评估基准,强调需要推理的编辑场景,并引入语义期望(SE)作为新的评估维度来衡量模型在文本编辑中的语义一致性和跨模态对齐能力。

Details

Motivation: 当前视觉生成领域在文本渲染方面取得进展,但图像中的文本编辑仍未被充分探索,因为它需要生成清晰字符的同时保持语义、几何和上下文的一致性,因此需要建立一个专门的评估基准来填补这一空白。

Result: 在现有最先进的编辑系统上进行广泛实验,结果显示当前模型能遵循简单的文本指令,但在上下文依赖推理、物理一致性和布局感知整合方面仍存在困难。

Insight: 创新点在于提出了一个专注于文本中心区域的评估基准,强调推理密集型编辑场景,并引入语义期望(SE)作为评估维度,为多模态生成中的文本引导图像编辑和推理能力提供了新的测试平台。

Abstract: Text rendering has recently emerged as one of the most challenging frontiers in visual generation, drawing significant attention from large-scale diffusion and multimodal models. However, text editing within images remains largely unexplored, as it requires generating legible characters while preserving semantic, geometric, and contextual coherence. To fill this gap, we introduce TextEditBench, a comprehensive evaluation benchmark that explicitly focuses on text-centric regions in images. Beyond basic pixel manipulations, our benchmark emphasizes reasoning-intensive editing scenarios that require models to understand physical plausibility, linguistic meaning, and cross-modal dependencies. We further propose a novel evaluation dimension, Semantic Expectation (SE), which measures reasoning ability of model to maintain semantic consistency, contextual coherence, and cross-modal alignment during text editing. Extensive experiments on state-of-the-art editing systems reveal that while current models can follow simple textual instructions, they still struggle with context-dependent reasoning, physical consistency, and layout-aware integration. By focusing evaluation on this long-overlooked yet fundamental capability, TextEditBench establishes a new testing ground for advancing text-guided image editing and reasoning in multimodal generation.


[34] GFLAN: Generative Functional Layouts cs.CV | cs.AIPDF

Mohamed Abouagour, Eleftherios Garyfallidis

TL;DR: 本文提出GFLAN,一种用于自动生成建筑平面图的生成式框架,其核心创新在于将平面图合成明确分解为两个阶段:拓扑规划和几何实现。给定建筑外边界和入口位置,第一阶段使用具有双编码器的专用卷积架构,通过离散概率图在可行位置内顺序分配房间中心点;第二阶段构建连接房间节点与边界顶点的异构图,并应用Transformer增强的图神经网络联合回归房间边界。

Details

Motivation: 解决现有深度学习方法在自动平面图生成中难以捕捉建筑学推理(如拓扑关系优先于几何实例化、功能约束通过邻接网络的传播、局部连通性决策产生的流通模式)的根本挑战。

Result: 摘要中未提及具体的定量实验结果、基准测试或与SOTA的比较。

Insight: 主要创新点在于将平面图生成问题明确分解为拓扑规划和几何实现两个阶段,并设计了相应的专用架构(双编码器卷积网络用于顺序分配,以及Transformer增强的GNN用于边界回归),以更好地建模建筑学中的功能约束和空间关系。

Abstract: Automated floor plan generation lies at the intersection of combinatorial search, geometric constraint satisfaction, and functional design requirements – a confluence that has historically resisted a unified computational treatment. While recent deep learning approaches have improved the state of the art, they often struggle to capture architectural reasoning: the precedence of topological relationships over geometric instantiation, the propagation of functional constraints through adjacency networks, and the emergence of circulation patterns from local connectivity decisions. To address these fundamental challenges, this paper introduces GFLAN, a generative framework that restructures floor plan synthesis through explicit factorization into topological planning and geometric realization. Given a single exterior boundary and a front-door location, our approach departs from direct pixel-to-pixel or wall-tracing generation in favor of a principled two-stage decomposition. Stage A employs a specialized convolutional architecture with dual encoders – separating invariant spatial context from evolving layout state – to sequentially allocate room centroids within the building envelope via discrete probability maps over feasible placements. Stage B constructs a heterogeneous graph linking room nodes to boundary vertices, then applies a Transformer-augmented graph neural network (GNN) that jointly regresses room boundaries.


[35] PixelArena: A benchmark for Pixel-Precision Visual Intelligence cs.CV | cs.AIPDF

Feng Liang, Sizhe Cheng, Chenqi Yi

TL;DR: 论文提出了一个名为PixelArena的基准测试,旨在通过语义分割任务来客观评估多模态大语言模型在像素精度上的细粒度图像生成能力。研究发现,Gemini 3 Pro Image模型在零样本设置下能够生成高保真度的语义掩码,展示了前所未有的视觉智能和在新图像生成任务中的真正泛化能力。

Details

Motivation: 当前多模态大语言模型的图像生成基准多关注美学而非细粒度生成能力,因此需要一种像素精度的评估方法来客观检验模型的细粒度生成智能。

Result: Gemini 3 Pro Image在零样本设置下生成高保真语义掩码,在PixelArena基准上表现出色,与其他模型相比在定性和定量评估中均显示出优势,但论文也展示了其失败案例。

Insight: 创新点在于使用语义分割作为像素精度生成能力的基准,揭示了模型在细粒度视觉任务中的新兴能力和泛化性,为多模态、推理、可解释性和基准测试的未来研究提供了见解。

Abstract: Multi-modal large language models that have image output are emerging. Many image generation benchmarks focus on aesthetics instead of fine-grained generation capabilities. In PixelArena, we propose using semantic segmentation tasks to objectively examine their fine-grained generative intelligence with pixel precision. We find the latest Gemini 3 Pro Image has emergent image generation capabilities that generate semantic masks with high fidelity under zero-shot settings, showcasing visual intelligence unseen before and true generalization in new image generation tasks. We further investigate its results, compare them qualitatively and quantitatively with those of other models, and present failure cases. The findings not only signal exciting progress in the field but also provide insights into future research related to multimodality, reasoning, interpretability and benchmarking.


[36] LaverNet: Lightweight All-in-one Video Restoration via Selective Propagation cs.CVPDF

Haiyu Zhao, Yiwen Shan, Yuanbiao Gou, Xi Peng

TL;DR: 本文提出LaverNet,一种轻量级的一体化视频恢复网络,仅含36.2万参数。通过选择性传播机制,仅传递与退化无关的特征,以减轻时变退化对时序建模的干扰,从而在紧凑网络中实现强大的多退化统一恢复能力。

Details

Motivation: 现有的一体化视频恢复方法在处理时变退化时面临两个挑战:退化会主导时序建模,使模型混淆于伪影而非视频内容;且通常依赖大型模型来掩盖这些固有困难。

Result: LaverNet在多个基准测试中取得了与现有模型相当甚至更优的性能,而其参数量不到现有模型的1%。

Insight: 创新点在于引入了选择性传播机制,仅跨帧传输与退化无关的特征,从而有效隔离退化影响;同时证明了轻量级网络也能实现强大的一体化恢复,挑战了该领域对大型模型的依赖。

Abstract: Recent studies have explored all-in-one video restoration, which handles multiple degradations with a unified model. However, these approaches still face two challenges when dealing with time-varying degradations. First, the degradation can dominate temporal modeling, confusing the model to focus on artifacts rather than the video content. Second, current methods typically rely on large models to handle all-in-one restoration, concealing those underlying difficulties. To address these challenges, we propose a lightweight all-in-one video restoration network, LaverNet, with only 362K parameters. To mitigate the impact of degradations on temporal modeling, we introduce a novel propagation mechanism that selectively transmits only degradation-agnostic features across frames. Through LaverNet, we demonstrate that strong all-in-one restoration can be achieved with a compact network. Despite its small size, less than 1% of the parameters of existing models, LaverNet achieves comparable, even superior performance across benchmarks.


[37] Ridge Estimation-Based Vision and Laser Ranging Fusion Localization Method for UAVs cs.CVPDF

Huayu Huang, Chen Chen, Banglei Guan, Ze Tan, Yang Shang

TL;DR: 本文提出了一种基于岭估计的无人机视觉与激光测距融合定位方法,通过结合序列图像丰富的场景信息和激光测距的高精度,以提升定位精度。该方法针对远距离、小交会角、大倾角等有限观测条件下,最小二乘估计中设计矩阵列向量存在严重多重共线性导致病态问题、不稳定和鲁棒性差的问题,引入岭估计进行缓解。实验表明,该方法相比基于单一信息的地面定位算法具有更高的定位精度,且岭估计的引入有效增强了鲁棒性,尤其在有限观测条件下。

Details

Motivation: 解决无人机在远距离、小交会角、大倾角等有限观测条件下,使用最小二乘估计进行多传感器融合定位时,因设计矩阵存在严重多重共线性而导致的病态问题、结果不稳定和鲁棒性差的问题。

Result: 实验结果表明,该方法相比基于单一信息(视觉或激光)的地面定位算法取得了更高的定位精度,并且岭估计的引入有效提升了方法的鲁棒性,特别是在有限观测条件下。

Insight: 创新点在于将岭估计引入无人机多传感器融合定位框架,以解决有限观测条件下的多重共线性问题。这为在具有挑战性的几何观测条件下提高定位系统的数值稳定性和鲁棒性提供了一种有效的正则化思路。

Abstract: Tracking and measuring targets using a variety of sensors mounted on UAVs is an effective means to quickly and accurately locate the target. This paper proposes a fusion localization method based on ridge estimation, combining the advantages of rich scene information from sequential imagery with the high precision of laser ranging to enhance localization accuracy. Under limited conditions such as long distances, small intersection angles, and large inclination angles, the column vectors of the design matrix have serious multicollinearity when using the least squares estimation algorithm. The multicollinearity will lead to ill-conditioned problems, resulting in significant instability and low robustness. Ridge estimation is introduced to mitigate the serious multicollinearity under the condition of limited observation. Experimental results demonstrate that our method achieves higher localization accuracy compared to ground localization algorithms based on single information. Moreover, the introduction of ridge estimation effectively enhances the robustness, particularly under limited observation conditions.


[38] Collaborative Edge-to-Server Inference for Vision-Language Models cs.CV | cs.AIPDF

Soochang Song, Yongjune Kim

TL;DR: 本文提出了一种用于视觉语言模型(VLM)的协作式边缘到服务器推理框架,旨在降低通信成本的同时保持推理精度。该框架采用两阶段设计:服务器首先对全局图像进行推理并利用VLM内部注意力识别感兴趣区域(RoI),然后通过计算输出令牌的最小熵作为置信度度量,决定是否需要请求边缘设备重传RoI的细节保留局部图像,最终结合全局与局部图像进行精炼推理。

Details

Motivation: 解决在典型部署中,边缘设备将视觉数据传至服务器进行VLM推理时,为匹配视觉编码器输入分辨率而调整原始图像(全局图像)尺寸会丢失细粒度细节,导致精度下降的问题,同时降低通信开销。

Result: 在多个VLM架构上的实验表明,该框架显著降低了通信成本,同时保持了推理精度。

Insight: 创新点在于利用VLM内部注意力机制动态识别RoI,并结合最小熵置信度度量实现选择性重传策略,仅传输关键视觉内容;从客观角度看,该研究通过边缘与服务器的协作推理,在通信效率与模型精度间取得了平衡,为资源受限环境下的VLM部署提供了实用方案。

Abstract: We propose a collaborative edge-to-server inference framework for vision-language models (VLMs) that reduces the communication cost while maintaining inference accuracy. In typical deployments, visual data captured at edge devices (clients) is transmitted to the server for VLM inference. However, resizing the original image (global image) to match the vision encoder’s input resolution often discards fine-grained details, leading to accuracy degradation. To overcome this limitation, we design a two-stage framework. In the first stage, the server performs inference on the global image and identifies a region of interest (RoI) using the VLM’s internal attention. The min-entropy of the output tokens is then computed as a confidence measure to determine whether retransmission is required. If the min-entropy exceeds a predefined threshold, the server requests the edge device to send a detail-preserved local image of the RoI. The server then refines its inference by jointly leveraging the global and local images. This selective retransmission strategy ensures that only essential visual content is transmitted. Experiments across multiple VLM architectures show that the proposed framework significantly reduces communication cost while maintaining inference accuracy.


[39] GMODiff: One-Step Gain Map Refinement with Diffusion Priors for HDR Reconstruction cs.CVPDF

Tao Hu, Weiyu Zhou, Yanjie Tu, Peng Wu, Wei Dong

TL;DR: GMODiff是一种基于增益图(Gain Map)的单步扩散框架,用于多曝光高动态范围(HDR)重建。该方法将HDR重建重新定义为条件引导的增益图估计任务,利用预训练的潜在扩散模型(LDM)作为感知先验,并通过回归先验初始化去噪过程,从而在单步内生成高质量增益图,同时抑制幻觉并保持结构准确性。

Details

Motivation: 解决直接应用预训练潜在扩散模型(LDM)进行HDR重建时面临的三个挑战:8位潜在压缩导致的动态范围表示有限、多步去噪带来的高推理成本,以及生成模型固有的内容幻觉问题。

Result: 在广泛的实验中,GMODiff在性能上优于多种最先进方法,并且比之前基于LDM的方法快100倍,达到了SOTA水平。

Insight: 创新点包括:将HDR重建重新定义为增益图估计任务以保留与LDR图像相同的位深度;使用基于回归的估计初始化去噪过程,实现单步高质量生成;结合回归先验(内容保真度)和LDM先验(感知质量)来引导去噪和潜在解码,有效抑制幻觉并保持结构准确性。

Abstract: Pre-trained Latent Diffusion Models (LDMs) have recently shown strong perceptual priors for low-level vision tasks, making them a promising direction for multi-exposure High Dynamic Range (HDR) reconstruction. However, directly applying LDMs to HDR remains challenging due to: (1) limited dynamic-range representation caused by 8-bit latent compression, (2) high inference cost from multi-step denoising, and (3) content hallucination inherent to generative nature. To address these challenges, we introduce GMODiff, a gain map-driven one-step diffusion framework for multi-exposure HDR reconstruction. Instead of reconstructing full HDR content, we reformulate HDR reconstruction as a conditionally guided Gain Map (GM) estimation task, where the GM encodes the extended dynamic range while retaining the same bit depth as LDR images. We initialize the denoising process from an informative regression-based estimate rather than pure noise, enabling the model to generate high-quality GMs in a single denoising step. Furthermore, recognizing that regression-based models excel in content fidelity while LDMs favor perceptual quality, we leverage regression priors to guide both the denoising process and latent decoding of the LDM, suppressing hallucinations while preserving structural accuracy. Extensive experiments demonstrate that our GMODiff performs favorably against several state-of-the-art methods and is 100 faster than previous LDM-based methods.


[40] Factorized Video Generation: Decoupling Scene Construction and Temporal Synthesis in Text-to-Video Diffusion Models cs.CVPDF

Mariam Hassan, Bastien Van Delft, Wuyang Li, Alexandre Alahi

TL;DR: 本文提出了Factorized Video Generation (FVG) 方法,将文本到视频生成分解为推理、构图和时序合成三个专门化阶段,以解决现有扩散模型在构建复杂场景和遵循逻辑时序指令方面的不足。该方法通过大语言模型重写提示、文本到图像模型生成高质量锚定帧,以及视频模型专注于动画化,显著提升了生成视频的质量和逻辑一致性。

Details

Motivation: 当前最先进的文本到视频扩散模型在生成视觉上令人印象深刻的结果时,仍经常无法构建复杂场景或遵循逻辑时序指令。作者认为许多错误源于模型无法构建语义正确或逻辑一致的初始帧。

Result: 该方法在T2V CompBench基准测试中达到了新的最先进水平,并在VBench2上显著提升了所有测试模型的性能。此外,视觉锚定技术允许将采样步骤减少70%而不损失性能,实现了采样速度的大幅提升。

Insight: 核心创新点在于将视频生成任务解耦为场景构建和时序合成两个独立子任务,通过专门的模型分别处理,从而提升了整体生成质量和效率。这种分解方法为更高效、鲁棒和可控的视频合成提供了一条简单实用的路径。

Abstract: State-of-the-art Text-to-Video (T2V) diffusion models can generate visually impressive results, yet they still frequently fail to compose complex scenes or follow logical temporal instructions. In this paper, we argue that many errors, including apparent motion failures, originate from the model’s inability to construct a semantically correct or logically consistent initial frame. We introduce Factorized Video Generation (FVG), a pipeline that decouples these tasks by decomposing the Text-to-Video generation into three specialized stages: (1) Reasoning, where a Large Language Model (LLM) rewrites the video prompt to describe only the initial scene, resolving temporal ambiguities; (2) Composition, where a Text-to-Image (T2I) model synthesizes a high-quality, compositionally-correct anchor frame from this new prompt; and (3) Temporal Synthesis, where a video model, finetuned to understand this anchor, focuses its entire capacity on animating the scene and following the prompt. Our decomposed approach sets a new state-of-the-art on the T2V CompBench benchmark and significantly improves all tested models on VBench2. Furthermore, we show that visual anchoring allows us to cut the number of sampling steps by 70% without any loss in performance, leading to a substantial speed-up in sampling. Factorized Video Generation offers a simple yet practical path toward more efficient, robust, and controllable video synthesis


[41] Using Gaussian Splats to Create High-Fidelity Facial Geometry and Texture cs.CV | cs.AI | cs.GRPDF

Haodi He, Jihun Yu, Ronald Fedkiw

TL;DR: 本文提出了一种利用高斯泼溅技术从少量未标定人脸图像重建高保真三维人脸几何与纹理的方法。该方法结合语义分割对齐、三角网格约束和可重光照高斯模型,实现了从仅11张图像重建中性姿态人脸,并生成可用于标准图形流水线的几何网格和去光照高分辨率反照率纹理。

Details

Motivation: 解决从少量未标定图像(如仅11张)高效、一致地重建高保真人脸三维几何与纹理的难题,并使其结果能无缝集成到标准图形流水线中。

Result: 方法在少量图像输入下实现了高保真重建,并展示了在文本驱动资产创建流程中的应用效果。

Insight: 创新点在于将高斯泼溅与底层三角网格结构相结合进行相互优化,并利用可重光照模型将高斯表示转化为视角相关的神经纹理,从而在保持高视觉保真度的同时,实现了与标准图形流水线的兼容性。

Abstract: We leverage increasingly popular three-dimensional neural representations in order to construct a unified and consistent explanation of a collection of uncalibrated images of the human face. Our approach utilizes Gaussian Splatting, since it is more explicit and thus more amenable to constraints than NeRFs. We leverage segmentation annotations to align the semantic regions of the face, facilitating the reconstruction of a neutral pose from only 11 images (as opposed to requiring a long video). We soft constrain the Gaussians to an underlying triangulated surface in order to provide a more structured Gaussian Splat reconstruction, which in turn informs subsequent perturbations to increase the accuracy of the underlying triangulated surface. The resulting triangulated surface can then be used in a standard graphics pipeline. In addition, and perhaps most impactful, we show how accurate geometry enables the Gaussian Splats to be transformed into texture space where they can be treated as a view-dependent neural texture. This allows one to use high visual fidelity Gaussian Splatting on any asset in a scene without the need to modify any other asset or any other aspect (geometry, lighting, renderer, etc.) of the graphics pipeline. We utilize a relightable Gaussian model to disentangle texture from lighting in order to obtain a delit high-resolution albedo texture that is also readily usable in a standard graphics pipeline. The flexibility of our system allows for training with disparate images, even with incompatible lighting, facilitating robust regularization. Finally, we demonstrate the efficacy of our approach by illustrating its use in a text-driven asset creation pipeline.


[42] CountZES: Counting via Zero-Shot Exemplar Selection cs.CVPDF

Muhammad Ibraheem Siddiqui, Muhammad Haris Khan

TL;DR: CountZES是一个无需训练、用于零样本目标计数的框架,它通过检测锚定、密度引导和特征共识三个阶段协同工作,逐步发现多样化的目标示例,以解决在复杂场景中仅通过类别名称计数未见类别实例的挑战。

Details

Motivation: 现有零样本目标计数方法要么依赖开放词汇检测器(会产生多实例候选),要么依赖随机补丁采样(无法准确描绘目标实例),CountZES旨在解决这些问题,通过精确选择单实例示例来提升计数准确性。

Result: 在多个数据集上的实验表明,CountZES在零样本目标计数方法中取得了优越的性能,并能有效泛化到自然、航拍和医学等多个领域。

Insight: 创新点在于提出了一个三阶段协同的渐进式示例选择框架,将文本基础、计数一致性和特征代表性进行平衡,无需训练即可实现精确的零样本计数,其密度引导的自监督范式是识别统计一致且语义紧凑示例的关键。

Abstract: Object counting in complex scenes remains challenging, particularly in the zero-shot setting, where the goal is to count instances of unseen categories specified only by a class name. Existing zero-shot object counting (ZOC) methods that infer exemplars from text either rely on open-vocabulary detectors, which often yield multi-instance candidates, or on random patch sampling, which fails to accurately delineate object instances. To address this, we propose CountZES, a training-free framework for object counting via zero-shot exemplar selection. CountZES progressively discovers diverse exemplars through three synergistic stages: Detection-Anchored Exemplar (DAE), Density-Guided Exemplar (DGE), and Feature-Consensus Exemplar (FCE). DAE refines open-vocabulary detections to isolate precise single-instance exemplars. DGE introduces a density-driven, self-supervised paradigm to identify statistically consistent and semantically compact exemplars, while FCE reinforces visual coherence through feature-space clustering. Together, these stages yield a diverse, complementary exemplar set that balances textual grounding, count consistency, and feature representativeness. Experiments on diverse datasets demonstrate CountZES superior performance among ZOC methods while generalizing effectively across natural, aerial and medical domains.


[43] SNOW: Spatio-Temporal Scene Understanding with World Knowledge for Open-World Embodied Reasoning cs.CV | cs.ROPDF

Tin Stribor Sohn, Maximilian Dillitzer, Jason J. Corso, Eric Sax

TL;DR: 论文提出了一种名为SNOW的训练无关、主干无关的统一4D场景理解框架,该框架将视觉语言模型(VLM)的开放世界语义先验与点云几何及时序一致性相结合,通过处理同步的RGB图像和3D点云,生成包含语义、几何和时序属性的多模态token,并增量式构建一个可查询的4D场景图(4DSG),作为下游具身推理的4D先验。

Details

Motivation: 解决自主机器人系统在动态环境中导航和交互时,现有视觉语言模型(VLMs)缺乏3D几何和时序动态的grounding,而几何感知方法又语义稀疏的问题,旨在实现统一的4D场景理解。

Result: 在多个基准测试上的实验表明,SNOW实现了精确的4D场景理解和空间grounded推理,在多个设定中达到了新的最先进(SOTA)性能。

Insight: 创新点在于提出了一个无需训练、主干无关的框架,通过HDBSCAN聚类和SAM2分割生成物体级提议,并设计了Spatio-Temporal Tokenized Patch Encoding(STEP)来编码局部语义、几何和时序属性,最终构建一个由轻量级SLAM后端进行空间锚定的统一4D场景图(4DSG),为VLMs提供了结构化的4D先验以直接解释场景的时空动态。

Abstract: Autonomous robotic systems require spatio-temporal understanding of dynamic environments to ensure reliable navigation and interaction. While Vision-Language Models (VLMs) provide open-world semantic priors, they lack grounding in 3D geometry and temporal dynamics. Conversely, geometric perception captures structure and motion but remains semantically sparse. We propose SNOW (Scene Understanding with Open-World Knowledge), a training-free and backbone-agnostic framework for unified 4D scene understanding that integrates VLM-derived semantics with point cloud geometry and temporal consistency. SNOW processes synchronized RGB images and 3D point clouds, using HDBSCAN clustering to generate object-level proposals that guide SAM2-based segmentation. Each segmented region is encoded through our proposed Spatio-Temporal Tokenized Patch Encoding (STEP), producing multimodal tokens that capture localized semantic, geometric, and temporal attributes. These tokens are incrementally integrated into a 4D Scene Graph (4DSG), which serves as 4D prior for downstream reasoning. A lightweight SLAM backend anchors all STEP tokens spatially in the environment, providing the global reference alignment, and ensuring unambiguous spatial grounding across time. The resulting 4DSG forms a queryable, unified world model through which VLMs can directly interpret spatial scene structure and temporal dynamics. Experiments on a diverse set of benchmarks demonstrate that SNOW enables precise 4D scene understanding and spatially grounded inference, thereby setting new state-of-the-art performance in several settings, highlighting the importance of structured 4D priors for embodied reasoning and autonomous robotics.


[44] Guiding Perception-Reasoning Closer to Human in Blind Image Quality Assessment cs.CV | cs.AIPDF

Yuan Li, Yahan Yu, Youyuan Lin, Yong-Hao Yang, Chenhui Chu

TL;DR: 本文提出了一种通过强化学习引导模型模仿人类感知-推理过程的无参考图像质量评估方法,旨在使模型具备类人且自洽的推理能力。

Details

Motivation: 人类评估图像质量时结合感知线索与隐式推理,形成自洽判断,而现有BIQA模型缺乏这种类人的、可解释的推理能力。

Result: 在通用指标(皮尔逊和斯皮尔曼相关系数)上,该方法达到了与最先进BIQA系统相当的性能;在人类解释相似度评估(ROUGE-1)上,模型在1000多个标注样本上获得了0.512分(基线为0.443),显著覆盖了人类解释。

Insight: 创新点在于利用人类标注作为强化学习的奖励信号,引导模型进行类人感知推理,并通过设计奖励使模型仅从自生成的描述中推断质量,从而内化自洽推理能力,提升了BIQA的可解释性。

Abstract: Humans assess image quality through a perception-reasoning cascade, integrating sensory cues with implicit reasoning to form self-consistent judgments. In this work, we investigate how a model can acquire both human-like and self-consistent reasoning capability for blind image quality assessment (BIQA). We first collect human evaluation data that capture several aspects of human perception-reasoning pipeline. Then, we adopt reinforcement learning, using human annotations as reward signals to guide the model toward human-like perception and reasoning. To enable the model to internalize self-consistent reasoning capability, we design a reward that drives the model to infer the image quality purely from self-generated descriptions. Empirically, our approach achieves score prediction performance comparable to state-of-the-art BIQA systems under general metrics, including Pearson and Spearman correlation coefficients. In addition to the rating score, we assess human-model alignment using ROUGE-1 to measure the similarity between model-generated and human perception-reasoning chains. On over 1,000 human-annotated samples, our model reaches a ROUGE-1 score of 0.512 (cf. 0.443 for baseline), indicating substantial coverage of human explanations and marking a step toward human-like interpretable reasoning in BIQA.


[45] Smile on the Face, Sadness in the Eyes: Bridging the Emotion Gap with a Multimodal Dataset of Eye and Facial Behaviors cs.CV | cs.AIPDF

Kejun Liu, Yuanyuan Liu, Lin Wei, Chang Tang, Yibing Zhan

TL;DR: 该论文针对面部表情识别(FER)与真实情感识别(ER)之间存在差异的问题,引入了眼部行为作为重要的情感线索,构建了一个包含眼动序列、注视点图和面部视频的多模态情感识别数据集EMER,并设计了一个名为EMERT的Transformer模型来融合眼部行为与面部表情,以弥合情感差距并提升情感识别的鲁棒性。

Details

Motivation: 当前情感识别领域过度依赖面部表情,但面部表情常被用作社交工具而非真实内在情感的表征,因此存在情感识别差距。论文旨在通过引入眼部行为这一更不易伪装的情感线索来理解和弥合这一差距。

Result: 在论文新构建的EMER数据集上,所提出的EMERT模型在七种多模态基准测试协议中,以显著优势超越了其他最先进的多模态方法,证明了建模眼部行为对于鲁棒情感识别的重要性。

Insight: 创新点在于将眼部行为(如眼动序列和注视点图)作为关键补充模态引入情感识别,并构建了相应的多模态数据集和模型。客观来看,其核心贡献是强调了跨模态(面部与眼部)特征解耦与融合对于捕捉真实、不易伪装情感的有效性,为更鲁棒的情感识别研究提供了新的数据和方法基础。

Abstract: Emotion Recognition (ER) is the process of analyzing and identifying human emotions from sensing data. Currently, the field heavily relies on facial expression recognition (FER) because visual channel conveys rich emotional cues. However, facial expressions are often used as social tools rather than manifestations of genuine inner emotions. To understand and bridge this gap between FER and ER, we introduce eye behaviors as an important emotional cue and construct an Eye-behavior-aided Multimodal Emotion Recognition (EMER) dataset. To collect data with genuine emotions, spontaneous emotion induction paradigm is exploited with stimulus material, during which non-invasive eye behavior data, like eye movement sequences and eye fixation maps, is captured together with facial expression videos. To better illustrate the gap between ER and FER, multi-view emotion labels for mutimodal ER and FER are separately annotated. Furthermore, based on the new dataset, we design a simple yet effective Eye-behavior-aided MER Transformer (EMERT) that enhances ER by bridging the emotion gap. EMERT leverages modality-adversarial feature decoupling and a multitask Transformer to model eye behaviors as a strong complement to facial expressions. In the experiment, we introduce seven multimodal benchmark protocols for a variety of comprehensive evaluations of the EMER dataset. The results show that the EMERT outperforms other state-of-the-art multimodal methods by a great margin, revealing the importance of modeling eye behaviors for robust ER. To sum up, we provide a comprehensive analysis of the importance of eye behaviors in ER, advancing the study on addressing the gap between FER and ER for more robust ER performance. Our EMER dataset and the trained EMERT models will be publicly available at https://github.com/kejun1/EMER.


[46] YOLO11-4K: An Efficient Architecture for Real-Time Small Object Detection in 4K Panoramic Images cs.CVPDF

Huma Hafeez, Matthew Garratt, Jo Plested, Sankaran Iyer, Arcot Sowmya

TL;DR: 本文提出了YOLO11-4K,一种专为4K全景图像设计的高效实时小目标检测框架。该框架通过引入包含P2层的新型多尺度检测头来提升对小目标的敏感性,并采用基于GhostConv的主干网络来降低计算复杂度。在自建的CVIP360数据集上,该方法在保持高精度的同时大幅降低了推理延迟。

Details

Motivation: 解决传统检测器(如YOLO)在处理4K或更高分辨率全景图像时,因空间畸变、大视场和超高分辨率带来的计算挑战和小目标漏检问题。

Result: 在自建的CVIP360数据集(包含6,876个标注框)上,YOLO11-4K在IoU阈值为0.5时达到0.95 mAP,每帧推理时间为28.3毫秒。相比YOLO11(112.3毫秒,mAP为0.908),延迟降低了75%,同时精度有所提升。

Insight: 主要创新点包括:1)为4K全景图像定制的高效架构;2)引入包含P2层的多尺度检测头以增强小目标检测能力;3)采用GhostConv主干网络实现计算效率与表征能力的平衡。该方法可推广至自动驾驶、监控和增强现实等领域的高分辨率检测任务。

Abstract: The processing of omnidirectional 360-degree images poses significant challenges for object detection due to inherent spatial distortions, wide fields of view, and ultra-high-resolution inputs. Conventional detectors such as YOLO are optimised for standard image sizes (for example, 640x640 pixels) and often struggle with the computational demands of 4K or higher-resolution imagery typical of 360-degree vision. To address these limitations, we introduce YOLO11-4K, an efficient real-time detection framework tailored for 4K panoramic images. The architecture incorporates a novel multi-scale detection head with a P2 layer to improve sensitivity to small objects often missed at coarser scales, and a GhostConv-based backbone to reduce computational complexity without sacrificing representational power. To enable evaluation, we manually annotated the CVIP360 dataset, generating 6,876 frame-level bounding boxes and producing a publicly available, detection-ready benchmark for 4K panoramic scenes. YOLO11-4K achieves 0.95 mAP at 0.50 IoU with 28.3 milliseconds inference per frame, representing a 75 percent latency reduction compared to YOLO11 (112.3 milliseconds), while also improving accuracy (mAP at 0.50 of 0.95 versus 0.908). This balance of efficiency and precision enables robust object detection in expansive 360-degree environments, making the framework suitable for real-world high-resolution panoramic applications. While this work focuses on 4K omnidirectional images, the approach is broadly applicable to high-resolution detection tasks in autonomous navigation, surveillance, and augmented reality.


[47] VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding Tasks cs.CVPDF

Beitong Zhou, Zhexiao Huang, Yuan Guo, Zhangxuan Gu, Tianyu Xia

TL;DR: 本文提出了VenusBench-GD,一个用于图形用户界面(GUI)基础任务(grounding)的综合性、多平台、双语基准测试。该基准通过大规模、跨平台的数据集、高质量的数据构建流程以及分层任务分类(将基础任务分为基础和高级两类,涵盖六个子任务),旨在解决现有基准数据量不足、领域覆盖窄、平台单一等问题,以进行更全面的层次化评估。

Details

Motivation: 现有GUI基础任务基准存在显著局限性:要么数据量不足、领域覆盖狭窄,要么过度专注于单一平台且需要高度专业化的领域知识。因此,需要建立一个更全面、多平台的基准来支持对真实世界应用的层次化评估。

Result: 实验发现,通用多模态模型在基础基础任务上已经达到甚至超越了专门的GUI模型;而在高级任务上,专门的GUI模型仍具优势,但它们表现出明显的过拟合和较差的鲁棒性。这些结果突显了建立全面、多层次评估框架的必要性。

Insight: 主要创新点包括:1) 构建了一个大规模、跨平台、覆盖广泛应用和UI元素的基准数据集;2) 建立了高质量的基础任务数据构建流程,标注准确率高于现有基准;3) 提出了分层任务分类法,将基础任务扩展为基础和高级类别,包含六个子任务,从互补角度评估模型。这为GUI智能体的能力评估提供了更精细和全面的框架。

Abstract: GUI grounding is a critical component in building capable GUI agents. However, existing grounding benchmarks suffer from significant limitations: they either provide insufficient data volume and narrow domain coverage, or focus excessively on a single platform and require highly specialized domain knowledge. In this work, we present VenusBench-GD, a comprehensive, bilingual benchmark for GUI grounding that spans multiple platforms, enabling hierarchical evaluation for real-word applications. VenusBench-GD contributes as follows: (i) we introduce a large-scale, cross-platform benchmark with extensive coverage of applications, diverse UI elements, and rich annotated data, (ii) we establish a high-quality data construction pipeline for grounding tasks, achieving higher annotation accuracy than existing benchmarks, and (iii) we extend the scope of element grounding by proposing a hierarchical task taxonomy that divides grounding into basic and advanced categories, encompassing six distinct subtasks designed to evaluate models from complementary perspectives. Our experimental findings reveal critical insights: general-purpose multimodal models now match or even surpass specialized GUI models on basic grounding tasks. In contrast, advanced tasks, still favor GUI-specialized models, though they exhibit significant overfitting and poor robustness. These results underscore the necessity of comprehensive, multi-tiered evaluation frameworks.


[48] Skeleton-Snippet Contrastive Learning with Multiscale Feature Fusion for Action Localization cs.CVPDF

Qiushuo Cheng, Jingjing Liu, Catherine Morgan, Alan Whone, Majid Mirmehdi

TL;DR: 本文提出了一种用于骨架序列时序动作定位的自监督预训练方法,通过设计片段判别前置任务和多尺度特征融合U型模块,提升模型对动作边界的敏感性和定位精度。

Details

Motivation: 现有基于骨架的自监督对比学习方法主要针对视频级动作识别,而时序动作定位需要能捕捉相邻帧间细微差异的时序敏感特征,该任务尚未得到充分探索。

Result: 在BABEL数据集的不同子集和评估协议上,该方法持续改进了现有骨架对比学习方法;在PKUMMD上使用NTU RGB+D和BABEL预训练后,达到了最先进的迁移学习性能。

Insight: 创新点在于设计了密集的片段判别前置任务以学习区分性时序特征,并引入U型模块融合多尺度中间特征来增强帧级定位的分辨率,将自监督预训练成功扩展至时序动作定位任务。

Abstract: The self-supervised pretraining paradigm has achieved great success in learning 3D action representations for skeleton-based action recognition using contrastive learning. However, learning effective representations for skeleton-based temporal action localization remains challenging and underexplored. Unlike video-level {action} recognition, detecting action boundaries requires temporally sensitive features that capture subtle differences between adjacent frames where labels change. To this end, we formulate a snippet discrimination pretext task for self-supervised pretraining, which densely projects skeleton sequences into non-overlapping segments and promotes features that distinguish them across videos via contrastive learning. Additionally, we build on strong backbones of skeleton-based action recognition models by fusing intermediate features with a U-shaped module to enhance feature resolution for frame-level localization. Our approach consistently improves existing skeleton-based contrastive learning methods for action localization on BABEL across diverse subsets and evaluation protocols. We also achieve state-of-the-art transfer learning performance on PKUMMD with pretraining on NTU RGB+D and BABEL.


[49] TTP: Test-Time Padding for Adversarial Detection and Robust Adaptation on Vision-Language Models cs.CV | cs.AIPDF

Zhiwei Li, Yitian Pang, Weining Wang, Zhenan Sun, Qi Li

TL;DR: 本文提出了一种名为测试时填充(TTP)的轻量级防御框架,旨在提升视觉语言模型(如CLIP)对抗对抗性扰动的鲁棒性。该方法在推理时通过空间填充前后的特征余弦相似度偏移来检测对抗样本,并对检测到的样本使用可训练的填充来恢复被破坏的注意力模式,结合相似度感知的集成策略进行鲁棒预测。对于干净样本,则默认保持不变或可选集成现有测试时适应技术以提升精度。

Details

Motivation: 现有训练时防御方法依赖带标签数据的对抗性微调和昂贵的重新训练,而现有测试时策略无法可靠区分干净和对抗性输入,导致对抗鲁棒性和干净精度无法同时达到最优。TTP旨在克服这些限制,提供一个无需重新训练、在测试时即可有效检测并适应对抗性攻击的轻量级解决方案。

Result: 在多种CLIP骨干网络和细粒度基准测试上的综合实验表明,TTP始终优于最先进的测试时防御方法,在不损害干净样本准确性的情况下,显著提升了对抗鲁棒性。

Insight: 创新点在于提出了一种基于空间填充前后特征余弦相似度偏移的通用对抗样本检测阈值,该方法具有跨架构和数据集的普适性;同时,通过可训练填充来针对性修复对抗性攻击破坏的注意力模式,并结合相似度感知集成进行鲁棒预测,实现了检测与自适应的统一轻量级框架。

Abstract: Vision-Language Models (VLMs), such as CLIP, have achieved impressive zero-shot recognition performance but remain highly susceptible to adversarial perturbations, posing significant risks in safety-critical scenarios. Previous training-time defenses rely on adversarial fine-tuning, which requires labeled data and costly retraining, while existing test-time strategies fail to reliably distinguish between clean and adversarial inputs, thereby preventing both adversarial robustness and clean accuracy from reaching their optimum. To address these limitations, we propose Test-Time Padding (TTP), a lightweight defense framework that performs adversarial detection followed by targeted adaptation at inference. TTP identifies adversarial inputs via the cosine similarity shift between CLIP feature embeddings computed before and after spatial padding, yielding a universal threshold for reliable detection across architectures and datasets. For detected adversarial cases, TTP employs trainable padding to restore disrupted attention patterns, coupled with a similarity-aware ensemble strategy for a more robust final prediction. For clean inputs, TTP leaves them unchanged by default or optionally integrates existing test-time adaptation techniques for further accuracy gains. Comprehensive experiments on diverse CLIP backbones and fine-grained benchmarks show that TTP consistently surpasses state-of-the-art test-time defenses, delivering substantial improvements in adversarial robustness without compromising clean accuracy. The code for this paper will be released soon.


[50] N3D-VLM: Native 3D Grounding Enables Accurate Spatial Reasoning in Vision-Language Models cs.CVPDF

Yuxin Wang, Lei Ke, Boqiang Zhang, Tianyuan Qu, Hanxun Yu

TL;DR: 本文提出了N3D-VLM,一个将原生3D物体感知与3D感知视觉推理统一起来的新框架,旨在解决现有多模态模型缺乏内在3D感知能力、难以理解3D场景空间关系的问题。该框架通过一个可扩展的数据构建流程,将大规模2D标注提升至3D空间,生成用于3D物体定位和3D空间推理的链式思维数据集,从而联合训练模型。

Details

Motivation: 当前基于2D图像的多模态模型缺乏内在的3D物体感知能力,限制了其对3D场景中空间关系和深度线索的理解。

Result: 实验结果表明,该统一框架不仅在3D物体定位任务上达到了最先进的性能,而且在视觉语言模型的3D空间推理任务上也持续超越了现有方法。

Insight: 核心创新点在于将原生3D物体感知能力(即直接根据文本描述在3D空间中定位物体)与显式的3D推理相结合,以实现更可解释和结构化的空间理解。其可扩展的数据构建流程,利用深度估计将2D标注提升至3D,极大地扩充了3D定位数据的规模和多样性,并生成了支持3D链式思维推理的训练数据。

Abstract: While current multimodal models can answer questions based on 2D images, they lack intrinsic 3D object perception, limiting their ability to comprehend spatial relationships and depth cues in 3D scenes. In this work, we propose N3D-VLM, a novel unified framework that seamlessly integrates native 3D object perception with 3D-aware visual reasoning, enabling both precise 3D grounding and interpretable spatial understanding. Unlike conventional end-to-end models that directly predict answers from RGB/RGB-D inputs, our approach equips the model with native 3D object perception capabilities, enabling it to directly localize objects in 3D space based on textual descriptions. Building upon accurate 3D object localization, the model further performs explicit reasoning in 3D, achieving more interpretable and structured spatial understanding. To support robust training for these capabilities, we develop a scalable data construction pipeline that leverages depth estimation to lift large-scale 2D annotations into 3D space, significantly increasing the diversity and coverage for 3D object grounding data, yielding over six times larger than the largest existing single-image 3D detection dataset. Moreover, the pipeline generates spatial question-answering datasets that target chain-of-thought (CoT) reasoning in 3D, facilitating joint training for both 3D object localization and 3D spatial reasoning. Experimental results demonstrate that our unified framework not only achieves state-of-the-art performance on 3D grounding tasks, but also consistently surpasses existing methods in 3D spatial reasoning in vision-language model.


[51] 4D Primitive-Mâché: Glueing Primitives for Persistent 4D Scene Reconstruction cs.CVPDF

Kirill Mazur, Marwan Taher, Andrew J. Davison

TL;DR: 该论文提出了一种名为’4D Primitive-Mâché’的动态场景重建系统,能够从单目RGB视频输入中生成完整且持久的4D场景重建。该方法将场景分解为一系列刚性3D基元,通过优化管道联合推断这些基元的刚性运动,从而获得随时间动态变化的3D几何结构,并引入运动外推机制处理物体不可见时的连续性。

Details

Motivation: 解决从单目视频中重建完整动态场景(包括当前可见和历史上曾出现过的所有部分)的挑战,实现能够跨时间步回放的持久性4D重建,克服现有方法在物体暂时消失时重建不完整的问题。

Result: 在物体扫描和多物体数据集上,该系统在定量和定性评估中均显著优于现有方法,实现了SOTA性能。

Insight: 创新点在于将动态场景分解为刚性运动基元并通过联合优化推断运动,同时引入基于运动分组的运动外推机制来维持物体不可见时的连续性,实现了真正的4D时空感知和物体持久性重建。

Abstract: We present a dynamic reconstruction system that receives a casual monocular RGB video as input, and outputs a complete and persistent reconstruction of the scene. In other words, we reconstruct not only the the currently visible parts of the scene, but also all previously viewed parts, which enables replaying the complete reconstruction across all timesteps. Our method decomposes the scene into a set of rigid 3D primitives, which are assumed to be moving throughout the scene. Using estimated dense 2D correspondences, we jointly infer the rigid motion of these primitives through an optimisation pipeline, yielding a 4D reconstruction of the scene, i.e. providing 3D geometry dynamically moving through time. To achieve this, we also introduce a mechanism to extrapolate motion for objects that become invisible, employing motion-grouping techniques to maintain continuity. The resulting system enables 4D spatio-temporal awareness, offering capabilities such as replayable 3D reconstructions of articulated objects through time, multi-object scanning, and object permanence. On object scanning and multi-object datasets, our system significantly outperforms existing methods both quantitatively and qualitatively.


[52] Causal-Tune: Mining Causal Factors from Vision Foundation Models for Domain Generalized Semantic Segmentation cs.CVPDF

Yin Zhang, Yongqiang Zhang, Yaoyue Zheng, Bogdan Raducanu, Dan Liu

TL;DR: 本文提出了一种名为Causal-Tune的新型微调策略,用于领域泛化语义分割。该方法从视觉基础模型的特征中识别并分离因果与非因果因子,通过离散余弦变换和带通滤波在频域进行操作,以抑制与伪影相关的非因果成分,从而提升模型在未见领域的泛化能力。

Details

Motivation: 现有微调方法(如训练轻量适配器或精调中间特征)忽略了长期预训练的视觉基础模型中存在的伪影,这些伪影与非因果因子相关,阻碍了有价值表征的利用并损害了领域泛化性能。

Result: 在多种跨领域任务上的实验表明,该方法有效提升了泛化性能,特别是在恶劣天气条件下表现突出,例如在雪天条件下比基线模型提高了4.8%的mIoU。

Insight: 创新点在于从因果机制出发,首次在视觉基础模型中明确分离特征的因果与非因果因子,并通过频域处理(DCT和带通滤波)结合可学习的因果感知令牌来精炼因果成分,为领域泛化提供了一种简单而有效的新思路。

Abstract: Fine-tuning Vision Foundation Models (VFMs) with a small number of parameters has shown remarkable performance in Domain Generalized Semantic Segmentation (DGSS). Most existing works either train lightweight adapters or refine intermediate features to achieve better generalization on unseen domains. However, they both overlook the fact that long-term pre-trained VFMs often exhibit artifacts, which hinder the utilization of valuable representations and ultimately degrade DGSS performance. Inspired by causal mechanisms, we observe that these artifacts are associated with non-causal factors, which usually reside in the low- and high-frequency components of the VFM spectrum. In this paper, we explicitly examine the causal and non-causal factors of features within VFMs for DGSS, and propose a simple yet effective method to identify and disentangle them, enabling more robust domain generalization. Specifically, we propose Causal-Tune, a novel fine-tuning strategy designed to extract causal factors and suppress non-causal ones from the features of VFMs. First, we extract the frequency spectrum of features from each layer using the Discrete Cosine Transform (DCT). A Gaussian band-pass filter is then applied to separate the spectrum into causal and non-causal components. To further refine the causal components, we introduce a set of causal-aware learnable tokens that operate in the frequency domain, while the non-causal components are discarded. Finally, refined features are transformed back into the spatial domain via inverse DCT and passed to the next layer. Extensive experiments conducted on various cross-domain tasks demonstrate the effectiveness of Causal-Tune. In particular, our method achieves superior performance under adverse weather conditions, improving +4.8% mIoU over the baseline in snow conditions.


[53] Sketch-in-Latents: Eliciting Unified Reasoning in MLLMs cs.CVPDF

Jintao Tong, Jiaqi Gu, Yujing Lou, Lubin Fan, Yixiong Zou

TL;DR: 论文提出Sketch-in-Latents (SkiLa)范式,旨在增强多模态大语言模型(MLLMs)的视觉想象力。该方法通过在推理过程中原生生成连续的视觉嵌入(潜在草图标记),与文本标记交替进行,实现统一的视觉-文本推理,从而在视觉中心任务上取得优异性能。

Details

Motivation: 当前MLLMs擅长基于文本的视觉理解,但在需要视觉想象力的场景中表现不足。受人类在大脑统一空间内进行灵活视觉-文本交互的启发,论文旨在让MLLMs在已有的统一特征空间中,无缝插入视觉标记进行推理,而无需依赖预定义的外部工具或生成完整图像。

Result: 大量实验表明,SkiLa在视觉中心任务上取得了卓越性能,同时在多样化的通用多模态基准测试上表现出强大的泛化能力。

Insight: 核心创新点在于提出了一个统一的多模态推理范式,扩展了MLLMs的自回归能力,使其能够原生生成连续的视觉嵌入作为视觉思维。通过引入潜在视觉语义重建机制确保这些嵌入的语义基础,并设计了动态交替的文本思考与视觉草图生成模式,模拟了更接近人类的推理过程。

Abstract: While Multimodal Large Language Models (MLLMs) excel at visual understanding tasks through text reasoning, they often fall short in scenarios requiring visual imagination. Unlike current works that take predefined external toolkits or generate images during thinking, however, humans can form flexible visual-text imagination and interactions during thinking without predefined toolkits, where one important reason is that humans construct the visual-text thinking process in a unified space inside the brain. Inspired by this capability, given that current MLLMs already encode visual and text information in the same feature space, we hold that visual tokens can be seamlessly inserted into the reasoning process carried by text tokens, where ideally, all visual imagination processes can be encoded by the latent features. To achieve this goal, we propose Sketch-in-Latents (SkiLa), a novel paradigm for unified multi-modal reasoning that expands the auto-regressive capabilities of MLLMs to natively generate continuous visual embeddings, termed latent sketch tokens, as visual thoughts. During multi-step reasoning, the model dynamically alternates between textual thinking mode for generating textual think tokens and visual sketching mode for generating latent sketch tokens. A latent visual semantics reconstruction mechanism is proposed to ensure these latent sketch tokens are semantically grounded. Extensive experiments demonstrate that SkiLa achieves superior performance on vision-centric tasks while exhibiting strong generalization to diverse general multi-modal benchmarks. Codes will be released at https://github.com/TungChintao/SkiLa.


[54] Hazedefy: A Lightweight Real-Time Image and Video Dehazing Pipeline for Practical Deployment cs.CVPDF

Ayush Bhavsar

TL;DR: 本文提出了Hazedefy,一个轻量级、面向应用的图像和视频去雾流水线,旨在实现实时处理并部署在消费级硬件上。该方法基于暗通道先验和大气散射模型,通过伽马自适应重建、快速透射率近似、稳定的基于分数顶部像素平均的大气光估计以及可选的颜色平衡阶段,在无需GPU加速的情况下提升了真实世界图像和视频的可见度和对比度。

Details

Motivation: 解决现有去雾方法计算复杂、难以在移动和嵌入式设备上实时部署的问题,旨在开发一个轻量级、实用的去雾流水线。

Result: 在真实世界图像和视频上的实验演示表明,该方法能有效提升可见度和对比度,且无需GPU加速,适合移动和嵌入式应用。

Insight: 创新点在于将经典暗通道先验方法进行轻量化工程优化,包括伽马自适应重建、快速透射率近似和稳定的大气光估计器,实现了在消费级硬件上的实时去雾能力,为实际部署提供了可行的解决方案。

Abstract: This paper introduces Hazedefy, a lightweight and application-focused dehazing pipeline intended for real-time video and live camera feed enhancement. Hazedefy prioritizes computational simplicity and practical deployability on consumer-grade hardware, building upon the Dark Channel Prior (DCP) concept and the atmospheric scattering model. Key elements include gamma-adaptive reconstruction, a fast transmission approximation with lower bounds for numerical stability, a stabilized atmospheric light estimator based on fractional top-pixel averaging, and an optional color balance stage. The pipeline is suitable for mobile and embedded applications, as experimental demonstrations on real-world images and videos show improved visibility and contrast without requiring GPU acceleration.


[55] Plug to Place: Indoor Multimedia Geolocation from Electrical Sockets for Digital Investigation cs.CVPDF

Kanwal Aftab, Graham Adams, Mark Scanlon

TL;DR: 论文提出了一种基于电源插座的室内多媒体地理定位流程,用于数字取证调查。该流程通过检测插座、分类插座类型并映射到国家,解决了室内定位的挑战,并创建了专门数据集在真实世界图像上评估性能。

Details

Motivation: 动机是解决室内多媒体地理定位的难题,如相似房间布局、光照变化和GPS信号不可靠,以辅助执法部门打击人口贩卖和儿童剥削等严重犯罪。

Result: 结果:插座检测使用YOLOv11,mAP@0.5为0.843;分类使用Xception,准确率0.912;国家映射准确率0.96(置信度>90%)。在Hotels-50K的TraffickCam子集上评估,展示了真实条件下的性能。

Insight: 创新点在于利用标准化的电源插座作为室内标记,构建了三阶段深度学习流程。客观分析认为,该方法通过插座类型的地理一致性,有效应对了室内定位的数据稀缺和视觉模糊问题。

Abstract: Computer vision is a rapidly evolving field, giving rise to powerful new tools and techniques in digital forensic investigation, and shows great promise for novel digital forensic applications. One such application, indoor multimedia geolocation, has the potential to become a crucial aid for law enforcement in the fight against human trafficking, child exploitation, and other serious crimes. While outdoor multimedia geolocation has been widely explored, its indoor counterpart remains underdeveloped due to challenges such as similar room layouts, frequent renovations, visual ambiguity, indoor lighting variability, unreliable GPS signals, and limited datasets in sensitive domains. This paper introduces a pipeline that uses electric sockets as consistent indoor markers for geolocation, since plug socket types are standardised by country or region. The three-stage deep learning pipeline detects plug sockets (YOLOv11, mAP@0.5 = 0.843), classifies them into one of 12 plug socket types (Xception, accuracy = 0.912), and maps the detected socket types to countries (accuracy = 0.96 at >90% threshold confidence). To address data scarcity, two dedicated datasets were created: socket detection dataset of 2,328 annotated images expanded to 4,072 through augmentation, and a classification dataset of 3,187 images across 12 plug socket classes. The pipeline was evaluated on the Hotels-50K dataset, focusing on the TraffickCam subset of crowd-sourced hotel images, which capture real-world conditions such as poor lighting and amateur angles. This dataset provides a more realistic evaluation than using professional, well-lit, often wide-angle images from travel websites. This framework demonstrates a practical step toward real-world digital forensic applications. The code, trained models, and the data for this paper are available open source.


[56] DeContext as Defense: Safe Image Editing in Diffusion Transformers cs.CVPDF

Linghui Shen, Mingyue Cui, Xingyi Yang

TL;DR: 本文提出了一种名为DeContext的新方法,旨在保护输入图像免受未经授权的上下文编辑。该方法通过在扩散变换器(DiT)的多模态注意力层中注入微小、有针对性的扰动,以削弱跨注意力路径,从而有效阻断输入图像与输出之间的关联,实现安全的图像编辑防御。

Details

Motivation: 动机在于解决上下文扩散模型带来的隐私和安全问题:个人图像可能被轻易操纵用于身份冒充、虚假信息等恶意用途,而现有针对个性化文本到图像生成的输入扰动防御方法对大规模基于DiT的上下文模型的鲁棒性尚未充分研究。

Result: 在Flux Kontext和Step1X-Edit基准上的实验表明,DeContext能持续阻止不需要的图像编辑,同时保持视觉质量,验证了基于注意力扰动的防御有效性。

Insight: 创新点在于揭示了上下文信息主要通过多模态注意力层传播,并利用早期去噪步骤和特定Transformer块主导上下文传播的特性,集中扰动关键位置;从客观角度看,该方法提供了一种高效且鲁棒的防御机制,可借鉴于其他基于注意力的生成模型安全保护中。

Abstract: In-context diffusion models allow users to modify images with remarkable ease and realism. However, the same power raises serious privacy concerns: personal images can be easily manipulated for identity impersonation, misinformation, or other malicious uses, all without the owner’s consent. While prior work has explored input perturbations to protect against misuse in personalized text-to-image generation, the robustness of modern, large-scale in-context DiT-based models remains largely unexamined. In this paper, we propose DeContext, a new method to safeguard input images from unauthorized in-context editing. Our key insight is that contextual information from the source image propagates to the output primarily through multimodal attention layers. By injecting small, targeted perturbations that weaken these cross-attention pathways, DeContext breaks this flow, effectively decouples the link between input and output. This simple defense is both efficient and robust. We further show that early denoising steps and specific transformer blocks dominate context propagation, which allows us to concentrate perturbations where they matter most. Experiments on Flux Kontext and Step1X-Edit show that DeContext consistently blocks unwanted image edits while preserving visual quality. These results highlight the effectiveness of attention-based perturbations as a powerful defense against image manipulation.


[57] REGLUE Your Latents with Global and Local Semantics for Entangled Diffusion cs.CVPDF

Giorgos Petsangourakis, Christos Sgouropoulos, Bill Psomas, Theodoros Giannakopoulos, Giorgos Sfikas

TL;DR: 本文提出了REGLUE框架,通过将VAE图像潜变量、局部(块级)视觉基础模型语义和全局(图像级)[CLS]标记联合建模在一个统一的SiT骨干网络中,以增强潜在扩散模型的语义监督,从而加速训练并提升图像合成质量。

Details

Motivation: 现有潜在扩散模型的重构式去噪目标仅提供间接的语义监督,导致高级语义出现缓慢,需要更长的训练时间并限制了样本质量。现有方法未能充分利用视觉基础模型提供的丰富、非线性、多层空间语义。

Result: 在ImageNet 256x256上,REGLUE在SiT-B/2和SiT-XL/2基线以及REPA、ReDi和REG等方法上,持续改善了FID指标并加速了收敛。

Insight: 创新点在于提出了一个全局-局部-潜变量联合建模的统一框架,通过轻量级卷积语义压缩器非线性聚合多层VFM特征,并与VAE潜变量在扩散过程中纠缠;外部对齐损失进一步正则化内部表示。关键发现包括空间VFM语义至关重要、非线性压缩是发挥其全部效益的关键,以及全局标记和外部对齐作为互补的轻量级增强手段。

Abstract: Latent diffusion models (LDMs) achieve state-of-the-art image synthesis, yet their reconstruction-style denoising objective provides only indirect semantic supervision: high-level semantics emerge slowly, requiring longer training and limiting sample quality. Recent works inject semantics from Vision Foundation Models (VFMs) either externally via representation alignment or internally by jointly modeling only a narrow slice of VFM features inside the diffusion process, under-utilizing the rich, nonlinear, multi-layer spatial semantics available. We introduce REGLUE (Representation Entanglement with Global-Local Unified Encoding), a unified latent diffusion framework that jointly models (i) VAE image latents, (ii) compact local (patch-level) VFM semantics, and (iii) a global (image-level) [CLS] token within a single SiT backbone. A lightweight convolutional semantic compressor nonlinearly aggregates multi-layer VFM features into a low-dimensional, spatially structured representation, which is entangled with the VAE latents in the diffusion process. An external alignment loss further regularizes internal representations toward frozen VFM targets. On ImageNet 256x256, REGLUE consistently improves FID and accelerates convergence over SiT-B/2 and SiT-XL/2 baselines, as well as over REPA, ReDi, and REG. Extensive experiments show that (a) spatial VFM semantics are crucial, (b) non-linear compression is key to unlocking their full benefit, and (c) global tokens and external alignment act as complementary, lightweight enhancements within our global-local-latent joint modeling framework. The code is available at https://github.com/giorgospets/reglue .


[58] FrameDiffuser: G-Buffer-Conditioned Diffusion for Neural Forward Frame Rendering cs.CV | cs.GRPDF

Ole Beisswenger, Jan-Niklas Dihlmann, Hendrik P. A. Lensch

TL;DR: 本文提出FrameDiffuser,一种基于G-buffer条件化的自回归扩散模型,用于神经前向帧渲染。它通过结合当前帧的几何缓冲区(G-buffer)数据和模型自身先前生成的帧,在交互式应用中实现时间一致且逼真的逐帧图像合成。

Details

Motivation: 现有基于扩散的G-buffer条件化图像合成方法存在局限:单图像模型缺乏时间一致性,而视频模型计算成本过高且需要完整序列,不适用于依赖用户输入的交互式应用。

Result: 模型在特定环境训练下,实现了数百至数千帧的稳定、时间一致生成,在逼真度、光照、阴影和反射方面优于通用方法。

Insight: 创新点包括双条件架构(结合ControlNet的结构引导和ControlLoRA的时间一致性)、三阶段训练策略以实现稳定自回归生成,以及针对单个环境进行专门化训练以优先保证一致性和推理速度。

Abstract: Neural rendering for interactive applications requires translating geometric and material properties (G-buffer) to photorealistic images with realistic lighting on a frame-by-frame basis. While recent diffusion-based approaches show promise for G-buffer-conditioned image synthesis, they face critical limitations: single-image models like RGBX generate frames independently without temporal consistency, while video models like DiffusionRenderer are too computationally expensive for most consumer gaming sets ups and require complete sequences upfront, making them unsuitable for interactive applications where future frames depend on user input. We introduce FrameDiffuser, an autoregressive neural rendering framework that generates temporally consistent, photorealistic frames by conditioning on G-buffer data and the models own previous output. After an initial frame, FrameDiffuser operates purely on incoming G-buffer data, comprising geometry, materials, and surface properties, while using its previously generated frame for temporal guidance, maintaining stable, temporal consistent generation over hundreds to thousands of frames. Our dual-conditioning architecture combines ControlNet for structural guidance with ControlLoRA for temporal coherence. A three-stage training strategy enables stable autoregressive generation. We specialize our model to individual environments, prioritizing consistency and inference speed over broad generalization, demonstrating that environment-specific training achieves superior photorealistic quality with accurate lighting, shadows, and reflections compared to generalized approaches.


[59] Task-Oriented Data Synthesis and Control-Rectify Sampling for Remote Sensing Semantic Segmentation cs.CVPDF

Yunkai Yang, Yudong Zhang, Kunquan Zhang, Jinxiao Zhang, Xinying Chen

TL;DR: 本文提出了一种面向任务的遥感语义分割数据合成框架TODSynth,包含一个具有统一三元注意力的多模态扩散变换器MM-DiT和一个由任务反馈引导的即插即用采样策略。该框架旨在解决语义掩码控制的复杂性和采样质量的不确定性,通过联合注意力机制和基于语义损失的动态采样方向调整,生成更稳定、更面向下游任务的合成数据。

Details

Motivation: 遥感领域手动标注成本高昂,可控生成技术为扩展标注数据集提供了可能,但现有方法在语义掩码控制和采样质量方面存在局限,限制了合成数据在下游语义分割任务中的效用。

Result: 在遥感语义分割任务上,该方法在少样本和复杂场景下表现优异,通过大量实验验证,其性能持续优于最先进的可控生成方法。

Insight: 创新点在于提出了文本-图像-掩码联合注意力方案与全微调策略以增强合成效果,并设计了控制-校正流匹配方法,在采样早期高可塑性阶段利用语义损失动态引导,从而稳定生成过程并缩小合成数据与下游任务之间的差距。

Abstract: With the rapid progress of controllable generation, training data synthesis has become a promising way to expand labeled datasets and alleviate manual annotation in remote sensing (RS). However, the complexity of semantic mask control and the uncertainty of sampling quality often limit the utility of synthetic data in downstream semantic segmentation tasks. To address these challenges, we propose a task-oriented data synthesis framework (TODSynth), including a Multimodal Diffusion Transformer (MM-DiT) with unified triple attention and a plug-and-play sampling strategy guided by task feedback. Built upon the powerful DiT-based generative foundation model, we systematically evaluate different control schemes, showing that a text-image-mask joint attention scheme combined with full fine-tuning of the image and mask branches significantly enhances the effectiveness of RS semantic segmentation data synthesis, particularly in few-shot and complex-scene scenarios. Furthermore, we propose a control-rectify flow matching (CRFM) method, which dynamically adjusts sampling directions guided by semantic loss during the early high-plasticity stage, mitigating the instability of generated images and bridging the gap between synthetic data and downstream segmentation tasks. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art controllable generation methods, producing more stable and task-oriented synthetic data for RS semantic segmentation.


[60] Make-It-Poseable: Feed-forward Latent Posing Model for 3D Humanoid Character Animation cs.CVPDF

Zhiyang Guo, Ori Zhang, Jax Xiang, Alan Zhao, Wengang Zhou

TL;DR: 本文提出了Make-It-Poseable,一种新颖的前馈框架,通过将角色摆姿重新定义为潜在空间变换问题,来生成3D人形角色动画。该方法通过直接操作角色的潜在表示来重建新姿态,核心是一个基于骨骼运动操纵形状令牌的潜在摆姿变换器,并引入了密集姿态表示、潜在空间监督策略和自适应补全模块以确保高保真几何和拓扑适应性。

Details

Motivation: 现有方法(如自动绑定和姿态条件生成)存在蒙皮权重预测不准确、拓扑结构缺陷和姿态一致性差等问题,限制了其鲁棒性和泛化能力,本文旨在克服这些限制。

Result: 该方法在摆姿质量上表现出优越性能,并自然地扩展到部件替换和细化等3D编辑应用。

Insight: 主要创新点在于将角色摆姿重构为潜在空间变换问题,避免了传统的网格顶点变形;具体技术包括基于骨骼运动的潜在摆姿变换器、用于精确控制的密集姿态表示、潜在空间监督策略和自适应补全模块,以处理拓扑变化并确保几何保真度。

Abstract: Posing 3D characters is a fundamental task in computer graphics and vision. However, existing methods like auto-rigging and pose-conditioned generation often struggle with challenges such as inaccurate skinning weight prediction, topological imperfections, and poor pose conformance, limiting their robustness and generalizability. To overcome these limitations, we introduce Make-It-Poseable, a novel feed-forward framework that reformulates character posing as a latent-space transformation problem. Instead of deforming mesh vertices as in traditional pipelines, our method reconstructs the character in new poses by directly manipulating its latent representation. At the core of our method is a latent posing transformer that manipulates shape tokens based on skeletal motion. This process is facilitated by a dense pose representation for precise control. To ensure high-fidelity geometry and accommodate topological changes, we also introduce a latent-space supervision strategy and an adaptive completion module. Our method demonstrates superior performance in posing quality. It also naturally extends to 3D editing applications like part replacement and refinement.


[61] Kling-Omni Technical Report cs.CVPDF

Kling Team, Jialu Chen, Yuanzheng Ci, Xiangyu Du, Zipeng Feng

TL;DR: Kling-Omni是一个通用的生成式框架,旨在直接从多模态视觉语言输入合成高保真视频。它采用端到端视角,将多样化的视频生成、编辑和智能推理任务整合到一个统一的整体系统中,支持文本指令、参考图像和视频上下文等多种输入,以生成电影级质量的智能视频内容。

Details

Motivation: 解决现有视频生成、编辑和推理任务功能分离、流程割裂的问题,旨在构建一个能够处理统一多模态表示、支持多样化用户输入的整体视频内容创作系统。

Result: 综合评估表明,Kling-Omni在上下文生成、基于推理的编辑和多模态指令跟随方面展现出卓越的能力。

Insight: 主要创新点在于提出了一个端到端的统一生成框架,整合了多种视频相关任务,并构建了支撑该框架的综合性数据系统、高效的大规模预训练策略和推理基础设施优化。其目标不仅是内容创作工具,更是迈向能够感知、推理、生成并与动态复杂世界交互的多模态世界模拟器的关键进展。

Abstract: We present Kling-Omni, a generalist generative framework designed to synthesize high-fidelity videos directly from multimodal visual language inputs. Adopting an end-to-end perspective, Kling-Omni bridges the functional separation among diverse video generation, editing, and intelligent reasoning tasks, integrating them into a holistic system. Unlike disjointed pipeline approaches, Kling-Omni supports a diverse range of user inputs, including text instructions, reference images, and video contexts, processing them into a unified multimodal representation to deliver cinematic-quality and highly-intelligent video content creation. To support these capabilities, we constructed a comprehensive data system that serves as the foundation for multimodal video creation. The framework is further empowered by efficient large-scale pre-training strategies and infrastructure optimizations for inference. Comprehensive evaluations reveal that Kling-Omni demonstrates exceptional capabilities in in-context generation, reasoning-based editing, and multimodal instruction following. Moving beyond a content creation tool, we believe Kling-Omni is a pivotal advancement toward multimodal world simulators capable of perceiving, reasoning, generating and interacting with the dynamic and complex worlds.


[62] R3ST: A Synthetic 3D Dataset With Realistic Trajectories cs.CVPDF

Simone Teglia, Claudia Melis Tonti, Francesco Pro, Leonardo Russo, Andrea Alfarano

TL;DR: 本文提出了R3ST(真实3D合成轨迹)数据集,通过生成合成3D环境并整合从无人机航拍数据集SinD中提取的真实世界轨迹,旨在解决现有合成数据集缺乏真实车辆运动轨迹的问题,为交通分析和道路安全研究提供兼具精确多模态标注和真实人类驾驶轨迹的数据资源。

Details

Motivation: 现有真实数据集虽能捕捉真实道路对象行为,但通常缺乏精确的地面真实标注;而合成数据集虽能低成本生成大量标注,但其车辆运动轨迹往往由AI模型或基于规则的系统生成,缺乏真实性。本文旨在弥合合成数据与真实轨迹之间的差距。

Result: 论文通过整合SinD数据集中的真实轨迹,生成了包含真实人类驾驶车辆轨迹的合成3D环境,提供了准确的多模态地面真实标注,但摘要中未提及具体的定量评估结果或基准测试对比。

Insight: 创新点在于将真实世界的轨迹数据(来自SinD)与合成3D环境相结合,从而在合成数据集中实现了既具有精确标注又包含真实车辆运动轨迹的特性,这为车辆轨迹预测等研究提供了更高质量的数据基础。

Abstract: Datasets are essential to train and evaluate computer vision models used for traffic analysis and to enhance road safety. Existing real datasets fit real-world scenarios, capturing authentic road object behaviors, however, they typically lack precise ground-truth annotations. In contrast, synthetic datasets play a crucial role, allowing for the annotation of a large number of frames without additional costs or extra time. However, a general drawback of synthetic datasets is the lack of realistic vehicle motion, since trajectories are generated using AI models or rule-based systems. In this work, we introduce R3ST (Realistic 3D Synthetic Trajectories), a synthetic dataset that overcomes this limitation by generating a synthetic 3D environment and integrating real-world trajectories derived from SinD, a bird’s-eye-view dataset recorded from drone footage. The proposed dataset closes the gap between synthetic data and realistic trajectories, advancing the research in trajectory forecasting of road vehicles, offering both accurate multimodal ground-truth annotations and authentic human-driven vehicle trajectories.


[63] GeoPredict: Leveraging Predictive Kinematics and 3D Gaussian Geometry for Precise VLA Manipulation cs.CV | cs.ROPDF

Jingjing Qian, Boyao Han, Chen Shi, Lei Xiao, Long Yang

TL;DR: GeoPredict是一个几何感知的视觉-语言-动作框架,通过引入预测性运动学和几何先验来增强连续动作策略,以解决现有VLA模型在需要精确3D推理任务中的局限性。

Details

Motivation: 现有视觉-语言-动作模型主要基于反应式2D感知,在需要精确3D空间推理的机器人操作任务中不可靠,因此需要增强其几何感知和预测能力。

Result: 在RoboCasa Human-50、LIBERO和真实世界操作任务上的实验表明,GeoPredict持续优于强VLA基线,尤其在几何密集和空间要求高的场景中表现突出。

Insight: 创新点在于引入了轨迹级运动历史编码与多步3D关键点轨迹预测模块,以及沿未来轨迹进行跟踪引导细化的预测性3D高斯几何模块,这些模块仅作为训练时监督,推理时仅需轻量级查询令牌,无需3D解码。

Abstract: Vision-Language-Action (VLA) models achieve strong generalization in robotic manipulation but remain largely reactive and 2D-centric, making them unreliable in tasks that require precise 3D reasoning. We propose GeoPredict, a geometry-aware VLA framework that augments a continuous-action policy with predictive kinematic and geometric priors. GeoPredict introduces a trajectory-level module that encodes motion history and predicts multi-step 3D keypoint trajectories of robot arms, and a predictive 3D Gaussian geometry module that forecasts workspace geometry with track-guided refinement along future keypoint trajectories. These predictive modules serve exclusively as training-time supervision through depth-based rendering, while inference requires only lightweight additional query tokens without invoking any 3D decoding. Experiments on RoboCasa Human-50, LIBERO, and real-world manipulation tasks show that GeoPredict consistently outperforms strong VLA baselines, especially in geometry-intensive and spatially demanding scenarios.


[64] Radiology Report Generation with Layer-Wise Anatomical Attention cs.CVPDF

Emmanuel D. Muñiz-De-León, Jorge A. Rosales-de-Golferichs, Ana S. Muñoz-Rodríguez, Alejandro I. Trejo-Castro, Eduardo de Avila-Armenta

TL;DR: 本文提出了一种用于胸部X光报告生成的紧凑型图像到文本架构,该架构仅使用单一正面图像生成报告的Findings部分。模型结合了冻结的DINOv3 ViT编码器和经过分层解剖注意力增强的GPT-2解码器,通过整合肺部和心脏分割掩码来引导注意力到临床相关区域,无需增加可训练参数。在MIMIC-CXR数据集上的评估显示,该方法在CheXpert和RadGraph指标上取得了显著提升。

Details

Motivation: 当前最先进的放射学报告生成系统(如MAIRA-2和MedPaLM-M)依赖于大规模多模态训练、临床元数据和多个成像视图,导致资源密集且难以普及。本文旨在开发一种仅依赖单一正面图像的紧凑模型,以降低资源需求并提高可访问性。

Result: 在MIMIC-CXR数据集上,使用CheXpert和RadGraph指标评估,模型在五个关键病理的CheXpert Macro-F1上提升了168%(0.083 -> 0.238),Micro-F1提升了146%(0.137 -> 0.337),在14个观察指标上的整体性能提升了86%(0.170 -> 0.316),RadGraph F1提升了9.7%。

Insight: 创新点在于引入了分层解剖注意力机制,通过分层高斯平滑整合解剖分割掩码,在解码器层面引导注意力到临床相关区域,从而提升空间定位和报告连贯性,且无需增加可训练参数。这为资源受限环境下的高效放射学报告生成提供了新思路。

Abstract: Automatic radiology report generation is a promising application of multimodal deep learning, aiming to reduce reporting workload and improve consistency. However, current state-of-the-art (SOTA) systems - such as Multimodal AI for Radiology Applications (MAIRA-2) and Medical Pathways Language Model-Multimodal (MedPaLM-M) - depend on large-scale multimodal training, clinical metadata, and multiple imaging views, making them resource-intensive and inaccessible for most settings. We introduce a compact image-to-text architecture that generates the Findings section of chest X-ray reports from a single frontal image. The model combines a frozen Self-Distillation with No Labels v3 (DINOv3) Vision Transformer (ViT) encoder with a Generative Pre-trained Transformer 2 (GPT-2) decoder enhanced by layer-wise anatomical attention. This mechanism integrates lung and heart segmentation masks through hierarchical Gaussian smoothing, biasing attention toward clinically relevant regions without adding trainable parameters. Evaluated on the official Medical Information Mart for Intensive Care-Chest X-ray (MIMIC-CXR) dataset using Chest Radiograph Expert (CheXpert) and Radiology Graph (RadGraph) metrics, our approach achieved substantial gains: CheXpert Macro-F1 for five key pathologies increased by 168% (0.083 -> 0.238) and Micro-F1 by 146% (0.137 -> 0.337), while broader performance across 14 observations improved by 86% (0.170 -> 0.316). Structural coherence also improved, with RadGraph F1 rising by 9.7%. Despite its small size and purely image-conditioned design, the model demonstrates that decoder-level anatomical guidance improves spatial grounding and enhances coherence in clinically relevant regions. The source code is publicly available at: https://github.com/devMuniz02/UDEM-CXR-Reporting-Thesis-2025.


[65] OPENTOUCH: Bringing Full-Hand Touch to Real-World Interaction cs.CV | cs.AI | cs.ROPDF

Yuxin Ray Song, Jinzhou Li, Rao Fu, Devin Murphy, Kaichen Zhou

TL;DR: 本文介绍了首个在真实世界环境中采集的以自我为中心的全手触觉数据集OpenTouch,包含5.1小时的同步视频-触觉-姿态数据及2900个带详细文本标注的片段,并基于此提出了检索和分类基准任务,旨在弥合视觉感知与物理交互之间的鸿沟。

Details

Motivation: 解决当前自我中心感知中缺乏触觉信息、缺少野外环境下对齐视频与全手触觉数据的问题,以推动多模态自我中心感知、具身学习及接触丰富的机器人操作的发展。

Result: 研究表明,触觉信号为抓握理解提供了紧凑而强大的线索,增强了跨模态对齐,并能从野外视频查询中可靠地检索出来。

Insight: 创新点在于首次构建了野外环境下的同步视频-触觉-姿态数据集,并建立了相关基准任务,为触觉感知与视觉的融合提供了新的数据和研究平台。

Abstract: The human hand is our primary interface to the physical world, yet egocentric perception rarely knows when, where, or how forcefully it makes contact. Robust wearable tactile sensors are scarce, and no existing in-the-wild datasets align first-person video with full-hand touch. To bridge the gap between visual perception and physical interaction, we present OpenTouch, the first in-the-wild egocentric full-hand tactile dataset, containing 5.1 hours of synchronized video-touch-pose data and 2,900 curated clips with detailed text annotations. Using OpenTouch, we introduce retrieval and classification benchmarks that probe how touch grounds perception and action. We show that tactile signals provide a compact yet powerful cue for grasp understanding, strengthen cross-modal alignment, and can be reliably retrieved from in-the-wild video queries. By releasing this annotated vision-touch-pose dataset and benchmark, we aim to advance multimodal egocentric perception, embodied learning, and contact-rich robotic manipulation.


[66] RePlan: Reasoning-guided Region Planning for Complex Instruction-based Image Editing cs.CVPDF

Tianyuan Qu, Lei Ke, Xiaohang Zhan, Longxiang Tang, Yuqi Liu

TL;DR: RePlan是一个基于推理引导的区域规划框架,用于解决复杂指令驱动的图像编辑任务。它采用‘规划-执行’策略,通过视觉语言规划器分解指令并定位目标区域,再结合无需训练的注意力区域注入机制进行并行多区域编辑,避免了迭代修复。

Details

Motivation: 现有基于指令的图像编辑模型在处理指令-视觉复杂性(IV-Complexity)时表现不佳,即当复杂指令遇到杂乱或模糊场景时容易失败。RePlan旨在通过显式的区域规划和推理来提升编辑精度和保真度。

Result: 在提出的IV-Edit基准测试(专注于细粒度定位和知识密集型编辑)上,RePlan在IV-Complex设置下持续优于使用更大数据集训练的强基线模型,显著提升了区域精度和整体保真度。

Insight: 创新点包括:1)结合规划与执行的框架,通过逐步推理分解指令并显式定位区域;2)无需训练的注意力区域注入机制,支持并行多区域编辑;3)使用基于GRPO的强化学习仅需1K指令样本即可增强规划器的推理和格式可靠性。

Abstract: Instruction-based image editing enables natural-language control over visual modifications, yet existing models falter under Instruction-Visual Complexity (IV-Complexity), where intricate instructions meet cluttered or ambiguous scenes. We introduce RePlan (Region-aligned Planning), a plan-then-execute framework that couples a vision-language planner with a diffusion editor. The planner decomposes instructions via step-by-step reasoning and explicitly grounds them to target regions; the editor then applies changes using a training-free attention-region injection mechanism, enabling precise, parallel multi-region edits without iterative inpainting. To strengthen planning, we apply GRPO-based reinforcement learning using 1K instruction-only examples, yielding substantial gains in reasoning fidelity and format reliability. We further present IV-Edit, a benchmark focused on fine-grained grounding and knowledge-intensive edits. Across IV-Complex settings, RePlan consistently outperforms strong baselines trained on far larger datasets, improving regional precision and overall fidelity. Our project page: https://replan-iv-edit.github.io


[67] Pixel Seal: Adversarial-only training for invisible image and video watermarking cs.CV | cs.AI | cs.CR | cs.LGPDF

Tomáš Souček, Pierre Fernandez, Hady Elsahar, Sylvestre-Alvise Rebuffi, Valeriu Lacatusu

TL;DR: 本文提出Pixel Seal,一种仅通过对抗训练实现不可见图像和视频水印的新方法,旨在解决现有方法在平衡鲁棒性与真正不可感知性方面的困难,并在图像和视频水印任务上达到新的最先进水平。

Details

Motivation: 现有方法存在三个基本问题:依赖MSE和LPIPS等代理感知损失无法准确模拟人类感知,导致可见水印伪影;冲突目标导致优化不稳定,需要大量超参数调优;以及在高分辨率图像和视频上扩展时水印的鲁棒性和不可感知性降低。

Result: Pixel Seal在多种图像类型和广泛变换下,在鲁棒性和不可感知性方面进行了全面评估,显示出相对于最先进方法的明显改进,并高效适应视频水印任务。

Insight: 创新点包括:提出仅对抗训练范式以消除不可靠的像素级不可感知损失;引入三阶段训练计划通过解耦鲁棒性和不可感知性来稳定收敛;以及通过基于JND的衰减和训练时推理模拟来解决分辨率差距,消除上采样伪影,实现可扩展的实用解决方案。

Abstract: Invisible watermarking is essential for tracing the provenance of digital content. However, training state-of-the-art models remains notoriously difficult, with current approaches often struggling to balance robustness against true imperceptibility. This work introduces Pixel Seal, which sets a new state-of-the-art for image and video watermarking. We first identify three fundamental issues of existing methods: (i) the reliance on proxy perceptual losses such as MSE and LPIPS that fail to mimic human perception and result in visible watermark artifacts; (ii) the optimization instability caused by conflicting objectives, which necessitates exhaustive hyperparameter tuning; and (iii) reduced robustness and imperceptibility of watermarks when scaling models to high-resolution images and videos. To overcome these issues, we first propose an adversarial-only training paradigm that eliminates unreliable pixel-wise imperceptibility losses. Second, we introduce a three-stage training schedule that stabilizes convergence by decoupling robustness and imperceptibility. Third, we address the resolution gap via high-resolution adaptation, employing JND-based attenuation and training-time inference simulation to eliminate upscaling artifacts. We thoroughly evaluate the robustness and imperceptibility of Pixel Seal on different image types and across a wide range of transformations, and show clear improvements over the state-of-the-art. We finally demonstrate that the model efficiently adapts to video via temporal watermark pooling, positioning Pixel Seal as a practical and scalable solution for reliable provenance in real-world image and video settings.


[68] Memory-Enhanced SAM3 for Occlusion-Robust Surgical Instrument Segmentation cs.CVPDF

Valay Bundele, Mehran Hosseinzadeh, Hendrik P. A. Lensch

TL;DR: 该论文提出了ReMeDI-SAM3,一种无需训练的、增强记忆的SAM3扩展方法,旨在解决内窥镜视频中手术器械分割因遮挡、快速运动等问题带来的挑战。该方法通过相关性感知记忆过滤、分段插值方案和基于特征的再识别模块,有效缓解了错误累积并提升了遮挡后的恢复能力。

Details

Motivation: 现有SAM3框架在手术场景中因无差别的内存更新、固定内存容量以及遮挡后身份恢复能力弱,导致分割性能受限。本文旨在解决这些问题,以实现更鲁棒的手术器械分割。

Result: 在EndoVis17和EndoVis18数据集上的零样本评估显示,该方法相比原始SAM3在平均交并比(mcIoU)上分别提升了约7%和16%,甚至超越了先前需要训练的方法。

Insight: 创新点在于提出了一个无需训练的记忆增强框架,通过引入遮挡感知记忆、分段插值扩展有效容量以及基于特征的再识别与时间投票机制,专门针对手术视频中的长期遮挡和身份模糊问题进行了优化,提升了分割的鲁棒性和准确性。

Abstract: Accurate surgical instrument segmentation in endoscopic videos is crucial for computer-assisted interventions, yet remains challenging due to frequent occlusions, rapid motion, specular artefacts, and long-term instrument re-entry. While SAM3 provides a powerful spatio-temporal framework for video object segmentation, its performance in surgical scenes is limited by indiscriminate memory updates, fixed memory capacity, and weak identity recovery after occlusions. We propose ReMeDI-SAM3, a training-free memory-enhanced extension of SAM3, that addresses these limitations through three components: (i) relevance-aware memory filtering with a dedicated occlusion-aware memory for storing pre-occlusion frames, (ii) a piecewise interpolation scheme that expands the effective memory capacity, and (iii) a feature-based re-identification module with temporal voting for reliable post-occlusion identity disambiguation. Together, these components mitigate error accumulation and enable reliable recovery after occlusions. Evaluations on EndoVis17 and EndoVis18 under a zero-shot setting show absolute mcIoU improvements of around 7% and 16%, respectively, over vanilla SAM3, outperforming even prior training-based approaches. Project page: https://valaybundele.github.io/remedi-sam3/.


[69] M-PhyGs: Multi-Material Object Dynamics from Video cs.CVPDF

Norika Wada, Kohei Yamashita, Ryo Kawahara, Ko Nishino

TL;DR: 本文提出了M-PhyGs方法,用于从视频中估计多材料复杂自然物体(以花朵为代表)的材料组成和连续力学参数。该方法通过新引入的级联3D和2D损失以及时间小批量处理,从自然场景拍摄的短片中联合分割物体并恢复其物理参数,同时考虑重力影响。

Details

Motivation: 现有从视觉数据估计物理材料参数的方法通常假设物体是均质单材料、具有预学习动力学或简单拓扑结构,而现实世界物体(如花朵)的材料组成和几何结构往往复杂,超出了这些假设范围,因此需要新方法来处理多材料物体的物理参数估计问题。

Result: 在作者新引入的Phlowers数据集(包含人与花朵交互的视频)上进行实验,结果表明M-PhyGs及其各组件在估计多材料物理参数这一挑战性任务上具有准确性和有效性。

Insight: 创新点在于针对多材料复杂自然物体(以花朵为典型代表)的物理参数估计问题,提出了联合材料分割和参数恢复的框架,并引入了级联3D/2D损失和时序小批量处理以提高效率;从客观角度看,将物理参数估计问题扩展到非均质、多材料的自然物体,并构建了相应的评估数据集,是该研究的主要贡献。

Abstract: Knowledge of the physical material properties governing the dynamics of a real-world object becomes necessary to accurately anticipate its response to unseen interactions. Existing methods for estimating such physical material parameters from visual data assume homogeneous single-material objects, pre-learned dynamics, or simplistic topologies. Real-world objects, however, are often complex in material composition and geometry lying outside the realm of these assumptions. In this paper, we particularly focus on flowers as a representative common object. We introduce Multi-material Physical Gaussians (M-PhyGs) to estimate the material composition and parameters of such multi-material complex natural objects from video. From a short video captured in a natural setting, M-PhyGs jointly segments the object into similar materials and recovers their continuum mechanical parameters while accounting for gravity. M-PhyGs achieves this efficiently with newly introduced cascaded 3D and 2D losses, and by leveraging temporal mini-batching. We introduce a dataset, Phlowers, of people interacting with flowers as a novel platform to evaluate the accuracy of this challenging task of multi-material physical parameter estimation. Experimental results on Phlowers dataset demonstrate the accuracy and effectiveness of M-PhyGs and its components.


[70] LinkedOut: Linking World Knowledge Representation Out of Video LLM for Next-Generation Video Recommendation cs.CV | cs.AI | cs.IR | cs.LG | cs.MMPDF

Haichao Zhang, Yao Lu, Lichen Wang, Yunzhe Li, Daiwei Chen

TL;DR: 本文提出LinkedOut,一种从视频大语言模型(VLLM)中提取世界知识表示的方法,旨在解决VLLM直接用于视频推荐时面临的高延迟、多视频输入支持不足以及语言输出瓶颈等问题。该方法通过可提示查询从原始视频帧中提取语义基础、知识感知的token,并引入跨层知识融合MoE来选择适当的抽象层次,从而实现个性化、可解释且低延迟的推荐。

Details

Motivation: 动机在于解决视频大语言模型(VLLM)在下游视频推荐任务中部署的挑战,包括解码生成导致的高延迟、典型接口不支持多视频输入,以及语言输出丢弃了对视觉任务重要的细粒度视觉细节。这些限制源于缺乏一种既能保留像素级细节又能利用世界知识的表示。

Result: LinkedOut在标准基准测试上取得了最先进(SOTA)的结果,是首个基于VLLM、无需手工标签、直接在原始视频帧上操作的视频推荐方法。可解释性研究和消融实验证实了层多样性和逐层融合的优势。

Insight: 创新点在于提出了一种直接从视频中提取VLLM世界知识的表示LinkedOut,它通过可提示查询和可选辅助模态提取知识感知token,并设计跨层知识融合MoE来融合不同抽象层次的VLLM特征,从而绕过语言瓶颈,支持快速推理和多视频历史,为下游视觉任务(如推荐)充分利用VLLM的先验知识和视觉推理提供了一条实用路径。

Abstract: Video Large Language Models (VLLMs) unlock world-knowledge-aware video understanding through pretraining on internet-scale data and have already shown promise on tasks such as movie analysis and video question answering. However, deploying VLLMs for downstream tasks such as video recommendation remains challenging, since real systems require multi-video inputs, lightweight backbones, low-latency sequential inference, and rapid response. In practice, (1) decode-only generation yields high latency for sequential inference, (2) typical interfaces do not support multi-video inputs, and (3) constraining outputs to language discards fine-grained visual details that matter for downstream vision tasks. We argue that these limitations stem from the absence of a representation that preserves pixel-level detail while leveraging world knowledge. We present LinkedOut, a representation that extracts VLLM world knowledge directly from video to enable fast inference, supports multi-video histories, and removes the language bottleneck. LinkedOut extracts semantically grounded, knowledge-aware tokens from raw frames using VLLMs, guided by promptable queries and optional auxiliary modalities. We introduce a cross-layer knowledge fusion MoE that selects the appropriate level of abstraction from the rich VLLM features, enabling personalized, interpretable, and low-latency recommendation. To our knowledge, LinkedOut is the first VLLM-based video recommendation method that operates on raw frames without handcrafted labels, achieving state-of-the-art results on standard benchmarks. Interpretability studies and ablations confirm the benefits of layer diversity and layer-wise fusion, pointing to a practical path that fully leverages VLLM world-knowledge priors and visual reasoning for downstream vision tasks such as recommendation.


[71] Instant Expressive Gaussian Head Avatar via 3D-Aware Expression Distillation cs.CVPDF

Kaiwen Jiang, Xueting Li, Seonwook Park, Ravi Ramamoorthi, Shalini De Mello

TL;DR: 本文提出了一种通过3D感知表情蒸馏实现即时表达性高斯头部化身的方法,旨在结合2D扩散模型的高质量动画与3D表示方法的快速推理和一致性优势。该方法使用前馈编码器从单张野外图像快速生成可动画的3D一致表示,通过解耦面部3D表示与动画表示,并采用轻量级局部融合策略,在保持高表达性的同时实现高速运行。

Details

Motivation: 现有2D肖像动画方法虽质量高但缺乏3D一致性和速度,而3D面部动画前馈方法虽保证一致性和速度却牺牲了表情细节。本文旨在融合两者优势,从2D扩散模型蒸馏知识到前馈编码器,以生成快速、3D一致且表达性强的动画表示。

Result: 该方法在动画和姿态控制上达到107.31 FPS,动画质量与最先进方法相当,在速度与质量之间取得了平衡,超越了其他需要权衡的设计。

Insight: 创新点包括:通过解耦3D表示与动画表示,从数据中隐式学习运动,摆脱了对预定义参数模型的依赖;采用高效的轻量级局部融合策略替代计算密集的全局融合机制,提升了表达性和速度。这为实时3D头像动画提供了可借鉴的高效架构设计。

Abstract: Portrait animation has witnessed tremendous quality improvements thanks to recent advances in video diffusion models. However, these 2D methods often compromise 3D consistency and speed, limiting their applicability in real-world scenarios, such as digital twins or telepresence. In contrast, 3D-aware facial animation feedforward methods – built upon explicit 3D representations, such as neural radiance fields or Gaussian splatting – ensure 3D consistency and achieve faster inference speed, but come with inferior expression details. In this paper, we aim to combine their strengths by distilling knowledge from a 2D diffusion-based method into a feed-forward encoder, which instantly converts an in-the-wild single image into a 3D-consistent, fast yet expressive animatable representation. Our animation representation is decoupled from the face’s 3D representation and learns motion implicitly from data, eliminating the dependency on pre-defined parametric models that often constrain animation capabilities. Unlike previous computationally intensive global fusion mechanisms (e.g., multiple attention layers) for fusing 3D structural and animation information, our design employs an efficient lightweight local fusion strategy to achieve high animation expressivity. As a result, our method runs at 107.31 FPS for animation and pose control while achieving comparable animation quality to the state-of-the-art, surpassing alternative designs that trade speed for quality or vice versa. Project website is https://research.nvidia.com/labs/amri/projects/instant4d


[72] FlashPortrait: 6x Faster Infinite Portrait Animation with Adaptive Latent Prediction cs.CVPDF

Shuyuan Tu, Yueming Pan, Yinming Huang, Xintong Han, Zhen Xing

TL;DR: FlashPortrait是一种端到端的视频扩散Transformer模型,旨在合成保持身份一致性的无限长度肖像动画,并通过自适应潜在预测实现高达6倍的推理加速。

Details

Motivation: 解决现有基于扩散模型的长肖像动画加速方法难以保证身份一致性的问题。

Result: 在基准测试中,FlashPortrait在定性和定量上均显示出有效性,实现了高达6倍的推理速度提升。

Insight: 创新点包括引入归一化面部表情块来对齐特征以增强身份稳定性,采用动态滑动窗口与加权混合确保长动画平滑过渡,以及基于潜在变化率和层间导数比使用高阶潜在导数直接预测未来时间步的潜在表示,从而跳过多个去噪步骤实现加速。

Abstract: Current diffusion-based acceleration methods for long-portrait animation struggle to ensure identity (ID) consistency. This paper presents FlashPortrait, an end-to-end video diffusion transformer capable of synthesizing ID-preserving, infinite-length videos while achieving up to 6x acceleration in inference speed. In particular, FlashPortrait begins by computing the identity-agnostic facial expression features with an off-the-shelf extractor. It then introduces a Normalized Facial Expression Block to align facial features with diffusion latents by normalizing them with their respective means and variances, thereby improving identity stability in facial modeling. During inference, FlashPortrait adopts a dynamic sliding-window scheme with weighted blending in overlapping areas, ensuring smooth transitions and ID consistency in long animations. In each context window, based on the latent variation rate at particular timesteps and the derivative magnitude ratio among diffusion layers, FlashPortrait utilizes higher-order latent derivatives at the current timestep to directly predict latents at future timesteps, thereby skipping several denoising steps and achieving 6x speed acceleration. Experiments on benchmarks show the effectiveness of FlashPortrait both qualitatively and quantitatively.


[73] VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization cs.CVPDF

Xiaoyan Cong, Haotian Yang, Angtian Wang, Yizhi Wang, Yiding Yang

TL;DR: 本文提出VIVA框架,用于基于指令的视频编辑,通过VLM引导的编码和奖励优化解决现有方法泛化能力不足的问题。

Details

Motivation: 现有基于扩散模型的指令视频编辑方法通常仅在简单编辑操作的配对数据上训练,难以泛化到多样且复杂的真实世界指令,因此需要提升模型的泛化能力。

Result: 在广泛实验中,VIVA在指令遵循、泛化能力和编辑质量方面优于现有最先进方法。

Insight: 创新点包括引入VLM引导的指令编码器以提供细粒度空间语义上下文,以及采用Edit-GRPO后训练阶段通过相对奖励直接优化模型,同时设计了合成多样化高保真视频-指令数据的数据构建流程。

Abstract: Instruction-based video editing aims to modify an input video according to a natural-language instruction while preserving content fidelity and temporal coherence. However, existing diffusion-based approaches are often trained on paired data of simple editing operations, which fundamentally limits their ability to generalize to diverse and complex, real-world instructions. To address this generalization gap, we propose VIVA, a scalable framework for instruction-based video editing that leverages VLM-guided encoding and reward optimization. First, we introduce a VLM-based instructor that encodes the textual instruction, the first frame of the source video, and an optional reference image into visually-grounded instruction representations, providing fine-grained spatial and semantic context for the diffusion transformer backbone. Second, we propose a post-training stage, Edit-GRPO, which adapts Group Relative Policy Optimization to the domain of video editing, directly optimizing the model for instruction-faithful, content-preserving, and aesthetically pleasing edits using relative rewards. Furthermore, we propose a data construction pipeline designed to synthetically generate diverse, high-fidelity paired video-instruction data of basic editing operations. Extensive experiments show that VIVA achieves superior instruction following, generalization, and editing quality over state-of-the-art methods. Website: https://viva-paper.github.io


[74] Flowing from Reasoning to Motion: Learning 3D Hand Trajectory Prediction from Egocentric Human Interaction Videos cs.CV | cs.AI | cs.ROPDF

Mingfei Chen, Yifan Wang, Zhengqin Li, Homanga Bharadhwaj, Yujin Chen

TL;DR: 该论文提出了EgoMAN数据集和模型,用于解决现有3D手部轨迹预测方法中语义与运动解耦、推理与动作弱关联的问题。EgoMAN是一个大规模第一人称视角数据集,包含21.9万条6自由度轨迹和300万结构化问答对,支持语义、空间和运动推理。EgoMAN模型是一个从推理到运动的框架,通过轨迹-令牌接口连接视觉-语言推理与运动生成,经过渐进式训练以对齐推理与运动动态,能够生成准确且感知交互阶段的轨迹,并具有跨真实场景的泛化能力。

Details

Motivation: 解决现有3D手部轨迹预测工作受限于将运动与语义监督解耦的数据集,以及推理与动作弱关联的模型的问题。

Result: 论文提出的方法在EgoMAN数据集上实现了准确且感知交互阶段的轨迹预测,并展示了跨真实场景的泛化能力,但摘要未提及具体定量指标或与其他SOTA模型的直接比较。

Insight: 创新点在于构建了首个大规模、富含语义推理标注的第一人称3D手部轨迹数据集(EgoMAN),并提出了一个通过轨迹令牌接口将视觉-语言推理与运动生成紧密耦合的端到端框架,实现了从高层语义推理到低层运动轨迹的渐进式对齐学习。

Abstract: Prior works on 3D hand trajectory prediction are constrained by datasets that decouple motion from semantic supervision and by models that weakly link reasoning and action. To address these, we first present the EgoMAN dataset, a large-scale egocentric dataset for interaction stage-aware 3D hand trajectory prediction with 219K 6DoF trajectories and 3M structured QA pairs for semantic, spatial, and motion reasoning. We then introduce the EgoMAN model, a reasoning-to-motion framework that links vision-language reasoning and motion generation via a trajectory-token interface. Trained progressively to align reasoning with motion dynamics, our approach yields accurate and stage-aware trajectories with generalization across real-world scenes.


[75] SceneDiff: A Benchmark and Method for Multiview Object Change Detection cs.CVPDF

Yuqun Wu, Chih-hao Lin, Henry Che, Aditi Tiwari, Chuhang Zou

TL;DR: 本文提出了SceneDiff,一个用于多视角物体变化检测的基准和方法。该方法通过3D对齐、物体区域提取以及空间和语义特征比较来检测场景中物体的添加、移除或移动,无需训练,并在多个基准测试中大幅超越现有方法。

Details

Motivation: 解决在不同时间点从不同视角捕获的同一场景图像或视频对中,准确识别物体变化(添加、移除、移动)的挑战,这对于机器人整理、施工进度监控等应用至关重要。

Result: 在提出的SceneDiff基准(首个带物体实例标注的多视角变化检测基准)和其他基准上,该方法取得了显著性能提升,相对AP分别提高了94%和37.4%,大幅优于现有方法。

Insight: 创新点在于提出了首个带实例标注的多视角变化检测基准SceneDiff Benchmark,以及一种无需训练的方法SceneDiff,该方法通过结合预训练的3D、分割和图像编码模型,实现了鲁棒的多视角变化检测。

Abstract: We investigate the problem of identifying objects that have been added, removed, or moved between a pair of captures (images or videos) of the same scene at different times. Detecting such changes is important for many applications, such as robotic tidying or construction progress and safety monitoring. A major challenge is that varying viewpoints can cause objects to falsely appear changed. We introduce SceneDiff Benchmark, the first multiview change detection benchmark with object instance annotations, comprising 350 diverse video pairs with thousands of changed objects. We also introduce the SceneDiff method, a new training-free approach for multiview object change detection that leverages pretrained 3D, segmentation, and image encoding models to robustly predict across multiple benchmarks. Our method aligns the captures in 3D, extracts object regions, and compares spatial and semantic region features to detect changes. Experiments on multi-view and two-view benchmarks demonstrate that our method outperforms existing approaches by large margins (94% and 37.4% relative AP improvements). The benchmark and code will be publicly released.


[76] MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning cs.CV | cs.ROPDF

Yuanchen Ju, Yongyuan Liang, Yen-Jen Wang, Nandiraju Gireesh, Yuanliang Ju

TL;DR: 本文提出了MomaGraph,一种用于具身智能体的统一场景图表示,它整合了空间-功能关系和部件级交互元素。为了解决数据与评估的缺失,作者贡献了MomaGraph-Scenes数据集和MomaGraph-Bench评估套件。基于此,他们开发了MomaGraph-R1模型,这是一个通过强化学习训练的7B视觉语言模型,用于预测任务导向场景图并作为零样本任务规划器。实验表明该模型在开源模型中达到SOTA水平,并在基准测试和真实机器人实验中表现出色。

Details

Motivation: 解决现有场景图表示在具身任务规划中的局限性,即空间与功能关系分离、缺乏物体状态与时间更新、以及忽略与当前任务最相关信息的问题。

Result: 在MomaGraph-Bench上达到71.6%的准确率,比最佳基线提升11.4%,在开源模型中达到最先进水平(SOTA),并能泛化到公共基准测试并有效迁移到真实机器人实验。

Insight: 创新点在于提出了一个整合空间、功能和部件级交互信息的统一状态感知场景图表示(MomaGraph),并配套构建了大规模任务驱动场景图数据集和系统评估基准。模型MomaGraph-R1采用Graph-then-Plan框架,将场景图预测与任务规划结合,实现了零样本规划能力。

Abstract: Mobile manipulators in households must both navigate and manipulate. This requires a compact, semantically rich scene representation that captures where objects are, how they function, and which parts are actionable. Scene graphs are a natural choice, yet prior work often separates spatial and functional relations, treats scenes as static snapshots without object states or temporal updates, and overlooks information most relevant for accomplishing the current task. To address these limitations, we introduce MomaGraph, a unified scene representation for embodied agents that integrates spatial-functional relationships and part-level interactive elements. However, advancing such a representation requires both suitable data and rigorous evaluation, which have been largely missing. We thus contribute MomaGraph-Scenes, the first large-scale dataset of richly annotated, task-driven scene graphs in household environments, along with MomaGraph-Bench, a systematic evaluation suite spanning six reasoning capabilities from high-level planning to fine-grained scene understanding. Built upon this foundation, we further develop MomaGraph-R1, a 7B vision-language model trained with reinforcement learning on MomaGraph-Scenes. MomaGraph-R1 predicts task-oriented scene graphs and serves as a zero-shot task planner under a Graph-then-Plan framework. Extensive experiments demonstrate that our model achieves state-of-the-art results among open-source models, reaching 71.6% accuracy on the benchmark (+11.4% over the best baseline), while generalizing across public benchmarks and transferring effectively to real-robot experiments.


[77] SFTok: Bridging the Performance Gap in Discrete Tokenizers cs.CV | cs.LGPDF

Qihang Rao, Borui Zhang, Wenzhao Zheng, Jie Zhou, Jiwen Lu

TL;DR: 本文提出了一种名为SFTok的离散图像分词器,旨在解决离散分词器在图像重建质量上落后于连续分词器的问题。通过引入多步迭代机制、自强制引导视觉重建和去偏拟合训练策略,SFTok显著提升了高压缩率下的图像重建精度,并在ImageNet上达到了最先进的性能。

Details

Motivation: 当前多模态模型中,离散分词器因其与自回归范式的天然契合而具有潜力,但其重建质量仍落后于连续分词器,限制了其在多模态系统中的广泛应用。本文旨在弥合这一性能差距。

Result: 在每张图像仅使用64个令牌的高压缩率下,SFTok在ImageNet上取得了最先进的图像重建质量(rFID = 1.21),并在类别到图像生成任务中表现出色(gFID = 2.29)。

Insight: 论文的核心创新在于通过多步迭代机制结合自强制引导重建和去偏拟合训练策略,有效解决了多步过程中的训练-推理不一致问题,从而显著提升了离散分词器的重建性能。这种设计思路为改进离散表示学习提供了新的方向。

Abstract: Recent advances in multimodal models highlight the pivotal role of image tokenization in high-resolution image generation. By compressing images into compact latent representations, tokenizers enable generative models to operate in lower-dimensional spaces, thereby improving computational efficiency and reducing complexity. Discrete tokenizers naturally align with the autoregressive paradigm but still lag behind continuous ones, limiting their adoption in multimodal systems. To address this, we propose \textbf{SFTok}, a discrete tokenizer that incorporates a multi-step iterative mechanism for precise reconstruction. By integrating \textbf{self-forcing guided visual reconstruction} and \textbf{debias-and-fitting training strategy}, SFTok resolves the training-inference inconsistency in multi-step process, significantly enhancing image reconstruction quality. At a high compression rate of only 64 tokens per image, SFTok achieves state-of-the-art reconstruction quality on ImageNet (rFID = 1.21) and demonstrates exceptional performance in class-to-image generation tasks (gFID = 2.29).


[78] StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors cs.CVPDF

Guibao Shen, Yihua Du, Wenhang Ge, Jing He, Chirui Chang

TL;DR: 本文提出StereoPilot模型,用于高效的单目到立体视频转换。它通过引入统一的大规模立体视频数据集UniStereo,并设计一个无需显式深度图或迭代扩散采样的前馈模型,直接合成目标视图,从而解决了传统多阶段流水线的误差传播、深度模糊和格式不一致问题。

Details

Motivation: 立体显示设备(如VR头显和3D影院)的快速增长导致对高质量立体视频内容的需求增加,但自动的单目到立体转换受限于传统“深度-扭曲-修复”多阶段流水线的缺陷,包括误差传播、深度模糊以及并行与汇聚立体格式之间的不一致性。

Result: 大量实验表明,StereoPilot在视觉保真度和计算效率方面均显著优于现有最先进方法。

Insight: 主要创新点包括:1)构建了首个覆盖两种立体格式的统一大规模数据集UniStereo,用于公平基准测试和鲁棒模型训练;2)提出了一个高效的前馈模型StereoPilot,它绕过了显式深度估计和迭代采样,直接合成目标视图;3)通过可学习的域切换器和循环一致性损失,使模型能无缝适应不同立体格式并提升一致性。

Abstract: The rapid growth of stereoscopic displays, including VR headsets and 3D cinemas, has led to increasing demand for high-quality stereo video content. However, producing 3D videos remains costly and complex, while automatic Monocular-to-Stereo conversion is hindered by the limitations of the multi-stage ``Depth-Warp-Inpaint’’ (DWI) pipeline. This paradigm suffers from error propagation, depth ambiguity, and format inconsistency between parallel and converged stereo configurations. To address these challenges, we introduce UniStereo, the first large-scale unified dataset for stereo video conversion, covering both stereo formats to enable fair benchmarking and robust model training. Building upon this dataset, we propose StereoPilot, an efficient feed-forward model that directly synthesizes the target view without relying on explicit depth maps or iterative diffusion sampling. Equipped with a learnable domain switcher and a cycle consistency loss, StereoPilot adapts seamlessly to different stereo formats and achieves improved consistency. Extensive experiments demonstrate that StereoPilot significantly outperforms state-of-the-art methods in both visual fidelity and computational efficiency. Project page: https://hit-perfect.github.io/StereoPilot/.


[79] AdaTooler-V: Adaptive Tool-Use for Images and Videos cs.CVPDF

Chaoyang Wang, Kaituo Feng, Dongyang Chen, Zhongyu Wang, Zhixun Li

TL;DR: 论文提出了AdaTooler-V,一个多模态大语言模型,通过自适应判断视觉问题是否需要工具来优化工具使用,以减少盲目调用带来的开销和性能下降。引入了AT-GRPO强化学习算法,根据工具效益分数调整奖励,并构建了两个训练数据集(AdaTooler-V-CoT-100k和AdaTooler-V-300k),支持单图像、多图像和视频数据。在十二个基准测试中,模型展现出强大的视觉推理能力。

Details

Motivation: 现有开源多模态大语言模型在视觉工具交互中存在盲目推理模式,即使不需要也调用工具,导致推理开销增加和性能下降,论文旨在解决这一问题。

Result: 在多个视觉推理基准测试中,AdaTooler-V优于现有方法,特别是AdaTooler-V-7B在V*基准上达到89.8%准确率,超过了商业专有模型GPT-4o和Gemini 1.5 Pro,实现了SOTA水平。

Insight: 创新点包括自适应工具使用机制、基于工具效益分数的AT-GRPO强化学习算法,以及构建的多样化训练数据集;客观上,这为多模态视觉推理提供了更高效和准确的解决方案,减少了不必要的工具调用。

Abstract: Recent advances have shown that multimodal large language models (MLLMs) benefit from multimodal interleaved chain-of-thought (CoT) with vision tool interactions. However, existing open-source models often exhibit blind tool-use reasoning patterns, invoking vision tools even when they are unnecessary, which significantly increases inference overhead and degrades model performance. To this end, we propose AdaTooler-V, an MLLM that performs adaptive tool-use by determining whether a visual problem truly requires tools. First, we introduce AT-GRPO, a reinforcement learning algorithm that adaptively adjusts reward scales based on the Tool Benefit Score of each sample, encouraging the model to invoke tools only when they provide genuine improvements. Moreover, we construct two datasets to support training: AdaTooler-V-CoT-100k for SFT cold start and AdaTooler-V-300k for RL with verifiable rewards across single-image, multi-image, and video data. Experiments across twelve benchmarks demonstrate the strong reasoning capability of AdaTooler-V, outperforming existing methods in diverse visual reasoning tasks. Notably, AdaTooler-V-7B achieves an accuracy of 89.8% on the high-resolution benchmark V*, surpassing the commercial proprietary model GPT-4o and Gemini 1.5 Pro. All code, models, and data are released.


[80] EasyV2V: A High-quality Instruction-based Video Editing Framework cs.CV | cs.AIPDF

Jinjie Mai, Chaoyang Wang, Guocheng Gordon Qian, Willi Menapace, Sergey Tulyakov

TL;DR: EasyV2V是一个基于指令的高质量视频编辑框架,通过系统性地设计数据、架构和控制机制来解决视频编辑在一致性、控制和泛化方面的挑战。

Details

Motivation: 图像编辑技术发展迅速,但视频编辑仍面临一致性、控制和泛化能力不足的挑战,本研究旨在探索数据、架构和控制的设计空间以解决这些问题。

Result: EasyV2V在视频编辑任务上取得了最先进(SOTA)的结果,超越了同期和商业系统,支持灵活输入如视频+文本、视频+掩码+文本、视频+掩码+参考图像+文本。

Insight: 创新点包括:1) 利用现有专家模型和快速反演构建多样化视频对数据;2) 通过单帧监督和共享仿射运动伪对将图像编辑对提升为视频;3) 采用简单的序列连接和轻量级LoRA微调来利用预训练文本到视频模型的编辑能力;4) 通过单一掩码机制统一时空控制并支持可选参考图像。

Abstract: While image editing has advanced rapidly, video editing remains less explored, facing challenges in consistency, control, and generalization. We study the design space of data, architecture, and control, and introduce \emph{EasyV2V}, a simple and effective framework for instruction-based video editing. On the data side, we compose existing experts with fast inverses to build diverse video pairs, lift image edit pairs into videos via single-frame supervision and pseudo pairs with shared affine motion, mine dense-captioned clips for video pairs, and add transition supervision to teach how edits unfold. On the model side, we observe that pretrained text-to-video models possess editing capability, motivating a simplified design. Simple sequence concatenation for conditioning with light LoRA fine-tuning suffices to train a strong model. For control, we unify spatiotemporal control via a single mask mechanism and support optional reference images. Overall, EasyV2V works with flexible inputs, e.g., video+text, video+mask+text, video+mask+reference+text, and achieves state-of-the-art video editing results, surpassing concurrent and commercial systems. Project page: https://snap-research.github.io/easyv2v/


[81] Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification cs.CV | cs.AIPDF

Qihao Liu, Chengzhi Mao, Yaojie Liu, Alan Yuille, Wen-Sheng Chu

TL;DR: 本文提出了AuditDM框架,通过强化学习训练一个多模态大语言模型作为审计器,自动生成能最大化目标模型间分歧的挑战性问题和反事实图像,以发现模型的能力差距并利用这些发现进行模型修正。

Details

Motivation: 传统的多模态大语言模型评估方法缺乏可解释性,且不足以充分揭示模型间显著的能力差距。

Result: 在Gemma-3和PaliGemma-2等SOTA模型上,AuditDM发现了超过20种不同的失败类型;基于这些发现进行微调后,所有模型在16个基准测试上均得到一致提升,并使一个3B参数的模型性能超越了其28B的对应模型。

Insight: 创新点在于提出了一种主动的、可解释的模型审计框架,通过生成分歧最大化样本来诊断和修复模型弱点;客观来看,该方法为数据扩展收益递减后,通过针对性审计进行模型诊断和改进提供了有效路径。

Abstract: Conventional evaluation methods for multimodal LLMs (MLLMs) lack interpretability and are often insufficient to fully disclose significant capability gaps across models. To address this, we introduce AuditDM, an automated framework that actively discovers and rectifies MLLM failure modes by auditing their divergence. AuditDM fine-tunes an MLLM as an auditor via reinforcement learning to generate challenging questions and counterfactual images that maximize disagreement among target models. Once trained, the auditor uncovers diverse, interpretable exemplars that reveal model weaknesses and serve as annotation-free data for rectification. When applied to SoTA models like Gemma-3 and PaliGemma-2, AuditDM discovers more than 20 distinct failure types. Fine-tuning on these discoveries consistently improves all models across 16 benchmarks, and enables a 3B model to surpass its 28B counterpart. Our results suggest that as data scaling hits diminishing returns, targeted model auditing offers an effective path to model diagnosis and improvement.


[82] The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text cs.CVPDF

Hanlin Wang, Hao Ouyang, Qiuyu Wang, Yue Yu, Yihao Meng

TL;DR: 论文提出了WorldCanvas框架,用于生成可由用户提示的世界事件视频。该框架通过结合文本、轨迹和参考图像,实现了丰富、用户导向的模拟,能够生成包含多智能体交互、物体进出、参考引导外观以及反直觉事件的连贯可控视频。

Details

Motivation: 现有方法(如纯文本方法或现有轨迹控制的图像到视频方法)在生成丰富、可控的世界事件方面存在局限。本文旨在通过多模态方法,结合轨迹(编码运动、时序和可见性)、自然语言(语义意图)和参考图像(物体身份的视觉基础),以生成更连贯、可控的事件,从而将世界模型从被动预测器提升为交互式、用户可塑造的模拟器。

Result: 生成的视频不仅展示了时间连贯性,还表现出涌现的一致性,能够在物体暂时消失后保持物体身份和场景的稳定性。论文宣称WorldCanvas支持富有表现力的世界事件生成,但摘要中未提及具体的基准测试或定量比较结果。

Insight: 主要创新点在于提出了一种多模态提示框架,将轨迹、文本和参考图像相结合,以实现对复杂世界事件(如多智能体交互、物体进出)的细粒度控制。从客观角度看,其将轨迹信息(运动、时序、可见性)与语义和视觉基础相结合的方法,为可控视频生成提供了新的思路,强调了从被动预测到交互式模拟的范式转变。

Abstract: We present WorldCanvas, a framework for promptable world events that enables rich, user-directed simulation by combining text, trajectories, and reference images. Unlike text-only approaches and existing trajectory-controlled image-to-video methods, our multimodal approach combines trajectories – encoding motion, timing, and visibility – with natural language for semantic intent and reference images for visual grounding of object identity, enabling the generation of coherent, controllable events that include multi-agent interactions, object entry/exit, reference-guided appearance and counterintuitive events. The resulting videos demonstrate not only temporal coherence but also emergent consistency, preserving object identity and scene despite temporary disappearance. By supporting expressive world events generation, WorldCanvas advances world models from passive predictors to interactive, user-shaped simulators. Our project page is available at: https://worldcanvas.github.io/.


cs.CY [Back]

[83] Explainable Ethical Assessment on Human Behaviors by Generating Conflicting Social Norms cs.CY | cs.AI | cs.CLPDF

Yuxi Sun, Wei Gao, Hongzhan Lin, Jing Ma, Wenxuan Zhang

TL;DR: 本文提出了一种名为ClarityEthic的新型伦理评估方法,通过生成人类行为背后相互冲突的社会规范来增强AI模型对行为效价(支持或反对)的预测和解释能力。该方法利用对比学习策略来增强语言模型的道德推理能力。

Details

Motivation: 当前基于大规模数据训练、未明确基于社会规范的AI系统在评估人类行为效价时难以解释且不可信。模仿人类评估者,通过考虑社会规范(特别是相互冲突的规范)可以帮助AI模型更好地理解和预测行为效价。

Result: 大量实验表明,该方法优于强大的基线方法。人工评估证实,生成的社会规范为人类行为评估提供了合理的解释。

Insight: 创新点在于明确生成相互冲突的社会规范作为解释依据,并使用对比学习策略来增强模型的道德推理能力。这为构建可解释、可信赖的AI伦理评估系统提供了一种新思路,即通过显式建模规范冲突来模拟人类复杂的道德权衡过程。

Abstract: Human behaviors are often guided or constrained by social norms, which are defined as shared, commonsense rules. For example, underlying an action \textit{report a witnessed crime}" are social norms that inform our conduct, such as \textit{It is expected to be brave to report crimes}’’. Current AI systems that assess valence (i.e., support or oppose) of human actions by leveraging large-scale data training not grounded on explicit norms may be difficult to explain, and thus untrustworthy. Emulating human assessors by considering social norms can help AI models better understand and predict valence. While multiple norms come into play, conflicting norms can create tension and directly influence human behavior. For example, when deciding whether to ``\textit{report a witnessed crime}’’, one may balance \textit{bravery} against \textit{self-protection}. In this paper, we introduce \textit{ClarityEthic}, a novel ethical assessment approach, to enhance valence prediction and explanation by generating conflicting social norms behind human actions, which strengthens the moral reasoning capabilities of language models by using a contrastive learning strategy. Extensive experiments demonstrate that our method outperforms strong baseline approaches, and human evaluations confirm that the generated social norms provide plausible explanations for the assessment of human behaviors.


cs.AI [Back]

[84] Needle in the Web: A Benchmark for Retrieving Targeted Web Pages in the Wild cs.AI | cs.CLPDF

Yumeng Wang, Tianyu Fan, Lingrui Xu, Chao Huang

TL;DR: 本文提出了一个名为’Needle in the Web’的新基准测试,专门用于评估现代搜索代理和基于LLM的系统在现实网络环境中,针对模糊、探索性查询进行网页检索和推理的能力。该基准包含663个跨七个领域的问题,并采用了一种可控难度的方法来生成查询。实验表明,当前领先的模型和系统在该基准上表现不佳,准确率普遍低于35%,揭示了现有系统在语义模糊下的模糊检索方面存在重大挑战。

Details

Motivation: 现有基准(如BrowseComp和xBench-DeepSearch)侧重于需要多跳合成的复杂推理搜索,但忽略了模糊探索性搜索,即用户查询是模糊且多方面的,目标是找到最相关的网页而非单一事实答案。为填补这一空白,本文旨在创建一个专门评估系统处理此类模糊查询能力的基准。

Result: 在Needle in the Web基准上对三个领先的LLM和三个基于代理的搜索系统进行了评估。结果表明,大多数模型表现挣扎,许多准确率低于35%,且没有模型能在所有领域或难度级别上持续表现出色。这凸显了该基准对当前搜索系统构成了重大挑战。

Insight: 论文的创新点在于提出了首个专门针对模糊探索性网页检索的基准,并设计了一种基于网络内容事实声明来可靠生成可控难度查询的灵活方法。从客观角度看,该基准有效地区分了现有系统在语义模糊检索任务上的能力短板,为未来开发更鲁棒的网页搜索代理指明了方向。

Abstract: Large Language Models (LLMs) have evolved from simple chatbots into sophisticated agents capable of automating complex real-world tasks, where browsing and reasoning over live web content is key to assessing retrieval and cognitive skills. Existing benchmarks like BrowseComp and xBench-DeepSearch emphasize complex reasoning searches requiring multi-hop synthesis but neglect Fuzzy Exploratory Search, namely queries that are vague and multifaceted, where users seek the most relevant webpage rather than a single factual answer. To address this gap, we introduce Needle in the Web, a novel benchmark specifically designed to evaluate modern search agents and LLM-based systems on their ability to retrieve and reason over real-world web content in response to ambiguous, exploratory queries under varying levels of difficulty. Needle in the Web comprises 663 questions spanning seven distinct domains. To ensure high query quality and answer uniqueness, we employ a flexible methodology that reliably generates queries of controllable difficulty based on factual claims of web contents. We benchmark three leading LLMs and three agent-based search systems on Needle in the Web, finding that most models struggle: many achieve below 35% accuracy, and none consistently excel across domains or difficulty levels. These findings reveal that Needle in the Web presents a significant challenge for current search systems and highlights the open problem of effective fuzzy retrieval under semantic ambiguity.


[85] Generative Adversarial Reasoner: Enhancing LLM Reasoning with Adversarial Reinforcement Learning cs.AI | cs.CL | cs.LGPDF

Qihao Liu, Luoxin Ye, Wufei Ma, Yu-Cheng Chou, Alan Yuille

TL;DR: 本文提出了生成对抗推理器(Generative Adversarial Reasoner),一种基于策略的联合训练框架,通过对抗性强化学习共同进化LLM推理器和基于LLM的判别器,以增强大语言模型的推理能力。该方法将推理链划分为逻辑完整的片段,由判别器评估每个片段的合理性,并生成密集、校准良好的步骤级奖励信号,从而改善信用分配、提高样本效率并提升整体推理质量。

Details

Motivation: 尽管具备显式推理能力的大语言模型在数学推理方面表现出色,但仍存在过程错误,如计算错误、逻辑脆弱以及表面合理但无效的步骤。本文旨在通过对抗性强化学习框架来减少这些错误,提升推理的鲁棒性和准确性。

Result: 在多个数学基准测试中,该方法相比标准强化学习后训练的强基线模型取得了持续提升。具体在AIME24基准上,将DeepSeek-R1-Distill-Qwen-7B从54.0提升至61.3(+7.3),将DeepSeek-R1-Distill-Llama-8B从43.7提升至53.7(+10.0)。

Insight: 创新点在于引入了一个模块化的对抗性训练框架,其中推理器和判别器协同进化,通过计算高效的审查计划生成密集的步骤级奖励信号。这改善了强化学习中的信用分配问题,并提高了样本效率。此外,模块化的判别器设计允许灵活地调整奖励函数,适用于教师蒸馏、偏好对齐和基于数学证明的推理等多种目标。

Abstract: Large language models (LLMs) with explicit reasoning capabilities excel at mathematical reasoning yet still commit process errors, such as incorrect calculations, brittle logic, and superficially plausible but invalid steps. In this paper, we introduce Generative Adversarial Reasoner, an on-policy joint training framework designed to enhance reasoning by co-evolving an LLM reasoner and an LLM-based discriminator through adversarial reinforcement learning. A compute-efficient review schedule partitions each reasoning chain into logically complete slices of comparable length, and the discriminator evaluates each slice’s soundness with concise, structured justifications. Learning couples complementary signals: the LLM reasoner is rewarded for logically consistent steps that yield correct answers, while the discriminator earns rewards for correctly detecting errors or distinguishing traces in the reasoning process. This produces dense, well-calibrated, on-policy step-level rewards that supplement sparse exact-match signals, improving credit assignment, increasing sample efficiency, and enhancing overall reasoning quality of LLMs. Across various mathematical benchmarks, the method delivers consistent gains over strong baselines with standard RL post-training. Specifically, on AIME24, we improve DeepSeek-R1-Distill-Qwen-7B from 54.0 to 61.3 (+7.3) and DeepSeek-R1-Distill-Llama-8B from 43.7 to 53.7 (+10.0). The modular discriminator also enables flexible reward shaping for objectives such as teacher distillation, preference alignment, and mathematical proof-based reasoning.


q-bio.QM [Back]

[86] Foundation Models in Biomedical Imaging: Turning Hype into Reality q-bio.QM | cs.AI | cs.CV | cs.LGPDF

Amgad Muneer, Kai Zhang, Ibraheem Hamdi, Rizwan Qureshi, Muhammad Waqas

TL;DR: 本文批判性评估了生物医学影像领域基础模型的现状,分析了其核心能力与局限性,并探讨了从统计相关迈向因果推理、解决可信度与偏见等部署挑战的必要性,指出未来发展方向是构建增强人类专业知识的混合、因果感知且可验证安全的系统。

Details

Motivation: 解决生物医学影像领域基础模型潜力与临床评估和部署现实之间的关键差距,批判性审视其炒作,并推动向更稳健、可解释和伦理的应用发展。

Result: 本文未报告具体的定量实验结果,而是对当前最先进技术进行了批判性评估,并提出了一个用于评估模型推理能力的分类法。

Insight: 创新点在于提出了超越统计相关、追求因果推理的前沿方向,并强调了构建混合、因果感知、可验证安全且以增强人类专业知识为目标的系统的重要性,而非单纯追求模型规模。

Abstract: Foundation models (FMs) are driving a prominent shift in artificial intelligence across different domains, including biomedical imaging. These models are designed to move beyond narrow pattern recognition towards emulating sophisticated clinical reasoning, understanding complex spatial relationships, and integrating multimodal data with unprecedented flexibility. However, a critical gap exists between this potential and the current reality, where the clinical evaluation and deployment of FMs are hampered by significant challenges. Herein, we critically assess the current state-of-the-art, analyzing hype by examining the core capabilities and limitations of FMs in the biomedical domain. We also provide a taxonomy of reasoning, ranging from emulated sequential logic and spatial understanding to the integration of explicit symbolic knowledge, to evaluate whether these models exhibit genuine cognition or merely mimic surface-level patterns. We argue that a critical frontier lies beyond statistical correlation, in the pursuit of causal inference, which is essential for building robust models that understand cause and effect. Furthermore, we discuss the paramount issues in deployment stemming from trustworthiness, bias, and safety, dissecting the challenges of algorithmic bias, data bias and privacy, and model hallucinations. We also draw attention to the need for more inclusive, rigorous, and clinically relevant validation frameworks to ensure their safe and ethical application. We conclude that while the vision of autonomous AI-doctors remains distant, the immediate reality is the emergence of powerful technology and assistive tools that would benefit clinical practice. The future of FMs in biomedical imaging hinges not on scale alone, but on developing hybrid, causally aware, and verifiably safe systems that augment, rather than replace, human expertise.


cond-mat.mtrl-sci [Back]

[87] Machine Learning Enabled Graph Analysis of Particulate Composites: Application to Solid-state Battery Cathodes cond-mat.mtrl-sci | cs.CVPDF

Zebin Li, Shimao Deng, Yijin Liu, Jia-Mian Hu

TL;DR: 本文提出了一种机器学习驱动的图分析框架,用于将多相颗粒复合材料的实验多模态X射线图像自动转换为可扩展的、具有拓扑感知的图结构,以提取物理洞见并建立从颗粒到网络层面的局部微观结构-性能关系。以固态锂电池的多相颗粒正极为例,该框架证实了三相结和同步离子/电子传导通道在实现理想局部电化学活性中的关键作用。

Details

Motivation: 解决如何利用大规模、高通量的多模态X射线显微图像数据集,来发现新的物理见解并指导颗粒复合材料微观结构优化的挑战。

Result: 以固态锂电池多相颗粒正极为应用案例,机器学习图分析框架定性地证实了三相结和同步离子/电子传导通道对局部电化学活性的关键作用。

Insight: 创新点在于建立了基于图的微观结构表示范式,作为连接多模态实验成像与功能理解、并促进微观结构感知的数据驱动材料设计的强大工具。从客观角度看,将复杂的多模态图像数据转化为可分析的图结构,为材料科学中的微观结构-性能关系研究提供了一种新颖且可扩展的计算方法。

Abstract: Particulate composites underpin many solid-state chemical and electrochemical systems, where microstructural features such as multiphase boundaries and inter-particle connections strongly influence system performance. Advances in X-ray microscopy enable capturing large-scale, multimodal images of these complex microstructures with an unprecedentedly high throughput. However, harnessing these datasets to discover new physical insights and guide microstructure optimization remains a major challenge. Here, we develop a machine learning (ML) enabled framework that enables automated transformation of experimental multimodal X-ray images of multiphase particulate composites into scalable, topology-aware graphs for extracting physical insights and establishing local microstructure-property relationships at both the particle and network level. Using the multiphase particulate cathode of solid-state lithium batteries as an example, our ML-enabled graph analysis corroborates the critical role of triple phase junctions and concurrent ion/electron conduction channels in realizing desirable local electrochemical activity. Our work establishes graph-based microstructure representation as a powerful paradigm for bridging multimodal experimental imaging and functional understanding, and facilitating microstructure-aware data-driven materials design in a broad range of particulate composites.


cs.MM [Back]

[88] A Tri-Dynamic Preprocessing Framework for UGC Video Compression cs.MM | cs.CVPDF

Fei Zhao, Mengxi Guo, Shijie Zhao, Junlin Li, Li Zhang

TL;DR: 论文提出了一种用于UGC视频压缩的三动态预处理框架,通过自适应调节预处理强度、量化级别和率失真权衡,以优化编码性能,应对UGC视频的多样性挑战。

Details

Motivation: UGC视频的多样性和可变性对数据驱动的编码优化算法构成挑战,因此需要开发适应UGC场景的新方法来提高编码有效性。

Result: 在大规模测试集上的实验结果表明,该方法取得了卓越的性能。

Insight: 创新点在于引入了三个自适应机制(预处理强度、量化级别和lambda权衡),提高了框架对UGC视频的适应性和编码效率。

Abstract: In recent years, user generated content (UGC) has become the dominant force in internet traffic. However, UGC videos exhibit a higher degree of variability and diverse characteristics compared to traditional encoding test videos. This variance challenges the effectiveness of data-driven machine learning algorithms for optimizing encoding in the broader context of UGC scenarios. To address this issue, we propose a Tri-Dynamic Preprocessing framework for UGC. Firstly, we employ an adaptive factor to regulate preprocessing intensity. Secondly, an adaptive quantization level is employed to fine-tune the codec simulator. Thirdly, we utilize an adaptive lambda tradeoff to adjust the rate-distortion loss function. Experimental results on large-scale test sets demonstrate that our method attains exceptional performance.


cs.LG [Back]

[89] D3G: Diverse Demographic Data Generation Increases Zero-Shot Image Classification Accuracy within Multimodal Models cs.LG | cs.CL | cs.CV | cs.CYPDF

Javon Hickmon

TL;DR: 本文提出了一种名为D3G(多样化人口统计数据生成)的训练无关、零样本方法,旨在提升预训练多模态模型(如CLIP)在零样本图像分类任务中的准确性,同时减少人口统计偏差。该方法利用生成模型(如Stable Diffusion XL)在推理时生成多样化人口统计特征的图像数据,以增强分类性能。

Details

Motivation: 尽管多模态模型(如CLIP)在图像分类任务中表现出色,但模型容量不足可能导致欠拟合,尤其在细粒度分类中表现不佳。此外,数据集若缺乏平衡的人口统计分布,会引入有害偏差,使预测偏向于过度代表的类别。本文旨在解决零样本图像分类中的人口统计偏差问题,并探索如何在不重新训练模型的情况下提升准确性和公平性。

Result: 实验表明,在推理时提供多样化人口统计数据能有效提升CLIP等模型的分类性能,并分析了不同人口统计特征对准确率指标的影响。该方法在零样本设置下实现了准确率的提高和偏差的降低,但未明确提及在特定基准测试中达到SOTA水平。

Insight: 创新点在于提出了一种无需额外训练、基于生成模型的零样本数据增强方法,通过动态生成多样化人口统计特征的图像来缓解数据偏差,从而提升多模态模型的分类公平性和准确性。这为减少模型偏见提供了一种灵活且高效的解决方案。

Abstract: Image classification is a task essential for machine perception to achieve human-level image understanding. Multimodal models such as CLIP have been able to perform well on this task by learning semantic similarities across vision and language; however, despite these advances, image classification is still a challenging task. Models with low capacity often suffer from underfitting and thus underperform on fine-grained image classification. Along with this, it is important to ensure high-quality data with rich cross-modal representations of each class, which is often difficult to generate. When datasets do not enforce balanced demographics, the predictions will be biased toward the more represented class, while others will be neglected. We focus on how these issues can lead to harmful bias for zero-shot image classification, and explore how to combat these issues in demographic bias. We propose Diverse Demographic Data Generation (D3G), a training-free, zero-shot method of boosting classification accuracy while reducing demographic bias in pre-trained multimodal models. With this method, we utilize CLIP as our base multimodal model and Stable Diffusion XL as our generative model. We demonstrate that providing diverse demographic data at inference time improves performance for these models, and explore the impact of individual demographics on the resulting accuracy metric.


[90] DSO: Direct Steering Optimization for Bias Mitigation cs.LG | cs.CL | cs.CYPDF

Lucas Monteiro Paes, Nivedha Sivakumar, Yinong Oliver Wang, Masha Fedzechkina Donaldson, Luca Zappella

TL;DR: 本文提出了一种名为直接转向优化(DSO)的新方法,用于在推理时可控地减轻生成模型(如视觉语言模型和大型语言模型)中的偏见。该方法利用强化学习来寻找激活的线性变换,旨在减少偏见的同时保持模型性能,并在公平性与能力之间实现了最先进的权衡。

Details

Motivation: 生成模型的决策常受输入中人物人口统计属性感知的影响,导致偏见结果(例如未能将女性识别为医生)。现有转向方法难以纠正需要跨人口群体等概率结果的偏见,且用户需要在偏见缓解与模型整体能力之间进行可控权衡,因此需要开发在推理时实现可控偏见减少的方法。

Result: DSO在视觉语言模型和大型语言模型上,在公平性与模型能力之间实现了最先进的权衡,为从业者提供了在推理时控制这种权衡的能力。

Insight: 论文宣称的创新点在于直接针对控制模型行为进行优化的转向策略设计,这比依赖预定义启发式方法进行可控性的方法更有效。从客观角度看,其核心创新在于将强化学习应用于寻找激活空间的线性变换,以直接优化偏见缓解目标,从而提供了一种更灵活、更有效的推理时干预机制。

Abstract: Generative models are often deployed to make decisions on behalf of users, such as vision-language models (VLMs) identifying which person in a room is a doctor to help visually impaired individuals. Yet, VLM decisions are influenced by the perceived demographic attributes of people in the input, which can lead to biased outcomes like failing to identify women as doctors. Moreover, when reducing bias leads to performance loss, users may have varying needs for balancing bias mitigation with overall model capabilities, highlighting the demand for methods that enable controllable bias reduction during inference. Activation steering is a popular approach for inference-time controllability that has shown potential in inducing safer behavior in large language models (LLMs). However, we observe that current steering methods struggle to correct biases, where equiprobable outcomes across demographic groups are required. To address this, we propose Direct Steering Optimization (DSO) which uses reinforcement learning to find linear transformations for steering activations, tailored to mitigate bias while maintaining control over model performance. We demonstrate that DSO achieves state-of-the-art trade-off between fairness and capabilities on both VLMs and LLMs, while offering practitioners inference-time control over the trade-off. Overall, our work highlights the benefit of designing steering strategies that are directly optimized to control model behavior, providing more effective bias intervention than methods that rely on pre-defined heuristics for controllability.


[91] Dynamic Rank Reinforcement Learning for Adaptive Low-Rank Multi-Head Self Attention in Large Language Models cs.LG | cs.CLPDF

Caner Erden

TL;DR: 本文提出了一种名为动态秩强化学习(DR-RL)的新框架,该框架通过结合强化学习和在线矩阵扰动理论,自适应地优化大型语言模型(LLMs)中多头自注意力(MHSA)的低秩分解。该方法能根据实时序列动态、层敏感性和硬件约束动态选择秩,在保持与全秩注意力统计等效的下游精度的同时,显著减少浮点运算(FLOPs),尤其是在长序列场景下。

Details

Motivation: 解决传统低秩近似方法依赖于静态秩假设,在不同输入上下文和计算资源下灵活性不足的问题,旨在实现MHSA在理论严谨性与自适应效率之间的平衡。

Result: 实验表明,该方法在保持下游任务准确性与全秩注意力统计等效的同时,显著减少了浮点运算(FLOPs),在长序列(L > 4096)场景下效果尤其明显。

Insight: 核心创新在于将秩选择建模为强化学习的序列策略优化问题,并利用在线矩阵扰动理论实现增量式秩更新以避免推理时的全分解开销;同时,轻量化的Transformer策略网络和批处理SVD操作确保了在现代GPU上的可扩展部署,为资源受限的深度学习提供了一种有理论依据的自适应低秩优化方案。

Abstract: We propose Dynamic Rank Reinforcement Learning (DR-RL), a novel framework that adaptively optimizes the low-rank factorization of Multi-Head Self-Attention (MHSA) in Large Language Models (LLMs) through the integration of reinforcement learning and online matrix perturbation theory. While traditional low-rank approximations often rely on static rank assumptions–limiting their flexibility across diverse input contexts–our method dynamically selects ranks based on real-time sequence dynamics, layer-specific sensitivities, and hardware constraints. The core innovation lies in an RL agent that formulates rank selection as a sequential policy optimization problem, where the reward function strictly balances attention fidelity against computational latency. Crucially, we employ online matrix perturbation bounds to enable incremental rank updates, thereby avoiding the prohibitive cost of full decomposition during inference. Furthermore, the integration of a lightweight Transformer-based policy network and batched Singular Value Decomposition (SVD) operations ensures scalable deployment on modern GPU architectures. Experiments demonstrate that DR-RL maintains downstream accuracy statistically equivalent to full-rank attention while significantly reducing Floating Point Operations (FLOPs), particularly in long-sequence regimes (L > 4096). This work bridges the gap between adaptive efficiency and theoretical rigor in MHSA, offering a principled, mathematically grounded alternative to heuristic rank reduction techniques in resource-constrained deep learning. Source code and experiment logs are available at: https://github.com/canererden/DR_RL_Project


[92] DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI cs.LG | cs.CLPDF

Hao Liang, Xiaochen Ma, Zhou Liu, Zhen Hao Wong, Zhengyang Zhao

TL;DR: 本文提出了DataFlow框架,这是一个由大语言模型驱动的统一数据准备和工作流自动化框架。该框架通过系统级抽象实现模块化、可复用和可组合的数据转换,提供类似PyTorch的管道构建API,并包含近200个可复用操作符和六个领域通用管道。此外,框架还引入了DataFlow-Agent,能够将自然语言规范自动转换为可执行管道。在六个代表性用例中,DataFlow持续提升了下游LLM性能。

Details

Motivation: 当前LLM对高质量数据的需求激增,但数据准备实践仍由临时脚本和松散定义的工作流主导,缺乏原则性抽象、阻碍可复现性,并且对模型在环数据生成的支持有限。本文旨在解决这些挑战。

Result: 在多个基准测试中取得显著提升:在Text-to-SQL任务上比SynSQL高出+3%的执行准确率;在代码基准测试上平均提升7%;在MATH、GSM8K和AIME基准上获得1-3个百分点的提升。此外,由DataFlow生成的统一10K样本数据集使基础模型性能超过了在1M Infinity-Instruct数据上训练的对应模型。

Insight: 主要创新点包括:1)提出了一个统一的、系统级的LLM驱动数据准备框架,提供了模块化和可组合的抽象;2)引入了DataFlow-Agent,实现了从自然语言到可执行管道的自动化转换(操作符合成、管道规划和迭代验证);3)框架本身提供了大量(近200个)可复用操作符和覆盖多个关键领域(文本、数学推理、代码等)的预构建管道,为以数据为中心的AI开发建立了系统级基础。

Abstract: The rapidly growing demand for high-quality data in Large Language Models (LLMs) has intensified the need for scalable, reliable, and semantically rich data preparation pipelines. However, current practices remain dominated by ad-hoc scripts and loosely specified workflows, which lack principled abstractions, hinder reproducibility, and offer limited support for model-in-the-loop data generation. To address these challenges, we present DataFlow, a unified and extensible LLM-driven data preparation framework. DataFlow is designed with system-level abstractions that enable modular, reusable, and composable data transformations, and provides a PyTorch-style pipeline construction API for building debuggable and optimizable dataflows. The framework consists of nearly 200 reusable operators and six domain-general pipelines spanning text, mathematical reasoning, code, Text-to-SQL, agentic RAG, and large-scale knowledge extraction. To further improve usability, we introduce DataFlow-Agent, which automatically translates natural-language specifications into executable pipelines via operator synthesis, pipeline planning, and iterative verification. Across six representative use cases, DataFlow consistently improves downstream LLM performance. Our math, code, and text pipelines outperform curated human datasets and specialized synthetic baselines, achieving up to +3% execution accuracy in Text-to-SQL over SynSQL, +7% average improvements on code benchmarks, and 1–3 point gains on MATH, GSM8K, and AIME. Moreover, a unified 10K-sample dataset produced by DataFlow enables base models to surpass counterparts trained on 1M Infinity-Instruct data. These results demonstrate that DataFlow provides a practical and high-performance substrate for reliable, reproducible, and scalable LLM data preparation, and establishes a system-level foundation for future data-centric AI development.


[93] Exploration v.s. Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward cs.LG | cs.AI | cs.CLPDF

Peter Chen, Xiaopeng Li, Ziniu Li, Wotao Yin, Xi Chen

TL;DR: 本文研究了强化学习可验证奖励(RLVR)框架中的探索-利用权衡问题,旨在提升大语言模型(LLMs)的推理能力。研究发现,看似矛盾的两种机制——与真实答案无关的虚假奖励和熵最小化——都能提升数学推理性能。论文通过分析策略熵与性能的关系,以及虚假奖励通过裁剪偏差和模型污染相互作用产生收益的机制,揭示了其背后的原理,并提出了奖励错配模型来解释虚假奖励在非污染环境下的有效性。

Details

Motivation: 近期研究表明,RLVR能通过虚假奖励(抑制利用)和熵最小化(抑制探索)这两种看似矛盾的机制来提升LLMs的数学推理能力,但其背后的协调原理尚不明确。本文旨在探究策略熵如何影响性能,以及虚假奖励是否通过裁剪偏差和模型污染的相互作用产生收益。

Result: 研究结果表明,在虚假奖励下,裁剪偏差会降低策略熵,从而产生更自信和确定的输出,而单独的熵最小化不足以带来性能提升。论文进一步提出的奖励错配模型解释了虚假奖励为何能在非污染设置下提升性能。

Insight: 论文的创新点在于澄清了虚假奖励收益背后的机制,揭示了裁剪偏差在降低策略熵中的关键作用,并提出了奖励错配模型来解释虚假奖励的有效性,为更有效的RLVR训练提供了原则性指导。

Abstract: This paper examines the exploration-exploitation trade-off in reinforcement learning with verifiable rewards (RLVR), a framework for improving the reasoning of Large Language Models (LLMs). Recent studies suggest that RLVR can elicit strong mathematical reasoning in LLMs through two seemingly paradoxical mechanisms: spurious rewards, which suppress exploitation by rewarding outcomes unrelated to the ground truth, and entropy minimization, which suppresses exploration by pushing the model toward more confident and deterministic outputs, highlighting a puzzling dynamic: both discouraging exploitation and discouraging exploration improve reasoning performance, yet the underlying principles that reconcile these effects remain poorly understood. We focus on two fundamental questions: (i) how policy entropy relates to performance, and (ii) whether spurious rewards yield gains, potentially through the interplay of clipping bias and model contamination. Our results show that clipping bias under spurious rewards reduces policy entropy, leading to more confident and deterministic outputs, while entropy minimization alone is insufficient for improvement. We further propose a reward-misalignment model explaining why spurious rewards can enhance performance beyond contaminated settings. Our findings clarify the mechanisms behind spurious-reward benefits and provide principles for more effective RLVR training.


[94] Surely Large Multimodal Models (Don’t) Excel in Visual Species Recognition? cs.LG | cs.CVPDF

Tian Liu, Anwesha Basu, James Caverlee, Shu Kong

TL;DR: 本文研究了大型多模态模型(LMMs)在视觉物种识别(VSR)任务中的表现,发现尽管LMMs在通用识别任务上表现出色,但在高度专业化的VSR任务中却表现不佳,甚至显著落后于简单的少样本学习(FSL)专家模型。然而,分析表明LMMs能够有效后验校正专家模型的错误预测。基于此,作者提出了一种名为后验校正(POC)的简单方法,通过提示LMM使用包含置信度和少样本示例的丰富提示来重新排序专家模型的预测,从而在五个VSR基准测试上显著提升FSL方法的性能。

Details

Motivation: 视觉物种识别对生物多样性评估至关重要,但物种级标注需要领域专业知识,导致标注数据有限,这促使了通过少样本学习训练专家模型。同时,大型多模态模型在通用识别任务中表现突出,因此研究其在高度专业化的VSR任务中的表现,以及是否优于FSL专家模型,成为本文的动机。

Result: 在五个具有挑战性的VSR基准测试上,POC方法无需额外训练、验证或手动干预,将少样本学习的准确率提升了+6.4%,超越了现有技术,并证明其可泛化到不同的预训练骨干网络和LMMs,作为一个即插即用模块显著增强现有FSL方法。

Insight: 论文的创新点在于揭示了LMMs在专业VSR任务中的局限性,但通过后验校正机制能够有效利用其能力来提升专家模型的性能。POC方法通过结合软最大置信度分数和少样本视觉示例的丰富提示,实现了一种简单而有效的集成策略,为少样本学习提供了可借鉴的增强手段。

Abstract: Visual Species Recognition (VSR) is pivotal to biodiversity assessment and conservation, evolution research, and ecology and ecosystem management. Training a machine-learned model for VSR typically requires vast amounts of annotated images. Yet, species-level annotation demands domain expertise, making it realistic for domain experts to annotate only a few examples. These limited labeled data motivate training an ‘’expert’’ model via few-shot learning (FSL). Meanwhile, advanced Large Multimodal Models (LMMs) have demonstrated prominent performance on general recognition tasks. It is straightforward to ask whether LMMs excel in the highly specialized VSR task and whether they outshine FSL expert models. Somewhat surprisingly, we find that LMMs struggle in this task, despite using various established prompting techniques. LMMs even significantly underperform FSL expert models, which are as simple as finetuning a pretrained visual encoder on the few-shot images. However, our in-depth analysis reveals that LMMs can effectively post-hoc correct the expert models’ incorrect predictions. Briefly, given a test image, when prompted with the top predictions from an FSL expert model, LMMs can recover the ground-truth label. Building on this insight, we derive a simple method called Post-hoc Correction (POC), which prompts an LMM to re-rank the expert model’s top predictions using enriched prompts that include softmax confidence scores and few-shot visual examples. Across five challenging VSR benchmarks, POC outperforms prior art of FSL by +6.4% in accuracy without extra training, validation, or manual intervention. Importantly, POC generalizes to different pretrained backbones and LMMs, serving as a plug-and-play module to significantly enhance existing FSL methods.


[95] SALVE: Sparse Autoencoder-Latent Vector Editing for Mechanistic Control of Neural Networks cs.LG | cs.AI | cs.CVPDF

Vegard Flovik

TL;DR: 本文提出SALVE框架,通过无监督稀疏自编码器学习模型原生特征基,结合Grad-FAM方法验证特征,并利用自编码器结构进行权重空间干预,实现对神经网络行为的可解释控制。

Details

Motivation: 解决深度神经网络难以解释和控制的问题,旨在建立机制可解释性与模型编辑之间的桥梁。

Result: 在卷积网络(ResNet-18)和基于Transformer的模型(ViT-B/16)上验证,实现了对其行为的一致且可解释的控制。

Insight: 创新点包括无监督稀疏特征发现、特征级显著图验证、权重空间精确编辑以及临界抑制阈值α_crit的推导,为特征发现到模型编辑提供了系统方法论。

Abstract: Deep neural networks achieve impressive performance but remain difficult to interpret and control. We present SALVE (Sparse Autoencoder-Latent Vector Editing), a unified “discover, validate, and control” framework that bridges mechanistic interpretability and model editing. Using an $\ell_1$-regularized autoencoder, we learn a sparse, model-native feature basis without supervision. We validate these features with Grad-FAM, a feature-level saliency mapping method that visually grounds latent features in input data. Leveraging the autoencoder’s structure, we perform precise and permanent weight-space interventions, enabling continuous modulation of both class-defining and cross-class features. We further derive a critical suppression threshold, $α_{crit}$, quantifying each class’s reliance on its dominant feature, supporting fine-grained robustness diagnostics. Our approach is validated on both convolutional (ResNet-18) and transformer-based (ViT-B/16) models, demonstrating consistent, interpretable control over their behavior. This work contributes a principled methodology for turning feature discovery into actionable model edits, advancing the development of transparent and controllable AI systems.


cs.RO [Back]

[96] Large Video Planner Enables Generalizable Robot Control cs.RO | cs.CVPDF

Boyuan Chen, Tianyuan Zhang, Haoran Geng, Kiwhan Song, Caiyi Zhang

TL;DR: 本文提出了一种基于大规模视频预训练构建机器人基础模型的新范式,称为Large Video Planner (LVP)。与当前主流的基于多模态大语言模型(MLLM)扩展的视觉-语言-动作(VLA)系统不同,该方法利用互联网规模的人类活动与任务演示视频数据集进行训练,生成零样本的视频规划,并后处理提取可执行的机器人动作,实现了对新场景和任务的泛化控制。

Details

Motivation: 当前构建通用机器人决策模型的主流方法是将多模态大语言模型(MLLMs)扩展为视觉-语言-动作(VLA)系统,其动机在于迁移MLLMs的大规模语言和图像预训练知识到动作输出模态。本文探索了一种替代范式,认为视频(而非静态图像和语言)能更自然地捕捉物理世界中与机器人行为对齐的时空状态与动作序列,因此更适合作为构建机器人基础模型的主要模态。

Result: 模型在真实世界第三方选定的任务和真实机器人实验中进行了评估,展示了成功的物理执行。结果表明,该方法具有鲁棒的指令跟随能力、强大的泛化能力以及现实世界的可行性。

Insight: 论文的核心创新点在于首次以基础模型的规模,利用大规模视频预训练来构建生成式机器人规划模型,并证明了视频作为主要模态的有效性。从客观角度看,这为机器人学习提供了一个与主流VLA范式不同的、基于视频时空动态特性的新思路,其发布的模型和数据集也支持了开放、可复现的基于视频的机器人学习研究。

Abstract: General-purpose robots require decision-making models that generalize across diverse tasks and environments. Recent works build robot foundation models by extending multimodal large language models (MLLMs) with action outputs, creating vision-language-action (VLA) systems. These efforts are motivated by the intuition that MLLMs’ large-scale language and image pretraining can be effectively transferred to the action output modality. In this work, we explore an alternative paradigm of using large-scale video pretraining as a primary modality for building robot foundation models. Unlike static images and language, videos capture spatio-temporal sequences of states and actions in the physical world that are naturally aligned with robotic behavior. We curate an internet-scale video dataset of human activities and task demonstrations, and train, for the first time at a foundation-model scale, an open video model for generative robotics planning. The model produces zero-shot video plans for novel scenes and tasks, which we post-process to extract executable robot actions. We evaluate task-level generalization through third-party selected tasks in the wild and real-robot experiments, demonstrating successful physical execution. Together, these results show robust instruction following, strong generalization, and real-world feasibility. We release both the model and dataset to support open, reproducible video-based robot learning. Our website is available at https://www.boyuan.space/large-video-planner/.


cs.ET [Back]

[97] Human-like Working Memory from Artificial Intrinsic Plasticity Neurons cs.ET | cs.AI | cs.CV | cs.NEPDF

Jingli Liu, Huannan Zheng, Bohao Zou, Kezhou Yang

TL;DR: 该论文提出了一种名为IPNet的神经形态架构,通过利用磁性隧道结(MTJ)的焦耳热动力学来物理模拟生物记忆的易失性,实现了类人工作记忆。该架构在多个记忆任务中表现出与人类被试相似的趋势,并在动态视觉传感器(DVS)手势数据集和自动驾驶任务上超越了RNN、LSTM和CNN等基线模型,同时实现了极低的能耗和紧凑的面积。

Details

Motivation: 解决人工网络(如循环或并行架构)实现工作记忆时存在的高能耗和噪声敏感性问题,通过神经元的固有可塑性来模拟生物工作记忆,以实现更高效、更生物合理的计算。

Result: 在11类DVS手势数据集上达到99.65%的准确率,在22类时间反转基准测试上保持99.48%,优于RNN、LSTM和2+1D CNN基线;在自动驾驶(DDD-20)任务中,比ResNet-LSTM减少14.4%的转向预测误差;能耗比LSTM降低2874倍,比并行3D-CNN降低90920倍;面积约1.5平方微米(28纳米CMOS),比标准LIF神经元减少20倍以上。

Insight: 创新点在于利用MTJ的物理特性(焦耳热动力学)硬件实现神经元的固有可塑性,从而物理模拟生物记忆的易失性,实现了“前沿记忆”效应(性能在传感接口处最大化),验证了近传感器处理的生物合理范式;所有结果均基于制造器件的原始参数,无需优化,并通过硬件在环验证了物理可实现性。

Abstract: Working memory enables the brain to integrate transient information for rapid decision-making. Artificial networks typically replicate this via recurrent or parallel architectures, yet incur high energy costs and noise sensitivity. Here we report IPNet, a hardware-software co-designed neuromorphic architecture realizing human-like working memory via neuronal intrinsic plasticity. Exploiting Joule-heating dynamics of Magnetic Tunnel Junctions (MTJs), IPNet physically emulates biological memory volatility. The memory behavior of the proposed architecture shows similar trends in n-back, free recall and memory interference tasks to that of reported human subjects. Implemented exclusively with MTJ neurons, the architecture with human-like working memory achieves 99.65% accuracy on 11-class DVS gesture datasets and maintains 99.48% on a novel 22-class time-reversed benchmark, outperforming RNN, LSTM, and 2+1D CNN baselines sharing identical backbones. For autonomous driving (DDD-20), IPNet reduces steering prediction error by 14.4% compared to ResNet-LSTM. Architecturally, we identify a ‘Memory-at-the-Frontier’ effect where performance is maximized at the sensing interface, validating a bio-plausible near-sensor processing paradigm. Crucially, all results rely on raw parameters from fabricated devices without optimization. Hardware-in-the-loop validation confirms the system’s physical realizability. Separately, energy analysis reveals a reduction in memory power of 2,874x compared to LSTMs and 90,920x versus parallel 3D-CNNs. This capacitor-free design enables a compact ~1.5um2 footprint (28 nm CMOS): a >20-fold reduction over standard LIF neurons. Ultimately, we demonstrate that instantiating human-like working memory via intrinsic neuronal plasticity endows neural networks with the dual biological advantages of superior dynamic vision processing and minimal metabolic cost.


cs.CR [Back]

[98] Agent Tools Orchestration Leaks More: Dataset, Benchmark, and Mitigation cs.CR | cs.AI | cs.CLPDF

Yuxuan Qiao, Dongqin Liu, Hongchang Yang, Wei Zhou, Songlin Hu

TL;DR: 本文首次系统性地研究了由大语言模型驱动的单智能体多工具架构所引入的新型隐私风险——工具编排隐私风险(TOP-R),即智能体为实现良性用户目标而自主聚合跨工具的信息片段,并利用其推理能力合成意外敏感信息。作者建立了形式化框架,构建了评估基准TOP-Bench,并提出了隐私增强原则(PEP)方法以有效缓解该风险。

Details

Motivation: 当前流行的单智能体多工具架构在追求高效性的同时,忽视了隐私意识,导致智能体可能通过跨工具信息聚合与推理泄露敏感信息,即工具编排隐私风险(TOP-R)。本文旨在系统研究这一新型风险。

Result: 在构建的TOP-Bench基准上评估了八个代表性模型,平均风险泄漏率(RLR)高达90.24%,而衡量安全性与鲁棒性权衡的整体指标H-Score平均仅为0.167(无模型超过0.3)。提出的PEP方法将风险泄漏率降低至46.58%,并将H-Score显著提升至0.624。

Insight: 创新点在于首次形式化并系统评估了智能体工具编排中的隐私风险,揭示了当前架构在目标函数对齐上的根本缺陷(过度优化有用性而忽视隐私)。提出的PEP方法通过原则性设计有效缓解了风险,为未来安全智能体架构提供了重要借鉴。

Abstract: Driven by Large Language Models, the single-agent, multi-tool architecture has become a popular paradigm for autonomous agents due to its simplicity and effectiveness. However, this architecture also introduces a new and severe privacy risk, which we term Tools Orchestration Privacy Risk (TOP-R), where an agent, to achieve a benign user goal, autonomously aggregates information fragments across multiple tools and leverages its reasoning capabilities to synthesize unexpected sensitive information. We provide the first systematic study of this risk. First, we establish a formal framework, attributing the risk’s root cause to the agent’s misaligned objective function: an overoptimization for helpfulness while neglecting privacy awareness. Second, we construct TOP-Bench, comprising paired leakage and benign scenarios, to comprehensively evaluate this risk. To quantify the trade-off between safety and robustness, we introduce the H-Score as a holistic metric. The evaluation results reveal that TOP-R is a severe risk: the average Risk Leakage Rate (RLR) of eight representative models reaches 90.24%, while the average H-Score is merely 0.167, with no model exceeding 0.3. Finally, we propose the Privacy Enhancement Principle (PEP) method, which effectively mitigates TOP-R, reducing the Risk Leakage Rate to 46.58% and significantly improving the H-Score to 0.624. Our work reveals both a new class of risk and inherent structural limitations in current agent architectures, while also offering feasible mitigation strategies.