Table of Contents
- cs.CL [Total: 27]
- cs.CV [Total: 121]
- cs.RO [Total: 3]
- cs.DC [Total: 1]
- cs.AI [Total: 9]
- q-fin.GN [Total: 1]
- cs.LG [Total: 15]
- econ.GN [Total: 1]
cs.CL [Back]
[1] Open-Source Multimodal Moxin Models with Moxin-VLM and Moxin-VLA cs.CL | cs.CV | cs.LGPDF
Pu Zhao, Xuan Shen, Zhenglun Kong, Yixin Shen, Sung-En Chang
TL;DR: 本文介绍了Moxin 7B,这是一个遵循模型开放框架的完全开源大型语言模型(LLM),其核心在于超越仅共享模型权重,实现了训练过程、数据集和实现细节的完全透明。基于Moxin 7B,作者开发了三个变体:Moxin-VLM(专注于视觉-语言任务)、Moxin-VLA(专注于视觉-语言-动作任务)和Moxin-Chinese(专注于中文能力),以增强模型在不同任务中的能力。
Details
Motivation: 当前LLM领域由GPT-4等闭源模型主导,而开源模型如LLaMA虽促进了普及,但透明度不足。本文旨在通过一个完全透明、符合开放框架的开源模型Moxin及其多模态变体,来培育一个更包容、协作的开源研究生态系统。
Result: 实验表明,所提出的模型在各种评估中取得了优异的性能。摘要中未提及具体的基准测试名称或是否达到SOTA水平,但强调了其性能优越性。
Insight: 主要创新点在于提出了一个遵循完整透明度原则(训练、数据、实现)的开源LLM框架,并基于此框架扩展了针对特定任务(视觉-语言、视觉-语言-动作、中文)的多模态变体,这有助于推动开源生态的健康发展。从客观角度看,其强调的“完全透明”模式是区别于许多现有开源模型的关键创新,可能为可复现性和社区协作设立新标准。
Abstract: Recently, Large Language Models (LLMs) have undergone a significant transformation, marked by a rapid rise in both their popularity and capabilities. Leading this evolution are proprietary LLMs like GPT-4 and GPT-o1, which have captured widespread attention in the AI community due to their remarkable performance and versatility. Simultaneously, open-source LLMs, such as LLaMA and Mistral, have made great contributions to the ever-increasing popularity of LLMs due to the ease to customize and deploy the models across diverse applications. Moxin 7B is introduced as a fully open-source LLM developed in accordance with the Model Openness Framework, which moves beyond the simple sharing of model weights to embrace complete transparency in training, datasets, and implementation detail, thus fostering a more inclusive and collaborative research environment that can sustain a healthy open-source ecosystem. To further equip Moxin with various capabilities in different tasks, we develop three variants based on Moxin, including Moxin-VLM, Moxin-VLA, and Moxin-Chinese, which target the vision-language, vision-language-action, and Chinese capabilities, respectively. Experiments show that our models achieve superior performance in various evaluations. We adopt open-source framework and open data for the training. We release our models, along with the available data and code to derive these models.
[2] SmartSnap: Proactive Evidence Seeking for Self-Verifying Agents cs.CL | cs.AI | cs.CV | cs.LG | cs.MAPDF
Shaofei Cai, Yulei Qin, Haojia Lin, Zihan Xu, Gang Li
TL;DR: 本文提出SmartSnap范式,将任务验证从被动的后处理转变为智能体主动的、在线的自我验证。通过设计具有双重使命(完成任务并提供证据)的自我验证智能体,并遵循3C原则(完整性、简洁性、创造性)收集最小决定性快照证据,供通用LLM-as-a-Judge验证器判断,从而以可扩展的方式训练LLM驱动的智能体。
Details
Motivation: 解决现有基于GUI的智能体强化学习中,任务验证作为被动、事后处理过程(如基于规则的脚本、奖励模型或LLM-as-a-Judge)所面临的成本高昂和可靠性低的问题,这些方法需要处理包含无关噪声的冗长交互轨迹。
Result: 在移动任务上的实验表明,SmartSnap范式为8B和30B模型分别带来了26.08%和16.66%的性能提升,并且培养出的高效自我验证智能体在性能上可与DeepSeek V3.1和Qwen3-235B-A22B竞争。
Insight: 核心创新在于将验证过程从被动后验转变为智能体主动的在线自我验证,并引入3C原则指导证据收集;这促进了解决方案寻找与证据寻求的协同,有望以更低成本、更高可靠性实现可扩展的智能体训练。
Abstract: Agentic reinforcement learning (RL) holds great promise for the development of autonomous agents under complex GUI tasks, but its scalability remains severely hampered by the verification of task completion. Existing task verification is treated as a passive, post-hoc process: a verifier (i.e., rule-based scoring script, reward or critic model, and LLM-as-a-Judge) analyzes the agent’s entire interaction trajectory to determine if the agent succeeds. Such processing of verbose context that contains irrelevant, noisy history poses challenges to the verification protocols and therefore leads to prohibitive cost and low reliability. To overcome this bottleneck, we propose SmartSnap, a paradigm shift from this passive, post-hoc verification to proactive, in-situ self-verification by the agent itself. We introduce the Self-Verifying Agent, a new type of agent designed with dual missions: to not only complete a task but also to prove its accomplishment with curated snapshot evidences. Guided by our proposed 3C Principles (Completeness, Conciseness, and Creativity), the agent leverages its accessibility to the online environment to perform self-verification on a minimal, decisive set of snapshots. Such evidences are provided as the sole materials for a general LLM-as-a-Judge verifier to determine their validity and relevance. Experiments on mobile tasks across model families and scales demonstrate that our SmartSnap paradigm allows training LLM-driven agents in a scalable manner, bringing performance gains up to 26.08% and 16.66% respectively to 8B and 30B models. The synergizing between solution finding and evidence seeking facilitates the cultivation of efficient, self-verifying agents with competitive performance against DeepSeek V3.1 and Qwen3-235B-A22B.
[3] LLM-Guided Exemplar Selection for Few-Shot Wearable-Sensor Human Activity Recognition cs.CL | cs.AI | cs.CVPDF
Elsen Ronando, Sozo Inoue
TL;DR: 本文提出了一种LLM引导的范例选择框架,用于解决可穿戴传感器人体活动识别(HAR)中现有方法依赖大规模标注数据和纯几何范例选择的局限性。该方法通过LLM生成的知识先验(捕捉特征重要性、类间混淆性和范例预算乘数)进行语义推理,并结合基于边界的验证线索、PageRank中心性、枢纽惩罚和设施选址优化,以选择紧凑且信息丰富的范例集。在UCI-HAR数据集上的严格少样本条件下,该框架实现了88.78%的宏F1分数,优于随机采样、聚类中心和k-center等经典方法。
Details
Motivation: 解决当前最先进HAR方法依赖大规模标注数据和纯几何范例选择的不足,特别是难以区分相似的可穿戴传感器活动(如行走、上楼梯和下楼梯)。
Result: 在UCI-HAR数据集上,在严格少样本条件下,宏F1分数达到88.78%,优于随机采样、herding和k-center等经典方法。
Insight: 创新点在于将LLM生成的语义先验(特征重要性、类间混淆性、范例预算乘数)与结构几何线索(如PageRank、枢纽惩罚、设施选址优化)结合,为少样本可穿戴传感器HAR提供更强大的范例选择基础;客观分析认为,该方法通过语义推理增强了对相似活动的区分能力,提升了少样本学习的代表性范例选择效果。
Abstract: In this paper, we propose an LLM-Guided Exemplar Selection framework to address a key limitation in state-of-the-art Human Activity Recognition (HAR) methods: their reliance on large labeled datasets and purely geometric exemplar selection, which often fail to distinguish similar weara-ble sensor activities such as walking, walking upstairs, and walking downstairs. Our method incorporates semantic reasoning via an LLM-generated knowledge prior that captures feature importance, inter-class confusability, and exemplar budget multipliers, and uses it to guide exemplar scoring and selection. These priors are combined with margin-based validation cues, PageRank centrality, hubness penalization, and facility-location optimization to obtain a compact and informative set of exemplars. Evaluated on the UCI-HAR dataset under strict few-shot conditions, the framework achieves a macro F1-score of 88.78%, outperforming classical approaches such as random sampling, herding, and $k$-center. The results show that LLM-derived semantic priors, when integrated with structural and geometric cues, provide a stronger foundation for selecting representative sensor exemplars in few-shot wearable-sensor HAR.
[4] HiFi-RAG: Hierarchical Content Filtering and Two-Pass Generation for Open-Domain RAG cs.CL | cs.AI | cs.IR | cs.LGPDF
Cattalyya Nuengsigkapian
TL;DR: 本文提出了HiFi-RAG,一个用于开放域检索增强生成(RAG)的系统,它通过一个多阶段流水线来解决检索文档中无关信息和对齐用户意图的挑战。该系统在MMU-RAGent NeurIPS 2025竞赛的Text-to-Text静态评估中获胜。其核心是分层内容过滤和两阶段生成:利用Gemini 2.5 Flash进行快速、低成本的查询构建、内容过滤和引用归因,然后利用Gemini 2.5 Pro进行最终的答案生成。
Details
Motivation: 解决开放域RAG中检索到的文档包含大量无关信息,以及生成的答案与用户意图难以对齐的问题。
Result: 在MMU-RAGent验证集上,系统超越了基线,将ROUGE-L提升至0.274(+19.6%),DeBERTaScore提升至0.677(+6.2%)。在自定义的、评估需要2025年1月后知识的测试集Test2025上,HiFi-RAG在ROUGE-L和DeBERTaScore上分别比参数化基线高出57.4%和14.9%。
Insight: 主要创新点在于分层内容过滤和两阶段生成架构,将快速、低成本模型(Flash)用于预处理和过滤,将高性能、高推理能力模型(Pro)用于最终生成,实现了效率与效果的平衡。这种基于模型能力差异的任务分工和流水线设计是可借鉴的系统优化思路。
Abstract: Retrieval-Augmented Generation (RAG) in open-domain settings faces significant challenges regarding irrelevant information in retrieved documents and the alignment of generated answers with user intent. We present HiFi-RAG (Hierarchical Filtering RAG), the winning closed-source system in the Text-to-Text static evaluation of the MMU-RAGent NeurIPS 2025 Competition. Our approach moves beyond standard embedding-based retrieval via a multi-stage pipeline. We leverage the speed and cost-efficiency of Gemini 2.5 Flash (4-6x cheaper than Pro) for query formulation, hierarchical content filtering, and citation attribution, while reserving the reasoning capabilities of Gemini 2.5 Pro for final answer generation. On the MMU-RAGent validation set, our system outperformed the baseline, improving ROUGE-L to 0.274 (+19.6%) and DeBERTaScore to 0.677 (+6.2%). On Test2025, our custom dataset evaluating questions that require post-cutoff knowledge (post January 2025), HiFi-RAG outperforms the parametric baseline by 57.4% in ROUGE-L and 14.9% in DeBERTaScore.
[5] Exploring the Vertical-Domain Reasoning Capabilities of Large Language Models cs.CLPDF
Jie Zhou, Xin Chen, Jie Zhang, Zhe Li
TL;DR: 这篇论文探讨了大型语言模型在垂直领域(特别是会计领域)的推理能力,提出了垂直领域会计推理的概念并建立了评估标准。研究评估了包括GLM系列模型和GPT-4在内的多个代表性模型在会计推理任务上的表现,发现不同的提示工程策略能不同程度地提升模型性能,其中GPT-4表现最强,但现有模型仍无法满足实际应用需求。
Details
Motivation: 为了有效将大型语言模型整合到会计等专业领域,需要深入理解其领域特定的推理能力,这是推动企业数字化转型和社会发展的关键挑战。
Result: 在会计推理任务上评估了GLM-6B、GLM-130B、GLM-4和GPT-4等模型,实验表明不同提示工程策略能提升模型性能,GPT-4展现出最强的会计推理能力,但所有模型均未达到实际应用要求。
Insight: 论文的创新点在于提出了垂直领域会计推理的概念和评估标准,为后续推理范式研究提供了基准;客观来看,其将领域专业知识与LLM评估框架结合的方法,为其他垂直领域的LLM能力评估提供了可借鉴的范式。
Abstract: Large Language Models (LLMs) are reshaping learning paradigms, cognitive processes, and research methodologies across a wide range of domains. Integrating LLMs with professional fields and redefining the relationship between LLMs and domain-specific applications has become a critical challenge for promoting enterprise digital transformation and broader social development. To effectively integrate LLMs into the accounting domain, it is essential to understand their domain-specific reasoning capabilities. This study introduces the concept of vertical-domain accounting reasoning and establishes evaluation criteria by analyzing the training data characteristics of representative GLM-series models. These criteria provide a foundation for subsequent research on reasoning paradigms and offer benchmarks for improving accounting reasoning performance. Based on this framework, we evaluate several representative models, including GLM-6B, GLM-130B, GLM-4, and OpenAI GPT-4, on a set of accounting reasoning tasks. Experimental results show that different prompt engineering strategies lead to varying degrees of performance improvement across models, with GPT-4 achieving the strongest accounting reasoning capability. However, current LLMs still fall short of real-world application requirements. In particular, further optimization is needed for deployment in enterprise-level accounting scenarios to fully realize the potential value of LLMs in this domain.
[6] Structured Prompting and LLM Ensembling for Multimodal Conversational Aspect-based Sentiment Analysis cs.CLPDF
Zhiqiang Gao, Shihao Gao, Zixing Zhang, Yihao Guo, Hongyu Chen
TL;DR: 本文针对多模态对话中的方面级情感分析(MCABSA)挑战,提出了结构化提示和LLM集成方法。对于子任务一(提取情感六元组),设计了结构化提示管道,引导大语言模型顺序提取情感成分;对于子任务二(检测情感翻转),通过集成三个LLM的互补优势来识别情感转变及其触发因素。
Details
Motivation: 解决多模态对话中复杂情感理解的挑战,特别是从多说话者对话中提取细粒度情感元素(如持有者、目标、方面、意见、情感和理由)以及检测动态情感翻转及其触发因素,以构建更情感智能的AI系统。
Result: 在MCABSA挑战中,子任务一平均得分47.38%,子任务二精确匹配F1得分74.12%,展示了逐步细化和集成策略在多模态情感分析任务中的有效性。
Insight: 创新点包括:使用结构化提示管道逐步引导LLM提取复杂情感结构,以及通过LLM集成利用模型互补性来增强情感翻转检测的鲁棒性;客观分析认为,这种方法结合了提示工程和模型集成,为多模态情感分析提供了可扩展的解决方案。
Abstract: Understanding sentiment in multimodal conversations is a complex yet crucial challenge toward building emotionally intelligent AI systems. The Multimodal Conversational Aspect-based Sentiment Analysis (MCABSA) Challenge invited participants to tackle two demanding subtasks: (1) extracting a comprehensive sentiment sextuple, including holder, target, aspect, opinion, sentiment, and rationale from multi-speaker dialogues, and (2) detecting sentiment flipping, which detects dynamic sentiment shifts and their underlying triggers. For Subtask-I, in the present paper, we designed a structured prompting pipeline that guided large language models (LLMs) to sequentially extract sentiment components with refined contextual understanding. For Subtask-II, we further leveraged the complementary strengths of three LLMs through ensembling to robustly identify sentiment transitions and their triggers. Our system achieved a 47.38% average score on Subtask-I and a 74.12% exact match F1 on Subtask-II, showing the effectiveness of step-wise refinement and ensemble strategies in rich, multimodal sentiment analysis tasks.
[7] Chain-of-thought Reviewing and Correction for Time Series Question Answering cs.CLPDF
Chen Su, Yuanhe Tian, Yan Song
TL;DR: 本文提出了一种名为T3LLM的新框架,用于时间序列问答任务。该框架通过引入三个大语言模型(LLM)——工作者、评审者和学生——来实现带显式纠正机制的多步推理,从而提升对复杂数值序列处理的准确性。
Details
Motivation: 现有基于LLM的时间序列问答方法多采用通用自然语言处理技术,在处理复杂数值序列时容易产生推理错误。时间序列数据本身具有可验证性,这启发了作者设计一个能检查推理步骤与原始输入一致性的纠正机制。
Result: 在多个真实世界时间序列问答基准测试上的实验表明,T3LLM超越了强大的基于LLM的基线方法,达到了最先进的性能水平。
Insight: 核心创新点在于利用时间序列数据的可验证性,设计了一个包含生成、评审和学习的协作框架,将多步推理和自纠正能力内化到学生模型的参数中。这为处理数值型序列任务提供了一种新的、可验证的推理范式。
Abstract: With the advancement of large language models (LLMs), diverse time series analysis tasks are reformulated as time series question answering (TSQA) through a unified natural language interface. However, existing LLM-based approaches largely adopt general natural language processing techniques and are prone to reasoning errors when handling complex numerical sequences. Different from purely textual tasks, time series data are inherently verifiable, enabling consistency checking between reasoning steps and the original input. Motivated by this property, we propose T3LLM, which performs multi-step reasoning with an explicit correction mechanism for time series question answering. The T3LLM framework consists of three LLMs, namely, a worker, a reviewer, and a student, that are responsible for generation, review, and reasoning learning, respectively. Within this framework, the worker generates step-wise chains of thought (CoT) under structured prompts, while the reviewer inspects the reasoning, identifies erroneous steps, and provides corrective comments. The collaboratively generated corrected CoT are used to fine-tune the student model, internalizing multi-step reasoning and self-correction into its parameters. Experiments on multiple real-world TSQA benchmarks demonstrate that T3LLM achieves state-of-the-art performance over strong LLM-based baselines.
[8] Evaluating GRPO and DPO for Faithful Chain-of-Thought Reasoning in LLMs cs.CLPDF
Hadi Mohammadi, Tamas Kozak, Anastasia Giachanou
TL;DR: 本文评估了GRPO和DPO两种优化方法在提升大语言模型思维链推理忠实性方面的表现。研究发现,在较大模型中,GRPO比DPO表现更优,其中Qwen2.5-14B-Instruct模型在所有评估指标上取得了最佳结果。两种方法均显示模型规模与性能呈正相关,但GRPO在提升忠实性指标上潜力更大,尽管在小规模模型中行为稳定性稍差。
Details
Motivation: 思维链推理虽能提升大语言模型的多步推理能力,但其解释常不能反映模型的实际推理过程,可能产生连贯但误导性的理由,或在不承认外部线索的情况下修改答案,这削弱了基于CoT的方法在安全监督和对齐监控中的可靠性。
Result: 实验表明,GRPO在较大模型中性能优于DPO,Qwen2.5-14B-Instruct模型在所有评估指标上取得最佳结果;两种方法均呈现模型规模与性能的正相关性,GRPO在提升忠实性指标上更具潜力,但在小规模时稳定性较差。
Insight: 论文的创新点在于系统评估GRPO和DPO对思维链推理忠实性的优化效果,发现GRPO在较大模型中更有效,为开发更透明可信的LLM推理提供了方向;客观来看,研究强调了优化方法对推理过程忠实性的影响,并指出模型规模与优化效果的关系,这对改进模型对齐和可解释性具有借鉴意义。
Abstract: Chain-of-thought (CoT) reasoning has emerged as a powerful technique for improving the problem-solving capabilities of large language models (LLMs), particularly for tasks requiring multi-step reasoning. However, recent studies show that CoT explanations often fail to reflect the model’s actual reasoning process, as models may produce coherent yet misleading justifications or modify answers without acknowledging external cues. Such discrepancies undermine the reliability of CoT-based methods for safety supervision and alignment monitoring, as models can generate plausible but deceptive rationales for incorrect answers. To better understand this limitation, we evaluate two optimization methods, Group Relative Policy Optimization (GRPO) and Direct Preference Optimization (DPO), in their ability to improve CoT faithfulness. Our experiments show that GRPO achieves higher performance than DPO in larger models, with the Qwen2.5-14B-Instruct model attaining the best results across all evaluation metrics. Both approaches exhibit positive correlations between model size and performance, but GRPO shows greater potential for improving faithfulness metrics, albeit with less stable behavior at smaller scales. These results suggest that GRPO offers a promising direction for developing more transparent and trustworthy reasoning in LLMs.
[9] Fragile Knowledge, Robust Instruction-Following: The Width Pruning Dichotomy in Llama-3.2 cs.CL | cs.AI | cs.LGPDF
Pere Martra
TL;DR: 本文研究了基于最大绝对权重准则对GLU-MLP层进行结构化宽度剪枝时,在Llama-3.2模型中观察到的系统性二分现象:剪枝会损害依赖参数化知识的任务性能,但能显著提升指令跟随能力,并保持多步推理的鲁棒性。
Details
Motivation: 挑战了剪枝会导致模型能力均匀退化的普遍假设,旨在探究剪枝如何选择性地影响不同认知能力,并揭示扩展比作为关键架构参数的作用。
Result: 在Llama-3.2-1B和3B模型上,指令跟随能力(IFEval)提升了46%至75%,多步推理保持稳健,而事实知识(MMLU)和困惑度指标则下降;同时发现事实知识能力与真实性指标(TruthfulQA-MC2)之间存在显著的负相关(r = -0.864)。
Insight: 扩展比是选择性调节认知能力的关键参数,而不仅仅是压缩指标;MAW引导的宽度剪枝充当了选择性过滤器,减少参数化知识的同时保持或增强了行为对齐;此外,剪枝配置在能耗(降低23%)与单请求延迟之间存在权衡,但批处理工作负载普遍受益。
Abstract: Structured width pruning of GLU-MLP layers, guided by the Maximum Absolute Weight (MAW) criterion, reveals a systematic dichotomy in how reducing the expansion ratio affects different model capabilities. While performance on tasks relying on parametric knowledge (e.g., MMLU, GSM8K) and perplexity metrics degrades predictably, instruction-following capabilities improve substantially (+46% to +75% in IFEval for Llama-3.2-1B and 3B models), and multi-step reasoning remains robust (MUSR). This pattern challenges the prevailing assumption that pruning induces uniform degradation. We evaluated seven expansion ratio configurations using comprehensive benchmarks assessing factual knowledge, mathematical reasoning, language comprehension, instruction-following, and truthfulness. Our analysis identifies the expansion ratio as a critical architectural parameter that selectively modulates cognitive capabilities, rather than merely serving as a compression metric. We provide the first systematic characterization of this selective preservation phenomenon. Notably, we document a robust inverse correlation (r = -0.864, p = 0.012 in Llama-3B) between factual knowledge capacity (MMLU) and truthfulness metrics (TruthfulQA-MC2): as knowledge degrades, the model’s ability to discriminate misconceptions improves consistently. This connects two previously distinct research areas, demonstrating that MAW-guided width pruning acts as a selective filter, reducing parametric knowledge while preserving or enhancing behavioral alignment. Additionally, we quantify context-dependent efficiency trade-offs: pruned configurations achieve up to 23% reduction in energy consumption (J/token) but incur penalties in single-request latency, whereas batch processing workloads benefit uniformly.
[10] Beg to Differ: Understanding Reasoning-Answer Misalignment Across Languages cs.CLPDF
Anaelia Ovalle, Candace Ross, Sebastian Ruder, Adina Williams, Karen Ullrich
TL;DR: 本文通过构建人工验证框架,评估了大型语言模型在多语言场景下生成的推理链是否逻辑上支持其结论。研究发现,尽管模型在任务准确率上表现良好,但其推理过程与结论之间存在显著不对齐现象,尤其在非拉丁文字语言中该问题更为严重。
Details
Motivation: 研究动机是探究链式思维提示下大型语言模型的推理能力是否能在不同语言间有效迁移,以及当前多语言评估方法是否能全面反映模型的真实推理能力。
Result: 在涵盖6种语言、6个前沿模型的GlobalMMLU数据集上分析了6.5万条推理链,发现非拉丁文字语言的推理-结论不对齐率至少是拉丁文字语言的两倍。通过人工标注构建了错误分类体系,主要错误类型为证据错误(无支持的主张、模糊事实)和不合逻辑的推理步骤。
Insight: 创新点在于揭示了多语言推理评估中的关键盲点,即高任务准确率可能掩盖了推理链的逻辑缺陷,并提出了需要建立推理感知的评估框架。从客观角度看,该研究为多语言AI系统的可靠性评估提供了新的方法论视角。
Abstract: Large language models demonstrate strong reasoning capabilities through chain-of-thought prompting, but whether this reasoning quality transfers across languages remains underexplored. We introduce a human-validated framework to evaluate whether model-generated reasoning traces logically support their conclusions across languages. Analyzing 65k reasoning traces from GlobalMMLU questions across 6 languages and 6 frontier models, we uncover a critical blind spot: while models achieve high task accuracy, their reasoning can fail to support their conclusions. Reasoning traces in non-Latin scripts show at least twice as much misalignment between their reasoning and conclusions than those in Latin scripts. We develop an error taxonomy through human annotation to characterize these failures, finding they stem primarily from evidential errors (unsupported claims, ambiguous facts) followed by illogical reasoning steps. Our findings demonstrate that current multilingual evaluation practices provide an incomplete picture of model reasoning capabilities and highlight the need for reasoning-aware evaluation frameworks.
[11] WeDLM: Reconciling Diffusion Language Models with Standard Causal Attention for Fast Inference cs.CLPDF
Aiwei Liu, Minghua He, Shaoxun Zeng, Sijun Zhang, Linhao Zhang
TL;DR: 本文提出了WeDLM,一种基于标准因果注意力的扩散语言模型解码框架,旨在解决现有扩散语言模型因依赖双向注意力而无法有效利用前缀KV缓存、导致推理速度受限的问题。通过拓扑重排技术,WeDLM在保持因果掩码的同时,使每个掩码位置能基于所有已观测到的token进行条件生成,从而实现前缀缓存友好的并行解码。
Details
Motivation: 动机在于自回归解码的token-by-token特性限制了推理时的并行性,而现有扩散语言模型虽支持并行解码,但因使用双向注意力破坏了标准前缀KV缓存机制,导致在实际部署中无法有效转化为速度优势,尤其是在与优化后的AR引擎(如vLLM)对比时。
Result: 实验表明,WeDLM在保持强AR骨干模型质量的同时,实现了显著的推理加速:在具有挑战性的推理基准测试中接近3倍加速,在低熵生成场景下最高可达10倍加速。关键的是,这些比较是在与vLLM服务的AR基线在匹配部署设置下进行的,证明了扩散式解码在实践中可以超越优化的AR引擎。
Insight: 创新点在于将扩散解码完全建立在标准因果注意力之上,通过拓扑重排实现前缀缓存友好的并行生成,并引入流式解码过程持续提交高置信度token以维持固定并行工作量,避免了块扩散方法中常见的停止-等待行为,从而在保持模型质量的同时大幅提升推理效率。
Abstract: Autoregressive (AR) generation is the standard decoding paradigm for Large Language Models (LLMs), but its token-by-token nature limits parallelism at inference time. Diffusion Language Models (DLLMs) offer parallel decoding by recovering multiple masked tokens per step; however, in practice they often fail to translate this parallelism into deployment speed gains over optimized AR engines (e.g., vLLM). A key reason is that many DLLMs rely on bidirectional attention, which breaks standard prefix KV caching and forces repeated contextualization, undermining efficiency. We propose WeDLM, a diffusion decoding framework built entirely on standard causal attention to make parallel generation prefix-cache friendly. The core idea is to let each masked position condition on all currently observed tokens while keeping a strict causal mask, achieved by Topological Reordering that moves observed tokens to the physical prefix while preserving their logical positions. Building on this property, we introduce a streaming decoding procedure that continuously commits confident tokens into a growing left-to-right prefix and maintains a fixed parallel workload, avoiding the stop-and-wait behavior common in block diffusion methods. Experiments show that WeDLM preserves the quality of strong AR backbones while delivering substantial speedups, approaching 3x on challenging reasoning benchmarks and up to 10x in low-entropy generation regimes; critically, our comparisons are against AR baselines served by vLLM under matched deployment settings, demonstrating that diffusion-style decoding can outperform an optimized AR engine in practice.
[12] Text-Routed Sparse Mixture-of-Experts Model with Explanation and Temporal Alignment for Multi-Modal Sentiment Analysis cs.CLPDF
Dongning Rao, Yunbiao Zeng, Zhihua Jiang, Jujian Lv
TL;DR: 本文提出了一种名为TEXT的文本路由稀疏专家混合模型,用于多模态情感分析(MSA)。该模型通过多模态大语言模型(MLLM)生成解释来增强分析,并利用一个面向时序的神经网络块(结合Mamba和时序交叉注意力)对音频和视频表示进行新颖的对齐,最后通过文本路由的稀疏专家混合与门控融合来整合多模态信息。
Details
Motivation: 现有MSA方法未能充分挖掘解释和时序对齐的潜力,而这两者在涉及人机交互的应用中对于理解微妙情感至关重要。
Result: TEXT在四个数据集上均取得了最佳性能,超越了所有测试模型(包括三个近期提出的方法和三个MLLM),在全部六个指标中至少四项获胜。例如,在CH-SIMS数据集上,平均绝对误差降至0.353,相比近期方法降低了13.5%。
Insight: 创新点包括:利用MLLM生成解释来增强MSA;提出结合Mamba和时序交叉注意力的时序对齐块,以有效对齐音频和视频的时序表示;以及引入文本路由的稀疏专家混合与门控融合机制,以文本为主导整合多模态信息。
Abstract: Human-interaction-involved applications underscore the need for Multi-modal Sentiment Analysis (MSA). Although many approaches have been proposed to address the subtle emotions in different modalities, the power of explanations and temporal alignments is still underexplored. Thus, this paper proposes the Text-routed sparse mixture-of-Experts model with eXplanation and Temporal alignment for MSA (TEXT). TEXT first augments explanations for MSA via Multi-modal Large Language Models (MLLM), and then novelly aligns the epresentations of audio and video through a temporality-oriented neural network block. TEXT aligns different modalities with explanations and facilitates a new text-routed sparse mixture-of-experts with gate fusion. Our temporal alignment block merges the benefits of Mamba and temporal cross-attention. As a result, TEXT achieves the best performance cross four datasets among all tested models, including three recently proposed approaches and three MLLMs. TEXT wins on at least four metrics out of all six metrics. For example, TEXT decreases the mean absolute error to 0.353 on the CH-SIMS dataset, which signifies a 13.5% decrement compared with recently proposed approaches.
[13] AutoForge: Automated Environment Synthesis for Agentic Reinforcement Learning cs.CL | cs.AIPDF
Shihao Cai, Runnan Fang, Jialong Wu, Baixuan Li, Xinyu Wang
TL;DR: 本文提出了AutoForge框架,用于自动化合成高难度但易于验证任务的模拟环境,并设计了环境层级的强化学习算法以缓解用户不稳定性和提升训练效率。
Details
Motivation: 现有强化学习在模拟环境中的研究局限于半自动化环境合成或任务难度不足,且模拟用户的不稳定性与环境的异质性给智能体强化学习带来挑战。
Result: 在tau-bench、tau2-Bench和VitaBench等智能体基准测试上进行了全面评估,验证了方法的有效性,并展示了其出色的领域外泛化能力。
Insight: 创新点包括自动化可扩展的环境合成流程,以及环境层级的强化学习算法,通过环境级优势估计提升训练稳定性和效率。
Abstract: Conducting reinforcement learning (RL) in simulated environments offers a cost-effective and highly scalable way to enhance language-based agents. However, previous work has been limited to semi-automated environment synthesis or tasks lacking sufficient difficulty, offering little breadth or depth. In addition, the instability of simulated users integrated into these environments, along with the heterogeneity across simulated environments, poses further challenges for agentic RL. In this work, we propose: (1) a unified pipeline for automated and scalable synthesis of simulated environments associated with high-difficulty but easily verifiable tasks; and (2) an environment level RL algorithm that not only effectively mitigates user instability but also performs advantage estimation at the environment level, thereby improving training efficiency and stability. Comprehensive evaluations on agentic benchmarks, including tau-bench, tau2-Bench, and VitaBench, validate the effectiveness of our proposed method. Further in-depth analyses underscore its out-of-domain generalization.
[14] Diversity or Precision? A Deep Dive into Next Token Prediction cs.CLPDF
Haoyuan Wu, Hai Wang, Jiajia Wu, Jinxiang Ou, Keyao Wang
TL;DR: 本文重新审视了标准交叉熵损失,将其解释为单步情节中策略梯度优化的特例,并提出了一种将在线强化学习原则应用于监督学习的广义预训练目标。通过将下一个令牌预测建模为随机决策过程,引入了一种显式平衡多样性和精确性的奖励塑造策略,以重塑预训练令牌输出分布,从而为后续强化学习提供更有利的探索空间。研究发现,与直觉相反,施加精确性导向的先验能为强化学习带来更优的探索空间。
Details
Motivation: 研究动机在于探索预训练模型的令牌输出分布如何定义强化学习训练的有效探索空间,以系统性研究预训练分布对后续强化学习探索潜力的影响。
Result: 研究发现,与直觉上更高分布熵有助于有效探索的观点相反,施加精确性导向的先验能为强化学习提供更优越的探索空间,从而最终提升端到端推理性能。
Insight: 创新点在于将下一个令牌预测重新框架化为随机决策过程,并提出了一种广义预训练目标,通过奖励塑造策略(包括正奖励缩放因子和排名感知的非对称负令牌处理机制)来显式平衡多样性与精确性,从而重塑预训练分布以优化后续强化学习的探索空间。
Abstract: Recent advancements have shown that reinforcement learning (RL) can substantially improve the reasoning abilities of large language models (LLMs). The effectiveness of such RL training, however, depends critically on the exploration space defined by the pre-trained model’s token-output distribution. In this paper, we revisit the standard cross-entropy loss, interpreting it as a specific instance of policy gradient optimization applied within a single-step episode. To systematically study how the pre-trained distribution shapes the exploration potential for subsequent RL, we propose a generalized pre-training objective that adapts on-policy RL principles to supervised learning. By framing next-token prediction as a stochastic decision process, we introduce a reward-shaping strategy that explicitly balances diversity and precision. Our method employs a positive reward scaling factor to control probability concentration on ground-truth tokens and a rank-aware mechanism that treats high-ranking and low-ranking negative tokens asymmetrically. This allows us to reshape the pre-trained token-output distribution and investigate how to provide a more favorable exploration space for RL, ultimately enhancing end-to-end reasoning performance. Contrary to the intuition that higher distribution entropy facilitates effective exploration, we find that imposing a precision-oriented prior yields a superior exploration space for RL.
[15] Prompt engineering does not universally improve Large Language Model performance across clinical decision-making tasks cs.CLPDF
Mengdi Chai, Ali R. Zomorrodi
TL;DR: 本研究评估了ChatGPT-4o、Gemini 1.5 Pro和LIama 3.3 70B三种大语言模型在典型患者就诊全流程临床决策支持任务中的表现,并探究了提示工程(特别是基于MedPrompt框架的动态少样本学习)对模型性能的影响。研究发现,模型在不同任务上表现差异显著,且提示工程并非普遍有效,其效果高度依赖于具体模型和任务。
Details
Motivation: 大语言模型在医学知识评估中展现出潜力,但其在真实世界临床决策工作流中的实际效用尚未充分探索,本研究旨在评估LLMs在完整临床推理流程中的表现,并检验提示工程是否能普遍提升其性能。
Result: 在36个案例研究中,所有模型在不同任务上表现差异大:最终诊断准确率接近完美,相关诊断测试任务表现差,其余任务表现中等。温度设置对模型影响不同(ChatGPT在零温度下更好,LIama在默认温度下更好)。应用MedPrompt提示工程框架后,仅在基线最差的任务(相关诊断测试)上性能显著提升,对其他任务反而有负面影响。此外,目标动态少样本提示并不总是优于随机选择。
Insight: 论文宣称的创新点在于系统评估了LLMs在整个临床决策工作流中的表现,并揭示了提示工程效果的局限性(非普适、模型与任务依赖)。客观分析认为,其核心洞察是指出在医疗领域集成LLMs需要量身定制、上下文感知的策略,且动态少样本学习中示例的精确匹配可能因损失更广泛的上下文多样性而抵消其优势。
Abstract: Large Language Models (LLMs) have demonstrated promise in medical knowledge assessments, yet their practical utility in real-world clinical decision-making remains underexplored. In this study, we evaluated the performance of three state-of-the-art LLMs-ChatGPT-4o, Gemini 1.5 Pro, and LIama 3.3 70B-in clinical decision support across the entire clinical reasoning workflow of a typical patient encounter. Using 36 case studies, we first assessed LLM’s out-of-the-box performance across five key sequential clinical decision-making tasks under two temperature settings (default vs. zero): differential diagnosis, essential immediate steps, relevant diagnostic testing, final diagnosis, and treatment recommendation. All models showed high variability by task, achieving near-perfect accuracy in final diagnosis, poor performance in relevant diagnostic testing, and moderate performance in remaining tasks. Furthermore, ChatGPT performed better under the zero temperature, whereas LIama showed stronger performance under the default temperature. Next, we assessed whether prompt engineering could enhance LLM performance by applying variations of the MedPrompt framework, incorporating targeted and random dynamic few-shot learning. The results demonstrate that prompt engineering is not a one-size-fit-all solution. While it significantly improved the performance on the task with lowest baseline accuracy (relevant diagnostic testing), it was counterproductive for others. Another key finding was that the targeted dynamic few-shot prompting did not consistently outperform random selection, indicating that the presumed benefits of closely matched examples may be counterbalanced by loss of broader contextual diversity. These findings suggest that the impact of prompt engineering is highly model and task-dependent, highlighting the need for tailored, context-aware strategies for integrating LLMs into healthcare.
[16] LENS: LLM-Enabled Narrative Synthesis for Mental Health by Aligning Multimodal Sensing with Language Models cs.CL | cs.AIPDF
Wenxuan Xu, Arvind Pillai, Subigya Nepal, Amanda C Collins, Daniel M Mackin
TL;DR: LENS是一个将多模态感知数据与语言模型对齐的框架,旨在生成基于临床的心理健康叙事。它通过构建大规模传感器-文本问答对数据集,并训练一个补丁级编码器将原始传感器信号直接投影到LLM表示空间,从而解决了长时传感器流数据难以被LLM原生处理的问题。
Details
Motivation: 当前LLM无法原生处理长时传感器流数据,且配对的传感器-文本数据集稀缺,这阻碍了将多模态健康感知数据转化为自然语言以评估心理健康。
Result: LENS在标准NLP指标和症状严重性准确性的任务特定指标上优于强基线模型;一项由13位心理健康专家参与的用户研究进一步表明,LENS生成的叙事全面且具有临床意义。
Insight: 创新点在于构建了一个大规模传感器-文本问答对数据集,并提出了一个补丁级编码器,使LLM能够原生整合时间序列传感器数据,从而将LLM提升为健康感知的接口,支持下游临床决策。
Abstract: Multimodal health sensing offers rich behavioral signals for assessing mental health, yet translating these numerical time-series measurements into natural language remains challenging. Current LLMs cannot natively ingest long-duration sensor streams, and paired sensor-text datasets are scarce. To address these challenges, we introduce LENS, a framework that aligns multimodal sensing data with language models to generate clinically grounded mental-health narratives. LENS first constructs a large-scale dataset by transforming Ecological Momentary Assessment (EMA) responses related to depression and anxiety symptoms into natural-language descriptions, yielding over 100,000 sensor-text QA pairs from 258 participants. To enable native time-series integration, we train a patch-level encoder that projects raw sensor signals directly into an LLM’s representation space. Our results show that LENS outperforms strong baselines on standard NLP metrics and task-specific measures of symptom-severity accuracy. A user study with 13 mental-health professionals further indicates that LENS-produced narratives are comprehensive and clinically meaningful. Ultimately, our approach advances LLMs as interfaces for health sensing, providing a scalable path toward models that can reason over raw behavioral signals and support downstream clinical decision-making.
[17] Is Chain-of-Thought Really Not Explainability? Chain-of-Thought Can Be Faithful without Hint Verbalization cs.CL | cs.AI | cs.LGPDF
Kerem Zaman, Shashank Srivastava
TL;DR: 本文质疑了使用Biasing Features指标将未在思维链中显式提及提示注入线索的推理过程判定为不忠实的做法,认为该指标混淆了不忠实性与不完整性,后者是分布式Transformer计算压缩为线性自然语言叙述的必要损失。通过在Llama-3和Gemma-3模型上的多跳推理任务实验,发现许多被Biasing Features标记为不忠实的思维链被其他指标判定为忠实,比例在某些模型中超过50%。研究还表明,增加推理时token预算可显著提升线索显式化比例(某些设置下达90%),且未显式化的线索仍可通过思维链因果中介预测变化。
Details
Motivation: 针对现有研究使用Biasing Features指标基于思维链是否显式提及提示注入线索来评估其忠实性,本文指出该指标存在混淆不忠实性与不完整性的问题,旨在重新评估思维链解释的忠实性本质。
Result: 在多跳推理任务中,使用Llama-3和Gemma-3模型,超过50%被Biasing Features标记为不忠实的思维链被其他指标判定为忠实;通过新提出的faithful@k指标,发现增加推理token预算可使线索显式化比例提升至90%;因果中介分析进一步证实未显式化线索仍能因果中介预测变化。
Insight: 论文的创新点在于区分了思维链的不忠实性与不完整性,提出增加token预算可改善线索显式化,并利用因果中介分析揭示了未显式化线索的因果作用;客观来看,研究强调了评估思维链忠实性需超越基于线索的单一指标,结合因果中介与基于破坏的度量等更广泛的解释性工具包。
Abstract: Recent work, using the Biasing Features metric, labels a CoT as unfaithful if it omits a prompt-injected hint that affected the prediction. We argue this metric confuses unfaithfulness with incompleteness, the lossy compression needed to turn distributed transformer computation into a linear natural language narrative. On multi-hop reasoning tasks with Llama-3 and Gemma-3, many CoTs flagged as unfaithful by Biasing Features are judged faithful by other metrics, exceeding 50% in some models. With a new faithful@k metric, we show that larger inference-time token budgets greatly increase hint verbalization (up to 90% in some settings), suggesting much apparent unfaithfulness is due to tight token limits. Using Causal Mediation Analysis, we further show that even non-verbalized hints can causally mediate prediction changes through the CoT. We therefore caution against relying solely on hint-based evaluations and advocate a broader interpretability toolkit, including causal mediation and corruption-based metrics.
[18] Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process cs.CL | cs.AIPDF
Zhijun Chen, Zeyu Ji, Qianren Mao, Junhang Cheng, Bangjie Qin
TL;DR: 本文提出了LLM-PeerReview,一种无监督的大语言模型集成方法,通过一个受同行评审启发的框架,从多个LLM生成的候选回答中为每个查询选择最理想的响应。该方法包含评分、推理和选择三个阶段,利用LLM-as-a-Judge技术进行评分,并通过图模型或平均策略聚合分数,最终选择最高分响应作为集成输出。
Details
Motivation: 动机是解决如何有效利用多个具有不同优势的大语言模型的集体智慧,以无监督的方式从多个候选回答中选择最佳响应,从而提高回答质量和适应性。
Result: 在四个数据集上的实验表明,该方法取得了强劲的结果,其两个变体分别比近期先进模型Smoothie-Global高出6.9%和7.3个百分点。
Insight: 创新点在于提出了一个受同行评审启发的、可解释的无监督集成框架,通过LLM-as-a-Judge进行评分和基于图模型的推理聚合,实现了灵活且强大的模型选择,无需人工标注即可提升集成性能。
Abstract: We propose LLM-PeerReview, an unsupervised LLM Ensemble method that selects the most ideal response from multiple LLM-generated candidates for each query, harnessing the collective wisdom of multiple models with diverse strengths. LLM-PeerReview is built on a novel, peer-review-inspired framework that offers a clear and interpretable mechanism, while remaining fully unsupervised for flexible adaptability and generalization. Specifically, it operates in three stages: For scoring, we use the emerging LLM-as-a-Judge technique to evaluate each response by reusing multiple LLMs at hand; For reasoning, we can apply a principled graphical model-based truth inference algorithm or a straightforward averaging strategy to aggregate multiple scores to produce a final score for each response; Finally, the highest-scoring response is selected as the best ensemble output. LLM-PeerReview is conceptually simple and empirically powerful. The two variants of the proposed approach obtain strong results across four datasets, including outperforming the recent advanced model Smoothie-Global by 6.9% and 7.3% points, respectively.
[19] AI Meets Brain: Memory Systems from Cognitive Neuroscience to Autonomous Agents cs.CL | cs.AI | cs.CVPDF
Jiafeng Liang, Hao Li, Chang Li, Jiaqi Zhou, Shixin Jiang
TL;DR: 这篇论文系统性地整合了认知神经科学和LLM驱动智能体中的记忆系统知识,旨在弥合跨学科鸿沟。它阐述了记忆的定义与功能,比较了生物与人工记忆的分类、存储机制及管理生命周期,回顾了评估智能体记忆的主流基准,探讨了记忆安全,并展望了多模态记忆系统和技能获取等未来方向。
Details
Motivation: 现有自主智能体研究在借鉴认知神经科学设计高效记忆工作流时,受限于跨学科障碍,难以吸收人类记忆机制的精髓,本文旨在弥合这一差距。
Result: 论文未在摘要中提及具体的定量实验结果或基准测试性能,主要贡献在于提供了一个系统性的综述和比较分析框架。
Insight: 创新点在于首次系统性地将认知神经科学的记忆见解与LLM驱动的智能体连接起来,提供了一个从定义、分类、存储、管理到评估和安全的全方位跨学科分析框架,并前瞻性地指出了多模态记忆和技能学习等未来研究方向。
Abstract: Memory serves as the pivotal nexus bridging past and future, providing both humans and AI systems with invaluable concepts and experience to navigate complex tasks. Recent research on autonomous agents has increasingly focused on designing efficient memory workflows by drawing on cognitive neuroscience. However, constrained by interdisciplinary barriers, existing works struggle to assimilate the essence of human memory mechanisms. To bridge this gap, we systematically synthesizes interdisciplinary knowledge of memory, connecting insights from cognitive neuroscience with LLM-driven agents. Specifically, we first elucidate the definition and function of memory along a progressive trajectory from cognitive neuroscience through LLMs to agents. We then provide a comparative analysis of memory taxonomy, storage mechanisms, and the complete management lifecycle from both biological and artificial perspectives. Subsequently, we review the mainstream benchmarks for evaluating agent memory. Additionally, we explore memory security from dual perspectives of attack and defense. Finally, we envision future research directions, with a focus on multimodal memory systems and skill acquisition.
[20] A Stepwise-Enhanced Reasoning Framework for Large Language Models Based on External Subgraph Generation cs.CLPDF
Xin Zhang, Yang Cao, Baoxing Wu, Xinyi Chen, Kai Song
TL;DR: 本文提出了一种基于外部子图生成的逐步增强推理框架SGR,旨在提升大语言模型在复杂推理任务中的性能。该框架通过动态构建与查询相关的外部知识子图,并利用其语义结构指导模型进行多步推理,从而减少噪声信息干扰并提高推理准确性。
Details
Motivation: 大语言模型在处理需要深度推理和逻辑推断的任务时,仍面临挑战,特别是在生成过程中可能引入噪声或无关信息,导致预测错误或与事实知识不一致。
Result: 在多个基准数据集上的实验结果表明,SGR框架持续优于强基线模型,有效提升了大语言模型的推理能力。
Insight: 创新点在于将外部知识以结构化子图形式动态集成到推理过程中,通过逐步、基于子图的推理路径整合来增强模型的逻辑一致性和事实准确性。这种方法为知识增强的推理提供了可借鉴的结构化引导机制。
Abstract: Large Language Models (LLMs) have achieved strong performance across a wide range of natural language processing tasks in recent years, including machine translation, text generation, and question answering. As their applications extend to increasingly complex scenarios, however, LLMs continue to face challenges in tasks that require deep reasoning and logical inference. In particular, models trained on large scale textual corpora may incorporate noisy or irrelevant information during generation, which can lead to incorrect predictions or outputs that are inconsistent with factual knowledge. To address this limitation, we propose a stepwise reasoning enhancement framework for LLMs based on external subgraph generation, termed SGR. The proposed framework dynamically constructs query relevant subgraphs from external knowledge bases and leverages their semantic structure to guide the reasoning process. By performing reasoning in a step by step manner over structured subgraphs, SGR reduces the influence of noisy information and improves reasoning accuracy. Specifically, the framework first generates an external subgraph tailored to the input query, then guides the model to conduct multi step reasoning grounded in the subgraph, and finally integrates multiple reasoning paths to produce the final answer. Experimental results on multiple benchmark datasets demonstrate that SGR consistently outperforms strong baselines, indicating its effectiveness in enhancing the reasoning capabilities of LLMs.
[21] C2PO: Diagnosing and Disentangling Bias Shortcuts in LLMs cs.CLPDF
Xuan Feng, Bo An, Tianlong Gu, Liang Chang, Fengrui Hao
TL;DR: 本文提出C2PO(因果对比偏好优化)框架,旨在统一诊断和缓解大语言模型中的偏见捷径问题,包括刻板印象偏见和结构偏见,通过在优化过程中发现并抑制输入中的虚假特征关联,从而在减少偏见的同时保持模型的通用推理能力。
Details
Motivation: 现有方法通常孤立地处理大语言模型中的刻板印象偏见和结构偏见,往往缓解一种偏见时会加剧另一种;本文旨在通过识别并抑制导致错误推理捷径的潜在虚假特征关联,统一解决这两类偏见问题。
Result: 在涵盖刻板印象偏见(BBQ、Unqover)、结构偏见(MNLI、HANS、Chatbot、MT-Bench)、领域外公平性(StereoSet、WinoBias)和通用能力(MMLU、GSM8K)的多个基准测试中,C2PO有效缓解了偏见并保持了稳健的通用推理能力。
Insight: 创新点在于将因果反事实信号与公平敏感偏好更新机制结合,在优化过程中动态隔离偏见诱导特征并抑制捷径特征,实现了对多种偏见的统一处理,避免了传统方法中偏见缓解的权衡问题。
Abstract: Bias in Large Language Models (LLMs) poses significant risks to trustworthiness, manifesting primarily as stereotypical biases (e.g., gender or racial stereotypes) and structural biases (e.g., lexical overlap or position preferences). However, prior paradigms typically address these in isolation, often mitigating one at the expense of exacerbating the other. To address this, we conduct a systematic exploration of these reasoning failures and identify a primary inducement: the latent spurious feature correlations within the input that drive these erroneous reasoning shortcuts. Driven by these findings, we introduce Causal-Contrastive Preference Optimization (C2PO), a unified alignment framework designed to tackle these specific failures by simultaneously discovering and suppressing these correlations directly within the optimization process. Specifically, C2PO leverages causal counterfactual signals to isolate bias-inducing features from valid reasoning paths, and employs a fairness-sensitive preference update mechanism to dynamically evaluate logit-level contributions and suppress shortcut features. Extensive experiments across multiple benchmarks covering stereotypical bias (BBQ, Unqover), structural bias (MNLI, HANS, Chatbot, MT-Bench), out-of-domain fairness (StereoSet, WinoBias), and general utility (MMLU, GSM8K) demonstrate that C2PO effectively mitigates stereotypical and structural biases while preserving robust general reasoning capabilities.
[22] ClinDEF: A Dynamic Evaluation Framework for Large Language Models in Clinical Reasoning cs.CLPDF
Yuqi Tang, Jing Yu, Zichang Su, Kehua Feng, Zhihui Zhu
TL;DR: 本文提出了ClinDEF,一个用于评估大型语言模型在临床推理中表现的动态框架。该框架通过模拟诊断对话,基于疾病知识图谱动态生成患者病例,并支持基于LLM的医生与自动化患者代理之间的多轮交互。评估协议不仅关注诊断准确性,还包括细粒度的效率分析和基于量表的诊断质量评估。实验表明,ClinDEF能有效揭示当前最先进LLM在临床推理中的关键差距,提供了一个更细致且具有临床意义的评估范式。
Details
Motivation: 现有LLM基准测试主要关注静态问答,无法充分反映临床诊断中动态的、迭代的推理过程。尽管近期方法探索了交互式临床对话,但它们往往依赖有限且易受污染的数据集,并缺乏细粒度的多级评估。
Result: 实验表明,ClinDEF能有效暴露最先进LLM在临床推理中的关键差距,提供了一个更细致且具有临床意义的评估范式。
Insight: 创新点在于提出了一个基于疾病知识图谱的动态评估框架,通过模拟诊断对话进行多轮交互评估,并引入了超越诊断准确性的细粒度效率分析和基于量表的诊断质量评估,从而提供了更全面、更贴近真实临床场景的LLM评估方法。
Abstract: Clinical diagnosis begins with doctor-patient interaction, during which physicians iteratively gather information, determine examination and refine differential diagnosis through patients’ response. This dynamic clinical-reasoning process is poorly represented by existing LLM benchmarks that focus on static question-answering. To mitigate these gaps, recent methods explore dynamic medical frameworks involving interactive clinical dialogues. Although effective, they often rely on limited, contamination-prone datasets and lack granular, multi-level evaluation. In this work, we propose ClinDEF, a dynamic framework for assessing clinical reasoning in LLMs through simulated diagnostic dialogues. Grounded in a disease knowledge graph, our method dynamically generates patient cases and facilitates multi-turn interactions between an LLM-based doctor and an automated patient agent. Our evaluation protocol goes beyond diagnostic accuracy by incorporating fine-grained efficiency analysis and rubric-based assessment of diagnostic quality. Experiments show that ClinDEF effectively exposes critical clinical reasoning gaps in state-of-the-art LLMs, offering a more nuanced and clinically meaningful evaluation paradigm.
[23] UniHetero: Could Generation Enhance Understanding for Vision-Language-Model at Large Data Scale? cs.CL | cs.AIPDF
Fengjiao Chen, Minhao Jing, Weitao Lu, Yan Feng, Xiaoyu Li
TL;DR: 本文提出UniHetero模型,在大规模数据(>2亿样本)预训练下探索视觉生成任务是否能增强视觉理解能力。研究发现,生成语义而非像素能有效提升理解性能,且生成任务展现出更优的数据缩放趋势和更高的数据利用率,同时自回归输入嵌入能有效捕捉视觉细节。
Details
Motivation: 探索在大规模数据预训练下,视觉生成任务是否能够增强视觉语言大模型的理解能力,解决视觉理解与生成任务统一模型中生成对理解的影响尚不明确的问题。
Result: 在大规模预训练(>2亿样本)下,UniHetero模型通过生成语义(而非像素)在视觉理解任务上表现出性能提升,并显示出更优的数据缩放趋势和更高的数据利用率。
Insight: 核心创新在于揭示生成语义(如文本描述或高级特征)而非低级像素是增强视觉理解的关键,同时自回归输入嵌入方法能有效建模视觉细节,为统一模型设计提供了数据高效利用的新视角。
Abstract: Vision-language large models are moving toward the unification of visual understanding and visual generation tasks. However, whether generation can enhance understanding is still under-explored on large data scale. In this work, we analysis the unified model with a concise structure, UniHetero, under large-scale pretraining (>200M samples). Our key observations are: (1) Generation can improve understanding, but Only if you generate Semantics, Not Pixels. (2) Generation reveals a superior Data Scaling trend and higher Data Utilization. (3) Autoregression on Input Embedding is effective to capture visual details.
[24] Single LLM Debate, MoLaCE: Mixture of Latent Concept Experts Against Confirmation Bias cs.CLPDF
Hazel Kim, Philip Torr
TL;DR: 本文提出了一种名为MoLaCE(混合潜在概念专家)的轻量级推理时框架,旨在解决大型语言模型(LLM)中存在的输入确认偏差问题。该框架通过混合不同潜在概念激活强度的专家,使单个LLM能够内部模拟辩论的好处,从而减少偏差、提高鲁棒性,并可与多智能体辩论框架结合以增强视角多样性。
Details
Motivation: 大型语言模型(LLM)极易受到输入确认偏差的影响,即当提示隐含偏好答案时,模型会强化这种偏差而非探索替代方案。这种现象在基础模型中已造成危害,并在多智能体辩论中因回声室效应而风险加剧,因此需要一种高效的方法来缓解此问题。
Result: 实验表明,MoLaCE框架能持续减少确认偏差、提高鲁棒性,在计算成本仅为一小部分的情况下,其性能匹配或超越了多智能体辩论方法。
Insight: 核心创新在于利用语言组合性,通过不同措辞的提示以特定方式重新加权影响事实正确性的潜在概念,从而设计出无需单一固定干预的通用框架。这使单个LLM能内部模拟辩论,兼顾计算效率与可扩展性,并为多智能体辩论提供了减少相关错误的集成方案。
Abstract: Large language models (LLMs) are highly vulnerable to input confirmation bias. When a prompt implies a preferred answer, models often reinforce that bias rather than explore alternatives. This phenomenon remains underexplored, yet it is already harmful in base models and poses an even greater risk in multi-agent debate, where echo chambers reinforce bias instead of correction. We introduce Mixture of Latent Concept Experts (MoLaCE), a lightweight inference-time framework that addresses confirmation bias by mixing experts instantiated as different activation strengths over latent concepts that shape model responses. Our key insight is that, due to the compositional nature of language, differently phrased prompts reweight latent concepts in prompt-specific ways that affect factual correctness, so no single fixed intervention can be applied universally across inputs. This design enables a single LLM to emulate the benefits of debate internally while remaining computationally efficient and scalable. It can also be integrated into multi-agent debate frameworks to diversify perspectives and reduce correlated errors. We empirically show that it consistently reduces confirmation bias, improves robustness, and matches or surpasses multi-agent debate while requiring only a fraction of the computation.
[25] Instruction-Following Evaluation of Large Vision-Language Models cs.CL | cs.CVPDF
Daiki Shiono, Shumpei Miyawaki, Ryota Tanaka, Jun Suzuki
TL;DR: 本文研究了大型视觉语言模型在视觉指令微调后指令跟随能力下降的问题,通过构建强调输出格式的新训练数据集,定量评估并证实了该现象,发现明确指示输出格式有助于提升模型遵循指令的准确性。
Details
Motivation: 解决大型视觉语言模型在整合视觉能力并进行视觉指令微调后,其指令跟随能力相比原始大型语言模型出现下降的问题。
Result: 定量评估证实了指令跟随能力在微调后下降;使用包含输出格式指示的数据集训练的模型,在指令跟随准确性上优于未使用此类数据的模型。
Insight: 在视觉指令微调数据集中显式包含关于输出格式的指令样本,是缓解指令跟随能力下降的有效策略,这为提升多模态模型对齐性能提供了数据层面的创新思路。
Abstract: Following the initial flourishing of large language models (LLMs), there has been a surge in proposed large vision-language models (LVLMs) that integrate LLMs with vision capabilities. However, it has been observed that LVLMs, after tuning to visual instruction using commonly used training datasets, often fail to exhibit the instruction-following ability that was present in the LLM before integration, leading to results in which they do not follow task instructions as expected. This study quantitatively demonstrates that LVLMs’ instruction-following ability declines after fine-tuning and analyzes its underlying causes. In particular, we constructed new training datasets highlighting whether the output format is specified. Then, we investigated how explicitly indicating the output format during fine-tuning affects LVLMs’ instruction-following ability. Our quantitative evaluation confirmed that LVLMs’ instruction-following ability declines after fine-tuning with commonly used datasets. Furthermore, we found that LVLMs trained with datasets, including instructions on output format, tend to follow instructions more accurately than models that do not. These findings suggest that including samples with instructions on output format during (visual) instruction tuning may help mitigate the decline in instruction-following abilities.
[26] Nested Browser-Use Learning for Agentic Information Seeking cs.CL | cs.AI | cs.IR | cs.MAPDF
Baixuan Li, Jialong Wu, Wenbiao Yin, Kuan Li, Zhongwang Zhang
TL;DR: 本文提出NestBrowse方法,通过嵌套浏览器操作框架将交互控制与页面探索解耦,以解决现有信息搜索代理在真实浏览中的局限性,提升深度网络信息获取能力。
Details
Motivation: 现有信息搜索代理主要依赖API级片段检索和基于URL的页面获取,限制了访问真实浏览中更丰富信息的能力,而完整浏览器交互的细粒度控制和冗长页面内容增加了ReAct式函数调用代理的复杂性。
Result: 在具有挑战性的深度信息搜索基准测试中,NestBrowse表现出明显的实践优势,并通过深入分析验证了其效率和灵活性。
Insight: 创新点在于引入最小且完整的浏览器操作框架,通过嵌套结构实现交互控制与页面探索的解耦,简化了代理推理过程,同时支持有效的深度网络信息获取。
Abstract: Information-seeking (IS) agents have achieved strong performance across a range of wide and deep search tasks, yet their tool use remains largely restricted to API-level snippet retrieval and URL-based page fetching, limiting access to the richer information available through real browsing. While full browser interaction could unlock deeper capabilities, its fine-grained control and verbose page content returns introduce substantial complexity for ReAct-style function-calling agents. To bridge this gap, we propose Nested Browser-Use Learning (NestBrowse), which introduces a minimal and complete browser-action framework that decouples interaction control from page exploration through a nested structure. This design simplifies agentic reasoning while enabling effective deep-web information acquisition. Empirical results on challenging deep IS benchmarks demonstrate that NestBrowse offers clear benefits in practice. Further in-depth analyses underscore its efficiency and flexibility.
[27] Fine-Tuning LLMs with Fine-Grained Human Feedback on Text Spans cs.CLPDF
Sky CH-Wang, Justin Svegliato, Helen Appel, Jason Eisner
TL;DR: 本文提出了一种基于细粒度人工反馈的文本跨度偏好监督方法,通过构建反馈驱动的改进链来微调语言模型。该方法要求标注者对模型生成的响应中“喜欢”和“不喜欢”的文本跨度进行标记并说明原因,然后模型从左到右依次重写不喜欢的跨度,形成一系列增量改进。通过从改进链的相邻步骤中构建偏好对进行直接对齐,使模型能够从局部、有针对性的编辑中学习。
Details
Motivation: 解决现有直接对齐方法(如基于标准A/B偏好排序或完整对比重写)在偏好监督中效率较低、效果有限的问题,旨在通过结构化、基于修订的监督实现更高效和有效的偏好微调。
Result: 该方法在偏好微调任务上优于基于标准A/B偏好排序或完整对比重写的直接对齐方法,表明结构化、基于修订的监督能带来更高效和有效的性能提升。
Insight: 创新点在于引入细粒度文本跨度反馈和改进链机制,将全局偏好分解为局部编辑序列,使模型能从具体的、逐步的修订中学习,这为语言模型对齐提供了更精细的监督信号和可解释的训练过程。
Abstract: We present a method and dataset for fine-tuning language models with preference supervision using feedback-driven improvement chains. Given a model response, an annotator provides fine-grained feedback by marking liked'' and disliked’’ spans and specifying what they liked or disliked about them. The base model then rewrites the disliked spans accordingly, proceeding from left to right, forming a sequence of incremental improvements. We construct preference pairs for direct alignment from each adjacent step in the chain, enabling the model to learn from localized, targeted edits. We find that our approach outperforms direct alignment methods based on standard A/B preference ranking or full contrastive rewrites, demonstrating that structured, revision-based supervision leads to more efficient and effective preference tuning.
cs.CV [Back]
[28] Characterizing Motion Encoding in Video Diffusion Timesteps cs.CV | cs.AI | eess.IVPDF
Vatsal Baherwani, Yixuan Ren, Abhinav Shrivastava
TL;DR: 本文系统研究了文本到视频扩散模型中运动编码在时间步中的分布规律,通过大规模定量分析揭示了早期时间步主导运动与布局、后期时间步主导外观的规律,并基于此简化了运动定制方法。
Details
Motivation: 解决视频扩散模型中运动编码机制不明确的问题,旨在将实践中常用的经验性启发式方法(即早期时间步控制运动、后期时间步优化外观)转化为可量化的时空解耦原则。
Result: 通过在不同架构的模型上进行定量实验,一致识别出早期运动主导阶段和后期外观主导阶段,并基于此提出的时间步约束训练方法在运动迁移任务中实现了强效的运动传递,无需额外的去偏模块或专用目标函数。
Insight: 创新点在于将运动编码代理为外观编辑与运动保持之间的权衡,通过时间步空间的定量映射实现了运动与外观的解耦分析;提出的时间步约束范式可直接集成到现有运动迁移与编辑方法中,简化了定制流程。
Abstract: Text-to-video diffusion models synthesize temporal motion and spatial appearance through iterative denoising, yet how motion is encoded across timesteps remains poorly understood. Practitioners often exploit the empirical heuristic that early timesteps mainly shape motion and layout while later ones refine appearance, but this behavior has not been systematically characterized. In this work, we proxy motion encoding in video diffusion timesteps by the trade-off between appearance editing and motion preservation induced when injecting new conditions over specified timestep ranges, and characterize this proxy through a large-scale quantitative study. This protocol allows us to factor motion from appearance by quantitatively mapping how they compete along the denoising trajectory. Across diverse architectures, we consistently identify an early, motion-dominant regime and a later, appearance-dominant regime, yielding an operational motion-appearance boundary in timestep space. Building on this characterization, we simplify current one-shot motion customization paradigm by restricting training and inference to the motion-dominant regime, achieving strong motion transfer without auxiliary debiasing modules or specialized objectives. Our analysis turns a widely used heuristic into a spatiotemporal disentanglement principle, and our timestep-constrained recipe can serve as ready integration into existing motion transfer and editing methods.
[29] Real-Time American Sign Language Recognition Using 3D Convolutional Neural Networks and LSTM: Architecture, Training, and Deployment cs.CVPDF
Dawnena Key
TL;DR: 本文提出了一种实时美国手语识别系统,该系统采用3D卷积神经网络与长短期记忆网络相结合的混合深度学习架构,通过处理网络摄像头视频流来识别单词级美国手语手势,旨在解决全球超过7000万聋哑及听力障碍人士的沟通障碍。
Details
Motivation: 解决聋哑及听力障碍人士的实时沟通问题,通过深度学习技术实现高效、准确的手语识别,以促进无障碍交流。
Result: 在WLASL数据集(2000个常见单词)、ASL-LEX词汇数据库(约2700个手势)和100个专家标注的美国手语手势数据集上训练,系统在不同手势类别上的F1分数范围为0.71至0.99,并部署于AWS基础设施和OAK-D摄像头边缘设备,支持实时推理。
Insight: 创新点在于结合3D CNN提取时空特征和LSTM建模序列依赖性的混合架构,有效捕捉手语手势的动态特性;从客观角度看,该系统在实时部署和跨数据集验证方面具有实用价值,为手语识别应用提供了可扩展的解决方案。
Abstract: This paper presents a real-time American Sign Language (ASL) recognition system utilizing a hybrid deep learning architecture combining 3D Convolutional Neural Networks (3D CNN) with Long Short-Term Memory (LSTM) networks. The system processes webcam video streams to recognize word-level ASL signs, addressing communication barriers for over 70 million deaf and hard-of-hearing individuals worldwide. Our architecture leverages 3D convolutions to capture spatial-temporal features from video frames, followed by LSTM layers that model sequential dependencies inherent in sign language gestures. Trained on the WLASL dataset (2,000 common words), ASL-LEX lexical database (~2,700 signs), and a curated set of 100 expert-annotated ASL signs, the system achieves F1-scores ranging from 0.71 to 0.99 across sign classes. The model is deployed on AWS infrastructure with edge deployment capability on OAK-D cameras for real-time inference. We discuss the architecture design, training methodology, evaluation metrics, and deployment considerations for practical accessibility applications.
[30] Unbiased Visual Reasoning with Controlled Visual Inputs cs.CV | cs.AI | cs.CLPDF
Zhaonan Li, Shijie Lu, Fei Wang, Jacob Dineen, Xiao Ye
TL;DR: 本文提出了VISTA框架,通过显式的信息瓶颈将视觉感知与推理解耦,以解决端到端视觉语言模型(VLMs)在回答视觉问题时依赖虚假相关性的问题。该框架使用冻结的VLM传感器进行简短、客观的感知查询,并由纯文本LLM推理器分解问题、规划查询并以自然语言聚合视觉事实,从而在强化学习环境中训练无偏的视觉推理。
Details
Motivation: 端到端视觉语言模型(VLMs)在回答视觉问题时常常利用虚假相关性而非因果视觉证据,尤其是在微调后更容易走捷径,因此需要一种方法来提升视觉推理的鲁棒性和无偏性。
Result: 在SpuriVerse基准测试中,VISTA显著提升了鲁棒性(使用Qwen-2.5-VL-7B提升+16.29%,使用Llama-3.2-Vision-11B提升+6.77%),同时在MMVP和SeedBench子集上保持竞争力,并能跨未见过的VLM传感器鲁棒地迁移,且能识别和恢复VLM感知失败。
Insight: 创新点在于通过模块化设计(感知与推理解耦)和受控的视觉输入接口,结合强化学习(GRPO)进行训练,从而减少对虚假属性的依赖,使推理过程更中立、更明确地基于视觉证据,这为构建更可靠、可解释的视觉推理系统提供了新思路。
Abstract: End-to-end Vision-language Models (VLMs) often answer visual questions by exploiting spurious correlations instead of causal visual evidence, and can become more shortcut-prone when fine-tuned. We introduce VISTA (Visual-Information Separation for Text-based Analysis), a modular framework that decouples perception from reasoning via an explicit information bottleneck. A frozen VLM sensor is restricted to short, objective perception queries, while a text-only LLM reasoner decomposes each question, plans queries, and aggregates visual facts in natural language. This controlled interface defines a reward-aligned environment for training unbiased visual reasoning with reinforcement learning. Instantiated with Qwen2.5-VL and Llama3.2-Vision sensors, and trained with GRPO from only 641 curated multi-step questions, VISTA significantly improves robustness to real-world spurious correlations on SpuriVerse (+16.29% with Qwen-2.5-VL-7B and +6.77% with Llama-3.2-Vision-11B), while remaining competitive on MMVP and a balanced SeedBench subset. VISTA transfers robustly across unseen VLM sensors and is able to recognize and recover from VLM perception failures. Human analysis further shows that VISTA’s reasoning traces are more neutral, less reliant on spurious attributes, and more explicitly grounded in visual evidence than end-to-end VLM baselines.
[31] SAMM2D: Scale-Aware Multi-Modal 2D Dual-Encoder for High-Sensitivity Intracrania Aneurysm Screening cs.CVPDF
Antara Titikhsha, Divyanshu Tak
TL;DR: 本文提出了SAMM2D,一种用于颅内动脉瘤筛查的尺度感知多模态双编码器框架。该模型在RSNA数据集上取得了0.686的AUC,比临床基线提升了32%,并通过校准决策阈值实现了95%的敏感度,超越了放射科医生的平均表现。一个关键发现是,在强大的预训练骨干网络下,任何数据增强都会损害模型性能,颠覆了在低数据医疗场景中‘增强越多越好’的假设。
Details
Motivation: 解决颅内动脉瘤检测中因动脉瘤形态细微、类别极度不平衡以及标注数据稀缺而带来的挑战。
Result: 在RSNA颅内动脉瘤数据集上,AUC达到0.686,比临床基线提升32%。校准阈值后,敏感度达到95%,超越放射科医生平均水平,并预计每筛查1000名患者可节省1390万美元。Grad-CAM可视化显示85%的真阳性关注于相关血管区域(与专家标注的IoU为62%)。
Insight: 核心创新在于提出了一个结合多模态(如不同成像序列)和尺度感知的双编码器框架。更重要的洞察是,在拥有强大ImageNet预训练骨干的情况下,额外的数据增强可能变得冗余甚至有害,这挑战了医学影像分析领域的传统做法,强调了高质量预训练特征的重要性,而非复杂的增强流程。
Abstract: Effective aneurysm detection is essential to avert life-threatening hemorrhages, but it remains challenging due to the subtle morphology of the aneurysm, pronounced class imbalance, and the scarcity of annotated data. We introduce SAMM2D, a dual-encoder framework that achieves an AUC of 0.686 on the RSNA intracranial aneurysm dataset; an improvement of 32% over the clinical baseline. In a comprehensive ablation across six augmentation regimes, we made a striking discovery: any form of data augmentation degraded performance when coupled with a strong pretrained backbone. Our unaugmented baseline model outperformed all augmented variants by 1.75–2.23 percentage points (p < 0.01), overturning the assumption that “more augmentation is always better” in low-data medical settings. We hypothesize that ImageNet-pretrained features already capture robust invariances, rendering additional augmentations both redundant and disruptive to the learned feature manifold. By calibrating the decision threshold, SAMM2D reaches 95% sensitivity, surpassing average radiologist performance, and translates to a projected $13.9M in savings per 1,000 patients in screening applications. Grad-CAM visualizations confirm that 85% of true positives attend to relevant vascular regions (62% IoU with expert annotations), demonstrating the model’s clinically meaningful focus. Our results suggest that future medical imaging workflows could benefit more from strong pretraining than from increasingly complex augmentation pipelines.
[32] HookMIL: Revisiting Context Modeling in Multiple Instance Learning for Computational Pathology cs.CV | cs.AIPDF
Xitong Ling, Minxi Ouyang, Xiaoxiao Li, Jiawen Li, Ying Chen
TL;DR: 本文提出HookMIL,一种用于计算病理学中全切片图像(WSI)弱监督分析的多实例学习(MIL)框架。它通过引入可学习的钩子(Hook)令牌进行结构化上下文聚合,解决了传统MIL方法丢失上下文信息以及基于Transformer的变体计算复杂度高的问题。该方法支持从视觉特征、文本嵌入和空间转录组-视觉模型等多模态初始化钩子令牌,并在线性复杂度下通过双向注意力进行交互,同时引入钩子多样性损失和钩子间通信机制以提升专业化和减少冗余。
Details
Motivation: 传统MIL方法在计算病理学中常丢失关键上下文信息,而基于Transformer的变体虽然表达能力更强,但存在二次方复杂度和冗余计算的问题。本文旨在设计一个既能有效建模上下文、又计算高效的MIL框架。
Result: 在四个公开病理学数据集上的大量实验表明,HookMIL实现了最先进的(SOTA)性能,同时提升了计算效率和可解释性。
Insight: 创新点在于提出使用可学习的钩子令牌进行结构化上下文聚合,支持多模态初始化以融入丰富的文本和空间先验知识;设计了线性复杂度的双向注意力交互机制、钩子多样性损失和钩子间通信,在保持高效计算的同时增强了模型的表达能力和专业化程度。
Abstract: Multiple Instance Learning (MIL) has enabled weakly supervised analysis of whole-slide images (WSIs) in computational pathology. However, traditional MIL approaches often lose crucial contextual information, while transformer-based variants, though more expressive, suffer from quadratic complexity and redundant computations. To address these limitations, we propose HookMIL, a context-aware and computationally efficient MIL framework that leverages compact, learnable hook tokens for structured contextual aggregation. These tokens can be initialized from (i) key-patch visual features, (ii) text embeddings from vision-language pathology models, and (iii) spatially grounded features from spatial transcriptomics-vision models. This multimodal initialization enables Hook Tokens to incorporate rich textual and spatial priors, accelerating convergence and enhancing representation quality. During training, Hook tokens interact with instances through bidirectional attention with linear complexity. To further promote specialization, we introduce a Hook Diversity Loss that encourages each token to focus on distinct histopathological patterns. Additionally, a hook-to-hook communication mechanism refines contextual interactions while minimizing redundancy. Extensive experiments on four public pathology datasets demonstrate that HookMIL achieves state-of-the-art performance, with improved computational efficiency and interpretability. Codes are available at https://github.com/lingxitong/HookMIL.
[33] Quadrant Segmentation VLM with Few-Shot Adaptation and OCT Learning-based Explainability Methods for Diabetic Retinopathy cs.CVPDF
Shivum Telang
TL;DR: 本文提出了一种用于糖尿病视网膜病变(DR)诊断的新型多模态可解释性模型,该模型结合了视觉语言模型(VLM)和少样本学习,通过分析眼底图像中视网膜象限内的病变分布来模拟眼科医生的推理过程。模型生成配对的Grad-CAM热力图,在OCT和眼底图像上可视化显示影响DR严重程度分类的关键区域,旨在提供更全面的解释而不仅仅是病变定位。
Details
Motivation: 解决当前DR诊断AI模型可解释性不足的问题,现有模型通常仅依赖单一成像模态(如病变分割)进行解释,且手动标注病变对临床医生不切实际。医生需要一个能解释分类推理过程(而不仅仅是高亮病变位置)的模型,以支持筛查、治疗和研究等多种应用场景。
Result: 在包含3,000张眼底图像和1,000张OCT图像的数据集上验证了模型的有效性,模型通过生成跨模态的Grad-CAM热力图,视觉化地突出了对DR严重程度分类有贡献的区域,但摘要中未提及具体的定量指标(如准确率)或与现有SOTA模型的直接比较结果。
Insight: 创新点在于将VLM与少样本学习结合,并引入基于视网膜象限分割的分析框架来模拟临床推理;同时,通过生成配对的OCT和眼底图像Grad-CAM热力图,实现了多模态可解释性,这超越了当前依赖单一模态解释的局限性,为DR诊断提供了更实用和全面的工具。
Abstract: Diabetic Retinopathy (DR) is a leading cause of vision loss worldwide, requiring early detection to preserve sight. Limited access to physicians often leaves DR undiagnosed. To address this, AI models utilize lesion segmentation for interpretability; however, manually annotating lesions is impractical for clinicians. Physicians require a model that explains the reasoning for classifications rather than just highlighting lesion locations. Furthermore, current models are one-dimensional, relying on a single imaging modality for explainability and achieving limited effectiveness. In contrast, a quantitative-detection system that identifies individual DR lesions in natural language would overcome these limitations, enabling diverse applications in screening, treatment, and research settings. To address this issue, this paper presents a novel multimodal explainability model utilizing a VLM with few-shot learning, which mimics an ophthalmologist’s reasoning by analyzing lesion distributions within retinal quadrants for fundus images. The model generates paired Grad-CAM heatmaps, showcasing individual neuron weights across both OCT and fundus images, which visually highlight the regions contributing to DR severity classification. Using a dataset of 3,000 fundus images and 1,000 OCT images, this innovative methodology addresses key limitations in current DR diagnostics, offering a practical and comprehensive tool for improving patient outcomes.
[34] TCFormer: A 5M-Parameter Transformer with Density-Guided Aggregation for Weakly-Supervised Crowd Counting cs.CV | cs.AIPDF
Qiang Guo, Rubo Zhang, Bingbing Zhang, Junjie Liu, Jianqing Liu
TL;DR: 本文提出TCFormer,一种仅含500万参数的轻量级弱监督人群计数Transformer框架,通过可学习的密度加权平均模块和密度级别分类损失,在仅使用图像级全局计数标签的情况下实现了与全监督方法竞争的性能。
Details
Motivation: 解决人群计数任务对密集点标注的依赖和计算密集型骨干网络的需求,以提升其在资源受限环境中的可扩展性和部署能力。
Result: 在ShanghaiTech A/B、UCF-QNRF和NWPU四个基准数据集上的实验表明,该方法在参数效率和计数精度之间取得了优越的权衡,适用于边缘设备。
Insight: 创新点包括:采用轻量级ViT提取全局上下文特征;设计可学习的密度加权平均模块,根据预测密度动态重加权局部token以增强空间感知;引入密度级别分类损失对密度进行离散化分级,提升模型对不同密度级别的区分能力。这些设计使模型在弱监督下仍能实现高精度计数。
Abstract: Crowd counting typically relies on labor-intensive point-level annotations and computationally intensive backbones, restricting its scalability and deployment in resource-constrained environments. To address these challenges, this paper proposes the TCFormer, a tiny, ultra-lightweight, weakly-supervised transformer-based crowd counting framework with only 5 million parameters that achieves competitive performance. Firstly, a powerful yet efficient vision transformer is adopted as the feature extractor, the global context-aware capabilities of which provides semantic meaningful crowd features with a minimal memory footprint. Secondly, to compensate for the lack of spatial supervision, we design a feature aggregation mechanism termed the Learnable Density-Weighted Averaging module. This module dynamically re-weights local tokens according to predicted density scores, enabling the network to adaptively modulate regional features based on their specific density characteristics without the need for additional annotations. Furthermore, this paper introduces a density-level classification loss, which discretizes crowd density into distinct grades, thereby regularizing the training process and enhancing the model’s classification power across varying levels of crowd density. Therefore, although TCformer is trained under a weakly-supervised paradigm utilizing only image-level global counts, the joint optimization of count and density-level losses enables the framework to achieve high estimation accuracy. Extensive experiments on four benchmarks including ShanghaiTech A/B, UCF-QNRF, and NWPU datasets demonstrate that our approach strikes a superior trade-off between parameter efficiency and counting accuracy and can be a good solution for crowd counting tasks in edge devices.
[35] VLM-PAR: A Vision Language Model for Pedestrian Attribute Recognition cs.CV | cs.AIPDF
Abdellah Zakaria Sellam, Salah Eddine Bekhouche, Fadi Dornaika, Cosimo Distante, Abdenour Hadid
TL;DR: 本文提出VLM-PAR,一种基于冻结SigLIP 2多语言编码器的模块化视觉语言框架,用于行人属性识别。该方法通过紧凑的交叉注意力融合来精炼视觉特征,从而对齐图像和提示嵌入,有效解决了类别严重不平衡、属性间复杂依赖和域偏移等挑战。
Details
Motivation: 解决行人属性识别中因类别严重不平衡、属性间复杂依赖和域偏移导致的性能瓶颈问题。
Result: 在高度不平衡的PA100K基准测试中取得了显著的准确率提升,并达到了新的SOTA性能;同时在PETA和Market-1501基准测试的平均准确率上也获得了显著增益。
Insight: 创新点在于将大规模视觉语言预训练与针对性的跨模态精炼相结合,以克服PAR中的不平衡和泛化挑战;客观来看,其模块化设计和对冻结编码器的有效利用是值得借鉴的轻量级高效策略。
Abstract: Pedestrian Attribute Recognition (PAR) involves predicting fine-grained attributes such as clothing color, gender, and accessories from pedestrian imagery, yet is hindered by severe class imbalance, intricate attribute co-dependencies, and domain shifts. We introduce VLM-PAR, a modular vision-language framework built on frozen SigLIP 2 multilingual encoders. By first aligning image and prompt embeddings via refining visual features through a compact cross-attention fusion, VLM-PAR achieves significant accuracy improvement on the highly imbalanced PA100K benchmark, setting a new state-of-the-art performance, while also delivering significant gains in mean accuracy across PETA and Market-1501 benchmarks. These results underscore the efficacy of integrating large-scale vision-language pretraining with targeted cross-modal refinement to overcome imbalance and generalization challenges in PAR.
[36] Towards Signboard-Oriented Visual Question Answering: ViSignVQA Dataset, Method and Benchmark cs.CV | cs.MMPDF
Hieu Minh Nguyen, Tam Le-Thanh Dang, Kiet Van Nguyen
TL;DR: 本文提出了ViSignVQA,首个面向越南语招牌的大规模视觉问答数据集,包含10,762张图像和25,573个问答对。研究通过集成越南语OCR和预训练语言模型来适配SOTA VQA模型,并提出了一个结合感知与推理智能体的多智能体VQA框架,在GPT-4的帮助下取得了75.98%的准确率。
Details
Motivation: 解决自然场景中招牌文本理解的问题,特别是在低资源语言(如越南语)中,该领域尚未得到充分探索。
Result: 实验表明,将OCR文本附加到问题中,F1分数最高提升了209%;提出的多智能体框架通过多数投票达到了75.98%的准确率。
Insight: 创新点在于构建了首个针对越南语招牌的大规模多模态VQA数据集,并强调了领域特定资源(如集成OCR)对于提升低资源语言文本VQA性能的重要性,同时提出了一个结合感知与推理的多智能体框架作为基准方法。
Abstract: Understanding signboard text in natural scenes is essential for real-world applications of Visual Question Answering (VQA), yet remains underexplored, particularly in low-resource languages. We introduce ViSignVQA, the first large-scale Vietnamese dataset designed for signboard-oriented VQA, which comprises 10,762 images and 25,573 question-answer pairs. The dataset captures the diverse linguistic, cultural, and visual characteristics of Vietnamese signboards, including bilingual text, informal phrasing, and visual elements such as color and layout. To benchmark this task, we adapted state-of-the-art VQA models (e.g., BLIP-2, LaTr, PreSTU, and SaL) by integrating a Vietnamese OCR model (SwinTextSpotter) and a Vietnamese pretrained language model (ViT5). The experimental results highlight the significant role of the OCR-enhanced context, with F1-score improvements of up to 209% when the OCR text is appended to questions. Additionally, we propose a multi-agent VQA framework combining perception and reasoning agents with GPT-4, achieving 75.98% accuracy via majority voting. Our study presents the first large-scale multimodal dataset for Vietnamese signboard understanding. This underscores the importance of domain-specific resources in enhancing text-based VQA for low-resource languages. ViSignVQA serves as a benchmark capturing real-world scene text characteristics and supporting the development and evaluation of OCR-integrated VQA models in Vietnamese.
[37] On Extending Semantic Abstraction for Efficient Search of Hidden Objects cs.CV | cs.AI | cs.ROPDF
Tasha Pais, Nikhilesh Belulkar
TL;DR: 本文提出了一种基于语义抽象框架的方法,用于高效定位和补全隐藏物体(即被部分遮挡、无法被视觉语言模型直接识别的物体)的3D位置。该方法利用历史放置数据来优化非结构化搜索过程,使家庭机器人能够更快地找到丢失物体。
Details
Motivation: 解决视觉语言模型(VLM)难以直接识别被遮挡物体的问题,通过扩展语义抽象框架,使机器人能够更高效地定位隐藏物体,减少搜索时间和精力。
Result: 模型能够在首次尝试中准确识别隐藏物体的完整3D位置,相比随机搜索显著更快,但未提及具体基准测试或SOTA比较。
Insight: 创新点在于将语义抽象中的相关性图视为’抽象物体’表示,并利用历史放置数据优化搜索过程,为机器人视觉任务提供了数据驱动的高效定位方法。
Abstract: Semantic Abstraction’s key observation is that 2D VLMs’ relevancy activations roughly correspond to their confidence of whether and where an object is in the scene. Thus, relevancy maps are treated as “abstract object” representations. We use this framework for learning 3D localization and completion for the exclusive domain of hidden objects, defined as objects that cannot be directly identified by a VLM because they are at least partially occluded. This process of localizing hidden objects is a form of unstructured search that can be performed more efficiently using historical data of where an object is frequently placed. Our model can accurately identify the complete 3D location of a hidden object on the first try significantly faster than a naive random search. These extensions to semantic abstraction hope to provide household robots with the skills necessary to save time and effort when looking for lost objects.
[38] VideoScaffold: Elastic-Scale Visual Hierarchies for Streaming Video Understanding in MLLMs cs.CV | cs.AI | cs.CL | cs.LGPDF
Naishan Zheng, Jie Huang, Qingpei Guo, Feng Zhao
TL;DR: 本文提出VideoScaffold,一个用于流式视频理解的动态表示框架。它通过弹性尺度事件分割(EES)和分层事件整合(HEC)两个核心组件,自适应地调整事件粒度并保留细粒度视觉语义,从而将基于图像的MLLM无缝扩展到连续视频理解。
Details
Motivation: 现有基于多模态大语言模型(MLLM)的长视频理解方法面临帧间冗余高和需要时序连贯表示的挑战,且静态策略(如稀疏采样、帧压缩)在流式视频场景下会产生碎片化或过度压缩的输出。
Result: 在离线和流式视频理解基准测试上的大量实验表明,VideoScaffold达到了最先进的(SOTA)性能。
Insight: 主要创新点在于提出了一个动态、模块化即插即用的框架,通过预测引导的分割和渐进式语义聚合,实现了从细粒度帧理解到抽象事件推理的平滑过渡,解决了流式视频自适应表示的难题。
Abstract: Understanding long videos with multimodal large language models (MLLMs) remains challenging due to the heavy redundancy across frames and the need for temporally coherent representations. Existing static strategies, such as sparse sampling, frame compression, and clustering, are optimized for offline settings and often produce fragmented or over-compressed outputs when applied to continuous video streams. We present VideoScaffold, a dynamic representation framework designed for streaming video understanding. It adaptively adjusts event granularity according to video duration while preserving fine-grained visual semantics. VideoScaffold introduces two key components: Elastic-Scale Event Segmentation (EES), which performs prediction-guided segmentation to dynamically refine event boundaries, and Hierarchical Event Consolidation (HEC), which progressively aggregates semantically related segments into multi-level abstractions. Working in concert, EES and HEC enable VideoScaffold to transition smoothly from fine-grained frame understanding to abstract event reasoning as the video stream unfolds. Extensive experiments across both offline and streaming video understanding benchmarks demonstrate that VideoScaffold achieves state-of-the-art performance. The framework is modular and plug-and-play, seamlessly extending existing image-based MLLMs to continuous video comprehension. The code is available at https://github.com/zheng980629/VideoScaffold.
[39] KAN-FPN-Stem:A KAN-Enhanced Feature Pyramid Stem for Boosting ViT-based Pose Estimation cs.CVPDF
HaoNan Tang
TL;DR: 本文提出了一种名为KAN-FPN-Stem的新型架构,旨在提升基于视觉Transformer(ViT)的姿态估计模型性能。该方法的核心创新在于,在特征金字塔网络(FPN)的融合流末端,用基于Kolmogorov-Arnold网络(KAN)的卷积层替换了标准的线性3x3平滑卷积,以自适应地学习和校正多尺度融合过程中产生的伪影,从而解决了ViT前端设计中因简单分块机制导致的多尺度变化处理不佳和信息丢失问题。
Details
Motivation: 现有基于ViT的密集预测模型(如ViTPose)的前端设计过于简单,其朴素的分块机制难以有效处理多尺度变化,并在初始特征提取阶段造成不可逆的信息损失,这限制了模型性能的进一步提升。
Result: 在COCO数据集上的大量实验表明,所提出的KAN-FPN-Stem模块在轻量级基线模型ViTPose-S上实现了高达+2.0 AP的显著性能提升。
Insight: 论文宣称的创新点在于揭示了ViT前端的性能瓶颈通常不在于‘特征细化’(注意力机制),而在于‘特征融合’的质量,并提供了通过引入KAN算子来解决此瓶颈的有效路径。从客观角度看,其核心创新在于将KAN的强大非线性建模能力巧妙地集成到经典的FPN融合流程中,替代了传统的线性卷积,从而自适应地优化融合特征,这是一个具体且可借鉴的架构设计思路。
Abstract: Vision Transformers (ViT) have demonstrated significant promise in dense prediction tasks such as pose estimation. However, their performance is frequently constrained by the overly simplistic front-end designs employed in models like ViTPose. This naive patchification mechanism struggles to effectively handle multi-scale variations and results in irreversible information loss during the initial feature extraction phase. To overcome this limitation, we introduce a novel KAN-enhanced FPN-Stem architecture. Through rigorous ablation studies, we first identified that the true bottleneck for performance improvement lies not in plug-and-play attention modules (e.g., CBAM), but in the post-fusion non-linear smoothing step within the FPN. Guided by this insight, our core innovation is to retain the classic “upsample-and-add” fusion stream of the FPN, but replace its terminal, standard linear 3x3 smoothing convolution with a powerful KAN-based convolutional layer. Leveraging its superior non-linear modeling capabilities, this KAN-based layer adaptively learns and rectifies the “artifacts” generated during the multi-scale fusion process. Extensive experiments on the COCO dataset demonstrate that our KAN-FPN-Stem achieves a significant performance boost of up to +2.0 AP over the lightweight ViTPose-S baseline. This work not only delivers a plug-and-play, high-performance module but, more importantly, reveals that: the performance bottleneck in ViT front-end often lies not in ‘feature refinement’ (Attention), but in the quality of ‘feature fusion’ (Fusion). Furthermore, it provides an effective path to address this bottleneck through the introduction of the KAN operator.
[40] Multi-objective hybrid knowledge distillation for efficient deep learning in smart agriculture cs.CV | cs.AIPDF
Phi-Hung Hoang, Nam-Thuan Trinh, Van-Manh Tran, Thi-Thu-Hong Phan
TL;DR: 本研究提出了一种用于智慧农业的多目标混合知识蒸馏框架,旨在开发轻量且高性能的卷积神经网络。该方法设计了一个结合倒残差块和密集连接的自定义学生模型,并在ResNet18教师网络的指导下,通过整合硬标签监督、特征级蒸馏、响应级蒸馏和自蒸馏的多目标策略进行训练。在包含九个品种的水稻种子品种识别数据集以及扩展到水稻、马铃薯、咖啡和玉米的四个植物叶片病害数据集上的实验表明,该方法在保持高精度的同时,显著降低了计算成本和模型大小。
Details
Motivation: 解决在资源受限的边缘设备上部署深度学习模型时,计算效率与识别精度之间的权衡问题,以促进智慧农业应用。
Result: 在水稻种子品种分类任务上,蒸馏后的学生模型准确率达到98.56%(教师模型为98.65%),仅需0.68 GFLOPs和约107万参数,计算成本降低约2.7倍,模型大小减少超10倍。相比DenseNet121和Vision Transformer (ViT),参数量分别减少超6倍和超80倍,同时保持相当或更优的分类精度。在多个植物叶片病害数据集上的一致性能提升进一步证明了其鲁棒性和部署潜力。
Insight: 创新点在于提出了一种结合倒残差块与密集连接的自定义学生架构,并采用整合了硬标签、特征级、响应级和自蒸馏的多目标混合知识蒸馏策略。这为在资源受限环境下开发高效、轻量且高性能的模型提供了一种可借鉴的框架,特别是在农业领域的图像识别任务中。
Abstract: Deploying deep learning models on resource-constrained edge devices remains a major challenge in smart agriculture due to the trade-off between computational efficiency and recognition accuracy. To address this challenge, this study proposes a hybrid knowledge distillation framework for developing a lightweight yet high-performance convolutional neural network. The proposed approach designs a customized student model that combines inverted residual blocks with dense connectivity and trains it under the guidance of a ResNet18 teacher network using a multi-objective strategy that integrates hard-label supervision, feature-level distillation, response-level distillation, and self-distillation. Experiments are conducted on a rice seed variety identification dataset containing nine varieties and further extended to four plant leaf disease datasets, including rice, potato, coffee, and corn, to evaluate generalization capability. On the rice seed variety classification task, the distilled student model achieves an accuracy of 98.56%, which is only 0.09% lower than the teacher model (98.65%), while requiring only 0.68 GFLOPs and approximately 1.07 million parameters. This corresponds to a reduction of about 2.7 times in computational cost and more than 10 times in model size compared with the ResNet18 teacher model. In addition, compared with representative pretrained models, the proposed student reduces the number of parameters by more than 6 times relative to DenseNet121 and by over 80 times compared with the Vision Transformer (ViT) architecture, while maintaining comparable or superior classification accuracy. Consistent performance gains across multiple plant leaf disease datasets further demonstrate the robustness, efficiency, and strong deployment potential of the proposed framework for hardware-limited smart agriculture systems.
[41] Evaluating an Adaptive Multispectral Turret System for Autonomous Tracking Across Variable Illumination Conditions cs.CV | cs.LG | cs.ROPDF
Aahan Sachdeva, Dhanvinkumar Ganeshkumar, James E. Gallagher, Tyler Treat, Edward J. Oughton
TL;DR: 本文提出了一种自适应多光谱炮塔系统,用于在可变光照条件下实现自主跟踪。该系统通过融合RGB和长波红外(LWIR)视频流,并动态选择不同光照条件下的最优检测模型,以克服传统RGB在低光下性能不佳和热成像缺乏颜色纹理信息的问题。
Details
Motivation: 解决自主机器人平台在应急服务(如搜救和侦察)中,传统RGB检测在低光环境下表现不佳,以及热成像系统缺乏颜色和纹理信息的问题。
Result: 在三个光照级别(无光、微光、全光)上训练了33个YOLO模型,评估显示最佳全光模型(80/20 RGB-LWIR融合)和微光模型(90/10融合)的平均置信度分别达到92.8%和92.0%,显著优于YOLOv5n和YOLOv11n基线;无光条件下最佳40/60融合达到71.0%,也超过基线但统计不显著。
Insight: 创新点在于提出了一种自适应多模态融合框架,通过动态调整RGB与LWIR的融合比例来适应不同光照条件,从而提升检测置信度和可靠性,为自主机器人视觉系统在复杂光照环境下的鲁棒性提供了有效解决方案。
Abstract: Autonomous robotic platforms are playing a growing role across the emergency services sector, supporting missions such as search and rescue operations in disaster zones and reconnaissance. However, traditional red-green-blue (RGB) detection pipelines struggle in low-light environments, and thermal-based systems lack color and texture information. To overcome these limitations, we present an adaptive framework that fuses RGB and long-wave infrared (LWIR) video streams at multiple fusion ratios and dynamically selects the optimal detection model for each illumination condition. We trained 33 You Only Look Once (YOLO) models on over 22,000 annotated images spanning three light levels: no-light (<10 lux), dim-light (10-1000 lux), and full-light (>1000 lux). To integrate both modalities, fusion was performed by blending aligned RGB and LWIR frames at eleven ratios, from full RGB (100/0) to full LWIR (0/100) in 10% increments. Evaluation showed that the best full-light model (80/20 RGB-LWIR) and dim-light model (90/10 fusion) achieved 92.8% and 92.0% mean confidence; both significantly outperformed the YOLOv5 nano (YOLOv5n) and YOLOv11 nano (YOLOv11n) baselines. Under no-light conditions, the top 40/60 fusion reached 71.0%, exceeding baselines though not statistically significant. Adaptive RGB-LWIR fusion improved detection confidence and reliability across all illumination conditions, enhancing autonomous robotic vision performance.
[42] GeCo: A Differentiable Geometric Consistency Metric for Video Generation cs.CVPDF
Leslie Gu, Junhwa Hur, Charles Herrmann, Fangneng Zhan, Todd Zickler
TL;DR: 本文提出了GeCo,一种基于几何的可微分一致性度量,用于检测视频生成中静态场景的几何变形和遮挡不一致伪影。它融合了残差运动和深度先验,生成可解释的密集一致性图。该度量被用于系统性地评估近期视频生成模型,揭示常见失败模式,并作为无需训练的训练指导损失来减少生成视频中的变形伪影。
Details
Motivation: 解决视频生成模型中静态场景部分出现的几何变形和遮挡不一致伪影的检测与量化问题,缺乏专门的评估指标。
Result: 使用GeCo对近期视频生成模型进行了系统性基准测试,揭示了常见失败模式;作为指导损失应用时,能有效减少变形伪影。
Insight: 创新点在于将残差运动与深度先验融合,构建了一个可微分的、几何基础的一致性度量,该度量不仅可用于评估,还能作为即插即用的指导损失来提升生成质量。
Abstract: We introduce GeCo, a geometry-grounded metric for jointly detecting geometric deformation and occlusion-inconsistency artifacts in static scenes. By fusing residual motion and depth priors, GeCo produces interpretable, dense consistency maps that reveal these artifacts. We use GeCo to systematically benchmark recent video generation models, uncovering common failure modes, and further employ it as a training-free guidance loss to reduce deformation artifacts during video generation.
[43] The Illusion of Clinical Reasoning: A Benchmark Reveals the Pervasive Gap in Vision-Language Models for Clinical Competency cs.CV | cs.AIPDF
Dingyu Wang, Zimu Yuan, Jiajun Liu, Shanggui Liu, Nan Zhou
TL;DR: 该论文提出了一个名为Bones and Joints (B&J)的基准测试,用于全面评估视觉语言模型在临床推理能力上的表现。研究发现,尽管当前最先进的模型在结构化选择题上表现优异(准确率超过90%),但在需要多模态整合的开放式任务上表现显著下降(准确率勉强达到60%),揭示了模型在真实临床场景中的能力差距。
Details
Motivation: 当前基于医学执照考试或精选案例的基准测试无法捕捉真实世界患者护理所需的综合性、多模态推理能力,因此需要开发更全面的评估框架来检验基础模型在临床实践中的真实能力。
Result: 在B&J基准测试(包含1245个骨科和运动医学真实病例问题)上评估了11个视觉语言模型和6个大语言模型。结果显示,模型在需要多模态整合的开放式任务(如诊断生成、治疗计划)上准确率远低于结构化选择题,且专门针对医学应用微调的模型并未显示出比通用模型更一致的优势。
Insight: 论文的创新点在于创建了一个模拟完整临床推理路径(从知识回忆到治疗规划)的综合基准测试,揭示了当前AI模型在多模态整合和视觉理解方面的根本性局限。客观来看,该研究强调了安全部署AI时应将其限制在基于文本的支持性角色,并指出未来进展需要多模态整合技术的根本性突破。
Abstract: Background: The rapid integration of foundation models into clinical practice and public health necessitates a rigorous evaluation of their true clinical reasoning capabilities beyond narrow examination success. Current benchmarks, typically based on medical licensing exams or curated vignettes, fail to capture the integrated, multimodal reasoning essential for real-world patient care. Methods: We developed the Bones and Joints (B&J) Benchmark, a comprehensive evaluation framework comprising 1,245 questions derived from real-world patient cases in orthopedics and sports medicine. This benchmark assesses models across 7 tasks that mirror the clinical reasoning pathway, including knowledge recall, text and image interpretation, diagnosis generation, treatment planning, and rationale provision. We evaluated eleven vision-language models (VLMs) and six large language models (LLMs), comparing their performance against expert-derived ground truth. Results: Our results demonstrate a pronounced performance gap between task types. While state-of-the-art models achieved high accuracy, exceeding 90%, on structured multiple-choice questions, their performance markedly declined on open-ended tasks requiring multimodal integration, with accuracy scarcely reaching 60%. VLMs demonstrated substantial limitations in interpreting medical images and frequently exhibited severe text-driven hallucinations, often ignoring contradictory visual evidence. Notably, models specifically fine-tuned for medical applications showed no consistent advantage over general-purpose counterparts. Conclusions: Current artificial intelligence models are not yet clinically competent for complex, multimodal reasoning. Their safe deployment should currently be limited to supportive, text-based roles. Future advancement in core clinical tasks awaits fundamental breakthroughs in multimodal integration and visual understanding.
[44] FETAL-GAUGE: A Benchmark for Assessing Vision-Language Models in Fetal Ultrasound cs.CVPDF
Hussain Alasmawi, Numan Saeed, Mohammad Yaqub
TL;DR: 本文提出了Fetal-Gauge,这是首个也是最大的用于评估视觉语言模型在胎儿超声成像中性能的视觉问答基准数据集,包含超过42,000张图像和93,000个问答对,覆盖解剖平面识别、结构定位、胎儿方位评估、视图合规性和临床诊断等多个任务。
Details
Motivation: 解决全球训练有素的超声医师短缺问题,以及缺乏标准化基准来评估视觉语言模型在具有挑战性、操作者依赖性强且公开数据集有限的胎儿超声成像领域性能的现状。
Result: 在Fetal-Gauge基准上系统评估了多个SOTA视觉语言模型,包括通用和医学专用模型,发现最佳模型准确率仅为55%,远低于临床要求,揭示了当前模型在该领域的显著性能差距。
Insight: 创新点在于创建了首个针对胎儿超声的综合性VLM评估基准,强调了领域适应架构和专门化训练方法的迫切需求,为推进产前护理中的多模态深度学习奠定了严格基础。
Abstract: The growing demand for prenatal ultrasound imaging has intensified a global shortage of trained sonographers, creating barriers to essential fetal health monitoring. Deep learning has the potential to enhance sonographers’ efficiency and support the training of new practitioners. Vision-Language Models (VLMs) are particularly promising for ultrasound interpretation, as they can jointly process images and text to perform multiple clinical tasks within a single framework. However, despite the expansion of VLMs, no standardized benchmark exists to evaluate their performance in fetal ultrasound imaging. This gap is primarily due to the modality’s challenging nature, operator dependency, and the limited public availability of datasets. To address this gap, we present Fetal-Gauge, the first and largest visual question answering benchmark specifically designed to evaluate VLMs across various fetal ultrasound tasks. Our benchmark comprises over 42,000 images and 93,000 question-answer pairs, spanning anatomical plane identification, visual grounding of anatomical structures, fetal orientation assessment, clinical view conformity, and clinical diagnosis. We systematically evaluate several state-of-the-art VLMs, including general-purpose and medical-specific models, and reveal a substantial performance gap: the best-performing model achieves only 55% accuracy, far below clinical requirements. Our analysis identifies critical limitations of current VLMs in fetal ultrasound interpretation, highlighting the urgent need for domain-adapted architectures and specialized training approaches. Fetal-Gauge establishes a rigorous foundation for advancing multimodal deep learning in prenatal care and provides a pathway toward addressing global healthcare accessibility challenges. Our benchmark will be publicly available once the paper gets accepted.
[45] A Three-Level Alignment Framework for Large-Scale 3D Retrieval and Controlled 4D Generation cs.CV | cs.AIPDF
Philip Xu, David Elizondo, Raouf Hamzaoui
TL;DR: 论文提出了一个名为Uni4D的统一框架,用于大规模开放词汇3D检索和可控4D生成。该框架基于文本、3D模型和图像模态之间的结构化三级对齐,通过改进语义对齐优化文本到3D检索,并利用多视图3D到图像对齐和图像到文本对齐来增强跨模态对齐,从而实现高质量的3D检索和生成时间一致的4D资产。
Details
Motivation: 解决大规模开放词汇3D检索和可控4D生成中的跨模态对齐问题,以提升动态多模态理解和实际应用能力。
Result: 实验结果表明,Uni4D实现了高质量的3D检索和可控4D生成,但摘要未提及具体基准测试或与SOTA的比较。
Insight: 创新点在于提出结构化三级对齐框架,结合文本、3D和图像模态,通过多视图对齐和注意力机制优化语义对齐,为跨模态检索和生成提供统一解决方案。
Abstract: We introduce Uni4D, a unified framework for large scale open vocabulary 3D retrieval and controlled 4D generation based on structured three level alignment across text, 3D models, and image modalities. Built upon the Align3D 130 dataset, Uni4D employs a 3D text multi head attention and search model to optimize text to 3D retrieval through improved semantic alignment. The framework further strengthens cross modal alignment through three components: precise text to 3D retrieval, multi view 3D to image alignment, and image to text alignment for generating temporally consistent 4D assets. Experimental results demonstrate that Uni4D achieves high quality 3D retrieval and controllable 4D generation, advancing dynamic multimodal understanding and practical applications.
[46] Real-Time In-Cabin Driver Behavior Recognition on Low-Cost Edge Hardware cs.CV | cs.HC | cs.LG | eess.IVPDF
Vesal Ahsani, Babak Hossein Khalaj
TL;DR: 本文提出了一种面向低成本边缘硬件的实时车内驾驶员行为识别系统,旨在在计算、功耗和成本严格受限的条件下,低延迟地识别分心和困倦相关行为。该系统针对Raspberry Pi 5和Google Coral Edge TPU平台设计,结合了紧凑的单帧视觉模型、减少误报的标签设计以及基于时序置信度的决策机制,覆盖17种行为类别,并在真实车载测试中验证了实时性能。
Details
Motivation: 解决在计算、功耗和成本严格受限的低成本边缘硬件上,实现低延迟、高可靠性的车内驾驶员监控系统(DMS)的需求。
Result: 在Raspberry Pi 5(INT8推理)上达到约16 FPS(单帧延迟低于60毫秒),在Coral Edge TPU上达到约25 FPS,实现了实时监控和稳定的警报生成。训练和评估使用了涵盖多样化驾驶员、车辆和光照条件的授权数据集。
Insight: 创新点包括:1)针对边缘部署优化的紧凑单帧视觉模型;2)引入混淆因素感知的标签设计以减少视觉相似行为的误报;3)结合时序置信度的决策头,仅在预测持续高置信度时才触发警报。这为在低成本硬件上实现可靠的人类状态感知提供了可行的系统级解决方案,可作为以人为中心的车辆智能的上游输入。
Abstract: In-cabin Driver Monitoring Systems (DMS) must recognize distraction- and drowsiness-related behaviors with low latency under strict constraints on compute, power, and cost. We present a single-camera in-cabin driver behavior recognition system designed for deployment on two low-cost edge platforms: Raspberry Pi 5 (CPU-only) and Google Coral Edge TPU. The proposed pipeline combines (i) a compact per-frame vision model, (ii) a confounder-aware label design to reduce visually similar false positives, and (iii) a temporal decision head that triggers alerts only when predictions are both confident and sustained. The system covers 17 behavior classes, including multiple phone-use modes, eating/drinking, smoking, reaching behind, gaze/attention shifts, passenger interaction, grooming, control-panel interaction, yawning, and eyes-closed sleep. Training and evaluation use licensed datasets spanning diverse drivers, vehicles, and lighting conditions (details in Section 6), and we further validate runtime behavior in real in-vehicle tests. The optimized deployments achieve about 16 FPS on Raspberry Pi 5 with INT8 inference (per-frame latency under 60 ms) and about 25 FPS on Coral Edge TPU, enabling real-time monitoring and stable alert generation on inexpensive hardware. Finally, we discuss how reliable in-cabin human-state perception can serve as an upstream input for human-centered vehicle intelligence, including emerging agentic vehicle concepts.
[47] Attack-Aware Deepfake Detection under Counter-Forensic Manipulations cs.CV | cs.AIPDF
Noor Fatima, Hasan Faraz Khan, Muzammil Behzad
TL;DR: 本文提出了一种攻击感知的深度伪造和图像取证检测器,旨在现实部署条件下实现鲁棒性、良好校准的概率和透明的证据。该方法在一个双流架构中结合了红队训练和随机化测试时防御,其中一流使用预训练骨干编码语义内容,另一流提取取证残差,通过轻量级残差适配器融合进行分类,同时一个浅层特征金字塔网络风格的头部在弱监督下生成篡改热图。
Details
Motivation: 动机是解决深度伪造检测在现实世界中面临的反取证操作(如JPEG重对齐、重压缩、重采样扭曲、去噪到重纹理、接缝平滑、颜色/伽马偏移、社交应用转码等)攻击时的鲁棒性问题,并确保检测器输出校准良好的概率和可解释的证据。
Result: 在现有基准测试(包括标准深度伪造数据集和具有低光照、高压缩的监控风格分割数据集)上的评估结果显示,该方法在各类攻击下保持了近乎完美的排名(AUC)、较低的校准误差、最小的弃权风险,以及在重纹理攻击下的可控性能下降,为攻击感知检测建立了一个模块化、数据高效且可实际部署的基线。
Insight: 创新点在于将红队训练(每批次应用最恶劣的K种反取证操作)与随机化测试时防御(注入低成本抖动)相结合的双流架构,以及通过弱监督(仅使用人脸框掩码)引导热图集中在人脸区域,实现了无需严格像素级标注的可操作热图生成和校准概率输出。
Abstract: This work presents an attack-aware deepfake and image-forensics detector designed for robustness, well-calibrated probabilities, and transparent evidence under realistic deployment conditions. The method combines red-team training with randomized test-time defense in a two-stream architecture, where one stream encodes semantic content using a pretrained backbone and the other extracts forensic residuals, fused via a lightweight residual adapter for classification, while a shallow Feature Pyramid Network style head produces tamper heatmaps under weak supervision. Red-team training applies worst-of-K counter-forensics per batch, including JPEG realign and recompress, resampling warps, denoise-to-regrain operations, seam smoothing, small color and gamma shifts, and social-app transcodes, while test-time defense injects low-cost jitters such as resize and crop phase changes, mild gamma variation, and JPEG phase shifts with aggregated predictions. Heatmaps are guided to concentrate within face regions using face-box masks without strict pixel-level annotations. Evaluation on existing benchmarks, including standard deepfake datasets and a surveillance-style split with low light and heavy compression, reports clean and attacked performance, AUC, worst-case accuracy, reliability, abstention quality, and weak-localization scores. Results demonstrate near-perfect ranking across attacks, low calibration error, minimal abstention risk, and controlled degradation under regrain, establishing a modular, data-efficient, and practically deployable baseline for attack-aware detection with calibrated probabilities and actionable heatmaps.
[48] PortionNet: Distilling 3D Geometric Knowledge for Food Nutrition Estimation cs.CVPDF
Darrin Bright, Rakshith Raj, Kanchan Keisham
TL;DR: PortionNet是一种新颖的跨模态知识蒸馏框架,旨在从单张RGB图像准确估计食物营养。它通过训练时从点云学习几何特征,推理时仅需RGB图像,无需深度传感器,解决了3D信息丢失的挑战。
Details
Motivation: 解决从单张图像进行食物营养估计时因缺乏3D几何信息而导致的准确性挑战,同时避免对智能手机深度传感器的依赖,使其更易于普及。
Result: 在MetaFood3D数据集上实现了最先进的性能,在体积和能量估计方面均优于所有先前方法;在SimpleFood45上的跨数据集评估进一步展示了在能量估计方面的强大泛化能力。
Insight: 创新点在于通过轻量级适配器网络模仿点云表示的双模式训练策略,实现了无需专用硬件的伪3D推理,将3D几何知识蒸馏到2D图像模型中,提升了可访问性和准确性。
Abstract: Accurate food nutrition estimation from single images is challenging due to the loss of 3D information. While depth-based methods provide reliable geometry, they remain inaccessible on most smartphones because of depth-sensor requirements. To overcome this challenge, we propose PortionNet, a novel cross-modal knowledge distillation framework that learns geometric features from point clouds during training while requiring only RGB images at inference. Our approach employs a dual-mode training strategy where a lightweight adapter network mimics point cloud representations, enabling pseudo-3D reasoning without any specialized hardware requirements. PortionNet achieves state-of-the-art performance on MetaFood3D, outperforming all previous methods in both volume and energy estimation. Cross-dataset evaluation on SimpleFood45 further demonstrates strong generalization in energy estimation.
[49] MoFu: Scale-Aware Modulation and Fourier Fusion for Multi-Subject Video Generation cs.CVPDF
Run Ling, Ke Cao, Jian Lu, Ao Ma, Haowei Liu
TL;DR: MoFu是一个用于多主体视频生成的统一框架,通过尺度感知调制和傅里叶融合解决主体尺度不一致和输入排列敏感性问题,并在专门基准上验证了其有效性。
Details
Motivation: 解决多主体视频生成中因主体尺寸变化导致的尺度不一致问题,以及因参考图像输入顺序变化导致的排列敏感性问题。
Result: 在专门构建的包含主体尺度和参考排列受控变化的基准上,MoFu在保持自然尺度、主体保真度和整体视觉质量方面显著优于现有方法。
Insight: 创新点包括:1) 利用LLM引导的尺度感知调制模块从文本提示中提取隐式尺度线索以调制特征;2) 通过快速傅里叶变换处理参考特征频率信息的简单有效的傅里叶融合策略;3) 联合优化尺度一致和排列不变的尺度-排列稳定性损失函数。
Abstract: Multi-subject video generation aims to synthesize videos from textual prompts and multiple reference images, ensuring that each subject preserves natural scale and visual fidelity. However, current methods face two challenges: scale inconsistency, where variations in subject size lead to unnatural generation, and permutation sensitivity, where the order of reference inputs causes subject distortion. In this paper, we propose MoFu, a unified framework that tackles both challenges. For scale inconsistency, we introduce Scale-Aware Modulation (SMO), an LLM-guided module that extracts implicit scale cues from the prompt and modulates features to ensure consistent subject sizes. To address permutation sensitivity, we present a simple yet effective Fourier Fusion strategy that processes the frequency information of reference features via the Fast Fourier Transform to produce a unified representation. Besides, we design a Scale-Permutation Stability Loss to jointly encourage scale-consistent and permutation-invariant generation. To further evaluate these challenges, we establish a dedicated benchmark with controlled variations in subject scale and reference permutation. Extensive experiments demonstrate that MoFu significantly outperforms existing methods in preserving natural scale, subject fidelity, and overall visual quality.
[50] VideoZoomer: Reinforcement-Learned Temporal Focusing for Long Video Reasoning cs.CV | cs.AIPDF
Yang Ding, Yizhen Zhang, Xin Lai, Ruihang Chu, Yujiu Yang
TL;DR: 本文提出VideoZoomer,一种基于强化学习的智能体框架,旨在解决多模态大语言模型在长视频理解中因上下文窗口限制而依赖均匀采样或静态预选帧、无法动态聚焦关键证据的问题。该框架通过从低帧率概览开始,自主调用时间缩放工具获取高帧率片段,以多轮交互方式逐步收集细粒度证据,并采用监督微调与强化学习两阶段训练策略。
Details
Motivation: 现有MLLMs在长视频理解任务中受限于上下文长度,通常采用均匀帧采样或静态预选择方法,这可能导致忽略关键视觉证据且无法在推理过程中纠正初始选择错误,因此需要一种能够动态控制视觉焦点的方法。
Result: 在多个长视频理解和推理基准测试上的广泛实验表明,所提出的7B模型能够产生多样且复杂的推理模式,性能显著优于现有开源模型,并在具有挑战性的任务上与专有系统相媲美,同时在减少帧预算的情况下实现了更高的效率。
Insight: 创新点在于将MLLMs构建为能够自主调用时间缩放工具的智能体,实现推理过程中的动态时序聚焦;采用监督微调与强化学习相结合的两阶段训练策略来优化智能体策略;该方法在长视频理解中实现了更高效的证据收集和更准确的推理。
Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable progress in vision-language tasks yet remain limited in long video understanding due to the limited context window. Consequently, prevailing approaches tend to rely on uniform frame sampling or static pre-selection, which might overlook critical evidence and unable to correct its initial selection error during its reasoning process. To overcome these limitations, we propose VideoZoomer, a novel agentic framework that enables MLLMs to dynamically control their visual focus during reasoning. Starting from a coarse low-frame-rate overview, VideoZoomer invokes a temporal zoom tool to obtain high-frame-rate clips at autonomously chosen moments, thereby progressively gathering fine-grained evidence in a multi-turn interactive manner. Accordingly, we adopt a two-stage training strategy: a cold-start supervised fine-tuning phase on a curated dataset of distilled exemplar and reflection trajectories, followed by reinforcement learning to further refine the agentic policy. Extensive experiments demonstrate that our 7B model delivers diverse and complex reasoning patterns, yielding strong performance across a broad set of long video understanding and reasoning benchmarks. These emergent capabilities allow it to consistently surpass existing open-source models and even rival proprietary systems on challenging tasks, while achieving superior efficiency under reduced frame budgets.
[51] The Multi-View Paradigm Shift in MRI Radiomics: Predicting MGMT Methylation in Glioblastoma cs.CV | cs.AIPDF
Mariya Miteva, Maria Nisheva-Pavlova
TL;DR: 本文提出了一种基于变分自编码器的多视图潜在表示学习框架,用于整合来自T1Gd和FLAIR两种MRI序列的互补影像组学特征,以非侵入性地预测胶质母细胞瘤中的MGMT启动子甲基化状态。该方法通过独立的概率编码器对每个模态进行编码,并在紧凑的潜在空间中进行融合,从而在保留模态特定结构的同时实现有效的多模态整合。
Details
Motivation: 解决传统单模态和早期融合方法在基于影像组学预测胶质母细胞瘤MGMT甲基化状态时面临的特征冗余高、模态特异性信息建模不完整的问题。
Result: 摘要中未提及具体的定量实验结果、基准测试或与现有方法的比较。
Insight: 创新点在于采用多视图潜在表示学习范式,通过独立的概率编码器和潜在空间融合,旨在更有效地整合多模态MRI信息,减少特征冗余并保留模态特异性,为影像基因组学任务提供了一种新的特征学习框架。
Abstract: Non-invasive inference of molecular tumor characteristics from medical imaging is a central goal of radiogenomics, particularly in glioblastoma (GBM), where O6-methylguanine-DNA methyltransferase (MGMT) promoter methylation carries important prognostic and therapeutic significance. Although radiomics-based machine learning methods have shown promise for this task, conventional unimodal and early-fusion approaches are often limited by high feature redundancy and an incomplete modeling of modality-specific information. In this work, we introduce a multi-view latent representation learning framework based on variational autoencoders (VAE) to integrate complementary radiomic features derived from post-contrast T1-weighted (T1Gd) and Fluid-Attenuated Inversion Recovery (FLAIR) magnetic resonance imaging (MRI). By encoding each modality through an independent probabilistic encoder and performing fusion in a compact latent space, the proposed approach preserves modality-specific structure while enabling effective multimodal integration. The resulting latent embeddings are subsequently used for MGMT promoter methylation classification.
[52] Feature Learning with Multi-Stage Vision Transformers on Inter-Modality HER2 Status Scoring and Tumor Classification on Whole Slides cs.CV | cs.AIPDF
Olaide N. Oyelade, Oliver Hoxey, Yulia Humrye
TL;DR: 该论文提出了一种基于多阶段视觉变换器(ViT)的端到端流程,用于在病理全切片图像(WSI)上进行HER2状态评分和肿瘤分类。该方法通过处理H&E染色图像定位肿瘤区域,并利用新颖的映射函数关联对应的IHC染色图像区域,实现了像素级的四分类HER2评分(0, 1+, 2+, 3+)以及HER2阴/阳性分类。
Details
Motivation: 现有深度学习方法在HER2评分任务中难以提供像素级的HER2状态定位,且联合分析H&E和IHC染色图像具有挑战性。本文旨在开发一个端到端系统,以准确预测HER2表达水平并实现像素级注释,从而辅助癌症治疗决策。
Result: 在私有数据集(来自13个病例的H&E和IHC WSI)上的实验表明,该方法在肿瘤定位中分类准确率良好,在四分类HER2评分中达到0.94的分类准确率和0.933的特异性,其性能可与人类病理学家相媲美。
Insight: 创新点包括:提出一种新颖的映射函数来关联H&E恶性区域与对应的IHC区域;将临床启发的HER2评分机制嵌入端到端ViT模型,实现自动像素级评分;首次在基于ViT的模型中联合评估H&E和IHC图像进行HER2评分,提升了多模态病理图像分析的实用性。
Abstract: The popular use of histopathology images, such as hematoxylin and eosin (H&E), has proven to be useful in detecting tumors. However, moving such cancer cases forward for treatment requires accurate on the amount of the human epidermal growth factor receptor 2 (HER2) protein expression. Predicting both the lower and higher levels of HER2 can be challenging. Moreover, jointly analyzing H&E and immunohistochemistry (IHC) stained images for HER2 scoring is difficult. Although several deep learning methods have been investigated to address the challenge of HER2 scoring, they suffer from providing a pixel-level localization of HER2 status. In this study, we propose a single end-to-end pipeline using a system of vision transformers with HER2 status scoring on whole slide images of WSIs. The method includes patch-wise processing of H&E WSIs for tumor localization. A novel mapping function is proposed to correspondingly identify correlated IHC WSIs regions with malignant regions on H&E. A clinically inspired HER2 scoring mechanism is embedded in the pipeline and allows for automatic pixel-level annotation of 4-way HER2 scoring (0, 1+, 2+, and 3+). Also, the proposed method accurately returns HER2-negative and HER2-positive. Privately curated datasets were collaboratively extracted from 13 different cases of WSIs of H&E and IHC. A thorough experiment is conducted on the proposed method. Results obtained showed a good classification accuracy during tumor localization. Also, a classification accuracy of 0.94 and a specificity of 0.933 were returned for the prediction of HER2 status, scoring in the 4-way methods. The applicability of the proposed pipeline was investigated using WSIs patches as comparable to human pathologists. Findings from the study showed the usability of jointly evaluated H&E and IHC images on end-to-end ViTs-based models for HER2 scoring
[53] Human-like visual computing advances explainability and few-shot learning in deep neural networks for complex physiological data cs.CV | cs.AI | cs.HC | cs.LGPDF
Alaa Alahmadi, Mohamed Hasan
TL;DR: 本文提出一种受人类视觉启发的伪着色技术,用于提升深度神经网络在分析复杂生理数据(如心电图ECG)时的可解释性和少样本学习能力。该方法通过将临床关键时间特征(如QT间期)编码为结构化颜色表示,使模型能从极少训练样本中学习可区分且可解释的特征,并在药物诱导的长QT综合征案例中验证了其有效性。
Details
Motivation: 当前机器视觉模型(尤其是深度神经网络)在生理信号解读(如心电图)中通常需要大量训练数据且预测的因果特征可解释性有限,这限制了其临床可靠性和与人类推理的一致性。本文旨在解决数据效率低和可解释性差的问题。
Result: 在药物诱导的长QT综合征(LQTS)的挑战性案例中,使用原型网络和ResNet-18架构评估单次和少样本学习。实验表明,伪着色技术能引导模型关注临床有意义的ECG特征,抑制无关信号成分,在仅1或5个训练样本下实现有效学习;聚合多个心动周期可进一步提升性能,达到与人类感知平均相似的效果。
Insight: 创新点在于将人类感知编码(伪着色)引入深度学习,以结构化颜色表示临床特征,从而同时提升数据效率、可解释性和因果推理能力。这为医学机器智能提供了一种人机协同的新思路,即模仿人类视觉处理机制来增强模型泛化和可解释性。
Abstract: Machine vision models, particularly deep neural networks, are increasingly applied to physiological signal interpretation, including electrocardiography (ECG), yet they typically require large training datasets and offer limited insight into the causal features underlying their predictions. This lack of data efficiency and interpretability constrains their clinical reliability and alignment with human reasoning. Here, we show that a perception-informed pseudo-colouring technique, previously demonstrated to enhance human ECG interpretation, can improve both explainability and few-shot learning in deep neural networks analysing complex physiological data. We focus on acquired, drug-induced long QT syndrome (LQTS) as a challenging case study characterised by heterogeneous signal morphology, variable heart rate, and scarce positive cases associated with life-threatening arrhythmias such as torsades de pointes. This setting provides a stringent test of model generalisation under extreme data scarcity. By encoding clinically salient temporal features, such as QT-interval duration, into structured colour representations, models learn discriminative and interpretable features from as few as one or five training examples. Using prototypical networks and a ResNet-18 architecture, we evaluate one-shot and few-shot learning on ECG images derived from single cardiac cycles and full 10-second rhythms. Explainability analyses show that pseudo-colouring guides attention toward clinically meaningful ECG features while suppressing irrelevant signal components. Aggregating multiple cardiac cycles further improves performance, mirroring human perceptual averaging across heartbeats. Together, these findings demonstrate that human-like perceptual encoding can bridge data efficiency, explainability, and causal reasoning in medical machine intelligence.
[54] VULCAN: Tool-Augmented Multi Agents for Iterative 3D Object Arrangement cs.CV | cs.AIPDF
Zhengfei Kuang, Rui Lin, Long Zhao, Gordon Wetzstein, Saining Xie
TL;DR: 本文提出了VULCAN框架,这是一个基于工具增强的多智能体系统,用于解决多模态大语言模型在复杂3D物体排列任务中的应用难题。该框架通过引入基于MCP的API、一套专门的视觉工具以及一个包含规划、执行和验证角色的协作多智能体架构,实现了对3D场景的迭代式、精确操作。
Details
Motivation: 尽管多模态大语言模型在2D视觉-语言任务上取得了显著进展,但它们在复杂3D场景操作中的应用仍未被充分探索。本文旨在解决MLLMs在3D物体排列任务中面临的三个关键挑战:视觉基础能力弱、3D场景理解不足以及迭代更新过程容易出错。
Result: 该方法在25个多样化的复杂物体排列任务上进行了评估,结果表明其性能显著优于现有基线模型。
Insight: 论文的创新点在于:1)通过MCP-based API将交互从脆弱的原始代码操作转变为更鲁棒的函数级更新,以解决MLLMs视觉基础弱的问题;2)利用专门的视觉工具增强MLLM的3D场景感知能力,形成感知反馈闭环;3)采用角色分工(规划、执行、验证)的协作多智能体框架来管理迭代的、易出错的更新过程,从而稳健地处理多步指令并从中间错误中恢复。这为将MLLMs的能力扩展到精确的3D感知操作任务提供了系统性的解决方案。
Abstract: Despite the remarkable progress of Multimodal Large Language Models (MLLMs) in 2D vision-language tasks, their application to complex 3D scene manipulation remains underexplored. In this paper, we bridge this critical gap by tackling three key challenges in 3D object arrangement task using MLLMs. First, to address the weak visual grounding of MLLMs, which struggle to link programmatic edits with precise 3D outcomes, we introduce an MCP-based API. This shifts the interaction from brittle raw code manipulation to more robust, function-level updates. Second, we augment the MLLM’s 3D scene understanding with a suite of specialized visual tools to analyze scene state, gather spatial information, and validate action outcomes. This perceptual feedback loop is critical for closing the gap between language-based updates and precise 3D-aware manipulation. Third, to manage the iterative, error-prone updates, we propose a collaborative multi-agent framework with designated roles for planning, execution, and verification. This decomposition allows the system to robustly handle multi-step instructions and recover from intermediate errors. We demonstrate the effectiveness of our approach on a diverse set of 25 complex object arrangement tasks, where it significantly outperforms existing baselines. Website: vulcan-3d.github.io
[55] Self-Evaluation Unlocks Any-Step Text-to-Image Generation cs.CV | cs.AI | cs.LGPDF
Xin Yu, Xiaojuan Qi, Zhengqi Li, Kai Zhang, Richard Zhang
TL;DR: 本文提出了自评估模型(Self-E),一种从零开始训练的新型文本到图像生成方法,支持任意步数推理。该方法结合了流匹配模型的训练方式与创新的自评估机制,通过当前分数估计评估自身生成的样本,实现动态自我教学。实验表明,Self-E不仅在少步生成中表现出色,在50步时也能与最先进的流匹配模型竞争,且性能随推理步数增加单调提升。
Details
Motivation: 解决传统扩散或流模型依赖局部监督导致需要多步推理的问题,以及基于蒸馏的方法需要预训练教师模型的限制,旨在从零开始训练一个支持任意步数推理的高质量文本到图像生成模型。
Result: 在大规模文本到图像基准测试中,Self-E在少步生成中表现优异,在50步时与最先进的流匹配模型相当,且性能随推理步数增加单调提升,实现了从超快速少步生成到高质量长轨迹采样的统一。
Insight: 创新点在于结合瞬时局部学习和自驱动全局匹配,通过自评估机制作为动态自我教师,无需预训练教师模型,从零开始训练支持任意步数推理的统一模型,为高效可扩展生成提供了新框架。
Abstract: We introduce the Self-Evaluating Model (Self-E), a novel, from-scratch training approach for text-to-image generation that supports any-step inference. Self-E learns from data similarly to a Flow Matching model, while simultaneously employing a novel self-evaluation mechanism: it evaluates its own generated samples using its current score estimates, effectively serving as a dynamic self-teacher. Unlike traditional diffusion or flow models, it does not rely solely on local supervision, which typically necessitates many inference steps. Unlike distillation-based approaches, it does not require a pretrained teacher. This combination of instantaneous local learning and self-driven global matching bridges the gap between the two paradigms, enabling the training of a high-quality text-to-image model from scratch that excels even at very low step counts. Extensive experiments on large-scale text-to-image benchmarks show that Self-E not only excels in few-step generation, but is also competitive with state-of-the-art Flow Matching models at 50 steps. We further find that its performance improves monotonically as inference steps increase, enabling both ultra-fast few-step generation and high-quality long-trajectory sampling within a single unified model. To our knowledge, Self-E is the first from-scratch, any-step text-to-image model, offering a unified framework for efficient and scalable generation.
[56] iOSPointMapper: RealTime Pedestrian and Accessibility Mapping with Mobile AI cs.CVPDF
Himanshu Naidu, Yuxiang Zhang, Sachin Mehta, Anat Caspi
TL;DR: iOSPointMapper是一款基于移动AI的实时行人道与无障碍设施测绘应用,利用iPhone和iPad的语义分割、LiDAR深度估计及GPS/IMU融合数据,检测并定位交通标志、信号灯等行人相关特征,通过用户引导的标注界面验证数据后匿名上传至交通数据交换平台,以低成本、可扩展的方式填补行人基础设施数据缺口。
Details
Motivation: 当前行人道数据采集方法成本高、碎片化且难以扩展,阻碍了无障碍和包容性行人基础设施的建设,因此需要一种实时、隐私保护且可扩展的解决方案来获取准确、最新的行人道数据。
Result: 论文对系统的特征检测和空间测绘性能进行了详细评估,结果表明该应用在增强行人道测绘方面具有潜力,但未提及具体基准测试或与现有方法的定量比较(如是否达到SOTA)。
Insight: 创新点在于将设备端AI(语义分割、LiDAR)与用户引导验证相结合,实现隐私保护、实时且可扩展的行人道测绘;客观来看,其移动端集成与匿名化数据收集流程为众包地理数据采集提供了实用范例。
Abstract: Accurate, up-to-date sidewalk data is essential for building accessible and inclusive pedestrian infrastructure, yet current approaches to data collection are often costly, fragmented, and difficult to scale. We introduce iOSPointMapper, a mobile application that enables real-time, privacy-conscious sidewalk mapping on the ground, using recent-generation iPhones and iPads. The system leverages on-device semantic segmentation, LiDAR-based depth estimation, and fused GPS/IMU data to detect and localize sidewalk-relevant features such as traffic signs, traffic lights and poles. To ensure transparency and improve data quality, iOSPointMapper incorporates a user-guided annotation interface for validating system outputs before submission. Collected data is anonymized and transmitted to the Transportation Data Exchange Initiative (TDEI), where it integrates seamlessly with broader multimodal transportation datasets. Detailed evaluations of the system’s feature detection and spatial mapping performance reveal the application’s potential for enhanced pedestrian mapping. Together, these capabilities offer a scalable and user-centered approach to closing critical data gaps in pedestrian
[57] LECalib: Line-Based Event Camera Calibration cs.CVPDF
Zibin Liu, Banglei Guana, Yang Shanga, Zhenbao Yu, Yifei Bian
TL;DR: 本文提出了一种基于线特征的事件相机标定框架LECalib,该方法利用人造环境中常见物体的几何线条(如门、窗、盒子等)进行标定,直接从事件流中检测线条并采用事件-线标定模型生成相机参数初始值,适用于平面和非平面线条,最后通过非线性优化细化参数。
Details
Motivation: 现有事件相机标定方法通常耗时且需要手动放置标定物,无法适应快速变化的场景需求,因此本文旨在开发一种更高效、无需专用标定物的标定方法。
Result: 仿真和真实世界实验验证了该方法的可行性和准确性,并在单目和立体事件相机上进行了验证,达到了较高的标定精度。
Insight: 创新点在于直接利用事件流中的线特征进行标定,无需依赖强度图像重建或专用标定图案,适用于动态环境,提高了标定的灵活性和效率。
Abstract: Camera calibration is an essential prerequisite for event-based vision applications. Current event camera calibration methods typically involve using flashing patterns, reconstructing intensity images, and utilizing the features extracted from events. Existing methods are generally time-consuming and require manually placed calibration objects, which cannot meet the needs of rapidly changing scenarios. In this paper, we propose a line-based event camera calibration framework exploiting the geometric lines of commonly-encountered objects in man-made environments, e.g., doors, windows, boxes, etc. Different from previous methods, our method detects lines directly from event streams and leverages an event-line calibration model to generate the initial guess of camera parameters, which is suitable for both planar and non-planar lines. Then, a non-linear optimization is adopted to refine camera parameters. Both simulation and real-world experiments have demonstrated the feasibility and accuracy of our method, with validation performed on monocular and stereo event cameras. The source code is released at https://github.com/Zibin6/line_based_event_camera_calib.
[58] SonoVision: A Computer Vision Approach for Helping Visually Challenged Individuals Locate Objects with the Help of Sound Cues cs.CVPDF
Md Abu Obaida Zishan, Annajiat Alim Rasel
TL;DR: SonoVision是一款基于计算机视觉的智能手机应用,旨在通过声音提示帮助视障人士定位日常物品。该应用利用Efficientdet-D2模型进行物体检测,并通过耳机发出正弦声音来指示物体的左右或前方位置,从而提升用户的独立性和安全性。
Details
Motivation: 解决视障人士在定位物体时面临的挑战,减少他们对周围人的依赖,并降低潜在风险,以增强其自主生活能力。
Result: 应用基于Efficientdet-D2模型实现物体检测,并设计声音提示系统;论文未提及具体定量结果或基准测试,但强调应用可完全离线运行,具有安全性和用户友好性。
Insight: 创新点在于结合计算机视觉(Efficientdet-D2模型)与听觉反馈(定向声音提示),开发轻量级离线智能手机应用,为视障辅助技术提供实用且低成本的解决方案。
Abstract: Locating objects for the visually impaired is a significant challenge and is something no one can get used to over time. However, this hinders their independence and could push them towards risky and dangerous scenarios. Hence, in the spirit of making the visually challenged more self-sufficient, we present SonoVision, a smart-phone application that helps them find everyday objects using sound cues through earphones/headphones. This simply means, if an object is on the right or left side of a user, the app makes a sinusoidal sound in a user’s respective ear through ear/headphones. However, to indicate objects located directly in front, both the left and right earphones are rung simultaneously. These sound cues could easily help a visually impaired individual locate objects with the help of their smartphones and reduce the reliance on people in their surroundings, consequently making them more independent. This application is made with the flutter development platform and uses the Efficientdet-D2 model for object detection in the backend. We believe the app will significantly assist the visually impaired in a safe and user-friendly manner with its capacity to work completely offline. Our application can be accessed here https://github.com/MohammedZ666/SonoVision.git.
[59] SAM 3D for 3D Object Reconstruction from Remote Sensing Images cs.CVPDF
Junsheng Yao, Lichao Mou, Qingyu Li
TL;DR: 本文首次系统评估了通用图像到3D基础模型SAM 3D在单目遥感图像建筑物三维重建任务上的性能,并在NYC Urban Dataset上以TRELLIS为基准,使用FID和CMMD指标进行评测。实验表明SAM 3D能生成更连贯的屋顶几何和更清晰的边界,并通过分段-重建-组合流程扩展至城市场景重建,分析了其局限性与未来方向。
Details
Motivation: 解决从遥感图像进行单目3D建筑物重建时,现有方法通常需要特定任务架构和密集监督的问题,探索通用基础模型在此领域的适用性。
Result: 在NYC Urban Dataset上,SAM 3D相比TRELLIS在FID和CMMD指标上表现更优,能产生更连贯的屋顶几何和更锐利的边界。
Insight: 创新点在于首次将通用图像到3D基础模型SAM 3D系统性地应用于遥感建筑物重建,并展示了其通过简单流程扩展至城市场景建模的潜力;客观分析认为,其方法避免了任务特定设计,为城市3D重建中基础模型的部署提供了实践指导,并启发了场景级结构先验的集成研究。
Abstract: Monocular 3D building reconstruction from remote sensing imagery is essential for scalable urban modeling, yet existing methods often require task-specific architectures and intensive supervision. This paper presents the first systematic evaluation of SAM 3D, a general-purpose image-to-3D foundation model, for monocular remote sensing building reconstruction. We benchmark SAM 3D against TRELLIS on samples from the NYC Urban Dataset, employing Frechet Inception Distance (FID) and CLIP-based Maximum Mean Discrepancy (CMMD) as evaluation metrics. Experimental results demonstrate that SAM 3D produces more coherent roof geometry and sharper boundaries compared to TRELLIS. We further extend SAM 3D to urban scene reconstruction through a segment-reconstruct-compose pipeline, demonstrating its potential for urban scene modeling. We also analyze practical limitations and discuss future research directions. These findings provide practical guidance for deploying foundation models in urban 3D reconstruction and motivate future integration of scene-level structural priors.
[60] Comparing Object Detection Models for Electrical Substation Component Mapping cs.CVPDF
Haley Mody, Namish Bansal, Dennies Kiprono Bor, Edward J. Oughton
TL;DR: 本研究旨在通过计算机视觉模型实现变电站组件的自动化识别与映射,以替代传统耗时费力的人工方式。论文在手动标注的美国变电站图像数据集上,训练并比较了YOLOv8、YOLOv11和RF-DETR三种目标检测模型的性能,评估了它们的检测精度、精确度和效率,并展示了这些模型在美国变电站组件大规模映射中的应用案例。
Details
Motivation: 变电站是电网的关键组成部分,其资产(如变压器)易受飓风、洪水、地震等多种灾害的破坏,而传统人工映射方式效率低下。因此,需要一种利用计算机视觉的自主解决方案来高效识别关键组件,以量化和预防脆弱性。
Result: 论文在手动标注的美国变电站图像数据集上评估了三个模型(YOLOv8, YOLOv11, RF-DETR)的检测准确性、精确度和效率,并分析了各自的优缺点,以确定哪个模型能提供可靠的大规模变电站组件映射。
Insight: 创新点在于系统性地比较了包括经典YOLO系列和新兴的基于Transformer的RF-DETR在内的多种前沿目标检测模型在特定工业场景(变电站组件检测)中的适用性,为关键基础设施的自动化资产映射提供了实用的模型选型依据和可行性验证。
Abstract: Electrical substations are a significant component of an electrical grid. Indeed, the assets at these substations (e.g., transformers) are prone to disruption from many hazards, including hurricanes, flooding, earthquakes, and geomagnetically induced currents (GICs). As electrical grids are considered critical national infrastructure, any failure can have significant economic and public safety implications. To help prevent and mitigate these failures, it is thus essential that we identify key substation components to quantify vulnerability. Unfortunately, traditional manual mapping of substation infrastructure is time-consuming and labor-intensive. Therefore, an autonomous solution utilizing computer vision models is preferable, as it allows for greater convenience and efficiency. In this research paper, we train and compare the outputs of 3 models (YOLOv8, YOLOv11, RF-DETR) on a manually labeled dataset of US substation images. Each model is evaluated for detection accuracy, precision, and efficiency. We present the key strengths and limitations of each model, identifying which provides reliable and large-scale substation component mapping. Additionally, we utilize these models to effectively map the various substation components in the United States, showcasing a use case for machine learning in substation mapping.
[61] Tracking by Predicting 3-D Gaussians Over Time cs.CVPDF
Tanish Baranwal, Himanshu Gaurav Singh, Jathushan Rajasegaran, Jitendra Malik
TL;DR: 本文提出了Video Gaussian Masked Autoencoders (Video-GMAE),一种自监督视频表示学习方法,它将图像序列编码为一组随时间移动的3D高斯溅射。该方法通过将视频表示为动态3D高斯集合,强制引入了一个合理的归纳偏置,即2D视频通常是动态3D场景的一致投影。研究发现,使用此架构预训练网络时,跟踪能力会自然涌现。
Details
Motivation: 动机在于开发一种自监督的视频表示学习方法,其核心思想是将视频视为动态3D场景的2D投影,通过将视频编码为随时间演化的3D高斯集合来学习有效的表示,并期望从中自然涌现出目标跟踪能力。
Result: 该方法在零样本跟踪任务上达到了与最先进方法相当的性能。经过小规模微调后,在Kinetics数据集上取得了34.6%的性能提升,在Kubric数据集上取得了13.1%的提升,超越了现有的自监督视频学习方法。
Insight: 创新点在于将视频表示为随时间移动的3D高斯溅射集合,这提供了一个强大的3D场景动态性归纳偏置。一个关键的洞察是,在这种表示下进行自监督预训练,目标跟踪能力可以作为一个涌现属性自然获得,而无需显式的跟踪监督信号。
Abstract: We propose Video Gaussian Masked Autoencoders (Video-GMAE), a self-supervised approach for representation learning that encodes a sequence of images into a set of Gaussian splats moving over time. Representing a video as a set of Gaussians enforces a reasonable inductive bias: that 2-D videos are often consistent projections of a dynamic 3-D scene. We find that tracking emerges when pretraining a network with this architecture. Mapping the trajectory of the learnt Gaussians onto the image plane gives zero-shot tracking performance comparable to state-of-the-art. With small-scale finetuning, our models achieve 34.6% improvement on Kinetics, and 13.1% on Kubric datasets, surpassing existing self-supervised video approaches. The project page and code are publicly available at https://videogmae.org/ and https://github.com/tekotan/video-gmae.
[62] SCAFusion: A Multimodal 3D Detection Framework for Small Object Detection in Lunar Surface Exploration cs.CVPDF
Xin Chen, Kang Luo, Yangyi Xiao, Hesheng Wang
TL;DR: 本文提出了SCAFusion,一种专为月球机器人任务设计的、用于检测月球表面小物体的多模态3D目标检测框架。该模型基于BEVFusion框架,通过引入认知适配器、对比对齐模块、相机辅助训练分支以及截面感知坐标注意力机制,显著提升了在月球等非结构化环境中对小且不规则物体(如陨石碎片和岩石)的检测性能。
Details
Motivation: 解决现有为地面自动驾驶设计的多模态3D感知方法在月球等非结构化环境中,由于特征对齐不佳、多模态协同有限以及小物体检测能力弱而表现不佳的问题,以实现月球表面探索中自主导航和操作所需的可靠、精确的小物体检测。
Result: 在nuScenes验证集上,模型取得了69.7%的mAP和72.1%的NDS,分别比基线提升了5.0%和2.7%。在基于Isaac Sim构建的模拟月球环境中,模型取得了90.93%的mAP,比基线提升了11.5%,在检测类似陨石的小障碍物方面提升显著。
Insight: 主要创新点在于明确设计了截面感知坐标注意力机制来增强对小、不规则目标的检测能力,并系统性地集成了认知适配器、对比对齐模块和相机辅助训练分支来优化特征对齐与多模态融合。从客观角度看,其针对特定应用场景(月球探索)对通用框架进行针对性模块化增强的设计思路具有借鉴意义。
Abstract: Reliable and precise detection of small and irregular objects, such as meteor fragments and rocks, is critical for autonomous navigation and operation in lunar surface exploration. Existing multimodal 3D perception methods designed for terrestrial autonomous driving often underperform in off world environments due to poor feature alignment, limited multimodal synergy, and weak small object detection. This paper presents SCAFusion, a multimodal 3D object detection model tailored for lunar robotic missions. Built upon the BEVFusion framework, SCAFusion integrates a Cognitive Adapter for efficient camera backbone tuning, a Contrastive Alignment Module to enhance camera LiDAR feature consistency, a Camera Auxiliary Training Branch to strengthen visual representation, and most importantly, a Section aware Coordinate Attention mechanism explicitly designed to boost the detection performance of small, irregular targets. With negligible increase in parameters and computation, our model achieves 69.7% mAP and 72.1% NDS on the nuScenes validation set, improving the baseline by 5.0% and 2.7%, respectively. In simulated lunar environments built on Isaac Sim, SCAFusion achieves 90.93% mAP, outperforming the baseline by 11.5%, with notable gains in detecting small meteor like obstacles.
[63] DreamOmni3: Scribble-based Editing and Generation cs.CVPDF
Bin Xia, Bohao Peng, Jiyang Liu, Sitong Wu, Jingyao Li
TL;DR: 本文提出了DreamOmni3,一个基于涂鸦的统一编辑与生成模型。它通过结合文本、图像和自由手绘草图,解决了传统文本指令在指定编辑位置和视觉细节上的不足,并设计了相应的数据合成流程和联合输入框架。
Details
Motivation: 现有统一生成与编辑模型主要依赖文本提示,但语言难以精确捕捉用户意图的编辑位置和细粒度视觉细节。因此,本文旨在通过引入涂鸦(scribble)作为输入,实现更灵活、直观的图形界面创作。
Result: 实验结果表明,DreamOmni3在提出的基于涂鸦的编辑与生成任务上取得了出色的性能。作者为这些任务建立了全面的基准测试以促进后续研究,模型和代码将公开。
Insight: 主要创新点包括:1) 定义了基于涂鸦的编辑与生成新任务,扩展了统一模型的交互方式;2) 设计了针对这些任务的数据合成流程;3) 提出了联合输入方案(而非二值掩码),通过使用不同颜色区分区域并共享索引和位置编码,以精确处理涉及多个涂鸦、图像和指令的复杂编辑。
Abstract: Recently unified generation and editing models have achieved remarkable success with their impressive performance. These models rely mainly on text prompts for instruction-based editing and generation, but language often fails to capture users intended edit locations and fine-grained visual details. To this end, we propose two tasks: scribble-based editing and generation, that enables more flexible creation on graphical user interface (GUI) combining user textual, images, and freehand sketches. We introduce DreamOmni3, tackling two challenges: data creation and framework design. Our data synthesis pipeline includes two parts: scribble-based editing and generation. For scribble-based editing, we define four tasks: scribble and instruction-based editing, scribble and multimodal instruction-based editing, image fusion, and doodle editing. Based on DreamOmni2 dataset, we extract editable regions and overlay hand-drawn boxes, circles, doodles or cropped image to construct training data. For scribble-based generation, we define three tasks: scribble and instruction-based generation, scribble and multimodal instruction-based generation, and doodle generation, following similar data creation pipelines. For the framework, instead of using binary masks, which struggle with complex edits involving multiple scribbles, images, and instructions, we propose a joint input scheme that feeds both the original and scribbled source images into the model, using different colors to distinguish regions and simplify processing. By applying the same index and position encodings to both images, the model can precisely localize scribbled regions while maintaining accurate editing. Finally, we establish comprehensive benchmarks for these tasks to promote further research. Experimental results demonstrate that DreamOmni3 achieves outstanding performance, and models and code will be publicly released.
[64] CoAgent: Collaborative Planning and Consistency Agent for Coherent Video Generation cs.CV | cs.AIPDF
Qinglin Zeng, Kaitong Cai, Ruiqi Chen, Qinhan Lv, Keze Wang
TL;DR: 本文提出CoAgent,一种用于连贯视频生成的协作闭环框架,通过计划-合成-验证流程解决开放域视频生成中的叙事连贯性和视觉一致性问题。
Details
Motivation: 现有文本到视频模型通常独立处理每个镜头,导致身份漂移、场景不一致和时间结构不稳定,本文旨在解决这些挑战。
Result: 大量实验表明,CoAgent在长视频生成中显著提高了连贯性、视觉一致性和叙事质量。
Insight: 创新点包括将视频生成建模为计划-合成-验证的闭环管道,引入全局上下文管理器维护实体级记忆以保持跨镜头一致性,以及使用验证器代理进行视觉语言推理触发选择性再生。
Abstract: Maintaining narrative coherence and visual consistency remains a central challenge in open-domain video generation. Existing text-to-video models often treat each shot independently, resulting in identity drift, scene inconsistency, and unstable temporal structure. We propose CoAgent, a collaborative and closed-loop framework for coherent video generation that formulates the process as a plan-synthesize-verify pipeline. Given a user prompt, style reference, and pacing constraints, a Storyboard Planner decomposes the input into structured shot-level plans with explicit entities, spatial relations, and temporal cues. A Global Context Manager maintains entity-level memory to preserve appearance and identity consistency across shots. Each shot is then generated by a Synthesis Module under the guidance of a Visual Consistency Controller, while a Verifier Agent evaluates intermediate results using vision-language reasoning and triggers selective regeneration when inconsistencies are detected. Finally, a pacing-aware editor refines temporal rhythm and transitions to match the desired narrative flow. Extensive experiments demonstrate that CoAgent significantly improves coherence, visual consistency, and narrative quality in long-form video generation.
[65] Self-Rewarded Multimodal Coherent Reasoning Across Diverse Visual Domains cs.CV | cs.AIPDF
Jesen Zhang, Ningyuan Liu, Kaitong Cai, Sidi Liu, Jing Yang
TL;DR: 本文提出了一种名为SR-MCR的轻量级、无需人工标注的框架,用于提升多模态大语言模型(MLLMs)的推理可靠性。该框架通过从模型自身输出中提取五个自参考信号来构建细粒度的过程级奖励,并采用改进的GRPO目标进行训练,从而在多个视觉基准测试中同时提升了答案准确性和推理连贯性。
Details
Motivation: 现有MLLMs的推理过程虽然流畅但不可靠,存在步骤间连贯性弱和视觉基础不足的问题,其根源在于传统的对齐方法只监督最终答案,而忽略了中间推理过程的可靠性。
Result: 在Qwen2.5-VL基础上构建的SR-MCR-7B模型,在多个视觉基准测试中取得了平均81.4%的准确率,在同等规模的开源模型中达到了最先进的(SOTA)性能,显著提升了答案准确性和推理连贯性。
Insight: 核心创新在于提出了一种利用模型自身输出(语义对齐、词汇保真度、非冗余性、视觉基础和步骤一致性)来构建过程级奖励的自我奖励机制,以及一个带有置信度感知冷却机制的、无需额外评论家模型的GRPO训练目标,实现了对推理过程的细粒度、稳定对齐。
Abstract: Multimodal LLMs often produce fluent yet unreliable reasoning, exhibiting weak step-to-step coherence and insufficient visual grounding, largely because existing alignment approaches supervise only the final answer while ignoring the reliability of the intermediate reasoning process. We introduce SR-MCR, a lightweight and label-free framework that aligns reasoning by exploiting intrinsic process signals derived directly from model outputs. Five self-referential cues – semantic alignment, lexical fidelity, non-redundancy, visual grounding, and step consistency – are integrated into a normalized, reliability-weighted reward that provides fine-grained process-level guidance. A critic-free GRPO objective, enhanced with a confidence-aware cooling mechanism, further stabilizes training and suppresses trivial or overly confident generations. Built on Qwen2.5-VL, SR-MCR improves both answer accuracy and reasoning coherence across a broad set of visual benchmarks; among open-source models of comparable size, SR-MCR-7B achieves state-of-the-art performance with an average accuracy of 81.4%. Ablation studies confirm the independent contributions of each reward term and the cooling module.
[66] KV-Tracker: Real-Time Pose Tracking with Transformers cs.CVPDF
Marwan Taher, Ignacio Alzugaray, Kirill Mazur, Xin Kong, Andrew J. Davison
TL;DR: KV-Tracker是一种基于Transformer的实时姿态跟踪方法,通过缓存多视图几何网络中的关键-值(KV)对作为场景表示,实现了从单目RGB视频中进行在线6自由度姿态跟踪和场景/物体重建,推理速度提升高达15倍。
Details
Motivation: 多视图3D几何网络虽提供强大先验,但速度过慢无法满足实时应用需求,因此需要一种能将其适配到在线使用的方法。
Result: 在TUM RGB-D、7-Scenes、Arctic和OnePose数据集上的实验表明,该系统在保持高达约27 FPS的高帧率的同时,展现出强劲的性能。
Insight: 创新点在于提出了一种模型无关的KV对缓存策略,将全局自注意力块的关键-值对作为唯一的场景表示用于在线跟踪,避免了漂移或灾难性遗忘,且无需重新训练即可应用于现成的多视图网络。
Abstract: Multi-view 3D geometry networks offer a powerful prior but are prohibitively slow for real-time applications. We propose a novel way to adapt them for online use, enabling real-time 6-DoF pose tracking and online reconstruction of objects and scenes from monocular RGB videos. Our method rapidly selects and manages a set of images as keyframes to map a scene or object via $π^3$ with full bidirectional attention. We then cache the global self-attention block’s key-value (KV) pairs and use them as the sole scene representation for online tracking. This allows for up to $15\times$ speedup during inference without the fear of drift or catastrophic forgetting. Our caching strategy is model-agnostic and can be applied to other off-the-shelf multi-view networks without retraining. We demonstrate KV-Tracker on both scene-level tracking and the more challenging task of on-the-fly object tracking and reconstruction without depth measurements or object priors. Experiments on the TUM RGB-D, 7-Scenes, Arctic and OnePose datasets show the strong performance of our system while maintaining high frame-rates up to ${\sim}27$ FPS.
[67] PTalker: Personalized Speech-Driven 3D Talking Head Animation via Style Disentanglement and Modality Alignment cs.CVPDF
Bin Wang, Yang Xu, Huan Zhao, Hao Zhang, Zixing Zhang
TL;DR: PTalker是一个用于个性化语音驱动3D说话头动画生成的新框架,通过风格解耦和模态对齐技术,在保持高唇部同步精度的同时,有效捕捉并保留个体特有的说话风格,从而提升动画的真实感和个性化水平。
Details
Motivation: 现有语音驱动3D说话头生成方法虽在唇部同步精度上取得进展,但普遍忽略了个体说话风格的细微差异,限制了动画的个性化和真实感,因此需要一种能同时兼顾高同步精度和个性化风格的方法。
Result: 在公开数据集上的大量定性和定量实验表明,PTalker能够生成逼真且风格化的3D说话头动画,准确匹配特定身份的说话风格,其性能超越了现有最先进(SOTA)方法。
Insight: 创新点在于通过设计解耦约束将音频和运动序列编码到独立的风格与内容空间以增强风格表示,并采用包含空间对齐(图注意力网络)、时间对齐(交叉注意力)和特征对齐(双向对比损失与KL散度约束)的三级模态对齐机制来提升唇部同步精度,实现了风格与内容的有效分离以及跨模态的精细对齐。
Abstract: Speech-driven 3D talking head generation aims to produce lifelike facial animations precisely synchronized with speech. While considerable progress has been made in achieving high lip-synchronization accuracy, existing methods largely overlook the intricate nuances of individual speaking styles, which limits personalization and realism. In this work, we present a novel framework for personalized 3D talking head animation, namely “PTalker”. This framework preserves speaking style through style disentanglement from audio and facial motion sequences and enhances lip-synchronization accuracy through a three-level alignment mechanism between audio and mesh modalities. Specifically, to effectively disentangle style and content, we design disentanglement constraints that encode driven audio and motion sequences into distinct style and content spaces to enhance speaking style representation. To improve lip-synchronization accuracy, we adopt a modality alignment mechanism incorporating three aspects: spatial alignment using Graph Attention Networks to capture vertex connectivity in the 3D mesh structure, temporal alignment using cross-attention to capture and synchronize temporal dependencies, and feature alignment by top-k bidirectional contrastive losses and KL divergence constraints to ensure consistency between speech and mesh modalities. Extensive qualitative and quantitative experiments on public datasets demonstrate that PTalker effectively generates realistic, stylized 3D talking heads that accurately match identity-specific speaking styles, outperforming state-of-the-art methods. The source code and supplementary videos are available at: PTalker.
[68] Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone cs.CV | cs.CLPDF
Jiacheng Ye, Shansan Gong, Jiahui Gao, Junming Fan, Shuang Wu
TL;DR: 本文提出了Dream-VL和Dream-VLA,这是两个基于扩散语言模型(dLLM)构建的开放视觉语言模型和视觉语言动作模型。Dream-VL在多个基准测试中达到了与顶级自回归VLMs相当的性能,并在视觉规划任务中展现出更优潜力。Dream-VLA在Dream-VL基础上,通过机器人数据集持续预训练而成,在多个机器人任务基准上取得了领先的性能,并因其双向生成特性,在下游微调中收敛更快。
Details
Motivation: 自回归大视觉语言模型(VLMs)的顺序生成特性限制了其在复杂视觉规划和动态机器人控制中的效能。本研究旨在探索基于扩散语言模型构建VLMs的潜力,以克服这些限制。
Result: Dream-VL在多个基准测试中达到了与顶级基于开放数据训练的自回归VLMs相当的性能,并在视觉规划任务中展现出更优潜力。Dream-VLA在LIBERO上取得了97.2%的平均成功率,在SimplerEnv-Bridge上取得了71.4%的总体平均分,在SimplerEnv-Fractal上取得了60.5%的总体平均分,超越了如π₀和GR00T-N1等领先模型。
Insight: 论文的核心创新点在于将扩散语言模型(dLLM)作为骨干网络来构建视觉语言模型,其固有的双向生成特性使其在需要动作分块和并行生成的任务(如机器人控制)中具有天然优势,这带来了更快的下游微调收敛速度和更优的任务性能。这为视觉语言和视觉语言动作模型的设计提供了一个新的、有前景的架构方向。
Abstract: While autoregressive Large Vision-Language Models (VLMs) have achieved remarkable success, their sequential generation often limits their efficacy in complex visual planning and dynamic robotic control. In this work, we investigate the potential of constructing Vision-Language Models upon diffusion-based large language models (dLLMs) to overcome these limitations. We introduce Dream-VL, an open diffusion-based VLM (dVLM) that achieves state-of-the-art performance among previous dVLMs. Dream-VL is comparable to top-tier AR-based VLMs trained on open data on various benchmarks but exhibits superior potential when applied to visual planning tasks. Building upon Dream-VL, we introduce Dream-VLA, a dLLM-based Vision-Language-Action model (dVLA) developed through continuous pre-training on open robotic datasets. We demonstrate that the natively bidirectional nature of this diffusion backbone serves as a superior foundation for VLA tasks, inherently suited for action chunking and parallel generation, leading to significantly faster convergence in downstream fine-tuning. Dream-VLA achieves top-tier performance of 97.2% average success rate on LIBERO, 71.4% overall average on SimplerEnv-Bridge, and 60.5% overall average on SimplerEnv-Fractal, surpassing leading models such as $π_0$ and GR00T-N1. We also validate that dVLMs surpass AR baselines on downstream tasks across different training objectives. We release both Dream-VL and Dream-VLA to facilitate further research in the community.
[69] Rethinking Memory Design in SAM-Based Visual Object Tracking cs.CVPDF
Mohamad Alansari, Muzammal Naseer, Hasan Al Marzouqi, Naoufel Werghi, Sajid Javed
TL;DR: 本文对基于SAM的视觉目标跟踪中的内存设计进行了系统性研究,分析了现有SAM2跟踪器在短期内存帧选择上的差异,并将其内存机制迁移到SAM3框架中进行大规模评估。基于实证发现,作者提出了一个统一的混合内存框架,将内存显式分解为短期外观内存和长期干扰物解决内存,从而在多种挑战性场景下提升了跟踪的鲁棒性。
Details
Motivation: 现有基于SAM的跟踪方法以特定方式解决内存限制,缺乏对内存设计原则的系统性理解,且不清楚这些机制如何迁移到下一代基础模型(如SAM3)。
Result: 在十个不同的基准测试上进行大规模评估,结果表明所提出的统一混合内存框架在SAM2和SAM3骨干网络上,对于长期遮挡、复杂运动和干扰物多的场景,均能持续提升鲁棒性。
Insight: 核心创新在于将内存显式分解为短期外观内存和长期干扰物解决内存,实现了现有内存策略的模块化、原则性集成,为基于SAM的跟踪提供了统一且可迁移的内存设计框架。
Abstract: \noindent Memory has become the central mechanism enabling robust visual object tracking in modern segmentation-based frameworks. Recent methods built upon Segment Anything Model 2 (SAM2) have demonstrated strong performance by refining how past observations are stored and reused. However, existing approaches address memory limitations in a method-specific manner, leaving the broader design principles of memory in SAM-based tracking poorly understood. Moreover, it remains unclear how these memory mechanisms transfer to stronger, next-generation foundation models such as Segment Anything Model 3 (SAM3). In this work, we present a systematic memory-centric study of SAM-based visual object tracking. We first analyze representative SAM2-based trackers and show that most methods primarily differ in how short-term memory frames are selected, while sharing a common object-centric representation. Building on this insight, we faithfully reimplement these memory mechanisms within the SAM3 framework and conduct large-scale evaluations across ten diverse benchmarks, enabling a controlled analysis of memory design independent of backbone strength. Guided by our empirical findings, we propose a unified hybrid memory framework that explicitly decomposes memory into short-term appearance memory and long-term distractor-resolving memory. This decomposition enables the integration of existing memory policies in a modular and principled manner. Extensive experiments demonstrate that the proposed framework consistently improves robustness under long-term occlusion, complex motion, and distractor-heavy scenarios on both SAM2 and SAM3 backbones. Code is available at: https://github.com/HamadYA/SAM3_Tracking_Zoo. \textbf{This is a preprint. Some results are being finalized and may be updated in a future revision.}
[70] Envision: Embodied Visual Planning via Goal-Imagery Video Diffusion cs.CVPDF
Yuming Gu, Yizhi Wang, Yining Hong, Yipeng Gao, Hao Jiang
TL;DR: 本文提出Envision框架,一种基于扩散模型的具身视觉规划方法,通过目标图像引导的视频生成来想象从初始场景到目标状态的轨迹。该方法包含两个阶段:目标图像模型合成与指令一致的目标图像,以及环境-目标视频模型基于首尾帧条件插值生成平滑、物理合理的视频轨迹。
Details
Motivation: 现有基于视频扩散模型的视觉规划方法多为前向预测,仅以初始观测为条件生成轨迹,缺乏明确的目标建模,导致空间漂移和目标错位问题。Envision旨在通过显式引入目标图像约束,确保生成轨迹的物理合理性和目标一致性。
Result: 在物体操纵和图像编辑基准测试中,Envision在目标对齐、空间一致性和物体保持方面优于基线方法,生成的视觉规划可直接支持下游机器人规划与控制。
Insight: 创新点在于将视觉规划分解为目标图像合成与视频插值两阶段,并利用首尾帧条件视频扩散模型(FL2V)进行轨迹生成,通过区域感知交叉注意力和目标显式约束有效缓解了空间漂移问题,提升了轨迹的连贯性和实用性。
Abstract: Embodied visual planning aims to enable manipulation tasks by imagining how a scene evolves toward a desired goal and using the imagined trajectories to guide actions. Video diffusion models, through their image-to-video generation capability, provide a promising foundation for such visual imagination. However, existing approaches are largely forward predictive, generating trajectories conditioned on the initial observation without explicit goal modeling, thus often leading to spatial drift and goal misalignment. To address these challenges, we propose Envision, a diffusion-based framework that performs visual planning for embodied agents. By explicitly constraining the generation with a goal image, our method enforces physical plausibility and goal consistency throughout the generated trajectory. Specifically, Envision operates in two stages. First, a Goal Imagery Model identifies task-relevant regions, performs region-aware cross attention between the scene and the instruction, and synthesizes a coherent goal image that captures the desired outcome. Then, an Env-Goal Video Model, built upon a first-and-last-frame-conditioned video diffusion model (FL2V), interpolates between the initial observation and the goal image, producing smooth and physically plausible video trajectories that connect the start and goal states. Experiments on object manipulation and image editing benchmarks demonstrate that Envision achieves superior goal alignment, spatial consistency, and object preservation compared to baselines. The resulting visual plans can directly support downstream robotic planning and control, providing reliable guidance for embodied agents.
[71] FinPercep-RM: A Fine-grained Reward Model and Co-evolutionary Curriculum for RL-based Real-world Super-Resolution cs.CVPDF
Yidi Liu, Zihao Fan, Jie Huang, Jie Xiao, Dong Li
TL;DR: 本文提出了一种用于真实世界超分辨率任务的细粒度感知奖励模型(FinPercep-RM)及其协同进化课程学习机制(CCL)。该奖励模型基于编码器-解码器架构,不仅能输出全局质量分数,还能生成定位和量化局部缺陷的感知退化图。为了解决复杂奖励模型带来的训练不稳定问题,CCL机制让奖励模型和超分辨率模型同步进行由易到难的课程学习,从而实现稳定训练并抑制奖励黑客行为。
Details
Motivation: 传统图像质量评估模型通常只输出单一全局分数,对局部和细粒度的失真极不敏感,这导致基于强化学习的超分辨率模型可能产生感知上不理想的伪影却获得虚假高分,造成优化目标与感知质量不匹配的奖励黑客问题。
Result: 实验验证了该方法在基于RLHF的超分辨率方法中,对于全局质量和局部真实感均有效。
Insight: 核心创新点在于细粒度的感知奖励模型(提供空间感知退化图)和协同进化课程学习机制。前者通过局部量化更精准地对齐人类感知偏好,后者通过奖励模型与生成模型的难度同步渐进提升,解决了复杂奖励下强化学习训练不稳定的难题,为将RLHF稳定应用于细节敏感的视觉任务提供了新思路。
Abstract: Reinforcement Learning with Human Feedback (RLHF) has proven effective in image generation field guided by reward models to align human preferences. Motivated by this, adapting RLHF for Image Super-Resolution (ISR) tasks has shown promise in optimizing perceptual quality with Image Quality Assessment (IQA) model as reward models. However, the traditional IQA model usually output a single global score, which are exceptionally insensitive to local and fine-grained distortions. This insensitivity allows ISR models to produce perceptually undesirable artifacts that yield spurious high scores, misaligning optimization objectives with perceptual quality and results in reward hacking. To address this, we propose a Fine-grained Perceptual Reward Model (FinPercep-RM) based on an Encoder-Decoder architecture. While providing a global quality score, it also generates a Perceptual Degradation Map that spatially localizes and quantifies local defects. We specifically introduce the FGR-30k dataset to train this model, consisting of diverse and subtle distortions from real-world super-resolution models. Despite the success of the FinPercep-RM model, its complexity introduces significant challenges in generator policy learning, leading to training instability. To address this, we propose a Co-evolutionary Curriculum Learning (CCL) mechanism, where both the reward model and the ISR model undergo synchronized curricula. The reward model progressively increases in complexity, while the ISR model starts with a simpler global reward for rapid convergence, gradually transitioning to the more complex model outputs. This easy-to-hard strategy enables stable training while suppressing reward hacking. Experiments validates the effectiveness of our method across ISR models in both global quality and local realism on RLHF methods.
[72] Visual Autoregressive Modelling for Monocular Depth Estimation cs.CVPDF
Amir El-Ghoussani, André Kaup, Nassir Navab, Gustavo Carneiro, Vasileios Belagiannis
TL;DR: 本文提出了一种基于视觉自回归(VAR)先验的单目深度估计方法,作为基于扩散模型方法的替代方案。该方法通过适配大规模文本到图像VAR模型,并引入带分类器自由引导的尺度条件上采样机制,在十个固定自回归阶段进行推理,仅需74K合成样本进行微调,即可取得有竞争力的结果。
Details
Motivation: 动机是为单目深度估计提供一种不同于扩散模型的、基于自回归先验的几何感知生成模型,以探索其在数据可扩展性和对3D视觉任务适应性方面的优势。
Result: 在受限训练条件下,该方法在室内基准测试中取得了最先进的性能,在应用于室外数据集时也表现出色。
Insight: 创新点在于将大规模文本到图像VAR模型成功适配到深度估计任务,并提出了尺度条件上采样机制。从客观角度看,该方法展示了自回归先验作为深度估计生成模型的潜力,特别是在数据效率和任务适应性方面具有优势。
Abstract: We propose a monocular depth estimation method based on visual autoregressive (VAR) priors, offering an alternative to diffusion-based approaches. Our method adapts a large-scale text-to-image VAR model and introduces a scale-wise conditional upsampling mechanism with classifier-free guidance. Our approach performs inference in ten fixed autoregressive stages, requiring only 74K synthetic samples for fine-tuning, and achieves competitive results. We report state-of-the-art performance in indoor benchmarks under constrained training conditions, and strong performance when applied to outdoor datasets. This work establishes autoregressive priors as a complementary family of geometry-aware generative models for depth estimation, highlighting advantages in data scalability, and adaptability to 3D vision tasks. Code available at “https://github.com/AmirMaEl/VAR-Depth“.
[73] Investigating Deep Learning Models for Ejection Fraction Estimation from Echocardiography Videos cs.CV | cs.AI | cs.LGPDF
Shravan Saranyan, Pramit Saha
TL;DR: 本研究系统评估了多种深度学习架构(包括3D Inception、双流网络和CNN-RNN模型)在超声心动图视频中自动估计左心室射血分数(LVEF)的性能。通过在包含10,030个视频的EchoNet-Dynamic数据集上进行训练和评估,研究发现经过修改的3D Inception架构取得了最佳整体性能,均方根误差(RMSE)为6.79%。研究还观察到模型普遍存在过拟合趋势,且性能对超参数(如卷积核大小和归一化策略)高度敏感。
Details
Motivation: 左心室射血分数(LVEF)是评估心脏功能的关键指标,但基于超声心动图的手动评估耗时且存在观察者间差异。深度学习为自动化、高精度的LVEF估计提供了有前景的替代方案。
Result: 在EchoNet-Dynamic数据集上,修改后的3D Inception架构取得了最佳性能,RMSE为6.79%。研究指出,更小、更简单的模型通常表现出更好的泛化能力。
Insight: 论文的创新点在于系统性地比较和修改了多种视频分析架构用于LVEF估计,并深入分析了架构修改、融合策略、过拟合和超参数敏感性的影响。其关于架构设计和训练策略的见解可推广至更广泛的医学及非医学视频分析任务。
Abstract: Left ventricular ejection fraction (LVEF) is a key indicator of cardiac function and plays a central role in the diagnosis and management of cardiovascular disease. Echocardiography, as a readily accessible and non-invasive imaging modality, is widely used in clinical practice to estimate LVEF. However, manual assessment of cardiac function from echocardiograms is time-consuming and subject to considerable inter-observer variability. Deep learning approaches offer a promising alternative, with the potential to achieve performance comparable to that of experienced human experts. In this study, we investigate the effectiveness of several deep learning architectures for LVEF estimation from echocardiography videos, including 3D Inception, two-stream, and CNN-RNN models. We systematically evaluate architectural modifications and fusion strategies to identify configurations that maximize prediction accuracy. Models were trained and evaluated on the EchoNet-Dynamic dataset, comprising 10,030 echocardiogram videos. Our results demonstrate that modified 3D Inception architectures achieve the best overall performance, with a root mean squared error (RMSE) of 6.79%. Across architectures, we observe a tendency toward overfitting, with smaller and simpler models generally exhibiting improved generalization. Model performance was also found to be highly sensitive to hyperparameter choices, particularly convolutional kernel sizes and normalization strategies. While this study focuses on echocardiography-based LVEF estimation, the insights gained regarding architectural design and training strategies may be applicable to a broader range of medical and non-medical video analysis tasks.
[74] Unleashing Foundation Vision Models: Adaptive Transfer for Diverse Data-Limited Scientific Domains cs.CV | cs.AIPDF
Qiankun Li, Feng He, Huabao Chen, Xin Ning, Kun Wang
TL;DR: 本文提出了一种名为CLAdapter的集群注意力适配器,旨在将大规模预训练视觉模型(如ViT和ConvNeXt)的知识自适应地迁移到数据有限的下游科学领域任务中。该方法通过引入注意力机制和聚类中心,利用分布相关性和变换矩阵来个性化增强特征表示,从而帮助模型有效适应不同领域。
Details
Motivation: 尽管大规模数据集(如LAION、ImageNet)上的预训练模型已获得丰富知识,但许多专业且数据稀缺的科学领域下游任务仍面临重大挑战,需要一种有效的方法来迁移和适应这些预训练表示。
Result: 在涵盖通用、多媒体、生物、医学、工业、农业、环境、地理、材料科学、分布外(OOD)和3D分析等10个领域的多个数据集上进行广泛实验,CLAdapter均取得了最先进的性能(SOTA),证明了其在数据有限科学领域中的有效性。
Insight: 创新点在于提出了一种统一的适配器接口(CLAdapter),结合注意力机制和聚类中心来个性化调整特征,可无缝集成到CNN和Transformer等多种2D/3D模型架构中,实现从大规模预训练模型到多样化下游任务的自适应知识迁移。
Abstract: In the big data era, the computer vision field benefits from large-scale datasets such as LAION-2B, LAION-400M, and ImageNet-21K, Kinetics, on which popular models like the ViT and ConvNeXt series have been pre-trained, acquiring substantial knowledge. However, numerous downstream tasks in specialized and data-limited scientific domains continue to pose significant challenges. In this paper, we propose a novel Cluster Attention Adapter (CLAdapter), which refines and adapts the rich representations learned from large-scale data to various data-limited downstream tasks. Specifically, CLAdapter introduces attention mechanisms and cluster centers to personalize the enhancement of transformed features through distribution correlation and transformation matrices. This enables models fine-tuned with CLAdapter to learn distinct representations tailored to different feature sets, facilitating the models’ adaptation from rich pre-trained features to various downstream scenarios effectively. In addition, CLAdapter’s unified interface design allows for seamless integration with multiple model architectures, including CNNs and Transformers, in both 2D and 3D contexts. Through extensive experiments on 10 datasets spanning domains such as generic, multimedia, biological, medical, industrial, agricultural, environmental, geographical, materials science, out-of-distribution (OOD), and 3D analysis, CLAdapter achieves state-of-the-art performance across diverse data-limited scientific domains, demonstrating its effectiveness in unleashing the potential of foundation vision models via adaptive transfer. Code is available at https://github.com/qklee-lz/CLAdapter.
[75] INTERACT-CMIL: Multi-Task Shared Learning and Inter-Task Consistency for Conjunctival Melanocytic Intraepithelial Lesion Grading cs.CV | cs.LGPDF
Mert Ikinci, Luna Toma, Karin U. Loeffler, Leticia Ussem, Daniela Süsskind
TL;DR: 本文提出了INTERACT-CMIL,一个用于结膜黑色素细胞上皮内病变(CMIL)分级的多头深度学习框架。该框架通过共享特征学习与组合部分监督,以及强制跨任务一致性的相互依赖损失,联合预测WHO4、WHO5、水平扩散、垂直扩散和细胞异型性五个组织病理学轴。在一个新构建的多中心数据集上,该方法相比CNN和基础模型基线取得了显著提升。
Details
Motivation: CMIL的准确分级对治疗和黑色素瘤预测至关重要,但由于微妙的形态学线索和相互关联的诊断标准,该任务仍然困难。现有方法可能未充分利用多任务间的内在关联。
Result: 在一个新构建的、包含来自三家大学医院486个专家标注的结膜活检切片的多中心数据集上进行训练和评估。相比CNN和基础模型基线,该方法取得了稳定的改进,在WHO4和垂直扩散任务上分别获得了高达55.1%和25.0%的相对宏观F1分数提升。
Insight: 主要创新点在于结合了共享特征学习与组合部分监督的多任务学习框架,并引入了强制跨任务预测一致性的相互依赖损失,这有助于生成与专家分级一致的、连贯且可解释的多标准预测,为CMIL诊断提供了一个可重复的计算基准。
Abstract: Accurate grading of Conjunctival Melanocytic Intraepithelial Lesions (CMIL) is essential for treatment and melanoma prediction but remains difficult due to subtle morphological cues and interrelated diagnostic criteria. We introduce INTERACT-CMIL, a multi-head deep learning framework that jointly predicts five histopathological axes; WHO4, WHO5, horizontal spread, vertical spread, and cytologic atypia, through Shared Feature Learning with Combinatorial Partial Supervision and an Inter-Dependence Loss enforcing cross-task consistency. Trained and evaluated on a newly curated, multi-center dataset of 486 expert-annotated conjunctival biopsy patches from three university hospitals, INTERACT-CMIL achieves consistent improvements over CNN and foundation-model (FM) baselines, with relative macro F1 gains up to 55.1% (WHO4) and 25.0% (vertical spread). The framework provides coherent, interpretable multi-criteria predictions aligned with expert grading, offering a reproducible computational benchmark for CMIL diagnosis and a step toward standardized digital ocular pathology.
[76] CritiFusion: Semantic Critique and Spectral Alignment for Faithful Text-to-Image Generation cs.CVPDF
ZhenQi Chen, TsaiChing Ni, YuanFu Yang
TL;DR: 本文提出了CritiFusion,一种无需额外训练的推理时框架,通过整合多模态语义批判机制和频域细化来提升文本到图像生成的语义对齐与细节质量。
Details
Motivation: 现有文本到图像扩散模型在视觉保真度上表现优异,但在处理复杂提示时语义对齐能力不足,需要改进生成内容与提示意图的一致性。
Result: 在标准基准测试中,该方法显著提升了文本-图像对应性和视觉质量的人类对齐指标,在人类偏好评分和美学评估上达到与最先进的奖励优化方法相当的水平。
Insight: 创新点在于引入CritiCore模块利用视觉语言模型和大语言模型提供高层语义反馈,并结合SpecFusion在频域融合中间生成状态以注入粗粒度结构信息并保留高频细节,可作为插件兼容现有扩散模型主干。
Abstract: Recent text-to-image diffusion models have achieved remarkable visual fidelity but often struggle with semantic alignment to complex prompts. We introduce CritiFusion, a novel inference-time framework that integrates a multimodal semantic critique mechanism with frequency-domain refinement to improve text-to-image consistency and detail. The proposed CritiCore module leverages a vision-language model and multiple large language models to enrich the prompt context and produce high-level semantic feedback, guiding the diffusion process to better align generated content with the prompt’s intent. Additionally, SpecFusion merges intermediate generation states in the spectral domain, injecting coarse structural information while preserving high-frequency details. No additional model training is required. CritiFusion serves as a plug-in refinement stage compatible with existing diffusion backbones. Experiments on standard benchmarks show that our method notably improves human-aligned metrics of text-to-image correspondence and visual quality. CritiFusion consistently boosts performance on human preference scores and aesthetic evaluations, achieving results on par with state-of-the-art reward optimization approaches. Qualitative results further demonstrate superior detail, realism, and prompt fidelity, indicating the effectiveness of our semantic critique and spectral alignment strategy.
[77] Autoregressive Flow Matching for Motion Prediction cs.CVPDF
Johnathan Xie, Stefan Stojanov, Cristobal Eyzaguirre, Daniel L. K. Yamins, Jiajun Wu
TL;DR: 本文提出了一种名为自回归流匹配(ARFM)的新方法,用于序列连续数据的概率建模,并在多样化视频数据集上进行训练,以生成长时间范围内的未来点轨迹位置。该方法旨在解决现有视频预测模型在复杂运动建模上的不足,并应用于人体运动预测和机器人动作预测的下游任务。
Details
Motivation: 现有视频预测模型虽在视觉真实性上表现优异,但在大规模数据下仍难以准确建模复杂运动,因此需要一种能够有效处理序列连续数据并提升下游任务性能的新方法。
Result: 在作者开发的人体和机器人运动预测基准测试中,ARFM模型能够预测复杂运动,并且通过将机器人动作预测和人体运动预测基于预测的未来轨迹进行条件化,显著提高了下游任务的性能。
Insight: 创新点在于将自回归与流匹配相结合,用于序列数据的概率建模,并通过条件化预测轨迹来增强下游任务,这为运动预测领域提供了一种可扩展且有效的框架。
Abstract: Motion prediction has been studied in different contexts with models trained on narrow distributions and applied to downstream tasks in human motion prediction and robotics. Simultaneously, recent efforts in scaling video prediction have demonstrated impressive visual realism, yet they struggle to accurately model complex motions despite massive scale. Inspired by the scaling of video generation, we develop autoregressive flow matching (ARFM), a new method for probabilistic modeling of sequential continuous data and train it on diverse video datasets to generate future point track locations over long horizons. To evaluate our model, we develop benchmarks for evaluating the ability of motion prediction models to predict human and robot motion. Our model is able to predict complex motions, and we demonstrate that conditioning robot action prediction and human motion prediction on predicted future tracks can significantly improve downstream task performance. Code and models publicly available at: https://github.com/Johnathan-Xie/arfm-motion-prediction.
[78] Multimodal Diffeomorphic Registration with Neural ODEs and Structural Descriptors cs.CV | cs.LGPDF
Salvador Rodriguez-Sanz, Monica Hernandez
TL;DR: 本文提出了一种基于神经常微分方程(Neural ODEs)和结构描述符的多模态微分同胚配准方法。该方法是一种实例特定的框架,不依赖于大量训练数据,并能有效处理训练中未见模态的图像。通过整合基于图像或特征的结构描述符以及由局部互信息计算的非结构图像相似性,该方法在多种扫描数据集组合的实验中,在精度和计算效率上超越了现有最先进方法,适用于大变形和小变形场景,并展现出对显式正则化变化的鲁棒性。
Details
Motivation: 解决传统非刚性配准算法在精度、计算复杂度和正则化之间的权衡问题,以及它们因假设图像对在解剖同源区域存在强度相关性而局限于单模态应用的问题。
Result: 在由不同扫描数据集组合形成的多个实验中,该方法在定性和定量结果上均超越了适用于大变形或小变形的、以及专门用于多模态配准的现有最先进基线方法。
Insight: 创新点在于将神经ODE范式与模态无关的结构描述符(利用参数化邻域几何的自相似性)相结合,构建了一个无需大量训练、对未见模态鲁棒的实例特定配准框架,并展示了其在多种正则化水平和尺度下的有效性与效率。
Abstract: This work proposes a multimodal diffeomorphic registration method using Neural Ordinary Differential Equations (Neural ODEs). Nonrigid registration algorithms exhibit tradeoffs between their accuracy, the computational complexity of their deformation model, and its proper regularization. In addition, they also assume intensity correlation in anatomically homologous regions of interest among image pairs, limiting their applicability to the monomodal setting. Unlike learning-based models, we propose an instance-specific framework that is not subject to high scan requirements for training and does not suffer performance degradation at inference time on modalities unseen during training. Our method exploits the potential of continuous-depth networks in the Neural ODE paradigm with structural descriptors, widely adopted as modality-agnostic metric models which exploit self-similarities on parameterized neighborhood geometries. We propose three different variants that integrate image-based or feature-based structural descriptors and nonstructural image similarities computed by local mutual information. We conduct extensive evaluations on different experiments formed by scan dataset combinations and show surpassing qualitative and quantitative results compared to state-of-the-art baselines adequate for large or small deformations, and specific of multimodal registration. Lastly, we also demonstrate the underlying robustness of the proposed framework to varying levels of explicit regularization while maintaining low error, its suitability for registration at varying scales, and its efficiency with respect to other methods targeted to large-deformation registration.
[79] SCPainter: A Unified Framework for Realistic 3D Asset Insertion and Novel View Synthesis cs.CVPDF
Paul Dobre, Jackson Cooper, Xin Wang, Hongzhou Yang
TL;DR: 本文提出了SCPainter,一个统一的框架,用于在自动驾驶仿真中实现逼真的3D资产插入和新视图合成。该框架将3D高斯泼溅表示的车辆资产与3D场景点云结合,并利用基于扩散模型的生成技术,共同处理资产插入和视图合成任务,以生成多样且真实的驾驶数据。
Details
Motivation: 自动驾驶仿真需要多样化的训练数据,特别是长尾驾驶场景。现有方法通常将3D资产重建/插入和新视图合成能力分开处理,导致插入的资产在光照、阴影等方面缺乏真实感,且难以与场景交互以创建新训练场景。因此,需要一个统一的框架来联合处理逼真的3D资产插入和新视图合成。
Result: 在Waymo Open Dataset上的评估表明,该框架能够有效实现3D资产插入和新视图合成,有助于创建多样且真实的驾驶数据。
Insight: 主要创新点在于提出了一个统一框架,将3D高斯泼溅资产表示、3D场景点云与基于扩散模型的生成相结合,以联合优化资产插入的真实性(如光照一致性)和新视图的合成质量,从而增强自动驾驶仿真的数据多样性。
Abstract: 3D Asset insertion and novel view synthesis (NVS) are key components for autonomous driving simulation, enhancing the diversity of training data. With better training data that is diverse and covers a wide range of situations, including long-tailed driving scenarios, autonomous driving models can become more robust and safer. This motivates a unified simulation framework that can jointly handle realistic integration of inserted 3D assets and NVS. Recent 3D asset reconstruction methods enable reconstruction of dynamic actors from video, supporting their re-insertion into simulated driving scenes. While the overall structure and appearance can be accurate, it still struggles to capture the realism of 3D assets through lighting or shadows, particularly when inserted into scenes. In parallel, recent advances in NVS methods have demonstrated promising results in synthesizing viewpoints beyond the originally recorded trajectories. However, existing approaches largely treat asset insertion and NVS capabilities in isolation. To allow for interaction with the rest of the scene and to enable more diverse creation of new scenarios for training, realistic 3D asset insertion should be combined with NVS. To address this, we present SCPainter (Street Car Painter), a unified framework which integrates 3D Gaussian Splat (GS) car asset representations and 3D scene point clouds with diffusion-based generation to jointly enable realistic 3D asset insertion and NVS. The 3D GS assets and 3D scene point clouds are projected together into novel views, and these projections are used to condition a diffusion model to generate high quality images. Evaluation on the Waymo Open Dataset demonstrate the capability of our framework to enable 3D asset insertion and NVS, facilitating the creation of diverse and realistic driving data.
[80] Split4D: Decomposed 4D Scene Reconstruction Without Video Segmentation cs.CVPDF
Yongzhen Hu, Yihui Yang, Haotong Lin, Yifan Wang, Junting Dong
TL;DR: 本文提出了一种名为Split4D的新方法,用于从多视角视频中重建分解的4D场景(即动态3D场景随时间的变化)。该方法的核心是使用Freetime FeatureGS表示动态场景,并通过流式特征学习策略直接从每帧图像的2D分割图中恢复4D分解,从而避免了传统方法对不稳定视频分割结果的依赖。
Details
Motivation: 现有方法通过将视频分割结果提升到4D表示来实现分解的4D场景重建,但其重建质量严重依赖于视频分割图的质量,而视频分割结果往往不稳定,导致重建结果不可靠。本文旨在克服这一挑战,无需视频分割即可实现可靠的分解重建。
Result: 在多个数据集上的实验结果表明,该方法的重建质量大幅优于近期方法。
Insight: 主要创新点包括:1) 提出Freetime FeatureGS,将动态场景建模为具有可学习特征和线性运动能力的高斯图元集合,允许它们随时间移动到相邻区域;2) 设计了一种对比损失,迫使高斯图元的特征根据其2D投影是否属于同一实例而靠近或远离,从而将特征学习自然扩展到时间维度,实现4D分割;3) 采用按时间顺序采样的流式训练策略,实现了特征随时间的传播,有效避免了优化过程中的局部极小值。从客观角度看,该方法将4D分解问题转化为基于2D单帧分割的流式特征学习,绕过了视频分割的瓶颈,是一个巧妙的思路转换。
Abstract: This paper addresses the problem of decomposed 4D scene reconstruction from multi-view videos. Recent methods achieve this by lifting video segmentation results to a 4D representation through differentiable rendering techniques. Therefore, they heavily rely on the quality of video segmentation maps, which are often unstable, leading to unreliable reconstruction results. To overcome this challenge, our key idea is to represent the decomposed 4D scene with the Freetime FeatureGS and design a streaming feature learning strategy to accurately recover it from per-image segmentation maps, eliminating the need for video segmentation. Freetime FeatureGS models the dynamic scene as a set of Gaussian primitives with learnable features and linear motion ability, allowing them to move to neighboring regions over time. We apply a contrastive loss to Freetime FeatureGS, forcing primitive features to be close or far apart based on whether their projections belong to the same instance in the 2D segmentation map. As our Gaussian primitives can move across time, it naturally extends the feature learning to the temporal dimension, achieving 4D segmentation. Furthermore, we sample observations for training in a temporally ordered manner, enabling the streaming propagation of features over time and effectively avoiding local minima during the optimization process. Experimental results on several datasets show that the reconstruction quality of our method outperforms recent methods by a large margin.
[81] TrimTokenator-LC: Towards Adaptive Visual Token Pruning for Large Multimodal Models with Long Contexts cs.CVPDF
Hao Zhang, Mengsi Lyu, Bo Huang, Yulong Ao, Yonghua Lin
TL;DR: 本文提出了一种名为TrimTokenator-LC的自适应视觉令牌剪枝方法,专门针对具有长上下文和多图像输入的大型多模态模型。该方法通过分析图像内和图像间的冗余度,动态分配令牌预算并进行两阶段剪枝,以在保持模型性能的同时显著降低推理成本。
Details
Motivation: 大型多模态模型在处理视觉输入时会产生大量视觉令牌,导致推理成本高昂,尤其是在长上下文、多图像场景下,现有剪枝方法往往忽视此类场景的挑战。
Result: 大量实验表明,该方法在长上下文设置下保持了强大的性能,同时显著减少了视觉令牌的数量。
Insight: 创新点在于将冗余分解为图像内多样性和图像间变化性,并以此指导动态预算分配;采用两阶段剪枝策略,结合贪婪选择、全局多样性过滤和帕累托选择,在减少令牌的同时平衡多样性与文本对齐。
Abstract: Large Multimodal Models (LMMs) have proven effective on various tasks. They typically encode visual inputs into Original Model sequences of tokens, which are then concatenated with textual tokens and jointly processed by the language model. However, the growing number of visual tokens greatly increases inference cost. Visual token pruning has emerged as a promising solution. However, existing methods often overlook scenarios involving long context inputs with multiple images. In this paper, we analyze the challenges of visual token pruning in long context, multi-image settings and introduce an adaptive pruning method tailored for such scenarios. We decompose redundancy into intra-image and inter-image components and quantify them through intra-image diversity and inter-image variation, which jointly guide dynamic budget allocation. Our approach consists of two stages. The intra-image stage allocates each image a content-aware token budget and greedily selects its most representative tokens. The inter-image stage performs global diversity filtering to form a candidate pool and then applies a Pareto selection procedure that balances diversity with text alignment. Extensive experiments show that our approach maintains strong performance in long context settings while significantly cutting down the number of visual tokens.
[82] Neighbor-Aware Token Reduction via Hilbert Curve for Vision Transformers cs.CVPDF
Yunge Li, Lanyu Xu
TL;DR: 本文提出了一种基于希尔伯特曲线重排序的邻居感知令牌缩减方法,用于提升视觉Transformer的计算效率。该方法通过两种策略——邻居感知剪枝(NAP)和相邻令牌相似性合并(MAT)——在保留空间连续性和邻居关系的同时减少冗余令牌表示,从而在精度和效率之间达到最优平衡。
Details
Motivation: 视觉Transformer在视觉识别任务中表现出色,但冗余的令牌表示限制了其计算效率,现有令牌合并和剪枝方法常忽视空间连续性和邻居关系,导致局部上下文信息丢失。
Result: 实验表明,该方法在精度-效率权衡上达到了最先进水平(SOTA),优于现有方法。
Insight: 创新点在于利用希尔伯特曲线将2D空间中的邻居结构显式保留在1D序列表示中,强调了空间连续性和邻居结构在ViT架构优化中的重要性,为令牌缩减提供了新视角。
Abstract: Vision Transformers (ViTs) have achieved remarkable success in visual recognition tasks, but redundant token representations limit their computational efficiency. Existing token merging and pruning strategies often overlook spatial continuity and neighbor relationships, resulting in the loss of local context. This paper proposes novel neighbor-aware token reduction methods based on Hilbert curve reordering, which explicitly preserves the neighbor structure in a 2D space using 1D sequential representations. Our method introduces two key strategies: Neighbor-Aware Pruning (NAP) for selective token retention and Merging by Adjacent Token similarity (MAT) for local token aggregation. Experiments demonstrate that our approach achieves state-of-the-art accuracy-efficiency trade-offs compared to existing methods. This work highlights the importance of spatial continuity and neighbor structure, offering new insights for the architectural optimization of ViTs.
[83] Next Best View Selections for Semantic and Dynamic 3D Gaussian Splatting cs.CV | cs.AIPDF
Yiqian Li, Wen Jiang, Kostas Daniilidis
TL;DR: 本文提出了一种基于费舍尔信息的主动学习算法,用于在语义和动态3D高斯泼溅场景中选择信息量最大的下一最佳视角,以优化模型训练。该方法通过量化候选视角对语义高斯参数和变形网络的信息增益,联合处理语义推理和动态场景建模,并在大规模静态图像和动态视频数据集上验证了其有效性。
Details
Motivation: 为了解决在语义理解和动态场景建模任务中数据冗余的问题,本文旨在通过主动学习策略优先选择对模型训练信息增益最大的视角,以替代启发式或随机选择策略。
Result: 实验结果表明,该方法在多相机设置下选择信息帧时,能持续提升渲染质量和语义分割性能,优于基于随机选择和不确定性启发式的基线方法。
Insight: 创新点在于将视角选择问题形式化为主动学习问题,并利用费舍尔信息量化候选视角对语义和动态参数的信息增益,为联合处理语义推理和动态建模提供了理论依据。
Abstract: Understanding semantics and dynamics has been crucial for embodied agents in various tasks. Both tasks have much more data redundancy than the static scene understanding task. We formulate the view selection problem as an active learning problem, where the goal is to prioritize frames that provide the greatest information gain for model training. To this end, we propose an active learning algorithm with Fisher Information that quantifies the informativeness of candidate views with respect to both semantic Gaussian parameters and deformation networks. This formulation allows our method to jointly handle semantic reasoning and dynamic scene modeling, providing a principled alternative to heuristic or random strategies. We evaluate our method on large-scale static images and dynamic video datasets by selecting informative frames from multi-camera setups. Experimental results demonstrate that our approach consistently improves rendering quality and semantic segmentation performance, outperforming baseline methods based on random selection and uncertainty-based heuristics.
[84] Plug In, Grade Right: Psychology-Inspired AGIQA cs.CV | eess.IVPDF
Zhicheng Liao, Baoliang Chen, Hanwei Zhu, Lingyu Zhu, Shiqi Wang
TL;DR: 本文提出了一种基于心理测量学中等级反应模型(GRM)的AGIQA方法,通过设计双分支质量分级模块来缓解文本-图像共享空间学习中的语义漂移问题,该模块可即插即用地提升现有AGIQA模型的性能,并泛化至自然与屏幕内容图像质量评估。
Details
Motivation: 现有AGIQA模型通过测量图像嵌入与多等级质量描述文本嵌入的相似性来评估质量,但相似性分布常呈现多模态模式(即语义漂移),导致文本嵌入与预期描述间语义不一致,降低了文本-图像共享空间学习的可靠性。
Result: 提出的算术GRM质量分级(AGQG)模块在多种最先进的AGIQA框架中即插即用,均能持续提升性能,并在自然与屏幕内容图像质量评估任务中有效泛化。
Insight: 创新点在于将心理测量学的等级反应模型引入AGIQA,通过双分支模块分别估计图像能力与构建多难度等级,并以算术方式建模难度生成以确保单调性和单峰可解释分布,从而缓解语义漂移,增强模型可靠性和泛化能力。
Abstract: Existing AGIQA models typically estimate image quality by measuring and aggregating the similarities between image embeddings and text embeddings derived from multi-grade quality descriptions. Although effective, we observe that such similarity distributions across grades usually exhibit multimodal patterns. For instance, an image embedding may show high similarity to both “excellent” and “poor” grade descriptions while deviating from the “good” one. We refer to this phenomenon as “semantic drift”, where semantic inconsistencies between text embeddings and their intended descriptions undermine the reliability of text-image shared-space learning. To mitigate this issue, we draw inspiration from psychometrics and propose an improved Graded Response Model (GRM) for AGIQA. The GRM is a classical assessment model that categorizes a subject’s ability across grades using test items with various difficulty levels. This paradigm aligns remarkably well with human quality rating, where image quality can be interpreted as an image’s ability to meet various quality grades. Building on this philosophy, we design a two-branch quality grading module: one branch estimates image ability while the other constructs multiple difficulty levels. To ensure monotonicity in difficulty levels, we further model difficulty generation in an arithmetic manner, which inherently enforces a unimodal and interpretable quality distribution. Our Arithmetic GRM based Quality Grading (AGQG) module enjoys a plug-and-play advantage, consistently improving performance when integrated into various state-of-the-art AGIQA frameworks. Moreover, it also generalizes effectively to both natural and screen content image quality assessment, revealing its potential as a key component in future IQA models.
[85] Parallel Diffusion Solver via Residual Dirichlet Policy Optimization cs.CVPDF
Ruoyu Wang, Ziyu Li, Beier Zhu, Liangyu Yuan, Hanwang Zhang
TL;DR: 本文提出了一种名为EPD-Solver的新型ODE求解器,通过在每个步骤中引入多个并行梯度评估来减少截断误差,从而加速扩散模型的采样过程。该方法基于几何洞察,利用向量值函数的中值定理更准确地近似积分解,并设计了一个两阶段优化框架,包括基于蒸馏的参数学习和参数高效的强化学习微调方案。该方法还可作为插件(EPD-Plugin)提升现有ODE采样器的性能。
Details
Motivation: 扩散模型在生成任务中达到SOTA性能,但其顺序去噪特性导致高采样延迟。现有的基于求解器的加速方法在低延迟预算下常面临显著的图像质量下降,主要原因是无法捕捉高曲率轨迹段而累积的截断误差。
Result: 论文在文本到图像(T2I)生成任务中验证了EPD-Solver的有效性,表明其能在保持低延迟采样的同时,显著减少图像质量退化,提升复杂生成任务的性能。
Insight: 创新点包括:1)利用采样轨迹主要局限于低维流形的几何洞察,通过并行梯度评估和向量值函数中值定理改进ODE求解精度;2)提出两阶段优化框架,结合蒸馏学习和参数高效的强化学习微调(将求解器重构为随机Dirichlet策略),避免对大型主干网络进行微调,有效缓解奖励黑客问题;3)方法灵活,可作为插件兼容现有ODE采样器。
Abstract: Diffusion models (DMs) have achieved state-of-the-art generative performance but suffer from high sampling latency due to their sequential denoising nature. Existing solver-based acceleration methods often face significant image quality degradation under a low-latency budget, primarily due to accumulated truncation errors arising from the inability to capture high-curvature trajectory segments. In this paper, we propose the Ensemble Parallel Direction solver (dubbed as EPD-Solver), a novel ODE solver that mitigates these errors by incorporating multiple parallel gradient evaluations in each step. Motivated by the geometric insight that sampling trajectories are largely confined to a low-dimensional manifold, EPD-Solver leverages the Mean Value Theorem for vector-valued functions to approximate the integral solution more accurately. Importantly, since the additional gradient computations are independent, they can be fully parallelized, preserving low-latency sampling nature. We introduce a two-stage optimization framework. Initially, EPD-Solver optimizes a small set of learnable parameters via a distillation-based approach. We further propose a parameter-efficient Reinforcement Learning (RL) fine-tuning scheme that reformulates the solver as a stochastic Dirichlet policy. Unlike traditional methods that fine-tune the massive backbone, our RL approach operates strictly within the low-dimensional solver space, effectively mitigating reward hacking while enhancing performance in complex text-to-image (T2I) generation tasks. In addition, our method is flexible and can serve as a plugin (EPD-Plugin) to improve existing ODE samplers.
[86] VPTracker: Global Vision-Language Tracking via Visual Prompt and MLLM cs.CVPDF
Jingchao Wang, Kaiwen Zhou, Zhijian Wu, Kunhua Ji, Dingjiang Huang
TL;DR: 本文提出了VPTracker,首个基于多模态大语言模型的全局视觉语言跟踪框架,通过视觉提示和空间先验增强目标定位能力,解决了现有方法在视角变化、遮挡和快速运动下易失效的问题。
Details
Motivation: 现有视觉语言跟踪方法通常局限于局部搜索,在视角变化、遮挡和快速目标移动时容易失败,因此需要一种能够进行全局搜索的鲁棒跟踪框架。
Result: 在多个具有挑战性的场景下,VPTracker显著提升了跟踪稳定性和目标消歧能力,实验结果表明其有效整合了MLLM到视觉跟踪任务中。
Insight: 创新点在于引入了位置感知的视觉提示机制,将空间先验融入MLLM,实现了区域级识别与全局推理的平衡,从而在保持全局跟踪优势的同时有效抑制干扰。
Abstract: Vision-Language Tracking aims to continuously localize objects described by a visual template and a language description. Existing methods, however, are typically limited to local search, making them prone to failures under viewpoint changes, occlusions, and rapid target movements. In this work, we introduce the first global tracking framework based on Multimodal Large Language Models (VPTracker), exploiting their powerful semantic reasoning to locate targets across the entire image space. While global search improves robustness and reduces drift, it also introduces distractions from visually or semantically similar objects. To address this, we propose a location-aware visual prompting mechanism that incorporates spatial priors into the MLLM. Specifically, we construct a region-level prompt based on the target’s previous location, enabling the model to prioritize region-level recognition and resort to global inference only when necessary. This design retains the advantages of global tracking while effectively suppressing interference from distracting visual content. Extensive experiments show that our approach significantly enhances tracking stability and target disambiguation under challenging scenarios, opening a new avenue for integrating MLLMs into visual tracking. Code is available at https://github.com/jcwang0602/VPTracker.
[87] Medical Scene Reconstruction and Segmentation based on 3D Gaussian Representation cs.CVPDF
Bin Liu, Wenyan Tian, Huangxin Fu, Zizheng Li, Zhifen He
TL;DR: 本文提出了一种基于3D高斯表示和三平面表示的高效医学图像3D重建方法,旨在解决传统方法在稀疏切片条件下计算成本高、结构不连续和细节丢失的问题。该方法在保持高斯表示高效渲染和几何表示优势的同时,显著增强了稀疏切片下的结构连续性和语义一致性。
Details
Motivation: 传统医学图像3D重建方法计算昂贵,且在稀疏切片下易出现结构不连续和细节丢失,难以满足临床精度需求,因此需要一种更高效、鲁棒的重建方法。
Result: 在超声(US)和磁共振成像(MRI)等多模态医学数据集上的实验表明,该方法在稀疏数据条件下能生成高质量、解剖结构连贯且语义稳定的医学图像,同时显著提升了重建效率。
Insight: 创新点在于将3D高斯表示与三平面表示结合,以增强稀疏切片下的结构连续性和语义一致性;从客观角度看,该方法为医学图像3D可视化提供了一种兼顾效率与质量的新思路,可能适用于其他需要高效渲染的3D重建任务。
Abstract: 3D reconstruction of medical images is a key technology in medical image analysis and clinical diagnosis, providing structural visualization support for disease assessment and surgical planning. Traditional methods are computationally expensive and prone to structural discontinuities and loss of detail in sparse slices, making it difficult to meet clinical accuracy requirements.To address these challenges, we propose an efficient 3D reconstruction method based on 3D Gaussian and tri-plane representations. This method not only maintains the advantages of Gaussian representation in efficient rendering and geometric representation but also significantly enhances structural continuity and semantic consistency under sparse slicing conditions. Experimental results on multimodal medical datasets such as US and MRI show that our proposed method can generate high-quality, anatomically coherent, and semantically stable medical images under sparse data conditions, while significantly improving reconstruction efficiency. This provides an efficient and reliable new approach for 3D visualization and clinical analysis of medical images.
[88] EgoReAct: Egocentric Video-Driven 3D Human Reaction Generation cs.CV | cs.AIPDF
Libo Zhang, Zekun Li, Tianyu Li, Zeyu Cao, Rui Xu
TL;DR: 该论文提出了EgoReAct框架,首个能够从第一人称视角视频流中实时生成3D空间对齐的人体反应动作的自回归框架。为了解决现有数据集的空间不一致性问题,作者构建了Human Reaction Dataset (HRD)。EgoReAct通过VQ-VAE压缩动作到潜在空间,并训练GPT模型进行生成,结合了3D动态特征以增强空间定位。
Details
Motivation: 解决从第一人称视角视频中忠实建模人体反应的挑战,该任务要求生成过程严格因果且与3D空间精确对齐,而现有数据集存在显著的空间不一致性。
Result: 大量实验表明,与先前方法相比,EgoReAct在真实性、空间一致性和生成效率方面显著更高,同时在生成过程中保持了严格的因果性。
Insight: 创新点包括构建空间对齐的HRD数据集,以及提出结合3D动态特征(如度量深度和头部动态)的自回归生成框架,有效增强了生成动作的空间基础,实现了实时、因果的3D反应生成。
Abstract: Humans exhibit adaptive, context-sensitive responses to egocentric visual input. However, faithfully modeling such reactions from egocentric video remains challenging due to the dual requirements of strictly causal generation and precise 3D spatial alignment. To tackle this problem, we first construct the Human Reaction Dataset (HRD) to address data scarcity and misalignment by building a spatially aligned egocentric video-reaction dataset, as existing datasets (e.g., ViMo) suffer from significant spatial inconsistency between the egocentric video and reaction motion, e.g., dynamically moving motions are always paired with fixed-camera videos. Leveraging HRD, we present EgoReAct, the first autoregressive framework that generates 3D-aligned human reaction motions from egocentric video streams in real-time. We first compress the reaction motion into a compact yet expressive latent space via a Vector Quantised-Variational AutoEncoder and then train a Generative Pre-trained Transformer for reaction generation from the visual input. EgoReAct incorporates 3D dynamic features, i.e., metric depth, and head dynamics during the generation, which effectively enhance spatial grounding. Extensive experiments demonstrate that EgoReAct achieves remarkably higher realism, spatial consistency, and generation efficiency compared with prior methods, while maintaining strict causality during generation. We will release code, models, and data upon acceptance.
[89] ByteLoom: Weaving Geometry-Consistent Human-Object Interactions through Progressive Curriculum Learning cs.CV | cs.GR | cs.LGPDF
Bangya Liu, Xinyu Gong, Zelin Zhao, Ziyang Song, Yulei Lu
TL;DR: 本文提出了ByteLoom,一个基于扩散Transformer(DiT)的框架,用于生成具有几何一致性的逼真人-物交互(HOI)视频。该方法通过引入RCM-cache机制来保持物体的多视角几何一致性,并设计渐进式课程学习策略来缓解对精细手部网格标注的依赖。
Details
Motivation: 解决现有HOI视频生成方法的两大局限:一是缺乏有效的多视角物体信息注入机制,导致跨视角一致性差;二是严重依赖精细的手部网格标注来建模交互遮挡。
Result: 大量实验表明,该方法能忠实保持人体身份和物体的多视角几何一致性,同时维持平滑的运动和物体操控。
Insight: 创新点包括:1)提出RCM-cache机制,利用相对坐标图(RCM)作为通用表示来维持物体几何一致性并精确控制6自由度物体变换;2)设计渐进式课程学习策略,增强模型能力并降低对手部网格标注的需求。
Abstract: Human-object interaction (HOI) video generation has garnered increasing attention due to its promising applications in digital humans, e-commerce, advertising, and robotics imitation learning. However, existing methods face two critical limitations: (1) a lack of effective mechanisms to inject multi-view information of the object into the model, leading to poor cross-view consistency, and (2) heavy reliance on fine-grained hand mesh annotations for modeling interaction occlusions. To address these challenges, we introduce ByteLoom, a Diffusion Transformer (DiT)-based framework that generates realistic HOI videos with geometrically consistent object illustration, using simplified human conditioning and 3D object inputs. We first propose an RCM-cache mechanism that leverages Relative Coordinate Maps (RCM) as a universal representation to maintain object’s geometry consistency and precisely control 6-DoF object transformations in the meantime. To compensate HOI dataset scarcity and leverage existing datasets, we further design a training curriculum that enhances model capabilities in a progressive style and relaxes the demand of hand mesh. Extensive experiments demonstrate that our method faithfully preserves human identity and the object’s multi-view geometry, while maintaining smooth motion and object manipulation.
[90] MUSON: A Reasoning-oriented Multimodal Dataset for Socially Compliant Navigation in Urban Environments cs.CV | cs.ROPDF
Zhuonan Liu, Xinyu Zhang, Zishuo Wang, Tomohito Kawabata, Xuesu Xiao
TL;DR: 该论文提出了MUSON,一个面向推理的多模态数据集,用于城市环境中的社会合规导航。该数据集通过结构化的五步思维链标注(感知、预测、推理、行动和解释)来监督模型学习,并在多样化的室内外校园场景中收集。与现有数据集相比,MUSON提供了更平衡的动作分布和显式的物理约束建模,可作为社会合规导航的有效基准。
Details
Motivation: 现有社会导航数据集缺乏显式的推理监督,且动作分布呈现严重的长尾特性,这限制了模型学习安全关键行为的能力。
Result: 在MUSON数据集上对多个最先进的小型视觉语言模型进行基准测试,其中Qwen2.5-VL-3B模型取得了最高的决策准确率0.8625。
Insight: 论文的创新点在于提出了一个具有结构化推理标注(五步思维链)和平衡离散动作空间的社会导航数据集,这有助于模型学习更安全、可解释的决策过程。从客观角度看,该数据集的设计为解决社会导航中推理能力不足和数据偏差问题提供了新的基准和评估框架。
Abstract: Socially compliant navigation requires structured reasoning over dynamic pedestrians and physical constraints to ensure safe and interpretable decisions. However, existing social navigation datasets often lack explicit reasoning supervision and exhibit highly long-tailed action distributions, limiting models’ ability to learn safety-critical behaviors. To address these issues, we introduce MUSON, a multimodal dataset for short-horizon social navigation collected across diverse indoor and outdoor campus scenes. MUSON adopts a structured five-step Chain-of-Thought annotation consisting of perception, prediction, reasoning, action, and explanation, with explicit modeling of static physical constraints and a rationally balanced discrete action space. Compared to SNEI, MUSON provides consistent reasoning, action, and explanation. Benchmarking multiple state-of-the-art Small Vision Language Models on MUSON shows that Qwen2.5-VL-3B achieves the highest decision accuracy of 0.8625, demonstrating that MUSON serves as an effective and reusable benchmark for socially compliant navigation. The dataset is publicly available at https://huggingface.co/datasets/MARSLab/MUSON
[91] Learning Anatomy from Multiple Perspectives via Self-supervision in Chest Radiographs cs.CVPDF
Ziyu Zhou, Haozhe Luo, Mohammad Reza Hosseinzadeh Taher, Jiaxuan Pang, Xiaowei Ding
TL;DR: 这篇论文提出了一种名为Lamps的自监督学习方法,旨在从多个角度(一致性、连贯性和层次性)学习胸部X光片中的人体解剖结构,以构建医学影像的基础模型。
Details
Motivation: 现有自监督学习方法往往忽视医学影像中人体解剖结构的关键基础特性(如一致性、连贯性和层次性),限制了其有效学习解剖特征的能力。
Result: 在10个数据集上通过微调和涌现特性分析进行广泛实验,与10个基线模型相比,Lamps在鲁棒性、可迁移性和临床潜力方面表现出优越性。
Insight: 创新点在于将人体解剖结构的一致性、连贯性和层次性作为监督信号,从多角度自监督学习解剖特征,为医学影像基础模型开发提供了新思路,有望学习到与解剖结构对齐的、有意义且鲁棒的表示。
Abstract: Foundation models have been successful in natural language processing and computer vision because they are capable of capturing the underlying structures (foundation) of natural languages. However, in medical imaging, the key foundation lies in human anatomy, as these images directly represent the internal structures of the body, reflecting the consistency, coherence, and hierarchy of human anatomy. Yet, existing self-supervised learning (SSL) methods often overlook these perspectives, limiting their ability to effectively learn anatomical features. To overcome the limitation, we built Lamps (learning anatomy from multiple perspectives via self-supervision) pre-trained on large-scale chest radiographs by harmoniously utilizing the consistency, coherence, and hierarchy of human anatomy as the supervision signal. Extensive experiments across 10 datasets evaluated through fine-tuning and emergent property analysis demonstrate Lamps’ superior robustness, transferability, and clinical potential when compared to 10 baseline models. By learning from multiple perspectives, Lamps presents a unique opportunity for foundation models to develop meaningful, robust representations that are aligned with the structure of human anatomy.
[92] M-ErasureBench: A Comprehensive Multimodal Evaluation Benchmark for Concept Erasure in Diffusion Models cs.CVPDF
Ju-Hsuan Weng, Jia-Wei Liao, Cheng-Fu Chou, Jun-Cheng Chen
TL;DR: 本文提出了M-ErasureBench,一个用于评估扩散模型中概念擦除方法的多模态基准测试框架,涵盖文本提示、学习嵌入和反转潜在向量三种输入模态,并包含白盒和黑盒访问场景。研究发现现有方法在文本提示上表现良好,但在其他模态上容易失效。为此,作者提出了IRECE模块,通过在去噪过程中扰动潜在向量来增强鲁棒性,显著降低了概念再现率。
Details
Motivation: 现有的概念擦除方法主要关注文本提示,忽略了图像编辑和个性化生成等实际应用中日益重要的其他输入模态(如学习嵌入和反转潜在向量),这些模态可能成为攻击面,导致已擦除的概念重新出现。
Result: 在M-ErasureBench上的实验表明,现有方法在文本提示上擦除效果强,但在学习嵌入和反转潜在向量上失败率高,白盒设置下的概念再现率(CRR)超过90%。提出的IRECE模块能恢复鲁棒性,在最具挑战性的白盒潜在向量反转场景下将CRR降低高达40%,同时保持视觉质量。
Insight: 创新点在于首次建立了超越文本提示的、全面的多模态概念擦除评估基准,并提出了一个即插即用的推理时鲁棒性增强模块(IRECE),通过跨注意力定位目标概念并扰动相关潜在向量,有效应对多模态攻击,为构建更可靠的保护性生成模型提供了实用保障。
Abstract: Text-to-image diffusion models may generate harmful or copyrighted content, motivating research on concept erasure. However, existing approaches primarily focus on erasing concepts from text prompts, overlooking other input modalities that are increasingly critical in real-world applications such as image editing and personalized generation. These modalities can become attack surfaces, where erased concepts re-emerge despite defenses. To bridge this gap, we introduce M-ErasureBench, a novel multimodal evaluation framework that systematically benchmarks concept erasure methods across three input modalities: text prompts, learned embeddings, and inverted latents. For the latter two, we evaluate both white-box and black-box access, yielding five evaluation scenarios. Our analysis shows that existing methods achieve strong erasure performance against text prompts but largely fail under learned embeddings and inverted latents, with Concept Reproduction Rate (CRR) exceeding 90% in the white-box setting. To address these vulnerabilities, we propose IRECE (Inference-time Robustness Enhancement for Concept Erasure), a plug-and-play module that localizes target concepts via cross-attention and perturbs the associated latents during denoising. Experiments demonstrate that IRECE consistently restores robustness, reducing CRR by up to 40% under the most challenging white-box latent inversion scenario, while preserving visual quality. To the best of our knowledge, M-ErasureBench provides the first comprehensive benchmark of concept erasure beyond text prompts. Together with IRECE, our benchmark offers practical safeguards for building more reliable protective generative models.
[93] SwinTF3D: A Lightweight Multimodal Fusion Approach for Text-Guided 3D Medical Image Segmentation cs.CV | cs.AIPDF
Hasan Faraz Khan, Noor Fatima, Muzammil Behzad
TL;DR: 本文提出了一种名为SwinTF3D的轻量级多模态融合方法,用于文本引导的3D医学图像分割。该方法通过基于Transformer的视觉编码器提取体积特征,并利用高效的融合机制与紧凑的文本编码器结合,使系统能够理解自然语言提示,并将语义线索与医学体积中的空间结构对齐,从而实现准确、上下文感知的分割,且计算开销低。
Details
Motivation: 现有3D分割框架主要依赖大型标注数据集的视觉学习,缺乏语义理解,难以适应新领域和灵活的、用户定义的分割目标。
Result: 在BTCV数据集上的大量实验表明,SwinTF3D在多个器官上取得了有竞争力的Dice和IoU分数,尽管其架构紧凑,且相比传统的基于Transformer的分割网络,在效率上有显著提升,并能很好地泛化到未见数据。
Insight: 创新点在于将视觉感知与语言理解相结合,通过轻量级多模态融合机制,为交互式、文本驱动的3D医学图像分割提供了一个实用且可解释的范式,有望实现更具适应性和资源效率的临床成像解决方案。
Abstract: The recent integration of artificial intelligence into medical imaging has driven remarkable advances in automated organ segmentation. However, most existing 3D segmentation frameworks rely exclusively on visual learning from large annotated datasets restricting their adaptability to new domains and clinical tasks. The lack of semantic understanding in these models makes them ineffective in addressing flexible, user-defined segmentation objectives. To overcome these limitations, we propose SwinTF3D, a lightweight multimodal fusion approach that unifies visual and linguistic representations for text-guided 3D medical image segmentation. The model employs a transformer-based visual encoder to extract volumetric features and integrates them with a compact text encoder via an efficient fusion mechanism. This design allows the system to understand natural-language prompts and correctly align semantic cues with their corresponding spatial structures in medical volumes, while producing accurate, context-aware segmentation results with low computational overhead. Extensive experiments on the BTCV dataset demonstrate that SwinTF3D achieves competitive Dice and IoU scores across multiple organs, despite its compact architecture. The model generalizes well to unseen data and offers significant efficiency gains compared to conventional transformer-based segmentation networks. Bridging visual perception with linguistic understanding, SwinTF3D establishes a practical and interpretable paradigm for interactive, text-driven 3D medical image segmentation, opening perspectives for more adaptive and resource-efficient solutions in clinical imaging.
[94] JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation cs.CVPDF
Kai Liu, Jungang Li, Yuchong Sun, Shengqiong Wu, Jianzhang Gao
TL;DR: JavisGPT是首个统一的多模态大语言模型,专门用于联合音频-视频的理解与生成。该模型采用简洁的编码器-LLM-解码器架构,通过SyncFusion模块实现时空音频-视频融合,并利用同步感知可学习查询桥接预训练的JAV-DiT生成器,从而支持从多模态指令中进行时间连贯的视频-音频理解与生成。研究还构建了包含超过20万条GPT-4o策划对话的高质量指令数据集JavisInst-Omni,并设计了包含多模态预训练、音频-视频微调和大规模指令调优的三阶段训练流程。
Details
Motivation: 解决现有多模态大语言模型在联合音频-视频理解与生成任务中的不足,特别是在处理时间同步和复杂场景时面临的挑战,旨在实现一个统一的模型来处理音频和视频的协同理解与生成。
Result: 在联合音频-视频理解与生成的基准测试中,JavisGPT超越了现有的多模态大语言模型,尤其在复杂和时间同步的设置中表现出色,达到了SOTA水平。
Insight: 创新点包括提出SyncFusion模块进行时空音频-视频融合,以及同步感知可学习查询来桥接生成器;从客观角度看,其统一架构和三阶段训练流程为多模态理解与生成任务提供了有效的端到端解决方案,高质量指令数据集的构建也提升了模型的泛化能力。
Abstract: This paper presents JavisGPT, the first unified multimodal large language model (MLLM) for Joint Audio-Video (JAV) comprehension and generation. JavisGPT adopts a concise encoder-LLM-decoder architecture, featuring a SyncFusion module for spatio-temporal audio-video fusion and synchrony-aware learnable queries to bridge a pretrained JAV-DiT generator. This design enables temporally coherent video-audio understanding and generation from multimodal instructions. We design an effective three-stage training pipeline consisting of multimodal pretraining, audio-video fine-tuning, and large-scale instruction-tuning, to progressively build multimodal comprehension and generation from existing vision-language models. To support this, we further construct JavisInst-Omni, a high-quality instruction dataset with over 200K GPT-4o-curated audio-video-text dialogues that span diverse and multi-level comprehension and generation scenarios. Extensive experiments on JAV comprehension and generation benchmarks show that JavisGPT outperforms existing MLLMs, particularly in complex and temporally synchronized settings.
[95] ColaVLA: Leveraging Cognitive Latent Reasoning for Hierarchical Parallel Trajectory Planning in Autonomous Driving cs.CVPDF
Qihang Peng, Xuesong Chen, Chenye Yang, Shaoshuai Shi, Hongsheng Li
TL;DR: 本文提出ColaVLA,一个统一的视觉-语言-动作框架,用于自动驾驶中的轨迹规划。该框架通过将推理从文本转移到统一的潜在空间,并结合分层并行轨迹解码器,解决了当前基于视觉语言模型(VLM)的规划器在离散文本推理与连续控制不匹配、自回归思维链解码延迟高以及规划器非因果或低效等问题。
Details
Motivation: 解决当前基于视觉语言模型的自动驾驶规划器面临的三个关键挑战:离散文本推理与连续控制之间的不匹配、自回归思维链解码带来的高延迟,以及低效或非因果的规划器限制实时部署。
Result: 在nuScenes基准测试中,ColaVLA在开环和闭环设置下均实现了最先进的性能,同时具有良好的效率和鲁棒性。
Insight: 主要创新点包括:1) 将推理从文本转移到统一的潜在空间,以弥合离散推理与连续控制之间的差距;2) 通过自我适应选择和仅两次VLM前向传递,将场景理解压缩为紧凑的、面向决策的元动作嵌入(Cognitive Latent Reasoner);3) 采用分层并行规划器在单次前向传递中生成多尺度、因果一致的轨迹,提高了效率并保持了VLM的泛化性和可解释性。
Abstract: Autonomous driving requires generating safe and reliable trajectories from complex multimodal inputs. Traditional modular pipelines separate perception, prediction, and planning, while recent end-to-end (E2E) systems learn them jointly. Vision-language models (VLMs) further enrich this paradigm by introducing cross-modal priors and commonsense reasoning, yet current VLM-based planners face three key challenges: (i) a mismatch between discrete text reasoning and continuous control, (ii) high latency from autoregressive chain-of-thought decoding, and (iii) inefficient or non-causal planners that limit real-time deployment. We propose ColaVLA, a unified vision-language-action framework that transfers reasoning from text to a unified latent space and couples it with a hierarchical, parallel trajectory decoder. The Cognitive Latent Reasoner compresses scene understanding into compact, decision-oriented meta-action embeddings through ego-adaptive selection and only two VLM forward passes. The Hierarchical Parallel Planner then generates multi-scale, causality-consistent trajectories in a single forward pass. Together, these components preserve the generalization and interpretability of VLMs while enabling efficient, accurate and safe trajectory generation. Experiments on the nuScenes benchmark show that ColaVLA achieves state-of-the-art performance in both open-loop and closed-loop settings with favorable efficiency and robustness.
[96] CLIP-Joint-Detect: End-to-End Joint Training of Object Detectors with Contrastive Vision-Language Supervision cs.CVPDF
Behnam Raoufi, Hossein Sharify, Mohamad Mahdee Ramezanee, Khosrow Hajsadeghi, Saeed Bagheri Shouraki
TL;DR: 本文提出CLIP-Joint-Detect,一种与检测器无关的端到端联合训练框架,通过集成CLIP风格的对比视觉-语言监督来改进目标检测。该方法使用轻量级并行头将区域或网格特征投影到CLIP嵌入空间,并通过InfoNCE对比损失和辅助交叉熵项与可学习的类别特定文本嵌入对齐,同时优化所有标准检测损失。该框架可无缝应用于两阶段和一阶段检测架构。
Details
Motivation: 传统目标检测器依赖交叉熵分类,容易受到类别不平衡和标签噪声的影响,因此需要一种更鲁棒的监督方式。
Result: 在Pascal VOC 2007+2012上使用Faster R-CNN,以及在MS COCO 2017上使用YOLOv11进行验证,均实现了显著且一致的性能提升,同时保持了实时推理速度。
Insight: 创新点在于将可学习的类别特定文本嵌入与对比视觉-语言监督结合,通过端到端联合训练增强闭集检测性能;客观来看,该方法提供了一种通用且轻量的框架,能够灵活适配不同检测架构并有效缓解传统分类的局限性。
Abstract: Conventional object detectors rely on cross-entropy classification, which can be vulnerable to class imbalance and label noise. We propose CLIP-Joint-Detect, a simple and detector-agnostic framework that integrates CLIP-style contrastive vision-language supervision through end-to-end joint training. A lightweight parallel head projects region or grid features into the CLIP embedding space and aligns them with learnable class-specific text embeddings via InfoNCE contrastive loss and an auxiliary cross-entropy term, while all standard detection losses are optimized simultaneously. The approach applies seamlessly to both two-stage and one-stage architectures. We validate it on Pascal VOC 2007+2012 using Faster R-CNN and on the large-scale MS COCO 2017 benchmark using modern YOLO detectors (YOLOv11), achieving consistent and substantial improvements while preserving real-time inference speed. Extensive experiments and ablations demonstrate that joint optimization with learnable text embeddings markedly enhances closed-set detection performance across diverse architectures and datasets.
[97] RealCamo: Boosting Real Camouflage Synthesis with Layout Controls and Textual-Visual Guidance cs.CVPDF
Chunyuan Chen, Yunuo Cai, Shujuan Li, Weiyun Liang, Bin Wang
TL;DR: 本文提出RealCamo框架,通过引入布局控制和文本-视觉联合引导,提升伪装图像生成的现实感,以解决现有方法在视觉相似性和语义一致性上的不足,并提出了新的伪装质量评估指标。
Details
Motivation: 现有伪装图像生成方法在生成图像时,存在视觉相似性不足导致伪装效果弱,或背景杂乱与前景目标语义不一致的问题,与现实伪装图像存在较大差距。
Result: 广泛的实验和可视化结果证明了所提框架的有效性,但摘要中未明确提及具体的基准测试和定量比较结果(如是否达到SOTA)。
Insight: 创新点在于:1)引入显式的布局控制以改善前景与生成背景的全局结构和语义一致性;2)构建结合细粒度文本任务描述和面向纹理的背景检索的多模态条件,联合引导生成以增强视觉保真度;3)提出了一个衡量生成图像中伪装有效性的背景-前景分布差异度量指标。
Abstract: Camouflaged image generation (CIG) has recently emerged as an efficient alternative for acquiring high-quality training data for camouflaged object detection (COD). However, existing CIG methods still suffer from a substantial gap to real camouflaged imagery: generated images either lack sufficient camouflage due to weak visual similarity, or exhibit cluttered backgrounds that are semantically inconsistent with foreground targets. To address these limitations, we propose ReamCamo, a unified out-painting based framework for realistic camouflaged image generation. ReamCamo explicitly introduces additional layout controls to regulate global image structure, thereby improving semantic coherence between foreground objects and generated backgrounds. Moreover, we construct a multi-modal textual-visual condition by combining a unified fine-grained textual task description with texture-oriented background retrieval, which jointly guides the generation process to enhance visual fidelity and realism. To quantitatively assess camouflage quality, we further introduce a background-foreground distribution divergence metric that measures the effectiveness of camouflage in generated images. Extensive experiments and visualizations demonstrate the effectiveness of our proposed framework.
[98] PoseStreamer: A Multi-modal Framework for 6DoF Pose Estimation of Unseen Moving Objects cs.CVPDF
Huiming Yang, Linglin Liao, Fei Ding, Sibo Wang, Zijian Zeng
TL;DR: 本文提出PoseStreamer,一个针对高速运动场景的鲁棒多模态6DoF姿态估计框架,通过整合自适应姿态记忆队列、以对象为中心的2D跟踪器和射线姿态滤波器三个核心组件,有效解决了标准RGB相机在高速和低光条件下因运动模糊导致的姿态估计性能下降问题,并在新构建的MoCapCube6D数据集上验证了其优越性。
Details
Motivation: 解决在高速运动和低光场景下,标准RGB相机因运动模糊导致的新物体6DoF姿态估计性能不佳的问题,利用事件相机的高时间分辨率优势,填补当前方法在高速物体运动场景中的性能差距。
Result: 在构建的新多模态数据集MoCapCube6D上进行广泛实验,结果表明PoseStreamer在高速运动场景中实现了卓越的精度,并作为一个无需模板的框架,对未见过的运动物体表现出强大的泛化能力。
Insight: 创新点包括:1)提出一个专门针对高速运动设计的鲁棒多模态框架;2)引入自适应姿态记忆队列利用历史方向线索确保时间一致性;3)设计以对象为中心的2D跟踪器提供强2D先验以提升3D中心召回;4)采用射线姿态滤波器进行沿相机射线的几何细化;5)构建了新的多模态基准数据集MoCapCube6D。从客观角度看,其核心创新在于将事件相机数据与精心设计的时序和几何模块相结合,以系统性地应对高速运动带来的挑战。
Abstract: Six degree of freedom (6DoF) pose estimation for novel objects is a critical task in computer vision, yet it faces significant challenges in high-speed and low-light scenarios where standard RGB cameras suffer from motion blur. While event cameras offer a promising solution due to their high temporal resolution, current 6DoF pose estimation methods typically yield suboptimal performance in high-speed object moving scenarios. To address this gap, we propose PoseStreamer, a robust multi-modal 6DoF pose estimation framework designed specifically on high-speed moving scenarios. Our approach integrates three core components: an Adaptive Pose Memory Queue that utilizes historical orientation cues for temporal consistency, an Object-centric 2D Tracker that provides strong 2D priors to boost 3D center recall, and a Ray Pose Filter for geometric refinement along camera rays. Furthermore, we introduce MoCapCube6D, a novel multi-modal dataset constructed to benchmark performance under rapid motion. Extensive experiments demonstrate that PoseStreamer not only achieves superior accuracy in high-speed moving scenarios, but also exhibits strong generalizability as a template-free framework for unseen moving objects.
[99] Spatial-aware Symmetric Alignment for Text-guided Medical Image Segmentation cs.CVPDF
Linglin Liao, Qichuan Geng, Yu Liu
TL;DR: 本文提出了空间感知对称对齐(SSA)框架,用于增强基于混合医学文本(包含位置、描述和诊断信息)引导的医学图像分割能力。该框架通过对称最优传输对齐机制建立图像区域与多个相关文本表达之间的双向细粒度多模态对应关系,并设计复合方向引导策略,通过构建区域级引导掩码显式引入文本中的空间约束。
Details
Motivation: 当前基于文本引导的医学图像分割方法存在两个关键瓶颈:一是难以同时处理诊断性和描述性文本,导致难以识别病灶并建立与图像区域的关联;二是现有方法侧重于病灶描述,未能捕捉位置约束,导致关键偏差(例如,根据’左下肺’文本可能错误地分割双侧肺部)。
Result: 在公开基准测试上的大量实验表明,SSA框架实现了最先进的性能,特别是在准确分割具有空间关系约束特征的病灶方面。
Insight: 创新点包括:1) 对称最优传输对齐机制,用于建立图像区域与多个相关文本表达之间的双向细粒度多模态对应;2) 复合方向引导策略,通过构建区域级引导掩码显式引入文本中的空间约束,从而增强对混合医学文本(位置、描述、诊断信息)的处理能力。
Abstract: Text-guided Medical Image Segmentation has shown considerable promise for medical image segmentation, with rich clinical text serving as an effective supplement for scarce data. However, current methods have two key bottlenecks. On one hand, they struggle to process diagnostic and descriptive texts simultaneously, making it difficult to identify lesions and establish associations with image regions. On the other hand, existing approaches focus on lesions description and fail to capture positional constraints, leading to critical deviations. Specifically, with the text “in the left lower lung”, the segmentation results may incorrectly cover both sides of the lung. To address the limitations, we propose the Spatial-aware Symmetric Alignment (SSA) framework to enhance the capacity of referring hybrid medical texts consisting of locational, descriptive, and diagnostic information. Specifically, we propose symmetric optimal transport alignment mechanism to strengthen the associations between image regions and multiple relevant expressions, which establishes bi-directional fine-grained multimodal correspondences. In addition, we devise a composite directional guidance strategy that explicitly introduces spatial constraints in the text by constructing region-level guidance masks. Extensive experiments on public benchmarks demonstrate that SSA achieves state-of-the-art (SOTA) performance, particularly in accurately segmenting lesions characterized by spatial relational constraints.
[100] OpenGround: Active Cognition-based Reasoning for Open-World 3D Visual Grounding cs.CV | cs.AIPDF
Wenyuan Huang, Zhao Wang, Zhou Wei, Ting Huang, Fang Zhao
TL;DR: OpenGround是一种新颖的零样本开放世界3D视觉定位框架,旨在解决现有方法因依赖预定义物体查找表(OLT)而无法处理未定义或未知目标的问题。其核心是主动认知推理(ACR)模块,通过模拟人类感知逐步扩展视觉语言模型(VLM)的认知范围,动态更新OLT,从而支持预定义和开放世界类别。论文还提出了包含7000多个物体-描述对的新数据集OpenTarget用于评估。
Details
Motivation: 现有3D视觉定位方法依赖预定义的物体查找表来查询视觉语言模型进行推理,这限制了其在未定义或未知目标场景中的应用,因此需要一种能够处理开放世界类别的零样本方法。
Result: 在Nr3D数据集上达到有竞争力的性能,在ScanRefer数据集上达到最先进水平(SOTA),并在新提出的OpenTarget数据集上实现了17.6%的显著提升。
Insight: 创新点在于引入主动认知推理(ACR)模块,通过认知任务链模拟人类感知,动态扩展视觉语言模型的认知范围,从而克服预定义物体查找表的限制,实现开放世界3D视觉定位的零样本能力;从客观角度看,该方法将动态认知过程融入视觉推理,为处理未知类别提供了可借鉴的框架。
Abstract: 3D visual grounding aims to locate objects based on natural language descriptions in 3D scenes. Existing methods rely on a pre-defined Object Lookup Table (OLT) to query Visual Language Models (VLMs) for reasoning about object locations, which limits the applications in scenarios with undefined or unforeseen targets. To address this problem, we present OpenGround, a novel zero-shot framework for open-world 3D visual grounding. Central to OpenGround is the Active Cognition-based Reasoning (ACR) module, which is designed to overcome the fundamental limitation of pre-defined OLTs by progressively augmenting the cognitive scope of VLMs. The ACR module performs human-like perception of the target via a cognitive task chain and actively reasons about contextually relevant objects, thereby extending VLM cognition through a dynamically updated OLT. This allows OpenGround to function with both pre-defined and open-world categories. We also propose a new dataset named OpenTarget, which contains over 7000 object-description pairs to evaluate our method in open-world scenarios. Extensive experiments demonstrate that OpenGround achieves competitive performance on Nr3D, state-of-the-art on ScanRefer, and delivers a substantial 17.6% improvement on OpenTarget. Project Page at this https URL.
[101] With Great Context Comes Great Prediction Power: Classifying Objects via Geo-Semantic Scene Graphs cs.CVPDF
Ciprian Constantinescu, Marius Leordeanu
TL;DR: 本文提出了一种基于地理语义场景图(GSCG)的上下文感知物体分类框架。该框架从单目图像构建一个包含物体几何、颜色、材质属性和空间关系的结构化图表示,并设计了一个图分类器来聚合目标物体、其邻近物体和全局场景的特征进行类别预测。
Details
Motivation: 解决现有物体识别系统通常孤立地处理图像区域,而忽略了人类赖以识别的关键上下文信息(如空间关系、材质、物体共现)的问题,强调上下文在物体分类中的关键作用。
Result: 在COCO 2017数据集上,该方法取得了73.4%的分类准确率,显著优于无上下文版本(最低38.4%)、微调的ResNet模型(最高53.5%)以及最先进的多模态大语言模型Llama 4 Scout(最高42.3%),实现了SOTA性能。
Insight: 创新点在于构建了显式、结构化且可解释的地理语义上下文图(GSCG),并设计了专门的图分类器来利用局部和全局上下文。其核心洞察是显式的、结构化的场景表示比隐式或纯描述性的上下文(如LLM所用)能更有效地提升物体识别性能。
Abstract: Humans effortlessly identify objects by leveraging a rich understanding of the surrounding scene, including spatial relationships, material properties, and the co-occurrence of other objects. In contrast, most computational object recognition systems operate on isolated image regions, devoid of meaning in isolation, thus ignoring this vital contextual information. This paper argues for the critical role of context and introduces a novel framework for contextual object classification. We first construct a Geo-Semantic Contextual Graph (GSCG) from a single monocular image. This rich, structured representation is built by integrating a metric depth estimator with a unified panoptic and material segmentation model. The GSCG encodes objects as nodes with detailed geometric, chromatic, and material attributes, and their spatial relationships as edges. This explicit graph structure makes the model’s reasoning process inherently interpretable. We then propose a specialized graph-based classifier that aggregates features from a target object, its immediate neighbors, and the global scene context to predict its class. Through extensive ablation studies, we demonstrate that our context-aware model achieves a classification accuracy of 73.4%, dramatically outperforming context-agnostic versions (as low as 38.4%). Furthermore, our GSCG-based approach significantly surpasses strong baselines, including fine-tuned ResNet models (max 53.5%) and a state-of-the-art multimodal Large Language Model (LLM), Llama 4 Scout, which, even when given the full image alongside a detailed description of objects, maxes out at 42.3%. These results on COCO 2017 train/val splits highlight the superiority of explicitly structured and interpretable context for object recognition tasks.
[102] An Architecture-Led Hybrid Report on Body Language Detection Project cs.CV | cs.AI | cs.SEPDF
Thomson Tong, Diba Darooneh
TL;DR: 本报告以架构为主导,分析了Qwen2.5-VL-7B-Instruct和Llama-4-Scout-17B-16E-Instruct两个现代视觉语言模型,并阐述了如何将其架构特性映射到BodyLanguageDetection仓库中实现的视频到人工产物流水线。该系统对视频帧进行采样,提示VLM检测可见人物并生成带有提示条件属性(默认为情绪)的像素空间边界框,使用预定义模式验证输出结构,并可选择渲染带注释的视频。
Details
Motivation: 旨在通过架构分析,将现代视觉语言模型的特性与一个实际的身体语言检测视频处理流水线连接起来,阐明模型行为如何影响系统设计和约束。
Result: 报告未提及在特定基准测试上的定量结果或SOTA比较,而是聚焦于架构分析和系统实现层面的定性观察。
Insight: 创新点在于将VLM的架构特性(如视觉分词、Transformer注意力、指令跟随)与具体工程实现(结构化输出验证、帧局部标识符、交互分析模式)进行系统性关联,强调了在构建稳健接口和设计评估时,区分语法有效性与语义正确性、结构验证与几何验证的重要性。
Abstract: This report provides an architecture-led analysis of two modern vision-language models (VLMs), Qwen2.5-VL-7B-Instruct and Llama-4-Scout-17B-16E-Instruct, and explains how their architectural properties map to a practical video-to-artifact pipeline implemented in the BodyLanguageDetection repository [1]. The system samples video frames, prompts a VLM to detect visible people and generate pixel-space bounding boxes with prompt-conditioned attributes (emotion by default), validates output structure using a predefined schema, and optionally renders an annotated video. We first summarize the shared multimodal foundation (visual tokenization, Transformer attention, and instruction following), then describe each architecture at a level sufficient to justify engineering choices without speculative internals. Finally, we connect model behavior to system constraints: structured outputs can be syntactically valid while semantically incorrect, schema validation is structural (not geometric correctness), person identifiers are frame-local in the current prompting contract, and interactive single-frame analysis returns free-form text rather than schema-enforced JSON. These distinctions are critical for writing defensible claims, designing robust interfaces, and planning evaluation.
[103] Toward Stable Semi-Supervised Remote Sensing Segmentation via Co-Guidance and Co-Fusion cs.CVPDF
Yi Zhou, Xuechao Zou, Shun Zhang, Kai Li, Shiying Wang
TL;DR: 本文提出了一种名为Co2S的稳定半监督遥感图像语义分割框架,旨在缓解伪标签漂移问题。该框架通过构建一个异构双学生架构,分别利用预训练的CLIP和DINOv3模型初始化,并结合显式-隐式语义协同引导机制以及全局-局部特征协同融合策略,以融合视觉语言模型和自监督模型的先验知识,从而提升分割的准确性和稳定性。
Details
Motivation: 半监督遥感图像语义分割虽能减轻标注负担,但存在伪标签漂移问题,即确认偏差导致训练过程中错误累积,影响模型性能。
Result: 在六个流行数据集上的大量实验表明,该方法在各种划分协议和多样化场景下均取得了领先性能,展现了其优越性。
Insight: 创新点在于提出异构双学生架构结合CLIP和DINOv3先验,以及显式-隐式语义协同引导与全局-局部特征融合策略,有效缓解伪标签漂移并提升语义一致性,为半监督分割提供了稳定解决方案。
Abstract: Semi-supervised remote sensing (RS) image semantic segmentation offers a promising solution to alleviate the burden of exhaustive annotation, yet it fundamentally struggles with pseudo-label drift, a phenomenon where confirmation bias leads to the accumulation of errors during training. In this work, we propose Co2S, a stable semi-supervised RS segmentation framework that synergistically fuses priors from vision-language models and self-supervised models. Specifically, we construct a heterogeneous dual-student architecture comprising two distinct ViT-based vision foundation models initialized with pretrained CLIP and DINOv3 to mitigate error accumulation and pseudo-label drift. To effectively incorporate these distinct priors, an explicit-implicit semantic co-guidance mechanism is introduced that utilizes text embeddings and learnable queries to provide explicit and implicit class-level guidance, respectively, thereby jointly enhancing semantic consistency. Furthermore, a global-local feature collaborative fusion strategy is developed to effectively fuse the global contextual information captured by CLIP with the local details produced by DINOv3, enabling the model to generate highly precise segmentation results. Extensive experiments on six popular datasets demonstrate the superiority of the proposed method, which consistently achieves leading performance across various partition protocols and diverse scenarios. Project page is available at https://xavierjiezou.github.io/Co2S/.
[104] 3D sans 3D Scans: Scalable Pre-training from Video-Generated Point Clouds cs.CVPDF
Ryousuke Yamada, Kohsuke Ide, Yoshihiro Fukuhara, Hirokatsu Kataoka, Gilles Puy
TL;DR: 本文提出了一种名为LAM3C的自监督学习框架,旨在从无标签视频生成的点云中学习3D表示,而无需依赖真实3D扫描数据。作者构建了RoomTours数据集,包含49,219个场景,并提出噪声正则化损失来增强学习的稳定性。实验表明,该方法在室内语义和实例分割任务上超越了先前自监督方法。
Details
Motivation: 解决大规模3D场景扫描数据收集成本高、劳动密集的问题,探索从无标签视频(无需真实3D传感器)中学习3D表示的可能性。
Result: 在室内语义和实例分割任务上,LAM3C的性能超过了之前的自监督方法,表明未使用真实3D扫描也能达到更高水平。
Insight: 创新点包括利用视频生成点云作为自监督数据源、引入噪声正则化损失以提升特征稳定性,以及构建大规模视频生成点云数据集RoomTours,为3D学习提供了廉价且可扩展的预训练途径。
Abstract: Despite recent progress in 3D self-supervised learning, collecting large-scale 3D scene scans remains expensive and labor-intensive. In this work, we investigate whether 3D representations can be learned from unlabeled videos recorded without any real 3D sensors. We present Laplacian-Aware Multi-level 3D Clustering with Sinkhorn-Knopp (LAM3C), a self-supervised framework that learns from video-generated point clouds from unlabeled videos. We first introduce RoomTours, a video-generated point cloud dataset constructed by collecting room-walkthrough videos from the web (e.g., real-estate tours) and generating 49,219 scenes using an off-the-shelf feed-forward reconstruction model. We also propose a noise-regularized loss that stabilizes representation learning by enforcing local geometric smoothness and ensuring feature stability under noisy point clouds. Remarkably, without using any real 3D scans, LAM3C achieves higher performance than the previous self-supervised methods on indoor semantic and instance segmentation. These results suggest that unlabeled videos represent an abundant source of data for 3D self-supervised learning.
[105] Video-BrowseComp: Benchmarking Agentic Video Research on Open Web cs.CVPDF
Zhengyang Liang, Yan Shu, Xiangrui Liu, Minghao Qin, Kaixin Liang
TL;DR: 本文提出了Video-BrowseComp,一个专为开放网络环境下的智能体视频研究设计的基准测试,包含210个问题,旨在评估模型主动探索视频时间线、交叉引用分散证据并基于开放网络验证信息的能力。
Details
Motivation: 现有视频基准主要关注被动感知,无法评估智能体在开放网络上进行主动、开放式视频研究的能力,存在显著的模态鸿沟。
Result: 对最先进模型的评估显示,即使是GPT-5.1(带搜索功能)等增强搜索模型,准确率也仅为15.24%,在元数据稀疏的动态环境(如体育、游戏)中表现不佳。
Insight: 该研究首次将视频基准从被动感知推向主动推理,强调了视觉基础在开放网络视频研究中的关键作用,并揭示了当前模型过度依赖文本代理的局限性。
Abstract: The evolution of autonomous agents is redefining information seeking, transitioning from passive retrieval to proactive, open-ended web research. However, while textual and static multimodal agents have seen rapid progress, a significant modality gap remains in processing the web’s most dynamic modality: video. Existing video benchmarks predominantly focus on passive perception, feeding curated clips to models without requiring external retrieval. They fail to evaluate agentic video research, which necessitates actively interrogating video timelines, cross-referencing dispersed evidence, and verifying claims against the open web. To bridge this gap, we present \textbf{Video-BrowseComp}, a challenging benchmark comprising 210 questions tailored for open-web agentic video reasoning. Unlike prior benchmarks, Video-BrowseComp enforces a mandatory dependency on temporal visual evidence, ensuring that answers cannot be derived solely through text search but require navigating video timelines to verify external claims. Our evaluation of state-of-the-art models reveals a critical bottleneck: even advanced search-augmented models like GPT-5.1 (w/ Search) achieve only 15.24% accuracy. Our analysis reveals that these models largely rely on textual proxies, excelling in metadata-rich domains (e.g., TV shows with plot summaries) but collapsing in metadata-sparse, dynamic environments (e.g., sports, gameplay) where visual grounding is essential. As the first open-web video research benchmark, Video-BrowseComp advances the field beyond passive perception toward proactive video reasoning.
[106] MedSAM-based lung masking for multi-label chest X-ray classification cs.CV | cs.AIPDF
Brayden Miao, Zain Rehman, Xin Miao, Siming Liu, Jianjie Wang
TL;DR: 本文提出了一种基于医学图像分割基础模型MedSAM的胸部X光多标签分类流程,通过微调MedSAM提取肺部区域掩膜,并将其作为解剖学先验,指导深度卷积神经网络对五种肺部异常(肿块、结节、肺炎、水肿、纤维化)及正常情况进行分类。实验表明,掩膜的效果依赖于具体任务和网络架构,松散的肺部掩膜在保持整体异常分类性能的同时,能显著提升正常病例的筛查能力。
Details
Motivation: 解决胸部X光自动解读的挑战,包括疾病信号弱、数据集偏差和空间监督有限,通过引入基于解剖学的基础模型先验,以提高分类的鲁棒性和可解释性。
Result: 在NIH CXR数据集的子集上评估,ResNet50在原始图像上训练取得了最强的整体异常判别能力;松散的肺部掩膜产生了可比的宏观AUROC,但显著改善了’无发现’(正常病例)的判别,表明在异常特异性分类与正常病例筛查之间存在权衡。紧密掩膜会降低异常级别的性能,但提高了训练效率。
Insight: 创新点在于将MedSAM作为可控的空间先验模块集成到分类流程中,而非统一应用。核心见解是肺部掩膜应被视为一种可调整的先验,其松紧程度需根据主干网络架构和临床目标(如侧重异常检测还是正常筛查)进行选择,松散的掩膜通过保留肺门及周围上下文信息,能部分缓解性能下降。
Abstract: Chest X-ray (CXR) imaging is widely used for screening and diagnosing pulmonary abnormalities, yet automated interpretation remains challenging due to weak disease signals, dataset bias, and limited spatial supervision. Foundation models for medical image segmentation (MedSAM) provide an opportunity to introduce anatomically grounded priors that may improve robustness and interpretability in CXR analysis. We propose a segmentation-guided CXR classification pipeline that integrates MedSAM as a lung region extraction module prior to multi-label abnormality classification. MedSAM is fine-tuned using a public image-mask dataset from Airlangga University Hospital. We then apply it to a curated subset of the public NIH CXR dataset to train and evaluate deep convolutional neural networks for multi-label prediction of five abnormalities (Mass, Nodule, Pneumonia, Edema, and Fibrosis), with the normal case (No Finding) evaluated via a derived score. Experiments show that MedSAM produces anatomically plausible lung masks across diverse imaging conditions. We find that masking effects are both task-dependent and architecture-dependent. ResNet50 trained on original images achieves the strongest overall abnormality discrimination, while loose lung masking yields comparable macro AUROC but significantly improves No Finding discrimination, indicating a trade-off between abnormality-specific classification and normal case screening. Tight masking consistently reduces abnormality level performance but improves training efficiency. Loose masking partially mitigates this degradation by preserving perihilar and peripheral context. These results suggest that lung masking should be treated as a controllable spatial prior selected to match the backbone and clinical objective, rather than applied uniformly.
[107] Domain-Shift Immunity in Deep Deformable Registration via Local Feature Representations cs.CVPDF
Mingzhen Shao, Sarang Joshi
TL;DR: 本文提出UniReg框架,通过将特征提取与形变估计解耦,证明了基于深度学习的可变形图像配准模型具有固有的域偏移免疫性,其鲁棒性源于对局部特征表示的依赖而非全局外观。
Details
Motivation: 解决基于学习的配准模型对域偏移敏感的问题,并探究其鲁棒性机制,而非依赖大规模多样化训练数据。
Result: UniReg在单一数据集训练后,在跨域和多模态配准任务中表现出与基于优化的方法相当的鲁棒性能。
Insight: 局部特征一致性是学习型可变形配准鲁棒性的关键驱动力;早期卷积层的数据集诱导偏差是传统CNN模型在模态偏移下失效的根源;设计保留域不变局部特征的骨干网络是重要方向。
Abstract: Deep learning has advanced deformable image registration, surpassing traditional optimization-based methods in both accuracy and efficiency. However, learning-based models are widely believed to be sensitive to domain shift, with robustness typically pursued through large and diverse training datasets, without explaining the underlying mechanisms. In this work, we show that domain-shift immunity is an inherent property of deep deformable registration models, arising from their reliance on local feature representations rather than global appearance for deformation estimation. To isolate and validate this mechanism, we introduce UniReg, a universal registration framework that decouples feature extraction from deformation estimation using fixed, pre-trained feature extractors and a UNet-based deformation network. Despite training on a single dataset, UniReg exhibits robust cross-domain and multi-modal performance comparable to optimization-based methods. Our analysis further reveals that failures of conventional CNN-based models under modality shift originate from dataset-induced biases in early convolutional layers. These findings identify local feature consistency as the key driver of robustness in learning-based deformable registration and motivate backbone designs that preserve domain-invariant local features.
[108] GeoTeacher: Geometry-Guided Semi-Supervised 3D Object Detection cs.CVPDF
Jingyu Li, Xiaolong Zhao, Zhe Liu, Wenxiao Wu, Li Zhang
TL;DR: 本文提出GeoTeacher,一种几何引导的半监督3D目标检测方法,旨在通过几何关系监督和体素级数据增强策略,提升学生模型在有限标注数据下对物体几何信息的捕获能力,从而增强目标感知和定位性能。
Details
Motivation: 现有半监督3D目标检测方法通常依赖异构教师模型提供伪标签或特征一致性约束,但忽视了在标注数据有限时模型对物体几何信息敏感性不足的问题,这限制了学生模型对几何关系的理解,而几何信息对目标感知和定位至关重要。
Result: 在ONCE和Waymo数据集上的大量实验表明,该方法有效且具有泛化性,达到了新的最先进(SOTA)水平,并可与不同半监督3D检测方法结合以进一步提升性能。
Insight: 创新点包括:基于关键点的几何关系监督模块,将教师模型的几何知识迁移给学生;结合距离衰减机制的体素级数据增强策略,增加物体几何多样性并保持远距离物体完整性。这些设计从几何角度强化了半监督学习,提升了模型对3D结构的理解能力。
Abstract: Semi-supervised 3D object detection, aiming to explore unlabeled data for boosting 3D object detectors, has emerged as an active research area in recent years. Some previous methods have shown substantial improvements by either employing heterogeneous teacher models to provide high-quality pseudo labels or enforcing feature-perspective consistency between the teacher and student networks. However, these methods overlook the fact that the model usually tends to exhibit low sensitivity to object geometries with limited labeled data, making it difficult to capture geometric information, which is crucial for enhancing the student model’s ability in object perception and localization. In this paper, we propose GeoTeacher to enhance the student model’s ability to capture geometric relations of objects with limited training data, especially unlabeled data. We design a keypoint-based geometric relation supervision module that transfers the teacher model’s knowledge of object geometry to the student, thereby improving the student’s capability in understanding geometric relations. Furthermore, we introduce a voxel-wise data augmentation strategy that increases the diversity of object geometries, thereby further improving the student model’s ability to comprehend geometric structures. To preserve the integrity of distant objects during augmentation, we incorporate a distance-decay mechanism into this strategy. Moreover, GeoTeacher can be combined with different SS3D methods to further improve their performance. Extensive experiments on the ONCE and Waymo datasets indicate the effectiveness and generalization of our method and we achieve the new state-of-the-art results. Code will be available at https://github.com/SII-Whaleice/GeoTeacher
[109] REVEALER: Reinforcement-Guided Visual Reasoning for Element-Level Text-Image Alignment Evaluation cs.CVPDF
Fulin Shi, Wenyi Xiao, Bin Chen, Liang Din, Leilei Gan
TL;DR: 本文提出了REVEALER,一个基于强化学习引导视觉推理的统一框架,用于细粒度的文本-图像对齐评估。该方法采用’定位-推理-结论’的结构化范式,指导多模态大语言模型显式地定位语义元素并得出可解释的对齐判断。通过结合结构格式、定位准确性和对齐保真度的复合奖励函数,使用组相对策略优化对模型进行优化。
Details
Motivation: 现有文本到图像模型的对齐评估方法大多依赖粗粒度指标或静态问答流程,缺乏细粒度的可解释性且难以反映人类偏好,因此需要一种更精细、可解释的评估框架。
Result: 在EvalMuse-40K、RichHF、MHaluBench和GenAI-Bench四个基准测试上的广泛实验表明,REVEALER取得了最先进的性能,一致优于强大的专有模型和有监督基线,同时相比现有迭代视觉推理方法展现出更优的推理效率。
Insight: 创新点在于将强化学习与结构化视觉推理范式结合,通过复合奖励函数优化多模态大语言模型进行细粒度元素级对齐评估,实现了可解释且高效的评估流程,为文本-图像对齐的自动化评估提供了新思路。
Abstract: Evaluating the alignment between textual prompts and generated images is critical for ensuring the reliability and usability of text-to-image (T2I) models. However, most existing evaluation methods rely on coarse-grained metrics or static QA pipelines, which lack fine-grained interpretability and struggle to reflect human preferences. To address this, we propose REVEALER, a unified framework for element-level alignment evaluation based on reinforcement-guided visual reasoning. Adopting a structured “grounding-reasoning-conclusion” paradigm, our method enables Multimodal Large Language Models (MLLMs) to explicitly localize semantic elements and derive interpretable alignment judgments. We optimize the model via Group Relative Policy Optimization(GRPO) using a composite reward function that incorporates structural format, grounding accuracy, and alignment fidelity. Extensive experiments across four benchmarks-EvalMuse-40K, RichHF, MHaluBench, and GenAI-Bench-demonstrate that REVEALER achieves state-of-the-art performance. Our approach consistently outperforms both strong proprietary models and supervised baselines while demonstrating superior inference efficiency compared to existing iterative visual reasoning methods.
[110] GVSynergy-Det: Synergistic Gaussian-Voxel Representations for Multi-View 3D Object Detection cs.CVPDF
Yi Zhang, Yi Wang, Lei Yao, Lap-Pui Chau
TL;DR: GVSynergy-Det是一个用于仅使用RGB图像进行3D目标检测的新框架。它通过协同学习连续的高斯表示和离散的体素表示来提升检测性能,利用高斯表示捕捉精细表面细节,体素表示提供结构化空间上下文,并通过可学习的集成机制融合两者特征,从而在无需深度或密集3D几何监督的情况下实现高精度检测。
Details
Motivation: 解决基于图像的3D目标检测中存在的关键挑战:高精度方法通常需要密集的3D监督,而无监督方法难以仅从图像中提取准确的几何信息。
Result: 在具有挑战性的室内基准测试ScanNetV2和ARKitScenes数据集上取得了最先进(SOTA)的结果,显著优于现有方法,且无需任何深度或密集3D几何监督(如点云或TSDF)。
Insight: 核心创新在于提出了高斯-体素协同表示学习框架,将连续高斯表示(擅长建模细粒度表面细节)与离散体素表示(提供结构化空间上下文)的优势互补,并通过跨表示增强机制和可学习的特征集成,直接利用两种表示的特征进行更准确的目标定位,避免了先前方法依赖的耗时逐场景优化或仅将高斯表示用于深度正则化的局限。
Abstract: Image-based 3D object detection aims to identify and localize objects in 3D space using only RGB images, eliminating the need for expensive depth sensors required by point cloud-based methods. Existing image-based approaches face two critical challenges: methods achieving high accuracy typically require dense 3D supervision, while those operating without such supervision struggle to extract accurate geometry from images alone. In this paper, we present GVSynergy-Det, a novel framework that enhances 3D detection through synergistic Gaussian-Voxel representation learning. Our key insight is that continuous Gaussian and discrete voxel representations capture complementary geometric information: Gaussians excel at modeling fine-grained surface details while voxels provide structured spatial context. We introduce a dual-representation architecture that: 1) adapts generalizable Gaussian Splatting to extract complementary geometric features for detection tasks, and 2) develops a cross-representation enhancement mechanism that enriches voxel features with geometric details from Gaussian fields. Unlike previous methods that either rely on time-consuming per-scene optimization or utilize Gaussian representations solely for depth regularization, our synergistic strategy directly leverages features from both representations through learnable integration, enabling more accurate object localization. Extensive experiments demonstrate that GVSynergy-Det achieves state-of-the-art results on challenging indoor benchmarks, significantly outperforming existing methods on both ScanNetV2 and ARKitScenes datasets, all without requiring any depth or dense 3D geometry supervision (e.g., point clouds or TSDF).
[111] GaussianDWM: 3D Gaussian Driving World Model for Unified Scene Understanding and Multi-Modal Generation cs.CVPDF
Tianchen Deng, Xuefeng Chen, Yi Chen, Qu Chen, Yuyao Xu
TL;DR: 本文提出了一种基于3D高斯场景表示的新型统一驾驶世界模型(GaussianDWM),该模型能够同时进行3D场景理解和多模态场景生成。其核心创新在于将丰富的语言特征嵌入到每个高斯图元中,实现了早期模态对齐,并设计了任务感知的语言引导采样策略以及双条件多模态生成模型。
Details
Motivation: 现有驾驶世界模型缺乏3D场景理解能力,只能基于输入数据生成内容,无法解释或推理驾驶环境。同时,当前使用点云或BEV特征表示3D空间信息的方法,无法将文本信息与底层3D场景准确对齐。
Result: 在nuScenes和NuInteract数据集上进行的综合研究表明,该方法取得了最先进的(SOTA)性能。
Insight: 主要创新点包括:1. 基于3D高斯表示的统一框架,同时支持理解和生成任务;2. 通过将语言特征嵌入高斯图元实现早期模态对齐;3. 任务感知的语言引导采样策略,为LLM注入精确紧凑的3D token;4. 结合高级语言条件和低级图像条件的双条件多模态生成模型。
Abstract: Driving World Models (DWMs) have been developing rapidly with the advances of generative models. However, existing DWMs lack 3D scene understanding capabilities and can only generate content conditioned on input data, without the ability to interpret or reason about the driving environment. Moreover, current approaches represent 3D spatial information with point cloud or BEV features do not accurately align textual information with the underlying 3D scene. To address these limitations, we propose a novel unified DWM framework based on 3D Gaussian scene representation, which enables both 3D scene understanding and multi-modal scene generation, while also enabling contextual enrichment for understanding and generation tasks. Our approach directly aligns textual information with the 3D scene by embedding rich linguistic features into each Gaussian primitive, thereby achieving early modality alignment. In addition, we design a novel task-aware language-guided sampling strategy that removes redundant 3D Gaussians and injects accurate and compact 3D tokens into LLM. Furthermore, we design a dual-condition multi-modal generation model, where the information captured by our vision-language model is leveraged as a high-level language condition in combination with a low-level image condition, jointly guiding the multi-modal generation process. We conduct comprehensive studies on the nuScenes, and NuInteract datasets to validate the effectiveness of our framework. Our method achieves state-of-the-art performance. We will release the code publicly on GitHub https://github.com/dtc111111/GaussianDWM.
[112] Exploring Syn-to-Real Domain Adaptation for Military Target Detection cs.CV | cs.AIPDF
Jongoh Jeong, Youngjin Oh, Gyeongrae Nam, Jeongeun Lee, Kuk-Jin Yoon
TL;DR: 本文提出利用虚幻引擎生成逼真的合成RGB数据,以解决军事目标检测中真实数据稀缺和SAR数据成本高的问题,并在合成到真实域适应场景下评估了现有域适应方法。
Details
Motivation: 军事目标检测面临多变的真实环境挑战,且缺乏公开数据集;SAR数据成本高昂,而RGB相机成本低但数据不足,因此需要利用合成数据实现跨域适应。
Result: 在提出的合成-真实数据集对上,使用图像级最小监督(如物体类别)的域适应方法相比无监督或半监督方法取得了显著提升,但当前方法仍面临挑战。
Insight: 创新点在于利用虚幻引擎生成逼真合成RGB数据以弥补军事目标数据缺失,并系统评估了不同监督程度的域适应方法在合成到真实转换中的有效性,揭示了最小监督方法的潜力。
Abstract: Object detection is one of the key target tasks of interest in the context of civil and military applications. In particular, the real-world deployment of target detection methods is pivotal in the decision-making process during military command and reconnaissance. However, current domain adaptive object detection algorithms consider adapting one domain to another similar one only within the scope of natural or autonomous driving scenes. Since military domains often deal with a mixed variety of environments, detecting objects from multiple varying target domains poses a greater challenge. Several studies for armored military target detection have made use of synthetic aperture radar (SAR) data due to its robustness to all weather, long range, and high-resolution characteristics. Nevertheless, the costs of SAR data acquisition and processing are still much higher than those of the conventional RGB camera, which is a more affordable alternative with significantly lower data processing time. Furthermore, the lack of military target detection datasets limits the use of such a low-cost approach. To mitigate these issues, we propose to generate RGB-based synthetic data using a photorealistic visual tool, Unreal Engine, for military target detection in a cross-domain setting. To this end, we conducted synthetic-to-real transfer experiments by training our synthetic dataset and validating on our web-collected real military target datasets. We benchmark the state-of-the-art domain adaptation methods distinguished by the degree of supervision on our proposed train-val dataset pair, and find that current methods using minimal hints on the image (e.g., object class) achieve a substantial improvement over unsupervised or semi-supervised DA methods. From these observations, we recognize the current challenges that remain to be overcome.
[113] MM-UAVBench: How Well Do Multimodal Large Language Models See, Think, and Plan in Low-Altitude UAV Scenarios? cs.CVPDF
Shiqi Dai, Zizhi Ma, Zhicong Luo, Xuesong Yang, Yibin Huang
TL;DR: 本文提出了MM-UAVBench,一个专门用于评估多模态大语言模型在低空无人机场景下感知、认知和规划能力的综合性基准。该基准包含19个子任务和超过5.7K个人工标注的问题,数据源自真实无人机数据集。通过对16个开源和专有MLLM的广泛实验,发现现有模型难以适应低空场景的复杂视觉和认知需求,并揭示了空间偏差和多视角理解等关键瓶颈。
Details
Motivation: 现有MLLM基准很少涵盖低空场景的独特挑战,而无人机相关评估主要集中于定位或导航等特定任务,缺乏对MLLM通用智能的统一评估。本文旨在填补这一空白,系统评估MLLM在低空无人机应用中的潜力。
Result: 在MM-UAVBench基准上对16个MLLM进行了广泛实验,结果表明当前模型在适应低空场景的复杂需求方面存在困难,揭示了其性能瓶颈。
Insight: 创新点在于构建了首个系统评估MLLM在低空无人机场景下多维能力(感知、认知、规划)的综合性基准。从客观角度看,该研究通过引入真实场景数据和多维度任务设计,为评估和提升MLLM在垂直领域(如无人机)的鲁棒性和可靠性提供了重要的方法论和洞察,特别是识别出的空间偏差和多视角理解瓶颈对后续模型优化具有指导意义。
Abstract: While Multimodal Large Language Models (MLLMs) have exhibited remarkable general intelligence across diverse domains, their potential in low-altitude applications dominated by Unmanned Aerial Vehicles (UAVs) remains largely underexplored. Existing MLLM benchmarks rarely cover the unique challenges of low-altitude scenarios, while UAV-related evaluations mainly focus on specific tasks such as localization or navigation, without a unified evaluation of MLLMs’general intelligence. To bridge this gap, we present MM-UAVBench, a comprehensive benchmark that systematically evaluates MLLMs across three core capability dimensions-perception, cognition, and planning-in low-altitude UAV scenarios. MM-UAVBench comprises 19 sub-tasks with over 5.7K manually annotated questions, all derived from real-world UAV data collected from public datasets. Extensive experiments on 16 open-source and proprietary MLLMs reveal that current models struggle to adapt to the complex visual and cognitive demands of low-altitude scenarios. Our analyses further uncover critical bottlenecks such as spatial bias and multi-view understanding that hinder the effective deployment of MLLMs in UAV scenarios. We hope MM-UAVBench will foster future research on robust and reliable MLLMs for real-world UAV intelligence.
[114] Bridging Your Imagination with Audio-Video Generation via a Unified Director cs.CV | cs.MMPDF
Jiaxu Zhang, Tianshu Hu, Yuan Zhang, Zenan Li, Linjie Luo
TL;DR: 本文提出UniMAGE模型,一个统一的导演模型,旨在将脚本起草和关键镜头设计这两个通常分离的任务整合到单一框架中。该模型采用混合Transformer架构,通过“先交错后解耦”的训练范式,生成逻辑连贯的视频脚本和视觉一致的关键帧图像,从而赋能非专业人士利用现有音视频生成模型创作长上下文、多镜头的影片。
Details
Motivation: 现有AI视频创作系统通常将脚本起草(依赖大语言模型)和关键镜头设计(依赖图像生成模型)视为两个独立任务,而作者认为逻辑推理与想象力思维是电影导演的基本素质,因此应将两者统一在单一框架内。
Result: 大量实验表明,UniMAGE在开源模型中达到了最先进的性能,能够生成逻辑连贯的视频脚本和视觉一致的关键帧图像。
Insight: 创新点在于提出统一的导演模型UniMAGE,采用混合Transformer架构统一文本和图像生成,并引入“先交错概念学习(利用交错文本-图像数据深化对脚本的理解与想象),后解耦专家学习(将脚本写作与关键帧生成解耦以增强灵活性)”的训练范式,以增强叙事逻辑和关键帧一致性。
Abstract: Existing AI-driven video creation systems typically treat script drafting and key-shot design as two disjoint tasks: the former relies on large language models, while the latter depends on image generation models. We argue that these two tasks should be unified within a single framework, as logical reasoning and imaginative thinking are both fundamental qualities of a film director. In this work, we propose UniMAGE, a unified director model that bridges user prompts with well-structured scripts, thereby empowering non-experts to produce long-context, multi-shot films by leveraging existing audio-video generation models. To achieve this, we employ the Mixture-of-Transformers architecture that unifies text and image generation. To further enhance narrative logic and keyframe consistency, we introduce a ``first interleaving, then disentangling’’ training paradigm. Specifically, we first perform Interleaved Concept Learning, which utilizes interleaved text-image data to foster the model’s deeper understanding and imaginative interpretation of scripts. We then conduct Disentangled Expert Learning, which decouples script writing from keyframe generation, enabling greater flexibility and creativity in storytelling. Extensive experiments demonstrate that UniMAGE achieves state-of-the-art performance among open-source models, generating logically coherent video scripts and visually consistent keyframe images.
[115] Physics-Inspired Modeling and Content Adaptive Routing in an Infrared Gas Leak Detection Network cs.CV | cs.AIPDF
Dongsheng Li, Chaobo Chen, Siling Wang, Song Gao
TL;DR: 本文提出了一种名为PEG-DRNet的物理启发式混合气体动态路由网络,用于检测红外图像中微弱、半透明的气体泄漏羽流。该方法通过Gas Block模块建模气体扩散与对流,结合自适应梯度与相位边缘算子(AGPEO)提取可靠边缘先验,并利用内容自适应稀疏路由路径聚合网络(CASR-PAN)进行跨尺度特征选择性传播,在保持高精度检测的同时显著提升了计算效率。
Details
Motivation: 红外气体泄漏检测对于环境监测和工业安全至关重要,但由于羽流微弱、尺寸小、半透明且边界模糊,检测任务极具挑战性。现有方法难以有效捕捉此类弱对比度目标。
Result: 在IIG数据集上,PEG-DRNet取得了29.8%的总体AP、84.3%的AP50和25.3%的小目标AP,分别比RT-DETR-R18基线高出3.0%、6.5%和5.3%,同时仅需43.7 Gflops和14.9 M参数量。在IIG和LangGas数据集上,其AP和AP50均优于现有的CNN和Transformer检测器,实现了精度与计算效率的最佳平衡。
Insight: 主要创新点包括:1)引入物理启发的Gas Block单元,通过局部与大核分支分别建模气体短程变化与长程传播,并利用边缘门控融合模块平衡局部细节与全局上下文;2)提出AGPEO算子,从多方向梯度和相位一致性响应中计算可靠的边缘先验,并通过多尺度边缘感知模块(MSEPM)生成层次化边缘特征以强化边界;3)设计CASR-PAN网络,基于边缘和内容线索自适应地调制并选择性传播跨尺度信息特征,在提升跨尺度判别力的同时减少了冗余。这些方法为弱对比度、小目标的检测任务提供了可借鉴的物理建模与自适应特征路由思路。
Abstract: Detecting infrared gas leaks is critical for environmental monitoring and industrial safety, yet remains difficult because plumes are faint, small, semitransparent, and have weak, diffuse boundaries. We present physics-edge hybrid gas dynamic routing network (PEG-DRNet). First, we introduce the Gas Block, a diffusion-convection unit modeling gas transport: a local branch captures short-range variations, while a large-kernel branch captures long-range propagation. An edge-gated learnable fusion module balances local detail and global context, strengthening weak-contrast plume and contour cues. Second, we propose the adaptive gradient and phase edge operator (AGPEO), computing reliable edge priors from multi-directional gradients and phase-consistent responses. These are transformed by a multi-scale edge perception module (MSEPM) into hierarchical edge features that reinforce boundaries. Finally, the content-adaptive sparse routing path aggregation network (CASR-PAN), with adaptive information modulation modules for fusion and self, selectively propagates informative features across scales based on edge and content cues, improving cross-scale discriminability while reducing redundancy. Experiments on the IIG dataset show that PEG-DRNet achieves an overall AP of 29.8%, an AP${50}$ of 84.3%, and a small-object AP of 25.3%, surpassing the RT-DETR-R18 baseline by 3.0%, 6.5%, and 5.3%, respectively, while requiring only 43.7 Gflops and 14.9 M parameters. The proposed PEG-DRNet achieves superior overall performance with the best balance of accuracy and computational efficiency, outperforming existing CNN and Transformer detectors in AP and AP${50}$ on the IIG and LangGas dataset.
[116] Multimodal Interpretation of Remote Sensing Images: Dynamic Resolution Input Strategy and Multi-scale Vision-Language Alignment Mechanism cs.CVPDF
Siyu Zhang, Ying Chen, Lianlei Shan, Runhe Qiu
TL;DR: 该论文提出了一种结合动态分辨率输入策略(DRIS)和多尺度视觉-语言对齐机制(MS-VLAM)的视觉-语言模型(VLM)框架,用于提升遥感图像多模态融合的语义理解准确性和计算效率。
Details
Motivation: 针对现有方法中固定分辨率无法平衡效率与细节、以及单尺度对齐缺乏语义层次性的不足,旨在克服单源数据限制并提高地表信息提取精度。
Result: 在RS-GPT4V数据集上的实验表明,该框架在图像描述(如BLEU-4和CIDEr指标)和跨模态检索(如R@10指标)任务中均优于传统方法,实现了SOTA性能。
Insight: 创新点在于DRIS通过粗到细的自适应计算资源分配优化效率,而MS-VLAM通过对象、局部区域和全局三层次对齐机制增强跨模态语义一致性,为构建高效鲁棒的多模态遥感系统提供了新思路。
Abstract: Multimodal fusion of remote sensing images serves as a core technology for overcoming the limitations of single-source data and improving the accuracy of surface information extraction, which exhibits significant application value in fields such as environmental monitoring and urban planning. To address the deficiencies of existing methods, including the failure of fixed resolutions to balance efficiency and detail, as well as the lack of semantic hierarchy in single-scale alignment, this study proposes a Vision-language Model (VLM) framework integrated with two key innovations: the Dynamic Resolution Input Strategy (DRIS) and the Multi-scale Vision-language Alignment Mechanism (MS-VLAM).Specifically, the DRIS adopts a coarse-to-fine approach to adaptively allocate computational resources according to the complexity of image content, thereby preserving key fine-grained features while reducing redundant computational overhead. The MS-VLAM constructs a three-tier alignment mechanism covering object, local-region and global levels, which systematically captures cross-modal semantic consistency and alleviates issues of semantic misalignment and granularity imbalance.Experimental results on the RS-GPT4V dataset demonstrate that the proposed framework significantly improves the accuracy of semantic understanding and computational efficiency in tasks including image captioning and cross-modal retrieval. Compared with conventional methods, it achieves superior performance in evaluation metrics such as BLEU-4 and CIDEr for image captioning, as well as R@10 for cross-modal retrieval. This technical framework provides a novel approach for constructing efficient and robust multimodal remote sensing systems, laying a theoretical foundation and offering technical guidance for the engineering application of intelligent remote sensing interpretation.
[117] ViLaCD-R1: A Vision-Language Framework for Semantic Change Detection in Remote Sensing cs.CV | cs.AIPDF
Xingwei Ma, Shiyang Feng, Bo Zhang, Bin Wang
TL;DR: 本文提出ViLaCD-R1,一个用于遥感语义变化检测的两阶段视觉-语言框架,通过多图像推理器和掩码引导解码器,结合监督微调和强化学习,提升了语义变化区域的识别与定位精度,并有效抑制了非语义干扰。
Details
Motivation: 传统遥感变化检测方法基于像素或编码器-解码器网络,难以捕捉高层语义且易受非语义扰动影响;现有基于多模态或视觉语言模型的方法虽增强了语义理解,但仍存在空间定位不准、像素边界不精确和可解释性有限等问题。
Result: 在多个遥感变化检测基准测试上进行综合评估,ViLaCD-R1显著提升了真实语义变化的识别与定位能力,鲁棒地抑制了非语义变化,并在复杂真实场景中达到了最先进的准确率。
Insight: 创新点包括:1)采用两阶段框架,结合块级双时相推理任务训练VLM,生成粗粒度变化掩码;2)通过掩码引导解码器整合双时相图像特征与粗掩码,实现精确的二进制变化图预测;3)利用监督微调和强化学习优化模型,增强了语义理解与定位精度。
Abstract: Remote sensing change detection (RSCD), a complex multi-image inference task, traditionally uses pixel-based operators or encoder-decoder networks that inadequately capture high-level semantics and are vulnerable to non-semantic perturbations. Although recent multimodal and vision-language model (VLM)-based approaches enhance semantic understanding of change regions by incorporating textual descriptions, they still suffer from challenges such as inaccurate spatial localization, imprecise pixel-level boundary delineation, and limited interpretability. To address these issues, we propose ViLaCD-R1, a two-stage framework comprising a Multi-Image Reasoner (MIR) and a Mask-Guided Decoder (MGD). Specifically, the VLM is trained through supervised fine-tuning (SFT) and reinforcement learning (RL) on block-level dual-temporal inference tasks, taking dual-temporal image patches as input and outputting a coarse change mask. Then, the decoder integrates dual-temporal image features with this coarse mask to predict a precise binary change map. Comprehensive evaluations on multiple RSCD benchmarks demonstrate that ViLaCD-R1 substantially improves true semantic change recognition and localization, robustly suppresses non-semantic variations, and achieves state-of-the-art accuracy in complex real-world scenarios.
[118] Contour Information Aware 2D Gaussian Splatting for Image Representation cs.CVPDF
Masaya Takabe, Hiroshi Watanabe, Sujun Hong, Tomohiro Ikai, Zheming Fan
TL;DR: 本文提出了一种轮廓信息感知的2D高斯泼溅框架,通过将对象分割先验知识融入基于高斯的图像表示中,解决了现有2DGS方法在少量高斯分布下边界模糊的问题。该方法在光栅化过程中约束每个高斯分布到特定分割区域,防止跨边界混合,从而在高压缩下保持边缘结构。
Details
Motivation: 现有2D高斯泼溅方法在少量高斯分布时由于缺乏轮廓感知,常产生模糊或不清晰的边界,本文旨在通过引入对象分割先验来改善边缘重建质量。
Result: 在合成色卡和DAVIS数据集上的实验表明,与现有2DGS方法相比,该方法在对象边缘区域实现了更高的重建质量,尤其在极少数高斯分布场景下改进显著,同时保持了快速渲染和低内存使用。
Insight: 创新点在于将分割约束集成到高斯泼溅框架中以增强轮廓感知,并引入预热方案稳定训练;客观分析认为,该方法通过区域约束有效提升了压缩场景下的边缘保真度,为轻量级图像表示提供了新思路。
Abstract: Image representation is a fundamental task in computer vision. Recently, Gaussian Splatting has emerged as an efficient representation framework, and its extension to 2D image representation enables lightweight, yet expressive modeling of visual content. While recent 2D Gaussian Splatting (2DGS) approaches provide compact storage and real-time decoding, they often produce blurry or indistinct boundaries when the number of Gaussians is small due to the lack of contour awareness. In this work, we propose a Contour Information-Aware 2D Gaussian Splatting framework that incorporates object segmentation priors into Gaussian-based image representation. By constraining each Gaussian to a specific segmentation region during rasterization, our method prevents cross-boundary blending and preserves edge structures under high compression. We also introduce a warm-up scheme to stabilize training and improve convergence. Experiments on synthetic color charts and the DAVIS dataset demonstrate that our approach achieves higher reconstruction quality around object edges compared to existing 2DGS methods. The improvement is particularly evident in scenarios with very few Gaussians, while our method still maintains fast rendering and low memory usage.
[119] Plug-and-Play Fidelity Optimization for Diffusion Transformer Acceleration via Cumulative Error Minimization cs.CVPDF
Tong Shao, Yusen Fu, Guoying Sun, Jingde Kong, Zhuotao Tian
TL;DR: 本文提出了一种名为CEM的即插即用保真度优化插件,用于通过累积误差最小化来加速扩散Transformer(DiT)的推理过程。CEM通过预定义误差来表征模型对加速的敏感性,并利用动态规划算法优化缓存策略,从而在无需训练的情况下显著提升生成保真度。该方法与模型无关,可适配任意加速预算,并能无缝集成到现有误差校正框架和量化模型中。
Details
Motivation: 扩散Transformer(DiT)在图像和视频生成中占主导地位,但其迭代去噪过程导致推理速度慢,限制了广泛应用。基于缓存的方法可实现免训练加速,但存在显著计算误差,现有误差校正策略因固定缓存策略无法适应去噪过程中复杂的误差变化,限制了校正潜力。
Result: 在九个生成模型和量化方法上跨三个任务的广泛实验表明,CEM显著提升了现有加速模型的生成保真度,并在FLUX.1-dev、PixArt-α、StableDiffusion1.5和Hunyuan等模型上超越了原始生成性能。
Insight: 创新点在于通过预定义误差联合表征时间步和缓存间隔对模型加速敏感性的影响,并利用基于累积误差近似的动态规划算法进行策略优化,实现了缓存误差最小化。该方法具有模型无关性和强泛化能力,可无缝集成且不引入额外计算开销。
Abstract: Although Diffusion Transformer (DiT) has emerged as a predominant architecture for image and video generation, its iterative denoising process results in slow inference, which hinders broader applicability and development. Caching-based methods achieve training-free acceleration, while suffering from considerable computational error. Existing methods typically incorporate error correction strategies such as pruning or prediction to mitigate it. However, their fixed caching strategy fails to adapt to the complex error variations during denoising, which limits the full potential of error correction. To tackle this challenge, we propose a novel fidelity-optimization plugin for existing error correction methods via cumulative error minimization, named CEM. CEM predefines the error to characterize the sensitivity of model to acceleration jointly influenced by timesteps and cache intervals. Guided by this prior, we formulate a dynamic programming algorithm with cumulative error approximation for strategy optimization, which achieves the caching error minimization, resulting in a substantial improvement in generation fidelity. CEM is model-agnostic and exhibits strong generalization, which is adaptable to arbitrary acceleration budgets. It can be seamlessly integrated into existing error correction frameworks and quantized models without introducing any additional computational overhead. Extensive experiments conducted on nine generation models and quantized methods across three tasks demonstrate that CEM significantly improves generation fidelity of existing acceleration models, and outperforms the original generation performance on FLUX.1-dev, PixArt-$α$, StableDiffusion1.5 and Hunyuan. The code will be made publicly available.
[120] Multi-Track Multimodal Learning on iMiGUE: Micro-Gesture and Emotion Recognition cs.CVPDF
Arman Martirosyan, Shahane Tigranyan, Maria Razzhivina, Artak Aslanyan, Nazgul Salikhova
TL;DR: 本文提出了两种多模态框架,分别用于iMiGUE数据集上的微手势识别和行为情感预测任务。对于微手势分类,通过融合RGB视频和3D姿态表示来捕捉细微的时空模式;对于情感识别,则融合面部和上下文嵌入。在MiGA 2025挑战赛中,该方法在行为情感预测任务中取得了第二名。
Details
Motivation: 解决微手势识别和行为情感预测这两个高度挑战性的任务,它们都需要对视频和骨骼姿态数据中细微、细粒度的人类行为进行建模。
Result: 在iMiGUE数据集上的实验表明,该方法在行为情感预测任务中表现稳健且准确,在MiGA 2025挑战赛中获得了第二名。
Insight: 创新点在于为两个任务分别设计了多模态融合模块(Cross-Modal Token Fusion和InterFusion),有效整合了互补的视觉模态(如RGB、姿态、面部和上下文信息)以提升对细微行为的表征能力。
Abstract: Micro-gesture recognition and behavior-based emotion prediction are both highly challenging tasks that require modeling subtle, fine-grained human behaviors, primarily leveraging video and skeletal pose data. In this work, we present two multimodal frameworks designed to tackle both problems on the iMiGUE dataset. For micro-gesture classification, we explore the complementary strengths of RGB and 3D pose-based representations to capture nuanced spatio-temporal patterns. To comprehensively represent gestures, video, and skeletal embeddings are extracted using MViTv2-S and 2s-AGCN, respectively. Then, they are integrated through a Cross-Modal Token Fusion module to combine spatial and pose information. For emotion recognition, our framework extends to behavior-based emotion prediction, a binary classification task identifying emotional states based on visual cues. We leverage facial and contextual embeddings extracted using SwinFace and MViTv2-S models and fuse them through an InterFusion module designed to capture emotional expressions and body gestures. Experiments conducted on the iMiGUE dataset, within the scope of the MiGA 2025 Challenge, demonstrate the robust performance and accuracy of our method in the behavior-based emotion prediction task, where our approach secured 2nd place.
[121] MedGemma vs GPT-4: Open-Source and Proprietary Zero-shot Medical Disease Classification from Images cs.CV | cs.AIPDF
Md. Sazzadul Islam Prottasha, Nabil Walid Rafi
TL;DR: 本研究对比了开源模型MedGemma与闭源模型GPT-4在零样本医学图像疾病分类任务中的表现,发现经过LoRA微调的MedGemma-4b-it在平均测试准确率上显著优于GPT-4,尤其在癌症和肺炎等高风险临床任务中展现出更高的敏感性。
Details
Motivation: 比较开源与闭源多模态大语言模型在医学图像疾病分类中的性能,探索领域特定微调对提升诊断准确性和减少幻觉的重要性。
Result: MedGemma-4b-it在六种疾病分类任务中平均测试准确率达到80.37%,而未经微调的GPT-4为69.58%,MedGemma在癌症和肺炎检测等高风险任务中敏感性更高。
Insight: 领域特定的微调(如使用LoRA)对于提升多模态大语言模型在临床任务中的准确性和可靠性至关重要,开源模型通过针对性优化可以超越通用闭源模型在专业医疗场景的表现。
Abstract: Multimodal Large Language Models (LLMs) introduce an emerging paradigm for medical imaging by interpreting scans through the lens of extensive clinical knowledge, offering a transformative approach to disease classification. This study presents a critical comparison between two fundamentally different AI architectures: the specialized open-source agent MedGemma and the proprietary large multimodal model GPT-4 for diagnosing six different diseases. The MedGemma-4b-it model, fine-tuned using Low-Rank Adaptation (LoRA), demonstrated superior diagnostic capability by achieving a mean test accuracy of 80.37% compared to 69.58% for the untuned GPT-4. Furthermore, MedGemma exhibited notably higher sensitivity in high-stakes clinical tasks, such as cancer and pneumonia detection. Quantitative analysis via confusion matrices and classification reports provides comprehensive insights into model performance across all categories. These results emphasize that domain-specific fine-tuning is essential for minimizing hallucinations in clinical implementation, positioning MedGemma as a sophisticated tool for complex, evidence-based medical reasoning.
[122] CME-CAD: Heterogeneous Collaborative Multi-Expert Reinforcement Learning for CAD Code Generation cs.CVPDF
Ke Niu, Haiyang Yu, Zhuofan Chen, Zhengtao Yao, Weitao Jia
TL;DR: 本文提出了一种异构协同多专家强化学习(CME-CAD)新范式,用于从草图自动生成高精度、可编辑的CAD代码。该方法通过多专家微调和多专家强化学习两阶段训练,整合不同模型的互补优势,以生成精确、约束兼容且完全可编辑的CAD模型。同时,作者构建了一个包含17,299个实例的开源基准CADExpert,提供正投影图、精确尺寸标注、专家生成的思维链过程、可执行CADQuery代码和渲染3D模型。
Details
Motivation: 解决传统CAD建模自动化中,现有方法从草图重建3D模型时产生的模型不可编辑、精度不足,以及依赖大量手动标注文本或图像输入导致的扩展性和工业适用性有限的问题。
Result: 论文在提出的CADExpert基准上进行了实验,但摘要中未明确提及具体的定量结果(如准确率)或与现有SOTA模型的比较。
Insight: 创新点在于提出了异构协同多专家强化学习训练范式,通过两阶段训练促进模型协作学习,以生成满足工业设计严格要求的CAD代码;同时构建了包含丰富标注的开源基准CADExpert,有助于推动该领域研究。从客观角度看,其将多专家协同与强化学习结合用于CAD代码生成,并提供了高质量数据集,是具有潜力的研究方向。
Abstract: Computer-Aided Design (CAD) is essential in industrial design, but the complexity of traditional CAD modeling and workflows presents significant challenges for automating the generation of high-precision, editable CAD models. Existing methods that reconstruct 3D models from sketches often produce non-editable and approximate models that fall short of meeting the stringent requirements for precision and editability in industrial design. Moreover, the reliance on text or image-based inputs often requires significant manual annotation, limiting their scalability and applicability in industrial settings. To overcome these challenges, we propose the Heterogeneous Collaborative Multi-Expert Reinforcement Learning (CME-CAD) paradigm, a novel training paradigm for CAD code generation. Our approach integrates the complementary strengths of these models, facilitating collaborative learning and improving the model’s ability to generate accurate, constraint-compatible, and fully editable CAD models. We introduce a two-stage training process: Multi-Expert Fine-Tuning (MEFT), and Multi-Expert Reinforcement Learning (MERL). Additionally, we present CADExpert, an open-source benchmark consisting of 17,299 instances, including orthographic projections with precise dimension annotations, expert-generated Chain-of-Thought (CoT) processes, executable CADQuery code, and rendered 3D models.
[123] Visual Language Hypothesis cs.CV | cs.LGPDF
Xiu Li
TL;DR: 本文从结构和拓扑视角研究视觉表征学习,提出视觉语言假说:视觉理解需要一种语义语言,其中大量感知观测对应少量离散语义状态。基于此假说,作者推导出视觉观测空间具有纤维丛结构,并得出两个理论推论:语义商空间无法通过平滑变形获得,需要非同胚的判别性目标;语义抽象要求模型架构支持拓扑变化,即先几何扩展再坍缩形成离散语义区域的‘扩展-捕捉’过程。
Details
Motivation: 从结构和拓扑角度理解视觉表征学习,探索视觉理解是否预设了一种语义语言,并基于此假说推导视觉观测空间的组织结构及其对学习机制的影响。
Result: 论文未提及具体实验或基准测试结果,而是提出理论框架,该框架与大规模判别性和多模态模型中观察到的经验规律以及统计学习理论的经典原理相一致。
Insight: 创新点在于提出视觉语言假说,并由此推导出视觉观测空间的纤维丛结构,强调语义抽象需要非平滑的拓扑变化和外部语义目标,为理解现有模型(如判别式模型和多模态对齐模型)提供了拓扑视角的理论解释。
Abstract: We study visual representation learning from a structural and topological perspective. We begin from a single hypothesis: that visual understanding presupposes a semantic language for vision, in which many perceptual observations correspond to a small number of discrete semantic states. Together with widely assumed premises on transferability and abstraction in representation learning, this hypothesis implies that the visual observation space must be organized in a fiber bundle like structure, where nuisance variation populates fibers and semantics correspond to a quotient base space. From this structure we derive two theoretical consequences. First, the semantic quotient $X/G$ is not a submanifold of $X$ and cannot be obtained through smooth deformation alone, semantic invariance requires a non-homeomorphic, discriminative target, for example, supervision via labels, cross instance identification, or multimodal alignment that supplies explicit semantic equivalence. Second, we show that approximating the quotient also places structural demands on the model architecture. Semantic abstraction requires not only an external semantic target, but a representation mechanism capable of supporting topology change: an expand-and-snap process in which the manifold is first geometrically expanded to separate structure and then collapsed to form discrete semantic regions. We emphasize that these results are interpretive rather than prescriptive: the framework provides a topological lens that aligns with empirical regularities observed in large-scale discriminative and multimodal models, and with classical principles in statistical learning theory.
[124] CountGD++: Generalized Prompting for Open-World Counting cs.CVPDF
Niki Amini-Naieni, Andrew Zisserman
TL;DR: CountGD++是一种用于开放世界计数的广义提示方法,通过扩展提示方式(包括指定不计数对象、引入伪示例自动标注、支持自然和合成外部图像示例)来提升对象计数的灵活性和准确性,并可作为视觉专家代理与LLM集成。
Details
Motivation: 现有计数方法在指定目标对象时存在局限:视觉示例需手动标注、无法指定不计数对象,限制了灵活性和准确性。
Result: 在多个数据集上,CountGD++显著提升了准确性、效率和泛化能力,实现了开放世界计数的SOTA性能。
Insight: 创新点包括:扩展提示以文本/视觉示例描述不计数对象、引入伪示例自动标注、支持外部图像示例,以及将计数模型作为LLM的视觉代理,增强了多模态交互能力。
Abstract: The flexibility and accuracy of methods for automatically counting objects in images and videos are limited by the way the object can be specified. While existing methods allow users to describe the target object with text and visual examples, the visual examples must be manually annotated inside the image, and there is no way to specify what not to count. To address these gaps, we introduce novel capabilities that expand how the target object can be specified. Specifically, we extend the prompt to enable what not to count to be described with text and/or visual examples, introduce the concept of `pseudo-exemplars’ that automate the annotation of visual examples at inference, and extend counting models to accept visual examples from both natural and synthetic external images. We also use our new counting model, CountGD++, as a vision expert agent for an LLM. Together, these contributions expand the prompt flexibility of multi-modal open-world counting and lead to significant improvements in accuracy, efficiency, and generalization across multiple datasets. Code is available at https://github.com/niki-amini-naieni/CountGDPlusPlus.
[125] SpatialMosaic: A Multiview VLM Dataset for Partial Visibility cs.CVPDF
Kanghee Lee, Injae Lee, Minseok Kwak, Kwonyoung Ryu, Jungi Hong
TL;DR: 该论文提出了SpatialMosaic,一个用于部分可见性场景的多视角视觉语言模型(VLM)数据集和基准测试,旨在通过大规模指令调优数据增强模型在真实复杂环境(如遮挡、低重叠度)下的空间推理能力。
Details
Motivation: 现有MLLMs多依赖预构建的3D表示或重建流程,限制了可扩展性和真实世界适用性;且缺乏针对部分可见性、遮挡等真实挑战性场景的空间推理探索。
Result: 构建了包含200万QA对的SpatialMosaic数据集和100万QA对的SpatialMosaic-Bench基准(涵盖6个任务),实验表明其能有效提升多视角挑战条件下的空间推理性能。
Insight: 创新点包括可扩展的多视角数据生成与标注流程、专注于真实挑战场景的基准测试,以及将3D重建模型作为几何编码器集成到VLM中的混合框架SpatialMosaicVLM,以增强鲁棒空间推理。
Abstract: The rapid progress of Multimodal Large Language Models (MLLMs) has unlocked the potential for enhanced 3D scene understanding and spatial reasoning. However, existing approaches often rely on pre-constructed 3D representations or off-the-shelf reconstruction pipelines, which constrain scalability and real-world applicability. A recent line of work explores learning spatial reasoning directly from multi-view images, enabling Vision-Language Models (VLMs) to understand 3D scenes without explicit 3D reconstructions. Nevertheless, key challenges that frequently arise in real-world environments, such as partial visibility, occlusion, and low-overlap conditions that require spatial reasoning from fragmented visual cues, remain under-explored. To address these limitations, we propose a scalable multi-view data generation and annotation pipeline that constructs realistic spatial reasoning QAs, resulting in SpatialMosaic, a comprehensive instruction-tuning dataset featuring 2M QA pairs. We further introduce SpatialMosaic-Bench, a challenging benchmark for evaluating multi-view spatial reasoning under realistic and challenging scenarios, consisting of 1M QA pairs across 6 tasks. In addition, we present SpatialMosaicVLM, a hybrid framework that integrates 3D reconstruction models as geometry encoders within VLMs for robust spatial reasoning. Extensive experiments demonstrate that our proposed dataset and VQA tasks effectively enhance spatial reasoning under challenging multi-view conditions, validating the effectiveness of our data generation pipeline in constructing realistic and diverse QA pairs. Code and dataset will be available soon.
[126] MGCA-Net: Multi-Graph Contextual Attention Network for Two-View Correspondence Learning cs.CVPDF
Shuyuan Lin, Mengtin Lo, Haosheng Chen, Yanjie Liang, Qiangqiang Wu
TL;DR: 本文提出了一种名为MGCA-Net的多图上下文注意力网络,用于解决两视图对应关系学习中的局部几何建模和跨阶段信息优化不足的问题。该网络包含上下文几何注意力模块和跨阶段多图共识模块,旨在更准确地捕获匹配对的几何约束并提升模型鲁棒性。
Details
Motivation: 现有方法在局部几何建模和跨阶段信息优化方面存在局限,难以准确捕获匹配对的几何约束,从而降低了模型在相机姿态估计等任务中的鲁棒性。
Result: 在YFCC100M和SUN3D两个代表性数据集上的实验表明,MGCA-Net在外点剔除和相机姿态估计任务上显著优于现有的SOTA方法。
Insight: 创新点在于通过上下文几何注意力模块自适应地融合空间位置与特征信息以捕获局部与全局几何关系,以及通过跨阶段多图共识模块利用稀疏图网络确保不同阶段间几何信息的一致性,从而提升了对应关系学习的几何建模能力。
Abstract: Two-view correspondence learning is a key task in computer vision, which aims to establish reliable matching relationships for applications such as camera pose estimation and 3D reconstruction. However, existing methods have limitations in local geometric modeling and cross-stage information optimization, which make it difficult to accurately capture the geometric constraints of matched pairs and thus reduce the robustness of the model. To address these challenges, we propose a Multi-Graph Contextual Attention Network (MGCA-Net), which consists of a Contextual Geometric Attention (CGA) module and a Cross-Stage Multi-Graph Consensus (CSMGC) module. Specifically, CGA dynamically integrates spatial position and feature information via an adaptive attention mechanism and enhances the capability to capture both local and global geometric relationships. Meanwhile, CSMGC establishes geometric consensus via a cross-stage sparse graph network, ensuring the consistency of geometric information across different stages. Experimental results on two representative YFCC100M and SUN3D datasets show that MGCA-Net significantly outperforms existing SOTA methods in the outlier rejection and camera pose estimation tasks. Source code is available at http://www.linshuyuan.com.
[127] SoulX-LiveTalk Technical Report cs.CV | cs.AIPDF
Le Shen, Qiao Qian, Tan Yu, Ke Zhou, Tianhang Yu
TL;DR: SoulX-LiveTalk是一个14B参数的大规模扩散模型框架,旨在解决实时、无限时长、音频驱动虚拟人生成中的计算负载与延迟约束冲突问题。它通过自校正双向蒸馏策略保留视频块内的双向注意力以提升运动连贯性和视觉细节,并采用多步回顾自校正机制确保无限生成的稳定性,配合全栈推理加速套件,实现了亚秒级启动延迟和32 FPS的实时吞吐量。
Details
Motivation: 现有方法为满足实时性,通常采用严格单向注意力或降低模型容量,导致视觉保真度下降。本文旨在解决大规模扩散模型在实时流式生成中计算负载与严格延迟要求之间的根本矛盾。
Result: 在广泛评估中,SoulX-LiveTalk成为首个14B规模系统,实现了亚秒级启动延迟(0.87秒)和32 FPS的实时吞吐量,为高保真交互式数字人合成设立了新标准。
Insight: 创新点在于自校正双向蒸馏策略(在视频块内保留双向注意力以保持时空相关性)和多步回顾自校正机制(用于错误恢复和防止崩溃)。从客观角度看,其全栈推理加速套件(混合序列并行、并行VAE和内核级优化)对实现高性能至关重要,为大规模模型的实时部署提供了工程范例。
Abstract: Deploying massive diffusion models for real-time, infinite-duration, audio-driven avatar generation presents a significant engineering challenge, primarily due to the conflict between computational load and strict latency constraints. Existing approaches often compromise visual fidelity by enforcing strictly unidirectional attention mechanisms or reducing model capacity. To address this problem, we introduce \textbf{SoulX-LiveTalk}, a 14B-parameter framework optimized for high-fidelity real-time streaming. Diverging from conventional unidirectional paradigms, we use a \textbf{Self-correcting Bidirectional Distillation} strategy that retains bidirectional attention within video chunks. This design preserves critical spatiotemporal correlations, significantly enhancing motion coherence and visual detail. To ensure stability during infinite generation, we incorporate a \textbf{Multi-step Retrospective Self-Correction Mechanism}, enabling the model to autonomously recover from accumulated errors and preventing collapse. Furthermore, we engineered a full-stack inference acceleration suite incorporating hybrid sequence parallelism, Parallel VAE, and kernel-level optimizations. Extensive evaluations confirm that SoulX-LiveTalk is the first 14B-scale system to achieve a \textbf{sub-second start-up latency (0.87s)} while reaching a real-time throughput of \textbf{32 FPS}, setting a new standard for high-fidelity interactive digital human synthesis.
[128] SOFTooth: Semantics-Enhanced Order-Aware Fusion for Tooth Instance Segmentation cs.CVPDF
Xiaolan Li, Wanquan Liu, Pengcheng Li, Pengyu Jie, Chenqiang Gao
TL;DR: SOFTooth是一个用于3D牙齿实例分割的语义增强、顺序感知的2D-3D融合框架。它利用冻结的2D SAM模型提供的语义信息,通过残差门控模块注入咬合视图的SAM嵌入来细化边界,通过中心引导的掩码细化减少中心漂移,并通过顺序感知的匈牙利匹配策略整合解剖学牙齿顺序,从而在牙齿拥挤、边界模糊、缺牙等挑战下实现鲁棒分割。
Details
Motivation: 解决3D牙齿实例分割中因牙弓拥挤、牙齿-牙龈边界模糊、缺牙以及第三磨牙等罕见但重要的类别导致的挑战。现有纯3D方法易出现边界泄漏、中心漂移和身份不一致问题,而2D基础模型(如SAM)虽提供强语义但难以直接用于3D临床工作流。
Result: 在3DTeethSeg’22基准测试上取得了最先进的(SOTA)整体准确率和平均IoU,在涉及第三磨牙的病例上提升明显。
Insight: 创新点在于提出了一种无需2D掩码监督、利用冻结2D语义(SAM)增强3D分割的融合框架。具体包括:点级残差门控模块注入2D语义以细化边界;中心引导掩码细化确保掩码与几何中心的一致性;顺序感知匈牙利匹配整合解剖顺序和中心距离,确保标签连贯性。这为将2D基础模型的丰富语义有效迁移到3D密集预测任务提供了一种高效途径。
Abstract: Three-dimensional (3D) tooth instance segmentation remains challenging due to crowded arches, ambiguous tooth-gingiva boundaries, missing teeth, and rare yet clinically important third molars. Native 3D methods relying on geometric cues often suffer from boundary leakage, center drift, and inconsistent tooth identities, especially for minority classes and complex anatomies. Meanwhile, 2D foundation models such as the Segment Anything Model (SAM) provide strong boundary-aware semantics, but directly applying them in 3D is impractical in clinical workflows. To address these issues, we propose SOFTooth, a semantics-enhanced, order-aware 2D-3D fusion framework that leverages frozen 2D semantics without explicit 2D mask supervision. First, a point-wise residual gating module injects occlusal-view SAM embeddings into 3D point features to refine tooth-gingiva and inter-tooth boundaries. Second, a center-guided mask refinement regularizes consistency between instance masks and geometric centroids, reducing center drift. Furthermore, an order-aware Hungarian matching strategy integrates anatomical tooth order and center distance into similarity-based assignment, ensuring coherent labeling even under missing or crowded dentitions. On 3DTeethSeg’22, SOFTooth achieves state-of-the-art overall accuracy and mean IoU, with clear gains on cases involving third molars, demonstrating that rich 2D semantics can be effectively transferred to 3D tooth instance segmentation without 2D fine-tuning.
[129] Bridging Cognitive Gap: Hierarchical Description Learning for Artistic Image Aesthetics Assessment cs.CVPDF
Henglin Liu, Nisha Huang, Chang Liu, Jiangpeng Yan, Huijuan Huang
TL;DR: 本文提出了一种用于艺术图像美学质量评估的新框架ArtQuant,通过构建大规模多维度结构化数据集RAD解决数据稀缺与不平衡问题,并利用LLM解码器联合生成美学描述来耦合孤立的美学维度,从而缩小艺术图像与美学判断之间的认知差距。
Details
Motivation: 解决美学质量评估任务中因视觉感知、认知和情感复杂性导致的数据稀缺、不平衡以及现有模型(如多分支编码器或对比学习方法)在处理长文本描述时效率低下的问题。
Result: 在多个数据集上取得了最先进的性能,且仅需传统训练轮数的33%,实现了高效训练与优越评估效果。
Insight: 创新点包括通过迭代流程低成本构建大规模多维度美学描述数据集RAD,以及设计ArtQuant框架利用LLM解码器联合生成描述来有效建模长文本语义,理论分析表明数据与模型的协同最小化预测熵,为框架提供了数学基础。
Abstract: The aesthetic quality assessment task is crucial for developing a human-aligned quantitative evaluation system for AIGC. However, its inherently complex nature, spanning visual perception, cognition, and emotion, poses fundamental challenges. Although aesthetic descriptions offer a viable representation of this complexity, two critical challenges persist: (1) data scarcity and imbalance: existing dataset overly focuses on visual perception and neglects deeper dimensions due to the expensive manual annotation; and (2) model fragmentation: current visual networks isolate aesthetic attributes with multi-branch encoder, while multimodal methods represented by contrastive learning struggle to effectively process long-form textual descriptions. To resolve challenge (1), we first present the Refined Aesthetic Description (RAD) dataset, a large-scale (70k), multi-dimensional structured dataset, generated via an iterative pipeline without heavy annotation costs and easy to scale. To address challenge (2), we propose ArtQuant, an aesthetics assessment framework for artistic images which not only couples isolated aesthetic dimensions through joint description generation, but also better models long-text semantics with the help of LLM decoders. Besides, theoretical analysis confirms this symbiosis: RAD’s semantic adequacy (data) and generation paradigm (model) collectively minimize prediction entropy, providing mathematical grounding for the framework. Our approach achieves state-of-the-art performance on several datasets while requiring only 33% of conventional training epochs, narrowing the cognitive gap between artistic images and aesthetic judgment. We will release both code and dataset to support future research.
[130] DriveLaW:Unifying Planning and Video Generation in a Latent Driving World cs.CVPDF
Tianze Xia, Yongkang Li, Lijun Zhou, Jingfeng Yao, Kaixin Xiong
TL;DR: DriveLaW提出了一种统一视频生成与运动规划的新范式,通过将视频生成器的潜在表示直接注入规划器,确保了高保真未来场景生成与可靠轨迹规划之间的内在一致性。
Details
Motivation: 当前自动驾驶世界模型通常将场景预测与运动规划作为解耦过程处理,存在局限性;本文旨在弥合这一差距,实现两者的统一。
Result: 在视频预测任务上,DriveLaW显著超越最佳现有工作,FID提升33.3%,FVD提升1.8%;在NAVSIM规划基准测试中创造了新纪录,达到了新的最先进水平。
Insight: 核心创新在于提出了一个将视频生成(DriveLaW-Video)与扩散规划器(DriveLaW-Act)通过潜在表示紧密耦合的统一架构,并采用三阶段渐进式训练策略进行优化,实现了预测与规划的一致性。
Abstract: World models have become crucial for autonomous driving, as they learn how scenarios evolve over time to address the long-tail challenges of the real world. However, current approaches relegate world models to limited roles: they operate within ostensibly unified architectures that still keep world prediction and motion planning as decoupled processes. To bridge this gap, we propose DriveLaW, a novel paradigm that unifies video generation and motion planning. By directly injecting the latent representation from its video generator into the planner, DriveLaW ensures inherent consistency between high-fidelity future generation and reliable trajectory planning. Specifically, DriveLaW consists of two core components: DriveLaW-Video, our powerful world model that generates high-fidelity forecasting with expressive latent representations, and DriveLaW-Act, a diffusion planner that generates consistent and reliable trajectories from the latent of DriveLaW-Video, with both components optimized by a three-stage progressive training strategy. The power of our unified paradigm is demonstrated by new state-of-the-art results across both tasks. DriveLaW not only advances video prediction significantly, surpassing best-performing work by 33.3% in FID and 1.8% in FVD, but also achieves a new record on the NAVSIM planning benchmark.
[131] Direct Diffusion Score Preference Optimization via Stepwise Contrastive Policy-Pair Supervision cs.CVPDF
Dohyun Kim, Seungwoo Lyu, Seung Wook Kim, Paul Hongsuck Seo
TL;DR: 本文提出了直接扩散分数偏好优化(DDSPO)方法,通过对比获胜和失败策略在去噪轨迹的每个时间步上提供密集监督,以提升扩散模型在文本到图像生成任务中对用户意图的细粒度对齐和视觉质量。该方法无需依赖昂贵的人工标注数据,而是利用预训练参考模型自动生成偏好信号。
Details
Motivation: 现有基于偏好的训练方法(如Diffusion Direct Preference Optimization)虽然能改善扩散模型输出与用户意图的对齐及美学一致性,但依赖于成本高且可能带噪声的人工标注数据集。本文旨在开发一种无需显式奖励建模或人工标注的有效监督方法。
Result: 实验结果表明,DDSPO在文本-图像对齐和视觉质量方面均有提升,优于或匹配现有基于偏好的方法,同时所需监督显著减少。具体基准未在摘要中明确提及,但暗示了与现有方法的对比。
Insight: 创新点在于直接从可用策略中推导出每个时间步的监督信号,提供跨去噪轨迹的密集、过渡级信号,而非仅基于最终样本。客观来看,该方法通过预训练参考模型对比原始提示与语义降级变体的输出,实现了无标注的自动偏好信号生成,降低了数据依赖和成本。
Abstract: Diffusion models have achieved impressive results in generative tasks such as text-to-image synthesis, yet they often struggle to fully align outputs with nuanced user intent and maintain consistent aesthetic quality. Existing preference-based training methods like Diffusion Direct Preference Optimization help address these issues but rely on costly and potentially noisy human-labeled datasets. In this work, we introduce Direct Diffusion Score Preference Optimization (DDSPO), which directly derives per-timestep supervision from winning and losing policies when such policies are available. Unlike prior methods that operate solely on final samples, DDSPO provides dense, transition-level signals across the denoising trajectory. In practice, we avoid reliance on labeled data by automatically generating preference signals using a pretrained reference model: we contrast its outputs when conditioned on original prompts versus semantically degraded variants. This practical strategy enables effective score-space preference supervision without explicit reward modeling or manual annotations. Empirical results demonstrate that DDSPO improves text-image alignment and visual quality, outperforming or matching existing preference-based methods while requiring significantly less supervision. Our implementation is available at: https://dohyun-as.github.io/DDSPO
[132] CoFi-Dec: Hallucination-Resistant Decoding via Coarse-to-Fine Generative Feedback in Large Vision-Language Models cs.CV | cs.AIPDF
Zongsheng Cao, Yangfan He, Anran Liu, Jun Xie, Feng Chen
TL;DR: 本文提出了一种名为CoFi-Dec的无训练解码框架,旨在减少大型视觉语言模型(LVLMs)中的幻觉问题。该方法通过从粗到细的视觉条件生成中间文本响应,并利用文生图模型将其转化为合成图像以形成多级视觉假设,最后通过基于Wasserstein距离的融合机制统一不同视觉条件下的预测分布,从而生成更忠实于视觉输入的输出。
Details
Motivation: 大型视觉语言模型在多模态理解和生成方面取得了显著进展,但仍倾向于产生与视觉输入不一致的幻觉内容,这限制了其在现实应用中的可靠性。本文旨在解决这一问题,通过引入生成式自反馈和从粗到细的视觉条件来缓解幻觉。
Result: 在六个专注于幻觉评估的基准测试上的广泛实验表明,CoFi-Dec显著减少了实体级和语义级的幻觉,性能优于现有的解码策略。
Insight: 创新点在于受人类视觉过程(从全局场景感知到细节检查)启发,设计了从粗到细的视觉条件生成与融合机制,以及基于Wasserstein距离的分布对齐方法,以实现无需训练、模型无关的幻觉抵抗解码,增强了输出的忠实性和鲁棒性。
Abstract: Large Vision-Language Models (LVLMs) have achieved impressive progress in multi-modal understanding and generation. However, they still tend to produce hallucinated content that is inconsistent with the visual input, which limits their reliability in real-world applications. We propose \textbf{CoFi-Dec}, a training-free decoding framework that mitigates hallucinations by integrating generative self-feedback with coarse-to-fine visual conditioning. Inspired by the human visual process from global scene perception to detailed inspection, CoFi-Dec first generates two intermediate textual responses conditioned on coarse- and fine-grained views of the original image. These responses are then transformed into synthetic images using a text-to-image model, forming multi-level visual hypotheses that enrich grounding cues. To unify the predictions from these multiple visual conditions, we introduce a Wasserstein-based fusion mechanism that aligns their predictive distributions into a geometrically consistent decoding trajectory. This principled fusion reconciles high-level semantic consistency with fine-grained visual grounding, leading to more robust and faithful outputs. Extensive experiments on six hallucination-focused benchmarks show that CoFi-Dec substantially reduces both entity-level and semantic-level hallucinations, outperforming existing decoding strategies. The framework is model-agnostic, requires no additional training, and can be seamlessly applied to a wide range of LVLMs. The implementation is available at https://github.com/AI-Researcher-Team/CoFi-Dec.
[133] Automated river gauge plate reading using a hybrid object detection and generative AI framework in the Limpopo River Basin cs.CVPDF
Kayathri Vigneswaran, Hugo Retief, Jai Clifford Holmes, Mariangel Garcia Andarcia, Hansaka Tennakoon
TL;DR: 本研究提出了一种混合框架,用于自动读取河流水位标尺。该框架集成了基于视觉的水位线检测、YOLOv8姿态尺度提取以及大型多模态语言模型(GPT-4o和Gemini 2.0 Flash),通过图像预处理、标注、水位线检测、刻度间隙估计和数字读数提取等步骤,实现了对林波波河流域水位标尺的自动化、高精度读数。
Details
Motivation: 传统水文观测方法存在人工测量误差和环境限制,而准确、连续的水位监测对于洪水预报、水资源管理和生态保护至关重要。本文旨在开发一个自动化、可扩展的解决方案来解决这些问题。
Result: 实验表明,水位线检测精度达94.24%,F1分数为83.64%。在最佳图像条件下,结合刻度间隙元数据后,Gemini Stage 2模型取得了最佳性能:平均绝对误差为5.43厘米,均方根误差为8.58厘米,R平方为0.84。结果强调了图像质量对LLMs性能的影响以及结合几何元数据的重要性。
Insight: 创新点在于将传统的计算机视觉目标检测(YOLOv8用于水位线和刻度提取)与生成式AI多模态大模型(GPT-4o/Gemini)相结合,利用几何元数据(刻度间隙)来增强LLMs的预测性能,从而构建了一个鲁棒的自动化水文监测框架,为实时水位数字化提供了可行方案。
Abstract: Accurate and continuous monitoring of river water levels is essential for flood forecasting, water resource management, and ecological protection. Traditional hydrological observation methods are often limited by manual measurement errors and environmental constraints. This study presents a hybrid framework integrating vision based waterline detection, YOLOv8 pose scale extraction, and large multimodal language models (GPT 4o and Gemini 2.0 Flash) for automated river gauge plate reading. The methodology involves sequential stages of image preprocessing, annotation, waterline detection, scale gap estimation, and numeric reading extraction. Experiments demonstrate that waterline detection achieved high precision of 94.24 percent and an F1 score of 83.64 percent, while scale gap detection provided accurate geometric calibration for subsequent reading extraction. Incorporating scale gap metadata substantially improved the predictive performance of LLMs, with Gemini Stage 2 achieving the highest accuracy, with a mean absolute error of 5.43 cm, root mean square error of 8.58 cm, and R squared of 0.84 under optimal image conditions. Results highlight the sensitivity of LLMs to image quality, with degraded images producing higher errors, and underscore the importance of combining geometric metadata with multimodal artificial intelligence for robust water level estimation. Overall, the proposed approach offers a scalable, efficient, and reliable solution for automated hydrological monitoring, demonstrating potential for real time river gauge digitization and improved water resource management.
[134] HY-Motion 1.0: Scaling Flow Matching Models for Text-To-Motion Generation cs.CV | cs.AI | cs.GRPDF
Yuxin Wen, Qing Shuai, Di Kang, Jing Li, Cheng Wen
TL;DR: HY-Motion 1.0是一个用于文本到3D人体运动生成的大规模模型系列,首次将基于扩散Transformer的流匹配模型扩展到十亿参数级别。它采用了一个包含大规模预训练、高质量微调以及基于人类反馈和奖励模型的强化学习的全阶段训练范式,实现了对文本指令的精确跟随和高质量运动生成。
Details
Motivation: 解决当前文本到运动生成模型在规模、指令跟随能力和运动质量方面的局限性,旨在推动3D人体运动生成模型向商业化成熟度过渡。
Result: 在文本到运动生成任务上,其指令跟随能力显著超越了当前的开源基准模型,并实现了最广泛的运动覆盖,涵盖6个大类超过200个运动类别。
Insight: 创新点在于首次将基于DiT的流匹配模型成功扩展到十亿参数规模,并提出了一个结合大规模预训练、高质量微调和强化学习的全阶段训练范式,以及严谨的数据处理流程,确保了模型性能的全面提升。
Abstract: We present HY-Motion 1.0, a series of state-of-the-art, large-scale, motion generation models capable of generating 3D human motions from textual descriptions. HY-Motion 1.0 represents the first successful attempt to scale up Diffusion Transformer (DiT)-based flow matching models to the billion-parameter scale within the motion generation domain, delivering instruction-following capabilities that significantly outperform current open-source benchmarks. Uniquely, we introduce a comprehensive, full-stage training paradigm – including large-scale pretraining on over 3,000 hours of motion data, high-quality fine-tuning on 400 hours of curated data, and reinforcement learning from both human feedback and reward models – to ensure precise alignment with the text instruction and high motion quality. This framework is supported by our meticulous data processing pipeline, which performs rigorous motion cleaning and captioning. Consequently, our model achieves the most extensive coverage, spanning over 200 motion categories across 6 major classes. We release HY-Motion 1.0 to the open-source community to foster future research and accelerate the transition of 3D human motion generation models towards commercial maturity.
[135] TV-RAG: A Temporal-aware and Semantic Entropy-Weighted Framework for Long Video Retrieval and Understanding cs.CVPDF
Zongsheng Cao, Yangfan He, Anran Liu, Feng Chen, Zepeng Wang
TL;DR: 论文提出了TV-RAG框架,这是一个无需训练的架构,通过结合时间对齐和熵引导的语义来增强长视频检索与理解能力。它包含一个时间衰减检索模块和一个熵加权关键帧采样器,旨在解决现有大视频语言模型在处理长视频时存在的时序窗口狭窄和语义变化捕捉不足的问题。
Details
Motivation: 现有大视频语言模型在处理长视频时面临时序窗口窄、难以捕捉长时间细粒度语义变化的问题,且主流基于文本的检索流程主要依赖表层词汇重叠,忽略了视觉、音频和字幕通道间丰富的时间相互依赖性。
Result: TV-RAG在Video-MME、MLVU和LongVideoBench等长视频基准测试中持续超越了大多数领先的基线模型,证明了其有效性。
Insight: 创新点包括:1. 时间衰减检索模块,将显式时间偏移注入相似度计算,使文本查询能根据真实的多媒体上下文进行排序;2. 熵加权关键帧采样器,选择均匀间隔且信息密集的帧,在减少冗余的同时保持代表性。该框架无需重新训练或微调即可集成到任何大视频语言模型中,提供了一种轻量级、经济高效的升级路径。
Abstract: Large Video Language Models (LVLMs) have rapidly emerged as the focus of multimedia AI research. Nonetheless, when confronted with lengthy videos, these models struggle: their temporal windows are narrow, and they fail to notice fine-grained semantic shifts that unfold over extended durations. Moreover, mainstream text-based retrieval pipelines, which rely chiefly on surface-level lexical overlap, ignore the rich temporal interdependence among visual, audio, and subtitle channels. To mitigate these limitations, we propose TV-RAG, a training-free architecture that couples temporal alignment with entropy-guided semantics to improve long-video reasoning. The framework contributes two main mechanisms: \emph{(i)} a time-decay retrieval module that injects explicit temporal offsets into the similarity computation, thereby ranking text queries according to their true multimedia context; and \emph{(ii)} an entropy-weighted key-frame sampler that selects evenly spaced, information-dense frames, reducing redundancy while preserving representativeness. By weaving these temporal and semantic signals together, TV-RAG realises a dual-level reasoning routine that can be grafted onto any LVLM without re-training or fine-tuning. The resulting system offers a lightweight, budget-friendly upgrade path and consistently surpasses most leading baselines across established long-video benchmarks such as Video-MME, MLVU, and LongVideoBench, confirming the effectiveness of our model. The code can be found at https://github.com/AI-Researcher-Team/TV-RAG.
[136] PathFound: An Agentic Multimodal Model Activating Evidence-seeking Pathological Diagnosis cs.CV | cs.AIPDF
Shengyi Hua, Jianfeng Wu, Tianle Shen, Kangzhe Hu, Zhongzhen Huang
TL;DR: 本文提出PathFound,一种代理式多模态模型,用于支持病理诊断中的证据寻求推理。该模型整合了病理视觉基础模型、视觉语言模型和强化学习训练的推理模型,通过初始诊断、证据寻求和最终决策三个阶段主动获取信息并优化诊断。
Details
Motivation: 当前病理基础模型大多依赖静态推理范式,一次性处理全切片图像生成预测,缺乏在诊断模糊时重新评估或针对性获取证据的能力,这与临床通过重复观察和进一步检查来完善假设的诊断流程不符。
Result: 在多个大型多模态模型上,采用该策略一致提升了诊断准确性,表明证据寻求工作流程在计算病理学中的有效性。PathFound在多样临床场景中实现了最先进的诊断性能,并展现出发现细胞核特征和局部浸润等细微细节的强大潜力。
Insight: 创新点在于将代理式推理引入病理诊断,模拟临床工作流程进行动态证据寻求,通过多阶段主动信息获取优化诊断,而非一次性静态预测,这为计算病理学提供了更符合实际临床决策的框架。
Abstract: Recent pathological foundation models have substantially advanced visual representation learning and multimodal interaction. However, most models still rely on a static inference paradigm in which whole-slide images are processed once to produce predictions, without reassessment or targeted evidence acquisition under ambiguous diagnoses. This contrasts with clinical diagnostic workflows that refine hypotheses through repeated slide observations and further examination requests. We propose PathFound, an agentic multimodal model designed to support evidence-seeking inference in pathological diagnosis. PathFound integrates the power of pathological visual foundation models, vision-language models, and reasoning models trained with reinforcement learning to perform proactive information acquisition and diagnosis refinement by progressing through the initial diagnosis, evidence-seeking, and final decision stages. Across several large multimodal models, adopting this strategy consistently improves diagnostic accuracy, indicating the effectiveness of evidence-seeking workflows in computational pathology. Among these models, PathFound achieves state-of-the-art diagnostic performance across diverse clinical scenarios and demonstrates strong potential to discover subtle details, such as nuclear features and local invasions.
[137] RxnBench: A Multimodal Benchmark for Evaluating Large Language Models on Chemical Reaction Understanding from Scientific Literature cs.CV | cs.AIPDF
Hanzheng Li, Xi Fang, Yixuan Li, Chaozheng Huang, Junjie Wang
TL;DR: 本文提出了RxnBench,一个用于评估多模态大语言模型从科学文献PDF中理解化学反应能力的多层级基准。该基准包含两个任务:单图问答和全文档问答,旨在测试模型在细粒度视觉感知、机理推理以及跨模态信息整合方面的能力。评估发现现有模型在提取显式文本方面表现良好,但在深层化学逻辑和精确结构识别方面存在显著不足,强调了开发领域专用视觉编码器和更强推理引擎的必要性。
Details
Motivation: 多模态大语言模型在化学领域的应用潜力巨大,但其在真实科学文献中理解密集、图形化的化学反应语言的能力尚未得到充分探索。因此,需要建立一个专门的基准来严格评估模型在这方面的能力。
Result: 在RxnBench上的评估显示,现有MLLMs存在关键能力缺口:虽然模型擅长提取显式文本,但在深层化学逻辑和精确结构识别方面表现不佳。具有推理时推理能力的模型显著优于标准架构,但在全文档问答任务上,所有模型的准确率均未超过50%。
Insight: 论文的创新点在于构建了一个从真实科学文献PDF出发、包含多层级任务的化学反应理解基准,揭示了当前MLLMs在科学领域视觉-语言理解上的核心短板。客观来看,该工作强调了针对特定科学领域(如化学)开发专用视觉编码器(用于解析反应式、图表)和集成更强符号/逻辑推理引擎的重要性,为构建自主AI化学家指明了关键研究方向。
Abstract: The integration of Multimodal Large Language Models (MLLMs) into chemistry promises to revolutionize scientific discovery, yet their ability to comprehend the dense, graphical language of reactions within authentic literature remains underexplored. Here, we introduce RxnBench, a multi-tiered benchmark designed to rigorously evaluate MLLMs on chemical reaction understanding from scientific PDFs. RxnBench comprises two tasks: Single-Figure QA (SF-QA), which tests fine-grained visual perception and mechanistic reasoning using 1,525 questions derived from 305 curated reaction schemes, and Full-Document QA (FD-QA), which challenges models to synthesize information from 108 articles, requiring cross-modal integration of text, schemes, and tables. Our evaluation of MLLMs reveals a critical capability gap: while models excel at extracting explicit text, they struggle with deep chemical logic and precise structural recognition. Notably, models with inference-time reasoning significantly outperform standard architectures, yet none achieve 50% accuracy on FD-QA. These findings underscore the urgent need for domain-specific visual encoders and stronger reasoning engines to advance autonomous AI chemists.
[138] ThinkGen: Generalized Thinking for Visual Generation cs.CVPDF
Siyu Jiao, Yiheng Lin, Yujie Zhong, Qi She, Wei Zhou
TL;DR: ThinkGen是一个基于思维链推理的视觉生成框架,它通过解耦的多模态大语言模型和扩散Transformer架构,结合可分离的GRPO训练范式,实现了在各种生成场景下的通用化高质量图像生成。
Details
Motivation: 当前思维链推理在复杂理解任务中表现出色,但在生成任务中的应用尚不成熟,且受限于特定场景机制,缺乏泛化能力。本文旨在将思维链推理系统性地扩展到视觉生成任务中。
Result: 在多个生成基准测试上,ThinkGen取得了稳健且最先进的性能。
Insight: 论文的创新点在于首次提出了一个显式利用MLLM思维链推理的通用视觉生成框架,其解耦架构和可分离的GRPO训练范式允许跨数据集联合训练,从而实现了对广泛生成场景的有效适应和泛化。
Abstract: Recent progress in Multimodal Large Language Models (MLLMs) demonstrates that Chain-of-Thought (CoT) reasoning enables systematic solutions to complex understanding tasks. However, its extension to generation tasks remains nascent and limited by scenario-specific mechanisms that hinder generalization and adaptation. In this work, we present ThinkGen, the first think-driven visual generation framework that explicitly leverages MLLM’s CoT reasoning in various generation scenarios. ThinkGen employs a decoupled architecture comprising a pretrained MLLM and a Diffusion Transformer (DiT), wherein the MLLM generates tailored instructions based on user intent, and DiT produces high-quality images guided by these instructions. We further propose a separable GRPO-based training paradigm (SepGRPO), alternating reinforcement learning between the MLLM and DiT modules. This flexible design enables joint training across diverse datasets, facilitating effective CoT reasoning for a wide range of generative scenarios. Extensive experiments demonstrate that ThinkGen achieves robust, state-of-the-art performance across multiple generation benchmarks. Code is available: https://github.com/jiaosiyuu/ThinkGen
[139] ProGuard: Towards Proactive Multimodal Safeguard cs.CVPDF
Shaohan Yu, Lijun Li, Chenyang Si, Lu Sheng, Jing Shao
TL;DR: 本文提出了ProGuard,一种面向多模态安全风险的主动防护方法,通过构建平衡的数据集和强化学习训练,旨在识别和描述分布外(OOD)安全风险,无需传统反应式方法所需的模型调整。
Details
Motivation: 生成模型的快速发展导致多模态安全风险不断涌现,现有防御方法存在局限性,需要一种能主动识别和描述未知安全风险的防护机制。
Result: ProGuard在二元安全分类任务上性能与闭源大模型相当,在不安全内容分类上大幅超越现有开源防护模型;在OOD风险检测和描述方面分别提升了52.6%和64.8%。
Insight: 创新点包括构建模态平衡的数据集以缓解偏差、纯强化学习训练实现高效推理,以及引入基于同义词库的相似性奖励来提升对未见风险类别的描述能力,为多模态安全防护提供了主动化解决方案。
Abstract: The rapid evolution of generative models has led to a continuous emergence of multimodal safety risks, exposing the limitations of existing defense methods. To address these challenges, we propose ProGuard, a vision-language proactive guard that identifies and describes out-of-distribution (OOD) safety risks without the need for model adjustments required by traditional reactive approaches. We first construct a modality-balanced dataset of 87K samples, each annotated with both binary safety labels and risk categories under a hierarchical multimodal safety taxonomy, effectively mitigating modality bias and ensuring consistent moderation across text, image, and text-image inputs. Based on this dataset, we train our vision-language base model purely through reinforcement learning (RL) to achieve efficient and concise reasoning. To approximate proactive safety scenarios in a controlled setting, we further introduce an OOD safety category inference task and augment the RL objective with a synonym-bank-based similarity reward that encourages the model to generate concise descriptions for unseen unsafe categories. Experimental results show that ProGuard achieves performance comparable to closed-source large models on binary safety classification, substantially outperforms existing open-source guard models on unsafe content categorization. Most notably, ProGuard delivers a strong proactive moderation ability, improving OOD risk detection by 52.6% and OOD risk description by 64.8%.
[140] LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation cs.CVPDF
Ethan Chern, Zhulin Hu, Bohao Tang, Jiadi Su, Steffi Chern
TL;DR: 本文提出了一种名为LiveTalk的实时多模态交互式视频生成系统,通过改进的策略蒸馏方法,显著降低了视频扩散模型的推理成本和延迟,实现了基于文本、图像和音频条件的实时交互式视频生成。
Details
Motivation: 现有扩散模型由于采用迭代去噪和双向注意力机制,无法实现实时交互,且现有蒸馏方法主要针对文本到视频生成,在多模态条件下存在视觉伪影和质量下降问题,因此需要一种能处理多模态条件并支持实时交互的视频生成方法。
Result: 在HDTF、AVSpeech和CelebV-HQ等多模态条件(音频、图像和文本)头像视频生成基准测试中,蒸馏后的模型在视觉质量上与全步骤双向基线模型相当或更好,同时推理成本和延迟降低了20倍;在构建的多轮交互基准上,LiveTalk系统在视频连贯性和内容质量上优于Sora2和Veo3等SOTA模型,并将响应延迟从1-2分钟降低到实时生成水平。
Insight: 创新点在于改进了策略蒸馏方法,重点关注条件输入的质量以及策略优化的初始化和调度,以解决多模态条件下的视觉伪影问题;系统层面整合了音频语言模型和长视频推理技术Anchor-Heavy Identity Sinks,实现了实时多模态交互。
Abstract: Real-time video generation via diffusion is essential for building general-purpose multimodal interactive AI systems. However, the simultaneous denoising of all video frames with bidirectional attention via an iterative process in diffusion models prevents real-time interaction. While existing distillation methods can make the model autoregressive and reduce sampling steps to mitigate this, they focus primarily on text-to-video generation, leaving the human-AI interaction unnatural and less efficient. This paper targets real-time interactive video diffusion conditioned on a multimodal context, including text, image, and audio, to bridge the gap. Given the observation that the leading on-policy distillation approach Self Forcing encounters challenges (visual artifacts like flickering, black frames, and quality degradation) with multimodal conditioning, we investigate an improved distillation recipe with emphasis on the quality of condition inputs as well as the initialization and schedule for the on-policy optimization. On benchmarks for multimodal-conditioned (audio, image, and text) avatar video generation including HDTF, AVSpeech, and CelebV-HQ, our distilled model matches the visual quality of the full-step, bidirectional baselines of similar or larger size with 20x less inference cost and latency. Further, we integrate our model with audio language models and long-form video inference technique Anchor-Heavy Identity Sinks to build LiveTalk, a real-time multimodal interactive avatar system. System-level evaluation on our curated multi-turn interaction benchmark shows LiveTalk outperforms state-of-the-art models (Sora2, Veo3) in multi-turn video coherence and content quality, while reducing response latency from 1 to 2 minutes to real-time generation, enabling seamless human-AI multimodal interaction.
[141] Same or Not? Enhancing Visual Perception in Vision-Language Models cs.CVPDF
Damiano Marsili, Aditya Mehta, Ryan Y. Lin, Georgia Gkioxari
TL;DR: 本文提出TWIN数据集和FGVQA基准测试,旨在提升视觉语言模型(VLMs)的细粒度视觉感知能力。TWIN包含56.1万个图像对查询,要求模型判断两个视觉上相似的图像是否描绘同一物体,从而鼓励模型关注细微视觉线索。通过在TWIN上微调,VLMs在未见过的艺术、动物、植物和地标等领域的细粒度识别任务上取得显著提升,且在通用VQA基准上性能不受影响。
Details
Motivation: 现有视觉语言模型在广泛视觉理解方面表现出色,但存在粒度粗糙、视觉偏见和忽略细微视觉细节的问题。现有训练语料库强调通用识别而非细粒度感知,加剧了这一局限。
Result: 在提出的FGVQA基准测试(包含1.2万个查询)上,现有VLMs表现不佳,而在TWIN上微调后,性能提升高达19.3%,且不影响通用VQA基准的性能。
Insight: 创新点在于引入专注于细粒度视觉相似性判断的大规模数据集TWIN和对应的评估基准FGVQA。客观分析认为,通过强调对细微视觉线索的关注,可以有效提升VLMs的感知精度,且数据规模是性能提升的关键。TWIN可作为开源VLM训练语料库的即插即用补充。
Abstract: Vision-language models (VLMs) excel at broad visual understanding but remain coarse-grained, exhibit visual biases, and miss subtle visual details. Existing training corpora reinforce this limitation by emphasizing general recognition (“Is it a cat or a dog?”) over fine-grained perception. To address this, we introduce a new training corpus and task designed to enhance the perceptual abilities of VLMs. TWIN is a large-scale dataset of 561,000 image-pair queries that task models to determine whether two visually similar images depict the same object, encouraging attention to nuanced visual cues. The dataset spans a diverse range of everyday objects across contexts, viewpoints, and appearances. Fine-tuning VLMs on TWIN yields notable gains in fine-grained recognition, even on unseen domains such as art, animals, plants, and landmarks. To quantify these gains, we introduce FGVQA, a benchmark suite of 12,000 queries that repurposes fine-grained recognition and retrieval datasets from multiple domains. While existing VLMs struggle on FGVQA, when fine-tuned on TWIN they improve by up to 19.3%, without compromising performance on general VQA benchmarks. Finally, our TWIN dataset scales favorably with object annotations, and our analysis shows that scale is key to performance. We envision TWIN as a drop-in addition to open-source VLM training corpora, advancing perceptual precision of future models. Project webpage: https://glab-caltech.github.io/twin/
[142] Detection Fire in Camera RGB-NIR cs.CVPDF
Nguyen Truong Khai, Luong Duc Vinh
TL;DR: 本文提出了一种改进红外夜视摄像头火灾检测准确性的方法,通过引入额外的近红外(NIR)数据集、一个结合YOLOv11和EfficientNetV2-B0的两阶段检测模型以及用于RGB图像的Patched-YOLO模型,旨在解决现有方法在夜间检测中易将明亮人造光源误判为火焰的问题,并提升对小而远目标的检测能力。
Details
Motivation: 现有火灾检测模型(如YOLOv7、RT-DETR、YOLOv9)在数据集构建上存在局限,尤其在夜间场景中经常将明亮人造光源错误分类为火焰,导致误报率高,因此需要提升夜间火灾检测的准确性并减少误报。
Result: 论文提出的两阶段方法在夜间火灾检测上取得了比先前方法更高的检测准确率,但未在摘要中提供具体的定量结果(如mAP值)或明确的基准测试对比。
Insight: 创新点包括:1)为缓解数据稀缺,对NIR和分类数据集应用了多种数据增强策略;2)设计了一个两阶段检测流水线,先使用YOLOv11进行初步检测,再用EfficientNetV2-B0分类器减少误报;3)提出了Patched-YOLO,通过基于图像块的处理来增强RGB图像中对小而远火焰目标的检测能力。
Abstract: Improving the accuracy of fire detection using infrared night vision cameras remains a challenging task. Previous studies have reported strong performance with popular detection models. For example, YOLOv7 achieved an mAP50-95 of 0.51 using an input image size of 640 x 1280, RT-DETR reached an mAP50-95 of 0.65 with an image size of 640 x 640, and YOLOv9 obtained an mAP50-95 of 0.598 at the same resolution. Despite these results, limitations in dataset construction continue to cause issues, particularly the frequent misclassification of bright artificial lights as fire. This report presents three main contributions: an additional NIR dataset, a two-stage detection model, and Patched-YOLO. First, to address data scarcity, we explore and apply various data augmentation strategies for both the NIR dataset and the classification dataset. Second, to improve night-time fire detection accuracy while reducing false positives caused by artificial lights, we propose a two-stage pipeline combining YOLOv11 and EfficientNetV2-B0. The proposed approach achieves higher detection accuracy compared to previous methods, particularly for night-time fire detection. Third, to improve fire detection in RGB images, especially for small and distant objects, we introduce Patched-YOLO, which enhances the model’s detection capability through patch-based processing. Further details of these contributions are discussed in the following sections.
[143] Scalable Residual Feature Aggregation Framework with Hybrid Metaheuristic Optimization for Robust Early Pancreatic Neoplasm Detection in Multimodal CT Imaging cs.CV | cs.IRPDF
Janani Annur Thiruvengadam, Kiran Mayee Nabigaru, Anusha Kovi
TL;DR: 本文提出了一种可扩展的残差特征聚合(SRFA)框架,用于多模态CT成像中胰腺肿瘤的早期检测。该框架结合了MAGRes-UNet分割、DenseNet-121特征提取、HHO-BA混合元启发式特征选择,以及集成Vision Transformer和EfficientNet-B3的混合分类模型,并通过SSA和GWO双优化机制进行超参数微调,以提升鲁棒性和泛化能力。
Details
Motivation: 解决胰腺肿瘤早期检测的临床难题,由于CT扫描中肿瘤对比度低、患者解剖结构差异大,需要一种有效且可扩展的系统来增强细微视觉线索的显著性,并在多模态成像数据上实现高泛化性。
Result: 在实验中,所提模型达到了96.23%的准确率、95.58%的F1分数和94.83%的特异性,显著优于传统CNN和当前基于Transformer的模型,展现了卓越的性能。
Insight: 创新点包括:可扩展的残差特征聚合框架、混合元启发式(HHO-BA)特征选择策略、结合Vision Transformer和EfficientNet-B3的混合分类模型,以及使用SSA和GWO的双优化机制进行超参数微调,这些方法共同提升了模型的鲁棒性和检测精度。
Abstract: The early detection of pancreatic neoplasm is a major clinical dilemma, and it is predominantly so because tumors are likely to occur with minimal contrast margins and a large spread anatomy-wide variation amongst patients on a CT scan. These complexities require to be addressed with an effective and scalable system that can assist in enhancing the salience of the subtle visual cues and provide a high level of the generalization on the multimodal imaging data. A Scalable Residual Feature Aggregation (SRFA) framework is proposed to be used to meet these conditions in this study. The framework integrates a pipeline of preprocessing followed by the segmentation using the MAGRes-UNet that is effective in making the pancreatic structures and isolating regions of interest more visible. DenseNet-121 performed with residual feature storage is used to extract features to allow deep hierarchical features to be aggregated without properties loss. To go further, hybrid HHO-BA metaheuristic feature selection strategy is used, which guarantees the best feature subset refinement. To be classified, the system is trained based on a new hybrid model that integrates the ability to pay attention on the world, which is the Vision Transformer (ViT) with the high representational efficiency of EfficientNet-B3. A dual optimization mechanism incorporating SSA and GWO is used to fine-tune hyperparameters to enhance greater robustness and less overfitting. Experimental results support the significant improvement in performance, with the suggested model reaching 96.23% accuracy, 95.58% F1-score and 94.83% specificity, the model is significantly better than the traditional CNNs and contemporary transformer-based models. Such results highlight the possibility of the SRFA framework as a useful instrument in the early detection of pancreatic tumors.
[144] Memorization in 3D Shape Generation: An Empirical Study cs.CV | cs.LGPDF
Shu Pu, Boya Zeng, Kaichen Zhou, Mengyu Wang, Zhuang Liu
TL;DR: 本文设计了一个评估框架来量化3D生成模型中的记忆现象,并研究了不同数据和建模设计对记忆的影响。研究发现,记忆程度取决于数据模态,并随数据多样性和更细粒度的条件而增加;在建模方面,记忆在适中的引导尺度达到峰值,并可通过更长的向量集和简单的旋转增强来缓解。
Details
Motivation: 生成模型在3D视觉中广泛用于合成新形状,但尚不清楚其生成是否依赖于记忆训练形状。理解记忆现象有助于防止训练数据泄露并提高生成结果的多样性。
Result: 通过将框架应用于现有方法进行量化,并利用潜在向量集扩散模型进行控制实验,发现数据多样性和细粒度条件会增加记忆,而适中的引导尺度会加剧记忆,但可通过更长向量集和旋转增强来减少记忆而不降低生成质量。
Insight: 论文的创新点在于提出了一个量化3D生成模型记忆的评估框架,并通过实证分析揭示了数据和建模因素对记忆的影响,提出了简单有效的缓解策略,如使用更长向量集和旋转增强,这为改进3D生成模型的多样性和安全性提供了实用指导。
Abstract: Generative models are increasingly used in 3D vision to synthesize novel shapes, yet it remains unclear whether their generation relies on memorizing training shapes. Understanding their memorization could help prevent training data leakage and improve the diversity of generated results. In this paper, we design an evaluation framework to quantify memorization in 3D generative models and study the influence of different data and modeling designs on memorization. We first apply our framework to quantify memorization in existing methods. Next, through controlled experiments with a latent vector-set (Vecset) diffusion model, we find that, on the data side, memorization depends on data modality, and increases with data diversity and finer-grained conditioning; on the modeling side, it peaks at a moderate guidance scale and can be mitigated by longer Vecsets and simple rotation augmentation. Together, our framework and analysis provide an empirical understanding of memorization in 3D generative models and suggest simple yet effective strategies to reduce it without degrading generation quality. Our code is available at https://github.com/zlab-princeton/3d_mem.
[145] Rethinking the Spatio-Temporal Alignment of End-to-End 3D Perception cs.CVPDF
Xiaoyu Li, Peidong Li, Xian Wu, Long Shi, Dedong Liu
TL;DR: 本文提出了一种名为HAT的时空对齐模块,用于改进自动驾驶中端到端3D感知的时序建模。HAT通过自适应地从多个假设中解码最优对齐方案,解决了现有方法因依赖统一显式物理模型和语义特征而导致的对齐次优问题。
Details
Motivation: 现有方法通常依赖注意力机制和统一的显式物理模型(如恒定速度)进行跨帧对象对齐,这忽视了不同类别和帧间运动状态与对象特征的差异,导致对齐效果不佳。
Result: 在nuScenes数据集上,HAT持续改进了多种基线的3D时序检测器和跟踪器。当与DETR3D检测器配对时,在测试集上达到了46.0% AMOTA的SOTA跟踪结果。在一个以对象为中心的端到端自动驾驶方法中,HAT提升了感知精度(mAP +1.3%, AMOTA +3.1%)并将碰撞率降低了32%。在语义受损的nuScenes-C场景下,HAT的运动建模增强了端到端自动驾驶中感知和规划的鲁棒性。
Insight: 核心创新在于提出了一个无需直接监督、能自适应解码最优对齐假设的时空对齐模块。它结合了多个显式运动模型生成空间锚点和运动感知特征提议,并通过融合缓存对象查询中的语义和运动线索进行多假设解码,从而更灵活、鲁棒地处理复杂的时空变化。
Abstract: Spatio-temporal alignment is crucial for temporal modeling of end-to-end (E2E) perception in autonomous driving (AD), providing valuable structural and textural prior information. Existing methods typically rely on the attention mechanism to align objects across frames, simplifying the motion model with a unified explicit physical model (constant velocity, etc.). These approaches prefer semantic features for implicit alignment, challenging the importance of explicit motion modeling in the traditional perception paradigm. However, variations in motion states and object features across categories and frames render this alignment suboptimal. To address this, we propose HAT, a spatio-temporal alignment module that allows each object to adaptively decode the optimal alignment proposal from multiple hypotheses without direct supervision. Specifically, HAT first utilizes multiple explicit motion models to generate spatial anchors and motion-aware feature proposals for historical instances. It then performs multi-hypothesis decoding by incorporating semantic and motion cues embedded in cached object queries, ultimately providing the optimal alignment proposal for the target frame. On nuScenes, HAT consistently improves 3D temporal detectors and trackers across diverse baselines. It achieves state-of-the-art tracking results with 46.0% AMOTA on the test set when paired with the DETR3D detector. In an object-centric E2E AD method, HAT enhances perception accuracy (+1.3% mAP, +3.1% AMOTA) and reduces the collision rate by 32%. When semantics are corrupted (nuScenes-C), the enhancement of motion modeling by HAT enables more robust perception and planning in the E2E AD.
[146] OmniAgent: Audio-Guided Active Perception Agent for Omnimodal Audio-Video Understanding cs.CVPDF
Keda Tao, Wenjie Du, Bohan Yu, Weiqiang Wang, Jian Liu
TL;DR: 本文提出OmniAgent,一种完全由音频引导的主动感知智能体,通过动态编排专用工具实现细粒度的音视频理解,解决了现有全模态大语言模型在跨模态对齐和细粒度理解上的不足。
Details
Motivation: 现有全模态大语言模型在统一音视频模态方面取得进展,但缺乏细粒度跨模态理解能力且难以实现多模态对齐,因此需要一种主动感知机制来提升理解精度。
Result: 在三个音视频理解基准测试上的广泛实验表明,OmniAgent实现了最先进的性能,准确率比领先的开源和专有模型高出10%至20%。
Insight: 创新点在于从被动响应生成转向主动多模态查询,采用动态规划自主按需调用工具,并引入由粗到细的音频引导感知范式,利用音频线索定位时间事件并指导后续推理。
Abstract: Omnimodal large language models have made significant strides in unifying audio and visual modalities; however, they often lack the fine-grained cross-modal understanding and have difficulty with multimodal alignment. To address these limitations, we introduce OmniAgent, a fully audio-guided active perception agent that dynamically orchestrates specialized tools to achieve more fine-grained audio-visual reasoning. Unlike previous works that rely on rigid, static workflows and dense frame-captioning, this paper demonstrates a paradigm shift from passive response generation to active multimodal inquiry. OmniAgent employs dynamic planning to autonomously orchestrate tool invocation on demand, strategically concentrating perceptual attention on task-relevant cues. Central to our approach is a novel coarse-to-fine audio-guided perception paradigm, which leverages audio cues to localize temporal events and guide subsequent reasoning. Extensive empirical evaluations on three audio-video understanding benchmarks demonstrate that OmniAgent achieves state-of-the-art performance, surpassing leading open-source and proprietary models by substantial margins of 10% - 20% accuracy.
[147] Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation cs.CVPDF
Shaocong Xu, Songlin Wei, Qizhe Wei, Zheng Geng, Hong Li
TL;DR: 本文提出DKT模型,通过利用视频扩散模型对透明现象的内在理解,将其重新用于透明物体的深度和法线估计。该方法使用合成的透明/反射场景视频数据集TransPhy3D进行训练,通过轻量级LoRA适配器学习从RGB视频到深度(和法线)的视频到视频转换,实现了对任意长度输入视频的时间一致预测。
Details
Motivation: 透明物体由于折射、反射和透射现象,破坏了立体视觉、飞行时间法和纯判别式单目深度估计的基本假设,导致深度估计存在空洞和时间不稳定问题。本文观察到现代视频扩散模型已经能够合成逼真的透明现象,表明它们已经内化了光学规则,因此探索如何利用这种生成先验来改进透明物体的感知。
Result: 在涉及透明度的真实和合成视频基准测试(ClearPose、DREDS和TransPhy3D-Test)上,DKT模型实现了零样本的SOTA性能。它在准确性和时间一致性方面优于强大的图像/视频基线模型,其法线变体在ClearPose上取得了最佳的视频法线估计结果。一个紧凑的1.3B版本模型运行速度约为0.17秒/帧。集成到抓取系统中,DKT的深度估计提高了在透明、反射和漫反射表面上的抓取成功率。
Insight: 核心创新点在于将预训练的大规模视频扩散模型重新用于透明物体感知任务,通过轻量级适配(LoRA)和联合训练策略,高效且无需额外标注地利用了模型内化的光学物理知识。这支持了’扩散模型理解透明度’的广泛主张,展示了生成式视频先验可以被重新用于实现鲁棒且时间一致的真实世界操作感知。
Abstract: Transparent objects remain notoriously hard for perception systems: refraction, reflection and transmission break the assumptions behind stereo, ToF and purely discriminative monocular depth, causing holes and temporally unstable estimates. Our key observation is that modern video diffusion models already synthesize convincing transparent phenomena, suggesting they have internalized the optical rules. We build TransPhy3D, a synthetic video corpus of transparent/reflective scenes: 11k sequences rendered with Blender/Cycles. Scenes are assembled from a curated bank of category-rich static assets and shape-rich procedural assets paired with glass/plastic/metal materials. We render RGB + depth + normals with physically based ray tracing and OptiX denoising. Starting from a large video diffusion model, we learn a video-to-video translator for depth (and normals) via lightweight LoRA adapters. During training we concatenate RGB and (noisy) depth latents in the DiT backbone and co-train on TransPhy3D and existing frame-wise synthetic datasets, yielding temporally consistent predictions for arbitrary-length input videos. The resulting model, DKT, achieves zero-shot SOTA on real and synthetic video benchmarks involving transparency: ClearPose, DREDS (CatKnown/CatNovel), and TransPhy3D-Test. It improves accuracy and temporal consistency over strong image/video baselines, and a normal variant sets the best video normal estimation results on ClearPose. A compact 1.3B version runs at ~0.17 s/frame. Integrated into a grasping stack, DKT’s depth boosts success rates across translucent, reflective and diffuse surfaces, outperforming prior estimators. Together, these results support a broader claim: “Diffusion knows transparency.” Generative video priors can be repurposed, efficiently and label-free, into robust, temporally coherent perception for challenging real-world manipulation.
[148] Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion cs.CVPDF
Hau-Shiang Shiu, Chin-Yang Lin, Zhixiang Wang, Chi-Wei Hsiao, Po-Fan Yu
TL;DR: 本文提出了Stream-DiffVSR,一种基于因果条件扩散模型的低延迟、可流式处理的视频超分辨率方法。该方法通过结合四步蒸馏去噪器、自回归时序引导模块和轻量级时序感知解码器,实现了仅依赖过去帧的高效在线推理,显著降低了扩散模型在VSR任务中的延迟。
Details
Motivation: 解决现有基于扩散模型的视频超分辨率方法因依赖未来帧和昂贵的多步去噪而导致的延迟过高问题,使其无法应用于对延迟敏感的在线场景。
Result: 在RTX4090 GPU上处理720p帧仅需0.328秒,显著优于先前的扩散方法。与在线SOTA方法TMP相比,在提升感知质量(LPIPS指标提升0.095)的同时,将延迟降低了130倍以上,实现了扩散VSR中最低的延迟报告。
Insight: 创新点在于提出了一个完整的因果条件扩散框架,包括用于快速推理的蒸馏去噪器、在潜在去噪中注入运动对齐线索的自回归时序引导模块,以及增强细节和时序一致性的轻量级解码器。其核心是将高感知质量的扩散模型成功适配到严格的在线低延迟场景,是首个适合低延迟在线部署的扩散VSR方法。
Abstract: Diffusion-based video super-resolution (VSR) methods achieve strong perceptual quality but remain impractical for latency-sensitive settings due to reliance on future frames and expensive multi-step denoising. We propose Stream-DiffVSR, a causally conditioned diffusion framework for efficient online VSR. Operating strictly on past frames, it combines a four-step distilled denoiser for fast inference, an Auto-regressive Temporal Guidance (ARTG) module that injects motion-aligned cues during latent denoising, and a lightweight temporal-aware decoder with a Temporal Processor Module (TPM) that enhances detail and temporal coherence. Stream-DiffVSR processes 720p frames in 0.328 seconds on an RTX4090 GPU and significantly outperforms prior diffusion-based methods. Compared with the online SOTA TMP, it boosts perceptual quality (LPIPS +0.095) while reducing latency by over 130x. Stream-DiffVSR achieves the lowest latency reported for diffusion-based VSR, reducing initial delay from over 4600 seconds to 0.328 seconds, thereby making it the first diffusion VSR method suitable for low-latency online deployment. Project page: https://jamichss.github.io/stream-diffvsr-project-page/
cs.RO [Back]
[149] VLA-Arena: An Open-Source Framework for Benchmarking Vision-Language-Action Models cs.RO | cs.CVPDF
Borong Zhang, Jiahao Li, Jiachen Shen, Yishuai Cai, Yuhao Zhang
TL;DR: 本文介绍了VLA-Arena,一个用于评估视觉-语言-动作模型的开源基准测试框架。该框架通过结构化任务设计,从任务结构、语言指令和视觉观察三个正交维度量化任务难度,并包含170个任务,旨在系统性地测量模型的性能边界和失败模式。
Details
Motivation: 当前视觉-语言-动作模型正快速发展为通用机器人策略,但难以定量理解其局限性和失败模式,因此需要构建一个全面的基准来系统评估这些模型。
Result: 通过对多个最先进的VLA模型进行广泛评估,揭示了几个关键局限性,包括模型倾向于记忆而非泛化、鲁棒性不对称、忽视安全约束以及无法组合技能处理长时程任务。
Insight: 创新点在于提出了一个正交的三维度结构化任务设计框架,能够精细控制任务难度并进行解耦分析,从而精确测量模型能力边界;同时提供了完整的端到端工具链和数据集以促进可复现研究。
Abstract: While Vision-Language-Action models (VLAs) are rapidly advancing towards generalist robot policies, it remains difficult to quantitatively understand their limits and failure modes. To address this, we introduce a comprehensive benchmark called VLA-Arena. We propose a novel structured task design framework to quantify difficulty across three orthogonal axes: (1) Task Structure, (2) Language Command, and (3) Visual Observation. This allows us to systematically design tasks with fine-grained difficulty levels, enabling a precise measurement of model capability frontiers. For Task Structure, VLA-Arena’s 170 tasks are grouped into four dimensions: Safety, Distractor, Extrapolation, and Long Horizon. Each task is designed with three difficulty levels (L0-L2), with fine-tuning performed exclusively on L0 to assess general capability. Orthogonal to this, language (W0-W4) and visual (V0-V4) perturbations can be applied to any task to enable a decoupled analysis of robustness. Our extensive evaluation of state-of-the-art VLAs reveals several critical limitations, including a strong tendency toward memorization over generalization, asymmetric robustness, a lack of consideration for safety constraints, and an inability to compose learned skills for long-horizon tasks. To foster research addressing these challenges and ensure reproducibility, we provide the complete VLA-Arena framework, including an end-to-end toolchain from task definition to automated evaluation and the VLA-Arena-S/M/L datasets for fine-tuning. Our benchmark, data, models, and leaderboard are available at https://vla-arena.github.io.
[150] SurgWorld: Learning Surgical Robot Policies from Videos via World Modeling cs.RO | cs.CVPDF
Yufan He, Pengfei Guo, Mengya Xu, Zhaoshuo Li, Andriy Myronenko
TL;DR: 该论文提出了SurgWorld,一个用于手术物理AI的世界模型,旨在通过生成合成手术视频和推断伪运动学数据来解决手术机器人领域数据稀缺的问题。论文构建了Surgical Action Text Alignment (SATA)数据集,并基于先进的世界模型和SATA开发了SurgWorld,能够生成多样化、可泛化且真实的手术视频。通过逆动力学模型从合成视频中推断伪运动学,生成配对的视频-动作数据,用于训练视觉语言动作(VLA)策略。实验表明,使用这些增强数据训练的手术VLA策略在真实手术机器人平台上显著优于仅使用真实演示数据训练的模型。
Details
Motivation: 手术机器人领域面临数据稀缺的根本障碍,特别是缺乏同时包含视觉观察和准确机器人运动学的配对数据集,而大量手术视频又缺乏动作标签,这阻碍了模仿学习或VLA训练的直接应用。
Result: 在真实手术机器人平台上,使用增强数据(合成视频和推断的伪运动学)训练的手术VLA策略显著优于仅使用真实演示数据训练的模型,展示了方法的有效性。
Insight: 创新点包括:构建专门用于手术机器人的SATA数据集;开发SurgWorld世界模型生成合成手术视频;首次使用逆动力学模型从合成视频中推断伪运动学以创建配对数据。这为利用未标记手术视频和生成式世界建模实现可扩展的自主手术技能获取提供了新路径。
Abstract: Data scarcity remains a fundamental barrier to achieving fully autonomous surgical robots. While large scale vision language action (VLA) models have shown impressive generalization in household and industrial manipulation by leveraging paired video action data from diverse domains, surgical robotics suffers from the paucity of datasets that include both visual observations and accurate robot kinematics. In contrast, vast corpora of surgical videos exist, but they lack corresponding action labels, preventing direct application of imitation learning or VLA training. In this work, we aim to alleviate this problem by learning policy models from SurgWorld, a world model designed for surgical physical AI. We curated the Surgical Action Text Alignment (SATA) dataset with detailed action description specifically for surgical robots. Then we built SurgeWorld based on the most advanced physical AI world model and SATA. It’s able to generate diverse, generalizable and realistic surgery videos. We are also the first to use an inverse dynamics model to infer pseudokinematics from synthetic surgical videos, producing synthetic paired video action data. We demonstrate that a surgical VLA policy trained with these augmented data significantly outperforms models trained only on real demonstrations on a real surgical robot platform. Our approach offers a scalable path toward autonomous surgical skill acquisition by leveraging the abundance of unlabeled surgical video and generative world modeling, thus opening the door to generalizable and data efficient surgical robot policies.
[151] RoboMirror: Understand Before You Imitate for Video to Humanoid Locomotion cs.RO | cs.CVPDF
Zhe Li, Cheng Chi, Yangyang Wei, Boan Zhu, Tao Huang
TL;DR: 本文提出RoboMirror框架,首次实现无需重定向的从视频到人形机器人运动的直接控制。该框架利用视觉语言模型(VLMs)从第一人称或第三人称视频中提取视觉运动意图,并以此直接驱动基于扩散模型的策略,生成物理上合理且语义对齐的运动,从而弥合了视觉理解与控制之间的鸿沟。
Details
Motivation: 现有的人形机器人运动系统依赖于精心策划的运动捕捉轨迹或稀疏的文本指令,导致视觉理解与控制之间存在关键差距。文本到运动的方法存在语义稀疏性和流水线错误,而基于视频的方法仅进行机械的姿态模仿,缺乏真正的视觉理解。
Result: 大量实验验证了RoboMirror的有效性:它支持通过第一人称视频实现远程呈现,将第三人称控制延迟大幅降低了80%,并且比基线方法实现了3.7%更高的任务成功率。
Insight: 核心创新在于提出了“先理解后模仿”的范式,通过视觉语言模型将原始视频提炼为视觉运动意图,并直接用于条件化扩散策略,避免了显式的姿态重建或重定向步骤,从而实现了更自然、语义更准确的人形机器人运动控制。
Abstract: Humans learn locomotion through visual observation, interpreting visual content first before imitating actions. However, state-of-the-art humanoid locomotion systems rely on either curated motion capture trajectories or sparse text commands, leaving a critical gap between visual understanding and control. Text-to-motion methods suffer from semantic sparsity and staged pipeline errors, while video-based approaches only perform mechanical pose mimicry without genuine visual understanding. We propose RoboMirror, the first retargeting-free video-to-locomotion framework embodying “understand before you imitate”. Leveraging VLMs, it distills raw egocentric/third-person videos into visual motion intents, which directly condition a diffusion-based policy to generate physically plausible, semantically aligned locomotion without explicit pose reconstruction or retargeting. Extensive experiments validate the effectiveness of RoboMirror, it enables telepresence via egocentric videos, drastically reduces third-person control latency by 80%, and achieves a 3.7% higher task success rate than baselines. By reframing humanoid control around video understanding, we bridge the visual understanding and action gap.
cs.DC [Back]
[152] SlimEdge: Lightweight Distributed DNN Deployment on Constrained Hardware cs.DC | cs.CVPDF
Mahadev Sunil Kumar, Arnab Raha, Debayan Das, Gopakumar G, Amitava Mukherjee
TL;DR: 本文提出了一种名为SlimEdge的轻量级分布式深度神经网络部署方法,旨在解决资源受限的边缘设备上部署大型DNN模型时面临的参数多、计算需求高的问题。该方法通过结合结构化模型剪枝和多目标优化,根据异构设备的约束条件定制网络容量,并以多视图卷积神经网络为例进行验证。
Details
Motivation: 动机在于解决深度分布式网络在计算和存储资源有限的边缘设备上部署困难的问题,需要在保证任务性能的同时,满足硬件的限制条件。
Result: 实验结果表明,该方法生成的模型在满足用户指定的精度和内存占用边界的同时,在不同硬件平台上将推理延迟降低了1.2倍到5.0倍。
Insight: 创新点在于将结构化剪枝与多目标优化联合,并针对多视图网络量化了各视图对分类精度的贡献,从而进行自适应的剪枝预算分配,实现了性能感知的、视图自适应的压缩,为在分布式边缘环境中部署复杂视觉模型提供了可行路径。
Abstract: Deep distributed networks (DNNs) have become central to modern computer vision, yet their deployment on resource-constrained edge devices remains hindered by substantial parameter counts and computational demands. Here, we present an approach to the efficient deployment of distributed DNNs that jointly respects hardware limitations and preserves task performance. Our method integrates a structured model pruning with a multi-objective optimization to tailor network capacity to heterogeneous device constraints. We demonstrate this framework using Multi-View Convolutional Neural Network (MVCNN), a state-of-the-art architecture for 3D object recognition, by quantifying the contribution of individual views to classification accuracy and allocating pruning budgets, respectively. Experimental results show that the resulting models satisfy user-specified bounds on accuracy and memory footprint while reducing inference latency by factors ranging from 1.2x to 5.0x across diverse hardware platforms. These findings suggest that performance-aware, view-adaptive compression provides a viable pathway for deploying complex vision models in distributed edge environments.
cs.AI [Back]
[153] SciEvalKit: An Open-source Evaluation Toolkit for Scientific General Intelligence cs.AI | cs.CLPDF
Yiheng Wang, Yixin Chen, Shuo Li, Yifan Zhou, Bo Liu
TL;DR: SciEvalKit是一个开源的、统一的科学通用智能评估工具包,旨在跨多个科学学科和任务能力评估AI模型。它专注于科学智能的核心能力,如科学多模态感知、推理、理解、符号推理、代码生成、假设生成和知识理解,覆盖物理、化学、天文学和材料科学等六大领域。该工具包基于专家级科学基准构建,支持灵活可扩展的评估流程,促进AI4Science社区的开发与进步。
Details
Motivation: 解决现有通用评估平台在科学智能领域评估不足的问题,提供一个专注于科学核心能力、跨学科、标准化的评估基础设施,以推动科学基础模型和智能代理的发展。
Result: 摘要未提及具体的定量实验结果或基准测试排名,但强调工具包能提供透明、可复现和可比较的结果,支持批量评估和自定义模型与数据集的集成。
Insight: 创新点在于将基于能力的评估与学科多样性相结合,构建了一个统一、可扩展的科学智能评估框架;从客观角度看,其开源和社区驱动的设计有助于标准化科学AI评估,促进该领域的可复现性和可比性研究。
Abstract: We introduce SciEvalKit, a unified benchmarking toolkit designed to evaluate AI models for science across a broad range of scientific disciplines and task capabilities. Unlike general-purpose evaluation platforms, SciEvalKit focuses on the core competencies of scientific intelligence, including Scientific Multimodal Perception, Scientific Multimodal Reasoning, Scientific Multimodal Understanding, Scientific Symbolic Reasoning, Scientific Code Generation, Science Hypothesis Generation and Scientific Knowledge Understanding. It supports six major scientific domains, spanning from physics and chemistry to astronomy and materials science. SciEvalKit builds a foundation of expert-grade scientific benchmarks, curated from real-world, domain-specific datasets, ensuring that tasks reflect authentic scientific challenges. The toolkit features a flexible, extensible evaluation pipeline that enables batch evaluation across models and datasets, supports custom model and dataset integration, and provides transparent, reproducible, and comparable results. By bridging capability-based evaluation and disciplinary diversity, SciEvalKit offers a standardized yet customizable infrastructure to benchmark the next generation of scientific foundation models and intelligent agents. The toolkit is open-sourced and actively maintained to foster community-driven development and progress in AI4Science.
[154] Agent2World: Learning to Generate Symbolic World Models via Adaptive Multi-Agent Feedback cs.AI | cs.CLPDF
Mengkang Hu, Bowei Xia, Yuran Wu, Ailing Yu, Yude Zou
TL;DR: 本文提出Agent2World,一个基于工具增强的多智能体框架,用于生成符号世界模型(如PDDL领域或可执行模拟器)。该框架通过三个阶段实现:深度研究智能体进行知识合成以填补规范空白,模型开发智能体实现可执行世界模型,以及专门的测试团队进行自适应单元测试和基于模拟的验证。该方法不仅在推理时在多个基准测试中达到最先进性能,还能作为监督微调的数据引擎,通过多智能体反馈生成训练轨迹,使微调后的模型性能大幅提升。
Details
Motivation: 当前训练大语言模型生成符号世界模型的主要限制在于缺乏大规模可验证的监督数据,且现有静态验证方法无法捕捉交互执行中产生的行为级错误。
Result: 在涵盖规划领域定义语言和可执行代码表示的三个基准测试中,Agent2World实现了最先进的推理性能。此外,利用框架生成的训练轨迹进行微调后,模型在生成世界模型方面的性能平均相对提升了30.95%。
Insight: 创新点在于将世界模型生成任务分解为多智能体协作流程,并引入基于测试团队的交互式、行为感知的自适应反馈机制,这不仅提升了推理时性能,还创造了一个闭环的数据生成与模型改进系统。
Abstract: Symbolic world models (e.g., PDDL domains or executable simulators) are central to model-based planning, but training LLMs to generate such world models is limited by the lack of large-scale verifiable supervision. Current approaches rely primarily on static validation methods that fail to catch behavior-level errors arising from interactive execution. In this paper, we propose Agent2World, a tool-augmented multi-agent framework that achieves strong inference-time world-model generation and also serves as a data engine for supervised fine-tuning, by grounding generation in multi-agent feedback. Agent2World follows a three-stage pipeline: (i) A Deep Researcher agent performs knowledge synthesis by web searching to address specification gaps; (ii) A Model Developer agent implements executable world models; And (iii) a specialized Testing Team conducts adaptive unit testing and simulation-based validation. Agent2World demonstrates superior inference-time performance across three benchmarks spanning both Planning Domain Definition Language (PDDL) and executable code representations, achieving consistent state-of-the-art results. Beyond inference, Testing Team serves as an interactive environment for the Model Developer, providing behavior-aware adaptive feedback that yields multi-turn training trajectories. The model fine-tuned on these trajectories substantially improves world-model generation, yielding an average relative gain of 30.95% over the same model before training. Project page: https://agent2world.github.io.
[155] Monadic Context Engineering cs.AI | cs.CL | cs.FLPDF
Yifan Zhang, Mengdi Wang
TL;DR: 本文提出了一种名为’单子上下文工程’(MCE)的新型架构范式,旨在为基于大语言模型的自主智能体设计提供形式化基础。该范式利用函子、应用函子和单子等代数结构,将智能体工作流视为计算上下文,从而内在地管理状态传播、错误处理和并发执行等横切关注点。
Details
Motivation: 当前基于大语言模型的自主智能体架构通常采用命令式的、临时性的模式构建,导致系统脆弱,难以管理状态、处理错误和实现并发。本文旨在通过引入形式化的代数结构来解决这些问题,为智能体设计提供更健壮、可组合的基础。
Result: 论文通过理论阐述和框架设计展示了MCE范式如何实现健壮的顺序组合、结构化的并行执行以及通过单子变换器进行能力的系统组合。
Insight: 核心创新点在于将函数式编程中的单子等代数结构系统地应用于AI智能体架构,以数学的严谨性解决工程中的横切关注点问题。其提出的’元智能体’概念,利用MCE进行生成式编排和动态工作流管理,为构建复杂、可组合的智能体系统提供了新的思路。
Abstract: The proliferation of Large Language Models (LLMs) has catalyzed a shift towards autonomous agents capable of complex reasoning and tool use. However, current agent architectures are frequently constructed using imperative, ad hoc patterns. This results in brittle systems plagued by difficulties in state management, error handling, and concurrency. This paper introduces Monadic Context Engineering (MCE), a novel architectural paradigm leveraging the algebraic structures of Functors, Applicative Functors, and Monads to provide a formal foundation for agent design. MCE treats agent workflows as computational contexts where cross-cutting concerns, such as state propagation, short-circuiting error handling, and asynchronous execution, are managed intrinsically by the algebraic properties of the abstraction. We demonstrate how Monads enable robust sequential composition, how Applicatives provide a principled structure for parallel execution, and crucially, how Monad Transformers allow for the systematic composition of these capabilities. This layered approach enables developers to construct complex, resilient, and efficient AI agents from simple, independently verifiable components. We further extend this framework to describe Meta-Agents, which leverage MCE for generative orchestration, dynamically creating and managing sub-agent workflows through metaprogramming. Project Page: https://github.com/yifanzhang-pro/monadic-context-engineering.
[156] Multimodal Fact-Checking: An Agent-based Approach cs.AI | cs.CLPDF
Danni Xu, Shaojing Fan, Xuanang Cheng, Mohan Kankanhalli
TL;DR: 本文针对多模态虚假信息检测的挑战,提出了RW-Post数据集和AgentFact框架。RW-Post是一个高质量、可解释的真实世界多模态事实核查数据集,包含完整的社交媒体帖子、标注的推理过程和可验证证据。AgentFact是一个基于智能体的多模态事实核查框架,通过五个专门智能体协作模拟人类核查工作流,以提升检测的准确性和可解释性。
Details
Motivation: 现有的大型视觉语言模型和多模态融合方法在多模态虚假信息检测中存在推理能力有限和证据利用浅层的问题,且缺乏包含完整推理过程和可验证证据的专用数据集。
Result: 实验结果表明,RW-Post数据集与AgentFact框架的结合显著提升了多模态事实核查的准确性和可解释性。
Insight: 创新点包括构建了高质量、可解释的真实世界多模态事实核查数据集RW-Post,以及设计了模拟人类工作流的基于智能体的协作框架AgentFact,通过迭代的证据搜索、过滤和推理流程实现系统性分析。
Abstract: The rapid spread of multimodal misinformation poses a growing challenge for automated fact-checking systems. Existing approaches, including large vision language models (LVLMs) and deep multimodal fusion methods, often fall short due to limited reasoning and shallow evidence utilization. A key bottleneck is the lack of dedicated datasets that provide complete real-world multimodal misinformation instances accompanied by annotated reasoning processes and verifiable evidence. To address this limitation, we introduce RW-Post, a high-quality and explainable dataset for real-world multimodal fact-checking. RW-Post aligns real-world multimodal claims with their original social media posts, preserving the rich contextual information in which the claims are made. In addition, the dataset includes detailed reasoning and explicitly linked evidence, which are derived from human written fact-checking articles via a large language model assisted extraction pipeline, enabling comprehensive verification and explanation. Building upon RW-Post, we propose AgentFact, an agent-based multimodal fact-checking framework designed to emulate the human verification workflow. AgentFact consists of five specialized agents that collaboratively handle key fact-checking subtasks, including strategy planning, high-quality evidence retrieval, visual analysis, reasoning, and explanation generation. These agents are orchestrated through an iterative workflow that alternates between evidence searching and task-aware evidence filtering and reasoning, facilitating strategic decision-making and systematic evidence analysis. Extensive experimental results demonstrate that the synergy between RW-Post and AgentFact substantially improves both the accuracy and interpretability of multimodal fact-checking.
[157] CubeBench: Diagnosing Interactive, Long-Horizon Spatial Reasoning Under Partial Observations cs.AI | cs.CL | cs.CVPDF
Huan-ang Gao, Zikang Zhang, Tianwei Luo, Kaisen Yang, Xinzhe Juan
TL;DR: 本文介绍了CubeBench,一个基于魔方的生成式基准测试,用于诊断LLM智能体在部分观测下的交互式长程空间推理能力。该基准采用三层诊断框架,从全符号信息的基础状态跟踪到仅部分视觉数据的主动探索,逐步评估智能体能力。实验发现主流LLM在长程任务上通过率为0.00%,暴露了长期规划的根本性缺陷。
Details
Motivation: 解决LLM智能体在物理世界部署中的核心挑战:空间推理、通过心理模拟进行长程状态跟踪以及在部分观测下的主动探索能力不足的问题。
Result: 在主流LLM上的实验结果显示,所有长程任务的通过率均为0.00%,表明当前模型在长期规划方面存在根本性失败。
Insight: 创新点在于提出了一个三层诊断框架来隔离和评估空间认知瓶颈,并通过提供外部求解器工具来分析故障模式,为开发更接地气的智能体提供了关键见解。从客观角度看,该研究通过魔方这一具象化任务,系统性地揭示了LLM在物理空间推理和长程规划方面的局限性,为后续研究指明了方向。
Abstract: Large Language Model (LLM) agents, while proficient in the digital realm, face a significant gap in physical-world deployment due to the challenge of forming and maintaining a robust spatial mental model. We identify three core cognitive challenges hindering this transition: spatial reasoning, long-horizon state tracking via mental simulation, and active exploration under partial observation. To isolate and evaluate these faculties, we introduce CubeBench, a novel generative benchmark centered on the Rubik’s Cube. CubeBench uses a three-tiered diagnostic framework that progressively assesses agent capabilities, from foundational state tracking with full symbolic information to active exploration with only partial visual data. Our experiments on leading LLMs reveal critical limitations, including a uniform 0.00% pass rate on all long-horizon tasks, exposing a fundamental failure in long-term planning. We also propose a diagnostic framework to isolate these cognitive bottlenecks by providing external solver tools. By analyzing the failure modes, we provide key insights to guide the development of more physically-grounded intelligent agents.
[158] Replay Failures as Successes: Sample-Efficient Reinforcement Learning for Instruction Following cs.AI | cs.CL | cs.LGPDF
Kongcheng Zhang, Qi Yao, Shunyu Liu, Wenjian Zhang, Min Cen
TL;DR: 本文提出了一种名为Hindsight instruction Replay (HiR)的样本高效强化学习框架,用于解决大型语言模型在遵循复杂指令任务中因初始能力有限导致成功样本稀疏、奖励信号难以区分的问题。该框架采用选择-重写策略,将失败的尝试根据事后满足的约束重播为成功样本,并结合原始样本进行强化学习,实现了在指令和响应层面的双重偏好学习。
Details
Motivation: 动机在于解决强化学习对齐大型语言模型以遵循指令时,由于初始模型能力有限难以生成满足所有约束的高质量响应,导致奖励稀疏或不可区分,从而阻碍学习效率的问题。
Result: 在多个指令遵循任务上的广泛实验表明,HiR框架在减少计算预算的同时,取得了有希望的结果,证明了其样本高效性。
Insight: 创新点在于提出了基于事后满足约束的选择-重写重播策略,将失败尝试转化为成功样本进行学习,并理论化为指令和响应层面的双重偏好学习,仅使用二元奖励信号即可实现高效优化,为样本稀疏的强化学习任务提供了新思路。
Abstract: Reinforcement Learning (RL) has shown promise for aligning Large Language Models (LLMs) to follow instructions with various constraints. Despite the encouraging results, RL improvement inevitably relies on sampling successful, high-quality responses; however, the initial model often struggles to generate responses that satisfy all constraints due to its limited capabilities, yielding sparse or indistinguishable rewards that impede learning. In this work, we propose Hindsight instruction Replay (HiR), a novel sample-efficient RL framework for complex instruction following tasks, which employs a select-then-rewrite strategy to replay failed attempts as successes based on the constraints that have been satisfied in hindsight. We perform RL on these replayed samples as well as the original ones, theoretically framing the objective as dual-preference learning at both the instruction- and response-level to enable efficient optimization using only a binary reward signal. Extensive experiments demonstrate that the proposed HiR yields promising results across different instruction following tasks, while requiring less computational budget. Our code and dataset is available at https://github.com/sastpg/HIR.
[159] Learning Multi-Modal Mobility Dynamics for Generalized Next Location Recommendation cs.AI | cs.CVPDF
Junshu Dai, Yu Wang, Tongya Zheng, Wei Ji, Qinghong Guo
TL;DR: 该论文提出了一种名为M³ob的多模态移动性动态学习方法,旨在提升位置推荐的泛化能力。该方法通过构建统一的空间-时间关系图(STRG)来融合多模态表示,并利用LLM增强的空间-时间知识图(STKG)捕获功能语义和时空知识。通过门控机制融合不同模态的图表示,并采用STKG引导的跨模态对齐将时空动态知识注入静态图像模态。
Details
Motivation: 现有的人类移动性预测方法泛化能力有限:单模态方法受限于数据稀疏性和固有偏差,而多模态方法难以有效捕捉静态多模态表示与时空动态之间的语义差距所导致的移动性动态。
Result: 在六个公共数据集上的大量实验表明,该方法不仅在正常场景下取得了一致的性能提升,而且在异常场景下也展现出显著的泛化能力。
Insight: 创新点在于利用LLM增强的STKG构建统一的STRG来表征多模态移动性动态,并通过门控融合与STKG引导的跨模态对齐来弥合静态图像与时空动态之间的语义鸿沟,从而提升模型的泛化性能。
Abstract: The precise prediction of human mobility has produced significant socioeconomic impacts, such as location recommendations and evacuation suggestions. However, existing methods suffer from limited generalization capability: unimodal approaches are constrained by data sparsity and inherent biases, while multi-modal methods struggle to effectively capture mobility dynamics caused by the semantic gap between static multi-modal representation and spatial-temporal dynamics. Therefore, we leverage multi-modal spatial-temporal knowledge to characterize mobility dynamics for the location recommendation task, dubbed as \textbf{M}ulti-\textbf{M}odal \textbf{Mob}ility (\textbf{M}$^3$\textbf{ob}). First, we construct a unified spatial-temporal relational graph (STRG) for multi-modal representation, by leveraging the functional semantics and spatial-temporal knowledge captured by the large language models (LLMs)-enhanced spatial-temporal knowledge graph (STKG). Second, we design a gating mechanism to fuse spatial-temporal graph representations of different modalities, and propose an STKG-guided cross-modal alignment to inject spatial-temporal dynamic knowledge into the static image modality. Extensive experiments on six public datasets show that our proposed method not only achieves consistent improvements in normal scenarios but also exhibits significant generalization ability in abnormal scenarios.
[160] Memento-II: Learning by Stateful Reflective Memory cs.AI | cs.CV | cs.LGPDF
Jun Wang
TL;DR: 本文提出了一个名为Memento-II的理论框架,用于大型语言模型智能体的持续和体验式学习,该框架将情景记忆与强化学习相结合。该框架将反思识别为关键机制,使智能体能够通过交互进行适应,而无需反向传播或模型微调,从而放松了传统训练与部署之间的分离。为了形式化这一过程,作者引入了状态化反思决策过程,将反思学习建模为与情景记忆的两阶段读写交互:写入存储交互结果并对应策略评估,而读取检索相关过去案例并对应策略改进。研究表明,该过程在增强的状态记忆表示上诱导出一个等价的马尔可夫决策过程,从而允许使用动态规划和强化学习的经典工具。作者进一步使用熵正则化策略迭代实例化了该框架,并建立了收敛保证。随着情景记忆增长并充分覆盖状态空间,所得策略收敛到最优解。这项工作为基于记忆增强和检索的语言模型智能体提供了原则性基础,使其能够在无需参数更新的情况下持续适应。
Details
Motivation: 解决大型语言模型智能体在部署后持续学习和适应环境的问题,避免传统方法中依赖反向传播或微调所带来的训练与部署分离的限制。
Result: 理论分析表明,在情景记忆充分覆盖状态空间的条件下,所提出的框架通过熵正则化策略迭代能够保证策略收敛到最优解。
Insight: 创新性地将反思机制形式化为与情景记忆的读写交互,并将其与强化学习的策略评估和改进对应,为无需参数更新的持续学习提供了理论框架和收敛保证,是记忆增强智能体领域的一个原则性进展。
Abstract: We propose a theoretical framework for continual and experiential learning in large language model agents that integrates episodic memory with reinforcement learning. The framework identifies reflection as the key mechanism that enables agents to adapt through interaction without back propagation or model fine tuning, thereby relaxing the conventional separation between training and deployment.To formalise this process, we introduce the Stateful Reflective Decision Process, which models reflective learning as a two stage read write interaction with episodic memory. Writing stores interaction outcomes and corresponds to policy evaluation, while reading retrieves relevant past cases and corresponds to policy improvement. We show that this process induces an equivalent Markov decision process over augmented state memory representations, allowing the use of classical tools from dynamic programming and reinforcement learning. We further instantiate the framework using entropy regularised policy iteration and establish convergence guarantees. As episodic memory grows and achieves sufficient coverage of the state space, the resulting policy converges to the optimal solution. This work provides a principled foundation for memory augmented and retrieval based language model agents capable of continual adaptation without parameter updates.
[161] HiSciBench: A Hierarchical Multi-disciplinary Benchmark for Scientific Intelligence from Reading to Discovery cs.AI | cs.CVPDF
Yaping Zhang, Qixuan Zhang, Xingquan Zhang, Zhiyuan Chen, Wenwen Zhuang
TL;DR: 本文介绍了HiSciBench,一个分层、多学科的基准测试,旨在全面评估基础模型在科学智能方面的能力。该基准模拟完整的科学工作流程,包含从科学素养到科学发现的五个层级,涵盖六个主要学科,支持多模态和跨语言评估。对主流模型的评估揭示了其在基础任务上表现尚可,但在高级发现任务上存在显著差距。
Details
Motivation: 现有评估科学智能的基准测试大多零散、任务狭窄,未能反映真实科学探究的层次性和多学科性,因此需要一个新的综合性基准来评估模型从知识理解到创造性发现的完整能力谱系。
Result: 在HiSciBench上对GPT-5、DeepSeek-R1等领先模型进行评估,结果显示模型在基础素养任务上准确率最高可达69%,但在发现级挑战上性能急剧下降至25%,揭示了模型能力在不同科学推理阶段存在巨大差距。
Insight: 创新点在于提出了一个分层、多学科、依赖感知的集成评估框架,将科学工作流程结构化,并支持多模态输入和跨语言评估,为全面诊断模型在科学推理各阶段的能力提供了新标准。
Abstract: The rapid advancement of large language models (LLMs) and multimodal foundation models has sparked growing interest in their potential for scientific research. However, scientific intelligence encompasses a broad spectrum of abilities ranging from understanding fundamental knowledge to conducting creative discovery, and existing benchmarks remain fragmented. Most focus on narrow tasks and fail to reflect the hierarchical and multi-disciplinary nature of real scientific inquiry. We introduce \textbf{HiSciBench}, a hierarchical benchmark designed to evaluate foundation models across five levels that mirror the complete scientific workflow: \textit{Scientific Literacy} (L1), \textit{Literature Parsing} (L2), \textit{Literature-based Question Answering} (L3), \textit{Literature Review Generation} (L4), and \textit{Scientific Discovery} (L5). HiSciBench contains 8,735 carefully curated instances spanning six major scientific disciplines, including mathematics, physics, chemistry, biology, geography, and astronomy, and supports multimodal inputs including text, equations, figures, and tables, as well as cross-lingual evaluation. Unlike prior benchmarks that assess isolated abilities, HiSciBench provides an integrated, dependency-aware framework that enables detailed diagnosis of model capabilities across different stages of scientific reasoning. Comprehensive evaluations of leading models, including GPT-5, DeepSeek-R1, and several multimodal systems, reveal substantial performance gaps: while models achieve up to 69% accuracy on basic literacy tasks, performance declines sharply to 25% on discovery-level challenges. HiSciBench establishes a new standard for evaluating scientific Intelligence and offers actionable insights for developing models that are not only more capable but also more reliable. The benchmark will be publicly released to facilitate future research.
q-fin.GN [Back]
[162] Deep Learning for Art Market Valuation q-fin.GN | cs.AI | cs.CV | cs.LG | econ.GNPDF
Jianping Mei, Michael Moses, Jan Waelty, Yucheng Yang
TL;DR: 本研究探讨了深度学习如何通过整合艺术品的视觉内容来改进艺术市场估值。利用主要拍卖行的大型重复销售数据集,作者将经典特征回归和基于树的方法与现代深度架构(包括融合表格和图像数据的多模态模型)进行了基准测试。研究发现,虽然艺术家身份和过往交易历史主导整体预测能力,但对于缺乏历史锚定的首次上市作品,视觉嵌入提供了独特且具有经济意义的贡献。使用Grad-CAM和嵌入可视化的可解释性分析表明,模型关注构图和风格线索。研究证明,多模态深度学习在估值最困难的首次销售场景中能提供显著价值,为艺术市场估值的学术研究和实践提供了新见解。
Details
Motivation: 解决艺术市场估值中传统方法(如特征回归)可能忽略视觉内容的问题,特别是在缺乏历史交易记录的首次销售艺术品上,探索深度学习(尤其是多模态模型)如何通过融合图像和表格数据来提升预测准确性。
Result: 在主要拍卖行的大型重复销售数据集上进行了基准测试,发现多模态深度学习模型在首次上市作品的估值中提供了显著且经济意义的贡献,视觉嵌入补充了以艺术家身份和交易历史为主导的传统预测因素。
Insight: 创新点在于将多模态深度学习(融合图像和表格数据)系统性地应用于艺术市场估值,并证明视觉内容在缺乏历史锚定的场景(如首次销售)中具有独特价值;可借鉴之处包括利用Grad-CAM和嵌入可视化进行模型可解释性分析,以揭示模型关注的视觉特征(如构图和风格),这为其他领域(如文化遗产或奢侈品估值)的多模态预测提供了方法论参考。
Abstract: We study how deep learning can improve valuation in the art market by incorporating the visual content of artworks into predictive models. Using a large repeated-sales dataset from major auction houses, we benchmark classical hedonic regressions and tree-based methods against modern deep architectures, including multi-modal models that fuse tabular and image data. We find that while artist identity and prior transaction history dominate overall predictive power, visual embeddings provide a distinct and economically meaningful contribution for fresh-to-market works where historical anchors are absent. Interpretability analyses using Grad-CAM and embedding visualizations show that models attend to compositional and stylistic cues. Our findings demonstrate that multi-modal deep learning delivers significant value precisely when valuation is hardest, namely first-time sales, and thus offers new insights for both academic research and practice in art market valuation.
cs.LG [Back]
[163] AFA-LoRA: Enabling Non-Linear Adaptations in LoRA with Activation Function Annealing cs.LG | cs.CLPDF
Jiacheng Li, Jianchao Tan, Zhidong Yang, Feiye Huo, Yerui Sun
TL;DR: 本文提出了一种名为AFA-LoRA的新型训练策略,通过在LoRA(低秩适应)中引入退火激活函数,使其在训练过程中从非线性过渡到线性,从而在保持可合并性的同时增强了LoRA的表达能力。该方法在监督微调、强化学习和推测解码等任务上进行了验证。
Details
Motivation: LoRA作为一种广泛使用的参数高效微调方法,其线性适应过程限制了其表达能力,导致线性训练与非线性训练之间存在性能差距。本文旨在弥合这一差距。
Result: 实验结果表明,AFA-LoRA减少了LoRA与全参数训练之间的性能差距,实现了更强大且实用的参数高效适应范式。
Insight: 核心创新在于设计了一个退火激活函数,使适配器在训练初期具备更强的非线性表征能力,最终收敛为可合并的线性形式,从而在提升表达力的同时保持了LoRA原有的可合并性优势。
Abstract: Low-Rank Adaptation (LoRA) is a widely adopted parameter-efficient fine-tuning (PEFT) method. However, its linear adaptation process limits its expressive power. This means there is a gap between the expressive power of linear training and non-linear training. To bridge this gap, we propose AFA-LoRA, a novel training strategy that brings non-linear expressivity to LoRA while maintaining its seamless mergeability. Our key innovation is an annealed activation function that transitions from a non-linear to a linear transformation during training, allowing the adapter to initially adopt stronger representational capabilities before converging to a mergeable linear form. We implement our method on supervised fine-tuning, reinforcement learning, and speculative decoding. The results show that AFA-LoRA reduces the performance gap between LoRA and full-parameter training. This work enables a more powerful and practical paradigm of parameter-efficient adaptation.
[164] Scaling Unverifiable Rewards: A Case Study on Visual Insights cs.LG | cs.AI | cs.CLPDF
Shuyu Gan, James Mooney, Pan Hao, Renxiang Wang, Mingyi Hong
TL;DR: 本文提出了一种名为选择性测试时缩放(Selective TTS)的基于过程的优化框架,用于解决多阶段任务中因最终结果缺乏可验证奖励而导致的错误累积问题。该方法通过在数据科学流程中构建一个端到端的多智能体管道来生成可视化图表和报告,并利用与人类专家对齐的LLM评判模型,在固定计算预算下提升了洞察质量。
Details
Motivation: 动机在于解决现实世界中多阶段任务(如数据科学流程)的挑战,这些任务的最终结果缺乏可验证的奖励或足够数据来训练鲁棒的奖励模型,导致基于评判的优化容易在阶段间累积错误。
Result: 在数据科学管道中,所提出的选择性TTS在固定计算预算下将平均得分从61.64提升至65.86,同时降低了方差,且设计的LLM评判模型与人类专家对齐(Kendall’s τ=0.55)。
Insight: 创新点在于将计算资源分布到多阶段管道中,而非先前工作的重复时间优化,通过过程特定的评判模型早期剪枝低质量分支,以减轻评判漂移并稳定优化过程,为扩展复杂、开放式任务(如科学发现和故事生成)提供了初步方案。
Abstract: Large Language Model (LLM) agents can increasingly automate complex reasoning through Test-Time Scaling (TTS), iterative refinement guided by reward signals. However, many real-world tasks involve multi-stage pipeline whose final outcomes lack verifiable rewards or sufficient data to train robust reward models, making judge-based refinement prone to accumulate error over stages. We propose Selective TTS, a process-based refinement framework that scales inference across different stages in multi-agent pipeline, instead of repeated refinement over time by prior work. By distributing compute across stages and pruning low-quality branches early using process-specific judges, Selective TTS mitigates the judge drift and stabilizes refinement. Grounded in the data science pipeline, we build an end-to-end multi-agent pipeline for generating visually insightful charts and report of given dataset, and design a reliable LLM-based judge model, aligned with human experts (Kendall’s τ=0.55). Our proposed selective TTS then improves insight quality under a fixed compute budget, increasing mean scores from 61.64 to 65.86 while reducing variance. We hope our findings serve as the first step toward to scaling complex, open-ended tasks with unverifiable rewards, such as scientific discovery and story generation.
[165] A Note on Hybrid Online Reinforcement and Imitation Learning for LLMs: Formulations and Algorithms cs.LG | cs.AI | cs.CLPDF
Yingru Li, Ziniu Li, Jiacai Liu
TL;DR: 本文提出了一个统一的LLM微调框架,将模仿学习与强化学习相结合,通过分析结合轨迹级KL散度与任务奖励的复合目标的梯度,将其分解为可解析计算的稠密梯度(用于token级模仿)和需蒙特卡洛估计的稀疏梯度(用于长视野奖励优化)。
Details
Motivation: 解决LLM微调中如何有效整合模仿学习(提供密集监督)与强化学习(优化长视野任务奖励)的问题,以提升训练效率和模型性能。
Result: 摘要中未提及具体定量结果或基准测试,但推导出了稠密梯度的闭式解,支持高效的GPU实现。
Insight: 创新点在于将混合目标梯度分解为稠密与稀疏分量,为LLM微调提供了理论框架和高效算法;客观分析认为其梯度分解方法可能简化训练流程并平衡模仿与探索。
Abstract: We present a unified framework for Large Language Model (LLM) fine-tuning that integrates Imitation Learning and Reinforcement Learning. By analyzing the gradient of a composite objective combining trajectory-level KL divergence with task rewards, we derive a natural decomposition into two components: (1) an analytically computable Dense Gradient for token-level imitation, and (2) a Monte Carlo estimated Sparse Gradient for long-horizon reward optimization. The Dense Gradient admits a closed-form logit-level formula, enabling efficient GPU implementation.
[166] VL-RouterBench: A Benchmark for Vision-Language Model Routing cs.LG | cs.AI | cs.CLPDF
Zhehao Huang, Baijiong Lin, Jingyuan Zhang, Jingying Wang, Yuhang Liu
TL;DR: 本文提出了VL-RouterBench,一个用于系统评估视觉-语言模型路由系统的基准测试。该基准基于VLMs的原始推理和评分日志,构建了样本-模型对的质量和成本矩阵,覆盖了3个任务组的14个数据集、17个模型,总计超过50万个样本-模型对和3400万输入输出token。评估协议联合衡量平均准确率、平均成本和吞吐量,并通过归一化成本与准确率的调和平均数构建排名分数,以比较不同路由配置和成本预算。
Details
Motivation: 现有的多模态路由工作缺乏一个系统化、可复现的基准来评估视觉-语言模型,因此需要构建一个全面的基准来评估VLM路由系统的整体能力。
Result: 在该基准上评估了10种路由方法和基线,观察到了显著的路由能力增益,但当前最佳路由器与理想Oracle相比仍存在明显差距,表明路由器架构在利用更精细的视觉线索和文本结构建模方面仍有很大改进空间。
Insight: 创新点在于提出了首个系统化的VLM路由基准,其核心是构建了大规模、多任务、多模型的质量-成本矩阵,并设计了联合衡量性能与成本的排名分数。这为多模态路由研究提供了可比较、可复现的评估基础,并揭示了当前路由器与理想性能之间的差距,指明了未来改进方向。
Abstract: Multi-model routing has evolved from an engineering technique into essential infrastructure, yet existing work lacks a systematic, reproducible benchmark for evaluating vision-language models (VLMs). We present VL-RouterBench to assess the overall capability of VLM routing systems systematically. The benchmark is grounded in raw inference and scoring logs from VLMs and constructs quality and cost matrices over sample-model pairs. In scale, VL-RouterBench covers 14 datasets across 3 task groups, totaling 30,540 samples, and includes 15 open-source models and 2 API models, yielding 519,180 sample-model pairs and a total input-output token volume of 34,494,977. The evaluation protocol jointly measures average accuracy, average cost, and throughput, and builds a ranking score from the harmonic mean of normalized cost and accuracy to enable comparison across router configurations and cost budgets. On this benchmark, we evaluate 10 routing methods and baselines and observe a significant routability gain, while the best current routers still show a clear gap to the ideal Oracle, indicating considerable room for improvement in router architecture through finer visual cues and modeling of textual structure. We will open-source the complete data construction and evaluation toolchain to promote comparability, reproducibility, and practical deployment in multimodal routing research.
[167] Training AI Co-Scientists Using Rubric Rewards cs.LG | cs.CL | cs.HCPDF
Shashwat Goel, Rishi Hazra, Dulhan Jayalath, Timon Willi, Parag Jain
TL;DR: 本文提出一种利用现有研究论文自动构建训练语料库,通过强化学习与自我评分机制训练语言模型生成高质量研究计划的方法,旨在提升AI科研助手的规划能力。
Details
Motivation: 当前语言模型在生成符合约束和隐含要求的研究计划方面存在不足,需要一种可扩展、自动化的训练方法来提升AI科研助手的规划能力。
Result: 在机器学习研究目标的人类专家评估中,微调后的Qwen3-30B-A3B模型生成的研究计划在70%的目标上优于初始模型,且84%的自动提取评分标准获得专家认可;在医学和arXiv新论文的跨领域评估中,模型相对提升12-22%,并展现出显著的泛化能力。
Insight: 创新点包括:通过自动从多领域论文中提取研究目标和评分标准构建可扩展训练语料库,以及利用强化学习结合自我评分(以初始策略作为评分器)实现无外部监督的改进;该方法在缺乏执行反馈的领域(如医学研究)中仍有效,为通用AI科研助手的训练提供了可扩展的自动化方案。
Abstract: AI co-scientists are emerging as a tool to assist human researchers in achieving their research goals. A crucial feature of these AI co-scientists is the ability to generate a research plan given a set of aims and constraints. The plan may be used by researchers for brainstorming, or may even be implemented after further refinement. However, language models currently struggle to generate research plans that follow all constraints and implicit requirements. In this work, we study how to leverage the vast corpus of existing research papers to train language models that generate better research plans. We build a scalable, diverse training corpus by automatically extracting research goals and goal-specific grading rubrics from papers across several domains. We then train models for research plan generation via reinforcement learning with self-grading. A frozen copy of the initial policy acts as the grader during training, with the rubrics creating a generator-verifier gap that enables improvements without external human supervision. To validate this approach, we conduct a study with human experts for machine learning research goals, spanning 225 hours. The experts prefer plans generated by our finetuned Qwen3-30B-A3B model over the initial model for 70% of research goals, and approve 84% of the automatically extracted goal-specific grading rubrics. To assess generality, we also extend our approach to research goals from medical papers, and new arXiv preprints, evaluating with a jury of frontier models. Our finetuning yields 12-22% relative improvements and significant cross-domain generalization, proving effective even in problem settings like medical research where execution feedback is infeasible. Together, these findings demonstrate the potential of a scalable, automated training recipe as a step towards improving general AI co-scientists.
[168] SoliReward: Mitigating Susceptibility to Reward Hacking and Annotation Noise in Video Generation Reward Models cs.LG | cs.CVPDF
Jiesong Lian, Ruizhe Zhong, Zixiang Zhou, Xiaoyue Mi, Yixue Hao
TL;DR: 本文提出了SoliReward框架,旨在解决视频生成奖励模型训练中面临的奖励黑客攻击和标注噪声问题。该框架通过单项目二元标注收集高质量数据,采用跨提示配对策略构建偏好对,并引入分层渐进查询注意力机制来增强特征聚合,同时使用改进的BT损失函数来正则化奖励分数分布,从而在多个基准测试中提升了奖励模型的评估指标和视频生成模型的后训练效果。
Details
Motivation: 视频生成模型的后训练对齐是一个关键目标,但现有奖励模型面临标注噪声、架构设计不足以及易受奖励黑客攻击等问题,需要系统性的解决方案。
Result: 在评估物理合理性、主体变形和语义对齐的基准测试中,SoliReward框架在直接奖励模型评估指标和视频生成模型后训练效果上均显示出改进。
Insight: 创新点包括采用单项目二元标注和跨提示配对的数据收集策略、分层渐进查询注意力机制以及改进的BT损失函数,这些方法共同缓解了标注噪声和奖励黑客问题,为视频奖励模型训练提供了系统化框架。
Abstract: Post-training alignment of video generation models with human preferences is a critical goal. Developing effective Reward Models (RMs) for this process faces significant methodological hurdles. Current data collection paradigms, reliant on in-prompt pairwise annotations, suffer from labeling noise. Concurrently, the architectural design of VLM-based RMs, particularly their output mechanisms, remains underexplored. Furthermore, RM is susceptible to reward hacking in post-training. To mitigate these limitations, we propose SoliReward, a systematic framework for video RM training. Our framework first sources high-quality, cost-efficient data via single-item binary annotations, then constructs preference pairs using a cross-prompt pairing strategy. Architecturally, we employ a Hierarchical Progressive Query Attention mechanism to enhance feature aggregation. Finally, we introduce a modified BT loss that explicitly accommodates win-tie scenarios. This approach regularizes the RM’s score distribution for positive samples, providing more nuanced preference signals to alleviate over-focus on a small number of top-scoring samples. Our approach is validated on benchmarks evaluating physical plausibility, subject deformity, and semantic alignment, demonstrating improvements in direct RM evaluation metrics and in the efficacy of post-training on video generation models. Code and benchmark will be publicly available.
[169] Masking Teacher and Reinforcing Student for Distilling Vision-Language Models cs.LG | cs.AI | cs.CVPDF
Byung-Kwan Lee, Yu-Chiang Frank Wang, Ryo Hachiuma
TL;DR: 论文提出了一种名为Masters的蒸馏框架,用于将大型视觉语言模型的知识压缩到小型学生模型中。该方法通过掩码渐进式强化学习,先掩码教师模型的非主导权重以降低复杂度,再逐步恢复教师能力,并结合离线强化学习使用准确性和蒸馏奖励来优化知识迁移。
Details
Motivation: 解决大型视觉语言模型因参数量大难以部署到移动或边缘设备的问题,以及由于师生模型尺寸差距大导致的知识蒸馏不稳定和性能下降的挑战。
Result: 未在摘要中提及具体的基准测试或定量结果,但声称该方法能使学生模型以稳定方式学习更丰富的表示,并获得强大的性能。
Insight: 创新点包括掩码渐进式策略平滑师生学习过程,以及离线强化学习结合双奖励机制(准确性奖励和蒸馏奖励)来高效指导知识迁移,避免了在线思维-回答范式的高计算成本。
Abstract: Large-scale vision-language models (VLMs) have recently achieved remarkable multimodal understanding, but their massive size makes them impractical for deployment on mobile or edge devices. This raises the need for compact yet capable VLMs that can efficiently learn from powerful large teachers. However, distilling knowledge from a large teacher to a small student remains challenging due to their large size gap: the student often fails to reproduce the teacher’s complex, high-dimensional representations, leading to unstable learning and degraded performance. To address this, we propose Masters (Masking Teacher and Reinforcing Student), a mask-progressive reinforcement learning (RL) distillation framework. Masters first masks non-dominant weights of the teacher to reduce unnecessary complexity, then progressively restores the teacher by gradually increasing its capacity during training. This strategy allows the student to learn richer representations from the teacher in a smooth and stable manner. To further refine knowledge transfer, Masters integrates an offline RL stage with two complementary rewards: an accuracy reward that measures the correctness of the generated responses, and a distillation reward that quantifies the ease of transferring responses from teacher to student. Unlike online think-answer RL paradigms that are computationally expensive and generate lengthy responses, our offline RL leverages pre-generated responses from masked teachers. These provide rich yet efficient guidance, enabling students to achieve strong performance without requiring the think-answer process.
[170] Temporal Visual Semantics-Induced Human Motion Understanding with Large Language Models cs.LG | cs.CVPDF
Zheng Xing, Weibing Zhao
TL;DR: 本文提出了一种结合时序视觉语义与子空间聚类的人体运动分割方法,通过大语言模型从连续帧中提取文本运动信息,并利用该信息增强子空间聚类性能,在四个基准数据集上实现了优于现有SOTA方法的结果。
Details
Motivation: 传统无监督人体运动分割方法忽略了时序语义探索的作用,本文旨在利用大语言模型的图像到文本能力,从人体运动序列中提取时序视觉语义,以提升子空间聚类效果。
Result: 在四个基准人体运动数据集上的实验表明,所提方法优于现有的最先进方法。
Insight: 创新点在于将大语言模型生成的时序语义信息融入子空间聚类框架,通过时序正则化约束和反馈优化机制,使相邻帧共享相似的子空间嵌入,从而提升分割准确性。
Abstract: Unsupervised human motion segmentation (HMS) can be effectively achieved using subspace clustering techniques. However, traditional methods overlook the role of temporal semantic exploration in HMS. This paper explores the use of temporal vision semantics (TVS) derived from human motion sequences, leveraging the image-to-text capabilities of a large language model (LLM) to enhance subspace clustering performance. The core idea is to extract textual motion information from consecutive frames via LLM and incorporate this learned information into the subspace clustering framework. The primary challenge lies in learning TVS from human motion sequences using LLM and integrating this information into subspace clustering. To address this, we determine whether consecutive frames depict the same motion by querying the LLM and subsequently learn temporal neighboring information based on its response. We then develop a TVS-integrated subspace clustering approach, incorporating subspace embedding with a temporal regularizer that induces each frame to share similar subspace embeddings with its temporal neighbors. Additionally, segmentation is performed based on subspace embedding with a temporal constraint that induces the grouping of each frame with its temporal neighbors. We also introduce a feedback-enabled framework that continuously optimizes subspace embedding based on the segmentation output. Experimental results demonstrate that the proposed method outperforms existing state-of-the-art approaches on four benchmark human motion datasets.
[171] Co-GRPO: Co-Optimized Group Relative Policy Optimization for Masked Diffusion Model cs.LG | cs.AI | cs.CVPDF
Renping Zhou, Zanlin Ni, Tianyi Chen, Zeyu Liu, Yang Yue
TL;DR: 本文提出Co-GRPO方法,通过将掩码扩散模型的生成过程统一建模为马尔可夫决策过程,并应用轨迹层面的组相对策略优化,联合优化模型参数和推理调度参数,从而弥合了训练与推理之间的差距,提升了生成质量。
Details
Motivation: 掩码扩散模型在训练时采用简化的单步BERT式目标,而推理是多步迭代过程,受模型和调度策略共同影响,这种训练与推理的不匹配导致调度策略从未在训练中被优化。
Result: 在ImageReward、HPS、GenEval和DPG-Bench四个基准测试上的实证结果表明,该方法有效提升了生成质量。
Insight: 主要创新点在于将MDM生成过程统一为MDP框架,并引入无需对多步生成过程进行昂贵反向传播的组相对策略优化,实现了模型与调度策略的协同优化,使训练与推理更彻底地对齐。
Abstract: Recently, Masked Diffusion Models (MDMs) have shown promising potential across vision, language, and cross-modal generation. However, a notable discrepancy exists between their training and inference procedures. In particular, MDM inference is a multi-step, iterative process governed not only by the model itself but also by various schedules that dictate the token-decoding trajectory (e.g., how many tokens to decode at each step). In contrast, MDMs are typically trained using a simplified, single-step BERT-style objective that masks a subset of tokens and predicts all of them simultaneously. This step-level simplification fundamentally disconnects the training paradigm from the trajectory-level nature of inference, leaving the inference schedules never optimized during training. In this paper, we introduce Co-GRPO, which reformulates MDM generation as a unified Markov Decision Process (MDP) that jointly incorporates both the model and the inference schedule. By applying Group Relative Policy Optimization at the trajectory level, Co-GRPO cooperatively optimizes model parameters and schedule parameters under a shared reward, without requiring costly backpropagation through the multi-step generation process. This holistic optimization aligns training with inference more thoroughly and substantially improves generation quality. Empirical results across four benchmarks-ImageReward, HPS, GenEval, and DPG-Bench-demonstrate the effectiveness of our approach. For more details, please refer to our project page: https://co-grpo.github.io/ .
[172] LangPrecip: Language-Aware Multimodal Precipitation Nowcasting cs.LG | cs.AI | cs.CVPDF
Xudong Ling, Tianxi Huang, Qian Dong, Tao He, Chaorong Li
TL;DR: 本文提出LangPrecip,一个语言感知的多模态降水临近预报框架,通过将气象文本作为降水演变的语义运动约束,在Rectified Flow范式下将临近预报构建为语义约束的轨迹生成问题,实现了文本与雷达信息在潜空间的高效、物理一致的融合。
Details
Motivation: 解决短期降水临近预报,特别是快速演变和极端天气事件中,未来运动因主要依赖视觉条件而约束弱、模糊不清的问题。
Result: 在瑞典和MRMS数据集上的实验表明,该方法相比最先进方法有持续改进,在80分钟预见期,强降水CSI指标分别获得超过60%和19%的提升。
Insight: 创新点在于将气象文本作为语义运动约束引入生成过程,并构建了大规模多模态数据集LangPrecip-160k;从客观角度看,其将文本模态作为强语义约束的思路,为不确定时空预测问题提供了新的条件生成范式。
Abstract: Short-term precipitation nowcasting is an inherently uncertain and under-constrained spatiotemporal forecasting problem, especially for rapidly evolving and extreme weather events. Existing generative approaches rely primarily on visual conditioning, leaving future motion weakly constrained and ambiguous. We propose a language-aware multimodal nowcasting framework(LangPrecip) that treats meteorological text as a semantic motion constraint on precipitation evolution. By formulating nowcasting as a semantically constrained trajectory generation problem under the Rectified Flow paradigm, our method enables efficient and physically consistent integration of textual and radar information in latent space.We further introduce LangPrecip-160k, a large-scale multimodal dataset with 160k paired radar sequences and motion descriptions. Experiments on Swedish and MRMS datasets show consistent improvements over state-of-the-art methods, achieving over 60 % and 19% gains in heavy-rainfall CSI at an 80-minute lead time.
[173] Schrodinger AI: A Unified Spectral-Dynamical Framework for Classification, Reasoning, and Operator-Based Generalization cs.LG | cs.CVPDF
Truong Son Nguyen
TL;DR: 论文提出了一个受量子力学启发的统一机器学习框架——Schrödinger AI,该框架由三个紧密耦合的组件构成:一个用于感知和分类的静态波能求解器、一个用于动态推理的时变动力学求解器,以及一个用于学习符号变换的低秩算子演算。该框架旨在替代传统的交叉熵训练和Transformer注意力机制,提供鲁棒的泛化、可解释的语义和涌现的拓扑结构。
Details
Motivation: 论文的动机是构建一个受物理学启发的统一机器学习框架,以解决传统方法在泛化性、可解释性和动态环境适应性方面的局限性,将学习过程重新定义为发现和导航底层语义能量景观。
Result: 实验表明,该框架在多个任务上取得成果:涌现出反映人类概念关系的语义流形(无需显式监督)、在动态环境(如实时势场扰动的迷宫导航)中实现自适应推理,以及在模运算任务上实现精确的算子泛化(学习群操作并在远超训练长度的序列上进行组合)。
Insight: 论文宣称的创新点在于将量子力学的谱分解、动力学演化和算子演算思想统一到一个机器学习框架中,提供了一种物理驱动的、可解释的替代方案。从客观角度看,其将分类、动态推理和符号操作泛化统一于一个能量景观模型的思路,为机器学习的基础研究提供了新的方向。
Abstract: We introduce \textbf{Schrödinger AI}, a unified machine learning framework inspired by quantum mechanics. The system is defined by three tightly coupled components: (1) a {time-independent wave-energy solver} that treats perception and classification as spectral decomposition under a learned Hamiltonian; (2) a {time-dependent dynamical solver} governing the evolution of semantic wavefunctions over time, enabling context-aware decision revision, re-routing, and reasoning under environmental changes; and (3) a {low-rank operator calculus} that learns symbolic transformations such as modular arithmetic through learned quantum-like transition operators. Together, these components form a coherent physics-driven alternative to conventional cross-entropy training and transformer attention, providing robust generalization, interpretable semantics, and emergent topology. Empirically, Schrödinger AI demonstrates: (a) emergent semantic manifolds that reflect human-conceived class relations without explicit supervision; (b) dynamic reasoning that adapts to changing environments, including maze navigation with real-time potential-field perturbations; and (c) exact operator generalization on modular arithmetic tasks, where the system learns group actions and composes them across sequences far beyond training length. These results suggest a new foundational direction for machine learning, where learning is cast as discovering and navigating an underlying semantic energy landscape.
[174] ReDiF: Reinforced Distillation for Few Step Diffusion cs.LG | cs.CVPDF
Amirhossein Tighkhorshid, Zahra Dehghanian, Gholamali Aminian, Chengchun Shi, Hamid R. Rabiee
TL;DR: 本文提出了一种基于强化学习的扩散模型蒸馏框架ReDiF,旨在解决扩散模型采样速度慢的问题。该方法将蒸馏过程视为策略优化问题,通过奖励信号动态指导学生模型探索多种去噪路径,从而在更少的推理步骤和计算资源下实现高效采样。
Details
Motivation: 现有扩散模型蒸馏方法通常依赖固定的重建或一致性损失,限制了学生模型在较少步骤下逼近教师模型的能力。本文旨在通过强化学习框架动态优化蒸馏过程,提升学生模型的采样效率和质量。
Result: 实验结果表明,该方法在显著减少推理步骤和计算资源的情况下,性能优于现有蒸馏技术,实现了更高效的扩散模型采样。
Insight: 创新点在于将蒸馏过程形式化为强化学习策略优化问题,利用奖励信号动态引导学生模型探索,而非依赖固定损失函数。这提供了一种模型无关的通用优化范式,可适用于各类扩散模型。
Abstract: Distillation addresses the slow sampling problem in diffusion models by creating models with smaller size or fewer steps that approximate the behavior of high-step teachers. In this work, we propose a reinforcement learning based distillation framework for diffusion models. Instead of relying on fixed reconstruction or consistency losses, we treat the distillation process as a policy optimization problem, where the student is trained using a reward signal derived from alignment with the teacher’s outputs. This RL driven approach dynamically guides the student to explore multiple denoising paths, allowing it to take longer, optimized steps toward high-probability regions of the data distribution, rather than relying on incremental refinements. Our framework utilizes the inherent ability of diffusion models to handle larger steps and effectively manage the generative process. Experimental results show that our method achieves superior performance with significantly fewer inference steps and computational resources compared to existing distillation techniques. Additionally, the framework is model agnostic, applicable to any type of diffusion models with suitable reward functions, providing a general optimization paradigm for efficient diffusion learning.
[175] Rethinking Fine-Tuning: Unlocking Hidden Capabilities in Vision-Language Models cs.LG | cs.CVPDF
Mingyuan Zhang, Yue Bai, Yifan Wang, Yiyang Huang, Yun Fu
TL;DR: 本文重新思考了视觉语言模型的微调方法,提出将掩码微调应用于视觉语言模型的文本编码器和投影器组件,通过为权重分配可学习的门控分数来重组内部子网络,而非更新权重,从而有效利用预训练模型中已有的表征结构。
Details
Motivation: 现有微调方法(如LoRA)依赖显式的权重更新,忽略了预训练模型中已编码但未被充分利用的广泛表征结构,本文旨在探索一种更高效的微调范式。
Result: 实验表明,该方法在不同语言骨干的视觉语言模型上一致超越了LoRA变体甚至全微调,在不改变冻结骨干的情况下实现了高性能。
Insight: 创新点在于从结构重参数化视角重新思考微调,通过掩码微调重组内部连接,揭示了有效适应不仅可通过更新权重实现,还可通过重新建立模型现有知识间的连接来实现。
Abstract: Explorations in fine-tuning Vision-Language Models (VLMs), such as Low-Rank Adaptation (LoRA) from Parameter Efficient Fine-Tuning (PEFT), have made impressive progress. However, most approaches rely on explicit weight updates, overlooking the extensive representational structures already encoded in pre-trained models that remain underutilized. Recent works have demonstrated that Mask Fine-Tuning (MFT) can be a powerful and efficient post-training paradigm for language models. Instead of updating weights, MFT assigns learnable gating scores to each weight, allowing the model to reorganize its internal subnetworks for downstream task adaptation. In this paper, we rethink fine-tuning for VLMs from a structural reparameterization perspective grounded in MFT. We apply MFT to the language and projector components of VLMs with different language backbones and compare against strong PEFT baselines. Experiments show that MFT consistently surpasses LoRA variants and even full fine-tuning, achieving high performance without altering the frozen backbone. Our findings reveal that effective adaptation can emerge not only from updating weights but also from reestablishing connections among the model’s existing knowledge. Code available at: https://github.com/Ming-K9/MFT-VLM
[176] Machine Learning-Assisted Vocal Cord Ultrasound Examination: Project VIPR cs.LG | cs.CE | cs.CVPDF
Will Sebelik-Lassiter, Evan Schubert, Muhammad Alliyu, Quentin Robbins, Excel Olatunji
TL;DR: 本研究提出了一种机器学习辅助算法,用于自动识别声带并从声带超声图像中区分正常声带与声带麻痹,旨在提高声带超声检查的准确性。
Details
Motivation: 声带超声检查的准确性依赖于操作者,本研究旨在通过机器学习减少这种依赖性,实现自动化的声带识别与疾病分类。
Result: 声带分割模型验证准确率达到96%,最佳分类模型VIPRnet的验证准确率达到99%,在内部数据集上表现出色。
Insight: 创新点在于将机器学习应用于声带超声的自动化分析,通过分割与分类模型结合,为减少操作者依赖性提供了可行方案,展示了医疗影像AI辅助诊断的潜力。
Abstract: Intro: Vocal cord ultrasound (VCUS) has emerged as a less invasive and better tolerated examination technique, but its accuracy is operator dependent. This research aims to apply a machine learning-assisted algorithm to automatically identify the vocal cords and distinguish normal vocal cord images from vocal cord paralysis (VCP). Methods: VCUS videos were acquired from 30 volunteers, which were split into still frames and cropped to a uniform size. Healthy and simulated VCP images were used as training data for vocal cord segmentation and VCP classification models. Results: The vocal cord segmentation model achieved a validation accuracy of 96%, while the best classification model (VIPRnet) achieved a validation accuracy of 99%. Conclusion: Machine learning-assisted analysis of VCUS shows great promise in improving diagnostic accuracy over operator-dependent human interpretation.
[177] A unified framework for detecting point and collective anomalies in operating system logs via collaborative transformers cs.LG | cs.AI | cs.CV | cs.NI | cs.OSPDF
Mohammad Nasirzadeh, Jafar Tahmoresnezhad, Parviz Rashidi-Khazaee
TL;DR: 本文提出了一种名为CoLog的统一框架,用于检测操作系统日志中的点异常和集体异常。该框架将日志视为多模态数据,通过协作Transformer和多头注意力机制学习不同模态间的交互,并引入模态适应层来处理异质性,从而提升异常检测能力。
Details
Motivation: 现有单模态方法忽略了日志数据的多模态特性,而多模态方法未能有效处理模态间的交互,因此需要一种能够协同编码多种日志模态并学习其交互的框架,以全面检测异常。
Result: 在七个基于日志的异常检测基准数据集上,CoLog在检测点异常和集体异常时实现了平均精确率99.63%、平均召回率99.59%和平均F1分数99.61%,优于现有最先进方法。
Insight: 创新点在于将多模态情感分析思想应用于日志异常检测,通过协作Transformer和模态适应层协同学习日志多模态的交互与表示,解决了模态异质性问题,为统一检测点异常和集体异常提供了有效方案。
Abstract: Log anomaly detection is crucial for preserving the security of operating systems. Depending on the source of log data collection, various information is recorded in logs that can be considered log modalities. In light of this intuition, unimodal methods often struggle by ignoring the different modalities of log data. Meanwhile, multimodal methods fail to handle the interactions between these modalities. Applying multimodal sentiment analysis to log anomaly detection, we propose CoLog, a framework that collaboratively encodes logs utilizing various modalities. CoLog utilizes collaborative transformers and multi-head impressed attention to learn interactions among several modalities, ensuring comprehensive anomaly detection. To handle the heterogeneity caused by these interactions, CoLog incorporates a modality adaptation layer, which adapts the representations from different log modalities. This methodology enables CoLog to learn nuanced patterns and dependencies within the data, enhancing its anomaly detection capabilities. Extensive experiments demonstrate CoLog’s superiority over existing state-of-the-art methods. Furthermore, in detecting both point and collective anomalies, CoLog achieves a mean precision of 99.63%, a mean recall of 99.59%, and a mean F1 score of 99.61% across seven benchmark datasets for log-based anomaly detection. The comprehensive detection capabilities of CoLog make it highly suitable for cybersecurity, system monitoring, and operational efficiency. CoLog represents a significant advancement in log anomaly detection, providing a sophisticated and effective solution to point and collective anomaly detection through a unified framework and a solution to the complex challenges automatic log data analysis poses. We also provide the implementation of CoLog at https://github.com/NasirzadehMoh/CoLog.
econ.GN [Back]
[178] The Big Three in Marriage Talk: LLM-Assisted Analysis of Moral Ethics and Sentiment on Weibo and Xiaohongshu econ.GN | cs.CLPDF
Frank Tian-Fang Ye, Xiaozi Gao
TL;DR: 本研究利用大语言模型辅助内容分析,对中国两大社交媒体平台(新浪微博和小红书)上的219,358条婚姻相关帖子进行了分析,探究了公众对婚姻的态度,包括情感倾向和基于Shweder’Big Three’道德伦理框架(自主性、社群性、神性)的道德推理。
Details
Motivation: 中国婚姻登记数量急剧下降,需要理解公众对婚姻的态度,这不仅要考察情感倾向,还要探究支撑这些评价的道德推理。
Result: 研究揭示了平台差异:微博的讨论偏向积极,而小红书则以中性为主。大多数帖子缺乏明确的道德框架,但当涉及道德伦理时,情感倾向与特定道德维度存在显著关联:涉及自主性伦理和社群性伦理的帖子主要为负面,而涉及神性框架的帖子则倾向于中性或正面。
Insight: 创新点在于将大语言模型用于大规模定性分析,并结合了道德伦理框架来深入理解社交媒体上的婚姻讨论。研究发现,对个人自主性限制和社群义务的担忧是驱动当代中国负面婚姻态度的关键因素,这为制定文化相关的政策以应对婚姻衰退提供了见解。
Abstract: China’s marriage registrations have declined dramatically, dropping from 13.47 million couples in 2013 to 6.1 million in 2024. Understanding public attitudes toward marriage requires examining not only emotional sentiment but also the moral reasoning underlying these evaluations. This study analyzed 219,358 marriage-related posts from two major Chinese social media platforms (Sina Weibo and Xiaohongshu) using large language model (LLM)-assisted content analysis. Drawing on Shweder’s Big Three moral ethics framework, posts were coded for sentiment (positive, negative, neutral) and moral dimensions (Autonomy, Community, Divinity). Results revealed platform differences: Weibo discourse skewed positive, while Xiaohongshu was predominantly neutral. Most posts across both platforms lacked explicit moral framing. However, when moral ethics were invoked, significant associations with sentiment emerged. Posts invoking Autonomy ethics and Community ethics were predominantly negative, whereas Divinity-framed posts tended toward neutral or positive sentiment. These findings suggest that concerns about both personal autonomy constraints and communal obligations drive negative marriage attitudes in contemporary China. The study demonstrates LLMs’ utility for scaling qualitative analysis and offers insights for developing culturally informed policies addressing marriage decline in Chinese contexts.