Table of Contents
- cs.CL [Total: 18]
- cs.CV [Total: 114]
- q-bio.QM [Total: 1]
- cs.IR [Total: 2]
- cs.GR [Total: 1]
- eess.IV [Total: 2]
- eess.AS [Total: 2]
- cs.AI [Total: 3]
- eess.SP [Total: 1]
- cs.LG [Total: 4]
- cs.RO [Total: 3]
- cs.SE [Total: 1]
cs.CL [Back]
[1] Rubric-Guided Fine-tuning of SpeechLLMs for Multi-Aspect, Multi-Rater L2 Reading-Speech Assessment cs.CL | cs.AI | cs.SD | eess.ASPDF
Aditya Kamlesh Parikh, Cristian Tejedor-Garcia, Catia Cucchiarini, Helmer Strik
TL;DR: 本文提出了一种基于评分标准引导的推理框架,用于第二语言(L2)朗读语音的多维度、多评分者自动评估。该方法通过显式编码准确性、流利度和韵律等多维度人工评分标准,并校准模型不确定性来捕捉评分者的自然变异性,从而提升语音大模型(SpeechLLM)评估的可靠性和可解释性。
Details
Motivation: 解决语音大模型在第二语言语音评估中难以与人类评分者细微变异性对齐的挑战,实现可靠且可解释的自动化评估。
Result: 在微调Qwen2-Audio-7B-Instruct模型并使用不确定性校准回归方法(结合保形校准)后,模型在流利度和韵律评估上与人类评分者实现了最强的对齐,优于回归和分类基线模型,但准确性评估仍具固有难度。
Insight: 创新点在于将多维度评分标准显式编码到推理框架中,并引入高斯不确定性建模与保形校准来量化模型置信度,为基于SpeechLLM的语音评估提供了一条可信且可解释的技术路径。
Abstract: Reliable and interpretable automated assessment of second-language (L2) speech remains a central challenge, as large speech-language models (SpeechLLMs) often struggle to align with the nuanced variability of human raters. To address this, we introduce a rubric-guided reasoning framework that explicitly encodes multi-aspect human assessment criteria: accuracy, fluency, and prosody, while calibrating model uncertainty to capture natural rating variability. We fine-tune the Qwen2-Audio-7B-Instruct model using multi-rater human judgments and develop an uncertainty-calibrated regression approach supported by conformal calibration for interpretable confidence intervals. Our Gaussian uncertainty modeling and conformal calibration approach achieves the strongest alignment with human ratings, outperforming regression and classification baselines. The model reliably assesses fluency and prosody while highlighting the inherent difficulty of assessing accuracy. Together, these results demonstrate that rubric-guided, uncertainty-calibrated reasoning offers a principled path toward trustworthy and explainable SpeechLLM-based speech assessment.
[2] Large Reasoning Models Struggle to Transfer Parametric Knowledge Across Scripts cs.CL | cs.AIPDF
Lucas Bandarkar, Alan Ansell, Trevor Cohn
TL;DR: 本文分析了大型推理LLM在跨语言知识迁移中的不足,指出主要障碍是文字脚本差异。通过ECLeKTic和MultiLoKo数据集的分析,发现脚本匹配是知识迁移失败的主要预测因素,而非语言或语系。通过提供源语言关键实体和设计SFT样本改进推理,实验证明可以减少跨脚本迁移差距。
Details
Motivation: 解决大型推理LLM在跨语言知识迁移中因文字脚本差异导致的性能下降问题,探索如何提升模型在推理时处理音译模糊性的能力。
Result: 在ECLeKTic和MultiLoKo数据集上,回归分析显示脚本匹配是知识迁移失败的主要预测因素;通过提供源语言关键实体和SFT样本训练,两个模型的跨脚本迁移差距得到减少,表明后训练阶段有改进潜力。
Insight: 创新点在于识别脚本差异是跨语言知识迁移的关键障碍,并提出通过增强模型对音译模糊性的推理能力来改善迁移效果;客观分析认为,该方法为LLM的后训练优化提供了新方向,特别是在处理多语言和跨脚本场景时。
Abstract: In this work, we analyze shortcomings in cross-lingual knowledge transfer in large, modern reasoning LLMs. We demonstrate that the perceived gap in knowledge transfer is primarily a script barrier. First, we conduct an observational data analysis on the performance of thinking models on two datasets with local knowledge from around the world, ECLeKTic and MultiLoKo. Our regression analysis shows that script match - not language or family - is the primary predictor of knowledge transfer failure once model capability and question difficulty are accounted for. We further this finding by providing the LLMs with the key entities of the questions in their source language and find that this disproportionately improves cross-script questions. We then posit that these LLMs could be reasoning better at test-time. To evaluate this, we develop a synthetic generation pipeline to design SFT samples to encourage the model to better reason about transliteration ambiguities when trying to fetch parametric knowledge at inference-time. We show that teaching two models to reason better reduces the cross-script transfer gap. As a result, we conclude that there is potential to improve cross-lingual parametric knowledge transfer during post-training.
[3] Ensemble Self-Training for Unsupervised Machine Translation cs.CL | cs.LGPDF
Ido Aharon, Jonathan Shaki, Sarit Kraus
TL;DR: 本文提出了一种基于集成学习的自训练框架,用于提升无监督神经机器翻译(UNMT)的性能。该方法通过训练多个共享主翻译任务但使用不同辅助语言的UNMT模型,引入模型间的结构化多样性,并利用词级集成解码生成伪翻译数据来进一步训练各模型,最终在保持单模型推理成本的同时,显著提高了翻译质量。
Details
Motivation: 解决无监督机器翻译中单模型性能有限的问题,通过集成多个模型并利用自训练来提升翻译准确性和鲁棒性。
Result: 在无监督机器翻译基准测试中,该方法相比单模型基线取得了统计显著的改进:从英语翻译时平均提升1.7 chrF,翻译成英语时平均提升0.67 chrF。
Insight: 创新点在于结合集成学习和自训练,通过结构化多样性模型生成高质量的伪翻译数据,实现模型间的协同优化,同时部署时仅保留最佳单模型以控制推理成本。
Abstract: We present an ensemble-driven self-training framework for unsupervised neural machine translation (UNMT). Starting from a primary language pair, we train multiple UNMT models that share the same translation task but differ in an auxiliary language, inducing structured diversity across models. We then generate pseudo-translations for the primary pair using token-level ensemble decoding, averaging model predictions in both directions. These ensemble outputs are used as synthetic parallel data to further train each model, allowing the models to improve via shared supervision. At deployment time, we select a single model by validation performance, preserving single-model inference cost. Experiments show statistically significant improvements over single-model UNMT baselines, with mean gains of 1.7 chrF when translating from English and 0.67 chrF when translating into English.
[4] Tabular LLMs for Interpretable Few-Shot Alzheimer’s Disease Prediction with Multimodal Biomedical Data cs.CL | cs.LG | q-bio.QMPDF
Sophie Kearney, Shu Yang, Zixuan Wen, Weimin Lyu, Bojian Hou
TL;DR: 本文提出TAP-GPT,一个基于TableGPT2微调的领域自适应表格大语言模型框架,用于处理阿尔茨海默病(AD)预测中的小样本、不完整多模态生物标志物表格数据。该模型通过表格提示而非纯文本进行训练,在四个ADNI数据集上实现了优于基线模型的小样本分类性能,并展示了无需插补处理缺失数据、生成可解释推理的能力。
Details
Motivation: 阿尔茨海默病的准确诊断依赖于表格化生物标志物数据,但这些数据通常规模小、不完整,导致深度学习模型难以超越传统方法。预训练大语言模型(LLMs)具有小样本泛化、结构化推理和可解释输出优势,为临床预测提供了新范式。
Result: 在四个ADNI衍生数据集(包括QT-PAD生物标志物及结构MRI、淀粉样蛋白PET、tau PET区域级数据)的二元AD分类任务中,TAP-GPT在多模态和单模态设置下均优于其骨干模型和传统机器学习基线,在小样本场景下与最先进的通用LLMs性能相当。模型在高维输入下通过特征选择缓解性能下降,在模拟和真实缺失数据下无需插补仍保持稳定性能。
Insight: 创新点在于首次系统地将专用于表格的LLM应用于多模态生物标志物的AD预测,使用表格提示进行微调;模型能生成与AD生物学一致的结构化、模态感知推理,且在自反思下表现更稳定,支持其在迭代多智能体临床决策系统中的应用,为表格LLM驱动的临床任务奠定了基础。
Abstract: Accurate diagnosis of Alzheimer’s disease (AD) requires handling tabular biomarker data, yet such data are often small and incomplete, where deep learning models frequently fail to outperform classical methods. Pretrained large language models (LLMs) offer few-shot generalization, structured reasoning, and interpretable outputs, providing a powerful paradigm shift for clinical prediction. We propose TAP-GPT Tabular Alzheimer’s Prediction GPT, a domain-adapted tabular LLM framework built on TableGPT2 and fine-tuned for few-shot AD classification using tabular prompts rather than plain texts. We evaluate TAP-GPT across four ADNI-derived datasets, including QT-PAD biomarkers and region-level structural MRI, amyloid PET, and tau PET for binary AD classification. Across multimodal and unimodal settings, TAP-GPT improves upon its backbone models and outperforms traditional machine learning baselines in the few-shot setting while remaining competitive with state-of-the-art general-purpose LLMs. We show that feature selection mitigates degradation in high-dimensional inputs and that TAP-GPT maintains stable performance under simulated and real-world missingness without imputation. Additionally, TAP-GPT produces structured, modality-aware reasoning aligned with established AD biology and shows greater stability under self-reflection, supporting its use in iterative multi-agent systems. To our knowledge, this is the first systematic application of a tabular-specialized LLM to multimodal biomarker-based AD prediction, demonstrating that such pretrained models can effectively address structured clinical prediction tasks and laying the foundation for tabular LLM-driven multi-agent clinical decision-support systems. The source code is publicly available on GitHub: https://github.com/sophie-kearney/TAP-GPT.
[5] CODMAS: A Dialectic Multi-Agent Collaborative Framework for Structured RTL Optimization cs.CL | cs.AR | cs.PLPDF
Che-Ming Chang, Prashanth Vijayaraghavan, Ashutosh Jadhav, Charles Mackin, Vandana Mukherjee
TL;DR: CODMAS是一个用于结构化RTL优化的辩证多智能体协作框架,通过结合结构化辩证推理、领域感知代码生成和确定性评估来自动化RTL优化,以改进功耗、性能和面积。
Details
Motivation: 解决电子设计自动化中RTL代码优化自动化不足的问题,传统方法依赖专家手动优化,该框架旨在通过多智能体协作实现高效、可靠的自动化优化。
Result: 在包含120个Verilog三元组的RTLOPT基准测试中,CODMAS在流水线优化中实现了约25%的关键路径延迟减少,在时钟门控中实现了约22%的功耗降低,同时减少了功能和编译失败,优于强提示和智能体基线方法。
Insight: 创新点在于引入结构化辩证推理,通过两个辩证智能体(Articulator和Hypothesis Partner)协作暴露潜在假设并指导优化,结合领域特定编码和评估智能体,提升了自动化优化的可靠性和可扩展性。
Abstract: Optimizing Register Transfer Level (RTL) code is a critical step in Electronic Design Automation (EDA) for improving power, performance, and area (PPA). We present CODMAS (Collaborative Optimization via a Dialectic Multi-Agent System), a framework that combines structured dialectic reasoning with domain-aware code generation and deterministic evaluation to automate RTL optimization. At the core of CODMAS are two dialectic agents: the Articulator, inspired by rubber-duck debugging, which articulates stepwise transformation plans and exposes latent assumptions; and the Hypothesis Partner, which predicts outcomes and reconciles deviations between expected and actual behavior to guide targeted refinements. These agents direct a Domain-Specific Coding Agent (DCA) to generate architecture-aware Verilog edits and a Code Evaluation Agent (CEA) to verify syntax, functionality, and PPA metrics. We introduce RTLOPT, a benchmark of 120 Verilog triples (unoptimized, optimized, testbench) for pipelining and clock-gating transformations. Across proprietary and open LLMs, CODMAS achieves ~25% reduction in critical path delay for pipelining and ~22% power reduction for clock gating, while reducing functional and compilation failures compared to strong prompting and agentic baselines. These results demonstrate that structured multi-agent reasoning can significantly enhance automated RTL optimization and scale to more complex designs and broader optimization tasks.
[6] SYMDIREC: A Neuro-Symbolic Divide-Retrieve-Conquer Framework for Enhanced RTL Synthesis and Summarization cs.CL | cs.PLPDF
Prashanth Vijayaraghavan, Apoorva Nitsure, Luyao Shi, Charles Mackin, Ashutosh Jadhav
TL;DR: SYMDIREC是一个神经符号化的‘分治-检索-征服’框架,旨在提升硬件设计自动化中的寄存器传输级(RTL)综合与摘要任务。它通过将RTL任务分解为符号子目标,利用微调的检索器获取相关代码,并借助大语言模型(LLM)推理组装经过验证的输出,从而克服了传统方法在HDL语法严格、监督有限以及与自然语言对齐弱等方面的挑战。
Details
Motivation: 解决大语言模型在RTL综合与摘要任务中因硬件描述语言(HDL)语法严格、监督数据有限以及与自然语言对齐弱而面临的困难,并弥补现有提示工程和检索增强生成(RAG)方法缺乏符号规划、导致结构精度不足的缺陷。
Result: 在不进行LLM微调的情况下,支持Verilog和VHDL,SYMDIREC在综合任务上Pass@1率提升约20%,在摘要任务上ROUGE-L分数提升15-20%,优于提示工程和RAG基线方法,展示了符号引导在RTL任务中的优势。
Insight: 创新点在于将神经与符号方法结合,通过符号化分解任务为子目标来增强结构精度,并利用微调检索器和LLM推理实现可验证的输出组装;客观分析认为,其神经符号化分治策略和无需LLM微调的多语言支持是提升RTL任务性能的关键借鉴之处。
Abstract: Register-Transfer Level (RTL) synthesis and summarization are central to hardware design automation but remain challenging for Large Language Models (LLMs) due to rigid HDL syntax, limited supervision, and weak alignment with natural language. Existing prompting and retrieval-augmented generation (RAG) methods have not incorporated symbolic planning, limiting their structural precision. We introduce SYMDIREC, a neuro-symbolic framework that decomposes RTL tasks into symbolic subgoals, retrieves relevant code via a fine-tuned retriever, and assembles verified outputs through LLM reasoning. Supporting both Verilog and VHDL without LLM fine-tuning, SYMDIREC achieves ~20% higher Pass@1 rates for synthesis and 15-20% ROUGE-L improvements for summarization over prompting and RAG baselines, demonstrating the benefits of symbolic guidance in RTL tasks.
[7] Ruyi2.5 Technical Report cs.CLPDF
Huan Song, Shuyu Tian, Qingfei Zhao, Wenhao Hong, Jiang Liu
TL;DR: 本文介绍了Ruyi2.5,一个基于AI Flow框架构建的多模态家族模型。它将Ruyi2的’一次训练,多处部署’范式扩展到多模态领域,通过共享主干架构在统一流程中协同训练不同规模的模型,确保所有部署层级间的语义一致性。基于此,开发了隐私保护的相机服务系统Ruyi2.5-Camera,采用两阶段识别流程:边缘模型应用信息瓶颈引导的不可逆特征映射在源头对原始帧进行去标识化,云端模型则进行深度行为推理。此外,论文提出了Binary Prefix Policy Optimization (BPPO)来加速强化学习微调,通过二元响应选择减少样本冗余并聚焦于响应前缀的梯度更新,相比GRPO实现了2到3倍的训练加速。
Details
Motivation: 将’一次训练,多处部署’的高效范式从单一模态扩展到多模态领域,并解决隐私敏感场景(如监控)中视觉数据处理的挑战,同时提升强化学习微调的效率。
Result: 在通用多模态基准测试中,Ruyi2.5的性能与Qwen3-VL相当;在隐私受限的监控任务上,Ruyi2.5-Camera显著优于Qwen3-VL。BPPO方法相比GRPO实现了2到3倍的训练加速。
Insight: 创新点包括:1) 将’一次训练,多处部署’范式扩展到多模态,通过共享主干和协同训练实现跨规模模型的语义一致性;2) 针对隐私保护相机系统,提出基于信息瓶颈的边缘去标识化与云端深度推理的两阶段架构;3) 提出BPPO优化方法,通过二元响应选择和前缀聚焦来高效加速强化学习微调。
Abstract: We present Ruyi2.5, a multimodal familial model built on the AI Flow framework. Extending Ruyi2’s “Train Once, Deploy Many” paradigm to the multimodal domain, Ruyi2.5 constructs a shared-backbone architecture that co-trains models of varying scales within a single unified pipeline, ensuring semantic consistency across all deployment tiers. Built upon Ruyi2.5, Ruyi2.5-Camera model is developed as a privacy-preserving camera service system, which instantiates Ruyi2.5-Camera into a two-stage recognition pipeline: an edge model applies information-bottleneck-guided irreversible feature mapping to de-identify raw frames at the source, while a cloud model performs deep behavior reasoning. To accelerate reinforcement learning fine-tuning, we further propose Binary Prefix Policy Optimization (BPPO), which reduces sample redundancy via binary response selection and focuses gradient updates on response prefixes, achieving a 2 to 3 times training speedup over GRPO. Experiments show Ruyi2.5 matches Qwen3-VL on the general multimodal benchmarks, while Ruyi2.5-Camera substantially outperforms Qwen3-VL on privacy-constrained surveillance tasks.
[8] Grid Spatial Understanding: A Dataset for Textual Spatial Reasoning over Grids, Embodied Settings, and Coordinate Structures cs.CLPDF
Risham Sidhu, Julia Hockenmaier
TL;DR: 本文介绍了GSU数据集,这是一个纯文本的网格数据集,用于评估大型语言模型在导航、物体定位和结构组合三个核心任务上的空间推理能力。研究发现,大多数模型能理解基本网格概念,但在处理以具身智能体为参照系的坐标系和从坐标列表识别3D形状方面存在困难,且视觉模态的暴露并未带来可泛化的3D空间理解。前沿模型能解决这些任务,但微调小型模型也显示出匹配前沿模型性能的潜力。
Details
Motivation: 动机是评估LLMs在纯文本环境下的空间推理能力,隔离感知因素,以解决其在具身参照系和3D形状识别中的挑战。
Result: 在GSU数据集上的实验表明,前沿模型能解决任务,但更难的变体仍可能难住它们;微调小型LM或使用LORA微调小型LLM有潜力匹配前沿模型性能。
Insight: 创新点在于引入纯文本网格数据集来专门评估空间推理,揭示了LLMs在具身参照系和3D理解上的局限性,并展示了通过微调实现专业化具身智能体的途径。
Abstract: We introduce GSU, a text-only grid dataset to evaluate the spatial reasoning capabilities of LLMs over 3 core tasks: navigation, object localization, and structure composition. By forgoing visual inputs, isolating spatial reasoning from perception, we show that while most models grasp basic grid concepts, they struggle with frames of reference relative to an embodied agent and identifying 3D shapes from coordinate lists. We also find that exposure to a visual modality does not provide a generalizable understanding of 3D space that VLMs are able to utilize for these tasks. Finally, we show that while the very latest frontier models can solve the provided tasks (though harder variants may still stump them), fully fine-tuning a small LM or LORA fine-tuning a small LLM show potential to match frontier model performance, suggesting an avenue for specialized embodied agents.
[9] SafeTutors: Benchmarking Pedagogical Safety in AI Tutoring Systems cs.CLPDF
Rima Hazra, Bikram Ghuku, Ilona Marchenko, Yaroslava Tokarieva, Sayan Layek
TL;DR: 该论文提出了SafeTutors基准,用于系统评估AI辅导系统在数学、物理和化学领域的教学安全性与教学效果,揭示了现有大语言模型在辅导过程中普遍存在的教学危害,如过度披露答案、强化错误概念和放弃支架式教学,并发现模型规模无助于改善,多轮对话会加剧问题。
Details
Motivation: 当前AI辅导系统的评估范式孤立地评估问题解决准确性和通用安全性,未能捕捉模型在学生-导师互动中是否同时具备教学有效性和安全性。论文认为辅导安全性与传统LLM安全性根本不同,主要风险在于通过过度披露答案等方式悄然侵蚀学习过程。
Result: 研究发现所有模型都表现出广泛的危害;模型规模并不能可靠地改善问题;多轮对话会恶化行为,教学失败率从17.7%上升到77.8%;危害因学科而异,因此缓解措施必须具有学科意识。
Insight: 论文的创新点在于提出了一个基于学习科学文献、包含11个危害维度和48个子风险的理论基础风险分类法,并构建了首个联合评估安全性与教学法的基准,揭示了单轮“安全/有帮助”的结果可能掩盖了长期互动中系统性的辅导失败。
Abstract: Large language models are rapidly being deployed as AI tutors, yet current evaluation paradigms assess problem-solving accuracy and generic safety in isolation, failing to capture whether a model is simultaneously pedagogically effective and safe across student-tutor interaction. We argue that tutoring safety is fundamentally different from conventional LLM safety: the primary risk is not toxic content but the quiet erosion of learning through answer over-disclosure, misconception reinforcement, and the abdication of scaffolding. To systematically study this failure mode, we introduce SafeTutors, a benchmark that jointly evaluates safety and pedagogy across mathematics, physics, and chemistry. SafeTutors is organized around a theoretically grounded risk taxonomy comprising 11 harm dimensions and 48 sub-risks drawn from learning-science literature. We uncover that all models show broad harm; scale doesn’t reliably help; and multi-turn dialogue worsens behavior, with pedagogical failures rising from 17.7% to 77.8%. Harms also vary by subject, so mitigations must be discipline-aware, and single-turn “safe/helpful” results can mask systematic tutor failure over extended interaction.
[10] Argument Reconstruction as Supervision for Critical Thinking in LLMs cs.CLPDF
Hyun Ryu, Gyouk Chu, Gregor Betz, Eunho Yang, Carolyn Rose
TL;DR: 本文提出了一个通过论证重构来增强大语言模型(LLM)批判性思维能力的整体框架。该框架包括一个自动重构任意论证的引擎(GAAR)、一个由此生成的高质量论证重构数据集(Arguinas),并探究了学习论证重构对下游批判性思维任务的影响。实验表明,在七个批判性思维任务上,经过论证重构训练的模型性能均优于未经训练的模型,尤其是在Arguinas数据集上训练时提升最大。
Details
Motivation: 人类通过识别、重构和评估论证来训练批判性思维,其中论证重构至关重要,因为它使论证的潜在推理过程显式化。然而,LLM是否也能通过学习重构论证来增强其批判性思维能力尚不明确。本文旨在探究这一问题。
Result: 在七个下游批判性思维任务上的实验结果表明,经过论证重构训练的模型性能优于未经训练的模型。最大的性能提升出现在使用本文提出的Arguinas数据集进行训练时。
Insight: 论文的核心创新点在于将论证重构作为一种监督信号来提升LLM的批判性思维。具体包括:1)提出了一个自动化的论证重构引擎(GAAR),能够处理任意论证;2)利用该引擎合成了一个高质量、规模化的论证重构数据集(Arguinas);3)首次系统地验证了学习论证重构任务本身对提升LLM在多种批判性思维任务上的有效性,为提升模型推理能力提供了新的训练范式和数据资源。
Abstract: To think critically about arguments, human learners are trained to identify, reconstruct, and evaluate arguments. Argument reconstruction is especially important because it makes an argument’s underlying inferences explicit. However, it remains unclear whether LLMs can similarly enhance their critical thinking ability by learning to reconstruct arguments. To address this question, we introduce a holistic framework with three contributions. We (1) propose an engine that automatically reconstructs arbitrary arguments (GAAR), (2) synthesize a new high-quality argument reconstruction dataset (Arguinas) using the GAAR engine, and (3) investigate whether learning argument reconstruction benefits downstream critical thinking tasks. Our experimental results show that, across seven critical thinking tasks, models trained to learn argument reconstruction outperform models that do not, with the largest performance gains observed when training on the proposed Arguinas dataset. The source code and dataset will be publicly available.
[11] TRiMS: Real-Time Tracking of Minimal Sufficient Length for Efficient Reasoning via RL cs.CLPDF
Tingcheng Bian, Jinchang Luo, Mingquan Cheng, Jinyu Zhang, Xiaoling Xia
TL;DR: 本文提出TRiMS方法,通过强化学习实时追踪最小充分长度(MSL),以优化大语言模型在复杂推理任务中的计算效率。该方法在训练中结合GRPO算法和基于MSL的估计,并采用动态批次聚合和基于批次标准差的优势计算来稳定训练过程,实现了在保持甚至略微提升准确率的同时,显著减少推理链的令牌数量。
Details
Motivation: 大语言模型通过长思维链序列实现复杂推理突破,但常导致严重的推理膨胀和计算冗余。为最大化每个令牌的智能效率,需要找到能保持答案正确性的最短推理长度,即最小充分长度(MSL),并开发高效方法逼近该长度。
Result: TRiMS在所有基准测试中实现了超过80%的思维链令牌减少,同时准确率略有提升。
Insight: 创新点包括:首次提出并理论定义了最小充分长度(MSL)作为推理链压缩的可测量下界;分析了主流思维链压缩策略的结构因素,指导模型逼近MSL;提出TRiMS训练框架,结合GRPO算法和MSL估计,并通过动态批次聚合和基于批次标准差的优势计算来缓解训练不稳定性,实现高效的令牌级推理优化。
Abstract: Large language models achieve breakthroughs in complex reasoning via long chain-of-thought sequences. However, this often leads to severe reasoning inflation, causing substantial computational redundancy. To maximize Intelligence per Token, we introduce a theoretical metric, MSL-Minimal Sufficient Length. MSL rigorously characterizes the shortest reasoning length that preserves answer correctness. We provide a recursive definition based on independently sampled sequences and prove the existence of its limit, establishing the first measurable lower bound for reasoning-chain compression. Building on an analysis of mainstream CoT compression strategies, we identify key structural factors enabling a model to approach MSL. Based on these insights, we propose TRiMS which employs the GRPO algorithm in conjunction with MSL-based estimation during training, while mitigating instabilities during the training process through dynamic batch aggregation and advantage computation using batch-level standard deviation. TRiMS achieves over 80% CoT token reduction with a minor accuracy boost across all benchmarks.
[12] Learning When to Attend: Conditional Memory Access for Long-Context LLMs cs.CL | cs.LGPDF
Sakshi Choudhary, Aditya Chattopadhyay, Luca Zancato, Elvis Nunez, Matthew Trager
TL;DR: 本文提出了一种名为L2A(Learning To Attend)的层,用于解决大语言模型在长上下文处理中注意力计算开销过大的问题。L2A通过让模型学习何时需要全局注意力,实现了基于token的条件化长程记忆访问,从而在显著减少计算量的同时,将模型的有效上下文长度从32K扩展到128K。
Details
Motivation: 动机在于解决语言模型难以泛化到预训练上下文长度之外的问题,以及传统注意力机制因二次方复杂度导致的长上下文训练成本高昂的挑战。作者观察到大多数token并不需要全局注意力,可以依赖局部上下文。
Result: 在Qwen 2.5和Qwen 3模型上的评估表明,L2A在将有效上下文长度扩展到128K时,性能与标准长上下文训练相比差距在3%以内,同时为约80%的token跳过了全局注意力计算,优于现有基线。通过定制Triton内核,训练吞吐量和首token生成时间相比FlashAttention提升了约2倍。此外,L2A支持对高度稀疏的全局注意力层进行后训练剪枝,将KV缓存内存减少高达50%且性能损失可忽略。
Insight: 核心创新点在于将全局注意力从固定模式转变为由模型学习的、基于每个token动态决策的条件化过程。这为高效长上下文建模提供了一种新范式,通过选择性稀疏化注意力来平衡性能与效率,其实现的硬件优化和内存节省也具有实际工程价值。
Abstract: Language models struggle to generalize beyond pretraining context lengths, limiting long-horizon reasoning and retrieval. Continued pretraining on long-context data can help but is expensive due to the quadratic scaling of Attention. We observe that most tokens do not require (Global) Attention over the entire sequence and can rely on local context. Based on this, we propose L2A (Learning To Attend), a layer that enables conditional (token-wise) long-range memory access by deciding when to invoke global attention. We evaluate L2A on Qwen 2.5 and Qwen 3 models, extending their effective context length from 32K to 128K tokens. L2A matches the performance of standard long-context training to within 3% while skipping Global Attention for $\sim$80% of tokens, outperforming prior baselines. We also design custom Triton kernels to efficiently implement this token-wise conditional Attention on GPUs, achieving up to $\sim$2x improvements in training throughput and time-to-first-token over FlashAttention. Moreover, L2A enables post-training pruning of highly sparse Global Attention layers, reducing KV cache memory by up to 50% with negligible performance loss.
[13] Language on Demand, Knowledge at Core: Composing LLMs with Encoder-Decoder Translation Models for Extensible Multilinguality cs.CLPDF
Mengyu Bu, Yang Feng
TL;DR: 本文提出XBridge架构,通过组合预训练的编码器-解码器翻译模型与大型语言模型(LLM),将多语言理解与生成任务卸载给翻译模型,同时保留LLM作为以英语为中心的核心知识处理器,从而提升LLM在低资源或未见语言上的多语言性能。
Details
Motivation: LLM在多语言任务上表现不平衡,难以可靠地处理低资源或未见语言,而预训练翻译模型具有平衡的多语言能力,可作为LLM的自然补充。
Result: 在四个LLM上的多语言理解、推理、摘要和生成实验表明,XBridge优于强基线,尤其在低资源和未见语言上表现突出,且无需重新训练LLM。
Insight: 创新点包括组合架构设计、轻量级跨模型映射层和基于最优传输的对齐目标,实现了细粒度语义一致性,为扩展LLM的多语言能力提供了可借鉴的模块化方法。
Abstract: Large language models (LLMs) exhibit strong general intelligence, yet their multilingual performance remains highly imbalanced. Although LLMs encode substantial cross-lingual knowledge in a unified semantic space, they often struggle to reliably interface this knowledge with low-resource or unseen languages. Fortunately, pretrained encoder-decoder translation models already possess balanced multilingual capability, suggesting a natural complement to LLMs. In this work, we propose XBridge, a compositional encoder-LLM-decoder architecture that offloads multilingual understanding and generation to external pretrained translation models, while preserving the LLM as an English-centric core for general knowledge processing. To address the resulting representation misalignment across models, we introduce lightweight cross-model mapping layers and an optimal transport-based alignment objective, enabling fine-grained semantic consistency for multilingual generation. Experiments on four LLMs across multilingual understanding, reasoning, summarization, and generation indicate that XBridge outperforms strong baselines, especially on low-resource and previously unseen languages, without retraining the LLM.
[14] VeriAgent: A Tool-Integrated Multi-Agent System with Evolving Memory for PPA-Aware RTL Code Generation cs.CL | cs.PLPDF
Yaoxiang Wang, Qi Shi, ShangZhan Li, Qingguo Hu, Xinyu Yin
TL;DR: 本文提出VeriAgent,一个集成了EDA工具的多智能体系统,用于生成同时考虑功能正确性和物理设计指标(功耗、性能、面积,即PPA)的高质量Verilog RTL代码。该系统通过一个包含程序员、正确性和PPA三个智能体的闭环工作流,并结合一个可进化的记忆机制,将RTL生成从一次性推理转变为持续的、反馈驱动的优化过程。
Details
Motivation: 现有基于LLM的RTL代码生成方法主要关注功能正确性,而忽略了关键的物理设计目标(PPA),这限制了其在真实硬件设计流程中的实用性。
Result: 大量实验表明,该方法在保持强大功能正确性的同时,在PPA指标上取得了显著改进。
Insight: 主要创新点在于将EDA工具反馈明确整合到多智能体闭环工作流中,并引入可进化的结构化记忆机制来外部化优化经验,支持无需模型重训练的持续策略优化,为LLM在实际硬件设计流程中的可扩展部署提供了途径。
Abstract: LLMs have recently demonstrated strong capabilities in automatic RTL code generation, achieving high syntactic and functional correctness. However, most methods focus on functional correctness while overlooking critical physical design objectives, including Power, Performance, and Area. In this work, we propose a PPA-aware, tool-integrated multi-agent framework for high-quality verilog code generation. Our framework explicitly incorporates EDA tools into a closed-loop workflow composed of a \textit{Programmer Agent}, a \textit{Correctness Agent}, and a \textit{PPA Agent}, enabling joint optimization of functional correctness and physical metrics. To support continuous improvement without model retraining, we introduce an \textit{Evolved Memory Mechanism} that externalizes optimization experience into structured memory nodes. A dedicated memory manager dynamically maintains the memory pool and allows the system to refine strategies based on historical execution trajectories. Extensive experiments demonstrate that our approach achieves strong functional correctness while delivering significant improvements in PPA metrics. By integrating tool-driven feedback with structured and evolvable memory, our framework transforms RTL generation from one-shot reasoning into a continual, feedback-driven optimization process, providing a scalable pathway for deploying LLMs in real-world hardware design flows.
[15] Harm or Humor: A Multimodal, Multilingual Benchmark for Overt and Covert Harmful Humor cs.CL | cs.AIPDF
Ahmed Sharshar, Hosam Elgendy, Saad El Dine Ahmed, Yasser Rohaim, Yuxia Wang
TL;DR: 该论文提出了一个名为’Harm or Humor’的新型多模态、多语言基准测试,用于检测和理解有害及冒犯性幽默。该基准包含手动标注的3000个文本、6000张图像(英语和阿拉伯语)以及1200个视频(英语、阿拉伯语及语言无关内容),并严格区分安全笑话、显性有害笑话和隐性有害笑话,以评估模型对需要深度推理的微妙文化线索的理解能力。
Details
Motivation: 当前静态基准测试难以捕捉依赖微妙文化背景和隐含线索的黑色幽默所带来的安全挑战,因此需要一个新的基准来评估模型在检测和理解有害幽默方面的能力。
Result: 论文系统评估了当前最先进的(SOTA)开源和闭源模型在所有模态上的表现。结果显示,闭源模型显著优于开源模型,并且在英语和阿拉伯语之间都存在显著的性能差异。
Insight: 创新点在于构建了一个专门针对有害幽默(尤其是隐性有害幽默)的多模态、多语言基准,强调了基于文化的、具备推理能力的安全对齐的重要性。从客观角度看,该研究为评估AI模型在复杂、语境敏感内容上的安全性和文化适应性提供了重要的数据集和方法论。
Abstract: Dark humor often relies on subtle cultural nuances and implicit cues that require contextual reasoning to interpret, posing safety challenges that current static benchmarks fail to capture. To address this, we introduce a novel multimodal, multilingual benchmark for detecting and understanding harmful and offensive humor. Our manually curated dataset comprises 3,000 texts and 6,000 images in English and Arabic, alongside 1,200 videos that span English, Arabic, and language-independent (universal) contexts. Unlike standard toxicity datasets, we enforce a strict annotation guideline: distinguishing \emph{Safe} jokes from \emph{Harmful} ones, with the latter further classified into \emph{Explicit} (overt) and \emph{Implicit} (Covert) categories to probe deep reasoning. We systematically evaluate state-of-the-art (SOTA) open and closed-source models across all modalities. Our findings reveal that closed-source models significantly outperform open-source ones, with a notable difference in performance between the English and Arabic languages in both, underscoring the critical need for culturally grounded, reasoning-aware safety alignment. \textcolor{red}{Warning: this paper contains example data that may be offensive, harmful, or biased.}
[16] CoVerRL: Breaking the Consensus Trap in Label-Free Reasoning via Generator-Verifier Co-Evolution cs.CL | cs.AI | cs.LGPDF
Teng Pan, Yuchen Yan, Zixuan Wang, Ruiqing Zhang, Gaiyang Han
TL;DR: 本文提出CoVerRL框架,通过生成器与验证器的协同进化,解决无标签强化学习中因追求自洽性而陷入的’共识陷阱’问题,在数学推理基准上显著超越基线方法。
Details
Motivation: 动机在于发现无标签强化学习在最大化自洽性时会导致输出多样性崩溃,模型会自信地强化系统性错误,即陷入’共识陷阱’。
Result: 在Qwen和Llama模型系列上的实验表明,CoVerRL在数学推理基准上比无标签基线方法提升4.7-5.9%,且自验证准确率从约55%提升至85%以上。
Insight: 创新点在于提出单一模型在生成器和验证器角色间交替,通过多数投票为验证器提供监督,而改进的验证器逐步过滤伪标签中的自洽错误,形成协同进化的良性循环。
Abstract: Label-free reinforcement learning enables large language models to improve reasoning capabilities without ground-truth supervision, typically by treating majority-voted answers as pseudo-labels. However, we identify a critical failure mode: as training maximizes self-consistency, output diversity collapses, causing the model to confidently reinforce systematic errors that evade detection. We term this the consensus trap. To escape it, we propose CoVerRL, a framework where a single model alternates between generator and verifier roles, with each capability bootstrapping the other. Majority voting provides noisy but informative supervision for training the verifier, while the improving verifier progressively filters self-consistent errors from pseudo-labels. This co-evolution creates a virtuous cycle that maintains high reward accuracy throughout training. Experiments across Qwen and Llama model families demonstrate that CoVerRL outperforms label-free baselines by 4.7-5.9% on mathematical reasoning benchmarks. Moreover, self-verification accuracy improves from around 55% to over 85%, confirming that both capabilities genuinely co-evolve.
[17] Process Supervision for Chain-of-Thought Reasoning via Monte Carlo Net Information Gain cs.CLPDF
Corentin Royer, Debarun Bhattacharjya, Gaetano Rossiello, Andrea Giovannini, Mennatallah El-Assady
TL;DR: 本文提出了一种基于信息论的方法,用于自动生成过程监督标签,以改进大语言模型(LLM)的链式思维推理。该方法通过估计每个推理步骤对正确答案似然的影响来评估步骤质量,计算复杂度降低至O(N),优于之前的O(N log N)方法。实验表明,该方法在数学、Python编程、SQL和科学问答等多个推理基准测试中,能有效用于链式思维选择。
Details
Motivation: 链式思维推理虽能提升大语言模型能力,但会增加错误在中间步骤传播的风险。现有训练过程奖励模型的方法依赖昂贵的人工标注或计算密集的自动标注,因此需要一种可扩展且高效的自动监督方法。
Result: 在包括数学、Python编程、SQL和科学问答在内的多样化推理基准测试中,该方法生成的标签在最佳K选择评估设置下实现了有效的链式思维选择。
Insight: 创新点在于利用信息论(蒙特卡洛净信息增益)自动评估推理步骤质量,提供细粒度监督信号,并将计算复杂度从O(N log N)降低到O(N),实现了更高效、可扩展的过程监督。
Abstract: Multi-step reasoning improves the capabilities of large language models (LLMs) but increases the risk of errors propagating through intermediate steps. Process reward models (PRMs) mitigate this by scoring each step individually, enabling fine-grained supervision and improved reliability. Existing methods for training PRMs rely on costly human annotations or computationally intensive automatic labeling. We propose a novel approach to automatically generate step-level labels using Information Theory. Our method estimates how each reasoning step affects the likelihood of the correct answer, providing a signal of step quality. Importantly, it reduces computational complexity to $\mathcal{O}(N)$, improving over the previous $\mathcal{O}(N \log N)$ methods. We demonstrate that these labels enable effective chain-of-thought selection in best-of-$K$ evaluation settings across diverse reasoning benchmarks, including mathematics, Python programming, SQL, and scientific question answering. This work enables scalable and efficient supervision of LLM reasoning, particularly for tasks where error propagation is critical.
[18] Text-to-Stage: Spatial Layouts from Long-form Narratives cs.CL | cs.AI | cs.LGPDF
Jefferson Hernandez, Swarnadeep Saha, Chenxi Whitehouse, Sanjeel Parekh, Calvin Murdock
TL;DR: 本文提出Text-to-Stage任务,旨在从缺乏显式空间线索的长篇叙事文本中推断舞台剧布局(包括场景、说话者位置、移动和房间类型),以探究语言模型的空间推理能力。作者引入了一套受戏剧学启发的确定性评估套件,并提出了一种结合拒绝式监督微调(通过Best-of-N采样)和基于可验证奖励的强化学习(通过GRPO)的训练与推理方法。在经典英语文学纯文本语料上的实验表明,该方法在角色归属、空间合理性和移动经济性等多个指标上优于基础模型,并与LLM-as-a-judge评估及人类主观偏好保持一致。
Details
Motivation: 探究语言模型从非结构化文本中展现空间推理的能力,模仿人类能力并自动化一个对下游媒体应用有益的过程,具体解决从缺乏明确空间、位置或关系线索的文本中推断舞台剧布局的叙事到剧本任务。
Result: 在经典英语文学文本语料上的实验显示,该方法在角色归属、空间合理性和移动经济性等多个指标上优于原始模型,并与LLM-as-a-judge评估及人类主观偏好对齐,实现了性能提升。
Insight: 创新点包括:提出了叙事到舞台布局的新任务(Text-to-Stage),引入了受戏剧学启发的确定性评估套件,以及结合了拒绝式监督微调(Best-of-N采样)和基于可验证奖励的强化学习(GRPO)的训练推理配方,有效提升了模型在复杂空间推理任务上的表现。
Abstract: In this work, we probe the ability of a language model to demonstrate spatial reasoning from unstructured text, mimicking human capabilities and automating a process that benefits many downstream media applications. Concretely, we study the narrative-to-play task: inferring stage-play layouts (scenes, speaker positions, movements, and room types) from text that lacks explicit spatial, positional, or relational cues. We then introduce a dramaturgy-inspired deterministic evaluation suite and, finally, a training and inference recipe that combines rejection SFT using Best-of-N sampling with RL from verifiable rewards via GRPO. Experiments on a text-only corpus of classical English literature demonstrate improvements over vanilla models across multiple metrics (character attribution, spatial plausibility, and movement economy), as well as alignment with an LLM-as-a-judge and subjective human preferences.
cs.CV [Back]
[19] Multi-Modal Multi-Agent Reinforcement Learning for Radiology Report Generation: Radiologist-Like Workflow with Clinically Verifiable Rewards cs.CV | cs.AI | cs.LGPDF
Kaito Baba, Satoshi Kodera
TL;DR: 本文提出MARL-Rad,一种新颖的多模态多智能体强化学习框架,用于放射学报告生成。该框架协调区域特定智能体和全局整合智能体,并通过临床可验证奖励进行优化,旨在模拟放射科医生的工作流程。
Details
Motivation: 解决现有放射学报告生成方法通常采用单一模型强化学习或对独立训练模型进行事后智能体化的问题,这些方法未能联合优化整个智能体系统。
Result: 在MIMIC-CXR和IU X-ray数据集上的实验表明,MARL-Rad在RadGraph、CheXbert和GREEN等临床效能指标上持续提升,并取得了最先进的临床效能性能。进一步分析证实其增强了侧向一致性并生成了更准确、细节丰富的报告。
Insight: 创新点在于提出了一种联合训练多个智能体并通过强化学习优化整个系统的框架,而非独立训练后组合;其奖励函数基于临床可验证指标,直接优化报告质量,模拟了放射科医生的诊断工作流程。
Abstract: We propose MARL-Rad, a novel multi-modal multi-agent reinforcement learning framework for radiology report generation that coordinates region-specific agents and a global integrating agent, optimized via clinically verifiable rewards. Unlike prior single-model reinforcement learning or post-hoc agentization of independently trained models, our method jointly trains multiple agents and optimizes the entire agent system through reinforcement learning. Experiments on the MIMIC-CXR and IU X-ray datasets show that MARL-Rad consistently improves clinically efficacy (CE) metrics such as RadGraph, CheXbert, and GREEN scores, achieving state-of-the-art CE performance. Further analyses confirm that MARL-Rad enhances laterality consistency and produces more accurate, detail-informed reports.
[20] Leveraging Large Vision Model for Multi-UAV Co-perception in Low-Altitude Wireless Networks cs.CV | eess.IVPDF
Yunting Xu, Jiacheng Wang, Ruichen Zhang, Changyuan Zhao, Yinqiu Liu
TL;DR: 本文提出了一种名为基站辅助无人机(BHU)的高效通信协同感知框架,用于解决多无人机在低空无线网络中协同感知时面临的海量视觉数据传输带来的通信延迟和资源效率挑战。该框架通过Top-K选择机制稀疏化传输图像,利用MU-MIMO传输至地面服务器,并采用基于Swin-large的MaskDINO编码器进行BEV特征提取与融合,同时结合基于扩散模型的深度强化学习算法联合优化协作无人机选择、稀疏化比率和预编码矩阵,以在通信效率与感知效用间取得平衡。
Details
Motivation: 解决多无人机协同感知中,由多视角视觉数据大量生成导致的通信延迟高和资源效率低的问题。
Result: 在Air-Co-Pred数据集上的仿真结果表明,相比传统的基于CNN的BEV融合基线方法,所提出的BHU框架在感知性能上提升了超过5%,同时通信开销降低了85%。
Insight: 创新点在于将Top-K像素选择用于视觉数据稀疏化传输,并结合基于扩散模型的DRL进行跨层(感知、通信、资源分配)联合优化,实现了在资源受限无线环境下通信效率与感知性能的有效权衡。从客观角度看,其将大型视觉模型(Swin-large, MaskDINO)与无线通信技术(MU-MIMO)及强化学习进行系统集成,为多智能体协同感知提供了一个软硬件协同设计的可行思路。
Abstract: Multi-uncrewed aerial vehicle (UAV) cooperative perception has emerged as a promising paradigm for diverse low-altitude economy applications, where complementary multi-view observations are leveraged to enhance perception performance via wireless communications. However, the massive visual data generated by multiple UAVs poses significant challenges in terms of communication latency and resource efficiency. To address these challenges, this paper proposes a communication-efficient cooperative perception framework, termed Base-Station-Helped UAV (BHU), which reduces communication overhead while enhancing perception performance. Specifically, we employ a Top-K selection mechanism to identify the most informative pixels from UAV-captured RGB images, enabling sparsified visual transmission with reduced data volume and latency. The sparsified images are transmitted to a ground server via multi-user MIMO (MU-MIMO), where a Swin-large-based MaskDINO encoder extracts bird’s-eye-view (BEV) features and performs cooperative feature fusion for ground vehicle perception. Furthermore, we develop a diffusion model-based deep reinforcement learning (DRL) algorithm to jointly select cooperative UAVs, sparsification ratios, and precoding matrices, achieving a balance between communication efficiency and perception utility. Simulation results on the Air-Co-Pred dataset demonstrate that, compared with traditional CNN-based BEV fusion baselines, the proposed BHU framework improves perception performance by over 5% while reducing communication overhead by 85%, providing an effective solution for multi-UAV cooperative perception under resource-constrained wireless environments.
[21] Facial beauty prediction fusing transfer learning and broad learning system cs.CV | cs.AIPDF
Junying Gan, Xiaoshan Xie, Yikui Zhai, Guohui He, Chaoyun Mai
TL;DR: 本文提出了一种融合迁移学习和广度学习系统(BLS)的面部美观度预测(FBP)方法,旨在解决数据缺乏、模型易过拟合以及快速构建鲁棒模型的问题。该方法首先利用基于迁移学习的EfficientNets构建特征提取器(E-BLS),然后设计连接层进一步优化特征与BLS的融合(ER-BLS)。
Details
Motivation: 面部美观度预测面临数据规模小、易过拟合、人脸外观多变以及人类感知复杂等挑战,导致难以快速构建鲁棒有效的评估模型。迁移学习可减少对大量数据的依赖并避免过拟合,而BLS能快速完成模型构建与训练,因此将两者融合以提升FBP性能。
Result: 实验结果表明,与现有的BLS和CNN方法相比,提出的E-BLS和ER-BLS方法提高了面部美观度预测的准确率,证明了该方法的有效性和优越性。
Insight: 创新点在于将迁移学习(特别是EfficientNets)与广度学习系统(BLS)相结合,利用迁移学习提取鲁棒特征,并通过BLS实现快速建模;进一步设计连接层(ER-BLS)优化特征融合过程,提升了模型性能。该方法可推广至模式识别、目标检测和图像分类等领域。
Abstract: Facial beauty prediction (FBP) is an important and challenging problem in the fields of computer vision and machine learning. Not only it is easily prone to overfitting due to the lack of large-scale and effective data, but also difficult to quickly build robust and effective facial beauty evaluation models because of the variability of facial appearance and the complexity of human perception. Transfer Learning can be able to reduce the dependence on large amounts of data as well as avoid overfitting problems. Broad learning system (BLS) can be capable of quickly completing models building and training. For this purpose, Transfer Learning was fused with BLS for FBP in this paper. Firstly, a feature extractor is constructed by way of CNNs models based on transfer learning for facial feature extraction, in which EfficientNets are used in this paper, and the fused features of facial beauty extracted are transferred to BLS for FBP, called E-BLS. Secondly, on the basis of E-BLS, a connection layer is designed to connect the feature extractor and BLS, called ER-BLS. Finally, experimental results show that, compared with the previous BLS and CNNs methods existed, the accuracy of FBP was improved by E-BLS and ER-BLS, demonstrating the effectiveness and superiority of the method presented, which can also be widely used in pattern recognition, object detection and image classification.
[22] Script-to-Slide Grounding: Grounding Script Sentences to Slide Objects for Automatic Instructional Video Generation cs.CV | cs.AIPDF
Rena Suzuki, Masato Kikuchi, Tadachika Ozono
TL;DR: 本文提出并形式化了脚本到幻灯片定位任务,旨在将脚本句子自动关联到幻灯片中的对应对象,作为自动生成教学视频的基础步骤。论文还提出了基于大语言模型的Text-S2SG方法,专门用于文本对象的定位,实验表明该方法性能优异。
Details
Motivation: 解决基于幻灯片的教学视频制作中,将口语内容与幻灯片对象进行视觉特效关联这一高度劳动密集型编辑过程的自动化问题。
Result: 在文本对象定位任务上,提出的Text-S2SG方法取得了高F1分数0.924。
Insight: 将隐式的、基于幻灯片的视频编辑过程形式化为一个可计算的任务,为自动化生成教学视频铺平了道路;利用LLM处理脚本与幻灯片文本对象之间的细粒度关联是一个有效的初始方案。
Abstract: While slide-based videos augmented with visual effects are widely utilized in education and research presentations, the video editing process – particularly applying visual effects to ground spoken content to slide objects – remains highly labor-intensive. This study aims to develop a system that automatically generates such instructional videos from slides and corresponding scripts. As a foundational step, this paper proposes and formulates Script-to-Slide Grounding (S2SG), defined as the task of grounding script sentences to their corresponding slide objects. Furthermore, as an initial step, we propose ``Text-S2SG,’’ a method that utilizes a large language model (LLM) to perform this grounding task for text objects. Our experiments demonstrate that the proposed method achieves high performance (F1-score: 0.924). The contribution of this work is the formalization of a previously implicit slide-based video editing process into a computable task, thereby paving the way for its automation.
[23] Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs cs.CV | cs.AIPDF
Nimrod Shabtay, Moshe Kimhi, Artem Spector, Sivan Haray, Ehud Rivlin
TL;DR: 本文提出AwaRes框架,通过低分辨率全局视图结合工具调用检索高分辨率局部区域,解决视觉语言模型在精度与计算效率之间的权衡问题,并利用自动生成的监督数据进行训练。
Details
Motivation: 视觉语言模型在处理高分辨率图像时面临精度与计算效率的权衡:高分辨率输入能捕捉细节但计算成本高,低分辨率输入效率高但可能丢失关键信息(如小文本)。
Result: 通过自动构建监督数据(包括判断裁剪需求的标注和证据定位),结合冷启动SFT和多回合GRPO训练,在语义答案正确性和裁剪成本惩罚的复合奖励下实现高效检索。
Insight: 创新点在于空间按需框架,通过工具调用动态检索高分辨率片段,并利用自动生成的监督数据(包括多回合工具使用轨迹)进行训练,以平衡精度与效率。
Abstract: Vision-language models (VLMs) typically process images at a native high-resolution, forcing a trade-off between accuracy and computational efficiency: high-resolution inputs capture fine details but incur significant computational costs, while low-resolution inputs advocate for efficiency, they potentially miss critical visual information, like small text. We present AwaRes, a spatial-on-demand framework that resolves this accuracy-efficiency trade-off by operating on a low-resolution global view and using tool-calling to retrieve only high-resolution segments needed for a given query. We construct supervised data automatically: a judge compares low- vs.\ high-resolution answers to label whether cropping is needed, and an oracle grounding model localizes the evidence for the correct answer, which we map to a discrete crop set to form multi-turn tool-use trajectories. We train our framework with cold-start SFT followed by multi-turn GRPO with a composite reward that combines semantic answer correctness with explicit crop-cost penalties. Project page: https://nimrodshabtay.github.io/AwaRes
[24] AgriChat: A Multimodal Large Language Model for Agriculture Image Understanding cs.CV | cs.AIPDF
Abderrahmene Boudiaf, Irfan Hussain, Sajid Javed
TL;DR: 本文提出了AgriChat,一个专门用于农业图像理解的多模态大语言模型,并通过创新的Vision-to-Verified-Knowledge (V2VK) 流水线,自动生成了大规模、基于验证知识的农业多模态基准数据集AgriMM,以解决农业领域缺乏高质量数据和模型缺乏可靠专业知识的问题。
Details
Motivation: 解决农业领域多模态大语言模型部署的瓶颈:现有研究缺乏大规模农业数据集用于模型开发和评估,同时现有SOTA模型缺乏经过验证的领域专业知识,无法可靠地跨不同分类进行推理。
Result: 在多样化的任务、数据集和评估条件下,AgriChat展现出优于其他开源模型的性能,包括内部和外部基准测试,验证了其方法的有效性。
Insight: 核心创新在于提出了V2VK流水线,将视觉描述与基于网络的科学检索相结合,自动生成基于已验证植物病理学文献的训练数据,从而有效消除生物幻觉,为构建可靠、可信的农业AI提供了新途径;同时构建了包含3000多个类别和60.7万VQA的大规模农业基准AgriMM。
Abstract: The deployment of Multimodal Large Language Models (MLLMs) in agriculture is currently stalled by a critical trade-off: the existing literature lacks the large-scale agricultural datasets required for robust model development and evaluation, while current state-of-the-art models lack the verified domain expertise necessary to reason across diverse taxonomies. To address these challenges, we propose the Vision-to-Verified-Knowledge (V2VK) pipeline, a novel generative AI-driven annotation framework that integrates visual captioning with web-augmented scientific retrieval to autonomously generate the AgriMM benchmark, effectively eliminating biological hallucinations by grounding training data in verified phytopathological literature. The AgriMM benchmark contains over 3,000 agricultural classes and more than 607k VQAs spanning multiple tasks, including fine-grained plant species identification, plant disease symptom recognition, crop counting, and ripeness assessment. Leveraging this verifiable data, we present AgriChat, a specialized MLLM that presents broad knowledge across thousands of agricultural classes and provides detailed agricultural assessments with extensive explanations. Extensive evaluation across diverse tasks, datasets, and evaluation conditions reveals both the capabilities and limitations of current agricultural MLLMs, while demonstrating AgriChat’s superior performance over other open-source models, including internal and external benchmarks. The results validate that preserving visual detail combined with web-verified knowledge constitutes a reliable pathway toward robust and trustworthy agricultural AI. The code and dataset are publicly available at https://github.com/boudiafA/AgriChat .
[25] GenLie: A Global-Enhanced Lie Detection Network under Sparsity and Semantic Interference cs.CV | cs.AIPDF
Zongshun Zhang, Yao Liu, Qiao Liu, Xuefeng Peng, Peiyuan Jiang
TL;DR: 本文提出了一种名为GenLie的全局增强谎言检测网络,旨在从视频中识别欺骗行为。该方法通过在全局监督下进行局部特征建模,以捕捉稀疏而微妙的欺骗线索,同时抑制身份相关的噪声,从而学习到更具判别性的表示。在三个公开数据集上的实验表明,GenLie在高低风险场景下均优于现有最先进方法。
Details
Motivation: 解决视频谎言检测中因欺骗信号稀疏、短暂且易被冗余信息淹没,以及个体和上下文变化引入身份相关噪声,导致难以学习判别性表示的核心挑战。
Result: 在三个公开数据集(涵盖高低风险场景)上的实验表明,GenLie consistently outperforms state-of-the-art methods(即达到了SOTA水平)。
Insight: 创新点在于提出了一种全局监督下的局部特征建模框架,通过局部捕捉稀疏欺骗线索与全局抑制身份噪声的协同机制,增强了表示的鲁棒性和判别力。从客观角度看,这种全局-局部协同优化策略为解决稀疏信号学习中的噪声干扰问题提供了可借鉴的思路。
Abstract: Video-based lie detection aims to identify deceptive behaviors from visual cues. Despite recent progress, its core challenge lies in learning sparse yet discriminative representations. Deceptive signals are typically subtle and short-lived, easily overwhelmed by redundant information, while individual and contextual variations introduce strong identity-related noise. To address this issue, we propose GenLie, a Global-Enhanced Lie Detection Network that performs local feature modeling under global supervision. Specifically, sparse and subtle deceptive cues are captured at the local level, while global supervision and optimization ensure robust and discriminative representations by suppressing identity-related noise. Experiments on three public datasets, covering both high- and low-stakes scenarios, show that GenLie consistently outperforms state-of-the-art methods. Source code is available at https://github.com/AliasDictusZ1/GenLie.
[26] TDMM-LM: Bridging Facial Understanding and Animation via Language Models cs.CV | cs.AIPDF
Luchuan Song, Pinxin Liu, Haiyang Liu, Zhenchao Jin, Yolo Yunlong Tang
TL;DR: 本文提出了TDMM-LM,一个通过语言模型桥接面部理解和动画的框架。为了解决面部动画领域缺乏高质量文本-动作配对数据的问题,作者利用基础生成模型合成大规模、平衡的面部行为语料库,并构建了包含约80小时视频及其对应3D面部参数的数据集。基于此,模型通过Motion2Language(从面部参数生成描述)和Language2Motion(从文本生成面部参数)两个互补任务,实现了对面部运动的双向理解和生成。实验表明,语言模型在该设定下能有效解释和合成面部运动,并具有良好的泛化能力。
Details
Motivation: 当前文本引导的人体动画进展迅速,但面部动画因缺乏高质量、文本配对的面部语料库而滞后。本文旨在填补这一空白,通过合成大规模数据集并利用语言模型,建立文本条件下面部动画与运动理解的统一路径。
Result: 广泛的实验表明,在该框架下,语言模型能够以强大的泛化能力解释和合成面部运动。这是首个将面部参数建模视为语言问题的工作,为文本条件面部动画建立了统一基准。
Insight: 核心创新点在于:1)利用生成模型合成大规模、平衡的(文本提示-3D面部参数)配对数据集,解决了数据稀缺问题;2)将面部运动参数建模为语言问题,通过Motion2Language和Language2Motion两个双向任务,实现了面部理解与动画的统一框架;3)采用量化运动令牌进行参数序列合成,便于下游动画应用。这为基于文本的面部动画与理解提供了一个新颖且有效的范式。
Abstract: Text-guided human body animation has advanced rapidly, yet facial animation lags due to the scarcity of well-annotated, text-paired facial corpora. To close this gap, we leverage foundation generative models to synthesize a large, balanced corpus of facial behavior. We design prompts suite covering emotions and head motions, generate about 80 hours of facial videos with multiple generators, and fit per-frame 3D facial parameters, yielding large-scale (prompt and parameter) pairs for training. Building on this dataset, we probe language models for bidirectional competence over facial motion via two complementary tasks: (1) Motion2Language: given a sequence of 3D facial parameters, the model produces natural-language descriptions capturing content, style, and dynamics; and (2) Language2Motion: given a prompt, the model synthesizes the corresponding sequence of 3D facial parameters via quantized motion tokens for downstream animation. Extensive experiments show that in this setting language models can both interpret and synthesize facial motion with strong generalization. To best of our knowledge, this is the first work to cast facial-parameter modeling as a language problem, establishing a unified path for text-conditioned facial animation and motion understanding.
[27] Solution for 10th Competition on Ambivalence/Hesitancy (AH) Video Recognition Challenge using Divergence-Based Multimodal Fusion cs.CVPDF
Aislan Gabriel O. Souza, Agostinho Freire, Leandro Honorato Silva, Igor Lucas B. da Silva, João Vinícius R. de Andrade
TL;DR: 本文针对CVPR 2026 ABAW竞赛中的Ambivalence/Hesitancy视频识别挑战,提出了一种基于差异的多模态融合方法。该方法通过显式测量视觉、音频和文本模态之间的冲突来识别视频中的矛盾/犹豫情绪。
Details
Motivation: 解决在视频中识别矛盾/犹豫情绪这一挑战性问题,其核心在于捕捉不同模态(如面部动作、语音、文本)之间的不一致性。
Result: 在BAH数据集验证集上,该方法取得了0.6808的Macro F1分数,显著超越了0.2827的基线水平。统计分析表明,面部动作单元的时间变异性是区分矛盾/犹豫情绪的主要视觉特征。
Insight: 创新点在于提出了一种直接计算模态嵌入间绝对差异的融合机制,以显式捕捉模态间的不一致性作为识别矛盾/犹豫情绪的关键信号。此外,研究证实了面部动作动态变化在此任务中的主导作用。
Abstract: We address the Ambivalence/Hesitancy (A/H) Video Recognition Challenge at the 10th ABAW Competition (CVPR 2026). We propose a divergence-based multimodal fusion that explicitly measures cross-modal conflict between visual, audio, and textual channels. Visual features are encoded as Action Units (AUs) extracted via Py-Feat, audio via Wav2Vec 2.0, and text via BERT. Each modality is processed by a BiLSTM with attention pooling and projected into a shared embedding space. The fusion module computes pairwise absolute differences between modality embeddings, directly capturing the incongruence that characterizes A/H. On the BAH dataset, our approach achieves a Macro F1 of 0.6808 on the validation test set, outperforming the challenge baseline of 0.2827. Statistical analysis across 1{,}132 videos confirms that temporal variability of AUs is the dominant visual discriminator of A/H.
[28] Omni IIE Bench: Benchmarking the Practical Capabilities of Image Editing Models cs.CV | cs.AIPDF
Yujia Yang, Yuanxiang Wang, Zhenyu Guan, Tiankun Yang, Chenxi Bao
TL;DR: 该论文提出了Omni IIE Bench,一个用于评估指令图像编辑模型在实际应用场景中编辑一致性的高质量人工标注基准。该基准采用创新的双轨诊断设计,包括单轮一致性和多轮协调任务,并通过严格的多阶段人工筛选构建。论文对8个主流IIE模型进行了全面评估,首次量化了模型在从低语义尺度任务转向高语义尺度任务时普遍存在的性能下降问题。
Details
Motivation: 现有指令图像编辑基准通过混合评估追求任务广度,但掩盖了模型在不同语义尺度任务上性能不一致这一在专业应用中至关重要的失败模式。为了弥补这一空白,作者旨在创建一个能诊断IIE模型在实际场景中编辑一致性的基准。
Result: 在Omni IIE Bench上对8个主流IIE模型的评估首次量化了一个普遍存在的性能差距:几乎所有模型在从低语义尺度任务过渡到高语义尺度任务时都表现出显著的性能下降。
Insight: 创新点在于提出了一个专门诊断编辑一致性的高质量基准,其核心是创新的双轨诊断设计(单轮一致性与多轮协调)和异常严格的、结合学术与工业视角的多阶段人工筛选构建流程。这为开发下一代更可靠、稳定的IIE模型提供了关键的诊断工具和见解。
Abstract: While Instruction-based Image Editing (IIE) has achieved significant progress, existing benchmarks pursue task breadth via mixed evaluations. This paradigm obscures a critical failure mode crucial in professional applications: the inconsistent performance of models across tasks of varying semantic scales. To address this gap, we introduce Omni IIE Bench, a high-quality, human-annotated benchmark specifically designed to diagnose the editing consistency of IIE models in practical application scenarios. Omni IIE Bench features an innovative dual-track diagnostic design: (1) Single-turn Consistency, comprising shared-context task pairs of attribute modification and entity replacement; and (2) Multi-turn Coordination, involving continuous dialogue tasks that traverse semantic scales. The benchmark is constructed via an exceptionally rigorous multi-stage human filtering process, incorporating a quality standard enforced by computer vision graduate students and an industry relevance review conducted by professional designers. We perform a comprehensive evaluation of 8 mainstream IIE models using Omni IIE Bench. Our analysis quantifies, for the first time, a prevalent performance gap: nearly all models exhibit a significant performance degradation when transitioning from low-semantic-scale to high-semantic-scale tasks. Omni IIE Bench provides critical diagnostic tools and insights for the development of next-generation, more reliable, and stable IIE models.
[29] Joint Optimization of Storage and Loading for High-Performance 3D Point Cloud Data Processing cs.CV | cs.AIPDF
Ke Wang, Yanfei Cao, Xiangzhi Tao, Naijie Gu, Jun Yu
TL;DR: 本文提出了一种名为.PcRecord的统一数据存储格式和配套的高性能数据处理流水线,旨在解决大规模3D点云数据存储占用大、加载和处理效率低下的问题。通过多阶段并行流水线架构优化计算资源利用,显著提升了点云数据处理的速度和效率。
Details
Motivation: 3D点云数据规模庞大、格式多样(如PLY、XYZ、BIN),导致数据准备和处理效率低下,传统算法难以应对大规模数据集,现有二进制格式也无法完全解决耗时问题。
Result: 在多个基准数据集上,系统在GPU上实现了平均6.61倍(ModelNet40)、2.69倍(S3DIS)、2.23倍(ShapeNet)、3.09倍(Kitti)、8.07倍(SUN RGB-D)和5.67倍(ScanNet)的性能提升;在Ascend上实现了6.9倍、1.88倍、1.29倍、2.28倍、25.4倍和19.3倍的提升。
Insight: 创新点在于提出统一的.PcRecord存储格式以减少存储占用,并结合多模块并行流水线架构优化资源利用,从而高效处理大规模点云数据,为3D视觉任务的数据预处理提供了可借鉴的解决方案。
Abstract: With the rapid development of computer vision and deep learning, significant advancements have been made in 3D vision, partic- ularly in autonomous driving, robotic perception, and augmented reality. 3D point cloud data, as a crucial representation of 3D information, has gained widespread attention. However, the vast scale and complexity of point cloud data present significant chal- lenges for loading and processing and traditional algorithms struggle to handle large-scale datasets.The diversity of storage formats for point cloud datasets (e.g., PLY, XYZ, BIN) adds complexity to data handling and results in inefficiencies in data preparation. Al- though binary formats like BIN and NPY have been used to speed up data access, they still do not fully address the time-consuming data loading and processing phase. To overcome these challenges, we propose the .PcRecord format, a unified data storage solution designed to reduce the storage occupation and accelerate the processing of point cloud data. We also introduce a high-performance data processing pipeline equipped with multiple modules. By leveraging a multi-stage parallel pipeline architecture, our system optimizes the use of computational resources, significantly improving processing speed and efficiency. This paper details the im- plementation of this system and demonstrates its effectiveness in addressing the challenges of handling large-scale point cloud datasets.On average, our system achieves performance improvements of 6.61x (ModelNet40), 2.69x (S3DIS), 2.23x (ShapeNet), 3.09x (Kitti), 8.07x (SUN RGB-D), and 5.67x (ScanNet) with GPU and 6.9x, 1.88x, 1.29x, 2.28x, 25.4x, and 19.3x with Ascend.
[30] EmergeNav: Structured Embodied Inference for Zero-Shot Vision-and-Language Navigation in Continuous Environments cs.CV | cs.AIPDF
Kun Luo, Xiaoguang Ma
TL;DR: 本文提出EmergeNav框架,用于解决连续环境中的零样本视觉语言导航(VLN-CE)问题。该框架通过结构化推理将视觉语言模型的语义先验转化为稳定的长时程具身执行,无需任务特定训练或外部地图支持。
Details
Motivation: 现有视觉语言模型虽具备语义先验,但其开放式推理难以直接转化为稳定、长时程的具身导航执行,主要瓶颈在于缺乏组织指令跟随、感知接地、时序进展和阶段验证的执行结构。
Result: 在VLN-CE基准上,仅使用开源视觉语言模型骨干(Qwen3-VL-8B和Qwen3-VL-32B)且无需任务特定训练,EmergeNav分别达到30.00和37.00的成功率(SR),展现了强大的零样本性能。
Insight: 创新点在于将连续VLN建模为结构化具身推理,通过计划-解决-转换层次实现阶段化执行,结合目标条件感知提取、对比双记忆推理和角色分离双视场感知,显式执行结构是将VLM先验转化为稳定导航行为的关键。
Abstract: Zero-shot vision-and-language navigation in continuous environments (VLN-CE) remains challenging for modern vision-language models (VLMs). Although these models encode useful semantic priors, their open-ended reasoning does not directly translate into stable long-horizon embodied execution. We argue that the key bottleneck is not missing knowledge alone, but missing an execution structure for organizing instruction following, perceptual grounding, temporal progress, and stage verification. We propose EmergeNav, a zero-shot framework that formulates continuous VLN as structured embodied inference. EmergeNav combines a Plan–Solve–Transition hierarchy for stage-structured execution, GIPE for goal-conditioned perceptual extraction, contrastive dual-memory reasoning for progress grounding, and role-separated Dual-FOV sensing for time-aligned local control and boundary verification. On VLN-CE, EmergeNav achieves strong zero-shot performance using only open-source VLM backbones and no task-specific training, explicit maps, graph search, or waypoint predictors, reaching 30.00 SR with Qwen3-VL-8B and 37.00 SR with Qwen3-VL-32B. These results suggest that explicit execution structure is a key ingredient for turning VLM priors into stable embodied navigation behavior.
[31] PhysQuantAgent: An Inference Pipeline of Mass Estimation for Vision-Language Models cs.CV | cs.AIPDF
Hisayuki Yokomizo, Taiki Miyanishi, Yan Gang, Shuhei Kurita, Nakamasa Inoue
TL;DR: 本文提出PhysQuantAgent框架,用于基于视觉语言模型(VLMs)的物体质量估计,并构建了VisPhysQuant基准数据集进行评估。通过引入三种视觉提示方法(物体检测、尺度估计和横截面图像生成)增强输入图像,以提高模型对物体尺寸和内部结构的理解,从而提升质量估计的准确性。
Details
Motivation: 当前视觉语言模型在机器人感知和操作中应用广泛,但其推断物理属性(如物体质量)的能力有限,而质量估计对于确定合适的抓取力和确保安全交互至关重要;现有基准数据集缺乏在真实感知条件下对物理量估计的明确评估。
Result: 实验表明,视觉提示方法显著提高了在真实世界数据上的质量估计准确性,验证了将空间推理与VLM知识结合用于物理推断的有效性;结果在VisPhysQuant基准数据集上进行了评估,展示了方法的改进效果。
Insight: 创新点包括提出专门用于质量估计的框架和基准数据集,以及三种视觉提示方法以增强模型对物体物理属性的理解;从客观角度看,该方法通过整合多模态信息(如深度数据和空间推理)来弥补VLMs在物理推理方面的不足,为机器人操作中的物理感知提供了新思路。
Abstract: Vision-Language Models (VLMs) are increasingly applied to robotic perception and manipulation, yet their ability to infer physical properties required for manipulation remains limited. In particular, estimating the mass of real-world objects is essential for determining appropriate grasp force and ensuring safe interaction. However, current VLMs lack reliable mass reasoning capabilities, and most existing benchmarks do not explicitly evaluate physical quantity estimation under realistic sensing conditions. In this work, we propose PhysQuantAgent, a framework for real-world object mass estimation using VLMs, together with VisPhysQuant, a new benchmark dataset for evaluation. VisPhysQuant consists of RGB-D videos of real objects captured from multiple viewpoints, annotated with precise mass measurements. To improve estimation accuracy, we introduce three visual prompting methods that enhance the input image with object detection, scale estimation, and cross-sectional image generation to help the model comprehend the size and internal structure of the target object. Experiments show that visual prompting significantly improves mass estimation accuracy on real-world data, suggesting the efficacy of integrating spatial reasoning with VLM knowledge for physical inference.
[32] CineSRD: Leveraging Visual, Acoustic, and Linguistic Cues for Open-World Visual Media Speaker Diarization cs.CV | cs.AI | cs.MM | cs.SD | eess.ASPDF
Liangbin Huang, Xiaohua Liao, Chaoqun Cui, Shijing Wang, Zhaolong Huang
TL;DR: 本文提出了CineSRD,一个用于开放世界视觉媒体(如电影和电视剧)说话人日志的统一多模态框架。它利用视频、语音和字幕中的视觉、声学和语言线索,通过视觉锚点聚类注册初始说话人,并集成音频语言模型进行说话人转换检测,以优化标注并补充未注册的屏幕外说话人。
Details
Motivation: 传统说话人日志系统主要局限于会议和访谈等受限场景,而开放世界视觉媒体(如影视作品)存在长视频理解、说话人数量多、跨模态异步和野外环境变化等挑战,需要新的解决方案。
Result: 实验结果表明,CineSRD在提出的视觉媒体说话人日志基准(包含中英文节目)上取得了优越性能,并在传统数据集上获得了有竞争力的结果,验证了其在开放世界视觉媒体设置中的鲁棒性和泛化性。
Insight: 创新点在于将说话人日志扩展到开放世界视觉媒体,并设计了一个统一的多模态框架,结合视觉、声学和语言线索来处理复杂场景;从客观角度看,其构建和发布的专用基准为领域研究提供了重要资源,且多模态集成方法对处理跨模态异步问题具有借鉴意义。
Abstract: Traditional speaker diarization systems have primarily focused on constrained scenarios such as meetings and interviews, where the number of speakers is limited and acoustic conditions are relatively clean. To explore open-world speaker diarization, we extend this task to the visual media domain, encompassing complex audiovisual programs such as films and TV series. This new setting introduces several challenges, including long-form video understanding, a large number of speakers, cross-modal asynchrony between audio and visual cues, and uncontrolled in-the-wild variability. To address these challenges, we propose Cinematic Speaker Registration & Diarization (CineSRD), a unified multimodal framework that leverages visual, acoustic, and linguistic cues from video, speech, and subtitles for speaker annotation. CineSRD first performs visual anchor clustering to register initial speakers and then integrates an audio language model for speaker turn detection, refining annotations and supplementing unregistered off-screen speakers. Furthermore, we construct and release a dedicated speaker diarization benchmark for visual media that includes Chinese and English programs. Experimental results demonstrate that CineSRD achieves superior performance on the proposed benchmark and competitive results on conventional datasets, validating its robustness and generalizability in open-world visual media settings.
[33] MSRAMIE: Multimodal Structured Reasoning Agent for Multi-instruction Image Editing cs.CV | cs.AIPDF
Zhaoyuan Qiu, Ken Chen, Xiangwei Wang, Yu Xia, Sachith Seneviratne
TL;DR: MSRAMIE是一个无需训练的多模态结构化推理代理框架,用于处理复杂的多指令图像编辑任务。它基于多模态大语言模型(MLLM),将现有编辑模型作为插件组件,通过树状状态和图状引用的推理拓扑结构,将复杂指令分解为多个编辑步骤,实现状态转换、跨步骤信息聚合和原始输入回忆,从而系统探索编辑空间并逐步优化输出。
Details
Motivation: 现有基于指令的图像编辑模型在简单单步指令上表现良好,但在涉及多个、冗长且相互依赖指令的现实场景中性能下降,主要原因是缺乏复杂多指令标注的训练数据,而收集此类数据并重新训练模型成本高昂。
Result: 实验表明,随着指令复杂度增加,MSRAMIE能将指令遵循率提高超过15%,并在单次运行中完成所有修改的概率提升超过100%,同时保持感知质量和视觉一致性。
Insight: 创新点包括:提出无需训练的代理框架,利用MLLM作为协调器;引入树状状态和图状引用的新型推理拓扑结构,实现可解释和可控的决策路径;通过分解复杂指令为多步骤编辑,支持状态转换和信息聚合,提升对复杂多指令的处理能力。
Abstract: Existing instruction-based image editing models perform well with simple, single-step instructions but degrade in realistic scenarios that involve multiple, lengthy, and interdependent directives. A main cause is the scarcity of training data with complex multi-instruction annotations. However, it is costly to collect such data and retrain these models. To address this challenge, we propose MSRAMIE, a training-free agent framework built on Multimodal Large Language Model (MLLM). MSRAMIE takes existing editing models as plug-in components and handle multi-instruction tasks via structured multimodal reasoning. It orchestrates iterative interactions between an MLLM-based Instructor and an image editing Actor, introducing a novel reasoning topology that comprises the proposed Tree-of-States and Graph-of-References. During inference, complex instructions are decomposed into multiple editing steps which enable state transitions, cross-step information aggregation, and original input recall, which enables systematic exploration of the image editing space and flexible progressive output refinement. The visualizable inference topology further provides interpretable and controllable decision pathways. Experiments show that as the instruction complexity increases, MSRAMIE can improve instruction following over 15% and increases the probability of finishing all modifications in a single run over 100%, while preserving perceptual quality and maintaining visual consistency.
[34] Continual Multimodal Egocentric Activity Recognition via Modality-Aware Novel Detection cs.CV | cs.AIPDF
Wonseon Lim, Hyejeong Im, Dae-Won Kim
TL;DR: 本文提出了一种名为MAND的模态感知框架,用于解决多模态第一人称视角活动识别中的开放世界持续学习问题。该框架通过模态感知自适应评分(MoAS)在推理时整合各模态的互补线索以提升新活动检测性能,并通过模态表示稳定训练(MoRST)在训练时缓解灾难性遗忘。
Details
Motivation: 现有方法主要依赖RGB主导的logits进行新活动检测,未能充分利用其他模态(如IMU)的互补证据,且这种不平衡在灾难性遗忘下会随时间恶化。
Result: 在公开的多模态第一人称视角基准测试上,MAND将新活动检测的AUC提升了高达10%,已知类别分类准确率提升了高达2.8%,优于现有最先进(SOTA)基线方法。
Insight: 创新点在于显式地建模和利用模态可靠性,通过自适应整合模态logits来增强新活动检测,并通过模态特定的辅助任务和蒸馏来稳定跨任务的学习,从而更有效地利用多模态互补信息并缓解持续学习中的遗忘问题。
Abstract: Multimodal egocentric activity recognition integrates visual and inertial cues for robust first-person behavior understanding. However, deploying such systems in open-world environments requires detecting novel activities while continuously learning from non-stationary streams. Existing methods rely on the main logits for novelty scoring, without fully exploiting the complementary evidence available from individual modalities. Because these logits are often dominated by RGB, cues from other modalities, particularly IMU, remain underutilized, and this imbalance worsens over time under catastrophic forgetting. To address this, we propose MAND, a modality-aware framework for multimodal egocentric open-world continual learning. At inference, Modality-aware Adaptive Scoring (MoAS) estimates sample-wise modality reliability from energy scores and adaptively integrates modality logits to better exploit complementary modality cues for novelty detection. During training, Modality-wise Representation Stabilization Training (MoRST) preserves modality-specific discriminability across tasks via auxiliary heads and modality-wise logit distillation. Experiments on a public multimodal egocentric benchmark show that MAND improves novel activity detection AUC by up to 10% and known-class classification accuracy by up to 2.8% over state-of-the-art baselines.
[35] Are a Thousand Words Better Than a Single Picture? Beyond Images – A Framework for Multi-Modal Knowledge Graph Dataset Enrichment cs.CV | cs.AIPDF
Pengyu Zhang, Klim Zaporojets, Jie Liu, Jia-Hong Huang, Paul Groth
TL;DR: 本文提出了一个名为Beyond Images的自动数据增强框架,用于丰富多模态知识图谱(MMKG)数据集。该框架通过大规模检索实体相关图像、将视觉输入转换为文本描述,并利用大语言模型融合多源描述来生成简洁的实体摘要,从而在不改变现有MMKG模型架构的情况下提升其性能。
Details
Motivation: 现有MMKG依赖于视觉信息,但大规模图像收集困难,且常忽略模糊但相关的视觉内容(如徽标、符号、抽象场景),导致性能受限。
Result: 在三个公共MMKG数据集和多个基线模型上,该方法带来了持续的性能提升(整体Hits@1最高提升7%);在具有视觉模糊徽标和符号的实体子集上,将图像转换为文本带来了显著改进(MRR提升201.35%,Hits@1提升333.33%)。
Insight: 创新点在于通过数据中心的自动化管道,将模糊视觉内容转化为文本描述以增强语义,而非直接使用图像,从而提升MMKG的完成效果;同时提供了可选的文本-图像一致性检查接口以提高数据可靠性。
Abstract: Multi-Modal Knowledge Graphs (MMKGs) benefit from visual information, yet large-scale image collection is hard to curate and often excludes ambiguous but relevant visuals (e.g., logos, symbols, abstract scenes). We present Beyond Images, an automatic data-centric enrichment pipeline with optional human auditing. This pipeline operates in three stages: (1) large-scale retrieval of additional entity-related images, (2) conversion of all visual inputs into textual descriptions to ensure that ambiguous images contribute usable semantics rather than noise, and (3) fusion of multi-source descriptions using a large language model (LLM) to generate concise, entity-aligned summaries. These summaries replace or augment the text modality in standard MMKG models without changing their architectures or loss functions. Across three public MMKG datasets and multiple baseline models, we observe consistent gains (up to 7% Hits@1 overall). Furthermore, on a challenging subset of entities with visually ambiguous logos and symbols, converting images into text yields large improvements (201.35% MRR and 333.33% Hits@1). Additionally, we release a lightweight Text-Image Consistency Check Interface for optional targeted audits, improving description quality and dataset reliability. Our results show that scaling image coverage and converting ambiguous visuals into text is a practical path to stronger MMKG completion. Code, datasets, and supplementary materials are available at https://github.com/pengyu-zhang/Beyond-Images.
[36] Empirical Recipes for Efficient and Compact Vision-Language Models cs.CV | cs.AIPDF
Jiabo Huang, Zhizhong Li, Sina Sajadmanesh, Weiming Zhuang, Lingjuan Lyu
TL;DR: 本文通过实证分析揭示了紧凑型视觉语言模型(VLMs)推理速度未达预期的瓶颈,并提出了一套优化方案以显著降低延迟,同时保持模型精度。此外,研究还扩展了紧凑VLMs以支持结构化感知输出,并引入了ArgusVLM模型系列,在多种基准测试中实现了高性能与高效设计的平衡。
Details
Motivation: 解决在资源受限环境中部署视觉语言模型时,现有紧凑模型推理速度提升不足的问题,旨在通过系统分析瓶颈并提供优化方案,以实现低延迟和高吞吐量。
Result: 优化方案在InternVL3-2B和SmolVLM-256M上分别将首令牌时间(TTFT)降低了53%和93%;ArgusVLM在多样基准测试中表现出色,保持了紧凑高效的设计。
Insight: 创新点包括通过端到端效率分析识别推理瓶颈,并开发针对紧凑VLMs的优化方法;同时,扩展紧凑VLMs以生成结构化感知输出,提升了模型的实用性和性能。从客观角度看,这些实证配方为构建高效VLM系统提供了可迁移的实践指导。
Abstract: Deploying vision-language models (VLMs) in resource-constrained settings demands low latency and high throughput, yet existing compact VLMs often fall short of the inference speedups their smaller parameter counts suggest. To explain this discrepancy, we conduct an empirical end-to-end efficiency analysis and systematically profile inference to identify the dominant bottlenecks. Based on these findings, we develop optimization recipes tailored to compact VLMs that substantially reduce latency while preserving accuracy. These techniques cut time to first token (TTFT) by 53% on InternVL3-2B and by 93% on SmolVLM-256M. Our recipes are broadly applicable across both VLM architectures and common serving frameworks, providing practical guidance for building efficient VLM systems. Beyond efficiency, we study how to extend compact VLMs with structured perception outputs and introduce the resulting model family, ArgusVLM. Across diverse benchmarks, ArgusVLM achieves strong performance while maintaining a compact and efficient design.
[37] HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning cs.CV | cs.AI | cs.CLPDF
Shenzhi Wang, Shixuan Liu, Jing Zhou, Chang Gao, Xiong-Hui Chen
TL;DR: 本文提出HopChain框架,用于合成多跳视觉语言推理数据,以增强视觉语言模型(VLMs)在细粒度推理任务中的泛化能力。通过构建逻辑依赖的多跳查询链,该数据能暴露模型在感知、推理、知识和幻觉等方面的复合错误。实验表明,将HopChain合成的数据加入RLVR训练后,在24个基准测试中提升了20个的性能,尤其在长链推理任务中增益显著。
Details
Motivation: 现有视觉语言数据缺乏依赖视觉证据的复杂推理链,导致VLMs在细粒度推理中易出现感知、推理、知识和幻觉等复合错误,限制了其泛化能力。
Result: 在Qwen3.5-35B-A3B和Qwen3.5-397B-A17B模型上,添加HopChain合成的多跳数据后,在涵盖STEM、Puzzle、通用VQA、文本识别与文档理解、视频理解等24个基准测试中,有20个性能得到提升;若使用半多跳或单跳变体数据,平均准确率分别下降5.3和7.0点;在超长链推理任务中,准确率增益超过50点。
Insight: 创新点在于提出可扩展的多跳数据合成框架,通过构建逻辑依赖的实例接地跳链,强化模型在复杂推理中的泛化能力;客观分析认为,该方法通过暴露复合错误并强制模型依赖视觉证据进行逐步推理,有效提升了长链推理的鲁棒性。
Abstract: VLMs show strong multimodal capabilities, but they still struggle with fine-grained vision-language reasoning. We find that long CoT reasoning exposes diverse failure modes, including perception, reasoning, knowledge, and hallucination errors, which can compound across intermediate steps. However, most existing vision-language data used for RLVR does not involve complex reasoning chains that rely on visual evidence throughout, leaving these weaknesses largely unexposed. We therefore propose HopChain, a scalable framework for synthesizing multi-hop vision-language reasoning data specifically for RLVR training of VLMs. Each synthesized multi-hop query forms a logically dependent chain of instance-grounded hops, where earlier hops establish the instances, sets, or conditions needed for later hops, while the final answer remains a specific, unambiguous number suitable for verifiable rewards. We add the multi-hop data synthesized by HopChain to the original RLVR data used to train Qwen3.5-35B-A3B and Qwen3.5-397B-A17B, and compare against RLVR on the original RLVR data alone across 24 benchmarks spanning STEM and Puzzle, General VQA, Text Recognition and Document Understanding, and Video Understanding. Although this multi-hop data is not synthesized to target any specific benchmark, adding it improves 20 out of 24 benchmarks on both models, indicating broad and generalizable gains. To demonstrate that full chained queries are important, we replace them with half-multi-hop or single-hop variants, reducing the 24-benchmark average accuracy by 5.3 and 7.0 points, respectively. Multi-hop training also strengthens long-CoT vision-language reasoning, with gains peaking at more than 50 accuracy points in the ultra-long-CoT regime. These experiments establish HopChain as an effective, scalable framework for synthesizing multi-hop data that improves generalizable vision-language reasoning.
[38] OpenQlaw: An Agentic AI Assistant for Analysis of 2D Quantum Materials cs.CVPDF
Sankalp Pandey, Xuan-Bac Nguyen, Hoang-Quan Nguyen, Tim Faltermeier, Nicholas Borys
TL;DR: OpenQlaw是一个用于分析二维量子材料的智能体AI助手系统,它通过一个核心LLM智能体协调一个领域专家多模态大模型(QuPAINT),将视觉识别与物理推理及确定性图像渲染解耦,从而支持动态处理用户查询、进行尺度感知的物理计算和生成视觉标注,旨在加速高通量器件制造。
Details
Motivation: 现有领域专用多模态大模型虽然能基于物理知识进行推理,但其输出侧重于分步认知透明度,导致冗长的候选枚举和密集推理,可能引发认知过载且缺乏与研究人员实时交互的实用性。
Result: 论文未在摘要中提供具体的定量基准测试结果或SOTA比较,但宣称该系统能将孤立的推理转变为具备上下文感知能力的助手,从而加速高通量器件制造。
Insight: 创新点在于采用智能体编排架构,核心LLM作为协调器调用领域专家MLLM(QuPAINT),实现了视觉识别与推理的模块化解耦;系统具备持久记忆功能,可存储物理尺度比例和样品制备方法,支持动态、上下文感知的交互与分析。
Abstract: The transition from optical identification of 2D quantum materials to practical device fabrication requires dynamic reasoning beyond the detection accuracy. While recent domain-specific Multimodal Large Language Models (MLLMs) successfully ground visual features using physics-informed reasoning, their outputs are optimized for step-by-step cognitive transparency. This yields verbose candidate enumerations followed by dense reasoning that, while accurate, may induce cognitive overload and lack immediate utility for real-world interaction with researchers. To address this challenge, we introduce OpenQlaw, an agentic orchestration system for analyzing 2D materials. The architecture is built upon NanoBot, a lightweight agentic framework inspired by OpenClaw, and QuPAINT, one of the first Physics-Aware Instruction Multi-modal platforms for Quantum Material Discovery. This allows accessibility to the lab floor via a variety of messaging channels. OpenQlaw allows the core Large Language Model (LLM) agent to orchestrate a domain-expert MLLM,with QuPAINT, as a specialized node, successfully decoupling visual identification from reasoning and deterministic image rendering. By parsing spatial data from the expert, the agent can dynamically process user queries, such as performing scale-aware physical computation or generating isolated visual annotations, and answer in a naturalistic manner. Crucially, the system features a persistent memory that enables the agent to save physical scale ratios (e.g., 1 pixel = 0.25 μm) for area computations and store sample preparation methods for efficacy comparison. The application of an agentic architecture, together with the extension that uses the core agent as an orchestrator for domain-specific experts, transforms isolated inferences into a context-aware assistant capable of accelerating high-throughput device fabrication.
[39] Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models cs.CVPDF
Songchun Zhang, Zeyue Xue, Siming Fu, Jie Huang, Xianghao Kong
TL;DR: 本文提出Astrolabe,一种专为蒸馏自回归视频模型设计的高效在线强化学习框架,旨在解决现有强化学习方法在调整这些模型以符合人类视觉偏好时存在的效率低下和计算开销大的问题。
Details
Motivation: 蒸馏自回归视频模型虽然能实现高效流式生成,但其输出常与人类视觉偏好不一致;现有强化学习框架不适用于此类架构,通常需要昂贵的重新蒸馏或引入高开销的反向过程优化。
Result: 实验表明,该方法在多个蒸馏自回归视频模型上持续提升了生成质量,提供了一个鲁棒且可扩展的对齐解决方案。
Insight: 创新点包括:提出基于负感知微调的前向过程强化学习公式,通过对比推理端点的正负样本建立隐式策略改进方向;引入流式训练方案,通过滚动KV缓存逐步生成序列,仅在局部剪辑窗口应用强化学习更新以确保长程连贯性;以及集成多奖励目标,通过不确定性感知选择性正则化和动态参考更新来缓解奖励黑客问题。
Abstract: Distilled autoregressive (AR) video models enable efficient streaming generation but frequently misalign with human visual preferences. Existing reinforcement learning (RL) frameworks are not naturally suited to these architectures, typically requiring either expensive re-distillation or solver-coupled reverse-process optimization that introduces considerable memory and computational overhead. We present Astrolabe, an efficient online RL framework tailored for distilled AR models. To overcome existing bottlenecks, we introduce a forward-process RL formulation based on negative-aware fine-tuning. By contrasting positive and negative samples directly at inference endpoints, this approach establishes an implicit policy improvement direction without requiring reverse-process unrolling. To scale this alignment to long videos, we propose a streaming training scheme that generates sequences progressively via a rolling KV-cache, applying RL updates exclusively to local clip windows while conditioning on prior context to ensure long-range coherence. Finally, to mitigate reward hacking, we integrate a multi-reward objective stabilized by uncertainty-aware selective regularization and dynamic reference updates. Extensive experiments demonstrate that our method consistently enhances generation quality across multiple distilled AR video models, serving as a robust and scalable alignment solution.
[40] PaAgent: Portrait-Aware Image Restoration Agent via Subjective-Objective Reinforcement Learning cs.CVPDF
Yijian Wang, Qingsen Yan, Jiantao Zhou, Duwei Dai, Wei Dong
TL;DR: 本文提出了一种名为PaAgent的肖像感知图像修复智能体,它通过构建一个自我演进的工具肖像库和检索增强生成(RAG)机制,来为输入图像选择最合适的修复工具。该方法还引入了一种主客观结合的强化学习策略,以提升在复杂场景下的退化感知能力。实验在8个图像修复基准上验证了其处理复杂修复任务的优越性。
Details
Motivation: 现有基于多模态大语言模型的图像修复智能体通常缺乏对过往交互的洞察总结机制,导致需要穷举搜索最优修复工具,效率低下。本文旨在解决这一局限性。
Result: 在涵盖六种单一退化和八种混合退化场景的8个图像修复基准上进行了广泛实验,验证了PaAgent在处理复杂图像修复任务上的优越性。
Insight: 创新点包括:1) 构建并持续演进一个包含各种修复工具特性的肖像库,结合RAG进行工具选择,避免了穷举搜索;2) 提出一种主客观结合的强化学习奖励生成策略,同时考虑图像质量分数和语义洞察,以提升在部分和非均匀退化等复杂场景下的退化感知准确性。
Abstract: Image Restoration (IR) agents, leveraging multimodal large language models to perceive degradation and invoke restoration tools, have shown promise in automating IR tasks. However, existing IR agents typically lack an insight summarization mechanism for past interactions, which results in an exhaustive search for the optimal IR tool. To address this limitation, we propose a portrait-aware IR agent, dubbed PaAgent, which incorporates a self-evolving portrait bank for IR tools and Retrieval-Augmented Generation (RAG) to select a suitable IR tool for input. Specifically, to construct and evolve the portrait bank, the PaAgent continuously enriches it by summarizing the characteristics of various IR tools with restored images, selected IR tools, and degraded images. In addition, the RAG is employed to select the optimal IR tool for the input image by retrieving relevant insights from the portrait bank. Furthermore, to enhance PaAgent’s ability to perceive degradation in complex scenes, we propose a subjective-objective reinforcement learning strategy that considers both image quality scores and semantic insights in reward generation, which accurately provides the degradation information even under partial and non-uniform degradation. Extensive experiments across 8 IR benchmarks, covering six single-degradation and eight mixed-degradation scenarios, validate PaAgent’s superiority in addressing complex IR tasks. Our project page is \href{https://wyjgr.github.io/PaAgent.html}{PaAgent}.
[41] DesertFormer: Transformer-Based Semantic Segmentation for Off-Road Desert Terrain Classification in Autonomous Navigation Systems cs.CV | cs.LGPDF
Yasaswini Chebolu
TL;DR: 本文提出了DesertFormer,一种基于SegFormer B2(MiT-B2骨干网络)的语义分割模型,专门用于自动驾驶导航系统中非结构化越野沙漠地形的分类。它将地形分为十类生态类别,并在一个包含4,176张标注图像的自建数据集上训练,实现了64.4%的mIoU和86.1%的像素精度,显著优于基线模型。
Details
Motivation: 解决在非结构化越野环境中,特别是沙漠地形,由于地形类别间色彩对比度低、光照变化极端、植被稀疏等独特挑战,导致标准道路场景分割模型失效,从而影响自动驾驶可靠地形感知和路径规划的问题。
Result: 在自建的512x512分辨率越野图像数据集上,DesertFormer取得了64.4%的平均交并比(mIoU)和86.1%的像素精度,相比DeepLabV3 MobileNetV2基线(41.0% mIoU)有24.2%的绝对提升。
Insight: 主要创新点在于将Transformer架构(SegFormer)专门应用于具有独特视觉挑战的越野沙漠地形分割任务,并定义了十个生态相关的语义类别以支持安全路径规划。此外,论文还进行了系统的失败分析,并提出了针对稀有类别的类别加权训练和复制粘贴数据增强方法。
Abstract: Reliable terrain perception is a fundamental requirement for autonomous navigation in unstructured, off-road environments. Desert landscapes present unique challenges due to low chromatic contrast between terrain categories, extreme lighting variability, and sparse vegetation that defy the assumptions of standard road-scene segmentation models. We present DesertFormer, a semantic segmentation pipeline for off-road desert terrain analysis based on SegFormer B2 with a hierarchical Mix Transformer (MiT-B2) backbone. The system classifies terrain into ten ecologically meaningful categories – Trees, Lush Bushes, Dry Grass, Dry Bushes, Ground Clutter, Flowers, Logs, Rocks, Landscape, and Sky – enabling safety-aware path planning for ground robots and autonomous vehicles. Trained on a purpose-built dataset of 4,176 annotated off-road images at 512x512 resolution, DesertFormer achieves a mean Intersection-over-Union (mIoU) of 64.4% and pixel accuracy of 86.1%, representing a +24.2% absolute improvement over a DeepLabV3 MobileNetV2 baseline (41.0% mIoU). We further contribute a systematic failure analysis identifying the primary confusion patterns – Ground Clutter to Landscape and Dry Grass to Landscape – and propose class-weighted training and copy-paste augmentation for rare terrain categories. Code, checkpoints, and an interactive inference dashboard are released at https://github.com/Yasaswini-ch/Vision-based-Desert-Terrain-Segmentation-using-SegFormer.
[42] Edge-Efficient Two-Stream Multimodal Architecture for Non-Intrusive Bathroom Fall Detection cs.CVPDF
Haitian Wang, Yiren Wang, Xinyu Wang, Sheldon Fung, Atif Mansoor
TL;DR: 本文提出了一种面向边缘设备的高效双流多模态架构,用于非侵入式的浴室跌倒检测。该架构包含一个处理毫米波雷达信号的Motion-Mamba分支和一个处理地板振动信号的Impact-Griffin分支,并通过低秩双线性交互与Switch-MoE头部进行跨条件融合,以对齐运动与冲击信息并抑制物体掉落等干扰。模型在树莓派4B网关实现了实时推理,并在自建的浴室跌倒检测基准数据集上取得了SOTA性能。
Details
Motivation: 现有基于毫米波雷达、振动或简单多模态融合的隐私保护跌倒检测方案,通常将运动和冲击视为松散耦合的流,依赖于粗略的时间对齐和幅度阈值,未能显式编码雷达观测的跌倒与地板冲击之间的因果关系,且未解决时序漂移、物体掉落干扰以及低功耗边缘设备的延迟与能耗约束问题。
Result: 在自建的包含超过3小时同步毫米波雷达与三轴振动记录的浴室跌倒检测基准数据集上,模型在测试集上达到了96.1%的准确率、94.8%的精确率、88.0%的召回率、91.1%的宏F1分数和0.968的AUC。相比最强基线,准确率提升2.0个百分点,跌倒召回率提升1.3个百分点,同时在树莓派4B网关上的延迟从35.9 ms降至15.8 ms,每2.56秒窗口的能耗从14200 mJ降至10750 mJ。
Insight: 创新点在于提出了一个专门设计的双流架构,分别用Motion-Mamba和Impact-Griffin分支高效捕获长程运动模式和冲击瞬态及跨轴耦合特征,并通过跨条件融合机制显式建模运动与冲击的因果关系,有效抑制了物体掉落等混淆因素。该架构在保持高精度的同时,显著优化了边缘设备上的延迟和能耗,为资源受限环境下的实时多模态感知提供了可借鉴的设计思路。
Abstract: Falls in wet bathroom environments are a major safety risk for seniors living alone. Recent work has shown that mmWave-only, vibration-only, and existing multimodal schemes, such as vibration-triggered radar activation, early feature concatenation, and decision-level score fusion, can support privacy-preserving, non-intrusive fall detection. However, these designs still treat motion and impact as loosely coupled streams, depending on coarse temporal alignment and amplitude thresholds, and do not explicitly encode the causal link between radar-observed collapse and floor impact or address timing drift, object drop confounders, and latency and energy constraints on low-power edge devices. To this end, we propose a two-stream architecture that encodes radar signals with a Motion–Mamba branch for long-range motion patterns and processes floor vibration with an Impact–Griffin branch that emphasizes impact transients and cross-axis coupling. Cross-conditioned fusion uses low-rank bilinear interaction and a Switch–MoE head to align motion and impact tokens and suppress object-drop confounders. The model keeps inference cost suitable for real-time execution on a Raspberry Pi 4B gateway. We construct a bathroom fall detection benchmark dataset with frame-level annotations, comprising more than 3~h of synchronized mmWave radar and triaxial vibration recordings across eight scenarios under running water, together with subject-independent training, validation, and test splits. On the test split, our model attains 96.1% accuracy, 94.8% precision, 88.0% recall, a 91.1% macro F1 score, and an AUC of 0.968. Compared with the strongest baseline, it improves accuracy by 2.0 percentage points and fall recall by 1.3 percentage points, while reducing latency from 35.9 ms to 15.8 ms and lowering energy per 2.56 s window from 14200 mJ to 10750 mJ on the Raspberry Pi 4B gateway.
[43] ACE-LoRA: Graph-Attentive Context Enhancement for Parameter-Efficient Adaptation of Medical Vision-Language Models cs.CVPDF
M. Arda Aydın, Melih B. Yilmaz, Aykut Koç, Tolga Çukur
TL;DR: 本文提出ACE-LoRA,一种用于通用医学视觉语言模型(VLM)的参数高效适应框架。它通过在冻结的图像-文本编码器中集成LoRA模块,并引入基于注意力的上下文增强超图神经网络(ACE-HGNN)模块来捕获高阶上下文交互,从而用局部诊断线索丰富全局表示。此外,还设计了一种标签引导的InfoNCE损失来增强跨模态对齐。该方法仅增加0.95M可训练参数,便在多个领域的零样本分类、分割和检测基准上超越了现有最先进的医学VLM和PEFT基线。
Details
Motivation: 解决现有医学视觉语言模型在专业化(单领域训练,泛化差)与通用化(多领域训练,细节丢失)之间的权衡难题,旨在为通用医学VLM设计一个参数高效的适应方法,使其在保持强大零样本泛化能力的同时,也能捕捉细粒度的诊断线索。
Result: 在跨越多个领域的零样本分类、分割和检测基准测试中,ACE-LoRA一致性地超越了最先进的医学VLM和参数高效微调(PEFT)基线方法。
Insight: 主要创新点在于:1)提出ACE-HGNN模块,利用超图神经网络捕获超越成对相似性的高阶上下文交互,以局部诊断线索增强全局表示,解决了先前PEFT方法忽视细粒度细节的关键局限;2)设计了标签引导的InfoNCE损失,有效抑制语义相关图像-文本对之间的假阴性,以增强跨模态对齐。从客观角度看,该方法将图注意力机制与参数高效微调(LoRA)结合用于医学VLM适应,是一个新颖且高效的思路。
Abstract: The success of CLIP-like vision-language models (VLMs) on natural images has inspired medical counterparts, yet existing approaches largely fall into two extremes: specialist models trained on single-domain data, which capture domain-specific details but generalize poorly, and generalist medical VLMs trained on multi-domain data, which retain broad semantics but dilute fine-grained diagnostic cues. Bridging this specialization-generalization trade-off remains challenging. To address this problem, we propose ACE-LoRA, a parameter-efficient adaptation framework for generalist medical VLMs that maintains robust zero-shot generalization. ACE-LoRA integrates Low-Rank Adaptation (LoRA) modules into frozen image-text encoders and introduces an Attention-based Context Enhancement Hypergraph Neural Network (ACE-HGNN) module that captures higher-order contextual interactions beyond pairwise similarity to enrich global representations with localized diagnostic cues, addressing a key limitation of prior Parameter-Efficient Fine-Tuning (PEFT) methods that overlook fine-grained details. To further enhance cross-modal alignment, we formulate a label-guided InfoNCE loss to effectively suppress false negatives between semantically related image-text pairs. Despite adding only 0.95M trainable parameters, ACE-LoRA consistently outperforms state-of-the-art medical VLMs and PEFT baselines across zero-shot classification, segmentation, and detection benchmarks spanning multiple domains. Our code is available at https://github.com/icon-lab/ACE-LoRA.
[44] LLM-Powered Flood Depth Estimation from Social Media Imagery: A Vision-Language Model Framework with Mechanistic Interpretability for Transportation Resilience cs.CVPDF
Nafis Fuad, Xiaodong Qian
TL;DR: 该研究提出了FloodLlama,一个基于开源视觉语言模型(VLM)微调的框架,用于从社交媒体(如TikTok)的单张街景图像中连续估计洪水深度,以支持交通韧性。研究通过合成约19万张图像的数据集进行训练,采用渐进式课程学习和QLoRA微调LLaMA 3.2-11B Vision模型。结果表明,模型在洪水深度估计上达到高精度,并通过机制可解释性框架识别关键网络层,实现了参数高效微调。
Details
Motivation: 解决城市洪水对交通网络连续性威胁日益严重的问题,当前缺乏能提供实时、街道级、厘米分辨率洪水深度信息的操作系统,这对于动态路径规划、电动汽车安全和自动驾驶运营至关重要。
Result: 在34797次试验评估中,FloodLlama对深水洪水估计的平均绝对误差(MAE)低于0.97厘米,Acc@5cm超过93.7%;对浅水洪水准确率超过96.8%。在真实世界数据上,Tier 3配置达到98.62%的准确率,并在视觉遮挡下表现出强鲁棒性。在底特律的676个标注洪水帧上验证了基于TikTok的数据管道可行性。
Insight: 创新点包括:1) 结合视觉语言模型与社交媒体数据(TikTok)进行洪水深度估计,提供可扩展、无需基础设施的解决方案;2) 采用渐进式课程训练实现从粗到细的学习;3) 发现提示词对深度估计的性能影响(简单提示对浅水更优,思维链推理对深水更有效);4) 提出五阶段机制可解释性框架,识别出L23层为关键深度编码转换层,支持选择性微调,减少76-80%可训练参数的同时保持精度。
Abstract: Urban flooding poses an escalating threat to transportation network continuity, yet no operational system currently provides real-time, street-level flood depth information at the centimeter resolution required for dynamic routing, electric vehicle (EV) safety, and autonomous vehicle (AV) operations. This study presents FloodLlama, a fine-tuned open-source vision-language model (VLM) for continuous flood depth estimation from single street-level images, supported by a multimodal sensing pipeline using TikTok data. A synthetic dataset of approximately 190000 images was generated, covering seven vehicle types, four weather conditions, and 41 depth levels (0-40 cm at 1 cm resolution). Progressive curriculum training enabled coarse-to-fine learning, while LLaMA 3.2-11B Vision was fine-tuned using QLoRA. Evaluation across 34797 trials reveals a depth-dependent prompt effect: simple prompts perform better for shallow flooding, whereas chain-of-thought (CoT) reasoning improves performance at greater depths. FloodLlama achieves a mean absolute error (MAE) below 0.97 cm and Acc@5cm above 93.7% for deep flooding, exceeding 96.8% for shallow depths. A five-phase mechanistic interpretability framework identifies layer L23 as the critical depth-encoding transition and enables selective fine-tuning that reduces trainable parameters by 76-80% while maintaining accuracy. The Tier 3 configuration achieves 98.62% accuracy on real-world data and shows strong robustness under visual occlusion. A TikTok-based data pipeline, validated on 676 annotated flood frames from Detroit, demonstrates the feasibility of real-time, crowd-sourced flood sensing. The proposed framework provides a scalable, infrastructure-free solution with direct implications for EV safety, AV deployment, and resilient transportation management.
[45] Hidden Clones: Exposing and Fixing Family Bias in Vision-Language Model Ensembles cs.CV | cs.AIPDF
Zacharie Bugaud
TL;DR: 本文研究了视觉语言模型集成中的家族偏见问题,发现来自同一架构家族的模型存在相关错误,导致集成效果下降。作者提出了三种家族感知方法:分层家族投票、基于校准和家族质量的加权方法,以及学习候选答案评分方法,在多个基准测试上显著提升了集成性能。
Details
Motivation: 解决视觉语言模型集成中因模型来自相同架构家族而产生的相关错误问题,这些错误被标准投票方法忽略,降低了集成的有效性和准确性。
Result: 在VQAv2、TextVQA和GQA基准测试上,提出的方法显著提升了性能,其中学习候选答案评分方法在VQAv2上达到87.83%的准确率,并在所有测试中均未降低任何基准的性能。
Insight: 创新点在于识别并量化了模型家族偏见对集成的影响,并提出了家族感知的集成策略,如分层投票和基于校准的加权,这些方法能有效减少相关错误,提升集成鲁棒性。
Abstract: Ensembling Vision-Language Models (VLMs) from different providers maximizes benchmark accuracy, yet models from the same architectural family share correlated errors that standard voting ignores. We study this structure across 17 VLMs from 8 families on VQAv2, TextVQA, and GQA. Family-correlated errors reduce effective ensemble dimensionality to 2.5-3.6 independent voters and create a Misleading tier (1.5-6.5% of questions) where correlated majority errors destroy accuracy to 0% despite the best model being correct. We propose three family-aware methods. Hierarchical Family Voting (HFV) aggregates within families before voting across them, recovering +18-26 pp on the Misleading tier. QualRCCV, a training-free method weighting models by calibration, family quality, and inverse family size, is the first to beat calibrated voting on all three benchmarks (p<0.05). Learned Candidate Scoring (LCS) trains a cross-validated classifier to re-rank candidate answers using support breadth, family diversity, and model quality, achieving the largest gains: +0.68% VQAv2, +0.61% TextVQA, +2.45% GQA – all significant – and is the only learned method that never degrades any benchmark. On VQAv2 test-standard (EvalAI), LCS reaches 87.83% with 12 models, confirming generalization.
[46] MosaicMem: Hybrid Spatial Memory for Controllable Video World Models cs.CVPDF
Wei Yu, Runjia Qian, Yumeng Li, Liquan Wang, Songheng Yin
TL;DR: 本文提出MosaicMem,一种混合空间记忆方法,用于提升视频扩散模型在相机运动、场景重访和干预下的时空一致性。该方法通过将图像块提升至3D空间实现可靠定位与检索,同时利用模型原生条件机制保持生成内容与提示的一致性。
Details
Motivation: 现有视频扩散模型在空间记忆方面存在瓶颈:显式3D结构虽能提升重投影一致性,但难以处理动态物体;隐式记忆即使给定正确相机姿态也常产生不准确的相机运动。
Result: 实验表明,结合PRoPE相机条件与两种新的记忆对齐方法,MosaicMem在姿态遵循方面优于隐式记忆,在动态建模方面强于显式基线方法。
Insight: 创新点在于提出混合空间记忆架构,通过‘块组合’接口在查询视图中合成空间对齐的图像块,既保持应持续的内容,又允许模型修复应演变的区域,从而支持分钟级导航、基于记忆的场景编辑和自回归推演。
Abstract: Video diffusion models are moving beyond short, plausible clips toward world simulators that must remain consistent under camera motion, revisits, and intervention. Yet spatial memory remains a key bottleneck: explicit 3D structures can improve reprojection-based consistency but struggle to depict moving objects, while implicit memory often produces inaccurate camera motion even with correct poses. We propose Mosaic Memory (MosaicMem), a hybrid spatial memory that lifts patches into 3D for reliable localization and targeted retrieval, while exploiting the model’s native conditioning to preserve prompt-following generation. MosaicMem composes spatially aligned patches in the queried view via a patch-and-compose interface, preserving what should persist while allowing the model to inpaint what should evolve. With PRoPE camera conditioning and two new memory alignment methods, experiments show improved pose adherence compared to implicit memory and stronger dynamic modeling than explicit baselines. MosaicMem further enables minute-level navigation, memory-based scene editing, and autoregressive rollout.
[47] SMAL-pets: SMAL Based Avatars of Pets from Single Image cs.CVPDF
Piotr Borycki, Joanna Waczyńska, Yizhe Zhu, Yongqiang Gao, Przemysław Spurek
TL;DR: 本文提出了SMAL-pets框架,旨在从单张图像生成高质量、可编辑的3D宠物(尤其是狗)化身。该方法结合了3D高斯泼溅(3D Gaussian Splatting)与SMAL参数化模型,以创建既具有高视觉保真度又符合解剖学结构的表示。此外,框架引入了多模态编辑套件,允许用户通过文本提示直接细化外观并执行复杂动画,从而为动画和虚拟现实应用提供了一个灵活、鲁棒的工具。
Details
Motivation: 当前创建高保真、可动画的3D狗化身面临挑战,包括缺乏大规模标注数据集、动物形态多样性大导致模型泛化困难、现有方法难以捕捉真实毛发纹理,以及需要大量手动操作来确保可编辑性和自然运动。
Result: 论文未在摘要中明确提及具体的定量结果或基准测试,但宣称其方法能够生成高质量、可编辑的动物化身,并通过文本提示实现外观和行为的控制。
Insight: 创新点在于将3D高斯泼溅与SMAL参数模型相结合,提供了一种视觉高保真且解剖学基础的混合表示;同时,通过多模态编辑套件和自然语言控制,实现了从单图像到可动画化身的端到端生成与编辑,简化了传统需要专家干预的流程。
Abstract: Creating high-fidelity, animatable 3D dog avatars remains a formidable challenge in computer vision. Unlike human digital doubles, animal reconstruction faces a critical shortage of large-scale, annotated datasets for specialized applications. Furthermore, the immense morphological diversity across species, breeds, and crosses, which varies significantly in size, proportions, and features, complicates the generalization of existing models. Current reconstruction methods often struggle to capture realistic fur textures. Additionally, ensuring these avatars are fully editable and capable of performing complex, naturalistic movements typically necessitates labor-intensive manual mesh manipulation and expert rigging. This paper introduces SMAL-pets, a comprehensive framework that generates high-quality, editable animal avatars from a single input image. Our approach bridges the gap between reconstruction and generative modeling by leveraging a hybrid architecture. Our method integrates 3D Gaussian Splatting with the SMAL parametric model to provide a representation that is both visually high-fidelity and anatomically grounded. We introduce a multimodal editing suite that enables users to refine the avatar’s appearance and execute complex animations through direct textual prompts. By allowing users to control both the aesthetic and behavioral aspects of the model via natural language, SMAL-pets provides a flexible, robust tool for animation and virtual reality.
[48] GazeOnce360: Fisheye-Based 360° Multi-Person Gaze Estimation with Global-Local Feature Fusion cs.CVPDF
Zhuojiang Cai, Zhenghui Sun, Feng Lu
TL;DR: GazeOnce360是一种新颖的端到端模型,用于从单个桌面向上鱼眼摄像头进行360°多人视线估计。该工作解决了从向上鱼眼视角估计分布在360°场景中多人的3D视线方向这一未充分探索的问题,并引入了大规模合成数据集MPSGaze360。模型通过结合旋转卷积和眼部关键点监督来处理鱼眼图像的严重畸变和视角变化,并采用融合全局低分辨率上下文与高分辨率局部眼部区域的双分辨率架构来捕捉细粒度眼部特征。
Details
Motivation: 解决从单个向上鱼眼摄像头在360°场景中进行多人3D视线估计的未充分探索问题,以克服传统方法依赖受限视角的前向摄像头的局限性。
Result: 实验结果表明模型各组成部分的有效性,证明了基于鱼眼摄像头的360°视线估计在实际多人场景中的可行性和潜力。
Insight: 创新点包括:针对鱼眼图像畸变和视角变化的旋转卷积与眼部关键点监督;融合全局上下文与局部眼部细节的双分辨率架构;以及为支持该研究方向而创建的大规模合成数据集MPSGaze360。
Abstract: We present GazeOnce360, a novel end-to-end model for multi-person gaze estimation from a single tabletop-mounted upward-facing fisheye camera. Unlike conventional approaches that rely on forward-facing cameras in constrained viewpoints, we address the underexplored setting of estimating the 3D gaze direction of multiple people distributed across a 360° scene from an upward fisheye perspective. To support research in this setting, we introduce MPSGaze360, a large-scale synthetic dataset rendered using Unreal Engine, featuring diverse multi-person configurations with accurate 3D gaze and eye landmark annotations. Our model tackles the severe distortion and perspective variation inherent in fisheye imagery by incorporating rotational convolutions and eye landmark supervision. To better capture fine-grained eye features crucial for gaze estimation, we propose a dual-resolution architecture that fuses global low-resolution context with high-resolution local eye regions. Experimental results demonstrate the effectiveness of each component in our model. This work highlights the feasibility and potential of fisheye-based 360° gaze estimation in practical multi-person scenarios. Project page: https://caizhuojiang.github.io/GazeOnce360/.
[49] Generalist Multimodal LLMs Gain Biometric Expertise via Human Salience cs.CV | cs.AIPDF
Jacob Piland, Byron Dowling, Christopher Sweet, Adam Czajka
TL;DR: 本文探讨了通用多模态大语言模型(MLLMs)在严格隐私约束下,通过融入人类专家知识(即人类显著性描述)来执行虹膜呈现攻击检测(PAD)的可行性。研究表明,MLLMs的预训练视觉编码器能够对多种虹膜攻击类型进行内在聚类,而结合人类显著性描述的提示词能有效解决类别间的模糊性。在包含7种攻击类型、224张图像的受限数据集上,使用专家提示的Gemini模型性能超越了专门的CNN基线和人类检查员,而本地部署的Llama模型也达到了接近人类的水平。
Details
Motivation: 虹膜PAD对于安全的生物识别部署至关重要,但开发专用模型面临实际障碍:无法收集代表未来未知攻击的数据,收集足够多样化的数据成本高昂且预测能力有限,同时共享生物识别数据存在隐私问题。新攻击向量的快速出现需要适应性强的解决方案,因此研究在禁止将生物识别数据发送到公共云MLLM服务的严格隐私约束下,通用MLLMs能否通过融入人类专家知识来执行虹膜PAD。
Result: 在受IRB限制的包含7种攻击类型、224张虹膜图像的数据集上,使用仅限大学批准的服务(Gemini 2.5 Pro)或本地托管模型(如Llama 3.2-Vision)进行测试。结果表明,采用专家知识提示的Gemini模型性能优于专门的基于卷积神经网络(CNN)的基线模型和人类检查员,而本地可部署的Llama模型则达到了接近人类的性能水平。
Insight: 创新点在于利用通用MLLMs的预训练视觉编码器(如视觉变换器)对虹膜攻击类型进行无监督的固有聚类能力,并通过结构化提示词融入人类显著性(即受试者识别的攻击指标的口头描述)来解决聚类重叠区域的模糊性问题。这为在机构隐私约束内部署MLLMs提供了一条可行的虹膜PAD路径,避免了专用模型的数据收集和隐私共享难题。
Abstract: Iris presentation attack detection (PAD) is critical for secure biometric deployments, yet developing specialized models faces significant practical barriers: collecting data representing future unknown attacks is impossible, and collecting diverse-enough data, yet still limited in terms of its predictive power, is expensive. Additionally, sharing biometric data raises privacy concerns. Due to rapid emergence of new attack vectors demanding adaptable solutions, we thus investigate in this paper whether general-purpose multimodal large language models (MLLMs) can perform iris PAD when augmented with human expert knowledge, operating under strict privacy constraints that prohibit sending biometric data to public cloud MLLM services. Through analysis of vision encoder embeddings applied to our dataset, we demonstrate that pre-trained vision transformers in MLLMs inherently cluster many iris attack types despite never being explicitly trained for this task. However, where clustering shows overlap between attack classes, we find that structured prompts incorporating human salience (verbal descriptions from subjects identifying attack indicators) enable these models to resolve ambiguities. Testing on an IRB-restricted dataset of 224 iris images spanning seven attack types, using only university-approved services (Gemini 2.5 Pro) or locally-hosted models (e.g., Llama 3.2-Vision), we show that Gemini with expert-informed prompts outperforms both a specialized convolutional neural networks (CNN)-based baseline and human examiners, while the locally-deployable Llama achieves near-human performance. Our results establish that MLLMs deployable within institutional privacy constraints offer a viable path for iris PAD.
[50] Patient4D: Temporally Consistent Patient Body Mesh Recovery from Monocular Operating Room Video cs.CVPDF
Mingxiao Tu, Hoijoon Jung, Alireza Moghadam, Andre Kyme, Jinman Kim
TL;DR: 该论文提出了Patient4D,一种用于从单目手术室视频中恢复患者4D(时空一致)稠密3D人体网格的管道。该方法针对手术增强现实场景中患者被手术单遮挡且相机视角持续移动的挑战,通过结合感知基础模型与轻量级几何机制,利用患者的静止性先验来增强时间一致性。
Details
Motivation: 解决在手术增强现实中,从被手术单遮挡且相机视角持续移动的单目视频中恢复患者3D人体网格的难题。现有基于相对稳定相机拍摄直立、运动人体的HMR方法在此类场景下性能会下降。
Result: 在4,680个合成手术序列和三个公开HMR视频基准上进行了评估。在手术单遮挡下,Patient4D实现了0.75的平均IoU,将失败帧率从最佳基线的30.5%降低至1.3%。
Insight: 核心创新在于明确利用患者的静止性先验,并设计了两个关键机制:Pose Locking(使用稳定关键帧锚定姿态参数)和Rigid Fallback(通过轮廓引导的刚性对齐在严重遮挡下恢复网格)。这些机制能稳定预测,且与现成的HMR模型兼容,展示了在临床AR场景中利用先验知识可显著改进单目重建。
Abstract: Recovering a dense 3D body mesh from monocular video remains challenging under occlusion from draping and continuously moving camera viewpoints. This configuration arises in surgical augmented reality (AR), where an anesthetized patient lies under surgical draping while a surgeon’s head-mounted camera continuously changes viewpoint. Existing human mesh recovery (HMR) methods are typically trained on upright, moving subjects captured from relatively stable cameras, leading to performance degradation under such conditions. To address this, we present Patient4D, a stationarity-constrained reconstruction pipeline that explicitly exploits the stationarity prior. The pipeline combines image-level foundation models for perception with lightweight geometric mechanisms that enforce temporal consistency across frames. Two key components enable robust reconstruction: Pose Locking, which anchors pose parameters using stable keyframes, and Rigid Fallback, which recovers meshes under severe occlusion through silhouette-guided rigid alignment. Together, these mechanisms stabilize predictions while remaining compatible with off-the-shelf HMR models. We evaluate Patient4D on 4,680 synthetic surgical sequences and three public HMR video benchmarks. Under surgical drape occlusion, Patient4D achieves a 0.75 mean IoU, reducing failure frames from 30.5% to 1.3% compared to the best baseline. Our findings demonstrate that exploiting stationarity priors can substantially improve monocular reconstruction in clinical AR scenarios.
[51] Visual Product Search Benchmark cs.CV | cs.IRPDF
Karthik Sulthanpete Govindappa
TL;DR: 该论文提出了一个针对工业级产品识别的视觉产品搜索基准,评估了多种现代视觉嵌入模型在实例级图像检索任务上的性能,重点关注工业应用场景下的实际约束和异构图像条件。
Details
Motivation: 解决工业与商业应用中从图像中可靠识别产品的关键需求,特别是在维护、采购和操作流程中,错误匹配可能导致昂贵的下游故障,因此需要评估视觉嵌入模型在真实场景下的检索能力。
Result: 在统一图像到图像检索协议下,评估了开源基础嵌入模型、专有多模态嵌入系统和特定领域纯视觉模型,使用工业数据集(来自制造、汽车、DIY和零售)和公共基准,结果显示基础模型在细粒度实例检索任务上的迁移能力以及与工业专用模型的对比表现。
Insight: 创新点在于构建了一个强调真实约束、异构图像条件和精确实例匹配需求的基准,为从业者和研究者提供了当前视觉嵌入方法在生产级产品识别系统中的优势和局限性的见解,并提供了交互式网站展示结果。
Abstract: Reliable product identification from images is a critical requirement in industrial and commercial applications, particularly in maintenance, procurement, and operational workflows where incorrect matches can lead to costly downstream failures. At the core of such systems lies the visual search component, which must retrieve and rank the exact object instance from large and continuously evolving catalogs under diverse imaging conditions. This report presents a structured benchmark of modern visual embedding models for instance-level image retrieval, with a focus on industrial applications. A curated set of open-source foundation embedding models, proprietary multi-modal embedding systems, and domain-specific vision-only models are evaluated under a unified image-to-image retrieval protocol. The benchmark includes curated datasets, which includes industrial datasets derived from production deployments in Manufacturing, Automotive, DIY, and Retail, as well as established public benchmarks. Evaluation is conducted without post-processing, isolating the retrieval capability of each model. The results provide insight into how well contemporary foundation and unified embedding models transfer to fine-grained instance retrieval tasks, and how they compare to models explicitly trained for industrial applications. By emphasizing realistic constraints, heterogeneous image conditions, and exact instance matching requirements, this benchmark aims to inform both practitioners and researchers about the strengths and limitations of current visual embedding approaches in production-level product identification systems. An interactive companion website presenting the benchmark results, evaluation details, and additional visualizations is available at https://benchmark.nyris.io.
[52] Adaptive Anchor Policies for Efficient 4D Gaussian Streaming cs.CVPDF
Ashim Dahal, Rabab Abdelfattah, Nick Rahimi
TL;DR: 本文提出了一种名为高效高斯流式传输(EGS)的插件式预算感知锚点采样器,用于动态场景重建。该方法通过强化学习策略替代传统的固定锚点选择方法(如最远点采样FPS),在保持高斯流式重建主干不变的同时,根据场景复杂度自适应选择锚点预算和信息丰富的锚点子集,以平衡重建质量和运行效率。
Details
Motivation: 现有基于高斯溅射的动态场景重建流式传输方法通常依赖固定锚点选择(如FPS),无论场景复杂度如何都使用固定数量的锚点(如8192个),这在严格的计算预算下会导致计算资源过度分配。因此,需要一种能够根据预算自适应选择锚点的方法来优化质量与效率的权衡。
Result: 在动态多视角数据集上的实验表明,EGS在质量-效率权衡上相比FPS采样有持续改进。在未见数据上,快速渲染模式下使用256个锚点(比8192少32倍),EGS将PSNR提高了0.52-0.61 dB,同时运行速度比IGS@8192快1.29-1.35倍(在N3DV和MeetingRoom数据集上)。在高质量细化模式下,EGS在显著降低锚点预算的情况下仍能与全锚点基线竞争。
Insight: 创新点在于引入强化学习策略进行自适应锚点预算和锚点选择,利用高斯表示的空间特征来平衡重建质量和运行时效率。该方法是一个即插即用的模块,无需改变现有高斯流式重建主干,为动态场景流式传输提供了更高效的锚点采样方案。
Abstract: Dynamic scene reconstruction with Gaussian Splatting has enabled efficient streaming for real-time rendering and free-viewpoint video. However, most pipelines rely on fixed anchor selection such as Farthest Point Sampling (FPS), typically using 8,192 anchors regardless of scene complexity, which over-allocates computation under strict budgets. We propose Efficient Gaussian Streaming (EGS), a plug-in, budget-aware anchor sampler that replaces FPS with a reinforcement-learned policy while keeping the Gaussian streaming reconstruction backbone unchanged. The policy jointly selects an anchor budget and a subset of informative anchors under discrete constraints, balancing reconstruction quality and runtime using spatial features of the Gaussian representation. We evaluate EGS in two settings: fast rendering, which prioritizes runtime efficiency, and high-quality refinement, which enables additional optimization. Experiments on dynamic multi-view datasets show consistent improvements in the quality–efficiency trade-off over FPS sampling. On unseen data, in fast rendering at 256 anchors ($32\times$ fewer than 8,192), EGS improves PSNR by $+0.52$–$0.61$,dB while running $1.29$–$1.35\times$ faster than IGS@8192 (N3DV and MeetingRoom). In high-quality refinement, EGS remains competitive with the full-anchor baseline at substantially lower anchor budgets. \emph{Code and pretrained checkpoints will be released upon acceptance.} \keywords{4D Gaussian Splatting \and 4D Gaussian Streaming \and Reinforcement Learning}
[53] From Drop-off to Recovery: A Mechanistic Analysis of Segmentation in MLLMs cs.CV | cs.AI | cs.LGPDF
Boyong Wu, Sanghwan Kim, Zeynep Akata
TL;DR: 本文通过层间线性探测评估和注意力干预分析,研究了多模态大语言模型(MLLMs)在分割任务中的空间理解能力。研究发现,适配器层会导致分割表征下降,但LLM层通过注意力机制逐步恢复,其中正确分类的token会引导错误分类的邻居token修正标签。早期图像token位置的恢复受因果注意力限制,而图像token间的双向注意力能缓解此问题。
Details
Motivation: 尽管MLLMs越来越多地应用于像素级视觉任务,但其内在的空间理解能力仍不清楚。本文旨在通过机制分析揭示MLLMs如何处理视觉信息以进行分割,为未来模型设计提供依据。
Result: 研究通过线性探测和注意力敲除分析发现,LLM层能通过注意力介导的细化逐步恢复适配器引入的分割表征下降,且图像token间的双向注意力有助于提高空间一致性。
Insight: 创新点在于对MLLM分割能力的机制性分析,揭示了注意力在恢复视觉表征中的关键作用,以及双向注意力对缓解因果注意力限制的潜力,为设计具有更强分割能力的模型提供了新视角。
Abstract: Multimodal Large Language Models (MLLMs) are increasingly applied to pixel-level vision tasks, yet their intrinsic capacity for spatial understanding remains poorly understood. We investigate segmentation capacity through a layerwise linear probing evaluation across the entire MLLM pipeline: vision encoder, adapter, and LLM. We further conduct an intervention based attention knockout analysis to test whether cross-token attention progressively refines visual representations, and an evaluation of bidirectional attention among image tokens on spatial consistency. Our analysis reveals that the adapter introduces a segmentation representation drop-off, but LLM layers progressively recover through attention-mediated refinement, where correctly classified tokens steer misclassified neighbors toward the correct label. At early image token positions, this recovery is bounded by causal attention, which bidirectional attention among image tokens alleviates. These findings provide a mechanistic account of how MLLMs process visual information for segmentation, informing the design of future segmentation-capable models.
[54] GigaWorld-Policy: An Efficient Action-Centered World–Action Model cs.CVPDF
Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao
TL;DR: 本文提出了GigaWorld-Policy,一种基于预训练视频生成骨干网络的高效动作中心世界-动作模型,用于机器人策略学习。该模型将策略训练解耦为动作预测和条件视频生成两个耦合组件,通过因果设计实现高效推理,在保持性能的同时显著提升了推理速度。
Details
Motivation: 现有基于预训练视频生成的世界-动作模型在机器人策略学习中面临两个瓶颈:联合推理未来视觉动态和对应动作导致推理开销大;视觉与运动表示的纠缠使得动作预测精度过度依赖未来视频预测质量。本文旨在解决这些问题。
Result: 在真实机器人平台上的实验结果表明,GigaWorld-Policy比领先的WAM基线Motus运行速度快9倍,同时任务成功率提升7%。与pi-0.5相比,在RoboTwin 2.0基准上性能提升95%。
Insight: 核心创新点在于将策略学习解耦为动作预测和条件视频生成两个耦合但因果隔离的组件,通过动作预测损失和视频生成损失联合监督,既提供了更丰富的学习信号,又通过因果设计使显式视频生成在推理时可选,从而实现了高效的动作解码。此外,构建大规模机器人数据集预训练动作中心视频生成模型作为骨干,也是一个重要的工程贡献。
Abstract: World-Action Models (WAM) initialized from pre-trained video generation backbones have demonstrated remarkable potential for robot policy learning. However, existing approaches face two critical bottlenecks that hinder performance and deployment. First, jointly reasoning over future visual dynamics and corresponding actions incurs substantial inference overhead. Second, joint modeling often entangles visual and motion representations, making motion prediction accuracy heavily dependent on the quality of future video forecasts. To address these issues, we introduce GigaWorld-Policy, an action-centered WAM that learns 2D pixel-action dynamics while enabling efficient action decoding, with optional video generation. Specifically, we formulate policy training into two coupled components: the model predicts future action sequences conditioned on the current observation, and simultaneously generates future videos conditioned on the predicted actions and the same observation. The policy is supervised by both action prediction and video generation, providing richer learning signals and encouraging physically plausible actions through visual-dynamics constraints. With a causal design that prevents future-video tokens from influencing action tokens, explicit future-video generation is optional at inference time, allowing faster action prediction during deployment. To support this paradigm, we curate a diverse, large-scale robot dataset to pre-train an action-centered video generation model, which is then adapted as the backbone for robot policy learning. Experimental results on real-world robotic platforms show that GigaWorld-Policy runs 9x faster than the leading WAM baseline, Motus, while improving task success rates by 7%. Moreover, compared with pi-0.5, GigaWorld-Policy improves performance by 95% on RoboTwin 2.0.
[55] LED: A Benchmark for Evaluating Layout Error Detection in Document Analysis cs.CV | cs.CLPDF
Inbum Heo, Taewook Hwang, Jeesu Jung, Sangkeun Jung
TL;DR: 本文提出了一个名为LED的基准测试,用于评估文档分析中的布局错误检测能力。该基准定义了八种标准化的布局错误类型,并构建了LED数据集和三个评估任务,旨在超越传统基于重叠的指标,对文档布局分析模型的结构推理能力进行细粒度和可解释的评估。
Details
Motivation: 尽管大语言模型和大规模多模态模型在文档布局分析方面取得了进展,但区域合并、分裂和遗漏等结构性错误仍然存在,而传统的基于重叠的度量标准无法捕捉此类逻辑不一致性。
Result: 在最新的多模态模型上的实验表明,LED能够对模型的结构理解能力进行细粒度和可解释的评估,揭示了不同模态和架构模型在结构推理方面存在的明显弱点。
Insight: 论文的创新点在于提出了一个专注于结构性错误检测的统一且可解释的基准,通过定义标准错误类型和注入算法来模拟真实错误,并设计了从文档级到元素级的评估任务,为诊断文档理解模型的结构鲁棒性和推理能力提供了新工具。
Abstract: Recent advances in Large Language Models (LLMs) and Large Multimodal Models (LMMs) have improved Document Layout Analysis (DLA), yet structural errors such as region merging, splitting, and omission remain persistent. Conventional overlap-based metrics (e.g., IoU, mAP) fail to capture such logical inconsistencies. To overcome this limitation, we propose Layout Error Detection (LED), a benchmark that evaluates structural reasoning in DLA predictions beyond surface-level accuracy. LED defines eight standardized error types (Missing, Hallucination, Size Error, Split, Merge, Overlap, Duplicate, and Misclassification) and provides quantitative rules and injection algorithms for realistic error simulation. Using these definitions, we construct LED-Dataset and design three evaluation tasks: document-level error detection, document-level error-type classification, and element-level error-type classification. Experiments with state-of-the-art multimodal models show that LED enables fine-grained and interpretable assessment of structural understanding, revealing clear weaknesses across modalities and architectures. Overall, LED establishes a unified and explainable benchmark for diagnosing the structural robustness and reasoning capability of document understanding models.
[56] ConfusionBench: An Expert-Validated Benchmark for Confusion Recognition and Localization in Educational Videos cs.CVPDF
Lu Dong, Xiao Wang, Mark Frank, Srirangaraj Setlur, Venu Govindaraju
TL;DR: 本文提出了ConfusionBench,一个用于教育视频中学生困惑识别与定位的专家验证基准。该基准通过多阶段过滤流程构建,包含平衡的困惑识别数据集和视频定位数据集,并提供了开源模型与专有模型的零样本基线评估。
Details
Motivation: 现有困惑数据集存在标签噪声、粗粒度时间标注和专家验证有限等问题,阻碍了可靠的细粒度识别和时序分析,因此需要构建更高质量的基准。
Result: 实验结果表明,专有模型整体表现更好但倾向于过度预测过渡片段,而开源模型更保守且更容易漏检。基准数据集和相关材料将公开提供。
Insight: 创新点在于提出了一个结合模型辅助筛选、研究者整理和专家验证的多阶段过滤流程来构建高质量基准,并提供了学生困惑报告可视化以支持教育专家进行干预决策。
Abstract: Recognizing and localizing student confusion from video is an important yet challenging problem in educational AI. Existing confusion datasets suffer from noisy labels, coarse temporal annotations, and limited expert validation, which hinder reliable fine-grained recognition and temporally grounded analysis. To address these limitations, we propose a practical multi-stage filtering pipeline that integrates two stages of model-assisted screening, researcher curation, and expert validation to build a higher-quality benchmark for confusion understanding. Based on this pipeline, we introduce ConfusionBench, a new benchmark for educational videos consisting of a balanced confusion recognition dataset and a video localization dataset. We further provide zero-shot baseline evaluations of a representative open-source model and a proprietary model on clip-level confusion recognition, long-video confusion localization tasks. Experimental results show that the proprietary model performs better overall but tends to over-predict transitional segments, while the open-source model is more conservative and more prone to missed detections. In addition, the proposed student confusion report visualization can support educational experts in making intervention decisions and adapting learning plans accordingly. All datasets and related materials will be made publicly available on our project page.
[57] DANCE: Dynamic 3D CNN Pruning: Joint Frame, Channel, and Feature Adaptation for Energy Efficiency on the Edge cs.CV | cs.AIPDF
Mohamed Mejri, Ashiqur Rasul, Abhijit Chatterjee
TL;DR: 本文提出了一种名为DANCE的动态3D CNN剪枝框架,旨在通过输入感知的细粒度剪枝来最大化边缘设备的能效。该方法包含两个步骤:激活变异性放大(AVA)和自适应激活剪枝(AAP),通过动态剪枝视频帧、通道和特征来减少计算量和内存访问,同时保持性能基本不变。
Details
Motivation: 现代卷积神经网络(CNN)在处理视频和图像时无法根据输入样本的计算复杂度进行动态调整以最小化能耗,因此需要一种能自适应输入、实现高效节能的动态剪枝方法。
Result: 在NVIDIA Jetson Nano GPU和Qualcomm Snapdragon 8 Gen 1平台上进行硬件验证,分别实现了1.37倍和2.22倍的加速,能效比现有技术(SOTA)最高提升1.47倍,同时乘积累加(MAC)操作和内存访问大幅减少。
Insight: 创新点包括:1)两步动态剪枝框架(AVA和AAP),通过放大激活变异性并训练轻量控制器网络实现输入感知的细粒度剪枝;2)在3D CNN中联合剪枝帧、通道和特征,引入卷积层稀疏性以优化边缘能效;3)硬件验证展示了实际部署中的显著加速和能效提升。
Abstract: Modern convolutional neural networks (CNNs) are workhorses for video and image processing, but fail to adapt to the computational complexity of input samples in a dynamic manner to minimize energy consumption. In this research, we propose DANCE, a fine-grained, input-aware, dynamic pruning framework for 3D CNNs to maximize power efficiency with negligible to zero impact on performance. In the proposed two-step approach, the first step is called activation variability amplification (AVA), and the 3D CNN model is retrained to increase the variance of the magnitude of neuron activations across the network in this step, facilitating pruning decisions across diverse CNN input scenarios. In the second step, called adaptive activation pruning (AAP), a lightweight activation controller network is trained to dynamically prune frames, channels, and features of 3D convolutional layers of the network (different for each layer), based on statistics of the outputs of the first layer of the network. Our method achieves substantial savings in multiply-accumulate (MAC) operations and memory accesses by introducing sparsity within convolutional layers. Hardware validation on the NVIDIA Jetson Nano GPU and the Qualcomm Snapdragon 8 Gen 1 platform demonstrates respective speedups of 1.37X and 2.22X, achieving up to 1.47X higher energy efficiency compared to the state of the art.
[58] 3D MRI-Based Alzheimer’s Disease Classification Using Multi-Modal 3D CNN with Leakage-Aware Subject-Level Evaluation cs.CVPDF
Md Sifat, Sania Akter, Akif Islam, Md. Ekramul Hamid, Abu Saleh Musa Miah
TL;DR: 该论文提出了一种用于阿尔茨海默病分类的多模态3D卷积神经网络,直接处理原始OASIS 1 MRI三维体积数据,并结合了T1结构图像与灰质、白质、脑脊液概率图。通过5折受试者级别交叉验证,模型在OASIS 1队列上取得了72.34%的平均准确率和0.7781的ROC AUC。研究还通过对比实验分析了数据表示和评估策略对性能的影响。
Details
Motivation: 现有研究多从MRI体积中提取单个2D切片进行分析,而临床神经影像实践依赖于完整的三维大脑结构。因此,体积分析可能更好地捕捉与疾病进展相关的脑区空间关系。
Result: 在OASIS 1临床标记队列上,使用5折受试者级别交叉验证,模型取得了72.34% ± 4.66%的平均准确率和0.7781 ± 0.0365的ROC AUC。GradCAM可视化显示模型关注了与阿尔茨海默病相关的解剖学区域(如内侧颞叶和脑室区域)。
Insight: 创新点在于提出了一个结合原始T1体积和分割概率图的多模态3D CNN框架,直接进行三维分析以更好地利用空间信息。同时,研究强调了受试者级别评估的重要性,并通过对比切片级和受试者级评估,为体积分析结果提供了上下文,建立了可复现的基准。
Abstract: Deep learning has become an important tool for Alzheimer’s disease (AD) classification from structural MRI. Many existing studies analyze individual 2D slices extracted from MRI volumes, while clinical neuroimaging practice typically relies on the full three dimensional structure of the brain. From this perspective, volumetric analysis may better capture spatial relationships among brain regions that are relevant to disease progression. Motivated by this idea, this work proposes a multimodal 3D convolutional neural network for AD classification using raw OASIS 1 MRI volumes. The model combines structural T1 information with gray matter, white matter, and cerebrospinal fluid probability maps obtained through FSL FAST segmentation in order to capture complementary neuroanatomical information. The proposed approach is evaluated on the clinically labelled OASIS 1 cohort using 5 fold subject level cross validation, achieving a mean accuracy of 72.34% plus or minus 4.66% and a ROC AUC of 0.7781 plus or minus 0.0365. GradCAM visualizations further indicate that the model focuses on anatomically meaningful regions, including the medial temporal lobe and ventricular areas that are known to be associated with Alzheimer’s related structural changes. To better understand how data representation and evaluation strategies may influence reported performance, additional diagnostic experiments were conducted on a slice based version of the dataset under both slice level and subject level protocols. These observations help provide context for the volumetric results. Overall, the proposed multimodal 3D framework establishes a reproducible subject level benchmark and highlights the potential benefits of volumetric MRI analysis for Alzheimer’s disease classification.
[59] Symphony: A Cognitively-Inspired Multi-Agent System for Long-Video Understanding cs.CV | cs.AIPDF
Haiyang Yan, Hongyun Zhou, Peng Xu, Xiaoxue Feng, Mengyi Liu
TL;DR: 本文提出了Symphony,一个受人类认知启发的多智能体系统,旨在解决多模态大语言模型智能体在长视频理解任务中面临的挑战。该系统通过模拟人类认知模式,将长视频理解分解为细粒度子任务,并引入基于反思的深度推理协作机制,同时结合基于视觉语言模型的定位方法来分析任务并评估视频片段的相关性,从而提升对高信息密度、长时跨度视频的复杂推理和定位能力。
Details
Motivation: 现有MLLM智能体在信息密度高、时间跨度长的长视频理解任务上表现不佳,简单的任务分解与协作机制以及基于嵌入检索的直接时间上下文压缩方法存在局限性,无法有效处理长链推理任务并可能丢失关键信息。
Result: 实验表明,Symphony在LVBench、LongVideoBench、VideoMME和MLVU等多个基准测试上取得了最先进的性能,其中在LVBench上相比之前的最优方法提升了5.0%。
Insight: 论文的创新点在于:1)受人类认知启发的、结合反思机制的深度推理协作多智能体框架,用于处理长链推理;2)基于VLM的任务分析和视频片段相关性评估的定位方法,以更好地处理具有隐含意图和大时间跨度的复杂问题。从客观角度看,将认知科学中的反思机制与多智能体协作结合,为解决长视频理解中的复杂时空推理问题提供了新思路。
Abstract: Despite rapid developments and widespread applications of MLLM agents, they still struggle with long-form video understanding (LVU) tasks, which are characterized by high information density and extended temporal spans. Recent research on LVU agents demonstrates that simple task decomposition and collaboration mechanisms are insufficient for long-chain reasoning tasks. Moreover, directly reducing the time context through embedding-based retrieval may lose key information of complex problems. In this paper, we propose Symphony, a multi-agent system, to alleviate these limitations. By emulating human cognition patterns, Symphony decomposes LVU into fine-grained subtasks and incorporates a deep reasoning collaboration mechanism enhanced by reflection, effectively improving the reasoning capability. Additionally, Symphony provides a VLM-based grounding approach to analyze LVU tasks and assess the relevance of video segments, which significantly enhances the ability to locate complex problems with implicit intentions and large temporal spans. Experimental results show that Symphony achieves state-of-the-art performance on LVBench, LongVideoBench, VideoMME, and MLVU, with a 5.0% improvement over the prior state-of-the-art method on LVBench. Code is available at https://github.com/Haiyang0226/Symphony.
[60] Recurrent Reasoning with Vision-Language Models for Estimating Long-Horizon Embodied Task Progress cs.CV | cs.AIPDF
Yuelin Zhang, Sijie Cheng, Chen Li, Zongzhao Li, Yuxin Huang
TL;DR: 本文提出了一种名为循环推理视觉语言模型(R²VLM)的新方法,用于估计具身智能体执行长期、多步骤任务的进度。该模型采用循环推理框架,通过迭代处理局部视频片段并维护一个演化的思维链(CoT)来记录任务分解、关键步骤及其完成状态,从而在避免处理长视频的高计算成本的同时,实现复杂时间依赖的推理。模型在ALFRED和Ego4D生成的大规模数据集上训练,并在进度估计及相关下游应用中表现出色。
Details
Motivation: 现有基于视觉语言模型(VLMs)的方法主要依赖视频理解能力,忽略了其复杂推理潜力,且处理长视频轨迹计算成本过高,难以在现实世界部署。本文旨在解决这些问题,以准确估计长期具身任务的进度。
Result: 在进度估计及下游应用(如进度增强的策略学习、强化学习的奖励建模和主动辅助)上的广泛实验表明,R²VLM实现了强大的性能和泛化能力,在长期任务进度估计上达到了新的最先进水平(SOTA)。
Insight: 创新点在于提出了一个循环推理框架,通过迭代处理局部视频片段和演化的思维链(CoT)来显式建模任务结构和时间依赖,这既降低了计算开销,又增强了推理能力。从客观角度看,该方法将VLMs的推理潜力与高效处理长序列相结合,为具身智能的进度估计提供了可扩展的解决方案。
Abstract: Accurately estimating task progress is critical for embodied agents to plan and execute long-horizon, multi-step tasks. Despite promising advances, existing Vision-Language Models (VLMs) based methods primarily leverage their video understanding capabilities, while neglecting their complex reasoning potential. Furthermore, processing long video trajectories with VLMs is computationally prohibitive for real-world deployment. To address these challenges, we propose the Recurrent Reasoning Vision-Language Model ($\text{R}^2$VLM). Our model features a recurrent reasoning framework that processes local video snippets iteratively, maintaining a global context through an evolving Chain of Thought (CoT). This CoT explicitly records task decomposition, key steps, and their completion status, enabling the model to reason about complex temporal dependencies. This design avoids the high cost of processing long videos while preserving essential reasoning capabilities. We train $\text{R}^2$VLM on large-scale, automatically generated datasets from ALFRED and Ego4D. Extensive experiments on progress estimation and downstream applications, including progress-enhanced policy learning, reward modeling for reinforcement learning, and proactive assistance, demonstrate that $\text{R}^2$VLM achieves strong performance and generalization, achieving a new state-of-the-art in long-horizon task progress estimation. The models and benchmarks are publicly available at \href{https://huggingface.co/collections/zhangyuelin/r2vlm}{huggingface}.
[61] A Proposal-Free Query-Guided Network for Grounded Multimodal Named Entity Recognition cs.CVPDF
Hongbing Li, Jiamin Liu, Shuo Zhang, Bo Xiao
TL;DR: 本文提出了一种无需候选框的查询引导网络(QGN),用于解决基于多模态的命名实体识别(GMNER)任务,该任务旨在从文本中识别命名实体并将其定位到相关图像的对应区域。QGN通过文本引导和跨模态交互统一了多模态推理和解码过程,从而在开放域场景中实现精确的实体定位和鲁棒性能。
Details
Motivation: 现有方法通常将GMNER任务分为两步:先使用预训练的通用目标检测器检测物体,再将命名实体与检测到的物体匹配。然而,通用检测器独立于文本实体运行,倾向于检测常见物体而忽略命名实体所需的细粒度区域,导致对齐不精确和系统性能下降。
Result: 大量实验表明,QGN在广泛使用的基准测试中,相比其他GMNER模型取得了最优性能。
Insight: 论文的创新点在于提出了一种端到端的、无需预先生成物体候选框的查询引导网络,通过文本信息直接引导视觉特征的提取和跨模态交互,从而更精确地实现文本实体到图像区域的定位,解决了传统两阶段方法中检测器与实体需求不匹配的问题。
Abstract: Grounded Multimodal Named Entity Recognition (GMNER) identifies named entities, including their spans and types, in natural language text and grounds them to the corresponding regions in associated images. Most existing approaches split this task into two steps: they first detect objects using a pre-trained general-purpose detector and then match named entities to the detected objects. However, these methods face a major limitation. Because pre-trained general-purpose object detectors operate independently of textual entities, they tend to detect common objects and frequently overlook specific fine-grained regions required by named entities. This misalignment between object detectors and entities introduces imprecision and can impair overall system performance. In this paper, we propose a proposal-free Query-Guided Network (QGN) that unifies multimodal reasoning and decoding through text guidance and cross- modal interaction. QGN enables accurate grounding and robust performance in open-domain scenarios. Extensive experiments demonstrate that QGN achieves top performance among compared GMNER models on widely used benchmarks.
[62] MedSAD-CLIP: Supervised CLIP with Token-Patch Cross-Attention for Medical Anomaly Detection and Segmentation cs.CVPDF
Thuy Truong Tran, Minh Kha Do, Phuc Nguyen Duy, Min Hun Lee
TL;DR: 本文提出了一种名为MedSAD-CLIP的模型,用于医学异常检测与分割。该模型通过监督方式微调CLIP,利用其Token-Patch交叉注意力机制来融合细粒度的文本-视觉线索,以提升病灶定位精度,同时保持CLIP的泛化能力。模型在多个医学影像数据集上实现了优于现有方法的像素级分割和图像级分类性能。
Details
Motivation: 解决现有基于CLIP的零/少样本异常检测方法通常依赖全局表征和弱监督,导致定位粗糙和分割质量有限的问题。本文旨在利用临床中有限的、有意义的标注异常数据,对CLIP进行监督式适应,以提升医学异常检测与分割的精度。
Result: 在脑部、视网膜、肺部和乳腺四个不同的医学影像基准数据集上进行了广泛实验。结果表明,该方法在像素级分割和图像级分类任务上均优于当前最先进的方法,达到了SOTA水平。
Insight: 创新点包括:1) 提出Token-Patch交叉注意力机制,利用细粒度文本-视觉线索改善病灶定位;2) 采用轻量级图像适配器和可学习的提示令牌,在适应医学领域的同时保留CLIP的丰富语义对齐;3) 设计了基于边际的图像-文本对比损失,以增强正常与异常全局表征之间的区分度。这为医学异常理解提供了一个统一且可扩展的监督CLIP适应范式。
Abstract: Medical anomaly detection (MAD) and segmentation play a critical role in assisting clinical diagnosis by identifying abnormal regions in medical images and localizing pathological regions. Recent CLIP-based studies are promising for anomaly detection in zero-/few-shot settings, and typically rely on global representations and weak supervision, often producing coarse localization and limited segmentation quality. In this work, we study supervised adaptation of CLIP for MAD under a realistic clinical setting where a limited yet meaningful amount of labeled abnormal data is available. Our model MedSAD-CLIP leverages fine-grained text-visual cues via the Token-Patch Cross-Attention(TPCA) to improve lesion localization while preserving the generalization capability of CLIP representations. Lightweight image adapters and learnable prompt tokens efficiently adapt the pretrained CLIP encoder to the medical domain while preserving its rich semantic alignment. Furthermore, a Margin-based image-text Contrastive Loss is designed to enhance global feature discrimination between normal and abnormal representations. Extensive experiments on four diverse benchmarks-Brain, Retina, Lung, and Breast datasets-demonstrate the effectiveness of our approach, achieving superior performance in both pixel-level segmentation and image-level classification over state-of-the-art methods. Our results highlight the potential of supervised CLIP adaptation as a unified and scalable paradigm for medical anomaly understanding. Code will be made available at https://github.com/thuy4tbn99/MedSAD-CLIP
[63] FineViT: Progressively Unlocking Fine-Grained Perception with Dense Recaptions cs.CVPDF
Peisen Zhao, Xiaopeng Zhang, Mingxing Xu, Ruoyu Sun, Zewei Du
TL;DR: FineViT是一种专为解锁细粒度感知而设计的新型视觉编码器,通过使用密集重描述替换粗糙的网络数据,并采用渐进式训练范式来系统性地减少信息损失。首先,在高原生分辨率下从零开始训练编码器,建立鲁棒且细节丰富的语义基础;随后,通过LLM对齐进一步增强其局部感知能力。实验表明,FineViT在零样本识别和检索任务中达到最先进水平,尤其在长上下文检索中表现突出。
Details
Motivation: 解决多模态大语言模型中视觉编码器因低分辨率预训练和依赖粗糙网络图像-文本对而导致视觉细节丢失,从而成为性能瓶颈的问题。
Result: FineViT在零样本识别和检索任务中达到最先进水平,尤其在长上下文检索中表现突出,并且在集成到MLLMs时持续优于SigLIP2和Qwen-ViT等多模态视觉编码器。
Insight: 创新点包括使用密集重描述数据替代粗糙网络数据以保留视觉细节,以及采用渐进式训练范式(先高分辨率全局预训练建立语义基础,再通过LLM对齐增强局部感知),这为细粒度视觉感知提供了强大的新基线。
Abstract: While Multimodal Large Language Models (MLLMs) have experienced rapid advancements, their visual encoders frequently remain a performance bottleneck. Conventional CLIP-based encoders struggle with dense spatial tasks due to the loss of visual details caused by low-resolution pretraining and the reliance on noisy, coarse web-crawled image-text pairs. To overcome these limitations, we introduce FineViT, a novel vision encoder specifically designed to unlock fine-grained perception. By replacing coarse web data with dense recaptions, we systematically mitigate information loss through a progressive training paradigm.: first, the encoder is trained from scratch at a high native resolution on billions of global recaptioned image-text pairs, establishing a robust, detail rich semantic foundation. Subsequently, we further enhance its local perception through LLM alignment, utilizing our curated FineCap-450M dataset that comprises over $450$ million high quality local captions. Extensive experiments validate the effectiveness of the progressive strategy. FineViT achieves state-of-the-art zero-shot recognition and retrieval performance, especially in long-context retrieval, and consistently outperforms multimodal visual encoders such as SigLIP2 and Qwen-ViT when integrated into MLLMs. We hope FineViT could serve as a powerful new baseline for fine-grained visual perception.
[64] EvoGuard: An Extensible Agentic RL-based Framework for Practical and Evolving AI-Generated Image Detection cs.CVPDF
Chenyang Zhu, Maorong Wang, Jun Liu, Ching-Chun Chang, Isao Echizen
TL;DR: 本文提出了EvoGuard,一个基于智能体强化学习的可扩展框架,用于实际且不断演化的AI生成图像检测。该框架将多种现成的多模态大语言模型和非MLLM检测器封装为可调用工具,并通过能力感知的动态编排机制协调它们,利用智能体的自主规划和反思能力进行多轮推理,以应对复杂动态的真实环境。
Details
Motivation: AI生成图像的快速扩散带来了严重的虚假信息风险,现有检测方法主要依赖低级特征或MLLM的通用理解能力,但存在可扩展性有限和训练数据标注成本高昂的问题。
Result: 大量实验表明,EvoGuard在检测准确率上达到了SOTA水平,同时缓解了正负样本间的偏差,并且无需训练即可通过即插即用的方式集成新检测器以提升整体性能。
Insight: 创新点在于提出了一个基于智能体架构的异构检测器动态编排框架,通过能力感知机制和基于GRPO的智能体强化学习算法,有效利用了不同检测器的互补优势,并仅需低成本二元标签进行优化,实现了高实用性和对不断演化威胁的长期适应性。
Abstract: The rapid proliferation of AI-Generated Images (AIGIs) has introduced severe risks of misinformation, making AIGI detection a critical yet challenging task. While traditional detection paradigms mainly rely on low-level features, recent research increasingly focuses on leveraging the general understanding ability of Multimodal Large Language Models (MLLMs) to achieve better generalization, but still suffer from limited extensibility and expensive training data annotations. To better address complex and dynamic real-world environments, we propose EvoGuard, a novel agentic framework for AIGI detection. It encapsulates various state-of-the-art (SOTA) off-the-shelf MLLM and non-MLLM detectors as callable tools, and coordinates them through a capability-aware dynamic orchestration mechanism. Empowered by the agent’s capacities for autonomous planning and reflection, it intelligently selects suitable tools for given samples, reflects intermediate results, and decides the next action, reaching a final conclusion through multi-turn invocation and reasoning. This design effectively exploits the complementary strengths among heterogeneous detectors, transcending the limits of any single model. Furthermore, optimized by a GRPO-based Agentic Reinforcement Learning algorithm using only low-cost binary labels, it eliminates the reliance on fine-grained annotations. Extensive experiments demonstrate that EvoGuard achieves SOTA accuracy while mitigating the bias between positive and negative samples. More importantly, it allows the plug-and-play integration of new detectors to boost overall performance in a train-free manner, offering a highly practical, long-term solution to ever-evolving AIGI threats. Source code will be publicly available upon acceptance.
[65] OnlineHMR: Video-based Online World-Grounded Human Mesh Recovery cs.CVPDF
Yiwen Zhao, Ce Zheng, Yufu Wang, Hsueh-Han Daniel Yang, Liting Wen
TL;DR: 本文提出OnlineHMR,一个完全在线的视频人体网格恢复框架,旨在从单目视频中实时重建世界坐标系下的3D人体姿态和轨迹。该框架通过因果键值缓存、滑动窗口学习策略以及以人为中心的增量式SLAM,满足了在线处理的因果性、保真度、时间一致性和效率四大标准。
Details
Motivation: 现有的大多数人体网格恢复方法都是离线的,依赖于未来帧或全局优化,这限制了它们在需要交互反馈和感知-动作循环的应用(如AR/VR和远程呈现)中的适用性。
Result: 实验结果表明,该方法在标准EMDB基准测试和高动态自定义视频上,其性能与现有的基于视频片段的方法相当,同时独特地支持在线处理。
Insight: 主要创新点在于提出了一个完全在线的两分支架构,结合了因果键值缓存实现流式推理,以及一个以人为中心的增量式SLAM模块,用于在物理上合理的轨迹校正下实现在线世界坐标系对齐。
Abstract: Human mesh recovery (HMR) models 3D human body from monocular videos, with recent works extending it to world-coordinate human trajectory and motion reconstruction. However, most existing methods remain offline, relying on future frames or global optimization, which limits their applicability in interactive feedback and perception-action loop scenarios such as AR/VR and telepresence. To address this, we propose OnlineHMR, a fully online framework that jointly satisfies four essential criteria of online processing, including system-level causality, faithfulness, temporal consistency, and efficiency. Built upon a two-branch architecture, OnlineHMR enables streaming inference via a causal key-value cache design and a curated sliding-window learning strategy. Meanwhile, a human-centric incremental SLAM provides online world-grounded alignment under physically plausible trajectory correction. Experimental results show that our method achieves performance comparable to existing chunk-based approaches on the standard EMDB benchmark and highly dynamic custom videos, while uniquely supporting online processing. Page and code are available at https://tsukasane.github.io/Video-OnlineHMR/.
[66] MCoT-MVS: Multi-level Vision Selection by Multi-modal Chain-of-Thought Reasoning for Composed Image Retrieval cs.CVPDF
Xuri Ge, Chunhao Wang, Xindi Wang, Zheyun Qin, Zhumin Chen
TL;DR: 本文提出了一种名为MCoT-MVS的新方法,用于组合图像检索任务。该方法通过多模态大语言模型进行思维链推理,生成文本线索,并以此指导两个参考视觉注意力选择模块,从参考图像中提取判别性的补丁级和实例级语义,最后通过加权分层组合模块将多粒度视觉线索与修改文本融合,在统一的嵌入空间中进行检索。
Details
Motivation: 现有组合图像检索方法难以从参考图像中提取最能反映用户文本修改意图的正确语义线索,容易受到无关视觉噪声的干扰。
Result: 在CIRR和FashionIQ两个基准测试上的大量实验表明,该方法始终优于现有方法,并取得了新的最先进性能。
Insight: 创新点在于利用多模态大语言模型的思维链推理生成文本线索来指导多层次的视觉特征选择,并通过加权分层组合实现多粒度信息的有效融合,从而更精准地捕捉用户意图。
Abstract: Composed Image Retrieval (CIR) aims to retrieve target images based on a reference image and modified texts. However, existing methods often struggle to extract the correct semantic cues from the reference image that best reflect the user’s intent under textual modification prompts, resulting in interference from irrelevant visual noise. In this paper, we propose a novel Multi-level Vision Selection by Multi-modal Chain-of-Thought Reasoning (MCoT-MVS) for CIR, integrating attention-aware multi-level vision features guided by reasoning cues from a multi-modal large language model (MLLM). Specifically, we leverage an MLLM to perform chain-of-thought reasoning on the multimodal composed input, generating the retained, removed, and target-inferred texts. These textual cues subsequently guide two reference visual attention selection modules to selectively extract discriminative patch-level and instance-level semantics from the reference image. Finally, to effectively fuse these multi-granular visual cues with the modified text and the imagined target description, we design a weighted hierarchical combination module to align the composed query with target images in a unified embedding space. Extensive experiments on two CIR benchmarks, namely CIRR and FashionIQ, demonstrate that our approach consistently outperforms existing methods and achieves new state-of-the-art performance. Code and trained models are publicly released.
[67] Understanding and Defending VLM Jailbreaks via Jailbreak-Related Representation Shift cs.CV | cs.AIPDF
Zhihua Wei, Qiang Li, Jian Ruan, Zhenxin Qin, Leilei Wen
TL;DR: 本文研究了视觉语言模型(VLM)中的越狱攻击现象,发现即使文本提示包含明确的恶意意图,添加图像也会显著提高越狱成功率。论文观察到,在表示空间中,VLM能清晰区分良性输入和有害输入,且越狱样本会形成一种与拒绝样本不同的内部状态。这表明越狱并非源于模型未能识别有害意图,而是视觉模态将表示向特定的越狱状态偏移,导致未能触发拒绝机制。为量化这一转变,作者识别了一个“越狱方向”,并将图像诱导的表示偏移沿此方向的分量定义为越狱相关偏移。分析表明,该偏移能可靠地表征越狱行为,为多种越狱场景提供统一解释。最后,论文提出了一种防御方法(JRS-Rem),通过在推理时移除越狱相关偏移来增强VLM的安全性。
Details
Motivation: 解决大型视觉语言模型(VLM)在集成视觉模态后安全对齐被削弱的问题,即图像如何导致模型对含有恶意意图的文本提示产生越狱响应,并探究其内在机制。
Result: 实验表明,提出的防御方法JRS-Rem在多种越狱场景下提供了强大的防御能力,同时在良性任务上保持了性能。
Insight: 创新点在于从表示空间的角度揭示了越狱攻击的本质——视觉模态诱导的表示偏移,而非模型无法识别有害意图;并提出了一个可量化的“越狱方向”和基于移除越狱相关偏移的防御策略,为理解和防御VLM越狱提供了新视角和实用方法。
Abstract: Large vision-language models (VLMs) often exhibit weakened safety alignment with the integration of the visual modality. Even when text prompts contain explicit harmful intent, adding an image can substantially increase jailbreak success rates. In this paper, we observe that VLMs can clearly distinguish benign inputs from harmful ones in their representation space. Moreover, even among harmful inputs, jailbreak samples form a distinct internal state that is separable from refusal samples. These observations suggest that jailbreaks do not arise from a failure to recognize harmful intent. Instead, the visual modality shifts representations toward a specific jailbreak state, thereby leading to a failure to trigger refusal. To quantify this transition, we identify a jailbreak direction and define the jailbreak-related shift as the component of the image-induced representation shift along this direction. Our analysis shows that the jailbreak-related shift reliably characterizes jailbreak behavior, providing a unified explanation for diverse jailbreak scenarios. Finally, we propose a defense method that enhances VLM safety by removing the jailbreak-related shift (JRS-Rem) at inference time. Experiments show that JRS-Rem provides strong defense across multiple scenarios while preserving performance on benign tasks.
[68] Shot-Aware Frame Sampling for Video Understanding cs.CVPDF
Mengyu Zhao, Di Fu, Yongyu Xie, Jiaxing Zhang, Zhigang Yuan
TL;DR: 本文提出了一种名为InfoShot的任务无关、镜头感知的视频帧采样方法,用于长视频理解。该方法首先将视频分割成语义一致的镜头,然后从每个镜头中选择两个互补的关键帧:一个代表主要内容,另一个捕捉镜头内的异常变化。该方法基于信息论目标设计,旨在保留镜头结构和稀疏镜头内偏差的高信息量,从而在不重新训练的情况下提高保留整体视频上下文和关键决策时刻的机会。
Details
Motivation: 现有视频帧采样方法在只能保留少量帧时,往往难以平衡广泛的视频覆盖与短暂但关键的事件,导致下游预测不可靠。本文旨在解决这一问题,通过镜头感知的采样策略来提升长视频理解的效率和可靠性。
Result: 实验表明,在帧数限制下,InfoShot提高了异常检测命中率和下游视频问答(Video-QA)的准确性,同时在标准视频理解基准测试中达到或优于强基线方法。
Insight: 创新点在于将视频分割为镜头并从中选择互补关键帧的信息论采样策略,以及引入可控的合成基准SynFlash来评估短暂事件。该方法无需重新训练,能有效平衡视频覆盖与关键事件捕捉。
Abstract: Video frame sampling is essential for efficient long-video understanding with Vision-Language Models (VLMs), since dense inputs are costly and often exceed context limits. Yet when only a small number of frames can be retained, existing samplers often fail to balance broad video coverage with brief but critical events, which can lead to unreliable downstream predictions. To address this issue, we present InfoShot, a task-agnostic, shot-aware frame sampler for long-video understanding. InfoShot first partitions a video into semantically consistent shots, and then selects two complementary keyframes from each shot: one to represent the main content and one to capture unusual within-shot changes. This design is guided by an information-theoretic objective that encourages the sampled set to retain high information about both shot structure and sparse within-shot deviations. In this way, it improves the chance of preserving both overall video context and short decision-critical moments without requiring any retraining. To better evaluate such short-lived events, we further introduce SynFlash, a synthetic benchmark with controllable sub-second anomaly patterns and frame-level ground truth, and we also evaluate InfoShot on existing anomaly datasets and general video understanding tasks. Experiments show that InfoShot improves anomaly hit rate and downstream Video-QA accuracy under frame number constraints, while matching or outperforming strong baselines on standard video understanding benchmarks.
[69] Stereo World Model: Camera-Guided Stereo Video Generation cs.CVPDF
Yang-Tian Sun, Zehuan Huang, Yifan Niu, Lin Ma, Yan-Pei Cao
TL;DR: 本文提出了StereoWorld,一种相机引导的立体世界模型,用于端到端的立体视频生成。该模型在RGB模态内联合学习外观和双目几何,通过视差直接建立几何基础。其核心设计包括统一的相机帧RoPE和立体感知注意力分解,从而在多个基准测试中提升了立体一致性、视差精度和相机运动保真度,并实现了更快的生成速度。
Details
Motivation: 现有单目RGB或RGBD方法无法高效且一致地生成立体视频,需要解决在RGB模态内联合建模外观与双目几何、并保持视图和时间一致性的挑战。
Result: 在基准测试中,StereoWorld相比强大的“先单目后转换”流程,显著提升了立体一致性、视差精度和相机运动保真度,生成速度快3倍以上,视角一致性额外提升5%。
Insight: 创新点在于提出了相机感知的旋转位置编码(RoPE)来保持视图和时间一致性,以及利用极线先验的立体感知注意力分解来高效捕获视差对齐的对应关系,大幅降低了计算成本。这为端到端双目VR渲染和具身策略学习提供了新途径。
Abstract: We present StereoWorld, a camera-conditioned stereo world model that jointly learns appearance and binocular geometry for end-to-end stereo video generation.Unlike monocular RGB or RGBD approaches, StereoWorld operates exclusively within the RGB modality, while simultaneously grounding geometry directly from disparity. To efficiently achieve consistent stereo generation, our approach introduces two key designs: (1) a unified camera-frame RoPE that augments latent tokens with camera-aware rotary positional encoding, enabling relative, view- and time-consistent conditioning while preserving pretrained video priors via a stable attention initialization; and (2) a stereo-aware attention decomposition that factors full 4D attention into 3D intra-view attention plus horizontal row attention, leveraging the epipolar prior to capture disparity-aligned correspondences with substantially lower compute. Across benchmarks, StereoWorld improves stereo consistency, disparity accuracy, and camera-motion fidelity over strong monocular-then-convert pipelines, achieving more than 3x faster generation with an additional 5% gain in viewpoint consistency. Beyond benchmarks, StereoWorld enables end-to-end binocular VR rendering without depth estimation or inpainting, enhances embodied policy learning through metric-scale depth grounding, and is compatible with long-video distillation for extended interactive stereo synthesis.
[70] VisionNVS: Self-Supervised Inpainting for Novel View Synthesis under the Virtual-Shift Paradigm cs.CVPDF
Hongbo Lu, Liang Yao, Chenghao He, Fan Liu, Wenlong Liao
TL;DR: VisionNVS是一个仅使用摄像头的框架,通过将新视角合成问题重新定义为自监督修复任务来解决自动驾驶中的监督鸿沟。它引入虚拟偏移策略和伪3D接缝合成,利用单目深度代理模拟遮挡并整合相邻摄像头数据,从而在训练中使用原始图像作为完美监督,无需依赖激光雷达。
Details
Motivation: 解决自动驾驶新视角合成中固有的监督鸿沟问题:模型在推理时需要合成未见视角,但训练时缺乏这些偏移位姿的真实图像作为监督。
Result: 实验表明,VisionNVS在几何保真度和视觉质量上优于依赖激光雷达的基线方法,为可扩展的驾驶模拟提供了鲁棒解决方案。
Insight: 核心创新在于范式转变:将不适定的外推问题转化为自监督修复任务,通过虚拟偏移策略消除领域鸿沟,并利用伪3D接缝合成处理空间一致性,实现了仅用摄像头的高质量合成。
Abstract: A fundamental bottleneck in Novel View Synthesis (NVS) for autonomous driving is the inherent supervision gap on novel trajectories: models are tasked with synthesizing unseen views during inference, yet lack ground truth images for these shifted poses during training. In this paper, we propose VisionNVS, a camera-only framework that fundamentally reformulates view synthesis from an ill-posed extrapolation problem into a self-supervised inpainting task. By introducing a ``Virtual-Shift’’ strategy, we use monocular depth proxies to simulate occlusion patterns and map them onto the original view. This paradigm shift allows the use of raw, recorded images as pixel-perfect supervision, effectively eliminating the domain gap inherent in previous approaches. Furthermore, we address spatial consistency through a Pseudo-3D Seam Synthesis strategy, which integrates visual data from adjacent cameras during training to explicitly model real-world photometric discrepancies and calibration errors. Experiments demonstrate that VisionNVS achieves superior geometric fidelity and visual quality compared to LiDAR-dependent baselines, offering a robust solution for scalable driving simulation.
[71] Harnessing the Power of Foundation Models for Accurate Material Classification cs.CVPDF
Qingran Lin, Fengwei Yang, Chaolun Zhu
TL;DR: 本文提出了一种新颖的框架,通过整合图像生成与自动标注流程以及先验知识融合策略,有效利用基础模型来解决材料分类任务中标注数据稀缺的问题,从而提升分类准确性。
Details
Motivation: 材料分类任务面临标注数据稀缺的挑战,限制了模型的准确性和泛化能力;现有基于视觉-语言基础模型(VLM)的解决方案在材料识别任务中效果仍不理想。
Result: 在多个数据集上的广泛实验表明,该方法取得了显著改进,合成的数据集有效捕捉了真实世界材料的特性,并且整合VLM先验知识显著提升了最终性能。
Insight: 创新点在于:1)一个结合物体语义和材料属性的文本提示,用于生成多样化、高质量的训练数据并自动标注;2)一种融合VLM先验知识的策略,结合联合微调方法,在保持广泛泛化能力的同时适应材料特定特征。
Abstract: Material classification has emerged as a critical task in computer vision and graphics, supporting the assignment of accurate material properties to a wide range of digital and real-world applications. While traditionally framed as an image classification task, this domain faces significant challenges due to the scarcity of annotated data, limiting the accuracy and generalizability of trained models. Recent advances in vision-language foundation models (VLMs) offer promising avenues to address these issues, yet existing solutions leveraging these models still exhibit unsatisfying results in material recognition tasks. In this work, we propose a novel framework that effectively harnesses foundation models to overcome data limitations and enhance classification accuracy. Our method integrates two key innovations: (a) a robust image generation and auto-labeling pipeline that creates a diverse and high-quality training dataset with material-centric images, and automatically assigns labels by fusing object semantics and material attributes in text prompts; (b) a prior incorporation strategy to distill information from VLMs, combined with a joint fine-tuning method that optimizes a pre-trained vision foundation model alongside VLM-derived priors, preserving broad generalizability while adapting to material-specific features.Extensive experiments demonstrate significant improvements on multiple datasets. We show that our synthetic dataset effectively captures the characteristics of real world materials, and the integration of priors from vision-language models significantly enhances the final performance. The source code and dataset will be released.
[72] Motion-Adaptive Temporal Attention for Lightweight Video Generation with Stable Diffusion cs.CVPDF
Rui Hong, Shuxue Quan
TL;DR: 本文提出了一种运动自适应的时间注意力机制,用于在冻结的Stable Diffusion模型基础上构建参数高效的视频生成方法。该方法根据估计的运动内容动态调整时间注意力的感受野:高运动序列在帧间进行局部注意力以保留快速变化的细节,而低运动序列进行全局注意力以增强场景一致性。通过级联策略将轻量级时间注意力模块注入所有UNet变换器块中,并结合时间相关噪声初始化和运动感知门控,整个系统仅增加2580万个可训练参数(占基础UNet的2.9%),在WebVid验证集上取得了有竞争力的结果。
Details
Motivation: 解决现有视频生成方法在处理不同运动幅度内容时效率不足的问题,旨在以最小的参数开销实现高质量、时间一致的视频生成,同时避免显式时间一致性损失带来的复杂性。
Result: 在WebVid数据集上训练10万个视频后,该方法在WebVid验证集上取得了有竞争力的结果,仅增加2.9%的基础UNet参数(2580万),表明标准去噪目标本身已提供足够的隐式时间正则化,优于添加显式时间一致性损失的方法。
Insight: 创新点在于运动自适应的时间注意力机制,根据运动幅度动态调整注意力范围,以及级联注入策略(下采样和中间块使用全局注意力,上采样块使用运动自适应注意力)。客观分析认为,该方法揭示了噪声相关性与运动幅度之间的权衡关系,为推理时控制生成行为提供了实用手段,且证明了隐式正则化的有效性,减少了模型复杂性和训练开销。
Abstract: We present a motion-adaptive temporal attention mechanism for parameter-efficient video generation built upon frozen Stable Diffusion models. Rather than treating all video content uniformly, our method dynamically adjusts temporal attention receptive fields based on estimated motion content: high-motion sequences attend locally across frames to preserve rapidly changing details, while low-motion sequences attend globally to enforce scene consistency. We inject lightweight temporal attention modules into all UNet transformer blocks via a cascaded strategy – global attention in down-sampling and middle blocks for semantic stabilization, motion-adaptive attention in up-sampling blocks for fine-grained refinement. Combined with temporally correlated noise initialization and motion-aware gating, the system adds only 25.8M trainable parameters (2.9% of the base UNet) while achieving competitive results on WebVid validation when trained on 100K videos. We demonstrate that the standard denoising objective alone provides sufficient implicit temporal regularization, outperforming approaches that add explicit temporal consistency losses. Our ablation studies reveal a clear trade-off between noise correlation and motion amplitude, providing a practical inference-time control for diverse generation behaviors.
[73] UniSAFE: A Comprehensive Benchmark for Safety Evaluation of Unified Multimodal Models cs.CV | cs.AI | cs.CLPDF
Segyu Lee, Boryeong Cho, Hojung Jung, Seokhyun An, Juhyeong Kim
TL;DR: 本文提出了UniSAFE,这是首个用于评估统一多模态模型(UMMs)系统级安全性的综合基准,涵盖7种输入/输出模态组合,包括传统任务和新型多模态上下文图像生成场景。该基准包含6,802个精心策划的实例,并用于评估15个最先进的专有和开源UMMs,揭示了它们在多图像组合和多轮对话等设置中的关键安全漏洞。
Details
Motivation: 统一多模态模型(UMMs)虽然提供了强大的跨模态能力,但也带来了单任务模型中未观察到的新安全风险。现有的安全基准在任务和模态上仍然分散,限制了对复杂系统级漏洞的全面评估。
Result: 使用UniSAFE评估了15个最先进的专有和开源UMMs,结果显示当前UMMs存在关键安全漏洞,特别是在多图像组合和多轮对话设置中安全违规率升高,且图像输出任务比文本输出任务始终更易受攻击。
Insight: 创新点在于提出了首个针对UMMs系统级安全性的综合基准UniSAFE,其采用共享目标设计,将共同风险场景映射到特定任务的I/O配置中,实现了对安全故障的受控跨任务比较。这为全面评估和提升UMMs的安全对齐提供了重要工具和洞见。
Abstract: Unified Multimodal Models (UMMs) offer powerful cross-modality capabilities but introduce new safety risks not observed in single-task models. Despite their emergence, existing safety benchmarks remain fragmented across tasks and modalities, limiting the comprehensive evaluation of complex system-level vulnerabilities. To address this gap, we introduce UniSAFE, the first comprehensive benchmark for system-level safety evaluation of UMMs across 7 I/O modality combinations, spanning conventional tasks and novel multimodal-context image generation settings. UniSAFE is built with a shared-target design that projects common risk scenarios across task-specific I/O configurations, enabling controlled cross-task comparisons of safety failures. Comprising 6,802 curated instances, we use UniSAFE to evaluate 15 state-of-the-art UMMs, both proprietary and open-source. Our results reveal critical vulnerabilities across current UMMs, including elevated safety violations in multi-image composition and multi-turn settings, with image-output tasks consistently more vulnerable than text-output tasks. These findings highlight the need for stronger system-level safety alignment for UMMs. Our code and data are publicly available at https://github.com/segyulee/UniSAFE
[74] Mutually Causal Semantic Distillation Network for Zero-Shot Learning cs.CV | cs.LGPDF
Shiming Chen, Shuhuang Chen, Guo-Sen Xie, Xinge You
TL;DR: 本文提出了一种名为MSDN++的相互因果语义蒸馏网络来解决零样本学习(ZSL)中语义表示学习不足的问题。该网络通过两个相互因果注意力子网络(属性→视觉和视觉→属性)来学习因果视觉-属性关联,并利用语义蒸馏损失进行协同训练,从而提取内在且充分的语义表示。
Details
Motivation: 现有零样本学习方法通常采用弱监督下的单向注意力机制,导致学习到的潜在语义表示是虚假且有限的,无法有效发现视觉特征与属性特征之间的内在语义知识(如属性语义)。本文旨在解决这一挑战。
Result: 在三个广泛使用的基准数据集(CUB、SUN、AWA2和FLO)上的大量实验表明,MSDN++相比强基线模型取得了显著提升,并实现了新的最先进(SOTA)性能。
Insight: 创新点在于提出了相互因果注意力机制来建模视觉与属性之间的双向因果关联,并通过语义蒸馏损失促进两个子网络的协同学习,从而获得更可靠、更具因果性的特征表示,这为跨模态语义知识迁移提供了新思路。
Abstract: Zero-shot learning (ZSL) aims to recognize the unseen classes in the open-world guided by the side-information (e.g., attributes). Its key task is how to infer the latent semantic knowledge between visual and attribute features on seen classes, and thus conducting a desirable semantic knowledge transfer from seen classes to unseen ones. Prior works simply utilize unidirectional attention within a weakly-supervised manner to learn the spurious and limited latent semantic representations, which fail to effectively discover the intrinsic semantic knowledge (e.g., attribute semantic) between visual and attribute features. To solve the above challenges, we propose a mutually causal semantic distillation network (termed MSDN++) to distill the intrinsic and sufficient semantic representations for ZSL. MSDN++ consists of an attribute$\rightarrow$visual causal attention sub-net that learns attribute-based visual features, and a visual$\rightarrow$attribute causal attention sub-net that learns visual-based attribute features. The causal attentions encourages the two sub-nets to learn causal vision-attribute associations for representing reliable features with causal visual/attribute learning. With the guidance of semantic distillation loss, the two mutual attention sub-nets learn collaboratively and teach each other throughout the training process. Extensive experiments on three widely-used benchmark datasets (e.g., CUB, SUN, AWA2, and FLO) show that our MSDN++ yields significant improvements over the strong baselines, leading to new state-of-the-art performances.
[75] Towards Motion-aware Referring Image Segmentation cs.CVPDF
Chaeyun Kim, Seunghoon Yi, Yejin Kim, Yohan Jo, Joonseok Lee
TL;DR: 本文针对指代图像分割任务中现有方法在处理与运动相关的查询时性能显著不足的问题,提出了一种高效的数据增强方案和一种多模态径向对比学习方法。该方法通过从原始描述中提取以运动为中心的短语来增强数据,并利用融合的图像-文本嵌入进行对比学习,从而提升模型对运动表达的理解。论文还引入了一个专注于运动查询的新测试集和一个名为M-Bench的新基准,用于评估模型在动作区分对象上的性能。实验表明,该方法在多个RIS模型上显著提高了运动相关查询的性能,同时在基于外观的描述上保持竞争力。
Details
Motivation: 现有指代图像分割方法在处理基于运动的查询时性能远低于基于外观的查询,这限制了模型在实际动态场景中的应用。
Result: 在多个RIS模型上的广泛实验表明,该方法在运动中心查询上的性能大幅提升,同时在基于外观的描述上保持竞争性结果;论文引入了新的测试集M-Bench进行综合评估。
Insight: 创新点包括:无需额外标注的高效数据增强方案,通过提取运动短语来暴露模型于更多运动表达;以及多模态径向对比学习,在融合的跨模态嵌入上进行对比而非单模态表示,以更好地处理上下文相关的对象描述。从客观角度看,该方法通过简单而有效的策略解决了运动理解不足的瓶颈,并建立了专门的基准以推动该方向的研究。
Abstract: Referring Image Segmentation (RIS) requires identifying objects from images based on textual descriptions. We observe that existing methods significantly underperform on motion-related queries compared to appearance-based ones. To address this, we first introduce an efficient data augmentation scheme that extracts motion-centric phrases from original captions, exposing models to more motion expressions without additional annotations. Second, since the same object can be described differently depending on the context, we propose Multimodal Radial Contrastive Learning (MRaCL), performed on fused image-text embeddings rather than unimodal representations. For comprehensive evaluation, we introduce a new test split focusing on motion-centric queries, and introduce a new benchmark called M-Bench, where objects are distinguished primarily by actions. Extensive experiments show our method substantially improves performance on motion-centric queries across multiple RIS models, maintaining competitive results on appearance-based descriptions. Codes are available at https://github.com/snuviplab/MRaCL
[76] SHIFT: Motion Alignment in Video Diffusion Models with Adversarial Hybrid Fine-Tuning cs.CVPDF
Xi Ye, Wenjia Yang, Yangyang Xu, Xiaoyang Liu, Duo Su
TL;DR: 该论文针对图像条件视频扩散模型在微调后运动保真度下降的问题,提出了一种基于像素通量动态的像素运动奖励,并引入了平滑混合微调(SHIFT)框架,将监督微调与优势加权微调相结合,以提升运动对齐效果。
Details
Motivation: 解决图像条件视频扩散模型在微调后常出现的运动动态减弱或长期时间一致性退化等运动保真度下降问题。
Result: 实验表明,SHIFT方法能有效解决现代视频扩散模型在监督微调中的动态度崩溃问题,提升了收敛速度并减轻了奖励黑客现象。
Insight: 创新点在于引入像素运动奖励来捕捉瞬时和长期运动一致性,以及提出SHIFT框架融合监督与优势加权微调,通过对抗性优势改进训练稳定性和效率。
Abstract: Image-conditioned Video diffusion models achieve impressive visual realism but often suffer from weakened motion fidelity, e.g., reduced motion dynamics or degraded long-term temporal coherence, especially after fine-tuning. We study the problem of motion alignment in video diffusion models post-training. To address this, we introduce pixel-motion rewards based on pixel flux dynamics, capturing both instantaneous and long-term motion consistency. We further propose Smooth Hybrid Fine-tuning (SHIFT), a scalable reward-driven fine-tuning framework for video diffusion models. SHIFT fuses the normal supervised fine-tuning and advantage weighted fine-tuning into a unified framework. Benefiting from novel adversarial advantages, SHIFT improves convergence speed and mitigates reward hacking. Experiments show that our approach efficiently resolves dynamic-degree collapse in modern video diffusion models supervised fine-tuning.
[77] ECHO: Towards Emotionally Appropriate and Contextually Aware Interactive Head Generation cs.CVPDF
Xiangyu Kong, Xiaoyu Jin, Yihan Pan, Haoqin Sun, Hengde Zhu
TL;DR: 本文提出ECHO框架,旨在解决交互式头部生成(IHG)中存在的上下文不连贯和唇部同步受损问题。该框架通过长程上下文理解(LCU)模块增强生成面部行为的语境适当性和情感合理性,并利用空间感知解耦交叉注意力调制(SDCM)模块在保持唇部同步的同时整合用户行为线索,从而提升生成头像的视觉保真度和交互自然性。
Details
Motivation: 现有IHG方法仅依赖短时窗口内的双轨信号(用户行为和预定义音频),缺乏长程上下文建模,导致生成的面部行为缺乏语境适当性,且信号融合方式存在交叉干扰,可能损害唇部同步质量。
Result: 大量实验验证了所提组件的有效性,ECHO在交互式头部生成任务上表现出优越性能。
Insight: 创新点在于引入长程上下文理解模块以建模行为动态和情感语义,以及设计空间感知解耦交叉注意力调制模块实现唇部与非唇部区域的差异化信号融合,通过两阶段训练策略共同提升生成质量。
Abstract: In natural face-to-face interaction, participants seamlessly alternate between speaking and listening, producing facial behaviors (FBs) that are finely informed by long-range context and naturally exhibit contextual appropriateness and emotional rationality. Interactive Head Generation (IHG) aims to synthesize lifelike avatar head video emulating such capabilities. Existing IHG methods typically condition on dual-track signals (i.e., human user’s behaviors and pre-defined audio for avatar) within a short temporal window, jointly driving generation of avatar’s audio-aligned lip articulation and non-verbal FBs. However, two main challenges persist in these methods: (i) the reliance on short-clip behavioral cues without long-range contextual modeling leads them to produce facial behaviors lacking contextual appropriateness; and (ii) the entangled, role-agnostic fusion of dual-track signals empirically introduces cross-signal interference, potentially compromising lip-region synchronization during speaking. To this end, we propose ECHO, a novel IHG framework comprising two key components: a Long-range Contextual Understanding (LCU) component that facilitates contextual understanding of both behavior-grounded dynamics and linguistic-driven affective semantics to promote contextual appropriateness and emotional rationality of synthesized avatar FBs; and a block-wise Spatial-aware Decoupled Cross-attention Modulation (SDCM) module, that preserves self-audio-driven lip articulation while adaptively integrating user contextual behavioral cues for non-lip facial regions, complemented by our designed two-stage training paradigm, to jointly enhance lip synchronization and visual fidelity. Extensive experiments demonstrate the effectiveness of proposed components and ECHO’s superior IHG performance.
[78] AdaZoom-GUI: Adaptive Zoom-based GUI Grounding with Instruction Refinement cs.CV | cs.AIPDF
Siqi Pei, Liang Tang, Tiaonan Duan, Long Chen, Shuxian Li
TL;DR: 本文提出AdaZoom-GUI,一种基于自适应缩放的GUI定位框架,通过指令精炼模块将自然语言指令重写为详细描述,并结合条件性放大策略对预测的小元素进行二次推理,以提高高分辨率GUI截图中的元素定位精度和指令理解能力。
Details
Motivation: 解决GUI定位任务中因高分辨率图像、小尺寸UI元素和模糊用户指令导致的定位挑战,旨在提升视觉语言模型在图形用户界面中的自动化交互能力。
Result: 在公开基准测试中,该方法在参数量相当或更大的模型中取得了最先进的性能,证明了其在高分辨率GUI理解和实际GUI智能体部署中的有效性。
Insight: 创新点包括指令精炼模块增强指令明确性,条件性放大策略平衡精度与计算效率,以及使用Group Relative Policy Optimization训练模型同时预测点击坐标和边界框;可借鉴之处在于结合自然语言处理与视觉推理的精细化交互设计。
Abstract: GUI grounding is a critical capability for vision-language models (VLMs) that enables automated interaction with graphical user interfaces by locating target elements from natural language instructions. However, grounding on GUI screenshots remains challenging due to high-resolution images, small UI elements, and ambiguous user instructions. In this work, we propose AdaZoom-GUI, an adaptive zoom-based GUI grounding framework that improves both localization accuracy and instruction understanding. Our approach introduces an instruction refinement module that rewrites natural language commands into explicit and detailed descriptions, allowing the grounding model to focus on precise element localization. In addition, we design a conditional zoom-in strategy that selectively performs a second-stage inference on predicted small elements, improving localization accuracy while avoiding unnecessary computation and context loss on simpler cases. To support this framework, we construct a high-quality GUI grounding dataset and train the grounding model using Group Relative Policy Optimization (GRPO), enabling the model to predict both click coordinates and element bounding boxes. Experiments on public benchmarks demonstrate that our method achieves state-of-the-art performance among models with comparable or even larger parameter sizes, highlighting its effectiveness for high-resolution GUI understanding and practical GUI agent deployment.
[79] FACE-net: Factual Calibration and Emotion Augmentation for Retrieval-enhanced Emotional Video Captioning cs.CVPDF
Weidong Chen, Cheng Ye, Zhendong Mao, Peipei Song, Xinyan Liu
TL;DR: 本文提出FACE-net框架,通过事实校准和情感增强来解决情感视频描述任务中的事实-情感偏差问题。该框架利用外部知识库检索相关句子,通过不确定性估计进行事实校准,并采用渐进式视觉情感增强模块自适应地融合情感信息,同时引入动态偏差调整路由模块来优化生成过程。
Details
Motivation: 现有情感视频描述方法在生成过程中对事实和情感线索的挖掘与协调不足,导致难以处理不同样本中事实与情感需求不一致的偏差问题。
Result: 论文在EVC基准测试中实现了SOTA性能,通过定量和定性实验验证了方法在事实准确性和情感表达上的有效性。
Insight: 创新点包括基于不确定性估计的事实校准机制、利用情感词典的渐进式视觉情感增强,以及动态偏差调整路由模块,这些设计协同工作以缓解事实-情感偏差,提升描述的适应性和准确性。
Abstract: Emotional Video Captioning (EVC) is an emerging task, which aims to describe factual content with the intrinsic emotions expressed in videos. Existing works perceive global emotional cues and then combine with video content to generate descriptions. However, insufficient factual and emotional cues mining and coordination during generation make their methods difficult to deal with the factual-emotional bias, which refers to the factual and emotional requirements being different in different samples on generation. To this end, we propose a retrieval-enhanced framework with FActual Calibration and Emotion augmentation (FACE-net), which through a unified architecture collaboratively mines factual-emotional semantics and provides adaptive and accurate guidance for generation, breaking through the compromising tendency of factual-emotional descriptions in all sample learning. Technically, we firstly introduces an external repository and retrieves the most relevant sentences with the video content to augment the semantic information. Subsequently, our factual calibration via uncertainty estimation module splits the retrieved information into subject-predicate-object triplets, and self-refines and cross-refines different components through video content to effectively mine the factual semantics; while our progressive visual emotion augmentation module leverages the calibrated factual semantics as experts, interacts with the video content and emotion dictionary to generate visual queries and candidate emotions, and then aggregates them to adaptively augment emotions to each factual semantics. Moreover, to alleviate the factual-emotional bias, we design a dynamic bias adjustment routing module to predict and adjust the degree of bias of a sample.
[80] AR-CoPO: Align Autoregressive Video Generation with Contrastive Policy Optimization cs.CVPDF
Dailan He, Guanlin Feng, Xingtong Ge, Yi Zhang, Bingqi Ma
TL;DR: 本文提出了AR-CoPO框架,一种用于对齐自回归视频生成模型的方法。该框架通过分块级对比策略优化,结合分叉机制和半在线训练策略,解决了现有基于SDE的GRPO方法在少步蒸馏自回归视频生成器中难以有效进行人类反馈强化学习对齐的问题。
Details
Motivation: 现有基于SDE的GRPO方法在结合少步蒸馏的低延迟、高质量自回归视频生成器时面临挑战,因为少步ODE和一致性模型采样器偏离标准流匹配ODE,且其短轨迹、低随机性的特性对初始化噪声高度敏感,导致中间SDE探索无效,难以通过人类反馈强化学习实现有效对齐。
Result: 在Self-Forcing基准上的实验表明,AR-CoPO相比基线方法,在领域外泛化和领域内人类偏好对齐方面均有提升,证明了其实现了真正的对齐而非奖励黑客行为。
Insight: 创新点包括:将Neighbor GRPO的对比视角适配到流式自回归生成;引入分块级对齐的分叉机制,在随机选择的分块处构建邻域候选、分配序列级奖励并进行局部GRPO更新;提出半在线训练策略,结合在线探索和基于参考轨迹回放缓冲区的利用,以提升跨领域生成质量。
Abstract: Streaming autoregressive (AR) video generators combined with few-step distillation achieve low-latency, high-quality synthesis, yet remain difficult to align via reinforcement learning from human feedback (RLHF). Existing SDE-based GRPO methods face challenges in this setting: few-step ODEs and consistency model samplers deviate from standard flow-matching ODEs, and their short, low-stochasticity trajectories are highly sensitive to initialization noise, rendering intermediate SDE exploration ineffective. We propose AR-CoPO (AutoRegressive Contrastive Policy Optimization), a framework that adapts the Neighbor GRPO contrastive perspective to streaming AR generation. AR-CoPO introduces chunk-level alignment via a forking mechanism that constructs neighborhood candidates at a randomly selected chunk, assigns sequence-level rewards, and performs localized GRPO updates. We further propose a semi-on-policy training strategy that complements on-policy exploration with exploitation over a replay buffer of reference rollouts, improving generation quality across domains. Experiments on Self-Forcing demonstrate that AR-CoPO improves both out-of-domain generalization and in-domain human preference alignment over the baseline, providing evidence of genuine alignment rather than reward hacking.
[81] VirPro: Visual-referred Probabilistic Prompt Learning for Weakly-Supervised Monocular 3D Detection cs.CV | cs.AIPDF
Chupeng Liu, Jiyong Rao, Shangquan Sun, Runkai Zhao, Weidong Cai
TL;DR: 本文提出了一种名为VirPro的视觉参考概率提示学习方法,用于弱监督单目3D目标检测。该方法通过自适应多模态预训练范式,生成场景中多样化的可学习实例条件提示,并利用多高斯提示建模将视觉特征融入文本嵌入以表达视觉不确定性,最后通过RoI级对比匹配增强模态对齐和语义一致性。
Details
Motivation: 解决传统手工制作文本描述难以捕捉场景中个体视觉多样性的问题,从而提升模型学习场景感知表示的能力。
Result: 在KITTI基准测试上的大量实验表明,集成该预训练范式能带来显著的性能提升,相比基线平均精度最高提升4.8%。
Insight: 创新点包括自适应提示库(APB)和多高斯提示建模(MGPM),通过概率化提示学习将视觉不确定性融入文本嵌入,并利用对比匹配增强跨模态对齐,为弱监督3D检测提供了可扩展的多模态辅助监督信号。
Abstract: Monocular 3D object detection typically relies on pseudo-labeling techniques to reduce dependency on real-world annotations. Recent advances demonstrate that deterministic linguistic cues can serve as effective auxiliary weak supervision signals, providing complementary semantic context. However, hand-crafted textual descriptions struggle to capture the inherent visual diversity of individuals across scenes, limiting the model’s ability to learn scene-aware representations. To address this challenge, we propose Visual-referred Probabilistic Prompt Learning (VirPro), an adaptive multi-modal pretraining paradigm that can be seamlessly integrated into diverse weakly supervised monocular 3D detection frameworks. Specifically, we generate a diverse set of learnable, instance-conditioned prompts across scenes and store them in an Adaptive Prompt Bank (APB). Subsequently, we introduce Multi-Gaussian Prompt Modeling (MGPM), which incorporates scene-based visual features into the corresponding textual embeddings, allowing the text prompts to express visual uncertainties. Then, from the fused vision-language embeddings, we decode a prompt-targeted Gaussian, from which we derive a unified object-level prompt embedding for each instance. RoI-level contrastive matching is employed to enforce modality alignment, bringing embeddings of co-occurring objects within the same scene closer in the latent space, thus enhancing semantic coherence. Extensive experiments on the KITTI benchmark demonstrate that integrating our pretraining paradigm consistently yields substantial performance gains, achieving up to a 4.8% average precision improvement than the baseline.
[82] UAV-CB: A Complex-Background RGB-T Dataset and Local Frequency Bridge Network for UAV Detection cs.CVPDF
Shenghui Huang, Menghao Hu, Longkun Zou, Hongyu Chi, Zekai Li
TL;DR: 本文针对复杂背景下无人机检测的挑战,构建了UAV-CB数据集并提出局部频率桥接网络LFBNet,通过局部频率空间建模实现RGB-T融合,提升检测性能。
Details
Motivation: 解决低空环境中无人机因复杂背景、伪装和多模态干扰导致的检测困难问题,现有数据集未能充分捕捉这些挑战。
Result: 在UAV-CB和公开基准测试上,LFBNet实现了最先进的检测性能,并在伪装和杂乱条件下表现出强鲁棒性。
Insight: 创新点包括构建强调复杂背景和伪装特性的RGB-T数据集,以及通过局部频率空间建模来桥接频率-空间融合差距和跨模态差异,为多模态无人机感知提供频率感知视角。
Abstract: Detecting Unmanned Aerial Vehicles (UAVs) in low-altitude environments is essential for perception and defense systems but remains highly challenging due to complex backgrounds, camouflage, and multimodal interference. In real-world scenarios, UAVs are frequently visually blended with surrounding structures such as buildings, vegetation, and power lines, resulting in low contrast, weak boundaries, and strong confusion with cluttered background textures. Existing UAV detection datasets, though diverse, are not specifically designed to capture these camouflage and complex-background challenges, which limits progress toward robust real-world perception. To fill this gap, we construct UAV-CB, a new RGB-T UAV detection dataset deliberately curated to emphasize complex low-altitude backgrounds and camouflage characteristics. Furthermore, we propose the Local Frequency Bridge Network (LFBNet), which models features in localized frequency space to bridge both the frequency-spatial fusion gap and the cross-modality discrepancy gap in RGB-T fusion. Extensive experiments on UAV-CB and public benchmarks demonstrate that LFBNet achieves state-of-the-art detection performance and strong robustness under camouflaged and cluttered conditions, offering a frequency-aware perspective on multimodal UAV perception in real-world applications.
[83] Omni-I2C: A Holistic Benchmark for High-Fidelity Image-to-Code Generation cs.CVPDF
Jiawei Zhou, Chi Zhang, Xiang Feng, Qiming Zhang, Haibo Qiu
TL;DR: 本文提出了Omni-I2C,一个用于评估大型多模态模型(LMMs)将复杂结构化数字图形转换为可执行代码能力的综合性基准。该基准包含1080个精心策划的样本,涵盖多种主题、图像模态和编程语言,并采用解耦的评估框架来深入分析模型的感知保真度和符号精度。评估结果表明,即使是当前最先进的模型在处理复杂场景时也存在显著性能差距,凸显了多模态代码生成任务的挑战性。
Details
Motivation: 当前大型多模态模型在将复杂数字图形(如科学可视化、符号图表)转换为精确可执行代码的任务上面临巨大挑战,该任务要求模型同时具备高保真视觉感知和精确的代码生成能力,任何微小错误都可能导致重建失败。现有评估方法缺乏对此类任务深度和广度的系统性衡量。
Result: 在Omni-I2C基准上的评估显示,领先的大型多模态模型之间存在显著的性能差距;即使是当前最先进的模型在复杂场景中也难以保持结构完整性,表明多模态代码生成仍是一个艰巨的挑战。
Insight: 论文的创新点在于提出了一个全面且深度解耦的评估基准(Omni-I2C),通过分离感知保真度和符号精度来细粒度地诊断模型失败原因;同时,基准构建中融合了真实用户案例,覆盖了广泛的数字内容类型,为系统性地评测和推动多模态代码生成能力的发展提供了重要工具。
Abstract: We present Omni-I2C, a comprehensive benchmark designed to evaluate the capability of Large Multimodal Models (LMMs) in converting complex, structured digital graphics into executable code. We argue that this task represents a non-trivial challenge for the current generation of LMMs: it demands an unprecedented synergy between high-fidelity visual perception – to parse intricate spatial hierarchies and symbolic details – and precise generative expression – to synthesize syntactically sound and logically consistent code. Unlike traditional descriptive tasks, Omni-I2C requires a holistic understanding where any minor perceptual hallucination or coding error leads to a complete failure in visual reconstruction. Omni-I2C features 1080 meticulously curated samples, defined by its breadth across subjects, image modalities, and programming languages. By incorporating authentic user-sourced cases, the benchmark spans a vast spectrum of digital content – from scientific visualizations to complex symbolic notations – each paired with executable reference code. To complement this diversity, our evaluation framework provides necessary depth; by decoupling performance into perceptual fidelity and symbolic precision, it transcends surface-level accuracy to expose the granular structural failures and reasoning bottlenecks of current LMMs. Our evaluation reveals a substantial performance gap among leading LMMs; even state-of-the-art models struggle to preserve structural integrity in complex scenarios, underscoring that multimodal code generation remains a formidable challenge. Data and code are available at https://github.com/MiliLab/Omni-I2C.
[84] Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models cs.CV | cs.AI | cs.CLPDF
Kevin Qu, Haozhe Qi, Mihai Dusmanu, Mahdi Rad, Rui Wang
TL;DR: Loc3R-VLM是一个增强2D视觉语言模型(VLM)3D理解能力的框架。它通过单目视频输入,结合全局布局重建和自我中心视角建模两个联合目标,并利用轻量级相机姿态先验,为模型提供直接的空间监督,从而在基于语言的定位和3D问答任务上实现了最先进的性能。
Details
Motivation: 当前的多模态大语言模型(MLLMs)在视觉与语言连接方面取得进展,但在空间理解和视点感知推理方面仍存在不足。现有方法通常通过添加几何线索来增强输入表示,而非明确教导模型进行3D空间推理。本文旨在解决这一根本问题。
Result: Loc3R-VLM在基于语言的定位和情境化/通用3D问答基准测试中,超越了现有的2D和基于视频的方法,达到了最先进的(SOTA)性能。
Insight: 创新点在于受人类空间认知启发,提出了两个联合训练目标(全局布局重建与显式情境建模)来提供直接的空间监督,将感知和语言都锚定在3D上下文中。同时,利用预训练3D基础模型提取的轻量级相机姿态先验来确保几何一致性和度量尺度对齐,这是一种高效整合3D先验知识的方法。
Abstract: Multimodal Large Language Models (MLLMs) have made impressive progress in connecting vision and language, but they still struggle with spatial understanding and viewpoint-aware reasoning. Recent efforts aim to augment the input representations with geometric cues rather than explicitly teaching models to reason in 3D space. We introduce Loc3R-VLM, a framework that equips 2D Vision-Language Models with advanced 3D understanding capabilities from monocular video input. Inspired by human spatial cognition, Loc3R-VLM relies on two joint objectives: global layout reconstruction to build a holistic representation of the scene structure, and explicit situation modeling to anchor egocentric perspective. These objectives provide direct spatial supervision that grounds both perception and language in a 3D context. To ensure geometric consistency and metric-scale alignment, we leverage lightweight camera pose priors extracted from a pre-trained 3D foundation model. Loc3R-VLM achieves state-of-the-art performance in language-based localization and outperforms existing 2D- and video-based approaches on situated and general 3D question-answering benchmarks, demonstrating that our spatial supervision framework enables strong 3D understanding. Project page: https://kevinqu7.github.io/loc3r-vlm
[85] EI: Early Intervention for Multimodal Imaging based Disease Recognition cs.CVPDF
Qijie Wei, Hailan Lin, Xirong Li
TL;DR: 本文提出了一种名为早期干预(EI)的新框架,用于解决多模态医学影像疾病识别中的两个主要挑战:现有方法无法充分利用多模态数据的互补和相关信息,以及标记数据稀缺和领域偏移阻碍了视觉基础模型(VFMs)的应用。EI框架通过将一种模态作为目标,其他模态作为参考,利用参考模态的高层语义标记作为干预标记,在早期阶段引导目标模态的嵌入过程。同时,论文还引入了混合低变秩适应(MoR),一种参数高效的微调方法,用于适配VFMs。
Details
Motivation: 解决多模态医学影像疾病识别中现有’单模态嵌入后融合’范式无法充分利用多模态互补信息,以及标记数据稀缺和领域偏移阻碍视觉基础模型应用的问题。
Result: 在视网膜疾病、皮肤病变和膝关节异常分类三个公共数据集上的大量实验验证了所提方法相对于多个竞争基线的有效性。
Insight: 创新点在于提出了早期干预(EI)框架,在嵌入过程的早期利用参考模态的语义信息引导目标模态,以及设计了混合低变秩适应(MoR)这一参数高效的视觉基础模型微调方法,以应对医学影像数据稀缺和领域差异的挑战。
Abstract: Current methods for multimodal medical imaging based disease recognition face two major challenges. First, the prevailing “fusion after unimodal image embedding” paradigm cannot fully leverage the complementary and correlated information in the multimodal data. Second, the scarcity of labeled multimodal medical images, coupled with their significant domain shift from natural images, hinders the use of cutting-edge Vision Foundation Models (VFMs) for medical image embedding. To jointly address the challenges, we propose a novel Early Intervention (EI) framework. Treating one modality as target and the rest as reference, EI harnesses high-level semantic tokens from the reference as intervention tokens to steer the target modality’s embedding process at an early stage. Furthermore, we introduce Mixture of Low-varied-Ranks Adaptation (MoR), a parameter-efficient fine-tuning method that employs a set of low-rank adapters with varied ranks and a weight-relaxed router for VFM adaptation. Extensive experiments on three public datasets for retinal disease, skin lesion, and keen anomaly classification verify the effectiveness of the proposed method against a number of competitive baselines.
[86] UniSem: Generalizable Semantic 3D Reconstruction from Sparse Unposed Images cs.CVPDF
Guibiao Liao, Qian Ren, Kaimin Liao, Hua Wang, Zhi Chen
TL;DR: 本文提出了UniSem,一个统一的框架,用于从稀疏、无位姿图像中进行可泛化的语义3D重建。该方法通过误差感知高斯丢弃(EGD)和混合训练课程(MTC)两个关键组件,联合提升了深度估计的准确性和3D语义分割的泛化能力。
Details
Motivation: 解决前馈式3D高斯溅射(3DGS)在稀疏、无位姿图像下进行语义感知3D重建时面临的挑战:现有方法在稀疏视角监督下预测的高斯基元集合过于冗余,导致几何不稳定和深度质量差;同时,它们仅依赖2D分割器特征进行语义提升,提供的3D级监督较弱且泛化性有限,导致在新场景中3D语义不完整。
Result: 在ScanNet和Replica数据集上的大量实验表明,UniSem在不同数量的输入视角下,在深度预测和开放词汇3D分割方面均取得了优越性能。具体而言,在16视角输入下,相比强基线方法,UniSem将深度相对误差(Rel)降低了15.2%,并将开放词汇分割的平均准确率(mAcc)提升了3.7%。
Insight: 创新点在于:1)误差感知高斯丢弃(EGD):利用渲染误差线索进行误差引导的容量控制,抑制易冗余的高斯基元,从而产生有意义、几何稳定的高斯表示以改进深度估计;2)混合训练课程(MTC):渐进式地将2D分割器提升的语义与模型自身涌现的3D语义先验进行混合,并通过对象级原型对齐来增强语义一致性和完整性,从而提升语义泛化能力。
Abstract: Semantic-aware 3D reconstruction from sparse, unposed images remains challenging for feed-forward 3D Gaussian Splatting (3DGS). Existing methods often predict an over-complete set of Gaussian primitives under sparse-view supervision, leading to unstable geometry and inferior depth quality. Meanwhile, they rely solely on 2D segmenter features for semantic lifting, which provides weak 3D-level and limited generalizable supervision, resulting in incomplete 3D semantics in novel scenes. To address these issues, we propose UniSem, a unified framework that jointly improves depth accuracy and semantic generalization via two key components. First, Error-aware Gaussian Dropout (EGD) performs error-guided capacity control by suppressing redundancy-prone Gaussians using rendering error cues, producing meaningful, geometrically stable Gaussian representations for improved depth estimation. Second, we introduce a Mix-training Curriculum (MTC) that progressively blends 2D segmenter-lifted semantics with the model’s own emergent 3D semantic priors, implemented with object-level prototype alignment to enhance semantic coherence and completeness. Extensive experiments on ScanNet and Replica show that UniSem achieves superior performance in depth prediction and open-vocabulary 3D segmentation across varying numbers of input views. Notably, with 16-view inputs, UniSem reduces depth Rel by 15.2% and improves open-vocabulary segmentation mAcc by 3.7% over strong baselines.
[87] PCA-Seg: Revisiting Cost Aggregation for Open-Vocabulary Semantic and Part Segmentation cs.CVPDF
Jianjian Yin, Tao Chen, Yi Chen, Gensheng Pei, Xiangbo Shu
TL;DR: 本文提出了一种名为PCA-Seg的并行成本聚合范式,用于开放词汇语义和部件分割任务。该方法通过专家驱动的感知学习模块和特征正交化解耦策略,有效缓解了现有串行聚合方法中存在的类别语义与空间上下文知识干扰问题,从而从成本体积中捕获更丰富的视觉-语言对齐信息。
Details
Motivation: 现有基于视觉语言模型的方法通过串行结构进行空间和类别聚合来提取图像-文本对齐线索,这导致了类别级语义和空间上下文之间的知识干扰。本文旨在解决这一问题。
Result: 在八个基准测试上的大量实验表明,PCA-Seg中的每个并行块仅增加0.35M参数,同时实现了最先进的开放词汇语义和部件分割性能。
Insight: 创新点在于提出了并行成本聚合范式,并设计了专家驱动的感知学习模块(包含多专家解析器和系数映射器)以及特征正交化解耦策略,以解耦并有效融合语义和上下文信息,从而提升模型对视觉-语言对齐信息的捕获能力。
Abstract: Recent advances in vision-language models (VLMs) have garnered substantial attention in open-vocabulary semantic and part segmentation (OSPS). However, existing methods extract image-text alignment cues from cost volumes through a serial structure of spatial and class aggregations, leading to knowledge interference between class-level semantics and spatial context. Therefore, this paper proposes a simple yet effective parallel cost aggregation (PCA-Seg) paradigm to alleviate the above challenge, enabling the model to capture richer vision-language alignment information from cost volumes. Specifically, we design an expert-driven perceptual learning (EPL) module that efficiently integrates semantic and contextual streams. It incorporates a multi-expert parser to extract complementary features from multiple perspectives. In addition, a coefficient mapper is designed to adaptively learn pixel-specific weights for each feature, enabling the integration of complementary knowledge into a unified and robust feature embedding. Furthermore, we propose a feature orthogonalization decoupling (FOD) strategy to mitigate redundancy between the semantic and contextual streams, which allows the EPL module to learn diverse knowledge from orthogonalized features. Extensive experiments on eight benchmarks show that each parallel block in PCA-Seg adds merely 0.35M parameters while achieving state-of-the-art OSPS performance.
[88] MM-OVSeg:Multimodal Optical-SAR Fusion for Open-Vocabulary Segmentation in Remote Sensing cs.CVPDF
Yimin Wei, Aoran Xiao, Hongruixuan Chen, Junshi Xia, Naoto Yokoya
TL;DR: MM-OVSeg是一个用于遥感图像开放词汇分割的多模态光学-SAR融合框架,旨在解决恶劣天气条件下(如多云、有雾)现有方法性能受限的问题。它通过融合光学图像丰富的语义信息和合成孔径雷达(SAR)的云穿透结构线索,并设计了跨模态统一过程和双编码器融合模块,实现了对开放文本类别的像素级识别。
Details
Motivation: 现有开放词汇分割方法在遥感领域主要局限于晴朗天气的光学数据,在多云或雾霾等恶劣天气条件下性能不佳,因此需要一种能融合互补模态、提升鲁棒性的解决方案。
Result: 大量实验表明,MM-OVSeg在不同云层条件下实现了卓越的鲁棒性和泛化能力,但摘要未提及具体基准测试或与SOTA的定量比较结果。
Insight: 创新点在于提出了一种光学与SAR的多模态融合框架,通过跨模态统一过程对齐多传感器表示,并利用双编码器融合模块集成多个视觉基础模型的分层特征,以进行文本对齐的多模态分割,有效弥补了单一模态的局限和现有视觉语言模型在密集预测能力上的不足。
Abstract: Open-vocabulary segmentation enables pixel-level recognition from an open set of textual categories, allowing generalization beyond fixed classes. Despite great potential in remote sensing, progress in this area remains largely limited to clear-sky optical data and struggles under cloudy or haze-contaminated conditions. We present MM-OVSeg, a multimodal Optical-SAR fusion framework for resilient open-vocabulary segmentation under adverse weather conditions. MM-OVSeg leverages the complementary strengths of the two modalities–optical imagery provides rich spectral semantics, while synthetic aperture radar (SAR) offers cloud-penetrating structural cues. To address the cross-modal domain gap and the limited dense prediction capability of current vision-language models, we propose two key designs: a cross-modal unification process for multi-sensor representation alignment, and a dual-encoder fusion module that integrates hierarchical features from multiple vision foundation models for text-aligned multimodal segmentation. Extensive experiments demonstrate that MM-OVSeg achieves superior robustness and generalization across diverse cloud conditions. The source dataset and code are available here.
[89] Temporal Gains, Spatial Costs: Revisiting Video Fine-Tuning in Multimodal Large Language Models cs.CVPDF
Linghao Zhang, Jungang Li, Yonghua Hei, Sicheng Tao, Song Dai
TL;DR: 本文系统研究了视频监督微调(Video-SFT)对多模态大语言模型视觉能力的影响,发现其普遍提升视频理解性能,但常导致静态图像基准测试性能增益有限甚至下降,揭示了图像与视频理解之间的权衡。
Details
Motivation: 旨在探究Video-SFT如何重塑MLLMs的细粒度视觉能力,特别是空间与时间理解之间的平衡,以理解其实际效果与潜在局限。
Result: 在不同架构、参数量级和帧采样设置下,Video-SFT一致提升视频性能,但对静态图像基准测试(如图像问答)的改善有限或产生负面影响;增加采样帧数通常提升视频性能,但不稳定改善图像性能。
Insight: 创新点在于揭示了Video-SFT中图像与视频性能的权衡现象,并提出一种指令感知的混合帧策略,通过自适应分配帧数部分缓解此权衡;客观分析认为,该研究强调了在联合图像-视频训练中保持空间理解的核心挑战,为优化MLLMs训练提供了新视角。
Abstract: Multimodal large language models (MLLMs) are typically trained in multiple stages, with video-based supervised fine-tuning (Video-SFT) serving as a key step for improving visual understanding. Yet its effect on the fine-grained evolution of visual capabilities, particularly the balance between spatial and temporal understanding, remains poorly understood. In this paper, we systematically study how Video-SFT reshapes visual capabilities in MLLMs. Across architectures, parameter scales, and frame sampling settings, we observe a consistent pattern: Video-SFT reliably improves video performance, but often yields limited gains or even degradation on static image benchmarks. We further show that this trade-off is closely tied to temporal budget: increasing the number of sampled frames generally improves video performance, but does not reliably improve static image performance. Motivated by this finding, we study an instruction-aware Hybrid-Frame strategy that adaptively allocates frame counts and partially mitigates the image-video trade-off. Our results indicate that Video-SFT is not a free lunch for MLLMs, and preserving spatial understanding remains a central challenge in joint image-video training.
[90] ProGVC: Progressive-based Generative Video Compression via Auto-Regressive Context Modeling cs.CVPDF
Daowen Li, Ruixiao Dong, Ying Chen, Kai Li, Ding Ding
TL;DR: 本文提出了ProGVC,一种基于渐进式传输的生成式视频压缩框架。它通过将视频编码为分层多尺度残差token图,支持从粗到细的渐进传输以实现灵活的码率适应。利用基于Transformer的多尺度自回归上下文模型来估计token概率,既用于已传输token的高效熵编码,也在解码器端预测被截断的精细尺度token以恢复感知细节。
Details
Motivation: 现有感知视频编解码器通常缺乏对可变码率和渐进式传输的原生支持,且其生成模块与熵编码弱耦合,限制了码率降低。本文旨在统一渐进传输、高效熵编码和细节合成于单一编解码器中。
Result: 大量实验表明,作为一种新的编码范式,ProGVC在低码率下提供了有前景的感知压缩性能,同时提供了实用的可扩展性。
Insight: 主要创新点在于受视觉自回归(VAR)模型中下一尺度预测的启发,将渐进传输、高效熵编码与基于生成先验的细节合成统一在一个框架内。通过分层多尺度残差编码和自回归上下文建模,实现了码率自适应与感知质量恢复的强耦合。
Abstract: Perceptual video compression leverages generative priors to reconstruct realistic textures and motions at low bitrates. However, existing perceptual codecs often lack native support for variable bitrate and progressive delivery, and their generative modules are weakly coupled with entropy coding, limiting bitrate reduction. Inspired by the next-scale prediction in the Visual Auto-Regressive (VAR) models, we propose ProGVC, a Progressive-based Generative Video Compression framework that unifies progressive transmission, efficient entropy coding, and detail synthesis within a single codec. ProGVC encodes videos into hierarchical multi-scale residual token maps, enabling flexible rate adaptation by transmitting a coarse-to-fine subset of scales in a progressive manner. A Transformer-based multi-scale autoregressive context model estimates token probabilities, utilized both for efficient entropy coding of the transmitted tokens and for predicting truncated fine-scale tokens at the decoder to restore perceptual details. Extensive experiments demonstrate that as a new coding paradigm, ProGVC delivers promising perceptual compression performance at low bitrates while offering practical scalability at the same time.
[91] Prompt-Free Universal Region Proposal Network cs.CVPDF
Qihong Tang, Changhan Liu, Shaofeng Zhang, Wenbin Li, Qi Fan
TL;DR: 本文提出了一种无需外部提示的通用区域提议网络(PF-RPN),用于识别潜在物体。该方法通过稀疏图像感知适配器(SIA)模块进行初始定位,利用级联自提示(CSP)模块自主聚合视觉特征,并结合中心度引导查询选择(CG-QS)模块筛选高质量查询嵌入,可在少量数据(如5%的MS COCO数据)上优化,并直接应用于水下、工业缺陷和遥感等多个领域的物体检测任务。
Details
Motivation: 现有方法依赖示例图像、预定义类别或文本描述来定位潜在物体,这种对图像和文本提示的依赖限制了灵活性,难以适应真实场景。本文旨在开发一种无需外部提示的通用区域提议网络,以提高在多样化应用中的适应性。
Result: 在19个数据集上的实验验证了方法的有效性,该方法在有限数据(如5%的MS COCO)上优化后,可直接应用于多个物体检测领域(如水下、工业缺陷、遥感),无需微调,展示了通用性和高效性。
Insight: 创新点包括:1)无需外部提示的通用区域提议框架,提高了灵活性;2)SIA模块通过可学习查询嵌入动态更新视觉特征进行初始定位;3)CSP模块利用自提示嵌入级联聚合信息;4)CG-QS模块基于中心度评分网络选择高质量查询。这些设计使得模型在少数据和跨领域场景中具有强适应能力。
Abstract: Identifying potential objects is critical for object recognition and analysis across various computer vision applications. Existing methods typically localize potential objects by relying on exemplar images, predefined categories, or textual descriptions. However, their reliance on image and text prompts often limits flexibility, restricting adaptability in real-world scenarios. In this paper, we introduce a novel Prompt-Free Universal Region Proposal Network (PF-RPN), which identifies potential objects without relying on external prompts. First, the Sparse Image-Aware Adapter (SIA) module performs initial localization of potential objects using a learnable query embedding dynamically updated with visual features. Next, the Cascade Self-Prompt (CSP) module identifies the remaining potential objects by leveraging the self-prompted learnable embedding, autonomously aggregating informative visual features in a cascading manner. Finally, the Centerness-Guided Query Selection (CG-QS) module facilitates the selection of high-quality query embeddings using a centerness scoring network. Our method can be optimized with limited data (e.g., 5% of MS COCO data) and applied directly to various object detection application domains for identifying potential objects without fine-tuning, such as underwater object detection, industrial defect detection, and remote sensing image object detection. Experimental results across 19 datasets validate the effectiveness of our method. Code is available at https://github.com/tangqh03/PF-RPN.
[92] FrescoDiffusion: 4K Image-to-Video with Prior-Regularized Tiled Diffusion cs.CV | cs.AIPDF
Hugo Caselles-Dupré, Mathis Koroglu, Guillaume Jeanneret, Arnaud Dapogny, Matthieu Cord
TL;DR: FrescoDiffusion是一种无需训练的4K图像到视频生成方法,通过引入预计算的潜在先验来增强分块去噪,解决了现有扩散模型在处理超高清输入时全局布局一致性差的问题,特别适用于壁画动画等复杂场景。
Details
Motivation: 现有基于扩散的图像到视频模型难以扩展到超高清分辨率(如4K),在生成视频时要么丢失细节,要么破坏全局一致性,尤其在壁画动画等包含多角色、多语义子场景的复杂图像中问题更为严重。
Result: 在VBench-I2V数据集和自建的壁画I2V数据集上,该方法相比分块基线在全局一致性和保真度方面均有提升,同时计算高效,并允许通过空间正则化变量实现区域级运动控制。
Insight: 创新点在于引入预计算的低分辨率视频潜在轨迹作为全局参考先验,通过加权最小二乘目标融合分块噪声预测,在模型输出空间中以闭式解增强全局一致性;该方法还提供了在创造性与一致性之间权衡的显式可控性。
Abstract: Diffusion-based image-to-video (I2V) models are increasingly effective, yet they struggle to scale to ultra-high-resolution inputs (e.g., 4K). Generating videos at the model’s native resolution often loses fine-grained structure, whereas high-resolution tiled denoising preserves local detail but breaks global layout consistency. This failure mode is particularly severe in the fresco animation setting: monumental artworks containing many distinct characters, objects, and semantically different sub-scenes that must remain spatially coherent over time. We introduce FrescoDiffusion, a training-free method for coherent large-format I2V generation from a single complex image. The key idea is to augment tiled denoising with a precomputed latent prior: we first generate a low-resolution video at the underlying model resolution and upsample its latent trajectory to obtain a global reference that captures long-range temporal and spatial structure. For 4K generation, we compute per-tile noise predictions and fuse them with this reference at every diffusion timestep by minimizing a single weighted least-squares objective in model-output space. The objective combines a standard tile-merging criterion with our regularization term, yielding a closed-form fusion update that strengthens global coherence while retaining fine detail. We additionally provide a spatial regularization variable that enables region-level control over where motion is allowed. Experiments on the VBench-I2V dataset and our proposed fresco I2V dataset show improved global consistency and fidelity over tiled baselines, while being computationally efficient. Our regularization enables explicit controllability of the trade-off between creativity and consistency.
[93] PanoVGGT: Feed-Forward 3D Reconstruction from Panoramic Imagery cs.CVPDF
Yijing Guo, Mengjun Chao, Luo Wang, Tianyang Zhao, Haizhao Dai
TL;DR: 本文提出了PanoVGGT,一种用于从全景图像进行前馈式3D重建的置换等变Transformer框架。该模型能够从单张或多张全景图中联合预测相机位姿、深度图和3D点云。为了解决全景图像特有的非针孔畸变和几何推理挑战,作者引入了球面感知位置编码和全景特定的三轴旋转增强。此外,为了处理全局坐标系模糊性,提出了随机锚定训练策略。
Details
Motivation: 全景图像提供360°视野,在消费设备中日益普及,但其非针孔畸变给联合位姿估计和3D重建带来了挑战。现有的为透视相机设计的前馈模型在此场景下泛化能力差。
Result: 在作者贡献的大规模户外全景数据集PanoCity以及标准基准测试上的大量实验表明,PanoVGGT达到了有竞争力的精度、强大的鲁棒性以及改进的跨域泛化能力。
Insight: 创新点包括:1)针对全景图像的置换等变Transformer架构;2)球面感知位置编码和全景特定的三轴SO(3)旋转数据增强,以支持球面域的几何推理;3)用于解决全局坐标系模糊性的随机锚定训练策略;4)贡献了带有密集深度和6自由度位姿标注的大规模户外全景数据集PanoCity。
Abstract: Panoramic imagery offers a full 360° field of view and is increasingly common in consumer devices. However, it introduces non-pinhole distortions that challenge joint pose estimation and 3D reconstruction. Existing feed-forward models, built for perspective cameras, generalize poorly to this setting. We propose PanoVGGT, a permutation-equivariant Transformer framework that jointly predicts camera poses, depth maps, and 3D point clouds from one or multiple panoramas in a single forward pass. The model incorporates spherical-aware positional embeddings and a panorama-specific three-axis SO(3) rotation augmentation, enabling effective geometric reasoning in the spherical domain. To resolve inherent global-frame ambiguity, we further introduce a stochastic anchoring strategy during training. In addition, we contribute PanoCity, a large-scale outdoor panoramic dataset with dense depth and 6-DoF pose annotations. Extensive experiments on PanoCity and standard benchmarks demonstrate that PanoVGGT achieves competitive accuracy, strong robustness, and improved cross-domain generalization. Code and dataset will be released.
[94] LoGSAM: Parameter-Efficient Cross-Modal Grounding for MRI Segmentation cs.CVPDF
Mohammad Robaitul Islam Bhuiyan, Sheethal Bhat, Melika Qahqaie, Tri-Thien Nguyen, Paula Andrea Pérez Toro
TL;DR: 本文提出LoGSAM,一种参数高效的检测驱动框架,用于MRI脑肿瘤分割。该方法将放射科医生的口述转录为文本提示,通过LoRA适配的视觉语言检测模型Grounding DINO进行文本条件肿瘤定位,再利用预测边界框作为MedSAM的提示生成像素级肿瘤掩膜,无需额外微调。
Details
Motivation: 解决MRI脑肿瘤分割中标注数据有限、现有方法依赖任务特定监督模型的问题,旨在通过利用预训练基础模型和最小参数更新,构建模块化的语音到分割流程。
Result: 在BRISC 2025数据集上达到80.32%的Dice分数(SOTA水平),并在12个未见过的MRI扫描上使用德语口述评估,获得91.7%的病例级准确率。
Insight: 创新点包括:将放射科医生口述转化为文本提示用于跨模态定位;采用LoRA适配(仅更新5%参数)实现计算高效的领域适应;结合检测驱动先验引导冻结的MedSAM进行分割,展示了模块化、参数高效的基础模型集成方法。
Abstract: Precise localization and delineation of brain tumors using Magnetic Resonance Imaging (MRI) are essential for planning therapy and guiding surgical decisions. However, most existing approaches rely on task-specific supervised models and are constrained by the limited availability of annotated data. To address this, we propose LoGSAM, a parameter-efficient, detection-driven framework that transforms radiologist dictation into text prompts for foundation-model-based localization and segmentation. Radiologist speech is first transcribed and translated using a pretrained Whisper ASR model, followed by negation-aware clinical NLP to extract tumor-specific textual prompts. These prompts guide text-conditioned tumor localization via a LoRA-adapted vision-language detection model, Grounding DINO (GDINO). The LoRA adaptation updates using 5% of the model parameters, thereby enabling computationally efficient domain adaptation while preserving pretrained cross-modal knowledge. The predicted bounding boxes are used as prompts for MedSAM to generate pixel-level tumor masks without any additional fine-tuning. Conditioning the frozen MedSAM on LoGSAM-derived priors yields a state-of-the-art dice score of 80.32% on BRISC 2025. In addition, we evaluate the full pipeline using German dictations from a board-certified radiologist on 12 unseen MRI scans, achieving 91.7% case-level accuracy. These results highlight the feasibility of constructing a modular, speech-to-segmentation pipeline by intelligently leveraging pretrained foundation models with minimal parameter updates.
[95] Edit-As-Act: Goal-Regressive Planning for Open-Vocabulary 3D Indoor Scene Editing cs.CV | cs.AIPDF
Seongrae Noh, SeungWon Seo, Gyeong-Moon Park, HyeongYeop Kang
TL;DR: 本文提出Edit-As-Act框架,将开放词汇的3D室内场景编辑视为目标回归规划问题,通过预测符号化目标谓词、在自定义的EditLang动作语言中进行规划,并利用验证器确保物理可行性,从而生成可解释且物理一致的场景变换,解决了现有方法在指令忠实度、语义一致性和物理合理性上的不足。
Details
Motivation: 现有开放词汇3D场景编辑方法通常将编辑视为生成任务,导致大规模场景重建或图像空间编辑破坏空间结构,产生非预期的全局变化或物理不一致布局;本文动机是将用户指令视为期望的世界状态,编辑应是最小化动作序列以实现该状态并保持其他部分不变。
Result: 在E2A-Bench基准测试(包含9个室内环境的63个编辑任务)上,Edit-As-Act在所有编辑类型和场景类别上显著优于先前方法。
Insight: 创新点在于将场景编辑重新定义为目标回归规划问题,设计了受PDDL启发的EditLang动作语言来编码几何关系,并通过分离推理与底层生成来同时保证指令忠实度、语义一致性和物理合理性;从客观角度看,其符号化规划与验证机制为可解释的3D场景编辑提供了新范式。
Abstract: Editing a 3D indoor scene from natural language is conceptually straightforward but technically challenging. Existing open-vocabulary systems often regenerate large portions of a scene or rely on image-space edits that disrupt spatial structure, resulting in unintended global changes or physically inconsistent layouts. These limitations stem from treating editing primarily as a generative task. We take a different view. A user instruction defines a desired world state, and editing should be the minimal sequence of actions that makes this state true while preserving everything else. This perspective motivates Edit-As-Act, a framework that performs open-vocabulary scene editing as goal-regressive planning in 3D space. Given a source scene and free-form instruction, Edit-As-Act predicts symbolic goal predicates and plans in EditLang, a PDDL-inspired action language that we design with explicit preconditions and effects encoding support, contact, collision, and other geometric relations. A language-driven planner proposes actions, and a validator enforces goal-directedness, monotonicity, and physical feasibility, producing interpretable and physically coherent transformations. By separating reasoning from low-level generation, Edit-As-Act achieves instruction fidelity, semantic consistency, and physical plausibility - three criteria that existing paradigms cannot satisfy together. On E2A-Bench, our benchmark of 63 editing tasks across 9 indoor environments, Edit-As-Act significantly outperforms prior approaches across all edit types and scene categories.
[96] ReLaGS: Relational Language Gaussian Splatting cs.CVPDF
Yaxu Xie, Abdalla Arafa, Alireza Javanmardi, Christen Millerdurai, Jia Cheng Hu
TL;DR: 本文提出了ReLaGS框架,通过构建层次化的语言蒸馏高斯场景和3D语义场景图,实现了无需场景特定训练的统一3D感知与推理。该方法结合高斯剪枝机制和多视图语言对齐策略,支持开放词汇的3D分割、场景图生成和关系引导检索等任务。
Details
Motivation: 解决现有方法在统一3D感知与推理(如分割、检索和关系理解)中的局限性,这些方法要么以物体为中心,要么需要昂贵的跨物体推理训练。
Result: 该方法在开放词汇分割、场景图生成和关系引导检索等任务上进行了验证,实现了高效且可扩展的开放词汇3D推理。
Insight: 创新点包括无需场景特定训练的分层语言蒸馏高斯场景构建、高斯剪枝机制优化几何、多视图语言对齐聚合2D特征为3D物体嵌入,以及基于视觉语言标注和图神经网络的关系推理构建开放词汇3D场景图。
Abstract: Achieving unified 3D perception and reasoning across tasks such as segmentation, retrieval, and relation understanding remains challenging, as existing methods are either object-centric or rely on costly training for inter-object reasoning. We present a novel framework that constructs a hierarchical language-distilled Gaussian scene and its 3D semantic scene graph without scene-specific training. A Gaussian pruning mechanism refines scene geometry, while a robust multi-view language alignment strategy aggregates noisy 2D features into accurate 3D object embeddings. On top of this hierarchy, we build an open-vocabulary 3D scene graph with Vision Language derived annotations and Graph Neural Network-based relational reasoning. Our approach enables efficient and scalable open-vocabulary 3D reasoning by jointly modeling hierarchical semantics and inter/intra-object relationships, validated across tasks including open-vocabulary segmentation, scene graph generation, and relation-guided retrieval. Project page: https://dfki-av.github.io/ReLaGS/
[97] Interpretable Cross-Domain Few-Shot Learning with Rectified Target-Domain Local Alignment cs.CV | cs.AIPDF
Yaze Zhao, Yixiong Zou, Yuhua Li, Ruixuan Li
TL;DR: 本文提出了一种名为CC-CDFSL的方法,通过循环一致性约束和语义锚机制,解决了基于CLIP的跨域少样本学习中存在的局部错位问题,从而提升了模型在目标域中对细粒度视觉线索的捕捉能力、决策可解释性以及整体性能。
Details
Motivation: 当前基于CLIP的跨域少样本学习方法在适应如医疗诊断等需要细粒度、可解释识别的下游领域时,存在局部视觉特征与文本语义错位的问题,且领域差异和训练数据稀缺加剧了这一问题。
Result: 在多种基准测试、骨干网络和微调方法上的广泛实验表明,该方法能有效改善局部视觉-语言对齐,通过可视化图像块增强模型决策的可解释性,并取得了最先进的性能。
Insight: 创新点在于利用循环一致性任务进行自监督学习以对齐局部特征,并引入语义锚机制来增强文本到图像的映射并过滤图像到文本映射中的噪声,为解决少样本下细粒度跨域对齐提供了新思路。
Abstract: Cross-Domain Few-Shot Learning (CDFSL) adapts models trained with large-scale general data (source domain) to downstream target domains with only scarce training data, where the research on vision-language models (e.g., CLIP) is still in the early stages. Typical downstream domains, such as medical diagnosis, require fine-grained visual cues for interpretable recognition, but we find that current fine-tuned CLIP models can hardly focus on these cues, albeit they can roughly focus on important regions in source domains. Although current works have demonstrated CLIP’s shortcomings in capturing local subtle patterns, in this paper, we find that the domain gap and scarce training data further exacerbate such shortcomings, much more than that of holistic patterns, which we call the local misalignment problem in CLIP-based CDFSL. To address this problem, due to the lack of supervision in aligning local visual features and text semantics, we turn to self-supervision information. Inspired by the translation task, we propose the CC-CDFSL method with cycle consistency, which translates local visual features into text features and then translates them back into visual features (and vice versa), and constrains the original features close to the translated back features. To reduce the noise imported by richer information in the visual modality, we further propose a Semantic Anchor mechanism, which first augments visual features to provide a larger corpus for the text-to-image mapping, and then shrinks the image features to filter out irrelevant image-to-text mapping. Extensive experiments on various benchmarks, backbones, and fine-tuning methods show we can (1) effectively improve the local vision-language alignment, (2) enhance the interpretability of learned patterns and model decisions by visualizing patches, and (3) achieve state-of-the-art performance.
[98] FINER: MLLMs Hallucinate under Fine-grained Negative Queries cs.CV | cs.AIPDF
Rui Xiao, Sanghwan Kim, Yongqin Xian, Zeynep Akata, Stephan Alaniz
TL;DR: 本文针对多模态大语言模型(MLLMs)在细粒度负查询下产生幻觉的问题,提出了FINER基准(包括FINER-CompreCap和FINER-DOCCI)来评估模型在细粒度不匹配场景下的幻觉现象,并提出了FINER-Tuning方法,通过直接偏好优化(DPO)在FINER数据上微调模型,显著减少了幻觉并提升了通用多模态能力。
Details
Motivation: 现有基准主要关注粗粒度的图像相关问题,未能充分评估MLLMs在细粒度查询下的幻觉问题,因此需要构建更精细的基准来揭示和解决这一挑战。
Result: 在FINER基准上,FINER-Tuning方法使前沿MLLMs(如InternVL3.5-14B)的幻觉减少了高达24.2%,同时在八个现有幻觉测试集上性能提升,并在六个通用多模态基准上增强了能力。
Insight: 创新点在于构建了针对细粒度负查询的幻觉基准(FINER),揭示了模型在图像中存在元素与细粒度不匹配共现时易产生幻觉的机制,并提出了一种基于DPO的微调方法(FINER-Tuning)来有效缓解此问题,该方法具有通用性,能同时提升抗幻觉和通用性能。
Abstract: Multimodal large language models (MLLMs) struggle with hallucinations, particularly with fine-grained queries, a challenge underrepresented by existing benchmarks that focus on coarse image-related questions. We introduce FIne-grained NEgative queRies (FINER), alongside two benchmarks: FINER-CompreCap and FINER-DOCCI. Using FINER, we analyze hallucinations across four settings: multi-object, multi-attribute, multi-relation, and ``what’’ questions. Our benchmarks reveal that MLLMs hallucinate when fine-grained mismatches co-occur with genuinely present elements in the image. To address this, we propose FINER-Tuning, leveraging Direct Preference Optimization (DPO) on FINER-inspired data. Finetuning four frontier MLLMs with FINER-Tuning yields up to 24.2% gains (InternVL3.5-14B) on hallucinations from our benchmarks, while simultaneously improving performance on eight existing hallucination suites, and enhancing general multimodal capabilities across six benchmarks. Code, benchmark, and models are available at \href{https://explainableml.github.io/finer-project/}{https://explainableml.github.io/finer-project/}.
[99] Few-Step Diffusion Sampling Through Instance-Aware Discretizations cs.CVPDF
Liangyu Yuan, Ruoyu Wang, Tong Zhao, Dingwen Fu, Mingkun Lei
TL;DR: 本文提出了一种实例感知的离散化框架,用于改进扩散和流匹配模型的采样过程。该方法通过学习基于输入相关先验的自适应时间步分配,替代了传统全局共享的时间步调度策略,从而在合成数据、像素空间扩散、潜空间图像和视频流匹配等多种设置中,以微小的调优成本和可忽略的推理开销,持续提升生成质量。
Details
Motivation: 现有扩散模型采样中的离散化策略大多采用全局共享的时间步调度,未能考虑生成过程中不同实例的特定复杂性,这可能导致性能受限。本文的动机源于在合成数据上的受控实验,该实验揭示了在实例特定动态下全局调度策略的次优性。
Result: 在合成数据、像素空间扩散、潜空间图像和视频流匹配模型等多种实验设置中,该方法在生成质量上取得了持续改进,且调优成本微小,推理开销可忽略。
Insight: 核心创新点是将基于梯度的离散化搜索扩展到条件生成设置,提出了一个学习实例特定时间步分配的框架。这挑战了传统全局统一调度的做法,通过适应不同样本的生成难度来优化采样效率和质量。
Abstract: Diffusion and flow matching models generate high-fidelity data by simulating paths defined by Ordinary or Stochastic Differential Equations (ODEs/SDEs), starting from a tractable prior distribution. The probability flow ODE formulation enables the use of advanced numerical solvers to accelerate sampling. Orthogonal yet vital to solver design is the discretization strategy. While early approaches employed handcrafted heuristics and recent methods adopt optimization-based techniques, most existing strategies enforce a globally shared timestep schedule across all samples. This uniform treatment fails to account for instance-specific complexity in the generative process, potentially limiting performance. Motivated by controlled experiments on synthetic data, which reveals the suboptimality of global schedules under instance-specific dynamics, we propose an instance-aware discretization framework. Our method learns to adapt timestep allocations based on input-dependent priors, extending gradient-based discretization search to the conditional generative setting. Empirical results across diverse settings, including synthetic data, pixel-space diffusion, latent-space images and video flow matching models, demonstrate that our method consistently improves generation quality with marginal tuning cost compared to training and negligible inference overhead.
[100] DeepCORO-CLIP: A Multi-View Foundation Model for Comprehensive Coronary Angiography Video-Text Analysis and External Validation cs.CVPDF
Sarra Harrabi, Yichen Wu, Geoffrey H. Tison, Minhaj Ansari, Milos Vukadinovic
TL;DR: DeepCORO-CLIP是一个基于视频-文本对比学习的多视角基础模型,用于全面分析冠状动脉造影视频。它在超过20万条视频上训练,并在外部数据集上验证,能够执行从狭窄检测到心血管事件预测等多种任务,为临床提供快速、自动化的血管造影解读。
Details
Motivation: 解决冠状动脉造影视觉解读存在主观差异性的问题,以及现有AI方法通常只分析单帧图像或单一投影、且主要关注狭窄检测,从而限制了全面冠状动脉评估的局限性。
Result: 在内部验证中,检测显著狭窄的AUROC为0.888,外部验证为0.89;与核心实验室定量冠状动脉造影相比,平均绝对误差为13.6%,优于临床报告的19.0%。模型在慢性完全闭塞、冠脉内血栓和冠脉钙化检测上也表现强劲。通过迁移学习,预测一年主要不良心血管事件的AUROC为0.79,估计左心室射血分数的平均绝对误差为7.3%。
Insight: 创新点在于采用多视角视频-文本对比学习框架进行血管造影研究级评估,整合了多个投影和基于注意力的池化机制。该模型作为一个基础模型,能够通过迁移学习泛化到诊断、预后和疾病进展追踪等多种下游任务,实现了从单一狭窄检测到全面血管评估的范式转变。
Abstract: Coronary angiography is the reference standard for evaluating coronary artery disease, yet visual interpretation remains variable between readers. Existing artificial intelligence methods typically analyze single frames or projections and focus mainly on stenosis, limiting comprehensive coronary assessment. We present DeepCORO-CLIP, a multi-view foundation model trained with video-text contrastive learning on 203,808 angiography videos from 28,117 patients across 32,473 studies at the Montreal Heart Institute and externally validated on 4,249 studies from the University of California, San Francisco. DeepCORO-CLIP integrates multiple projections with attention-based pooling for study-level assessment across diagnostic, prognostic, and disease progression tasks. For significant stenosis detection, the model achieved an AUROC of 0.888 internally and 0.89 on external validation. Mean absolute error against core laboratory quantitative coronary angiography was 13.6%, lower than clinical reports at 19.0%. The model also performed strongly for chronic total occlusion, intracoronary thrombus, and coronary calcification detection. Transfer learning enabled prediction of one-year major adverse cardiovascular events with AUROC 0.79 and estimation of left ventricular ejection fraction with mean absolute error 7.3%. Embeddings also captured disease progression across serial examinations. With a mean inference time of 4.2 seconds in hospital deployment, DeepCORO-CLIP provides a foundation for automated coronary angiography interpretation at the point of care. Code, sample data, model weights, and deployment infrastructure are publicly released.
[101] WeatherReasonSeg: A Benchmark for Weather-Aware Reasoning Segmentation in Visual Language Models cs.CV | cs.AIPDF
Wanjun Du, Zifeng Yuan, Tingting Chen, Fucai Ke, Beibei Lin
TL;DR: 本文提出了WeatherReasonSeg基准,用于评估视觉语言模型(VLMs)在恶劣天气条件下的推理分割能力。该基准包含一个可控的合成天气数据集和一个真实世界恶劣天气数据集,并扩展了五个推理维度的评估。实验表明,VLM性能随天气严重程度单调下降,且不同天气类型会引发不同的脆弱性模式。
Details
Motivation: 现有视觉语言模型在推理分割方面表现出色,但其基准主要基于理想条件下采集的高质量图像构建。当恶劣天气(如雨、雪、雾)严重降低视觉线索时,VLMs能否保持可靠的推理分割能力成为一个关键问题。
Result: 在WeatherReasonSeg基准上的广泛实验揭示了两个关键发现:1)VLM性能随着天气严重程度的增加而单调下降;2)不同的天气类型会引发不同的脆弱性模式。
Insight: 论文的创新点在于构建了首个专注于恶劣天气条件下推理分割的基准,包含可控合成和真实世界数据集,并扩展了多维度推理评估。这为开发鲁棒的、天气感知的推理模型提供了重要的评估基础和数据资源。
Abstract: Existing vision-language models (VLMs) have demonstrated impressive performance in reasoning-based segmentation. However, current benchmarks are primarily constructed from high-quality images captured under idealized conditions. This raises a critical question: when visual cues are severely degraded by adverse weather conditions such as rain, snow, or fog, can VLMs sustain reliable reasoning segmentation capabilities? In response to this challenge, we introduce WeatherReasonSeg, a benchmark designed to evaluate VLM performance in reasoning-based segmentation under adverse weather conditions. It consists of two complementary components. First, we construct a controllable reasoning dataset by applying synthetic weather with varying severity levels to existing segmentation datasets, enabling fine-grained robustness analysis. Second, to capture real-world complexity, we curate a real-world adverse-weather reasoning segmentation dataset with semantically consistent queries generated via mask-guided LLM prompting. We further broaden the evaluation scope across five reasoning dimensions, including functionality, application scenarios, structural attributes, interactions, and requirement matching. Extensive experiments across diverse VLMs reveal two key findings: (1) VLM performance degrades monotonically with increasing weather severity, and (2) different weather types induce distinct vulnerability patterns. We hope WeatherReasonSeg will serve as a foundation for advancing robust, weather-aware reasoning.
[102] Learning Transferable Temporal Primitives for Video Reasoning via Synthetic Videos cs.CVPDF
Songtao Jiang, Sibo Song, Chenyi Zhou, Yuan Wang, Ruizhe Chen
TL;DR: 该论文提出了SynRL框架,通过程序生成的合成视频来教授模型时间原语(如方向、速度和状态跟踪),这些原语作为时间理解的基本构建块,能够有效迁移到真实世界视频理解任务中。
Details
Motivation: 当前视觉语言模型在视频理解中面临两个关键限制:现有数据集缺乏时间中心性(答案可从孤立关键帧推断),以及由专有模型生成的训练数据在基本时间感知(如混淆运动方向或误判速度)上存在系统性错误。
Result: SynRL在15个基准测试(涵盖时间定位、复杂推理和通用视频理解)上取得显著改进,仅用7.7K合成CoT样本就超越了使用165K真实世界样本的Video-R1模型。
Insight: 创新点在于将时间理解分解为短期感知原语(速度、方向)和长期认知原语,并通过基于代码的视频生成构建带真实帧级标注的合成数据集,证明了从抽象合成模式学习到的基本时间技能能有效迁移到复杂现实场景,为视频后训练提供了更高效的成本扩展路径。
Abstract: The transition from image to video understanding requires vision-language models (VLMs) to shift from recognizing static patterns to reasoning over temporal dynamics such as motion trajectories, speed changes, and state transitions. Yet current post-training methods fall short due to two critical limitations: (1) existing datasets often lack temporal-centricity, where answers can be inferred from isolated keyframes rather than requiring holistic temporal integration; and (2) training data generated by proprietary models contains systematic errors in fundamental temporal perception, such as confusing motion directions or misjudging speeds. We introduce SynRL, a post-training framework that teaches models temporal primitives, the fundamental building blocks of temporal understanding including direction, speed, and state tracking. Our key insight is that these abstract primitives, learned from programmatically generated synthetic videos, transfer effectively to real-world scenarios. We decompose temporal understanding into short-term perceptual primitives (speed, direction) and long-term cognitive primitives, constructing 7.7K CoT and 7K RL samples with ground-truth frame-level annotations through code-based video generation. Despite training on simple geometric shapes, SynRL achieves substantial improvements across 15 benchmarks spanning temporal grounding, complex reasoning, and general video understanding. Remarkably, our 7.7K synthetic CoT samples outperform Video-R1 with 165K real-world samples. We attribute this to fundamental temporal skills, such as tracking frame by frame changes and comparing velocity, that transfer effectively from abstract synthetic patterns to complex real-world scenarios. This establishes a new paradigm for video post-training: video temporal learning through carefully designed synthetic data provides a more cost efficient scaling path.
[103] Parameter-Efficient Modality-Balanced Symmetric Fusion for Multimodal Remote Sensing Semantic Segmentation cs.CVPDF
Haocheng Li, Juepeng Zheng, Shuangxi Miao, Ruibo Lu, Guosheng Cai
TL;DR: 本文提出了一种名为MoBaNet的参数高效且模态平衡的对称融合框架,用于多模态遥感语义分割。该框架基于一个基本冻结的视觉基础模型(VFM)主干,采用对称双流架构,通过交叉模态提示注入适配器和差异引导门控融合模块实现深度语义交互与自适应特征融合,并引入模态条件随机掩码策略来缓解模态不平衡问题。在ISPRS Vaihingen和Potsdam基准测试上,MoBaNet以显著更少的可训练参数达到了最先进的性能。
Details
Motivation: 动机在于解决将预训练的视觉基础模型(VFMs)适配到多模态任务时面临的计算开销大和模态不平衡问题,即优化过程中辅助模态的贡献被抑制。
Result: 在ISPRS Vaihingen和Potsdam基准测试上进行的大量实验表明,MoBaNet以比完全微调少得多的可训练参数实现了最先进的(SOTA)性能。
Insight: 创新点包括:1) 对称双流架构与冻结主干结合,在保持泛化表征的同时最小化可训练参数;2) 交叉模态提示注入适配器(CPIA),通过生成共享提示并注入到瓶颈适配器中来实现深度语义交互;3) 差异引导门控融合模块(DGFM),利用跨模态差异自适应融合阶段特征;4) 模态条件随机掩码(MCRM)策略,通过在训练中仅掩码一个模态并对模态特定分支施加硬像素辅助监督来缓解模态不平衡。
Abstract: Multimodal remote sensing semantic segmentation enhances scene interpretation by exploiting complementary physical cues from heterogeneous data. Although pretrained Vision Foundation Models (VFMs) provide strong general-purpose representations, adapting them to multimodal tasks often incurs substantial computational overhead and is prone to modality imbalance, where the contribution of auxiliary modalities is suppressed during optimization. To address these challenges, we propose MoBaNet, a parameter-efficient and modality-balanced symmetric fusion framework. Built upon a largely frozen VFM backbone, MoBaNet adopts a symmetric dual-stream architecture to preserve generalizable representations while minimizing the number of trainable parameters. Specifically, we design a Cross-modal Prompt-Injected Adapter (CPIA) to enable deep semantic interaction by generating shared prompts and injecting them into bottleneck adapters under the frozen backbone. To obtain compact and discriminative multimodal representations for decoding, we further introduce a Difference-Guided Gated Fusion Module (DGFM), which adaptively fuses paired stage features by explicitly leveraging cross-modal discrepancy to guide feature selection. Furthermore, we propose a Modality-Conditional Random Masking (MCRM) strategy to mitigate modality imbalance by masking one modality only during training and imposing hard-pixel auxiliary supervision on modality-specific branches. Extensive experiments on the ISPRS Vaihingen and Potsdam benchmarks demonstrate that MoBaNet achieves state-of-the-art performance with significantly fewer trainable parameters than full fine-tuning, validating its effectiveness for robust and balanced multimodal fusion. The source code in this work is available at https://github.com/sauryeo/MoBaNet.
[104] Eye image segmentation using visual and concept prompts with Segment Anything Model 3 (SAM3) cs.CV | cs.AIPDF
Diederick C. Niehorster, Marcus Nyström
TL;DR: 本文评估了Segment Anything Model 3 (SAM3)在眼图像分割任务上的性能,并与SAM2进行了比较。研究发现,无论是使用视觉提示还是新引入的概念(文本)提示,SAM3在大多数情况下(包括实验室环境和野外采集的挑战性数据集)的表现均未优于SAM2,且速度更慢。因此,作者认为SAM2仍是眼图像分割的最佳选择,并提供了可处理任意时长视频的SAM3代码适配版本。
Details
Motivation: 评估最新的视觉基础模型SAM3及其新增的概念提示功能在眼图像分割任务上的零样本性能,并与前代SAM2进行比较,以确定最佳模型选择。
Result: 在包含高质量实验室视频和野外挑战性视频(TEyeD数据集)的多样化数据集上评估,结果显示SAM3(使用视觉或概念提示)在大多数情况下性能未超越SAM2,且速度更慢。
Insight: 论文的客观分析表明,模型迭代(SAM3)在特定任务(眼图像分割)上未必带来性能提升,甚至可能因复杂度增加而降低效率;同时,新引入的概念提示功能在该任务中并未展现出优势,这提示了在应用基础模型时需进行针对性的性能验证,而非盲目追新。作者提供的视频处理代码适配具有实用价值。
Abstract: Previous work has reported that vision foundation models show promising zero-shot performance in eye image segmentation. Here we examine whether the latest iteration of the Segment Anything Model, SAM3, offers better eye image segmentation performance than SAM2, and explore the performance of its new concept (text) prompting mode. Eye image segmentation performance was evaluated using diverse datasets encompassing both high-resolution high-quality videos from a lab environment and the TEyeD dataset consisting of challenging eye videos acquired in the wild. Results show that in most cases SAM3 with either visual or concept prompts did not perform better than SAM2, for both lab and in-the-wild datasets. Since SAM2 not only performed better but was also faster, we conclude that SAM2 remains the best option for eye image segmentation. We provide our adaptation of SAM3’s codebase that allows processing videos of arbitrary duration.
[105] SARE: Sample-wise Adaptive Reasoning for Training-free Fine-grained Visual Recognition cs.CV | cs.AIPDF
Jingxiao Yang, DaLin He, Miao Pan, Ge Su, Wenqi Zhang
TL;DR: 本文提出了一种名为SARE的样本自适应推理框架,用于无需训练(training-free)的细粒度视觉识别(FGVR)。该框架通过结合快速候选检索和细粒度推理的级联设计,仅在必要时进行复杂推理,并引入自反经验机制,利用历史失败案例提供可迁移的判别性指导,从而在提升精度的同时显著降低计算开销。
Details
Motivation: 现有基于大型视觉语言模型(LVLMs)的细粒度视觉识别方法主要采用检索导向或推理导向范式,但两者均存在局限:对所有样本采用相同的推理流程,未考虑识别难度的不均匀性,导致精度和效率次优;且缺乏整合和重用错误特定经验的机制,导致在类似困难案例上重复失败。
Result: 在14个数据集上的大量实验证实,SARE在无需训练的方法中达到了最先进的(SOTA)性能,同时大幅减少了计算开销。
Insight: 核心创新点在于样本自适应的级联推理设计(根据样本难度动态调整推理复杂度)和自反经验机制(无需参数更新,利用历史失败经验指导当前推理),这为高效利用LVLMs进行细粒度识别提供了新思路。
Abstract: Recent advances in Large Vision-Language Models (LVLMs) have enabled training-free Fine-Grained Visual Recognition (FGVR). However, effectively exploiting LVLMs for FGVR remains challenging due to the inherent visual ambiguity of subordinate-level categories. Existing methods predominantly adopt either retrieval-oriented or reasoning-oriented paradigms to tackle this challenge, but both are constrained by two fundamental limitations:(1) They apply the same inference pipeline to all samples without accounting for uneven recognition difficulty, thereby leading to suboptimal accuracy and efficiency; (2) The lack of mechanisms to consolidate and reuse error-specific experience causes repeated failures on similar challenging cases. To address these limitations, we propose SARE, a Sample-wise Adaptive textbfREasoning framework for training-free FGVR. Specifically, SARE adopts a cascaded design that combines fast candidate retrieval with fine-grained reasoning, invoking the latter only when necessary. In the reasoning process, SARE incorporates a self-reflective experience mechanism that leverages past failures to provide transferable discriminative guidance during inference, without any parameter updates. Extensive experiments across 14 datasets substantiate that SARE achieves state-of-the-art performance while substantially reducing computational overhead.
[106] TAPESTRY: From Geometry to Appearance via Consistent Turntable Videos cs.CVPDF
Yan Zeng, Haoran Jiang, Kaixin Yao, Qixuan Zhang, Longwen Zhang
TL;DR: TAPESTRY是一个基于显式3D几何条件生成高质量转盘视频(TTV)的框架,旨在为未贴图的3D模型自动生成逼真且自一致的外观。该方法将3D外观生成任务重构为几何条件视频扩散问题,通过渲染和编码多模态几何特征来约束视频生成,并设计了包含3D感知修复的多阶段下游重建流程,以从TTV输入中实现完整的表面覆盖和高质量3D资产创建。
Details
Motivation: 解决现有通用视频扩散模型在生成360度转盘视频时难以保持严格几何一致性和外观稳定性的问题,从而为未贴图3D模型自动化生成可用于高质量3D重建的逼真外观。
Result: 实验结果表明,该方法在视频一致性和最终重建质量上均优于现有方法。
Insight: 将3D外观生成任务重构为几何条件视频扩散问题,通过像素级精度的多模态几何特征约束生成过程;提出包含3D感知修复的多阶段下游重建流程,有效补全自遮挡区域以实现完整表面覆盖;生成的TTV可作为可靠的3D感知中间表示,无缝反投影至UV纹理或用于监督神经渲染方法(如3DGS)。
Abstract: Automatically generating photorealistic and self-consistent appearances for untextured 3D models is a critical challenge in digital content creation. The advancement of large-scale video generation models offers a natural approach: directly synthesizing 360-degree turntable videos (TTVs), which can serve not only as high-quality dynamic previews but also as an intermediate representation to drive texture synthesis and neural rendering. However, existing general-purpose video diffusion models struggle to maintain strict geometric consistency and appearance stability across the full range of views, making their outputs ill-suited for high-quality 3D reconstruction. To this end, we introduce TAPESTRY, a framework for generating high-fidelity TTVs conditioned on explicit 3D geometry. We reframe the 3D appearance generation task as a geometry-conditioned video diffusion problem: given a 3D mesh, we first render and encode multi-modal geometric features to constrain the video generation process with pixel-level precision, thereby enabling the creation of high-quality and consistent TTVs. Building upon this, we also design a method for downstream reconstruction tasks from the TTV input, featuring a multi-stage pipeline with 3D-Aware Inpainting. By rotating the model and performing a context-aware secondary generation, this pipeline effectively completes self-occluded regions to achieve full surface coverage. The videos generated by TAPESTRY are not only high-quality dynamic previews but also serve as a reliable, 3D-aware intermediate representation that can be seamlessly back-projected into UV textures or used to supervise neural rendering methods like 3DGS. This enables the automated creation of production-ready, complete 3D assets from untextured meshes. Experimental results demonstrate that our method outperforms existing approaches in both video consistency and final reconstruction quality.
[107] Concept-to-Pixel: Prompt-Free Universal Medical Image Segmentation cs.CVPDF
Haoyun Chen, Fenghe Tang, Wenxin Ma, Shaohua Kevin Zhou
TL;DR: 本文提出了Concept-to-Pixel (C2P),一种无需提示的通用医学图像分割框架。它通过将解剖知识解耦为几何和语义表示,利用多模态大语言模型提取可学习的语义标记,并引入显式监督的几何标记来施加通用物理约束,从而生成输入特定的动态卷积核进行精确分割。该方法在包含7种模态的8个数据集上表现出色,并在零样本和跨模态任务上展现了强大的泛化能力。
Details
Motivation: 现有通用医学图像分割方法严重依赖手动视觉提示或检索参考图像,限制了其自动化和鲁棒性;同时,跨模态的简单联合训练难以处理巨大的领域偏移。
Result: 在包含7种成像模态的8个数据集构成的统一基准测试上,该方法相比通用或单一模型方法展现出显著优越性,在零样本任务和跨模态迁移任务上均取得了令人印象深刻的结果。
Insight: 核心创新点在于将解剖知识显式解耦为几何和语义表示,并分别用几何标记和由MLLM提取的语义标记进行建模;此外,引入几何感知推理共识机制来评估预测可靠性并抑制异常值,提升了模型的鲁棒性。
Abstract: Universal medical image segmentation seeks to use a single foundational model to handle diverse tasks across multiple imaging modalities. However, existing approaches often rely heavily on manual visual prompts or retrieved reference images, which limits their automation and robustness. In addition, naive joint training across modalities often fails to address large domain shifts. To address these limitations, we propose Concept-to-Pixel (C2P), a novel prompt-free universal segmentation framework. C2P explicitly separates anatomical knowledge into two components: Geometric and Semantic representations. It leverages Multimodal Large Language Models (MLLMs) to distill abstract, high-level medical concepts into learnable Semantic Tokens and introduces explicitly supervised Geometric Tokens to enforce universal physical and structural constraints. These disentangled tokens interact deeply with image features to generate input-specific dynamic kernels for precise mask prediction. Furthermore, we introduce a Geometry-Aware Inference Consensus mechanism, which utilizes the model’s predicted geometric constraints to assess prediction reliability and suppress outliers. Extensive experiments and analysis on a unified benchmark comprising eight diverse datasets across seven modalities demonstrate the significant superiority of our jointly trained approach, compared to universe- or single-model approaches. Remarkably, our unified model demonstrates strong generalization, achieving impressive results not only on zero-shot tasks involving unseen cases but also in cross-modal transfers across similar tasks. Code is available at: https://github.com/Yundi218/Concept-to-Pixel
[108] Evidence Packing for Cross-Domain Image Deepfake Detection with LVLMs cs.CVPDF
Yuxin Liu, Fei Wang, Kun Li, Yiqi Nie, Junjie Chen
TL;DR: 本文提出了一种无需训练的LVLM框架SCEP,用于跨域图像深度伪造检测。该方法通过挖掘可疑图像块作为证据包,驱动冻结的LVLM进行推理,避免了昂贵的微调,并在多个基准测试中优于强基线方法。
Details
Motivation: 解决现有方法在将大型视觉语言模型(LVLMs)应用于图像深度伪造检测时,需要昂贵微调且对多样、演化的篡改操作泛化能力差的问题。
Result: 在多个基准测试上的实验表明,SCEP在不进行LVLM微调的情况下,性能优于强基线方法。
Insight: 创新点在于提出了一种无需训练的、基于证据包驱动的推理框架(SCEP),通过融合语义不匹配、频率和噪声异常的多指标来选取可疑图像块,并利用聚类和NMS来构建紧凑的证据集,从而有效利用冻结LVLM的能力进行检测。
Abstract: Image Deepfake Detection (IDD) separates manipulated images from authentic ones by spotting artifacts of synthesis or tampering. Although large vision-language models (LVLMs) offer strong image understanding, adapting them to IDD often demands costly fine-tuning and generalizes poorly to diverse, evolving manipulations. We propose the Semantic Consistent Evidence Pack (SCEP), a training-free LVLM framework that replaces whole-image inference with evidence-driven reasoning. SCEP mines a compact set of suspicious patch tokens that best reveal manipulation cues. It uses the vision encoder’s CLS token as a global reference, clusters patch features into coherent groups, and scores patches with a fused metric combining CLS-guided semantic mismatch with frequency-and noise-based anomalies. To cover dispersed traces and avoid redundancy, SCEP samples a few high-confidence patches per cluster and applies grid-based NMS, producing an evidence pack that conditions a frozen LVLM for prediction. Experiments on diverse benchmarks show SCEP outperforms strong baselines without LVLM fine-tuning.
[109] Exploring parameter-efficient fine-tuning (PEFT) of billion-parameter vision models with QLoRA and DoRA: insights into generalization for limited-data image classification under a 98:1 test-to-train regime cs.CVPDF
Haiyu Yang, Sumit Sharma, Enhong Liu, Miel Hostens
TL;DR: 本研究系统比较了从头训练、冻结特征提取和参数高效微调(PEFT)三种方法在有限数据图像分类任务上的性能,重点关注在98:1的高测试-训练比例下,使用QLoRA和DoRA对67亿参数的DINOv3基础模型进行PEFT的效果。结果表明,PEFT方法显著优于其他方法,其中最佳QLoRA配置在仅使用2.72%参数的情况下达到了83.16%的测试准确率,且未出现过拟合,主要挑战是欠拟合。
Details
Motivation: 解决精准畜牧业中行为自动分类面临的高计算成本和有限标注数据挑战,探索在极端数据稀缺(训练数据仅2160张图像,测试数据211800张)条件下,如何高效微调大规模视觉基础模型以实现良好泛化。
Result: 在农业牲畜图像分类任务上,最佳QLoRA配置(针对所有线性层,秩=64)取得了83.16%的测试准确率,仅需5.8小时训练,显著优于ResNet-18(72.87%)、ViT-Small(61.91%)和冻结DINOv3特征提取(76.56%)。DoRA取得了可比的准确率(83.14%),但训练时间更长(11.0小时)。结果表明PEFT方法在此任务上达到SOTA水平。
Insight: 论文宣称的创新点在于系统评估了QLoRA和DoRA在不同秩和适配模块配置下对大规模视觉基础模型的PEFT效果,并揭示了在农业图像适配中,欠拟合而非过拟合是主要挑战。从客观角度看,其提供了在极端数据不平衡和高测试-训练比例下,通过增加适配器容量(如扩展目标模块、提高秩)来改善泛化的实用指南,为在农业等数据有限领域部署十亿参数视觉模型提供了重要见解。
Abstract: Automated behavior classification is essential for precision livestock farming but faces challenges of high computational costs and limited labeled data. This study systematically compared three approaches: training from scratch (ResNet-18, ViT-Small), frozen feature extraction, and parameter-efficient fine-tuning (PEFT) of the DINOv3 foundation model (6.7 billion parameters). We evaluated QLoRA and DoRA across multiple configurations varying rank (8, 16, 64) and target modules (q_proj versus all-linear layers). With 2,160 verified training images, we assessed generalization of our model on 211,800 test samples, which is essentially a 98:1 test-to-train ratio. Results demonstrated that PEFT substantially outperformed alternatives, where the best QLoRA configuration (all-linear layers and rank=64) achieved 83.16% test accuracy with only 2.72% parameters (183.0M) in 5.8 hours, compared to 72.87% for ResNet-18 (16.8 hours), 61.91% for ViT-Small (18.7 hours), and 76.56% for frozen DINOv3 (17.5 hours). DoRA achieved comparable accuracy (83.14%) but with longer training time (11.0 hours). Notably, increasing adapter capacity consistently improved generalization while simultaneously not causing overfitting: reducing rank from 16 to 8 decreased test accuracy from 78.38% to 77.17%, while expanding from q_proj-only to all-linear layers with rank=64 improved accuracy from 78.38% to 83.16%. This suggests underfitting, instead of overfitting, is the primary challenge when adapting foundation models to agricultural imagery. Our findings provide guidelines for deploying billion-parameter vision models with PEFT in agricultural livestock applications.
[110] ResNet-50 with Class Reweighting and Anatomy-Guided Temporal Decoding for Gastrointestinal Video Analysis cs.CV | cs.LGPDF
Romil Imtiaz, Dimitris K. Iakovidis
TL;DR: 本文提出了一种基于ResNet-50帧分类器并结合解剖引导时序解码的多标签胃肠道视频分析流程。该系统从336x336尺寸的帧中预测17个标签(包括5个解剖类别和12个病理类别)。主要挑战是严重的类别不平衡,特别是罕见病理标签。为此,在训练损失中使用了裁剪的类别正样本加权以改善罕见类学习。在时序阶段,通过结合GT风格的逐帧事件组合、解剖投票平滑、基于解剖的病理门控以及保守的滞后解码器,显著提升了时序平均精度均值(mAP)。
Details
Motivation: 解决胃肠道视频分析中存在的严重类别不平衡问题,以及直接帧到事件转换产生的与官方真实标注不匹配的碎片化预测问题。
Result: 在挑战赛测试集上,最终提交方案将时序mAP从0.3801提升至0.4303。
Insight: 创新点在于将裁剪的类别正样本加权损失函数用于缓解类别不平衡,并设计了一个结合解剖信息引导的时序解码策略(包括事件组合、平滑、门控和滞后解码),有效提升了多标签视频事件检测的精度和鲁棒性。
Abstract: We developed a multi-label gastrointestinal video analysis pipeline based on a ResNet-50 frame classifier followed by anatomy-guided temporal event decoding. The system predicts 17 labels, including 5 anatomy classes and 12 pathology classes, from frames resized to 336x336. A major challenge was severe class imbalance, particularly for rare pathology labels. To address this, we used clipped class-wise positive weighting in the training loss, which improved rare-class learning while maintaining stable optimization. At the temporal stage, we found that direct frame-to-event conversion produced fragmented mismatches with the official ground truth. The final submission therefore combined GT-style framewise event composition, anatomy vote smoothing, and anatomy-based pathology gating with a conservative hysteresis decoder. This design improved the final temporal mAP from 0.3801 to 0.4303 on the challenge test set.
[111] Fine-Grained Post-Training Quantization for Large Vision Language Models with Quantization-Aware Integrated Gradients cs.CV | cs.AIPDF
Ziwei Xiang, Fanhu Zeng, Hongjian Fang, Rui-Qi Wang, Renxing Chen
TL;DR: 本文提出了一种针对大型视觉语言模型(LVLMs)的细粒度后训练量化方法,称为量化感知积分梯度(QIG)。该方法通过积分梯度来定量评估token级别的敏感性,将量化粒度从模态级别提升到token级别,从而更准确地反映模型内部跨模态和模态内的动态交互。实验表明,该方法在多种量化设置下能有效提升模型精度,且额外延迟开销可忽略。
Details
Motivation: 大型视觉语言模型虽然性能卓越,但计算和内存开销巨大,阻碍了实际部署。现有的后训练量化方法通常在模态级别衡量token敏感性,无法捕捉复杂的跨token交互,也无法在token级别定量衡量量化误差。随着token在模型内部交互,模态间的区分逐渐减弱,因此需要更细粒度的校准策略。
Result: 在W4A8和W3A16等多种量化设置下,对多个LVLM模型进行了广泛实验。结果表明,该方法在不同模型和基准测试上均能提升精度,且延迟开销可忽略。例如,在3位权重量化下,该方法将LLaVA-onevision-7B模型的平均准确率提升了1.60%,使其与全精度版本的差距缩小至仅1.33%。
Insight: 主要创新点在于将量化粒度从模态级别细化到token级别,并利用积分梯度这一可解释性工具来定量评估每个token对量化误差的敏感性。这借鉴了机制可解释性中的公理化归因思想,为量化校准提供了更精细、更符合模型内部动态的指导。该方法在保持高效推理的同时,显著提升了量化模型的精度,为LVLM的轻量化部署提供了新思路。
Abstract: Large Vision Language Models (LVLMs) have achieved remarkable success in a range of downstream tasks that require multimodal interaction, but their capabilities come with substantial computational and memory overhead, which hinders practical deployment. Among numerous acceleration techniques, post-training quantization is a popular and effective strategy for reducing memory cost and accelerating inference. However, existing LVLM quantization methods typically measure token sensitivity at the modality level, which fails to capture the complex cross-token interactions and falls short in quantitatively measuring the quantization error at the token level. As tokens interact within the model, the distinction between modalities gradually diminishes, suggesting the need for fine-grained calibration. Inspired by axiomatic attribution in mechanistic interpretability, we introduce a fine-grained quantization strategy on Quantization-aware Integrated Gradients (QIG), which leverages integrated gradients to quantitatively evaluate token sensitivity and push the granularity from modality level to token level, reflecting both inter-modality and intra-modality dynamics. Extensive experiments on multiple LVLMs under both W4A8 and W3A16 settings show that our method improves accuracy across models and benchmarks with negligible latency overhead. For example, under 3-bit weight-only quantization, our method improves the average accuracy of LLaVA-onevision-7B by 1.60%, reducing the gap to its full-precision counterpart to only 1.33%. The code is available at https://github.com/ucas-xiang/QIG.
[112] ChopGrad: Pixel-Wise Losses for Latent Video Diffusion via Truncated Backpropagation cs.CV | cs.AI | cs.LGPDF
Dmitriy Rivkin, Parker Ewen, Lili Gao, Julian Ost, Stefanie Walz
TL;DR: 本文提出ChopGrad,一种用于视频扩散模型的截断反向传播方案,通过将梯度计算限制在局部帧窗口内,在保持全局一致性的同时,显著降低了训练内存开销,使其能够高效地利用逐像素损失进行微调。
Details
Motivation: 现有视频扩散模型通过循环帧处理实现高质量生成,但在像素域训练时,由于激活值在整个视频序列中累积,导致内存成本过高,使得对长视频或高分辨率视频使用逐像素损失进行微调在计算上不可行。
Result: ChopGrad将训练内存从与视频帧数线性增长(完整反向传播)降低至恒定内存,并在视频超分辨率、视频修复、神经渲染场景的视频增强以及可控驾驶视频生成等一系列条件视频生成任务中,与现有最先进的视频扩散模型相比表现优异。
Insight: 核心创新在于提出了一种截断反向传播方案,通过理论分析证明了该近似方法的有效性,实现了在局部窗口内计算梯度以维持全局一致性,从而解决了视频扩散模型因内存限制而难以应用逐像素损失进行微调的关键瓶颈。
Abstract: Recent video diffusion models achieve high-quality generation through recurrent frame processing where each frame generation depends on previous frames. However, this recurrent mechanism means that training such models in the pixel domain incurs prohibitive memory costs, as activations accumulate across the entire video sequence. This fundamental limitation also makes fine-tuning these models with pixel-wise losses computationally intractable for long or high-resolution videos. This paper introduces ChopGrad, a truncated backpropagation scheme for video decoding, limiting gradient computation to local frame windows while maintaining global consistency. We provide a theoretical analysis of this approximation and show that it enables efficient fine-tuning with frame-wise losses. ChopGrad reduces training memory from scaling linearly with the number of video frames (full backpropagation) to constant memory, and compares favorably to existing state-of-the-art video diffusion models across a suite of conditional video generation tasks with pixel-wise losses, including video super-resolution, video inpainting, video enhancement of neural-rendered scenes, and controlled driving video generation.
[113] M2P: Improving Visual Foundation Models with Mask-to-Point Weakly-Supervised Learning for Dense Point Tracking cs.CVPDF
Qiangqiang Wu, Tianyu Yang, Bo Fang, Jia Wan, Matias Di Martino
TL;DR: 本文提出了一种名为Mask-to-Point(M2P)的弱监督学习方法,旨在利用视频对象分割(VOS)的掩码标注来增强视觉基础模型(VFMs)在密集点跟踪任务中的性能。该方法引入了三种基于掩码的约束损失,包括局部结构一致性损失、掩码标签一致性损失和掩码边界约束,以改进模型对视频中密集时间对应关系的建模能力。
Details
Motivation: 当前基于静态图像预训练的视觉基础模型在捕捉视频中密集时间对应关系方面存在固有不足,而现有方法通过离线微调或测试时优化来适应点跟踪任务效果有限。本文旨在利用丰富的VOS掩码标注,以弱监督方式提升视觉基础模型在密集点跟踪中的表现。
Result: 在TAP-Vid-DAVIS基准测试上,M2P模型相比基线模型DINOv2-B/14和DINOv3-B/16分别取得了12.8%和14.6%的性能提升。该方法仅使用3.6K个VOS训练视频进行高效训练,显著超越了基线视觉基础模型,并可作为点跟踪任务的通用预训练骨干网络。
Insight: 论文的创新点在于提出了三种新颖的基于掩码的弱监督约束损失,将视频对象分割的掩码标注信息转化为对点跟踪表示学习的监督信号。特别是局部结构一致性损失利用Procrustes分析建模局部结构的凝聚运动,掩码标签一致性损失作为正则化防止模型收敛到平凡解,这些方法为利用弱监督视频标注提升视觉基础模型的时序理解能力提供了新思路。
Abstract: Tracking Any Point (TAP) has emerged as a fundamental tool for video understanding. Current approaches adapt Vision Foundation Models (VFMs) like DINOv2 via offline finetuning or test-time optimization. However, these VFMs rely on static image pre-training, which is inherently sub-optimal for capturing dense temporal correspondence in videos. To address this, we propose Mask-to-Point (M2P) learning, which leverages rich video object segmentation (VOS) mask annotations to improve VFMs for dense point tracking. Our M2P introduces three new mask-based constraints for weakly-supervised representation learning. First, we propose a local structure consistency loss, which leverages Procrustes analysis to model the cohesive motion of points lying within a local structure, achieving more reliable point-to-point matching learning. Second, we propose a mask label consistency (MLC) loss, which enforces that sampled foreground points strictly match foreground regions across frames. The proposed MLC loss can be regarded as a regularization, which stabilizes training and prevents convergence to trivial solutions. Finally, mask boundary constrain is applied to explicitly supervise boundary points. We show that our weaklysupervised M2P models significantly outperform baseline VFMs with efficient training by using only 3.6K VOS training videos. Notably, M2P achieves 12.8% and 14.6% performance gains over DINOv2-B/14 and DINOv3-B/16 on the TAP-Vid-DAVIS benchmark, respectively. Moreover, the proposed M2P models are used as pre-trained backbones for both test-time optimized and offline fine-tuned TAP tasks, demonstrating its potential to serve as general pre-trained models for point tracking. Code will be made publicly available upon acceptance.
[114] Steering Video Diffusion Transformers with Massive Activations cs.CVPDF
Xianhang Cheng, Yujian Zheng, Zhenyu Xie, Tingting Liao, Hao Li
TL;DR: 本文研究了视频扩散变换器中的大规模激活现象,提出了一种无需训练的自引导方法STAS,通过调整首帧和边界令牌的激活值来提升视频生成质量和时序一致性。
Details
Motivation: 尽管视频扩散变换器进展迅速,但如何以最小开销利用其内部模型信号来提升视频生成质量仍未充分探索。
Result: STAS方法在不同文本到视频模型上均实现了视频质量和时序一致性的持续改进,且计算开销可忽略不计。
Insight: 发现了大规模激活在视频扩散变换器中呈现结构化模式,并基于此提出了一种无需训练的自引导方法,通过调整特定令牌的激活值来优化生成效果。
Abstract: Despite rapid progress in video diffusion transformers, how their internal model signals can be leveraged with minimal overhead to enhance video generation quality remains underexplored. In this work, we study the role of Massive Activations (MAs), which are rare, high-magnitude hidden state spikes in video diffusion transformers. We observed that MAs emerge consistently across all visual tokens, with a clear magnitude hierarchy: first-frame tokens exhibit the largest MA magnitudes, latent-frame boundary tokens (the head and tail portions of each temporal chunk in the latent space) show elevated but slightly lower MA magnitudes than the first frame, and interior tokens within each latent frame remain elevated, yet are comparatively moderate in magnitude. This structured pattern suggests that the model implicitly prioritizes token positions aligned with the temporal chunking in the latent space. Based on this observation, we propose Structured Activation Steering (STAS), a training-free self-guidance-like method that steers MA values at first-frame and boundary tokens toward a scaled global maximum reference magnitude. STAS achieves consistent improvements in terms of video quality and temporal coherence across different text-to-video models, while introducing negligible computational overhead.
[115] Video Understanding: From Geometry and Semantics to Unified Models cs.CVPDF
Zhaochong An, Zirui Li, Mingqiao Ye, Feng Qiao, Jiaang Li
TL;DR: 这篇综述论文系统性地梳理了视频理解领域的研究进展,将其划分为三个互补的视角:低层视频几何理解、高层语义理解以及统一的视频理解模型。文章强调了该领域正从孤立的任务特定流程转向能够适应多种下游目标的统一建模范式,并总结了关键趋势、设计原则以及构建鲁棒、可扩展的统一视频基础模型所面临的开放挑战。
Details
Motivation: 视频理解旨在让模型感知、推理并与动态视觉世界交互,其核心挑战在于建模时间动态和演变的视觉上下文,对时空推理提出了更高要求,是计算机视觉的基础性问题。本文旨在通过结构化综述,为这一快速发展的领域提供一个连贯的概览。
Result: 本文是一篇综述性论文,未提出具体模型,因此不包含定量实验结果或基准测试排名。其成果在于对现有文献进行了系统性梳理和分类。
Insight: 论文的核心见解在于提出了一个从几何、语义到统一模型的三维分析框架来审视视频理解领域。它敏锐地指出了该领域向统一建模范式(Unified Modeling Paradigms)和视频基础模型(Video Foundation Models)发展的关键趋势,这为未来的研究方向提供了清晰的路线图。
Abstract: Video understanding aims to enable models to perceive, reason about, and interact with the dynamic visual world. In contrast to image understanding, video understanding inherently requires modeling temporal dynamics and evolving visual context, placing stronger demands on spatiotemporal reasoning and making it a foundational problem in computer vision. In this survey, we present a structured overview of video understanding by organizing the literature into three complementary perspectives: low-level video geometry understanding, high-level semantic understanding, and unified video understanding models. We further highlight a broader shift from isolated, task-specific pipelines toward unified modeling paradigms that can be adapted to diverse downstream objectives, enabling a more systematic view of recent progress. By consolidating these perspectives, this survey provides a coherent map of the evolving video understanding landscape, summarizes key modeling trends and design principles, and outlines open challenges toward building robust, scalable, and unified video foundation models.
[116] Differential Attention-Augmented BiomedCLIP with Asymmetric Focal Optimization for Imbalanced Multi-Label Video Capsule Endoscopy Classification cs.CV | cs.AIPDF
Podakanti Satyajith Chary, Nagarajan Ganapathy
TL;DR: 本文提出了一种用于视频胶囊内窥镜(VCE)多标签分类的框架,旨在解决Galar数据集中存在的极端类别不平衡问题。该方法通过结合架构和优化层面的策略,对生物医学视觉-语言基础模型BiomedCLIP进行改进,引入差分注意力机制来抑制注意力噪声,并采用多种技术(如平方根频率加权采样、非对称焦点损失等)处理数据不平衡。在RARE-VISION测试集上,该方法在事件级检测中取得了0.2456的mAP@0.5和0.2353的mAP@0.95,推理速度较快。
Details
Motivation: 解决视频胶囊内窥镜(VCE)多标签分类中,由于病理发现帧占比极低(<0.1%)导致的极端类别不平衡问题,提升模型对罕见病理事件的检测能力。
Result: 在由三个NaviCam检查(161,025帧)组成的RARE-VISION测试集上,该方法实现了整体时间mAP@0.5为0.2456,mAP@0.95为0.2353,并在单个GPU上以约8.6分钟完成总推理。
Insight: 主要创新点包括:1)在BiomedCLIP中引入差分注意力机制,通过计算两个softmax注意力图的差异来抑制注意力噪声;2)综合运用多种策略(如非对称焦点损失、平方根频率加权采样、混合正则化、每类阈值优化)处理极端类别不平衡;3)通过中值滤波平滑和间隙合并来增强时间一致性,用于事件级JSON生成。
Abstract: This work presents a multi-label classification framework for video capsule endoscopy (VCE) that addresses the extreme class imbalance inherent in the Galar dataset through a combination of architectural and optimization-level strategies. Our approach modifies BiomedCLIP, a biomedical vision-language foundation model, by replacing its standard multi-head self-attention with a differential attention mechanism that computes the difference between two softmax attention maps to suppress attention noise. To counteract the skewed label distribution, where pathological findings constitute less than 0.1% of all annotated frames, a sqrt-frequency weighted sampler, asymmetric focal loss, mixup regularization, and per-class threshold optimization are employed. Temporal coherence is enforced through median-filter smoothing and gap merging prior to event-level JSON generation. On the held-out RARE-VISION test set comprising three NaviCam examinations (161,025 frames), the pipeline achieves an overall temporal mAP@0.5 of 0.2456 and mAP@0.95 of 0.2353, with total inference completed in approximately 8.6 minutes on a single GPU.
[117] Identity as Presence: Towards Appearance and Voice Personalized Joint Audio-Video Generation cs.CVPDF
Yingjie Chen, Shilun Lin, Cai Xing, Qixin Yan, Wenjing Wang
TL;DR: 本文提出了一种统一且可扩展的身份感知联合音视频生成框架,能够实现高保真和一致性的个性化内容生成。该框架通过数据整理流程自动提取跨音频和视觉模态的身份信息,并设计了灵活的身份注入机制,将面部外观和声音音色作为身份控制信号,同时采用多阶段训练策略来加速收敛并增强跨模态一致性。
Details
Motivation: 当前缺乏一个开放可访问的框架,能够对多个身份的面部外观和声音音色进行细粒度控制,以满足身份感知内容创作日益增长的需求。
Result: 实验证明了所提出框架的优越性,具体定性结果可参考项目网页。
Insight: 创新点包括:1) 自动化的跨模态身份信息数据整理流程;2) 适用于单主体和多主体场景的灵活身份注入机制;3) 针对模态差异设计的多阶段训练策略以提升跨模态一致性。
Abstract: Recent advances have demonstrated compelling capabilities in synthesizing real individuals into generated videos, reflecting the growing demand for identity-aware content creation. Nevertheless, an openly accessible framework enabling fine-grained control over facial appearance and voice timbre across multiple identities remains unavailable. In this work, we present a unified and scalable framework for identity-aware joint audio-video generation, enabling high-fidelity and consistent personalization. Specifically, we introduce a data curation pipeline that automatically extracts identity-bearing information with paired annotations across audio and visual modalities, covering diverse scenarios from single-subject to multi-subject interactions. We further propose a flexible and scalable identity injection mechanism for single- and multi-subject scenarios, in which both facial appearance and vocal timbre act as identity-bearing control signals. Moreover, in light of modality disparity, we design a multi-stage training strategy to accelerate convergence and enforce cross-modal coherence. Experiments demonstrate the superiority of the proposed framework. For more details and qualitative results, please refer to our webpage: \href{https://chen-yingjie.github.io/projects/Identity-as-Presence}{Identity-as-Presence}.
[118] A Creative Agent is Worth a 64-Token Template cs.CVPDF
Ruixiao Shi, Fu Feng, Yucheng Xie, Xu Yang, Jing Wang
TL;DR: 本文提出了一种名为CAT(Creative Agent Tokenization)的框架,旨在解决文本到图像(T2I)模型在处理模糊提示(如“一个富有创意的黑胶唱片灵感摩天大楼”)时创造力受限的问题。该框架通过一个可学习的“创意分词器”,将代理对“创造力”的内在理解封装成一个可重用的64个令牌的模板,从而无需为每次生成重复进行推理或提示增强,即可将创意语义注入T2I模型。
Details
Motivation: 现有T2I模型严重依赖离散的自然语言提示,在处理模糊提示时难以推断其背后的创意意图,将创意构思和提示设计负担留给了用户。而现有的基于推理或代理的方法虽然能迭代增强提示,但计算和金钱成本高昂,且其生成是实例特定的,使得“创造力”变得昂贵且不可重用。
Result: 在“建筑设计”、“家具设计”和“自然混合”三个任务上的大量实验表明,CAT提供了一个可扩展且有效的范式来增强T2I生成的创造力。与最先进的T2I模型和创意生成方法相比,CAT实现了3.7倍的加速和4.8倍的计算成本降低,同时生成的图像在人类偏好和文本-图像对齐方面表现更优。
Insight: 核心创新点在于提出了“创意代理令牌化”框架,通过一个可训练的“创意分词器”将代理的创意理解抽象并封装成一个固定长度(64个令牌)的可重用模板。这避免了为每个模糊提示进行重复的推理或提示增强,实现了“创造力”的封装与复用。从客观角度看,其利用部分重叠概念对之间的关系进行创意语义解耦的训练方法,是学习捕获代理潜在创意表示的关键技术洞察。
Abstract: Text-to-image (T2I) models have substantially improved image fidelity and prompt adherence, yet their creativity remains constrained by reliance on discrete natural language prompts. When presented with fuzzy prompts such as a creative vinyl record-inspired skyscraper'', these models often fail to infer the underlying creative intent, leaving creative ideation and prompt design largely to human users. Recent reasoning- or agent-driven approaches iteratively augment prompts but incur high computational and monetary costs, as their instance-specific generation makes creativity’’ costly and non-reusable, requiring repeated queries or reasoning for subsequent generations. To address this, we introduce \textbf{CAT}, a framework for \textbf{C}reative \textbf{A}gent \textbf{T}okenization that encapsulates agents’ intrinsic understanding of ``creativity’’ through a \textit{Creative Tokenizer}. Given the embeddings of fuzzy prompts, the tokenizer generates a reusable token template that can be directly concatenated with them to inject creative semantics into T2I models without repeated reasoning or prompt augmentation. To enable this, the tokenizer is trained via creative semantic disentanglement, leveraging relations among partially overlapping concept pairs to capture the agent’s latent creative representations. Extensive experiments on \textbf{\textit{Architecture Design}}, \textbf{\textit{Furniture Design}}, and \textbf{\textit{Nature Mixture}} tasks demonstrate that CAT provides a scalable and effective paradigm for enhancing creativity in T2I generation, achieving a $3.7\times$ speedup and a $4.8\times$ reduction in computational cost, while producing images with superior human preference and text-image alignment compared to state-of-the-art T2I models and creative generation methods.
[119] SegFly: A 2D-3D-2D Paradigm for Aerial RGB-Thermal Semantic Segmentation at Scale cs.CVPDF
Markus Gross, Sai Bharadhwaj Matha, Rui Song, Viswanathan Muthuveerappan, Conrad Christoph
TL;DR: 本文提出了一种名为SegFly的几何驱动2D-3D-2D范式,用于大规模自动生成无人机RGB和热成像图像的语义分割标注,并构建了一个包含超过2万张高分辨率RGB图像和1.5万对几何对齐RGB-T图像对的大规模基准数据集。该方法通过将少量手动标注的RGB图像提升为语义3D点云并重投影至所有视图,实现了高效、准确的跨模态伪真值生成与配准。
Details
Motivation: 现有无人机RGB和RGB-T语义分割数据集在规模、多样性和标注效率上受限,主要由于手动标注成本高以及商用无人机上RGB-T图像精确对齐困难。本文旨在解决大规模、高质量多模态标注数据自动生成的挑战。
Result: 该方法仅需标注不到3%的RGB图像,即可自动生成97%的RGB标签和100%的热成像标签,标注准确率分别达到91%和88%。在跨模态配准方面,实现了87%的配准准确率。基于此构建的SegFly数据集被用于建立Firefly基线模型,实验表明传统架构和视觉基础模型均能从SegFly监督中显著受益。
Insight: 核心创新在于提出了一种可扩展的几何驱动2D-3D-2D范式,利用高重叠航拍图像的多视角冗余,以3D几何作为中间对齐空间,统一实现了大规模伪真值自动生成和无需硬件同步的像素级RGB-T配准。这为可扩展的多模态场景理解提供了一种高效的自动化数据构建管道。
Abstract: Semantic segmentation for uncrewed aerial vehicles (UAVs) is fundamental for aerial scene understanding, yet existing RGB and RGB-T datasets remain limited in scale, diversity, and annotation efficiency due to the high cost of manual labeling and the difficulties of accurate RGB-T alignment on off-the-shelf UAVs. To address these challenges, we propose a scalable geometry-driven 2D-3D-2D paradigm that leverages multi-view redundancy in high-overlap aerial imagery to automatically propagate labels from a small subset of manually annotated RGB images to both RGB and thermal modalities within a unified framework. By lifting less than 3% of RGB images into a semantic 3D point cloud and reprojecting it into all views, our approach enables dense pseudo ground-truth generation across large image collections, automatically producing 97% of RGB labels and 100% of thermal labels while achieving 91% and 88% annotation accuracy without any 2D manual refinement. We further extend this 2D-3D-2D paradigm to cross-modal image registration, using 3D geometry as an intermediate alignment space to obtain fully automatic, strong pixel-level RGB-T alignment with 87% registration accuracy and no hardware-level synchronization. Applying our framework to existing geo-referenced aerial imagery, we construct SegFly, a large-scale benchmark with over 20,000 high-resolution RGB images and more than 15,000 geometrically aligned RGB-T pairs spanning diverse urban, industrial, and rural environments across multiple altitudes and seasons. On SegFly, we establish the Firefly baseline for RGB and thermal semantic segmentation and show that both conventional architectures and vision foundation models benefit substantially from SegFly supervision, highlighting the potential of geometry-driven 2D-3D-2D pipelines for scalable multi-modal scene understanding. Data and Code available at https://github.com/markus-42/SegFly.
[120] Interpretable Traffic Responsibility from Dashcam Video via Legal Multi Agent Reasoning cs.CVPDF
Jingchun Yang, Jinchang Zhang
TL;DR: 本文提出了一种可解释的交通责任判定方法,通过结合车载摄像头视频与法律多智能体推理,将视频证据转化为具体的责任划分和法律条文依据。首先构建了C-TRAIL多模态法律数据集,将视频与文本描述对齐到具体的责任模式和中国交通法规;然后设计了一个两阶段框架:交通事故理解模块生成视频文本描述,法律多智能体框架输出责任模式、法规集合和完整的判决报告。
Details
Motivation: 解决车载摄像头视频证据难以自动转化为法律责任判定的问题,弥补现有研究在视频感知与基于文本的法律分析之间的鸿沟。
Result: 在C-TRAIL和MM-AU数据集上的实验表明,该方法优于通用和法律大语言模型以及现有的基于智能体的方法,同时提供了透明可解释的法律推理过程。
Insight: 创新点在于构建了视频与法律条文对齐的多模态数据集,并设计了结合视频理解与法律多智能体推理的两阶段框架,实现了从视频到法律责任的端到端可解释推理。
Abstract: The widespread adoption of dashcams has made video evidence in traffic accidents increasingly abundant, yet transforming “what happened in the video” into “who is responsible under which legal provisions” still relies heavily on human experts. Existing ego-view traffic accident studies mainly focus on perception and semantic understanding, while LLM-based legal methods are mostly built on textual case descriptions and rarely incorporate video evidence, leaving a clear gap between the two. We first propose C-TRAIL, a multimodal legal dataset that, under the Chinese traffic regulation system, explicitly aligns dashcam videos and textual descriptions with a closed set of responsibility modes and their corresponding Chinese traffic statutes. On this basis, we introduce a two-stage framework: (1) a traffic accident understanding module that generates textual video descriptions; and (2) a legal multi-agent framework that outputs responsibility modes, statute sets, and complete judgment reports. Experimental results on C-TRAIL and MM-AU show that our method outperforms general and legal LLMs, as well as existing agent-based approaches, while providing a transparent and interpretable legal reasoning process.
[121] TransText: Transparency Aware Image-to-Video Typography Animation cs.CVPDF
Fei Zhang, Zijian Zhou, Bohao Tang, Sen He, Hang Li
TL;DR: 本文提出了TransText,一种基于Alpha-as-RGB范式的新框架,旨在将图像到视频模型适配于具有图层感知能力的文字(字形)动画生成,这是动态视觉设计中的关键能力。该方法通过潜在空间拼接将透明度通道(Alpha通道)编码为与RGB兼容的视觉信号,从而在不修改预训练生成模型流形的情况下,联合建模外观和透明度。
Details
Motivation: 现有方法通常将透明度通道作为额外潜在维度附加到RGB空间,这需要重建底层的以RGB为中心的变分自编码器(VAE)。然而,高质量透明字形数据稀缺,重新训练VAE计算成本高,且可能损害从海量RGB语料库中学到的鲁棒语义先验,导致潜在模式混合。因此,需要一种方法能在不修改预训练模型的情况下,有效处理透明度信息。
Result: 实验表明,TransText显著优于基线方法,能够生成连贯、高保真且具有多样精细效果的透明动画。
Insight: 创新点在于提出了Alpha-as-RGB范式,通过潜在空间拼接将Alpha通道嵌入为RGB兼容信号,从而在保持跨模态(RGB与Alpha)严格一致性的同时,避免了特征纠缠,且无需重新训练预训练的生成模型,保护了其语义先验。
Abstract: We introduce the first method, to the best of our knowledge, for adapting image-to-video models to layer-aware text (glyph) animation, a capability critical for practical dynamic visual design. Existing approaches predominantly handle the transparency-encoding (alpha channel) as an extra latent dimension appended to the RGB space, necessitating the reconstruction of the underlying RGB-centric variational autoencoder (VAE). However, given the scarcity of high-quality transparent glyph data, retraining the VAE is computationally expensive and may erode the robust semantic priors learned from massive RGB corpora, potentially leading to latent pattern mixing. To mitigate these limitations, we propose TransText, a framework based on a novel Alpha-as-RGB paradigm to jointly model appearance and transparency without modifying the pre-trained generative manifold. TransText embeds the alpha channel as an RGB-compatible visual signal through latent spatial concatenation, explicitly ensuring strict cross-modal (RGB-and-Alpha) consistency while preventing feature entanglement. Our experiments demonstrate that TransText significantly outperforms baselines, generating coherent, high-fidelity transparent animations with diverse, fine-grained effects.
[122] VideoAtlas: Navigating Long-Form Video in Logarithmic Compute cs.CV | cs.AIPDF
Mohamed Eltahir, Ali Habibullah, Yazan Alshoibi, Lama Ayash, Tanveer Hussain
TL;DR: 本文提出VideoAtlas,一种用于长视频理解的无任务特定、无损、可导航的分层网格表示环境,并基于此构建Video-RLM模型,实现了计算量随视频时长对数增长的长视频理解。
Details
Motivation: 解决现有视频语言模型在表示(依赖有损近似)和长上下文(基于字幕或智能体的流程会丢失视觉保真度)方面的挑战。
Result: 在从1小时到10小时的基准测试中,Video-RLM展现出最强的时长鲁棒性,精度下降最小;计算量随视频时长对数增长,并受益于30-60%的多模态缓存命中率。
Insight: 创新点在于提出了一个统一、无损、可递归缩放的分层视觉表示环境(VideoAtlas),并将其与递归语言模型(RLM)结合,通过主-工作者并行架构实现高效的长视频探索与理解,计算可预算且能自适应分配。
Abstract: Extending language models to video introduces two challenges: representation, where existing methods rely on lossy approximations, and long-context, where caption- or agent-based pipelines collapse video into text and lose visual fidelity. To overcome this, we introduce \textbf{VideoAtlas}, a task-agnostic environment to represent video as a hierarchical grid that is simultaneously lossless, navigable, scalable, caption- and preprocessing-free. An overview of the video is available at a glance, and any region can be recursively zoomed into, with the same visual representation used uniformly for the video, intermediate investigations, and the agent’s memory, eliminating lossy text conversion end-to-end. This hierarchical structure ensures access depth grows only logarithmically with video length. For long-context, Recursive Language Models (RLMs) recently offered a powerful solution for long text, but extending them to visual domain requires a structured environment to recurse into, which \textbf{VideoAtlas} provides. \textbf{VideoAtlas} as a Markov Decision Process unlocks Video-RLM: a parallel Master-Worker architecture where a Master coordinates global exploration while Workers concurrently drill into assigned regions to accumulate lossless visual evidence. We demonstrate three key findings: (1)logarithmic compute growth with video duration, further amplified by a 30-60% multimodal cache hit rate arising from the grid’s structural reuse. (2)environment budgeting, where bounding the maximum exploration depth provides a principled compute-accuracy hyperparameter. (3)~emergent adaptive compute allocation that scales with question granularity. When scaling from 1-hour to 10-hour benchmarks, Video-RLM remains the most duration-robust method with minimal accuracy degradation, demonstrating that structured environment navigation is a viable and scalable paradigm for video understanding.
[123] LaDe: Unified Multi-Layered Graphic Media Generation and Decomposition cs.CVPDF
Vlad-Constantin Lungu-Stan, Ionut Mironica, Mariana-Iuliana Georgescu
TL;DR: 本文提出了LaDe(Layered Media Design),一个统一的潜在扩散框架,用于生成和分解多层图形媒体设计。它能够根据自然语言提示,生成数量灵活且语义清晰的RGBA图层,支持文本到图像、文本到图层以及图像分解三种任务。
Details
Motivation: 现有方法在生成可编辑的分层设计文档(如海报、传单)时,要么限制图层数量固定,要么要求每个图层仅包含空间连续区域,导致图层数量随设计复杂度线性增长,缺乏灵活性。LaDe旨在解决这一问题,生成语义有意义且数量可变的图层。
Result: 在Crello测试集上,LaDe在文本到图层和图像到图层任务中与Qwen-Image-Layered进行了比较。通过GPT-4o mini和Qwen3-VL作为评估器验证,LaDe在文本到图层生成任务中表现更优,提高了文本与图层的对齐度。
Insight: 创新点包括:1) 基于LLM的提示扩展器,将简短用户意图转化为结构化的逐层描述;2) 采用4D RoPE位置编码的潜在扩散Transformer,联合生成完整媒体设计及其RGBA图层;3) 支持完整Alpha通道的RGBA VAE解码器。该框架通过训练时对图层样本进行条件化,实现了生成与分解任务的统一。
Abstract: Media design layer generation enables the creation of fully editable, layered design documents such as posters, flyers, and logos using only natural language prompts. Existing methods either restrict outputs to a fixed number of layers or require each layer to contain only spatially continuous regions, causing the layer count to scale linearly with design complexity. We propose LaDe (Layered Media Design), a latent diffusion framework that generates a flexible number of semantically meaningful layers. LaDe combines three components: an LLM-based prompt expander that transforms a short user intent into structured per-layer descriptions that guide the generation, a Latent Diffusion Transformer with a 4D RoPE positional encoding mechanism that jointly generates the full media design and its constituent RGBA layers, and an RGBA VAE that decodes each layer with full alpha-channel support. By conditioning on layer samples during training, our unified framework supports three tasks: text-to-image generation, text-to-layers media design generation, and media design decomposition. We compare LaDe to Qwen-Image-Layered on text-to-layers and image-to-layers tasks on the Crello test set. LaDe outperforms Qwen-Image-Layered in text-to-layers generation by improving text-to-layer alignment, as validated by two VLM-as-a-judge evaluators (GPT-4o mini and Qwen3-VL).
[124] AHOY! Animatable Humans under Occlusion from YouTube Videos with Gaussian Splatting and Video Diffusion Priors cs.CVPDF
Aymen Mir, Riza Alp Guler, Xiangjun Tang, Peter Wonka, Gerard Pons-Moll
TL;DR: AHOY是一种从单目视频中重建完整、可动画的3D高斯化身的方法,即使存在严重遮挡。该方法通过身份微调的扩散模型生成对未观测身体区域的密集监督,采用两阶段架构从稀疏观测引导到完整的姿态相关高斯图,并解耦地图姿态与LBS姿态以吸收生成数据中的多视角不一致性,最终在YouTube视频和存在显著遮挡的多视角捕获数据上实现了最先进的重建质量。
Details
Motivation: 现有方法通常假设输入是无遮挡的、完全可见的主体,这排除了现实世界中人物经常被家具、物体或其他人遮挡的绝大多数视频素材。从这类素材重建面临根本挑战:大范围身体区域可能从未被观测到,且每个姿态缺乏多视角监督。
Result: 在YouTube视频和存在显著遮挡的多视角捕获数据上进行了评估,证明了其重建质量达到了最先进水平(SOTA)。
Insight: 主要创新点包括:1) 使用身份微调的扩散模型生成对未观测区域的密集监督的幻觉即监督流程;2) 从稀疏观测引导到完整姿态相关高斯图的两阶段规范-姿态依赖架构;3) 解耦地图姿态与LBS姿态以吸收生成数据中的多视角不一致性;4) 头/身体分离监督策略以保持面部身份。这些方法为解决严重遮挡下的3D人体重建提供了新思路。
Abstract: We present AHOY, a method for reconstructing complete, animatable 3D Gaussian avatars from in-the-wild monocular video despite heavy occlusion. Existing methods assume unoccluded input-a fully visible subject, often in a canonical pose-excluding the vast majority of real-world footage where people are routinely occluded by furniture, objects, or other people. Reconstructing from such footage poses fundamental challenges: large body regions may never be observed, and multi-view supervision per pose is unavailable. We address these challenges with four contributions: (i) a hallucination-as-supervision pipeline that uses identity-finetuned diffusion models to generate dense supervision for previously unobserved body regions; (ii) a two-stage canonical-to-pose-dependent architecture that bootstraps from sparse observations to full pose-dependent Gaussian maps; (iii) a map-pose/LBS-pose decoupling that absorbs multi-view inconsistencies from the generated data; (iv) a head/body split supervision strategy that preserves facial identity. We evaluate on YouTube videos and on multi-view capture data with significant occlusion and demonstrate state-of-the-art reconstruction quality. We also demonstrate that the resulting avatars are robust enough to be animated with novel poses and composited into 3DGS scenes captured using cell-phone video. Our project page is available at https://miraymen.github.io/ahoy/
[125] AdaRadar: Rate Adaptive Spectral Compression for Radar-based Perception cs.CVPDF
Jinho Park, Se Young Chun, Mingoo Seok
TL;DR: 本文提出AdaRadar,一种用于雷达感知的自适应频谱压缩方法,通过动态调整压缩比,结合离散余弦变换、选择性剪枝和缩放量化,在保持检测性能的同时实现超过100倍的特征尺寸压缩。
Details
Motivation: 解决高维原始雷达数据在低带宽通信链路上传输饱和的问题,现有图像域压缩方法因固定压缩比且无法适应变化或对抗性条件而不适用。
Result: 在RADIal、CARRADA和Radatron数据集上验证,方法在性能下降极小(约1个百分点)的情况下实现超过100倍的特征尺寸缩减。
Insight: 创新点包括基于检测置信度代理梯度的自适应压缩比调整、使用零阶梯度近似避免传输梯度张量、利用雷达特征图在频域集中的特性进行DCT变换和选择性剪枝,以及通过缩放量化保持动态范围。
Abstract: Radar is a critical perception modality in autonomous driving systems due to its all-weather characteristics and ability to measure range and Doppler velocity. However, the sheer volume of high-dimensional raw radar data saturates the communication link to the computing engine (e.g., an NPU), which is often a low-bandwidth interface with data rate provisioned only for a few low-resolution range-Doppler frames. A generalized codec for utilizing high-dimensional radar data is notably absent, while existing image-domain approaches are unsuitable, as they typically operate at fixed compression ratios and fail to adapt to varying or adversarial conditions. In light of this, we propose radar data compression with adaptive feedback. It dynamically adjusts the compression ratio by performing gradient descent from the proxy gradient of detection confidence with respect to the compression rate. We employ a zeroth-order gradient approximation as it enables gradient computation even with non-differentiable core operations–pruning and quantization. This also avoids transmitting the gradient tensors over the band-limited link, which, if estimated, would be as large as the original radar data. In addition, we have found that radar feature maps are heavily concentrated on a few frequency components. Thus, we apply the discrete cosine transform to the radar data cubes and selectively prune out the coefficients effectively. We preserve the dynamic range of each radar patch through scaled quantization. Combining those techniques, our proposed online adaptive compression scheme achieves over 100x feature size reduction at minimal performance drop (~1%p). We validate our results on the RADIal, CARRADA, and Radatron datasets.
[126] Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding cs.CVPDF
Shuyao Shi, Kang G. Shin
TL;DR: 本文提出了一种名为Motion-MLLM的新型多模态大语言模型框架,旨在通过结合视频与惯性测量单元(IMU)采集的自我运动数据,以更高效、更准确的方式进行3D场景理解与空间推理。该框架包含级联运动-视觉关键帧过滤模块和不对称跨模态融合模块,通过将视觉内容锚定在物理运动轨迹上来推理绝对尺度和空间关系。
Details
Motivation: 现有MLLMs在3D场景理解中通常依赖计算成本高昂的3D表示(如点云、重建的BEV地图)或缺乏物理基础来解决尺度和大小的模糊性。本文旨在利用与视频同步采集的IMU自我运动数据来增强MLLMs,以解决这些问题。
Result: 在广泛的评估中,Motion-MLLM在多种3D场景理解和空间推理任务上取得了显著提升。与基于视频帧和显式3D数据的SOTA方法相比,Motion-MLLM在精度相当甚至更高的同时,显著降低了开销(成本效益分别提高了1.40倍和1.63倍)。
Insight: 创新点在于引入了自我运动模态作为物理基础,并设计了级联关键帧过滤(结合IMU与视觉特征进行高效稀疏选择)和不对称跨模态融合(以运动token为中介注入运动线索与跨帧视觉上下文)两个核心模块,实现了在降低计算成本的同时提升对绝对尺度和空间关系的推理能力。
Abstract: Recent Multimodal Large Language Models (MLLMs) have shown high potential for spatial reasoning within 3D scenes. However, they typically rely on computationally expensive 3D representations like point clouds or reconstructed Bird’s-Eye View (BEV) maps, or lack physical grounding to resolve ambiguities in scale and size. This paper significantly enhances MLLMs with egomotion modality data, captured by Inertial Measurement Units (IMUs) concurrently with the video. In particular, we propose a novel framework, called Motion-MLLM, introducing two key components: (1) a cascaded motion-visual keyframe filtering module that leverages both IMU data and visual features to efficiently select a sparse yet representative set of keyframes, and (2) an asymmetric cross-modal fusion module where motion tokens serve as intermediaries that channel egomotion cues and cross-frame visual context into the visual representation. By grounding visual content in physical egomotion trajectories, Motion-MLLM can reason about absolute scale and spatial relationships across the scene. Our extensive evaluation shows that Motion-MLLM makes significant improvements in various tasks related to 3D scene understanding and spatial reasoning. Compared to state-of-the-art (SOTA) methods based on video frames and explicit 3D data, Motion-MLLM exhibits similar or even higher accuracy with significantly less overhead (i.e., 1.40$\times$ and 1.63$\times$ higher cost-effectiveness, respectively).
[127] Versatile Editing of Video Content, Actions, and Dynamics without Training cs.CVPDF
Vladimir Kulikov, Roni Paiss, Andrey Voynov, Inbar Mosseri, Tali Dekel
TL;DR: 本文提出了DynaEdit,一种无需训练的通用视频编辑方法,利用预训练的文本到视频流模型实现视频内容、动作和动态的多样化编辑,解决了现有方法在复杂编辑任务上的局限性。
Details
Motivation: 现有训练模型难以处理复杂编辑(如修改动作或插入交互对象),而无需训练的方法又局限于保持结构和运动的编辑,无法修改运动或交互,因此需要一种无需训练且能支持动态修改的通用视频编辑方法。
Result: 在复杂的基于文本的视频编辑任务上,DynaEdit实现了最先进(SOTA)的结果,包括修改动作、插入与场景交互的对象以及引入全局效果。
Insight: 创新点在于基于无反转方法(inversion-free approach)并引入新机制克服低频错位和高频抖动问题,实现了模型无关的通用视频编辑能力,无需额外训练数据即可处理动态和交互编辑。
Abstract: Controlled video generation has seen drastic improvements in recent years. However, editing actions and dynamic events, or inserting contents that should affect the behaviors of other objects in real-world videos, remains a major challenge. Existing trained models struggle with complex edits, likely due to the difficulty of collecting relevant training data. Similarly, existing training-free methods are inherently restricted to structure- and motion-preserving edits and do not support modification of motion or interactions. Here, we introduce DynaEdit, a training-free editing method that unlocks versatile video editing capabilities with pretrained text-to-video flow models. Our method relies on the recently introduced inversion-free approach, which does not intervene in the model internals, and is thus model-agnostic. We show that naively attempting to adapt this approach to general unconstrained editing results in severe low-frequency misalignment and high-frequency jitter. We explain the sources for these phenomena and introduce novel mechanisms for overcoming them. Through extensive experiments, we show that DynaEdit achieves state-of-the-art results on complex text-based video editing tasks, including modifying actions, inserting objects that interact with the scene, and introducing global effects.
[128] GMT: Goal-Conditioned Multimodal Transformer for 6-DOF Object Trajectory Synthesis in 3D Scenes cs.CV | cs.ROPDF
Huajian Zeng, Abhishek Saroha, Daniel Cremers, Xi Wang
TL;DR: 本文提出了GMT(Goal-Conditioned Multimodal Transformer),一个用于在3D场景中合成可控6自由度物体操作轨迹的多模态Transformer框架。该方法通过联合利用3D边界框几何、点云上下文、语义物体类别和目标末端姿态,生成真实且目标导向的物体轨迹。
Details
Motivation: 在3D环境中合成可控的6自由度物体操作轨迹对于机器人交互至关重要,但现有方法多依赖2D或部分3D表示,难以捕捉完整场景几何并保证轨迹精度。
Result: 在合成和真实世界基准测试上的广泛实验表明,GMT在空间精度和方向控制方面显著优于CHOIS和GIMO等最先进的人体运动与人物交互基线模型,达到了SOTA水平。
Insight: 创新点在于将轨迹表示为连续的6自由度姿态序列,并采用一种融合几何、语义、上下文和目标导向信息的定制化条件策略,为基于学习的操作规划设立了新基准,并展现出对多样化物体和杂乱3D环境的强泛化能力。
Abstract: Synthesizing controllable 6-DOF object manipulation trajectories in 3D environments is essential for enabling robots to interact with complex scenes, yet remains challenging due to the need for accurate spatial reasoning, physical feasibility, and multimodal scene understanding. Existing approaches often rely on 2D or partial 3D representations, limiting their ability to capture full scene geometry and constraining trajectory precision. We present GMT, a multimodal transformer framework that generates realistic and goal-directed object trajectories by jointly leveraging 3D bounding box geometry, point cloud context, semantic object categories, and target end poses. The model represents trajectories as continuous 6-DOF pose sequences and employs a tailored conditioning strategy that fuses geometric, semantic, contextual, and goaloriented information. Extensive experiments on synthetic and real-world benchmarks demonstrate that GMT outperforms state-of-the-art human motion and human-object interaction baselines, such as CHOIS and GIMO, achieving substantial gains in spatial accuracy and orientation control. Our method establishes a new benchmark for learningbased manipulation planning and shows strong generalization to diverse objects and cluttered 3D environments. Project page: https://huajian- zeng.github. io/projects/gmt/.
[129] The Unreasonable Effectiveness of Text Embedding Interpolation for Continuous Image Steering cs.CVPDF
Yigit Ekin, Yossi Gandelsman
TL;DR: 本文提出了一种无需训练的框架,用于在测试时对文本条件生成模型进行连续可控的图像编辑。该方法通过在文本嵌入空间中进行简单的向量偏移,利用大语言模型自动构建去偏对比提示对来计算编辑方向向量,并通过弹性范围搜索确定有效的编辑强度区间,从而实现平滑连续的语义编辑。该方法仅修改文本表示,可泛化至图像和视频生成等多种模态。
Details
Motivation: 解决现有文本条件生成模型在连续可控图像编辑中依赖额外训练或人工干预的问题,旨在实现轻量级、无需训练的平滑编辑控制。
Result: 在连续编辑行为评估中,该方法与基于训练的方法性能相当,优于其他无需训练的方法,并通过新提出的评估指标验证了编辑连续性的优势。
Insight: 创新点在于揭示了文本嵌入插值对连续图像控制的惊人有效性,通过自动构建去偏提示对和弹性范围搜索实现无需训练的连续编辑,其轻量设计且泛化性强的特点具有借鉴意义。
Abstract: We present a training-free framework for continuous and controllable image editing at test time for text-conditioned generative models. In contrast to prior approaches that rely on additional training or manual user intervention, we find that a simple steering in the text-embedding space is sufficient to produce smooth edit control. Given a target concept (e.g., enhancing photorealism or changing facial expression), we use a large language model to automatically construct a small set of debiased contrastive prompt pairs, from which we compute a steering vector in the generator’s text-encoder space. We then add this vector directly to the input prompt representation to control generation along the desired semantic axis. To obtain a continuous control, we propose an elastic range search procedure that automatically identifies an effective interval of steering magnitudes, avoiding both under-steering (no-edit) and over-steering (changing other attributes). Adding the scaled versions of the same vector within this interval yields smooth and continuous edits. Since our method modifies only textual representations, it naturally generalizes across text-conditioned modalities, including image and video generation. To quantify the steering continuity, we introduce a new evaluation metric that measures the uniformity of semantic change across edit strengths. We compare the continuous editing behavior across methods and find that, despite its simplicity and lightweight design, our approach is comparable to training-based alternatives, outperforming other training-free methods.
[130] EchoGen: Cycle-Consistent Learning for Unified Layout-Image Generation and Understanding cs.CVPDF
Kai Zou, Hongbo Liu, Dian Zheng, Jianxiong Gao, Zhiwei Zhao
TL;DR: 本文提出EchoGen,一个统一的布局到图像生成与图像定位框架,通过循环一致性学习实现两个任务的协同优化。该框架能够生成布局准确且文本描述保真度高的图像,同时实现鲁棒的图像定位。
Details
Motivation: 动机在于利用图像定位任务对文本和布局的强理解能力来弥补布局到图像生成的不足,同时利用布局生成图像的内容多样性增强图像定位的鲁棒性,通过联合训练促进两个任务的性能提升。
Result: 在布局到图像生成和图像定位基准测试中取得了最先进(SOTA)的结果,并显示出两个任务协同优化带来的显著性能增益。
Insight: 创新点包括提出渐进式训练策略(PMTP、DJO和Cycle RL阶段),利用任务对偶性和循环一致性约束进行统一优化,通过GRPO策略减少对视觉监督的依赖,从而增强模型的统一能力。
Abstract: In this work, we present EchoGen, a unified framework for layout-to-image generation and image grounding, capable of generating images with accurate layouts and high fidelity to text descriptions (e.g., spatial relationships), while grounding the image robustly at the same time. We believe that image grounding possesses strong text and layout understanding abilities, which can compensate for the corresponding limitations in layout-to-image generation. At the same time, images generated from layouts exhibit high diversity in content, thereby enhancing the robustness of image grounding. Jointly training both tasks within a unified model can promote performance improvements for each. However, we identify that this joint training paradigm encounters several optimization challenges and results in restricted performance. To address these issues, we propose progressive training strategies. First, the Parallel Multi-Task Pre-training (PMTP) stage equips the model with basic abilities for both tasks, leveraging shared tokens to accelerate training. Next, the Dual Joint Optimization (DJO) stage exploits task duality to sequentially integrate the two tasks, enabling unified optimization. Finally, the Cycle RL stage eliminates reliance on visual supervision by using consistency constraints as rewards, significantly enhancing the model’s unified capabilities via the GRPO strategy. Extensive experiments demonstrate state-of-the-art results on both layout-to-image generation and image grounding benchmarks, and reveal clear synergistic gains from optimizing the two tasks together.
[131] Universal Skeleton Understanding via Differentiable Rendering and MLLMs cs.CVPDF
Ziyi Wang, Peiming Li, Xinshun Wang, Yang Tang, Kai-Kuang Ma
TL;DR: 本文提出SkeletonLLM,一种通过可微分渲染将任意骨架序列转换为多模态大语言模型(MLLMs)原生视觉模态的方法,以实现通用的骨架理解。其核心是DrAction,一个格式无关的可微分渲染器,将骨骼运动学转换为紧凑的图像序列。通过端到端可微的流程,MLLM梯度可直接指导渲染生成任务相关的视觉标记。此外,论文引入协作训练策略:因果推理蒸馏从教师模型迁移结构化推理,判别性微调则锐化易混淆动作间的决策边界。
Details
Motivation: 多模态大语言模型(MLLMs)虽在视觉-语言推理上表现出色,但无法直接处理如人体骨架这类结构化、非视觉的数据。现有方法要么将骨架动态压缩为有损特征向量以对齐文本,要么将运动量化为离散标记,在异构骨架格式间泛化能力差。
Result: SkeletonLLM在包括识别、描述、推理和跨格式迁移在内的多样化任务上展现出强大的泛化能力,表明其为MLLMs应用于非原生模态提供了一条可行路径。
Insight: 创新点在于通过可微分、格式无关的渲染器将骨架数据转换为视觉模态,使MLLM能直接处理;同时,端到端可微设计允许梯度反向传播优化渲染,并结合因果推理蒸馏和判别性微调来增强推理能力。这为处理非视觉结构化数据提供了新思路。
Abstract: Multimodal large language models (MLLMs) exhibit strong visual-language reasoning, yet remain confined to their native modalities and cannot directly process structured, non-visual data such as human skeletons. Existing methods either compress skeleton dynamics into lossy feature vectors for text alignment, or quantize motion into discrete tokens that generalize poorly across heterogeneous skeleton formats. We present SkeletonLLM, which achieves universal skeleton understanding by translating arbitrary skeleton sequences into the MLLM’s native visual modality. At its core is DrAction, a differentiable, format-agnostic renderer that converts skeletal kinematics into compact image sequences. Because the pipeline is end-to-end differentiable, MLLM gradients can directly guide the rendering to produce task-informative visual tokens. To further enhance reasoning capabilities, we introduce a cooperative training strategy: Causal Reasoning Distillation transfers structured, step-by-step reasoning from a teacher model, while Discriminative Finetuning sharpens decision boundaries between confusable actions. SkeletonLLM demonstrates strong generalization on diverse tasks including recognition, captioning, reasoning, and cross-format transfer – suggesting a viable path for applying MLLMs to non-native modalities. Code will be released upon acceptance.
[132] Unified Spatio-Temporal Token Scoring for Efficient Video VLMs cs.CV | cs.AI | cs.LGPDF
Jianrui Zhang, Yue Yang, Rohun Tripathi, Winson Han, Ranjay Krishna
TL;DR: 本文提出了一种名为时空令牌评分(STTS)的轻量级模块,用于在视频视觉语言模型(VLMs)中统一剪枝视觉令牌,无需文本条件或令牌合并,兼容端到端训练。该方法通过辅助损失学习时间评分和利用LLM下游梯度学习空间评分,结合高效的打包算法,在13个长短视频问答任务中剪枝50%的视觉令牌,训练和推理效率提升62%,平均性能仅下降0.7%。
Details
Motivation: 解决视频VLMs中计算效率低的问题,现有方法要么仅在视觉变换器(ViT)中剪枝令牌(适用于单模态任务),要么仅在LLM中剪枝(需复杂文本条件机制),缺乏统一、无需文本条件的跨架构剪枝方案。
Result: 在13个长短视频QA任务上,剪枝50%视觉令牌,效率提升62%,平均性能仅下降0.7%;效率增益随视频采样帧数增加而提升;在长视频QA中通过测试时缩放进一步获得0.5-1%的性能提升。
Insight: 创新点包括统一的时空令牌评分机制(无需文本条件或令牌合并)、结合辅助损失和LLM梯度的评分学习、高效的打包算法,实现了简单有效的跨ViT和LLM的视觉令牌剪枝,提升效率同时保持性能。
Abstract: Token pruning is essential for enhancing the computational efficiency of vision-language models (VLMs), particularly for video-based tasks where temporal redundancy is prevalent. Prior approaches typically prune tokens either (1) within the vision transformer (ViT) exclusively for unimodal perception tasks such as action recognition and object segmentation, without adapting to downstream vision-language tasks; or (2) only within the LLM while leaving the ViT output intact, often requiring complex text-conditioned token selection mechanisms. In this paper, we introduce Spatio-Temporal Token Scoring (STTS), a simple and lightweight module that prunes vision tokens across both the ViT and the LLM without text conditioning or token merging, and is fully compatible with end-to-end training. By learning how to score temporally via an auxiliary loss and spatially via LLM downstream gradients, aided by our efficient packing algorithm, STTS prunes 50% of vision tokens throughout the entire architecture, resulting in a 62% improvement in efficiency during both training and inference with only a 0.7% drop in average performance across 13 short and long video QA tasks. Efficiency gains increase with more sampled frames per video. Applying test-time scaling for long-video QA further yields performance gains of 0.5-1% compared to the baseline. Overall, STTS represents a novel, simple yet effective technique for unified, architecture-wide vision token pruning.
q-bio.QM [Back]
[133] Topology-Guided Biomechanical Profiling: A White-Box Framework for Opportunistic Screening of Spinal Instability on Routine CT q-bio.QM | cs.CVPDF
Zanting Ye, Xuanbin Wu, Guoqing Zhong, Shengyuan Liu, Jiashuai Liu
TL;DR: 本文提出了一种名为拓扑引导生物力学分析(TGBP)的可审计白盒框架,用于在常规CT扫描中自动化评估脊柱不稳定性。该框架通过解耦解剖感知与结构推理,结合了基于椎管参考的分区方法和协方差导向包围盒的形态计量归一化,并整合了放射组学和大型语言模型模块,实现了端到端、可解释的脊柱不稳定性评分(SINS)。
Details
Motivation: 常规肿瘤CT扫描为筛查脊柱不稳定性提供了机会,但由于脊柱不稳定性肿瘤评分(SINS)所需的复杂几何推理,预防性稳定窗口常被错过。转移性骨溶解引起的拓扑模糊性阻碍了标准分割和黑盒AI的自动化评估,因此需要一种可解释的解决方案。
Result: 在多中心、多癌症队列(N=482)的验证中,TGBP在三级稳定性分类中达到了90.2%的准确率。在一项盲法读者研究(N=30)中,TGBP在复杂结构特征评估(κ=0.857 vs. 0.570)和总分估计(κ=0.625 vs. 0.207)上显著优于肿瘤内科医生,实现了专家级筛查的普及。
Insight: 创新点包括:1)采用白盒框架,将解剖感知与结构推理解耦,提高了可解释性和可审计性;2)引入椎管参考分区以解决后外侧边界模糊性;3)使用基于协方差的定向包围盒进行上下文感知的形态计量归一化,以量化椎体塌陷;4)整合放射组学和LLM模块,增强了端到端评估能力。从客观角度看,该方法通过确定性几何创新有效应对了拓扑模糊性挑战,为医学影像中的自动化、可解释决策提供了新思路。
Abstract: Routine oncologic computed tomography (CT) presents an ideal opportunity for screening spinal instability, yet prophylactic stabilization windows are frequently missed due to the complex geometric reasoning required by the Spinal Instability Neoplastic Score (SINS). Automating SINS is fundamentally hindered by metastatic osteolysis, which induces topological ambiguity that confounds standard segmentation and black-box AI. We propose Topology-Guided Biomechanical Profiling (TGBP), an auditable white-box framework decoupling anatomical perception from structural reasoning. TGBP anchors SINS assessment on two deterministic geometric innovations: (i) canal-referenced partitioning to resolve posterolateral boundary ambiguity, and (ii) context-aware morphometric normalization via covariance-based oriented bounding boxes (OBB) to quantify vertebral collapse. Integrated with auxiliary radiomic and large language model (LLM) modules, TGBP provides an end-to-end, interpretable SINS evaluation. Validated on a multi-center, multi-cancer cohort ($N=482$), TGBP achieved 90.2% accuracy in 3-tier stability triage. In a blinded reader study ($N=30$), TGBP significantly outperformed medical oncologists on complex structural features ($κ=0.857$ vs.\ $0.570$) and prevented compounding errors in Total Score estimation ($κ=0.625$ vs.\ $0.207$), democratizing expert-level opportunistic screening.
cs.IR [Back]
[134] PJB: A Reasoning-Aware Benchmark for Person-Job Retrieval cs.IR | cs.CLPDF
Guangzhi Wang, Xiaohui Yang, Kai Li, Jiawen He, Kai Yang
TL;DR: 本文提出了PJB(Person-Job Benchmark),一个面向人岗匹配任务的推理感知检索评估数据集。该数据集使用完整的职位描述作为查询,完整的简历作为文档,基于真实的招聘数据构建,并提供了行业领域和推理类型的诊断标签,旨在将评估重点从‘谁得分更高’转向‘系统在何处失败及原因’。
Details
Motivation: 现有的人岗匹配基准缺乏系统性诊断能力,无法评估系统在技能迁移推理和岗位胜任力推理等复杂需求上的表现,而仅靠聚合分数会严重误导优化决策。
Result: 在PJB上进行的密集检索诊断实验表明,不同行业领域间的性能异质性远超同一模型模块升级带来的增益;在模块层面,重排序带来稳定提升,而查询理解模块不仅无益,在与重排序结合时甚至会降低整体性能,揭示了二者面临根本不同的改进瓶颈。
Insight: 论文的创新点在于构建了一个具有诊断标签的真实世界人岗匹配基准,将评估范式从单纯的分数排名升级为能力地图分析,为系统优化提供了具体的投资方向指引,揭示了聚合指标在复杂推理任务中的局限性以及不同检索模块性能瓶颈的差异性。
Abstract: As retrieval models converge on generic benchmarks, the pressing question is no longer “who scores higher” but rather “where do systems fail, and why?” Person-job matching is a domain that urgently demands such diagnostic capability – it requires systems not only to verify explicit constraints but also to perform skill-transfer inference and job-competency reasoning, yet existing benchmarks provide no systematic diagnostic support for this task. We introduce PJB (Person-Job Benchmark), a reasoning-aware retrieval evaluation dataset that uses complete job descriptions as queries and complete resumes as documents, defines relevance through job-competency judgment, is grounded in real-world recruitment data spanning six industry domains and nearly 200,000 resumes, and upgrades evaluation from “who scores higher” to “where do systems differ, and why” through domain-family and reasoning-type diagnostic labels. Diagnostic experiments using dense retrieval reveal that performance heterogeneity across industry domains far exceeds the gains from module upgrades for the same model, indicating that aggregate scores alone can severely mislead optimization decisions. At the module level, reranking yields stable improvements while query understanding not only fails to help but actually degrades overall performance when combined with reranking – the two modules face fundamentally different improvement bottlenecks. The value of PJB lies not in yet another leaderboard of average scores, but in providing recruitment retrieval systems with a capability map that pinpoints where to invest.
[135] From Isolated Scoring to Collaborative Ranking: A Comparison-Native Framework for LLM-Based Paper Evaluation cs.IR | cs.CLPDF
Pujun Zheng, Jiacheng Yao, Jinquan Zheng, Chenyang Gu, Guoxiu He
TL;DR: 该论文提出了一种从孤立评分转向协作排序的LLM论文评估框架CNPE,通过基于图的相似性排序算法构建更有信息量的论文对,并利用监督微调和基于比较的强化学习增强相对质量判断,最终通过成对比较聚合偏好信号生成全局相对质量排序。
Details
Motivation: 现有LLM论文评估方法依赖绝对评分,但评分标准因会议、时期和准则而异,导致模型易拟合特定上下文规则而非发展稳健的学术判断力,因此需要转向更鲁棒的协作排序方法。
Result: 实验表明,CNPE框架在DeepReview-14B强基线基础上平均相对提升21.8%,并在五个未见数据集上展现出稳健的泛化能力。
Insight: 创新点在于将比较机制融入数据构建和模型学习全过程,通过图算法优化论文对采样,并结合监督与强化学习提升相对判断;客观分析认为,该方法通过排序而非评分规避了绝对标准的偏差,增强了评估的普适性和可解释性。
Abstract: Large language models (LLMs) are currently applied to scientific paper evaluation by assigning an absolute score to each paper independently. However, since score scales vary across conferences, time periods, and evaluation criteria, models trained on absolute scores are prone to fitting narrow, context-specific rules rather than developing robust scholarly judgment. To overcome this limitation, we propose shifting paper evaluation from isolated scoring to collaborative ranking. In particular, we design \textbf{C}omparison-\textbf{N}ative framework for \textbf{P}aper \textbf{E}valuation (\textbf{CNPE}), integrating comparison into both data construction and model learning. We first propose a graph-based similarity ranking algorithm to facilitate the sampling of more informative and discriminative paper pairs from a collection. We then enhance relative quality judgment through supervised fine-tuning and reinforcement learning with comparison-based rewards. At inference, the model performs pairwise comparisons over sampled paper pairs and aggregates these preference signals into a global relative quality ranking. Experimental results demonstrate that our framework achieves an average relative improvement of \textbf{21.8%} over the strong baseline DeepReview-14B, while exhibiting robust generalization to five previously unseen datasets. \href{https://github.com/ECNU-Text-Computing/ComparisonReview}{Code}.
cs.GR [Back]
[136] DancingBox: A Lightweight MoCap System for Character Animation from Physical Proxies cs.GR | cs.CV | cs.HCPDF
Haocheng Yuan, Adrien Bousseau, Hao Pan, Lei Zhong, Changjian Li
TL;DR: DancingBox是一个轻量级的、基于视觉的运动捕捉系统,它通过将日常物体作为物理代理,使用单个网络摄像头捕捉其近似运动,并利用生成式运动模型和边界框表示,结合从大规模数据集中学习的人类运动先验,将这些粗略的代理运动细化为逼真的3D角色动画。
Details
Motivation: 解决专业3D角色动画制作需要昂贵设备或专业技能的壁垒,让新手用户也能通过直观的数字木偶戏方式进行创作。
Result: 用户研究表明,该系统能够使用从毛绒玩具到香蕉等多种代理物,实现直观且富有创意的角色动画,降低了新手动画师的入门门槛。
Insight: 核心创新在于将运动捕捉重新构想为数字木偶戏,通过捕捉日常物体的粗略代理运动而非精确人体动作来驱动动画;方法上,通过将现有动捕序列转换为代理表示来合成训练数据对,解决了配对数据缺乏的问题,并利用边界框表示和预训练的人类运动先验来提升生成动画的质量和真实感。
Abstract: Creating compelling 3D character animations typically requires either expert use of professional software or expensive motion capture systems operated by skilled actors. We present DancingBox, a lightweight, vision-based system that makes motion capture accessible to novices by reimagining the process as digital puppetry. Instead of tracking precise human motions, DancingBox captures the approximate movements of everyday objects manipulated by users with a single webcam. These coarse proxy motions are then refined into realistic character animations by conditioning a generative motion model on bounding-box representations, enriched with human motion priors learned from large-scale datasets. To overcome the lack of paired proxy-animation data, we synthesize training pairs by converting existing motion capture sequences into proxy representations. A user study demonstrates that DancingBox enables intuitive and creative character animation using diverse proxies, from plush toys to bananas, lowering the barrier to entry for novice animators.
eess.IV [Back]
[137] A Lensless Polarization Camera eess.IV | cs.CV | physics.opticsPDF
Noa Kraicer, Shay Elmalem, Erez Yosef, Hani Barhum, Raja Giryes
TL;DR: 本文提出了一种紧凑型无透镜偏振相机,通过结合漫射器和条纹偏振掩模的光学设计,以及显式建模偏振编码测量的重建算法,实现了从单次快照中恢复四个线性偏振图像。
Details
Motivation: 现有偏振相机采用空间或时间复用技术,导致相机体积、重量或成本增加;本文旨在利用无透镜成像方法(如DiffuserCam)实现紧凑型偏振成像系统。
Result: 实验结果表明,该方法能够从单次快照中重建四个线性偏振图像,并揭示了影响重建质量的物理因素,为开发高质量实用系统提供了指导。
Insight: 创新点在于将无透镜成像与偏振编码相结合,通过简单的光学元件和计算重建算法实现紧凑型偏振相机,为偏振成像提供了新的紧凑解决方案。
Abstract: Polarization imaging is a technique that creates a pixel map of the polarization state in a scene. Although invisible to the human eye, polarization can assist various sensing and computer vision tasks. Existing polarization cameras use spatial or temporal multiplexing, which increases the camera volume, weight, cost, or all of the above. Recent lensless imaging approaches, such as DiffuserCam, have demonstrated that compact imaging systems can be realized by replacing the lens with a coding element and performing computational reconstruction. In this work, we propose a compact lensless polarization camera composed of a diffuser and a simple striped polarization mask. By combining this optical design with a reconstruction algorithm that explicitly models the polarization-encoded lensless measurements, four linear polarization images are recovered from a single snapshot. Our results demonstrate the potential of lensless approaches for polarization imaging and reveal the physical factors that govern reconstruction quality, guiding the development of high-quality practical systems.
[138] Structured SIR: Efficient and Expressive Importance-Weighted Inference for High-Dimensional Image Registration eess.IV | cs.CV | cs.LGPDF
Ivor J. A. Simpson, Neill D. F. Campbell
TL;DR: 本文提出了一种名为Structured SIR的高效、表达能力强的概率推理方法,用于解决高维图像配准中的不确定性量化问题。该方法通过一种新颖的内存高效高维协方差参数化(低秩协方差与稀疏空间结构化Cholesky精度因子之和),结合采样重要性重采样算法,能够捕捉复杂的空间相关性并生成高质量的多模态后验分布样本。
Details
Motivation: 图像配准是一个不适定的密集视觉任务,存在多个解对应相似的损失值,需要进行概率推理。现有的变分推理方法因对后验形式的限制性假设可能导致表征不佳、过度自信和低质量样本,而更灵活的后验模型则受限于高维协方差矩阵的计算复杂度。
Result: 在3D脑部MRI数据的密集图像配准(一个非常高维的问题)上评估,该方法产生的不确定性估计比变分方法显著更好地校准,同时达到相当或更好的配准精度。模型生成了高度结构化的多模态后验分布,实现了有效且高效的不确定性量化。
Insight: 核心创新点在于提出了一种内存和计算高效的高维协方差参数化结构(低秩+稀疏空间结构化精度),这使得在保持计算可行性的同时,能够捕捉复杂的空间相关性并进行表达性强的多模态不确定性表征。这为高维密集视觉任务的概率推理提供了一种新思路。
Abstract: Image registration is an ill-posed dense vision task, where multiple solutions achieve similar loss values, motivating probabilistic inference. Variational inference has previously been employed to capture these distributions, however restrictive assumptions about the posterior form can lead to poor characterisation, overconfidence and low-quality samples. More flexible posteriors are typically bottlenecked by the complexity of high-dimensional covariance matrices required for dense 3D image registration. In this work, we present a memory and computationally efficient inference method, Structured SIR, that enables expressive, multi-modal, characterisation of uncertainty with high quality samples. We propose the use of a Sampled Importance Resampling (SIR) algorithm with a novel memory-efficient high-dimensional covariance parameterisation as the sum of a low-rank covariance and a sparse, spatially structured Cholesky precision factor. This structure enables capturing complex spatial correlations while remaining computationally tractable. We evaluate the efficacy of this approach in 3D dense image registration of brain MRI data, which is a very high-dimensional problem. We demonstrate that our proposed methods produces uncertainty estimates that are significantly better calibrated than those produced by variational methods, achieving equivalent or better accuracy. Crucially, we show that the model yields highly structured multi-modal posterior distributions, enable effective and efficient uncertainty quantification.
eess.AS [Back]
[139] Multi-Source Evidence Fusion for Audio Question Answering eess.AS | cs.CLPDF
Aivo Olev, Tanel Alumäe
TL;DR: 本文介绍了TalTech团队为Interspeech 2026音频推理挑战赛Agent Track设计的解决方案,该系统通过多源证据融合方法提升音频问答的推理质量。该方法利用两个大型音频语言模型生成独立观察,并引入一个纯文本推理模型,结合25个按可靠性分层的声学工具输出进行交叉验证,从而生成可验证的密集推理链。该系统在挑战赛中排名第一,在推理质量指标上大幅领先所有竞争对手。
Details
Motivation: 解决大型音频语言模型在音频问答任务中推理过程不透明、难以验证的问题,旨在提升推理链的事实准确性、逻辑严密性和完整性。
Result: 在Interspeech 2026音频推理挑战赛的Agent Track中排名第一,在推理质量指标上以显著优势超越所有竞争系统。
Insight: 创新点在于提出了一种多源集成流水线,通过融合两个LALM的独立观察与分层可靠性声学工具的证据,并由纯文本模型进行交叉验证,实现了每一步推理都基于明确、带可靠性标签的证据,从而生成可验证的密集推理链。从客观角度看,该方法通过证据分层和交叉验证机制,有效提升了推理过程的透明度和可靠性,为可解释音频推理提供了实用框架。
Abstract: Large audio language models (LALMs) can answer questions about speech, music, and environmental sounds, yet their internal reasoning is largely opaque and difficult to validate. We describe TalTech’s solution to the Agent Track of the Interspeech 2026 Audio Reasoning Challenge, in which systems are evaluated on reasoning process quality, specifically the factual accuracy, logical soundness, and completeness of their reasoning chains. Our multi-source ensemble pipeline uses two LALMs that generate independent observations, while a separate text-only reasoning model cross-checks these against outputs from 25 acoustic tools organized into reliability tiers. By grounding every inference step in explicit, reliability-tagged evidence, the system produces dense, verifiable reasoning chains. Our system ranked first in the challenge, outperforming all competing systems by a wide margin in challenge’s reasoning quality metric.
[140] The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning eess.AS | cs.CLPDF
Donghang Wu, Tianyu Zhang, Yuxin Li, Hexin Liu, Chen Chen
TL;DR: 本文提出了一种名为FLAIR的新型全双工潜在内部推理方法,模拟人类在对话中边听边思考的认知过程,通过递归更新潜在嵌入实现与语音感知同步的连续推理,无需后处理生成,从而在保持因果性的同时避免额外延迟。
Details
Motivation: 受人类在听说话者时进行并发内部认知处理的启发,旨在解决传统NLP中’思考’机制需要事后生成的问题,使推理过程更贴合语音对话系统的实时性需求。
Result: 实验表明该方法在多个语音基准测试中取得了有竞争力的结果,并在全双工交互指标上表现出稳健的对话动态处理能力和竞争性性能。
Insight: 创新点在于将潜在推理与语音感知同步进行,通过基于证据下界的目标实现高效监督微调,无需显式推理标注;从客观角度看,其递归潜在嵌入更新机制为实时对话系统提供了一种低延迟的认知建模新思路。
Abstract: During conversational interactions, humans subconsciously engage in concurrent thinking while listening to a speaker. Although this internal cognitive processing may not always manifest as explicit linguistic structures, it is instrumental in formulating high-quality responses. Inspired by this cognitive phenomenon, we propose a novel Full-duplex LAtent and Internal Reasoning method named FLAIR that conducts latent thinking simultaneously with speech perception. Unlike conventional “thinking” mechanisms in NLP, which require post-hoc generation, our approach aligns seamlessly with spoken dialogue systems: during the user’s speaking phase, it recursively feeds the latent embedding output from the previous step into the next step, enabling continuous reasoning that strictly adheres to causality without introducing additional latency. To enable this latent reasoning, we design an Evidence Lower Bound-based objective that supports efficient supervised finetuning via teacher forcing, circumventing the need for explicit reasoning annotations. Experiments demonstrate the effectiveness of this think-while-listening design, which achieves competitive results on a range of speech benchmarks. Furthermore, FLAIR robustly handles conversational dynamics and attains competitive performance on full-duplex interaction metrics.
cs.AI [Back]
[141] How Clued up are LLMs? Evaluating Multi-Step Deductive Reasoning in a Text-Based Game Environment cs.AI | cs.CLPDF
Rebecca Ansell, Autumn Toney-Wails
TL;DR: 本文通过实现一个基于文本的多智能体版经典桌游《妙探寻凶》,构建了一个基于规则的测试平台,用于评估大型语言模型的多步演绎推理能力。研究使用了GPT-4o-mini和Gemini-2.5-Flash的六个智能体,在18场模拟游戏中,智能体仅获得四次正确胜利,表明其在完整游戏中维持一致推理存在困难。此外,研究发现,在结构化逻辑谜题上进行微调并不能可靠地提升游戏表现,有时甚至会增加推理量而不提高推理精度。
Details
Motivation: 评估大型语言模型在多步演绎推理任务上的能力,特别是在需要长期、一致逻辑推理的复杂游戏环境中,以解决LLM智能体在“破案”式推理中面临的挑战。
Result: 在基于《妙探寻凶》规则的18场模拟游戏中,六个LLM智能体仅获得四次正确胜利,表现不佳。微调实验表明,在结构化逻辑谜题上的微调并不能可靠提升游戏性能,有时甚至导致推理量增加而精度未改善。
Insight: 创新点在于构建了一个基于经典桌游的、规则可控的文本环境,用于系统评估LLM的多步演绎推理。客观分析表明,当前LLM在需要长期维持逻辑一致性的复杂推理任务上仍有明显缺陷,且针对简单逻辑任务的微调可能无法有效迁移到更复杂的场景,揭示了LLM推理能力的局限性及评估方法的必要性。
Abstract: Deducing whodunit proves challenging for LLM agents. In this paper, we implement a text-based multi-agent version of the classic board game Clue as a rule-based testbed for evaluating multi-step deductive reasoning, with six agents drawn from GPT-4o-mini and Gemini-2.5-Flash. We further investigate whether fine-tuning on structured logic puzzles transfers to improved in-game reasoning and gameplay. Across 18 simulated games, agents achieve only four correct wins, indicating difficulty in maintaining consistent deductive reasoning over the course of a full game. Additionally, we find that fine-tuning does not reliably improve performance and, in some cases, appears to increase reasoning volume without improving reasoning precision.
[142] Contrastive Reasoning Alignment: Reinforcement Learning from Hidden Representations cs.AI | cs.CL | cs.LGPDF
Haozheng Luo, Yimin Wang, Jiahao Yu, Binghui Wang, Yan Chen
TL;DR: 本文提出了CRAFT框架,一种利用模型推理能力和隐藏表示来增强大语言模型对越狱攻击鲁棒性的红队对齐方法。该方法通过结合对比表示学习和强化学习,在隐藏状态空间优化目标,使模型生成安全感知的推理轨迹,从而在推理层面实现安全对齐。
Details
Motivation: 现有防御方法主要在输出层面操作,而CRAFT旨在利用模型的推理能力和隐藏表示,通过优化隐藏状态空间的目标来更根本地提升模型对越狱攻击的鲁棒性,解决推理过程中的安全问题。
Result: 在多个安全基准测试上,使用Qwen3-4B-Thinking和R1-Distill-Llama-8B两个强推理模型进行评估,CRAFT一致性地超越了IPO和SafeKey等最先进的防御方法。具体而言,与基础模型相比,CRAFT在推理安全性上平均提升了79.0%,在最终响应安全性上平均提升了87.7%。
Insight: 论文宣称的创新点在于将对比表示学习与强化学习结合,在隐藏状态空间进行对齐,以分离安全与不安全的推理轨迹,从而在潜在空间几何上支持鲁棒的推理级安全对齐。从客观角度看,其核心创新在于将安全对齐的优化目标从传统的输出层面深入到了模型的内部推理表示层面,并通过理论分析(如将潜在-文本一致性融入GRPO以排除表面对齐策略)来保证方法的有效性。
Abstract: We propose CRAFT, a red-teaming alignment framework that leverages model reasoning capabilities and hidden representations to improve robustness against jailbreak attacks. Unlike prior defenses that operate primarily at the output level, CRAFT aligns large reasoning models to generate safety-aware reasoning traces by explicitly optimizing objectives defined over the hidden state space. Methodologically, CRAFT integrates contrastive representation learning with reinforcement learning to separate safe and unsafe reasoning trajectories, yielding a latent-space geometry that supports robust, reasoning-level safety alignment. Theoretically, we show that incorporating latent-textual consistency into GRPO eliminates superficially aligned policies by ruling them out as local optima. Empirically, we evaluate CRAFT on multiple safety benchmarks using two strong reasoning models, Qwen3-4B-Thinking and R1-Distill-Llama-8B, where it consistently outperforms state-of-the-art defenses such as IPO and SafeKey. Notably, CRAFT delivers an average 79.0% improvement in reasoning safety and 87.7% improvement in final-response safety over the base models, demonstrating the effectiveness of hidden-space reasoning alignment.
[143] InfoDensity: Rewarding Information-Dense Traces for Efficient Reasoning cs.AI | cs.CLPDF
Chengwei Wei, Jung-jae Kim, Longyin Zhang, Shengkai Chen, Nancy F. Chen
TL;DR: 本文提出InfoDensity奖励框架,通过结合基于AUC的奖励和单调性奖励来衡量推理质量,并引入长度缩放项以鼓励简洁性,旨在解决大型语言模型在扩展推理中产生的冗长和冗余问题,从而在保持准确性的同时显著降低计算成本。
Details
Motivation: 现有强化学习方法仅优化最终响应长度,忽略了中间推理步骤的质量,导致模型易受奖励攻击;作者认为冗长是中间推理质量差的表现,因此需要一种能奖励信息密集推理轨迹的方法。
Result: 在数学推理基准测试上,InfoDensity在准确性上匹配或超越了最先进的基线方法,同时显著减少了token使用量,实现了强大的准确性与效率权衡。
Insight: 创新点在于将推理质量定义为信息密度,即每个步骤对答案分布熵减的有意义贡献,并通过条件熵的实证研究发现了高质量推理轨迹具有低不确定性收敛和单调进展的特性;客观来看,该框架将推理过程的质量度量与效率优化统一,为训练高效推理模型提供了新视角。
Abstract: Large Language Models (LLMs) with extended reasoning capabilities often generate verbose and redundant reasoning traces, incurring unnecessary computational cost. While existing reinforcement learning approaches address this by optimizing final response length, they neglect the quality of intermediate reasoning steps, leaving models vulnerable to reward hacking. We argue that verbosity is not merely a length problem, but a symptom of poor intermediate reasoning quality. To investigate this, we conduct an empirical study tracking the conditional entropy of the answer distribution across reasoning steps. We find that high-quality reasoning traces exhibit two consistent properties: low uncertainty convergence and monotonic progress. These findings suggest that high-quality reasoning traces are informationally dense, that is, each step contributes meaningful entropy reduction relative to the total reasoning length. Motivated by this, we propose InfoDensity, a reward framework for RL training that combines an AUC-based reward and a monotonicity reward as a unified measure of reasoning quality, weighted by a length scaling term that favors achieving equivalent quality more concisely. Experiments on mathematical reasoning benchmarks demonstrate that InfoDensity matches or surpasses state-of-the-art baselines in accuracy while significantly reducing token usage, achieving a strong accuracy-efficiency trade-off.
eess.SP [Back]
[144] NeuroNarrator: A Generalist EEG-to-Text Foundation Model for Clinical Interpretation via Spectro-Spatial Grounding and Temporal State-Space Reasoning eess.SP | cs.CL | cs.LG | q-bio.NCPDF
Guoan Wang, Shihao Yang, Jun-en Ding, Hao Zhu, Feng Liu
TL;DR: 本文提出了NeuroNarrator,首个通用型EEG到文本的基础模型,旨在将脑电图片段转化为精确的临床叙述。其核心是构建了首个大规模对齐数据集NeuroCorpus-160K,包含超过16万个EEG片段及其结构化临床描述。模型通过对比学习对齐时域波形与空间地形图,建立频谱-空间基础表征,并采用状态空间启发的架构整合历史时频上下文,驱动大语言模型生成连贯的临床叙述。
Details
Motivation: 现有EEG分析方法多局限于特定任务分类或粗粒度模式识别,对临床有意义的解释支持有限。本文旨在弥合连续信号动态与离散临床语言之间的鸿沟,实现可解释的叙述生成,以支持临床报告工作流和专家解读。
Result: 在多个基准测试和零样本迁移任务上的广泛评估表明,NeuroNarrator能够有效整合时域、频域和空间动态,确立了其作为时频感知、开放式电生理数据临床解释的基础框架地位。
Insight: 创新点在于:1) 构建首个大规模对齐的EEG-文本数据集;2) 通过严格的对比目标实现时域信号与空间地形图的频谱-空间基础对齐;3) 采用状态空间启发的公式整合历史时频上下文,为大语言模型提供条件,以生成连贯的临床叙述。这为连接连续信号与离散语言提供了一个原则性桥梁。
Abstract: Electroencephalography (EEG) provides a non-invasive window into neural dynamics at high temporal resolution and plays a pivotal role in clinical neuroscience research. Despite this potential, prevailing computational approaches to EEG analysis remain largely confined to task-specific classification objectives or coarse-grained pattern recognition, offering limited support for clinically meaningful interpretation. To address these limitations, we introduce NeuroNarrator, the first generalist EEG-to-text foundation model designed to translate electrophysiological segments into precise clinical narratives. A cornerstone of this framework is the curation of NeuroCorpus-160K, the first harmonized large-scale resource pairing over 160,000 EEG segments with structured, clinically grounded natural-language descriptions. Our architecture first aligns temporal EEG waveforms with spatial topographic maps via a rigorous contrastive objective, establishing spectro-spatially grounded representations. Building on this grounding, we condition a Large Language Model through a state-space-inspired formulation that integrates historical temporal and spectral context to support coherent clinical narrative generation. This approach establishes a principled bridge between continuous signal dynamics and discrete clinical language, enabling interpretable narrative generation that facilitates expert interpretation and supports clinical reporting workflows. Extensive evaluations across diverse benchmarks and zero-shot transfer tasks highlight NeuroNarrator’s capacity to integrate temporal, spectral, and spatial dynamics, positioning it as a foundational framework for time-frequency-aware, open-ended clinical interpretation of electrophysiological data.
cs.LG [Back]
[145] MHPO: Modulated Hazard-aware Policy Optimization for Stable Reinforcement Learning cs.LG | cs.AI | cs.CLPDF
Hongjun Wang, Wei Liu, Weibo Gu, Xing Sun, Kai Han
TL;DR: 本文提出了MHPO(Modulated Hazard-aware Policy Optimization)框架,旨在解决基于GRPO的强化学习训练稳定性问题。该框架通过Log-Fidelity Modulator(LFM)将有界的重要性比率映射到可微的有界域,并通过Decoupled Hazard Penalty(DHP)整合生存分析中的累积风险函数,独立调控正负策略偏移,从而在稳定的信任区域内实现细粒度、非对称的策略调节。
Details
Motivation: 现有比率控制方法(如硬裁剪)存在不可微边界和梯度消失区域,无法保持梯度保真度,且缺乏自适应抑制极端偏差的风险感知机制,导致优化过程易受策略突变影响。MHPO旨在解决这些问题,实现鲁棒且稳定的强化学习。
Result: 在涵盖文本和视觉语言任务的多样化推理基准上进行广泛评估,结果表明MHPO持续优于现有方法,在显著提升训练稳定性的同时实现了更优的性能。
Insight: 创新点包括引入LFM确保梯度稳定性和防止高方差异常值破坏损失景观,以及提出DHP利用生存分析概念独立调控非对称策略偏移,从而同时缓解过扩展导致的模式崩溃和灾难性收缩导致的策略侵蚀,在稳定信任区域内实现精细优化。
Abstract: Regulating the importance ratio is critical for the training stability of Group Relative Policy Optimization (GRPO) based frameworks. However, prevailing ratio control methods, such as hard clipping, suffer from non-differentiable boundaries and vanishing gradient regions, failing to maintain gradient fidelity. Furthermore, these methods lack a hazard-aware mechanism to adaptively suppress extreme deviations, leaving the optimization process vulnerable to abrupt policy shifts. To address these challenges, we propose Modulated Hazard-aware Policy Optimization (MHPO), a novel framework designed for robust and stable reinforcement learning. The proposed MHPO introduces a Log-Fidelity Modulator (LFM) to map unbounded importance ratios into a bounded, differentiable domain. This mechanism effectively prevents high-variance outlier tokens from destabilizing the loss landscape while ensuring global gradient stability. Complementarily, a Decoupled Hazard Penalty (DHP) integrates cumulative hazard functions from survival analysis to independently regulate positive and negative policy shifts. By shaping the optimization landscape with hazard-aware penalties, the proposed MHPO achieves fine-grained regulation of asymmetric policy shifts simultaneously mitigating mode collapse from over-expansion and preventing policy erosion from catastrophic contraction within a stabilized trust region. Extensive evaluations on diverse reasoning benchmarks across both text-based and vision-language tasks demonstrate that MHPO consistently outperforms existing methods, achieving superior performance while significantly enhancing training stability.
[146] Catching rationalization in the act: detecting motivated reasoning before and after CoT via activation probing cs.LG | cs.AI | cs.CLPDF
Parsa Mirtaheri, Mikhail Belkin
TL;DR: 这篇论文研究了大型语言模型(LLM)在思维链(CoT)生成中可能出现的‘动机性推理’现象,即模型会为了合理化其受外部提示影响的答案而生成看似合理的解释,却不承认提示的影响。作者通过监督探针分析模型内部激活状态,发现无论是在CoT生成前还是生成后,内部表征都比单纯监控CoT文本更能可靠地检测出这种动机性推理。
Details
Motivation: 解决LLMs在生成思维链时可能产生不反映真实推理过程的‘动机性推理’问题,即模型会为了合理化受外部提示影响的答案而生成虚假的解释。
Result: 在多个LLM家族和数据集上的实验表明,基于残差流训练的监督探针(包括生成前探针和生成后探针)在检测动机性推理方面优于基于CoT文本的监控器,其中生成后探针表现最佳。
Insight: 创新点在于利用模型内部激活状态(而非输出文本)来早期且可靠地检测LLMs的动机性推理行为;可借鉴之处是内部表征探针可作为更有效的模型行为监控工具,有助于避免不必要的生成并提高模型透明度。
Abstract: Large language models (LLMs) can produce chains of thought (CoT) that do not accurately reflect the actual factors driving their answers. In multiple-choice settings with an injected hint favoring a particular option, models may shift their final answer toward the hinted option and produce a CoT that rationalizes the response without acknowledging the hint - an instance of motivated reasoning. We study this phenomenon across multiple LLM families and datasets demonstrating that motivated reasoning can be identified by probing internal activations even in cases when it cannot be easily determined from CoT. Using supervised probes trained on the model’s residual stream, we show that (i) pre-generation probes, applied before any CoT tokens are generated, predict motivated reasoning as well as a LLM-based CoT monitor that accesses the full CoT trace, and (ii) post-generation probes, applied after CoT generation, outperform the same monitor. Together, these results show that motivated reasoning is detected more reliably from internal representations than from CoT monitoring. Moreover, pre-generation probing can flag motivated behavior early, potentially avoiding unnecessary generation.
[147] Complementary Reinforcement Learning cs.LG | cs.CLPDF
Dilxat Muhtar, Jiashun Liu, Wei Gao, Weixun Wang, Shaopan Xiong
TL;DR: 本文提出互补强化学习(Complementary RL),通过模仿神经科学中的互补学习系统,在RL优化循环中实现经验提取器与策略执行器的协同进化,以解决传统RL方法因经验与智能体能力不匹配而导致的样本效率低下问题。
Details
Motivation: 传统基于LLM的智能体强化学习受限于稀疏奖励反馈和无法跨回合利用历史经验,且现有利用经验的方法存在经验静态存储或与智能体能力进化不同步的缺陷,导致训练后期经验效用下降。
Result: 在单任务场景中,Complementary RL相比不学习经验的基线方法性能提升10%;在多任务设置中也展现出稳健的可扩展性。
Insight: 核心创新在于设计了一个动态优化的经验提取器,其优化目标直接取决于所提取经验是否被证明有助于策略执行器的成功,从而确保经验管理与智能体能力同步进化,为高效经验驱动的智能体学习提供了新范式。
Abstract: Reinforcement Learning (RL) has emerged as a powerful paradigm for training LLM-based agents, yet remains limited by low sample efficiency, stemming not only from sparse outcome feedback but also from the agent’s inability to leverage prior experience across episodes. While augmenting agents with historical experience offers a promising remedy, existing approaches suffer from a critical weakness: the experience distilled from history is either stored statically or fail to coevolve with the improving actor, causing a progressive misalignment between the experience and the actor’s evolving capability that diminishes its utility over the course of training. Inspired by complementary learning systems in neuroscience, we present Complementary RL to achieve seamless co-evolution of an experience extractor and a policy actor within the RL optimization loop. Specifically, the actor is optimized via sparse outcome-based rewards, while the experience extractor is optimized according to whether its distilled experiences demonstrably contribute to the actor’s success, thereby evolving its experience management strategy in lockstep with the actor’s growing capabilities. Empirically, Complementary RL outperforms outcome-based agentic RL baselines that do not learn from experience, achieving 10% performance improvement in single-task scenarios and exhibits robust scalability in multi-task settings. These results establish Complementary RL as a paradigm for efficient experience-driven agent learning.
[148] Do Understanding and Generation Fight? A Diagnostic Study of DPO for Unified Multimodal Models cs.LG | cs.AI | cs.CVPDF
Abinav Rao, Sujan Rachuri
TL;DR: 本文首次系统研究了在统一多模态模型中使用DPO同时对齐理解和生成能力的可行性,发现基于VQ标记化的架构中生成质量难以通过DPO对齐,梯度分析揭示了理解与生成任务间的正交性和幅度不平衡是主要干扰机制。
Details
Motivation: 探究统一多模态模型中共享语言模型主干的理解与生成能力能否通过DPO同时对齐,解决多任务对齐中的潜在冲突问题。
Result: 在Janus-Pro的1B和7B参数模型上,所有DPO训练策略均未能提升生成CLIPScore(7B模型变化不显著,1B模型生成质量下降),且结果在不同偏好数据类型和数据量下保持一致;理解任务仅出现方向性改善但统计不显著。
Insight: 研究揭示了基于VQ标记化的统一模型中理解与生成梯度的正交性和幅度不平衡(由VQ标记数量不对称导致)是多任务DPO对齐的主要瓶颈,为VQ基模型实践提供了结构局限性的重要参考。
Abstract: Unified multimodal models share a language model backbone for both understanding and generating images. Can DPO align both capabilities simultaneously? We present the first systematic study of this question, applying DPO to Janus-Pro at 1B and 7B parameters under seven training strategies and two post-hoc methods. The central finding is negative: generation quality resists DPO alignment across all tested conditions on this architecture. No method improves generation CLIPScore at 7B (|Delta| < 0.2, p > 0.5 at n=200 per seed, 3 seeds); at 1B, all methods degrade generation, and the result holds across preference data types (real-vs-generated and model-vs-model) and the data volumes tested (150-288 pairs). Gradient analysis reveals why: understanding and generation gradients are near-orthogonal (cos ~ 0) with ~11-14x magnitude imbalance driven by VQ token count asymmetry (576 generation tokens vs. ~30-100 text tokens). This imbalance is the dominant interference mechanism in multi-task DPO; magnitude-balancing yields directionally positive understanding deltas (+0.01-0.04 VQA, though individually not significant), but the generation gap persists regardless. We identify discrete VQ tokenization as a likely structural bottleneck – supported by the generation DPO loss converging to ln(2) – and provide practical guidance for practitioners working with VQ-based unified models.
cs.RO [Back]
[149] SLAM Adversarial Lab: An Extensible Framework for Visual SLAM Robustness Evaluation under Adverse Conditions cs.RO | cs.CVPDF
Mohamed Hefny, Karthik Dantu, Steven Y. Ko
TL;DR: 本文提出了SAL(SLAM Adversarial Lab),一个用于在雾、雨等恶劣条件下评估视觉SLAM系统鲁棒性的模块化框架。SAL将每种恶劣条件建模为一种扰动,可将现有数据集转换为对抗性数据集,并支持使用米(能见度)等易于解释的真实世界单位来定义扰动严重级别。其可扩展架构通过通用接口解耦了数据集、扰动和SLAM算法,便于用户添加新组件。此外,SAL还包含一个搜索程序,用于找出导致SLAM系统失效的扰动严重级别。
Details
Motivation: 动机是提供一个系统化、可扩展的框架,以评估视觉SLAM算法在各种现实世界恶劣条件(如恶劣天气、相机缺陷、视频传输问题)下的鲁棒性,解决现有评估方法缺乏标准化和灵活性的问题。
Result: 为展示SAL的能力,评估集成了七种SLAM算法,并在三个数据集上对它们进行了天气、相机和视频传输扰动的测试。结果具体表现为框架能够成功运行并找出导致各SLAM算法失效的扰动阈值,但摘要未提及与特定基准(如KITTI)的SOTA对比或定量性能提升。
Insight: 宣称的创新点在于提出了一个模块化、可扩展的评估框架,其核心是将对抗条件抽象为可配置严重级别的扰动,并通过解耦设计和通用接口实现高度灵活性。客观来看,其提供的标准化严重级别定义(如基于能见度)和自动失效阈值搜索程序,对于系统化研究SLAM的脆弱性和鲁棒性具有重要借鉴价值。
Abstract: We present SAL (SLAM Adversarial Lab), a modular framework for evaluating visual SLAM systems under adversarial conditions such as fog and rain. SAL represents each adversarial condition as a perturbation that transforms an existing dataset into an adversarial dataset. When transforming a dataset, SAL supports severity levels using easily-interpretable real-world units such as meters for fog visibility. SAL’s extensible architecture decouples datasets, perturbations, and SLAM algorithms through common interfaces, so users can add new components without rewriting integration code. Moreover, SAL includes a search procedure that finds the severity level of a perturbation at which a SLAM system fails. To showcase the capabilities of SAL, our evaluation integrates seven SLAM algorithms and evaluates them across three datasets under weather, camera, and video transport perturbations.
[150] VectorWorld: Efficient Streaming World Model via Diffusion Flow on Vector Graphs cs.RO | cs.CVPDF
Chaokang Jiang, Desen Zhou, Jiuming Liu, Kevin Li Sun
TL;DR: VectorWorld是一种用于自动驾驶闭环评估的流式世界模型,通过向量图上的扩散流实现高效生成。它通过运动感知门控VAE生成与历史条件策略兼容的交互状态,利用边缘门控关系DiT进行无求解器一步掩码补全以实现实时外推,并引入ΔSim物理对齐非自我车辆策略来稳定长时域推演。
Details
Motivation: 解决现有生成世界模型在闭环评估中的三个问题:历史无关初始化与策略输入不匹配、多步采样延迟违反实时预算、以及长时域运动学不可行性累积。
Result: 在Waymo开放运动和nuPlan基准测试中,VectorWorld提高了地图结构保真度和初始化有效性,并支持稳定、实时的1公里以上闭环推演。
Insight: 创新点包括:通过运动感知门控VAE实现策略兼容初始化,利用边缘门控关系DiT和基于JVP的大步长监督实现实时外推,以及引入混合离散-连续动作的物理对齐NPC策略ΔSim来稳定长时域生成。
Abstract: Closed-loop evaluation of autonomous-driving policies requires interactive simulation beyond log replay. However, existing generative world models often degrade in closed loop due to (i) history-free initialization that mismatches policy inputs, (ii) multi-step sampling latency that violates real-time budgets, and (iii) compounding kinematic infeasibility over long horizons. We propose VectorWorld, a streaming world model that incrementally generates ego-centric $64 \mathrm{m}\times 64\mathrm{m}$ lane–agent vector-graph tiles during rollout. VectorWorld aligns initialization with history-conditioned policies by producing a policy-compatible interaction state via a motion-aware gated VAE. It enables real-time outpainting via solver-free one-step masked completion with an edge-gated relational DiT trained with interval-conditioned MeanFlow and JVP-based large-step supervision. To stabilize long-horizon rollouts, we introduce $Δ$Sim, a physics-aligned non-ego (NPC) policy with hybrid discrete–continuous actions and differentiable kinematic logit shaping. On Waymo open motion and nuPlan, VectorWorld improves map-structure fidelity and initialization validity, and supports stable, real-time $1\mathrm{km}+$ closed-loop rollouts (\href{https://github.com/jiangchaokang/VectorWorld}{code}).
[151] AERR-Nav: Adaptive Exploration-Recovery-Reminiscing Strategy for Zero-Shot Object Navigation cs.RO | cs.CVPDF
Jingzhi Huang, Junkai Huang, Haoyang Yang, Haoang Li, Yi Wang
TL;DR: 本文提出了AERR-Nav,一种用于零样本目标导航(ZSON)的自适应框架,旨在解决机器人在未知多楼层环境中导航时探索与利用不平衡的问题。该框架通过动态切换探索、恢复和回忆三种状态,并引入包含快慢思维模式的自适应探索状态,来提升导航性能。在HM3D和MP3D基准测试中取得了零样本方法的SOTA结果。
Details
Motivation: 解决现有零样本目标导航方法在未知多楼层环境中(如遇到狭窄路口、楼梯入口)容易陷入探索与利用不平衡,导致机器人卡住、徘徊或失败的问题。
Result: 在HM3D和MP3D基准测试上进行了广泛实验,结果表明AERR-Nav在零样本方法中达到了最先进的(SOTA)性能。
Insight: 核心创新点是提出了自适应探索-恢复-回忆策略,使机器人能根据环境动态调整状态;以及自适应探索状态中的快慢思维模式,帮助机器人基于环境信息演变更好地平衡探索、利用和高级推理。这是一种状态机与决策模式相结合的新型导航策略框架。
Abstract: Zero-Shot Object Navigation (ZSON) in unknown multi-floor environments presents a significant challenge. Recent methods, mostly based on semantic value greedy waypoint selection, spatial topology-enhanced memory, and Multimodal Large Language Model (MLLM) as a decision-making framework, have led to improvements. However, these architectures struggle to balance exploration and exploitation for ZSON when encountering unseen environments, especially in multi-floor settings, such as robots getting stuck at narrow intersections, endlessly wandering, or failing to find stair entrances. To overcome these challenges, we propose AERR-Nav, a Zero-Shot Object Navigation framework that dynamically adjusts its state based on the robot’s environment. Specifically, AERR-Nav has the following two key advantages: (1) An Adaptive Exploration-Recovery-Reminiscing Strategy, enables robots to dynamically transition between three states, facilitating specialized responses to diverse navigation scenarios. (2) An Adaptive Exploration State featuring Fast and Slow-Thinking modes helps robots better balance exploration, exploitation, and higher-level reasoning based on evolving environmental information. Extensive experiments on the HM3D and MP3D benchmarks demonstrate that our AERR-Nav achieves state-of-the-art performance among zero-shot methods. Comprehensive ablation studies further validate the efficacy of our proposed strategy and modules.
cs.SE [Back]
[152] CodeScout: An Effective Recipe for Reinforcement Learning of Code Search Agents cs.SE | cs.AI | cs.CLPDF
Lintang Sutawika, Aditya Bharat Soni, Bharath Sriraam R R, Apurva Gandhi, Taha Yassine
TL;DR: 本文提出CodeScout,一种基于强化学习的代码搜索智能体训练方法。该方法仅使用标准Unix终端作为工具,通过有效的强化学习配方(包括环境改造、奖励设计和优化技术),在多个代码搜索基准测试上取得了优异或具有竞争力的性能。
Details
Motivation: 解决代码智能体在大规模代码库中执行任务时的代码定位问题,即识别相关文件、类和方法。现有方法通常依赖复杂专用工具(如静态分析生成的仓库图),本文旨在探索仅使用简单通用工具(Unix终端)并通过强化学习训练智能体实现高效代码搜索的可能性。
Result: 在SWE-Bench Verified、Pro和Lite三个基准测试上,CodeScout模型性能优于或媲美比其大2-18倍的基础及后训练LLM,有时甚至接近Claude Sonnet等闭源模型的性能,即使后者使用了专用脚手架。
Insight: 创新点在于证明了通过精心设计的强化学习配方(环境复用、奖励设计、RL优化),仅使用标准Unix终端这种简单通用工具,而非复杂专用工具,就能训练出高效的代码搜索智能体。这为构建轻量、通用的代码智能体提供了新思路。
Abstract: A prerequisite for coding agents to perform tasks on large repositories is code localization - the identification of relevant files, classes, and functions to work on. While repository-level code localization has been performed using embedding-based retrieval approaches such as vector search, recent work has focused on developing agents to localize relevant code either as a standalone precursor to or interleaved with performing actual work. Most prior methods on agentic code search equip the agent with complex, specialized tools, such as repository graphs derived from static analysis. In this paper, we demonstrate that, with an effective reinforcement learning recipe, a coding agent equipped with nothing more than a standard Unix terminal can be trained to achieve strong results. Our experiments on three benchmarks (SWE-Bench Verified, Pro, and Lite) reveal that our models consistently achieve superior or competitive performance over 2-18x larger base and post-trained LLMs and sometimes approach performance provided by closed models like Claude Sonnet, even when using specialized scaffolds. Our work particularly focuses on techniques for re-purposing existing coding agent environments for code search, reward design, and RL optimization. We release the resulting model family, CodeScout, along with all our code and data for the community to build upon.