Table of Contents
- cs.CL [Total: 53]
- cs.CV [Total: 199]
- cs.CR [Total: 1]
- cs.AI [Total: 9]
- cs.RO [Total: 8]
- cs.HC [Total: 1]
- cs.CY [Total: 1]
- cs.ET [Total: 1]
- cs.IR [Total: 4]
- physics.soc-ph [Total: 1]
- eess.IV [Total: 1]
- eess.AS [Total: 1]
- cs.LG [Total: 14]
cs.CL [Back]
[1] ActMem: Bridging the Gap Between Memory Retrieval and Reasoning in LLM Agents cs.CL | cs.AI | cs.IRPDF
Xiaohui Zhang, Zequn Sun, Chengyuan Yang, Yaqin Jin, Yazhong Zhang
TL;DR: 本文提出了一种名为ActMem的新型可操作记忆框架,旨在解决大型语言模型(LLM)智能体在长期交互中,现有记忆系统仅被动记录和检索信息而缺乏深度推理能力的问题。该框架通过将非结构化对话历史转化为结构化的因果语义图,并利用反事实推理和常识补全,使智能体能够推断隐含约束并解决过去状态与当前意图之间的潜在冲突。
Details
Motivation: 当前LLM智能体的记忆框架通常将其视为被动的“记录器”,在检索信息时未能理解其深层含义,导致在需要冲突检测和复杂决策的场景中表现不佳。本文旨在弥合记忆检索与主动因果推理之间的关键差距。
Result: 实验表明,ActMem在复杂的、依赖记忆的任务上显著优于最先进的基线方法。作者还引入了ActMemEval数据集,用于在逻辑驱动场景中评估智能体的推理能力,超越了现有记忆基准仅关注事实检索的局限。
Insight: 核心创新在于将记忆检索与主动因果推理相结合,通过构建因果语义图并应用反事实推理,使智能体能够进行更深层次的逻辑分析和冲突解决。这为构建更一致、可靠的智能助手提供了新思路。
Abstract: Effective memory management is essential for large language model (LLM) agents handling long-term interactions. Current memory frameworks typically treat agents as passive “recorders” and retrieve information without understanding its deeper implications. They may fail in scenarios requiring conflict detection and complex decision-making. To bridge this critical gap, we propose a novel actionable memory framework called ActMem that integrates memory retrieval with active causal reasoning. ActMem transforms unstructured dialogue history into a structured causal and semantic graph. By leveraging counterfactual reasoning and commonsense completion, it enables agents to deduce implicit constraints and resolve potential conflicts between past states and current intentions. Furthermore, we introduce a comprehensive dataset ActMemEval to evaluate agent reasoning capabilities in logic-driven scenarios, moving beyond the fact-retrieval focus of existing memory benchmarks. Experiments demonstrate that ActMem significantly outperforms state-of-the-art baselines in handling complex, memory-dependent tasks, paving the way for more consistent and reliable intelligent assistants.
[2] EPPCMinerBen: A Novel Benchmark for Evaluating Large Language Models on Electronic Patient-Provider Communication via the Patient Portal cs.CLPDF
Samah Fodeh, Yan Wang, Linhai Ma, Srivani Talakokkul, Jordan M. Alpert
TL;DR: 本文提出了一个名为EPPCMinerBen的新基准,用于评估大型语言模型在电子患者-提供者沟通(EPPC)领域的能力,具体包括代码分类、子代码分类和证据提取三个子任务。该基准基于耶鲁纽黑文医院患者门户的752条安全消息中的1933个专家标注句子构建,并在零样本和少样本设置下测试了多种LLM。结果表明,大型指令调优模型(如Llama-3.1-70B和Llama-3.3-70b-Instruct)在证据提取和代码分类任务上表现最佳,而较小模型在细粒度推理任务上表现不佳。
Details
Motivation: 随着医患交流转向安全消息传递,分析电子患者-提供者沟通数据变得至关重要且具有挑战性,需要评估LLM在检测沟通模式和从消息中提取见解的能力。
Result: 在EPPCMinerBen基准上,Llama-3.1-70B在证据提取任务上达到F1分数82.84%,表现领先;Llama-3.3-70b-Instruct在代码分类任务上以F1分数67.03%优于所有模型;DeepSeek-R1-Distill-Qwen-32B在子代码分类任务上以F1分数48.25%表现突出;sdoh-llama-3-70B表现稳定。较小模型表现不佳,尤其在子代码分类任务上F1分数低于30%。少样本提示在多数任务中提升了性能。
Insight: 论文的创新点在于构建了首个专门针对电子患者-提供者沟通的LLM评估基准EPPCMinerBen,涵盖了从意图分类到证据提取的多层次任务。客观来看,该工作强调了大型指令调优模型在医疗文本理解任务上的优势,以及少样本学习对性能的提升,为未来模型泛化和医患沟通分析研究提供了重要基准。
Abstract: Effective communication in health care is critical for treatment outcomes and adherence. With patient-provider exchanges shifting to secure messaging, analyzing electronic patient-communication (EPPC) data is both essential and challenging. We introduce EPPCMinerBen, a benchmark for evaluating LLMs in detecting communication patterns and extracting insights from electronic patient-provider messages. EPPCMinerBen includes three sub-tasks: Code Classification, Subcode Classification, and Evidence Extraction. Using 1,933 expert annotated sentences from 752 secure messages of the patient portal at Yale New Haven Hospital, it evaluates LLMs on identifying communicative intent and supportive text. Benchmarks span various LLMs under zero-shot and few-shot settings, with data to be released via the NCI Cancer Data Service. Model performance varied across tasks and settings. Llama-3.1-70B led in evidence extraction (F1: 82.84%) and performed well in classification. Llama-3.3-70b-Instruct outperformed all models in code classification (F1: 67.03%). DeepSeek-R1-Distill-Qwen-32B excelled in subcode classification (F1: 48.25%), while sdoh-llama-3-70B showed consistent performance. Smaller models underperformed, especially in subcode classification (>30% F1). Few-shot prompting improved most tasks. Our results show that large, instruction-tuned models generally perform better in EPPCMinerBen tasks, particularly evidence extraction while smaller models struggle with fine-grained reasoning. EPPCMinerBen provides a benchmark for discourse-level understanding, supporting future work on model generalization and patient-provider communication analysis. Keywords: Electronic Patient-Provider Communication, Large language models, Data collection, Prompt engineering
[3] Stepwise Penalization for Length-Efficient Chain-of-Thought Reasoning cs.CL | cs.AI | cs.LGPDF
Xintong Li, Sha Li, Rongmei Lin, Hongye Jin, Linwei Li
TL;DR: 本文提出了一种名为SWAP的逐步自适应惩罚框架,用于优化大型推理模型在链式思维(CoT)推理中的长度效率。该方法通过基于步骤内在贡献的细粒度长度惩罚,在强化学习过程中将长度减少分配到各个推理步骤,从而在保持准确性的同时显著压缩推理长度。
Details
Motivation: 现有大型推理模型在测试时计算中往往产生过长的链式思维,增加成本而不提升准确性;先前基于强化学习的方法通常依赖轨迹级长度惩罚,无法区分必要与冗余推理步骤,导致压缩效果不佳。
Result: 在广泛实验中,SWAP相对于基础模型平均减少64.3%的推理长度,同时准确率提升5.7%。
Insight: 创新点在于将推理长度作为明确的步骤级优化目标,通过基于模型在策略对数概率改进的步骤重要性估计,将超额长度作为惩罚质量重新分配,以更重地惩罚低重要性步骤并保留高重要性推理;采用组相对策略优化中的统一结果-过程优势进行优化。
Abstract: Large reasoning models improve with more test-time computation, but often overthink, producing unnecessarily long chains-of-thought that raise cost without improving accuracy. Prior reinforcement learning approaches typically rely on a single outcome reward with trajectory-level length penalties, which cannot distinguish essential from redundant reasoning steps and therefore yield blunt compression. Although recent work incorporates step-level signals, such as offline pruning, supervised data construction, or verifier-based intermediate rewards, reasoning length is rarely treated as an explicit step-level optimization objective during RL. We propose Step-wise Adaptive Penalization (SWAP), a fine-grained framework that allocates length reduction across steps based on intrinsic contribution. We estimate step importance from the model’s on-policy log-probability improvement toward the correct answer, then treat excess length as a penalty mass redistributed to penalize low-importance steps more heavily while preserving high-importance reasoning. We optimize with a unified outcome-process advantage within group-relative policy optimization. Extensive experiments demonstrate that SWAP reduces reasoning length by 64.3% on average while improving accuracy by 5.7% relative to the base model.
[4] Distribution-Aware Companding Quantization of Large Language Models cs.CLPDF
Athul Radhakrishnan, Siddhant Mohan, Mahima Sachdeva
TL;DR: 该论文提出了一种名为多令牌预测的新训练方法,通过让语言模型同时预测多个未来令牌来提高样本效率和下游任务性能。该方法在代码和自然语言模型上均有效,尤其能提升生成式任务(如代码生成)的能力,并在推理时显著加速。
Details
Motivation: 解决传统语言模型仅预测下一个令牌导致的样本效率不足问题,旨在通过多令牌预测任务提升模型的学习效率和推理能力。
Result: 在HumanEval和MBPP代码生成基准上,13B参数模型分别比基线提升12%和17%;推理速度提升高达3倍;实验表明多令牌预测有助于诱导头和算法推理能力的发展。
Insight: 创新点在于将多令牌预测作为辅助训练任务,利用独立输出头共享主干,提高了模型样本效率和生成性能,同时加速推理,尤其适用于大模型和多轮训练场景。
Abstract: Large language models such as GPT and Llama are trained with a next-token prediction loss. In this work, we suggest that training language models to predict multiple future tokens at once results in higher sample efficiency. More specifically, at each position in the training corpus, we ask the model to predict the following n tokens using n independent output heads, operating on top of a shared model trunk. Considering multi-token prediction as an auxiliary training task, we measure improved downstream capabilities with no overhead in training time for both code and natural language models. The method is increasingly useful for larger model sizes and keeps its appeal when training for multiple epochs. Gains are especially pronounced on generative benchmarks like coding, where our models consistently outperform strong baselines by several percentage points. Our 13B parameter models solves 12 % more problems on HumanEval and 17 % more on MBPP than comparable next-token models. Experiments on small algorithmic tasks demonstrate that multi-token prediction is favorable for the development of induction heads and algorithmic reasoning capabilities. As an additional benefit, models trained with 4-token prediction are up to 3X times faster at inference, even with large batch sizes.
[5] LLM-Bootstrapped Targeted Finding Guidance for Factual MLLM-based Medical Report Generation cs.CL | cs.CVPDF
Cunyuan Yang, Dejuan Song, Xiaotao Pang, Qianqian Shen, Wenjie Nie
TL;DR: 本文提出Fact-Flow框架,用于解决基于多模态大语言模型(MLLM)的医疗报告生成中常见的事实不稳定问题,如遗漏发现或包含不准确信息。该框架将视觉事实识别与报告生成过程分离,先预测临床发现,再引导MLLM生成事实准确的报告。
Details
Motivation: 当前MLLM直接基于图像特征生成报告,缺乏明确的事实基础,导致报告存在事实不稳定问题,限制了其在临床环境中的应用。
Result: 在两个疾病相关的医疗数据集上的广泛实验表明,该方法在事实准确性上相比最先进模型有显著提升,同时保持了高文本质量。
Insight: 核心创新在于利用大型语言模型(LLM)自动创建带标签的医疗发现数据集,避免了昂贵的人工标注,并提出了将事实识别与报告生成解耦的框架,以提升生成报告的事实准确性。
Abstract: The automatic generation of medical reports utilizing Multimodal Large Language Models (MLLMs) frequently encounters challenges related to factual instability, which may manifest as the omission of findings or the incorporation of inaccurate information, thereby constraining their applicability in clinical settings. Current methodologies typically produce reports based directly on image features, which inherently lack a definitive factual basis. In response to this limitation, we introduce Fact-Flow, an innovative framework that separates the process of visual fact identification from the generation of reports. This is achieved by initially predicting clinical findings from the image, which subsequently directs the MLLM to produce a report that is factually precise. A pivotal advancement of our approach is a pipeline that leverages a Large Language Model (LLM) to autonomously create a dataset of labeled medical findings, effectively eliminating the need for expensive manual annotation. Extensive experimental evaluations conducted on two disease-focused medical datasets validate the efficacy of our method, demonstrating a significant enhancement in factual accuracy compared to state-of-the-art models, while concurrently preserving high standards of text quality.
[6] Super Research: Answering Highly Complex Questions with Large Language Models through Super Deep and Super Wide Research cs.CLPDF
Yubo Dong, Nianhao You, Yuxuan Hou, Zixun Sun, Yue Zhang
TL;DR: 该论文提出了Super Research任务,旨在解决大型语言模型在处理需要长期规划、大规模证据收集和跨异构来源综合的高度复杂问题时的能力不足。该方法整合了结构化分解为研究计划、超广度检索以获取多样化视角,以及超深度调查通过迭代查询解决不确定性。论文还创建了一个包含300个专家编写问题的基准测试,并引入了基于图的审计协议来评估五个维度的性能。
Details
Motivation: 尽管大型语言模型在深度研究或广度搜索方面表现出色,但其解决高度复杂问题的能力尚未得到充分探索,这些问题需要长期规划、大规模证据收集和跨异构来源的综合。
Result: 论文在包含300个跨领域专家编写问题的基准上进行了评估,每个问题需要多达100多个检索步骤和1,000多个网页来协调冲突证据。Super Research能够生成带有细粒度引用和中间产物的可验证报告,并通过图锚定审计协议在覆盖率、逻辑一致性、报告效用、客观性和引用健康五个维度进行评估。
Insight: 创新点包括将复杂研究任务结构化分解为可管理的计划,结合超广度和超深度检索策略,以及引入多维度审计协议来系统评估LLM的研究能力。这为LLM的极限能力评估和通用研究能力提供了新的基准和测试方法。
Abstract: While Large Language Models (LLMs) have demonstrated proficiency in Deep Research or Wide Search, their capacity to solve highly complex questions-those requiring long-horizon planning, massive evidence gathering, and synthesis across heterogeneous sources-remains largely unexplored. We introduce Super Research, a task for complex autonomous research tasks that integrates (i) structured decomposition into a research plan, (ii) super wide retrieval for diverse perspectives, and (iii) super deep investigation to resolve uncertainties through iterative queries. To evaluate this capability, we curated a benchmark of 300 expert-written questions across diverse domains, each requiring up to 100+ retrieval steps and 1,000+ web pages to reconcile conflicting evidence. Super Research produces verifiable reports with fine-grained citations and intermediate artifacts (e.g., outlines and tables) to ensure traceable reasoning. Furthermore, we present a graph-anchored auditing protocol that evaluates Super Research along five dimensions: Coverage, Logical Consistency, Report Utility, Objectivity and Citation Health. While super-complex questions may be infrequent in standard applications, Super Research serves as a critical ceiling evaluation and stress test for LLM capabilities. A model’s proficiency within Super Research acts as a powerful proxy for its general research competence; success here suggests the robustness necessary to navigate nearly any subordinate research task. Leaderboard is available at: https://cnsdqd-dyb.github.io/Super-Research-Benchmark/
[7] From Literature to Hypotheses: An AI Co-Scientist System for Biomarker-Guided Drug Combination Hypothesis Generation cs.CLPDF
Raneen Younis, Suvinava Basak, Lukas Chavez, Zahra Ahmadi
TL;DR: 本文提出了一种名为AI Co-Scientist (CoDHy)的交互式人机协同系统,用于在癌症研究中生成基于生物标志物指导的药物组合假设。该系统整合了结构化的生物医学数据库和非结构化的文献证据,构建了一个任务特定的知识图谱,并在此基础上通过图谱嵌入和基于智能体的推理来生成、验证和排序候选药物组合,同时确保每个假设都有可检索的证据支持。
Details
Motivation: 随着生物医学文献和数据库的快速增长,研究人员难以系统地将生物标志物机制与可行的药物组合假设联系起来,因此需要一种工具来辅助这一过程。
Result: 论文展示了CoDHy系统在转化肿瘤学中作为探索性假设生成和决策支持工具的设计、交互工作流程和实际用例,但摘要中未提及具体的定量基准测试结果或与现有方法的比较。
Insight: 创新点在于将结构化与非结构化数据整合到任务特定的知识图谱中,结合图谱嵌入与基于智能体的推理进行假设生成与排序,并通过基于Web的交互界面实现透明、可迭代、由研究人员引导的探索,而非全自动决策。
Abstract: The rapid growth of biomedical literature and curated databases has made it increasingly difficult for researchers to systematically connect biomarker mechanisms to actionable drug combination hypotheses. We present AI Co-Scientist (CoDHy), an interactive, human-in-the-loop system for biomarker-guided drug combination hypothesis generation in cancer research. CoDHy integrates structured biomedical databases and unstructured literature evidence into a task-specific knowledge graph, which serves as the basis for graph-based reasoning and hypothesis construction. The system combines knowledge graph embeddings with agent-based reasoning to generate, validate, and rank candidate drug combinations, while explicitly grounding each hypothesis in retrievable evidence. Through a web-based interface, users can configure the scientific context, inspect intermediate results, and iteratively refine hypotheses, enabling transparent and researcher-steerable exploration rather than automated decision-making. We demonstrate CoDHy as a system for exploratory hypothesis generation and decision support in translational oncology, highlighting its design, interaction workflow, and practical use cases.
[8] RAVEL: Reasoning Agents for Validating and Evaluating LLM Text Synthesis cs.CLPDF
Andrew Zhuoer Feng, Cunxiang Wang, Yu Luo, Bosi Wen, Yidong Wang
TL;DR: 本文提出了RAVEL框架,这是一个基于智能体的评估框架,旨在评估大型语言模型在复杂文本合成任务(如大纲、起草、编辑)中的实际能力。同时,作者构建了C3EBench基准测试,包含1258个专业写作样本,用于在填空、编辑、扩展和端到端四个任务上评估模型。通过对14个LLM的分析,研究发现大多数模型在上下文理解有限或指令不明确的任务上表现不佳,并揭示了智能体文本合成的质量主要取决于模型的推理能力而非原始生成能力。
Details
Motivation: 当前评估框架缺乏对LLM实际文本合成操作(如大纲、起草、编辑)的评估能力,无法全面衡量LLM的详细能力,因此需要一个新的框架来填补这一空白。
Result: 在C3EBench基准测试上对14个LLM进行评估,发现大多数模型在需要上下文理解的任务上表现挣扎。使用SOTA LLM作为RAVEL框架的操作者时,智能体文本合成的质量主要由推理能力主导,且强推理器能指导弱生成器产生更高质量的结果,反之则不成立。
Insight: 创新点在于提出了一个专注于文本合成操作评估的智能体框架RAVEL和配套的C3EBench基准。客观分析表明,该研究强调了在复杂文本合成任务中,LLM的推理能力比原始生成能力更为关键,并揭示了不同能力模型间协作的可能性(如强推理指导弱生成)。
Abstract: Large Language Models have evolved from single-round generators into long-horizon agents, capable of complex text synthesis scenarios. However, current evaluation frameworks lack the ability to assess the actual synthesis operations, such as outlining, drafting, and editing. Consequently, they fail to evaluate the actual and detailed capabilities of LLMs. To bridge this gap, we introduce RAVEL, an agentic framework that enables the LLM testers to autonomously plan and execute typical synthesis operations, including outlining, drafting, reviewing, and refining. Complementing this framework, we present C3EBench, a comprehensive benchmark comprising 1,258 samples derived from professional human writings. We utilize a “reverse-engineering” pipeline to isolate specific capabilities across four tasks: Cloze, Edit, Expand, and End-to-End. Through our analysis of 14 LLMs, we uncover that most LLMs struggle with tasks that demand contextual understanding under limited or under-specified instructions. By augmenting RAVEL with SOTA LLMs as operators, we find that such agentic text synthesis is dominated by the LLM’s reasoning capability rather than raw generative capacity. Furthermore, we find that a strong reasoner can guide a weaker generator to yield higher-quality results, whereas the inverse does not hold. Our code and data are available at this link: https://github.com/ZhuoerFeng/RAVEL-Reasoning-Agents-Text-Eval.
[9] DRIV-EX: Counterfactual Explanations for Driving LLMs cs.CLPDF
Amaia Cardiel, Eloi Zablocki, Elias Ramzi, Eric Gaussier
TL;DR: 本文提出了一种名为DRIV-EX的方法,用于为大语言模型在自动驾驶中的决策提供反事实解释。该方法通过基于梯度的优化在连续嵌入空间中寻找最小语义变化,并引导一个受控解码过程来生成流畅、有效且接近原场景的描述,从而改变模型的驾驶规划决策。
Details
Motivation: 大语言模型越来越多地被用作自动驾驶中的推理引擎,但其决策过程不透明。本文旨在通过反事实解释来研究其决策过程,即识别出需要改变场景描述的最小语义变化以改变驾驶规划。
Result: 在基于highD数据集文本转录、使用LC-LLM规划器的评估中,DRIV-EX比现有基线方法更可靠地生成了有效且流畅的反事实解释。
Insight: 核心创新在于将连续嵌入空间的优化结果仅作为语义指导,用于偏置一个受控解码过程来重新生成原始场景描述。这保证了生成文本的语言流畅性、领域有效性和对原始输入的接近性,从而避免了无约束连续优化产生的文本不连贯问题,为解释模型决策和暴露潜在偏见提供了具体途径。
Abstract: Large language models (LLMs) are increasingly used as reasoning engines in autonomous driving, yet their decision-making remains opaque. We propose to study their decision process through counterfactual explanations, which identify the minimal semantic changes to a scene description required to alter a driving plan. We introduce DRIV-EX, a method that leverages gradient-based optimization on continuous embeddings to identify the input shifts required to flip the model’s decision. Crucially, to avoid the incoherent text typical of unconstrained continuous optimization, DRIV-EX uses these optimized embeddings solely as a semantic guide: they are used to bias a controlled decoding process that re-generates the original scene description. This approach effectively steers the generation toward the counterfactual target while guaranteeing the linguistic fluency, domain validity, and proximity to the original input, essential for interpretability. Evaluated using the LC-LLM planner on a textual transcription of the highD dataset, DRIV-EX generates valid, fluent counterfactuals more reliably than existing baselines. It successfully exposes latent biases and provides concrete insights to improve the robustness of LLM-based driving agents.
[10] RLAR: An Agentic Reward System for Multi-task Reinforcement Learning on Large Language Models cs.CLPDF
Andrew Zhuoer Feng, Cunxiang Wang, Bosi Wen, Yidong Wang, Yu Luo
TL;DR: 本文提出RLAR(基于智能体奖励的强化学习)框架,通过动态合成和调用工具,为每个查询分配定制化的奖励函数,以解决大语言模型对齐中静态奖励模型泛化能力差的问题。
Details
Motivation: 传统基于强化学习的大语言模型对齐依赖静态、领域特定的奖励模型,这些模型训练成本高且在RL迭代中遇到分布外场景时泛化能力差。
Result: 在数学、编程、翻译和对话任务上,RLAR带来了10到60的性能提升;在RewardBench-V2基准测试中,其显著优于静态基线并接近性能上限。
Insight: 将奖励获取转化为动态工具合成与调用任务,利用LLM智能体自主检索最优奖励模型并通过代码生成合成程序化验证器,使奖励系统能在训练过程中随数据分布变化而自我进化。
Abstract: Large language model alignment via reinforcement learning depends critically on reward function quality. However, static, domain-specific reward models are often costly to train and exhibit poor generalization in out-of-distribution scenarios encountered during RL iterations. We present RLAR (Reinforcement Learning from Agent Rewards), an agent-driven framework that dynamically assigns tailored reward functions to individual queries. Specifically, RLAR transforms reward acquisition into a dynamic tool synthesis and invocation task. It leverages LLM agents to autonomously retrieve optimal reward models from the Internet and synthesize programmatic verifiers through code generation. This allows the reward system to self-evolve with the shifting data distributions during training. Experimental results demonstrate that RLAR yields consistent performance gains ranging from 10 to 60 across mathematics, coding, translation, and dialogue tasks. On RewardBench-V2, RLAR significantly outperforms static baselines and approaches the performance upper bound, demonstrating superior generalization through dynamic reward orchestration. The data and code are available on this link: https://github.com/ZhuoerFeng/RLAR.
[11] LaSTR: Language-Driven Time-Series Segment Retrieval cs.CLPDF
Kota Dohi, Harsh Purohit, Tomoya Nishida, Takashi Endo, Yusuke Ohtsubo
TL;DR: LaSTR是一种语言驱动的时间序列片段检索方法,通过自然语言查询从大规模时间序列库中检索相关局部片段。该方法利用TV2分割LOTSA窗口构建大规模片段-描述训练数据,使用GPT-5.2生成片段描述,并训练基于Conformer的对比检索器在共享的文本-时间序列嵌入空间中工作。
Details
Motivation: 现有时间序列检索方法通常依赖专家设计的相似性标准或全局序列级描述,无法有效支持自然语言驱动的局部片段检索。
Result: 在保留测试集上,LaSTR在多种候选池规模下均优于随机和CLIP基线,在单正例检索和描述侧一致性(使用SBERT和VLM-as-a-judge评估)方面表现出更高的排序质量和更强的检索片段与查询意图的语义一致性。
Insight: 创新点在于构建大规模片段-描述对训练数据的方法(TV2分割+GPT-5.2生成),以及训练跨模态对比检索器实现细粒度语言-时间序列对齐。该方法为时间序列分析提供了更灵活的自然语言接口。
Abstract: Effectively searching time-series data is essential for system analysis, but existing methods often require expert-designed similarity criteria or rely on global, series-level descriptions. We study language-driven segment retrieval: given a natural language query, the goal is to retrieve relevant local segments from large time-series repositories. We build large-scale segment–caption training data by applying TV2-based segmentation to LOTSA windows and generating segment descriptions with GPT-5.2, and then train a Conformer-based contrastive retriever in a shared text–time-series embedding space. On a held-out test split, we evaluate single-positive retrieval together with caption-side consistency (SBERT and VLM-as-a-judge) under multiple candidate pool sizes. Across all settings, LaSTR outperforms random and CLIP baselines, yielding improved ranking quality and stronger semantic agreement between retrieved segments and query intent.
[12] Qwen3-Coder-Next Technical Report cs.CLPDF
Ruisheng Cao, Mouxiang Chen, Jiawei Chen, Zeyu Cui, Yunlong Feng
TL;DR: Qwen3-Coder-Next是一个专为编码智能体设计的开放权重语言模型,拥有800亿参数但推理时仅激活30亿参数,实现了高效推理与强大编码能力的结合。该模型通过大规模合成可验证的编码任务与可执行环境进行智能体训练,直接从环境反馈中学习,探索了强训练方法在小型参数模型上的能力极限。
Details
Motivation: 探索强训练方法(如智能体训练)能在多大程度上提升小参数规模模型的能力极限,以开发出兼具强大编码能力和高效推理效率的编码智能体模型。
Result: 在SWE-Bench和Terminal-Bench等以智能体为中心的基准测试中,Qwen3-Coder-Next相对于其激活参数量(30亿)取得了有竞争力的性能。
Insight: 主要创新点在于结合了大规模合成可验证编码任务与可执行环境进行智能体训练,并利用环境反馈(通过中期训练和强化学习)直接优化模型,从而在保持小参数推理规模的同时实现强大的编码能力;其模型架构(800亿总参数,30亿激活参数)也体现了在模型效率与能力之间的一种权衡设计。
Abstract: We present Qwen3-Coder-Next, an open-weight language model specialized for coding agents. Qwen3-Coder-Next is an 80-billion-parameter model that activates only 3 billion parameters during inference, enabling strong coding capability with efficient inference. In this work, we explore how far strong training recipes can push the capability limits of models with small parameter footprints. To achieve this, we perform agentic training through large-scale synthesis of verifiable coding tasks paired with executable environments, allowing learning directly from environment feedback via mid-training and reinforcement learning. Across agent-centric benchmarks including SWE-Bench and Terminal-Bench, Qwen3-Coder-Next achieves competitive performance relative to its active parameter count. We release both base and instruction-tuned open-weight versions to support research and real-world coding agent development.
[13] Learning Nested Named Entity Recognition from Flat Annotations cs.CLPDF
Igor Rozhkov, Natalia Loukachevitch
TL;DR: 该论文研究如何仅利用平面标注数据学习嵌套命名实体识别,提出了四种方法:字符串包含、实体破坏、平面中和以及混合微调+LLM流水线,并在俄语嵌套NER基准NEREL上验证了有效性。
Details
Motivation: 解决嵌套命名实体识别需要昂贵多层次标注而平面标注数据丰富但嵌套资源稀缺的问题,探索仅从平面标注中学习嵌套结构的可行性。
Result: 在包含29种实体类型且21%实体为嵌套的俄语基准NEREL上,最佳组合方法达到26.37%的内部F1分数,缩小了与全监督嵌套方法40%的性能差距。
Insight: 创新点在于提出无需嵌套标注的四种学习策略,特别是混合流水线方法,为资源稀缺场景下的嵌套NER提供了实用解决方案;客观分析表明该方法通过伪嵌套数据生成和信号优化有效利用了平面标注的隐含结构信息。
Abstract: Nested named entity recognition identifies entities contained within other entities, but requires expensive multi-level annotation. While flat NER corpora exist abundantly, nested resources remain scarce. We investigate whether models can learn nested structure from flat annotations alone, evaluating four approaches: string inclusions (substring matching), entity corruption (pseudo-nested data), flat neutralization (reducing false negative signal), and a hybrid fine-tuned + LLM pipeline. On NEREL, a Russian benchmark with 29 entity types where 21% of entities are nested, our best combined method achieves 26.37% inner F1, closing 40% of the gap to full nested supervision. Code is available at https://github.com/fulstock/Learning-from-Flat-Annotations.
[14] MedGPT-oss: Training a General-Purpose Vision-Language Model for Biomedicine cs.CLPDF
Kai Zhang, Zhengqing Yuan, Cheng Peng, Songlin Zhao, Mengxian Lyu
TL;DR: MEDGPT-OSS是一个开源的、200亿参数的多模态通用视觉语言模型,专为生物医学领域设计。它通过优化的三阶段训练课程,将GPT-oss语言主干与视觉前端结合,旨在统一放射学、病理学和临床文本推理,并支持在本地部署以满足患者隐私和PHI合规要求。
Details
Motivation: 当前高性能的生物医学多模态助手要么闭源,要么计算成本过高,无法满足保护患者隐私和健康信息合规所需的本地部署要求,因此需要开发一个开源、高效且性能强大的模型来推动临床AI的开放研究。
Result: 在分布外多模态推理和复杂的纯文本临床任务上,MEDGPT-OSS成功超越了更大的开源医学模型,展示了其强大的能力。
Insight: 论文的创新点在于采用参数高效的设计,通过严格的数据筛选和长上下文多模态对齐进行渐进式领域适应,使一个200亿参数的模型能够弥合能力差距,并在通用GPU上完全兼容,为保护隐私的机构特定临床AI研究提供了可验证的基础。
Abstract: Biomedical multimodal assistants have the potential to unify radiology, pathology, and clinical-text reasoning, yet a critical deployment gap remains: top-performing systems are either closed-source or computationally prohibitive, precluding the on-premises deployment required for patient privacy and PHI compliance. We introduce MEDGPT-OSS, an open-weight, 20B-parameter generalist vision-language model designed to facilitate open research in clinical AI. Rather than relying on architectural complexity, MEDGPT-OSS pairs the GPT-oss language backbone with a visual front-end via a optimized, three-stage training curriculum. By progressively domain-adapting these modules through rigorous data curation and long-context multimodal alignment, we demonstrate that a 20B model can bridge the capacity gap. It successfully outperforms larger open medical models on out-of-distribution (OOD) multimodal reasoning and complex text-only clinical tasks. By unifying diverse modalities under a single instruction-following interface, MEDGPT-OSS maintains a parameter-efficient footprint fully compatible with commodity GPUs. We release the complete training recipe, open-weight checkpoints, and a rigorous evaluation harness to serve as a verifiable foundation for privacy-preserving, institution-specific clinical AI research.
[15] CHIMERA: Compact Synthetic Data for Generalizable LLM Reasoning cs.CL | cs.AIPDF
Xinyu Zhu, Yihao Feng, Yanchao Sun, Xianzhi Du, Pingzhi Li
TL;DR: 本文提出了CHIMERA,一个用于通用跨领域推理的紧凑型合成数据集,旨在解决高质量推理数据稀缺的三大挑战:冷启动问题、领域覆盖有限和标注瓶颈。该数据集包含9K个样本,具有丰富的思维链轨迹、覆盖8大学科的结构化主题,并采用自动化评估流程。使用CHIMERA微调一个4B参数的Qwen3模型后,该模型在多个高难度推理基准测试中表现出色,性能接近或匹配参数量大得多的模型。
Details
Motivation: 解决在开放和可扩展环境中复现和扩展LLM推理能力时面临的数据中心三大挑战:缺乏初始化推理策略所需的种子数据集(冷启动)、现有开源数据集领域覆盖狭窄(主要集中在数学)、以及前沿推理任务的人工标注成本过高或不可行(标注瓶颈)。
Result: 使用CHIMERA微调的4B Qwen3模型在GPQA-Diamond、AIME 24/25/26、HMMT 25和Humanity’s Last Exam等一系列具有挑战性的推理基准测试中取得了强劲性能,其推理能力接近或匹配了DeepSeek-R1和Qwen3-235B等参数量大得多的模型。
Insight: 论文宣称的创新点在于构建了一个具有丰富长思维链、结构化跨学科覆盖和全自动评估流程的紧凑合成数据集。从客观角度看,其核心创新在于通过合成数据生成和自动化验证,以较小的数据规模(9K样本)有效解决了高质量、广覆盖推理数据的获取难题,并证明了小模型通过高质量数据微调可以达到与大模型相当的推理性能,为高效LLM推理能力开发提供了新思路。
Abstract: Large Language Models (LLMs) have recently exhibited remarkable reasoning capabilities, largely enabled by supervised fine-tuning (SFT)- and reinforcement learning (RL)-based post-training on high-quality reasoning data. However, reproducing and extending these capabilities in open and scalable settings is hindered by three fundamental data-centric challenges: (1) the cold-start problem, arising from the lack of seed datasets with detailed, long Chain-of-Thought (CoT) trajectories needed to initialize reasoning policies; (2) limited domain coverage, as most existing open-source reasoning datasets are concentrated in mathematics, with limited coverage of broader scientific disciplines; and (3) the annotation bottleneck, where the difficulty of frontier-level reasoning tasks makes reliable human annotation prohibitively expensive or infeasible. To address these challenges, we introduce CHIMERA, a compact synthetic reasoning dataset comprising 9K samples for generalizable cross-domain reasoning. CHIMERA is constructed with three key properties: (1) it provides rich, long CoT reasoning trajectories synthesized by state-of-the-art reasoning models; (2) it has broad and structured coverage, spanning 8 major scientific disciplines and over 1K fine-grained topics organized via a model-generated hierarchical taxonomy; and (3) it employs a fully automated, scalable evaluation pipeline that uses strong reasoning models to cross-validate both problem validity and answer correctness. We use CHIMERA to post-train a 4B Qwen3 model. Despite the dataset’s modest size, the resulting model achieves strong performance on a suite of challenging reasoning benchmarks, including GPQA-Diamond, AIME 24/25/26, HMMT 25, and Humanity’s Last Exam, approaching or matching the reasoning performance of substantially larger models such as DeepSeek-R1 and Qwen3-235B.
[16] Hybrid Neural-LLM Pipeline for Morphological Glossing in Endangered Language Documentation: A Case Study of Jungar Tuvan cs.CLPDF
Siyu Liang, Talant Mawkanuli, Gina-Anne Levow
TL;DR: 本文提出了一种用于濒危语言形态标注的混合神经-LLM流水线,结合了神经序列标注模型与大型语言模型(LLM)的后校正功能,并以低资源突厥语种Jungar Tuvan为案例进行了评估。研究表明,该两阶段流水线能显著减少标注工作量,为形态复杂的语言文档自动标注提供了轻量级计算解决方案。
Details
Motivation: 解决低资源、形态丰富语言在语言文档和田野调查中,创建行间注释文本(IGT)这一主要瓶颈问题,旨在通过自动化方法减轻人工标注负担。
Result: 在Jungar Tuvan语言上的实验表明,结合BiLSTM-CRF模型与LLM后校正的两阶段流水线对大多数模型带来了显著性能提升,检索增强提示相比随机示例选择有大幅增益,且性能随少样本示例数量近似对数增长。
Insight: 创新点在于提出混合架构设计原则,将结构化预测模型与LLM推理结合;研究发现检索增强提示优于随机示例,且提供语素词典在多数情况下反而损害性能,这为形态复杂语境下的自动标注提供了新见解。
Abstract: Interlinear glossed text (IGT) creation remains a major bottleneck in linguistic documentation and fieldwork, particularly for low-resource morphologically rich languages. We present a hybrid automatic glossing pipeline that combines neural sequence labeling with large language model (LLM) post-correction, evaluated on Jungar Tuvan, a low-resource Turkic language. Through systematic ablation studies, we show that retrieval-augmented prompting provides substantial gains over random example selection. We further find that morpheme dictionaries paradoxically hurt performance compared to providing no dictionary at all in most cases, and that performance scales approximately logarithmically with the number of few-shot examples. Most significantly, our two-stage pipeline combining a BiLSTM-CRF model with LLM post-correction yields substantial gains for most models, achieving meaningful reductions in annotation workload. Drawing on these findings, we establish concrete design principles for integrating structured prediction models with LLM reasoning in morphologically complex fieldwork contexts. These principles demonstrate that hybrid architectures offer a promising direction for computationally light solutions to automatic linguistic annotation in endangered language documentation.
[17] The Aftermath of DrawEduMath: Vision Language Models Underperform with Struggling Students and Misdiagnose Errors cs.CL | cs.CV | cs.CYPDF
Li Lucy, Albert Zhang, Nathan Anderson, Ryan Knight, Kyle Lo
TL;DR: 该论文通过分析11种视觉语言模型在DrawEduMath基准测试上的表现,发现这些模型在处理数学教育中学生的错误方面存在显著不足,尤其是在描述需要更多教学帮助的学生的作业时表现不佳,表明当前模型虽擅长解题,但需调整开发方向以更好地支持教育应用。
Details
Motivation: 研究动机在于评估AI模型(特别是视觉语言模型)在数学教育中识别和响应学生错误的能力,以确定其是否能在不同学生熟练度水平上有效支持教学应用。
Result: 在DrawEduMath基准测试(涉及真实学生手写手绘数学问题回答的一年期数据集)上,所有11种视觉语言模型在评估学生错误相关问题时表现最差,且在描述需要更多教学帮助的学生作业时表现不佳,表明模型在教育用例中未达到理想水平。
Insight: 论文的创新点在于揭示了视觉语言模型在数学教育场景中的局限性,即模型虽优化为解题专家,但缺乏处理学生错误和适应不同熟练度学生的能力,这提示需要开发替代激励措施来提升模型的教育适用性。
Abstract: Effective mathematics education requires identifying and responding to students’ mistakes. For AI to support pedagogical applications, models must perform well across different levels of student proficiency. Our work provides an extensive, year-long snapshot of how 11 vision-language models (VLMs) perform on DrawEduMath, a QA benchmark involving real students’ handwritten, hand-drawn responses to math problems. We find that models’ weaknesses concentrate on a core component of math education: student error. All evaluated VLMs underperform when describing work from students who require more pedagogical help, and across all QA, they struggle the most on questions related to assessing student error. Thus, while VLMs may be optimized to be math problem solving experts, our results suggest that they require alternative development incentives to adequately support educational use cases.
[18] Thoth: Mid-Training Bridges LLMs to Time Series Understanding cs.CL | cs.AI | cs.LGPDF
Jiafeng Lin, Yuxuan Wang, Jialong Wu, Huakun Luo, Zhongyi Pei
TL;DR: 本文提出了Thoth,这是首个通过中期训练使大型语言模型具备通用时间序列理解能力的模型系列。作者构建了高质量的时间序列中心化语料库Book-of-Thoth,用于实现时间序列与自然语言之间的任务和领域无关的对齐,并提出了知识密集型时间序列理解基准KnoTS。实验表明,Thoth在多个时间序列问答基准上显著优于其基础模型和先进LLMs,且在数据稀缺的微调场景下表现出色。
Details
Motivation: 大型语言模型在通用推理方面表现出色,但在理解和推理时间序列数据方面存在困难,这限制了其在依赖时间动态的决策场景中的有效性。
Result: 在多个时间序列问答基准上,Thoth显著优于其基础模型和先进LLMs,并在数据稀缺的微调设置下展现出卓越能力。
Insight: 创新点在于提出了“中期训练”这一关键中间阶段,通过构建Book-of-Thoth语料库实现时间序列与文本的双向生成对齐,从而为LLMs注入对时间模式的基础理解;同时,提出的KnoTS基准专注于评估结合时间模式与领域知识的联合推理能力。
Abstract: Large Language Models (LLMs) have demonstrated remarkable success in general-purpose reasoning. However, they still struggle to understand and reason about time series data, which limits their effectiveness in decision-making scenarios that depend on temporal dynamics. In this paper, we propose Thoth, the first family of mid-trained LLMs with general-purpose time series understanding capabilities. As a pivotal intermediate stage, mid-training achieves task- and domain-agnostic alignment between time series and natural language, for which we construct Book-of-Thoth, a high-quality, time-series-centric mid-training corpus. Book-of-Thoth enables both time-series-to-text and text-to-time-series generation, equipping LLMs with a foundational grasp of temporal patterns. To better evaluate advanced reasoning capabilities, we further present KnoTS, a novel benchmark of knowledge-intensive time series understanding, designed for joint reasoning over temporal patterns and domain knowledge. Extensive experiments demonstrate that mid-training with Book-of-Thoth enables Thoth to significantly outperform its base model and advanced LLMs across a range of time series question answering benchmarks. Moreover, Thoth exhibits superior capabilities when fine-tuned under data scarcity, underscoring the effectiveness of mid-training for time series understanding. Code is available at: https://github.com/thuml/Thoth.
[19] GroupGPT: A Token-efficient and Privacy-preserving Agentic Framework for Multi-User Chat Assistant cs.CLPDF
Zhuokang Shen, Yifan Wang, Hanyu Chen, Wenxuan Huang, Shaohui Lin
TL;DR: 本文提出了GroupGPT,一个面向多用户群聊助手的、兼顾令牌效率和隐私保护的智能体框架。该框架采用大小模型协同架构,将干预时机决策与响应生成解耦,并支持多模态输入。作者还构建了MUIR基准数据集用于评估,实验表明GroupGPT能生成准确且时机恰当的响应,同时大幅降低令牌消耗并提供隐私保护。
Details
Motivation: 现有聊天机器人系统主要针对单用户场景,难以适应复杂动态的多用户群聊环境,且现有方法通常依赖大语言模型进行推理和生成,导致令牌消耗高、可扩展性有限并存在隐私风险。
Result: 在提出的MUIR基准数据集上,GroupGPT在基于LLM的评估中获得了平均4.72/5.0的高分,用户评价良好。与基线方法相比,令牌使用量最多减少了3倍。
Insight: 核心创新点在于采用大小模型协同架构解耦干预时机与响应生成,实现了效率与性能的平衡;同时构建了专门的多用户干预推理基准数据集MUIR,并集成了隐私清洗机制,为多用户智能体系统设计提供了新思路。
Abstract: Recent advances in large language models (LLMs) have enabled increasingly capable chatbots. However, most existing systems focus on single-user settings and do not generalize well to multi-user group chats, where agents require more proactive and accurate intervention under complex, evolving contexts. Existing approaches typically rely on LLMs for both reasoning and generation, leading to high token consumption, limited scalability, and potential privacy risks. To address these challenges, we propose GroupGPT, a token-efficient and privacy-preserving agentic framework for multi-user chat assistant. GroupGPT adopts a small-large model collaborative architecture to decouple intervention timing from response generation, enabling efficient and accurate decision-making. The framework also supports multimodal inputs, including memes, images, videos, and voice messages. We further introduce MUIR, a benchmark dataset for multi-user chat assistant intervention reasoning. MUIR contains 2,500 annotated group chat segments with intervention labels and rationales, supporting evaluation of timing accuracy and response quality. We evaluate a range of models on MUIR, from large language models to smaller counterparts. Extensive experiments demonstrate that GroupGPT produces accurate and well-timed responses, achieving an average score of 4.72/5.0 in LLM-based evaluation, and is well received by users across diverse group chat scenarios. Moreover, GroupGPT reduces token usage by up to 3 times compared to baseline methods, while providing privacy sanitization of user messages before cloud transmission. Code is available at: https://github.com/Eliot-Shen/GroupGPT .
[20] How RL Unlocks the Aha Moment in Geometric Interleaved Reasoning cs.CLPDF
Xiangxiang Zhang, Caijun Jia, Siyuan Li, Dingyu He, Xiya Xiong
TL;DR: 本文针对几何推理中交替绘图与逻辑推理的核心需求,发现单纯使用监督微调(SFT)处理交替绘图-解题数据会导致推理性能显著下降。作者提出了一种名为Faire的强化学习框架,通过强制三个因果约束来实现功能对齐,而非表面模仿,从而有效提升了模型在几何推理任务中的表现。
Details
Motivation: 解决多模态大语言模型(MLLMs)在复杂几何问题中交替推理时,因SFT方法仅学习表面格式而无法内化绘图与推理步骤间因果依赖关系,导致性能下降的问题。
Result: 在具有挑战性的几何推理基准测试上,Faire框架诱导了模型行为的质变,使绘图被有效内化,并取得了有竞争力的性能。
Insight: 创新点在于揭示了SFT在交替推理任务中的根本局限性(仅实现分布对齐),并提出通过强化学习强制因果约束来实现功能对齐的新框架,为解决多模态推理中形式与功能脱节的问题提供了新思路。
Abstract: Solving complex geometric problems inherently requires interleaved reasoning: a tight alternation between constructing diagrams and performing logical deductions. Although recent Multimodal Large Language Models (MLLMs) have demonstrated strong capabilities in visual generation and plotting, we identify a counter-intuitive and underexplored phenomenon. Naively applying Supervised Fine-Tuning (SFT) on interleaved plot-solution data leads to a substantial degradation in reasoning performance compared to text-only baselines. We argue that this failure stems from a fundamental limitation of SFT, which primarily induces distributional alignment: the model learns to reproduce the surface format of interleaved plotting but fails to internalize the causal dependency between the generated plot and reasoning steps. To overcome this limitation, we propose Faire (Functional alignment for interleaved reasoning), a reinforcement learning framework that enforces three casual constraints to move beyond superficial imitation toward functional alignment. Extensive experiments show that Faire induces a qualitative shift in model behavior in which the plotting is effectively internalized, yielding competitive performance on challenging geometric reasoning benchmarks.
[21] CARD: Towards Conditional Design of Multi-agent Topological Structures cs.CL | cs.LGPDF
Tongtong Wu, Yanming Li, Ziye Tang, Chen Jiang, Linhao Luo
TL;DR: 本文提出CARD(条件智能体图设计器),一种条件图生成框架,用于动态设计多智能体系统的通信拓扑结构,以应对模型升级、API变更等现实动态变化,从而提升系统的有效性和鲁棒性。
Details
Motivation: 现有基于大语言模型的多智能体系统通常采用固定或静态学习的通信拓扑,忽略了模型能力、API工具或知识源变化等动态环境因素,限制了系统的适应性和鲁棒性。
Result: 在HumanEval、MATH和MMLU基准测试上,CARD持续优于静态和基于提示的基线方法,在不同条件下实现了更高的准确性和鲁棒性。
Insight: 创新点在于将动态环境信号显式纳入图构建过程,通过条件变分图编码器和环境感知优化,实现了训练和运行时均可自适应的通信拓扑生成,增强了多智能体系统对能力或资源变化的韧性。
Abstract: Large language model (LLM)-based multi-agent systems have shown strong capabilities in tasks such as code generation and collaborative reasoning. However, the effectiveness and robustness of these systems critically depend on their communication topology, which is often fixed or statically learned, ignoring real-world dynamics such as model upgrades, API (or tool) changes, or knowledge source variability. To address this limitation, we propose CARD (Conditional Agentic Graph Designer), a conditional graph-generation framework that instantiates AMACP, a protocol for adaptive multi-agent communication. CARD explicitly incorporates dynamic environmental signals into graph construction, enabling topology adaptation at both training and runtime. Through a conditional variational graph encoder and environment-aware optimization, CARD produces communication structures that are both effective and resilient to shifts in model capability or resource availability. Empirical results on HumanEval, MATH, and MMLU demonstrate that CARD consistently outperforms static and prompt-based baselines, achieving higher accuracy and robustness across diverse conditions. The source code is available at: https://github.com/Warma10032/CARD.
[22] Reasoning or Rationalization? The Role of Justifications in Masked Diffusion Models for Fact Verification cs.CLPDF
Jacob Devasier
TL;DR: 本文研究了掩码扩散语言模型在事实核查任务中的推理动态,发现模型通常在扩散过程早期就收敛于判决,并将其作为全局锚点,而后续生成的论证往往是对判决的事后合理化而非真正推理。强制实施’先推理后判决’的约束反而会降低性能,因为模型在生成有噪声的论证标记时会基于这些噪声调整其初始正确的预测。
Details
Motivation: 与自回归模型不同,掩码扩散语言模型同时优化所有序列位置,这引发了关于此类模型如何处理需要论证判决的任务的疑问。本文旨在探究MDLM在事实核查任务中的推理动态,即论证是作为真正的推理还是事后合理化。
Result: 在事实核查任务中,标准MDLM的准确率为86.2%。强制实施’先推理后判决’的约束(延迟判决解掩码)使准确率降至71.9%。干预实验表明,模型会为56%的错误强制判决进行合理化;当论证质量受损时,判决准确率降至57.3%,而使用真实论证时可达97.1%。
Insight: 论文揭示了MDLM在事实核查中独特的’先判决后论证’模式,论证常作为事后合理化而非推理过程。核心创新在于通过因果干预实验,论证了判决对论证质量的强因果依赖性,并解释了强制深思熟虑导致性能下降的机制——模型会基于生成的有噪声论证标记调整其初始正确判断。
Abstract: Unlike autoregressive models, which generate tokens sequentially and benefit from reasoning-before-answering strategies such as Chain-of-Thought, Masked Diffusion Language Models (MDLMs) refine all sequence positions simultaneously, raising questions about how these models handle tasks requiring justified verdicts. In this work, we investigate the dynamics of MDLM reasoning on fact verification, examining whether justifications serve as genuine reasoning or post-hoc rationalization. We observe that MDLMs typically converge on a verdict early in the diffusion process, treating it as a global anchor that is resolved before the justification is complete. Crucially, enforcing a reasoning-first constraint via delayed verdict unmasking actively degrades performance, dropping accuracy from 86.2% to 71.9% as accumulating justification tokens introduce inconsistencies that override initially correct predictions. Interventional experiments reveal that the model rationalizes incorrect forced verdicts in 56% of cases, and that verdicts are strongly causally dependent on justification quality (57.3% accuracy with corrupted justifications vs. 97.1% with ground-truth). This causal dependence explains the degradation under forced deliberation: as the model generates noisy justification tokens, it conditions on them, gradually overriding its initially correct assessment. Our findings suggest that for fact verification with MDLMs, extended deliberation can be counterproductive, risking the dilution of accurate early predictions with noise introduced during justification generation.
[23] XAI-enhanced Comparative Opinion Mining via Aspect-based Scoring and Semantic Reasoning cs.CLPDF
Ngoc-Quang Le, T. Thanh-Lam Nguyen, Quoc-Trung Phu, Thi-Phuong Le, Duy-Cat Can
TL;DR: 本文提出XCom模型,通过基于方面的评分预测和语义分析两个模块增强比较性观点挖掘的透明度,并集成Shapley可解释性模块提供决策依据。
Details
Motivation: 解决基于Transformer的比较性观点挖掘模型缺乏透明度的问题,以提升用户对模型的信任度。
Result: 在比较性观点挖掘任务上,XCom相比基线模型取得了领先性能,证明了其在提供有意义解释方面的有效性。
Insight: 创新点在于将可解释人工智能(XAI)与比较性观点挖掘结合,通过模块化设计和Shapley值增强模型决策的可解释性与可靠性。
Abstract: Comparative opinion mining involves comparing products from different reviews. However, transformer-based models designed for this task often lack transparency, which can adversely hinder the development of trust in users. In this paper, we propose XCom, an enhanced transformer-based model separated into two principal modules, i.e., (i) aspect-based rating prediction and (ii) semantic analysis for comparative opinion mining. XCom also incorporates a Shapley additive explanations module to provide interpretable insights into the model’s deliberative decisions. Empirically, XCom achieves leading performances compared to other baselines, which demonstrates its effectiveness in providing meaningful explanations, making it a more reliable tool for comparative opinion mining. Source code is available at: https://anonymous.4open.science/r/XCom.
[24] Reasoning Boosts Opinion Alignment in LLMs cs.CL | cs.LGPDF
Frédéric Berdoz, Yann Billeter, Yann Vonlanthen, Roger Wattenhofer
TL;DR: 本文研究了通过推理提升大语言模型在意见对齐任务中的表现,提出了一种基于强化学习的结构化推理方法,并在美国、欧洲和瑞士政治数据集上验证了其有效性。
Details
Motivation: 大语言模型在意见建模中存在偏见问题,本文旨在探索推理能力是否能改善模型生成与用户政治偏好一致的回答。
Result: 在三个政治数据集上的实验表明,推理方法提升了意见建模性能,与强基线模型竞争,但未能完全消除偏见。
Insight: 创新点在于将强化学习驱动的结构化推理引入意见对齐任务,为构建更忠实政治数字孪生模型提供了新思路;客观分析认为该方法为后续研究建立了可靠基线,但揭示了仅靠推理不足以完全解决偏见问题。
Abstract: Opinion modeling aims to capture individual or group political preferences, enabling applications such as digital democracies, where models could help shape fairer and more popular policies. Given their versatility, strong generalization capabilities, and demonstrated success across diverse text-to-text applications, large language models (LLMs) are natural candidates for this task. However, due to their statistical nature and limited causal understanding, they tend to produce biased opinions when prompted naively. In this work, we study whether reasoning can improve opinion alignment. Motivated by the recent advancement in mathematical reasoning enabled by reinforcement learning (RL), we train models to produce profile-consistent answers through structured reasoning. We evaluate our approach on three datasets covering U.S., European, and Swiss politics. Results indicate that reasoning enhances opinion modeling and is competitive with strong baselines, but does not fully remove bias, highlighting the need for additional mechanisms to build faithful political digital twins using LLMs. By releasing both our method and datasets, we establish a solid baseline to support future research on LLM opinion alignment.
[25] Can Thinking Models Think to Detect Hateful Memes? cs.CLPDF
Mohamed Bayan Kmainasi, Mucahid Kutlu, Ali Ezzat Shahroor, Abul Hasnat, Firoj Alam
TL;DR: 本文提出了一种基于强化学习的后训练框架,旨在提升基于思维的多模态大语言模型在仇恨表情包检测任务中的推理能力。该框架通过任务特定奖励和一种新颖的群组相对策略优化目标,联合优化表情包分类和解释生成的质量。
Details
Motivation: 仇恨表情包通常需要组合式的多模态推理,因为其图像和文本单独看可能无害,但组合起来却传达有害意图。尽管基于思维的多模态大语言模型在视觉语言理解方面取得了进展,但其在仇恨表情包分析方面的能力仍未得到充分探索。
Result: 在Hateful Memes基准测试上的实验表明,该方法达到了最先进的性能,将准确率和F1分数提升了约1%,并将解释质量提升了约3%。
Insight: 主要创新点包括:1)提出了一个强化学习后训练框架,结合任务奖励和GRPO目标来提升模型推理;2)通过蒸馏生成弱监督或伪监督的思维链理由,扩展了现有数据集;3)引入GRPO目标,联合优化分类和解释质量,鼓励细粒度、逐步推理。
Abstract: Hateful memes often require compositional multimodal reasoning: the image and text may appear benign in isolation, yet their interaction conveys harmful intent. Although thinking-based multimodal large language models (MLLMs) have recently advanced vision-language understanding, their capabilities remain underexplored for hateful meme analysis. We propose a reinforcement learning based post-training framework that improves reasoning in thinking-based MLLMs through task-specific rewards and a novel Group Relative Policy Optimization (GRPO) objective. Specifically, we (i) conduct a systematic empirical study of off-the-shelf MLLMs for hateful meme understanding, (ii) extend an existing hateful meme dataset by generating weakly or pseudo-supervised chain-of-thought rationales via distillation, and (iii) introduce a GRPO-based objective that jointly optimizes meme classification and explanation quality to encourage fine-grained, step-by-step reasoning. Experiments on the Hateful Memes benchmark show that our approach achieves state-of-the-art performance, improving accuracy and F1 by approximately 1 percent and explanation quality by approximately 3 percent. We will publicly release our code, dataset extensions, and evaluation resources to support reproducibility.
[26] Suffix-Constrained Greedy Search Algorithms for Causal Language Models cs.CLPDF
Ayoub Hammal, Pierre Zweigenbaum, Caio Corro
TL;DR: 本文提出后缀约束生成方法,通过设计基于贪心搜索的算法,确保大语言模型(LLM)生成的响应遵循严格模板,使最终答案可被轻松解析提取,从而解决LLM自由格式输出中答案提取困难的问题。
Details
Motivation: 动机在于LLM在数学问答等任务中生成推理轨迹时,其自由格式输出难以可靠提取最终答案,这本身构成了一个信息抽取难题。
Result: 在多个数据集上的实验表明,该方法能保证从LLM输出中确定性提取最终答案,且不会对结果产生负面影响,甚至有所提升。
Insight: 创新点在于引入后缀约束生成框架,通过贪心搜索算法强制输出符合预定义模板,确保答案的可解析性,这为LLM在结构化预测任务中的应用提供了可靠解决方案。
Abstract: Large language models (LLMs) are powerful tools that have found applications beyond human-machine interfaces and chatbots. In particular, their ability to generate reasoning traces motivated their use in many prediction tasks like math question answering. Unfortunately, extracting the final answer in an LLM free-form output is difficult, as it is an information extraction problem on its own. In this work, we introduce suffix-constrained generation, that aims to produce well-formed LLM responses in which final answers follow strict templates and are guaranteed to be trivially parseable. To this end, we introduce several algorithms that are based on greedy search procedures. We experiment on several datasets, and show that our approach allows to guarantee trivial deterministic extraction of the final answer from an LLM output without having a negative impact on results, and even improving them.
[27] Truth as a Trajectory: What Internal Representations Reveal About Large Language Model Reasoning cs.CL | cs.LGPDF
Hamed Damirchi, Ignacio Meza De la Jara, Ehsan Abbasnejad, Afshar Shamsi, Zhen Zhang
TL;DR: 本文提出Truth as a Trajectory (TaT)方法,将大语言模型的推理过程建模为跨层迭代精化的轨迹,通过分析表征在层间的几何位移而非静态激活点,来区分有效推理与虚假行为。该方法在常识推理、问答和毒性检测等基准测试中,仅利用激活值的变化,有效减少了对表面词汇模式的依赖,优于传统探针方法。
Details
Motivation: 现有大语言模型可解释性方法通常将隐藏状态视为激活空间中的静态点,假设单个层的表征即可区分正确与错误推理,但这些激活值饱含多义特征,导致线性探针只能学习表面词汇模式而非底层推理结构。
Result: 在密集和混合专家架构上,于常识推理、问答和毒性检测等基准测试中,TaT方法无需访问激活值本身,仅利用跨层激活变化,就有效减轻了对静态词汇混淆的依赖,其性能超越了传统探针方法。
Insight: 核心创新在于将推理过程视为跨层演化的几何轨迹,通过分析表征的层间位移来揭示推理的几何不变量,这为LLM可解释性提供了互补于静态激活分析的新视角,并能更好地剥离表面词汇模式的影响。
Abstract: Existing explainability methods for Large Language Models (LLMs) typically treat hidden states as static points in activation space, assuming that correct and incorrect inferences can be separated using representations from an individual layer. However, these activations are saturated with polysemantic features, leading to linear probes learning surface-level lexical patterns rather than underlying reasoning structures. We introduce Truth as a Trajectory (TaT), which models the transformer inference as an unfolded trajectory of iterative refinements, shifting analysis from static activations to layer-wise geometric displacement. By analyzing displacement of representations across layers, TaT uncovers geometric invariants that distinguish valid reasoning from spurious behavior. We evaluate TaT across dense and Mixture-of-Experts (MoE) architectures on benchmarks spanning commonsense reasoning, question answering, and toxicity detection. Without access to the activations themselves and using only changes in activations across layers, we show that TaT effectively mitigates reliance on static lexical confounds, outperforming conventional probing, and establishes trajectory analysis as a complementary perspective on LLM explainability.
[28] PanCanBench: A Comprehensive Benchmark for Evaluating Large Language Models in Pancreatic Oncology cs.CL | cs.AIPDF
Yimin Zhao, Sheela R. Damle, Simone E. Dekker, Scott Geng, Karly Williams Silva
TL;DR: 该论文提出了一个名为PanCanBench的专门基准测试,用于评估大型语言模型在胰腺肿瘤学领域的临床实用性。该基准包含282个真实患者问题及3130个专家制定的具体评估标准,通过人工参与流程创建。研究使用LLM-as-a-judge框架评估了22个专有和开源模型,重点衡量临床完整性、事实准确性和网络搜索整合能力。
Details
Motivation: 现有医疗评估基准(如HealthBench)依赖模拟查询且缺乏疾病特异性深度,多项选择题的准确性无法反映LLM在复杂临床场景(如胰腺癌)中的真实效用和安全性,特别是存在幻觉问题。
Result: 在PanCanBench上,模型基于量规的完整性得分在46.5%到82.3%之间,差异显著。事实错误普遍,幻觉率(至少包含一个事实错误的回答百分比)从Gemini-2.5 Pro和GPT-4o的6.0%到Llama-3.1-8B的53.8%不等。启用网络搜索并未显著提升平均得分(例如Gemini-2.5 Pro从66.8%降至63.9%)。
Insight: 创新点在于构建了一个针对特定疾病(胰腺癌)的、基于真实患者问题和专家量规的深度评估基准。客观分析表明,即使是最新优化的推理模型(如o3)在事实准确性上也可能表现不佳,且网络搜索整合不一定改善回答质量,这挑战了相关常见假设。使用AI生成的合成量规会显著虚增绝对分数(平均17.9分),但相对排名大致保持。
Abstract: Large language models (LLMs) have achieved expert-level performance on standardized examinations, yet multiple-choice accuracy poorly reflects real-world clinical utility and safety. As patients and clinicians increasingly use LLMs for guidance on complex conditions such as pancreatic cancer, evaluation must extend beyond general medical knowledge. Existing frameworks, such as HealthBench, rely on simulated queries and lack disease-specific depth. Moreover, high rubric-based scores do not ensure factual correctness, underscoring the need to assess hallucinations. We developed a human-in-the-loop pipeline to create expert rubrics for de-identified patient questions from the Pancreatic Cancer Action Network (PanCAN). The resulting benchmark, PanCanBench, includes 3,130 question-specific criteria across 282 authentic patient questions. We evaluated 22 proprietary and open-source LLMs using an LLM-as-a-judge framework, measuring clinical completeness, factual accuracy, and web-search integration. Models showed substantial variation in rubric-based completeness, with scores ranging from 46.5% to 82.3%. Factual errors were common, with hallucination rates (the percentages of responses containing at least one factual error) ranging from 6.0% for Gemini-2.5 Pro and GPT-4o to 53.8% for Llama-3.1-8B. Importantly, newer reasoning-optimized models did not consistently improve factuality: although o3 achieved the highest rubric score, it produced inaccuracies more frequently than other GPT-family models. Web-search integration did not inherently guarantee better responses. The average score changed from 66.8% to 63.9% for Gemini-2.5 Pro and from 73.8% to 72.8% for GPT-5 when web search was enabled. Synthetic AI-generated rubrics inflated absolute scores by 17.9 points on average while generally maintaining similar relative ranking.
[29] Toward Graph-Tokenizing Large Language Models with Reconstructive Graph Instruction Tuning cs.CL | cs.AIPDF
Zhongjian Zhang, Xiao Wang, Mengmei Zhang, Jiarui Tan, Chuan Shi
TL;DR: 本文提出了一种名为RGLM的重构式图指令调优方法,旨在改进图-令牌化大语言模型(GTokenLLMs)中图与文本的对齐。通过从输入空间和潜在空间引入三种变体(RGLM-Decoder、RGLM-Similarizer、RGLM-Denoiser),并利用图信息重构来显式地加入图监督,以克服现有方法仅依赖文本监督导致的文本主导偏差问题。
Details
Motivation: 现有图-令牌化大语言模型仅通过语言指令的文本监督实现隐式的图-文本对齐,导致文本主导偏差,未能充分利用图上下文信息。本文旨在通过信息论分析证明对齐目标受限于图输入与LLM隐藏表示之间的互信息,从而提出改进该上界以实现更好对齐的方法。
Result: 在多个基准测试和任务场景上的广泛实验验证了RGLM的有效性,为GTokenLLMs的对齐研究开辟了新方向。
Insight: 创新点在于从信息论角度形式化对齐问题,并引入重构式图监督来显式约束对齐过程,从而减少文本偏差并增强图上下文利用。三种变体从输入和潜在空间提供了互补的优化视角,理论分析了各变体的对齐效果,为图基础模型的发展提供了新思路。
Abstract: The remarkable success of large language models (LLMs) has motivated researchers to adapt them as universal predictors for various graph-related tasks, with the ultimate goal of developing a graph foundation model that generalizes diverse scenarios. The key challenge is to align graph data with language spaces so that LLMs can better comprehend graphs. As a popular paradigm, Graph-Tokenizing LLMs (GTokenLLMs) encode complex structures and lengthy texts into a graph token sequence, and then align them with text tokens via language instructions tuning. Despite their initial success, our information-theoretic analysis reveals that existing GTokenLLMs rely solely on text supervision from language instructions, which achieve only implicit graph-text alignment, resulting in a text-dominant bias that underutilizes graph context. To overcome this limitation, we first prove that the alignment objective is upper-bounded by the mutual information between the input graphs and their hidden representations in the LLM, which motivates us to improve this upper bound to achieve better alignment. To this end, we further propose a reconstructive graph instruction tuning pipeline, RGLM. Our key idea is to reconstruct the graph information from the LLM’s graph token outputs, explicitly incorporating graph supervision to constrain the alignment process. Technically, we embody RGLM by exploring three distinct variants from two complementary perspectives: RGLM-Decoder from the input space; RGLM-Similarizer and RGLM-Denoiser from the latent space. Additionally, we theoretically analyze the alignment effectiveness of each variant. Extensive experiments on various benchmarks and task scenarios validate the effectiveness of the proposed RGLM, paving the way for new directions in GTokenLLMs’ alignment research.
[30] Quantifying Conversational Reliability of Large Language Models under Multi-Turn Interaction cs.CLPDF
Jiyoon Myung
TL;DR: 本文系统评估了大型语言模型在多轮对话中的可靠性,通过三个代表性任务(跨话题维持全局约束、交错意图中正确选择工具/代理、修订与干扰下跟踪结构化实体)量化了单轮与多轮设置下的性能退化。研究发现商业和开源模型在多轮交互中可靠性显著下降,尤其是较小模型,并揭示了指令漂移、意图混淆和上下文覆盖等常见失效模式。
Details
Motivation: 解决LLMs在现实世界多轮、混合话题对话中可靠性未被充分理解的问题,评估其在依赖历史上下文的扩展对话中的表现。
Result: 在三个任务上,所有测试模型(包括商业和开源模型)在多轮设置下均出现可靠性显著下降,较小模型退化更严重;研究识别了具体的失效模式,但未提及具体基准或SOTA比较。
Insight: 创新点在于提出系统框架量化多轮对话可靠性退化,并揭示关键失效模式;客观分析认为其任务设计(单轮vs多轮对比)和方法为评估LLM对话鲁棒性提供了新视角。
Abstract: Large Language Models (LLMs) are increasingly deployed in real-world applications where users engage in extended, mixed-topic conversations that depend on prior context. Yet, their reliability under realistic multi-turn interactions remains poorly understood. We conduct a systematic evaluation of conversational reliability through three representative tasks that reflect practical interaction challenges: (1) maintaining global constraints across topic shifts, (2) selecting the correct tool or agent amid interleaved intents, and (3) tracking structured entities under revisions and distractions. Each task pairs single-turn and multi-turn settings, allowing us to quantify reliability degradation under extended dialogue. Across both commercial and open-source models, we observe substantial declines in reliability, particularly for smaller models. Error analyses reveal recurring failure modes such as instruction drift, intent confusion, and contextual overwriting, which compromise dependable behavior in operational systems. Our findings highlight the need for stress-testing LLMs for conversational reliability and developing more robust evaluation methods for trustworthy deployment.
[31] LaSER: Internalizing Explicit Reasoning into Latent Space for Dense Retrieval cs.CL | cs.IRPDF
Jiajie Jin, Yanzhao Zhang, Mingxin Li, Dingkun Long, Pengjun Xie
TL;DR: 本文提出LaSER框架,旨在将显式推理能力内化到稠密检索器的隐式空间中,以解决现有方法依赖生成显式推理链导致延迟过高的问题。该方法通过双视图训练和多粒度对齐策略,使检索器能在隐式空间中进行高效推理,从而结合了显式推理的深度与标准检索的效率。
Details
Motivation: 当前基于LLM的稠密检索器主要将其用作静态编码器,未能充分利用其强大的推理能力;而现有利用推理能力的方法(如改写-检索流程)需要生成显式推理链,导致推理延迟过高。
Result: 在领域内和领域外的多个推理密集型基准测试上,LaSER显著超越了最先进的基线模型。不同骨干网络和模型规模的实验分析验证了该方法的鲁棒性。
Insight: 核心创新点在于提出了一个将显式推理内化到隐式空间的自我蒸馏框架,通过双视图(显式视图与隐式视图)训练和轨迹对齐机制,使模型能在隐式空间中进行高效的“无声思考”,避免了自回归文本生成的开销。
Abstract: LLMs have fundamentally transformed dense retrieval, upgrading backbones from discriminative encoders to generative architectures. However, a critical disconnect remains: while LLMs possess strong reasoning capabilities, current retrievers predominantly utilize them as static encoders, leaving their potential for complex reasoning unexplored. To address this, existing approaches typically adopt rewrite-then-retrieve pipelines to generate explicit CoT rationales before retrieval. However, this incurs prohibitive latency. In this paper, we propose LaSER, a novel self-distillation framework that internalizes explicit reasoning into the latent space of dense retrievers. Operating on a shared LLM backbone, LaSER introduces a dual-view training mechanism: an Explicit view that explicitly encodes ground-truth reasoning paths, and a Latent view that performs implicit latent thinking. To bridge the gap between these views, we design a multi-grained alignment strategy. Beyond standard output alignment, we introduce a trajectory alignment mechanism that synchronizes the intermediate latent states of the latent path with the semantic progression of the explicit reasoning segments. This allows the retriever to think silently and effectively without autoregressive text generation. Extensive experiments on both in-domain and out-of-domain reasoning-intensive benchmarks demonstrate that LaSER significantly outperforms state-of-the-art baselines. Furthermore, analyses across diverse backbones and model scales validate the robustness of our approach, confirming that our unified learning framework is essential for eliciting effective latent thinking. Our method successfully combines the reasoning depth of explicit CoT pipelines with the inference efficiency of standard dense retrievers.
[32] Understanding the Physics of Key-Value Cache Compression for LLMs through Attention Dynamics cs.CLPDF
Samhruth Ananthanarayanan, Ayan Sengupta, Tanmoy Chakraborty
TL;DR: 本文提出了一种基于物理启发的视角,将大语言模型中的键值缓存压缩视为对令牌级路由的受控扰动,并区分了保留、可访问性和利用三个概念。通过设计合成任务探究多实体跟踪、消歧、共指和多跳推理,研究发现适度压缩会降低内部表示但精度损失很小,揭示了冗余性;所有模型在接近90%压缩率时都出现急剧的幻觉安全悬崖,这与全局驱逐率峰值相关,表明语义可达性存在相变;不同架构的路由动态不同,导致其抗压缩能力存在差异。
Details
Motivation: 随着LLM上下文窗口扩展到10万+令牌,键值缓存成为主要内存瓶颈,现有评估方法忽略了注意力不仅是存储更是路由的结构性问题,即保留KV对并不能保证语义可访问性。
Result: 在合成任务上的实验表明,适度压缩导致内部表示退化但精度损失小;所有模型在约90%压缩率附近出现急剧的性能下降(幻觉安全悬崖),与全局驱逐率峰值相关;不同模型架构(如LLaMA和Qwen)表现出不同的路由动态和抗压缩能力。
Insight: 创新点在于将KV压缩重新定义为对注意力几何结构的结构性探测,揭示了稀疏令牌-路由结构主导压缩容忍度,并将长上下文可扩展性与自注意力中的稀疏性和彩票假设联系起来;同时提出了’表征刚性’概念,即过度的头部共识会损害路由灵活性。
Abstract: As context windows in LLMs scale to 100K+ tokens, the key-value (KV) cache becomes the dominant memory bottleneck, with recent methods claiming 80-90% savings and minimal benchmark degradation. We argue these evaluations miss a structural issue: attention is not just storage but routing, and retaining KV pairs does not guarantee semantic accessibility. We propose a physics-inspired view of KV compression as a controlled perturbation of token-level routing, distinguishing retention, accessibility, and utilization. Using synthetic tasks probing multi-entity tracking, disambiguation, coreference, and multi-hop reasoning, we find that moderate compression degrades internal representations with little accuracy loss, revealing redundancy; all models exhibit a sharp hallucination safety cliff near 90% compression, correlated with spikes in Global Eviction Ratio (GER), suggesting a phase transition in semantic reachability; and architectures differ in routing dynamics, with LLaMA showing early consensus and late diversification, and Qwen showing funnel-like late convergence, leading to distinct resilience profiles. Beyond erasure, we identify representational rigidity, where excessive head-level consensus collapses routing flexibility despite token survival. These results suggest sparse token-route structures govern compression tolerance, reframing KV compression as a structural probe of attention geometry and linking long-context scalability to sparsity and the lottery ticket hypothesis in self-attention.
[33] Enhancing Persona Following at Decoding Time via Dynamic Importance Estimation for Role-Playing Agents cs.CL | cs.AIPDF
Yuxin Liu, Mingye Zhu, Siyuan Liu, Bo Hu, Lei Zhang
TL;DR: 本文提出了一种名为Persona Dynamic Decoding(PDD)的新框架,旨在提升角色扮演语言代理在解码时遵循人物设定的能力。该框架包含两个核心组件:动态评估人物属性在上下文中的重要性(PIE模块),以及利用这些重要性分数构建加权多目标奖励以在推理时调整生成概率(PIA范式)。
Details
Motivation: 现有方法(如静态提示工程或昂贵的微调)无法使角色扮演代理根据动态场景调整其人物设定,而心理学理论(如认知-情感人格系统)指出人物对行为的影响是随情境变化的,因此需要一种自适应的人物管理方法。
Result: 大量实验表明,该方法在话语一致性和行为保真度方面有效,但摘要未具体说明是在哪些基准测试上进行的,也未提及是否达到SOTA水平。
Insight: 创新点在于将心理学理论(情境依赖的人物重要性)融入解码过程,提出了一种无需真实监督的动态重要性估计方法,并在推理时通过加权奖励引导实现自适应的人物遵循,这为角色扮演代理的实时适应性提供了新思路。
Abstract: The utility of Role-Playing Language Agents in sociological research is growing alongside the adoption of Large Language Models. For realism in social simulation, these agents must adhere to their personas defined by character profiles, yet existing strategies-static prompt engineering or costly fine-tuning-fail to adapt personas to dynamic scenarios. Psychological theories, such as the Cognitive-Affective Personality Systems, provide a crucial explanation for this failure: a persona’s influence on behavior is not static but varies with the scenarios. This context-dependence highlights the critical need for adaptive persona management. To address this gap, we propose a novel, theory-driven method that dynamically estimates context-dependent persona importance and integrates it into weighted reward-guided decoding, enabling inference-time persona following. Specifically, we introduce the Persona Dynamic Decoding (PDD) framework, which consists of two key components: (1) Persona Importance Estimation (PIE) module, which dynamically quantifies the contextual importance of persona attributes without requiring ground-truth supervision; and (2) Persona-Guided Inference-Time Alignment (PIA) paradigm, which leverages these importance scores to construct weighted multi-objective rewards and modulate generation probabilities during inference. Extensive experiments show the effectiveness of our method in utterance consistency and behavioral fidelity.
[34] Markovian ODE-guided scoring can assess the quality of offline reasoning traces in language models cs.CLPDF
Arghodeep Nandi, Ojasva Saxena, Tanmoy Chakraborty
TL;DR: 本文提出了MarODE框架,用于评估语言模型生成的推理轨迹的质量,该方法基于马尔可夫推理进程和常微分方程对轨迹动态进行建模,并在大规模评估中显著优于现有基线。
Details
Motivation: 现有评估方法机械且难以捕捉人类中心的推理质量概念,无法泛化到多样化和逐步退化的推理场景中,因此需要一种理论驱动的评估框架。
Result: 在大规模评估中,MarODE在Somers’ D相关性上超过现有基线250%以上,通过人类中心扰动和人类判断验证了其有效性和稳健性。
Insight: 创新点在于将推理进程建模为马尔可夫过程,并利用常微分方程表征轨迹动态,从而实现对推理质量的高效评估,为基于语言模型的系统提供了理论驱动的评估工具。
Abstract: Reasoning traces produced by generative language models are increasingly used for tasks ranging from mathematical problem solving to automated fact checking. However, existing evaluation methods remain largely mechanical and fail to capture human-centric notions of reasoning quality in a way that generalizes across varied and progressively degraded reasoning. We introduce MarODE, an offline evaluation framework that assigns quality scores to reasoning traces. Its effectiveness is assessed using human-centric perturbations and human judgments, which jointly evaluate the fundamental dimensions of an evaluation metric - goodness and soundness. The approach is grounded in a Markovian formulation of reasoning progression and an ordinary differential equation based characterization of trace dynamics, enabling efficient evaluation of reasoning quality. In a large-scale evaluation, MarODE outperforms existing baselines by over 250% under Somers’ D correlation. Our results emphasize the value of theory-driven evaluation frameworks as reasoning traces become central to language model-based systems.
[35] Measuring What VLMs Don’t Say: Validation Metrics Hide Clinical Terminology Erasure in Radiology Report Generation cs.CL | cs.AIPDF
Aditya Parikh, Aasa Feragen, Sneha Das, Stella Frank
TL;DR: 本文研究了放射学报告生成中视觉语言模型(VLM)评估指标的盲点,指出当前基于词重叠的指标会掩盖模型因‘模板崩溃’而生成重复、通用文本并省略临床术语的问题。论文提出了词汇多样性测量方法,并引入了临床关联位移(CAD)和加权关联擦除(WAE)框架,以量化生成报告中基于人口统计学的词汇关联变化和临床信息损失。研究发现确定性解码会导致高水平的语义擦除,而随机采样虽能增加多样性但可能引入新偏见,从而促使对‘最优’报告定义的重新思考。
Details
Motivation: 解决当前放射学报告生成模型评估中存在的关键盲点:现有验证指标(如基于词重叠的分数)无法检测模型因解码策略导致的‘模板崩溃’问题,即模型生成重复、安全的通用文本而省略关键临床术语,这使得在基准测试上表现良好的模型可能缺乏临床信息价值,存在度量博弈风险。
Result: 论文通过提出的CAD和WAE框架进行定量分析,表明确定性解码策略会产生高水平的语义擦除(即临床术语丢失),而随机采样策略能生成更多样化的输出,但可能引入新的偏见。这些结果揭示了当前‘最优’报告生成定义的局限性。
Insight: 创新点在于揭示了当前评估指标(如BLEU、ROUGE)在临床领域可能隐藏模型生成内容临床特异性不足的严重问题,并提出了基于词汇多样性和人口统计学词汇关联变化的定量评估框架(CAD/WAE),为更可靠地验证VLM在放射学中的临床保真度和人口统计学公平性提供了新视角和方法论基础。
Abstract: Reliable deployment of Vision-Language Models (VLMs) in radiology requires validation metrics that go beyond surface-level text similarity to ensure clinical fidelity and demographic fairness. This paper investigates a critical blind spot in current model evaluation: the use of decoding strategies that lead to high aggregate token-overlap scores despite succumbing to template collapse, in which models generate only repetitive, safe generic text and omit clinical terminology. Unaddressed, this blind spot can lead to metric gaming, where models that perform well on benchmarks prove clinically uninformative. Instead, we advocate for lexical diversity measures to check model generations for clinical specificity. We introduce Clinical Association Displacement (CAD), a vocabulary-level framework that quantifies shifts in demographic-based word associations in generated reports. Weighted Association Erasure (WAE) aggregates these shifts to measure the clinical signal loss across demographic groups. We show that deterministic decoding produces high levels of semantic erasure, while stochastic sampling generates diverse outputs but risks introducing new bias, motivating a fundamental rethink of how “optimal” reporting is defined.
[36] Learning to Draft: Adaptive Speculative Decoding with Reinforcement Learning cs.CLPDF
Jiebin Zhang, Zhenghan Yu, Liang Wang, Nan Yang, Eugene J. Yu
TL;DR: 本文提出了一种名为’Learning to Draft (LTD)’的自适应推测解码方法,通过强化学习训练两个协同适应的策略,动态协调草稿生成和验证阶段,以直接优化每个解码周期的吞吐量,从而加速大语言模型推理。
Details
Motivation: 当前推测解码方法采用静态时间分配或优化代理指标,忽略了真实时间成本,且将草稿和验证阶段孤立处理。本文旨在解决这些限制,直接优化解码效率。
Result: 在五个不同的大语言模型和四个不同任务上的广泛评估表明,LTD实现了2.24倍到4.32倍的加速比,比当前最先进的Eagle3方法性能高出高达36.4%。
Insight: 主要创新点在于将推测解码问题建模为强化学习环境,并训练两个协同适应的策略来动态协调草稿和验证阶段,从而直接最大化解码吞吐量,而非优化代理指标。
Abstract: Speculative decoding accelerates large language model (LLM) inference by using a small draft model to generate candidate tokens for a larger target model to verify. The efficacy of this technique hinges on the trade-off between the time spent on drafting candidates and verifying them. However, current state-of-the-art methods rely on a static time allocation, while recent dynamic approaches optimize for proxy metrics like acceptance length, often neglecting the true time cost and treating the drafting and verification phases in isolation. To address these limitations, we introduce Learning to Draft (LTD), a novel method that directly optimizes for throughput of each draft-and-verify cycle. We formulate the problem as a reinforcement learning environment and train two co-adaptive policies to dynamically coordinate the draft and verification phases. This encourages the policies to adapt to each other and explicitly maximize decoding efficiency. We conducted extensive evaluations on five diverse LLMs and four distinct tasks. Our results show that LTD achieves speedup ratios ranging from 2.24x to 4.32x, outperforming the state-of-the-art method Eagle3 up to 36.4%.
[37] LexChronos: An Agentic Framework for Structured Event Timeline Extraction in Indian Jurisprudence cs.CL | cs.AIPDF
Anka Chandrahas Tummepalli, Preethu Rose Anish
TL;DR: 本文提出LexChronos,一个用于从印度最高法院判决书中提取结构化事件时间线的智能体框架。该框架采用双智能体架构:一个微调的提取智能体识别候选事件,另一个预训练的反馈智能体通过置信度驱动的循环对事件进行评分和精炼。为了解决印度法律事件数据稀缺的问题,作者使用DeepSeek-R1和GPT-4通过逆向工程技术构建了一个包含2000个样本的合成语料库,并生成了黄金标准的事件标注。
Details
Motivation: 传统方法将法律判决和诉讼程序视为非结构化文本,限制了大型语言模型在法律文本摘要、论点生成和判决预测等任务中的有效性。因此,需要一种能够从印度法律文件中提取结构化事件时间线的方法,以提升模型的理解和推理能力。
Result: 在构建的合成语料库上,该框架基于BERT的F1分数达到0.8751。在下游法律文本摘要评估中,GPT-4在75%的情况下更倾向于使用结构化时间线而非非结构化基线,表明其在印度法理学中的理解和推理能力得到改善。
Insight: 主要创新点包括:1)采用双智能体(提取与反馈)的迭代式置信度驱动架构来精炼事件提取;2)针对特定领域(印度法律)数据稀缺问题,提出利用LLM(DeepSeek-R1和GPT-4)进行逆向工程以合成高质量标注数据集的方法;3)将法律文档转化为结构化事件时间线,为后续法律AI应用(如先例映射、论点合成、预测性判决建模)奠定了基础。
Abstract: Understanding and predicting judicial outcomes demands nuanced analysis of legal documents. Traditional approaches treat judgments and proceedings as unstructured text, limiting the effectiveness of large language models (LLMs) in tasks such as summarization, argument generation, and judgment prediction. We propose LexChronos, an agentic framework that iteratively extracts structured event timelines from Supreme Court of India judgments. LexChronos employs a dual-agent architecture: a LoRA-instruct-tuned extraction agent identifies candidate events, while a pre-trained feedback agent scores and refines them through a confidence-driven loop. To address the scarcity of Indian legal event datasets, we construct a synthetic corpus of 2000 samples using reverse-engineering techniques with DeepSeek-R1 and GPT-4, generating gold-standard event annotations. Our pipeline achieves a BERT-based F1 score of 0.8751 against this synthetic ground truth. In downstream evaluations on legal text summarization, GPT-4 preferred structured timelines over unstructured baselines in 75% of cases, demonstrating improved comprehension and reasoning in Indian jurisprudence. This work lays a foundation for future legal AI applications in the Indian context, such as precedent mapping, argument synthesis, and predictive judgment modelling, by harnessing structured representations of legal events.
[38] Beyond the Grid: Layout-Informed Multi-Vector Retrieval with Parsed Visual Document Representations cs.CL | cs.IRPDF
Yibo Yan, Mingdong Ou, Yi Cao, Xin Zou, Shuliang Liu
TL;DR: 本文提出了一种名为ColParse的新型视觉文档检索方法,通过解析文档布局生成少量布局感知的子图像嵌入,并与全局页面级向量融合,构建紧凑且结构感知的多向量表示,从而在显著降低存储需求的同时提升检索性能。
Details
Motivation: 解决当前多向量检索架构在视觉文档检索中面临的存储瓶颈问题,现有优化策略如嵌入合并、剪枝或使用抽象标记往往以牺牲性能或忽略关键布局信息为代价,无法有效平衡存储效率与检索精度。
Result: 在多个基准测试和基础模型上,该方法将存储需求降低了超过95%,同时取得了显著的性能提升,实现了存储效率与检索准确性的双重优化。
Insight: 创新性地将文档解析模型引入多向量检索框架,通过生成布局感知的局部嵌入与全局向量融合,在保持细粒度检索能力的同时大幅压缩表示尺寸,为大规模多模态信息系统的部署提供了高效且可解释的解决方案。
Abstract: Harnessing the full potential of visually-rich documents requires retrieval systems that understand not just text, but intricate layouts, a core challenge in Visual Document Retrieval (VDR). The prevailing multi-vector architectures, while powerful, face a crucial storage bottleneck that current optimization strategies, such as embedding merging, pruning, or using abstract tokens, fail to resolve without compromising performance or ignoring vital layout cues. To address this, we introduce ColParse, a novel paradigm that leverages a document parsing model to generate a small set of layout-informed sub-image embeddings, which are then fused with a global page-level vector to create a compact and structurally-aware multi-vector representation. Extensive experiments demonstrate that our method reduces storage requirements by over 95% while simultaneously yielding significant performance gains across numerous benchmarks and base models. ColParse thus bridges the critical gap between the fine-grained accuracy of multi-vector retrieval and the practical demands of large-scale deployment, offering a new path towards efficient and interpretable multimodal information systems.
[39] Surgical Post-Training: Cutting Errors, Keeping Knowledge cs.CL | cs.AIPDF
Wenye Lin, Kai Han
TL;DR: 本文提出了一种名为Surgical Post-Training(SPoT)的新范式,旨在高效优化大型语言模型(LLM)的推理能力,同时缓解灾难性遗忘问题。该方法包含一个数据修正流程(利用Oracle对错误推理步骤进行最小化编辑)和一个基于奖励的二元交叉熵目标函数。实验表明,仅使用4k对修正后的数学数据,SPoT能在短时间内显著提升模型在领域内和领域外任务上的准确率。
Details
Motivation: 动机在于解决LLM后训练中效率与灾难性遗忘之间的权衡问题。现有研究强调策略数据的作用,但本文发现并验证了DPO奖励估计中隐含的正则化机制是关键,这启发了SPoT的设计。
Result: 在Qwen3-8B模型上,仅使用4k修正数据对进行28分钟训练(8x H800 GPUs),SPoT在领域内和领域外数学任务上的平均准确率提升了6.2%。
Insight: 创新点在于:1)揭示了DPO奖励估计中隐含的正则化机制对缓解遗忘的关键作用;2)提出了SPoT范式,结合了数据修正流程(通过Oracle进行外科手术式编辑)和将推理正确性视为二元分类的监督目标,实现了高效的知识保持与错误削减。
Abstract: Enhancing the reasoning capabilities of Large Language Models (LLMs) via post-training is often constrained by the trade-off between efficiency and catastrophic forgetting. While prior research emphasizes the role of on-policy data in mitigating forgetting, we uncover–and validate both theoretically and empirically–an overlooked yet critical mechanism: the implicit regularization inherent in Direct Preference Optimization’s (DPO) reward estimate. This motivates our Surgical Post-Training (SPoT), a new paradigm designed to optimize reasoning efficiently while preserving learned prior knowledge. SPoT consists of: (1) a data rectification pipeline that employs an Oracle to surgically correct erroneous steps via minimal edits, generating data proximal to the model’s distribution; and (2) a reward-based binary cross-entropy objective. Unlike the relative ranking in DPO, this objective treats reasoning correctness as a binary classification problem, enforcing decoupled supervision signals. Empirically, with only 4k rectified math data pairs, SPoT improves Qwen3-8B’s accuracy by 6.2% on average across in-domain and OOD tasks, requiring merely 28 minutes of training on 8x H800 GPUs. Code: https://github.com/Visual-AI/SPoT
[40] Legal RAG Bench: an end-to-end benchmark for legal RAG cs.CL | cs.IR | cs.LGPDF
Abdur-Rahman Butler, Umar Butler
TL;DR: 本文介绍了Legal RAG Bench,一个用于评估法律领域检索增强生成(RAG)系统端到端性能的基准测试和方法论。该基准包含来自《维多利亚刑事指控书》的4,876个文本段落和100个需要刑法及程序专业知识的复杂人工编写问题,并提供了长答案和支持段落。其评估方法采用全因子设计和新颖的层次化错误分解框架,以分离评估检索和推理模型的贡献。通过评估三种最先进的嵌入模型和两种前沿大语言模型,研究发现信息检索是法律RAG性能的主要驱动力,而LLM对正确性和事实依据性的影响相对温和。
Details
Motivation: 为了解决法律领域RAG系统缺乏标准化、端到端评估基准的问题,并深入理解检索和生成组件各自对系统性能的影响,从而推动更可靠的法律AI系统发展。
Result: 在Legal RAG Bench上评估了三种SOTA嵌入模型(Kanon 2 Embedder, Gemini Embedding 001, Text Embedding 3 Large)和两种前沿LLM(Gemini 3.1 Pro, GPT-5.2)。Kanon 2 Embedder表现最佳,将平均正确性提升了17.5分,事实依据性提升了4.5分,检索准确率提升了34分。结果表明,检索性能是法律RAG系统性能的上限,许多归因于LLM幻觉的错误实际上由检索失败触发。
Insight: 论文的创新点在于构建了一个专门针对法律领域的、包含复杂专业问题和长答案的端到端RAG基准,并提出了一个全因子设计和层次化错误分解框架,能够对检索和推理组件进行“苹果对苹果”的公平比较和归因分析。客观来看,其核心洞察是量化并证实了在法律RAG中,检索质量比LLM能力对最终性能的影响更为关键,这为优化此类系统指明了优先方向。
Abstract: We introduce Legal RAG Bench, a benchmark and evaluation methodology for assessing the end-to-end performance of legal RAG systems. As a benchmark, Legal RAG Bench consists of 4,876 passages from the Victorian Criminal Charge Book alongside 100 complex, hand-crafted questions demanding expert knowledge of criminal law and procedure. Both long-form answers and supporting passages are provided. As an evaluation methodology, Legal RAG Bench leverages a full factorial design and novel hierarchical error decomposition framework, enabling apples-to-apples comparisons of the contributions of retrieval and reasoning models in RAG. We evaluate three state-of-the-art embedding models (Isaacus’ Kanon 2 Embedder, Google’s Gemini Embedding 001, and OpenAI’s Text Embedding 3 Large) and two frontier LLMs (Gemini 3.1 Pro and GPT-5.2), finding that information retrieval is the primary driver of legal RAG performance, with LLMs exerting a more moderate effect on correctness and groundedness. Kanon 2 Embedder, in particular, had the largest positive impact on performance, improving average correctness by 17.5 points, groundedness by 4.5 points, and retrieval accuracy by 34 points. We observe that many errors attributed to hallucinations in legal RAG systems are in fact triggered by retrieval failures, concluding that retrieval sets the ceiling for the performance of many modern legal RAG systems. We document why and how we built Legal RAG Bench alongside the results of our evaluations. We also openly release our code and data to assist with reproduction of our findings.
[41] FreeAct: Freeing Activations for LLM Quantization cs.CL | cs.AI | cs.CVPDF
Xiaohao Liu, Xiaobo Xia, Manyi Zhang, Ji-Fu Li, Xianzhi Yu
TL;DR: FreeAct是一种新颖的大语言模型量化框架,通过解耦激活和权重的变换,为不同类型的令牌分配动态的变换矩阵,以解决现有基于变换的量化方法在处理扩散LLM和多模态LLM中动态激活模式时的不足。
Details
Motivation: 现有基于正交矩阵变换的量化方法采用静态的一对一变换约束,无法适应输入激活(尤其是在扩散LLM和多模态LLM中,不同令牌类型具有不同分布)固有的动态模式,导致量化性能受限。
Result: 在扩散LLM和多模态LLM上的大量实验表明,FreeAct显著优于基线方法,性能提升最高达5.3%。
Insight: 核心创新在于利用激活的秩亏特性,推导出超越简单逆矩阵的解空间,从而将激活变换与权重解耦,并为激活侧分配动态的、特定于令牌类型的变换矩阵,而权重侧保持统一的静态变换,以灵活处理动态激活差异。
Abstract: Quantization is pivotal for mitigating the significant memory and computational overhead of Large Language Models (LLMs). While emerging transformation-based methods have successfully enhanced quantization by projecting feature spaces onto smoother manifolds using orthogonal matrices, they typically enforce a rigid one-to-one transformation constraint. This static approach fails to account for the dynamic patterns inherent in input activations, particularly within diffusion LLMs (dLLMs) and Multimodal LLMs (MLLMs), where varying token types exhibit distinct distributions. To advance this, we propose FreeAct, a novel quantization framework that relaxes the static one-to-one constraint to accommodate dynamic activation disparities. Theoretically, we leverage the rank-deficient nature of activations to derive a solution space that extends beyond simple inverse matrices, enabling the decoupling of activation transformations from weights. Methodologically, FreeAct identifies token-specific dynamics (i.e., vision v.s. text, or masked tokens) and allocates distinct transformation matrices to the activation side, while maintaining a unified, static transformation for the weights. Extensive experiments across dLLMs and MLLMs demonstrate that FreeAct significantly outperforms baselines, up to 5.3% performance improvement, with in-depth analyses. Our code will be publicly released.
[42] Let the Agent Search: Autonomous Exploration Beats Rigid Workflows in Temporal Question Answering cs.CLPDF
Xufei Lv, Jiahui Yang, Yifu Gao, Linbo Qiao, Houde Liu
TL;DR: 本文提出了一种名为AT2QA的自主、无需训练的智能体,用于时序知识图谱问答。该方法通过赋予现成的大语言模型自主决策能力,使其能够通过通用搜索工具与时序知识图谱进行迭代交互,从而在严格的零样本设置下显著提升性能。
Details
Motivation: 解决现有基于大语言模型的时序知识图谱问答方法依赖僵化的手工检索流程或成本高昂的监督微调的问题,探索通过赋予模型自主性来提升推理能力。
Result: 在MultiTQ基准测试中,AT2QA实现了88.7%的Hits@1,比之前的SOTA提升了10.7%,其中在具有挑战性的多目标查询上提升了20.1%,表明其自主性方法在时序问答任务上能够显著超越微调方法。
Insight: 核心创新在于将自主决策能力(即让模型自主决定下一步行动)引入到现成大语言模型中,构建了一个无需训练、基于通用搜索工具的交互式智能体框架,这为时序推理任务提供了一种高效且灵活的替代方案,避免了手工流程设计或监督微调的成本。
Abstract: Temporal Knowledge Graph Question Answering (TKGQA) demands multi-hop reasoning under temporal constraints. Prior approaches based on large language models (LLMs) typically rely on rigid, hand-crafted retrieval workflows or costly supervised fine-tuning. We show that simply granting an off-the-shelf LLM autonomy, that is, letting it decide what to do next, already yields substantial gains even in a strict zero-shot setting. Building on this insight, we propose AT2QA, an autonomous, training-free agent for temporal question answering that iteratively interacts with the temporal knowledge graph via a general search tool for dynamic retrieval. Experiments on MultiTQ demonstrate large improvements: AT2QA achieves 88.7% Hits@1 (+10.7% over prior SOTA), including a +20.1% gain on challenging multi-target queries, showing that agentic autonomy can decisively outperform fine-tuning for temporal question answering. Code and the full set of sampled trajectories are available on https://github.com/AT2QA-Official-Code/AT2QA-Official-Code
[43] CharacterFlywheel: Scaling Iterative Improvement of Engaging and Steerable LLMs in Production cs.CL | cs.AI | cs.SIPDF
Yixin Nie, Lin Guan, Zhongyao Ma, Anchit Gupta, Yipin Zhou
TL;DR: 该论文提出了CharacterFlywheel,一个用于在Instagram、WhatsApp和Messenger等生产级社交聊天应用中迭代改进大型语言模型的飞轮流程。该流程基于LLaMA 3.1,利用内外部真实用户流量数据进行了15代模型的精炼。通过从2024年7月到2025年4月的持续部署和7天A/B测试,模型在用户参与度和可操控性方面均取得了显著提升。
Details
Motivation: 解决在生产环境中大规模、持续地改进LLM,以提升其在社交聊天应用中的用户参与度和可操控性(指令遵循)的问题。
Result: 在长达数月的A/B测试中,8个新部署模型中有7个相对于基线表现出正向提升,最佳模型在参与广度上提升8.8%,在参与深度上提升19.4%。指令遵循率从59.2%提升至84.8%,指令违反率从26.6%降至5.8%。
Insight: 创新点在于提出并详细阐述了一个集成了数据管理、奖励建模、监督微调、强化学习以及线上线下评估的完整迭代飞轮流程,并分享了在大规模生产环境中防止过拟合和应对动态变化的实用方法,为LLM在生产环境中的科学化、规模化改进提供了系统性的框架和经验。
Abstract: This report presents CharacterFlywheel, an iterative flywheel process for improving large language models (LLMs) in production social chat applications across Instagram, WhatsApp, and Messenger. Starting from LLaMA 3.1, we refined models across 15 generations using data from both internal and external real-user traffic. Through continuous deployments from July 2024 to April 2025, we conducted controlled 7-day A/B tests showing consistent engagement improvements: 7 of 8 newly deployed models demonstrated positive lift over the baseline, with the strongest performers achieving up to 8.8% improvement in engagement breadth and 19.4% in engagement depth. We also observed substantial gains in steerability, with instruction following increasing from 59.2% to 84.8% and instruction violations decreasing from 26.6% to 5.8%. We detail the CharacterFlywheel process which integrates data curation, reward modeling to estimate and interpolate the landscape of engagement metrics, supervised fine-tuning (SFT), reinforcement learning (RL), and both offline and online evaluation to ensure reliable progress at each optimization step. We also discuss our methods for overfitting prevention and navigating production dynamics at scale. These contributions advance the scientific rigor and understanding of LLMs in social applications serving millions of users.
[44] MMR-Life: Piecing Together Real-life Scenes for Multimodal Multi-image Reasoning cs.CL | cs.AI | cs.CVPDF
Jiachun Li, Shaoping Huang, Zhuoran Jin, Chenlong Zhang, Pengfei Cao
TL;DR: 该论文提出了一个名为MMR-Life的综合基准测试,旨在评估多模态大语言模型在真实生活场景中的多图像推理能力。该基准包含2,646个基于19,108张真实世界图像的多选题,覆盖七种推理类型,并评估了37个先进模型,揭示了现有模型在该任务上的显著挑战。
Details
Motivation: 当前多模态大语言模型的推理能力在真实生活多样化场景中尚未得到充分探索,且缺乏标准化的评估基准,因此需要构建一个全面的基准来填补这一空白。
Result: 在MMR-Life基准上,即使顶级模型如GPT-5也仅达到58%的准确率,且在不同推理类型上表现差异显著,表明该基准对现有模型构成了实质性挑战。
Insight: 论文的创新点在于构建了一个不依赖领域专业知识、要求模型跨多图像整合信息并应用多种推理能力的真实生活场景基准,为评估和分析下一代多模态推理系统提供了全面基础。
Abstract: Recent progress in the reasoning capabilities of multimodal large language models (MLLMs) has empowered them to address more complex tasks such as scientific analysis and mathematical reasoning. Despite their promise, MLLMs’ reasoning abilities across different scenarios in real life remain largely unexplored and lack standardized benchmarks for evaluation. To address this gap, we introduce MMR-Life, a comprehensive benchmark designed to evaluate the diverse multimodal multi-image reasoning capabilities of MLLMs across real-life scenarios. MMR-Life consists of 2,646 multiple-choice questions based on 19,108 images primarily sourced from real-world contexts, comprehensively covering seven reasoning types: abductive, analogical, causal, deductive, inductive, spatial, and temporal. Unlike existing reasoning benchmarks, MMR-Life does not rely on domain-specific expertise but instead requires models to integrate information across multiple images and apply diverse reasoning abilities. The evaluation of 37 advanced models highlights the substantial challenge posed by MMR-Life. Even top models like GPT-5 achieve only 58% accuracy and display considerable variance in performance across reasoning types. Moreover, we analyze the reasoning paradigms of existing MLLMs, exploring how factors such as thinking length, reasoning method, and reasoning type affect their performance. In summary, MMR-Life establishes a comprehensive foundation for evaluating, analyzing, and improving the next generation of multimodal reasoning systems.
[45] EstLLM: Enhancing Estonian Capabilities in Multilingual LLMs via Continued Pretraining and Post-Training cs.CL | cs.AIPDF
Aleksei Dorkin, Taido Purason, Emil Kalbaliyev, Hele-Andra Kuulmets, Marii Ojastu
TL;DR: 该论文提出EstLLM方法,通过持续预训练(CPT)和后训练对齐,显著提升了预训练多语言大模型(Llama 3.1 8B)在爱沙尼亚语上的能力,同时保持了其英语和通用推理性能。
Details
Motivation: 解决主流大语言模型因以英语为中心训练而导致对小语种(如爱沙尼亚语)支持性能不佳的问题。
Result: 在全面的爱沙尼亚语基准测试中,模型在语言能力、知识、推理、翻译质量和指令遵循方面相比原始基础模型及其指令微调变体均取得一致提升,同时在英语基准上保持有竞争力的性能。
Insight: 创新点在于采用平衡的数据混合策略进行持续预训练(增加目标语言曝光,同时通过英语回放、代码、数学和类指令数据来近似原始训练分布),并结合后训练对齐(监督微调、偏好优化和聊天向量合并)来引入鲁棒的指令遵循行为,这为提升多语言模型中特定单语能力提供了有效路径。
Abstract: Large language models (LLMs) are predominantly trained on English-centric data, resulting in uneven performance for smaller languages. We study whether continued pretraining (CPT) can substantially improve Estonian capabilities in a pretrained multilingual LLM while preserving its English and general reasoning performance. Using Llama 3.1 8B as the main base model, we perform CPT on a mixture that increases Estonian exposure while approximating the original training distribution through English replay and the inclusion of code, mathematics, and instruction-like data. We subsequently apply supervised fine-tuning, preference optimization, and chat vector merging to introduce robust instruction-following behavior. Evaluation on a comprehensive suite of Estonian benchmarks shows consistent gains in linguistic competence, knowledge, reasoning, translation quality, and instruction-following compared to the original base model and its instruction-tuned variant, while maintaining competitive performance on English benchmarks. These findings indicate that CPT, with an appropriately balanced data mixture, together with post-training alignment, can substantially improve single-language capabilities in pretrained multilingual LLMs.
[46] Modeling Grammatical Hypothesis Testing in Young Learners: A Sequence-Based Learning Analytics Study of Morphosyntactic Reasoning in an Interactive Game cs.CLPDF
Thierry Geoffre, Trystan Geoffre
TL;DR: 本研究采用基于序列的学习分析方法,通过分析小学生在法语形态句法一致性互动游戏中的细粒度操作序列,探究其语法推理过程。研究将每个滑块移动视为假设检验行为,捕捉句子构建过程中的实时认知策略,揭示了学习者在语法学习中的动态假设修正模式。
Details
Motivation: 传统评估方法仅依赖最终答案,无法捕捉学习者在语法学习过程中的实时认知策略。本研究旨在通过分析互动游戏中的操作序列,揭示学习者在形态句法推理中的隐藏维度,为实时教学干预提供基础。
Result: 在真实课堂环境中收集了100名8-11岁学生的597个游戏会话(共9,783个操作),通过引入汉明距离量化与有效语法解决方案的接近度。结果显示,限定词和动词是主要难点,操作序列偏离从左到右的常规处理模式;解决方案较少的练习表现出更慢、更不稳定的收敛性。
Insight: 创新点在于将每个滑块移动视为假设检验行为,并引入汉明距离量化学习过程;客观分析表明,该方法能有效揭示学习者动态修正假设的模式,为开发实时教学工具提供了新视角。
Abstract: This study investigates grammatical reasoning in primary school learners through a sequence-based learning analytics approach, leveraging fine-grained action sequences from an interactive game targeting morphosyntactic agreement in French. Unlike traditional assessments that rely on final answers, we treat each slider movement as a hypothesis-testing action, capturing real-time cognitive strategies during sentence construction. Analyzing 597 gameplay sessions (9,783 actions) from 100 students aged 8-11 in authentic classroom settings, we introduce Hamming distance to quantify proximity to valid grammatical solutions and examine convergence patterns across exercises with varying levels of difficulty. Results reveal that determiners and verbs are key sites of difficulty, with action sequences deviating from left-to-right usual treatment. This suggests learners often fix the verb first and adjust preceding elements. Exercises with fewer solutions exhibit slower and more erratic convergence, while changes in the closest valid solution indicate dynamic hypothesis revision. Our findings demonstrate how sequence-based analytics can uncover hidden dimensions of linguistic reasoning, offering a foundation for real-time scaffolding and teacher-facing tools in linguistically diverse classrooms.
[47] ClinConsensus: A Consensus-Based Benchmark for Evaluating Chinese Medical LLMs across Difficulty Levels cs.CLPDF
Xiang Zheng, Han Li, Wenjie Luo, Weiqi Zhai, Yiyuan Li
TL;DR: 本文提出了ClinConsensus,一个由临床专家精心策划、验证和质量控制的中文医学大语言模型(LLM)基准测试集。该基准包含2500个开放式病例,覆盖从预防到长期随访的全病程、36个医学专科和12种临床任务类型,并具有渐进式增加的难度。为了可靠评估,作者采用了基于量规的评分协议和临床适用一致性分数(CACS@k),并引入了一个结合了高性能LLM作为评判者与本地可部署蒸馏评判模型的双重评判框架。使用该基准对多个领先LLM的评估揭示了模型在不同任务、病程阶段和专科间的显著异质性,并指出临床可操作的治疗规划仍是关键瓶颈。
Details
Motivation: 现有医学基准测试大多是静态且任务孤立的,未能捕捉真实世界临床工作流程的开放性、纵向结构和安全关键复杂性,因此需要一个新的、更贴近临床实践的基准来评估医学LLM。
Result: 在ClinConsensus基准上对多个领先LLM进行了全面评估,揭示了模型在任务主题、病程阶段和医学专科方面存在显著的异质性。表现最佳的模型总体得分相当,但在推理、证据使用和纵向随访能力方面差异显著,且临床可操作的治疗规划仍是关键瓶颈。
Insight: 创新点在于构建了一个由专家验证、覆盖全病程和多种难度级别的中文医学开放式基准;提出了CACS@k评分指标来评估复杂场景;并设计了结合高性能LLM与本地可部署蒸馏模型的双重评判框架,以实现可扩展且与医生判断一致的评估。这为开发稳健、基于临床且可实际部署的医学LLM提供了重要的评估工具和方法。
Abstract: Large language models (LLMs) are increasingly applied to health management, showing promise across disease prevention, clinical decision-making, and long-term care. However, existing medical benchmarks remain largely static and task-isolated, failing to capture the openness, longitudinal structure, and safety-critical complexity of real-world clinical workflows. We introduce ClinConsensus, a Chinese medical benchmark curated, validated, and quality-controlled by clinical experts. ClinConsensus comprises 2500 open-ended cases spanning the full continuum of care–from prevention and intervention to long-term follow-up–covering 36 medical specialties, 12 common clinical task types, and progressively increasing levels of complexity. To enable reliable evaluation of such complex scenarios, we adopt a rubric-based grading protocol and propose the Clinically Applicable Consistency Score (CACS@k). We further introduce a dual-judge evaluation framework, combining a high-capability LLM-as-judge with a distilled, locally deployable judge model trained via supervised fine-tuning, enabling scalable and reproducible evaluation aligned with physician judgment. Using ClinConsensus, we conduct a comprehensive assessment of several leading LLMs and reveal substantial heterogeneity across task themes, care stages, and medical specialties. While top-performing models achieve comparable overall scores, they differ markedly in reasoning, evidence use, and longitudinal follow-up capabilities, and clinically actionable treatment planning remains a key bottleneck. We release ClinConsensus as an extensible benchmark to support the development and evaluation of medical LLMs that are robust, clinically grounded, and ready for real-world deployment.
[48] Recursive Think-Answer Process for LLMs and VLMs cs.CLPDF
Byung-Kwan Lee, Youngchae Chee, Yong Man Ro
TL;DR: 本文提出了一种高效的递归思考-回答过程(R-TAP),用于增强大型语言模型和视觉语言模型的推理能力。该方法通过引入置信度生成器和两种互补的奖励机制,使模型能够进行迭代推理循环,从而超越传统的单次推理方法,生成更准确、更稳定的答案。
Details
Motivation: 尽管如DeepSeek-R1等思考-回答推理器利用可解释的内部推理取得了显著进展,但在单次推理中仍易产生输出错误,即使存在“Oops!”等自我反思线索。为了克服这一局限性,需要一种能进行迭代推理的方法。
Result: 实验表明,应用R-TAP增强的模型在LLM和VLM任务上,其性能持续优于传统的单次推理方法。模型响应的“Oops”类表达频率显著降低,推理过程更稳定、更快。
Insight: 核心创新在于设计了一个置信度生成器来评估模型响应的确定性,并指导后续改进,同时结合了递归置信度增长奖励和最终答案置信度奖励这两种互补的奖励机制。这提供了一种通过迭代循环和置信度引导来精炼模型推理过程的有效途径。
Abstract: Think-Answer reasoners such as DeepSeek-R1 have made notable progress by leveraging interpretable internal reasoning. However, despite the frequent presence of self-reflective cues like “Oops!”, they remain vulnerable to output errors during single-pass inference. To address this limitation, we propose an efficient Recursive Think-Answer Process (R-TAP) that enables models to engage in iterative reasoning cycles and generate more accurate answers, going beyond conventional single-pass approaches. Central to this approach is a confidence generator that evaluates the certainty of model responses and guides subsequent improvements. By incorporating two complementary rewards-Recursively Confidence Increase Reward and Final Answer Confidence Reward-we show that R-TAP-enhanced models consistently outperform conventional single-pass methods for both large language models (LLMs) and vision-language models (VLMs). Moreover, by analyzing the frequency of “Oops”-like expressions in model responses, we find that R-TAP-applied models exhibit significantly fewer self-reflective patterns, resulting in more stable and faster inference-time reasoning. We hope R-TAP pave the way evolving into efficient and elaborated methods to refine the reasoning processes of future AI.
[49] LLMs as Strategic Actors: Behavioral Alignment, Risk Calibration, and Argumentation Framing in Geopolitical Simulations cs.CL | cs.AI | cs.CYPDF
Veronika Solopova, Viktoria Skorik, Maksym Tereshchenko, Alina Haidun, Ostap Vykhopen
TL;DR: 该研究评估了六种先进大语言模型在四个真实世界危机模拟场景中的战略决策行为,并与人类表现进行对比,发现模型在初始阶段能近似人类决策模式,但随时间推移出现行为分化,且其决策解释普遍呈现以稳定、协调和风险缓解为核心的规范性合作框架。
Details
Motivation: 大语言模型越来越多地被提议作为战略决策环境中的智能体,但其在结构化地缘政治模拟中的行为模式尚未得到充分研究,本研究旨在填补这一空白。
Result: 在四个真实危机模拟场景中,模型在基础轮次中能近似人类决策模式,但随时间推移出现行为分化,展现出不同的行为特征和策略更新;所有模型的决策解释都表现出强烈的以稳定、协调和风险缓解为核心的规范性合作框架,对抗性推理有限。
Insight: 研究揭示了LLMs在战略模拟中行为随时间演化的特性及其决策解释中固有的规范性合作偏好,这为将LLMs部署为战略行为体时的风险校准和论证框架设计提供了重要洞见。
Abstract: Large language models (LLMs) are increasingly proposed as agents in strategic decision environments, yet their behavior in structured geopolitical simulations remains under-researched. We evaluate six popular state-of-the-art LLMs alongside results from human results across four real-world crisis simulation scenarios, requiring models to select predefined actions and justify their decisions across multiple rounds. We compare models to humans in action alignment, risk calibration through chosen actions’ severity, and argumentative framing grounded in international relations theory. Results show that models approximate human decision patterns in base simulation rounds but diverge over time, displaying distinct behavioural profiles and strategy updates. LLM explanations for chosen actions across all models exhibit a strong normative-cooperative framing centered on stability, coordination, and risk mitigation, with limited adversarial reasoning.
[50] LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards cs.CLPDF
Guanzheng Chen, Michael Qizhe Shieh, Lidong Bing
TL;DR: 本文提出LongRLVR方法,通过引入可验证的上下文奖励来解决长上下文强化学习中奖励稀疏性问题。该方法在标准答案奖励基础上增加密集的上下文奖励信号,直接激励模型选择正确的上下文依据,显著提升了大型语言模型在长上下文任务中的推理性能。
Details
Motivation: 标准RLVR方法在长上下文场景中表现不佳,因为仅依赖最终答案的奖励过于稀疏,无法有效指导模型从外部提供的长上下文中定位和推理相关信息,导致梯度消失和学习困难。
Result: 在Qwen和LLaMA模型上的实验表明,LongRLVR在RULER-QA基准上将14B模型得分从73.17提升至88.90,在LongBench v2上从39.8提升至46.5,在所有模型和基准上均显著优于标准RLVR方法。
Insight: 核心创新在于引入可验证的上下文奖励作为辅助信号,直接激励模型的上下文定位过程,解决了长上下文强化学习中的优化难题。这表明显式奖励上下文依据过程是释放LLMs长上下文推理潜力的关键策略。
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Large Language Models (LLMs) by optimizing them against factual outcomes. However, this paradigm falters in long-context scenarios, as its reliance on internal parametric knowledge is ill-suited for tasks requiring contextual grounding–the ability to find and reason over externally provided information. We identify a key reason for this failure: a reward based solely on the final answer is too sparse to effectively guide the model for identifying relevant evidence. We formally prove that the outcome-only reward leads to significant vanishing gradients for the context grounding process, rendering learning intractable. To overcome this bottleneck, we introduce LongRLVR to augment the sparse answer reward with a dense and verifiable context reward. This auxiliary signal directly incentivizes the model for selecting the correct grounding information, providing a robust learning gradient that solves the underlying optimization challenge. We validate our method on challenging long-context benchmarks using Qwen and LLaMA models. LongRLVR consistently and significantly outperforms the standard RLVR across all models and benchmarks, e.g., boosting a 14B model’s scores on RULER-QA from 73.17 to 88.90 and on LongBench v2 from 39.8 to 46.5. Our work demonstrates that explicitly rewarding the grounding process is a critical and effective strategy for unlocking the full reasoning potential of LLMs in long-context applications. Our code is available at https://github.com/real-absolute-AI/LongRLVR.
[51] Organizing, Orchestrating, and Benchmarking Agent Skills at Ecosystem Scale cs.CLPDF
Hao Li, Chunjiang Mu, Jianhao Chen, Siyue Ren, Zhiyao Cui
TL;DR: 本文提出了AgentSkillOS框架,旨在解决Claude智能体技能生态系统中技能的有效利用、管理和扩展问题。该框架包含两个阶段:通过节点级递归分类将技能组织成能力树以实现高效发现,以及通过基于有向无环图(DAG)的管道检索、编排和执行多个技能。
Details
Motivation: 随着Claude智能体技能的快速扩散,如何有效利用、管理和扩展技能生态系统成为核心问题。
Result: 在包含30个跨五类任务的基准测试中,基于LLM的成对评估和Bradley-Terry模型聚合结果显示,基于树的检索能有效近似最优技能选择,基于DAG的编排在相同技能集下显著优于原生扁平调用。实验覆盖了从200到20万技能的不同规模生态系统。
Insight: 创新点在于首次提出了一个原则性的技能选择、编排和生态系统级管理框架,其核心是通过结构化组合(能力树和DAG管道)来释放技能潜力,这为解决大规模技能生态系统的组织与协调问题提供了系统化方案。
Abstract: The rapid proliferation of Claude agent skills has raised the central question of how to effectively leverage, manage, and scale the agent skill ecosystem. In this paper, we propose AgentSkillOS, the first principled framework for skill selection, orchestration, and ecosystem-level management. AgentSkillOS comprises two stages: (i) Manage Skills, which organizes skills into a capability tree via node-level recursive categorization for efficient discovery; and (ii) Solve Tasks, which retrieves, orchestrates, and executes multiple skills through DAG-based pipelines. To evaluate the agent’s ability to invoke skills, we construct a benchmark of 30 artifact-rich tasks across five categories: data computation, document creation, motion video, visual design, and web interaction. We assess the quality of task outputs using LLM-based pairwise evaluation, and the results are aggregated via a Bradley-Terry model to produce unified quality scores. Experiments across three skill ecosystem scales (200 to 200K skills) show that tree-based retrieval effectively approximates oracle skill selection, and that DAG-based orchestration substantially outperforms native flat invocation even when given the identical skill set. Our findings confirm that structured composition is the key to unlocking skill potential. Our GitHub repository is available at:https://github.com/ynulihao/AgentSkillOS.
[52] Reasoning Core: A Scalable Procedural Data Generation Suite for Symbolic Pre-training and Post-Training cs.CLPDF
Valentin Lacombe, Valentin Quesnel, Damien Sileo
TL;DR: 论文提出了Reasoning Core,一个可扩展的程序化数据生成套件,用于生成可验证的符号推理数据,覆盖PDDL规划、一阶逻辑、上下文无关文法、因果推理和方程组等核心形式领域,支持课程设计和强化学习,实验表明其数据能提升下游推理能力。
Details
Motivation: 现有程序化生成器通常依赖固定谜题或模板,缺乏大规模所需的分布广度,因此需要一种可扩展的方法来生成多样化的可验证符号推理数据,以扩展语言模型的推理边界。
Result: 实验显示,将Reasoning Core数据混合到预训练中能改善下游推理任务,同时保持或略微提升语言建模质量;零样本评估证实这些任务对GPT-5等前沿模型构成挑战。
Insight: 创新点在于提供了一个统一、可扩展的套件,集成多个形式领域并支持外部求解器验证、难度控制和推理轨迹生成,实现了从预训练到后训练的全程可监督训练和强化学习奖励函数,为符号推理数据的规模化生成提供了系统化解决方案。
Abstract: Training on verifiable symbolic data is a promising way to expand the reasoning frontier of language models beyond what standard pre-training corpora provide. Yet existing procedural generators often rely on fixed puzzles or templates and do not deliver the distributional breadth needed at scale. We introduce Reasoning Core, a scalable suite that procedurally generates verifiable symbolic reasoning data across core formal domains: PDDL planning over randomized domains, first-order logic with equality, context-free grammar parsing and generation, causal reasoning over random Bayesian networks, and systems of equations. Each task is paired with an external solver for rigorous verification and admits continuous difficulty control for curriculum design. Examples can optionally include solver-derived reasoning traces, enabling supervised training from the earliest pre-training stages, and the same interface provides verifiable reward functions for reinforcement learning. Our experiments show that mixing Reasoning Core data into pre-training improves downstream reasoning while preserving, or slightly improving, language modeling quality. Zero-shot evaluations confirm these tasks challenge frontier models such as GPT-5. The code and data are publicly available under the MIT license.
[53] MemeIntel: Explainable Detection of Propagandistic and Hateful Memes cs.CL | cs.AI | cs.CVPDF
Mohamed Bayan Kmainasi, Abul Hasnat, Md Arid Hasan, Ali Ezzat Shahroor, Firoj Alam
TL;DR: 本文提出了MemeXplain数据集和一种多阶段优化方法,用于联合检测宣传性和仇恨性网络迷因并生成解释性依据。该方法在阿拉伯语宣传迷因和英语仇恨迷因检测任务上显著提升了标签检测和解释生成的质量,超越了现有最佳方法。
Details
Motivation: 社交媒体上多模态内容的泛滥使得理解和审核错误信息、仇恨言论和宣传等复杂、依赖上下文的问题面临重大挑战。现有方法在联合建模标签检测和基于解释的推理生成方面关注有限,导致同时训练时分类性能下降。
Result: 在ArMeme数据集上,该方法比当前最佳模型绝对提升了约1.4%的准确率;在Hateful Memes数据集上,绝对提升了约2.2%的准确率,达到了新的SOTA水平。
Insight: 论文的创新点在于引入了首个针对阿拉伯语宣传迷因和英语仇恨迷因的大规模解释增强数据集MemeXplain,并提出了一种多阶段优化策略来训练视觉语言模型,有效解决了联合任务中性能下降的问题,提升了模型的可解释性。
Abstract: The proliferation of multimodal content on social media presents significant challenges in understanding and moderating complex, context-dependent issues such as misinformation, hate speech, and propaganda. While efforts have been made to develop resources and propose new methods for automatic detection, limited attention has been given to jointly modeling label detection and the generation of explanation-based rationales, which often leads to degraded classification performance when trained simultaneously. To address this challenge, we introduce MemeXplain, an explanation-enhanced dataset for propagandistic memes in Arabic and hateful memes in English, making it the first large-scale resource for these tasks. To solve these tasks, we propose a multi-stage optimization approach and train Vision-Language Models (VLMs). Our results show that this strategy significantly improves both label detection and explanation generation quality over the base model, outperforming the current state-of-the-art with an absolute improvement of ~1.4% (Acc) on ArMeme and ~2.2% (Acc) on Hateful Memes. For reproducibility and future research, we aim to make the MemeXplain dataset and scripts publicly available (https://github.com/MohamedBayan/MemeIntel).
cs.CV [Back]
[54] VoxelDiffusionCut: Non-destructive Internal-part Extraction via Iterative Cutting and Structure Estimation cs.CVPDF
Takumi Hachimine, Yuhwan Kwon, Cheng-Yu Kuo, Tomoya Yamanokuchi, Takamitsu Matsubara
TL;DR: 本文提出了一种名为VoxelDiffusionCut的方法,用于通过迭代切割和结构估计实现目标内部部件(如电池和电机)的无损提取。该方法利用扩散模型从观察到的切割表面迭代估计体素表示的内部结构,并基于估计结果规划切割路径,以避免损坏目标部件。
Details
Motivation: 在回收和处理现场,通过切割周围结构无损提取目标内部部件至关重要,但产品多样性和拆卸程序信息的缺乏使得难以决定切割位置。现有条件生成模型(如条件变分自编码器)由于模式崩溃难以捕捉多模态预测不确定性,导致过度自信的预测,无法有效估计目标部件存在的概率。
Result: 模拟实验结果表明,所提出的方法能够从观察到的切割表面估计内部结构,并利用估计的不确定性实现目标内部部件的无损提取。
Insight: 创新点在于将扩散模型应用于体素表示的3D形状补全,以捕捉未观察区域的不确定性,避免错误切割;体素表示通过预测固定网格位置的属性(构成部件类型)简化了学习过程,提高了模型的可处理性。
Abstract: Non-destructive extraction of the target internal part, such as batteries and motors, by cutting surrounding structures is crucial at recycling and disposal sites. However, the diversity of products and the lack of information on disassembly procedures make it challenging to decide where to cut. This study explores a method for non-destructive extraction of a target internal part that iteratively estimates the internal structure from observed cutting surfaces and formulates cutting plans based on the estimation results. A key requirement is to estimate the probability of the target part’s presence from partial observations. However, learning conditional generative models for this task is challenging: The high dimensionality of 3D shape representations makes learning difficult, and conventional models (e.g., conditional variational autoencoders) often fail to capture multi-modal predictive uncertainty due to mode collapse, resulting in overconfident predictions. To address these issues, we propose VoxelDiffusionCut, which iteratively estimates the internal structure represented as voxels using a diffusion model and plans cuts for non-destructive extraction of the target internal part based on the estimation results. Voxel representation allows the model to predict only attributes at fixed grid positions, i.e., types of constituent parts, making learning more tractable. The diffusion model completes the voxel representation conditioned on observed cutting surfaces, capturing uncertainty in unobserved regions to avoid erroneous cuts. Experimental results in simulation suggest that the proposed method can estimate internal structures from observed cutting surfaces and enable non-destructive extraction of the target internal part by leveraging the estimated uncertainty.
[55] NovaLAD: A Fast, CPU-Optimized Document Extraction Pipeline for Generative AI and Data Intelligence cs.CV | cs.AI | cs.IRPDF
Aman Ulla
TL;DR: NovaLAD是一个为生成式AI和数据智能设计的快速、CPU优化的文档提取流水线。它通过并行运行两个YOLO目标检测模型(元素检测和布局检测),结合基于规则的分组和可选的视觉语言增强,将PDF和扫描件等非结构化文档转换为结构化文本和布局感知表示。系统在CPU上运行,支持并行执行检测、分类、OCR和转换任务,并输出JSON、Markdown、RAG就绪文本和知识图谱等多种格式。
Details
Motivation: 文档提取是检索增强生成(RAG)、知识库和下游生成式AI应用的重要预处理步骤,旨在将非结构化文档(如PDF和扫描件)转化为结构化的、布局感知的表示形式。现有方法在速度、成本(尤其是GPU依赖)和准确性方面存在挑战。
Result: 在DP-Bench基准测试(upstage/dp-bench)上,NovaLAD取得了96.49%的TEDS(表格结构识别)分数和98.51%的NID(非结构化信息检测)分数,性能优于商业和开源解析器,达到了SOTA水平。
Insight: 主要创新点包括:1)采用两个并行的YOLO模型分别进行语义元素检测和布局结构检测,实现高效协同;2)引入图像分类器(ViT)预筛选,仅将相关图像送入视觉大语言模型(Vision LLM)进行标题、摘要和结构化信息提取,有效降低噪声和计算成本;3)整个流水线针对CPU优化,无需GPU,通过并行化设计实现高速处理,并支持多种输出格式,兼顾了准确性、速度和实用性。
Abstract: Document extraction is an important step before retrieval-augmented generation (RAG), knowledge bases, and downstream generative AI can work. It turns unstructured documents like PDFs and scans into structured text and layout-aware representations. We introduce NovaLAD, a comprehensive document parsing system that integrates two concurrent YOLO object detection models - element detection and layout detection - with rule-based grouping and optional vision-language enhancement. When a page image is sent in, the first thing that happens is that it goes through both models at the same time. The element model finds semantic content like the title, header, text, table, image, and so on, and the layout model finds structural regions like layout_box, column_group, multi_column, row_group, and so on. A key design decision is to first send an image or figure through an image classifier (ViT) that decides whether it is relevant or not. Only useful images are then submitted to the Vision LLM for title, summary, and structured information, which cuts down on noise and costs. NovaLAD is built for speed: it works on CPU, employs parallel execution for detection, classification, OCR, and conversion, and generates several forms, including structured JSON, Markdown, RAG-ready texts, and knowledge graphs. We test on the DP-Bench benchmark (upstage/dp-bench) and get 96.49% TEDS and 98.51% NID, which is better than both commercial and open-source parsers. This paper explains how to extract data, how the architecture works, how data flows, and how to make NovaLAD both accurate and usable without needing a GPU.
[56] CT-Flow: Orchestrating CT Interpretation Workflow with Model Context Protocol Servers cs.CV | cs.AIPDF
Yannian Gu, Xizhuo Zhang, Linjie Mu, Yongrui Yu, Zhongzhen Huang
TL;DR: 本文提出CT-Flow,一个基于模型上下文协议(MCP)的智能体框架,旨在将静态的3D CT分析转变为动态、工具感知的临床工作流。通过构建首个大规模指令调优基准CT-FlowBench,该框架能够将复杂的自然语言查询分解为自动化工具使用序列,实现可互操作的体积影像解读。
Details
Motivation: 现有的大视觉语言模型(LVLM)在3D CT分析中多依赖静态单次推理,而实际临床解读是动态、工具介导的迭代工作流,存在方法与实践的差距。
Result: 在CT-FlowBench和标准3D VQA数据集上的实验表明,CT-Flow实现了最先进的性能,诊断准确率超越基线模型41%,自主工具调用成功率高达95%。
Insight: 创新点在于将模型上下文协议(MCP)引入放射学分析,实现了从封闭推理到开放、工具感知范式的转变,并通过专用基准促进了多步骤推理与工具使用的协同,为临床放射学中自主智能体集成提供了可扩展基础。
Abstract: Recent advances in Large Vision-Language Models (LVLMs) have shown strong potential for multi-modal radiological reasoning, particularly in tasks like diagnostic visual question answering (VQA) and radiology report generation. However, most existing approaches for 3D CT analysis largely rely on static, single-pass inference. In practice, clinical interpretation is a dynamic, tool-mediated workflow where radiologists iteratively review slices and use measurement, radiomics, and segmentation tools to refine findings. To bridge this gap, we propose CT-Flow, an agentic framework designed for interoperable volumetric interpretation. By leveraging the Model Context Protocol (MCP), CT-Flow shifts from closed-box inference to an open, tool-aware paradigm. We curate CT-FlowBench, the first large-scale instruction-tuning benchmark tailored for 3D CT tool-use and multi-step reasoning. Built upon this, CT-Flow functions as a clinical orchestrator capable of decomposing complex natural language queries into automated tool-use sequences. Experimental evaluations on CT-FlowBench and standard 3D VQA datasets demonstrate that CT-Flow achieves state-of-the-art performance, surpassing baseline models by 41% in diagnostic accuracy and achieving a 95% success rate in autonomous tool invocation. This work provides a scalable foundation for integrating autonomous, agentic intelligence into real-world clinical radiology.
[57] OrthoAI: A Lightweight Deep Learning Framework for Automated Biomechanical Analysis in Clear Aligner Orthodontics – A Methodological Proof-of-Concept cs.CV | cs.AIPDF
Edouard Lansiaux, Margaux Leman, Mehdi Ammi
TL;DR: 论文提出OrthoAI,一个轻量级深度学习框架,用于透明牙套正畸中的自动化生物力学分析。该系统结合了轻量级3D牙齿分割与基于规则的生物力学引擎,旨在辅助治疗计划评估,通过分解牙齿六自由度运动、计算可预测性、发出超限警报并生成复合指数,为临床决策提供支持。
Details
Motivation: 透明牙套治疗已成为正畸主流,但临床医生对数字化计划的牙齿移动(通常通过ClinCheck)的审查过程缓慢且易出错,因此需要自动化决策支持系统来提高效率和准确性。
Result: 在基于3DTeethLand(MICCAI)地标重建点云的替代数据集上,分割模型(仅60,705可训练参数)达到81.4%的牙齿识别率和8.25%的mIoU,反映了稀疏地标监督而非密集网格的局限性;端到端流程在消费级硬件上运行时间小于4秒,为未来全网格训练建立了基线。
Insight: 创新点包括将轻量级动态图CNN用于3D牙齿分割与基于证据的生物力学分析引擎集成,实现快速自动化评估;方法强调从稀疏地标监督中提取牙齿身份和近似中心/轴估计以驱动下游分析,为几何深度学习和数字正畸的可重复研究提供了开源工具和概念验证。
Abstract: Clear aligner therapy now dominates orthodontics, yet clinician review of digitally planned tooth movements-typically via ClinCheck (Align Technology)-remains slow and error-prone. We present OrthoAI, an open-source proof-of-concept decision-support system combining lightweight 3D dental segmentation with automated biomechanical analysis to assist treatment-plan evaluation. The framework uses a Dynamic Graph CNN trained on landmark-reconstructed point clouds from 3DTeethLand (MICCAI) and integrates a rule-based biomechanical engine grounded in orthodontic evidence (Kravitz et al 2009; Simon et al 2014). The system decomposes per-tooth motion across six degrees of freedom, computes movement-specific predictability, issues alerts when biomechanical limits are exceeded, and derives an exploratory composite index. With 60,705 trainable parameters, segmentation reaches a Tooth Identification Rate of $81.4%$ and mIoU of $8.25%$ on surrogate point clouds-reflecting sparse landmark supervision rather than dense meshes. Although spatial boundaries are coarse, downstream analysis depends mainly on tooth identity and approximate centroid/axis estimation. Results establish a baseline for future full-mesh training and highlight current perceptual limits. The end-to-end pipeline runs in $<4s$ on consumer hardware. Code, weights, and analysis tools are released to support reproducible research in geometric deep learning and digital orthodontics. The system has not been validated on real intraoral meshes and should not be assumed to generalize beyond landmark-derived representations.
[58] QuickGrasp: Responsive Video-Language Querying Service via Accelerated Tokenization and Edge-Augmented Inference cs.CV | cs.AI | cs.IR | cs.MM | cs.PF | eess.SYPDF
Miao Zhang, Ruixiao Zhang, Jianxin Shi, Hengzhi Wang, Hao Fang
TL;DR: 论文提出了QuickGrasp系统,这是一个响应式、服务质量感知的视频-语言查询服务系统。它通过本地优先架构和按需边缘增强来平衡大型视频-语言模型的高精度与小型本地模型低延迟之间的权衡。系统核心设计包括加速视频标记化、查询自适应的边缘增强以及延迟感知的视觉标记密度配置,旨在共享视觉表示以减少冗余计算。
Details
Motivation: 解决在现实世界系统中部署大型视频-语言模型时面临的高资源需求和响应延迟问题,同时避免小型本地模型精度不足的缺陷,以构建响应迅速且高精度的视频查询服务。
Result: 在多个视频理解基准测试上的评估结果表明,QuickGrasp在匹配大型VLM精度的同时,将响应延迟降低了高达12.8倍。
Insight: 创新点在于提出了一种本地优先、按需边缘增强的混合架构,并通过加速标记化、查询自适应增强和延迟感知的标记密度配置等系统级优化,在保证精度的前提下显著提升了响应速度,为构建开放世界理解的响应式视频查询服务提供了关键进展。
Abstract: Video-language models (VLMs) are reshaping video querying services, bringing unified solutions to complex perception and reasoning tasks. However, deploying large VLMs in real-world systems remains challenging due to their high resource demands, and remote-based deployment often results in unacceptable response delays. Although small, locally deployable VLMs offer faster responses, they unavoidably fall short in accuracy. To reconcile this trade-off, we propose QuickGrasp, a responsive, quality of service (QoS)-aware system that bridges this gap through a local-first architecture with on-demand edge augmentation. Built upon the highly modular architecture of VLMs, QuickGrasp shares the vision representation across model variants to avoid redundant computation. To maximize system-wide efficiency, QuickGrasp introduces three key designs: accelerated video tokenization, query-adaptive edge augmentation, and delay-aware, accuracy-preserving vision token density configuration. We implement a prototype of QuickGrasp and evaluate it across multiple video understanding benchmarks. The results show that QuickGrasp matches the accuracy of large VLMs while achieving up to a 12.8x reduction in response delay. QuickGrasp represents a key advancement toward building responsive video querying services for open-world understanding that fully leverage the capabilities of VLMs.
[59] TinyVLM: Zero-Shot Object Detection on Microcontrollers via Vision-Language Distillation with Matryoshka Embeddings cs.CV | cs.AIPDF
Bibin Wilson
TL;DR: TinyVLM是首个在内存小于1MB的微控制器(MCU)上实现零样本目标检测的框架。它通过解耦视觉推理与文本编码、使用Matryoshka蒸馏训练多维度嵌套嵌入以及量化嵌入存储等关键技术,在保持竞争力的零样本检测精度的同时,大幅降低了内存需求。
Details
Motivation: 解决现有零样本目标检测方法依赖大型视觉语言模型(如CLIP),内存占用高达数百MB,无法在资源极度受限的微控制器上部署的问题。
Result: 在COCO、Flowers102和Food101数据集上取得了具有竞争力的零样本检测精度。部署时视觉编码器仅需285KB RAM和892KB闪存,在STM32H7上达到26 FPS实时推理,在配备CNN加速器的MAX78000上超过1000 FPS。
Insight: 创新点包括:1)解耦架构允许预计算类别嵌入存储于闪存;2)Matryoshka蒸馏实现灵活的多维度精度-内存权衡;3)量化嵌入存储将类别原型内存减少4倍且精度损失极小。其核心在于通过系统性的模型压缩与存储优化,首次将零样本检测能力成功移植到极致边缘设备。
Abstract: Zero-shot object detection enables recognising novel objects without task-specific training, but current approaches rely on large vision language models (VLMs) like CLIP that require hundreds of megabytes of memory - far exceeding the constraints of micro controller units (MCUs). We present TinyVLM, the first framework enabling zero-shot object detection on resource-constrained MCUs with less than 1MB of memory. Our approach introduces three key innovations: (1) a decoupled architecture that separates visual inference from text encoding, allowing precomputed class embeddings to be stored in flash memory; (2) Matryoshka distillation that trains nested embeddings at multiple dimensions (16-256), enabling flexible accuracy-memory trade-offs; and (3) quantized embedding storage that reduces class prototype memory by 4x with minimal accuracy loss. Trained on Conceptual Captions 3M (CC3M), TinyVLM achieves competitive zero-shot accuracy on COCO, Flowers102, and Food101 while requiring only 285KB of RAM and 892KB of flash memory for the deployed vision encoder. We demonstrate real-time inference at 26 FPS on STM32H7 and over 1,000 FPS on MAX78000 with its CNN accelerator, enabling practical zero-shot detection on edge devices for the first time.
[60] Steering Away from Memorization: Reachability-Constrained Reinforcement Learning for Text-to-Image Diffusion cs.CV | cs.AI | cs.LGPDF
Sathwik Karnik, Juyeop Kim, Sanmi Koyejo, Jong-Seok Lee, Somil Bansal
TL;DR: 本文提出了一种名为RADS的推理时框架,旨在防止文本到图像扩散模型的记忆化问题,同时保持生成保真度。该方法将去噪过程建模为动态系统,利用可达性分析近似反向可达管,并通过约束强化学习学习策略,在标题嵌入空间施加最小扰动以引导轨迹远离记忆化样本。
Details
Motivation: 解决文本到图像扩散模型对训练数据记忆化的问题,现有方法通常以牺牲图像质量或提示对齐为代价,本文旨在不修改扩散主干网络的前提下实现稳健的缓解。
Result: 在生成多样性(SSCD)、质量(FID)和对齐(CLIP)指标上,RADS相比现有最优基线实现了更优的帕累托前沿,提供了即插即用的安全生成解决方案。
Insight: 创新点在于将扩散去噪过程视为动态系统并引入可达性分析来识别记忆化路径,同时将缓解问题形式化为约束强化学习任务,通过最小扰动实现高效引导,这是一种无需重新训练模型的推理时干预方法。
Abstract: Text-to-image diffusion models often memorize training data, revealing a fundamental failure to generalize beyond the training set. Current mitigation strategies typically sacrifice image quality or prompt alignment to reduce memorization. To address this, we propose Reachability-Aware Diffusion Steering (RADS), an inference-time framework that prevents memorization while preserving generation fidelity. RADS models the diffusion denoising process as a dynamical system and applies concepts from reachability analysis to approximate the “backward reachable tube”–the set of intermediate states that inevitably evolve into memorized samples. We then formulate mitigation as a constrained reinforcement learning (RL) problem, where a policy learns to steer the trajectory away from memorization via minimal perturbations in the caption embedding space. Empirical evaluations show that RADS achieves a superior Pareto frontier between generation diversity (SSCD), quality (FID), and alignment (CLIP) compared to state-of-the-art baselines. Crucially, RADS provides robust mitigation without modifying the diffusion backbone, offering a plug-and-play solution for safe generation. Our website is available at: https://s-karnik.github.io/rads-memorization-project-page/.
[61] From Scale to Speed: Adaptive Test-Time Scaling for Image Editing cs.CV | cs.AI | cs.LG | eess.IVPDF
Xiangyan Qu, Zhenlong Yuan, Jing Tang, Rui Chen, Datao Tang
TL;DR: 本文提出了一种名为ADE-CoT的自适应测试时扩展框架,旨在解决将图像思维链(Image-CoT)范式应用于目标导向的图像编辑任务时面临的资源分配低效、早期验证不可靠和结果冗余三大挑战。该框架通过难度感知资源分配、编辑特定验证和深度优先机会性停止三大策略,在多个先进编辑模型和基准测试上实现了性能与效率的优越权衡。
Details
Motivation: 现有Image-CoT方法主要针对文本到图像生成,而图像编辑是目标导向的,其解空间受源图像和指令约束,直接应用Image-CoT会导致资源分配低效、验证不可靠和结果冗余的问题。
Result: 在三个SOTA编辑模型(Step1X-Edit, BAGEL, FLUX.1 Kontext)和三个基准测试上的广泛实验表明,ADE-CoT在可比采样预算下,性能优于Best-of-N方法,并且实现了超过2倍的加速。
Insight: 创新点在于将测试时扩展范式从生成任务自适应地迁移到编辑任务,核心是提出了一个包含动态预算分配、基于区域定位和描述一致性的早期验证、以及由实例特定验证器引导的提前停止策略的完整框架,实现了按需扩展,提升了编辑任务的效率与效果平衡。
Abstract: Image Chain-of-Thought (Image-CoT) is a test-time scaling paradigm that improves image generation by extending inference time. Most Image-CoT methods focus on text-to-image (T2I) generation. Unlike T2I generation, image editing is goal-directed: the solution space is constrained by the source image and instruction. This mismatch causes three challenges when applying Image-CoT to editing: inefficient resource allocation with fixed sampling budgets, unreliable early-stage verification using general MLLM scores, and redundant edited results from large-scale sampling. To address this, we propose ADaptive Edit-CoT (ADE-CoT), an on-demand test-time scaling framework to enhance editing efficiency and performance. It incorporates three key strategies: (1) a difficulty-aware resource allocation that assigns dynamic budgets based on estimated edit difficulty; (2) edit-specific verification in early pruning that uses region localization and caption consistency to select promising candidates; and (3) depth-first opportunistic stopping, guided by an instance-specific verifier, that terminates when intent-aligned results are found. Extensive experiments on three SOTA editing models (Step1X-Edit, BAGEL, FLUX.1 Kontext) across three benchmarks show that ADE-CoT achieves superior performance-efficiency trade-offs. With comparable sampling budgets, ADE-CoT obtains better performance with more than 2x speedup over Best-of-N.
[62] GrapHist: Graph Self-Supervised Learning for Histopathology cs.CV | cs.LGPDF
Sevda Öğüt, Cédric Vincent-Cuaz, Natalia Dubljevic, Carlos Hurtado, Vaishnavi Subramanian
TL;DR: GrapHist是一个基于图的自监督学习框架,专门用于组织病理学图像分析,通过将组织建模为细胞图来学习可泛化的嵌入表示,支持多种下游任务。
Details
Motivation: 现有自监督视觉模型在数字病理学中取得进展,但其通用Transformer架构未充分考虑组织病理学图像的基本生物元素(如细胞及其复杂相互作用),因此需要一种更高效的生物信息建模方法。
Result: 在乳腺组织衍生的1100万个细胞图上预训练后,GrapHist在域内和域外基准测试中,在切片、区域和细胞级别任务上达到与基于视觉的模型相当的性能,且参数减少四倍;在癌症亚型分类任务上大幅优于全监督图模型。
Insight: 创新点包括将组织建模为细胞图以整合生物结构信息,结合掩码自编码器和异质性图神经网络来捕捉肿瘤微环境异质性,并发布了首个大规模图基准数据集,推动了图学习方法在数字病理学领域的应用。
Abstract: Self-supervised vision models have achieved notable success in digital pathology. However, their domain-agnostic transformer architectures are not originally designed to account for fundamental biological elements of histopathology images, namely cells and their complex interactions. In this work, we hypothesize that a biologically-informed modeling of tissues as cell graphs offers a more efficient representation learning. Thus, we introduce GrapHist, a novel graph-based self-supervised learning framework for histopathology, which learns generalizable and structurally-informed embeddings that enable diverse downstream tasks. GrapHist integrates masked autoencoders and heterophilic graph neural networks that are explicitly designed to capture the heterogeneity of tumor microenvironments. We pre-train GrapHist on a large collection of 11 million cell graphs derived from breast tissues and evaluate its transferability across in- and out-of-domain benchmarks. Our results show that GrapHist achieves competitive performance compared to its vision-based counterparts in slide-, region-, and cell-level tasks, while requiring four times fewer parameters. It also drastically outperforms fully-supervised graph models on cancer subtyping tasks. Finally, we also release five graph-based digital pathology datasets used in our study at https://huggingface.co/ogutsevda/datasets , establishing the first large-scale graph benchmark in this field. Our code is available at https://github.com/ogutsevda/graphist .
[63] Mechanistically Guided LoRA Improves Paraphrase Consistency in Medical Vision-Language Models cs.CVPDF
Binesh Sadanandan, Vahid Behzadan
TL;DR: 该论文提出了一种基于机制指导的LoRA方法,旨在提升医学视觉语言模型在回答临床问题时的复述一致性。研究使用MedGemma-4B模型,在MIMIC-CXR和PadChest Balanced数据集上验证了方法的有效性,通过结合复述一致性和答案准确性的联合损失函数进行微调,显著降低了模型对同一问题不同表述的答案翻转率,同时保持了准确性的稳定。
Details
Motivation: 医学视觉语言模型在面对同一临床问题的不同复述时,可能给出不一致的是/否答案,这种不一致性会影响模型的可靠性和临床应用。论文旨在解决这一问题,提升模型的复述一致性。
Result: 在MIMIC-CXR数据集上,基线翻转率为14.6%,平均边际差异为1.63 logits;使用提出的方法后,翻转率降至4.4%,边际差异降至0.33(减少79.5%),准确性从84.2%略微降至82.3%(不显著)。在PadChest Balanced数据集上,翻转率从13.6%降至7.8%,边际差异从1.08降至0.35(减少67.9%),准确性从66.4%提升至69.4%。
Insight: 创新点在于将稀疏自编码器(SAEs)的机制解释性分析与LoRA微调相结合,通过联合损失函数平衡一致性与准确性,避免了纯一致性训练导致的模式崩溃。客观来看,该方法为提升模型鲁棒性提供了一种可解释的微调策略,尤其在医学领域具有应用价值。
Abstract: Medical Vision-Language Models can give different yes or no answers to rephrasings of the same clinical question. We study this in MedGemma-4B using PSF-Med Sadanandan and Behzadan (2025), which provides paraphrase pairs for systematic consistency evaluation on medical VQA. On MIMIC-CXR binary questions (n = 158), the baseline flip rate is 14.6% and mean margin difference is 1.63 logits. We validate that Gemma Scope 2 Sparse Autoencoders (SAEs) transfer to MedGemma activations, achieving R2 ~= 0.997 on both medical and general text (n = 100 prompts each, p < 0.001 for exceeding a 0.95 threshold). We then fine-tune Low-Rank Adaptation (LoRA) adapters with a combined loss that balances paraphrase consistency with answer accuracy. This combined approach prevents mode collapse that occurs with pure consistency training while reducing flip rate from 14.6% to 4.4% (p = 0.002, two-proportion z-test) and margin difference from 1.63 to 0.33 (79.5% reduction). Accuracy remains stable at 84.2% baseline versus 82.3% after training (-1.9pp, not significant). On PadChest Balanced (n = 250), flip rate drops from 13.6% to 7.8%, mean margin difference drops from 1.08 to 0.35 (67.9% reduction), and accuracy increases from 66.4% to 69.4%. A layer-range ablation shows that early layers reduce margin differences more than mechanistically selected middle layers.
[64] Dr. Seg: Revisiting GRPO Training for Visual Large Language Models through Perception-Oriented Design cs.CV | cs.AIPDF
Haoxiang Sun, Tao Wang, Chenwei Tang, Li Yuan, Jiancheng Lv
TL;DR: 本文提出Dr. Seg,一个基于GRPO的即插即用框架,用于改进视觉大语言模型在感知任务(如分割)上的训练。研究发现,为语言推理设计的GRPO训练范式不能直接无缝迁移到视觉感知任务,并揭示了推理导向与感知导向设置之间的内在差异。
Details
Motivation: 解决现有研究将语言推理的训练范式(如GRPO)直接迁移到视觉大语言模型进行视觉感知任务时存在的无效假设问题,旨在弥合理论与实践之间的差距。
Result: 大量实验表明,Dr. Seg在复杂视觉场景中提升了性能,同时保持了强大的泛化能力。
Insight: 创新点在于识别了感知任务中两个被忽视的因素(更广的输出空间需求、细粒度稳定奖励的重要性),并据此设计了无需架构修改的Look-to-Confirm机制和Distribution-Ranked Reward模块。
Abstract: Following the success of Group Relative Policy Optimization (GRPO) in foundation LLMs, an increasing number of works have sought to adapt GRPO to Visual Large Language Models (VLLMs) for visual perception tasks (e.g., detection and segmentation). However, much of this line of research rests on a long-standing yet unexamined assumption: training paradigms developed for language reasoning can be transferred seamlessly to visual perception. Our experiments show that this assumption is not valid, revealing intrinsic differences between reasoning-oriented and perception-oriented settings. Using reasoning segmentation as a representative case, we surface two overlooked factors: (i) the need for a broader output space, and (ii) the importance of fine-grained, stable rewards. Building on these observations, we propose Dr.Seg, a simple, plug-and-play GRPO-based framework consisting of a Look-to-Confirm mechanism and a Distribution-Ranked Reward module, requiring no architectural modifications and integrating seamlessly with existing GRPO-based VLLMs. Extensive experiments demonstrate that Dr.Seg improves performance in complex visual scenarios while maintaining strong generalization. Code and models will be available at https://github.com/xVI-group-SCU/Dr-Seg.
[65] EfficientPosterGen: Semantic-aware Efficient Poster Generation via Token Compression and Accurate Violation Detection cs.CV | cs.AI | cs.IRPDF
Wenxin Tang, Jingyu Xiao, Yanpei Gong, Fengyuan Ran, Tongchuan Xia
TL;DR: 本文提出了EfficientPosterGen,一个端到端的学术海报自动生成框架,旨在解决现有基于多模态大语言模型方法存在的输入信息密度低、令牌消耗过多和布局验证不可靠三大问题。该框架通过语义感知检索和令牌高效的多模态生成,实现了高质量、高效率的海报生成。
Details
Motivation: 现有基于多模态大语言模型的学术海报生成方法存在三个关键局限:全文输入信息密度低、令牌消耗过大以及布局验证不可靠,这阻碍了其实际应用。本文旨在解决这些问题。
Result: 广泛的实验表明,EfficientPosterGen在保持高海报质量的同时,在令牌效率和布局可靠性方面取得了显著提升,为自动化学术海报生成提供了一个可扩展的解决方案。
Insight: 论文的核心创新点包括:1)语义感知关键信息检索,通过构建语义贡献图建模段落间关系并选择性保留重要内容;2)基于视觉的上下文压缩,将选定文本段落渲染为图像以将文本信息转移到视觉模态,显著减少令牌使用并生成海报就绪的要点;3)无代理布局违规检测,一种基于确定性颜色梯度的算法,无需辅助MLLM即可可靠检测内容溢出和空间稀疏性。这些方法在提升效率和可靠性方面具有借鉴意义。
Abstract: Automated academic poster generation aims to distill lengthy research papers into concise, visually coherent presentations. Existing Multimodal Large Language Models (MLLMs) based approaches, however, suffer from three critical limitations: low information density in full-paper inputs, excessive token consumption, and unreliable layout verification. We present EfficientPosterGen, an end-to-end framework that addresses these challenges through semantic-aware retrieval and token-efficient multimodal generation. EfficientPosterGen introduces three core innovations: (1) Semantic-aware Key Information Retrieval (SKIR), which constructs a semantic contribution graph to model inter-segment relationships and selectively preserves important content; (2) Visual-based Context Compression (VCC), which renders selected text segments into images to shift textual information into the visual modality, significantly reducing token usage while generating poster-ready bullet points; and (3) Agentless Layout Violation Detection (ALVD), a deterministic color-gradient-based algorithm that reliably detects content overflow and spatial sparsity without auxiliary MLLMs. Extensive experiments demonstrate that EfficientPosterGen achieves substantial improvements in token efficiency and layout reliability while maintaining high poster quality, offering a scalable solution for automated academic poster generation. Our code is available at https://github.com/vinsontang1/EfficientPosterGen-Code.
[66] BiCLIP: Bidirectional and Consistent Language-Image Processing for Robust Medical Image Segmentation cs.CVPDF
Saivan Talaei, Fatemeh Daneshfar, Abdulhady Abas Abdullah, Mustaqeem Khan
TL;DR: 本文提出了BiCLIP框架,旨在提升医学图像分割在标注稀缺和图像质量受损等真实临床环境中的鲁棒性。该框架通过双向多模态融合机制,利用视觉特征迭代优化文本表示,并引入增强一致性目标来稳定学习过程。在QaTa-COV19和MosMedData+基准测试中,BiCLIP在少量标注数据和存在临床伪影的情况下均优于现有方法。
Details
Motivation: 解决当前多模态视觉-语言模型在标注稀缺、硬件导致图像退化的真实临床环境中鲁棒性不足的问题,以提升医学图像分割的可靠性。
Result: 在QaTa-COV19和MosMedData+基准测试上,BiCLIP consistently surpasses state-of-the-art image-only and multimodal baselines,在仅使用1%标注数据训练时仍保持高性能,并对运动模糊和低剂量CT噪声等临床伪影表现出显著抵抗力。
Insight: 创新点包括双向多模态融合机制(视觉特征迭代优化文本表示以增强语义对齐)和增强一致性目标(通过正则化中间表示来稳定学习),这些设计提升了模型在数据稀缺和图像退化条件下的鲁棒性,可借鉴于其他需要处理噪声或有限标注数据的多模态任务。
Abstract: Medical image segmentation is a cornerstone of computer-assisted diagnosis and treatment planning. While recent multimodal vision-language models have shown promise in enhancing semantic understanding through textual descriptions, their resilience in “in-the-wild” clinical settings-characterized by scarce annotations and hardware-induced image degradations-remains under-explored. We introduce BiCLIP (Bidirectional and Consistent Language-Image Processing), a framework engineered to bolster robustness in medical segmentation. BiCLIP features a bidirectional multimodal fusion mechanism that enables visual features to iteratively refine textual representations, ensuring superior semantic alignment. To further stabilize learning, we implement an augmentation consistency objective that regularizes intermediate representations against perturbed input views. Evaluation on the QaTa-COV19 and MosMedData+ benchmarks demonstrates that BiCLIP consistently surpasses state-of-the-art image-only and multimodal baselines. Notably, BiCLIP maintains high performance when trained on as little as 1% of labeled data and exhibits significant resistance to clinical artifacts, including motion blur and low-dose CT noise.
[67] FujiView: Multimodal Late-Fusion for Predicting Scenic Visibility cs.CVPDF
Bryceton Bible, Shah Md Nehal Hasnaeen, Hairong Qi
TL;DR: 本文提出了FujiView,一个用于预测富士山等自然景观可见度的多模态学习框架和数据集。该框架通过后期融合网络摄像头图像和结构化气象数据,将可见度分为五个等级。实验表明,基于YOLO的视觉特征在短期预测中占主导,而气象数据在长期预测中作用更大,后期融合方法在当日和次日预测中分别达到约0.89和84%的准确率。
Details
Motivation: 自然景观(如富士山)的可见度对旅游规划和游客体验至关重要,但由于大气条件快速变化,预测难度大。本文旨在通过融合网络摄像头图像和气象数据来解决这一问题。
Result: 在富士山可见度预测任务中,后期融合方法在当日预测中达到约0.89的准确率(ACC),次日预测准确率最高达84%。基于YOLO的视觉特征在短期预测(如“即时预报”和“当日预报”)中表现突出,而气象数据在超过一天的预测中成为主要信号。
Insight: 创新点包括提出了一种多模态后期融合框架,结合图像分类概率和数值天气特征;构建并公开了一个包含10万多张网络摄像头图像及对应气象数据的数据集;将景观可见度预测(SVF)确立为多模态学习的新基准任务。客观来看,该方法有效融合了视觉与气象模态,为环境预测提供了新思路。
Abstract: Visibility of natural landmarks such as Mount Fuji is a defining factor in both tourism planning and visitor experience, yet it remains difficult to predict due to rapidly changing atmospheric conditions. We present FujiView, a multimodal learning framework and dataset for predicting scenic visibility by fusing webcam imagery with structured meteorological data. Our late-fusion approach combines image-derived class probabilities with numerical weather features to classify visibility into five categories. The dataset currently comprises over 100,000 webcam images paired with concurrent and forecasted weather conditions from more than 40 cameras around Mount Fuji, and continues to expand; it will be released to support further research in environmental forecasting. Experiments show that YOLO-based vision features dominate short-term horizons such as “nowcasting” and “samedaycasting”, while weather-driven forecasts increasingly take over as the primary predictive signal beyond $+1$d. Late fusion consistently yields the highest overall accuracy, achieving ACC of approx 0.89 for same-day prediction and up to 84% for next-day forecasts. These results position Scenic Visibility Forecasting (SVF) as a new benchmark task for multimodal learning.
[68] FlowPortrait: Reinforcement Learning for Audio-Driven Portrait Video Generation cs.CV | cs.AI | cs.MM | cs.SDPDF
Weiting Tan, Andy T. Liu, Ming Tu, Xinghua Qu, Philipp Koehn
TL;DR: FlowPortrait是一个基于强化学习的音频驱动肖像视频生成框架,通过多模态骨干网络进行自回归的音频到视频生成。它引入了基于多模态大语言模型的人类对齐评估系统来评估唇部同步准确性、表现力和运动质量,并结合感知和时间一致性正则化器形成稳定的复合奖励,使用组相对策略优化对生成器进行后训练。实验表明,FlowPortrait能持续生成更高质量的说话头部视频。
Details
Motivation: 解决现有说话头部视频生成中存在的唇部同步不完美、运动不自然以及评估指标与人类感知相关性差等持续性问题。
Result: 广泛的实验(包括自动评估和人类偏好研究)表明,FlowPortrait能持续生成更高质量的说话头部视频,突显了强化学习在肖像动画中的有效性。
Insight: 创新点在于将强化学习引入肖像动画生成,并设计了一个基于多模态大语言模型的人类对齐评估系统来构建复合奖励函数,用于指导生成器的优化,从而提升生成视频的唇同步、表现力和运动质量。
Abstract: Generating realistic talking-head videos remains challenging due to persistent issues such as imperfect lip synchronization, unnatural motion, and evaluation metrics that correlate poorly with human perception. We propose FlowPortrait, a reinforcement-learning framework for audio-driven portrait animation built on a multimodal backbone for autoregressive audio-to-video generation. FlowPortrait introduces a human-aligned evaluation system based on Multimodal Large Language Models (MLLMs) to assess lip-sync accuracy, expressiveness, and motion quality. These signals are combined with perceptual and temporal consistency regularizers to form a stable composite reward, which is used to post-train the generator via Group Relative Policy Optimization (GRPO). Extensive experiments, including both automatic evaluations and human preference studies, demonstrate that FlowPortrait consistently produces higher-quality talking-head videos, highlighting the effectiveness of reinforcement learning for portrait animation.
[69] DINOv3 Meets YOLO26 for Weed Detection in Vegetable Crops cs.CV | cs.AIPDF
Boyang Deng, Yuzhen Lu
TL;DR: 本研究针对蔬菜作物杂草检测中大规模标注数据集稀缺的问题,提出了一种基于自监督学习的基础性作物-杂草检测模型。通过整合异构数据集,采用序列筛选策略对DINOv3 ViT-small进行微调,并将其作为主干网络集成到YOLO26中,构建了单主干或双主干架构。在双主干框架中引入了特征对齐损失以增强特征融合。实验表明,该方法在多个季节数据集上均实现了显著的性能提升,并保持了实时检测能力。
Details
Motivation: 解决精准蔬菜除草中因缺乏大规模标注杂草-作物数据集而导致的模型鲁棒性受限问题。
Result: 在2025季节采集的域内图像上,mAP50最高提升+5.4%;在2021-2023和2024季节数据集上,跨域泛化能力显著,mAP50分别提升+14.0%和+11.9%(相较于标准YOLO26-large)。模型参数量增加45.6%,推理延迟增加2.9倍,但仍保持约28.5 FPS的实时性能。
Insight: 创新点在于通过整合异构数据集和自监督学习微调DINOv3,构建了一个基础检测模型;在双主干架构中引入轻量级特征对齐损失以优化特征融合;该方法在提升检测精度的同时,通过工程优化保持了实用性。可借鉴的思路包括利用自监督预训练模型解决小样本领域问题,以及设计高效的特征融合机制。
Abstract: Developing robust models for precision vegetable weeding is currently constrained by the scarcity of large-scale, annotated weed-crop datasets. To address this limitation, this study proposes a foundational crop-weed detection model by integrating heterogeneous datasets and leveraging self-supervised learning. A total of 618,642 crop-weed images were initially collected and subsequently refined to 199,388 filtered images for fine-tuning a DINOv3 vision transformer (ViT-small) through a sequential curation strategy. The fine-tuned DINOv3 backbone was then integrated into YOLO26, serving either as a primary backbone or part of a dual-backbone architecture. A feature alignment loss was introduced in the dual backbone framework to enhance feature fusion with minimal computational overhead. Experimental results show that the proposed DINOv3-finetuned ViT-small-based YOLO26-large achieved up to a +5.4% mAP50 gain on in-domain images collected in the 2025 season. Moreover, it demonstrated strong cross-domain generalization with mAP50 improvements of +14.0% on the 2021-2023 season dataset and +11.9% on the 2024 season dataset, compared to the standard YOLO26-large. Although the DINOv3-YOLO26-large model has 45.6% more parameters and a 2.9x increase in inference latency, it maintains real-time performance at ~28.5 frames per second (fps). The curated dataset and software programs developed in this study will be made publicly available.
[70] SKINOPATHY AI: Smartphone-Based Ophthalmic Screening and Longitudinal Tracking Using Lightweight Computer Vision cs.CV | cs.LGPDF
S. Kalaycioglu, C. Hong, M. Zhu, H. Xie
TL;DR: SKINOPATHY AI是一个基于智能手机的、轻量级计算机视觉眼科筛查和纵向追踪网络应用。它包含五个互补且可解释的筛查模块:通过LAB a*色彩空间归一化量化眼红;使用MediaPipe FaceMesh眼宽高比(EAR)与自适应阈值估计眨眼率;通过瞳孔-虹膜比(PIR)时间序列分析表征瞳孔光反射;利用LAB/HSV统计进行巩膜颜色索引以评估黄疸和贫血迹象;以及通过虹膜地标校准进行病灶侵蚀测量(提供毫米级估计)和纵向趋势追踪。该系统采用React/FastAPI技术栈,结合OpenCV和MediaPipe,使用MongoDB进行会话持久化,并生成PDF报告。所有算法都是完全确定性的、保护隐私的,专为非诊断性消费者分流设计。
Details
Motivation: 在资源匮乏和偏远地区,早期眼科筛查受到专业设备和训练有素的从业者获取困难的限制。
Result: 论文展示了SKINOPATHY AI平台,证明了多信号眼科筛查在未经修改的智能手机上无需基于云端AI推理即可实现,为未来经过临床验证的移动检眼镜工具奠定了基础。
Insight: 创新点在于将多个互补的、可解释的轻量级计算机视觉算法集成到一个智能手机优先的单一应用中,实现完全在设备端运行、保护隐私的眼科筛查,并支持纵向追踪。这为资源有限环境下的可访问性医疗工具开发提供了范例。
Abstract: Early ophthalmic screening in low-resource and remote settings is constrained by access to specialized equipment and trained practitioners. We present SKINOPATHY AI, a smartphone-first web application that delivers five complementary, explainable screening modules entirely through commodity mobile hardware: (1) redness quantification via LAB a* color-space normalization; (2) blink-rate estimation using MediaPipe FaceMesh Eye Aspect Ratio (EAR) with adaptive thresholding; (3) pupil light reflex characterization through Pupil-to-Iris Ratio (PIR) time-series analysis; (4) scleral color indexing foricterus and anemia proxies via LAB/HSV statistics; and (5) iris-landmark-calibrated lesion encroachment measurement with millimeter-scale estimates and longitudinal trend tracking. The system is implemented as a React/FastAPI stack with OpenCV and MediaPipe, MongoDB-backed session persistence, and PDF report generation. All algorithms are fully deterministic, privacy-preserving, and designed for non-diagnostic consumer triage. We detail system architecture, algorithm design, evaluation methodology, clinical context, and ethical boundaries of the platform. SKINOPATHY AI demonstrates that multi-signal ophthalmic screening is feasible on unmodified smartphones without cloud-based AI inference, providing a foundation for future clinically validated mobile ophthalmoscopy tools.
[71] ConFoThinking: Consolidated Focused Attention Driven Thinking for Visual Question Answering cs.CVPDF
Zhaodong Wu, Haochen Xue, Qi Cao, Wenqi Mo, Yu Pei
TL;DR: ConFoThinking提出了一种整合聚焦注意力驱动的思考框架,用于提升多模态大语言模型在视觉问答任务中的细粒度感知能力。该方法通过将分散在不同层的注意力信号聚合到指定中间层,并基于简洁的语义线索提取注意力,从而挖掘并放大显著区域以增强视觉理解。
Details
Motivation: 现有基于注意力裁剪感兴趣区域的方法存在两个主要问题:一是注意力信号在模型各层中碎片化分布,导致定位效果不佳;二是注意力提取依赖于问题或冗余文本,引入了语义噪声。论文旨在解决这些问题,提升MLLMs在VQA任务中的视觉定位可靠性。
Result: 在五个VQA基准测试上的实验表明,ConFoThinking显著提升了感知性能。具体定量结果未在摘要中给出,但暗示了方法在多个基准上有效。
Insight: 创新点在于提出了一种学习聚合跨层注意力到指定中间层的机制,以及使用简洁的“看什么”语义线索而非问题文本进行注意力提取,这减少了噪声并增强了区域定位的鲁棒性。从客观角度看,该方法为改善MLLMs的视觉 grounding 能力提供了一种新颖的注意力整合与净化思路。
Abstract: Thinking with Images improves fine-grained VQA for MLLMs by emphasizing visual cues. However, tool-augmented methods depend on the capacity of grounding, which remains unreliable for MLLMs. In parallel, attention-driven methods to crop the Region of Interest (ROIs) are proposed but they are constrained by (1) fragmented attention signals scattered across layers, leading to suboptimal localization and (2) relying on question- or redundant-text-conditioned attention extraction. Our analysis reveals three patterns: MLLMs may attend to the correct region yet generate incorrect coordinates, where-to-look attention is often fragmented across layers, and attention extraction is query-sensitive. Motivated by these, We propose ConFoThinking, a Consolidated-Focused-Attention-Driven Thinking framework that learns to aggregate attention into a designated intermediate layer, from which we mine and zoom in salient regions for downstream visual understanding. Moreover, we extract attention using concise semantic cues of what to look into, which mitigates the semantic noise introduced by question- or redundant-text-based attention extraction. Experiments across five VQA benchmarks demonstrate ConFoThinking significantly improves perception performance. The code, checkpoints, and dataset will be released after being accepted.
[72] AdaFocus: Knowing When and Where to Look for Adaptive Visual Reasoning cs.CV | cs.AIPDF
Yuxiang Shen, Hailong Huang, Zhenkun Gao, Xueheng Li, Chengjun Xie
TL;DR: AdaFocus是一种无需训练的自适应视觉推理框架,通过两阶段流程(基于置信度的裁剪决策和语义引导的定位模块)动态决定何时及何处裁剪图像,以解决现有免训练方法中的感知冗余和语义意图与空间注意力漂移问题,在保持高精度的同时实现约4倍于SOTA方法ZoomEyes的推理加速。
Details
Motivation: 针对多模态大语言模型(MLLMs)轻量级免训练方案中存在的感知冗余(由无差别裁剪引起)和语义意图与空间注意力漂移问题,旨在提升自适应视觉推理的准确性和效率。
Result: 在实验中,AdaFocus在性能上取得显著提升,同时推理速度比当前最先进的ZoomEyes方法快约4倍,实现了精度和效率的双重进步。
Insight: 创新点在于引入两阶段自适应裁剪机制(何时裁剪与何处裁剪),通过置信度决策和语义引导定位,有效减少冗余计算并精准对齐用户关注区域,为免训练视觉推理提供了可扩展的高效解决方案。
Abstract: Multimodal Large Language Models (MLLMs) are shifting towards “Thinking with Images” by actively exploring image details. While effective, large-scale training is computationally expensive, which has spurred growing interest in lightweight, training-free solutions. However, existing training-free methods suffer from two flaws: perceptual redundancy from indiscriminate cropping, which adds overhead and noise; and a drift between semantic intent and spatial attention, which prevents accurate localization of user-focused regions. To address these challenges, we propose AdaFocus, a novel training-free framework designed for adaptive visual reasoning. AdaFocus follows a two-stage pipeline: a confidence-based module decides when to crop, and a semantic-guided localization module determines where to crop. This enables adaptive visual reasoning without additional training. Experimentally, AdaFocus delivers substantial performance gains while achieving approximately 4.0\times speedup inference speedup than the SOTA method ZoomEyes, representing a significant advance in both accuracy and efficiency.
[73] Summer-22B: A Systematic Approach to Dataset Engineering and Training at Scale for Video Foundation Model cs.CV | cs.AI | cs.LGPDF
Simo Ryu, Chunghwan Han
TL;DR: 本文详细介绍了从头开始训练视频基础模型Summer-22B的系统性工程经验,涵盖了从原始视频收集到在约5000万个视频片段上训练出可用模型的完整过程。报告重点阐述了数据集工程、多阶段过滤、μP参数化以及超球面约束优化等方法,并分享了数据集工程占主导、架构变体差异小于预期等关键发现。
Details
Motivation: 旨在解决大规模视频基础模型训练中面临的数据集构建、工程挑战和可扩展训练等系统性难题,为类似项目提供实践经验和教训。
Result: 报告未提供具体的定量基准测试结果或SOTA比较,主要分享了工程实践中的定性观察和发现,例如μP超参数迁移在几何约束下仍然有效。
Insight: 创新点在于系统性地提出了从数据收集到模型训练的全流程工程方法,包括Lavender Data数据集管理系统、推理感知的架构选择,以及强调了数据集工程在视频基础模型开发中的核心地位,而非单纯的模型架构创新。
Abstract: We describe our experience training Summer-22B, a video foundation model developed from scratch. This report documents the engineering challenges, design decisions, and lessons learned while scaling from raw footage collection to a functional model trained on approximately 50 million clips. We outline our approach combining metadata-driven dataset curation, multi-stage filtering, $μ$P parameterization, and hypersphere-constrained optimization. We developed the Lavender Data system for dataset management and adopted inference-aware architectural choices. We share observations on what worked in our setting: dataset engineering consumed the majority of effort, architectural variants showed smaller differences than we expected, and $μ$P hyperparameter transfer appeared effective even under geometric constraints. We hope this account proves useful to others undertaking similar projects.
[74] Infinite Self-Attention cs.CVPDF
Giorgio Roffo
TL;DR: 本文提出了无限自注意力(InfSA),一种将自注意力重新表述为内容自适应令牌图上的扩散过程的谱方法,通过诺依曼级数累积多跳交互,从而将自注意力与经典的图中心性度量联系起来。进一步提出了线性时间变体Linear-InfSA,它无需构建完整的注意力矩阵即可近似隐式注意力算子的主特征向量,实现了与序列长度无关的固定大小辅助状态。该方法在ImageNet-1K上达到84.7%的top-1准确率,比同等深度的softmax ViT提升了3.2个百分点,并在高分辨率推理(如9216x9216)中表现出卓越的计算效率和内存可扩展性。
Details
Motivation: 解决softmax注意力的二次计算成本限制Transformer在高分辨率视觉任务中的可扩展性问题。
Result: 在ImageNet-1K上,4层ViT架构的Linear-InfSA达到84.7% top-1准确率,比同等配置的softmax ViT提升3.2个百分点;在ImageNet-V2上达到79.8% top-1,优于所有基线模型(最高为76.8%)。在A100 40GB GPU上,Linear-InfViT的吞吐量为231图像/秒,能耗为0.87焦耳/图像,比同等深度ViT高13倍,且是唯一能完成9216x9216推理而不内存溢出的模型。线性近似与二次算子的主特征向量高度匹配(余弦相似度0.985)。
Insight: 核心创新点在于将自注意力重新解释为图上的扩散过程,并通过诺依曼级数将其与图中心性理论(如Katz中心性、PageRank)联系起来,提供了可解释的令牌加权机制。Linear-InfSA通过近似主特征向量实现了线性时间复杂度,其固定大小的辅助状态设计使其内存消耗与序列长度无关,从而实现了高分辨率下的高效训练和推理。
Abstract: The quadratic cost of softmax attention limits Transformer scalability in high-resolution vision. We introduce Infinite Self-Attention (InfSA), a spectral reformulation that treats each attention layer as a diffusion step on a content-adaptive token graph, accumulating multi-hop interactions through a discounted Neumann series over attention matrices. This links self-attention to classical graph centrality (Katz, PageRank, eigenvector centrality) for interpretable token weighting. We also show the Neumann kernel equals the fundamental matrix of an absorbing Markov chain, so a token’s centrality is its expected number of random-walk visits before absorption. We then propose Linear-InfSA, a linear-time variant that approximates the principal eigenvector of the implicit attention operator without forming the full attention matrix. It keeps an auxiliary state of fixed size proportional to per-head dimension dh (independent of sequence length N), is drop-in compatible with Vision Transformers, and supports stable training at 4096 by 4096 and inference at 9216 by 9216 (about 332k tokens). In a 4-layer ViT (53.5M parameters, 59 GFLOPs at 224 by 224), Linear-InfSA reaches 84.7% top-1 on ImageNet-1K, a +3.2 point architectural gain over an equal-depth softmax ViT trained with the same recipe. On ImageNet-V2, InfViT variants outperform all compared baselines (up to 79.8% vs 76.8%), indicating robustness under distribution shift. On an A100 40GB GPU, Linear-InfViT runs at 231 images/s and 0.87 J/image (13x better throughput and energy than equal-depth ViT) and is the only tested model to complete 9216 by 9216 inference without out-of-memory. The linear approximation closely matches the dominant eigenvector of the quadratic operator (cosine 0.985).
[75] Zero-Shot and Supervised Bird Image Segmentation Using Foundation Models: A Dual-Pipeline Approach with Grounding DINO1.5, YOLOv11, and SAM2.1 cs.CV | cs.AIPDF
Abhinav Munagala
TL;DR: 本文提出了一种基于基础模型的双管道鸟类图像分割框架,利用Grounding DINO 1.5和YOLOv11进行检测,并结合Segment Anything Model 2.1(SAM 2.1)作为共享冻结骨干网络,实现了零样本和监督两种分割模式,在CUB-200-2011数据集上取得了优异性能。
Details
Motivation: 解决鸟类图像分割中因姿态多样性、复杂羽毛图案和光照变化带来的挑战,通过利用基础模型减少对标注数据的依赖并提升分割精度。
Result: 在CUB-200-2011数据集上,监督管道达到IoU 0.912、Dice 0.954和F1 0.953,超越包括SegFormer-B2(IoU 0.842)在内的所有先前基线,提升7.0个百分点;零样本管道仅使用文本提示达到IoU 0.831,是该基准上首次报告的结果。
Insight: 创新点包括结合基础模型构建双管道框架,实现零样本和监督分割,无需针对新物种或领域重新训练分割模型,仅需轻量级检测器微调(约1小时)即可适应新领域,展示了基于提示的基础模型管道优于特定任务端到端训练的分割网络。
Abstract: Bird image segmentation remains a challenging task in computer vision due to extreme pose diversity, complex plumage patterns, and variable lighting conditions. This paper presents a dual-pipeline framework for binary bird image segmentation leveraging 2025 foundation models. We introduce two operating modes built upon Segment Anything Model 2.1 (SAM 2.1) as a shared frozen backbone: (1) a zero-shot pipeline using Grounding DINO 1.5 to detect birds via the text prompt “bird” before prompting SAM 2.1 with bounding boxes requiring no labelled bird data; and (2) a supervised pipeline that fine-tunes YOLOv11 on the CUB-200-2011 dataset for high-precision detection, again prompting SAM 2.1 for pixel-level masks. The segmentation model is never retrained for new species or domains. On CUB-200-2011 (11,788 images, 200 species), the supervised pipeline achieves IoU 0.912, Dice 0.954, and F1 0.953 outperforming all prior baselines including SegFormer-B2 (IoU 0.842) by +7.0 percentage points. The zero-shot pipeline achieves IoU 0.831 using only a text prompt, the first such result reported on this benchmark. We demonstrate that prompt-based foundation model pipelines outperform task specific end-to-end trained segmentation networks, while requiring only lightweight detector fine-tuning (~1 hour) for domain adaptation. Complete PyTorch implementation, dataset preparation scripts, and trained weights are publicly available.
[76] Efficient Long-Horizon GUI Agents via Training-Free KV Cache Compression cs.CV | cs.AI | cs.LGPDF
Bowen Zhou, Zhou Xu, Wanli Li, Jingyu Xiao, Haoqian Wang
TL;DR: 本文提出了一种无需训练的KV缓存压缩框架ST-Lite,专门针对长序列GUI代理任务中的内存和延迟问题。该框架通过结合组件中心的空间显著性(CSS)和轨迹感知的语义门控(TSG)双分支评分策略,有效压缩KV缓存,在仅使用10-20%缓存预算的情况下实现2.45倍的解码加速,且性能与全缓存基线相当或更优。
Details
Motivation: 现有KV缓存压缩方法在GUI场景中表现不佳,因为GUI注意力模式在所有Transformer层都呈现均匀的高稀疏性,与通用视觉任务中注意力稀疏性随层变化的特点存在根本性错配。
Result: 在广泛的评估中,ST-Lite在仅使用10-20%缓存预算的情况下,实现了2.45倍的解码加速,同时保持了与全缓存基线相当甚至更优的性能。
Insight: 创新点在于发现了GUI场景中注意力模式的均匀高稀疏性这一关键特性,并据此设计了无需训练、专门针对GUI数据流动态时空轨迹依赖性的双分支压缩框架(CSS和TSG),为资源受限的GUI代理提供了可扩展的解决方案。
Abstract: Large Vision-Language Models (VLMs) have emerged as powerful engines for autonomous GUI agents, yet their deployment is severely constrained by the substantial memory footprint and latency of the Key-Value (KV) cache during long-horizon interactions. While existing cache compression methods have proven effective for LLMs, we empirically demonstrate that they suffer from suboptimal performance in GUI scenarios due to a fundamental misalignment: unlike general visual tasks where attention sparsity varies across layers, GUI attention patterns exhibit uniform high-sparsity across all transformer layers. Motivated by this insight, we propose ST-Lite, a training-free KV cache compression framework tailored for efficient GUI agents that explicitly addresses the dynamic spatio-trajectory dependencies within GUI data streams. ST-Lite introduces a novel dual-branch scoring policy incorporating Component-centric Spatial Saliency (CSS) and Trajectory-aware Semantic Gating (TSG). Specifically, CSS preserves the structural integrity of interactive UI elements by evaluating local neighborhood saliency, while TSG mitigates historical redundancy by dynamically filtering visually repetitive KV pairs within the interaction trajectory. Extensive evaluations demonstrate that with only a 10-20% cache budget, ST-Lite achieves a 2.45x decoding acceleration while maintaining comparable or even superior performance compared to full-cache baselines, offering a scalable solution for resource-constrained GUI agents.
[77] SKeDA: A Generative Watermarking Framework for Text-to-video Diffusion Models cs.CV | cs.AI | cs.CRPDF
Yang Yang, Xinze Zou, Zehua Ma, Han Fang, Weiming Zhang
TL;DR: 本文提出了一种名为SKeDA的生成式水印框架,专门针对文本到视频扩散模型。该框架通过Shuffle-Key-based Distribution-preserving Sampling (SKe) 和 Differential Attention (DA) 两个组件,解决了现有图像水印方法直接扩展到视频时面临的帧对齐依赖和时域失真问题,从而在保持高生成保真度的同时,显著提升了水印对帧重排、丢失和压缩等操作的鲁棒性。
Details
Motivation: 随着文本到视频生成模型的兴起,内容真实性、版权保护和恶意滥用问题日益凸显。现有基于图像的水印方法直接扩展到视频时存在两个关键限制:一是依赖视频帧与用于水印加密的帧相关伪随机二进制序列之间的严格对齐,一旦对齐被破坏(如帧重排或丢失),水印提取就不可靠;二是视频特有的失真(如帧间压缩)会显著降低水印可靠性。
Result: 大量实验表明,SKeDA在保持高视频生成质量的同时,显著提升了水印的鲁棒性。
Insight: 论文的创新点在于:1) SKe组件使用单一基础伪随机二进制序列进行水印加密,并通过排列派生出帧级加密序列,将水印提取从对同步敏感的序列解码转变为对排列容忍的集合级聚合,从而增强了对帧重排和丢失的鲁棒性;2) DA组件通过计算帧间差异并在提取时动态调整注意力权重,增强了对时域失真的鲁棒性。从客观角度看,这是一种将水印嵌入与视频的时序特性紧密结合的针对性设计,有效解决了视频水印特有的挑战。
Abstract: The rise of text-to-video generation models has raised growing concerns over content authenticity, copyright protection, and malicious misuse. Watermarking serves as an effective mechanism for regulating such AI-generated content, where high fidelity and strong robustness are particularly critical. Recent generative image watermarking methods provide a promising foundation by leveraging watermark information and pseudo-random keys to control the initial sampling noise, enabling lossless embedding. However, directly extending these techniques to videos introduces two key limitations: Existing designs implicitly rely on strict alignment between video frames and frame-dependent pseudo-random binary sequences used for watermark encryption. Once this alignment is disrupted, subsequent watermark extraction becomes unreliable; and Video-specific distortions, such as inter-frame compression, significantly degrade watermark reliability. To address these issues, we propose SKeDA, a generative watermarking framework tailored for text-to-video diffusion models. SKeDA consists of two components: (1) Shuffle-Key-based Distribution-preserving Sampling (SKe) employs a single base pseudo-random binary sequence for watermark encryption and derives frame-level encryption sequences through permutation. This design transforms watermark extraction from synchronization-sensitive sequence decoding into permutation-tolerant set-level aggregation, substantially improving robustness against frame reordering and loss; and (2) Differential Attention (DA), which computes inter-frame differences and dynamically adjusts attention weights during extraction, enhancing robustness against temporal distortions. Extensive experiments demonstrate that SKeDA preserves high video generation quality and watermark robustness.
[78] Stateful Token Reduction for Long-Video Hybrid VLMs cs.CV | cs.AIPDF
Jindong Jiang, Amala Sanjay Deshmukh, Kateryna Chumachenko, Karan Sapra, Zhiding Yu
TL;DR: 本文研究了针对长视频混合视觉语言模型(VLMs)的token缩减方法,提出了一种从低到高的渐进式缩减调度方案和统一的语言感知评分机制,以加速模型推理并保持准确性。
Details
Motivation: 现有token缩减方法主要针对密集Transformer设计,不适用于混合了注意力机制和线性时间状态空间块(如Mamba)的混合架构,因此需要一种能有效处理此类架构的token缩减策略。
Result: 在保留25%视觉token的激进压缩设置下,该方法在推理时实现了3.8-4.2倍的预填充加速,同时保持了接近基线的准确性;在长上下文视频基准测试中,轻量微调进一步提升了性能。
Insight: 创新点包括:通过分析层间稀疏性和重要性稳定性,提出渐进式缩减调度;为注意力和Mamba块设计统一的语言感知评分机制(对Mamba使用隐式注意力代理),实现了混合架构的全层token缩减。
Abstract: Token reduction is an effective way to accelerate long-video vision-language models (VLMs), but most existing methods are designed for dense Transformers and do not directly account for hybrid architectures that interleave attention with linear-time state-space blocks (e.g., Mamba). We study query-conditioned token reduction for hybrid video VLMs and analyze reduction behavior through two properties: layerwise sparsity (how many tokens capture query-relevant information) and importance stability (whether token-importance rankings persist across depth). Although token importance is sparse within each layer, the set of important tokens changes across layers, so aggressive early pruning is unreliable. Motivated by this, we propose a low-to-high progressive reduction schedule and a unified language-aware scoring mechanism for both attention and Mamba blocks (using an implicit-attention proxy for Mamba), enabling all-layer token reduction in hybrids. Under an aggressive compression setting (retaining 25% of visual tokens), our approach delivers substantial prefilling speedups (3.8–4.2x) with near-baseline accuracy at test time, and light finetuning under reduction further improves performance on long-context video benchmarks.
[79] TACIT Benchmark: A Programmatic Visual Reasoning Benchmark for Generative and Discriminative Models cs.CV | cs.AIPDF
Daniel Nobrega Medeiros
TL;DR: 本文介绍了TACIT Benchmark,这是一个程序化的视觉推理基准测试,包含6个推理领域(空间导航、抽象模式补全、因果模拟、逻辑约束满足、图论和拓扑)的10个任务。该基准提供双轨评估:生成式轨道要求模型生成解决方案图像并通过确定性计算机视觉流程验证;判别式轨道提供五选一选择题,包含结构上合理的近似错误干扰项。基准测试发布了6,000个谜题(108,000张PNG图像,三种分辨率),并采用完全确定性的种子生成和可复现的验证。
Details
Motivation: 现有视觉推理基准主要依赖自然语言提示、评估狭窄的推理模式,或依赖主观评分程序(如LLM-as-judge)。TACIT Benchmark旨在克服这些限制,提供一个更全面、客观和程序化的评估框架。
Result: 论文发布了TACIT Benchmark的0.1.0版本,包含6,000个谜题(108,000张PNG图像,三种分辨率),并提供了完全确定性的生成和验证流程。该基准在HuggingFace上以Apache 2.0许可证发布(DOI: 10.57967/hf/7904)。
Insight: 创新点包括:1) 程序化基准设计,避免自然语言依赖和主观评分;2) 双轨评估(生成式和判别式),全面测试模型能力;3) 精心设计的干扰项(每个违反一个结构约束),迫使模型进行细粒度视觉推理而非利用表面线索;4) 覆盖广泛的推理领域(6个领域,10个任务),提升评估广度。从客观角度看,该基准通过确定性生成和验证增强了可复现性和客观性,为视觉推理模型提供了更严格的测试平台。
Abstract: Existing visual reasoning benchmarks predominantly rely on natural language prompts, evaluate narrow reasoning modalities, or depend on subjective scoring procedures such as LLM-as-judge. We introduce the TACIT Benchmark, a programmatic visual reasoning benchmark comprising 10 tasks across 6 reasoning domains: spatial navigation, abstract pattern completion, causal simulation, logical constraint satisfaction, graph theory, and topology. The benchmark provides dual-track evaluation: a generative track in which models must produce solution images verified through deterministic computer-vision pipelines, and a discriminative track offering five-way multiple choice with structurally plausible near-miss distractors. Each distractor violates exactly one structural constraint, requiring models to reason about fine-grained visual differences rather than exploit superficial cues. Version 0.1.0 distributes 6,000 puzzles (108,000 PNG images across three resolutions) with fully deterministic seeded generation and reproducible verification. The dataset, generation code, and evaluation harness are released under the Apache 2.0 license on HuggingFace (DOI: 10.57967/hf/7904).
[80] VisRef: Visual Refocusing while Thinking Improves Test-Time Scaling in Multi-Modal Large Reasoning Models cs.CV | cs.AIPDF
Soumya Suvra Ghosal, Youngeun Kim, Zhuowei Li, Ritwick Chaudhry, Linghan Xu
TL;DR: 本文提出VisRef框架,通过视觉重聚焦机制,在推理过程中动态重新注入与当前推理上下文语义相关且具有全局代表性的视觉标记,以改善多模态大推理模型在视觉依赖任务中的测试时扩展性能,避免模型过度依赖文本先验而忽略视觉信息。
Details
Motivation: 现有大推理模型通过扩展测试时计算(延长推理链)来提升复杂推理任务性能,但在视觉依赖任务中,过长的文本推理会导致模型逐渐忽略视觉标记,过度依赖文本先验,从而降低性能。现有基于强化学习的微调或重聚焦方法计算成本高,因此需要一种无需额外强化学习微调的轻量级测试时扩展方法。
Result: 在三个视觉推理基准测试上,使用最先进的多模态大推理模型进行实验,在固定测试时计算预算下,VisRef始终优于现有测试时扩展方法,最高提升6.4%。
Insight: 创新点在于提出一种无需强化学习微调的视觉重聚焦框架,通过动态选择和重新注入语义相关且多样化的视觉标记子集来引导推理过程,增强多模态推理的视觉基础性,从而更有效地利用测试时计算资源。
Abstract: Advances in large reasoning models have shown strong performance on complex reasoning tasks by scaling test-time compute through extended reasoning. However, recent studies observe that in vision-dependent tasks, extended textual reasoning at inference time can degrade performance as models progressively lose attention to visual tokens and increasingly rely on textual priors alone. To address this, prior works use reinforcement learning (RL)-based fine-tuning to route visual tokens or employ refocusing mechanisms during reasoning. While effective, these methods are computationally expensive, requiring large-scale data generation and policy optimization. To leverage the benefits of test-time compute without additional RL fine-tuning, we propose VisRef, a visually grounded test-time scaling framework. Our key idea is to actively guide the reasoning process by re-injecting a coreset of visual tokens that are semantically relevant to the reasoning context while remaining diverse and globally representative of the image, enabling more grounded multi-modal reasoning. Experiments on three visual reasoning benchmarks with state-of-the-art multi-modal large reasoning models demonstrate that, under fixed test-time compute budgets, VisRef consistently outperforms existing test-time scaling approaches by up to 6.4%.
[81] Adversarial Patch Generation for Visual-Infrared Dense Prediction Tasks via Joint Position-Color Optimization cs.CVPDF
He Li, Wenyue He, Weihang Kong, Xingchen Zhang
TL;DR: 本文提出了一种针对视觉-红外密集预测任务的联合位置-颜色优化框架(AP-PCO),用于生成对抗性补丁。该方法通过同时优化补丁位置和颜色组成,使单个补丁能够扰动可见光和红外两种模态,并引入跨模态颜色适应策略以减少跨光谱显著性,支持灵活的黑盒攻击。
Details
Motivation: 现有对抗补丁方法主要针对单模态输入,无法处理视觉-红外感知系统中异质光谱特性和模态特定强度分布带来的跨光谱不一致问题,导致攻击效果下降和隐蔽性差。
Result: 在视觉-红外密集预测任务上的大量实验表明,AP-PCO在多种架构上均实现了持续强大的攻击性能,为VI感知系统的鲁棒性评估提供了实用基准。
Insight: 创新点包括联合位置-颜色优化框架和跨模态颜色适应策略,通过基于模型输出的适应度函数优化补丁,在无需内部模型信息的情况下实现黑盒攻击,有效解决了跨光谱不一致性并提升了攻击的隐蔽性。
Abstract: Multimodal adversarial attacks for dense prediction remain largely underexplored. In particular, visual-infrared (VI) perception systems introduce unique challenges due to heterogeneous spectral characteristics and modality-specific intensity distributions. Existing adversarial patch methods are primarily designed for single-modal inputs and fail to account for crossspectral inconsistencies, leading to reduced attack effectiveness and poor stealthiness when applied to VI dense prediction models. To address these challenges, we propose a joint position-color optimization framework (AP-PCO) for generating adversarial patches in visual-infrared settings. The proposed method optimizes patch placement and color composition simultaneously using a fitness function derived from model outputs, enabling a single patch to perturb both visible and infrared modalities. To further bridge spectral discrepancies, we introduce a crossmodal color adaptation strategy that constrains patch appearance according to infrared grayscale characteristics while maintaining strong perturbations in the visible domain, thereby reducing cross-spectral saliency. The optimization procedure operates without requiring internal model information, supporting flexible black-box attacks. Extensive experiments on visual-infrared dense prediction tasks demonstrate that the proposed AP-PCO achieves consistently strong attack performance across multiple architectures, providing a practical benchmark for robustness evaluation in VI perception systems.
[82] Seeking Necessary and Sufficient Information from Multimodal Medical Data cs.CVPDF
Boyu Chen, Weiye Bao, Junjie Liu, Michael Shen, Bo Peng
TL;DR: 本文提出了一种从多模态医学数据中学习必要且充分特征表示的新方法,通过将多模态表示分解为模态不变和模态特定组件,并利用概率必要性充分性(PNS)作为学习目标,以提升模型性能和鲁棒性。
Details
Motivation: 现有多模态模型忽视了学习既必要(结果发生必须存在)又充分(足以决定结果)的特征,而这些特征对于捕获关键预测信息和增强模型对缺失模态的鲁棒性至关重要。
Result: 在合成和真实世界医学数据集上的实验证明了该方法的有效性,具体表现为模型性能的提升和对模态缺失的鲁棒性增强。
Insight: 创新点在于将PNS学习目标扩展到多模态场景,通过分解表示解决了PNS估计条件被违反的挑战,为学习可解释且鲁棒的多模态医学特征提供了新思路。
Abstract: Learning multimodal representations from medical images and other data sources can provide richer information for decision-making. While various multimodal models have been developed for this, they overlook learning features that are both necessary (must be present for the outcome to occur) and sufficient (enough to determine the outcome). We argue learning such features is crucial as they can improve model performance by capturing essential predictive information, and enhance model robustness to missing modalities as each modality can provide adequate predictive signals. Such features can be learned by leveraging the Probability of Necessity and Sufficiency (PNS) as a learning objective, an approach that has proven effective in unimodal settings. However, extending PNS to multimodal scenarios remains underexplored and is non-trivial as key conditions of PNS estimation are violated. We address this by decomposing multimodal representations into modality-invariant and modality-specific components, then deriving tractable PNS objectives for each. Experiments on synthetic and real-world medical datasets demonstrate our method’s effectiveness. Code will be available on GitHub.
[83] Proof-of-Perception: Certified Tool-Using Multimodal Reasoning with Compositional Conformal Guarantees cs.CVPDF
Arya Fayyazi, Haleh Akrami
TL;DR: 本文提出了Proof-of-Perception(PoP)框架,将多模态推理建模为具有明确可靠性保证的可执行图。该框架通过每个感知或逻辑节点输出符合性集合来提供校准的、逐步的不确定性度量,并利用轻量级控制器在计算预算下分配资源,仅在需要时调用额外工具并提前终止,从而减少错误累积和幻觉,实现原则性的精度-计算权衡。
Details
Motivation: 解决多模态推理中错误传播、幻觉以及缺乏可验证可靠性保证的问题,旨在通过结构化、可认证的工具使用框架提升推理的准确性和效率。
Result: 在文档、图表和多图像问答基准测试中,PoP在性能与可靠性上优于强力的思维链、ReAct风格和程序思维基线,同时更高效地利用计算资源。
Insight: 创新点在于将多模态推理形式化为具有符合性保证的可执行图,通过逐步不确定性校准和预算感知的控制器实现可验证的证据基础和自适应的计算分配,为工具使用系统提供了可证明的可靠性框架。
Abstract: We present Proof-of-Perception (PoP), a tool-using framework that casts multimodal reasoning as an executable graph with explicit reliability guarantees. Each perception or logic node outputs a conformal set, yielding calibrated, stepwise uncertainty; a lightweight controller uses these certificates to allocate compute under a budget, expanding with extra tool calls only when needed and stopping early otherwise. This grounds answers in verifiable evidence, reduces error compounding and hallucinations, and enables principled accuracy-compute trade-offs. Across document, chart, and multi-image QA benchmarks, PoP improves performance and reliability over strong chain-of-thought, ReAct-style, and program-of-thought baselines while using computation more efficiently.
[84] Diffusion-Based Low-Light Image Enhancement with Color and Luminance Priors cs.CVPDF
Xuanshuo Fu, Lei Kang, Javier Vazquez-Corral
TL;DR: 本文提出了一种基于条件扩散模型的低光照图像增强方法,通过结构化控制嵌入模块(SCEM)将低光照图像分解为光照、光照不变特征、阴影先验和颜色不变线索四个信息组件,作为控制信号引导U-Net扩散模型进行增强,仅使用LOLv1数据集训练并在多个基准测试中实现了最先进的性能。
Details
Motivation: 解决低光照图像常见的低对比度、噪声和颜色失真问题,以提升视觉质量并改善下游视觉任务性能。
Result: 在LOLv2-real、LSRW、DICM、MEF和LIME等基准测试中,无需微调即达到定量和感知指标上的最先进(SOTA)水平,展现出强大的泛化能力。
Insight: 创新点在于引入结构化控制嵌入模块(SCEM)分解图像物理先验作为扩散模型的条件信号,将物理先验与生成模型结合,增强了增强过程的结构化引导和可解释性。
Abstract: Low-light images often suffer from low contrast, noise, and color distortion, degrading visual quality and impairing downstream vision tasks. We propose a novel conditional diffusion framework for low-light image enhancement that incorporates a Structured Control Embedding Module (SCEM). SCEM decomposes a low-light image into four informative components including illumination, illumination-invariant features, shadow priors, and color-invariant cues. These components serve as control signals that condition a U-Net-based diffusion model trained with a simplified noise-prediction loss. Thus, the proposed SCEM equipped Diffusion method enforces structured enhancement guided by physical priors. In experiments, our model is trained only on the LOLv1 dataset and evaluated without fine-tuning on LOLv2-real, LSRW, DICM, MEF, and LIME. The method achieves state-of-the-art performance in quantitative and perceptual metrics, demonstrating strong generalization across benchmarks. https://casted.github.io/scem/.
[85] Percept-Aware Surgical Planning for Visual Cortical Prostheses with Vascular Avoidance cs.CVPDF
Galen Pogoncheff, Alvin Wang, Jacob Granley, Michael Beyeler
TL;DR: 本文提出了一种用于视觉皮层假体手术规划的感知感知框架,将电极放置问题建模为解剖空间中的约束优化问题,通过可微分前向模型端到端优化电极坐标,以最小化任务级感知误差,同时纳入血管避让和灰质可行性约束。
Details
Motivation: 现有皮层视觉假体手术规划策略主要关注视野覆盖和解剖启发式方法,未能直接在安全约束下优化预测的感知结果,因此需要一种能够直接优化感知性能并确保安全性的规划方法。
Result: 在模拟阅读和自然图像任务上,使用真实折叠皮层几何(FreeSurfer fsaverage)进行评估,感知感知优化相比基于覆盖的放置策略持续提高了重建保真度,血管安全约束消除了边缘违规同时保持了感知性能。
Insight: 创新点在于将电极坐标作为可学习参数,通过可微分感知模型进行端到端优化,实现了任务驱动的、安全约束下的解剖空间电极放置优化,为下一代视觉假体的优化提供了基础。
Abstract: Cortical visual prostheses aim to restore sight by electrically stimulating neurons in early visual cortex (V1). With the emergence of high-density and flexible neural interfaces, electrode placement within three-dimensional cortex has become a critical surgical planning problem. Existing strategies emphasize visual field coverage and anatomical heuristics but do not directly optimize predicted perceptual outcomes under safety constraints. We present a percept-aware framework for surgical planning of cortical visual prostheses that formulates electrode placement as a constrained optimization problem in anatomical space. Electrode coordinates are treated as learnable parameters and optimized end-to-end using a differentiable forward model of prosthetic vision. The objective minimizes task-level perceptual error while incorporating vascular avoidance and gray matter feasibility constraints. Evaluated on simulated reading and natural image tasks using realistic folded cortical geometry (FreeSurfer fsaverage), percept-aware optimization consistently improves reconstruction fidelity relative to coverage-based placement strategies. Importantly, vascular safety constraints eliminate margin violations while preserving perceptual performance. The framework further enables co-optimization of multi-electrode thread configurations under fixed insertion budgets. These results demonstrate how differentiable percept models can inform anatomically grounded, safety-aware computer-assisted planning for cortical neural interfaces and provide a foundation for optimizing next-generation visual prostheses.
[86] SSR: Pushing the Limit of Spatial Intelligence with Structured Scene Reasoning cs.CVPDF
Yi Zhang, Youya Xia, Yong Wang, Meng Song, Xin Wu
TL;DR: 本文提出了SSR框架,旨在通过结构化场景推理提升多模态大语言模型的空间智能。该框架通过轻量级对齐机制整合2D和3D表示,并利用场景图生成管道构建语言模型友好的结构支架,以支持复杂空间推理。
Details
Motivation: 当前多模态大语言模型在语义任务上表现出色,但缺乏精细的几何推理所需的’空间感’,且面临模态对齐成本高和细粒度结构建模精度不足的问题。
Result: 在7B参数规模下,SSR在多个空间智能基准测试中达到最先进性能,特别是在VSI-Bench上获得73.9分,显著优于更大模型。
Insight: 创新点包括通过跨模态加法和令牌交错实现轻量级对齐以减少训练开销,以及提出基于相对坐标的局部三元组链场景图生成方法,为复杂环境构建结构化表示,从而提升空间推理能力。
Abstract: While Multimodal Large Language Models (MLLMs) excel in semantic tasks, they frequently lack the “spatial sense” essential for sophisticated geometric reasoning. Current models typically suffer from exorbitant modality-alignment costs and deficiency in fine-grained structural modeling precision.We introduce SSR, a framework designed for Structured Scene Reasoning that seamlessly integrates 2D and 3D representations via a lightweight alignment mechanism. To minimize training overhead, our framework anchors 3D geometric features to the large language model’s pre-aligned 2D visual semantics through cross-modal addition and token interleaving, effectively obviating the necessity for large-scale alignment pre-training. To underpin complex spatial reasoning, we propose a novel scene graph generation pipeline that represents global layouts as a chain of independent local triplets defined by relative coordinates. This is complemented by an incremental generation algorithm, enabling the model to construct “language-model-friendly” structural scaffolds for complex environments. Furthermore, we extend these capabilities to global-scale 3D global grounding task, achieving absolute metric precision across heterogeneous data sources. At a 7B parameter scale, SSR achieves state-of-the-art performance on multiple spatial intelligence benchmarks, notably scoring 73.9 on VSI-Bench. Our approach significantly outperforms much larger models, demonstrating that efficient feature alignment and structured scene reasoning are the cornerstones of authentic spatial intelligence.
[87] PointAlign: Feature-Level Alignment Regularization for 3D Vision-Language Models cs.CVPDF
Yuanhao Su, Shaofeng Zhang, Xiaosong Jia, Qi Fan
TL;DR: 本文提出了一种名为PointAlign的新型特征级对齐正则化方法,旨在解决3D视觉语言模型(VLMs)因配对3D-文本数据稀缺而导致的几何信息退化问题。该方法通过在语言建模过程中显式监督中间点云令牌,使其与视觉输入令牌对齐,从而保留细粒度的3D几何语义信息。
Details
Motivation: 现有3D视觉语言模型仅依赖下一令牌预测损失,仅使用语言令牌进行监督,导致对有限3D数据的利用效率低下,并在中间表示中造成几何信息的显著退化和丢失。
Result: 在ModelNet40和Objaverse数据集上的广泛实验表明,该方法在分类任务上平均提升了2.08个百分点,在具有挑战性的开放词汇Objaverse分类任务上获得了7.50个百分点的显著提升,并在由Qwen2-72B-Instruct评估的3D物体描述任务上提升了4.88个百分点,验证了其有效性。
Insight: 论文的创新点在于引入了特征级对齐正则化,通过一致性损失约束中间点云令牌与视觉输入令牌对齐,从而显式地监督并保留几何信息。从客观角度看,该方法仅需训练轻量级的对齐投影器和LoRA适配器,计算开销小,是一种高效防止几何退化的策略。
Abstract: The development of 3D Vision-Language Models (VLMs), crucial for applications in robotics, autonomous driving, and augmented reality, is severely constrained by the scarcity of paired 3D-text data. Existing methods rely solely on next-token prediction loss, using only language tokens for supervision. This results in inefficient utilization of limited 3D data and leads to a significant degradation and loss of valuable geometric information in intermediate representations. To address these limitations, we propose {\mname}, a novel feature-level alignment regularization method. {\mname} explicitly supervises intermediate point cloud tokens to preserve fine-grained 3D geometric-semantic information throughout the language modeling process. Specifically, we constrain the intermediate point cloud tokens within the LLM to align with visual input tokens via a consistency loss. By training only a lightweight alignment projector and LoRA adapters, {\mname} achieves explicit feature-level supervision with minimal computational overhead, effectively preventing geometric degradation. Extensive experiments on ModelNet40 and Objaverse datasets demonstrate that our method achieves \textbf{2.08} pp improvement on average for classification tasks, with a substantial \textbf{7.50} pp gain on the challenging open-vocabulary Objaverse classification task and \textbf{4.88} pp improvement on 3D object captioning evaluated by Qwen2-72B-Instruct, validating the effectiveness of {\mname}. Code is publicly available at \href{https://github.com/yharoldsu0627/PointAlign}{https://github.com/yharoldsu0627/PointAlign}.
[88] Taxonomy-Aware Representation Alignment for Hierarchical Visual Recognition with Large Multimodal Models cs.CV | cs.AIPDF
Hulingxiao He, Zhi Tan, Yuxin Peng
TL;DR: 本文提出了一种名为TARA(Taxonomy-Aware Representation Alignment)的方法,旨在将分类学知识注入大型多模态模型(LMMs),以提升其在层次化视觉识别(HVR)任务中的性能,特别是对于训练集中未见的新类别。该方法通过将视觉特征与生物学基础模型(BFMs)的层次化表示对齐,并灵活调整答案表示与真实标签的对应关系,来增强模型从粗到细的类别预测一致性和准确性。
Details
Motivation: 当前大型多模态模型在细粒度视觉识别(FGVR)已知类别上表现优异,但在层次化视觉识别(HVR)任务中,尤其是在预测从粗到细的一致标签路径以及识别未见的新类别方面,仍存在局限。
Result: 实验表明,TARA方法能持续提升LMMs的层次一致性(hierarchical consistency)和叶节点准确率(leaf node accuracy),使其在复杂的生物分类学中,对已知和未知类别都能实现可靠的识别。
Insight: 核心创新点在于利用经过层次对比学习预训练的生物学基础模型(BFMs)所编码的丰富生物关系,通过表示对齐将分类学结构知识注入LMMs的视觉特征提取过程,并设计了一种灵活的机制来桥接不同粒度的视觉特征与类别标签,从而有效处理层次化识别和新类别泛化问题。
Abstract: A high-performing, general-purpose visual understanding model should map visual inputs to a taxonomic tree of labels, identify novel categories beyond the training set for which few or no publicly available images exist. Large Multimodal Models (LMMs) have achieved remarkable progress in fine-grained visual recognition (FGVR) for known categories. However, they remain limited in hierarchical visual recognition (HVR) that aims at predicting consistent label paths from coarse to fine categories, especially for novel categories. To tackle these challenges, we propose Taxonomy-Aware Representation Alignment (TARA), a simple yet effective strategy to inject taxonomic knowledge into LMMs. TARA leverages representations from biology foundation models (BFMs) that encode rich biological relationships through hierarchical contrastive learning. By aligning the intermediate representations of visual features with those of BFMs, LMMs are encouraged to extract discriminative visual cues well structured in the taxonomy tree. Additionally, we align the representations of the first answer token with the ground-truth label, flexibly bridging the gap between contextualized visual features and categories of varying granularity according to user intent. Experiments demonstrate that TARA consistently enhances LMMs’ hierarchical consistency and leaf node accuracy, enabling reliable recognition of both known and novel categories within complex biological taxonomies. Code is available at https://github.com/PKU-ICST-MIPL/TARA_CVPR2026.
[89] TAP-SLF: Parameter-Efficient Adaptation of Vision Foundation Models for Multi-Task Ultrasound Image Analysis cs.CV | cs.AIPDF
Hui Wan, Libin Lan
TL;DR: 本文提出了一种名为TAP-SLF的参数高效微调框架,用于将视觉基础模型(VFMs)适配到医学超声图像的多任务分析中。该框架通过任务感知的软提示(task-aware soft prompts)将任务特定先验编码到输入序列,并结合对编码器顶层进行选择性LoRA微调,在冻结预训练主干的同时仅更新少量参数,从而在共享骨干网络上高效支持分割、分类、检测和回归等多种任务。
Details
Motivation: 解决医学图像多任务分析中模型泛化性差、共享特征表示优化困难的问题,同时避免在有限医学数据上全参数微调导致的过拟合和高计算成本,并改进现有参数高效微调方法忽略任务特异性及模型层间敏感性差异的不足。
Result: 在FMC_UIA 2026挑战赛测试集上获得第五名,并在官方发布训练数据集上按8:2划分进行评测,结果表明任务感知提示和选择性层微调是有效的VFM高效适配策略。
Insight: 创新点在于将任务感知的软提示与针对特定顶层(如编码器高层)的选择性LoRA微调相结合,形成统一的多任务适配框架;这不仅考虑了不同任务的特异性,还利用了模型不同层在微调过程中的敏感性差异,实现了在共享骨干网络上高效、针对性地适应多样医学任务。
Abstract: Executing multiple tasks simultaneously in medical image analysis, including segmentation, classification, detection, and regression, often introduces significant challenges regarding model generalizability and the optimization of shared feature representations. While Vision Foundation Models (VFMs) provide powerful general representations, full fine-tuning on limited medical data is prone to overfitting and incurs high computational costs. Moreover, existing parameter-efficient fine-tuning approaches typically adopt task-agnostic adaptation protocols, overlooking both task-specific mechanisms and the varying sensitivity of model layers during fine-tuning. In this work, we propose Task-Aware Prompting and Selective Layer Fine-Tuning (TAP-SLF), a unified framework for multi-task ultrasound image analysis. TAP-SLF incorporates task-aware soft prompts to encode task-specific priors into the input token sequence and applies LoRA to selected specific top layers of the encoder. This strategy updates only a small fraction of the VFM parameters while keeping the pre-trained backbone frozen. By combining task-aware prompts with selective high-layer fine-tuning, TAP-SLF enables efficient VFM adaptation to diverse medical tasks within a shared backbone. Results on the FMC_UIA 2026 Challenge test set, where TAP-SLF wins fifth place, combined with evaluations on the officially released training dataset using an 8:2 train-test split, demonstrate that task-aware prompting and selective layer tuning are effective strategies for efficient VFM adaptation.
[90] Self-Correction Inside the Model: Leveraging Layer Attention to Mitigate Hallucinations in Large Vision Language Models cs.CVPDF
April Fu
TL;DR: 本文提出了一种名为ICLA的内部自校正机制,利用层注意力直接作用于生成过程中的隐藏状态,以缓解大型视觉语言模型中的幻觉问题。该方法通过一种对角线跨层注意力机制,使每一层都能选择性地从所有前序层检索信息,实现无需外部校正信号的自优化。在LLaVA1.5-7B和Qwen2.5-VL-7B模型上仅引入并训练了0.2M和0.1M额外参数,即可在多个幻觉基准测试中持续提升视觉基础性。
Details
Motivation: 尽管大型视觉语言模型取得了显著进展,但幻觉问题(即生成的文本未基于视觉输入)仍然是一个挑战。随着模型能力增强,先前报告的幻觉模式(如语言偏见和过度思考现象)变得不再一致,导致相应的缓解技术效果大幅下降。本文旨在通过一种内部自校正机制,直接利用模型内部信息来更有效地减轻幻觉。
Result: 在多个幻觉基准测试中,该方法一致地改善了视觉基础性。具体而言,在LLaVA1.5-7B和Qwen2.5-VL-7B模型上,仅引入少量额外参数(分别为0.2M和0.1M)后,模型在幻觉缓解方面表现出有效性,表明其适用于更先进的LVLMs。
Insight: 摘要宣称的创新点在于提出了一种内部自校正机制(ICLA),利用对角线跨层注意力在生成过程中直接对隐藏状态进行操作,实现无需外部信号的自优化。从客观角度看,该研究的创新之处在于:1)将缓解幻觉的焦点从外部校正转向模型内部的自校正;2)设计了一种轻量级的跨层注意力机制,以极少的参数开销实现性能提升;3)针对更先进的LVLMs中幻觉模式不一致的问题,提供了一种有效的解决方案。
Abstract: Although Large Vision-Language Models (LVLMs) have made substantial progress, hallucination, where generated text is not grounded in the visual input, remains a challenge. As LVLMs become stronger, previously reported hallucination patterns, such as linguistic bias and overthinking phenomenon, become far less consistent, making the corresponding mitigation techniques substantially less effective. In this paper, we introduce an Internal self-Correction mechanism utilizing Layer Attention (ICLA) that operates directly on hidden states during generation. Each layer selectively retrieves information from all preceding layers through a diagonal cross-layer attention mechanism, enabling self-refinement without any external correction signals. With introducing and training only 0.2M and 0.1M additional parameters on LLaVA1.5-7B and Qwen2.5-VL-7B, \ours consistently improves visual grounding across multiple hallucination benchmarks, demonstrating its effectiveness for more advanced LVLMs.
[91] SesaHand: Enhancing 3D Hand Reconstruction via Controllable Generation with Semantic and Structural Alignment cs.CVPDF
Zhuoran Zhao, Xianghao Kong, Linlin Yang, Zheng Wei, Pan Hui
TL;DR: 本文提出SesaHand方法,通过语义和结构对齐增强可控手部图像生成,以提升3D手部重建性能。该方法利用思维链推理从视觉语言模型生成的图像描述中提取人类行为语义,并通过分层结构融合整合不同粒度的结构信息,同时引入手部结构注意力增强机制以聚焦手部区域。实验表明,该方法在生成质量和3D手部重建任务上均优于现有方法。
Details
Motivation: 现有3D手部重建方法多依赖游戏引擎合成训练数据,但存在纹理和环境多样性不足、缺乏手臂或交互物体等关键组件的问题;生成模型虽能生成多样手部图像,但仍面临对齐不准确的问题。
Result: 实验证明,该方法在生成性能上优于先前工作,且使用生成图像能提升3D手部重建效果。
Insight: 创新点包括:通过思维链推理提取人类行为语义以抑制非人相关环境细节;采用分层结构融合实现多粒度结构信息对齐;设计手部结构注意力增强机制以提升模型对手部区域的关注。这些方法为可控生成和3D重建任务提供了语义与结构对齐的新思路。
Abstract: Recent studies on 3D hand reconstruction have demonstrated the effectiveness of synthetic training data to improve estimation performance. However, most methods rely on game engines to synthesize hand images, which often lack diversity in textures and environments, and fail to include crucial components like arms or interacting objects. Generative models are promising alternatives to generate diverse hand images, but still suffer from misalignment issues. In this paper, we present SesaHand, which enhances controllable hand image generation from both semantic and structural alignment perspectives for 3D hand reconstruction. Specifically, for semantic alignment, we propose a pipeline with Chain-of-Thought inference to extract human behavior semantics from image captions generated by the Vision-Language Model. This semantics suppresses human-irrelevant environmental details and ensures sufficient human-centric contexts for hand image generation. For structural alignment, we introduce hierarchical structural fusion to integrate structural information with different granularity for feature refinement to better align the hand and the overall human body in generated images. We further propose a hand structure attention enhancement method to efficiently enhance the model’s attention on hand regions. Experiments demonstrate that our method not only outperforms prior work in generation performance but also improves 3D hand reconstruction with the generated hand images.
[92] Improved Adversarial Diffusion Compression for Real-World Video Super-Resolution cs.CVPDF
Bin Chen, Weiqi Li, Shijie Zhao, Xuanyu Zhang, Junlin Li
TL;DR: 本文提出了一种改进的对抗扩散压缩方法,用于真实世界视频超分辨率任务。该方法通过将配备3D时空注意力的大型扩散Transformer教师模型DOVE,蒸馏到基于2D Stable Diffusion的剪裁骨干网络中,并增强轻量级1D时间卷积,实现了更高的效率。此外,引入了一种双头对抗蒸馏方案,在像素域和特征域使用判别器将细节和一致性的判别显式解耦,从而有效优化这两个目标。
Details
Motivation: 现有扩散模型在真实世界视频超分辨率中依赖多步采样导致推理缓慢,而一步网络虽加速但参数量大、延迟高。直接应用对抗扩散压缩到该任务时,由于缺乏时间感知和标准对抗学习的限制,难以平衡空间细节和时间一致性。
Result: 实验表明,压缩后的AdcVSR模型参数量减少了95%,相比其DiT教师模型DOVE实现了8倍加速,同时保持了有竞争力的视频质量和效率。
Insight: 创新点包括:将3D时空注意力教师蒸馏到增强轻量1D时间卷积的2D骨干中以提高效率;提出双头对抗蒸馏方案,在像素和特征域显式解耦细节和一致性判别,实现目标平衡优化。从客观角度看,该方法在模型压缩与加速方面提供了有效的时空信息处理策略。
Abstract: While many diffusion models have achieved impressive results in real-world video super-resolution (Real-VSR) by generating rich and realistic details, their reliance on multi-step sampling leads to slow inference. One-step networks like SeedVR2, DOVE, and DLoRAL alleviate this through condensing generation into one single step, yet they remain heavy, with billions of parameters and multi-second latency. Recent adversarial diffusion compression (ADC) offers a promising path via pruning and distilling these models into a compact AdcSR network, but directly applying it to Real-VSR fails to balance spatial details and temporal consistency due to its lack of temporal awareness and the limitations of standard adversarial learning. To address these challenges, we propose an improved ADC method for Real-VSR. Our approach distills a large diffusion Transformer (DiT) teacher DOVE equipped with 3D spatio-temporal attentions, into a pruned 2D Stable Diffusion (SD)-based AdcSR backbone, augmented with lightweight 1D temporal convolutions, achieving significantly higher efficiency. In addition, we introduce a dual-head adversarial distillation scheme, in which discriminators in both pixel and feature domains explicitly disentangle the discrimination of details and consistency into two heads, enabling both objectives to be effectively optimized without sacrificing one for the other. Experiments demonstrate that the resulting compressed AdcVSR model reduces complexity by 95% in parameters and achieves an 8$\times$ acceleration over its DiT teacher DOVE, while maintaining competitive video quality and efficiency.
[93] ReMoT: Reinforcement Learning with Motion Contrast Triplets cs.CVPDF
Cong Wan, Zeyu Guo, Jiangyang Li, SongLin Dong, Yifan Bai
TL;DR: 本文提出ReMoT,一种统一的训练范式,旨在系统解决视觉语言模型在时空一致性方面的根本缺陷,这是导航、机器人和自动驾驶领域的关键失败点。ReMoT包含两个核心组件:一是基于规则的自动框架,用于生成大规模运动对比数据集ReMoT-16K;二是组相对策略优化方法,用于高效学习对比推理。该方法在新建的细粒度运动对比基准和多个标准VLM基准上实现了最先进的性能。
Details
Motivation: 解决视觉语言模型在时空一致性推理方面的根本短板,特别是在需要区分细微运动属性(如相反方向)的任务中,这对于导航、机器人和自动驾驶应用至关重要。
Result: 在新建的细粒度运动对比三元组基准上达到最先进水平,同时在多个标准VLM基准上取得优异表现,在时空推理任务上实现了25.1%的显著性能提升。
Insight: 创新点包括:1)基于规则的自动框架生成大规模运动对比数据集,避免了昂贵的手动或基于模型的生成;2)组相对策略优化方法,在学习和数据效率上远超标准监督微调,为学习对比推理提供了优化方案;3)构建了首个用于评估VLM细粒度运动属性辨别能力的基准。
Abstract: We present ReMoT, a unified training paradigm to systematically address the fundamental shortcomings of VLMs in spatio-temporal consistency – a critical failure point in navigation, robotics, and autonomous driving. ReMoT integrates two core components: (1) A rule-based automatic framework that generates ReMoT-16K, a large-scale (16.5K triplets) motion-contrast dataset derived from video meta-annotations, surpassing costly manual or model-based generation. (2) Group Relative Policy Optimization, which we empirically validate yields optimal performance and data efficiency for learning this contrastive reasoning, far exceeding standard Supervised Fine-Tuning. We also construct the first benchmark for fine-grained motion contrast triplets to measure a VLM’s discrimination of subtle motion attributes (e.g., opposing directions). The resulting model achieves state-of-the-art performance on our new benchmark and multiple standard VLM benchmarks, culminating in a remarkable 25.1% performance leap on spatio-temporal reasoning tasks.
[94] OPGAgent: An Agent for Auditable Dental Panoramic X-ray Interpretation cs.CV | cs.AIPDF
Zhaolin Yu, Litao Yang, Ben Babicka, Ming Hu, Jing Hao
TL;DR: 本文提出了OPGAgent,一个用于可审计牙科全景X光片(OPG)解读的多工具智能体系统。该系统通过协调专门的感知模块和共识机制,将OPG分析分解为全局、象限和牙齿级别的阶段,并动态调用工具。作者还提出了基于真实临床报告构建的结构化报告评估基准OPG-Bench。
Details
Motivation: 尽管视觉语言模型(VLM)允许通过自然语言进行多任务OPG分析,但在大多数单项任务上表现不如专用模型。目前,在牙科影像领域,协调专用工具的智能体系统尚未被探索,而这种方法有望同时实现多功能性和准确性。
Result: 在作者提出的OPG-Bench和公开的MMOral-OPG基准测试上,OPGAgent在结构化报告和视觉问答(VQA)评估中均优于当前的牙科VLM和医疗智能体框架。
Insight: 主要创新点包括:1)一个将分析任务分层分解并动态调用工具的证据收集模块;2)一个封装了空间、检测、实用和专家工具的工具箱;3)一个通过解剖学约束解决冲突的共识子智能体。此外,提出的基于(位置、领域、值)三元组的结构化报告协议,为全面评估模型发现和幻觉提供了新方法,超越了传统VQA指标的局限。
Abstract: Orthopantomograms (OPGs) are the standard panoramic radiograph in dentistry, used for full-arch screening across multiple diagnostic tasks. While Vision Language Models (VLMs) now allow multi-task OPG analysis through natural language, they underperform task-specific models on most individual tasks. Agentic systems that orchestrate specialized tools offer a path to both versatility and accuracy, this approach remains unexplored in the field of dental imaging. To address this gap, we propose OPGAgent, a multi-tool agentic system for auditable OPG interpretation. OPGAgent coordinates specialized perception modules with a consensus mechanism through three components: (1) a Hierarchical Evidence Gathering module that decomposes OPG analysis into global, quadrant, and tooth-level phases with dynamically invoking tools, (2) a Specialized Toolbox encapsulating spatial, detection, utility, and expert zoos, and (3) a Consensus Subagent that resolves conflicts through anatomical constraints. We further propose OPG-Bench, a structured-report protocol based on (Location, Field, Value) triples derived from real clinical reports, which enables a comprehensive review of findings and hallucinations, extending beyond the limitations of VQA indicators. On our OPG-Bench and the public MMOral-OPG benchmark, OPGAgent outperforms current dental VLMs and medical agent frameworks across both structured-report and VQA evaluation. Code will be released upon acceptance.
[95] DreamWorld: Unified World Modeling in Video Generation cs.CVPDF
Boming Tan, Xiangdong Zhang, Ning Liao, Yuqing Zhang, Shaofeng Zhang
TL;DR: DreamWorld是一个统一的视频生成框架,通过联合世界建模范式整合了多种互补的世界知识(如物理常识、3D和时序一致性),以提升视频的世界一致性。它采用一致性约束退火和多源内部引导技术来解决多目标优化中的视觉不稳定问题,并在VBench基准上超越了Wan2.1模型。
Details
Motivation: 现有视频生成模型仅关注表面合理性,缺乏对世界的连贯统一理解,通常只融入单一形式的世界知识或依赖僵化的对齐策略,无法联合建模多个异构维度(如物理常识、3D和时序一致性)。
Result: 在VBench基准测试中,DreamWorld在一致性方面表现优异,比Wan2.1模型高出2.26分,显示出更好的世界一致性。
Insight: 创新点在于提出了联合世界建模范式,将多种异构世界知识(时序动态、空间几何和语义一致性)整合到视频生成中,并通过一致性约束退火和多源内部引导技术有效缓解了多目标优化导致的视觉不稳定问题,实现了更统一和连贯的世界建模。
Abstract: Despite impressive progress in video generation, existing models remain limited to surface-level plausibility, lacking a coherent and unified understanding of the world. Prior approaches typically incorporate only a single form of world-related knowledge or rely on rigid alignment strategies to introduce additional knowledge. However, aligning the single world knowledge is insufficient to constitute a world model that requires jointly modeling multiple heterogeneous dimensions (e.g., physical commonsense, 3D and temporal consistency). To address this limitation, we introduce \textbf{DreamWorld}, a unified framework that integrates complementary world knowledge into video generators via a \textbf{Joint World Modeling Paradigm}, jointly predicting video pixels and features from foundation models to capture temporal dynamics, spatial geometry, and semantic consistency. However, naively optimizing these heterogeneous objectives can lead to visual instability and temporal flickering. To mitigate this issue, we propose \textit{Consistent Constraint Annealing (CCA)} to progressively regulate world-level constraints during training, and \textit{Multi-Source Inner-Guidance} to enforce learned world priors at inference. Extensive evaluations show that DreamWorld improves world consistency, outperforming Wan2.1 by 2.26 points on VBench. Code will be made publicly available at \href{https://github.com/ABU121111/DreamWorld}{\textcolor{mypink}{\textbf{Github}}}.
[96] U-VLM: Hierarchical Vision Language Modeling for Report Generation cs.CVPDF
Pengcheng Shi, Minghui Zhang, Kehan Song, Jiaqi Liu, Yun Gu
TL;DR: 本文提出U-VLM模型,用于自动化放射学报告生成,通过分层视觉语言建模方法解决现有模型的两个局限:未利用分割预训练编码器以及仅在语言模型输入层注入视觉特征导致多尺度信息丢失。
Details
Motivation: 解决3D医学影像报告生成中现有视觉语言模型的两个关键限制:缺乏分割预训练编码器的利用,以及视觉特征注入仅在语言模型输入层,从而丢失多尺度信息。
Result: 在CT-RATE数据集上达到SOTA性能(F1: 0.414 vs 0.258,BLEU-mean: 0.349 vs 0.305),在AbdomenAtlas 3.0数据集上分割检测F1为0.624 vs 0.518,仅使用0.1B从头训练的编码器,优于7B+预训练语言模型。
Insight: 创新点包括渐进式训练(从分割到分类再到报告生成)和多层视觉注入(将U-Net编码器特征路由到对应语言模型层),允许不同训练阶段利用无统一标注的数据集,表明精心设计的视觉编码器预训练优于大规模预训练语言模型。
Abstract: Automated radiology report generation is key for reducing radiologist workload and improving diagnostic consistency, yet generating accurate reports for 3D medical imaging remains challenging. Existing vision-language models face two limitations: they do not leverage segmentation-pretrained encoders, and they inject visual features only at the input layer of language models, losing multi-scale information. We propose U-VLM, which enables hierarchical vision-language modeling in both training and architecture: (1) progressive training from segmentation to classification to report generation, and (2) multi-layer visual injection that routes U-Net encoder features to corresponding language model layers. Each training stage can leverage different datasets without unified annotations. U-VLM achieves state-of-the-art performance on CT-RATE (F1: 0.414 vs 0.258, BLEU-mean: 0.349 vs 0.305) and AbdomenAtlas 3.0 (F1: 0.624 vs 0.518 for segmentation-based detection) using only a 0.1B decoder trained from scratch, demonstrating that well-designed vision encoder pretraining outweighs the benefits of 7B+ pre-trained language models. Ablation studies show that progressive pretraining significantly improves F1, while multi-layer injection improves BLEU-mean. Code is available at https://github.com/yinghemedical/U-VLM.
[97] TokenCom: Vision-Language Model for Multimodal and Multitask Token Communications cs.CV | cs.ITPDF
Feibo Jiang, Siwei Tu, Li Dong, Xiaolong Li, Kezhi Wang
TL;DR: 本文提出TaiChi框架,一种用于多模态多任务令牌通信的视觉语言模型,通过双视觉分词器架构处理高低分辨率图像以协同捕捉像素级细节和全局概念特征,引入双边注意力网络融合多尺度视觉令牌,并采用基于Kolmogorov Arnold Network的模态投影器实现视觉特征到文本语义空间的精确非线性对齐,最终集成到配备联合VLM-信道编码方案的多模态多任务令牌通信系统中。
Details
Motivation: 解决现有视觉语言模型在令牌通信中因令牌粒度有限、视觉令牌序列过长和跨模态对齐不足而受限的问题。
Result: 实验验证了TaiChi的优越性能,以及TaiChi驱动的令牌通信系统的可行性和有效性。
Insight: 创新点包括双视觉分词器架构、双边注意力网络和基于KAN的模态投影器,这些设计旨在提升视觉理解、压缩令牌并优化跨模态对齐,可借鉴于多模态通信系统的效率提升。
Abstract: Visual-Language Models (VLMs), with their strong capabilities in image and text understanding, offer a solid foundation for intelligent communications. However, their effectiveness is constrained by limited token granularity, overlong visual token sequences, and inadequate cross-modal alignment. To overcome these challenges, we propose TaiChi, a novel VLM framework designed for token communications. TaiChi adopts a dual-visual tokenizer architecture that processes both high- and low-resolution images to collaboratively capture pixel-level details and global conceptual features. A Bilateral Attention Network (BAN) is introduced to intelligently fuse multi-scale visual tokens, thereby enhancing visual understanding and producing compact visual tokens. In addition, a Kolmogorov Arnold Network (KAN)-based modality projector with learnable activation functions is employed to achieve precise nonlinear alignment from visual features to the text semantic space, thus minimizing information loss. Finally, TaiChi is integrated into a multimodal and multitask token communication system equipped with a joint VLM-channel coding scheme. Experimental results validate the superior performance of TaiChi, as well as the feasibility and effectiveness of the TaiChi-driven token communication system.
[98] RAISE: Requirement-Adaptive Evolutionary Refinement for Training-Free Text-to-Image Alignment cs.CV | cs.AIPDF
Liyao Jiang, Ruichen Chen, Chao Gao, Di Niu
TL;DR: 本文提出RAISE框架,一种无需训练、需求驱动的进化方法,用于提升文本到图像生成的对齐质量。该方法将图像生成建模为需求驱动的自适应缩放过程,在推理时通过多种细化操作(如提示重写、噪声重采样和指令编辑)进化候选图像种群,并根据结构化需求清单动态分配计算资源。
Details
Motivation: 解决现有文本到图像扩散模型在处理包含多对象、关系和细粒度属性的复杂提示时,提示-图像对齐不准确的问题。现有无需训练的推理时缩放方法使用固定迭代预算无法适应提示难度,而基于反思调优的方法需要精心策划的数据集和大量联合微调,且易过拟合、跨模型迁移性差。
Result: 在GenEval和DrawBench基准测试上达到最先进的对齐水平(GenEval总体得分0.94),同时比之前的缩放和反思调优基线减少了30-40%的生成样本和80%的视觉语言模型调用次数。
Insight: 创新点在于将进化算法与结构化需求验证相结合,实现推理时自适应计算分配;其无需训练、模型无关的特性提供了高效、可泛化的多轮自我改进框架,可动态识别未满足需求并针对性优化。
Abstract: Recent text-to-image (T2I) diffusion models achieve remarkable realism, yet faithful prompt-image alignment remains challenging, particularly for complex prompts with multiple objects, relations, and fine-grained attributes. Existing training-free inference-time scaling methods rely on fixed iteration budgets that cannot adapt to prompt difficulty, while reflection-tuned models require carefully curated reflection datasets and extensive joint fine-tuning of diffusion and vision-language models, often overfitting to reflection paths data and lacking transferability across models. We introduce RAISE (Requirement-Adaptive Self-Improving Evolution), a training-free, requirement-driven evolutionary framework for adaptive T2I generation. RAISE formulates image generation as a requirement-driven adaptive scaling process, evolving a population of candidates at inference time through a diverse set of refinement actions-including prompt rewriting, noise resampling, and instructional editing. Each generation is verified against a structured checklist of requirements, enabling the system to dynamically identify unsatisfied items and allocate further computation only where needed. This achieves adaptive test-time scaling that aligns computational effort with semantic query complexity. On GenEval and DrawBench, RAISE attains state-of-the-art alignment (0.94 overall GenEval) while incurring fewer generated samples (reduced by 30-40%) and VLM calls (reduced by 80%) than prior scaling and reflection-tuned baselines, demonstrating efficient, generalizable, and model-agnostic multi-round self-improvement. Code is available at https://github.com/LiyaoJiang1998/RAISE.
[99] Random Wins All: Rethinking Grouping Strategies for Vision Tokens cs.CVPDF
Qihang Fan, Yuang Ai, Huaibo Huang, Ran He
TL;DR: 本文提出了一种简单的随机分组策略来替代视觉Transformer中复杂的token分组方法,通过实验验证了随机分组在多个基准模型和下游任务中优于现有精心设计的策略,并从多个角度分析了其优势所在。
Details
Motivation: 针对视觉Transformer中二次复杂度问题,现有研究提出了多种精心设计的token分组策略,但作者质疑这些复杂方法的必要性,探索是否存在更简单统一的替代方案。
Result: 在多个基线模型上的实验表明,随机分组几乎优于所有其他分组方法,在目标检测等下游任务中优势更明显,并在视觉、点云和视觉-语言模型等多模态任务中验证了有效性。
Insight: 创新点在于揭示了分组策略设计的四个关键要素:位置信息、头部特征多样性、全局感受野和固定分组模式,只要满足这些条件,极简的随机分组即可高效处理各种视觉任务,挑战了现有复杂分组设计的必要性。
Abstract: Since Transformers are introduced into vision architectures, their quadratic complexity has always been a significant issue that many research efforts aim to address. A representative approach involves grouping tokens, performing self-attention calculations within each group, or pooling the tokens within each group into a single token. To this end, various carefully designed grouping strategies have been proposed to enhance the performance of Vision Transformers. Here, we pose the following questions: \textbf{Are these carefully designed grouping methods truly necessary? Is there a simpler and more unified token grouping method that can replace these diverse methods?} Therefore, we propose the random grouping strategy, which involves a simple and fast random grouping strategy for vision tokens. We validate this approach on multiple baselines, and experiments show that random grouping almost outperforms all other grouping methods. When transferred to downstream tasks, such as object detection, random grouping demonstrates even more pronounced advantages. In response to this phenomenon, we conduct a detailed analysis of the advantages of random grouping from multiple perspectives and identify several crucial elements for the design of grouping strategies: positional information, head feature diversity, global receptive field, and fixed grouping pattern. We demonstrate that as long as these four conditions are met, vision tokens require only an extremely simple grouping strategy to efficiently and effectively handle various visual tasks. We also validate the effectiveness of our proposed random method across multiple modalities, including visual tasks, point cloud processing, and vision-language models. Code will be available at https://github.com/qhfan/random.
[100] ArtiFixer: Enhancing and Extending 3D Reconstruction with Auto-Regressive Diffusion Models cs.CV | cs.AI | cs.GR | cs.LGPDF
Riccardo de Lutio, Tobias Fischer, Yen-Yu Chang, Yuxuan Zhang, Jay Zhangjie Wu
TL;DR: 本文提出ArtiFixer,一个两阶段管道,旨在增强和扩展3D重建。它首先训练一个强大的双向生成模型,采用新颖的不透明度混合策略以确保与现有观测的一致性并保持外推能力;然后将其蒸馏为因果自回归模型,可单次生成数百帧视图,用于直接生成新视图或作为伪监督改进3D表示。该方法在现有方法完全失败的场景中能生成可信重建,并在基准数据集上大幅超越现有基线,PSNR提升1-3 dB。
Details
Motivation: 解决现有基于生成先验的方法在修正3D重建伪影时的两个缺陷:可扩展性不足(现有方法使用图像扩散模型或双向视频模型,单次生成视图数量有限,需昂贵迭代蒸馏)和质量问题(生成输出与现有场景内容不一致,在完全未观测区域失败)。
Result: 在常用基准数据集上评估,大幅超越所有现有基线,PSNR超过先前最先进方法1-3 dB,在现有方法完全失败的场景中能生成可信重建。
Insight: 创新点包括:1) 训练双向生成模型时采用新颖的不透明度混合策略,平衡观测一致性与外推能力;2) 将双向模型蒸馏为因果自回归模型,实现单次生成数百帧的高效扩展,可直接用于视图生成或作为伪监督简化3D表示改进。
Abstract: Per-scene optimization methods such as 3D Gaussian Splatting provide state-of-the-art novel view synthesis quality but extrapolate poorly to under-observed areas. Methods that leverage generative priors to correct artifacts in these areas hold promise but currently suffer from two shortcomings. The first is scalability, as existing methods use image diffusion models or bidirectional video models that are limited in the number of views they can generate in a single pass (and thus require a costly iterative distillation process for consistency). The second is quality itself, as generators used in prior work tend to produce outputs that are inconsistent with existing scene content and fail entirely in completely unobserved regions. To solve these, we propose a two-stage pipeline that leverages two key insights. First, we train a powerful bidirectional generative model with a novel opacity mixing strategy that encourages consistency with existing observations while retaining the model’s ability to extrapolate novel content in unseen areas. Second, we distill it into a causal auto-regressive model that generates hundreds of frames in a single pass. This model can directly produce novel views or serve as pseudo-supervision to improve the underlying 3D representation in a simple and highly efficient manner. We evaluate our method extensively and demonstrate that it can generate plausible reconstructions in scenarios where existing approaches fail completely. When measured on commonly benchmarked datasets, we outperform existing all existing baselines by a wide margin, exceeding prior state-of-the-art methods by 1-3 dB PSNR.
[101] COG: Confidence-aware Optimal Geometric Correspondence for Unsupervised Single-reference Novel Object Pose Estimation cs.CVPDF
Yuchen Che, Jingtu Wu, Hao Zheng, Asako Kanezaki
TL;DR: 本文提出了一种名为COG的无监督框架,用于解决单参考视图下的新物体6DoF姿态估计问题。该方法将对应关系估计建模为置信度感知的最优传输问题,通过预测点级置信度并作为最优传输的边际约束来产生平衡的软对应关系,抑制非重叠区域,并结合视觉基础模型的语义先验进行正则化,从而实现稳定的姿态估计。
Details
Motivation: 现有方法在单参考视图下估计新物体姿态时,由于遮挡、视角变化和异常值的影响,难以找到鲁棒的跨视图对应关系,且通常依赖于不可微的离散一对一匹配,容易坍缩到稀疏关键点。
Result: 实验表明,无监督的COG取得了与有监督方法相当的性能,而有监督的COG则超越了这些方法。
Insight: 将对应关系估计构建为置信度感知的最优传输问题,通过置信度作为边际约束来产生平衡的软对应关系,并结合视觉基础模型的语义先验进行正则化,实现了无监督学习下的稳定姿态估计。
Abstract: Estimating the 6DoF pose of a novel object with a single reference view is challenging due to occlusions, view-point changes, and outliers. A core difficulty lies in finding robust cross-view correspondences, as existing methods often rely on discrete one-to-one matching that is non-differentiable and tends to collapse onto sparse key-points. We propose Confidence-aware Optimal Geometric Correspondence (COG), an unsupervised framework that formulates correspondence estimation as a confidence-aware optimal transport problem. COG produces balanced soft correspondences by predicting point-wise confidences and injecting them as optimal transport marginals, suppressing non-overlapping regions. Semantic priors from vision foundation models further regularize the correspondences, leading to stable pose estimation. This design integrates confidence into the correspondence finding and pose estimation pipeline, enabling unsupervised learning. Experiments show unsupervised COG achieves comparable performance to supervised methods, and supervised COG outperforms them.
[102] M$^2$: Dual-Memory Augmentation for Long-Horizon Web Agents via Trajectory Summarization and Insight Retrieval cs.CVPDF
Dawei Yan, Haokui Zhang, Guangda Huzhang, Yang Li, Yibo Wang
TL;DR: 本文提出了M²框架,一种无需训练、通过双记忆增强机制来提升基于多模态大语言模型的Web智能体在长视野任务中性能的方法。该框架通过动态轨迹摘要压缩交互历史,并通过洞察检索增强从离线知识库中获取指导,从而优化上下文效率与决策鲁棒性。
Details
Motivation: 解决当前基于MLLM的Web智能体在处理长视野任务时,普遍存在的计算成本高、推理能力不足的瓶颈问题,旨在减少对大量数据收集和模型训练的依赖。
Result: 在WebVoyager和OnlineMind2Web基准测试上,M²框架显著超越了基线方法,例如使用Qwen3-VL-32B模型时实现了高达19.6%的成功率提升和58.7%的token减少,而Claude等专有模型也获得了高达12.5%的准确率提升并显著降低了计算开销。
Insight: 创新点在于提出了一个结合内部记忆(动态轨迹摘要)和外部记忆(洞察检索增强)的双层记忆机制,这是一种无需训练即可有效管理长上下文、提升智能体在复杂Web导航任务中决策效率与鲁棒性的新颖架构设计。
Abstract: Multimodal Large Language Models (MLLMs) based agents have demonstrated remarkable potential in autonomous web navigation. However, handling long-horizon tasks remains a critical bottleneck. Prevailing strategies often rely heavily on extensive data collection and model training, yet still struggle with high computational costs and insufficient reasoning capabilities when facing complex, long-horizon scenarios. To address this, we propose M$^2$, a training-free, memory-augmented framework designed to optimize context efficiency and decision-making robustness. Our approach incorporates a dual-tier memory mechanism that synergizes Dynamic Trajectory Summarization (Internal Memory) to compress verbose interaction history into concise state updates, and Insight Retrieval Augmentation (External Memory) to guide the agent with actionable guidelines retrieved from an offline insight bank. Extensive evaluations across WebVoyager and OnlineMind2Web demonstrate that M$^2$ consistently surpasses baselines, yielding up to a 19.6% success rate increase and 58.7% token reduction for Qwen3-VL-32B, while proprietary models like Claude achieve accuracy gains up to 12.5% alongside significantly lower computational overhead.
[103] What Do Visual Tokens Really Encode? Uncovering Sparsity and Redundancy in Multimodal Large Language Models cs.CV | cs.AIPDF
Yingqi Fan, Junlong Tong, Anhao Zhao, Xiaoyu Shen
TL;DR: 本文通过提出EmbedLens分析框架,揭示了多模态大语言模型(MLLMs)中视觉令牌的语义稀疏性和处理冗余性。研究发现,输入视觉令牌可分为sink、dead和alive三类,其中仅约60%的alive令牌携带图像特定语义,且这些令牌在进入LLM前已编码丰富的细粒度信息。对于大多数标准任务,内部视觉计算(如注意力机制)是冗余的;对于少数高度视觉中心的任务,alive令牌自然与LLM中间层对齐,表明浅层处理不必要,直接中间层注入即可。
Details
Motivation: 尽管MLLMs将视觉令牌投影到语言模型的嵌入空间,但其内部视觉语义的结构和处理机制尚不清晰,本文旨在深入分析视觉令牌的编码特性和处理效率。
Result: 在针对性的图像块压缩基准测试中,alive令牌在进入LLM前已编码对象、颜色和OCR等细粒度线索;对于标准任务,内部视觉计算被证明冗余;对于高度视觉中心任务,直接中间层注入能达到效果,无需浅层处理。
Insight: 创新点在于提出了EmbedLens这一细粒度分析工具,揭示了视觉令牌的语义稀疏性(仅alive令牌有效)和处理冗余性(内部计算大多不必要),为通过选择性令牌剪枝、最小化视觉计算和中间层注入来实现更高效、可解释的MLLM架构提供了机制性见解。
Abstract: Multimodal large language models (MLLMs) project visual tokens into the embedding space of language models, yet the internal structuring and processing of visual semantics remain poorly understood. In this work, we introduce a two-fold analytical framework featuring a novel probing tool, $\textbf{EmbedLens}$, to conduct a fine-grained analysis. We uncover a pronounced semantic sparsity at the input level: visual tokens consistently partition into sink, dead, and alive categories. Remarkably, only the alive tokens, comprising $\approx60%$ of the total input, carry image-specific meaning. Furthermore, using a targeted patch-compression benchmark, we demonstrate that these alive tokens already encode rich, fine-grained cues (e.g., objects, colors, and OCR) prior to entering the LLM. Internal visual computations (such as visual attention and feed-forward networks) are redundant for most standard tasks. For the small subset of highly vision-centric tasks that actually benefit from internal processing, we reveal that alive tokens naturally align with intermediate LLM layers rather than the initial embedding space, indicating that shallow-layer processing is unnecessary and that direct mid-layer injection is both sufficient. Ultimately, our findings provide a unified mechanistic view of visual token processing, paving the way for more efficient and interpretable MLLM architectures through selective token pruning, minimized visual computation, and mid-layer injection. The code is released at: https://github.com/EIT-NLP/EmbedLens.
[104] Multimodal Adaptive Retrieval Augmented Generation through Internal Representation Learning cs.CV | cs.LGPDF
Ruoshuang Du, Xin Sun, Qiang Liu, Bowen Song, Zhongqi Chen
TL;DR: 本文提出了一种多模态自适应检索增强生成框架MMA-RAG,旨在解决视觉问答系统中因幻觉导致的可靠性问题。该框架通过分析模型内部表示来动态评估其自身知识的置信度,从而决定是否引入检索到的外部知识,以平衡外部知识利用与推理鲁棒性。
Details
Motivation: 动机在于解决视觉问答系统中因模型幻觉导致的答案与视觉输入或事实知识不一致的问题,以及现有静态检索增强生成方法可能引入视觉相似但语义错误的无关或冲突证据的缺陷。
Result: 在三个VQA数据集上的实验表明,该方法显著提升了响应性能;消融研究强调了内部表示对于自适应检索决策的重要性。
Insight: 创新点在于提出了一种基于内部表示学习的自适应决策机制,通过联合分析视觉和文本的内部表示来动态指导反向图像检索的使用,从而更智能地融合外部知识。
Abstract: Visual Question Answering systems face reliability issues due to hallucinations, where models generate answers misaligned with visual input or factual knowledge. While Retrieval Augmented Generation frameworks mitigate this issue by incorporating external knowledge, static retrieval often introduces irrelevant or conflicting content, particularly in visual RAG settings where visually similar but semantically incorrect evidence may be retrieved. To address this, we propose Multimodal Adaptive RAG (MMA-RAG), which dynamically assesses the confidence in the internal knowledge of the model to decide whether to incorporate the retrieved external information into the generation process. Central to MMA-RAG is a decision classifier trained through a layer-wise analysis, which leverages joint internal visual and textual representations to guide the use of reverse image retrieval. Experiments demonstrated that the model achieves a significant improvement in response performance in three VQA datasets. Meanwhile, ablation studies highlighted the importance of internal representations in adaptive retrieval decisions. In general, the experimental results demonstrated that MMA-RAG effectively balances external knowledge utilization and inference robustness in diverse multimodal scenarios.
[105] Wavelet-based Frame Selection by Detecting Semantic Boundary for Long Video Understanding cs.CVPDF
Wang Chen, Yuhui Zeng, Yongdong Luo, Tianyu Xie, Luojun Lin
TL;DR: 本文提出了一种名为WFS-SB的无训练框架,用于长视频理解中的帧选择。该方法通过小波变换检测查询-帧相似度信号中的语义边界,将视频分割为连贯片段,并基于片段重要性和多样性自适应分配帧预算进行选择,旨在提升大视觉语言模型对视频整体叙事结构的理解能力。
Details
Motivation: 现有帧选择方法通常只选择与查询高度相关的帧,忽略了视频的叙事结构,导致选择的帧集不连贯。本文动机是认识到有效的视频理解不仅需要高相关性,更需要捕捉叙事变化的关键语义转折点。
Result: 在VideoMME、MLVU和LongVideoBench等基准测试上的大量实验表明,WFS-SB显著提升了大视觉语言模型的性能,准确率分别提高了5.5%、9.5%和6.2%,持续优于最先进的方法。
Insight: 创新点在于将帧选择问题重新定义为检测语义边界以捕捉叙事变化,并利用小波变换的多分辨率分析来鲁棒地提取干净的语义变化信号,从而实现对视频连贯片段的划分和基于重要性与多样性的自适应帧选择。
Abstract: Frame selection is crucial due to high frame redundancy and limited context windows when applying Large Vision-Language Models (LVLMs) to long videos. Current methods typically select frames with high relevance to a given query, resulting in a disjointed set of frames that disregard the narrative structure of video. In this paper, we introduce Wavelet-based Frame Selection by Detecting Semantic Boundary (WFS-SB), a training-free framework that presents a new perspective: effective video understanding hinges not only on high relevance but, more importantly, on capturing semantic shifts - pivotal moments of narrative change that are essential to comprehending the holistic storyline of video. However, direct detection of abrupt changes in the query-frame similarity signal is often unreliable due to high-frequency noise arising from model uncertainty and transient visual variations. To address this, we leverage the wavelet transform, which provides an ideal solution through its multi-resolution analysis in both time and frequency domains. By applying this transform, we decompose the noisy signal into multiple scales and extract a clean semantic change signal from the coarsest scale. We identify the local extrema of this signal as semantic boundaries, which segment the video into coherent clips. Building on this, WFS-SB comprises a two-stage strategy: first, adaptively allocating a frame budget to each clip based on a composite importance score; and second, within each clip, employing the Maximal Marginal Relevance approach to select a diverse yet relevant set of frames. Extensive experiments show that WFS-SB significantly boosts LVLM performance, e.g., improving accuracy by 5.5% on VideoMME, 9.5% on MLVU, and 6.2% on LongVideoBench, consistently outperforming state-of-the-art methods.
[106] MLLM-4D: Towards Visual-based Spatial-Temporal Intelligence cs.CVPDF
Xingyilang Yin, Chengzhengxu Li, Jiahao Chang, Chi-Man Pun, Xiaodong Cun
TL;DR: MLLM-4D是一个旨在赋予多模态大语言模型(MLLMs)基于纯2D视觉输入的4D时空理解与推理能力的综合框架。它通过高效的数据处理流程,将现有立体视频数据集转化为高质量的4D时空指令数据,并采用监督微调(SFT)和基于组相对策略优化(GRPO)的强化微调(RFT)策略进行模型后训练,无需修改模型架构。
Details
Motivation: 人类天生具备基于视觉的4D时空智能,而当前的多模态大语言模型在此能力上存在显著瓶颈。本文旨在解决这一挑战,提升模型从纯视觉输入中感知和推理3D空间随时间演变的能力。
Result: 大量实验表明,MLLM-4D在纯2D RGB输入下,实现了最先进的(state-of-the-art)时空理解与推理能力。
Insight: 主要创新点包括:1) 高效的数据处理流程,将现有立体视频数据集重新用于创建高质量的4D时空指令数据集(MLLM4D-2M, MLLM4D-R1-30k)和评估基准(MLLM4D-Bench);2) 无需修改架构的后训练策略,结合了专门的时空思维链(ST-CoT)提示和时空奖励函数(ST-reward),通过SFT建立基础理解,并通过GRPO强化推理能力。
Abstract: Humans are born with vision-based 4D spatial-temporal intelligence, which enables us to perceive and reason about the evolution of 3D space over time from purely visual inputs. Despite its importance, this capability remains a significant bottleneck for current multimodal large language models (MLLMs). To tackle this challenge, we introduce MLLM-4D, a comprehensive framework designed to bridge the gaps in training data curation and model post-training for spatiotemporal understanding and reasoning. On the data front, we develop a cost-efficient data curation pipeline that repurposes existing stereo video datasets into high-quality 4D spatiotemporal instructional data. This results in the MLLM4D-2M and MLLM4D-R1-30k datasets for Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT), alongside MLLM4D-Bench for comprehensive evaluation. Regarding model training, our post-training strategy establishes a foundational 4D understanding via SFT and further catalyzes 4D reasoning capabilities by employing Group Relative Policy Optimization (GRPO) with specialized Spatiotemporal Chain of Thought (ST-CoT) prompting and Spatiotemporal reward functions (ST-reward) without involving the modification of architecture. Extensive experiments demonstrate that MLLM-4D achieves state-of-the-art spatial-temporal understanding and reasoning capabilities from purely 2D RGB inputs. Project page: https://github.com/GVCLab/MLLM-4D.
[107] Vision-TTT: Efficient and Expressive Visual Representation Learning with Test-Time Training cs.CVPDF
Quan Kong, Yanru Xiao, Yuhao Shen, Cong Wang
TL;DR: 本文提出了Vision-TTT,一种将线性时间序列建模方法Test-Time Training引入视觉任务的新模型。它通过双向扫描策略和Conv2d模块,以自监督方式压缩视觉token序列,有效建模二维视觉相关性并具有全局感受野。实验表明,该模型在ImageNet分类和下游任务上取得了优异的准确率,并在计算效率上大幅超越了DeiT等Transformer模型。
Details
Motivation: Vision Transformers(ViTs)因其自注意力机制的二次复杂度,在应用中存在计算效率低的问题。本文旨在解决这一挑战,寻求学习高效且表达能力强的视觉表示。
Result: 在ImageNet分类上,Vittt-T/S/B分别达到77.3%、81.2%、82.5%的Top-1准确率,并在下游任务上大幅超越同类模型。在1280x1280分辨率下,Vittt-T相比DeiT-T减少了79.4%的FLOPs,运行速度提升4.38倍,内存使用减少88.9%,展示了其作为下一代通用视觉骨干的潜力。
Insight: 创新点在于将Test-Time Training(TTT)线性时间序列建模方法首次引入视觉领域,并通过双向扫描和Conv2d模块扩展以建模2D视觉相关性。这提供了一种在保持全局感受野的同时,显著降低计算复杂度的新思路,为设计高效的视觉骨干网络提供了借鉴。
Abstract: Learning efficient and expressive visual representation has long been the pursuit of computer vision research. While Vision Transformers (ViTs) gradually replace traditional Convolutional Neural Networks (CNNs) as more scalable vision learners, their applications are plagued by the quadratic complexity of the self-attention mechanism. To address the challenge, we introduce a new linear-time sequence modeling method Test-Time Training (TTT) into vision and propose Vision-TTT, which compresses the visual token sequence in a novel self-supervised learning manner. By incorporating bidirectional scan strategy and the Conv2d module, Vision-TTT effectively extends vanilla TTT to model 2D visual correlations with global receptive fields. Extensive experiments show that \texttt{Vittt-T/S/B} achieve 77.3%,81.2%,82.5% Top-1 accuracy on ImageNet classification and also greatly outperform their counterparts on downstream tasks. At 1280x1280 resolution, \texttt{Vittt-T} reduces FLOPs by 79.4% and runs 4.38x faster with 88.9% less memory than DeiT-T. These results demonstrate the expressiveness and efficiency of Vision-TTT as a strong candidate for the next-generation generic visual backbone.
[108] Mesh-Pro: Asynchronous Advantage-guided Ranking Preference Optimization for Artist-style Quadrilateral Mesh Generation cs.CVPDF
Zhen Zhou, Jian Liu, Biwen Lei, Jing Xu, Haohan Weng
TL;DR: 本文提出Mesh-Pro,一种用于艺术家风格四边形网格生成的异步优势引导排序偏好优化方法。该方法通过异步在线强化学习框架提升训练效率,引入ARPO算法平衡效率与泛化,并结合对角线感知的混合三角-四边形标记化及基于射线的奖励机制,在艺术和密集网格生成上达到SOTA性能。
Details
Motivation: 现有基于离线直接偏好优化(DPO)的3D网格生成方法训练效率低、泛化能力有限,本文旨在提升强化学习在3D网格生成中的训练效率和生成质量。
Result: Mesh-Pro在艺术和密集网格生成任务上实现了最先进的性能,异步框架比同步RL快3.75倍,ARPO算法在训练效率与泛化间取得更好平衡。
Insight: 创新点包括:首个为3D网格生成设计的异步在线RL框架、ARPO算法、对角线感知混合三角-四边形标记化表示以及基于射线的几何完整性奖励机制,这些共同提升了生成效率和质量。
Abstract: Reinforcement learning (RL) has demonstrated remarkable success in text and image generation, yet its potential in 3D generation remains largely unexplored. Existing attempts typically rely on offline direct preference optimization (DPO) method, which suffers from low training efficiency and limited generalization. In this work, we aim to enhance both the training efficiency and generation quality of RL in 3D mesh generation. Specifically, (1) we design the first asynchronous online RL framework tailored for 3D mesh generation post-training efficiency improvement, which is 3.75$\times$ faster than synchronous RL. (2) We propose Advantage-guided Ranking Preference Optimization (ARPO), a novel RL algorithm that achieves a better trade-off between training efficiency and generalization than current RL algorithms designed for 3D mesh generation, such as DPO and group relative policy optimization (GRPO). (3) Based on asynchronous ARPO, we propose Mesh-Pro, which additionally introduces a novel diagonal-aware mixed triangular-quadrilateral tokenization for mesh representation and a ray-based reward for geometric integrity. Mesh-Pro achieves state-of-the-art performance on artistic and dense meshes.
[109] CaptionFool: Universal Image Captioning Model Attacks cs.CV | cs.AIPDF
Swapnil Parekh
TL;DR: 本文提出CaptionFool,一种针对基于Transformer的图像描述模型的通用对抗攻击方法。该方法仅需修改图像中约1.2%的补丁,即可在94-96%的成功率下,迫使模型生成任意目标描述(包括攻击性内容),并能生成规避现有内容审核过滤器的’俚语’术语。
Details
Motivation: 针对在大规模图像-文本数据集上训练的编码器-解码器架构的图像描述模型易受对抗攻击的问题,研究其安全漏洞。
Result: 攻击在SOTA的Transformer图像描述模型上,通过修改7/577个图像补丁(约1.2%),实现了94-96%的成功率,能生成任意目标描述和规避内容审核的俚语。
Insight: 提出了一种输入无关的通用对抗攻击方法,揭示了视觉-语言模型在部署中的关键脆弱性,并强调了开发鲁棒防御的紧迫性;其攻击效率高(修改区域极小),且能针对内容审核系统进行规避性设计。
Abstract: Image captioning models are encoder-decoder architectures trained on large-scale image-text datasets, making them susceptible to adversarial attacks. We present CaptionFool, a novel universal (input-agnostic) adversarial attack against state-of-the-art transformer-based captioning models. By modifying only 7 out of 577 image patches (approximately 1.2% of the image), our attack achieves 94-96% success rate in generating arbitrary target captions, including offensive content. We further demonstrate that CaptionFool can generate “slang” terms specifically designed to evade existing content moderation filters. Our findings expose critical vulnerabilities in deployed vision-language models and underscore the urgent need for robust defenses against such attacks. Warning: This paper contains model outputs which are offensive in nature.
[110] RAFM: Retrieval-Augmented Flow Matching for Unpaired CBCT-to-CT Translation cs.CVPDF
Xianhao Zhou, Jianghao Wu, Lanfeng Zhong, Ku Zhao, Jinlong He
TL;DR: 本文提出了一种检索增强流匹配方法,用于解决医学影像中无配对CBCT到CT的转换问题。该方法通过引入DINOv3编码器和全局CT记忆库构建检索引导的伪配对,以稳定基于流的无配对训练,并在SynthRAD2023数据集上取得了优于现有方法的结果。
Details
Motivation: CBCT在放疗中广泛使用,但存在伪影和HU值不可靠的问题,而配对CBCT-CT数据常因时间间隔、解剖变化和配准误差而难以获取或不可靠,因此需要开发有效的无配对转换方法。
Result: 在SynthRAD2023数据集上,RAFM在FID、MAE、SSIM、PSNR和SegScore指标上均优于现有方法,达到了SOTA水平。
Insight: 创新点在于将整流流引入医学影像无配对转换,并通过检索增强策略解决小数据集和有限批次大小下的训练不稳定问题,利用预训练编码器和全局记忆库提升伪配对质量,为基于流的生成模型在医学领域的应用提供了新思路。
Abstract: Cone-beam CT (CBCT) is routinely acquired in radiotherapy but suffers from severe artifacts and unreliable Hounsfield Unit (HU) values, limiting its direct use for dose calculation. Synthetic CT (sCT) generation from CBCT is therefore an important task, yet paired CBCT–CT data are often unavailable or unreliable due to temporal gaps, anatomical variation, and registration errors. In this work, we introduce rectified flow (RF) into unpaired CBCT-to-CT translation in medical imaging. Although RF is theoretically compatible with unpaired learning through distribution-level coupling and deterministic transport, its practical effectiveness under small medical datasets and limited batch sizes remains underexplored. Direct application with random or batch-local pseudo pairing can produce unstable supervision due to semantically mismatched endpoint samples. To address this challenge, we propose Retrieval-Augmented Flow Matching (RAFM), which adapts RF to the medical setting by constructing retrieval-guided pseudo pairs using a frozen DINOv3 encoder and a global CT memory bank. This strategy improves empirical coupling quality and stabilizes unpaired flow-based training. Experiments on SynthRAD2023 under a strict subject-level true-unpaired protocol show that RAFM outperforms existing methods across FID, MAE, SSIM, PSNR, and SegScore. The code is available at https://github.com/HiLab-git/RAFM.git.
[111] Adaptive Dynamic Dehazing via Instruction-Driven and Task-Feedback Closed-Loop Optimization for Diverse Downstream Task Adaptation cs.CVPDF
Yafei Zhang, Shuaitian Song, Huafeng Li, Shujuan Wang, Yu Liu
TL;DR: 本文提出了一种自适应动态去雾框架,通过结合闭环优化机制,在推理过程中基于下游任务性能和用户指令指导进行调整,使模型能够满足多种下游任务的特定需求而无需重新训练。
Details
Motivation: 现实视觉系统中,去雾不仅需要提升图像可见性,还需满足多样化下游任务的特定需求,现有方法难以动态适应不同任务要求。
Result: 在多种视觉任务上的广泛实验表明,该方法具有强有效性、鲁棒性和泛化性,为交互式、任务自适应去雾建立了新范式。
Insight: 创新点在于整合了任务反馈循环和文本指令接口的双重引导策略,实现了训练后去雾行为的动态自适应,可借鉴其闭环优化机制用于多任务视觉系统。
Abstract: In real-world vision systems,haze removal is required not only to enhance image visibility but also to meet the specific needs of diverse downstream tasks.To address this challenge,we propose a novel adaptive dynamic dehazing framework that incorporates a closed-loop optimization mechanism.It enables feedback-driven refinement based on downstream task performance and user instruction-guided adjustment during inference,allowing the model to satisfy the specific requirements of multiple downstream tasks without retraining.Technically,our framework integrates two complementary and innovative mechanisms: (1)a task feedback loop that dynamically modulates dehazing outputs based on performance across multiple downstream tasks,and (2) a text instruction interface that allows users to specify high-level task preferences.This dual-guidance strategy enables the model to adapt its dehazing behavior after training,tailoring outputs in real time to the evolving needs of multiple tasks.Extensive experiments across various vision tasks demonstrate the strong effectiveness,robustness,and generalizability of our approach.These results establish a new paradigm for interactive,task-adaptive dehazing that actively collaborates with downstream applications.
[112] Multiple Inputs and Mixwd data for Alzheimer’s Disease Classification Based on 3D Vision Transformer cs.CVPDF
Juan A. Castro-Silva, Maria N. Moreno Garcia, Diego H. Peluffo-Ordoñez
TL;DR: 本文提出了一种名为MIMD-3DVT的新方法,用于阿尔茨海默病的分类。该方法通过3D视觉Transformer处理连续的脑部MRI切片以保留3D上下文信息,融合多个3D感兴趣区域(ROI)的成像数据,并整合人口统计学、认知评估和脑成像等多源混合数据。在ADNI、AIBL和OASIS组合数据集上的实验表明,该方法在区分正常认知与阿尔茨海默病方面达到了97.14%的准确率,超越了现有最佳方法。
Details
Motivation: 现有基于MRI的阿尔茨海默病诊断方法存在局限:许多研究使用2D Transformer独立分析单个脑切片,可能丢失关键的3D上下文信息;基于ROI的模型通常只关注少数脑区,而阿尔茨海默病影响多个区域;且多数分类模型依赖单一测试,而准确诊断需要整合多源数据。
Result: 在ADNI、AIBL和OASIS组合数据集上,MIMD-3DVT(使用单个或多个ROI)在区分正常认知与阿尔茨海默病分类任务中达到了97.14%的准确率,性能超越了现有最佳方法(SOTA)。
Insight: 创新点在于提出了一个整合多输入和混合数据的3D视觉Transformer框架:1) 处理连续切片以捕获特征维度和空间信息;2) 融合多个3D ROI成像数据输入;3) 整合人口统计学、认知评估和脑成像等多模态数据。这为医学图像分析中结合3D结构信息与多源临床数据提供了新思路。
Abstract: The current methods for diagnosing Alzheimer Disease using Magnetic Resonance Imaging (MRI) have significant limitations. Many previous studies used 2D Transformers to analyze individual brain slices independently, potentially losing critical 3D contextual information. Region of interest-based models often focus on only a few brain regions despite Alzheimer’s affecting multiple areas. Additionally, most classification models rely on a single test, whereas diagnosing Alzheimer’s requires a multifaceted approach integrating diverse data sources for a more accurate assessment. This study introduces a novel methodology called the Multiple Inputs and Mixed Data 3D Vision Transformer (MIMD-3DVT). This method processes consecutive slices together to capture the feature dimensions and spatial information, fuses multiple 3D ROI imaging data inputs, and integrates mixed data from demographic factors, cognitive assessments, and brain imaging. The proposed methodology was experimentally evaluated using a combined dataset that included the Alzheimer’s Disease Neuroimaging Initiative (ADNI), the Australian Imaging, Biomarker, and Lifestyle Flagship Study of Ageing (AIBL), and the Open Access Series of Imaging Studies (OASIS). Our MIMD-3DVT, utilizing single or multiple ROIs, achieved an accuracy of 97.14%, outperforming the state-of-the-art methods in distinguishing between Normal Cognition and Alzheimer’s Disease.
[113] Weakly Supervised Video Anomaly Detection with Anomaly-Connected Components and Intention Reasoning cs.CVPDF
Yu Wang, Shengjie Zhao
TL;DR: 本文提出了一种名为LAS-VAD的新框架,用于弱监督视频异常检测(WS-VAD)。该框架通过整合异常连通分量机制和意图感知机制,旨在更有效地学习异常语义,并利用异常属性信息来指导检测,以解决因缺乏密集帧级标注而难以有效学习异常语义的关键限制。
Details
Motivation: 弱监督视频异常检测(WS-VAD)仅提供视频级标注作为监督信号,缺乏密集的帧级标注,导致现有方法难以有效学习异常语义。本文旨在解决这一问题。
Result: 在XD-Violence和UCF-Crime两个基准数据集上的大量实验表明,LAS-VAD显著优于当前最先进的方法,取得了显著的性能提升。
Insight: 创新点包括:1)异常连通分量机制,将视频帧分配到不同的语义组中,同一组内的帧段共享相同的语义信息;2)意图感知机制,用于区分相似的正异常行为(如拿物品与偷窃);3)引入异常属性信息(如爆炸伴随火焰和浓烟)来建模异常语义,以指导更准确的检测。从客观角度看,这些机制共同增强了模型对异常语义的理解和区分能力。
Abstract: Weakly supervised video anomaly detection (WS-VAD) involves identifying the temporal intervals that contain anomalous events in untrimmed videos, where only video-level annotations are provided as supervisory signals. However, a key limitation persists in WS-VAD, as dense frame-level annotations are absent, which often leaves existing methods struggling to learn anomaly semantics effectively. To address this issue, we propose a novel framework named LAS-VAD, short for Learning Anomaly Semantics for WS-VAD, which integrates anomaly-connected component mechanism and intention awareness mechanism. The former is designed to assign video frames into distinct semantic groups within a video, and frame segments within the same group are deemed to share identical semantic information. The latter leverages an intention-aware strategy to distinguish between similar normal and abnormal behaviors (e.g., taking items and stealing). To further model the semantic information of anomalies, as anomaly occurrence is accompanied by distinct characteristic attributes (i.e., explosions are characterized by flames and thick smoke), we additionally incorporate anomaly attribute information to guide accurate detection. Extensive experiments on two benchmark datasets, XD-Violence and UCF-Crime, demonstrate that our LAS-VAD outperforms current state-of-the-art methods with remarkable gains.
[114] MIDAS: Multi-Image Dispersion and Semantic Reconstruction for Jailbreaking MLLMs cs.CV | cs.AI | cs.CRPDF
Yilian Liu, Xiaojun Jia, Guoshun Nan, Jiuyang Lyu, Zhican Chen
TL;DR: 本文提出了一种名为MIDAS的多模态越狱框架,通过将有害语义分解为风险承载子单元、分散到多个视觉线索中,并利用跨图像推理逐步重建恶意意图,以绕过多模态大语言模型的安全机制。
Details
Motivation: 现有越狱方法依赖单图像掩码或孤立视觉线索,推理路径有限,对强对齐的商业闭源模型效果不佳,因此需要一种更有效的攻击方法来破坏模型的安全注意力。
Result: 在多个数据集和MLLMs上的广泛实验表明,MIDAS优于现有最先进的MLLM越狱攻击方法,在4个闭源MLLMs上平均攻击成功率达到81.46%。
Insight: 创新点在于通过多图像分散和语义重构,强制模型进行更长、更结构化的链式推理,增加对视觉线索的依赖并延迟恶意语义暴露,从而显著降低安全注意力,提升越狱性能。
Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable performance but remain vulnerable to jailbreak attacks that can induce harmful content and undermine their secure deployment. Previous studies have shown that introducing additional inference steps, which disrupt security attention, can make MLLMs more susceptible to being misled into generating malicious content. However, these methods rely on single-image masking or isolated visual cues, which only modestly extend reasoning paths and thus achieve limited effectiveness, particularly against strongly aligned commercial closed-source models. To address this problem, in this paper, we propose Multi-Image Dispersion and Semantic Reconstruction (MIDAS), a multimodal jailbreak framework that decomposes harmful semantics into risk-bearing subunits, disperses them across multiple visual clues, and leverages cross-image reasoning to gradually reconstruct the malicious intent, thereby bypassing existing safety mechanisms. The proposed MIDAS enforces longer and more structured multi-image chained reasoning, substantially increases the model’s reliance on visual cues while delaying the exposure of malicious semantics and significantly reducing the model’s security attention, thereby improving the performance of jailbreak against advanced MLLMs. Extensive experiments across different datasets and MLLMs demonstrate that the proposed MIDAS outperforms state-of-the-art jailbreak attacks for MLLMs and achieves an average attack success rate of 81.46% across 4 closed-source MLLMs. Our code is available at this link.
[115] Decoupling Stability and Plasticity for Multi-Modal Test-Time Adaptation cs.CV | cs.AIPDF
Yongbo He, Zirun Guo, Tao Jin
TL;DR: 本文提出了一种名为DASP的新框架,用于解决多模态测试时适应中的负迁移和灾难性遗忘问题。该框架通过诊断模态偏差并采用解耦的稳定性和可塑性适配器,实现不对称的适应策略,从而在保持泛化知识的同时灵活适应新领域。
Details
Motivation: 现有方法在多模态测试时适应中常面临无偏模态的负迁移和有偏模态的灾难性遗忘问题,本文旨在通过解耦稳定性和可塑性来克服这些挑战。
Result: 在多个多模态基准测试上的综合评估表明,DASP显著优于现有最先进方法。
Insight: 创新点在于发现统一潜在空间中有偏模态具有更高的维度间冗余性,并基于此设计不对称适应策略,通过解耦架构分别处理稳定性和可塑性,以避免负迁移和灾难性遗忘。
Abstract: Adapting pretrained multi-modal models to evolving test-time distributions, known as multi-modal test-time adaptation, presents a significant challenge. Existing methods frequently encounter negative transfer in the unbiased modality and catastrophic forgetting in the biased modality. To address these challenges, we propose Decoupling Adaptation for Stability and Plasticity (DASP), a novel diagnose-then-mitigate framework. Our analysis reveals a critical discrepancy within the unified latent space: the biased modality exhibits substantially higher interdimensional redundancy (i.e., strong correlations across feature dimensions) compared to the unbiased modality. Leveraging this insight, DASP identifies the biased modality and implements an asymmetric adaptation strategy. This strategy employs a decoupled architecture where each modality-specific adapter is divided into stable and plastic components. The asymmetric mechanism works as follows: for the biased modality, which requires plasticity, the plastic component is activated and updated to capture domain-specific information, while the stable component remains fixed. Conversely, for the unbiased modality, which requires stability, the plastic component is bypassed, and the stable component is updated using KL regularization to prevent negative transfer. This asymmetric design enables the model to adapt flexibly to new domains while preserving generalizable knowledge. Comprehensive evaluations on diverse multi-modal benchmarks demonstrate that DASP significantly outperforms state-of-the-art methods.
[116] WildActor: Unconstrained Identity-Preserving Video Generation cs.CVPDF
Qin Guo, Tianyu Yang, Xuanhua He, Fei Shen, Yong Zhang
TL;DR: 本文提出了WildActor框架,用于生成身份一致的无约束视角人体视频,并构建了大规模数据集Actor-18M和评估基准Actor-Bench。
Details
Motivation: 现有方法在生成人体视频时难以保持全身身份一致性,常存在面部中心化或姿态锁定导致的僵硬复制粘贴伪影,无法满足实际制作需求。
Result: 在提出的Actor-Bench基准上评估,WildActor在多样化镜头构图、大视角转换和剧烈运动下均能保持身体身份一致性,超越了现有方法。
Insight: 创新点包括构建大规模多视角人体视频数据集Actor-18M、提出非对称身份保持注意力机制以及基于边际效用的视角自适应蒙特卡洛采样策略,以实现平衡的流形覆盖。
Abstract: Production-ready human video generation requires digital actors to maintain strictly consistent full-body identities across dynamic shots, viewpoints and motions, a setting that remains challenging for existing methods. Prior methods often suffer from face-centric behavior that neglects body-level consistency, or produce copy-paste artifacts where subjects appear rigid due to pose locking. We present Actor-18M, a large-scale human video dataset designed to capture identity consistency under unconstrained viewpoints and environments. Actor-18M comprises 1.6M videos with 18M corresponding human images, covering both arbitrary views and canonical three-view representations. Leveraging Actor-18M, we propose WildActor, a framework for any-view conditioned human video generation. We introduce an Asymmetric Identity-Preserving Attention mechanism coupled with a Viewpoint-Adaptive Monte Carlo Sampling strategy that iteratively re-weights reference conditions by marginal utility for balanced manifold coverage. Evaluated on the proposed Actor-Bench, WildActor consistently preserves body identity under diverse shot compositions, large viewpoint transitions, and substantial motions, surpassing existing methods in these challenging settings.
[117] AlignVAR: Towards Globally Consistent Visual Autoregression for Image Super-Resolution cs.CV | cs.AIPDF
Cencen Liu, Dongyang Zhang, Wen Yin, Jielei Wang, Tianyu Li
TL;DR: 本文提出AlignVAR,一种针对图像超分辨率(ISR)的全局一致视觉自回归框架,旨在解决现有视觉自回归模型在ISR中因局部偏置注意力和仅残差监督导致的全局一致性问题。该框架通过空间一致性自回归(SCA)增强长程依赖,以及通过分层一致性约束(HCC)在每层引入全重建监督,从而提升结构连贯性和感知保真度。
Details
Motivation: 视觉自回归(VAR)模型在图像生成中表现出稳定训练、非迭代推理和高保真合成等优势,但其在图像超分辨率中的应用仍面临两大挑战:局部偏置注意力会破坏空间结构,而仅残差监督会导致误差跨尺度累积,严重损害重建图像的全局一致性。
Result: 大量实验表明,AlignVAR在结构连贯性和感知保真度上持续优于现有生成方法,同时与领先的基于扩散的方法相比,推理速度快10倍以上且参数减少近50%,为高效ISR建立了新范式。
Insight: 论文的创新点在于提出了空间一致性自回归(SCA)来缓解注意力过度局部化,以及分层一致性约束(HCC)来早期暴露累积偏差并稳定从粗到细的优化过程。从客观角度看,该方法通过结合注意力重加权和多尺度全监督,有效提升了自回归模型在超分辨率任务中的全局一致性,同时显著提升了效率。
Abstract: Visual autoregressive (VAR) models have recently emerged as a promising alternative for image generation, offering stable training, non-iterative inference, and high-fidelity synthesis through next-scale prediction. This encourages the exploration of VAR for image super-resolution (ISR), yet its application remains underexplored and faces two critical challenges: locality-biased attention, which fragments spatial structures, and residual-only supervision, which accumulates errors across scales, severely compromises global consistency of reconstructed images. To address these issues, we propose AlignVAR, a globally consistent visual autoregressive framework tailored for ISR, featuring two key components: (1) Spatial Consistency Autoregression (SCA), which applies an adaptive mask to reweight attention toward structurally correlated regions, thereby mitigating excessive locality and enhancing long-range dependencies; and (2) Hierarchical Consistency Constraint (HCC), which augments residual learning with full reconstruction supervision at each scale, exposing accumulated deviations early and stabilizing the coarse-to-fine refinement process. Extensive experiments demonstrate that AlignVAR consistently enhances structural coherence and perceptual fidelity over existing generative methods, while delivering over 10x faster inference with nearly 50% fewer parameters than leading diffusion-based approaches, establishing a new paradigm for efficient ISR.
[118] UNICBench: UNIfied Counting Benchmark for MLLM cs.CVPDF
Chenggang Rong, Tao Han, Zhiyuan Zhao, Yaowu Fan, Jia Wan
TL;DR: 该论文提出了UNICBench,一个统一的多模态、多层级计数基准测试和评估工具包,用于评估多模态大语言模型在图像、文本和音频上的计数能力。基准包含超过1.4万个标注问答对,并采用标准化协议评估了45个先进模型,揭示了其在基础任务上表现良好但在复杂推理和困难任务上存在显著差距。
Details
Motivation: 目前缺乏一个统一的多模态计数基准来严格评估MLLM在图像、文本和音频上的核心计数能力,因此需要构建一个标准化的评估框架。
Result: 在UNICBench上评估了45个SOTA MLLM,结果显示模型在一些基础计数任务上表现强劲,但在需要推理和最困难的任务分区上存在显著差距,表明有巨大的改进空间。
Insight: 创新点在于构建了首个统一的多模态、多层级计数基准,并提供了包含确定性数值解析和分层报告的标准化评估工具包,为衡量和加速MLLM计数能力的进展提供了严谨且可比较的基础。
Abstract: Counting is a core capability for multimodal large language models (MLLMs), yet there is no unified counting dataset to rigorously evaluate this ability across image, text, and audio. We present UNICBench, a unified multimodal, multi level counting benchmark and evaluation toolkit with accurate ground truth, deterministic numeric parsing, and stratified reporting. The corpus comprises 5,300 images (5,508 QA), 872 documents (5,888 QA), and 2,069 audio clips (2,905 QA), annotated with a three level capability taxonomy and difficulty tags. Under a standardized protocol with fixed splits/prompts/seeds and modality specific matching rules, we evaluate 45 state-of-the-art MLLMs across modalities. Results show strong performance on some basic counting tasks but significant gaps on reasoning and the hardest partitions, highlighting long-tail errors and substantial headroom for improving general counting. UNICBench offers a rigorous and comparable basis for measurement and a public toolkit to accelerate progress.
[119] IdGlow: Dynamic Identity Modulation for Multi-Subject Generation cs.CV | cs.AIPDF
Honghao Cai, Xiangyuan Wang, Yunhao Bai, Tianze Zhou, Sijie Xu
TL;DR: IdGlow是一个基于Flow Matching扩散模型的两阶段框架,用于解决多主体图像生成中的‘稳定性-可塑性困境’。它通过任务自适应的扩散时间步调度、时间门控机制、以及结合视觉语言模型进行提示词合成,并在第二阶段采用细粒度组级直接偏好优化,从而在无需空间掩码的情况下,实现了多主体身份的高保真融合与和谐的场景生成。
Details
Motivation: 现有基于刚性空间掩码或局部注意力的多主体图像生成方法在处理需要复杂结构变形(如身份保持的年龄变换)的任务时,难以平衡身份保持(稳定性)与场景适应(可塑性),导致生成质量下降。
Result: 在直接多人融合和年龄变换群体生成两个具有挑战性的基准测试上,IdGlow有效缓解了稳定性-可塑性冲突,在面部保真度和商业级美学质量之间达到了优越的帕累托平衡,取得了最先进(SOTA)的结果。
Insight: 创新点在于:1) 将扩散模型的生成动态与任务自适应的扩散时间步调度(线性衰减和时间门控)相结合,以控制身份注入的时机和强度;2) 利用视觉语言模型进行坏案例驱动的、上下文感知的提示词合成,以解决属性泄漏和语义模糊问题;3) 设计了带加权边际的细粒度组级直接偏好优化策略,以联合优化多主体伪影、纹理和谐度及身份保真度。这些方法为无掩码、高质量的多主体生成提供了新思路。
Abstract: Multi-subject image generation requires seamlessly harmonizing multiple reference identities within a coherent scene. However, existing methods relying on rigid spatial masks or localized attention often struggle with the “stability-plasticity dilemma,” particularly failing in tasks that require complex structural deformations, such as identity-preserving age transformation. To address this, we present IdGlow, a mask-free, progressive two-stage framework built upon Flow Matching diffusion models. In the supervised fine-tuning (SFT) stage, we introduce task-adaptive timestep scheduling aligned with diffusion generative dynamics: a linear decay schedule that progressively relaxes constraints for natural group composition, and a temporal gating mechanism that concentrates identity injection within a critical semantic window, successfully preserving adult facial semantics without overriding child-like anatomical structures. To resolve attribute leakage and semantic ambiguity without explicit layout inputs, we further integrate a badcase-driven Vision-Language Model (VLM) for precise, context-aware prompt synthesis. In the second stage, we design a Fine-Grained Group-Level Direct Preference Optimization (DPO) with a weighted margin formulation to simultaneously eliminate multi-subject artifacts, elevate texture harmony, and recalibrate identity fidelity towards real-world distributions. Extensive experiments on two challenging benchmarks – direct multi-person fusion and age-transformed group generation – demonstrate that IdGlow fundamentally mitigates the stability-plasticity conflict, achieving a superior Pareto balance between state-of-the-art facial fidelity and commercial-grade aesthetic quality.
[120] Linking Modality Isolation in Heterogeneous Collaborative Perception cs.CVPDF
Changxing Liu, Zichen Chao, Siheng Chen
TL;DR: 本文提出CodeAlign,一种无需共现训练数据的高效跨模态对齐框架,用于解决异构协同感知中的模态隔离问题。通过跨模态特征-代码-特征翻译,直接学习模态特定特征空间之间的映射,无需空间对应监督,显著降低了参数和通信开销,并在OPV2V和DAIR-V2X数据集上实现了最先进的感知性能。
Details
Motivation: 异构智能体间的模态差异导致域鸿沟,而模态隔离(即不同模态的智能体在训练数据中从未共现)进一步加剧了这一问题,现有基于空间重叠观测的对齐方法无法处理此挑战。
Result: 在OPV2V和DAIR-V2X数据集上达到SOTA感知性能;仅需先前对齐方法8%的训练参数,通信负载降低1024倍。
Insight: 创新点在于通过码本将特征空间正则化为紧凑的代码空间,并学习跨模态的特征-代码-特征翻译,从而在无需空间对应监督的情况下实现模态对齐,大幅提升了效率并解决了模态隔离问题。
Abstract: Collaborative perception leverages data exchange among multiple agents to enhance overall perception capabilities. However, heterogeneity across agents introduces domain gaps that hinder collaboration, and this is further exacerbated by an underexplored issue: modality isolation. It arises when multiple agents with different modalities never co-occur in any training data frame, enlarging cross-modal domain gaps. Existing alignment methods rely on supervision from spatially overlapping observations, thus fail to handle modality isolation. To address this challenge, we propose CodeAlign, the first efficient, co-occurrence-free alignment framework that smoothly aligns modalities via cross-modal feature-code-feature(FCF) translation. The key idea is to explicitly identify the representation consistency through codebook, and directly learn mappings between modality-specific feature spaces, thereby eliminating the need for spatial correspondence. Codebooks regularize feature spaces into code spaces, providing compact yet expressive representations. With a prepared code space for each modality, CodeAlign learns FCF translations that map features to the corresponding codes of other modalities, which are then decoded back into features in the target code space, enabling effective alignment. Experiments show that, when integrating three modalities, CodeAlign requires only 8% of the training parameters of prior alignment methods, reduces communication load by 1024x, and achieves state-of-the-art perception performance on both OPV2V and DAIR-V2X dataset. Code will be released on https://github.com/cxliu0314/CodeAlign.
[121] Exploring Spatiotemporal Feature Propagation for Video-Level Compressive Spectral Reconstruction: Dataset, Model and Benchmark cs.CVPDF
Lijing Cai, Zhan Shi, Chenglong Huang, Jinyao Wu, Qiping Li
TL;DR: 本文旨在将光谱压缩成像(SCI)从图像级重建提升至视频级重建,以解决现有图像级方法在空间-光谱特征不确定性和时间一致性方面的不足。为此,研究构建了首个高质量动态高光谱图像数据集DynaSpec,并提出了传播引导的光谱视频重建Transformer(PG-SVRT),该模型采用空间-时序注意力机制,利用视频中的互补信息和时序连续性进行有效重建。研究还通过仿真和构建DD-CASSI原型系统进行了性能评估与基准测试。
Details
Motivation: 现有光谱压缩成像(SCI)重建方法主要是图像级的,存在两个主要问题:一是编码过程掩盖了空间-光谱特征,导致从单次压缩测量中重建缺失信息存在不确定性;二是逐帧重建范式无法确保对视频感知至关重要的时间一致性。本文旨在利用动态场景中相邻帧的互补特征和时序连续性,将光谱重建从图像级推进到视频级。
Result: 大量实验表明,所提出的PG-SVRT模型在重建质量、光谱保真度和时间一致性方面均取得了优越性能,同时保持了极低的浮点运算量(FLOPs)。研究通过仿真实验评估了四种SCI系统的性能,并构建了DD-CASSI原型用于真实数据收集和基准测试。
Insight: 主要创新点包括:1)构建了首个高质量动态高光谱视频数据集DynaSpec,填补了该领域基准数据的空白;2)提出了PG-SVRT模型,其采用“先空间后时序”的注意力机制,并引入桥接令牌以降低计算复杂度,有效利用了视频的时序信息进行光谱重建;3)将光谱重建任务从图像级范式扩展到视频级,强调了时间一致性的重要性,并通过构建原型系统进行了从仿真到真实世界的全面验证。
Abstract: Recently, Spectral Compressive Imaging (SCI) has achieved remarkable success, unlocking significant potential for dynamic spectral vision. However, existing reconstruction methods, primarily image-based, suffer from two limitations: (i) Encoding process masks spatial-spectral features, leading to uncertainty in reconstructing missing information from single compressed measurements, and (ii) The frame-by-frame reconstruction paradigm fails to ensure temporal consistency, which is crucial in the video perception. To address these challenges, this paper seeks to advance spectral reconstruction from the image level to the video level, leveraging the complementary features and temporal continuity across adjacent frames in dynamic scenes. Initially, we construct the first high-quality dynamic hyperspectral image dataset (DynaSpec), comprising 30 sequences obtained through frame-scanning acquisition. Subsequently, we propose the Propagation-Guided Spectral Video Reconstruction Transformer (PG-SVRT), which employs a spatial-then-temporal attention to effectively reconstruct spectral features from abundant video information, while using a bridged token to reduce computational complexity. Finally, we conduct simulation experiments to assess the performance of four SCI systems, and construct a DD-CASSI prototype for real-world data collection and benchmarking. Extensive experiments demonstrate that PG-SVRT achieves superior performance in reconstruction quality, spectral fidelity, and temporal consistency, while maintaining minimal FLOPs. Project page: https://github.com/nju-cite/DynaSpec
[122] Exploring 3D Dataset Pruning cs.CV | cs.LGPDF
Xiaohan Zhao, Xinyi Shang, Jiacheng Liu, Zhiqiang Shen
TL;DR: 本文研究了3D数据集剪枝问题,针对3D数据常见的类别长尾分布特性,提出了一个统一的框架来解决剪枝过程中整体准确率(OA)与平均准确率(mAcc)之间的权衡挑战。该方法通过表征感知的子集选择与先验不变的教师监督,有效减少了覆盖误差和先验失配偏差。
Details
Motivation: 现有数据集剪枝方法主要针对2D图像,而3D数据的剪枝方法尚未充分探索。3D数据普遍存在的长尾类别分布特性,使得在传统评估指标OA和mAcc下进行优化存在内在冲突,这给剪枝带来了特别的挑战。
Result: 在多个3D数据集上的广泛实验表明,该方法能在多种设置下同时提升OA和mAcc指标,并能适应不同的下游任务偏好。
Insight: 创新点在于将剪枝问题形式化为用加权子集近似全数据期望风险,并识别出覆盖误差和先验失配偏差两个关键误差源。提出的方法结合了基于每类保留配额的表征感知子集选择,以及使用校准软标签和嵌入几何蒸馏的先验不变教师监督,其中保留配额还可作为控制OA-mAcc权衡的开关。
Abstract: Dataset pruning has been widely studied for 2D images to remove redundancy and accelerate training, while particular pruning methods for 3D data remain largely unexplored. In this work, we study dataset pruning for 3D data, where its observed common long-tail class distribution nature make optimization under conventional evaluation metrics Overall Accuracy (OA) and Mean Accuracy (mAcc) inherently conflicting, and further make pruning particularly challenging. To address this, we formulate pruning as approximating the full-data expected risk with a weighted subset, which reveals two key errors: coverage error from insufficient representativeness and prior-mismatch bias from inconsistency between subset-induced class weights and target metrics. We propose representation-aware subset selection with per-class retention quotas for long-tail coverage, and prior-invariant teacher supervision using calibrated soft labels and embedding-geometry distillation. The retention quota also serves as a switch to control the OA-mAcc trade-off. Extensive experiments on 3D datasets show that our method can improve both metrics across multiple settings while adapting to different downstream preferences. Our code is available at https://github.com/XiaohanZhao123/3D-Dataset-Pruning.
[123] RC-GeoCP: Geometric Consensus for Radar-Camera Collaborative Perception cs.CVPDF
Xiaokai Bai, Lianqing Zheng, Runwei Guan, Siyuan Cao, Huiliang Shen
TL;DR: 本文提出了RC-GeoCP,这是首个探索在协同感知中融合4D雷达与图像信息的框架。该框架通过建立以雷达为锚点的几何共识,解决了多智能体间因深度模糊和空间分散导致的错位问题,具体包括几何结构校正、不确定性感知通信和共识驱动组装三个模块。
Details
Motivation: 动机在于,以激光雷达为中心的协同感知系统成本高且在恶劣天气下性能下降,而相机与4D雷达的结合(兼具密集视觉语义和鲁棒空间测量)在协同场景中尚未得到充分探索。
Result: 在V2X-Radar和V2X-R数据集上建立了首个统一的雷达-相机协同感知基准,并实现了最先进的性能,同时显著降低了通信开销。
Insight: 创新点在于首次系统性地探索了4D雷达与相机在协同感知中的融合,并提出了以雷达几何为锚点的共识构建方法,通过几何结构校正、基于不确定性的特征选择以及基于共识的聚合,有效解决了多模态、多智能体间的空间对齐和信息传输效率问题。
Abstract: Collaborative perception (CP) enhances scene understanding through multi-agent information sharing. While LiDAR-centric systems offer precise geometry, high costs and performance degradation in adverse weather necessitate multi-modal alternatives. Despite dense visual semantics and robust spatial measurements, the synergy between cameras and 4D radar remains underexplored in collaborative settings. This work introduces RC-GeoCP, the first framework to explore the fusion of 4D radar and images in CP. To resolve misalignment caused by depth ambiguity and spatial dispersion across agents, RC-GeoCP establishes a radar-anchored geometric consensus. Specifically, Geometric Structure Rectification (GSR) aligns visual semantics with geometry derived from radar to generate spatially grounded, geometry-consistent representations. Uncertainty-Aware Communication (UAC) formulates selective transmission as a conditional entropy reduction process to prioritize informative features based on inter-agent disagreement. Finally, the Consensus-Driven Assembler (CDA) aggregates multi-agent information via shared geometric anchors to form a globally coherent representation. We establish the first unified radar-camera CP benchmark on V2X-Radar and V2X-R, demonstrating state-of-the-art performance with significantly reduced communication overhead. Code will be released soon.
[124] Stateful Cross-layer Vision Modulation cs.CVPDF
Ying Liu, Yudong Han, Kean Shi, Liyuan Pan
TL;DR: 本文提出了一种名为SCVM的跨层记忆调制视觉框架,旨在解决多模态大语言模型中视觉特征融合的局限性。现有方法通常在视觉编码后进行静态拼接或加权聚合,而SCVM通过在视觉编码器内部引入递归更新的跨层记忆状态来建模层间长程依赖,并设计逐层反馈调制机制来结构化调控表征演化轨迹。该方法在多个视觉问答和幻觉评估基准上实现了性能提升,且无需扩展视觉标记、引入额外视觉编码器或修改/微调语言模型。
Details
Motivation: 现有MLLMs多采用多层视觉特征融合来增强视觉表征,但通常仅在编码后进行静态融合,未干预表征形成过程本身,导致早期层的细粒度细节在层次抽象中被抑制,且直接引入浅层特征可能导致与LLM预训练视觉特征空间的语义分布不匹配,需要额外适应或微调。
Result: 在多个视觉问答和幻觉评估基准上的实验结果表明,SCVM实现了持续的性能提升,且无需扩展视觉标记、引入额外视觉编码器或修改/微调语言模型。
Insight: 创新点在于从表征演化控制的角度重新审视视觉表征学习,提出在视觉编码器内部引入递归更新的跨层记忆状态来建模层间依赖,并通过逐层反馈调制机制结构化调控表征演化轨迹,同时引入辅助语义对齐目标来监督最终记忆状态,促进任务相关信息的渐进压缩和强化。这种方法在保持模型架构轻量的同时有效提升了表征质量。
Abstract: Recent multimodal large language models (MLLMs) widely adopt multi-layer visual feature fusion to enhance visual representation. However, existing approaches typically perform static concatenation or weighted aggregation after visual encoding, without intervening in the representation formation process itself. As a result, fine-grained details from early layers may be progressively suppressed during hierarchical abstraction. Moreover, directly introducing shallow-layer features into the language model often leads to semantic distribution mismatch with the visual feature space that the LLM’s cross-attention layers were pretrained on, which typically requires additional adaptation or fine-tuning of the LLM. To address these limitations, we revisit visual representation learning from the perspective of representation evolution control and propose a cross-layer memory-modulated vision framework(SCVM). Specifically, we introduce a recursively updated cross-layer memory state inside the vision encoder to model long-range inter-layer dependencies. We further design a layer-wise feedback modulation mechanism that refreshes token representations at each layer based on the accumulated memory, thereby structurally regulating the representation evolution trajectory. In addition, we incorporate an auxiliary semantic alignment objective that explicitly supervises the final memory state, encouraging progressive compression and reinforcement of task-relevant information. Experimental results on multiple visual question answering and hallucination evaluation benchmarks demonstrate that SCVM achieves consistent performance improvements without expanding visual tokens, introducing additional vision encoders, or modifying or fine-tuning the language model.
[125] Act Like a Pathologist: Tissue-Aware Whole Slide Image Reasoning cs.CVPDF
Wentao Huang, Weimin Lyu, Peiliang Lou, Qingqiao Hu, Xiaoling Hu
TL;DR: 本文提出了一种名为HistoSelect的病理学全玻片图像(WSI)推理框架,旨在模仿病理学家观察玻片的模式,通过问题引导、组织感知和由粗到细的检索策略,选择最相关的组织区域和信息丰富的图像块进行问答,从而显著提高效率和准确性。
Details
Motivation: 当前病理学问答模型在处理包含海量信息的千兆像素玻片时,通常采用均匀采样或宽泛的注意力机制,导致模型可能同等关注无关区域而忽略关键视觉证据,无法像人类病理学家那样根据临床问题进行有选择性的扫描和放大观察。
Result: 在包含356,000个问答对的数据集上,该方法在三个病理学QA任务上均超越了现有方法,同时将视觉令牌使用量平均减少了70%,并产生了基于可解释、与病理学家判断一致区域的答案。
Insight: 核心创新在于将类似人类的搜索和注意力模式引入WSI推理,通过一个两阶段(组织区域采样器+图像块选择器)的检索框架,实现了问题引导下的高效、精准信息提取,为构建实用可靠的病理学视觉语言模型(VLM)指明了方向。
Abstract: Computational pathology has advanced rapidly in recent years, driven by domain-specific image encoders and growing interest in using vision-language models to answer natural-language questions about diseases. Yet, the core problem behind pathology question-answering remains unsolved, considering that a gigapixel slide contains far more information than necessary for a given question. Pathologists naturally navigate tissue and morphology complexity by scanning broadly, and zooming in selectively according to the clinical questions. Current models, in contrast, rely on uniform patch sampling or broad attention maps, often attending equally to irrelevant regions while overlooking key visual evidence. In this work, we try to bring models closer to how humans actually examine slides. We propose a question-guided, tissue-aware, and coarse-to-fine retrieval framework, HistoSelect, that consists of two key components: a group sampler that identifies question-relevant tissue regions, followed by a patch selector that retrieves the most informative patches within those regions. By selecting only the most informative patches, our method becomes significantly more efficient: reducing visual token usage by 70% on average, while improving accuracy across three pathology QA tasks. Evaluated on 356,000 question-answer pairs, our approach outperforms existing methods and produces answers grounded in interpretable, pathologist-consistent regions. Our results suggest that bringing human-like search and attention patterns into WSI reasoning is a promising direction for building practical and reliable pathology VLMs.
[126] Specializing Foundation Models via Mixture of Low-Rank Experts for Comprehensive Head CT Analysis cs.CVPDF
Youngjin Yoo, Han Liu, Bogdan Georgescu, Yanbo Zhang, Sasa Grbic
TL;DR: 本文提出了一种名为MoLRE(低秩专家混合)的框架,用于专门化基础模型以适应复杂的多标签诊断任务,如全面的头部CT发现检测。该方法扩展了LoRA,引入了多个专门化的低秩适配器和无监督软路由,以实现条件特征适应,仅增加少于0.5%的参数且无需显式病理监督。通过在六个最先进的医学成像基础模型上进行广泛基准测试,使用超过70,000个非对比头部CT扫描和75个标注发现,实验表明MoLRE在所有模型上均带来一致的性能提升,其中与MedGemma结合实现了最高的平均检测AUC为0.917。
Details
Motivation: 基础模型在大规模数据集上预训练后展现出强大的迁移学习能力,但其在复杂多标签诊断任务(如全面的头部CT发现检测)上的适应仍研究不足。标准的参数高效微调方法(如LoRA)在病理类型上应用统一的适应,可能限制对不同医学发现的性能。
Result: MoLRE框架在所有测试的六个最先进的医学成像基础模型上均带来一致的性能提升,提升幅度因模型而异:通用和医学领域模型提升最大(DINOv3-Base: +4.6%;MedGemma: +4.3%),而3D CT专门化或非常大的模型提升较小(+0.2-1.3%)。MoLRE与MedGemma结合实现了最高的平均检测AUC为0.917。
Insight: 创新点在于提出MoLRE框架,通过混合多个专门化的低秩适配器和无监督软路由,实现条件特征适应,以更少的参数(少于0.5%)和无需显式监督的方式提升多标签诊断任务的性能。客观分析认为,该方法强调了在目标临床任务上进行系统基准测试的重要性,因为预训练领域、架构和模型规模以非显而易见的方式相互作用,为专门化基础模型提供了新思路。
Abstract: Foundation models pre-trained on large-scale datasets demonstrate strong transfer learning capabilities; however, their adaptation to complex multi-label diagnostic tasks-such as comprehensive head CT finding detection-remains understudied. Standard parameter-efficient fine-tuning methods such as LoRA apply uniform adaptations across pathology types, which may limit performance for diverse medical findings. We propose a Mixture of Low-Rank Experts (MoLRE) framework that extends LoRA with multiple specialized low-rank adapters and unsupervised soft routing. This approach enables conditional feature adaptation with less than 0.5% additional parameters and without explicit pathology supervision. We present a comprehensive benchmark of MoLRE across six state-of-the-art medical imaging foundation models spanning 2D and 3D architectures, general-domain, medical-domain, and head CT-specific pretraining, and model sizes ranging from 7M to 431M parameters. Using over 70,000 non-contrast head CT scans with 75 annotated findings-including hemorrhage, infarction, trauma, mass lesions, structural abnormalities, and chronic changes-our experiments demonstrate consistent performance improvements across all models. Gains vary substantially: general-purpose and medical-domain models show the largest improvements (DINOv3-Base: +4.6%; MedGemma: +4.3%), whereas 3D CT-specialized or very large models show more modest gains (+0.2-1.3%). The combination of MoLRE and MedGemma achieves the highest average detection AUC of 0.917. These findings highlight the importance of systematic benchmarking on target clinical tasks, as pretraining domain, architecture, and model scale interact in non-obvious ways.
[127] STMI: Segmentation-Guided Token Modulation with Cross-Modal Hypergraph Interaction for Multi-Modal Object Re-Identification cs.CVPDF
Xingguo Xu, Zhanyu Liu, Weixiang Zhou, Yuansheng Gao, Junjie Cao
TL;DR: 本文提出了一种名为STMI的新型多模态物体重识别框架,旨在通过分割引导的特征调制、语义令牌重分配和跨模态超图交互三个核心模块,有效利用不同模态的互补信息,减少背景干扰并提升特征判别力。
Details
Motivation: 现有方法多依赖硬令牌过滤或简单融合策略,易导致判别性线索丢失和背景干扰加剧,因此需要一种更精细的融合机制来提升多模态物体重识别性能。
Result: 在RGBNT201、RGBNT100和MSVR310等公开基准测试上的大量实验表明,STMI框架在多模态重识别场景中具有有效性和鲁棒性,达到了先进水平。
Insight: 创新点包括利用SAM生成掩码进行分割引导的注意力调制、通过可学习查询令牌实现无丢弃的语义令牌重分配,以及构建跨模态超图以捕获高阶语义关系,为多模态融合提供了可借鉴的精细交互策略。
Abstract: Multi-modal object Re-Identification (ReID) aims to exploit complementary information from different modalities to retrieve specific objects. However, existing methods often rely on hard token filtering or simple fusion strategies, which can lead to the loss of discriminative cues and increased background interference. To address these challenges, we propose STMI, a novel multi-modal learning framework consisting of three key components: (1) Segmentation-Guided Feature Modulation (SFM) module leverages SAM-generated masks to enhance foreground representations and suppress background noise through learnable attention modulation; (2) Semantic Token Reallocation (STR) module employs learnable query tokens and an adaptive reallocation mechanism to extract compact and informative representations without discarding any tokens; (3) Cross-Modal Hypergraph Interaction (CHI) module constructs a unified hypergraph across modalities to capture high-order semantic relationships. Extensive experiments on public benchmarks (i.e., RGBNT201, RGBNT100, and MSVR310) demonstrate the effectiveness and robustness of our proposed STMI framework in multi-modal ReID scenarios.
[128] TokenSplat: Token-aligned 3D Gaussian Splatting for Feed-forward Pose-free Reconstruction cs.CVPDF
Yihui Li, Chengxin Lv, Zichen Tang, Hongyu Yang, Di Huang
TL;DR: TokenSplat是一个前馈框架,用于从无姿态的多视角图像中联合进行3D高斯重建和相机姿态估计。其核心是引入了Token对齐的高斯预测模块,在特征空间中对齐跨视图的语义对应信息,并通过可学习的相机token和非对称双流解码器增强姿态鲁棒性,实现了无需迭代优化的连贯重建与稳定姿态估计。
Details
Motivation: 解决从无姿态(未标定相机参数)的多视角图像中联合进行高质量3D重建和相机姿态估计的挑战,避免传统方法对迭代优化或已知姿态的依赖。
Result: 在无姿态设置下,TokenSplat实现了更高的重建保真度和新视角合成质量,并且相比先前的无姿态方法显著提高了姿态估计精度。
Insight: 创新点在于提出了Token对齐的高斯预测模块进行长程跨视图推理,以及使用可学习相机token和非对称双流解码器来解耦视角线索与场景语义,从而在前馈架构中实现干净的因素分解和稳定的联合估计。
Abstract: We present TokenSplat, a feed-forward framework for joint 3D Gaussian reconstruction and camera pose estimation from unposed multi-view images. At its core, TokenSplat introduces a Token-aligned Gaussian Prediction module that aligns semantically corresponding information across views directly in the feature space. Guided by coarse token positions and fusion confidence, it aggregates multi-scale contextual features to enable long-range cross-view reasoning and reduce redundancy from overlapping Gaussians. To further enhance pose robustness and disentangle viewpoint cues from scene semantics, TokenSplat employs learnable camera tokens and an Asymmetric Dual-Flow Decoder (ADF-Decoder) that enforces directionally constrained communication between camera and image tokens. This maintains clean factorization within a feed-forward architecture, enabling coherent reconstruction and stable pose estimation without iterative refinement. Extensive experiments demonstrate that TokenSplat achieves higher reconstruction fidelity and novel-view synthesis quality in pose-free settings, and significantly improves pose estimation accuracy compared to prior pose-free methods. Project page: https://kidleyh.github.io/tokensplat/.
[129] Towards Khmer Scene Document Layout Detection cs.CVPDF
Marry Kong, Rina Buoy, Sovisal Chenda, Nguonly Taing, Masakazu Iwamura
TL;DR: 本文针对高棉语场景文档布局检测问题,首次进行了全面研究,提出了一个包含专门数据集、开源文档增强工具和基于YOLO的布局检测基线模型的新框架,以解决因标注数据稀缺和脚本结构复杂导致的现有拉丁语系模型失效问题。
Details
Motivation: 高棉语场景文档布局分析因标注训练数据稀缺、脚本结构复杂(如变音符号和多层字符堆叠)以及场景文档中的透视畸变和复杂背景,导致现有基于拉丁语系的模型无法准确分割语义布局单元,特别是密集文本区域。
Result: 论文提出了首个专门用于高棉语场景文档布局检测的数据集和基线模型,利用带定向边界框(OBB)的YOLO架构处理几何畸变,并开发了开源文档增强工具以合成逼真场景文档来扩展训练数据。
Insight: 创新点在于首次系统性地构建了高棉语场景文档布局检测的基准,包括专用数据集、数据增强工具和基线模型,为资源稀缺语言文档分析提供了可扩展的数据合成方法和面向几何畸变的检测架构,具有社区推动价值。
Abstract: While document layout analysis for Latin scripts has advanced significantly, driven by the advent of large multimodal models (LMMs), progress for the Khmer language remains constrained because of the scarcity of annotated training data. This gap is particularly acute for scene documents, where perspective distortions and complex backgrounds challenge traditional methods. Given the structural complexities of Khmer script, such as diacritics and multi-layer character stacking, existing Latin-based layout analysis models fail to accurately delineate semantic layout units, particularly for dense text regions (e.g., list items). In this paper, we present the first comprehensive study on Khmer scene document layout detection. We contribute a novel framework comprising three key elements: (1) a robust training and benchmarking dataset specifically for Khmer scene layouts; (2) an open-source document augmentation tool capable of synthesizing realistic scene documents to scale training data; and (3) layout detection baselines utilizing YOLO-based architectures with oriented bounding boxes (OBB) to handle geometric distortions. To foster further research in the Khmer document analysis and recognition (DAR) community, we release our models, code, and datasets in this gated repository (in review).
[130] A Reconstruction System for Industrial Pipeline Inner Walls Using Panoramic Image Stitching with Endoscopic Imaging cs.CVPDF
Rui Ma, Yifeng Wang, Ziteng Yang, Xinghui Li
TL;DR: 本文提出了一种基于全景图像拼接技术的工业管道内壁重建系统,通过工业内窥镜视频提取关键帧,结合极坐标变换与图像拼接技术将环形视频帧展开为平面全景图像,为管道内壁缺陷检测和状态评估提供直观准确的视觉支持。
Details
Motivation: 解决工业检测场景中管道内壁视觉分析与重建的挑战,传统逐帧视频审查方法效率低下,需要一种高效、全面的内壁重建方案。
Result: 实验结果表明,该方法能高效处理工业内窥镜视频,生成的全景拼接图像完整保留了管道内壁的所有细节特征,相比传统逐帧审查方法显著提升了重建效率,具有显著的工程应用价值。
Insight: 创新点在于将极坐标变换与图像拼接技术结合,专门针对环形管道内壁视频帧进行展开和拼接,生成全景图像以支持缺陷检测;系统设计定制GUI,实现了从视频到全景图像的自动化处理流程,提升了工业检测的实用性和效率。
Abstract: Visual analysis and reconstruction of pipeline inner walls remain challenging in industrial inspection scenarios. This paper presents a dedicated reconstruction system for pipeline inner walls via industrial endoscopes, which is built on panoramic image stitching technology. Equipped with a custom graphical user interface (GUI), the system extracts key frames from endoscope video footage, and integrates polar coordinate transformation with image stitching techniques to unwrap annular video frames of pipeline inner walls into planar panoramic images. Experimental results demonstrate that the proposed method enables efficient processing of industrial endoscope videos, and the generated panoramic stitched images preserve all detailed features of pipeline inner walls in their entirety. This provides intuitive and accurate visual support for defect detection and condition assessment of pipeline inner walls. In comparison with the traditional frame-by-frame video review method, the proposed approach significantly elevates the efficiency of pipeline inner wall reconstruction and exhibits considerable engineering application value.
[131] BornoViT: A Novel Efficient Vision Transformer for Bengali Handwritten Basic Characters Classification cs.CV | cs.LGPDF
Rafi Hassan Chowdhury, Naimul Haque, Kaniz Fatiha
TL;DR: 本文提出了一种名为BornoViT的新型高效轻量级视觉Transformer模型,专门用于孟加拉语手写基本字符和数字的分类。该模型通过简化深度卷积神经网络架构,显著减少了参数量和计算开销,在资源受限环境下表现出色。
Details
Motivation: 孟加拉语手写字符分类面临字符复杂多变、现有模型计算成本高且数据需求大的挑战,尤其不适合资源有限的孟加拉语环境,因此需要设计高效轻量的分类模型。
Result: 在BanglaLekha Isolated数据集上达到95.77%的准确率,在自建数据集Bornomala上达到91.51%的准确率;模型仅含65万参数、0.62 MB大小和0.16 GFLOPs计算量,显著轻于当前SOTA模型。
Insight: 创新点在于将视觉Transformer与简化深度卷积网络结合,实现极低的参数量和计算开销,为资源受限语言的手写字符分类提供了高效解决方案;其轻量化设计思路可迁移至其他低资源场景。
Abstract: Handwritten character classification in the Bengali script is a significant challenge due to the complexity and variability of the characters. The models commonly used for classification are often computationally expensive and data-hungry, making them unsuitable for resource-limited languages such as Bengali. In this experiment, we propose a novel, efficient, and lightweight Vision Transformer model that effectively classifies Bengali handwritten basic characters and digits, addressing several shortcomings of traditional methods. The proposed solution utilizes a deep convolutional neural network (DCNN) in a more simplified manner compared to traditional DCNN architectures, with the aim of reducing computational burden. With only 0.65 million parameters, a model size of 0.62 MB, and 0.16 GFLOPs, our model, BornoViT, is significantly lighter than current state-of-the-art models, making it more suitable for resource-limited environments, which is essential for Bengali handwritten character classification. BornoViT was evaluated on the BanglaLekha Isolated dataset, achieving an accuracy of 95.77%, and demonstrating superior efficiency compared to existing state-of-the-art approaches. Furthermore, the model was evaluated on our self-collected dataset, Bornomala, consisting of approximately 222 samples from different age groups, where it achieved an accuracy of 91.51%.
[132] DUCX: Decomposing Unfairness in Tool-Using Chest X-ray Agents cs.CVPDF
Zikang Xu, Ruinan Jin, Xiaoxiao Li
TL;DR: 本文提出了DUCX框架,用于系统性地审计基于工具使用的胸部X光智能代理中的不公平性。研究发现,除了端到端性能存在人口统计学差异外,代理的中间行为(如工具使用、转换模式和推理轨迹)也表现出显著的子群差异,这些差异无法仅从端到端评估中预测。
Details
Motivation: 动机在于,虽然使用工具的医疗智能代理可以通过协调专门的视觉和语言模块来改善胸部X光问答,但这种增加的管道复杂性也带来了独立模型之外的新的人口统计偏见途径。
Result: 在基于五个驱动骨干的工具使用代理框架上的广泛实验表明:(i)端到端性能中持续存在人口统计学差距,均衡优势高达20.79%,最低的公平性-效用权衡低至28.65%;(ii)中间行为、工具使用、转换模式和推理轨迹表现出明显的子群差异,这些差异无法仅从端到端评估中预测(例如,在分割工具可用性的条件下,子群效用差距高达50%)。
Insight: 创新点在于提出了一个阶段性的公平性分解框架,将端到端偏见与三个代理特定来源分离开来:工具暴露偏见、工具转换偏见和模型推理偏见。这强调了在临床代理系统中进行过程级公平性审计和去偏的必要性,为理解和缓解复杂AI系统中的偏见提供了新的方法论视角。
Abstract: Tool-using medical agents can improve chest X-ray question answering by orchestrating specialized vision and language modules, but this added pipeline complexity also creates new pathways for demographic bias beyond standalone models. We present ours (Decomposing Unfairness in Chest X-ray agents), a systematic audit of chest X-ray agents instantiated with MedRAX. To localize where disparities arise, we introduce a stage-wise fairness decomposition that separates end-to-end bias from three agent-specific sources: tool exposure bias (utility gaps conditioned on tool presence), tool transition bias (subgroup differences in tool-routing patterns), and model reasoning bias (subgroup differences in synthesis behaviors). Extensive experiments on tool-used based agentic frameworks across five driver backbones reveal that (i) demographic gaps persist in end-to-end performance, with equalized odds up to 20.79%, and the lowest fairness-utility tradeoff down to 28.65%, and (ii) intermediate behaviors, tool usage, transition patterns, and reasoning traces exhibit distinct subgroup disparities that are not predictable from end-to-end evaluation alone (e.g., conditioned on segmentation-tool availability, the subgroup utility gap reaches as high as 50%). Our findings underscore the need for process-level fairness auditing and debiasing to ensure the equitable deployment of clinical agentic systems. Code is available here: https://anonymous.4open.science/r/DUCK-E5FE/README.md
[133] Neural Functional Alignment Space: Brain-Referenced Representation of Artificial Neural Networks cs.CVPDF
Ruiyu Yan, Hanqi Jiang, Yi Pan, Xiaobo Li, Tianming Liu
TL;DR: 该论文提出了神经功能对齐空间(NFAS),这是一个以大脑为参照的表征框架,用于在统一的功能基础上刻画人工神经网络。NFAS通过建模刺激表征在网络深度上的内在动态演化,而非依赖逐层特征或任务特定激活,来表征模型。具体方法是将逐层嵌入建模为深度方向的动态轨迹,并应用动态模式分解提取稳定模式,然后将该表示投影到由分布式神经响应定义的生物学锚定坐标系中。论文还引入了信噪一致性指数来量化模态层面的跨模型一致性。在涵盖视觉、音频和语言的45个预训练模型上,NFAS揭示了该大脑参照空间中的结构化组织,包括模态特异性聚类和整合皮层系统中的跨模态汇聚。
Details
Motivation: 动机在于超越传统的、依赖逐层特征或任务特定激活的对齐方法,提出一个在统一功能基础上、以大脑为参照的框架来表征和比较不同的人工神经网络,旨在揭示其内在表征动态与生物神经系统的关系。
Result: 在涵盖视觉、音频和语言模态的45个预训练模型上,NFAS揭示了大脑参照空间中的结构化组织,例如模态特异性聚类和跨模态汇聚。
Insight: 创新点在于将网络深度维度的表征变化建模为动态轨迹并用DMD提取稳定模式,从而构建一个与生物神经系统功能对齐的统一表征空间;同时引入了信噪一致性指数用于量化跨模型一致性。这为从表征动态而非静态特征的角度理解人工神经网络与大脑的功能对应关系提供了新思路。
Abstract: We propose the Neural Functional Alignment Space (NFAS), a brain-referenced representational framework for characterizing artificial neural networks on equal functional grounds. NFAS departs from conventional alignment approaches that rely on layer-wise features or task-specific activations by modeling the intrinsic dynamical evolution of stimulus representations across network depth. Specifically, we model layer-wise embeddings as a depth-wise dynamical trajectory and apply Dynamic Mode Decomposition (DMD) to extract the stable mode. This representation is then projected into a biologically anchored coordinate system defined by distributed neural responses. We also introduce the Signal-to-Noise Consistency Index (SNCI) to quantify cross-model consistency at the modality level. Across 45 pretrained models spanning vision, audio, and language, NFAS reveals structured organization within this brain-referenced space, including modality-specific clustering and cross-modal convergence in integrative cortical systems. Our findings suggest that representation dynamics provide a principled basis for
[134] NERFIFY: A Multi-Agent Framework for Turning NeRF Papers into Code cs.CV | cs.MAPDF
Seemandhar Jain, Keshav Gupta, Kunal Gupta, Manmohan Chandraker
TL;DR: NERFIFY是一个多智能体框架,专门用于将神经辐射场(NeRF)研究论文自动转换为可训练的Nerfstudio插件代码。它通过上下文无关文法约束、图思维代码合成、组合式引用恢复、视觉反馈、知识增强和基准测试等创新,解决了通用论文转代码方法在NeRF领域通常无法生成可运行代码的问题,显著减少了实现时间。
Details
Motivation: NeRF研究的激增使得研究人员在复现论文时需投入大量精力重新实现代码,而通用论文转代码方法或前沿模型(如GPT-5)通常无法生成可执行的代码,因此需要一种领域特定的解决方案来加速可重复研究。
Result: 在30篇多样化NeRF论文的基准测试中,对于没有公开实现的论文,NERFIFY生成的代码在视觉质量上可与专家人工代码媲美(PSNR误差±0.5 dB,SSIM误差±0.2),同时将实现时间从数周缩短到几分钟。
Insight: 论文的创新点包括:使用上下文无关文法约束LLM合成以确保架构不变性;采用图思维进行拓扑依赖的多文件代码合成;通过组合式引用恢复自动集成引用图中的组件;利用视觉反馈(如PSNR分析、几何验证和VLM引导修补)迭代改进;超越复现,可引入新颖优化进行方法增强;设计了针对NeRF论文转代码的评估框架。这些领域感知设计使得复杂视觉论文的代码翻译成为可能,促进了研究加速和民主化。
Abstract: The proliferation of neural radiance field (NeRF) research requires significant efforts to reimplement papers before building upon them. We introduce NERFIFY, a multi-agent framework that reliably converts NeRF research papers into trainable Nerfstudio plugins, in contrast to generic paper-to-code methods and frontier models like GPT-5 that usually fail to produce runnable code. NERFIFY achieves domain-specific executability through six key innovations: (1) Context-free grammar (CFG): LLM synthesis is constrained by Nerfstudio formalized as a CFG, ensuring generated code satisfies architectural invariants. (2) Graph-of-Thought code synthesis: Specialized multi-file-agents generate repositories in topological dependency order, validating contracts and errors at each node. (3) Compositional citation recovery: Agents automatically retrieve and integrate components (samplers, encoders, proposal networks) from citation graphs of references. (4) Visual feedback: Artifacts are diagnosed through PSNR-minima ROI analysis, cross-view geometric validation, and VLM-guided patching to iteratively improve quality. (5) Knowledge enhancement: Beyond reproduction, methods can be improved with novel optimizations. (6) Benchmarking: An evaluation framework is designed for NeRF paper-to-code synthesis across 30 diverse papers. On papers without public implementations, NERFIFY achieves visual quality matching expert human code (+/-0.5 dB PSNR, +/-0.2 SSIM) while reducing implementation time from weeks to minutes. NERFIFY demonstrates that a domain-aware design enables code translation for complex vision papers, potentiating accelerated and democratized reproducible research. Code, data and implementations will be publicly released.
[135] COMBAT: Conditional World Models for Behavioral Agent Training cs.CVPDF
Anmol Agarwal, Pranay Meshram, Sumer Singh, Saurav Suman, Andrew Lapp
TL;DR: 本文提出了COMBAT,一个用于行为智能体训练的条件世界模型,它基于扩散模型和Transformer架构,能够在复杂的1v1格斗游戏《铁拳3》中实时模拟动态、可反应的对手。该模型仅使用单人游戏输入进行训练,无需对手策略的显式监督,即可学习并生成智能的交互行为。
Details
Motivation: 现有视频生成和世界模型在模拟动态、可反应的智能体方面存在局限,无法有效建模能智能影响和与世界交互的代理。本文旨在填补这一空白,开发能模拟对玩家动作做出反应的动态对手的世界模型。
Result: 在《铁拳3》游戏环境中,COMBAT模型成功模拟了动态对手,并展示了复杂智能体行为的涌现。研究引入了新的评估方法来基准测试这种涌现行为,为在基于扩散的世界模型中训练交互式智能体奠定了坚实基础。
Insight: 创新点在于利用扩散模型从部分观测数据(仅单人输入)中隐式学习动态对手的行为,无需完整的动作标签或显式监督,这区别于传统的模仿学习方法。技术实现上结合了10亿参数的扩散Transformer、深度压缩自编码器的潜在表示、因果蒸馏和扩散强制等先进技术,实现了实时推理。
Abstract: Recent advances in video generation have spurred the development of world models capable of simulating 3D-consistent environments and interactions with static objects. However, a significant limitation remains in their ability to model dynamic, reactive agents that can intelligently influence and interact with the world. To address this gap, we introduce COMBAT, a real-time, action-controlled world model trained on the complex 1v1 fighting game Tekken 3. Our work demonstrates that diffusion models can successfully simulate a dynamic opponent that reacts to player actions, learning its behavior implicitly. Our approach utilizes a 1.2 billion parameter Diffusion Transformer, conditioned on latent representations from a deep compression autoencoder. We employ state-of-the-art techniques, including causal distillation and diffusion forcing, to achieve real-time inference. Crucially, we observe the emergence of sophisticated agent behavior by training the model solely on single-player inputs, without any explicit supervision for the opponent’s policy. Unlike traditional imitation learning methods, which require complete action labels, COMBAT learns effectively from partially observed data to generate responsive behaviors for a controllable Player 1. We present an extensive study and introduce novel evaluation methods to benchmark this emergent agent behavior, establishing a strong foundation for training interactive agents within diffusion-based world models.
[136] MMTA: Multi Membership Temporal Attention for Fine-Grained Stroke Rehabilitation Assessment cs.CVPDF
Halil Ismail Helvaci, Justin Huber, Jihye Bae, Sen-ching Samson Cheung
TL;DR: 本文提出了一种名为多成员时序注意力(MMTA)的高分辨率时序Transformer模型,用于细粒度康复评估。该模型通过让每一帧在同一层内关注多个局部归一化的时序注意力窗口,融合并保留过渡区域附近的竞争性局部上下文,从而在无需增加深度或多阶段精炼的情况下提升边界敏感度。MMTA支持视频和可穿戴IMU输入的统一单阶段架构,适用于临床和家庭环境。
Details
Motivation: 现有时序动作分割模型难以在保留运动上下文的同时捕捉亚秒级的微动作,导致快速阶段过渡模糊,限制了运动恢复下游评估的可靠性。因此,需要一种能够实现细粒度动作精确分割的方法,以支持康复过程中的迭代评估。
Result: 在StrokeRehab数据集上,MMTA相比全局注意力Transformer将Edit Score分别提升了+1.3(视频)和+1.6(IMU);在50Salads数据集上进一步提升了+3.3。消融实验证实性能提升源于多成员时序视图而非架构复杂性。
Insight: 创新点在于引入了多成员时序注意力机制,允许每帧在同一层内同时关注多个局部时序窗口,通过特征空间重叠解析融合这些并发视图,从而在保持长距离推理能力的同时增强边界敏感性。这为资源受限的康复评估提供了一个实用的单阶段解决方案。
Abstract: To empower the iterative assessments involved during a person’s rehabilitation, automated assessment of a person’s abilities during daily activities requires temporally precise segmentation of fine-grained actions in therapy videos. Existing temporal action segmentation (TAS) models struggle to capture sub-second micro-movements while retaining exercise context, blurring rapid phase transitions and limiting reliable downstream assessment of motor recovery. We introduce Multi-Membership Temporal Attention (MMTA), a high-resolution temporal transformer for fine-grained rehabilitation assessment. Unlike standard temporal attention, which assigns each frame a single attention context per layer, MMTA lets each frame attend to multiple locally normalized temporal attention windows within the same layer. We fuse these concurrent temporal views via feature-space overlap resolution, preserving competing local contexts near transitions while enabling longer-range reasoning through layer-wise propagation. This increases boundary sensitivity without additional depth or multi-stage refinement. MMTA supports both video and wearable IMU inputs within a unified single-stage architecture, making it applicable to both clinical and home settings. MMTA consistently improves over the Global Attention transformer, boosting Edit Score by +1.3 (Video) and +1.6 (IMU) on StrokeRehab while further improving 50Salads by +3.3. Ablations confirm that performance gains stem from multi-membership temporal views rather than architectural complexity, offering a practical solution for resource-constrained rehabilitation assessment.
[137] Uncertainty-Aware Concept and Motion Segmentation for Semi-Supervised Angiography Videos cs.CVPDF
Yu Luo, Guangyu Wei, Yangfan Li, Jieyu He, Yueming Lyu
TL;DR: 本文提出了一种名为SMART的半监督学习方法,用于X射线冠状动脉造影视频的血管分割。该方法基于SAM3构建师生框架,通过运动感知一致性和渐进式置信度正则化来解决血管边界模糊、对比度低和复杂运动模式等挑战,在减少标注需求的同时实现了最先进的性能。
Details
Motivation: 解决X射线冠状动脉造影视频中血管分割任务面临的标注数据稀缺、血管边界模糊、辐射对比度不一致以及复杂运动模式等挑战,并改进现有半监督学习方法在处理复杂时序动态和不可靠不确定性量化方面的不足。
Result: 在来自不同机构的三个XCA序列数据集上进行的大量实验表明,SMART方法在显著减少标注需求的同时,取得了最先进的性能。
Insight: 创新点包括:1)利用SAM3的可提示概念分割设计,创新性地构建了基于SAM3的师生框架以最大化师生模型的性能潜力;2)整合血管掩模变形技术和运动一致性损失来建模复杂的血管动态;3)提出渐进式置信感知一致性正则化来缓解由模糊边界和低对比度导致的教师预测不可靠问题。从客观角度看,将强大的基础模型(SAM3)与针对特定领域挑战(如运动建模、不确定性处理)设计的正则化技术相结合,是该方法的核心创新之处。
Abstract: Segmentation of the main coronary artery from X-ray coronary angiography (XCA) sequences is crucial for the diagnosis of coronary artery diseases. However, this task is challenging due to issues such as blurred boundaries, inconsistent radiation contrast, complex motion patterns, and a lack of annotated images for training. Although Semi-Supervised Learning (SSL) can alleviate the annotation burden, conventional methods struggle with complicated temporal dynamics and unreliable uncertainty quantification. To address these challenges, we propose SAM3-based Teacher-student framework with Motion-Aware consistency and Progressive Confidence Regularization (SMART), a semi-supervised vessel segmentation approach for X-ray angiography videos. First, our method utilizes SAM3’s unique promptable concept segmentation design and innovates a SAM3-based teacher-student framework to maximize the performance potential of both the teacher and the student. Second, we enhance segmentation by integrating the vessel mask warping technique and motion consistency loss to model complex vessel dynamics. To address the issue of unreliable teacher predictions caused by blurred boundaries and minimal contrast, we further propose a progressive confidence-aware consistency regularization to mitigate the risk of unreliable outputs. Extensive experiments on three datasets of XCA sequences from different institutions demonstrate that SMART achieves state-of-the-art performance while requiring significantly fewer annotations, making it particularly valuable for real-world clinical applications where labeled data is scarce. Our code is available at: https://github.com/qimingfan10/SMART.
[138] Unified Vision-Language Modeling via Concept Space Alignment cs.CV | cs.AI | cs.CL | cs.LGPDF
Yifu Qiu, Paul-Ambroise Duquenne, Holger Schwenk
TL;DR: 本文提出了V-SONAR,一种通过后处理对齐方法将现有视觉编码器映射到多语言文本嵌入空间SONAR中而构建的统一视觉-语言嵌入空间。基于V-SONAR,论文进一步开发了V-LCM模型,该模型通过视觉-语言指令微调扩展了原有的纯文本大型概念模型(LCM),实现了在多模态任务上的强大零样本和指令跟随能力。
Details
Motivation: 动机是构建一个统一的多模态嵌入空间,将强大的多语言文本表示能力(SONAR)扩展到视觉领域,以实现跨语言和跨模态(文本、图像、视频)的统一理解和生成。
Result: V-SONAR在文本到视频检索任务上取得了有竞争力的性能。配备OMNISONAR解码器后,在视频描述任务DREAM-1K和PE-VIDEO上超越了SOTA模型(例如BLEU分数分别从19.6提升到23.9,从30.0提升到39.0)。V-LCM在图像/视频描述和问答任务上匹配SOTA视觉-语言模型,并在测试的62种语言中的61种(从富资源到低资源语言)上显著超越它们。
Insight: 主要创新点在于提出了一种后处理对齐(post-hoc alignment)管道,将预训练视觉编码器高效地映射到成熟的多语言文本嵌入空间,从而快速构建强大的统一多模态表示。另一个关键洞察是证明了纯文本训练的大型概念模型(LCM)通过此统一空间能直接进行零样本视觉概念理解,并通过统一的潜在扩散目标进行多模态指令微调,实现了卓越的多语言泛化能力。
Abstract: We introduce V-SONAR, a vision-language embedding space extended from the text-only embedding space SONAR (Omnilingual Embeddings Team et al., 2026), which supports 1500 text languages and 177 speech languages. To construct V-SONAR, we propose a post-hoc alignment pipeline that maps the representations of an existing vision encoder into the SONAR space. We thoroughly evaluate V-SONAR and show that its embeddings achieve competitive performance on text-to-video retrieval. Equipped with the OMNISONAR text decoder, V-SONAR further surpasses state-of-the-art vision-language models on video captioning tasks, including DREAM-1K (BLEU 23.9 vs. 19.6) and PE-VIDEO (BLEU 39.0 vs. 30.0). Leveraging V-SONAR, we first demonstrate that the Large Concept Model (LCM; LCM team et al. 2024) operating in SONAR and trained with English text only, can perform both single- and multi-visual concept understanding in a zero-shot manner. Finally, we introduce V-LCM, which extends the LCM with vision-language instruction tuning. V-LCM encodes vision and language inputs into an unified sequence of latent embeddings via V-SONAR and SONAR, and it is trained with the same latent diffusion objective for next-embedding prediction as in LCM’s text-only pre-training. Experiments on a large-scale multilingual and -modal instruction-tuning data mixture highlight the potential of V-LCM: V-LCM matches state-of-the-art vision-language models on tasks covering image/video captioning and question answering, while significantly outperforming them across 61 rich- to low-resource languages out of all 62 tested languages.
[139] pySpatial: Generating 3D Visual Programs for Zero-Shot Spatial Reasoning cs.CVPDF
Zhanpeng Luo, Ce Zhang, Silong Yong, Cunxi Dai, Qianwei Wang
TL;DR: 本文提出了pySpatial,一个视觉编程框架,通过生成Python代码调用空间工具(如3D重建、相机姿态恢复、新视角渲染),将多模态大语言模型(MLLMs)的2D图像输入转换为可探索的3D场景,从而增强其在零样本设置下的3D空间推理能力。
Details
Motivation: 多模态大语言模型在通用感知和推理方面表现出色,但在需要理解3D空间的任务上仍有困难,本文旨在解决这一局限性。
Result: 在MindCube和Omni3D-Bench基准测试中,pySpatial持续超越强大的MLLM基线模型,例如在MindCube上比GPT-4.1-mini高出12.94%,并在真实世界室内导航实验中成功指导机器人穿越复杂环境。
Insight: 创新点在于通过代码生成接口将MLLMs与空间工具连接,实现无需梯度微调的零样本3D空间显式推理;从客观角度看,其将视觉编程与3D场景构建相结合,为增强模型的空间理解能力提供了一种可扩展且实用的方法。
Abstract: Multi-modal Large Language Models (MLLMs) have demonstrated strong capabilities in general-purpose perception and reasoning, but they still struggle with tasks that require spatial understanding of the 3D world. To address this, we introduce pySpatial, a visual programming framework that equips MLLMs with the ability to interface with spatial tools via Python code generation. Given an image sequence and a natural-language query, the model composes function calls to spatial tools including 3D reconstruction, camera-pose recovery, novel-view rendering, etc. These operations convert raw 2D inputs into an explorable 3D scene, enabling MLLMs to reason explicitly over structured spatial representations. Notably, pySpatial requires no gradient-based fine-tuning and operates in a fully zero-shot setting. Experimental evaluations on the challenging MindCube and Omni3D-Bench benchmarks demonstrate that our framework pySpatial consistently surpasses strong MLLM baselines; for instance, it outperforms GPT-4.1-mini by 12.94% on MindCube. Furthermore, we conduct real-world indoor navigation experiments where the robot can successfully traverse complex environments using route plans generated by pySpatial, highlighting the practical effectiveness of our approach.
[140] On the Exact Algorithmic Extraction of Finite Tesselations Through Prime Extraction of Minimal Representative Forms cs.CVPDF
Sushish Baral, Paulo Garcia, Warisa Sritriratanarak
TL;DR: 本文提出了一种分层算法,用于在有限平面网格中精确提取轴对齐矩形镶嵌(tessellation)的重复模式。该方法通过复合发现(双重检查和广度优先剪枝)识别具有内部重复的矩形区域,归一化为最小代表形式,并利用质数提取(选择性复制和分层记忆化)处理不规则维度以实现高效计算。算法在2x2至32x2网格上评估了可扩展性,展示了确定性行为,适用于符号网格分析、谜题求解和离散符号域中精确重复结构的识别。
Details
Motivation: 解决在符号推理、算法合成和结构优化中,确定性提取离散网格中周期性重复模式的不足,特别是针对分层结构中可能共存多个独立模式的问题。
Result: 在2x2到32x32网格规模上评估可扩展性,简单重复图块的重叠检测处理时间低于1毫秒,而需要穷举搜索的复杂模式则呈现指数级增长。
Insight: 创新点在于结合复合发现、归一化最小代表形式和质数提取的分层方法,实现了对轴对齐矩形镶嵌的精确、确定性提取,弥补了符号网格分析技术的空白。
Abstract: The identification of repeating patterns in discrete grids is rudimentary within symbolic reasoning, algorithm synthesis and structural optimization across diverse computational domains. Although statistical approaches targeting noisy data can approximately recognize patterns, symbolic analysis utilizing deterministic extraction of periodic structures is underdeveloped. This paper aims to fill this gap by employing a hierarchical algorithm that discovers exact tessellations in finite planar grids, addressing the problem where multiple independent patterns may coexist within a hierarchical structure. The proposed method utilizes composite discovery (dual inspection and breadth-first pruning) for identifying rectangular regions with internal repetition, normalization to a minimal representative form, and prime extraction (selective duplication and hierarchical memoization) to account for irregular dimensions and to achieve efficient computation time. We evaluate scalability on grid sizes from 2x2 to 32x32, showing overlap detection on simple repeating tiles exhibits processing time under 1ms, while complex patterns which require exhaustive search and systematic exploration shows exponential growth. This algorithm provides deterministic behavior for exact, axis-aligned, rectangular tessellations, addressing a critical gap in symbolic grid analysis techniques, applicable to puzzle solving reasoning tasks and identification of exact repeating structures in discrete symbolic domains.
[141] Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards cs.CV | cs.AIPDF
Seungwook Kim, Minsu Cho
TL;DR: 本文提出了ARC(自适应自置信度奖励)框架,用于改进文本到图像生成模型的后训练过程。该框架利用模型内部的自置信度信号(通过自去噪探针评估噪声恢复的准确性获得)生成标量奖励,从而无需外部奖励监督即可进行完全无监督的优化。实验表明,ARC在组合生成、文本渲染和文本图像对齐方面均优于基线,且与外部奖励结合可产生互补性改进并缓解奖励黑客问题。
Details
Motivation: 解决文本到图像生成模型后训练中依赖外部奖励监督(如人类偏好、事实性或美学评估)所带来的数据、标注和奖励模型依赖问题,旨在通过内部自置信度信号实现无监督优化。
Result: 在组合生成、文本渲染和文本图像对齐等任务上,ARC相比基线模型取得了一致性提升;与外部奖励结合使用时,进一步实现了互补性改进,并减轻了奖励黑客现象。
Insight: 创新点在于利用模型内部的自去噪能力作为自置信度信号,将其转化为奖励以驱动无监督后训练,这避免了对外部资源的依赖,同时与外部奖励机制具有兼容性和协同效应。
Abstract: Text-to-image generation powers content creation across design, media, and data augmentation. Post-training of text-to-image generative models is a promising path to better match human preferences, factuality, and improved aesthetics. We introduce ARC (Adaptive Rewarding by self-Confidence), a post-training framework that replaces external reward supervision with an internal self-confidence signal, obtained by evaluating how accurately the model recovers injected noise under self-denoising probes. ARC converts this intrinsic signal into scalar rewards, enabling fully unsupervised optimization without additional datasets, annotators, or reward models. Empirically, by reinforcing high-confidence generations, ARC delivers consistent gains in compositional generation, text rendering and text-image alignment over the baseline. We also find that integrating ARC with external rewards results in a complementary improvement, with alleviated reward hacking.
[142] DriveCode: Domain Specific Numerical Encoding for LLM-Based Autonomous Driving cs.CV | cs.ROPDF
Zhiye Wang, Yanbo Jiang, Rui Zhou, Bo Zhang, Fang Zhang
TL;DR: 本文提出了DriveCode,一种用于基于LLM的自动驾驶的领域特定数值编码方法。该方法将数字表示为专用的嵌入向量,而非离散的文本标记,以解决现有方法在数值推理精度和解码效率上的不足。
Details
Motivation: 动机在于解决LLM在自动驾驶应用中,将数字离散化为标记所导致的数值推理不精确、无法反映数字位置重要性,以及难以平衡解码效率与数值精度的问题,这些问题阻碍了基于LLM的自动驾驶系统的部署。
Result: 在OmniDrive、DriveGPT4和DriveGPT4-V2数据集上的评估表明,DriveCode在轨迹预测和控制信号生成任务上表现出优越的性能。
Insight: 创新点在于提出了一种新颖的数值编码方法,通过一个数字投影器将数字映射到语言模型的隐藏空间,从而能够与视觉和文本特征在统一的多模态序列中无缝集成,这为提升LLM在自动驾驶等需要精确数值处理领域的性能提供了新思路。
Abstract: Large language models (LLMs) have shown great promise for autonomous driving. However, discretizing numbers into tokens limits precise numerical reasoning, fails to reflect the positional significance of digits in the training objective, and makes it difficult to achieve both decoding efficiency and numerical precision. These limitations affect both the processing of sensor measurements and the generation of precise control commands, creating a fundamental barrier for deploying LLM-based autonomous driving systems. In this paper, we introduce DriveCode, a novel numerical encoding method that represents numbers as dedicated embeddings rather than discrete text tokens. DriveCode employs a number projector to map numbers into the language model’s hidden space, enabling seamless integration with visual and textual features in a unified multimodal sequence. Evaluated on OmniDrive, DriveGPT4, and DriveGPT4-V2 datasets, DriveCode demonstrates superior performance in trajectory prediction and control signal generation, confirming its effectiveness for LLM-based autonomous driving systems.
[143] Learning to Weigh Waste: A Physics-Informed Multimodal Fusion Framework and Large-Scale Dataset for Commercial and Industrial Applications cs.CVPDF
Md. Adnanul Islam, Wasimul Karim, Md Mahbub Alam, Subhey Sadi Rahman, Md. Abdur Rahman
TL;DR: 本文提出了一种名为多模态重量预测器(MWP)的物理信息融合框架,用于从RGB图像和物理元数据(如物体尺寸、相机距离和高度)中估计商业和工业废物的重量。同时,作者构建了一个大规模真实世界数据集Waste-Weight-10K,包含超过1万张同步的图像-元数据对,涵盖11类废物和广泛的重量范围。模型采用视觉Transformer提取视觉特征,结合专门的元数据编码器,并通过堆叠互注意力融合机制整合多模态信息,以处理透视效应并关联物体与材料属性。训练时使用均方对数误差来稳定宽重量范围的性能,并集成了基于SHAP和大语言模型的物理解释模块来提供可读的预测解释。
Details
Motivation: 解决基于图像的商业和工业废物重量估计的难题,因为外观相似的物体可能密度不同,且可见尺寸随相机距离变化,导致传统视觉方法不准确。
Result: 在提出的Waste-Weight-10K测试集上,该方法实现了88.06 kg的平均绝对误差(MAE)、6.39%的平均绝对百分比误差(MAPE)和0.9548的R2系数。对于0-100 kg的轻物体,MAE为2.38 kg,MAPE为3.1%;对于1000-2000 kg的重废物,MAPE为11.1%,表现出强准确性。
Insight: 创新点包括:1) 结合物理元数据(如尺寸和相机参数)与RGB图像的多模态融合框架,以处理透视和密度变化;2) 引入大规模真实世界废物重量数据集Waste-Weight-10K,填补了数据空白;3) 使用堆叠互注意力融合机制,让视觉和物理线索相互引导;4) 采用均方对数误差训练来稳定宽范围重量预测;5) 集成基于SHAP和LLM的物理解释模块,增强模型可解释性。从客观角度看,该方法将物理先验与深度学习结合,提升了重量估计的鲁棒性和可解释性,适用于实际工业场景。
Abstract: Accurate weight estimation of commercial and industrial waste is important for efficient operations, yet image-based estimation remains difficult because similar-looking objects may have different densities, and the visible size changes with camera distance. Addressing this problem, we propose Multimodal Weight Predictor (MWP) framework that estimates waste weight by combining RGB images with physics-informed metadata, including object dimensions, camera distance, and camera height. We also introduce Waste-Weight-10K, a real-world dataset containing 10,421 synchronized image-metadata collected from logistics and recycling sites. The dataset covers 11 waste categories and a wide weight range from 3.5 to 3,450 kg. Our model uses a Vision Transformer for visual features and a dedicated metadata encoder for geometric and category information, combining them with Stacked Mutual Attention Fusion that allows visual and physical cues guide each other. This helps the model manage perspective effects and link objects to material properties. To ensure stable performance across the wide weight range, we train the model using Mean Squared Logarithmic Error. On the test set, the proposed method achieves 88.06 kg Mean Absolute Error (MAE), 6.39% Mean Absolute Percentage Error (MAPE), and an R2 coefficient of 0.9548. The model shows strong accuracy for light objects in the 0-100 kg range with 2.38 kg MAE and 3.1% MAPE, maintaining reliable performance for heavy waste in the 1000-2000 kg range with 11.1% MAPE. Finally, we incorporate a physically grounded explanation module using Shapley Additive Explanations (SHAP) and a large language model to provide clear, human-readable explanations for each prediction.
[144] Seeing Beyond 8bits: Subjective and Objective Quality Assessment of HDR-UGC Videos cs.CV | cs.AIPDF
Shreshth Saini, Bowen Chen, Neil Birkbeck, Yilin Wang, Balu Adsumilli
TL;DR: 该论文针对高动态范围(HDR)用户生成内容(UGC)视频的质量评估问题,构建了大规模主观数据集Beyond8Bits,并提出了首个用于HDR-UGC视频质量评估的多模态大语言模型HDR-Q。HDR-Q通过引入HDR感知视觉编码器和HDR感知策略优化框架,显著提升了模型对HDR特有失真的评估能力。
Details
Motivation: 随着HDR-UGC视频在社交平台的快速普及,现有的感知视频质量评估系统主要针对标准动态范围视频,无法有效处理HDR视频因更高位深、更广色域和更高亮度范围而暴露的失真问题,如近黑压缩、高光剪切、条带效应和曝光闪烁等。
Result: 在自建的Beyond8Bits数据集和公开的HDR-VQA基准测试上,HDR-Q模型均取得了最先进的性能表现。
Insight: 论文的创新点在于:1)构建了首个大规模、多样化的HDR-UGC视频主观质量评估数据集Beyond8Bits;2)提出了首个用于HDR-UGC VQA的多模态大语言模型HDR-Q,其核心是新颖的HDR感知视觉编码器和HDR感知策略优化框架,后者通过HDR-SDR对比KL散度和高斯加权回归奖励,引导模型关注HDR输入线索并进行细粒度质量分数校准。
Abstract: High Dynamic Range (HDR) user-generated (UGC) videos are rapidly proliferating across social platforms, yet most perceptual video quality assessment (VQA) systems remain tailored to Standard Dynamic Range (SDR). HDR has a higher bit depth, wide color gamut, and elevated luminance range, exposing distortions such as near-black crushing, highlight clipping, banding, and exposure flicker that amplify UGC artifacts and challenge SDR models. To catalyze progress, we curate Beyond8Bits, a large-scale subjective dataset of 44K videos from 6.5K sources with over 1.5M crowd ratings, spanning diverse scenes, capture conditions, and compression settings. We further introduce HDR-Q, the first Multimodal Large Language Model (MLLM) for HDR-UGC VQA. We propose (i) a novel HDR-aware vision encoder to produce HDR-sensitive embeddings, and (ii) HDR-Aware Policy Optimization (HAPO), an RL finetuning framework that anchors reasoning to HDR cues. HAPO augments GRPO via an HDR-SDR contrastive KL that encourages token reliance on HDR inputs and a Gaussian weighted regression reward for fine-grained MOS calibration. Across Beyond8Bits and public HDR-VQA benchmarks, HDR-Q delivers state-of-the-art performance.
[145] \textsc{Mobile-VTON}: High-Fidelity On-Device Virtual Try-On cs.CVPDF
Zhenchen Wan, Ce Chen, Runqi Lin, Jiaxin Huang, Tianxi Chen
TL;DR: 本文提出了Mobile-VTON,一个用于移动设备的高质量、保护隐私的离线虚拟试穿框架。它采用教师网络-服装网络-试穿网络(TGT)架构,集成了知识蒸馏、服装条件生成和对齐技术,旨在仅使用单张用户图像和服装图像,在移动设备上实现高保真度的虚拟试穿。
Details
Motivation: 现有虚拟试穿系统大多依赖云端GPU处理,存在隐私泄露风险且无法在设备端部署。本文旨在解决此问题,开发一个完全离线、高效且高质量的移动端虚拟试穿方案。
Result: 在VITON-HD和DressCode数据集(分辨率1024x768)上的实验表明,Mobile-VTON的性能与强大的服务器端基线模型相当或更优,同时完全离线运行。
Insight: 创新点包括:1) 提出TGT模块化架构,针对设备端效率优化;2) 提出特征引导对抗蒸馏策略,结合教师监督与对抗学习以更好地匹配真实图像分布;3) 服装网络使用轨迹一致性损失保留服装语义,试穿网络使用潜在连接和轻量级跨模态条件实现鲁棒的服装-人体对齐,无需大规模预训练。这为在资源受限设备上实现高质量生成任务提供了可行的技术路径。
Abstract: Virtual try-on (VTON) has recently achieved impressive visual fidelity, but most existing systems require uploading personal photos to cloud-based GPUs, raising privacy concerns and limiting on-device deployment. To address this, we present \textsc{Mobile-VTON}, a high-quality, privacy-preserving framework that enables fully offline virtual try-on on commodity mobile devices using only a single user image and a garment image. \textsc{Mobile-VTON} introduces a modular TeacherNet–GarmentNet–TryonNet (TGT) architecture that integrates knowledge distillation, garment-conditioned generation, and garment alignment into a unified pipeline optimized for on-device efficiency. Within this framework, we propose a Feature-Guided Adversarial (FGA) Distillation strategy that combines teacher supervision with adversarial learning to better match real-world image distributions. GarmentNet is trained with a trajectory-consistency loss to preserve garment semantics across diffusion steps, while TryonNet uses latent concatenation and lightweight cross-modal conditioning to enable robust garment-to-person alignment without large-scale pretraining. By combining these components, \textsc{Mobile-VTON} achieves high-fidelity generation with low computational overhead. Experiments on VITON-HD and DressCode at $1024{\times}768$ show that it matches or outperforms strong server-based baselines while running entirely offline. These results demonstrate that high-quality VTON is not only feasible but also practical on-device, offering a secure solution for real-world applications.
[146] PreciseCache: Precise Feature Caching for Efficient and High-fidelity Video Generation cs.CVPDF
Jiangshan Wang, Kang Zhao, Jiayi Guo, Jiayu Wang, Hang Guo
TL;DR: 本文提出了PreciseCache,一个即插即用的框架,旨在通过精确检测和跳过视频生成模型中的冗余计算来加速推理,同时保持生成质量。该框架包含两个组件:用于步级缓存的LFCache和用于块级缓存的BlockCache,通过计算低频差异来识别冗余,从而在不牺牲保真度的情况下实现显著加速。
Details
Motivation: 现有视频生成模型计算成本高、推理速度慢,限制了其实际应用。先前通过特征缓存加速的方法常导致明显的质量下降,其根本原因在于无法准确区分真正冗余的特征,从而错误地跳过了重要特征的计算。
Result: 在多种骨干网络上的大量实验表明,PreciseCache平均实现了2.6倍的加速,且没有明显的质量损失。
Insight: 论文的核心创新在于提出了一个双层级(步级和块级)的精确冗余检测机制。具体来说,利用低频差异作为衡量步级冗余的有效指标,并进一步在块级进行细粒度冗余检测与跳过。这提供了一种在保持高保真度的前提下,系统性地减少视频生成中冗余计算的新思路。
Abstract: High computational costs and slow inference hinder the practical application of video generation models. While prior works accelerate the generation process through feature caching, they often suffer from notable quality degradation. In this work, we reveal that this issue arises from their inability to distinguish truly redundant features, which leads to the unintended skipping of computations on important features. To address this, we propose \textbf{PreciseCache}, a plug-and-play framework that precisely detects and skips truly redundant computations, thereby accelerating inference without sacrificing quality. Specifically, PreciseCache contains two components: LFCache for step-wise caching and BlockCache for block-wise caching. For LFCache, we compute the Low-Frequency Difference (LFD) between the prediction features of the current step and those from the previous cached step. Empirically, we observe that LFD serves as an effective measure of step-wise redundancy, accurately detecting highly redundant steps whose computation can be skipped through reusing cached features. To further accelerate generation within each non-skipped step, we propose BlockCache, which precisely detects and skips redundant computations at the block level within the network. Extensive experiments on various backbones demonstrate the effectiveness of our PreciseCache, which achieves an average of 2.6x speedup without noticeable quality loss. Source code will be released.
[147] EraseAnything++: Enabling Concept Erasure in Rectified Flow Transformers Leveraging Multi-Object Optimization cs.CV | cs.AIPDF
Zhaoxin Fan, Nanxiang Jiang, Daiheng Gao, Shiji Zhou, Wenjun Wu
TL;DR: EraseAnything++是一个统一的概念擦除框架,针对基于流匹配和Transformer架构的现代文本到图像(T2I)和文本到视频(T2V)扩散模型(如Stable Diffusion v3、Flux、OpenSora)。该框架将概念擦除建模为一个约束多目标优化问题,通过隐式梯度手术、LoRA参数调优和注意力级正则化等技术,在有效移除指定概念的同时,保持模型的整体生成质量和时空一致性。
Details
Motivation: 现有概念擦除方法主要针对早期扩散模型设计,难以有效泛化到基于流匹配和Transformer架构的现代长序列视频生成模型(如Stable Diffusion v3, Flux, OpenSora)。这些新模型架构带来了新的挑战,需要一种统一且高效的框架来移除不期望的概念,同时保持生成质量。
Result: 在图像和视频基准测试上的大量实验表明,EraseAnything++在擦除有效性、生成保真度和时间一致性方面显著优于现有方法,为下一代扩散模型的概念擦除任务建立了新的SOTA(最先进水平)。
Insight: 核心创新点在于:1)将概念擦除形式化为一个显式平衡概念移除与生成效用保留的约束多目标优化问题;2)引入基于隐式梯度手术的高效效用保留遗忘策略来解决冲突目标;3)结合LoRA参数调优和注意力级正则化,将擦除锚定在关键视觉表征上,并在时空维度上一致传播;4)针对视频场景,提出“锚定-传播”机制,在参考帧初始化擦除并在后续Transformer层中强制执行,以减轻时间漂移。
Abstract: Removing undesired concepts from large-scale text-to-image (T2I) and text-to-video (T2V) diffusion models while preserving overall generative quality remains a major challenge, particularly as modern models such as Stable Diffusion v3, Flux, and OpenSora employ flow-matching and transformer-based architectures and extend to long-horizon video generation. Existing concept erasure methods, designed for earlier T2I/T2V models, often fail to generalize to these paradigms. To address this issue, we propose EraseAnything++, a unified framework for concept erasure in both image and video diffusion models with flow-matching objectives. Central to our approach is formulating concept erasure as a constrained multi-objective optimization problem that explicitly balances concept removal with preservation of generative utility. To solve the resulting conflicting objectives, we introduce an efficient utility-preserving unlearning strategy based on implicit gradient surgery. Furthermore, by integrating LoRA-based parameter tuning with attention-level regularization, our method anchors erasure on key visual representations and propagates it consistently across spatial and temporal dimensions. In the video setting, we further enhance consistency through an anchor-and-propagate mechanism that initializes erasure on reference frames and enforces it throughout subsequent transformer layers, thereby mitigating temporal drift. Extensive experiments on both image and video benchmarks demonstrate that EraseAnything++ substantially outperforms prior methods in erasure effectiveness, generative fidelity, and temporal consistency, establishing a new state of the art for concept erasure in next-generation diffusion models.
[148] From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents cs.CV | cs.AI | cs.CL | cs.IR | cs.MMPDF
Niu Lian, Yuting Wang, Hanshu Yao, Jinpeng Wang, Bin Chen
TL;DR: 该论文提出了一种名为MM-Mem的金字塔式多模态记忆架构,旨在解决多模态大语言模型在长时程视频理解中的局限性。该架构受模糊痕迹理论启发,将记忆分层为感觉缓冲区、情景流和符号图式,通过语义信息瓶颈目标优化记忆压缩与任务相关信息保留的权衡,并采用基于熵的自顶向下检索策略。实验在四个基准测试中验证了其在离线和流式任务上的有效性。
Details
Motivation: 现有方法在长时程视频理解中存在两极化问题:视觉中心方法因密集视觉积累导致高延迟和冗余,而文本中心方法因激进描述导致细节丢失和幻觉。论文旨在弥合这一差距,模仿人类认知效率,构建高效的多模态记忆系统。
Result: 在四个基准测试上的广泛实验证实了MM-Mem在离线和流式任务中的有效性,展示了其鲁棒的泛化能力,并验证了受认知启发的记忆组织的有效性。
Insight: 主要创新点包括:1)受模糊痕迹理论启发的金字塔式多模态记忆分层架构,实现从细粒度感知痕迹到高层语义图式的渐进蒸馏;2)基于语义信息瓶颈的目标(SIB-GRPO)来动态优化记忆构建;3)基于熵的自顶向下记忆检索策略,在不确定性高时逐步深入底层记忆。这为长时程视频理解提供了一种认知启发的、高效且可泛化的记忆机制。
Abstract: While multimodal large language models have demonstrated impressive short-term reasoning, they struggle with long-horizon video understanding due to limited context windows and static memory mechanisms that fail to mirror human cognitive efficiency. Existing paradigms typically fall into two extremes: vision-centric methods that incur high latency and redundancy through dense visual accumulation, or text-centric approaches that suffer from detail loss and hallucination via aggressive captioning. To bridge this gap, we propose MM-Mem, a pyramidal multimodal memory architecture grounded in Fuzzy-Trace Theory. MM-Mem structures memory hierarchically into a Sensory Buffer, Episodic Stream, and Symbolic Schema, enabling the progressive distillation of fine-grained perceptual traces (verbatim) into high-level semantic schemas (gist). Furthermore, to govern the dynamic construction of memory, we derive a Semantic Information Bottleneck objective and introduce SIB-GRPO to optimize the trade-off between memory compression and task-relevant information retention. In inference, we design an entropy-driven top-down memory retrieval strategy, which first tries with the abstract Symbolic Schema and progressively “drills down” to the Sensory Buffer and Episodic Stream under high uncertainty. Extensive experiments across 4 benchmarks confirm the effectiveness of MM-Mem on both offline and streaming tasks, demonstrating robust generalization and validating the effectiveness of cognition-inspired memory organization. Code is available at https://github.com/EliSpectre/MM-Mem.
[149] Fake It Right: Injecting Anatomical Logic into Synthetic Supervised Pre-training for Medical Segmentation cs.CVPDF
Jiaqi Tang, Mengyan Zheng, Shu Zhang, Fandong Zhang, Qingchao Chen
TL;DR: 本文提出了一种用于医学图像分割的解剖学信息合成监督预训练框架,旨在解决传统公式驱动监督学习(FDSL)中合成数据与真实解剖结构之间存在语义鸿沟的问题。该方法通过使用轻量级解剖形状库和结构感知的顺序放置策略来生成具有解剖合理性的合成数据,从而在保护隐私的前提下,让模型学习到关键的全局结构先验。
Details
Motivation: 动机在于解决医学图像分割中Vision Transformers对大规模标注数据的依赖,以及自监督学习面临的隐私和物流障碍。现有FDSL方法使用通用数学基元生成的合成数据缺乏真实解剖的形态保真度、固定空间布局和器官间关系,导致模型无法学习到必要的全局结构先验,限制了其有效性。
Result: 在BTCV和MSD数据集上的大量实验表明,该方法显著优于最先进的FDSL基线和自监督学习方法,分别提升了1.74%和最高1.66%。同时,该方法展现出稳健的缩放效应,即性能随着合成数据量的增加而提升。
Insight: 核心创新点在于将FDSL的无限可扩展性与解剖学真实性相统一。具体包括:1)用来自少量受试者的去标识化、仅含标签的分割掩码构建的轻量级形状库替代通用基元;2)引入结构感知的顺序放置策略,利用空间锚点确保正确定位,并使用拓扑图管理器官间交互(如防止不可能的器官重叠),从而在合成过程中强制生理合理性。这为医学分割提供了一种数据高效且符合隐私要求的解决方案。
Abstract: Vision Transformers (ViTs) excel in 3D medical segmentation but require massive annotated datasets. While Self-Supervised Learning (SSL) mitigates this using unlabeled data, it still faces strict privacy and logistical barriers. Formula-Driven Supervised Learning (FDSL) offers a privacy-preserving alternative by pre-training on synthetic mathematical primitives. However, a critical semantic gap limits its efficacy: generic shapes lack the morphological fidelity, fixed spatial layouts, and inter-organ relationships of real anatomy, preventing models from learning essential global structural priors. To bridge this gap, we propose an Anatomy-Informed Synthetic Supervised Pre-training framework unifying FDSL’s infinite scalability with anatomical realism. We replace basic primitives with a lightweight shape bank with de-identified, label-only segmentation masks from 5 subjects. Furthermore, we introduce a structure-aware sequential placement strategy to govern the patch synthesis process. Instead of random placement, we enforce physiological plausibility using spatial anchors for correct localization and a topological graph to manage inter-organ interactions (e.g., preventing impossible overlaps). Extensive experiments on BTCV and MSD datasets demonstrate that our method significantly outperforms state-of-the-art FDSL baselines and SSL methods by 1.74% and up to 1.66%, while exhibiting a robust scaling effect where performance improves with increased synthetic data volume. This provides a data-efficient, privacy-compliant solution for medical segmentation. The code will be made publicly available upon acceptance.
[150] Event-Anchored Frame Selection for Effective Long-Video Understanding cs.CVPDF
Wang Chen, Yongdong Luo, Yuhui Zeng, Luojun Lin, Tianyu Xie
TL;DR: 本文提出了一种名为事件锚定帧选择(EFS)的分层、事件感知的流水线方法,用于解决长视频理解中由于大量帧冗余和有限上下文窗口带来的效率问题。该方法首先利用自监督的DINO嵌入将视频流分割为视觉同质的时间段作为语义事件的代理,然后在每个事件中选择最相关的帧作为锚点,最后通过自适应最大边际相关性方案进行全局优化,确保关键帧集在事件覆盖、查询相关性和视觉多样性方面达到最优。
Details
Motivation: 现有方法通常采用扁平采样范式,将视频视为无结构的帧集合,无法有效处理长视频中的帧冗余和上下文限制问题,因此需要一种更高效、结构化的帧选择方法。
Result: 作为即插即用的免训练模块,EFS在应用于LLaVA-Video-7B模型时,在VideoMME、LongVideoBench和MLVU三个具有挑战性的视频理解基准测试上,准确率分别提升了4.7%、4.9%和8.8%,达到了当前先进水平。
Insight: 创新点在于提出了一个分层、事件感知的帧选择流水线,通过事件分割和锚点选择结合自适应MMR优化,实现了对长视频内容的结构化理解;从客观角度看,该方法将视频视为由语义事件组成的序列,而非无序帧集合,有效提升了帧选择的效率和效果,且具有通用性和可扩展性。
Abstract: Massive frame redundancy and limited context window make efficient frame selection crucial for long-video understanding with large vision-language models (LVLMs). Prevailing approaches, however, adopt a flat sampling paradigm which treats the video as an unstructured collection of frames. In this paper, we introduce Event-Anchored Frame Selection (EFS), a hierarchical, event-aware pipeline. Leveraging self-supervised DINO embeddings, EFS first partitions the video stream into visually homogeneous temporal segments, which serve as proxies for semantic events. Within each event, it then selects the most query-relevant frame as an anchor. These anchors act as structural priors that guide a global refinement stage using an adaptive Maximal Marginal Relevance (MMR) scheme. This pipeline ensures the final keyframe set jointly optimizes for event coverage, query relevance, and visual diversity. As a training-free, plug-and-play module, EFS can be seamlessly integrated into off-the-shelf LVLMs, yielding substantial gains on challenging video understanding benchmarks. Specifically, when applied to LLaVA-Video-7B, EFS improves accuracy by 4.7%, 4.9%, and 8.8% on VideoMME, LongVideoBench, and MLVU, respectively.
[151] The Texture-Shape Dilemma: Boundary-Safe Synthetic Generation for 3D Medical Transformers cs.CVPDF
Jiaqi Tang, Weixuan Xu, Shu Zhang, Fandong Zhang, Qingchao Chen
TL;DR: 本文提出了一种物理启发的空间解耦合成框架,以解决医学视觉Transformer在数据稀缺和隐私限制下的训练难题。该方法通过正交化合成过程,在边界区域构建梯度屏蔽缓冲区以稳定形状学习,并在物体核心注入物理驱动的频谱纹理,从而有效调和形状表示学习与采集噪声不变性之间的矛盾。
Details
Motivation: 动机在于解决公式驱动监督学习(FDSL)在医学图像合成中的局限性,即现有方法依赖简单几何形状和均匀强度,忽略了CT和MRI等模态中固有的组织纹理和噪声模式,导致合成数据与真实数据之间存在显著差距,并引发边界混叠的优化冲突。
Result: 在BTCV和MSD数据集上的广泛实验表明,该方法显著优于先前的FDSL方法,以及在真实世界医学数据集上训练的SSL方法,在BTCV上提升1.43%,在MSD任务上提升高达1.51%,为医学ViTs提供了可扩展且无需标注的基础。
Insight: 创新点在于识别并解决了边界混叠问题,通过空间解耦合成框架正交化纹理和形状学习,利用梯度屏蔽缓冲区保护边界梯度信号,并注入物理驱动的纹理以增强真实性,为数据稀缺领域的合成数据生成提供了新思路。
Abstract: Vision Transformers (ViTs) have revolutionized medical image analysis, yet their data-hungry nature clashes with the scarcity and privacy constraints of clinical archives. Formula-Driven Supervised Learning (FDSL) has emerged as a promising solution to this bottleneck, synthesizing infinite annotated samples from mathematical formulas without utilizing real patient data. However, existing FDSL paradigms rely on simple geometric shapes with homogeneous intensities, creating a substantial gap by neglecting tissue textures and noise patterns inherent in modalities like CT and MRI. In this paper, we identify a critical optimization conflict termed boundary aliasing: when high-frequency synthetic textures are naively added, they corrupt the image gradient signals necessary for learning structural boundaries, causing the model to fail in delineating real anatomical margins. To bridge this gap, we propose a novel Physics-inspired Spatially-Decoupled Synthesis framework. Our approach orthogonalizes the synthesis process: it first constructs a gradient-shielded buffer zone based on boundary distance to ensure stable shape learning, and subsequently injects physics-driven spectral textures into the object core. This design effectively reconciles robust shape representation learning with invariance to acquisition noise. Extensive experiments on the BTCV and MSD datasets demonstrate that our method significantly outperforms previous FDSL, as well as SSL methods trained on real-world medical datasets, by 1.43% on BTCV and up to 1.51% on MSD task, offering a scalable, annotation-free foundation for medical ViTs. The code will be made publicly available upon acceptance.
[152] Foundation Models in Remote Sensing: Evolving from Unimodality to Multimodality cs.CV | cs.SEPDF
Danfeng Hong, Chenyu Li, Xuyang Li, Gustau Camps-Valls, Jocelyn Chanussot
TL;DR: 本文是一篇关于遥感领域基础模型的技术综述,系统梳理了从单模态到多模态基础模型的发展历程,旨在为研究人员提供入门指导和实践应用参考。
Details
Motivation: 随着遥感数据量和多样性的指数级增长,迫切需要先进的数据建模和理解能力来有效管理和解释这些海量数据集,基础模型为遥感领域带来了革命性的潜力和新的增长机会。
Result: 本文是一篇综述性论文,未报告具体的定量实验结果,但系统回顾和分类了现有的遥感基础模型,并提供了实践教程。
Insight: 创新性地从单模态到多模态演进的视角来组织和综述遥感基础模型,并提供了面向初学者的实践指导,有助于降低该领域的研究门槛并推动应用。
Abstract: Remote sensing (RS) techniques are increasingly crucial for deepening our understanding of the planet. As the volume and diversity of RS data continue to grow exponentially, there is an urgent need for advanced data modeling and understanding capabilities to manage and interpret these vast datasets effectively. Foundation models present significant new growth opportunities and immense potential to revolutionize the RS field. In this paper, we conduct a comprehensive technical survey on foundation models in RS, offering a brand-new perspective by exploring their evolution from unimodality to multimodality. We hope this work serves as a valuable entry point for researchers interested in both foundation models and RS and helps them launch new projects or explore new research topics in this rapidly evolving area. This survey addresses the following three key questions: What are foundation models in RS? Why are foundation models needed in RS? How can we effectively guide junior researchers in gaining a comprehensive and practical understanding of foundation models in RS applications? More specifically, we begin by outlining the background and motivation, emphasizing the importance of foundation models in RS. We then review existing foundation models in RS, systematically categorizing them into unimodal and multimodal approaches. Additionally, we provide a tutorial-like section to guide researchers, especially beginners, on how to train foundation models in RS and apply them to real-world tasks. The survey aims to equip researchers in RS with a deeper and more efficient understanding of foundation models, enabling them to get started easily and effectively apply these models across various RS applications.
[153] MLRecon: Robust Markerless Freehand 3D Ultrasound Reconstruction via Coarse-to-Fine Pose Estimation cs.CVPDF
Yi Zhang, Puxun Tu, Kun Wang, Yulin Yan, Tao Ying
TL;DR: MLRecon是一种鲁棒的无标记自由手3D超声重建框架,它使用单个消费级RGB-D相机实现抗漂移的6D探头姿态跟踪。该框架利用视觉基础模型的泛化能力进行连续无标记跟踪,并通过视觉引导的偏差检测器监控跟踪完整性以触发故障恢复。此外,它提出了一种双阶段姿态细化网络,将高频抖动与低频偏差解耦,有效去噪轨迹并保持操作运动的运动学保真度。
Details
Motivation: 现有自由手3D超声重建的跟踪范式面临一个限制性的三难困境:基于标记的系统成本高昂,由内向外的方法需要侵入式传感器附着,而无传感器方法则遭受严重的累积漂移。MLRecon旨在克服这些限制,为资源有限的临床环境提供低成本、易访问的容积超声成像解决方案。
Result: 实验表明,MLRecon在复杂轨迹上实现了低至0.88毫米的平均位置误差,并产生了具有亚毫米平均表面精度的高质量3D重建,显著优于竞争的无传感器和传感器辅助方法,为低成本容积超声成像设立了新基准。
Insight: 创新点包括:利用视觉基础模型实现鲁棒的无标记连续跟踪;引入视觉引导的偏差检测器进行自主完整性监控和故障恢复;提出双阶段姿态细化网络,显式解耦高频抖动和低频偏差以优化轨迹。这为在资源有限设置下实现高精度、低成本、非侵入式的3D超声成像提供了新的技术路径。
Abstract: Freehand 3D ultrasound (US) reconstruction promises volumetric imaging with the flexibility of standard 2D probes, yet existing tracking paradigms face a restrictive trilemma: marker-based systems demand prohibitive costs, inside-out methods require intrusive sensor attachment, and sensorless approaches suffer from severe cumulative drift. To overcome these limitations, we present MLRecon, a robust markerless 3D US reconstruction framework delivering drift-resilient 6D probe pose tracking using a single commodity RGB-D camera. Leveraging the generalization power of vision foundation models, our pipeline enables continuous markerless tracking of the probe, augmented by a vision-guided divergence detector that autonomously monitors tracking integrity and triggers failure recovery to ensure uninterrupted scanning. Crucially, we further propose a dual-stage pose refinement network that explicitly disentangles high-frequency jitter from low-frequency bias, effectively denoising the trajectory while maintaining the kinematic fidelity of operator maneuvers. Experiments demonstrate that MLRecon significantly outperforms competing sensorless and sensor-aided methods, achieving average position errors as low as 0.88 mm on complex trajectories and yielding high-quality 3D reconstructions with sub-millimeter mean surface accuracy. This establishes a new benchmark for low-cost, accessible volumetric US imaging in resource-limited clinical settings.
[154] Let Your Image Move with Your Motion! – Implicit Multi-Object Multi-Motion Transfer cs.CVPDF
Yuze Li, Dong Gong, Xiao Cao, Junchao Yuan, Dongsheng Li
TL;DR: 本文提出了FlexiMMT,首个隐式图像到视频多对象多运动迁移框架。它能够从静态多对象图像和多个参考视频中独立提取运动表示,并精确地将不同运动模式分配给不同对象,支持灵活的组合和任意运动到对象的映射。
Details
Motivation: 现有运动迁移方法主要集中于单对象场景,当多个对象需要不同运动模式时表现不佳。本文旨在解决多对象、多运动迁移的核心挑战,即跨对象运动纠缠问题。
Result: 大量实验表明,FlexiMMT在基于图像到视频的多对象多运动迁移任务中实现了精确、组合式的性能,并达到了最先进的水平。
Insight: 创新点包括:1)运动解耦掩码注意力机制,利用对象特定掩码约束注意力,确保运动和文本标记仅影响其指定区域;2)差异化掩码传播机制,直接从扩散注意力中推导对象特定掩码,并高效地在帧间传播。这些机制有效解决了多对象场景下的运动纠缠问题。
Abstract: Motion transfer has emerged as a promising direction for controllable video generation, yet existing methods largely focus on single-object scenarios and struggle when multiple objects require distinct motion patterns. In this work, we present FlexiMMT, the first implicit image-to-video (I2V) motion transfer framework that explicitly enables multi-object, multi-motion transfer. Given a static multi-object image and multiple reference videos, FlexiMMT independently extracts motion representations and accurately assigns them to different objects, supporting flexible recombination and arbitrary motion-to-object mappings. To address the core challenge of cross-object motion entanglement, we introduce a Motion Decoupled Mask Attention Mechanism that uses object-specific masks to constrain attention, ensuring that motion and text tokens only influence their designated regions. We further propose a Differentiated Mask Propagation Mechanism that derives object-specific masks directly from diffusion attention and progressively propagates them across frames efficiently. Extensive experiments demonstrate that FlexiMMT achieves precise, compositional, and state-of-the-art performance in I2V-based multi-object multi-motion transfer.
[155] Dr.Occ: Depth- and Region-Guided 3D Occupancy from Surround-View Cameras for Autonomous Driving cs.CVPDF
Xubo Zhu, Haoyang Zhang, Fei He, Rui Wu, Yanhu Shan
TL;DR: 论文提出Dr.Occ,一个用于自动驾驶感知的深度和区域引导的3D语义占据预测框架。该框架通过深度引导的2D到3D视图变换器(D²-VFormer)利用高质量深度线索实现精确的几何对齐,并通过区域引导的专家变换器(R/R²-EFormer)自适应分配区域专家来处理空间语义变化,从而解决现有方法在视图变换中的几何错位和空间类别不平衡问题。
Details
Motivation: 现有3D语义占据预测方法在视图变换时因缺乏像素级精确深度估计而存在几何错位问题,且语义类别在空间上呈现严重的不平衡(空间各向异性)。论文旨在解决这些挑战,提升自动驾驶场景的几何理解和语义识别能力。
Result: 在Occ3D-nuScenes基准测试中,Dr.Occ在纯视觉设置下,相比强基线BEVDet4D,mIoU提升了7.43%,IoU提升了3.09%,达到了新的先进水平。
Insight: 创新点在于将深度引导(利用MoGe-2的密集深度)与区域引导(受MoE启发,自适应分配区域专家)相结合,分别从几何对齐和语义学习两个互补角度提升3D占据预测性能。这种深度与区域双引导的框架设计是处理自动驾驶场景中复杂几何和语义变化的一种有效思路。
Abstract: 3D semantic occupancy prediction is crucial for autonomous driving perception, offering comprehensive geometric scene understanding and semantic recognition. However, existing methods struggle with geometric misalignment in view transformation due to the lack of pixel-level accurate depth estimation, and severe spatial class imbalance where semantic categories exhibit strong spatial anisotropy. To address these challenges, we propose Dr.Occ, a depth- and region-guided occupancy prediction framework. Specifically, we introduce a depth-guided 2D-to-3D View Transformer (D$^2$-VFormer) that effectively leverages high-quality dense depth cues from MoGe-2 to construct reliable geometric priors, thereby enabling precise geometric alignment of voxel features. Moreover, inspired by the Mixture-of-Experts (MoE) framework, we propose a region-guided Expert Transformer (R/R$^2$-EFormer) that adaptively allocates region-specific experts to focus on different spatial regions, effectively addressing spatial semantic variations. Thus, the two components make complementary contributions: depth guidance ensures geometric alignment, while region experts enhance semantic learning. Experiments on the Occ3D-nuScenes benchmark demonstrate that \textbf{Dr.Occ} improves the strong baseline BEVDet4D by 7.43% mIoU and 3.09% IoU under the full vision-only setting.
[156] Learning to Read Where to Look: Disease-Aware Vision-Language Pretraining for 3D CT cs.CV | cs.CL | cs.LGPDF
Simon Ging, Philipp Arnold, Sebastian Walter, Hani Alnahas, Hannah Bast
TL;DR: 本文提出了一种针对3D CT影像的疾病感知视觉语言预训练模型,通过结合医院内部收集的98k报告-影像对与公开数据集,采用SigLIP风格的对比预训练和基于提示的疾病监督,在共享的视觉-文本嵌入空间中进行训练。该模型在CT-RATE基准上实现了最先进的文本到图像检索性能(R@10 31.5 vs. 22.2)和具有竞争力的疾病分类(AUC 83.8 vs. 83.8),并在Rad-ChestCT上表现一致(AUC 77.0 vs. 77.3)。此外,论文通过自动挖掘262k文本片段-切片对,引入了扫描内片段定位任务,以预测文本片段所指的轴向深度,将平均绝对误差降低至36.3毫米(特征分辨率为12毫米),优于最佳基线的67.0毫米。
Details
Motivation: 现有3D CT视觉语言模型通常依赖有限的公开数据,仅提供粗略的全局监督,无法精确关联报告中的文本描述与CT扫描中的具体轴向位置。本文旨在通过大规模医院数据和疾病感知监督,提升模型在检索、分类和定位任务上的性能。
Result: 在CT-RATE基准上,模型在文本到图像检索(R@10 31.5)达到SOTA,疾病分类(AUC 83.8)与现有最佳模型持平;在Rad-ChestCT上疾病分类AUC为77.0,与基准(77.3)相当。扫描内片段定位任务中,平均绝对误差降至36.3毫米,显著优于基线(67.0毫米)。
Insight: 创新点包括:1)结合大规模医院内部数据与公开数据集进行预训练;2)引入基于提示的疾病监督,增强共享嵌入空间的语义对齐;3)提出扫描内片段定位新任务,通过自动挖掘文本片段-切片对实现细粒度定位,且不影响检索和分类性能,形成统一的检索-分类-定位模型。
Abstract: Recent 3D CT vision-language models align volumes with reports via contrastive pretraining, but typically rely on limited public data and provide only coarse global supervision. We train a 3D CT vision-language model on 98k report-volume pairs (50k patients) collected at a single hospital, combined with public datasets, using SigLIP-style contrastive pretraining together with prompt-based disease supervision in the shared vision-text embedding space. On CT-RATE, our model achieves state-of-the-art text-to-image retrieval (R@10 31.5 vs. 22.2) and competitive disease classification (AUC 83.8 vs. 83.8), with consistent results on Rad-ChestCT (AUC 77.0 vs. 77.3). We further observe that radiologists routinely reference specific images within their reports (e.g., ``series X, image Y’’), linking textual descriptions to precise axial locations. We automatically mine 262k such snippet-slice pairs and introduce the task of intra-scan snippet localization – predicting the axial depth referred to by a text snippet – reducing mean absolute error to 36.3 mm at 12 mm feature resolution, compared with 67.0 mm for the best baseline. Adding this localization objective leaves retrieval and classification broadly unchanged within confidence bounds, yielding a single unified model for retrieval, classification, and intra-scan grounding.
[157] RaUF: Learning the Spatial Uncertainty Field of Radar cs.CVPDF
Shengpeng Wang, Kuangyu Wang, Wei Wang
TL;DR: 本文提出RaUF框架,通过建模毫米波雷达测量的物理各向异性特性来学习空间不确定性场,以解决雷达数据空间保真度低、方位模糊和杂波干扰等问题。该方法设计了各向异性概率模型来学习细粒度不确定性,并引入双向域注意力机制利用空间结构与多普勒一致性互补性抑制虚假反射。
Details
Motivation: 毫米波雷达在恶劣天气中具有优势,但存在空间保真度低、方位模糊和杂波引起的虚假回波问题,现有方法常忽略模糊的特征到标签映射,导致几何推断不适定,对下游感知任务构成挑战。
Result: 在公开基准和真实数据集上的大量实验表明,RaUF能提供高度可靠的空间检测并具有良好校准的不确定性,下游案例研究进一步验证了其在挑战性真实驾驶场景中的增强可靠性和可扩展性。
Insight: 创新点包括基于物理各向异性特性建模雷达测量的空间不确定性场,设计各向异性概率模型解决特征到标签映射冲突,以及利用双向域注意力机制融合空间结构和多普勒一致性以抑制虚假反射,为雷达感知提供了更可靠的几何推断框架。
Abstract: Millimeter-wave radar offers unique advantages in adverse weather but suffers from low spatial fidelity, severe azimuth ambiguity, and clutter-induced spurious returns. Existing methods mainly focus on improving spatial perception effectiveness via coarse-to-fine cross-modal supervision, yet often overlook the ambiguous feature-to-label mapping, which may lead to ill-posed geometric inference and pose fundamental challenges to downstream perception tasks. In this work, we propose RaUF, a spatial uncertainty field learning framework that models radar measurements through their physically grounded anisotropic properties. To resolve conflicting feature-to-label mapping, we design an anisotropic probabilistic model that learns fine-grained uncertainty. To further enhance reliability, we propose a Bidirectional Domain Attention mechanism that exploits the mutual complementarity between spatial structure and Doppler consistency, effectively suppressing spurious or multipath-induced reflections. Extensive experiments on public benchmarks and real-world datasets demonstrate that RaUF delivers highly reliable spatial detections with well-calibrated uncertainty. Moreover, downstream case studies further validate the enhanced reliability and scalability of RaUF under challenging real-world driving scenarios.
[158] Vision-Language Feature Alignment for Road Anomaly Segmentation cs.CVPDF
Zhuolin He, Jiacheng Tang, Jian Pu, Xiangyang Xue
TL;DR: 本文提出了一种名为VL-Anomaly的视觉-语言异常分割框架,用于提升自动驾驶系统在复杂环境中对道路异常(如未知障碍物)的识别能力。该框架通过整合预训练视觉-语言模型(如CLIP)的语义先验知识,设计了一个提示学习驱动的对齐模块,以减少背景区域(如天空、植被)的误报,并提高对真实分布外实例的召回率。在推理时,采用多源推理策略融合文本引导相似性、图像-文本相似性和检测器置信度,以实现更可靠的异常预测。
Details
Motivation: 现有道路异常分割方法通常依赖像素级统计来判断异常区域,这导致在语义正常的背景区域(如天空或植被)上误报率高,而对真实分布外实例的召回率低,从而给机器人感知和决策带来安全风险。
Result: 在RoadAnomaly、SMIYC和Fishyscapes等基准数据集上的大量实验表明,VL-Anomaly实现了最先进的性能(SOTA)。
Insight: 创新点在于利用预训练视觉-语言模型的语义先验知识,通过提示学习对齐视觉特征与文本嵌入,有效抑制背景区域的虚假异常响应;同时,多源推理策略整合了互补信息源,提升了异常预测的可靠性。从客观角度看,该方法将视觉-语言对齐技术应用于道路异常分割任务,为解决误报和召回问题提供了新思路。
Abstract: Safe autonomous systems in complex environments require robust road anomaly segmentation to identify unknown obstacles. However, existing approaches often rely on pixel-level statistics to determine whether a region appears anomalous. This reliance leads to high false-positive rates on semantically normal background regions such as sky or vegetation, and poor recall of true Out-of-distribution (OOD) instances, thereby posing safety risks for robotic perception and decision-making. To address these challenges, we propose VL-Anomaly, a vision-language anomaly segmentation framework that incorporates semantic priors from pre-trained Vision-Language Models (VLMs). Specifically, we design a prompt learning-driven alignment module that adapts Mask2Forme’s visual features to CLIP text embeddings of known categories, effectively suppressing spurious anomaly responses in background regions. At inference time, we further introduce a multi-source inference strategy that integrates text-guided similarity, CLIP-based image-text similarity and detector confidence, enabling more reliable anomaly prediction by leveraging complementary information sources. Extensive experiments demonstrate that VL-Anomaly achieves state-of-the-art performance on benchmark datasets including RoadAnomaly, SMIYC and Fishyscapes.Code is released on https://github.com/NickHezhuolin/VL-aligner-Road-anomaly-segment.
[159] From Intuition to Investigation: A Tool-Augmented Reasoning MLLM Framework for Generalizable Face Anti-Spoofing cs.CV | cs.AIPDF
Haoyuan Zhang, Keyao Wang, Guosheng Zhang, Haixiao Yue, Zhiwen Tan
TL;DR: 本文提出了一种工具增强推理的多模态大语言模型框架TAR-FAS,用于提升人脸活体检测的泛化能力。该框架将活体检测任务重构为一种结合视觉工具的思维链范式,使模型能从直观观察开始,自适应调用外部视觉工具进行细粒度调查。
Details
Motivation: 现有基于MLLM的人脸活体检测方法主要依赖捕捉直观语义线索,难以感知细粒度视觉模式,导致跨域泛化能力有限。本文旨在通过引入外部视觉工具,促使模型对细微的欺骗线索进行更深层次的调查。
Result: 在具有挑战性的一对十一跨域协议下的大量实验表明,TAR-FAS取得了最先进的性能,同时为可信的欺骗检测提供了细粒度的视觉调查。
Insight: 创新点在于将人脸活体检测任务重构为工具增强的思维链推理范式,并设计了相应的数据标注流程和训练方法。核心洞察是利用外部视觉工具来弥补MLLM在细粒度视觉感知上的不足,通过自主学习的工具调用策略来提升模型的调查深度和泛化能力。
Abstract: Face recognition remains vulnerable to presentation attacks, calling for robust Face Anti-Spoofing (FAS) solutions. Recent MLLM-based FAS methods reformulate the binary classification task as the generation of brief textual descriptions to improve cross-domain generalization. However, their generalizability is still limited, as such descriptions mainly capture intuitive semantic cues (e.g., mask contours) while struggling to perceive fine-grained visual patterns. To address this limitation, we incorporate external visual tools into MLLMs to encourage deeper investigation of subtle spoof clues. Specifically, we propose the Tool-Augmented Reasoning FAS (TAR-FAS) framework, which reformulates the FAS task as a Chain-of-Thought with Visual Tools (CoT-VT) paradigm, allowing MLLMs to begin with intuitive observations and adaptively invoke external visual tools for fine-grained investigation. To this end, we design a tool-augmented data annotation pipeline and construct the ToolFAS-16K dataset, which contains multi-turn tool-use reasoning trajectories. Furthermore, we introduce a tool-aware FAS training pipeline, where Diverse-Tool Group Relative Policy Optimization (DT-GRPO) enables the model to autonomously learn efficient tool use. Extensive experiments under a challenging one-to-eleven cross-domain protocol demonstrate that TAR-FAS achieves SOTA performance while providing fine-grained visual investigation for trustworthy spoof detection.
[160] MM-DeepResearch: A Simple and Effective Multimodal Agentic Search Baseline cs.CV | cs.AIPDF
Huanjin Yao, Qixiang Yin, Min Yang, Ziwang Zhao, Yibo Wang
TL;DR: 本文提出了MM-DeepResearch,一个强大的多模态深度研究智能体。它通过三个核心设计来解决构建此类智能体面临的挑战:1)使用基于超图的Hyper-Search方法生成需要调用多种搜索工具才能解决的、搜索密集型的多模态QA数据;2)提出DR-TTS方法,将任务按工具类型分解并优化专家,再通过树搜索重组专家以探索有效的搜索轨迹;3)构建支持多种搜索工具的离线搜索引擎,以低成本进行智能体强化学习。
Details
Motivation: 旨在开发一个能够进行显式推理与规划、调用多工具、并实现跨模态信息合成的多模态研究智能体,以执行深度研究任务。但面临三大挑战:搜索密集型多模态QA数据稀缺、缺乏有效的搜索轨迹、以及使用在线搜索API进行训练的成本过高。
Result: 广泛的实验结果表明,MM-DeepResearch在多个基准测试上表现出优越性。
Insight: 创新点包括:1)基于超图建模跨模态节点关系以生成高质量训练数据;2)将任务按工具类型分解并优化专家,再通过树搜索进行组合的轨迹探索策略;3)构建离线搜索引擎以大幅降低训练成本。这些方法为构建低成本、高性能的多模态智能体提供了有效的基线方案。
Abstract: We aim to develop a multimodal research agent capable of explicit reasoning and planning, multi-tool invocation, and cross-modal information synthesis, enabling it to conduct deep research tasks. However, we observe three main challenges in developing such agents: (1) scarcity of search-intensive multimodal QA data, (2) lack of effective search trajectories, and (3) prohibitive cost of training with online search APIs. To tackle them, we first propose Hyper-Search, a hypergraph-based QA generation method that models and connects visual and textual nodes within and across modalities, enabling to generate search-intensive multimodal QA pairs that require invoking various search tools to solve. Second, we introduce DR-TTS, which first decomposes search-involved tasks into several categories according to search tool types, and respectively optimize specialized search tool experts for each tool. It then recomposes tool experts to jointly explore search trajectories via tree search, producing trajectories that successfully solve complex tasks using various search tools. Third, we build an offline search engine supporting multiple search tools, enabling agentic reinforcement learning without using costly online search APIs. With the three designs, we develop MM-DeepResearch, a powerful multimodal deep research agent, and extensive results shows its superiority across benchmarks. Code is available at https://github.com/HJYao00/MM-DeepResearch
[161] Unleashing VLA Potentials in Autonomous Driving via Explicit Learning from Failures cs.CVPDF
Yuechen Luo, Qimao Chen, Fang Li, Shaoqing Xu, Jaxin Liu
TL;DR: 本文提出了一种名为ELF-VLA的框架,旨在解决自动驾驶中视觉-语言-动作(VLA)模型在强化学习优化过程中遇到的性能瓶颈问题。该方法通过从失败中显式学习,利用结构化诊断反馈来替代模糊的标量奖励,生成可解释的失败模式报告,并据此引导策略进行反馈指导的细化,从而有效解决长尾场景中的关键失败案例。
Details
Motivation: 自动驾驶VLA模型在强化学习优化中常因受限于先前监督微调(SFT)的探索能力而陷入性能停滞,导致长尾场景中持续失败,且稀疏的奖励信号无法识别失败的根本原因(如规划、推理或轨迹执行错误)。
Result: 在公开的NAVSIM基准测试中,该方法在整体PDMS、EPDMS分数以及高层规划准确性方面均达到了最先进的(SOTA)性能水平。
Insight: 创新点在于引入了显式失败学习机制,通过生成结构化诊断反馈和反馈指导的细化,为策略提供了有针对性的梯度更新,从而解锁了VLA模型的潜在能力,解决了无引导探索无法处理的关键场景。
Abstract: Vision-Language-Action (VLA) models for autonomous driving often hit a performance plateau during Reinforcement Learning (RL) optimization. This stagnation arises from exploration capabilities constrained by previous Supervised Fine-Tuning (SFT), leading to persistent failures in long-tail scenarios. In these critical situations, all explored actions yield a zero-value driving score. This information-sparse reward signals a failure, yet fails to identify its root cause – whether it is due to incorrect planning, flawed reasoning, or poor trajectory execution. To address this limitation, we propose VLA with Explicit Learning from Failures (ELF-VLA), a framework that augments RL with structured diagnostic feedback. Instead of relying on a vague scalar reward, our method produces detailed, interpretable reports that identify the specific failure mode. The VLA policy then leverages this explicit feedback to generate a Feedback-Guided Refinement. By injecting these corrected, high-reward samples back into the RL training batch, our approach provides a targeted gradient, which enables the policy to solve critical scenarios that unguided exploration cannot. Extensive experiments demonstrate that our method unlocks the latent capabilities of VLA models, achieving state-of-the-art (SOTA) performance on the public NAVSIM benchmark for overall PDMS, EPDMS score and high-level planning accuracy.
[162] LLaDA-o: An Effective and Length-Adaptive Omni Diffusion Model cs.CV | cs.LGPDF
Zebin You, Xiaolu Zhang, Jun Zhou, Chongxuan Li, Ji-Rong Wen
TL;DR: LLaDA-o是一个基于混合扩散框架的有效且长度自适应的全能扩散模型,用于多模态理解和生成。它通过解耦文本理解的离散掩码扩散和视觉生成的连续扩散,并利用共享的高效注意力主干进行耦合,减少了固定条件下的冗余计算。此外,模型引入了以数据为中心的长度自适应策略,支持无需架构更改的灵活长度解码。实验表明,LLaDA-o在多模态理解和生成基准上达到了最先进的性能。
Details
Motivation: 为了解决多模态任务中理解和生成统一建模的挑战,并减少固定条件下扩散模型的计算冗余,同时实现灵活长度的解码能力。
Result: 在多模态理解和生成基准测试中达到了最先进的性能,在文本到图像生成的DPG-Bench基准上取得了87.04的分数。
Insight: 创新点在于提出了混合扩散框架来解耦并耦合文本和视觉处理,以及数据驱动的长度自适应策略,实现了高效且灵活的多模态统一建模,无需修改模型架构即可适应不同长度的输入输出。
Abstract: We present \textbf{LLaDA-o}, an effective and length-adaptive omni diffusion model for multimodal understanding and generation. LLaDA-o is built on a Mixture of Diffusion (MoD) framework that decouples discrete masked diffusion for text understanding and continuous diffusion for visual generation, while coupling them through a shared, simple, and efficient attention backbone that reduces redundant computation for fixed conditions. Building on MoD, we further introduce a data-centric length adaptation strategy that enables flexible-length decoding in multimodal settings without architectural changes. Extensive experiments show that LLaDA-o achieves state-of-the-art performance among omni-diffusion models on multimodal understanding and generation benchmarks, and reaches 87.04 on DPG-Bench for text-to-image generation, supporting the effectiveness of unified omni diffusion modeling. Code is available at https://github.com/ML-GSAI/LLaDA-o.
[163] Beyond Global Similarity: Towards Fine-Grained, Multi-Condition Multimodal Retrieval cs.CV | cs.IRPDF
Xuan Lu, Kangle Li, Haohang Huang, Rui Meng, Wenjun Zeng
TL;DR: 本文提出了MCMR(多条件多模态检索)基准,旨在评估多模态大语言模型在细粒度、多条件跨模态检索任务上的表现。该基准涵盖服装、珠宝、鞋类和家具等多个产品领域,要求模型同时满足查询中互补的视觉和文本属性条件。
Details
Motivation: 现有多模态检索基准主要关注粗粒度或单条件对齐,忽略了现实场景中用户查询常包含跨模态的多个相互依赖约束,因此需要构建一个更贴近实际、支持细粒度多条件检索的评估框架。
Result: 实验评估了多种基于MLLM的多模态检索器和视觉语言重排序器,发现:(1)模型间存在明显的模态不对称性;(2)视觉线索主导早期排序精度,而文本元数据稳定长尾排序;(3)基于MLLM的点式重排序器通过显式验证查询-候选一致性,显著提升了细粒度匹配性能。
Insight: 创新点在于构建了首个大规模、细粒度、多条件的跨模态检索基准MCMR,强调组合性匹配和约束感知理解;客观来看,该工作揭示了多模态检索中视觉与文本线索的互补作用,并为开发更具解释性和组合推理能力的检索系统提供了诊断性评估工具。
Abstract: Recent advances in multimodal large language models (MLLMs) have substantially expanded the capabilities of multimodal retrieval, enabling systems to align and retrieve information across visual and textual modalities. Yet, existing benchmarks largely focus on coarse-grained or single-condition alignment, overlooking real-world scenarios where user queries specify multiple interdependent constraints across modalities. To bridge this gap, we introduce MCMR (Multi-Conditional Multimodal Retrieval): a large-scale benchmark designed to evaluate fine-grained, multi-condition cross-modal retrieval under natural-language queries. MCMR spans five product domains: upper and bottom clothing, jewelry, shoes, and furniture. It also preserves rich long-form metadata essential for compositional matching. Each query integrates complementary visual and textual attributes, requiring models to jointly satisfy all specified conditions for relevance. We benchmark a diverse suite of MLLM-based multimodal retrievers and vision-language rerankers to assess their condition-aware reasoning abilities. Experimental results reveal: (i) distinct modality asymmetries across models; (ii) visual cues dominate early-rank precision, while textual metadata stabilizes long-tail ordering; and (iii) MLLM-based pointwise rerankers markedly improve fine-grained matching by explicitly verifying query-candidate consistency. Overall, MCMR establishes a challenging and diagnostic benchmark for advancing multimodal retrieval toward compositional, constraint-aware, and interpretable understanding. Our code and dataset is available at https://github.com/EIT-NLP/MCMR
[164] Can Vision Language Models Assess Graphic Design Aesthetics? A Benchmark, Evaluation, and Dataset Perspective cs.CVPDF
Arctanx An, Shizhao Sun, Danqing Huang, Mingxi Cheng, Yan Gao
TL;DR: 这篇论文研究了视觉语言模型(VLMs)评估平面设计美学质量的能力。作者提出了一个名为AesEval-Bench的综合性基准,涵盖四个维度、十二个指标和三个可量化任务,并系统评估了多种VLMs的性能。此外,他们还构建了一个训练数据集来微调VLMs,以提升其在美学评估任务上的表现。
Details
Motivation: 平面设计的美学质量评估是视觉传达的核心,但在视觉语言模型中尚未得到充分探索。现有研究存在三个主要局限:基准测试局限于狭窄的原则和粗糙的评估协议、缺乏系统的VLM比较、以及用于模型改进的训练数据有限。
Result: 作者系统评估了专有、开源和推理增强的VLMs,揭示了这些模型在满足美学评估的细微需求方面存在明显的性能差距。他们构建的训练数据集通过微调显著提升了VLMs在美学评估任务上的性能。
Insight: 论文的创新点在于建立了首个用于平面设计美学质量评估的系统性框架,包括一个全面的基准测试、系统的模型评估方法,以及利用人类引导的VLM标注和大规模任务标签生成来构建训练数据集。这为VLMs在美学评估领域的应用提供了新的基准和方法论。
Abstract: Assessing the aesthetic quality of graphic design is central to visual communication, yet remains underexplored in vision language models (VLMs). We investigate whether VLMs can evaluate design aesthetics in ways comparable to humans. Prior work faces three key limitations: benchmarks restricted to narrow principles and coarse evaluation protocols, a lack of systematic VLM comparisons, and limited training data for model improvement. In this work, we introduce AesEval-Bench, a comprehensive benchmark spanning four dimensions, twelve indicators, and three fully quantifiable tasks: aesthetic judgment, region selection, and precise localization. Then, we systematically evaluate proprietary, open-source, and reasoning-augmented VLMs, revealing clear performance gaps against the nuanced demands of aesthetic assessment. Moreover, we construct a training dataset to fine-tune VLMs for this domain, leveraging human-guided VLM labeling to produce task labels at scale and indicator-grounded reasoning to tie abstract indicators to concrete design regions.Together, our work establishes the first systematic framework for aesthetic quality assessment in graphic design. Our code and dataset will be released at: \href{https://github.com/arctanxarc/AesEval-Bench}{https://github.com/arctanxarc/AesEval-Bench}
[165] HeroGS: Hierarchical Guidance for Robust 3D Gaussian Splatting under Sparse Views cs.CVPDF
Jiashu Li, Xumeng Han, Zhaoyang Wei, Zipeng Wang, Kuiran Wang
TL;DR: HeroGS提出了一种分层引导的鲁棒3D高斯溅射框架,旨在解决稀疏视角下3D高斯溅射(3DGS)因监督不足导致的全局覆盖稀疏、背景模糊和高频区域扭曲等问题。该方法通过在图像、特征和参数三个层次建立统一引导,将稀疏监督转化为伪密集引导,并利用特征自适应致密化与剪枝以及协同剪枝几何一致性优化高斯分布,从而提升重建的结构保真度和渲染质量。
Details
Motivation: 3D高斯溅射(3DGS)在新视角合成中表现出色,但严重依赖密集相机覆盖;在稀疏视角条件下,监督不足会导致高斯分布不规则,表现为全局覆盖稀疏、背景模糊和高频区域扭曲,因此需要一种鲁棒的方法来改善稀疏视图下的性能。
Result: 大量实验表明,HeroGS在稀疏视角条件下实现了高保真重建,并持续超越现有最先进的基线方法。
Insight: 创新点在于提出了一个统一的分层引导框架,包括图像级的伪密集引导、特征级的特征自适应致密化与剪枝(FADP)以及参数级的协同剪枝几何一致性(CPG),通过多层级约束优化高斯分布,有效提升了稀疏视图下的渲染质量和结构一致性。
Abstract: 3D Gaussian Splatting (3DGS) has recently emerged as a promising approach in novel view synthesis, combining photorealistic rendering with real-time efficiency. However, its success heavily relies on dense camera coverage; under sparse-view conditions, insufficient supervision leads to irregular Gaussian distributions, characterized by globally sparse coverage, blurred background, and distorted high-frequency areas. To address this, we propose HeroGS, Hierarchical Guidance for Robust 3D Gaussian Splatting, a unified framework that establishes hierarchical guidance across the image, feature, and parameter levels. At the image level, sparse supervision is converted into pseudo-dense guidance, globally regularizing the Gaussian distributions and forming a consistent foundation for subsequent optimization. Building upon this, Feature-Adaptive Densification and Pruning (FADP) at the feature level leverages low-level features to refine high-frequency details and adaptively densifies Gaussians in background regions. The optimized distributions then support Co-Pruned Geometry Consistency (CPG) at parameter level, which guides geometric consistency through parameter freezing and co-pruning, effectively removing inconsistent splats. The hierarchical guidance strategy effectively constrains and optimizes the overall Gaussian distributions, thereby enhancing both structural fidelity and rendering quality. Extensive experiments demonstrate that HeroGS achieves high-fidelity reconstructions and consistently surpasses state-of-the-art baselines under sparse-view conditions.
[166] Data-Efficient Brushstroke Generation with Diffusion Models for Oil Painting cs.CVPDF
Dantong Qin, Alessandro Bozzon, Xian Yang, Xun Zhang, Yike Guo
TL;DR: 本文提出StrokeDiff,一种基于扩散模型的笔触生成框架,通过平滑正则化(SmR)从小规模手绘样本(n=470)中学习类人笔触,并利用贝塞尔曲线条件模块实现可控生成,集成到完整的基于笔触的绘画流程中。
Details
Motivation: 解决视觉基元(如笔触)数据稀缺导致生成模型难以学习表达性和可控基元的问题,支持过程感知的内容创作。
Result: 实验表明,该方法能生成多样且结构连贯的笔触,并通过自动指标和人工评估验证了其能实现纹理更丰富、分层更细致的绘画。
Insight: 创新点包括平滑正则化(SmR)在稀疏监督下稳定扩散模型训练,以及贝塞尔曲线条件模块实现可控笔触生成,为数据高效的基元建模提供了框架。
Abstract: Many creative multimedia systems are built upon visual primitives such as strokes or textures, which are difficult to collect at scale and fundamentally different from natural image data. This data scarcity makes it challenging for modern generative models to learn expressive and controllable primitives, limiting their use in process-aware content creation. In this work, we study the problem of learning human-like brushstroke generation from a small set of hand-drawn samples (n=470) and propose StrokeDiff, a diffusion-based framework with Smooth Regularization (SmR). SmR injects stochastic visual priors during training, providing a simple mechanism to stabilize diffusion models under sparse supervision without altering the inference process. We further show how the learned primitives can be made controllable through a Bézier-based conditioning module and integrated into a complete stroke-based painting pipeline, including prediction, generation, ordering, and compositing. This demonstrates how data-efficient primitive modeling can support expressive and structured multimedia content creation. Experiments indicate that the proposed approach produces diverse and structurally coherent brushstrokes and enables paintings with richer texture and layering, validated by both automatic metrics and human evaluation.
[167] GroundedSurg: A Multi-Procedure Benchmark for Language-Conditioned Surgical Tool Segmentation cs.CVPDF
Tajamul Ashraf, Abrar Ul Riyaz, Wasif Tak, Tavaheed Tariq, Sonia Yadav
TL;DR: 本文提出了GroundedSurg,这是首个语言条件化的实例级手术器械定位基准数据集,用于评估模型在临床真实多器械场景中根据自然语言描述定位特定手术器械实例的能力。
Details
Motivation: 现有手术器械基准主要评估类别级分割,无法满足临床决策中基于器械功能角色、空间关系或解剖交互能力来解析对特定实例引用的需求。
Result: 广泛的实验表明,现代分割模型和视觉语言模型在该基准上存在显著的性能差距,突显了在手术AI系统中发展临床接地气的视觉语言推理的迫切需求。
Insight: 创新点在于引入了首个结合自然语言描述与结构化空间标注(如边界框和点级锚点)的手术器械实例级定位基准,实现了对语言参考解析和像素级定位的联合评估,为临床现实的视觉语言模型评估提供了系统化框架。
Abstract: Clinically reliable perception of surgical scenes is essential for advancing intelligent, context-aware intraoperative assistance such as instrument handoff guidance, collision avoidance, and workflow-aware robotic support. Existing surgical tool benchmarks primarily evaluate category-level segmentation, requiring models to detect all instances of predefined instrument classes. However, real-world clinical decisions often require resolving references to a specific instrument instance based on its functional role, spatial relation, or anatomical interaction capabilities not captured by current evaluation paradigms. We introduce GroundedSurg, the first language-conditioned, instance-level surgical grounding benchmark. Each instance pairs a surgical image with a natural-language description targeting a single instrument, accompanied by structured spatial grounding annotations including bounding boxes and point-level anchors. The dataset spans ophthalmic, laparoscopic, robotic, and open procedures, encompassing diverse instrument types, imaging conditions, and operative complexities. By jointly evaluating linguistic reference resolution and pixel-level localization, GroundedSurg enables a systematic and realistic evaluation of vision-language models in clinically realistic multi-instrument scenes. Extensive experiments demonstrate substantial performance gaps across modern segmentation and VLMs, highlighting the urgent need for clinically grounded vision-language reasoning in surgical AI systems. Code and data are publicly available at https://github.com/gaash-lab/GroundedSurg
[168] DeAR: Fine-Grained VLM Adaptation by Decomposing Attention Head Roles cs.CVPDF
Yiming Ma, Hongkun Yang, Lionel Z. Wang, Bin Chen, Weizhi Xian
TL;DR: 本文提出DeAR框架,通过分解注意力头角色实现细粒度视觉语言模型(VLM)适应。该方法挑战了传统的层级中心视角,认为功能专门化发生在更深层的单个注意力头级别,并引入概念熵指标将注意力头分类为属性、泛化和混合角色,进而通过角色化注意力掩码和任务自适应融合策略控制信息流,在15个数据集上实现了任务适应与泛化能力的平衡。
Details
Motivation: 现有提示学习方法通常基于层级中心假设,即浅层捕获通用特征而深层处理任务特定知识,这导致可学习标记与原始标记之间的交互不受控制,任务特定知识可能损害模型的核心泛化能力,造成任务适应与零样本泛化之间的权衡问题。
Result: 在15个数据集上的广泛实验表明,DeAR在任务适应和泛化之间实现了强平衡,在各种任务上优于先前方法。
Insight: 创新点在于挑战了层级中心视角,提出注意力头级别的功能专门化,并引入概念熵进行角色分类;通过角色化注意力掩码和属性标记精确控制信息流,隔离泛化头与任务特定知识,从而在保持零样本泛化能力的同时提升任务适应性能。
Abstract: Prompt learning is a dominant paradigm for adapting pre-trained Vision-Language Models (VLMs) to downstream tasks. However, existing methods often rely on a simplistic, layer-centric view, assuming shallow layers capture general features while deep layers handle task-specific knowledge. This assumption results in uncontrolled interactions between learnable tokens and original tokens. Task-specific knowledge could degrades the model’s core generalization and creates a trade-off between task adaptation and the preservation of zero-shot generalization. To address this, we challenge the layer-centric view and propose \textbf{DeAR}, a framework that achieves fine-grained VLM adaptation by \textbf{De}composing \textbf{A}ttention head \textbf{R}oles. We posit that the functional specialization within VLMs occurs not between layers, but at the finer-grained level of individual attention heads in the deeper layers. Based on this insight, we introduce a novel metric, Concept Entropy, to systematically classify attention heads into distinct functional roles: \textit{Attribute}, \textit{Generalization}, and \textit{Mixed}. Guided by these roles, we introduce specialized attribute tokens and a Role-Based Attention Mask mechanism to precisely control information flow, ensuring generalization heads remain isolated from task-specific knowledge. We further incorporate a Task-Adaptive Fusion Strategy for inference. Extensive experiments on fifteen datasets show that DeAR achieves a strong balance between task adaptation and generalization, outperforming previous methods across various tasks.
[169] GuiDINO: Rethinking Vision Foundation Model in Medical Image Segmentation cs.CVPDF
Zhuonan Liang, Wei Guo, Jie Gan, Yaxuan Song, Runnan Chen
TL;DR: GuiDINO是一个用于医学图像分割的框架,它重新定位了基础视觉模型(如DINOv3)的角色,使其作为视觉引导生成器来指导下游分割任务。该框架通过轻量级的TokenBook机制将DINOv3提取的视觉特征转换为空间引导掩码,该掩码用于门控多个分割骨干网络的特征激活,从而注入基础模型的先验知识,同时保留医学专用架构的归纳偏置和效率。
Details
Motivation: 由于领域偏移,预训练的基础视觉模型与医学图像分割需求存在错位,而完全微调或轻量适应效果不佳。GuiDINO旨在解决这一问题,通过将基础模型重新定位为视觉引导生成器,以更高效的方式利用其先验知识。
Result: 在多个医学数据集和nnUNet风格的推理中,GuiDINO一致地提高了分割质量和边界鲁棒性,表明其是微调的一种实用替代方案,并在医学视觉任务中提供了新的性能水平。
Insight: 创新点包括将基础模型重新定位为视觉引导生成器,引入轻量级TokenBook机制聚合token-原型相似性以生成空间引导掩码,以及通过引导监督目标损失和边界聚焦铰链损失来优化训练。从客观角度看,该框架提供了一种参数高效的自适应方法(如LoRA),为如何最佳利用基础模型服务医学视觉提供了新视角。
Abstract: Foundation vision models are increasingly adopted in medical image analysis. Due to domain shift, these pretrained models misalign with medical image segmentation needs without being fully fine-tuned or lightly adapted. We introduce GuiDINO, a framework that repositions native foundation model to acting as a visual guidance generator for downstream segmentation. GuiDINO extracts visual feature representation from DINOv3 and converts them into a spatial guide mask via a lightweight TokenBook mechanism, which aggregates token-prototype similarities. This guide mask gates feature activations in multiple segmentation backbones, thereby injecting foundation-model priors while preserving the inductive biases and efficiency of medical dedicated architectures. Training relies on a guide supervision objective loss that aligns the guide mask to ground-truth regions, optionally augmented by a boundary-focused hinge loss to sharpen fine structures. GuiDINO also supports parameter-efficient adaptation through LoRA on the DINOv3 guide backbone. Across diverse medical datasets and nnUNet-style inference, GuiDINO consistently improves segmentation quality and boundary robustness, suggesting a practical alternative to fine-tuning and offering a new perspective on how foundation models can best serve medical vision. Code is available at https://github.com/Hi-FishU/GuiDINO
[170] ClinCoT: Clinical-Aware Visual Chain-of-Thought for Medical Vision Language Models cs.CV | cs.AIPDF
Xiwei Liu, Yulong Li, Xinlin Zhuang, Xuhui Li, Jianxu Chen
TL;DR: 本文提出ClinCoT,一种临床感知的视觉思维链框架,旨在解决医学视觉语言模型因缺乏局部病理证据支持而产生事实幻觉的问题。该框架将偏好优化从响应级校正转变为视觉驱动的推理,通过自动数据生成管道构建基于临床的偏好对,并采用基于评分的边际感知优化策略来细化区域级推理轨迹。
Details
Motivation: 现有医学对齐方法主要在响应层面通过偏好优化操作,虽然提高了输出正确性,但中间推理与视觉区域的连接较弱。尽管思维链增强了多模态推理,但仍以文本为中心,限制了临床视觉线索的有效整合。
Result: 在三个医学VQA和报告生成基准测试上的广泛实验表明,ClinCoT持续改善了事实基础,并优于现有的基于偏好的对齐方法。
Insight: 创新点在于将偏好优化与视觉驱动的推理链相结合,通过假设驱动的区域提议自动生成临床基础的偏好数据,并采用迭代学习方案动态更新数据以保持模型策略演化过程中的对齐。这为增强医学多模态模型的区域感知和事实可靠性提供了新思路。
Abstract: Medical Vision-Language Models have shown promising potential in clinical decision support, yet they remain prone to factual hallucinations due to insufficient grounding in localized pathological evidence. Existing medical alignment methods primarily operate at the response level through preference optimization, improving output correctness but leaving intermediate reasoning weakly connected to visual regions. Although chain-of-thought (CoT) enhances multimodal reasoning, it remains largely text-centric, limiting effective integration of clinical visual cues. To address this gap, we propose ClinCoT, a clinical-aware visual chain-of-thought framework that transforms preference optimization from response-level correction to visual-driven reasoning. We introduce an automatic data generation pipeline that constructs clinically grounded preference pairs through reasoning with hypotheses-driven region proposals. Multiple Med-LLMs evaluators rank and assign scores to each response, and these rankings serve as supervision to train the target model. We further introduce a scoring-based margin-aware optimization strategy that incorporates both preference ranking and score difference to refine region-level reasoning trajectories. To maintain alignment as the model’s policy evolves during training, we adopt an iterative learning scheme that dynamically regenerates preference data. Extensive experiments on three medical VQA and report generation benchmarks demonstrate that ClinCoT consistently improves factual grounding and achieves superior performance compared with existing preference-based alignment methods.
[171] Predictive Reasoning with Augmented Anomaly Contrastive Learning for Compositional Visual Relations cs.CV | cs.AIPDF
Chengtai Li, Yuting He, Jianfeng Ren, Ruibin Bai, Yitian Zhao
TL;DR: 本文提出了一种名为PR-A^2CL的新方法,用于解决组合视觉关系(CVR)任务,即从四张图像中识别出不符合相同组合规则的异常图像。该方法结合了增强异常对比学习来提取判别性特征,并引入预测-验证范式进行基于规则的推理,通过迭代预测和验证逐步定位异常。
Details
Motivation: 组合视觉关系因其复杂性而研究较少,现有方法难以建模丰富的组合规则。本文旨在解决CVR任务中识别异常图像的挑战,提升模型对复杂视觉关系的推理能力。
Result: 在SVRT、CVR和MC^2R数据集上的实验结果表明,PR-A^2CL显著优于现有的最先进推理模型,达到了SOTA水平。
Insight: 创新点包括设计增强异常对比学习以增强特征泛化性,以及引入预测-验证范式和预测异常推理块(PARBs)进行迭代式规则推理,这为复杂视觉关系建模提供了可借鉴的思路。
Abstract: While visual reasoning for simple analogies has received significant attention, compositional visual relations (CVR) remain relatively unexplored due to their greater complexity. To solve CVR tasks, we propose Predictive Reasoning with Augmented Anomaly Contrastive Learning (PR-A$^2$CL), \ie, to identify an outlier image given three other images that follow the same compositional rules. To address the challenge of modelling abundant compositional rules, an Augmented Anomaly Contrastive Learning is designed to distil discriminative and generalizable features by maximizing similarity among normal instances while minimizing similarity between normal and anomalous outliers. More importantly, a predict-and-verify paradigm is introduced for rule-based reasoning, in which a series of Predictive Anomaly Reasoning Blocks (PARBs) iteratively leverage features from three out of the four images to predict those of the remaining one. Throughout the subsequent verification stage, the PARBs progressively pinpoint the specific discrepancies attributable to the underlying rules. Experimental results on SVRT, CVR and MC$^2$R datasets show that PR-A$^2$CL significantly outperforms state-of-the-art reasoning models.
[172] Teacher-Guided Causal Interventions for Image Denoising: Orthogonal Content-Noise Disentanglement in Vision Transformers cs.CVPDF
Kuai Jiang, Zhaoyan Ding, Guijuan Zhang, Dianjie Lu, Zhuoran Zheng
TL;DR: 本文提出了一种基于因果干预的教师引导因果解缠网络(TCD-Net),用于图像去噪。该方法在Vision Transformer框架内,通过结构化干预将特征分解为内容与噪声,旨在解决传统去噪模型因虚假相关和高频模糊导致的细节过度去除或噪声残留问题。
Details
Motivation: 传统图像去噪模型容易学习环境因素与噪声模式之间的虚假相关性,并且由于高频模糊难以可靠地区分细微纹理与随机噪声,导致鲁棒性下降。论文从因果干预角度重新审视去噪问题,认为纯相关性拟合会纠缠内在内容与外部噪声。
Result: 在多个基准测试上的广泛实验表明,TCD-Net在保真度和效率上均优于主流方法,并在单块RTX 5090 GPU上实现了104.2 FPS的实时速度。
Insight: 创新点在于将因果干预思想引入视觉Transformer去噪框架,具体通过环境偏置调整模块进行去混杂、通过正交约束的双分支解缠头实现内容与噪声的严格分离,并利用Google的Nano Banana Pro生成模型引导因果先验以解决结构模糊性,实现了内容与噪声的正交解耦。
Abstract: Conventional image denoising models often inadvertently learn spurious correlations between environmental factors and noise patterns. Moreover, due to high-frequency ambiguity, they struggle to reliably distinguish subtle textures from stochastic noise, resulting in over-removed details or residual noise artifacts. We therefore revisit denoising via causal intervention, arguing that purely correlational fitting entangles intrinsic content with extrinsic noise, which directly degrades robustness under distribution shifts. Motivated by this, we propose the Teacher-Guided Causal Disentanglement Network (TCD-Net), which explicitly decomposes the generative mechanism via structured interventions on feature spaces within a Vision Transformer framework. Specifically, our method integrates three key components: (1) An Environmental Bias Adjustment (EBA) module projects features into a stable, de-centered subspace to suppress global environmental bias (de-confounding). (2) A dual-branch disentanglement head employs an orthogonality constraint to force a strict separation between content and noise representations, preventing information leakage. (3) To resolve structural ambiguity, we leverage Nano Banana Pro, Google’s reasoning-guided AI image generation model, to guide a causal prior, effectively pulling content representations back onto the natural-image manifold. Extensive experiments demonstrate that TCD-Net outperforms mainstream methods across multiple benchmarks in both fidelity and efficiency, achieving a real-time speed of 104.2 FPS on a single RTX 5090 GPU.
[173] ArtLLM: Generating Articulated Assets via 3D LLM cs.CVPDF
Penghao Wang, Siyuan Xie, Hongyu Yan, Xianghui Yang, Jingwei Huang
TL;DR: ArtLLM是一个新颖的框架,用于直接从完整的3D网格生成高质量的铰接式(可活动关节)3D资产。其核心是一个在大型铰接数据集上训练的3D多模态大语言模型,能够自回归地预测可变数量的部件和关节,并统一推断其运动学结构。该框架随后利用这个结构感知的布局来指导一个3D生成模型合成高保真度的部件几何形状。
Details
Motivation: 现有方法(如基于优化的重建和基于检索的组装)在生成铰接式3D对象时存在根本性限制,例如处理速度慢、只能处理简单关节、几何重复且泛化能力差。ArtLLM旨在解决这些挑战,为游戏、机器人和仿真创建交互式数字环境提供高质量的铰接资产。
Result: 在PartNet-Mobility数据集上的实验表明,ArtLLM在部件布局准确性和关节预测方面显著优于最先进的方法,并能鲁棒地泛化到真实世界物体。
Insight: 主要创新点在于将铰接结构生成统一建模为一个自回归预测问题,通过3D多模态大语言模型从点云中直接推断可变数量的部件和关节的运动学结构,并以此引导几何生成,实现了高质量、可泛化的铰接资产生成,为构建数字孪生和可扩展的机器人学习提供了潜力。
Abstract: Creating interactive digital environments for gaming, robotics, and simulation relies on articulated 3D objects whose functionality emerges from their part geometry and kinematic structure. However, existing approaches remain fundamentally limited: optimization-based reconstruction methods require slow, per-object joint fitting and typically handle only simple, single-joint objects, while retrieval-based methods assemble parts from a fixed library, leading to repetitive geometry and poor generalization. To address these challenges, we introduce ArtLLM, a novel framework for generating high-quality articulated assets directly from complete 3D meshes. At its core is a 3D multimodal large language model trained on a large-scale articulation dataset curated from both existing articulation datasets and procedurally generated objects. Unlike prior work, ArtLLM autoregressively predicts a variable number of parts and joints, inferring their kinematic structure in a unified manner from the object’s point cloud. This articulation-aware layout then conditions a 3D generative model to synthesize high-fidelity part geometries. Experiments on the PartNet-Mobility dataset show that ArtLLM significantly outperforms state-of-the-art methods in both part layout accuracy and joint prediction, while generalizing robustly to real-world objects. Finally, we demonstrate its utility in constructing digital twins, highlighting its potential for scalable robot learning.
[174] TC-SSA: Token Compression via Semantic Slot Aggregation for Gigapixel Pathology Reasoning cs.CV | cs.AIPDF
Zhuo Chen, Shawn Young, Lijian Xu
TL;DR: 本文提出TC-SSA(基于语义槽聚合的令牌压缩),一种可学习的令牌压缩框架,用于解决千兆像素级全切片病理图像(WSI)在大型视觉语言模型中计算序列过长的问题。该方法通过门控路由模块将图像块特征聚合为固定数量的语义槽,在严格令牌预算下实现全局覆盖,并将视觉令牌数量压缩至原始的1.7%。
Details
Motivation: 解决计算病理学中千兆像素级全切片图像(WSI)因包含过多图像块(超10^5个)而导致序列长度超出标准Transformer架构限制的计算瓶颈,避免现有空间采样方法可能丢弃关键诊断证据的风险。
Result: 在SlideBench(TCGA)上,模型整体准确率达到78.34%,诊断子集准确率为77.14%,在可比令牌预算下优于基于采样的基线方法;在MIL分类任务中,在TCGA-BRCA、TCGA-NSCLC和PANDA数据集上分别达到95.83%、98.27%和79.80%的AUC。
Insight: 创新点在于提出可学习的语义槽聚合框架,通过稀疏Top-2路由和加权聚合实现高效且保留诊断信息的令牌压缩;客观来看,该方法在计算效率与诊断性能之间提供了有效的权衡,为千兆像素级图像推理提供了新思路。
Abstract: The application of large vision-language models to computational pathology holds great promise for diagnostic assistants but faces a critical computational bottleneck: the gigapixel scale of Whole Slide Images (WSIs). A single WSI typically contains over 105 patches, creating sequence lengths that exceed the constraints of standard Transformer architectures. Existing solutions often resort to spatial sampling, which risks discarding diagnostically critical evidence. To address this, we propose TC-SSA (Token Compression via Semantic Slot Aggregation), a learnable token compression framework that aggregates patch features into a fixed number of semantic slots. A gated routing module assigns patches to slots using sparse Top-2 routing, followed by weighted aggregation, enabling global slide coverage under a strict token budget. The resulting representation retains diagnostically relevant information while reducing the number of visual tokens to 1.7% of the original sequence. On the SlideBench(TCGA), our model achieves 78.34% overall accuracy and 77.14% on the diagnosis subset, outperforming sampling-based baselines under comparable token budgets. The method also generalizes to MIL classification, reaching AUC of 95.83% on TCGA-BRCA, 98.27% on TCGA-NSCLC and 79.80% on PANDA. These results suggest that learnable semantic aggregation provides an effective trade-off between efficiency and diagnostic performance for gigapixel pathology reasoning.
[175] BeautyGRPO: Aesthetic Alignment for Face Retouching via Dynamic Path Guidance and Fine-Grained Preference Modeling cs.CVPDF
Jiachen Yang, Xianhui Lin, Yi Dong, Zebiao Zheng, Xing Liu
TL;DR: 本文提出BeautyGRPO,一种通过动态路径引导和细粒度偏好建模实现人脸美化与人类审美偏好对齐的强化学习框架。该方法构建了包含五个关键维度的细粒度偏好数据集FRPref-10K,并训练了能够评估细微感知差异的奖励模型。为了解决探索与保真度之间的冲突,引入了动态路径引导(DPG)来稳定随机采样轨迹,有效纠正随机漂移。实验表明,该方法在纹理质量、瑕疵去除准确性和与人类审美偏好对齐方面优于现有方法。
Details
Motivation: 现有的人脸美化方法存在根本性权衡:有监督学习受限于像素级标签模仿,无法捕捉复杂的主观审美偏好;而在线强化学习虽然擅长偏好对齐,但其随机探索范式与人脸美化对高保真度的需求相冲突,且容易因累积随机漂移引入明显噪声伪影。
Result: 大量实验表明,BeautyGRPO在纹理质量、瑕疵去除准确性和与人类审美偏好对齐方面优于专门的人脸美化方法和通用图像编辑模型,取得了更优的整体效果。
Insight: 主要创新点包括:1) 构建了覆盖五个关键维度的细粒度人脸美化偏好数据集FRPref-10K;2) 训练了能够评估细微感知差异的专用奖励模型;3) 提出了动态路径引导(DPG)机制,通过动态计算基于锚点的ODE路径并在每个采样时间步重新规划引导轨迹,以稳定随机采样过程,有效调和了探索与保真度之间的矛盾。从客观角度看,该方法将强化学习的偏好对齐能力与人脸美化的高保真需求相结合,为解决类似主观感知任务中的探索-保真度权衡问题提供了新思路。
Abstract: Face retouching requires removing subtle imperfections while preserving unique facial identity features, in order to enhance overall aesthetic appeal. However, existing methods suffer from a fundamental trade-off. Supervised learning on labeled data is constrained to pixel-level label mimicry, failing to capture complex subjective human aesthetic preferences. Conversely, while online reinforcement learning (RL) excels at preference alignment, its stochastic exploration paradigm conflicts with the high-fidelity demands of face retouching and often introduces noticeable noise artifacts due to accumulated stochastic drift. To address these limitations, we propose BeautyGRPO, a reinforcement learning framework that aligns face retouching with human aesthetic preferences. We construct FRPref-10K, a fine-grained preference dataset covering five key retouching dimensions, and train a specialized reward model capable of evaluating subtle perceptual differences. To reconcile exploration and fidelity, we introduce Dynamic Path Guidance (DPG). DPG stabilizes the stochastic sampling trajectory by dynamically computing an anchor-based ODE path and replanning a guided trajectory at each sampling timestep, effectively correcting stochastic drift while maintaining controlled exploration. Extensive experiments show that BeautyGRPO outperforms both specialized face retouching methods and general image editing models, achieving superior texture quality, more accurate blemish removal, and overall results that better align with human aesthetic preferences.
[176] FREE-Edit: Using Editing-aware Injection in Rectified Flow Models for Zero-shot Image-Driven Video Editing cs.CVPDF
Maomao Li, Yunfei Liu, Yu Li
TL;DR: 本文提出了一种名为FREE-Edit的零样本图像驱动视频编辑框架,其核心创新在于一种编辑感知(REE)的注意力注入方法,用于调制扩散模型中每个token的注入强度,以更好地平衡源视频的运动保持与编辑内容的传播,从而在无需微调或训练的情况下,在多种编辑场景中生成更高质量的视频。
Details
Motivation: 现有基于预训练I2V模型的视频编辑方法,通常通过在去噪过程中注入注意力来保持源视频的运动和布局,但固定的注入强度容易导致冲突:注入过多会引入源视频的冲突语义,注入过少则对源视频的表征有限,导致编辑效果不佳。
Result: 论文提出的FREE-Edit框架在多种图像驱动视频编辑场景中进行了验证,结果表明,与现有技术相比,该方法能够生成更高质量的输出。
Insight: 核心创新点在于提出了编辑感知(REE)的注意力注入机制,它通过计算首帧编辑掩码并利用光流进行传播,动态地为每个token调制注入强度(编辑区域不注入),从而更智能地平衡编辑内容与源视频信息的保留。该方法构建于新兴的Rectified Flow模型之上,实现了零样本编辑,无需额外训练。
Abstract: Image-driven video editing aims to propagate edit contents from the modified first frame to the rest frames. The existing methods usually invert the source video to noise using a pre-trained image-to-video (I2V) model and then guide the sampling process using the edited first frame. Generally, a popular choice for maintaining motion and layout from the source video is intervening in the denoising process by injecting attention during reconstruction. However, such injection often leads to unsatisfactory results, where excessive injection leads to conflicting semantics from the source video while insufficient injection brings limited source representation. Recognizing this, we propose an Editing-awaRE (REE) injection method to modulate injection intensity of each token. Specifically, we first compute the pixel difference between the source and edited first frame to form a corresponding editing mask. Next, we track the editing area throughout the entire video by using optical flow to warp the first-frame mask. Then, editing-aware feature injection intensity for each token is generated accordingly, where injection is not conducted on editing areas. Building upon REE injection, we further propose a zero-shot image-driven video editing framework with recent-emerging rectified-Flow models, dubbed FREE-Edit. Without fine-tuning or training, our FREE-Edit demonstrates effectiveness in various image-driven video editing scenarios, showing its capability to produce higher-quality outputs compared with existing techniques. Project page: https://free-edit.github.io/page/.
[177] TripleSumm: Adaptive Triple-Modality Fusion for Video Summarization cs.CV | cs.AI | cs.LGPDF
Sumin Kim, Hyemin Jeong, Mingu Kang, Yejin Kim, Yoori Oh
TL;DR: 本文提出TripleSumm,一种用于视频摘要的自适应三模态融合架构,通过动态加权融合视觉、文本和音频模态以提升对复杂视频的理解;同时构建了首个大规模三模态视频摘要基准数据集MoSu。
Details
Motivation: 现有视频摘要方法采用静态或模态无关的融合策略,无法处理视频数据中模态显著性的动态变化,导致对复杂视频理解不足;同时缺乏全面的多模态基准数据集限制了研究进展。
Result: TripleSumm在四个基准测试(包括新提出的MoSu数据集)上实现了最先进的性能,显著优于现有方法。
Insight: 创新点包括提出帧级自适应三模态融合机制,以及构建首个提供视觉、文本和音频三模态的大规模基准数据集MoSu,为多模态视频摘要研究提供了重要工具。
Abstract: The exponential growth of video content necessitates effective video summarization to efficiently extract key information from long videos. However, current approaches struggle to fully comprehend complex videos, primarily because they employ static or modality-agnostic fusion strategies. These methods fail to account for the dynamic, frame-dependent variations in modality saliency inherent in video data. To overcome these limitations, we propose TripleSumm, a novel architecture that adaptively weights and fuses the contributions of visual, text, and audio modalities at the frame level. Furthermore, a significant bottleneck for research into multimodal video summarization has been the lack of comprehensive benchmarks. Addressing this bottleneck, we introduce MoSu (Most Replayed Multimodal Video Summarization), the first large-scale benchmark that provides all three modalities. Extensive experiments demonstrate that TripleSumm achieves state-of-the-art performance, outperforming existing methods by a significant margin on four benchmarks, including MoSu. Our code and dataset are available at https://github.com/smkim37/TripleSumm.
[178] VP-Hype: A Hybrid Mamba-Transformer Framework with Visual-Textual Prompting for Hyperspectral Image Classification cs.CVPDF
Abdellah Zakaria Sellam, Fadi Abdeladhim Zidi, Salah Eddine Bekhouche, Ihssen Houhou, Marouane Tliba
TL;DR: VP-Hype是一个用于高光谱图像分类的混合Mamba-Transformer框架,通过结合状态空间模型的线性时间效率和Transformer的关系建模能力,并引入视觉与文本提示来解决标签稀缺问题,在低数据条件下实现了新的SOTA性能。
Details
Motivation: 解决高光谱图像分类中高维光谱数据与极稀缺标注样本之间的矛盾,并克服标准Transformer二次计算复杂度的扩展障碍。
Result: 在仅使用2%训练样本的低数据条件下,在Salinas数据集上达到99.69%的总体准确率,在Longkou数据集上达到99.45%的总体准确率,建立了新的SOTA。
Insight: 创新点在于将线性复杂度的SSM与Transformer结合形成混合骨干网络以高效捕获长程依赖,并集成双模态视觉与文本提示进行上下文感知的特征提取指导,为样本高效的遥感分析提供了新路径。
Abstract: Accurate classification of hyperspectral imagery (HSI) is often frustrated by the tension between high-dimensional spectral data and the extreme scarcity of labeled training samples. While hierarchical models like LoLA-SpecViT have demonstrated the power of local windowed attention and parameter-efficient fine-tuning, the quadratic complexity of standard Transformers remains a barrier to scaling. We introduce VP-Hype, a framework that rethinks HSI classification by unifying the linear-time efficiency of State-Space Models (SSMs) with the relational modeling of Transformers in a novel hybrid architecture. Building on a robust 3D-CNN spectral front-end, VP-Hype replaces conventional attention blocks with a Hybrid Mamba-Transformer backbone to capture long-range dependencies with significantly reduced computational overhead. Furthermore, we address the label-scarcity problem by integrating dual-modal Visual and Textual Prompts that provide context-aware guidance for the feature extraction process. Our experimental evaluation demonstrates that VP-Hype establishes a new state of the art in low-data regimes. Specifically, with a training sample distribution of only 2%, the model achieves Overall Accuracy (OA) of 99.69% on the Salinas dataset and 99.45% on the Longkou dataset. These results suggest that the convergence of hybrid sequence modeling and multi-modal prompting provides a robust path forward for high-performance, sample-efficient remote sensing.
[179] VisNec: Measuring and Leveraging Visual Necessity for Multimodal Instruction Tuning cs.CV | cs.AIPDF
Mingkang Dong, Hongyi Cai, Jie Li, Sifan Zhou, Bin Ren
TL;DR: 本文提出了VisNec(视觉必要性评分)框架,用于衡量多模态指令微调中视觉输入的边际贡献,并据此选择真正需要视觉推理的训练样本。通过结合语义聚类,在保留任务多样性的前提下,仅使用LLaVA-665K数据集中15%的高必要性样本进行训练,即可达到全数据性能的100.2%。
Details
Motivation: 现有指令微调数据集包含大量视觉冗余样本(仅凭文本即可解决)和多模态未对齐的监督,这会降低学习效果,因此需要一种方法来识别和选择真正依赖视觉的样本。
Result: 在10个下游基准测试中,使用VisNec从LLaVA-665K数据集中选择的15%样本训练,性能达到全数据训练的100.2%;在较小的Vision-Flan-186K数据集上,所选数据不仅进一步减少了数据量,性能还超越了全数据训练15.8%。
Insight: 创新点在于提出了一个原则性的数据选择框架VisNec,通过比较有/无视觉上下文时的预测损失来量化视觉必要性,并结合语义聚类确保任务多样性,从而实现了高效且鲁棒的多模态指令微调。
Abstract: The effectiveness of multimodal instruction tuning depends not only on dataset scale, but critically on whether training samples genuinely require visual reasoning. However, existing instruction datasets often contain a substantial portion of visually redundant samples (solvable from text alone), as well as multimodally misaligned supervision that can degrade learning. To address this, we propose VisNec (Visual Necessity Score), a principled data selection framework that measures the marginal contribution of visual input during instruction tuning. By comparing predictive loss with and without visual context, VisNec identifies whether a training instance is vision-critical, redundant, or misaligned. To preserve task diversity, we combine VisNec with semantic clustering and select high-necessity samples within each cluster. Across 10 downstream benchmarks, training on only 15% of the LLaVA-665K dataset selected by VisNec achieves 100.2% of full-data performance. On the smaller Vision-Flan-186K dataset, our selection not only further reduces data size but also surpasses full-data training by 15.8%. These results demonstrate that measuring and leveraging visual necessity provides an effective solution for both efficient and robust multimodal instruction tuning. Codes and selected subsets will be released upon acceptance.
[180] Monocular 3D Object Position Estimation with VLMs for Human-Robot Interaction cs.CV | cs.AI | cs.HC | cs.LG | cs.ROPDF
Ari Wahl, Dorian Gawlinski, David Przewozny, Paul Chojecki, Felix Bießmann
TL;DR: 该论文提出了一种利用预训练视觉语言模型进行单目3D物体位置估计的方法,通过微调VLM并添加自定义回归头,结合条件路由机制,使模型在保持通用视觉查询能力的同时,具备3D坐标检测功能,用于增强人机交互。
Details
Motivation: 预训练的通用视觉语言模型具有丰富的世界知识和2D物体检测能力,但缺乏3D坐标检测功能,限制了其在直观人机交互中的应用,因此研究如何扩展VLM以实现单目RGB图像的3D物体位置估计。
Result: 在测试集上取得了中位数MAE为13毫米的稳健预测性能,比未微调的简单基线提升了五倍,约25%的预测结果在机器人可交互的误差范围内。
Insight: 通过QLoRA微调和自定义回归头扩展VLM的3D能力,结合条件路由机制实现通用与专用任务的平衡,为VLM在机器人交互中的3D感知应用提供了可行方案。
Abstract: Pre-trained general-purpose Vision-Language Models (VLM) hold the potential to enhance intuitive human-machine interactions due to their rich world knowledge and 2D object detection capabilities. However, VLMs for 3D coordinates detection tasks are rare. In this work, we investigate interactive abilities of VLMs by returning 3D object positions given a monocular RGB image from a wrist-mounted camera, natural language input, and robot states. We collected and curated a heterogeneous dataset of more than 100,000 images and finetuned a VLM using QLoRA with a custom regression head. By implementing conditional routing, our model maintains its ability to process general visual queries while adding specialized 3D position estimation capabilities. Our results demonstrate robust predictive performance with a median MAE of 13 mm on the test set and a five-fold improvement over a simpler baseline without finetuning. In about 25% of the cases, predictions are within a range considered acceptable for the robot to interact with objects.
[181] Towards Policy-Adaptive Image Guardrail: Benchmark and Method cs.CVPDF
Caiyong Piao, Zhiyuan Yan, Haoming Xu, Yunzhen Zhao, Kaiqing Lin
TL;DR: 本文针对有害图像防护栏任务中模型难以适应动态安全策略的问题,提出了SafeEditBench评估基准和SafeGuard-VL方法。SafeEditBench通过图像编辑生成违反特定安全规则的图像对,在五个不同策略下进行细粒度评估;SafeGuard-VL则采用基于可验证奖励的强化学习来优化模型,使其能跨策略泛化。
Details
Motivation: 传统分类器在固定类别上训练,难以适应不断演化的安全策略;现有基于视觉-语言模型的防护方法通常在单一固定策略下训练,导致过拟合、泛化能力差,甚至丧失基本指令遵循能力。
Result: 在提出的SafeEditBench基准上评估了现有VLM的跨策略泛化性能;实验表明,所提SafeGuard-VL方法在各种策略下对有害图像防护均有效,验证了其鲁棒性。
Insight: 创新点在于构建了基于图像编辑的策略对齐数据集用于细粒度评估,并提出了基于可验证奖励的强化学习方法,使模型能显式优化以适应动态策略,而非仅依赖固定策略下的监督微调。
Abstract: Accurate rejection of sensitive or harmful visual content, i.e., harmful image guardrail, is critical in many application scenarios. This task must continuously adapt to the evolving safety policies and content across various domains and over time. However, traditional classifiers, confined to fixed categories, require frequent retraining when new policies are introduced. Vision-language models (VLMs) offer a more adaptable and generalizable foundation for dynamic safety guardrails. Despite this potential, existing VLM-based safeguarding methods are typically trained and evaluated under only a fixed safety policy. We find that these models are heavily overfitted to the seen policy, fail to generalize to unseen policies, and even lose the basic instruction-following ability and general knowledge. To address this issue, in this paper we make two key contributions. First, we benchmark the cross-policy generalization performance of existing VLMs with SafeEditBench, a new evaluation suite. SafeEditBench leverages image-editing models to convert unsafe images into safe counterparts, producing policy-aligned datasets where each safe-unsafe image pair remains visually similar except for localized regions violating specific safety rules. Human annotators then provide accurate safe/unsafe labels under five distinct policies, enabling fine-grained assessment of policy-aware generalization. Second, we introduce SafeGuard-VL, a reinforcement learning-based method with verifiable rewards (RLVR) for robust unsafe-image guardrails. Instead of relying solely on supervised fine-tuning (SFT) under fixed policies, SafeGuard-VL explicitly optimizes the model with policy-grounded rewards, promoting verifiable adaptation across evolving policies. Extensive experiments verify the effectiveness of our method for unsafe image guardrails across various policies.
[182] AgilePruner: An Empirical Study of Attention and Diversity for Adaptive Visual Token Pruning in Large Vision-Language Models cs.CV | cs.LGPDF
Changwoo Baek, Jouwon Song, Sohyeon Kim, Kyeongbo Kong
TL;DR: 本文对大型视觉语言模型中的视觉令牌剪枝方法进行了实证分析,比较了基于注意力和基于多样性的剪枝策略。研究发现,基于多样性的方法实际保留的特征多样性低于预期且与幻觉增加相关,而基于注意力的方法在简单图像上更有效,基于多样性的方法在复杂图像上更有效。基于这些发现,论文提出了一种简单的自适应剪枝机制,在标准基准和幻觉评估中均取得了可靠性能。
Details
Motivation: 现有视觉令牌剪枝方法主要关注注意力或多样性,但对其特性和局限性的深入分析不足,论文旨在通过实证研究填补这一空白,以优化剪枝策略并减少计算开销。
Result: 提出的自适应剪枝机制在标准基准和CHAIR数据集等幻觉特定评估中均取得了强劲且可靠的性能,表明其有效性。
Insight: 创新点在于使用有效秩和注意力熵进行定量分析,揭示了两种剪枝方法的优缺点,并据此设计了图像感知的自适应混合策略,可借鉴其结合图像复杂度动态调整剪枝方法的思路。
Abstract: Large Vision-Language Models (LVLMs) have adopted visual token pruning strategies to mitigate substantial computational overhead incurred by extensive visual token sequences. While prior works primarily focus on either attention-based or diversity-based pruning methods, in-depth analysis of these approaches’ characteristics and limitations remains largely unexplored. In this work, we conduct thorough empirical analysis using effective rank (erank) as a measure of feature diversity and attention score entropy to investigate visual token processing mechanisms and analyze the strengths and weaknesses of each approach. Our analysis reveals two insights: (1) Our erank-based quantitative analysis shows that many diversity-oriented pruning methods preserve substantially less feature diversity than intended; moreover, analysis using the CHAIR dataset reveals that the diversity they do retain is closely tied to increased hallucination frequency compared to attention-based pruning. (2) We further observe that attention-based approaches are more effective on simple images where visual evidence is concentrated, while diversity-based methods better handle complex images with distributed features. Building on these empirical insights, we show that incorporating image-aware adjustments into existing hybrid pruning strategies consistently improves their performance. We also provide a minimal instantiation of our empirical findings through a simple adaptive pruning mechanism, which achieves strong and reliable performance across standard benchmarks as well as hallucination-specific evaluations. Our project page available at https://cvsp-lab.github.io/AgilePruner.
[183] FoSS: Modeling Long Range Dependencies and Multimodal Uncertainty in Trajectory Prediction via Fourier State Space Integration cs.CVPDF
Yizhou Huang, Gengze Jiang, Yihua Cheng, Kezhi Wang
TL;DR: 本文提出FoSS框架,通过傅里叶状态空间集成来建模轨迹预测中的长程依赖和多模态不确定性。该框架采用双分支结构:频域分支利用离散傅里叶变换分解轨迹,通过渐进螺旋重排序模块和两个选择性状态空间子模块以线性复杂度细化频谱特征;时域分支则通过动态选择性状态空间在线性时间内重建自注意力行为以保留长程时间上下文。两个分支通过交叉注意力层融合,并利用可学习查询生成多个候选轨迹,加权融合头表达运动不确定性。
Details
Motivation: 现有轨迹预测方法难以平衡建模能力和计算效率,基于注意力的架构随智能体数量增加呈二次复杂度,循环模型则难以捕获长程依赖和细粒度局部动态。
Result: 在Argoverse 1和Argoverse 2基准测试中,FoSS实现了最先进的精度,同时计算量减少22.5%,参数量减少超过40%。
Insight: 创新点在于将频域推理与线性时间序列建模统一的双分支框架,通过傅里叶变换分离全局意图和局部变化,并利用选择性状态空间模块在保持线性复杂度的同时有效融合时空表示和处理不确定性。
Abstract: Accurate trajectory prediction is vital for safe autonomous driving, yet existing approaches struggle to balance modeling power and computational efficiency. Attention-based architectures incur quadratic complexity with increasing agents, while recurrent models struggle to capture long-range dependencies and fine-grained local dynamics. Building upon this, we present FoSS, a dual-branch framework that unifies frequency-domain reasoning with linear-time sequence modeling. The frequency-domain branch performs a discrete Fourier transform to decompose trajectories into amplitude components encoding global intent and phase components capturing local variations, followed by a progressive helix reordering module that preserves spectral order; two selective state-space submodules, Coarse2Fine-SSM and SpecEvolve-SSM, refine spectral features with O(N) complexity. In parallel, a time-domain dynamic selective SSM reconstructs self-attention behavior in linear time to retain long-range temporal context. A cross-attention layer fuses temporal and spectral representations, while learnable queries generate multiple candidate trajectories, and a weighted fusion head expresses motion uncertainty. Experiments on Argoverse 1 and Argoverse 2 benchmarks demonstrate that FoSS achieves state-of-the-art accuracy while reducing computation by 22.5% and parameters by over 40%. Comprehensive ablations confirm the necessity of each component.
[184] When Does RL Help Medical VLMs? Disentangling Vision, SFT, and RL Gains cs.CVPDF
Ahmadreza Jeddi, Kimia Shaban, Negin Baghbanzadeh, Natasha Sharan, Abhishek Moturu
TL;DR: 本文通过控制实验探究了强化学习(RL)在医学视觉语言模型(VLMs)训练中的作用,发现RL主要在模型已具备一定支持度(高Pass@K)时有效,通过锐化输出分布来提升准确率和采样效率,而监督微调(SFT)则扩展支持度并使RL生效。基于此,作者提出了一种边界感知的训练方法,并在多个医学VQA基准上取得了优异性能。
Details
Motivation: 研究旨在厘清RL在医学VLMs后训练中是否真正提升视觉推理能力,还是仅强化了SFT已诱导的行为,通过解耦视觉、SFT和RL三方面影响进行系统分析。
Result: 在MedMNIST多模态测试平台上,RL在模型支持度高时显著提升Acc@1和采样效率;提出的方法在六个医学VQA基准上实现了强劲的平均性能。
Insight: 创新点在于通过控制实验量化了RL与SFT在医学VLMs中的互补作用:SFT扩展支持度,RL锐化分布;提出的边界感知训练方法可优化RL后训练策略。
Abstract: Reinforcement learning (RL) is increasingly used to post-train medical Vision-Language Models (VLMs), yet it remains unclear whether RL improves medical visual reasoning or mainly sharpens behaviors already induced by supervised fine-tuning (SFT). We present a controlled study that disentangles these effects along three axes: vision, SFT, and RL. Using MedMNIST as a multi-modality testbed, we probe visual perception by benchmarking VLM vision towers against vision-only baselines, quantify reasoning support and sampling efficiency via Accuracy@1 versus Pass@K, and evaluate when RL closes the support gap and how gains transfer across modalities. We find that RL is most effective when the model already has non-trivial support (high Pass@K): it primarily sharpens the output distribution, improving Acc@1 and sampling efficiency, while SFT expands support and makes RL effective. Based on these findings, we propose a boundary-aware recipe and instantiate it by RL post-training an OctoMed-initialized model on a small, balanced subset of PMC multiple-choice VQA, achieving strong average performance across six medical VQA benchmarks.
[185] AG-VAS: Anchor-Guided Zero-Shot Visual Anomaly Segmentation with Large Multimodal Models cs.CV | cs.AIPDF
Zhen Qu, Xian Tao, Xiaoyi Bao, Dingrong Wang, ShiChen Qu
TL;DR: 本文提出AG-VAS框架,通过引入三个可学习的语义锚点令牌([SEG]、[NOR]、[ANO])来增强大型多模态模型(LMMs)在零样本视觉异常分割(ZSAS)任务中的能力,解决了异常概念抽象、上下文依赖以及语义与像素特征对齐不佳的问题,并在多个工业和医学基准测试中实现了最先进的性能。
Details
Motivation: 现有基于LMM的分割方法面临异常概念抽象且缺乏稳定视觉原型、高级语义嵌入与像素级空间特征对齐弱导致定位不准的根本局限,需要一种新框架来统一引导异常分割。
Result: 在六个工业和医学基准测试上的广泛实验表明,AG-VAS在零样本设置下实现了持续的最先进(SOTA)性能。
Insight: 创新点包括引入语义锚点令牌将抽象异常语义转化为显式空间实体、设计语义-像素对齐模块(SPAM)和锚点引导掩码解码器(AGMD)以增强跨模态对齐与精确定位,以及构建大规模指令数据集Anomaly-Instruct20K来结构化组织异常知识。
Abstract: Large multimodal models (LMMs) exhibit strong task generalization capabilities, offering new opportunities for zero-shot visual anomaly segmentation (ZSAS). However, existing LMM-based segmentation approaches still face fundamental limitations: anomaly concepts are inherently abstract and context-dependent, lacking stable visual prototypes, and the weak alignment between high-level semantic embeddings and pixel-level spatial features hinders precise anomaly localization. To address these challenges, we present AG-VAS (Anchor-Guided Visual Anomaly Segmentation), a new framework that expands the LMM vocabulary with three learnable semantic anchor tokens-[SEG], [NOR], and [ANO], establishing a unified anchor-guided segmentation paradigm. Specifically, [SEG] serves as an absolute semantic anchor that translates abstract anomaly semantics into explicit, spatially grounded visual entities (e.g., holes or scratches), while [NOR] and [ANO] act as relative anchors that model the contextual contrast between normal and abnormal patterns across categories. To further enhance cross-modal alignment, we introduce a Semantic-Pixel Alignment Module (SPAM) that aligns language-level semantic embeddings with high-resolution visual features, along with an Anchor-Guided Mask Decoder (AGMD) that performs anchor-conditioned mask prediction for precise anomaly localization. In addition, we curate Anomaly-Instruct20K, a large-scale instruction dataset that organizes anomaly knowledge into structured descriptions of appearance, shape, and spatial attributes, facilitating effective learning and integration of the proposed semantic anchors. Extensive experiments on six industrial and medical benchmarks demonstrate that AG-VAS achieves consistent state-of-the-art performance in the zero-shot setting.
[186] Open-Vocabulary vs Supervised Learning Methods for Post-Disaster Visual Scene Understanding cs.CVPDF
Anna Michailidou, Georgios Angelidis, Vasileios Argyriou, Panagiotis Sarigiannidis, Georgios Th. Papadopoulos
TL;DR: 本文对监督学习和开放词汇视觉模型在灾后场景理解任务(如语义分割和目标检测)中的表现进行了比较评估,使用了包括FloodNet+、RescueNet、DFire和LADD在内的多个数据集,并分析了性能趋势、失败模式以及不同学习范式之间的实际权衡。
Details
Motivation: 解决灾后航空图像自动解释的挑战,包括杂乱场景、视觉变异性和跨事件域偏移,同时减少对固定标签集和昂贵任务特定标注的依赖。
Result: 在所有评估基准中,监督训练(当标签空间固定且标注可用时)仍是最可靠的方法,特别是在杂乱场景中的小物体和精细边界描绘方面表现更优。
Insight: 开放词汇和基础视觉模型通过大规模预训练和视觉-语言表示,减少了对特定任务标注的依赖,适用于视觉概念模糊且数据有限的灾后领域;但监督学习在固定标签场景下仍具优势,尤其是在细节处理上。
Abstract: Aerial imagery is critical for large-scale post-disaster damage assessment. Automated interpretation remains challenging due to clutter, visual variability, and strong cross-event domain shift, while supervised approaches still rely on costly, task-specific annotations with limited coverage across disaster types and regions. Recent open-vocabulary and foundation vision models offer an appealing alternative, by reducing dependence on fixed label sets and extensive task-specific annotations. Instead, they leverage large-scale pretraining and vision-language representations. These properties are particularly relevant for post-disaster domains, where visual concepts are ambiguous and data availability is constrained. In this work, we present a comparative evaluation of supervised learning and open-vocabulary vision models for post-disaster scene understanding, focusing on semantic segmentation and object detection across multiple datasets, including FloodNet+, RescueNet, DFire, and LADD. We examine performance trends, failure modes, and practical trade-offs between different learning paradigms, providing insight into their applicability for real-world disaster response. The most notable remark across all evaluated benchmarks is that supervised training remains the most reliable approach (i.e., when the label space is fixed and annotations are available), especially for small objects and fine boundary delineation in cluttered scenes.
[187] Continuous Exposure-Time Modeling for Realistic Atmospheric Turbulence Synthesis cs.CVPDF
Junwei Zeng, Dong Liang, Sheng-Jun Huang, Kun Zhan, Songcan Chen
TL;DR: 本文提出了一种连续曝光时间建模方法,用于合成更真实的大气湍流效应,并构建了大规模合成数据集ET-Turb。该方法通过引入曝光时间依赖的调制传递函数(ET-MTF)和倾斜不变点扩散函数(PSF),将模糊建模为曝光时间的连续函数,从而更准确地表征湍流引起的几何扭曲和模糊。
Details
Motivation: 现有湍流合成方法通常过度简化模糊与曝光时间的关系(如假设固定或二值化曝光设置),导致合成数据不真实且训练模型泛化能力有限。本文旨在解决这一差距,通过连续建模曝光时间与模糊的关系,提升合成数据的真实性和模型的实用性。
Result: 基于提出的合成流程构建了ET-Turb数据集,包含5,083个视频(2,005,835帧),分为3,988个训练视频和1,095个测试视频。实验表明,在ET-Turb上训练的模型能产生更真实的恢复结果,并在真实湍流数据上相比其他数据集训练的模型具有更优的泛化性能。
Insight: 创新点在于将湍流引起的模糊建模为曝光时间的连续函数(ET-MTF),并推导出倾斜不变PSF,结合空间变化的模糊宽度场,实现了对湍流模糊的全面且物理准确的表征。这为合成更真实的湍流数据和提升模型泛化能力提供了新思路。
Abstract: Atmospheric turbulence significantly degrades long-range imaging by introducing geometric warping and exposure-time-dependent blur, which adversely affects both visual quality and the performance of high-level vision tasks. Existing methods for synthesizing turbulence effects often oversimplify the relationship between blur and exposure-time, typically assuming fixed or binary exposure settings. This leads to unrealistic synthetic data and limited generalization capability of trained models. To address this gap, we revisit the modulation transfer function (MTF) formulation and propose a novel Exposure-Time-dependent MTF (ET-MTF) that models blur as a continuous function of exposure-time. For blur synthesis, we derive a tilt-invariant point spread function (PSF) from the ET-MTF, which, when integrated with a spatially varying blur-width field, provides a comprehensive and physically accurate characterization of turbulence-induced blur. Building on this synthesis pipeline, we construct ET-Turb, a large-scale synthetic turbulence dataset that explicitly incorporates continuous exposure-time modeling across diverse optical and atmospheric conditions. The dataset comprises 5,083 videos (2,005,835 frames), partitioned into 3,988 training and 1,095 test videos. Extensive experiments demonstrate that models trained on ET-Turb produce more realistic restorations and achieve superior generalization on real-world turbulence data compared to those trained on other datasets. The dataset is publicly available at: github.com/Jun-Wei-Zeng/ET-Turb.
[188] Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models cs.CVPDF
Jinlong Li, Liyuan Jiang, Haonan Zhang, Nicu Sebe
TL;DR: 本文提出了一种名为AOT(基于局部和全局最优传输的锚点优化)的训练无关方法,用于高效视频大语言模型(VLLMs)的令牌缩减。该方法通过建立帧内局部与全局感知的令牌锚点,并利用最优传输聚合被修剪令牌的信息,构建帧内锚点;同时,在时序片段中,将首帧作为关键帧锚点,通过最优传输聚合连续帧的相似信息并保留独特令牌以表征时序动态,从而在保持视觉和时序保真度的同时,显著提升计算效率。
Details
Motivation: 现有视频大语言模型存在因冗余视觉令牌导致的效率低下问题,现有剪枝方法主要针对帧内空间冗余或在LLM浅层进行剪枝,导致次优的时空缩减,且未能充分利用长上下文可压缩性,常常丢弃被合并或修剪令牌中细微但信息丰富的上下文。
Result: 在多个长短视频基准测试(如ActivityNet、YouCook2、EgoSchema、Next-QA等)上对领先的视频LLMs(如Video-LLaVA、LLaMA-VID)进行评估,AOT方法在保持竞争性性能的同时,获得了显著的计算效率提升。
Insight: 创新点在于提出了一种新的视角,通过建立帧内和帧间的令牌锚点,并利用局部-全局最优传输来全面聚合信息上下文,实现了训练无关的高效令牌缩减,同时更好地保留了时空动态和视觉细节。
Abstract: Video Large Language Models (VLLMs) demonstrate strong video understanding but suffer from inefficiency due to redundant visual tokens. Existing pruning primary targets intra-frame spatial redundancy or prunes inside the LLM with shallow-layer overhead, yielding suboptimal spatiotemporal reduction and underutilizing long-context compressibility. All of them often discard subtle yet informative context from merged or pruned tokens. In this paper, we propose a new perspective that elaborates token \textbf{A}nchors within intra-frame and inter-frame to comprehensively aggregate the informative contexts via local-global \textbf{O}ptimal \textbf{T}ransport (\textbf{AOT}). Specifically, we first establish local- and global-aware token anchors within each frame under the attention guidance, which then optimal transport aggregates the informative contexts from pruned tokens, constructing intra-frame token anchors. Then, building on the temporal frame clips, the first frame within each clip will be considered as the keyframe anchors to ensemble similar information from consecutive frames through optimal transport, while keeping distinct tokens to represent temporal dynamics, leading to efficient token reduction in a training-free manner. Extensive evaluations show that our proposed AOT obtains competitive performances across various short- and long-video benchmarks on leading video LLMs, obtaining substantial computational efficiency while preserving temporal and visual fidelity. Project webpage: \href{https://tyroneli.github.io/AOT}{AOT}.
[189] UETrack: A Unified and Efficient Framework for Single Object Tracking cs.CVPDF
Ben Kang, Jie Zhao, Xin Chen, Wanting Geng, Bin Zhang
TL;DR: UETrack是一个统一高效的单目标跟踪框架,旨在解决现有方法局限于RGB输入、在多模态场景下性能不佳以及设计复杂导致效率低下的问题。它通过引入基于Token池化的混合专家机制和目标感知自适应蒸馏策略,高效处理RGB、深度、热成像、事件和语言等多种模态输入,并在多个硬件平台上实现了优越的速度-精度权衡。
Details
Motivation: 现有跟踪方法大多仅支持RGB输入,难以应对多模态场景,且当前多模态跟踪方法设计复杂、计算量大,不适用于资源受限的部署环境。
Result: 在12个基准测试和3种硬件平台(GPU/CPU/AGX)上的广泛实验表明,UETrack在速度与精度上优于先前方法。例如,UETrack-B在LaSOT数据集上达到69.2% AUC,并在GPU/CPU/AGX上分别以163/56/60 FPS运行。
Insight: 创新点包括:1) 基于Token池化的混合专家机制,通过特征聚合和专家专业化增强建模能力;2) 目标感知自适应蒸馏策略,根据样本特征选择性进行知识蒸馏,减少冗余监督并提升性能。这些设计实现了高效、通用的多模态跟踪,具有强实用性和泛化性。
Abstract: With growing real-world demands, efficient tracking has received increasing attention. However, most existing methods are limited to RGB inputs and struggle in multi-modal scenarios. Moreover, current multi-modal tracking approaches typically use complex designs, making them too heavy and slow for resource-constrained deployment. To tackle these limitations, we propose UETrack, an efficient framework for single object tracking. UETrack demonstrates high practicality and versatility, efficiently handling multiple modalities including RGB, Depth, Thermal, Event, and Language, and addresses the gap in efficient multi-modal tracking. It introduces two key components: a Token-Pooling-based Mixture-of-Experts mechanism that enhances modeling capacity through feature aggregation and expert specialization, and a Target-aware Adaptive Distillation strategy that selectively performs distillation based on sample characteristics, reducing redundant supervision and improving performance. Extensive experiments on 12 benchmarks across 3 hardware platforms show that UETrack achieves a superior speed-accuracy trade-off compared to previous methods. For instance, UETrack-B achieves 69.2% AUC on LaSOT and runs at 163/56/60 FPS on GPU/CPU/AGX, demonstrating strong practicality and versatility. Code is available at https://github.com/kangben258/UETrack.
[190] UniTalking: A Unified Audio-Video Framework for Talking Portrait Generation cs.CV | cs.MM | cs.SDPDF
Hebeizi Li, Zihao Liang, Benyuan Sun, Zihao Yin, Xiao Sha
TL;DR: 本文提出了UniTalking,一个用于生成高保真语音和唇形同步视频的统一端到端扩散框架。该框架通过多模态Transformer块和共享自注意力机制,显式建模音频和视频潜在token之间的细粒度时序对应关系,并利用预训练视频生成模型的强大先验确保视觉保真度。此外,它集成了个性化语音克隆功能,可根据简短音频参考生成目标风格的语音。
Details
Motivation: 当前最先进的音视频生成模型(如Veo3和Sora2)是闭源的,其架构和训练范式无法获取。为了弥补这一可访问性和性能上的差距,作者提出了一个开源的统一框架。
Result: 定性和定量结果表明,该方法能生成高度逼真的说话人像,在唇形同步准确性、音频自然度和整体感知质量上优于现有的开源方法,达到了SOTA水平。
Insight: 创新点在于提出了一个统一的、端到端的扩散框架,通过多模态Transformer块显式建模音视频时序对齐,并整合了语音克隆功能。从客观角度看,其利用预训练模型先验进行高效训练的策略具有借鉴意义。
Abstract: While state-of-the-art audio-video generation models like Veo3 and Sora2 demonstrate remarkable capabilities, their closed-source nature makes their architectures and training paradigms inaccessible. To bridge this gap in accessibility and performance, we introduce UniTalking, a unified, end-to-end diffusion framework for generating high-fidelity speech and lip-synchronized video. At its core, our framework employs Multi-Modal Transformer Blocks to explicitly model the fine-grained temporal correspondence between audio and video latent tokens via a shared self-attention mechanism. By leveraging powerful priors from a pre-trained video generation model, our framework ensures state-of-the-art visual fidelity while enabling efficient training. Furthermore, UniTalking incorporates a personalized voice cloning capability, allowing the generation of speech in a target style from a brief audio reference. Qualitative and quantitative results demonstrate that our method produces highly realistic talking portraits, achieving superior performance over existing open-source approaches in lip-sync accuracy, audio naturalness, and overall perceptual quality.
[191] SeaVIS: Sound-Enhanced Association for Online Audio-Visual Instance Segmentation cs.CVPDF
Yingjian Zhu, Ying Wang, Yuyang Hong, Ruohao Guo, Kun Ding
TL;DR: SeaVIS是首个在线音频-视觉实例分割框架,通过因果交叉注意力融合模块实现高效在线处理,并采用音频引导对比学习策略增强音频跟随能力,在AVISeg数据集上超越现有SOTA模型且保持实时推理速度。
Details
Motivation: 解决现有音频-视觉实例分割方法主要采用离线范式、无法在连续视频流中关联实例的问题,以及传统视觉实例分割方法基于外观的关联难以区分物体发声与静默状态导致错误分割的挑战。
Result: 在AVISeg数据集上的大量实验表明,SeaVIS在多个评估指标上超越了现有最先进模型,同时保持了适合实时处理的竞争性推理速度。
Insight: 创新点包括因果交叉注意力融合模块实现严格因果约束下的在线音频-视觉特征融合,以及音频引导对比学习策略生成编码视觉外观和发声活动的实例原型,有效抑制静默对象在关联过程中的错误分割。
Abstract: Recently, an audio-visual instance segmentation (AVIS) task has been introduced, aiming to identify, segment and track individual sounding instances in videos. However, prevailing methods primarily adopt the offline paradigm, that cannot associate detected instances across consecutive clips, making them unsuitable for real-world scenarios that involve continuous video streams. To address this limitation, we introduce SeaVIS, the first online framework designed for audio-visual instance segmentation. SeaVIS leverages the Causal Cross Attention Fusion (CCAF) module to enable efficient online processing, which integrates visual features from the current frame with the entire audio history under strict causal constraints. A major challenge for conventional VIS methods is that appearance-based instance association fails to distinguish between an object’s sounding and silent states, resulting in the incorrect segmentation of silent objects. To tackle this, we employ an Audio-Guided Contrastive Learning (AGCL) strategy to generate instance prototypes that encode not only visual appearance but also sounding activity. In this way, instances preserved during per-frame prediction that do not emit sound can be effectively suppressed during instance association process, thereby significantly enhancing the audio-following capability of SeaVIS. Extensive experiments conducted on the AVISeg dataset demonstrate that SeaVIS surpasses existing state-of-the-art models across multiple evaluation metrics while maintaining a competitive inference speed suitable for real-time processing.
[192] Unifying Language-Action Understanding and Generation for Autonomous Driving cs.CV | cs.ROPDF
Xinyang Wang, Qian Liu, Wenjie Ding, Zhao Yang, Wei Li
TL;DR: 本文提出LinkVLA模型,通过统一语言与动作表征、引入动作理解辅助任务以及采用粗到细的两步生成策略,解决了自动驾驶端到端视觉-语言-动作模型中存在的指令-动作错位和自回归生成效率低下的问题,在闭环驾驶基准测试中提升了指令跟随准确性和驾驶性能,并大幅降低了推理延迟。
Details
Motivation: 现有端到端自动驾驶视觉-语言-动作模型存在两个关键缺陷:语言指令与动作输出之间的持续错位,以及典型自回归动作生成方式固有的低效率。
Result: 在闭环驾驶基准测试中,模型在指令跟随准确性和驾驶性能上取得了一致的提升,同时通过提出的C2F生成方法节省了86%的推理时间。
Insight: 创新点包括:1. 将语言和动作token统一到共享离散码本中,从结构上强制跨模态一致性;2. 引入从轨迹生成描述性文本的动作理解辅助任务,建立双向的语言-动作映射;3. 用高效的粗到细两步生成方法替代缓慢的自回归生成,显著提升推理速度。
Abstract: Vision-Language-Action (VLA) models are emerging as a promising paradigm for end-to-end autonomous driving, valued for their potential to leverage world knowledge and reason about complex driving scenes. However, existing methods suffer from two critical limitations: a persistent misalignment between language instructions and action outputs, and the inherent inefficiency of typical auto-regressive action generation. In this paper, we introduce LinkVLA, a novel architecture that directly addresses these challenges to enhance both alignment and efficiency. First, we establish a structural link by unifying language and action tokens into a shared discrete codebook, processed within a single multi-modal model. This structurally enforces cross-modal consistency from the ground up. Second, to create a deep semantic link, we introduce an auxiliary action understanding objective that trains the model to generate descriptive captions from trajectories, fostering a bidirectional language-action mapping. Finally, we replace the slow, step-by-step generation with a two-step coarse-to-fine generation method C2F that efficiently decodes the action sequence, saving 86% inference time. Experiments on closed-loop driving benchmarks show consistent gains in instruction following accuracy and driving performance, alongside reduced inference latency.
[193] Deepfake Forensics Adapter: A Dual-Stream Network for Generalizable Deepfake Detection cs.CVPDF
Jianfeng Liao, Yichen Wei, Raymond Chan Ching Bon, Shulan Wang, Kam-Pui Chow
TL;DR: 本文提出了一种名为Deepfake Forensics Adapter (DFA)的双流网络框架,用于提升深度伪造检测的泛化能力。该框架结合了预训练的CLIP视觉语言基础模型与针对性的取证分析,通过全局特征适配器、局部异常流和交互式融合分类器三个核心组件,在不修改CLIP参数的情况下,有效识别全局不一致性和局部面部伪造痕迹。
Details
Motivation: 深度伪造生成技术的快速发展对公共安全构成严重威胁,而现有检测方法在泛化到新兴伪造模式方面存在局限。本文旨在开发一个鲁棒且泛化能力强的深度伪造检测系统。
Result: 在帧级别和视频级别基准测试中,DFA表现出优异的泛化能力,特别是在具有挑战性的DFDC数据集上取得了最先进的性能:帧级别AUC/EER为0.816/0.256,视频级别AUC/EER为0.836/0.251,视频AUC比先前方法提升了4.8%。
Insight: 论文的创新点在于提出了一种双流框架,将基础模型的强大通用能力与专门的取证分析相结合,通过适配器机制实现专业化检测而无需微调基础模型。从客观角度看,其利用面部结构先验增强局部感知以及通过Transformer促进全局与局部特征深度交互的设计,为构建鲁棒检测系统提供了有效方向。
Abstract: The rapid advancement of deepfake generation techniques poses significant threats to public safety and causes societal harm through the creation of highly realistic synthetic facial media. While existing detection methods demonstrate limitations in generalizing to emerging forgery patterns, this paper presents Deepfake Forensics Adapter (DFA), a novel dual-stream framework that synergizes vision-language foundation models with targeted forensics analysis. Our approach integrates a pre-trained CLIP model with three core components to achieve specialized deepfake detection by leveraging the powerful general capabilities of CLIP without changing CLIP parameters: 1) A Global Feature Adapter is used to identify global inconsistencies in image content that may indicate forgery, 2) A Local Anomaly Stream enhances the model’s ability to perceive local facial forgery cues by explicitly leveraging facial structure priors, and 3) An Interactive Fusion Classifier promotes deep interaction and fusion between global and local features using a transformer encoder. Extensive evaluations of frame-level and video-level benchmarks demonstrate the superior generalization capabilities of DFA, particularly achieving state-of-the-art performance in the challenging DFDC dataset with frame-level AUC/EER of 0.816/0.256 and video-level AUC/EER of 0.836/0.251, representing a 4.8% video AUC improvement over previous methods. Our framework not only demonstrates state-of-the-art performance, but also points out a feasible and effective direction for developing a robust deepfake detection system with enhanced generalization capabilities against the evolving deepfake threats. Our code is available at https://github.com/Liao330/DFA.git
[194] VidDoS: Universal Denial-of-Service Attack on Video-based Large Language Models cs.CV | cs.AIPDF
Duoxun Tang, Dasen Dai, Jiyao Wang, Xiao Yang, Jianyu Wang
TL;DR: 本文提出了VidDoS,这是首个针对视频大语言模型的通用拒绝服务攻击框架,通过生成实例无关的触发模式,在无需推理时梯度计算的情况下,显著增加模型的计算开销和延迟。
Details
Motivation: 视频大语言模型在安全关键应用中部署增多,但易受能耗-延迟攻击影响,现有图像中心方法因时间聚合机制而失效,且实时需求使得逐实例优化不切实际。
Result: 在三个主流Video-LLM和三个视频数据集(包括视频问答和自动驾驶场景)上测试,VidDoS导致令牌扩展超过205倍,推理延迟增加超过15倍,模拟实时自动驾驶流显示这引发了关键安全违规。
Insight: 创新点包括使用掩码教师强制引导模型生成高开销目标序列,结合拒绝惩罚和提前终止抑制来覆盖简洁性先验,实现了无需推理时优化的通用攻击框架。
Abstract: Video-LLMs are increasingly deployed in safety-critical applications but are vulnerable to Energy-Latency Attacks (ELAs) that exhaust computational resources. Current image-centric methods fail because temporal aggregation mechanisms dilute individual frame perturbations. Additionally, real-time demands make instance-wise optimization impractical for continuous video streams. We introduce VidDoS, which is the first universal ELA framework tailored for Video-LLMs. Our method leverages universal optimization to create instance-agnostic triggers that require no inference-time gradient calculation. We achieve this through $\textit{masked teacher forcing}$ to steer models toward expensive target sequences, combined with a $\textit{refusal penalty}$ and $\textit{early-termination suppression}$ to override conciseness priors. Testing across three mainstream Video-LLMs and three video datasets, which include video question answering and autonomous driving scenarios, shows extreme degradation. VidDoS induces a token expansion of more than 205$\times$ and inflates the inference latency by more than 15$\times$ relative to clean baselines. Simulations of real-time autonomous driving streams further reveal that this induced latency leads to critical safety violations. We urge the community to recognize and mitigate these high-hazard ELA in Video-LLMs.
[195] WildCross: A Cross-Modal Large Scale Benchmark for Place Recognition and Metric Depth Estimation in Natural Environments cs.CVPDF
Joshua Knights, Joseph Reid, Kaushik Roy, David Hall, Mark Cox
TL;DR: 本文提出了WildCross,一个用于自然环境中地点识别和度量深度估计的大规模跨模态基准数据集。该数据集包含超过47.6万帧连续的RGB图像,并配有半稠密深度、表面法线标注、精确的6自由度位姿以及同步的稠密激光雷达子图。
Details
Motivation: 现有机器人数据集主要针对结构化城市环境,难以应对复杂、非结构化的自然环境挑战,因此需要一个新的基准来推动2D与3D场景理解在自然场景中的融合。
Result: 论文在视觉、激光雷达以及跨模态地点识别和度量深度估计任务上进行了全面的实验,证明了WildCross作为多模态机器人感知任务基准的挑战性和价值。
Insight: 创新点在于构建了首个专注于大规模自然环境的跨模态基准数据集,整合了丰富的对齐多模态数据(RGB、深度、法线、位姿、激光雷达),为研究非结构化环境下的机器人感知提供了关键资源。
Abstract: Recent years have seen a significant increase in demand for robotic solutions in unstructured natural environments, alongside growing interest in bridging 2D and 3D scene understanding. However, existing robotics datasets are predominantly captured in structured urban environments, making them inadequate for addressing the challenges posed by complex, unstructured natural settings. To address this gap, we propose WildCross, a cross-modal benchmark for place recognition and metric depth estimation in large-scale natural environments. WildCross comprises over 476K sequential RGB frames with semi-dense depth and surface normal annotations, each aligned with accurate 6DoF poses and synchronized dense lidar submaps. We conduct comprehensive experiments on visual, lidar, and cross-modal place recognition, as well as metric depth estimation, demonstrating the value of WildCross as a challenging benchmark for multi-modal robotic perception tasks. We provide access to the code repository and dataset at https://csiro-robotics.github.io/WildCross.
[196] SCATR: Mitigating New Instance Suppression in LiDAR-based Tracking-by-Attention via Second Chance Assignment and Track Query Dropout cs.CVPDF
Brian Cheong, Letian Wang, Sandro Papais, Steven L. Waslander
TL;DR: 本文提出了SCATR,一种新颖的基于LiDAR的注意力跟踪(TBA)模型,旨在系统性地解决TBA框架中固有的高假阴性错误问题。其核心创新是两种与架构无关的训练策略:第二次机会分配和轨迹查询丢弃。在nuScenes跟踪基准测试中,SCATR在基于LiDAR的TBA方法中达到了最先进的性能,显著缩小了与传统检测跟踪(TBD)方法的性能差距。
Details
Motivation: 基于LiDAR的注意力跟踪(TBA)框架固有地存在高假阴性错误,导致其性能与传统基于LiDAR的检测跟踪(TBD)方法相比存在显著差距。本文旨在系统性地解决这一根本性挑战。
Result: 在nuScenes跟踪基准上的实验表明,SCATR在基于LiDAR的TBA方法中实现了最先进的性能,AMOTA指标比先前工作提升了7.6%,成功弥合了LiDAR TBA与TBD方法之间长期存在的性能差距。消融研究进一步验证了所提策略的有效性和泛化性。
Insight: 论文宣称的创新点是两种架构无关的训练策略:第二次机会分配(通过将未分配的轨迹查询与提议查询拼接,在二分图匹配前给予其第二次分配机会,有效缓解检测与跟踪任务的内在冲突)和轨迹查询丢弃(通过多样化监督对象查询配置,训练解码器处理不同的轨迹查询集,增强对丢失或新生轨迹的鲁棒性)。从客观角度看,这些策略针对性地解决了TBA框架的核心痛点,即新实例抑制和查询配置单一问题,具有较好的通用性和应用潜力。
Abstract: LiDAR-based tracking-by-attention (TBA) frameworks inherently suffer from high false negative errors, leading to a significant performance gap compared to traditional LiDAR-based tracking-by-detection (TBD) methods. This paper introduces SCATR, a novel LiDAR-based TBA model designed to address this fundamental challenge systematically. SCATR leverages recent progress in vision-based tracking and incorporates targeted training strategies specifically adapted for LiDAR. Our work’s core innovations are two architecture-agnostic training strategies for TBA methods: Second Chance Assignment and Track Query Dropout. Second Chance Assignment is a novel ground truth assignment that concatenates unassigned track queries to the proposal queries before bipartite matching, giving these track queries a second chance to be assigned to a ground truth object and effectively mitigating the conflict between detection and tracking tasks inherent in tracking-by-attention. Track Query Dropout is a training method that diversifies supervised object query configurations to efficiently train the decoder to handle different track query sets, enhancing robustness to missing or newborn tracks. Experiments on the nuScenes tracking benchmark demonstrate that SCATR achieves state-of-the-art performance among LiDAR-based TBA methods, outperforming previous works by 7.6% AMOTA and successfully bridging the long-standing performance gap between LiDAR-based TBA and TBD methods. Ablation studies further validate the effectiveness and generalization of Second Chance Assignment and Track Query Dropout. Code can be found at the following link: \href{https://github.com/TRAILab/SCATR}{https://github.com/TRAILab/SCATR}
[197] ATA: Bridging Implicit Reasoning with Attention-Guided and Action-Guided Inference for Vision-Language Action Models cs.CV | cs.AIPDF
Cheng Yang, Jianhao Jiao, Lingyi Huang, Jinqi Xiao, Zhexiang Tang
TL;DR: 本文提出了一种名为ATA的新型免训练框架,旨在通过注意力引导和动作引导的互补策略,将隐式推理引入视觉-语言-动作(VLA)模型的推理过程,以提升任务成功率和鲁棒性,同时保持甚至提升推理效率。
Details
Motivation: 现有VLA模型依赖当前观测(图像、语言指令、机器人状态)预测动作,而引入显式推理(如思维链或视觉定位标注)的方法存在依赖数据密集型标注、构建耗时且降低推理效率的局限性。
Result: 广泛的实验表明,ATA框架能持续提升任务成功率和鲁棒性,同时保持甚至增强了推理效率。
Insight: 创新点在于提出了一种免训练、即插即用的隐式推理方法,通过整合注意力图与基于动作的兴趣区域来隐式地制定推理,自适应地细化视觉输入,无需额外训练或标注,从而克服了显式方法对标注和重训练的依赖。
Abstract: Vision-Language-Action (VLA) models rely on current observations, including images, language instructions, and robot states, to predict actions and complete tasks. While accurate visual perception is crucial for precise action prediction and execution, recent work has attempted to further improve performance by introducing explicit reasoning during inference. However, such approaches face significant limitations. They often depend on data-intensive resources such as Chain-of-Thought (CoT) style annotations to decompose tasks into step-by-step reasoning, and in many cases require additional visual grounding annotations (e.g., bounding boxes or masks) to highlight relevant image regions. Moreover, they involve time-consuming dataset construction, labeling, and retraining, which ultimately results in longer inference sequences and reduced efficiency. To address these challenges, we propose ATA, a novel training-free framework that introduces implicit reasoning into VLA inference through complementary attention-guided and action-guided strategies. Unlike CoT or explicit visual-grounding methods, ATA formulates reasoning implicitly by integrating attention maps with an action-based region of interest (RoI), thereby adaptively refining visual inputs without requiring extra training or annotations. ATA is a plug-and-play implicit reasoning approach for VLA models, lightweight yet effective. Extensive experiments show that it consistently improves task success and robustness while preserving, and even enhancing, inference efficiency.
[198] Radiometrically Consistent Gaussian Surfels for Inverse Rendering cs.CV | cs.GRPDF
Kyu Beom Han, Jaeyoon Kim, Woo Jae Kim, Jinhwan Seo, Sung-eui Yoon
TL;DR: 本文提出了一种名为Radiometrically Consistent Gaussian Surfels (RadioGS)的逆向渲染框架,旨在解决高斯泼溅技术中从复杂全局光照(尤其是间接光照)中准确解耦材质属性的挑战。通过引入辐射一致性约束,该方法为未观测视角提供了监督,并结合高斯面片与2D高斯光线追踪进行高效集成,实现了精确的间接光照建模和快速的场景重光照。
Details
Motivation: 现有基于高斯泼溅的逆向渲染方法通常从预训练用于新视角合成的高斯基元中查询间接辐射,但这些基元仅针对有限训练视角进行监督,缺乏对未观测视角间接辐射建模的监督,导致难以准确解耦材质与全局光照效果。
Result: 在现有逆向渲染基准测试上的广泛实验表明,RadioGS在逆向渲染任务上优于现有基于高斯的方法,同时保持了计算效率(渲染成本<10ms),并能在几分钟内适应新光照进行重光照。
Insight: 创新点在于提出了辐射一致性这一基于物理的约束,通过最小化每个高斯基元学习到的辐射与其基于物理的渲染对应物之间的残差,为未观测视角提供监督,形成了一个结合基于物理渲染和新视角合成的自校正反馈循环,从而实现了对相互反射的精确建模。此外,利用高斯面片和2D高斯光线追踪高效集成该约束,以及基于微调的重光照策略,也是可借鉴的技术路径。
Abstract: Inverse rendering with Gaussian Splatting has advanced rapidly, but accurately disentangling material properties from complex global illumination effects, particularly indirect illumination, remains a major challenge. Existing methods often query indirect radiance from Gaussian primitives pre-trained for novel-view synthesis. However, these pre-trained Gaussian primitives are supervised only towards limited training viewpoints, thus lack supervision for modeling indirect radiances from unobserved views. To address this issue, we introduce radiometric consistency, a novel physically-based constraint that provides supervision towards unobserved views by minimizing the residual between each Gaussian primitive’s learned radiance and its physically-based rendered counterpart. Minimizing the residual for unobserved views establishes a self-correcting feedback loop that provides supervision from both physically-based rendering and novel-view synthesis, enabling accurate modeling of inter-reflection. We then propose Radiometrically Consistent Gaussian Surfels (RadioGS), an inverse rendering framework built upon our principle by efficiently integrating radiometric consistency by utilizing Gaussian surfels and 2D Gaussian ray tracing. We further propose a finetuning-based relighting strategy that adapts Gaussian surfel radiances to new illuminations within minutes, achieving low rendering cost (<10ms). Extensive experiments on existing inverse rendering benchmarks show that RadioGS outperforms existing Gaussian-based methods in inverse rendering, while retaining the computational efficiency.
[199] Retrieval, Refinement, and Ranking for Text-to-Video Generation via Prompt Optimization and Test-Time Scaling cs.CV | cs.AIPDF
Zillur Rahman, Alex Sheng, Cristian Meo
TL;DR: 本文提出了一种名为3R的基于检索增强生成(RAG)的提示优化框架,旨在提升文本到视频(T2V)生成模型的质量。该框架无需训练核心生成器,通过RAG提取修饰词增强上下文、基于扩散的偏好优化对齐人类偏好以及时间帧插值确保时序一致性,从而生成更准确、高效且上下文对齐的视频。
Details
Motivation: 现有T2V模型对输入提示词高度敏感,而提升视频输出的方法要么依赖复杂的后处理模型(可能引入伪影),要么需要对核心生成器进行昂贵的微调,这限制了可扩展性和可访问性。
Result: 实验结果表明,3R框架有效增强了生成视频的静态保真度和动态连贯性,验证了优化用户提示词的重要性。
Insight: 创新点在于将RAG、扩散偏好优化和时序插值结合到一个无需训练核心模型的提示优化框架中,提供了一种可扩展且易于访问的提升T2V生成质量的方法。
Abstract: While large-scale datasets have driven significant progress in Text-to-Video (T2V) generative models, these models remain highly sensitive to input prompts, demonstrating that prompt design is critical to generation quality. Current methods for improving video output often fall short: they either depend on complex, post-editing models, risking the introduction of artifacts, or require expensive fine-tuning of the core generator, which severely limits both scalability and accessibility. In this work, we introduce 3R, a novel RAG based prompt optimization framework. 3R utilizes the power of current state-of-the-art T2V diffusion model and vision language model. It can be used with any T2V model without any kind of model training. The framework leverages three key strategies: RAG-based modifiers extraction for enriched contextual grounding, diffusion-based Preference Optimization for aligning outputs with human preferences, and temporal frame interpolation for producing temporally consistent visual contents. Together, these components enable more accurate, efficient, and contextually aligned text-to-video generation. Experimental results demonstrate the efficacy of 3R in enhancing the static fidelity and dynamic coherence of generated videos, underscoring the importance of optimizing user prompts.
[200] Better Matching, Less Forgetting: A Quality-Guided Matcher for Transformer-based Incremental Object Detection cs.CVPDF
Qirui Wu, Shizhou Zhang, De Cheng, Yinghui Xing, Lingyan Ran
TL;DR: 本文提出了一种针对基于Transformer的增量目标检测(IOD)中背景前景化问题的新解决方案——质量引导的最小成本最大流匹配器(Q-MCMF)。该方法通过构建流图并基于几何质量剪除不可信的匹配,避免了匈牙利匹配器强制分配导致的错误监督,从而有效缓解灾难性遗忘。在COCO数据集上的多种增量设置实验中,该方法均优于现有最先进方法。
Details
Motivation: 增量目标检测面临灾难性遗忘的挑战,传统检测器主要归因于背景偏移。本文发现,在DETR类架构中存在一个独特的新遗忘源——背景前景化,这是由于匈牙利匹配器的穷尽性约束强制将每个真实目标分配给一个预测,即使预测主要覆盖背景区域(即低IoU),导致模型将背景特征误分类为特定前景类别,破坏已学表示并加速遗忘。
Result: 在COCO数据集上的多种增量设置实验表明,该方法始终优于现有的最先进方法,实现了SOTA性能。
Insight: 创新点在于识别了DETR类架构中特有的背景前景化遗忘源,并提出了Q-MCMF匹配器,通过质量引导的流图优化避免强制分配,从而消除有害监督并最大化前景学习信号。从客观角度看,该方法将匹配问题重新表述为最小成本最大流问题,并引入几何质量剪枝,为基于Transformer的增量学习提供了新的匹配策略。
Abstract: Incremental Object Detection (IOD) aims to continuously learn new object classes without forgetting previously learned ones. A persistent challenge is catastrophic forgetting, primarily attributed to background shift in conventional detectors. While pseudo-labeling mitigates this in dense detectors, we identify a novel, distinct source of forgetting specific to DETR-like architectures: background foregrounding. This arises from the exhaustiveness constraint of the Hungarian matcher, which forcibly assigns every ground truth target to one prediction, even when predictions primarily cover background regions (i.e., low IoU). This erroneous supervision compels the model to misclassify background features as specific foreground classes, disrupting learned representations and accelerating forgetting. To address this, we propose a Quality-guided Min-Cost Max-Flow (Q-MCMF) matcher. To avoid forced assignments, Q-MCMF builds a flow graph and prunes implausible matches based on geometric quality. It then optimizes for the final matching that minimizes cost and maximizes valid assignments. This strategy eliminates harmful supervision from background foregrounding while maximizing foreground learning signals. Extensive experiments on the COCO dataset under various incremental settings demonstrate that our method consistently outperforms existing state-of-the-art approaches.
[201] Boosting AI Reliability with an FSM-Driven Streaming Inference Pipeline: An Industrial Case cs.CVPDF
Yutian Zhang, Zhongyi Pei, Yi Mao, Chen Wang, Lin Liu
TL;DR: 本文提出了一种新颖的流式推理管道,通过将有限状态机(FSM)与目标检测模型相结合,将先验知识显式地融入数据驱动模型中,以增强工业AI应用的鲁棒性。该方案应用于从监控视频中自动统计挖掘机工作量的具体工业案例,并在真实数据集上验证了其有效性。
Details
Motivation: 工业中广泛采用AI常因其在面对训练数据中未出现场景时鲁棒性有限而受阻,导致预测偏差和脆弱性。本文旨在通过结合先验知识来提升数据驱动模型的可靠性。
Result: 在包含12个现场视频、超过7000张图像和300多个挖掘机工作量的真实世界数据集上,该方法相比基于手动启发式规则的原始解决方案,表现出更优的性能和更强的鲁棒性。
Insight: 核心创新点在于提出了一种FSM驱动的流式推理管道架构,将领域知识(操作场景)编码为状态机来引导和纠正AI对流数据的预测,从而在保持数据驱动优势的同时,显式地融入了先验知识以提升系统可靠性。
Abstract: The widespread adoption of AI in industry is often hampered by its limited robustness when faced with scenarios absent from training data, leading to prediction bias and vulnerabilities. To address this, we propose a novel streaming inference pipeline that enhances data-driven models by explicitly incorporating prior knowledge. This paper presents the work on an industrial AI application that automatically counts excavator workloads from surveillance videos. Our approach integrates an object detection model with a Finite State Machine (FSM), which encodes knowledge of operational scenarios to guide and correct the AI’s predictions on streaming data. In experiments on a real-world dataset of over 7,000 images from 12 site videos, encompassing more than 300 excavator workloads, our method demonstrates superior performance and greater robustness compared to the original solution based on manual heuristic rules. We will release the code at https://github.com/thulab/video-streamling-inference-pipeline.
[202] Training-Free Spatio-temporal Decoupled Reasoning Video Segmentation with Adaptive Object Memory cs.CVPDF
Zhengtong Zhu, Jiaqing Fan, Zhixuan Liu, Fanzhang Li
TL;DR: 本文提出了一种无需训练的时空解耦推理视频分割方法SDAM,通过自适应对象记忆模块和时空解耦策略,仅使用预训练模型实现了对复杂文本输入下视频对象的稳定分割,并在多个基准数据集上取得了优异性能。
Details
Motivation: 现有推理视频分割方法通常需要微调多模态大语言模型,资源消耗大,且时空信息处理耦合影响时间稳定性,本文旨在设计无需训练、超越微调方法的框架。
Result: 在Ref-YouTubeVOS、Ref-DAVIS17、MeViS、ReasonVOS和ReVOS五个基准数据集上取得了优秀结果,性能优于需要微调的现有方法。
Insight: 创新点包括无需训练的框架设计、基于运动线索的自适应对象记忆模块以及时空解耦策略,实现了空间精确定位与时间稳定传播的分离,提升了模型效率与稳定性。
Abstract: Reasoning Video Object Segmentation (ReasonVOS) is a challenging task that requires stable object segmentation across video sequences using implicit and complex textual inputs. Previous methods fine-tune Multimodal Large Language Models (MLLMs) to produce segmentation outputs, which demand substantial resources. Additionally, some existing methods are coupled in the processing of spatio-temporal information, which affects the temporal stability of the model to some extent. To address these issues, we propose Training-Free \textbf{S}patio-temporal \textbf{D}ecoupled Reasoning Video Segmentation with \textbf{A}daptive Object \textbf{M}emory (SDAM). We aim to design a training-free reasoning video segmentation framework that outperforms existing methods requiring fine-tuning, using only pre-trained models. Meanwhile, we propose an Adaptive Object Memory module that selects and memorizes key objects based on motion cues in different video sequences. Finally, we propose Spatio-temporal Decoupling for stable temporal propagation. In the spatial domain, we achieve precise localization and segmentation of target objects, while in the temporal domain, we leverage key object temporal information to drive stable cross-frame propagation. Our method achieves excellent results on five benchmark datasets, including Ref-YouTubeVOS, Ref-DAVIS17, MeViS, ReasonVOS, and ReVOS.
[203] PathMoE: Interpretable Multimodal Interaction Experts for Pediatric Brain Tumor Classification cs.CVPDF
Jian Yu, Joakim Nguyen, Jinrui Fang, Awais Naeem, Zeyuan Cao
TL;DR: 本文提出了PathMoE,一种可解释的多模态交互专家模型,用于儿科脑肿瘤分类。该模型整合了H&E病理切片、病理报告和细胞核级细胞图,通过基于各模态先进基础模型构建的交互感知混合专家架构,动态加权模态间的独特、冗余和协同信息,从而提升分类性能并提供样本级可解释性。
Details
Motivation: 儿科中枢神经系统肿瘤的组织学复杂性和训练数据有限使得准确分类具有挑战性。现有的病理基础模型虽能分析全切片图像,但未能充分利用临床文本和组织微结构中的丰富互补信息。
Result: 在内部儿科脑肿瘤数据集上,整合WSI、文本和图模态将宏F1从0.762提升至0.799;在外部TCGA数据集上,用图知识增强WSI将宏F1从0.668提升至0.709,均显著优于最先进的纯图像基线方法。
Insight: 创新点在于提出了一种交互感知的混合专家架构,通过输入依赖的门控机制动态融合多模态信息,不仅提升了性能,还提供了驱动个体预测的特定模态交互的可解释性,这对罕见肿瘤亚型的临床信任至关重要。
Abstract: Accurate classification of pediatric central nervous system tumors remains challenging due to histological complexity and limited training data. While pathology foundation models have advanced whole-slide image (WSI) analysis, they often fail to leverage the rich, complementary information found in clinical text and tissue microarchitecture. To this end, we propose PathMoE, an interpretable multimodal framework that integrates H&E slides, pathology reports, and nuclei-level cell graphs via an interaction-aware mixture-of-experts architecture built on state-of-the-art foundation models for each modality. By training specialized experts to capture modality uniqueness, redundancy, and synergy, PathMoE employs an input-dependent gating mechanism that dynamically weights these interactions, providing sample-level interpretability. We evaluate our framework on two dataset-specific classification tasks on an internal pediatric brain tumor dataset (PBT) and external TCGA datasets. PathMoE improves macro-F1 from 0.762 to 0.799 (+0.037) on PBT when integrating WSI, text, and graph modalities; on TCGA, augmenting WSI with graph knowledge improves macro-F1 from 0.668 to 0.709 (+0.041). These results demonstrate significant performance gains over state-of-the-art image-only baselines while revealing the specific modality interactions driving individual predictions. This interpretability is particularly critical for rare tumor subtypes, where transparent model reasoning is essential for clinical trust and diagnostic validation.
[204] Pri4R: Learning World Dynamics for Vision-Language-Action Models with Privileged 4D Representation cs.CV | cs.AI | cs.ROPDF
Jisoo Kim, Jungbin Cho, Sanghyeok Chu, Ananya Bal, Jinhyung Kim
TL;DR: 本文提出Pri4R方法,通过在训练阶段利用特权4D信息(3D空间+时间),为视觉-语言-动作模型注入对世界动态的隐式理解,从而提升其在物理交互任务中的性能。该方法通过一个轻量级的点轨迹预测头与VLA模型联合训练,学习场景几何的动态变化,但在推理时无需额外计算开销。
Details
Motivation: 现有的视觉-语言-动作模型虽然具备出色的语义理解能力,但往往无法捕捉物理交互中的时空动态规律,限制了其在精确控制任务中的表现。
Result: 在仿真和真实世界评估中,Pri4R显著提升了具有挑战性的操作任务性能,如在LIBERO-Long基准上获得+10%的提升,在RoboCasa基准上获得+40%的提升。
Insight: 创新点在于利用3D点轨迹预测作为监督信号来学习动作-世界动态,并通过在训练中注入特权4D信息(无需推理时修改)来增强VLA模型的物理感知能力,其架构简单且与主流VLA设计兼容。
Abstract: Humans learn not only how their bodies move, but also how the surrounding world responds to their actions. In contrast, while recent Vision-Language-Action (VLA) models exhibit impressive semantic understanding, they often fail to capture the spatiotemporal dynamics governing physical interaction. In this paper, we introduce Pri4R, a simple yet effective approach that endows VLA models with an implicit understanding of world dynamics by leveraging privileged 4D information during training. Specifically, Pri4R augments VLAs with a lightweight point track head that predicts 3D point tracks. By injecting VLA features into this head to jointly predict future 3D trajectories, the model learns to incorporate evolving scene geometry within its shared representation space, enabling more physically aware context for precise control. Due to its architectural simplicity, Pri4R is compatible with dominant VLA design patterns with minimal changes. During inference, we run the model using the original VLA architecture unchanged; Pri4R adds no extra inputs, outputs, or computational overhead. Across simulation and real-world evaluations, Pri4R significantly improves performance on challenging manipulation tasks, including a +10% gain on LIBERO-Long and a +40% gain on RoboCasa. We further show that 3D point track prediction is an effective supervision target for learning action-world dynamics, and validate our design choices through extensive ablations.
[205] Align-cDAE: Alzheimer’s Disease Progression Modeling with Attention-Aligned Conditional Diffusion Auto-Encoder cs.CVPDF
Ayantika Das, Keerthi Ram, Mohanasankar Sivaprakasam
TL;DR: 本文提出了一种名为Align-cDAE的注意力对齐条件扩散自编码器框架,用于阿尔茨海默病(AD)进展建模。该框架通过引入显式的目标函数来强制多模态信息(如非成像属性)与图像特征对齐,并设计了分离的潜在子空间来分别整合疾病进展相关条件和保留受试者特定身份信息,从而实现对疾病进展更精确的生成控制。
Details
Motivation: 现有基于扩散模型的疾病进展生成方法未能显式确保非成像条件信息与图像特征有意义地对齐,以在生成图像中引入理想的、针对进展特定区域的调制变化,同时缺乏在模型内部表示中引入进展相关结构以实现更精确的生成控制。
Result: 实验结果表明,强制对齐和更好地构建扩散自编码框架的潜在表示空间,能够实现对阿尔茨海默病进展的更精确的解剖学建模。
Insight: 创新点在于提出了一个显式的目标函数来强制跨模态对齐,使模型能聚焦于展现进展相关变化的区域;并设计了分离的潜在子空间机制,分别处理进展条件和身份信息,从而实现了对生成过程更好、更精确的控制。这为可控的医学图像生成提供了可借鉴的结构化潜在空间设计思路。
Abstract: Generative AI framework-based modeling and prediction of longitudinal human brain images offer an efficient mechanism to track neurodegenerative progression, essential for the assessment of diseases like Alzheimer’s. Among the existing generative approaches, recent diffusion-based models have emerged as an effective alternative to generate disease progression images. Incorporating multi-modal and non-imaging attributes as conditional information into diffusion frameworks has been shown to improve controllability during such generations. However, existing methods do not explicitly ensure that information from non-imaging conditioning modalities is meaningfully aligned with image features to introduce desirable changes in the generated images, such as modulation of progression-specific regions. Further, more precise control over the generation process can be achieved by introducing progression-relevant structure into the internal representations of the model, lacking in the existing approaches. To address these limitations, we propose a diffusion autoencoder-based framework for disease progression modeling that explicitly enforces alignment between different modalities. The alignment is enforced by introducing an explicit objective function that enables the model to focus on the regions exhibiting progression-related changes. Further, we devise a mechanism to better structure the latent representational space of the diffusion auto-encoding framework. Specifically, we assign separate latent subspaces for integrating progression-related conditions and retaining subject-specific identity information, allowing better-controlled image generation. These results demonstrate that enforcing alignment and better structuring of the latent representational space of diffusion auto-encoding framework leads to more anatomically precise modeling of Alzheimer’s disease progression.
[206] SkeleGuide: Explicit Skeleton Reasoning for Context-Aware Human-in-Place Image Synthesis cs.CV | cs.AIPDF
Chuqiao Wu, Jin Song, Yiyun Fei
TL;DR: 本文提出SkeleGuide框架,通过显式骨骼推理解决现有生成模型在场景中合成人体图像时出现的肢体扭曲和姿态不自然问题。该框架联合训练推理与渲染阶段,学习生成内部姿态作为结构先验,并引入PoseInverter模块实现细粒度姿态编辑。
Details
Motivation: 现有生成模型在场景中合成人体图像时缺乏对骨骼结构的显式推理,导致肢体扭曲和姿态不自然等伪影,需要一种能显式建模人体骨骼结构的方法来提升合成图像的结构合理性和真实性。
Result: 大量实验表明,SkeleGuide在生成高保真、上下文感知的人体图像方面显著优于专用模型和通用模型,验证了显式骨骼建模对提升人体图像合成鲁棒性的有效性。
Insight: 创新点在于将显式骨骼推理作为结构先验引入生成过程,并通过联合训练推理与渲染阶段学习内部姿态表示,同时PoseInverter模块实现了潜在姿态的可解码与可编辑性,为可控人体合成提供了新思路。
Abstract: Generating realistic and structurally plausible human images into existing scenes remains a significant challenge for current generative models, which often produce artifacts like distorted limbs and unnatural poses. We attribute this systemic failure to an inability to perform explicit reasoning over human skeletal structure. To address this, we introduce SkeleGuide, a novel framework built upon explicit skeletal reasoning. Through joint training of its reasoning and rendering stages, SkeleGuide learns to produce an internal pose that acts as a strong structural prior, guiding the synthesis towards high structural integrity. For fine-grained user control, we introduce PoseInverter, a module that decodes this internal latent pose into an explicit and editable format. Extensive experiments demonstrate that SkeleGuide significantly outperforms both specialized and general-purpose models in generating high-fidelity, contextually-aware human images. Our work provides compelling evidence that explicitly modeling skeletal structure is a fundamental step towards robust and plausible human image synthesis.
[207] InterCoG: Towards Spatially Precise Image Editing with Interleaved Chain-of-Grounding Reasoning cs.CVPDF
Yecong Wan, Fan Li, Chunwei Wang, Hao Wu, Mingwen Shao
TL;DR: 本文提出了InterCoG,一种新颖的文本-视觉交错式链式定位推理框架,用于解决复杂多实体场景下的细粒度图像编辑难题。该方法通过纯文本空间关系推理、视觉定位(边界框和掩码)以及编辑描述重写三个步骤,实现对非显著目标的精确编辑。为支持该框架,作者还构建了包含4.5万个样本的数据集GroundEdit-45K和评估基准GroundEdit-Bench。
Details
Motivation: 现有统一编辑模型在复杂多实体场景中,对视觉不显著且需要空间推理的目标进行细粒度编辑时仍面临重大挑战。
Result: 大量实验证实了该方法在空间复杂和多实体场景下实现高精度编辑的优越性。
Insight: 核心创新在于将编辑任务分解为纯文本空间推理、视觉定位和描述重写的链式流程,并引入了多模态定位重建监督和推理对齐两个辅助训练模块,以分别增强空间定位准确性和推理可解释性。构建的大规模数据集和评估基准也为该研究方向提供了重要资源。
Abstract: Emerging unified editing models have demonstrated strong capabilities in general object editing tasks. However, it remains a significant challenge to perform fine-grained editing in complex multi-entity scenes, particularly those where targets are not visually salient and require spatial reasoning. To this end, we propose InterCoG, a novel text-vision Interleaved Chain-of-Grounding reasoning framework for fine-grained image editing in complex real-world scenes. The key insight of InterCoG is to first perform object position reasoning solely within text that includes spatial relation details to explicitly deduce the location and identity of the edited target. It then conducts visual grounding via highlighting the editing targets with generated bounding boxes and masks in pixel space, and finally rewrites the editing description to specify the intended outcomes. To further facilitate this paradigm, we propose two auxiliary training modules: multimodal grounding reconstruction supervision and multimodal grounding reasoning alignment to enforce spatial localization accuracy and reasoning interpretability, respectively. We also construct GroundEdit-45K, a dataset comprising 45K grounding-oriented editing samples with detailed reasoning annotations, and GroundEdit-Bench for grounding-aware editing evaluation. Extensive experiments substantiate the superiority of our approach in highly precise edits under spatially intricate and multi-entity scenes.
[208] PPEDCRF: Privacy-Preserving Enhanced Dynamic CRF for Location-Privacy Protection for Sequence Videos with Minimal Detection Degradation cs.CVPDF
Bo Ma, Jinsong Wu, Weiqi Yan, Catherine Shi, Minh Nguyen
TL;DR: 本文提出PPEDCRF,一种隐私保护增强动态条件随机场框架,用于序列视频中的位置隐私保护。该方法通过向推断出的位置敏感背景区域注入校准扰动,在最小化检测性能下降的同时,有效抵御基于背景检索的攻击。
Details
Motivation: 自动驾驶或辅助驾驶系统收集的行车记录仪视频常被共享用于安全审计和模型改进,但即使移除GPS元数据,攻击者仍可通过匹配背景视觉线索(如建筑物和道路布局)与大规模街景图像来推断录制位置,因此需要保护位置隐私。
Result: 在公开驾驶数据集上的实验表明,与全局噪声、白噪声掩码和基于特征的匿名化等基线方法相比,PPEDCRF显著降低了位置检索攻击成功率(如Top-k检索准确率),同时保持了有竞争力的检测性能(如mAP和分割指标)。
Insight: 创新点包括:1) 引入动态CRF确保跨帧时间一致性以发现和跟踪位置敏感区域;2) 提出归一化控制惩罚(NCP)根据分层敏感度模型分配扰动强度;3) 设计效用保持的噪声注入模块,最小化对目标检测和分割的干扰。从客观角度看,该方法将隐私保护与检测任务解耦,实现了细粒度的区域级扰动,平衡了隐私与实用性。
Abstract: Dashcam videos collected by autonomous or assisted-driving systems are increasingly shared for safety auditing and model improvement. Even when explicit GPS metadata are removed, an attacker can still infer the recording location by matching background visual cues (e.g., buildings and road layouts) against large-scale street-view imagery. This paper studies location-privacy leakage under a background-based retrieval attacker, and proposes PPEDCRF, a privacy-preserving enhanced dynamic conditional random field framework that injects calibrated perturbations only into inferred location-sensitive background regions while preserving foreground detection utility. PPEDCRF consists of three components: (i) a dynamic CRF that enforces temporal consistency to discover and track location sensitive regions across frames, (ii) a normalized control penalty (NCP) that allocates perturbation strength according to a hierarchical sensitivity model, and (iii) a utility-preserving noise injection module that minimizes interference to object detection and segmentation. Experiments on public driving datasets demonstrate that PPEDCRF significantly reduces location-retrieval attack success (e.g., Top-k retrieval accuracy) while maintaining competitive detection performance (e.g., mAP and segmentation metrics) compared with common baselines such as global noise, white-noise masking, and feature-based anonymization. The source code is in https://github.com/mabo1215/PPEDCRF.git
[209] YCDa: YCbCr Decoupled Attention for Real-time Realistic Camouflaged Object Detection cs.CV | cs.AIPDF
PeiHuang Zheng, Yunlong Zhao, Zheng Cui, Yang Li
TL;DR: 本文提出YCDa(YCbCr解耦注意力)策略,灵感来源于人类视觉系统在伪装环境下从色度转向亮度感知的机制,通过解耦输入图像的色度与亮度信息并动态分配通道注意力,增强判别性线索并抑制误导性颜色噪声,该策略可即插即用地集成到现有实时检测器中,仅需替换首个下采样层。
Details
Motivation: 解决伪装目标检测中颜色线索不可靠时模型性能下降的问题,借鉴人类视觉在复杂环境中依赖亮度与纹理的生物学机制,提升检测器在视觉混淆环境下的鲁棒性。
Result: 在多个基线模型上验证,YCDa以可忽略的开销持续提升性能;在COD10K-D数据集上,YCDa-YOLO12s相比基线mAP提升112%,并在COD-D数据集上实现了实时伪装目标检测的新SOTA结果。
Insight: 创新点在于将生物视觉的色度-亮度解耦与动态注意力机制嵌入早期特征处理,通过通道级动态注意力加权有效分离颜色噪声与纹理信息,其即插即用设计为实时检测任务提供了轻量高效的增强方案。
Abstract: Human vision exhibits remarkable adaptability in perceiving objects under camouflage. When color cues become unreliable, the visual system instinctively shifts its reliance from chrominance (color) to luminance (brightness and texture), enabling more robust perception in visually confusing environments. Drawing inspiration from this biological mechanism, we propose YCDa, an efficient early-stage feature processing strategy that embeds this “chrominance-luminance decoupling and dynamic attention” principle into modern real-time detectors. Specifically, YCDa separates color and luminance information in the input stage and dynamically allocates attention across channels to amplify discriminative cues while suppressing misleading color noise. The strategy is plug-and-play and can be integrated into existing detectors by simply replacing the first downsampling layer. Extensive experiments on multiple baselines demonstrate that YCDa consistently improves performance with negligible overhead as shown in Fig. Notably, YCDa-YOLO12s achieves a 112% improvement in mAP over the baseline on COD10K-D and sets new state-of-the-art results for real-time camouflaged object detection across COD-D datasets.
[210] Sparse View Distractor-Free Gaussian Splatting cs.CVPDF
Yi Gu, Zhaorui Wang, Jiahang Cao, Jiaxu Wang, Mingle Zhao
TL;DR: 本文提出了一种增强稀疏视图下无干扰3D高斯溅射(3DGS)的框架,通过整合几何基础模型VGGT和视觉语言模型(VLMs)等先验信息,解决了现有方法在稀疏输入条件下性能显著下降的问题,从而在稀疏视图下有效抑制瞬态干扰物。
Details
Motivation: 现有无干扰3DGS方法在密集图像捕获下表现良好,但在稀疏输入条件下性能严重下降,主要原因是其依赖的颜色残差启发式方法在观测有限时不可靠。
Result: 大量实验证实了该方法在稀疏视图3DGS训练中减轻瞬态干扰物的有效性和鲁棒性,但摘要未提及具体基准测试或与SOTA的定量比较。
Insight: 创新点在于将几何基础模型VGGT(用于相机参数估计和初始3D点生成、语义实体匹配)和视觉语言模型(用于识别和保留场景中的大静态区域)的先验信息无缝集成到现有无干扰3DGS方法中,以提升稀疏视图下的鲁棒性。
Abstract: 3D Gaussian Splatting (3DGS) enables efficient training and fast novel view synthesis in static environments. To address challenges posed by transient objects, distractor-free 3DGS methods have emerged and shown promising results when dense image captures are available. However, their performance degrades significantly under sparse input conditions. This limitation primarily stems from the reliance on the color residual heuristics to guide the training, which becomes unreliable with limited observations. In this work, we propose a framework to enhance distractor-free 3DGS under sparse-view conditions by incorporating rich prior information. Specifically, we first adopt the geometry foundation model VGGT to estimate camera parameters and generate a dense set of initial 3D points. Then, we harness the attention maps from VGGT for efficient and accurate semantic entity matching. Additionally, we utilize Vision-Language Models (VLMs) to further identify and preserve the large static regions in the scene. We also demonstrate how these priors can be seamlessly integrated into existing distractor-free 3DGS methods. Extensive experiments confirm the effectiveness and robustness of our approach in mitigating transient distractors for sparse-view 3DGS training.
[211] What Helps – and What Hurts: Bidirectional Explanations for Vision Transformers cs.CV | cs.AI | cs.LGPDF
Qin Su, Tie Luo
TL;DR: 该论文提出了一种名为BiCAM的双向类激活映射方法,用于解释视觉Transformer(ViT)的决策过程。BiCAM不仅捕捉对模型预测有支持作用的正面贡献,还保留抑制作用的负面贡献,从而生成更完整和对比性的解释。此外,该方法引入了正负比(PNR)来总结归因平衡,并实现轻量级的对抗样本检测。在ImageNet、VOC和COCO等数据集上,BiCAM在保持计算效率的同时,提升了定位准确性和忠实度,并可泛化到DeiT和Swin等多种ViT变体。
Details
Motivation: 动机在于解决视觉Transformer在视觉识别中性能强大但决策过程难以解释的问题,特别是现有基于CAM的方法丢弃了负面信号,导致解释不完整。
Result: 在ImageNet、VOC和COCO等基准测试中,BiCAM提高了定位和忠实度,同时保持计算效率,并实现了轻量级对抗样本检测,无需重新训练。
Insight: 创新点在于引入双向归因来捕捉正面和负面贡献,以及正负比(PNR)用于总结归因平衡和对抗检测,强调了建模支持和抑制证据对于解释基于Transformer的视觉模型的重要性。
Abstract: Vision Transformers (ViTs) achieve strong performance in visual recognition, yet their decision-making remains difficult to interpret. We propose BiCAM, a bidirectional class activation mapping method that captures both supportive (positive) and suppressive (negative) contributions to model predictions. Unlike prior CAM-based approaches that discard negative signals, BiCAM preserves signed attributions to produce more complete and contrastive explanations. BiCAM further introduces a Positive-to-Negative Ratio (PNR) that summarizes attribution balance and enables lightweight detection of adversarial examples without retraining. Across ImageNet, VOC, and COCO, BiCAM improves localization and faithfulness while remaining computationally efficient. It generalizes to multiple ViT variants, including DeiT and Swin. These results suggest the importance of modeling both supportive and suppressive evidence for interpreting transformer-based vision models.
[212] Adaptive Spectral Feature Forecasting for Diffusion Sampling Acceleration cs.CV | cs.LGPDF
Jiaqi Han, Juntong Shi, Puheng Li, Haotian Ye, Qiushan Guo
TL;DR: 本文提出了一种名为Spectrum的训练免费方法,通过将去噪器的潜在特征视为时间函数并用切比雪夫多项式进行逼近,实现了全局、长距离的特征重用,从而加速扩散模型的采样过程。该方法通过岭回归拟合基函数系数来预测未来多个扩散步的特征,理论上具有更优的长时行为且误差界不随步长累积。
Details
Motivation: 扩散模型已成为高保真图像和视频生成的主流工具,但其推理速度受限于扩散变换器的多次迭代计算。现有特征缓存与重用方法仅依赖局部近似,导致在大步长跳过时误差迅速累积,样本质量下降。
Result: 在多个最先进的图像和视频扩散模型上的广泛实验验证了该方法的优越性。具体而言,在FLUX.1上实现了高达4.79倍的加速,在Wan2.1-14B上实现了4.67倍的加速,同时相比基线方法保持了更高的样本质量。
Insight: 创新点在于将特征建模为时间函数并使用切比雪夫多项式进行全局逼近,结合岭回归实现稳定预测,从而在理论上保证了误差界不随步长复合增长,实现了高质量的长距离特征重用,无需额外训练即可显著加速采样。
Abstract: Diffusion models have become the dominant tool for high-fidelity image and video generation, yet are critically bottlenecked by their inference speed due to the numerous iterative passes of Diffusion Transformers. To reduce the exhaustive compute, recent works resort to the feature caching and reusing scheme that skips network evaluations at selected diffusion steps by using cached features in previous steps. However, their preliminary design solely relies on local approximation, causing errors to grow rapidly with large skips and leading to degraded sample quality at high speedups. In this work, we propose spectral diffusion feature forecaster (Spectrum), a training-free approach that enables global, long-range feature reuse with tightly controlled error. In particular, we view the latent features of the denoiser as functions over time and approximate them with Chebyshev polynomials. Specifically, we fit the coefficient for each basis via ridge regression, which is then leveraged to forecast features at multiple future diffusion steps. We theoretically reveal that our approach admits more favorable long-horizon behavior and yields an error bound that does not compound with the step size. Extensive experiments on various state-of-the-art image and video diffusion models consistently verify the superiority of our approach. Notably, we achieve up to 4.79$\times$ speedup on FLUX.1 and 4.67$\times$ speedup on Wan2.1-14B, while maintaining much higher sample quality compared with the baselines.
[213] DriveCombo: Benchmarking Compositional Traffic Rule Reasoning in Autonomous Driving cs.CVPDF
Enhui Ma, Jiahuan Zhang, Guantian Zheng, Tao Tang, Shengbo Eben Li
TL;DR: 该论文提出了DriveCombo基准测试,用于评估多模态大语言模型在自动驾驶中对组合交通规则的理解与推理能力。它通过一个五级认知阶梯,从单规则理解到多规则整合与冲突解决,系统地量化模型性能。研究还开发了Rule2Scene代理,将文本规则转化为动态驾驶场景进行视觉推理。对14个主流MLLMs的评估显示,随着任务复杂性增加,模型性能下降,尤其在规则冲突时;但在训练集上微调后,模型在规则推理和下游规划能力上均有显著提升。
Details
Motivation: 现有基准测试主要关注单规则场景(如交通标志识别),忽略了真实驾驶中多规则并发与冲突的复杂性,导致模型在简单任务上表现良好,但在复杂现实场景中易失败或违规。因此,需要一个新的基准来评估MLLMs是否真正理解和遵循复杂交通规则。
Result: 在DriveCombo基准上评估了14个主流MLLMs,发现随着任务复杂性增加(尤其是规则冲突时),模型性能显著下降。在数据集划分后,于训练集上进行微调,模型在交通规则推理和下游规划能力上均取得实质性改进。
Insight: 创新点包括:1) 提出基于五级认知阶梯的系统性评估框架,实现跨认知阶段的量化评估;2) 开发Rule2Scene代理,将语言规则映射到动态驾驶场景,支持场景级交通规则视觉推理。这为推进合规且智能的自动驾驶系统提供了有效的基准和方法。
Abstract: Multimodal Large Language Models (MLLMs) are rapidly becoming the intelligence brain of end-to-end autonomous driving systems. A key challenge is to assess whether MLLMs can truly understand and follow complex real-world traffic rules. However, existing benchmarks mainly focus on single-rule scenarios like traffic sign recognition, neglecting the complexity of multi-rule concurrency and conflicts in real driving. Consequently, models perform well on simple tasks but often fail or violate rules in real world complex situations. To bridge this gap, we propose DriveCombo, a text and vision-based benchmark for compositional traffic rule reasoning. Inspired by human drivers’ cognitive development, we propose a systematic Five-Level Cognitive Ladder that evaluates reasoning from single-rule understanding to multi-rule integration and conflict resolution, enabling quantitative assessment across cognitive stages. We further propose a Rule2Scene Agent that maps language-based traffic rules to dynamic driving scenes through rule crafting and scene generation, enabling scene-level traffic rule visual reasoning. Evaluations of 14 mainstream MLLMs reveal performance drops as task complexity grows, particularly during rule conflicts. After splitting the dataset and fine-tuning on the training set, we further observe substantial improvements in both traffic rule reasoning and downstream planning capabilities. These results highlight the effectiveness of DriveCombo in advancing compliant and intelligent autonomous driving systems.
[214] FastLightGen: Fast and Light Video Generation with Fewer Steps and Parameters cs.CVPDF
Shao Shitong, Gu Yufei, Xie Zeke
TL;DR: 本文提出FastLightGen算法,通过协同蒸馏模型大小和推理步数,将大型视频生成模型转化为快速轻量级版本,在HunyuanVideo-ATI2V和WanX-TI2V基准上实现高效视频生成。
Details
Motivation: 现有视频生成模型(如Hunyuan、WanX)因参数量大和推理时多步迭代采样导致计算开销过高,阻碍实际部署,而现有加速方法仅单独减少采样步数或压缩模型大小,未探索两者同时压缩的潜力。
Result: 在HunyuanVideo-ATI2V和WanX-TI2V基准上,使用4步采样和30%参数剪枝的生成器在受限推理预算下达到最优视觉质量,且持续超越所有竞争方法,建立了高效视频生成的新SOTA。
Insight: 创新点在于提出协同蒸馏框架,同时压缩模型参数和推理步数,并构建优化教师模型以最大化学生性能,为生成模型的高效部署提供了新思路。
Abstract: The recent advent of powerful video generation models, such as Hunyuan, WanX, Veo3, and Kling, has inaugurated a new era in the field. However, the practical deployment of these models is severely impeded by their substantial computational overhead, which stems from enormous parameter counts and the iterative, multi-step sampling process required during inference. Prior research on accelerating generative models has predominantly followed two distinct trajectories: reducing the number of sampling steps (e.g., LCM, DMD, and MagicDistillation) or compressing the model size for more efficient inference (e.g., ICMD). The potential of simultaneously compressing both to create a fast and lightweight model remains an unexplored avenue. In this paper, we propose FastLightGen, an algorithm that transforms large, computationally expensive models into fast, lightweight counterparts. The core idea is to construct an optimal teacher model, one engineered to maximize student performance, within a synergistic framework for distilling both model size and inference steps. Our extensive experiments on HunyuanVideo-ATI2V and WanX-TI2V reveal that a generator using 4-step sampling and 30% parameter pruning achieves optimal visual quality under a constrained inference budget. Furthermore, FastLightGen consistently outperforms all competing methods, establishing a new state-of-the-art in efficient video generation.
[215] CoopDiff: A Diffusion-Guided Approach for Cooperation under Corruptions cs.CVPDF
Gong Chen, Chaokun Zhang, Pengcheng Lv
TL;DR: 本文提出CoopDiff,一种基于扩散模型的协作感知框架,旨在应对现实世界中多样且不可预测的退化问题。该框架采用师生范式:质量感知教师通过兴趣质量加权和语义引导进行体素级早期融合,并利用扩散去噪器生成干净的监督特征;双分支扩散学生则通过分离自车与协作流编码来重建教师的干净目标,并采用自车引导的交叉注意力机制在退化条件下自适应融合特征以实现平衡解码。
Details
Motivation: 现实场景中多样且不可预测的退化(如环境和传感器级失真)会破坏协作感知的鲁棒性和泛化能力,因此需要一种能够有效缓解这些退化的方法。
Result: 在构建的两个多退化基准测试OPV2Vn和DAIR-V2Xn(各包含六种退化类型)上,CoopDiff在所有退化类型上均优于先前方法,降低了相对退化误差,并在精度与推理效率之间提供了可调节的平衡。
Insight: 创新点包括:利用扩散模型的固有去噪特性来增强协作感知的鲁棒性;采用师生范式进行特征净化与重建;引入自车引导的交叉注意力机制以在退化条件下实现自车与协作特征的自适应平衡融合。
Abstract: Cooperative perception lets agents share information to expand coverage and improve scene understanding. However, in real-world scenarios, diverse and unpredictable corruptions undermine its robustness and generalization. To address these challenges, we introduce CoopDiff, a diffusion-based cooperative perception framework that mitigates corruptions via a denoising mechanism. CoopDiff adopts a teacher-student paradigm: the Quality-Aware Teacher performs voxel-level early fusion with Quality of Interest weighting and semantic guidance, then produces clean supervision features via a diffusion denoiser. The Dual-Branch Diffusion Student first separates ego and cooperative streams in encoding to reconstruct the teacher’s clean targets. And then, an Ego-Guided Cross-Attention mechanism facilitates balanced decoding under degradation by adaptively integrating ego and cooperative features. We evaluate CoopDiff on two constructed multi-degradation benchmarks, OPV2Vn and DAIR-V2Xn, each incorporating six corruption types, including environmental and sensor-level distortions. Benefiting from the inherent denoising properties of diffusion, CoopDiff consistently outperforms prior methods across all degradation types and lowers the relative corruption error. Furthermore, it offers a tunable balance between precision and inference efficiency.
[216] MVR: Multi-view Video Reward Shaping for Reinforcement Learning cs.CV | cs.AI | cs.LGPDF
Lirui Luo, Guoxi Zhang, Hongming Xu, Yaodong Yang, Cong Fang
TL;DR: 本文提出了多视角视频奖励塑形(MVR)框架,用于解决强化学习中基于视觉语言模型(VLM)的奖励设计问题。该方法利用从多个视角捕获的视频,通过视频-文本相似度学习状态相关性函数,以减少基于静态图像方法的偏差,并引入状态依赖的奖励塑形公式,在达成目标运动模式后自动减弱VLM引导的影响。
Details
Motivation: 现有方法通常将VLM分数线性添加到任务奖励中,可能改变最优策略,且依赖单张静态图像,难以处理涉及复杂动态运动的任务,单视角也可能遮挡关键行为信息。
Result: 在HumanoidBench的人形运动任务和MetaWorld的操作任务上进行了广泛实验,通过消融研究验证了设计选择的有效性。
Insight: 创新点在于使用多视角视频建模状态相关性以克服静态图像偏差,并设计状态依赖的奖励塑形公式实现自适应引导,可借鉴于需要复杂视觉反馈的强化学习任务。
Abstract: Reward design is of great importance for solving complex tasks with reinforcement learning. Recent studies have explored using image-text similarity produced by vision-language models (VLMs) to augment rewards of a task with visual feedback. A common practice linearly adds VLM scores to task or success rewards without explicit shaping, potentially altering the optimal policy. Moreover, such approaches, often relying on single static images, struggle with tasks whose desired behavior involves complex, dynamic motions spanning multiple visually different states. Furthermore, single viewpoints can occlude critical aspects of an agent’s behavior. To address these issues, this paper presents Multi-View Video Reward Shaping (MVR), a framework that models the relevance of states regarding the target task using videos captured from multiple viewpoints. MVR leverages video-text similarity from a frozen pre-trained VLM to learn a state relevance function that mitigates the bias towards specific static poses inherent in image-based methods. Additionally, we introduce a state-dependent reward shaping formulation that integrates task-specific rewards and VLM-based guidance, automatically reducing the influence of VLM guidance once the desired motion pattern is achieved. We confirm the efficacy of the proposed framework with extensive experiments on challenging humanoid locomotion tasks from HumanoidBench and manipulation tasks from MetaWorld, verifying the design choices through ablation studies.
[217] Cross-modal Identity Mapping: Minimizing Information Loss in Modality Conversion via Reinforcement Learning cs.CV | cs.AIPDF
Haonan Jia, Shichao Dong, Xin Dong, Zenghui Sun, Jin Wang
TL;DR: 本文提出了一种名为跨模态身份映射(CIM)的强化学习框架,旨在减少大型视觉语言模型(LVLM)在图像描述生成过程中的信息损失。该方法通过文本搜索检索到的图像相似性来评估描述质量,并利用画廊表示一致性和查询-画廊图像相关性两个指标进行监督优化,从而提升模型对图像细节的捕捉能力。
Details
Motivation: 大型视觉语言模型在生成图像描述时,经常遗漏或错误表达关键的视觉内容,导致信息损失。由于视觉内容与文本输出之间存在模态鸿沟,直接衡量这种信息损失具有挑战性。
Result: 在COCO-LN500基准测试上,该方法在Qwen2.5-VL-7B模型上实现了关系推理能力20%的提升,其性能甚至优于监督微调方法。
Insight: 创新点在于提出了一种无需额外标注的强化学习框架,通过基于检索的图像相似性来量化信息损失,并设计了两个新颖的评估指标来指导模型优化,从而实现从图像到描述的“身份映射”,有效缩小了模态转换中的信息差距。
Abstract: Large Vision-Language Models (LVLMs) often omit or misrepresent critical visual content in generated image captions. Minimizing such information loss will force LVLMs to focus on image details to generate precise descriptions. However, measuring information loss during modality conversion is inherently challenging due to the modal gap between visual content and text output. In this paper, we argue that the quality of an image caption is positively correlated with the similarity between images retrieved via text search using that caption. Based on this insight, we further propose Cross-modal Identity Mapping (CIM), a reinforcement learning framework that enhances image captioning without requiring additional annotations. Specifically, the method quantitatively evaluates the information loss from two perspectives: Gallery Representation Consistency and Query-gallery Image Relevance. Supervised under these metrics, LVLM minimizes information loss and aims to achieve identity mapping from images to captions. The experimental results demonstrate the superior performance of our method in image captioning, even when compared with Supervised Fine-Tuning. Particularly, on the COCO-LN500 benchmark, CIM achieves a 20% improvement in relation reasoning on Qwen2.5-VL-7B.The code will be released when the paper is accepted.
[218] Dual Distillation for Few-Shot Anomaly Detection cs.CVPDF
Le Dong, Qinzhong Tan, Chunlei Li, Jingliang Hu, Yilei Shi
TL;DR: 本文提出了一种名为D^2^4FAD的新型双蒸馏框架,用于解决医学图像中的少样本异常检测问题。该方法利用预训练编码器作为教师网络,从少量正常参考图像(支持集)和待检测图像(查询集)中提取多尺度特征,并通过学生解码器进行知识蒸馏与自蒸馏。此外,还引入了一个可学习的加权机制,根据查询图像动态评估每个支持图像的价值。该方法在一个包含四个器官、四种成像模态和五种疾病类别的综合基准数据集上进行了评估,并显著超越了现有方法。
Details
Motivation: 当前的无监督异常检测方法需要大量正常训练数据,且难以泛化到不同的解剖学背景。本文旨在解决医学成像中,仅使用少量正常参考图像就能在未见任务中检测异常的挑战。
Result: 在包含13,084张图像的综合医学异常检测基准数据集上进行的广泛实验表明,D^2^4FAD显著优于现有方法,在少样本医学异常检测任务上达到了新的SOTA水平。
Insight: 创新点包括:1) 双蒸馏框架,结合了教师网络的知识蒸馏和学生网络在支持集上的自蒸馏;2) 可学习的加权机制,能根据查询图像动态调整支持图像的参考价值。从客观角度看,该方法将少样本学习与异常检测有效结合,并通过精心设计的蒸馏和加权策略,提升了模型在数据稀缺且领域多样的医学图像上的泛化能力。
Abstract: Anomaly detection is a critical task in computer vision with profound implications for medical imaging, where identifying pathologies early can directly impact patient outcomes. While recent unsupervised anomaly detection approaches show promise, they require substantial normal training data and struggle to generalize across anatomical contexts. We introduce D$^2$4FAD, a novel dual distillation framework for few-shot anomaly detection that identifies anomalies in previously unseen tasks using only a small number of normal reference images. Our approach leverages a pre-trained encoder as a teacher network to extract multi-scale features from both support and query images, while a student decoder learns to distill knowledge from the teacher on query images and self-distill on support images. We further propose a learn-to-weight mechanism that dynamically assesses the reference value of each support image conditioned on the query, optimizing anomaly detection performance. To evaluate our method, we curate a comprehensive benchmark dataset comprising 13,084 images across four organs, four imaging modalities, and five disease categories. Extensive experiments demonstrate that D$^2$4FAD significantly outperforms existing approaches, establishing a new state-of-the-art in few-shot medical anomaly detection. Code is available at https://github.com/ttttqz/D24FAD.
[219] Preoperative-to-intraoperative Liver Registration for Laparoscopic Surgery via Latent-Grounded Correspondence Constraints cs.CVPDF
Ruize Cui, Jialun Pei, Haiqiao Wang, Jun Zhou, Jeremy Yuen-Chun Teoh
TL;DR: 本文提出了一种名为Land-Reg的对应驱动可变形配准框架,用于腹腔镜肝脏手术中术前3D模型与术中2D视图的配准。该方法通过显式学习基于潜在证据的2D-3D地标对应关系作为可解释的中间表示,以桥接跨模态对齐,并设计了刚性配准中的跨模态潜在对齐模块、不确定性增强重叠地标检测器,以及非刚性配准中的形状约束监督策略,以提升配准的鲁棒性和可解释性。
Details
Motivation: 现有腹腔镜肝脏手术中的增强现实配准方法缺乏对由潜在证据支持的可靠2D-3D几何对应关系的显式建模,导致可解释性有限且在临床场景中可能产生不稳定的对齐结果。
Result: 在P2ILF数据集上的实验结果表明,该方法在刚性姿态估计和非刚性变形配准方面均表现出优越性。
Insight: 创新点在于引入了一种显式学习潜在接地的2D-3D地标对应关系作为可解释中间表示的配准框架,并设计了跨模态潜在对齐、不确定性增强地标检测以及结合重投影一致性和局部等距正则化的形状约束监督策略,以解决跨模态配准中的深度模糊性和提升鲁棒性。
Abstract: In laparoscopic liver surgery, augmented reality technology enhances intraoperative anatomical guidance by overlaying 3D liver models from preoperative CT/MRI onto laparoscopic 2D views. However, existing registration methods lack explicit modeling of reliable 2D-3D geometric correspondences supported by latent evidence, leading to limited interpretability and potentially unstable alignment in clinical scenarios. In this work, we introduce Land-Reg, a correspondence-driven deformable registration framework that explicitly learns latent-grounded 2D-3D landmark correspondences as an interpretable intermediate representation to bridge cross-modal alignment. For rigid registration, Land-Reg embraces a Cross-modal Latent Alignment module to map multi-modal features into a unified latent space. Further, an Uncertainty-enhanced Overlap Landmark Detector with similarity matching is proposed to robustly estimate explicit 2D-3D landmark correspondences. For non-rigid registration, we design a novel shape-constrained supervision strategy that anchors shape deformation to matched landmarks through reprojection consistency and incorporates local-isometric regularization to alleviate inherent 2D-3D depth ambiguity, while a rendered-mask alignment enforces global shape consistency. Experimental results on the P2ILF dataset demonstrate the superiority of our method on both rigid pose estimation and non-rigid deformation. Our code will be available at https://github.com/cuiruize/Land-Reg.
[220] Learning Domain-Aware Task Prompt Representations for Multi-Domain All-in-One Image Restoration cs.CVPDF
Guanglu Dong, Chunlei Li, Chao Ren, Jingliang Hu, Yilei Shi
TL;DR: 本文提出了首个多领域一体化图像恢复方法DATPRL-IR,通过领域感知任务提示表示学习,能够利用单一模型处理多个图像领域(如自然场景、医学影像、遥感)的多种恢复任务。该方法构建任务提示池和领域提示池,通过提示组合机制自适应地为每个输入图像生成实例级的任务表示和领域表示,并融合为领域感知任务提示表示,以充分利用跨任务和跨领域的特定与共享知识指导恢复过程。
Details
Motivation: 现有的一体化图像恢复方法通常局限于特定图像领域(如自然场景),无法处理跨多个领域的恢复任务。本文旨在将一体化图像恢复扩展到多个领域,解决多领域图像恢复的挑战。
Result: 大量实验表明,DATPRL-IR在多个图像恢复任务上显著优于现有的SOTA方法,并展现出强大的泛化能力。
Insight: 创新点在于提出了领域感知任务提示表示学习框架,通过构建并自适应组合任务和领域提示池,将多模态大语言模型的领域先验知识蒸馏到领域提示中,从而融合跨任务和跨领域的知识,实现高效的多领域一体化恢复。
Abstract: Recently, significant breakthroughs have been made in all-in-one image restoration (AiOIR), which can handle multiple restoration tasks with a single model. However, existing methods typically focus on a specific image domain, such as natural scene, medical imaging, or remote sensing. In this work, we aim to extend AiOIR to multiple domains and propose the first multi-domain all-in-one image restoration method, DATPRL-IR, based on our proposed Domain-Aware Task Prompt Representation Learning. Specifically, we first construct a task prompt pool containing multiple task prompts, in which task-related knowledge is implicitly encoded. For each input image, the model adaptively selects the most relevant task prompts and composes them into an instance-level task representation via a prompt composition mechanism (PCM). Furthermore, to endow the model with domain awareness, we introduce another domain prompt pool and distill domain priors from multimodal large language models into the domain prompts. PCM is utilized to combine the adaptively selected domain prompts into a domain representation for each input image. Finally, the two representations are fused to form a domain-aware task prompt representation which can make full use of both specific and shared knowledge across tasks and domains to guide the subsequent restoration process. Extensive experiments demonstrate that our DATPRL-IR significantly outperforms existing SOTA image restoration methods, while exhibiting strong generalization capabilities. Code is available at https://github.com/GuangluDong0728/DATPRL-IR.
[221] Action-Guided Attention for Video Action Anticipation cs.CVPDF
Tsung-Ming Tai, Sofia Casarin, Andrea Pilzer, Werner Nutt, Oswald Lanz
TL;DR: 本文提出了一种名为动作引导注意力(AGA)的机制,用于视频动作预测任务。该方法利用预测的动作序列作为查询和键来引导序列建模,强调基于未来活动的过去相关时刻,并通过门控函数与当前帧嵌入结合,以更好地捕捉潜在意图并提升泛化能力。
Details
Motivation: 现有基于Transformer的方法依赖于像素表示的注意力机制,缺乏高级语义来有效建模视频序列以进行动作预测,导致模型容易过拟合过去帧的显式视觉线索,难以捕捉潜在意图,泛化到未见样本的能力受限。
Result: 在广泛采用的EPIC-Kitchens-100基准测试中,AGA在从验证集到未见测试集上表现出良好的泛化性能。
Insight: 创新点在于引入动作引导注意力机制,通过预测动作序列作为查询和键来增强语义建模,并结合门控函数整合信息;此外,该方法支持训练后分析,可检查模型捕获的动作依赖关系和反事实证据,提供透明可解释的预测洞察。
Abstract: Anticipating future actions in videos is challenging, as the observed frames provide only evidence of past activities, requiring the inference of latent intentions to predict upcoming actions. Existing transformer-based approaches, which rely on dot-product attention over pixel representations, often lack the high-level semantics necessary to model video sequences for effective action anticipation. As a result, these methods tend to overfit to explicit visual cues present in the past frames, limiting their ability to capture underlying intentions and degrading generalization to unseen samples. To address this, we propose Action-Guided Attention (AGA), an attention mechanism that explicitly leverages predicted action sequences as queries and keys to guide sequence modeling. Our approach fosters the attention module to emphasize relevant moments from the past based on the upcoming activity and combine this information with the current frame embedding via a dedicated gating function. The design of AGA enables post-training analysis of the knowledge discovered from the training set. Experiments on the widely adopted EPIC-Kitchens-100 benchmark demonstrate that AGA generalizes well from validation to unseen test sets. Post-training analysis can further examine the action dependencies captured by the model and the counterfactual evidence it has internalized, offering transparent and interpretable insights into its anticipative predictions.
[222] NeuroSymb-MRG: Differentiable Abductive Reasoning with Active Uncertainty Minimization for Radiology Report Generation cs.CVPDF
Rong Fu, Yiqing Lyu, Chunlei Meng, Muge Qi, Yabin Jin
TL;DR: 本文提出NeuroSymb-MRG框架,将神经符号溯因推理与主动不确定性最小化相结合,用于生成结构化、临床依据充分的放射学报告。该系统通过将图像特征映射为概率性临床概念、构建可微逻辑推理链、解码为模板化子句,并利用检索和受限语言模型编辑来优化文本输出,旨在解决现有方法在视觉-语言偏差、事实不一致性和缺乏显式多跳临床推理方面的不足。
Details
Motivation: 现有自动生成放射学报告的方法(如编码器-解码器或检索增强流程)在流畅性上取得进展,但仍易受视觉-语言偏差影响,存在事实不一致问题,且缺乏显式的多跳临床推理能力。
Result: 在标准基准测试上的实验表明,与代表性基线方法相比,该方法在事实一致性和标准语言指标上均取得了一致的改进。
Insight: 创新点在于将神经符号溯因推理与主动不确定性最小化统一到一个框架中,通过可微逻辑推理链实现结构化报告生成,并引入基于规则级不确定性和多样性的主动采样循环来指导临床医生参与裁决和提示手册优化,从而增强临床推理的显式性和事实一致性。
Abstract: Automatic generation of radiology reports seeks to reduce clinician workload while improving documentation consistency. Existing methods that adopt encoder-decoder or retrieval-augmented pipelines achieve progress in fluency but remain vulnerable to visual-linguistic biases, factual inconsistency, and lack of explicit multi-hop clinical reasoning. We present NeuroSymb-MRG, a unified framework that integrates NeuroSymbolic abductive reasoning with active uncertainty minimization to produce structured, clinically grounded reports. The system maps image features to probabilistic clinical concepts, composes differentiable logic-based reasoning chains, decodes those chains into templated clauses, and refines the textual output via retrieval and constrained language-model editing. An active sampling loop driven by rule-level uncertainty and diversity guides clinician-in-the-loop adjudication and promptbook refinement. Experiments on standard benchmarks demonstrate consistent improvements in factual consistency and standard language metrics compared to representative baselines.
[223] StepVAR: Structure-Texture Guided Pruning for Visual Autoregressive Models cs.CVPDF
Keli Liu, Zhendong Wang, Wengang Zhou, Houqiang Li
TL;DR: 本文提出StepVAR,一种无需训练的结构-纹理引导的视觉自回归模型剪枝框架,通过联合考虑结构重要性和纹理重要性来加速VAR模型推理,同时保持生成质量。
Details
Motivation: 现有VAR模型在高分辨率下推理成本呈二次增长,且后期计算密集的尺度主要细化高频纹理并存在空间冗余,而现有剪枝方法主要关注高频检测,往往忽视结构连贯性,导致全局语义退化。
Result: 在先进的文本到图像和文本到视频VAR模型上的大量实验表明,StepVAR在保持生成质量的同时实现了显著的推理加速,定量和定性评估均显示其优于现有加速方法。
Insight: 创新点在于联合使用轻量级高通滤波器捕捉局部纹理细节和主成分分析(PCA)保留全局结构信息的双准则设计,并引入最近邻特征传播策略从剪枝后的表示中重建密集特征图,以维持有效的下一尺度预测。
Abstract: Visual AutoRegressive (VAR) models based on next-scale prediction enable efficient hierarchical generation, yet the inference cost grows quadratically at high resolutions. We observe that the computationally intensive later scales predominantly refine high-frequency textures and exhibit substantial spatial redundancy, in contrast to earlier scales that determine the global structural layout. Existing pruning methods primarily focus on high-frequency detection for token selection, often overlooking structural coherence and consequently degrading global semantics. To address this limitation, we propose StepVAR, a training-free token pruning framework that accelerates VAR inference by jointly considering structural and textural importance. Specifically, we employ a lightweight high-pass filter to capture local texture details, while leveraging Principal Component Analysis (PCA) to preserve global structural information. This dual-criterion design enables the model to retain tokens critical for both fine-grained fidelity and overall composition. To maintain valid next-scale prediction under sparse tokens, we further introduce a nearest neighbor feature propagation strategy to reconstruct dense feature maps from pruned representations. Extensive experiments on state-of-the-art text-to-image and text-to-video VAR models demonstrate that StepVAR achieves substantial inference speedups while maintaining generation quality. Quantitative and qualitative evaluations consistently show that our method outperforms existing acceleration approaches, validating its effectiveness and general applicability across diverse VAR architectures.
[224] Unifying Heterogeneous Multi-Modal Remote Sensing Detection Via Language-Pivoted Pretraining cs.CVPDF
Yuxuan Li, Yuming Chen, Yunheng Li, Ming-Ming Cheng, Xiang Li
TL;DR: 本文提出BabelRS,一种基于语言枢纽的预训练框架,用于统一异构多模态遥感目标检测。该框架通过概念共享指令对齐(CSIA)将不同传感器模态(如RGB、SAR、红外)对齐到共享的语言概念上,并使用层级视觉语义退火(LVSA)聚合多尺度视觉特征以提供细粒度语义指导,从而解耦模态对齐与下游任务学习。
Details
Motivation: 现有异构多模态遥感目标检测方法多采用后期对齐范式,导致模态对齐与任务优化在下游微调中纠缠,造成训练不稳定和泛化性能次优。本文旨在通过语言枢纽预训练框架解决这些问题。
Result: 大量实验表明,BabelRS稳定了训练过程,并在多个基准上持续超越现有最先进方法,无需复杂技巧。
Insight: 创新点在于使用语言作为语义枢纽来桥接异构视觉表示,并通过CSIA和LVSA组件实现模态对齐与任务学习的解耦,从而提升训练稳定性和泛化能力。
Abstract: Heterogeneous multi-modal remote sensing object detection aims to accurately detect objects from diverse sensors (e.g., RGB, SAR, Infrared). Existing approaches largely adopt a late alignment paradigm, in which modality alignment and task-specific optimization are entangled during downstream fine-tuning. This tight coupling complicates optimization and often results in unstable training and suboptimal generalization. To address these limitations, we propose BabelRS, a unified language-pivoted pretraining framework that explicitly decouples modality alignment from downstream task learning. BabelRS comprises two key components: Concept-Shared Instruction Aligning (CSIA) and Layerwise Visual-Semantic Annealing (LVSA). CSIA aligns each sensor modality to a shared set of linguistic concepts, using language as a semantic pivot to bridge heterogeneous visual representations. To further mitigate the granularity mismatch between high-level language representations and dense detection objectives, LVSA progressively aggregates multi-scale visual features to provide fine-grained semantic guidance. Extensive experiments demonstrate that BabelRS stabilizes training and consistently outperforms state-of-the-art methods without bells and whistles. Code: https://github.com/zcablii/SM3Det.
[225] Efficient Test-Time Optimization for Depth Completion via Low-Rank Decoder Adaptation cs.CVPDF
Minseok Seo, Wonjun Lee, Jaehyuk Jang, Changick Kim
TL;DR: 本文提出了一种高效的测试时优化方法,通过低秩解码器适配实现深度补全。该方法仅更新解码器的低维子空间,利用稀疏深度监督进行适配,在保持高精度的同时显著提升了计算效率。
Details
Motivation: 现有零样本深度补全方法依赖基于扩散的测试时优化,计算成本高;而基于视觉提示的方法虽降低训练成本,但推理速度仍慢。本文旨在解决测试时优化效率低下的问题。
Result: 在五个室内外数据集上的实验表明,该方法在精度和效率之间建立了新的帕累托前沿,达到了最先进的性能水平。
Insight: 创新点在于发现深度基础模型将深度相关信息集中在低维解码器子空间中,因此仅适配解码器即可实现有效的测试时优化,这为轻量级适配提供了新思路。
Abstract: Zero-shot depth completion has gained attention for its ability to generalize across environments without sensor-specific datasets or retraining. However, most existing approaches rely on diffusion-based test-time optimization, which is computationally expensive due to iterative denoising. Recent visual-prompt-based methods reduce training cost but still require repeated forward–backward passes through the full frozen network to optimize input-level prompts, resulting in slow inference. In this work, we show that adapting only the decoder is sufficient for effective test-time optimization, as depth foundation models concentrate depth-relevant information within a low-dimensional decoder subspace. Based on this insight, we propose a lightweight test-time adaptation method that updates only this low-dimensional subspace using sparse depth supervision. Our approach achieves state-of-the-art performance, establishing a new Pareto frontier between accuracy and efficiency for test-time adaptation. Extensive experiments on five indoor and outdoor datasets demonstrate consistent improvements over prior methods, highlighting the practicality of fast zero-shot depth completion.
[226] Downstream Task Inspired Underwater Image Enhancement: A Perception-Aware Study from Dataset Construction to Network Design cs.CV | eess.IVPDF
Bosen Lin, Feng Gao, Yanwei Yu, Junyu Dong, Qian Du
TL;DR: 本文提出了一种面向下游任务的感知感知水下图像增强框架(DTI-UIE),旨在通过改进预处理图像来提升水下语义分割、目标检测等视觉任务的性能。该框架包含一个高效的双分支网络、任务感知注意力模块、多阶段训练策略以及任务驱动的感知损失,并自动构建了一个任务启发式数据集(TI-UIED)。
Details
Motivation: 现有水下图像增强方法主要关注提升人类视觉感知,往往忽略了对于下游识别任务至关重要的高频细节重建,导致预处理图像对任务性能提升有限。
Result: 实验表明,DTI-UIE在语义分割、目标检测和实例分割等下游任务上显著提升了性能,生成了对任务更有利的预处理图像。
Insight: 创新点在于将下游任务性能作为图像增强的直接优化目标,并利用人类视觉感知模型和任务特定网络自动构建数据集,实现了从数据构建到网络设计的端到端感知感知优化。
Abstract: In real underwater environments, downstream image recognition tasks such as semantic segmentation and object detection often face challenges posed by problems like blurring and color inconsistencies. Underwater image enhancement (UIE) has emerged as a promising preprocessing approach, aiming to improve the recognizability of targets in underwater images. However, most existing UIE methods mainly focus on enhancing images for human visual perception, frequently failing to reconstruct high-frequency details that are critical for task-specific recognition. To address this issue, we propose a Downstream Task-Inspired Underwater Image Enhancement (DTI-UIE) framework, which leverages human visual perception model to enhance images effectively for underwater vision tasks. Specifically, we design an efficient two-branch network with task-aware attention module for feature mixing. The network benefits from a multi-stage training framework and a task-driven perceptual loss. Additionally, inspired by human perception, we automatically construct a Task-Inspired UIE Dataset (TI-UIED) using various task-specific networks. Experimental results demonstrate that DTI-UIE significantly improves task performance by generating preprocessed images that are beneficial for downstream tasks such as semantic segmentation, object detection, and instance segmentation. The codes are publicly available at https://github.com/oucailab/DTIUIE.
[227] Non-verbal Real-time Human-AI Interaction in Constrained Robotic Environments cs.CV | cs.AIPDF
Dragos Costea, Alina Marcu, Cristina Lazar, Marius Leordeanu
TL;DR: 该论文研究了在非语言全身运动背景下,AI生成数据与人类生成数据在统计保真度上的差异,并提出了首个从2D身体关键点实时生成自然非语言人机交互的框架。
Details
Motivation: 动机是探讨当前生成模型是否超越了表面模仿,能够参与身体语言的无声但富有表现力的对话,解决AI在非语言实时交互中自然性的问题。
Result: 实验在NVIDIA Orin Nano上以高达100 FPS运行四种轻量级架构,训练了437个人类视频片段,发现使用合成序列预训练显著减少了运动误差而不牺牲速度。在SORA和VEO等尖端文本到视频系统上评估时,性能在SORA生成的片段上下降,但在VEO上下降较少,表明时间连贯性而非图像保真度驱动了真实世界性能。
Insight: 创新点在于引入了实时非语言人机交互框架,并揭示了AI与人类运动之间存在统计可区分的差异,且时间连贯性是关键因素;这为轻量级实时交互系统和生成模型的评估提供了新视角。
Abstract: We study the ongoing debate regarding the statistical fidelity of AI-generated data compared to human-generated data in the context of non-verbal communication using full body motion. Concretely, we ask if contemporary generative models move beyond surface mimicry to participate in the silent, but expressive dialogue of body language. We tackle this question by introducing the first framework that generates a natural non-verbal interaction between Human and AI in real-time from 2D body keypoints. Our experiments utilize four lightweight architectures which run at up to 100 FPS on an NVIDIA Orin Nano, effectively closing the perception-action loop needed for natural Human-AI interaction. We trained on 437 human video clips and demonstrated that pretraining on synthetically-generated sequences reduces motion errors significantly, without sacrificing speed. Yet, a measurable reality gap persists. When the best model is evaluated on keypoints extracted from cutting-edge text-to-video systems, such as SORA and VEO, we observe that performance drops on SORA-generated clips. However, it degrades far less on VEO, suggesting that temporal coherence, not image fidelity, drives real-world performance. Our results demonstrate that statistically distinguishable differences persist between Human and AI motion.
[228] Neural Operator-Grounded Continuous Tensor Function Representation and Its Applications cs.CV | math.NAPDF
Ruoyang Su, Xi-Le Zhao, Sheng Liu, Wei-Hao Wu, Yisi Luo
TL;DR: 本文提出了一种基于神经算子的连续张量函数表示方法(NO-CTR),通过引入连续且非线性的mode-n算子替代传统的离散线性mode-n乘积,能够更真实地表示复杂现实世界数据,并应用于多维数据补全任务。
Details
Motivation: 现有连续张量函数表示方法受限于离散线性的mode-n乘积,无法充分发挥潜力,因此需要一种连续非线性的替代方案来更准确地表示数据。
Result: 在多种数据上的实验(包括多光谱图像、彩色视频、不同分辨率的Sentinel-2图像和点云)表明,NO-CTR在数据补全任务中表现出优越性。
Insight: 创新点在于用神经算子实现连续非线性的mode-n算子,突破了传统离散线性表示的瓶颈,理论上证明了NO-CTR可以逼近任何连续张量函数,为多维数据表示提供了更灵活和强大的框架。
Abstract: Recently, continuous tensor functions have attracted increasing attention, because they can unifiedly represent data both on mesh grids and beyond mesh grids. However, since mode-$n$ product is essentially discrete and linear, the potential of current continuous tensor function representations is still locked. To break this bottleneck, we suggest neural operator-grounded mode-$n$ operators as a continuous and nonlinear alternative of discrete and linear mode-$n$ product. Instead of mapping the discrete core tensor to the discrete target tensor, proposed mode-$n$ operator directly maps the continuous core tensor function to the continuous target tensor function, which provides a genuine continuous representation of real-world data and can ameliorate discretization artifacts. Empowering with continuous and nonlinear mode-$n$ operators, we propose a neural operator-grounded continuous tensor function representation (abbreviated as NO-CTR), which can more faithfully represent complex real-world data compared with classic discrete tensor representations and continuous tensor function representations. Theoretically, we also prove that any continuous tensor function can be approximated by NO-CTR. To examine the capability of NO-CTR, we suggest an NO-CTR-based multi-dimensional data completion model. Extensive experiments across various data on regular mesh grids (multi-spectral images and color videos), on mesh girds with different resolutions (Sentinel-2 images) and beyond mesh grids (point clouds) demonstrate the superiority of NO-CTR.
[229] Affine Correspondences in Stereo Vision: Theory, Practice, and Limitations cs.CVPDF
Levente Hajder
TL;DR: 本文系统研究了仿射变换在立体视觉中的应用,包括其理论基础、实践方法及局限性。论文首先回顾了仿射变换与对极几何的基本原理,然后分析了变换精度对三维重建质量的影响,并提出了从对应图像方向估计局部仿射变换的新方法,同时利用基础矩阵提升估计。通过合成与真实数据(使用包含三个垂直棋盘格平面的特殊物体)的定量评估,以重建表面法向量的精度为标准,验证了方法在现实测试案例中能达到几度的估计精度,并对特殊立体姿态和平面方向进行了详细分析。
Details
Motivation: 解决在立体视觉中如何有效利用仿射变换来提升三维重建(如表面法向量、单应性矩阵、基础矩阵和本质矩阵估计)的精度和鲁棒性问题,并探究其实际应用中的局限性。
Result: 在合成和真实数据上进行了定量评估,基于重建表面法向量的准确性,结果显示在现实测试案例中估计精度约为几度;同时详细评估了特殊立体姿态和平面方向的影响。
Insight: 创新点在于提出了从对应图像方向估计局部仿射变换的新技术,并整合基础矩阵信息;客观分析认为,该方法通过系统评估变换精度对三维重建的影响,为仿射变换在立体视觉中的实际应用提供了重要的理论指导和实践验证,特别是在处理复杂场景和姿态时具有借鉴意义。
Abstract: Affine transformations have been recently used for stereo vision. They can be exploited in various computer vision application, e.g., when estimating surface normals, homographies, fundamental and essential matrices. Even full 3D reconstruction can be obtained by using affine correspondences. First, this paper overviews the fundamental statements for affine transformations and epipolar geometry. Then it is investigated how the transformation accuracy influences the quality of the 3D reconstruction. Besides, we propose novel techniques for estimating the local affine transformation from corresponding image directions; moreover, the fundamental matrix, related to the processed image pair, can also be exploited. Both synthetic and real quantitative evaluations are implemented based on the accuracy of the reconstructed surface normals. For the latter one, a special object, containing three perpendicular planes with chessboard patterns, is constructed. The quantitative evaluations are based on the accuracy of the reconstructed surface normals and it is concluded that the estimation accuracy is around a few degrees for realistic test cases. Special stereo poses and plane orientations are also evaluated in detail.
[230] LEAR: Learning Edge-Aware Representations for Event-to-LiDAR Localization cs.CV | cs.ROPDF
Kuangyi Chen, Jun Zhang, Yuxi Hu, Yi Zhou, Friedrich Fraundorfer
TL;DR: 本文提出LEAR框架,通过联合学习边缘结构和稠密事件-深度光流场,解决事件相机与LiDAR点云在GPS拒止和视觉退化环境中的定位问题。该方法利用跨模态融合机制和迭代优化策略,增强模态不变几何线索,实现更鲁棒的姿态估计。
Details
Motivation: 事件相机在高动态和弱光条件下具有高时间分辨率,但稀疏异步事件与稠密LiDAR地图的对齐存在模态差异,直接对应估计困难,需要解决跨模态定位的挑战。
Result: 在多个流行且具有挑战性的数据集上,LEAR超越了现有最佳方法,实现了更优的性能,具体通过PnP求解器获得更准确和鲁棒的姿态恢复结果。
Insight: 创新点在于将边缘估计与光流估计耦合,通过跨模态融合注入几何线索,并采用迭代优化确保任务间一致性,从而生成边缘感知的深度对齐流场,提升了跨模态定位的鲁棒性。
Abstract: Event cameras offer high-temporal-resolution sensing that remains reliable under high-speed motion and challenging lighting, making them promising for localization from LiDAR point clouds in GPS-denied and visually degraded environments. However, aligning sparse, asynchronous events with dense LiDAR maps is fundamentally ill-posed, as direct correspondence estimation suffers from modality gaps. We propose LEAR, a dual-task learning framework that jointly estimates edge structures and dense event-depth flow fields to bridge the sensing-modality divide. Instead of treating edges as a post-hoc aid, LEAR couples them with flow estimation through a cross-modal fusion mechanism that injects modality-invariant geometric cues into the motion representation, and an iterative refinement strategy that enforces mutual consistency between the two tasks over multiple update steps. This synergy produces edge-aware, depth-aligned flow fields that enable more robust and accurate pose recovery via Perspective-n-Point (PnP) solvers. On several popular and challenging datasets, LEAR achieves superior performance over the best prior method. The source code, trained models, and demo videos are made publicly available online.
[231] FireRed-OCR Technical Report cs.CV | eess.IVPDF
Hao Wu, Haoran Lou, Xinyue Li, Zuodong Zhong, Zhaojun Sun
TL;DR: FireRed-OCR是一个将通用视觉语言模型(VLM)专门化为高性能OCR模型的系统框架。它通过构建一个’几何+语义’的数据工厂来生成高质量结构化数据,并采用三阶段渐进式训练策略,使模型从像素级感知发展到逻辑结构生成。在OmniDocBench v1.5基准测试中取得了92.94%的SOTA总体得分。
Details
Motivation: 通用视觉语言模型在处理复杂文档时经常出现’结构幻觉’问题,限制了其在工业OCR应用中的实用性。本文旨在解决此问题,将通用VLM转化为像素级精确的结构化文档解析专家。
Result: 在OmniDocBench v1.5基准上进行广泛评估,FireRed-OCR取得了92.94%的总体得分,在文本、公式、表格和阅读顺序等指标上显著超越了DeepSeek-OCR 2和OCRVerse等强基线模型,达到了最先进的性能水平。
Insight: 创新点包括:1)’几何+语义’数据工厂,通过几何特征聚类和多维标注来合成和筛选高度平衡的数据集,有效处理长尾布局和罕见文档类型;2)三阶段渐进式训练策略,包括多任务预对齐、专门化SFT和格式约束的组相对策略优化(GRPO),利用强化学习来强制执行严格的句法有效性和结构完整性(如表闭合、公式语法)。
Abstract: We present FireRed-OCR, a systematic framework to specialize general VLMs into high-performance OCR models. Large Vision-Language Models (VLMs) have demonstrated impressive general capabilities but frequently suffer from structural hallucination'' when processing complex documents, limiting their utility in industrial OCR applications. In this paper, we introduce FireRed-OCR, a novel framework designed to transform general-purpose VLMs (based on Qwen3-VL) into pixel-precise structural document parsing experts. To address the scarcity of high-quality structured data, we construct a Geometry + Semantics’’ Data Factory. Unlike traditional random sampling, our pipeline leverages geometric feature clustering and multi-dimensional tagging to synthesize and curate a highly balanced dataset, effectively handling long-tail layouts and rare document types. Furthermore, we propose a Three-Stage Progressive Training strategy that guides the model from pixel-level perception to logical structure generation. This curriculum includes: (1) Multi-task Pre-alignment to ground the model’s understanding of document structure; (2) Specialized SFT for standardizing full-image Markdown output; and (3) Format-Constrained Group Relative Policy Optimization (GRPO), which utilizes reinforcement learning to enforce strict syntactic validity and structural integrity (e.g., table closure, formula syntax). Extensive evaluations on OmniDocBench v1.5 demonstrate that FireRed-OCR achieves state-of-the-art performance with an overall score of 92.94%, significantly outperforming strong baselines such as DeepSeek-OCR 2 and OCRVerse across text, formula, table, and reading order metrics. We open-source our code and model weights to facilitate the ``General VLM to Specialized Structural Expert’’ paradigm.
[232] Generative Visual Chain-of-Thought for Image Editing cs.CVPDF
Zijin Yin, Tiankai Hang, Yiji Cheng, Shiyi Zhang, Runze He
TL;DR: 本文提出了一种名为生成式视觉思维链(GVCoT)的统一框架,用于解决复杂场景和精细空间指令下的图像编辑问题。该方法通过首先生成空间线索来定位目标区域,然后执行编辑,实现了端到端的视觉推理与编辑联合优化。
Details
Motivation: 现有图像编辑方法在复杂场景和细微空间指令下难以准确感知编辑位置,GVCoT旨在通过视觉推理链提升模型的空间定位和编辑能力。
Result: 在SREdit-Bench和ImgEdit基准测试中,GVCoT持续优于最先进模型,达到了SOTA水平。
Insight: 创新点包括:提出端到端联合优化视觉推理与编辑的框架;构建了包含180万高质量样本的GVCoT-Edit-Instruct数据集;采用渐进式训练策略(监督微调+强化学习)提升性能;设计了新的SREdit-Bench基准以全面评估复杂场景下的编辑能力。
Abstract: Existing image editing methods struggle to perceive where to edit, especially under complex scenes and nuanced spatial instructions. To address this issue, we propose Generative Visual Chain-of-Thought (GVCoT), a unified framework that performs native visual reasoning by first generating spatial cues to localize the target region and then executing the edit. Unlike prior text-only CoT or tool-dependent visual CoT paradigms, GVCoT jointly optimizes visual tokens generated during the reasoning and editing phases in an end-to-end manner. This way fosters the emergence of innate spatial reasoning ability and enables more effective utilization of visual-domain cues. The main challenge of training GCVoT lies in the scarcity of large-scale editing data with precise edit region annotations; to this end, we construct GVCoT-Edit-Instruct, a dataset of 1.8M high-quality samples spanning 19 tasks. We adopt a progressive training strategy: supervised fine-tuning to build foundational localization ability in reasoning trace before final editing, followed by reinforcement learning to further improve reasoning and editing quality. Finally, we introduce SREdit-Bench, a new benchmark designed to comprehensively stress-test models under sophisticated scenes and fine-grained referring expressions. Experiments demonstrate that GVCoT consistently outperforms state-of-the-art models on SREdit-Bench and ImgEdit. We hope our GVCoT will inspire future research toward interpretable and precise image editing.
[233] Zero-shot Low-Field MRI Enhancement via Diffusion-Based Adaptive Contrast Transport cs.CVPDF
Muyu Liu, Chenhe Du, Xuanyu Tian, Qing Wu, Xiao Wang
TL;DR: 本文提出了一种名为DACT(基于扩散的自适应对比度传输)的零样本框架,用于从低场(LF)MRI数据重建高场(HF)质量的图像,无需配对监督。该方法结合了预训练的HF扩散先验和一个物理信息驱动的自适应前向模型,通过一个可微分的Sinkhorn最优传输模块在反向扩散过程中显式建模并校正LF与HF域之间的强度分布偏移,从而在保持解剖结构保真度的同时恢复真实的组织对比度。
Details
Motivation: 低场MRI虽然普及了诊断成像,但其固有的低信噪比和场依赖弛豫动力学导致的显著组织对比度失真限制了图像质量。从LF数据重建HF质量图像是一个盲逆问题,面临配对训练数据稀缺以及未知、非线性对比度变换算子的严重挑战。现有的零样本方法通常假设简化的线性退化,难以恢复真实的组织对比度。
Result: 在模拟和真实临床LF数据集上的大量实验表明,DACT实现了最先进的性能,其重建结果具有优越的结构细节和正确的组织对比度。
Insight: 论文的创新点在于将预训练的HF扩散先验与物理信息驱动的自适应前向模型相结合,并引入可微分的Sinkhorn最优传输模块来显式建模和校正跨域的强度分布偏移。这允许框架在反向扩散过程中动态学习难以处理的对比度映射,同时保持拓扑一致性,为解决缺乏配对数据的盲逆重建问题提供了一种新思路。
Abstract: Low-field (LF) magnetic resonance imaging (MRI) democratizes access to diagnostic imaging but is fundamentally limited by low signal-to-noise ratio and significant tissue contrast distortion due to field-dependent relaxation dynamics. Reconstructing high-field (HF) quality images from LF data is a blind inverse problem, severely challenged by the scarcity of paired training data and the unknown, non-linear contrast transformation operator. Existing zero-shot methods, which assume simplified linear degradation, often fail to recover authentic tissue contrast. In this paper, we propose DACT(Diffusion-Based Adaptive Contrast Transport), a novel zero-shot framework that restores HF-quality images without paired supervision. DACT synergizes a pre-trained HF diffusion prior to ensure anatomical fidelity with a physically-informed adaptive forward model. Specifically, we introduce a differentiable Sinkhorn optimal transport module that explicitly models and corrects the intensity distribution shift between LF and HF domains during the reverse diffusion process. This allows the framework to dynamically learn the intractable contrast mapping while preserving topological consistency. Extensive experiments on simulated and real clinical LF datasets demonstrate that DACT achieves state-of-the-art performance, yielding reconstructions with superior structural detail and correct tissue contrast.
[234] LaST-VLA: Thinking in Latent Spatio-Temporal Space for Vision-Language-Action in Autonomous Driving cs.CVPDF
Yuechen Luo, Fang Li, Shaoqing Xu, Yang Ji, Zehan Zhang
TL;DR: LaST-VLA是一种用于自动驾驶的视觉-语言-动作模型,它将推理范式从离散符号处理转变为基于物理的潜在时空思维链,通过双特征对齐机制从3D基础模型和世界模型中提取几何约束和动态预见,并结合渐进式监督微调和强化学习进行训练,在多个基准测试中取得了最先进的性能。
Details
Motivation: 解决现有VLA模型依赖显式文本思维链导致的语义-感知解耦和感知-符号冲突问题,以及标准潜在思维链缺乏物理约束的局限性。
Result: 在NAVSIM v1(91.3 PDMS)和NAVSIM v2(87.1 EPDMS)上创造了新记录,并在SURDS和NuDynamics基准测试中表现出卓越的时空推理能力。
Insight: 提出了将物理几何约束和动态预见直接融入潜在空间的潜在时空思维链范式,以及从特征对齐到轨迹生成的渐进式训练策略与GRPO强化学习相结合的方法,以实现更安全、合规的自动驾驶决策。
Abstract: While Vision-Language-Action (VLA) models have revolutionized autonomous driving by unifying perception and planning, their reliance on explicit textual Chain-of-Thought (CoT) leads to semantic-perceptual decoupling and perceptual-symbolic conflicts. Recent shifts toward latent reasoning attempt to bypass these bottlenecks by thinking in continuous hidden space. However, without explicit intermediate constraints, standard latent CoT often operates as a physics-agnostic representation. To address this, we propose the Latent Spatio-Temporal VLA (LaST-VLA), a framework shifting the reasoning paradigm from discrete symbolic processing into a physically grounded Latent Spatio-Temporal CoT. By implementing a dual-feature alignment mechanism, we distill geometric constraints from 3D foundation models and dynamic foresight from world models directly into the latent space. Coupled with a progressive SFT training strategy that transitions from feature alignment to trajectory generation, and refined via Reinforcement Learning with Group Relative Policy Optimization (GRPO) to ensure safety and rule compliance. \method~setting a new record on NAVSIM v1 (91.3 PDMS) and NAVSIM v2 (87.1 EPDMS), while excelling in spatial-temporal reasoning on SURDS and NuDynamics benchmarks.
[235] physfusion: A Transformer-based Dual-Stream Radar and Vision Fusion Framework for Open Water Surface Object Detection cs.CV | cs.AIPDF
Yuting Wan, Liguo Sun, Jiuwu Hao, Zao Zhang, Pin LV
TL;DR: 本文提出PhysFusion,一种基于Transformer的雷达与视觉双流融合框架,用于开放水域水面目标检测。该框架包含物理信息雷达编码器、雷达引导的交互融合模块和时序查询聚合模块,旨在有效利用稀疏、间歇性的海事雷达点云与视觉信息,以应对波浪杂波、镜面反射等挑战。
Details
Motivation: 解决无人水面艇(USV)在远距离观测中,因波浪杂波、镜面反射和外观线索弱导致的水面目标检测难题,以及传统融合方法难以有效利用稀疏、间歇性且反射率属性变化剧烈的海事雷达点云数据的问题。
Result: 在WaterScenes和FLOW数据集上的实验表明,PhysFusion在WaterScenes上(使用5帧雷达历史)达到59.7% mAP50:95和90.3% mAP50,在FLOW数据集上(雷达+相机设置)达到94.8% mAP50和46.2% mAP50:95,参数量为5.6M,计算量为12.5G FLOPs。
Insight: 创新点包括:1) 物理信息雷达编码器(PIR Encoder)将雷达点属性转化为散射先验并预测点级可靠性;2) 基于散射感知自注意力(SASA)的Transformer全局流与基于点的局部流构成的双流主干;3) 雷达引导的交互融合模块(RIFM)实现查询级雷达-图像融合;4) 时序查询聚合(TQA)模块聚合短时窗口内的帧间融合查询以获得时序一致性表示。这些设计增强了在复杂海事环境下对雷达线索的鲁棒利用。
Abstract: Detecting water-surface targets for Unmanned Surface Vehicles (USVs) is challenging due to wave clutter, specular reflections, and weak appearance cues in long-range observations. Although 4D millimeter-wave radar complements cameras under degraded illumination, maritime radar point clouds are sparse and intermittent, with reflectivity attributes exhibiting heavy-tailed variations under scattering and multipath, making conventional fusion designs struggle to exploit radar cues effectively. We propose PhysFusion, a physics-informed radar-image detection framework for water-surface perception. The framework integrates: (1) a Physics-Informed Radar Encoder (PIR Encoder) with an RCS Mapper and Quality Gate, transforming per-point radar attributes into compact scattering priors and predicting point-wise reliability for robust feature learning under clutter; (2) a Radar-guided Interactive Fusion Module (RIFM) performing query-level radar-image fusion between semantically enriched radar features and multi-scale visual features, with the radar branch modeled by a dual-stream backbone including a point-based local stream and a transformer-based global stream using Scattering-Aware Self-Attention (SASA); and (3) a Temporal Query Aggregation module (TQA) aggregating frame-wise fused queries over a short temporal window for temporally consistent representations. Experiments on WaterScenes and FLOW demonstrate that PhysFusion achieves 59.7% mAP50:95 and 90.3% mAP50 on WaterScenes (T=5 radar history) using 5.6M parameters and 12.5G FLOPs, and reaches 94.8% mAP50 and 46.2% mAP50:95 on FLOW under radar+camera setting. Ablation studies quantify the contributions of PIR Encoder, SASA-based global reasoning, and RIFM.
[236] PreSight: Preoperative Outcome Prediction for Parkinson’s Disease via Region-Prior Morphometry and Patient-Specific Weighting cs.CVPDF
Yand Wang, Chen Zhang, Lanyun Zhu, Yixin Chen, Qunbo Wang
TL;DR: 本文提出了PreSight模型,用于帕金森病手术前的术后运动改善率预测。该模型融合了临床先验知识、术前MRI和基于变形的形态测量学(DBM),并通过患者特异性加权模块调整区域重要性,以生成端到端、校准良好且可直接用于决策的预测结果。
Details
Motivation: 帕金森病手术的术前改善率预测在临床上至关重要但非常困难,因为影像信号微弱且患者存在异质性。本文旨在仅利用术前可用信息,预测患者特异性的术后运动获益。
Result: 在一个包含400名受试者的真实世界双中心队列上,PreSight在响应者分类任务上,内部验证准确率达到88.89%,外部中心测试准确率达到85.29%,优于临床、仅影像和多模态基线模型,并表现出更好的概率校准和更高的决策曲线净收益。
Insight: 主要创新点在于将临床先验知识与区域自适应的形态测量学(通过DBM和患者特异性加权模块实现)相结合,使模型能够以患者特异性的方式强调疾病相关区域,从而实现可靠的术前决策支持。
Abstract: Preoperative improvement rate prediction for Parkinson’s disease surgery is clinically important yet difficult because imaging signals are subtle and patients are heterogeneous. We address this setting, where only information available before surgery is used, and the goal is to predict patient-specific postoperative motor benefit. We present PreSight, a presurgical outcome model that fuses clinical priors with preoperative MRI and deformation-based morphometry (DBM) and adapts regional importance through a patient-specific weighting module. The model produces end-to-end, calibrated, decision-ready predictions with patient-level explanations. We evaluate PreSight on a real-world two-center cohort of 400 subjects with multimodal presurgical inputs and postoperative improvement labels. PreSight outperforms strong clinical, imaging-only, and multimodal baselines. It attains 88.89% accuracy on internal validation and 85.29% on an external-center test for responder classification and shows better probability calibration and higher decision-curve net benefit. Ablations and analyses confirm the contribution of DBM and the patient-specific weighting module and indicate that the model emphasizes disease-relevant regions in a patient-specific manner. These results demonstrate that integrating clinical prior knowledge with region-adaptive morphometry enables reliable presurgical decision support in routine practice.
[237] Process Over Outcome: Cultivating Forensic Reasoning for Generalizable Multimodal Manipulation Detection cs.CVPDF
Yuchen Zhang, Yaxiong Wang, Kecheng Han, Yujiao Wu, Lianwei Wu
TL;DR: 本文提出REFORM框架,通过将学习重点从结果拟合转向过程建模,强调法证推理在提升多模态篡改检测泛化能力中的核心作用。该框架采用三阶段课程学习,首先诱导法证依据,然后对齐推理与最终判断,最后通过强化学习优化逻辑一致性。为支持此范式,作者构建了ROM大规模数据集。实验表明,REFORM在多个基准测试中取得了最先进的性能,并展现出优越的泛化能力。
Details
Motivation: 现有篡改检测方法主要关注结果导向监督下的篡改类型分类,缺乏可解释性且易过拟合表面伪影,难以泛化到未见过的篡改模式。本文动机是推动检测从有限的类型分类转向融入显式的法证推理,以实现更具泛化性的检测。
Result: REFORM在ROM数据集上达到81.52% ACC,在DGM4数据集上达到76.65% ACC,在MMFakeBench数据集上达到74.9 F1分数,均建立了新的最先进(SOTA)性能,并展现出优越的泛化能力。
Insight: 论文的核心创新点在于提出了“过程重于结果”的范式转变,强调通过显式的法证推理过程建模来提升检测的泛化性和可解释性。其设计的推理驱动框架(REFORM)和三阶段课程学习策略,以及配套的大规模带推理标注数据集(ROM),为构建更鲁棒、可泛化的多模态篡改检测系统提供了新的思路和方法论。从客观角度看,将强化学习用于优化推理逻辑一致性是一个值得借鉴的技术点。
Abstract: Recent advances in generative AI have significantly enhanced the realism of multimodal media manipulation, thereby posing substantial challenges to manipulation detection. Existing manipulation detection and grounding approaches predominantly focus on manipulation type classification under result-oriented supervision, which not only lacks interpretability but also tends to overfit superficial artifacts. In this paper, we argue that generalizable detection requires incorporating explicit forensic reasoning, rather than merely classifying a limited set of manipulation types, which fails to generalize to unseen manipulation patterns. To this end, we propose REFORM, a reasoning-driven framework that shifts learning from outcome fitting to process modeling. REFORM adopts a three-stage curriculum that first induces forensic rationales, then aligns reasoning with final judgments, and finally refines logical consistency via reinforcement learning. To support this paradigm, we introduce ROM, a large-scale dataset with rich reasoning annotations. Extensive experiments show that REFORM establishes new state-of-the-art performance with superior generalization, achieving 81.52% ACC on ROM, 76.65% ACC on DGM4, and 74.9 F1 on MMFakeBench.
[238] MAP-Diff: Multi-Anchor Guided Diffusion for Progressive 3D Whole-Body Low-Dose PET Denoising cs.CV | cs.AIPDF
Peiyuan Jing, Chun-Wun Cheng, Liutao Yang, Zhenxuan Zhang, Thiago V. Lima
TL;DR: 本文提出了一种名为MAP-Diff的多锚点引导扩散模型,用于渐进式三维全身低剂量PET图像去噪。该方法利用临床观测的中等剂量扫描作为扩散反向过程的轨迹锚点,通过时间步相关的监督来约束去噪过程,使其与剂量形成的渐进特性对齐,从而实现从超低剂量输入到高质量重建的渐进式、剂量一致的图像恢复。
Details
Motivation: 低剂量正电子发射断层扫描(PET)可减少辐射暴露,但存在严重噪声和定量退化问题。现有的基于扩散模型的去噪方法虽然能获得强重建结果,但其反向轨迹通常不受约束,与PET剂量形成的渐进性质不一致。
Result: 在内部数据集(西门子Biograph Vision Quadra)上,MAP-Diff将PSNR从42.48 dB提升至43.71 dB,SSIM提升至0.986,NMAE从0.115降低至0.103,优于3D DDPM等基线模型。在跨扫描仪数据集(联影uEXPLORER)上也实现了34.42 dB的PSNR和0.141的NMAE,性能优于所有对比方法,展现了良好的泛化能力。
Insight: 核心创新在于将临床真实的中等剂量扫描作为“锚点”引入扩散模型的反向过程,通过时间步校准和加权损失函数,对去噪轨迹进行渐进式、剂量对齐的约束,从而弥合了标准扩散模型与PET成像物理过程之间的差距,实现了更符合临床实际的渐进式恢复。
Abstract: Low-dose Positron Emission Tomography (PET) reduces radiation exposure but suffers from severe noise and quantitative degradation. Diffusion-based denoising models achieve strong final reconstructions, yet their reverse trajectories are typically unconstrained and not aligned with the progressive nature of PET dose formation. We propose MAP-Diff, a multi-anchor guided diffusion framework for progressive 3D whole-body PET denoising. MAP-Diff introduces clinically observed intermediate-dose scans as trajectory anchors and enforces timestep-dependent supervision to regularize the reverse process toward dose-aligned intermediate states. Anchor timesteps are calibrated via degradation matching between simulated diffusion corruption and real multi-dose PET pairs, and a timestep-weighted anchor loss stabilizes stage-wise learning. At inference, the model requires only ultra-low-dose input while enabling progressive, dose-consistent intermediate restoration. Experiments on internal (Siemens Biograph Vision Quadra) and cross-scanner (United Imaging uEXPLORER) datasets show consistent improvements over strong CNN-, Transformer-, GAN-, and diffusion-based baselines. On the internal dataset, MAP-Diff improves PSNR from 42.48 dB to 43.71 dB (+1.23 dB), increases SSIM to 0.986, and reduces NMAE from 0.115 to 0.103 (-0.012) compared to 3D DDPM. Performance gains generalize across scanners, achieving 34.42 dB PSNR and 0.141 NMAE on the external cohort, outperforming all competing methods.
[239] NICO-RAG: Multimodal Hypergraph Retrieval-Augmented Generation for Understanding the Nicotine Public Health Crisis cs.CVPDF
Manuel Serna-Aguilera, Raegan Anderes, Page Dobbs, Khoa Luu
TL;DR: 本文介绍了NICO数据集和NICO-RAG框架,旨在应对尼古丁成瘾公共卫生危机。NICO数据集包含超过20万个多模态样本,涵盖55个烟草和尼古丁产品品牌。NICO-RAG是一个检索增强生成框架,通过超图组织图像和文本实体关系,实现高效的多模态检索,以提供事实性回答。
Details
Motivation: 解决尼古丁产品创新导致的公共卫生危机,现有研究在数据规模和关联能力上受限,需要大规模多模态数据和高效检索方法来支持公共卫生研究。
Result: 在超过100个问题的实验中,NICO-RAG无需处理图像额外token,性能与当前最先进的图像适配RAG方法相当。
Insight: 创新点包括引入大规模多模态NICO数据集,以及基于超图的多模态知识表示方法,使图像检索不仅依赖视觉相似性,还通过图像描述的语义相似性进行,降低了语言模型和图像token处理的高成本。
Abstract: The nicotine addiction public health crisis continues to be pervasive. In this century alone, the tobacco industry has released and marketed new products in an aggressive effort to lure new and young customers for life. Such innovations and product development, namely flavored nicotine or tobacco such as nicotine pouches, have undone years of anti-tobacco campaign work. Past work is limited both in scope and in its ability to connect large-scale data points. Thus, we introduce the Nicotine Innovation Counter-Offensive (NICO) Dataset to provide public health researchers with over 200,000 multimodal samples, including images and text descriptions, on 55 tobacco and nicotine product brands. In addition, to provide public health researchers with factual connections across a large-scale dataset, we propose NICO-RAG, a retrieval-augmented generation (RAG) framework that can retrieve image features without incurring the high-cost of language models, as well as the added cost of processing image tokens with large-scale datasets such as NICO. At construction time, NICO-RAG organizes image- and text-extracted entities and relations into hypergraphs to produce as factual responses as possible. This joint multimodal knowledge representation enables NICO-RAG to retrieve images for query answering not only by visual similarity but also by the semantic similarity of image descriptions. Experimentals show that without needing to process additional tokens from images for over 100 questions, NICO-RAG performs comparably to the state-of-the-art RAG method adapted for images.
[240] WorldStereo: Bridging Camera-Guided Video Generation and Scene Reconstruction via 3D Geometric Memories cs.CVPDF
Yisu Zhang, Chenjie Cao, Tengfei Wang, Xuhui Zuo, Junta Wu
TL;DR: 本文提出WorldStereo框架,通过引入全局几何记忆和空间立体记忆两个模块,将相机引导的视频生成与3D场景重建相结合,旨在解决现有视频扩散模型在相机控制性和多视角一致性方面的不足,从而生成高质量、多视角一致的视频以支持精确的3D重建。
Details
Motivation: 现有视频扩散模型生成的视频虽然视觉质量高,但由于相机可控性有限且从不同相机轨迹观看时内容不一致,难以从中重建出一致的3D场景。
Result: 在相机引导的视频生成和3D重建基准测试上的大量实验证明了该方法的有效性,能够作为强大的世界模型处理多样化的场景生成任务(从透视或全景图像开始),并产生高保真度的3D结果。
Insight: 创新点在于设计了两个几何记忆模块:全局几何记忆通过增量更新的点云实现精确相机控制和注入粗略结构先验;空间立体记忆利用3D对应关系约束注意力感受野以聚焦记忆库中的细粒度细节。此外,基于控制分支的灵活设计得益于分布匹配蒸馏的VDM主干,无需联合训练,展现了高效性。
Abstract: Recent advances in foundational Video Diffusion Models (VDMs) have yielded significant progress. Yet, despite the remarkable visual quality of generated videos, reconstructing consistent 3D scenes from these outputs remains challenging, due to limited camera controllability and inconsistent generated content when viewed from distinct camera trajectories. In this paper, we propose WorldStereo, a novel framework that bridges camera-guided video generation and 3D reconstruction via two dedicated geometric memory modules. Formally, the global-geometric memory enables precise camera control while injecting coarse structural priors through incrementally updated point clouds. Moreover, the spatial-stereo memory constrains the model’s attention receptive fields with 3D correspondence to focus on fine-grained details from the memory bank. These components enable WorldStereo to generate multi-view-consistent videos under precise camera control, facilitating high-quality 3D reconstruction. Furthermore, the flexible control branch-based WorldStereo shows impressive efficiency, benefiting from the distribution matching distilled VDM backbone without joint training. Extensive experiments across both camera-guided video generation and 3D reconstruction benchmarks demonstrate the effectiveness of our approach. Notably, we show that WorldStereo acts as a powerful world model, tackling diverse scene generation tasks (whether starting from perspective or panoramic images) with high-fidelity 3D results. Models will be released.
[241] MMNavAgent: Multi-Magnification WSI Navigation Agent for Clinically Consistent Whole-Slide Analysis cs.CVPDF
Zhengyang Xu, Han Li, Jingsong Liu, Linrui Xie, Xun Ma
TL;DR: 本文提出了一种临床一致的多倍率全切片图像导航智能体(MMNavAgent),通过显式建模多倍率交互和自适应倍率选择来改进全切片图像(WSI)诊断。具体引入了跨倍率导航工具(CMT)来聚合相邻倍率的上下文信息,以及倍率选择工具(MST)来模拟病理学家的顺序决策过程,实现交互式自适应倍率选择。
Details
Motivation: 现有AI导航方法通常在单一固定倍率下操作或依赖预定义的倍率遍历,而临床实践中病理学家会跨多个倍率检查切片并动态整合全局和细胞证据,这种不匹配阻碍了现有方法对真实诊断工作流中固有的跨倍率交互和自适应倍率选择进行建模。
Result: 在公开数据集上的大量实验表明,该方法提高了诊断性能,与非智能体基线相比,AUC提升了1.45%,BACC提升了2.93%。
Insight: 创新点在于显式建模了多倍率交互(通过CMT工具)和模拟临床决策的自适应倍率选择(通过MST工具),这更贴近病理学家的实际工作流程,有望提升AI辅助诊断的临床一致性和实用性。
Abstract: Recent AI navigation approaches aim to improve Whole-Slide Image (WSI) diagnosis by modeling spatial exploration and selecting diagnostically relevant regions, yet most operate at a single fixed magnification or rely on predefined magnification traversal. In clinical practice, pathologists examine slides across multiple magnifications and selectively inspect only necessary scales, dynamically integrating global and cellular evidence in a sequential manner. This mismatch prevents existing methods from modeling cross-magnification interactions and adaptive magnification selection inherent to real diagnostic workflows. To these, we propose a clinically consistent Multi-Magnification WSI Navigation Agent (MMNavAgent) that explicitly models multi magnification interaction and adaptive magnification selection. Specifically, we introduce a Cross-Magnification navigation Tool (CMT) that aggregates contextual information from adjacent magnifications to enhance discriminative representations along the navigation path. We further introduce a Magnification Selection Tool (MST) that leverages memory-driven reasoning within the agent framework to enable interactive and adaptive magnification selection, mimicking the sequential decision process of pathologists. Extensive experiments on a public dataset demonstrate improved diagnostic performance, with 1.45% gain of AUC and 2.93% gain of BACC over a non-agent baseline. Code will be public upon acceptance.
[242] Detection-Gated Glottal Segmentation with Zero-Shot Cross-Dataset Transfer and Clinical Feature Extraction cs.CV | cs.AI | cs.LGPDF
Harikrishnan Unnikrishnan
TL;DR: 本文提出了一种用于高速视频内窥镜(HSV)中声门分割的检测门控流水线,该流水线集成了基于YOLOv8的检测器和U-Net分割器,并采用时间一致性包装器来抑制假阳性。模型在有限的GIRAFE数据集上训练,并在BAGLS数据集上进行了零样本跨数据集迁移评估,实现了最先进的性能。
Details
Motivation: 现有深度学习模型在非声门帧中容易产生虚假伪影,且难以在不同临床设置中泛化。本文旨在解决声门分割的准确性和跨数据集泛化问题。
Result: 在GIRAFE基准测试中达到DSC 0.81(SOTA),在BAGLS数据集上零样本迁移达到DSC 0.85(优于分布内性能)。下游临床验证表明,自动提取的运动学特征(如开放商数、变异系数)与既定临床基准一致,且声门面积变异系数是区分健康与病理声带功能的显著标志(p=0.006)。
Insight: 创新点包括检测门控架构(结合目标检测与分割)、时间一致性包装器以提升鲁棒性,以及实现了无需机构微调的零样本跨数据集迁移,为实时临床使用提供了轻量高效的解决方案(~35帧/秒)。
Abstract: Background: Accurate glottal segmentation in high-speed videoendoscopy (HSV) is essential for extracting kinematic biomarkers of laryngeal function. However, existing deep learning models often produce spurious artifacts in non-glottal frames and fail to generalize across different clinical settings. Methods: We propose a detection-gated pipeline that integrates a YOLOv8-based detector with a U-Net segmenter. A temporal consistency wrapper ensures robustness by suppressing false positives during glottal closure and instrument occlusion. The model was trained on a limited subset of the GIRAFE dataset (600 frames) and evaluated via zero-shot transfer on the large-scale BAGLS dataset. Results: The pipeline achieved state-of-the-art performance on the GIRAFE benchmark (DSC 0.81) and demonstrated superior generalizability on BAGLS (DSC 0.85, in-distribution) without institutional fine-tuning. Downstream validation on a 65-subject clinical cohort confirmed that automated kinematic features (Open Quotient, coefficient of variation) remained consistent with established clinical benchmarks. The coefficient of variation (CV) of the glottal area was found to be a significant marker for distinguishing healthy from pathological vocal function (p=0.006). Conclusions: The detection-gated architecture provides a lightweight, computationally efficient solution (~35 frames/s) for real-time clinical use. By enabling robust zero-shot transfer, this framework facilitates the standardized, large-scale extraction of clinical biomarkers across diverse endoscopy platforms. Code, trained weights, and evaluation scripts are released at https://github.com/hari-krishnan/openglottal.
[243] FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding cs.CV | cs.AIPDF
Yiweng Xie, Bo He, Junke Wang, Xiangyu Zheng, Ziyi Ye
TL;DR: FluxMem是一个无需训练的高效流式视频理解框架,通过自适应分层内存压缩冗余视觉信息。它采用两阶段设计:时间相邻选择模块去除相邻帧间的冗余视觉令牌,空间域整合模块合并每帧内空间重复区域为紧凑表示。该方法在动态场景中通过自适应令牌压缩机制自动确定压缩率,显著降低了延迟和GPU内存占用,并在多个在线视频基准测试中达到新的最先进水平。
Details
Motivation: 解决流式视频理解中因冗余视觉信息导致的计算效率低下和内存占用高的问题,旨在实现高效、自适应的实时视频处理。
Result: 在StreamingBench上达到76.4分,在OVO-Bench上达到67.2分(实时设置),同时将OVO-Bench的延迟降低69.9%、峰值GPU内存减少34.5%;在离线性能上,MLVU达到73.1分,视觉令牌使用减少65%,均达到新的SOTA。
Insight: 创新点包括分层两阶段内存压缩设计(TAS和SDC模块)和基于场景统计的自适应令牌压缩机制,无需手动调优;从客观角度看,该方法通过动态调整压缩率有效平衡了效率与准确性,为流式视频处理提供了可扩展的解决方案。
Abstract: This paper presents FluxMem, a training-free framework for efficient streaming video understanding. FluxMem adaptively compresses redundant visual memory through a hierarchical, two-stage design: (1) a Temporal Adjacency Selection (TAS) module removes redundant visual tokens across adjacent frames, and (2) a Spatial Domain Consolidation (SDC) module further merges spatially repetitive regions within each frame into compact representations. To adapt effectively to dynamic scenes, we introduce a self-adaptive token compression mechanism in both TAS and SDC, which automatically determines the compression rate based on intrinsic scene statistics rather than manual tuning. Extensive experiments demonstrate that FluxMem achieves new state-of-the-art results on existing online video benchmarks, reaching 76.4 on StreamingBench and 67.2 on OVO-Bench under real-time settings, while reducing latency by 69.9% and peak GPU memory by 34.5% on OVO-Bench. Furthermore, it maintains strong offline performance, achieving 73.1 on MLVU while using 65% fewer visual tokens.
[244] LiftAvatar: Kinematic-Space Completion for Expression-Controlled 3D Gaussian Avatar Animation cs.CV | cs.AIPDF
Hualiang Wei, Shunran Jia, Jialun Liu, Wenhui Li
TL;DR: LiftAvatar提出了一种新范式,通过在运动学空间(如面部表情和头部姿态)中补全稀疏的单目观测数据,并利用补全后的信号驱动高保真度的3D高斯化身动画。它是一个细粒度、表情可控的大规模视频扩散Transformer,能够基于单张或多张参考图像合成高质量、时序一致的表情序列。
Details
Motivation: 解决日常单目视频中稀疏运动学线索导致的3D高斯溅射化身表达能力有限和重建伪影问题,旨在增强下游3D化身流程的重建和动画质量。
Result: 大量实验表明,LiftAvatar能持续提升最先进3D化身方法的动画质量和量化指标,尤其是在极端、未见过的表情下表现突出。
Insight: 创新点包括:1)将不完整输入数据提升为更丰富的运动学表示以强化下游流程;2)结合阴影图和表情系数的多粒度表情控制方案,实现精确稳定驱动;3)多参考条件机制聚合多帧互补线索,确保强3D一致性和可控性;4)作为即插即用增强器,有效将大规模视频生成模型的先验知识蒸馏到3D流程中。
Abstract: We present LiftAvatar, a new paradigm that completes sparse monocular observations in kinematic space (e.g., facial expressions and head pose) and uses the completed signals to drive high-fidelity avatar animation. LiftAvatar is a fine-grained, expression-controllable large-scale video diffusion Transformer that synthesizes high-quality, temporally coherent expression sequences conditioned on single or multiple reference images. The key idea is to lift incomplete input data into a richer kinematic representation, thereby strengthening both reconstruction and animation in downstream 3D avatar pipelines. To this end, we introduce (i) a multi-granularity expression control scheme that combines shading maps with expression coefficients for precise and stable driving, and (ii) a multi-reference conditioning mechanism that aggregates complementary cues from multiple frames, enabling strong 3D consistency and controllability. As a plug-and-play enhancer, LiftAvatar directly addresses the limited expressiveness and reconstruction artifacts of 3D Gaussian Splatting-based avatars caused by sparse kinematic cues in everyday monocular videos. By expanding incomplete observations into diverse pose-expression variations, LiftAvatar also enables effective prior distillation from large-scale video generative models into 3D pipelines, leading to substantial gains. Extensive experiments show that LiftAvatar consistently boosts animation quality and quantitative metrics of state-of-the-art 3D avatar methods, especially under extreme, unseen expressions.
[245] Stereo-Inertial Poser: Towards Metric-Accurate Shape-Aware Motion Capture Using Sparse IMUs and a Single Stereo Camera cs.CVPDF
Tutian Tang, Xingyu Ji, Yutong Li, MingHao Liu, Wenqiang Xu
TL;DR: 本文提出了Stereo-Inertial Poser,一种实时运动捕捉系统,它结合单个立体相机和六个IMU来估计度量精确且形状感知的3D人体运动。该系统通过立体视觉解决单目深度模糊问题,并融合IMU数据与视觉线索来预测无漂移的关节位置和根运动,同时引入新颖的形状感知融合模块来协调人体测量学变化与全局平移。整个端到端管道无需基于优化的后处理即可达到超过200 FPS,实现了实时部署。
Details
Motivation: 现有结合单目相机与稀疏IMU的视觉-惯性运动捕捉系统虽成本效益高并能缓解遮挡与漂移问题,但仍受限于单目深度模糊导致的全局平移度量不准确,以及忽略人体测量学变化的形状无关局部运动估计。
Result: 在多个数据集上的定量评估展示了最先进的性能。定性结果表明,该方法在长时间录制下能产生无漂移的全局平移,并减少了脚部滑动效应。
Insight: 主要创新点在于用立体视觉替代单目RGB以直接解决深度模糊,实现直接的3D关键点提取和身体形状参数估计;以及设计了一个新颖的形状感知融合模块,动态协调人体测量学变化与全局平移,从而实现了度量精确且形状感知的运动捕捉。从客观角度看,将立体视觉的几何约束与IMU的惯性数据深度融合,并专门处理人体形状变化对运动估计的影响,是该工作的核心创新之处。
Abstract: Recent advancements in visual-inertial motion capture systems have demonstrated the potential of combining monocular cameras with sparse inertial measurement units (IMUs) as cost-effective solutions, which effectively mitigate occlusion and drift issues inherent in single-modality systems. However, they are still limited by metric inaccuracies in global translations stemming from monocular depth ambiguity, and shape-agnostic local motion estimations that ignore anthropometric variations. We present Stereo-Inertial Poser, a real-time motion capture system that leverages a single stereo camera and six IMUs to estimate metric-accurate and shape-aware 3D human motion. By replacing the monocular RGB with stereo vision, our system resolves depth ambiguity through calibrated baseline geometry, enabling direct 3D keypoint extraction and body shape parameter estimation. IMU data and visual cues are fused for predicting drift-compensated joint positions and root movements, while a novel shape-aware fusion module dynamically harmonizes anthropometry variations with global translations. Our end-to-end pipeline achieves over 200 FPS without optimization-based post-processing, enabling real-time deployment. Quantitative evaluations across various datasets demonstrate state-of-the-art performance. Qualitative results show our method produces drift-free global translation under a long recording time and reduces foot-skating effects.
[246] SimRecon: SimReady Compositional Scene Reconstruction from Real Videos cs.CVPDF
Chong Xia, Kai Zhu, Zizhuo Wang, Fangfu Liu, Zhizheng Zhang
TL;DR: SimRecon是一个用于从真实视频中进行组合式场景重建的框架,它实现了‘感知-生成-模拟’的流程。该框架首先从视频输入进行场景级语义重建,然后执行单物体生成,最后在模拟器中组装这些资产。为了解决简单组合三个阶段导致的生成资产视觉保真度低和最终场景物理合理性差的问题,论文提出了两个桥接模块:用于提升视觉保真度的主动视点优化,以及用于增强物理合理性的场景图合成器。
Details
Motivation: 传统的组合式重建方法主要关注视觉外观,在真实场景中的泛化能力有限。本文旨在解决从真实视频重建组合式、面向对象的表示时,如何确保生成资产的视觉质量和最终组装场景的物理合理性问题。
Result: 在ScanNet数据集上进行的大量实验验证了该方法优于之前的最先进方法。
Insight: 核心创新点在于提出了一个完整的‘感知-生成-模拟’流水线,并设计了两个关键的桥接模块(主动视点优化和场景图合成器)来分别解决视觉保真度和物理合理性的鸿沟,从而实现了更高质量、更适用于模拟和交互的组合式场景重建。
Abstract: Compositional scene reconstruction seeks to create object-centric representations rather than holistic scenes from real-world videos, which is natively applicable for simulation and interaction. Conventional compositional reconstruction approaches primarily emphasize on visual appearance and show limited generalization ability to real-world scenarios. In this paper, we propose SimRecon, a framework that realizes a “Perception-Generation-Simulation” pipeline towards cluttered scene reconstruction, which first conducts scene-level semantic reconstruction from video input, then performs single-object generation, and finally assembles these assets in the simulator. However, naively combining these three stages leads to visual infidelity of generated assets and physical implausibility of the final scene, a problem particularly severe for complex scenes. Thus, we further propose two bridging modules between the three stages to address this problem. To be specific, for the transition from Perception to Generation, critical for visual fidelity, we introduce Active Viewpoint Optimization, which actively searches in 3D space to acquire optimal projected images as conditions for single-object completion. Moreover, for the transition from Generation to Simulation, essential for physical plausibility, we propose a Scene Graph Synthesizer, which guides the construction from scratch in 3D simulators, mirroring the native, constructive principle of the real world. Extensive experiments on the ScanNet dataset validate our method’s superior performance over previous state-of-the-art approaches.
[247] OmniLottie: Generating Vector Animations via Parameterized Lottie Tokens cs.CVPDF
Yiying Yang, Wei Cheng, Sijin Chen, Honghao Fu, Xianfang Zeng
TL;DR: OmniLottie是一个从多模态指令生成高质量矢量动画的通用框架。它通过精心设计的Lottie分词器将JSON文件转换为结构化的命令和参数序列,并基于预训练的视觉语言模型来遵循指令。此外,论文还构建了大规模数据集MMLottie-2M以推动研究。
Details
Motivation: 解决从多模态指令(如文本和视觉)直接生成高质量矢量动画的挑战,特别是针对Lottie格式中原始JSON文件结构复杂、包含大量不变元数据和格式化标记,难以直接用于学习生成的问题。
Result: 通过大量实验验证,OmniLottie能够生成生动且语义对齐的矢量动画,紧密遵循多模态人类指令。
Insight: 创新点在于设计了专门的Lottie分词器,将复杂的JSON表示转化为结构化的命令参数序列,从而能够利用预训练视觉语言模型进行生成;同时贡献了大规模专业设计的矢量动画数据集MMLottie-2M,为领域研究提供了重要资源。
Abstract: OmniLottie is a versatile framework that generates high quality vector animations from multi-modal instructions. For flexible motion and visual content control, we focus on Lottie, a light weight JSON formatting for both shapes and animation behaviors representation. However, the raw Lottie JSON files contain extensive invariant structural metadata and formatting tokens, posing significant challenges for learning vector animation generation. Therefore, we introduce a well designed Lottie tokenizer that transforms JSON files into structured sequences of commands and parameters representing shapes, animation functions and control parameters. Such tokenizer enables us to build OmniLottie upon pretrained vision language models to follow multi-modal interleaved instructions and generate high quality vector animations. To further advance research in vector animation generation, we curate MMLottie-2M, a large scale dataset of professionally designed vector animations paired with textual and visual annotations. With extensive experiments, we validate that OmniLottie can produce vivid and semantically aligned vector animations that adhere closely to multi modal human instructions.
[248] Is Bigger Always Better? Efficiency Analysis in Resource-Constrained Small Object Detection cs.CV | cs.LGPDF
Kwame Mbobda-Kuate, Gabriel Kasmi
TL;DR: 本文对资源受限的地球观测(EO)领域中的小目标检测进行了系统性的效率分析,挑战了‘模型越大、数据越多性能越好’的普遍假设。通过在马达加斯加的屋顶光伏检测任务上,对模型大小、数据集大小和输入分辨率三个维度进行缩放实验,发现存在‘效率反转’现象:最小的YOLO11N模型在单位模型大小下的检测效率最高,且绝对性能也最优。分辨率是提升效率的最关键因素,而增加数据量在低分辨率下收效甚微。在所有部署场景中,小型高分辨率配置均在精度-吞吐量权衡中占据帕累托优势。
Details
Motivation: 动机是挑战计算机视觉中普遍存在的缩放定律假设(即更大模型、更多数据总是带来更好性能),特别是在资源受限的地球观测(EO)领域,该假设尚未得到验证。本文旨在系统分析模型大小、数据集大小和输入分辨率三个维度对资源受限小目标检测任务效率的影响。
Result: 在屋顶光伏检测任务上,YOLO11N模型实现了最高的效率(比YOLO11X高24倍)和最高的绝对mAP50(0.617)。分辨率是提升效率的最主要杠杆(带来+120%的效率增益),而在低分辨率下增加数据带来的回报微乎其微。在联合精度-吞吐量空间中,小型高分辨率配置在所有44种实验设置中均占据帕累托主导地位,无需进行权衡。
Insight: 论文宣称的创新点在于揭示了在数据稀缺的地球观测领域存在‘效率反转’现象,即更小的模型在效率和绝对性能上均可超越更大模型,这直接挑战了缩放定律。从客观角度看,其核心洞察是:在资源受限的小目标检测场景中,优化输入分辨率是比盲目扩大模型或数据集更有效、更关键的资源分配策略,这为实际部署中的模型选择提供了新的、反直觉的指导原则。
Abstract: Scaling laws assume larger models trained on more data consistently outperform smaller ones – an assumption that drives model selection in computer vision but remains untested in resource-constrained Earth observation (EO). We conduct a systematic efficiency analysis across three scaling dimensions: model size, dataset size, and input resolution, on rooftop PV detection in Madagascar. Optimizing for model efficiency (mAP${50}$ per unit of model size), we find a consistent efficiency inversion: YOLO11N achieves both the highest efficiency ($24\times$ higher than YOLO11X) and the highest absolute mAP${50}$ (0.617). Resolution is the dominant resource allocation lever ($+$120% efficiency gain), while additional data yields negligible returns at low resolution. These findings are robust to the deployment objective: small high-resolution configurations are Pareto-dominant across all 44 setups in the joint accuracy-throughput space, leaving no tradeoff to resolve. In data-scarce EO, bigger is not just unnecessary: it can be worse.
[249] Bridging the gap between Performance and Interpretability: An Explainable Disentangled Multimodal Framework for Cancer Survival Prediction cs.CVPDF
Aniek Eijpe, Soufyan Lakbir, Melis Erdal Cesur, Sara P. Oliveira, Angelos Chatzimparmpas
TL;DR: 本文提出了一种名为DIMAFx的可解释多模态癌症生存预测框架,该框架从组织病理学全切片图像和转录组学数据中生成解耦的、可解释的模态特定和模态共享表示。该框架在多个癌症队列中实现了最先进的性能,并通过其可解释的设计揭示了关键的多模态相互作用和生物学信息。
Details
Motivation: 当前多模态生存预测模型虽然准确性不断提高,但其复杂性往往降低了可解释性,限制了人们对不同数据源如何影响预测的理解。本文旨在解决性能与可解释性之间的传统权衡问题。
Result: 在多个癌症队列中,DIMAFx实现了最先进的性能,并改善了表示解耦。在乳腺癌生存预测中,最具预测性的特征包含模态共享信息,这些发现与已知的乳腺癌生物学一致。
Insight: 创新点在于提出了一个可解释的多模态框架,能够生成解耦的表示,并系统性地揭示多模态相互作用和生物学信息。这证明了多模态模型可以克服性能与可解释性之间的传统权衡,支持其在精准医学中的应用。
Abstract: While multimodal survival prediction models are increasingly more accurate, their complexity often reduces interpretability, limiting insight into how different data sources influence predictions. To address this, we introduce DIMAFx, an explainable multimodal framework for cancer survival prediction that produces disentangled, interpretable modality-specific and modality-shared representations from histopathology whole-slide images and transcriptomics data. Across multiple cancer cohorts, DIMAFx achieves state-of-the-art performance and improved representation disentanglement. Leveraging its interpretable design and SHapley Additive exPlanations, DIMAFx systematically reveals key multimodal interactions and the biological information encoded in the disentangled representations. In breast cancer survival prediction, the most predictive features contain modality-shared information, including one capturing solid tumor morphology contextualized primarily by late estrogen response, where higher-grade morphology aligned with pathway upregulation and increased risk, consistent with known breast cancer biology. Key modality-specific features capture microenvironmental signals from interacting adipose and stromal morphologies. These results show that multimodal models can overcome the traditional trade-off between performance and explainability, supporting their application in precision medicine.
[250] Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance cs.CV | cs.AIPDF
Yiqi Lin, Guoqiang Liang, Ziyun Zeng, Zechen Bai, Yanzhe Chen
TL;DR: 本文提出了Kiwi-Edit,一个通过指令和参考引导进行多功能视频编辑的统一架构。为了解决现有方法在精确视觉控制上的不足和高质量配对训练数据稀缺的问题,作者引入了一个可扩展的数据生成流程,构建了大规模数据集RefVIE及其评估基准RefVIE-Bench。
Details
Motivation: 当前基于指令的视频编辑方法因自然语言在描述复杂视觉细节上的固有局限而难以实现精确控制,而参考引导编辑的潜力又受限于高质量配对训练数据的稀缺。本文旨在弥合这一差距。
Result: 在RefVIE-Bench上的广泛实验表明,所提出的数据和架构在可控视频编辑任务上达到了新的最先进水平(SOTA)。
Insight: 主要创新点包括:1) 一个利用图像生成模型创建合成参考支架,从而将现有视频编辑对转化为高质量训练四元组的可扩展数据生成流程;2) 一个通过可学习查询和潜在视觉特征协同工作以实现参考语义引导的统一编辑架构;3) 一个渐进式多阶段训练课程,显著提升了指令跟随和参考保真度。
Abstract: Instruction-based video editing has witnessed rapid progress, yet current methods often struggle with precise visual control, as natural language is inherently limited in describing complex visual nuances. Although reference-guided editing offers a robust solution, its potential is currently bottlenecked by the scarcity of high-quality paired training data. To bridge this gap, we introduce a scalable data generation pipeline that transforms existing video editing pairs into high-fidelity training quadruplets, leveraging image generative models to create synthesized reference scaffolds. Using this pipeline, we construct RefVIE, a large-scale dataset tailored for instruction-reference-following tasks, and establish RefVIE-Bench for comprehensive evaluation. Furthermore, we propose a unified editing architecture, Kiwi-Edit, that synergizes learnable queries and latent visual features for reference semantic guidance. Our model achieves significant gains in instruction following and reference fidelity via a progressive multi-stage training curriculum. Extensive experiments demonstrate that our data and architecture establish a new state-of-the-art in controllable video editing. All datasets, models, and code is released at https://github.com/showlab/Kiwi-Edit.
[251] Adaptive Confidence Regularization for Multimodal Failure Detection cs.CV | cs.AI | cs.LGPDF
Moru Liu, Hao Dong, Olga Fink, Mario Trapp
TL;DR: 本文提出了一种名为自适应置信度正则化(ACR)的新框架,专门用于检测多模态模型中的预测失败。该方法基于一个关键观察:在多数失败案例中,多模态预测的置信度显著低于至少一个单模态分支的置信度,即置信度退化现象。为缓解此问题,ACR引入了自适应置信度损失来惩罚训练中的退化,并提出了多模态特征交换这一新颖的异常值合成技术,以生成具有挑战性的、感知失败的训练样本。通过在四个数据集、三种模态及多种评估设置上的广泛实验,ACR展现出持续且稳健的性能提升。
Details
Motivation: 在高风险领域(如自动驾驶和医疗诊断)部署多模态模型时,不仅需要强大的预测性能,还需要可靠的失败检测机制。本文旨在解决多模态场景下失败检测这一尚未被充分探索的问题。
Result: 在四个数据集、三种模态及多种评估设置上的广泛实验表明,ACR方法取得了持续且稳健的性能增益,证明了其在多模态失败检测任务上的有效性。
Insight: 论文的创新点在于:1)识别并形式化了多模态失败中的“置信度退化”现象;2)设计了自适应置信度损失来直接惩罚这种退化;3)提出了多模态特征交换技术,通过合成感知失败的异常样本来增强模型的失败检测能力。从客观角度看,该方法将失败检测从传统的单模态或后处理视角,转向了针对多模态交互特性的、集成到训练过程中的正则化框架,具有借鉴意义。
Abstract: The deployment of multimodal models in high-stakes domains, such as self-driving vehicles and medical diagnostics, demands not only strong predictive performance but also reliable mechanisms for detecting failures. In this work, we address the largely unexplored problem of failure detection in multimodal contexts. We propose Adaptive Confidence Regularization (ACR), a novel framework specifically designed to detect multimodal failures. Our approach is driven by a key observation: in most failure cases, the confidence of the multimodal prediction is significantly lower than that of at least one unimodal branch, a phenomenon we term confidence degradation. To mitigate this, we introduce an Adaptive Confidence Loss that penalizes such degradations during training. In addition, we propose Multimodal Feature Swapping, a novel outlier synthesis technique that generates challenging, failure-aware training examples. By training with these synthetic failures, ACR learns to more effectively recognize and reject uncertain predictions, thereby improving overall reliability. Extensive experiments across four datasets, three modalities, and multiple evaluation settings demonstrate that ACR achieves consistent and robust gains. The source code will be available at https://github.com/mona4399/ACR.
[252] HiFi-Inpaint: Towards High-Fidelity Reference-Based Inpainting for Generating Detail-Preserving Human-Product Images cs.CVPDF
Yichen Liu, Donghao Zhou, Jie Wang, Xin Gao, Guisheng Liu
TL;DR: 本文提出了一种名为HiFi-Inpaint的新型高保真参考修复框架,专门用于生成保留细节的人-物图像。该框架通过引入共享增强注意力机制来细化产品特征,并使用基于高频图的细节感知损失进行像素级监督。此外,作者构建了一个新的数据集HP-Image-40K用于训练。实验表明,该方法在生成保留产品细节的图像方面达到了最先进的性能。
Details
Motivation: 解决在生成人-物图像时,现有基于参考的修复方法在缺乏大规模训练数据、难以专注于产品细节保留以及粗粒度监督无法实现精确引导这三个方面的局限性。
Result: 实验结果表明,HiFi-Inpaint在生成保留细节的人-物图像任务上取得了最先进的性能。
Insight: 创新点在于提出了共享增强注意力机制和细节感知损失,前者用于细化产品特征,后者利用高频图实现像素级精确监督。同时,构建了一个新的大规模数据集HP-Image-40K来支持模型训练,这为解决数据稀缺问题提供了思路。
Abstract: Human-product images, which showcase the integration of humans and products, play a vital role in advertising, e-commerce, and digital marketing. The essential challenge of generating such images lies in ensuring the high-fidelity preservation of product details. Among existing paradigms, reference-based inpainting offers a targeted solution by leveraging product reference images to guide the inpainting process. However, limitations remain in three key aspects: the lack of diverse large-scale training data, the struggle of current models to focus on product detail preservation, and the inability of coarse supervision for achieving precise guidance. To address these issues, we propose HiFi-Inpaint, a novel high-fidelity reference-based inpainting framework tailored for generating human-product images. HiFi-Inpaint introduces Shared Enhancement Attention (SEA) to refine fine-grained product features and Detail-Aware Loss (DAL) to enforce precise pixel-level supervision using high-frequency maps. Additionally, we construct a new dataset, HP-Image-40K, with samples curated from self-synthesis data and processed with automatic filtering. Experimental results show that HiFi-Inpaint achieves state-of-the-art performance, delivering detail-preserving human-product images.
cs.CR [Back]
[253] RLShield: Practical Multi-Agent RL for Financial Cyber Defense with Attack-Surface MDPs and Real-Time Response Orchestration cs.CR | cs.AI | cs.CL | cs.LGPDF
Srikumar Nayak
TL;DR: 本文提出了RLShield,一个用于金融网络防御的实用多智能体强化学习(RL)框架。它将企业攻击面建模为马尔可夫决策过程(MDP),状态包含警报、资产暴露和服务健康状况,动作代表真实的响应步骤。RLShield学习跨多个智能体(资产或服务组)的协调策略,并优化一个平衡遏制速度、业务中断和响应成本的风险敏感目标。
Details
Motivation: 解决金融系统在持续网络攻击下面临的挑战:现有安全工具(如固定规则或静态剧本)适应攻击者行为变化慢,而现有金融领域的RL研究多集中于交易,未充分考虑网络响应中的实际限制,如行动成本、服务中断和多资产间的防御者协调问题。
Result: 实验表明,在相同约束下,RLShield相比静态规则基线和单智能体RL,能减少遏制时间和残余暴露,同时将业务中断控制在固定的响应预算内,表现出更优的性能。
Insight: 主要创新点在于:1) 将企业攻击面建模为包含真实响应步骤和操作约束的MDP;2) 采用多智能体RL框架学习协调的防御策略;3) 优化一个平衡多个实际运营目标(速度、成本、中断)的风险敏感目标;4) 引入针对自适应攻击者的博弈感知评估方法,报告实际运营结果而不仅仅是奖励值。这为金融安全运营中的自动化响应提供了一个可部署的、实用的解决方案。
Abstract: Financial systems run nonstop and must stay reliable even during cyber incidents. Modern attacks move across many services (apps, APIs, identity, payment rails), so defenders must make a sequence of actions under time pressure. Most security tools still use fixed rules or static playbooks, which can be slow to adapt when the attacker changes behavior. Reinforcement learning (RL) is a good fit for sequential decisions, but much of the RL-in-finance literature targets trading and does not model real cyber response limits such as action cost, service disruption, and defender coordination across many assets. This paper proposes RLShield, a practical multi-agent RL pipeline for financial cyber defense. We model the enterprise attack surface as a Markov decision process (MDP) where states summarize alerts, asset exposure, and service health, and actions represent real response steps (e.g., isolate a host, rotate credentials, ratelimit an API, block an account, or trigger recovery). RLShield learns coordinated policies across multiple agents (assets or service groups) and optimizes a risk-sensitive objective that balances containment speed, business disruption, and response cost. We also include a game-aware evaluation that tests policies against adaptive attackers and reports operational outcomes, not only reward. Experiments show that RLShield reduces time-to-containment and residual exposure while keeping disruption within a fixed response budget, outperforming static rule baselines and single-agent RL under the same constraints. These results suggest that multi-agent, cost-aware RL can provide a deployable layer for automated response in financial security operations.
cs.AI [Back]
[254] Draft-Thinking: Learning Efficient Reasoning in Long Chain-of-Thought LLMs cs.AI | cs.CLPDF
Jie Cao, Tianwei Lin, Zhenxuan Fan, Bo Yuan, Ziyuan Zhao
TL;DR: 本文提出了一种名为Draft-Thinking的新方法,旨在解决长思维链推理中计算成本过高的问题。该方法通过引导模型学习一种简洁的“草稿式”推理结构,仅保留关键推理步骤,并结合渐进式课程学习来稳定内化这种高效推理模式。此外,该方法引入了自适应提示,使推理深度成为模型可灵活选择的行为。实验表明,该方法能大幅减少推理成本,同时基本保持推理性能。
Details
Motivation: 当前长思维链推理范式虽然提升了大型推理模型的性能,但也导致了推理计算成本的大幅增加,且现有方法往往引发系统性过度思考,不必要地将推理能力与推理成本耦合。现有降低token使用的方法多为事后处理,未能解决推理的核心机制问题。
Result: 在MATH500基准测试上,Draft-Thinking方法实现了82.6%的推理成本降低,而性能仅下降2.6%。广泛的实验证明了该方法在显著减少推理预算的同时,很大程度上保留了推理性能。
Insight: 论文的创新点在于:1. 提出“草稿式”推理结构,从源头精简推理步骤;2. 采用渐进式课程学习来稳定内化高效推理模式;3. 引入自适应提示,使推理深度成为可灵活调节的模型行为,而非固定属性。这为设计更高效、更经济的推理模型提供了新思路。
Abstract: Long chain-of-thought(CoT) has become a dominant paradigm for enhancing the reasoning capability of large reasoning models(LRMs); however, the performance gains often come with a substantial increase in reasoning budget. Recent studies show that existing CoT paradigms tend to induce systematic overthinking, unnecessarily coupling reasoning capability with reasoning cost. Most prior approaches reduce token usage through post hoc techniques such as token compression, truncation, or length penalties, without explicitly addressing the core mechanisms of reasoning. We propose \textbf{Draft-Thinking}, which guides models to first learn a concise \textit{draft-style} reasoning structure that retains only the critical reasoning steps. Through a \textit{progressive curriculum learning}, the model stably internalizes this efficient reasoning pattern as its capability scales. Moreover, Draft-Thinking introduces adaptive prompting, which elevates reasoning depth to a flexible, model-selectable behavior. Extensive experiments demonstrate that Draft-Thinking substantially reduces reasoning budget while largely preserving reasoning performance; for example, on MATH500, it achieves an 82.6% reduction in reasoning budget at the cost of only a 2.6% performance drop.
[255] TraceSIR: A Multi-Agent Framework for Structured Analysis and Reporting of Agentic Execution Traces cs.AI | cs.CLPDF
Shu-Xun Yang, Cunxiang Wang, Haoke Zhang, Wenbo Yu, Lindong Wu
TL;DR: 本文提出了TraceSIR,一个用于结构化分析和报告智能体执行轨迹的多智能体框架。该框架通过三个专门智能体(StructureAgent、InsightAgent、ReportAgent)协同工作,解决了智能体系统因执行轨迹冗长复杂而导致的故障诊断和根因分析难题。
Details
Motivation: 智能体系统(结合大语言模型与外部工具)的执行轨迹长且复杂,使得故障诊断和根因分析极具挑战性。手动检查不可扩展,而直接将LLM应用于原始轨迹则受限于输入长度和不可靠的推理,仅关注最终任务结果又会丢失关键的行为信息。
Result: 在覆盖三个真实世界智能体场景的TraceBench基准上,使用与行业需求对齐的ReportEval评估协议进行实验。结果表明,TraceSIR在所有评估维度上均显著优于现有方法,能持续生成连贯、信息丰富且可操作的报告。
Insight: 创新点包括:1)提出了一种新颖的抽象格式TraceFormat来压缩执行轨迹同时保留关键行为信息;2)设计了一个多智能体协作框架,将轨迹分析分解为结构化、诊断和报告生成三个专门化步骤;3)构建了TraceBench基准和ReportEval评估协议,为智能体轨迹分析领域提供了评估标准。
Abstract: Agentic systems augment large language models with external tools and iterative decision making, enabling complex tasks such as deep research, function calling, and coding. However, their long and intricate execution traces make failure diagnosis and root cause analysis extremely challenging. Manual inspection does not scale, while directly applying LLMs to raw traces is hindered by input length limits and unreliable reasoning. Focusing solely on final task outcomes further discards critical behavioral information required for accurate issue localization. To address these issues, we propose TraceSIR, a multi-agent framework for structured analysis and reporting of agentic execution traces. TraceSIR coordinates three specialized agents: (1) StructureAgent, which introduces a novel abstraction format, TraceFormat, to compress execution traces while preserving essential behavioral information; (2) InsightAgent, which performs fine-grained diagnosis including issue localization, root cause analysis, and optimization suggestions; (3) ReportAgent, which aggregates insights across task instances and generates comprehensive analysis reports. To evaluate TraceSIR, we construct TraceBench, covering three real-world agentic scenarios, and introduce ReportEval, an evaluation protocol for assessing the quality and usability of analysis reports aligned with industry needs. Experiments show that TraceSIR consistently produces coherent, informative, and actionable reports, significantly outperforming existing approaches across all evaluation dimensions. Our project and video are publicly available at https://github.com/SHU-XUN/TraceSIR.
[256] ProtRLSearch: A Multi-Round Multimodal Protein Search Agent with Large Language Models Trained via Reinforcement Learning cs.AI | cs.CLPDF
Congying Liu, Taihao Li, Ming Huang, Xingyuan Wei, Peipei Liu
TL;DR: 本文提出了ProtRLSearch,一个基于强化学习训练的多轮多模态蛋白质搜索智能体,旨在解决医疗场景下需要结合蛋白质序列约束进行准确推理的任务。该方法联合利用蛋白质序列和文本作为多模态输入进行实时搜索,以生成高质量报告。
Details
Motivation: 现有蛋白质搜索智能体大多局限于单轮、纯文本模态的搜索,无法将蛋白质序列作为多模态输入整合到搜索决策中,且其强化学习监督仅关注最终答案,缺乏对搜索过程的约束,导致关键词选择和推理方向出现偏差时难以纠正。
Result: 为评估模型在真实蛋白质查询场景中整合序列信息与文本多模态输入的能力,作者构建了包含3000个多选题的基准测试ProtMCQs,该基准分为三个难度级别,评估范围从蛋白质功能和表型变化的序列约束推理到整合多维序列特征与信号通路、调控网络的综合蛋白质推理。
Insight: 创新点在于提出了一个结合蛋白质序列与文本的多模态多轮搜索框架,并采用基于多维奖励的强化学习进行训练,以优化搜索过程;客观来看,其构建的ProtMCQs基准为评估蛋白质多模态推理能力提供了新的测试平台。
Abstract: Protein analysis tasks arising in healthcare settings often require accurate reasoning under protein sequence constraints, involving tasks such as functional interpretation of disease-related variants, protein-level analysis for clinical research, and similar scenarios. To address such tasks, search agents are introduced to search protein-related information, providing support for disease-related variant analysis and protein function reasoning in protein-centric inference. However, such search agents are mostly limited to single-round, text-only modality search, which prevents the protein sequence modality from being incorporated as a multimodal input into the search decision-making process. Meanwhile, their reliance on reinforcement learning (RL) supervision that focuses solely on the final answer results in a lack of search process constraints, making deviations in keyword selection and reasoning directions difficult to identify and correct in a timely manner. To address these limitations, we propose ProtRLSearch, a multi-round protein search agent trained with multi-dimensional reward based RL, which jointly leverages protein sequence and text as multimodal inputs during real-time search to produce high quality reports. To evaluate the ability of models to integrate protein sequence information and text-based multimodal inputs in realistic protein query settings, we construct ProtMCQs, a benchmark of 3,000 multiple choice questions (MCQs) organized into three difficulty levels. The benchmark evaluates protein query tasks that range from sequence constrained reasoning about protein function and phenotype changes to comprehensive protein reasoning that integrates multi-dimensional sequence features with signal pathways and regulatory networks.
[257] According to Me: Long-Term Personalized Referential Memory QA cs.AI | cs.CL | cs.CVPDF
Jingbiao Mei, Jinghong Chen, Guangyu Yang, Xinyu Hou, Margaret Li
TL;DR: 本文介绍了ATM-Bench,首个针对多模态、多源个性化参考记忆问答的基准测试,包含约四年的隐私保护个人记忆数据及人工标注的问答对。论文提出了Schema-Guided Memory(SGM)方法,以结构化方式表示不同来源的记忆项,并在实验中评估了多种现有记忆系统,发现它们在ATM-Bench-Hard集上表现不佳(准确率低于20%),而SGM相比先前工作中常用的描述性记忆方法有所提升。
Details
Motivation: 现有长期记忆基准主要关注对话历史,未能捕捉基于真实生活经验的个性化参考,因此需要构建一个更贴近现实的多模态、多源个性化记忆问答基准,以推动个性化AI助手在长期用户记忆召回与推理方面的发展。
Result: 在ATM-Bench基准上,5种最先进的记忆系统及标准RAG基线在ATM-Bench-Hard集上准确率低于20%,而提出的SGM方法相比先前常用的描述性记忆(Descriptive Memory)提升了性能,但未明确达到SOTA水平。
Insight: 创新点包括引入首个多模态、多源个性化参考记忆问答基准ATM-Bench,以及提出Schema-Guided Memory(SGM)来结构化表示不同来源的记忆项,这有助于处理个人参考解析、多源多证据推理和冲突证据等复杂任务,为个性化AI记忆系统提供了新的评估框架和方法思路。
Abstract: Personalized AI assistants must recall and reason over long-term user memory, which naturally spans multiple modalities and sources such as images, videos, and emails. However, existing Long-term Memory benchmarks focus primarily on dialogue history, failing to capture realistic personalized references grounded in lived experience. We introduce ATM-Bench, the first benchmark for multimodal, multi-source personalized referential Memory QA. ATM-Bench contains approximately four years of privacy-preserving personal memory data and human-annotated question-answer pairs with ground-truth memory evidence, including queries that require resolving personal references, multi-evidence reasoning from multi-source and handling conflicting evidence. We propose Schema-Guided Memory (SGM) to structurally represent memory items originated from different sources. In experiments, we implement 5 state-of-the-art memory systems along with a standard RAG baseline and evaluate variants with different memory ingestion, retrieval, and answer generation techniques. We find poor performance (under 20% accuracy) on the ATM-Bench-Hard set, and that SGM improves performance over Descriptive Memory commonly adopted in prior works. Code available at: https://github.com/JingbiaoMei/ATM-Bench
[258] Exploring Plan Space through Conversation: An Agentic Framework for LLM-Mediated Explanations in Planning cs.AI | cs.CL | cs.HC | cs.MAPDF
Guilhem Fouilhé, Rebecca Eifler, Antonin Poché, Sylvie Thiébaux, Nicholas Asher
TL;DR: 本文提出了一种基于多智能体大语言模型(LLM)的对话式框架,用于在自动化规划任务中生成交互式解释,以促进人机协作。该框架独立于具体的解释方法,并针对目标冲突解释进行了实例化,通过用户研究验证了其相较于基于模板的基线界面的有效性。
Details
Motivation: 在现实世界的顺序决策问题中,自动化规划的目标通常不是取代人类规划者,而是促进一个迭代的推理和启发过程,其中人类的作用是根据其偏好和专业知识来指导AI规划器。因此,能够响应用户问题的解释对于提高他们对潜在解决方案的理解和增强对系统的信任至关重要。
Result: 论文描述了一个针对目标冲突解释的框架实例,并进行了用户研究,比较了LLM驱动的交互式解释界面与基于模板的基线解释界面。结果表明,LLM驱动的交互式方法在改善用户理解和信任方面具有优势。
Insight: 创新点在于提出了一个与具体解释框架无关的多智能体LLM架构,支持依赖于用户和上下文的交互式解释,从而实现了更自然的对话式交互。从客观角度看,该研究将LLM作为中介,通过多智能体协作来动态生成规划空间的解释,为人机协作规划中的可解释性提供了新的灵活框架。
Abstract: When automating plan generation for a real-world sequential decision problem, the goal is often not to replace the human planner, but to facilitate an iterative reasoning and elicitation process, where the human’s role is to guide the AI planner according to their preferences and expertise. In this context, explanations that respond to users’ questions are crucial to improve their understanding of potential solutions and increase their trust in the system. To enable natural interaction with such a system, we present a multi-agent Large Language Model (LLM) architecture that is agnostic to the explanation framework and enables user- and context-dependent interactive explanations. We also describe an instantiation of this framework for goal-conflict explanations, which we use to conduct a user study comparing the LLM-powered interaction with a baseline template-based explanation interface.
[259] Tool Verification for Test-Time Reinforcement Learning cs.AI | cs.CLPDF
Ruotong Liao, Nikolai Röhrich, Xiaohan Wang, Yuhui Zhang, Yasaman Samadzadeh
TL;DR: 本文提出了T^3RL(工具验证的测试时强化学习)方法,以解决测试时强化学习(TTRL)中因未经验证的高频伪共识导致奖励信号偏差和错误模式崩溃的问题。该方法在奖励估计中引入测试时工具验证,通过验证器利用外部工具(如代码执行)作为证据,在验证感知投票中提升已验证轨迹的权重,从而生成更可靠的伪标签用于训练。
Details
Motivation: 测试时强化学习(TTRL)作为一种有前景的范式,能使大型推理模型在未标记测试输入上通过多数投票产生的自诱导奖励进行在线自适应,但未经验证的高频伪共识可能成为有偏且被强化的奖励信号,导致错误的模式崩溃。
Result: 在多种数学难度数据集(MATH-500、AMC和AIME 2024)和不同骨干模型上,T^3RL相比TTRL有显著提升,且在更难问题上增益更大。
Insight: 创新点在于将测试时工具验证整合到奖励估计中,通过外部工具验证来稳定自进化过程,可视为一种验证的在线数据合成方法,为稳定自进化提供了关键机制。
Abstract: Test-time reinforcement learning (TTRL) has emerged as a promising paradigm for self-evolving large reasoning models (LRMs), enabling online adaptation on unlabeled test inputs via self-induced rewards through majority voting. However, a spurious yet high-frequency unverified consensus can become a biased and reinforced reward signal, leading to incorrect mode collapse. We address this failure mode with T^3RL (Tool-Verification for Test-Time Reinforcement Learning), which introduces test-time tool verification into reward estimation. Concretely, a verifier uses an external tool as evidence (e.g., from code execution) to upweight verified rollouts in a verification-aware voting, producing more reliable pseudo-labels for training. Across various math difficulties (MATH-500, AMC, and AIME 2024) and diverse backbone types, T^3RL significantly improves over TTRL, with larger gains on harder problems. More broadly, T^3RL can be viewed as verified online data synthesis, highlighting test-time tool verification as a key mechanism for stabilizing self-evolution.
[260] Advancing Multimodal Judge Models through a Capability-Oriented Benchmark and MCTS-Driven Data Generation cs.AI | cs.CVPDF
Zeyu Chen, Huanjin Yao, Ziwang Zhao, Min Yang
TL;DR: 本文提出了一个面向能力的基准测试M-JudgeBench,用于全面评估多模态大语言模型作为评判者的能力,并揭示了现有系统的系统性弱点。为了应对这些弱点,作者进一步提出了Judge-MCTS数据生成框架,用于构建增强数据集并训练了名为M-Judger的强评判模型系列。实验表明M-Judger在现有基准和M-JudgeBench上均表现出优越性。
Details
Motivation: 现有评判基准按任务类型分类样本,但未能捕捉可靠评估所需的基本评判能力。因此,需要建立一个更原则性的基础来评估MLLM作为评判者的能力和可靠性,并解决其系统性弱点。
Result: 在M-JudgeBench和现有评判基准上的广泛实验表明,通过Judge-MCTS框架训练的M-Judger模型系列表现优异,展现了其优越性。
Insight: 创新点在于提出了一个从十个维度分解评判能力的基准(M-JudgeBench),以及一个基于蒙特卡洛树搜索的数据生成框架(Judge-MCTS)来系统性增强评判模型的能力。这为评估和训练评判模型提供了一个更原则性的、能力驱动的方法论基础。
Abstract: Using Multimodal Large Language Models (MLLMs) as judges to achieve precise and consistent evaluations has gradually become an emerging paradigm across various domains. Evaluating the capability and reliability of MLLM-as-a-judge systems is therefore essential for ensuring trustworthy assessment. Existing judge benchmarks categorize samples by task types but fail to capture the fundamental judgment capabilities required for reliable evaluation. In this work, we introduce M-JudgeBench, a ten-dimensional capability-oriented benchmark designed to comprehensively assess the judgment abilities of MLLMs. Our benchmark decomposes evaluation into pairwise Chain-of-Thought (CoT) comparison, length bias avoidance, and process error detection tasks, jointly covering ten fine-grained subtasks. This design enables diagnosis of model reliability across reasoning styles, response lengths, and cross-model variations. Systematic evaluation uncovers the systematic weaknesses in existing MLLM-as-a-judge systems. To address this issue, we further propose Judge-MCTS, a data construction framework generating pairwise reasoning trajectories with various correctness and length. Using Judge-MCTS, we construct an MCTS-augmented dataset and train M-Judger, a series of strong judge models. Extensive experiments demonstrate the superiority of M-Judger on existing judge benchmarks as well as M-JudgeBench. Overall, our work establishes a more principled foundation for evaluating MLLM-as-a-judge through M-JudgeBench and Judge-MCTS framework, paving the way for future research on judge model evaluation and capability-driven judge training.
[261] MicroVerse: A Preliminary Exploration Toward a Micro-World Simulation cs.AI | cs.CVPDF
Rongsheng Wang, Minghao Wu, Hongru Zhou, Zhihan Yu, Zhenyang Cai
TL;DR: 本文提出了MicroVerse,一个面向微观世界模拟的视频生成模型。作者首先构建了MicroWorldBench基准,用于系统评估微观模拟任务,发现现有SOTA视频生成模型在科学保真度、时间一致性等方面存在不足。为此,他们创建了高质量专家验证数据集MicroSim-10K,并基于此训练了专用于微观模拟的MicroVerse模型,能够准确再现复杂的微观机制。
Details
Motivation: 现有视频生成技术主要应用于宏观复杂动态系统模拟,但在微观现象(如生物医学中的药物发现、细胞动力学)的模拟方面尚未充分探索,而微观模拟在生物医学、教育和科学可视化中具有巨大潜力。
Result: 在提出的MicroWorldBench基准(包含459个专家标注标准)上评估,当前SOTA视频生成模型在微观模拟任务中失败,违反了物理定律、存在时间不一致等问题。而基于MicroSim-10K数据集训练的MicroVerse模型能够准确再现复杂微观机制,为微观世界模拟提供了概念验证。
Insight: 创新点在于首次提出了’微观世界模拟’的概念,并构建了首个多层级、基于量规的微观模拟基准MicroWorldBench以及高质量专家验证数据集MicroSim-10K,进而训练了专门化的视频生成模型MicroVerse,为生物、教育等领域的应用铺平了道路。
Abstract: Recent advances in video generation have opened new avenues for macroscopic simulation of complex dynamic systems, but their application to microscopic phenomena remains largely unexplored. Microscale simulation holds great promise for biomedical applications such as drug discovery, organ-on-chip systems, and disease mechanism studies, while also showing potential in education and interactive visualization. In this work, we introduce MicroWorldBench, a multi-level rubric-based benchmark for microscale simulation tasks. MicroWorldBench enables systematic, rubric-based evaluation through 459 unique expert-annotated criteria spanning multiple microscale simulation task (e.g., organ-level processes, cellular dynamics, and subcellular molecular interactions) and evaluation dimensions (e.g., scientific fidelity, visual quality, instruction following). MicroWorldBench reveals that current SOTA video generation models fail in microscale simulation, showing violations of physical laws, temporal inconsistency, and misalignment with expert criteria. To address these limitations, we construct MicroSim-10K, a high-quality, expert-verified simulation dataset. Leveraging this dataset, we train MicroVerse, a video generation model tailored for microscale simulation. MicroVerse can accurately reproduce complex microscale mechanism. Our work first introduce the concept of Micro-World Simulation and present a proof of concept, paving the way for applications in biology, education, and scientific visualization. Our work demonstrates the potential of educational microscale simulations of biological mechanisms. Our data and code are publicly available at https://github.com/FreedomIntelligence/MicroVerse
[262] Nano-EmoX: Unifying Multimodal Emotional Intelligence from Perception to Empathy cs.AI | cs.CVPDF
Jiahao Huang, Fengyan Lin, Xuechao Yang, Chen Feng, Kexin Zhu
TL;DR: 本文提出了一个认知启发的三层情感智能层次结构(感知、理解、交互),并基于此设计了Nano-EmoX——一个小规模多任务多模态语言模型,以及配套的P2E(从感知到共情)课程训练框架。该模型通过集成增强的全模态编码器和异构适配器,在统一语言空间中处理多种情感任务,在多个基准测试中达到了最先进或极具竞争力的性能。
Details
Motivation: 解决现有情感多模态语言模型在低层感知与高层交互之间存在鸿沟、能力碎片化且泛化能力有限的问题,旨在为情感建模提供一个统一的概念基础。
Result: Nano-EmoX(2.2B参数)是首个在三个层次上统一了六项核心情感任务的紧凑型多模态语言模型,在多个基准测试中取得了最先进(SOTA)或极具竞争力的性能,展现了优秀的效率和泛化能力。
Insight: 创新点包括:1)提出一个认知启发的三层情感任务层次结构,为情感建模提供了统一框架;2)设计了集成增强全模态编码器(如面部编码器)和异构适配器的紧凑模型架构,实现跨任务知识迁移;3)提出了P2E课程训练框架,通过将对快速感知与思维链驱动的共情进行对齐,逐步培养模型的情感智能。
Abstract: The development of affective multimodal language models (MLMs) has long been constrained by a gap between low-level perception and high-level interaction, leading to fragmented affective capabilities and limited generalization. To bridge this gap, we propose a cognitively inspired three-level hierarchy that organizes affective tasks according to their cognitive depth-perception, understanding, and interaction-and provides a unified conceptual foundation for advancing affective modeling. Guided by this hierarchy, we introduce Nano-EmoX, a small-scale multitask MLM, and P2E (Perception-to-Empathy), a curriculum-based training framework. Nano-EmoX integrates a suite of omni-modal encoders, including an enhanced facial encoder and a fusion encoder, to capture key multimodal affective cues and improve cross-task transferability. The outputs are projected into a unified language space via heterogeneous adapters, empowering a lightweight language model to tackle diverse affective tasks. Concurrently, P2E progressively cultivates emotional intelligence by aligning rapid perception with chain-of-thought-driven empathy. To the best of our knowledge, Nano-EmoX is the first compact MLM (2.2B) to unify six core affective tasks across all three hierarchy levels, achieving state-of-the-art or highly competitive performance across multiple benchmarks, demonstrating excellent efficiency and generalization.
cs.RO [Back]
[263] LangGap: Diagnosing and Closing the Language Gap in Vision-Language-Action Models cs.RO | cs.AI | cs.CL | cs.CV | cs.LGPDF
Yuchen Hou, Lin Zhao
TL;DR: 该论文提出了LangGap基准,用于诊断和揭示当前最先进的视觉-语言-动作模型在理解语言指令方面的严重缺陷。研究发现,尽管模型在标准基准上成功率超过95%,但它们很大程度上忽略了语言。通过设计一个迫使模型理解语言的基准,并采用针对性的数据增强,可以部分弥补这一语言鸿沟,但也暴露了模型在处理多样化语义任务时学习能力的根本不足。
Details
Motivation: 当前最先进的视觉-语言-动作模型在标准基准上表现优异,但缺乏对语言指令的系统性理解。现有工作缺少系统的语义扰动诊断、强制语言理解的基准以及语言多样性的训练数据,因此需要构建新的基准来揭示和解决这一问题。
Result: 在LangGap基准上,模型π0.5的语言理解缺陷被揭示,初始成功率为0%。通过针对性的数据增强,单任务训练的成功率从0%提升到90%,多任务训练从0%提升到28%。然而,随着扩展任务语义多样性的增加,模型学习能力严重不足,即使训练过的任务也表现不佳。
Insight: 论文的创新点在于构建了LangGap基准,它通过四维语义扰动方法(在固定桌面布局下改变指令语义)强制模型理解语言,弥补了现有基准(如LIBERO)的不足。客观来看,该研究揭示了VLA模型在多样化语言指令理解方面的根本挑战,并强调了设计强制语言理解的基准和增加训练数据语言多样性的重要性。
Abstract: Vision-Language-Action (VLA) models achieve over 95% success on standard benchmarks. However, through systematic experiments, we find that current state-of-the-art VLA models largely ignore language instructions. Prior work lacks: (1) systematic semantic perturbation diagnostics, (2) a benchmark that forces language understanding by design, and (3) linguistically diverse training data. This paper constructs the LangGap benchmark, based on a four-dimensional semantic perturbation method – varying instruction semantics while keeping the tabletop layout fixed – revealing language understanding deficits in π0.5. Existing benchmarks like LIBERO assign only one task per layout, underutilizing available objects and target locations; LangGap fully diversifies pick-and-place tasks under identical layouts, forcing models to truly understand language. Experiments show that targeted data augmentation can partially close the language gap – success rate improves from 0% to 90% with single-task training, and 0% to 28% with multi-task training. However, as semantic diversity of extended tasks increases, model learning capacity proves severely insufficient; even trained tasks perform poorly. This reveals a fundamental challenge for VLA models in understanding diverse language instructions – precisely the long-term value of LangGap.
[264] UniHM: Unified Dexterous Hand Manipulation with Vision Language Model cs.RO | cs.CVPDF
Zhenhao Zhang, Jiaxin Liu, Ye Shi, Jingya Wang
TL;DR: 本文提出了UniHM框架,这是首个由自由形式语言指令引导的统一灵巧手操作框架。它通过统一的手部灵巧标记器将异构的手部形态映射到共享码本,并利用仅基于人-物交互数据训练的视觉语言动作模型生成类人操作序列,最后通过物理引导的动态优化模块确保物理可行性。
Details
Motivation: 解决现有灵巧手操作规划方法依赖物体中心线索或精确的手-物交互序列,而缺乏开放词汇指令的丰富组合性指导的问题。
Result: 在多个数据集和真实世界评估中,UniHM在已见和未见物体及轨迹上均取得了最先进(SOTA)的结果,展现了强大的泛化能力和高物理可行性。
Insight: 创新点包括:1) 统一异构手部形态的标记化方法,提升了跨手部泛化能力和对新形态的可扩展性;2) 仅使用人-物交互数据训练,避免了大规模真实遥操作数据集的需求;3) 引入物理引导的动态优化模块,在生成和时间先验下进行分段联合优化,确保物理真实性和平滑性。
Abstract: Planning physically feasible dexterous hand manipulation is a central challenge in robotic manipulation and Embodied AI. Prior work typically relies on object-centric cues or precise hand-object interaction sequences, foregoing the rich, compositional guidance of open-vocabulary instruction. We introduce UniHM, the first framework for unified dexterous hand manipulation guided by free-form language commands. We propose a Unified Hand-Dexterous Tokenizer that maps heterogeneous dexterous-hand morphologies into a single shared codebook, improving cross-dexterous hand generalization and scalability to new morphologies. Our vision language action model is trained solely on human-object interaction data, eliminating the need for massive real-world teleoperation datasets, and demonstrates strong generalizability in producing human-like manipulation sequences from open-ended language instructions. To ensure physical realism, we introduce a physics-guided dynamic refinement module that performs segment-wise joint optimization under generative and temporal priors, yielding smooth and physically feasible manipulation sequences. Across multiple datasets and real-world evaluations, UniHM attains state-of-the-art results on both seen and unseen objects and trajectories, demonstrating strong generalization and high physical feasibility. Our project page at \href{https://unihm.github.io/}{https://unihm.github.io/}.
[265] Certifiable Estimation with Factor Graphs cs.RO | cs.CVPDF
Zhexin Xu, Nikolas R. Sanderson, Hanna Jiamei Zhang, David M. Rosen
TL;DR: 本文提出了一种将因子图与可认证估计统一起来的框架,通过Shor松弛和Burer-Monteiro分解,在保持因子图结构的同时,将二次约束二次规划问题转化为可认证的优化问题,使得可认证估计能够利用成熟的因子图库和工作流程,简化了部署过程。
Details
Motivation: 解决因子图推理中局部优化方法可能收敛到次优解的问题,以及可认证估计方法因计算成本高、实现复杂而难以实际部署的挑战。
Result: 该方法在保持因子图结构的基础上,实现了可认证的全局最优估计,使得可认证估计能够像传统因子图推理一样易于设计和部署。
Insight: 创新点在于将因子图与可认证估计范式自然结合,通过结构保持的变换,使得可认证优化能够直接利用现有高性能因子图库,降低了实现门槛,提升了在安全关键应用中的可靠性。
Abstract: Factor graphs provide a convenient modular modeling language that enables practitioners to design and deploy high-performance robotic state estimation systems by composing simple, reusable building blocks. However, inference in these models is typically performed using local optimization methods that can converge to suboptimal solutions, a serious reliability concern in safety-critical applications. Conversely, certifiable estimators based on convex relaxation can recover verifiably globally optimal solutions in many practical settings, but the computational cost of solving their large-scale relaxations necessitates specialized, structure-exploiting solvers that require substantial expertise to implement, significantly hampering practical deployment. In this paper, we show that these two paradigms, which have thus far been treated as independent in the literature, can be naturally synthesized into a unified framework for certifiable factor graph optimization. The key insight is that factor graph structure is preserved under Shor’s relaxation and Burer-Monteiro factorization: applying these transformations to a QCQP with an associated factor graph representation yields a lifted problem admitting a factor graph model with identical connectivity, in which variables and factors are simple one-to-one algebraic transformations of those in the original QCQP. This structural preservation enables the Riemannian Staircase methodology for certifiable estimation to be implemented using the same mature, highly-performant factor graph libraries and workflows already ubiquitously employed throughout robotics and computer vision, making certifiable estimation as straightforward to design and deploy as conventional factor graph inference.
[266] Tiny-DroNeRF: Tiny Neural Radiance Fields aboard Federated Learning-enabled Nano-drones cs.RO | cs.CV | eess.SYPDF
Ilenia Carboni, Elia Cereda, Lorenzo Lamberti, Daniele Malpetti, Francesco Conti
TL;DR: 本文提出了Tiny-DroNeRF,一种基于Instant-NGP的轻量化神经辐射场模型,专为搭载在资源受限的纳米无人机上的超低功耗微控制器设计。为了克服单个无人机内存和处理能力的限制,论文进一步引入了联邦学习方案,使多架纳米无人机能够协作训练模型。该方法在保持可接受的重建精度损失下,显著减少了模型的内存占用,并首次在纳米无人机上结合了超低功耗MCU的NeRF训练与联邦学习。
Details
Motivation: 解决在资源极度受限(计算能力约100 GOps/s,内存低于100 MB)的纳米无人机上,实现如密集3D场景重建等复杂视觉任务的挑战,因为现有高性能NeRF模型需要GB级内存和高功耗GPU,无法在纳米无人机上部署。
Result: 在GAP9超低功耗MCU上,Tiny-DroNeRF相比Instant-NGP将内存占用减少了96%,重建精度仅下降5.7 dB。联邦学习方案使得模型能够利用超出单个无人机内存容量的数据进行训练,从而提高了整体重建精度。
Insight: 主要创新点在于将NeRF模型极致轻量化以适应纳米无人机平台,并首次在该平台上结合联邦学习进行协作训练。这为在资源受限的边缘设备上部署和训练复杂的神经渲染模型提供了可行的技术路径,实现了模型效率与协作学习的结合。
Abstract: Sub-30g nano-sized aerial robots can leverage their agility and form factor to autonomously explore cluttered and narrow environments, like in industrial inspection and search and rescue missions. However, the price for their tiny size is a strong limit in their resources, i.e., sub-100 mW microcontroller units (MCUs) delivering $\sim$100 GOps/s at best, and memory budgets well below 100 MB. Despite these strict constraints, we aim to enable complex vision-based tasks aboard nano-drones, such as dense 3D scene reconstruction: a key robotic task underlying fundamental capabilities like spatial awareness and motion planning. Top-performing 3D reconstruction methods leverage neural radiance fields (NeRF) models, which require GBs of memory and massive computation, usually delivered by high-end GPUs consuming 100s of Watts. Our work introduces Tiny-DroNeRF, a lightweight NeRF model, based on Instant-NGP, and optimized for running on a GAP9 ultra-low-power (ULP) MCU aboard our nano-drones. Then, we further empower our Tiny-DroNeRF by leveraging a collaborative federated learning scheme, which distributes the model training among multiple nano-drones. Our experimental results show a 96% reduction in Tiny-DroNeRF’s memory footprint compared to Instant-NGP, with only a 5.7 dB drop in reconstruction accuracy. Finally, our federated learning scheme allows Tiny-DroNeRF to train with an amount of data otherwise impossible to keep in a single drone’s memory, increasing the overall reconstruction accuracy. Ultimately, our work combines, for the first time, NeRF training on an ULP MCU with federated learning on nano-drones.
[267] Learning Vision-Based Omnidirectional Navigation: A Teacher-Student Approach Using Monocular Depth Estimation cs.RO | cs.CV | cs.LGPDF
Jan Finke, Wayne Paul Martis, Adrian Schmelter, Lars Erbach, Christian Jestel
TL;DR: 本文提出了一种基于视觉的全向导航教师-学生框架,通过微调的Depth Anything V2模型从四台RGB相机预测单目深度图,替代传统2D LiDAR,实现仅依赖视觉的移动机器人导航。整个推理流程在NVIDIA Jetson Orin AGX上完全板载运行,无需外部计算。
Details
Motivation: 解决工业环境中2D LiDAR仅感知单一水平切片、无法检测扫描平面上下方关键障碍物(如悬垂结构、低矮物体)的问题,旨在实现更可靠的3D场景理解与避障。
Result: 在仿真中,学生策略的成功率达到82-96.5%,持续优于标准2D LiDAR教师策略(50-89%);在真实世界实验中,基于单目深度估计的学生在导航具有复杂3D几何形状的障碍物时表现优于2D LiDAR教师。
Insight: 创新点在于将基于特权LiDAR观测训练的教师策略蒸馏为仅依赖单目深度估计的学生策略,实现了纯视觉导航,并展示了在边缘设备上实时运行的可行性;客观来看,该方法有效解决了2D LiDAR的几何感知局限,提升了在复杂3D环境中的导航鲁棒性。
Abstract: Reliable obstacle avoidance in industrial settings demands 3D scene understanding, but widely used 2D LiDAR sensors perceive only a single horizontal slice of the environment, missing critical obstacles above or below the scan plane. We present a teacher-student framework for vision-based mobile robot navigation that eliminates the need for LiDAR sensors. A teacher policy trained via Proximal Policy Optimization (PPO) in NVIDIA Isaac Lab leverages privileged 2D LiDAR observations that account for the full robot footprint to learn robust navigation. The learned behavior is distilled into a student policy that relies solely on monocular depth maps predicted by a fine-tuned Depth Anything V2 model from four RGB cameras. The complete inference pipeline, comprising monocular depth estimation (MDE), policy execution, and motor control, runs entirely onboard an NVIDIA Jetson Orin AGX mounted on a DJI RoboMaster platform, requiring no external computation for inference. In simulation, the student achieves success rates of 82-96.5%, consistently outperforming the standard 2D LiDAR teacher (50-89%). In real-world experiments, the MDE-based student outperforms the 2D LiDAR teacher when navigating around obstacles with complex 3D geometries, such as overhanging structures and low-profile objects, that fall outside the single scan plane of a 2D LiDAR.
[268] LAD-Drive: Bridging Language and Trajectory with Action-Aware Diffusion Transformers cs.RO | cs.CVPDF
Fabian Schmidt, Karol Fedurko, Markus Enzweiler, Abhinav Valada
TL;DR: LAD-Drive是一个用于自动驾驶的生成式框架,旨在解决多模态大语言模型(MLLMs)的离散语义知识难以转化为连续轨迹的挑战。它通过结构化解耦高层意图与低层空间规划,利用动作解码器推断概率元动作分布,并结合车辆运动学状态,通过动作感知扩散解码器生成安全、可行的轨迹。
Details
Motivation: 现有方法通常依赖单模态规划头,限制了其表示多模态驾驶行为的能力,且多数生成方法使用独热编码动作,丢弃了对复杂场景至关重要的导航不确定性。LAD-Drive旨在克服这些限制,实现语言与轨迹的更好桥接。
Result: 在LangAuto基准测试上的广泛评估表明,LAD-Drive取得了最先进(SOTA)的结果,驾驶分数比竞争基线高出高达59%,同时显著减少了路线偏差和碰撞。
Insight: 创新点在于结构化解耦意图与规划,引入概率元动作分布作为显式信念状态以保留导航不确定性,并采用截断去噪过程精炼运动锚点。这为将语言推理转化为鲁棒、连续的驾驶策略提供了新思路。
Abstract: While multimodal large language models (MLLMs) provide advanced reasoning for autonomous driving, translating their discrete semantic knowledge into continuous trajectories remains a fundamental challenge. Existing methods often rely on unimodal planning heads that inherently limit their ability to represent multimodal driving behavior. Furthermore, most generative approaches frequently condition on one-hot encoded actions, discarding the nuanced navigational uncertainty critical for complex scenarios. To resolve these limitations, we introduce LAD-Drive, a generative framework that structurally disentangles high-level intention from low-level spatial planning. LAD-Drive employs an action decoder to infer a probabilistic meta-action distribution, establishing an explicit belief state that preserves the nuanced intent typically lost by one-hot encodings. This distribution, fused with the vehicle’s kinematic state, conditions an action-aware diffusion decoder that utilizes a truncated denoising process to refine learned motion anchors into safe, kinematically feasible trajectories. Extensive evaluations on the LangAuto benchmark demonstrate that LAD-Drive achieves state-of-the-art results, outperforming competitive baselines by up to 59% in Driving Score while significantly reducing route deviations and collisions. We will publicly release the code and models on https://github.com/iis-esslingen/lad-drive.
[269] $π$-StepNFT: Wider Space Needs Finer Steps in Online RL for Flow-based VLAs cs.RO | cs.CVPDF
Siting Wang, Xiaofeng Wang, Zheng Zhu, Minnan Pei, Xinyu Cui
TL;DR: 该论文提出了一种名为π-StepNFT(Step-wise Negative-aware Fine-Tuning)的免评论家和似然函数的框架,用于解决基于流的视觉-语言-动作模型在在线强化学习中因多步采样导致似然计算困难的问题。该方法通过单次前向传播优化,无需辅助价值网络,在LIBERO和ManiSkill基准测试中展现了强大的少样本鲁棒性和优越的泛化能力。
Details
Motivation: 基于流的视觉-语言-动作模型在具身控制中表现出色,但在多步采样时面临似然计算困难,阻碍了在线强化学习的应用。论文旨在解决这一瓶颈,通过更精细的步进式指导来适应更广泛的探索空间。
Result: 在LIBERO基准上,π-StepNFT展现了具有竞争力的少样本鲁棒性,解锁了模型的潜在能力。在ManiSkill基准上,它在分布外场景中超越了基于价值的方法,通过防止对多模态特征的过拟合实现了优越的泛化性能。
Insight: 论文宣称的创新点在于提出了一个免评论家和似然函数的优化框架,强调更广泛的探索空间需要更精细的步进式对齐指导。从客观角度看,其核心创新在于将复杂的多步采样优化简化为单步前向传播,避免了传统方法对价值网络的依赖,为复杂现实应用提供了可扩展的解决方案。
Abstract: Flow-based vision-language-action (VLA) models excel in embodied control but suffer from intractable likelihoods during multi-step sampling, hindering online reinforcement learning. We propose \textbf{\textit{$\boldsymbolπ$-StepNFT}} (Step-wise Negative-aware Fine-Tuning), a critic-and-likelihood-free framework that requires only a single forward pass per optimization step and eliminates auxiliary value networks. We identify that wider exploration spaces necessitate finer-grained, step-wise guidance for alignment. Empirically, $π$-StepNFT unlocks latent potential on LIBERO with competitive few-shot robustness. Moreover, it achieves superior generalization on ManiSkill, outperforming value-based baselines in OOD scenarios by preventing overfitting to multimodal features. This property offers a scalable solution promising for complex real-world applications.
[270] Rethinking Camera Choice: An Empirical Study on Fisheye Camera Properties in Robotic Manipulation cs.RO | cs.CVPDF
Han Xue, Nan Min, Xiaotong Liu, Wendi Chen, Yuan Fang
TL;DR: 本文首次对鱼眼摄像头在机器人模仿学习中的影响进行了全面的实证研究,重点关注其在空间定位、场景泛化和硬件泛化三个关键问题上的表现。研究发现,鱼眼摄像头的大视场角在复杂环境中能显著提升空间定位能力,但容易在简单场景中过拟合;通过增加环境多样性训练可提升场景泛化性能;而跨摄像头硬件泛化失败的主要原因是尺度过拟合,可通过简单的随机尺度增强策略改善。
Details
Motivation: 鱼眼摄像头因其超大视场角在机器人操作中被广泛采用,但其对策略学习的具体影响缺乏系统性理解,本研究旨在填补这一空白。
Result: 在仿真和真实世界的大量实验中,论文揭示了鱼眼摄像头在空间定位、场景泛化和硬件泛化方面的具体表现,并提出了随机尺度增强策略以改善硬件泛化性能。
Insight: 创新点在于首次系统实证分析了鱼眼摄像头在机器人学习中的特性,并提供了针对数据集收集和使用的具体指导,如环境多样性训练和随机尺度增强策略的应用。
Abstract: The adoption of fisheye cameras in robotic manipulation, driven by their exceptionally wide Field of View (FoV), is rapidly outpacing a systematic understanding of their downstream effects on policy learning. This paper presents the first comprehensive empirical study to bridge this gap, rigorously analyzing the properties of wrist-mounted fisheye cameras for imitation learning. Through extensive experiments in both simulation and the real world, we investigate three critical research questions: spatial localization, scene generalization, and hardware generalization. Our investigation reveals that: (1) The wide FoV significantly enhances spatial localization, but this benefit is critically contingent on the visual complexity of the environment. (2) Fisheye-trained policies, while prone to overfitting in simple scenes, unlock superior scene generalization when trained with sufficient environmental diversity. (3) While naive cross-camera transfer leads to failures, we identify the root cause as scale overfitting and demonstrate that hardware generalization performance can be improved with a simple Random Scale Augmentation (RSA) strategy. Collectively, our findings provide concrete, actionable guidance for the large-scale collection and effective use of fisheye datasets in robotic learning. More results and videos are available on https://robo-fisheye.github.io/
cs.HC [Back]
[271] Egocentric Co-Pilot: Web-Native Smart-Glasses Agents for Assistive Egocentric AI cs.HC | cs.AI | cs.CV | cs.CYPDF
Sicheng Yang, Yukai Huang, Weitong Cai, Shitong Sun, Fengyi Fang
TL;DR: 本文提出了Egocentric Co-Pilot,一个运行在智能眼镜上的、基于Web的神经符号框架。它利用大语言模型(LLM)协调感知、推理和网络工具,通过结合时序思维链和分层上下文压缩技术来处理连续的第一人称视频,以支持长视野的问答和决策。同时,一个轻量级多模态意图层将嘈杂的语音和视线映射为结构化命令。论文还实现并评估了云原生WebRTC管道,将流式语音、视频和控制消息集成到智能眼镜与浏览器的统一通道中。
Details
Motivation: 为了解决人们在拥挤城市、低视力或认知超负荷等场景下,无需屏幕、稳定桌面甚至空出双手即可访问网络的需求,论文旨在通过智能眼镜与AI代理的结合,将网络转变为日常生活中持续可用的辅助层。
Result: 在Egolife和HD-EPIC数据集上的实验表明,该方法在自我中心问答任务上取得了有竞争力或最先进的性能。一项在智能眼镜上进行的带人参与的研究显示,与领先的商业基线相比,该方法具有更高的任务完成率和用户满意度。
Insight: 论文的核心创新点在于提出了一个结合时序思维链与分层上下文压缩的自我中心推理核心,以处理远超单个模型上下文窗口的连续第一人称视频。此外,通过将操作建立在Web原生通信原语和模块化、可审计的工具使用之上,为辅助性、持续在线的网络代理提供了一个具体蓝图,有助于教育、无障碍和社会包容。
Abstract: What if accessing the web did not require a screen, a stable desk, or even free hands? For people navigating crowded cities, living with low vision, or experiencing cognitive overload, smart glasses coupled with AI agents could turn the web into an always-on assistive layer over daily life. We present Egocentric Co-Pilot, a web-native neuro-symbolic framework that runs on smart glasses and uses a Large Language Model (LLM) to orchestrate a toolbox of perception, reasoning, and web tools. An egocentric reasoning core combines Temporal Chain-of-Thought with Hierarchical Context Compression to support long-horizon question answering and decision support over continuous first-person video, far beyond a single model’s context window. Additionally, a lightweight multimodal intent layer maps noisy speech and gaze into structured commands. We further implement and evaluate a cloud-native WebRTC pipeline integrating streaming speech, video, and control messages into a unified channel for smart glasses and browsers. In parallel, we deploy an on-premise WebSocket baseline, exposing concrete trade-offs between local inference and cloud offloading in terms of latency, mobility, and resource use. Experiments on Egolife and HD-EPIC demonstrate competitive or state-of-the-art egocentric QA performance, and a human-in-the-loop study on smart glasses shows higher task completion and user satisfaction than leading commercial baselines. Taken together, these results indicate that web-connected egocentric co-pilots can be a practical path toward more accessible, context-aware assistance in everyday life. By grounding operation in web-native communication primitives and modular, auditable tool use, Egocentric Co-Pilot offers a concrete blueprint for assistive, always-on web agents that support education, accessibility, and social inclusion for people who may benefit most from contextual, egocentric AI.
cs.CY [Back]
[272] How effective are VLMs in assisting humans in inferring the quality of mental models from Multimodal short answers? cs.CY | cs.AI | cs.CLPDF
Pritam Sil, Durgaprasad Karnam, Vinay Reddy Venumuddala, Pushpak Bhattacharyya
TL;DR: 本文提出了一种名为MMGrader的方法,用于从学生的多模态回答中推断其心理模型的质量,并利用概念图作为分析框架。研究发现,现有最佳模型在推断心理模型质量方面仍无法达到人类水平,但通过提高准确性,这些模型有望成为教师评估全班学生理解水平的有效助手。
Details
Motivation: STEM心理模型在评估学生对主题的概念理解方面至关重要,但传统方法仅关注评分,而忽略了从学生回答中推断心理模型质量所需的深度推理能力,这给教师带来了挑战。
Result: 在评估9个公开可用模型后,最佳模型的准确率约为40%,预测误差为1.1个单位,评分分布与人类评分模式基本一致,但仍未达到人类水平。
Insight: 论文的创新点在于使用概念图作为分析框架来推断多模态回答中的心理模型质量,并指出通过提升模型准确性,可以辅助教师高效评估全班学生的理解水平,从而设计针对性教学方案。
Abstract: STEM Mental models can play a critical role in assessing students’ conceptual understanding of a topic. They not only offer insights into what students know but also into how effectively they can apply, relate to, and integrate concepts across various contexts. Thus, students’ responses are critical markers of the quality of their understanding and not entities that should be merely graded. However, inferring these mental models from student answers is challenging as it requires deep reasoning skills. We propose MMGrader, an approach that infers the quality of students’ mental models from their multimodal responses using concept graphs as an analytical framework. In our evaluation with 9 openly available models, we found that the best-performing models fall short of human-level performance. This is because they only achieved an accuracy of approximately 40%, a prediction error of 1.1 units, and a scoring distribution fairly aligned with human scoring patterns. With improved accuracy, these can be highly effective assistants to teachers in inferring the mental models of their entire classrooms, enabling them to do so efficiently and help improve their pedagogies more effectively by designing targeted help sessions and lectures that strengthen areas where students collectively demonstrate lower proficiency.
cs.ET [Back]
[273] RTLocating: Intent-aware RTL Localization for Hardware Design Iteration cs.ET | cs.CL | cs.IRPDF
Changwen Xing, Yanfeng Lu, Lei Qi, Chenxu Niu, Jie Li
TL;DR: 该论文提出了RTLocating框架,用于在硬件设计迭代中根据自然语言变更请求(ΔSpec)定位受影响的RTL代码块,首次形式化了ΔSpec-to-RTL定位问题,并构建了首个工业级基准数据集EvoRTL-Bench。
Details
Motivation: 工业芯片开发本质上是迭代式的,但现有LLM辅助硬件设计工作多关注一次性合成,忽略了基于意图的局部更新需求,因此需要解决自然语言变更请求到RTL代码块的精准定位问题。
Result: 在EvoRTL-Bench基准测试中,RTLocating取得了0.568的MRR和15.08%的R@1,分别比最强基线提升了22.9%和67.0%,达到了硬件设计意图驱动定位的新SOTA水平。
Insight: 创新点包括:形式化了多正例的ΔSpec-to-RTL定位问题;设计了动态路由机制,融合文本语义、局部结构和全局交互依赖三种编码视图;构建了首个基于真实Git历史的工业级意图-代码对齐基准数据集。
Abstract: Industrial chip development is inherently iterative, favoring localized, intent-driven updates over rewriting RTL from scratch. Yet most LLM-Aided Hardware Design (LAD) work focuses on one-shot synthesis, leaving this workflow underexplored. To bridge this gap, we for the first time formalize $Δ$Spec-to-RTL localization, a multi-positive problem mapping natural language change requests ($Δ$Spec) to the affected Register Transfer Level (RTL) syntactic blocks. We propose RTLocating, an intent-aware RTL localization framework, featuring a dynamic router that adaptively fuses complementary views from a textual semantic encoder, a local structural encoder, and a global interaction and dependency encoder (GLIDE). To enable scalable supervision, we introduce EvoRTL-Bench, the first industrial-scale benchmark for intent-code alignment derived from OpenTitan’s Git history, comprising 1,905 validated requests and 13,583 $Δ$Spec-RTL block pairs. On EvoRTL-Bench, RTLocating achieves 0.568 MRR and 15.08% R@1, outperforming the strongest baseline by +22.9% and +67.0%, respectively, establishing a new state-of-the-art for intent-driven localization in evolving hardware designs.
cs.IR [Back]
[274] OmniRet: Efficient and High-Fidelity Omni Modality Retrieval cs.IR | cs.CL | cs.CVPDF
Chuong Huynh, Manh Luong, Abhinav Shrivastava
TL;DR: 本文提出了OmniRet,首个能够处理文本、视觉和音频三种模态复杂组合查询的检索模型。该模型通过注意力重采样机制和注意力切片Wasserstein池化技术,解决了计算效率低和表征保真度不足的问题,在13个检索任务和MMEBv2子集上进行了评估,并在组合查询、音频和视频检索任务上取得了显著改进。
Details
Motivation: 现有最先进的多模态检索模型通常仅限于文本和视觉两种模态,这限制了能够理解超过两种模态组合查询的通用检索系统的发展。
Result: 在13个检索任务和MMEBv2子集上的基准测试表明,该模型在组合查询、音频和视频检索任务上取得了显著改进,在其他任务上达到了与最先进模型相当的性能。
Insight: 创新点包括引入注意力重采样机制以生成紧凑的固定大小表征,以及提出注意力切片Wasserstein池化以保留细粒度细节;此外,还构建了一个新的音频中心多模态基准(ACM),以更全面地评估模型的跨模态嵌入能力。
Abstract: Multimodal retrieval is the task of aggregating information from queries across heterogeneous modalities to retrieve desired targets. State-of-the-art multimodal retrieval models can understand complex queries, yet they are typically limited to two modalities: text and vision. This limitation impedes the development of universal retrieval systems capable of comprehending queries that combine more than two modalities. To advance toward this goal, we present OmniRet, the first retrieval model capable of handling complex, composed queries spanning three key modalities: text, vision, and audio. Our OmniRet model addresses two critical challenges for universal retrieval: computational efficiency and representation fidelity. First, feeding massive token sequences from modality-specific encoders to Large Language Models (LLMs) is computationally inefficient. We therefore introduce an attention-based resampling mechanism to generate compact, fixed-size representations from these sequences. Second, compressing rich omni-modal data into a single embedding vector inevitably causes information loss and discards fine-grained details. We propose Attention Sliced Wasserstein Pooling to preserve these fine-grained details, leading to improved omni-modal representations. OmniRet is trained on an aggregation of approximately 6 million query-target pairs spanning 30 datasets. We benchmark our model on 13 retrieval tasks and a MMEBv2 subset. Our model demonstrates significant improvements on composed query, audio and video retrieval tasks, while achieving on-par performance with state-of-the-art models on others. Furthermore, we curate a new Audio-Centric Multimodal Benchmark (ACM). This new benchmark introduces two critical, previously missing tasks-composed audio retrieval and audio-visual retrieval to more comprehensively evaluate a model’s omni-modal embedding capacity.
[275] PhotoBench: Beyond Visual Matching Towards Personalized Intent-Driven Photo Retrieval cs.IR | cs.AI | cs.CV | cs.MMPDF
Tianyi Xu, Rong Shan, Junjie Wu, Jiadeng Huang, Teng Wang
TL;DR: 本文提出了PhotoBench,首个基于真实个人相册构建的检索基准,旨在推动从视觉匹配范式转向个性化的多源意图驱动推理。该基准通过整合视觉语义、时空元数据、社交身份和时序事件等多源信息,合成了基于用户生活轨迹的复杂意图查询。评估揭示了现有方法的两个关键局限:模态鸿沟和源融合悖论。
Details
Motivation: 现有检索基准严重依赖上下文孤立的网络快照,无法捕捉解决真实、意图驱动的用户查询所需的多源推理能力。为了弥补这一差距,需要构建一个能反映个人相册生态复杂性(如时间连续性、社交关联和丰富元数据)的基准。
Result: 在PhotoBench上的广泛评估表明,现有统一嵌入模型在非视觉约束上失效(模态鸿沟),而智能体系统在多工具编排上表现不佳(源融合悖论)。这揭示了当前个人多模态检索方法的局限性。
Insight: 论文的核心创新在于构建了首个真实个人相册基准,并提出了一个严谨的多源画像框架来合成复杂意图查询。其关键见解是,个人多模态检索的下一个前沿不在于统一嵌入,而在于需要能够精确满足约束并进行多源融合的鲁棒智能体推理系统。
Abstract: Personal photo albums are not merely collections of static images but living, ecological archives defined by temporal continuity, social entanglement, and rich metadata, which makes the personalized photo retrieval non-trivial. However, existing retrieval benchmarks rely heavily on context-isolated web snapshots, failing to capture the multi-source reasoning required to resolve authentic, intent-driven user queries. To bridge this gap, we introduce PhotoBench, the first benchmark constructed from authentic, personal albums. It is designed to shift the paradigm from visual matching to personalized multi-source intent-driven reasoning. Based on a rigorous multi-source profiling framework, which integrates visual semantics, spatial-temporal metadata, social identity, and temporal events for each image, we synthesize complex intent-driven queries rooted in users’ life trajectories. Extensive evaluation on PhotoBench exposes two critical limitations: the modality gap, where unified embedding models collapse on non-visual constraints, and the source fusion paradox, where agentic systems perform poor tool orchestration. These findings indicate that the next frontier in personal multimodal retrieval lies beyond unified embeddings, necessitating robust agentic reasoning systems capable of precise constraint satisfaction and multi-source fusion. Our PhotoBench is available.
[276] MealRec: Multi-granularity Sequential Modeling via Hierarchical Diffusion Models for Micro-Video Recommendation cs.IR | cs.CVPDF
Xinxin Dong, Haokai Ma, Yuze Zheng, Yongfu Zha, Yonghui Yang
TL;DR: 本文提出了一种名为MealRec的微视频推荐方法,通过分层扩散模型进行多粒度序列建模,旨在解决微视频推荐中因多模态内容噪声和不可靠隐式反馈导致的行为与兴趣对应关系弱化问题。该方法包含两个核心模块:TCD用于在视频内部时间引导下细化视频表示,强调显著内容并抑制冗余;NPD用于在盲去噪条件下从损坏状态中恢复信息丰富的用户偏好。
Details
Motivation: 解决微视频推荐中因多模态内容噪声和不可靠隐式反馈导致的偏好无关视频表示提取和固有模态冲突问题,传统方法如行为增强建模和以内容为中心的多模态分析存在局限性。
Result: 在两个平台的四个微视频数据集上进行的广泛实验和分析证明了MealRec的有效性、普适性和鲁棒性,并揭示了所提TCD和NPD模块的有效机制。
Insight: 创新点在于同时从视频内部和视频间角度考虑偏好建模中的时间相关性,通过分层扩散模型实现多粒度序列建模;TCD结合个性化协同信号细化表示,NPD实现语义连贯的偏好建模,为处理噪声和模态冲突提供了新思路。
Abstract: Micro-video recommendation aims to capture user preferences from the collaborative and context information of the interacted micro-videos, thereby predicting the appropriate videos. This target is often hindered by the inherent noise within multimodal content and unreliable implicit feedback, which weakens the correspondence between behaviors and underlying interests. While conventional works have predominantly approached such scenario through behavior-augmented modeling and content-centric multimodal analysis, these paradigms can inadvertently give rise to two non-trivial challenges: preference-irrelative video representation extraction and inherent modality conflicts. To address these issues, we propose a Multi-granularity sequential modeling method via hierarchical diffusion models for micro-video Recommendation (MealRec), which simultaneously considers temporal correlations during preference modeling from intra- and inter-video perspectives. Specifically, we first propose Temporal-guided Content Diffusion (TCD) to refine video representations under intra-video temporal guidance and personalized collaborative signals to emphasize salient content while suppressing redundancy. To achieve the semantically coherent preference modeling, we further design the Noise-unconditional Preference Denoising (NPD) to recovers informative user preferences from corrupted states under the blind denoising. Extensive experiments and analyses on four micro-video datasets from two platforms demonstrate the effectiveness, universality, and robustness of our MealRec, further uncovering the effective mechanism of our proposed TCD and NPD. The source code and corresponding dataset will be available upon acceptance.
[277] NextAds: Towards Next-generation Personalized Video Advertising cs.IR | cs.CVPDF
Yiyan Xu, Ruoxuan Xia, Wuqiang Zheng, Fengbin Zhu, Wenjie Wang
TL;DR: 本文提出了NextAds,一种基于生成式AI的下一代个性化视频广告范式,旨在超越传统的基于检索的广告投放方式。论文概念化了NextAds的四个核心组件,并针对个性化创意生成和个性化创意集成两个任务制定了基准测试和端到端流程。初步实验表明生成式AI在生成和集成个性化广告创意方面具有潜力,同时论文也讨论了该范式面临的挑战与机遇。
Details
Motivation: 解决传统基于检索的个性化视频广告系统因依赖静态、有限的预制创意库而导致的个性化粒度粗、时效性差,以及无法根据在线用户反馈持续优化创意的问题。
Result: 论文为两个代表性任务(个性化创意生成与集成)构建了轻量级基准并实例化了端到端流程。初步探索性实验表明,生成式AI能够以令人鼓舞的性能生成和集成个性化创意,但未提及具体的定量指标或与SOTA的对比。
Insight: 核心创新点在于提出从检索式范式转向生成式范式的系统性框架(NextAds),将广告创意优化视为一个在连续空间中进行实时生成和调整的问题。这为利用生成式AI实现更细粒度、更及时、可在线迭代的个性化广告提供了新的研究方向和可行路径。
Abstract: With the rapid growth of online video consumption, video advertising has become increasingly dominant in the digital advertising landscape. Yet diverse users and viewing contexts makes one-size-fits-all ad creatives insufficient for consistent effectiveness, underlining the importance of personalization. In practice, most personalized video advertising systems follow a retrieval-based paradigm, selecting the optimal one from a small set of professionally pre-produced creatives for each user. Such static and finite inventories limits both the granularity and the timeliness of personalization, and prevents the creatives from being continuously refined based on online user feedback. Recent advances in generative AI make it possible to move beyond retrieval toward optimizing video creatives in a continuous space at serving time. In this light, we propose NextAds, a generation-based paradigm for next-generation personalized video advertising, and conceptualize NextAds with four core components. To enable comparable research progress, we formulate two representative tasks: personalized creative generation and personalized creative integration, and introduce corresponding lightweight benchmarks. To assess feasibility, we instantiate end-to-end pipelines for both tasks and conduct initial exploratory experiments, demonstrating that GenAI can generate and integrate personalized creatives with encouraging performance. Moreover, we discuss the key challenges and opportunities under this paradigm, aiming to provide actionable insights for both researchers and practitioners and to catalyze progress in personalized video advertising.
physics.soc-ph [Back]
[278] Multimodal Modular Chain of Thoughts in Energy Performance Certificate Assessment physics.soc-ph | cs.AI | cs.CVPDF
Zhen Peng, Peter J. Bentley
TL;DR: 本文提出了一种名为多模态模块化思维链(MMCoT)的框架,利用视觉语言模型从有限的视觉信息中自动进行能源性能证书(EPC)预评估。该框架通过结构化提示将EPC估计分解为中间推理阶段,并在任务间显式传播推断属性。在英国81个住宅物业的多模态数据集上的实验表明,MMCoT在EPC估计上相比仅使用指令提示取得了统计显著的改进。
Details
Motivation: 在缺乏可扩展能源性能证书(EPC)评估的地区,准确评估建筑能源性能仍然具有挑战性,本文旨在通过低成本自动化方法解决这一问题。
Result: 在英国81个住宅物业的多模态数据集上,MMCoT在准确率、召回率、平均绝对误差和混淆矩阵方面优于仅使用指令提示的方法,错误主要发生在相邻类别之间,表明该方法能捕捉EPC评级的序数结构。
Insight: 创新点在于将多模态推理分解为模块化思维链,通过结构化提示实现属性跨任务传播,为数据稀缺环境下的低成本EPC预评估提供了新方向。
Abstract: Accurate evaluation of building energy performance remains challenging in regions where scalable Energy Performance Certificate (EPC) assessments are unavailable. This paper presents a cost-efficient framework that leverages Vision-Language models for automated EPC pre-assessment from limited visual information. The proposed Multimodal Modular Chain of Thoughts (MMCoT) architecture decomposes EPC estimation into intermediate reasoning stages and explicitly propagates inferred attributes across tasks using structured prompting. Experiments on a multimodal dataset of 81 residential properties in the United Kingdom show that MMCoT achieves statistically significant improvements over instruction-only prompting for EPC estimation. Analysis based on accuracy, recall, mean absolute error, and confusion matrices indicate that the proposed approach captures the ordinal structure of EPC ratings, with most errors occurring between adjacent classes. These results suggest that modular prompt-based reasoning offers a promising direction for low-cost EPC pre-assessment in data-scarce settings.
eess.IV [Back]
[279] GazeXPErT: An Expert Eye-tracking Dataset for Interpretable and Explainable AI in Oncologic FDG-PET/CT Scans eess.IV | cs.CV | cs.HCPDF
Joy T Wu, Daniel Beckmann, Sarah Miller, Alexander Lee, Elizabeth Theng
TL;DR: 本文介绍了GazeXPErT,一个用于肿瘤FDG-PET/CT扫描的4D眼动追踪数据集,旨在提升AI模型的可解释性和临床工作流集成。该数据集包含346次扫描中专家在肿瘤检测和测量时的搜索模式,并提供了COCO格式的9,030条眼动轨迹。基线实验表明,融入专家眼动模式能提升3D nnUNet肿瘤分割性能,且基于视觉Transformer的模型能改善动态病灶定位和专家意图预测。
Details
Motivation: 解决FDG-PET/CT影像分析中AI模型因可解释性、可靠性不足以及临床工作流集成困难而难以转化的问题,通过捕捉专家眼动模式来增强AI的临床可信度和实用性。
Result: 在肿瘤分割任务中,融入专家眼动模式的3D nnUNet模型DICE分数从0.6008提升至0.6819;在动态病灶定位任务中,基于视觉Transformer的模型使74.95%的预测注视点更接近肿瘤;在专家意图预测任务中,准确率达到67.53%,AUROC为0.747。
Insight: 创新点在于首次构建了大规模、同步的专家眼动轨迹数据集,并将其格式化为标准机器学习格式,为可解释AI研究提供了新范式;客观分析认为,将专家视觉注意力作为监督信号或特征增强,是提升医学影像AI模型性能和临床可接受性的有效途径,尤其在病灶定位和意图理解等任务上具有潜力。
Abstract: [18F]FDG-PET/CT is a cornerstone imaging modality for tumor staging and treatment response assessment across many cancer types, yet expert reader shortages necessitate more efficient diagnostic aids. While standalone AI models for automatic lesion segmentation exist, clinical translation remains hindered by concerns about interpretability, explainability, reliability, and workflow integration. We present GazeXPErT, a 4D eye-tracking dataset capturing expert search patterns during tumor detection and measurement on 346 FDG-PET/CT scans. Each study was read by a trainee and a board-certified nuclear medicine or radiology specialist using an eye-tracking-enabled annotation platform that simulates routine clinical reads. From 3,948 minutes of raw 60Hz eye-tracking data, 9,030 unique gaze-to-lesion trajectories were extracted, synchronized with PET/CT image slices, and rendered in COCO-style format for multiple machine learning applications. Baseline validation experiments demonstrate that a 3D nnUNet tumor segmentation model achieved superior performance when incorporating expert gaze patterns versus without (DICE score 0.6819 versus 0.6008), and that vision transformers trained on sequential gaze and PET/CT images can improve dynamic lesion localization (74.95% predicted gaze point closer to tumor) and expert intention prediction (Accuracy 67.53% and AUROC 0.747). GazeXPErT is a valuable resource designed to explore multiple machine learning problems beyond these baseline experiments, which include and are not limited to, visual grounding or causal reasoning, clinically explainable feature augmentation, human-computer interaction, human intention prediction or understanding, and expert gaze-rewarded modeling approaches to AI in oncologic FDG-PET/CT imaging.
eess.AS [Back]
[280] VoxKnesset: A Large-Scale Longitudinal Hebrew Speech Dataset for Aging Speaker Modeling eess.AS | cs.CL | cs.LG | cs.SD | eess.SPPDF
Yanir Marmor, Arad Zulti, David Krongauz, Adam Gabet, Yoad Snapir
TL;DR: 该论文介绍了VoxKnesset,一个大规模、纵向的希伯来语语音数据集,包含约2300小时的以色列议会演讲录音(2009-2025年),涉及393名发言人,时间跨度长达15年,并配有对齐的文本转录和经过验证的人口统计元数据。论文利用该数据集对现代语音嵌入模型在年龄预测和说话人验证任务上进行了基准测试,揭示了语音随年龄变化对系统性能的影响。
Details
Motivation: 解决语音处理系统面临的一个基本挑战:人类声音会随着年龄变化,但现有数据集很少支持严格的纵向评估,因此需要构建一个能够研究语音老化效应的数据集。
Result: 在纵向说话人验证任务中,最强模型(WavLM-Large)的等错误率(EER)在15年内从2.15%上升到4.58%;在年龄预测任务中,基于横截面数据训练的回归器无法捕捉说话人内部的年龄变化,而基于纵向数据训练的模型则能恢复有意义的时间信号。
Insight: 论文的主要创新点是创建并开源了首个大规模、纵向、带丰富元数据的希伯来语语音数据集VoxKnesset,为研究语音老化、鲁棒说话人识别和希伯来语语音处理提供了宝贵资源。从客观角度看,其数据集的构建方法(利用公开的议会记录确保数据质量和时间跨度)以及对纵向评估范式的强调,对推动语音老化建模领域的发展具有重要价值。
Abstract: Speech processing systems face a fundamental challenge: the human voice changes with age, yet few datasets support rigorous longitudinal evaluation. We introduce VoxKnesset, an open-access dataset of ~2,300 hours of Hebrew parliamentary speech spanning 2009-2025, comprising 393 speakers with recording spans of up to 15 years. Each segment includes aligned transcripts and verified demographic metadata from official parliamentary records. We benchmark modern speech embeddings (WavLM-Large, ECAPA-TDNN, Wav2Vec2-XLSR-1B) on age prediction and speaker verification under longitudinal conditions. Speaker verification EER rises from 2.15% to 4.58% over 15 years for the strongest model, and cross-sectionally trained age regressors fail to capture within-speaker aging, while longitudinally trained models recover a meaningful temporal signal. We publicly release the dataset and pipeline to support aging-robust speech systems and Hebrew speech processing.
cs.LG [Back]
[281] Stabilizing Policy Optimization via Logits Convexity cs.LG | cs.CLPDF
Hongzhan Chen, Tao Yang, Yuhua Zhu, Shiping Gao, Xiaojun Quan
TL;DR: 本文提出了一种名为Logits Convex Optimization (LCO)的策略优化框架,旨在解决强化学习(RL)训练不稳定的问题。论文从梯度角度分析了监督微调(SFT)损失函数相对于模型logits的凸性带来的稳定训练优势,并基于此设计了LCO来模拟这种稳定效应,从而在多种模型和基准测试上提升了训练稳定性和性能。
Details
Motivation: 动机在于解决强化学习(尤其是与大型语言模型结合时)训练不稳定的问题,特别是与监督微调相比存在的稳定性差距。论文旨在从梯度方向性的理论分析出发,探究SFT损失凸性带来的稳定机制,并设计一种能模拟此机制的RL优化方法。
Result: 在多个模型家族上进行的大量实验表明,LCO框架能持续提升训练稳定性,并在广泛的基准测试中优于传统的强化学习方法(如PPO)。
Insight: 论文宣称的创新点在于从理论角度揭示了SFT损失相对于logits的凸性是训练稳定的关键,并据此提出了LCO框架来在RL优化中模拟这一性质。从客观角度看,其核心洞察是将SFT的稳定机制(梯度方向性)形式化并迁移到RL领域,提供了一种简单有效的稳定化策略优化新思路。
Abstract: While reinforcement learning (RL) has been central to the recent success of large language models (LLMs), RL optimization is notoriously unstable, especially when compared to supervised fine-tuning (SFT). In this work, we investigate the stability gap between SFT and RL from a gradient-based perspective, and show that the convexity of the SFT loss with respect to model logits plays a key role in enabling stable training. Our theoretical analysis demonstrates that this property induces favorable gradient directionality during optimization. In contrast, Proximal Policy Optimization (PPO), a widely adopted policy gradient algorithm utilizing a clipped surrogate objective, lacks this stabilizing property. Motivated by this observation, we propose Logits Convex Optimization (LCO), a simple yet effective policy optimization framework that aligns the learned policy with an optimal target derived from the original RL objective, thereby emulating the stabilizing effects of logits-level convexity. Extensive experiments across multiple model families show that our LCO framework consistently improves training stability and outperforms conventional RL methods on a broad range of benchmarks.
[282] Learn Hard Problems During RL with Reference Guided Fine-tuning cs.LG | cs.CLPDF
Yangzhen Wu, Shanda Li, Zixin Wen, Xin Zhou, Ameet Talwalkar
TL;DR: 本文提出了一种名为参考引导微调(ReGFT)的方法,用于解决强化学习在数学推理任务中因奖励稀疏性而难以训练的问题。该方法利用人类编写的参考解来合成模型可学习的正向轨迹,并在强化学习前进行微调,从而提升模型在困难问题上的表现。
Details
Motivation: 强化学习在数学推理中面临奖励稀疏性问题:对于困难问题,大语言模型无法采样到任何正确的推理轨迹,导致强化学习缺乏有效的正向反馈。同时,现有的人类参考解往往超出模型自身的推理分布,直接微调效果有限。
Result: 在AIME24、AIME25和BeyondAIME三个基准测试上,ReGFT方法一致提升了监督学习的准确率,加速了DAPO训练过程,并提高了强化学习的最终性能上限,有效克服了奖励稀疏性问题。
Insight: 创新点在于通过部分参考解引导模型生成自身推理空间内的轨迹进行微调,既利用了人类知识,又确保了轨迹的可学习性。这为在奖励稀疏环境下结合监督学习与强化学习提供了一种简单有效的策略。
Abstract: Reinforcement learning (RL) for mathematical reasoning can suffer from reward sparsity: for challenging problems, LLM fails to sample any correct trajectories, preventing RL from receiving meaningful positive feedback. At the same time, there often exist human-written reference solutions along with the problem (e.g., problems from AoPS), but directly fine-tuning on these solutions offers no benefit because models often cannot imitate human proofs that lie outside their own reasoning distribution. We introduce Reference-Guided Fine-Tuning (ReGFT), a simple and effective method that utilizes human-written reference solutions to synthesize positive trajectories on hard problems and train on them before RL. For each problem, we provide the model with a partial reference solution and let it generate its own reasoning trace, ensuring the resulting trajectories remain in the model’s reasoning space while still benefiting from reference guidance. Fine-tuning on these reference-guided trajectories increases the number of solvable problems and produces a checkpoint that receives more positive rewards during RL. Across three benchmarks (AIME24, AIME25, BeyondAIME), ReGFT consistently improves supervised accuracy, accelerates DAPO training, and raises the final performance plateau of RL. Our results show that ReGFT effectively overcomes reward sparsity and unlocks stronger RL-based mathematical reasoning.
[283] I Can’t Believe It’s Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift cs.LG | cs.CLPDF
Subramanyam Sahoo, Vinija Jain, Divya Chaudhary, Aman Chadha
TL;DR: 本文系统研究了基于冻结嵌入的安全分类器在指令调优模型更新时的鲁棒性假设,发现即使微小的嵌入漂移(如归一化扰动σ=0.02,对应嵌入球面上约1度的角度漂移)也会导致分类器性能从85% ROC-AUC急剧下降至50%,且平均置信度仅下降14%,产生高置信度误判的静默故障。
Details
Motivation: 动机在于验证生产AI安全架构中普遍假设的表示稳定性,即安全分类器在模型更新时能否基于冻结嵌入保持性能,揭示该假设在实际中的脆弱性。
Result: 实验表明,在嵌入漂移下,分类器ROC-AUC显著下降,72%的误分类发生在高置信度下,且指令调优模型比基础模型类别可分性差20%,使对齐系统更难以保障安全。
Insight: 创新点在于首次系统量化了嵌入漂移对安全分类器的灾难性影响,挑战了安全机制跨模型版本可迁移的假设,并揭示了指令调优可能降低表示鲁棒性的悖论。
Abstract: Instruction tuned reasoning models are increasingly deployed with safety classifiers trained on frozen embeddings, assuming representation stability across model updates. We systematically investigate this assumption and find it fails: normalized perturbations of magnitude $σ=0.02$ (corresponding to $\approx 1^\circ$ angular drift on the embedding sphere) reduce classifier performance from $85%$ to $50%$ ROC-AUC. Critically, mean confidence only drops $14%$, producing dangerous silent failures where $72%$ of misclassifications occur with high confidence, defeating standard monitoring. We further show that instruction-tuned models exhibit 20$%$ worse class separability than base models, making aligned systems paradoxically harder to safeguard. Our findings expose a fundamental fragility in production AI safety architectures and challenge the assumption that safety mechanisms transfer across model versions.
[284] Constructing Synthetic Instruction Datasets for Improving Reasoning in Domain-Specific LLMs: A Case Study in the Japanese Financial Domain cs.LG | cs.AI | cs.CLPDF
Yuma Okochi, Fabio Milentiansen Sim, Tomoyasu Okada
TL;DR: 本文提出了一种基于领域特定词汇构建高质量合成指令数据的通用方法,并应用于日本金融领域,构建了包含约95亿token的大规模指令数据集,其中包含思维链推理轨迹。评估结果显示该方法在金融基准测试上提升了模型性能,并分析了推理轨迹长度的影响及局限性。
Details
Motivation: 解决领域特定LLMs在适应特定领域时,同时实现领域专业知识和推理能力的挑战。
Result: 在金融基准测试上,模型性能相比基线模型有所提升,证明了方法的有效性。
Insight: 创新点在于从领域词汇出发构建合成指令数据的通用方法,以及包含思维链的大规模数据集构建;客观分析其提供了可复用的数据构建框架,并实证了推理轨迹对性能的影响。
Abstract: In adapting LLMs to specific domains, achieving both domain expertise and reasoning ability remains an urgent challenge. This study proposes a general method for constructing high-quality synthetic instruction data for any domain, starting from domain-specific vocabulary. As a demonstration, we applied this method to the financial domain and constructed a large-scale instruction dataset totaling approximately 9.5 billion tokens with Chain-of-Thought reasoning traces. Evaluation results confirmed performance improvements over baseline models on financial benchmarks, demonstrating the effectiveness of our approach. We also report findings on the impact of reasoning trace length on performance and its limitations. Lastly, we open-source our models and datasets on https://huggingface.co/nri-ai .
[285] TopoCurate:Modeling Interaction Topology for Tool-Use Agent Training cs.LG | cs.CLPDF
Jinluan Yang, Yuxin Liu, Zhengyu Chen, Chengcheng Han, Yueqing Sun
TL;DR: 本文提出了TopoCurate,一个用于工具使用智能体训练的交互感知框架。该框架通过将同一任务的多轮次轨迹投影到一个统一的语义商拓扑中,将分散的线性轨迹转化为结构化的流形,以显式捕捉工具调用和环境响应如何驱动有效策略与失败模式之间的分歧。基于此表示,TopoCurate引入了双重选择机制:为监督微调(SFT)选择具有反思恢复、语义效率和策略多样性的轨迹;为强化学习(RL)选择具有高错误分支比和策略异质性的任务。
Details
Motivation: 当前训练工具使用智能体的范式(基于结果的过滤,如在成功轨迹上进行监督微调,或在通过率筛选的任务上进行强化学习)忽略了交互动态。成功轨迹可能缺乏错误恢复或存在冗余,而通过率无法区分具有结构信息量的任务和简单任务。
Result: 在BFCLv3和Tau2 Bench基准测试上的评估表明,TopoCurate相比最先进的基线方法,在SFT和RL上分别取得了4.2%和6.9%的稳定性能提升。
Insight: 核心创新点在于将多轮次轨迹投影到统一的语义商拓扑中,构建结构化表示来建模交互动态,并基于此设计了针对SFT和RL的双重选择机制,以解决协变量偏移、模式崩溃和稀疏奖励设置下的梯度信号消失问题。该方法从交互拓扑的角度,而非单纯的结果,来指导训练数据的筛选。
Abstract: Training tool-use agents typically relies on outcome-based filtering: Supervised Fine-Tuning (SFT) on successful trajectories and Reinforcement Learning (RL) on pass-rate-selected tasks. However, this paradigm ignores interaction dynamics: successful trajectories may lack error recovery or exhibit redundancy, while pass rates fail to distinguish structurally informative tasks from trivial ones. We propose \textbf{TopoCurate}, an interaction-aware framework that projects multi-trial rollouts from the same task into a unified semantic quotient topology. By merging equivalent action-observation states, this projection transforms scattered linear trajectories into a structured manifold that explicitly captures how tool invocations and environmental responses drive the divergence between effective strategies and failure modes. Leveraging this representation, we introduce a dual-selection mechanism: for SFT, we prioritize trajectories demonstrating reflective recovery, semantic efficiency, and strategic diversity to mitigate covariate shift and mode collapse; for RL, we select tasks with high error branch ratios and strategic heterogeneity, maximizing gradient Signal-to-Noise Ratio to address vanishing signals in sparse-reward settings. Evaluations on BFCLv3 and Tau2 Bench show that TopoCurate achieves consistent gains of 4.2% (SFT) and 6.9% (RL) over state-of-the-art baselines. We will release the code and data soon for further investigations.
[286] Efficient RLVR Training via Weighted Mutual Information Data Selection cs.LG | cs.CLPDF
Xinyu Zhou, Boyu Zhu, Haotian Zhang, Huiming Wang, Zhijiang Guo
TL;DR: 本文提出了一种名为InSight的信息引导数据采样方法,用于提升强化学习与可验证奖励(RLVR)的训练效率。该方法基于加权互信息目标,通过贝叶斯潜在成功率建模数据结果,将预期不确定性减少分解为难度和证据依赖的互补成分,从而克服了现有仅基于难度启发式采样策略的局限。
Details
Motivation: 现有在线数据选择策略主要依赖基于难度的启发式方法,偏好具有中等成功率的数据点,这隐含地将难度等同于信息量,并忽略了因证据有限而产生的认知不确定性。本文旨在解决这一效率瓶颈,提出一种更高效的数据采样方法。
Result: 大量实验表明,InSight方法在多个基准测试上取得了最先进的性能,并显著提升了训练效率。具体结果包括:在规划与数学基准测试上平均提升+1.41,在通用推理任务上提升+1.01,训练速度最高可加速约2.2倍,且额外计算开销可忽略不计。
Insight: 论文的核心创新点在于提出了一个基于加权互信息的数据选择框架,揭示了预期不确定性减少由难度和证据依赖成分组成,这挑战了仅依赖难度的传统选择策略。该方法构建了一个基于数据点成功率平均信念的稳定获取分数,而非依赖噪声采样结果,并能自然地扩展到多轮次设置的RLVR场景中,具有很好的理论依据和实用性。
Abstract: Reinforcement learning (RL) plays a central role in improving the reasoning and alignment of large language models, yet its efficiency critically depends on how training data are selected. Existing online selection strategies predominantly rely on difficulty-based heuristics, favouring datapoints with intermediate success rates, implicitly equating difficulty with informativeness and neglecting epistemic uncertainty arising from limited evidence. We introduce InSight, an INformation-guided data SamplInG metHod for RL Training, grounded in a weighted mutual information objective. By modeling data outcomes with Bayesian latent success rates, we show that expected uncertainty reduction decomposes into complementary difficulty- and evidence-dependent components, revealing a fundamental limitation of difficulty-only selection. Leveraging this observation, InSight constructs a stable acquisition score based on the mean belief of datapoints’ success rather than noisy sampled outcomes, and naturally extends to multi-rollout settings common in reinforcement learning with verifiable rewards (RLVR). Extensive experiments demonstrate that InSight consistently achieves state-of-the-art performance and improves training efficiency, including a +1.41 average gain on Planning & Mathmatics benchmarks, +1.01 improvement on general reasoning, and up to ~2.2x acceleration, with negligible additional computational overhead.
[287] Semantic Similarity is a Spurious Measure of Comic Understanding: Lessons Learned from Hallucinations in a Benchmarking Experiment cs.LG | cs.CL | cs.CVPDF
Christopher Driggers-Ellis, Nachiketh Tibrewal, Rohit Bogulla, Harsh Khanna, Sangpil Youm
TL;DR: 本文通过评估生成式视觉语言模型在漫画理解任务中的表现,发现语义相似性作为衡量标准存在虚假性,并系统分析了模型在漫画解读过程中产生的幻觉现象,提出了针对性的缓解策略。
Details
Motivation: 为解决视障群体无法访问漫画/漫画这一叙事媒介的问题,当前缺乏支持漫画页面级理解的系统,而现有研究多局限于面板级分析,因此需要评估VLM在漫画解读任务中的表现并识别其幻觉问题。
Result: 研究构建了VLM漫画解读任务的初步基准,通过实验识别并分类了模型产生的幻觉,将其归纳为广义的对象幻觉分类体系,但未提及具体定量指标或与SOTA的比较结果。
Insight: 创新点在于揭示了语义相似性作为漫画理解评估指标的局限性,并系统化分析了VLM在漫画解读中的幻觉类型;可借鉴之处包括对幻觉的细粒度分类方法以及针对数据优化和幻觉缓解的未来研究方向。
Abstract: A system that enables blind or visually impaired users to access comics/manga would introduce a new medium of storytelling to this community. However, no such system currently exists. Generative vision-language models (VLMs) have shown promise in describing images and understanding comics, but most research on comic understanding is limited to panel-level analysis. To fully support blind and visually impaired users, greater attention must be paid to page-level understanding and interpretation. In this work, we present a preliminary benchmark of VLM performance on comic interpretation tasks. We identify and categorize hallucinations that emerge during this process, organizing them into generalized object-hallucination taxonomies. We conclude with guidance on future research, emphasizing hallucination mitigation and improved data curation for comic interpretation.
[288] Learning from Synthetic Data Improves Multi-hop Reasoning cs.LG | cs.AI | cs.CLPDF
Anmol Kabra, Yilun Yin, Albert Gong, Kamilė Stankevičiūtė, Dongyoung Go
TL;DR: 本文研究利用规则生成的合成数据对大型语言模型进行强化学习微调,以提升多跳推理能力。研究发现,尽管合成数据仅包含虚构知识,但经过微调的模型在真实问答基准测试中表现显著提升,尤其是在复杂问题上,表明合成数据能有效教授模型知识组合这一通用推理技能。
Details
Motivation: 传统强化学习微调依赖高质量可验证数据(如人工标注、前沿LLM生成或LLM验证器评分),但这些方法存在成本高、幻觉倾向、不准确或速度慢等局限。本文旨在探索一种更廉价替代方案:使用规则生成的合成数据进行多跳推理任务的RL微调。
Result: 在流行真实世界问答基准测试中,基于合成数据微调的LLMs表现显著更好;按问题难度分层分析显示,合成数据尤其提升了模型在复杂问题上的性能,验证了其有效性。
Insight: 创新点在于将规则生成的合成数据作为免费、可扩展的资源来增强LLM推理能力,强调合成数据能教授知识组合这一基础且可泛化的技能,为降低数据获取成本提供了新思路。
Abstract: Reinforcement Learning (RL) has been shown to significantly boost reasoning capabilities of large language models (LLMs) in math, coding, and multi-hop reasoning tasks. However, RL fine-tuning requires abundant high-quality verifiable data, often sourced from human annotations, generated from frontier LLMs, or scored by LLM-based verifiers. All three have considerable limitations: human-annotated datasets are small and expensive to curate, LLM-generated data is hallucination-prone and costly, and LLM-based verifiers are inaccurate and slow. In this work, we investigate a cheaper alternative: RL fine-tuning on rule-generated synthetic data for multi-hop reasoning tasks. We discover that LLMs fine-tuned on synthetic data perform significantly better on popular real-world question-answering benchmarks, despite the synthetic data containing only fictional knowledge. On stratifying performance by question difficulty, we find that synthetic data teaches LLMs to compose knowledge – a fundamental and generalizable reasoning skill. Our work highlights rule-generated synthetic reasoning data as a free and scalable resource to improve LLM reasoning capabilities.
[289] Recursive Models for Long-Horizon Reasoning cs.LG | cs.CLPDF
Chenxiao Yang, Nathan Srebro, Zhiyuan Li
TL;DR: 该论文提出递归模型作为解决语言模型有限上下文约束下长时程推理问题的核心方法,通过递归调用自身在隔离上下文中解决子任务,理论上证明其能实现指数级更小的活跃上下文需求,并在布尔可满足性任务上超越前沿大语言模型。
Details
Motivation: 现代语言模型受限于有界上下文,这构成了长时程推理的根本障碍,论文旨在通过递归原则克服这一限制。
Result: 在需要长时程组合搜索的布尔可满足性任务上,训练的3B参数递归模型显著优于前沿大语言模型。
Insight: 创新点在于将递归确立为核心原则,提出递归模型作为最小实现,理论上证明其上下文效率严格超越单序列方法,并可泛化到具有任意上下文处理和控制流的现代智能体系统。
Abstract: Modern language models reason within bounded context, an inherent constraint that poses a fundamental barrier to long-horizon reasoning. We identify recursion as a core principle for overcoming this barrier, and propose recursive models as a minimal realization, where the model can recursively invoke itself to solve subtasks in isolated contexts. We prove that any computable problem admits a recursive decomposition in which each subtask requires only exponentially smaller active context than standard autoregressive models; this strictly surpasses any context management approach confined to a single sequence, such as summarization. We further generalize our framework to modern agentic systems with arbitrary context processing and control flows, and prove that recursive models can achieve optimal power within this broader class. Experimentally, we train a 3B model to reason recursively and evaluate on Boolean satisfiability, a task requiring long-horizon combinatorial search, where it significantly outperforms frontier LLMs.
[290] Certainty-Validity: A Diagnostic Framework for Discrete Commitment Systems cs.LG | cs.CVPDF
Datorien L. Anderson
TL;DR: 本文针对离散承诺系统(输出为{-W, 0, +W}的架构)提出了确定性-有效性(CVS)诊断框架,该框架通过一个2x2矩阵(区分高/低确定性和有效/无效预测)来分解模型性能,揭示了标准准确率指标所掩盖的‘自信-错误’(CI)失败模式,即在模糊数据上产生幻觉。论文在Fashion-MNIST、EMNIST和IMDB数据集上进行了消融实验,分析了离散模型在噪声基准上达到的‘83%模糊性上限’,并指出这种拒绝承诺模糊样本的行为是模型在结构证据结束处停止的特征,而非缺陷。然而,在模糊数据上的标准训练会导致良性过拟合,使模型从‘不确定-错误’(适当怀疑)病态地迁移到‘自信-错误’(幻觉)。因此,论文主张对于推理系统,‘良好训练’的定义不应基于准确率,而应基于最大化确定性-有效性分数(CVS),以确保模型知道在何处停止。
Details
Motivation: 标准机器学习评估指标(如准确率、精确率、召回率、AUROC)假设所有错误是等价的,即一个自信的错误预测与一个不确定的错误预测受到相同惩罚。对于离散承诺系统,这种假设在认识论上存在缺陷。论文旨在解决这一问题,揭示并诊断标准指标所隐藏的模型失败模式。
Result: 在Fashion-MNIST、EMNIST和IMDB数据集上的消融实验表明,特定的离散架构在噪声基准上存在一个‘83%模糊性上限’,模型性能在此处达到稳定平台期。与可以通过记忆纹理或统计噪声来突破此上限的连续模型不同,离散模型拒绝承诺模糊样本。CVS框架成功揭示了标准准确率所掩盖的‘自信-错误’行为。
Insight: 主要创新点在于提出了确定性-有效性(CVS)诊断框架,它能够分解模型性能,突出显示‘自信-错误’这一关键失败模式。从客观角度看,该研究强调了对于推理系统,评估和训练目标需要超越传统准确率,转向确保模型在证据不足时保持不确定(即最大化CVS),这为防止模型在模糊数据上产生幻觉提供了一种新的方法论视角。离散模型在模糊性上限处的停止被重新诠释为一种特征(在结构证据结束处停止),而非限制,这挑战了单纯追求更高准确率的训练范式。
Abstract: Standard evaluation metrics for machine learning – accuracy, precision, recall, and AUROC – assume that all errors are equivalent: a confident incorrect prediction is penalized identically to an uncertain one. For discrete commitment systems (architectures that select committed states {-W, 0, +W}), this assumption is epistemologically flawed. We introduce the Certainty-Validity (CVS) Framework, a diagnostic method that decomposes model performance into a 2x2 matrix distinguishing high/low certainty from valid/invalid predictions. This framework reveals a critical failure mode hidden by standard accuracy: Confident-Incorrect (CI) behavior, where models hallucinate structure in ambiguous data. Through ablation experiments on Fashion-MNIST, EMNIST, and IMDB, we analyze the “83% Ambiguity Ceiling” – a stopping point where this specific discrete architecture consistently plateaus on noisy benchmarks. Unlike continuous models that can surpass this ceiling by memorizing texture or statistical noise, the discrete model refuses to commit to ambiguous samples. We show that this refusal is not a failure but a feature: the model stops where structural evidence ends. However, standard training on ambiguous data eventually forces Benign Overfitting, causing a pathological migration from Uncertain-Incorrect (appropriate doubt) to Confident-Incorrect (hallucination). We propose that “good training” for reasoning systems must be defined not by accuracy, but by maximizing the Certainty-Validity Score (CVS) – ensuring the model knows where to stop.
[291] Deep Learning-Based Meat Freshness Detection with Segmentation and OOD-Aware Classification cs.LG | cs.CV | eess.IVPDF
Hutama Arif Bramantyo, Mukarram Ali Faridi, Rui Chen, Clarissa Harris, Yin Sun
TL;DR: 本研究提出了一种基于深度学习的肉类新鲜度检测框架,该框架能够处理包装和未包装的肉类图像。系统通过U-Net分割模块隔离肉类区域以减少背景干扰,然后使用多种深度神经网络骨干进行四类新鲜度分类,并集成了分布外(OOD)感知的弃权机制以标记低置信度样本。
Details
Motivation: 解决从RGB图像中准确、鲁棒地检测肉类新鲜度的问题,特别是在实际应用中需同时处理包装/未包装肉类、减少背景干扰,并能识别未知或低置信度样本(OOD情况)以提高系统可靠性。
Result: 在分割任务上,U-Net模块取得了75%的IoU和82%的Dice系数;在分类任务上,EfficientNet-B0在保留的分布内测试集上达到最高准确率98.10%,其他模型如ResNet-50和MobileNetV3-Small为97.63%。系统还评估了OOD评分与阈值设置,并报告了在智能手机上使用TFLite的端侧延迟,以权衡精度与速度。
Insight: 创新点在于将分割作为分类预处理步骤以标准化输入,并结合OOD感知机制提升实际部署的可靠性;从客观角度看,该研究系统地比较了多种现代CNN与Transformer骨干在特定任务上的性能,并考虑了端侧部署的延迟,为食品检测的工业应用提供了实用框架。
Abstract: In this study, we present a meat freshness classification framework from Red-Green-Blue (RGB) images that supports both packaged and unpackaged meat datasets. The system classifies four in-distribution (ID) meat classes and uses an out-of-distribution (OOD)-aware abstention mechanism that flags low-confidence samples as No Result. The pipeline combines U-Net-based segmentation with deep feature classifiers. Segmentation is used as a preprocessing step to isolate the meat region and reduce background, producing more consistent inputs for classification. The segmentation module achieved an Intersection over Union (IoU) of 75% and a Dice coefficient of 82%, producing standardized inputs for the classification stage. For classification, we benchmark five backbones: Residual Network-50 (ResNet-50), Vision Transformer-Base/16 (ViT-B/16), Swin Transformer-Tiny (Swin-T), EfficientNet-B0, and MobileNetV3-Small. We use nested 5x3 cross-validation (CV) for model selection and hyperparameter tuning. On the held-out ID test set, EfficientNet-B0 achieves the highest accuracy (98.10%), followed by ResNet-50 and MobileNetV3-Small (both 97.63%) and Swin-T (97.51%), while ViT-B/16 is lower (94.42%). We additionally evaluate OOD scoring and thresholding using standard OOD metrics and sensitivity analysis over the abstention threshold. Finally, we report on-device latency using TensorFlow Lite (TFLite) on a smartphone, highlighting practical accuracy-latency trade-offs for future deployment.
[292] Benchmarking Few-shot Transferability of Pre-trained Models with Improved Evaluation Protocols cs.LG | cs.CVPDF
Xu Luo, Ji Zhang, Lianli Gao, Heng Tao Shen, Jingkuan Song
TL;DR: 本文提出了FEWTRANS基准测试,包含10个多样化数据集,并引入超参数集成(HPE)协议以解决小样本场景下的‘验证集幻觉’问题。实证研究表明,预训练模型的选择是性能的主导因素,而许多复杂的迁移学习方法相比简单的全参数微调基线并无显著优势。通过机制分析,作者揭示了全微调通过分布式微调和灵活重塑高层语义表示来避免过拟合。此外,研究还量化了多模态模型在专业领域因语言稀有性导致的性能崩溃。
Details
Motivation: 当前小样本迁移学习缺乏统一、严谨且贴近真实世界应用的评估协议,导致难以公平比较预训练模型和迁移方法的有效性。
Result: 在FEWTRANS基准上,实验表明预训练模型的选择是性能关键,而复杂迁移方法相比全参数微调基线优势有限;同时,使用调整后的Zipf频率分数量化了多模态模型在专业领域的性能下降。
Insight: 创新点在于提出了一个全面的FEWTRANS基准和HPE评估协议,以解决小样本评估中的偏差;客观分析揭示了全参数微调在小样本场景下的意外有效性及其机制,为未来研究提供了更可靠的评估工具。
Abstract: Few-shot transfer has been revolutionized by stronger pre-trained models and improved adaptation algorithms.However, there lacks a unified, rigorous evaluation protocol that is both challenging and realistic for real-world usage. In this work, we establish FEWTRANS, a comprehensive benchmark containing 10 diverse datasets, and propose the Hyperparameter Ensemble (HPE) protocol to overcome the “validation set illusion” in data-scarce regimes. Our empirical findings demonstrate that the choice of pre-trained model is the dominant factor for performance, while many sophisticated transfer methods offer negligible practical advantages over a simple full-parameter fine-tuning baseline. To explain this surprising effectiveness, we provide an in-depth mechanistic analysis showing that full fine-tuning succeeds via distributed micro-adjustments and more flexible reshaping of high-level semantic presentations without suffering from overfitting. Additionally, we quantify the performance collapse of multimodal models in specialized domains as a result of linguistic rarity using adjusted Zipf frequency scores. By releasing FEWTRANS, we aim to provide a rigorous “ruler” to streamline reproducible advances in few-shot transfer learning research. We make the FEWTRANS benchmark publicly available at https://github.com/Frankluox/FewTrans.
[293] When Does Margin Clamping Affect Training Variance? Dataset-Dependent Effects in Contrastive Forward-Forward Learning cs.LG | cs.CVPDF
Joshua Steier
TL;DR: 本文研究了对比前向传播学习(CFF)中正样本对边界的应用方式(通过饱和相似度钳位)如何影响训练方差。作者证明了一种替代的梯度中性公式,并通过在CIFAR-10等数据集上的实验发现,钳位操作会显著增加测试准确率的方差,且这种效应与数据集特性(如每批次正样本对密度和任务难度)密切相关。
Details
Motivation: CFF训练对随机种子敏感,但其不稳定性来源尚不清楚。本文旨在探究对比损失中正样本对边界的实现细节(特别是相似度钳位操作)是否是导致训练方差增大的关键因素。
Result: 在CIFAR-10上,钳位操作导致测试准确率方差增加5.90倍(p=0.003),而平均准确率无差异。但在CIFAR-100、SVHN和Fashion-MNIST上,钳位操作产生相等或更低的方差。通过SVHN难度扫描实验证实,在高准确率下方差比为0.25倍,而在强数据增强下升至16.73倍。
Insight: 论文的创新点在于揭示了边界钳位操作对训练方差的影响是数据集依赖的,并受每批次正样本对密度和任务难度两个因素调控。提出采用梯度中性的减法参考公式可以消除方差膨胀而不损害平均性能,并建议通过测量第0层的钳位激活率作为该问题是否存在的简单检查指标。
Abstract: Contrastive Forward-Forward (CFF) learning trains Vision Transformers layer by layer against supervised contrastive objectives. CFF training can be sensitive to random seed, but the sources of this instability are poorly understood. We focus on one implementation detail: the positive-pair margin in the contrastive loss is applied through saturating similarity clamping, $\min(s + m,, 1)$. We prove that an alternative formulation, subtracting the margin after the log-probability, is gradient-neutral under the mean-over-positives reduction. On CIFAR-10 ($2 \times 2$ factorial, $n{=}7$ seeds per cell), clamping produces $5.90\times$ higher pooled test-accuracy variance ($p{=}0.003$) with no difference in mean accuracy. Analyses of clamp activation rates, layerwise gradient norms, and a reduced-margin probe point to saturation-driven gradient truncation at early layers. The effect does not transfer cleanly to other datasets: on CIFAR-100, SVHN, and Fashion-MNIST, clamping produces equal or lower variance. Two factors account for the discrepancy. First, positive-pair density per batch controls how often saturation occurs. Second, task difficulty compresses seed-to-seed spread when accuracy is high. An SVHN difficulty sweep confirms the interaction on a single dataset, with the variance ratio moving from $0.25\times$ at high accuracy to $16.73\times$ under aggressive augmentation. In moderate-accuracy regimes with many same-class pairs per batch, switching to the gradient-neutral subtraction reference removes this variance inflation at no cost to mean accuracy. Measuring the layer-0 clamp activation rate serves as a simple check for whether the problem applies.
[294] Rate-Distortion Signatures of Generalization and Information Trade-offs cs.LG | cs.CV | cs.IT | q-bio.NCPDF
Leyla Roksan Caglar, Pedro A. M. Mediano, Baihan Lin
TL;DR: 本文提出了一种基于率失真理论的框架,用于分析和比较视觉系统在图像扰动下的泛化行为。该框架将刺激-响应行为视为有效通信信道,从混淆矩阵推导出率失真前沿,并用斜率(β)和曲率(κ)两个几何特征来刻画系统在准确性与鲁棒性之间权衡的边际成本和突变性。作者将该框架应用于人类心理物理学数据和18个深度视觉模型,发现生物和人工系统都遵循有损压缩原理,但占据不同的率失真空间区域。人类表现出更平滑、更灵活的权衡,而现代深度网络即使在匹配的准确性下也表现出更陡峭、更脆弱的权衡模式。
Details
Motivation: 解决视觉系统(包括人类和机器)在面临新颖视觉条件时泛化能力评估的挑战,标准鲁棒性指标难以深入揭示系统如何在准确性和鲁棒性之间进行权衡。
Result: 在受控图像扰动下,对18个深度视觉模型和人类心理物理学数据应用该框架进行分析。结果表明,人类和人工系统遵循共同的率失真压缩原理,但占据不同的率失真空间区域。人类表现出更平滑的权衡(更低的β和κ),而深度网络则表现出更陡峭、更脆弱的权衡模式。不同训练范式(如鲁棒性训练)会引起β和κ的系统性但可分离的变化,揭示了改进的鲁棒性或准确性并不总是转化为更类人的泛化几何特性。
Insight: 创新点在于提出了一个基于信息论(率失真理论)的、模型无关的紧凑框架,用两个可解释的几何特征(β和κ)来量化泛化行为,超越了基于标准准确性的指标。这为跨系统(生物与人工)比较泛化行为提供了一个新的理论视角和分析工具,揭示了深度网络与人类视觉在泛化几何特性上的系统性差异。
Abstract: Generalization to novel visual conditions remains a central challenge for both human and machine vision, yet standard robustness metrics offer limited insight into how systems trade accuracy for robustness. We introduce a rate-distortion-theoretic framework that treats stimulus-response behavior as an effective communication channel, derives rate-distortion (RD) frontiers from confusion matrices, and summarizes each system with two interpretable geometric signatures - slope ($β$) and curvature ($κ$) - which capture the marginal cost and abruptness of accuracy-robustness trade-offs. Applying this framework to human psychophysics and 18 deep vision models under controlled image perturbations, we compare generalization geometry across model architectures and training regimes. We find that both biological and artificial systems follow a common lossy-compression principle but occupy systematically different regions of RD space. In particular, humans exhibit smoother, more flexible trade-offs, whereas modern deep networks operate in steeper and more brittle regimes even at matched accuracy. Across training regimes, robustness training induces systematic but dissociable shifts in beta/kappa, revealing cases where improved robustness or accuracy does not translate into more human-like generalization geometry. These results demonstrate that RD geometry provides a compact, model-agnostic lens for comparing generalization behavior across systems beyond standard accuracy-based metrics.