Table of Contents
- cs.CL [Total: 15]
- cs.CV [Total: 41]
- q-bio.NC [Total: 1]
- cs.LG [Total: 2]
- cs.AI [Total: 4]
- cs.IR [Total: 1]
- cs.CR [Total: 1]
- cs.RO [Total: 1]
cs.CL [Back]
[1] Uncovering Competency Gaps in Large Language Models and Their Benchmarks cs.CL | cs.AI | cs.LGPDF
Matyas Bohacek, Nino Scherrer, Nicholas Dufour, Thomas Leung, Christoph Bregler
TL;DR: 本文提出了一种基于稀疏自编码器(SAE)的新方法,用于自动揭示大语言模型(LLMs)在特定概念上的能力不足(模型差距)以及基准测试本身的概念覆盖不均衡(基准差距)。该方法通过提取SAE概念激活并计算基准数据上的显著性加权性能分数,将评估建立在模型内部表征基础上,并支持跨基准比较。应用该方法对两个开源模型和十个基准的分析表明,模型在反对谄媚行为和安全相关概念上表现不佳,而许多基准则过度代表了服从、权威或指令遵循等概念。
Details
Motivation: 现有大语言模型的评估严重依赖标准化基准测试,但这些基准的聚合指标可能掩盖模型在特定子领域的能力弱点(模型差距)以及基准自身概念覆盖的不平衡(基准差距),因此需要一种能够自动、无监督地揭示这些差距的方法。
Result: 该方法在两个流行的开源模型和十个基准上进行了应用验证。研究发现,这些模型在反对谄媚行为(如礼貌拒绝请求或坚持界限)和安全讨论相关概念上持续表现不佳,这些模型差距与文献中的观察一致。同时,发现许多被评估的基准过度代表了与服从、权威或指令遵循相关的概念,而缺失了其预期范围内的核心概念。
Insight: 创新点在于提出了一种基于表征的评估方法,利用稀疏自编码器自动、无监督地实现基准得分的概念级分解,从而揭示模型得分背后的原因和基准的改进方向。这为模型评估提供了更细粒度的、可解释的分析工具,是对传统聚合指标的有力补充。
Abstract: The evaluation of large language models (LLMs) relies heavily on standardized benchmarks. These benchmarks provide useful aggregated metrics for a given capability, but those aggregated metrics can obscure (i) particular sub-areas where the LLMs are weak (“model gaps”) and (ii) imbalanced coverage in the benchmarks themselves (“benchmark gaps”). We propose a new method that uses sparse autoencoders (SAEs) to automatically uncover both types of gaps. By extracting SAE concept activations and computing saliency-weighted performance scores across benchmark data, the method grounds evaluation in the model’s internal representations and enables comparison across benchmarks. As examples demonstrating our approach, we applied the method to two popular open-source models and ten benchmarks. We found that these models consistently underperformed on concepts that stand in contrast to sycophantic behaviors (e.g., politely refusing a request or asserting boundaries) and concepts connected to safety discussions. These model gaps align with observations previously surfaced in the literature; our automated, unsupervised method was able to recover them without manual supervision. We also observed benchmark gaps: many of the evaluated benchmarks over-represented concepts related to obedience, authority, or instruction-following, while missing core concepts that should fall within their intended scope. In sum, our method offers a representation-grounded approach to evaluation, enabling concept-level decomposition of benchmark scores. Rather than replacing conventional aggregated metrics, CG complements them by providing a concept-level decomposition that can reveal why a model scored as it did and how benchmarks could evolve to better reflect their intended scope. Code is available at https://competency-gaps.github.io.
[2] Semantic Deception: When Reasoning Models Can’t Compute an Addition cs.CLPDF
Nathaniël de Leeuw, Marceau Nahon, Mathis Reymond, Raja Chatila, Mehdi Khamassi
TL;DR: 本文通过引入语义欺骗实验框架,探究大语言模型在处理新颖符号时的推理能力,发现模型容易受符号形式误导而过度依赖表层语义关联,导致在简单计算任务上性能显著下降。
Details
Motivation: 研究动机是评估LLMs在涉及人类价值的决策任务中是否具备真正的符号抽象与推理能力,而非仅依赖训练中的语义关联。
Result: 在四个LLM上的实验表明,语义线索会严重损害模型在简单计算任务中的表现,揭示了当前模型在符号操作上的局限性。
Insight: 创新点在于提出语义欺骗概念,通过重新定义数字和运算符符号来测试LLMs的抽象能力;客观分析认为,该研究警示了LLMs可能过度依赖统计相关性而非逻辑推理,这对需要稳健符号推理的决策应用具有重要伦理意义。
Abstract: Large language models (LLMs) are increasingly used in situations where human values are at stake, such as decision-making tasks that involve reasoning when performed by humans. We investigate the so-called reasoning capabilities of LLMs over novel symbolic representations by introducing an experimental framework that tests their ability to process and manipulate unfamiliar symbols. We introduce semantic deceptions: situations in which symbols carry misleading semantic associations due to their form, such as being embedded in specific contexts, designed to probe whether LLMs can maintain symbolic abstraction or whether they default to exploiting learned semantic associations. We redefine standard digits and mathematical operators using novel symbols, and task LLMs with solving simple calculations expressed in this altered notation. The objective is: (1) to assess LLMs’ capacity for abstraction and manipulation of arbitrary symbol systems; (2) to evaluate their ability to resist misleading semantic cues that conflict with the task’s symbolic logic. Through experiments with four LLMs we show that semantic cues can significantly deteriorate reasoning models’ performance on very simple tasks. They reveal limitations in current LLMs’ ability for symbolic manipulations and highlight a tendency to over-rely on surface-level semantics, suggesting that chain-of-thoughts may amplify reliance on statistical correlations. Even in situations where LLMs seem to correctly follow instructions, semantic cues still impact basic capabilities. These limitations raise ethical and societal concerns, undermining the widespread and pernicious tendency to attribute reasoning abilities to LLMs and suggesting how LLMs might fail, in particular in decision-making contexts where robust symbolic reasoning is essential and should not be compromised by residual semantic associations inherited from the model’s training.
[3] MediEval: A Unified Medical Benchmark for Patient-Contextual and Knowledge-Grounded Reasoning in LLMs cs.CL | cs.AIPDF
Zhan Qu, Michael Färber
TL;DR: 本文提出了MediEval基准测试,该基准将MIMIC-IV电子健康记录与统一医学知识库(UMLS等)关联,生成真实患者背景下的多样事实与反事实医学陈述,用于系统评估LLMs在知识基础和上下文一致性方面的表现。同时,针对评估中发现的模型缺陷,提出了基于DPO的反事实风险感知微调方法CoRFu,以提升模型准确性与安全性。
Details
Motivation: 现有医学LLM评估要么孤立测试事实知识,要么评估患者级推理而不验证正确性,存在关键缺口。本文旨在填补这一缺口,通过一个统一的基准系统评估LLMs在医学领域的可靠性与安全性。
Result: 在MediEval基准上评估发现,当前专有、开源及领域特定LLMs普遍存在关键失败模式(如幻觉支持和真相反转)。提出的CoRFu方法相比基础模型在宏观F1分数上提升了+16.4分,并消除了真相反转错误,实现了更高的准确性和显著的安全性。
Insight: 创新点在于构建了一个结合真实患者上下文与统一知识库的医学基准(MediEval),并提出了一个四象限评估框架来联合考量知识基础和上下文一致性。从客观角度看,其提出的反事实风险感知微调(CoRFu)方法,通过基于DPO的不对称惩罚机制针对不安全混淆进行优化,是提升医学LLM安全性的有效技术路径。
Abstract: Large Language Models (LLMs) are increasingly applied to medicine, yet their adoption is limited by concerns over reliability and safety. Existing evaluations either test factual medical knowledge in isolation or assess patient-level reasoning without verifying correctness, leaving a critical gap. We introduce MediEval, a benchmark that links MIMIC-IV electronic health records (EHRs) to a unified knowledge base built from UMLS and other biomedical vocabularies. MediEval generates diverse factual and counterfactual medical statements within real patient contexts, enabling systematic evaluation across a 4-quadrant framework that jointly considers knowledge grounding and contextual consistency. Using this framework, we identify critical failure modes, including hallucinated support and truth inversion, that current proprietary, open-source, and domain-specific LLMs frequently exhibit. To address these risks, we propose Counterfactual Risk-Aware Fine-tuning (CoRFu), a DPO-based method with an asymmetric penalty targeting unsafe confusions. CoRFu improves by +16.4 macro-F1 points over the base model and eliminates truth inversion errors, demonstrating both higher accuracy and substantially greater safety.
[4] Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning cs.CL | cs.AI | cs.LGPDF
NVIDIA, :, Aaron Blakeman, Aaron Grattafiori, Aarti Basant
TL;DR: 本文介绍了Nemotron 3 Nano 30B-A3B,一个混合Mamba-Transformer架构的专家混合语言模型。该模型在25万亿文本token上进行了预训练,并经过监督微调和大规模强化学习,在推理吞吐量和准确性上均优于前代及同类开源模型,同时展现出增强的智能体、推理和对话能力,支持高达100万token的上下文长度。
Details
Motivation: 旨在构建一个更高效、性能更强的开源语言模型,通过结合Mamba和Transformer架构以及专家混合设计,在减少前向传播激活参数的同时提升推理速度和任务性能,特别是针对智能体推理等复杂场景。
Result: 在推理中激活的参数少于前代Nemotron 2 Nano的一半,但准确性更高;在类似规模的开源模型(如GPT-OSS-20B和Qwen3-30B-A3B-Thinking-2507)上实现了高达3.3倍的推理吞吐量提升,并在流行基准测试中更准确。
Insight: 创新点在于将Mamba的高效序列建模能力与Transformer结合,并采用专家混合设计,实现了参数激活稀疏化与计算效率的平衡;模型通过大规模多阶段训练(预训练、监督微调、强化学习)优化了智能体推理能力,且开源发布促进了可访问性。
Abstract: We present Nemotron 3 Nano 30B-A3B, a Mixture-of-Experts hybrid Mamba-Transformer language model. Nemotron 3 Nano was pretrained on 25 trillion text tokens, including more than 3 trillion new unique tokens over Nemotron 2, followed by supervised fine tuning and large-scale RL on diverse environments. Nemotron 3 Nano achieves better accuracy than our previous generation Nemotron 2 Nano while activating less than half of the parameters per forward pass. It achieves up to 3.3x higher inference throughput than similarly-sized open models like GPT-OSS-20B and Qwen3-30B-A3B-Thinking-2507, while also being more accurate on popular benchmarks. Nemotron 3 Nano demonstrates enhanced agentic, reasoning, and chat abilities and supports context lengths up to 1M tokens. We release both our pretrained Nemotron 3 Nano 30B-A3B Base and post-trained Nemotron 3 Nano 30B-A3B checkpoints on Hugging Face.
[5] NVIDIA Nemotron 3: Efficient and Open Intelligence cs.CL | cs.AI | cs.LGPDF
NVIDIA, :, Aaron Blakeman, Aaron Grattafiori, Aarti Basant
TL;DR: NVIDIA发布了Nemotron 3系列模型,包括Nano、Super和Ultra三个版本,它们具备强大的智能体、推理和对话能力。该系列采用混合Mamba-Transformer的专家混合架构,实现了顶级的吞吐量和高达100万token的上下文长度。模型通过多环境强化学习进行后训练,支持推理、多步骤工具使用和细粒度推理预算控制。
Details
Motivation: 旨在提供一系列高效、开放且能力强大的语言模型,以满足从低成本推理到高性能复杂任务(如IT工单自动化)的不同需求,并推动开放AI生态的发展。
Result: Nano模型在保持极高推理成本效益的同时,在准确性上超越了同类可比模型;Ultra模型则提供了最先进的准确性和推理性能。Super模型针对协作智能体和高吞吐量工作负载进行了优化。
Insight: 主要创新点包括:1) 采用混合Mamba-Transformer的MoE架构以平衡效率与能力;2) 引入了LatentMoE这一新颖方法来提升模型质量;3) 使用多环境强化学习进行后训练以增强推理和工具使用能力;4) 在大型模型中集成MTP层以加速文本生成。其开放模型权重、训练软件和配方的承诺也具有重要的生态意义。
Abstract: We introduce the Nemotron 3 family of models - Nano, Super, and Ultra. These models deliver strong agentic, reasoning, and conversational capabilities. The Nemotron 3 family uses a Mixture-of-Experts hybrid Mamba-Transformer architecture to provide best-in-class throughput and context lengths of up to 1M tokens. Super and Ultra models are trained with NVFP4 and incorporate LatentMoE, a novel approach that improves model quality. The two larger models also include MTP layers for faster text generation. All Nemotron 3 models are post-trained using multi-environment reinforcement learning enabling reasoning, multi-step tool use, and support granular reasoning budget control. Nano, the smallest model, outperforms comparable models in accuracy while remaining extremely cost-efficient for inference. Super is optimized for collaborative agents and high-volume workloads such as IT ticket automation. Ultra, the largest model, provides state-of-the-art accuracy and reasoning performance. Nano is released together with its technical report and this white paper, while Super and Ultra will follow in the coming months. We will openly release the model weights, pre- and post-training software, recipes, and all data for which we hold redistribution rights.
[6] Where Did This Sentence Come From? Tracing Provenance in LLM Reasoning Distillation cs.CLPDF
Kaiyuan Liu, Shaotian Yan, Rui Miao, Bing Wang, Chen Shen
TL;DR: 本文提出了一个跨模型的推理蒸馏溯源追踪框架,用于分析蒸馏模型在推理过程中生成内容的来源,并基于此提出了一种教师引导的数据选择方法,以提升蒸馏模型的泛化能力。
Details
Motivation: 现有推理蒸馏方法缺乏对蒸馏模型能力来源的详细分析,不清楚学生模型在新测试场景下是否能保持与教师模型一致的行为,这引发了对蒸馏模型泛化能力的担忧。
Result: 实验表明,在测试场景下,蒸馏模型确实能生成源自教师模型的行为,这些行为与观察到的性能相关并可解释。所提出的教师引导数据选择方法在多个代表性教师模型和多样化学生模型上验证了其有效性。
Insight: 创新点在于提出了一个系统性的溯源框架来解构蒸馏模型行为的来源,并基于此设计了直接比较师生在训练数据上差异的原则性数据选择准则,为理解和改进推理蒸馏提供了新视角。
Abstract: Reasoning distillation has attracted increasing attention. It typically leverages a large teacher model to generate reasoning paths, which are then used to fine-tune a student model so that it mimics the teacher’s behavior in training contexts. However, previous approaches have lacked a detailed analysis of the origins of the distilled model’s capabilities. It remains unclear whether the student can maintain consistent behaviors with the teacher in novel test-time contexts, or whether it regresses to its original output patterns, raising concerns about the generalization of distillation models. To analyse this question, we introduce a cross-model Reasoning Distillation Provenance Tracing framework. For each action (e.g., a sentence) produced by the distilled model, we obtain the predictive probabilities assigned by the teacher, the original student, and the distilled model under the same context. By comparing these probabilities, we classify each action into different categories. By systematically disentangling the provenance of each action, we experimentally demonstrate that, in test-time contexts, the distilled model can indeed generate teacher-originated actions, which correlate with and plausibly explain observed performance on distilled model. Building on this analysis, we further propose a teacher-guided data selection method. Unlike prior approach that rely on heuristics, our method directly compares teacher-student divergences on the training data, providing a principled selection criterion. We validate the effectiveness of our approach across multiple representative teacher models and diverse student models. The results highlight the utility of our provenance-tracing framework and underscore its promise for reasoning distillation. We hope to share Reasoning Distillation Provenance Tracing and our insights into reasoning distillation with the community.
[7] Foundation Model-based Evaluation of Neuropsychiatric Disorders: A Lifespan-Inclusive, Multi-Modal, and Multi-Lingual Study cs.CL | cs.SDPDF
Zhongren Dong, Haotian Guo, Weixiang Xu, Huan Zhao, Zixing Zhang
TL;DR: 该论文提出了FEND框架,这是一个基于基础模型的多模态评估系统,用于检测阿尔茨海默病、抑郁症和自闭症谱系障碍等神经精神疾病。该框架整合了语音和文本模态,并在涵盖英语、中文、希腊语、法语和荷兰语的13个多语言数据集上进行了系统性评估。研究发现多模态融合在AD和抑郁症检测中表现出色,但在ASD检测中因数据集异质性而表现不佳,并揭示了模态不平衡等普遍问题。
Details
Motivation: 解决神经精神疾病检测中多语言泛化能力不足、缺乏统一评估框架以及多模态方法面临的挑战(如模态不平衡和数据集异质性)等问题。
Result: 在涵盖AD、抑郁症和ASD的多语言数据集上的实验表明,多模态融合在AD和抑郁症检测中表现优异,但在ASD检测中表现不佳;跨语料库实验显示,在任务和语言一致的情况下性能稳健,但在多语言和任务异构设置下性能显著下降。
Insight: 创新点在于提出了一个统一的、基于基础模型的多模态、多语言、覆盖全生命周期的神经精神疾病评估框架(FEND),并系统性地分析了多模态融合的性能影响因素(如模态不平衡、数据集异质性)及其在不同疾病和语言环境下的适用性,为公平比较和可复现研究提供了基准。
Abstract: Neuropsychiatric disorders, such as Alzheimer’s disease (AD), depression, and autism spectrum disorder (ASD), are characterized by linguistic and acoustic abnormalities, offering potential biomarkers for early detection. Despite the promise of multi-modal approaches, challenges like multi-lingual generalization and the absence of a unified evaluation framework persist. To address these gaps, we propose FEND (Foundation model-based Evaluation of Neuropsychiatric Disorders), a comprehensive multi-modal framework integrating speech and text modalities for detecting AD, depression, and ASD across the lifespan. Leveraging 13 multi-lingual datasets spanning English, Chinese, Greek, French, and Dutch, we systematically evaluate multi-modal fusion performance. Our results show that multi-modal fusion excels in AD and depression detection but underperforms in ASD due to dataset heterogeneity. We also identify modality imbalance as a prevalent issue, where multi-modal fusion fails to surpass the best mono-modal models. Cross-corpus experiments reveal robust performance in task- and language-consistent scenarios but noticeable degradation in multi-lingual and task-heterogeneous settings. By providing extensive benchmarks and a detailed analysis of performance-influencing factors, FEND advances the field of automated, lifespan-inclusive, and multi-lingual neuropsychiatric disorder assessment. We encourage researchers to adopt the FEND framework for fair comparisons and reproducible research.
[8] Reflection Pretraining Enables Token-Level Self-Correction in Biological Sequence Models cs.CL | cs.AIPDF
Xiang Zhang, Jiaqi Wei, Yuejin Yang, Zijie Qiu, Yuhan Chen
TL;DR: 本文提出了一种名为“反思预训练”的方法,旨在解决生物序列模型(如蛋白质和RNA语言模型)因标记空间表达能力有限而无法应用思维链(CoT)推理的问题。通过引入辅助的“思考标记”来增强生物语言的表达能力,使模型能够进行中间推理和自我纠正,从而提升任务性能。
Details
Motivation: 动机在于将自然语言处理中成功的思维链(CoT)提示技术扩展到非自然语言领域(如蛋白质和RNA序列),但由于生物序列标记(如氨基酸)的表达能力有限,传统CoT无法直接应用,因此需要增强其语言表达能力以支持复杂推理。
Result: 实验表明,反思预训练方法使蛋白质模型能够进行自我纠正,相比标准预训练,在性能上取得了显著提升,具体表现为在相关基准测试中获得了实质性的性能增益。
Insight: 创新点包括首次在生物序列模型中引入反思预训练,通过扩展标记集(增加“思考标记”)来增强生物语言的表达能力,从而实现了类似CoT的中间推理和自我纠正能力,为生物序列模型的复杂任务解决提供了新思路。
Abstract: Chain-of-Thought (CoT) prompting has significantly advanced task-solving capabilities in natural language processing with large language models. Unlike standard prompting, CoT encourages the model to generate intermediate reasoning steps, non-answer tokens, that help guide the model toward more accurate final outputs. These intermediate steps enable more complex reasoning processes such as error correction, memory management, future planning, and self-reflection. However, applying CoT to non-natural language domains, such as protein and RNA language models, is not yet possible, primarily due to the limited expressiveness of their token spaces (e.g., amino acid tokens). In this work, we propose and define the concept of language expressiveness: the ability of a given language, using its tokens and grammar, to encode information. We show that the limited expressiveness of protein language severely restricts the applicability of CoT-style reasoning. To overcome this, we introduce reflection pretraining, for the first time in a biological sequence model, which enables the model to engage in intermediate reasoning through the generation of auxiliary “thinking tokens” beyond simple answer tokens. Theoretically, we demonstrate that our augmented token set significantly enhances biological language expressiveness, thereby improving the overall reasoning capacity of the model. Experimentally, our pretraining approach teaches protein models to self-correct and leads to substantial performance gains compared to standard pretraining.
[9] Automatic Replication of LLM Mistakes in Medical Conversations cs.CL | cs.AI | cs.LGPDF
Oleksii Proniakin, Diego Fajardo, Ruslan Nazarenko, Razvan Marinescu
TL;DR: 本文提出了一个名为MedMistake的自动化流程,用于提取大型语言模型在医患对话中犯的错误,并将其转化为单轮问答对基准测试。该流程通过模拟LLM医患对话、使用LLM委员会进行多维度评估,并将错误简化为QA对来构建数据集。作者发布了包含3390个QA对的数据集MedMistake-All,以及一个由医学专家验证的211个问题的子集MedMistake-Bench,并基于后者评估了12个前沿LLM的性能。
Details
Motivation: 当前在临床环境中评估LLM时,虽然有多维度的评估标准,但复制LLM在特定对话中犯的错误通常需要大量人工努力,缺乏自动化的方法。本文旨在解决这一问题,自动化地识别和复现LLM在医患对话中的错误。
Result: 在由医学专家验证的MedMistake-Bench子集上对12个前沿LLM进行了评估,发现GPT系列模型、Claude和Grok模型表现最佳。具体而言,GPT-5和Gemini 2.5 Pro在MedMistake-All数据集中的3390个QA对上仍存在错误。
Insight: 创新点在于提出了一个全自动的流程来提取和复现LLM在复杂对话场景(如医患对话)中的错误,并将其转化为可扩展的基准测试。这为系统性地评估和比较不同LLM在特定领域的错误模式提供了新工具,有助于推动模型在安全性和可靠性方面的改进。
Abstract: Large language models (LLMs) are increasingly evaluated in clinical settings using multi-dimensional rubrics which quantify reasoning quality, safety, and patient-centeredness. Yet, replicating specific mistakes in other LLM models is not straightforward and often requires manual effort. We introduce MedMistake, an automatic pipeline that extracts mistakes LLMs make in patient-doctor conversations and converts them into a benchmark of single-shot QA pairs. Our pipeline (1) creates complex, conversational data between an LLM patient and LLM doctor, (2) runs an evaluation with a committee of 2 LLM judges across a variety of dimensions and (3) creates simplified single-shot QA scenarios from those mistakes. We release MedMistake-All, a dataset of 3,390 single-shot QA pairs where GPT-5 and Gemini 2.5 Pro are currently failing to answer correctly, as judged by two LLM judges. We used medical experts to validate a subset of 211/3390 questions (MedMistake-Bench), which we used to run a final evaluation of 12 frontier LLMs: Claude Opus 4.5, Claude Sonnet 4.5, DeepSeek-Chat, Gemini 2.5 Pro, Gemini 3 Pro, GPT-4o, GPT-5, GPT-5.1, GPT-5.2, Grok 4, Grok 4.1, Mistral Large. We found that GPT models, Claude and Grok obtained the best performance on MedMistake-Bench. We release both the doctor-validated benchmark (MedMistake-Bench), as well as the full dataset (MedMistake-All) at https://huggingface.co/datasets/TheLumos/MedicalMistakeBenchmark.
[10] Distilling the Essence: Efficient Reasoning Distillation via Sequence Truncation cs.CL | cs.AIPDF
Wei-Rui Chen, Vignesh Kothapalli, Ata Fatahibaarzi, Hejian Sang, Shao Tang
TL;DR: 该论文提出了一种高效的推理蒸馏方法,通过序列截断来减少计算开销。研究发现,在推理蒸馏过程中,仅对思维链(CoT)令牌进行监督或截取序列的前50%令牌进行训练,可以在数学基准测试中保留约94%的完整序列性能,同时显著降低训练时间、内存使用和FLOPs约50%。
Details
Motivation: 动机在于解决从大型语言模型向小型学生模型蒸馏推理能力时,因处理包含提示、思维链和答案的长序列而导致的计算成本高昂问题,旨在探索不同序列段监督分配的影响以优化效率。
Result: 在数学基准测试上,仅使用序列前50%令牌训练可平均保留约94%的完整序列性能,同时训练时间、内存使用和FLOPs均减少约50%,实现了计算与质量的有效权衡。
Insight: 创新点在于揭示了推理蒸馏中早期推理令牌的重要性,并建立了一个截断协议来量化序列长度与计算质量之间的权衡,为高效蒸馏提供了简单实用的方法。
Abstract: Distilling the reasoning capabilities from a large language model (LLM) to a smaller student model often involves training on substantial amounts of reasoning data. However, distillation over lengthy sequences with prompt (P), chain-of-thought (CoT), and answer (A) segments makes the process computationally expensive. In this work, we investigate how the allocation of supervision across different segments (P, CoT, A) affects student performance. Our analysis shows that selective knowledge distillation over only the CoT tokens can be effective when the prompt and answer information is encompassed by it. Building on this insight, we establish a truncation protocol to quantify computation-quality tradeoffs as a function of sequence length. We observe that training on only the first $50%$ of tokens of every training sequence can retain, on average, $\approx94%$ of full-sequence performance on math benchmarks while reducing training time, memory usage, and FLOPs by about $50%$ each. These findings suggest that reasoning distillation benefits from prioritizing early reasoning tokens and provides a simple lever for computation-quality tradeoffs. Codes are available at https://github.com/weiruichen01/distilling-the-essence.
[11] Rethinking Supervised Fine-Tuning: Emphasizing Key Answer Tokens for Improved LLM Accuracy cs.CL | cs.AIPDF
Xiaofeng Shi, Qian Kou, Yuduo Li, Hua Zhou
TL;DR: 本文提出了一种名为SFTKey的两阶段监督微调方法,旨在解决传统SFT中模型对过长的思维链序列分配过多注意力,从而忽视更短但关键答案部分的问题。该方法首先进行常规SFT以确保输出格式正确,然后在第二阶段仅对关键答案部分进行微调以提高准确性。
Details
Motivation: 传统监督微调中,模型可能过度关注冗长的思维链序列,而减少了对直接决定任务成功与否的关键答案部分的关注,这限制了模型在复杂推理任务上的准确性。
Result: 在多个基准测试和模型系列上的广泛实验表明,SFTKey相比传统SFT平均准确率提升超过5%,同时保持了生成正确格式的能力。
Insight: 论文的创新点在于通过两阶段训练方案,显式地平衡思维链学习与答案相关令牌的额外优化,从而更有效地利用监督信号提升LLM的准确性。从客观角度看,这种聚焦关键答案令牌的微调策略是一种简单而有效的训练范式改进。
Abstract: With the rapid advancement of Large Language Models (LLMs), the Chain-of-Thought (CoT) component has become significant for complex reasoning tasks. However, in conventional Supervised Fine-Tuning (SFT), the model could allocate disproportionately more attention to CoT sequences with excessive length. This reduces focus on the much shorter but essential Key portion-the final answer, whose correctness directly determines task success and evaluation quality. To address this limitation, we propose SFTKey, a two-stage training scheme. In the first stage, conventional SFT is applied to ensure proper output format, while in the second stage, only the Key portion is fine-tuned to improve accuracy. Extensive experiments across multiple benchmarks and model families demonstrate that SFTKey achieves an average accuracy improvement exceeding 5% over conventional SFT, while preserving the ability to generate correct formats. Overall, this study advances LLM fine-tuning by explicitly balancing CoT learning with additional optimization on answer-relevant tokens.
[12] SpidR-Adapt: A Universal Speech Representation Model for Few-Shot Adaptation cs.CL | cs.AIPDF
Mahi Luthra, Jiayi Shen, Maxime Poli, Angelo Ortiz, Yosuke Higuchi
TL;DR: 本文提出SpidR-Adapt,一种用于少样本自适应的通用语音表示模型,通过元学习框架和创新的双级优化方法,仅需少于1小时的目标语言音频即可高效适应新语言,在音素区分性和口语建模任务上取得显著提升。
Details
Motivation: 解决当前自监督语音模型数据需求大、效率低的问题,模仿人类婴儿仅需少量语音暴露即可学习新语言基本单元的高效能力,旨在实现数据高效的语音表示学习。
Result: 在音素区分性(ABX)和口语建模(sWUGGY、sBLIMP、tSC)基准测试中,仅用少于1小时目标语言音频训练后,性能即超越领域内语言模型,数据效率比标准训练高100倍以上。
Insight: 创新点包括将低资源语音表示学习构建为元学习问题,提出多任务自适应预训练(MAdaPT)协议和双级优化框架,并引入一阶双级优化(FOBLO)降低计算成本,以及通过交替自监督与监督目标的交错监督实现稳定元训练,为构建生物启发、数据高效的通用表示模型提供了架构无关的实用路径。
Abstract: Human infants, with only a few hundred hours of speech exposure, acquire basic units of new languages, highlighting a striking efficiency gap compared to the data-hungry self-supervised speech models. To address this gap, this paper introduces SpidR-Adapt for rapid adaptation to new languages using minimal unlabeled data. We cast such low-resource speech representation learning as a meta-learning problem and construct a multi-task adaptive pre-training (MAdaPT) protocol which formulates the adaptation process as a bi-level optimization framework. To enable scalable meta-training under this framework, we propose a novel heuristic solution, first-order bi-level optimization (FOBLO), avoiding heavy computation costs. Finally, we stabilize meta-training by using a robust initialization through interleaved supervision which alternates self-supervised and supervised objectives. Empirically, SpidR-Adapt achieves rapid gains in phonemic discriminability (ABX) and spoken language modeling (sWUGGY, sBLIMP, tSC), improving over in-domain language models after training on less than 1h of target-language audio, over $100\times$ more data-efficient than standard training. These findings highlight a practical, architecture-agnostic path toward biologically inspired, data-efficient representations. We open-source the training code and model checkpoints at https://github.com/facebookresearch/spidr-adapt.
[13] SMART SLM: Structured Memory and Reasoning Transformer, A Small Language Model for Accurate Document Assistance cs.CL | cs.AIPDF
Divij Dudeja, Mayukha Pal
TL;DR: SMART SLM是一种结构化记忆与推理Transformer小型语言模型,专为工程手册文档辅助设计。它通过分层处理结构,结合语法感知事实提取器、紧凑索引记忆网络和六层Transformer,在减少参数量的同时提升了对工程手册中复杂信息的理解和响应准确性。
Details
Motivation: 解决工程手册因内容冗长、格式密集(包括文档、步骤和参数列表)而难以阅读的问题,现有紧凑Transformer模型将其视为扁平化token流,导致数字答案错误且事实记忆效率低下。
Result: 在工程手册任务上,SMART模型参数量为45.51M,比GPT-2(124M)减少64%,比BERT(133M)减少69%,但准确率比GPT-2提高21.3%;支持快速索引路径(亚秒级响应)和动态RAG辅助路径,在真实部署中减少了幻觉现象。
Insight: 创新点包括分层结构化处理(语法感知事实提取、索引记忆、Transformer融合)、双模式推理机制,以及通过紧凑设计实现高效准确的小模型;客观分析表明其将结构化记忆与Transformer结合,提升了文档理解的可解释性和效率。
Abstract: The user of Engineering Manuals (EM) finds it difficult to read EM s because they are long, have a dense format which includes written documents, step by step procedures, and standard parameter lists for engineering equipment. Off the shelf transformers, especially compact ones, treat this material as a flat stream of tokens. This approach leads to confident but incorrect numeric answers and forces the models to memorize separate facts inefficiently. SMART (Structured Memory and Reasoning Transformer) offers a different and practical solution to the above problem. SMART structures its processing by using a hierarchical approach, and is based upon three main job categories (1) A syntax-aware Fact Extractor (Grammarian) Tree LSTM which extracts facts as subject relation object relations from EM sentences (2) A compact indexed memory MANN (Memory Augmented Neural Network) that indexes these Rational Subject Relation Objects as 384 dimensional vectors that are associated with the source of the information, and (3) A 6 layer Transformer that learns to fuse the previously retrieved facts into its generated response. The entire SMART model utilizes 45.51M parameters, which is 64% less than GPT-2 (124M) and 69% less than BERT (133M), and it achieves a 21.3% higher accuracy than GPT-2, indicating that SMART fits the data better with the least amount of processing requirements. SMART employs dual modes of inference an indexed fast path for known documents (sub-second answer times) and an indexed dynamic path assisted by RAGs for new uploads (FAISS Top 20 results with memory severed at 64 slots). In real world deployment, this framework leads to more well supported results with reduced hallucinations than comparable small transformer models.
[14] Your Reasoning Benchmark May Not Test Reasoning: Revealing Perception Bottleneck in Abstract Reasoning Benchmarks cs.CLPDF
Xinhe Wang, Jin Huang, Xingjian Zhang, Tianhao Wang, Jiaqi W. Ma
TL;DR: 这篇论文挑战了当前对抽象推理基准(如ARC和ARC-AGI)的主流解读,认为这些基准测试的性能差距主要源于视觉感知瓶颈,而非机器推理能力的不足。作者通过设计一个两阶段实验流程(感知阶段将图像转换为自然语言描述,推理阶段基于描述进行规则归纳和应用),在Mini-ARC、ACRE和Bongard-LOGO数据集上验证了这一假设,发现约80%的模型失败源于感知错误。
Details
Motivation: 针对前沿视觉语言模型(VLMs)在抽象推理基准上表现不佳常被归因于推理能力缺陷的观点,作者提出质疑,认为性能差距可能主要来自视觉感知限制而非推理短板。
Result: 在三个ARC风格数据集上的实验表明,两阶段流程(分离感知与推理)相比端到端单阶段评估显著提升了性能,揭示了感知能力是主导性能差距的关键因素;手动分析VLM输出显示约80%的失败案例源于感知错误。
Insight: 创新点在于提出并验证了感知瓶颈假说,通过分离感知与推理的实验设计揭示了现有推理基准可能混淆了感知与推理挑战,从而高估了机器推理的缺陷;这强调了在评估机器智能进展时需要设计能解耦感知与推理的评估协议。
Abstract: Reasoning benchmarks such as the Abstraction and Reasoning Corpus (ARC) and ARC-AGI are widely used to assess progress in artificial intelligence and are often interpreted as probes of core, so-called ``fluid’’ reasoning abilities. Despite their apparent simplicity for humans, these tasks remain challenging for frontier vision-language models (VLMs), a gap commonly attributed to deficiencies in machine reasoning. We challenge this interpretation and hypothesize that the gap arises primarily from limitations in visual perception rather than from shortcomings in inductive reasoning. To verify this hypothesis, we introduce a two-stage experimental pipeline that explicitly separates perception and reasoning. In the perception stage, each image is independently converted into a natural-language description, while in the reasoning stage a model induces and applies rules using these descriptions. This design prevents leakage of cross-image inductive signals and isolates reasoning from perception bottlenecks. Across three ARC-style datasets, Mini-ARC, ACRE, and Bongard-LOGO, we show that the perception capability is the dominant factor underlying the observed performance gap by comparing the two-stage pipeline with against standard end-to-end one-stage evaluation. Manual inspection of reasoning traces in the VLM outputs further reveals that approximately 80 percent of model failures stem from perception errors. Together, these results demonstrate that ARC-style benchmarks conflate perceptual and reasoning challenges and that observed performance gaps may overstate deficiencies in machine reasoning. Our findings underscore the need for evaluation protocols that disentangle perception from reasoning when assessing progress in machine intelligence.
[15] Optimizing Decoding Paths in Masked Diffusion Models by Quantifying Uncertainty cs.CL | cs.AI | cs.LGPDF
Ziyu Chen, Xinbei Jiang, Peng Sun, Tao Lin
TL;DR: 本文针对掩码扩散模型(MDMs)生成质量对解码顺序敏感的问题,首次将其形式化为生成路径上的累积预测不确定性,并提出了去噪熵这一可计算指标来量化不确定性。基于该指标,作者设计了两种优化解码路径的算法:事后选择方法和实时引导策略。实验表明,熵引导方法显著提升了生成质量,在推理、规划和代码等具有挑战性的基准测试中持续提高了准确性。
Details
Motivation: 掩码扩散模型提供了灵活的非自回归生成能力,但这种自由度导致最终输出质量对解码顺序高度敏感,本文旨在解决这一挑战。
Result: 实验证明,所提出的熵引导方法显著提升了生成质量,在具有挑战性的推理、规划和代码基准测试中持续提高了准确性。
Insight: 核心创新在于首次将MDMs的输出质量差异形式化为生成路径上的累积预测不确定性,并提出了去噪熵这一可计算的内部信号来量化和引导生成过程,从而将模型的不确定性从劣势转化为发现高质量解决方案的关键优势。
Abstract: Masked Diffusion Models (MDMs) offer flexible, non-autoregressive generation, but this freedom introduces a challenge: final output quality is highly sensitive to the decoding order. We are the first to formalize this issue, attributing the variability in output quality to the cumulative predictive uncertainty along a generative path. To quantify this uncertainty, we introduce Denoising Entropy, a computable metric that serves as an internal signal for evaluating generative process. Leveraging this metric, we propose two algorithms designed to optimize the decoding path: a post-hoc selection method and a real-time guidance strategy. Experiments demonstrate that our entropy-guided methods significantly improve generation quality, consistently boosting accuracy on challenging reasoning, planning, and code benchmarks. Our work establishes Denoising Entropy as a principled tool for understanding and controlling generation, effectively turning the uncertainty in MDMs from a liability into a key advantage for discovering high-quality solutions.
cs.CV [Back]
[16] VL4Gaze: Unleashing Vision-Language Models for Gaze Following cs.CVPDF
Shijing Wang, Chaoqun Cui, Yaping Huang, Hyung Jin Chang, Yihua Cheng
TL;DR: 本文提出了VL4Gaze,这是首个用于研究和评估视觉语言模型在视线理解方面能力的大规模基准测试。该基准包含48.9万个自动生成的问题-答案对,覆盖12.4万张图像,并通过四个互补任务将视线理解统一为视觉问答问题。评估表明,大规模视觉语言模型在没有任务特定监督的情况下难以可靠推断视线语义和空间定位,而在VL4Gaze上进行训练则能带来显著且一致的性能提升。
Details
Motivation: 人类视线为解释视觉场景中的注意力、意图和社交互动提供了关键线索,但当前视觉语言模型在视线理解方面的能力尚未被充分探索。目前缺乏系统评估或训练模型进行视线解释的基准,因此本文旨在填补这一空白,探究通用视觉语言预训练是否能自然衍生出视线理解能力。
Result: 在上下文学习和微调设置下,对商业和开源视觉语言模型进行了全面评估。结果表明,即使是大规模视觉语言模型,在没有任务特定监督的情况下,也难以可靠地推断视线语义和空间定位。相反,在VL4Gaze数据集上进行训练,能在所有任务上带来实质性和一致的性能提升。
Insight: 论文的主要创新点在于构建了首个针对视线理解的大规模、多任务视觉问答基准VL4Gaze,将视线理解系统性地表述为四个互补的VQA任务。客观来看,其核心贡献在于揭示了通用视觉语言预训练本身不足以产生可靠的视线理解能力,而针对性的多任务监督对于开发这种能力至关重要,这为未来在VLM中整合细粒度社会感知任务提供了新的数据集和方法论基础。
Abstract: Human gaze provides essential cues for interpreting attention, intention, and social interaction in visual scenes, yet gaze understanding remains largely unexplored in current vision-language models (VLMs). While recent VLMs achieve strong scene-level reasoning across a range of visual tasks, there exists no benchmark that systematically evaluates or trains them for gaze interpretation, leaving open the question of whether gaze understanding can emerge from general-purpose vision-language pre-training. To address this gap, we introduce VL4Gaze, the first large-scale benchmark designed to investigate, evaluate, and unlock the potential of VLMs for gaze understanding. VL4Gaze contains 489K automatically generated question-answer pairs across 124K images and formulates gaze understanding as a unified VQA problem through four complementary tasks: (1) gaze object description, (2) gaze direction description, (3) gaze point location, and (4) ambiguous question recognition. We comprehensively evaluate both commercial and open-source VLMs under in-context learning and fine-tuning settings. The results show that even large-scale VLMs struggle to reliably infer gaze semantics and spatial localization without task-specific supervision. In contrast, training on VL4Gaze brings substantial and consistent improvements across all tasks, highlighting the importance of targeted multi-task supervision for developing gaze understanding capabilities in VLMs. We will release the dataset and code to support further research and development in this direction.
[17] OccuFly: A 3D Vision Benchmark for Semantic Scene Completion from the Aerial Perspective cs.CVPDF
Markus Gross, Sai B. Matha, Aya Fahmy, Rui Song, Daniel Cremers
TL;DR: 本文提出了OccuFly,首个基于相机模态的真实世界航空语义场景补全(SSC)基准数据集,用于从空中视角进行三维感知。该数据集在50米、40米和30米高度于春、夏、秋、冬四季采集,覆盖城市、工业和乡村场景,提供22个语义类别。论文还提出了一个无需激光雷达、基于传统三维重建的自动化标注框架,并在此基准上评估了现有先进方法,突出了空中视角特有的挑战。
Details
Motivation: 语义场景补全(SSC)在移动机器人三维感知中至关重要,但现有研究主要集中于地面领域(如自动驾驶),航空场景(如自主飞行)尚未充分探索,且激光雷达作为主要数据模态对无人机存在法规、重量、能耗及点云稀疏性等限制。
Result: 论文在提出的OccuFly基准上对现有最先进(SOTA)方法进行了基准测试,并分析了从空中高视角进行语义场景补全所面临的特定挑战。
Insight: 主要创新点在于:1)创建了首个基于相机、覆盖多季节和多高度的真实世界航空SSC基准数据集;2)提出了一种无需激光雷达、利用传统三维重建和2D标注掩码自动生成3D标签的框架,极大减少了手动3D标注工作量;3)为空中视角下的整体三维场景理解提供了一个全面的视觉基准。
Abstract: Semantic Scene Completion (SSC) is crucial for 3D perception in mobile robotics, as it enables holistic scene understanding by jointly estimating dense volumetric occupancy and per-voxel semantics. Although SSC has been widely studied in terrestrial domains such as autonomous driving, aerial scenarios like autonomous flying remain largely unexplored, thereby limiting progress on downstream applications. Furthermore, LiDAR sensors represent the primary modality for SSC data generation, which poses challenges for most uncrewed aerial vehicles (UAVs) due to flight regulations, mass and energy constraints, and the sparsity of LiDAR-based point clouds from elevated viewpoints. To address these limitations, we introduce OccuFly, the first real-world, camera-based aerial SSC benchmark, captured at altitudes of 50m, 40m, and 30m during spring, summer, fall, and winter. OccuFly covers urban, industrial, and rural scenarios, provides 22 semantic classes, and the data format adheres to established conventions to facilitate seamless integration with existing research. Crucially, we propose a LiDAR-free data generation framework based on camera modality, which is ubiquitous on modern UAVs. By utilizing traditional 3D reconstruction, our framework automates label transfer by lifting a subset of annotated 2D masks into the reconstructed point cloud, thereby substantially minimizing manual 3D annotation effort. Finally, we benchmark the state-of-the-art on OccuFly and highlight challenges specific to elevated viewpoints, yielding a comprehensive vision benchmark for holistic aerial 3D scene understanding.
[18] NULLBUS: Multimodal Mixed-Supervision for Breast Ultrasound Segmentation via Nullable Global-Local Prompts cs.CV | cs.AI | cs.LGPDF
Raja Mallina, Bryar Shareef
TL;DR: 本文提出了NullBUS,一个用于乳腺超声图像分割的多模态混合监督框架。该框架的核心创新是引入了可空提示,通过可学习的空嵌入和存在掩码来处理训练数据中文本提示缺失的情况,从而能够在一个模型中同时利用有提示和无提示的图像进行训练,提升模型在真实世界数据(常缺乏可靠元数据)下的鲁棒性和分割性能。
Details
Motivation: 解决现有可提示分割方法在训练时高度依赖文本或空间提示的问题。由于许多公开的乳腺超声数据集缺乏可靠的元数据或报告,导致只能在小规模的多模态子集上训练,限制了模型的泛化能力和鲁棒性。
Result: 在三个公开乳腺超声数据集组成的统一测试集上,NullBUS取得了平均IoU 0.8568和平均Dice 0.9103的指标,达到了最先进的性能水平。
Insight: 主要创新点是提出了“可空提示”机制,通过可学习的空嵌入和存在掩码,使模型能够灵活处理训练和推理时提示存在或缺失的混合场景,实现了单一模型对混合监督信号的有效学习,提高了模型在数据不完备情况下的实用性。
Abstract: Breast ultrasound (BUS) segmentation provides lesion boundaries essential for computer-aided diagnosis and treatment planning. While promptable methods can improve segmentation performance and tumor delineation when text or spatial prompts are available, many public BUS datasets lack reliable metadata or reports, constraining training to small multimodal subsets and reducing robustness. We propose NullBUS, a multimodal mixed-supervision framework that learns from images with and without prompts in a single model. To handle missing text, we introduce nullable prompts, implemented as learnable null embeddings with presence masks, enabling fallback to image-only evidence when metadata are absent and the use of text when present. Evaluated on a unified pool of three public BUS datasets, NullBUS achieves a mean IoU of 0.8568 and a mean Dice of 0.9103, demonstrating state-of-the-art performance under mixed prompt availability.
[19] Input-Adaptive Visual Preprocessing for Efficient Fast Vision-Language Model Inference cs.CVPDF
Putu Indah Githa Cahyani, Komang David Dananjaya Suartana, Novanto Yudistira
TL;DR: 本文提出了一种输入自适应的视觉预处理方法,用于提升FastVLM等视觉语言模型(VLM)的推理效率。该方法根据图像内容动态调整输入分辨率和空间覆盖范围,以减少视觉冗余,从而在不修改模型架构或重新训练的情况下,显著降低推理延迟和计算成本。
Details
Motivation: 现有VLM部署面临高推理延迟和计算成本挑战,尤其是在处理高分辨率视觉输入时。虽然FastVLM等架构通过优化视觉编码器提升了效率,但其预处理流程仍是静态的,导致对视觉简单的输入存在冗余计算。本文旨在解决这一静态预处理导致的效率瓶颈。
Result: 在DocVQA数据集子集上的推理实验表明,该方法使单图像推理时间减少超过50%,平均完整生成时间降低,视觉令牌数量相比基线持续减少超过55%,显著提升了部署导向的效率。
Insight: 创新点在于将静态预处理改为基于图像内容特征(如图像复杂度)的动态自适应预处理(包括分辨率选择和内容感知裁剪),这是一种轻量级、无需模型修改的策略,可有效减少视觉冗余,为VLM的高效部署提供了新思路。
Abstract: Vision-Language Models (VLMs) have demonstrated strong performance on multimodal reasoning tasks, but their deployment remains challenging due to high inference latency and computational cost, particularly when processing high-resolution visual inputs. While recent architectures such as FastVLM improve efficiency through optimized vision encoders, existing pipelines still rely on static visual preprocessing, leading to redundant computation for visually simple inputs. In this work, we propose an adaptive visual preprocessing method that dynamically adjusts input resolution and spatial coverage based on image content characteristics. The proposed approach combines content-aware image analysis, adaptive resolution selection, and content-aware cropping to reduce visual redundancy prior to vision encoding. Importantly, the method is integrated with FastVLM without modifying its architecture or requiring retraining. We evaluate the proposed method on a subset of the DocVQA dataset in an inference-only setting, focusing on efficiency-oriented metrics. Experimental results show that adaptive preprocessing reduces per-image inference time by over 50%, lowers mean full generation time, and achieves a consistent reduction of more than 55% in visual token count compared to the baseline pipeline. These findings demonstrate that input-aware preprocessing is an effective and lightweight strategy for improving deployment-oriented efficiency of vision-language models. To facilitate reproducibility, our implementation is provided as a fork of the FastVLM repository, incorporating the files for the proposed method, and is available at https://github.com/kmdavidds/mlfastlm.
[20] ALIVE: An Avatar-Lecture Interactive Video Engine with Content-Aware Retrieval for Real-Time Interaction cs.CVPDF
Md Zabirul Islam, Md Motaleb Hossen Manik, Ge Wang
TL;DR: ALIVE是一个本地部署的交互式视频引擎,将传统讲座视频转化为动态实时学习体验。它通过ASR转录、LLM精炼和神经头像合成生成头像讲解,结合语义相似性和时间戳对齐的内容感知检索机制,支持文本或语音提问并接收文本或头像形式的解释。
Details
Motivation: 传统讲座视频缺乏实时澄清机制,现有交互式学习系统通常缺乏讲座感知、依赖云服务或未能整合检索与头像讲解。ALIVE旨在解决这些问题,在本地硬件上提供统一的隐私保护交互管道。
Result: 在完整医学影像课程上演示,评估了检索准确性、延迟特性和用户体验,表明ALIVE能提供准确、内容感知且引人入胜的实时支持。
Insight: 创新点在于将多模态AI与内容感知检索及本地部署相结合,通过轻量级嵌入模型、FAISS检索和分段头像合成实现响应性,为下一代交互式学习环境提供了可扩展路径。
Abstract: Traditional lecture videos offer flexibility but lack mechanisms for real-time clarification, forcing learners to search externally when confusion arises. Recent advances in large language models and neural avatars provide new opportunities for interactive learning, yet existing systems typically lack lecture awareness, rely on cloud-based services, or fail to integrate retrieval and avatar-delivered explanations in a unified, privacy-preserving pipeline. We present ALIVE, an Avatar-Lecture Interactive Video Engine that transforms passive lecture viewing into a dynamic, real-time learning experience. ALIVE operates fully on local hardware and integrates (1) Avatar-delivered lecture generated through ASR transcription, LLM refinement, and neural talking-head synthesis; (2) A content-aware retrieval mechanism that combines semantic similarity with timestamp alignment to surface contextually relevant lecture segments; and (3) Real-time multimodal interaction, enabling students to pause the lecture, ask questions through text or voice, and receive grounded explanations either as text or as avatar-delivered responses. To maintain responsiveness, ALIVE employs lightweight embedding models, FAISS-based retrieval, and segmented avatar synthesis with progressive preloading. We demonstrate the system on a complete medical imaging course, evaluate its retrieval accuracy, latency characteristics, and user experience, and show that ALIVE provides accurate, content-aware, and engaging real-time support. ALIVE illustrates how multimodal AI-when combined with content-aware retrieval and local deployment-can significantly enhance the pedagogical value of recorded lectures, offering an extensible pathway toward next-generation interactive learning environments.
[21] NeRV360: Neural Representation for 360-Degree Videos with a Viewport Decoder cs.CV | cs.MM | eess.IVPDF
Daichi Arai, Kyohei Unno, Yasuko Sugito, Yuichi Kusakabe
TL;DR: 本文提出了NeRV360,一种用于360度视频的隐式神经表示框架,通过仅解码用户选择的视口而非整个全景帧,显著降低了内存消耗并提升了解码速度。
Details
Motivation: 解决现有NeRV方法应用于高分辨率360度视频时内存占用高、解码速度慢,难以实现实时应用的问题。
Result: 在6K分辨率视频上的实验表明,与代表性先前工作HNeRV相比,NeRV360实现了内存消耗降低7倍、解码速度提升2.5倍,并在客观指标上获得了更好的图像质量。
Insight: 主要创新点在于将视口提取集成到解码过程中,并引入了时空仿射变换模块,用于基于视点和时间的条件解码,从而实现了高效的视口自适应解码。
Abstract: Implicit neural representations for videos (NeRV) have shown strong potential for video compression. However, applying NeRV to high-resolution 360-degree videos causes high memory usage and slow decoding, making real-time applications impractical. We propose NeRV360, an end-to-end framework that decodes only the user-selected viewport instead of reconstructing the entire panoramic frame. Unlike conventional pipelines, NeRV360 integrates viewport extraction into decoding and introduces a spatial-temporal affine transform module for conditional decoding based on viewpoint and time. Experiments on 6K-resolution videos show that NeRV360 achieves a 7-fold reduction in memory consumption and a 2.5-fold increase in decoding speed compared to HNeRV, a representative prior work, while delivering better image quality in terms of objective metrics.
[22] Beyond Weight Adaptation: Feature-Space Domain Injection for Cross-Modal Ship Re-Identification cs.CVPDF
Tingfeng Xian, Wenlve Zhou, Zhiheng Zhou, Zhelin Li
TL;DR: 本文针对跨模态船舶重识别任务中存在的显著模态差异问题,提出了一种名为域表示注入的新型参数高效微调策略。该方法基于视觉基础模型,通过设计轻量级可学习的偏移编码器和调制器,在特征空间而非权重空间进行优化,将富含模态和身份属性的域特定表示注入到中间层,动态调整特征分布以适应下游任务,同时保持预训练权重完全冻结。
Details
Motivation: 跨模态船舶重识别对于实现全天候、全天时海上目标跟踪至关重要,但面临显著的模态差异挑战。主流方法依赖显式的模态对齐策略,这严重依赖于构建大规模配对数据集进行预训练。此外,现有通用的参数高效微调方法主要在权重空间操作,在有限容量模型上表现欠佳。
Result: 在HOSS-ReID数据集上,该方法仅使用1.54M和7.05M可训练参数,分别达到了57.9%和60.5%的mAP,取得了最先进的性能。
Insight: 创新点在于将优化视角从权重空间转移到特征空间,提出域表示注入策略。通过冻结视觉基础模型以保留通用知识,并利用轻量级模块提取和调制域特定表示进行特征空间注入,实现了高效的下游任务适应。这为在有限数据和模型容量下利用基础模型提供了新思路。
Abstract: Cross-Modality Ship Re-Identification (CMS Re-ID) is critical for achieving all-day and all-weather maritime target tracking, yet it is fundamentally challenged by significant modality discrepancies. Mainstream solutions typically rely on explicit modality alignment strategies; however, this paradigm heavily depends on constructing large-scale paired datasets for pre-training. To address this, grounded in the Platonic Representation Hypothesis, we explore the potential of Vision Foundation Models (VFMs) in bridging modality gaps. Recognizing the suboptimal performance of existing generic Parameter-Efficient Fine-Tuning (PEFT) methods that operate within the weight space, particularly on limited-capacity models, we shift the optimization perspective to the feature space and propose a novel PEFT strategy termed Domain Representation Injection (DRI). Specifically, while keeping the VFM fully frozen to maximize the preservation of general knowledge, we design a lightweight, learnable Offset Encoder to extract domain-specific representations rich in modality and identity attributes from raw inputs. Guided by the contextual information of intermediate features at different layers, a Modulator adaptively transforms these representations. Subsequently, they are injected into the intermediate layers via additive fusion, dynamically reshaping the feature distribution to adapt to the downstream task without altering the VFM’s pre-trained weights. Extensive experimental results demonstrate the superiority of our method, achieving State-of-the-Art (SOTA) performance with minimal trainable parameters. For instance, on the HOSS-ReID dataset, we attain 57.9% and 60.5% mAP using only 1.54M and 7.05M parameters, respectively. The code is available at https://github.com/TingfengXian/DRI.
[23] DGSAN: Dual-Graph Spatiotemporal Attention Network for Pulmonary Nodule Malignancy Prediction cs.CV | cs.AIPDF
Xiao Yu, Zhaojie Fang, Guanyu Zhou, Yin Shen, Huoling Luo
TL;DR: 本文提出了一种名为DGSAN的双图时空注意力网络,用于提高肺结节恶性预测的准确性。该方法通过全局-局部特征编码器捕获结节特征,构建模态内和模态间双图,并引入分层跨模态图融合模块进行特征整合。作者还构建了一个新的多模态数据集NLST-cmst。实验表明,DGSAN在多个数据集上显著优于现有方法,且计算效率高。
Details
Motivation: 现有肺结节恶性预测研究在融合多模态和多时序信息时,通常采用低效的向量拼接或简单的互注意力机制,未能充分利用信息,因此需要更有效的多模态信息融合方法。
Result: 在NLST-cmst和CSTL-derived数据集上的大量实验表明,DGSAN在肺结节分类任务上显著优于最先进的方法(SOTA),并且具有卓越的计算效率。
Insight: 创新点包括:1) 全局-局部特征编码器以捕获局部、全局及融合特征;2) 双图构建方法将多模态特征组织为模态内和模态间图;3) 分层跨模态图融合模块以优化特征整合;4) 构建了新的多模态数据集NLST-cmst,为相关研究提供支持。从客观角度看,将图神经网络与时空注意力结合用于多模态医学图像分析是一个有前景的方向。
Abstract: Lung cancer continues to be the leading cause of cancer-related deaths globally. Early detection and diagnosis of pulmonary nodules are essential for improving patient survival rates. Although previous research has integrated multimodal and multi-temporal information, outperforming single modality and single time point, the fusion methods are limited to inefficient vector concatenation and simple mutual attention, highlighting the need for more effective multimodal information fusion. To address these challenges, we introduce a Dual-Graph Spatiotemporal Attention Network, which leverages temporal variations and multimodal data to enhance the accuracy of predictions. Our methodology involves developing a Global-Local Feature Encoder to better capture the local, global, and fused characteristics of pulmonary nodules. Additionally, a Dual-Graph Construction method organizes multimodal features into inter-modal and intra-modal graphs. Furthermore, a Hierarchical Cross-Modal Graph Fusion Module is introduced to refine feature integration. We also compiled a novel multimodal dataset named the NLST-cmst dataset as a comprehensive source of support for related research. Our extensive experiments, conducted on both the NLST-cmst and curated CSTL-derived datasets, demonstrate that our DGSAN significantly outperforms state-of-the-art methods in classifying pulmonary nodules with exceptional computational efficiency.
[24] Benchmarking and Enhancing VLM for Compressed Image Understanding cs.CVPDF
Zifu Zhang, Tongda Xu, Siqi Li, Shengxi Li, Yue Zhang
TL;DR: 本文首次构建了一个全面的基准测试,用于评估视觉语言模型(VLM)处理压缩图像的能力,涵盖多种主流图像编解码器和任务,包含超过一百万张压缩图像。通过分析性能差距的来源(信息损失和泛化失败),作者提出了一种通用的VLM适配器,能够将不同编解码器和比特率下的VLM性能提升10%-30%。
Details
Motivation: 随着VLM的快速发展和应用需求增长,图像输入的高效压缩变得日益重要。现有VLM主要处理高比特率压缩图像,而其对低比特率压缩图像的理解能力尚未被充分探索。
Result: 在构建的包含多种编解码器和任务的基准测试上,提出的通用适配器能将VLM在压缩图像上的性能提升10%-30%。
Insight: 创新点在于首次系统性地对VLM的压缩图像理解能力进行基准测试,并识别出性能差距主要源于泛化失败而非信息损失。提出的通用适配器是一个轻量级解决方案,能够跨编解码器和比特率提升性能,为VLM与压缩图像的结合提供了实用路径。
Abstract: With the rapid development of Vision-Language Models (VLMs) and the growing demand for their applications, efficient compression of the image inputs has become increasingly important. Existing VLMs predominantly digest and understand high-bitrate compressed images, while their ability to interpret low-bitrate compressed images has yet to be explored by far. In this paper, we introduce the first comprehensive benchmark to evaluate the ability of VLM against compressed images, varying existing widely used image codecs and diverse set of tasks, encompassing over one million compressed images in our benchmark. Next, we analyse the source of performance gap, by categorising the gap from a) the information loss during compression and b) generalisation failure of VLM. We visualize these gaps with concrete examples and identify that for compressed images, only the generalization gap can be mitigated. Finally, we propose a universal VLM adaptor to enhance model performance on images compressed by existing codecs. Consequently, we demonstrate that a single adaptor can improve VLM performance across images with varying codecs and bitrates by 10%-30%. We believe that our benchmark and enhancement method provide valuable insights and contribute toward bridging the gap between VLMs and compressed images.
[25] PanoGrounder: Bridging 2D and 3D with Panoramic Scene Representations for VLM-based 3D Visual Grounding cs.CVPDF
Seongmin Jung, Seongho Choi, Gunwoo Jeon, Minsu Cho, Jongwoo Lim
TL;DR: PanoGrounder是一个通用的3D视觉定位框架,通过将多模态全景表示与预训练的2D视觉语言模型(VLM)结合,实现强大的视觉语言推理。该方法利用全景渲染作为2D和3D之间的中间表示,通过三个阶段(放置全景视点、使用VLM在每张全景图上进行文本查询定位、融合预测生成3D边界框)实现3D视觉定位,在ScanRefer和Nr3D数据集上达到SOTA,并展现出对未见3D数据集和文本重述的优越泛化能力。
Details
Motivation: 解决传统监督模型在3D视觉定位中因3D视觉语言数据集稀缺和推理能力有限导致的泛化性不足问题,通过结合2D VLM的强语言理解能力和全景表示的3D场景信息,提升模型的通用性和性能。
Result: 在ScanRefer和Nr3D数据集上达到SOTA(state-of-the-art)水平,并在未见3D数据集和文本重述上表现出优越的泛化能力。
Insight: 创新点包括使用全景渲染作为2D和3D之间的桥梁,保留长距离物体关系;设计三阶段流程(视点放置、VLM定位、预测融合)实现高效3D定位;结合预训练2D VLM增强语言推理,避免对大规模3D数据集的依赖。
Abstract: 3D Visual Grounding (3DVG) is a critical bridge from vision-language perception to robotics, requiring both language understanding and 3D scene reasoning. Traditional supervised models leverage explicit 3D geometry but exhibit limited generalization, owing to the scarcity of 3D vision-language datasets and the limited reasoning capabilities compared to modern vision-language models (VLMs). We propose PanoGrounder, a generalizable 3DVG framework that couples multi-modal panoramic representation with pretrained 2D VLMs for strong vision-language reasoning. Panoramic renderings, augmented with 3D semantic and geometric features, serve as an intermediate representation between 2D and 3D, and offer two major benefits: (i) they can be directly fed to VLMs with minimal adaptation and (ii) they retain long-range object-to-object relations thanks to their 360-degree field of view. We devise a three-stage pipeline that places a compact set of panoramic viewpoints considering the scene layout and geometry, grounds a text query on each panoramic rendering with a VLM, and fuses per-view predictions into a single 3D bounding box via lifting. Our approach achieves state-of-the-art results on ScanRefer and Nr3D, and demonstrates superior generalization to unseen 3D datasets and text rephrasings.
[26] Quantile Rendering: Efficiently Embedding High-dimensional Feature on 3D Gaussian Splatting cs.CVPDF
Yoonwoo Jeong, Cheng Sun, Frank Wang, Minsu Cho, Jaesung Choe
TL;DR: 本文提出了一种名为Quantile Rendering (Q-Render)的新型渲染策略,用于在3D高斯泼溅(3D-GS)中高效嵌入高维特征,以支持开放词汇分割(OVS)。该方法通过稀疏采样沿射线具有主导影响的高斯分布,避免了现有方法使用码本或特征压缩导致的信息丢失问题。同时,作者将Q-Render集成到一个可泛化的3D神经网络中,提出了高斯泼溅网络(GS-Net),以可泛化的方式预测高斯特征。
Details
Motivation: 现有方法在将开放词汇分割扩展到3D领域时,虽然利用了3D高斯泼溅,但高效渲染高维特征以支持开放词汇查询仍是一个重大挑战。现有技术通常采用码本或特征压缩,这会导致信息丢失,从而降低分割质量。
Result: 在ScanNet和LeRF数据集上的大量实验表明,该框架在性能上超越了最先进的方法(SOTA),同时在512维特征图上实现了约43.7倍的加速,能够进行实时渲染。
Insight: 主要创新点在于Q-Render渲染策略,它通过选择性稀疏采样来高效处理高维特征,在保持高保真度的同时大幅提升渲染速度。此外,GS-Net以可泛化的方式预测特征,增强了方法的适用性。从客观角度看,该方法在渲染效率和特征质量之间取得了良好平衡,为3D场景的开放词汇理解提供了新的高效解决方案。
Abstract: Recent advancements in computer vision have successfully extended Open-vocabulary segmentation (OVS) to the 3D domain by leveraging 3D Gaussian Splatting (3D-GS). Despite this progress, efficiently rendering the high-dimensional features required for open-vocabulary queries poses a significant challenge. Existing methods employ codebooks or feature compression, causing information loss, thereby degrading segmentation quality. To address this limitation, we introduce Quantile Rendering (Q-Render), a novel rendering strategy for 3D Gaussians that efficiently handles high-dimensional features while maintaining high fidelity. Unlike conventional volume rendering, which densely samples all 3D Gaussians intersecting each ray, Q-Render sparsely samples only those with dominant influence along the ray. By integrating Q-Render into a generalizable 3D neural network, we also propose Gaussian Splatting Network (GS-Net), which predicts Gaussian features in a generalizable manner. Extensive experiments on ScanNet and LeRF demonstrate that our framework outperforms state-of-the-art methods, while enabling real-time rendering with an approximate ~43.7x speedup on 512-D feature maps. Code will be made publicly available.
[27] Transductive Visual Programming: Evolving Tool Libraries from Experience for Spatial Reasoning cs.CV | cs.AI | cs.CL | cs.MAPDF
Shengguang Wu, Xiaohan Wang, Yuhui Zhang, Hao Zhu, Serena Yeung-Levy
TL;DR: 本文提出了Transductive Visual Programming(TVP)框架,通过从自身解决3D空间推理问题的经验中归纳和抽象出可复用的高级工具,从而动态构建和进化工具库,以更有效地解决复杂的空间几何计算问题。
Details
Motivation: 现有视觉编程方法依赖固定工具集或在解决问题前进行推测性工具归纳,导致程序次优且工具利用率低,TVP旨在通过经验驱动的归纳式工具创建来克服这些限制。
Result: 在Omni3D-Bench基准测试中,TVP达到了最先进的性能,比GPT-4o高出22%,比之前最好的视觉编程系统高出11%;其归纳学习的工具作为核心程序依赖的使用频率是归纳创建工具的5倍,并且在SpatialScore-Hard集合的未见空间任务上表现出强大的泛化能力。
Insight: 核心创新在于从经验中归纳式地学习和进化工具库,而非预先定义或推测,这实现了工具的自演化,并显著提升了工具发现、重用和解决复杂空间推理任务的有效性。
Abstract: Spatial reasoning in 3D scenes requires precise geometric calculations that challenge vision-language models. Visual programming addresses this by decomposing problems into steps calling specialized tools, yet existing methods rely on either fixed toolsets or speculative tool induction before solving problems, resulting in suboptimal programs and poor utilization of induced tools. We present Transductive Visual Programming (TVP), a novel framework that builds new tools from its own experience rather than speculation. TVP first solves problems using basic tools while accumulating experiential solutions into an Example Library, then abstracts recurring patterns from these programs into reusable higher-level tools for an evolving Tool Library. This allows TVP to tackle new problems with increasingly powerful tools learned from experience. On Omni3D-Bench, TVP achieves state-of-the-art performance, outperforming GPT-4o by 22% and the previous best visual programming system by 11%. Our transductively learned tools are used 5x more frequently as core program dependency than inductively created ones, demonstrating more effective tool discovery and reuse. The evolved tools also show strong generalization to unseen spatial tasks, achieving superior performance on benchmarks from SpatialScore-Hard collection without any testset-specific modification. Our work establishes experience-driven transductive tool creation as a powerful paradigm for building self-evolving visual programming agents that effectively tackle challenging spatial reasoning tasks. We release our code at https://transductive-visualprogram.github.io/.
[28] Reasoning-Driven Amodal Completion: Collaborative Agents and Perceptual Evaluation cs.CVPDF
Hongxing Fan, Shuyu Zhao, Jiayang Ao, Lu Sheng
TL;DR: 本文提出了一种协作多智能体推理框架来解决无模态补全任务中的语义一致性和结构完整性问题。该框架将语义规划与视觉合成解耦,通过专门的智能体进行前期推理,生成结构化计划后再进行像素生成,实现了视觉和语义连贯的单次合成。此外,论文还引入了MAC-Score这一新型评估指标,以更好地评估推断的不可见内容。
Details
Motivation: 无模态补全任务在推断不可见物体部分时面临语义一致性和结构完整性的挑战,现有渐进式方法受限于推理不稳定性和误差累积。
Result: 在多个数据集上的广泛实验表明,该框架显著优于最先进的方法。
Insight: 创新点包括:1)协作多智能体推理框架,将语义规划与视觉合成解耦;2)自校正验证智能体,在语义规划阶段使用思维链推理修正可见区域分割并识别残留遮挡物;3)多样化假设生成器,提供多样化的语义解释以解决不可见区域的模糊性;4)MAC-Score评估指标,基于MLLM并与人类判断对齐,为评估结构完整性和语义一致性提供了新标准。
Abstract: Amodal completion, the task of inferring invisible object parts, faces significant challenges in maintaining semantic consistency and structural integrity. Prior progressive approaches are inherently limited by inference instability and error accumulation. To tackle these limitations, we present a Collaborative Multi-Agent Reasoning Framework that explicitly decouples Semantic Planning from Visual Synthesis. By employing specialized agents for upfront reasoning, our method generates a structured, explicit plan before pixel generation, enabling visually and semantically coherent single-pass synthesis. We integrate this framework with two critical mechanisms: (1) a self-correcting Verification Agent that employs Chain-of-Thought reasoning to rectify visible region segmentation and identify residual occluders strictly within the Semantic Planning phase, and (2) a Diverse Hypothesis Generator that addresses the ambiguity of invisible regions by offering diverse, plausible semantic interpretations, surpassing the limited pixel-level variations of standard random seed sampling. Furthermore, addressing the limitations of traditional metrics in assessing inferred invisible content, we introduce the MAC-Score (MLLM Amodal Completion Score), a novel human-aligned evaluation metric. Validated against human judgment and ground truth, these metrics establish a robust standard for assessing structural completeness and semantic consistency with visible context. Extensive experiments demonstrate that our framework significantly outperforms state-of-the-art methods across multiple datasets. Our project is available at: https://fanhongxing.github.io/remac-page.
[29] MVInverse: Feed-forward Multi-view Inverse Rendering in Seconds cs.CVPDF
Xiangzuo Wu, Chengwei Ren, Jun Zhou, Xiu Li, Yuan Liu
TL;DR: 本文提出了一种名为MVInverse的前馈式多视角逆渲染框架,能够直接从RGB图像序列中预测空间变化的反射率、金属度、粗糙度、漫反射着色和表面法线。该方法通过跨视图交替注意力机制,在单次前向传播中实现视图内长距离光照交互和视图间材质一致性的建模。针对真实世界训练数据稀缺的问题,论文还提出了一种基于一致性的微调策略,利用未标记的真实世界视频来增强模型的鲁棒性和多视图一致性。
Details
Motivation: 现有单视角逆渲染方法应用于多视图图像时忽略了视图间关系,导致结果不一致;而多视图优化方法依赖于缓慢的可微分渲染和逐场景优化,计算成本高且难以扩展。本文旨在解决这些限制,实现快速、一致的多视角逆渲染。
Result: 在基准数据集上的大量实验表明,该方法在多视图一致性、材质和法线估计质量以及向真实世界图像的泛化能力方面均达到了最先进的性能。
Insight: 主要创新点包括:1)前馈式多视角逆渲染框架,通过交替注意力机制高效实现场景级一致性推理;2)基于一致性的微调策略,利用无标签真实视频提升模型在真实场景下的鲁棒性和一致性。从客观角度看,该工作将缓慢的优化过程转化为高效的前馈预测,并巧妙利用无监督数据缓解合成数据与真实数据的域差距,具有实际应用价值。
Abstract: Multi-view inverse rendering aims to recover geometry, materials, and illumination consistently across multiple viewpoints. When applied to multi-view images, existing single-view approaches often ignore cross-view relationships, leading to inconsistent results. In contrast, multi-view optimization methods rely on slow differentiable rendering and per-scene refinement, making them computationally expensive and hard to scale. To address these limitations, we introduce a feed-forward multi-view inverse rendering framework that directly predicts spatially varying albedo, metallic, roughness, diffuse shading, and surface normals from sequences of RGB images. By alternating attention across views, our model captures both intra-view long-range lighting interactions and inter-view material consistency, enabling coherent scene-level reasoning within a single forward pass. Due to the scarcity of real-world training data, models trained on existing synthetic datasets often struggle to generalize to real-world scenes. To overcome this limitation, we propose a consistency-based finetuning strategy that leverages unlabeled real-world videos to enhance both multi-view coherence and robustness under in-the-wild conditions. Extensive experiments on benchmark datasets demonstrate that our method achieves state-of-the-art performance in terms of multi-view consistency, material and normal estimation quality, and generalization to real-world imagery.
[30] Learning from Next-Frame Prediction: Autoregressive Video Modeling Encodes Effective Representations cs.CVPDF
Jinghan Li, Yang Jin, Hao Jiang, Yadong Mu, Yang Song
TL;DR: 本文提出NExT-Vid,一种新颖的自回归视觉生成预训练框架,通过掩码下一帧预测联合建模图像和视频,旨在解决现有方法忽视时序信息、语义定位不准和生成质量差的问题。
Details
Motivation: 当前视觉生成预训练大多依赖BERT式掩码建模,忽视了视频分析关键的时序信息,而现有自回归方法存在语义定位不准和生成质量差导致语义学习不佳的问题。
Result: 在大规模预训练模型上的广泛实验表明,该方法在下游分类任务中通过注意力探测,一致优于先前的生成预训练方法,实现了强大的表征学习。
Insight: 创新点包括引入上下文隔离的自回归预测器以解耦语义表征与目标解码,以及条件流匹配解码器以提升生成质量和多样性;通过上下文隔离流匹配预训练获得有效表征。
Abstract: Recent advances in pretraining general foundation models have significantly improved performance across diverse downstream tasks. While autoregressive (AR) generative models like GPT have revolutionized NLP, most visual generative pretraining methods still rely on BERT-style masked modeling, which often disregards the temporal information essential for video analysis. The few existing autoregressive visual pretraining methods suffer from issues such as inaccurate semantic localization and poor generation quality, leading to poor semantics. In this work, we propose NExT-Vid, a novel autoregressive visual generative pretraining framework that utilizes masked next-frame prediction to jointly model images and videos. NExT-Vid introduces a context-isolated autoregressive predictor to decouple semantic representation from target decoding, and a conditioned flow-matching decoder to enhance generation quality and diversity. Through context-isolated flow-matching pretraining, our approach achieves strong representations. Extensive experiments on large-scale pretrained models demonstrate that our proposed method consistently outperforms previous generative pretraining methods for visual representation learning via attentive probing in downstream classification.
[31] Granular-ball Guided Masking: Structure-aware Data Augmentation cs.CVPDF
Shuyin Xia, Fan Chen, Dawei Dai, Meng Yang, Junwei Han
TL;DR: 本文提出了一种名为粒度球引导掩码(GBGM)的结构感知数据增强方法,该方法利用粒度球计算(GBC)指导,通过从粗到细的层次化掩码过程,自适应地保留语义丰富、结构重要的区域,同时抑制冗余区域,从而生成更具代表性和判别性的增强样本。
Details
Motivation: 深度学习模型在计算机视觉中取得了显著成功,但严重依赖大规模标注数据,且在数据有限或分布偏移时容易过拟合。现有的基于掩码的信息丢弃数据增强方法通常缺乏结构感知,可能会丢弃关键语义信息。
Result: 在多个基准测试上的广泛实验表明,该方法在分类精度和掩码图像重建任务上均取得了一致的提升,证实了其有效性和广泛的适用性。
Insight: 创新点在于将粒度球计算(GBC)引入数据增强,实现了结构感知的、自适应的层次化掩码,为结构感知数据增强提供了新范式。该方法简单且与模型无关,可无缝集成到CNN和Vision Transformer中。
Abstract: Deep learning models have achieved remarkable success in computer vision, but they still rely heavily on large-scale labeled data and tend to overfit when data are limited or distributions shift. Data augmentation, particularly mask-based information dropping, can enhance robustness by forcing models to explore complementary cues; however, existing approaches often lack structural awareness and may discard essential semantics. We propose Granular-ball Guided Masking (GBGM), a structure-aware augmentation strategy guided by Granular-ball Computing (GBC). GBGM adaptively preserves semantically rich, structurally important regions while suppressing redundant areas through a coarse-to-fine hierarchical masking process, producing augmentations that are both representative and discriminative. Extensive experiments on multiple benchmarks demonstrate consistent improvements in classification accuracy and masked image reconstruction, confirming the effectiveness and broad applicability of the proposed method. Simple and model-agnostic, it integrates seamlessly into CNNs and Vision Transformers and provides a new paradigm for structure-aware data augmentation.
[32] FluencyVE: Marrying Temporal-Aware Mamba with Bypass Attention for Video Editing cs.CVPDF
Mingshu Cai, Yixuan Li, Osamu Yoshie, Yuya Ieiri
TL;DR: 本文提出FluencyVE,一种简单有效的一次性视频编辑方法,通过将线性时间序列模块Mamba集成到基于预训练Stable Diffusion的视频编辑模型中,替代原有的时间注意力层,以实现全局帧级注意力并降低计算成本。
Details
Motivation: 大规模文本到图像扩散模型在图像生成和编辑方面取得了巨大成功,但将其扩展到视频编辑仍面临挑战,现有方法存在时间不一致性和高计算开销的问题。
Result: 实验和分析表明,该方法在编辑真实视频中的各种属性、主体和位置方面取得了有希望的结果,有效保持了文本到图像模型的生成能力。
Insight: 创新点在于将Mamba模块引入视频编辑以替代时间注意力,并结合低秩近似矩阵和加权平均技术来降低计算负担,同时保持模型性能。
Abstract: Large-scale text-to-image diffusion models have achieved unprecedented success in image generation and editing. However, extending this success to video editing remains challenging. Recent video editing efforts have adapted pretrained text-to-image models by adding temporal attention mechanisms to handle video tasks. Unfortunately, these methods continue to suffer from temporal inconsistency issues and high computational overheads. In this study, we propose FluencyVE, which is a simple yet effective one-shot video editing approach. FluencyVE integrates the linear time-series module, Mamba, into a video editing model based on pretrained Stable Diffusion models, replacing the temporal attention layer. This enables global frame-level attention while reducing the computational costs. In addition, we employ low-rank approximation matrices to replace the query and key weight matrices in the causal attention, and use a weighted averaging technique during training to update the attention scores. This approach significantly preserves the generative power of the text-to-image model while effectively reducing the computational burden. Experiments and analyses demonstrate promising results in editing various attributes, subjects, and locations in real-world videos.
[33] Efficient and Robust Video Defense Framework against 3D-field Personalized Talking Face cs.CVPDF
Rui-qing Sun, Xingshan Yao, Tian Lan, Hui-Yang Zhao, Jia-Ling Shi
TL;DR: 本文提出了一种高效且鲁棒的视频防御框架,旨在保护肖像视频免受3D场个性化说话人脸生成(TFG)方法的恶意滥用。该框架通过扰动3D信息获取过程来防御,同时保持视频的高保真度,并引入了相似性引导的参数共享机制和多尺度双域注意力模块以提升效率与效果。
Details
Motivation: 现有的3D场TFG方法能够实时合成高保真个性化说话人脸视频,这引发了严重的隐私滥用担忧,但缺乏高效的视频防御框架来保护此类视频;基于图像的逐帧2D扰动方法计算成本高、视频质量下降严重,且无法破坏用于视频保护的3D信息。
Result: 大量实验表明,该框架展现出强大的防御能力,相比最快的基线实现了47倍的加速,同时保持高保真度;并且对缩放操作和最先进的净化攻击保持鲁棒性,消融研究进一步验证了设计选择的有效性。
Insight: 创新点在于针对3D场TFG方法的防御,通过扰动3D信息获取过程而非仅2D扰动来有效保护视频;具体技术贡献包括相似性引导的参数共享机制以提升计算效率,以及多尺度双域注意力模块来联合优化空间-频率域扰动,实现了效率与防御效果的平衡。
Abstract: State-of-the-art 3D-field video-referenced Talking Face Generation (TFG) methods synthesize high-fidelity personalized talking-face videos in real time by modeling 3D geometry and appearance from reference portrait video. This capability raises significant privacy concerns regarding malicious misuse of personal portraits. However, no efficient defense framework exists to protect such videos against 3D-field TFG methods. While image-based defenses could apply per-frame 2D perturbations, they incur prohibitive computational costs, severe video quality degradation, failing to disrupt 3D information for video protection. To address this, we propose a novel and efficient video defense framework against 3D-field TFG methods, which protects portrait video by perturbing the 3D information acquisition process while maintain high-fidelity video quality. Specifically, our method introduces: (1) a similarity-guided parameter sharing mechanism for computational efficiency, and (2) a multi-scale dual-domain attention module to jointly optimize spatial-frequency perturbations. Extensive experiments demonstrate that our proposed framework exhibits strong defense capability and achieves a 47x acceleration over the fastest baseline while maintaining high fidelity. Moreover, it remains robust against scaling operations and state-of-the-art purification attacks, and the effectiveness of our design choices is further validated through ablation studies. Our project is available at https://github.com/Richen7418/VDF.
[34] DexAvatar: 3D Sign Language Reconstruction with Hand and Body Pose Priors cs.CV | cs.AI | cs.HC | cs.LGPDF
Kaustubh Kundu, Hrishav Bakul Barua, Lucy Robertson-Bell, Zhixi Cai, Kalin Stefanov
TL;DR: DexAvatar是一个新颖的框架,旨在从单目手语视频中重建生物力学上准确的手部精细关节和身体运动。它通过学习的手部和身体3D先验来指导重建过程,以应对现有方法因自遮挡、噪声和运动模糊导致的3D姿态估计质量差的问题。在SGNify动作捕捉数据集上,DexAvatar相比现有技术实现了35.11%的提升。
Details
Motivation: 当前手语生成依赖于数据驱动的生成方法,需要大量精确的2D和3D人体姿态数据,但大多数手语数据集仅限于自动重建的2D关键点,缺乏准确的3D信息,且现有从手语视频进行3D人体姿态估计的SOTA方法易受自遮挡、噪声和运动模糊影响,导致重建质量不佳。
Result: 在SGNify动作捕捉数据集(该任务唯一可用的基准)上,DexAvatar在手部和身体姿态估计方面相比现有技术(state-of-the-art)实现了35.11%的改进,达到了强劲性能。
Insight: 创新点在于引入了一个结合学习的手部和身体3D先验的框架,以从野外单目视频中重建生物力学上准确的精细手部关节和身体运动。客观来看,该方法通过整合先验知识来缓解自遮挡和噪声等问题,可能为从有限2D数据中提升3D重建质量提供了新思路。
Abstract: The trend in sign language generation is centered around data-driven generative methods that require vast amounts of precise 2D and 3D human pose data to achieve an acceptable generation quality. However, currently, most sign language datasets are video-based and limited to automatically reconstructed 2D human poses (i.e., keypoints) and lack accurate 3D information. Furthermore, existing state-of-the-art for automatic 3D human pose estimation from sign language videos is prone to self-occlusion, noise, and motion blur effects, resulting in poor reconstruction quality. In response to this, we introduce DexAvatar, a novel framework to reconstruct bio-mechanically accurate fine-grained hand articulations and body movements from in-the-wild monocular sign language videos, guided by learned 3D hand and body priors. DexAvatar achieves strong performance in the SGNify motion capture dataset, the only benchmark available for this task, reaching an improvement of 35.11% in the estimation of body and hand poses compared to the state-of-the-art. The official website of this work is: https://github.com/kaustesseract/DexAvatar.
[35] Beyond Pixel Simulation: Pathology Image Generation via Diagnostic Semantic Tokens and Prototype Control cs.CVPDF
Minghao Han, YiChen Liu, Yizhou Liu, Zizhi Chen, Jingqun Tang
TL;DR: 本文提出UniPath框架,通过诊断语义令牌和原型控制实现病理图像的语义驱动生成,解决了病理图像生成中数据稀缺、语义控制不精确和术语异构性问题,并在多层级评估中达到SOTA性能。
Details
Motivation: 解决计算病理学中生成模型仅模拟像素、缺乏精细语义控制、数据稀缺以及诊断术语异构性导致文本条件不可靠的问题。
Result: 在病理学定制四层级评估中达到SOTA,Patho-FID为80.9(比次优模型提升51%),细粒度语义控制达到真实图像的98.7%。
Insight: 创新点包括:利用冻结病理MLLM提取抗释义的诊断语义令牌,通过原型库实现组件级形态控制,以及构建大规模高质量图像-文本语料库缓解数据瓶颈。
Abstract: In computational pathology, understanding and generation have evolved along disparate paths: advanced understanding models already exhibit diagnostic-level competence, whereas generative models largely simulate pixels. Progress remains hindered by three coupled factors: the scarcity of large, high-quality image-text corpora; the lack of precise, fine-grained semantic control, which forces reliance on non-semantic cues; and terminological heterogeneity, where diverse phrasings for the same diagnostic concept impede reliable text conditioning. We introduce UniPath, a semantics-driven pathology image generation framework that leverages mature diagnostic understanding to enable controllable generation. UniPath implements Multi-Stream Control: a Raw-Text stream; a High-Level Semantics stream that uses learnable queries to a frozen pathology MLLM to distill paraphrase-robust Diagnostic Semantic Tokens and to expand prompts into diagnosis-aware attribute bundles; and a Prototype stream that affords component-level morphological control via a prototype bank. On the data front, we curate a 2.65M image-text corpus and a finely annotated, high-quality 68K subset to alleviate data scarcity. For a comprehensive assessment, we establish a four-tier evaluation hierarchy tailored to pathology. Extensive experiments demonstrate UniPath’s SOTA performance, including a Patho-FID of 80.9 (51% better than the second-best) and fine-grained semantic control achieving 98.7% of the real-image. The meticulously curated datasets, complete source code, and pre-trained model weights developed in this study will be made openly accessible to the public.
[36] Multimodal Skeleton-Based Action Representation Learning via Decomposition and Composition cs.CVPDF
Hongsong Wang, Heng Fei, Bingxuan Dai, Jie Gui
TL;DR: 本文提出了一种名为‘分解与组合’的自监督多模态骨架动作表示学习框架,旨在解决多模态动作理解中效率与性能的平衡难题。该方法通过分解策略将融合的多模态特征分离为单模态特征并进行对齐,同时通过组合策略整合单模态特征作为自监督指导来增强多模态表示学习。
Details
Motivation: 解决多模态动作理解中,现有方法(如简单后融合带来高计算开销,或早期融合共享主干网络导致性能不佳)难以在模型效率和利用模态互补性之间取得平衡的问题。
Result: 在NTU RGB+D 60、NTU RGB+D 120和PKU-MMD II数据集上的大量实验表明,该方法在计算成本和模型性能之间取得了优异的平衡。
Insight: 创新点在于提出分解与组合策略:分解策略实现多模态特征到单模态特征的精细分离与对齐,组合策略利用单模态特征作为自监督信号来优化多模态表示,从而在自监督框架下有效协调模态互补性与模型效率。
Abstract: Multimodal human action understanding is a significant problem in computer vision, with the central challenge being the effective utilization of the complementarity among diverse modalities while maintaining model efficiency. However, most existing methods rely on simple late fusion to enhance performance, which results in substantial computational overhead. Although early fusion with a shared backbone for all modalities is efficient, it struggles to achieve excellent performance. To address the dilemma of balancing efficiency and effectiveness, we introduce a self-supervised multimodal skeleton-based action representation learning framework, named Decomposition and Composition. The Decomposition strategy meticulously decomposes the fused multimodal features into distinct unimodal features, subsequently aligning them with their respective ground truth unimodal counterparts. On the other hand, the Composition strategy integrates multiple unimodal features, leveraging them as self-supervised guidance to enhance the learning of multimodal representations. Extensive experiments on the NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD II datasets demonstrate that the proposed method strikes an excellent balance between computational cost and model performance.
[37] UniPR-3D: Towards Universal Visual Place Recognition with Visual Geometry Grounded Transformer cs.CVPDF
Tianchen Deng, Xun Chen, Ziming Li, Hongming Shen, Danwei Wang
TL;DR: 本文提出了UniPR-3D,一种用于视觉地点识别(VPR)的新型架构,它首次有效地整合了多视图信息。该方法基于能够编码多视图3D表示的VGGT骨干网络,通过设计特征聚合器并针对地点识别任务进行微调,联合利用VGGT产生的3D令牌和中间2D令牌来构建描述符,并采用单帧与多帧聚合方案以及可变长度序列检索策略以增强泛化能力。
Details
Motivation: 传统视觉地点识别通常被表述为单图像检索任务,而利用多视图信息具有明显优势,但该设置尚未得到充分探索,且现有方法难以在不同环境中泛化。本文旨在解决多视图VPR的通用性问题。
Result: 实验表明,UniPR-3D在相关基准测试中取得了新的最先进(SOTA)性能,超越了单视图和多视图基线方法。
Insight: 主要创新点在于首次提出有效整合多视图信息的VPR架构,通过设计专用的2D和3D特征聚合模块,使描述符能同时捕捉细粒度纹理线索并进行跨视角推理;同时结合单/多帧聚合与可变长度序列检索策略以提升泛化能力,强调了基于几何的令牌对VPR的有效性。
Abstract: Visual Place Recognition (VPR) has been traditionally formulated as a single-image retrieval task. Using multiple views offers clear advantages, yet this setting remains relatively underexplored and existing methods often struggle to generalize across diverse environments. In this work we introduce UniPR-3D, the first VPR architecture that effectively integrates information from multiple views. UniPR-3D builds on a VGGT backbone capable of encoding multi-view 3D representations, which we adapt by designing feature aggregators and fine-tune for the place recognition task. To construct our descriptor, we jointly leverage the 3D tokens and intermediate 2D tokens produced by VGGT. Based on their distinct characteristics, we design dedicated aggregation modules for 2D and 3D features, allowing our descriptor to capture fine-grained texture cues while also reasoning across viewpoints. To further enhance generalization, we incorporate both single- and multi-frame aggregation schemes, along with a variable-length sequence retrieval strategy. Our experiments show that UniPR-3D sets a new state of the art, outperforming both single- and multi-view baselines and highlighting the effectiveness of geometry-grounded tokens for VPR. Our code and models will be made publicly available on Github https://github.com/dtc111111/UniPR-3D.
[38] T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation cs.CVPDF
Zhe Cao, Tao Wang, Jiaming Wang, Yanghai Wang, Yuanxing Zhang
TL;DR: 本文提出了T2AV-Compass,一个用于全面评估文本到音视频(T2AV)生成系统的统一基准。它包含500个多样且复杂的提示,并采用一个集成了客观信号级指标和主观MLLM-as-a-Judge协议的双层评估框架。对11个代表性T2AV系统的评估表明,现有模型在真实感和跨模态一致性方面仍远低于人类水平。
Details
Motivation: 当前T2AV生成的评估是碎片化的,依赖于单模态指标或范围狭窄的基准,无法充分捕捉跨模态对齐、指令遵循和复杂提示下的感知真实感。本文旨在解决这一局限性,提供一个统一的、全面的评估基准。
Result: 对11个代表性T2AV系统进行了广泛评估。结果表明,即使是最强的模型在人类水平的真实感和跨模态一致性方面也存在显著差距,在音频真实感、细粒度同步、指令遵循等方面存在持续失败。T2AV-Compass被证明是一个具有挑战性和诊断性的测试平台。
Insight: 论文的创新点在于构建了一个统一的、全面的T2AV评估基准(T2AV-Compass),其核心是:1)通过分类学驱动的流程构建了语义丰富且物理上合理的多样化复杂提示集;2)提出了一个结合客观信号级指标(视频/音频质量、跨模态对齐)和主观MLLM-as-a-Judge协议的双层评估框架,以综合评估指令遵循和真实感。这为未来模型的发展提供了明确的诊断和改进方向。
Abstract: Text-to-Audio-Video (T2AV) generation aims to synthesize temporally coherent video and semantically synchronized audio from natural language, yet its evaluation remains fragmented, often relying on unimodal metrics or narrowly scoped benchmarks that fail to capture cross-modal alignment, instruction following, and perceptual realism under complex prompts. To address this limitation, we present T2AV-Compass, a unified benchmark for comprehensive evaluation of T2AV systems, consisting of 500 diverse and complex prompts constructed via a taxonomy-driven pipeline to ensure semantic richness and physical plausibility. Besides, T2AV-Compass introduces a dual-level evaluation framework that integrates objective signal-level metrics for video quality, audio quality, and cross-modal alignment with a subjective MLLM-as-a-Judge protocol for instruction following and realism assessment. Extensive evaluation of 11 representative T2AVsystems reveals that even the strongest models fall substantially short of human-level realism and cross-modal consistency, with persistent failures in audio realism, fine-grained synchronization, instruction following, etc. These results indicate significant improvement room for future models and highlight the value of T2AV-Compass as a challenging and diagnostic testbed for advancing text-to-audio-video generation.
[39] UniRec-0.1B: Unified Text and Formula Recognition with 0.1B Parameters cs.CVPDF
Yongkun Du, Zhineng Chen, Yazhen Xie, Weikang Baiand Hao Feng, Wei Shi
TL;DR: 本文提出了UniRec-0.1B,一个仅含0.1B参数的轻量级统一识别模型,用于处理文档中文本和公式的多层次(字符、词、行、段落、文档)识别。通过构建包含4000万样本的大规模数据集UniRec40M,并引入分层监督训练和语义解耦分词器来解决结构可变性和语义纠缠问题,该模型在多个基准测试中超越了通用视觉语言模型和文档解析专家模型,同时实现了2-9倍的加速。
Details
Motivation: 现有视觉语言模型(VLMs)在统一识别文本和公式方面表现出色,但参数量大、计算需求高,限制了其广泛应用。本文旨在开发一个轻量、高效且功能强大的统一识别模型,以解决这一瓶颈。
Result: 在涵盖中英文多领域、多层次文档的综合评估基准以及公开基准测试上,UniRec-0.1B的性能超越了通用VLMs和领先的文档解析专家模型,同时实现了2-9倍的推理速度提升。
Insight: 主要创新点包括:1)构建了大规模混合文本与公式数据集UniRec40M;2)提出了分层监督训练策略以明确指导模型理解不同层次的结构;3)设计了语义解耦分词器来分离文本和公式的表征,有效解决了结构可变性和语义纠缠两大挑战,为轻量级统一识别模型的设计提供了新思路。
Abstract: Text and formulas constitute the core informational components of many documents. Accurately and efficiently recognizing both is crucial for developing robust and generalizable document parsing systems. Recently, vision-language models (VLMs) have achieved impressive unified recognition of text and formulas. However, they are large-sized and computationally demanding, restricting their usage in many applications. In this paper, we propose UniRec-0.1B, a unified recognition model with only 0.1B parameters. It is capable of performing text and formula recognition at multiple levels, including characters, words, lines, paragraphs, and documents. To implement this task, we first establish UniRec40M, a large-scale dataset comprises 40 million text, formula and their mix samples, enabling the training of a powerful yet lightweight model. Secondly, we identify two challenges when building such a lightweight but unified expert model. They are: structural variability across hierarchies and semantic entanglement between textual and formulaic content. To tackle these, we introduce a hierarchical supervision training that explicitly guides structural comprehension, and a semantic-decoupled tokenizer that separates text and formula representations. Finally, we develop a comprehensive evaluation benchmark covering Chinese and English documents from multiple domains and with multiple levels. Experimental results on this and public benchmarks demonstrate that UniRec-0.1B outperforms both general-purpose VLMs and leading document parsing expert models, while achieving a 2-9$\times$ speedup, validating its effectiveness and efficiency. Codebase and Dataset: https://github.com/Topdu/OpenOCR.
[40] MarineEval: Assessing the Marine Intelligence of Vision-Language Models cs.CV | cs.DBPDF
YuK-Kwan Wong, Tuan-An To, Jipeng Zhang, Ziqiang Zheng, Sai-Kit Yeung
TL;DR: 该论文提出了首个大规模海洋领域视觉语言模型(VLM)评估数据集和基准测试MarineEval,包含2000个基于图像的问答对,旨在评估现有VLMs在需要专业知识的海洋领域问题上的表现。论文对17个现有VLM进行了全面测试,发现它们在回答领域特定问题时效果不佳,存在较大改进空间。
Details
Motivation: 尽管VLMs在多个领域取得了成功,但其是否能在需要大量专业知识的特定领域(如海洋科学)中充当专家并准确回答问题尚不清楚。论文旨在探索现有VLMs的能力边界,并解决海洋领域特有的挑战和需求。
Result: 在构建的MarineEval基准上对17个现有VLM进行了测试。实验结果表明,现有模型无法有效回答领域特定问题,性能有较大提升空间,但未提及具体定量结果或是否达到SOTA水平。
Insight: 论文的创新点在于构建了首个覆盖7个任务维度和20个能力维度的大规模、多样化的海洋领域VLM评估数据集,并由领域专家验证,为评估和提升VLMs在专业领域的性能提供了基准和方向。
Abstract: We have witnessed promising progress led by large language models (LLMs) and further vision language models (VLMs) in handling various queries as a general-purpose assistant. VLMs, as a bridge to connect the visual world and language corpus, receive both visual content and various text-only user instructions to generate corresponding responses. Though great success has been achieved by VLMs in various fields, in this work, we ask whether the existing VLMs can act as domain experts, accurately answering marine questions, which require significant domain expertise and address special domain challenges/requirements. To comprehensively evaluate the effectiveness and explore the boundary of existing VLMs, we construct the first large-scale marine VLM dataset and benchmark called MarineEval, with 2,000 image-based question-answering pairs. During our dataset construction, we ensure the diversity and coverage of the constructed data: 7 task dimensions and 20 capacity dimensions. The domain requirements are specially integrated into the data construction and further verified by the corresponding marine domain experts. We comprehensively benchmark 17 existing VLMs on our MarineEval and also investigate the limitations of existing models in answering marine research questions. The experimental results reveal that existing VLMs cannot effectively answer the domain-specific questions, and there is still a large room for further performance improvements. We hope our new benchmark and observations will facilitate future research. Project Page: http://marineeval.hkustvgd.com/
[41] TGC-Net: A Structure-Aware and Semantically-Aligned Framework for Text-Guided Medical Image Segmentation cs.CV | cs.AIPDF
Gaoren Lin, Huangxuan Zhao, Yuan Xiong, Lefei Zhang, Bo Du
TL;DR: 本文提出了TGC-Net,一个基于CLIP的、用于文本引导医学图像分割的框架。它通过引入语义-结构协同编码器来增强细粒度解剖结构保留,通过领域增强文本编码器注入医学知识以建模复杂临床描述,并通过视觉-语言校准模块在统一特征空间中优化跨模态对应关系,从而高效地解决了CLIP直接应用于医学影像时的三个主要限制。
Details
Motivation: 现有文本引导医学分割方法通常依赖未对齐的图像和文本编码器,需要复杂的交互模块进行多模态融合。而CLIP虽提供了预对齐的多模态特征空间,但其直接应用于医学影像存在三个主要问题:细粒度解剖结构保留不足、复杂临床描述建模不充分以及领域特定语义未对齐。
Result: 在涵盖胸部X光和胸部CT模态的五个数据集上的实验表明,TGC-Net以显著更少的可训练参数实现了最先进的性能,包括在具有挑战性的基准测试上取得了显著的Dice分数提升。
Insight: 创新点在于提出了一种参数高效、任务特定的CLIP适应框架。其核心是三个协同组件:SSE通过引入CNN分支进行多尺度结构细化来增强CLIP的ViT;DATE通过注入大语言模型衍生的医学知识来增强文本表示;VLCM在统一特征空间中细化跨模态对应关系。这为将通用视觉-语言模型高效适配到专业医学领域提供了一种结构化思路。
Abstract: Text-guided medical segmentation enhances segmentation accuracy by utilizing clinical reports as auxiliary information. However, existing methods typically rely on unaligned image and text encoders, which necessitate complex interaction modules for multimodal fusion. While CLIP provides a pre-aligned multimodal feature space, its direct application to medical imaging is limited by three main issues: insufficient preservation of fine-grained anatomical structures, inadequate modeling of complex clinical descriptions, and domain-specific semantic misalignment. To tackle these challenges, we propose TGC-Net, a CLIP-based framework focusing on parameter-efficient, task-specific adaptations. Specifically, it incorporates a Semantic-Structural Synergy Encoder (SSE) that augments CLIP’s ViT with a CNN branch for multi-scale structural refinement, a Domain-Augmented Text Encoder (DATE) that injects large-language-model-derived medical knowledge, and a Vision-Language Calibration Module (VLCM) that refines cross-modal correspondence in a unified feature space. Experiments on five datasets across chest X-ray and thoracic CT modalities demonstrate that TGC-Net achieves state-of-the-art performance with substantially fewer trainable parameters, including notable Dice gains on challenging benchmarks.
[42] ORCA: Object Recognition and Comprehension for Archiving Marine Species cs.CVPDF
Yuk-Kwan Wong, Haixin Liang, Zeyu Ma, Yiwei Chen, Ziqiang Zheng
TL;DR: ORCA是一个用于海洋物种存档的多模态基准数据集,包含14,647张图像、478个物种、42,217个边界框标注和22,321个专家验证的实例描述,旨在通过系统化的任务定义(如目标检测、实例描述和视觉定位)来推动海洋视觉理解的研究。
Details
Motivation: 解决海洋视觉理解中训练数据有限、缺乏系统化任务定义的问题,以促进自动化和可扩展的海洋生物调查。
Result: 在ORCA基准上评估了18个最先进模型,结果显示海洋物种多样性、形态重叠和领域特定需求带来了关键挑战,凸显了海洋理解的难度。
Insight: 创新点在于构建了首个结合细粒度视觉和文本标注的多模态海洋基准,通过系统任务评估揭示了领域特定挑战,为后续研究提供了标准化平台。
Abstract: Marine visual understanding is essential for monitoring and protecting marine ecosystems, enabling automatic and scalable biological surveys. However, progress is hindered by limited training data and the lack of a systematic task formulation that aligns domain-specific marine challenges with well-defined computer vision tasks, thereby limiting effective model application. To address this gap, we present ORCA, a multi-modal benchmark for marine research comprising 14,647 images from 478 species, with 42,217 bounding box annotations and 22,321 expert-verified instance captions. The dataset provides fine-grained visual and textual annotations that capture morphology-oriented attributes across diverse marine species. To catalyze methodological advances, we evaluate 18 state-of-the-art models on three tasks: object detection (closed-set and open-vocabulary), instance captioning, and visual grounding. Results highlight key challenges, including species diversity, morphological overlap, and specialized domain demands, underscoring the difficulty of marine understanding. ORCA thus establishes a comprehensive benchmark to advance research in marine domain. Project Page: http://orca.hkustvgd.com/.
[43] VisRes Bench: On Evaluating the Visual Reasoning Capabilities of VLMs cs.CVPDF
Brigitta Malagurski Törtei, Yasser Dahou, Ngoc Dung Huynh, Wamiq Reyaz Para, Phúc H. Lê Khac
TL;DR: 本文提出了VisRes Bench基准测试,旨在评估视觉语言模型(VLMs)在自然场景下的视觉推理能力,而非依赖语言先验。该基准包含三个复杂度级别,分别测试感知补全、基于规则的属性推理和组合推理,通过超过19,000张受控图像揭示了当前SOTA VLMs在感知扰动下表现接近随机,抽象推理能力有限。
Details
Motivation: 解决现有VLMs在视觉任务中可能过度依赖语言先验而非真正视觉推理的问题,需要一种无上下文语言监督的基准来评估其纯视觉推理能力。
Result: 在VisRes Bench上测试,当前SOTA VLMs在细微感知扰动下表现接近随机水平,显示出在感知和关系视觉推理方面的明显局限。
Insight: 创新点在于设计了一个分层次的视觉推理基准(VisRes Bench),通过三个复杂度级别隔离不同推理能力,为多模态研究提供了统一的抽象视觉推理评估框架,有助于揭示VLMs在模式识别之外的抽象能力不足。
Abstract: Vision-Language Models (VLMs) have achieved remarkable progress across tasks such as visual question answering and image captioning. Yet, the extent to which these models perform visual reasoning as opposed to relying on linguistic priors remains unclear. To address this, we introduce VisRes Bench, a benchmark designed to study visual reasoning in naturalistic settings without contextual language supervision. Analyzing model behavior across three levels of complexity, we uncover clear limitations in perceptual and relational visual reasoning capacities. VisRes isolates distinct reasoning abilities across its levels. Level 1 probes perceptual completion and global image matching under perturbations such as blur, texture changes, occlusion, and rotation; Level 2 tests rule-based inference over a single attribute (e.g., color, count, orientation); and Level 3 targets compositional reasoning that requires integrating multiple visual attributes. Across more than 19,000 controlled task images, we find that state-of-the-art VLMs perform near random under subtle perceptual perturbations, revealing limited abstraction beyond pattern recognition. We conclude by discussing how VisRes provides a unified framework for advancing abstract visual reasoning in multimodal research.
[44] Human Motion Estimation with Everyday Wearables cs.CVPDF
Siqi Zhu, Yixuan Li, Junfu Li, Qi Wu, Zan Wang
TL;DR: 该论文提出了一种名为EveryWear的轻量级人体运动捕捉方法,该方法仅使用智能手机、智能手表、耳机和配备摄像头的智能眼镜等日常可穿戴设备,无需显式校准,旨在解决现有方法穿戴性差、硬件昂贵和校准繁琐的问题。
Details
Motivation: 解决基于设备的人体运动估计方法在可穿戴性、硬件成本和校准便捷性方面的不足,以促进其在XR交互等日常应用中的广泛采用。
Result: 实验表明,该方法在真实世界数据集上优于基线模型,验证了其实用全身运动估计的有效性。
Insight: 创新点在于完全利用日常可穿戴设备构建系统,并引入一个大规模真实世界多模态数据集Ego-Elec来训练模型,通过师生框架整合视觉和惯性信号,直接基于真实数据训练有效消除了仿真到现实的差距。
Abstract: While on-body device-based human motion estimation is crucial for applications such as XR interaction, existing methods often suffer from poor wearability, expensive hardware, and cumbersome calibration, which hinder their adoption in daily life. To address these challenges, we present EveryWear, a lightweight and practical human motion capture approach based entirely on everyday wearables: a smartphone, smartwatch, earbuds, and smart glasses equipped with one forward-facing and two downward-facing cameras, requiring no explicit calibration before use. We introduce Ego-Elec, a 9-hour real-world dataset covering 56 daily activities across 17 diverse indoor and outdoor environments, with ground-truth 3D annotations provided by the motion capture (MoCap), to facilitate robust research and benchmarking in this direction. Our approach employs a multimodal teacher-student framework that integrates visual cues from egocentric cameras with inertial signals from consumer devices. By training directly on real-world data rather than synthetic data, our model effectively eliminates the sim-to-real gap that constrains prior work. Experiments demonstrate that our method outperforms baseline models, validating its effectiveness for practical full-body motion estimation.
[45] Latent Implicit Visual Reasoning cs.CVPDF
Kelvin Li, Chuyi Shang, Leonid Karlinsky, Rogerio Feris, Trevor Darrell
TL;DR: 本文提出了一种任务无关的机制,使大型多模态模型能够自主发现和使用视觉推理标记,无需显式监督,从而提升其在视觉主导推理任务上的性能。该方法通过全局注意力机制和任务自适应图像重编码,有效提取相关视觉信息,并在多种视觉中心任务上实现了最先进的结果。
Details
Motivation: 现有大型多模态模型以文本为中心,依赖语言作为核心推理模态,在处理视觉主导的推理任务时能力有限。现有方法通过辅助图像、深度图或图像裁剪监督中间视觉步骤,但限制了视觉抽象的定义,标注成本高且泛化能力差。
Result: 该方法在多种视觉中心任务上超越了直接微调,取得了最先进的结果,包括那些难以指定中间抽象的任务,并能泛化到多任务指令微调。
Insight: 创新点在于提出无监督的视觉推理标记发现机制,通过任务自适应的图像重编码实现灵活视觉信息提取,避免了手工监督的限制,增强了模型在视觉推理中的泛化能力。
Abstract: While Large Multimodal Models (LMMs) have made significant progress, they remain largely text-centric, relying on language as their core reasoning modality. As a result, they are limited in their ability to handle reasoning tasks that are predominantly visual. Recent approaches have sought to address this by supervising intermediate visual steps with helper images, depth maps, or image crops. However, these strategies impose restrictive priors on what “useful” visual abstractions look like, add heavy annotation costs, and struggle to generalize across tasks. To address this critical limitation, we propose a task-agnostic mechanism that trains LMMs to discover and use visual reasoning tokens without explicit supervision. These tokens attend globally and re-encode the image in a task-adaptive way, enabling the model to extract relevant visual information without hand-crafted supervision. Our approach outperforms direct fine-tuning and achieves state-of-the-art results on a diverse range of vision-centric tasks – including those where intermediate abstractions are hard to specify – while also generalizing to multi-task instruction tuning.
[46] Leveraging Lightweight Entity Extraction for Scalable Event-Based Image Retrieval cs.CV | cs.AIPDF
Dao Sy Duy Minh, Huynh Trung Kiet, Nguyen Lam Phu Quy, Phu-Hoa Pham, Tran Chi Nguyen
TL;DR: 本文提出了一种轻量级的两阶段图像检索方法,通过事件驱动的实体提取来整合图像描述中的时间和上下文信息,以提高基于自然语言描述的图像检索的准确性和可扩展性。
Details
Motivation: 解决现实世界中图像文本检索面临的挑战,如查询模糊、语境依赖、语言多样性以及可扩展性需求。
Result: 在OpenEvents v1基准测试中,该方法取得了0.559的平均精度均值,显著超越了现有基线方法。
Insight: 创新点在于将事件引导的轻量级实体过滤(基于BM25)与长文本视觉语言模型(BEiT-3)的深度语义重排序相结合,为复杂现实场景提供了高效且准确的检索方案。
Abstract: Retrieving images from natural language descriptions is a core task at the intersection of computer vision and natural language processing, with wide-ranging applications in search engines, media archiving, and digital content management. However, real-world image-text retrieval remains challenging due to vague or context-dependent queries, linguistic variability, and the need for scalable solutions. In this work, we propose a lightweight two-stage retrieval pipeline that leverages event-centric entity extraction to incorporate temporal and contextual signals from real-world captions. The first stage performs efficient candidate filtering using BM25 based on salient entities, while the second stage applies BEiT-3 models to capture deep multimodal semantics and rerank the results. Evaluated on the OpenEvents v1 benchmark, our method achieves a mean average precision of 0.559, substantially outperforming prior baselines. These results highlight the effectiveness of combining event-guided filtering with long-text vision-language modeling for accurate and efficient retrieval in complex, real-world scenarios. Our code is available at https://github.com/PhamPhuHoa-23/Event-Based-Image-Retrieval
[47] SegMo: Segment-aligned Text to 3D Human Motion Generation cs.CVPDF
Bowen Dang, Lin Wu, Xiaohang Yang, Zheng Yuan, Zhixiang Chen
TL;DR: 本文提出SegMo,一种新颖的分段对齐文本到3D人体运动生成框架,旨在实现细粒度的文本-运动对齐。该方法将文本描述和运动序列分解为语义连贯的片段,并通过对比学习对齐这些片段,从而提升生成质量。
Details
Motivation: 现有方法在序列级别对齐文本描述与人体运动,忽略了模态内部的语义结构。本文动机在于利用文本和运动均可自然分解为更小语义片段的特点,以实现更细粒度的对应关系。
Result: 在HumanML3D和KIT-ML两个广泛使用的数据集上,SegMo改进了强基线模型,在HumanML3D测试集上实现了0.553的改进TOP 1分数。
Insight: 核心创新点在于将文本和运动分解为原子对齐单元(片段),并通过对比学习建立共享嵌入空间。这不仅提升了生成性能,还使模型能够应用于运动定位和运动到文本检索等检索式任务。
Abstract: Generating 3D human motions from textual descriptions is an important research problem with broad applications in video games, virtual reality, and augmented reality. Recent methods align the textual description with human motion at the sequence level, neglecting the internal semantic structure of modalities. However, both motion descriptions and motion sequences can be naturally decomposed into smaller and semantically coherent segments, which can serve as atomic alignment units to achieve finer-grained correspondence. Motivated by this, we propose SegMo, a novel Segment-aligned text-conditioned human Motion generation framework to achieve fine-grained text-motion alignment. Our framework consists of three modules: (1) Text Segment Extraction, which decomposes complex textual descriptions into temporally ordered phrases, each representing a simple atomic action; (2) Motion Segment Extraction, which partitions complete motion sequences into corresponding motion segments; and (3) Fine-grained Text-Motion Alignment, which aligns text and motion segments with contrastive learning. Extensive experiments demonstrate that SegMo improves the strong baseline on two widely used datasets, achieving an improved TOP 1 score of 0.553 on the HumanML3D test set. Moreover, thanks to the learned shared embedding space for text and motion segments, SegMo can also be applied to retrieval-style tasks such as motion grounding and motion-to-text retrieval.
[48] DreaMontage: Arbitrary Frame-Guided One-Shot Video Generation cs.CVPDF
Jiawei Liu, Junqiao Li, Jiangfan Deng, Gen Li, Siyu Zhou
TL;DR: 本文提出了DreaMontage框架,用于生成任意帧引导的单镜头长视频。该框架通过引入轻量级中间条件机制、高质量数据集与视觉表达微调、以及分段自回归推理策略,解决了现有视频生成方法中视觉平滑性与时序连贯性不足的问题,能够从用户提供的多样化输入合成无缝、富有表现力的单镜头视频。
Details
Motivation: 解决电影制作中’单镜头’美学因成本高昂和现实约束难以实现的问题,以及现有视频生成模型依赖简单片段拼接导致视觉不连贯、时序不一致的缺陷。
Result: 大量实验表明,该方法在计算高效的同时,能生成视觉惊艳且无缝连贯的单镜头效果,成功率和可用性显著提升。
Insight: 创新点包括:在DiT架构中集成轻量级中间条件机制与自适应调优策略以实现鲁棒的任意帧控制;通过高质量数据集、视觉表达微调和定制DPO方案提升视觉保真度与运动合理性;设计内存高效的分段自回归推理策略以生成长序列。这些技术为从碎片化视觉材料生成连贯影视体验提供了系统解决方案。
Abstract: The “one-shot” technique represents a distinct and sophisticated aesthetic in filmmaking. However, its practical realization is often hindered by prohibitive costs and complex real-world constraints. Although emerging video generation models offer a virtual alternative, existing approaches typically rely on naive clip concatenation, which frequently fails to maintain visual smoothness and temporal coherence. In this paper, we introduce DreaMontage, a comprehensive framework designed for arbitrary frame-guided generation, capable of synthesizing seamless, expressive, and long-duration one-shot videos from diverse user-provided inputs. To achieve this, we address the challenge through three primary dimensions. (i) We integrate a lightweight intermediate-conditioning mechanism into the DiT architecture. By employing an Adaptive Tuning strategy that effectively leverages base training data, we unlock robust arbitrary-frame control capabilities. (ii) To enhance visual fidelity and cinematic expressiveness, we curate a high-quality dataset and implement a Visual Expression SFT stage. In addressing critical issues such as subject motion rationality and transition smoothness, we apply a Tailored DPO scheme, which significantly improves the success rate and usability of the generated content. (iii) To facilitate the production of extended sequences, we design a Segment-wise Auto-Regressive (SAR) inference strategy that operates in a memory-efficient manner. Extensive experiments demonstrate that our approach achieves visually striking and seamlessly coherent one-shot effects while maintaining computational efficiency, empowering users to transform fragmented visual materials into vivid, cohesive one-shot cinematic experiences.
[49] AnyAD: Unified Any-Modality Anomaly Detection in Incomplete Multi-Sequence MRI cs.CVPDF
Changwei Wu, Yifei Chen, Yuxin Du, Mingxuan Liu, Jinying Zong
TL;DR: 本文提出了一种名为AnyAD的统一任意模态异常检测框架,用于处理临床脑部MRI中常见的模态缺失问题。该框架通过双通路DINOv2编码器、特征分布对齐机制以及内在正常原型(INP)引导的解码器,能够在任意MRI模态组合下进行鲁棒的异常检测和定位,无需针对不同模态配置重新训练。
Details
Motivation: 解决脑部MRI异常检测中因标注异常病例稀缺和临床工作流中关键成像模态经常缺失而导致的挑战,现有方法通常依赖固定模态配置、需要重复训练或无法泛化到未见过的模态组合,限制了临床可扩展性。
Result: 在BraTS2018、MU-Glioma-Post和Pretreat-MetsToBrain-Masks数据集上的广泛实验表明,该方法在7种模态组合上始终优于最先进的工业和医学异常检测基线,实现了卓越的泛化性能。
Insight: 创新点在于提出了一个统一的、可适应任意模态可用性的框架,通过特征分布对齐和内在正常原型引导的重建来增强语义一致性,并利用随机模态掩码和间接特征完成进行训练,从而学习适应所有模态配置,为现实世界中不完美的多模态医学异常检测建立了可扩展的范式。
Abstract: Reliable anomaly detection in brain MRI remains challenging due to the scarcity of annotated abnormal cases and the frequent absence of key imaging modalities in real clinical workflows. Existing single-class or multi-class anomaly detection (AD) models typically rely on fixed modality configurations, require repetitive training, or fail to generalize to unseen modality combinations, limiting their clinical scalability. In this work, we present a unified Any-Modality AD framework that performs robust anomaly detection and localization under arbitrary MRI modality availability. The framework integrates a dual-pathway DINOv2 encoder with a feature distribution alignment mechanism that statistically aligns incomplete-modality features with full-modality representations, enabling stable inference even with severe modality dropout. To further enhance semantic consistency, we introduce an Intrinsic Normal Prototypes (INPs) extractor and an INP-guided decoder that reconstruct only normal anatomical patterns while naturally amplifying abnormal deviations. Through randomized modality masking and indirect feature completion during training, the model learns to adapt to all modality configurations without re-training. Extensive experiments on BraTS2018, MU-Glioma-Post, and Pretreat-MetsToBrain-Masks demonstrate that our approach consistently surpasses state-of-the-art industrial and medical AD baselines across 7 modality combinations, achieving superior generalization. This study establishes a scalable paradigm for multimodal medical AD under real-world, imperfect modality conditions. Our source code is available at https://github.com/wuchangw/AnyAD.
[50] ACD: Direct Conditional Control for Video Diffusion Models via Attention Supervision cs.CVPDF
Weiqi Li, Zehao Zhang, Liang Lin, Guangrun Wang
TL;DR: 本文提出了一种名为注意力条件扩散(ACD)的新框架,通过注意力监督在视频扩散模型中实现直接条件控制,以提升视频生成与条件信号的对齐能力。
Details
Motivation: 现有基于分类器或无分类器引导的方法在视频合成中条件控制能力有限,前者可能产生对抗性伪影,后者则间接建模联合分布导致对齐不精确,因此需要一种更直接有效的条件控制机制。
Result: 在基准视频生成数据集上的大量实验表明,ACD在保持时间一致性和视觉保真度的同时,实现了与条件输入更优的对齐效果。
Insight: 创新点在于通过注意力图与外部控制信号对齐来实现直接条件控制,并引入了稀疏3D感知物体布局作为高效条件信号,配合专用的Layout ControlNet和自动标注流程,为条件视频合成提供了有效范例。
Abstract: Controllability is a fundamental requirement in video synthesis, where accurate alignment with conditioning signals is essential. Existing classifier-free guidance methods typically achieve conditioning indirectly by modeling the joint distribution of data and conditions, which often results in limited controllability over the specified conditions. Classifier-based guidance enforces conditions through an external classifier, but the model may exploit this mechanism to raise the classifier score without genuinely satisfying the intended condition, resulting in adversarial artifacts and limited effective controllability. In this paper, we propose Attention-Conditional Diffusion (ACD), a novel framework for direct conditional control in video diffusion models via attention supervision. By aligning the model’s attention maps with external control signals, ACD achieves better controllability. To support this, we introduce a sparse 3D-aware object layout as an efficient conditioning signal, along with a dedicated Layout ControlNet and an automated annotation pipeline for scalable layout integration. Extensive experiments on benchmark video generation datasets demonstrate that ACD delivers superior alignment with conditioning inputs while preserving temporal coherence and visual fidelity, establishing an effective paradigm for conditional video synthesis.
[51] GriDiT: Factorized Grid-Based Diffusion for Efficient Long Image Sequence Generation cs.CVPDF
Snehal Singh Tomar, Alexandros Graikos, Arjun Krishna, Dimitris Samaras, Klaus Mueller
TL;DR: 本文提出GriDiT方法,一种基于因子化网格的高效长图像序列生成框架。该方法将生成过程分解为两个阶段:首先在低分辨率下生成粗粒度序列,再对单帧进行高分辨率细化。该方法通过扩散变换器(DiT)的自注意力机制捕捉帧间相关性,无需修改架构即可将2D图像生成器扩展为低分辨率3D序列生成器。
Details
Motivation: 针对当前图像序列生成方法在处理大张量时存在的效率瓶颈和表示不理想问题,旨在设计一种更有效的图像序列建模方式。
Result: 在多个数据集上,该方法在合成质量和序列一致性方面优于现有SOTA方法,推理速度至少快两倍,并能高效生成任意长度的高保真序列。
Insight: 创新点在于将序列生成分解为低分辨率全局生成与高分辨率单帧细化的两阶段因子化策略,利用DiT的自注意力自然捕获帧间依赖,实现了无需架构修改的2D到3D生成器扩展,提升了效率与泛化能力。
Abstract: Modern deep learning methods typically treat image sequences as large tensors of sequentially stacked frames. However, is this straightforward representation ideal given the current state-of-the-art (SoTA)? In this work, we address this question in the context of generative models and aim to devise a more effective way of modeling image sequence data. Observing the inefficiencies and bottlenecks of current SoTA image sequence generation methods, we showcase that rather than working with large tensors, we can improve the generation process by factorizing it into first generating the coarse sequence at low resolution and then refining the individual frames at high resolution. We train a generative model solely on grid images comprising subsampled frames. Yet, we learn to generate image sequences, using the strong self-attention mechanism of the Diffusion Transformer (DiT) to capture correlations between frames. In effect, our formulation extends a 2D image generator to operate as a low-resolution 3D image-sequence generator without introducing any architectural modifications. Subsequently, we super-resolve each frame individually to add the sequence-independent high-resolution details. This approach offers several advantages and can overcome key limitations of the SoTA in this domain. Compared to existing image sequence generation models, our method achieves superior synthesis quality and improved coherence across sequences. It also delivers high-fidelity generation of arbitrary-length sequences and increased efficiency in inference time and training data usage. Furthermore, our straightforward formulation enables our method to generalize effectively across diverse data domains, which typically require additional priors and supervision to model in a generative context. Our method consistently outperforms SoTA in quality and inference speed (at least twice-as-fast) across datasets.
[52] Surgical Scene Segmentation using a Spike-Driven Video Transformer with Real-Time Potential cs.CVPDF
Shihao Zou, Jingjing Li, Wei Ji, Jincai Huang, Kai Wang
TL;DR: 本文提出了SpikeSurgSeg,首个为手术场景分割设计的脉冲驱动视频Transformer框架,旨在在非GPU平台上实现实时潜力。该工作通过手术场景掩码自编码预训练策略解决脉冲神经网络(SNN)标注数据稀缺问题,并采用轻量级脉冲驱动分割头保持低延迟特性。在EndoVis18和内部SurgBleed数据集上的实验表明,其mIoU与基于ANN的SOTA模型相当,同时推理延迟降低至少8倍,相比多数基础模型基线加速超过20倍。
Details
Motivation: 解决当前基于深度学习(尤其是大规模基础模型)的手术场景分割模型计算量大、功耗高,难以在资源受限的手术环境中实时部署的问题,并探索SNN作为高效手术智能的潜力。
Result: 在EndoVis18和内部SurgBleed数据集上,SpikeSurgSeg的mIoU与基于ANN的SOTA模型相当,同时推理延迟降低至少8倍,相比多数基础模型基线加速超过20倍。
Insight: 创新点包括:1) 首个为手术场景分割定制的脉冲驱动视频Transformer框架;2) 针对SNN的手术场景掩码自编码预训练策略,通过分层管状掩码实现鲁棒的时空表示学习;3) 轻量级脉冲驱动分割头,在保持SNN低延迟特性的同时产生时间一致的预测。从客观角度看,该工作将SNN与视频Transformer结合用于手术场景理解,为解决实时性与精度权衡提供了新思路。
Abstract: Modern surgical systems increasingly rely on intelligent scene understanding to provide timely situational awareness for enhanced intra-operative safety. Within this pipeline, surgical scene segmentation plays a central role in accurately perceiving operative events. Although recent deep learning models, particularly large-scale foundation models, achieve remarkable segmentation accuracy, their substantial computational demands and power consumption hinder real-time deployment in resource-constrained surgical environments. To address this limitation, we explore the emerging SNN as a promising paradigm for highly efficient surgical intelligence. However, their performance is still constrained by the scarcity of labeled surgical data and the inherently sparse nature of surgical video representations. To this end, we propose \textit{SpikeSurgSeg}, the first spike-driven video Transformer framework tailored for surgical scene segmentation with real-time potential on non-GPU platforms. To address the limited availability of surgical annotations, we introduce a surgical-scene masked autoencoding pretraining strategy for SNNs that enables robust spatiotemporal representation learning via layer-wise tube masking. Building on this pretrained backbone, we further adopt a lightweight spike-driven segmentation head that produces temporally consistent predictions while preserving the low-latency characteristics of SNNs. Extensive experiments on EndoVis18 and our in-house SurgBleed dataset demonstrate that SpikeSurgSeg achieves mIoU comparable to SOTA ANN-based models while reducing inference latency by at least $8\times$. Notably, it delivers over $20\times$ acceleration relative to most foundation-model baselines, underscoring its potential for time-critical surgical scene segmentation.
[53] Fast SAM2 with Text-Driven Token Pruning cs.CVPDF
Avilasha Mandal, Chaoning Zhang, Fachrina Dewi Puspitasari, Xudong Wang, Jiaquan Zhang
TL;DR: 本文提出了一种文本引导的令牌剪枝框架,用于提升Segment Anything Model 2 (SAM2)在视频对象分割中的推理效率。该方法在视觉编码之后、基于记忆的传播之前,通过一个轻量级路由机制,结合局部视觉上下文、源自以对象为中心的文本描述的语义相关性以及不确定性线索,对令牌进行排序和选择性剪枝,从而减少冗余计算。
Details
Motivation: SAM2等视觉基础模型在提示驱动的视频对象分割方面取得了显著进展,但其实际部署受到处理跨时间密集视觉令牌所带来的高计算和内存成本的限制。现有流程通常将所有视觉令牌传播至下游时间推理模块,无论其与目标对象是否相关,导致二次内存注意力开销并降低了可扩展性。
Result: 在多个具有挑战性的视频分割基准测试上的广泛实验表明,该方法在保持有竞争力的J和F性能的同时,与未剪枝的基线SAM2相比,实现了高达42.50%的推理速度提升和37.41%的GPU内存使用降低。
Insight: 创新点在于提出了一种在编码器后、时间传播前进行文本驱动的令牌剪枝框架,该框架通过集成视觉、文本语义和不确定性线索的路由机制来选择信息量最大的令牌。这为基于Transformer的视频分割系统提供了一条在不修改底层分割架构的前提下,提升实时性和资源受限应用可扩展性的有效途径。
Abstract: Segment Anything Model 2 (SAM2), a vision foundation model has significantly advanced in prompt-driven video object segmentation, yet their practical deployment remains limited by the high computational and memory cost of processing dense visual tokens across time. The SAM2 pipelines typically propagate all visual tokens produced by the image encoder through downstream temporal reasoning modules, regardless of their relevance to the target object, resulting in reduced scalability due to quadratic memory attention overhead. In this work, we introduce a text-guided token pruning framework that improves inference efficiency by selectively reducing token density prior to temporal propagation, without modifying the underlying segmentation architecture. Operating after visual encoding and before memory based propagation, our method ranks tokens using a lightweight routing mechanism that integrates local visual context, semantic relevance derived from object-centric textual descriptions (either user-provided or automatically generated), and uncertainty cues that help preserve ambiguous or boundary critical regions. By retaining only the most informative tokens for downstream processing, the proposed approach reduces redundant computation while maintaining segmentation fidelity. Extensive experiments across multiple challenging video segmentation benchmarks demonstrate that post-encoder token pruning provides a practical and effective pathway to efficient, prompt-aware video segmentation, achieving up to 42.50 percent faster inference and 37.41 percent lower GPU memory usage compared to the unpruned baseline SAM2, while preserving competitive J and F performance. These results highlight the potential of early token selection to improve the scalability of transformer-based video segmentation systems for real-time and resource-constrained applications.
[54] Streaming Video Instruction Tuning cs.CVPDF
Jiaer Xia, Peixian Chen, Mengdan Zhang, Xing Sun, Kaiyang Zhou
TL;DR: 本文提出了Streamo,一种实时流式视频大语言模型,可作为通用交互助手。与现有专注于问答或描述的在线视频模型不同,Streamo能执行广泛的流式视频任务,包括实时解说、动作理解、事件描述、时序事件定位和时间敏感问答。为实现这种多功能性,作者构建了Streamo-Instruct-465K,一个为流式视频理解定制的大规模指令跟随数据集。通过端到端训练,Streamo在多种流式基准测试中展现出强大的时序推理、响应式交互和广泛泛化能力。
Details
Motivation: 旨在弥合离线视频感知模型与实时多模态助手之间的差距,推动在连续视频流中实现统一、智能的视频理解。
Result: 在广泛的流式基准测试中表现出色,展现出强大的时序推理和泛化能力,向统一的实时视频理解迈进一步。
Insight: 创新点在于构建了大规模、覆盖多样化时序上下文和多任务监督的流式视频指令数据集(Streamo-Instruct-465K),并通过统一的端到端训练流程,使单一模型能够处理多种异构的实时流式视频任务,实现了从离线到在线视频理解的范式转变。
Abstract: We present Streamo, a real-time streaming video LLM that serves as a general-purpose interactive assistant. Unlike existing online video models that focus narrowly on question answering or captioning, Streamo performs a broad spectrum of streaming video tasks, including real-time narration, action understanding, event captioning, temporal event grounding, and time-sensitive question answering. To develop such versatility, we construct Streamo-Instruct-465K, a large-scale instruction-following dataset tailored for streaming video understanding. The dataset covers diverse temporal contexts and multi-task supervision, enabling unified training across heterogeneous streaming tasks. After training end-to-end on the instruction-following dataset through a streamlined pipeline, Streamo exhibits strong temporal reasoning, responsive interaction, and broad generalization across a variety of streaming benchmarks. Extensive experiments show that Streamo bridges the gap between offline video perception models and real-time multimodal assistants, making a step toward unified, intelligent video understanding in continuous video streams.
[55] Beyond Memorization: A Multi-Modal Ordinal Regression Benchmark to Expose Popularity Bias in Vision-Language Models cs.CVPDF
Li-Zhong Szu-Tu, Ting-Lin Wu, Chia-Jui Chang, He Syu, Yu-Lun Liu
TL;DR: 本文揭示了当前最先进的视觉语言模型(VLMs)存在显著的流行度偏差,在识别著名建筑时准确率比普通建筑高出34%,表明模型依赖记忆而非泛化理解。为此,作者构建了最大的开放基准数据集YearGuessr,包含55,546张建筑图像及其多模态属性,并引入基于序数回归的建造年份预测任务和流行度感知间隔准确率指标来量化该偏差。通过对30多个模型的评估,包括作者提出的YearCLIP模型,证实了VLMs在处理流行、已记忆项目时表现出色,但在处理未知主题时表现显著下降,暴露了其推理能力的严重缺陷。
Details
Motivation: 动机是暴露并系统研究视觉语言模型中存在的流行度偏差,即模型过度依赖对流行项目的记忆,而非发展出真正的泛化理解能力,这限制了模型在现实世界中的可靠应用。
Result: 在YearGuessr基准测试中,评估了超过30个模型,结果显示最先进的VLMs在著名建筑上的准确率比普通建筑高出高达34%,量化了显著的流行度偏差。作者提出的YearCLIP模型也包含在评估中,共同证实了这一普遍存在的缺陷。
Insight: 创新点在于构建了一个大规模、多模态、带有连续序数标签(建造年份)和流行度代理指标(页面浏览量)的基准数据集YearGuessr,并将预测任务形式化为序数回归,同时设计了流行度感知的评估指标来系统量化模型偏差,为评估VLMs的泛化能力而非记忆能力提供了新工具和视角。
Abstract: We expose a significant popularity bias in state-of-the-art vision-language models (VLMs), which achieve up to 34% higher accuracy on famous buildings compared to ordinary ones, indicating a reliance on memorization over generalizable understanding. To systematically investigate this, we introduce the largest open benchmark for this task: the YearGuessr dataset, a collection of 55,546 building images with multi-modal attributes from 157 countries, annotated with continuous ordinal labels of their construction year (1001-2024), GPS data, and page-view counts as a proxy for popularity. Using this dataset, we frame the construction year prediction task as ordinal regression and introduce popularity-aware interval accuracy metrics to quantify this bias. Our resulting benchmark of 30+ models, including our YearCLIP model, confirms that VLMs excel on popular, memorized items but struggle significantly with unrecognized subjects, exposing a critical flaw in their reasoning capabilities. Project page: https://sytwu.github.io/BeyondMemo/
[56] HiStream: Efficient High-Resolution Video Generation via Redundancy-Eliminated Streaming cs.CVPDF
Haonan Qiu, Shikun Liu, Zijian Zhou, Zhaochong An, Weiming Ren
TL;DR: HiStream是一个高效的高分辨率视频生成框架,通过消除空间、时间和时间步三个维度的冗余,显著降低了扩散模型的计算复杂度。其主要模型HiStream在1080p基准测试中实现了SOTA视觉质量,去噪速度比Wan2.1基线快76.2倍;其加速变体HiStream+进一步优化,速度提升达107.5倍,在速度与质量间取得了良好平衡。
Details
Motivation: 高分辨率视频生成对数字媒体和电影至关重要,但扩散模型的二次计算复杂度使其实际推理不可行。本文旨在解决这一计算瓶颈,使高分辨率视频生成变得实用且可扩展。
Result: 在1080p基准测试中,HiStream主要模型(应用空间和时间压缩)实现了SOTA视觉质量,去噪速度比Wan2.1基线快76.2倍,且质量损失可忽略。HiStream+变体(应用全部三种压缩)速度提升达107.5倍,提供了速度与质量间有吸引力的权衡。
Insight: 论文宣称的创新点在于系统性地从空间(低分辨率去噪后利用缓存特征高分辨率细化)、时间(基于固定大小锚点缓存的逐块策略)和时间步(对后续缓存条件块应用更少去噪步)三个轴消除冗余。从客观角度看,这种多维度压缩的流式自回归框架是高效视频生成的一个新颖且有效的设计范式,其缓存机制和分块策略对降低计算开销具有借鉴意义。
Abstract: High-resolution video generation, while crucial for digital media and film, is computationally bottlenecked by the quadratic complexity of diffusion models, making practical inference infeasible. To address this, we introduce HiStream, an efficient autoregressive framework that systematically reduces redundancy across three axes: i) Spatial Compression: denoising at low resolution before refining at high resolution with cached features; ii) Temporal Compression: a chunk-by-chunk strategy with a fixed-size anchor cache, ensuring stable inference speed; and iii) Timestep Compression: applying fewer denoising steps to subsequent, cache-conditioned chunks. On 1080p benchmarks, our primary HiStream model (i+ii) achieves state-of-the-art visual quality while demonstrating up to 76.2x faster denoising compared to the Wan2.1 baseline and negligible quality loss. Our faster variant, HiStream+, applies all three optimizations (i+ii+iii), achieving a 107.5x acceleration over the baseline, offering a compelling trade-off between speed and quality, thereby making high-resolution video generation both practical and scalable.
q-bio.NC [Back]
[57] Decoding Predictive Inference in Visual Language Processing via Spatiotemporal Neural Coherence q-bio.NC | cs.CLPDF
Sean C. Borneman, Julia Krebs, Ronnie B. Wilbur, Evie A. Malaia
TL;DR: 本文提出一种机器学习框架,通过分析聋人手语者观看动态视觉语言刺激时的脑电图(EEG)响应,解码大脑中的预测推断过程。该方法利用神经信号与光流运动特征之间的相干性构建时空表征,并通过基于熵的特征选择识别出区分可理解语言输入与时间反转干扰刺激的频段特异性神经特征。
Details
Motivation: 研究动机是探索人类语言处理中大脑预测推断的神经机制,特别是在视觉语言(如手语)处理中,如何通过多模态方法解码经验驱动的感知生成模型。
Result: 结果表明,分布式左半球和额叶低频相干性是语言理解的关键特征,且这些神经特征与年龄相关,体现了经验依赖性。
Insight: 创新点在于结合神经信号(EEG)与视觉运动特征(光流)构建时空相干性表征,并利用熵基特征选择揭示频段特异性神经签名,为研究大脑的预测性感知提供了新的多模态解码框架。
Abstract: Human language processing relies on the brain’s capacity for predictive inference. We present a machine learning framework for decoding neural (EEG) responses to dynamic visual language stimuli in Deaf signers. Using coherence between neural signals and optical flow-derived motion features, we construct spatiotemporal representations of predictive neural dynamics. Through entropy-based feature selection, we identify frequency-specific neural signatures that differentiate interpretable linguistic input from linguistically disrupted (time-reversed) stimuli. Our results reveal distributed left-hemispheric and frontal low-frequency coherence as key features in language comprehension, with experience-dependent neural signatures correlating with age. This work demonstrates a novel multimodal approach for probing experience-driven generative models of perception in the brain.
cs.LG [Back]
[58] Generalization of RLVR Using Causal Reasoning as a Testbed cs.LG | cs.AI | cs.CLPDF
Brian Lu, Hongyu Zhao, Shuo Sun, Hao Peng, Rui Ding
TL;DR: 本文通过因果图模型中的概率推理任务,实证研究了可验证奖励强化学习(RLVR)的泛化能力。研究发现,RLVR在特定模型规模和训练查询级别组合下,比监督微调(SFT)展现出更强的同级别和跨级别泛化性能,但其有效性依赖于模型初始的推理能力。
Details
Motivation: RLVR作为一种后训练大型语言模型(LLM)处理复杂推理任务的有前景范式,其实现稳健泛化的条件尚不明确。本文旨在通过因果推理这一测试平台,探究RLVR的泛化特性。
Result: 在基于因果图和查询构建的数据集上,对Qwen-2.5-Instruct模型(3B-32B)进行RLVR或SFT微调。结果表明,RLVR在特定模型规模和训练查询级别组合下,比SFT在关联、干预和反事实等不同查询级别以及不同结构复杂度的查询上,取得了更强的泛化性能和更高的准确率,尤其是在更复杂的查询上。
Insight: 论文的创新点在于将RLVR的泛化研究置于因果推理的框架下,并系统考察了查询级别和结构复杂度两个难度维度。核心发现是RLVR的有效性依赖于模型的初始推理能力,当具备足够初始能力时,RLVR能通过改进模型的边缘化策略和减少中间概率计算错误,来提升特定因果推理子技能,从而实现泛化收益。这为RLVR的应用条件提供了重要洞见。
Abstract: Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising paradigm for post-training large language models (LLMs) on complex reasoning tasks. Yet, the conditions under which RLVR yields robust generalization remain poorly understood. This paper provides an empirical study of RLVR generalization in the setting of probabilistic inference over causal graphical models. This setting offers two natural axes along which to examine generalization: (i) the level of the probabilistic query – associational, interventional, or counterfactual – and (ii) the structural complexity of the query, measured by the size of its relevant subgraph. We construct datasets of causal graphs and queries spanning these difficulty axes and fine-tune Qwen-2.5-Instruct models using RLVR or supervised fine-tuning (SFT). We vary both the model scale (3B-32B) and the query level included in training. We find that RLVR yields stronger within-level and across-level generalization than SFT, but only for specific combinations of model size and training query level. Further analysis shows that RLVR’s effectiveness depends on the model’s initial reasoning competence. With sufficient initial competence, RLVR improves an LLM’s marginalization strategy and reduces errors in intermediate probability calculations, producing substantial accuracy gains, particularly on more complex queries. These findings show that RLVR can improve specific causal reasoning subskills, with its benefits emerging only when the model has sufficient initial competence.
[59] HyDRA: Hierarchical and Dynamic Rank Adaptation for Mobile Vision Language Model cs.LG | cs.AI | cs.CVPDF
Yuanhao Xi, Xiaohuan Bing, Ramin Yahyapour
TL;DR: 本文提出了HyDRA框架,一种用于移动视觉语言模型(VLM)的参数高效微调方法,通过分层和动态的秩调度策略,在保持可训练参数量不变的情况下提升模型性能。
Details
Motivation: 针对移动VLM训练计算成本高的问题,现有LoRA方法因固定秩而能力不足,需要一种更灵活高效的参数微调方案。
Result: 在多个流行基准测试中,HyDRA均优于基线方法,在不同模型规模上平均提升4.7%,部分任务甚至超越全参数微调。
Insight: 创新点在于结合了分层优化(粗粒度层间和细粒度层内秩分配)与动态调整(利用轻量性能模型进行端到端自动秩优化),实现了自适应的参数高效微调。
Abstract: Vision Language Models (VLMs) have undergone significant advancements, particularly with the emergence of mobile-oriented VLMs, which offer a wide range of application scenarios. However, the substantial computational requirements for training these models present a significant obstacle to their practical application. To address this issue, Low-Rank Adaptation (LoRA) has been proposed. Nevertheless, the standard LoRA with a fixed rank lacks sufficient capability for training mobile VLMs that process both text and image modalities. In this work, we introduce HyDRA, a parameter-efficient fine-tuning framework designed to implement hierarchical and dynamic rank scheduling for mobile VLMs. This framework incorporates two essential optimization strategies: (1) hierarchical optimization, which involves a coarse-grained approach that assigns different ranks to various layers, as well as a fine-grained method that adjusts ranks within individual layers, and (2) dynamic adjustment, which employs an end-to-end automatic optimization using a lightweight performance model to determine and adjust ranks during the fine-tuning process. Comprehensive experiments conducted on popular benchmarks demonstrate that HyDRA consistently outperforms the baseline, achieving a 4.7% improvement across various model sizes without increasing the number of trainable parameters. In some tasks, it even surpasses full-parameter fine-tuning.
cs.AI [Back]
[60] MegaRAG: Multimodal Knowledge Graph-Based Retrieval Augmented Generation cs.AI | cs.CL | cs.CV | cs.IRPDF
Chi-Hsiang Hsiao, Yi-Cheng Wang, Tzung-Sheng Lin, Yi-Ren Yeh, Chu-Song Chen
TL;DR: 本文提出了一种基于多模态知识图谱的检索增强生成方法MegaRAG,旨在解决传统RAG方法在理解长篇、领域特定内容(如整本书)时,因上下文窗口有限而导致的高层次概念理解和整体理解能力不足的问题。该方法通过将视觉线索融入知识图谱的构建、检索和答案生成过程,实现了跨模态推理。
Details
Motivation: 传统基于文本的知识图谱RAG方法无法利用视觉等多模态信息提供的互补洞察,而视觉文档的理解需要结合文本、视觉和空间线索。本文旨在解决现有方法在多模态内容理解和深度推理方面的局限性。
Result: 在全局和细粒度问答任务上的实验结果表明,该方法在文本和多模态语料库上均持续优于现有的基于RAG的方法。
Insight: 主要创新点在于将视觉信息系统地整合到知识图谱RAG的整个流程中,实现了跨模态的知识表示与推理,从而提升了对复杂、长篇内容的理解能力。
Abstract: Retrieval-augmented generation (RAG) enables large language models (LLMs) to dynamically access external information, which is powerful for answering questions over previously unseen documents. Nonetheless, they struggle with high-level conceptual understanding and holistic comprehension due to limited context windows, which constrain their ability to perform deep reasoning over long-form, domain-specific content such as full-length books. To solve this problem, knowledge graphs (KGs) have been leveraged to provide entity-centric structure and hierarchical summaries, offering more structured support for reasoning. However, existing KG-based RAG solutions remain restricted to text-only inputs and fail to leverage the complementary insights provided by other modalities such as vision. On the other hand, reasoning from visual documents requires textual, visual, and spatial cues into structured, hierarchical concepts. To address this issue, we introduce a multimodal knowledge graph-based RAG that enables cross-modal reasoning for better content understanding. Our method incorporates visual cues into the construction of knowledge graphs, the retrieval phase, and the answer generation process. Experimental results across both global and fine-grained question answering tasks show that our approach consistently outperforms existing RAG-based approaches on both textual and multimodal corpora.
[61] AgentMath: Empowering Mathematical Reasoning for Large Language Models via Tool-Augmented Agent cs.AI | cs.CL | cs.LGPDF
Haipeng Luo, Huawen Feng, Qingfeng Sun, Can Xu, Kai Zheng
TL;DR: 本文提出了AgentMath,一个将大语言模型的推理能力与代码解释器的计算精度相结合的智能体框架,旨在高效解决复杂数学问题。该方法通过自动生成高质量监督微调数据、引入新型智能体强化学习范式以及高效训练系统,显著提升了数学推理的准确性和效率。
Details
Motivation: 现有大型推理模型在自然语言推理方面取得进展,但在处理需要复杂数学运算的问题时仍存在计算效率低和准确性不足的问题,因此需要一种结合语言模型推理与代码执行精确性的新方法。
Result: 在AIME24、AIME25和HMMT25等具有挑战性的数学竞赛基准测试中,AgentMath-30B-A3B模型分别达到了90.6%、86.4%和73.8%的准确率,实现了最先进的性能水平。
Insight: 创新点包括:自动将自然语言思维链转换为结构化工具增强轨迹以生成高质量SFT数据;动态交织自然语言生成与实时代码执行的智能体强化学习范式,促进代码优化和错误纠正能力的涌现;以及包含异步调度和负载平衡的高效训练系统,实现了4-5倍的加速。这些方法为构建更复杂和可扩展的数学推理智能体提供了新思路。
Abstract: Large Reasoning Models (LRMs) like o3 and DeepSeek-R1 have achieved remarkable progress in natural language reasoning with long chain-of-thought. However, they remain computationally inefficient and struggle with accuracy when solving problems requiring complex mathematical operations. In this work, we present AgentMath, an agent framework that seamlessly integrates language models’ reasoning capabilities with code interpreters’ computational precision to efficiently tackle complex mathematical problems. Our approach introduces three key innovations: (1) An automated method that converts natural language chain-of-thought into structured tool-augmented trajectories, generating high-quality supervised fine-tuning (SFT) data to alleviate data scarcity; (2) A novel agentic reinforcement learning (RL) paradigm that dynamically interleaves natural language generation with real-time code execution. This enables models to autonomously learn optimal tool-use strategies through multi-round interactive feedback, while fostering emergent capabilities in code refinement and error correction; (3) An efficient training system incorporating innovative techniques, including request-level asynchronous rollout scheduling, agentic partial rollout, and prefix-aware weighted load balancing, achieving 4-5x speedup and making efficient RL training feasible on ultra-long sequences with scenarios with massive tool calls.Extensive evaluations show that AgentMath achieves state-of-the-art performance on challenging mathematical competition benchmarks including AIME24, AIME25, and HMMT25. Specifically, AgentMath-30B-A3B attains 90.6%, 86.4%, and 73.8% accuracy respectively, achieving advanced capabilities.These results validate the effectiveness of our approach and pave the way for building more sophisticated and scalable mathematical reasoning agents.
[62] Beyond Context: Large Language Models Failure to Grasp Users Intent cs.AI | cs.CL | cs.CR | cs.CYPDF
Ahmed M. Hussain, Salahuddin Salahuddin, Panos Papadimitratos
TL;DR: 这篇论文指出当前大语言模型(LLMs)的安全方法主要关注显性有害内容,但忽视了关键漏洞:无法理解上下文和识别用户意图,这导致恶意用户可以利用情感框架、渐进揭示和学术论证等系统方法绕过安全机制。
Details
Motivation: 动机是揭示LLMs安全机制的根本缺陷,即过度依赖显性内容过滤而缺乏对用户意图和上下文的理解,从而存在可被系统性利用的漏洞。
Result: 实证评估了包括ChatGPT、Claude、Gemini和DeepSeek在内的多个SOTA LLMs,结果显示这些模型的安全机制容易被绕过,且启用推理功能反而加剧了漏洞利用的有效性;例外是Claude Opus 4.1,它在某些用例中优先检测意图而非提供信息。
Insight: 创新点在于强调LLMs安全需要范式转变,将上下文理解和意图识别作为核心安全能力,而非事后保护机制;从客观角度看,这揭示了当前架构设计中的系统性漏洞,为未来安全研究提供了重要方向。
Abstract: Current Large Language Models (LLMs) safety approaches focus on explicitly harmful content while overlooking a critical vulnerability: the inability to understand context and recognize user intent. This creates exploitable vulnerabilities that malicious users can systematically leverage to circumvent safety mechanisms. We empirically evaluate multiple state-of-the-art LLMs, including ChatGPT, Claude, Gemini, and DeepSeek. Our analysis demonstrates the circumvention of reliable safety mechanisms through emotional framing, progressive revelation, and academic justification techniques. Notably, reasoning-enabled configurations amplified rather than mitigated the effectiveness of exploitation, increasing factual precision while failing to interrogate the underlying intent. The exception was Claude Opus 4.1, which prioritized intent detection over information provision in some use cases. This pattern reveals that current architectural designs create systematic vulnerabilities. These limitations require paradigmatic shifts toward contextual understanding and intent recognition as core safety capabilities rather than post-hoc protective mechanisms.
[63] RoboSafe: Safeguarding Embodied Agents via Executable Safety Logic cs.AI | cs.CV | cs.ROPDF
Le Wang, Zonghao Ying, Xiao Yang, Quanchen Zou, Zhenfei Yin
TL;DR: 本文提出RoboSafe,一种用于具身智能体的混合推理运行时安全护栏,通过基于谓词的可执行安全逻辑来防范危险指令。它结合了后向反思推理和前向预测推理模块,在混合长短安全记忆上运行,以动态检测和预防动态、时序依赖环境中的隐含风险。
Details
Motivation: 现有基于静态规则过滤或提示级控制的防御方法难以应对动态、时序依赖且上下文丰富的环境中产生的隐含风险,因此需要一种更灵活、可验证的安全机制来保护由视觉语言模型驱动的具身智能体。
Result: 在多个智能体上的广泛实验表明,与领先基线相比,RoboSafe显著减少了危险行为(风险发生率降低36.8%),同时保持了接近原始的任务性能;物理机械臂的真实世界评估进一步证实了其实用性。
Insight: 创新点在于提出了基于谓词的可执行安全逻辑,结合后向反思推理和前向预测推理的混合推理机制,以及混合长短安全记忆的设计,实现了自适应、可验证且可解释的运行时安全防护。
Abstract: Embodied agents powered by vision-language models (VLMs) are increasingly capable of executing complex real-world tasks, yet they remain vulnerable to hazardous instructions that may trigger unsafe behaviors. Runtime safety guardrails, which intercept hazardous actions during task execution, offer a promising solution due to their flexibility. However, existing defenses often rely on static rule filters or prompt-level control, which struggle to address implicit risks arising in dynamic, temporally dependent, and context-rich environments. To address this, we propose RoboSafe, a hybrid reasoning runtime safeguard for embodied agents through executable predicate-based safety logic. RoboSafe integrates two complementary reasoning processes on a Hybrid Long-Short Safety Memory. We first propose a Backward Reflective Reasoning module that continuously revisits recent trajectories in short-term memory to infer temporal safety predicates and proactively triggers replanning when violations are detected. We then propose a Forward Predictive Reasoning module that anticipates upcoming risks by generating context-aware safety predicates from the long-term safety memory and the agent’s multimodal observations. Together, these components form an adaptive, verifiable safety logic that is both interpretable and executable as code. Extensive experiments across multiple agents demonstrate that RoboSafe substantially reduces hazardous actions (-36.8% risk occurrence) compared with leading baselines, while maintaining near-original task performance. Real-world evaluations on physical robotic arms further confirm its practicality. Code will be released upon acceptance.
cs.IR [Back]
[64] ReaSeq: Unleashing World Knowledge via Reasoning for Sequential Modeling cs.IR | cs.CLPDF
Chuan Wang, Gaoming Yang, Han Wu, Jiakai Tang, Jiahao Yu
TL;DR: 本文提出了ReaSeq框架,旨在解决工业推荐系统中基于日志范式的两个根本局限:ID项目表示的知识贫乏性以及平台内行为数据的系统性盲区。该框架利用大语言模型中的世界知识,通过显式的思维链推理和隐式的扩散大语言模型推理,分别增强项目语义表示和推断平台外用户兴趣。在淘宝的排序系统中部署后,ReaSeq在多个关键指标上取得了显著提升。
Details
Motivation: 解决工业推荐系统在日志驱动范式下的两个核心问题:一是ID表示因缺乏语义知识而导致在数据稀疏时兴趣建模脆弱;二是系统局限于平台内日志,无法捕捉用户跨平台的潜在兴趣。
Result: 在淘宝服务于数亿用户的排序系统中部署,实现了显著提升:IPV和CTR提升超过6.0%,订单量提升超过2.9%,GMV提升超过2.5%,验证了基于世界知识的推理方法相对于纯日志驱动方法的有效性。
Insight: 创新点在于将大语言模型的世界知识通过显式(多智能体协作的思维链)和隐式(扩散大语言模型)两种推理机制引入推荐系统,以结构化知识增强项目表示并推断平台外行为,突破了传统推荐系统对浅层交互统计和闭环反馈的过度依赖。
Abstract: Industrial recommender systems face two fundamental limitations under the log-driven paradigm: (1) knowledge poverty in ID-based item representations that causes brittle interest modeling under data sparsity, and (2) systemic blindness to beyond-log user interests that constrains model performance within platform boundaries. These limitations stem from an over-reliance on shallow interaction statistics and close-looped feedback while neglecting the rich world knowledge about product semantics and cross-domain behavioral patterns that Large Language Models have learned from vast corpora. To address these challenges, we introduce ReaSeq, a reasoning-enhanced framework that leverages world knowledge in Large Language Models to address both limitations through explicit and implicit reasoning. Specifically, ReaSeq employs explicit Chain-of-Thought reasoning via multi-agent collaboration to distill structured product knowledge into semantically enriched item representations, and latent reasoning via Diffusion Large Language Models to infer plausible beyond-log behaviors. Deployed on Taobao’s ranking system serving hundreds of millions of users, ReaSeq achieves substantial gains: >6.0% in IPV and CTR, >2.9% in Orders, and >2.5% in GMV, validating the effectiveness of world-knowledge-enhanced reasoning over purely log-driven approaches.
cs.CR [Back]
[65] Automated Red-Teaming Framework for Large Language Model Security Assessment: A Comprehensive Attack Generation and Detection System cs.CR | cs.CLPDF
Zhang Wei, Peilu Hu, Shengning Lang, Hao Yan, Li Mei
TL;DR: 本文提出了一种自动化红队测试框架,用于评估大型语言模型(LLM)的安全性。该框架通过基于元提示的攻击合成、多模态漏洞检测和标准化评估协议,系统地生成、执行和评估对抗性提示,以发现LLM的安全漏洞。实验在GPT-OSS-20B模型上发现了47个不同漏洞,包括21个高严重性和12个新颖攻击模式,漏洞发现率比人工专家测试提高了3.9倍,同时保持89%的检测准确率。
Details
Motivation: 随着LLM在高风险领域部署的增加,确保其安全性和对齐性成为关键挑战。现有红队测试严重依赖人工测试,限制了可扩展性,且无法全面覆盖潜在的对抗行为空间。
Result: 在GPT-OSS-20B模型上的实验结果显示,框架发现了47个不同漏洞(包括21个高严重性和12个新颖攻击模式),漏洞发现率比人工专家测试提高了3.9倍,检测准确率为89%,覆盖了六大威胁类别(如奖励黑客、欺骗性对齐等)。
Insight: 创新点包括自动化红队测试框架的集成设计(结合元提示攻击合成和多模态检测)、标准化评估协议(覆盖六类威胁),以及可扩展的系统化方法,为提升LLM对齐鲁棒性提供了可操作的见解,推动了自动化AI安全评估的发展。
Abstract: As large language models (LLMs) are increasingly deployed in high-stakes domains, ensuring their security and alignment has become a critical challenge. Existing red-teaming practices depend heavily on manual testing, which limits scalability and fails to comprehensively cover the vast space of potential adversarial behaviors. This paper introduces an automated red-teaming framework that systematically generates, executes, and evaluates adversarial prompts to uncover security vulnerabilities in LLMs. Our framework integrates meta-prompting-based attack synthesis, multi-modal vulnerability detection, and standardized evaluation protocols spanning six major threat categories – reward hacking, deceptive alignment, data exfiltration, sandbagging, inappropriate tool use, and chain-of-thought manipulation. Experiments on the GPT-OSS-20B model reveal 47 distinct vulnerabilities, including 21 high-severity and 12 novel attack patterns, achieving a $3.9\times$ improvement in vulnerability discovery rate over manual expert testing while maintaining 89% detection accuracy. These results demonstrate the framework’s effectiveness in enabling scalable, systematic, and reproducible AI safety evaluations. By providing actionable insights for improving alignment robustness, this work advances the state of automated LLM red-teaming and contributes to the broader goal of building secure and trustworthy AI systems.
cs.RO [Back]
[66] Language-Guided Grasp Detection with Coarse-to-Fine Learning for Robotic Manipulation cs.RO | cs.CVPDF
Zebin Jiang, Tianle Jin, Xiangtong Yao, Alois Knoll, Hu Cao
TL;DR: 本文提出了一种语言引导的抓取检测方法(LGGD),采用从粗到细的学习范式,通过层次化跨模态融合管道和语言条件动态卷积头,实现视觉语义的细粒度对齐,提升机器人抓取任务中语言指令与视觉推理的一致性。
Details
Motivation: 现有语言条件抓取方法通常依赖浅层融合策略,导致语义基础薄弱,语言意图与视觉抓取推理之间的对齐性不足,难以在非结构化、杂乱且语义多样的环境中实现精准抓取。
Result: 在OCID-VLG和Grasp-Anything++数据集上的实验表明,LGGD超越了现有语言引导抓取方法,对未见物体和多样语言查询表现出强泛化能力;真实机器人平台部署验证了其执行准确指令条件抓取动作的实用有效性。
Insight: 创新点包括:基于CLIP嵌入的层次化跨模态融合管道,逐步注入语言线索以增强视觉特征重建;语言条件动态卷积头(LDCH)通过句子级特征混合多个卷积专家,实现指令自适应的粗掩码和抓取预测;最终细化模块提升复杂场景下的抓取一致性和鲁棒性。
Abstract: Grasping is one of the most fundamental challenging capabilities in robotic manipulation, especially in unstructured, cluttered, and semantically diverse environments. Recent researches have increasingly explored language-guided manipulation, where robots not only perceive the scene but also interpret task-relevant natural language instructions. However, existing language-conditioned grasping methods typically rely on shallow fusion strategies, leading to limited semantic grounding and weak alignment between linguistic intent and visual grasp reasoning.In this work, we propose Language-Guided Grasp Detection (LGGD) with a coarse-to-fine learning paradigm for robotic manipulation. LGGD leverages CLIP-based visual and textual embeddings within a hierarchical cross-modal fusion pipeline, progressively injecting linguistic cues into the visual feature reconstruction process. This design enables fine-grained visual-semantic alignment and improves the feasibility of the predicted grasps with respect to task instructions. In addition, we introduce a language-conditioned dynamic convolution head (LDCH) that mixes multiple convolution experts based on sentence-level features, enabling instruction-adaptive coarse mask and grasp predictions. A final refinement module further enhances grasp consistency and robustness in complex scenes.Experiments on the OCID-VLG and Grasp-Anything++ datasets show that LGGD surpasses existing language-guided grasping methods, exhibiting strong generalization to unseen objects and diverse language queries. Moreover, deployment on a real robotic platform demonstrates the practical effectiveness of our approach in executing accurate, instruction-conditioned grasp actions. The code will be released publicly upon acceptance.