Table of Contents
- cs.CL [Total: 85]
- cs.CV [Total: 179]
- cs.HC [Total: 2]
- cs.AI [Total: 11]
- eess.IV [Total: 1]
- cs.CY [Total: 1]
- cs.SD [Total: 4]
- cs.RO [Total: 8]
- cs.DB [Total: 1]
- cs.CR [Total: 2]
- cs.IR [Total: 5]
- eess.AS [Total: 1]
- cs.SE [Total: 1]
- cs.LG [Total: 10]
cs.CL [Back]
[1] Multimodal Claim Extraction for Fact-Checking cs.CL | cs.AI | cs.SIPDF
Joycelyn Teo, Rui Cao, Zhenyun Deng, Zifeng Ding, Michael Sejr Schlichtkrull
TL;DR: 本文提出了首个针对社交媒体多模态声明提取的基准,包含文本和图像的帖子,并标注了来自真实事实核查者的黄金标准声明。作者评估了当前最先进的多模态大语言模型(MLLMs),发现其在建模修辞意图和上下文线索方面存在困难,并提出了一个意图感知框架MICE以改进性能。
Details
Motivation: 现有自动事实核查方法在声明提取步骤中大多忽略了错误信息的跨模态特性,而社交媒体帖子常结合非正式文本与图像(如梗图、截图),这带来了不同于纯文本声明提取或传统多模态任务的新挑战。
Result: 在提出的三部分评估框架(语义对齐、忠实性和去语境化)下,基线MLLMs表现不佳;而提出的MICE框架在意图关键案例中显示出改进。
Insight: 创新点在于首次建立了多模态声明提取的基准,并提出了一个专注于建模修辞意图的框架MICE,以应对社交媒体中多模态内容特有的挑战,这为事实核查领域提供了新的研究方向。
Abstract: Automated Fact-Checking (AFC) relies on claim extraction as a first step, yet existing methods largely overlook the multimodal nature of today’s misinformation. Social media posts often combine short, informal text with images such as memes, screenshots, and photos, creating challenges that differ from both text-only claim extraction and well-studied multimodal tasks like image captioning or visual question answering. In this work, we present the first benchmark for multimodal claim extraction from social media, consisting of posts containing text and one or more images, annotated with gold-standard claims derived from real-world fact-checkers. We evaluate state-of-the-art multimodal LLMs (MLLMs) under a three-part evaluation framework (semantic alignment, faithfulness, and decontextualization) and find that baseline MLLMs struggle to model rhetorical intent and contextual cues. To address this, we introduce MICE, an intent-aware framework which shows improvements in intent-critical cases.
[2] Brain-CLIPLM: Decoding Compressed Semantic Representations in EEG for Language Reconstruction cs.CL | cs.AI | cs.CVPDF
Xiaoli Yang, Huiyuan Tian, Yurui Li, Jianyu Zhang, Shijian Li
TL;DR: 本文提出Brain-CLIPLM,一个两阶段框架,用于从脑电图(EEG)信号中解码语言。该框架基于语义压缩假设,认为EEG编码的是压缩的语义锚点而非完整句子结构,因此将解码过程分解为:首先通过对比学习提取语义锚点,然后利用基于检索的大语言模型(LLM)结合思维链(CoT)推理进行句子重建,以匹配神经信息容量。
Details
Motivation: 解决从非侵入性脑电图(EEG)解码自然语言时,由于信噪比低、信息带宽有限,直接重建句子级语言结构可能不切实际的问题。作者质疑了这一假设,并提出EEG信号可能编码的是压缩的语义内容。
Result: 在Zurich Cognitive Language Processing Corpus上评估,Brain-CLIPLM在句子检索任务中达到67.55%的top-5准确率和85.00%的top-25准确率,显著优于直接解码基线,且跨被试评估证实了其鲁棒泛化能力。控制分析(如置换测试)表明EEG表征携带了超越语言模型先验的句子特异性信息。
Insight: 创新点在于提出了语义压缩假设和粒度匹配原则,将EEG到文本的解码重新定义为恢复压缩语义内容而非重建完整句子;方法上采用两阶段框架,结合对比学习提取语义锚点和基于检索的LLM进行推理重建,为脑机接口提供了一条更符合生物基础且数据高效的路径。
Abstract: Decoding natural language from non-invasive electroencephalography (EEG) remains fundamentally limited by low signal-to-noise ratio and restricted information bandwidth. This raises a fundamental question regarding whether sentence-level linguistic structure can be reliably recovered from such signals. In this work, we suggest that this assumption may not hold under realistic information constraints, and instead propose a semantic compression hypothesis in which EEG signals encode a compressed set of semantic anchors rather than full linguistic structure. Under our new perspective, direct sentence reconstruction becomes an overparameterized objective relative to the intrinsic information capacity of EEG. To address this mismatch, we introduce Brain-CLIPLM, a two-stage framework that decomposes EEG-to-text decoding into semantic anchor extraction via contrastive learning and sentence reconstruction using a retrieval-grounded large language model (LLM) with Chain-of-Thought (CoT) reasoning, following a granularity matching principle that aligns decoding complexity with neural information capacity. Evaluated on the Zurich Cognitive Language Processing Corpus, Brain-CLIPLM achieves 67.55% top-5 and 85.00% top-25 sentence retrieval accuracy, significantly outperforming direct decoding baseline, while cross-subject evaluation confirms robust generalization. Control analyses, including permutation testing, further demonstrate that EEG-derived representations carry sentence-specific information beyond language model priors. These results suggest that EEG-to-text decoding is better framed as recovering compressed semantic content rather than reconstructing full sentences, providing a biologically grounded and data-efficient pathway for non-invasive brain-computer interfaces.
[3] CFMS: Towards Explainable and Fine-Grained Chinese Multimodal Sarcasm Detection Benchmark cs.CL | cs.AIPDF
Junzhao Zhang, Hsiu-Yuan Huang, Chenming Tang, Yutong Yang, Yunfang Wu
TL;DR: 本文构建了首个面向中文社交媒体的细粒度多模态讽刺检测数据集CFMS,包含2,796个高质量图文对,并提供讽刺识别、目标识别和解释生成的三级标注框架。研究发现细粒度解释标注能有效指导AI生成具有明确讽刺意图的图像,并揭示了当前模型在隐喻推理方面的显著局限。为克服传统检索方法的限制,论文提出了一种强化学习增强的上下文学习策略(PGDS)来动态优化示例选择。大量实验表明CFMS为构建可靠的多模态讽刺理解系统提供了坚实基础,且PGDS方法在关键任务上显著优于现有基线。
Details
Motivation: 现有多模态讽刺检测基准存在标注粒度粗糙和文化覆盖有限的问题,阻碍了细粒度语义理解的研究,因此需要构建一个针对中文社交媒体的细粒度数据集。
Result: 在CFMS数据集上进行的广泛实验表明,该数据集为构建可靠的多模态讽刺理解系统提供了坚实基础;提出的PGDS方法在关键任务上显著优于现有基线。
Insight: 创新点在于构建了首个中文细粒度多模态讽刺数据集CFMS,并提出了三级标注框架(讽刺识别、目标识别、解释生成)以及一个强化学习增强的上下文学习策略(PGDS)来动态优化示例选择,这有助于提升模型对讽刺和隐喻的理解能力。
Abstract: Multimodal sarcasm detection has recently garnered significant attention. However, existing benchmarks suffer from coarse-grained annotations and limited cultural coverage, which hinder research into fine-grained semantic understanding. To address this, we construct CFMS, the first fine-grained multimodal sarcasm dataset tailored for Chinese social media. It comprises 2,796 high-quality image-text pairs and provides a triple-level annotation framework: sarcasm identification, target recognition, and explanation generation. We find that the fine-grained explanation annotations effectively guide AI in generating images with explicit sarcastic intent. Furthermore, we curate a high-consistency parallel Chinese-English metaphor subset (200 entries each), revealing significant limitations of current models in metaphoric reasoning. To overcome the constraints of traditional retrieval methods, we propose a Reinforcement Learning-augmented In-Context Learning strategy (PGDS) to dynamically optimize exemplar selection. Extensive experiments demonstrate that CFMS provides a solid foundation for building reliable multimodal sarcasm understanding systems, and the PGDS method significantly outperforms existing baselines on key tasks. Our data and code are available at https://anonymous.4open.science/r/CFMS-E8F9.
[4] GoCoMA: Hyperbolic Multimodal Representation Fusion for Large Language Model-Generated Code Attribution cs.CL | cs.CYPDF
Nitin Choudhury, Bikrant Bikram Pratap Maurya, Bhavinkumar Vinodbhai Kuwar, Arun Balaji Buduru
TL;DR: 本文提出了GoCoMA,一个用于大语言模型(LLM)生成代码溯源的多模态框架。该框架通过双曲几何(庞加莱球)融合代码风格度量(高层结构/风格特征)与二进制预执行工件(BPEA)的图像表示(低层字节语义),并采用基于测地线余弦相似度的跨模态注意力(GCSA)机制进行融合,最终将融合表示投影回欧几里得空间以完成LLM来源归属。
Details
Motivation: 动机在于解决LLM生成的代码与人类代码难以区分所带来的实际问题(如安全漏洞、许可模糊),并回答一个取证问题:’这段代码是谁(或哪个LLM)写的?’
Result: 在两个开源基准测试(CoDET-M4和LLMAuthorBench)上的实验表明,在相同的评估协议下,GoCoMA持续优于单模态和欧几里得多模态基线方法。
Insight: 创新点在于将代码溯源建模为一种外在层次结构(高层风格与低层字节语义),并首次引入双曲几何空间进行多模态表示融合,利用其更适合层次数据建模的特性;同时提出了GCSA这种新颖的基于测地线余弦的跨模态注意力融合机制。
Abstract: Large Language Models (LLMs) trained on massive code corpora are now increasingly capable of generating code that is hard to distinguish from human-written code. This raises practical concerns, including security vulnerabilities and licensing ambiguity, and also motivates a forensic question: ‘Who (or which LLM) wrote this piece of code?’ We present GoCoMA, a multimodal framework that models an extrinsic hierarchy between (i) code stylometry, capturing higher-level structural and stylistic signatures, and (ii) image representations of binary pre-executable artifacts (BPEA), capturing lower-level, execution-oriented byte semantics shaped by compilation and toolchains. GoCoMA projects modality embeddings into a hyperbolic Poincaré ball, fuses them via a geodesic-cosine similarity-based cross-modal attention (GCSA) fusion mechanism, and back-projects the fused representation to Euclidean space for final LLM-source attribution. Experiments on two open-source benchmarks (CoDET-M4 and LLMAuthorBench) show that GoCoMA consistently outperforms unimodal and Euclidean multimodal baselines under identical evaluation protocols.
[5] Reciprocal Co-Training (RCT): Coupling Gradient-Based and Non-Differentiable Models via Reinforcement Learning cs.CL | cs.LGPDF
Yunshuo Tian, Akayou Kitessa, Tanuja Chitnis, Yijun Zhao
TL;DR: 本文提出了一种名为互惠协同训练(RCT)的框架,通过强化学习将基于梯度优化的大语言模型(LLM)与不可微分的随机森林(RF)分类器耦合起来,形成一个迭代反馈循环,使两种模型能够相互利用对方的优势进行改进。
Details
Motivation: 大语言模型和传统机器学习方法(如随机森林)在预测建模上各有优势,但因其表示形式和训练范式(梯度优化 vs. 非可微特征划分)的根本不同,难以有效整合。本文旨在解决这一集成难题。
Result: 在三个医学数据集上的实验表明,该框架使两种模型的性能均获得一致提升,其中LLM的提升尤为显著。消融分析证实迭代优化、混合奖励设计和维度控制共同促成了这些增益。
Insight: 核心创新点在于通过强化学习构建了一个双向适应机制,将表格数据转换为标准化文本供LLM处理,同时利用RF的校准概率估计作为反馈信号来指导LLM的强化学习更新,从而为不兼容的模型族提供了一种通用的协同增强方法。
Abstract: Large language models (LLMs) and classical machine learning methods offer complementary strengths for predictive modeling, yet their fundamentally different representations and training paradigms hinder effective integration: LLMs rely on gradient-based optimization over textual data, whereas models such as Random Forests (RF) employ non-differentiable feature partitioning. This work introduces a reciprocal co-training framework that couples an LLM with an RF classifier via reinforcement learning, creating an iterative feedback loop in which each model improves using signals from the other. Tabular data are reformulated into standardized textual representations for the LLM, whose embeddings augment the RF feature space, while calibrated RF probability estimates provide feedback signals that guide reinforcement learning updates of the LLM. Experiments across three medical datasets demonstrate consistent performance gains for both models, with particularly strong effects for the LLM. Ablation analyses show that iterative refinement, hybrid reward design, and dimensionality control jointly contribute to these gains. The proposed framework provides a general mechanism that allows incompatible model families to leverage each other’s strengths through bidirectional adaptation.
[6] LiFT: Does Instruction Fine-Tuning Improve In-Context Learning for Longitudinal Modelling by Large Language Models? cs.CLPDF
Iqra Ali, Talia Tseriotou, Mahmud Elahi Akhter, Yuxiang Zhou, Maria Liakata
TL;DR: 本文提出了LiFT,一个用于纵向建模的指令微调框架,旨在解决大语言模型在上下文学习中难以整合历史背景、追踪动态交互和处理罕见变化事件的问题。LiFT通过统一的指令模式整合多种纵向建模任务,并采用渐进式增加时间难度的课程学习策略,结合少样本结构和时间条件,以提升模型对过去上下文的有效利用。
Details
Motivation: 动机在于纵向NLP任务需要基于时间顺序的文本进行推理,以检测人类行为和观点的持续性与变化,而现有大语言模型的上下文学习在处理需要整合历史背景、追踪演化交互及处理罕见变化事件的任务时存在困难。
Result: 在五个数据集上的评估表明,LiFT在不同参数规模的模型(如OLMo (1B/7B)、LLaMA-8B和Qwen-14B)上均一致优于基础模型的上下文学习,尤其在分布外数据和少数变化事件上取得了显著提升。
Insight: 创新点在于将纵向建模任务统一到共享指令模式中,并设计渐进式课程学习来增强时间推理能力;从客观角度看,该方法通过结构化指令和条件化训练,有效提升了模型在纵向任务中的泛化性和对罕见事件的处理能力。
Abstract: Longitudinal NLP tasks require reasoning over temporally ordered text to detect persistence and change in human behavior and opinions. However, in-context learning with large language models struggles on tasks where models must integrate historical context, track evolving interactions, and handle rare change events. We introduce LiFT, a longitudinal instruction fine-tuning framework that unifies diverse longitudinal modeling tasks under a shared instruction schema. LiFT uses a curriculum that progressively increases temporal difficulty while incorporating few-shot structure and temporal conditioning to encourage effective use of past context. We evaluate LiFT across five datasets. Models trained on longitudinal tasks with different levels of temporal granularity are tested for generalisability on two separate datasets. Across models with different parameter sizes (OLMo (1B/7B), LLaMA-8B, and Qwen-14B), LiFT consistently outperforms base-model ICL, with strong gains on out-of-distribution data and minority change events.
[7] QU-NLP at QIAS 2026: Multi-Stage QLoRA Fine-Tuning for Arabic Islamic Inheritance Reasoning cs.CLPDF
Mohammad AL-Smadi
TL;DR: 本文介绍了QU-NLP团队为QIAS 2026阿拉伯伊斯兰继承法推理任务提交的系统,该系统采用多阶段QLoRA微调策略,在Qwen3-4B模型上先进行领域适应,再进行任务特定训练,以生成JSON格式的输出,在测试集上取得了90%的MIR-E分数,展示了小模型在复杂法律推理任务上的有效性。
Details
Motivation: 伊斯兰继承法是一个需要多步骤法律分析、基于规则的阻断决策和精确分数计算的领域,这为评估大语言模型的结构化推理能力提出了挑战。本文旨在解决如何让较小的语言模型有效地执行此类复杂的法律推理任务。
Result: 在测试集上,模型达到了90%的MIR-E分数,表现出了与Gemini-2.5-flash等商业系统相竞争的性能,同时所需计算资源极少。
Insight: 摘要宣称的创新点在于采用了多阶段QLoRA微调策略,结合领域特定的预适应和结构化输出训练。从客观角度看,该方法通过分阶段微调(先领域适应后任务训练)和高效的量化技术(4-bit NF4),在资源受限下提升了小模型在专业领域的结构化推理能力,是一个可借鉴的轻量化领域适应方案。
Abstract: Islamic inheritance law (ilm al-mawarıth) presents a challenging domain for evaluating large language models’ structured reasoning capabilities, requiring multi-step legal analysis, rule-based blocking decisions, and precise fractional calculations. We present QU-NLP’s submission to the QIAS 2026 shared task on Arabic Islamic inheritance reasoning. Our approach employs a multi-stage Quantized Low-Rank Adaptation (QLoRA) fine-tuning strategy on Qwen3-4B: (1) domain adaptation on 3,166 Islamic fatwa records to acquire inheritance terminology and jurisprudential reasoning patterns, followed by (2) task-specific training on 12,000 structured inheritance cases to optimize JSON-formatted output generation. Using 4-bit NF4 quantization with rank-128 LoRA adapters, our model achieves 90% MIR-E (Mawarith Inheritance Reasoning Evaluation) score on the test set, demonstrating competitive performance while requiring minimal computational resources. Our results show that domain-specific pre-adaptation combined with structured output training enables small language models to perform complex legal reasoning tasks effectively comparing to commercial systems such as Gemini-2.5-flash.
[8] Measuring Representation Robustness in Large Language Models for Geometry cs.CL | cs.AIPDF
Vedant Jawandhia, Yash Sinha, Murari Mandal, Ankan Pal, Dhruv Kumar
TL;DR: 本文提出了GeoRepEval评估框架,用于衡量大语言模型在几何问题上的表示鲁棒性,发现模型在不同表示形式(欧几里得、坐标、向量)下的准确率存在显著差异,表明模型依赖表示特定的启发式方法而非抽象几何推理。
Details
Motivation: 现有基准测试在固定格式下评估大语言模型的数学推理能力,隐含假设表示不变性,但忽略了仅因表示变化导致的失败,因此需要系统评估模型对等效问题表示的鲁棒性。
Result: 在158个高中几何问题(474个实例)上评估11个大语言模型,发现仅因表示选择导致的准确率差距高达14个百分点,向量表示是持续失败点(Invariance@3低至0.044);通过转换后求解提示干预,高性能模型的向量准确率提升高达52个百分点。
Insight: 创新点在于提出表示感知的评估框架,引入Invariance@3指标分解准确率为鲁棒和脆弱成分;客观分析表明当前模型对表示敏感,缺乏抽象推理能力,提示工程可部分缓解但低能力模型存在根本限制。
Abstract: Large language models (LLMs) are increasingly evaluated on mathematical reasoning, yet their robustness to equivalent problem representations remains poorly understood. In geometry, identical problems can be expressed in Euclidean, coordinate, or vector forms, but existing benchmarks report accuracy on fixed formats, implicitly assuming representation invariance and masking failures caused by representational changes alone. We propose GeoRepEval, a representation-aware evaluation framework that measures correctness, invariance, and consistency at the problem level across parallel formulations, combining strict answer matching, bootstrap confidence intervals, paired McNemar tests, representation-flip analyses, and regression controls for surface complexity. We prove that our Invariance@3 metric decomposes accuracy into robust and fragile components and is bounded by the weakest representation. Evaluating eleven LLMs on 158 curated high-school geometry problems (474 instances), we find accuracy gaps of up to 14 percentage points induced solely by representation choice. Vector formulations emerge as a consistent failure point, with Invariance@3 as low as 0.044 even after controlling for length and symbolic complexity. A convert-then-solve prompting intervention improves vector accuracy by up to 52 percentage points for high-capacity models, suggesting that failures reflect representation sensitivity rather than inability; however, low-capacity models show no gains, indicating deeper limitations. These results suggest that current models rely on representation-specific heuristics rather than abstract geometric reasoning. All datasets, prompts, and scripts are released at https://github.com/vedjaw/GeoRepEval.
[9] Injecting Structured Biomedical Knowledge into Language Models: Continual Pretraining vs. GraphRAG cs.CL | cs.AI | cs.LGPDF
Jaafer Klila, Sondes Bannour Souihi, Rahma Boujelben, Nasredine Semmar, Lamia Hadrich Belguith
TL;DR: 本研究探索了两种将结构化生物医学知识(来自UMLS Metathesaurus)注入语言模型的方法:持续预训练和基于知识图谱的检索增强生成(GraphRAG)。通过构建大规模UMLS知识图谱并从中生成文本语料,对BERT和BioBERT进行持续预训练,得到BERTUMLS和BioBERTUMLS模型。实验在BLURB基准的多个任务上评估了持续预训练模型,并在PubMedQA和BioASQ问答任务上评估了GraphRAG。结果显示,持续预训练对基础BERT模型提升明显,尤其是在知识密集型QA任务上;而对已具备生物医学知识的BioBERT提升有限。GraphRAG无需重新训练即可显著提升LLaMA 3-8B在生物医学问答任务上的性能。
Details
Motivation: 当前领域适应方法主要依赖非结构化文本,本研究旨在探索如何更有效地利用结构化生物医学知识(如UMLS Metathesaurus)来增强语言模型在专业领域的性能,解决模型在生物医学领域知识不足和推理透明度低的问题。
Result: 在BLURB基准的六个数据集(涵盖五种任务类型)上,BERTUMLS相比原始BERT有提升,尤其在知识密集型QA任务上提升最大;对BioBERT的提升则更有限,表明当基础模型已编码大量生物医学文本知识时,收益递减。GraphRAG无需重新训练,将LLaMA 3-8B在PubMedQA上的准确率提升了超过3个百分点,在BioASQ上提升了5个百分点,实现了透明、多跳且易于更新的知识访问。
Insight: 论文的创新点在于系统比较了将结构化知识通过持续预训练(参数化)与通过GraphRAG(推理时检索)两种互补策略注入模型的优劣。客观来看,其核心贡献是构建了大规模、可公开获取的UMLS知识图谱资源,并证明了对于已具备领域知识的基础模型,动态检索增强(GraphRAG)可能比进一步的参数化训练更高效、更灵活,且能提供更好的可解释性。
Abstract: The injection of domain-specific knowledge is crucial for adapting language models (LMs) to specialized fields such as biomedicine. While most current approaches rely on unstructured text corpora, this study explores two complementary strategies for leveraging structured knowledge from the UMLS Metathesaurus: (i) Continual pretraining that embeds knowledge into model parameters, and (ii) Graph Retrieval-Augmented Generation (GraphRAG) that consults a knowledge graph at inference time. We first construct a large-scale biomedical knowledge graph from UMLS (3.4 million concepts and 34.2 million relations), stored in Neo4j for efficient querying. We then derive a ~100-million-token textual corpus from this graph to continually pretrain two models: BERTUMLS (from BERT) and BioBERTUMLS (from BioBERT). We evaluate these models on six BLURB (Biomedical Language Understanding and Reasoning Benchmark) datasets spanning five task types and evaluate GraphRAG on the two QA (Question Answering) datasets (PubMedQA, BioASQ). On BLURB tasks, BERTUMLS improves over BERT, with the largest gains on knowledge-intensive QA. Effects on BioBERT are more nuanced, suggesting diminishing returns when the base model already encodes substantial biomedical text knowledge. Finally, augmenting LLaMA 3-8B with our GraphRAG pipeline yields over than 3 points accuracy on PubMedQA and 5 points on BioASQ without any retraining, delivering transparent, multi-hop, and easily updated knowledge access. We release the processed UMLS Neo4j graph to support reproducibility.
[10] SynopticBench: Evaluating Vision-Language Models on Generating Weather Forecast Discussions of the Future cs.CL | cs.CV | cs.LG | physics.ao-phPDF
Timothy B. Higgins, Antonios Mamalakis, Chirag Agarwal
TL;DR: 本文提出了SynopticBench数据集和SPACE评估框架,用于评估视觉语言模型在生成未来天气预报讨论方面的能力。该数据集包含超过130万条美国国家气象局发布的区域预报讨论文本,并配以500mb位势高度、2米温度和850mb风速的预报图像。SPACE框架旨在有效评估天气现象文本描述的质量。
Details
Motivation: 尽管视觉语言模型在图像描述、报告生成等复杂多模态任务上取得显著进展,但利用气象数据生成文本仍极具挑战性,因为大气是一个在多种时空尺度上快速变化的混沌系统。需要可验证地量化现有VLMs在天气预报数据上的有效性。
Result: 通过在SynopticBench数据集上对最先进的视觉语言模型进行广泛实验,展示了现有评估指标在该领域的敏感性,并为进一步探索天气和气候文本生成提供了基础。
Insight: 创新点包括构建了高质量、大规模的气象文本-图像配对数据集SynopticBench,以及提出了专门针对天气现象文本描述的评估框架SPACE,这有助于推动VLMs在复杂科学数据生成任务中的应用和评估。
Abstract: Recent advances in visual-language models (VLMs) have led to significant improvements in a plethora of complex multimodal tasks like image captioning, report generation, and visual perception. However, generating text from meteorological data is highly challenging because the atmosphere is a chaotic system that is rapidly changing at various spatial and temporal scales. Given the complexity of atmospheric phenomena, it is critical to verifiably quantify the effectiveness of existing VLMs on weather forecasting data. In this work, we present SynopticBench, a high-quality dataset consisting of 1,367,041 text samples of Area Forecast Discussions created by the National Weather Service over the continental United States paired to images of 500mb geopotential height, 2 meter temperature, and 850mb wind velocity in weather forecasts. We also present Synoptic Phenomena Alignment and Coverage Evaluation (SPACE), a novel evaluation framework that can be used to effectively estimate the quality of text descriptions of synoptic weather phenomena. Extensive experiments on generating forecast discussions using state-of-the-art VLMs show the sensitivity of existing evaluation metrics in this domain and enable further exploration into synoptic weather and climate text generation.
[11] EchoChain: A Full-Duplex Benchmark for State-Update Reasoning Under Interruptions cs.CL | cs.AI | cs.LG | cs.SDPDF
Smit Nautambhai Modi, Gandharv Mahajan, Marc Wetter, Randall Welles
TL;DR: 本文提出了EchoChain基准测试,用于评估实时语音助手在用户中途打断情况下的全双工状态更新推理能力。该基准通过生成场景驱动的对话并在助手语音开始后的标准化时间点注入打断,识别了三种常见的后打断续说失败模式:上下文惯性、打断遗忘和目标偏移。
Details
Motivation: 现有口语对话基准主要评估基于轮次的交互,忽略了用户在中途打断时助手必须修订任务状态这一关键失败模式,因此需要一个新的基准来专门评估全双工状态更新推理。
Result: 在评估的实时语音模型中,没有系统的通过率超过50%,表明在中途生成状态修订方面有巨大的改进空间。与配对半双工控制相比,总失败率下降了40.2%,说明许多错误是由打断下的状态更新推理驱动,而非仅由任务难度引起。
Insight: 论文的创新点在于构建了一个可控的、可重复的基准(EchoChain),用于诊断全双工语音交互中的状态更新推理失败,并系统性地识别了三种具体的失败模式,为模型改进提供了明确的诊断方向。
Abstract: Real-time voice assistants must revise task state when users interrupt mid-response, but existing spoken-dialog benchmarks largely evaluate turn-based interaction and miss this failure mode. We introduce EchoChain, a controlled benchmark for evaluating full-duplex state-update reasoning under mid-speech interruptions. EchoChain identifies three recurring failure patterns in post-interruption continuations: contextual inertia, interruption amnesia, and objective displacement. The benchmark generates scenario-driven conversations and injects interruptions at a standardized point relative to assistant speech onset, enabling controlled cross-model comparison. In a paired half-duplex control, total failures drop by 40.2% relative to interrupted runs, indicating that many errors are driven by state-update reasoning under interruption rather than task difficulty alone. Across evaluated real-time voice models, no system exceeds a 50% pass rate, showing substantial room for improvement in mid-generation state revision. EchoChain provides a reproducible benchmark for diagnosing state-update reasoning failures in full-duplex voice interaction.
[12] Revisiting a Pain in the Neck: A Semantic Reasoning Benchmark for Language Models cs.CLPDF
Yang Liu, Hongming Li, Melissa Xiaohui Qin, Qiankun Liu, Chao Huang
TL;DR: 本文提出了SemanticQA评估套件,旨在系统评估语言模型在语义短语处理任务上的能力。该基准整合了现有的多词表达资源,将其重组为统一测试平台,涵盖词汇搭配、习语表达、名词复合词和动词结构等类别。通过提取、分类、解释及序列任务组合,评估了不同架构和规模的语言模型,揭示了其在语义推理任务上的显著性能差异。
Details
Motivation: 解决现有语言模型在非平凡语义短语理解上的评估不足问题,通过构建统一的语义推理基准来系统衡量模型在复杂语义任务上的表现。
Result: 在SemanticQA基准上评估了多种语言模型,结果显示模型在需要语义推理的任务上性能差异显著,特别是在习语表达和复杂短语理解方面,揭示了当前模型语义理解能力的局限性。
Insight: 创新点在于整合并重组多词表达资源为统一评估框架,细化了语义短语类别;客观分析表明该工作为语言模型的深层语义理解提供了可量化的评估工具,有助于推动模型在非字面语义推理方面的进步。
Abstract: We present SemanticQA, an evaluation suite designed to assess language models (LMs) in semantic phrase processing tasks. The benchmark consolidates existing multiword expression (MwE) resources and reorganizes them into a unified testbed. It covers both general lexical phenomena, such as lexical collocations, and three fine-grained categories: idiomatic expressions, noun compounds, and verbal constructions. Through SemanticQA, we assess LMs of diverse architectures and scales in extraction, classification, and interpretation tasks, as well as sequential task compositions. We reveal substantial performance variation, particularly on tasks requiring semantic reasoning, highlighting differences in reasoning efficacy and semantic understanding of LMs, providing insights for pushing LMs with stronger comprehension on non-trivial semantic phrases. The evaluation harness and data of SemanticQA are available at https://github.com/jacklanda/SemanticQA.
[13] Aligning Backchannel and Dialogue Context Representations via Contrastive LLM Fine-Tuning cs.CL | cs.AI | cs.LGPDF
Livia Qian, Gabriel Skantze
TL;DR: 本文提出了一种两阶段框架,通过对比性LLM微调来对齐对话上下文和反馈信号(backchannel)的表示。首先,在对话文本上微调大语言模型以获取丰富的上下文表示;其次,学习一个对话上下文和反馈信号实现的联合嵌入空间。通过三元相似性判断和上下文-反馈信号适合性任务评估与人类感知的对齐效果,结果表明该方法在上下文-反馈信号检索方面优于先前方法,且学习到的嵌入比原始WavLM特征更接近人类判断。
Details
Motivation: 解决反馈信号(如’yeah’、’mhm’)的词汇-韵律形式与语用意义之间关系未被充分探索的问题,现有计算研究主要关注反馈信号的时间预测,而忽略了形式与意义的关联。
Result: 在上下文-反馈信号检索任务中,学习到的投影显著优于先前方法;学习到的嵌入在人类感知对齐方面优于原始WavLM特征,表明对扩展对话上下文高度敏感。
Insight: 创新点在于通过两阶段框架结合LLM微调和对比学习,实现对对话上下文和反馈信号表示的联合对齐;客观分析认为,该方法强调了上下文在反馈信号形式理解中的重要性,为多模态对话建模提供了新思路。
Abstract: Backchannels (e.g., yeah', mhm’, and `right’) are short, non-interruptive feedback signals whose lexical form and prosody jointly convey pragmatic meaning. While prior computational research has largely focused on predicting backchannel timing, the relationship between lexico-prosodic form and meaning remains underexplored. We propose a two-stage framework: first, fine-tuning large language models on dialogue transcripts to derive rich contextual representations; and second, learning a joint embedding space for dialogue contexts and backchannel realizations. We evaluate alignment with human perception via triadic similarity judgments (prosodic and cross-lexical) and a context-backchannel suitability task. Our results demonstrate that the learned projections substantially improve context-backchannel retrieval compared to previous methods. In addition, they reveal that backchannel form is highly sensitive to extended conversational context and that the learned embeddings align more closely with human judgments than raw WavLM features.
[14] Evaluating Adaptive Personalization of Educational Readings with Simulated Learners cs.CL | cs.AI | cs.HCPDF
Ryan T. Woo, Anmol Rao, Aryan Keluskar, Yinong Chen
TL;DR: 本文提出了一个基于理论驱动的模拟学习者框架,用于评估教育阅读材料的自适应个性化。该系统从开放教科书构建学习目标和知识组件的本体论,在基于浏览器的Ontology Atlas中进行管理,用本体实体标记教科书片段,并生成对齐的阅读-评估对。模拟读者通过受Construction-Integration启发的记忆模型(结合DIME风格的读者因素、KREC风格的误解修正和开放的New Dale-Chall可读性信号)从段落中学习。答案基于学习者的显式记忆状态通过得分选项选择产生,而BKT驱动自适应。在三个抽样学科本体和每种条件下50名模拟学习者的匹配队列中,自适应阅读在计算机科学中显著改善结果,在无机化学中产生较小但不确定的正向增益,在普通生物学中呈中性至轻微负面。
Details
Motivation: 解决教育阅读材料自适应个性化评估中缺乏理论驱动模拟学习者框架的问题,以更系统化地测试和优化个性化学习效果。
Result: 在计算机科学、无机化学和普通生物学三个学科的本体及模拟学习者队列中,自适应阅读在计算机科学上显著提升学习成果,在无机化学上效果较小且不确定,在普通生物学上中性或轻微负面。
Insight: 创新点包括结合理论驱动的模拟学习者模型(如Construction-Integration记忆、DIME读者因素、KREC误解修正)与本体论驱动的阅读材料生成和BKT自适应机制,为教育个性化提供可扩展的评估框架;客观分析显示,该方法在学科适用性上存在差异,强调了情境因素在自适应学习中的重要性。
Abstract: We present a framework for evaluating adaptive personalization of educational reading materials with theory-grounded simulated learners. The system builds a learning-objective and knowledge-component ontology from open textbooks, curates it in a browser-based Ontology Atlas, labels textbook chunks with ontology entities, and generates aligned reading-assessment pairs. Simulated readers learn from passages through a Construction-Integration-inspired memory model with DIME-style reader factors, KREC-style misconception revision, and an open New Dale-Chall readability signal. Answers are produced by score-based option selection over the learner’s explicit memory state, while BKT drives adaptation. Across three sampled subject ontologies and matched cohorts of 50 simulated learners per condition, adaptive reading significantly improved outcomes in computer science, yielded smaller positive but inconclusive gains in inorganic chemistry, and was neutral to slightly negative in general biology.
[15] DART: Mitigating Harm Drift in Difference-Aware LLMs via Distill-Audit-Repair Training cs.CLPDF
Ziwen Pan, Zihan Liang, Jad Kabbara, Ali Emami
TL;DR: 论文提出了DART(Distill-Audit-Repair Training)训练框架,旨在缓解大型语言模型(LLM)在微调以提高准确性时引发的‘伤害漂移’问题。该框架通过从教师模型蒸馏标签条件推理、审核输出相对于基线的伤害漂移案例,并通过严重性加权的微调来修复问题案例,从而在提升模型对涉及人口统计差异问题的回答准确性的同时,显著减少有害内容的生成。
Details
Motivation: 针对安全调优的LLMs往往避免承认人口统计差异,即使这种承认在事实上正确或情境上合理,导致错误回答、不必要的拒绝或通用的‘平等对待’默认。微调提高准确性会触发‘伤害漂移’,即模型生成的有害解释随着决策准确性提高而增加。
Result: 在八个基准测试上,DART将Llama-3-8B-Instruct的准确率从39.0%提升至68.8%,在平等对待提示上提升最大(11.3% -> 72.6%),同时将伤害漂移案例减少了72.6%。在280个涵盖医疗、法律、政策和教育领域的开放式真实世界查询上,差异适当回答率从39.8%提升至77.5%,拒绝率从34.3%降至3.0%。
Insight: 创新点在于提出了DART框架,通过蒸馏-审核-修复的流程,明确检测和修复伤害漂移,证明了准确性和安全性可以共存。客观来看,该方法提供了一种系统性的机制来平衡LLMs在敏感话题上的事实准确性与安全伦理约束。
Abstract: Large language models (LLMs) tuned for safety often avoid acknowledging demographic differences, even when such acknowledgment is factually correct (e.g., ancestry-based disease incidence) or contextually justified (e.g., religious hiring preferences). This identity-blindness yields incorrect responses, unnecessary refusals, or generic “equal-treatment” defaults. We study this via difference-awareness classification: given a question involving demographic groups, the task is not to answer directly, but to classify whether a correct answer requires recognizing group differences (yes) or whether groups should be treated identically (no). Crucially, fine-tuning for accuracy triggers harm drift: model-generated explanations become increasingly harmful as decision accuracy improves, whether by elaborating harmful content, introducing problematic assumptions, or failing to flag harms the baseline identified. To mitigate this, we introduce DART (Distill–Audit–Repair Training), which distills label-conditioned reasoning from a teacher, audits outputs for harm drift cases relative to baseline, and repairs problematic cases via severity-weighted fine-tuning. On eight benchmarks, DART improves Llama-3-8B-Instruct accuracy from 39.0% to 68.8%, with largest gains on equal-treatment prompts (11.3% -> 72.6%), while reducing harm drift cases by 72.6%. It also transfers to 280 open-ended real-world queries across medical, legal, policy, and educational domains, improving difference-appropriate responses from 39.8% to 77.5% while reducing refusals from 34.3% to 3.0%. Our results demonstrate that accuracy and safety need not conflict when explicit detection and repair mechanisms are in place.
[16] Incentivizing Parametric Knowledge via Reinforcement Learning with Verifiable Rewards for Cross-Cultural Entity Translation cs.CL | cs.AIPDF
Jiang Zhou, Xiaohu Zhao, Xinwei Wu, Tianyu Dong, Hao Wang
TL;DR: 本文提出了一种名为EA-RLVR的强化学习训练框架,旨在激励大型语言模型有效利用其内部参数化知识,以解决跨文化实体翻译的挑战。该方法通过可验证的实体级奖励信号和轻量级结构门来稳定优化过程,引导模型学习稳健的推理过程,而非单纯模仿参考翻译。
Details
Motivation: 解决大型语言模型在跨文化实体翻译中常产生字面或音译,而非上下文文化适宜翻译的问题,旨在激励模型有效利用预训练中已编码的参数化知识,而不依赖外部知识库。
Result: 在XC-Translate基准测试中,实体翻译准确率和领域外泛化能力均得到持续提升;仅使用7k样本训练,Qwen3-14B在包含5万个未见实体的测试集上,实体翻译准确率从23.66%提升至31.87%;所学能力可迁移至通用翻译,在WMT24++上带来+1.35 XCOMET分数提升,经扩展优化后可增至+1.59。
Insight: 创新点在于提出基于可验证奖励的实体锚定强化学习框架,结合轻量级结构门以稳定优化,强调学习推理过程而非模仿输出;客观分析认为其通过提升采样效率和稳定优化环境,有效激励了参数化知识的利用。
Abstract: Cross-cultural entity translation remains challenging for large language models (LLMs) as literal or phonetic renderings are usually yielded instead of culturally appropriate translations in context. However, relevant knowledge may already be encoded in model parameters during large-scale pre-training. To incentivize the effective use of parametric knowledge, we propose EA-RLVR (Entity-Anchored Reinforcement Learning with Verifiable Rewards), a training framework that optimizes cross-cultural entity translation without relying on external knowledge bases. EA-RLVR anchors supervision on a verifiable, entity-level reward signal and incorporates lightweight structural gates to stabilize optimization. This design steers the model toward learning a robust reasoning process rather than merely imitating reference translations. We evaluate EA-RLVR on XC-Translate and observe consistent improvements in both entity translation accuracy and out-of-domain generalization. Specifically, training on merely 7k samples boosts Qwen3-14B’s entity translation accuracy from 23.66% to 31.87% on a 50k test set comprising entirely unseen entities. The learned entity translation ability also transfers to general translation, yielding +1.35 XCOMET on WMT24++, which scales to +1.59 with extended optimization. Extensive analyses of $pass@k$ dynamics and reward formulations attribute these gains to superior sampling efficiency and a stable optimization landscape.
[17] PRISM: Probing Reasoning, Instruction, and Source Memory in LLM Hallucinations cs.CL | cs.AIPDF
Yuhe Wu, Guangyu Wang, Yuran Chen, Jiatong Zhang, Yutong Zhang
TL;DR: 本文提出了PRISM基准测试,用于诊断大语言模型(LLM)产生幻觉的根源,将幻觉分解为知识缺失、知识错误、推理错误和指令遵循错误四个维度,并关联到记忆、指令和推理三个生成阶段。该基准包含65个任务的9,448个实例,支持细粒度、阶段感知的诊断评估。通过对24个主流开源和专有LLM的评估,揭示了在指令遵循、记忆检索和逻辑推理之间存在一致的权衡关系,且缓解策略往往以牺牲其他维度为代价来改善特定维度。
Details
Motivation: 现有基准主要依赖混合查询和后验评估、输出级评分,虽能量化幻觉严重程度,但对幻觉在生成流程中何处及为何产生提供有限洞察。因此,本文将幻觉评估重新定义为诊断问题,旨在深入理解LLM幻觉的具体机制。
Result: 在PRISM基准上评估了24个主流LLM,揭示了模型在指令遵循、记忆检索和逻辑推理能力之间存在一致的权衡关系。结果表明,缓解幻觉的策略往往只能改善特定维度,而牺牲其他维度。
Insight: 创新点在于将幻觉评估从简单的输出评分转变为细粒度的、阶段感知的诊断框架,明确区分了幻觉的四个维度和三个生成阶段。这为理解幻觉的具体产生机制提供了系统化工具,有助于针对性地开发更可信的LLM。从客观角度看,该工作为模型评估和诊断提供了新的视角和方法论,强调了多维度权衡的重要性。
Abstract: As large language models (LLMs) evolve from conversational assistants into agents capable of handling complex tasks, they are increasingly deployed in high-risk domains. However, existing benchmarks largely rely on mixed queries and posterior evaluation, output-level scoring, which quantifies hallucination severity but offers limited insight into where and why hallucinations arise in the generation pipeline. We therefore reformulate hallucination evaluation as a diagnostic problem and propose PRISM, a controlled benchmark that disentangles hallucinations into four dimensions: knowledge missing, knowledge errors, reasoning errors, and instruction-following errors, grounded in three stages of generation (memory, instruction, and reasoning). PRISM contains 9,448 instances across 65 tasks and supports fine-grained, stage-aware diagnostic evaluation. Evaluating 24 mainstream open-source and proprietary LLMs, we uncover consistent trade-offs across instruction following, memory retrieval, and logical reasoning, showing that mitigation strategies often improve specific dimensions at the expense of others. We hope PRISM provides a framework for understanding the specific mechanisms behind LLMs hallucinations, ultimately accelerating the development of trustworthy large language models.
[18] x1: Learning to Think Adaptively Across Languages and Cultures cs.CLPDF
Yangfan Ye, Xiaocheng Feng, Xiachong Feng, Yichong Huang, Zekun Yuan
TL;DR: 本文提出了x1系列推理模型,能够根据每个实例自适应地选择优势语言进行推理,以解决主流大语言模型因单一语言推理而忽视语言多样性的问题。通过对比同一输入的不同语言推理轨迹进行训练,x1在跨语言数学推理和文化相关任务中展现出优势。
Details
Motivation: 现有大语言模型通常使用单一主导语言进行推理,忽略了不同语言编码的独特抽象和归纳先验,x1旨在通过自适应语言选择来利用这种多样性,提升推理能力。
Result: 在跨语言数学推理和文化相关任务上的实验表明,x1能有效利用自适应多语言推理的优势;研究还发现,虽然模型规模扩大能减少数学推理等程序性领域的跨语言差异,但在文化相关任务中,与文化关联的语言仍能带来更高效和准确的文化知识回忆。
Insight: 创新点在于将语言选择作为推理的功能性组成部分,通过对比训练实现自适应多语言推理,挑战了简单的缩放定律观点,为构建更通用和全球胜任的推理模型提供了新思路。
Abstract: Languages encode distinct abstractions and inductive priors, yet most large language models (LLMs) overlook this diversity by reasoning in a single dominant language. In this work, we introduce x1, a family of reasoning models that can adaptively reason in an advantageous language on a per-instance basis. To isolate the effect of reasoning-language choice, x1 is constructed without expanding the model’s knowledge boundaries and is trained by contrasting linguistically distinct reasoning trajectories for the same input. Our extensive experiments demonstrate the benefits of adaptive multilingual reasoning across multilingual mathematical reasoning and culturally grounded tasks. Moreover, our results challenge a simplistic view of scaling laws: while scaling reduces cross-lingual disparities in procedural domains such as math reasoning, it does not eliminate the advantages of culture-associated languages in culturally grounded tasks, as we empirically show that such reasoning enables more efficient and accurate cultural knowledge recall. Overall, our findings establish language choice as a functional component of reasoning, with implications for building more generalist and globally competent reasoning models.
[19] Freshness-Aware Prioritized Experience Replay for LLM/VLM Reinforcement Learning cs.CL | cs.LGPDF
Weiyu Ma, Yongcheng Zeng, Yan Song, Xinyu Cui, Jian Zhao
TL;DR: 本文提出了一种新颖的、针对大语言模型和视觉语言模型强化学习的Freshness-Aware Prioritized Experience Replay方法,通过引入基于有效样本量的指数年龄衰减机制来解决传统优先经验回放中优先级过时的问题,从而显著提升了样本效率。
Details
Motivation: 现有LLM/VLM强化学习主要依赖PPO等在线策略算法,这些方法每轮梯度更新后即丢弃所有轨迹,导致样本效率低下,尤其在多轮交互成本高昂的智能体任务中浪费严重。传统优先经验回放直接应用于LLM会因模型策略快速演变而导致存储的优先级过时,使旧的高优先级轨迹在已失效后仍主导采样。
Result: 在八个多步智能体、推理和数学竞赛任务上,使用0.5B、3B和7B模型进行评估。Freshness-Aware PER显著优于在线策略基线,在NQ Search上提升+46%,在Sokoban上提升+367%,在VLM FrozenLake上提升+133%。而未加年龄衰减的标准PER则持续损害性能。
Insight: 核心创新点是将基于有效样本量分析的指数年龄衰减因子与任何PER优先级相结合,以数学方式量化并缓解优先级过时问题。这是首个成功将PER应用于LLM/VLM强化学习的工作,为解决大模型RL中的样本效率瓶颈提供了一个通用且有效的机制。
Abstract: Reinforcement Learning (RL) has achieved impressive success in post-training Large Language Models (LLMs) and Vision-Language Models (VLMs), with on-policy algorithms such as PPO, GRPO, and REINFORCE++ serving as the dominant paradigm. However, these methods discard all collected trajectories after a single gradient update, resulting in poor sample efficiency, particularly wasteful for agentic tasks where multi-turn environment interactions are expensive. While Experience Replay drives sample efficiency in classic RL by allowing agents to reuse past trajectories and prioritize informative ones, directly applying Prioritized Experience Replay (PER) to LLMs fails. The rapid policy evolution of billion-parameter models renders stored priorities stale, causing old high-priority trajectories to dominate sampling long after they have become uninformative. We propose Freshness-Aware PER, which addresses this priority staleness problem by augmenting any PER-based priority with a multiplicative exponential age decay grounded in effective sample size analysis. To the best of our knowledge, Freshness-Aware PER is the first work to successfully apply PER to LLM/VLM reinforcement learning. We evaluate on eight multi-step agentic, reasoning, and math competition tasks with 0.5B, 3B, and 7B models. Freshness-Aware PER significantly outperforms on-policy baselines, achieving +46% on NQ Search, +367% on Sokoban, and +133% on VLM FrozenLake, while standard PER without age decay consistently degrades performance. Our code is publicly available at https://github.com/Vision-CAIR/Freshness-Aware-PER.
[20] MeasHalu: Mitigation of Scientific Measurement Hallucinations for Large Language Models with Enhanced Reasoning cs.CLPDF
Ruijun Huang, Zhiqiao Kang, Yuxuan Zhu, Junxiong Li, Jiahao Zhao
TL;DR: 本文提出MeasHalu框架,通过增强推理和针对性优化来缓解大语言模型在科学文献中提取测量数据时产生的幻觉问题。该框架首先对测量相关的幻觉进行了细粒度分类,并采用两阶段推理感知微调策略和渐进式奖励课程,显著提升了提取的准确性。
Details
Motivation: 解决大语言模型在从科学文献中准确提取测量数据时频繁出现严重幻觉的问题,以提高自动化科学文档理解系统的可靠性。
Result: 在MeasEval基准测试上,MeasHalu显著降低了幻觉率并提高了整体准确率。
Insight: 创新点包括对测量幻觉的细粒度分类、结合增强科学数据和过程监督的两阶段推理感知微调策略,以及针对特定幻觉类型进行惩罚的渐进式奖励课程设计,为自动化科学知识提取提供了针对性解决方案。
Abstract: The accurate extraction of scientific measurements from literature is a critical yet challenging task in AI4Science, enabling large-scale analysis and integration of quantitative research findings. However, Large Language Models (LLMs) frequently exhibit severe hallucinations, which significantly undermine the reliability of automated scientific document understanding systems. To address this problem, we propose MeasHalu, a novel framework for mitigating scientific measurement hallucinations through enhanced reasoning and targeted optimization. We first present a fine-grained taxonomy of measurement-specific hallucinations, categorizing errors across quantities, units, modifiers, and relations. Our approach incorporates a two-stage reasoning-aware fine-tuning strategy using augmented scientific data and process-based supervision. Furthermore, we introduce a progressive reward curriculum designed to penalize specific hallucination types, significantly improving extraction faithfulness. Experimental results demonstrate that MeasHalu substantially reduces hallucination rates and improves overall accuracy on the MeasEval benchmark. This work provides a targeted solution to a key bottleneck in automated scientific knowledge extraction, facilitating more trustworthy and scalable machine-assisted scientific literature analysis.
[21] MNAFT: modality neuron-aware fine-tuning of multimodal large language models for image translation cs.CLPDF
Bo Li, Ningyuan Deng, Tianyu Dong, Shaobo Wang, Shaolin Zhu
TL;DR: 本文提出了一种名为模态神经元感知微调(MNAFT)的新方法,用于提升多模态大语言模型(MLLMs)在图像翻译任务中的性能。该方法通过指令驱动的激活分析,识别视觉和语言模块中的语言无关神经元和语言特定神经元,并仅对与目标任务相关的选定层中的这些关键神经元进行选择性微调,从而在提升翻译准确性的同时避免预训练知识的参数冗余。
Details
Motivation: 现有基于指令微调的方法存在参数冗余风险,可能损害泛化性能,且MLLMs在图像翻译任务中难以有效捕捉图像内对准确翻译至关重要的细粒度文本信息,导致视觉文本输入与文本输入/输出之间存在模态鸿沟。
Result: 在多个基准测试上的广泛实验表明,MNAFT显著优于包括级联模型、标准全参数微调和参数高效微调技术在内的最先进图像翻译方法。
Insight: 创新点在于提出了神经元级别的选择性微调策略,通过分析神经元在跨模态理解中的专门化角色(语言无关/语言特定),实现了在保留大部分预训练知识的同时,高效地针对特定任务(如图像翻译)进行模型适配。这为理解MLLMs内部工作机制和设计更高效的微调方法提供了新视角。
Abstract: Multimodal large language models (MLLMs) have shown impressive capabilities, yet they often struggle to effectively capture the fine-grained textual information within images crucial for accurate image translation. This often leads to a modality gap between visual text inputs and textual inputs/outputs for image translation. Existing methods, primarily relying on instruction fine-tuning, risk parameter redundancy of pre-trained knowledge, hindering generalization performance. To address this, we introduce modality neuron-aware fine-tuning (MNAFT), a novel approach that takes advantage of the specialized roles of individual neurons within MLLMs for enhanced image translation. MNAFT identifies language-agnostic and language-specific neurons in both vision and language modules through an instruction-driven activation analysis, evaluating their importance in various translation tasks. We then perform selective fine-tuning, updating only the parameters of language-specific and language-agnostic neurons within the selected layers relevant to the target task, while preserving the knowledge encoded in other neurons and layers. Our extensive experiments on multiple benchmarks demonstrate that MNAFT significantly outperforms state-of-the-art image translation methods, including cascaded models, standard full fine-tuning, and parameter-efficient tuning techniques. Furthermore, we provide comprehensive analysis, including visualizations of neuron activations and clustering patterns, to offer insights into the roles of different neuron groups in mediating cross-modal understanding and facilitating accurate language-specific translation.
[22] On Safety Risks in Experience-Driven Self-Evolving Agents cs.CLPDF
Weixiang Zhao, Yichen Zhang, Yingshuo Wang, Yang Deng, Yanyan Zhao
TL;DR: 本文研究了经验驱动自进化智能体中的安全风险,发现即使仅从良性任务中积累经验,也会在高风险场景下损害安全性,这源于经验以执行为导向强化了智能体的行动倾向而非拒绝倾向;在更现实的混合任务环境中,与拒绝相关的经验虽能缓解安全下降但会导致过度拒绝,揭示了安全与效用的根本权衡。
Details
Motivation: 动机是探究经验驱动自进化这一提升大语言模型智能体自主性的新兴范式,其依赖自我策划经验所引入的未被充分探索的安全风险。
Result: 研究通过基于网络和具身环境的实验,揭示了经验积累与利用如何影响安全性能,并指出当前自进化智能体存在固有局限。
Insight: 创新点在于首次系统分析了自进化智能体中经验导向对安全性的负面影响,揭示了执行偏向与拒绝机制之间的安全-效用权衡,为设计更安全可靠的自适应策略提供了原则性见解。
Abstract: Experience-driven self-evolution has emerged as a promising paradigm for improving the autonomy of large language model agents, yet its reliance on self-curated experience introduces underexplored safety risks. In this study, we investigate how experience accumulation and utilization in self-evolving agents affect safety performance across web-based and embodied environments. Notably, experience gathered solely from benign tasks can still compromise safety in high-risk scenarios. Further analysis attributes this degradation to the execution-oriented nature of accumulated experience, which reinforces agents’ tendency to act rather than refuse. In more realistic settings where agents encounter both benign and harmful tasks, refusal-related experience mitigates safety decline but induces over-refusal, revealing a fundamental safety-utility trade-off. Overall, our findings expose inherent limitations of current self-evolving agents and call for more principled strategies to ensure safe and reliable adaptation.
[23] SPS: Steering Probability Squeezing for Better Exploration in Reinforcement Learning for Large Language Models cs.CL | cs.LGPDF
Yifu Huo, Chenglong Wang, Ziming Zhu, Shunjie Xing, Peinan Feng
TL;DR: 本文提出了一种名为Steering Probability Squeezing (SPS)的训练范式,通过交替使用强化学习(RL)和逆强化学习(IRL)来解决RL训练中探索不足的问题,旨在提升大语言模型在推理任务中的多样性和多样本性能(Pass@k)。
Details
Motivation: 动机在于发现传统RL训练倾向于提高单样本成功率(Pass@1),但会过度集中概率质量,限制了对多样化推理轨迹的探索,而这对于多样本性能(Pass@k)至关重要。
Result: 在五个常用推理基准测试上的实验表明,SPS能够实现更好的探索并提升Pass@k性能。论文还分析了RL学习动态,并识别了Pass@k的经验上限。
Insight: 创新点在于将策略内采样轨迹视为演示数据,利用IRL显式地重塑轨迹分布以增强探索,无需外部监督。核心见解是RL与IRL交替提供了一种扩展推理导向大语言模型探索能力的有效途径。
Abstract: Reinforcement learning (RL) has emerged as a promising paradigm for training reasoning-oriented models by leveraging rule-based reward signals. However, RL training typically tends to improve single-sample success rates (i.e., Pass@1) while offering limited exploration of diverse reasoning trajectories, which is crucial for multi-sample performance (i.e., Pass@k). Our preliminary analysis reveals that this limitation stems from a fundamental squeezing effect, whereby probability mass is excessively concentrated on a narrow subset of high-reward trajectories, restricting genuine exploration and constraining attainable performance under RL training. To address this issue, in this work, we propose Steering Probability Squeezing (SPS), a training paradigm that interleaves conventional RL with inverse reinforcement learning (IRL). SPS treats on-policy rollouts as demonstrations and employs IRL to explicitly reshape the induced trajectory distribution, thereby enhancing exploration without introducing external supervision. Experiments on five commonly used reasoning benchmarks demonstrate that SPS can enable better exploration and improve Pass@k. Beyond algorithmic contributions, we provide an analysis of RL learning dynamics and identify an empirical upper bound on Pass@k, shedding light on intrinsic exploration limits in RL-based reasoning models. Our findings suggest that alternating between RL and IRL offers an effective pathway toward extending the exploration capacity of reasoning-oriented large language models.
[24] Improving LLM Code Reasoning via Semantic Equivalence Self-Play with Formal Verification cs.CL | cs.AI | cs.LG | cs.PLPDF
Antonio Valerio Miceli Barone, Poon Tsz Nok
TL;DR: 本文提出了一种基于语义等价的自博弈框架,通过形式化验证指导生成器和评估器的对抗训练,以提升大语言模型在Haskell代码推理方面的能力。该框架利用Liquid Haskell证明验证等价性,并基于执行的反例验证非等价性,同时采用难度感知课程进行组织。为支持此框架,作者发布了包含约28k已验证Haskell程序的合成数据集OpInstruct-HSx。实验表明,该评估器能有效迁移至下游任务,在EquiBench上实现最高13.3个百分点的准确率提升,并在PySecDB上取得一致增益。消融研究显示,非等价监督提供数据量,而等价证明是模型推理能力的独特来源。
Details
Motivation: 解决大语言模型在代码推理任务中语义等价判断能力不足的问题,通过结合形式化验证与自博弈训练提升模型的逻辑推理精度。
Result: 在EquiBench基准测试上达到最高13.3个百分点的准确率提升,在PySecDB上取得一致性能增益,证明了方法的有效性。
Insight: 创新点在于将形式化验证(Liquid Haskell证明)与自博弈对抗训练结合,并引入难度感知课程;客观分析认为,该方法通过等价证明与非等价反例的协同监督,独特地增强了模型的深层推理能力,而非仅依赖数据规模。
Abstract: We introduce a self-play framework for semantic equivalence in Haskell, utilizing formal verification to guide adversarial training between a generator and an evaluator. The framework leverages Liquid Haskell proofs for validating equivalence and execution-based counterexamples for inequivalence, organized via a difficulty-aware curriculum. To facilitate this, we release \textbf{OpInstruct-HSx}, a synthetic dataset of $\approx$28k validated Haskell programs. Empirical experiments show that our evaluator transfers effectively to downstream tasks, achieving up to 13.3pp accuracy gain on EquiBench and consistent gains on PySecDB. Ablation studies on the SEQ-SINQ regimes indicate that while inequivalence supervision provides data volume, equivalence proofs are uniquely responsible for the model’s reasoning capabilities. The entire training pipeline and dataset are publicly released on GitHub and Hugging Face respectively.
[25] Dynamic Emotion and Personality Profiling for Multimodal Deception Detection cs.CLPDF
Li Zheng, Yanyi Luo, Hao Fei, Yuzhe Ding, Yujie Huang
TL;DR: 本文提出了一种用于多模态欺骗检测的动态情绪与人格分析框架Rel-DDEP,通过创新的多模型多提示标注方案构建了包含欺骗、情绪和人格的联合检测数据集DDEP,并采用基于高斯分布的不确定性量化与可靠性加权融合方法,显著提升了欺骗、情绪和人格检测的性能。
Details
Motivation: 现有欺骗检测方法缺乏样本级别的动态情绪和人格标注,而这两者对欺骗检测至关重要,因此需要构建动态标注数据集并设计有效的多模态融合框架。
Result: 在MDPE和DDEP数据集上的实验表明,Rel-DDEP在欺骗检测、情绪检测和人格检测三个任务上均显著优于现有SOTA基线模型,F1分数分别提升了2.53%、2.66%和9.30%。
Insight: 创新点包括:1)提出多模型多提示标注方案与严格标签质量评估标准,构建了动态情绪与人格标注数据集;2)设计可靠性加权融合框架,通过高斯分布量化模态不确定性,并结合对齐与排序约束模块实现联合检测;3)验证了样本级动态标注与可靠性加权融合的有效性。
Abstract: Deception detection is of great significance for ensuring information security and conducting public opinion analysis, with personality factors and emotion cues playing a critical role. However, existing methods lack sample-level dynamic annotations for emotions and personality.In this paper, we propose an innovative multi-model multi-prompt annotation scheme and a strict label quality evaluation standard, and establish a multimodal joint detection dataset DDEP for deception, emotion, and personality. Meanwhile, we propose Rel-DDEP, an adaptive reliability-weighted fusion framework. Our framework quantifies uncertainty by mapping modal features to a high-dimensional Gaussian distribution space. It then performs reliability-weighted fusion and incorporates an alignment module and a sorting constraint module to achieve joint detection of deception, emotion, and personality. Experimental results on the MDPE and DDEP datasets show that our Rel-DDEP significantly outperforms the existing state-of-the-art baseline models in three tasks. The F1 score of the deception detection increases by 2.53%, that of the emotion detection increases by 2.66%, and that of the personality detection increases by 9.30%. The experiments fully verify the necessity of annotating dynamic emotion and personality labels for each sample and the effectiveness of reliability-weighted fusion.
[26] Stability-Weighted Decoding for Diffusion Language Models cs.CL | cs.LGPDF
Yue Wu, Jian Huang
TL;DR: 本文提出了一种名为稳定性加权解码(SWD)的训练即插即用策略,用于提升扩散大语言模型(dLLM)的文本生成质量。该方法通过量化标记在连续去噪步骤中的预测分布差异(KL散度)来评估其时间不稳定性,并将此稳定性信息融入解码评分过程,以抑制过早解掩不稳定的标记。
Details
Motivation: 现有扩散大语言模型的解码策略依赖于单步去噪计算的静态置信度指标,忽略了时间历史信息,常导致过早解掩不稳定的标记,从而影响生成质量。
Result: 在代码生成和数学推理基准测试上的实验表明,SWD能持续提升多种代表性评分指标和选择策略下的生成准确率,并在不同加速比下均显著优于标准基线方法,展现出卓越的鲁棒性。
Insight: 核心创新点在于从理论上证明了标记的时间不稳定性(由连续预测分布的KL散度量化)是其与剩余掩码上下文互信息的严格下界,这为识别不安全解掩标记提供了理论依据。基于此,SWD将时间稳定性作为通用调制器融入任意基于评分的解码策略,是一种无需训练、即插即用的通用改进方法。
Abstract: Diffusion large language models (dLLMs) enable parallel text generation by iteratively denoising a fully masked sequence, unmasking a subset of masked tokens at each step. Existing decoding strategies rely on static confidence metrics computed at a single denoising step, ignoring temporal history and often leading to premature unmasking of unstable tokens. In this work, we theoretically establish that a token’s temporal instability, quantified by the KL divergence between consecutive prediction distributions, provides a strict lower bound on its mutual information with the remaining masked context, indicating that temporally unstable tokens are inherently unsafe to unmask. Based on this insight, we propose Stability-Weighted Decoding (SWD), a training-free, plug-and-play strategy that incorporates temporal stability into token scoring and acts as a universal modulator for arbitrary score-based decoding policies. Experiments on code generation and mathematical reasoning benchmarks demonstrate that SWD consistently improves generation accuracy across representative scoring metrics and selection policies, and exhibits exceptional robustness, maintaining a significant performance lead over standard baselines across varying acceleration ratios.
[27] Abstain-R1: Calibrated Abstention and Post-Refusal Clarification via Verifiable RL cs.CL | cs.AIPDF
Skylar Zhai, Jingcheng Liang, Dongyeop Kang
TL;DR: 本文提出Abstain-R1模型,通过一种可验证的强化学习奖励机制,在训练大型语言模型时联合优化对不可回答查询的明确拒绝和语义对齐的拒绝后澄清,旨在解决模型对无法可靠回答的问题进行猜测或幻觉的问题。
Details
Motivation: 现有拒绝方法要么训练模型产生通用拒绝,要么鼓励后续澄清但未验证澄清是否识别了关键缺失信息。本文动机是让可靠模型不仅拒绝回答,还应解释缺失什么信息。
Result: 在Abstain-Test、Abstain-QA和SelfAware基准测试中,Abstain-R1(3B参数)相比其基础模型有显著提升,并在不可回答查询行为上与包括DeepSeek-R1在内的更大系统表现相当。
Insight: 创新点在于提出了一种澄清感知的RLVR奖励函数,可联合优化正确回答、明确拒绝和语义对齐的澄清;研究结果表明,经过校准的拒绝和澄清行为可以通过可验证的奖励学习获得,而不仅仅依赖于模型规模的扩大。
Abstract: Reinforcement fine-tuning improves the reasoning ability of large language models, but it can also encourage them to answer unanswerable queries by guessing or hallucinating missing information. Existing abstention methods either train models to produce generic refusals or encourage follow-up clarifications without verifying whether those clarifications identify the key missing information. We study queries that are clear in meaning but cannot be reliably resolved from the given information, and argue that a reliable model should not only abstain, but also explain what is missing. We propose a clarification-aware RLVR reward that, while rewarding correct answers on answerable queries, jointly optimizes explicit abstention and semantically aligned post-refusal clarification on unanswerable queries. Using this reward, we train Abstain-R1, a 3B model that improves abstention and clarification on unanswerable queries while preserving strong performance on answerable ones. Experiments on Abstain-Test, Abstain-QA, and SelfAware show that Abstain-R1 substantially improves over its base model and achieves unanswerable-query behavior competitive with larger systems including DeepSeek-R1, suggesting that calibrated abstention and clarification can be learned through verifiable rewards rather than emerging from scale alone.
[28] How Tokenization Limits Phonological Knowledge Representation in Language Models and How to Improve Them cs.CLPDF
Disen Liao, Freda Shi
TL;DR: 本文研究了语言模型(LM)中分词(tokenization)对语音知识表示能力的影响,发现基于子词的分词会削弱局部(如押韵)和全局(如音节划分)语音特征的编码。作者提出了音节划分-分词对齐距离(STAD)来量化这种错位,并设计了一种基于国际音标(IPA)的轻量级微调方法,以提升LM的语音感知能力,同时基本保持数学和通用推理性能。
Details
Motivation: 现有语言模型的分词过程从未考虑单词的发音,这限制了模型对语音知识的表示能力。本文旨在探究分词如何影响文本仅LM的语音知识编码,并解决由此导致的语音特征表示弱化问题。
Result: 在三个语音相关任务上,提出的基于IPA的微调方法带来了持续改进,同时在GSM8K和MMLU基准上仅分别下降了1.1%和0.9%,基本保持了数学和通用推理能力。
Insight: 创新点包括引入STAD作为分词与自然音节边界错位的诊断指标,以及提出一种轻量级的IPA微调方法,有效注入语音意识到LM中,为改善语音知识表示提供了新思路。
Abstract: Tokenization is the first step in every language model (LM), yet it never takes the sounds of words into account. We investigate how tokenization influences text-only LMs’ ability to represent phonological knowledge. Through a series of probing experiments, we show that subword-based tokenization systematically weakens the encoding of both local (e.g., rhyme) and global (e.g., syllabification) phonological features. To quantify this effect, we introduce the syllabification-tokenization alignment distance (STAD), a metric that measures the misalignment between a model’s tokenization and the natural syllable boundaries of words, and find that higher misalignment correlates with poorer phonological representations, providing a simple diagnostic for phonology-aware tokenization. To address these limitations, we propose a lightweight IPA-based fine-tuning method that infuses phonological awareness into LMs, leading to consistent improvements across three phonology-related tasks while largely preserving math and general reasoning ability, with 1.1% and 0.9% drops on GSM8K and MMLU, respectively.
[29] The Provenance Gap in Clinical AI: Evidence-Traceable Temporal Knowledge Graphs for Rare Disease Reasoning cs.CLPDF
Md Shamim Ahmed, Maja Dusanic, Moritz Nikolai Kirschner, Elisabeth Nyoungui, Jana Zschüntzsch
TL;DR: 本文针对临床AI中的’溯源鸿沟’问题,即前沿大语言模型(LLM)虽能生成临床准确的输出,但其引证常为捏造。作者测试了五种前沿LLM在36个临床验证的罕见神经肌肉疾病场景中的表现,发现无提示时模型无法生成相关PubMed标识符,即使明确要求引证,最佳模型也仅达到15.3%的相关PMID。为此,作者提出了HEG-TKG(分层证据锚定的时序知识图谱)系统,该系统基于4,512条PubMed记录和精选来源构建时序知识图谱,将临床主张锚定于证据。在控制三组比较中,HEG-TKG在匹配基线临床特征覆盖的同时,实现了100%的证据可验证性(含203个内联引证),而Guideline-RAG在给定重叠源文档作为原始文本时产生零可验证引证。独立临床评估证实了其可验证性优势(Cohen’s d = 1.81, p < 0.001),且安全性和完整性无下降。反事实实验显示,系统对注入的临床错误具有80%的抵抗力,并通过引证追踪实现100%的可检测性。系统通过开源模型在本地部署,确保患者数据不离开机构基础设施。
Details
Motivation: 解决临床AI中的’溯源鸿沟’问题,即LLM生成的临床输出虽准确但引证常为捏造,缺乏可验证的证据基础,这在罕见疾病推理等高风险场景中尤为关键。
Result: 在36个临床验证的罕见神经肌肉疾病场景中,HEG-TKG系统在控制三组比较中匹配基线临床特征覆盖,同时实现100%证据可验证性(含203个内联引证),显著优于Guideline-RAG(零可验证引证)和前沿LLM(最佳仅15.3%相关PMID)。独立临床评估显示可验证性优势效应量大(Cohen’s d = 1.81, p < 0.001),且安全性和完整性无下降。反事实实验表明对注入临床错误有80%抵抗力,并通过引证追踪实现100%可检测性。
Insight: 创新点在于提出HEG-TKG系统,通过构建分层证据锚定的时序知识图谱(基于质量分层和疾病轨迹里程碑的PubMed记录),将临床主张严格锚定于可验证证据,解决了LLM引证捏造问题。从客观角度看,该系统将知识图谱与证据溯源结合,强调时序性和质量分层,提升了临床AI的可信度和可审计性,同时通过本地部署保障数据隐私,为高风险医疗应用提供了可借鉴的框架。
Abstract: Frontier large language models generate clinically accurate outputs, but their citations are often fabricated. We term this the Provenance Gap. We tested five frontier LLMs across 36 clinician-validated scenarios for three rare neuromuscular disease pairs. No model produced a clinically relevant PubMed identifier without prompting. When explicitly asked to cite, the best model achieved 15.3% relevant PMIDs; the majority resolved to real publications in unrelated fields. We present HEG-TKG (Hierarchical Evidence-Grounded Temporal Knowledge Graphs), a system that grounds clinical claims in temporal knowledge graphs built from 4,512 PubMed records and curated sources with quality-tier stratification and 1,280 disease-trajectory milestones. In a controlled three-arm comparison using the same synthesis model, HEG-TKG matches baseline clinical feature coverage while achieving 100% evidence verifiability with 203 inline citations. Guideline-RAG, given overlapping source documents as raw text, produces zero verifiable citations. LLM judges cannot distinguish fabricated from verified citations without PubMed audit data. Independent clinician evaluation confirms the verifiability advantage (Cohen’s d = 1.81, p < 0.001) with no degradation on safety or completeness. A counterfactual experiment shows 80% resistance to injected clinical errors with 100% detectability via citation trace. The system deploys on-premise via open-source models so patient data never leaves institutional infrastructure.
[30] The Consensus Trap: Rescuing Multi-Agent LLMs from Adversarial Majorities via Token-Level Collaboration cs.CL | cs.AI | cs.MAPDF
Jiayuan Liu, Shiyi Du, Weihua Du, Mingyu Guo, Vincent Conitzer
TL;DR: 该论文揭示了当前多智能体大语言模型系统中基于响应级别聚合(如多数投票)的结构性脆弱性:当被对抗性污染的智能体形成局部多数时,系统会崩溃。为克服此限制,论文提出了令牌级别的轮询协作方法,通过让智能体在共享的自回归上下文中顺序交错生成令牌,将聚合过程从脆弱的最终投票计数转变为动态交织的逻辑链。理论分析和实验表明,即使污染智能体占多数,该方法也能保持鲁棒性。
Details
Motivation: 解决多智能体LLM系统在开放环境中容易受到隐蔽上下文污染(如针对性提示注入)攻击的问题,特别是当污染智能体形成多数时,现有基于响应级别聚合(如多数投票)的方法会完全失效。
Result: 在多个推理基准测试上进行广泛评估,结果表明,当污染智能体达到多数时,多数投票方法崩溃,而提出的令牌级别轮询协作方法在超过此临界阈值后仍能保持鲁棒的准确性。
Insight: 核心创新点在于将多智能体协作的聚合粒度从响应级别下沉到令牌级别,通过顺序交错生成将聚合操作从线性求和转变为非线性算子乘积,从而在理论上保证了诚实模型的恢复力能压倒对抗性污染。这为构建更鲁棒的多智能体系统提供了新的架构思路。
Abstract: Multi-agent large language model (LLM) architectures increasingly rely on response-level aggregation, such as Majority Voting (MAJ), to raise reasoning ceilings. However, in open environments, agents are highly susceptible to stealthy contextual corruption, such as targeted prompt injections. We reveal a critical structural vulnerability in current multi-agent systems: response-level aggregation collapses when corrupted agents form a local majority. Because voting aggregates fully-formed conclusions, it is blind to flawed intermediate logic. To overcome this systematic limitation, we propose the Token-Level Round-Robin (RR) Collaboration, where agents sequentially interleave generation within a shared auto-regressive context. We formalize this process as a discrete-time dynamical system, proving that token-level interleaving transitions aggregation from a brittle counting of final votes (a linear sum) to a dynamic, interwoven chain of logic (a non-linear operator product). Through this theoretical lens, we prove that the honest model’s restorative pull can overpower adversarial corruptions, even when corrupted agents form a majority. We conduct an exhaustive empirical evaluation across diverse reasoning benchmarks and demonstrate that while MAJ collapses when corrupted agents reach a majority, RR maintains robust accuracy well beyond this critical threshold.
[31] Beyond Overlap Metrics: Rewarding Reasoning and Preferences for Faithful Multi-Role Dialogue Summarization cs.CL | cs.AIPDF
Xiaoyong Mei, Tingting Zuo, Da Chen, Guangyu Hu, Xiangyu Wen
TL;DR: 本文提出了一种结合显式认知推理与奖励优化的新框架,用于多角色对话摘要任务。该方法首先从大型教师模型中提取结构化推理轨迹作为辅助监督,通过分阶段监督微调初始化一个具备推理意识的摘要模型,然后应用GRPO算法结合基于指标和人类对齐标准的双重奖励进行优化。
Details
Motivation: 现有方法主要优化ROUGE和BERTScore等自动指标,这些指标倾向于对参考摘要的表面模仿,而非在忠实性或与人类偏好对齐方面取得真正提升。本文旨在解决多角色对话摘要中角色信息保留和事实一致性的建模难题。
Result: 在多语言多角色对话基准测试中,该方法在ROUGE和BERTScore上与强基线模型表现相当。在CSDS数据集上验证了框架在语义一致性上的稳定性,在SAMSum数据集上的深入分析则显示出在事实忠实性和基于模型的偏好对齐方面有明显提升。
Insight: 创新点在于将结构化推理轨迹作为辅助监督引入训练过程,并设计了融合指标信号与人类对齐标准的双重奖励机制。这强调了推理意识和偏好意识训练对于可靠对话摘要的价值,为超越表面重叠指标、提升摘要质量和忠实性提供了新思路。
Abstract: Multi-role dialogue summarization requires modeling complex interactions among multiple speakers while preserving role-specific information and factual consistency. However, most existing methods optimize for automatic metrics such as ROUGE and BERTScore, which favor surface-level imitation of references rather than genuine gains in faithfulness or alignment with human preferences. We propose a novel framework that couples explicit cognitive-style reasoning with reward-based optimization for multi-role dialogue summarization. Our method first distills structured reasoning traces (e.g., step-by-step inferences and intermediate reflections) from a large teacher model and uses them as auxiliary supervision to initialize a reasoning-aware summarizer via staged supervised fine-tuning. It then applies GRPO with a dual-principle reward that blends metric-based signals with human-aligned criteria targeting key information coverage, implicit inference, factual faithfulness, and conciseness. Experiments on multilingual multi-role dialogue benchmarks show that our method matches strong baselines on ROUGE and BERTScore. Specifically, results on CSDS confirm the framework’s stability in semantic consistency, while in-depth analysis on SAMSum demonstrates clear gains in factual faithfulness and model-based preference alignment. These findings underscore the value of reasoning-aware and preference-aware training for reliable dialogue summarization. Checkpoints and datasets are available at https://huggingface.co/collections/NebulaPixel/summorchestra-multirole-summary.
[32] A Multi-Agent Approach for Claim Verification from Tabular Data Documents cs.CLPDF
Rudra Ranajee Saha, Laks V. S. Lakshmanan, Raymond T. Ng
TL;DR: 本文提出了一种新颖的多智能体框架MACE,用于从表格数据文档中进行声明验证。该框架包含三个专门化的智能体:规划器、执行器和验证器,采用零样本思维链设置执行任务,无需复杂微调,并能生成可解释的验证轨迹。实验表明,MACE在两个数据集上达到SOTA性能,在另外两个数据集上与最佳模型相当,同时使用参数更少的模型(27-92B vs 235B)实现了80-100%的最佳性能。
Details
Motivation: 现有基于LLM的声明验证方法要么依赖复杂的预训练/微调,要么将验证分解为子任务,但往往缺乏全面的解释和泛化能力。本文旨在解决这些限制,提出一个更高效、可解释且通用的框架。
Result: MACE在两个数据集上实现了SOTA性能,在另外两个数据集上与最佳模型性能相当。使用参数显著更少的模型(27-92B参数)达到了最佳模型(235B参数)80-100%的性能,体现了竞争性性能和内存效率。
Insight: 创新点在于采用多智能体架构(规划器、执行器、验证器)进行声明验证,结合零样本思维链实现无需微调的可解释推理。从客观角度看,该框架在保持高性能的同时降低了模型规模要求,并提供了透明的推理过程,这对于表格数据理解和可信AI具有借鉴意义。
Abstract: We present a novel approach for claim verification from tabular data documents. Recent LLM-based approaches either employ complex pretraining/fine-tuning or decompose verification into subtasks, often lacking comprehensive explanations and generalizability. To address these limitations, we propose a Multi-Agentic framework for Claim verification (MACE) consisting of three specialized agents: Planner, Executor, and Verifier. Instead of elaborate finetuning, each agent employs a zero-shot Chain-of-Thought setup to perform its tasks. MACE produces interpretable verification traces, with the Planner generating explicit reasoning strategies, the Executor providing detailed computation steps, and the Verifier validating the logic. Experiments demonstrate that MACE achieves state-of-the-art (SOTA) performance on two datasets and performs on par with the best models on two others, while achieving 80–100% of best performance with substantially smaller models: 27–92B parameters versus 235B. This combination of competitive performance, memory efficiency, and transparent reasoning highlights our framework’s effectiveness.
[33] Seeing Isn’t Believing: Mitigating Belief Inertia via Active Intervention in Embodied Agents cs.CL | cs.AI | cs.ROPDF
Hanlin Wang, Chak Tou Leong, Jian Wang, Wenjie Li
TL;DR: 本文提出了一种名为估计-验证-更新(EVU)的主动干预机制,旨在解决具身智能体中的‘信念惯性’问题,即智能体固执地坚持先验信念而忽视关键环境反馈。该机制通过预测、验证和更新信念的步骤,帮助智能体更有效地与环境交互,并在多个具身任务基准上显著提升了任务成功率。
Details
Motivation: 动机在于解决大型语言模型驱动的具身智能体在复杂任务中,因忽视与内部信念不符的环境反馈而导致决策次优和行动无效的问题,即‘信念惯性’现象。
Result: 在三个具身任务基准上的大量实验表明,集成EVU机制能持续带来任务成功率的显著提升,验证了其有效性。
Insight: 创新点在于从被动理解转向主动管理的‘主动信念干预’范式,并提出了统一的、可生成显式文本信念状态的EVU机制,该机制可灵活集成到基于提示或基于训练的智能体推理方法中,有效缓解信念惯性,推动更鲁棒具身智能体的发展。
Abstract: Recent advancements in large language models (LLMs) have enabled agents to tackle complex embodied tasks through environmental interaction. However, these agents still make suboptimal decisions and perform ineffective actions, as they often overlook critical environmental feedback that differs from their internal beliefs. Through a formal probing analysis, we characterize this as belief inertia, a phenomenon where agents stubbornly adhere to prior beliefs despite explicit observations. To address this, we advocate active belief intervention, moving from passive understanding to active management. We introduce the Estimate-Verify-Update (EVU) mechanism, which empowers agents to predict expected outcomes, verify them against observations through explicit reasoning, and actively update prior beliefs based on the verification evidence. EVU is designed as a unified intervention mechanism that generates textual belief states explicitly, and can be integrated into both prompting-based and training-based agent reasoning methods. Extensive experiments across three embodied benchmarks demonstrate that EVU consistently yields substantial gains in task success rates. Further analyses validate that our approach effectively mitigates belief inertia, advancing the development of more robust embodied agents. Our code is available at https://github.com/WangHanLinHenry/EVU.
[34] Are Emotion and Rhetoric Neurons in LLM? Neuron Recognition and Adaptive Masking for Emotion-Rhetoric Prediction Steering cs.CLPDF
Li Zheng, Xin Zhang, Shuyi He, Fei Li, Chong Teng
TL;DR: 本文提出了一种神经元识别框架和自适应掩码方法,用于探索大语言模型(LLM)中情感和修辞的神经元表征机制及其内在关联,并实现细粒度的表达引导。
Details
Motivation: 现有研究多依赖外部优化,缺乏对LLM内部表征机制的深入探索,无法在神经元层面实现细粒度引导;且现有神经元研究局限于情感,忽略了修辞神经元及其内在联系,传统神经元掩码方法也存在反直觉现象,难以可靠验证神经元功能。
Result: 在5个常用数据集上的实验验证了所提方法的有效性,通过神经元调控实现了对非目标句子的定向诱导以及利用修辞神经元增强情感任务。
Insight: 创新点在于系统性地研究了6种情感类别和4种核心修辞手法的神经元表征机制与内在关联,提出了结合多维筛选的神经元识别框架,以及融合动态过滤、衰减掩码和反馈优化的自适应掩码方法,为LLM中情感和修辞表达的细粒度引导提供了新范式。
Abstract: Accurate comprehension and controllable generation of emotion and rhetoric are pivotal for enhancing the reasoning capabilities of large language models (LLMs). Existing studies mostly rely on external optimizations, lacking in-depth exploration of internal representation mechanisms, thus failing to achieve fine-grained steering at the neuron level. A handful of works on neurons are confined to emotions, neglecting rhetoric neurons and their intrinsic connections. Traditional neuron masking also exhibits counterintuitive phenomena, making reliable verification of neuron functionality infeasible. To address these issues, we systematically investigate the neurons representation mechanisms and inherent associations of 6 emotion categories and 4 core rhetorical devices. We propose a neuron identification framework that integrates multi-dimensional screening, and design an adaptive masking method incorporating dynamic filtering, attenuation masking, and feedback optimization, enabling reliable causal validation of neuron functionality.Through neuron regulation, we achieve directed induction of non-target sentences and enhancement of emotion tasks via rhetoric neurons. Experiments on 5 commonly used datasets validate the effectiveness of our method, providing a novel paradigm for the fine-grained steering of emotion and rhetoric expressions in LLMs.
[35] REZE: Representation Regularization for Domain-adaptive Text Embedding Pre-finetuning cs.CL | cs.AIPDF
Seungmin Lee, Jeonghwan Lee, Hyunkuk Lim, Sejoon Kim, Mingi Sung
TL;DR: 本文提出了一种名为REZE的表示正则化框架,用于在领域自适应文本嵌入预微调过程中控制表示偏移。该方法通过特征空间分解锚点-正样本对的关系,测量任务间沿特征分量的离散度以识别任务变异方向,并应用自适应软收缩抑制任务引起的噪声,同时保留任务不变的语义结构,且无需推理开销。
Details
Motivation: 现有文本嵌入模型通过对比预微调适应特定领域时,常因任务诱导偏差导致表示偏移不受控,扭曲预训练嵌入几何结构并造成性能显著下降。
Result: 在多个嵌入骨干网络和专用基准测试上的实验表明,REZE在大多数设置中优于标准预微调和各向同性后处理正则化方法,且在现有预微调变体失效时保持稳定。
Insight: 创新点在于将表示偏移控制作为异构监督下鲁棒嵌入预微调的关键原则,通过特征空间分解和自适应软收缩实现可控偏移,保持与原始嵌入流形对齐,避免推理开销。
Abstract: Recent text embedding models are often adapted to specialized domains via contrastive pre-finetuning (PFT) on a naive collection of scattered, heterogeneous tasks. However, this approach often introduces task-induced bias alongside domain knowledge, leading to uncontrolled representation shifts that distort the pretrained embedding geometry and cause substantial performance degradation. To address this issue, we propose REZE}, a representation regularization framework that explicitly controls representation shift during embedding pre-finetuning. REZE operates on the relations of anchor-positive pairs and decomposes them in an eigenspace. It then measures task-wise dispersion along each eigencomponent to identify task-variant directions and applies adaptive soft-shrinkage to suppress task-induced noise while preserving task-invariant semantic structure, without inference-time overhead. Experiments across multiple embedding backbones and specialized benchmarks show that REZE outperforms standard pre-finetuning and isotropy-oriented post-hoc regularization in most settings, remaining stable where existing PFT variants collapse. Embedding space analyses further confirm that REZE induces controlled shifts aligned with the original embedding manifold, underscoring representation shift control as a key principle for robust embedding pre-finetuning under heterogeneous supervision.
[36] HopRank: Self-Supervised LLM Preference-Tuning on Graphs for Few-Shot Node Classification cs.CLPDF
Ziqing Wang, Kaize Ding
TL;DR: 本文提出HopRank,一种完全自监督的LLM偏好调优框架,用于文本属性图的少样本节点分类。该方法将节点分类重新定义为链接预测任务,通过分层跳数采样构建偏好数据,并采用自适应偏好学习来优先处理信息丰富的训练信号,无需任何类别标签。在推理时,通过预测节点与标记锚点的连接偏好进行分类,并采用自适应提前退出投票方案提高效率。
Details
Motivation: 现有基于GNN的方法存在文本编码浅层化和对标记数据依赖性强的问题,而现有LLM-for-graph方法要么在训练中仍需要大量标签,要么未能充分利用图拓扑中丰富的结构信号。本文观察到在许多真实世界TAG中,边主要基于同质性连接相似节点,因此图拓扑本身无需标签即可编码类别结构。
Result: 在三个TAG基准测试上的实验表明,HopRank在零标记训练数据的情况下,性能与完全监督的GNN相当,并显著优于先前的图-LLM方法。
Insight: 创新点在于将节点分类重新定义为自监督的链接预测任务,利用图拓扑的同质性作为无标签的类别结构信号,并通过分层跳数采样和自适应偏好学习构建有效的训练数据,实现了无需标记数据的LLM调优,适用于少样本场景。
Abstract: Node classification on text-attributed graphs (TAGs) is a fundamental task with broad applications in citation analysis, social networks, and recommendation systems. Current GNN-based approaches suffer from shallow text encoding and heavy dependence on labeled data, limiting their effectiveness in label-scarce settings. While large language models (LLMs) naturally address the text understanding gap with deep semantic reasoning, existing LLM-for-graph methods either still require abundant labels during training or fail to exploit the rich structural signals freely available in graph topology. Our key observation is that, in many real-world TAGs, edges predominantly connect similar nodes under the homophily principle, meaning graph topology inherently encodes class structure without any labels. Building on this insight, we reformulate node classification as a link prediction task and present HopRank, a fully self-supervised LLM-tuning framework for TAGs. HopRank constructs preference data via hierarchical hop-based sampling and employs adaptive preference learning to prioritize informative training signals without any class labels. At inference, nodes are classified by predicting their connection preferences to labeled anchors, with an adaptive early-exit voting scheme to improve efficiency. Experiments on three TAG benchmarks show that HopRank matches fully-supervised GNNs and substantially outperforms prior graph-LLM methods, despite using zero labeled training data.
[37] MedPRMBench: A Fine-grained Benchmark for Process Reward Models in Medical Reasoning cs.CLPDF
Lingyan Wu, Xiang Zheng, Weiqi Zhai, Wei Wang, Xuan Ren
TL;DR: 该论文提出了首个针对医疗领域的流程级奖励模型(PRM)基准测试MedPRMBench,旨在评估大语言模型在复杂医疗推理中的错误检测能力。该基准基于临床推理蓝图(CRBs)构建,覆盖了来自七个医疗问答来源的6,500个问题、13,000条推理链和113,910个步骤级标签,并引入了首个4级严重性分级系统来量化临床影响。
Details
Motivation: 现有PRM基准主要覆盖数学等通用领域,缺乏针对医疗推理(具有安全性关键、知识密集和错误模式多样等独特特征)的评估框架,导致无法量化模型在临床推理中的错误检测能力,其在实际医疗应用中的安全性无法得到验证。
Result: 论文提出的医疗PRM基线模型在MedPRMBench上取得了87.1%的整体PRMScore,显著超越了所有基线模型。该模型作为一个即插即用的验证器,将下游医疗问答的准确率提升了3.2-6.7个百分点。
Insight: 创新点在于构建了首个细粒度的医疗PRM基准,系统性地定义了14种跨三个类别(简洁性、合理性、敏感性)的错误类型,并引入了首个4级严重性分级系统来量化临床影响。这为未来PRM在安全关键领域的改进提供了明确的评估方向和高质量数据。
Abstract: Process-Level Reward Models (PRMs) are essential for guiding complex reasoning in large language models, yet existing PRM benchmarks cover only general domains such as mathematics, failing to address medical reasoning – which is uniquely characterized by safety criticality, knowledge intensity, and diverse error patterns. Without a reliable medical PRM evaluation framework, we cannot quantify models’ error detection capabilities in clinical reasoning, leaving their safety in real-world healthcare applications unverified. We propose MedPRMBench, the first process-level reward model benchmark for the medical domain. Built through a three-phase pipeline based on Clinical Reasoning Blueprints (CRBs), MedPRMBench systematically generates high-quality evaluation data from seven medical QA sources, covering 14 fine-grained error types across three categories (Simplicity, Soundness, and Sensitivity) with the first 4-level severity grading system to quantify clinical impact. The benchmark comprises 6{,}500 questions with 13{,}000 reasoning chains and 113{,}910 step-level labels, plus 6{,}879 questions for training. Our medical PRM baseline achieves an 87.1% overall PRMScore – substantially surpassing all baselines – and serves as a plug-and-play verifier that improves downstream medical QA accuracy by 3.2–6.7 percentage points. Systematic evaluation spanning proprietary frontier models, open-source reasoning models, and medical-specialized models reveals critical weaknesses in current models’ medical reasoning error detection capabilities, providing clear directions for future PRM improvement.
[38] HorizonBench: Long-Horizon Personalization with Evolving Preferences cs.CL | cs.AIPDF
Shuyue Stella Li, Bhargavi Paranjape, Kerem Oktar, Zhongyao Ma, Gelin Zhou
TL;DR: 本文提出了HorizonBench,一个用于评估长期个性化能力的基准测试,通过模拟用户6个月内的对话历史来追踪用户偏好的演变。该基准包含360个模拟用户的4245个项目,平均对话历史约4300轮和163K tokens,旨在测试长上下文建模、记忆增强架构、心智理论推理和用户建模能力。
Details
Motivation: 用户偏好会随时间演变,现有研究缺乏同时提供自然长期交互和真实偏好变化来源的数据资源,限制了长期个性化问题的进展。
Result: 在25个前沿模型中,最佳模型准确率仅为52.8%,大多数模型表现接近或低于20%的随机基线。当模型处理演变偏好时,超过三分之一的情况会错误选择用户最初声明的值,未能跟踪更新后的用户状态。
Insight: 创新点在于通过结构化心理状态图生成对话数据,提供每个偏好变化的真实来源,从而构建可诊断模型失败原因的基准。研究发现状态跟踪能力是长期个性化的主要瓶颈,且该问题在不同上下文长度和表达明确性水平下持续存在。
Abstract: User preferences evolve across months of interaction, and tracking them requires inferring when a stated preference has been changed by a subsequent life event. We define this problem as long-horizon personalization and observe that progress on it is limited by data availability and measurement, with no existing resource providing both naturalistic long-horizon interactions and the ground-truth provenance needed to diagnose why models fail. We introduce a data generator that produces conversations from a structured mental state graph, yielding ground-truth provenance for every preference change across 6-month timelines, and from it construct HorizonBench, a benchmark of 4,245 items from 360 simulated users with 6-month conversation histories averaging ~4,300 turns and ~163K tokens. HorizonBench provides a testbed for long-context modeling, memory-augmented architectures, theory-of-mind reasoning, and user modeling. Across 25 frontier models, the best model reaches 52.8% and most score at or below the 20% chance baseline. When these models err on evolved preferences, over a third of the time they select the user’s originally stated value without tracking the updated user state. This belief-update failure persists across context lengths and expression explicitness levels, identifying state-tracking capability as the primary bottleneck for long-horizon personalization.
[39] Probabilistic Programs of Thought cs.CL | cs.AI | cs.PLPDF
Poorva Garg, Renato Lui Geh, Daniel Israel, Todd Millstein, Kyle Richardson
TL;DR: 本文提出了一种名为“概率思维程序”的新框架,旨在减少大型语言模型在代码生成和数学推理任务中生成结构化输出时所需的GPU计算开销。通过利用模型生成的程序及其下一个令牌的概率分布,构建一个紧凑的概率程序来表示指数级数量的确定性程序,从而以更少的LLM生成次数获得更多样本。
Details
Motivation: 当前LLM在代码生成和数学推理中,通常需要多次采样生成程序,每次采样都依赖昂贵的GPU计算,当采样数量较大时成本过高。本文旨在解决这一计算效率瓶颈,通过暴露LLM在生成程序中的分布来减少生成开销。
Result: 在代码生成、代码理解和数学推理的基准测试中,该方法以更少的LLM生成次数实现了性能提升,具体表现为在相同计算资源下获得更多有效样本,但未明确提及是否达到SOTA水平或与特定模型相当。
Insight: 创新点在于将LLM的生成过程转化为概率程序,利用概率推理高效采样新程序,无需额外GPU计算。这为降低LLM在结构化输出任务中的计算成本提供了新思路,可借鉴其将确定性生成扩展为概率表示的方法来优化资源密集型应用。
Abstract: LLMs are widely used for code generation and mathematical reasoning tasks where they are required to generate structured output. They either need to reason about code, generate code for a given specification, or reason using programs of thought. The typical approach to code generation is to prompt the model and generate samples until an appropriate program is obtained. Within this process, sampling $n$ programs from the language model requires $n$ GPU compute-intensive generations which becomes prohibitively expensive for larger values of $n$. In this work, we address this limitation by exposing the LLM’s distribution within the generated programs themselves. We propose a novel test-time framework we dub probabilistic programs of thought to obtain more samples from the model with fewer LLM generations. Given a program generated by a model and the associated next-token probabilities, we build a probabilistic program that compactly represents exponentially many deterministic programs. Since performing probabilistic reasoning in this probabilistic program is much cheaper, our approach allows sampling new programs without any additional GPU compute and little CPU overhead. We instantiate our approach on benchmarks for code generation, code understanding and mathematical reasoning and report improvements in performance with fewer generations from the LLM.
[40] Beyond “I Don’t Know”: Evaluating LLM Self-Awareness in Discriminating Data and Model Uncertainty cs.CLPDF
Jingyi Ren, Ante Wang, Yunghwei Lai, Xiaolong Wang, Linlu Gong
TL;DR: 这篇论文提出了UA-Bench基准,用于评估大型语言模型(LLM)在区分数据不确定性和模型不确定性方面的自我意识能力。研究发现,即使是前沿的LLM也难以可靠地区分这两种不确定性,且高答案准确率并不等同于强的不确定性归因能力。为缩小这一差距,论文提出了一种轻量级的数据合成和强化学习策略,并在Qwen3模型上验证了其有效性。
Details
Motivation: 现有研究通常将LLM的拒绝回答笼统地视为“我不知道”,未能区分输入层面的模糊性(数据不确定性)与模型能力限制(模型不确定性),这限制了请求澄清或调用外部工具等下游决策。
Result: 在包含六个数据集、超过3500个问题的UA-Bench基准上评估了18个前沿LLM,结果表明它们难以可靠地区分数据与模型不确定性。提出的方法在Qwen3-4B-Instruct-2507和Qwen3-8B(思维模式)上实验,在保持答案准确率的同时提升了不确定性归因能力。
Insight: 论文的创新点在于明确区分并评估LLM对数据不确定性和模型不确定性的归因能力,并提出了一个专门的基准(UA-Bench)以及一种轻量级的合成数据与强化学习训练策略来提升该能力。
Abstract: Reliable Large Language Models (LLMs) should abstain when confidence is insufficient. However, prior studies often treat refusal as a generic “I don’t know’’, failing to distinguish input-level ambiguity (data uncertainty) from capability limitations (model uncertainty). This lack of distinction limits downstream action decisions like requesting clarification or invoking external tools. In this work, we introduce UA-Bench, a benchmark of over 3,500 questions drawn from six datasets spanning knowledge-intensive and reasoning-intensive tasks, designed to evaluate explicit uncertainty attribution. An evaluation of 18 frontier LLMs shows that even state-of-the-art models struggle to reliably discriminate between data uncertainty and model uncertainty, and that high answer accuracy does not necessarily imply strong uncertainty attribution ability. To narrow this gap, we propose a lightweight data synthesis and reinforcement learning strategy. Experiments on both Qwen3-4B-Instruct-2507 and Qwen3-8B in thinking mode show that the proposed method improves uncertainty attribution while preserving answer accuracy. Our code and data are publicly available now.
[41] CRISP: Compressing Redundancy in Chain-of-Thought via Intrinsic Saliency Pruning cs.CLPDF
Yangsong Lan, Hongliang Dai, Piji Li
TL;DR: CRISP是一种通过内在显著性剪枝压缩思维链冗余的框架,利用模型内部注意力信号识别并移除非关键推理步骤,从而在保持逻辑连贯性的同时大幅减少推理过程中的token数量。
Details
Motivation: 解决长思维链推理中计算开销和延迟高的问题,现有外部压缩方法难以与模型内部推理动态对齐,导致关键逻辑步骤丢失。
Result: 在多种骨干模型和数学数据集上,CRISP实现了token数量减少50-60%且不损失准确性,有效缓解了长上下文推理的效率瓶颈。
Insight: 创新点在于发现推理终止token的注意力模式可作为信息锚点区分关键推理与冗余,并基于此设计细粒度压缩策略;客观分析认为该方法通过模型内在信号指导压缩,避免了外部压缩器的对齐问题,提升了压缩的精准性和效率。
Abstract: Long Chain-of-Thought (CoT) reasoning is pivotal for the success of recent reasoning models but suffers from high computational overhead and latency. While prior works attempt to compress CoT via external compressor, they often fail to align with the model’s internal reasoning dynamics, resulting in the loss of critical logical steps. This paper presents \textbf{C}ompressing \textbf{R}edundancy in Chain-of-Thought via \textbf{I}ntrinsic \textbf{S}aliency \textbf{P}runing (\textbf{CRISP}), a framework that compresses CoT by exploiting the model’s intrinsic saliency. Our analysis reveals a distinct phenomenon: the reasoning termination token \texttt{[object Object]} acts as an information anchor, where its attention pattern effectively demarcates essential reasoning from redundancy. Based on this finding, we design a policy that utilizes these intrinsic attention signals to guide atomic compression operations. In contrast to coarse-grained pruning strategies, CRISP strategically distills the reasoning chain to maximize information density while preserving logical coherence. Empirical results across various backbone models and mathematical datasets demonstrate that CRISP achieves a 50-60% reduction in token count without compromising accuracy, effectively mitigating the efficiency bottleneck of long-context reasoning. We open-source our implementation to facilitate further research in efficient reasoning.
[42] RoTRAG: Rule of Thumb Reasoning for Conversation Harm Detection with Retrieval-Augmented Generation cs.CL | cs.AI | cs.HC | cs.IR | cs.LGPDF
Juhyeon Lee, Wonduk Seo, Junseo Koh, Seunghyun Lee, Haihua Chen
TL;DR: 本文提出了RoTRAG框架,用于多轮对话中的有害内容检测。该框架通过检索增强生成(RAG)技术,将外部的人类编写的道德规范(称为经验法则,RoTs)引入基于大语言模型(LLM)的有害性评估中,以提升检测的准确性、一致性和可解释性。
Details
Motivation: 现有方法主要依赖模型内部的参数化知识,缺乏对外部规范性原则的显式依据,导致在复杂社会语境下的判断不一致、可解释性差以及跨轮次推理冗余。
Result: 在ProsocialDialog和Safety Reasoning Multi Turn Dialogue基准数据集上的实验表明,RoTRAG在有害性分类和严重性估计方面均优于现有基线方法,F1分数平均相对提升约40%,分布误差平均相对减少8.4%,同时在不牺牲性能的情况下减少了冗余计算。
Insight: 核心创新点在于将外部、简洁的人类道德规范(RoTs)作为显式的规范性证据集成到LLM推理过程中,并引入一个轻量级的二元路由分类器来动态决定是否需要检索新证据,从而在提升性能的同时提高了效率。这为基于规则的检索增强推理提供了新思路。
Abstract: Detecting harmful content in multi turn dialogue requires reasoning over the full conversational context rather than isolated utterances. However, most existing methods rely mainly on models internal parametric knowledge, without explicit grounding in external normative principles. This often leads to inconsistent judgments in socially nuanced contexts, limited interpretability, and redundant reasoning across turns. To address this, we propose RoTRAG, a retrieval augmented framework that incorporates concise human written moral norms, called Rules of Thumb (RoTs), into LLM based harm assessment. For each turn, RoTRAG retrieves relevant RoTs from an external corpus and uses them as explicit normative evidence for turn level reasoning and final severity classification. To improve efficiency, we further introduce a lightweight binary routing classifier that decides whether a new turn requires retrieval grounded reasoning or can reuse existing context. Experiments on ProsocialDialog and Safety Reasoning Multi Turn Dialogue show that RoTRAG consistently improves both harm classification and severity estimation over competitive baselines, with an average relative gain of around 40% in F1 across benchmark datasets and an average relative reduction of 8.4% in distributional error, while reducing redundant computation without sacrificing performance.
[43] Align Documents to Questions: Question-Oriented Document Rewriting for Retrieval-Augmented Generation cs.CLPDF
Jiaang Li, Zhendong Mao, Quan Wang, Yuning Wan, Yongdong Zhang
TL;DR: 该论文提出了一种名为QREAM的文档重写方法,旨在解决检索增强生成(RAG)中检索到的文档与生成内容风格不匹配的问题。QREAM通过将检索到的文档重写为面向问题的风格,同时保留事实信息,以帮助大型语言模型(LLM)更有效地利用检索证据。该方法包括基于风格种子的迭代探索(QREAM-ICL)和通过拒绝采样蒸馏得到的轻量级学生模型(QREAM-FT)两个阶段,可作为即插即用模块提升现有RAG流程。
Details
Motivation: 动机是解决RAG中LLM存在的风格偏见问题,即模型倾向于偏好流畅但可能产生幻觉的生成内容,而忽视事实性强但组织混乱的检索证据,这限制了检索信息的效用。
Result: 实验表明,QREAM能持续提升先进的RAG流程,在保持可忽略延迟开销的同时,实现高达8%的相对性能提升,有效平衡了问题相关性与事实基础。
Insight: 创新点在于提出了一种风格控制的重写器,通过将文档对齐到面向问题的风格来优化检索信息的呈现方式;其双阶段框架(ICL引导探索和基于双重标准拒绝采样的蒸馏)确保了高质量监督,且方法具有即插即用的通用性。
Abstract: Retrieval-Augmented Generation (RAG) enhances the factuality of Large Language Models (LLMs) by incorporating retrieved documents and/or generated context. However, LLMs often exhibit a stylistic bias when presented with mixed contexts, favoring fluent but hallucinated generated content over factually grounded yet disorganized retrieved evidence. This phenomenon reveals that the utility of retrieved information is bottlenecked by its presentation. To bridge this gap, we propose QREAM, a style-controlled rewriter that aligns retrieved documents with a question-oriented style while preserving facts, better for LLM readers to utilize. Our framework consists of two stages: (1) QREAM-ICL, which uses stylistic seeds to guide iterative rewriting exploration; and (2) QREAM-FT, a lightweight student model distilled from denoised ICL outputs. QREAM-FT employs dual-criteria rejection sampling, filtering based on answer correctness and factual consistency to ensure high-quality supervision. QREAM seamlessly integrates into existing RAG pipelines as a plug-and-play module. Experiments demonstrate that QREAM consistently enhances advanced RAG pipelines, yielding up to 8% relative improvement with negligible latency overhead, effectively balancing question relevance with factual grounding.
[44] More Than Meets the Eye: Measuring the Semiotic Gap in Vision-Language Models via Semantic Anchorage cs.CL | cs.CVPDF
Wei He
TL;DR: 本文提出了DIVA基准和语义对齐间隙(Δ)指标,用于评估视觉语言模型在字面与习语解释之间的语义差距,发现模型普遍存在字面优越性偏差,即高视觉保真度反而削弱了符号对齐能力。
Details
Motivation: 研究动机是探究视觉语言模型在生成高保真图像时,是否因视觉细节过多而干扰了对抽象含义(如名词复合词的习语解释)的组合性理解,从而解决模型在语义表达上的局限性。
Result: 在DIVA基准上评估了8个近期视觉语言模型,结果显示模型普遍存在字面优越性偏差,模型规模扩大并未缓解此偏差,且视觉保真度越高,符号对齐越弱,表明超真实图像可能产生认知干扰。
Insight: 创新点包括引入基于图标的抽象化视觉输入方法(DIVA基准)和架构无关的语义对齐间隙度量(Δ),客观分析表明,提升组合性理解需要视觉输入的图标化抽象以及将解释和生成锚定在预期含义上。
Abstract: Vision-Language Models (VLMs) excel at photorealistic generation, yet often struggle to represent abstract meaning such as idiomatic interpretations of noun compounds. To study whether high visual fidelity interferes with idiomatic compositionality under visual abstraction, we introduce DIVA, a controlled benchmark that replaces high-fidelity visual detail with schematic iconicity by generating paired, sense-anchored visualizations for literal and idiomatic readings. We further propose Semantic Alignment Gap ($Δ$), an architecture-agnostic metric that quantifies divergence between literal and idiomatic visual grounding. We additionally introduce a directional signed bias $b(t)$ to separately measure the direction and strength of literal preference. Evaluating 8 recent VLMs, we reveal a consistent Literal Superiority Bias: model scale alone does not resolve literal preference, and increased visual fidelity is associated with weaker symbolic alignment, suggesting cognitive interference from hyper-realistic imagery. Our findings suggest that improving compositional understanding requires iconographic abstraction of visual input and anchoring interpretation and generation in intended meaning.
[45] ArgBench: Benchmarking LLMs on Computational Argumentation Tasks cs.CL | cs.AIPDF
Yamen Ajjour, Carlotta Quensel, Nedim Lipka, Henning Wachsmuth
TL;DR: 本文提出了ArgBench,这是首个用于标准化评估大语言模型在计算论证任务上性能的基准测试,整合了33个现有数据集,覆盖46个任务,包括论证挖掘、视角评估、质量评估、论证推理和生成。作者使用该基准评估了五个LLM家族的泛化能力,并系统分析了少样本示例、推理步骤、模型规模和训练技能对性能的影响。
Details
Motivation: 论证技能是大语言模型的关键能力,对自反思、协作辩论和反仇恨言论等应用至关重要,但缺乏标准化评估基准,因此需要创建统一的基准来系统评估LLM在计算论证任务上的表现。
Result: 在ArgBench基准上对五个LLM家族进行了评估,并分析了少样本示例、推理步骤、模型规模和训练技能等因素对任务性能的贡献,但摘要未提及具体的定量结果或与SOTA的比较水平。
Insight: 创新点在于首次创建了统一的计算论证评估基准,整合了多任务数据集,并进行了全面的因素分析;客观来看,这为LLM论证能力的标准化测评和影响因素研究提供了重要工具。
Abstract: Argumentation skills are an essential toolkit for large language models (LLMs). These skills are crucial in various use cases, including self-reflection, debating collaboratively for diverse answers, and countering hate speech. In this paper, we create the first benchmark for a standardized evaluation of LLM-based approaches to computational argumentation, encompassing 33 datasets from previous work in unified form. Using the benchmark, we evaluate the generalizability of five LLM families across 46 computational argumentation tasks that cover mining arguments, assessing perspectives, assessing argument quality, reasoning about arguments, and generating arguments. On the benchmark, we conduct an extensive systematic analysis of the contribution of few-shot examples, reasoning steps, model size, and training skills to the performance of LLMs on the computational argumentation tasks in the benchmark.
[46] Jupiter-N Technical Report cs.CL | cs.AIPDF
George Drayson
TL;DR: Jupiter-N 是一个基于开源大模型 Nemotron 3 Super(1200 亿参数)进行后训练得到的混合推理模型。该模型通过精心设计的数据策略,旨在提升智能体能力、实现英国文化对齐并提供威尔士语支持,同时有效防止灾难性遗忘,保持了基础模型的原有能力。
Details
Motivation: 论文的动机是构建一个具有文化对齐和多语言支持能力的强大混合推理模型,以解决大模型在特定文化背景和语言任务上表现不足的问题,并探索一种可复现的主权后训练模板。
Result: Jupiter-N 在多个基准测试上相比基础模型 Nemotron 取得了显著提升:在威尔士语任务上(ARC-Easy +18,MMLU-Lite +5.25),终端使用任务上(Terminal Bench 2 +9.1)以及指令遵循任务上(IFBench +4.4),同时保持了基础模型的核心能力。
Insight: 论文的创新点在于其数据策展策略,特别是提出的 Forget-Me-Not 框架,它通过混合策略内合成回放和策略外任务数据来缓解灾难性遗忘,并混合推理与非推理轨迹以保持混合推理能力。此外,其将文化知识、机构语料和目标语言作为可替换模块的后训练框架,为任何国家构建定制化模型提供了可复现的模板。
Abstract: We present Jupiter-N, a hybrid reasoning model post-trained from Nemotron 3 Super, a fully open-source 120 billion parameter LLM. We target three objectives: (1) agentic capability via uncertainty-curated trajectories; (2) UK cultural alignment via synthetic data grounded in cultural norms; and (3) Welsh language support via parallel corpora and LLM-translated Welsh conversations. Our data curation strategy carefully preserves the base model’s capabilities: using our Forget-Me-Not framework, we mix on-policy synthetic replay with off-policy task data to mitigate catastrophic forgetting, and include a mixture of reasoning and non-reasoning traces to maintain Nemotron’s hybrid reasoning ability. Jupiter-N achieves standout gains over Nemotron in Welsh (+18 on ARC-Easy, +5.25 on MMLU-Lite), terminal-use (+9.1 on Terminal Bench 2) and instruction following (+4.4 on IFBench), while retaining the base model capabilities. We frame this work as a reproducible template for sovereign post-training: substituting cultural knowledge, institutional corpora, and target languages produces an equivalent pipeline for any country. All model weights and all post-training datasets are publicly released under open licences.
[47] Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning cs.CL | cs.AI | cs.LGPDF
Raman Saparkhan, Majd Hawasly, Md Rizwan Parvez, Mohammad Raza
TL;DR: 本文提出了一种名为CoT-PoT Ensembling的混合集成方法,通过结合思维链(CoT)和程序思维(PoT)两种推理模式,在自洽性(SC)框架下显著提高了大语言模型的推理准确性,同时大幅降低了所需的采样数量,多数任务仅需两个样本即可解决。
Details
Motivation: 自洽性(SC)技术通过聚合多个采样输出来提高大语言模型的推理准确性,但因其广泛的采样需求导致计算成本高昂,本文旨在解决这一效率问题。
Result: CoT-PoT集成方法在保持高准确性的同时,将自洽性所需的采样数量减少了9.3倍,78.6%的任务仅需两个样本即可处理,这在先前SC方法中无法实现。
Insight: 创新点在于利用CoT和PoT推理模式的互补优势进行混合集成,并设计了全采样和早停策略,从而在减少计算开销的同时提升性能,为高效LLM推理提供了新思路。
Abstract: Self-consistency (SC) is a popular technique for improving the reasoning accuracy of large language models by aggregating multiple sampled outputs, but it comes at a high computational cost due to extensive sampling. We introduce a hybrid ensembling approach that leverages the complementary strengths of two distinct modes of reasoning: Chain-of-Thought (CoT) and Program-of-Thought (PoT). We describe a general framework for combining these two forms of reasoning in self-consistency, as well as particular strategies for both full sampling and early-stopping. We show that CoT-PoT ensembling not only improves overall accuracy, but also drastically reduces the number of samples required for SC by a factor of 9.3x. In particular, the majority of tasks (78.6%) can be addressed with only two samples, which has not been possible with any prior SC methods.
[48] CoAct: Co-Active LLM Preference Learning with Human-AI Synergy cs.CLPDF
Ruiyao Xu, Mihir Parmar, Tiankai Yang, Zhengyu Hu, Yue Zhao
TL;DR: 本文提出了CoAct框架,通过协同人类与AI,结合自奖励和主动学习,以解决高质量人工标注偏好数据稀缺且昂贵的问题,从而更有效地对齐大语言模型。
Details
Motivation: 现有方法中,自奖励方法依赖纯AI生成标签但可能不可靠,主动学习依赖人工标注但无法充分利用未标注数据,CoAct旨在通过人机协同克服这些局限性。
Result: 在两个模型系列的三个推理基准测试(GSM8K、MATH、WebInstruct)上,CoAct平均提升了13.25%、8.19%和13.16%,均优于所有基线方法。
Insight: 创新点在于利用自一致性识别可靠的自标注数据和需人工验证的样本,同时通过人工反馈引导模型生成在其可解能力范围内的新指令,实现了人机协同的偏好学习。
Abstract: Learning from preference-based feedback has become an effective approach for aligning LLMs across diverse tasks. However, high-quality human-annotated preference data remains expensive and scarce. Existing methods address this challenge through either self-rewarding, which scales by using purely AI-generated labels but risks unreliability, or active learning, which ensures quality through oracle annotation but cannot fully leverage unlabeled data. In this paper, we present CoAct, a novel framework that synergistically combines self-rewarding and active learning through strategic human-AI collaboration. CoAct leverages self-consistency to identify both reliable self-labeled data and samples that require oracle verification. Additionally, oracle feedback guides the model to generate new instructions within its solvable capability. Evaluated on three reasoning benchmarks across two model families, CoAct achieves average improvements of +13.25% on GSM8K, +8.19% on MATH, and +13.16% on WebInstruct, consistently outperforming all baselines.
[49] OPSDL: On-Policy Self-Distillation for Long-Context Language Models cs.CL | cs.AIPDF
Xinsen Zhang, Zhenkai Ding, Tianjun Pan, Run Yang, Chun Kang
TL;DR: 本文提出了一种名为OPSDL的在线策略自蒸馏方法,旨在提升大语言模型的长上下文处理能力。该方法利用模型自身强大的短上下文能力作为自我教师,在长上下文场景中监督其生成过程,通过逐token的反向KL散度提供密集监督信号,以减少幻觉并提升证据利用的忠实度。
Details
Motivation: 扩展大语言模型的有效上下文长度是实际应用中的核心挑战,现有后训练方法依赖高质量监督数据或稀疏序列级奖励,导致优化不稳定且效率低下。
Result: 在7B到32B参数规模模型的长上下文基准测试中,OPSDL在不同上下文长度下均取得一致且显著的性能提升,优于标准的SFT和DPO等后训练方法,且样本效率更高,同时不损害模型的通用短上下文性能。
Insight: 创新点在于利用模型固有的短上下文能力作为自我监督信号,通过逐token的反向KL散度提供密集监督,从而稳定高效地提升长上下文性能,避免了对外部教师模型或高质量数据的依赖。
Abstract: Extending the effective context length of large language models (LLMs) remains a central challenge for real-world applications. While recent post-training methods have made progress in long-context scaling, they either rely on high-quality supervision data or sparse sequence-level rewards, leading to unstable and inefficient optimization. We propose OPSDL, an On-Policy Self-Distillation method for enhancing the Long-context capabilities of LLMs. Unlike other recent self-distillation methods that inject privileged information and rely on the model’s in-context learning ability to act as a teacher, OPSDL leverages the model’s own inherently strong short-context capability as a self-teacher to supervise its own generation in long-context scenarios. The model first generates responses conditioned on the full long-context, then the self-teacher provides per-token supervision signals via point-wise reverse KL divergence under the relevant extracted short-context. This dense token-level signal encourages faithful use of relevant evidence and mitigates hallucinations induced by irrelevant context. We evaluate OPSDL on long-context benchmarks across a range of models from 7B to 32B parameters. Results show consistent and substantial improvements across varying context lengths, outperforming standard post-training approaches such as SFT and DPO with higher sample efficiency. Notably, these gains are achieved without degrading general short-context performance. These findings highlight the effectiveness of OPSDL as a scalable and stable approach for long-context learning.
[50] PoliLegalLM: A Technical Report on a Large Language Model for Political and Legal Affairs cs.CLPDF
Yuting Huang, Yinghao Hu, Qian Xiao, Wenlin Zhong, Yiquan Wu
TL;DR: 本文提出了PoliLegalLM,一个专为政治和法律领域设计的大语言模型,通过统一的训练框架(包括持续预训练、渐进式监督微调和基于偏好的强化学习)来增强法律知识基础、任务对齐和推理能力,并在多个法律基准测试中表现出色。
Details
Motivation: 解决通用大语言模型在法律领域应用时存在的法律引用幻觉、知识覆盖不完整和结构化推理能力弱的问题。
Result: 在LawBench、LexEval和真实世界数据集PoliLegal上评估,PoliLegalLM表现强劲且一致,优于同规模竞争模型,并与更大模型保持高度竞争力,在真实法律场景中取得了最佳结果。
Insight: 创新点在于为法律领域设计了一个统一的训练框架和结构化后训练流程,并构建了大规模高质量法律语料库,展示了领域特定大语言模型在现实法律应用中的实用价值。
Abstract: Large language models (LLMs) have achieved remarkable success in general-domain tasks, yet their direct application to the legal domain remains challenging due to hallucinated legal citations, incomplete knowledge coverage, and weak structured reasoning. To address these issues, we propose PoliLegalLM, a domain-specific large language model tailored for political and legal applications. Our approach adopts a unified training framework that integrates continued pretraining, progressive supervised fine-tuning, and preference-based reinforcement learning to jointly enhance legal knowledge grounding, task alignment, and reasoning capability. We construct a large-scale, high-quality legal corpus and design a structured post-training pipeline, enabling the model to effectively learn domain-specific knowledge and adapt to diverse legal tasks. We evaluate PoliLegalLM on three representative benchmarks, including LawBench, LexEval, and a real-world dataset, PoliLegal. Experimental results demonstrate that PoliLegalLM achieves strong and consistent performance, outperforming competitive models of similar scale and remaining highly competitive with significantly larger models, while achieving the best results on real-world legal scenarios. These results highlight the effectiveness of our training paradigm and the practical value of domain-specific LLMs for real-world legal applications.
[51] Beyond Fine-Tuning: In-Context Learning and Chain-of-Thought for Reasoned Distractor Generation cs.CLPDF
Elaf Alhazmi, Quan Z. Sheng, Wei Emma Zhang
TL;DR: 本文提出了一种基于大语言模型(LLM)和上下文学习(In-Context Learning)的干扰项生成(DG)框架,通过无监督语义检索选择少样本示例,并联合生成干扰项及其推理依据,显著提升了生成干扰项的质量和推理一致性。
Details
Motivation: 当前干扰项生成任务高度依赖领域专家且劳动密集,现有基于微调和对比学习的方法难以捕捉专家选择干扰项时所依据的隐含推理过程。
Result: 在六个不同平均干扰项长度和领域的基准测试上,该方法通过少样本提示显著优于最近的DG模型,并取得了最先进(SOTA)的结果,生成的干扰项与人工标注基准高度一致。
Insight: 创新点在于将LLM的上下文学习与推理能力(如思维链)引入干扰项生成,通过联合生成干扰项及其推理依据(rationale-augmented)来模拟专家决策过程,并利用无监督语义检索优化少样本示例选择,从而提升生成干扰项的语义相关性和误导性。
Abstract: Distractor generation (DG) remains a labor-intensive task that still significantly depends on domain experts. The task focuses on generating plausible yet incorrect options, known as distractors, for multiple-choice questions. A reliable distractor must be contextually relevant to the question and able to mislead examinees through implicit reasoning when identifying the correct answer. While a recent method integrates fine-tuning pre-trained encoder-decoder models with contrastive learning to generate semantically relevant distractors for a given question-answer, it often fails to capture the underlying reasoning process that experts utilize when selecting distractors in benchmarks. In this paper, we explore large language models (LLMs) reasoning for DG through in-context learning with unsupervised semantic retrieval for selecting few-shot examples. We design a rationale-augmented DG framework that jointly generates distractors and their rationales for a given question-answer. Extensive experiments on six benchmarks, with varying average distractor lengths and domains, demonstrate that prompting LLMs with few-shot examples substantially improves the performance compared to recent DG models. It outperforms recent approaches and achieves state-of-the-art results in generating reasoned distractors that align with human-labeled benchmarks.
[52] Agents Explore but Agents Ignore: LLMs Lack Environmental Curiosity cs.CL | cs.LGPDF
Leon Engländer, Sophia Althammer, Ahmet Üstün, Matthias Gallé, Tom Sherborne
TL;DR: 这篇论文指出当前基于LLM的智能体缺乏环境好奇心,即无法有效识别和利用环境中意外但相关的信息。通过在三个基准测试(Terminal-Bench、SWE-Bench、AppWorld)中注入任务解决方案,研究发现智能体虽然能发现这些方案,但很少利用它们。
Details
Motivation: 解决LLM智能体在环境中发现意外信息后无法有效整合和利用的问题,挑战了智能体应能自然利用其发现的假设。
Result: 在Terminal-Bench中,智能体发现解决方案的比例为79-81%,但利用比例仅为37-50%;在AppWorld中,发现比例超过90%,但利用比例低于7%。优化配置可提升好奇心,但多数试验中智能体仍忽略发现。
Insight: 创新点在于提出环境好奇心的概念,并识别影响它的三个因素:工具可用性、测试时计算和训练数据分布。客观分析表明,当前智能体主要依赖环境获取预期信息,而非修订策略,这揭示了其推理局限性。
Abstract: LLM-based agents are assumed to integrate environmental observations into their reasoning: discovering highly relevant but unexpected information should naturally lead to a model exploiting its own discoveries. We show that this assumption is false for current LLM-based agents, which struggle to reflect or react to unexpected information. Across three benchmarks (Terminal-Bench, SWE-Bench, AppWorld), we inject complete task solutions into the agent environments to deliberately expose a task’s solution to a model. While agents discover these solutions on Terminal-Bench in 79-81% of runs, they interact, or exploit, them in only 37-50% of cases. This gap is starkest in AppWorld: agents see documentation stating that a command “returns the complete solution to this task” in over 90% of attempts but exploit this in fewer than 7% of trials. We show that agents lack what we call environmental curiosity: the capability to recognize and investigate unexpected but relevant observations in response to environmental stimuli. We identify three main factors influencing environmental curiosity: available tools in the agent scaffold, test-time compute, and training data distribution. Our findings identify configurations that maximize curiosity also achieve the best performance on the unmodified benchmarks. Yet even jointly optimized agents still ignore discovered solutions in the majority of trials: current agents use the environment to fetch expected information, but not to revise their strategy or maximally exploit useful stimuli.
[53] ThreadSumm: Summarization of Nested Discourse Threads Using Tree of Thoughts cs.CLPDF
Olubusayo Olabisi, Ekata Mitra, Ameeta Agrawal
TL;DR: 本文提出了ThreadSumm框架,用于总结深度嵌套的讨论线程。该框架将线程摘要视为一个分层推理问题,通过提取话语方面和原子内容单元进行内容规划,然后应用句子排序构建线程感知序列,最后使用思维树搜索生成和评分多个段落候选,以联合优化连贯性和覆盖范围。
Details
Motivation: 现有标准LLM摘要器难以可靠处理深度嵌套讨论线程中的交错回复、引用和重叠主题,因此需要一种能够捕捉多观点和逻辑结构的方法。
Result: ThreadSumm在生成逻辑结构摘要方面优于现有基线,并在嵌套讨论中实现了更高的方面保留和观点覆盖。
Insight: 创新点在于将线程摘要建模为分层推理,结合内容规划、句子排序和思维树搜索,以可解释单元为基础生成多候选摘要,提升连贯性和覆盖性。
Abstract: Summarizing deeply nested discussion threads requires handling interleaved replies, quotes, and overlapping topics, which standard LLM summarizers struggle to capture reliably. We introduce ThreadSumm, a multi-stage LLM framework that treats thread summarization as a hierarchical reasoning problem over explicit aspect and content unit representations. Our method first performs content planning via LLM-based extraction of discourse aspects and Atomic Content Units, then applies sentence ordering to construct thread-aware sequences that surface multiple viewpoints rather than a single linear strand. On top of these interpretable units, ThreadSumm employs a Tree of Thoughts search that generates and scores multiple paragraph candidates, jointly optimizing coherence and coverage within a unified search space. With this multi-proposal and iterative refinement design, we show improved performance in generating logically structured summaries compared to existing baselines, while achieving higher aspect retention and opinion coverage in nested discussions.
[54] Peerispect: Claim Verification in Scientific Peer Reviews cs.CL | cs.IRPDF
Ali Ghorbanpour, Soroush Sadeghian, Alireza Daghighfarsoodeh, Sajad Ebrahimi, Negar Arabzadeh
TL;DR: Peerispect是一个用于验证科学同行评审中声明的交互式系统。它通过从评审意见中提取值得核查的声明,从原稿中检索相关证据,并利用自然语言推理进行验证,将结果通过可视化界面呈现,以支持评审者、作者和程序委员会的快速审查。
Details
Motivation: 同行评审是科学出版的核心,但评审者常包含主观、修辞性或与提交工作不符的声明。手动评估这些声明的真实性和可验证性在大规模会议和期刊中不可行,因此需要自动化工具来确保公平性和问责制。
Result: 论文通过一个公开可用的演示和API服务展示了Peerispect系统,但摘要中未提及具体的定量实验结果或基准测试性能。
Insight: 创新点在于将声明级验证操作化为一个模块化的信息检索管道,支持可替换的检索器、重排器和验证器,并通过可视化界面直接高亮论文中的证据,提升了可解释性和实用性。
Abstract: Peer review is central to scientific publishing, yet reviewers frequently include claims that are subjective, rhetorical, or misaligned with the submitted work. Assessing whether review statements are factual and verifiable is crucial for fairness and accountability. At the scale of modern conferences and journals, manually inspecting the grounding of such claims is infeasible. We present Peerispect, an interactive system that operationalizes claim-level verification in peer reviews by extracting check-worthy claims from peer reviews, retrieving relevant evidence from the manuscript, and verifying the claims through natural language inference. Results are presented through a visual interface that highlights evidence directly in the paper, enabling rapid inspection and interpretation. Peerispect is designed as a modular Information Retrieval (IR) pipeline, supporting alternative retrievers, rerankers, and verifiers, and is intended for use by reviewers, authors, and program committees. We demonstrate Peerispect through a live, publicly available demo (https://app.reviewer.ly/app/peerispect) and API services (https://github.com/Reviewerly-Inc/Peerispect), accompanied by a video tutorial (https://www.youtube.com/watch?v=pc9RkvkUh14).
[55] RePrompT: Recurrent Prompt Tuning for Integrating Structured EHR Encoders with Large Language Models cs.CL | cs.AIPDF
Arya Hadizadeh Moghaddam, Drew Ross, Mohsen Nayebi Kerdabadi, Dongjie Wang, Zijun Yao
TL;DR: 本文提出了RePrompT框架,通过提示调优将结构化电子健康记录(EHR)编码器与大型语言模型(LLM)集成,以解决LLM在处理时序EHR数据时面临的时间结构模糊和缺乏群体级信息利用的问题。该方法在MIMIC-III和MIMIC-IV数据集上的多个临床预测任务中表现优于基线模型。
Details
Motivation: 解决LLM在处理结构化EHR(如诊断和药物代码)时存在的两个关键挑战:一是将时序EHR序列转换为纯文本会模糊时间结构和代码身份,削弱捕捉代码共现和纵向规律的能力;二是LLM通常以病例隔离的推理方式应用,无法利用群体层面的模式,而传统队列训练预测模型则能学习跨患者的共享、任务对齐表示空间。
Result: 在MIMIC-III和MIMIC-IV数据集上的实验表明,RePrompT在多个临床预测任务中一致优于基于EHR和基于LLM的基线方法,达到了先进水平。
Insight: 创新点包括:通过提示调优集成结构化EHR编码器,无需修改底层架构;递归地结合先前就诊的潜在状态以保留纵向信息;通过从队列训练、任务对齐的EHR编码器派生的可训练提示令牌注入群体级信息。这为LLM在结构化时序数据上的应用提供了可扩展且高效的解决方案。
Abstract: Large Language Models (LLMs) have shown strong promise for mining Electronic Health Records (EHRs) by reasoning over longitudinal clinical information to capture context-rich patient trajectories. However, leveraging LLMs for structured EHRs (e.g., standardized diagnosis and medication codes) presents two key challenges. First, translating time-stamped EHR sequences into plain text can obscure both temporal structure and code identities, weakening the ability to capture code co-occurrence and longitudinal regularities. Second, unlike cohort-trained predictive models that learn a shared, task-aligned representation space across patients, LLMs are often applied in a case-isolated inference setting where each patient is processed independently without leveraging population-level patterns. To address these challenges, we introduce RePrompT, a time-aware LLM framework that integrates structured EHR encoders through prompt tuning, without modifying underlying architectures. Specifically, RePrompT recurrently incorporates latent states from prior visits to preserve longitudinal information, and injects population-level information through trainable prompt tokens derived from a cohort-trained, task-aligned EHR encoder. Experiments on MIMIC-III and MIMIC-IV demonstrate that RePrompT consistently outperforms both EHR-based and LLM-based baselines across multiple clinical prediction tasks.
[56] Mira-Embeddings-V1: Domain-Adapted Semantic Reranking for Recruitment via LLM-Synthesized Data cs.CLPDF
Zhaohua Liang, Zhilin Wang, Renjie Cao, Yining Zhang
TL;DR: 本文介绍了mira-embeddings-v1,一个用于招聘领域的语义重排序系统。该系统通过LLM合成训练数据重塑嵌入空间,并使用轻量级重排序头纠正边界混淆,从而在有限的评审预算下提升候选人的召回率。
Details
Motivation: 解决招聘中候选人筛选的两阶段流程(检索与重排序)问题,目标是在上游检索器返回的候选人短名单中,通过语义重排序使合格候选人排名尽可能靠前,同时避免大规模人工标注数据的需求。
Result: 在包含300个真实职位描述的本地池中,Recall@50从68.89%提升至77.55%,Precision@10从35.77%提升至39.62%;在包含44,138个候选人的全局池中,Recall@200达到0.7047,优于基线的0.5969,展示了稳健的性能提升。
Insight: 创新点包括:使用五阶段提示管道生成多样化的正负样本以多角度塑造语义空间,采用两轮LoRA适应(JD-JD对比训练和JD-CV三元组对齐),以及引入BoundaryHead MLP处理相同职位但范围不同的边界混淆问题,实现了无需大规模人工标注或重型交叉编码器的领域适应。
Abstract: Candidate sourcing for recruiters is best viewed as a two-stage retrieval and reranking pipeline with recall as the primary objective under a limited review budget. An upstream production retriever first returns a candidate shortlist for each job description (JD), and our goal is to rerank that shortlist so that qualified candidates appear as high as possible. We present mira-embeddings-v1, a semantic reranking system for the recruitment domain that reshapes the embedding space with LLM-synthesized training data and corrects boundary confusions with a lightweight reranking head. Starting from real JDs, we build a five-stage prompt pipeline to generate diverse positive and hard negative samples that sculpt the semantic space from multiple angles. We then apply a two-round LoRA adaptation: JD–JD contrastive training followed by JD–CV triplet alignment on a heterogeneous text dataset. Importantly, these gains require no large-scale manually labeled industrial training pairs: a modest set of real JDs is expanded into supervision through LLM synthesis. Finally, a BoundaryHead MLP reranks the Top-K results to distinguish between roles that share the same title but differ in scope. On a local pool of 300 real JDs with candidates from an upstream production retriever, mira-embeddings-v1 improves Recall@50 from 68.89% (baseline) to 77.55% while lifting Precision@10 from 35.77% to 39.62%. On a supportive global pool over 44,138 candidates judged by a Qwen3-32B rubric, it achieves Recall@200 of 0.7047 versus 0.5969 for the baseline. These results show that LLM-synthesized supervision with boundary-aware reranking yields robust gains without a heavy cross-encoder.
[57] Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF cs.CL | cs.AIPDF
Yuan Fang, Yiming Luo, Aimin Zhou, Fei Tan
TL;DR: 本文提出了一种名为反向宪法AI(R-CAI)的框架,用于自动化、可控地生成对抗性有毒数据,以增强大型语言模型的安全性评估。该方法通过将无害的宪法规则反转为毒性宪法,并利用一个包含批评和修订的迭代流程来优化模型输出,从而无需人工标注即可大规模合成多维度的对抗数据。为了解决单纯优化毒性奖励可能导致的奖励攻击和语义连贯性下降问题,论文引入了在AI反馈强化学习中的概率钳制技术,以稳定对抗优化过程并保持对抗意图。
Details
Motivation: 动机在于确保大型语言模型的安全性需要进行稳健的红队测试,但目前高质量有毒数据的系统性合成方法仍未得到充分探索。现有方法通常局限于孤立的越狱提示,缺乏可控且可扩展的自动化数据生成框架。
Result: 实验表明,R-CAI能够生成多样化、高质量的有毒数据。引入的概率钳制技术在不牺牲对抗强度的前提下,显著提升了语义连贯性(15%)。该框架为红队数据生成和对齐语言模型的系统性安全评估提供了一个完全自动化的解决方案。
Insight: 论文的创新点在于提出了反向宪法AI(R-CAI)框架,通过反转无害宪法为毒性宪法来指导对抗数据生成,并引入了概率钳制技术以解决强化学习优化中的奖励攻击问题,从而在保持对抗性的同时提升输出质量。从客观角度看,该方法将宪法AI的思想逆向应用于红队测试,为自动化安全评估提供了一种新颖且可控的数据生成范式。
Abstract: Ensuring the safety of large language models (LLMs) requires robust red teaming, yet the systematic synthesis of high-quality toxic data remains under-explored. We propose Reverse Constitutional AI (R-CAI), a framework for automated and controllable adversarial data generation that moves beyond isolated jailbreak prompts. By inverting a harmless constitution into a constitution of toxicity and iteratively refining model outputs through a critique–revision pipeline, R-CAI enables scalable synthesis of multi-dimensional adversarial data without human annotation. Optimizing solely for toxicity-related rewards, however, can lead to reward hacking and degraded semantic coherence. To address this challenge, we introduce probability clamping within reinforcement learning from AI feedback, which stabilizes adversarial optimization while preserving adversarial intent. Experiments demonstrate that R-CAI generates diverse, high-quality toxic data and that probability clamping substantially improves semantic coherence (15%) without sacrificing adversarial strength. Overall, R-CAI provides a fully automated framework for red teaming data generation and systematic safety evaluation of aligned language models.
[58] Bridging the Reasoning Gap in Vietnamese with Small Language Models via Test-Time Scaling cs.CL | cs.AIPDF
Bui The Trung, Do Minh Duc, Nguyen Van Vinh, Bui Nguyen Quoc Trinh
TL;DR: 本文针对越南语环境下小型语言模型(SLMs)存在的“推理鸿沟”问题,研究了在越南小学数学任务上的测试时扩展策略。作者构建了高质量推理数据集Vi-S1K和评估基准Vi-Elementary-Bench,发现基础模型具备潜在知识但存在严重的“格式化鸿沟”。监督微调(SFT)能显著提升解释质量,而分析表明,对于1.7B参数模型,简化的思维链结合自洽性策略优于复杂的ReAct框架。
Details
Motivation: 解决资源受限设备上部署AI时,小型语言模型(SLMs)在非英语语言(如越南语)中存在的“推理鸿沟”问题,特别是在维持连贯思维链方面的困难。
Result: 在越南小学数学基准Vi-Elementary-Bench上,基础模型(Qwen3-1.7B)显示出强大的潜在知识(准确率:4.05/5.00),但存在格式化问题。监督微调(SFT)使解释质量提升了77%。实验表明,简化的思维链(CoT)加自洽性策略优于复杂的ReAct框架,为边缘推理确立了部署层级。
Insight: 主要创新点包括:1)为越南语推理构建了高质量本地化数据集Vi-S1K和双资源评估基准Vi-Elementary-Bench;2)揭示了SFT作为“推理解锁器”的关键作用,能有效弥合原始计算与教学连贯性之间的差距;3)发现了对于小参数模型,复杂提示框架(如ReAct)会带来“认知税”,简化的测试时扩展策略(CoT+自洽性)结合SFT更优,这为边缘设备上的SLM部署提供了实用指导。
Abstract: The democratization of ubiquitous AI hinges on deploying sophisticated reasoning capabilities on resource-constrained devices. However, Small Language Models (SLMs) often face a “reasoning gap”, particularly in non-English languages like Vietnamese, where they struggle to maintain coherent chains of thought. This paper investigates Test-Time Scaling strategies for the Qwen3-1.7B architecture within the context of Vietnamese Elementary Mathematics. We introduce Vi-S1K, a high-fidelity reasoning dataset localized via a Gemini 2.5 Flash-Lite powered pipeline, and Vi-Elementary-Bench, a dual-resource benchmark for rigorous evaluation. Using an LLM-as-a-Judge protocol, we reveal that the base model possesses robust latent knowledge (Accuracy: 4.05/5.00) but suffers from a severe “formatting gap” in communication. Supervised Fine-Tuning (SFT) acts as a critical “reasoning unlocker”, yielding a 77% improvement in Explanation Quality and bridging the gap between raw calculation and pedagogical coherence. Furthermore, our analysis of prompting strategies uncovers a significant trade-off: structured frameworks like ReAct impose a “cognitive tax” on the 1.7B parameter capacity, degrading performance relative to pure Chain-of-Thought (CoT) combined with Self-Consistency. These findings establish a deployment hierarchy for SLMs, demonstrating that SFT combined with simplified test-time scaling is superior to complex agentic workflows for edge-based reasoning.
[59] PDDL-Mind: Large Language Models are Capable on Belief Reasoning with Reliable State Tracking cs.CL | cs.AIPDF
Wang Bill Zhu, Qiutong Tony Yi, Robin Jia, Jesse Thomason
TL;DR: 本文提出PDDL-Mind,一种神经符号框架,通过将叙事描述翻译为规划领域定义语言(PDDL)中的显式状态和动作,并验证动作引发的状态转换,为大型语言模型(LLM)在心智理论(ToM)任务中提供逻辑一致且显式的世界状态表示,从而显著提升LLM在ToM基准测试上的表现。
Details
Motivation: 现有心智理论基准测试中,即使采用思维链提示或概率信念更新,LLM的表现仍远低于人类水平,作者认为这主要源于不可靠的隐式状态跟踪,而非高层推理能力的限制。
Result: 在MMToM-QA、MuMA和FanToM等ToM基准测试上,PDDL-Mind相比现有最佳方法取得了超过5%的绝对准确率提升。
Insight: 核心创新在于将环境状态演化与信念推理解耦,利用PDDL提供显式、可验证的状态表示,从而弥补LLM在隐式状态跟踪上的不可靠性,这是一种结合符号逻辑与神经模型的神经符号方法。
Abstract: Large language models (LLMs) perform substantially below human level on existing theory-of-mind (ToM) benchmarks, even when augmented with chain-of-thought prompting or probabilistic belief updates. We argue that these failures primarily arise from unreliable implicit state tracking rather than limitations in high-level reasoning. We introduce PDDL-Mind, a neuro-symbolic framework that decouples environment state evolution from belief inference. By translating narrative descriptions into explicit states and actions expressed in Planning Domain Definition Language (PDDL), and by verifying action-induced state transitions against a predefined domain, PDDL-Mind provides LLMs with a logically consistent and explicit representation of world states for ToM tasks. Experiments on MMToM-QA, MuMA and FanToM show that PDDL-Mind achieves over 5% absolute accuracy gain over the best existing state-of-the-art method on ToM benchmark questions.
[60] Learning to Seek Help: Dynamic Collaboration Between Small and Large Language Models cs.CLPDF
Hang Zeng, Xiangyu Liu, Yong Hu, Chaoyue Niu, Jiarui Zhang
TL;DR: 本文提出了一种动态协作框架,让小型语言模型(SLM)在多步推理中学习主动决定何时向大型语言模型(LLM)寻求帮助,而LLM则提供自适应反馈而非被动工具。该框架旨在协同LLM的强大能力与SLM的高效、隐私优势。
Details
Motivation: 解决LLM成本高、隐私担忧与SLM能力有限之间的矛盾,通过动态协作机制结合两者的互补优势。
Result: 评估显示明显的缩放效应:更强的SLM更独立,更强的LLM允许更少且信息量更大的交互;动态协作策略显著优于静态流程和独立推理,并能稳健迁移到未见过的LLM。
Insight: 创新点在于让SLM主动学习协作策略,而非固定调用模式;客观分析表明,该方法通过自适应交互优化了效率与性能的权衡,为异构模型协作提供了新思路。
Abstract: Large language models (LLMs) offer strong capabilities but raise cost and privacy concerns, whereas small language models (SLMs) facilitate efficient and private local inference yet suffer from limited capacity. To synergize the complementary strengths, we introduce a dynamic collaboration framework, where an SLM learns to proactively decide how to request an LLM during multi-step reasoning, while the LLM provides adaptive feedback instead of acting as a passive tool. We further systematically investigate how collaboration strategies are shaped by SLM and LLM capabilities as well as efficiency and privacy constraints. Evaluation results reveal a distinct scaling effect: stronger SLMs become more self-reliant, while stronger LLMs enable fewer and more informative interactions. In addition, the learned dynamic collaboration strategies significantly outperform static pipelines and standalone inference, and transfer robustly to unseen LLMs.
[61] Latent Abstraction for Retrieval-Augmented Generation cs.CL | cs.AIPDF
Ha Lan N. T, Minh-Anh Nguyen, Dung D. Le
TL;DR: 本文提出了一种名为LAnR(潜在抽象检索增强生成)的统一框架,旨在克服现有检索增强生成(RAG)系统的局限性。该方法让单个大语言模型在其自身的潜在空间中联合执行编码、检索和生成任务,通过从特定[PRED]令牌的隐藏状态生成密集检索向量来匹配文档,并使用轻量级MLP控制头自适应地决定何时停止检索,从而消除了独立的检索器和显式的停止推理机制。
Details
Motivation: 现有RAG系统依赖生成自然语言查询并在每个步骤中维持检索器与生成器的严格架构分离,这限制了利用大语言模型全部表征能力。本文旨在通过一个统一的潜在空间框架来解决这些问题,以更紧密地集成检索与生成过程。
Result: 在涵盖单跳和多跳设置的六个问答基准测试上的广泛实验表明,LAnR在性能上优于现有的RAG方法,同时通过减少检索调用次数和更紧密的模型集成,实现了推理效率的提升。
Insight: 主要创新点在于提出了一个完全在单一LLM潜在空间内操作的统一RAG框架,用潜在向量替代文本查询进行检索,并基于答案令牌熵的观察,利用轻量级控制头实现自适应的检索停止决策,从而简化了架构并提升了效率与性能。
Abstract: Retrieval-Augmented Generation (RAG) has become a standard approach for enhancing large language models (LLMs) with external knowledge, mitigating hallucinations, and improving factuality. However, existing systems rely on generating natural language queries at each hop and maintaining a strict architectural separation between retriever and generator, preventing them from leveraging the full representational capacity of the LLM. We propose \textbf{LAnR} (Latent Abstraction for RAG), a unified framework in which a single LLM jointly performs encoding, retrieval, and generation entirely within its own latent space. Rather than generating textual queries, LAnR produces dense retrieval vectors from the hidden states of a designated \texttt{[PRED]} token and uses them to match against encoded document representations from the same model. Furthermore, LAnR adaptively decides when sufficient evidence has been retrieved using a lightweight MLP control head over those same hidden states, eliminating both the separate retriever and explicit token-level stopping reasoning. This design is motivated by our empirical observation that answer token entropy reliably signals retrieval sufficiency. Extensive experiments on six QA benchmarks spanning single-hop and multi-hop settings demonstrate that LAnR outperforms existing RAG methods, while achieving improved inference efficiency through reduced number of retrieval calls and tighter model integration.
[62] Automatic Slide Updating with User-Defined Dynamic Templates and Natural Language Instructions cs.CLPDF
Kun Zhou, Jiakai He, Wenmian Yang, Zhensheng Wang, Yiquan Zhang
TL;DR: 本文提出了DynaSlide基准和SlideAgent框架,用于解决基于用户自定义动态模板和自然语言指令的演示文稿自动更新问题。DynaSlide是一个包含20,036个真实世界指令-执行三元组的大规模数据集,而SlideAgent是一个结合多模态解析、指令理解和工具增强推理的智能体框架,能够在保持布局和样式的同时更新幻灯片内容。
Details
Motivation: 现有自动化方法主要遵循固定模板填充,无法支持多样化、用户创作的幻灯片动态更新,导致维护复杂的数据分析型演示文稿仍然劳动密集。
Result: SlideAgent在DynaSlide基准上提供了强有力的参考基线,并通过端到端和组件级评估协议揭示了未来研究的关键挑战和机遇。
Insight: 创新点在于定义了’基于用户提供模板和自然语言指令的动态幻灯片更新’任务,并构建了首个大规模、基于真实业务报告(BYO-template条件)的基准;SlideAgent框架通过智能体整合多模态解析与工具增强推理,实现了内容更新与样式保留的平衡。
Abstract: Presentation slides are a primary medium for data-driven reporting, yet keeping complex, analytics-style decks up to date remains labor-intensive. Existing automation methods mostly follow fixed template filling and cannot support dynamic updates for diverse, user-authored slide decks. We therefore define “Dynamic Slide Update via Natural Language Instructions on User-provided Templates” and introduce DynaSlide, a large-scale benchmark with 20,036 real-world instruction-execution triples (source slide, user instruction, target slide) grounded in a shared external database and built from business reporting slides under bring-your-own-template (BYO-template) conditions. To tackle this task, we propose SlideAgent, an agent-based framework that combines multimodal slide parsing, natural language instruction grounding, and tool-augmented reasoning for tables, charts, and textual conclusions. SlideAgent updates content while preserving layout and style, providing a strong reference baseline on DynaSlide. We further design end-to-end and component-level evaluation protocols that reveal key challenges and opportunities for future research. The dataset and code are available at https://github.com/XiaoZhou2024/SlideAgent.
[63] ReCoQA: A Benchmark for Tool-Augmented and Multi-Step Reasoning in Real Estate Question and Answering cs.CLPDF
Yindong Zhang, Wenmian Yang, Yiquan Zhang, Weijia Jia
TL;DR: 本文提出了ReCoQA基准测试,包含29,270个房地产问答实例,用于评估工具增强和多步推理能力,并设计了分层框架HIRE-Agent作为强基线。
Details
Motivation: 解决现有基准在结合数据库查询与外部API的混合工作流方面稀缺的问题,以推动能够处理碎片化、多源信息的智能代理发展。
Result: 大量实验表明HIRE-Agent构成了一个强基线,并证实了分层协作对于复杂现实世界推理任务的必要性。
Insight: 创新点在于构建了具有机器可验证中间步骤监督的大规模基准,以及采用理解-规划-执行架构的分层代理框架,有效整合异构证据。
Abstract: Developing agents capable of navigating fragmented, multi-source information remains challenging, primarily due to the scarcity of benchmarks reflecting hybrid workflows combining database querying with external APIs. To bridge this gap, we introduce ReCoQA, a large-scale benchmark of 29,270 real-estate instances featuring machine-verifiable supervision for intermediate steps, including structured intent labels, SQL queries, and API calls. Complementarily, we propose HIRE-Agent, a hierarchical framework instantiating an understand-plan-execute architecture as a strong baseline. By orchestrating a Front-end parser, a planning Supervisor, and execution Specialists, HIRE-Agent effectively integrates heterogeneous evidence. Extensive experiments demonstrate that HIRE-Agent constitutes a strong baseline and substantiates the necessity of hierarchical collaboration for complex, real-world reasoning tasks.
[64] Process Reward Models Meet Planning: Generating Precise and Scalable Datasets for Step-Level Rewards cs.CLPDF
Raffaele Pisano, Roberto Navigli
TL;DR: 本文提出了一种基于规划领域定义语言(PDDL)的新方法,用于生成大规模、精确的过程奖励模型(PRM)数据集,以解决现有PRM数据集构建成本高、易出错且主要局限于数学领域的问题。该方法生成了约一百万个推理步骤的语料库,并用于训练PRM。实验表明,将PDDL数据加入现有训练集能显著提升PRM在数学和非数学推理任务上的性能。
Details
Motivation: 现有过程奖励模型(PRM)数据集构建昂贵、易受标注错误影响,且主要局限于数学领域,难以提供可扩展的细粒度推理反馈。
Result: 在多个基准测试中,将PDDL生成的数据加入广泛使用的PRM训练数据集后,在数学和非数学推理任务上均取得了显著改进,表明该方法能有效提升PRM性能。
Insight: 创新点在于利用规划问题(通过PDDL表达)作为可扩展资源,生成精确、细粒度的PRM训练数据,突破了传统数学领域数据源的局限,为PRM提供了更通用和鲁棒的训练基础。
Abstract: Process Reward Models (PRMs) have emerged as a powerful tool for providing step-level feedback when evaluating the reasoning of Large Language Models (LLMs), which frequently produce chains of thought (CoTs) containing errors even when the final answer is correct. However, existing PRM datasets remain expensive to construct, prone to annotation errors, and predominantly limited to the mathematical domain. This work introduces a novel and scalable approach to PRM dataset generation based on planning logical problems expressed in the Planning Domain Definition Language (PDDL). Using this method, we generate a corpus of approximately one million reasoning steps across various PDDL domains and use it to train PRMs. Experimental results show that augmenting widely-used PRM training datasets with PDDL-derived data yields substantial improvements in both mathematical and non-mathematical reasoning, as demonstrated across multiple benchmarks. These findings indicate that planning problems constitute a scalable and effective resource for generating robust, precise, and fine-grained training data for PRMs, going beyond the classical mathematical sources that dominate this field.
[65] Modeling Multiple Support Strategies within a Single Turn for Emotional Support Conversations cs.CLPDF
Jie Zhu, Huaixia Dou, Junhui Li, Lifan Guo, Feng Chen
TL;DR: 本文重新定义了情感支持对话任务,将其建模为多策略话语生成问题,提出两种生成方法:All-in-One(单步解码预测所有策略-响应对)和One-by-One(迭代生成策略-响应对),并通过强化学习引导的认知推理增强策略选择和响应组合。在ESConv数据集上的实验表明,这些方法能有效建模多策略话语,提升支持质量和对话成功率。
Details
Motivation: 现有研究通常假设每个支持者轮次仅对应单一策略,但现实中的支持性沟通常在一个话语中包含多种策略。本文旨在解决这一局限性,探索在单个话语中建模多种支持策略的可行性。
Result: 在ESConv数据集上,论文方法在话语级和对话级设置下均有效提升了支持质量和对话成功率,为单话语内使用多策略提供了首个系统性实证证据。
Insight: 创新点在于将情感支持对话重新定义为多策略话语生成任务,并提出了两种相应的生成框架,结合强化学习进行认知推理优化,突破了传统单策略假设的限制。
Abstract: Emotional Support Conversation (ESC) aims to assist individuals experiencing distress by generating empathetic and supportive dialogue. While prior work typically assumes that each supporter turn corresponds to a single strategy, real-world supportive communication often involves multiple strategies within a single utterance. In this paper, we revisit the ESC task by formulating it as multi-strategy utterance generation, where each utterance may contain one or more strategy-response pairs. We propose two generation methods: All-in-One, which predicts all strategy-response pairs in a single decoding step, and One-by-One, which iteratively generates strategy-response pairs until completion. Both methods are further enhanced with cognitive reasoning guided by reinforcement learning to improve strategy selection and response composition. We evaluate our models on the ESConv dataset under both utterance-level and dialogue-level settings. Experimental results show that our methods effectively model multi-strategy utterances and lead to improved supportive quality and dialogue success. To our knowledge, this work provides the first systematic empirical evidence that allowing multiple support strategies within a single utterance is both feasible and beneficial for emotional support conversations. All code and data will be publicly available at https://github.com/aliyun/qwen-dianjin.
[66] Employing General-Purpose and Biomedical Large Language Models with Advanced Prompt Engineering for Pharmacoepidemiologic Study Design cs.CLPDF
Xinyao Zhang, Nicole Sonne Heckmann, Manuela Del Castillo Suero, Francesco Paolo Speca, Maurizio Sessa
TL;DR: 本研究评估了通用大语言模型(GPT-4o和DeepSeek-R1)与生物医学微调大语言模型在药物流行病学研究设计自动化任务中的表现。通过使用Least-to-Most和Active Prompting等高级提示策略,在HMA-EMA目录和哨点系统的46个研究方案上进行测试,发现通用模型在相关性和逻辑合理性方面表现更优,而所有模型在医学术语本体编码映射方面均存在局限。
Details
Motivation: 大语言模型在自动化药物流行病学研究设计方面潜力巨大,但其可靠性尚未得到充分验证。通用模型常有不准确之处,而专门的生物医学模型在该领域的性能对比尚不明确,因此需要系统评估。
Result: 在相关性(relevance)和逻辑合理性(logic of justification)评分上,结合Least-to-Most提示策略的GPT-4o和DeepSeek-R1表现最佳,GPT-4o-LTM在9个问题中的8个上达到了中位数4分(针对HMA-EMA方案)。生物医学模型整体相关性较低,且常无法提供充分论证。所有模型在本体编码映射(ontology-code agreement)方面能力有限,但LTM策略在推理稳定性上带来了最一致的提升。
Insight: 研究发现,现成的通用大语言模型在支持药物流行病学设计方面优于专门的生物医学微调模型,这挑战了“领域专用模型在专业任务上必然更优”的直觉。提示策略(如Least-to-Most)对模型性能有强烈影响,是提升推理稳定性的关键创新点,表明精心设计的提示工程可能比模型本身的领域微调更重要。
Abstract: Background: The potential of large language models (LLMs) to automate and support pharmacoepidemiologic study design is an emerging area of interest, yet their reliability remains insufficiently characterized. General-purpose LLMs often display inaccuracies, while the comparative performance of specialized biomedical LLMs in this domain remains unknown. Methods: This study evaluated general-purpose LLMs (GPT-4o and DeepSeek-R1) versus biomedically fine-tuned LLMs (QuantFactory/Bio-Medical-Llama-3-8B-GGUF and Irathernotsay/qwen2-1.5B-medical_qa-Finetune) using 46 protocols (2018-2024) from the HMA-EMA Catalogue and Sentinel System. Performance was assessed across relevance, logic of justification, and ontology-code agreement across multiple coding systems using Least-to-Most (LTM) and Active Prompting strategies. Results: GPT-4o and DeepSeek-R1 paired with LTM prompting achieved the highest relevance and logic of justification scores, with GPT-4o-LTM reaching a median relevance score of 4 in 8 of 9 questions for HMA-EMA protocols. Biomedical LLMs showed lower relevance overall and frequently generated insufficient justification. All LLMs demonstrated limited proficiency in ontology-code mapping, although LTM provided the most consistent improvements in reasoning stability. Conclusion: Off-the-shelf general-purpose LLMs currently offer superior support for pharmacoepidemiologic design compared to biomedical LLMs. Prompt strategy strongly influenced LLM performance.
[67] JudgeMeNot: Personalizing Large Language Models to Emulate Judicial Reasoning in Hebrew cs.CL | cs.CYPDF
Itay Razumenko, Arnon Sturm, Nir Grinberg
TL;DR: 本文提出了JudgeMeNot方法,通过一个合成-有机监督管道将原始司法判决转化为指令微调数据,从而在低资源环境下对大型语言模型进行参数高效的个性化微调,以模仿希伯来语法官的推理风格。
Details
Motivation: 尽管大型语言模型取得了显著进展,但为个体决策者(如法官)进行个性化定制仍是一个未解决的问题,尤其是在低资源语言(如希伯来语)和司法领域。
Result: 在三个不同任务和设置中,该方法(因果语言建模结合合成生成的指令微调)显著优于所有基线方法,在词汇、风格和语义相似性上均有显著提升,且模型生成的输出与人类法官的推理难以区分。
Insight: 创新点在于提出合成-有机监督管道,将原始判决高效转化为指令数据,实现低资源环境下的参数高效个性化微调,证明了在专业领域(如司法)模仿个体推理风格的可行性。
Abstract: Despite significant advances in large language models, personalizing them for individual decision-makers remains an open problem. Here, we introduce a synthetic-organic supervision pipeline that transforms raw judicial decisions into instruction-tuning data, enabling parameter-efficient fine-tuning of personalized models for individual judges in low-resource settings. We compare our approach to state-of-the-art personalization techniques across three different tasks and settings. The results show that Causal Language Modeling followed by synthetically generated instruction-tuning significantly outperforms all other baselines, providing significant improvements across lexical, stylistic, and semantic similarity. Notably, our model-generated outputs are indistinguishable from the reasoning of human judges, highlighting the viability of efficient personalization, even in low-resource settings.
[68] Culture-Aware Humorous Captioning: Multimodal Humor Generation across Cultural Contexts cs.CL | cs.CVPDF
Run Xu, Lu Li, Rongzhao Zhang, Jie Xu
TL;DR: 本文提出了一种新的多模态生成任务——文化感知幽默字幕生成,要求模型根据输入图像和目标文化背景生成幽默字幕。为了解决现有模型在特定文化背景下难以同时保持图像相关性、上下文适当性和幽默质量的局限性,作者构建了一个六维评估框架,并提出了一种分阶段对齐框架,通过基于评判者的GRPO和多维偏好对齐,结合退化感知原型排斥约束来缓解开放式生成中的奖励黑客问题,并最终将模型适应到东方文化背景。实验结果表明,该方法在提出的评估框架下实现了更强的整体性能,特别是在上下文适应性方面有显著提升,并在文化约束下更好地平衡了图像相关性和幽默性。
Details
Motivation: 现有的多模态大语言模型在生成图像幽默字幕方面显示出潜力,但缺乏对显式文化背景的稳定控制,难以在指定文化背景下同时保持图像相关性、上下文适当性和幽默质量。
Result: 实验结果表明,该方法在提出的六维评估框架下实现了更强的整体性能,在上下文适应性方面有显著提升,并在文化约束下更好地平衡了图像相关性和幽默性。
Insight: 创新点包括:1)提出了文化感知幽默字幕生成这一新任务;2)构建了涵盖图像相关性、上下文适应性、语义丰富性、合理性、幽默性和创造性的六维评估框架;3)提出了分阶段对齐框架,结合基于评判者的GRPO和退化感知原型排斥约束来缓解奖励黑客问题,并实现跨文化适应。
Abstract: Recent multimodal large language models have shown promising ability in generating humorous captions for images, yet they still lack stable control over explicit cultural context, making it difficult to jointly maintain image relevance, contextual appropriateness, and humor quality under a specified cultural background. To address this limitation, we introduce a new multimodal generation task, culture-aware humorous captioning, which requires a model to generate a humorous caption conditioned on both an input image and a target cultural context. Captions generated under different cultural contexts are not expected to share the same surface form, but should remain grounded in similar visual situations or humorous rationales.To support this task, we establish a six-dimensional evaluation framework covering image relevance, contextual fit, semantic richness, reasonableness, humor, and creativity. We further propose a staged alignment framework that first initializes the model with high-resource supervision under the Western cultural context, then performs multi-dimensional preference alignment via judge-based GRPO with a Degradation-aware Prototype Repulsion Constraint to mitigate reward hacking in open-ended generation, and finally adapts the model to the Eastern cultural context with a small amount of supervision. Experimental results show that our method achieves stronger overall performance under the proposed evaluation framework, with particularly large gains in contextual fit and a better balance between image relevance and humor under cultural constraints.
[69] FLiP: Towards understanding and interpreting multimodal multilingual sentence embeddings cs.CL | cs.SDPDF
Santosh Kesiraju, Bolaji Yusuf, Šimon Sedláček, Oldřich Plchot, Petr Schwarz
TL;DR: 本文提出了因子化线性投影(FLiP)模型,用于理解和解释预训练的多语言、多模态及API基础的句子嵌入空间。通过训练FLiP模型从LaBSE、SONAR和Gemini等嵌入中恢复词汇内容,在多种高、中资源语言上实现了超过75%的词汇召回率,显著优于现有非因子化基线。该方法作为一种诊断工具,揭示了所选句子编码器的模态和语言偏见,为从业者提供了不依赖传统下游评估任务的内在洞察。
Details
Motivation: 旨在开发一种工具来深入理解预训练句子嵌入空间的内部表示,特别是其如何编码词汇信息,并揭示不同编码器(多语言、多模态、API基础)存在的模态和语言偏见,从而提供更本质的评估视角。
Result: 在从句子嵌入中恢复词汇内容的任务上,FLiP模型在多个高、中资源语言中实现了超过75%的召回率,显著优于现有的非因子化基线方法。
Insight: 创新点在于提出了因子化线性投影(FLiP)这一简单有效的诊断模型,能够量化嵌入空间对词汇信息的编码能力,并将其用作分析工具来揭示编码器的内在偏差(如模态偏向、语言偏向),这为模型理解和评估提供了新的、不依赖于下游任务性能的视角。
Abstract: This paper presents factorized linear projection (FLiP) models for understanding pretrained sentence embedding spaces. We train FLiP models to recover the lexical content from multilingual (LaBSE), multimodal (SONAR) and API-based (Gemini) sentence embedding spaces in several high- and mid-resource languages. We show that FLiP can recall more than 75% of lexical content from the embeddings, significantly outperforming existing non-factorized baselines. Using this as a diagnostic tool, we uncover the modality and language biases across the selected sentence encoders and provide practitioners with intrinsic insights about the encoders without relying on conventional downstream evaluation tasks. Our implementation is public https://github.com/BUTSpeechFIT/FLiP.
[70] Retrieval-Augmented Multimodal Model for Fake News Detection cs.CL | cs.MMPDF
Yiheng Li, Weihai Lu, Hanyi Yu, Yue Wang
TL;DR: 本文提出了一种检索增强的多模态模型RAMM,用于解决多模态多领域假新闻检测中的两个关键挑战:跨实例叙事一致性建模和领域特定知识推理。RAMM采用多模态大语言模型作为骨干,结合抽象叙事对齐模块和语义表示对齐模块,通过提取跨领域实例的抽象叙事一致性并进行基于实例的类比推理,提升模型性能。
Details
Motivation: 现有假新闻检测模型通常孤立评估每条新闻,难以捕捉跨实例的叙事一致性,且依赖训练时编码的参数知识,难以泛化到新领域或数据稀缺领域。
Result: 在三个公开数据集上的大量实验结果验证了所提方法的有效性。
Insight: 创新点在于引入抽象叙事对齐模块来建模高级叙事信息,以及语义表示对齐模块将模型推理过程从直接特征推断转变为基于实例的类比推理,模仿人类决策模式。
Abstract: In recent years, multimodal multidomain fake news detection has garnered increasing attention. Nevertheless, this direction presents two significant challenges: (1) Failure to Capture Cross-Instance Narrative Consistency: existing models usually evaluate each news in isolation, fail to capture cross-instance narrative consistency, and thus struggle to address the spread of cluster based fake news driven by social media; (2) Lack of Domain Specific Knowledge for Reasoning: conventional models, which rely solely on knowledge encoded in their parameters during training, struggle to generalize to new or data-scarce domains (e.g., emerging events or niche topics). To tackle these challenges, we introduce Retrieval-Augmented Multimodal Model for Fake News Detection (RAMM). First, RAMM employs a Multimodal Large Language Model (MLLM) as its backbone to capture cross-modal semantic information from news samples. Second, RAMM incorporates an Abstract Narrative Alignment Module. This component adaptively extracts abstract narrative consistency from diverse instances across distinct domains, aggregates relevant knowledge, and thereby enables the modeling of high-level narrative information. Finally, RAMM introduces a Semantic Representation Alignment Module, which aligns the model’s decision-making paradigm with that of humans - specifically, it shifts the model’s reasoning process from direct inference on multimodal features to an instance-based analogical reasoning process. Extensive experimental results on three public datasets validate the efficacy of our proposed approach. Our code is available at the following link: https://github.com/li-yiheng/RAMM
[71] Decisive: Guiding User Decisions with Optimal Preference Elicitation from Unstructured Documents cs.CLPDF
Akriti Jain, Anish Mulay, Divyansh Verma, Aishani Pandey, Pritika Ramu
TL;DR: 本文提出了Decisive框架,这是一个结合了基于文档的推理与贝叶斯偏好推断的交互式决策支持系统。它通过从非结构化文档中提取客观选项评分矩阵,并主动学习用户的潜在偏好向量来指导决策。系统通过自适应选择成对权衡问题来最大化信息增益,从而高效收敛,在最小化用户负担的同时提供透明且个性化的推荐。
Details
Motivation: 现有方法(包括大语言模型和传统决策支持系统)在决策支持中存在不足,它们要么用过多信息淹没用户,要么无法准确捕捉细微的用户偏好。本文旨在解决从多源非结构化信息中综合相关因素并纳入主观偏好的认知密集型决策任务。
Result: 通过大量实验证明,该方法在多个领域的决策准确性上显著优于通用大语言模型和现有决策框架,相对于强基线实现了高达20%的提升。
Insight: 创新点在于将文档驱动的客观选项评分与贝叶斯主动偏好学习相结合,通过信息论原则自适应地选择问题以优化决策过程。这提供了一种在保证透明度和个性化的同时,高效减少用户认知负荷的决策支持新范式。
Abstract: Decision-making is a cognitively intensive task that requires synthesizing relevant information from multiple unstructured sources, weighing competing factors, and incorporating subjective user preferences. Existing methods, including large language models and traditional decision-support systems, fall short: they often overwhelm users with information or fail to capture nuanced preferences accurately. We present Decisive, an interactive decision-making framework that combines document-grounded reasoning with Bayesian preference inference. Our approach grounds decisions in an objective option-scoring matrix extracted from source documents, while actively learning a user’s latent preference vector through targeted elicitation. Users answer pairwise tradeoff questions adaptively selected to maximize information gain over the final decision. This process converges efficiently, minimizing user effort while ensuring recommendations remain transparent and personalized. Through extensive experiments, we demonstrate that our approach significantly outperforms both general-purpose LLMs and existing decision-making frameworks achieving up to 20% improvement in decision accuracy over strong baselines across domains.
[72] TLoRA: Task-aware Low Rank Adaptation of Large Language Models cs.CL | cs.AIPDF
Weicheng Lin, Yi Zhang, Jiawei Dang, Liang-Jie Zhang
TL;DR: 本文提出了TLoRA(Task-aware Low-Rank Adaptation),一个用于大语言模型参数高效微调的统一框架。它通过数据驱动的初始化策略,将LoRA的A矩阵与任务相关子空间对齐并冻结,仅训练B矩阵,同时基于敏感度的重要性度量在固定参数预算下自适应分配各层的秩和缩放因子。
Details
Motivation: 现有LoRA变体通常只优化秩分配、缩放因子或初始化中的一个方面,且常以增加训练复杂度或降低实际效率为代价。本文旨在联合优化初始化和资源分配,以更高效、统一的方式提升LoRA的性能。
Result: 在自然语言理解、常识推理、数学推理、代码生成和聊天生成等多种任务上的广泛实验表明,TLoRA在显著减少可训练参数数量的同时,始终表现出优异的性能。
Insight: 创新点包括:1)数据驱动的初始化策略,通过对预训练权重与输入激活协方差的乘积进行奇异值分解来初始化A矩阵;2)基于敏感度的重要性度量,实现固定参数预算下各层秩和缩放因子的自适应分配;3)统一框架同时优化初始化和资源分配,提升了效率与性能。
Abstract: Low-Rank Adaptation (LoRA) has become a widely adopted parameter-efficient fine-tuning method for large language models, with its effectiveness largely influenced by the allocation of ranks and scaling factors, as well as initialization. Existing LoRA variants typically address only one of these factors, often at the cost of increased training complexity or reduced practical efficiency. In this work, we present Task-aware Low-Rank Adaptation (TLoRA), a unified framework that jointly optimizes initialization and resource allocation at the outset of training. TLoRA introduces a data-driven initialization strategy that aligns the LoRA $A$ matrix with task-relevant subspaces by performing singular value decomposition on the product of pre-trained weights and input activation covariance. After this, the $A$ matrix is frozen, and only the $B$ matrix is trained. Furthermore, TLoRA employs a sensitivity-based importance metric to adaptively allocate ranks and scaling factors across layers under a fixed parameter budget. We conduct extensive experiments that demonstrate TLoRA consistently performs excellently across various tasks, including natural language understanding, commonsense reasoning, math reasoning, code generation, and chat generation, while significantly reducing the number of trainable parameters.
[73] MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge cs.CL | cs.AI | cs.CVPDF
Sua Lee, Sanghee Park, Jinbae Im
TL;DR: 本文提出了MM-JudgeBias基准,用于系统评估多模态大语言模型作为自动评估器时的组合偏见问题。研究发现现有MLLM法官在证据缺失或不匹配时评估不可靠,并对语义无关扰动表现出不稳定性。
Details
Motivation: 多模态大语言模型越来越多地被用作自动评估器,但其可靠性和对偏见的脆弱性尚未得到充分探索,需要系统评估其组合偏见。
Result: 在26个最先进的MLLM上的实验揭示了系统性的模态忽视和不对称评估倾向,表明现有模型作为评估器尚不可靠。
Insight: 创新性地定义了MLLM-as-a-Judge系统中的组合偏见,并提出了包含偏差偏差度和偏差一致性两个互补指标的评估框架,以及一个包含1800多个样本的细粒度诊断数据集。
Abstract: Multimodal Large Language Models (MLLMs) have been increasingly used as automatic evaluators-a paradigm known as MLLM-as-a-Judge. However, their reliability and vulnerabilities to biases remain underexplored. We find that many MLLM judges fail to reliably integrate key visual or textual cues, yielding unreliable evaluations when evidence is missing or mismatched, and exhibiting instability under semantically irrelevant perturbations. To address this, we systematically define Compositional Bias in MLLM-as-a-Judge systems and introduce MM-JudgeBias, a benchmark for evaluating it. MM-JudgeBias introduces controlled perturbations across Query, Image, and Response, and evaluates model behavior via two complementary metrics: Bias-Deviation (BD) for sensitivity and Bias-Conformity (BC) for stability. Our dataset of over 1,800 curated and refined multimodal samples, drawn from 29 source benchmarks, enables a fine-grained diagnosis of nine bias types across diverse tasks and domains. Experiments on 26 state-of-the-art MLLMs reveal systematic modality neglect and asymmetric evaluation tendencies, underscoring the need for more reliable judges.
[74] STaD: Scaffolded Task Design for Identifying Compositional Skill Gaps in LLMs cs.CL | cs.AIPDF
Sungeun An, Swanand Ravindra Kadhe, Shailja Thakur, Chad DeLuca, Hima Patel
TL;DR: 本文提出了一种名为STaD(Scaffolded Task Design)的框架,用于系统性地识别大型语言模型(LLMs)在组合推理技能上的缺陷。该方法通过基于‘脚手架’概念生成基准任务的受控变体,以结构化、渐进的方式引入支持,从而大规模地探测模型行为,揭示其缺乏的具体推理技能组合。
Details
Motivation: 现有基准测试的聚合分数难以深入揭示LLMs的组合技能缺陷及其改进方向,因此需要一种方法来系统性地暴露这些弱点。
Result: 在六个不同规模的模型上,对三个推理基准进行的实验揭示了多个失败点,并突出了每个模型独特且不同的技能差距。
Insight: 创新点在于将教育心理学中的‘脚手架’概念引入LLM评估,通过设计受控的、渐进的任务变体,实现了对黑盒模型组合技能缺陷的系统性、可扩展的探测,这比单独检查失败案例更具洞察力。
Abstract: Benchmarks are often used as a standard to understand LLM capabilities in different domains. However, aggregate benchmark scores provide limited insight into compositional skill gaps of LLMs and how to improve them. To make these weaknesses visible, we propose Scaffolded Task Design (STaD) framework. STaD generates controlled variations of benchmark tasks based on the concept of scaffolding, which introduces structured, incremental support in a step-by-step manner. Rather than inspecting failures individually, this approach enables systematic and scalable probing of model behavior by identifying the specific reasoning skill compositions they lack. Treating the LLM as a black box, our experiments on six models of varying sizes reveal multiple failure points in three reasoning benchmarks and highlight each model’s unique and distinct skill gaps.
[75] Multiplication in Multimodal LLMs: Computation with Text, Image, and Audio Inputs cs.CLPDF
Samuel G. Balter, Ethan Jerzak, Connor T. Jerzak
TL;DR: 本文研究了多模态大语言模型(LLMs)在执行多位数乘法任务时的表现,发现模型在不同模态(文本、图像、音频)下均难以精确计算,主要瓶颈在于计算而非感知。作者提出了一个可控的多模态乘法基准测试,引入算术负载(C)作为性能预测指标,并通过损失探针分析模型倾向于使用的启发式推理策略。
Details
Motivation: 现有多模态LLMs能够跨模态感知数字内容,但在执行精确的多位数乘法时存在困难,且缺乏系统性的跨模态配对基准来比较模型间的算术能力限制。
Result: 在提出的基准测试中,模型准确率随算术负载C增加而急剧下降(常接近零),C能有效预测跨模态和模型性能(R平方常>0.5)。感知检查显示模型跨模态感知准确率接近完美(>99%),表明性能下降主要源于计算而非感知缺陷。
Insight: 创新点包括:引入可控多模态乘法基准和算术负载C作为简洁的性能代理指标;通过感知-计算分解揭示多模态性能下降的计算本质;使用强制完成损失探针揭示模型倾向于分配分解等启发式策略,且基础模型内部路由机制已优化。
Abstract: Multimodal LLMs can accurately perceive numerical content across modalities yet fail to perform exact multi-digit multiplication when the identical underlying arithmetic problem is presented as numerals, number words, images, or in audio form. Because existing benchmarks often lack systematically paired instances across modalities, it remains difficult to compare genuine arithmetic limits within and across model families. We therefore introduce a controlled multimodal multiplication benchmark that factorially varies digit length, digit sparsity, representation (e.g., numerals vs. number words), and modality (text, rendered images, audio), with paired instances from a reproducible generator. We also define arithmetic load, C, as the product of the total and non-zero digit count as a compact, mechanistically motivated proxy for operation count. Across evaluations, accuracy falls sharply as C grows, often nearing zero by C > 100. Indeed, C remains predictive of performance across modalities and models, with R-squared often > 0.5, nearing the value from more complex measures of arithmetic load that count the number of intermediate arithmetic steps. A separate perception-versus-computation decomposition shows that multimodal degradation is primarily computational rather than perceptual: on matched-perception checks, models are near-perfect (> 99%) across modalities, even when multiplication accuracy drops. Beyond measuring when models fail, we ask which procedures they are predisposed to follow. We introduce a forced-completion loss probe that scores heuristic-specific reasoning prefixes–including columnar multiplication, distributive decomposition, and rounding/compensation. Here, decomposition is favored in both text and vision modalities; heuristic-specific LoRA adapters produce near-orthogonal updates yet degrade accuracy, indicating the base model maintains a well-tuned internal router.
[76] Model in Distress: Sentiment Analysis on French Synthetic Social Media cs.CLPDF
Pierre-Carl Langlais, Pavel Chizhov, Yannick Detrois, Carlos Rosas Hinostroza, Ivan P. Yamshchikov
TL;DR: 本文提出了一种通用的合成数据生成流程,用于解决社交媒体客户反馈分析中标注数据成本高、多语言评估集稀缺以及隐私问题。该流程通过反向翻译和微调模型,从少量种子语料生成了170万条法语推文合成数据,并辅以合成推理轨迹。训练出的6亿参数推理模型在人工标注评估数据上达到77-79%的准确率,匹配或超越了现有SOTA专有大语言模型和专用编码器。
Details
Motivation: 解决社交媒体客户反馈自动分析面临的三大挑战:标注训练数据成本高昂、多语言环境下评估集稀缺,以及隐私问题导致数据共享和可复现性受限。
Result: 在法语公共交通客户困扰检测案例中,训练出的6亿参数推理模型在人工标注评估数据上达到77-79%的准确率,匹配或超越了SOTA专有LLMs和专用编码器。
Insight: 创新点在于提出了一种结合反向翻译和合成推理轨迹的通用合成数据生成流程,不仅能大幅降低标注成本,还能通过避免暴露敏感用户数据来保护隐私,且该方法可推广至其他用例和语言。
Abstract: Automated analysis of customer feedback on social media is hindered by three challenges: the high cost of annotated training data, the scarcity of evaluation sets, especially in multilingual settings, and privacy concerns that prevent data sharing and reproducibility. We address these issues by developing a generalizable synthetic data generation pipeline applied to a case study on customer distress detection in French public transportation. Our approach utilizes backtranslation with fine-tuned models to generate 1.7 million synthetic tweets from a small seed corpus, complemented by synthetic reasoning traces. We train 600M-parameter reasoners with English and French reasoning that achieve 77-79% accuracy on human-annotated evaluation data, matching or exceeding SOTA proprietary LLMs and specialized encoders. Beyond reducing annotation costs, our pipeline preserves privacy by eliminating the exposure of sensitive user data. Our methodology can be adopted for other use cases and languages.
[77] Reasoning Models Know What’s Important, and Encode It in Their Activations cs.CLPDF
Yaniv Nikankin, Martin Tutek, Tomer Ashuach, Jonathan Rosenfeld, Yonatan Belinkov
TL;DR: 该论文研究了语言模型在复杂推理任务中生成推理链时,模型内部激活如何编码推理步骤的重要性。研究发现,模型激活比推理链的token本身包含更多关于步骤重要性的信息,且模型在生成后续步骤之前就已形成对重要性的内部表征。
Details
Motivation: 动机是理解语言模型如何处理复杂推理任务,特别是确定推理链中哪些步骤对最终答案至关重要,以及为什么重要,这是一个核心开放性问题。
Result: 通过在模型激活上训练探针来预测步骤重要性,结果表明模型内部编码了步骤重要性的表征,该表征在不同模型间具有泛化性,分布于各层,且与表面特征(如相对位置或长度)无关。
Insight: 创新点在于揭示了分析模型内部激活(而非仅表面token)对于理解推理过程至关重要,能发现表面方法根本遗漏的方面,为推理分析提供了新的方向。
Abstract: Language models often solve complex tasks by generating long reasoning chains, consisting of many steps with varying importance. While some steps are crucial for generating the final answer, others are removable. Determining which steps matter most, and why, remains an open question central to understanding how models process reasoning. We investigate if this question is best approached through model internals or through tokens of the reasoning chain itself. We find that model activations contain more information than tokens for identifying important reasoning steps. Crucially, by training probes on model activations to predict importance, we show that models encode an internal representation of step importance, even prior to the generation of subsequent steps. This internal representation of importance generalizes across models, is distributed across layers, and does not correlate with surface-level features, such as a step’s relative position or its length. Our findings suggest that analyzing activations can reveal aspects of reasoning that surface-level approaches fundamentally miss, indicating that reasoning analyses should look into model internals.
[78] Multilingual Training and Evaluation Resources for Vision-Language Models cs.CL | cs.AIPDF
Daniela Baiamonte, Elena Fano, Matteo Gabburo, Stefano Simonazzi, Leonardo Rigutini
TL;DR: 该论文针对当前视觉语言模型(VLMs)发展主要依赖英语、缺乏多语言多模态训练数据和评估基准的问题,提出了一套覆盖五种欧洲语言(英语、法语、德语、意大利语、西班牙语)的综合资源套件。通过采用再生-翻译范式,结合精选的合成生成和人工标注,构建了高质量跨语言资源。具体包括用于训练的多语言语料库Multi-PixMo,以及通过翻译广泛使用的英文数据集创建的多语言评估基准。实验表明,使用多语言多模态数据进行训练对非英语基准测试有持续益处,并对英语也有正向迁移效果。
Details
Motivation: 当前视觉语言模型的发展严重依赖英语,导致缺乏多语言多模态的训练数据集和全面的跨语言评估基准,限制了其全球适用性。
Result: 通过人工分析(定性和定量,包括标注者间一致性)和消融实验评估了资源质量,实验涉及3种不同模型,结果显示使用多语言多模态数据进行训练在非英语基准测试上持续有益,且对英语也有正向迁移。
Insight: 创新点在于提出并实现了一个再生-翻译范式来构建高质量跨语言资源,包括训练语料Multi-PixMo和多语言评估基准,解决了VLMs多语言资源稀缺的核心问题;客观来看,其系统性的多语言资源构建方法和实证验证为VLMs的国际化发展提供了可扩展的框架和数据基础。
Abstract: Vision Language Models (VLMs) achieved rapid progress in the recent years. However, despite their growth, VLMs development is heavily grounded on English, leading to two main limitations: (i) the lack of multilingual and multimodal datasets for training, and (ii) the scarcity of comprehensive evaluation benchmarks across languages. In this work, we address these gaps by introducing a new comprehensive suite of resources for VLMs training and evaluation spanning five European languages (English, French, German, Italian, and Spanish). We adopt a regeneration-translation paradigm that produces high-quality cross-lingual resources by combining curated synthetic generation and manual annotation. Specifically, we build Multi-PixMo, a training corpus obtained regenerating examples from Pixmo pre-existing datasets with permissively licensed models: PixMo-Cap, PixMo-AskModelAnything, and CoSyn-400k. On the evaluation side, we construct a set of multilingual benchmarks derived translating widely used English datasets (MMbench, ScienceQA, MME, POPE, AI2D). We assess the quality of these resources through qualitative and quantitative human analyses, measuring inter-annotator agreement. Additionally, we perform ablation studies to demonstrate the impact of multilingual data, with respect to English only, in VLMs training. Experiments, comprising 3 different models show that using multilingual, multimodal examples for training VLMs aids is consistently beneficial on non-English benchmarks, with positive transfer to English as well.
[79] HiGMem: A Hierarchical and LLM-Guided Memory System for Long-Term Conversational Agents cs.CLPDF
Shuqi Cao, Jingyi He, Fei Tan
TL;DR: 本文提出了HiGMem,一个用于长期对话LLM代理的分层且由LLM引导的记忆系统。它通过事件摘要作为语义锚点,让LLM预测哪些相关对话轮次值得阅读,从而在检索时先检查高层事件摘要,再聚焦于少量潜在有用的轮次,提供简洁可靠的证据集,避免了向量检索带来的检索开销和证据集臃肿问题。
Details
Motivation: 解决现有长期对话LLM代理记忆系统(包括分层系统)过度依赖向量相似性检索,导致检索到的证据集臃肿、精度低、增加答案阶段上下文成本且难以检查和管理的问题。
Result: 在LoCoMo10基准测试上,HiGMem在五个问题类别中的四个上取得了最佳F1分数,并将对抗性F1从A-Mem的0.54提升至0.78,同时检索的轮次数量减少了一个数量级。
Insight: 创新点在于设计了一个两层(事件-轮次)记忆系统,并引入LLM引导的推理过程,使用事件摘要作为语义锚点来主动选择和过滤相关轮次,实现了更精确、更高效的记忆检索,而不仅仅是依赖被动的向量相似性匹配。
Abstract: Long-term conversational large language model (LLM) agents require memory systems that can recover relevant evidence from historical interactions without overwhelming the answer stage with irrelevant context. However, existing memory systems, including hierarchical ones, still often rely solely on vector similarity for retrieval. It tends to produce bloated evidence sets: adding many superficially similar dialogue turns yields little additional recall, but lowers retrieval precision, increases answer-stage context cost, and makes retrieved memories harder to inspect and manage. To address this, we propose HiGMem (Hierarchical and LLM-Guided Memory System), a two-level event-turn memory system that allows LLMs to use event summaries as semantic anchors to predict which related turns are worth reading. This allows the model to inspect high-level event summaries first and then focus on a smaller set of potentially useful turns, providing a concise and reliable evidence set through reasoning, while avoiding the retrieval overhead that would be excessively high compared to vector retrieval. On the LoCoMo10 benchmark, HiGMem achieves the best F1 on four of five question categories and improves adversarial F1 from 0.54 to 0.78 over A-Mem, while retrieving an order of magnitude fewer turns. Code is publicly available at https://github.com/ZeroLoss-Lab/HiGMem.
[80] PRISMA: Preference-Reinforced Self-Training Approach for Interpretable Emotionally Intelligent Negotiation Dialogues cs.CLPDF
Prajwal Vijay Kajare, Priyanshu Priya, Bikash Santra, Asif Ekbal
TL;DR: 本文提出了PRISMA,一种可解释的情感智能谈判对话系统,应用于求职面试和资源分配两个领域。该系统通过情感感知谈判策略链式思维推理机制实现可解释性,并利用自训练结合直接偏好优化方法,在构建的两个新数据集上训练模型,以生成更准确、可解释且情感适当的谈判回应。
Details
Motivation: 情感在谈判中至关重要,影响信任、合作与长期关系。开发能够识别并策略性回应情感的谈判对话系统对于创建更有效的人机交互至关重要,而系统的可解释性对于建立可靠性和融洽关系同样关键。
Result: 在JobNego和ResNego数据集上的自动和人工评估表明,PRISMA显著增强了可解释性,生成了适当的情感感知回应,同时提高了整体谈判效果。
Insight: 创新点在于提出了ENS-CoT推理机制来模拟人类谈判中的情感处理过程以实现可解释性,并采用自训练与DPO结合的方法来优化模型,使其生成既准确又情感适当的回应;同时构建了两个针对特定谈判领域的新数据集,为相关研究提供了资源。
Abstract: Emotion plays a pivotal role in shaping negotiation outcomes, influencing trust, cooperation, and long-term relationships. Developing negotiation dialog systems that can recognize and respond strategically to emotions is, therefore, essential to create more effective human-centered interactions. Beyond generating emotionally appropriate responses, interpretability - understanding how a system generates a particular emotion-aware response, is critical for fostering reliability and building rapport. Driven by these aspects, in this work, we introduce PRISMA, an interpretable emotionally intelligent negotiation dialogue system targeting two application domains, viz. job interviews and resource allocation. To enable interpretability, we propose an Emotion-aware Negotiation Strategy-informed Chain-of-Thought (ENS-CoT) reasoning mechanism, which mimics human negotiation by perceiving, understanding, using, and managing emotions. Leveraging ENS-CoT, we curate two new datasets: JobNego (for job interview negotiation) and ResNego (for resource allocation negotiation). We then leverage these datasets to develop PRISMA by augmenting self-training with Direct Preference Optimization (DPO), guiding agents toward more accurate, interpretable, and emotionally appropriate negotiation responses. Automatic and human evaluation on JobNego and ResNego datasets demonstrate that PRISMA substantially enhances interpretability and generates appropriate emotion-aware responses, while improving overall negotiation effectiveness.
[81] IceBreaker for Conversational Agents: Breaking the First-Message Barrier with Personalized Starters cs.CL | cs.AIPDF
Hongwei Zheng, Weiqi Wu, Zhengjia Wang, Guanyu Jiang, Haoming Li
TL;DR: 本文提出了IceBreaker系统,旨在解决对话代理在对话启动阶段面临的‘首条消息障碍’问题。该系统通过生成个性化的对话启动器,在用户意图不明确时引导用户进入对话。其核心是一个两步握手框架:首先通过‘共鸣感知兴趣提炼’从会话摘要中提取能引发共鸣的兴趣点,然后通过‘面向交互的启动器生成’来优化启动内容,以最大化用户参与度。该系统已在世界级对话产品中通过A/B测试并成功部署。
Details
Motivation: 现有对话代理主要关注在持续对话中的激活,而忽视了对话启动阶段的瓶颈:用户可能有模糊需求但无明确查询意图,导致对话难以开始。本文旨在解决这一‘首条消息障碍’,使代理从被动响应转向主动引导。
Result: 在世界上最大的对话代理产品之一上进行的在线A/B测试表明,IceBreaker将用户活跃天数提升了+0.184%,点击率提升了+9.425%,并已投入生产环境部署。
Insight: 创新点在于将对话启动问题形式化为一个两步握手过程,结合了‘共鸣感知兴趣提炼’(从历史会话中挖掘触发兴趣)和‘面向交互的启动器生成’(通过个性化偏好对齐和自增强循环优化参与度)。这为冷启动场景下的主动对话生成提供了新思路,强调了从会话摘要而非即时上下文中推断用户潜在兴趣的重要性。
Abstract: Conversational agents, such as ChatGPT and Doubao, have become essential daily assistants for billions of users. To further enhance engagement, these systems are evolving from passive responders to proactive companions. However, existing efforts focus on activation within ongoing dialogues, while overlooking a key real-world bottleneck. In the conversation initiation stage, users may have a vague need but no explicit query intent, creating a first-message barrier where the conversation holds before it begins. To overcome this, we introduce Conversation Starter Generation: generating personalized starters to guide users into conversation. However, unlike in-conversation stages where immediate context guides the response, initiation must operate in a cold-start moment without explicit user intent. To pioneer in this direction, we present IceBreaker that frames human ice-breaking as a two-step handshake: (i) evoke resonance via Resonance-Aware Interest Distillation from session summaries to capture trigger interests, and (ii) stimulate interaction via Interaction-Oriented Starter Generation, optimized with personalized preference alignment and a self-reinforced loop to maximize engagement. Online A/B tests on one of the world’s largest conversational agent products show that IceBreaker improves user active days by +0.184% and click-through rate by +9.425%, and has been deployed in production.
[82] River-LLM: Large Language Model Seamless Exit Based on KV Share cs.CLPDF
Yingtao Shen, An Zou
TL;DR: 本文提出River-LLM,一种无需训练、基于KV共享的轻量级框架,旨在解决仅解码器架构大语言模型(LLM)中早期退出(Early Exit)技术因KV缓存缺失问题导致的效率瓶颈。该方法通过一个共享的退出流(Exit River)在退出过程中自然生成并保留缺失的KV缓存,避免了昂贵的恢复操作,并利用解码器块内的状态转移相似性来预测累积KV误差以指导精确的退出决策,从而在保持高质量生成的同时实现显著的推理加速。
Details
Motivation: 大语言模型(LLM)推理延迟高,早期退出技术是加速推理的可行方案,但在仅解码器架构中,其效率受到KV缓存缺失问题的严重制约——被跳过的层无法为后续token提供必要的历史状态。现有解决方案(如重计算或掩码)要么引入显著的延迟开销,要么导致严重的精度损失,无法弥合理论层数减少与实际加速效果之间的差距。
Result: 在数学推理和代码生成任务上的大量实验表明,River-LLM在保持高生成质量的同时,实现了1.71倍到2.16倍的实际加速。
Insight: 论文宣称的创新点在于提出了一个无需训练的、基于KV共享的轻量级退出流框架,解决了早期退出中的KV缓存缺失问题,并通过状态转移相似性预测误差来指导退出决策。从客观角度看,其核心创新在于将KV缓存的生成与退出过程无缝集成,避免了传统恢复操作的开销,从而在理论上更高效地将层数减少转化为实际的端到端加速。
Abstract: Large Language Models (LLMs) have demonstrated exceptional performance across diverse domains but are increasingly constrained by high inference latency. Early Exit has emerged as a promising solution to accelerate inference by dynamically bypassing redundant layers. However, in decoder-only architectures, the efficiency of Early Exit is severely bottlenecked by the KV Cache Absence problem, where skipped layers fail to provide the necessary historical states for subsequent tokens. Existing solutions, such as recomputation or masking, either introduce significant latency overhead or incur severe precision loss, failing to bridge the gap between theoretical layer reduction and practical wall-clock speedup. In this paper, we propose River-LLM, a training-free framework that enables seamless token-level Early Exit. River-LLM introduces a lightweight KV-Shared Exit River that allows the backbone’s missing KV cache to be naturally generated and preserved during the exit process, eliminating the need for costly recovery operations. Furthermore, we utilize state transition similarity within decoder blocks to predict cumulative KV errors and guide precise exit decisions. Extensive experiments on mathematical reasoning and code generation tasks demonstrate that River-LLM achieves 1.71 to 2.16 times of practical speedup while maintaining high generation quality.
[83] StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning cs.CLPDF
Daoyu Wang, Qingchuan Li, Mingyue Cheng, Jie Ouyang, Shuo Yu
TL;DR: 本文提出StepPO(Step-Aligned Policy Optimization),一种面向智能体强化学习(Agentic RL)的步级对齐优化范式。论文认为,传统基于令牌(token)的马尔可夫决策过程(MDP)建模不适用于多轮交互的LLM智能体,应将其推进为步级(step-level)MDP,并将步(step)而非令牌视为LLM智能体的合适动作表示。
Details
Motivation: 随着通用智能体系统(如OpenClaw、Claude Code)追求更复杂的目标,对基础大语言模型(LLMs)的智能体能力(如决策、工具使用)要求越来越高。面向智能体的强化学习(Agentic RL)成为增强这些能力的关键后训练范式,但面临延迟稀疏奖励、长可变上下文等新挑战,传统的以令牌为中心的建模与优化范式已不足以捕捉真实LLM智能体行为。
Result: 摘要提及了初步实验为该视角的有效性提供了初步证据,但未具体说明在何种基准测试上取得何种定量结果或是否达到SOTA水平。
Insight: 论文的核心创新点在于提出了步级对齐的智能体强化学习范式:将动作表示从令牌级提升到步级(即智能体的一个完整决策单元),并相应提出步级信用分配作为优化对应,使策略优化和奖励传播与智能体决策的粒度对齐。这为理解智能体行为和提升LLM通用智能体能力提供了一个新的理论框架和实用视角。
Abstract: General agents have given rise to phenomenal applications such as OpenClaw and Claude Code. As these agent systems (a.k.a. Harnesses) strive for bolder goals, they demand increasingly stronger agentic capabilities from foundation Large Language Models (LLMs). Agentic Reinforcement Learning (RL) is emerging as a central post-training paradigm for empowering LLMs with these capabilities and is playing an increasingly pivotal role in agent training. Unlike single-turn token-level alignment or reasoning enhancement, as in RLHF and RLVR, Agentic RL targets multi-turn interactive settings, where the goal is to optimize core agentic capabilities such as decision making and tool use while addressing new challenges including delayed and sparse rewards, as well as long and variable context. As a result, the token-centric modeling and optimization paradigm inherited from traditional LLM RL is becoming increasingly inadequate for capturing real LLM agent behavior. In this paper, we present StepPO as a position on step-level Agentic RL. We argue that the conventional token-level Markov Decision Process (MDP) should be advanced to a step-level MDP formulation, and that the step, rather than the token, should be regarded as the proper action representation for LLM agents. We then propose step-level credit assignment as the natural optimization counterpart of this formulation, thereby aligning policy optimization and reward propagation with the granularity of agent decisions. Finally, we discuss the key systems designs required to realize step-level Agentic RL in practice and preliminary experiments provide initial evidence for the effectiveness of this perspective. We hope that the step-aligned, step-level paradigm embodied in StepPO offers the Agentic RL community a useful lens for understanding agent behavior and helps advance LLMs toward stronger general-agent capabilities.
[84] BhashaSutra: A Task-Centric Unified Survey of Indian NLP Datasets, Corpora, and Resources cs.CLPDF
Raghvendra Kumar, Devankar Raj, Sriparna Saha
TL;DR: 本文首次对印度自然语言处理资源进行了统一调查,涵盖了200多个数据集、50多个基准测试以及100多个模型、工具和系统,覆盖文本、语音、多模态和文化相关任务,旨在为印度语言生态系统的公平、文化相关和可扩展的NLP研究提供基础。
Details
Motivation: 印度语言环境复杂,包含22种官方语言和数百种边缘化方言,但现有综述要么只关注少数高资源语言,要么将印度语言纳入更广泛的多语言环境中,缺乏专门针对印度语言的资源整合调查,导致低资源和文化多样性语言覆盖不足。
Result: 本文通过按语言现象、领域和模态组织资源,分析了标注、评估和模型设计的趋势,并识别了数据稀疏、语言覆盖不均、文字多样性和文化及领域泛化有限等持续挑战。
Insight: 创新点在于首次提供了针对印度语言的全面资源调查,强调了文化相关性和公平性,为低资源语言研究提供了结构化框架,有助于推动印度语言NLP的均衡发展。
Abstract: India’s linguistic landscape, spanning 22 scheduled languages and hundreds of marginalized dialects, has driven rapid growth in NLP datasets, benchmarks, and pretrained models. However, no dedicated survey consolidates resources developed specifically for Indian languages. Existing reviews either focus on a few high-resource languages or subsume Indian languages within broader multilingual settings, limiting coverage of low-resource and culturally diverse varieties. To address this gap, we present the first unified survey of Indian NLP resources, covering 200+ datasets, 50+ benchmarks, and 100+ models, tools, and systems across text, speech, multimodal, and culturally grounded tasks. We organize resources by linguistic phenomena, domains, and modalities; analyze trends in annotation, evaluation, and model design; and identify persistent challenges such as data sparsity, uneven language coverage, script diversity, and limited cultural and domain generalization. This survey offers a consolidated foundation for equitable, culturally grounded, and scalable NLP research in the Indian linguistic ecosystem.
[85] MASS-RAG: Multi-Agent Synthesis Retrieval-Augmented Generation cs.CLPDF
Xingchen Xiao, Heyan Huang, Runheng Liu, Jincheng Xie
TL;DR: 本文提出MASS-RAG,一种多智能体合成的检索增强生成方法,通过设计多个角色专用的智能体(如证据总结、证据提取和推理)来处理检索到的上下文,并经过合成阶段整合输出,以提升在噪声、不完整或异构证据下的生成质量。
Details
Motivation: 解决当检索到的上下文存在噪声、不完整或异构时,单一生成过程难以有效整合证据的问题。
Result: 在四个基准测试上的实验表明,MASS-RAG持续优于强RAG基线,尤其在相关证据分散在检索上下文中的场景下表现突出。
Insight: 创新点在于将证据处理结构化到多个角色专用智能体中,通过暴露多个中间证据视图,使模型能在答案生成前比较和整合互补信息,从而提升RAG系统的鲁棒性和准确性。
Abstract: Large language models (LLMs) are widely used in retrieval-augmented generation (RAG) to incorporate external knowledge at inference time. However, when retrieved contexts are noisy, incomplete, or heterogeneous, a single generation process often struggles to reconcile evidence effectively. We propose \textbf{MASS-RAG}, a multi-agent synthesis approach to retrieval-augmented generation that structures evidence processing into multiple role-specialized agents. MASS-RAG applies distinct agents for evidence summarization, evidence extraction, and reasoning over retrieved documents, and combines their outputs through a dedicated synthesis stage to produce the final answer. This design exposes multiple intermediate evidence views, allowing the model to compare and integrate complementary information before answer generation. Experiments on four benchmarks show that MASS-RAG consistently improves performance over strong RAG baselines, particularly in settings where relevant evidence is distributed across retrieved contexts.
cs.CV [Back]
[86] From Inheritance to Saturation: Disentangling the Evolution of Visual Redundancy for Architecture-Aware MLLM Inference Acceleration cs.CV | cs.AIPDF
Jiaqi Shi, Yuechan Li, Xulong Zhang, Xiaoyang Qu, Jianzong Wang
TL;DR: 本文提出HalfV框架,通过解耦视觉冗余为通用内在视觉冗余(IVR)和架构相关的次级饱和冗余(SSR),并分别采用统一剪枝和自适应处理策略,以加速高分辨率多模态大语言模型(MLLM)的推理,在多种骨干网络上实现了优越的效率-性能权衡。
Details
Motivation: 解决高分辨率MLLM推理时视觉令牌爆炸导致计算成本过高的问题,并克服现有加速方法(如令牌剪枝或层稀疏)存在的严重“骨干网络依赖性”,即在不同架构(如Vicuna、Qwen)上性能下降显著。
Result: 在Qwen25-VL上,HalfV在实现4.1倍FLOPs加速的同时保持了96.8%的性能,显著优于最先进的基线方法;实验表明其在多种骨干网络上均实现了优越的效率-性能权衡。
Insight: 创新点在于利用截断矩阵熵揭示了MLLM推理中普遍存在的三阶段生命周期,从而将视觉冗余解耦为IVR和SSR,并据此设计了两阶段处理框架;客观来看,这种解耦思想为架构感知的加速提供了通用指导,减少了骨干依赖性问题。
Abstract: High-resolution Multimodal Large Language Models (MLLMs) face prohibitive computational costs during inference due to the explosion of visual tokens. Existing acceleration strategies, such as token pruning or layer sparsity, suffer from severe “backbone dependency”, performing well on Vicuna or Mistral architectures (e.g., LLaVA) but causing significant performance degradation when transferred to architectures like Qwen. To address this, we leverage truncated matrix entropy to uncover a universal three-stage inference lifecycle, decoupling visual redundancy into universal Intrinsic Visual Redundancy (IVR) and architecture-dependent Secondary Saturation Redundancy (SSR). Guided by this insight, we propose HalfV, a framework that first mitigates IVR via a unified pruning strategy and then adaptively handles SSR based on its specific manifestation. Experiments demonstrate that HalfV achieves superior efficiency-performance trade-offs across diverse backbones. Notably, on Qwen25-VL, it retains 96.8% performance at a 4.1$\times$ FLOPs speedup, significantly outperforming state-of-the-art baselines. Our code is available at https://github.com/civilizwa/HalfV.
[87] Latent-Compressed Variational Autoencoder for Video Diffusion Models cs.CV | cs.AIPDF
Jiarui Guan, Wenshuai Zhao, Zhengtao Zou, Juho Kannala, Arno Solin
TL;DR: 本文提出了一种用于视频扩散模型的潜在压缩变分自编码器(Latent-Compressed VAE),旨在解决视频VAE中潜在通道数过多会损害潜在扩散模型生成性能的问题。该方法通过移除视频潜在表示中的高频成分来压缩潜在空间,而非直接减少通道数,从而在保持高重建质量的同时促进扩散模型的训练。
Details
Motivation: 动机在于,现有视频潜在扩散模型中使用的VAE通常需要大量潜在通道来保证高质量视频重建,但过多的通道会阻碍潜在扩散模型的收敛并降低其生成性能。本文旨在不牺牲重建保真度的前提下压缩潜在表示。
Result: 实验结果表明,在保持相同整体压缩比的情况下,所提方法在视频重建质量上优于强基线模型。
Insight: 创新点在于提出了一种新颖的潜在压缩策略,即通过频域操作(移除高频成分)而非空间维度操作(减少通道数)来压缩潜在表示,这可能在保持重建质量与促进下游生成模型性能之间提供了更好的权衡。该方法为设计用于生成模型的编码器提供了新思路。
Abstract: Video variational autoencoders (VAEs) used in latent diffusion models typically require a sufficiently large number of latent channels to ensure high-quality video reconstruction. However, recent studies have revealed that an excessive number of latent channels can impede the convergence of latent diffusion models and deteriorate their generative performance, even when reconstruction quality remains high. We propose a latent compression method that removes high-frequency components in video latent representations rather than directly reducing the number of channels, which often compromises reconstruction fidelity. Experimental results demonstrate that the proposed method achieves superior video reconstruction quality compared to strong baselines while maintaining the same overall compression ratio.
[88] Positioning radiata pine branches requiring pruning by drone stereo vision cs.CVPDF
Yida Lin, Bing Xue, Mengjie Zhang, Sam Schofield, Richard Green
TL;DR: 本文提出了一种基于无人机立体视觉的系统,用于检测和定位辐射松树枝,以支持自主修剪。该系统包括两个阶段:树枝分割和深度估计。在分割阶段,比较了YOLOv8、YOLOv9和Mask R-CNN变体在自定义数据集上的性能;在深度估计阶段,评估了传统方法(SGBM)和多种深度学习方法。通过提出的基于质心的三角测量算法计算树枝距离,定性评估表明基于深度学习的视差图在1-2米距离内能产生更一致的深度估计。
Details
Motivation: 解决林业中自动化修剪任务的需求,通过低成本立体视觉系统实现树枝的精确检测和定位。
Result: 在1-2米距离的定性评估中,基于深度学习的视差图(如PSMNet、ACVNet等方法)比传统SGBM方法产生更一致的深度估计,证明了该系统在林业自动化分支定位中的可行性。
Insight: 创新点包括将无人机立体视觉与深度学习分割(YOLO/Mask R-CNN)和深度估计方法结合,并提出了基于质心三角测量与MAD异常值剔除的算法,为低成本林业自动化提供了可行方案。
Abstract: This paper presents a stereo-vision-based system mounted on a drone for detecting and localising radiata pine branches to support autonomous pruning. The proposed pipeline comprises two stages: branch segmentation and depth estimation. For segmentation, YOLOv8, YOLOv9, and Mask R-CNN variants are compared on a custom dataset of 71 stereo image pairs captured with a ZED Mini camera. For depth estimation, both a traditional method (SGBM with WLS filtering) and deep-learning-based methods (PSMNet, ACVNet, GWCNet, MobileStereoNet, RAFT-Stereo, and NeRF-Supervised Deep Stereo) are evaluated. A centroid-based triangulation algorithm with MAD outlier rejection is proposed to compute branch distance from the segmentation mask and disparity map. Qualitative evaluation at distances of 1-2 m indicates that the deep learning-based disparity maps produce more coherent depth estimates than SGBM, demonstrating the feasibility of low-cost stereo vision for automated branch positioning in forestry.
[89] A Survey of Spatial Memory Representations for Efficient Robot Navigation cs.CV | cs.ROPDF
Ma. Madecheen S. Pangaliman, Steven S. Sison, Erwin P. Quilloy, Rowel Atienza
TL;DR: 这篇综述论文系统性地研究了机器人导航中空间记忆表示的内存效率问题,分析了从占据栅格到神经隐式表示等52个系统,并引入了峰值运行时内存与保存地图大小的比率α来量化实际部署成本。论文提出了包含内存增长率、查询延迟等指标的标准化评估协议,并通过帕累托前沿分析表明没有单一范式在所有评估维度上占优。
Details
Motivation: 解决视觉机器人在大规模环境中导航时,空间记忆无限增长导致计算资源(尤其是嵌入式平台)耗尽的问题,揭示已发布地图大小与实际部署成本之间的差距。
Result: 在NVIDIA A100 GPU上的独立性能分析显示,神经方法的α值跨越两个数量级(从Point-SLAM的2.3到NICE-SLAM的215);在Replica数据集上,3DGS方法以90-254MB的地图大小实现了最佳绝对精度,而场景图以可预测成本提供语义抽象。
Insight: 提出了α比率作为关键部署可行性指标,并设计了α感知的预算算法,使实践者能在实现前评估目标硬件上的部署可行性;同时提出了标准化评估协议,填补了当前基准测试的空白。
Abstract: As vision-based robots navigate larger environments, their spatial memory grows without bound, eventually exhausting computational resources, particularly on embedded platforms (8-16GB shared memory, $<$30W) where adding hardware is not an option. This survey examines the spatial memory efficiency problem across 88 references spanning 52 systems (1989-2025), from occupancy grids to neural implicit representations. We introduce the $α= M_{\text{peak}} / M_{\text{map}}$, the ratio of peak runtime memory (the total RAM or GPU memory consumed during operation) to saved map size (the persistent checkpoint written to disk), exposing the gap between published map sizes and actual deployment cost. Independent profiling on an NVIDIA A100 GPU reveals that $α$ spans two orders of magnitude within neural methods alone, ranging from 2.3 (Point-SLAM) to 215 (NICE-SLAM, whose 47,MB map requires 10GB at runtime), showing that memory architecture, not paradigm label, determines deployment feasibility. We propose a standardized evaluation protocol comprising memory growth rate, query latency, memory-completeness curves, and throughput degradation, none of which current benchmarks capture. Through a Pareto frontier analysis with explicit benchmark separation, we show that no single paradigm dominates within its evaluation regime: 3DGS methods achieve the best absolute accuracy at 90-254,MB map size on Replica, while scene graphs provide semantic abstraction at predictable cost. We provide the first independently measured $α$ reference values and an $α$-aware budgeting algorithm enabling practitioners to assess deployment feasibility on target hardware prior to implementation.
[90] Aletheia: Physics-Conditioned Localized Artifact Attention (PhyLAA-X) for End-to-End Generalizable and Robust Deepfake Video Detection cs.CV | cs.LGPDF
Devendra Ghori
TL;DR: 该论文提出了PhyLAA-X方法,这是一种物理条件化的局部伪影注意力机制,用于端到端、可泛化且鲁棒的深度伪造视频检测。它通过将光流旋度、镜面反射偏度和空间上采样的rPPG功率谱等物理特征直接注入注意力计算,迫使网络学习语义不一致与物理违规共现的篡改边界。该方法嵌入了一个高效的时空集成模型中,并在多个基准数据集上取得了优异的性能。
Details
Motivation: 现有最先进的深度伪造检测器在域内准确率接近完美,但在跨生成器迁移、重度压缩和对抗性扰动下性能下降。核心局限在于将语义伪影学习与物理不变量(如光流不连续性、镜面反射不一致性和心脏调制反射率)解耦,这些要么被用作后处理特征,要么被忽略。
Result: 在FaceForensics++ (c23)上达到97.2%准确率/0.992 AUC-ROC;在Celeb-DF v2上达到94.9%/0.981;在DFDC上达到90.8%/0.966。在跨生成器设置下,比最强基线(LAA-Net)高出4.1-7.3%,并在epsilon=0.02的PGD-10攻击下保持79.4%的准确率。单主干消融实验证实PhyLAA-X本身带来4.2%的跨数据集AUC增益。
Insight: 创新点在于将物理不变量(光流、镜面反射、rPPG)以端到端可微分的方式直接集成到注意力机制中,通过交叉注意力门控和共振一致性损失,强制网络联合学习语义和物理违规,从而提升模型的泛化性和鲁棒性。系统还采用了不确定性感知的自适应加权集成策略。
Abstract: State-of-the-art deepfake detectors achieve near-perfect in-domain accuracy yet degrade under cross-generator shifts, heavy compression, and adversarial perturbations. The core limitation remains the decoupling of semantic artifact learning from physical invariants: optical-flow discontinuities, specular-reflection inconsistencies, and cardiac-modulated reflectance (rPPG) are treated either as post-hoc features or ignored. We introduce PhyLAA-X, a novel physics-conditioned extension of Localized Artifact Attention (LAA-X). PhyLAA-X injects three end-to-end differentiable physics-derived feature volumes - optical-flow curl, specular-reflectance skewness, and spatially-upsampled rPPG power spectra - directly into the LAA-X attention computation via cross-attention gating and a resonance consistency loss. This forces the network to learn manipulation boundaries where semantic inconsistencies and physical violations co-occur - regions inherently harder for generative models to replicate consistently. PhyLAA-X is embedded across an efficient spatiotemporal ensemble (EfficientNet-B4+BiLSTM, ResNeXt-101+Transformer, Xception+causal Conv1D) with uncertainty-aware adaptive weighting. On FaceForensics++ (c23), Aletheia reaches 97.2% accuracy / 0.992 AUC-ROC; on Celeb-DF v2, 94.9% / 0.981; on DFDC, 90.8% / 0.966 - outperforming the strongest published baseline (LAA-Net [1]) by 4.1-7.3% in cross-generator settings and maintaining 79.4% accuracy under epsilon = 0.02 PGD-10 attacks. Single-backbone ablations confirm PhyLAA-X alone delivers a 4.2% cross-dataset AUC gain. The full production system is open-sourced at https://github.com/devghori1264/Aletheia (v1.2, April 2026) with pretrained weights, the adversarial corpus (referred to as ADC-2026 in this work), and complete reproducibility artifacts.
[91] HQA-VLAttack: Towards High Quality Adversarial Attack on Vision-Language Pre-Trained Models cs.CV | cs.AIPDF
Han Liu, Jiaqi Li, Zhi Xu, Xiaotong Zhang, Xiaoming Xu
TL;DR: 本文提出了一种针对视觉语言预训练模型的高质量黑盒对抗攻击方法HQA-VLAttack,该方法通过文本和图像两阶段攻击生成对抗样本。文本攻击利用反拟合词向量保证语义一致性,图像攻击通过层重要性引导初始化和对比学习优化,旨在同时降低正样本对相似度并提高负样本对相似度,从而提升攻击成功率。
Details
Motivation: 针对视觉语言预训练模型的黑盒对抗攻击是一个实际且具有挑战性的任务,现有方法要么查询成本高,要么仅考虑降低正样本对相似度而忽略负样本对,影响了攻击性能。
Result: 在三个基准数据集上的实验结果表明,HQA-VLAttack在攻击成功率方面显著优于现有基线方法。
Insight: 创新点在于将对比学习思想引入对抗攻击优化过程,通过同时优化正负样本对的相似性来增强攻击效果;同时,文本攻击中采用反拟合词向量以保证语义一致性,图像攻击中结合层重要性引导初始化,这些策略共同提升了对抗样本的质量和攻击效率。
Abstract: Black-box adversarial attack on vision-language pre-trained models is a practical and challenging task, as text and image perturbations need to be considered simultaneously, and only the predicted results are accessible. Research on this problem is in its infancy, and only a handful of methods are available. Nevertheless, existing methods either rely on a complex iterative cross-search strategy, which inevitably consumes numerous queries, or only consider reducing the similarity of positive image-text pairs but ignore that of negative ones, which will also be implicitly diminished, thus inevitably affecting the attack performance. To alleviate the above issues, we propose a simple yet effective framework to generate high-quality adversarial examples on vision-language pre-trained models, named HQA-VLAttack, which consists of text and image attack stages. For text perturbation generation, it leverages the counter-fitting word vector to generate the substitute word set, thus guaranteeing the semantic consistency between the substitute word and the original word. For image perturbation generation, it first initializes the image adversarial example via the layer-importance guided strategy, and then utilizes contrastive learning to optimize the image adversarial perturbation, which ensures that the similarity of positive image-text pairs is decreased while that of negative image-text pairs is increased. In this way, the optimized adversarial images and texts are more likely to retrieve negative examples, thereby enhancing the attack success rate. Experimental results on three benchmark datasets demonstrate that HQA-VLAttack significantly outperforms strong baselines in terms of attack success rate.
[92] Topology-Aware Layer Pruning for Large Vision-Language Models cs.CVPDF
Pengcheng Zheng, Chaoning Zhang, Ya Wen, Wang Liu, Qigan Sun
TL;DR: 本文提出了一种针对大型视觉语言模型(LVLMs)的拓扑感知层剪枝框架,通过将隐藏状态建模为点云并使用单纯复形和zigzag持续性同调来量化层间拓扑一致性,从而实现自适应剪枝,在多种稀疏度下优于现有方法。
Details
Motivation: 大型视觉语言模型计算和内存成本高昂,阻碍其在资源受限场景下的部署。现有层剪枝方法依赖局部相似性度量或静态代理信号,无法捕捉表征在模型深度上的全局动态演化,常导致移除关键的过渡层。
Result: 在多种多模态基准测试上的广泛实验表明,所提出的框架在一系列稀疏度下始终优于现有的剪枝方法。
Insight: 创新点在于将拓扑数据分析(TDA)技术,特别是zigzag持续性同调,引入模型剪枝领域,以全局和动态的视角量化层间表征的拓扑演变,从而更准确地识别和保留关键的过渡层。
Abstract: Large Language Models (LLMs) have demonstrated strong capabilities in natural language understanding and reasoning, while recent extensions that incorporate visual inputs enable them to process multimodal information. Despite these advances, Large Vision-Language Models (LVLMs) incur substantial computational and memory costs, hindering deployment in resource-constrained scenarios. Existing layer pruning methods typically rely on local similarity metrics or static proxy signals, failing to capture the global and dynamic evolution of representations across model depth, which often leads to the removal of transition-critical layers. To address this limitation, we propose a topology-aware layer pruning framework for LVLMs. Specifically, we represent layer wise hidden states as point clouds and models their evolution using \textit{simplicial complexes}. By leveraging \textit{zigzag persistent homology}, we quantify inter-layer topological consistency and enable adaptive pruning that preserves critical representational transitions. Extensive experiments on diverse multimodal benchmarks demonstrate that the proposed framework consistently outperforms existing pruning methods across a wide range of sparsity ratios. Our code is available at https://github.com/zpc456/TopoVLM.
[93] Motif-Video 2B: Technical Report cs.CV | cs.AIPDF
Junghwan Lim, Wai Ting Cheung, Minsu Ha, Beomgyu Kim, Taewhan Kim
TL;DR: Motif-Video 2B是一个参数规模为20亿的文本到视频生成模型,其核心主张是通过精心设计的架构专业化而非单纯依赖大规模数据和算力来提升视频生成质量。该模型采用共享交叉注意力机制来增强长视频序列的文本控制,并使用一个三部分主干网络来分离早期融合、联合表示学习和细节细化等角色。配合基于动态令牌路由和与冻结预训练视频编码器进行早期特征对齐的高效训练方案,该模型在有限计算预算下实现了高性能。
Details
Motivation: 解决在有限计算预算(少于1000万视频片段和10万H200 GPU小时)下训练高质量文本到视频生成模型的挑战,核心动机是探索通过优化模型容量组织方式(架构专业化)而非单纯扩大规模来提升性能。
Result: 在VBench基准测试上,Motif-Video 2B达到了83.76%的得分,超越了参数规模为140亿的Wan2.1模型,同时使用了7倍更少的参数和显著更少的训练数据。
Insight: 主要创新点在于通过架构设计分离视频生成中相互干扰的不同任务(如提示对齐、时序一致性和细节恢复),具体通过共享交叉注意力机制和三部分主干网络实现。客观来看,其将高效训练策略(动态令牌路由、早期特征对齐)与专业化架构相结合的方法,为在有限资源下开发高性能视频生成模型提供了可借鉴的路径。
Abstract: Training strong video generation models usually requires massive datasets, large parameter counts, and substantial compute. In this work, we ask whether strong text-to-video quality is possible at a much smaller budget: fewer than 10M clips and less than 100,000 H200 GPU hours. Our core claim is that part of the answer lies in how model capacity is organized, not only in how much of it is used. In video generation, prompt alignment, temporal consistency, and fine-detail recovery can interfere with one another when they are handled through the same pathway. Motif-Video 2B addresses this by separating these roles architecturally, rather than relying on scale alone. The model combines two key ideas. First, Shared Cross-Attention strengthens text control when video token sequences become long. Second, a three-part backbone separates early fusion, joint representation learning, and detail refinement. To make this design effective under a limited compute budget, we pair it with an efficient training recipe based on dynamic token routing and early-phase feature alignment to a frozen pretrained video encoder. Our analysis shows that later blocks develop clearer cross-frame attention structure than standard single-stream baselines. On VBench, Motif-Video~2B reaches 83.76%, surpassing Wan2.1 14B while using 7$\times$ fewer parameters and substantially less training data. These results suggest that careful architectural specialization, combined with an efficiency-oriented training recipe, can narrow or exceed the quality gap typically associated with much larger video models.
[94] From Handwriting to Structured Data: Benchmarking AI Digitisation of Handwritten Forms cs.CV | cs.LGPDF
Nicholas Pather, Joshua Fouché, Sitwala Mundia, Karl-Günter Technau, Thokozile Malaba
TL;DR: 本文对17种前沿多模态大语言模型在复杂手写医疗表单数字化任务上的性能进行了基准测试,评估了它们在处理混合日期、印刷文本、手写响应等挑战时的表现。研究发现,最新的Google和OpenAI模型在结构化字段上能达到约85%的准确率和90%的加权F1分数,其中Gemini 3.1综合表现最佳,而提示优化能显著提升宏观指标。
Details
Motivation: 解决结构化手写文档人工数字化过程缓慢且成本高昂的问题,特别是在中低收入国家,需要评估多模态大语言模型自动化处理复杂手写工作流的潜力。
Result: 在极具挑战性的真实世界医疗表单基准测试中,最新模型如Gemini 3.1、GPT 5.4和Claude Sonnet 4.6在特定任务上表现出色:Gemini 3.1在自由文本错误率(WER=0.50, CER=0.31)和分类指标上最优;GPT 5.4在噪声日期提取和可靠性(最低幻觉率6%)上领先;Claude Sonnet 4.6在格式化字段(日期和数值)平均性能最佳。提示优化使宏观精确率、召回率和F1提升超过60%。
Insight: 研究证实了多模态大语言模型在复杂手写文档自动化数字化方面的可行性和快速进步,为相关应用提供了技术路径;同时揭示了不同模型在特定子任务(如日期提取、格式字段处理、自由文本识别)上的专长差异,以及提示工程对宏观指标的重大影响,为模型选择和优化提供了实践指导。
Abstract: Manual digitisation of structured handwritten documents is slow and costly. We benchmark 17 leading frontier multi-modal large language models and open-source models against a very challenging real-world medical form that mixes dates; structured, printed text; hand-written responses and significant variability challenges. None of the smaller or older models perform well but the latest Google and OpenAI models reach accuracies around $85%$ with weighted F1 scores $\simeq 90%$ across the discrete or predefined fields despite the very challenging nature of the responses. Clear task specific strengths emerge: GPT 5.4 excels in noisy date extraction as well as reliability with the lowest hallucination rate ($6%$). Claude Sonnet 4.6 had the best average performance across formatted fields (dates and numerical values), while Gemini 3.1 delivered the best overall performance, with the lowest free text error rates (WER = $0.50$ and CER = $0.31$) and the strongest results across discrete classification metrics. We further show that prompt optimisation dramatically improves macro precision, recall and F1 by over $60%$, but has little impact on weighted metrics (only $\sim2-5%$ improvement). These results provide evidence that the rapid improvements of multimodal large language models offer a compelling pathway toward fully automated digitisation of complex handwritten workflows that is particularly relevant in low- and middle-income countries.
[95] Predicting Blastocyst Formation in IVF: Integrating DINOv2 and Attention-Based LSTM on Time-Lapse Embryo Images cs.CV | cs.AI | cs.LGPDF
Zahra Asghari Varzaneh, Niclas Wölner-Hanssen, Reza Khoshkangini, Thomas Ebner, Magnus Johnsson
TL;DR: 该论文提出了一种结合DINOv2视觉Transformer和带有多头注意力层的LSTM的混合模型,用于从有限的每日延时胚胎图像中预测囊胚形成,旨在辅助体外受精(IVF)中的胚胎选择。
Details
Motivation: 解决IVF中依赖人工检查大量延时成像数据来筛选最优胚胎的挑战,特别是当诊所缺乏完整延时系统、仅有有限每日图像可用时,准确预测囊胚形成的需求。
Result: 在包含704个胚胎视频的真实数据集上,模型达到了96.4%的准确率,超越了现有方法,并且在图像帧缺失的情况下仍表现良好。
Insight: 创新点在于将DINOv2用于胚胎图像特征提取,并结合增强的注意力LSTM来建模时间序列发展;客观分析认为其方法对资源有限的IVF实验室具有实用价值,通过处理不完整数据提高了预测的鲁棒性。
Abstract: The selection of the optimal embryo for transfer is a critical yet challenging step in in vitro fertilization (IVF), primarily due to its reliance on the manual inspection of extensive time-lapse imaging data. A key obstacle in this process is predicting blastocyst formation from the limited number of daily images available. Many clinics also lack complete time-lapse systems, so full videos are often unavailable. In this study, we aimed to predict which embryos will develop into blastocysts using limited daily images from time-lapse recordings. We propose a novel hybrid model that combines DINOv2, a transformer-based vision model, with an enhanced long short-term memory (LSTM) network featuring a multi-head attention layer. DINOv2 extracts meaningful features from embryo images, and the LSTM model then uses these features to analyze embryo development over time and generate final predictions. We tested our model on a real dataset of 704 embryo videos. The model achieved 96.4% accuracy, surpassing existing methods. It also performs well with missing frames, making it valuable for many IVF laboratories with limited imaging systems. Our approach can assist embryologists in selecting better embryos more efficiently and with greater confidence.
[96] Medical thinking with multiple images cs.CV | cs.CLPDF
Zonghai Yao, Benlu Wang, Yifan Zhang, Junda Wang, Iris Xia
TL;DR: 本文介绍了MedThinkVQA,一个用于多图像医学推理的专家标注基准数据集。该数据集包含8,067个病例,平均每个病例有6.62张图像,远高于先前工作。研究评估了多个大语言模型在该基准上的表现,发现即使是性能最好的闭源模型(如Claude-4.6-Opus)准确率也仅约57%,开源模型则表现更差。分析表明,模型的主要瓶颈在于跨视图的、基于视觉的推理能力,即从多张图像中可靠地提取、对齐和整合证据。
Details
Motivation: 现有大语言模型在许多医学QA基准上表现良好,但真实的临床推理通常需要整合多张图像(如不同视角的医学影像)的证据,而不仅仅是解释单一视图。当前基准(平均每病例图像数≤1.43)无法充分评估这种多图像推理能力。
Result: 在MedThinkVQA测试集(720个病例)上,最佳闭源模型Claude-4.6-Opus、Gemini-3-Pro和GPT-5.2-xhigh的准确率分别为57.2%、55.3%和54.9%。最佳开源模型Qwen3.5-397B-A17B和Qwen3.5-27B的准确率分别为52.2%和50.6%。分析显示,超过70%的错误源于图像读取和跨视图整合步骤。
Insight: 论文的核心创新点是构建了一个高图像密度(平均6.62张/病例)的、具有中间监督和步骤级评估的多图像医学推理基准MedThinkVQA,这更贴近真实临床场景。关键研究发现是,多模态医学推理的主要挑战并非单纯的推理长度,而是需要可靠的机制来实现跨多张真实世界临床图像的视觉基础(grounding)、证据对齐(aligning)和证据组合(composing)。当早期视觉证据提取薄弱时,增加推理计算量收益有限甚至不稳定。这为未来模型设计指明了方向:需要加强多图像视觉理解和证据整合能力,而非单纯扩展推理步骤。
Abstract: Large language models perform well on many medical QA benchmarks, but real clinical reasoning often requires integrating evidence across multiple images rather than interpreting a single view. We introduce MedThinkVQA, an expert-annotated benchmark for thinking with multiple images, where models must interpret each image, combine cross-view evidence, and answer diagnostic questions with intermediate supervision and step-level evaluation. The dataset contains 8,067 cases, including 720 test cases, with an average of 6.62 images per case, substantially denser than prior work, whose expert-level benchmarks use at most 1.43 images per case. On the test set, the best closed-source models, Claude-4.6-Opus, Gemini-3-Pro, and GPT-5.2-xhigh, reach only 57.2%, 55.3%, and 54.9% accuracy, while GPT-5-mini and GPT-5-nano reach 39.7% and 30.8%. Strong open-source models lag behind, led by Qwen3.5-397B-A17B at 52.2% and Qwen3.5-27B at 50.6%. Further analysis identifies grounded multi-image reasoning as the main bottleneck: models often fail to extract, align, and compose evidence across views before higher-level inference can help. Providing expert single-image cues and cross-image summaries improves performance, whereas replacing them with self-generated intermediates reduces accuracy. Step-level analysis shows that over 70% of errors arise from image reading and cross-view integration. Scaling results further show that additional inference-time computation helps only when visual grounding is already reliable; when early evidence extraction is weak, longer reasoning yields limited or unstable gains and can amplify misread cues. These results suggest that the key challenge is not reasoning length alone, but reliable mechanisms for grounding, aligning, and composing distributed evidence across real-world multimodal clinical inputs.
[97] BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation cs.CV | cs.LGPDF
Baoyou Chen, Hanchen Xia, Peng Tu, Haojun Shi, Shan Mu
TL;DR: 本文提出了BARD框架,旨在将预训练的自回归视觉语言模型(VLM)高效地转换为具有相同架构但解码效率更高的大块扩散视觉语言模型(dVLM)。该方法通过渐进式监督块合并逐步增大解码块大小,并结合阶段内dVLM蒸馏从固定的小块扩散锚点恢复性能损失,同时引入混合噪声调度器和内存友好训练策略。实验表明,BARD-VL在有限数据下成功将Qwen3-VL的多模态能力迁移到大块dVLM,在4B和8B规模上达到同类开放dVLMs的SOTA水平,并实现最高3倍的解码吞吐量加速。
Details
Motivation: 自回归视觉语言模型虽然多模态能力强,但其逐令牌解码方式存在固有的推理瓶颈;扩散视觉语言模型提供更并行的解码范式,但直接将预训练自回归VLM转换为大块dVLM常导致质量显著下降。本文旨在解决这一转换难题,实现高效解码而不牺牲性能。
Result: 在≤4.4M数据下,BARD-VL将Qwen3-VL的能力迁移到大块dVLM,在4B和8B规模的评估套件中建立了同类开放dVLMs的新SOTA,同时相比源模型实现最高3倍的解码吞吐量加速。
Insight: 创新点包括:渐进式监督块合并与阶段内dVLM蒸馏的结合策略,避免了直接自回归到扩散蒸馏的性能损失;混合噪声调度器提升去噪鲁棒性和令牌修订能力;内存友好训练支持长多模态序列高效训练。关键发现是扩散域内的蒸馏比跨范式蒸馏更有效。
Abstract: Autoregressive vision-language models (VLMs) deliver strong multimodal capability, but their token-by-token decoding imposes a fundamental inference bottleneck. Diffusion VLMs offer a more parallel decoding paradigm, yet directly converting a pretrained autoregressive VLM into a large-block diffusion VLM (dVLM) often leads to substantial quality degradation. In this work, we present BARD, a simple and effective bridging framework that converts a pretrained autoregressive VLM into a same-architecture, decoding-efficient dVLM. Our approach combines progressive supervised block merging, which gradually enlarges the decoding block size, with stage-wise intra-dVLM distillation from a fixed small-block diffusion anchor to recover performance lost at larger blocks. We further incorporate a mixed noise scheduler to improve robustness and token revision during denoising, and memory-friendly training to enable efficient training on long multimodal sequences. A key empirical finding is that direct autoregressive-to-diffusion distillation is poorly aligned and can even hurt performance, whereas distillation within the diffusion regime is consistently effective. Experimental results show that, with $\leq 4.4M$ data, BARD-VL transfers strong multimodal capability from Qwen3-VL to a large-block dVLM. Remarkably, BARD-VL establishes a new SOTA among comparable-scale open dVLMs on our evaluation suite at both 4B and 8B scales. At the same time, BARD-VL achieves up to \textbf{3$\times$} decoding throughput speedup compared to the source model.
[98] Penny Wise, Pixel Foolish: Bypassing Price Constraints in Multimodal Agents via Visual Adversarial Perturbations cs.CV | cs.CR | cs.LGPDF
Jiachen Qian, Zhaolu Kang
TL;DR: 这篇论文研究了多模态大语言模型(MLLMs)在基于截图的、价格受限场景下的对抗鲁棒性问题。作者发现了一种称为‘视觉主导幻觉’的现象,即微小的视觉对抗扰动可以覆盖文本价格证据,导致智能体做出非理性的财务决策。为此,他们提出了PriceBlind攻击框架,该框架利用基于CLIP编码器的模态间隙,通过语义解耦损失来操纵图像嵌入,使其与低成本价值锚点对齐。
Details
Motivation: 随着多模态大语言模型被用于执行高风险金融交易,其对抗鲁棒性尚未得到充分探索。论文旨在揭示并利用这些模型在价格约束场景下的脆弱性,即视觉信息可能压倒文本证据,从而导致错误的购买决策。
Result: 在E-ShopBench基准测试上,PriceBlind在白盒评估中实现了约80%的攻击成功率。在简化的单轮坐标选择协议下,使用Ensemble-DI-FGSM方法对GPT-4o、Gemini-1.5-Pro和Claude-3.5-Sonnet进行黑盒迁移攻击,成功率约为35-41%。同时,论文也表明,使用鲁棒编码器和‘验证后执行’的防御策略可以显著降低攻击成功率,但会带来一定的干净准确率损失。
Insight: 论文的核心创新点在于揭示了多模态智能体中‘视觉主导幻觉’这一安全漏洞,并提出了一个利用模态间隙进行针对性攻击的框架。从客观角度看,这项工作强调了在多模态系统中进行跨模态一致性验证的重要性,并为评估和提升此类系统的对抗鲁棒性提供了新的视角和方法。
Abstract: The rapid proliferation of Multimodal Large Language Models (MLLMs) has enabled mobile agents to execute high-stakes financial transactions, but their adversarial robustness remains underexplored. We identify Visual Dominance Hallucination (VDH), where imperceptible visual cues can override textual price evidence in screenshot-based, price-constrained settings and lead agents to irrational decisions. We propose PriceBlind, a stealthy white-box adversarial attack framework for controlled screenshot-based evaluation. PriceBlind exploits the modality gap in CLIP-based encoders via a Semantic-Decoupling Loss that aligns the image embedding with low-cost, value-associated anchors while preserving pixel-level fidelity. On E-ShopBench, PriceBlind achieves around 80% ASR in white-box evaluation; under a simplified single-turn coordinate-selection protocol, Ensemble-DI-FGSM transfers with roughly 35-41% ASR across GPT-4o, Gemini-1.5-Pro, and Claude-3.5-Sonnet. We also show that robust encoders and Verify-then-Act defenses reduce ASR substantially, though with some clean-accuracy trade-off.
[99] SmoGVLM: A Small, Graph-enhanced Vision-Language Model cs.CV | cs.CLPDF
Debjyoti Mondal, Rituraj Singh, Subhadarshi Panda
TL;DR: SmoGVLM是一种小型、图增强的视觉语言模型,通过集成结构化知识与视觉和文本模态,利用图神经网络提升多模态推理能力。研究表明,该方法在不同模型规模(从1.3B到13B)上均有效,其中小型模型性能提升高达16.24%,甚至超越更大规模的视觉语言模型和强微调基线。
Details
Motivation: 解决大型视觉语言模型在知识密集型推理中存在的幻觉问题和知识基础薄弱的问题,通过引入结构化知识增强小型模型的效率与性能。
Result: 在多个模型规模上验证,小型模型性能提升最高达16.24%,超越更大规模视觉语言模型和强微调基线,展示了结构化知识增强的潜力。
Insight: 创新点在于将图神经网络与视觉语言模型结合,利用结构化知识增强多模态推理,客观分析表明这种方法能有效提升小型模型的性能,为高效多模态系统提供新思路。
Abstract: Large vision-language models (VLMs) achieve strong performance on multimodal tasks but often suffer from hallucination and poor grounding in knowledge-intensive reasoning. We propose SmoGVLM, a small, graph-enhanced VLM that integrates structured knowledge with visual and textual modalities, using Graph Neural Networks. We investigate the effects of our method across a range of model sizes, from tiny (1.3B) to large (13B) models. Our results demonstrate that, when trained using our approach, a small model can achieve performance gains upto 16.24%, and surpass its larger counterparts, outperforming larger VLMs and strong fine-tuned baselines. These results highlight the potential of structured knowledge augmentation for efficient, smaller-scale multimodal reasoning systems.
[100] Privacy-Preserving Semantic Segmentation without Key Management cs.CV | cs.CRPDF
Mare Hirose, Shoko Imaizumi, Hitoshi Kiya
TL;DR: 本文提出了一种新颖的隐私保护语义分割方法,允许每个客户端和每张图像使用独立的密钥进行加密。该方法通过在加密图像上进行模型训练和推理,无需集中管理密钥,并在Cityscapes数据集上使用基于视觉变换器(SETR)的模型验证了其有效性。
Details
Motivation: 解决传统隐私保护语义分割方法中密钥管理复杂、可能泄露隐私的问题,旨在实现无需密钥管理的、更灵活且安全的图像加密分割方案。
Result: 在Cityscapes数据集上使用SETR模型进行实验,确认了该方法能有效减少性能下降,实现了隐私保护下的语义分割任务。
Insight: 创新点在于允许每个客户端和图像使用独立密钥进行本地加密,无需集中密钥管理,同时通过在训练阶段应用图像加密来缓解性能损失,这为分布式隐私保护视觉任务提供了新思路。
Abstract: This paper proposes a novel privacy-preserving semantic segmentation method that can use independent keys for each client and image. In the proposed method, the model creator and each client encrypt images using locally generated keys, and model training and inference are conducted on the encrypted images. To mitigate performance degradation, an image encryption method is applied to model training in addition to the generation of test images. In experiments, the effectiveness of the proposed method is confirmed on the Cityscapes dataset under the use of a vision transformer-based model, called SETR.
[101] Expert-Annotated Embryo Image Dataset with Natural Language Descriptions for Evidence-Based Patient Communication in IVF cs.CV | cs.AIPDF
Nicklas Neu, Thomas Ebner, Jasmin Primus, Bernhard Schenkenfelder, Raphael Zefferer
TL;DR: 本文提出了一个专家标注的胚胎图像数据集,包含胚胎图像和对应的自然语言形态描述,用于支持基于证据的体外受精患者沟通。该数据集可用于微调视觉-语言基础模型,实现自动化胚胎评估,并通过提取文献证据增强决策透明性。
Details
Motivation: 解决胚胎选择中人工智能方法缺乏可解释性、难以适应临床数据以及患者对专家决策质疑的问题,旨在通过基于证据的决策理由支持透明决策和患者沟通。
Result: 未在摘要中提及具体定量结果或基准测试,但强调数据集能够支持高精度的模型微调,并有望显著改善决策过程和患者预后。
Insight: 创新点在于构建结合胚胎图像与自然语言描述的数据集,利用视觉-语言模型实现可解释的自动化胚胎评估,并通过文献证据提取增强临床决策的透明性和科学性。
Abstract: Embryo selection is one of multiple crucial steps in in-vitro fertilization, commonly based on morphological assessment by clinical embryologists. Although artificial intelligence methods have demonstrated their potential to support embryo selection by automated embryo ranking or grading methods, the overall impact of AI-based solutions is still limited. This is mainly due to the required adaptation of automated solutions to custom clinical data, reliance on time lapse incubators and a lack of interpretability to understand AI reasoning. The modern, informed patient is questioning expert decisions, particularly if the treatment is not successful. Thus, evidence-based decision justification in tasks like embryo selection would support transparent decision making and respectful patient communication. To support this aim, we hereby present an expert-annotated dataset consisting of embryo images and corresponding morphological description using natural language. The description contains relevant information on embryonic cell cycle, developmental stage and morphological features. This dataset enables the finetuning of modern foundational vision-language models to learn and improve over time with high accuracy. Predicted embryo descriptions can then be leveraged to automatically extract scientific evidence from literature, facilitating well-informed, evidence-based decision-making and transparent communication with patients. Our proposed dataset supports research in language-based, interpretable, and transparent automated embryo assessment and has the potential to enhance the decision-making process and improve patient outcomes significantly over time.
[102] Beyond Attack Success Rate: A Multi-Metric Evaluation of Adversarial Transferability in Medical Imaging Models cs.CV | cs.AIPDF
Emily Curl, Kofi Ampomah, Md Erfan, Sayanton Dibbo
TL;DR: 本文针对医学影像深度学习模型对抗攻击评估中过度依赖攻击成功率(ASR)这一单一指标的问题,提出需要进行多指标评估。通过对四个医学影像数据集、七种模型和七种攻击方法的系统性实证研究,发现感知质量指标(如PSNR、SSIM)与ASR相关性极低,表明ASR不足以全面评估对抗鲁棒性和迁移性。
Details
Motivation: 当前医学影像AI模型的对抗脆弱性评估主要依赖攻击成功率(ASR),但ASR作为二元指标,忽略了扰动强度、图像感知质量和跨架构攻击迁移性等因素,导致评估不完整。随着视觉Transformer(ViT)等新架构挑战CNN的主导地位,单一指标能否有效捕捉对抗行为成为问题。
Result: 在PathMNIST、DermaMNIST、RetinaMNIST和CheXpert四个数据集上,对VGG-16、ResNet-50、DenseNet-121、Inception-v3、DeiT、Swin Transformer和ViT-B/16七种模型,在五种扰动预算下评估了七种攻击方法。结果表明,感知指标(PSNR、SSIM)和失真指标(L2范数)彼此强相关,但与ASR相关性极小,这一模式在CNN和ViT中均一致。
Insight: 论文的创新点在于首次在医学影像领域系统论证了ASR作为对抗攻击评估指标的局限性,并实证揭示了感知质量指标与攻击成功率指标的脱钩。这强调了评估医学AI对抗风险必须采用涵盖攻击效能、方法及其相关开销的多指标框架,而非单一ASR。
Abstract: While deep learning systems are becoming increasingly prevalent in medical image analysis, their vulnerabilities to adversarial perturbations raise serious concerns for clinical deployment. These vulnerability evaluations largely rely on Attack Success Rate (ASR), a binary metric that indicates solely whether an attack is successful. However, the ASR metric does not account for other factors, such as perturbation strength, perceptual image quality, and cross-architecture attack transferability, and therefore, the interpretation is incomplete. This gap requires consideration, as complex, large-scale deep learning systems, including Vision Transformers (ViTs), are increasingly challenging the dominance of Convolutional Neural Networks (CNNs). These architectures learn differently, and it is unclear whether a single metric, e.g., ASR, can effectively capture adversarial behavior. To address this, we perform a systematic empirical study on four medical image datasets: PathMNIST, DermaMNIST, RetinaMNIST, and CheXpert. We evaluate seven models (VGG-16, ResNet-50, DenseNet-121, Inception-v3, DeiT, Swin Transformer, and ViT-B/16) against seven attack methods at five perturbation budgets, measuring ASR, Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM), and $L_2$ perturbation magnitude. Our findings show a consistent pattern: perceptual and distortion metrics are strongly associated with one another and exhibit minimal correlation with ASR. This applies to both CNNs and ViTs. The results demonstrate that ASR alone is an inadequate indicator of adversarial robustness and transferability. Consequently, we argue that a thorough assessment of adversarial risk in medical AI necessitates multi-metric frameworks that encompass not only the attack efficacy but also its methodology and associated overheads.
[103] BOOKAGENT: Orchestrating Safety-Aware Visual Narratives via Multi-Agent Cognitive Calibration cs.CVPDF
Bo Gao, Chang Liu, Yuyang Miao, Siyuan Ma, Ser-Nam Lim
TL;DR: BookAgent是一个安全感知的多智能体协作框架,旨在生成高质量、安全可控的视觉叙事故事书。它通过联合规划、脚本编写、插图和全局修复不一致性,实现从用户草稿到完整故事书的端到端合成,并动态校准文本脚本与视觉布局的页面级对齐,同时从时间维度验证并修正角色身份和叙事逻辑的全局不一致性。
Details
Motivation: 现有大型生成模型在多模态生成方面取得进展,但生成插图故事书仍面临挑战,先前工作多将任务分解为独立阶段,导致整体多模态接地有限,且很少将儿童特定安全约束整合到叙事规划和序列级多模态验证中。
Result: 大量实验表明,BookAgent在叙事连贯性、视觉一致性和安全合规性方面显著优于当前方法,为复杂多模态创作提供了可靠智能体的稳健范式。
Insight: 创新点在于提出多智能体认知校准框架,通过动态校准页面级对齐和时序维度整体一致性验证,实现安全感知的端到端故事书合成,突破了传统分阶段方法的局限性。
Abstract: Recent advancements in Large Generative Models (LGMs) have revolutionized multi-modal generation. However, generating illustrated storybooks remains an open challenge, where prior works mainly decompose this task into separate stages, and thus, holistic multi-modal grounding remains limited. Besides, while safety alignment is studied for text- or image-only generation, existing works rarely integrate child-specific safety constraints into narrative planning and sequence-level multi-modal verification. To address these limitations, we propose BookAgent, a safety-aware multi-agent collaboration framework designed for high-quality, safety-aware visual narratives. Different from prior story visualization models that assume a fixed storyline sequence, BookAgent targets end-to-end storybook synthesis from a user draft by jointly planning, scripting, illustrating, and globally repairing inconsistencies. To ensure precise multi-modal grounding, BookAgent dynamically calibrates page-level alignment between textual scripts and visual layouts. Furthermore, BookAgent calibrates holistic consistency from the temporal dimension, by verifying-then-rectifying global inconsistencies in character identity and storytelling logic. Extensive experiments demonstrate that BookAgent significantly outperforms current methods in narrative coherence, visual consistency, and safety compliance, offering a robust paradigm for reliable agents in complex multi-modal creation. The implementation will be publicly released at https://github.com/bogao-code/BookAgent/tree/main.
[104] Co-generation of Layout and Shape from Text via Autoregressive 3D Diffusion cs.CV | cs.AIPDF
Zhenggang Tang, Yuehao Wang, Yuchen Fan, Jun-Kun Chen, Yu-Ying Yeh
TL;DR: 本文提出了一种新的顺序文本到场景生成范式,通过一个名为3D-ARD+的自回归扩散模型,能够从文本指令中同时生成场景布局和物体形状。该方法将多模态令牌序列的自回归生成与下一个物体3D潜在向量的扩散生成统一起来,实现了对包含复杂形状、外观和空间排列描述的文本指令的连贯场景创建。
Details
Motivation: 解决现有文本到场景生成方法通常只能单独生成场景布局或物体,且生成的布局简单、与包含非平凡描述的文本输入不一致的问题,旨在实现从文本指令交互式地生成包含详细布局和物体的完整3D场景。
Result: 在包含23万室内场景及其配对文本指令的大型数据集上训练了70亿参数的3D-ARD+模型,并在具有挑战性的场景上进行了评估,展示了模型能够根据文本指令生成并放置符合非平凡空间布局和语义要求的物体。
Insight: 核心创新点在于提出了顺序文本到场景生成的新范式,以及将自回归生成与扩散生成统一的两阶段3D-ARD+模型架构,实现了从粗粒度场景空间潜在向量到细粒度物体空间潜在向量的连贯生成,从而提升了生成场景与复杂文本描述的一致性。
Abstract: Recent text-to-scene generation approaches largely reduced the manual efforts required to create 3D scenes. However, their focus is either to generate a scene layout or to generate objects, and few generate both. The generated scene layout is often simple even with LLM’s help. Moreover, the generated scene is often inconsistent with the text input that contains non-trivial descriptions of the shape, appearance, and spatial arrangement of the objects. We present a new paradigm of sequential text-to-scene generation and propose a novel generative model for interactive scene creation. At the core is a 3D Autoregressive Diffusion model 3D-ARD+, which unifies the autoregressive generation over a multimodal token sequence and diffusion generation of next-object 3D latents. To generate the next object, the model uses one autoregressive step to generate the coarse-grained 3D latents in the scene space, conditioned on both the current seen text instructions and already synthesized 3D scene. It then uses a second step to generate the 3D latents in the smaller object space, which can be decoded into fine-grained object geometry and appearance. We curate a large dataset of 230K indoor scenes with paired text instructions for training. We evaluate 7B 3D-ARD+, on challenging scenes, and showcase the model can generate and place objects following non-trivial spatial layout and semantics prescribed by the text instructions.
[105] PA-TCNet: Pathology-Aware Temporal Calibration with Physiology-Guided Target Refinement for Cross-Subject Motor Imagery EEG Decoding in Stroke Patients cs.CV | cs.AIPDF
Xiangkai Wang, Yun Zhao, Dongyi He, Qingling Xia, Gen Li
TL;DR: 本文提出PA-TCNet,一种用于中风患者运动想象脑电信号跨被试解码的病理感知时间校准框架。该框架通过病理感知节律状态Mamba模块捕捉异常时间动态,并结合生理引导的目标校准模块约束伪标签,以应对病灶相关的异常时间动态和患者间异质性带来的挑战。
Details
Motivation: 解决中风患者运动想象脑电跨被试解码中,因病灶相关的异常时间动态和显著的被试间异质性导致的模型泛化性能下降问题。现有适应方法易受病理慢波活动和不稳定的目标域伪标签误导。
Result: 在两个独立的中风脑电数据集XW-Stroke和2019-Stroke上进行的留一被试交叉验证实验中,平均准确率分别达到66.56%和72.75%,优于现有最先进的基线方法。
Insight: 创新点在于联合建模病理时间动态和生理约束的伪监督。具体包括:1)将脑电时空特征分解为慢变节律上下文和快速瞬态扰动,并将融合的病理上下文注入选择性状态传播以捕捉异常动态;2)构建源域感觉运动感兴趣区模板,施加生理一致性约束以动态细化目标域伪标签,提高适应可靠性。这为个性化中风后运动想象脑机接口康复提供了更鲁棒的跨被试初始化方法。
Abstract: Stroke patient cross-subject electroencephalography (EEG) decoding of motor imagery (MI) brain-computer interface (BCI) is essential for motor rehabilitation, yet lesion-related abnormal temporal dynamics and pronounced inter-patient heterogeneity often undermine generalization. Existing adaptation methods are easily misled by pathological slow-wave activity and unstable target-domain pseudo-labels. To address this challenge, we propose PA-TCNet, a pathology-aware temporal calibration framework with physiology-guided target refinement for stroke motor imagery decoding. PA-TCNet integrates two coordinated components. The Pathology-aware Rhythmic State Mamba (PRSM) module decomposes EEG spatiotemporal features into slowly varying rhythmic context and fast transient perturbations, injecting the fused pathological context into selective state propagation to more effectively capture abnormal temporal dynamics. The Physiology-Guided Target Calibration (PGTC) module constructs source-domain sensorimotor region-of-interest templates, imposing physiological consistency constraints and dynamically refining target-domain pseudo-labels, thereby improving adaptation reliability. Leave-one-subject-out experiments on two independent stroke EEG datasets, XW-Stroke and 2019-Stroke, yielded mean accuracies of 66.56% and 72.75%, respectively, outperforming state-of-the-art baselines. These results indicate that jointly modeling pathological temporal dynamics and physiology-constrained pseudo-supervision can provide more robust cross-subject initialization for personalized post-stroke MI-BCI rehabilitation. The implemented code is available at https://github.com/wxk1224/PA-TCNet.
[106] Classification of systolic murmurs in heart sounds using multiresolution complex Gabor dictionary and vision transformer cs.CV | cs.AIPDF
Mahmoud Fakhry, Abeer FathAllah Brery
TL;DR: 本文提出了一种基于多分辨率复Gabor字典和视觉Transformer的收缩期心脏杂音自动分类系统。该系统首先通过复正交匹配追踪将杂音片段投影到多分辨率复Gabor基函数构成的冗余字典上,生成可变分辨率的时频特征矩阵;然后利用视觉Transformer对这些多分辨率特征矩阵进行分类,在CirCor DigiScope数据集上对四种收缩期杂音的分类准确率达到95.96%。
Details
Motivation: 收缩期心脏杂音是心脏收缩期出现的额外心音,其强度、音调和性质多变,对心脏疾病的准确诊断至关重要,需要一种能够精确识别和分类杂音的自动系统。
Result: 在CirCor DigiScope数据集上对四种收缩期杂音进行分类,达到了95.96%的分类准确率。
Insight: 创新点在于将多分辨率复Gabor字典用于时频特征提取,并结合视觉Transformer进行分类;可借鉴之处包括使用共享字典处理多个片段以减少杂音变异性,以及通过卷积神经网络进行补丁标记化后输入Transformer编码器的多分辨率特征融合方法。
Abstract: Systolic murmurs are extra heart sounds that occur during the contraction phase of the cardiac cycle, often indicating heart abnormalities caused by turbulent blood flow. Their intensity, pitch, and quality vary, requiring precise identification for the accurate diagnosis of cardiac disorders. This study presents an automatic classification system for systolic murmurs using a feature extraction module, followed by a classification model. The feature extraction module employs complex orthogonal matching pursuit to project single or multiple murmur segments onto a redundant dictionary composed of multiresolution complex Gabor basis functions (GBFs). The resulting projection weights are split and reshaped into variable-resolution time–frequency feature matrices. Processing multiple segments of a single recording using a shared dictionary mitigates murmur variability. This is achieved by learning the weights for each segment while enforcing that they correspond to the same set of basis functions in the dictionary, promoting consistent time–frequency feature matrices. The classification model is built based on a vision transformer to process multiple input matrices of different resolutions by passing each through a convolutional neural network for patch tokenization. All embedding tokens are then concatenated to form a matrix and forwarded to an encoder layer that includes multihead attention, residual connections, and a convolutional network with a kernel size of one. This integration of multiresolution feature extraction with transformer-based feature classification enhances the accuracy and reliability of heart murmur identification. An experimental analysis of four types of systolic murmurs from the CirCor DigiScope dataset demonstrates the effectiveness of the system, achieving a classification accuracy of $95.96%$.
[107] Real-Time Visual Attribution Streaming in Thinking Model cs.CV | cs.AIPDF
Seil Kang, Woojung Han, Junhyeok Kim, Jinyeong Kim, Youngeun Kim
TL;DR: 本文提出了一种用于多模态思维模型中实时视觉归因流式传输的摊销框架,旨在解决模型在生成代码或解决数学问题时其推理过程需要基于视觉证据的验证难题。该方法通过学习从注意力特征中直接估计语义区域的因果效应,实现了在模型推理过程中实时提供可靠的视觉归因,而无需昂贵的重复计算。
Details
Motivation: 动机在于多模态思维模型(如从截图生成代码或从图像解决数学问题)的长推理轨迹需要基于视觉证据,但验证这种依赖关系具有挑战性:传统的因果方法计算成本高,而原始注意力图虽即时可用但缺乏因果有效性。
Result: 在五个不同的基准测试和四种思维模型上,该方法实现了与穷举因果方法相当的忠实度,同时支持视觉归因流式传输,使用户能在模型推理过程中实时观察证据,而非事后。
Insight: 创新点在于通过轻量级学习而非暴力计算,实现了多模态思维模型中实时且忠实的视觉归因;该方法利用注意力特征中的丰富信号来摊销因果效应估计,平衡了效率与准确性。
Abstract: We present an amortized framework for real-time visual attribution streaming in multimodal thinking models. When these models generate code from a screenshot or solve math problems from images, their long reasoning traces should be grounded in visual evidence. However, verifying this reliance is challenging: faithful causal methods require costly repeated backward passes or perturbations, while raw attention maps offer instant access, they lack causal validity. To resolve this, we introduce an amortized approach that learns to estimate the causal effects of semantic regions directly from the rich signals encoded in attention features. Across five diverse benchmarks and four thinking models, our approach achieves faithfulness comparable to exhaustive causal methods while enabling visual attribution streaming, where users observe grounding evidence as the model reasons, not after. Our results demonstrate that real-time, faithful attribution in multimodal thinking models is achievable through lightweight learning, not brute-force computation.
[108] MambaKick: Early Penalty Direction Prediction from HAR Embeddings cs.CV | cs.AIPDF
Henry O. Velesaca, David Freire-Obregon, Abel Reyes-Angulo, Steven Araujo, Angel Sappa
TL;DR: 本文提出了MambaKick,一个基于学习的框架,用于从足球点球视频中预测射门方向。该方法利用预训练的人类动作识别(HAR)模型提取接触球前后短片段中的时空表征,并结合轻量级的时间预测器(选择性状态空间模型Mamba)进行序列聚合,同时考虑简单的上下文元数据(如场地侧和惯用脚)作为补充线索。
Details
Motivation: 解决在足球点球中,守门员需要在极短的时间约束下,从踢球者的动作中提前预测射门方向的问题。传统方法依赖显式的运动学重建或手工设计的生物力学特征,而本文旨在探索利用可迁移的预训练表征和高效时序建模进行低延迟意图预测的实用方向。
Result: 在多种HAR骨干网络上,MambaKick方法持续改进或匹配了强基线模型。在提出的方法下,对于三类方向预测达到了53.1%的准确率,对于二类方向预测达到了64.5%的准确率。
Insight: 主要创新点在于将预训练的通用HAR表征与高效的、基于选择性状态空间模型(Mamba)的时序建模相结合,用于特定体育视频的低延迟意图预测任务,避免了复杂的特征工程。从客观角度看,其利用预训练模型迁移能力和Mamba模型在长序列处理上的效率,为实时体育分析提供了一个轻量且有效的框架思路。
Abstract: Penalty kicks in soccer are decided under extreme time constraints, where goalkeepers benefit from anticipating shot direction from the kickers motion before or around ball contact. In this paper, MambaKick is presented as a learning-based framework for penalty direction prediction that leverages pretrained human action recognition (HAR) embeddings extracted from contact-centered short video segments and combines them with a lightweight temporal predictor. Rather than relying on explicit kinematic reconstruction or handcrafted biomechanical features, the approach reuses transferable spatiotemporal representations and utilizes selective state-spare models (Mamba) for efficient sequence aggregation. Simple contextual metadata (e.g., field side and footedness) are also considered as complementary cues that may reduce ambiguity in real-world footage. Across a range of HAR backbones, MambaKick consistently improves or matches strong embedding baselines, achieving up to 53.1% accuracy for three classes and 64.5% for two classes under the proposed methodology. Overall, the results indicate that combining pretrained HAR representations with efficient state-space temporal modeling is a practical direction for low-latency intention prediction in real-world sports video. The code will be available at GitHub: https://github.com/hvelesaca/MambaKick/
[109] IncepDeHazeGAN: Novel Satellite Image Dehazing cs.CVPDF
Tejeswar Pokuri, Shivarth Rai
TL;DR: 本文提出了一种名为IncepDeHazeGAN的新型生成对抗网络,用于单幅卫星图像去雾。该网络结合了Inception模块和多层特征融合技术,以提升在云雾条件下遥感图像的视觉质量。
Details
Motivation: 解决在云雾或雾霾条件下捕获的卫星图像质量下降问题,旨在从受雾影响的遥感数据中恢复清晰、高质量的图像。
Result: 实验表明,该网络在多个数据集上取得了最先进(SOTA)的结果。
Insight: 创新点在于将Inception模块用于多尺度特征提取,并通过多层特征融合设计实现特征的高效重用;此外,应用Grad-CAM可解释性技术可视化了网络去雾的关注区域及其对不同雾霾条件的适应能力。
Abstract: Dehazing is a technique in computer vision for enhancing the visual quality of images captured in cloudy or foggy conditions. Dehazing helps to recover clear, high-quality images from haze-affected remote sensing data. In this study, we introduce IncepDeHazeGAN, a novel Generative Adversarial Network (GAN) involving Inception block and multi-layer feature fusion for the task of single-image dehazing. Utilizing the Inception block allows for multi-scale feature extraction. On the other hand, the multi-layer feature fusion design achieves efficient reuse of features as the features extracted at different convolution layers are fused several times. Grad-CAM XAI technique has been applied to our network, highlighting the regions focused on by the network for dehazing and its adaptation to different haze conditions. Experiments demonstrate that our network achieves state-of-the-art results in several datasets.
[110] AVRT: Audio-Visual Reasoning Transfer through Single-Modality Teachers cs.CV | cs.MM | cs.SDPDF
Edson Araujo, Saurabhchand Bhati, M. Jehanzeb Mirza, Brian Kingsbury, Samuel Thomas
TL;DR: 本文提出AVRT框架,通过单模态教师模型生成高质量音视频推理轨迹,以解决多模态推理中高质量数据稀缺的问题。该方法首先生成独立的视觉和音频推理轨迹,再利用LLM合并模型融合为多模态轨迹,通过监督微调和强化学习两阶段训练,使模型适应音视频推理任务。
Details
Motivation: 当前推理模型在文本领域取得显著进展,但将其能力迁移到多模态场景(如音视频数据推理)仍面临挑战,主要原因是目标多模态组合中高质量推理数据有限。
Result: 在七个音视频和音频基准测试中,3B和7B参数模型在可比规模模型中达到SOTA水平,包括音视频推理的OmniBench和DailyOmni,以及纯音频推理的MMAR,表明跨模态训练也能迁移到单模态任务。
Insight: 创新点在于利用单模态教师模型生成多模态推理轨迹,通过两阶段训练(监督微调后强化学习)有效适应多模态推理,且跨模态训练可提升单模态任务性能,为多模态推理模型建立了新的训练流程。
Abstract: Recent advances in reasoning models have shown remarkable progress in text-based domains, but transferring those capabilities to multimodal settings, e.g., to allow reasoning over audio-visual data, still remains a challenge, in part because of the limited availability of high-quality reasoning data in targeted multimodal combinations. To address this problem, we introduce AVRT, a novel framework that generates high-quality audio-visual reasoning traces from single-modality teacher models. We generate independent vision- and audio-reasoning traces via models specialized to reason over their respective modalities and merge the resulting traces with an LLM merger model. The resulting multimodal traces are used in a supervised fine-tuning (SFT) cold start to adapt the target model to audio-visual reasoning traces first, before training it in a second reinforcement learning stage on larger-scale data. Evaluated on seven audio-visual and audio benchmarks, our 3B and 7B parameter models achieve state-of-the-art results among models of comparable size including OmniBench and DailyOmni for audio-visual and MMAR for audio-only reasoning, showing that cross-modal training also transfers to single-modality tasks and establishing a new training pipeline for multimodal reasoning models.
[111] Tri-Modal Fusion Transformers for UAV-based Object Detection cs.CVPDF
Craig Iaboni, Pramod Abichandani
TL;DR: 本文提出了一种用于无人机目标检测的三模态融合Transformer框架,该框架整合了RGB、热成像和事件相机数据,通过双流分层视觉Transformer进行处理,并引入了模态感知门控交换(MAGE)和双向令牌交换(BiTE)模块来实现多模态融合。
Details
Motivation: 解决无人机目标检测中因光照变化、运动模糊和动态场景导致RGB线索失效的问题,通过融合热成像(LWIR)在低光下的对比度优势和事件相机微秒级时间边缘信息,提升检测的鲁棒性。
Result: 在提出的包含10,489帧同步对齐RGB-热成像-事件数据及24,223个标注车辆的无人机数据集上,三模态融合优于所有双模态基线,融合深度对性能有显著影响,轻量级CSSA变体以最小成本恢复了大部分收益。
Insight: 创新点包括首次系统研究RGB、热成像和事件相机的三模态融合检测,提出MAGE和BiTE模块实现通道、空间和令牌级交互,并提供了首个三模态无人机检测基准和模块化骨干网络。
Abstract: Reliable UAV object detection requires robustness to illumination changes, motion blur, and scene dynamics that suppress RGB cues. Thermal long-wave infrared (LWIR) sensing preserves contrast in low light, and event cameras retain microsecond-level temporal edges, but integrating all three modalities in a unified detector has not been systematically studied. We present a tri-modal framework that processes RGB, thermal, and event data with a dual-stream hierarchical vision transformer. At selected encoder depths, a Modality-Aware Gated Exchange (MAGE) applies inter-sensor channel and spatial gating, and a Bidirectional Token Exchange (BiTE) module performs bidirectional token-level attention with depthwise-pointwise refinement, producing resolution-preserving fused maps for a standard feature pyramid and two-stage detector. We introduce a 10,489-frame UAV dataset with synchronized and pre-aligned RGB-thermal-event streams and 24,223 annotated vehicles across day and night flights. Through 61 controlled ablations, we evaluate fusion placement, mechanism (baseline MAGE+BiTE, CSSA, GAFF), modality subsets, and backbone capacity. Tri-modal fusion improves over all dual-modal baselines, with fusion depth having a significant effect and a lightweight CSSA variant recovering most of the benefit at minimal cost. This work provides the first systematic benchmark and modular backbone for tri-modal UAV-based object detection.
[112] Appearance-free Action Recognition: Zero-shot Generalization in Humans and a Two-Pathway Model cs.CVPDF
Prerana Kumar, Martin A. Giese
TL;DR: 本文研究了人类和视觉模型在零样本条件下对无外观(appearance-free)动作视频的泛化能力。通过心理物理学实验发现,人类在未经训练的情况下能识别无外观动作,但准确率低于自然视频。作者提出了一种结合RGB流和光流流的双通路3D CNN模型,该模型在无外观数据集上表现出色,缩小了与人类表现的差距。
Details
Motivation: 动作识别是社会性物种的基本能力,但其计算机制尚不明确。现有研究多使用简化刺激或对无外观视频进行显式训练,而人类和视觉模型是否能零样本泛化到真实世界动作视频的无外观变换尚未可知。
Result: 人类实验显示,参与者在两种无外观视频(密集噪声运动和随机点)上的识别准确率显著高于随机水平,但低于自然视频。提出的双通路3D CNN模型在AFD5等无外观数据集上优于当代视频分类模型,并接近人类表现。
Insight: 创新点包括:通过心理物理学实验首次量化人类零样本泛化到无外观动作的能力;提出基于格式塔共同命运分组启发的相干门控机制的双通路模型,强调运动通路对无外观泛化的关键作用,支持多流架构在视频动作识别建模中的应用。
Abstract: Action recognition is a fundamental ability for social species. Yet, its underlying computations are not well understood. Classical psychophysical studies using simplified stimuli have shown that humans can perceive body motion even under degradation of relevant shape cues. Recent work using real-world action videos and their appearance-free counterparts (that preserve motion but lack static shape cues) included explicit training of humans and models on the appearance-free videos. Whether humans and vision models generalize in a zero-shot manner to appearance-free transformations of real-world action videos is not yet known. To measure this generalization in humans, we conducted a laboratory-based psychophysics experiment. 22 participants were trained to recognize five action categories using naturalistic videos (UCF5 dataset), and tested zero-shot on two types of appearance-free transformations: (i) dense-noise motion videos from an existing dataset (AFD5) and (ii) random-dot appearance-free videos. We find that participants recognize actions in both types of appearance-free videos well above chance, albeit with reduced accuracy compared to naturalistic videos. To model this behavior, we developed a two-pathway 3D CNN-based model combining an RGB (form) stream and an optical flow (motion) stream, including a coherence-gating mechanism inspired by Gestalt common-fate grouping. Our model generalizes to both appearance-free datasets and outperforms contemporary video classification models, narrowing the gap to human performance. We find that the motion pathway is critical for generalization to appearance-free videos, while the form pathway improves performance on naturalistic videos. Our findings highlight the importance of motion-based representations for generalization to appearance-free videos, and support the use of multi-stream architectures to model video-based action recognition.
[113] C-GenReg: Training-Free 3D Point Cloud Registration by Multi-View-Consistent Geometry-to-Image Generation with Probabilistic Modalities Fusion cs.CVPDF
Yuval Haitman, Amit Efraim, Joseph M. Francos
TL;DR: C-GenReg是一个无需训练的三维点云配准框架,它通过世界基础模型将输入几何体合成为多视角一致的RGB图像,利用视觉基础模型在图像域提取密集对应关系,再通过概率融合方案结合生成RGB分支和原始几何分支的对应关系后验,实现鲁棒的零样本配准。
Details
Motivation: 解决当前基于学习的3D点云配准方法在跨传感模态、采样差异和环境变化时泛化能力不足的问题。
Result: 在室内(3DMatch, ScanNet)和室外(Waymo)基准测试上展示了强大的零样本性能和卓越的跨域泛化能力,首次成功在无图像数据的真实室外LiDAR数据上运行生成式配准框架。
Insight: 创新点在于通过生成式转移将3D几何匹配问题转换到图像域以利用视觉基础模型的优势,并提出了“匹配后融合”的概率冷融合方案,无需微调即可结合不同模态的归纳偏置并提供校准置信度。
Abstract: We introduce C-GenReg, a training-free framework for 3D point cloud registration that leverages the complementary strengths of world-scale generative priors and registration-oriented Vision Foundation Models (VFMs). Current learning-based 3D point cloud registration methods struggle to generalize across sensing modalities, sampling differences, and environments. Hence, C-GenReg augments the geometric point cloud registration branch by transferring the matching problem into an auxiliary image domain, where VFMs excel, using a World Foundation Model to synthesize multi-view-consistent RGB representations from the input geometry. This generative transfer, preserves spatial coherence across source and target views without any fine-tuning. From these generated views, a VFM pretrained for finding dense correspondences extracts matches. The resulting pixel correspondences are lifted back to 3D via the original depth maps. To further enhance robustness, we introduce a “Match-then-Fuse” probabilistic cold-fusion scheme that combines two independent correspondence posteriors, that of the generated-RGB branch with that of the raw geometric branch. This principled fusion preserves each modality inductive bias and provides calibrated confidence without any additional learning. C-GenReg is zero-shot and plug-and-play: all modules are pretrained and operate without fine-tuning. Extensive experiments on indoor (3DMatch, ScanNet) and outdoor (Waymo) benchmarks demonstrate strong zero-shot performance and superior cross-domain generalization. For the first time, we demonstrate a generative registration framework that operates successfully on real outdoor LiDAR data, where no imagery data is available.
[114] iDocV2: Leveraging Self-Supervision and Open-Set Detection for Improving Pattern Spotting in Historical Documents cs.CVPDF
Jose M. Saavedra, Crhistopher Stears, Marcelo Pizarro, Cristóbal Loyola, Luis Aros
TL;DR: 本文提出iDocV2模型,通过结合自监督学习和开放集检测技术,改进了历史文档中的图案检索性能,显著提升了检索速度和小型非方形查询的精度。
Details
Motivation: 针对历史文档检索和图案检索中现有方法精度不足(特别是小型非方形查询)和处理时间过长的问题,旨在提高检索效率和准确性。
Result: 在DocExplore数据集上,模型在图案检索和文档检索方面达到与SOTA竞争的结果,检索速度提升10倍,并在小型非方形查询上达到新的SOTA精度0.612。
Insight: 创新点包括使用自监督训练的改进编码器(iDoc)和开放集检测器加速搜索,并引入非极大值抑制减少误报,可借鉴于文档检索系统的优化。
Abstract: Considering the imminent massification of digital books, it has become critical to facilitate searching collections through graphical patterns. Current strategies for document retrieval and pattern spotting in historical documents still need to be improved. State-of-the-art strategies achieve an overall precision of $0.494$ for pattern spotting, where the precision for small non-square queries reaches 0.427. In addition, the processing time is excessive, requiring up to 7 seconds for searching in the DocExplore dataset due to a dense-based strategy used by SOTA models. Therefore, we propose a new model based on a better encoder (iDoc), trained under a self-supervised strategy, and an open-set detector to accelerate searching. Our model achieves competitive results with state-of-the-art pattern spotting and document retrieval, improving speed by 10x. Furthermore, our model reaches a new SOTA performance on the small non-square queries, achieving a new precision of 0.612.Different from the previous version, this leverages non-maximum suppression to reduce false positives.
[115] Agentic Large Language Models for Training-Free Neuro-Radiological Image Analysis cs.CV | cs.AIPDF
Ayhan Can Erdur, Daniel Scholz, Jiazhen Pan, Benedikt Wiestler, Daniel Rueckert
TL;DR: 本文提出了一种无需训练的智能体(agentic)框架,利用大型语言模型(LLM)协调外部专业工具,以自动化执行复杂的脑部MRI分析工作流,包括预处理、病理分割和体积分析。
Details
Motivation: 当前LLM在通用视觉问答中表现出色,但其架构缺乏处理3D医学影像(如CT或MRI)所需的本征空间推理能力,而新兴的智能体AI通过调用外部工具提供了无需内建3D处理能力的解决方案,但其在复杂、多步骤放射学工作流中的可行性尚未充分探索。
Result: 在多个LLM(GPT-5.1、Gemini 3 Pro、Claude Sonnet 4.5)和现成的领域专用工具上验证,该系统能自主执行从单次扫描分割到需要多时间点比较的纵向响应评估等日益复杂的放射学任务,并分析了单智能体与多智能体“领域专家”协作架构设计的影响。
Insight: 创新点在于提出了一个完全无需训练或微调的智能体管道,通过LLM编排外部工具来解决复杂的神经放射学图像分析任务,并引入了基于公开BraTS数据的基准数据集以支持未来智能体系统的严格评估,这为将通用LLM能力扩展到专业3D医学影像分析提供了一种高效、灵活的新范式。
Abstract: State-of-the-art large language models (LLMs) show high performance in general visual question answering. However, a fundamental limitation remains: current architectures lack the native 3D spatial reasoning required for direct analysis of volumetric medical imaging, such as CT or MRI. Emerging agentic AI offers a new solution, eliminating the need for intrinsic 3D processing by enabling LLMs to orchestrate and leverage specialized external tools. Yet, the feasibility of such agentic frameworks in complex, multi-step radiological workflows remains underexplored. In this work, we present a training-free agentic pipeline for automated brain MRI analysis. Validating our methodology on several LLMs (GPT-5.1, Gemini 3 Pro, Claude Sonnet 4.5) with off-the-shelf domain-specific tools, our system autonomously executes complex end-to-end workflows, including preprocessing (skull stripping, registration), pathology segmentation (glioma, meningioma, metastases), and volumetric analysis. We evaluate our framework across increasingly complex radiological tasks, from single-scan segmentation and volumetric reporting to longitudinal response assessment requiring multi-timepoint comparisons. We analyze the impact of architectural design by comparing single-agent models against multi-agent “domain-expert” collaborations. Finally, to support rigorous evaluation of future agentic systems, we introduce and release a benchmark dataset of image-prompt-answer tuples derived from public BraTS data. Our results demonstrate that agentic AI can solve highly neuro-radiological image analysis tasks through tool use without the need for training or fine-tuning.
[116] Active World-Model with 4D-informed Retrieval for Exploration and Awareness cs.CVPDF
Elaheh Vaezpour, Amirhosein Javadi, Tara Javidi
TL;DR: 本文提出AW4RE(Active World-model with 4D-informed Retrieval for Exploration),一种以感知为中心的生成世界模型,用于解决部分可观测环境下的物理感知问题。该模型通过结合4D信息检索、动作条件几何支持与时间一致性以及条件生成补全,为探索传感查询提供了一个传感器原生的替代环境,能够估计动作条件下的观测过程。
Details
Motivation: 在大型动态环境中,物理感知受传感决策影响,而观测又影响决策质量,这种循环信息结构使得部分可观测决策问题(如POMDPs)极具挑战性。现有方法如强化学习在完全可观测问题上成功,但部分可观测问题仍面临现实探索成本高和模拟到现实视角未观测的困难。
Result: 实验表明,在极端视角变化、时间间隔和稀疏几何支持条件下,AW4RE比几何感知生成基线方法产生更接地气和一致的预测。
Insight: 创新点在于提出一个集成了4D信息检索、动作条件几何支持与时间一致性的生成世界模型,为部分可观测环境提供传感器原生的替代环境,以支持传感查询的探索和物理感知的提升。
Abstract: Physical awareness, especially in a large and dynamic environment, is shaped by sensing decisions that determine observability across space, time, and scale, while observations impact the quality of sensing decisions. This loopy information structure makes physical awareness a fundamentally challenging decision problem with partial observations. While in the past decade we have witnessed the unprecedented success of reinforcement learning (RL) in problems with full observability, decision problems with partial observation, such as POMDPs, remain largely open: real-world explorations are excessively costly, while sim-to-real pipeline suffer from unobserved viewpoints. We introduce AW4RE (Active World-model with 4D-informed Retrieval for Exploration), an awareness-centric generative world model that provides a sensor-native surrogate environment for exploring sensing queries. Conditioned on a queried sensing action, AW4RE estimates the action-conditioned observation process. This is done by combining 4D-informed evidence retrieval, action-conditioned geometric support with temporal coherence, and conditional generative completion. Experiments demonstrate that AW4RE produces more grounded and consistent predictions than geometry-aware generative baselines under extreme viewpoint shifts, temporal gaps, and sparse geometric support.
[117] Reducing Peak Memory Usage for Modern Multimodal Large Language Model Pipelines cs.CV | cs.AIPDF
Junwan Kim, Hyunkyung Bae
TL;DR: 本文针对多模态大语言模型推理过程中因视觉令牌大量存储导致峰值内存占用过高的问题,提出了一种顺序输入压缩机制,通过在预填充阶段执行结构感知的键值缓存压缩来控制内存增长,从而显著降低峰值内存使用,同时保持生成性能仅有轻微下降。
Details
Motivation: 随着多模态大语言模型处理高分辨率图像和长视频序列,推理过程中需要存储大量视觉令牌到键值缓存,导致内存消耗成为主要瓶颈;现有方法通常在全部输入处理后才进行缓存压缩,无法解决预填充阶段的高峰值内存问题。
Result: 论文提出的方法在保持生成性能仅有最小程度下降的前提下,显著降低了峰值内存使用,实现了更实用且内存高效的多模态推理。
Insight: 创新点在于揭示了多模态大语言模型具有内在的结构规律和表示冗余,并利用这一特性在预填充阶段进行结构感知的键值缓存压缩,以固定内存预算控制整个推理过程的内存增长,而非事后压缩。
Abstract: Multimodal large language models (MLLMs) have recently demonstrated strong capabilities in understanding and generating responses from diverse visual inputs, including high-resolution images and long video sequences. As these models scale to richer visual representations, inference increasingly relies on storing large numbers of vision tokens in the key-value (KV) cache, making memory consumption a central bottleneck. Existing methods address this issue by identifying redundancy in vision tokens and compressing the cache, but such compression is typically applied only after all inputs are processed, resulting in high peak memory usage during the prefill stage. In this work, we show that MLLMs exhibit inherent structural regularities and representational redundancy that can be exploited to control memory growth throughout inference. Based on this insight, we propose a sequential input-compression mechanism that enforces a fixed memory budget by performing structure-aware key-value cache compression during the prefill process. This approach substantially reduces peak memory usage while maintaining generative performance with only minimal degradation, enabling more practical and memory-efficient multimodal inference.
[118] Automated Palynological Analysis System: Integrating Deep Metric Learning and $U^{2}$-Net Detection in $H\infty$ bright field microscopy cs.CV | physics.opticsPDF
J. Staforelli-Vivanco, R. Jofré, B. Muñoz, V. Salamanca, P. Coelho
TL;DR: 本文提出了一种自动化、高通量的花粉分析系统,该系统结合了H∞鲁棒机械控制与先进深度学习流程,用于智利比奥比奥地区花粉颗粒的精确计数、分类和形态分析。系统采用U²-Net进行显著目标检测,并利用基于Deep Metric Learning训练的DINOv2 Vision Transformer骨干网络进行分类,通过集成梯度加权注意力提供人类可解释的纹理和诊断特征注释。
Details
Motivation: 传统蜜源花粉学分析过程耗时且主观,每个样本通常需要4-6小时,因此需要开发自动化系统以提高效率和客观性。
Result: 系统实现了95.8%的分类召回率,与手动专家分析相比,处理速度提升了6倍。
Insight: 创新点在于将H∞鲁棒控制、U²-Net目标检测、基于深度度量学习的DINOv2 Vision Transformer分类以及梯度加权注意力机制集成到一个统一系统中,实现了高精度、可解释的自动化花粉分析。
Abstract: Traditional melissopalynology is a time-consuming and subjective process, often taking 4-6 hours per sample. We present an automated, high-throughput microscopy system that integrates $H\infty$ robust mechanical control with advanced deep learning pipelines for the precise counting, classification, and morphological analysis of pollen grains from Bio Bio region in south central territory in Chile. Our system employs $U^{2}$-Net for salient object detection and a DINOv2 Vision Transformer backbone trained via Deep Metric Learning for classification. By integrating Gradient-Weighted Attention, the model provides human-interpretable texture and diagnostic feature annotations. The system achieves a 95.8$%$ classification recall and a 6x processing speedup compared to manual expert analysis.
[119] Incoherent Deformation, Not Capacity: Diagnosing and Mitigating Overfitting in Dynamic Gaussian Splatting cs.CVPDF
Ahmad Droby
TL;DR: 本文研究了动态3D高斯泼溅方法在单目视频上训练视图PSNR表现良好但泛化能力差的问题,发现泛化差距主要由高斯分裂导致,但根本原因并非参数容量,而是变形场的不连贯性。通过引入弹性能量正则化等正则化方法,显著减少了泛化差距,并在不同架构和真实视频上验证了其有效性。
Details
Motivation: 动态3D高斯泼溅方法在单目视频上训练时PSNR高但测试时泛化能力差,存在显著的训练-测试PSNR差距,本文旨在诊断并缓解这种过拟合问题。
Result: 在D-NeRF基准测试中,平均训练-测试PSNR差距为6.18 dB,个别场景达11 dB。通过弹性能量正则化等方法,将差距减少了40.8%至57%,并在Deformable-3DGS架构和HyperNeRF真实视频场景上验证了泛化能力提升。
Insight: 创新点在于揭示了动态3D高斯泼溅过拟合的主要驱动因素是变形场的不连贯性而非参数数量,并提出了弹性能量正则化等正则化技术来强制变形连贯性,从而有效提升泛化性能。
Abstract: Dynamic 3D Gaussian Splatting methods achieve strong training-view PSNR on monocular video but generalize poorly: on the D-NeRF benchmark we measure an average train-test PSNR gap of 6.18 dB, rising to 11 dB on individual scenes. We report two findings that together account for most of that gap. Finding 1 (the role of splitting). A systematic ablation of the Adaptive Density Control pipeline (split, clone, prune, frequency, threshold, schedule) shows that splitting is responsible for over 80% of the gap: disabling split collapses the cloud from 44K to 3K Gaussians and the gap from 6.18 dB to 1.15 dB. Across all threshold-varying ablations, gap is log-linear in count (r = 0.995, bootstrap 95% CI [0.99, 1.00]), which suggests a capacity-based explanation. Finding 2 (the role of deformation coherence). We show that the capacity explanation is incomplete. A local-smoothness penalty on the per-Gaussian deformation field – Elastic Energy Regularization (EER) – reduces the gap by 40.8% while growing the cloud by 85%. Measuring per-Gaussian strain directly on trained checkpoints, EER reduces mean strain by 99.72% (median 99.80%) across all 8 scenes; on 8/8 scenes the median Gaussian under EER is less strained than the 1st-percentile (best-behaved) Gaussian under baseline. Alongside EER, we evaluate two further regularizers: GAD, a loss-rate-aware densification threshold, and PTDrop, a jitter-weighted Gaussian dropout. GAD+EER reduces the gap by 48%; adding PTDrop and a soft growth cap reaches 57%. We confirm that coherence generalizes to (a) a different deformation architecture (Deformable-3DGS, +40.6% gap reduction at re-tuned lambda), and (b) real monocular video (4 HyperNeRF scenes, reducing the mean PSNR gap by 14.9% at the same lambda as D-NeRF, with near-zero quality cost). The overfitting in dynamic 3DGS is driven by incoherent deformation, not parameter count.
[120] TriTS: Time Series Forecasting from a Multimodal Perspective cs.CV | cs.AIPDF
Xiang Ao
TL;DR: TriTS提出了一种新颖的跨模态解耦框架,用于长期时间序列预测。它将一维时间序列投影到正交的时间、频率和二维视觉空间,通过引入周期感知重塑策略和视觉曼巴模型来高效建模跨周期依赖,并设计了多分辨率小波混合模块来解耦非平稳信号。实验表明,TriTS在多个基准数据集上实现了最先进的性能,并显著减少了参数数量和推理延迟。
Details
Motivation: 解决长期时间序列预测中,一维视角难以完全捕捉真实世界信号中高度纠缠的时序动态这一表示瓶颈问题。
Result: 在多个基准数据集上的广泛实验表明,TriTS实现了最先进的性能,从根本上优于现有的基于视觉的预测器,同时大幅减少了参数数量和推理延迟。
Insight: 创新点在于从多模态视角出发,通过跨模态解耦框架将一维时间序列映射到正交的时间、频率和视觉空间,并设计了高效的周期感知重塑策略和视觉曼巴模型来桥接模态差距,以及多分辨率小波混合模块进行精细的时频定位。客观来看,其将视觉模型高效引入时序预测的思路以及对计算复杂度的优化具有借鉴意义。
Abstract: Time series forecasting plays a pivotal role in critical sectors such as finance, energy, transportation, and meteorology. However, Long-term Time Series Forecasting (LTSF) remains a significant challenge because real-world signals contain highly entangled temporal dynamics that are difficult to fully capture from a purely 1D perspective. To break this representation bottleneck, we propose TriTS, a novel cross-modal disentanglement framework that projects 1D time series into orthogonal time, frequency, and 2D-vision spaces.To seamlessly bridge the 1D-to-2D modality gap without the prohibitive $O(N^2)$ computational overhead of Vision Transformers (ViTs), we introduce a Period-Aware Reshaping strategy and incorporate Visual Mamba (Vim). This approach efficiently models cross-period dependencies as global visual textures while maintaining linear computational complexity. Complementing this, we design a Multi-Resolution Wavelet Mixing (MR-WM) module for the frequency modality, which explicitly decouples non-stationary signals into trend and noise components to achieve fine-grained time-frequency localization. Finally, a streaming linear branch is retained in the time domain to anchor numerical stability. By dynamically fusing these three complementary representations, TriTS effectively adapts to diverse data contexts. Extensive experiments across multiple benchmark datasets demonstrate that TriTS achieves state-of-the-art (SOTA) performance, fundamentally outperforming existing vision-based forecasters by drastically reducing both parameter count and inference latency.
[121] Frozen Vision Transformers for Dense Prediction on Small Datasets: A Case Study in Arrow Localization cs.CV | cs.AI | cs.LGPDF
Maxwell Shepherd
TL;DR: 本文提出了一种用于室内射箭靶面箭孔自动检测、定位和计分的系统。该系统仅使用48张标注照片进行训练,结合了基于颜色的图像校正、冻结的自监督视觉变换器(DINOv3 ViT-L/16)与AnyUp引导特征上采样,以及轻量级的CenterNet检测头。
Details
Motivation: 解决在小数据集(仅48张标注图像)上进行密集预测(箭孔检测与定位)的挑战,探索冻结基础模型结合最小化任务特定适配的实用性。
Result: 在交叉验证中,平均F1分数为0.893±0.011,平均定位误差为1.41±0.06 mm,性能与需要更多训练数据的全监督方法相当或更好。在下游射箭指标上,平均箭分恢复的中位误差为1.8%,箭群质心定位中位误差为4.00 mm。
Insight: 主要创新点在于将冻结的自监督视觉变换器(DINOv3)与引导特征上采样(AnyUp)结合,以从粗粒度patch token中恢复亚毫米级空间精度,从而在小数据场景下实现高性能密集预测。一个有趣的发现是,在特定设置下,CenterNet的偏移回归头对检测提升微乎其微,甚至损害定位精度,这表明引导上采样已有效解决了空间精度损失问题。
Abstract: We present a system for automated detection, localization, and scoring of arrow punctures on 40,cm indoor archery target faces, trained on only 48 annotated photographs (5{,}084 punctures). Our pipeline combines three components: a color-based canonical rectification stage that maps perspective-distorted photographs into a standardized coordinate system where pixel distances correspond to known physical measurements; a frozen self-supervised vision transformer (DINOv3 ViT-L/16) paired with AnyUp guided feature upsampling to recover sub-millimeter spatial precision from $32 \times 32$ patch tokens; and lightweight CenterNet-style detection heads for arrow-center heatmap prediction. Only 3.8,M of 308,M total parameters are trainable. Across three cross-validation folds, we achieve a mean F1 score of $0.893 \pm 0.011$ and a mean localization error of $1.41 \pm 0.06$,mm, comparable to or better than prior fully-supervised approaches that require substantially more training data. An ablation study shows that the CenterNet offset regression head, typically essential for sub-pixel refinement, provides negligible detection improvement while degrading localization in our setting. This suggests that guided feature upsampling already resolves the spatial precision lost through patch tokenization. On downstream archery metrics, the system recovers per-image average arrow scores with a median error of 1.8% and group centroid positions to within a median of 4.00,mm. These results demonstrate that frozen foundation models with minimal task-specific adaptation offer a practical paradigm for dense prediction in small-data regimes.
[122] FairNVT: Improving Fairness via Noise Injection in Vision Transformers cs.CV | cs.AI | cs.LGPDF
Qiaoyue Tang, Sepidehsadat Hosseini, Mengyao Zhai, Thibaut Durand, Greg Mori
TL;DR: FairNVT是一个轻量级的去偏框架,通过在预训练的Transformer编码器中注入校准的高斯噪声来改善表示层和预测层的公平性,同时保持任务准确性。
Details
Motivation: 现有方法通常分别处理表示层和预测层的公平性问题,本文认为两者本质相连,抑制表示层的敏感信息有助于实现更公平的预测。
Result: 在涵盖视觉和语言的三个数据集上,FairNVT降低了敏感属性攻击者的准确率,改善了人口统计均等和机会均等等公平性指标,并保持了较高的任务性能。
Insight: 创新点在于通过轻量级适配器学习任务相关和敏感嵌入,对敏感嵌入施加噪声并与任务表示融合,结合正交约束和公平性正则化,联合优化表示和预测公平性,且框架兼容多种预训练Transformer编码器。
Abstract: This paper presents FairNVT, a lightweight debiasing framework for pretrained transformer-based encoders that improves both representation and prediction level fairness while preserving task accuracy. Unlike many existing debiasing approaches that address these notions separately, we argue they are inherently connected: suppressing sensitive information at the representation level can facilitate fairer predictions. Our approach learns task-relevant and sensitive embeddings via lightweight adapters, applies calibrated Gaussian noise to the sensitive embedding, and fuses it with the task representation. Together with orthogonality constraints and fairness regularization, these components jointly reduce sensitive-attribute leakage in the learned embeddings and encourage fairer downstream predictions. The framework is compatible with a wide range of pretrained transformer encoders. Across three datasets spanning vision and language, FairNVT reduces sensitive-attribute attacker accuracy, improves demographic-parity and equalized-odds metrics, and maintains high task performance.
[123] EdgeVTP: Exploration of Latency-efficient Trajectory Prediction for Edge-based Embedded Vision Applications cs.CVPDF
Seungjin Kim, Reza Jafarpourmarzouni, Christopher Neff, Hamed Tabkhi, Vinit Katariya
TL;DR: EdgeVTP是一种专为边缘嵌入式视觉应用设计的车辆轨迹预测模型,通过结合交互感知图建模、轻量级Transformer主干网络和一次性曲线解码器,旨在实现确定性的低端到端延迟。
Details
Motivation: 解决在路侧边缘设备上部署车辆轨迹预测模型时,对有限且确定的端到端延迟的需求问题。
Result: 在三个高速公路基准测试和两个Jetson级平台上,EdgeVTP在包含图构建和后处理的协议下实现了最低的实测端到端延迟,并在其中两个数据集上达到了最先进的预测精度,在其他基准测试上具有竞争性的误差。
Insight: 创新点包括:将未来运动预测为紧凑的曲线参数而非按时间步自回归的路径点,以降低解码开销并生成平滑轨迹;通过具有硬邻居上限的局部性图来明确限制交互复杂度,从而在拥挤场景中保持运行时间可预测。从客观角度看,这是一种面向边缘部署的、在延迟和精度之间取得良好平衡的嵌入式优先设计范式。
Abstract: Vehicle trajectory prediction is central to highway perception, but deployment on roadside edge devices necessitates bounded, deterministic end-to-end latency. We present EdgeVTP, an embedded-first trajectory predictor that combines interaction-aware graph modeling with a lightweight transformer backbone and a one-shot curve decoder. By predicting future motion as compact curve parameters (anchored at the last observed position) rather than horizon-scaled autoregressive waypoints, EdgeVTP reduces decoding overhead while producing smooth trajectories. To keep runtime predictable in crowded scenes, we explicitly bound interaction complexity via a locality graph with a hard neighbor cap. Across three highway benchmarks and two Jetson-class platforms, EdgeVTP achieves the lowest measured end-to-end latency under a protocol that includes graph construction and post-processing, while attaining state-of-the-art (SotA) prediction accuracy on two of the three datasets and competitive error on other benchmarks. Our code is available at https://github.com/SeungjinStevenKim/EdgeVTP.
[124] Bridging Coarse and Fine Recognition: A Hybrid Approach for Open-Ended Multi-Granularity Object Recognition in Interactive Educational Games cs.CV | cs.AIPDF
Hanling Yi, Feng Lin, Mao Luo, Yifan Yang, Xiaotian Yu
TL;DR: 本文提出了一种名为HyMOR的混合多粒度开放词汇物体识别框架,旨在结合多模态大语言模型(MLLM)的开放词汇粗粒度识别能力和CLIP模型的细粒度识别优势,以解决现有模型在开放词汇场景下多粒度物体识别中的不足。
Details
Motivation: 现有MLLM在开放词汇物体识别方面表现良好,但在细粒度任务上存在不足;而CLIP模型擅长细粒度识别,但对通用物体类别的覆盖范围较窄。本文旨在弥合这一差距,为下游多模态内容生成和交互式游戏提供更鲁棒的感知基础。
Result: 在引入的教科书物体数据集(TBO)上进行广泛实验,结果表明HyMOR将细粒度识别与CLIP的差距缩小至0.2%,同时将通用物体识别性能较基线MLLM提升了2.5%。在所有评估数据集上,平均Sentence-BERT相似度提升了23.2%,证明了其有效性。
Insight: 创新点在于提出了一种混合框架,通过任务分工(MLLM负责开放词汇粗粒度识别,CLIP负责领域特定细粒度识别)实现了跨多语义粒度的准确物体理解。从客观角度看,这种模块化设计可借鉴于其他需要兼顾通用性与专用性的多模态感知任务,且引入的TBO数据集为教育场景评估提供了新基准。
Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have enabled open-ended object recognition, yet they struggle with fine-grained tasks. In contrast, CLIP-style models excel at fine-grained recognition but lack broad coverage of general object categories. To bridge this gap, we propose \textbf{HyMOR}, a \textbf{Hy}brid \textbf{M}ulti-granularity open-ended \textbf{O}bject \textbf{R}ecognition framework that integrates an MLLM with a CLIP model. In HyMOR, the MLLM performs open-ended and coarse-grained object recognition, while the CLIP model specializes in fine-grained identification of domain-specific objects such as animals and plants. This hybrid design enables accurate object understanding across multiple semantic granularities, serving as a robust perceptual foundation for downstream multi-modal content generation and interactive gameplay. To support evaluation in content-rich and educational scenarios, we introduce TBO (TextBook Objects), a dataset containing 20,942 images annotated with 8,816 object categories extracted from textbooks. Extensive experiments demonstrate that HyMOR narrows the fine-grained recognition gap with CLIP to 0.2% while improving general object recognition by 2.5% over a baseline MLLM, measured by average Sentence-BERT (SBert) similarity. Overall, HyMOR achieves a 23.2% improvement in average SBert across all evaluated datasets, highlighting its effectiveness in enabling accurate perception for multi-modal game content generation and interactive learning applications.
[125] Improving Radio Interferometry Imaging by Explicitly Modeling Cross-Domain Consistency in Reconstruction cs.CVPDF
Kai Cheng, Ruoqi Wang, Qiong Luo
TL;DR: 本文提出了一种名为CDCRec的多模态射电干涉测量数据重建方法,通过显式建模跨域一致性来提升射电干涉成像质量。该方法设计了一个分层多任务多阶段框架,以增强重建过程中域间交互的探索,从而改善成像性能。
Details
Motivation: 现有射电干涉成像重建方法通常局限于单模态域(如脏图或稀疏可见度数据),导致丢失互补的上下文信息,无法完整建模域间的相互依赖和一致性。本文旨在解决这一问题,通过跨域一致性建模来提升重建质量。
Result: 实验结果表明,CDCRec通过增强跨域相关性提取提高了成像性能,其自监督互补建模策略在依赖从受限源域数据恢复密集信息的干涉测量域转换任务上优于当前方法。
Insight: 创新点在于显式建模跨域一致性,并采用分层多任务多阶段框架来探索域间交互;从客观角度看,该方法通过多模态融合和自监督策略,有效利用了可见度和图像域的互补信息,为射电天文成像提供了新的重建思路。
Abstract: Radio astronomy plays a crucial role in understanding the universe, particularly within the realm of non-thermal astrophysics. Images of celestial objects are derived from the signals (called visibility) measured by radio telescopes. Such imaging results, called dirty images, contain artifacts due to factors such as sparsity and therefore require reconstruction to improve imaging quality. Existing methods typically restrict reconstruction to a unimodal domain, either to the dirty image after imaging or to the sparse visibility prior to imaging. Focusing solely on each unimodal reconstruction results in the loss of complementary in-context information in either the visibility or image domain, leading to an incomplete modeling of mutual dependency and consistency. To address these challenges, we propose CDCRec, a multimodal radio interferometric data reconstruction method that explicitly models cross-domain consistency. We design a hierarchical multi-task and multi-stage framework to enhance the exploration of interplays between domains during reconstruction. Our experimental results demonstrate that CDCRec improves imaging performance through enhanced cross-domain correlation extraction. In particular, our self-supervised complementary modeling strategy is better than current methods at interferometric domain translations that rely heavily on recovering dense information from constrained source-domain data.
[126] Frequency-Decomposed INR for NIR-Assisted Low-Light RGB Image Denoising cs.CVPDF
Ligen Shi, Zengyu Pang, Chang Liu, Shuchen Sun, Jun Qiu
TL;DR: 本文提出了一种基于频率解耦隐式神经表示(FDINR)的近红外(NIR)辅助低光RGB图像去噪方法。该方法利用RGB-NIR跨模态频率相关性先验,通过多尺度小波变换将图像分解为不同频率分量,并构建双分支隐式神经表示框架,分别用低光RGB指导低频重建、用高信噪比NIR约束高频纹理生成,实现频域优势互补。
Details
Motivation: 解决低光条件下可见光图像存在的严重噪声和高频结构退化问题,并克服传统方法在空间域刚性融合导致的色彩失真和伪影问题。
Result: 实验结果表明,FD-INR不仅能有效恢复图像亮度一致性和结构细节,而且得益于其隐式连续表示,在任意分辨率重建任务中优于现有方法,显著提升了低光感知的可靠性。
Insight: 核心创新点在于基于RGB-NIR跨模态频率统计先验的频率分解与差异化监督机制,以及引入基于不确定性的自适应加权损失函数来自动平衡不同频率任务的贡献,实现了频域的优势互补与灵活融合。
Abstract: Addressing the issues of severe noise and high frequency structural degradation in visible images under low-light conditions, this paper proposes a Near Infrared (NIR) aided low light image restoration method based on Frequency Decoupled Implicit Neural Representation (FDINR). Based on the statistical prior of RGB-NIR cross-modal frequency correlations, specifically that low-frequency RGB signals are more reliable, whereas high frequency NIR signals exhibit higher correlation, we explicitly decompose images into distinct frequency components via multi-scale wavelet transforms and construct a dual-branch implicit neural representation framework. Within this framework, we design a cross modal differentiated frequency supervision mechanism, leveraging low light RGB to guide the reconstruction of low frequency luminance and color, and utilizing high-SNR NIR signals to constrain the generation of high frequency texture details, thereby achieving complementary advantages in the frequency domain. Furthermore, an uncertainty-based adaptive weighting loss function is introduced to automatically balance the contributions of different frequency tasks, solving the problems of color distortion and artifacts caused by rigid fusion in the spatial domain common in traditional methods. Experimental results demonstrate that FD-INR not only effectively restores image luminance consistency and structural details but also, benefitting from its implicit continuous representation, outperforms existing methods in arbitrary-resolution reconstruction tasks, significantly enhancing the reliability of low light perception.
[127] Channel Attention-Guided Cross-Modal Knowledge Distillation for Referring Image Segmentation cs.CVPDF
Chen Yang
TL;DR: 本文提出了一种通道注意力引导的跨模态知识蒸馏方法,用于解决参考图像分割任务中大型模型难以在计算资源受限场景下部署的问题。该方法将教师网络学习到的高阶细粒度视觉-语言关联以及通道表示的语义组件相关性传递给学生网络,使学生模型在推理时不增加额外参数的情况下获得显著性能提升。
Details
Motivation: 现有参考图像分割方法通常采用大规模视觉和语言编码模型提升性能,但其庞大参数量严重限制了在计算资源有限场景中的部署,因此需要一种高效的模型压缩方法。
Result: 在两个公共数据集上的实验结果表明,所提出的蒸馏方法在推理时不引入额外参数,并能使学生模型实现显著的性能提升。
Insight: 创新点在于通过通道注意力引导的跨模态知识蒸馏,不仅传递像素级关系,还传递通道表示的语义关联,使学生网络在继承教师知识的同时保留部分独立学习能力,缓解了学习偏差的传递。
Abstract: Referring image segmentation (RIS) requires accurate segmentation of target regions in images according to language descriptions, which is a cross-modal task integrating vision and language. Existing RIS methods typically employ large-scale vision and language encoding models to improve performance, but their enormous parameter size severely restricts deployment in scenarios with limited computing resources. To solve this problem, this paper proposes a channel attention-guided cross-modal knowledge distillation method, which transfers the high-order fine-grained correlations between vision and language learned by the teacher network, as well as the correlations between semantic components represented by each channel, to the student network. Compared with the traditional pixel-wise relational distillation, this method not only enables the student to learn the knowledge of the teacher, but also retains part of its independent learning ability, alleviating the transfer of learning bias. Experimental results on two public datasets show that the proposed distillation method does not introduce additional parameters during inference and can achieve significant performance improvement for the student model.
[128] Hierarchical Vision Transformer Enhanced by Graph Convolutional Network for Image Classification cs.CV | cs.AIPDF
Haibin Jiao
TL;DR: 本文提出了一种结合图卷积网络(GCN)的分层视觉Transformer(GCN-HViT)用于图像分类。该方法通过分层ViT在全局尺度上建模不同层级(小图像块与大图像块)间的信息交互,并利用GCN作为局部特征提取器,为每个图像块提供2D位置嵌入,同时在局部尺度上建模信息交互,以克服ViT在局部结构建模和GCN在全局结构捕获方面的不足。
Details
Motivation: 解决ViT中图像块大小选择困难、1D位置嵌入无法准确捕获空间结构信息,以及GCN缺乏捕获全局图结构信息能力的问题,旨在结合ViT的全局注意力与GCN的局部连接性优势。
Result: 在三个真实世界数据集上的大量实验表明,GCN-HViT达到了最先进的(SOTA)性能。
Insight: 创新点在于将分层ViT与GCN结合,利用GCN生成2D位置嵌入并提取局部特征,同时通过分层结构整合多尺度图像块信息,实现了局部与全局建模的互补。
Abstract: Vision Transformer (ViT) has brought new breakthroughs to the field of image classification by introducing the self-attention mechanism and Graph Convolutional Networks(GCN) have been proposed and successfully applied in data representation and analysis. However, there are key challenges which limit their further development: (1) The patch size selected by ViT is crucial for accurate predictions, which raises a natural question: How to select the size of patches properly or how to comprehensively combine small patches and larger patches; (2) While the spatial structure information is important in vision tasks, the 1D position embeddings fails to capture the spatial structure information of patches more accurately; (3) The GCN can capture the local connectivity relationships between image nodes, but it lacks the ability to capture global graph structural information. On the contrary, the self-attention mechanism of ViT can draw the global relation on image patches, but it is unable to model the local structure of image. To overcome such limitations, we propose the Hierarchical Vision Transformer Enhanced by Graph Convolutional Network (GCN-HViT) for image classification. Specifically, the Hierarchical ViT we designed can model patch-wise information interactions on a global scale within each level and model hierarchical relationships between small patches and large patches across multiple levels. In addition, the proposed GCN method functions as a local feature extractor to obtain the local representation of each image patch which serves as a 2D position embedding of each patch in the 2D space. Meanwhile, it models patch-wise information interactions on a local scale within each level. Extensive experiments on 3 real-world datasets demonstrate that GCN-HViT achieves state-of-the-art performance.
[129] Q-DeepSight: Incentivizing Thinking with Images for Image Quality Assessment and Refinement cs.CVPDF
Xudong Li, Jiaxi Tan, Ziyin Zhou, Yan Zhong, Zihao Huang
TL;DR: 本文提出了Q-DeepSight,一个用于图像质量评估与优化的’思考-图像’框架。它通过交错的多模态思维链和工具增强的证据获取来模拟人类证据搜寻式的判断过程,旨在提供可操作的、局部化的质量反馈。该框架在多个基准测试中达到了最先进的性能,并进一步展示了一个无需训练的’生成中感知’框架,利用其诊断结果来指导迭代图像增强。
Details
Motivation: 当前基于MLLM的图像质量评估方法采用单次观察、纯语言的范式,这与人类寻求证据的判断过程相悖,导致其提供的理由缺乏扎实的视觉依据,限制了其在闭环优化中的可靠性。
Result: Q-DeepSight在包括自然图像、修复图像和AI生成内容在内的多样化基准测试中实现了最先进的性能。
Insight: 核心创新在于提出了一个模拟人类证据搜寻过程的’思考-图像’框架,通过交错多模态思维链和工具增强(如裁剪和缩放)来获取视觉证据。此外,为通过强化学习训练这种长思维链轨迹,引入了感知课程奖励来缓解奖励稀疏性,以及证据梯度过滤来改进视觉推理的信用分配。其实用价值通过一个无需训练的’生成中感知’框架得到验证,实现了评估与优化的闭环。
Abstract: Image Quality Assessment (IQA) models are increasingly deployed as perceptual critics to guide generative models and image restoration. This role demands not only accurate scores but also actionable, localized feedback. However, current MLLM-based methods adopt a single-look, language-only paradigm, which departs from human evidence-seeking judgment and yields weakly grounded rationales, limiting their reliability for in-the-loop refinement. We propose Q-DeepSight, a think-with-image framework that emulates this human-like process. It performs interleaved Multimodal Chain-of-Thought (iMCoT) with tool-augmented evidence acquisition (e.g., crop-and-zoom) to explicitly determine where quality degrades and why. To train these long iMCoT trajectories via reinforcement learning, we introduce two techniques: Perceptual Curriculum Reward (PCR) to mitigate reward sparsity and Evidence Gradient Filtering (EGF) to improve credit assignment for visually-grounded reasoning. Q-DeepSight achieves state-of-the-art performance across diverse benchmarks, including natural, restored, and AI-generated content. Furthermore, we demonstrate its practical value with Perceptual-in-Generation (PiG), a training-free framework where Q-DeepSight’s diagnoses guide iterative image enhancement, effectively closing the loop between assessment and refinement.
[130] Adaptive Forensic Feature Refinement via Intrinsic Importance Perception cs.CVPDF
Jiazhen Yang, Junjun Zheng, Kejia Chen, Xiangheng Kong, Jie Lei
TL;DR: 本文提出了一种名为I2P的合成图像检测框架,旨在解决现有基于视觉基础模型的方法在跨分布泛化能力上的不足。该框架通过自适应识别对检测任务最关键的表示层,并在低敏感度参数子空间内约束任务驱动的参数更新,以在提升任务适应性的同时,尽可能保留预训练表征的可迁移结构。
Details
Motivation: 随着生成模型和多模态内容编辑技术的发展,合成图像检测面临的关键挑战在于对未知生成源的跨分布泛化。现有基于视觉基础模型的方法在适应策略上较为粗粒度,通常直接使用最终层表示或简单融合多层特征,缺乏对可迁移伪造线索最优表示层次的显式建模。同时,直接微调视觉基础模型可能损害支持开放集泛化的跨模态预训练结构。
Result: 摘要中未提及具体的定量实验结果、基准测试或达到的水平。
Insight: 论文的创新点在于将视觉基础模型适应于合成图像检测的任务重新表述为一个联合优化问题,即同时识别承载伪造判别信息的关键表示层,并约束任务知识注入对预训练结构的扰动。从客观角度看,其提出的内在重要性感知机制,通过自适应关键层识别和低敏感度子空间参数更新约束,为解决任务特定性与预训练结构保持之间的张力提供了一种新思路。
Abstract: With the rapid development of generative models and multimodal content editing technologies, the key challenge faced by synthetic image detection (SID) lies in cross-distribution generalization to unknown generation sources. In recent years, visual foundation models (VFM), which acquire rich visual priors through large scale image-text alignment pretraining, have become a promising technical route for improving the generalization ability of SID. However, existing VFM-based methods remain relatively coarse-grained in their adaptation strategies. They typically either directly use the final layer representations of VFM or simply fuse multi layer features, lacking explicit modeling of the optimal representational hierarchy for transferable forgery cues. Meanwhile, although directly fine-tuning VFM can enhance task adaptation, it may also damage the cross-modal pretrained structure that supports open-set generalization. To address this task specific tension, we reformulate VFM adaptation for SID as a joint optimization problem: it is necessary both to identify the critical representational layer that is more suitable for carrying forgery discriminative information and to constrain the disturbance caused by task knowledge injection to the pretrained structure. Based on this, we propose I2P, an SID framework centered on intrinsic importance perception. I2P first adaptively identifies the critical layer representations that are most discriminative for SID, and then constrains task-driven parameter updates within a low sensitivity parameter subspace, thereby improving task specificity while preserving the transferable structure of pretrained representations as much as possible.
[131] Bias-constrained multimodal intelligence for equitable and reliable clinical AI cs.CVPDF
Cheng Li, Weijian Huang, Jiarun Liu, Hao Yang, Qi Yang
TL;DR: 本文提出了BiasCareVL,一种用于医疗领域的偏置感知多模态学习框架,旨在解决医学影像与临床文本融合系统中存在的普遍偏置问题,以提升AI在临床环境中的公平性和可靠性。该框架通过将偏置控制直接融入模型设计,并引入自适应不确定性建模和可选的人机协同优化,来调节主导数据模式的影响,并在分布不平衡下促进公平推理。
Details
Motivation: 动机在于解决现实临床环境中医学视觉-语言系统因疾病流行度不平衡、解剖区域分布偏斜、成像协议异质性和人口统计学差异等普遍偏置而面临的公平性和可靠性挑战。
Result: 在涵盖皮肤病学、肿瘤学、放射学和病理学的八个公共基准测试中,BiasCareVL一致优于20种最先进方法,在临床挑战性场景中提升显著,例如在多类皮肤病变诊断中准确率提高超过10%,在小肿瘤分割中Dice系数提高超过20%,且诊断性能超过人类准确率并大幅减少时间需求。
Insight: 创新点在于将偏置控制直接整合到模型设计中而非事后校正,通过自适应不确定性建模和人机协同优化来主动管理偏置,从而在统一表示空间中支持多种临床任务,为医疗AI提供了透明、可重复且公平的解决方案。
Abstract: The integration of medical imaging and clinical text has enabled the emergence of generalist artificial intelligence (AI) systems for healthcare. However, pervasive biases, such as imbalanced disease prevalence, skewed anatomical region distributions, heterogeneous imaging protocols, and demographic disparities, pose significant challenges to the fairness and reliability of vision-language systems in real-world clinical settings. Here we present BiasCareVL, a bias-aware multimodal learning framework that introduces bias control directly into model design, rather than treating it as a post hoc correction. BiasCareVL incorporates adaptive uncertainty modeling with optional human-in-the-loop refinement to regulate the influence of dominant data patterns and to promote equitable reasoning under distributional imbalance. Trained on 3.44 million samples spanning over 15 imaging modalities, the framework supports diverse clinical tasks, including visual question answering, disease classification, segmentation, and report generation within a unified representation space. Across eight public benchmarks covering dermatology, oncology, radiology, and pathology, BiasCareVL consistently outperforms 20 state-of-the-art methods, with pronounced gains in clinically challenging scenarios, including over 10% accuracy improvement in multi-class skin lesion diagnosis and more than 20% Dice improvement in small tumor segmentation. Furthermore, BiasCareVL achieves diagnostic performance exceeding human accuracy with substantially reduced time requirements when evaluated with board-certified radiologists. By open-sourcing BiasCareVL, we aim to promote a transparent, reproducible, and equitable future for AI in healthcare, paving the way for general-purpose, trustworthy, and clinically reliable AI systems.
[132] CrossFlowDG: Bridging the Modality Gap with Cross-modal Flow Matching for Domain Generalization cs.CVPDF
Antonios Kritikos, Nikolaos Spanos, Athanasios Voulodimos
TL;DR: 本文提出CrossFlowDG框架,通过跨模态流匹配解决领域泛化中的模态鸿沟问题,利用无噪声的流匹配技术将领域偏差的图像嵌入向领域不变的文本嵌入对齐,在多个基准测试中取得竞争性性能,并在TerraIncognita上达到SOTA。
Details
Motivation: 解决领域泛化中因风格变化导致模型过拟合领域特定外观线索而非类别语义的问题,现有基于余弦相似度的对比对齐方法存在模态鸿沟,导致图像和文本嵌入在几何上分离。
Result: 在四个常见领域泛化基准测试中,使用VMamba图像编码器和CLIP文本编码器,在多个基准上取得竞争性性能,并在TerraIncognita上达到SOTA水平。
Insight: 创新点在于引入跨模态流匹配技术,在联合欧几里得潜在空间中学习连续变换,显式地将领域偏差图像嵌入向领域不变文本嵌入传输,有效弥合模态鸿沟;客观分析认为该方法通过流匹配的连续对齐机制,提升了跨模态表示的几何一致性。
Abstract: Domain generalization (DG) aims to maintain performance under domain shift, which in computer vision appears primarily as stylistic variations that cause models to overfit to domain-specific appearance cues rather than class semantics. To overcome this, recent methods use textual representations as stable, domain-invariant anchors. However, multimodal approaches that rely on cosine similarity-based contrastive alignment leave a modality gap where image and text embeddings remain geometrically separated despite semantic correspondence. We propose CrossFlowDG, a novel DG framework that addresses this residual gap using noise-free, cross-modal flow matching. By learning a continuous transformation in the joint Euclidean latent space, our framework explicitly transports domain-biased image embeddings toward domain-invariant text embeddings of the correct class. Using the efficient VMamba image encoder and CLIP’s text encoder, CrossFlowDG is tested against four common DG benchmarks, and achieves competitive performance on several benchmarks and state-of-the-art on TerraIncognita. Code is available at: https://github.com/ajkrit/CrossFlowDG
[133] EasyVideoR1: Easier RL for Video Understanding cs.CV | cs.LGPDF
Chuanyu Qin, Chenxu Yang, Qingyi Si, Naibin Gu, Dingyu Yao
TL;DR: 本文提出了EasyVideoR1,一个专门为视频理解任务训练大型视觉语言模型而设计的完整且高效的强化学习框架。该框架通过离线预处理与张量缓存、全面的任务感知奖励系统、混合离线-在线数据训练范式、可独立配置的图像-视频联合训练以及异步多基准评估框架,解决了将可验证奖励强化学习扩展到视频领域时面临的多样性、计算开销和评估可复现性等挑战。
Details
Motivation: 随着模型发展为原生多模态架构,将基于可验证奖励的强化学习扩展到视频理解领域变得日益重要,但由于视频任务类型多样、高维视觉输入重复解码与预处理的计算开销大,以及众多敏感超参数下评估可复现性困难,该领域仍未被充分探索。现有开源RL训练框架缺乏针对视频模态的系统性优化。
Result: 该框架实现了1.47倍的吞吐量提升,并构建了一个覆盖22个主流视频理解基准的异步多基准评估框架,其复现的准确率与官方报告分数高度一致。
Insight: 创新点包括:1)通过离线预处理与张量缓存消除冗余视频解码,显著提升训练效率;2)设计覆盖11种不同视频与图像问题类型的模块化、可扩展奖励系统;3)提出结合高质量轨迹与在线探索的混合训练范式,以应对更具挑战性的任务;4)实现图像与视频模态可独立配置预算的联合训练,促进模态间相互增强;5)建立异步多基准评估框架,确保结果的可复现性与可比性。
Abstract: Reinforcement learning from verifiable rewards (RLVR) has demonstrated remarkable effectiveness in improving the reasoning capabilities of large language models. As models evolve into natively multimodal architectures, extending RLVR to video understanding becomes increasingly important yet remains largely unexplored, due to the diversity of video task types, the computational overhead of repeatedly decoding and preprocessing high-dimensional visual inputs, and the difficulty of reproducible evaluation across numerous sensitive hyperparameters. Existing open-source RL training frameworks provide solid infrastructure for text and image scenarios but lack systematic optimizations tailored for video modality. In this work, we present \textbf{EasyVideoR1}, a complete and efficient reinforcement learning framework specifically designed for training large vision-language models on video understanding tasks. EasyVideoR1 makes the following contributions: (1) a full video RL training pipeline with offline preprocessing and tensor caching that eliminates redundant video decoding and yields a 1.47 $\times$ throughput improvement; (2) a comprehensive, task-aware reward system covering 11 distinct video and image problem types with unified routing and modular extension; (3) a mixed offline-online data training paradigm that combines curated high-quality trajectories with on-policy exploration, benefiting the learning of more challenging tasks; (4) joint image-video training with independently configurable pixel budgets, allowing the two modalities to mutually reinforce each other; and (5) an asynchronous multi-benchmark evaluation framework covering 22 mainstream video understanding benchmarks, with reproduced accuracy closely aligned with officially reported scores.
[134] Physics-Informed Tracking (PIT) cs.CV | cs.AIPDF
Emil Hovad, Allan Peter Engsig-Karup
TL;DR: 本文提出了一种名为物理信息跟踪(PIT)的视频框架,用于从视频中跟踪单个粒子。该框架结合了神经网络自编码器和可微分物理模块,通过物理信息地标损失(PILL)或其监督变体(PILLS)来强制轨迹满足已知动力学,无需标签或利用仿真真值,实现了端到端训练。
Details
Motivation: 解决从视频中跟踪单个粒子时,如何在没有标签或仅有仿真数据的情况下,利用已知物理动力学约束来提高跟踪的准确性和物理一致性。
Result: 在一个复制的26因子设计实验(64种配置,4次重复)中评估,PILLS在干净和噪声条件下,对于双线性和物理精炼解码器输出,均能持续实现亚像素级的跟踪精度。
Insight: 核心创新点在于将可微分物理模块嵌入自编码器,并设计了物理信息地标损失(PILL)来无监督地强制物理一致性;其监督变体(PILLS)则允许利用仿真数据进行端到端训练。自编码器的分流瓶颈设计有效分离了跟踪相关结构与背景噪声,支持有监督和无监督学习。
Abstract: We propose Physics-Informed Tracking (PIT), a video-based framework for tracking a single particle from video, where a neural network autoencoder localizes a particle as a heatmap peak (landmark) and a differentiable physics module embedded in the autoencoder constrains several landmarks over time (a trajectory) to satisfy known dynamics. The novel Physics-Informed Landmark Loss (PILL) compares this predicted trajectory back against the landmarks, enforcing physical consistency without labels. Its supervised variant (PILLS) instead compares the prediction against ground-truth position, velocity, and bounce from simulation, enabling end-to-end backpropagation. To support supervised and unsupervised learning, we use an autoencoder with a split bottleneck that separates A) tracking-related structure via landmark heatmaps from B) background noise and subsequent image reconstruction. We evaluate a replicated 26 factorial design (n = 4 replicates, 64 configurations), showing that PILLS consistently achieves sub-pixel tracking accuracy for the bilinear and physics-refined decoder outputs under both clean and noisy conditions.
[135] KIRA: Knowledge-Intensive Image Retrieval and Reasoning Architecture for Specialized Visual Domains cs.CVPDF
Parthaw Goswami, Jaynto Goswami Deep
TL;DR: 本文提出了KIRA,一个用于专业视觉领域的知识密集型图像检索与推理架构。它是一个统一的五阶段框架,旨在解决视觉检索增强生成中的十个核心挑战,包括跨模态检索、知识库构建、多跳推理和答案真实性验证。
Details
Motivation: 动机是扩展检索增强生成到视觉领域,解决图像查询与文本知识库之间的模态鸿沟、构建有意义的视觉知识库、进行多跳视觉推理以及确保生成答案基于视觉证据等根本性挑战。
Result: 在包含医学X光、电路图、卫星图像和组织病理学四个专业领域的DOMAINVQAR基准测试中,KIRA平均取得了0.97的检索精度、1.0的答案真实性分数和0.707的领域正确性分数,并通过渐进式消融实验验证了各组件的作用。
Insight: 创新点包括:基于DINO的分层语义分块构建多粒度知识库、支持少样本适应的领域自适应对比编码器、结合思维链查询扩展的双路径跨模态检索、支持时序和多视图的多跳视觉推理链,以及带有事后幻觉验证的证据条件生成。该框架为专业视觉领域的RAG提供了系统性的解决方案和可操作的组件权衡见解。
Abstract: Retrieval augmented generation (RAG) has transformed text based question answering, yet its extension to visual domains remains hindered by fundamental challenges: bridging the modality gap between image queries and text heavy knowledge bases, constructing semantically meaningful visual knowledge bases, performing multihop reasoning over retrieved images, and verifying that generated answers are faithfully grounded in visual evidence. We present KIRA (Knowledge Intensive Image Retrieval and Reasoning Architecture), a unified five stage framework that addresses ten core problems in visual RAG for specialized domains. KIRA introduces: (1) hierarchical semantic chunking with DINO based region detection for multi granularity knowledge base construction, (2) domain adaptive contrastive encoders with fewshot adaptation for rare visual concepts, (3) dualpath crossmodal retrieval with chainOfThought query expansion, (4) chainOfRetrieval for multihop visual reasoning with temporal and multiview support, and (5) evidence conditioned grounded generation with posthoc hallucination verification. We also propose DOMAINVQAR, a benchmark suite that evaluates visual RAG along three axes (retrieval precision, reasoning faithfulness, and domain correctness) going beyond standard recall metrics. Experiments across four specialized domains (medical Xray, circuit diagrams, satellite imagery, and histopathology) with a progressive six variant ablation demonstrate that KIRA achieves 0.97 retrieval precision, 1.0 grounding scores, and 0.707 domain correctness averaged across domains, while the ablation reveals actionable insights about when each component helps and when components introduce precision diversity tradeoffs that must be managed. Code will be released upon acceptance.
[136] CoGR-MoE: Concept-Guided Expert Routing with Consistent Selection and Flexible Reasoning for Visual Question Answering cs.CV | cs.AIPDF
Xiyin Zeng, Yi Lu, Hao Wang
TL;DR: 本文提出了一种名为CoGR-MoE的概念引导专家路由框架,用于视觉问答任务。该框架利用答案选项的语义信息来指导训练阶段的专家选择,并通过选项特征对选中的专家进行重加权,从而为每个候选选项生成具有区分性的表示,最后通过对比学习进行优化。
Details
Motivation: 针对现有混合专家方法在视觉问答中存在的路由不稳定(导致同一问题类型下专家选择不一致)或过于稳定(降低灵活性)的问题,旨在设计一个更稳定且灵活的路由机制。
Result: 实验结果表明,CoGR-MoE在多个VQA任务上均取得了强劲的性能,证明了该方法的有效性。
Insight: 创新点在于将答案选项的语义信息显式地融入专家路由和表示生成过程,通过概念引导实现更一致且灵活的专家选择,并利用选项级表示和对比学习增强模型的判别能力。
Abstract: Visual Question Answering (VQA) requires models to identify the correct answer options based on both visual and textual evidence. Recent Mixture-of-Experts (MoE) methods improve option reasoning by grouping similar concepts or routing based on examples. However, unstable routing can lead to inconsistent expert selection in the same question type, while overly stable routing may reduce flexibility. To address this, we propose Concept-Guided Routing framework (CoGR-MoE), which incorporates semantics of the answer options to guide expert selection in the training phase. Next, option features are used to reweight the selected experts, producing discriminative representations for each candidate option. These option-level representations are further used for option comparison and optimized via contrastive learning. The experimental results indicate that CoGR-MoE delivers strong performance across multiple VQA tasks, demonstrating the effectiveness of our approach.
[137] Better with Less: Tackling Heterogeneous Multi-Modal Image Joint Pretraining via Conditioned and Degraded Masked Autoencoder cs.CVPDF
Bowen Peng, Yongxiang Liu, Jie Zhou, Xiaodong Chen, Tianpeng Liu
TL;DR: 本文提出CoDe-MAE方法,旨在解决高分辨率光学与合成孔径雷达(SAR)图像联合预训练中的异质性-分辨率悖论。该方法通过光学锚定知识蒸馏、条件对比学习和跨模态退化重建三个核心模块,以‘更少对齐实现更好协同’的理念,有效防止特征抑制或污染,从而学习到鲁棒的多模态表示。
Details
Motivation: 解决高分辨率光学与SAR图像联合预训练中的核心挑战——异质性-分辨率悖论:空间分辨率越高,两种模态在物理几何和纹理上的巨大差异越被放大,导致现有基于刚性对齐的范式产生特征抑制或污染,引发表示退化和负迁移。
Result: 在100万样本上预训练后,CoDe-MAE展现出卓越的数据效率,有效防止了表示退化,并在多种单模态和双模态下游任务中取得了新的最先进(SOTA)性能,大幅超越了基于更大数据集训练的基础模型。
Insight: 创新点在于提出了‘更少对齐实现更好协同’的新范式,具体通过:1)光学锚定知识蒸馏将SAR噪声映射到纯净语义流形;2)条件对比学习对齐共识并安全保留物理差异;3)跨模态退化重建剥离非同类光谱伪特征以捕获真实结构不变性。这为处理极端异质模态的联合学习提供了新思路。
Abstract: Learning robust representations across extremely heterogeneous modalities remains a fundamental challenge in multi-modal vision. As a critical and profound instantiation of this challenge, high-resolution (HR) joint optical and synthetic aperture radar (SAR) pretraining seeks modality synergy to mutually enhance single-source representations; its potential is severely hindered by the Heterogeneity-Resolution Paradox: finer spatial scales drastically amplify the physical divergence between complex radar geometries and non-homologous optical textures. Consequently, migrating medium-resolution-oriented rigid alignment paradigms to HR scenarios triggers either severe feature suppression to force equivalence, or feature contamination driven by extreme epistemic uncertainty. Both extremes inevitably culminate in profound representation degradation and negative transfer. To overcome this bottleneck, we propose CoDe-MAE, pioneering a \textit{better synergy with less alignment} philosophy. First, Optical-anchored Knowledge Distillation (OKD) implicitly regularizes SAR’s speckle noise by mapping it into a pure semantic manifold. Building on this, Conditioned Contrastive Learning (CCL) utilizes a gradient buffering mechanism to align shared consensus while safely preserving divergent physical signatures. Concurrently, Cross-Modal Degraded Reconstruction (CDR) deliberately strips non-homologous spectral pseudo-features, truncating the inherently ill-posed mapping to capture true structural invariants. Extensive analyses validate our theoretical claims. Pretrained on 1M samples, CoDe-MAE demonstrates remarkable data efficiency, successfully preventing representation degradation and establishing new state-of-the-art performance across diverse single- and bi-modal downstream tasks, substantially outperforming foundation models scaled on vastly larger datasets.
[138] Self-Reasoning Agentic Framework for Narrative Product Grid-Collage Generation cs.CVPDF
Minyan Luo, Yuxin Zhang, Yifei Li, Xincan Wang, Fuzhang Wu
TL;DR: 本文提出了一种自推理智能体框架,用于生成叙事驱动的产品网格拼贴画。该框架首先构建产品叙事框架,明确表示产品身份、使用情境和环境,并将其转化为受共享视觉风格约束的互补网格。然后编译约束感知提示并馈送给生成模型以联合合成拼贴画,并通过评估和迭代自反思进行精炼。
Details
Motivation: 解决现有图像生成方法在叙事产品摄影中缺乏结构化叙事规划和跨面板协调能力,导致叙事性弱和视觉不一致的问题,特别是针对多网格拼贴画这种常见形式。
Result: 实验表明,与直接提示基线相比,该框架在美学质量、叙事丰富性和视觉连贯性方面持续改进。
Insight: 创新点在于将叙事规划、约束感知提示生成、联合图像合成与基于评估和失败归因的迭代自反思精炼过程集成到一个统一的智能体框架中,以生成视觉一致且叙事连贯的网格拼贴画。
Abstract: Narrative-driven product photography has become a prevalent paradigm in modern marketing, as coherent visual storytelling helps convey product value and establishes emotional engagement with consumers. However, existing image generation methods do not support structured narrative planning or cross-panel coordination, often resulting in weak storytelling and visual incoherence. In practice, narrative product photography is commonly presented as multi-grid collages, where multiple views or scenes jointly communicate a product narrative. To ensure visual consistency across grids and aesthetic harmony of the overall composition, we generate the collage as a single unified image rather than composing independently synthesized panels. We propose a self-reasoning agentic framework for narrative product grid collage generation. Given a product packshot and its name, the system first constructs a Product Narrative Framework that explicitly represents the product’s identity, usage context, and situational environment, and translates it into complementary grids governed by a shared visual style. Constraint-aware prompts are then compiled and fed to a generation model that synthesizes the collage jointly. The generated output is evaluated on both content validity and photography quality, with explicit gates determining whether to proceed or refine. When evaluation fails, the system performs failure attribution and applies targeted refinement, enabling progressive improvement through iterative self-reflection. Experiments demonstrate that our framework consistently improves aesthetic quality, narrative richness, and visual coherence, compared to direct prompting baselines.
[139] DOSE: Data Selection for Multi-Modal LLMs via Off-the-Shelf Models cs.CV | cs.CLPDF
Biao Wu, Yiwu Zhong, Meng Fang, Ling Chen
TL;DR: 本文提出DOSE方法,利用现成的预训练模型(无需针对目标数据微调)来筛选多模态训练数据,通过联合质量-对齐分布和自适应加权采样,提升数据多样性和模型性能。
Details
Motivation: 解决多模态数据集中存在的噪声、冗余和对齐不良问题,同时避免传统数据过滤方法带来的额外计算成本。
Result: 在标准VQA和数学基准测试上,使用DOSE筛选数据训练的模型性能达到或超过使用完整数据集训练的模型。
Insight: 创新点在于利用未见过目标数据的现成预训练模型进行数据选择,无需任务特定训练,通过联合质量-对齐分布和自适应加权采样平衡信息量与长尾多样性。
Abstract: High-quality and diverse multimodal data are essential for improving vision-language models (VLMs), yet existing datasets often contain noisy, redundant, and poorly aligned samples. To address these problems, data filtering is commonly used to enhance the efficiency and performance of multimodal learning, but it introduces extra computational cost because filtering models are usually trained on the same data they are meant to screen. To reduce this cost, we study DOSE, which explores whether off-the-shelf pretrained models that have never seen the target data can be used to select training samples for larger and stronger multimodal models without any task-specific training. Even without fine-tuning, these models can effectively assess text quality and image-text alignment to guide data selection. Based on this, we build a joint quality-alignment distribution and apply adaptive weighted sampling to select informative samples while maintaining long-tail diversity. This approach enhances data diversity, enabling models trained on DOSE-filtered data to match or surpass those trained on the full dataset on standard VQA and math benchmarks. Extensive experiments demonstrate its effectiveness, efficiency, and scalability.
[140] Adverse-to-the-eXtreme Panoptic Segmentation: URVIS 2026 Study and Benchmark cs.CVPDF
Yiting Wang, Nolwenn Peyratout, Tim Brodermann, Jiahui Wang, Yusi Cao
TL;DR: 本文介绍了URVIS 2026挑战赛的报告,该赛事专注于恶劣到极端天气条件下的全景分割,是首个此类挑战。基于MUSES多传感器数据集,吸引了17支队伍参与,并设计了加权全景质量(wPQ)作为官方评估指标。报告总结了挑战设置、基准测试结果,分析了提交方法的性能,并讨论了鲁棒多模态全景分割的当前进展与剩余挑战。
Details
Motivation: 解决在恶劣到极端天气条件下,利用多模态传感器数据进行鲁棒全景分割的问题,以推动自动驾驶等应用在复杂环境下的感知能力。
Result: 挑战赛基于MUSES数据集,吸引了17支注册队伍和47份提交,最终有4支队伍进入决赛阶段;报告总结了各方法的性能,但未在摘要中具体提及定量结果或是否达到SOTA水平。
Insight: 创新点在于首次举办了针对极端天气全景分割的挑战赛,并引入了加权全景质量(wPQ)指标以实现跨天气条件的公平评估,强调了多模态传感器(RGB、LiDAR、雷达、事件相机)融合在恶劣环境下的重要性。
Abstract: This paper presents the report of the URVIS 2026 challenge on adverse-to-extreme panoptic segmentation. As the first challenge of its kind, it attracted 17 registered participants and 47 submissions, with 4 teams reaching the final phase. The challenge is based on the MUSES dataset, a multi-sensor benchmark for panoptic segmentation in adverse-to-extreme weather, including RGB frame camera, LiDAR, radar, and event camera data. Weighted Panoptic Quality (wPQ) is designed and adopted as the official ranking metric for fair evaluation across weather conditions. In this report, we summarise the challenge setting and benchmark results, analyse the performance of the submitted methods, and discuss current progress and remaining challenges for robust multimodal panoptic segmentation. Link: https://urvis-workshop.github.io/challenge-Muses.html
[141] DVAR: Adversarial Multi-Agent Debate for Video Authenticity Detection cs.CVPDF
Hongyuan Qi, Feifei Shao, Ming Li, Hehe Fan, Jun Xiao
TL;DR: 本文提出DVAR,一种无需训练的对抗性多智能体辩论框架,用于视频真实性检测。该框架将检测任务重构为结构化多智能体取证推理过程,通过生成假设代理与自然机制代理的交叉质证辩论,结合最小描述长度准则量化解释成本,并集成动态知识库GenVideoKB提供生成边界启发式知识,实现超越训练分布的泛化能力。
Details
Motivation: 针对视频生成技术快速发展导致传统检测方法泛化能力不足的问题,旨在通过辩论式推理框架提升对未知生成架构的检测鲁棒性。
Result: 在广泛实验中,DVAR在监督SOTA方法上取得竞争性性能,并对未见过的生成架构表现出优越的泛化能力。
Insight: 创新点包括:将检测任务重构为多智能体辩论过程,引入最小描述长度框架量化解释成本,以及集成动态知识库提供高层推理启发式;客观分析认为其无需训练、基于逻辑辩论的透明推理机制为媒体取证提供了可解释的新范式。
Abstract: The rapid evolution of video generation technologies poses a significant challenge to media forensics, as conventional detection methods often fail to generalize beyond their training distributions. To address this, we propose DVAR (Debate-based Video Authenticity Reasoning), a training-free framework that reformulates video detection as a structured multi-agent forensic reasoning process. Moving beyond the paradigm of pattern matching, DVAR orchestrates a competition between a Generative Hypothesis Agent and a Natural Mechanism Agent. Through iterative rounds of cross-examination, these agents defend their respective explanations against abnormal evidence, driving a logical convergence where the truth emerges from rigorous stress-testing. To adjudicate these conflicting claims, we apply Occam’s Razor through the Minimum Description Length (MDL) framework, defining an Explanatory Cost to quantify the “logical burden” of each reasoning path. Furthermore, we integrate GenVideoKB, a dynamic knowledge repository that provides high-level reasoning heuristics on generative boundaries and failure modes. Extensive experiments demonstrate that DVAR achieves competitive performance against supervised state-of-the-art methods while exhibiting superior generalization to unseen generative architectures. By transforming detection into a transparent debate, DVAR provides explicit, interpretable reasoning traces for robust video authenticity assessment.
[142] Inductive Convolution Nuclear Norm Minimization for Tensor Completion with Arbitrary Sampling cs.CV | cs.AIPDF
Wei Li, Yuyang Li, Kaile Du, Yi Yu, Guangcan Liu
TL;DR: 本文提出了一种名为归纳卷积核范数最小化(ICNNM)的新方法,用于解决任意采样下的张量补全问题。该方法通过引入预学习的卷积特征向量,避免了原始卷积核范数最小化(CNNM)中多次奇异值分解的计算瓶颈,从而显著降低计算时间,并利用额外先验知识提升了恢复性能。
Details
Motivation: 原始卷积核范数最小化(CNNM)在处理任意采样张量补全时,其优化过程需要多次执行奇异值分解,计算成本高且难以并行化,限制了其实际应用。
Result: 在视频补全、预测和帧插值等任务上的大量实验表明,ICNNM在恢复性能上超越了CNNM及其他多种竞争方法,验证了其优越性。
Insight: 论文的创新点在于从卷积特征向量的角度重新形式化了CNNM的优化目标,并引入了可跨不同张量共享的预学习卷积特征向量,这既避免了昂贵的SVD计算,又通过编码额外先验知识提升了模型性能,为张量补全提供了一种高效且性能更优的归纳式解决方案。
Abstract: The recently established Convolution Nuclear Norm Minimization (CNNM) addresses the problem of \textit{tensor completion with arbitrary sampling} (TCAS), which involves restoring a tensor from a subset of its entries sampled in an arbitrary manner. Despite its promising performance, the optimization procedure of CNNM needs performing Singular Value Decomposition (SVD) multiple times, which is computationally expensive and hard to parallelize. To address the issue, we reformulate the optimization objective of CNNM from the perspective of convolution eigenvectors. By introducing pre-learned convolution eigenvectors which are shared among different tensors, we propose a novel method called Inductive Convolution Nuclear Norm Minimization (ICNNM), which bypasses the SVD step so as to decrease significantly the computational time. In addition, due to the extra prior knowledge encoded in the pre-learned convolution eigenvectors, ICNNM also outperforms CNNM in terms of recovery performance. Extensive experiments on video completion, prediction and frame interpolation verify the superiority of ICNNM over CNNM and several other competing methods.
[143] TeMuDance: Contrastive Alignment-Based Textual Control for Music-Driven Dance Generation cs.CV | cs.SDPDF
Xinran Liu, Diptesh Kanojia, Wenwu Wang, Zhenhua Feng
TL;DR: 本文提出了TeMuDance框架,旨在解决音乐驱动舞蹈生成中缺乏语义可控性的问题。该框架无需人工标注的音乐-文本-动作三元组数据集,通过以动作为共享语义锚点,在统一嵌入空间中对齐分离的音乐-舞蹈和文本-动作数据集,从而训练一个轻量级的文本控制分支,实现在保持音乐节奏保真度的同时进行细粒度的语义引导。
Details
Motivation: 现有音乐驱动舞蹈生成方法虽然实现了高真实感和有效的音频-动作对齐,但普遍缺乏语义可控性,难以通过自然语言描述来指导特定动作。这主要源于缺乏大规模对齐的音乐、文本和动作数据集用于监督学习。
Result: 大量实验表明,TeMuDance在保持有竞争力的舞蹈生成质量的同时,在文本条件控制方面相比现有方法有显著提升。论文还提出了一个新颖的任务对齐度量来量化文本提示在音乐条件下是否诱导了预期的运动学属性。
Insight: 创新点在于提出了一种以动作为中心的桥接范式,利用动作作为共享语义锚点来对齐不同模态的数据集,从而避免了昂贵的三元组数据标注。此外,设计了双流微调策略和基于置信度的过滤来抑制检索监督中的噪声,并提出了一个任务对齐的评估指标。从客观角度看,这是一种高效利用现有数据、实现跨模态对齐和控制的实用方法。
Abstract: Existing music-driven dance generation approaches have achieved strong realism and effective audio-motion alignment. However, they generally lack semantic controllability, making it difficult to guide specific movements through natural language descriptions. This limitation primarily stems from the absence of large-scale datasets that jointly align music, text, and motion for supervised learning of text-conditioned control. To address this challenge, we propose TeMuDance, a framework that enables text-based control for music-conditioned dance generation without requiring any manually annotated music-text-motion triplet dataset. TeMuDance introduces a motion-centred bridging paradigm that leverages motion as a shared semantic anchor to align disjoint music-dance and text-motion datasets within a unified embedding space, enabling cross-modal retrieval of missing modalities for end-to-end training. A lightweight text control branch is then trained on top of a frozen music-to-dance diffusion backbone, preserving rhythmic fidelity while enabling fine-grained semantic guidance. To further suppress noise inherent in the retrieved supervision, we design a dual-stream fine-tuning strategy with confidence-based filtering. We also propose a novel task-aligned metric that quantifies whether textual prompts induce the intended kinematic attributes under music conditioning. Extensive experiments demonstrate that TeMuDance achieves competitive dance quality while substantially improving text-conditioned control over existing methods.
[144] Towards Universal Skeleton-Based Action Recognition cs.CVPDF
Jidong Kuang, Hongsong Wang, Jie Gui
TL;DR: 本文研究面向开放词汇的异构骨架动作识别这一挑战性问题,旨在实现通用的骨架动作识别。作者构建了一个大规模异构开放词汇骨架数据集,并提出了一种基于Transformer的模型,该模型包含统一骨架表示、骨架运动编码器和多粒度运动-文本对齐三个关键组件,以学习时空动作表示并将其映射到语义空间与文本嵌入对齐。
Details
Motivation: 动机在于解决现实应用中骨架数据的异构性(源于不同人体骨架来源和人形机器人结构)以及开放词汇动作识别的需求,而以往工作忽视了骨架数据的异构性,仅使用同质骨架构建模型。
Result: 在包含异构骨架数据的流行基准测试上进行的广泛实验证明了所提方法的有效性和泛化能力。
Insight: 创新点在于构建了大规模异构开放词汇骨架数据集,并提出了一个包含统一骨架表示、两流Transformer运动编码器和多粒度(全局、流特定、细粒度)运动-文本对比对齐的框架,以处理异构骨架数据并实现开放词汇识别。
Abstract: With the development of robotics, skeleton-based action recognition has become increasingly important, as human-robot interaction requires understanding the actions of humans and humanoid robots. Due to different sources of human skeletons and structures of humanoid robots, skeleton data naturally exhibit heterogeneity. However, previous works overlook the data heterogeneity of skeletons and solely construct models using homogeneous skeletons. Moreover, open-vocabulary action recognition is also essential for real-world applications. To this end, this work studies the challenging problem of heterogeneous skeleton-based action recognition with open vocabularies. We construct a large-scale Heterogeneous Open-Vocabulary (HOV) Skeleton dataset by integrating and refining multiple representative large-scale skeleton-based action datasets. To address universal skeleton-based action recognition, we propose a Transformer-based model that comprises three key components: unified skeleton representation, motion encoder for skeletons, and multi-grained motion-text alignment. The motion encoder feeds multi-modal skeleton embeddings into a two-stream Transformer-based encoder to learn spatio-temporal action representations, which are then mapped to a semantic space to align with text embeddings. Multi-grained motion-text alignment incorporates contrastive learning at three levels: global instance alignment, stream-specific alignment, and fine-grained alignment. Extensive experiments on popular benchmarks with heterogeneous skeleton data demonstrate both the effectiveness and the generalization ability of the proposed method. Code is available at https://github.com/jidongkuang/Universal-Skeleton.
[145] LIVE: Leveraging Image Manipulation Priors for Instruction-based Video Editing cs.CVPDF
Weicheng Wang, Zhicheng Zhang, Zhongqi Zhang, Juncheng Zhou, Yongjie Zhu
TL;DR: 本文提出LIVE框架,通过联合利用大规模高质量图像编辑数据和视频数据集来增强基于指令的视频编辑能力。该方法引入帧级令牌噪声策略处理静态图像与动态视频间的领域差异,并采用两阶段训练策略逐步提升视频编辑性能。
Details
Motivation: 解决视频编辑中因数据标注成本高而导致的训练数据规模、质量和任务多样性受限的问题,旨在利用丰富的图像编辑数据来弥补视频编辑数据的不足。
Result: 在涵盖60多个挑战性任务的综合评估基准上,LIVE方法通过广泛的比较和消融实验验证了其有效性,并取得了最先进的性能。
Insight: 创新点在于利用图像编辑先验知识辅助视频编辑,通过帧级令牌噪声策略和两阶段训练来缓解图像与视频间的领域差异,同时构建自动化数据管道和全面评估基准以促进模型能力提升。
Abstract: Video editing aims to modify input videos according to user intent. Recently, end-to-end training methods have garnered widespread attention, constructing paired video editing data through video generation or editing models. However, compared to image editing, the high annotation costs of video data severely constrain the scale, quality, and task diversity of video editing datasets when relying on video generative models or manual annotation. To bridge this gap, we propose LIVE, a joint training framework that leverages large-scale, high-quality image editing data alongside video datasets to bolster editing capabilities. To mitigate the domain discrepancy between static images and dynamic videos, we introduce a frame-wise token noise strategy, which treats the latents of specific frames as reasoning tokens, leveraging large pretrained video generative models to create plausible temporal transformations. Moreover, through cleaning public datasets and constructing an automated data pipeline, we adopt a two-stage training strategy to anneal video editing capabilities. Furthermore, we curate a comprehensive evaluation benchmark encompassing over 60 challenging tasks that are prevalent in image editing but scarce in existing video datasets. Extensive comparative and ablation experiments demonstrate that our method achieves state-of-the-art performance. The source code will be publicly available.
[146] IMA-MoE: An Interpretable Modality-Aware Mixture-of-Experts Framework for Characterizing the Neurobiological Signatures of Binge Eating Disorder cs.CVPDF
Lin Zhao, Qiaohui Gao, Elizabeth Martin, Kurt P. Schulz, Tom Hildebrandt
TL;DR: 本研究提出了一种名为IMA-MoE的可解释模态感知混合专家框架,用于整合神经影像、行为、激素和人口统计学等多模态数据,以识别暴食症(BED)的神经生物学特征。该框架通过将每种测量编码为独立token来灵活建模跨模态依赖关系,并引入token重要性机制增强可解释性。在大型ABCD数据集上的评估表明,IMA-MoE在区分BED与健康对照组方面优于基线方法,并揭示了性别特异的预测模式。
Details
Motivation: 暴食症是最普遍的进食障碍,但当前诊断主要基于症状而非生物学机制,限制了早期检测和干预。现有研究多依赖假设驱动的参数模型、单模态分析或有限数据,导致结果难以泛化。因此,需要先进的数据驱动框架来建模多模态数据,以发现可泛化且具有生物学意义的BED特征。
Result: 在大型青少年大脑认知发展(ABCD)数据集上评估,IMA-MoE在区分BED与健康对照组方面表现出优于基线方法的性能。定性结果显示,该框架揭示了性别特异的预测模式,例如激素测量在女性预测中贡献更突出。
Insight: 论文的创新点在于提出了一个可解释的多模态混合专家框架(IMA-MoE),它通过token化编码灵活整合异构数据并建模跨模态依赖,同时引入token重要性机制量化各测量对预测的贡献,增强了模型的可解释性。从客观角度看,该研究将多模态学习与可解释性AI结合,应用于神经精神障碍的生物特征挖掘,为精准医疗提供了数据驱动的解决方案。
Abstract: Binge eating disorder (BED) is the most prevalent eating disorder. However, current diagnostic frameworks remain largely grounded in symptom-based criteria rather than underlying biological mechanisms, thereby limiting early detection and the development of biologically-informed interventions. Emerging studies have begun to investigate the neurobiological signatures of BED, yet their findings are often difficult to generalize due to the reliance on hypothesis-driven parametric models, single-modality analyses, and limited data diversity. Therefore, there is a critical need for advanced data-driven frameworks capable of modeling multimodal data to uncover generalizable and biologically meaningful signatures of BED. In this study, we propose the Interpretable Modality-Aware Mixture-of-Experts (IMA-MoE), a novel architecture designed to integrate heterogeneous neuroimaging, behavioral, hormonal, and demographic measures within a unified predictive framework. By encoding each measure as a distinct token, IMA-MoE enables flexible modeling of cross-modal dependencies while preserving modality-specific characteristics. We further introduce a token-importance mechanism to enhance interpretability by quantifying the contribution of each measure to model predictions. Evaluated on the large-scale Adolescent Brain Cognitive Development (ABCD) dataset, IMA-MoE demonstrates superior performance in differentiating BED from healthy controls compared with baseline methods, while revealing sex-specific predictive patterns, with hormonal measures contributing more prominently to prediction in females. Collectively, these findings highlight the promise of interpretable, data-driven multimodal modeling in advancing biologically-informed characterization of BED and facilitating more precise and personalized interventions in neuropsychiatric disorders.
[147] Conditional Evidence Reconstruction and Decomposition for Interpretable Multimodal Diagnosis cs.CVPDF
Shaowen Wan, Yanjun Lv, Lu Zhang, Dajiang Zhu, Bharat Biswal
TL;DR: 本文提出了一种名为条件证据重建与分解(CERD)的框架,用于解决多模态诊断中模态缺失的问题,并提高模型的可解释性。该框架首先根据每个受试者的已观测输入重建缺失的模态表示,然后通过logit级归因将诊断证据分解为跨模态共享的佐证和模态特定的线索。
Details
Motivation: 动机在于解决多模态诊断中模态覆盖不完整导致模型脆弱性的问题,同时克服现有方法难以捕捉个体特异性跨模态依赖和提供有限可解释性的局限。
Result: 在阿尔茨海默病神经影像学倡议(ADNI)数据集上的实验表明,CERD在不完整模态设置下优于竞争基线,并产生了结构化的、与临床对齐的证据归因。
Insight: 创新点在于将条件重建与证据分解相结合,以个体化方式处理缺失模态,并通过logit级归因实现可解释的诊断决策支持,为多模态建模提供了新的可解释性框架。
Abstract: Neurobiological and neurodegenerative diseases are inherently multifactorial, arising from coupled influences spanning genetic susceptibility, brain alterations, and environmental and behavioral factors. Multimodal modeling has therefore been increasingly adopted for disease diagnosis by integrating complementary evidence across data sources. However, in both large-scale cohorts and real-world clinical workflows, modality coverage is often incomplete, making many multimodal models brittle when one or more modalities are unavailable. Existing approaches to incomplete multimodal diagnosis typically rely on group-wise or static priors, which may fail to capture subject-specific cross-modal dependencies; moreover, many models provide limited interpretability into which evidence sources drive the final decision. To address these limitations, we propose Conditional Evidence Reconstruction and Decomposition (CERD), a framework for interpretable multimodal diagnosis with incomplete modalities. CERD first reconstructs missing modality representations conditioned on each subject’s observed inputs, then decomposes diagnostic evidence into shared cross-modal corroboration and modality-specific cues via logit-level attribution. Experiments on the Alzheimer’s Disease Neuroimaging Initiative (ADNI) demonstrate that CERD outperforms competitive baselines under incomplete-modality settings while producing structured and clinically aligned evidence attributions for trustworthy decision support.
[148] SIF: Semantically In-Distribution Fingerprints for Large Vision-Language Models cs.CVPDF
Yifei Zhao, Qian Lou, Mengxin Zheng
TL;DR: 本文针对大型视觉语言模型(LVLM)的版权保护问题,提出了一种名为SIF(语义分布内指纹)的非侵入式所有权验证框架。该方法通过语义对齐指纹蒸馏(SAFD)将文本水印信号转移到视觉模态,生成语义一致且带有指纹的响应,并通过鲁棒指纹优化(RFO)增强对模型修改(如微调、量化)的鲁棒性。实验在LLaVA-1.5和Qwen2.5-VL模型上进行,验证了SIF在隐蔽性和鲁棒性方面的有效性。
Details
Motivation: 现有基于语义异常查询或分布外响应的指纹方法容易被对手检测和移除,存在安全漏洞。本文旨在解决LVLM未经授权重用和知识产权侵权问题,提出一种更隐蔽、鲁棒的指纹方案。
Result: 在LLaVA-1.5和Qwen2.5-VL上的大量实验表明,SIF实现了强大的隐蔽性和鲁棒性,能够有效抵抗语义分歧攻击(SDA)以及微调、量化等模型修改,为LVLM版权保护提供了实用解决方案。
Insight: 创新点在于提出语义分布内指纹(SIF),通过语义对齐指纹蒸馏(SAFD)确保指纹与正常语义分布一致,从而提升隐蔽性;同时引入鲁棒指纹优化(RFO)模拟最坏情况扰动以增强鲁棒性。该方法无需修改模型参数,是一种非侵入式框架,可借鉴于其他多模态模型的版权保护。
Abstract: The public accessibility of large vision-language models (LVLMs) raises serious concerns about unauthorized model reuse and intellectual property infringement. Existing ownership verification methods often rely on semantically abnormal queries or out-of-distribution responses as fingerprints, which can be easily detected and removed by adversaries. We expose this vulnerability through a Semantic Divergence Attack (SDA), which identifies and filters fingerprint queries by measuring semantic divergence between a suspect model and a reference model, showing that existing fingerprints are not semantic-preserving and are therefore easy to detect and bypass. To address these limitations, we propose SIF (Semantically In-Distribution Fingerprints), a non-intrusive ownership verification framework that requires no parameter modification. SIF introduces Semantic-Aligned Fingerprint Distillation (SAFD), which transfers text watermarking signals into the visual modality to produce semantically coherent yet fingerprinted responses. In addition, Robust-Fingerprint Optimization (RFO) enhances robustness by simulating worst-case representation perturbations, making the fingerprints resilient to model modifications such as fine-tuning and quantization. Extensive experiments on LLaVA-1.5 and Qwen2.5-VL demonstrate that SIF achieves strong stealthiness and robustness, providing a practical solution for LVLM copyright protection. Code is available at https://github.com/UCF-ML-Research/SIF-VLM-Fingerprint
[149] OASIS: On-Demand Hierarchical Event Memory for Streaming Video Reasoning cs.CVPDF
Zhijia Liang, Jiaming Li, Weikai Chen, Yanhao Zhang, Haonan Lu
TL;DR: OASIS是一个用于流式视频推理的新框架,它通过结构化的按需检索来解决历史信息无限增长而关键证据稀缺的挑战。该框架将流式历史组织为分层事件,并采用先进行短上下文推理、仅在不确定性出现时才进行语义驱动的检索的控制细化推理策略。
Details
Motivation: 解决流式视频推理中历史信息无限增长导致冗余过多、关键信号(绿洲)容易被淹没,而单纯扩大内存或激进压缩均无效的问题,核心挑战在于确定’在哪里查找’而非’记住多少’。
Result: 在多个基准测试和不同流式MLLM骨干网络上的实验表明,OASIS在有限token成本和低请求延迟下,显著提升了长时域准确性和组合推理能力。
Insight: 创新点在于提出了由高层意图(而非嵌入相似性)驱动的、即插即用且无需训练的结构化按需检索机制,将推理过程组织为控制细化,从而更准确、噪声更少地利用记忆。
Abstract: Streaming video reasoning requires models to operate in a setting where history grows without bound while meaningful evidence remains scarce. In such a landscape, relevant signal is like an oasis-small, critical, and easily lost in a desert of redundancy. Enlarging memory only widens the desert; aggressive compression dries up the oasis. The real difficulty lies in discovering where to look, not how much to remember. We therefore introduce OASIS, a novel framework for streaming video reasoning that tackles this challenge through structured, on-demand retrieval. It organizes streaming history into hierarchical events and performs reasoning as controlled refinement-short-context inference first, followed by semantically grounded retrieval only when uncertainty arises. As the retrieval is driven by high-level intent rather than embedding similarity, the retrieved memory is substantially more accurate and less noisy. Additionally, the mechanism is plug-and-play, training-free, and readily attaches to different streaming MLLM backbones. Experiments across multiple benchmarks and backbones show that OASIS achieves strong gains in long-horizon accuracy and compositional reasoning with bounded token cost and low request delay. Code is available at https://github.com/Solus-sano/OASIS.
[150] mEOL: Training-Free Instruction-Guided Multimodal Embedder for Vector Graphics and Image Retrieval cs.CV | cs.AIPDF
Kyeong Seon Kim, Baek Seong-Eun, Lee Jung-Mok, Tae-Hyun Oh
TL;DR: 本文提出了一种无需训练的指令引导多模态嵌入框架mEOL,利用多模态大语言模型将文本、光栅图像和SVG代码映射到对齐的嵌入空间。该方法通过模态特定指令和SVG结构线索控制嵌入方向,无需学习投影头或对比训练。核心创新包括多模态显式单字限制和语义SVG重写模块,在重构的VGBench上构建了首个文本到SVG检索基准,并超越了基于编码器和需要训练的多模态基线。
Details
Motivation: 现有方法通常将可缩放矢量图形(SVG)栅格化处理,丢弃了其作为结构化代码所蕴含的丰富几何与布局信息;同时,现有的句子嵌入方法虽能生成强文本表示,但难以自然扩展到视觉或结构化模态。本文旨在解决如何有效对齐文本、图像和SVG代码的语义表示,并利用SVG的结构信息进行检索的问题。
Result: 在重构的VGBench上构建了首个文本到SVG检索基准,实验表明,这种无需训练的嵌入方法在性能上超越了基于编码器的基线方法和需要训练的多模态基线方法。
Insight: 论文的创新点在于:1) 提出了多模态显式单字限制(mEOL),指令引导MLLM将任何多模态输入总结为单个令牌,其隐藏状态作为紧凑的语义嵌入;2) 设计了语义SVG重写模块,通过视觉推理为SVG元素分配有意义的标识符并简化嵌套结构,从而暴露原始代码中隐藏的几何和关系线索。这证明了提示级控制可以作为参数级训练的有效替代方案,用于实现结构感知的多模态检索。
Abstract: Scalable Vector Graphics (SVGs) function both as visual images and as structured code that encode rich geometric and layout information, yet most methods rasterize them and discard this symbolic organization. At the same time, recent sentence embedding methods produce strong text representations but do not naturally extend to visual or structured modalities. We propose a training-free, instruction-guided multimodal embedding framework that uses a Multimodal Large Language Model (MLLM) to map text, raster images, and SVG code into an aligned embedding space. We control the direction of embeddings through modality-specific instructions and structural SVG cues, eliminating the need for learned projection heads or contrastive training. Our method has two key components: (1) Multimodal Explicit One-word Limitation (mEOL), which instructs the MLLM to summarize any multimodal input into a single token whose hidden state serves as a compact semantic embedding. (2) A semantic SVG rewriting module that assigns meaningful identifiers and simplifies nested SVG elements through visual reasoning over the rendered image, exposing geometric and relational cues hidden in raw code. Using a repurposed VGBench, we build the first text-to-SVG retrieval benchmark and show that our training-free embeddings outperform encoder-based and training-based multimodal baselines. These results highlight prompt-level control as an effective alternative to parameter-level training for structure-aware multimodal retrieval. Project page: https://scene-the-ella.github.io/meol/
[151] Motion-Guided Semantic Alignment with Negative Prompts for Zero-Shot Video Action Recognition cs.CVPDF
Yiming Wang, Frederick W. B. Li, Jingyun Wang
TL;DR: 本文提出了一种新颖的零样本视频动作识别框架,通过解耦嵌入和语义引导交互来增强CLIP模型。该方法包含一个运动分离模块(MSM)来区分运动敏感特征和全局静态特征,以及一个运动聚合块(MAB)利用门控交叉注意力来精炼运动表征。同时,通过将投影嵌入与正文本提示对齐来加强视频特征与文本表征的语义对齐,并利用负提示来显式建模’非类别’语义,以促进对未见类别的泛化。
Details
Motivation: 解决零样本动作识别中因已见类别与未见类别之间存在语义鸿沟而带来的挑战,旨在提升模型对未见动作类别的泛化能力。
Result: 在标准基准测试上的实验表明,该方法持续优于先前基于CLIP的方法,在粗粒度和细粒度数据集上都实现了鲁棒的零样本动作识别性能。
Insight: 创新点在于将运动特征解耦与聚合机制引入CLIP,并利用正负文本提示进行语义对齐,显式建模类别与非类别语义,以缩小语义鸿沟并增强零样本泛化能力。
Abstract: Zero-shot action recognition is challenging due to the semantic gap between seen and unseen classes. We present a novel framework that enhances CLIP with disentangled embeddings and semantic-guided interaction. A Motion Separation Module (MSM) separates motion-sensitive and global-static features, while a Motion Aggregation Block (MAB) employs gated cross-attention to refine motion representation without re-coupling redundant information. To facilitate generalization to unseen categories, we enforce semantic alignment between video features and textual representations by aligning projected embeddings with positive textual prompts, while leveraging negative prompts to explicitly model “non-class” semantics. Experiments on standard benchmarks demonstrate that our method consistently outperforms prior CLIP-based approaches, achieving robust zero-shot action recognition across both coarse and fine-grained datasets.
[152] BasketHAR: A Multimodal Dataset for Human Activity Recognition and Sport Analysis in Basketball Training Scenarios cs.CVPDF
Xian Gao, Haoyue Zhang, Zongyun Zhang, Jiacheng Ruan, Ting Liu
TL;DR: 本文介绍了BasketHAR,一个专为篮球训练场景设计的新型多模态人类活动识别数据集,包含惯性测量单元、心率、皮肤温度和同步视频等多种数据,并提供了基线多模态对齐方法,旨在支持高级HAR任务和体育分析应用。
Details
Motivation: 现有HAR数据集主要关注基本日常活动,缺乏针对专业运动场景(如篮球训练)的多样化、高质量标注数据,限制了其在体育性能分析等专门领域的应用,因此需要构建一个专门的数据集来填补这一空白。
Result: 实验结果表明,BasketHAR数据集具有复杂性,适合用于高级HAR任务,可作为未来研究的基准资源,但摘要未提及具体定量结果或与现有SOTA的比较。
Insight: 创新点在于构建了首个针对篮球训练的多模态HAR数据集,涵盖了专业级动作和丰富的传感器数据,并提供了多模态对齐基线,这为体育分析和HAR研究提供了新的数据资源和评估框架。
Abstract: Human Activity Recognition (HAR) involves the automatic identification of user activities and has gained significant research interest due to its broad applicability. Most HAR systems rely on supervised learning, which necessitates large, diverse, and well-annotated datasets. However, existing datasets predominantly focus on basic activities such as walking, standing, and stair navigation, limiting their utility in specialized contexts like sports performance analysis. To address this gap, we present BasketHAR, a novel multimodal HAR dataset tailored for basketball training, encompassing a diverse set of professional-level actions. BasketHAR includes comprehensive motion data from inertial measurement units (accelerometers and gyroscopes), angular velocity, magnetic field, heart rate, skin temperature, and synchronized video recordings. We also provide a baseline multimodal alignment method to benchmark performance. Experimental results underscore the dataset’s complexity and suitability for advanced HAR tasks. Furthermore, we highlight its potential applications in the analysis of basketball training sessions and in the generation of specialized performance reports, representing a valuable resource for future research in HAR and sports analytics. The dataset are publicly accessible at https://huggingface.co/datasets/Xian-Gao/BasketHAR licensed under Apache License 2.0.
[153] NTIRE 2026 Rip Current Detection and Segmentation (RipDetSeg) Challenge Report cs.CVPDF
Andrei Dumitriu, Aakash Ralhan, Florin Miron, Florin Tatui, Radu Tudor Ionescu
TL;DR: 本文介绍了NTIRE 2026 Rip Current Detection and Segmentation (RipDetSeg)挑战赛,该赛事旨在通过图像自动检测和分割离岸流。挑战赛基于RipVIS基准数据集,评估了检测和分割任务,吸引了159名参与者,最终收到9份有效测试提交。报告总结了数据集、挑战协议、评估方法、最终结果以及参赛方法的主要见解。
Details
Motivation: 离岸流是全球海滩相关死亡事故的主要危险因素,但由于其视觉外观在不同海滩、视角和海况下差异巨大,难以自动识别。该挑战赛旨在推动这一安全关键问题的研究进展。
Result: 最终排名基于结合F1[50]、F2[50]、F1[40:95]和F2[40:95]的综合得分。大多数参赛方案依赖于预训练模型,并结合了强大的数据增强和后处理设计。
Insight: 结果表明,离岸流理解任务显著受益于鲁棒的通用视觉模型的进展,但针对其独特视觉结构定制的方法仍有很大发展空间。挑战赛构建的多样化数据集(来自10多个国家,包含4种相机方向和多种海滩/海况)为未来研究提供了重要基准。
Abstract: This report presents the NTIRE 2026 Rip Current Detection and Segmentation (RipDetSeg) Challenge, which targets automatic rip current understanding in images. Rip currents are hazardous nearshore flows that cause many beach-related fatalities worldwide, yet remain difficult to identify because their visual appearance varies substantially across beaches, viewpoints, and sea states. To advance research on this safety-critical problem, the challenge builds on the RipVIS benchmark, evaluating both detection and segmentation. The dataset is diverse, sourced from more than $10$ countries, with $4$ camera orientations and diverse beach and sea conditions. This report describes the dataset, challenge protocol, evaluation methodology, final results, and summarizes the main insights from the submitted methods. The challenge attracted $159$ registered participants and produced $9$ valid test submissions across the two tasks. Final rankings are based on a composite score that combines $F_1[50]$, $F_2[50]$, $F_1[40!:!95]$, and $F_2[40!:!95]$. Most participant solutions relied on pretrained models, combined with strong augmentation and post-processing design. These results suggest that rip current understanding benefits strongly from the robust general-purpose vision models’ progress, while leaving ample room for future methods tailored to their unique visual structure.
[154] Comparison Drives Preference: Reference-Aware Modeling for AI-Generated Video Quality Assessment cs.CVPDF
Minghao Zou, Gen Liu, Guanghui Yue, Baoquan Zhao, Zhihua Wang
TL;DR: 本文提出了一种参考感知的视频质量评估方法RefVQA,用于AI生成视频的质量评估。该方法从视频间关系的角度重新审视AIGC-VQA问题,通过构建以查询视频为中心的参考图来组织语义相关的样本,并利用图引导的差异聚合从参考节点到查询节点进行质量评估。实验表明,该方法在多个质量维度上优于现有SOTA方法,并展现出良好的泛化能力。
Details
Motivation: 现有AI生成内容视频质量评估方法通常独立分析每个视频,忽略了视频间的潜在关系。本文从视频间关系的视角出发,将AIGC-VQA重新表述为一个参考感知的评估问题,认为质量评估不仅应基于视频内在特征,还应通过与相关视频的比较来引导,这更符合人类感知。
Result: 在现有数据集上的实验表明,所提出的RefVQA方法在多个质量维度上均优于最先进的方法。跨数据集评估验证了其强大的泛化能力。
Insight: 主要创新点在于将AIGC-VQA问题从独立视频分析转向参考感知的评估框架,并提出了基于图结构的参考感知建模方法。从客观角度看,该方法引入了视频间的比较机制,模拟了人类通过对比进行质量判断的认知过程,为AIGC-VQA领域提供了一种新的建模思路。
Abstract: The rapid advancement of generative models has led to a growing volume of AI-generated videos, making the automatic quality assessment of such videos increasingly important. Existing AI-generated content video quality assessment (AIGC-VQA) methods typically estimate visual quality by analyzing each video independently, ignoring potential relationships among videos. In this work, we revisit AIGC-VQA from an inter-video perspective and formulate it as a reference-aware evaluation problem. Through this formulation, quality assessment is guided not only by intrinsic video characteristics but also by comparisons with related videos, which is more consistent with human perception. To validate its effectiveness, we propose Reference-aware Video Quality Assessment (RefVQA), which utilizes a query-centered reference graph to organize semantically related samples and performs graph-guided difference aggregation from the reference nodes to the query node. Experiments on existing datasets demonstrate that our proposed RefVQA outperforms state-of-the-art methods across multiple quality dimensions, with strong generalization ability validated by cross-dataset evaluation. These results highlight the effectiveness of the proposed reference-based formulation and suggest its potential to advance AIGC-VQA.
[155] EvoComp: Learning Visual Token Compression for Multimodal Large Language Models via Semantic-Guided Evolutionary Labeling cs.CV | cs.LGPDF
Jiafei Song, Fengwei Zhou, Jin Qu, Wenjin Jason Li, Tong Wu
TL;DR: 本文提出EvoComp,一种用于多模态大语言模型(MLLMs)的视觉令牌压缩框架,旨在通过语义引导的进化标注策略,显著减少视觉令牌数量,同时保持任务准确性,从而提升推理效率。
Details
Motivation: 当前多模态大语言模型在视觉语言理解任务中表现出色,但其推理效率常受大量视觉令牌(尤其是在高分辨率或多图像场景中)的阻碍,因此需要一种有效的压缩方法来减少令牌数量而不损失性能。
Result: 在多个视觉语言基准测试上的广泛实验表明,EvoComp优于基于注意力或相似性启发式的方法;在3倍令牌压缩下保持原始准确率的99.3%,并在移动设备上实现高达1.6倍的加速。
Insight: 创新点包括:引入基于轻量级仅编码器Transformer的压缩器,通过联合考虑视觉和文本上下文选择信息量最大且非冗余的令牌;设计进化标注策略,通过基于词汇的令牌分组搜索最小化MLLM输出损失的令牌子集;采用结合GHM损失和余弦相似性正则化的定制损失函数,以缓解类别和难度不平衡并鼓励保留与丢弃令牌间的语义分离。
Abstract: Recent Multimodal Large Language Models (MLLMs) have demonstrated strong performance on vision-language understanding tasks, yet their inference efficiency is often hampered by the large number of visual tokens, particularly in high-resolution or multi-image scenarios. To address this issue, we propose EvoComp, a visual token compression framework that significantly reduces token count while preserving task accuracy. EvoComp introduces a lightweight encoder-only transformer-based compressor that selects the most informative and non-redundant visual tokens by jointly considering visual and textual contexts. A core challenge lies in providing effective supervision for training the compressor. To this end, we design an evolutionary labeling strategy that searches for token subsets minimizing the MLLM’s output loss, while enforcing semantic diversity through vocabulary-based token grouping. We further train the compressor using a tailored loss function combining the GHM loss to mitigate class and difficulty imbalance, and a cosine similarity regularization to encourage semantic separation between retained and discarded tokens. Extensive experiments across multiple vision-language benchmarks show that EvoComp outperforms existing methods based on attention or similarity heuristics. Notably, it retains 99.3% of the original accuracy under 3x token compression and delivers up to 1.6x speedup on mobile devices.
[156] Marrying Text-to-Motion Generation with Skeleton-Based Action Recognition cs.CVPDF
Jidong Kuang, Hongsong Wang, Jie Gui
TL;DR: 本文提出了一种统一的动作识别与运动生成框架CoAMD,通过骨骼坐标同时处理运动理解和生成任务,采用由粗到细的扩散模型合成运动,并设计了一个多模态动作识别器提供基于梯度的语义指导。
Details
Motivation: 现有研究通常将动作识别和运动生成作为两个独立问题处理,忽略了它们之间的内在联系(运动生成需要语义理解),本文旨在探索两者统一的建模方法。
Result: 在涵盖动作识别、文本到运动生成、文本-运动检索和运动编辑四个任务的13个基准测试上进行了广泛实验,结果表明该方法达到了最先进的性能。
Insight: 核心创新在于将动作识别与运动生成统一于一个基于骨骼坐标的框架,并利用多模态动作识别器为扩散过程提供梯度引导,实现了任务间的协同与性能提升。
Abstract: Human action recognition and motion generation are two active research problems in human-centric computer vision, both aiming to align motion with textual semantics. However, most existing works study these two problems separately, without uncovering the links between them, namely that motion generation requires semantic comprehension. This work investigates unified action recognition and motion generation by leveraging skeleton coordinates for both motion understanding and generation. We propose Coordinates-based Autoregressive Motion Diffusion (CoAMD), which synthesizes motion in a coarse-to-fine manner. As a core component of CoAMD, we design a Multi-modal Action Recognizer (MAR) that provides gradient-based semantic guidance for motion generation. Furthermore, we establish a rigorous benchmark by evaluating baselines on absolute coordinates. Our model can be applied to four important tasks, including skeleton-based action recognition, text-to-motion generation, text-motion retrieval, and motion editing. Extensive experiments on 13 benchmarks across these tasks demonstrate that our approach achieves state-of-the-art performance, highlighting its effectiveness and versatility for human motion modeling. Code is available at https://github.com/jidongkuang/CoAMD.
[157] Inference-Time Temporal Probability Smoothing for Stable Video Segmentation with SAM2 under Weak Prompts cs.CVPDF
Dawar Jyoti Deka
TL;DR: 本文提出了一种推理时的时间概率平滑方法,旨在提升SAM2模型在弱提示(如稀疏点提示)下视频分割的时间稳定性,无需重新训练或修改模型架构。该方法通过对每帧分割概率图进行光流运动扭曲,并结合分割熵和前后向流一致性估计像素级不确定性,自适应地融合当前帧预测与运动对齐的历史估计,从而生成时间连贯的分割输出。
Details
Motivation: 解决SAM2等交互式视频分割模型在弱用户监督(如单帧稀疏点提示)下预测的时间不稳定问题,包括边界闪烁、物体丢失和跨帧物体范围不一致,以提高其在视频理解和控制应用中的可靠性。
Result: 在四个多样化视频序列上,使用包括运动补偿IoU、边界一致性、物体持久性和面积波动性在内的综合帧级和时间稳定性指标进行评估。实验结果表明,该方法在保持空间精度的同时,相比原始SAM2推理,时间稳定性得到一致提升。
Insight: 创新点在于提出了一种轻量级、模型无关的推理时后处理框架,通过光流运动对齐和基于不确定性的自适应概率融合,有效增强视频分割的时间连贯性,适用于实时交互应用,无需额外训练成本。
Abstract: Interactive video segmentation models such as SAM2 have demonstrated strong generalization across diverse visual domains. However, under weak user supervision, for example, when sparse point prompts are provided on a single frame, their predictions often suffer from temporal instability, including flickering boundaries, object dropout, and inconsistent object extents across frames. These issues limit their reliability in downstream video understanding and control applications. In this paper, we propose an inference-time temporal probability smoothing method that improves the temporal stability of SAM2-based video segmentation without retraining or architectural modification. Our approach operates directly on per-frame segmentation probability maps and leverages optical-flow-based motion warping together with pixel-wise uncertainty estimates derived from segmentation entropy, and forward-backwards flow consistency. These signals are used to adaptively blend current-frame predictions with motion-aligned historical estimates, yielding temporally coherent segmentation outputs under weak prompts. We evaluate the proposed method on four diverse video sequences using a comprehensive set of frame-wise and temporal stability metrics, including motion-compensated IoU, boundary consistency, object persistence, and area volatility. Experimental results demonstrate consistent improvements in temporal stability over vanilla SAM2 inference while preserving spatial accuracy. The proposed framework is lightweight, model-agnostic, and well-suited for real-time, interactive video segmentation.
[158] Multimodal Fusion of Histopathology Images and Electronic Health Records for Early Breast Cancer Diagnosis cs.CVPDF
Aditya Shribhagwan Khandelwal, Mohammad Samar Ansari, Asra Aslam
TL;DR: 本文提出了一种多模态融合框架,用于早期乳腺癌诊断,该框架整合了来自BreCaHAD数据集的病理图像块级特征和来自MIMIC-IV的结构化电子健康记录(EHR)数据。通过训练和评估单模态图像模型(CNN基线、ResNet-18)、单模态表格模型(XGBoost、多层感知机)以及一个中间融合模型,发现融合模型在预测性能和临床可解释性方面均优于单模态基线。
Details
Motivation: 乳腺癌是全球癌症相关死亡的主要原因,及时准确的诊断对改善生存结果至关重要。现有工作大多孤立处理病理图像和电子健康记录这两种模态,未能充分利用多模态信息的互补性。
Result: 在BreCaHAD数据集上,ResNet-18在三分类的病理图像块级分类任务中取得了近乎完美的准确率(1.000)和AUC(1.000);XGBoost在EHR预测任务中达到98%的准确率。中间融合模型取得了0.997的宏平均AUC,超越了所有单模态基线,并在诊断关键但类别不平衡的有丝分裂类别上提升最大(AUC 0.994)。
Insight: 论文的创新点在于系统性地整合了病理图像和电子健康记录进行多模态融合,并通过中间融合策略在预测性能上取得了显著提升。同时,利用Grad-CAM和SHAP进行可解释性分析,验证了模型决策与既定病理和临床标准的一致性,增强了临床透明度。
Abstract: Breast cancer is a leading cause of cancer-related mortality worldwide, and timely accurate diagnosis is critical to improving survival outcomes. While convolutional neural networks (CNNs) have demonstrated strong performance on histopathology image classification, and machine learning models on structured electronic health records (EHR) have shown utility for clinical risk stratification, most existing work treats these modalities in isolation. This paper presents a systematic multimodal framework that integrates patch-level histopathology features from the BreCaHAD dataset with structured clinical data from MIMIC-IV. We train and evaluate unimodal image models (a simple CNN baseline and ResNet-18 with transfer learning), unimodal tabular models (XGBoost and a multilayer perceptron), and an intermediate-fusion model that concatenates latent representations from both modalities. ResNet-18 achieves near-perfect accuracy (1.000) and AUC (1.000) on three-class patch-level classification, while XGBoost achieves 98% accuracy on the EHR prediction task. The intermediate fusion model yields a macro-average AUC of 0.997, outperforming all unimodal baselines and delivering the largest improvements on the diagnostically critical but class-imbalanced mitosis category (AUC 0.994). Grad-CAM and SHAP interpretability analyses validate that model decisions align with established pathological and clinical criteria. Our results demonstrate that multimodal integration delivers meaningful improvements in both predictive performance and clinical transparency.
[159] Prompt Sensitivity in Vision-Language Grounding: How Small Changes in Wording Affect Object Detection cs.CVPDF
Dawar Jyoti Deka, Amit Sethi, Syed Mohammad Ali
TL;DR: 本文研究了视觉语言模型中提示词敏感性对目标检测的影响,发现语义相近的提示词(如’一个人’、’一个人类’、’一个行人’)在相同图像中经常选择不同的目标实例,平均不稳定度为2.11个不同选择。通过结合DETR目标提议和CLIP语言条件选择的控制流程,在COCO val2017数据集上分析表明,这种变异性是结构化且方向性的,而非随机噪声。
Details
Motivation: 动机是检验视觉语言模型中隐含的假设——语义等价的描述应产生一致输出,揭示提示词微小变化对目标检测稳定性的实际影响。
Result: 在263张COCO val2017图像上的实验显示,提示词集成不能提升质量,反而常使选择偏向通用区域;文本嵌入接近度仅能解释34%的检测分歧(r = -0.58),表明不稳定性主要源于argmax选择机制而非文本距离。
Insight: 创新点在于系统量化了提示词敏感性,并揭示视觉语言模型中目标检测的不稳定性源于选择机制的结构性偏差,而非随机误差,这对提示工程和模型鲁棒性设计具有重要启示。
Abstract: Vision-language models enable open-vocabulary object grounding through natural language queries, under the implicit assumption that semantically equivalent descriptions yield consistent outputs. We examine this assumption using a controlled pipeline combining DETR for object proposals with CLIP for language-conditioned selection on 263 COCO val2017 images. We find that overlapping prompts such as “a person,” “a human,” and “a pedestrian” frequently select different instances, with mean instability of 2.11 distinct selections across six prompts. PCA analysis shows this variability is structured and directional, not random. Prompt ensembling does not improve quality and often shifts selections toward generic regions. We further show that text embedding proximity explains only 34% of grounding disagreement (r = -0.58), confirming that instability arises from the argmax selection mechanism rather than text-level distances alone.
[160] ScenarioControl: Vision-Language Controllable Vectorized Latent Scenario Generation cs.CV | cs.ROPDF
Lili Gao, Yanbo Xu, William Koch, Samuele Ruffino, Luke Rowe
TL;DR: ScenarioControl是一种创新的视觉语言可控向量化潜在场景生成方法,专为驾驶场景生成设计。它能够根据文本提示或输入图像,合成多样且逼真的3D场景序列,包括地图、动态反应性参与者的3D框、行人、驾驶基础设施以及自车相机观测。该方法在表示道路结构和动态参与者的联合向量化潜在空间中生成场景,并通过跨全局控制机制实现多模态控制与稀疏向量化场景元素的连接,支持从不同参与者视角生成时间一致的场景序列,并促进驾驶场景的长期延续。
Details
Motivation: 论文的动机是解决现有驾驶场景生成方法中缺乏细粒度视觉语言控制能力的问题,旨在开发一种能够根据文本或图像输入精确控制道路布局和交通条件,同时保持场景真实性和多样性的生成机制,以支持自动驾驶系统的训练和评估。
Result: 在广泛的实验中,ScenarioControl在控制依从性和保真度方面优于所有测试方法,验证了其有效性。论文还发布了一个与向量化地图结构对齐的文本注释数据集,以促进训练和评估。
Insight: 创新点包括:首次提出视觉语言可控的驾驶场景生成机制;采用向量化潜在空间联合表示道路和动态参与者;提出跨全局控制机制,结合交叉注意力和轻量级全局上下文分支,实现细粒度控制并保持真实性;支持多参与者视角和时间一致性,以及长期场景延续。从客观角度看,该方法在可控生成和场景逼真度方面具有显著优势,为自动驾驶模拟提供了新工具。
Abstract: We introduce ScenarioControl, the first vision-language control mechanism for learned driving scenario generation. Given a text prompt or an input image, Scenario-Control synthesizes diverse, realistic 3D scenario rollouts - including map, 3D boxes of reactive actors over time, pedestrians, driving infrastructure, and ego camera observations. The method generates scenes in a vectorized latent space that represents road structure and dynamic agents jointly. To connect multimodal control with sparse vectorized scene elements, we propose a cross-global control mechanism that integrates crossattention with a lightweight global-context branch, enabling fine-grained control over road layout and traffic conditions while preserving realism. The method produces temporally consistent scenario rollouts from the perspectives different actors in the scene, supporting long-horizon continuation of driving scenarios. To facilitate training and evaluation, we release a dataset with text annotations aligned to vectorized map structures. Extensive experiments validate that the control adherence and fidelity of ScenarioControl compare favorable to all tested methods across all experiments. Project webpage: https://light.princeton.edu/ScenarioControl
[161] PPEDCRF: Dynamic-CRF-Guided Selective Perturbation for Background-Based Location Privacy in Video Sequences cs.CVPDF
Bo Ma, Weiqi Yan, Jinsong Wu
TL;DR: PPEDCRF是一个选择性扰动框架,用于保护视频帧中的背景位置隐私,防止基于图库的检索攻击。它通过动态条件随机场(DCRF)估计位置敏感的背景区域,使用归一化控制惩罚(NCP)调整扰动强度,并通过差分隐私风格的校准规则在推断区域内注入高斯噪声,从而在降低检索准确率的同时保持较高的视觉质量。
Details
Motivation: 解决在视频帧发布时,即使GPS元数据被移除,攻击者仍可通过匹配背景视觉线索与地理标记参考图像来定位位置的问题,保护背景位置隐私。
Result: 在包含八个攻击骨干网络和三个噪声种子的配对场景检索基准测试中,PPEDCRF将ResNet18的Top-1检索准确率从0.667降低到0.361±0.127(σ_0=8),同时保持36.14 dB的PSNR,比全局高斯噪声高出约6 dB的质量优势;在大多数骨干-图库单元中显示出负转移(23/24),但MixVPR存在例外。
Insight: 创新点包括使用动态条件随机场进行位置敏感区域估计、归一化控制惩罚调整扰动强度,以及选择性噪声注入策略;客观分析表明,该方法通过空间集中的扰动,在相同噪声尺度下提供更高的视觉质量,而非更强的隐私保护,实现了隐私与效用的平衡。
Abstract: We propose PPEDCRF, a calibrated selective perturbation framework that protects \emph{background-based location privacy} in released video frames against gallery-based retrieval attackers. Even after GPS metadata are stripped, an adversary can geolocate a frame by matching its background visual cues to geo-tagged reference imagery; PPEDCRF mitigates this threat by estimating location-sensitive background regions with a dynamic conditional random field (DCRF), rescaling perturbation strength with a normalized control penalty (NCP), and injecting Gaussian noise only inside the inferred regions via a DP-style calibration rule. On a controlled paired-scene retrieval benchmark with eight attacker backbones and three noise seeds, PPEDCRF reduces ResNet18 Top-1 retrieval accuracy from 0.667 to $0.361\pm0.127$ at $σ_0=8$ while preserving $36.14,$dB PSNR – an ${\approx}6,$dB quality advantage over global Gaussian noise. Transfer across the eight-backbone seed-averaged benchmark is broadly supportive (23 of 24 backbone-gallery cells show negative $Δ$), while appendix-scale confirmation identifies MixVPR as a remaining adverse-transfer exception. Matched-operating-point analysis shows that PPEDCRF and global Gaussian noise converge in Top-1 privacy at equal utility, so the practical benefit is spatially concentrated perturbation that preserves higher visual quality at any given noise scale rather than stronger matched-utility privacy. Code: https://github.com/mabo1215/PPEDCRF
[162] LookasideVLN: Direction-Aware Aerial Vision-and-Language Navigation cs.CVPDF
Yuwei Ning, Ganlong Zhao, Yipeng Qin, Si Liu, Yang Liu
TL;DR: 本文提出了LookasideVLN,一种用于无人机空中视觉语言导航的新范式。该方法通过利用自然语言指令中的方向线索,构建了包含地标及其方向关系的动态图(ELG),并结合轻量级记忆检索(SLKB)和多模态对齐的导航智能体,旨在实现更准确的空间推理和更高的计算效率。
Details
Motivation: 现有空中视觉语言导航方法主要依赖地标描述,忽略了指令中关键的方向线索,导致对指令的理解较浅且计算成本高昂。本文旨在通过利用方向信息来解决这些问题。
Result: 在CityNavAgent基准上的大量实验表明,LookasideVLN即使仅使用单步前瞻,也显著优于当前最先进的方法,实现了SOTA性能。
Insight: 核心创新在于明确建模并利用自然语言指令中的方向线索进行空间推理,提出了动态编码方向-地标关系的Egocentric Lookaside Graph和轻量级记忆检索机制,将方向信息作为提升导航精度和效率的关键因素,而非仅依赖地标描述。
Abstract: Aerial Vision-and-Language Navigation (Aerial VLN) enables unmanned aerial vehicles (UAVs) to follow natural language instructions and navigate complex urban environments. While recent advances have achieved progress through large-scale memory graphs and lookahead path planning, they remain limited by shallow instruction understanding and high computational cost. In particular, existing methods rely primarily on landmark descriptions, overlooking directional cues “a key source of spatial context in human navigation”. In this work, we propose LookasideVLN, a new paradigm that exploits directional cues in natural language to achieve both more accurate spatial reasoning and greater computational efficiency. LookasideVLN comprises three core components: (1) an Egocentric Lookaside Graph (ELG) that dynamically encodes instruction-relevant landmarks and their directional relationships, (2) a Spatial Landmark Knowledge Base (SLKB) that provides lightweight memory retrieval from prior navigation experiences, and (3) a Lookaside MLLM Navigation Agent that aligns multimodal information from user instructions, visual observations, and landmark-direction information from ELG for path planning. Extensive experiments show that LookasideVLN significantly outperforms the state-of-the-art CityNavAgent, even with a single-level lookahead, demonstrating that leveraging directional cues is a powerful yet efficient strategy for Aerial VLN.
[163] DreamShot: Personalized Storyboard Synthesis with Video Diffusion Prior cs.CVPDF
Junjia Huang, Binbin Yang, Pengxiang Yan, Jiyang Liu, Bin Xia
TL;DR: DreamShot是一种基于视频扩散先验的个性化故事板合成框架,旨在生成具有长期时间一致性、角色身份一致性和叙事流畅性的多镜头序列。该框架支持文本到镜头和参考到镜头的生成,并能基于先前帧进行故事延续,从而灵活地生成上下文感知的故事板。
Details
Motivation: 现有方法大多基于文本到图像扩散模型,难以维持长程时间一致性、角色身份一致性和跨多个镜头的叙事流畅性,因此需要一种能够利用视频生成模型时空一致性的新框架来解决这些问题。
Result: 大量实验表明,DreamShot在场景连贯性、角色一致性和生成效率方面优于最先进的文本到图像故事板模型,为可控视频模型驱动的视觉叙事设立了新方向。
Insight: 创新点在于充分利用视频扩散先验进行可控多镜头合成,并引入多参考角色条件模块及角色注意力一致性损失来强制身份对齐,从而显著提升叙事保真度和角色连续性。
Abstract: Storyboard synthesis plays a crucial role in visual storytelling, aiming to generate coherent shot sequences that visually narrate cinematic events with consistent characters, scenes, and transitions. However, existing approaches are mostly adapted from text-to-image diffusion models, which struggle to maintain long-range temporal coherence, consistent character identities, and narrative flow across multiple shots. In this paper, we introduce DreamShot, a video generative model based storyboard framework that fully exploits powerful video diffusion priors for controllable multi-shot synthesis. DreamShot supports both Text-to-Shot and Reference-to-Shot generation, as well as story continuation conditioned on previous frames, enabling flexible and context-aware storyboard generation. By leveraging the spatial-temporal consistency inherent in video generative models, DreamShot produces visually and semantically coherent sequences with improved narrative fidelity and character continuity. Furthermore, DreamShot incorporates a multi-reference role conditioning module that accepts multiple character reference images and enforces identity alignment via a Role-Attention Consistency Loss, explicitly constraining attention between reference and generated roles. Extensive experiments demonstrate that DreamShot achieves superior scene coherence, role consistency, and generation efficiency compared to state-of-the-art text-to-image storyboard models, establishing a new direction toward controllable video model-driven visual storytelling.
[164] CDSA-Net:Collaborative Decoupling of Vascular Structure and Background for High-Fidelity Coronary Digital Subtraction Angiography cs.CV | cs.AIPDF
Si Li, Chen-Kai Hu, Zhenhuan Lyu, Yuanqing He
TL;DR: CDSA-Net是一种用于冠状动脉数字减影血管造影(DSA)的新型深度学习框架,通过协同解耦血管结构和背景,解决了现有方法存在的边界伪影和灰度保真度丢失问题,从而生成高保真度的减影图像。
Details
Motivation: 冠状动脉DSA成像受生理运动影响,现有深度学习方法常产生边界伪影和灰度失真,损害诊断可信度,因此需要一种能同时优化血管结构保留和真实背景恢复的方法。
Result: 在血管强度相关性和感知质量上显著优于现有SOTA方法,形态学评估效率提升25.6%,血流动力学评估速度提升42.9%,在介入心脏病学中设定了新的实用性基准,同时保持与原始血管造影一致的诊断结果。
Insight: 创新点包括:1)首次显式解耦并联合优化血管结构保留与背景恢复;2)引入层次化几何先验引导机制(HGPG),结合集成几何先验、门控空间调制和中心线感知拓扑损失,确保结构连续性;3)自适应噪声模块(ANM)建模临床X射线噪声的随机性,消除边界伪影,实现无缝背景强度估计。
Abstract: Digital subtraction angiography (DSA) in coronary imaging is fundamentally challenged by physiological motion, forcing reliance on raw angiograms cluttered with anatomical noise. Existing deep learning methods often produced images with two critical clinically unacceptable flaws: persistent boundary artifacts and a loss of native tissue grayscale fidelity that undermined diagnostic confidence. We propose a novel framework termed as CDSA-Net that for the first time explicitly decouples and jointly optimizes vascular structure preservation and realistic background restoration. CDSA-Net introduces two core innovations: (i) A hierarchical geometric prior guidance (HGPG) mechanism, embedded in our coronary structure extraction network (CSENet). It synergistically combines integrated geometric prior (IGP) with gated spatial modulation (GSM) and centerline-aware topology (CAT) loss supervision, ensuring structural continuity. (ii) An adaptive noise module (ANM) within our coronary background restoration network (CBResNet). Unlike standard restoration, ANM uniquely models the stochastic nature of clinical X-ray noise, bridging the domain gap to enable seamless background intensity estimation and the complete elimination of boundary artifacts. The final subtraction is obtained by removing the restored background from the raw angiogram. Quantitatively, it significantly outperformed state-of-the-art methods in vascular intensity correlation and perceptual quality. A 25.6% improvement in morphology assessment efficiency and a 42.9% gain in hemodynamic evaluation speed set a new benchmark for utility in interventional cardiology, while maintaining diagnostic results consistent with raw angiograms. The project code is available at https://github.com/DrThink-ai/CDSA-Net.
[165] DREAM: Dynamic Retinal Enhancement with Adaptive Multi-modal Fusion for Expert Precision Medical Report Generation cs.CV | cs.AI | eess.SPPDF
Nagur Shareef Shaik, Teja Krishna Cherukuri, Dong Hye Ye
TL;DR: DREAM是一种用于视网膜图像高保真医学报告生成的新框架,通过两阶段自适应多模态融合机制,将视觉数据与眼科医生提供的临床关键词智能整合,以解决数据稀缺下大视觉语言模型在专业医疗领域过拟合和漏诊的问题。
Details
Motivation: 当前大视觉语言模型在数据稀缺的专门医疗领域(如视网膜图像分析)中容易过拟合,并可能遗漏细微但关键的病理特征,因此需要一种能够有效融合视觉信息和临床知识以生成精确医学报告的自动化方法。
Result: 在DeepEyeNet基准测试中,DREAM取得了新的最先进水平,BLEU-4分数达到0.241,并在ROCO数据集上表现出强大的泛化能力。
Insight: 论文的创新点在于引入了一个包含Abstractor模块(将图像和关键词特征映射到共享空间以增强视觉数据)和Adaptor模块(使用可学习参数动态加权各模态进行自适应融合)的两阶段融合机制,以及一个在训练中通过对比对齐确保输出语义与临床真实情况一致的模块,从而将医学专业知识与高效融合策略相结合。
Abstract: Automating medical reports for retinal images requires a sophisticated blend of visual pattern recognition and deep clinical knowledge. Current Large Vision-Language Models (LVLMs) often struggle in specialized medical fields where data is scarce, leading to models that overfit and miss subtle but critical pathologies. To address this, we introduce DREAM (Dynamic Retinal Enhancement with Adaptive Multi-modal Fusion), a novel framework for high-fidelity medical report generation that excels even with limited data. DREAM employs a unique two-stage fusion mechanism that intelligently integrates visual data with clinical keywords curated by ophthalmologists. First, the Abstractor module maps image and keyword features into a shared space, enhancing visual data with pathology-relevant insights. Next, the Adaptor performs adaptive multi-modal fusion, dynamically weighting the importance of each modality using learnable parameters to create a unified representation. To ensure the model’s outputs are semantically grounded in clinical reality, a Contrastive Alignment module aligns these fused representations with ground-truth medical reports during training. By combining medical expertise with an efficient fusion strategy, DREAM sets a new state-of-the-art on the DeepEyeNet benchmark, achieving a BLEU-4 score of 0.241, and further demonstrates strong generalization to the ROCO dataset.
[166] EmbodiedHead: Real-Time Listening and Speaking Avatar for Conversational Agents cs.CVPDF
Yu Zhang, Kaiyuan Shen, Yang Li
TL;DR: EmbodiedHead是一个实时语音驱动对话头像框架,为LLM配备视觉化身,通过Rectified-Flow扩散Transformer和可微分渲染器实现四步采样生成高质量图像,采用单流音频接口和Streaming Audio Scheduler实现无缝听-说切换,两阶段训练方案提升渲染质量。
Details
Motivation: 解决现有对话头像系统无法同时实现实时生成、听-说行为统一和高视觉质量的问题,特别是消除听者前瞻依赖以实现因果性用户-LLM交互。
Result: 在听-说场景的广泛实验中,实现了最先进的视觉质量和运动保真度(SOTA)。
Insight: 创新点包括:首个用于该任务的Rectified-Flow DiT模型、单流音频接口与显式状态条件机制、Streaming Audio Scheduler抑制虚假嘴部运动、系数空间预训练与图像域联合细化的两阶段训练方案。
Abstract: We present EmbodiedHead, a speech-driven talking-head framework that equips LLMs with real-time visual avatars for conversation. A practical embodied avatar must achieve real-time generation, unified listening-speaking behavior, and high rendered visual quality simultaneously. Our framework couples the first Rectified-Flow Diffusion Transformer (DiT) for this task with a differentiable renderer, enabling diverse, high-fidelity generation in as few as four sampling steps. Prior listening-speaking methods rely on dual-stream audio, introducing an interlocutor look-ahead dependency incompatible with causal user–LLM interaction. We instead adopt a single-stream interface with explicit per-frame listening-speaking state conditioning and a Streaming Audio Scheduler, suppressing spurious mouth motion during listening while enabling seamless turn-taking. A two-stage training scheme of coefficient-space pretraining and joint image-domain refinement further closes the gap between motion-level supervision and rendered quality. Extensive experiments demonstrate state-of-the-art visual quality and motion fidelity in both speaking and listening scenarios.
[167] Cross-Modal Attention Analysis and Optimization in Vision-Language Models: A Study on Visual Reliability cs.CV | cs.AIPDF
Lijie Zhou
TL;DR: 该论文研究了视觉-语言模型(VLMs)中的跨模态注意力问题,发现模型存在过度依赖文本描述而忽视视觉证据的‘文本捷径学习’现象。作者提出了一个对抗性评估框架来量化这种依赖,并在一个几何形状数据集上应用了四种对抗策略。通过比较基线CLIP、LoRA微调和优化后的LoRA模型,优化模型显著降低了准确率下降(Drop),同时保持了高正常准确率,并通过注意力可视化和嵌入空间分析验证了模型对视觉特征的关注增强。
Details
Motivation: 解决视觉-语言模型中过度依赖文本而忽视视觉证据的‘文本捷径学习’问题,以提升模型的视觉可靠性。
Result: 在几何形状数据集上,优化模型将平均准确率下降(Drop)从27.5%降低到9.8%(相对改进64.4%,p<0.001),同时保持97%的正常准确率,表明模型在对抗性文本下更稳健。
Insight: 创新点包括提出对抗性评估框架量化跨模态依赖,以及通过集成硬负样本挖掘、标签平滑、分层学习率等技术的LoRA优化方法,有效增强了模型对视觉特征的注意力,提升了跨模态对齐。
Abstract: Vision-Language Models (VLMs) achieve strong cross-modal performance, yet recent evidence suggests they over-rely on textual descriptions while under-utilizing visual evidence – a phenomenon termed ``text shortcut learning.’’ We propose an adversarial evaluation framework that quantifies this cross-modal dependency by measuring accuracy degradation (Drop) when semantically conflicting text is paired with unchanged images. Four adversarial strategies – shape_swap, color_swap, position_swap, and random_text – are applied to a controlled geometric-shapes dataset ($n{=}1{,}000$). We compare three configurations: Baseline CLIP (ViT-B/32), LoRA fine-tuning, and LoRA Optimized (integrating Hard Negative Mining, Label Smoothing, layer-wise learning rates, Cosine Restarts, curriculum learning, and data augmentation). The optimized model reduces average Drop from 27.5% to 9.8% (64.4% relative improvement, $p{<}0.001$) while maintaining 97% normal accuracy. Attention visualization and embedding-space analysis confirm that the optimized model attends more to visual features and achieves tighter cross-modal alignment.
[168] Fringe Projection Based Vision Pipeline for Autonomous Hard Drive Disassembly cs.CV | cs.ROPDF
Badrinath Balasubramaniam, Vignesh Suresh, Benjamin Metcalf, Beiwen Li
TL;DR: 本文提出了一种用于硬盘驱动器自主拆卸的视觉感知流水线。该系统结合了条纹投影轮廓术进行3D感知,在FPP失效时选择性触发深度补全模块,并集成了一个轻量级实时实例分割网络用于场景理解和关键部件定位。通过使用同一FPP相机-投影仪系统,深度图与分割掩码实现了像素级对齐。
Details
Motivation: 解决硬盘驱动器自动化拆卸中缺乏鲁棒3D感知、场景理解和紧固件定位的碎片化问题,以回收电子废弃物的经济价值。
Result: 实例分割在box mAP@50和mask mAP@50上分别达到0.960和0.957;基于Depth Anything V2 Base骨干网络的深度补全配置RMSE为2.317 mm,MAE为1.836 mm;在评估工作站上,Platter Facing推理栈的延迟为12.86 ms,吞吐量为77.7 FPS。
Insight: 创新点在于使用同一FPP系统同时完成深度感知和部件定位,实现了无需配准的像素级对齐,优于常见的RGB-D工业感知系统;采用了面向部署的网络优化和仿真到现实的迁移学习来增强数据集;提出的感知流水线为下游机器人拆卸提供了高保真语义和空间数据。
Abstract: Unrecovered e-waste represents a significant economic loss. Hard disk drives (HDDs) comprise a valuable e-waste stream necessitating robotic disassembly. Automating the disassembly of HDDs requires holistic 3D sensing, scene understanding, and fastener localization, however current methods are fragmented, lack robust 3D sensing, and lack fastener localization. We propose an autonomous vision pipeline which performs 3D sensing using a Fringe Projection Profilometry (FPP) module, with selective triggering of a depth completion module where FPP fails, and integrates this module with a lightweight, real-time instance segmentation network for scene understanding and critical component localization. By utilizing the same FPP camera-projector system for both our depth sensing and component localization modules, our depth maps and derived 3D geometry are inherently pixel-wise aligned with the segmentation masks without registration, providing an advantage over RGB-D perception systems common in industrial sensing. We optimize both our trained depth completion and instance segmentation networks for deployment-oriented inference. The proposed system achieves a box mAP@50 of 0.960 and mask mAP@50 of 0.957 for instance segmentation, while the selected depth completion configuration with the Depth Anything V2 Base backbone achieves an RMSE of 2.317 mm and MAE of 1.836 mm; the Platter Facing learned inference stack achieved a combined latency of 12.86 ms and a throughput of 77.7 Frames Per Second (FPS) on the evaluation workstation. Finally, we adopt a sim-to-real transfer learning approach to augment our physical dataset. The proposed perception pipeline provides both high-fidelity semantic and spatial data which can be valuable for downstream robotic disassembly. The synthetic dataset developed for HDD instance segmentation will be made publicly available.
[169] Enhancing Zero-shot Personalized Image Aesthetics Assessment with Profile-aware Multimodal LLM cs.CV | cs.AIPDF
Chun Wang, Chenfeng Wei, Chenyang Liu, Weihong Deng
TL;DR: 本文提出P-MLLM,一种基于用户档案感知的多模态大语言模型,用于解决零样本个性化图像美学评估问题。该方法利用用户档案作为个性化上下文信号,通过选择性融合模块将视觉信息以档案感知的方式整合到冻结LLM的推理过程中,从而在缺乏用户历史评分数据的情况下实现个性化美学预测。
Details
Motivation: 解决现有个性化图像美学评估方法依赖用户历史评分数据、在零样本场景下(即无历史数据时)性能受限的问题,探索利用用户档案作为替代信号进行个性化建模。
Result: 在近期PIAA基准测试中,P-MLLM取得了具有竞争力的零样本性能,即使在用户档案信息较粗糙时也保持有效。
Insight: 创新点在于提出了基于档案的个性化范式,并设计了选择性融合模块,使冻结的LLM能够以可控、档案感知的方式整合视觉信息,为多模态LLM在零样本个性化任务中的应用提供了新思路。
Abstract: Personalized image aesthetics assessment (PIAA) aims to predict an individual user’s subjective rating of an image, which requires modeling user-specific aesthetic preferences. Existing methods rely on historical user ratings for this modeling and therefore struggle when such data are unavailable. We address this zero-shot setting by using user profiles as contextual signals for personalization and adopting a profile-based personalization paradigm. We introduce P-MLLM, a profile-aware multimodal LLM that augments a frozen LLM with selective fusion modules for controlled visual integration. These modules selectively integrate visual information into the model’s evolving hidden states during profile-conditioned reasoning, allowing visual information to be incorporated in a profile-aware manner. Experiments on recent PIAA benchmarks show that P-MLLM achieves competitive zero-shot performance and remains effective even with coarse profile information, highlighting the potential of profile-based personalization for zero-shot PIAA.
[170] RemoteShield: Enable Robust Multimodal Large Language Models for Earth Observation cs.CVPDF
Rui Min, Liang Yao, Shiyu Miao, Shengxiang Xu, Yuxuan Liu
TL;DR: 本文提出了RemoteShield,一个旨在提升地球观测领域多模态大语言模型鲁棒性的方法。针对现有遥感MLLMs在现实噪声输入下性能下降的问题,该方法通过构建语义等价簇并进行偏好学习,使模型在面对图像退化(如云、雾)和文本变化(如口语化、模糊指令)时能保持一致的视觉语义推理能力。
Details
Motivation: 当前用于地球观测的遥感多模态大语言模型在精心清洗的数据集上训练,学习到的是脆弱的映射关系,无法泛化到实际地球观测中常见的噪声条件(如图像退化、文本变化),导致部署时面对不完美输入时性能显著下降。
Result: 在三个地球观测任务上的实验表明,在现实的多模态扰动下,RemoteShield相比代表性基线模型,始终展现出更强的鲁棒性和跨条件一致性。
Insight: 创新点在于通过构建“语义等价簇”(将干净样本与其图像-文本扰动变体配对)并采用基于簇内干净与扰动条件的偏好学习进行优化,而非直接拟合噪声样本。这种方法鼓励模型通过比较对干净和损坏输入的响应,专注于底层任务语义,从而实现对视觉退化和文本噪声的鲁棒性。这是一种针对领域特定噪声进行模型鲁棒性对齐的有效策略。
Abstract: A robust Multimodal Large Language Model (MLLM) for Earth Observation should maintain consistent interpretation and reasoning under realistic input variations. However, current Remote Sensing MLLMs fail to meet this requirement. Trained on carefully curated clean datasets, they learn brittle mappings that do not generalize to noisy conditions in operational Earth Observation. Consequently, their performance degrades when confronted with imperfect inputs in deployment. To quantify this vulnerability, we construct a realistic set of multimodal perturbations, including visual degradations such as cloud and fog cover, together with diverse human-centric textual variations ranging from colloquialisms to vague or omitted instructions. Empirical evaluations show that these perturbations significantly impair the visual-semantic reasoning capabilities of leading RS foundation models. To address this limitation, we introduce RemoteShield, a robust Remote Sensing MLLM trained to maintain consistent outputs across realistic input variations. During training, each clean sample is paired with its image-text perturbed variants to form a semantic equivalence cluster. Rather than directly fitting noisy samples, RemoteShield is optimized through preference learning over clean and perturbed conditions within the same cluster. By comparing model responses to clean and corrupted inputs, the model is encouraged to favor stable responses over perturbation-induced failures. This cross-condition alignment helps the model focus on underlying task semantics despite visual degradations and textual noise. Experiments on three Earth Observation tasks show that RemoteShield consistently delivers stronger robustness and cross-condition consistency than representative baselines under realistic multimodal perturbations.
[171] Instinct vs. Reflection: Unifying Token and Verbalized Confidence in Multimodal Large Models cs.CV | cs.AIPDF
Yunkai Dang, Yifan Jiang, Yizhu Jiang, Anqi Chen, Wenbin Li
TL;DR: 本文针对多模态大语言模型(MLLMs)的置信度估计问题,揭示了模型内部存在的‘直觉与反思错位’现象,即其隐式的token级支持度与显式的语言化自评估置信度经常不一致。为解决此问题,论文提出了一种单调置信度融合框架,通过合并双通道信号和跨通道一致性来估计答案正确性,并应用保序均值对齐步骤校正全局偏差,从而提升校准效果和选择性预测性能。
Details
Motivation: 尽管MLLMs在各种感知和推理任务中表现出色,但在实际部署中确保其可靠性需要鲁棒的置信度估计。现有工作主要集中于纯文本LLMs,且常依赖计算成本高的自一致性采样。本文旨在将置信度估计扩展到多模态场景,并全面评估MLLMs的响应置信度。
Result: 在多种开源和闭源MLLMs上的实验表明,该方法能持续产生更可靠的置信度估计,并同时改善了校准和失败预测性能。
Insight: 核心创新点在于首次系统性地揭示了MLLMs中‘直觉-反思错位’这一关键现象,并提出了一个统一的框架来融合token级和语言化置信度信号,通过单调融合和保序对齐技术,在提升校准的同时保持了风险-覆盖权衡,为模型可靠性和可解释性提供了新思路。
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in various perception and reasoning tasks. Despite this success, ensuring their reliability in practical deployment necessitates robust confidence estimation. Prior works have predominantly focused on text-only LLMs, often relying on computationally expensive self-consistency sampling. In this paper, we extend this to multimodal settings and conduct a comprehensive evaluation of MLLMs’ response confidence estimation. Our analysis reveals a significant instinct-reflection misalignment: the model’s implicit token-level support frequently diverges from its verbal self-assessment confidence. To address this misalignment, we propose a monotone confidence fusion framework to merge dual-channel signals and cross-channel consistency to estimate correctness. Subsequently, an order-preserving mean alignment step is applied to correct global bias, which improves calibration while preserving the risk-coverage trade-off for selective prediction. Experiments on diverse open-source and closed-source MLLMs show that our method consistently yields more reliable confidence estimates and improves both calibration and failure prediction. Code will be available at https://github.com/Yunkaidang/Instinct-vs.-Reflection.
[172] PestVL-Net: Enabling Multimodal Pest Learning via Fine-grained Vision-Language Interaction cs.CVPDF
Xueheng Li, Tao Hu, Ke Cao, Runsheng Qi, Huixin Zhang
TL;DR: 本文提出了一种名为PestVL-Net的新型视觉-语言框架,旨在通过细粒度的视觉-语言交互实现多模态害虫学习。该方法整合了基于RWKV架构的视觉通路(采用显著性引导的自适应窗口划分方案)和利用多模态大语言模型先验生成精确害虫语义描述的语言通路,并通过多模态思维链推理进行结构化。在两个多物种害虫数据集上的实验验证了其优越性能。
Details
Motivation: 解决实际农业场景中害虫数据收集困难、物种繁多且形态特征复杂多样,以及现有技术难以对害虫的关键视觉和高层语义特征进行细粒度建模的问题,以促进有效的害虫识别与管理。
Result: 在多个害虫数据集上进行的广泛实验评估验证了PestVL-Net的优越性能,表明其在现实世界害虫管理中的潜力。
Insight: 创新点在于提出了一个协同的视觉-语言框架,通过细粒度融合视觉与文本表征进行害虫学习。具体包括:视觉通路采用RWKV架构并结合显著性引导的自适应窗口划分来建模细粒度视觉特征;语言通路利用MLLMs先验,并融合农业专家知识,通过多模态思维链推理生成结构化语义描述。这种深度融合互补表征的方法是实现细粒度多模态害虫学习的关键。
Abstract: Effective pest recognition and management are crucial for sustainable agricultural development. However, collecting pest data in real scenarios is often challenging. Compared to other domains, pests exhibit a wide variety of species with complex and diverse morphological characteristics. Existing techniques struggle to effectively model the key visual and high-level semantic features of pests in a fine-grained manner. These limitations hinder the practical application of such methods in real agricultural scenarios. To address these critical challenges, we present a synergistic approach that integrates PestVL-Net, a novel vision-language framework, with two multi-species pest datasets to facilitate fine-grained pest learning. The visual pathway of PestVL-Net utilizes the Recurrent Weighted Key Value (RWKV) architecture, incorporating a saliency-guided adaptive window partitioning scheme to effectively model the fine-grained visual characteristics of pests. Concurrently, the linguistic component generates precise pest semantic descriptions by leveraging Multimodal Large Language Models (MLLMs) priors, critically informed by agricultural expert knowledge and structured via multimodal Chain-of-Thought (CoT) reasoning. The deep fusion of these complementary visual and textual representations enables fine-grained multimodal pest learning. Extensive experimental evaluations on multiple pest datasets validate the superior performance of PestVL-Net, highlighting its potential for effective real-world pest management.
[173] Frequency-guided Multi-level Reasoning for Scene Graph Generation in Video cs.CVPDF
Chenxing Li, Yiping Duan, Xiaoming Tao
TL;DR: 本文提出了一种频率引导的关系多级推理模型FReMuRe,用于视频场景图生成任务,旨在通过机制设计增强对长尾关系的建模能力,从而提升结构化语义表示的质量。
Details
Motivation: 现有视频场景图生成方法在处理长尾分布关系时存在局限,本文旨在从机制角度提升模型对长尾关系的建模能力。
Result: 在Action Genome数据集上的实验结果表明,FReMuRe显著提高了长尾关系的召回率和整体推理的鲁棒性。
Insight: 创新点包括引入关系特定分支以处理梯度冲突、设计频率感知的双分支谓词嵌入网络分别建模高低频关系,以及提出贝叶斯头和高斯混合模型头两种可互换的关系分类头来增强不确定性估计和类内多样性。
Abstract: Video Scene Graph Generation aims to obtain structured semantic representations of objects and their relationships in videos for high-level understanding. However, existing methods still have limitations in handling long-tail distributions. This paper proposes the Frequency-guided Relational Multi-level Reasoning (FReMuRe) model, which enhances the modeling ability of long-tail relationships from a mechanism perspective. We introduce relation-specific branches to deal gradient conflicts, yielding more balanced and tail-aware learning. And we design a frequency-aware dual-branch predicate embedding network to model high-frequency and low-frequency relationships separately and improve the recall rate of tail classes through gated fusion. Meanwhile, we propose two types of interchangeable relation classification heads: Bayesian Head for uncertainty estimation and new Gaussian Mixture Model Head to enhance intra-class diversity. Experimental results show that FReMuRe significantly improves the recall rate of long-tail relationships and overall reasoning robustness on the Action Genome dataset.
[174] When Background Matters: Breaking Medical Vision Language Models by Transferable Attack cs.CVPDF
Akash Ghosh, Subhadip Baidya, Sriparna Saha, Xiuying Chen
TL;DR: 本文提出了一种名为MedFocusLeak的可迁移黑盒多模态攻击方法,针对医学视觉语言模型。该方法通过向非诊断性背景区域注入协同扰动,并采用注意力分散机制,使模型偏离病理区域,从而在保持扰动难以察觉的情况下,诱导出错误但临床可信的诊断结果。
Details
Motivation: 解决现有医学对抗攻击的局限性:现有攻击要么关注模型窃取等次要目标,要么从自然图像迁移的攻击会引入明显失真而易被临床医生察觉,缺乏对医学VLMs鲁棒性的有效探索。
Result: 在六种医学成像模态上的广泛评估表明,MedFocusLeak达到了最先进的性能,能在多种VLMs上生成具有误导性但真实的诊断输出。
Insight: 创新点在于提出了一种针对医学VLMs背景区域的、可迁移的、隐蔽的攻击范式,并引入了一个联合评估攻击成功率和图像保真度的统一框架,揭示了现代临床VLMs在推理能力上的关键弱点。
Abstract: Vision-Language Models (VLMs) are increasingly used in clinical diagnostics, yet their robustness to adversarial attacks remains largely unexplored, posing serious risks. Existing medical attacks focus on secondary objectives such as model stealing or adversarial fine-tuning, while transferable attacks from natural images introduce visible distortions that clinicians can easily detect. To address this, we propose MedFocusLeak, a highly transferable black-box multimodal attack that induces incorrect yet clinically plausible diagnoses while keeping perturbations imperceptible. The method injects coordinated perturbations into non-diagnostic background regions and employs an attention distraction mechanism to shift the model’s focus away from pathological areas. Extensive evaluations across six medical imaging modalities show that MedFocusLeak achieves state-of-the-art performance, generating misleading yet realistic diagnostic outputs across diverse VLMs. We further introduce a unified evaluation framework with novel metrics that jointly capture attack success and image fidelity, revealing a critical weakness in the reasoning capabilities of modern clinical VLMs.
[175] E2E-GMNER: End-to-End Generative Grounded Multimodal Named Entity Recognition cs.CV | cs.CLPDF
Meng Zhang, Jinzhong Ning, Xiaolong Wu, Hongfei Lin, Yijia Zhang
TL;DR: 本文提出了E2E-GMNER,一个端到端的生成式框架,用于解决接地多模态命名实体识别任务。该框架利用多模态大语言模型,将文本实体识别、语义类型预测、视觉区域接地以及隐式知识推理统一在一个生成式模型中,并通过思维链推理自适应地利用视觉证据或背景知识。
Details
Motivation: 现有GMNER方法多为流水线架构,将文本实体识别与视觉接地解耦,导致错误累积和联合优化效果不佳。本文旨在通过统一的端到端生成框架解决这些问题。
Result: 在Twitter-GMNER和Twitter-FMNERG基准测试上的大量实验表明,E2E-GMNER达到了与当前最先进方法高度竞争的性能,验证了统一端到端优化和噪声感知接地监督的有效性。
Insight: 主要创新点包括:1) 将GMNER任务构建为指令调优的条件生成任务,并融入思维链推理,使模型能自适应地决定何时利用视觉证据或背景知识;2) 提出了高斯风险感知框扰动方法,用概率扰动的软目标替代硬框监督,以提升对标注噪声和离散化误差的鲁棒性。这为多模态实体识别提供了一种更统一、鲁棒的端到端生成范式。
Abstract: Grounded Multimodal Named Entity Recognition (GMNER) aims to jointly identify named entity mentions in text, predict their semantic types, and ground each entity to a corresponding visual region in an associated image. Existing approaches predominantly adopt pipeline-based architectures that decouple textual entity recognition and visual grounding, leading to error accumulation and suboptimal joint optimization. In this paper, we propose E2E-GMNER, a fully end-to-end generative framework that unifies entity recognition, semantic typing, visual grounding, and implicit knowledge reasoning within a single multimodal large language model. We formulate GMNER as an instruction-tuned conditional generation task and incorporate chain-of-thought reasoning to enable the model to adaptively determine when visual evidence or background knowledge is informative, reducing reliance on noisy cues. To further address the instability of generative bounding box prediction, we introduce Gaussian Risk-Aware Box Perturbation (GRBP), which replaces hard box supervision with probabilistically perturbed soft targets to improve robustness against annotation noise and discretization errors. Extensive experiments on the Twitter-GMNER and Twitter-FMNERG benchmarks demonstrate that E2E-GMNER achieves highly competitive performance compared with state of the art methods, validating the effectiveness of unified end-to-end optimization and noise-aware grounding supervision. Code is available at:https://github.com/Finch-coder/E2E-GMNER
[176] Towards Joint Quantization and Token Pruning of Vision-Language Models cs.CVPDF
Xinqing Li, Xin He, Xindong Zhang, Ming-Ming Cheng, Lei Zhang
TL;DR: 本文提出了一种名为QUOTA的协同量化与剪枝框架,旨在联合优化视觉语言模型(VLMs)的低比特推理和确定性视觉令牌剪枝,以降低推理成本。该方法通过将量化校准信号转换为层级的令牌分配计划,并作为剪枝方案实现,从而在保持性能的同时显著减少视觉令牌数量和KV缓存大小。
Details
Motivation: 在低比特推理下部署视觉语言模型面临挑战,因为推理成本主要由预填充阶段的长视觉令牌前缀和自回归解码阶段不断增长的KV缓存主导。现有的令牌剪枝和低比特量化方法虽然互补,但简单的分阶段组合往往因量化校准与剪枝执行不匹配而脆弱,需要一种统一的框架来协同优化这两者。
Result: 在标准VLM基准测试中,该方法在相同的低比特(W4A4)设置下,相比代表性的分阶段组合基线(约94.3%的保留率),实现了95.65%的平均性能保留率,同时仅保留30%的视觉令牌,显示出更好的鲁棒性。
Insight: 创新点在于提出了QUOTA框架,将量化与剪枝统一在一个可部署的流程中,通过结合激活幅度、注意力线索和显式低比特风险信号,在量化KV缓存下评估令牌重要性,实现一致的预算化top-k选择。这为模型压缩提供了协同优化的新思路,可借鉴于其他多模态或大语言模型的低比特部署场景。
Abstract: Deploying Vision-Language Models (VLMs) under aggressive low-bit inference remains challenging because inference cost is dominated by the long visual-token prefix during prefill and the growing KV cache during autoregressive decoding. Token pruning and low-bit quantization are complementary for reducing these costs, yet naive stage-wise combinations are often brittle due to a mismatch between quantization calibration and pruning execution. We present a collaborative quantization-and-pruning framework that unifies low-bit inference and deterministic visual-token pruning in a single deployable pipeline. The framework introduces the \textbf{Q}uantization \textbf{U}nified \textbf{O}ffline \textbf{T}oken \textbf{A}llocator (\textbf{QUOTA}), which converts low-bit calibration signals into a layer-wise token allocation schedule and materializes it as a pruning recipe. Token importance is evaluated under deployed W4A4 operators with a quantized KV cache by combining activation magnitude, attention cues, and an explicit low-bit risk signal, enabling consistent budgeted top-$k$ selection. Experiments on standard VLM benchmarks show improved robustness over stage-wise baselines under the same low-bit regime, achieving 95.65% average retention while retaining only 30% of visual tokens, compared with about 94.3% retention for representative stage-wise combinations. The code will be released.
[177] R-FLoRA: Residual-Statistic-Gated Low-Rank Adaptation for Single-Image Face Morphing Attack Detection cs.CVPDF
Raghavendra Ramachandra
TL;DR: 本文提出了一种名为R-FLoRA的单图像人脸变形攻击检测(S-MAD)框架,该框架通过结合高频拉普拉斯残差统计与冻结的基础视觉变换器(ViT)表示,利用残差统计门控低秩适配器(R-FLoRA)和特征级残差融合(Res-FiLM)来增强对局部变形伪影的敏感性,同时保持骨干网络的语义上下文。此外,引入了一种新颖的残差对比对齐损失来正则化融合的令牌空间,以提升在未见变形条件下的判别能力。
Details
Motivation: 人脸变形攻击对护照签发、边境控制和数字身份验证等场景中的人脸识别系统可靠性构成重大威胁。由于缺乏可信参考以及攻击生成方法的多样性,从单张人脸图像中检测变形攻击仍然具有挑战性。
Result: 在四个符合ICAO标准的数据集上,涵盖七种变形生成技术,进行了全面实验。结果表明,该方法在检测精度和跨域(或跨数据集)泛化能力上,持续超越了九种最新的SOTA S-MAD算法。模型在冻结骨干网络和极少可训练参数的情况下,实现了实时效率和可解释性。
Insight: 创新点包括:1)将高频残差统计与冻结的基础视觉变换器表示相结合,以捕捉局部变形伪影;2)提出残差统计门控低秩适配器(R-FLoRA)和特征级残差融合(Res-FiLM)模块,有效融合多尺度特征;3)设计残差对比对齐损失,正则化特征空间以增强泛化。从客观角度看,该方法通过轻量化的适配器设计,在保持骨干网络冻结的同时实现了高性能和高效性,为实际生物特征验证系统提供了可行的解决方案。
Abstract: Face morphing attacks pose a substantial risk to the reliability of face recognition systems used in passport issuance, border control, and digital identity verification. Detecting morphing attacks from a single facial image remains challenging owing to the lack of a trusted reference and the diversity of attack generation methods. This paper presents a new Single-Image Face Morphing Attack Detection (S-MAD) framework that integrates high-frequency Laplacian residual statistics with representations from a frozen, foundation-scale vision transformer. The approach employs residual-statistic-gated low-rank adapters (R-FLoRA) and feature-wise residual fusion (Res-FiLM) to enhance sensitivity to local morphing artefacts while preserving the semantic context of the backbone. A novel residual-contrastive alignment loss further regularises the fused token space, improving discrimination under unseen morphing conditions. Comprehensive experiments on four ICAO-compliant datasets, encompassing seven morph generation techniques, demonstrate that the proposed method consistently surpasses nine recent state-of-the-art S-MAD algorithms in detection accuracy and cross-domain (or dataset) generalisation. With a frozen backbone and minimal trainable parameters, the model achieves real-time efficiency and interpretability, making it suitable for real-life scenarios in biometric verification systems.
[178] Robust Diabetic Retinopathy Grading Using Dual-Resolution Attention-Based Deep Learning with Ordinal Regression cs.CV | cs.AIPDF
Afshan Hashmi
TL;DR: 本研究提出了一种用于糖尿病视网膜病变(DR)分级的鲁棒双分辨率深度学习框架,该框架结合了基于注意力的特征融合和有序回归,以提升跨数据集的泛化能力。该方法采用两个并行、不同空间分辨率的EfficientNet主干网络来捕获互补的视网膜特征,并通过可学习的注意力机制自适应融合多分辨率表征,同时使用基于累积链接模型(CORAL)的有序回归公式来显式处理DR严重程度的顺序性。通过结合圆形裁剪、对比度增强和直方图匹配的预处理策略来缓解数据集间的域差异。
Details
Motivation: 解决深度学习模型在不同成像条件下采集的数据集之间部署时性能下降的问题,旨在开发一个鲁棒的、泛化能力强的自动化DR分级系统,以支持大规模筛查。
Result: 在APTOS 2019数据集上训练,并在内部验证集和外部Messidor-2测试集上评估。实验结果显示,在APTOS验证集上取得了0.88的二次加权Kappa(QWK),在未见过的Messidor-2数据集上取得了0.68的QWK,表明其跨数据集DR分级应用的鲁棒性得到提升。
Insight: 创新点在于将双分辨率特征提取、自适应注意力融合与有序回归(CORAL)相结合,以同时处理多尺度特征和DR等级的有序性。从客观角度看,其预处理策略(如直方图匹配)是针对医学图像域适应的一种实用且有效的工程方法,有助于提升模型在分布外数据上的性能。整个框架为处理具有内在顺序标签的医学图像分类任务提供了一个可借鉴的范式。
Abstract: Diabetic retinopathy (DR) is a leading cause of vision impairment worldwide, and automated grading systems play a crucial role in large-scale screening programs. However, deep learning models often exhibit degraded performance when deployed across datasets acquired under different imaging conditions. This study presents a robust dual-resolution deep learning framework for DR grading that integrates attention-based feature fusion with ordinal regression to improve cross-dataset generalization. The proposed method employs two parallel EfficientNet backbones operating at different spatial resolutions to capture complementary retinal features. A learnable attention mechanism adaptively fuses multi-resolution representations, while an ordinal regression formulation based on the cumulative link model (CORAL) explicitly accounts for the ordered nature of DR severity levels. To mitigate domain discrepancies between datasets, a preprocessing strategy combining circular cropping, contrast enhancement, and histogram matching is applied. The model was trained on the APTOS 2019 dataset and evaluated on both an internal validation split and an external Messidor-2 test set. Experimental results demonstrate strong grading performance, achieving a quadratic weighted kappa (QWK) of 0.88 on the APTOS validation set and 0.68 on the unseen Messidor-2 dataset, indicating improved robustness for cross-dataset DR grading applications.
[179] When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models cs.CV | cs.AIPDF
Cui Yakun, Xingqun Qi, TianTian Geng, Yuyao Zhang, Sirui Han
TL;DR: 本文针对视觉语言模型(VLMs)中存在的文本覆盖诱导幻觉(TOIH)问题,即当屏幕上的叠加文本与视觉场景矛盾时,模型会优先处理文本语义而忽略真实视觉内容,导致系统性幻觉。作者提出了首个综合性基准VisualTextTrap,包含大规模人工验证样本和专门设计的评估指标,并开发了一种新颖的视觉-文本解耦框架VTHM-MoE,通过双编码器架构和自适应令牌路由策略来缓解幻觉,在视频问答任务中超越了现有最先进方法。
Details
Motivation: 解决现有视觉语言模型在屏幕叠加文本与视觉内容矛盾时,会系统性地产生幻觉、优先处理文本语义而忽略真实视觉内容的问题,即文本覆盖诱导幻觉(TOIH)。
Result: 在提出的VisualTextTrap基准上进行的广泛实验验证了VTHM-MoE的有效性,它在多样化的视频问答任务中超越了最先进的对比方法。
Insight: 创新点包括:1) 首次定义了文本覆盖诱导幻觉(TOIH)问题并构建了大规模、细粒度标注的基准VisualTextTrap;2) 提出了视觉-文本解耦框架VTHM-MoE,采用双编码器架构和四个维度专家模块(时间、动作、对象、空间),结合自适应令牌路由策略,动态分配专家以识别和利用跨模态差异,从而在抵抗TOIH的同时保持对未污染视频的性能。
Abstract: Recent advances in Vision-Language Models (VLMs) have substantially enhanced their ability across multimodal video understanding benchmarks spanning temporal, action, object, and spatial understanding. However, we identify a critical yet overlooked issue: when embedded on-screen text contradicts the visual scene, existing VLMs systematically hallucinate, prioritizing overlay textual semantics over the actual visual content. We define this phenomenon as Text Overlay-Induced Hallucination (TOIH). In this work, we propose VisualTextTrap, the first comprehensive benchmark, including large-scale human-validated samples with specifically designed evaluation metrics. In particular, we construct VisualTextTrap from widely-used public datasets using a scalable hybrid pipeline of VLMs assisted text generation and rigorous manual verification. The benchmark features 6,057 samples annotated across 88 fine-grained attributes within four dimensions, with hallucination intensity quantified on a five-level scale (L1–L5) that reflects the semantic contradiction between overlay text and visual reality. Moreover, we propose Visual Text Hallucination Mitigation Mixture-of-Experts (VTHM-MoE), a novel Vision-Text Disentanglement framework that employs a dual-encoder architecture. Concretely, four dimension-specialized expert modules spanning Temporal, Action, Object, and Spatial reasoning are first pre-trained to identify and leverage cross-modal discrepancies between textual semantics and actual video content. We develop an Adaptive Token Routing Strategy to enable dynamic expert allocation, conferring robust resistance to TOIH while preserving performance on uncontaminated videos. Extensive experiments conducted on our VisualTextTrap benchmark verify the effectiveness of VTHM-MoE, outperforming state-of-the-art counterparts with diverse video question answering tasks.
[180] Towards Generalizable Deepfake Image Detection with Vision Transformers cs.CV | cs.AI | cs.LG | eess.IVPDF
Kaliki V Srinanda, M Manvith Prabhu, Hemanth K Mogilipalem, Jayavarapu S Abhinai, Vaibhav Santhosh
TL;DR: 本文提出了一种基于微调视觉Transformer(如DINOv2、AIMv2和OpenCLIP的ViT-L/14)集成的方法,用于提升深度伪造图像检测的泛化能力。该方法在IEEE SP Cup 2025发布的DF-Wild数据集上进行评估,该数据集包含多样且具有挑战性的操纵和生成技术。实验表明,该集成方法优于单个模型和强CNN基线,在DF-Wild测试集上实现了96.77%的AUC和9%的EER,超越了当前最先进的深度伪造检测算法Effort。
Details
Motivation: 由于现代生成模型的快速发展和现有方法泛化能力不足,深度伪造图像检测面临挑战,本文旨在开发一种泛化性强的检测方法。
Result: 在DF-Wild测试集上,集成方法达到96.77%的AUC和9%的EER,分别比SOTA算法Effort提升7.05%和8%,成为ICASSP 2025上SP Cup的获胜方案。
Insight: 创新点在于利用微调的视觉Transformer集成来增强泛化性,客观分析表明,结合多种预训练ViT模型能有效捕捉深度伪造的通用特征,提升跨技术检测性能。
Abstract: In today’s day and age, we face a challenge in detecting deepfake images because of the fast evolution of modern generative models and the poor generalization capability of existing methods. In this paper, we use an ensemble of fine-tuned vision transformers like DINOv2, AIMv2 and OpenCLIP’s ViT-L/14 to create generalizable method to detect deepfakes. We use the DF-Wild dataset released as part of the IEEE SP Cup 2025, because it uses a challenging and diverse set of manipulations and generation techniques. We started our experiments with CNN classifiers trained on spatial features. Experimental results show that our ensemble outperforms individual models and strong CNN baselines, achieving an AUC of 96.77% and an Equal Error Rate (EER) of just 9% on the DF-Wild test set, beating the state-of-the-art deepfake detection algorithm Effort by 7.05% and 8% in AUC and EER respectively. This was the winning solution for SP Cup, presented at ICASSP 2025.
[181] SpatialImaginer: Towards Adaptive Visual Imagination for Spatial Reasoning cs.CVPDF
Yian Li, Yang Jiao, Bin Zhu, Tianwen Qian, Shaoxiang Chen
TL;DR: 本文提出SpatialImaginer,一个统一的多模态生成框架,旨在解决多模态大语言模型在空间推理任务中因文本表示抽象化而导致的几何结构细节丢失和推理轨迹脆弱的问题。该框架采用分治策略,结合文本思维链进行高层语义规划,并利用视觉想象进行几何敏感的状态转换和一致性保持。通过难度感知的数据引擎和闭环验证训练模型选择性调用视觉想象,在多个空间智能基准测试中实现了最先进的性能。
Details
Motivation: 当前多模态大语言模型在涉及一致空间状态识别的空间推理任务中表现出脆弱的推理轨迹,其根本原因在于空间识别机制与纯文本推理行为之间的不匹配:文本表示倾向于抽象掉关键的几何细节,而有效的空间推理需要在整个推理过程中忠实保留和更新低层几何结构。
Result: 在多样化的空间智能基准测试上进行广泛实验,结果表明SpatialImaginer实现了最先进的性能,并在复杂的多步空间推理任务中显著提高了鲁棒性。
Insight: 论文的创新点在于提出了一个将文本推理与视觉想象相集成的统一多模态生成框架,采用分治策略分别处理语义规划和几何状态转换;同时,引入难度感知数据引擎与闭环验证机制,训练模型在需要稳定空间状态跟踪时选择性调用视觉想象,这为增强模型的空间推理能力提供了可借鉴的架构设计思路。
Abstract: Spatial intelligence, which refers to the ability to reason about geometric and physical structure from visual observations, remains a core challenge for multimodal large language models. Despite promising performance, recent multimodal large language models (MLLMs) often exhibit fragile reasoning traces in spatial intelligence tasks that involve consistent spatial state recognition. We argue that these failures stem from a mismatch between the spatial recognition mechanism and the text-only reasoning behavior of these MLLMs. Effective spatial reasoning requires low-level geometric structure to be faithfully preserved and updated throughout the reasoning process, whereas textual representations tend to abstract away precisely these critical details. To address this issue, we propose SpatialImaginer, a unified multimodal generation framework that integrates textual reasoning with visual imagination. Our framework adopts a divide-and-conquer strategy, using text chain-of-thought for high-level semantic planning and the visual imagination for geometry-sensitive state transformation and consistency preservation. To support this capability, we further introduce a difficulty-aware data engine with closed-loop verification to train the model to invoke visual imagination selectively when stable spatial state tracking is required. Extensive experiments on diverse spatial intelligence benchmarks show that SpatialImaginer achieves state-of-the-art performance and substantially improves robustness on complex multi-step spatial reasoning tasks.
[182] Speculative Decoding for Autoregressive Video Generation cs.CV | cs.AIPDF
Yuezhou Hu, Jintao Zhang
TL;DR: 本文提出SDVG,一种将推测解码(speculative decoding)应用于自回归视频扩散模型的方法,通过引入图像质量路由器(image-quality router)来验证候选视频块,从而在不改变模型架构的情况下加速推理。
Details
Motivation: 推测解码在大型语言模型中已成功用于加速推理,但能否有效应用于自回归视频生成仍是一个开放问题,因为视频块是连续的时空张量,缺乏用于精确拒绝采样的token级分布。
Result: 在MovieGenVideoBench数据集(832x480分辨率)的1003个提示上,SDVG在保持98.1%目标模型视觉奖励质量(0.0773 vs. 0.0788)的同时实现了1.59倍加速,在95.7%质量保持下达到2.09倍加速,且始终比仅使用草稿模型的生成质量高出17%以上。
Insight: 创新点包括用图像质量路由器替代token验证,采用最差帧聚合(worst-frame aggregation)来捕捉单帧伪影,并通过强制拒绝首块和调节阈值tau来平衡质量与速度;该方法无需训练,可无缝集成到现有自回归视频生成流程中。
Abstract: Autoregressive video diffusion is emerging as a promising paradigm for streaming video synthesis, with step distillation serving as the primary means of accelerating inference. Whether speculative decoding, the dominant acceleration strategy for large language models, can be effectively adapted to autoregressive video generation remains an open question, because video blocks are continuous spatiotemporal tensors with no token-level distribution for exact rejection sampling. We introduce SDVG, which brings speculative decoding to block-based autoregressive video diffusion by replacing token verification with an image-quality router. A 1.3B drafter proposes candidate blocks via four denoising steps; each block is VAE-decoded and scored by ImageReward using worst-frame aggregation–taking the minimum per-frame reward to catch single-frame artifacts that averaging would mask. Blocks scoring above a fixed threshold tau are accepted into the 14B target’s KV cache; the rest are regenerated by the target. Two additional design choices prove critical: the first block is always force-rejected to anchor scene composition, and tau serves as a single knob that traces a smooth quality-speed Pareto frontier. On 1003 MovieGenVideoBench prompts (832x480), SDVG retains 98.1% of target-only VisionReward quality (0.0773 vs. 0.0788) at a 1.59x speedup with tau=-0.7, and reaches 2.09x at 95.7% quality retention–while consistently outperforming draft-only generation by over +17%. The framework is training-free, requires no architectural changes, and can be seamlessly integrated into existing autoregressive video generation pipelines.
[183] Where to Focus: Query-Modulated Multimodal Keyframe Selection for Long Video Understanding cs.CV | cs.MMPDF
Shaoguang Wang, Weiyu Guo, Ziyang Chen, Xuming Hu, Hui Xiong
TL;DR: 本文提出Q-Gate框架,通过查询调制门控机制动态路由多模态专家,实现长视频理解中的关键帧选择,以解决现有方法在视觉与文本模态间噪声干扰的问题。
Details
Motivation: 现有长视频理解方法在处理密集帧序列时计算成本高,且基于单一视觉度量或静态融合启发式分数的关键帧选择范式无法适应不同查询意图(如视觉任务与叙事查询),导致模态噪声或失效。
Result: 在LongVideoBench和Video-MME基准测试中,Q-Gate在多种MLLM骨干网络上均显著优于现有最先进方法,实现了可扩展视频推理的鲁棒性能。
Insight: 创新点在于将关键帧选择建模为动态模态路由问题,通过轻量级专家流(视觉定位、全局匹配、上下文对齐)和基于LLM上下文推理的查询调制门控机制,智能激活相关模态并抑制噪声,提供可解释且无需训练的即插即用解决方案。
Abstract: Long video understanding remains a formidable challenge for Multimodal Large Language Models (MLLMs) due to the prohibitive computational cost of processing dense frame sequences. Prevailing solutions, which select a keyframe subset, typically rely on either a single visual-centric metric (e.g., CLIP similarity) or a static fusion of heuristic scores. This one-size-fits-all'' paradigm frequently fails: visual-only metrics are ineffective for plot-driven narrative queries, while indiscriminately incorporating textual scores introduces severe modal noise’’ for purely visual tasks. To break this bottleneck, we propose Q-Gate, a plug-and-play and training-free framework that treats keyframe selection as a dynamic modality routing problem. We decouple the retrieval process into three lightweight expert streams: Visual Grounding for local details, Global Matching for scene semantics, and Contextual Alignment for subtitle-driven narratives. Crucially, Q-Gate introduces a Query-Modulated Gating Mechanism that leverages the in-context reasoning of an LLM to assess the query’s intent and dynamically allocate attention weights across the experts. This mechanism intelligently activates necessary modalities while ``muting’’ irrelevant ones, thereby maximizing the signal-to-noise ratio. Extensive experiments on LongVideoBench and Video-MME across multiple MLLM backbones demonstrate that Q-Gate substantially outperforms state-of-the-art baselines. By effectively suppressing modality-specific noise, it provides a robust, highly interpretable solution for scalable video reasoning.
[184] Long-CODE: Isolating Pure Long-Context as an Orthogonal Dimension in Video Evaluation cs.CV | cs.AIPDF
Zhijiang Tang, Jiaxin Qi, Bing Zhao, Jianqiang Huang
TL;DR: 本文提出了一种专门用于长视频评估的新框架Long-CODE,旨在解决现有视频生成模型评估指标在长视频场景下的不足。作者认为短期视觉感知与长上下文属性是正交维度,因此需要将长视频评估与短视频评估解耦。论文首先通过一系列长视频属性损坏测试揭示了现有指标的局限性,然后设计了一个基于镜头动态的新型长视频评估指标,并构建了一个专门用于长视频评估的基准数据集Long-CODE。实验表明,所提指标在与人判断的相关性上达到了SOTA水平。
Details
Motivation: 随着视频生成模型能力的提升,对鲁棒视频评估指标的需求日益迫切。传统指标主要针对短视频评估,侧重于帧级视觉质量和局部时间平滑度,无法有效捕捉长视频的关键长程特性(如叙事丰富性和全局因果一致性)。
Result: 在专门构建的长视频评估基准数据集Long-CODE上进行了广泛实验,结果表明,所提出的新指标在与人判断的相关性方面达到了SOTA水平。
Insight: 创新点在于首次明确将长上下文属性作为视频评估的一个正交维度,并为此设计了一个专门的评估框架和基准数据集。从客观角度看,其核心贡献是提出了一个基于镜头动态的、对长程结构敏感的新型评估指标,并构建了一个隔离了纯长程特性的人工标注数据集,从而为视频生成模型提供了一个更全面、无偏的评估范式。
Abstract: As video generation models achieve unprecedented capabilities, the demand for robust video evaluation metrics becomes increasingly critical. Traditional metrics are intrinsically tailored for short-video evaluation, predominantly assessing frame-level visual quality and localized temporal smoothness. However, as state-of-the-art video generation models scale to generate longer videos, these metrics fail to capture essential long-range characteristics, such as narrative richness and global causal consistency. Recognizing that short-term visual perception and long-context attributes are fundamentally orthogonal dimensions, we argue that long-video metrics should be disentangled from short-video assessments. In this paper, we focus on the rigorous justification and design of a dedicated framework for long-video evaluation. We first introduce a suite of long-video attribute corruption tests, exposing the critical limitations of existing hort-video metrics from their insensitivity to structural inconsistencies, such as shot-level perturbations and narrative shuffling. To bridge this gap, we design a novel long-video metric based on shot dynamics, which is highly sensitive to the long-range testing framework. Furthermore, we introduce Long-CODE (Long-Context as an Orthogonal Dimension for video Evaluation), a specialized dataset designed to benchmark long-video evaluation, with human annotations isolated specifically to genuine long-range characteristics. Extensive experiments show that our proposed metrics achieve state-of-the-art correlation with human judgments. Ultimately, our metric and benchmark seamlessly complement existing short-video standards, establishing a holistic and unbiased evaluation paradigm for video generation models.
[185] Attention Is not Everything: Efficient Alternatives for Vision cs.CVPDF
Nur Mohammad Kazi, Ibteshum Khaled, Md. Luthful Hasan Galib, Ali Faruk Shihab, Md. Rakibul Islam
TL;DR: 这篇综述论文系统性地回顾了计算机视觉领域中非Transformer模型的研究进展,提出了一种全面的分类法,将相关方法分为基于卷积、基于MLP、基于状态空间等类别,并评估了它们在效率、可扩展性、可解释性和鲁棒性等方面的表现。
Details
Motivation: 尽管Transformer模型在计算机视觉领域取得了显著进展,但许多非Transformer方法仍然表现出色,与Transformer模型形成直接竞争。本文旨在全面梳理这些非Transformer方法,探讨其面临的挑战和未来的研究机会。
Result: 研究选择了40篇相关论文进行分析,从效率、可扩展性、可理解性和鲁棒性等多个维度评估了这些非Transformer方法,但没有提供具体的定量实验结果或基准测试比较。
Insight: 论文的创新点在于提出了一种针对非Transformer视觉模型的系统性分类框架,强调了多样化架构在计算机视觉中的重要性,为未来研究提供了结构化的视角和潜在的发展方向。
Abstract: Recently computer vision has seen advancements mainly thanks to Transformer-based models. However many non-Transformer methods are still doing well being a direct competition of Transformer-based models. This review tries to present a comprehensive taxonomy of such methods and organize these methods into categories like convolution-based models, MLP-based models, state-space-based and more. These methods are looked at in terms of how efficient they are, how well they scale, how easy they are to understand and how robust they are. A total of 40 papers were chosen for this study. The goal is to give a view of non-Transformer methods and find out what challenges and opportunities exist for future computer vision research.
[186] HSG: Hyperbolic Scene Graph cs.CVPDF
Liyang Wang, Zeyu Zhang, Hao Tang
TL;DR: 本文提出了一种名为HSG(Hyperbolic Scene Graph)的新方法,用于在双曲空间中学习场景图的嵌入表示,以更好地捕捉场景中物体和地点之间的层次蕴含关系,从而提升场景图的结构质量和检索性能。
Details
Motivation: 现有方法(如MSG)在欧几里得空间中通过对比学习和注意力机制学习场景图嵌入,但欧几里得几何无法显式地建模地点与物体之间的层次蕴含关系,限制了学习表示的结构一致性。
Result: HSG在场景图建模任务中表现出色,特别是在图级指标上:其PP IoU达到33.17,Graph IoU达到33.51,相比最佳AoMSG变体(25.37)提升了8.14,证明了双曲表示学习的有效性。
Insight: 核心创新点在于将场景图嵌入学习从欧几里得空间迁移到双曲空间,利用双曲几何能自然编码层次结构的特性来提升表示的结构质量。这为处理具有层次关系的数据(如场景图)提供了一种新的几何表示学习思路。
Abstract: Scene graph representations enable structured visual understanding by modeling objects and their relationships, and have been widely used for multiview and 3D scene reasoning. Existing methods such as MSG learn scene graph embeddings in Euclidean space using contrastive learning and attention based association. However, Euclidean geometry does not explicitly capture hierarchical entailment relationships between places and objects, limiting the structural consistency of learned representations. To address this, we propose Hyperbolic Scene Graph (HSG), which learns scene graph embeddings in hyperbolic space where hierarchical relationships are naturally encoded through geometric distance. Our results show that HSG improves hierarchical structure quality while maintaining strong retrieval performance. The largest gains are observed in graph level metrics: HSG achieves a PP IoU of 33.17 and the highest Graph IoU of 33.51, outperforming the best AoMSG variant (25.37) by 8.14, highlighting the effectiveness of hyperbolic representation learning for scene graph modeling. Code: https://github.com/AIGeeksGroup/HSG.
[187] UniMesh: Unifying 3D Mesh Understanding and Generation cs.CVPDF
Peng Huang, Yifeng Chen, Zeyu Zhang, Hao Tang
TL;DR: UniMesh是一个统一的三维网格理解与生成框架,通过新颖的Mesh Head桥接基于扩散的图像生成与隐式形状解码器,并引入Chain of Mesh实现用户驱动的语义网格编辑,以及自反思机制提升三维字幕等高级任务的性能。
Details
Motivation: 当前三维视觉领域存在理解与生成任务分离的问题,导致架构碎片化、知识转移困难,阻碍了整体场景建模。UniMesh旨在通过统一框架同时学习三维生成与理解,以促进任务间的协同。
Result: 实验表明,UniMesh在标准基准测试中取得了有竞争力的性能,并在迭代编辑以及生成与理解的相互增强方面展现出新能力。
Insight: 创新点包括Mesh Head作为跨模型接口、Chain of Mesh实现几何迭代推理以支持语义编辑,以及自反思三元组机制用于诊断和纠正高级任务错误,这些设计有望推动三维任务的统一建模与交互式应用。
Abstract: Recent advances in 3D vision have led to specialized models for either 3D understanding (e.g., shape classification, segmentation, reconstruction) or 3D generation (e.g., synthesis, completion, and editing). However, these tasks are often tackled in isolation, resulting in fragmented architectures and representations that hinder knowledge transfer and holistic scene modeling. To address these challenges, we propose UniMesh, a unified framework that jointly learns 3D generation and understanding within a single architecture. First, we introduce a novel Mesh Head that acts as a cross model interface, bridging diffusion based image generation with implicit shape decoders. Second, we develop Chain of Mesh (CoM), a geometric instantiation of iterative reasoning that enables user driven semantic mesh editing through a closed loop latent, prompting, and re generation cycle. Third, we incorporate a self reflection mechanism based on an Actor Evaluator Self reflection triad to diagnose and correct failures in high level tasks like 3D captioning. Experimental results demonstrate that UniMesh not only achieves competitive performance on standard benchmarks but also unlocks novel capabilities in iterative editing and mutual enhancement between generation and understanding. Code: https://github.com/AIGeeksGroup/UniMesh. Website: https://aigeeksgroup.github.io/UniMesh.
[188] Dual-Anchoring: Addressing State Drift in Vision-Language Navigation cs.CV | cs.AIPDF
Kangyi Wu, Pengna Li, Kailin Lyu, Lin Zhao, Qingrong He
TL;DR: 本文提出了一种名为Dual-Anchoring的双锚定框架,以解决视觉语言导航任务中智能体在长序列场景下容易发生的状态漂移问题。该框架通过指令进度锚定和记忆地标锚定,分别监督智能体明确区分已完成的子目标和剩余子目标,并强制其显式验证历史观测以保留访问过的地标表征。
Details
Motivation: 动机是解决视频大语言模型在长序列视觉语言导航任务中,因内部状态漂移而导致的漫无目的游荡和关键操作执行失败的问题。作者将失败归因于两种认知缺陷:进度漂移和记忆漂移。
Result: 在仿真和真实环境中的大量实验表明,该方法具有优越性,成功率提高了15.2%,在长视野轨迹上取得了24.7%的显著增益。
Insight: 创新点在于明确提出了状态漂移问题并分解为进度和记忆两类,并设计了对应的双锚定监督机制。其核心是引入结构化文本令牌来锚定进度,以及利用一个以地标为中心的世界模型进行回顾性预测来锚定记忆,这为模型提供了明确的、可解释的中间监督信号。配套构建的大规模数据集也极具价值。
Abstract: Vision-Language Navigation(VLN) requires an agent to navigate through 3D environments by following natural language instructions. While recent Video Large Language Models(Video-LLMs) have largely advanced VLN, they remain highly susceptible to State Drift in long scenarios. In these cases, the agent’s internal state drifts away from the true task execution state, leading to aimless wandering and failure to execute essential maneuvers in the instruction. We attribute this failure to two distinct cognitive deficits: Progress Drift, where the agent fails to distinguish completed sub-goals from remaining ones, and Memory Drift, where the agent’s history representations degrade, making it lose track of visited landmarks. In this paper, we propose a Dual-Anchoring Framework that explicitly anchors the instruction progress and history representations. First, to address progress drift, we introduce Instruction Progress Anchoring, which supervises the agent to generate structured text tokens that delineate completed versus remaining sub-goals. Second, to mitigate memory drift, we propose Memory Landmark Anchoring, which utilizes a Landmark-Centric World Model to retrospectively predict object-centric embeddings extracted by the Segment Anything Model, compelling the agent to explicitly verify past observations and preserve distinct representations of visited landmarks. Facilitating this framework, we curate two extensive datasets: 3.6 million samples with explicit progress descriptions, and 937k grounded landmark data for retrospective verification. Extensive experiments in both simulation and real-world environments demonstrate the superiority of our method, achieving a 15.2% improvement in Success Rate and a remarkable 24.7% gain on long-horizon trajectories. To facilitate further research, we will release our code, data generation pipelines, and the collected datasets.
[189] AutoVQA-G: Self-Improving Agentic Framework for Automated Visual Question Answering and Grounding Annotation cs.CVPDF
Rongsheng Hu, Runwei Guan, Yicheng Di, Jiayu Bao, Yuan Liu
TL;DR: 本文提出AutoVQA-G,一种用于自动生成视觉问答与定位(VQA-G)标注数据的自改进智能体框架,旨在解决现有自动化方法因模型幻觉导致的数据不一致和基于简单启发式验证机制脆弱的问题。该框架通过包含一致性评估模块和基于记忆的提示优化代理的迭代精炼循环,逐步提升生成数据的质量。
Details
Motivation: 手动标注高质量的VQA-G数据集对于推进视觉语言模型(VLMs)发展至关重要但难以扩展,现有自动化方法存在模型幻觉导致的数据保真度不一致以及验证机制脆弱两大关键问题。
Result: 实验表明,与领先的多模态大语言模型相比,AutoVQA-G生成的VQA-G数据集在视觉定位准确性方面表现更优,为创建高保真数据以促进更鲁棒的VLM训练和评估提供了一种有前景的方法。
Insight: 创新点在于提出了一个结合思维链(CoT)推理进行细粒度视觉验证的一致性评估模块,以及一个分析失败样本批评以逐步优化生成提示的、基于记忆的提示优化代理,构成了一个自我改进的智能体框架,可系统性地提升自动化标注的可靠性和数据质量。
Abstract: Manual annotation of high-quality visual question answering with grounding (VQA-G) datasets, which pair visual questions with evidential grounding, is crucial for advancing vision-language models (VLMs), but remains unscalable. Existing automated methods are often hindered by two key issues: (1) inconsistent data fidelity due to model hallucinations; (2) brittle verification mechanisms based on simple heuristics. To address these limitations, we introduce AutoVQA-G, a self-improving agentic framework for automated VQA-G annotation. AutoVQA-G employs an iterative refinement loop where a Consistency Evaluation module uses Chain-of-Thought (CoT) reasoning for fine-grained visual verification. Based on this feedback, a memory-augmented Prompt Optimization agent analyzes critiques from failed samples to progressively refine generation prompts. Our experiments show that AutoVQA-G generates VQA-G datasets with superior visual grounding accuracy compared to leading multimodal LLMs, offering a promising approach for creating high-fidelity data to facilitate more robust VLM training and evaluation. Code: https://github.com/rohnson1999/AutoVQA-G
[190] RS-HyRe-R1: A Hybrid Reward Mechanism to Overcome Perceptual Inertia for Remote Sensing Images Understanding cs.CV | cs.AIPDF
Gaozhi Zhou, Hu He, Peng Shen, Jipeng Zhang, Liujue Zhang
TL;DR: 该论文提出了一种名为RS-HyRe-R1的混合奖励框架,旨在解决强化学习后训练中遥感视觉语言模型(RS-VLMs)在处理复杂遥感图像时产生的‘感知惯性’问题。该框架通过引入空间推理激活奖励、感知正确性奖励和视觉语义路径演化奖励,鼓励模型进行更全面、更多样化的视觉证据挖掘。实验表明,该方法有效缓解了感知惯性,并在多个任务上取得了最先进的性能。
Details
Motivation: 动机在于解决强化学习后训练导致的‘感知惯性’问题,即模型在处理需要详尽视觉扫描的复杂遥感图像时,倾向于依赖局部显著线索进行快速推理,从而阻碍了完整证据构建和跨任务灵活视觉焦点转移。
Result: 在仅使用30亿参数的情况下,该方法在遥感图像理解任务(REC、OVD、VQA)上达到了最先进的性能,超越了参数规模达70亿的模型。在零样本泛化能力上,在VQA、OVD和REC任务上分别超越了次优模型3.16%、3.97%和2.72%。
Insight: 创新点在于设计了一个混合奖励机制来系统性地克服强化学习中的感知惯性。具体包括:1)结构化视觉推理的激活奖励;2)跨任务自适应质量锚定的感知正确性奖励;3)惩罚重复推理、鼓励探索互补线索的视觉语义路径演化奖励。这为改进视觉语言模型的强化学习后训练提供了新思路。
Abstract: Reinforcement learning (RL) post-training substantially improves remote sensing vision-language models (RS-VLMs). However, when handling complex remote sensing imagery (RSI) requiring exhaustive visual scanning, models tend to rely on localized salient cues for rapid inference. We term this RL-induced bias “perceptual inertia”. Driven by reward maximization, models favor quick outcome fitting, leading to two limitations: cognitively, overreliance on specific features impedes complete evidence construction; operationally, models struggle to flexibly shift visual focus across tasks. To address this bias and encourage comprehensive visual evidence mining, we propose RS-HyRe-R1, a hybrid reward framework for RSI understanding. It introduces: (1) a spatial reasoning activation reward that enforces structured visual reasoning; (2) a perception correctness reward that provides adaptive quality anchors across RS tasks, ensuring accurate geometric and semantic alignment; and (3) a visual-semantic path evolution reward that penalizes repetitive reasoning and promotes exploration of complementary cues to build richer evidence chains. Experiments show RS-HyRe-R1 effectively mitigates “perceptual inertia”, encouraging deeper, more diverse reasoning. With only 3B parameters, it achieves state-of-the-art performance on REC, OVD, and VQA tasks, outperforming models up to 7B parameters. It also demonstrates strong zero-shot generalization, surpassing the second-best model by 3.16%, 3.97%, and 2.72% on VQA, OVD, and REC, respectively. Code and datasets are available at https://github.com/geox-lab/RS-HyRe-R1.
[191] Dual Strategies for Test-Time Adaptation cs.CVPDF
Nam Nguyen Phuong, Duc Nguyen The Minh, Phi Le Nguyen, Ehsan Abbasnejad, Minh Hoai
TL;DR: 本文提出DualTTA框架,通过将测试样本划分为预测可靠和不可靠两组,并分别采用最小化和最大化预测熵的策略进行模型适应,以更充分地利用测试分布信息,提升模型在分布偏移下的性能。
Details
Motivation: 解决传统测试时适应方法仅利用少量低熵测试样本、未能充分利用测试分布全部信息的问题。
Result: 在多个分布偏移基准测试中,DualTTA实现了更优的性能,通过新的可靠性准则实现了可靠与不可靠样本的更清晰分离。
Insight: 创新点在于提出基于预测可靠性的双重适应策略,以及结合语义保持和语义改变变换的稳定性度量作为样本选择准则,理论上保证了更有效的模型更新。
Abstract: Conventional test-time adaptation (TTA) approaches typically adapt the model using only a small fraction of test samples, often those with low-entropy predictions, thereby failing to fully leverage the available information in the test distribution. This paper introduces DualTTA, a novel framework that improves performance under distribution shifts by utilizing a larger and more diverse set of test samples. DualTTA identifies two distinct groups: one where the model’s predictions are likely consistent with the underlying semantics, and another where predictions are likely incorrect. For the first group, it minimizes prediction entropy to reinforce reliable decisions; for the second, it maximizes entropy to suppress overconfident errors and unlearn spurious behavior. These groups are adaptively selected using a new reliability criterion that measures prediction stability under both semantic-preserving and semantic-altering transformations, addressing the limitations of purely entropy-based selection. We further provide theoretical analysis and empirical justification showing that our approach enables a tighter separation between reliable and unreliable samples, in the context of their suitability for adaptation, leading to provably more effective model updates.
[192] UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models cs.CVPDF
Hong Jiang, Wensong Song, Zongxing Yang, Ruijie Quan, Yi Yang
TL;DR: 本文提出UniGeo框架,通过在多层级(表示、架构、损失函数)统一注入几何引导,解决了现有相机可控图像编辑方法中几何引导碎片化导致的几何漂移和结构退化问题,显著提升了连续相机运动下的视觉质量和几何一致性。
Details
Motivation: 现有相机可控图像编辑方法依赖碎片化的几何引导(如仅在表示层注入点云)且主要基于处理离散视图映射的图像扩散模型,导致在连续相机运动下出现几何漂移和结构退化。
Result: 在多个公共基准测试(包括广泛和有限的相机运动设置)上的综合实验表明,UniGeo在视觉质量和几何一致性方面显著优于现有方法。
Insight: 创新点在于系统性地在表示、架构和损失函数三个层级统一注入几何引导,具体包括帧解耦的几何参考注入机制、几何锚点注意力以及轨迹端点几何监督策略,从而稳定了生成模型对几何的理解。
Abstract: Camera-controllable image editing aims to synthesize novel views of a given scene under varying camera poses while strictly preserving cross-view geometric consistency. However, existing methods typically rely on fragmented geometric guidance, such as only injecting point clouds at the representation level despite models containing multiple levels, and are mainly based on image diffusion models that operate on discrete view mappings. These two limitations jointly lead to geometric drift and structural degradation under continuous camera motion. We observe that while leveraging video models provides continuous viewpoint priors for camera-controllable image editing, they still struggle to form stable geometric understanding if geometric guidance remains fragmented. To systematically address this, we inject unified geometric guidance across three levels that jointly determine the generative output: representation, architecture, and loss function. To this end, we propose UniGeo, a novel camera-controllable editing framework. Specifically, at the representation level, UniGeo incorporates a frame-decoupled geometric reference injection mechanism to provide robust cross-view geometry context. At the architecture level, it introduces geometric anchor attention to align multi-view features. At the loss function level, it proposes a trajectory-endpoint geometric supervision strategy to explicitly reinforce the structural fidelity of target views. Comprehensive experiments across multiple public benchmarks, encompassing both extensive and limited camera motion settings, demonstrate that UniGeo significantly outperforms existing methods in both visual quality and geometric consistency.
[193] Multi-Camera Self-Calibration in Sports Motion Capture: Leveraging Human and Stick Poses cs.CV | eess.IVPDF
Fan Yang, Changsoo Jung, Ryosuke Kawamura, Hon Yung Wong
TL;DR: 本文提出了一种无需专用工具的多相机外参标定方法,专门针对涉及棍棒类器械(如高尔夫球杆、球棒、曲棍球杆)的运动场景。该方法利用同步多相机视频中的两个互补线索:未知度量尺度的人体关键点和已知长度的刚性棍棒器械,通过一个三阶段优化流程来精炼相机外参、重建人体和棍棒轨迹,并利用棍棒长度约束解决全局尺度问题。
Details
Motivation: 在体育运动中,多相机系统广泛用于捕捉运动员和装备的3D运动,但标定其外参参数仍然成本高昂且劳动密集。本文旨在解决这一痛点,为涉及棍棒器械的运动提供一种高效、无需工具的标定方法。
Result: 该方法在提出的首个棍棒类运动多相机自标定数据集(包含四个运动类别、3到10个相机的合成序列)上进行了综合实验,实现了最先进的性能,获得了较低的旋转和平移误差。
Insight: 创新点在于联合利用人体关键点(无尺度信息)和已知长度的刚性棍棒器械作为互补线索,通过一个三阶段优化流程解决了多相机外参标定中的全局尺度问题,实现了无需专用工具的准确标定。从客观角度看,该方法巧妙地将人体姿态估计与已知几何约束结合,为特定应用场景提供了实用且高效的解决方案。
Abstract: Multi-camera systems are widely employed in sports to capture the 3D motion of athletes and equipment, yet calibrating their extrinsic parameters remains costly and labor-intensive. We introduce an efficient, tool-free method for multi-camera extrinsic calibration tailored to sports involving stick-like implements (e.g., golf clubs, bats, hockey sticks). Our approach jointly exploits two complementary cues from synchronized multi-camera videos: (i) human body keypoints with unknown metric scale and (ii) a rigid stick-like implement of known length. We formulate a three-stage optimization pipeline that refines camera extrinsics, reconstructs human and stick trajectories, and resolves global scale via the stick-length constraint. Our method achieves accurate extrinsic calibration without dedicated calibration tools. To benchmark this task, we present the first dataset for multi-camera self-calibration in stick-based sports, consisting of synthetic sequences across four sports categories with 3 to 10 cameras. Comprehensive experiments demonstrate that our method delivers SOTA performance, achieving low rotation and translation errors. Our project page: https://fandulu.github.io/sport_stick_multi_cam_calib/.
[194] PBSBench: A Multi-Level Vision-Language Framework and Benchmark for Hematopathology Whole Slide Image Interpretation cs.CV | cs.AIPDF
Yuanlong Wang, Weichi Chen, Adrian Rajab, Wenfang Liu, Yulan Jin
TL;DR: 该论文提出了PBSBench,一个针对血液病理学外周血涂片(PBS)全玻片图像(WSI)解读的多层次视觉语言框架和基准。为了解决现有多模态大语言模型(MLLMs)在PBS图像上泛化能力不足的问题,作者构建了首个PBS视觉语言数据集PBSInstr,并基于此开发了专门针对血液病理学的视觉语言模型PBS-VL,用于细胞和玻片两个层次的解读。
Details
Motivation: 现有病理学多模态大语言模型主要基于实体组织WSI开发,难以泛化到视觉特征和诊断逻辑都不同的外周血涂片(PBS)图像解读上,因此需要构建专门的PBS数据集和模型来填补这一空白。
Result: 实验表明,专门开发的PBS-VL模型在作者构建的PBSBench视觉问答基准测试(包含四个问题类别和六个PBS解读任务)上,性能优于现有的通用和病理学专用MLLMs,证明了PBS特定数据的价值。
Insight: 创新点在于:1)构建了首个针对PBS解读的多层次(细胞和玻片)视觉语言数据集PBSInstr,包含图像-文本对和问答对;2)开发了专门针对血液病理学(PBS)的视觉语言模型PBS-VL;3)提出了一个全面的PBS理解评估基准PBSBench。这为开发支持血液病理学决策的实用AI助手奠定了基础,强调了针对特定医学影像子领域构建专用数据集和模型的重要性。
Abstract: Peripheral Blood Smear (PBS) is a critical microscopic examination in hematopathology that yields whole-slide imaging (WSI). Unlike solid tissue pathology, PBS interpretation focuses on individual cell morphologies rather than tissue architecture, making it distinct in both visual characteristics and diagnostic reasoning. However, current multimodal large language models (MLLMs) for pathology are primarily developed on solid-tissue WSIs and struggle to generalize to PBS. To bridge this gap, we construct PBSInstr, the first vision-language dataset for PBS interpretation, comprising 353 PBS WSIs paired with microscopic impression paragraphs and 29k cell-level image crops annotated with cell type labels and morphological descriptions. To facilitate instruction tuning, PBSInstr further includes 27k question-answer (QA) pairs for cell crops and 1,286 QA pairs for PBS slides. Building upon PBSInstr, we develop PBS-VL, a hematopathology-tailored vision-language model for multi-level PBS interpretation at both cell and slide levels. To comprehensively evaluate PBS understanding, we construct PBSBench, a visual question answering (VQA) benchmark featuring four question categories and six PBS interpretation tasks. Experiments show that PBS-VL outperforms existing general-purpose and pathology MLLMs, underscoring the value of PBS-specific data. We release our code, datasets, and model weights to facilitate future research. Our proposed framework lays the foundation for developing practical AI assistants supporting decision-making in hematopathology.
[195] DGSSM: Diffusion guided state-space models for multimodal salient object detection cs.CV | cs.AI | cs.LGPDF
Suklav Ghosh, Arijit Sur, Pinaki Mitra
TL;DR: 本文提出DGSSM,一种用于多模态显著目标检测的扩散引导状态空间模型框架。该框架将扩散模型的结构先验与多尺度状态空间编码、自适应显著提示以及迭代Mamba扩散细化机制相结合,以建模长程上下文依赖并恢复精确的目标边界。
Details
Motivation: 现有基于卷积、Transformer或Mamba的状态空间模型在显著目标检测中,难以同时有效建模长程依赖与恢复精细边界;而扩散模型虽能捕获结构先验,但其在判别性密集预测任务中的应用仍受计算成本和集成挑战的限制。
Result: 在RGB、RGB-D和RGB-T设置的13个公开基准测试上进行的大量实验表明,DGSSM在多个评估指标上始终优于最先进方法,同时保持了紧凑的模型大小。
Insight: 创新点在于将多模态显著目标检测构建为一个渐进去噪过程,通过扩散结构先验引导状态空间建模,并引入了自适应显著提示、迭代Mamba扩散细化机制、边界感知细化头和自蒸馏策略,为多模态密集预测任务提供了一个有效且可泛化的范式。
Abstract: Salient object detection (SOD) requires modeling both long-range contextual dependencies and fine-grained structural details, which remains challenging for convolutional, transformer-based, and Mamba-based state space models. While recent Mamba-based state space approaches enable efficient global reasoning, they often struggle to recover precise object boundaries. In contrast, diffusion models capture strong structural priors through iterative denoising, but their use in discriminative dense prediction is still limited due to computational cost and integration challenges. In this work, we propose DGSSM, a diffusion-guided state space (Mamba) framework that formulates multimodal salient object detection as a progressive denoising process. The framework integrates diffusion structural priors with multi-scale state space encoding, adaptive saliency prompting, and an iterative Mamba diffusion refinement mechanism to improve boundary accuracy. A boundary-aware refinement head and self-distillation strategy further enhance spatial coherence and feature consistency. Extensive experiments on 13 public benchmarks across RGB, RGB-D, and RGB-T settings demonstrate that DGSSM consistently outperforms state-of-the-art methods across multiple evaluation metrics while maintaining a compact model size. These results suggest that diffusion-guided state space modeling is an effective and generalizable paradigm for multimodal dense prediction tasks.
[196] ViPS: Video-informed Pose Spaces for Auto-Rigged Meshes cs.CV | cs.GRPDF
Honglin Chen, Karran Pandey, Rundi Wu, Matheus Gadelha, Yannick Hold-Geoffroy
TL;DR: 本文提出了ViPS(Video-informed Pose Spaces)框架,这是一个前馈式框架,通过从预训练的视频扩散模型中提取运动先验,为自动绑定网格(auto-rigged meshes)发现有效关节配置的潜在分布。该方法避免了依赖稀缺的艺术家创作的4D数据集,而是将生成式视频先验转化为给定绑定参数化的通用分布,并利用可微几何验证器确保资产特定的有效性。
Details
Motivation: 解决运动学绑定(kinematic rigs)缺乏对给定资产合理关节配置流形的内在表示问题,避免随机采样或手动操作原始绑定参数导致的语义或几何违规(如解剖学上的过度伸展和非物理自相交)。
Result: 评估表明,仅使用视频先验训练的ViPS,在合理性和多样性方面,与在合成艺术家创建的4D数据上训练的最先进(SOTA)方法性能相当。更重要的是,作为一个通用模型,ViPS对分布外物种和未见过的骨骼拓扑结构展现出强大的零样本泛化能力。
Insight: 创新点在于利用预训练视频扩散模型的生成先验来构建3D姿态空间,实现了从2D生成先验到结构化3D运动学控制的闭环;通过可微几何验证器自动确保有效性,无需手动正则化;模型学习到的姿态空间平滑、紧凑、可控,支持多样采样、流形投影(用于逆向运动学)和时间相干轨迹(用于关键帧动画)。
Abstract: Kinematic rigs provide a structured interface for articulating 3D meshes, but they lack an inherent representation of the plausible manifold of joint configurations for a given asset. Without such a pose space, stochastic sampling or manual manipulation of raw rig parameters often leads to semantic or geometric violations, such as anatomical hyperextension and non-physical self-intersections. We propose Video-informed Pose Spaces (ViPS), a feed-forward framework that discovers the latent distribution of valid articulations for auto-rigged meshes by distilling motion priors from a pretrained video diffusion model. Unlike existing methods that rely on scarce artist-authored 4D datasets, ViPS transfers generative video priors into a universal distribution over a given rig parameterization. Differentiable geometric validators applied to the skinned mesh enforce asset-specific validity without requiring manual regularizers. Our model learns a smooth, compact, and controllable pose space that supports diverse sampling, manifold projection for inverse kinematics, and temporally coherent trajectories for keyframing. Furthermore, the distilled 3D pose samples serve as precise semantic proxies for guiding video diffusion, effectively closing the loop between generative 2D priors and structured 3D kinematic control. Our evaluations show that ViPS, trained solely on video priors, matches the performance of state-of-the-art methods trained on synthetic artist-created 4D data in both plausibility and diversity. Most importantly, as a universal model, ViPS demonstrates robust zero-shot generalization to out-of-distribution species and unseen skeletal topologies.
[197] FlowC2S: Flowing from Current to Succeeding Frames for Fast and Memory-Efficient Video Continuation cs.CVPDF
Hovhannes Margaryan, Quentin Bammey, Christian Sandor
TL;DR: 本文提出了一种名为FlowC2S的新方法,用于快速且内存高效地生成视频延续。该方法通过微调预训练的文本到视频流模型,学习当前视频块与后续视频块之间的向量场,从而直接生成后续帧,避免了传统方法中结合当前帧与噪声的步骤,将模型输入维度降低了一半。
Details
Motivation: 解决现有视频延续方法在生成过程中需要结合当前帧与噪声,导致计算复杂度和内存消耗较高的问题,旨在实现更快速、更高效的视频生成。
Result: 在定量评估中,基于FID和FVD指标,FlowC2S方法超越了当前最先进(SOTA)的水平,仅需五次神经函数评估即可实现。
Insight: 创新点包括引入固有最优耦合(利用时间相邻视频块作为真实最优耦合的代理,以获得更直的流)和目标反转(将目标块的倒置潜在表示注入输入表示,以增强对应关系并提高视觉保真度),从而通过直接流式生成降低输入维度,提升效率。
Abstract: This paper introduces a novel methodology for generating fast and memory-efficient video continuations. Our method, dubbed FlowC2S, fine-tunes a pre-trained text-to-video flow model to learn a vector field between the current and succeeding video chunks. Two design choices are key. First, we introduce inherent optimal couplings, utilizing temporally adjacent video chunks during training as a practical proxy for true optimal couplings, resulting in straighter flows. Second, we incorporate target inversion, injecting the inverted latent of the target chunk into the input representation to strengthen correspondences and improve visual fidelity. By flowing directly from current to succeeding frames, instead of the common combination of current frames with noise to generate a video continuation, we reduce the dimensionality of the model input by a factor of two. The proposed method, fine-tuned from LTXV and Wan, surpasses the state-of-the-art scores across quantitative evaluations with FID and FVD, with as few as five neural function evaluations.
[198] BioVLM: Routing Prompts, Not Parameters, for Cross-Modality Generalization in Biomedical VLMs cs.CVPDF
Mainak Singha, Tanisha Gupta, Ankit Jha, Muhammad Haris Khan, Sayantani Ghosh
TL;DR: BioVLM是一个用于生物医学视觉语言模型的提示学习框架,旨在提升模型在具有挑战性的模态上的跨域泛化能力。它通过动态选择最具有判别性的提示,将稀疏的少样本证据与丰富的LLM语义先验相结合,并在测试时选择模态合适的提示,从而在保持训练轻量和推理高效的同时,实现对新类别和领域的迁移。
Details
Motivation: 解决预训练生物医学视觉语言模型在具有挑战性的模态上性能下降的问题,这些模态类间间隔小、采集特异性变化显著,尤其是在少样本监督和模态先验与预训练语料差异较大时。
Result: 在11个MedMNIST+ 2D数据集上,BioVLM在三种不同的泛化设置中均达到了新的最先进水平。
Insight: 创新点在于提出了一种动态提示选择机制,通过低熵准则选择最具有判别性的提示,并结合LLM语义先验进行知识蒸馏和数据增强一致性约束,从而实现了高效的跨域泛化,无需大量主干网络微调。
Abstract: Pretrained biomedical vision-language models (VLMs) such as BioMedCLIP perform well on average but often degrade on challenging modalities where inter-class margins are small and acquisition-specific variations are pronounced, especially under few-shot supervision and when modality priors differ from pretraining corpora substantially. We propose BioVLM, a prompt-learning framework that improves cross-domain generalization without extensive backbone fine-tuning. BioVLM learns a diverse prompt bank and introduces dynamic prompt selection: for each input, it selects the most discriminative prompts via a low-entropy criterion on the predictive distribution, effectively coupling sparse few-shot evidence with rich LLM semantic priors. To strengthen this coupling, we distill high-confidence LLM-derived attributes and enforce robust knowledge transfer through strong/weak augmentation consistency. At test time, BioVLM adapts by choosing modality-appropriate prompts, enabling transfer to unseen categories and domains, while keeping training lightweight and inference efficient. On 11 MedMNIST+ 2D datasets, BioVLM achieves new state of the art across three distinct generalization settings. Codes are available at https://github.com/mainaksingha01/BioVLM.
[199] Infrastructure-Centric World Models: Bridging Temporal Depth and Spatial Breadth for Roadside Perception cs.CV | cs.ROPDF
Siyuan Meng, Chengbo Ai
TL;DR: 本文提出了一种以基础设施为中心的世界模型(I-WM)新范式,旨在利用路侧感知系统独特的鸟瞰、多传感器、持续性视角,来模拟和预测交通环境演化。论文阐述了其分三阶段实现的愿景、双层架构设计,并引入了基础设施视觉-语言-动作模型(I-VLA)作为统一框架。
Details
Motivation: 现有自动驾驶世界模型均以自车为中心,忽略了路侧基础设施的独特视角。本文认为,路侧系统在时间深度(长期行为分布)上与车载系统的空间广度(大范围场景采样)存在互补性,因此探索基础设施中心的世界模型具有根本性的互补能力。
Result: 论文未在摘要中提供具体的定量实验结果或基准测试数据,而是提出了一个愿景框架和实现路径。
Insight: 核心创新点在于首次系统性地提出了从基础设施视角构建世界模型的愿景,强调了路侧感知在时间深度上的独特价值,并提出了分阶段实现的技术路径(包括质量感知不确定性传播、物理信息预测动力学、V2X潜在空间对齐)以及将感知、语言指令和交通控制动作统一起来的I-VLA架构。
Abstract: World models, generative AI systems that simulate how environments evolve, are transforming autonomous driving, yet all existing approaches adopt an ego-vehicle perspective, leaving the infrastructure viewpoint unexplored. We argue that infrastructure-centric world models offer a fundamentally complementary capability: the bird’s-eye, multi-sensor, persistent viewpoint that roadside systems uniquely possess. Central to our thesis is a spatio-temporal complementarity: fixed roadside sensors excel at temporal depth, accumulating long-term behavioral distributions including rare safety-critical events, while vehicle-borne sensors excel at spatial breadth, sampling diverse scenes across large road networks. This paper presents a vision for Infrastructure-centric World Models (I-WM) in three phases: (I) generative scene understanding with quality-aware uncertainty propagation, (II) physics-informed predictive dynamics with multi-agent counterfactual reasoning, and (III) collaborative world models for V2X communication via latent space alignment. We propose a dual-layer architecture, annotation-free perception as a multi-modal data engine feeding end-to-end generative world models, with a phased sensor strategy from LiDAR through 4D radar and signal phase data to event cameras. We establish a taxonomy of driving world model paradigms, position I-WM relative to LeCun’s JEPA, Li Fei-Fei’s spatial intelligence, and VLA architectures, and introduce Infrastructure VLA (I-VLA) as a novel unification of roadside perception, language commands, and traffic control actions. Our vision builds upon existing multi-LiDAR pipelines and identifies open-source foundations for each phase, providing a path toward infrastructure that understands and anticipates traffic.
[200] Dual-stream Spatio-Temporal GCN-Transformer Network for 3D Human Pose Estimation cs.CVPDF
Jiawen Duan, Jian Xiang, Zhiqiang Li, Linlin Xue, Wan Xiang
TL;DR: 本文提出了一种名为MixTGFormer的双流时空GCN-Transformer网络,用于从2D姿态提升到3D人体姿态估计。该方法通过两个并行通道同时建模人体骨架的时空关系,并利用集成了图卷积网络的Mixformer模块,有效融合了全局与局部特征。
Details
Motivation: 现有基于Transformer的2D到3D姿态估计方法主要关注全局时空关系建模,但忽略了局部骨骼关系以及不同通道间的信息交互,因此需要一种能同时有效利用全局和局部信息的方法。
Result: 在Human3.6M和MPI-INF-3DHP两个基准数据集上进行了广泛评估,MixTGFormer取得了最先进(SOTA)的结果,其P1误差分别为37.6mm和15.7mm。
Insight: 创新点在于提出了双流架构和Mixformer模块,将GCN集成到Transformer中以增强局部和全局信息利用,并通过时空形式的实现以及SE层进行信息补充,实现了全局与局部特征的有效融合。
Abstract: 3D human pose estimation is a classic and important research direction in the field of computer vision. In recent years, Transformer-based methods have made significant progress in lifting 2D to 3D human pose estimation. However, these methods primarily focus on modeling global temporal and spatial relationships, neglecting local skeletal relationships and the information interaction between different channels. Therefore, we have proposed a novel method,the Dual-stream Spatio-temporal GCN-Transformer Network (MixTGFormer). This method models the spatial and temporal relationships of human skeletons simultaneously through two parallel channels, achieving effective fusion of global and local features. The core of MixTGFormer is composed of stacked Mixformers. Specifically, the Mixformer includes the Mixformer Block and the Squeeze-and-Excitation Layer ( SE Layer). It first extracts and fuses various information of human skeletons through two parallel Mixformer Blocks with different modes. Then, it further supplements the fused information through the SE Layer. The Mixformer Block integrates Graph Convolutional Networks (GCN) into the Transformer, enhancing both local and global information utilization. Additionally, we further implement its temporal and spatial forms to extract both spatial and temporal relationships. We extensively evaluated our model on two benchmark datasets (Human3.6M and MPI-INF-3DHP). The experimental results showed that, compared to other methods, our MixTGFormer achieved state-of-the-art results, with P1 errors of 37.6mm and 15.7mm on these datasets, respectively.
[201] Dynamic Visual-semantic Alignment for Zero-shot Learning with Ambiguous Labels cs.CVPDF
Jiangnan Li, Linqing Huang, Xiaowen Yan, Min Gan, Wenpeng Lu
TL;DR: 本文提出动态视觉-语义对齐(DVSA)框架,用于解决零样本学习(ZSL)中标签噪声和模糊性问题。该方法通过双向视觉-语义对齐模块和基于互信息的对比优化来增强判别性语义属性,并采用动态标签消歧机制迭代修正噪声监督,从而提升模型在模糊标签下的泛化能力。
Details
Motivation: 现有零样本学习方法通常假设标签干净,忽略了现实世界中的标签噪声和模糊性,导致性能下降。本文旨在设计一个鲁棒的ZSL框架,以处理模糊标签并提升模型鲁棒性。
Result: 在标准基准测试上的大量实验表明,DVSA在模糊监督下实现了更强的性能,验证了其有效性。
Insight: 创新点包括:引入双向视觉-语义对齐模块进行特征与属性原型的相互校准;在属性级别基于互信息进行对比优化以增强语义一致性;以及动态标签消歧机制迭代修正噪声标签,同时保持语义一致性,缩小实例与标签间的差距。
Abstract: Zero-shot learning (ZSL) aims to recognize unseen classes without visual instances. However, existing methods usually assume clean labels, overlooking real-world label noise and ambiguity, which degrades performance. To bridge this gap, we propose the Dynamic Visual-semantic Alignment (DVSA), a robust ZSL framework for learning from ambiguous labels. DVSA uses a bidirectional visual-semantic alignment module with attention to mutually calibrate visual features and attribute prototypes, and a contrastive optimization grounded in Mutual Information (MI) at the attribute level to strengthen discriminative, semantically consistent attributes. In addition, a dynamic label disambiguation mechanism iteratively corrects noisy supervision while preserving semantic consistency, narrowing the instance-label gap, and improving generalization. Extensive experiments on standard benchmarks verify that DVSA achieves stronger performance under ambiguous supervision.
[202] Source-Free Domain Adaptation with Vision-Language Prior cs.CVPDF
Song Tang, Yunxiang Bai, Wenxin Su, Mao Ye, Jianwei Zhang
TL;DR: 本文提出了一种名为DIFO++的新方法,用于解决源自由域自适应(SFDA)问题。该方法利用现成的视觉-语言(ViL)多模态模型(如CLIP)的丰富知识,通过交替进行提示学习和知识蒸馏,专注于减少目标域中的’间隙区域’,以生成更可靠的伪标签并实现语义对齐。
Details
Motivation: 传统SFDA方法依赖伪标签或辅助监督,容易产生错误。本文首次探索利用现成的ViL模型(如CLIP)的异构知识来缓解这一限制,但发现直接零样本应用效果不佳,因此需要使其任务特定化。
Result: 大量实验表明,DIFO++在多个基准测试上显著优于现有的最先进方法。
Insight: 创新点在于提出了一种交替进行ViL模型定制(通过提示学习最大化与目标模型的互信息)和知识蒸馏(专注于减少间隙区域)的渐进式知识适应框架,并引入了基于记忆机制的可靠伪标签生成以及基于类别注意力和预测一致性的语义对齐与熵最小化策略。
Abstract: Source-Free Domain Adaptation (SFDA) seeks to adapt a source model, which is pre-trained on a supervised source domain, for a target domain, with only access to unlabeled target training data. Relying on pseudo labeling and/or auxiliary supervision, conventional methods are inevitably error-prone. To mitigate this limitation, in this work we for the first time explore the potentials of off-the-shelf vision-language (ViL) multimodal models (e.g., CLIP) with rich whilst heterogeneous knowledge. We find that directly applying the ViL model to the target domain in a zero-shot fashion is unsatisfactory, as it is not specialized for this particular task but largely generic. To make it task-specific, we propose a novel DIFO++ approach. Specifically, DIFO++ alternates between two steps during adaptation: (i) Customizing the ViL model by maximizing the mutual information with the target model in a prompt learning manner, (ii) Distilling the knowledge of this customized ViL model to the target model, centering on gap region reduction. During progressive knowledge adaptation, we first identify and focus on the gap region, where enclosed features are entangled and class-ambiguous, as it often captures richer task-specific semantics. Reliable pseudo-labels are then generated by fusing predictions from the target and ViL models, supported by a memory mechanism. Finally, gap region reduction is guided by category attention and predictive consistency for semantic alignment, complemented by referenced entropy minimization to suppress uncertainty. Extensive experiments show that DIFO++ significantly outperforms the state-of-the-art alternatives. Our code and data are available at https://github.com/tntek/DIFO-Plus.
[203] Ego-InBetween: Generating Object State Transitions in Ego-Centric Videos cs.CVPDF
Mengmeng Ge, Takashi Isobe, Xu Jia, Yanan Sun, Zetong Yang
TL;DR: 本文提出EgoIn框架,用于解决以自我为中心视角下的物体状态转换生成任务(EIVST),即根据初始状态、目标状态和简短动作指令生成中间过渡帧序列。该框架通过微调的TransitionVLM推断多步转换过程,并利用Transition Conditioning模块和对象感知辅助监督来生成语义合理且视觉一致的转换序列。
Details
Motivation: 从自我中心视角理解物理转换过程是连接人类与机器动作建模的关键,但现有生成模型难以同时理解初始与目标状态、推理转换步骤,并生成遵循指令且保持物体外观一致的中间过渡序列。
Result: 在人与物体及机器人与物体交互数据集上的大量实验表明,EgoIn在生成语义有意义且视觉连贯的转换序列方面表现优异,优于现有方法。
Insight: 创新点包括定义EIVST任务、提出结合微调视觉语言模型(TransitionVLM)与条件生成模块的EgoIn框架,以及引入对象感知辅助监督来保持物体外观一致性,为自我中心视频中的状态转换生成提供了系统解决方案。
Abstract: Understanding physical transformation processes is crucial for both human cognition and artificial intelligence systems, particularly from an egocentric perspective, which serves as a key bridge between humans and machines in action modeling. We define this modeling process as Egocentric Instructed Visual State Transition (EIVST), which involves generating intermediate frames that depict object transformations between initial and target states under a brief action instruction. EIVST poses two challenges for current generative models: (1) understanding the visual scenes of the initial and target states and reasoning about transformation steps from an egocentric view, and (2) generating a consistent intermediate transition that follows the given instruction while preserving object appearance across the two visual states. To address these challenges, we propose the EgoIn framework. It first infers the multi-step transition process between two given states using TransitionVLM, fine-tuned on our curated dataset to better adapt to this task and reduce hallucinated information. It then generates a sequence of frames based on transition conditions produced by the proposed Transition Conditioning module. Additionally, we introduce Object-aware Auxiliary Supervision to preserve consistent object appearance throughout the transition. Extensive experiments on human-object and robot-object interaction datasets demonstrate EgoIn’s superior performance in generating semantically meaningful and visually coherent transformation sequences.
[204] Subject-Aware Multi-Granularity Alignment for Zero-Shot EEG-to-Image Retrieval cs.CVPDF
Lin Jiang, Qingshan She, Jiale Xu, Haiqi Xu, Duanpo Wu
TL;DR: 本文提出了一种名为SAMGA(Subject-Aware Multi-Granularity Alignment)的框架,用于解决零样本脑电图到图像检索任务。该框架通过自适应聚合预训练视觉编码器的多级中间表示来构建主体感知的视觉监督目标,并采用从粗到细的跨模态对齐策略,以更好地匹配视觉诱发脑电信号的多尺度表征和主体间差异。
Details
Motivation: 现有方法通常依赖单一固定的视觉目标或主体不变的目标构建方案,忽视了视觉诱发脑电信号的两个关键特性:它们保留了多尺度表征信息,且与脑电信号最佳匹配的视觉粒度可能因主体而异。
Result: 在THINGS-EEG基准测试中,该方法在主体内设置下达到了91.3%的Top-1准确率和98.8%的Top-5准确率,在主体间设置下达到了34.4%的Top-1准确率和64.8%的Top-5准确率,超越了当前最先进的方法。
Insight: 创新点在于提出了主体感知的多粒度视觉目标构建和从粗到细的跨模态对齐策略,这允许模型在训练时吸收主体依赖的粒度偏差,同时保持主体无关的推理能力,从而更鲁棒地处理脑电信号的多尺度性和主体间差异。
Abstract: Zero-shot EEG-to-image retrieval aims to decode perceived visual content from electroencephalography (EEG) by aligning neural responses with pretrained visual representations, providing a promising route toward scalable visual neural decoding and practical brain-computer interfaces. However, robust EEG-to-image retrieval remains challenging, because prior methods usually rely on either a single fixed visual target or a subject-invariant target construction scheme. Such designs overlook two important properties of visually evoked EEG signals: they preserve information across multiple representational scales, and the visual granularity best matched to EEG may vary across subjects. To address these issues, subject-aware multi-granularity alignment (SAMGA) framework is proposed for zero-shot EEG-to-image retrieval. SAMGA first constructs a subject-aware visual supervision target by adaptively aggregating multiple intermediate representations from a pretrained vision encoder, allowing the model to absorb subject-dependent granularity deviations during training while preserving subject-agnostic inference. Building on this adaptive target construction, a coarse-to-fine cross-modal alignment strategy is further designed with a shared encoder wherein the coarse stage stabilizes the shared semantic geometry and reduces subject-induced distribution shift, and the fine stage further improves instance-level retrieval discrimination. Extensive experiments on the THINGS-EEG benchmark demonstrate that the proposed method achieves 91.3% Top-1 and 98.8% Top-5 accuracy in the intra-subject setting, and 34.4% Top-1 and 64.8% Top-5 accuracy in the inter-subject setting, outperforming recent state-of-the-art methods.
[205] Weakly-Supervised Referring Video Object Segmentation through Text Supervision cs.CVPDF
Miaojing Shi, Jun Huang, Zijie Yue, Hanli Wang
TL;DR: 本文提出了一种新颖的弱监督视频指代分割方法WSRVOS,该方法仅使用文本描述作为监督信号,无需像素级掩码、边界框或点标注。该方法通过多模态大语言模型生成正负文本描述,进行细粒度的视觉-语言特征对齐与交互,并利用实例感知分类、高质量伪掩码生成和时序片段排序约束来训练模型。在多个公开数据集上的实验证明了其优越性。
Details
Motivation: 解决传统全监督视频指代分割方法依赖昂贵像素级标注的问题,以及现有弱监督方法仍需边界框或点标注的局限,旨在仅使用文本描述这一最弱形式的监督来训练模型。
Result: 在A2D Sentences、J-HMDB Sentences、Ref-YouTube-VOS和Ref-DAVIS17四个公开RVOS数据集上进行了广泛实验,结果表明该方法具有优越性。
Insight: 创新点包括:利用多模态大语言模型的描述能力进行对比性指代表达增强;双向视觉-语言特征选择与交互以实现细粒度对齐;实例感知表达分类;正预测融合策略生成高质量伪掩码;以及时序片段排序约束。这些方法共同实现了仅用文本监督的有效训练。
Abstract: Referring video object segmentation (RVOS) aims to segment the target instance in a video, referred by a text expression. Conventional approaches are mostly supervised learning, requiring expensive pixel-level mask annotations. To tackle it, weakly-supervised RVOS has recently been proposed to replace mask annotations with bounding boxes or points, which are however still costly and labor-intensive. In this paper, we design a novel weakly-supervised RVOS method, namely WSRVOS, to train the model with only text expressions. Given an input video and the referring expression, we first design a contrastive referring expression augmentation scheme that leverages the captioning capabilities of a multimodal large language model to generate both positive and negative expressions. We extract visual and linguistic features from the input video and generated expressions, then perform bi-directional vision-language feature selection and interaction to enable fine-grained multimodal alignment. Next, we propose an instance-aware expression classification scheme to optimize the model in distinguishing positive from negative expressions. Also, we introduce a positive-prediction fusion strategy to generate high-quality pseudo-masks, which serve as additional supervision to the model. Last, we design a temporal segment ranking constraint such that the overlaps between mask predictions of temporally neighboring frames are required to conform to specific orders. Extensive experiments on four publicly available RVOS datasets, including A2D Sentences, J-HMDB Sentences, Ref-YouTube-VOS, and Ref-DAVIS17, demonstrate the superiority of our method. Code is available at \href{https://github.com/viscom-tongji/WSRVOS}{https://github.com/viscom-tongji/WSRVOS}.
[206] View-Consistent 3D Scene Editing via Dual-Path Structural Correspondense and Semantic Continuity cs.CVPDF
Pufan Li, Bi’an Du, Shenghe Zheng, Junyi Yao, Wei Hu
TL;DR: 本文提出了一种基于双路径一致性机制和语义连续性的视角一致3D场景编辑框架,通过引入跨视角依赖关系来缓解多视角编辑中的不一致性问题,并构建了配对的多视角编辑数据集进行监督学习。
Details
Motivation: 现有基于渲染-编辑-优化的3D场景编辑方法存在跨视角不一致的瓶颈,尽管已有方法引入几何线索、跨视角交互或视频先验,但仍依赖推理时同步,鲁棒性和泛化性有限。本文从分布建模的角度重新审视多视角一致编辑问题,认为3D编辑本质上是跨视角的联合分布建模。
Result: 大量实验表明,该方法在复杂场景中实现了精确且视角一致的编辑性能,优于现有方法。
Insight: 创新点包括:1)从分布视角重新定义多视角一致编辑问题;2)提出双路径一致性机制,结合投影引导的结构指导和补丁级语义传播,分别处理结构对应和语义连续性;3)构建了配对多视角编辑数据集,为学习编辑场景中的跨视角一致性提供可靠监督。
Abstract: Text-driven 3D scene editing has recently attracted increasing attention. Most existing methods follow a render-edit-optimize pipeline, where multi-view images are rendered from a 3D scene, edited with 2D image editors, and then used to optimize the underlying 3D representation. However, cross-view inconsistency remains a major bottleneck. Although recent methods introduce geometric cues, cross-view interactions, or video priors to mitigate this issue, they still largely rely on inference-time synchronization and thus remain limited in robustness and generalization.In this work, we recast multi-view consistent 3D editing from a distributional perspective: 3D scene editing essentially requires a joint distribution modeling across viewpoints.Based on this insight, we propose a view-consistent 3D editing framework that explicitly introduces cross-view dependencies into the editing process. Furthermore, motivated by the observation that structural correspondence and semantic continuity rely on different cross-view cues, we introduce a dual-path consistency mechanism consisting of projection-guided structural guidance and patch-level semantic propagation for effective cross-view editing. Further, we construct a paired multi-view editing dataset that provides reliable supervision for learning cross-view consistency in edited scenes. Extensive experiments demonstrate that our method achieves superior editing performance with precise and consistent views for complex scenes.
[207] Re$^2$MoGen: Open-Vocabulary Motion Generation via LLM Reasoning and Physics-Aware Refinement cs.CV | cs.ROPDF
Jiakun Zheng, Ting Xiao, Shiqin Cao, Xinran Li, Zhe Wang
TL;DR: 本文提出Re$^2$MoGen框架,用于解决文本到动作生成模型在遇到与训练数据差异较大的文本描述时性能下降的问题。该框架通过增强的大型语言模型推理生成初始动作规划,并利用强化学习进行物理感知的后训练细化,从而生成语义一致且物理合理的开放词汇动作。
Details
Motivation: 现有文本到动作生成模型在训练数据分布内表现良好,但当文本描述与训练文本差异显著时性能显著下降。本文旨在解决这一开放词汇动作生成问题,提升模型对多样化文本描述的泛化能力。
Result: 大量实验表明,该框架能生成语义一致且物理合理的动作,并在开放词汇动作生成任务上达到了最先进的性能。
Insight: 创新点在于结合了增强的LLM推理(通过蒙特卡洛树搜索)进行关键帧规划,以及分阶段的细化流程(包括基于人体姿态先验的优化和物理感知的强化学习后训练),实现了从文本到高质量动作的端到端生成,并特别关注了物理合理性。
Abstract: Text-to-motion (T2M) generation aims to control the behavior of a target character via textual descriptions. Leveraging text-motion paired datasets, existing T2M models have achieved impressive performance in generating high-quality motions within the distribution of their training data. However, their performance deteriorates notably when the motion descriptions differ significantly from the training texts. To address this issue, we propose Re$^2$MoGen, a Reasoning and Refinement open-vocabulary Motion Generation framework that leverages enhanced Large Language Model (LLM) reasoning to generate an initial motion planning and then refine its physical plausibility via reinforcement learning (RL) post-training. Specifically, Re$^2$MoGen consists of three stages: We first employ Monte Carlo tree search to enhance the LLM’s reasoning ability in generating reasonable keyframes of the motion based on text prompts, specifying only the root and several key joints’ positions to ease the reasoning process. Then, we apply a human pose model as a prior to optimize the full-body poses based on the planned keyframes and use the resulting incomplete motion to supervise fine-tuning a pre-trained motion generator via a dynamic temporal matching objective, enabling spatiotemporal completion. Finally, we use post-training with physics-aware reward to refine motion quality to eliminate physical implausibility in LLM-planned motions. Extensive experiments demonstrate that our framework can generate semantically consistent and physically plausible motions and achieve state-of-the-art performance in open-vocabulary motion generation.
[208] AnyLift: Scaling Motion Reconstruction from Internet Videos via 2D Diffusion cs.CVPDF
Hongjie Li, Heng Yu, Jiaman Li, Hong-Xing Yu, Ehsan Adeli
TL;DR: 本文提出了一种名为AnyLift的两阶段框架,利用2D扩散模型从互联网视频中重建3D人体运动和人-物交互(HOI)。该方法首先从视频中提取2D关键点,合成多视角2D运动数据以涵盖现有MoCap数据集中罕见的运动类型;然后训练一个相机条件化的多视角2D运动扩散模型,在第二阶段恢复世界空间中的3D运动与HOI。
Details
Motivation: 现有方法在动态相机下难以恢复全局一致的3D运动,尤其对于当前运动捕捉数据集中代表性不足的运动类型,并且在3D中恢复连贯的人-物交互方面面临额外困难。
Result: 该方法在包含体操等挑战性运动的互联网视频以及野外HOI视频上进行了验证,结果表明其在生成真实人体运动和人-物交互方面优于先前工作,达到了SOTA水平。
Insight: 创新点在于利用2D扩散模型结合互联网视频中的2D关键点数据,通过合成多视角2D运动来增强数据多样性,从而有效处理罕见运动类型和复杂人-物交互,实现了从单目视频到3D运动的高质量重建。
Abstract: Reconstructing 3D human motion and human-object interactions (HOI) from Internet videos is a fundamental step toward building large-scale datasets of human behavior. Existing methods struggle to recover globally consistent 3D motion under dynamic cameras, especially for motion types underrepresented in current motion-capture datasets, and face additional difficulty recovering coherent human-object interactions in 3D. We introduce a two-stage framework leveraging 2D diffusion that reconstructs 3D human motion and HOI from Internet videos. In the first stage, we synthesize multi-view 2D motion data for each domain, leveraging 2D keypoints extracted from Internet videos to incorporate human motions that rarely appear in existing MoCap datasets. In the second stage, a camera-conditioned multi-view 2D motion diffusion model is trained on the domain-specific synthetic data to recover 3D human motion and 3D HOI in the world space. We demonstrate the effectiveness of our method on Internet videos featuring challenging motions such as gymnastics, as well as in-the-wild HOI videos, and show that it outperforms prior work in producing realistic human motion and human-object interaction.
[209] UniCSG: Unified High-Fidelity Content-Constrained Style-Driven Generation via Staged Semantic and Frequency Disentanglement cs.CVPDF
Jingwei Yang, Ruoxi Wu, Wei Shen, Meng Li, Yulong Liu
TL;DR: UniCSG是一个统一的框架,用于在文本引导和参考引导的设置下进行内容约束、风格驱动的生成。它通过分阶段训练解决基于DiT的扩散模型中常见的内容-风格纠缠问题,包括潜在空间语义解耦阶段和频率感知细节重建阶段,并结合像素空间奖励学习来提升解码后的感知质量。
Details
Motivation: 解决基于DiT的扩散模型在风格迁移中常见的内容-风格纠缠问题,如参考内容泄漏和不稳定生成,旨在同时匹配目标风格并保留内容语义。
Result: 实验表明,该方法在内容忠实度、风格对齐和鲁棒性方面均得到提升,适用于文本引导和参考引导两种设置。
Insight: 创新点包括分阶段训练策略(语义解耦与频率感知重建)以及结合像素空间奖励学习来对齐潜在目标与感知质量,有效实现了内容与风格的高保真解耦。
Abstract: Style transfer must match a target style while preserving content semantics. DiT-based diffusion models often suffer from content-style entanglement, leading to reference-content leakage and unstable generation. We present UniCSG, a unified framework for content-constrained, style-driven generation in both text-guided and reference-guided settings. UniCSG employs staged training: (i) a latent-space semantic disentanglement stage that combines low-frequency preprocessing with conditioning corruption to encourage content-style separation, and (ii) a latent-space frequency-aware detail reconstruction stage that refines details via multi-scale frequency supervision. We further incorporate pixel-space reward learning to align latent objectives with perceptual quality after decoding. Experiments demonstrate improved content faithfulness, style alignment, and robustness in both settings.
[210] PlankFormer: Robust Plankton Instance Segmentation via MAE-Pretrained Vision Transformers and Pseudo Community Image Generation cs.CVPDF
Masaharu Miyazaki, Yurie Otake, Koichi Ito, Wataru Makino, Jotaro Urabe
TL;DR: 本文提出PlankFormer框架,用于解决拥挤图像中浮游生物实例分割的难题。该方法通过生成伪群落图像(PCI)来缓解像素级标注数据稀缺问题,并采用基于MAE自监督预训练的Vision Transformer骨干网络与Mask2Former解码器,以增强对遮挡和碎屑干扰的鲁棒性。实验表明,该方法在真实数据集上显著优于传统方法如Mask R-CNN,尤其在碎屑密集的挑战性环境中表现突出。
Details
Motivation: 浮游生物监测依赖人工显微分析,效率低下;自动化分割面临两大挑战:像素级标注数据稀缺,以及传统CNN方法难以区分浮游生物与碎屑、重叠个体。
Result: 在真实数据集上的实验结果显示,该方法显著优于传统方法(如Mask R-CNN),特别是在高碎屑密度的挑战性环境中达到先进水平(SOTA)。
Insight: 创新点包括:提出伪群落图像(PCI)生成方法以扩充标注数据;结合MAE自监督预训练的ViT骨干网络,利用全局结构特征提升对遮挡和碎屑的鲁棒性;整体框架减少了对个体图像手动标注的依赖,实现了高精度分割。
Abstract: Plankton monitoring is essential for assessing aquatic ecosystems but is limited by the labor-intensive nature of manual microscopic analysis. Automating the segmentation of plankton from crowded images is crucial, however, it faces two major challenges: (i) the scarcity of pixel-level annotated datasets and (ii) the difficulty of distinguishing plankton from debris and overlapping individuals using conventional CNN-based methods. To address these issues, we propose PlankFormer, a novel framework for plankton instance segmentation. First, to overcome the data shortage, we introduce a method to generate labeled Pseudo Community Images (PCI) by synthesizing individual plankton images onto diverse backgrounds, including those created by generative models. Second, we propose a segmentation model utilizing a Vision Transformer (ViT) backbone with a Mask2Former decoder. To robustly capture the global structural features of plankton against occlusion and debris, we employ a Masked Autoencoder (MAE) for self-supervised pre-training on unlabeled individual images. Experimental results on real-world datasets demonstrate that our method significantly outperforms conventional methods, such as Mask R-CNN, particularly in challenging environments with high debris density. We demonstrate that our synthetic training strategy and MAE-based architecture enable high-precision segmentation with requiring less manual annotations for individual plankton images.
[211] Sharpening Lightweight Models for Generalized Polyp Segmentation: A Boundary Guided Distillation from Foundation Models cs.CVPDF
Shivanshu Agnihotri, Snehashis Majhi, Deepak Ranjan Nayak
TL;DR: 本文提出了一种名为LiteBounD的轻量级边界引导蒸馏框架,用于提升息肉分割模型的性能。该框架通过从多个视觉基础模型中提取互补的语义和结构先验知识,并将其蒸馏到紧凑的分割骨干网络中,以解决现有轻量模型在复杂息肉区域分割上的不足,同时保持实时临床应用的效率。
Details
Motivation: 自动化息肉分割对于早期结直肠癌检测至关重要,但面临边界模糊、外观变化大和标注数据有限等挑战。轻量级分割模型(如U-Net)效率高但语义和结构信息捕捉能力不足,而大型视觉基础模型(如SAM)泛化能力强但存在领域不匹配、边界敏感性不足和计算成本高的问题,难以直接用于息肉分割。本文旨在弥合这一差距。
Result: 在已见数据集(Kvasir-SEG, CVC-ClinicDB)和未见数据集(ColonDB, CVC-300, ETIS)上的大量实验表明,LiteBounD显著优于其轻量级基线模型,并取得了与最先进方法(SOTA)相竞争的性能,同时保持了实时临床使用所需的效率。
Insight: 论文的创新点包括:1)双路径蒸馏机制,解耦语义和边界感知表示;2)频率感知对齐策略,分别监督低频全局语义和高频边界细节;3)边界感知解码器,融合多尺度编码器特征与蒸馏得到的语义丰富的边界信息。从客观角度看,该方法通过知识蒸馏有效结合了大型基础模型的强泛化能力和轻量模型的效率,特别是在边界细节的增强上具有借鉴意义。
Abstract: Automated polyp segmentation is critical for early colorectal cancer detection and its prevention, yet remains challenging due to weak boundaries, large appearance variations, and limited annotated data. Lightweight segmentation models such as U-Net, U-Net++, and PraNet offer practical efficiency for clinical deployment but struggle to capture the rich semantic and structural cues required for accurate delineation of complex polyp regions. In contrast, large Vision Foundation Models (VFMs), including SAM, OneFormer, Mask2Former, and DINOv2, exhibit strong generalization but transfer poorly to polyp segmentation due to domain mismatch, insufficient boundary sensitivity, and high computational cost. To bridge this gap, we propose \textit{\textbf{LiteBounD}, a \underline{Li}gh\underline{t}w\underline{e}ight \underline{Boun}dary-guided \underline{D}istillation} framework that transfers complementary semantic and structural priors from multiple VFMs into compact segmentation backbones. LiteBounD introduces (i) a dual-path distillation mechanism that disentangles semantic and boundary-aware representations, (ii) a frequency-aware alignment strategy that supervises low-frequency global semantics and high-frequency boundary details separately, and (iii) a boundary-aware decoder that fuses multi-scale encoder features with distilled semantically rich boundary information for precise segmentation. Extensive experiments on both seen (Kvasir-SEG, CVC-ClinicDB) and unseen (ColonDB, CVC-300, ETIS) datasets demonstrate that LiteBounD consistently outperforms its lightweight baselines by a significant margin and achieves performance competitive with state-of-the-art methods, while maintaining the efficiency required for real-time clinical use. Our code is available at https://github.com/lostinrepo/LiteBounD.
[212] Spatiotemporal Sycophancy: Negation-Based Gaslighting in Video Large Language Models cs.CVPDF
Ziyao Tang, Pengkun Jiao, Bin Zhu, Huiyan Qi, Jingjing Chen
TL;DR: 本文发现视频大语言模型存在时空谄媚现象,即在否定式误导下,模型会撤回最初正确的视觉判断并迎合用户的错误反馈,甚至编造虚假的时空解释。作者提出了一个否定式误导评估框架和GasVideo-1000基准,对多种先进模型进行了测试,结果表明该问题普遍且严重,现有模型缺乏在对抗性对话中保持时空信念的鲁棒机制。
Details
Motivation: 视频大语言模型在视频理解任务中表现出色,但其在对话交互中的鲁棒性尚未得到充分探索。本文旨在研究模型在否定式误导下撤回正确判断并编造解释的失败模式。
Result: 在GasVideo-1000基准上对多种开源和专有视频大语言模型进行了广泛评估,结果表明模型对否定式误导的脆弱性普遍且严重,即使基线性能强的模型也受影响;提示级约束只能部分缓解,无法可靠防止幻觉或信念逆转。
Insight: 创新点在于识别了时空谄媚这一新的失败模式,并提出了系统的评估框架和基准。客观来看,该研究揭示了视频大语言模型在对抗性对话中维护时空信念的固有弱点,为提升模型鲁棒性提供了重要方向。
Abstract: Video Large Language Models (Vid-LLMs) have demonstrated remarkable performance in video understanding tasks, yet their robustness under conversational interaction remains largely underexplored. In this paper, we identify spatiotemporal sycophancy, a failure mode in which Vid-LLMs retract initially correct, visually grounded judgments and conform to misleading user feedback under negation-based gaslighting. Rather than merely changing their answers, the models often fabricate unsupported temporal or spatial explanations to justify incorrect revisions. To systematically investigate this phenomenon, we propose a negation-based gaslighting evaluation framework and introduce GasVideo-1000, a curated benchmark designed to probe spatiotemporal sycophancy with clear visual grounding and temporal reasoning requirements. We evaluate a broad range of state-of-the-art open-source and proprietary Vid-LLMs across diverse video understanding tasks. Extensive experiments reveal that vulnerability to negation-based gaslighting is pervasive and severe, even among models with strong baseline performance. While prompt-level grounding constraints can partially mitigate this behavior, they do not reliably prevent hallucinated justifications or belief reversal. Our results indicate that current Vid-LLMs lack robust mechanisms for maintaining grounded spatiotemporal beliefs under adversarial conversational feedback.
[213] AeroRAG: Structured Multimodal Retrieval-Augmented LLM for Fine-Grained Aerial Visual Reasoning cs.CVPDF
Junxiao Xue, Quan Deng, Tingqi Hu, Meicong Si, Xinyi Yin
TL;DR: 本文提出AeroRAG,一种用于细粒度航空视觉推理的结构化多模态检索增强生成框架。该框架将输入图像转换为包含对象类别、数量、空间位置和语义关系的结构化视觉知识,然后检索与查询相关的语义块,为基于文本的大语言模型构建紧凑提示。实验在AUG航空数据集和通用领域VG-150基准上显示了对六个强大多模态大语言模型基线的持续改进。
Details
Motivation: 解决航空场景中可靠视觉问答的挑战,因为任务关键证据通常由小物体、显式数量、粗略位置和对象间关系承载,而传统的密集视觉令牌表示与这些结构化语义对齐不佳。
Result: 在AUG航空数据集和通用领域VG-150基准上,相比六个强大多模态大语言模型基线取得了一致性改进,在密集航空场景和关系敏感推理中观察到最大增益;在VQAv2上的评估验证了所提接口与标准视觉推理设置的兼容性。
Insight: 引入了一种更明确的感知与语言推理之间的中间接口,通过结构化检索增强生成,将视觉信息转换为结构化知识以提高推理的准确性和可解释性,为面向部署和基于事实的视觉推理系统提供了实用的设计方向。
Abstract: Despite recent progress in multimodal large language models (MLLMs), reliable visual question answering in aerial scenes remains challenging. In such scenes, task-critical evidence is often carried by small objects, explicit quantities, coarse locations, and inter-object relations, whereas conventional dense visual-token representations are not well aligned with these structured semantics. To address this interface mismatch, we propose AeroRAG, a scene-graph-guided multimodal retrieval-augmented generation framework for visual question answering. The framework first converts an input image into structured visual knowledge, including object categories, quantities, spatial locations, and semantic relations, and then retrieves query-relevant semantic chunks to construct compact prompts for a text-based large language model. Rather than relying on direct reasoning over dense visual tokens, our method introduces a more explicit intermediate interface between perception and language reasoning. Experiments on the AUG aerial dataset and the general-domain VG-150 benchmark show consistent improvements over six strong MLLM baselines, with the largest gains observed in dense aerial scenes and relation-sensitive reasoning. We further evaluate the framework on VQAv2 to verify that the proposed interface remains compatible with standard visual reasoning settings. These results suggest that structured retrieval is a practical design direction for deployment-oriented and grounded visual reasoning systems.
[214] ReTrack: Evidence-Driven Dual-Stream Directional Anchor Calibration Network for Composed Video Retrieval cs.CVPDF
Zixu Li, Yupeng Hu, Zhiwei Chen, Qinlei Huang, Guozhi Qiu
TL;DR: 本文提出了ReTrack,一种用于组合视频检索(CVR)任务的证据驱动双流方向锚点校准网络。该模型通过三个关键模块(语义贡献解耦、组合几何校准和可靠证据驱动对齐)来解决模态贡献纠缠、组合特征显式优化和检索不确定性三大挑战,旨在校准组合特征中的方向偏差,从而提升多模态查询理解能力。
Details
Motivation: CVR任务中,视频和文本模态的信息密度存在显著差异,传统组合方法倾向于使组合特征偏向参考视频,导致检索性能不佳。这主要源于模态贡献纠缠、组合特征显式优化和检索不确定性三大核心挑战。
Result: ReTrack在CVR和组合图像检索(CIR)任务上均表现出强大的泛化能力,在三个基准数据集上均达到了SOTA性能。
Insight: 创新点在于首次通过校准组合特征中的方向偏差来改进多模态查询理解,并引入了语义贡献解耦、组合几何校准和可靠证据驱动对齐三个模块来系统性地解决CVR中的核心挑战。从客观角度看,其提出的方向锚点校准和双向证据驱动相似性估计机制是新颖且有效的设计。
Abstract: With the rapid growth of video data, Composed Video Retrieval (CVR) has emerged as a novel paradigm in video retrieval and is receiving increasing attention from researchers. Unlike unimodal video retrieval methods, the CVR task takes a multi-modal query consisting of a reference video and a piece of modification text as input. The modification text conveys the user’s intended alterations to the reference video. Based on this input, the model aims to retrieve the most relevant target video. In the CVR task, there exists a substantial discrepancy in information density between video and text modalities. Traditional composition methods tend to bias the composed feature toward the reference video, which leads to suboptimal retrieval performance. This limitation is significant due to the presence of three core challenges: (1) modal contribution entanglement, (2) explicit optimization of composed features, and (3) retrieval uncertainty. To address these challenges, we propose the evidence-dRivRn dual-sTream diRectionAl anChor calibration networK (ReTrack). ReTrack is the first CVR framework that improves multi-modal query understanding by calibrating directional bias in composed features. It consists of three key modules: Semantic Contribution Disentanglement, Composition Geometry Calibration, and Reliable Evidence-driven Alignment. Specifically, ReTrack estimates the semantic contribution of each modality to calibrate the directional bias of the composed feature. It then uses the calibrated directional anchors to compute bidirectional evidence that drives reliable composed-to-target similarity estimation. Moreover, ReTrack exhibits strong generalization to the Composed Image Retrieval (CIR) task, achieving SOTA performance across three benchmark datasets in both CVR and CIR scenarios. Codes are available at https://github.com/Lee-zixu/ReTrack
[215] MEDN: Motion-Emotion Feature Decoupling Network for Micro-Expression Recognition cs.CVPDF
Chenxing Hu, Kun Xie, Qiguang Miao, Ruyi Liu, Quan Wang
TL;DR: 本文提出了一种名为MEDN的运动-情感特征解耦网络,用于解决微表情识别中因动作单元与情绪类别间非严格一致映射导致的视觉相似但情绪相反的问题。该网络采用双分支结构分别提取运动特征和情感特征,并通过正交损失减少特征耦合,同时引入稀疏情感视觉变换器(SEVit)来突出局部时序变化,最后通过协作融合模块(CoFM)自适应融合解耦后的特征。
Details
Motivation: 现有微表情识别方法主要依赖显式的面部运动线索(如光流、帧差、AU特征),而忽略了隐式的情感信息,导致对视觉相似但情绪相反的微表情识别困难。
Result: 在三个基准数据集上的大量实验验证了MEDN能有效解耦运动与情感特征,并取得了优越的识别性能,为提升识别准确性和泛化能力提供了新视角。
Insight: 创新点在于通过双分支框架和正交损失实现运动与情感特征解耦,以及利用稀疏情感视觉变换器(SEVit)进行隐式情感建模,这为处理特征耦合问题提供了可借鉴的思路。
Abstract: Unlike macro-expression, micro-expression does not follow a strictly consistent mapping rule between emotions and Action Units (AUs). As a result, some micro-expressions share identical AUs yet represent completely opposite emotional categories, making them highly visually similar. Existing microexpression recognition (MER) methods mostly rely on explicit facial motion cues (e.g., optical flow, frame differences, AU features) while ignoring implicit emotion information. To tackle this issue, this paper presents a Motion Emotion Feature Decoupling Network (MEDN) for MER. We design a dual-branch framework to separately extract motion and emotion features. In the motion branch, an AU-detection task restricts features to the explicit motion domain, and orthogonal loss is adopted to reduce motion emotion feature coupling. For implicit emotion modeling, we propose a Sparse Emotion Vision Transformer (SEVit) that sparsifies spatial tokens to highlight local temporal variations with multi-scale sparsity rates. A Collaborative Fusion Module (CoFM) is further developed to fuse disentangled motion and emotion features adaptively. Extensive experiments on three benchmark datasets validate that MEDN effectively decouples motion and emotion features and achieves superior recognition performance, offering a new perspective for enhancing recognition accuracy and generalization.
[216] OneDrive: Unified Multi-Paradigm Driving with Vision-Language-Action Models cs.CVPDF
Yiwei Zhang, Xuesong Chen, Jin Gao, Hanshi Wang, Fudong Ge
TL;DR: 本文提出了一种名为OneDrive的统一多范式自动驾驶框架,该框架基于预训练的视觉语言模型(VLM),旨在解决端到端自动驾驶中多任务学习(如语言生成、目标检测和轨迹回归)需要异构解码行为的问题。通过将视觉和结构化查询令牌组织在单个因果解码器中,该方法实现了文本和结构化输出共享一个注意力主干,从而实现了异构任务的稳定联合优化。
Details
Motivation: 现有自动驾驶系统通常为不同的任务(如自回归语言生成、并行目标检测和轨迹回归)引入独立或级联的解码器,导致架构碎片化和主干网络重用受限。本文的动机是设计一个统一的框架,在单个Transformer解码器中协调这些异构的解码行为。
Result: 在端到端自动驾驶基准测试中,该方法取得了最先进的性能:在nuScenes开环评估中达到0.28 L2误差和0.18碰撞率,在NAVSIM闭环评估中取得了有竞争力的结果(86.8 PDMS)。此外,高效推理模式实现了约40%的延迟降低。
Insight: 创新点在于证明了预训练VLM的注意力机制在纯语言建模之外具有很强的可迁移性,通过将轨迹规划等结构化任务也整合到同一个因果LLM解码器中,实现了感知、规划和语言生成在统一注意力主干下的共享与协同优化,这为构建紧凑、高效的多任务自动驾驶系统提供了新思路。
Abstract: Vision-Language Models(VLMs) excel at autoregressive text generation, yet end-to-end autonomous driving requires multi-task learning with structured outputs and heterogeneous decoding behaviors, such as autoregressive language generation, parallel object detection and trajectory regression. To accommodate these differences, existing systems typically introduce separate or cascaded decoders, resulting in architectural fragmentation and limited backbone reuse. In this work, we present a unified autonomous driving framework built upon a pretrained VLM, where heterogeneous decoding behaviors are reconciled within a single transformer decoder. We demonstrate that pretrained VLM attention exhibits strong transferability beyond pure language modeling. By organizing visual and structured query tokens within a single causal decoder, structured queries can naturally condition on visual context through the original attention mechanism. Textual and structured outputs share a common attention backbone, enabling stable joint optimization across heterogeneous tasks. Trajectory planning is realized within the same causal LLM decoder by introducing structured trajectory queries. This unified formulation enables planning to share the pretrained attention backbone with images and perception tokens. Extensive experiments on end-to-end autonomous driving benchmarks demonstrate state-of-the-art performance, including 0.28 L2 and 0.18 collision rate on nuScenes open-loop evaluation and competitive results (86.8 PDMS) on NAVSIM closed-loop evaluation. The full model preserves multi-modal generation capability, while an efficient inference mode achieves approximately 40% lower latency. Code and models are available at https://github.com/Z1zyw/OneDrive
[217] Prompting Foundation Models for Zero-Shot Ship Instance Segmentation in SAR Imagery cs.CV | cs.AI | cs.LGPDF
Islam Mansour, Francescopaolo Sica, Michael Schmitt
TL;DR: 本文提出了一种利用通用视觉基础模型实现SAR图像中零样本船舶实例分割的方法。该方法使用在开放SAR数据集上训练的YOLOv11检测器定位船舶边界框,然后提示Segment Anything Model 2 (SAM2)生成实例掩码,整个过程无需任何掩码标注。
Details
Motivation: 解决合成孔径雷达(SAR)图像分析中因缺乏像素级标注而限制深度学习应用的问题,旨在实现无需掩码监督的零样本船舶实例分割。
Result: 在SSDD基准测试上,该方法取得了0.637的平均交并比(达到全监督基线性能的89%)和89.2%的船舶总体检测率。
Insight: 创新点在于仅利用SAR训练检测器提供的空间约束来正则化基础模型(SAM2)的预测,无需微调或适配器,部分缓解了光学与SAR的域差异,为基于基础模型的SAR图像理解提供了一条可扩展且标注高效的路径。
Abstract: Synthetic Aperture Radar (SAR) plays a critical role in maritime surveillance, yet deep learning for SAR analysis is limited by the lack of pixel-level annotations. This paper explores how general-purpose vision foundation models can enable zero-shot ship instance segmentation in SAR imagery, eliminating the need for pixel-level supervision. A YOLOv11-based detector trained on open SAR datasets localizes ships via bounding boxes, which then prompt the Segment Anything Model 2 (SAM2) to produce instance masks without any mask annotations. Unlike prior SAM-based SAR approaches that rely on fine tuning or adapters, our method demonstrates that spatial constraints from a SAR-trained detector alone can effectively regularize foundation model predictions. This design partially mitigates the optical-SAR domain gap and enables downstream applications such as vessel classification, size estimation, and wake analysis. Experiments on the SSDD benchmark achieve a mean IoU of 0.637 (89% of a fully supervised baseline) with an overall ship detection rate of 89.2%, confirming a scalable, annotation-efficient pathway toward foundation-model-driven SAR image understanding.
[218] From Heads to Neurons: Causal Attribution and Steering in Multi-Task Vision-Language Models cs.CV | cs.CLPDF
Qidong Wang, Junjie Hu, Ming Jiang
TL;DR: 本文提出了HONES(面向注意头的神经元解释与调控)框架,这是一种无需梯度的任务感知神经元归因与调控方法,用于多任务视觉语言模型。该框架通过条件于任务相关注意头来评估前馈网络神经元的因果写入贡献,并对其进行轻量级缩放调控。在四个多模态任务和两个流行VLM上的实验表明,HONES在识别任务关键神经元和调控后提升模型性能方面优于现有方法。
Details
Motivation: 现有神经元级解释方法主要关注单一任务,限制了跨任务神经元重要性的可比性,且排名策略往往孤立地评估神经元,忽略了任务依赖的信息路径如何塑造前馈网络神经元的写入效应,这加剧了多任务设置中的神经元多义性,为识别和干预任务关键神经元引入了噪声。
Result: 在四个不同的多模态任务和两个流行的视觉语言模型上的实验表明,HONES在识别任务关键神经元方面优于现有方法,并且在神经元调控后提高了模型性能。
Insight: 创新点在于提出了一个无需梯度的、任务感知的框架,通过将神经元归因条件于任务相关的注意头来考虑信息路径,从而更准确地识别和调控多任务VLM中的关键神经元,缓解了神经元多义性问题。从客观角度看,将注意力机制与神经元分析相结合,为理解多任务模型中的信息流提供了新视角。
Abstract: Recent work has increasingly explored neuron-level interpretation in vision-language models (VLMs) to identify neurons critical to final predictions. However, existing neuron analyses generally focus on single tasks, limiting the comparability of neuron importance across tasks. Moreover, ranking strategies tend to score neurons in isolation, overlooking how task-dependent information pathways shape the write-in effects of feed-forward network (FFN) neurons. This oversight can exacerbate neuron polysemanticity in multi-task settings, introducing noise into the identification and intervention of task-critical neurons. In this study, we propose HONES (Head-Oriented Neuron Explanation & Steering), a gradient-free framework for task-aware neuron attribution and steering in multi-task VLMs. HONES ranks FFN neurons by their causal write-in contributions conditioned on task-relevant attention heads, and further modulates salient neurons via lightweight scaling. Experiments on four diverse multimodal tasks and two popular VLMs show that HONES outperforms existing methods in identifying task-critical neurons and improves model performance after steering. Our source code is released at: https://github.com/petergit1/HONES.
[219] ZSG-IAD: A Multimodal Framework for Zero-Shot Grounded Industrial Anomaly Detection cs.CVPDF
Qiuhui Chen, Jiaxiang Song, Shuai Tan, Weimin Zhong
TL;DR: ZSG-IAD是一个用于零样本定位工业异常检测的多模态视觉-语言框架,它能够处理RGB图像、传感器图像和3D点云数据,生成结构化的异常报告和像素级异常掩码。
Details
Motivation: 解决现有基于深度学习的工业异常检测器通常表现为黑箱模型,难以提供具有物理意义的缺陷证据来支持决策的问题。
Result: 在多个工业异常检测基准测试中,该方法展现了强大的零样本性能,并且比现有方法提供了更透明、更具物理基础的解释。
Insight: 创新点在于引入了语言引导的双跳定位模块,以及应用了带有可验证奖励的Executable-Rule GRPO来提升输出的可靠性、异常区域一致性和推理结论连贯性,从而实现了更可解释的异常检测。
Abstract: Deep learning-based industrial anomaly detectors often behave as black boxes, making it hard to justify decisions with physically meaningful defect evidence. We propose ZSG-IAD, a multimodal vision-language framework for zero-shot grounded industrial anomaly detection. Given RGB images, sensor images, and 3D point clouds, ZSG-IAD generates structured anomaly reports and pixel-level anomaly masks. ZSG-IAD introduces a language-guided two-hop grounding module: (1) anomaly-related sentences select evidence-like latent slots distilled from multimodal features, yielding coarse spatial support; (2) selected slots modulate feature maps via channel-spatial gating and a lightweight decoder to produce fine-grained masks. To improve reliability, we further apply Executable-Rule GRPO with verifiable rewards to promote structured outputs, anomaly-region consistency, and reasoning-conclusion coherence. Experiments across multiple industrial anomaly benchmarks show strong zero-shot performance and more transparent, physically grounded explanations than prior methods. We will release code and annotations to support future research on trustworthy industrial anomaly detection systems.
[220] DifFoundMAD: Foundation Models meet Differential Morphing Attack Detection cs.CVPDF
Lazaro J. Gonzalez-Soler, André Dörsch, Christian Rathgeb, Christoph Busch
TL;DR: 本文提出了DifFoundMAD,一个参数高效的差分形态攻击检测框架。该方法利用视觉基础模型的泛化能力来捕捉疑似合成图像与真实捕获图像之间的差异,通过轻量级微调和类别平衡优化,仅更新少量参数,在标准D-MAD基准测试中实现了超越现有最佳系统的性能。
Details
Motivation: 解决传统差分形态攻击检测系统依赖人脸识别嵌入或手工特征差异的局限性,利用基础模型的强大表征能力来更有效地检测面部合成攻击。
Result: 在标准D-MAD基准上进行广泛的跨数据库评估,结果表明DifFoundMAD在严格安全级别下显著超越现有最佳系统,将高安全级别下的错误率从6.16%降低至2.17%。
Insight: 创新点在于将基础模型的嵌入空间引入差分检测范式,并结合参数高效的轻量级微调策略,在保持基础模型丰富先验知识的同时,实现了针对特定任务的高性能检测。
Abstract: In this work, we introduce DifFoundMAD, a parameter-efficient D-MAD framework that exploits the generalisation capabilities of vision foundation models (FM) to capture discrepancies between suspected morphs and live capture images. In contrast to conventional D-MAD systems that rely on face recognition embeddings or handcrafted feature differences, DifFoundMAD follows the standard differential paradigm while replacing the underlying representation space with embeddings extracted from FMs. By combining lightweight finetuning with class-balanced optimisation, the proposed method updates only a small subset of parameters while preserving the rich representational priors of the underlying FMs. Extensive cross-database evaluations on standard D-MAD benchmarks demonstrate that DifFoundMAD achieves consistent improvements over state-of-the-art systems, particularly at the strict security levels required in operational deployments such as border control: The error rates reported in the current state-of-the-art were reduced from 6.16% to 2.17% for high-security levels using DifFoundMAD.
[221] MU-GeNeRF: Multi-view Uncertainty-guided Generalizable Neural Radiance Fields for Distractor-aware Scene cs.CVPDF
Wenjie Mu, Zhan Li, Chuanzhou Su, Xuanyi Shen, Ziniu Liu
TL;DR: MU-GeNeRF是一个多视角不确定性引导的可泛化神经辐射场框架,旨在解决现实场景中瞬态干扰物破坏跨视角结构一致性的问题。它通过分解源视角不确定性和目标视角不确定性,并结合异方差重建损失,自适应地调节监督信号,从而实现更鲁棒的干扰物抑制和几何建模。
Details
Motivation: 现有可泛化神经辐射场(GeNeRF)方法在存在瞬态干扰物的真实场景中,跨视角结构一致性被破坏,导致监督信号被污染和重建质量下降。现有基于单场景优化的去干扰NeRF方法依赖于单视角重建误差估计不确定性,这对于GeNeRF不可靠,且常将不一致的静态结构误判为干扰物。
Result: 大量实验表明,该方法不仅超越了现有的GeNeRF方法,而且达到了与针对特定场景的去干扰NeRF方法相当的性能。
Insight: 创新点在于将干扰物感知分解为两个互补的不确定性组件:源视角不确定性(捕捉由视角变化或动态因素引起的跨源视图结构差异)和目标视角不确定性(检测目标图像中由瞬态干扰物引起的观测异常)。通过异方差重建损失结合这两种不确定性,自适应地调制监督,这是对现有不确定性估计方法的重要改进。
Abstract: Generalizable Neural Radiance Fields (GeNeRFs) enable high-quality scene reconstruction from sparse views and can generalize to unseen scenes. However, in real-world settings, transient distractors break cross-view structural consistency, corrupting supervision and degrading reconstruction quality. Existing distractor-free NeRF methods rely on per-scene optimization and estimate uncertainty from per-view reconstruction errors, which are not reliable for GeNeRFs and often misjudge inconsistent static structures as distractors. To this end, we propose MU-GeNeRF, a Multi-view Uncertainty-guided distractor-aware GeNeRF framework designed to alleviate GeNeRF’s robust modeling challenges in the presence of transient distractions. We decompose distractor awareness into two complementary uncertainty components: Source-view Uncertainty, which captures structural discrepancies across source views caused by viewpoint changes or dynamic factors; and Target-view Uncertainty, which detects observation anomalies in the target image induced by transient distractors.These two uncertainties address distinct error sources and are combined through a heteroscedastic reconstruction loss, which guides the model to adaptively modulate supervision, enabling more robust distractor suppression and geometric modeling.Extensive experiments show that our method not only surpasses existing GeNeRFs but also achieves performance comparable to scene-specific distractor-free NeRFs.
[222] E3VS-Bench: A Benchmark for Viewpoint-Dependent Active Perception in 3D Gaussian Splatting Scenes cs.CVPDF
Koya Sakamoto, Taiki Miyanishi, Daichi Azuma, Shuhei Kurita, Shu Morikuni
TL;DR: 本文提出了E3VS-Bench,这是一个用于评估在3D高斯溅射场景中基于视角的主动感知能力的基准测试。该基准包含99个高保真3D场景和2014个问题驱动的任务,要求智能体通过5自由度视角控制来收集视角依赖的证据以回答问题。研究评估了多个先进视觉语言模型,发现它们与人类表现存在显著差距。
Details
Motivation: 现有视觉搜索和具身AI基准(如EQA)通常依赖静态观察或受限的自我中心运动,无法充分评估在真实3D环境中无限制5自由度视角控制下产生的细粒度视角依赖现象(如垂直视角变化导致的可见性改变、容器内内容揭示、仅从特定角度可观察的对象属性消歧)。
Result: 在E3VS-Bench上评估了多个最先进的视觉语言模型(VLMs),并与人类表现进行了比较。尽管这些模型在2D推理方面表现出色,但在5自由度视角变化下的主动感知和连贯视角规划方面,所有模型均与人类存在显著差距。
Insight: 创新点在于利用3D高斯溅射技术实现高保真自由视角渲染,保留了基于网格模拟器中常丢失的细粒度视觉细节(如小文本和细微属性),从而构建了无法从单一视角回答、需要跨5自由度视角主动检查的问题。这为评估具身智能体的主动感知能力提供了更真实和细粒度的测试平台。
Abstract: Visual search in 3D environments requires embodied agents to actively explore their surroundings and acquire task-relevant evidence. However, existing visual search and embodied AI benchmarks, including EQA, typically rely on static observations or constrained egocentric motion, and thus do not explicitly evaluate fine-grained viewpoint-dependent phenomena that arise under unrestricted 5-DoF viewpoint control in real-world 3D environments, such as visibility changes caused by vertical viewpoint shifts, revealing contents inside containers, and disambiguating object attributes that are only observable from specific angles. To address this limitation, we introduce {E3VS-Bench}, a benchmark for embodied 3D visual search where agents must control their viewpoints in 5-DoF to gather viewpoint-dependent evidence for question answering. E3VS-Bench consists of 99 high-fidelity 3D scenes reconstructed using 3D Gaussian Splatting and 2,014 question-driven episodes. 3D Gaussian Splatting enables photorealistic free-viewpoint rendering that preserves fine-grained visual details (e.g., small text and subtle attributes) often degraded in mesh-based simulators, thereby allowing the construction of questions that cannot be answered from a single view and instead require active inspection across viewpoints in 5-DoF. We evaluate multiple state-of-the-art VLMs and compare their performance with humans. Despite strong 2D reasoning ability, all models exhibit a substantial gap from humans, highlighting limitations in active perception and coherent viewpoint planning specifically under full 5-DoF viewpoint changes.
[223] Identifying Ethical Biases in Action Recognition Models cs.CVPDF
Ana Baltaretu, Pascal Benschop, Jan van Gemert
TL;DR: 本文提出了一种利用合成视频数据审计人类动作识别(HAR)模型偏见的新框架,通过BEDLAM仿真平台生成可精确控制肤色等视觉身份属性的视频,以检测模型在动作相同但肤色不同时是否存在统计显著偏见。
Details
Motivation: 动机在于HAR模型正被部署于高风险环境,但其在不同人类外观(如肤色)上的公平性尚未得到充分分析,需要一种能够隔离并测试单一属性影响的方法来评估模型偏见。
Result: 实验结果表明,一些流行的HAR模型在动作相同但肤色不同时表现出统计显著偏见,揭示了模型编码了不希望的视觉关联,并提供了跨群体系统性错误的证据。
Insight: 创新点在于首次提出一个专注于HAR模型、利用合成视频并保持时间一致性的偏见审计框架,能够精确控制属性以隔离偏见来源,为开发更透明、可问责的系统提供了方法论支持。
Abstract: Human Action Recognition (HAR) models are increasingly deployed in high-stakes environments, yet their fairness across different human appearances has not been analyzed. We introduce a framework for auditing bias in HAR models using synthetic video data, generated with full control over visual identity attributes such as skin color. Unlike prior work that focuses on static images or pose estimation, our approach preserves temporal consistency, allowing us to isolate and test how changes to a single attribute affect model predictions. Through controlled interventions using the BEDLAM simulation platform, we show whether some popular HAR models exhibit statistically significant biases on the skin color even when the motion remains identical. Our results highlight how models may encode unwanted visual associations, and we provide evidence of systematic errors across groups. This work contributes a framework for auditing HAR models and supports the development of more transparent, accountable systems in light of upcoming regulatory standards.
[224] Mitigating Multimodal Hallucination via Phase-wise Self-reward cs.CV | cs.CLPDF
Yu Zhang, Chuyang Sun, Kehai Chen, Xuefeng Bai, Yang Xiang
TL;DR: 本文提出了一种名为PSRD(Phase-wise Self-Reward Decoding)的自奖励框架,用于在推理时动态缓解大型视觉语言模型(LVLMs)中的视觉幻觉问题,而无需外部监督。该方法通过分析幻觉出现的阶段性动态模式,利用轻量级奖励模型提供实时指导,在解码过程中进行针对性干预。
Details
Motivation: 现有方法要么依赖大规模标注数据进行微调,计算开销大;要么采用静态的事后策略,忽视了幻觉出现的动态特性。本文旨在解决这些问题,实现无需外部监督的动态幻觉缓解。
Result: 在五个幻觉评估基准上对四个LVLMs的测试表明,PSRD将LLaVA-1.5-7B的幻觉率降低了50.0%,并持续优于现有的事后处理方法。
Insight: 创新点在于揭示了视觉幻觉具有阶段性动态模式(在每个语义阶段开始时达到峰值),并据此提出了基于阶段性自奖励信号的在线幻觉校正框架。通过将幻觉指导信号蒸馏到轻量级奖励模型中,实现了高效、可控的解码干预,平衡了性能与推理效率。
Abstract: Large Vision-Language Models (LVLMs) still struggle with vision hallucination, where generated responses are inconsistent with the visual input. Existing methods either rely on large-scale annotated data for fine-tuning, which incurs massive computational overhead, or employ static post-hoc strategies that overlook the dynamic nature of hallucination emergence. To address these, we introduce a new self-rewarding framework, enabling dynamic hallucination mitigation at inference time without external supervision. On the empirical side, we reveal that visual hallucination exhibits phase-wise dynamic patterns, peaking at the onset of each semantic phase. Drawing on these insights, we propose \textbf{PSRD} (\textbf{Phase-wise \textbf{S}elf-\textbf{R}eward \textbf{D}ecoding) for online hallucination correction guided by phase-wise self-reward signals. To reduce the cost of repeated self-evaluation during decoding, we distill the hallucination guidance signal from LVLMs into a lightweight reward model. The reward model subsequently provides on-the-fly guidance for targeted intervention during the decoding process, enabling precise hallucination suppression. The proposed PSRD significantly reduces the hallucination rate of LLaVA-1.5-7B by 50.0% and consistently outperforms existing post-hoc methods across five hallucination evaluation benchmarks for four LVLMs. Further analysis confirms that PSRD effectively mitigates hallucination propagation and achieves a highly controllable trade-off between strong performance and inference efficiency.
[225] Trustworthy Endoscopic Super-Resolution cs.CVPDF
Julio Silva-Rodríguez, Ender Konukoglu
TL;DR: 本文提出了一种提升内窥镜超分辨率系统可信度的框架,通过集成轻量级误差预测网络来估计像素级重建误差,并基于保形风险控制原理构建保形失效掩码,以定位不可信的重建区域,从而在实时部署中提供理论保证。
Details
Motivation: 解决超分辨率模型在安全关键场景(如内窥镜手术)中可能引入虚假结构和放大噪声的问题,提升其可靠性。
Result: 在图像和视频超分辨率任务中评估,证明了该方法在内窥镜和机器人手术设置中有效检测不可靠重建,是首个提供模型无关、理论基础的实时内窥镜图像超分辨率安全改进方法。
Insight: 创新点在于结合轻量级误差预测网络与保形风险控制,实现像素级误差估计和理论保证的失效检测,适用于实时部署,增强了超分辨率在医疗应用中的安全性。
Abstract: Super-resolution (SR) models are attracting growing interest for enhancing minimally invasive surgery and diagnostic videos under hardware constraints. However, valid concerns remain regarding the introduction of hallucinated structures and amplified noise, limiting their reliability in safety-critical settings. We propose a direct and practical framework to make SR systems more trustworthy by identifying where reconstructions are likely to fail. Our approach integrates a lightweight error-prediction network that operates on intermediate representations to estimate pixel-wise reconstruction error. The module is computationally efficient and low-latency, making it suitable for real-time deployment. We convert these predictions into operational failure decisions by constructing Conformal Failure Masks (CFM), which localize regions where the SR output should not be trusted. Built on conformal risk control principles, our method provides theoretical guarantees for controlling both the tolerated error limit and the miscoverage in detected failures. We evaluate our approach on image and video SR, demonstrating its effectiveness in detecting unreliable reconstructions in endoscopic and robotic surgery settings. To our knowledge, this is the first study to provide a model-agnostic, theoretically grounded approach to improving the safety of real-time endoscopic image SR.
[226] CFSR: Geometry-Conditioned Shadow Removal via Physical Disentanglement cs.CVPDF
Pan Wang, Yihao Hu, Xiujin Liu, Hang Wang
TL;DR: CFSR是一个基于物理约束的多模态先验驱动框架,用于阴影去除任务。它通过整合3D几何线索和大规模基础模型语义,将阴影去除重新定义为物理约束的图像恢复过程,有效弥合了2D-3D领域差距。
Details
Motivation: 传统阴影去除网络通常将图像恢复视为无约束的映射,缺乏平衡局部纹理恢复与全局光照一致性的物理可解释性。
Result: 在多个具有挑战性的基准测试中,CFSR实现了最先进的性能。
Insight: 创新点包括:将观测映射到自定义HVI颜色空间以抑制阴影噪声;提出几何与语义双重显式引导注意力机制,利用DINO特征和3D表面法线直接调制注意力亲和矩阵;通过冻结的CLIP编码器注入整体先验;以及频率协同重建模块(FCRM)通过解耦解码过程实现最优合成,协调高频遮挡边界重建与低频全局光照恢复。
Abstract: Traditional shadow removal networks often treat image restoration as an unconstrained mapping, lacking the physical interpretability required to balance localized texture recovery with global illumination consistency. To address this, we propose CFSR, a multi-modal prior-driven framework that reframes shadow removal as a physics-constrained restoration process. By seamlessly integrating 3D geometric cues with large-scale foundation model semantics, CFSR effectively bridges the 2D-3D domain gap. Specifically, we first map observations into a custom HVI color space to suppress shadow-induced noise and robustly fuse RGB data with estimated depth priors. At its core, our Geometric & Semantic Dual Explicit Guided Attention mechanism utilizes DINO features and 3D surface normals to directly modulate the attention affinity matrix, structurally enforcing physical lighting constraints. To recover severely degraded regions, we inject holistic priors via a frozen CLIP encoder. Finally, our Frequency Collaborative Reconstruction Module (FCRM) achieves an optimal synthesis by decoupling the decoding process. Conditioned on geometric priors, FCRM seamlessly harmonizes the reconstruction of sharp high-frequency occlusion boundaries with the restoration of low-frequency global illumination. Extensive experiments demonstrate that CFSR achieves state-of-the-art performance across multiple challenging benchmarks.
[227] HABIT: Chrono-Synergia Robust Progressive Learning Framework for Composed Image Retrieval cs.CVPDF
Zixu Li, Yupeng Hu, Zhiwei Chen, Shiqi Zhang, Qinlei Huang
TL;DR: 本文提出了一种名为HABIT的鲁棒渐进学习框架,用于解决组合图像检索(CIR)任务中存在的噪声三元组对应(NTC)问题。该框架通过互知识估计模块量化样本清洁度,并利用双一致性渐进学习模块模拟人类习惯形成过程,从而在噪声数据下实现稳健的学习和检索。
Details
Motivation: 组合图像检索在实际应用中面临噪声三元组对应(NTC)问题的严重挑战,这主要源于标注三元组数据的高成本和主观性。论文旨在解决NTC问题,具体聚焦于精确估计组合语义差异和渐进适应修改差异这两个核心挑战。
Result: 在两个标准CIR数据集上进行的大量实验表明,HABIT在多种噪声比例下显著优于大多数现有方法,展现出优越的鲁棒性和检索性能。
Insight: 创新点包括:1)互知识估计模块通过计算组合特征与目标图像之间互信息的转移率来量化样本清洁度,有效识别符合修改语义的干净样本;2)双一致性渐进学习模块引入历史模型与当前模型的协同机制,模拟人类习惯形成(保留好习惯、校准坏习惯),从而在NTC存在下实现鲁棒学习。从客观角度看,该方法将噪声鲁棒性学习与人类认知习惯形成过程类比,为处理噪声多模态数据提供了新颖的视角。
Abstract: Composed Image Retrieval (CIR) is a flexible image retrieval paradigm that enables users to accurately locate the target image through a multimodal query composed of a reference image and modification text. Although this task has demonstrated promising applications in personalized search and recommendation systems, it encounters a severe challenge in practical scenarios known as the Noise Triplet Correspondence (NTC) problem. This issue primarily arises from the high cost and subjectivity involved in annotating triplet data. To address this problem, we identify two central challenges: the precise estimation of composed semantic discrepancy and the insufficient progressive adaptation to modification discrepancy. To tackle these challenges, we propose a cHrono-synergiA roBust progressIve learning framework for composed image reTrieval (HABIT), which consists of two core modules. First, the Mutual Knowledge Estimation Module quantifies sample cleanliness by calculating the Transition Rate of mutual information between the composed feature and the target image, thereby effectively identifying clean samples that align with the intended modification semantics. Second, the Dual-consistency Progressive Learning Module introduces a collaborative mechanism between the historical and current models, simulating human habit formation to retain good habits and calibrate bad habits, ultimately enabling robust learning under the presence of NTC. Extensive experiments conducted on two standard CIR datasets demonstrate that HABIT significantly outperforms most methods under various noise ratios, exhibiting superior robustness and retrieval performance. Codes are available at https://github.com/Lee-zixu/HABIT
[228] GS-STVSR: Ultra-Efficient Continuous Spatio-Temporal Video Super-Resolution via 2D Gaussian Splatting cs.CVPDF
Mingyu Shi, Xin Di, Long Peng, Boxiang Cao, Anran Wu
TL;DR: 本文提出了一种名为GS-STVSR的超高效连续时空视频超分辨率框架。该方法基于2D高斯泼溅技术,通过连续运动建模驱动高斯核的时空演化,从而完全避免了传统基于隐式神经表示方法所需的密集像素网格查询,显著提升了计算效率。
Details
Motivation: 现有基于隐式神经表示的连续时空视频超分辨率方法,其计算成本与插值帧数呈线性增长,严重限制了推理效率。本文旨在解决这一效率瓶颈,提出一种不依赖密集网格查询的高效框架。
Result: 在Vid4、GoPro和Adobe240基准测试上的大量实验表明,GS-STVSR在所有基准上都达到了最先进的性能。在推理效率方面,其在常规时间尺度(X2–X8)上推理时间几乎恒定,在极端尺度X32上实现了超过3倍的加速。
Insight: 核心创新在于将2D高斯泼溅引入视频超分辨率,通过连续运动建模驱动高斯演化,取代了密集的像素查询。具体技术点包括:利用协方差参数的时间稳定性进行轻量级中间拟合、设计光流引导的运动模块、引入协方差重采样对齐模块防止漂移,以及针对大尺度运动的自适应偏移窗口。这为高效连续视频处理提供了新思路。
Abstract: Continuous Spatio-Temporal Video Super-Resolution (C-STVSR) aims to simultaneously enhance the spatial resolution and frame rate of videos by arbitrary scale factors, offering greater flexibility than fixed-scale methods that are constrained by predefined upsampling ratios. In recent years, methods based on Implicit Neural Representations (INR) have made significant progress in C-STVSR by learning continuous mappings from spatio-temporal coordinates to pixel values. However, these methods fundamentally rely on dense pixel-wise grid queries, causing computational cost to scale linearly with the number of interpolated frames and severely limiting inference efficiency. We propose GS-STVSR, an ultra-efficient C-STVSR framework based on 2D Gaussian Splatting (2D-GS) that drives the spatiotemporal evolution of Gaussian kernels through continuous motion modeling, bypassing dense grid queries entirely. We exploit the strong temporal stability of covariance parameters for lightweight intermediate fitting, design an optical flow-guided motion module to derive Gaussian position and color at arbitrary time steps, introduce a Covariance resampling alignment module to prevent covariance drift, and propose an adaptive offset window for large-scale motion. Extensive experiments on Vid4, GoPro, and Adobe240 show that GS-STVSR achieves state-of-the-art quality across all benchmarks. Moreover, its inference time remains nearly constant at conventional temporal scales (X2–X8) and delivers over X3 speedup at extreme scales X32, demonstrating strong practical applicability.
[229] INTENT: Invariance and Discrimination-aware Noise Mitigation for Robust Composed Image Retrieval cs.CVPDF
Zhiwei Chen, Yupeng Hu, Zhiheng Fu, Zixu Li, Jiale Huang
TL;DR: 本文提出INTENT方法,用于解决组合图像检索中的噪声三元组对应问题,通过视觉不变性组合和双目标判别学习分别处理模态内噪声和跨模态对应噪声,提升检索鲁棒性。
Details
Motivation: 现实世界CIR数据集因标注成本高而存在标注错误,导致噪声三元组对应问题,现有方法未充分处理模态内噪声和跨模态对应噪声。
Result: 在两个广泛使用的基准数据集上进行的实验表明INTENT具有优越性和鲁棒性。
Insight: 创新点包括将CIR噪声分类为跨模态对应噪声和模态内噪声,并分别通过傅里叶变换因果干预和基于忠诚度的动态决策边界进行针对性处理。
Abstract: Composed Image Retrieval (CIR) is a challenging image retrieval paradigm that enables to retrieve target images based on multimodal queries consisting of reference images and modification texts. Although substantial progress has been made in recent years, existing methods assume that all samples are correctly matched. However, in real-world scenarios, due to high triplet annotation costs, CIR datasets inevitably contain annotation errors, resulting in incorrectly matched triplets. To address this issue, the problem of Noisy Triplet Correspondence (NTC) has attracted growing attention. We argue that noise in CIR can be categorized into two types: cross-modal correspondence noise and modality-inherent noise. The former arises from mismatches across modalities, whereas the latter originates from intra-modal background interference or visual factors irrelevant to the coarse-grained modification annotations. However, modality-inherent noise is often overlooked, and research on cross-modal correspondence noise remains nascent. To tackle above issues, we propose the Invariance and discrimiNaTion-awarE Noise neTwork (INTENT), comprising two components: Visual Invariant Composition and Bi-Objective Discriminative Learning, specifically designed to handle the two-aspect noise. The former applies causal intervention on the visual side via Fast Fourier Transform (FFT) to generate intervened composed features, enforcing visual invariance and enabling the model to ignore modality-inherent noise during composition. The latter adopts collaborative optimization with both positive and negative samples, and constructs a scalable decision boundary that dynamically adjusts decisions based on the loyalty degree, enabling robust correspondence discrimination. Extensive experiments on two widely used benchmark datasets demonstrate the superiority and robustness of INTENT.
[230] Enhancing Continual Learning of Vision-Language Models via Dynamic Prefix Weighting cs.CVPDF
Hyeonseo Jang, Hyuk Kwon, Kibok Lee
TL;DR: 本文提出了一种名为动态前缀加权(DPW)的框架,用于增强视觉语言模型在领域-类别增量学习场景下的持续学习能力。该方法通过门控模块动态调整前缀权重,并结合适配器,仅在必要时使用适配器,从而优化模型对下游任务的适应。
Details
Motivation: 现有参数高效方法(如前缀调优或适配器)在视觉语言模型的增量学习中,通常对前缀向量进行归一化处理,忽略了不同输入token需要不同程度的调整,导致性能受限。
Result: 实验结果表明,该方法在视觉语言模型的领域-类别增量学习场景中实现了最先进的性能(SOTA)。
Insight: 创新点在于动态分配前缀权重,通过门控模块根据输入token的重要性调整权重,并结合适配器作为前缀权重的残差,实现更精细的参数适应,可借鉴于其他增量学习或参数高效微调任务。
Abstract: We investigate recently introduced domain-class incremental learning scenarios for vision-language models (VLMs). Recent works address this challenge using parameter-efficient methods, such as prefix-tuning or adapters, which facilitate model adaptation to downstream tasks by incorporating task-specific information into input tokens through additive vectors. However, previous approaches often normalize the weights of these vectors, disregarding the fact that different input tokens require different degrees of adjustment. To overcome this issue, we propose Dynamic Prefix Weighting (DPW), a framework that dynamically assigns weights to prefixes, complemented by adapters. DPW consists of 1) a gating module that adjusts the weights of each prefix based on the importance of the corresponding input token, and 2) a weighting mechanism that derives adapter output weights as a residual of prefix-tuning weights, ensuring that adapters are utilized only when necessary. Experimental results demonstrate that our method achieves state-of-the-art performance in domain-class incremental learning scenarios for VLMs. The code is available at: https://github.com/YonseiML/dpw.
[231] Decision-Aware Attention Propagation for Vision Transformer Explainability cs.CVPDF
Sehyeong Jo, Gangjae Jang, Haesol Park
TL;DR: 本文提出了一种名为决策感知注意力传播(DAP)的归因方法,旨在改进视觉Transformer(ViT)的可解释性。该方法通过将基于梯度的定位估计出的令牌重要性注入到分层注意力传播中,生成比传统注意力方法更具类别区分性、更紧凑和更忠实的归因图。
Details
Motivation: 现有基于注意力的解释方法主要依赖原始注意力权重,不能明确反映最终决策,导致解释的类别区分性有限;而基于梯度的定位方法虽能有效突出类别特定证据,但未能充分利用Transformer的分层注意力传播机制。本文旨在结合两者优势,解决ViT预测过程难以解释的问题。
Result: 在不同模型规模的Vision Transformer变体上进行的大量实验表明,DAP在定量指标和定性可视化方面均持续优于现有基线方法。
Insight: 核心创新点在于将决策相关先验(通过梯度定位估计的令牌重要性)注入到Transformer的注意力传播机制中,从而同时捕捉注意力的结构流和与最终预测最相关的证据。这为改进ViT可解释性提供了一个有效方向,即决策感知的传播。
Abstract: Vision Transformers (ViTs) have become a dominant architecture in computer vision, yet their prediction process remains difficult to interpret because information is propagated through complex interactions across layers and attention heads. Existing attention based explanation methods provide an intuitive way to trace information flow. However, they rely mainly on raw attention weights, which do not explicitly reflect the final decision and often lead to explanations with limited class discriminability. In contrast, gradient based localization methods are more effective at highlighting class specific evidence, but they do not fully exploit the hierarchical attention propagation mechanism of transformers. To address this limitation, we propose Decision-Aware Attention Propagation (DAP), an attribution method that injects decision-relevant priors into transformer attention propagation. By estimating token importance through gradient based localization and integrating it into layer wise attention rollout, the method captures both the structural flow of attention and the evidence most relevant to the final prediction. Consequently, DAP produces attribution maps that are more class sensitive, compact, and faithful than those generated by conventional attention based methods. Extensive experiments across Vision Transformer variants of different model scales show that DAP consistently outperforms existing baselines in both quantitative metrics and qualitative visualizations, indicating that decision aware propagation is an effective direction for improving ViT interpretability.
[232] Test-Time Perturbation Learning with Delayed Feedback for Vision-Language-Action Models cs.CVPDF
Zehua Zang, Xi Wang, Fuchun Sun, Xiao Xu, Lixiang Lium
TL;DR: 本文提出了一种名为‘延迟反馈的扰动学习’的无验证器测试时适应框架,旨在提升视觉-语言-动作模型在环境变化下的决策鲁棒性。该方法通过基于不确定性的数据增强和动作投票来缓解虚假相关性,并利用自适应调度器平衡性能与效率,同时学习一个轻量级扰动模块,根据延迟反馈调整动作logits以纠正过度自信问题。
Details
Motivation: 视觉-语言-动作模型在序列决策中表现出色,但对细微的环境变化(如物体姿态的微小改变)非常脆弱,这归因于轨迹过拟合,即模型过度关注动作与实体间的虚假相关性并复现记忆的动作模式。
Result: 在LIBERO基准上,PDF将成功率提升了7.4%;在Atari基准上,人类标准化分数提升了10.3%,均优于原始VLA及采用测试时适应的VLA,展现了稳定的性能增益。
Insight: 创新点在于无需微调基础模型,通过测试时扰动学习和延迟反馈机制来纠正过拟合和过度自信问题;从客观角度看,其基于不确定性的自适应数据增强和轻量级扰动模块设计,为多模态决策代理的可靠测试时适应提供了一种实用且高效的路径。
Abstract: Vision-Language-Action models (VLAs) achieve remarkable performance in sequential decision-making but remain fragile to subtle environmental shifts, such as small changes in object pose. We attribute this brittleness to trajectory overfitting, where VLAs over-attend to the spurious correlation between actions and entities, then reproduce memorized action patterns. We propose Perturbation learning with Delayed Feedback (PDF), a verifier-free test-time adaptation framework that improves decision performance without fine-tuning the base model. PDF mitigates the spurious correlation through uncertainty-based data augmentation and action voting, while an adaptive scheduler allocates augmentation budgets to balance performance and efficiency. To further improve stability, PDF learns a lightweight perturbation module that retrospectively adjusts action logits guided by delayed feedback, correcting overconfidence issue. Experiments on LIBERO (+7.4% success rate) and Atari (+10.3 human normalized score) demonstrate consistent gains of PDF in task success over vanilla VLA and VLA with test-time adaptation, establishing a practical path toward reliable test-time adaptation in multimodal decision-making agents. The code is available at \href{https://github.com/zhoujiahuan1991/CVPR2026-PDF}{https://github.com/zhoujiahuan1991/CVPR2026-PDF}.
[233] Can LLM-Generated Text Empower Surgical Vision-Language Pre-training? cs.CVPDF
Chengan Che, Chao Wang, Jiayuan Huang, Xinyue Chen, Luis C. Garcia-Peraza-Herrera
TL;DR: 本文提出了一种利用LLM生成文本构建大规模手术视频-语言数据集LIME的方法,并设计了参数高效的视觉-语言预训练框架SurgLIME,以在存在噪声文本的情况下学习可靠的跨模态对齐,从而克服手术领域专家标注成本高昂的扩展性瓶颈。
Details
Motivation: 解决手术视觉-语言预训练中因专家文本标注成本过高而导致的多模态推理任务扩展性受限的问题。
Result: 在AutoLaparo和Cholec80基准测试上,SurgLIME实现了具有竞争力的零样本跨模态对齐性能,同时保持了视觉基础模型稳健的线性探测性能。
Insight: 创新点在于利用LLM生成文本实现数据集的低成本大规模扩展,并设计了包含LoRA适配的双编码器架构和自动置信度估计机制的预训练框架,以缓解生成文本中可能存在的错误(如幻觉)对模型先验知识造成的负面影响。
Abstract: Recent advancements in self-supervised learning have led to powerful surgical vision encoders capable of spatiotemporal understanding. However, extending these visual foundations to multi-modal reasoning tasks is severely bottlenecked by the prohibitive cost of expert textual annotations. To overcome this scalability limitation, we introduce \textbf{LIME}, a large-scale multi-modal dataset derived from open-access surgical videos using human-free, Large Language Model (LLM)-generated narratives. While LIME offers immense scalability, unverified generated texts may contain errors, including hallucinations, that could potentially lead to catastrophically degraded pre-trained medical priors in standard contrastive pipelines. To mitigate this, we propose \textbf{SurgLIME}, a parameter-efficient Vision-Language Pre-training (VLP) framework designed to learn reliable cross-modal alignments using noisy narratives. SurgLIME preserves foundational medical priors using a LoRA-adapted dual-encoder architecture and introduces an automated confidence estimation mechanism that dynamically down-weights uncertain text during contrastive alignment. Evaluations on the AutoLaparo and Cholec80 benchmarks show that SurgLIME achieves competitive zero-shot cross-modal alignment while preserving the robust linear probing performance of the visual foundation model. Dataset, code, and models are publicly available at \href{https://github.com/visurg-ai/SurgLIME}{https://github.com/visurg-ai/SurgLIME}.
[234] Soft Label Pruning and Quantization for Large-Scale Dataset Distillation cs.CV | cs.AI | cs.LGPDF
Xiao Lingao, Yang He
TL;DR: 本文提出了一种名为LPQLD(Label Pruning and Quantization for Large-scale Distillation)的方法,用于解决大规模数据集蒸馏中软标签存储开销过大的问题。该方法通过提升合成图像的多样性和监督信号的多样性,显著减少了软标签的存储需求,并在ImageNet-1K和ImageNet-21K上实现了存储减少和精度提升的双重效果。
Details
Motivation: 大规模数据集蒸馏中,辅助软标签的存储开销远大于压缩后的图像(例如在ImageNet-1K上大30-40倍,在ImageNet-21K上大200倍),这违背了数据集压缩的初衷。问题的根源在于合成图像多样性不足和训练期间监督信号多样性有限。
Result: 在ImageNet-1K上,该方法将软标签存储减少了78倍,同时将准确率提升了高达7.2%;在ImageNet-21K上,存储减少了500倍,准确率提升了高达2.8%。广泛的实验验证了LPQLD在不同网络架构和数据集蒸馏方法上的优越性。
Insight: 创新点在于:1)通过类分批(class-wise batching)和批归一化监督来增强图像多样性;2)提出了动态知识重用的标签剪枝(Label Pruning)来提升每个增强的标签多样性,以及校准师生对齐的标签量化(Label Quantization)来提升每个图像的增强多样性。这些方法协同作用,在极大幅度压缩软标签的同时,反而提升了蒸馏性能。
Abstract: Large-scale dataset distillation requires storing auxiliary soft labels that can be 30-40x larger on ImageNet-1K and 200x larger on ImageNet-21K than the condensed images, undermining the goal of dataset compression. We identify two fundamental issues necessitating such extensive labels: (1) insufficient image diversity, where high within-class similarity in synthetic images requires extensive augmentation, and (2) insufficient supervision diversity, where limited variety in supervisory signals during training leads to performance degradation at high compression rates. To address these challenges, we propose Label Pruning and Quantization for Large-scale Distillation (LPQLD). We enhance image diversity via class-wise batching and batch-normalization supervision during synthesis. For supervision diversity, we introduce Label Pruning with Dynamic Knowledge Reuse to improve label-per-augmentation diversity, and Label Quantization with Calibrated Student-Teacher Alignment to improve augmentation-per-image diversity. Our approach reduces soft label storage by 78x on ImageNet-1K and 500x on ImageNet-21K while improving accuracy by up to 7.2% and 2.8%, respectively. Extensive experiments validate the superiority of LPQLD across different network architectures and dataset distillation methods. Code is available at https://github.com/he-y/soft-label-pruning-quantization-for-dataset-distillation.
[235] CanonSLR: Canonical-View Guided Multi-View Continuous Sign Language Recognition cs.CVPDF
Xu Wang, Shengeng Tang, Wan Jiang, Yaxiong Wang, Lechao Cheng
TL;DR: 本文提出CanonSLR,一种用于多视角连续手语识别的框架,通过引入正面视角锚定的师生学习策略、序列级软目标蒸馏和时序运动关系增强,以提升模型对视角变化的鲁棒性。同时,作者构建了一个通用的多视角手语数据生成流程,并基于PHOENIX-2014T和CSL-Daily创建了两个七视角基准数据集PT14-MV和CSL-MV。
Details
Motivation: 现有连续手语识别方法大多基于单视角设置,对真实场景中的视角变化鲁棒性不足,因此需要开发能够有效处理多视角数据的方法。
Result: 在PT14-MV和CSL-MV基准上的大量实验表明,CanonSLR在多视角设置下持续优于现有方法,尤其在具有挑战性的非正面视角上表现出更强的鲁棒性。
Insight: 创新点包括:正面视角锚定的师生学习策略提供规范时序监督;序列级软目标蒸馏减少跨视角语义差异;时序运动关系增强建模运动感知的时序关系;以及构建通用多视角数据生成流程和基准数据集,为多视角CSLR研究提供新基础。
Abstract: Continuous Sign Language Recognition (CSLR) has achieved remarkable progress in recent years; however, most existing methods are developed under single-view settings and thus remain insufficiently robust to viewpoint variations in real-world scenarios. To address this limitation, we propose CanonSLR, a canonical-view guided framework for multi-view CSLR. Specifically, we introduce a frontal-view-anchored teacher-student learning strategy, in which a teacher network trained on frontal-view data provides canonical temporal supervision for a student network trained on all viewpoints. To further reduce cross-view semantic discrepancy, we propose Sequence-Level Soft-Target Distillation, which transfers structured temporal knowledge from the frontal view to non-frontal samples, thereby alleviating gloss boundary ambiguity and category confusion caused by occlusion and projection variation. In addition, we introduce Temporal Motion Relational Enhancement to explicitly model motion-aware temporal relations in high-level visual features, strengthening stable dynamic representations while suppressing viewpoint-sensitive appearance disturbances. To support multi-view CSLR research, we further develop a universal multi-view sign language data construction pipeline that transforms original single-view RGB videos into semantically consistent, temporally coherent, and viewpoint-controllable multi-view sign language videos. Based on this pipeline, we extend PHOENIX-2014T and CSL-Daily into two seven-view benchmarks, namely PT14-MV and CSL-MV, providing a new experimental foundation for multi-view CSLR. Extensive experiments on PT14-MV and CSL-MV demonstrate that CanonSLR consistently outperforms existing approaches under multi-view settings and exhibits stronger robustness, especially on challenging non-frontal views.
[236] DiffuSAM: Diffusion Guided Zero-Shot Object Grounding for Remote Sensing Imagery cs.CV | cs.LGPDF
Geet Sethi, Panav Shah, Ashutosh Gandhe, Soumitra Darshan Nayak
TL;DR: 本文提出DiffuSAM,一种用于遥感图像零样本目标定位的混合方法。该方法结合了扩散模型的定位线索与先进的遥感分割模型(如RemoteSAM和SAM3),以生成更精确的目标边界框。通过利用生成扩散模型和基础分割模型的互补优势,该方法能够在复杂场景中实现鲁棒且自适应的目标定位。实验表明,该方法显著提升了定位性能。
Details
Motivation: 探索扩散模型在遥感图像目标定位任务中的潜力,解决现有方法在复杂场景下定位精度不足的问题。
Result: 在零样本目标定位任务上,该方法在Acc@0.5指标上相比现有最先进方法提升了超过14%,达到了新的SOTA水平。
Insight: 创新点在于将生成式扩散模型的定位能力与基础分割模型(SAM系列)的精细分割能力相结合,形成互补的混合流水线,从而提升了遥感图像中目标定位的鲁棒性和准确性。
Abstract: Diffusion models have emerged as powerful tools for a wide range of vision tasks, including text-guided image generation and editing. In this work, we explore their potential for object grounding in remote sensing imagery. We propose a hybrid pipeline that integrates diffusion-based localization cues with state-of-the-art segmentation models such as RemoteSAM and SAM3 to obtain more accurate bounding boxes. By leveraging the complementary strengths of generative diffusion models and foundational segmentation models, our approach enables robust and adaptive object localization across complex scenes. Experiments demonstrate that our pipeline significantly improves localization performance, achieving over a 14% increase in Acc@0.5 compared to existing state-of-the-art methods.
[237] A Comparative Evaluation of Geometric Accuracy in NeRF and Gaussian Splatting cs.CV | cs.ROPDF
Mikolaj Zielinski, Eryk Vykysaly, Bartlomiej Biesiada, Jan Baturo, Mateusz Capala
TL;DR: 本文提出了一种专注于几何精度的神经渲染方法评估流程,并构建了一个包含19个多样化场景的基准测试集,旨在系统评估重建方法在表面和形状保真度方面的性能,以弥补传统视觉质量指标的不足。
Details
Motivation: 现有神经渲染方法评估主要关注生成图像的视觉质量,而忽略了表面几何的保真度,这在机器人学等需要精确几何信息(如抓取和物体操作)的应用中至关重要。
Result: 论文构建了一个包含19个场景的基准测试集,并提出了一个评估流程,但摘要中未提及具体的定量结果或与现有方法的对比(如是否达到SOTA)。
Insight: 创新点在于将评估重点从图像质量转向几何精度,提出了一个系统性的几何保真度评估框架和基准,这对于推动神经渲染在机器人等对几何精度敏感领域的应用具有重要价值。
Abstract: Recent advances in neural rendering have introduced numerous 3D scene representations. Although standard computer vision metrics evaluate the visual quality of generated images, they often overlook the fidelity of surface geometry. This limitation is particularly critical in robotics, where accurate geometry is essential for tasks such as grasping and object manipulation. In this paper, we present an evaluation pipeline for neural rendering methods that focuses on geometric accuracy, along with a benchmark comprising 19 diverse scenes. Our approach enables a systematic assessment of reconstruction methods in terms of surface and shape fidelity, complementing traditional visual metrics.
[238] Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation cs.CVPDF
Yanjun Guo, Zhengqiang Zhang, Pengfei Wang, Xinyue Liang, Zhiyuan Ma
TL;DR: 本文提出了一种解耦记忆控制框架,用于提升长时序视频生成的空间一致性。该方法将记忆建模与视频生成分离,通过轻量级记忆分支学习历史观测中的空间信息,并引入相机感知门控机制仅在需要时激活记忆条件,从而在降低训练成本的同时,增强场景重访时的内容一致性和新场景探索的生成能力。
Details
Motivation: 现有方法通常将记忆建模与视频生成耦合,导致在场景重访时内容不一致,且在新区域探索时生成能力下降,即使使用大量标注数据训练也存在这些问题。本文旨在解决这些限制,通过解耦记忆条件与生成过程,提升空间一致性并保持生成新颖场景的能力。
Result: 实验表明,与现有方法相比,该方法在数据效率上显著提升,同时在视觉质量和空间一致性方面达到了最先进的性能(SOTA)。
Insight: 创新点包括:1)解耦的记忆控制框架,将记忆条件与生成分离;2)混合记忆表示捕获时空线索;3)逐帧交叉注意力机制确保每帧仅基于最相关的历史信息;4)相机感知门控机制动态控制记忆与生成模块的交互。从客观角度看,这种解耦设计降低了训练成本并提高了灵活性,可借鉴于其他需要长期一致性的生成任务。
Abstract: Spatially consistent long-horizon video generation aims to maintain temporal and spatial consistency along predefined camera trajectories. Existing methods mostly entangle memory modeling with video generation, leading to inconsistent content during scene revisits and diminished generative capacity when exploring novel regions, even trained on extensive annotated data. To address these limitations, we propose a decoupled framework that separates memory conditioning from generation. Our approach significantly reduces training costs while simultaneously enhancing spatial consistency and preserving the generative capacity for novel scene exploration. Specifically, we employ a lightweight, independent memory branch to learn precise spatial consistency from historical observation. We first introduce a hybrid memory representation to capture complementary temporal and spatial cues from generated frames, then leverage a per-frame cross-attention mechanism to ensure each frame is conditioned exclusively on the most spatially relevant historical information, which is injected into the generative model to ensure spatial consistency. When generating new scenes, a camera-aware gating mechanism is proposed to mediate the interaction between memory and generation modules, enabling memory conditioning only when meaningful historical references exist. Compared with the existing method, our method is highly data-efficient, yet the experiments demonstrate that our approach achieves state-of-the-art performance in terms of both visual quality and spatial consistency.
[239] Instruction-as-State: Environment-Guided and State-Conditioned Semantic Understanding for Embodied Navigation cs.CVPDF
Zhen Liu, Yuhan Liu, Jinjun Wang, Jianyi Liu, Wei Song
TL;DR: 本文提出了一种名为Instruction-as-State的新方法,将指令理解建模为随智能体感知状态演化的状态变量,并引入了S-EGIU框架,通过从粗到细的段激活和令牌级语义细化,动态地调整指令语义以适应视觉环境的变化,从而提升了具身导航任务的性能。
Details
Motivation: 解决视觉-语言导航任务中指令与观察动态纠缠的挑战,即现有模型将指令编码为静态全局表示,无法根据当前视觉上下文自适应调整指令含义,限制了导航性能。
Result: 在REVERIE Test Unseen上实现了+2.68%的SPL提升,并在多个VLN基准测试中表现出了一致的效率增益,验证了动态指令-感知纠缠的有效性。
Insight: 创新点在于将指令建模为状态变量,并设计了S-EGIU框架进行粗粒度段激活和细粒度令牌级语义细化,实现了指令语义随感知状态的动态演化,为具身导航中的语义理解提供了新思路。
Abstract: Vision-and-Language Navigation requires agents to follow natural-language instructions in visually changing environments. A central challenge is the dynamic entanglement between language and observations: the meaning of instruction shifts as the agent’s field of view and spatial context evolve. However, many existing models encode the instruction as a static global representation, limiting their ability to adapt instruction meaning to the current visual context. We therefore model instruction understanding as an Instruction-as-State variable: a decision-relevant, token-level instruction state that evolves step by step conditioned on the agent’s perceptual state, where the perceptual state denotes the observation-grounded navigation context at each step. To realize this principle, we introduce State-Entangled Environment-Guided Instruction Understanding (S-EGIU), a coarse-to-fine framework for state-conditioned segment activation and token-level semantic refinement. At the coarse level, S-EGIU activates the instruction segment whose semantics align with the current observation. At the fine level, it refines the activated segment through observation-guided token grounding and contextual modeling, sharpening its internal semantics under the current observation. Together, these stages maintain an instruction state that is continuously updated according to the agent’s perceptual state during navigation. S-EGIU delivers strong performance on several key metrics, including a +2.68% SPL gain on REVERIE Test Unseen, and demonstrates consistent efficiency gains across multiple VLN benchmarks, underscoring the value of dynamic instruction–perception entanglement.
[240] Is SAM3 ready for pathology segmentation? cs.CV | cs.AIPDF
Qiuyu Kong, Shakiba Sharifi, Zanxi Ruan, Yiming Wang, Marco Cristani
TL;DR: 本文系统评估了Segment Anything Model 3 (SAM3)在病理图像分割任务上的能力,包括组织级和细胞核级尺度。通过在不同监督设置(零样本、少样本、全监督)和提示策略下对NuInsSeg、PanNuke和GlaS等数据集进行测试,发现SAM3在病理分割中存在局限性,如文本提示对细胞核概念激活效果差、性能对视觉提示类型和数量敏感、缺乏对提示噪声的鲁棒性,且与任务特定适配器方法存在显著差距。
Details
Motivation: 解决数字病理分割中传统方法标注成本高、泛化能力差的问题,并探究SAM3新引入的Promptable Concept Segmentation功能(特别是文本提示)在病理图像分割领域的实际能力与边界。
Result: 在NuInsSeg、PanNuke和GlaS等病理数据集上的评估结果表明,SAM3在病理分割中表现不佳:文本提示效果差,性能严重依赖视觉提示,少样本学习虽有提升但鲁棒性不足,且其基于提示的使用方式与任务训练的适配器方法相比存在显著性能差距。
Insight: 论文的创新点在于提出了一个系统性的评估协议来结构化地探索SAM3在病理分割中的能力空间。客观来看,其核心洞察是指出了SAM3的‘通用分割’能力在专业病理领域存在明显局限,特别是其宣称的文本提示功能在实际应用中效果有限,这强调了针对病理领域进行模型适配的必要性,为后续研究提供了明确的改进方向。
Abstract: Is Segment Anything Model 3 (SAM3) capable in segmenting Any Pathology Images? Digital pathology segmentation spans tissue-level and nuclei-level scales, where traditional methods often suffer from high annotation costs and poor generalization. SAM3 introduces Promptable Concept Segmentation, offering a potential automated interface via text prompts. With this work, we propose a systematic evaluation protocol to explore the capability space of SAM3 in a structured manner. Specifically, we evaluate SAM3 under different supervision settings including zero-shot, few-shot, and supervised with varying prompting strategies. Our extensive evaluation on pathological datasets including NuInsSeg, PanNuke and GlaS, reveals that: 1.text-only prompts poorly activate nuclear concepts. 2.performance is highly sensitive to visual prompt types and budgets. 3.few-shot learning offers gains, but SAM3 lacks robustness against visual prompt noise. and 4.a significant gap persists between prompt-based usage and task-trained adapter-based reference. Our study delineates SAM3’s boundaries in pathology image segmentation and provides practical guidance on the necessity of pathology domain adaptation.
[241] Medical Image Understanding Improves Survival Prediction via Visual Instruction Tuning cs.CVPDF
Xixi Liu, Jorge Lazo, Andreas Hallqvist, Mikael Johansson, Åse Johnsson
TL;DR: 本文提出了一种基于视觉指令调优的医学影像理解框架,通过大规模开源CT图像与放射学报告进行预训练,学习临床意义的视觉-文本表征,并在此基础上结合生存预测头,提升CT图像和临床数据的生存预测性能,同时生成有临床意义的语言响应。
Details
Motivation: 解决传统放射科医生评估CT图像时依赖专家知识、视觉信息转化为文本摘要导致信息丢失的问题,旨在通过视觉语言模型提升生存预测的准确性和可解释性。
Result: 实验表明,该方法在生存预测任务上优于基线方法,尤其在仅使用临床数据预测能力较弱时表现更佳,但未提及具体基准数据集或是否达到SOTA水平。
Insight: 创新点在于将视觉指令调优应用于3D CT图像理解,通过预训练学习临床表征并适配下游生存预测任务,实现了多模态(图像-文本)融合与可解释性输出的结合。
Abstract: Accurate prognostication and risk estimation are essential for guiding clinical decision-making and optimizing patient management. While radiologist-assessed features from CT scans provide valuable indicators of disease severity and outcomes, interpreting such images requires expert knowledge, and translating rich visual information into textual summaries inevitably leads to information loss. In this work, we propose a vision-language framework for 3D CT image understanding that leverages large-scale open-sourced CT images paired with radiology reports through visual instruction tuning. This pre-training enables the model to learn clinically meaningful visual-textual representations, which can then be adapted to downstream survival prediction tasks. By incorporating a survival prediction head on top of the pre-trained model, our approach improves survival prediction from CT images and clinical data while generating clinically meaningful language responses to predefined questions. Experimental results demonstrate that our method outperforms baseline methods in survival prediction, particularly, when clinical data alone is less predictive. The code will be released upon acceptance.
[242] Geometry-Guided 3D Visual Token Pruning for Video-Language Models cs.CVPDF
Han Li, Zehao Huang, Jiahui Fu, Naiyan Wang, Si Liu
TL;DR: 本文提出了一种名为Geo3DPruner的几何引导3D视觉令牌剪枝框架,旨在解决将多模态大语言模型扩展到3D场景理解时,空间视频中视觉令牌数量过多导致的推理效率低下和上下文管理困难的问题。该方法通过几何感知的全局注意力建模跨帧相关性,并执行体素内和体素间两阶段剪枝,以移除冗余令牌并保持场景完整性。
Details
Motivation: 动机在于,虽然预训练的视频-语言模型能够通过将3D场景表示为包含深度和相机位姿信息的图像序列(空间视频)来处理3D推理任务,但空间视频中大量的视觉令牌是高效推理和上下文管理的主要瓶颈。现有剪枝方法忽略了空间视频的视图一致性和剩余令牌的空间多样性,无法有效去除帧间冗余并保持场景完整性。
Result: 在多个3D场景理解基准测试上的广泛实验表明,Geo3DPruner在剪除90%视觉令牌的同时,能保留超过90%的原始模型性能,显著优于现有的文本引导和视觉引导的剪枝方法。
Insight: 创新点在于提出了一个几何引导的两阶段剪枝框架:首先通过几何感知的全局注意力建模跨帧相关性,然后在体素内阶段选择每个体素中的代表性多视图特征,在体素间阶段通过选择全局分布的体素子集来保持空间多样性。这有效解决了视图一致性和空间多样性问题,实现了高效且高性能的3D视觉令牌压缩。
Abstract: Multimodal large language models have demonstrated remarkable capabilities in 2D vision, motivating their extension to 3D scene understanding. Recent studies represent 3D scenes as 3D spatial videos composed of image sequences with depth and camera pose information, enabling pre-trained video-language models to perform 3D reasoning tasks. However, the large number of visual tokens in spatial videos remains a major bottleneck for efficient inference and context management. Existing pruning methods overlook the view consistency of spatial videos and the spatial diversity of the remaining tokens, which prevents them from effectively removing inter-frame redundancy and preserving scene completeness. In this paper, we propose Geo3DPruner, a Geometry-Guided 3D Visual Token Pruning framework. Geo3DPruner first models cross-frame relevance through geometry-aware global attention, and then performs a two-stage pruning process. The intra-voxel stage selects representative multi-view features within each voxel, while the inter-voxel stage preserves spatial diversity by selecting a globally distributed subset of voxels. Extensive experiments on multiple 3D scene understanding benchmarks demonstrate that Geo3DPruner retains over 90% of the original performance while pruning 90% of visual tokens, significantly outperforming existing text-guided and vision-guided pruning methods.
[243] MARCO: Navigating the Unseen Space of Semantic Correspondence cs.CVPDF
Claudia Cuttano, Gabriele Trivigno, Carlo Masone, Stefan Roth
TL;DR: MARCO是一个基于DINOv2的统一模型,用于提升语义对应的泛化能力。它通过结合从粗到细的定位目标和自蒸馏框架,将稀疏的关键点监督转化为密集的语义对应,在多个基准测试中取得了新的最优性能,并且模型更小、更快。
Details
Motivation: 当前基于双编码器架构(如DINOv2与扩散主干结合)的语义对应方法虽然准确,但参数量巨大(数十亿),且对训练时未见过的关键点泛化能力差,导致基准测试性能与现实世界可用性之间存在差距。
Result: MARCO在SPair-71k、AP-10K和PF-PASCAL基准测试上达到了新的最先进水平(SOTA),特别是在细粒度定位阈值(PCK@0.01提升+8.9)、对未见关键点的泛化(SPair-U提升+5.1)和对未见类别的泛化(MP-100提升+4.7)方面表现突出,同时模型尺寸比基于扩散的方法小3倍,速度快10倍。
Insight: 论文的核心创新在于提出了一个统一的训练框架,该框架耦合了从粗到细的定位目标(提升空间精度)和自蒸馏框架(将稀疏监督扩展到标注区域之外),从而有效增强了细粒度定位和语义泛化能力,实现了从稀疏关键点到密集语义对应的转化。
Abstract: Recent advances in semantic correspondence rely on dual-encoder architectures, combining DINOv2 with diffusion backbones. While accurate, these billion-parameter models generalize poorly beyond training keypoints, revealing a gap between benchmark performance and real-world usability, where queried points rarely match those seen during training. Building upon DINOv2, we introduce MARCO, a unified model for generalizable correspondence driven by a novel training framework that enhances both fine-grained localization and semantic generalization. By coupling a coarse-to-fine objective that refines spatial precision with a self-distillation framework, which expands sparse supervision beyond annotated regions, our approach transforms a handful of keypoints into dense, semantically coherent correspondences. MARCO sets a new state of the art on SPair-71k, AP-10K, and PF-PASCAL, with gains that amplify at fine-grained localization thresholds (+8.9 PCK@0.01), strongest generalization to unseen keypoints (+5.1, SPair-U) and categories (+4.7, MP-100), while remaining 3x smaller and 10x faster than diffusion-based approaches. Code is available at https://github.com/visinf/MARCO .
[244] LiquidTAD: An Efficient Method for Temporal Action Detection via Liquid Neural Dynamics cs.CVPDF
Zepeng Sun, Naichuan Zheng, Hailun Xia, Junjie Wu, Liwei Bao
TL;DR: 本文提出了一种名为LiquidTAD的高效时序动作检测方法,旨在解决现有基于Transformer的方法计算复杂度高和参数冗余大的问题。该方法用并行化的ActionLiquid块取代了自注意力层,利用闭式连续时间公式实现了线性复杂度,并在保持连续时间动力学内在物理先验的同时,显著减少了模型参数量。
Details
Motivation: 当前未修剪视频中的时序动作检测主要被基于Transformer的架构主导,这些架构虽然性能高,但其二次计算复杂度和大量参数冗余限制了其在资源受限环境中的部署。本文旨在设计一个参数高效且计算复杂度低的框架来解决这一问题。
Result: 在THUMOS-14数据集上,LiquidTAD仅用10.82M参数就达到了69.46%的平均mAP,与ActionFormer基线相比参数减少了63%。在ActivityNet-1.3和Ego4D基准上的进一步评估证实,该方法实现了最佳的精度-效率权衡,并对时序采样变化表现出更强的鲁棒性,推进了现代TAD框架的帕累托前沿。
Insight: 论文的主要创新点在于首次将并行化的液态神经网络架构引入时序动作检测领域,通过闭式连续时间公式将模型重构为可并行算子,从而在保持连续时间动力学优势的同时,实现了线性计算复杂度。其自适应调制时序敏感性的学习时间常数机制,为处理不同动作时长提供了鲁棒的方法,在模型效率和性能之间取得了突破性平衡。
Abstract: Temporal Action Detection (TAD) in untrimmed videos is currently dominated by Transformer-based architectures. While high-performing, their quadratic computational complexity and substantial parameter redundancy limit deployment in resource-constrained environments. In this paper, we propose LiquidTAD, a novel parameter-efficient framework that replaces cumbersome self-attention layers with parallelized ActionLiquid blocks. Unlike traditional Liquid Neural Networks (LNNs) that suffer from sequential execution bottlenecks, LiquidTAD leverages a closed-form continuous-time (CfC) formulation, allowing the model to be reformulated as a parallelizable operator while preserving the intrinsic physical prior of continuous-time dynamics. This architecture captures complex temporal dependencies with $O(N)$ linear complexity and adaptively modulates temporal sensitivity through learned time-constants ($τ$), providing a robust mechanism for handling varying action durations. To the best of our knowledge, this work is the first to introduce a parallelized LNN-based architecture to the TAD domain. Experimental results on the THUMOS-14 dataset demonstrate that LiquidTAD achieves a highly competitive Average mAP of 69.46% with only 10.82M parameters – a 63% reduction compared to the ActionFormer baseline. Further evaluations on ActivityNet-1.3 and Ego4D benchmarks confirm that LiquidTAD achieves an optimal accuracy-efficiency trade-off and exhibits superior robustness to temporal sampling variations, advancing the Pareto frontier of modern TAD frameworks.
[245] Spike-NVPT: Learning Robust Visual Prompts via Bio-Inspired Temporal Filtering and Discretization cs.CVPDF
Qiugang Zhan, Anning Jiang, Ran Tao, Ao Ma, Xiangyu Zhang
TL;DR: 本文提出了一种名为Spike-NVPT的噪声鲁棒视觉提示调优方法,旨在解决传统连续密集提示对输入噪声敏感的问题。该方法通过仿生时间滤波和离散化,利用脉冲神经元的积分发放机制过滤噪声,并将过滤后的信号转换为稀疏的二进制提示,从而在保持推理效率的同时提升模型鲁棒性。
Details
Motivation: 动机在于解决基于提示调优的预训练视觉模型适应方法中,连续密集的提示容易对输入噪声敏感并过度拟合任务无关细节的问题,需要在保持参数高效的同时提升鲁棒性。
Result: 实验结果表明,Spike-NVPT在鲁棒性性能上显著优于传统方法,最大提升达11.2%,并且在干净数据集上保持了有竞争力的准确率。
Insight: 创新点在于首次将脉冲神经元用于微调传统人工神经网络视觉模型,通过信号滤波层和脉冲离散化单元,将生物启发的时序处理和稀疏二进制化作为强正则化器,以增强模型对判别性和鲁棒特征的依赖,且推理时无额外计算开销。
Abstract: Pre-trained vision models have found widespread application across diverse domains. Prompt tuning-based methods have emerged as a parameter-efficient paradigm for adapting pre-trained vision models. While effective on standard benchmarks, the continuous and dense nature of learned prompts can lead to sensitivity against input noise, as the high-capacity prompts tend to overfit task-irrelevant details. To address this trade-off, we propose Spike-NVPT, a noise-robust visual prompt tuning method. Specifically, we design a Signal Filtering Layer based on spiking neurons, which uses the integrate-and-fire (IF) mechanism to accumulate task-relevant signals over time and filter transient noise fluctuations. A subsequent Spike Discretization Unit converts filtered signals into sparse binary prompts. This discretization acts as a strong regularizer, forcing the model to anchor decision boundaries on the most discriminative and robust features. Notably, the resulting binary prompts remain static during deployment, ensuring zero additional computational overhead during inference. Experimental results demonstrate that Spike-NVPT achieves superior robustness performance, with a maximum improvement of 11.2% over conventional methods, and retains competitive accuracy on clean datasets. To the best of our knowledge, this is the first attempt to leverage spiking neurons for fine-tuning traditional artificial neural network (ANN)-based visual models.
[246] Denoise and Align: Diffusion-Driven Foreground Knowledge Prompting for Open-Vocabulary Temporal Action Detection cs.CVPDF
Sa Zhu, Wanqian Zhang, Lin Wang, Jinchao Zhang, Cong Wang
TL;DR: 本文提出DFAlign框架,首次利用扩散去噪生成前景知识来指导开放词汇时序动作检测(OV-TAD)中的动作-视频对齐。该方法通过语义统一条件化、背景抑制去噪和前景提示对齐三个模块,有效缓解了抽象动作标签与复杂视频内容之间的语义不平衡问题,提升了未见动作类别的检测性能。
Details
Motivation: 现有OV-TAD方法难以平衡简洁抽象的动作标签与丰富复杂视频内容之间的语义差异,导致语义噪声和跨模态对齐误导,因此需要一种机制来生成中间语义锚点以弥合语义鸿沟。
Result: 在两个OV-TAD基准测试上取得了最先进的性能。
Insight: 创新性地将扩散模型用于前景知识生成,通过去噪过程渐进式去除视频背景冗余,生成的前景知识作为视频与文本表示之间的语义锚点,并通过提示令牌注入文本表示来引导模型关注动作相关片段,实现了更精确的跨模态对齐。
Abstract: Open-Vocabulary Temporal Action Detection (OV-TAD) aims to localize and classify action segments of unseen categories in untrimmed videos, where effective alignment between action semantics and video representations is critical for accurate detection. However, existing methods struggle to mitigate the semantic imbalance between concise, abstract action labels and rich, complex video contents, inevitably introducing semantic noise and misleading cross-modal alignment. To address this challenge, we propose DFAlign, the first framework that leverages diffusion-based denoising to generate foreground knowledge for the guidance of action-video alignment. Following the ‘conditioning, denoising and aligning’ manner, we first introduce the Semantic-Unify Conditioning (SUC) module, which unifies action-shared and action-specific semantics as conditions for diffusion denoising. Then, the Background-Suppress Denoising (BSD) module generates foreground knowledge by progressively removing background redundancy from videos through denoising process. This foreground knowledge serves as effective intermediate semantic anchor between video and text representations, mitigating the semantic gap and enhancing the discriminability of action-relevant segments. Furthermore, we introduce the Foreground-Prompt Alignment (FPA) module to inject extracted foreground knowledge as prompt tokens into text representations, guiding model’s attention towards action-relevant segments and enabling precise cross-modal alignment. Extensive experiments demonstrate that our method achieves state-of-the-art performance on two OV-TAD benchmarks. The code repository is provided as follows: https://anonymous.4open.science/r/Code-2114/.
[247] EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations cs.CV | cs.AIPDF
Yongrui Heng, Chaoya Jiang, Han Yang, Shikun Zhang, Wei Ye
TL;DR: 本文提出EVE框架,通过可执行的视觉变换实现多模态大语言模型(MLLMs)的可验证自我进化,避免了伪标签方法的性能退化问题,并利用挑战者-求解器双策略架构动态生成多样化和难度递增的视觉变换任务,从而持续提升模型性能。
Details
Motivation: 解决MLLMs自我进化中的关键挑战:伪标签方法因模型预测漂移导致质量逐步下降,而基于模板的方法受限于静态变换集,无法在难度和多样性上自适应。
Result: 大量实验表明,EVE在多个基准测试中持续超越现有自进化方法,建立了稳健且可扩展的可验证MLLM自进化范式。
Insight: 创新点在于完全绕过伪标签,利用可执行视觉变换生成具有绝对真实答案的VQA问题,并通过多维奖励系统引导挑战者策略动态扩展代码示例队列,实现双策略的协同进化,防止模式崩溃。
Abstract: Self-evolution of multimodal large language models (MLLMs) remains a critical challenge: pseudo-label-based methods suffer from progressive quality degradation as model predictions drift, while template-based methods are confined to a static set of transformations that cannot adapt in difficulty or diversity. We contend that robust, continuous self-improvement requires not only deterministic external feedback independent of the model’s internal certainty, but also a mechanism to perpetually diversify the training distribution. To this end, we introduce EVE (Executable Visual transformation-based self-Evolution), a novel framework that entirely bypasses pseudo-labels by harnessing executable visual transformations continuously enriched in both variety and complexity. EVE adopts a Challenger-Solver dual-policy architecture. The Challenger maintains and progressively expands a queue of visual transformation code examples, from which it synthesizes novel Python scripts to perform dynamic visual transformations. Executing these scripts yields VQA problems with absolute, execution-verified ground-truth answers, eliminating any reliance on model-generated supervision. A multi-dimensional reward system integrating semantic diversity and dynamic difficulty calibration steers the Challenger to enrich its code example queue while posing progressively more challenging tasks, preventing mode collapse and fostering reciprocal co-evolution between the two policies. Extensive experiments demonstrate that EVE consistently surpasses existing self-evolution methods, establishing a robust and scalable paradigm for verifiable MLLM self-evolution. The code is available at https://github.com/0001Henry/EVE .
[248] OmniHuman: A Large-scale Dataset and Benchmark for Human-Centric Video Generation cs.CVPDF
Lei Zhu, Xing Cai, Yingjie Chen, Yiheng Li, Binxin Yang
TL;DR: 本文提出了OmniHuman数据集和OHBench基准,旨在解决现有音频-视频联合生成模型在复杂真实场景中生成高保真人体中心视频的挑战。OmniHuman是一个大规模、多场景、具有分层标注(视频级场景、帧级交互、个体级属性)的数据集,通过自动化流程收集。OHBench则是一个三级评估系统,引入了与人类感知高度一致的指标,从全局场景、关系交互和个体属性三个维度进行全面诊断。
Details
Motivation: 现有音频-视频生成模型在复杂真实物理场景中生成高保真人体中心视频仍面临重大挑战,其根本原因在于现有数据集在全局场景与相机多样性、交互建模(人-人、人-物)以及个体属性对齐三个维度上存在结构性缺陷。
Result: 论文提出了OmniHuman数据集和OHBench基准,但摘要中未提及具体的定量实验结果(如与SOTA模型的对比性能)。OHBench旨在提供一个科学的诊断框架,其指标与人类感知高度一致,填补了现有基准在全局场景、关系交互和个体属性全面评估方面的空白。
Insight: 主要创新点在于构建了一个针对细粒度人体建模的大规模、多场景、分层标注数据集(OmniHuman),以及一个与之配套的、包含与人类感知一致指标的三级评估基准(OHBench),从数据源和评估方法两个层面系统性地解决了人体中心视频生成的瓶颈问题。
Abstract: Recent advancements in audio-video joint generation models have demonstrated impressive capabilities in content creation. However, generating high-fidelity human-centric videos in complex, real-world physical scenes remains a significant challenge. We identify that the root cause lies in the structural deficiencies of existing datasets across three dimensions: limited global scene and camera diversity, sparse interaction modeling (both person-person and person-object), and insufficient individual attribute alignment. To bridge these gaps, we present OmniHuman, a large-scale, multi-scene dataset designed for fine-grained human modeling. OmniHuman provides a hierarchical annotation covering video-level scenes, frame-level interactions, and individual-level attributes. To facilitate this, we develop a fully automated pipeline for high-quality data collection and multi-modal annotation. Complementary to the dataset, we establish the OmniHuman Benchmark (OHBench), a three-level evaluation system that provides a scientific diagnosis for human-centric audio-video synthesis. Crucially, OHBench introduces metrics that are highly consistent with human perception, filling the gaps in existing benchmarks by providing a comprehensive diagnosis across global scenes, relational interactions, and individual attributes.
[249] AdaCluster: Adaptive Query-Key Clustering for Sparse Attention in Video Generation cs.CV | cs.AIPDF
Haoyue Tan, Shengnan Wang, Yulin Qiao, Juncheng Zhang, Youhui Bai
TL;DR: 本文提出了AdaCluster,一种无需训练的自适应聚类框架,旨在加速视频扩散变换器(DiTs)的生成过程。该方法通过分别对查询向量和键向量进行角度相似性和欧几里得相似性保持的聚类,自适应地处理不同层中的异构令牌分布,从而在保持模型精度的同时显著降低注意力计算的二次复杂度。
Details
Motivation: 视频扩散变换器(DiTs)因注意力机制的二次计算复杂度导致推理延迟过高,现有稀疏注意力方法要么忽略语义相似性,要么无法适应不同层间的异构令牌分布,从而导致模型性能下降。
Result: 在CogVideoX-2B、HunyuanVideo和Wan-2.1等模型上,使用单张A40 GPU进行实验,实现了1.67至4.31倍的加速,且质量下降可忽略不计。
Insight: 创新点在于提出了训练无关的自适应聚类框架,分别针对查询和键设计不同的相似性保持聚类方法,并引入基于阈值的自适应聚类和高效关键簇选择机制,以更好地适应不同层的令牌分布,在加速的同时保持精度。
Abstract: Video diffusion transformers (DiTs) suffer from prohibitive inference latency due to quadratic attention complexity. Existing sparse attention methods either overlook semantic similarity or fail to adapt to heterogeneous token distributions across layers, leading to model performance degradation. We propose AdaCluster, a training-free adaptive clustering framework that accelerates the generation of DiTs while preserving accuracy. AdaCluster applies an angle-similarity-preserving clustering method to query vectors for higher compression, and designs a euclidean-similarity-preserving clustering method for keys, covering cluster number assignment, threshold-wise adaptive clustering, and efficient critical cluster selection. Experiments on CogVideoX-2B, HunyuanVideo, and Wan-2.1 on one A40 GPU demonstrate up to 1.67-4.31x speedup with negligible quality degradation.
[250] EAST: Early Action Prediction Sampling Strategy with Token Masking cs.CVPDF
Iva Sović, Ivan Martinović, Marin Oršić
TL;DR: 本文提出了EAST框架,用于早期动作预测任务,旨在通过不完整的视频观察来预测即将发生的动作。该框架包含一种随机化训练策略,通过采样时间步来分隔已观察和未观察的视频帧,使单个模型能够无缝泛化到所有测试时的观察比例。此外,联合学习已观察和未来(oracle)表示显著提升了性能,而提出的token masking过程则提高了可扩展性,将内存使用减半并加速训练2倍,且精度损失可忽略。结合预测解码器,EAST在NTU60、SSv2和UCF101数据集上实现了新的最先进性能。
Details
Motivation: 早期动作预测任务因视觉证据有限而极具挑战性,需要模型能够从不完整的观察中进行推理。本文旨在解决这一问题,通过设计一个简单高效的框架来提升模型对不完整观测的处理能力。
Result: EAST在NTU60、SSv2和UCF101数据集上实现了新的最先进(SOTA)性能,分别比之前的最佳工作提升了10.1、7.7和3.9个百分点。
Insight: 创新点包括:随机化训练策略使模型能泛化到不同观察比例;联合学习已观察和未来表示提升了性能,甚至使仅编码器模型表现出色;token masking过程显著提高了训练效率和可扩展性,内存减半且加速2倍,精度损失可忽略。
Abstract: Early action prediction seeks to anticipate an action before it fully unfolds, but limited visual evidence makes this task especially challenging. We introduce EAST, a simple and efficient framework that enables a model to reason about incomplete observations. In our empirical study, we identify key components when training early action prediction models. Our key contribution is a randomized training strategy that samples a time step separating observed and unobserved video frames, enabling a single model to generalize seamlessly across all test-time observation ratios. We further show that joint learning on both observed and future (oracle) representations significantly boosts performance, even allowing an encoder-only model to excel. To improve scalability, we propose a token masking procedure that cuts memory usage in half and accelerates training by 2x with negligible accuracy loss. Combined with a forecasting decoder, EAST sets a new state of the art on NTU60, SSv2, and UCF101, surpassing previous best work by 10.1, 7.7, and 3.9 percentage points, respectively.
[251] Towards Robust Text-to-Image Person Retrieval: Multi-View Reformulation for Semantic Compensation cs.CVPDF
Chao Yuan, Yujian Zhao, Haoxuan Xu, Guanglin Niu
TL;DR: 本文提出了一种基于大语言模型(LLM)驱动的语义补偿框架(MVR),用于增强文本到图像行人检索任务的鲁棒性。该方法通过多视角语义重构和特征补偿来解决自然语言表达多样性与视觉语义隐晦性导致的’表达漂移’问题,从而提升跨模态表示的一致性。
Details
Motivation: 动机在于解决文本到图像行人检索中,由于自然语言表达的多样性和视觉语义的隐晦性导致的’表达漂移’问题,即语义等价的文本因措辞差异在嵌入空间中表现出显著特征差异,从而降低了图像-文本对齐的鲁棒性。
Result: 实验表明,该方法无需训练即可有效提升原始模型的准确率,并在三个文本到图像行人检索数据集上达到了最先进的(SOTA)性能。
Insight: 创新点包括:1)采用双分支提示策略(关键特征引导与多样性感知重写)的多视角重构(MVR)生成语义等价但分布多样的文本变体;2)无需训练的潜在空间补偿机制,通过多视角特征均值池化和残差连接抑制噪声干扰,捕捉’语义回声’;3)利用视觉语言模型(VLM)生成多视角图像描述,并通过共享文本重构进行增强以弥补视觉语义鸿沟。从客观角度看,该方法巧妙地将LLM的生成能力与特征空间补偿相结合,为提升跨模态检索的鲁棒性提供了一种高效且无需额外训练的新思路。
Abstract: In text-to-image person retrieval tasks, the diversity of natural language expressions and the implicitness of visual semantics often lead to the problem of Expression Drift, where semantically equivalent texts exhibit significant feature discrepancies in the embedding space due to phrasing variations, thereby degrading the robustness of image-text alignment. This paper proposes a semantic compensation framework (MVR) driven by Large Language Models (LLMs), which enhances cross-modal representation consistency through multi-view semantic reformulation and feature compensation. The core methodology comprises three components: Multi-View Reformulation (MVR): A dual-branch prompting strategy combines key feature guidance (extracting visually critical components via feature similarity) and diversity-aware rewriting to generate semantically equivalent yet distributionally diverse textual variants; Textual Feature Robustness Enhancement: A training-free latent space compensation mechanism suppresses noise interference through multi-view feature mean-pooling and residual connections, effectively capturing “Semantic Echoes”; Visual Semantic Compensation: VLM generates multi-perspective image descriptions, which are further enhanced through shared text reformulation to address visual semantic gaps. Experiments demonstrate that our method can improve the accuracy of the original model well without training and performs SOTA on three text-to-image person retrieval datasets.
[252] ESsEN: Training Compact Discriminative Vision-Language Transformers in a Low-Resource Setting cs.CV | cs.CLPDF
Clayton Fields, Casey Kennington
TL;DR: 本文提出了一种名为ESsEN的紧凑型视觉-语言模型,旨在解决在低资源环境下训练轻量级视觉-语言模型的挑战。通过系统性地研究双塔编码器架构、融入传统卷积网络以及优化跨模态融合模块,该模型能够在参数较少的情况下,在多个判别性英语任务上达到与大型模型相当的性能。
Details
Motivation: 当前视觉-语言模型通常参数量巨大(数十亿),不适用于边缘设备或独立机器人平台等资源受限场景,且缺乏针对轻量级模型和小数据集训练的研究。本文受儿童语言学习过程中数据稀疏性的启发,旨在系统性地开发紧凑且高效的视觉-语言模型。
Result: 实验表明,在低资源设置下,双塔编码器在判别性英语任务上优于单塔编码器;融入卷积网络的双塔Transformer架构能实现参数高效;跨模态融合模块的形状和大小可灵活调整而不影响性能。ESsEN模型在多个任务上仅用少量参数即可达到与其他模型相当的水平,具体基准未在摘要中明确提及,但暗示了其紧凑性和有效性。
Insight: 创新点包括:系统性地探索低资源视觉-语言建模,证明双塔编码器在资源受限时的优势;将传统卷积网络集成到Transformer中以提升参数效率;揭示跨模态融合模块的灵活性。从客观角度看,该研究为资源有限的场景提供了可借鉴的模型设计思路,促进了视觉-语言建模的普及化。
Abstract: Vision-language modeling is rapidly increasing in popularity with an ever expanding list of available models. In most cases, these vision-language models have parameters in the tens of billions, which is necessary for some needs, but in many cases smaller models are necessary (e.g., on edge devices or independent robotic platforms). Unfortunately, there is little research in producing light-weight models or in training them with small datasets. Inspired by the language learning progression and data sparsity in child development, in this paper, we address both of these goals in a systematic fashion. We show that two-tower encoder models are superior to one-tower encoders in low-resource settings for discriminative English tasks. We show also that incorporating traditional convolutional networks into the two-tower transformer architecture can help produce parameter efficient vision-language models. Finally, we show that the cross-modal fusion module of two-tower encoders can vary significantly in shape and size while producing the same results. In addition, we present ESsEN, a compact vision-language model that can be trained end-to-end with relatively few resources that performs as well on several tasks with only a fraction of the parameters compared to other models. The experimental results and the tools we present here make vision-language modeling more accessible to a wider variety of researchers.
[253] Revisiting Change VQA in Remote Sensing with Structured and Native Multimodal Qwen Models cs.CV | cs.AIPDF
Yakoub Bazi, Mohamad M. Al Rahhal, Mansour Zuair, Faroun Mohamed
TL;DR: 本文重新评估了遥感图像变化视觉问答(Change VQA)任务,在CDVQA基准上,使用统一的LoRA微调设置,对比了结构化视觉语言模型Qwen3-VL与原生多模态模型Qwen3.5的性能。实验发现,现代多模态模型超越了早期专用基线,且性能不随模型规模单调增长,原生多模态模型在此任务上比结构化视觉语言流程更有效。
Details
Motivation: 解决遥感影像中双时相图像语义变化的自然语言问答问题,探索现代多模态模型在Change VQA这一尚未充分研究的任务上的应用潜力。
Result: 在官方CDVQA测试集上,近期视觉语言模型超越了早期专用基线;性能不随模型规模单调增长;原生多模态模型Qwen3.5比结构化视觉语言模型Qwen3-VL表现更优。
Insight: 对于遥感影像中语言驱动的语义变化推理任务,紧密集成的多模态主干网络(如单阶段对齐的混合解码器)比模型规模或显式的多深度视觉条件化对性能贡献更大,这挑战了“更大即更好”的假设,并凸显了架构设计的重要性。
Abstract: Change visual question answering (Change VQA) addresses the problem of answering natural-language questions about semantic changes between bi-temporal remote sensing (RS) images. Although vision-language models (VLMs) have recently been studied for temporal RS image understanding, Change VQA remains underexplored in the context of modern multimodal models. In this letter, we revisit the CDVQA benchmark using recent Qwen models under a unified low-rank adaptation (LoRA) setting. We compare Qwen3-VL, which follows a structured vision-language pipeline with multi-depth visual conditioning and a full-attention decoder, with Qwen3.5, a native multimodal model that combines a single-stage alignment with a hybrid decoder backbone. Experimental results on the official CDVQA test splits show that recent VLMs improve over earlier specialized baselines. They further show that performance does not scale monotonically with model size, and that native multimodal models are more effective than structured vision-language pipelines for this task. These findings indicate that tightly integrated multimodal backbones contribute more to performance than scale or explicit multi-depth visual conditioning for language-driven semantic change reasoning in RS imagery.
[254] OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation cs.CV | cs.CL | cs.ROPDF
Jinghui Lu, Jiayi Guan, Zhijian Huang, Jinlong Li, Guang Li
TL;DR: OneVL是一种用于自动驾驶的视觉语言模型,通过将推理过程压缩到紧凑的潜在令牌中,并引入视觉世界模型解码器来预测未来帧,从而在单步并行推理中实现与仅答案预测相当的速度,同时超越显式思维链方法的准确性。
Details
Motivation: 解决现有思维链推理方法因自回归特性导致延迟过高而无法实时部署的问题,以及纯语言潜在表示无法有效捕捉驾驶因果动态的局限性。
Result: 在四个基准测试中,OneVL成为首个超越显式思维链的潜在思维链方法,在仅答案预测的延迟下实现了最先进的准确性。
Insight: 通过双辅助解码器(语言和视觉世界模型)监督潜在空间,迫使潜在令牌内化道路几何、智能体运动和环境变化的因果动态,从而在更紧致的压缩下产生更具泛化性的表示,替代逐令牌的冗长推理。
Abstract: Chain-of-Thought (CoT) reasoning has become a powerful driver of trajectory prediction in VLA-based autonomous driving, yet its autoregressive nature imposes a latency cost that is prohibitive for real-time deployment. Latent CoT methods attempt to close this gap by compressing reasoning into continuous hidden states, but consistently fall short of their explicit counterparts. We suggest that this is due to purely linguistic latent representations compressing a symbolic abstraction of the world, rather than the causal dynamics that actually govern driving. Thus, we present OneVL (One-step latent reasoning and planning with Vision-Language explanations), a unified VLA and World Model framework that routes reasoning through compact latent tokens supervised by dual auxiliary decoders. Alongside a language decoder that reconstructs text CoT, we introduce a visual world model decoder that predicts future-frame tokens, forcing the latent space to internalize the causal dynamics of road geometry, agent motion, and environmental change. A three-stage training pipeline progressively aligns these latents with trajectory, language, and visual objectives, ensuring stable joint optimization. At inference, the auxiliary decoders are discarded and all latent tokens are prefilled in a single parallel pass, matching the speed of answer-only prediction. Across four benchmarks, OneVL becomes the first latent CoT method to surpass explicit CoT, delivering state-of-the-art accuracy at answer-only latency, and providing direct evidence that tighter compression, when guided in both language and world-model supervision, produces more generalizable representations than verbose token-by-token reasoning. Project Page: https://xiaomi-embodied-intelligence.github.io/OneVL
[255] Progressive Online Video Understanding with Evidence-Aligned Timing and Transparent Decisions cs.CV | cs.AIPDF
Kecheng Zhang, Zongxin Yang, Mingfei Han, Haihong Hao, Yunzhi Zhuge
TL;DR: 本文提出了一种用于在线视频理解的新框架,旨在解决传统离线视频大语言模型在流式处理中面临的挑战。该框架将推理控制与记忆集成解耦,并实例化为一个名为Thinking-QwenVL的模型,包含两个核心组件:主动思维决策器(ATDM)用于实现透明、证据对齐的响应时机决策,以及分层渐进语义集成(HPSI)模块用于在有限计算预算下高效构建全局认知状态。
Details
Motivation: 解决视觉智能体在在线流式视频场景中必须及时响应查询的挑战,包括缺乏决策透明度、难以将响应时机与视觉证据对齐,以及在严格计算预算下保持全局因果一致理解的问题。
Result: 在StreamingBench基准测试上,Thinking-QwenVL将先前最先进模型的准确率从67.63%提升至71.60%,展示了其有效性。
Insight: 创新点在于将推理控制与记忆集成解耦的框架设计,以及ATDM通过外部化进度和置信度指标实现透明、证据对齐的决策,配合HPSI使用可学习的多级聚合令牌高效构建全局状态,为在线视频分析提供了可借鉴的透明且高效的解决方案。
Abstract: Visual agents operating in the wild must respond to queries precisely when sufficient evidence first appears in a video stream, a critical capability that is overlooked by conventional video LLMs evaluated in offline settings. The shift to an online, streaming paradigm introduces significant challenges: a lack of decision transparency, the difficulty of aligning response timing with visual evidence, and the need to maintain a global, causally consistent understanding under tight computational budgets. To address these issues, we propose a novel framework that decouples reasoning control from memory integration. We introduce \textbf{\model{}}, an instantiation of this framework with two core components. First, the \emph{Active Thinking Decision Maker (ATDM)} is a transparent reasoning controller that externalizes its decision process using observable progress ($\boldsymbolρ$) and confidence ($\boldsymbol{c}$) metrics. This allows it to precisely time its response $t_r$ to match the first-sufficient-evidence timestamp $t^\star$ while streaming its reasoning to the user. Second, the \emph{Hierarchical Progressive Semantic Integration (HPSI)} module acts as an efficient memory system. It employs a set of learnable, multi-level aggregation tokens that are propagated across clips to build a rich, global cognitive state without exceeding token budgets. %Our approach sets a new standard on key online video understanding benchmarks, achieving strong performance of \textbf{71.6%} on StreamingBench and \textbf{46.9%} on OVOBench, demonstrating a robust solution for evidence-aligned and transparent online video analysis. Extensive experiments demonstrate the effectiveness of ATDM and HPSI, e.g., Thinking-QwenVL improves the accuracy of the previous state-of-the-art from 67.63% to 71.60% on the StreamingBench benchmark.
[256] XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments cs.CV | cs.MM | cs.ROPDF
Kangan Qian, ChuChu Xie, Yang Zhong, Jingrui Pang, Siwen Jiao
TL;DR: 本文提出XEmbodied,一种云端基础模型,旨在增强视觉-语言-动作(VLA)模型在具身环境中的几何与物理感知能力。通过结构化3D适配器和高效图像-具身适配器,模型将几何表示和物理信号(如占据栅格、3D边界框)整合为内在知识,而非辅助输入。结合渐进式领域课程和强化学习后训练,模型在保持通用能力的同时,在18个公共基准测试中展现出鲁棒性能。
Details
Motivation: 当前用于训练VLA模型的云端流水线依赖于通用的视觉-语言模型(VLMs),这些模型基于2D图像-文本预训练,缺乏几何推理和领域语义,导致与复杂具身环境的需求不匹配。
Result: XEmbodied在18个公共基准测试中表现出鲁棒性能,显著提升了空间推理、交通语义、具身可供性和分布外泛化能力,适用于大规模场景挖掘和具身视觉问答任务。
Insight: 创新点在于将几何表示(如3D结构)和物理信号(如占据栅格)作为模型的内在组成部分进行整合,而非外部辅助输入,并通过结构化适配器和领域自适应训练策略,有效增强了模型在具身环境中的感知与推理能力。
Abstract: Vision-Language-Action (VLA) models drive next-generation autonomous systems, but training them requires scalable, high-quality annotations from complex environments. Current cloud pipelines rely on generic vision-language models (VLMs) that lack geometric reasoning and domain semantics due to their 2D image-text pretraining. To address this mismatch, we propose XEmbodied, a cloud-side foundation model that endows VLMs with intrinsic 3D geometric awareness and interaction with physical cues (e.g., occupancy grids, 3D boxes). Instead of treating geometry as auxiliary input, XEmbodied integrates geometric representations via a structured 3D Adapter and distills physical signals into context tokens using an Efficient Image-Embodied Adapter. Through progressive domain curriculum and reinforcement learning post-training, XEmbodied preserves general capabilities while demonstrating robust performance across 18 public benchmarks. It significantly improves spatial reasoning, traffic semantics, embodied affordance, and out-of-distribution generalization for large-scale scenario mining and embodied VQA.
[257] S2H-DPO: Hardness-Aware Preference Optimization for Vision-Language Models cs.CVPDF
Nitish Shukla, Surgan Jandial, Arun Ross
TL;DR: 本文提出了一种名为S2H-DPO的简单到困难学习框架,用于增强视觉语言模型(VLMs)的多图像推理能力。该方法通过构建三个层次(单图像局部推理、多图像局部比较和全局视觉搜索)的偏好数据,系统性地训练模型,以解决现有方法在全局视觉搜索和自主跨图像比较方面的不足。实验表明,该方法在LLaVA和Qwen-VL模型上显著提升了多图像推理性能,同时保持了单图像推理能力。
Details
Motivation: 现有视觉语言模型在多图像对齐方法中存在关键能力差距,主要局限于基于预定义图像索引的局部推理(如“查看图像3并…”),而缺乏全局视觉搜索和自主跨图像比较的基本技能,这限制了模型的有效多图像推理。
Result: 在LLaVA和Qwen-VL模型上的广泛评估显示,该方法显著提升了多图像推理性能,在多个基准测试中优于基线方法,同时保持了强大的单图像推理性能,从而推进了整体视觉偏好对齐的技术水平。
Insight: 创新点在于提出了一种基于提示驱动复杂性的简单到困难学习框架,通过构建层次化的多图像偏好数据(而非依赖模型特定属性如幻觉或注意力启发式)来增强模型的多图像推理能力,这种方法具有模型通用性,并能同时优化单图像和多图像理解。
Abstract: Vision-Language Models (VLMs) have demonstrated remarkable progress in single-image understanding, yet effective reasoning across multiple images remains challenging. We identify a critical capability gap in existing multi-image alignment approaches: current methods focus primarily on localized reasoning with pre-specified image indices (``Look at Image 3 and…’’), bypassing the essential skills of global visual search and autonomous cross-image comparison. To address this limitation, we introduce a Simple-to-Hard (S2H) learning framework that systematically constructs multi-image preference data across three hierarchical reasoning levels requiring an increasing level of capabilities: (1) single-image localized reasoning, (2) multi-image localized comparison, and (3) global visual search. Unlike prior work that relies on model-specific attributes, such as hallucinations or attention heuristics, to generate preference pairs, our approach leverages prompt-driven complexity to create chosen/rejected pairs that are applicable across different models. Through extensive evaluations on LLaVA and Qwen-VL models, we show that our diverse multi-image reasoning data significantly enhances multi-image reasoning performance, yielding significant improvements over baseline methods across benchmarks. Importantly, our approach maintains strong single-image reasoning performance while simultaneously strengthening multi-image understanding capabilities, thus advancing the state of the art for holistic visual preference alignment.
[258] UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models cs.CV | cs.LGPDF
Jiaqi Wang, Haoge Deng, Ting Pan, Yang Liu, Chengyuan Wang
TL;DR: 本文提出了UDM-GRPO框架,首次将均匀离散扩散模型(UDM)与强化学习(RL)相结合,以解决两者直接结合时存在的训练不稳定和性能提升有限的问题。该方法通过将最终生成的干净样本作为动作,以及通过扩散前向过程重构轨迹来对齐概率路径,从而提供更准确稳定的优化信号。此外,还引入了Reduced-Step和CFG-Free策略来提升训练效率。该方法在多个文本到图像(T2I)任务上显著提升了基础模型的性能,并在OCR基准测试上验证了其泛化能力。
Details
Motivation: 均匀离散扩散模型(UDM)在离散生成建模中展现出潜力,但其与强化学习的结合尚未被充分探索。作者观察到直接将GRPO应用于UDM会导致训练不稳定和性能提升有限,因此需要一种新的框架来解决这些问题。
Result: 在多个T2I任务上,该方法显著提升了性能:GenEval准确率从69%提升至96%,PickScore从20.46提升至23.81,在连续和离散设置下均达到了最先进的(SOTA)水平。在OCR基准测试上,准确率从8%提升至57%,验证了方法的泛化能力。
Insight: 论文的核心创新点在于:1)将最终生成的干净样本作为强化学习的动作,以获得更准确稳定的优化信号;2)通过扩散前向过程重构轨迹,使概率路径更好地与预训练分布对齐。此外,Reduced-Step和CFG-Free策略的引入提升了训练效率。从客观角度看,这是首个将UDM与RL稳定结合的框架,为解决离散扩散模型在RL中的训练难题提供了新思路。
Abstract: Uniform Discrete Diffusion Model (UDM) has recently emerged as a promising paradigm for discrete generative modeling; however, its integration with reinforcement learning remains largely unexplored. We observe that naively applying GRPO to UDM leads to training instability and marginal performance gains. To address this, we propose \Ours, the first framework to integrate UDM with RL. Our method is guided by two key insights: (i) treating the final clean sample as the action provides more accurate and stable optimization signals; and (ii) reconstructing trajectories via the diffusion forward process better aligns probability paths with the pretraining distribution. Additionally, we introduce two strategies, Reduced-Step and CFG-Free, to further improve training efficiency. \Ours significantly improves base model performance across multiple T2I tasks. Notably, GenEval accuracy improves from $69%$ to $96%$ and PickScore increases from $20.46$ to $23.81$, achieving state-of-the-art performance in both continuous and discrete settings. On the OCR benchmark, accuracy rises from $8%$ to $57%$, further validating the generalization ability of our method. Code is available at \href{https://github.com/Yovecent/UDM-GRPO}{https://github.com/Yovecent/UDM-GRPO}.
[259] Advancing Vision Transformer with Enhanced Spatial Priors cs.CVPDF
Qihang Fan, Huaibo Huang, Mingrui Chen, Hongmin Liu, Ran He
TL;DR: 本文提出了一种名为欧几里得增强视觉变换器(EVT)的新视觉骨干网络,旨在解决视觉变换器(ViT)中自注意力机制缺乏显式空间先验和计算复杂度高的问题。EVT通过引入欧几里得距离衰减来增强空间信息建模,并采用空间无关的分组方法替代分解注意力,提高了模型的灵活性和性能。
Details
Motivation: 解决视觉变换器(ViT)中自注意力机制缺乏显式空间先验和二次计算复杂度的核心限制,以提升其在计算机视觉任务中的适用性。
Result: 在ImageNet-1k图像分类任务上,无需额外训练数据,EVT取得了86.6%的top-1准确率。在目标检测、实例分割和语义分割等任务上的广泛实验也证明了其卓越性能。
Insight: 创新点在于使用欧几里得距离衰减(相比RMT的曼哈顿距离)更合理地建模空间关系,并采用空间无关的分组注意力机制,提供了更灵活和高效的空间先验整合方式,增强了模型的适应性和表达能力。
Abstract: In recent years, the Vision Transformer (ViT) has garnered significant attention within the computer vision community. However, the core component of ViT, Self-Attention, lacks explicit spatial priors and suffers from quadratic computational complexity, limiting its applicability. To address these issues, we have proposed RMT, a robust vision backbone with explicit spatial priors for general purposes. RMT utilizes Manhattan distance decay to introduce spatial information and employs a horizontal and vertical decomposition attention method to model global information. Building on the strengths of RMT, Euclidean enhanced Vision Transformer (EVT) is an expanded version that incorporates several key improvements. Firstly, EVT uses a more reasonable Euclidean distance decay to enhance the modeling of spatial information, allowing for a more accurate representation of spatial relationships compared to the Manhattan distance used in RMT. Secondly, EVT abandons the decomposed attention mechanism featured in RMT and instead adopts a simpler spatially-independent grouping approach, providing the model with greater flexibility in controlling the number of tokens within each group. By addressing these modifications, EVT offers a more sophisticated and adaptable approach to incorporating spatial priors into the Self-Attention mechanism, thus overcoming some of the limitations associated with RMT and further enhancing its applicability in various computer vision tasks. Extensive experiments on Image Classification, Object Detection, Instance Segmentation, and Semantic Segmentation demonstrate that EVT exhibits exceptional performance. Without additional training data, EVT achieves 86.6% top1-acc on ImageNet-1k.
[260] AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation cs.CVPDF
Rui Qian, Chuanhang Deng, Qiang Huang, Jian Xiong, Mingxuan Li
TL;DR: AnchorSeg提出了一种基于语言锚定查询库的推理分割方法,将推理分割重新表述为在图像标记上的结构化条件生成过程。该方法通过构建有序的查询库序列(包括捕获中间语义状态的潜在推理标记和提供显式空间锚定的分割锚点标记),显式解耦了空间定位与语义推理,并引入了标记-掩码循环一致性(TMCC)训练目标来桥接标记级预测与像素级监督。
Details
Motivation: 现有推理分割方法依赖单个分割标记
Result: 在ReasonSeg测试集上达到了最先进水平(67.7% gIoU和68.1% cIoU)。
Insight: 核心创新在于通过结构化语言锚定查询库显式解耦空间定位与语义推理,将空间条件建模为图像标记上的因子化分布,并设计了TMCC双向训练目标来确保跨分辨率对齐。
Abstract: Reasoning segmentation requires models to ground complex, implicit textual queries into precise pixel-level masks. Existing approaches rely on a single segmentation token $\texttt{
[261] MultiWorld: Scalable Multi-Agent Multi-View Video World Models cs.CVPDF
Haoyu Wu, Jiwen Yu, Yingtian Zou, Xihui Liu
TL;DR: 本文提出了MultiWorld,一个用于多智能体多视角世界建模的统一框架,旨在解决现有视频世界模型局限于单智能体场景、难以捕捉真实世界多智能体系统复杂交互的问题。该框架通过引入多智能体条件模块实现精确的多智能体控制,并利用全局状态编码器确保不同视角间的一致性观测,支持智能体和视角数量的灵活扩展及并行合成。
Details
Motivation: 现有视频世界模型大多局限于单智能体场景,无法有效模拟真实世界中多智能体系统的复杂交互,因此需要开发能够支持多智能体、多视角且保持一致性的世界建模方法。
Result: 在多玩家游戏环境和多机器人操作任务上的实验表明,MultiWorld在视频保真度、动作跟随能力和多视角一致性方面均优于基线方法。
Insight: 创新点在于提出了统一的多智能体多视角世界建模框架,通过多智能体条件模块和全局状态编码器分别解决了多智能体精确控制和跨视角一致性的挑战,实现了可扩展的并行视图合成。
Abstract: Video world models have achieved remarkable success in simulating environmental dynamics in response to actions by users or agents. They are modeled as action-conditioned video generation models that take historical frames and current actions as input to predict future frames. Yet, most existing approaches are limited to single-agent scenarios and fail to capture the complex interactions inherent in real-world multi-agent systems. We present \textbf{MultiWorld}, a unified framework for multi-agent multi-view world modeling that enables accurate control of multiple agents while maintaining multi-view consistency. We introduce the Multi-Agent Condition Module to achieve precise multi-agent controllability, and the Global State Encoder to ensure coherent observations across different views. MultiWorld supports flexible scaling of agent and view counts, and synthesizes different views in parallel for high efficiency. Experiments on multi-player game environments and multi-robot manipulation tasks demonstrate that MultiWorld outperforms baselines in video fidelity, action-following ability, and multi-view consistency. Project page: https://multi-world.github.io/
[262] Back into Plato’s Cave: Examining Cross-modal Representational Convergence at Scale cs.CV | cs.AI | cs.LGPDF
A. Sophia Koepke, Daniil Zverev, Shiry Ginosar, Alexei A. Efros
TL;DR: 本文对柏拉图表示假说进行了批判性检验,该假说认为不同模态(如文本和图像)训练的神经网络会收敛到相同的现实表示。作者通过大规模实验发现,支持该假说的证据是脆弱的,对齐效果在小数据集(约1K样本)上通过互近邻度量显著,但在扩展到百万级样本时大幅下降,且剩余的对齐仅反映粗粒度语义重叠而非一致的细粒度结构。此外,在现实的多对多设置下,对齐进一步减弱,且新模型并未表现出更强的语言模型与视觉模型对齐的趋势。
Details
Motivation: 动机是检验柏拉图表示假说的有效性,该假说认为不同模态的神经网络会收敛到相同表示,这挑战了模态选择的重要性。作者旨在通过大规模数据和更现实的评估设置,揭示现有证据的局限性。
Result: 实验结果表明,跨模态表示对齐在小数据集上可度量,但在大规模数据集(百万样本)上显著退化;在现实多对多设置下对齐进一步减少;且新模型(如更新的语言模型)未显示更强的对齐趋势,与先前结论相悖。
Insight: 创新点在于通过大规模、多对多设置的系统性实验,揭示了跨模态表示对齐的脆弱性和局限性,挑战了先前研究的乐观结论,强调不同模态模型可能学习到同等丰富但不同的世界表示,而非相同表示。
Abstract: The Platonic Representation Hypothesis suggests that neural networks trained on different modalities (e.g., text and images) align and eventually converge toward the same representation of reality. If true, this has significant implications for whether modality choice matters at all. We show that the experimental evidence for this hypothesis is fragile and depends critically on the evaluation regime. Alignment is measured using mutual nearest neighbors on small datasets ($\approx$1K samples) and degrades substantially as the dataset is scaled to millions of samples. The alignment that remains between model representations reflects coarse semantic overlap rather than consistent fine-grained structure. Moreover, the evaluations in Huh et al. are done in a one-to-one image-caption setting, a constraint that breaks down in realistic many-to-many settings and further reduces alignment. We also find that the reported trend of stronger language models increasingly aligning with vision does not appear to hold for newer models. Overall, our findings suggest that the current evidence for cross-modal representational convergence is considerably weaker than subsequent works have taken it to be. Models trained on different modalities may learn equally rich representations of the world, just not the same one.
[263] T-REN: Learning Text-Aligned Region Tokens Improves Dense Vision-Language Alignment and Scalability cs.CVPDF
Savya Khosla, Sethuraman T, Aryan Chadha, Alex Schwing, Derek Hoiem
TL;DR: 本文提出了T-REN(文本对齐区域编码器网络),一种高效的编码器,通过一个轻量级网络将视觉数据映射为一组紧凑的、与文本对齐的区域级表示(区域令牌)。该方法旨在解决视觉语言编码器中语言与密集视觉特征对齐弱,以及细粒度视觉表示令牌数量过多导致可扩展性受限这两个核心问题。
Details
Motivation: 解决现有视觉语言编码器的两个核心局限:1)语言与密集视觉特征之间的对齐较弱,影响开放词汇语义分割等任务;2)细粒度视觉表示的令牌数量过高,限制了其在长视频等场景的可扩展性。
Result: 在多个基准测试中取得显著提升:ADE20K开放词汇分割上mIoU提升+5.9,COCO对象级图文检索召回率提升+18.4%,Ego4D视频对象定位召回率提升+15.6%,VSPW视频场景解析mIoU提升+17.6%。同时,与基于图像块的视觉语言骨干网络相比,令牌数量在图像上减少超过24倍,在视频上减少超过187倍。
Insight: 核心创新在于提出了一种轻量级区域池化与对齐机制,通过冻结的视觉骨干网络和少量额外参数(仅增加3.7%),将图像块级表示聚合为语义区域令牌并与区域级文本标注对齐,从而在显著提升密集跨模态理解能力的同时,极大地压缩了视觉表示的计算开销,提高了模型的可扩展性。
Abstract: Despite recent progress, vision-language encoders struggle with two core limitations: (1) weak alignment between language and dense vision features, which hurts tasks like open-vocabulary semantic segmentation; and (2) high token counts for fine-grained visual representations, which limits scalability to long videos. This work addresses both limitations. We propose T-REN (Text-aligned Region Encoder Network), an efficient encoder that maps visual data to a compact set of text-aligned region-level representations (or region tokens). T-REN achieves this through a lightweight network added on top of a frozen vision backbone, trained to pool patch-level representations within each semantic region into region tokens and align them with region-level text annotations. With only 3.7% additional parameters compared to the vision-language backbone, this design yields substantially stronger dense cross-modal understanding while reducing the token count by orders of magnitude. Specifically, T-REN delivers +5.9 mIoU on ADE20K open-vocabulary segmentation, +18.4% recall on COCO object-level text-image retrieval, +15.6% recall on Ego4D video object localization, and +17.6% mIoU on VSPW video scene parsing, all while reducing token counts by more than 24x for images and 187x for videos compared to the patch-based vision-language backbone. The code and model are available at https://github.com/savya08/T-REN.
[264] MUA: Mobile Ultra-detailed Animatable Avatars cs.CVPDF
Heming Zhu, Guoxing Sun, Marc Habermann
TL;DR: 本文提出了一种名为MUA(Mobile Ultra-detailed Animatable Avatars)的新方法,旨在构建既高保真又轻量化的可动画数字人。通过结合小波引导的多级空间分解混合形状表示和蒸馏流程,该方法从预训练的高质量教师模型中提取运动感知的衣物动态和细粒度外观细节,实现了计算成本降低2000倍、模型大小缩小10倍,同时保持与教师模型相近的视觉动态和外观细节。
Details
Motivation: 当前可动画数字人建模面临高保真与轻量化难以兼得的挑战:高保真模型需强大GPU计算,而轻量化模型则动态细节不足、易出现伪影。本文旨在弥合这一差距,实现移动设备上的超详细可动画数字人。
Result: 在移动设备设置下,该方法显著优于现有方法,渲染质量与多数仅能在服务器上运行的方法相当或更优。在桌面PC上达到超过180 FPS,在独立Meta Quest 3设备上实现24 FPS的实时原生性能。
Insight: 创新点在于提出小波引导的多级空间分解混合形状表示,结合纹理空间中的低秩结构分解,以及从高质量教师模型到紧凑表示的蒸馏流程,实现了高保真与高效率的平衡,为沉浸式应用提供了实用解决方案。
Abstract: Building photorealistic, animatable full-body digital humans remains a longstanding challenge in computer graphics and vision. Recent advances in animatable avatar modeling have largely progressed along two directions: improving the fidelity of dynamic geometry and appearance, or reducing computational complexity to enable deployment on resource-constrained platforms, e.g., VR headsets. However, existing approaches fail to achieve both goals simultaneously: Ultra-high-fidelity avatars typically require substantial computation on server-class GPUs, whereas lightweight avatars often suffer from limited surface dynamics, reduced appearance details, and noticeable artifacts. To bridge this gap, we propose a novel animatable avatar representation, termed Wavelet-guided Multi-level Spatial Factorized Blendshapes, and a corresponding distillation pipeline that transfers motion-aware clothing dynamics and fine-grained appearance details from a pre-trained ultra-high-quality avatar model into a compact, efficient representation. By coupling multi-level wavelet spectral decomposition with low-rank structural factorization in texture space, our method achieves up to 2000X lower computational cost and a 10X smaller model size than the original high-quality teacher avatar model, while preserving visually plausible dynamics and appearance details closely resemble those of the teacher model. Extensive comparisons with state-of-the-art methods show that our approach significantly outperforms existing avatar approaches designed for mobile settings and achieves comparable or superior rendering quality to most approaches that can only run on servers. Importantly, our representation substantially improves the practicality of high-fidelity avatars for immersive applications, achieving over 180 FPS on a desktop PC and real-time native on-device performance at 24 FPS on a standalone Meta Quest 3.
cs.HC [Back]
[265] Navigating the Conceptual Multiverse cs.HC | cs.CL | cs.CYPDF
Andre Ye, Jenny Y. Huang, Alicia Guo, Rose Novick, Tamara Broderick
TL;DR: 本文提出并评估了‘概念多元宇宙’交互系统,该系统借鉴统计学中的多元宇宙分析,将语言模型回答开放性问题时隐含的决策(如问题框架、价值取向)显式化为一个可透明检查、干预修改并能与领域推理规范核对的空间。通过一个通用验证框架确保决策结构的严谨性(如无歧义性、完整性),该系统在三个领域(哲学、对齐标注、诗歌创作)中帮助用户构建对问题的‘工作地图’,提升其分析与表达能力。
Details
Motivation: 解决语言模型在回答开放性问题时隐含决策不透明、输出缺乏上下文的问题,使用户无法获得对问题的结构化理解(工作地图),旨在通过交互式系统使这些决策可见、可干预且可验证。
Result: 在三个领域(哲学论文重写、对齐标注、诗歌创作分析)的定性评估中,系统帮助参与者(哲学学生、对齐标注员、诗人)发展了问题的工作地图,具体表现为哲学学生能重写出框架更清晰、论点反转的论文,对齐标注员从表面偏好转向推理用户意图和危害,诗人识别出阐明其品味的创作模式。
Insight: 核心创新是将统计学中的‘多元宇宙分析’思想引入人机交互与AI可解释性领域,构建了一个将隐含概念决策空间化的交互系统,并配套开发了基于专家级推理校准的通用验证框架来确保该决策结构的严谨性,为增强AI系统在开放域问题中的透明度和用户可控性提供了新范式。
Abstract: When language models answer open-ended problems, they implicitly make hidden decisions that shape their outputs, leaving users with uncontextualized answers rather than a working map of the problem; drawing on multiverse analysis from statistics, we build and evaluate the conceptual multiverse, an interactive system that represents conceptual decisions such as how to frame a question or what to value as a space users can transparently inspect, intervenably change, and check against principled domain reasoning; for this structure to be worth navigating rather than misleading, it must be rigorous and checkable against domain reasoning norms, so we develop a general verification framework that enforces properties of good decision structures like unambiguity and completeness calibrated by expert-level reasoning; across three domains, the conceptual multiverse helped participants develop a working map of the problem, with philosophy students rewriting essays with sharper framings and reversed theses, alignment annotators moving from surface preferences to reasoning about user intent and harm, and poets identifying compositional patterns that clarified their taste.
[266] Real-Time Cellist Postural Evaluation With On-Device Computer Vision cs.HC | cs.CVPDF
Paolo Wang, Michael Zhang, Shrinand Perumal, Ekaterina Tszyao, Luke Choi
TL;DR: 本文提出了一种名为Cello Evaluator的实时大提琴手姿势评估系统,利用设备端计算机视觉技术,通过普通安卓手机为初学者提供实时姿势反馈,以解决传统教学中反馈不足的问题。
Details
Motivation: 解决大提琴初学者因每周仅一次课程而缺乏持续姿势反馈,导致姿势恶化、增加肌肉骨骼损伤风险和技术效率低下的问题。
Result: 通过启发式评估,包括大提琴手和用户体验专家的反馈,表明该应用用户友好且有帮助。
Insight: 创新点在于优化设备端计算机视觉推理,使姿势评估仅依赖当前一代安卓手机,无需昂贵硬件或多传感器设置,提高了可用性和便利性。
Abstract: Posture is a critical factor for beginning instrumental learners. Most students receive instruction only once a week, and during the intervals between lessons they have little or no feedback on their physical posture. As a result, posture often deteriorates, increasing the risk of musculoskeletal injury and inefficient technique. Recent advances in computer vision and machine learning make it possible to evaluate posture without the constant presence of a human expert. However, current solutions have been extremely limited in availability and convenience due to their reliance on computationally expensive hardware or multi-sensor setups. We present Cello Evaluator, a real-time postural feedback system for practicing cellists. Through this optimization for on-device computer vision inference, we provide access to cellist postural evaluation to anyone with a current generation Android phone and thus reduces the postural feedback voids within individual practice. To validate our mobile application, we conduct a heuristic evaluation consisting of cellist and UX experts. Overall feedback from the evaluation found the app to be user friendly and helpful.
cs.AI [Back]
[267] PersonalHomeBench: Evaluating Agents in Personalized Smart Homes cs.AI | cs.CL | cs.DBPDF
Nikhil Verma, InJung Yang, Sungil Kim, KoKeun Kim, YoungJoon Kim
TL;DR: 该论文提出了PersonalHomeBench基准测试,用于评估基础模型在个性化智能家居环境中作为智能助手的表现。该基准通过迭代过程构建丰富的家庭状态,并生成个性化、上下文相关的任务,同时提供PersonalHomeTools工具箱支持真实的环境交互。
Details
Motivation: 当前智能体AI系统在复杂和个性化环境中的准备情况尚未得到充分评估,因此需要建立一个基准来测试其在个性化智能家居中的表现。
Result: 实验表明,随着任务复杂度的增加,智能体性能系统性下降,特别是在反事实推理和部分可观测性方面表现不佳,需要有效的基于工具的信息收集。
Insight: 创新点在于构建了一个全面的个性化智能家居评估平台,结合了反应式和主动式智能体能力评估,并强调了工具使用在信息收集中的重要性。
Abstract: Agentic AI systems are rapidly advancing toward real-world applications, yet their readiness in complex and personalized environments remains insufficiently characterized. To address this gap, we introduce PersonalHomeBench, a benchmark for evaluating foundation models as agentic assistants in personalized smart home environments. The benchmark is constructed through an iterative process that progressively builds rich household states, which are then used to generate personalized, context-dependent tasks. To support realistic agent-environment interaction, we provide PersonalHomeTools, a comprehensive toolbox enabling household information retrieval, appliance control, and situational understanding. PersonalHomeBench evaluates both reactive and proactive agentic abilities under unimodal and multimodal observations. Thorough experimentation reveals a systematic performance reduction as task complexity increases, with pronounced failures in counterfactual reasoning and under partial observability, where effective tool-based information gathering is required. These results position PersonalHomeBench as a rigorous evaluation platform for analyzing the robustness and limitations of personalized agentic reasoning and planning.
[268] The Cognitive Penalty: Ablating System 1 and System 2 Reasoning in Edge-Native SLMs for Decentralized Consensus cs.AI | cs.CL | cs.CR | cs.DCPDF
Syed Muhammad Aqdas Rizvi
TL;DR: 本文研究了在去中心化自治组织(DAO)环境中,小型语言模型(SLM)作为边缘原生防火墙时,推理时计算(System 2)与自回归基线(System 1)在对抗性治理任务中的表现。通过Sentinel-Bench框架对Qwen-3.5-9B模型进行严格消融实验,发现System 2推理会导致严重的认知崩溃和不稳定性,而System 1在对抗鲁棒性、一致性和延迟方面表现更优。
Details
Motivation: 解决在高度对抗的加密经济治理环境中,扩展推理时计算(System 2)以增强形式逻辑的有效性尚未被充分探索的问题,旨在评估System 1与System 2在去中心化共识任务中的实际表现。
Result: 在对抗性Optimism DAO数据集上,System 1基线实现了100%的对抗鲁棒性和司法一致性,并在13秒内达成状态最终性;而System 2推理导致26.7%的推理不收敛率,使共识稳定性降至72.6%,并带来17倍的延迟开销,同时观察到1.5%的案例中出现’推理诱导的谄媚’现象。
Insight: 论文宣称的创新点在于引入了Sentinel-Bench实证框架进行严格的模型内消融研究,揭示了在拜占庭容错约束下,边缘原生SLM的System 1参数化直觉在结构和经济上优于System 2迭代审议。客观分析认为,该研究对SLM在去中心化系统中的实际部署提供了重要的性能与安全性权衡洞见。
Abstract: Decentralized Autonomous Organizations (DAOs) are inclined explore Small Language Models (SLMs) as edge-native constitutional firewalls to vet proposals and mitigate semantic social engineering. While scaling inference-time compute (System 2) enhances formal logic, its efficacy in highly adversarial, cryptoeconomic governance environments remains underexplored. To address this, we introduce Sentinel-Bench, an 840-inference empirical framework executing a strict intra-model ablation on Qwen-3.5-9B. By toggling latent reasoning across frozen weights, we isolate the impact of inference-time compute against an adversarial Optimism DAO dataset. Our findings reveal a severe compute-accuracy inversion. The autoregressive baseline (System 1) achieved 100% adversarial robustness, 100% juridical consistency, and state finality in under 13 seconds. Conversely, System 2 reasoning introduced catastrophic instability, fundamentally driven by a 26.7% Reasoning Non-Convergence (cognitive collapse) rate. This collapse degraded trial-to-trial consensus stability to 72.6% and imposed a 17x latency overhead, introducing critical vulnerabilities to Governance Extractable Value (GEV) and hardware centralization. While rare (1.5% of adversarial trials), we empirically captured “Reasoning-Induced Sycophancy,” where the model generated significantly longer internal monologues (averaging 25,750 characters) to rationalize failing the adversarial trap. We conclude that for edge-native SLMs operating under Byzantine Fault Tolerance (BFT) constraints, System 1 parameterized intuition is structurally and economically superior to System 2 iterative deliberation for decentralized consensus. Code and Dataset: https://github.com/smarizvi110/sentinel-bench
[269] Waking Up Blind: Cold-Start Optimization of Supervision-Free Agentic Trajectories for Grounded Visual Perception cs.AI | cs.CL | cs.LGPDF
Ashutosh Bajpai, Tamal Majumder, Akshay Nambi, Tanmoy Chakraborty
TL;DR: 本文提出了一种名为SPECTRA的无监督框架,用于提升小型视觉语言模型(SVLM)的代理能力。该框架通过冷启动强化学习,引导代理在多轮交互中明确地序列化工具生成的证据,从而将推理过程建立在视觉观察之上,并利用多目标奖励信号来最大化任务正确性、交互结构完整性和工具效用。
Details
Motivation: 小型视觉语言模型作为任务控制器虽然高效,但存在视觉脆弱性和工具编排能力差的问题,通常需要昂贵的监督轨迹调优来缓解。本文旨在通过无监督框架,让代理能够从环境交互中自我发现鲁棒的行为,从而克服这些缺陷。
Result: 在复合任务和分布外(MMMU-Pro)基准测试上的广泛评估表明,SPECTRA显著提升了代理轨迹,将任务准确率提高了高达5%,工具效率提高了9%。
Insight: 创新点在于提出了SPECTRA这一无监督框架,其核心是软结构化多轮交互约束和多目标奖励信号,以及用于量化工具效用的新指标TIU。这使代理能够在没有人类偏好标签的情况下,通过环境交互自我优化,有效提升了视觉推理的鲁棒性和效率。
Abstract: Small Vision-Language Models (SVLMs) are efficient task controllers but often suffer from visual brittleness and poor tool orchestration. They typically require expensive supervised trajectory tuning to mitigate these deficits. In this work, we propose Self-supervised Perception Enabled by Cascaded Tool Rollout Alignment (SPECTRA), a supervision-free framework that bootstraps agentic capabilities via Coldstart Reinforcement Learning for SVLMs. SPECTRA enforces Soft Structured Multi-turn Rollouts, a topological constraint that directs agents to explicitly sequence tool derived evidence before synthesis, effectively grounding reasoning in visual observations. We employ a multi-objective reward signal that simultaneously maximizes task correctness, rollout structure, and tool utility, enabling agent to self-discover robust behaviors without human preference labels. We further introduce Tool Instrumental Utility (TIU), a novel metric to quantify tool efficacy in the absence of ground truth. Extensive evaluations across composite and out-of-distribution (MMMU-Pro) benchmarks demonstrate that SPECTRA boosts agentic trajectories, improving task accuracy by up to 5% and tool efficiency by 9%, enabling more efficient multimodal agents that learn effectively from environmental interaction alone.
[270] COSEARCH: Joint Training of Reasoning and Document Ranking via Reinforcement Learning for Agentic Search cs.AI | cs.CL | cs.IRPDF
Hansi Zeng, Liam Collins, Bhuvesh Kumar, Neil Shah, Hamed Zamani
TL;DR: 本文提出了CoSearch框架,通过强化学习联合训练多步推理智能体和生成式文档排序模型,以解决智能搜索任务中检索系统性能瓶颈的问题。
Details
Motivation: 现有智能搜索方法(如Search-R1)仅优化推理智能体,而将检索系统视为固定工具,导致检索性能成为限制整体性能的关键瓶颈(与理想检索系统的差距可达+26.8% F1)。
Result: 在七个单跳和多跳问答基准测试中,CoSearch相比强基线模型取得了一致的性能提升,消融实验验证了各设计模块的有效性。
Insight: 创新点包括:1)通过GRPO联合训练推理器和排序器;2)基于语义分组的策略优化方法,避免额外轨迹采样;3)结合排序质量信号与轨迹级反馈的复合奖励机制,为排序器提供即时和长期学习信号。
Abstract: Agentic search – the task of training agents that iteratively reason, issue queries, and synthesize retrieved information to answer complex questions – has achieved remarkable progress through reinforcement learning (RL). However, existing approaches such as Search-R1, treat the retrieval system as a fixed tool, optimizing only the reasoning agent while the retrieval component remains unchanged. A preliminary experiment reveals that the gap between an oracle and a fixed retrieval system reaches up to +26.8% relative F1 improvement across seven QA benchmarks, suggesting that the retrieval system is a key bottleneck in scaling agentic search performance. Motivated by this finding, we propose CoSearch, a framework that jointly trains a multi-step reasoning agent and a generative document ranking model via Group Relative Policy Optimization (GRPO). To enable effective GRPO training for the ranker – whose inputs vary across reasoning trajectories – we introduce a semantic grouping strategy that clusters sub-queries by token-level similarity, forming valid optimization groups without additional rollouts. We further design a composite reward combining ranking quality signals with trajectory-level outcome feedback, providing the ranker with both immediate and long-term learning signals. Experiments on seven single-hop and multi-hop QA benchmarks demonstrate consistent improvements over strong baselines, with ablation studies validating each design choice. Our results show that joint training of the reasoning agent and retrieval system is both feasible and strongly performant, pointing to a key ingredient for future search agents.
[271] Characterizing Model-Native Skills cs.AI | cs.CL | cs.LGPDF
Feiyang Kang, Mahavir Dabas, Myeongseob Ko, Ruoxi Jia
TL;DR: 该论文提出了一种’模型原生技能’的概念,主张从语言模型自身的内部表示中提取技能表征,而非依赖外部人为定义。通过从序列级激活中恢复一个紧凑正交基,该方法获得了一个语义可解释的、模型自身行为变化的基础。该表征被成功应用于推理任务的后训练数据选择和推理时引导,在MATH和AMC基准上显著提升了性能,并在安全对齐任务中实现了更高效的对抗数据选择。
Details
Motivation: 现有技能表征方法依赖人类编写的分类法、文本描述或手动分析流程,这些都是强加于模型的外部假设,可能与模型的内部表示不一致。当目标是干预模型行为时,技能表征应基于模型自身的内部表示,即’模型原生’的。
Result: 在Llama3-8B和Qwen2.5-3B模型上,基于模型原生技能进行数据选择,在MATH基准上的Pass@1提升了高达20%,在AMC基准上提升了41%,优于基于人类定义技能的方法。同时,该方法还支持推理时引导,在MATH基准上使Pass@8提升了高达4.8%。在安全对齐任务中,基于模型原生技能覆盖选择对抗训练数据,也实现了更高效的样本学习。
Insight: 核心创新点在于提出了’模型原生技能’的概念,并通过从模型激活中恢复正交基来实例化这一思想。这提供了一种从模型内部视角(而非外部强加视角)来理解和干预模型行为的新范式。其方法学创新在于将技能表征与模型内部表示对齐,并开发了轻量级代理干预来识别有用方向,使得同一表征既能用于训练数据选择,也能用于推理时引导,具有很高的实用性和效率。
Abstract: Skills are a natural unit for describing what a language model can do and how its behavior can be changed. However, existing characterizations rely on human-written taxonomies, textual descriptions, or manual profiling pipelines–all external hypotheses about what matters that need not align with the model’s internal representations. We argue that when the goal is to intervene on model behavior, skill characterization should be model-native: grounded in the model’s own representations rather than imposed through external ontologies. We instantiate this view by recovering a compact orthogonal basis from sequence-level activations. The resulting basis is semantically interpretable but need not correspond to any predefined human ontology; instead, it captures axes of behavioral variation that the model itself organizes around. We validate this characterization on reasoning post-training, using the recovered basis for both SFT data selection and inference-time steering. We develop lightweight proxy interventions to identify which directions are most useful for a given model. Across Llama3-8B and Qwen2.5-3B, selecting data along those directions improves Pass@1 by up to 20% on MATH and 41% on AMC, outperforming data selection based on human-characterized skills. Because the basis lives in activation space, the same directions also serve as steering vectors at inference time, improving Pass@8 by up to 4.8% on MATH–an intervention that human-characterized skills cannot support. We further validate the characterization on safety alignment, where selecting adversarial training data for model-native skill coverage rather than textual diversity yields more sample-efficient learning. These results suggest that recovering skills from the model’s own representations, rather than imposing them externally, provides a more effective foundation for intervening on model behavior. Codes are open-sourced.
[272] Evolutionary Negative Module Pruning for Better LoRA Merging cs.AI | cs.CL | cs.CVPDF
Anda Cao, Zhuo Gou, Yi Wang, Kaixuan Chen, Yu Wang
TL;DR: 本文提出了一种名为ENMP(进化负模块剪枝)的即插即用方法,用于在合并多个LoRA专家模型之前,识别并剪除那些会降低全局性能的‘负模块’,从而提升现有模型合并算法的性能。
Details
Motivation: 现有LoRA模型合并方法隐含假设所有LoRA矩阵都对合并模型有积极贡献,但作者发现存在‘负模块’这一关键瓶颈,这些模块在合并时会固有地损害全局性能。
Result: 大量评估表明,ENMP能持续提升现有合并算法的性能,在语言和视觉领域均达到了新的最先进水平(SOTA)。
Insight: 核心创新在于揭示了LoRA合并中的‘负模块’问题,并提出了一种基于进化搜索的策略来导航离散、不可微的模块选择空间,以最优方式剪除有害模块,这是一种新颖的、与现有合并算法正交的优化思路。
Abstract: Merging multiple Low-Rank Adaptation (LoRA) experts into a single backbone is a promising approach for efficient multi-task deployment. While existing methods strive to alleviate interference via weight interpolation or subspace alignment, they rest upon the implicit assumption that all LoRA matrices contribute constructively to the merged model. In this paper, we uncover a critical bottleneck in current merging paradigms: the existence of $\textit{negative modules}$ – specific LoRA layers that inherently degrade global performance upon merging. We propose $\textbf{E}$volutionary $\textbf{N}$egative $\textbf{M}$odule $\textbf{P}$runing ($\textbf{ENMP}$), a plug-and-play LoRA pruning method to locate and exclude these detrimental modules prior to merging. By leveraging an evolutionary search strategy, ENMP effectively navigates the discrete, non-differentiable landscape of module selection to identify optimal pruning configurations. Extensive evaluations demonstrate that ENMP consistently boosts the performance of existing merging algorithms, achieving a new state-of-the-art across both language and vision domains. Code is available at https://github.com/CaoAnda/ENMP-LoRAMerging.
[273] Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence cs.AI | cs.CLPDF
Guanting Dong, Junting Lu, Junjie Huang, Wanjun Zhong, Longxiang Liu
TL;DR: 本文提出了Agent-World,一个自演化的训练平台,旨在通过可扩展的环境来提升通用智能体的能力。该系统包含两个核心组件:一是智能体环境-任务发现模块,能够从数千个真实世界环境主题中自主探索主题对齐的数据库和可执行工具生态系统,并合成具有可控难度的可验证任务;二是持续自演化智能体训练模块,结合多环境强化学习和自演化智能体竞技场,通过动态任务合成自动识别能力差距并驱动针对性学习,实现智能体策略与环境的协同进化。
Details
Motivation: 当前,大型语言模型越来越多地被期望作为与外部有状态工具环境交互的通用智能体。虽然模型上下文协议(MCP)和更广泛的智能体技能为连接智能体与可扩展的真实世界服务提供了统一接口,但训练鲁棒的智能体仍然受限于缺乏真实的环境和终身学习的机制。
Result: 在23个具有挑战性的智能体基准测试中,Agent-World-8B和14B模型持续超越了强大的专有模型和环境扩展基线。进一步的分析揭示了环境多样性和自演化轮次相关的扩展趋势。
Insight: 摘要宣称的创新点在于构建了一个能够自主发现和合成真实世界环境与任务、并支持智能体与环境协同持续进化的训练框架。从客观角度看,其核心创新在于将环境合成与智能体训练过程紧密结合,通过自动化的任务发现和难度控制来驱动针对性的能力提升,为构建通用智能体智能提供了可扩展的、数据驱动的训练范式。
Abstract: Large language models are increasingly expected to serve as general-purpose agents that interact with external, stateful tool environments. The Model Context Protocol (MCP) and broader agent skills offer a unified interface for connecting agents with scalable real-world services, but training robust agents remains limited by the lack of realistic environments and principled mechanisms for life-long learning. In this paper, we present \textbf{Agent-World}, a self-evolving training arena for advancing general agent intelligence through scalable environments. Agent-World has two main components: (1) Agentic Environment-Task Discovery, which autonomously explores topic-aligned databases and executable tool ecosystems from thousands of real-world environment themes and synthesizes verifiable tasks with controllable difficulty; and (2) Continuous Self-Evolving Agent Training, which combines multi-environment reinforcement learning with a self-evolving agent arena that automatically identifies capability gaps through dynamic task synthesis and drives targeted learning, enabling the co-evolution of agent policies and environments. Across 23 challenging agent benchmarks, Agent-World-8B and 14B consistently outperforms strong proprietary models and environment scaling baselines. Further analyses reveal scaling trends in relation to environment diversity and self-evolution rounds, offering insights for building general agent intelligence.
[274] PARM: Pipeline-Adapted Reward Model cs.AI | cs.CLPDF
Xingyu Fan, Wei Shao, Jiacheng Liu, Linqi Song, Pheng Ann Heng
TL;DR: 本文提出了一种针对多阶段LLM管道的奖励模型PARM,旨在解决传统奖励模型在复杂管道中预测与实际执行结果不一致的问题,通过在代码生成管道中整合奖励模型并利用管道特定数据进行优化,显著提升了输出质量和稳定性。
Details
Motivation: 现有奖励模型主要针对单步生成任务,而实际应用中的多阶段LLM管道(如组合优化中的代码生成)缺乏有效的奖励引导,导致奖励预测与管道执行结果不一致,需要一种适应管道结构的奖励建模方法。
Result: 在四个公开优化基准测试中,PARM在代码执行率和求解准确率上均优于基线方法和采样方法,并在GSM8K的跨领域实验中展示了良好的可迁移性,证明了其能持续提升管道输出质量与稳定性。
Insight: 创新点在于将奖励模型与多阶段LLM管道紧密结合,利用管道特定数据和直接偏好优化(DPO)来对齐奖励与下游反馈,为多阶段推理任务的奖励建模提供了新思路,强调了奖励模型需适应实际管道结构以提升一致性。
Abstract: Reward models (RMs) are central to aligning large language models (LLMs) with human preferences, powering RLHF and advanced decoding strategies. While most prior work focuses on single-step generation, real-world applications increasingly adopt multi-stage LLM pipelines, where effective reward guidance remains underexplored. We investigate this through code generation for combinatorial optimization, constructing a pipeline that integrates reward models into both formulation and solution stages. We identify a critical challenge: inconsistency between reward model predictions and actual pipeline execution outcomes. To address this, we propose the Pipeline-Adapted Reward Model (PARM), which leverages pipeline-specific data and direct preference optimization to align rewards with downstream feedback. We instantiate PARM as a two-stage pipeline (formulation -> code generation) and evaluate it on four public optimization benchmarks, measuring execution rate and solving accuracy against baselines and sampling methods. A supplementary cross-domain experiment on GSM8K assesses transferability. Results demonstrate that PARM consistently improves pipeline output quality and stability, providing new insights into reward modeling for multi-stage LLM reasoning.
[275] WorldDB: A Vector Graph-of-Worlds Memory Engine with Ontology-Aware Write-Time Reconciliation cs.AI | cs.CLPDF
Harish Santhanalakshmi Ganesan
TL;DR: WorldDB是一个基于向量世界图(Vector Graph-of-Worlds)的记忆引擎,采用本体感知的写入时协调机制。它通过将每个节点设计为包含内部子图、本体范围和组合嵌入的递归世界,并利用内容寻址的不可变节点和具有程序化行为的边类型,解决了现有向量存储和时态知识图谱在持久记忆、事实碎片化、跨会话身份丢失以及缺乏超时和矛盾处理能力方面的局限性。在LongMemEval-s基准测试中,WorldDB结合Claude Opus 4.7作为回答者,取得了显著的性能提升。
Details
Motivation: 解决现有检索增强生成(RAG)使用的扁平向量存储导致事实碎片化、丢失跨会话身份、缺乏超时或矛盾处理的一级概念,以及现有时态知识图谱系统(如Graphiti、Memento、Hydra DB)图谱本身扁平、缺乏递归组合、节点无内容寻址不变性、边类型仅作为标签而无行为的问题,旨在为长期运行的智能体系统提供更强大的持久记忆引擎。
Result: 在LongMemEval-s基准测试(500个问题,约115k token的对话堆栈)上,WorldDB使用Claude Opus 4.7作为回答者,实现了96.40%的整体准确率和97.11%的任务平均准确率,比之前报告的Hydra DB的SOTA结果(90.79%)提升了+5.61个百分点,比Supermemory(85.20%)提升了+11.20个百分点,在单会话助手召回、时序推理(96.24%)、知识更新(98.72%)和偏好合成(96.67%)方面均表现出色。消融实验表明,其图谱层(解析器统一实体和类型化引用边)独立于底层回答者贡献了+7.0个百分点的任务平均提升。
Insight: 论文的创新点在于三个核心设计承诺:1)每个节点都是一个递归世界,具有内部子图、本体范围和组合嵌入;2)节点内容寻址且不可变,任何编辑都会在节点及其所有祖先处产生新哈希,提供免费的Merkle式审计追踪;3)边是写入时程序,每种边类型都带有插入/删除/查询重写处理程序(如超时关闭有效性、矛盾保留双方、相同提出合并提议),从而消除了原始追加路径。从客观角度看,这种将递归结构、内容寻址不变性和行为化边类型紧密结合的设计,为构建具有强一致性、可审计性和复杂关系处理能力的智能体记忆系统提供了新的架构范式。
Abstract: Persistent memory is the bottleneck separating stateless chatbots from long-running agentic systems. Retrieval-augmented generation (RAG) over flat vector stores fragments facts into chunks, loses cross-session identity, and has no first-class notion of supersession or contradiction. Recent bitemporal knowledge-graph systems (Graphiti, Memento, Hydra DB) add typed edges and valid-time metadata, but the graph itself remains flat: no recursive composition, no content-addressed invariants on nodes, and edge types carry no behavior beyond a label. We present WorldDB, a memory engine built on three commitments: (i) every node is a world – a container with its own interior subgraph, ontology scope, and composed embedding, recursive to arbitrary depth; (ii) nodes are content-addressed and immutable, so any edit produces a new hash at the node and every ancestor, giving a Merkle-style audit trail for free; (iii) edges are write-time programs – each edge type ships on_insert/on_delete/on_query_rewrite handlers (supersession closes validity, contradicts preserves both sides, same_as stages a merge proposal), so no raw append path exists. On LongMemEval-s (500 questions, ~115k-token conversational stacks), WorldDB with Claude Opus 4.7 as answerer achieves 96.40% overall / 97.11% task-averaged accuracy, a +5.61pp improvement over the previously reported Hydra DB state-of-the-art (90.79%) and +11.20pp over Supermemory (85.20%), with perfect single-session-assistant recall and robust performance on temporal reasoning (96.24%), knowledge update (98.72%), and preference synthesis (96.67%). Ablations show that the engine’s graph layer – resolver-unified entities and typed refers_to edges – contributes +7.0pp task-averaged independently of the underlying answerer.
[276] Why Training-Free Token Reduction Collapses: The Inherent Instability of Pairwise Scoring Signals cs.AI | cs.CVPDF
Yang Shanglin
TL;DR: 本文揭示了无训练令牌缩减方法(如ToMe、ToFu、PiToMe和MCTF)在高压缩率下普遍出现悬崖式崩溃的原因。作者开发了一个诊断框架,通过排序一致性ρ_s和非对角线相关性ρ_off两个工具,将崩溃分解为两个因素:层间缩减固有的信号无关误差放大器,以及这些方法共同依赖的成对相似性信号的不稳定性。基于诊断得出的三个设计原则,作者构建了CATIS方法进行验证,该方法在ViT-Large上实现63%的FLOPs减少时,在ImageNet-1K上仍能保持96.9%的原始准确率(81.0%),而基线方法则崩溃至43-65%的准确率。
Details
Motivation: 解决无训练令牌缩减方法在高压缩率下普遍出现性能悬崖式崩溃的根本原因,理解其内在的不稳定性机制。
Result: 在ImageNet-1K基准上,提出的CATIS方法在ViT-Large模型上实现63%的FLOPs减少时,保持了81.0%的准确率(相当于原始模型准确率的96.9%),显著优于所有基线方法(准确率崩溃至43-65%)。
Insight: 创新点在于揭示了成对评分信号的内在统计不稳定性(O(N_p^2)联合扰动)是崩溃的核心原因,而一元信号(O(N_p)扰动,受中心极限定理约束)则更稳定。诊断框架和由此衍生的设计原则(如使用一元信号、引入筛选机制)为构建更鲁棒的令牌缩减方法提供了理论基础和实用指导。
Abstract: Training-free token reduction methods for Vision Transformers (ToMe, ToFu, PiToMe, and MCTF) employ different scoring mechanisms, yet they share a closely matched cliff-like collapse at high compression. This paper explains \emph{why}. We develop a diagnostic framework with two tools, ranking consistency $ρ_s$ and off-diagonal correlation $ρ_\text{off}$, that decomposes the collapse into (1)a signal-agnostic error amplifier inherent to layer-wise reduction, predicting convex Pareto curves and $r_{\text{crit}} \propto 1/L$; and (2)shared reliance on \emph{pairwise} similarity signals whose ranking consistency degrades from $ρ_s{=}0.88$ to $0.27$ in deep layers. Pairwise rankings are inherently unstable ($O(N_p^2)$ joint perturbations) while unary signals enjoy greater stability ($O(N_p)$ perturbations, CLT). From three design principles derived from this diagnosis, we construct CATIS as a constructive validation: unary signals raise the trigger threshold, triage suppresses the gain. On ViT-Large at 63% FLOPs reduction, CATIS retains 96.9% of vanilla accuracy (81.0%) on ImageNet-1K where all baselines collapse to 43–65%.
[277] Rule-VLN: Bridging Perception and Compliance via Semantic Reasoning and Geometric Rectification cs.AI | cs.CV | cs.ROPDF
Jiawen Wen, Penglei Sun, Wenjie Zhang, Suixuan Qiu, Weisheng Xu
TL;DR: 本文提出了Rule-VLN,一个用于评估智能体在遵守社会规则下进行视觉语言导航(VLN)的大规模城市基准测试。该基准包含丰富的语义规则约束,并进一步提出了语义导航校正模块(SNRM),这是一个零样本通用模块,旨在为预训练导航智能体注入安全合规意识,通过从粗到细的视觉感知和动态路径规划来提升其性能。
Details
Motivation: 当前视觉语言导航智能体存在’目标驱动陷阱’,即过度关注物理可达性(’能否通过’)而忽视语义规则约束(’是否允许通过’),这限制了其在需要社会合规的真实世界中的部署。
Result: 在提出的Rule-VLN基准上,实验表明SNRM模块显著恢复了智能体的导航能力,将约束违规率(CVR)降低了19.26%,并将任务完成率(TC)提升了5.97%,挑战了现有SOTA模型。
Insight: 创新点在于首次构建了大规模、细粒度的规则合规导航基准,并提出了一个零样本、通用的语义导航校正模块(SNRM),该模块通过结合视觉语言模型的感知能力和基于认知心理地图的动态绕行规划,将规则理解与几何导航解耦并融合。
Abstract: As embodied AI transitions to real-world deployment, the success of the Vision-and-Language Navigation (VLN) task tends to evolve from mere reachability to social compliance. However, current agents suffer from a “goal-driven trap”, prioritizing physical geometry (“can I go?”) over semantic rules (“may I go?”), frequently overlooking subtle regulatory constraints. To bridge this gap, we establish Rule-VLN, the first large-scale urban benchmark for rule-compliant navigation. Spanning a massive 29k-node environment, it injects 177 diverse regulatory categories into 8k constrained nodes across four curriculum levels, challenging agents with fine-grained visual and behavioral constraints. We further propose the Semantic Navigation Rectification Module (SNRM), a universal, zero-shot module designed to equip pre-trained agents with safety awareness. SNRM integrates a coarse-to-fine visual perception VLM framework with an epistemic mental map for dynamic detour planning. Experiments demonstrate that while Rule-VLN challenges state-of-the-art models, SNRM significantly restores navigation capabilities, reducing CVR by 19.26% and boosting TC by 5.97%.
eess.IV [Back]
[278] A Two-Stage Multi-Modal MRI Framework for Lifespan Brain Age Prediction eess.IV | cs.AI | cs.CVPDF
Dingyi Zhang, Ruiying Liu, Yun Wang
TL;DR: 本文提出了一种用于全生命周期脑年龄预测的两阶段多模态MRI框架,旨在通过整合脑形态和白质组织信息来量化大脑健康。该模型采用两阶段架构,首先将受试者分类到六个发育阶段之一,然后在预测阶段内估计年龄,从而实现跨不同发育时期的统一评估。
Details
Motivation: 现有脑年龄预测方法通常局限于狭窄的年龄范围和单模态MRI数据,无法捕捉人类生命周期中协调的宏观和微观结构变化,因此需要开发多模态框架来表征大脑形态和白质组织的综合演变。
Result: 论文未在摘要中提及具体的定量结果或基准测试,但提出了一种新的两阶段多模态框架,旨在实现跨生命周期的统一脑成熟度评估。
Insight: 创新点在于采用两阶段架构结合晚期融合处理多模态MRI数据,先进行发育阶段分类再估计年龄,这有助于整合不同模态信息并适应全生命周期的脑变化,为脑健康评估提供了更全面的方法。
Abstract: The accurate quantification of brain age from MRI has emerged as an important biomarker of brain health. However, existing approaches are often restricted to narrow age ranges and single-modality MRI data, limiting their capacity to capture the coordinated macro- and microstructural changes that unfold across the human lifespan. To address these limitations, we developed a multi-modal brain age framework to characterize the integrated evolution of brain morphology and white matter organization. Our model adopts a two-stage architecture, where modalities are processed independently and integrated via late fusion in both stages: first to classify each subject into one of six developmental stages, and then to estimate age within the predicted stage. This design enables a unified and lifespan-spanning assessment of brain maturity across diverse developmental periods.
cs.CY [Back]
[279] RoMathExam: A Longitudinal Dataset of Romanian Math Exams (1895-2025) with a Seven-Decade Core (1957-2025) cs.CY | cs.AI | cs.CLPDF
Luca-Ncolae Cuclea, Sabin-Codrut Badea, Adrian-Marius Dumitran
TL;DR: 本文介绍了RoMathExam数据集,这是一个涵盖1895年至2025年的罗马尼亚高中数学考试纵向数据集,其中1957-2025年为核心标准化部分。该数据集包含10,592个数学问题,按课程主题标记并配有文本嵌入,支持难度分析和检索。作者提出并验证了一种解决方案复杂度指标作为难度的内在代理,并在多个前沿推理模型上进行了评估,展示了数据集在课程分析和LLM评估中的实用性。
Details
Motivation: 当前教育AI研究缺乏针对许多语言和教育体系的大规模、结构化的真实考试数据,因此作者构建了RoMathExam数据集以填补这一空白,为难度建模、课程分析和低资源语言环境下的LLM评估提供基础。
Result: 在GPT-5-mini、DeepSeek-R1和Qwen3-235B-Thinking三个前沿推理模型上的评估显示,提出的解决方案复杂度指标具有高跨模型同步性(相关系数r > 0.72),有效区分了内在数学深度与随机生成噪声。纵向分析量化了从波动历史格式到标准化、代数主导的现代课程的“制度转变”。
Insight: 创新点包括构建了一个长期、标准化且课程对齐的数学考试数据集,并提出了一个可扩展的解决方案复杂度指标作为历史心理测量数据缺失的替代方案,该指标通过多模型验证证实了其有效性。数据集的结构化设计和丰富元数据支持变体检测、去重和基于相似性的检索,为教育AI研究提供了可重复的基础。
Abstract: AI in Education research increasingly relies on authentic, curriculum-grounded assessment data, yet large, well-structured exam corpora remain scarce for many languages and educational systems. We introduce RoMathExam, a longitudinal dataset of Romanian high-school mathematics exams spanning 1895-2025, with a robust standardized core for 1957-2025. The dataset contains 10,592 mathematics problems organized into 600+ complete exam sets across multiple tracks (M1-M4), covering both official national examination sessions and ministry-published training variants. Beyond high-fidelity digitization and a unified JSON schema with traceable provenance, RoMathExam is enriched with curriculum-aligned topic tags and dense text embeddings, enabling variant detection, deduplication, and similarity-based retrieval. To overcome the lack of historical psychometric data, we propose and validate a solution complexity metric as a scalable intrinsic proxy for difficulty. Our evaluation across three frontier reasoning models (GPT-5-mini, DeepSeek-R1, and Qwen3-235B-Thinking) reveals high cross-model synchronization (r > 0.72), confirming the metric’s ability to isolate intrinsic mathematical depth from stochastic generation noise. We demonstrate the dataset’s utility through a longitudinal analysis that quantifies a “regime shift” from volatile historical formats to a standardized, algebra-dominant modern curriculum. RoMathExam provides a foundation for reproducible research in difficulty modeling, curriculum analytics, and LLM evaluation in low-resource linguistic contexts.
cs.SD [Back]
[280] ICLAD: In-Context Learning with Comparison-Guidance for Audio Deepfake Detection cs.SD | cs.CL | eess.ASPDF
Benjamin Chou, Yi Zhu, Surya Koppisetti
TL;DR: 本文提出了一种名为ICLAD的新型音频深度伪造检测框架,它利用上下文学习和比较引导策略,使音频语言模型能够无需训练即可泛化到未见过的深度伪造音频,并提供检测结果的文本解释。
Details
Motivation: 当前最先进的音频深度伪造检测系统在真实场景的深度伪造音频上泛化能力不足,存在安全威胁,因此需要一种能够更好泛化并提供可解释性的检测方法。
Result: 在真实场景数据集上,ICLAD相比专用检测器在宏观F1分数上有提升,相对改进最高可达2倍,展现了其泛化优势。
Insight: 核心创新点是引入了成对比较推理策略,引导模型发现并过滤幻觉和与深度伪造无关的声学属性,并结合路由机制将分布外样本交由音频语言模型处理,实现了无需训练的泛化与可解释性输出。
Abstract: Audio deepfakes pose a significant security threat, yet current state-of-the-art (SOTA) detection systems do not generalize well to realistic in-the-wild deepfakes. We introduce a novel \textbf{I}n-\textbf{C}ontext \textbf{L}earning paradigm with comparison-guidance for \textbf{A}udio \textbf{D}eepfake detection (\textbf{ICLAD}). The framework enables the use of audio language models (ALMs) for training-free generalization to unseen deepfakes and provides textual rationales on the detection outcome. At the core of ICLAD is a pairwise comparative reasoning strategy that guides the ALM to discover and filter hallucinations and deepfake-irrelevant acoustic attributes. The ALM works alongside a specialized deepfake detector, whereby a routing mechanism feeds out-of-distribution samples to the ALM. On in-the-wild datasets, ICLAD improves macro F1 over the specialized detector, with up to $2\times$ relative improvement. Further analysis demonstrates the flexibility of ICLAD and its potential for deployment on recent open-source ALMs.
[281] Video-Robin: Autoregressive Diffusion Planning for Intent-Grounded Video-to-Music Generation cs.SD | cs.AI | cs.CL | cs.CV | cs.LGPDF
Vaibhavi Lokegaonkar, Aryan Vijay Bhosale, Vishnu Raj, Gouthaman KV, Ramani Duraiswami
TL;DR: Video-Robin是一个新颖的文本条件视频到音乐生成模型,它通过结合自回归规划和基于扩散的合成,实现了快速、高质量且语义对齐的视频背景音乐生成。
Details
Motivation: 解决现有视频到音乐模型仅依赖视觉条件、语义和风格可控性有限的问题,旨在提供更精细的创作者控制,同时不牺牲音频真实感。
Result: 在分布内和分布外基准测试中,模型性能优于仅接受视频输入的基线以及额外特征条件的基线,推理速度比SOTA快2.21倍。
Insight: 创新点在于将语义驱动的自回归全局规划与局部扩散变换器细化相结合,实现了音乐保真度与语义理解的平衡,为可控生成提供了新思路。
Abstract: Video-to-music (V2M) is the fundamental task of creating background music for an input video. Recent V2M models achieve audiovisual alignment by typically relying on visual conditioning alone and provide limited semantic and stylistic controllability to the end user. In this paper, we present Video-Robin, a novel text-conditioned video-to-music generation model that enables fast, high-quality, semantically aligned music generation for video content. To balance musical fidelity and semantic understanding, Video-Robin integrates autoregressive planning with diffusion-based synthesis. Specifically, an autoregressive module models global structure by semantically aligning visual and textual inputs to produce high-level music latents. These latents are subsequently refined into coherent, high-fidelity music using local Diffusion Transformers. By factoring semantically driven planning into diffusion-based synthesis, Video-Robin enables fine-grained creator control without sacrificing audio realism. Our proposed model outperforms baselines that solely accept video input and additional feature conditioned baselines on both in-distribution and out-of-distribution benchmarks with a 2.21x speed in inference compared to SOTA. We will open-source everything upon paper acceptance.
[282] Audio-DeepThinker: Progressive Reasoning-Aware Reinforcement Learning for High-Quality Chain-of-Thought Emergence in Audio Language Models cs.SD | cs.CLPDF
Xiang He, Chenxing Li, Jinting Wang, Yan Rong, Tianxin Xie
TL;DR: 本文提出Audio-DeepThinker框架,通过结合混合推理相似性奖励和渐进式两阶段课程学习,在纯强化学习探索下,使音频语言模型涌现出高质量的思维链推理能力,无需监督微调,并在多个音频推理基准上取得了最先进的结果。
Details
Motivation: 现有大型音频语言模型主要作为感知-应答系统,缺乏显式推理过程;而现有增强音频推理的方法要么受限于监督思维链微调的数据质量,要么依赖无法直接评估推理质量的粗粒度强化学习奖励,导致生成的推理链缺乏具体的声学依据。
Result: 在MMAR(74.0%)、MMAU-test-mini(78.5%)和MMSU(77.26%)基准上取得了最先进(SOTA)的结果,并在Interspeech 2026音频推理挑战赛(单模型赛道)中获得第一名。
Insight: 核心创新点在于:1)混合推理相似性奖励,结合LLM评估器(评估逻辑路径对齐、关键步骤覆盖和分析深度)和嵌入相似性组件(强制与参考推理链的语义对齐),直接监督生成推理链的质量;2)渐进式两阶段课程学习,从无思维链能力的指令调优模型出发,通过纯强化学习探索,分阶段(基础音频QA和声学挑战性边界案例)引导高质量思维链推理的涌现。可解释性分析揭示了强化学习主要通过重塑上层MoE门控机制以及推理标记在上层Transformer层中逐步结晶来实现推理涌现,为探索式音频推理的机制提供了洞见。
Abstract: Large Audio-Language Models (LALMs) have made significant progress in audio understanding, yet they primarily operate as perception-and-answer systems without explicit reasoning processes. Existing methods for enhancing audio reasoning rely either on supervised chain-of-thought (CoT) fine-tuning, which is limited by training data quality, or on reinforcement learning (RL) with coarse rewards that do not directly evaluate reasoning quality. As a result, the generated reasoning chains often appear well-structured yet lack specific acoustic grounding. We propose Audio-DeepThinker, a framework built on two core ideas. First, we introduce a hybrid reasoning similarity reward that directly supervises the quality of generated reasoning chains by combining an LLM evaluator assessing logical path alignment, key step coverage, and analytical depth with an embedding similarity component enforcing semantic alignment with reference reasoning chains. Second, we propose a progressive two-stage curriculum that enables high-quality CoT reasoning to emerge through pure RL exploration, without any supervised reasoning fine-tuning, from an instruction-tuned model that possesses no prior chain-of-thought capability. Stage 1 trains on foundational audio QA with the hybrid reward to foster basic reasoning patterns, while Stage 2 shifts to acoustically challenging boundary cases with an LLM-only reward for greater reasoning diversity. Audio-DeepThinker achieves state-of-the-art results on MMAR (74.0%), MMAU-test-mini (78.5%), and MMSU (77.26%), winning 1st Place in the Interspeech 2026 Audio Reasoning Challenge (Single Model Track). Interpretability analyses further reveal that RL training primarily reshapes upper-layer MoE gating mechanisms and that reasoning tokens crystallize progressively in the upper transformer layers, offering mechanistic insights into how audio reasoning emerges through exploration.
[283] Omni-Embed-Audio: Leveraging Multimodal LLMs for Robust Audio-Text Retrieval cs.SD | cs.CLPDF
HaeJun Yoo, Yongseop Shin, Insung Lee, Myoung-Wan Koo, Du-Seong Chang
TL;DR: 本文提出了Omni-Embed-Audio (OEA),一种利用原生理解音频的多模态大语言模型构建的检索编码器,旨在解决传统音频-文本检索基准依赖标题式查询、无法有效评估实际检索鲁棒性的问题。作者引入了反映自然搜索行为的用户意图查询(UIQs)来系统评估鲁棒性,并开发了硬负例挖掘流程和判别指标。实验表明,OEA在文本到音频检索上与SOTA模型M2D-CLAP性能相当,但在文本到文本检索和硬负例判别能力上具有显著优势。
Details
Motivation: 传统基于对比语言-音频预训练(CLAP)的音频-文本检索系统在传统基准上表现良好,但这些基准依赖的标题式查询与现实世界的搜索行为差异很大,限制了其对实际检索鲁棒性的评估。
Result: 在AudioCaps、Clotho和MECAT数据集上的实验表明,OEA在文本到音频检索性能上与最先进的M2D-CLAP模型相当,同时在文本到文本检索上取得+22%的相对提升,在硬负例判别指标上(HNSR@10和TFR@10)也表现出显著优势(分别提升4.3个百分点和34.7%的相对提升)。
Insight: 论文的创新点在于:1)利用具有原生音频理解能力的多模态LLM作为主干网络来构建检索编码器,以提升对复杂查询的语义理解;2)提出了系统评估检索鲁棒性的新范式,即用户意图查询(UIQs)和相应的硬负例挖掘与判别指标(HNSR, TFR),这为更贴近实际应用的评估提供了框架。
Abstract: Audio-text retrieval systems based on Contrastive Language-Audio Pretraining (CLAP) achieve strong performance on traditional benchmarks; however, these benchmarks rely on caption-style queries that differ substantially from real-world search behavior, limiting their assessment of practical retrieval robustness. We present Omni-Embed-Audio (OEA), a retrieval-oriented encoder leveraging multimodal LLMs with native audio understanding. To systematically evaluate robustness beyond caption-style queries, we introduce User-Intent Queries (UIQs) - five formulations reflecting natural search behaviors: questions, commands, keyword tags, paraphrases, and exclusion-based negative queries. For negative queries, we develop a hard negative mining pipeline and propose discrimination metrics (HNSR, TFR) assessing models’ ability to suppress acoustically similar distractors. Experiments on AudioCaps, Clotho, and MECAT show that OEA achieves comparable text-to-audio retrieval performance to state-of-the-art M2D-CLAP, while demonstrating clear advantages in two critical areas: (1) dominant text-to-text retrieval (+22% relative improvement), and (2) substantially superior hard negative discrimination (+4.3%p HNSR@10, +34.7% relative TFR@10), revealing that LLM backbones provide superior semantic understanding of complex queries.
cs.RO [Back]
[284] BrainMem: Brain-Inspired Evolving Memory for Embodied Agent Task Planning cs.RO | cs.AI | cs.CV | cs.MAPDF
Xiaoyu Ma, Lianyu Hu, Wenbing Tang, Zixuan Hu, Zeqin Liao
TL;DR: 本文提出BrainMem,一种受人类认知启发的、无需训练的分层记忆系统,为具身智能体提供工作记忆、情景记忆和语义记忆,以支持其在复杂3D环境中的长期任务规划。该系统将交互历史转化为结构化知识图谱和提炼的符号化指导原则,使智能体能够基于过往经验进行检索、推理和行为调整,无需模型微调或额外训练。
Details
Motivation: 现有基于大语言模型的具身任务规划器大多是无状态和反应式的,缺乏持久记忆,导致重复错误和难以处理空间或时间依赖关系。本文旨在通过引入持续演化的记忆机制来解决这一问题。
Result: 在EB-ALFRED、EB-Navigation、EB-Manipulation和EB-Habitat四个代表性基准测试上的广泛实验表明,BrainMem显著提升了多种模型在不同难度子集上的任务成功率,尤其在长期和空间复杂任务上增益最大。
Insight: 创新点在于提出了一种训练免费、即插即用的分层记忆架构,将人类认知记忆模型(工作、情景、语义记忆)计算化,并通过知识图谱和符号化规则实现经验的持续结构化积累与利用,这为可泛化的具身智能提供了一种有前景且可扩展的机制。
Abstract: Embodied task planning requires agents to execute long-horizon, goal-directed actions in complex 3D environments, where success depends on both immediate perception and accumulated experience across tasks. However, most existing LLM-based planners are stateless and reactive, operating without persistent memory and therefore repeating errors and struggling with spatial or temporal dependencies. We propose BrainMem(Brain-Inspired Evolving Memory), a training-free hierarchical memory system that equips embodied agents with working, episodic, and semantic memory inspired by human cognition. BrainMem continuously transforms interaction histories into structured knowledge graphs and distilled symbolic guidelines, enabling planners to retrieve, reason over, and adapt behaviors from past experience without any model fine-tuning or additional training. This plug-and-play design integrates seamlessly with arbitrary multi-modal LLMs and greatly reduces reliance on task-specific prompt engineering. Extensive experiments on four representative benchmarks, including EB-ALFRED, EB-Navigation, EB-Manipulation, and EB-Habitat, demonstrate that BrainMem significantly enhances task success rates across diverse models and difficulty subsets, with the largest gains observed on long-horizon and spatially complex tasks. These results highlight evolving memory as a promising and scalable mechanism for generalizable embodied intelligence.
[285] Visual-RRT: Finding Paths toward Visual-Goals via Differentiable Rendering cs.RO | cs.CVPDF
Sebin Lee, Jumin Lee, Taeyeon Kim, Younju Na, Woobin Im
TL;DR: 本文提出视觉RRT(vRRT)方法,通过结合可微分机器人渲染的梯度利用与RRT的采样探索,实现基于视觉目标(如图像或演示视频)的机器人运动规划,无需精确的关节角度配置。
Details
Motivation: 现有基于RRT的规划器需要明确的数值关节角度作为目标配置,而实际应用中常通过视觉观察(如图像或视频)提供目标,缺乏精确配置,因此需要一种能直接处理视觉目标的规划方法。
Result: 在Franka、UR5e和Fetch等多种机器人操纵器上的大量实验表明,vRRT在仿真和真实环境中均能有效实现视觉目标规划,弥合了基于采样的规划与以视觉为中心的机器人应用之间的差距。
Insight: 创新点包括:结合可微分渲染与RRT的视觉目标规划框架、基于前沿的自适应探索-利用策略优先处理视觉有前景的搜索区域,以及跨树分支继承优化状态的惯性梯度树扩展以实现动量一致的梯度利用。
Abstract: Rapidly-exploring random trees (RRTs) have been widely adopted for robot motion planning due to their robustness and theoretical guarantees. However, existing RRT-based planners require explicit goal configurations specified as numerical joint angles, while many practical applications provide goal specifications through visual observations such as images or demonstration videos where precise goal configurations are unavailable. In this paper, we propose visual-RRT (vRRT), a motion planner that enables visual-goal planning by unifying gradient-based exploitation from differentiable robot rendering with sampling-based exploration from RRTs. We further introduce (i) a frontier-based exploration-exploitation strategy that adaptively prioritizes visually promising search regions, and (ii) inertial gradient tree expansion that inherits optimization states across tree branches for momentum-consistent gradient exploitation. Extensive experiments across various robot manipulators including Franka, UR5e, and Fetch demonstrate that vRRT achieves effective visual-goal planning in both simulated and real-world settings, bridging the gap between sampling-based planning and vision-centric robot applications. Our code is available at https://sgvr.kaist.ac.kr/Visual-RRT.
[286] Disentangled Robot Learning via Separate Forward and Inverse Dynamics Pretraining cs.RO | cs.CVPDF
Wenyao Zhang, Bozhou Zhang, Zekun Qi, Wenjun Zeng, Xin Jin
TL;DR: 本文提出了一种名为DeFI的新型机器人学习框架,通过将视觉前向动力学和逆动力学的预训练解耦,分别利用不同的数据源进行训练,从而解决了VLA模型中2D图像预测与3D动作预测不匹配的问题,并能够利用大规模无动作的网络视频数据。该框架包含通用前向动力学模型(GFDM)和通用逆动力学模型(GIDM),先分别预训练,再在下游任务中进行端到端微调。
Details
Motivation: 解决现有视觉-语言-动作(VLA)模型中存在的2D图像预测与3D动作预测不匹配的困境,以及视觉与动作纠缠的训练方式限制了模型从大规模无动作网络视频数据中学习的问题。
Result: 在CALVIN ABC-D和SimplerEnv基准测试中取得了最先进的性能:在CALVIN上平均任务长度为4.51,在SimplerEnv-Fractal基准上成功率为51.2%,在真实世界部署中成功率为81.3%,显著优于先前方法。
Insight: 核心创新点在于将前向动力学(视频生成)和逆动力学(动作预测)的预训练过程解耦,允许分别利用动作视频和无动作网络视频数据进行训练。这种分离式预训练策略使得模型能更高效地利用不同特性的数据,并在后续联合微调中实现协同增效,为构建通用机器人模型提供了新的架构思路。
Abstract: Vision-language-action (VLA) models have shown great potential in building generalist robots, but still face a dilemma-misalignment of 2D image forecasting and 3D action prediction. Besides, such a vision-action entangled training manner limits model learning from large-scale, action-free web video data. To address these issues, we propose DeFI, a novel framework that Decouples visual Forward and Inverse dynamics pretraining to exploit respective data sources, wherein video generation and action prediction are disentangled. We introduce the General Forward Dynamics Model (GFDM), pretrained on diverse human and robot videos for future prediction, and the General Inverse Dynamics Model (GIDM), trained via self-supervised learning to infer latent actions from unlabeled video transitions. These models are then integrated into a unified architecture for end-to-end finetuning on downstream tasks. In this manner, GFDM and GIDM first shine separately and then cooperate for mutual benefit. Extensive experiments on CALVIN ABC-D and SimplerEnv demonstrate state-of-the-art performance, with DeFI achieving an average task length of 4.51 for CALVIN, 51.2% success rate on SimplerEnv-Fractal benchmark and 81.3% success rate in real-world deployment, significantly outperforming prior methods.
[287] ICAT: Incident-Case-Grounded Adaptive Testing for Physical-Risk Prediction in Embodied World Models cs.RO | cs.AI | cs.CV | cs.LGPDF
Zhenglin Lai, Sirui Huang, Yuteng Li, Changxin Huang, Jianqiang Li
TL;DR: 该论文提出了ICAT方法,用于评估视频生成世界模型在预测物理风险和严重后果方面的能力。该方法基于真实事故报告和安全手册构建结构化风险记忆,并通过检索和组合这些记忆来生成具有因果链和严重性标签的风险案例,从而对世界模型进行测试。
Details
Motivation: 当前视频生成世界模型被广泛用作具身规划和策略学习的神经模拟器,但其预测物理风险和严重后果的能力很少被评估。研究发现这些模型经常淡化或忽略危险动作的关键危险信号和严重后果,这可能导致在基于想象轨迹进行规划和训练时产生不安全的偏好。
Result: 在基于ICAT构建的基准测试上的实验表明,主流世界模型经常遗漏风险机制和触发条件,并且对严重性的校准存在错误,未能达到安全关键型具身部署所需的可靠性水平。
Insight: 论文的创新点在于将安全测试与真实事故报告和知识(结构化风险记忆)相结合,通过因果链和严重性标签来约束风险案例的生成,从而系统性地评估世界模型的安全风险预测能力。这为评估和提升具身AI系统的安全性提供了一个新的、有依据的基准和方法。
Abstract: Video-generative world models are increasingly used as neural simulators for embodied planning and policy learning, yet their ability to predict physical risk and severe consequences is rarely evaluated.We find that these models often downplay or omit key danger cues and severe outcomes for hazardous actions, which can induce unsafe preferences during planning and training on imagined rollouts. We propose ICAT, which grounds testing in real incident reports and safety manuals by building structured risk memories and retrieving/composing them to constrain the generation of risk cases with causal chains and severity labels. Experiments on an ICAT-based benchmark show that mainstream world models frequently miss mechanisms and triggering conditions and miscalibrate severity, falling short of the reliability required for safety-critical embodied deployment.
[288] Human Cognition in Machines: A Unified Perspective of World Models cs.RO | cs.AI | cs.CV | cs.ETPDF
Timothy Rupprecht, Pu Zhao, Amir Taherin, Arash Akbari, Arman Akbari
TL;DR: 本文提出一个基于认知架构理论(CAT)的统一框架来评估和分类世界模型,指出当前研究在内在动机和元认知方面的不足,并引入“认知世界模型”这一新类别来涵盖基于结构化知识的科学发现智能体框架。
Details
Motivation: 现有世界模型研究常声称具备“类人”认知能力,但缺乏基于认知架构理论(CAT)第一性原理的系统评估,本文旨在提供一个统一框架来全面分析世界模型的认知功能并识别研究空白。
Result: 通过将统一分类法应用于视频、具身和认知世界模型,本文揭示了现有分类法未指出的研究方向,特别是内在动机和元认知领域的研究严重不足。
Insight: 创新点在于基于CAT构建了涵盖记忆、感知、语言、推理、想象、动机和元认知的完整认知功能框架,并提出了受主动推理和全局工作空间理论启发的具体研究方向,以及引入了面向结构化知识科学发现的“认知世界模型”新类别。
Abstract: This comprehensive report distinguishes prior works by the cognitive functions they innovate. Many works claim an almost “human-like” cognitive capability in their world models. To evaluate these claims requires a proper grounding in first principles in Cognitive Architecture Theory (CAT). We present a conceptual unified framework for world models that fully incorporates all the cognitive functions associated with CAT (i.e. memory, perception, language, reasoning, imagining, motivation, and meta-cognition) and identify gaps in the research as a guide for future states of the art. In particular, we find that motivation (especially intrinsic motivation) and meta-cognition remain drastically under-researched, and we propose concrete directions informed by active inference and global workspace theory to address them. We further introduce Epistemic World Models, a new category encompassing agent frameworks for scientific discovery that operate over structured knowledge. Our taxonomy, applied across video, embodied, and epistemic world models, suggests research directions where prior taxonomies have not.
[289] Rewind-IL: Online Failure Detection and State Respawning for Imitation Learning cs.RO | cs.AI | cs.CVPDF
Gehan Zheng, Sanjay Seenivasan, Matthew Johnson-Roberson, Weiming Zhi
TL;DR: Rewind-IL是一个无需训练的在线安全框架,用于生成式动作块模仿策略。它结合了基于时间块间差异估计(TIDE)的零样本故障检测器(通过分形保形预测校准)和状态重生机制,能够在检测到故障时将机器人回滚到语义验证的安全中间状态,从而提高模仿学习的可靠性。
Details
Motivation: 模仿学习使机器人能从演示中学习复杂的视觉运动操作技能,但部署失败仍是主要障碍,尤其是对于长时程动作块策略。现有运行时监控器要么需要故障数据,要么在良性特征漂移下过度触发,或仅检测故障而不提供恢复机制。
Result: 在真实世界和模拟的长时程操作任务(包括迁移到流匹配动作块策略)上的实验表明,策略内部一致性结合语义基础的重生机制为模仿学习提供了提高可靠性的实用途径。
Insight: 创新点包括:零样本故障检测器TIDE基于策略内部一致性监控,无需故障数据;利用视觉语言模型离线识别恢复检查点并构建紧凑特征数据库;在线故障时回滚到语义验证的安全状态并重启推理,实现训练免费的故障恢复。
Abstract: Imitation learning has enabled robots to acquire complex visuomotor manipulation skills from demonstrations, but deployment failures remain a major obstacle, especially for long-horizon action-chunked policies. Once execution drifts off the demonstration manifold, these policies often continue producing locally plausible actions without recovering from the failure. Existing runtime monitors either require failure data, over-trigger under benign feature drift, or stop at failure detection without providing a recovery mechanism. We present Rewind-IL, a training-free online safeguard framework for generative action-chunked imitation policies. Rewind-IL combines a zero-shot failure detector based on Temporal Inter-chunk Discrepancy Estimate (TIDE), calibrated with split conformal prediction, with a state-respawning mechanism that returns the robot to a semantically verified safe intermediate state. Offline, a vision-language model identifies recovery checkpoints in demonstrations, and the frozen policy encoder is used to construct a compact checkpoint feature database. Online, Rewind-IL monitors self-consistency in overlapping action chunks, tracks similarity to the checkpoint library, and, upon failure, rewinds execution to the latest verified safe state before restarting inference from a clean policy state. Experiments on real-world and simulated long-horizon manipulation tasks, including transfer to flow-matching action-chunked policies, demonstrate that policy-internal consistency coupled with semantically grounded respawning offers a practical route to improved reliability in imitation learning. Supplemental materials are available at https://sjay05.github.io/rewind-il
[290] ReFineVLA: Multimodal Reasoning-Aware Generalist Robotic Policies via Teacher-Guided Fine-Tuning cs.RO | cs.CVPDF
Tuan Van Vo, Tan Q. Nguyen, Khang Nguyen, Nhat Xuan Tran, Duy H. M. Nguyen
TL;DR: 本文提出了ReFineVLA框架,旨在增强视觉-语言-动作(VLA)模型在机器人操作任务中的推理能力。该方法通过专家教师模型生成推理依据来增强数据集,并以此微调预训练的VLA模型,从而提升模型在复杂、长视野任务中的可解释性和泛化能力。
Details
Motivation: 现有的VLA模型通常只学习功能性的输入-动作映射,忽略了显式的推理过程,这限制了其在复杂、长视野操作任务中的可解释性和泛化能力。
Result: 在SimplerEnv模拟环境中的WidowX和Google Robot任务基准测试上,ReFineVLA取得了最先进的性能,其成功率在两个基准上都超过了次优方法。
Insight: 创新点在于通过教师引导的推理依据增强数据集来微调VLA模型,从而显式地融入推理过程。这提升了模型的多模态理解和泛化能力,并通过注意力图可视化验证了视觉观察、语言提示与待执行动作之间的对齐。
Abstract: Vision-Language-Action (VLA) models have gained much attention from the research community thanks to their strength in translating multimodal observations with linguistic instructions into desired robotic actions. Despite their advancements, VLAs often overlook explicit reasoning and learn the functional input-action mappings, omitting crucial logical steps, which are especially pronounced in interpretability and generalization for complex, long-horizon manipulation tasks. In this work, we propose ReFineVLA, a multimodal reasoning-aware framework that fine-tunes VLAs with teacher-guided reasons. We first augment robotic datasets with reasoning rationales generated by an expert teacher model, guiding VLA models to learn to reason about their actions. Then, we fine-tune pre-trained VLAs with the reasoning-enriched datasets with ReFineVLA, while maintaining the underlying generalization abilities and boosting reasoning capabilities. We also conduct attention map visualization to analyze the alignment among visual observation, linguistic prompts, and to-be-executed actions of ReFineVLA, reflecting the model is ability to focus on relevant tasks and actions. Through this additional step, we explore that ReFineVLA-trained models exhibit a meaningful agreement between vision-language and action domains, highlighting the enhanced multimodal understanding and generalization. Evaluated across a suite of simulated manipulation benchmarks on SimplerEnv with both WidowX and Google Robot tasks, ReFineVLA achieves state-of-the-art performance, in success rate over the second-best method on the both the WidowX benchmark and Google Robot Tasks.
[291] ST-$π$: Structured SpatioTemporal VLA for Robotic Manipulation cs.RO | cs.CVPDF
Chuanhao Ma, Hanyu Zhou, Shihan Peng, Yan Li, Tao Gu
TL;DR: 本文提出ST-π,一种结构化时空视觉-语言-动作模型,用于解决机器人精细时空操作任务。该模型通过时空VLM生成因果有序的块级动作提示,再通过时空动作专家联合建模空间依赖与时间因果性,预测步级动作参数,从而显式规划全局时空行为并细化局部时空控制。
Details
Motivation: 现有VLA模型在精细时空操作中面临挑战,其时空推理多为隐式,难以处理具有显式时空边界的多序列行为。
Result: 在真实世界机器人数据集上的大量实验证明了模型的有效性。
Insight: 创新点在于提出了结构化双生成器引导的时空动作专家,以及包含结构化时空标注的真实世界数据集用于微调,实现了从全局规划到局部控制的显式时空分解推理。
Abstract: Vision-language-action (VLA) models have achieved great success on general robotic tasks, but still face challenges in fine-grained spatiotemporal manipulation. Typically, existing methods mainly embed spatiotemporal knowledge into visual and action representations, and directly perform a cross-modal mapping for step-level action prediction. However, such spatiotemporal reasoning remains largely implicit, making it difficult to handle multiple sequential behaviors with explicit spatiotemporal boundaries. In this work, we propose ST-$π$, a structured spatiotemporal VLA model for robotic manipulation. Our model is guided by two key designs: 1) Spatiotemporal VLM. We encode 4D observations and task instructions into latent spaces, and feed them into the LLM to generate a sequence of causally ordered chunk-level action prompts consisting of sub-tasks, spatial grounding and temporal grounding. 2) Spatiotemporal action expert. Conditioned on chunk-level action prompts, we design a structured dual-generator guidance to jointly model spatial dependencies and temporal causality, thus predicting step-level action parameters. Within this structured framework, the VLM explicitly plans global spatiotemporal behavior, and the action expert further refines local spatiotemporal control. In addition, we propose a real-world robotic dataset with structured spatiotemporal annotations for fine-tuning. Extensive experiments have been conducted to demonstrate the effectiveness of our model. Our code link: https://github.com/chuanhaoma/ST-pi.
cs.DB [Back]
[292] NL2SQLBench: A Modular Benchmarking Framework for LLM-Enabled NL2SQL Solutions cs.DB | cs.AI | cs.CL | cs.LGPDF
Shizheng Hou, Wenqi Pei, Nuo Chen, Quang-Trung Ta, Peng Lu
TL;DR: 本文提出了NL2SQLBench,一个用于评估基于大语言模型的自然语言转SQL系统的模块化基准测试框架。该框架将NL2SQL系统分解为模式选择、候选生成和查询修订三个核心模块,并为每个模块设计了细粒度的评估指标。作者利用该框架在BIRD和ScienceBenchmark数据集上,使用DeepSeek-V3和GPT-4o mini模型,对十种开源方法进行了全面评估,揭示了现有方法在准确性和计算效率上的不足,并指出了当前基准数据集和评估规则存在的问题。
Details
Motivation: 大语言模型虽然极大地推动了NL2SQL技术的发展,但其快速发展缺乏系统性的评估,导致对其有效性、效率和局限性的理解存在关键空白。
Result: 在BIRD和ScienceBenchmark开发集上,使用DeepSeek-V3和GPT-4o mini模型对十种代表性开源方法进行了评估。评估揭示了现有方法在准确性上仍有巨大提升空间,并且存在严重的计算效率低下问题,阻碍了实际应用。
Insight: 论文的创新点在于提出了首个模块化的NL2SQL评估框架,将系统解耦为三个核心模块并设计了细粒度的模块级评估指标。这为公平比较和未来针对性创新提供了清晰的参考点和重要指导。从客观角度看,其模块化分析和多维度评估框架是评估复杂AI系统的一个可借鉴的方法论。
Abstract: Natural Language to SQL (NL2SQL) technology empowers non-expert users to query relational databases without requiring SQL expertise. While large language models (LLMs) have greatly improved NL2SQL algorithms, their rapid development outpaces systematic evaluation, leaving a critical gap in understanding their effectiveness, efficiency, and limitations. To this end, we present NL2SQLBench, the first modular evaluation and benchmarking framework for LLM-enabled NL2SQL approaches. Specifically, we dissect NL2SQL systems into three core modules: Schema Selection, Candidate Generation, and Query Revision. For each module, we comprehensively review existing strategies and propose novel fine-grained metrics that systematically quantify module-level effectiveness and efficiency. We further implement these metrics in a flexible multi-agent framework, allowing configurable benchmarking across diverse NL2SQL approaches. Leveraging NL2SQLBench, we rigorously evaluate ten representative open-source methods on two datasets, the BIRD development set and the ScienceBenchmark development set, using two LLMs, DeepSeek-V3 and GPT-4o mini. We systematically assess each approach across the three core modules and evaluate multiple critical performance dimensions. Our evaluation reveals significant gaps in existing NL2SQL methods, highlighting not only substantial room for accuracy improvements but also the significant computational inefficiency, which severely hampers real-world adoption. Furthermore, our analysis identifies critical shortcomings in current benchmark datasets and evaluation rules, emphasizing issues such as inaccurate gold SQL annotations and limitations in existing evaluation rules. By synthesizing these insights into a unified benchmarking, our study establishes a clear reference point for fair comparison and serves as essential guidance for future targeted innovation in NL2SQL technology.
cs.CR [Back]
[293] Systematic Capability Benchmarking of Frontier Large Language Models for Offensive Cyber Tasks cs.CR | cs.AI | cs.CLPDF
Tyler H. Merves, Michael H. Conaway, Joseph M. Escobar, Hakan T. Otal, Unal Tatar
TL;DR: 本文对前沿大语言模型在进攻性网络安全任务上的能力进行了系统性基准测试,评估了7个提供商的10个模型在NYU CTF Bench全部200个挑战上的表现。研究基于D-CIPHER多智能体框架进行扩展,并发现Kali Linux环境、模型选择是性能的关键驱动因素,而提示工程在工具完备的环境中收益递减甚至为负。
Details
Motivation: 动机是全面评估和比较不同前沿大语言模型在复杂的、工具密集型的进攻性网络安全任务(如渗透测试)上的实际能力,为模型选择和系统构建提供实证依据。
Result: 在NYU CTF Bench基准上,Claude 4.5 Opus取得了最高的解决率(59%),其次是Gemini 3 Pro(52%);Gemini 3 Flash具有最佳成本效益(每次解决0.05美元)。Kali Linux环境相比Ubuntu带来了+9.5个百分点的性能提升。
Insight: 创新点在于构建了支持多提供商后端、包含100多个预装渗透测试工具的定制Kali Linux环境及运行时工具发现智能体的综合评估框架。核心发现是:在工具完备的环境中,环境配置和模型选择比提示工程干预更重要;同模型配置优于混合层级配对;非对称规划器/执行器模型分配无显著益处。
Abstract: We present, to our knowledge, the most comprehensive cross-model evaluation of LLM agents on offensive cybersecurity tasks, benchmarking 10 frontier models from 7 providers on all 200 challenges of the NYU CTF Bench. Building on the D-CIPHER multi-agent framework, we extend it with multi-provider backend support, a custom Kali Linux environment with over 100 pre-installed penetration testing tools, and runtime tool-discovery agents. Through a controlled factorial study, we find that the Kali Linux environment yields a +9.5 percentage-point improvement over Ubuntu, while auto-prompting and category-specific tips often degrade performance in well-equipped environments. Among models, Claude 4.5 Opus achieves the highest solve rate (59%), followed by Gemini 3 Pro (52%), with Gemini 3 Flash offering the best cost-efficiency at $0.05 per solve. Asymmetric planner/executor model assignments provide no meaningful benefit while coherent same-model configurations consistently outperform mixed-tier pairings. Our results indicate that environment tooling and model selection emerge as the strongest drivers of performance, whereas prompt engineering interventions show diminishing or negative returns in well-equipped environments. Reported performance reflects both model reasoning ability and compatibility with agent tooling and API integration.
[294] Different Paths to Harmful Compliance: Behavioral Side Effects and Mechanistic Divergence Across LLM Jailbreaks cs.CR | cs.AI | cs.CLPDF
Md Rysul Kabir, Zoran Tiganj
TL;DR: 本文研究了三种不同方法对开源语言模型进行越狱后产生的有害合规行为,包括有害监督微调(SFT)、有害强化学习与可验证奖励(RLVR)以及拒绝抑制消除(abliteration)。尽管三种方法均能实现接近上限的有害指令遵从率,但在模型能力、行为特征和内部失效机制上存在显著差异。RLVR越狱模型在保持明确危害识别能力和最小化能力退化方面表现最佳,而SFT越狱模型则表现出最大的安全性判断崩溃和行为漂移。
Details
Motivation: 动机在于探究不同越狱干预手段如何导致语言模型产生有害合规行为,并比较这些方法在行为表现和内部机制上的差异,以深入理解模型安全失效的多样性。
Result: 在有害合规性基准测试中,所有三种方法都达到了接近上限的遵从率。RLVR越狱模型在标准基准测试上能力退化最小,且在结构化自我审计中能识别有害提示;当引入反思性安全支架时,其有害行为大幅下降至接近基线水平。SFT越狱模型在标准基准测试上表现出显著的能力损失和最大的行为漂移。
Insight: 创新点在于系统性地比较了三种越狱路径的行为副作用和机制差异,揭示了RLVR方法能保持模型安全几何结构和明确危害识别能力,而SFT导致更广泛的分布式漂移。研究结果表明,仅凭有害合规率不足以评估模型安全性,内部机制和行为特征的差异至关重要,这为针对性的模型修复和安全评估提供了新见解。
Abstract: Open-weight language models can be rendered unsafe through several distinct interventions, but the resulting models may differ substantially in capabilities, behavioral profile, and internal failure mode. We study behavioral and mechanistic properties of jailbroken models across three unsafe routes: harmful supervised fine-tuning (SFT), harmful reinforcement learning with verifiable rewards (RLVR), and refusal-suppressing abliteration. All three routes achieve near-ceiling harmful compliance, but they diverge once we move beyond direct harmfulness. RLVR-jailbroken models show minimal degradation and preserve explicit harm recognition in a structured self-audit: they are able to identify harmful prompts and describe how a safe LLM should respond, yet they comply with the harmful request. With RLVR, harmful behavior is strongly suppressed by a reflective safety scaffold: when a harmful prompt is prepended with an instruction to reflect on safety standards, harmful behavior drops close to the baseline. Category-specific RLVR jailbreaks generalize broadly across harmfulness domains. Models jailbroken with SFT show the largest collapse in explicit safety judgments, the highest behavioral drift, and a substantial capability loss on standard benchmarks. Abliteration is family-dependent in both self-audit and response to a reflective safety scaffold. Mechanistic and repair analyses further separate the routes: abliteration is consistent with localized refusal-feature deletion, RLVR with preserved safety geometry but retargeted policy behavior, and SFT with broader distributed drift. Targeted repair partially recovers RLVR-jailbroken models, but has little effect on SFT-jailbroken models. Together, these results show that jailbreaks can produce vastly different properties despite similar harmfulness, with models jailbroken via RLVR showing remarkable similarity to the base model.
cs.IR [Back]
[295] MARA: A Multimodal Adaptive Retrieval-Augmented Framework for Document Question Answering cs.IR | cs.AI | cs.CLPDF
Hui Wu, Haoquan Zhai, Yuchen Li, Hengyi Cai, Peirong Zhang
TL;DR: 本文提出了一种名为MARA的多模态自适应检索增强框架,旨在解决基于检索的多模态文档问答任务中现有方法存在的局限性。该框架通过查询对齐的区域编码器构建多级文档表示并根据查询相关性进行重加权,以及通过自反思证据控制器在生成过程中监控证据充分性并自适应地纳入低排名内容,从而提升检索精度和答案质量。
Details
Motivation: 现有基于检索增强生成(RAG)的方法在扩展到多模态文档问答时,依赖于与查询无关的文档表示(忽略了显著内容)并使用静态的top-k证据选择策略,无法适应相关信息分布的不确定性。
Result: 在六个多模态问答基准测试上的实验表明,MARA在检索相关性和答案质量方面持续优于现有的SOTA方法。
Insight: 创新点在于引入了查询自适应的机制到检索和生成两个阶段:1)查询对齐的区域编码器实现多级表示和基于相关性的重加权;2)自反思证据控制器通过滑动窗口策略动态监控和纳入证据。这为多模态RAG提供了更灵活、自适应的证据检索与集成思路。
Abstract: Retrieval-based multimodal document QA aims to identify and integrate relevant information from visually rich documents with complex multimodal structures. While retrieval-augmented generation (RAG) has shown strong performance in text-based QA, its extensions to multimodal documents remain underexplored and face significant limitations. Specifically, current approaches rely on query-agnostic document representations that overlook salient content and use static top-k evidence selection, which fails to adapt to the uncertain distribution of relevant information. To address these limitations, we propose the Multimodal Adaptive Retrieval-Augmented (MARA) framework, which introduces query-adaptive mechanisms to both retrieval and generation. MARA consists of two components: a Query-Aligned Region Encoder that builds multi-level document representations and reweights them based on query relevance to improve retrieval precision; and a Self-Reflective Evidence Controller that monitors evidence sufficiency during generation and adaptively incorporates content from lower-ranked sources using a sliding-window strategy. Experiments on six multimodal QA benchmarks demonstrate that MARA consistently improves retrieval relevance and answer quality over existing SOTA method.
[296] Benchmarking Real-Time Question Answering via Executable Code Workflows cs.IR | cs.AI | cs.CLPDF
Wenjie Zhou, Yuan Gao, Xin Zhou, Hao Fu, Zhongjian Miao
TL;DR: 该论文提出了RT-QA,一个用于评估智能体实时问答能力的动态基准框架。它通过可执行代码工作流(如网络爬虫和DOM解析)在评估时自动生成最新答案作为真实标签,并包含自我修复机制以适应网页结构变化。该基准涵盖12个领域、320个中文问题,评估发现现有顶级模型(如GPT-5.2)在实时适应性上存在严重不足,最高准确率仅为46%。
Details
Motivation: 现有基准大多是静态的,无法捕捉信息的时效性和现实世界知识的动态演变,这限制了搜索集成智能体在真实应用中的实时信息检索能力评估。
Result: 在RT-QA基准上对GPT-5.2、GLM-4.7等最先进模型进行广泛评估,结果显示即使在最佳模型上也仅达到46%的准确率,远未达到实用水平。
Insight: 创新点在于提出了一个利用可执行代码工作流动态生成实时真实答案的评估框架,并设计了自我修复机制以应对网页结构变化。客观分析揭示了两大关键失败模式:’惰性检索’(依赖搜索摘要而非深度扫描网站)和’时间混淆’(无法将历史信息重新锚定到当前时间进行推理),这为未来智能体开发指明了需加强检索策略和时序状态管理的方向。
Abstract: Retrieving real-time information is a fundamental capability for search-integrated agents in real-world applications. However, existing benchmarks are predominantly static and therefore fail to capture the temporal dynamics of information and the continuously evolving nature of real-world knowledge. To address this limitation, we propose RT-QA, a dynamic evaluation framework that leverages executable code workflows to retrieve up-to-date answers at evaluation time. Specifically, we construct an agent-driven pipeline that autonomously generates code for web crawling and DOM-based answer extraction to produce real-time ground truth. To ensure robust evaluation over time, the pipeline further incorporates a self-repair mechanism to adapt to changes in web page structures. RT-QA spans 12 domains (e.g., Finance, Sports) with 320 Chinese questions categorized into three difficulty levels. Extensive evaluations of state-of-the-art models (e.g., GPT-5.2, GLM-4.7) reveal significant limitations in real-time adaptability: even the best models achieve only 46% accuracy. Our analysis highlights two primary failure modes: (1) Lazy Retrieval, where agents rely on search snippets instead of deeply scanning specific websites for information (20% of failures); and (2) Temporal Confusion, a cognitive error where agents retrieve a historical date (e.g., an event in 2024) and fail to re-anchor to the current time (2026) for subsequent reasoning. These findings suggest that future agents require not just better retrieval strategies, but robust temporal state management.
[297] LLMAR: A Tuning-Free Recommendation Framework for Sparse and Text-Rich Industrial Domains cs.IR | cs.CLPDF
Ryogo Hishikawa, Ichiro Kataoka, Shinya Yuda
TL;DR: 本文提出了LLMAR,一种无需微调的推荐框架,专门用于解决工业B2B领域中数据极度稀疏但文本交互丰富的推荐问题。该框架通过LLM推理将用户行为历史转化为结构化的语义动机,并结合自校正机制和异步批处理,在保证高准确性和可解释性的同时,显著降低了运营成本。
Details
Motivation: 工业B2B应用(如建筑工地风险预测、材料采购)面临极端数据稀疏性,但拥有丰富的文本交互。传统的基于ID的协同过滤方法因缺乏共现信号而失效,而微调标准大语言模型则运营成本高昂且难以应对频繁的数据漂移。
Result: 在公开基准测试(MovieLens-1M, Amazon Prime Pantry)和一个稀疏工业数据集(建筑风险预测)上的评估表明,LLMAR优于最先进的基于学习的模型(如SASRecF),在工业数据集上实现了高达54.6%的nDCG@10提升,且推理成本保持高度实用(约每1000用户1美元)。
Insight: 论文的创新点在于:1)无需训练,通过LLM推理直接捕获用户“潜在动机”;2)引入反思循环自校正机制,减少幻觉并解决历史与当前指令间的“上下文竞争”;3)构建了成本效益高的架构,结合免调优组件和异步批处理。这为对严格实时性要求不高的B2B领域,提供了一种在准确性、可解释性和运营成本上均优于基于训练方法的替代方案。
Abstract: Industrial B2B applications (e.g., construction site risk prediction, material procurement) face extreme data sparsity yet feature rich textual interactions. In such environments, traditional ID-based collaborative filtering fails lacking co-occurrence signals, while fine-tuning standard Large Language Models (LLMs) incurs high operational costs and struggles with frequent data drift. We propose LLMAR (LLM-Annotated Recommendation), a tuning-free framework. Moving beyond simple embeddings, LLMAR systematically integrates LLM reasoning to capture user “latent motives” without any training process. We introduce three core contributions: (1) Inference-Driven Annotation: uses LLMs to transform behavioral history into structured semantic motives, enabling reasoning-based matching unattainable by ID-based methods; (2) Reflection Loop: a self-correction mechanism that refines generated queries to mitigate hallucinations and resolve “context competition” between past history and current instructions; and (3) Cost-Effective Architecture: relies on tuning-free components and asynchronous batch processing to minimize maintenance costs. Evaluations on public benchmarks (MovieLens-1M, Amazon Prime Pantry) and a sparse industrial dataset (construction risk prediction) demonstrate that LLMAR outperforms state-of-the-art learning-based models (SASRecF), achieving up to a 54.6% nDCG@10 improvement on the industrial dataset. Inference costs remain highly practical (~$1 per 1,000 users). For B2B domains where strict real-time latency is not critical, combining LLM reasoning with self-verification offers a superior alternative to training-based approaches across accuracy, explainability, and operational cost.
[298] On the Robustness of LLM-Based Dense Retrievers: A Systematic Analysis of Generalizability and Stability cs.IR | cs.CLPDF
Yongkang Li, Panagiotis Eustratiadis, Yixing Fan, Evangelos Kanoulas
TL;DR: 本文首次系统性地研究了基于LLM的稠密检索器的鲁棒性,从泛化性和稳定性两个互补视角进行分析。通过评估四个基准测试中的30个数据集,发现指令调优模型表现优异但复杂推理优化模型存在‘专业化税’;同时发现LLM检索器在拼写错误和语料库中毒方面比编码器基线更鲁棒,但对语义扰动仍脆弱。
Details
Motivation: 随着解码器专用大语言模型逐渐取代BERT架构成为稠密检索的主流,其性能显著提升并广泛应用,但这些基于LLM的检索器的鲁棒性尚未得到充分探索,因此本文旨在系统评估其泛化能力和稳定性。
Result: 在泛化性方面,通过线性混合效应模型分析30个数据集,发现指令调优模型整体表现优异,而复杂推理优化模型在广泛上下文中泛化能力有限;在稳定性方面,LLM检索器相比编码器基线在拼写错误和语料库中毒攻击上更具鲁棒性,但对同义词替换等语义扰动仍脆弱,且模型规模扩大通常能提升鲁棒性。
Insight: 创新点在于首次系统评估LLM稠密检索器的鲁棒性,提出从泛化性和稳定性双视角分析,并引入‘专业化税’概念解释模型泛化局限;客观分析表明嵌入几何特征(如角度均匀性)可作为词汇稳定性的预测信号,为鲁棒性感知的检索器设计提供了新见解。
Abstract: Decoder-only large language models (LLMs) are increasingly replacing BERT-style architectures as the backbone for dense retrieval, achieving substantial performance gains and broad adoption. However, the robustness of these LLM-based retrievers remains underexplored. In this paper, we present the first systematic study of the robustness of state-of-the-art open-source LLM-based dense retrievers from two complementary perspectives: generalizability and stability. For generalizability, we evaluate retrieval effectiveness across four benchmarks spanning 30 datasets, using linear mixed-effects models to estimate marginal mean performance and disentangle intrinsic model capability from dataset heterogeneity. Our analysis reveals that while instruction-tuned models generally excel, those optimized for complex reasoning often suffer a ``specialization tax,’’ exhibiting limited generalizability in broader contexts. For stability, we assess model resilience against both unintentional query variations(e.g., paraphrasing, typos) and malicious adversarial attacks(e.g., corpus poisoning). We find that LLM-based retrievers show improved robustness against typos and corpus poisoning compared to encoder-only baselines, yet remain vulnerable to semantic perturbations like synonymizing. Further analysis shows that embedding geometry (e.g., angular uniformity) provides predictive signals for lexical stability and suggests that scaling model size generally improves robustness. These findings inform future robustness-aware retriever design and principled benchmarking. Our code is publicly available at https://github.com/liyongkang123/Robust_LLM_Retriever_Eval.
[299] Document-as-Image Representations Fall Short for Scientific Retrieval cs.IR | cs.AI | cs.CLPDF
Ghazal Khalighinejad, Raghuveer Thirukovalluru, Alexander H. Oh, Bhuwan Dhingra
TL;DR: 本文指出当前许多文档嵌入模型采用‘文档即图像’的表示方法(将渲染后的页面作为图像处理),并认为这种方法不适合文本丰富的多模态科学文档检索。为此,作者引入了基于LaTeX源码构建的新基准ArXivDoc,以直接访问结构化元素。通过系统比较文本、图像及多模态表示,研究发现:基于文本的表示最为有效,而‘文档即图像’表示始终次优,尤其是在长文档中。
Details
Motivation: 解决现有科学文档检索基准(如ArXivQA和ViDoRe)隐式偏向‘文档即图像’表示的问题,这种表示无法有效处理分布在文本、表格和图形中的关键证据,因此需要更适合文本丰富多模态科学文档的检索方法。
Result: 在ArXivDoc基准上的实验表明:1)‘文档即图像’表示始终次优,且随文档长度增加性能下降;2)基于文本的表示最有效,即使对于基于图形的查询也能通过利用标题和上下文实现;3)交错文本+图像表示优于‘文档即图像’方法,且无需专门训练。
Insight: 创新点在于构建了基于LaTeX源码的基准ArXivDoc,以支持对结构化证据的受控查询;客观分析表明,对于科学文档检索,直接利用文本和结构化信息比依赖图像表示更有效,这挑战了当前偏向图像表示的基准设计范式。
Abstract: Many recent document embedding models are trained on document-as-image representations, embedding rendered pages as images rather than the underlying source. Meanwhile, existing benchmarks for scientific document retrieval, such as ArXivQA and ViDoRe, treat documents as images of pages, implicitly favoring such representations. In this work, we argue that this paradigm is not well-suited for text-rich multimodal scientific documents, where critical evidence is distributed across structured sources, including text, tables, and figures. To study this setting, we introduce ArXivDoc, a new benchmark constructed from the underlying LaTeX sources of scientific papers. Unlike PDF or image-based representations, LaTeX provides direct access to structured elements (e.g., sections, tables, figures, equations), enabling controlled query construction grounded in specific evidence types. We systematically compare text-only, image-based, and multimodal representations across both single-vector and multi-vector retrieval models. Our results show that: (1) document-as-image representations are consistently suboptimal, especially as document length increases; (2) text-based representations are most effective, even for figure-based queries, by leveraging captions and surrounding context; and (3) interleaved text+image representations outperform document-as-image approaches without requiring specialized training.
eess.AS [Back]
[300] NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR eess.AS | cs.CL | cs.SDPDF
Yuan Xie, Jiaqi Song, Guang Qiu, Xianliang Wang, Kai Qiao
TL;DR: 本文提出了NIM4-ASR,一个面向生产、注重效率与鲁棒性的基于大语言模型的自动语音识别框架。它通过重新设计多阶段训练范式,包括改进预训练架构、引入迭代异步监督微调以及专门的强化学习阶段,旨在解决现有LLM-ASR模型在资源受限部署下的可扩展性不足和在声学挑战条件下的幻觉问题。
Details
Motivation: 现有基于大语言模型的ASR模型虽然在公开基准上表现优异,但其训练主要依赖数据驱动,未能充分解决实际部署中的关键挑战,特别是在资源受限环境下的向下可扩展性不足以及在声学条件恶劣时产生的幻觉问题。
Result: 实验表明,NIM4-ASR仅用23亿参数就在多个公开基准上达到了最先进的性能,并在内部基准(尤其是实体密集的真实场景)上显著优于更大规模的竞争对手。此外,它通过检索增强生成支持百万级热词定制,检索延迟低于毫秒。
Insight: 论文的核心创新在于从原理上明确了编码器与LLM的功能角色划分,并据此重新设计了多阶段训练范式,使各模块专注于其能力边界。具体创新点包括:改进预训练以弥合模态差距并提升参数效率;引入迭代异步SFT以保持声学保真度并约束表征漂移;设计ASR专用的强化学习阶段以增强识别质量与鲁棒性;以及集成一系列面向生产的优化,如噪声/静音鲁棒性、实时流式推理和基于RAG的热词定制。
Abstract: Integrating large language models (LLMs) into automatic speech recognition (ASR) has become a mainstream paradigm in recent years. Although existing LLM-based ASR models demonstrate impressive performance on public benchmarks, their training remains predominantly data-driven, leaving key practical challenges insufficiently addressed – particularly limited downward scalability in resource-constrained deployments and hallucinations under acoustically challenging conditions. To address these issues, we present NIM4-ASR, a production-oriented LLM-based ASR framework optimized for both efficiency and robustness. Grounded in a principled delineation of functional roles between the encoder and the LLM, we redesign the multi-stage training paradigm to align each module with its intended capability boundary. Specifically, we reformulate the pre-training architecture and objective to mitigate the modality gap and improve parameter efficiency; introduce an iterative asynchronous SFT stage to preserve acoustic fidelity and constrain representation drift; and design an ASR-specialized reinforcement learning stage to further enhance recognition quality and robustness. We additionally incorporate a suite of production-oriented optimizations, including robustness under noisy and silent conditions, real-time streaming inference, and hotword customization via retrieval-augmented generation (RAG). Experiments show that NIM4-ASR achieves state-of-the-art performance on multiple public benchmarks with merely 2.3B parameters, while substantially outperforming larger-scale competitors on internal benchmarks – particularly in entity-intensive real-world scenarios. NIM4-ASR further supports million-scale hotword customization via RAG with sub-millisecond retrieval latency, enabling efficient adaptation to emerging entities and personalized user requirements.
cs.SE [Back]
[301] CodePivot: Bootstrapping Multilingual Transpilation in LLMs via Reinforcement Learning without Parallel Corpora cs.SE | cs.CLPDF
Shangyu Li, Juyong Jiang, Meibo Ren, Sizhe Zhong, Huiri Tan
TL;DR: CodePivot是一个基于强化学习的LLM训练框架,它利用Python作为中间表示,无需平行语料库即可引导模型的多语言代码转译能力。实验表明,训练出的7B模型在涉及10种编程语言的转译任务中表现出色,甚至超越了参数量大得多的主流模型。
Details
Motivation: 解决现有基于训练的代码转译方法依赖成对语料、难以支持多种语言(尤其是低资源语言)的问题,以及现有强化学习奖励机制不理想的问题。
Result: 在涉及10种编程语言的实验中,该7B模型在Python到其他语言以及其他语言到所有语言的转译任务上,分别显著超越了Deepseek-R1和Qwen3-235B-A22B-Instruct-2507等参数量大得多的主流模型,并在通用转译任务上优于直接训练在任意到任意任务上的模型。
Insight: 核心创新点是提出了一种利用Python作为通用中间表示(IR)的框架,并设计了一种新颖的强化学习奖励机制(Aggressive-Partial-Functional reward),从而实现了无需平行语料库的多语言代码转译能力引导。这种方法有效解决了低资源语言数据稀缺和成对训练不可行的问题。
Abstract: Transpilation, or code translation, aims to convert source code from one programming language (PL) to another. It is beneficial for many downstream applications, from modernizing large legacy codebases to augmenting data for low-resource PLs. Recent large language model (LLM)-based approaches have demonstrated immense potential for code translation. Among these approaches, training-based methods are particularly important because LLMs currently do not effectively adapt to domain-specific settings that suffer from a lack of knowledge without targeted training. This limitation is evident in transpilation tasks involving low-resource PLs. However, existing training-based approaches rely on a pairwise transpilation paradigm, making it impractical to support a diverse range of PLs. This limitation is particularly prominent for low-resource PLs due to a scarcity of training data. Furthermore, these methods suffer from suboptimal reinforcement learning (RL) reward formulations. To address these limitations, we propose CodePivot, a training framework that leverages Python as an intermediate representation (IR), augmented by a novel RL reward mechanism, Aggressive-Partial-Functional reward, to bootstrap the model’s multilingual transpilation ability without requiring parallel corpora. Experiments involving 10 PLs show that the resulting 7B model, trained on Python-to-Others tasks, consistently improves performance across both general and low-resource PL-related transpilation tasks. It outperforms substantially larger mainstream models with hundreds of billions more parameters, such as Deepseek-R1 and Qwen3-235B-A22B-Instruct-2507, on Python-to-Others tasks and Others-to-All tasks, respectively. In addition, it outperforms its counterpart trained directly on Any-to-Any tasks on general transpilation tasks. The code and data are available at https://github.com/lishangyu-hkust/CodePivot.
cs.LG [Back]
[302] SaFeR-Steer: Evolving Multi-Turn MLLMs via Synthetic Bootstrapping and Feedback Dynamics cs.LG | cs.CLPDF
Haolong Hu, Hanyu Li, Tiancheng He, Huahui Yi, An Zhang
TL;DR: 本文提出SaFeR-Steer框架,通过分阶段的合成数据引导和带导师循环的GRPO训练方法,来解决多模态大语言模型在多轮对话中的安全问题。该框架还引入了TCSR机制来传播安全失败信号,并发布了STEER多轮多模态安全数据集。实验表明,该方法能显著提升模型在单轮和多轮对话中的安全性与有用性。
Details
Motivation: 当前MLLMs的安全对齐主要依赖单轮数据和固定模板对话,与多轮部署场景存在不匹配,攻击者可能利用不断演化的视觉-文本历史和长上下文安全衰减来升级不安全意图。
Result: 在Qwen2.5-VL-3B/7B模型上,SaFeR-Steer在单轮基准测试上将安全性/有用性从48.30/45.86提升至81.84/70.77(3B),从56.21/60.32提升至87.89/77.40(7B);在多轮基准测试上从12.55/27.13提升至55.58/70.27(3B),从24.66/46.48提升至64.89/72.35(7B),并将失败延迟到后续轮次,展现出超越单纯模型缩放的鲁棒性。
Insight: 创新点在于提出了一个渐进式多轮对齐框架,结合了分阶段合成数据引导和带导师循环的GRPO训练,以及TCSR机制来传播跨轮次的安全失败信号。这为解决多轮对话中长上下文安全衰减问题提供了系统性的训练方法和评估视角。
Abstract: MLLMs are increasingly deployed in multi-turn settings, where attackers can escalate unsafe intent through the evolving visual-text history and exploit long-context safety decay. Yet safety alignment is still dominated by single-turn data and fixed-template dialogues, leaving a mismatch between training and deployment.To bridge this gap, we propose SaFeR-Steer, a progressive multi-turn alignment framework that combines staged synthetic bootstrapping with tutor-in-the-loop GRPO to train a single student under adaptive, on-policy attacks. We also introduce TCSR, which uses trajectory minimum/average safety to propagate late-turn failures to earlier turns.I. Dataset. We release STEER, a multi-turn multimodal safety dataset with STEER-SFT (12,934), STEER-RL (2,000), and STEER-Bench (3,227) dialogues spanning 2~10 turns.II. Experiment. Starting from Qwen2.5-VL-3B/7B, SaFeR-Steer substantially improves Safety/Helpfulness on both single-turn (48.30/45.86 -> 81.84/70.77 for 3B; 56.21/60.32 -> 87.89/77.40 for 7B) and multi-turn benchmarks (12.55/27.13 -> 55.58/70.27 for 3B; 24.66/46.48 -> 64.89/72.35 for 7B), shifting failures to later turns and yielding robustness beyond scaling alone.Codes are available at https://github.com/Ed-Bg/SaFeR-Steer
[303] S-GRPO: Unified Post-Training for Large Vision-Language Models cs.LG | cs.CL | cs.CVPDF
Yuming Yan, Kai Tang, Sihong Chen, Ke Xu, Dan Hu
TL;DR: 本文提出了一种名为S-GRPO的统一后训练框架,用于大型视觉语言模型(LVLM)的适应。该框架将模仿学习的指导与偏好优化的多轨迹探索相结合,通过引入条件性真实轨迹注入(CGI)机制来解决传统监督微调(SFT)导致的灾难性遗忘和强化学习(RL)面临的冷启动优化崩溃问题,旨在实现更高效、更平衡的领域适应。
Details
Motivation: 当前大型视觉语言模型的后训练方法主要分为监督微调(SFT)和强化学习(RL)两种范式,但各自存在缺陷:SFT会导致模型能力分布偏移和灾难性遗忘,而RL在稀疏奖励的视觉任务中容易因冷启动问题而无法采样有效轨迹,导致优化崩溃。本文旨在弥合这两种方法的差距,解决它们各自的低效问题。
Result: 理论分析和实证结果表明,S-GRPO能够优雅地弥合SFT与RL之间的差距,大幅加速收敛速度,在实现卓越领域适应的同时,保留了基础模型的通用能力。
Insight: 主要创新点在于提出了一个统一的S-GRPO框架,其核心是条件性真实轨迹注入(CGI)机制。该机制通过一个二元验证器检测轨迹组中的探索失败,并在失败时注入已验证的真实轨迹作为高奖励锚点,从而将监督学习目标重新表述为策略梯度中的一个高优势分量。这迫使模型在利用专家轨迹和探索新视觉概念之间进行动态平衡,巧妙地结合了SFT的确定性和RL的探索性优势。
Abstract: Current post-training methodologies for adapting Large Vision-Language Models (LVLMs) generally fall into two paradigms: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). Despite their prevalence, both approaches suffer from inefficiencies when applied in isolation. SFT forces the model’s generation along a single expert trajectory, often inducing catastrophic forgetting of general multimodal capabilities due to distributional shifts. Conversely, RL explores multiple generated trajectories but frequently encounters optimization collapse - a cold-start problem where an unaligned model fails to spontaneously sample any domain-valid trajectories in sparse-reward visual tasks. In this paper, we propose Supervised Group Relative Policy Optimization (S-GRPO), a unified post-training framework that integrates the guidance of imitation learning into the multi-trajectory exploration of preference optimization. Tailored for direct-generation visual tasks, S-GRPO introduces Conditional Ground-Truth Trajectory Injection (CGI). When a binary verifier detects a complete exploratory failure within a sampled group of trajectories, CGI injects the verified ground-truth trajectory into the candidate pool. By assigning a deterministic maximal reward to this injected anchor, S-GRPO enforces a positive signal within the group-relative advantage estimation. This mechanism reformulates the supervised learning objective as a high-advantage component of the policy gradient, compelling the model to dynamically balance between exploiting the expert trajectory and exploring novel visual concepts. Theoretical analysis and empirical results demonstrate that S-GRPO gracefully bridges the gap between SFT and RL, drastically accelerates convergence, and achieves superior domain adaptation while preserving the base model’s general-purpose capabilities.
[304] MoE-nD: Per-Layer Mixture-of-Experts Routing for Multi-Axis KV Cache Compression cs.LG | cs.CLPDF
Libo Sun, Peixiong He, Po-Wei Harn, Xiao Qin
TL;DR: MoE-nD是一种用于长上下文LLM推理中KV缓存压缩的混合专家框架,它通过离线校准的贪婪求解器为每个层选择最优的(驱逐率、K位、V位)元组,在全局内存预算下实现异构的逐层压缩。实验表明,在LongBench-v1子集上,MoE-nD在14倍压缩下匹配未压缩基线的性能,并在AIME推理基准上显著优于其他压缩方法。
Details
Motivation: 现有KV缓存压缩方法(如令牌驱逐、量化、低秩投影或跨层共享)通常对所有层采用相同的压缩策略,但不同层对每种压缩操作的响应差异很大,这种同质性限制了压缩效果和精度。
Result: 在LongBench-v1的4任务子集(16k输入)上,MoE-nD在14倍压缩(136MB)下匹配未压缩基线(1.9GB)的性能,而其他压缩基线在可比或更小内存下性能低于8/100;在AIME推理基准上,MoE-nD比最强的逐层量化基线提升6到27分。
Insight: 创新点在于提出逐层异构的混合专家路由框架,允许不同层采用不同的压缩策略组合,并通过离线校准优化全局内存预算下的性能损失;客观分析表明,该方法揭示了压缩策略的层间敏感性,并为短输入场景下的压缩效果提供了理论解释。
Abstract: KV cache memory is the dominant bottleneck for long-context LLM inference. Existing compression methods each act on a single axis of the four-dimensional KV tensor – token eviction (sequence), quantization (precision), low-rank projection (head dimension), or cross-layer sharing – but apply the same recipe to every layer. We show that this homogeneity leaves accuracy on the table: different layers respond very differently to each compression operation, and the optimal per-layer mix of eviction and quantization is far from uniform. We propose MoE-nD, a mixture-of-experts framework that routes each layer to its own (eviction-ratio, K-bits, V-bits) tuple under a global memory budget. An offline-calibrated greedy solver chooses the routing that minimizes predicted quality loss; at inference time, per-layer heterogeneous eviction and quantization are applied jointly through a single attention patch. On a 4-task subset of LongBench-v1 (16k inputs, n=50 per task, adapted reasoning-model protocol; see section Experiments), MoE-nD’s hetero variant matches our uncompressed 1.9GB baseline at 14x compression (136MB) while every other compressed baseline we tested (1d, 2d_uniform, 2d) at comparable or smaller memory stays under 8/100. The gains hold on AIME reasoning benchmarks (+6 to +27 pts over the strongest per-layer-quantization baseline across eight configurations). Two null results – MATH-500 and LongBench’s TREC – share a principled cause (short inputs, solver picks keep=1.0 on most layers), cleanly characterizing when per-layer eviction routing has headroom to help.
[305] Tool Learning Needs Nothing More Than a Free 8B Language Model cs.LG | cs.CLPDF
Chenming Tang, Hsiu-Yuan Huang, Weijie Liu, Junqiang Zheng, Saiyong Yang
TL;DR: 本文提出TRUSTEE方法,一种无需标注数据、仅使用免费开源语言模型(小至8B参数)动态模拟环境来训练工具调用智能体的数据无关方法。该方法通过任务生成、用户模拟、工具模拟和轨迹评估构建动态环境,并结合自适应课程学习机制动态调整任务难度,在多个领域实现了一致的性能提升。
Details
Motivation: 现有工具调用智能体训练方法通常依赖带标注的训练数据或需要高级商业语言模型来合成静态环境,这限制了其可扩展性和资源效率。本文旨在探索仅使用免费开源小模型动态模拟环境来训练工具调用智能体的可行性,以降低对昂贵标注数据、真实交互、可执行工具或商业模型的依赖。
Result: 实验结果表明,TRUSTEE在多个领域均带来一致的性能提升,且优于所有需要额外外部资源进行训练的基线方法。这证实了即使仅使用本地8B语言模型作为骨干的模拟环境,也能为工具学习设定一个强有力的基线。
Insight: 论文的创新点在于提出了一种完全数据无关、资源高效的训练范式,通过动态环境模拟和自适应课程学习机制,仅依赖免费开源小模型即可实现有效的工具学习。从客观角度看,该方法展示了在有限资源下通过精巧设计实现环境扩展的潜力,为未来资源受限场景下的工具学习研究提供了新思路。
Abstract: Reinforcement learning (RL) has become a prevalent paradigm for training tool calling agents, which typically requires online interactive environments. Existing approaches either rely on training data with ground truth annotations or require advanced commercial language models (LMs) to synthesize environments that keep fixed once created. In this work, we propose TRUSTEE, a data-free method training tool calling agents with dynamic environments fully simulated by free open-source LMs that can be as small as 8B, including task generation, user simulation, tool simulation and trajectory evaluation, paired with an adaptive curriculum learning mechanism that controls various aspects of the task difficulty dynamically during training. Our empirical results show that TRUSTEE brings consistent improvements across various domains and outperforms all the baselines which require extra external resources for training. These confirm that, with a sufficiently sophisticated design, even simulated environments with a local 8B LM as the backbone could set a strong baseline for tool learning, without expensive annotated data, realistic human interactions, executable tools or costly verifiable environments from human experts or commercial LMs. We hope our proposed paradigm could inspire future research on environment scaling with limited resources.
[306] Knowing When to Quit: A Principled Framework for Dynamic Abstention in LLM Reasoning cs.LG | cs.CL | stat.MLPDF
Hen Davidov, Nachshon Cohen, Oren Kalinsky, Yaron Fairstein, Guy Kushilevitz
TL;DR: 本文提出了一种用于大语言模型(LLM)推理的动态弃权原则性框架,旨在解决链式思维推理中因生成冗长错误答案而浪费计算资源的问题。该框架将动态弃权建模为强化学习中的显式动作,通过一个弃权奖励参数来平衡计算成本与信息获取,并推导出在价值函数低于该奖励时终止生成的原则性规则。
Details
Motivation: 动机在于现有LLM在链式思维推理中经常产生冗长且错误的回答,浪费大量计算。虽然已有研究探索了在生成前或生成后弃权的方法,但动态的、在生成过程中终止无望推理路径的方法缺乏原则性的指导。
Result: 在数学推理和毒性规避任务上的实验结果表明,该方法在选择性准确率上优于现有基线方法。
Insight: 主要创新点在于首次为LLM的动态中途弃权提供了一个形式化的、基于正则化强化学习的理论框架,并推导出一个原则性的、高效的弃权规则(当价值函数低于弃权奖励时终止),从而在计算与准确性之间实现更好的权衡。
Abstract: Large language models (LLMs) using chain-of-thought reasoning often waste substantial compute by producing long, incorrect responses. Abstention can mitigate this by withholding outputs unlikely to be correct. While most abstention methods decide to withhold outputs before or after generation, dynamic mid-generation abstention considers early termination of unpromising reasoning traces at each token position. Prior work has explored empirical variants of this idea, but principled guidance for the abstention rule remains lacking. We present a formal analysis of dynamic abstention for LLMs, modeling abstention as an explicit action within a regularized reinforcement learning framework. An abstention reward parameter controls the trade-off between compute and information. We show that abstaining when the value function falls below this reward strictly outperforms natural baselines under general conditions. We further derive a principled and efficient method to approximate the value function. Empirical results on mathematical reasoning and toxicity avoidance tasks support our theory and demonstrate improved selective accuracy over existing methods.
[307] Latent Phase-Shift Rollback: Inference-Time Error Correction via Residual Stream Monitoring and KV-Cache Steering cs.LG | cs.AI | cs.CLPDF
Manan Gupta, Dhruv Kumar
TL;DR: 本文提出了一种名为“潜在相移回滚”(LPSR)的推理时错误校正方法,用于在大型语言模型生成过程中检测和纠正不可恢复的推理错误。该方法通过监控残差流中的方向突变(相移),并回滚KV缓存和注入预计算的转向向量来实现,无需微调或额外前向传播。在MATH-500基准测试中,该方法显著提升了8B模型的性能,超越了标准自回归、提示自校正和Best-of-16等基线方法,甚至以更少的参数超越了70B模型。
Details
Motivation: 大型语言模型在生成过程中经常出现不可恢复的推理错误:一旦出错,后续标记会加剧错误而非纠正。现有方法如提示自校正效果不佳,甚至低于标准自回归。因此,需要一种高效的推理时错误校正机制,在不增加计算成本的情况下提升模型推理准确性。
Result: 在MATH-500基准测试上,使用8B模型,LPSR达到44.0%的准确率,相比标准自回归(28.8%)提升15.2个百分点,统计显著(p < 10^{-15})。LPSR显著超越提示自校正(19.8%)和Best-of-16方法,并以更少的参数(8.75倍)和略高的计算成本超越了标准70B模型(35.2%)。此外,通过层扫描发现检测与校正存在解耦:错误检测AUC在14层最佳(0.718),而任务准确率在16层最高(44.0% vs. 29.2%)。
Insight: 创新点在于提出了一种无需训练、基于残差流监控和KV缓存转向的推理时错误校正框架。核心机制包括:1)使用余弦相似度和熵的双门控检测残差流中的“相移”;2)通过回滚KV缓存和注入转向向量实现即时纠正。此外,发现了检测与校正的最佳监控深度不同这一新现象(检测-校正解耦),为模型内部监控机制提供了新见解。
Abstract: Large language models frequently commit unrecoverable reasoning errors mid-generation: once a wrong step is taken, subsequent tokens compound the mistake rather than correct it. We introduce $\textbf{Latent Phase-Shift Rollback}$ (LPSR): at each generation step, we monitor the residual stream at a critical layer lcrit, detect abrupt directional reversals (phase shifts) via a cosine-similarity $+$ entropy dual gate, and respond by rolling back the KV-cache and injecting a pre-computed steering vector. No fine-tuning, gradient computation, or additional forward passes are required. LPSR achieves $\mathbf{44.0%}$ on MATH-500 with an 8B model versus $28.8%$ for standard AR ($+15.2$ pp; McNemar $χ^2 = 66.96$, $p < 10^{-15}$). Critically, prompted self-correction, the most natural inference-time baseline, scores only $19.8%$, below standard AR; LPSR exceeds it by $+24.2$ pp ($χ^2 = 89.4$, $p \approx 0$). LPSR also outperforms Best-of-16 ($+7.8$ pp) at $5.4\times$ lower token cost, and surpasses a standard 70B model ($35.2%$) with $8.75\times$ fewer parameters at ${\sim}3\times$ the token budget. A 32-layer sweep reveals a novel \textbf{detection-correction dissociation}: error-detection AUC peaks at layer14 ($0.718$) but task accuracy peaks at layer16 ($44.0%$ vs.\ $29.2%$), demonstrating that optimal monitoring depth differs for detection and correction.
[308] A multimodal and temporal foundation model for virtual patient representations at healthcare system scale cs.LG | cs.AI | cs.CLPDF
Andrew Zhang, Tong Ding, Sophia J. Wagner, Caiwei Tian, Ming Y. Lu
TL;DR: 本文介绍了Apollo,一种多模态时序基础模型,用于在医疗系统规模上构建虚拟患者表示。该模型整合了超过30年的纵向医院记录,包括250亿条来自720万患者的记录,涵盖28种医疗模态和12个主要专科。Apollo学习了一个统一的表示空间,将结构化事件、图像和临床文本压缩为虚拟患者表示,并在322个预后和检索任务上展示了其临床预测和搜索潜力。
Details
Motivation: 现代医学在孤岛系统中生成了大量多模态数据,但现有模型无法将临床记录的完整广度和时间深度整合为统一的患者表示,因此需要开发一个能够建模整个患者护理旅程的基础模型。
Result: 在包含140万患者的测试集上创建的322个任务中,Apollo展示了广义临床预测潜力,包括提前五年预测新疾病发病风险(95个任务)、疾病进展(78个任务)、治疗反应(59个任务)、治疗相关不良事件风险(17个任务)和医院运营终点(12个任务)。在61个检索任务中评估了语义相似性搜索,并展示了作为多模态医学搜索引擎的潜力。
Insight: 创新点在于构建了一个整合多模态和时间维度的统一患者表示空间(“医学概念图谱”),支持长期临床预测和检索任务,为可计算医学奠定了基础,使患者护理的完整上下文可被计算推理访问。
Abstract: Modern medicine generates vast multimodal data across siloed systems, yet no existing model integrates the full breadth and temporal depth of the clinical record into a unified patient representation. We introduce Apollo, a multimodal temporal foundation model trained and evaluated on over three decades of longitudinal hospital records from a major US hospital system, composed of 25 billion records from 7.2 million patients, representing 28 distinct medical modalities and 12 major medical specialties. Apollo learns a unified representation space integrating over 100 thousand unique medical events in our clinical vocabulary as well as images and clinical text. This “atlas of medical concepts” forms a computational substrate for modeling entire patient care journeys comprised of sequences of structured and unstructured events, which are compressed by Apollo into virtual patient representations. To assess the potential of these whole-patient representations, we created 322 prognosis and retrieval tasks from a held-out test set of 1.4 million patients. We demonstrate the generalized clinical forecasting potential of Apollo embeddings, including predicting new disease onset risk up to five years in advance (95 tasks), disease progression (78 tasks), treatment response (59 tasks), risk of treatment-related adverse events (17 tasks), and hospital operations endpoints (12 tasks). Using feature attribution techniques, we show that model predictions align with clinically-interpretable multimodal biomarkers. We evaluate semantic similarity search on 61 retrieval tasks, and moreover demonstrate the potential of Apollo as a multimodal medical search engine using text and image queries. Together, these modeling capabilities establish the foundation for computable medicine, where the full context of patient care becomes accessible to computational reasoning.
[309] SetFlow: Generating Structured Sets of Representations for Multiple Instance Learning cs.LG | cs.AI | cs.CVPDF
Nikola Jovišić, Milica Škipina, Vanja Švenda
TL;DR: 本文提出了SetFlow,一种用于多示例学习(MIL)的生成架构,直接在表示空间中建模整个MIL包(即集合)。该方法结合流匹配范式和受Set Transformer启发的设计,能够处理置换不变的输入并捕获包内实例间的交互。模型以类别标签和输入规模为条件,生成连贯且语义一致的表示集合。在大型乳腺X光摄影基准测试中,生成的样本与原始数据分布高度匹配,用于数据增强时能提升下游分类性能,仅使用合成数据训练也能取得有竞争力的结果。
Details
Motivation: 解决现实应用(如乳腺X光摄影)中数据稀缺和弱监督问题,现有方法在实例层面操作,无法有效增强MIL数据的表示或捕获包内依赖关系。
Result: 在大型乳腺X光摄影基准测试上,使用最先进的MIL-PF分类流程评估,生成的样本与原始数据分布高度匹配,用于数据增强时提升了下游性能,仅用合成数据训练也取得了有竞争力的结果。
Insight: 创新点在于直接在表示空间对整个MIL包进行生成建模,结合流匹配与Set Transformer设计以处理集合数据并捕获实例间交互,为数据稀缺和隐私敏感任务提供了有效的表示空间生成建模方法。
Abstract: Data scarcity and weak supervision continue to limit the performance of machine learning models in many real-world applications, such as mammography, where Multiple Instance Learning (MIL) often offers the best formulation. While recent foundation models provide strong semantic representations out of the box, effective augmentation of such representations of MIL data remains limited, as existing methods operate at the instance level and fail to capture intra-bag dependencies. In this work, we introduce SetFlow, a generative architecture that models entire MIL bags (i.e., sets) directly in the representation space. Our approach leverages the flow matching paradigm combined with a Set Transformer-inspired design, enabling it to handle permutation-invariant inputs while capturing interactions between instances within each bag. The model is conditioned on both class labels and input scale, allowing it to generate coherent and semantically consistent sets of representations. We evaluate SetFlow on a large-scale mammography benchmark using a state-of-the-art MIL-PF classification pipeline. The generated samples are shown to closely match the original data distribution and even improve downstream performance when used for augmentation. Furthermore, training on synthetic data alone shows competitive results, demonstrating the effectiveness of representation-space generative modeling for data-scarce and privacy-sensitive tasks.
[310] Applications of deep generative models to DNA reaction kinetics and to cryogenic electron microscopy cs.LG | cs.AI | cs.CV | q-bio.BM | q-bio.QMPDF
Chenwei Zhang
TL;DR: 本论文探讨了深度生成模型如何通过整合领域知识与深度学习来推进具有挑战性的生物学问题分析,重点研究了DNA反应动力学和冷冻电镜两个领域。在DNA反应动力学方面,提出了ViDa框架,利用变分自编码器和几何散射变换生成生物物理学上合理的DNA反应动力学模拟嵌入,并将其降维至二维空间以可视化DNA杂交和立足点介导的链置换反应。在冷冻电镜方面,提供了深度学习原子模型构建方法的全面综述与基准测试,提出了Struc2mapGAN生成对抗网络从蛋白质结构合成高保真实验样冷冻电镜密度图,以及CryoSAMU结构感知多模态U-Net通过交叉注意力整合密度特征与蛋白质语言模型的结构嵌入来增强中等分辨率冷冻电镜图。
Details
Motivation: 解决DNA反应动力学模拟结果的可解释性不足以及冷冻电镜密度图解释和蛋白质结构建模中的关键挑战,通过深度生成模型整合领域知识以提升分析能力。
Result: ViDa框架将轨迹集合聚类为反应路径,揭示了新的机制见解;在冷冻电镜方面,通过改进的评估指标和实用指南进行了基准测试,Struc2mapGAN生成了高保真密度图,CryoSAMU增强了中等分辨率图的质量。
Insight: 创新点包括将变分自编码器与几何散射变换结合用于DNA反应动力学的生物物理嵌入生成,以及利用生成对抗网络和多模态U-Net整合结构信息以提升冷冻电镜图的分析;客观分析认为,这些方法通过领域知识与深度学习的融合,在生物物理模拟和结构生物学中实现了更可解释和高效的建模。
Abstract: This dissertation explores how deep generative models can advance the analysis of challenging biological problems by integrating domain knowledge with deep learning. It focuses on two areas: DNA reaction kinetics and cryogenic electron microscopy (cryo-EM). In the first part, we present ViDa, a biophysics-informed framework leveraging variational autoencoders (VAEs) and geometric scattering transforms to generate biophysically-plausible embeddings of DNA reaction kinetics simulations. These embeddings are reduced to a two-dimensional space to visualize DNA hybridization and toehold-mediated strand displacement reactions. ViDa preserves structure and clusters trajectory ensembles into reaction pathways, making simulation results more interpretable and revealing new mechanistic insights. In the second part, we address key challenges in cryo-EM density map interpretation and protein structure modeling. We provide a comprehensive review and benchmarking of deep learning methods for atomic model building, with improved evaluation metrics and practical guidance. We then present Struc2mapGAN, a generative adversarial network that synthesizes high-fidelity experimental-like cryo-EM density maps from protein structures. Finally, we present CryoSAMU, a structure-aware multimodal U-Net that enhances intermediate-resolution cryo-EM maps by integrating density features with structural embeddings from protein language models via cross-attention. Overall, these contributions demonstrate the potential of deep generative models to interpret DNA reaction mechanisms and advance cryo-EM density map analysis and protein structure modeling.
[311] ProtoCLIP: Prototype-Aligned Latent Refinement for Robust Zero-Shot Chest X-Ray Classification cs.LG | cs.AI | cs.CVPDF
Florian Kittler, Sheethal Bhat, Andreas Maier
TL;DR: ProtoCLIP是一种针对CLIP风格视觉语言模型的改进策略,旨在通过有针对性的数据筛选和锚点对齐蒸馏,提升胸部X光片分类的零样本判别能力。该方法通过构建聚焦于病理的训练子集和引入保持表征的蒸馏目标,来减少标签共现偏差、稳定领域适应并改善对临床相关共现病理的区分。
Details
Motivation: 解决零样本视觉语言模型在胸部X光片分类中因标签共现混淆、长尾类别不平衡以及领域偏移下的迁移不稳定性导致的性能受限问题。
Result: 在未见数据集VinDr-CXR上评估,ProtoCLIP相比基于CLIP的强基线,在多个发现上将AUC提升了2-10个百分点;对于气胸(pneumothorax)分类,达到了0.94的AUC,实现了SOTA水平。
Insight: 创新点在于结合了锚点引导的细化、筛选监督和受控适应,通过病理聚焦的数据子集构建和保持语义结构的蒸馏目标,有效缓解了医学VLM中常见的零样本迁移失败,且无需大规模重新训练。
Abstract: Zero-shot vision-language models (VLMs) have shown promise for chest radiograph classification, but their performance is often limited by confounding label co-occurrence, long-tail class imbalance, and transfer instability under domain shift. We propose ProtoCLIP, a refinement strategy for CLIP-style VLMs that improves zero-shot discrimination through targeted data curation and distilled anchor alignment. Specifically, we construct pathology-focused training subsets with curated negative samples to reduce co-occurrence bias. We also introduce a representation-preserving distillation objective to stabilize adaptation while maintaining semantic structure and improving discrimination of clinically relevant co-occurring pathologies. Evaluated on an unseen dataset VinDr-CXR, ProtoCLIP improves AUC by 2-10 percentage points over a strong CLIP-based baseline across multiple findings. For pneumothorax specifically, ProtoCLIP achieves a state-of-the-art AUC of 0.94. These results demonstrate that anchor-guided refinement, coupled with curated supervision and controlled adaptation, can mitigate common zero-shot transfer failures in medical VLMs without requiring large-scale retraining.