Table of Contents

cs.CL [Back]

[1] The Perplexity Paradox: Why Code Compresses Better Than Math in LLM Prompts cs.CL | cs.AIPDF

Warren Johnson

TL;DR: 本文通过跨多个代码与推理基准的验证,揭示了LLM提示压缩中的“困惑度悖论”:代码生成任务能容忍高压缩率(r>=0.6),而数学推理任务则因关键数值被剪枝而性能下降。研究首次进行了逐词元困惑度分析,发现代码语法词元(高困惑度)被保留,而数学问题中的数值(低困惑度)被错误剪枝。通过签名注入可显著恢复性能(通过率提升34个百分点),并提出了任务自适应的压缩算法TAAC,在保持96%质量的同时降低22%成本。

Details

Motivation: 解决先前研究(Johnson, 2026)的三个局限:仅基于HumanEval基准、未验证“困惑度悖论”机制、缺乏自适应压缩算法。本文旨在跨多个基准验证压缩阈值的泛化性,揭示悖论机制,并设计自适应压缩方法。

Result: 在六个代码基准(HumanEval、MBPP等)和四个推理基准(GSM8K、MATH等)上验证了压缩阈值的泛化性;签名注入使数学任务通过率从5.3%提升至39.3%(Cohen’s h=0.890);TAAC算法在MBPP上(1,800次试验)实现22%成本降低与96%质量保持,优于固定比率压缩7%;压缩率从0.3到1.0时,性能变化系统性地从3.6%升至54.6%。

Insight: 创新点包括:首次通过逐词元困惑度分析揭示“困惑度悖论”机制(代码语法与数学数值的困惑度差异导致压缩效果不同);提出签名注入方法以恢复数学任务性能;设计任务自适应的压缩算法TAAC,实现成本与质量的平衡。从客观角度看,该研究为LLM提示压缩提供了细粒度的理论解释与实用算法,强调了任务类型与词元属性在压缩中的关键作用。

Abstract: In “Compress or Route?” (Johnson, 2026), we found that code generation tolerates aggressive prompt compression (r >= 0.6) while chain-of-thought reasoning degrades gradually. That study was limited to HumanEval (164 problems), left the “perplexity paradox” mechanism unvalidated, and provided no adaptive algorithm. This paper addresses all three gaps. First, we validate across six code benchmarks (HumanEval, MBPP, HumanEval+, MultiPL-E) and four reasoning benchmarks (GSM8K, MATH, ARC-Challenge, MMLU-STEM), confirming the compression threshold generalizes across languages and difficulties. Second, we conduct the first per-token perplexity analysis (n=723 tokens), revealing a “perplexity paradox”: code syntax tokens are preserved (high perplexity) while numerical values in math problems are pruned despite being task-critical (low perplexity). Signature injection recovers +34 percentage points in pass rate (5.3% to 39.3%; Cohen’s h=0.890). Third, we propose TAAC (Task-Aware Adaptive Compression), achieving 22% cost reduction with 96% quality preservation, outperforming fixed-ratio compression by 7%. MBPP validation (n=1,800 trials) confirms systematic variation: 3.6% at r=0.3 to 54.6% at r=1.0.


[2] KD4MT: A Survey of Knowledge Distillation for Machine Translation cs.CLPDF

Ona de Gibert, Joseph Attieh, Timothee Mickus, Yves Scherrer, Jörg Tiedemann

TL;DR: 这篇综述论文系统回顾了知识蒸馏(KD)在机器翻译(MT)领域的应用,涵盖了截至2025年10月1日的105篇相关文献。论文首先介绍了MT和KD的基础知识,然后概述了适用于MT的标准KD方法,接着根据方法论贡献和实际应用对KD4MT文献进行了分类。通过定性和定量分析,论文总结了该领域的常见趋势、关键研究空白以及缺乏统一评估实践的问题,并提供了具体场景下选择KD方法的实用指南,同时指出了KD应用于MT可能带来的幻觉增加和偏见放大等风险。最后,论文讨论了LLMs如何重塑KD4MT领域,并附带了公开数据库和术语表以支持进一步研究。

Details

Motivation: 随着NLP领域模型规模不断增大,知识蒸馏作为一种模型压缩工具受到广泛关注。在机器翻译中,KD不仅用于压缩,还作为一种通用的知识转移机制,影响监督、翻译质量和效率。然而,该领域缺乏系统性综述,且存在研究空白和评估实践不统一的问题,因此本文旨在全面梳理KD在MT中的应用现状。

Result: 论文通过对105篇文献的定性和定量分析,总结了KD4MT领域的常见趋势,并指出了关键研究空白,如缺乏统一的评估实践。虽然没有提供具体的SOTA性能数据,但强调了KD方法在MT中作为知识转移机制的多功能性和潜在风险。

Insight: 论文的创新点在于首次系统性地综述了KD在MT领域的应用,并提出了基于方法论和应用的分类框架。从客观角度看,论文不仅总结了现有技术,还识别了研究空白(如评估标准不统一)和潜在风险(如幻觉和偏见),为未来研究提供了实用指南和公开资源(数据库和术语表),特别是在LLMs时代如何重塑该领域方面具有前瞻性。

Abstract: Knowledge Distillation (KD) as a research area has gained a lot of traction in recent years as a compression tool to address challenges related to ever-larger models in NLP. Remarkably, Machine Translation (MT) offers a much more nuanced take on this narrative: in MT, KD also functions as a general-purpose knowledge transfer mechanism that shapes supervision and translation quality as well as efficiency. This survey synthesizes KD for MT (KD4MT) across 105 papers (through October 1, 2025). We begin by introducing both MT and KD for non-experts, followed by an overview of the standard KD approaches relevant to MT applications. Subsequently, we categorize advances in the KD4MT literature based on (i) their methodological contributions and (ii) their practical applications. Our qualitative and quantitative analyses identify common trends in the field and highlight key research gaps as well as the absence of unified evaluation practice for KD methods in MT. We further provide practical guidelines for selecting a KD method in concrete settings and highlight potential risks associated with the application of KD to MT such as increased hallucination and bias amplification. Finally, we discuss the role of LLMs in re-shaping the KD4MT field. To support further research, we complement our survey with a publicly available database summarizing the main characteristics of the surveyed KD methods and a glossary of key terms.


[3] Gated Tree Cross-attention for Checkpoint-Compatible Syntax Injection in Decoder-Only LLMs cs.CLPDF

Xinyu Gao, Shaonan Wang, Nai Ding

TL;DR: 本文提出了一种名为门控树交叉注意力(GTCA)的分支模块,用于在仅解码器大型语言模型中注入语法结构,以增强其语法鲁棒性,同时保持预训练检查点的原有能力。该方法通过读取预计算的成分块记忆,并采用令牌更新掩码和分阶段训练来控制结构更新的范围和时机,实现了与现有检查点的兼容性。

Details

Motivation: 仅解码器大型语言模型在广泛任务上表现强劲,但对细微语法扰动脆弱,影响下游推理的可靠性;直接向现有检查点注入显式语法结构可能干扰其预训练能力,因此需要一种兼容检查点的方法来增强语法鲁棒性。

Result: 在多个基准测试和Transformer骨干网络上,GTCA超越了持续训练基线,显著提升了语法鲁棒性,且未损害多项选择问答性能或常识推理能力,为仅解码器LLMs提供了实用的语法增强途径。

Insight: 创新点在于设计了一个检查点兼容的GTCA分支,通过令牌更新掩码和分阶段训练策略,在保持主干架构不变的前提下有效注入语法信息;客观分析认为,该方法在模块化结构增强和训练控制机制上具有借鉴意义,可实现特定能力提升而不破坏预训练模型整体性能。

Abstract: Decoder-only large language models achieve strong broad performance but are brittle to minor grammatical perturbations, undermining reliability for downstream reasoning. However, directly injecting explicit syntactic structure into an existing checkpoint can interfere with its pretrained competence. We introduce a checkpoint-compatible gated tree cross-attention (GTCA) branch that reads precomputed constituency chunk memory while leaving backbone architecture unchanged. Our design uses a token update mask and staged training to control the scope and timing of structural updates. Across benchmarks and Transformer backbones, GTCA strengthens syntactic robustness beyond continued-training baselines without compromising Multiple-Choice QA performance or commonsense reasoning, providing a practical checkpoint-compatible route to more syntax-robust decoder-only LLMs.


[4] Preference Optimization for Review Question Generation Improves Writing Quality cs.CL | cs.AIPDF

Karun Sharma, Vidushee Vats, Shengzhi Li, Yuxiang Wang, Zhongtian Sun

TL;DR: 本文提出了IntelliReward奖励模型和IntelliAsk问题生成模型,旨在提升同行评审中LLM生成问题的质量,使其更具实质性、基于证据且深入,而非停留在表面。通过结合Decoupled Clip和DAPO优化方法,模型在多个推理和写作基准上取得了显著改进。

Details

Motivation: 现有基于LLM的同行评审问题生成方法往往产生肤浅的问题,超过50%的问题词元来自论文首页,缺乏深度和证据支持。本文旨在解决这一问题,生成更符合人类专家标准的、需要努力、证据和依据的评审问题。

Result: 在推理任务MuSR上,IntelliAsk相比基础模型Qwen3-32B准确率从64.7提升至68.3;在复杂写作评估WritingBench上,得分从8.07提升至8.31。IntelliReward在预测人类专家偏好方面优于基于API的SFT基线。

Insight: 创新点包括:1) IntelliReward奖励模型,基于冻结的自回归LLM和可训练的多头Transformer构建;2) 结合Decoupled Clip和DAPO进行偏好优化;3) 发现评审问题质量与更广泛的推理和写作能力相关;4) 提供了自动评估基准和开源资源。

Abstract: Peer review relies on substantive, evidence-based questions, yet existing LLM-based approaches often generate surface-level queries, drawing over 50% of their question tokens from a paper’s first page. To bridge this gap, we develop IntelliReward, a novel reward model built from a frozen autoregressive LLM with trainable multi-head transformers over the final 50 token states, which outperforms API-based SFT baselines in predicting expert-level human preferences. By applying Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO) with IntelliReward, we train IntelliAsk, a question-generation model aligned with human standards of effort, evidence, and grounding. We find consistent improvements on reasoning and writing benchmarks, suggesting reviewer-question quality correlates with broader capabilities. Compared to the Qwen3-32B base model, IntelliAsk shows measurable gains across diverse benchmarks, specifically improving performance on reasoning tasks like MuSR (68.3 vs 64.7 Acc) and complex writing evaluations such as WritingBench (8.31 vs 8.07). We release our implementation, expert preference annotations, and the IntelliReward model to provide an automatic evaluation benchmark for grounding, effort, and evidence in LLM-generated review questions.


[5] Large Language Models for Assisting American College Applications cs.CLPDF

Zhengliang Liu, Weihang You, Peng Shu, Junhao Chen, Yi Pan

TL;DR: 本文提出了EZCollegeApp,一个基于大语言模型的系统,旨在帮助高中生应对美国大学申请过程中复杂的申请表格、模糊问题和碎片化的招生政策。该系统采用映射优先范式,将表格理解与答案生成分离,整合了官方招生文档的摄取、检索增强问答以及人在回路的聊天机器人界面,确保最终答案由用户完全控制。

Details

Motivation: 解决美国大学申请过程中学生面临的挑战,包括碎片化的招生政策、重复且有条件的申请表格以及需要交叉参考多个来源的模糊问题。

Result: 通过自动化测试和人工质量评估对系统进行了评估,源代码已在GitHub开源以促进更广泛的影响。

Insight: 创新点在于引入了映射优先范式,将表单理解与答案生成解耦,实现了跨异构申请门户的一致性推理;系统整合了文档摄取、检索增强问答和人在回路的交互界面,在提供智能建议的同时保持了用户对最终答案的完全控制权。

Abstract: American college applications require students to navigate fragmented admissions policies, repetitive and conditional forms, and ambiguous questions that often demand cross-referencing multiple sources. We present EZCollegeApp, a large language model (LLM)-powered system that assists high-school students by structuring application forms, grounding suggested answers in authoritative admissions documents, and maintaining full human control over final responses. The system introduces a mapping-first paradigm that separates form understanding from answer generation, enabling consistent reasoning across heterogeneous application portals. EZCollegeApp integrates document ingestion from official admissions websites, retrieval-augmented question answering, and a human-in-the-loop chatbot interface that presents suggestions alongside application fields without automated submission. We describe the system architecture, data pipeline, internal representations, security and privacy measures, and evaluation through automated testing and human quality assessment. Our source code is released on GitHub (https://github.com/ezcollegeapp-public/ezcollegeapp-public) to facilitate the broader impact of this work.


[6] Decoupling Strategy and Execution in Task-Focused Dialogue via Goal-Oriented Preference Optimization cs.CL | cs.AIPDF

Jingyi Xu, Xingyu Ren, Zhiqiang You, Yumeng Zhang, Zhoupeng Shou

TL;DR: 本文提出了一种名为目标导向偏好优化(GOPO)的分层强化学习框架,用于任务导向对话系统。该框架通过专家智能体和服务智能体,将长期策略规划与即时响应生成解耦,旨在更好地对齐多轮对话的长期任务成功。

Details

Motivation: 现有基于词元级似然或偏好优化的训练方法难以与长期任务成功对齐,因此需要一种能解耦策略规划与执行、专注于长视野任务目标的训练框架。

Result: 在Mgshop数据集上,GOPO相比PPO和Memento将新提出的序列级指标TSE分别提升了7.7%和10.3%。此外,一个14B的GOPO模型在TSE上分别超越了Qwen-235B和GPT-5.2达2.7%和1.5%。在其他数据集上也取得了一致的提升。

Insight: 核心创新在于将对话轨迹级的策略规划(专家智能体)与严格遵循策略的响应生成(服务智能体)进行解耦,实现了分层强化学习。同时,从真实电商数据中推导出的序列级评估指标TSE也是一个重要的贡献,为任务导向对话提供了更贴近实际商业价值的评估标准。

Abstract: Large language models show potential in task-oriented dialogue systems, yet existing training methods often rely on token-level likelihood or preference optimization, which poorly align with long-horizon task success. To address this, we propose Goal-Oriented Preference Optimization (GOPO), a hierarchical reinforcement learning framework that decouples strategy planning from response generation via an Expert Agent and a Customer Service Agent. The Expert Agent optimizes multi-turn goal preferences at the dialogue-trajectory level, while the Customer Service Agent generates responses strictly aligned with the selected strategy. We evaluate GOPO on public benchmarks and e-commerce customer service datasets, and introduce Task-focused Sequential Engagement (TSE), a sequence-level metric derived from real e-commerce interaction data. On the Mgshop dataset, GOPO improves TSE by 7.7% and 10.3% over PPO and Memento, with consistent gains in sequence-level reward and generation quality. Furthermore, a 14B model trained with GOPO achieves 2.7% and 1.5% higher TSE than Qwen-235B and GPT-5.2, respectively. Ablation studies confirm the Expert Agent’s critical role in long-horizon optimization. GOPO demonstrates consistent improvements across other datasets as well. This work establishes a new paradigm for task-oriented dialogue systems in commercial scenarios, with code and datasets to be made public.


[7] Multi-source Heterogeneous Public Opinion Analysis via Collaborative Reasoning and Adaptive Fusion: A Systematically Integrated Approach cs.CLPDF

Yi Liu

TL;DR: 本文提出了一种名为CRAF(协同推理与自适应融合)的新框架,用于分析来自多个异构来源的公众舆论。该框架通过结构化的多阶段推理机制,系统性地整合了传统基于特征的方法与大型语言模型(LLMs),以应对不同平台在结构、语义和偏见方面的差异。

Details

Motivation: 解决从多个异构来源(如不同社交媒体平台)分析公众舆论时,由于结构差异、语义变化和平台特定偏见所带来的重大挑战。

Result: 在三个多平台数据集(Weibo-12, CrossPlatform-15, NewsForum-8)上的综合实验表明,CRAF实现了平均主题聚类ARI为0.76(比最佳基线提升4.1%),情感分析F1分数为0.84(提升3.8%)。该框架展现出强大的跨平台适应性,能将新平台所需的标注数据需求减少75%。理论分析表明,与独立源建模相比,CRAF实现了更紧的泛化界,减少了O(sqrt(d log K / m))。

Insight: 摘要宣称的创新点包括:1)跨平台协同注意力模块;2)分层自适应融合机制;3)通过共享潜在空间同时学习主题表示和情感分布的联合优化策略;4)整合OCR、ASR和视觉情感分析来处理抖音、快手等平台视频内容的新型多模态提取能力。从客观角度看,其核心创新在于系统性地融合传统方法与LLM,并通过结构化推理机制实现异构多源信息的有效对齐与融合,在提升性能的同时增强了跨平台适应性和数据效率。

Abstract: The analysis of public opinion from multiple heterogeneous sources presents significant challenges due to structural differences, semantic variations, and platform-specific biases. This paper introduces a novel Collaborative Reasoning and Adaptive Fusion (CRAF) framework that systematically integrates traditional feature-based methods with large language models (LLMs) through a structured multi-stage reasoning mechanism. Our approach features four key innovations: (1) a cross-platform collaborative attention module that aligns semantic representations while preserving source-specific characteristics, (2) a hierarchical adaptive fusion mechanism that dynamically weights features based on both data quality and task requirements, (3) a joint optimization strategy that simultaneously learns topic representations and sentiment distributions through shared latent spaces, and (4) a novel multimodal extraction capability that processes video content from platforms like Douyin and Kuaishou by integrating OCR, ASR, and visual sentiment analysis. Theoretical analysis demonstrates that CRAF achieves a tighter generalization bound with a reduction of O(sqrt(d log K / m)) compared to independent source modeling, where d is feature dimensionality, K is the number of sources, and m is sample size. Comprehensive experiments on three multi-platform datasets (Weibo-12, CrossPlatform-15, NewsForum-8) show that CRAF achieves an average topic clustering ARI of 0.76 (4.1% improvement over best baseline) and sentiment analysis F1-score of 0.84 (3.8% improvement). The framework exhibits strong cross-platform adaptability, reducing the labeled data requirement for new platforms by 75%.


[8] State Design Matters: How Representations Shape Dynamic Reasoning in Large Language Models cs.CL | cs.AIPDF

Annie Wong, Aske Plaat, Thomas Bäck, Niki van Stein, Anna V. Kononova

TL;DR: 本文探讨了大型语言模型在动态推理任务中,状态表示设计(包括粒度、结构和空间基础)对性能的关键影响。研究发现,轨迹摘要能提升性能,自然语言表示最稳健,而文本空间编码最有效,其优势源于构建过程本身而非空间信息。

Details

Motivation: 解决LLMs在动态环境中推理时,状态表示(粒度、结构、空间基础)如何影响其性能这一未充分探索的问题。

Result: 在顺序决策基准测试中,轨迹摘要提升了性能;自然语言表示最稳健;文本空间编码最有效;但当前LLMs和VLMs在长视野、多子任务合成信息方面仍显脆弱。

Insight: 状态表示设计是独立于信息可用性的决定性性能因素;构建过程(如文本空间编码的生成)能强制模型进行静态输入无法引发的推理,这是关键创新点;模型先验(如代码能力)影响其对结构化表示的利用效率。

Abstract: As large language models (LLMs) move from static reasoning tasks toward dynamic environments, their success depends on the ability to navigate and respond to an environment that changes as they interact at inference time. An underexplored factor in these settings is the representation of the state. Holding model parameters fixed, we systematically vary three key aspects: (1) state granularity (long form versus summary), (2) structure (natural language versus symbolic), and (3) spatial grounding (text-only versus images or textual map encodings) across sequential decision-making benchmarks. We find that trajectory summarisation improves performance by reducing noise and stabilising long-horizon reasoning. Second, natural language representations are the most robust across models, whereas structured encodings help mainly for models with strong code or structured output priors, such as JSON schemas. Third, while image-inputs show some benefit, text-based spatial encodings prove most effective. This advantage stems not from the spatial information itself, but from the act of construction, which compels the model to perform the spatial reasoning that static input does not elicit. Overall, we demonstrate that design choices for representing state are a decisive factor in performance, distinct from the availability of information itself. We note, however, that even with improved representations, current LLMs and VLMs remain brittle over long horizons, particularly when they must synthesise information to manage multiple subtasks to reach a goal.


[9] CAST: Achieving Stable LLM-based Text Analysis for Data Analytics cs.CL | cs.AIPDF

Jinxiang Xie, Zihao Li, Wei He, Rui Ding, Shi Han

TL;DR: 论文提出CAST框架,通过算法提示和稳定思维机制约束大语言模型的潜在推理路径,以提升表格数据文本分析任务(如摘要和标记)的输出稳定性,并设计了相应的稳定性评估指标。

Details

Motivation: 现有大语言模型在表格数据文本分析任务中难以满足数据分析对输出稳定性的高标准要求,因此需要一种方法来增强模型输出的稳定性。

Result: 在多个公开基准测试和不同大语言模型上,CAST框架在所有基线方法中取得了最佳的稳定性,将稳定性分数提升了高达16.2%,同时保持或改进了输出质量。

Insight: 创新点在于通过算法提示构建程序化推理支架,并结合“先思后言”机制强制模型在最终生成前做出明确的中间承诺,从而约束推理路径以提升稳定性;同时,为摘要和标记任务设计了专门的稳定性评估指标。

Abstract: Text analysis of tabular data relies on two core operations: \emph{summarization} for corpus-level theme extraction and \emph{tagging} for row-level labeling. A critical limitation of employing large language models (LLMs) for these tasks is their inability to meet the high standards of output stability demanded by data analytics. To address this challenge, we introduce \textbf{CAST} (\textbf{C}onsistency via \textbf{A}lgorithmic Prompting and \textbf{S}table \textbf{T}hinking), a framework that enhances output stability by constraining the model’s latent reasoning path. CAST combines (i) Algorithmic Prompting to impose a procedural scaffold over valid reasoning transitions and (ii) Thinking-before-Speaking to enforce explicit intermediate commitments before final generation. To measure progress, we introduce \textbf{CAST-S} and \textbf{CAST-T}, stability metrics for bulleted summarization and tagging, and validate their alignment with human judgments. Experiments across publicly available benchmarks on multiple LLM backbones show that CAST consistently achieves the best stability among all baselines, improving Stability Score by up to 16.2%, while maintaining or improving output quality.


[10] Enhancing Action and Ingredient Modeling for Semantically Grounded Recipe Generation cs.CL | cs.AIPDF

Guoshan Liu, Bin Zhu, Yian Li, Jingjing Chen, Chong-Wah Ngo

TL;DR: 本文提出了一种语义接地框架,通过预测和验证食谱生成中的动作与食材作为内部上下文,以提升多模态大语言模型在食谱生成任务中的语义准确性。该框架采用两阶段微调流程(监督微调与强化微调)并结合语义置信度评分与修正模块,在Recipe1M数据集上实现了最先进的性能。

Details

Motivation: 针对现有多模态大语言模型在从食物图像生成食谱时,尽管词汇评分(如BLEU、ROUGE)较高,但常出现语义错误的动作或食材的问题,旨在提升生成的语义保真度。

Result: 在Recipe1M数据集上实现了最先进的性能,语义保真度显著提升。

Insight: 创新点包括:将动作与食材预测作为内部上下文进行语义接地;结合监督微调与强化微调的两阶段流程,其中强化微调采用频率感知奖励以改善长尾动作预测和食材泛化;引入语义置信度评分与修正模块进行预测过滤与校正。

Abstract: Recent advances in Multimodal Large Language Models (MLMMs) have enabled recipe generation from food images, yet outputs often contain semantically incorrect actions or ingredients despite high lexical scores (e.g., BLEU, ROUGE). To address this gap, we propose a semantically grounded framework that predicts and validates actions and ingredients as internal context for instruction generation. Our two-stage pipeline combines supervised fine-tuning (SFT) with reinforcement fine-tuning (RFT): SFT builds foundational accuracy using an Action-Reasoning dataset and ingredient corpus, while RFT employs frequency-aware rewards to improve long-tail action prediction and ingredient generalization. A Semantic Confidence Scoring and Rectification (SCSR) module further filters and corrects predictions. Experiments on Recipe1M show state-of-the-art performance and markedly improved semantic fidelity.


[11] Not the Example, but the Process: How Self-Generated Examples Enhance LLM Reasoning cs.CL | cs.AIPDF

Daehoon Gwak, Minseo Jung, Junwoo Park, Minho Park, ChaeHun Park

TL;DR: 这篇论文探讨了大型语言模型(LLM)通过自我生成少样本示例来提升推理性能的现象。研究发现,性能提升的关键并非生成的示例本身,而是生成示例的创作过程。通过对比零样本提示、集成提示(模型在单一提示中创建并解决问题)和解耦提示(模型生成的示例被复用为上下文示例,但排除了其创作背景)三种策略,论文表明集成提示在多种LLM架构上均能持续超越其他方法。

Details

Motivation: 尽管已有研究表明LLM通过自我生成少样本示例可以提升推理性能,甚至达到与人工策划示例相当的水平,但其背后的机制尚不明确,这限制了该技术的有效应用。本文旨在探究性能提升的根本原因,以指导更有效的提示策略设计。

Result: 在多种LLM架构(五种广泛使用的模型)上进行的推理密集型任务实验表明,集成提示策略在性能上持续优于零样本提示和解耦提示。相比之下,解耦提示相比零样本提示仅带来边际收益。注意力分析进一步揭示了集成提示与解耦提示在注意力模式上存在显著差异。

Insight: 论文的核心创新点在于揭示了自我生成提示的优势源于问题创作过程本身,而非生成的示例内容。这为设计更有效的提示策略(例如,强调让模型参与问题构建过程,而非仅仅复用其输出)提供了关键见解。从客观角度看,该研究通过系统性的实验设计和注意力分析,为理解LLM的上下文学习机制提供了有价值的实证证据。

Abstract: Recent studies have shown that Large Language Models (LLMs) can improve their reasoning performance through self-generated few-shot examples, achieving results comparable to manually curated in-context examples. However, the underlying mechanism behind these gains remains unclear, making it hard to decide when and how to apply the technique effectively. In this work, we argue that the key benefit arises not from the generated examples themselves but from the act of creating them. To validate this, on reasoning-intensive tasks across diverse LLM architectures, we systematically evaluate three prompting strategies for in-context learning: (1) Zero-shot prompting; (2) Integrated prompting, where LLMs create and solve problems within a single, unified prompt; and (3) Decoupled prompting, where self-generated examples are reused as in-context examples, but the context of their creation itself is excluded. We conduct experiments across five widely used model architectures, demonstrating that Integrated prompting consistently outperforms both Zero-shot and Decoupled prompting. In contrast, Decoupled prompting offers only marginal gains over Zero-shot. Further, for a more in-depth analysis, we conduct an attention analysis and observe significant differences in attention patterns between Integrated and Decoupled prompting. These findings suggest that the advantage of self-generation prompting comes from the process of problem creation, not the examples themselves, providing valuable insights for designing more effective prompting strategies.


[12] Playing With AI: How Do State-Of-The-Art Large Language Models Perform in the 1977 Text-Based Adventure Game Zork? cs.CL | cs.AIPDF

Berry Gerrits

TL;DR: 这篇论文通过让ChatGPT、Claude和Gemini等前沿大型语言模型(LLM)玩1977年的文字冒险游戏《Zork》,评估了它们的解决问题和推理能力。研究发现,所有模型平均完成度不足10%,即使表现最好的Claude Opus 4.5也只获得了约75分(满分350分),且提供详细游戏说明或启用‘扩展思考’功能均未带来改善。定性分析揭示了模型在元认知和策略学习上的根本性局限。

Details

Motivation: 动机是利用《Zork》这类基于对话的文本冒险游戏作为受控环境,评估LLM如何理解自然语言描述并生成合适的动作序列来解决问题,从而检验其推理和问题解决能力。

Result: 在《Zork》游戏中,所有测试的专有模型(ChatGPT、Claude、Gemini)平均完成度低于10%,最佳模型(Claude Opus 4.5)仅获得约75/350分。提供详细指令或启用‘扩展思考’均未提升性能。

Insight: 论文宣称的创新点在于将经典文本冒险游戏作为评估LLM推理能力的新颖基准。客观分析认为,其核心洞察是揭示了当前LLM在元认知(如反思自身思维)、策略一致性以及从历史尝试中学习等方面存在显著缺陷,这对理解LLM的真实推理能力提出了重要质疑。

Abstract: In this positioning paper, we evaluate the problem-solving and reasoning capabilities of contemporary Large Language Models (LLMs) through their performance in Zork, the seminal text-based adventure game first released in 1977. The game’s dialogue-based structure provides a controlled environment for assessing how LLM-based chatbots interpret natural language descriptions and generate appropriate action sequences to succeed in the game. We test the performance of leading proprietary models - ChatGPT, Claude, and Gemini - under both minimal and detailed instructions, measuring game progress through achieved scores as the primary metric. Our results reveal that all tested models achieve less than 10% completion on average, with even the best-performing model (Claude Opus 4.5) reaching only approximately 75 out of 350 possible points. Notably, providing detailed game instructions offers no improvement, nor does enabling ‘’extended thinking’’. Qualitative analysis of the models’ reasoning processes reveals fundamental limitations: repeated unsuccessful actions suggesting an inability to reflect on one’s own thinking, inconsistent persistence of strategies, and failure to learn from previous attempts despite access to conversation history. These findings suggest substantial limitations in current LLMs’ metacognitive abilities and problem-solving capabilities within the domain of text-based games, raising questions about the nature and extent of their reasoning capabilities.


[13] Understanding LLM Failures: A Multi-Tape Turing Machine Analysis of Systematic Errors in Language Model Reasoning cs.CLPDF

Magnus Boman

TL;DR: 该论文提出了一种基于确定性多带图灵机的形式化框架来分析大型语言模型(LLM)的推理失败模式。该框架将LLM的交互过程分解为多个独立的‘带’,分别对应输入字符、分词、词汇表、模型参数、激活值、概率分布和输出文本等组件,从而能够精确定位错误发生的具体处理阶段。

Details

Motivation: 动机在于解释LLM为何会在看似简单的任务上失败,并为理解其系统性错误提供一个严格、可证伪的理论分析工具,以替代通常使用的几何隐喻。

Result: 论文通过该形式化模型揭示了具体失败原因,例如在计数任务中,分词过程如何模糊了所需的字符级结构。同时,模型也定性地解释了思维链提示等技术通过将计算外化到输出带上而有效的原因及其根本局限性。

Insight: 主要创新点在于提供了一个基于多带图灵机的、可精确定位错误阶段的形式化分析框架。这为理解LLM的内部工作机制和失败模式提供了新的理论视角,将经验性的缩放定律与基于原理的错误分析相结合。

Abstract: Large language models (LLMs) exhibit failure modes on seemingly trivial tasks. We propose a formalisation of LLM interaction using a deterministic multi-tape Turing machine, where each tape represents a distinct component: input characters, tokens, vocabulary, model parameters, activations, probability distributions, and output text. The model enables precise localisation of failure modes to specific pipeline stages, revealing, e.g., how tokenisation obscures character-level structure needed for counting tasks. The model clarifies why techniques like chain-of-thought prompting help, by externalising computation on the output tape, while also revealing their fundamental limitations. This approach provides a rigorous, falsifiable alternative to geometric metaphors and complements empirical scaling laws with principled error analysis.


[14] VDLM: Variable Diffusion LMs via Robust Latent-to-Text Rendering cs.CLPDF

Shuhui Qu

TL;DR: VDLM是一种模块化的可变扩散语言模型,通过将语义规划与文本渲染分离,解决了自回归语言模型在推理过程中无法迭代修订的问题。该模型在语义变量嵌入上应用掩码扩散以实现潜在空间的迭代优化,并通过嵌入空间奖励和值函数进行轨迹感知的后训练,最后使用Vec2Text渲染器将规划后的嵌入转换回文本。

Details

Motivation: 自回归语言模型从左到右解码且不可逆,限制了多步推理中的修订能力,因此需要一种支持迭代细化的模型结构。

Result: 在涵盖通用推理、数学和代码的九个基准测试中,VDLM在预训练阶段具有竞争力,并在长文本生成任务的后训练中显著优于其他基线模型。

Insight: 创新点包括将语义规划与文本渲染解耦的模块化设计、嵌入空间的后训练以避免文本解码循环,以及引入嵌入扰动来增强渲染器对规划噪声的鲁棒性。

Abstract: Autoregressive language models decode left-to-right with irreversible commitments, limiting revision during multi-step reasoning. We propose \textbf{VDLM}, a modular variable diffusion language model that separates semantic planning from text rendering. VDLM applies LLaDA-style masked diffusion over semantic variable embeddings to enable iterative refinement in latent space, then post-trains the planner with trajectory-aware optimization using embedding-space rewards and values, avoiding text decoding inside the RL loop. To convert planned embeddings back to text, we use a \textbf{Vec2Text} renderer and introduce \textbf{embedding perturbations} to robustify decoding under planner noise. Across nine benchmarks spanning general reasoning, math, and code, VDLM is competitive in pre-training and yields substantial post-training improvements on long-form generation tasks, outperforming other baselines. These results highlight the effectiveness of embedding-space post-training and robust latent-to-text rendering for diffusion language modeling.


[15] P-RAG: Prompt-Enhanced Parametric RAG with LoRA and Selective CoT for Biomedical and Multi-Hop QA cs.CL | cs.LGPDF

Xingda Lyu, Gongfu Lyu, Zitai Yan, Yuxin Jiang

TL;DR: 本文提出了一种名为P-RAG的提示增强参数化检索增强生成模型,该模型通过集成低秩适应微调和思维链提示,结合了LLM内部参数化知识与外部检索证据,旨在提升生物医学和多跳问答任务的性能。

Details

Motivation: 解决大型语言模型依赖静态训练数据和传统RAG方法过度依赖知识库质量的局限性,探索通过结合参数化知识与检索证据来提升模型在复杂领域问答中的准确性和适应性。

Result: 在PubMedQA数据集上,P-RAG的F1分数达到93.33%,比标准RAG高出10.47个百分点;在2WikiMultihopQA数据集上,P-RAG整体分数达到33.44%,是标准RAG的近两倍,并在Compare子集上达到44.03%。这些结果在PubMedQA和2WikiMultihopQA上达到了最先进水平。

Insight: 创新点包括:1)采用LoRA对LLaMA-3.2-1B-Instruct进行生物医学问答微调;2)提出结合思维链提示的P-RAG混合架构;3)在生物医学和多跳问答基准上实现了SOTA性能,证明了参数化与检索知识结合的有效性。

Abstract: Large Language Models (LLMs) demonstrate remarkable capabilities but remain limited by their reliance on static training data. Retrieval-Augmented Generation (RAG) addresses this constraint by retrieving external knowledge during inference, though it still depends heavily on knowledge base quality. To explore potential improvements, we evaluated three RAG variants-Standard RAG, DA-RAG, and our proposed Prompt-Enhanced Parametric RAG (P-RAG), a hybrid architecture that integrates parametric knowledge within the LLM and retrieved evidence, guided by Chain-of-Thought (CoT) prompting and Low-Rank Adaptation (LoRA) fine-tuning-on both general and biomedical datasets. Using LLaMA-3.2-1B-Instruct fine-tuned via LoRA, we evaluate on PubMedQA and 2WikiMultihopQA. P-RAG outperforms Standard RAG on PubMedQA by 10.47 percentage points in F1 (93.33% vs. 82.86%; 12.64% relative). On 2WikiMultihopQA, P-RAG nearly doubles the overall score vs. Standard RAG (33.44% vs. 17.83%) and achieves 44.03% on the Compare subset (with 42.74% Bridge, 21.84% Inference, 8.60% Compose). CoT prompting substantially improves multi-hop reasoning but yields mixed results for simpler, single-hop queries. These findings underscore P-RAG’s potential for accurate, scalable, and contextually adaptive biomedical question answering. Our contributions include: (1) LoRA-based fine-tuning of LLaMA-3.2-1B-Instruct for biomedical QA, (2) introduction of P-RAG with Chain-of-Thought prompting, and (3) state-of-the-art results on PubMedQA and 2WikiMultihopQA.


[16] Understand Then Memory: A Cognitive Gist-Driven RAG Framework with Global Semantic Diffusion cs.CL | cs.AIPDF

Pengcheng Zhou, Haochen Li, Zhiqiang Nie, JiaLe Chen, Qing Gong

TL;DR: 本文提出CogitoRAG,一个受人类情景记忆机制启发的检索增强生成框架,通过提取和演化语义要点来增强语义完整性。框架包含离线索引阶段(将语料库转化为要点记忆语料和多维知识图谱)和在线检索阶段(通过查询分解、实体扩散和CogniRank重排序),最终以段落-记忆配对格式提供证据。实验表明其在多个QA基准和GraphBench上显著优于现有SOTA方法。

Details

Motivation: 解决现有RAG框架中文本离散表示导致的语义完整性损失和检索偏差问题,通过模拟人类认知记忆过程来提升复杂知识整合与推理能力。

Result: 在五个主流QA基准和GraphBench的多任务生成任务上,CogitoRAG显著优于最先进的RAG方法,展示了在复杂知识整合和推理方面的优越性能。

Insight: 创新点包括语义要点的提取与演化、多维知识图谱构建、查询分解模块模拟人类认知分解、实体扩散模块结合结构相关性和实体频率奖励机制,以及CogniRank算法融合扩散分数与语义相似度进行精确重排序。从客观角度看,该框架将认知科学原理与RAG结合,提升了语义连贯性和检索准确性。

Abstract: Retrieval-Augmented Generation (RAG) effectively mitigates hallucinations in LLMs by incorporating external knowledge. However, the inherent discrete representation of text in existing frameworks often results in a loss of semantic integrity, leading to retrieval deviations. Inspired by the human episodic memory mechanism, we propose CogitoRAG, a RAG framework that simulates human cognitive memory processes. The core of this framework lies in the extraction and evolution of the Semantic Gist. During the offline indexing stage, CogitoRAG first deduces unstructured corpora into gist memory corpora, which are then transformed into a multi-dimensional knowledge graph integrating entities, relational facts, and memory nodes. In the online retrieval stage, the framework handles complex queries via Query Decomposition Module that breaks them into comprehensive sub-queries, mimicking the cognitive decomposition humans employ for complex information. Subsequently, Entity Diffusion Module performs associative retrieval across the graph, guided by structural relevance and an entity-frequency reward mechanism. Furthermore, we propose the CogniRank algorithm, which precisely reranks candidate passages by fusing diffusion-derived scores with semantic similarity. The final evidence is delivered to the generator in a passage-memory pairing format, providing high-density information support. Experimental results across five mainstream QA benchmarks and multi-task generation on GraphBench demonstrate that CogitoRAG significantly outperforms state-of-the-art RAG methods, showcasing superior capabilities in complex knowledge integration and reasoning.


[17] Every Little Helps: Building Knowledge Graph Foundation Model with Fine-grained Transferable Multi-modal Tokens cs.CLPDF

Yichi Zhang, Zhuo Chen, Lingbing Guo, Wen Zhang, Huajun Chen

TL;DR: 本文提出了一种基于令牌的知识图谱基础模型TOFU,用于多模态知识图谱推理。该模型通过将结构、视觉和文本信息离散化为模态特定令牌,并采用分层融合架构与混合消息机制,实现了跨不同多模态知识图谱的强泛化能力。

Details

Motivation: 现有方法多为转导式设置,难以泛化到新知识图谱;而现有知识图谱基础模型主要利用结构模式,忽略了丰富的多模态信号。本文旨在解决这些不足,构建一个能够有效利用多模态内容并实现跨图谱迁移的模型。

Result: 在17个转导式、归纳式和完全归纳式多模态知识图谱上的实验表明,TOFU一致性地超越了强知识图谱基础模型和多模态知识图谱推理基线,在未见过的多模态知识图谱上表现出色。

Insight: 创新点在于将多模态信息离散化为令牌,并通过分层融合与混合消息机制实现模态间交互,从而学习可迁移的特征。这为构建通用的多模态知识图谱基础模型提供了新思路,强调了细粒度令牌化与有效融合策略的重要性。

Abstract: Multi-modal knowledge graph reasoning (MMKGR) aims to predict the missing links by exploiting both graph structure information and multi-modal entity contents. Most existing works are designed for a transductive setting, which learns dataset-specific embeddings and struggles to generalize to new KGs. Recent knowledge graph foundation models (KGFMs) improve cross-KG transfer, but they mainly exploit structural patterns and ignore rich multi-modal signals. We address these gaps by proposing a token-based foundation model (TOFU) for MMKGR, which exhibits strong generalization across different MMKGs. TOFU discretizes structural, visual, and textual information into modality-specific tokens. TOFU then employs a hierarchical fusion architecture with mixture-of-message mechanisms, aiming to process these tokens and obtain transferable features for MMKGR. Experimental results on 17 transductive, inductive, and fully-inductive MMKGs show that TOFU consistently outperforms strong KGFM and MMKGR baselines, delivering strong performance on unseen MMKGs.


[18] MultiCube-RAG for Multi-hop Question Answering cs.CLPDF

Jimeng Shi, Wei Hu, Runchu Tian, Bowen Jin, Wonbin Kweon

TL;DR: 本文提出MultiCube-RAG,一种基于本体立方体结构的免训练方法,用于解决多跳问答任务中现有检索增强生成方法难以准确捕获结构化语义、计算成本高且依赖单步检索的问题。该方法通过多维度立方体建模主题、属性和关系,将复杂查询分解为沿立方体维度的简单子查询进行顺序推理与检索。

Details

Motivation: 现有基于图的RAG方法构建的图结构噪声大、计算昂贵,且多数方法依赖单步检索,忽略了多跳推理过程;而基于训练的方法虽尝试激励大语言模型进行迭代推理与检索,但训练过程收敛不稳定且计算开销高。

Result: 在四个多跳问答数据集上的实验表明,MultiCube-RAG相比多种基线方法的平均性能提升了8.9%的响应准确率,并且展现出更高的效率和固有的可解释性。

Insight: 创新点在于提出了一种基于本体论的多维立方体结构来建模结构化语义,实现了免训练的多步推理与检索;该方法通过立方体专业化建模与查询沿维分解,提升了检索精度与效率,同时增强了过程的可解释性。

Abstract: Multi-hop question answering (QA) necessitates multi-step reasoning and retrieval across interconnected subjects, attributes, and relations. Existing retrieval-augmented generation (RAG) methods struggle to capture these structural semantics accurately, resulting in suboptimal performance. Graph-based RAGs structure such information in graphs, but the resulting graphs are often noisy and computationally expensive. Moreover, most methods rely on single-step retrieval, neglecting the need for multi-hop reasoning processes. Recent training-based approaches attempt to incentivize the large language models (LLMs) for iterative reasoning and retrieval, but their training processes are prone to unstable convergence and high computational overhead. To address these limitations, we devise an ontology-based cube structure with multiple and orthogonal dimensions to model structural subjects, attributes, and relations. Built on the cube structure, we propose MultiCube-RAG, a training-free method consisting of multiple cubes for multi-step reasoning and retrieval. Each cube specializes in modeling a class of subjects, so that MultiCube-RAG flexibly selects the most suitable cubes to acquire the relevant knowledge precisely. To enhance the query-based reasoning and retrieval, our method decomposes a complex multi-hop query into a set of simple subqueries along cube dimensions and conquers each of them sequentially. Experiments on four multi-hop QA datasets show that MultiCube-RAG improves response accuracy by 8.9% over the average performance of various baselines. Notably, we also demonstrate that our method performs with greater efficiency and inherent explainability.


[19] Doc-to-LoRA: Learning to Instantly Internalize Contexts cs.CL | cs.AIPDF

Rujikorn Charakorn, Edoardo Cetin, Shinnosuke Uesaka, Robert Tjarko Lange

TL;DR: 本文提出了Doc-to-LoRA (D2L),一个轻量级的超网络,旨在解决大语言模型处理长输入序列时注意力计算开销大、推理内存密集且缓慢的问题。D2L通过元学习,能够在单次前向传播中为给定提示生成一个LoRA适配器,使目标LLM在后续查询时无需重新处理原始上下文,从而降低延迟和KV缓存内存消耗。

Details

Motivation: 动机是解决Transformer模型在处理长上下文时二次方注意力成本带来的高推理延迟和内存消耗问题,以及传统上下文蒸馏方法因训练成本和延迟而无法实现按提示即时蒸馏的局限性。

Result: 在一个长上下文‘大海捞针’任务中,D2L成功学习将上下文映射到存储关键信息的适配器中,在序列长度超过目标LLM原生上下文窗口4倍以上时,实现了接近完美的零样本准确率。在计算资源有限的现实世界QA数据集上,D2L在显著降低峰值内存消耗和更新延迟的同时,性能优于标准上下文蒸馏方法。

Insight: 宣称的创新点在于提出了一种元学习的超网络架构,能够即时(单次前向传播)为任意新提示生成参数高效的LoRA适配器,实现近似上下文蒸馏,从而支持LLM的快速知识更新和个性化。从客观角度看,其将动态上下文信息压缩并固化到轻量级适配器参数中的方法,为高效的长上下文推理和模型即时适应提供了新思路。

Abstract: Long input sequences are central to in-context learning, document understanding, and multi-step reasoning of Large Language Models (LLMs). However, the quadratic attention cost of Transformers makes inference memory-intensive and slow. While context distillation (CD) can transfer information into model parameters, per-prompt distillation is impractical due to training costs and latency. To address these limitations, we propose Doc-to-LoRA (D2L), a lightweight hypernetwork that meta-learns to perform approximate CD within a single forward pass. Given an unseen prompt, D2L generates a LoRA adapter for a target LLM, enabling subsequent queries to be answered without re-consuming the original context, reducing latency and KV-cache memory consumption during inference of the target LLM. On a long-context needle-in-a-haystack task, D2L successfully learns to map contexts into adapters that store the needle information, achieving near-perfect zero-shot accuracy at sequence lengths exceeding the target LLM’s native context window by more than 4x. On real-world QA datasets with limited compute, D2L outperforms standard CD while significantly reducing peak memory consumption and update latency. We envision that D2L can facilitate rapid adaptation of LLMs, opening up the possibility of frequent knowledge updates and personalized chat behavior.


[20] DocSplit: A Comprehensive Benchmark Dataset and Evaluation Approach for Document Packet Recognition and Splitting cs.CL | cs.AI | cs.CVPDF

Md Mofijul Islam, Md Sirajus Salekin, Nivedha Balakrishnan, Vincil C. Bishop, Niharika Jain

TL;DR: 本文提出了首个全面的文档包分割基准数据集DocSplit,包含五个不同复杂度的数据集,涵盖多种文档类型、布局和多模态设置。论文定义了DocSplit任务,要求模型识别文档边界、分类文档类型并保持页面顺序,并引入了新的评估指标。通过在多模态大语言模型上进行广泛实验,揭示了现有模型在处理复杂文档分割任务时的显著性能差距。

Details

Motivation: 现实应用中的文档理解常需处理包含多个文档拼接而成的异构多页文档包,但文档包分割这一基础任务尚未得到充分研究。

Result: 在提出的DocSplit数据集上对多模态大语言模型进行了广泛评估,结果显示当前模型在处理复杂文档分割任务时存在显著性能差距。

Insight: 创新点在于首次构建了全面的文档包分割基准数据集并提出了新的评估指标,系统性地形式化了包含边界识别、类型分类和页面排序的文档包分割任务,为法律、金融、医疗等文档密集型领域的文档理解研究提供了重要框架。

Abstract: Document understanding in real-world applications often requires processing heterogeneous, multi-page document packets containing multiple documents stitched together. Despite recent advances in visual document understanding, the fundamental task of document packet splitting, which involves separating a document packet into individual units, remains largely unaddressed. We present the first comprehensive benchmark dataset, DocSplit, along with novel evaluation metrics for assessing the document packet splitting capabilities of large language models. DocSplit comprises five datasets of varying complexity, covering diverse document types, layouts, and multimodal settings. We formalize the DocSplit task, which requires models to identify document boundaries, classify document types, and maintain correct page ordering within a document packet. The benchmark addresses real-world challenges, including out-of-order pages, interleaved documents, and documents lacking clear demarcations. We conduct extensive experiments evaluating multimodal LLMs on our datasets, revealing significant performance gaps in current models’ ability to handle complex document splitting tasks. The DocSplit benchmark datasets and proposed novel evaluation metrics provide a systematic framework for advancing document understanding capabilities essential for legal, financial, healthcare, and other document-intensive domains. We release the datasets to facilitate future research in document packet processing.


[21] Language Statistics and False Belief Reasoning: Evidence from 41 Open-Weight LMs cs.CL | cs.AIPDF

Sean Trott, Samuel Taylor, Cameron Jones, James A. Michaelov, Pamela D. Rivière

TL;DR: 本研究评估了41个开源语言模型在错误信念任务中的心智状态推理能力,发现34%的模型对知识状态敏感,但均未能完全解释人类效应;模型规模越大表现越好,且模型行为支持了关于人类认知的新假设。

Details

Motivation: 旨在通过大规模开源语言模型测试人类心智状态推理理论(如语言暴露假说),并评估模型自身能力,弥补以往研究依赖闭源小样本的局限。

Result: 在41个开源模型上测试,34%对知识状态敏感,但未达到人类水平;模型规模与敏感性和心理测量预测力正相关;人类在非事实动词提示下的错误信念偏差效应落在模型效应分布范围内。

Insight: 利用大规模开源模型能更严谨地检验心理理论;语言统计分布可部分解释人类认知偏差(如动词提示效应),但不足以解释知识状态敏感性的核心差异;模型规模是提升心智推理能力的关键因素。

Abstract: Research on mental state reasoning in language models (LMs) has the potential to inform theories of human social cognition–such as the theory that mental state reasoning emerges in part from language exposure–and our understanding of LMs themselves. Yet much published work on LMs relies on a relatively small sample of closed-source LMs, limiting our ability to rigorously test psychological theories and evaluate LM capacities. Here, we replicate and extend published work on the false belief task by assessing LM mental state reasoning behavior across 41 open-weight models (from distinct model families). We find sensitivity to implied knowledge states in 34% of the LMs tested; however, consistent with prior work, none fully explain away'' the effect in humans. Larger LMs show increased sensitivity and also exhibit higher psychometric predictive power. Finally, we use LM behavior to generate and test a novel hypothesis about human cognition: both humans and LMs show a bias towards attributing false beliefs when knowledge states are cued using a non-factive verb (John thinks…’’) than when cued indirectly (``John looks in the…’’). Unlike the primary effect of knowledge states, where human sensitivity exceeds that of LMs, the magnitude of the human knowledge cue effect falls squarely within the distribution of LM effect sizes-suggesting that distributional statistics of language can in principle account for the latter but not the former in humans. These results demonstrate the value of using larger samples of open-weight LMs to test theories of human cognition and evaluate LM capacities.


[22] Updating Parametric Knowledge with Context Distillation Retains Post-Training Capabilities cs.CL | cs.AIPDF

Shankar Padmanabhan, Mustafa Omer Gul, Tanya Goyal

TL;DR: 本文提出了一种名为DiSC(Distillation via Split Contexts)的简单上下文蒸馏方法,用于解决大语言模型在持续知识适应中面临的新知识学习与旧技能遗忘之间的矛盾。该方法通过将训练示例分割为不同片段来生成学生和教师分布,并最小化共享标记之间的KL散度,从而无需显式生成步骤即可高效应用上下文蒸馏。

Details

Motivation: 动机在于解决预训练大语言模型在微调后仅编码截止日期前的知识,需要持续适应新知识,但现有方法无法同时学习新知识并减轻对已习得能力(如指令遵循、推理等)的遗忘问题。

Result: 在四个微调后模型和两个适应领域上的实验表明,与先前的持续适应微调和蒸馏方法相比,DiSC在学习新知识和减轻对指令遵循、推理及事实知识等先前技能的遗忘方面,始终展现出最佳权衡。

Insight: 创新点在于提出了一种基于分割上下文的蒸馏方法(DiSC),通过条件化训练示例的不同片段来推导学生和教师分布,避免了训练中的显式生成步骤,从而高效实现持续知识适应,平衡了新知识获取与旧技能保留。从客观角度看,该方法简化了上下文蒸馏过程,可能降低计算成本并提升适应性效率。

Abstract: Post-training endows pretrained LLMs with a variety of desirable skills, including instruction-following, reasoning, and others. However, these post-trained LLMs only encode knowledge up to a cut-off date, necessitating continual adaptation. Unfortunately, existing solutions cannot simultaneously learn new knowledge from an adaptation document corpora and mitigate the forgetting of earlier learned capabilities. To address this, we introduce Distillation via Split Contexts (DiSC), a simple context-distillation based approach for continual knowledge adaptation. \methodname~derives student and teacher distributions by conditioning on distinct segments of the training example and minimizes the KL divergence between the shared tokens. This allows us to efficiently apply context-distillation without requiring explicit generation steps during training. We run experiments on four post-trained models and two adaptation domains. Compared to prior finetuning and distillation methods for continual adaptation, DiSC consistently reports the best trade-off between learning new knowledge and mitigating forgetting of previously learned skills like instruction-following, reasoning, and factual knowledge.


[23] Missing-by-Design: Certifiable Modality Deletion for Revocable Multimodal Sentiment Analysis cs.CL | cs.LGPDF

Rong Fu, Wenxin Zhang, Ziming Wang, Chunlei Meng, Jiaxuan Lu

TL;DR: 本文提出Missing-by-Design(MBD)框架,用于可撤销的多模态情感分析,通过结构化表示学习和可验证的参数修改流程,实现特定数据模态的删除,以应对隐私合规需求。

Details

Motivation: 随着多模态系统处理更多敏感个人数据,选择性撤销特定数据模态的能力成为隐私合规和用户自主权的关键需求。

Result: 在基准数据集上的实验表明,MBD在不完整输入下实现了强大的预测性能,并提供了实用的隐私-效用权衡,将精确的遗忘作为完全重新训练的高效替代方案。

Insight: 创新点在于结合了属性感知嵌入学习和基于生成器的重建,并引入了可机器验证的模态删除证书,通过显著性驱动的候选选择和校准高斯更新实现可认证的模态删除。

Abstract: As multimodal systems increasingly process sensitive personal data, the ability to selectively revoke specific data modalities has become a critical requirement for privacy compliance and user autonomy. We present Missing-by-Design (MBD), a unified framework for revocable multimodal sentiment analysis that combines structured representation learning with a certifiable parameter-modification pipeline. Revocability is critical in privacy-sensitive applications where users or regulators may request removal of modality-specific information. MBD learns property-aware embeddings and employs generator-based reconstruction to recover missing channels while preserving task-relevant signals. For deletion requests, the framework applies saliency-driven candidate selection and a calibrated Gaussian update to produce a machine-verifiable Modality Deletion Certificate. Experiments on benchmark datasets show that MBD achieves strong predictive performance under incomplete inputs and delivers a practical privacy-utility trade-off, positioning surgical unlearning as an efficient alternative to full retraining.


[24] Balancing Faithfulness and Performance in Reasoning via Multi-Listener Soft Execution cs.CL | cs.AIPDF

Nithin Sivakumaran, Shoubin Yu, Hyunji Lee, Yue Zhang, Ali Payani

TL;DR: 本文提出了一种名为REMUL的多方强化学习方法,旨在解决思维链推理中忠实性与任务性能之间的权衡问题。该方法通过让一个说话者模型生成推理轨迹,并由多个倾听者模型执行该轨迹来验证其清晰度,从而提升推理的忠实性,同时通过掩码监督微调来保持准确性。

Details

Motivation: 思维链推理有时无法忠实反映大语言模型的真实计算过程,这限制了其在解释模型如何得出答案方面的效用,且优化忠实性和可解释性往往会降低任务性能。

Result: 在多个推理基准测试(BIG-Bench Extra Hard、MuSR、ZebraLogicBench和FOLIO)上,REMUL在三个忠实性指标(提示归因、早期回答曲线下面积和错误注入曲线下面积)上均取得显著提升,同时提高了准确性。

Insight: 创新点在于提出了一个基于’可被其他方遵循的推理轨迹更忠实’假设的多方强化学习框架,通过说话者-倾听者交互和正确性正则化来协同优化忠实性与性能,这为提升推理的可解释性和可靠性提供了新思路。

Abstract: Chain-of-thought (CoT) reasoning sometimes fails to faithfully reflect the true computation of a large language model (LLM), hampering its utility in explaining how LLMs arrive at their answers. Moreover, optimizing for faithfulness and interpretability in reasoning often degrades task performance. To address this tradeoff and improve CoT faithfulness, we propose Reasoning Execution by Multiple Listeners (REMUL), a multi-party reinforcement learning approach. REMUL builds on the hypothesis that reasoning traces which other parties can follow will be more faithful. A speaker model generates a reasoning trace, which is truncated and passed to a pool of listener models who “execute” the trace, continuing the trace to an answer. Speakers are rewarded for producing reasoning that is clear to listeners, with additional correctness regularization via masked supervised finetuning to counter the tradeoff between faithfulness and performance. On multiple reasoning benchmarks (BIG-Bench Extra Hard, MuSR, ZebraLogicBench, and FOLIO), REMUL consistently and substantially improves three measures of faithfulness – hint attribution, early answering area over the curve (AOC), and mistake injection AOC – while also improving accuracy. Our analysis finds that these gains are robust across training domains, translate to legibility gains, and are associated with shorter and more direct CoTs.


[25] LLMs Exhibit Significantly Lower Uncertainty in Creative Writing Than Professional Writers cs.CLPDF

Peiqi Sui

TL;DR: 本文通过信息论方法量化了人类与大型语言模型(LLM)在创意写作中的’不确定性差距’,发现人类写作的不确定性显著高于模型生成内容,且指令微调和推理模型会加剧这一差距,该差距与写作质量强相关。

Details

Motivation: 研究动机在于探讨不确定性作为LLM创意写作表现的关键限制因素,当前对齐策略为追求事实性而抑制不确定性输出,这可能损害创意表达所需的文学丰富性。

Result: 在高质量故事数据集上对28个LLM的受控分析表明,人类写作的不确定性始终显著高于模型输出,该差距在创意写作领域比功能领域更明显,并与写作质量高度相关。

Insight: 创新点在于将不确定性形式化为’不确定性差距’并进行量化分析,指出实现人类水平创意需要新的不确定性感知对齐范式,以区分破坏性幻觉和文学丰富性所需的建设性模糊。

Abstract: We argue that uncertainty is a key and understudied limitation of LLMs’ performance in creative writing, which is often characterized as trite and cliché-ridden. Literary theory identifies uncertainty as a necessary condition for creative expression, while current alignment strategies steer models away from uncertain outputs to ensure factuality and reduce hallucination. We formalize this tension by quantifying the “uncertainty gap” between human-authored stories and model-generated continuations. Through a controlled information-theoretic analysis of 28 LLMs on high-quality storytelling datasets, we demonstrate that human writing consistently exhibits significantly higher uncertainty than model outputs. We find that instruction-tuned and reasoning models exacerbate this trend compared to their base counterparts; furthermore, the gap is more pronounced in creative writing than in functional domains, and strongly correlates to writing quality. Achieving human-level creativity requires new uncertainty-aware alignment paradigms that can distinguish between destructive hallucinations and the constructive ambiguity required for literary richness.


[26] MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks cs.CLPDF

Zexue He, Yu Wang, Churan Zhi, Yuanzhe Hu, Tzu-Ping Chen

TL;DR: 本文提出了MemoryArena,一个用于评估智能体在多会话、任务相互依赖场景下记忆能力的统一基准测试平台。该基准包含人工设计的智能体任务,其中子任务明确相互依赖,要求智能体通过将经验提炼为记忆来学习先前的行动和反馈,并随后利用该记忆指导后续行动以解决整体任务。

Details

Motivation: 现有对具有记忆的智能体的评估通常孤立地测试记忆和行动,未能捕捉记忆如何指导未来决策,且多关注单会话任务,缺乏对记忆与行动紧密耦合的现实场景的评估。

Result: 在MemoryArena基准上评估发现,在现有长上下文记忆基准(如LoCoMo)上性能接近饱和的智能体,在该智能体化设定下表现不佳,揭示了当前记忆智能体评估的差距。

Insight: 创新点在于构建了一个统一的多会话Memory-Agent-Environment循环评估框架,强调记忆获取与行动指导的耦合,并通过设计具有明确相互依赖子任务的任务来更真实地评估智能体记忆的实际效用。

Abstract: Existing evaluations of agents with memory typically assess memorization and action in isolation. One class of benchmarks evaluates memorization by testing recall of past conversations or text but fails to capture how memory is used to guide future decisions. Another class focuses on agents acting in single-session tasks without the need for long-term memory. However, in realistic settings, memorization and action are tightly coupled: agents acquire memory while interacting with the environment, and subsequently rely on that memory to solve future tasks. To capture this setting, we introduce MemoryArena, a unified evaluation gym for benchmarking agent memory in multi-session Memory-Agent-Environment loops. The benchmark consists of human-crafted agentic tasks with explicitly interdependent subtasks, where agents must learn from earlier actions and feedback by distilling experiences into memory, and subsequently use that memory to guide later actions to solve the overall task. MemoryArena supports evaluation across web navigation, preference-constrained planning, progressive information search, and sequential formal reasoning, and reveals that agents with near-saturated performance on existing long-context memory benchmarks like LoCoMo perform poorly in our agentic setting, exposing a gap in current evaluations for agents with memory.


[27] TabAgent: A Framework for Replacing Agentic Generative Components with Tabular-Textual Classifiers cs.CLPDF

Ido Levy, Eilam Shapira, Yinon Goldshtein, Avi Yaeli, Nir Mashkif

TL;DR: TabAgent是一个用于替换智能体系统中生成式决策组件的框架,它通过使用基于执行轨迹训练的紧凑型文本-表格分类器,来替代原本依赖重复大语言模型调用的闭集决策任务(如路由、候选列表筛选、门控和验证),从而显著降低延迟和推理成本。

Details

Motivation: 当前基于大语言模型的智能体系统在执行多步骤工作流时,通常需要反复调用LLM进行闭集决策,这导致部署缓慢且昂贵,主要受累积延迟和令牌使用量的影响。

Result: 在长视野AppWorld基准测试中,TabAgent在保持任务级成功率的同时,消除了候选列表筛选时的LLM调用,将延迟降低了约95%,推理成本降低了85-91%。

Insight: 该框架的创新点在于从轨迹中提取结构化模式、状态和依赖特征,并通过模式对齐的合成监督增强覆盖范围,最后使用轻量级分类器进行评分,为生产环境中智能体架构的生成式瓶颈提供了可学习的判别式替代方案。

Abstract: Agentic systems, AI architectures that autonomously execute multi-step workflows to achieve complex goals, are often built using repeated large language model (LLM) calls for closed-set decision tasks such as routing, shortlisting, gating, and verification. While convenient, this design makes deployments slow and expensive due to cumulative latency and token usage. We propose TabAgent, a framework for replacing generative decision components in closed-set selection tasks with a compact textual-tabular classifier trained on execution traces. TabAgent (i) extracts structured schema, state, and dependency features from trajectories (TabSchema), (ii) augments coverage with schema-aligned synthetic supervision (TabSynth), and (iii) scores candidates with a lightweight classifier (TabHead). On the long-horizon AppWorld benchmark, TabAgent maintains task-level success while eliminating shortlist-time LLM calls, reducing latency by approximately 95% and inference cost by 85-91%. Beyond tool shortlisting, TabAgent generalizes to other agentic decision heads, establishing a paradigm for learned discriminative replacements of generative bottlenecks in production agent architectures.


[28] IndicEval: A Bilingual Indian Educational Evaluation Framework for Large Language Models cs.CL | cs.AIPDF

Saurabh Bharti, Gaurav Azad, Abhinaw Jagtap, Nachiket Tapas

TL;DR: 本文提出了IndicEval,一个用于评估大语言模型(LLMs)的双语印度教育评估框架。该框架使用来自UPSC、JEE和NEET等高难度真实考试中的英语和印地语题目,覆盖STEM和人文学科,旨在通过零样本、少样本和思维链提示策略,自动化地评估LLMs的推理能力、领域知识和双语适应性。

Details

Motivation: 现有LLM评估框架缺乏反映真实学术严谨性和多语言复杂性的基准。IndicEval旨在解决这一问题,提供一个基于真实考试标准的、可扩展的评估平台,以更现实地衡量LLMs在双语教育环境中的表现。

Result: 在Gemini 2.0 Flash、GPT-4、Claude和LLaMA 3-70B等模型上的实验表明:1)思维链提示能持续提升各学科和语言的推理准确性;2)不同模型间存在显著的性能差异,尤其是在高复杂度考试中;3)多语言性能退化是一个关键挑战,印地语准确率相比英语显著下降,特别是在零样本条件下。

Insight: 创新点在于构建了一个基于真实、高难度双语考试题目的评估基准,强调了实践导向和可扩展性。客观来看,该工作揭示了当前LLMs在双语推理和领域知识迁移方面存在的持续差距,为提升模型的推理鲁棒性和语言适应性提供了具体的研究方向和数据基础。

Abstract: The rapid advancement of large language models (LLMs) necessitates evaluation frameworks that reflect real-world academic rigor and multilingual complexity. This paper introduces IndicEval, a scalable benchmarking platform designed to assess LLM performance using authentic high-stakes examination questions from UPSC, JEE, and NEET across STEM and humanities domains in both English and Hindi. Unlike synthetic benchmarks, IndicEval grounds evaluation in real examination standards, enabling realistic measurement of reasoning, domain knowledge, and bilingual adaptability. The framework automates assessment using Zero-Shot, Few-Shot, and Chain-of-Thought (CoT) prompting strategies and supports modular integration of new models and languages. Experiments conducted on Gemini 2.0 Flash, GPT-4, Claude, and LLaMA 3-70B reveal three major findings. First, CoT prompting consistently improves reasoning accuracy, with substantial gains across subjects and languages. Second, significant cross-model performance disparities persist, particularly in high-complexity examinations. Third, multilingual degradation remains a critical challenge, with marked accuracy drops in Hindi compared to English, especially under Zero-Shot conditions. These results highlight persistent gaps in bilingual reasoning and domain transfer. Overall, IndicEval provides a practice-oriented, extensible foundation for rigorous, equitable evaluation of LLMs in multilingual educational settings and offers actionable insights for improving reasoning robustness and language adaptability.


[29] Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling cs.CL | cs.AI | cs.MAPDF

Jeffrey T. H. Wong, Zixi Zhang, Junyi Liu, Yiren Zhao

TL;DR: 本文提出了Team-of-Thoughts,一种新颖的多智能体系统架构,旨在通过编排器-工具范式,利用异构智能体的互补能力,以解决现有系统依赖静态、同质模型配置的局限性。

Details

Motivation: 现有多智能体系统通常依赖静态、同质的模型配置,限制了其利用不同后训练模型独特优势的能力。

Result: 在五个推理和代码生成基准测试上的实验表明,该方法始终提供优越的任务性能。特别是在AIME24和LiveCodeBench上,准确率分别达到96.67%和72.53%,显著优于同质角色扮演基线(80%和65.93%)。

Insight: 创新点在于引入了编排器校准方案以识别具有卓越协调能力的模型,以及一个工具智能体自我评估协议来刻画其领域专长,从而在推理时动态激活最合适的工具智能体。这为构建高效、异构的多智能体系统提供了可借鉴的架构设计思路。

Abstract: Existing Multi-Agent Systems (MAS) typically rely on static, homogeneous model configurations, limiting their ability to exploit the distinct strengths of differently post-trained models. To address this, we introduce Team-of-Thoughts, a novel MAS architecture that leverages the complementary capabilities of heterogeneous agents via an orchestrator-tool paradigm. Our framework introduces two key mechanisms to optimize performance: (1) an orchestrator calibration scheme that identifies models with superior coordination capabilities, and (2) a self-assessment protocol where tool agents profile their own domain expertise to account for variations in post-training skills. During inference, the orchestrator dynamically activates the most suitable tool agents based on these proficiency profiles. Experiments on five reasoning and code generation benchmarks show that Team-of-Thoughts delivers consistently superior task performance. Notably, on AIME24 and LiveCodeBench, our approach achieves accuracies of 96.67% and 72.53%, respectively, substantially outperforming homogeneous role-play baselines, which score 80% and 65.93%.


[30] From Growing to Looping: A Unified View of Iterative Computation in LLMs cs.CL | cs.AI | cs.LGPDF

Ferdinand Kapl, Emmanouil Angelis, Kaitlin Maile, Johannes von Oswald, Stefan Bauer

TL;DR: 本文通过机制分析统一了深度增长(depth growing)和循环(looping)两种技术,发现它们在深度维度上表现出相似的签名特征,如对后期层的依赖增强和循环/增长块内的重复模式,表明它们都源于一种共同的迭代计算形式。基于此联系,论文展示了这两种技术是可适应和可组合的:在深度增长模型的中间块上应用推理时循环,即使模型未经过循环训练,也能将某些推理原语的准确率提升高达2倍。此外,两种方法在提供更多上下文示例或额外监督微调数据时,都比基线模型适应得更好。深度增长模型在使用更高质量、数学密集的冷却混合数据时获得最大的推理增益,而通过适应中间块进行循环可以进一步提升性能。总体而言,研究结果将深度增长和循环定位为互补的实用方法,用于诱导和扩展迭代计算以改进推理能力。

Details

Motivation: 循环(在深度上重复使用层块)和深度增长(通过复制中间层从浅到深训练模型)都与更强的推理能力相关,但它们之间的关系尚不明确。本文旨在从机制上统一这两种技术,探究它们是否共享共同的迭代计算形式,并探索它们的适应性和可组合性。

Result: 实验表明,循环和深度增长模型在深度维度上表现出收敛的签名特征。在推理时对深度增长模型的中间块应用循环,可将某些推理原语的准确率提升高达2倍。两种方法在更多上下文示例或额外监督微调数据下均比基线适应得更好。深度增长模型在使用高质量、数学密集的冷却混合数据时获得最大推理增益,且通过适应中间块进行循环可进一步提升性能。

Insight: 论文的创新点在于从机制上统一了循环和深度增长,揭示了它们共享的迭代计算本质,并实证展示了这两种技术的适应性和可组合性,为通过诱导和扩展迭代计算来改进LLM推理提供了互补的实用方法。从客观角度看,这种统一视角有助于更系统地理解和设计LLM中的迭代计算模块。

Abstract: Looping, reusing a block of layers across depth, and depth growing, training shallow-to-deep models by duplicating middle layers, have both been linked to stronger reasoning, but their relationship remains unclear. We provide a mechanistic unification: looped and depth-grown models exhibit convergent depth-wise signatures, including increased reliance on late layers and recurring patterns aligned with the looped or grown block. These shared signatures support the view that their gains stem from a common form of iterative computation. Building on this connection, we show that the two techniques are adaptable and composable: applying inference-time looping to the middle blocks of a depth-grown model improves accuracy on some reasoning primitives by up to $2\times$, despite the model never being trained to loop. Both approaches also adapt better than the baseline when given more in-context examples or additional supervised fine-tuning data. Additionally, depth-grown models achieve the largest reasoning gains when using higher-quality, math-heavy cooldown mixtures, which can be further boosted by adapting a middle block to loop. Overall, our results position depth growth and looping as complementary, practical methods for inducing and scaling iterative computation to improve reasoning.


[31] Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models cs.CL | cs.AI | cs.CV | cs.LGPDF

Melkamu Abay Mersha, Jugal Kalita

TL;DR: 本文提出了一种名为上下文感知层间集成梯度(CA-LIG)的统一分层归因框架,用于解释Transformer模型的预测。该框架通过计算每个Transformer块内的层间集成梯度,并将这些词元级归因与类别特定的注意力梯度融合,生成带符号、上下文敏感的归因图,以捕捉支持性和反对性证据,并追踪相关性在Transformer层间的分层流动。

Details

Motivation: 现有可解释性方法依赖最终层归因,仅捕获局部词元级归因或全局注意力模式而缺乏统一性,且缺乏对词元间依赖和结构组件的上下文感知,无法捕捉相关性跨层演化及结构组件如何影响决策。

Result: 在情感分析、长文档与多类文档分类(使用BERT)、低资源语言环境下的仇恨言论检测(使用XLM-R和AfroLM)以及图像分类(使用Masked Autoencoder视觉Transformer模型)等多种任务、领域和Transformer模型家族中,CA-LIG相比现有可解释性方法提供了更忠实(faithful)的归因、对上下文依赖表现出更强的敏感性,并生成更清晰、语义更一致的可视化结果。

Insight: 创新点在于提出了一种统一的、上下文感知的分层归因框架,通过结合层间集成梯度和类别特定注意力梯度,实现了对Transformer决策过程中相关性跨层演化和结构组件作用的全面捕捉,提升了深度神经网络的可解释性和概念理解。

Abstract: Transformer models achieve state-of-the-art performance across domains and tasks, yet their deeply layered representations make their predictions difficult to interpret. Existing explainability methods rely on final-layer attributions, capture either local token-level attributions or global attention patterns without unification, and lack context-awareness of inter-token dependencies and structural components. They also fail to capture how relevance evolves across layers and how structural components shape decision-making. To address these limitations, we proposed the \textbf{Context-Aware Layer-wise Integrated Gradients (CA-LIG) Framework}, a unified hierarchical attribution framework that computes layer-wise Integrated Gradients within each Transformer block and fuses these token-level attributions with class-specific attention gradients. This integration yields signed, context-sensitive attribution maps that capture supportive and opposing evidence while tracing the hierarchical flow of relevance through the Transformer layers. We evaluate the CA-LIG Framework across diverse tasks, domains, and transformer model families, including sentiment analysis and long and multi-class document classification with BERT, hate speech detection in a low-resource language setting with XLM-R and AfroLM, and image classification with Masked Autoencoder vision Transformer model. Across all tasks and architectures, CA-LIG provides more faithful attributions, shows stronger sensitivity to contextual dependencies, and produces clearer, more semantically coherent visualizations than established explainability methods. These results indicate that CA-LIG provides a more comprehensive, context-aware, and reliable explanation of Transformer decision-making, advancing both the practical interpretability and conceptual understanding of deep neural models.


[32] Who can we trust? LLM-as-a-jury for Comparative Assessment cs.CL | cs.AI | cs.LGPDF

Mengjie Qian, Guangzhi Sun, Mark J. F. Gales, Kate M. Knill

TL;DR: 本文研究了LLM作为自动评估器在自然语言生成任务中进行成对比较评估时存在的概率不一致性问题,提出了BT-sigma方法——一种基于Bradley-Terry模型的法官感知扩展模型,通过引入判别器参数联合推断项目排名和法官可靠性,无需人工标注监督即可实现无监督校准。

Details

Motivation: 现有LLM评估方法通常依赖单一法官或假设多个法官可靠性相等的聚合方式,但实践中LLM法官在不同任务和维度上性能差异显著,其判断概率可能存在偏差和不一致,且缺乏人工标注数据进行校准。

Result: 在基准NLG评估数据集上的实验表明,BT-sigma方法持续优于基于平均的聚合方法,其学习到的判别器参数与LLM判断的循环一致性独立度量高度相关。

Insight: 创新点在于将LLM视为陪审团并建模法官可靠性,通过无监督的Bradley-Terry模型扩展实现概率校准,为多LLM评估器聚合提供了理论框架和可解释的可靠性度量机制。

Abstract: Large language models (LLMs) are increasingly applied as automatic evaluators for natural language generation assessment often using pairwise comparative judgements. Existing approaches typically rely on single judges or aggregate multiple judges assuming equal reliability. In practice, LLM judges vary substantially in performance across tasks and aspects, and their judgment probabilities may be biased and inconsistent. Furthermore, human-labelled supervision for judge calibration may be unavailable. We first empirically demonstrate that inconsistencies in LLM comparison probabilities exist and show that it limits the effectiveness of direct probability-based ranking. To address this, we study the LLM-as-a-jury setting and propose BT-sigma, a judge-aware extension of the Bradley-Terry model that introduces a discriminator parameter for each judge to jointly infer item rankings and judge reliability from pairwise comparisons alone. Experiments on benchmark NLG evaluation datasets show that BT-sigma consistently outperforms averaging-based aggregation methods, and that the learned discriminator strongly correlates with independent measures of the cycle consistency of LLM judgments. Further analysis reveals that BT-sigma can be interpreted as an unsupervised calibration mechanism that improves aggregation by modelling judge reliability.


Subrit Dikshit

TL;DR: 本文介绍了Quecto-V1,一个针对印度法律领域的小型语言模型(SLM),旨在解决大型语言模型(LLM)在资源受限环境下的部署难题。该模型基于GPT-2架构(1.24亿参数)从头训练,专门使用印度法规语料库,并通过8位量化(GGUF格式)将模型压缩至150MB以下,使其能在消费级CPU上离线运行,实现高保真的法律条文检索。

Details

Motivation: 解决当前先进法律智能系统依赖大规模参数(70亿+)和云端推理导致的’资源鸿沟’问题,使其无法在资源受限环境中使用,并存在数据主权风险,旨在为印度法律领域提供民主化、隐私保护的可访问智能工具。

Result: 在领域特定的精确匹配任务中,Quecto-V1在检索法定定义和处罚条款方面表现出高保真度,优于通用SLM;8位量化使模型大小减少74%,与全精度基线相比,检索准确率下降小于3.5%。

Insight: 创新点在于针对高专业领域(如法律)采用领域特定训练(最大化’词汇密度’)与激进量化(8位GGUF)相结合的策略,为隐私保护、离线部署提供了可行替代方案,证明了小型化、专业化模型在特定任务上的有效性。

Abstract: The rapid proliferation of Large Language Models (LLMs) has revolutionized Natural Language Processing (NLP) but has simultaneously created a “resource divide.” State-of-the-art legal intelligence systems typically rely on massive parameter counts (7B+) and cloud-based inference, rendering them inaccessible to practitioners in resource-constrained environments and posing significant data sovereignty risks. This paper introduces Quecto-V1, a domain-specific Small Language Model (SLM) engineered to democratize access to Indian legal intelligence. Built upon a custom configuration of the GPT-2 architecture (124 million parameters), Quecto-V1 was trained from scratch exclusively on a corpus of Indian statutes, including the Indian Penal Code (IPC), the Code of Criminal Procedure (CrPC), and the Constitution of India. Unlike generalist models, which prioritize broad world knowledge, our approach maximizes “lexical density” within the legal domain. Furthermore, we address the deployment bottleneck by applying post-training 8-bit quantization (GGUF format), compressing the model to a memory footprint of under 150 MB. Our empirical analysis demonstrates that Quecto-V1 achieves high fidelity in retrieving statutory definitions and penal provisions, outperforming general-purpose SLMs in domain-specific exact match tasks while running entirely offline on consumer-grade CPUs. We further present an ablation study showing that 8-bit quantization yields a 74% reduction in model size with less than 3.5% degradation in retrieval accuracy compared to full-precision baselines. These findings suggest that for specialized, high-stakes domains like law, domain-specific training coupled with aggressive quantization offers a viable, privacy-preserving alternative to monolithic cloud models.


[34] Align Once, Benefit Multilingually: Enforcing Multilingual Consistency for LLM Safety Alignment cs.CL | cs.AI | cs.LGPDF

Yuyan Bu, Xiaohao Liu, ZhaoXing Ren, Yaodong Yang, Juntao Dai

TL;DR: 本文提出了一种资源高效的多语言安全对齐方法,通过引入一个即插即用的多语言一致性(MLC)损失函数,该函数可集成到现有的单语对齐流程中,旨在仅使用多语言提示变体,无需低资源语言的额外响应级监督,即可同时对齐多种语言,提升跨语言泛化能力。

Details

Motivation: 解决大型语言模型在多语言社区部署时,现有扩展对齐方法因需要大量目标语言高质量监督或与高资源语言成对对齐而资源消耗大、可扩展性受限的问题。

Result: 该方法在不同模型架构和对齐范式上得到验证,有效增强了多语言安全性,且对模型通用效用影响有限;跨语言和任务的进一步评估显示其改善了跨语言泛化性能。

Insight: 创新点在于通过提高多语言表示向量间的共线性,在单次更新中强制多语言语义层面的方向一致性,这是一种在有限监督下实现多语言一致性对齐的实用解决方案。

Abstract: The widespread deployment of large language models (LLMs) across linguistic communities necessitates reliable multilingual safety alignment. However, recent efforts to extend alignment to other languages often require substantial resources, either through large-scale, high-quality supervision in the target language or through pairwise alignment with high-resource languages, which limits scalability. In this work, we propose a resource-efficient method for improving multilingual safety alignment. We introduce a plug-and-play Multi-Lingual Consistency (MLC) loss that can be integrated into existing monolingual alignment pipelines. By improving collinearity between multilingual representation vectors, our method encourages directional consistency at the multilingual semantic level in a single update. This allows simultaneous alignment across multiple languages using only multilingual prompt variants without requiring additional response-level supervision in low-resource languages. We validate the proposed method across different model architectures and alignment paradigms, and demonstrate its effectiveness in enhancing multilingual safety with limited impact on general model utility. Further evaluation across languages and tasks indicates improved cross-lingual generalization, suggesting the proposed approach as a practical solution for multilingual consistency alignment under limited supervision.


[35] Reinforced Fast Weights with Next-Sequence Prediction cs.CLPDF

Hee Seung Hwang, Xindi Wu, Sanghyuk Chun, Olga Russakovsky

TL;DR: 本文提出了REFINE(Reinforced Fast weIghts with Next sEquence prediction),一个基于强化学习的框架,用于训练快速权重模型进行下一序列预测(NSP),以克服传统下一词预测(NTP)在长上下文建模中的局限性。该方法通过基于预测熵选择信息丰富的标记位置、生成多标记展开、分配自监督序列级奖励,并使用组相对策略优化(GRPO)来优化模型,从而提升模型捕获长距离依赖的能力。

Details

Motivation: 快速权重架构在长上下文建模中具有恒定内存开销的优势,但其潜力受限于下一词预测(NTP)训练范式,因为NTP仅优化单标记预测,忽略了前缀后多个标记的语义连贯性,导致模型学习到次优表示,无法有效捕获长距离依赖。

Result: 在LaCT-760M和DeltaNet-1.3B模型上的实验表明,REFINE在针在干草堆检索、长上下文问答以及LongBench中的多样化任务上,均一致优于使用NTP的监督微调方法。

Insight: 创新点在于将强化学习与下一序列预测目标结合,通过序列级奖励和GRPO优化,使快速权重模型能更好地建模长距离语义依赖;从客观角度看,该方法提供了一种适用于预训练语言模型整个训练生命周期(训练中期、训练后和测试时训练)的通用框架,有效提升了长上下文建模能力。

Abstract: Fast weight architectures offer a promising alternative to attention-based transformers for long-context modeling by maintaining constant memory overhead regardless of context length. However, their potential is limited by the next-token prediction (NTP) training paradigm. NTP optimizes single-token predictions and ignores semantic coherence across multiple tokens following a prefix. Consequently, fast weight models, which dynamically update their parameters to store contextual information, learn suboptimal representations that fail to capture long-range dependencies. We introduce REFINE (Reinforced Fast weIghts with Next sEquence prediction), a reinforcement learning framework that trains fast weight models under the next-sequence prediction (NSP) objective. REFINE selects informative token positions based on prediction entropy, generates multi-token rollouts, assigns self-supervised sequence-level rewards, and optimizes the model with group relative policy optimization (GRPO). REFINE is applicable throughout the training lifecycle of pre-trained language models: mid-training, post-training, and test-time training. Our experiments on LaCT-760M and DeltaNet-1.3B demonstrate that REFINE consistently outperforms supervised fine-tuning with NTP across needle-in-a-haystack retrieval, long-context question answering, and diverse tasks in LongBench. REFINE provides an effective and versatile framework for improving long-context modeling in fast weight architectures.


cs.CV [Back]

[36] Egocentric Bias in Vision-Language Models cs.CV | cs.AIPDF

Maijunxian Wang, Yijiang Li, Bingyang Wang, Tianwei Zhao, Ran Ji

TL;DR: 本文提出FlipSet基准,用于评估视觉语言模型在二级视觉视角采择(L2 VPT)任务中的表现。研究发现,现有模型普遍存在以自我为中心的偏见,在模拟180度视角旋转任务中表现低于随机水平,且大部分错误源于重复相机视角。

Details

Motivation: 视觉视角采择是社会认知的基础能力。本文旨在诊断当前视觉语言模型在理解他人视角方面的系统性缺陷,特别是将社会意识与空间操作结合的机制缺失。

Result: 在评估的103个视觉语言模型中,绝大多数在FlipSet基准上表现低于随机水平,约四分之三的错误直接复制了相机视角。控制实验显示模型在单独的心理旋转或心智理论任务中表现尚可,但整合时出现灾难性失败。

Insight: 创新点在于设计了FlipSet这一隔离3D场景复杂性的诊断基准,揭示了当前多模态模型在组合性推理上的根本局限:社会感知与空间操作能力无法有效结合。

Abstract: Visual perspective taking–inferring how the world appears from another’s viewpoint–is foundational to social cognition. We introduce FlipSet, a diagnostic benchmark for Level-2 visual perspective taking (L2 VPT) in vision-language models. The task requires simulating 180-degree rotations of 2D character strings from another agent’s perspective, isolating spatial transformation from 3D scene complexity. Evaluating 103 VLMs reveals systematic egocentric bias: the vast majority perform below chance, with roughly three-quarters of errors reproducing the camera viewpoint. Control experiments expose a compositional deficit–models achieve high theory-of-mind accuracy and above-chance mental rotation in isolation, yet fail catastrophically when integration is required. This dissociation indicates that current VLMs lack the mechanisms needed to bind social awareness to spatial operations, suggesting fundamental limitations in model-based spatial reasoning. FlipSet provides a cognitively grounded testbed for diagnosing perspective-taking capabilities in multimodal systems.


[37] Detecting Deepfakes with Multivariate Soft Blending and CLIP-based Image-Text Alignment cs.CVPDF

Jingwei Li, Jiaxin Tong, Pengfei Wu

TL;DR: 本文提出了一种名为MSBA-CLIP的新型深度伪造检测框架,通过结合CLIP的多模态对齐能力和一种新颖的多变量软混合数据增强策略,旨在提升检测的准确性和泛化能力。

Details

Motivation: 现有深度伪造检测方法在面对由不同伪造技术生成的样本时,常因显著的分布偏移而导致准确率有限和泛化能力差。本文旨在解决这些挑战,以构建更鲁棒和泛化性强的检测器。

Result: 实验表明该方法达到了最先进的性能。在域内测试中,其准确率和AUC分别比最佳基线提高了3.32%和4.02%。在跨五个数据集的域外评估中,平均AUC增益达到3.27%。消融研究证实了所提组件的有效性。

Insight: 主要创新点在于利用CLIP的多模态对齐能力捕捉细微伪造痕迹,以及引入多变量软混合增强策略来合成更具泛化性的训练数据。此外,设计的多变量伪造强度估计模块能显式引导模型学习与不同伪造模式和强度相关的特征,这为提升检测模型的泛化性提供了新思路。

Abstract: The proliferation of highly realistic facial forgeries necessitates robust detection methods. However, existing approaches often suffer from limited accuracy and poor generalization due to significant distribution shifts among samples generated by diverse forgery techniques. To address these challenges, we propose a novel Multivariate and Soft Blending Augmentation with CLIP-guided Forgery Intensity Estimation (MSBA-CLIP) framework. Our method leverages the multimodal alignment capabilities of CLIP to capture subtle forgery traces. We introduce a Multivariate and Soft Blending Augmentation (MSBA) strategy that synthesizes images by blending forgeries from multiple methods with random weights, forcing the model to learn generalizable patterns. Furthermore, a dedicated Multivariate Forgery Intensity Estimation (MFIE) module is designed to explicitly guide the model in learning features related to varied forgery modes and intensities. Extensive experiments demonstrate state-of-the-art performance. On in-domain tests, our method improves Accuracy and AUC by 3.32% and 4.02%, respectively, over the best baseline. In cross-domain evaluations across five datasets, it achieves an average AUC gain of 3.27%. Ablation studies confirm the efficacy of both proposed components. While the reliance on a large vision-language model entails higher computational cost, our work presents a significant step towards more generalizable and robust deepfake detection.


[38] MaS-VQA: A Mask-and-Select Framework for Knowledge-Based Visual Question Answering cs.CV | cs.AIPDF

Xianwei Mao, Kai Ye, Sheng Zhou, Nan Zhang, Haikuan Huang

TL;DR: 本文提出MaS-VQA框架,用于解决基于知识的视觉问答任务中外部知识噪声大、与视觉内容不对齐以及内部知识难以控制的问题。该框架通过掩码选择机制联合过滤无关图像区域和弱相关知识片段,生成紧凑的多模态知识,并以此引导内部知识在受限语义空间中激活,实现显隐知识的互补协同建模。

Details

Motivation: 解决KB-VQA任务中检索知识噪声大、部分无关或与视觉内容错位,以及内部模型知识难以控制和解释的问题,提升知识融合与推理的有效性。

Result: 在Encyclopedic-VQA和InfoSeek基准测试中,该方法在多个多模态大语言模型骨干网络上均取得了一致的性能提升,消融实验验证了选择机制能有效降低噪声并增强知识利用。

Insight: 创新点在于将显式知识过滤与隐式知识推理紧密耦合,通过掩码选择机制实现跨模态的联合剪枝,并利用过滤后的知识约束内部知识的激活空间,从而提升知识融合的鲁棒性和可解释性。

Abstract: Knowledge-based Visual Question Answering (KB-VQA) requires models to answer questions by integrating visual information with external knowledge. However, retrieved knowledge is often noisy, partially irrelevant, or misaligned with the visual content, while internal model knowledge is difficult to control and interpret. Naive aggregation of these sources limits reasoning effectiveness and reduces answer accuracy. To address this, we propose MaS-VQA, a selection-driven framework that tightly couples explicit knowledge filtering with implicit knowledge reasoning. MaS-VQA first retrieves candidate passages and applies a Mask-and-Select mechanism to jointly prune irrelevant image regions and weakly relevant knowledge fragments, producing compact, high-signal multimodal knowledge . This filtered knowledge then guides the activation of internal knowledge in a constrained semantic space, enabling complementary co-modeling of explicit and implicit knowledge for robust answer prediction. Experiments on Encyclopedic-VQA and InfoSeek demonstrate consistent performance gains across multiple MLLM backbones, and ablations verify that the selection mechanism effectively reduces noise and enhances knowledge utilization.


[39] EarthSpatialBench: Benchmarking Spatial Reasoning Capabilities of Multimodal LLMs on Earth Imagery cs.CV | cs.AIPDF

Zelin Xu, Yupu Zhang, Saugat Adhikari, Saiful Islam, Tingsong Xiao

TL;DR: 本文提出了EarthSpatialBench,一个用于评估多模态大语言模型在地球影像上进行空间推理能力的综合性基准。该基准包含超过32.5万个问答对,涵盖定性/定量的距离与方向推理、系统性的拓扑关系、多种查询类型以及多种对象表示方式。

Details

Motivation: 现有地球影像基准主要关注2D空间定位、图像描述和粗略空间关系,缺乏对定量方向/距离推理、系统性拓扑关系以及复杂几何形状的支持,因此需要一个新的基准来填补这一空白。

Result: 论文在开源和专有模型上进行了广泛的实验,以识别MLLMs在空间推理方面的局限性,但摘要中未提及具体的定量结果或与SOTA的比较。

Insight: 创新点在于构建了一个全面、细粒度的地球影像空间推理基准,其独特之处在于结合了视觉线索和矢量几何坐标,并支持多边形、折线等复杂几何形状以及组合聚合查询,这为评估和提升MLLMs在真实地理环境中的精确空间理解能力提供了新工具。

Abstract: Benchmarking spatial reasoning in multimodal large language models (MLLMs) has attracted growing interest in computer vision due to its importance for embodied AI and other agentic systems that require precise interaction with the physical world. However, spatial reasoning on Earth imagery has lagged behind, as it uniquely involves grounding objects in georeferenced images and quantitatively reasoning about distances, directions, and topological relations using both visual cues and vector geometry coordinates (e.g., 2D bounding boxes, polylines, and polygons). Existing benchmarks for Earth imagery primarily focus on 2D spatial grounding, image captioning, and coarse spatial relations (e.g., simple directional or proximity cues). They lack support for quantitative direction and distance reasoning, systematic topological relations, and complex object geometries beyond bounding boxes. To fill this gap, we propose \textbf{EarthSpatialBench}, a comprehensive benchmark for evaluating spatial reasoning in MLLMs on Earth imagery. The benchmark contains over 325K question-answer pairs spanning: (1) qualitative and quantitative reasoning about spatial distance and direction; (2) systematic topological relations; (3) single-object queries, object-pair queries, and compositional aggregate group queries; and (4) object references expressed via textual descriptions, visual overlays, and explicit geometry coordinates, including 2D bounding boxes, polylines, and polygons. We conducted extensive experiments on both open-source and proprietary models to identify limitations in the spatial reasoning of MLLMs.


[40] A Study on Real-time Object Detection using Deep Learning cs.CV | cs.LGPDF

Ankita Bose, Jayasravani Bhumireddy, Naveen N

TL;DR: 本文是一篇关于深度学习在实时目标检测中应用的综述文章,详细探讨了多种深度学习算法(如Faster R-CNN、YOLO、SSD等)如何提升实时目标识别的性能,并涵盖了现有模型、公开基准数据集、应用研究以及不同策略的比较分析。

Details

Motivation: 实时目标检测在众多领域(如人机交互、安防监控、自动驾驶等)具有广泛应用,需要动态分析视觉信息以支持即时决策,而深度学习算法的发展为提供更准确、高效的解决方案创造了条件。

Result: 文章通过对照研究比较了不同策略,并得出了一些有启发性的发现,但未具体提及在特定基准数据集上的定量结果或是否达到SOTA水平。

Insight: 创新点在于系统性地综述了深度学习实时目标检测的模型、数据集和应用,并指出了该领域未来的研究挑战与方向,为后续研究提供了全面的参考框架。

Abstract: Object detection has compelling applications over a range of domains, including human-computer interfaces, security and video surveillance, navigation and road traffic monitoring, transportation systems, industrial automation healthcare, the world of Augmented Reality (AR) and Virtual Reality (VR), environment monitoring and activity identification. Applications of real time object detection in all these areas provide dynamic analysis of the visual information that helps in immediate decision making. Furthermore, advanced deep learning algorithms leverage the progress in the field of object detection providing more accurate and efficient solutions. There are some outstanding deep learning algorithms for object detection which includes, Faster R CNN(Region-based Convolutional Neural Network),Mask R-CNN, Cascade R-CNN, YOLO (You Only Look Once), SSD (Single Shot Multibox Detector), RetinaNet etc. This article goes into great detail on how deep learning algorithms are used to enhance real time object recognition. It provides information on the different object detection models available, open benchmark datasets, and studies on the use of object detection models in a range of applications. Additionally, controlled studies are provided to compare various strategies and produce some illuminating findings. Last but not least, a number of encouraging challenges and approaches are offered as suggestions for further investigation in both relevant deep learning approaches and object recognition.


[41] Visual Memory Injection Attacks for Multi-Turn Conversations cs.CV | cs.LGPDF

Christian Schlarmann, Matthias Hein

TL;DR: 本文提出了一种新型的视觉记忆注入攻击,针对生成式大型视觉语言模型在多轮对话场景中的安全性。攻击者通过上传被篡改的图像到网络,当良性用户下载并使用该图像与LVLM进行多轮对话时,模型在正常提示下表现正常,但一旦用户给出特定触发提示,模型就会输出预设的目标信息以操纵用户。

Details

Motivation: 大型视觉语言模型在长上下文多轮对话设置中的安全性研究不足,本文旨在探索这种现实场景下的攻击可行性,例如用于对抗性营销或政治说服。

Result: 攻击在多个最新的开源权重LVLM上得到验证,表明在多轮对话设置中,通过扰动图像大规模操纵用户是可行的。

Insight: 创新点在于首次针对LVLM的多轮对话场景设计了隐蔽的视觉记忆注入攻击,揭示了模型在长上下文交互中的安全脆弱性,呼吁提升LVLM对此类攻击的鲁棒性。

Abstract: Generative large vision-language models (LVLMs) have recently achieved impressive performance gains, and their user base is growing rapidly. However, the security of LVLMs, in particular in a long-context multi-turn setting, is largely underexplored. In this paper, we consider the realistic scenario in which an attacker uploads a manipulated image to the web/social media. A benign user downloads this image and uses it as input to the LVLM. Our novel stealthy Visual Memory Injection (VMI) attack is designed such that on normal prompts the LVLM exhibits nominal behavior, but once the user gives a triggering prompt, the LVLM outputs a specific prescribed target message to manipulate the user, e.g. for adversarial marketing or political persuasion. Compared to previous work that focused on single-turn attacks, VMI is effective even after a long multi-turn conversation with the user. We demonstrate our attack on several recent open-weight LVLMs. This article thereby shows that large-scale manipulation of users is feasible with perturbed images in multi-turn conversation settings, calling for better robustness of LVLMs against these attacks. We release the source code at https://github.com/chs20/visual-memory-injection


[42] Can Vision-Language Models See Squares? Text-Recognition Mediates Spatial Reasoning Across Three Model Families cs.CV | cs.LGPDF

Yuval Levental

TL;DR: 本文通过一个简单的实验揭示了视觉语言模型(VLMs)的一个根本性局限:当二进制网格中的填充单元格缺乏文本标识时,模型无法准确定位这些单元格。实验生成了15个15x15的网格,以两种图像类型(文本符号和纯填充方块)呈现,并测试了三个前沿VLM(Claude Opus、ChatGPT 5.2和Gemini 3 Thinking)的转录能力。结果显示,在文本符号条件下模型表现良好,但在填充方块条件下性能大幅下降,表明VLMs依赖文本识别路径进行空间推理,而非原生视觉处理。

Details

Motivation: 动机是暴露VLMs在空间推理中的基本限制,特别是当视觉元素缺乏文本身份时,模型难以准确定位非文本的填充单元格,这揭示了其视觉处理能力的缺陷。

Result: 在文本符号条件下,Claude和ChatGPT达到约91%的单元格准确率和84%的F1分数,Gemini为84%准确率和63% F1;在填充方块条件下,所有模型性能崩溃至60-73%准确率和29-39% F1,F1差距达34-54点,表明性能显著下降。

Insight: 创新点在于通过对比实验揭示VLMs依赖高保真文本识别路径进行空间推理,而非原生视觉能力;客观分析认为,这暴露了模型在处理非文本视觉元素时的空间定位缺陷,不同模型表现出不同的失败模式(如系统性少计、过度计数和模板幻觉),但根源相同。

Abstract: We present a simple experiment that exposes a fundamental limitation in vision-language models (VLMs): the inability to accurately localize filled cells in binary grids when those cells lack textual identity. We generate fifteen 15x15 grids with varying density (10.7%-41.8% filled cells) and render each as two image types – text symbols (. and #) and filled squares without gridlines – then ask three frontier VLMs (Claude Opus, ChatGPT 5.2, and Gemini 3 Thinking) to transcribe them. In the text-symbol condition, Claude and ChatGPT achieve approximately 91% cell accuracy and 84% F1, while Gemini achieves 84% accuracy and 63% F1. In the filled-squares condition, all three models collapse to 60-73% accuracy and 29-39% F1. Critically, all conditions pass through the same visual encoder – the text symbols are images, not tokenized text. The text-vs-squares F1 gap ranges from 34 to 54 points across models, demonstrating that VLMs behave as if they possess a high-fidelity text-recognition pathway for spatial reasoning that dramatically outperforms their native visual pathway. Each model exhibits a distinct failure mode in the squares condition – systematic under-counting (Claude), massive over-counting (ChatGPT), and template hallucination (Gemini) – but all share the same underlying deficit: severely degraded spatial localization for non-textual visual elements.


[43] Non-Contact Physiological Monitoring in Pediatric Intensive Care Units via Adaptive Masking and Self-Supervised Learning cs.CVPDF

Mohamed Khalil Ben Salah, Philippe Jouvet, Rita Noumeir

TL;DR: 本文提出了一种基于自适应掩码和自监督学习的非接触式生理监测框架,用于儿科重症监护室(PICU)中的心率估计。该方法采用渐进式课程策略,结合VisionMamba架构和自适应掩码机制,通过师生蒸馏利用无标签临床数据,显著提升了在复杂临床环境下的远程光电容积描记(rPPG)性能。

Details

Motivation: 解决PICU中接触式传感器(如脉搏血氧仪)可能引起皮肤刺激、感染风险增加和患者不适的问题,同时克服传统rPPG方法在临床环境中因运动伪影、遮挡、光照变化和领域偏移导致的性能限制。

Result: 在PICU设置下,该方法相比标准掩码自编码器平均绝对误差(MAE)降低42%,优于PhysFormer模型31%,最终MAE达到3.2 bpm,并在临床遮挡和噪声下表现出鲁棒性。

Insight: 创新点包括:基于VisionMamba的自适应掩码机制,通过轻量级Mamba控制器分配时空重要性分数以指导概率性补丁采样;渐进式课程策略结合师生蒸馏,有效利用无标签临床数据;无需显式感兴趣区域提取,模型能自动关注脉搏丰富区域,提升了临床实用性。

Abstract: Continuous monitoring of vital signs in Pediatric Intensive Care Units (PICUs) is essential for early detection of clinical deterioration and effective clinical decision-making. However, contact-based sensors such as pulse oximeters may cause skin irritation, increase infection risk, and lead to patient discomfort. Remote photoplethysmography (rPPG) offers a contactless alternative to monitor heart rate using facial video, but remains underutilized in PICUs due to motion artifacts, occlusions, variable lighting, and domain shifts between laboratory and clinical data. We introduce a self-supervised pretraining framework for rPPG estimation in the PICU setting, based on a progressive curriculum strategy. The approach leverages the VisionMamba architecture and integrates an adaptive masking mechanism, where a lightweight Mamba-based controller assigns spatiotemporal importance scores to guide probabilistic patch sampling. This strategy dynamically increases reconstruction difficulty while preserving physiological relevance. To address the lack of labeled clinical data, we adopt a teacher-student distillation setup. A supervised expert model, trained on public datasets, provides latent physiological guidance to the student. The curriculum progresses through three stages: clean public videos, synthetic occlusion scenarios, and unlabeled videos from 500 pediatric patients. Our framework achieves a 42% reduction in mean absolute error relative to standard masked autoencoders and outperforms PhysFormer by 31%, reaching a final MAE of 3.2 bpm. Without explicit region-of-interest extraction, the model consistently attends to pulse-rich areas and demonstrates robustness under clinical occlusions and noise.


[44] LAND: A Longitudinal Analysis of Neuromorphic Datasets cs.CV | cs.DBPDF

Gregory Cohen, Alexandre Marcireau

TL;DR: 这篇论文对神经形态数据集进行了纵向分析,回顾了超过423个现有数据集,探讨了数据集的任务性质、数据结构、标准化问题以及访问困难,并特别关注了合成数据集的增长及其潜在影响。

Details

Motivation: 解决神经形态工程领域的数据问题,包括数据集难以查找、理解和使用,以及缺乏标准化,这阻碍了神经形态技术的进一步研究和应用。

Result: 论文通过分析大量数据集,揭示了数据集大小、标准化缺失和访问困难等挑战,并指出合成数据集的增长可能带来潜在偏见。

Insight: 创新点在于提出了元数据集的概念,通过整合现有数据集来减少对新数据的需求并消除任务定义中的偏见,同时强调了合成数据在算法测试中的利弊。

Abstract: Neuromorphic engineering has a data problem. Despite the meteoric rise in the number of neuromorphic datasets published over the past ten years, the conclusion of a significant portion of neuromorphic research papers still states that there is a need for yet more data and even larger datasets. Whilst this need is driven in part by the sheer volume of data required by modern deep learning approaches, it is also fuelled by the current state of the available neuromorphic datasets and the difficulties in finding them, understanding their purpose, and determining the nature of their underlying task. This is further compounded by practical difficulties in downloading and using these datasets. This review starts by capturing a snapshot of the existing neuromorphic datasets, covering over 423 datasets, and then explores the nature of their tasks and the underlying structure of the presented data. Analysing these datasets shows the difficulties arising from their size, the lack of standardisation, and difficulties in accessing the actual data. This paper also highlights the growth in the size of individual datasets and the complexities involved in working with the data. However, a more important concern is the rise of synthetic datasets, created by either simulation or video-to-events methods. This review explores the benefits of simulated data for testing existing algorithms and applications, highlighting the potential pitfalls for exploring new applications of neuromorphic technologies. This review also introduces the concepts of meta-datasets, created from existing datasets, as a way of both reducing the need for more data, and to remove potential bias arising from defining both the dataset and the task.


[45] BTReport: A Framework for Brain Tumor Radiology Report Generation with Clinically Relevant Features cs.CVPDF

Juampablo E. Heras Rivera, Dickson T. Chen, Tianyi Ren, Daniel K. Low, Asma Ben Abacha

TL;DR: 本文提出了BTReport框架,用于生成脑肿瘤放射学报告,通过确定性提取影像特征并利用大语言模型进行结构化报告生成,同时发布了BTReport-BraTS数据集。

Details

Motivation: 解决神经肿瘤学领域缺乏公开配对的影像-报告数据集的问题,并避免现有方法依赖大型视觉-语言模型导致的幻觉和不可解释性。

Result: 在放射学报告生成任务中,BTReport生成的报告比现有基线更接近临床参考报告,且提取的特征能预测生存期和IDH突变状态等关键临床结局。

Insight: 将放射学报告生成分解为确定性特征提取和报告生成两步,提高了报告的可解释性和可靠性,并提供了合成报告数据集以促进该领域研究。

Abstract: Recent advances in radiology report generation (RRG) have been driven by large paired image-text datasets; however, progress in neuro-oncology has been limited due to a lack of open paired image-report datasets. Here, we introduce BTReport, an open-source framework for brain tumor RRG that constructs natural language radiology reports using deterministically extracted imaging features. Unlike existing approaches that rely on large general-purpose or fine-tuned vision-language models for both image interpretation and report composition, BTReport performs deterministic feature extraction for image analysis and uses large language models only for syntactic structuring and narrative formatting. By separating RRG into a deterministic feature extraction step and a report generation step, the generated reports are completely interpretable and less prone to hallucinations. We show that the features used for report generation are predictive of key clinical outcomes, including survival and IDH mutation status, and reports generated by BTReport are more closely aligned with reference clinical reports than existing baselines for RRG. Finally, we introduce BTReport-BraTS, a companion dataset that augments BraTS imaging with synthetically generated radiology reports produced with BTReport. Code for this project can be found at https://github.com/KurtLabUW/BTReport.


[46] MedProbCLIP: Probabilistic Adaptation of Vision-Language Foundation Model for Reliable Radiograph-Report Retrieval cs.CV | cs.AIPDF

Ahmad Elallaf, Yu Zhang, Yuktha Priya Masupalli, Jeong Yang, Young Lee

TL;DR: 本文提出MedProbCLIP,一种概率视觉语言学习框架,用于胸部X光片和放射学报告的表征学习与双向检索。该框架通过概率对比目标将图像和文本表征建模为高斯嵌入,以显式捕捉不确定性及影像与临床描述间的多对多对应关系,并利用变分信息瓶颈缓解过度自信预测。在MIMIC-CXR数据集上,MedProbCLIP在检索和零样本分类任务中均优于现有确定性和概率基线模型。

Details

Motivation: 现有视觉语言基础模型的确定性嵌入往往无法满足高风险生物医学应用所需的可靠性,因此需要一种能够显式建模不确定性并处理影像与报告间多对多对应关系的概率框架。

Result: 在MIMIC-CXR数据集上,MedProbCLIP在检索和零样本分类任务中超越了CLIP、CXR-CLIP和PCME++等确定性和概率基线模型,并展现出更优的校准性、风险覆盖行为、选择性检索可靠性以及对临床相关干扰的鲁棒性。

Insight: 创新点在于将视觉语言表征建模为概率分布(高斯嵌入),通过概率对比损失显式捕捉不确定性及多对多对应;采用变分信息瓶颈防止过拟合;训练时使用多视图影像编码和多章节报告编码提供细粒度监督,而推理时仅需单影像和单报告,提升了临床可信度与安全性。

Abstract: Vision-language foundation models have emerged as powerful general-purpose representation learners with strong potential for multimodal understanding, but their deterministic embeddings often fail to provide the reliability required for high-stakes biomedical applications. This work introduces MedProbCLIP, a probabilistic vision-language learning framework for chest X-ray and radiology report representation learning and bidirectional retrieval. MedProbCLIP models image and text representations as Gaussian embeddings through a probabilistic contrastive objective that explicitly captures uncertainty and many-to-many correspondences between radiographs and clinical narratives. A variational information bottleneck mitigates overconfident predictions, while MedProbCLIP employs multi-view radiograph encoding and multi-section report encoding during training to provide fine-grained supervision for clinically aligned correspondence, yet requires only a single radiograph and a single report at inference. Evaluated on the MIMIC-CXR dataset, MedProbCLIP outperforms deterministic and probabilistic baselines, including CLIP, CXR-CLIP, and PCME++, in both retrieval and zero-shot classification. Beyond accuracy, MedProbCLIP demonstrates superior calibration, risk-coverage behavior, selective retrieval reliability, and robustness to clinically relevant corruptions, underscoring the value of probabilistic vision-language modeling for improving the trustworthiness and safety of radiology image-text retrieval systems.


[47] OmniCT: Towards a Unified Slice-Volume LVLM for Comprehensive CT Analysis cs.CV | cs.AIPDF

Tianwei Lin, Zhongwei Qiu, Wenqiao Zhang, Jiang Liu, Yihan Xie

TL;DR: 本文提出了OmniCT,一个用于CT影像分析的统一切片-体积大视觉语言模型。它通过空间一致性增强和器官级语义增强模块,解决了现有方法在切片与体积理解上的割裂问题,并在大规模数据集MedEval-CT上实现了对多种临床任务的统一评估和性能提升。

Details

Motivation: 现有的大视觉语言模型在CT影像分析中存在切片驱动与体积驱动方法的割裂:切片模型缺乏跨切片空间一致性,体积模型则粒度粗糙且与切片输入兼容性差,这阻碍了医学LVLM的临床转化。

Result: OmniCT在多种临床任务上均显著优于现有方法,在MedEval-CT基准上实现了统一的评估,并同时满足微观细节敏感性和宏观空间推理的需求。

Insight: 创新点在于提出了统一的切片-体积建模范式,通过空间一致性增强(结合体积切片组合与三轴位置编码)和器官级语义增强(通过分割与ROI定位显式对齐解剖区域)来融合局部细节与全局空间信息,并构建了大规模混合基准数据集用于评估。

Abstract: Computed Tomography (CT) is one of the most widely used and diagnostically information-dense imaging modalities, covering critical organs such as the heart, lungs, liver, and colon. Clinical interpretation relies on both slice-driven local features (e.g., sub-centimeter nodules, lesion boundaries) and volume-driven spatial representations (e.g., tumor infiltration, inter-organ anatomical relations). However, existing Large Vision-Language Models (LVLMs) remain fragmented in CT slice versus volumetric understanding: slice-driven LVLMs show strong generalization but lack cross-slice spatial consistency, while volume-driven LVLMs explicitly capture volumetric semantics but suffer from coarse granularity and poor compatibility with slice inputs. The absence of a unified modeling paradigm constitutes a major bottleneck for the clinical translation of medical LVLMs. We present OmniCT, a powerful unified slice-volume LVLM for CT scenarios, which makes three contributions: (i) Spatial Consistency Enhancement (SCE): volumetric slice composition combined with tri-axial positional embedding that introduces volumetric consistency, and an MoE hybrid projection enables efficient slice-volume adaptation; (ii) Organ-level Semantic Enhancement (OSE): segmentation and ROI localization explicitly align anatomical regions, emphasizing lesion- and organ-level semantics; (iii) MedEval-CT: the largest slice-volume CT dataset and hybrid benchmark integrates comprehensive metrics for unified evaluation. OmniCT consistently outperforms existing methods with a substantial margin across diverse clinical tasks and satisfies both micro-level detail sensitivity and macro-level spatial reasoning. More importantly, it establishes a new paradigm for cross-modal medical imaging understanding.


[48] CHAI: CacHe Attention Inference for text2video cs.CV | cs.LGPDF

Joel Mathew Cherian, Ashutosh Muralidhara Bharadwaj, Vima Gupta, Anand Padmanabha Iyer

TL;DR: 本文提出了CHAI(CacHe Attention Inference)方法,旨在加速文本到视频扩散模型的推理过程。通过引入缓存注意力机制,在跨推理的潜在表示中重用共享对象/场景的信息,从而在减少去噪步骤(如仅需8步)的同时保持视频质量。

Details

Motivation: 现有文本到视频扩散模型推理速度慢,因为需要对3D潜在表示进行顺序去噪;现有加速方法要么需要昂贵的模型重训练,要么基于启发式步骤跳过,在减少去噪步骤时难以维持视频质量。

Result: 实验表明,使用缓存注意力机制仅需8个去噪步骤即可生成高质量视频;当集成到完整系统中时,CHAI比基线OpenSora 1.2快1.65倍到3.35倍,同时保持视频质量。

Insight: 创新点在于提出缓存注意力机制,通过选择性关注跨推理潜在表示中的共享语义内容,实现潜在表示的有效重用,从而在不重训练模型的情况下显著加速推理并维持质量;这为扩散模型的高效推理提供了新思路。

Abstract: Text-to-video diffusion models deliver impressive results but remain slow because of the sequential denoising of 3D latents. Existing approaches to speed up inference either require expensive model retraining or use heuristic-based step skipping, which struggles to maintain video quality as the number of denoising steps decreases. Our work, CHAI, aims to use cross-inference caching to reduce latency while maintaining video quality. We introduce Cache Attention as an effective method for attending to shared objects/scenes across cross-inference latents. This selective attention mechanism enables effective reuse of cached latents across semantically related prompts, yielding high cache hit rates. We show that it is possible to generate high-quality videos using Cache Attention with as few as 8 denoising steps. When integrated into the overall system, CHAI is 1.65x - 3.35x faster than baseline OpenSora 1.2 while maintaining video quality.


[49] IRIS: Intent Resolution via Inference-time Saccades for Open-Ended VQA in Large Vision-Language Models cs.CVPDF

Parsa Madinei, Srijita Karmakar, Russell Cohen Hoffing, Felix Gervitz, Miguel P. Eckstein

TL;DR: 本文提出了一种名为IRIS的新型免训练方法,利用实时眼动追踪数据来解决开放域视觉问答(VQA)中的歧义问题。通过一项包含500个独特图像-问题对的用户研究,发现参与者在开始口头提问时最接近的注视点对于大型视觉语言模型(VLMs)的消歧最为有效,能将模糊问题的回答准确率从35.2%提升至77.2%,同时保持对明确问题的性能。该方法在多个先进VLMs上评估均显示出一致的改进,并发布了包含新基准数据集、实时交互协议和评估套件的资源。

Details

Motivation: 解决开放域视觉问答中因图像或问题模糊导致的歧义问题,传统方法可能依赖额外训练或复杂模型,而IRIS旨在通过实时眼动数据实现免训练的意图解析。

Result: 在模糊图像-问题对上,IRIS将回答准确率从35.2%显著提升至77.2%,并在多个SOTA VLMs(如不同架构的模型)上均实现了性能改进,同时保持对明确查询的原有水平。

Insight: 创新点在于利用实时眼动数据(特别是提问开始时的注视点)作为外部信号来消歧,这是一种免训练、模型无关的方法;从客观角度看,这为VQA系统引入了人类意图的实时反馈机制,可能降低对大规模标注数据的依赖,并提升人机交互的自然性。

Abstract: We introduce IRIS (Intent Resolution via Inference-time Saccades), a novel training-free approach that uses eye-tracking data in real-time to resolve ambiguity in open-ended VQA. Through a comprehensive user study with 500 unique image-question pairs, we demonstrate that fixations closest to the time participants start verbally asking their questions are the most informative for disambiguation in Large VLMs, more than doubling the accuracy of responses on ambiguous questions (from 35.2% to 77.2%) while maintaining performance on unambiguous queries. We evaluate our approach across state-of-the-art VLMs, showing consistent improvements when gaze data is incorporated in ambiguous image-question pairs, regardless of architectural differences. We release a new benchmark dataset to use eye movement data for disambiguated VQA, a novel real-time interactive protocol, and an evaluation suite.


[50] Evaluating Demographic Misrepresentation in Image-to-Image Portrait Editing cs.CVPDF

Huichan Seo, Minki Hong, Sieun Choi, Jihie Kim, Jean Oh

TL;DR: 本文研究了指令引导的图像到图像(I2I)编辑中的人口统计学偏见,发现相同的编辑指令会因主体的人口统计学特征(如种族、性别、年龄)而产生系统性不同的结果。论文形式化了两种失败模式:软擦除和刻板印象替换,并引入了一个受控基准来评估多个编辑器的表现。研究表明,身份保持失败是普遍且分布不均的,并揭示了编辑器中存在不对称的身份先验。最后,论文提出了一种无需模型更新的提示级身份约束方法,可以有效减少少数群体的人口统计学改变。

Details

Motivation: 文本到图像(T2I)生成中的人口统计学偏见已有较多研究,但指令引导的图像到图像(I2I)编辑中的人口统计学相关失败模式尚未得到充分探索。本文旨在探究相同的编辑指令是否会在不同人口统计学特征的主体上产生系统性不同的结果,以揭示I2I编辑系统中潜在的偏见。

Result: 通过使用视觉语言模型(VLM)评分和人工评估,在提出的受控基准上对多个编辑器进行了评估。结果表明,身份保持失败是普遍存在的,且在人口统计学上分布不均,受到隐式社会先验(如职业驱动的性别推断)的影响。提出的提示级身份约束方法,在不更新模型的情况下,能显著减少少数群体的人口统计学改变,而对多数群体的肖像影响甚微。

Insight: 论文的创新点在于形式化了I2I编辑中两种新的人口统计学偏见失败模式(软擦除和刻板印象替换),并构建了一个诊断性的基准来量化评估。从客观角度看,研究揭示了当前I2I编辑器底层存在不对称的身份先验,并且通过简单的提示工程(身份约束)就能有效缓解偏见,这为构建人口统计学鲁棒的编辑系统提供了新的思路和方法。

Abstract: Demographic bias in text-to-image (T2I) generation is well studied, yet demographic-conditioned failures in instruction-guided image-to-image (I2I) editing remain underexplored. We examine whether identical edit instructions yield systematically different outcomes across subject demographics in open-weight I2I editors. We formalize two failure modes: Soft Erasure, where edits are silently weakened or ignored in the output image, and Stereotype Replacement, where edits introduce unrequested, stereotype-consistent attributes. We introduce a controlled benchmark that probes demographic-conditioned behavior by generating and editing portraits conditioned on race, gender, and age using a diagnostic prompt set, and evaluate multiple editors with vision-language model (VLM) scoring and human evaluation. Our analysis shows that identity preservation failures are pervasive, demographically uneven, and shaped by implicit social priors, including occupation-driven gender inference. Finally, we demonstrate that a prompt-level identity constraint, without model updates, can substantially reduce demographic change for minority groups while leaving majority-group portraits largely unchanged, revealing asymmetric identity priors in current editors. Together, our findings establish identity preservation as a central and demographically uneven failure mode in I2I editing and motivate demographic-robust editing systems. Project page: https://seochan99.github.io/i2i-demographic-bias


[51] Uncertainty-Guided Inference-Time Depth Adaptation for Transformer-Based Visual Tracking cs.CVPDF

Patrick Poggi, Divake Kumar, Theja Tulabandhula, Amit Ranjan Trivedi

TL;DR: 本文提出UncL-STARK,一种基于Transformer的单目标跟踪器动态深度自适应方法。该方法通过不确定性引导,在推理时根据预测置信度动态调整编码器和解码器的深度,从而在保持跟踪精度的同时显著降低计算成本、延迟和能耗。

Details

Motivation: 基于Transformer的单目标跟踪器虽然精度达到SOTA,但采用固定深度推理,对每一帧都执行完整的编码器-解码器堆栈,在时间连贯性强的长视频序列中会产生不必要的计算开销。

Result: 在GOT-10k和LaSOT基准测试上,该方法实现了高达12%的GFLOPs减少、8.9%的延迟降低和10.8%的能耗节省,同时跟踪精度与全深度基线相比下降不超过0.2%。

Insight: 创新点在于提出了一种架构保持的动态深度自适应策略,通过随机深度训练与知识蒸馏进行微调,并利用模型自身角点定位热图生成轻量级不确定性估计,结合反馈驱动策略实现运行时深度选择,有效利用了视频的时间连贯性。

Abstract: Transformer-based single-object trackers achieve state-of-the-art accuracy but rely on fixed-depth inference, executing the full encoder–decoder stack for every frame regardless of visual complexity, thereby incurring unnecessary computational cost in long video sequences dominated by temporally coherent frames. We propose UncL-STARK, an architecture-preserving approach that enables dynamic, uncertainty-aware depth adaptation in transformer-based trackers without modifying the underlying network or adding auxiliary heads. The model is fine-tuned to retain predictive robustness at multiple intermediate depths using random-depth training with knowledge distillation, thus enabling safe inference-time truncation. At runtime, we derive a lightweight uncertainty estimate directly from the model’s corner localization heatmaps and use it in a feedback-driven policy that selects the encoder and decoder depth for the next frame based on the prediction confidence by exploiting temporal coherence in video. Extensive experiments on GOT-10k and LaSOT demonstrate up to 12% GFLOPs reduction, 8.9% latency reduction, and 10.8% energy savings while maintaining tracking accuracy within 0.2% of the full-depth baseline across both short-term and long-term sequences.


[52] DataCube: A Video Retrieval Platform via Natural Language Semantic Profiling cs.CVPDF

Yiming Ju, Hanyu Zhao, Quanyue Ma, Donglin Hao, Chengwei Wu

TL;DR: DataCube是一个基于自然语言语义分析的视频检索平台,旨在自动处理视频、进行多维度语义标注,并支持查询驱动的检索。该系统通过构建视频片段的语义表示,结合神经重排序和深度语义匹配实现混合检索,帮助用户从大规模视频库中高效构建定制化视频子集,用于训练、分析和评估。

Details

Motivation: 解决大规模视频库中原始视频转化为高质量、任务特定数据集时成本高、效率低的问题。

Result: 论文未在摘要中提及具体定量结果或基准测试,但提供了一个公开可访问的平台和演示视频。

Insight: 创新点在于将视频自动处理、结构化语义表示与混合检索技术结合,通过交互式Web界面支持用户自定义视频子集构建和私有视频库的可搜索系统,提升了视频数据管理的效率和灵活性。

Abstract: Large-scale video repositories are increasingly available for modern video understanding and generation tasks. However, transforming raw videos into high-quality, task-specific datasets remains costly and inefficient. We present DataCube, an intelligent platform for automatic video processing, multi-dimensional profiling, and query-driven retrieval. DataCube constructs structured semantic representations of video clips and supports hybrid retrieval with neural re-ranking and deep semantic matching. Through an interactive web interface, users can efficiently construct customized video subsets from massive repositories for training, analysis, and evaluation, and build searchable systems over their own private video collections. The system is publicly accessible at https://datacube.baai.ac.cn/. Demo Video: https://baai-data-cube.ks3-cn-beijing.ksyuncs.com/custom/Adobe%20Express%20-%202%E6%9C%8818%E6%97%A5%20%281%29%281%29%20%281%29.mp4


[53] HyPCA-Net: Advancing Multimodal Fusion in Medical Image Analysis cs.CVPDF

J. Dhar, M. K. Pandey, D. Chakladar, M. Haghighat, A. Alavi

TL;DR: 本文提出了一种名为HyPCA-Net的新型多模态融合网络,用于医学图像分析。该网络包含两个核心创新模块:一个计算高效的残差自适应学习注意力块,用于捕获精细的模态特定表征;以及一个双视角级联注意力块,旨在学习跨模态的鲁棒共享表征。在十个公开数据集上的实验表明,该方法在性能和计算效率上均显著优于现有领先方法。

Details

Motivation: 现有医学图像多模态融合方法存在两个主要问题:一是计算成本高,限制了其在资源受限环境下的应用;二是常采用级联注意力模块,可能导致模块间信息丢失,并难以有效捕获跨模态的鲁棒共享表征,从而限制了其在多疾病分析任务中的泛化能力。

Result: 在十个公开数据集上进行的大量实验表明,HyPCA-Net显著优于现有领先方法,性能提升最高达5.2%,同时计算成本降低最高达73.1%。

Insight: 论文的创新点在于提出了一个混合并行融合级联注意力网络架构,其核心是设计了两个新颖的模块来分别优化模态特定表征和跨模态共享表征的学习。从客观角度看,这种将计算效率与表征鲁棒性结合的设计思路,以及通过双视角级联注意力来缓解信息丢失的策略,对于资源敏感的医学图像多模态融合任务具有借鉴意义。

Abstract: Multimodal fusion frameworks, which integrate diverse medical imaging modalities (e.g., MRI, CT), have shown great potential in applications such as skin cancer detection, dementia diagnosis, and brain tumor prediction. However, existing multimodal fusion methods face significant challenges. First, they often rely on computationally expensive models, limiting their applicability in low-resource environments. Second, they often employ cascaded attention modules, which potentially increase risk of information loss during inter-module transitions and hinder their capacity to effectively capture robust shared representations across modalities. This restricts their generalization in multi-disease analysis tasks. To address these limitations, we propose a Hybrid Parallel-Fusion Cascaded Attention Network (HyPCA-Net), composed of two core novel blocks: (a) a computationally efficient residual adaptive learning attention block for capturing refined modality-specific representations, and (b) a dual-view cascaded attention block aimed at learning robust shared representations across diverse modalities. Extensive experiments on ten publicly available datasets exhibit that HyPCA-Net significantly outperforms existing leading methods, with improvements of up to 5.2% in performance and reductions of up to 73.1% in computational cost. Code: https://github.com/misti1203/HyPCA-Net.


[54] AFFMAE: Scalable and Efficient Vision Pretraining for Desktop Graphics Cards cs.CVPDF

David Smerkous, Zian Wang, Behzad Najafian

TL;DR: AFFMAE是一种专为桌面级显卡设计的高效视觉预训练框架,通过自适应非网格令牌合并和丢弃掩码令牌,解决了MAE与分层下采样架构结合时的计算和结构挑战,实现了高分辨率训练,在保持性能的同时显著降低计算开销和内存使用。

Details

Motivation: 解决自监督预训练中高分辨率训练通常需要服务器级基础设施的问题,以及MAE与分层架构结合时因密集网格先验和掩码感知设计妥协导致的结构性挑战。

Result: 在高分辨率电子显微镜分割任务上,AFFMAE在参数量相同的情况下匹配ViT-MAE性能,同时将FLOPs降低高达7倍,内存使用减半,并在单块RTX 5090上实现更快训练。

Insight: 创新点包括基于自适应非网格令牌合并的掩码友好分层框架、丢弃掩码令牌以移除密集网格假设、稳定的混合精度Flash风格集群注意力内核,以及通过深度监督缓解稀疏阶段表示崩溃,为资源受限环境下的基础模型开发提供了高效方案。

Abstract: Self-supervised pretraining has transformed computer vision by enabling data-efficient fine-tuning, yet high-resolution training typically requires server-scale infrastructure, limiting in-domain foundation model development for many research laboratories. Masked Autoencoders (MAE) reduce computation by encoding only visible tokens, but combining MAE with hierarchical downsampling architectures remains structurally challenging due to dense grid priors and mask-aware design compromises. We introduce AFFMAE, a masking-friendly hierarchical pretraining framework built on adaptive, off-grid token merging. By discarding masked tokens and performing dynamic merging exclusively over visible tokens, AFFMAE removes dense-grid assumptions while preserving hierarchical scalability. We developed numerically stable mixed-precision Flash-style cluster attention kernels, and mitigate sparse-stage representation collapse via deep supervision. On high-resolution electron microscopy segmentation, AFFMAE matches ViT-MAE performance at equal parameter count while reducing FLOPs by up to 7x, halving memory usage, and achieving faster training on a single RTX 5090. Code available at https://github.com/najafian-lab/affmae.


[55] Breaking the Sub-Millimeter Barrier: Eyeframe Acquisition from Color Images cs.CVPDF

Manel Guzmán, Antonio Agudo

TL;DR: 本文提出了一种基于计算机视觉的新方法,用于从彩色图像中高精度获取眼镜框轮廓,以替代传统依赖机械工具、耗时且需要专用设备的镜框追踪流程。该方法利用多视图信息,通过图像采集、镜框分割、深度估计和多视图处理等步骤,从静止彩色图像中实现亚毫米级的精确测量。

Details

Motivation: 传统眼镜框追踪依赖机械工具,需要精确定位和校准,流程耗时且需要额外设备,导致验光师工作效率低下。本文旨在通过基于人工智能视觉的方法,仅使用彩色图像实现高精度镜框测量,以简化工作流程并消除对专用追踪设备的需求。

Result: 在真实数据上对不同配置和变体进行了分析和评估,结果表明,该方法仅从静止彩色图像中获得的测量结果与其他解决方案相比具有竞争力,能够实现亚毫米级的精度。

Insight: 创新点在于提出了一种集成了图像分割、深度估计和多视图处理的完整视觉流水线,将RGB图像与深度数据融合,从而仅从普通彩色图像中实现了高精度的3D轮廓测量,这为光学行业提供了一种无需专用硬件、简化工作流程的替代方案。

Abstract: Eyeframe lens tracing is an important process in the optical industry that requires sub-millimeter precision to ensure proper lens fitting and optimal vision correction. Traditional frame tracers rely on mechanical tools that need precise positioning and calibration, which are time-consuming and require additional equipment, creating an inefficient workflow for opticians. This work presents a novel approach based on artificial vision that utilizes multi-view information. The proposed algorithm operates on images captured from an InVision system. The full pipeline includes image acquisition, frame segmentation to isolate the eyeframe from background, depth estimation to obtain 3D spatial information, and multi-view processing that integrates segmented RGB images with depth data for precise frame contour measurement. To this end, different configurations and variants are proposed and analyzed on real data, providing competitive measurements from still color images with respect to other solutions, while eliminating the need for specialized tracing equipment and reducing workflow complexity for optical technicians.


[56] Parameter-Free Adaptive Multi-Scale Channel-Spatial Attention Aggregation framework for 3D Indoor Semantic Scene Completion Toward Assisting Visually Impaired cs.CVPDF

Qi He, XiangXiang Wang, Jingtao Zhang, Yongbin Yu, Hongxiang Chu

TL;DR: 本文提出了一种用于3D室内语义场景补全的自适应多尺度通道-空间注意力聚合框架(AMAA),旨在提升单目视觉下室内辅助感知的结构连贯性和语义一致性。该框架基于MonoScene流程,通过并行通道-空间注意力聚合校准体素特征,并采用分层自适应特征门控策略稳定多尺度融合,从而缓解投影扩散和特征纠缠问题。

Details

Motivation: 现有单目语义场景补全方法在2D-3D投影和多尺度融合过程中缺乏对体素特征可靠性和跨尺度信息传播的显式建模,导致结构稳定性受限,难以满足视障用户室内辅助感知的安全关键场景理解需求。

Result: 在NYUv2基准测试中,AMAA相比MonoScene取得一致提升:SSC mIoU达到27.25%(提升0.31%),SC IoU达到43.10%(提升0.59%),且未显著增加系统复杂度;在NVIDIA Jetson平台上的部署验证了其嵌入式硬件可行性。

Insight: 创新点包括:通过并行通道-空间注意力聚合实现体素特征的语义与空间联合校准,以及采用分层自适应特征门控策略调节跨尺度信息注入,从而在轻量级框架内提升特征可靠性与融合稳定性,为嵌入式辅助系统提供可部署的感知方案。

Abstract: In indoor assistive perception for visually impaired users, 3D Semantic Scene Completion (SSC) is expected to provide structurally coherent and semantically consistent occupancy under strictly monocular vision for safety-critical scene understanding. However, existing monocular SSC approaches often lack explicit modeling of voxel-feature reliability and regulated cross-scale information propagation during 2D-3D projection and multi-scale fusion, making them vulnerable to projection diffusion and feature entanglement and thus limiting structural stability.To address these challenges, this paper presents an Adaptive Multi-scale Attention Aggregation (AMAA) framework built upon the MonoScene pipeline. Rather than introducing a heavier backbone, AMAA focuses on reliability-oriented feature regulation within a monocular SSC framework. Specifically, lifted voxel features are jointly calibrated in semantic and spatial dimensions through parallel channel-spatial attention aggregation, while multi-scale encoder-decoder fusion is stabilized via a hierarchical adaptive feature-gating strategy that regulates information injection across scales.Experiments on the NYUv2 benchmark demonstrate consistent improvements over MonoScene without significantly increasing system complexity: AMAA achieves 27.25% SSC mIoU (+0.31) and 43.10% SC IoU (+0.59). In addition, system-level deployment on an NVIDIA Jetson platform verifies that the complete AMAA framework can be executed stably on embedded hardware. Overall, AMAA improves monocular SSC quality and provides a reliable and deployable perception framework for indoor assistive systems targeting visually impaired users.


[57] ReMoRa: Multimodal Large Language Model based on Refined Motion Representation for Long-Video Understanding cs.CVPDF

Daichi Yashima, Shuhei Kurita, Yusuke Oda, Komei Sugiura

TL;DR: 本文提出ReMoRa,一种基于精炼运动表示的多模态大语言模型,用于长视频理解。该方法通过直接处理视频的压缩表示来避免处理完整RGB帧序列的计算负担,保留稀疏关键帧用于外观信息,并将时间动态编码为运动表示,从而线性扩展序列长度。

Details

Motivation: 解决多模态大语言模型在长视频理解任务中因处理完整RGB帧序列导致计算不可行和高冗余的问题,因为自注意力机制具有序列长度的二次复杂度。

Result: 在LongVideoBench、NExT-QA和MLVU等多个长视频理解基准测试中,ReMoRa超越了基线方法,表现出优越性能。

Insight: 创新点包括使用压缩表示中的运动表示作为光流的紧凑代理来捕获时间动态,以及引入去噪模块生成细粒度运动表示以提升块运动的质量;从客观角度看,该方法通过分离外观与运动并线性压缩特征,为长视频处理提供了高效可扩展的解决方案。

Abstract: While multimodal large language models (MLLMs) have shown remarkable success across a wide range of tasks, long-form video understanding remains a significant challenge. In this study, we focus on video understanding by MLLMs. This task is challenging because processing a full stream of RGB frames is computationally intractable and highly redundant, as self-attention have quadratic complexity with sequence length. In this paper, we propose ReMoRa, a video MLLM that processes videos by operating directly on their compressed representations. A sparse set of RGB keyframes is retained for appearance, while temporal dynamics are encoded as a motion representation, removing the need for sequential RGB frames. These motion representations act as a compact proxy for optical flow, capturing temporal dynamics without full frame decoding. To refine the noise and low fidelity of block-based motions, we introduce a module to denoise and generate a fine-grained motion representation. Furthermore, our model compresses these features in a way that scales linearly with sequence length. We demonstrate the effectiveness of ReMoRa through extensive experiments across a comprehensive suite of long-video understanding benchmarks. ReMoRa outperformed baseline methods on multiple challenging benchmarks, including LongVideoBench, NExT-QA, and MLVU.


[58] Designing Production-Scale OCR for India: Multilingual and Domain-Specific Systems cs.CV | cs.AIPDF

Ali Faraz, Raja Kolla, Ashish Kulkarni, Shubham Agarwal

TL;DR: 本文研究了为印度设计生产级OCR系统的策略,提出了Chitrapathak系列多语言OCR模型和Parichay系列特定领域OCR模型。通过对比两种训练策略(端到端训练与微调现有OCR模型),发现微调策略在准确率与延迟权衡上更优。Chitrapathak-2在泰卢固语上达到SOTA,并在其他语言中位列第二,同时实现了3-6倍加速;Parichay在9种印度政府文档上实现了89.8%的精确匹配分数。

Details

Motivation: 解决印度OCR系统面临的挑战,包括语言多样性、文档异构性和部署约束,以构建高效、实用的生产级多语言OCR系统。

Result: 在印度多语言OCR基准测试和部署导向指标上,微调策略(Chitrapathak-2)在泰卢固语达到SOTA(6.69字符ANLS),其他语言中排名第二,并实现3-6倍加速;Parichay在政府文档上获得89.8%精确匹配分数,推理更快。

Insight: 创新点包括对比两种多语言OCR训练策略,证明微调现有模型在准确率-延迟权衡上优于端到端训练;同时提出专门针对印度政府文档的独立OCR模型系列,实现结构化字段提取的高性能。客观分析认为,该方法为多语言、多领域OCR生产部署提供了实用框架。

Abstract: Designing Optical Character Recognition (OCR) systems for India requires balancing linguistic diversity, document heterogeneity, and deployment constraints. In this paper, we study two training strategies for building multilingual OCR systems with Vision-Language Models through the Chitrapathak series. We first follow a popular multimodal approach, pairing a generic vision encoder with a strong multilingual language model and training the system end-to-end for OCR. Alternatively, we explore fine-tuning an existing OCR model, despite not being trained for the target languages. Through extensive evaluation on multilingual Indic OCR benchmarks and deployment-oriented metrics, we find that the second strategy consistently achieves better accuracy-latency trade-offs. Chitrapathak-2 achieves 3-6x speedup over its predecessor with being state-of-the-art (SOTA) in Telugu (6.69 char ANLS) and second best in the rest. In addition, we present Parichay, an independent OCR model series designed specifically for 9 Indian government documents to extract structured key fields, achieving 89.8% Exact Match score with a faster inference. Together, these systems achieve SOTA performance and provide practical guidance for building production-scale OCR pipelines in the Indian context.


[59] Visual Self-Refine: A Pixel-Guided Paradigm for Accurate Chart Parsing cs.CVPDF

Jinsong Li, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Jiaqi Wang

TL;DR: 本文提出了一种名为视觉自精炼(VSR)的新范式,旨在解决大型视觉语言模型(LVLM)在视觉密集型任务(如图表解析)中存在的视觉感知错误问题。该范式通过让模型生成像素级定位输出、将其可视化并反馈给自身,从而实现直观的自我检查和修正。作者在图表解析领域实例化了该范式,提出了ChartVSR模型,该模型将解析过程分解为精炼阶段(迭代使用视觉反馈确保数据点像素定位的准确性)和解码阶段(使用已验证的定位作为精确视觉锚点来解析最终结构化数据)。此外,还构建了一个新的高难度基准测试ChartP-Bench。

Details

Motivation: 大型视觉语言模型在文本层面的推理和自校正能力很强,但对于以视觉感知为核心的复杂任务(如图表解析)帮助有限,现有模型在处理视觉密集的图表时,容易出现数据遗漏、错位和幻觉等问题。受人类在阅读复杂图表时用手指作为“视觉锚点”以确保准确性的策略启发,本文旨在解决图表解析中的视觉感知准确性难题。

Result: 论文构建了新的高难度基准测试ChartP-Bench,并在该基准上对提出的ChartVSR模型进行了评估。虽然没有在摘要中给出具体的定量结果(如准确率提升数值),但暗示了该方法能有效提升准确性,并强调了VSR作为一种通用视觉反馈机制,为广泛视觉中心任务提升准确性提供了新方向。

Insight: 核心创新点是提出了视觉自精炼(VSR)范式,这是一种新颖的像素引导、基于视觉反馈的自我修正机制。它将模型的内部定位输出(像素级)可视化并作为外部反馈输入,模拟了人类的“视觉锚点”策略,从而将模型的文本级自校正能力有效扩展到视觉感知层面。这种将定位与识别解耦、通过迭代视觉反馈确保定位精度的两阶段(精炼与解码)设计,为解决视觉密集型任务中的幻觉和错误问题提供了可借鉴的通用框架。

Abstract: While Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities for reasoning and self-correction at the textual level, these strengths provide minimal benefits for complex tasks centered on visual perception, such as Chart Parsing. Existing models often struggle with visually dense charts, leading to errors like data omission, misalignment, and hallucination. Inspired by the human strategy of using a finger as a ``visual anchor’’ to ensure accuracy when reading complex charts, we propose a new paradigm named Visual Self-Refine (VSR). The core idea of VSR is to enable a model to generate pixel-level localization outputs, visualize them, and then feed these visualizations back to itself, allowing it to intuitively inspect and correct its own potential visual perception errors. We instantiate the VSR paradigm in the domain of Chart Parsing by proposing ChartVSR. This model decomposes the parsing process into two stages: a Refine Stage, where it iteratively uses visual feedback to ensure the accuracy of all data points’ Pixel-level Localizations, and a Decode Stage, where it uses these verified localizations as precise visual anchors to parse the final structured data. To address the limitations of existing benchmarks, we also construct ChartP-Bench, a new and highly challenging benchmark for chart parsing. Our work also highlights VSR as a general-purpose visual feedback mechanism, offering a promising new direction for enhancing accuracy on a wide range of vision-centric tasks.


[60] MMA: Multimodal Memory Agent cs.CVPDF

Yihao Lu, Wanru Cheng, Zeyu Zhang, Hao Tang

TL;DR: 本文提出了一种名为MMA(多模态记忆代理)的新型多模态智能体,它通过动态可靠性评分机制来优化外部记忆检索,以解决传统基于相似性的检索方法中存在的陈旧、低可信度或冲突信息所导致的过度自信错误问题。同时,论文还引入了MMA-Bench基准测试来评估信念动态,并揭示了RAG智能体中存在的‘视觉安慰剂效应’。

Details

Motivation: 动机在于解决长视野多模态智能体依赖外部记忆时,基于相似性的检索方法容易检索到过时、低可信或相互冲突的记忆项,从而引发智能体做出过度自信的错误决策的问题。

Result: 在FEVER基准上,MMA在保持基线准确率的同时,将方差降低了35.2%并提升了选择性效用;在LoCoMo基准上,其安全导向配置提升了可操作的准确率并减少了错误答案;在自建的MMA-Bench上,MMA在视觉模式下达到了41.18%的Type-B准确率,而基线方法在同一协议下崩溃至0.0%。

Insight: 主要创新点在于为每个检索到的记忆项分配一个结合了来源可信度、时间衰减和冲突感知网络共识的动态可靠性分数,并利用该信号重新加权证据或在支持不足时选择弃权。此外,通过构建MMA-Bench基准,揭示了基于RAG的智能体如何从基础模型中继承潜在的视觉偏见(即‘视觉安慰剂效应’)。

Abstract: Long-horizon multimodal agents depend on external memory; however, similarity-based retrieval often surfaces stale, low-credibility, or conflicting items, which can trigger overconfident errors. We propose Multimodal Memory Agent (MMA), which assigns each retrieved memory item a dynamic reliability score by combining source credibility, temporal decay, and conflict-aware network consensus, and uses this signal to reweight evidence and abstain when support is insufficient. We also introduce MMA-Bench, a programmatically generated benchmark for belief dynamics with controlled speaker reliability and structured text-vision contradictions. Using this framework, we uncover the “Visual Placebo Effect”, revealing how RAG-based agents inherit latent visual biases from foundation models. On FEVER, MMA matches baseline accuracy while reducing variance by 35.2% and improving selective utility; on LoCoMo, a safety-oriented configuration improves actionable accuracy and reduces wrong answers; on MMA-Bench, MMA reaches 41.18% Type-B accuracy in Vision mode, while the baseline collapses to 0.0% under the same protocol. Code: https://github.com/AIGeeksGroup/MMA.


[61] Benchmarking Adversarial Robustness and Adversarial Training Strategies for Object Detection cs.CVPDF

Alexis Winter, Jean-Vincent Martini, Romaric Audigier, Angelique Loesch, Bertrand Luvison

TL;DR: 本文针对目标检测模型的对抗鲁棒性,提出了一个统一的基准测试框架,以公平比较不同攻击方法,并研究了攻击在CNN与Vision Transformer架构间的可迁移性,以及最有效的对抗训练策略。

Details

Motivation: 目标检测模型对对抗攻击的敏感性构成安全风险,但由于缺乏标准化评估(如数据集、效率指标和扰动成本度量不一致),现有攻击和防御方法难以公平比较,阻碍了防御进展。

Result: 实验表明,现代对抗攻击对基于Transformer的架构可迁移性显著不足;同时,使用混合高扰动攻击(如空间和语义目标)数据集的对抗训练策略,其鲁棒性优于任何单一攻击训练。

Insight: 创新点在于提出了一个专注于数字非补丁攻击的统一基准框架,引入了分离定位和分类错误的特定指标,并使用多种感知度量评估攻击成本;客观分析认为,该框架为标准化评估提供了重要工具,而关于攻击可迁移性和混合攻击训练策略的发现具有实际指导意义。

Abstract: Object detection models are critical components of automated systems, such as autonomous vehicles and perception-based robots, but their sensitivity to adversarial attacks poses a serious security risk. Progress in defending these models lags behind classification, hindered by a lack of standardized evaluation. It is nearly impossible to thoroughly compare attack or defense methods, as existing work uses different datasets, inconsistent efficiency metrics, and varied measures of perturbation cost. This paper addresses this gap by investigating three key questions: (1) How can we create a fair benchmark to impartially compare attacks? (2) How well do modern attacks transfer across different architectures, especially from Convolutional Neural Networks to Vision Transformers? (3) What is the most effective adversarial training strategy for robust defense? To answer these, we first propose a unified benchmark framework focused on digital, non-patch-based attacks. This framework introduces specific metrics to disentangle localization and classification errors and evaluates attack cost using multiple perceptual metrics. Using this benchmark, we conduct extensive experiments on state-of-the-art attacks and a wide range of detectors. Our findings reveal two major conclusions: first, modern adversarial attacks against object detection models show a significant lack of transferability to transformer-based architectures. Second, we demonstrate that the most robust adversarial training strategy leverages a dataset composed of a mix of high-perturbation attacks with different objectives (e.g., spatial and semantic), which outperforms training on any single attack.


[62] DressWild: Feed-Forward Pose-Agnostic Garment Sewing Pattern Generation from In-the-Wild Images cs.CVPDF

Zeng Tao, Ying Jiang, Yunuo Chen, Tianyi Xie, Huamin Wang

TL;DR: 本文提出DressWild,一种新颖的前馈式管道,能够从单张野外图像中重建物理一致的2D缝纫图案和对应的3D服装。该方法利用视觉语言模型(VLMs)在图像层面归一化姿态变化,提取姿态感知、3D信息化的服装特征,并通过基于Transformer的编码器融合这些特征来预测缝纫图案参数,可直接应用于物理模拟、纹理合成和多层虚拟试穿。

Details

Motivation: 现有前馈方法难以处理多样姿态和视角,而基于优化的方法计算成本高且难以扩展。本文旨在为需要可编辑、可分离且可模拟的服装建模与制造应用,提供一种高效、可扩展的缝纫图案生成解决方案。

Result: 大量实验表明,该方法无需多视图输入或迭代优化,就能从野外图像中稳健地恢复多样缝纫图案和对应的3D服装,为逼真服装模拟和动画提供了高效可扩展的解决方案。

Insight: 创新点在于结合视觉语言模型进行姿态归一化,并提取3D信息化的特征来预测缝纫图案,实现了从单张野外图像到可模拟服装的前馈式生成,提升了处理多样姿态的鲁棒性和效率。

Abstract: Recent advances in garment pattern generation have shown promising progress. However, existing feed-forward methods struggle with diverse poses and viewpoints, while optimization-based approaches are computationally expensive and difficult to scale. This paper focuses on sewing pattern generation for garment modeling and fabrication applications that demand editable, separable, and simulation-ready garments. We propose DressWild, a novel feed-forward pipeline that reconstructs physics-consistent 2D sewing patterns and the corresponding 3D garments from a single in-the-wild image. Given an input image, our method leverages vision-language models (VLMs) to normalize pose variations at the image level, then extract pose-aware, 3D-informed garment features. These features are fused through a transformer-based encoder and subsequently used to predict sewing pattern parameters, which can be directly applied to physical simulation, texture synthesis, and multi-layer virtual try-on. Extensive experiments demonstrate that our approach robustly recovers diverse sewing patterns and the corresponding 3D garments from in-the-wild images without requiring multi-view inputs or iterative optimization, offering an efficient and scalable solution for realistic garment simulation and animation.


[63] Let’s Split Up: Zero-Shot Classifier Edits for Fine-Grained Video Understanding cs.CV | cs.LGPDF

Kaiting Liu, Hazel Doughty

TL;DR: 该论文提出了一种名为’类别分割’的新任务,旨在对现有视频分类器进行编辑,将粗粒度的类别细分为更精细的子类别,而无需重新训练模型或收集额外标注数据。论文引入了一种零样本编辑方法,利用视频分类器的潜在组合结构来揭示细粒度差异,并展示了低样本微调的有效性。

Details

Motivation: 解决现有视频识别模型因固定、粗粒度的分类体系而无法适应新兴细粒度任务定义的问题,避免因任务演变而需要重新收集标注和训练模型的高昂成本。

Result: 在论文新构建的视频类别分割基准测试上,所提出的方法显著优于基于视觉-语言模型的基线方法,在提升新分割类别准确率的同时,保持了其他类别的性能。

Insight: 创新点在于提出了’零样本分类器编辑’的任务范式,通过挖掘预训练模型的内部组合结构来实现细粒度概念的拆分,这为模型适应性和持续学习提供了一种高效、低成本的新思路。

Abstract: Video recognition models are typically trained on fixed taxonomies which are often too coarse, collapsing distinctions in object, manner or outcome under a single label. As tasks and definitions evolve, such models cannot accommodate emerging distinctions and collecting new annotations and retraining to accommodate such changes is costly. To address these challenges, we introduce category splitting, a new task where an existing classifier is edited to refine a coarse category into finer subcategories, while preserving accuracy elsewhere. We propose a zero-shot editing method that leverages the latent compositional structure of video classifiers to expose fine-grained distinctions without additional data. We further show that low-shot fine-tuning, while simple, is highly effective and benefits from our zero-shot initialization. Experiments on our new video benchmarks for category splitting demonstrate that our method substantially outperforms vision-language baselines, improving accuracy on the newly split categories without sacrificing performance on the rest. Project page: https://kaitingliu.github.io/Category-Splitting/.


[64] A Contrastive Learning Framework Empowered by Attention-based Feature Adaptation for Street-View Image Classification cs.CV | cs.AI | cs.LGPDF

Qi You, Yitai Cheng, Zichao Zeng, James Haworth

TL;DR: 本文提出CLIP-MHAdapter,一种基于对比学习框架的轻量级适配器,通过引入多头自注意力机制处理图像块令牌,以增强预训练视觉语言模型CLIP在复杂街景图像属性分类任务中对细粒度局部特征的建模能力。

Details

Motivation: 解决现有基于CLIP的适配或微调方法主要依赖全局图像嵌入,难以捕捉复杂、杂乱街景中至关重要的细粒度局部属性,且从头训练、预训练权重初始化或微调大模型计算成本高昂的问题。

Result: 在Global StreetScapes数据集的八个属性分类任务上,仅使用约140万个可训练参数的CLIP-MHAdapter取得了最优或具有竞争力的准确率,达到了新的最先进水平,同时保持了较低的计算成本。

Insight: 创新点在于将多头自注意力机制集成到轻量级MLP瓶颈适配器中,以建模图像块间的依赖关系,从而有效利用CLIP的预训练知识并增强对局部特征的感知,为高效适应预训练大模型到特定细粒度视觉任务提供了新思路。

Abstract: Street-view image attribute classification is a vital downstream task of image classification, enabling applications such as autonomous driving, urban analytics, and high-definition map construction. It remains computationally demanding whether training from scratch, initialising from pre-trained weights, or fine-tuning large models. Although pre-trained vision-language models such as CLIP offer rich image representations, existing adaptation or fine-tuning methods often rely on their global image embeddings, limiting their ability to capture fine-grained, localised attributes essential in complex, cluttered street scenes. To address this, we propose CLIP-MHAdapter, a variant of the current lightweight CLIP adaptation paradigm that appends a bottleneck MLP equipped with multi-head self-attention operating on patch tokens to model inter-patch dependencies. With approximately 1.4 million trainable parameters, CLIP-MHAdapter achieves superior or competitive accuracy across eight attribute classification tasks on the Global StreetScapes dataset, attaining new state-of-the-art results while maintaining low computational cost. The code is available at https://github.com/SpaceTimeLab/CLIP-MHAdapter.


[65] Unpaired Image-to-Image Translation via a Self-Supervised Semantic Bridge cs.CVPDF

Jiaming Liu, Felix Petersen, Yunhe Gao, Yabin Zhang, Hyojin Kim

TL;DR: 本文提出了一种名为自监督语义桥(SSB)的无配对图像到图像翻译框架,该框架通过整合外部语义先验到扩散桥模型中,实现了无需跨域监督的空间保真翻译。

Details

Motivation: 解决现有对抗性扩散和扩散反转方法在无配对图像翻译中的局限性,如对抗性方法对未见数据泛化能力有限,扩散反转方法因不完美的噪声潜在表示反转导致翻译保真度低。

Result: 在具有挑战性的医学图像合成任务中,SSB在域内和域外设置下均优于先前强基线方法,并可轻松扩展到高质量的文本引导编辑。

Insight: 创新点在于利用自监督视觉编码器学习对外观变化不变但捕捉几何结构的表示,形成共享潜在空间来条件化扩散桥,从而提升翻译的空间保真度和泛化能力。

Abstract: Adversarial diffusion and diffusion-inversion methods have advanced unpaired image-to-image translation, but each faces key limitations. Adversarial approaches require target-domain adversarial loss during training, which can limit generalization to unseen data, while diffusion-inversion methods often produce low-fidelity translations due to imperfect inversion into noise-latent representations. In this work, we propose the Self-Supervised Semantic Bridge (SSB), a versatile framework that integrates external semantic priors into diffusion bridge models to enable spatially faithful translation without cross-domain supervision. Our key idea is to leverage self-supervised visual encoders to learn representations that are invariant to appearance changes but capture geometric structure, forming a shared latent space that conditions the diffusion bridges. Extensive experiments show that SSB outperforms strong prior methods for challenging medical image synthesis in both in-domain and out-of-domain settings, and extends easily to high-quality text-guided editing.


[66] PredMapNet: Future and Historical Reasoning for Consistent Online HD Vectorized Map Construction cs.CVPDF

Bo Lang, Nirav Savaliya, Zhihao Zheng, Jinglun Feng, Zheng-Hang Yeh

TL;DR: 本文提出PredMapNet,一种用于在线高精地图构建的端到端框架,通过联合执行地图实例跟踪和短期预测,解决现有方法在构建全局地图时的时间不一致性和不稳定性问题。

Details

Motivation: 现有基于查询的方法通常采用随机查询初始化并依赖隐式时间建模,导致构建全局地图时出现时间不一致和不稳定,因此需要一种能显式利用历史和未来信息的方法来提升一致性。

Result: 在nuScenes和Argoverse2数据集上的大量实验表明,该方法在效率良好的情况下优于现有最先进方法。

Insight: 创新点包括语义感知查询生成器、历史栅格化地图记忆、历史地图引导模块和短期未来引导模块,通过显式整合历史和未来信息来增强时间连续性和预测合理性。

Abstract: High-definition (HD) maps are crucial to autonomous driving, providing structured representations of road elements to support navigation and planning. However, existing query-based methods often employ random query initialization and depend on implicit temporal modeling, which lead to temporal inconsistencies and instabilities during the construction of a global map. To overcome these challenges, we introduce a novel end-to-end framework for consistent online HD vectorized map construction, which jointly performs map instance tracking and short-term prediction. First, we propose a Semantic-Aware Query Generator that initializes queries with spatially aligned semantic masks to capture scene-level context globally. Next, we design a History Rasterized Map Memory to store fine-grained instance-level maps for each tracked instance, enabling explicit historical priors. A History-Map Guidance Module then integrates rasterized map information into track queries, improving temporal continuity. Finally, we propose a Short-Term Future Guidance module to forecast the immediate motion of map instances based on the stored history trajectories. These predicted future locations serve as hints for tracked instances to further avoid implausible predictions and keep temporal consistency. Extensive experiments on the nuScenes and Argoverse2 datasets demonstrate that our proposed method outperforms state-of-the-art (SOTA) methods with good efficiency.


[67] VETime: Vision Enhanced Zero-Shot Time Series Anomaly Detection cs.CVPDF

Yingyuan Yang, Tian Lan, Yifei Gao, Yimeng Lu, Wenjun He

TL;DR: VETime是一个零样本时间序列异常检测框架,通过精细的视觉-时序对齐和动态融合统一了时序和视觉模态。它解决了现有基础模型在点异常定位和上下文异常检测之间的权衡问题,引入可逆图像转换、补丁级时序对齐、异常窗口对比学习和任务自适应多模态融合机制,在零样本场景下显著优于现有方法,且计算开销更低。

Details

Motivation: 解决时间序列异常检测中现有基础模型的根本权衡:一维时序模型能精细定位点异常但缺乏全局上下文视角,而基于二维视觉的模型能捕捉全局模式但因缺乏时序对齐和粗粒度点检测导致信息瓶颈。

Result: 在零样本场景下,VETime显著优于最先进模型,实现了更优的定位精度,且比当前基于视觉的方法计算开销更低。

Insight: 创新点包括通过可逆图像转换和补丁级时序对齐建立共享视觉-时序时间线,以及异常窗口对比学习和任务自适应多模态融合机制,以自适应整合两种模态的互补感知优势,实现精细对齐与全局上下文捕获的统一。

Abstract: Time-series anomaly detection (TSAD) requires identifying both immediate Point Anomalies and long-range Context Anomalies. However, existing foundation models face a fundamental trade-off: 1D temporal models provide fine-grained pointwise localization but lack a global contextual perspective, while 2D vision-based models capture global patterns but suffer from information bottlenecks due to a lack of temporal alignment and coarse-grained pointwise detection. To resolve this dilemma, we propose VETime, the first TSAD framework that unifies temporal and visual modalities through fine-grained visual-temporal alignment and dynamic fusion. VETime introduces a Reversible Image Conversion and a Patch-Level Temporal Alignment module to establish a shared visual-temporal timeline, preserving discriminative details while maintaining temporal sensitivity. Furthermore, we design an Anomaly Window Contrastive Learning mechanism and a Task-Adaptive Multi-Modal Fusion to adaptively integrate the complementary perceptual strengths of both modalities. Extensive experiments demonstrate that VETime significantly outperforms state-of-the-art models in zero-shot scenarios, achieving superior localization precision with lower computational overhead than current vision-based approaches. Code available at: https://github.com/yyyangcoder/VETime.


[68] Learning Situated Awareness in the Real World cs.CVPDF

Chuhan Li, Ruilin Han, Joy Hsu, Yongyuan Liang, Rajiv Dhawan

TL;DR: 该论文提出了SAW-Bench基准,一个基于真实世界视频的新基准,用于评估多模态基础模型在自我中心情境感知方面的能力。该基准包含786个用智能眼镜录制的视频和2071个人工标注的问答对,旨在测试模型在六个不同任务中对观察者中心关系的理解。

Details

Motivation: 现有的大多数多模态基础模型基准侧重于环境中心的空间关系,而忽略了需要从智能体视角、姿态和运动进行推理的观察者中心关系,因此需要一个新的基准来弥补这一差距。

Result: 在SAW-Bench上的综合评估显示,即使表现最好的多模态基础模型Gemini 3 Flash,其性能与人类水平相比仍有37.66%的差距。深入分析发现,模型虽然能利用自我中心视频中的部分几何线索,但常常无法推断出连贯的相机几何,导致系统性的空间推理错误。

Insight: 论文的创新点在于引入了首个专注于评估真实世界自我中心情境感知的基准SAW-Bench,强调了从被动观察到理解物理接地、观察者中心动态的转变,为评估模型的空间情境智能提供了新方向。

Abstract: A core aspect of human perception is situated awareness, the ability to relate ourselves to the surrounding physical environment and reason over possible actions in context. However, most existing benchmarks for multimodal foundation models (MFMs) emphasize environment-centric spatial relations (relations among objects in a scene), while largely overlooking observer-centric relationships that require reasoning relative to agent’s viewpoint, pose, and motion. To bridge this gap, we introduce SAW-Bench (Situated Awareness in the Real World), a novel benchmark for evaluating egocentric situated awareness using real-world videos. SAW-Bench comprises 786 self-recorded videos captured with Ray-Ban Meta (Gen 2) smart glasses spanning diverse indoor and outdoor environments, and over 2,071 human-annotated question-answer pairs. It probes a model’s observer-centric understanding with six different awareness tasks. Our comprehensive evaluation reveals a human-model performance gap of 37.66%, even with the best-performing MFM, Gemini 3 Flash. Beyond this gap, our in-depth analysis uncovers several notable findings; for example, while models can exploit partial geometric cues in egocentric videos, they often fail to infer a coherent camera geometry, leading to systematic spatial reasoning errors. We position SAW-Bench as a benchmark for situated spatial intelligence, moving beyond passive observation to understanding physically grounded, observer-centric dynamics.


[69] Are Object-Centric Representations Better At Compositional Generalization? cs.CV | cs.LGPDF

Ferdinand Kapl, Amir Mohammad Karimi Mamaghan, Maximilian Seitzer, Karl Henrik Johansson, Carsten Marr

TL;DR: 这篇论文通过构建一个跨三个可控视觉世界(CLEVRTex、Super-CLEVR和MOVi-C)的视觉问答基准,系统评估了具有和不具有物体中心偏置的视觉编码器在未见过的物体属性组合上的组合泛化能力。研究发现,在更困难的组合泛化场景中,物体中心方法表现更优;而原始密集表示仅在较简单场景中超越前者,且通常需要更多的下游计算资源;此外,物体中心模型具有更高的样本效率。

Details

Motivation: 组合泛化是人类认知的基础,也是机器学习的核心挑战。物体中心表示常被认为能支持这种泛化,但在视觉丰富环境中的系统性证据有限。本文旨在通过一个公平、全面的基准,实证检验物体中心表示是否在组合泛化方面更具优势。

Result: 在三个可控视觉世界(CLEVRTex, Super-CLEVR, MOVi-C)的VQA基准上,以DINOv2和SigLIP2及其物体中心变体为基础模型进行测试。结果表明,在更困难的组合泛化设置下,物体中心方法表现更优;原始密集表示仅在较简单设置下超越它们,且通常需要显著更多的下游计算;物体中心模型样本效率更高,用更少图像就能实现更强的泛化。

Insight: 论文的核心创新在于构建了一个系统、公平的基准来量化评估物体中心表示对组合泛化的贡献。其关键见解是,当数据集大小、训练数据多样性或下游计算资源中任一因素受限时,物体中心表示能提供更强的组合泛化能力,这为模型设计(尤其是在资源受限场景下)提供了重要指导。

Abstract: Compositional generalization, the ability to reason about novel combinations of familiar concepts, is fundamental to human cognition and a critical challenge for machine learning. Object-centric (OC) representations, which encode a scene as a set of objects, are often argued to support such generalization, but systematic evidence in visually rich settings is limited. We introduce a Visual Question Answering benchmark across three controlled visual worlds (CLEVRTex, Super-CLEVR, and MOVi-C) to measure how well vision encoders, with and without object-centric biases, generalize to unseen combinations of object properties. To ensure a fair and comprehensive comparison, we carefully account for training data diversity, sample size, representation size, downstream model capacity, and compute. We use DINOv2 and SigLIP2, two widely used vision encoders, as the foundation models and their OC counterparts. Our key findings reveal that (1) OC approaches are superior in harder compositional generalization settings; (2) original dense representations surpass OC only on easier settings and typically require substantially more downstream compute; and (3) OC models are more sample efficient, achieving stronger generalization with fewer images, whereas dense encoders catch up or surpass them only with sufficient data and diversity. Overall, object-centric representations offer stronger compositional generalization when any one of dataset size, training data diversity, or downstream compute is constrained.


[70] Saliency-Aware Multi-Route Thinking: Revisiting Vision-Language Reasoning cs.CVPDF

Mingjia Shi, Yinhan He, Yaochen Zhu, Jundong Li

TL;DR: 本文提出了一种名为Saliency-Aware Principle(SAP)选择的方法,旨在解决视觉语言模型(VLMs)在推理过程中视觉信息利用不足和早期视觉基础错误累积的问题。SAP通过操作高级推理原则而非令牌级轨迹,支持多路线推理,允许在需要时重新参考视觉证据,从而在无需额外训练的情况下实现更稳定的推理和更低的响应延迟。

Details

Motivation: 动机在于视觉语言模型在推理时通常只在生成开始时提供一次视觉输入,而文本推理是自回归生成的,导致推理逐渐被文本主导,早期视觉基础错误会累积。此外,推理期间的视觉基础引导通常粗糙且嘈杂,难以在长文本推理中进行有效引导。

Result: 实验结果表明,在可比较的令牌生成预算下,SAP实现了有竞争力的性能,特别是在减少物体幻觉方面,同时比思维链(CoT)式的长序列推理产生更稳定的推理和更低的响应延迟。

Insight: 创新点在于提出了SAP原则,它操作高级推理原则,支持多路线推理和动态重新参考视觉证据,这是一种模型无关且无需数据的方法,能有效缓解视觉信息在长文本推理中的退化问题,提升推理的稳定性和效率。

Abstract: Vision-language models (VLMs) aim to reason by jointly leveraging visual and textual modalities. While allocating additional inference-time computation has proven effective for large language models (LLMs), achieving similar scaling in VLMs remains challenging. A key obstacle is that visual inputs are typically provided only once at the start of generation, while textual reasoning (e.g., early visual summaries) is generated autoregressively, causing reasoning to become increasingly text-dominated and allowing early visual grounding errors to accumulate. Moreover, vanilla guidance for visual grounding during inference is often coarse and noisy, making it difficult to steer reasoning over long texts. To address these challenges, we propose \emph{Saliency-Aware Principle} (SAP) selection. SAP operates on high-level reasoning principles rather than token-level trajectories, which enable stable control over discrete generation under noisy feedback while allowing later reasoning steps to re-consult visual evidence when renewed grounding is required. In addition, SAP supports multi-route inference, enabling parallel exploration of diverse reasoning behaviors. SAP is model-agnostic and data-free, requiring no additional training. Empirical results show that SAP achieves competitive performance, especially in reducing object hallucination, under comparable token-generation budgets while yielding more stable reasoning and lower response latency than CoT-style long sequential reasoning.


[71] TeCoNeRV: Leveraging Temporal Coherence for Compressible Neural Representations for Videos cs.CVPDF

Namitha Padmanabhan, Matthew Gwilliam, Abhinav Shrivastava

TL;DR: TeCoNeRV是一种用于视频压缩的可压缩神经表示方法,通过利用时间相干性来改进基于超网络的隐式神经表示。它通过空间-时间分解、残差存储和时间相干性正则化三项关键技术,显著降低了内存开销、减少了比特流大小并提升了重建质量。

Details

Motivation: 解决现有基于超网络的隐式神经表示方法在视频压缩中面临的质量低、压缩尺寸大、内存需求高以及难以扩展到高分辨率视频的问题。

Result: 在UVG数据集上,480p和720p分辨率下PSNR分别比基线方法提升2.47dB和5.35dB,比特率降低36%,编码速度加快1.5-3倍,并首次在UVG、HEVC和MCL-JCV数据集上实现了480p、720p和1080p分辨率的结果。

Insight: 创新点包括:将权重预测任务分解为空间和时间维度以降低内存;采用残差存储方案减少比特流;引入时间相干性正则化使权重变化与视频内容相关。这些方法有效平衡了压缩效率、质量和计算资源。

Abstract: Implicit Neural Representations (INRs) have recently demonstrated impressive performance for video compression. However, since a separate INR must be overfit for each video, scaling to high-resolution videos while maintaining encoding efficiency remains a significant challenge. Hypernetwork-based approaches predict INR weights (hyponetworks) for unseen videos at high speeds, but with low quality, large compressed size, and prohibitive memory needs at higher resolutions. We address these fundamental limitations through three key contributions: (1) an approach that decomposes the weight prediction task spatially and temporally, by breaking short video segments into patch tubelets, to reduce the pretraining memory overhead by 20$\times$; (2) a residual-based storage scheme that captures only differences between consecutive segment representations, significantly reducing bitstream size; and (3) a temporal coherence regularization framework that encourages changes in the weight space to be correlated with video content. Our proposed method, TeCoNeRV, achieves substantial improvements of 2.47dB and 5.35dB PSNR over the baseline at 480p and 720p on UVG, with 36% lower bitrates and 1.5-3$\times$ faster encoding speeds. With our low memory usage, we are the first hypernetwork approach to demonstrate results at 480p, 720p and 1080p on UVG, HEVC and MCL-JCV. Our project page is available at https://namithap10.github.io/teconerv/ .


cs.MM [Back]

[72] Emotion Collider: Dual Hyperbolic Mirror Manifolds for Sentiment Recovery via Anti Emotion Reflection cs.MM | cs.CL | cs.LGPDF

Rong Fu, Ziming Wang, Shuo Yin, Wenxin Zhang, Haiyun Wei

TL;DR: 本文提出Emotion Collider(EC-Net),一种基于双曲超图的多模态情感建模框架。该方法利用庞加莱球嵌入表示模态层次结构,通过超图机制在节点与超边之间进行双向消息传递,并结合解耦的径向与角度对比学习目标来增强类别分离。实验表明,EC-Net能在标准多模态情感基准上生成鲁棒且语义一致的表示,尤其在模态部分缺失或受噪声污染时显著提升准确性。

Details

Motivation: 情感表达是自然沟通和人机交互的基础,现有方法在多模态情感建模中对层次结构和高阶语义关系的利用不足,尤其在模态不完整或噪声干扰时性能下降。本文旨在通过引入显式的双曲几何与超图融合机制,提升多模态情感理解的鲁棒性。

Result: 在标准多模态情感基准(如CMU-MOSEI、IEMOCAP)上的实验结果显示,EC-Net在情感分类准确率上取得一致提升,特别是在模态部分可用或受噪声污染的场景下表现优于基线方法,达到了当前先进水平(SOTA)。

Insight: 创新点包括:1)将双曲几何(庞加莱球嵌入)与超图融合结合,显式建模模态层次结构;2)设计解耦的径向与角度对比学习目标,在双曲空间中增强类别分离;3)通过自适应超边构建保留跨时间步和模态的高阶语义关系。从客观角度看,该方法为多模态情感分析提供了几何感知的鲁棒表示学习框架。

Abstract: Emotional expression underpins natural communication and effective human-computer interaction. We present Emotion Collider (EC-Net), a hyperbolic hypergraph framework for multimodal emotion and sentiment modeling. EC-Net represents modality hierarchies using Poincare-ball embeddings and performs fusion through a hypergraph mechanism that passes messages bidirectionally between nodes and hyperedges. To sharpen class separation, contrastive learning is formulated in hyperbolic space with decoupled radial and angular objectives. High-order semantic relations across time steps and modalities are preserved via adaptive hyperedge construction. Empirical results on standard multimodal emotion benchmarks show that EC-Net produces robust, semantically coherent representations and consistently improves accuracy, particularly when modalities are partially available or contaminated by noise. These findings indicate that explicit hierarchical geometry combined with hypergraph fusion is effective for resilient multimodal affect understanding.


cs.SD [Back]

[73] MAEB: Massive Audio Embedding Benchmark cs.SD | cs.AI | cs.CL | cs.LGPDF

Adnan El Assadi, Isaac Chung, Chenghao Xiao, Roman Solomatin, Animesh Jha

TL;DR: 本文介绍了大规模音频嵌入基准测试MAEB,涵盖语音、音乐、环境声音和跨模态音频-文本推理等30个任务,涉及100多种语言。通过评估50多个模型,发现没有单一模型在所有任务上表现最优:对比音频-文本模型在环境声音分类(如ESC50)上表现出色,但在多语言语音任务(如SIB-FLEURS)上接近随机水平,而语音预训练模型则呈现相反模式。聚类任务对所有模型仍具挑战性,最佳模型也仅取得中等结果。研究还表明,音频编码器在MAEB上的性能与其在音频大语言模型中的表现高度相关。MAEB源自包含98个任务的MAEB+集合,旨在保持任务多样性的同时降低评估成本,并集成到MTEB生态系统中,实现跨文本、图像和音频模态的统一评估。

Details

Motivation: 解决现有音频嵌入模型评估缺乏统一、大规模基准的问题,以全面衡量模型在多样化音频任务上的性能,并揭示模型在不同任务类型(如声学理解与语言任务)间的性能差异。

Result: 在MAEB基准的30个任务上评估了50多个模型,结果显示没有模型能主导所有任务:对比音频-文本模型在环境声音分类(如ESC50)上表现优异,但在多语言语音任务(如SIB-FLEURS)上接近随机;语音预训练模型则相反。聚类任务表现普遍不佳,最佳模型仅取得中等结果。音频编码器在MAEB上的性能与在音频大语言模型中的表现高度相关。

Insight: 创新点在于构建了首个大规模、多任务、多语言的音频嵌入基准MAEB,集成到MTEB生态系统,支持跨模态统一评估;客观分析揭示了音频模型在声学与语言任务间存在性能权衡,为模型设计与优化提供了关键洞见,强调需要针对特定任务类型开发专用模型。

Abstract: We introduce the Massive Audio Embedding Benchmark (MAEB), a large-scale benchmark covering 30 tasks across speech, music, environmental sounds, and cross-modal audio-text reasoning in 100+ languages. We evaluate 50+ models and find that no single model dominates across all tasks: contrastive audio-text models excel at environmental sound classification (e.g., ESC50) but score near random on multilingual speech tasks (e.g., SIB-FLEURS), while speech-pretrained models show the opposite pattern. Clustering remains challenging for all models, with even the best-performing model achieving only modest results. We observe that models excelling on acoustic understanding often perform poorly on linguistic tasks, and vice versa. We also show that the performance of audio encoders on MAEB correlates highly with their performance when used in audio large language models. MAEB is derived from MAEB+, a collection of 98 tasks. MAEB is designed to maintain task diversity while reducing evaluation cost, and it integrates into the MTEB ecosystem for unified evaluation across text, image, and audio modalities. We release MAEB and all 98 tasks along with code and a leaderboard at https://github.com/embeddings-benchmark/mteb.


cs.IR [Back]

[74] Variable-Length Semantic IDs for Recommender Systems cs.IR | cs.CL | cs.LGPDF

Kirill Khrylchenko

TL;DR: 本文提出了一种用于推荐系统的变长语义ID方法,通过离散变分自编码器学习自适应长度的物品表示,以解决传统固定长度语义ID在描述不同流行度物品时的效率低下问题。

Details

Motivation: 现有推荐系统中的生成模型面临物品空间基数极大的挑战,固定长度的语义ID无法适应物品流行度分布不均的现实,导致描述效率低下且与自然语言不匹配。

Result: 论文未在摘要中提及具体实验结果或基准测试,但提出了一种基于Gumbel-Softmax重参数化的离散变分自编码器框架,旨在避免基于REINFORCE训练的不稳定性和先前固定长度方法的限制。

Insight: 创新点在于将涌现通信中的变长消息思想引入推荐系统,通过概率框架实现自适应长度的语义ID生成,这有望提升对长尾物品的描述效率并更好地对齐自然语言特性。

Abstract: Generative models are increasingly used in recommender systems, both for modeling user behavior as event sequences and for integrating large language models into recommendation pipelines. A key challenge in this setting is the extremely large cardinality of item spaces, which makes training generative models difficult and introduces a vocabulary gap between natural language and item identifiers. Semantic identifiers (semantic IDs), which represent items as sequences of low-cardinality tokens, have recently emerged as an effective solution to this problem. However, existing approaches generate semantic identifiers of fixed length, assigning the same description length to all items. This is inefficient, misaligned with natural language, and ignores the highly skewed frequency structure of real-world catalogs, where popular items and rare long-tail items exhibit fundamentally different information requirements. In parallel, the emergent communication literature studies how agents develop discrete communication protocols, often producing variable-length messages in which frequent concepts receive shorter descriptions. Despite the conceptual similarity, these ideas have not been systematically adopted in recommender systems. In this work, we bridge recommender systems and emergent communication by introducing variable-length semantic identifiers for recommendation. We propose a discrete variational autoencoder with Gumbel-Softmax reparameterization that learns item representations of adaptive length under a principled probabilistic framework, avoiding the instability of REINFORCE-based training and the fixed-length constraints of prior semantic ID methods.


cs.AI [Back]

[75] Evidence-Grounded Subspecialty Reasoning: Evaluating a Curated Clinical Intelligence Layer on the 2025 Endocrinology Board-Style Examination cs.AI | cs.CLPDF

Amir Hosseinian, MohammadReza Zare Shahneh, Umer Mansoor, Gilbert Szeto, Kirill Karlin

TL;DR: 这篇论文评估了名为Mirror的证据基础临床推理系统在2025年内分泌学委员会式考试上的表现。该系统整合了精选的内分泌学和心脏代谢证据语料库与结构化推理架构,在封闭证据约束下运行,无需外部检索。与具有实时网络访问权限的前沿大语言模型相比,Mirror在120道题的考试中取得了更高的准确率,并提供了证据可追溯性。

Details

Motivation: 大语言模型在通用医学考试上表现良好,但由于指南快速演变和证据等级细微差别,专科临床推理仍然具有挑战性。本研究旨在评估一个基于精选证据的临床推理系统在专科考试上的性能。

Result: Mirror在120道题的内分泌学考试中准确率达到87.5%,超过了人类参考基准(62.3%)以及GPT-5.2(74.6%)、GPT-5(74.0%)和Gemini-3-Pro(69.8%)等前沿LLM。在最难的30道题上,Mirror准确率为76.7%。其Top-2准确率为92.5%,优于GPT-5.2的85.25%。

Insight: 论文的创新点在于提出了一个整合了精选专科证据语料库与结构化推理架构的系统,在封闭环境下实现了优于具有网络检索能力的通用LLM的性能,并提供了证据可追溯性(74.2%的输出引用了指南级来源,且引用准确率达100%)。这表明,对于专科临床推理,具有明确来源的精选证据语料库可能优于无约束的网络检索,并支持临床部署所需的可审计性。

Abstract: Background: Large language models have demonstrated strong performance on general medical examinations, but subspecialty clinical reasoning remains challenging due to rapidly evolving guidelines and nuanced evidence hierarchies. Methods: We evaluated January Mirror, an evidence-grounded clinical reasoning system, against frontier LLMs (GPT-5, GPT-5.2, Gemini-3-Pro) on a 120-question endocrinology board-style examination. Mirror integrates a curated endocrinology and cardiometabolic evidence corpus with a structured reasoning architecture to generate evidence-linked outputs. Mirror operated under a closed-evidence constraint without external retrieval. Comparator LLMs had real-time web access to guidelines and primary literature. Results: Mirror achieved 87.5% accuracy (105/120; 95% CI: 80.4-92.3%), exceeding a human reference of 62.3% and frontier LLMs including GPT-5.2 (74.6%), GPT-5 (74.0%), and Gemini-3-Pro (69.8%). On the 30 most difficult questions (human accuracy less than 50%), Mirror achieved 76.7% accuracy. Top-2 accuracy was 92.5% for Mirror versus 85.25% for GPT-5.2. Conclusions: Mirror provided evidence traceability: 74.2% of outputs cited at least one guideline-tier source, with 100% citation accuracy on manual verification. Curated evidence with explicit provenance can outperform unconstrained web retrieval for subspecialty clinical reasoning and supports auditability for clinical deployment.


cs.RO [Back]

[76] ReasonNavi: Human-Inspired Global Map Reasoning for Zero-Shot Embodied Navigation cs.RO | cs.CVPDF

Yuzhuo Ao, Anbang Wang, Yu-Wing Tai, Chi-Keung Tang

TL;DR: ReasonNavi是一个受人类启发的零样本具身导航框架,它将多模态大语言模型(MLLM)与确定性规划器相结合,实现了‘先全局推理,后局部行动’的范式。该框架将俯视图转换为离散推理空间,利用MLLM进行多阶段语义推理来选择目标节点,再通过确定性规划器生成可执行轨迹,无需对MLLM进行微调。

Details

Motivation: 解决现有具身智能体主要依赖局部自我中心观察进行导航,导致全局前瞻性不足和探索效率低下的问题,旨在模仿人类利用地图进行全局规划再局部行动的高效导航方式。

Result: 在三个导航任务上,ReasonNavi一致超越了需要大量训练或复杂场景建模的先前方法,提供了可扩展、可解释且基于全局的解决方案。

Insight: 核心创新在于将‘先推理后行动’的人类范式操作化,通过将连续地图空间离散化并利用MLLM的语义推理能力来选择目标,巧妙地规避了MLLM在连续坐标预测上的弱点,同时结合确定性规划器确保了动作执行的鲁棒性,形成了一个无需微调、可随基础模型改进而自然扩展的统一零样本框架。

Abstract: Embodied agents often struggle with efficient navigation because they rely primarily on partial egocentric observations, which restrict global foresight and lead to inefficient exploration. In contrast, humans plan using maps: we reason globally first, then act locally. We introduce ReasonNavi, a human-inspired framework that operationalizes this reason-then-act paradigm by coupling Multimodal Large Language Models (MLLMs) with deterministic planners. ReasonNavi converts a top-down map into a discrete reasoning space by room segmentation and candidate target nodes sampling. An MLLM is then queried in a multi-stage process to identify the candidate most consistent with the instruction (object, image, or text goal), effectively leveraging the model’s semantic reasoning ability while sidestepping its weakness in continuous coordinate prediction. The selected waypoint is grounded into executable trajectories using a deterministic action planner over an online-built occupancy map, while pretrained object detectors and segmenters ensure robust recognition at the goal. This yields a unified zero-shot navigation framework that requires no MLLM fine-tuning, circumvents the brittleness of RL-based policies and scales naturally with foundation model improvements. Across three navigation tasks, ReasonNavi consistently outperforms prior methods that demand extensive training or heavy scene modeling, offering a scalable, interpretable, and globally grounded solution to embodied navigation. Project page: https://reasonnavi.github.io/


[77] MARVL: Multi-Stage Guidance for Robotic Manipulation via Vision-Language Models cs.RO | cs.CV | cs.LGPDF

Xunlan Zhou, Xuanlin Chen, Shaowei Zhang, Xiangkun Li, ShengHua Wan

TL;DR: 这篇论文提出了MARVL方法,通过多阶段引导利用视觉语言模型(VLM)为机器人强化学习设计密集奖励函数。该方法通过微调VLM以提升空间和语义一致性,并将任务分解为多阶段子任务,结合任务方向投影增强轨迹敏感性。在Meta-World基准测试中,MARVL显著优于现有VLM奖励方法,在稀疏奖励操作任务上展现出更高的样本效率和鲁棒性。

Details

Motivation: 动机在于解决密集奖励函数设计依赖人工工程的问题,这限制了强化学习的可扩展性和自动化。现有VLM奖励方法存在与任务进展错位、空间定位困难及任务语义理解有限等缺陷,MARVL旨在通过改进VLM奖励设计来克服这些挑战。

Result: 在Meta-World基准测试中,MARVL显著优于现有VLM奖励方法,在稀疏奖励操作任务上实现了更高的样本效率和鲁棒性,达到了当前最先进(SOTA)水平。

Insight: 创新点包括微调VLM以增强空间和语义一致性,以及将任务分解为多阶段子任务并引入任务方向投影来提升轨迹敏感性。从客观角度看,该方法通过结构化任务分解和VLM优化,有效解决了VLM奖励的错位和定位问题,为自动化奖励设计提供了新思路。

Abstract: Designing dense reward functions is pivotal for efficient robotic Reinforcement Learning (RL). However, most dense rewards rely on manual engineering, which fundamentally limits the scalability and automation of reinforcement learning. While Vision-Language Models (VLMs) offer a promising path to reward design, naive VLM rewards often misalign with task progress, struggle with spatial grounding, and show limited understanding of task semantics. To address these issues, we propose MARVL-Multi-stAge guidance for Robotic manipulation via Vision-Language models. MARVL fine-tunes a VLM for spatial and semantic consistency and decomposes tasks into multi-stage subtasks with task direction projection for trajectory sensitivity. Empirically, MARVL significantly outperforms existing VLM-reward methods on the Meta-World benchmark, demonstrating superior sample efficiency and robustness on sparse-reward manipulation tasks.


[78] World Action Models are Zero-shot Policies cs.RO | cs.CV | cs.LGPDF

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu

TL;DR: 本文提出DreamZero,一种基于预训练视频扩散模型的世界动作模型,通过联合建模视频和动作来学习物理动态,实现了在未见任务和环境中的零样本泛化,并在真实机器人实验中显著优于现有VLA模型。

Details

Motivation: 解决当前视觉-语言-动作模型在语义泛化方面表现优异,但在新环境中对未见物理动作泛化能力不足的问题。

Result: 在真实机器人实验中,相比SOTA VLA模型,DreamZero在新任务和环境上的泛化性能提升超过2倍;通过优化,14B自回归视频扩散模型能以7Hz频率实现实时闭环控制;仅需10-20分钟其他机器人或人类的视频演示,未见任务性能相对提升超过42%;仅用30分钟游戏数据即可实现少样本具身适应并保持零样本泛化。

Insight: 创新点在于利用视频作为世界演变的密集表示,通过联合建模视频和动作来学习物理动态,从而有效从异构机器人数据中学习多样化技能,无需重复演示;实现了高效的跨具身迁移和少样本适应能力。

Abstract: State-of-the-art Vision-Language-Action (VLA) models excel at semantic generalization but struggle to generalize to unseen physical motions in novel environments. We introduce DreamZero, a World Action Model (WAM) built upon a pretrained video diffusion backbone. Unlike VLAs, WAMs learn physical dynamics by predicting future world states and actions, using video as a dense representation of how the world evolves. By jointly modeling video and action, DreamZero learns diverse skills effectively from heterogeneous robot data without relying on repetitive demonstrations. This results in over 2x improvement in generalization to new tasks and environments compared to state-of-the-art VLAs in real robot experiments. Crucially, through model and system optimizations, we enable a 14B autoregressive video diffusion model to perform real-time closed-loop control at 7Hz. Finally, we demonstrate two forms of cross-embodiment transfer: video-only demonstrations from other robots or humans yield a relative improvement of over 42% on unseen task performance with just 10-20 minutes of data. More surprisingly, DreamZero enables few-shot embodiment adaptation, transferring to a new embodiment with only 30 minutes of play data while retaining zero-shot generalization.


[79] Articulated 3D Scene Graphs for Open-World Mobile Manipulation cs.RO | cs.AI | cs.CVPDF

Martin Büchner, Adrian Röfer, Tim Engelbracht, Tim Welschehold, Zuria Bauer

TL;DR: 本文提出MoMa-SG框架,用于构建包含大量可交互物体的铰接式场景的语义-运动学3D场景图。该框架通过RGB-D序列进行时间分割和点跟踪,估计物体的运动参数,并关联物体与铰接关系,从而支持机器人在开放世界中进行长时程移动操作。

Details

Motivation: 解决机器人在真实环境中无法预测物体运动的问题,弥合语义、几何和运动学之间的鸿沟,以实现长时程移动操作。

Result: 在Arti4D-Semantic数据集(包含62个真实世界RGB-D序列和600次物体交互)和另一个数据集上进行了广泛评估,并通过四足机器人和移动机械臂的真实世界实验验证了其能够实现日常家庭环境中铰接物体的鲁棒操作。

Insight: 创新点包括:1)提出统一的扭转估计公式,在单次优化中鲁棒地估计旋转和平移关节参数;2)引入Arti4D-Semantic数据集,结合了层次化物体语义和物体轴注释;3)通过语义-运动学场景图实现开放世界移动操作。

Abstract: Semantics has enabled 3D scene understanding and affordance-driven object interaction. However, robots operating in real-world environments face a critical limitation: they cannot anticipate how objects move. Long-horizon mobile manipulation requires closing the gap between semantics, geometry, and kinematics. In this work, we present MoMa-SG, a novel framework for building semantic-kinematic 3D scene graphs of articulated scenes containing a myriad of interactable objects. Given RGB-D sequences containing multiple object articulations, we temporally segment object interactions and infer object motion using occlusion-robust point tracking. We then lift point trajectories into 3D and estimate articulation models using a novel unified twist estimation formulation that robustly estimates revolute and prismatic joint parameters in a single optimization pass. Next, we associate objects with estimated articulations and detect contained objects by reasoning over parent-child relations at identified opening states. We also introduce the novel Arti4D-Semantic dataset, which uniquely combines hierarchical object semantics including parent-child relation labels with object axis annotations across 62 in-the-wild RGB-D sequences containing 600 object interactions and three distinct observation paradigms. We extensively evaluate the performance of MoMa-SG on two datasets and ablate key design choices of our approach. In addition, real-world experiments on both a quadruped and a mobile manipulator demonstrate that our semantic-kinematic scene graphs enable robust manipulation of articulated objects in everyday home environments. We provide code and data at: https://momasg.cs.uni-freiburg.de.


[80] Markerless 6D Pose Estimation and Position-Based Visual Servoing for Endoscopic Continuum Manipulators cs.RO | cs.CVPDF

Junhyun Park, Chunggil An, Myeongbo Park, Ihsan Ullah, Sihyeong Park

TL;DR: 本文提出了一种用于内窥镜连续体机械臂的无标记立体6D位姿估计与基于位置的视觉伺服统一框架。通过照片级真实感仿真生成大规模自动标注数据,利用立体感知多特征融合网络增强几何可观测性,并采用前馈渲染细化模块实现单次几何一致性优化。结合自监督仿真到真实适应策略提升实际性能,实现了高精度的闭环控制。

Details

Motivation: 解决柔性内窥镜连续体机械臂因迟滞、柔顺性和末端传感有限导致的精确位姿估计与闭环控制难题,克服现有视觉方法几何可观测性不足和计算开销大的限制。

Result: 在1000个真实样本上达到平均平移误差0.83毫米和平均旋转误差2.76度;无标记闭环视觉伺服实现轨迹跟踪,平均平移误差2.07毫米、旋转误差7.41度,较开环控制分别降低85%和59%,在重复点位到达任务中表现出高重复性。

Insight: 创新点包括:立体感知多特征融合网络联合利用分割掩码、关键点、热图和边界框;前馈渲染细化模块无需迭代优化;自监督仿真到真实适应策略;首次实现完全无标记的位姿估计驱动视觉伺服框架,无需物理标记或嵌入式传感。

Abstract: Continuum manipulators in flexible endoscopic surgical systems offer high dexterity for minimally invasive procedures; however, accurate pose estimation and closed-loop control remain challenging due to hysteresis, compliance, and limited distal sensing. Vision-based approaches reduce hardware complexity but are often constrained by limited geometric observability and high computational overhead, restricting real-time closed-loop applicability. This paper presents a unified framework for markerless stereo 6D pose estimation and position-based visual servoing of continuum manipulators. A photo-realistic simulation pipeline enables large-scale automatic training with pixel-accurate annotations. A stereo-aware multi-feature fusion network jointly exploits segmentation masks, keypoints, heatmaps, and bounding boxes to enhance geometric observability. To enforce geometric consistency without iterative optimization, a feed-forward rendering-based refinement module predicts residual pose corrections in a single pass. A self-supervised sim-to-real adaptation strategy further improves real-world performance using unlabeled data. Extensive real-world validation achieves a mean translation error of 0.83 mm and a mean rotation error of 2.76° across 1,000 samples. Markerless closed-loop visual servoing driven by the estimated pose attains accurate trajectory tracking with a mean translation error of 2.07 mm and a mean rotation error of 7.41°, corresponding to 85% and 59% reductions compared to open-loop control, together with high repeatability in repeated point-reaching tasks. To the best of our knowledge, this work presents the first fully markerless pose-estimation-driven position-based visual servoing framework for continuum manipulators, enabling precise closed-loop control without physical markers or embedded sensing.


[81] Learning Humanoid End-Effector Control for Open-Vocabulary Visual Loco-Manipulation cs.RO | cs.CVPDF

Runpei Dong, Ziyan Li, Xialin He, Saurabh Gupta

TL;DR: 本文提出了HERO,一种用于人形机器人物体移动操作的新范式,结合了大型视觉模型的强泛化与开放词汇理解能力以及模拟训练的强控制性能。通过设计精确的残差感知末端执行器跟踪策略,显著降低了跟踪误差,并构建了一个模块化系统,使机器人能够在多样化的真实环境中可靠操作日常物体。

Details

Motivation: 解决人形机器人在野外对任意物体进行视觉移动操作时,现有基于真实世界模仿学习的方法因数据收集困难而泛化能力有限的问题。

Result: 所提出的末端执行器跟踪策略将跟踪误差降低了3.2倍;系统在从办公室到咖啡店等多样化真实环境中可靠操作了各种日常物体(如杯子、苹果、玩具),操作表面高度范围从43厘米到92厘米。

Insight: 创新点在于将大型视觉模型的开放词汇理解与模拟训练的强控制性能相结合,并通过结合逆运动学、学习的神经正向运动学模型、目标调整和重新规划等组件设计残差感知末端执行器跟踪策略,实现了精确控制和强泛化能力。

Abstract: Visual loco-manipulation of arbitrary objects in the wild with humanoid robots requires accurate end-effector (EE) control and a generalizable understanding of the scene via visual inputs (e.g., RGB-D images). Existing approaches are based on real-world imitation learning and exhibit limited generalization due to the difficulty in collecting large-scale training datasets. This paper presents a new paradigm, HERO, for object loco-manipulation with humanoid robots that combines the strong generalization and open-vocabulary understanding of large vision models with strong control performance from simulated training. We achieve this by designing an accurate residual-aware EE tracking policy. This EE tracking policy combines classical robotics with machine learning. It uses a) inverse kinematics to convert residual end-effector targets into reference trajectories, b) a learned neural forward model for accurate forward kinematics, c) goal adjustment, and d) replanning. Together, these innovations help us cut down the end-effector tracking error by 3.2x. We use this accurate end-effector tracker to build a modular system for loco-manipulation, where we use open-vocabulary large vision models for strong visual generalization. Our system is able to operate in diverse real-world environments, from offices to coffee shops, where the robot is able to reliably manipulate various everyday objects (e.g., mugs, apples, toys) on surfaces ranging from 43cm to 92cm in height. Systematic modular and end-to-end tests in simulation and the real world demonstrate the effectiveness of our proposed design. We believe the advances in this paper can open up new ways of training humanoid robots to interact with daily objects.


eess.IV [Back]

[82] Automated Assessment of Kidney Ureteroscopy Exploration for Training eess.IV | cs.CV | cs.HCPDF

Fangjie Li, Nicholas Kavoussi, Charan Mohan, Matthieu Chabanas, Jie Ying Wu

TL;DR: 本文提出了一种基于输尿管镜视频的自动化定位框架,用于评估肾脏模型探索训练中遗漏的肾盏,旨在通过自动反馈系统提升临床培训效果。

Details

Motivation: 肾脏输尿管镜导航学习曲线陡峭,当前临床培训依赖专家一对一反馈且在手术室进行,存在局限性,因此需要一种具有自动反馈的模型训练系统以扩大培训机会。

Result: 在15个探索视频中,系统正确分类了74个肾盏中的69个,相机姿态定位误差小于4毫米,且处理典型探索视频(1-2分钟长)仅需10分钟,展示了高准确性和效率。

Insight: 创新点在于利用先验探索视频生成参考重建,实现纯视频基础的自动化定位,为肾脏模型探索提供无需专家监督的准确反馈,可应用于手术培训的虚拟环境。

Abstract: Purpose: Kidney ureteroscopic navigation is challenging with a steep learning curve. However, current clinical training has major deficiencies, as it requires one-on-one feedback from experts and occurs in the operating room (OR). Therefore, there is a need for a phantom training system with automated feedback to greatly \revision{expand} training opportunities. Methods: We propose a novel, purely ureteroscope video-based scope localization framework that automatically identifies calyces missed by the trainee in a phantom kidney exploration. We use a slow, thorough, prior exploration video of the kidney to generate a reference reconstruction. Then, this reference reconstruction can be used to localize any exploration video of the same phantom. Results: In 15 exploration videos, a total of 69 out of 74 calyces were correctly classified. We achieve < 4mm camera pose localization error. Given the reference reconstruction, the system takes 10 minutes to generate the results for a typical exploration (1-2 minute long). Conclusion: We demonstrate a novel camera localization framework that can provide accurate and automatic feedback for kidney phantom explorations. We show its ability as a valid tool that enables out-of-OR training without requiring supervision from an expert.


[83] Automated Histopathology Report Generation via Pyramidal Feature Extraction and the UNI Foundation Model eess.IV | cs.AI | cs.CVPDF

Ahmet Halici, Ece Tugba Cebeci, Musa Balci, Mustafa Cini, Serkan Sokmen

TL;DR: 本文提出了一种用于从组织病理学全切片图像(WSI)自动生成诊断报告的层次化视觉-语言框架。该方法通过多分辨率金字塔补丁选择处理超大尺寸输入,利用冻结的UNI视觉Transformer提取特征,并通过一个Transformer解码器生成文本。为了提高可靠性,还引入了基于BioGPT的标记化和基于检索的验证步骤。

Details

Motivation: 解决从千兆像素级别的组织病理学全切片图像生成精确、领域特异性诊断文本的挑战,因为输入规模巨大且需要专业语言。

Result: 论文在摘要中未提及具体的定量结果、基准测试或与SOTA的比较。

Insight: 创新点包括:结合冻结的基础模型与解码器的层次化框架;用于处理WSI的多分辨率金字塔补丁选择与背景去除方法;以及为提高可靠性而设计的基于检索的验证步骤,用高相似度的真实参考报告替换生成内容。

Abstract: Generating diagnostic text from histopathology whole slide images (WSIs) is challenging due to the gigapixel scale of the input and the requirement for precise, domain specific language. We propose a hierarchical vision language framework that combines a frozen pathology foundation model with a Transformer decoder for report generation. To make WSI processing tractable, we perform multi resolution pyramidal patch selection (downsampling factors 2^3 to 2^6) and remove background and artifacts using Laplacian variance and HSV based criteria. Patch features are extracted with the UNI Vision Transformer and projected to a 6 layer Transformer decoder that generates diagnostic text via cross attention. To better represent biomedical terminology, we tokenize the output using BioGPT. Finally, we add a retrieval based verification step that compares generated reports with a reference corpus using Sentence BERT embeddings; if a high similarity match is found, the generated report is replaced with the retrieved ground truth reference to improve reliability.


cs.LG [Back]

[84] B-DENSE: Branching For Dense Ensemble Network Learning cs.LG | cs.AI | cs.CV | cs.NEPDF

Cherish Puniani, Tushar Kumar, Arnav Bendre, Gaurav Kumar, Shree Singhi

TL;DR: 本文提出B-DENSE框架,通过多分支轨迹对齐技术改进扩散模型的蒸馏过程,以解决现有蒸馏方法因稀疏监督而丢失结构信息并引入离散误差的问题,从而在加速推理的同时提升生成图像质量。

Details

Motivation: 扩散模型推理迭代采样导致高延迟,现有蒸馏技术加速采样但丢弃中间轨迹步长,造成结构信息损失和显著离散化误差。

Result: 在图像生成任务上,B-DENSE相比基线蒸馏框架展现出更优的生成质量。

Insight: 创新点在于修改学生架构为输出K倍扩展通道,每个子集对应教师轨迹中的特定离散中间步骤,通过密集中间轨迹对齐训练,使学生模型从训练早期就能学习导航解空间。

Abstract: Inspired by non-equilibrium thermodynamics, diffusion models have achieved state-of-the-art performance in generative modeling. However, their iterative sampling nature results in high inference latency. While recent distillation techniques accelerate sampling, they discard intermediate trajectory steps. This sparse supervision leads to a loss of structural information and introduces significant discretization errors. To mitigate this, we propose B-DENSE, a novel framework that leverages multi-branch trajectory alignment. We modify the student architecture to output $K$-fold expanded channels, where each subset corresponds to a specific branch representing a discrete intermediate step in the teacher’s trajectory. By training these branches to simultaneously map to the entire sequence of the teacher’s target timesteps, we enforce dense intermediate trajectory alignment. Consequently, the student model learns to navigate the solution space from the earliest stages of training, demonstrating superior image generation quality compared to baseline distillation frameworks.


[85] Extracting and Analyzing Rail Crossing Behavior Signatures from Videos using Tensor Methods cs.LG | cs.CVPDF

Dawon Ahn, Het Patel, Aemal Khattak, Jia Chen, Evangelos E. Papalexakis

TL;DR: 本文提出了一种基于多视图张量分解的框架,用于从多个铁路道口的监控视频中提取和分析驾驶员行为模式。该方法利用TimeSformer模型提取视频特征,将行为划分为接近、等待和通过三个阶段,通过构建特定阶段的相似性矩阵并应用非负对称CP分解,发现了具有不同时间特征的潜在行为成分。分析表明,道口位置比一天中的时间对行为模式的影响更大,且接近阶段的行为具有特别强的区分性。该框架可实现跨多个道口的可扩展模式发现,为基于行为相似性对道口进行分组以指导针对性安全干预奠定了基础。

Details

Motivation: 铁路道口存在复杂的安全挑战,驾驶员行为因地点、时间和条件而异。传统方法单独分析每个道口,限制了识别跨地点共享行为模式的能力。

Result: 张量分析揭示了道口位置似乎是比一天中的时间更强的行为模式决定因素,并且接近阶段的行为提供了特别有区分性的特征。学习到的成分空间可视化证实了基于位置的聚类,某些道口形成了独特的行为集群。

Insight: 创新点在于提出了一个多视图张量分解框架,将行为划分为三个时间阶段进行分析,并使用非负对称CP分解来发现跨地点的潜在行为成分。该方法实现了跨多个道口的自动化、可扩展的行为模式发现,为基于数据驱动的安全干预提供了新思路。

Abstract: Railway crossings present complex safety challenges where driver behavior varies by location, time, and conditions. Traditional approaches analyze crossings individually, limiting the ability to identify shared behavioral patterns across locations. We propose a multi-view tensor decomposition framework that captures behavioral similarities across three temporal phases: Approach (warning activation to gate lowering), Waiting (gates down to train passage), and Clearance (train passage to gate raising). We analyze railway crossing videos from multiple locations using TimeSformer embeddings to represent each phase. By constructing phase-specific similarity matrices and applying non-negative symmetric CP decomposition, we discover latent behavioral components with distinct temporal signatures. Our tensor analysis reveals that crossing location appears to be a stronger determinant of behavior patterns than time of day, and that approach-phase behavior provides particularly discriminative signatures. Visualization of the learned component space confirms location-based clustering, with certain crossings forming distinct behavioral clusters. This automated framework enables scalable pattern discovery across multiple crossings, providing a foundation for grouping locations by behavioral similarity to inform targeted safety interventions.


[86] ModalImmune: Immunity Driven Unlearning via Self Destructive Training cs.LG | cs.CL | cs.MMPDF

Rong Fu, Jia Yee Tan, Wenxin Zhang, Zijian Zhang, Ziming Wang

TL;DR: ModalImmune是一种训练框架,旨在增强多模态系统对输入通道部分或完全丢失的鲁棒性。该框架通过在训练中故意且可控地破坏选定模态信息,使模型学习到对破坏性模态影响具有免疫力的联合表示,从而提高在真实世界部署中的可靠性。

Details

Motivation: 多模态系统在部署时容易受到输入通道部分或完全丢失的影响,这削弱了其在真实场景中的可靠性。论文旨在解决这一问题,通过训练使模型对模态破坏具有免疫力。

Result: 在标准多模态基准测试上的实证评估表明,ModalImmune提高了对模态移除和损坏的恢复能力,同时保持了收敛稳定性和重建能力。

Insight: 创新点包括:频谱自适应崩溃正则化器、信息增益引导的控制器用于目标干预、曲率感知梯度掩码以稳定破坏性更新,以及经过认证的Neumann截断超梯度程序用于自动元参数适应。这些方法共同实现了可控的模态信息破坏和鲁棒表示学习。

Abstract: Multimodal systems are vulnerable to partial or complete loss of input channels at deployment, which undermines reliability in real-world settings. This paper presents ModalImmune, a training framework that enforces modality immunity by intentionally and controllably collapsing selected modality information during training so the model learns joint representations that are robust to destructive modality influence. The framework combines a spectrum-adaptive collapse regularizer, an information-gain guided controller for targeted interventions, curvature-aware gradient masking to stabilize destructive updates, and a certified Neumann-truncated hyper-gradient procedure for automatic meta-parameter adaptation. Empirical evaluation on standard multimodal benchmarks demonstrates that ModalImmune improves resilience to modality removal and corruption while retaining convergence stability and reconstruction capacity.