Table of Contents

cs.CL [Back]

[1] Improving Neural Topic Modeling with Semantically-Grounded Soft Label Distributions cs.CL | cs.AIPDF

Raymond Li, Amirhossein Abaskohi, Chuyuan Li, Gabriel Murray, Giuseppe Carenini

TL;DR: 本文提出了一种新颖的神经主题建模方法,通过利用语言模型生成语义丰富的软标签分布作为监督信号,以替代传统的词袋重建目标,从而提升主题质量。

Details

Motivation: 传统神经主题模型通常通过重建文档的词袋表示进行优化,忽略了上下文信息且受限于数据稀疏性问题。

Result: 在三个数据集上的实验表明,该方法在主题连贯性和纯度方面相比现有基线模型取得了显著提升,并且在新提出的基于检索的指标上也大幅优于现有方法。

Insight: 核心创新在于利用语言模型,通过特定提示词生成下一个词的概率分布,并将其投影到预定义词汇表上,构建语义基础的软标签作为更丰富的监督信号,从而引导主题模型学习到更贴合语料主题结构的表示。

Abstract: Traditional neural topic models are typically optimized by reconstructing the document’s Bag-of-Words (BoW) representations, overlooking contextual information and struggling with data sparsity. In this work, we propose a novel approach to construct semantically-grounded soft label targets using Language Models (LMs) by projecting the next token probabilities, conditioned on a specialized prompt, onto a pre-defined vocabulary to obtain contextually enriched supervision signals. By training the topic models to reconstruct the soft labels using the LM hidden states, our method produces higher-quality topics that are more closely aligned with the underlying thematic structure of the corpus. Experiments on three datasets show that our method achieves substantial improvements in topic coherence, purity over existing baselines. Additionally, we also introduce a retrieval-based metric, which shows that our approach significantly outperforms existing methods in identifying semantically similar documents, highlighting its effectiveness for retrieval-oriented applications.


[2] Condition-Gated Reasoning for Context-Dependent Biomedical Question Answering cs.CL | cs.AIPDF

Jash Rajesh Parekh, Wonbin Kweon, Joey Chan, Rezarta Islamaj, Robert Leaman

TL;DR: 本文提出了首个用于条件性生物医学问答的基准CondMedQA,并开发了条件门控推理(CGR)框架,该框架通过构建条件感知知识图谱和基于查询条件选择性激活或剪枝推理路径,以解决现有生物医学QA系统在临床条件依赖性推理上的不足。

Details

Motivation: 现有生物医学问答系统通常假设医学知识普遍适用,但真实临床推理本质上是条件性的,决策依赖于患者特定因素(如并发症和禁忌症),而现有基准和方法缺乏对此类条件推理的评估和显式机制。

Result: CGR在生物医学QA基准测试中匹配或超越了最先进性能,并在CondMedQA基准上更可靠地选择条件适当的答案。

Insight: 创新点在于引入条件门控推理机制,通过条件感知知识图谱和路径选择显式建模条件性,提升了医学推理的鲁棒性和上下文适应性。

Abstract: Current biomedical question answering (QA) systems often assume that medical knowledge applies uniformly, yet real-world clinical reasoning is inherently conditional: nearly every decision depends on patient-specific factors such as comorbidities and contraindications. Existing benchmarks do not evaluate such conditional reasoning, and retrieval-augmented or graph-based methods lack explicit mechanisms to ensure that retrieved knowledge is applicable to given context. To address this gap, we propose CondMedQA, the first benchmark for conditional biomedical QA, consisting of multi-hop questions whose answers vary with patient conditions. Furthermore, we propose Condition-Gated Reasoning (CGR), a novel framework that constructs condition-aware knowledge graphs and selectively activates or prunes reasoning paths based on query conditions. Our findings show that CGR more reliably selects condition-appropriate answers while matching or exceeding state-of-the-art performance on biomedical QA benchmarks, highlighting the importance of explicitly modeling conditionality for robust medical reasoning.


[3] Analyzing LLM Instruction Optimization for Tabular Fact Verification cs.CL | cs.PLPDF

Xiaotang Du, Giwon Hong, Wai-Chung Kwan, Rohit Saxena, Ivan Titov

TL;DR: 本文首次系统性地比较了基于DSPy优化框架的指令优化方法在表格事实核查任务中的应用,评估了四种提示技术(直接预测、思维链、带SQL工具的ReAct、带Python执行的CodeAct)和三种DSPy优化器(COPRO、MiPROv2、SIMBA)在四个基准测试和三个模型系列上的表现。研究发现指令优化能持续提升验证准确率,其中MiPROv2对思维链最稳定,SIMBA对ReAct智能体提升最大,尤其是在较大模型上。行为分析表明SIMBA通过启发式方法鼓励更直接的推理路径,从而提升思维链的数字比较能力和减少ReAct中不必要的工具调用。

Details

Motivation: 指令优化为提升大语言模型的推理性能提供了一种轻量级、模型无关的方法,但尚未在表格事实核查任务中进行系统比较,本文旨在填补这一空白。

Result: 在四个基准测试(未具体命名)和三个模型家族上,指令优化一致提高了验证准确率;MiPROv2对思维链提示带来最稳定的增益,SIMBA对ReAct智能体(尤其是较大规模模型)提供最大收益;思维链在表格事实核查中保持有效,特别是在较小模型上;而基于较大模型的ReAct智能体经过仔细指令优化后能达到有竞争力的性能。

Insight: 创新点在于首次系统比较DSPy框架下的指令优化方法在表格事实核查任务中的效果;客观分析发现SIMBA优化器通过启发式鼓励直接推理路径,能有效提升数字比较能力和减少冗余工具调用,这为优化基于工具的智能体推理提供了新思路;同时证实思维链在资源受限场景(小模型)下仍具实用性。

Abstract: Instruction optimization provides a lightweight, model-agnostic approach to enhancing the reasoning performance of large language models (LLMs). This paper presents the first systematic comparison of instruction optimization, based on the DSPy optimization framework, for tabular fact verification. We evaluate four out-of-the-box prompting techniques that cover both text-only prompting and code use: direct prediction, Chain-of-Thought (CoT), ReAct with SQL tools, and CodeAct with Python execution. We study three optimizers from the DSPy framework – COPRO, MiPROv2, and SIMBA – across four benchmarks and three model families. We find that instruction optimization consistently improves verification accuracy, with MiPROv2 yielding the most stable gains for CoT, and SIMBA providing the largest benefits for ReAct agents, particularly at larger model scales. Behavioral analyses reveal that SIMBA encourages more direct reasoning paths by applying heuristics, thereby improving numerical comparison abilities in CoT reasoning and helping avoid unnecessary tool calls in ReAct agents. Across different prompting techniques, CoT remains effective for tabular fact checking, especially with smaller models. Although ReAct agents built with larger models can achieve competitive performance, they require careful instruction optimization.


[4] CUICurate: A GraphRAG-based Framework for Automated Clinical Concept Curation for NLP applications cs.CL | cs.AIPDF

Victoria Blake, Mathew Miller, Jamie Novak, Sze-yuan Ooi, Blanca Gallego

TL;DR: 本文提出了CUICurate,一个基于GraphRAG的自动化框架,用于从UMLS知识图谱中构建临床概念集,以支持下游NLP应用。该框架结合了基于图的语义检索和LLM过滤分类,在五个临床概念上进行了评估。

Details

Motivation: 解决临床命名实体识别中,将自由文本映射到单个UMLS CUI不足以满足下游任务需求的问题。当前构建包含同义词、子类和超类的临床概念集过程劳动密集、不一致且缺乏工具支持。

Result: 在五个词汇异构的临床概念上,与人工标注基准相比,CUICurate生成了显著更大、更完整的概念集,同时保持了与人工相当的精确度。GPT-5-mini在过滤阶段召回率更高,而GPT-5的分类结果更接近临床医生判断。

Insight: 创新点在于将基于知识图谱的语义检索与大型语言模型的推理能力相结合(GraphRAG),提供了一个可扩展、可复现的自动化概念集构建方法,能显著减少人工工作量并适应不同的临床表型分析需求。

Abstract: Background: Clinical named entity recognition tools commonly map free text to Unified Medical Language System (UMLS) Concept Unique Identifiers (CUIs). For many downstream tasks, however, the clinically meaningful unit is not a single CUI but a concept set comprising related synonyms, subtypes, and supertypes. Constructing such concept sets is labour-intensive, inconsistently performed, and poorly supported by existing tools, particularly for NLP pipelines that operate directly on UMLS CUIs. Methods We present CUICurate, a Graph-based retrieval-augmented generation (GraphRAG) framework for automated UMLS concept set curation. A UMLS knowledge graph (KG) was constructed and embedded for semantic retrieval. For each target concept, candidate CUIs were retrieved from the KG, followed by large language model (LLM) filtering and classification steps comparing two LLMs (GPT-5 and GPT-5-mini). The framework was evaluated on five lexically heterogeneous clinical concepts against a manually curated benchmark and gold-standard concept sets. Results Across all concepts, CUICurate produced substantially larger and more complete concept sets than the manual benchmarks whilst matching human precision. Comparisons between the two LLMs found that GPT-5-mini achieved higher recall during filtering, while GPT-5 produced classifications that more closely aligned with clinician judgements. Outputs were stable across repeated runs and computationally inexpensive. Conclusions CUICurate offers a scalable and reproducible approach to support UMLS concept set curation that substantially reduces manual effort. By integrating graph-based retrieval with LLM reasoning, the framework produces focused candidate concept sets that can be adapted to clinical NLP pipelines for different phenotyping and analytic requirements.


[5] Agentic Adversarial QA for Improving Domain-Specific LLMs cs.CL | cs.AI | cs.LGPDF

Vincent Grari, Ciprian Tomoiaga, Sylvain Lamprier, Tatsunori Hashimoto, Marcin Detyniecki

TL;DR: 本文提出了一种对抗性问答生成框架,旨在通过生成语义上具有挑战性的紧凑问题集,来提升领域特定大型语言模型的适应能力,解决现有方法在支持解释性推理和样本效率方面的不足。

Details

Motivation: 解决大型语言模型在专业领域适应时面临的高质量任务相关数据稀缺、覆盖有限,以及现有合成数据生成方法(如释义或知识提取)在支持解释性推理能力和样本效率方面的缺陷。

Result: 在LegalBench语料库的专业子集上评估,该方法以显著更少的合成样本实现了更高的准确率。

Insight: 创新点在于采用对抗性、迭代反馈驱动的问题生成过程,通过比较待适应模型与基于参考文档的专家模型的输出来揭示并弥补理解差距,从而生成紧凑且语义挑战性的问题集,提升模型在专业领域的解释性推理能力和训练效率。

Abstract: Large Language Models (LLMs), despite extensive pretraining on broad internet corpora, often struggle to adapt effectively to specialized domains. There is growing interest in fine-tuning these models for such domains; however, progress is constrained by the scarcity and limited coverage of high-quality, task-relevant data. To address this, synthetic data generation methods such as paraphrasing or knowledge extraction are commonly applied. Although these approaches excel at factual recall and conceptual knowledge, they suffer from two critical shortcomings: (i) they provide minimal support for interpretive reasoning capabilities in these specialized domains, and (ii) they often produce synthetic corpora that are excessively large and redundant, resulting in poor sample efficiency. To overcome these gaps, we propose an adversarial question-generation framework that produces a compact set of semantically challenging questions. These questions are constructed by comparing the outputs of the model to be adapted and a robust expert model grounded in reference documents, using an iterative, feedback-driven process designed to reveal and address comprehension gaps. Evaluation on specialized subsets of the LegalBench corpus demonstrates that our method achieves greater accuracy with substantially fewer synthetic samples.


[6] FENCE: A Financial and Multimodal Jailbreak Detection Dataset cs.CL | cs.AI | cs.DBPDF

Mirae Kim, Seonghun Jeong, Youngjun Kwak

TL;DR: 该论文提出了FENCE,一个用于金融领域的双语(韩语-英语)多模态越狱检测数据集,旨在训练和评估针对大型语言模型和视觉语言模型的越狱攻击检测器。该数据集通过结合金融相关查询和基于图像的威胁来强调领域真实性。

Details

Motivation: 越狱攻击对LLM和VLM的部署构成重大风险,而VLM由于处理文本和图像,攻击面更广。目前,特别是在金融领域,可用于越狱检测的资源非常稀缺。

Result: 在商业和开源VLM上的实验揭示了持续的漏洞,GPT-4o显示出可测量的攻击成功率,开源模型暴露更大。基于FENCE训练的基线检测器在分布内准确率达到99%,并在外部基准测试中保持强劲性能。

Insight: 论文的创新点在于创建了一个专注于金融领域的、强调领域真实性的多模态越狱检测数据集,填补了该领域资源空白,并为在敏感领域训练可靠的检测模型提供了稳健的基础资源。

Abstract: Jailbreaking poses a significant risk to the deployment of Large Language Models (LLMs) and Vision Language Models (VLMs). VLMs are particularly vulnerable because they process both text and images, creating broader attack surfaces. However, available resources for jailbreak detection are scarce, particularly in finance. To address this gap, we present FENCE, a bilingual (Korean-English) multimodal dataset for training and evaluating jailbreak detectors in financial applications. FENCE emphasizes domain realism through finance-relevant queries paired with image-grounded threats. Experiments with commercial and open-source VLMs reveal consistent vulnerabilities, with GPT-4o showing measurable attack success rates and open-source models displaying greater exposure. A baseline detector trained on FENCE achieves 99 percent in-distribution accuracy and maintains strong performance on external benchmarks, underscoring the dataset’s robustness for training reliable detection models. FENCE provides a focused resource for advancing multimodal jailbreak detection in finance and for supporting safer, more reliable AI systems in sensitive domains. Warning: This paper includes example data that may be offensive.


[7] Improving Sampling for Masked Diffusion Models via Information Gain cs.CLPDF

Kaisen Yang, Jayden Teoh, Kaicheng Yang, Yitong Zhang, Alex Lamb

TL;DR: 本文提出了一种名为信息增益采样器的新解码框架,用于改进掩码扩散模型的采样过程。该框架通过平衡当前解码步骤的即时不确定性和对未来掩码位置的信息增益,克服了现有贪婪启发式方法忽视下游影响的局限性。

Details

Motivation: 现有掩码扩散模型采样器通常采用贪婪启发式方法,优先解码局部确定性最高的位置,但这种方法忽略了当前解码选择对后续步骤的下游影响,未能最小化累积不确定性。

Result: 在推理、编码、创意写作和图像生成等多种任务和架构上的广泛评估表明,信息增益采样器始终优于现有采样器。例如,在推理任务上平均准确率提高了3.6%,在创意写作任务中胜率达到63.1%,并将累积不确定性从78.4显著降低至48.6,大幅超越了最佳基线方法。

Insight: 核心创新在于充分利用了掩码扩散模型的非因果特性,通过评估解码决策如何重塑所有剩余掩码位置的标记概率/不确定性,提出了一种平衡即时收益与未来信息增益的原则性解码框架,而非仅依赖局部贪婪策略。

Abstract: Masked Diffusion Models (MDMs) offer greater flexibility in decoding order than autoregressive models but require careful planning to achieve high-quality generation. Existing samplers typically adopt greedy heuristics, prioritizing positions with the highest local certainty to decode at each step. Through failure case analysis, we identify a fundamental limitation of this approach: it neglects the downstream impact of current decoding choices on subsequent steps and fails to minimize cumulative uncertainty. In particular, these methods do not fully exploit the non-causal nature of MDMs, which enables evaluating how a decoding decision reshapes token probabilities/uncertainty across all remaining masked positions. To bridge this gap, we propose the Info-Gain Sampler, a principled decoding framework that balances immediate uncertainty with information gain over future masked tokens. Extensive evaluations across diverse architectures and tasks (reasoning, coding, creative writing, and image generation) demonstrate that Info-Gain Sampler consistently outperforms existing samplers for MDMs. For instance, it achieves a 3.6% improvement in average accuracy on reasoning tasks and a 63.1% win-rate in creative writing. Notably, on reasoning tasks it reduces cumulative uncertainty from 78.4 to 48.6, outperforming the best baseline by a large margin. The code will be available at https://github.com/yks23/Information-Gain-Sampler.


[8] Thinking by Subtraction: Confidence-Driven Contrastive Decoding for LLM Reasoning cs.CL | cs.AIPDF

Lexiang Tang, Weihao Gao, Bingchen Zhao, Lu Ma, Qiao jin

TL;DR: 本文提出了一种名为’Thinking by Subtraction’的置信度驱动对比解码方法,旨在通过针对性地干预解码过程中的低置信度标记来提升大语言模型的推理可靠性。该方法检测低置信度标记,构建一个用占位符替换高置信度标记的对比参考分布,并在低置信度位置通过减去该参考分布来优化预测。

Details

Motivation: 动机源于现有研究观察到推理不确定性高度局部化,即一小部分低置信度标记不成比例地导致了推理错误和不必要的输出扩展。因此,需要一种方法能针对性地干预这些关键位置,而非均匀增加计算量。

Result: 实验表明,该方法在多个数学推理基准测试上显著提高了准确性,同时大幅减少了输出长度,并且仅带来最小的KV缓存开销。作为一个无需训练的方法,它有效提升了推理可靠性。

Insight: 创新点在于提出了一个置信度驱动的对比解码框架,通过’减法’思维(即减去一个由高置信度标记构建的参考分布)来精炼低置信度位置的预测。这提供了一种无需训练、计算高效且能针对性解决推理不确定性的新思路。

Abstract: Recent work on test-time scaling for large language model (LLM) reasoning typically assumes that allocating more inference-time computation uniformly improves correctness. However, prior studies show that reasoning uncertainty is highly localized: a small subset of low-confidence tokens disproportionately contributes to reasoning errors and unnecessary output expansion. Motivated by this observation, we propose Thinking by Subtraction, a confidence-driven contrastive decoding approach that improves reasoning reliability through targeted token-level intervention. Our method, Confidence-Driven Contrastive Decoding, detects low-confidence tokens during decoding and intervenes selectively at these positions. It constructs a contrastive reference by replacing high-confidence tokens with minimal placeholders, and refines predictions by subtracting this reference distribution at low-confidence locations. Experiments show that CCD significantly improves accuracy across mathematical reasoning benchmarks while substantially reducing output length, with minimal KV-cache overhead. As a training-free method, CCD enhances reasoning reliability through targeted low-confidence intervention without computational redundancy. Our code will be made available at: https://github.com/bolo-web/CCD.


[9] Simplifying Outcomes of Language Model Component Analyses with ELIA cs.CL | cs.AI | cs.LGPDF

Aaron Louis Eidt, Nils Feldhus

TL;DR: 本文介绍了ELIA(可解释语言可解释性分析),一个旨在降低大型语言模型(LLM)机制可解释性分析复杂性的交互式Web应用程序。它整合了归因分析、函数向量分析和电路追踪三种关键技术,并创新性地利用视觉语言模型为这些方法生成的复杂可视化结果自动生成自然语言解释。通过混合方法的用户研究验证了其有效性,发现交互式界面和AI生成的解释能有效帮助非专家理解,降低了理解门槛。

Details

Motivation: 动机在于解决机制可解释性分析工具因复杂性高而造成的可访问性鸿沟,这些工具通常仅限于专家使用,限制了其更广泛的应用。

Result: 通过混合方法的用户研究进行了实证验证,结果显示用户明显偏好交互式、可探索的界面而非简单的静态可视化。关键发现是AI生成的解释帮助非专家弥合了知识差距;统计分析表明用户先前的LLM经验与其理解分数之间没有显著相关性,这表明该系统降低了不同经验水平用户的理解障碍。

Insight: 宣称的创新点在于设计并构建了整合多种分析技术(归因分析、函数向量分析、电路追踪)的交互式系统ELIA,并引入了一种新颖的方法论:使用视觉语言模型为复杂可视化自动生成自然语言解释。从客观角度看,其核心创新在于将复杂的机制可解释性分析结果通过AI辅助解释和以用户为中心的交互设计进行系统性简化与呈现,证明了结合AI与深思熟虑的交互设计在降低技术理解门槛方面的潜力。

Abstract: While mechanistic interpretability has developed powerful tools to analyze the internal workings of Large Language Models (LLMs), their complexity has created an accessibility gap, limiting their use to specialists. We address this challenge by designing, building, and evaluating ELIA (Explainable Language Interpretability Analysis), an interactive web application that simplifies the outcomes of various language model component analyses for a broader audience. The system integrates three key techniques – Attribution Analysis, Function Vector Analysis, and Circuit Tracing – and introduces a novel methodology: using a vision-language model to automatically generate natural language explanations (NLEs) for the complex visualizations produced by these methods. The effectiveness of this approach was empirically validated through a mixed-methods user study, which revealed a clear preference for interactive, explorable interfaces over simpler, static visualizations. A key finding was that the AI-powered explanations helped bridge the knowledge gap for non-experts; a statistical analysis showed no significant correlation between a user’s prior LLM experience and their comprehension scores, suggesting that the system reduced barriers to comprehension across experience levels. We conclude that an AI system can indeed simplify complex model analyses, but its true power is unlocked when paired with thoughtful, user-centered design that prioritizes interactivity, specificity, and narrative guidance.


[10] Predicting Contextual Informativeness for Vocabulary Learning using Deep Learning cs.CLPDF

Tao Wu, Adam Kapelner

TL;DR: 本文描述了一个基于深度学习的系统,用于自动识别适合高中生母语词汇教学的信息化语境示例。论文比较了三种建模方法:基于MPNet无监督相似性的策略、基于指令感知微调Qwen3嵌入的监督框架,以及结合手工特征的后一种模型。作者引入了一种名为“保持能力曲线”的新指标来可视化模型在丢弃优质语境比例与“优质-劣质”语境比率之间的权衡。结果表明,结合手工特征的监督模型在仅丢弃70%优质语境的情况下,实现了440的优质-劣质比率,性能提升显著。

Details

Motivation: 解决在词汇教学中自动筛选高质量、信息丰富的语境示例的问题,以支持高效的一语词汇教学。

Result: 在自定义评估框架下,结合手工特征的监督模型(模型iii)取得了最佳性能,优质-劣质比率达到440,同时仅丢弃70%的优质语境,展现了显著的增益。

Insight: 创新点包括:1) 将现代嵌入模型(如Qwen3)与指令感知微调和手工特征结合,用于语境信息性预测任务;2) 提出了“保持能力曲线”这一新颖评估指标,以统一、紧凑的方式可视化模型在筛选精度与覆盖率之间的权衡。从客观角度看,该研究展示了监督学习与领域特征工程在提升教育技术任务性能方面的有效性。

Abstract: We describe a modern deep learning system that automatically identifies informative contextual examples (\qu{contexts}) for first language vocabulary instruction for high school student. Our paper compares three modeling approaches: (i) an unsupervised similarity-based strategy using MPNet’s uniformly contextualized embeddings, (ii) a supervised framework built on instruction-aware, fine-tuned Qwen3 embeddings with a nonlinear regression head and (iii) model (ii) plus handcrafted context features. We introduce a novel metric called the Retention Competency Curve to visualize trade-offs between the discarded proportion of good contexts and the \qu{good-to-bad} contexts ratio providing a compact, unified lens on model performance. Model (iii) delivers the most dramatic gains with performance of a good-to-bad ratio of 440 all while only throwing out 70% of the good contexts. In summary, we demonstrate that a modern embedding model on neural network architecture, when guided by human supervision, results in a low-cost large supply of near-perfect contexts for teaching vocabulary for a variety of target words.


[11] Vichara: Appellate Judgment Prediction and Explanation for the Indian Judicial System cs.CL | cs.AIPDF

Pavithra PM Nair, Preethu Rose Anish

TL;DR: 本文提出了Vichara框架,专门针对印度司法系统,用于预测和解释上诉判决。该框架处理英文上诉案件文件,将其分解为决策点,并采用基于IRAC框架的结构化解释格式。在PredEx和ILDC_expert数据集上,使用多个大语言模型进行评估,GPT-4o mini取得了最佳性能。

Details

Motivation: 印度等司法管辖区面临大量案件积压,人工智能在法律判决预测中具有变革潜力,特别是上诉案件(即上级法院审查下级法院裁决的正式决定)。本文旨在开发一个专门针对印度司法系统的框架,以预测和解释上诉判决。

Result: 在PredEx和ILDC_expert两个数据集上,Vichara超越了现有的判决预测基准。其中,GPT-4o mini模型性能最高(在PredEx上F1得分为81.5,在ILDC_expert上为80.3),其次是Llama-3.1-8B。在清晰度、关联性和有用性方面的人工评估也突显了GPT-4o mini在可解释性方面的优势。

Insight: 主要创新点包括:1) 将上诉案件文件分解为包含法律问题、裁决机构、结果、推理和时间背景的离散决策点,这种结构化表示有助于准确预测和可解释;2) 采用并适配IRAC框架为印度法律推理提供结构化解释格式,增强了可解释性,便于法律专业人士高效评估预测的合理性。

Abstract: In jurisdictions like India, where courts face an extensive backlog of cases, artificial intelligence offers transformative potential for legal judgment prediction. A critical subset of this backlog comprises appellate cases, which are formal decisions issued by higher courts reviewing the rulings of lower courts. To this end, we present Vichara, a novel framework tailored to the Indian judicial system that predicts and explains appellate judgments. Vichara processes English-language appellate case proceeding documents and decomposes them into decision points. Decision points are discrete legal determinations that encapsulate the legal issue, deciding authority, outcome, reasoning, and temporal context. The structured representation isolates the core determinations and their context, enabling accurate predictions and interpretable explanations. Vichara’s explanations follow a structured format inspired by the IRAC (Issue-Rule-Application-Conclusion) framework and adapted for Indian legal reasoning. This enhances interpretability, allowing legal professionals to assess the soundness of predictions efficiently. We evaluate Vichara on two datasets, PredEx and the expert-annotated subset of the Indian Legal Documents Corpus (ILDC_expert), using four large language models: GPT-4o mini, Llama-3.1-8B, Mistral-7B, and Qwen2.5-7B. Vichara surpasses existing judgment prediction benchmarks on both datasets, with GPT-4o mini achieving the highest performance (F1: 81.5 on PredEx, 80.3 on ILDC_expert), followed by Llama-3.1-8B. Human evaluation of the generated explanations across Clarity, Linking, and Usefulness metrics highlights GPT-4o mini’s superior interpretability.


[12] Validating Political Position Predictions of Arguments cs.CL | cs.AIPDF

Jordan Robinson, Angus R. Williams, Katie Atkinson, Anthony G. Cohn

TL;DR: 该论文针对现实世界中主观、连续属性(如政治立场)的知识表示与验证难题,提出了一种双尺度验证框架,结合点对点和成对人工标注,用于验证论点政治立场预测。研究基于22个语言模型,构建了一个包含23,228个论点的大规模政治立场预测知识库,并评估了人类与模型预测之间的一致性。

Details

Motivation: 解决在知识表示中如何有效验证主观、连续属性(如政治立场)的挑战,这些属性与广泛接受的成对验证黄金标准存在冲突。

Result: 点对点评估显示人类与模型间具有中等一致性(Krippendorff’s α=0.578),而成对验证则揭示出人类与模型衍生排名间更强的对齐(最佳模型α=0.86)。

Insight: 创新点包括:提出了一种平衡可扩展性与可靠性的主观连续知识验证方法;构建了一个经过验证的结构化论证知识库,支持基于图的推理和检索增强生成;证明了可以从点对点语言模型预测中提取序数结构,即使对于主观性强的现实世界话语,这提升了传统符号或分类方法不足领域的知识表示能力。

Abstract: Real-world knowledge representation often requires capturing subjective, continuous attributes – such as political positions – that conflict with pairwise validation, the widely accepted gold standard for human evaluation. We address this challenge through a dual-scale validation framework applied to political stance prediction in argumentative discourse, combining pointwise and pairwise human annotation. Using 22 language models, we construct a large-scale knowledge base of political position predictions for 23,228 arguments drawn from 30 debates that appeared on the UK politicial television programme \textit{Question Time}. Pointwise evaluation shows moderate human-model agreement (Krippendorff’s $α=0.578$), reflecting intrinsic subjectivity, while pairwise validation reveals substantially stronger alignment between human- and model-derived rankings ($α=0.86$ for the best model). This work contributes: (i) a practical validation methodology for subjective continuous knowledge that balances scalability with reliability; (ii) a validated structured argumentation knowledge base enabling graph-based reasoning and retrieval-augmented generation in political domains; and (iii) evidence that ordinal structure can be extracted from pointwise language models predictions from inherently subjective real-world discourse, advancing knowledge representation capabilities for domains where traditional symbolic or categorical approaches are insufficient.


[13] VIRAASAT: Traversing Novel Paths for Indian Cultural Reasoning cs.CL | cs.IRPDF

Harshul Raj Surana, Arijit Maji, Aryan Vats, Akash Ghosh, Sriparna Saha

TL;DR: 该论文提出了VIRAASAT,一个用于生成印度文化特定多跳问答数据集的半自动化方法,并基于此构建了一个包含超过3200个多跳问题的数据集。论文评估了当前SOTA大语言模型在该数据集上的表现,发现其在文化推理方面存在局限。为弥补这一缺陷,论文提出了一个名为符号化操作链(SCoM)的新框架,该框架通过训练模型内部模拟知识图谱的原子操作来提升推理能力。实验表明,SCoM在监督微调上比标准思维链基线性能提升高达20%。

Details

Motivation: 现有大语言模型在需要丰富社会文化知识和多样化本地语境(特别是涉及印度文化)的任务上表现不佳,而现有的文化基准测试集通常是人工构建、包含单跳事实回忆问题且难以扩展,导致这一缺陷未被充分衡量。

Result: 在VIRAASAT数据集上评估当前SOTA大语言模型,揭示了其在基于低概率事实进行推理和综合方面的关键局限。提出的SCoM框架在监督微调实验中,性能比标准思维链基线高出高达20%。

Insight: 论文的创新点在于:1) 提出了一种半自动化的多跳数据集生成方法VIRAASAT,用于创建大规模、覆盖广泛的印度文化推理基准;2) 提出了符号化操作链(SCoM)框架,通过让模型内部模拟知识图谱的原子操作来学习可靠地遍历图结构,从而提升在复杂文化知识上的推理能力。这为构建文化感知的推理模型奠定了基础。

Abstract: Large Language Models (LLMs) have made significant progress in reasoning tasks across various domains such as mathematics and coding. However, their performance deteriorates in tasks requiring rich socio-cultural knowledge and diverse local contexts, particularly those involving Indian Culture. Existing Cultural benchmarks are (i) Manually crafted, (ii) contain single-hop questions testing factual recall, and (iii) prohibitively costly to scale, leaving this deficiency largely unmeasured. To address this, we introduce VIRAASAT, a novel, semi-automated multi-hop approach for generating cultural specific multi-hop Question-Answering dataset for Indian culture. VIRAASAT leverages a Knowledge Graph comprising more than 700 expert-curated cultural artifacts, covering 13 key attributes of Indian culture (history, festivals, etc). VIRAASAT spans all 28 states and 8 Union Territories, yielding more than 3,200 multi-hop questions that necessitate chained cultural reasoning. We evaluate current State-of-the-Art (SOTA) LLMs on VIRAASAT and identify key limitations in reasoning wherein fine-tuning on Chain-of-Thought(CoT) traces fails to ground and synthesize low-probability facts. To bridge this gap, we propose a novel framework named Symbolic Chain-of-Manipulation (SCoM). Adapting the Chain-of-Manipulation paradigm, we train the model to simulate atomic Knowledge Graph manipulations internally. SCoM teaches the model to reliably traverse the topological structure of the graph. Experiments on Supervised Fine-Tuning (SFT) demonstrate that SCoM outperforms standard CoT baselines by up to 20%. We release the VIRAASAT dataset along with our findings, laying a strong foundation towards building Culturally Aware Reasoning Models.


cs.CV [Back]

[14] KPM-Bench: A Kinematic Parsing Motion Benchmark for Fine-grained Motion-centric Video Understanding cs.CVPDF

Boda Lin, Yongjie Zhu, Xiaocheng Gong, Wenyu Qin, Meng Wang

TL;DR: 本文提出了KPM-Bench,一个用于细粒度运动中心视频理解的新基准数据集,并引入了基于语言解析的运动解析与提取算法来缓解视频描述模型中的幻觉问题。

Details

Motivation: 现有视频描述模型在准确描述细粒度运动细节和避免幻觉方面存在显著不足,特别是在以运动为中心的视频中,对复杂肢体动态的精确描绘至关重要但常被忽视。

Result: 基于提出的自动化标注流程构建了KPM-Bench数据集,并开发了MoPE算法。通过将MoPE集成到GRPO后训练框架中,有效缓解了幻觉问题,显著提升了运动中心视频描述模型的可靠性。

Insight: 创新点在于结合运动学计算与语言解析的自动化标注流程,以及独立于大型视觉语言或纯语言模型的精确幻觉评估指标MoPE,为细粒度运动理解提供了新的数据集和评估方法。

Abstract: Despite recent advancements, video captioning models still face significant limitations in accurately describing fine-grained motion details and suffer from severe hallucination issues. These challenges become particularly prominent when generating captions for motion-centric videos, where precise depiction of intricate movements and limb dynamics is crucial yet often neglected. To alleviate this gap, we introduce an automated annotation pipeline that integrates kinematic-based motion computation with linguistic parsing, enabling detailed decomposition and description of complex human motions. Based on this pipeline, we construct and release the Kinematic Parsing Motion Benchmark (KPM-Bench), a novel open-source dataset designed to facilitate fine-grained motion understanding. KPM-Bench consists of (i) fine-grained video-caption pairs that comprehensively illustrate limb-level dynamics in complex actions, (ii) diverse and challenging question-answer pairs focusing specifically on motion understanding, and (iii) a meticulously curated evaluation set specifically designed to assess hallucination phenomena associated with motion descriptions. Furthermore, to address hallucination issues systematically, we propose the linguistically grounded Motion Parsing and Extraction (MoPE) algorithm, capable of accurately extracting motion-specific attributes directly from textual captions. Leveraging MoPE, we introduce a precise hallucination evaluation metric that functions independently of large-scale vision-language or language-only models. By integrating MoPE into the GRPO post-training framework, we effectively mitigate hallucination problems, significantly improving the reliability of motion-centric video captioning models.


[15] CLUTCH: Contextualized Language model for Unlocking Text-Conditioned Hand motion modelling in the wild cs.CV | cs.LGPDF

Balamurugan Thambiraja, Omid Taheri, Radek Danecek, Giorgio Becherini, Gerard Pons-Moll

TL;DR: 本文提出了CLUTCH系统,这是一个基于大语言模型的手部动画生成与描述系统,旨在解决野外环境下文本与手部动作建模的难题。核心贡献包括构建了大规模野外数据集3D-HIW,以及引入了SHIFT VQ-VAE架构和几何精炼阶段来提升动作生成的质量与对齐度。

Details

Motivation: 现有文本到手部动作生成或手部动画描述方法依赖于工作室采集的有限动作和上下文数据集,成本高且难以扩展到野外环境,同时现有模型在捕捉动画保真度和文本-动作对齐方面存在困难。

Result: 实验表明,CLUTCH在文本到动作和动作到文本任务上均达到了最先进的性能,为可扩展的野外手部动作建模建立了首个基准。

Insight: 创新点在于:1) 提出了结合视觉语言模型和先进3D手部追踪器的数据标注流程,构建了大规模野外数据集3D-HIW;2) 设计了SHIFT(一种部件模态分解的VQ-VAE)来提升泛化能力和重建保真度;3) 引入了直接作用于解码手部运动参数的重建损失进行协同监督的几何精炼阶段,以改善动画质量。

Abstract: Hands play a central role in daily life, yet modeling natural hand motions remains underexplored. Existing methods that tackle text-to-hand-motion generation or hand animation captioning rely on studio-captured datasets with limited actions and contexts, making them costly to scale to “in-the-wild” settings. Further, contemporary models and their training schemes struggle to capture animation fidelity with text-motion alignment. To address this, we (1) introduce ‘3D Hands in the Wild’ (3D-HIW), a dataset of 32K 3D hand-motion sequences and aligned text, and (2) propose CLUTCH, an LLM-based hand animation system with two critical innovations: (a) SHIFT, a novel VQ-VAE architecture to tokenize hand motion, and (b) a geometric refinement stage to finetune the LLM. To build 3D-HIW, we propose a data annotation pipeline that combines vision-language models (VLMs) and state-of-the-art 3D hand trackers, and apply it to a large corpus of egocentric action videos covering a wide range of scenarios. To fully capture motion in-the-wild, CLUTCH employs SHIFT, a part-modality decomposed VQ-VAE, which improves generalization and reconstruction fidelity. Finally, to improve animation quality, we introduce a geometric refinement stage, where CLUTCH is co-supervised with a reconstruction loss applied directly to decoded hand motion parameters. Experiments demonstrate state-of-the-art performance on text-to-motion and motion-to-text tasks, establishing the first benchmark for scalable in-the-wild hand motion modelling. Code, data and models will be released.


[16] Multi-Modal Monocular Endoscopic Depth and Pose Estimation with Edge-Guided Self-Supervision cs.CVPDF

Xinwei Ju, Rema Daher, Danail Stoyanov, Sophia Bano, Francisco Vasconcelos

TL;DR: 本文提出了一种名为PRISM的自监督学习框架,用于单目内窥镜深度和姿态估计。该框架通过结合边缘检测和亮度解耦来利用解剖和光照先验,以指导几何学习,从而解决内窥镜场景中纹理缺失、光照复杂和缺乏真实标注数据等挑战。

Details

Motivation: 单目深度和姿态估计在结肠镜辅助导航中至关重要,但面临纹理缺失表面、复杂光照模式、形变以及缺乏可靠真实标注的体内数据集等挑战。

Result: 在多个真实和合成数据集上的实验结果表明,该方法达到了最先进的性能水平。

Insight: 创新点在于将基于学习的边缘检测和通过本征分解模块实现的亮度解耦相结合,以提供结构指导。此外,消融研究提供了两个实用见解:在真实数据上的自监督训练优于在仿真体模数据上的监督训练;视频帧率是模型性能的关键因素,需要针对数据集进行特定的帧采样以生成高质量训练数据。

Abstract: Monocular depth and pose estimation play an important role in the development of colonoscopy-assisted navigation, as they enable improved screening by reducing blind spots, minimizing the risk of missed or recurrent lesions, and lowering the likelihood of incomplete examinations. However, this task remains challenging due to the presence of texture-less surfaces, complex illumination patterns, deformation, and a lack of in-vivo datasets with reliable ground truth. In this paper, we propose PRISM (Pose-Refinement with Intrinsic Shading and edge Maps), a self-supervised learning framework that leverages anatomical and illumination priors to guide geometric learning. Our approach uniquely incorporates edge detection and luminance decoupling for structural guidance. Specifically, edge maps are derived using a learning-based edge detector (e.g., DexiNed or HED) trained to capture thin and high-frequency boundaries, while luminance decoupling is obtained through an intrinsic decomposition module that separates shading and reflectance, enabling the model to exploit shading cues for depth estimation. Experimental results on multiple real and synthetic datasets demonstrate state-of-the-art performance. We further conduct a thorough ablation study on training data selection to establish best practices for pose and depth estimation in colonoscopy. This analysis yields two practical insights: (1) self-supervised training on real-world data outperforms supervised training on realistic phantom data, underscoring the superiority of domain realism over ground truth availability; and (2) video frame rate is an extremely important factor for model performance, where dataset-specific video frame sampling is necessary for generating high quality training data.


[17] Enabling Training-Free Text-Based Remote Sensing Segmentation cs.CVPDF

Jose Sosa, Danila Rukhovich, Anis Kacem, Djamila Aouada

TL;DR: 本文提出了一种无需额外训练即可实现基于文本的遥感图像分割的方法。该方法通过结合对比式视觉语言模型(如CLIP)和生成式视觉语言模型(如GPT-5、Qwen-VL)与Segment Anything Model(SAM),构建了一个完全零样本或仅需轻量级LoRA微调的流程,在19个遥感基准测试中展现了强大的分割能力。

Details

Motivation: 现有基于视觉语言模型的零样本遥感图像分割方法通常依赖额外的可训练组件,这限制了其泛化能力和实际应用。本文旨在探索仅利用现有基础模型、无需额外训练来实现文本引导的遥感分割的可能性。

Result: 在19个遥感基准测试(包括开放词汇、指代和基于推理的任务)上进行了广泛实验。对比式方法(CLIP作为SAM掩码选择器)在完全零样本设置下实现了最先进的开放词汇语义分割(OVSS)。生成式方法中,经过LoRA微调的Qwen-VL模型在零样本设置下为SAM生成点击提示,取得了最佳结果。

Insight: 主要创新点在于构建了一个完全无需训练或仅需轻量微调(LoRA)的集成框架,将对比式与生成式VLM的能力与SAM的通用分割能力相结合,实现了强大的零样本遥感分割。从客观角度看,其将不同基础模型进行模块化组合以解决复杂任务(开放词汇、指代、推理分割)的思路具有借鉴意义,且代码开源促进了可复现性和实际应用。

Abstract: Recent advances in Vision Language Models (VLMs) and Vision Foundation Models (VFMs) have opened new opportunities for zero-shot text-guided segmentation of remote sensing imagery. However, most existing approaches still rely on additional trainable components, limiting their generalisation and practical applicability. In this work, we investigate to what extent text-based remote sensing segmentation can be achieved without additional training, by relying solely on existing foundation models. We propose a simple yet effective approach that integrates contrastive and generative VLMs with the Segment Anything Model (SAM), enabling a fully training-free or lightweight LoRA-tuned pipeline. Our contrastive approach employs CLIP as mask selector for SAM’s grid-based proposals, achieving state-of-the-art open-vocabulary semantic segmentation (OVSS) in a completely zero-shot setting. In parallel, our generative approach enables reasoning and referring segmentation by generating click prompts for SAM using GPT-5 in a zero-shot setting and a LoRA-tuned Qwen-VL model, with the latter yielding the best results. Extensive experiments across 19 remote sensing benchmarks, including open-vocabulary, referring, and reasoning-based tasks, demonstrate the strong capabilities of our approach. Code will be released at https://github.com/josesosajs/trainfree-rs-segmentation.


[18] VidEoMT: Your ViT is Secretly Also a Video Segmentation Model cs.CVPDF

Narges Norouzi, Idil Esen Zulfikar, Niccol`o Cavagnero, Tommie Kerssies, Bastian Leibe

TL;DR: 本文提出了一种名为VidEoMT的纯编码器视频分割模型,它利用Vision Transformer (ViT) 编码器,通过轻量级的查询传播机制和查询融合策略,在无需专用跟踪模块的情况下实现了高效的视频分割。该方法在保持竞争力的准确率的同时,速度比现有方法快5到10倍,使用ViT-L骨干网络时可达160 FPS。

Details

Motivation: 现有在线视频分割模型通常结合逐帧分割器和复杂的专用跟踪模块,这带来了显著的架构复杂性和计算开销。受启发于大规模预训练的ViT编码器能够在没有专用模块的情况下进行准确的图像分割,本文旨在设计一个简单、高效的纯编码器视频分割模型,以消除对专用跟踪模块的需求。

Result: VidEoMT在视频分割任务上取得了具有竞争力的准确率,同时速度比现有方法快5到10倍,使用ViT-L骨干网络时运行速度高达160 FPS。

Insight: 主要创新点在于:1) 提出了一个纯编码器的视频分割架构,简化了模型设计;2) 引入了轻量级的查询传播机制,通过重用前一帧的查询来实现跨帧信息传递;3) 采用了查询融合策略,结合传播查询和一组与时间无关的学习查询,以平衡信息传递和对新内容的适应性。这为视频理解任务提供了一种高效且简洁的建模思路。

Abstract: Existing online video segmentation models typically combine a per-frame segmenter with complex specialized tracking modules. While effective, these modules introduce significant architectural complexity and computational overhead. Recent studies suggest that plain Vision Transformer (ViT) encoders, when scaled with sufficient capacity and large-scale pre-training, can conduct accurate image segmentation without requiring specialized modules. Motivated by this observation, we propose the Video Encoder-only Mask Transformer (VidEoMT), a simple encoder-only video segmentation model that eliminates the need for dedicated tracking modules. To enable temporal modeling in an encoder-only ViT, VidEoMT introduces a lightweight query propagation mechanism that carries information across frames by reusing queries from the previous frame. To balance this with adaptability to new content, it employs a query fusion strategy that combines the propagated queries with a set of temporally-agnostic learned queries. As a result, VidEoMT attains the benefits of a tracker without added complexity, achieving competitive accuracy while being 5x–10x faster, running at up to 160 FPS with a ViT-L backbone. Code: https://www.tue-mps.org/videomt/


[19] VQPP: Video Query Performance Prediction Benchmark cs.CV | cs.IR | cs.LGPDF

Adrian Catalin Lutu, Eduard Poesina, Radu Tudor Ionescu

TL;DR: 本文提出了首个视频查询性能预测基准VQPP,旨在评估基于内容的视频检索中查询性能预测任务。该基准包含两个文本到视频检索数据集和两个CBVR系统,总计5.6万文本查询和5.1万视频,并提供了官方划分以促进可复现研究。

Details

Motivation: 查询性能预测在文本和图像检索领域已得到广泛研究,但在基于内容的视频检索中仍未被充分探索,因此需要建立一个专门的基准来推动该领域的发展。

Result: 实验表明,预检索预测器取得了有竞争力的性能,使其能在实际检索步骤之前应用;此外,作者还将最佳预检索预测器作为奖励模型,通过直接偏好优化训练大语言模型进行查询重写,验证了VQPP的实用性。

Insight: 创新点在于构建了首个视频查询性能预测基准,填补了CBVR中QPP研究的空白;同时探索了预检索与后检索预测器,为未来视频领域的QPP研究提供了代表性评估框架,并展示了其在实际任务(如LLM驱动的查询重写)中的应用潜力。

Abstract: Query performance prediction (QPP) is an important and actively studied information retrieval task, having various applications, such as query reformulation, query expansion, and retrieval system selection, among many others. The task has been primarily studied in the context of text and image retrieval, whereas QPP for content-based video retrieval (CBVR) remains largely underexplored. To this end, we propose the first benchmark for video query performance prediction (VQPP), comprising two text-to-video retrieval datasets and two CBVR systems, respectively. VQPP contains a total of 56K text queries and 51K videos, and comes with official training, validation and test splits, fostering direct comparisons and reproducible results. We explore multiple pre-retrieval and post-retrieval performance predictors, creating a representative benchmark for future exploration of QPP in the video domain. Our results show that pre-retrieval predictors obtain competitive performance, enabling applications before performing the retrieval step. We also demonstrate the applicability of VQPP by employing the best performing pre-retrieval predictor as reward model for training a large language model (LLM) on the query reformulation task via direct preference optimization (DPO). We release our benchmark and code at https://github.com/AdrianLutu/VQPP.


[20] On the Evaluation Protocol of Gesture Recognition for UAV-based Rescue Operation based on Deep Learning: A Subject-Independence Perspective cs.CVPDF

Domonkos Varga

TL;DR: 本文对Liu和Szirányi提出的基于深度学习的无人机救援手势识别方法进行了方法学分析,重点批判了其评估协议的有效性。论文指出,原研究中报告的近乎完美的准确率指标源于帧级别的随机训练-测试集划分,这导致了同一受试者的样本在训练集和测试集之间混合,造成了严重的数据泄露。通过分析已发布的混淆矩阵、学习曲线和数据集构建方式,作者证明该评估并未衡量模型对未见个体的泛化能力。

Details

Motivation: 动机是揭示并批判现有手势识别评估协议中的缺陷,特别是在无人机救援等需要可靠识别新用户手势的应用中,强调主体独立数据划分的重要性。

Result: 定性结果表明,原评估协议存在数据泄露问题,导致泛化性能被高估;论文未提供新的定量结果,但通过分析论证了主体独立划分的必要性。

Insight: 创新点在于从主体独立性视角系统分析了评估协议的有效性,强调了在基于视觉的手势识别研究中,尤其是面向人机交互的应用,必须采用严格的、避免数据泄露的评估方法以确保模型对未见个体的泛化能力。

Abstract: This paper presents a methodological analysis of the gesture-recognition approach proposed by Liu and Szirányi, with a particular focus on the validity of their evaluation protocol. We show that the reported near-perfect accuracy metrics result from a frame-level random train-test split that inevitably mixes samples from the same subjects across both sets, causing severe data leakage. By examining the published confusion matrix, learning curves, and dataset construction, we demonstrate that the evaluation does not measure generalization to unseen individuals. Our findings underscore the importance of subject-independent data partitioning in vision-based gesture-recognition research, especially for applications - such as UAV-human interaction - that require reliable recognition of gestures performed by previously unseen people.


[21] Learning Compact Video Representations for Efficient Long-form Video Understanding in Large Multimodal Models cs.CVPDF

Yuxiao Chen, Jue Wang, Zhikang Zhang, Jingru Yi, Xu Zhang

TL;DR: 本文提出了一种用于高效长视频理解的新框架,包含基于信息密度的自适应视频采样器(AVS)和基于自动编码器的时空视频压缩器(SVC),并与多模态大语言模型(MLLM)集成。该系统能自适应地捕捉不同时长视频的关键信息,并在高压缩率下保留判别性信息,在多个长视频和标准视频理解基准上表现出色。

Details

Motivation: 解决长视频分析中因视频序列固有冗余性带来的两大挑战:在内存限制内高效处理大量帧,以及从海量输入数据中提取判别性信息。

Result: 在多个长视频理解任务和标准视频理解基准测试中均表现出有前景的性能,证明了其方法的有效性和通用性。

Insight: 创新点在于将自适应采样与时空压缩相结合,以端到端方式处理长视频,核心是信息密度感知的采样策略和保留关键信息的压缩机制,为在MLLM中高效集成长视频提供了可借鉴的解决方案。

Abstract: With recent advancements in video backbone architectures, combined with the remarkable achievements of large language models (LLMs), the analysis of long-form videos spanning tens of minutes has become both feasible and increasingly prevalent. However, the inherently redundant nature of video sequences poses significant challenges for contemporary state-of-the-art models. These challenges stem from two primary aspects: 1) efficiently incorporating a larger number of frames within memory constraints, and 2) extracting discriminative information from the vast volume of input data. In this paper, we introduce a novel end-to-end schema for long-form video understanding, which includes an information-density-based adaptive video sampler (AVS) and an autoencoder-based spatiotemporal video compressor (SVC) integrated with a multimodal large language model (MLLM). Our proposed system offers two major advantages: it adaptively and effectively captures essential information from video sequences of varying durations, and it achieves high compression rates while preserving crucial discriminative information. The proposed framework demonstrates promising performance across various benchmarks, excelling in both long-form video understanding tasks and standard video understanding benchmarks. These results underscore the versatility and efficacy of our approach, particularly in managing the complexities of prolonged video sequences.


[22] Understanding the Fine-Grained Knowledge Capabilities of Vision-Language Models cs.CV | cs.AI | cs.LG | cs.MMPDF

Dhruba Ghosh, Yuhui Zhang, Ludwig Schmidt

TL;DR: 该论文研究了多种视觉语言模型在细粒度图像分类任务上的表现,发现这些模型在传统视觉问答基准上虽有进步,但在测试细粒度视觉知识的分类基准上表现不佳。通过大量实验,论文揭示了影响细粒度知识能力的关键因素,包括视觉编码器的质量、预训练策略以及语言模型在预训练阶段是否解冻。

Details

Motivation: 尽管视觉语言模型在广泛的视觉问答基准上取得了显著进展,但最近的研究表明它们在测试细粒度视觉知识的传统图像分类基准上表现落后。论文旨在探究造成这种细粒度知识能力与其他视觉基准表现脱节的原因。

Result: 论文测试了大量近期VLMs在细粒度分类基准上的表现,并通过消融实验发现:使用更好的LLM对所有基准分数提升均等,而使用更好的视觉编码器则能不成比例地大幅提升细粒度分类性能。此外,预训练阶段对细粒度性能至关重要,尤其是在预训练期间语言模型权重未冻结的情况下。

Insight: 论文的创新点在于系统地评估了VLMs的细粒度知识能力,并识别出视觉编码器质量和预训练策略(特别是语言模型在预训练阶段的参与度)是提升模型细粒度视觉理解能力的关键因素。这为增强VLMs的细粒度视觉理解和以视觉为中心的能力指明了方向。

Abstract: Vision-language models (VLMs) have made substantial progress across a wide range of visual question answering benchmarks, spanning visual reasoning, document understanding, and multimodal dialogue. These improvements are evident in a wide range of VLMs built on a variety of base models, alignment architectures, and training data. However, recent works show that these models trail behind in traditional image classification benchmarks, which test fine-grained visual knowledge. We test a large number of recent VLMs on fine-grained classification benchmarks and identify potential factors in the disconnect between fine-grained knowledge and other vision benchmarks. Through a series of ablation experiments, we find that using a better LLM improves all benchmark scores equally, while a better vision encoder disproportionately improves fine-grained classification performance. Furthermore, we find that the pretraining stage is also vital to fine-grained performance, particularly when the language model weights are unfrozen during pretraining. These insights pave the way for enhancing fine-grained visual understanding and vision-centric capabilities in VLMs.


[23] A Single Image and Multimodality Is All You Need for Novel View Synthesis cs.CVPDF

Amirhosein Javadi, Chi-Shiang Gau, Konstantinos D. Polyzos, Tara Javidi

TL;DR: 本文提出了一种利用稀疏多模态测距数据(如雷达或激光雷达)增强单图像新视角合成的方法,通过局部高斯过程在角度域建模深度,生成密集深度图作为扩散模型的几何条件,从而提升真实驾驶场景下合成视图的几何一致性和视觉质量。

Details

Motivation: 现有基于扩散的单图像新视角合成方法依赖单目深度估计的几何信息,但在低纹理、恶劣天气和遮挡严重的真实场景中,深度估计的可靠性受限,导致合成视图质量下降。

Result: 在真实多模态驾驶场景实验中,用本文提出的稀疏测距重建深度替代纯视觉深度,显著提升了单图像新视角视频生成的几何一致性和视觉质量。

Insight: 创新点在于将极稀疏的多模态测距数据融入深度重建框架,通过角度域局部高斯过程建模实现高效推理并量化不确定性,可作为即插即用模块增强现有扩散渲染流程,无需修改生成模型本身,突出了可靠几何先验对基于扩散的视图合成的重要性。

Abstract: Diffusion-based approaches have recently demonstrated strong performance for single-image novel view synthesis by conditioning generative models on geometry inferred from monocular depth estimation. However, in practice, the quality and consistency of the synthesized views are fundamentally limited by the reliability of the underlying depth estimates, which are often fragile under low texture, adverse weather, and occlusion-heavy real-world conditions. In this work, we show that incorporating sparse multimodal range measurements provides a simple yet effective way to overcome these limitations. We introduce a multimodal depth reconstruction framework that leverages extremely sparse range sensing data, such as automotive radar or LiDAR, to produce dense depth maps that serve as robust geometric conditioning for diffusion-based novel view synthesis. Our approach models depth in an angular domain using a localized Gaussian Process formulation, enabling computationally efficient inference while explicitly quantifying uncertainty in regions with limited observations. The reconstructed depth and uncertainty are used as a drop-in replacement for monocular depth estimators in existing diffusion-based rendering pipelines, without modifying the generative model itself. Experiments on real-world multimodal driving scenes demonstrate that replacing vision-only depth with our sparse range-based reconstruction substantially improves both geometric consistency and visual quality in single-image novel-view video generation. These results highlight the importance of reliable geometric priors for diffusion-based view synthesis and demonstrate the practical benefits of multimodal sensing even at extreme levels of sparsity.


[24] ZACH-ViT: Regime-Dependent Inductive Bias in Compact Vision Transformers for Medical Imaging cs.CV | cs.LG | eess.IVPDF

Athanasios Angelakis

TL;DR: 本文提出了一种名为ZACH-ViT的紧凑视觉Transformer,专为医学影像设计。它移除了传统ViT中的位置嵌入和[CLS]分类令牌,通过全局平均池化实现排列不变性,从而适应医学影像中空间布局信息弱或不一致的特性。模型在严格的少样本协议下,在多个MedMNIST数据集上进行了评估,展示了其在不同数据模态下的性能优势与局限性。

Details

Motivation: 传统视觉Transformer依赖位置嵌入和[CLS]令牌编码固定的空间先验,这在空间布局信息弱或不一致的医学影像中可能阻碍泛化。本文旨在设计一种更适应医学影像数据特性的紧凑Transformer架构。

Result: 在七个MedMNIST数据集(二元和多分类任务)的严格少样本协议(每类50个样本)下进行评估。ZACH-ViT(0.25M参数,从头训练)在BloodMNIST上表现最强,在PathMNIST上与TransMIL相当,但在具有强解剖先验的数据集(如OCTMNIST、OrganAMNIST)上相对优势减弱。模型在保持亚秒级推理时间的同时实现了有竞争力的性能。

Insight: 主要创新点在于移除了位置嵌入和[CLS]令牌以实现排列不变性,并通过自适应残差投影保持训练稳定性。核心洞察是:将架构的归纳偏置与数据结构对齐,可能比追求通用基准的统治地位更重要,这为特定领域(如医学影像)的模型设计提供了新思路。

Abstract: Vision Transformers rely on positional embeddings and class tokens that encode fixed spatial priors. While effective for natural images, these priors may hinder generalization when spatial layout is weakly informative or inconsistent, a frequent condition in medical imaging and edge-deployed clinical systems. We introduce ZACH-ViT (Zero-token Adaptive Compact Hierarchical Vision Transformer), a compact Vision Transformer that removes both positional embeddings and the [CLS] token, achieving permutation invariance through global average pooling over patch representations. The term “Zero-token” specifically refers to removing the dedicated [CLS] aggregation token and positional embeddings; patch tokens remain unchanged and are processed normally. Adaptive residual projections preserve training stability in compact configurations while maintaining a strict parameter budget. Evaluation is performed across seven MedMNIST datasets spanning binary and multi-class tasks under a strict few-shot protocol (50 samples per class, fixed hyperparameters, five random seeds). The empirical analysis demonstrates regime-dependent behavior: ZACH-ViT (0.25M parameters, trained from scratch) achieves its strongest advantage on BloodMNIST and remains competitive with TransMIL on PathMNIST, while its relative advantage decreases on datasets with strong anatomical priors (OCTMNIST, OrganAMNIST), consistent with the architectural hypothesis. These findings support the view that aligning architectural inductive bias with data structure can be more important than pursuing universal benchmark dominance. Despite its minimal size and lack of pretraining, ZACH-ViT achieves competitive performance while maintaining sub-second inference times, supporting deployment in resource-constrained clinical environments. Code and models are available at https://github.com/Bluesman79/ZACH-ViT.


[25] ROCKET: Residual-Oriented Multi-Layer Alignment for Spatially-Aware Vision-Language-Action Models cs.CV | cs.AIPDF

Guoheng Sun, Tingting Du, Kaixi Feng, Chenxiang Luo, Xingguo Ding

TL;DR: 本文提出了一种名为ROCKET的残差导向多层表示对齐框架,用于增强视觉-语言-动作(VLA)模型在机器人操作任务中的3D空间理解能力。该方法通过共享投影器将VLA主干网络的多个层与强大的3D视觉基础模型的多个层进行对齐,并采用嵌套式稀疏激活方案来平衡多个对齐损失,从而在极低计算成本下实现了接近SOTA的性能。

Details

Motivation: 现有VLA模型通常在2D数据上预训练,缺乏3D空间理解能力;而现有的表示对齐方法通常只在单层进行监督,未能充分利用深度网络中的丰富信息,且朴素的多层对齐会导致梯度干扰。

Result: 在LIBERO基准测试上,ROCKET仅需约4%的计算预算就达到了98.5%的SOTA成功率;在LIBERO-Plus和RoboTwin等多个基准以及多种VLA模型上也展示了优越性能。

Insight: 创新点在于将多层对齐形式化为残差流之间的对齐,通过共享投影器实现层不变映射以减少梯度冲突,并提出了嵌套式稀疏激活方案来有效管理多层对齐损失;从客观角度看,其理论分析和无训练层选择策略也为高效表示对齐提供了新思路。

Abstract: Vision-Language-Action (VLA) models enable instruction-following robotic manipulation, but they are typically pretrained on 2D data and lack 3D spatial understanding. An effective approach is representation alignment, where a strong vision foundation model is used to guide a 2D VLA model. However, existing methods usually apply supervision at only a single layer, failing to fully exploit the rich information distributed across depth; meanwhile, naïve multi-layer alignment can cause gradient interference. We introduce ROCKET, a residual-oriented multi-layer representation alignment framework that formulates multi-layer alignment as aligning one residual stream to another. Concretely, ROCKET employs a shared projector to align multiple layers of the VLA backbone with multiple layers of a powerful 3D vision foundation model via a layer-invariant mapping, which reduces gradient conflicts. We provide both theoretical justification and empirical analyses showing that a shared projector is sufficient and outperforms prior designs, and further propose a Matryoshka-style sparse activation scheme for the shared projector to balance multiple alignment losses. Our experiments show that, combined with a training-free layer selection strategy, ROCKET requires only about 4% of the compute budget while achieving 98.5% state-of-the-art success rate on LIBERO. We further demonstrate the superior performance of ROCKET across LIBERO-Plus and RoboTwin, as well as multiple VLA models. The code and model weights can be found at https://github.com/CASE-Lab-UMD/ROCKET-VLA.


[26] MUOT_3M: A 3 Million Frame Multimodal Underwater Benchmark and the MUTrack Tracking Method cs.CVPDF

Ahsan Baidar Bakht, Mohamad Alansari, Muhayy Ud Din, Muzammal Naseer, Sajid Javed

TL;DR: 本文提出了MUOT_3M,一个包含300万帧、3030个视频的多模态水下目标跟踪基准数据集,并在此基础上开发了MUTrack跟踪方法。该方法基于SAM,通过视觉几何对齐、视觉语言融合和多级知识蒸馏,将多模态知识迁移到单模态学生模型中,实现了高效且高性能的跟踪。

Details

Motivation: 现有水下目标跟踪(UOT)基准数据集规模小且仅包含RGB模态,限制了算法在严重颜色失真、浑浊和低能见度条件下的鲁棒性。本文旨在通过构建大规模、多模态、多样化的数据集来推动该领域的发展。

Result: 在五个UOT基准上的广泛评估表明,MUTrack相比最强的SOTA基线,AUC最高提升8.40%,精度最高提升7.80%,同时运行速度达到24 FPS。

Insight: 主要创新点包括:1)构建了首个大规模、多模态(RGB、增强RGB、深度、语言)且经过海洋生物学家验证的水下跟踪基准MUOT_3M;2)提出了MUTrack方法,其核心是通过知识蒸馏将多模态训练的优势迁移到易于部署的单模态模型中,实现了性能与效率的平衡。

Abstract: Underwater Object Tracking (UOT) is crucial for efficient marine robotics, large scale ecological monitoring, and ocean exploration; however, progress has been hindered by the scarcity of large, multimodal, and diverse datasets. Existing benchmarks remain small and RGB only, limiting robustness under severe color distortion, turbidity, and low visibility conditions. We introduce MUOT_3M, the first pseudo multimodal UOT benchmark comprising 3 million frames from 3,030 videos (27.8h) annotated with 32 tracking attributes, 677 fine grained classes, and synchronized RGB, estimated enhanced RGB, estimated depth, and language modalities validated by a marine biologist. Building upon MUOT_3M, we propose MUTrack, a SAM-based multimodal to unimodal tracker featuring visual geometric alignment, vision language fusion, and four level knowledge distillation that transfers multimodal knowledge into a unimodal student model. Extensive evaluations across five UOT benchmarks demonstrate that MUTrack achieves up to 8.40% higher AUC and 7.80% higher precision than the strongest SOTA baselines while running at 24 FPS. MUOT_3M and MUTrack establish a new foundation for scalable, multimodally trained yet practically deployable underwater tracking.


[27] Towards LLM-centric Affective Visual Customization via Efficient and Precise Emotion Manipulating cs.CVPDF

Jiamin Luo, Xuqian Gu, Jingjing Wang, Jiahong Lu

TL;DR: 本文提出了一种以大型语言模型为中心的视觉情感定制任务,旨在通过多模态LLM高效、精确地编辑图像中的主观情感内容。为此,论文设计了一种高效精确的情感操纵方法,包含高效情感间语义转换模块和精确情感无关内容保留模块,并在构建的数据集上验证了其优于现有方法的性能。

Details

Motivation: 现有视觉定制研究主要依赖客观控制信号与编辑图像的对齐,忽视了主观情感内容,且缺乏通用的情感视觉定制基础模型。

Result: 在构建的L-AVC数据集上的综合实验评估表明,所提出的EPEM方法在L-AVC任务上优于多个最先进的基线模型,证明了情感信息的重要性及EPEM方法的有效性。

Insight: 创新点在于首次系统性地提出并定义了LLM中心的视觉情感定制任务,并针对该任务中情感语义高效转换和情感无关内容精确保留两大挑战,设计了专门的模块来解决,为多模态情感编辑提供了新思路。

Abstract: Previous studies on visual customization primarily rely on the objective alignment between various control signals (e.g., language, layout and canny) and the edited images, which largely ignore the subjective emotional contents, and more importantly lack general-purpose foundation models for affective visual customization. With this in mind, this paper proposes an LLM-centric Affective Visual Customization (L-AVC) task, which focuses on generating images within modifying their subjective emotions via Multimodal LLM. Further, this paper contends that how to make the model efficiently align emotion conversion in semantics (named inter-emotion semantic conversion) and how to precisely retain emotion-agnostic contents (named exter-emotion semantic retaining) are rather important and challenging in this L-AVC task. To this end, this paper proposes an Efficient and Precise Emotion Manipulating approach for editing subjective emotions in images. Specifically, an Efficient Inter-emotion Converting (EIC) module is tailored to make the LLM efficiently align emotion conversion in semantics before and after editing, followed by a Precise Exter-emotion Retaining (PER) module to precisely retain the emotion-agnostic contents. Comprehensive experimental evaluations on our constructed L-AVC dataset demonstrate the great advantage of the proposed EPEM approach to the L-AVC task over several state-of-the-art baselines. This justifies the importance of emotion information for L-AVC and the effectiveness of EPEM in efficiently and precisely manipulating such information.


[28] DeepSVU: Towards In-depth Security-oriented Video Understanding via Unified Physical-world Regularized MoE cs.CV | cs.AIPDF

Yujie Jin, Wenxin Zhang, Jingjing Wang, Guodong Zhou

TL;DR: 本文提出了一种新的安全导向视频理解任务DeepSVU,旨在不仅识别和定位视频中的威胁,还能归因和评估威胁片段的原因。为应对任务中建模物理世界信息和权衡不同因素的挑战,论文提出了统一物理世界正则化混合专家模型UPRM,并在自建数据集上验证了其优于现有视频大语言模型和非VLM方法的性能。

Details

Motivation: 现有安全导向视频理解研究主要集中于检测和定位威胁,缺乏生成和评估威胁原因的有效能力。为填补这一空白,本文提出了更深入的DeepSVU任务。

Result: 在自建的DeepSVU指令数据集(UCF-C instructions和CUVA instructions)上进行的大量实验表明,UPRM方法优于多种先进的视频大语言模型以及非VLM方法。

Insight: 创新点在于提出了新的DeepSVU任务范式,并设计了UPRM模型,其核心是通过统一物理世界增强的MoE块和物理世界权衡正则化器,来有效建模从粗到细的物理世界信息并自适应地权衡这些因素,从而提升深度安全视频理解能力。

Abstract: In the literature, prior research on Security-oriented Video Understanding (SVU) has predominantly focused on detecting and localize the threats (e.g., shootings, robberies) in videos, while largely lacking the effective capability to generate and evaluate the threat causes. Motivated by these gaps, this paper introduces a new chat paradigm SVU task, i.e., In-depth Security-oriented Video Understanding (DeepSVU), which aims to not only identify and locate the threats but also attribute and evaluate the causes threatening segments. Furthermore, this paper reveals two key challenges in the proposed task: 1) how to effectively model the coarse-to-fine physical-world information (e.g., human behavior, object interactions and background context) to boost the DeepSVU task; and 2) how to adaptively trade off these factors. To tackle these challenges, this paper proposes a new Unified Physical-world Regularized MoE (UPRM) approach. Specifically, UPRM incorporates two key components: the Unified Physical-world Enhanced MoE (UPE) Block and the Physical-world Trade-off Regularizer (PTR), to address the above two challenges, respectively. Extensive experiments conduct on our DeepSVU instructions datasets (i.e., UCF-C instructions and CUVA instructions) demonstrate that UPRM outperforms several advanced Video-LLMs as well as non-VLM approaches. Such information.These justify the importance of the coarse-to-fine physical-world information in the DeepSVU task and demonstrate the effectiveness of our UPRM in capturing such information.


[29] UAOR: Uncertainty-aware Observation Reinjection for Vision-Language-Action Models cs.CV | cs.ROPDF

Jiabing Yang, Yixiang Chen, Yuan Xu, Peiyan Li, Xiangnan Wu

TL;DR: 本文提出了一种名为不确定性感知观测再注入(UAOR)的训练即插即用模块,用于增强视觉-语言-动作(VLA)模型。该方法通过测量动作熵来识别语言模型层的不确定性,并在高不确定性时通过注意力检索将关键观测信息重新注入到下一层的FFN中,从而在无需额外数据或训练的情况下,提升模型在推理时对观测的关注度,生成更自信和可靠的动作。

Details

Motivation: 现有VLA模型为提升性能常需引入额外观测线索(如深度图)或辅助模块(如物体检测器),但这些方法通常需要昂贵的数据收集和额外训练。本文旨在开发一种无需训练、即插即用的模块,以更高效地提升VLA模型的性能。

Result: 综合实验表明,该方法在模拟和真实世界任务中,以最小开销持续改进了多种VLA模型的性能,无需额外观测线索或模块,使其成为现有VLA流程的通用实用插件。

Insight: 创新点在于受语言模型中FFN可作为“键值记忆”的启发,提出了一种基于不确定性(动作熵)的动态观测信息再注入机制。这提供了一种无需重新训练即可增强模型对观测关注度的有效途径,具有很好的通用性和实用性。

Abstract: Vision-Language-Action (VLA) models leverage pretrained Vision-Language Models (VLMs) as backbones to map images and instructions to actions, demonstrating remarkable potential for generalizable robotic manipulation. To enhance performance, existing methods often incorporate extra observation cues (e.g., depth maps, point clouds) or auxiliary modules (e.g., object detectors, encoders) to enable more precise and reliable task execution, yet these typically require costly data collection and additional training. Inspired by the finding that Feed-Forward Network (FFN) in language models can act as “key-value memory”, we propose Uncertainty-aware Observation Reinjection (UAOR), an effective, training-free and plug-and-play module for VLA models. Specifically, when the current language model layer exhibits high uncertainty, measured by Action Entropy, it reinjects key observation information into the next layer’s Feed-Forward Network (FFN) through attention retrieval. This mechanism helps VLAs better attend to observations during inference, enabling more confident and faithful action generation. Comprehensive experiments show that our method consistently improves diverse VLA models across simulation and real-world tasks with minimal overhead. Notably, UAOR eliminates the need for additional observation cues or modules, making it a versatile and practical plug-in for existing VLA pipelines. The project page is at https://uaor.jiabingyang.cn.


[30] Dual-Channel Attention Guidance for Training-Free Image Editing Control in Diffusion Transformers cs.CV | cs.AIPDF

Guandong Li, Mengxia Ye

TL;DR: 本文提出了一种名为双通道注意力引导(DCAG)的无训练图像编辑控制方法,专门用于基于扩散Transformer(DiT)的图像编辑模型。该方法通过同时操纵注意力机制中的Key通道和Value通道,实现了对编辑强度的精细控制,从而在保持图像保真度和编辑效果之间取得更好的平衡。

Details

Motivation: 现有基于注意力操纵的图像编辑方法仅关注Key空间来调制注意力路由,而完全忽略了控制特征聚合的Value空间。这限制了编辑控制的精度和灵活性,无法实现精细的编辑-保真度权衡。

Result: 在PIE-Bench基准测试(包含700张图像和10个编辑类别)上的大量实验表明,DCAG在所有保真度指标上均优于仅使用Key引导的方法,特别是在对象删除(LPIPS降低4.9%)和对象添加(LPIPS降低3.2%)等局部编辑任务中提升最显著。

Insight: 论文的创新点在于首次揭示了DiT中多模态注意力层的Key和Value投影均表现出明显的偏置-增量结构,并基于此提出了同时操纵Key和Value通道的双通道引导框架。理论分析表明Key通道通过非线性softmax函数提供粗粒度控制,而Value通道通过线性加权求和提供细粒度补充,二者结合形成的二维参数空间实现了更精确的编辑控制。这为无训练扩散模型编辑提供了新的调控维度。

Abstract: Training-free control over editing intensity is a critical requirement for diffusion-based image editing models built on the Diffusion Transformer (DiT) architecture. Existing attention manipulation methods focus exclusively on the Key space to modulate attention routing, leaving the Value space – which governs feature aggregation – entirely unexploited. In this paper, we first reveal that both Key and Value projections in DiT’s multi-modal attention layers exhibit a pronounced bias-delta structure, where token embeddings cluster tightly around a layer-specific bias vector. Building on this observation, we propose Dual-Channel Attention Guidance (DCAG), a training-free framework that simultaneously manipulates both the Key channel (controlling where to attend) and the Value channel (controlling what to aggregate). We provide a theoretical analysis showing that the Key channel operates through the nonlinear softmax function, acting as a coarse control knob, while the Value channel operates through linear weighted summation, serving as a fine-grained complement. Together, the two-dimensional parameter space $(δ_k, δ_v)$ enables more precise editing-fidelity trade-offs than any single-channel method. Extensive experiments on the PIE-Bench benchmark (700 images, 10 editing categories) demonstrate that DCAG consistently outperforms Key-only guidance across all fidelity metrics, with the most significant improvements observed in localized editing tasks such as object deletion (4.9% LPIPS reduction) and object addition (3.2% LPIPS reduction).


[31] Spatio-temporal Decoupled Knowledge Compensator for Few-Shot Action Recognition cs.CVPDF

Hongyu Qu, Xiangbo Shu, Rui Yan, Hailiang Gao, Wenguan Wang

TL;DR: 本文提出DiST框架,通过解耦大语言模型提供的空间和时间知识来增强少样本动作识别。该方法将动作名称分解为多样化的时空属性描述,并利用空间/时间知识补偿器分别学习对象级和帧级原型,以捕获细粒度空间细节和多样化时间模式。

Details

Motivation: 现有少样本动作识别方法通常使用语义粗糙的类别名称作为辅助上下文,但这类上下文过于有限,无法为捕获动作中的新颖时空概念提供足够的背景知识。

Result: 实验结果表明,DiST在五个标准少样本动作识别数据集上取得了最先进(SOTA)的结果。

Insight: 创新点在于提出了一种分解-整合框架,利用大语言模型解耦的时空常识知识来补充语义上下文,并通过专门设计的补偿器学习多粒度原型,从而透明地捕获细粒度空间细节和多样化时间模式。

Abstract: Few-Shot Action Recognition (FSAR) is a challenging task that requires recognizing novel action categories with a few labeled videos. Recent works typically apply semantically coarse category names as auxiliary contexts to guide the learning of discriminative visual features. However, such context provided by the action names is too limited to provide sufficient background knowledge for capturing novel spatial and temporal concepts in actions. In this paper, we propose DiST, an innovative Decomposition-incorporation framework for FSAR that makes use of decoupled Spatial and Temporal knowledge provided by large language models to learn expressive multi-granularity prototypes. In the decomposition stage, we decouple vanilla action names into diverse spatio-temporal attribute descriptions (action-related knowledge). Such commonsense knowledge complements semantic contexts from spatial and temporal perspectives. In the incorporation stage, we propose Spatial/Temporal Knowledge Compensators (SKC/TKC) to discover discriminative object-level and frame-level prototypes, respectively. In SKC, object-level prototypes adaptively aggregate important patch tokens under the guidance of spatial knowledge. Moreover, in TKC, frame-level prototypes utilize temporal attributes to assist in inter-frame temporal relation modeling. These learned prototypes thus provide transparency in capturing fine-grained spatial details and diverse temporal patterns. Experimental results show DiST achieves state-of-the-art results on five standard FSAR datasets.


[32] 3DMedAgent: Unified Perception-to-Understanding for 3D Medical Analysis cs.CVPDF

Ziyue Wang, Linghan Cai, Chang Han Low, Haofeng Liu, Junde Wu

TL;DR: 该论文提出了3DMedAgent,一个统一的多模态大语言模型(MLLM)智能体,旨在解决现有方法在3D CT医学分析中感知与理解脱节的问题。它通过协调异构视觉和文本工具,将复杂的3D分析任务分解为从全局到局部、从3D体数据到2D切片、从视觉证据到结构化文本表示的可处理子任务,并利用长期结构化记忆支持证据驱动的多步推理。

Details

Motivation: 现有3D医学分析方法要么是孤立的任务特定建模,要么是任务无关的端到端范式,难以系统积累感知证据以支持下游临床推理;同时,主流多模态大语言模型(MLLMs)主要面向2D数据,无法有效感知和分析3D体数据。本文旨在弥合这一差距。

Result: 在包含超过40个任务的实验中,3DMedAgent在无需3D特定微调的情况下,在统一感知到理解的3D胸部成像评估基准DeepChestVQA上,性能持续优于通用、医学专用及3D专用的MLLMs,展示了其有效性。

Insight: 核心创新在于提出了一个灵活的MLLM智能体框架,通过工具协调和任务分解策略,使2D MLLM能够处理3D医学分析,并引入了长期结构化记忆来聚合中间结果以支持证据驱动的推理。这为构建通用3D临床助手提供了一条可扩展的路径。

Abstract: 3D CT analysis spans a continuum from low-level perception to high-level clinical understanding. Existing 3D-oriented analysis methods adopt either isolated task-specific modeling or task-agnostic end-to-end paradigms to produce one-hop outputs, impeding the systematic accumulation of perceptual evidence for downstream reasoning. In parallel, recent multimodal large language models (MLLMs) exhibit improved visual perception and can integrate visual and textual information effectively, yet their predominantly 2D-oriented designs fundamentally limit their ability to perceive and analyze volumetric medical data. To bridge this gap, we propose 3DMedAgent, a unified agent that enables 2D MLLMs to perform general 3D CT analysis without 3D-specific fine-tuning. 3DMedAgent coordinates heterogeneous visual and textual tools through a flexible MLLM agent, progressively decomposing complex 3D analysis into tractable subtasks that transition from global to regional views, from 3D volumes to informative 2D slices, and from visual evidence to structured textual representations. Central to this design, 3DMedAgent maintains a long-term structured memory that aggregates intermediate tool outputs and supports query-adaptive, evidence-driven multi-step reasoning. We further introduce the DeepChestVQA benchmark for evaluating unified perception-to-understanding capabilities in 3D thoracic imaging. Experiments across over 40 tasks demonstrate that 3DMedAgent consistently outperforms general, medical, and 3D-specific MLLMs, highlighting a scalable path toward general-purpose 3D clinical assistants.Code and data are available at \href{https://github.com/jinlab-imvr/3DMedAgent}{https://github.com/jinlab-imvr/3DMedAgent}.


[33] Faster Training, Fewer Labels: Self-Supervised Pretraining for Fine-Grained BEV Segmentation cs.CVPDF

Daniel Busch, Christian Bohn, Thomas Kurbiel, Klaus Friedrichs, Richard Meyes

TL;DR: 本文提出了一种用于细粒度鸟瞰图语义分割的两阶段训练策略,通过自监督预训练减少对标注数据的依赖,并在微调阶段仅使用50%的数据和更少的训练时间,在nuScenes数据集上超越了全监督基线模型的性能。

Details

Motivation: 当前多摄像头鸟瞰图语义分割方法依赖昂贵且标注不一致的地面真值,本文旨在通过自监督预训练减少对全监督标注的依赖,降低数据标注成本和训练时间。

Result: 在nuScenes数据集上,该方法在细粒度道路标记分割任务中比全监督基线模型提升了高达2.5个百分点的mIoU,同时将标注数据使用量减半,总训练时间减少多达三分之二。

Insight: 创新点在于利用可微分重投影和相机视角伪标签(由Mask2Former生成)进行自监督预训练,学习可迁移的BEV特征,为减少标注的自动驾驶感知提供了一条可扩展的路径。

Abstract: Dense Bird’s Eye View (BEV) semantic maps are central to autonomous driving, yet current multi-camera methods depend on costly, inconsistently annotated BEV ground truth. We address this limitation with a two-phase training strategy for fine-grained road marking segmentation that removes full supervision during pretraining and halves the amount of training data during fine-tuning while still outperforming the comparable supervised baseline model. During the self-supervised pretraining, BEVFormer predictions are differentiably reprojected into the image plane and trained against multi-view semantic pseudo-labels generated by the widely used semantic segmentation model Mask2Former. A temporal loss encourages consistency across frames. The subsequent supervised fine-tuning phase requires only 50% of the dataset and significantly less training time. With our method, the fine-tuning benefits from rich priors learned during pretraining boosting the performance and BEV segmentation quality (up to +2.5pp mIoU over the fully supervised baseline) on nuScenes. It simultaneously halves the usage of annotation data and reduces total training time by up to two thirds. The results demonstrate that differentiable reprojection plus camera perspective pseudo labels yields transferable BEV features and a scalable path toward reduced-label autonomous perception.


[34] Comparative Assessment of Multimodal Earth Observation Data for Soil Moisture Estimation cs.CV | cs.LGPDF

Ioannis Kontogiorgakis, Athanasios Askitopoulos, Iason Tsardanidis, Dimitrios Bormpoudakis, Ilias Tsoumas

TL;DR: 本文提出了一种结合Sentinel-1 SAR、Sentinel-2光学影像和ERA-5再分析数据的高分辨率(10米)土壤湿度估算框架,用于欧洲植被覆盖区域。通过机器学习方法,并利用国际土壤湿度网络(ISMN)站点数据进行评估,比较了不同模态组合与时间参数化的效果,同时测试了基础模型(Prithvi)嵌入相对于传统手工特征的优势。

Details

Motivation: 现有卫星土壤湿度产品分辨率过低(>1公里),无法满足农场级精准农业、水资源管理和气候监测的需求,因此需要开发高分辨率的估算方法。

Result: 在ISMN站点数据上进行空间交叉验证,结果显示混合时间匹配(当前日Sentinel-2与Sentinel-1降轨数据)达到R^2=0.514,结合10天ERA5回看窗口可提升至R^2=0.518。基础模型(Prithvi)嵌入相比手工特征提升微乎其微(R^2=0.515 vs. 0.514)。

Insight: 创新点在于整合多模态遥感数据与时间参数化进行高分辨率土壤湿度估算,并评估了基础模型在稀疏数据回归任务中的效用。研究发现,针对特定领域的光谱指数结合基于树的集成方法,为实际业务化的大范围田间尺度土壤湿度监测提供了高效实用的解决方案,且传统特征工程在此类任务中仍极具竞争力。

Abstract: Accurate soil moisture (SM) estimation is critical for precision agriculture, water resources management and climate monitoring. Yet, existing satellite SM products are too coarse (>1km) for farm-level applications. We present a high-resolution (10m) SM estimation framework for vegetated areas across Europe, combining Sentinel-1 SAR, Sentinel-2 optical imagery and ERA-5 reanalysis data through machine learning. Using 113 International Soil Moisture Network (ISMN) stations spanning diverse vegetated areas, we compare modality combinations with temporal parameterizations, using spatial cross-validation, to ensure geographic generalization. We also evaluate whether foundation model embeddings from IBM-NASA’s Prithvi model improve upon traditional hand-crafted spectral features. Results demonstrate that hybrid temporal matching - Sentinel-2 current-day acquisitions with Sentinel-1 descending orbit - achieves R^2=0.514, with 10-day ERA5 lookback window improving performance to R^2=0.518. Foundation model (Prithvi) embeddings provide negligible improvement over hand-crafted features (R^2=0.515 vs. 0.514), indicating traditional feature engineering remains highly competitive for sparse-data regression tasks. Our findings suggest that domain-specific spectral indices combined with tree-based ensemble methods offer a practical and computationally efficient solution for operational pan-European field-scale soil moisture monitoring.


[35] Predict to Skip: Linear Multistep Feature Forecasting for Efficient Diffusion Transformers cs.CVPDF

Hanshuai Cui, Zhiqing Tang, Qianli Ma, Zhi Yao, Weijia Jia

TL;DR: 本文提出了一种名为PrediT的无训练加速框架,用于加速扩散变换器(DiT)的迭代去噪过程。该方法通过将特征预测建模为线性多步问题,利用历史信息预测未来模型输出,并结合校正器在高动态区域防止误差累积,以及动态步长调制机制自适应调整预测范围,从而在保持生成质量的同时显著降低计算延迟。

Details

Motivation: 扩散变换器(DiT)作为高保真图像和视频生成的骨干网络,其迭代去噪过程计算成本高昂。现有无训练加速方法基于时间稳定性假设进行特征缓存和重用,但重用特征可能导致潜在漂移和视觉质量下降。本文观察到模型输出在扩散轨迹的大部分区域平滑演化,因此提出进行有原则的预测而非简单重用。

Result: 大量实验验证表明,该方法在各种基于DiT的图像和视频生成模型上实现了高达5.54倍的延迟降低,同时质量下降可忽略不计。

Insight: 论文的创新点在于将特征预测形式化为线性多步问题,并引入经典线性多步方法进行预测,结合校正器和动态步长调制机制以平衡加速与保真度。从客观角度看,该方法提供了一种系统性的、无需重新训练的加速策略,通过预测而非简单缓存来利用时间平滑性,对高效扩散模型设计具有借鉴意义。

Abstract: Diffusion Transformers (DiT) have emerged as a widely adopted backbone for high-fidelity image and video generation, yet their iterative denoising process incurs high computational costs. Existing training-free acceleration methods rely on feature caching and reuse under the assumption of temporal stability. However, reusing features for multiple steps may lead to latent drift and visual degradation. We observe that model outputs evolve smoothly along much of the diffusion trajectory, enabling principled predictions rather than naive reuse. Based on this insight, we propose \textbf{PrediT}, a training-free acceleration framework that formulates feature prediction as a linear multistep problem. We employ classical linear multistep methods to forecast future model outputs from historical information, combined with a corrector that activates in high-dynamics regions to prevent error accumulation. A dynamic step modulation mechanism adaptively adjusts the prediction horizon by monitoring the feature change rate. Together, these components enable substantial acceleration while preserving generation fidelity. Extensive experiments validate that our method achieves up to $5.54\times$ latency reduction across various DiT-based image and video generation models, while incurring negligible quality degradation.


[36] OODBench: Out-of-Distribution Benchmark for Large Vision-Language Models cs.CV | cs.AI | cs.DBPDF

Ling Lin, Yang Bai, Heng Su, Congcong Zhu, Yaoxing Wang

TL;DR: 本文提出了OODBench,一个用于评估大规模视觉语言模型处理分布外数据能力的基准测试。该基准包含40K个实例级别的OOD实例-类别对,并采用一种从基础到高级的提示问题渐进式自动化评估指标。研究发现,即使对于常见图像类别,当前VLM在OOD数据上仍存在显著的性能下降。

Details

Motivation: 现有视觉语言模型通常在IID假设下训练,但现实世界数据往往不满足此假设,且处理OOD数据不当可能带来安全风险。目前缺乏全面评估VLM处理OOD数据能力的有效基准。

Result: 在提出的OODBench基准上,当前VLM表现出明显的性能下降。论文未提及与特定SOTA模型的直接比较,但通过新基准和评估指标系统性地揭示了模型在OOD数据上的脆弱性。

Insight: 创新点在于构建了一个大规模、自动化生成的OOD评估基准(OODBench)和一种渐进式提示的自动化评估指标,为系统研究VLM的OOD鲁棒性提供了工具和初步发现。从客观角度看,其方法学(自动化构建与评估)和关注点(OOD系统性评估)对推动模型安全性和鲁棒性研究具有借鉴意义。

Abstract: Existing Visual-Language Models (VLMs) have achieved significant progress by being trained on massive-scale datasets, typically under the assumption that data are independent and identically distributed (IID). However, in real-world scenarios, it is often impractical to expect that all data processed by an AI system satisfy this assumption. Furthermore, failure to appropriately handle out-of-distribution (OOD) objects may introduce safety risks in real-world applications (e.g., autonomous driving or medical assistance). Unfortunately, current research has not yet provided valid benchmarks that can comprehensively assess the performance of VLMs in response to OOD data. Therefore, we propose OODBench, a predominantly automated method with minimal human verification, for constructing new benchmarks and evaluating the ability of VLMs to process OOD data. OODBench contains 40K instance-level OOD instance-category pairs, and we show that current VLMs still exhibit notable performance degradation on OODBench, even when the underlying image categories are common. In addition, we propose a reliable automated assessment metric that employs a Basic-to-Advanced Progression of prompted questions to assess the impact of OOD data on questions of varying difficulty more fully. Lastly, we summarize substantial findings and insights to facilitate future research in the acquisition and evaluation of OOD data.


[37] Evaluating Graphical Perception Capabilities of Vision Transformers cs.CVPDF

Poonam Poonam, Pere-Pau Vázquez, Timo Ropinski

TL;DR: 本文评估了Vision Transformers(ViTs)在图形感知任务中的表现,这些任务受Cleveland和McGill基础研究的启发,用于量化人类对不同视觉编码的感知准确性。研究将ViTs与卷积神经网络(CNNs)和人类参与者在受控的图形感知任务中进行基准测试,发现尽管ViTs在一般视觉任务中表现出色,但在可视化领域的类人图形感知方面存在局限性。

Details

Motivation: ViTs已成为卷积神经网络(CNNs)在多种图像任务中的有力替代,但CNNs的图形感知能力已被评估,而ViTs的感知能力尚未充分探索,因此研究旨在填补这一空白,评估ViTs在基本视觉判断任务中的表现。

Result: 研究结果显示,ViTs在一般视觉任务中表现强劲,但在图形感知任务中与人类感知的对齐有限,具体在受控的图形感知基准测试中,ViTs的表现不如CNNs和人类参与者。

Insight: 论文的创新点在于首次系统评估ViTs在图形感知任务中的能力,揭示了其与人类感知的差距,为ViTs在可视化系统和图形感知建模中的应用提供了重要参考,从客观角度看,这强调了模型在特定领域(如可视化)的泛化能力需进一步研究。

Abstract: Vision Transformers, ViTs, have emerged as a powerful alternative to convolutional neural networks, CNNs, in a variety of image-based tasks. While CNNs have previously been evaluated for their ability to perform graphical perception tasks, which are essential for interpreting visualizations, the perceptual capabilities of ViTs remain largely unexplored. In this work, we investigate the performance of ViTs in elementary visual judgment tasks inspired by the foundational studies of Cleveland and McGill, which quantified the accuracy of human perception across different visual encodings. Inspired by their study, we benchmark ViTs against CNNs and human participants in a series of controlled graphical perception tasks. Our results reveal that, although ViTs demonstrate strong performance in general vision tasks, their alignment with human-like graphical perception in the visualization domain is limited. This study highlights key perceptual gaps and points to important considerations for the application of ViTs in visualization systems and graphical perceptual modeling.


[38] BLM-Guard: Explainable Multimodal Ad Moderation with Chain-of-Thought and Policy-Aligned Rewards cs.CVPDF

Yiran Yang, Zhaowei Liu, Yuan Yuan, Yukun Song, Xiong Ma

TL;DR: BLM-Guard是一个用于短视频平台商业广告审核的多模态内容审计框架,它通过结合思维链推理、基于规则的政策原则和批评者引导的奖励,来检测广告中欺骗性的视觉、语音和字幕内容。

Details

Motivation: 解决短视频平台上大量多模态广告因其欺骗性的视觉、语音和字幕内容,需要比社区安全过滤器更细粒度、基于政策的审核问题。

Result: 在真实短视频广告上的实验表明,BLM-Guard在准确性、一致性和泛化性方面超越了强基线模型。

Insight: 创新点包括:1) 使用规则驱动的ICoT数据合成管道生成结构化场景描述、推理链和标签以启动训练,降低标注成本;2) 通过强化学习,使用平衡因果连贯性与政策遵从性的复合奖励来优化模型;3) 采用多任务架构分别建模模态内操纵(如夸张图像)和跨模态不匹配(如字幕-语音漂移),增强了鲁棒性。

Abstract: Short-video platforms now host vast multimodal ads whose deceptive visuals, speech and subtitles demand finer-grained, policy-driven moderation than community safety filters. We present BLM-Guard, a content-audit framework for commercial ads that fuses Chain-of-Thought reasoning with rule-based policy principles and a critic-guided reward. A rule-driven ICoT data-synthesis pipeline jump-starts training by generating structured scene descriptions, reasoning chains and labels, cutting annotation costs. Reinforcement learning then refines the model using a composite reward balancing causal coherence with policy adherence. A multitask architecture models intra-modal manipulations (e.g., exaggerated imagery) and cross-modal mismatches (e.g., subtitle-speech drift), boosting robustness. Experiments on real short-video ads show BLM-Guard surpasses strong baselines in accuracy, consistency and generalization.


[39] On the Adversarial Robustness of Discrete Image Tokenizers cs.CV | cs.AIPDF

Rishika Bhagwatkar, Irina Rish, Nicolas Flammarion, Francesco Croce

TL;DR: 本文首次研究了离散图像分词器在对抗攻击下的脆弱性,提出了针对其特征提取和分词过程的攻击方法,并通过无监督对抗训练提升其鲁棒性,为安全多模态基础模型的发展提供了重要步骤。

Details

Motivation: 离散图像分词器在多模态系统中日益流行,但其对抗鲁棒性尚未被探索,本文旨在填补这一空白,研究其脆弱性并提出防御方法。

Result: 提出的攻击方法在分类、多模态检索和字幕生成任务中均有效且计算高效;无监督对抗训练显著提升了分词器对无监督和端到端监督攻击的鲁棒性,并能泛化到未见任务和数据上。

Insight: 创新点在于首次系统研究离散图像分词器的对抗鲁棒性,并引入无监督对抗训练进行防御,该方法不依赖标注数据,更具通用性,强调了分词器鲁棒性对下游任务的关键影响。

Abstract: Discrete image tokenizers encode visual inputs as sequences of tokens from a finite vocabulary and are gaining popularity in multimodal systems, including encoder-only, encoder-decoder, and decoder-only models. However, unlike CLIP encoders, their vulnerability to adversarial attacks has not been explored. Ours being the first work studying this topic, we first formulate attacks that aim to perturb the features extracted by discrete tokenizers, and thus change the extracted tokens. These attacks are computationally efficient, application-agnostic, and effective across classification, multimodal retrieval, and captioning tasks. Second, to defend against this vulnerability, inspired by recent work on robust CLIP encoders, we fine-tune popular tokenizers with unsupervised adversarial training, keeping all other components frozen. While unsupervised and task-agnostic, our approach significantly improves robustness to both unsupervised and end-to-end supervised attacks and generalizes well to unseen tasks and data. Unlike supervised adversarial training, our approach can leverage unlabeled images, making it more versatile. Overall, our work highlights the critical role of tokenizer robustness in downstream tasks and presents an important step in the development of safe multimodal foundation models.


[40] DEIG: Detail-Enhanced Instance Generation with Fine-Grained Semantic Control cs.CVPDF

Shiyan Du, Conghan Yue, Xinyu Cheng, Dongyu Zhang

TL;DR: 本文提出DEIG框架,用于实现细粒度语义控制的多实例图像生成。该框架通过实例细节提取器将文本编码转换为紧凑的实例感知表示,并利用细节融合模块防止实例间属性泄漏,从而生成与复杂文本描述精确匹配的多实例场景。

Details

Motivation: 现有多实例生成方法在细粒度语义理解方面存在不足,难以处理复杂的文本描述,导致属性泄漏和语义不准确。

Result: 在多个基准测试中,DEIG在空间一致性、语义准确性和组合泛化方面均优于现有方法,并在新构建的DEIG-Bench基准上表现出色。

Insight: 创新点包括实例细节提取器和细节融合模块的设计,以及利用视觉语言模型构建高质量细粒度标注数据集;该框架可作为即插即用模块集成到标准扩散模型中,提升可控生成能力。

Abstract: Multi-Instance Generation has advanced significantly in spatial placement and attribute binding. However, existing approaches still face challenges in fine-grained semantic understanding, particularly when dealing with complex textual descriptions. To overcome these limitations, we propose DEIG, a novel framework for fine-grained and controllable multi-instance generation. DEIG integrates an Instance Detail Extractor (IDE) that transforms text encoder embeddings into compact, instance-aware representations, and a Detail Fusion Module (DFM) that applies instance-based masked attention to prevent attribute leakage across instances. These components enable DEIG to generate visually coherent multi-instance scenes that precisely match rich, localized textual descriptions. To support fine-grained supervision, we construct a high-quality dataset with detailed, compositional instance captions generated by VLMs. We also introduce DEIG-Bench, a new benchmark with region-level annotations and multi-attribute prompts for both humans and objects. Experiments demonstrate that DEIG consistently outperforms existing approaches across multiple benchmarks in spatial consistency, semantic accuracy, and compositional generalization. Moreover, DEIG functions as a plug-and-play module, making it easily integrable into standard diffusion-based pipelines.


[41] Diff2DGS: Reliable Reconstruction of Occluded Surgical Scenes via 2D Gaussian Splatting cs.CV | cs.GR | cs.ROPDF

Tianyi Song, Danail Stoyanov, Evangelos Mazomenos, Francisco Vasconcelos

TL;DR: 本文提出了Diff2DGS,一个用于可靠重建被遮挡手术场景的两阶段框架。第一阶段使用基于扩散的时序先验视频模块,以高时空一致性修复被手术器械遮挡的组织。第二阶段,采用带有可学习变形模型(LDM)的2D高斯泼溅(2DGS)来捕捉动态组织变形和解剖结构。该方法在EndoNeRF和StereoMIS基准测试中取得了SOTA的PSNR结果,并首次在SCARED数据集上进行了深度精度定量分析,证明了仅优化图像质量不一定能获得最优3D重建精度。

Details

Motivation: 实时重建可变形手术场景对于推进机器人手术至关重要,但现有方法(如高斯泼溅)在被遮挡区域的重建质量有限,且深度精度缺乏充分评估,因为现有基准测试(如EndoNeRF和StereoMIS)缺乏3D真值。

Result: 在EndoNeRF和StereoMIS基准测试上,Diff2DGS在图像质量(PSNR)上超越了SOTA方法,分别达到38.02 dB和34.40 dB。此外,在SCARED数据集上的深度精度定量分析表明,该方法在几何重建上也表现出色。

Insight: 核心创新点在于将基于扩散的视频修复与可学习的2D高斯泼溅变形模型相结合,以处理遮挡和动态变形。一个关键的洞见是揭示了仅优化图像质量指标(如PSNR)并不等同于获得最优的3D几何重建精度,因此提出了对深度质量进行联合优化的策略,以确保高保真外观与忠实几何的统一。

Abstract: Real-time reconstruction of deformable surgical scenes is vital for advancing robotic surgery, improving surgeon guidance, and enabling automation. Recent methods achieve dense reconstructions from da Vinci robotic surgery videos, with Gaussian Splatting (GS) offering real-time performance via graphics acceleration. However, reconstruction quality in occluded regions remains limited, and depth accuracy has not been fully assessed, as benchmarks like EndoNeRF and StereoMIS lack 3D ground truth. We propose Diff2DGS, a novel two-stage framework for reliable 3D reconstruction of occluded surgical scenes. In the first stage, a diffusion-based video module with temporal priors inpaints tissue occluded by instruments with high spatial-temporal consistency. In the second stage, we adapt 2D Gaussian Splatting (2DGS) with a Learnable Deformation Model (LDM) to capture dynamic tissue deformation and anatomical geometry. We also extend evaluation beyond prior image-quality metrics by performing quantitative depth accuracy analysis on the SCARED dataset. Diff2DGS outperforms state-of-the-art approaches in both appearance and geometry, reaching 38.02 dB PSNR on EndoNeRF and 34.40 dB on StereoMIS. Furthermore, our experiments demonstrate that optimizing for image quality alone does not necessarily translate into optimal 3D reconstruction accuracy. To address this, we further optimize the depth quality of the reconstructed 3D results, ensuring more faithful geometry in addition to high-fidelity appearance.


[42] G-LoG Bi-filtration for Medical Image Classification cs.CV | math.ATPDF

Qingsong Wang, Jiaxing He, Bingzhe Hou, Tieru Wu, Yang Cao

TL;DR: 该论文提出了一种名为G-LoG(高斯-高斯-拉普拉斯)的双参数过滤方法,用于医学图像分类。该方法利用高斯-拉普拉斯算子增强图像边界,构建拓扑特征,并证明了其稳定性。在MedMNIST数据集上的实验表明,基于此拓扑特征的简单多层感知机(MLP)分类器性能与在原始数据上训练的复杂深度学习模型相当。

Details

Motivation: 在拓扑数据分析(TDA)中,构建实用的过滤以检测对象的拓扑和几何特征是一个重要任务。本文旨在为医学图像分类设计一种更适用于多参数持续性模块的特征生成方法。

Result: 在MedMNIST数据集上的实验表明,所提出的双参数过滤方法显著优于单参数过滤。使用该双过滤生成的拓扑特征训练的简单MLP,其性能与在原始数据集上训练的Google AutoML Vision、ResNet、AutoKeras和auto-sklearn等复杂深度学习基线模型相当。

Insight: 论文的创新点在于将高斯-拉普拉斯算子(LoG)用于增强医学图像边界,并以此为基础定义了一个新的、稳定的双参数过滤(G-LoG bi-filtration),从而为多参数持续性模块生成了更合适的拓扑特征。从客观角度看,该方法将经典的图像处理算子(LoG)与先进的拓扑数据分析(多参数持续性)相结合,为医学图像分类提供了一种可解释性强且计算高效的替代方案,证明了简单模型结合精心设计的拓扑特征可以达到与复杂深度学习模型竞争的性能。

Abstract: Building practical filtrations on objects to detect topological and geometric features is an important task in the field of Topological Data Analysis (TDA). In this paper, leveraging the ability of the Laplacian of Gaussian operator to enhance the boundaries of medical images, we define the G-LoG (Gaussian-Laplacian of Gaussian) bi-filtration to generate the features more suitable for multi-parameter persistence module. By modeling volumetric images as bounded functions, then we prove the interleaving distance on the persistence modules obtained from our bi-filtrations on the bounded functions is stable with respect to the maximum norm of the bounded functions. Finally, we conduct experiments on the MedMNIST dataset, comparing our bi-filtration against single-parameter filtration and the established deep learning baselines, including Google AutoML Vision, ResNet, AutoKeras and auto-sklearn. Experiments results demonstrate that our bi-filtration significantly outperforms single-parameter filtration. Notably, a simple Multi-Layer Perceptron (MLP) trained on the topological features generated by our bi-filtration achieves performance comparable to complex deep learning models trained on the original dataset.


[43] Latent Equivariant Operators for Robust Object Recognition: Promise and Challenges cs.CV | cs.LGPDF

Minh Dinh, Stéphane Deny

TL;DR: 本文探讨了利用潜在等变算子进行鲁棒物体识别的潜力与挑战。针对深度学习中物体经历训练中罕见的对称变换(如异常姿态、尺度、位置)时识别困难的问题,研究提出在潜在空间中从对称变换示例学习等变算子的架构,并在旋转和平移的噪声MNIST数据集上验证了其在分布外分类中的有效性,克服了传统网络和等变网络的局限,但指出了向复杂数据集扩展的挑战。

Details

Motivation: 解决深度学习在物体经历训练中罕见的对称变换(如异常姿态、尺度、位置)时识别困难的问题,特别是当先验变换知识未知时,传统等变网络难以适用。

Result: 在旋转和平移的噪声MNIST简单数据集上,潜在等变算子架构成功实现了分布外分类,克服了传统网络和等变网络的局限,但未提及具体定量指标或与SOTA的比较。

Insight: 创新点在于提出从示例中学习潜在等变算子的架构,无需先验变换知识,即可泛化到对称变换,为鲁棒物体识别提供了新思路;但客观分析认为,该方法目前仅限于简单数据集,向复杂场景扩展是主要挑战。

Abstract: Despite the successes of deep learning in computer vision, difficulties persist in recognizing objects that have undergone group-symmetric transformations rarely seen during training-for example objects seen in unusual poses, scales, positions, or combinations thereof. Equivariant neural networks are a solution to the problem of generalizing across symmetric transformations, but require knowledge of transformations a priori. An alternative family of architectures proposes to earn equivariant operators in a latent space from examples of symmetric transformations. Here, using simple datasets of rotated and translated noisy MNIST, we illustrate how such architectures can successfully be harnessed for out-of-distribution classification, thus overcoming the limitations of both traditional and equivariant networks. While conceptually enticing, we discuss challenges ahead on the path of scaling these architectures to more complex datasets.


[44] Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control cs.CVPDF

Linxi Xie, Lisong C. Sun, Ashley Neall, Tong Wu, Shengqu Cai

TL;DR: 本文提出了一种以人为中心的视频世界模型,通过结合头部姿态和手部关节姿态控制,实现交互式视频生成,用于扩展现实(XR)中的虚拟环境模拟。

Details

Motivation: 现有视频世界模型仅接受文本或键盘等粗略控制信号,无法响应真实世界中的用户运动跟踪,限制了其在具身交互中的应用。

Result: 在人类受试者评估中,该系统相比基线方法在任务表现和感知控制水平上均有显著提升。

Insight: 创新点在于将跟踪的头部和手部姿态作为条件输入扩散变换器,并提出有效的3D控制机制,实现了灵巧的手-物交互,并通过师生蒸馏构建了因果交互系统。

Abstract: Extended reality (XR) demands generative models that respond to users’ tracked real-world motion, yet current video world models accept only coarse control signals such as text or keyboard input, limiting their utility for embodied interaction. We introduce a human-centric video world model that is conditioned on both tracked head pose and joint-level hand poses. For this purpose, we evaluate existing diffusion transformer conditioning strategies and propose an effective mechanism for 3D head and hand control, enabling dexterous hand–object interactions. We train a bidirectional video diffusion model teacher using this strategy and distill it into a causal, interactive system that generates egocentric virtual environments. We evaluate this generated reality system with human subjects and demonstrate improved task performance as well as a significantly higher level of perceived amount of control over the performed actions compared with relevant baselines.


[45] CapNav: Benchmarking Vision Language Models on Capability-conditioned Indoor Navigation cs.CV | cs.ROPDF

Xia Su, Ruiqi Chen, Benlin Liu, Jingwei Ma, Zonglin Di

TL;DR: 本文提出了CapNav基准测试,用于评估视觉语言模型在考虑智能体特定物理和操作能力条件下的室内导航性能。该基准定义了五种代表性人类和机器人智能体,包含45个真实室内场景、473个导航任务和2365个问答对。研究发现,当前最先进的视觉语言模型在移动约束增强时导航性能急剧下降,且在需要空间维度推理的障碍物类型上表现不佳。

Details

Motivation: 解决现实世界导航中智能体移动能力约束的问题,例如扫地机器人无法爬楼梯而四足机器人可以,旨在评估视觉语言模型在考虑具体能力条件下的导航决策能力。

Result: 在CapNav基准上评估了13个现代视觉语言模型,发现随着移动约束收紧,导航性能急剧下降;即使是SOTA模型在处理需要空间维度推理的障碍物类型时也面临困难。

Insight: 创新点在于首次提出了能力条件导航的基准测试,强调了在具身智能导航中考虑智能体物理约束的重要性,为未来视觉语言模型在具身空间推理方面的进步指明了方向。

Abstract: Vision-Language Models (VLMs) have shown remarkable progress in Vision-Language Navigation (VLN), offering new possibilities for navigation decision-making that could benefit both robotic platforms and human users. However, real-world navigation is inherently conditioned by the agent’s mobility constraints. For example, a sweeping robot cannot traverse stairs, while a quadruped can. We introduce Capability-Conditioned Navigation (CapNav), a benchmark designed to evaluate how well VLMs can navigate complex indoor spaces given an agent’s specific physical and operational capabilities. CapNav defines five representative human and robot agents, each described with physical dimensions, mobility capabilities, and environmental interaction abilities. CapNav provides 45 real-world indoor scenes, 473 navigation tasks, and 2365 QA pairs to test if VLMs can traverse indoor environments based on agent capabilities. We evaluate 13 modern VLMs and find that current VLM’s navigation performance drops sharply as mobility constraints tighten, and that even state-of-the-art models struggle with obstacle types that require reasoning on spatial dimensions. We conclude by discussing the implications for capability-aware navigation and the opportunities for advancing embodied spatial reasoning in future VLMs. The benchmark is available at https://github.com/makeabilitylab/CapNav


[46] Going Down Memory Lane: Scaling Tokens for Video Stream Understanding with Dynamic KV-Cache Memory cs.CVPDF

Vatsal Agarwal, Saksham Suri, Matthew Gwilliam, Pulkit Kumar, Abhinav Shrivastava

TL;DR: 本文提出MemStream方法,通过扩展token预算和引入自适应选择策略与无训练检索专家混合机制,提升视频流理解中细粒度时空信息的保留与检索能力,显著改善了视频问答性能。

Details

Motivation: 现有视频流理解方法因每帧token数量有限而丢失细粒度视觉细节,且特征编码导致查询-帧相似度随时间增加,检索偏向后续帧,需解决密集视频流下的信息保留与偏差问题。

Result: 在CG-Bench上提升8.0%,LVBench上提升8.5%,VideoMME(Long)上提升2.4%(基于Qwen2.5-VL-7B模型对比ReKV方法),达到SOTA水平。

Insight: 创新点包括扩展token预算以增强时空理解、自适应选择策略减少冗余并保留局部信息,以及无训练检索专家混合利用外部模型优化帧检索,可借鉴于视频流处理与长序列建模。

Abstract: Streaming video understanding requires models to robustly encode, store, and retrieve information from a continuous video stream to support accurate video question answering (VQA). Existing state-of-the-art approaches rely on key-value caching to accumulate frame-level information over time, but use a limited number of tokens per frame, leading to the loss of fine-grained visual details. In this work, we propose scaling the token budget to enable more granular spatiotemporal understanding and reasoning. First, we find that current methods are ill-equipped to handle dense streams: their feature encoding causes query-frame similarity scores to increase over time, biasing retrieval toward later frames. To address this, we introduce an adaptive selection strategy that reduces token redundancy while preserving local spatiotemporal information. We further propose a training-free retrieval mixture-of-experts that leverages external models to better identify relevant frames. Our method, MemStream, achieves +8.0% on CG-Bench, +8.5% on LVBench, and +2.4% on VideoMME (Long) over ReKV with Qwen2.5-VL-7B.


cs.LG [Back]

[47] Reducing Text Bias in Synthetically Generated MCQAs for VLMs in Autonomous Driving cs.LG | cs.CL | cs.ROPDF

Sutej Kulgod, Sean Ye, Sanchit Tanwar, Christoffer Heckman

TL;DR: 本文针对自动驾驶领域视觉语言模型(VLM)评估中,合成生成的多选题问答(MCQA)基准存在严重文本偏见的问题,提出了一种去偏方法。该方法通过将正确答案与语言伪影解耦并采用课程学习策略,迫使模型依赖视觉基础进行推理,从而更准确地衡量模型的感知理解能力。

Details

Motivation: 现有用于评估自动驾驶VLM性能的合成MCQA基准,存在可利用隐藏文本线索(语言模式)而非视觉上下文来回答问题的缺陷,导致评估结果失真,无法真实反映模型的视觉理解能力。

Result: 实验表明,在去偏前,一个VLM仅凭文本输入(无需视觉输入)在合成MCQA上的准确率可比肩人类验证的基准。应用所提方法后,盲猜准确率从高于随机猜测+66.9%大幅降至+2.9%,基本消除了可被利用的文本捷径。

Insight: 核心创新点在于提出了一种针对合成MCQA的文本去偏方法,通过解耦答案与语言伪影并结合课程学习,强制模型进行视觉基础学习。这为构建更鲁棒、更能反映真实视觉理解能力的VLM评估基准提供了有效思路。

Abstract: Multiple Choice Question Answering (MCQA) benchmarks are an established standard for measuring Vision Language Model (VLM) performance in driving tasks. However, we observe the known phenomenon that synthetically generated MCQAs are highly susceptible to hidden textual cues that allow models to exploit linguistic patterns rather than visual context. Our results show that a VLM fine-tuned on such data can achieve accuracy comparable to human-validated benchmarks even without visual input. Our proposed method reduces blind accuracy from +66.9% above random to +2.9%, eliminating the vast majority of exploitable textual shortcuts. By decoupling the correct answer from linguistic artifacts and employing a curriculum learning strategy, we force the model to rely on visual grounding, ensuring that performance accurately reflects perceptual understanding.


[48] Robust Pre-Training of Medical Vision-and-Language Models with Domain-Invariant Multi-Modal Masked Reconstruction cs.LG | cs.AI | cs.CL | cs.CVPDF

Melika Filvantorkaman, Mohsen Piri

TL;DR: 本文提出了一种名为Robust-MMR的鲁棒性医学视觉-语言模型预训练框架,通过整合非对称扰动感知掩码、领域一致性正则化和模态弹性约束,在掩码视觉-语言学习中明确纳入鲁棒性目标,以鼓励学习领域不变的表示。该方法在多个医学视觉-语言基准测试中显著提升了模型在领域偏移下的性能。

Details

Motivation: 现有医学视觉-语言模型在成像设备、采集协议和报告风格变化导致的领域偏移下性能下降,而现有的多模态预训练方法大多忽视了鲁棒性,将其视为下游适应问题。本文旨在将鲁棒性目标明确纳入预训练过程。

Result: 在医学视觉问答(VQA-RAD、SLAKE、VQA-2019)、跨领域图像-文本分类(MELINDA)和鲁棒图像-字幕检索(ROCO)等多个基准测试上评估。Robust-MMR在VQA-RAD上达到78.9%的跨领域准确率,比最强基线高出3.8个百分点;在SLAKE和VQA-2019上分别达到74.6%和77.0%的准确率。在扰动评估下,VQA-RAD准确率从69.1%提升至75.6%;MELINDA跨领域准确率从70.3%提升至75.2%;检索实验显示,在扰动下平均排名退化从超过16减少到4.1。定性结果也显示出在疾病检测和结构异常评估方面临床推理能力的提升。

Insight: 主要创新点在于将鲁棒性建模明确整合到预训练框架中,而非留待下游任务适应。具体技术包括非对称扰动感知掩码、领域一致性正则化和模态弹性约束,这些设计旨在鼓励学习对领域变化不敏感的、更具弹性的多模态表示,从而提升模型在真实世界部署中的可靠性和可迁移性。

Abstract: Medical vision-language models show strong potential for joint reasoning over medical images and clinical text, but their performance often degrades under domain shift caused by variations in imaging devices, acquisition protocols, and reporting styles. Existing multi-modal pre-training methods largely overlook robustness, treating it as a downstream adaptation problem. In this work, we propose Robust Multi-Modal Masked Reconstruction (Robust-MMR), a self-supervised pre-training framework that explicitly incorporates robustness objectives into masked vision-language learning. Robust-MMR integrates asymmetric perturbation-aware masking, domain-consistency regularization, and modality-resilience constraints to encourage domain-invariant representations. We evaluate Robust-MMR on multiple medical vision-language benchmarks, including medical visual question answering (VQA-RAD, SLAKE, VQA-2019), cross-domain image-text classification (MELINDA), and robust image-caption retrieval (ROCO). Robust-MMR achieves 78.9% cross-domain accuracy on VQA-RAD, outperforming the strongest baseline by 3.8 percentage points, and reaches 74.6% and 77.0% accuracy on SLAKE and VQA-2019, respectively. Under perturbed evaluation, Robust-MMR improves VQA-RAD accuracy from 69.1% to 75.6%. For image-text classification, cross-domain MELINDA accuracy increases from 70.3% to 75.2%, while retrieval experiments show a reduction in mean rank degradation from over 16 to 4.1 under perturbation. Qualitative results further demonstrate improved clinical reasoning for disease detection and structural abnormality assessment. These findings show that explicitly modeling robustness during pre-training leads to more reliable and transferable medical vision-language representations for real-world deployment.


[49] Tethered Reasoning: Decoupling Entropy from Hallucination in Quantized LLMs via Manifold Steering cs.LG | cs.CLPDF

Craig Atkinson

TL;DR: 本文提出HELIX框架,通过将量化语言模型的隐藏状态轨迹锚定到预计算的真实性流形上,解耦输出熵与幻觉问题。该方法在高温采样时保持语义连贯性,显著减少重复输出并提升概念多样性,同时在GSM8K和MMLU基准上维持高精度。

Details

Motivation: 量化语言模型面临采样温度的两难困境:低温导致重复性输出,高温则引发轨迹发散和语义不连贯。本文旨在通过几何方法将输出熵与幻觉分离,实现高温下的稳定推理。

Result: 在4位量化Granite 4.0 H Small模型上,GSM8K在T=3.0时保持88.84%准确率(较T=0.5仅下降2.81pp),MMLU在14,042个问题上保持72.49%准确率(下降1.24pp)。跨架构验证(Qwen3-30B-A3B MOE)显示独特概念生成提升46.7%,多温度合成生成的概念数量比单温度推理增加200%。

Insight: 核心创新是通过几何锚定(流形引导)解耦熵与幻觉,揭示高温下的高熵创意储备。仅需引导约10%的稀疏Transformer注意力层即可修正Mamba-2状态空间公式的漂移,实现逻辑骨架不变前提下的语义多样性探索。

Abstract: Quantized language models face a fundamental dilemma: low sampling temperatures yield repetitive, mode-collapsed outputs, while high temperatures (T > 2.0) cause trajectory divergence and semantic incoherence. We present HELIX, a geometric framework that decouples output entropy from hallucination by tethering hidden-state trajectories to a pre-computed truthfulness manifold. HELIX computes a Unified Truth Score (UTS) combining token-level semantic entropy with Mahalanobis distance from the manifold. When UTS indicates trajectory divergence, graduated steering vectors redirect activations toward structurally coherent regions while affecting only 0.2-2.5% of tokens. On 4-bit quantized Granite 4.0 H Small (32B/9B active, hybrid Mamba-Transformer): GSM8K maintains 88.84% accuracy at T = 3.0 (2.81pp degradation from T = 0.5); MMLU maintains 72.49% across 14,042 questions (1.24pp degradation). This demonstrates that high-temperature hallucination is primarily trajectory divergence rather than semantic collapse. Notably, steering the sparse Transformer attention layers (~10% of layers) is sufficient to correct drift in the Mamba-2 state-space formulation. Geometric tethering reveals a previously-masked High-Entropy Creative Reservoir. At T > 2.0, steered outputs exhibit 5-20% idea duplication versus 70-80% at conservative settings. Cross-architecture validation (Qwen3-30B-A3B MOE) confirms this phenomenon is architecture-independent, with 46.7% higher unique concept generation. HELIX acts as a syntax tether, enabling exploration of semantic diversity without violating the logical backbone required for valid output. This enables Multi-Temperature Synthesis, generating 200% more unique concepts than single-temperature inference.


[50] A Case Study of Selected PTQ Baselines for Reasoning LLMs on Ascend NPU cs.LG | cs.AI | cs.CLPDF

Yuchen Luo, Fangyue Zhu, Ruining Zhou, Mingzhe Huang, Jian Zhu

TL;DR: 本文对昇腾NPU上推理大语言模型的训练后量化(PTQ)方法进行了案例研究,评估了AWQ、GPTQ、SmoothQuant和FlatQuant四种算法在DeepSeek-R1-Distill-Qwen系列和QwQ-32B模型上的表现。研究发现,4位权重量化对大型模型可行,但4位权重-激活量化在NPU上存在层间校准不稳定问题,导致长上下文推理任务逻辑崩溃,而8位量化则保持数值稳定。实际INT8部署表明,尽管优化内核降低了延迟,但动态量化开销限制了端到端加速。

Details

Motivation: 训练后量化对于高效模型部署至关重要,但其在昇腾NPU上的效果相比GPU架构仍未被充分探索。本文旨在通过案例研究,评估代表性PTQ基线在推理大模型上的表现,为在昇腾NPU上部署量化模型提供可行性参考。

Result: 在昇腾NPU上的实验结果表明:4位权重量化对大型模型(如14B/32B)可行;4位权重-激活量化方案在NPU上存在层间校准不稳定问题,导致长上下文推理任务逻辑崩溃;标准8位量化保持数值稳定;INT8部署中,优化内核降低了延迟,但动态量化开销限制了端到端加速。

Insight: 论文的创新点在于首次系统评估了多种PTQ算法在昇腾NPU推理大模型上的表现,揭示了平台敏感性(如NPU上4位权重-激活量化的不稳定性)这一关键发现。从客观角度看,该研究为特定硬件(昇腾NPU)上的模型量化部署提供了重要的实证参考,强调了硬件架构对量化方案有效性的影响,并指出了动态量化开销是当前端到端加速的瓶颈。

Abstract: Post-Training Quantization (PTQ) is crucial for efficient model deployment, yet its effectiveness on Ascend NPU remains under-explored compared to GPU architectures. This paper presents a case study of representative PTQ baselines applied to reasoning-oriented models such as DeepSeek-R1-Distill-Qwen series (1.5B/7B/14B) and QwQ-32B. We evaluate four distinct algorithms, including AWQ, GPTQ, SmoothQuant, and FlatQuant, to cover the spectrum from weight-only compression to advanced rotation-based methods. Our empirical results reveal significant platform sensitivity. While 4-bit weight-only quantization proves viable for larger models, aggressive 4-bit weight-activation schemes suffer from layer-wise calibration instability on the NPU, leading to logic collapse in long-context reasoning tasks. Conversely, standard 8-bit quantization remains numerically stable. Furthermore, a real-world INT8 deployment demonstrates that although optimized kernels reduce latency, dynamic quantization overheads currently limit end-to-end acceleration. These findings offer a practical reference for the feasibility and limitations of deploying quantized reasoning models on Ascend NPU.


[51] Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards cs.LG | cs.AI | cs.CLPDF

Johannes Ackermann, Michael Noukhovitch, Takashi Ishida, Masashi Sugiyama

TL;DR: 该论文提出使用梯度正则化(GR)来防止强化学习从人类反馈(RLHF)或可验证奖励(RLVR)中的奖励黑客问题,通过将策略更新偏向奖励模型更准确的平坦区域,从而避免策略利用奖励模型的不准确性学习非预期行为。

Details

Motivation: 解决RLHF/RLVR中常见的奖励黑客问题,即策略可能利用奖励模型的不准确性学习非预期行为,传统方法使用KL惩罚限制策略更新,但本论文旨在通过偏向奖励更准确的区域来更直接地解决该问题。

Result: 在多样化的LM强化学习实验中,GR表现优于KL惩罚,在RLHF中实现了更高的GPT评判胜率,避免了在基于规则的数学奖励中过度关注格式,并防止了在LLM-as-a-Judge数学任务中黑客评判。

Insight: 创新点在于建立了奖励模型准确性与收敛最优值平坦度之间的理论联系,并提出使用显式梯度正则化(通过高效有限差分估计)来偏向平坦区域,从而维持奖励准确性;客观分析认为,该方法提供了一种替代KL惩罚的更通用框架,直接针对奖励黑客的根源。

Abstract: Reinforcement Learning from Human Feedback (RLHF) or Verifiable Rewards (RLVR) are two key steps in the post-training of modern Language Models (LMs). A common problem is reward hacking, where the policy may exploit inaccuracies of the reward and learn an unintended behavior. Most previous works address this by limiting the policy update with a Kullback-Leibler (KL) penalty towards a reference model. We propose a different framing: Train the LM in a way that biases policy updates towards regions in which the reward is more accurate. First, we derive a theoretical connection between the accuracy of a reward model and the flatness of an optimum at convergence. Gradient regularization (GR) can then be used to bias training to flatter regions and thereby maintain reward model accuracy. We confirm these results by showing that the gradient norm and reward accuracy are empirically correlated in RLHF. We then show that Reference Resets of the KL penalty implicitly use GR to find flatter regions with higher reward accuracy. We further improve on this by proposing to use explicit GR with an efficient finite-difference estimate. Empirically, GR performs better than a KL penalty across a diverse set of RL experiments with LMs. GR achieves a higher GPT-judged win-rate in RLHF, avoids overly focusing on the format in rule-based math rewards, and prevents hacking the judge in LLM-as-a-Judge math tasks.


[52] Analyzing and Improving Chain-of-Thought Monitorability Through Information Theory cs.LG | cs.AI | cs.CL | cs.ITPDF

Usman Anwar, Tim Bakker, Dana Kianfar, Cristina Pinneri, Christos Louizos

TL;DR: 本文通过信息论分析思维链(CoT)监控器的可监控性,指出CoT与输出间的互信息非零是必要条件但非充分条件,并识别了信息差距和引导误差两种近似误差源。论文提出两种针对性训练方法——基于oracle的直接奖励法和无标签的条件互信息最大化法,以提升监控精度并防止CoT退化,在多种环境中验证了其有效性。

Details

Motivation: 解决现有思维链监控器在实践中因近似误差(如信息差距和引导误差)导致性能受限的问题,旨在系统性地提升CoT监控的可监控性和鲁棒性。

Result: 在多个不同环境中,所提方法显著提高了监控精度(具体benchmark未提及),同时防止了CoT退化,缓解了任务奖励不完善时的奖励黑客问题。

Insight: 创新点在于从信息论角度形式化分析了CoT监控的可监控性条件,并提出了两种实用的训练目标来优化监控性能;客观来看,该研究为基于CoT的监控系统提供了理论框架和可操作的改进方案,有助于增强LLM推理过程的可靠性和可控性。

Abstract: Chain-of-thought (CoT) monitors are LLM-based systems that analyze reasoning traces to detect when outputs may exhibit attributes of interest, such as test-hacking behavior during code generation. In this paper, we use information-theoretic analysis to show that non-zero mutual information between CoT and output is a necessary but not sufficient condition for CoT monitorability. We identify two sources of approximation error that may undermine the performance of CoT monitors in practice: information gap, which measures the extent to which the monitor can extract the information available in CoT, and elicitation error, which measures the extent to which the monitor approximates the optimal monitoring function. We further demonstrate that CoT monitorability can be systematically improved through targeted training objectives. To this end, we propose two complementary approaches: (a) an oracle-based method that directly rewards the monitored model for producing CoTs that maximize monitor accuracy, and (b) a more practical, label-free approach that maximizes conditional mutual information between outputs and CoTs. Across multiple different environments, we show both methods significantly improve monitor accuracy while preventing CoT degeneration even when training against a monitor, thereby mitigating reward hacking when the task reward is imperfectly specified.


[53] On the “Induction Bias” in Sequence Models cs.LG | cs.CLPDF

M. Reza Ebrahimi, Michaël Defferrard, Sunny Panchal, Roland Memisevic

TL;DR: 本文通过大规模实验研究,比较了Transformer和循环神经网络(RNN)在状态跟踪任务上的数据效率。研究发现,随着状态空间大小和序列长度的增加,Transformer所需训练数据量的增长速度远高于RNN;此外,Transformer在不同序列长度间几乎不共享权重,导致其学习孤立的长特定解决方案,而RNN则能通过跨长度共享权重实现有效的摊销学习。

Details

Motivation: 尽管Transformer在语言模型上取得了显著成功,但近期研究对其状态跟踪能力提出质疑,尤其是在分布外泛化(如长度外推)方面存在局限。本文旨在探究这些局限在分布内的影响,即当训练和评估分布匹配时,Transformer在状态跟踪任务上的根本挑战。

Result: 实验表明,在状态空间大小和序列长度增加时,Transformer的数据需求增长远快于RNN;同时,Transformer在不同序列长度间的权重共享可忽略甚至有害,而RNN能有效共享权重,利用一个长度的数据提升其他长度的性能。

Insight: 论文的创新点在于将关注点从分布外泛化转向分布内影响,系统揭示了Transformer在状态跟踪任务上数据效率低和缺乏跨长度泛化能力的结构性问题,强调了循环模型在摊销学习方面的优势,为理解序列模型的归纳偏置提供了新视角。

Abstract: Despite the remarkable practical success of transformer-based language models, recent work has raised concerns about their ability to perform state tracking. In particular, a growing body of literature has shown this limitation primarily through failures in out-of-distribution (OOD) generalization, such as length extrapolation. In this work, we shift attention to the in-distribution implications of these limitations. We conduct a large-scale experimental study of the data efficiency of transformers and recurrent neural networks (RNNs) across multiple supervision regimes. We find that the amount of training data required by transformers grows much more rapidly with state-space size and sequence length than for RNNs. Furthermore, we analyze the extent to which learned state-tracking mechanisms are shared across different sequence lengths. We show that transformers exhibit negligible or even detrimental weight sharing across lengths, indicating that they learn length-specific solutions in isolation. In contrast, recurrent models exhibit effective amortized learning by sharing weights across lengths, allowing data from one sequence length to improve performance on others. Together, these results demonstrate that state tracking remains a fundamental challenge for transformers, even when training and evaluation distributions match.


eess.IV [Back]

[54] Promptable segmentation with region exploration enables minimal-effort expert-level prostate cancer delineation eess.IV | cs.CVPDF

Junqing Yang, Natasha Thorley, Ahmed Nadeem Abbasi, Shonit Punwani, Zion Tse

TL;DR: 本文提出了一种基于用户点提示和强化学习的交互式前列腺癌分割框架,通过结合区域生长与强化学习迭代优化分割结果,显著减少标注工作量,并在两个公开数据集上达到接近专家手动分割的精度。

Details

Motivation: 解决前列腺癌在磁共振图像上分割的挑战,包括肿瘤外观多变、成像协议差异以及专家标注耗时费力的问题,旨在通过用户点提示桥接自动分割与手动分割之间的差距。

Result: 在PROMIS和PICAI两个公开前列腺MR数据集(分别包含566和1090例)上评估,性能分别超越先前最佳自动方法9.9%和8.9%,达到与放射科医生手动分割相当的水平,同时将标注时间减少十倍。

Insight: 创新点在于将强化学习与区域生长结合,通过奖励机制平衡分割精度和不确定性,引导模型探索模糊区域,实现样本特异性优化;框架在推理时仅需少量用户交互,即可达到专家级分割效果,为医学图像分割提供了高效的人机协作方案。

Abstract: Purpose: Accurate segmentation of prostate cancer on magnetic resonance (MR) images is crucial for planning image-guided interventions such as targeted biopsies, cryoablation, and radiotherapy. However, subtle and variable tumour appearances, differences in imaging protocols, and limited expert availability make consistent interpretation difficult. While automated methods aim to address this, they rely on large expertly-annotated datasets that are often inconsistent, whereas manual delineation remains labour-intensive. This work aims to bridge the gap between automated and manual segmentation through a framework driven by user-provided point prompts, enabling accurate segmentation with minimal annotation effort. Methods: The framework combines reinforcement learning (RL) with a region-growing segmentation process guided by user prompts. Starting from an initial point prompt, region-growing generates a preliminary segmentation, which is iteratively refined through RL. At each step, the RL agent observes the image and current segmentation to predict a new point, from which region growing updates the mask. A reward, balancing segmentation accuracy and voxel-wise uncertainty, encourages exploration of ambiguous regions, allowing the agent to escape local optima and perform sample-specific optimisation. Despite requiring fully supervised training, the framework bridges manual and fully automated segmentation at inference by substantially reducing user effort while outperforming current fully automated methods. Results: The framework was evaluated on two public prostate MR datasets (PROMIS and PICAI, with 566 and 1090 cases). It outperformed the previous best automated methods by 9.9% and 8.9%, respectively, with performance comparable to manual radiologist segmentation, reducing annotation time tenfold.


[55] Exploiting Completeness Perception with Diffusion Transformer for Unified 3D MRI Synthesis eess.IV | cs.CVPDF

Junkai Liu, Nay Aung, Theodoros N. Arvanitis, Joao A. C. Lima, Steffen E. Petersen

TL;DR: 本文提出CoPeDiT,一种基于扩散Transformer的统一3D MRI合成模型,通过引入完整性感知机制,使模型能够自主推断缺失状态(如多模态脑MRI中的缺失模态或心脏MRI中的缺失切片),从而在无需外部手动指示的情况下生成高质量的缺失MRI数据。

Details

Motivation: 解决临床实践中多模态脑MRI缺失模态和心脏MRI缺失切片等数据缺失问题,现有方法依赖外部手动指示来指导生成模型,但这些指示在真实临床场景中可能不可用或不可靠,且信息不足以保证语义一致性。

Result: 在三个大规模MRI数据集上的综合评估表明,CoPeDiT显著优于现有最先进方法,展现出卓越的鲁棒性、泛化能力和灵活性。

Insight: 创新点包括:1)提出完整性感知机制,使生成模型能自我感知缺失状态,更好地捕捉解剖和病理细微变化;2)设计CoPeVAE分词器,通过专用预训练任务学习完整性感知的判别性提示;3)构建MDiT3D扩散Transformer架构,有效利用学习到的提示在3D空间中增强语义一致性。从客观角度看,该方法通过自感知方式减少对外部指导的依赖,提升了模型在复杂临床环境中的实用性。

Abstract: Missing data problems, such as missing modalities in multi-modal brain MRI and missing slices in cardiac MRI, pose significant challenges in clinical practice. Existing methods rely on external guidance to supply detailed missing state for instructing generative models to synthesize missing MRIs. However, manual indicators are not always available or reliable in real-world scenarios due to the unpredictable nature of clinical environments. Moreover, these explicit masks are not informative enough to provide guidance for improving semantic consistency. In this work, we argue that generative models should infer and recognize missing states in a self-perceptive manner, enabling them to better capture subtle anatomical and pathological variations. Towards this goal, we propose CoPeDiT, a general-purpose latent diffusion model equipped with completeness perception for unified synthesis of 3D MRIs. Specifically, we incorporate dedicated pretext tasks into our tokenizer, CoPeVAE, empowering it to learn completeness-aware discriminative prompts, and design MDiT3D, a specialized diffusion transformer architecture for 3D MRI synthesis, that effectively uses the learned prompts as guidance to enhance semantic consistency in 3D space. Comprehensive evaluations on three large-scale MRI datasets demonstrate that CoPeDiT significantly outperforms state-of-the-art methods, achieving superior robustness, generalizability, and flexibility. The code is available at https://github.com/JK-Liu7/CoPeDiT .


cs.HC [Back]

[56] Assessing LLM Response Quality in the Context of Technology-Facilitated Abuse cs.HC | cs.AI | cs.CL | cs.CR | cs.CYPDF

Vijay Prakash, Majed Almansoori, Donghan Hu, Rahul Chatterjee, Danny Yuxing Huang

TL;DR: 本研究首次对四种大语言模型(包括两种通用非推理模型和两种针对亲密伴侣暴力设计的领域特定模型)在技术助长虐待情境下的响应质量进行了专家主导的人工评估。通过使用从文献和在线论坛收集的真实问题,评估了基于幸存者安全提示的零样本单轮响应的质量,并进行了用户研究以评估TFA经历者对响应可操作性的感知。

Details

Motivation: 技术助长虐待是亲密伴侣暴力的一种普遍形式,幸存者常因技术支持诊所的人员限制和后勤障碍而转向在线资源求助。随着LLM的普及和IPV组织的兴趣增长,幸存者可能在寻求技术支持诊所帮助前咨询基于LLM的聊天机器人,因此需要评估LLM在此关键领域的响应质量。

Result: 研究结合专家评估和用户反馈,揭示了LLM在TFA背景下的当前能力和局限性,为未来模型的设计、开发和微调提供了信息,并提出了改进LLM支持幸存者性能的具体建议。

Insight: 创新点在于首次对LLM在TFA领域的响应进行专家主导的评估,并引入了幸存者安全为中心的提示和针对TFA领域定制的评估标准,同时结合了专家评估和用户感知可操作性的双重视角,为高风险领域LLM应用提供了评估框架和改进方向。

Abstract: Technology-facilitated abuse (TFA) is a pervasive form of intimate partner violence (IPV) that leverages digital tools to control, surveil, or harm survivors. While tech clinics are one of the reliable sources of support for TFA survivors, they face limitations due to staffing constraints and logistical barriers. As a result, many survivors turn to online resources for assistance. With the growing accessibility and popularity of large language models (LLMs), and increasing interest from IPV organizations, survivors may begin to consult LLM-based chatbots before seeking help from tech clinics. In this work, we present the first expert-led manual evaluation of four LLMs - two widely used general-purpose non-reasoning models and two domain-specific models designed for IPV contexts - focused on their effectiveness in responding to TFA-related questions. Using real-world questions collected from literature and online forums, we assess the quality of zero-shot single-turn LLM responses generated with a survivor safety-centered prompt on criteria tailored to the TFA domain. Additionally, we conducted a user study to evaluate the perceived actionability of these responses from the perspective of individuals who have experienced TFA. Our findings, grounded in both expert assessment and user feedback, provide insights into the current capabilities and limitations of LLMs in the TFA context and may inform the design, development, and fine-tuning of future models for this domain. We conclude with concrete recommendations to improve LLM performance for survivor support.


cs.IR [Back]

[57] IRPAPERS: A Visual Document Benchmark for Scientific Retrieval and Question Answering cs.IR | cs.AI | cs.CL | cs.LGPDF

Connor Shorten, Augustas Skaburskas, Daniel M. Jones, Charles Pierse, Roberto Esposito

TL;DR: 本文介绍了IRPAPERS基准测试,这是一个包含166篇科学论文、3230页的视觉文档数据集,每页都提供图像和OCR转录文本。该研究通过180个“大海捞针”式问题,系统比较了基于图像和基于文本的检索与问答系统性能。研究发现,两种模态在检索中表现出互补性,多模态混合搜索优于单一模态;在问答任务中,基于文本的RAG系统比基于图像的系统具有更高的真实答案对齐度。

Details

Motivation: 尽管AI系统在文本和关系数据处理上取得了显著成功,但视觉文档处理仍相对未被充分探索。传统系统依赖OCR将视觉文档转换为文本和元数据,而多模态基础模型的最新进展允许直接从文档图像进行检索和生成。这引发了一个关键问题:基于图像的系统与成熟的基于文本方法相比表现如何?

Result: 在IRPAPERS基准上,基于文本的检索(使用Arctic 2.0嵌入、BM25和混合文本搜索)的Recall@1、Recall@5、Recall@20分别为46%、78%、91%;基于图像的检索分别为43%、78%、93%。多模态混合搜索实现了49%、81%、95%的召回率。在闭源模型中,Cohere Embed v4的图像嵌入性能最佳(Recall@1达58%)。在问答任务中,基于文本的RAG系统比基于图像的系统具有更高的真实答案对齐度(0.82 vs. 0.71)。

Insight: 论文的创新点在于构建了一个专门用于科学文档检索和问答的视觉文档基准IRPAPERS,并首次系统比较了图像与文本模态在此类任务上的性能。客观分析表明,其核心洞察在于揭示了文本和图像表示在文档理解中的互补性局限,以及多模态混合策略能有效提升性能,这为未来文档智能系统的设计提供了重要方向。同时,研究识别了哪些问题类型更依赖特定模态,具有实际指导意义。

Abstract: AI systems have achieved remarkable success in processing text and relational data, yet visual document processing remains relatively underexplored. Whereas traditional systems require OCR transcriptions to convert these visual documents into text and metadata, recent advances in multimodal foundation models offer retrieval and generation directly from document images. This raises a key question: How do image-based systems compare to established text-based methods? We introduce IRPAPERS, a benchmark of 3,230 pages from 166 scientific papers, with both an image and an OCR transcription for each page. Using 180 needle-in-the-haystack questions, we compare image- and text-based retrieval and question answering systems. Text retrieval using Arctic 2.0 embeddings, BM25, and hybrid text search achieved 46% Recall@1, 78% Recall@5, and 91% Recall@20, while image-based retrieval reaches 43%, 78%, and 93%, respectively. The two modalities exhibit complementary failures, enabling multimodal hybrid search to outperform either alone, achieving 49% Recall@1, 81% Recall@5, and 95% Recall@20. We further evaluate efficiency-performance tradeoffs with MUVERA and assess multiple multi-vector image embedding models. Among closed-source models, Cohere Embed v4 page image embeddings outperform Voyage 3 Large text embeddings and all tested open-source models, achieving 58% Recall@1, 87% Recall@5, and 97% Recall@20. For question answering, text-based RAG systems achieved higher ground-truth alignment than image-based systems (0.82 vs. 0.71), and both benefit substantially from increased retrieval depth, with multi-document retrieval outperforming oracle single-document retrieval. We analyze the complementary limitations of unimodal text and image representations and identify question types that require one modality over the other. The IRPAPERS dataset and all experimental code are publicly available.


[58] When & How to Write for Personalized Demand-aware Query Rewriting in Video Search cs.IR | cs.CV | cs.LGPDF

Cheng cheng, Chenxing Wang, Aolin Li, Haijun Wu, Huiyun Hu

TL;DR: 本文提出了WeWrite,一个新颖的个性化需求感知查询重写框架,用于视频搜索系统。该框架通过自动挖掘策略确定何时需要个性化重写,采用混合训练范式(SFT与GRPO结合)指导如何重写,并通过并行架构保证低延迟部署。在线A/B测试表明,该方法显著提升了视频点击量和降低了查询重构率。

Details

Motivation: 传统视频搜索系统利用用户隐式历史行为特征时,常面临信号稀释和反馈延迟的挑战,难以精准识别搜索意图和消除歧义。

Result: 在大规模视频平台上进行的在线A/B测试中,WeWrite将点击观看时长超过10秒的视频量(VV>10s)提升了1.07%,并将查询重构率降低了2.97%。

Insight: 创新点在于系统性地解决了个性化查询重写中的“何时写”与“如何写”问题:通过基于后验的自动挖掘策略筛选高质量样本,以及结合监督微调与组相对策略优化的混合训练范式来对齐大语言模型与检索系统;同时,提出的并行“伪召回”架构确保了低延迟部署。

Abstract: In video search systems, user historical behaviors provide rich context for identifying search intent and resolving ambiguity. However, traditional methods utilizing implicit history features often suffer from signal dilution and delayed feedback. To address these challenges, we propose WeWrite, a novel Personalized Demand-aware Query Rewriting framework. Specifically, WeWrite tackles three key challenges: (1) When to Write: An automated posterior-based mining strategy extracts high-quality samples from user logs, identifying scenarios where personalization is strictly necessary; (2) How to Write: A hybrid training paradigm combines Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO) to align the LLM’s output style with the retrieval system; (3) Deployment: A parallel “Fake Recall” architecture ensures low latency. Online A/B testing on a large-scale video platform demonstrates that WeWrite improves the Click-Through Video Volume (VV$>$10s) by 1.07% and reduces the Query Reformulation Rate by 2.97%.


cs.AI [Back]

[59] Epistemic Traps: Rational Misalignment Driven by Model Misspecification cs.AI | cs.CL | cs.LGPDF

Xingcheng Xu, Jingjing Qu, Qiaosheng Zhang, Chaochao Lu, Yanqing Yang

TL;DR: 该论文提出了一种理论框架,将大型语言模型和AI智能体中观察到的行为病理(如奉承、幻觉和策略性欺骗)解释为模型设定错误下的理性行为,而非训练缺陷。通过将经济学中的Berk-Nash理性化概念引入人工智能,作者论证了这些不安全行为是智能体在错误主观世界模型下优化的结构性必然结果。

Details

Motivation: 当前安全范式将AI的行为失败视为可修复的训练伪影,缺乏统一的理论框架来解释其出现和稳定性。本文旨在为这些持续存在的、难以通过强化学习缓解的行为病理提供一个根本性的理论解释。

Result: 理论预测在六个最先进的模型家族上进行了行为实验验证,生成了精确描绘安全行为拓扑边界的相图。结果表明,安全行为是一个由智能体认知先验决定的离散相,而非奖励幅度的连续函数。

Insight: 核心创新在于将模型设定错误(model misspecification)和Berk-Nash理性化框架引入AI对齐问题,揭示了不安全行为是理性均衡。这确立了’主观模型工程’(设计智能体内部信念结构)作为实现稳健对齐的必要条件,标志着从操纵环境奖励到塑造智能体对现实解释的范式转变。

Abstract: The rapid deployment of Large Language Models and AI agents across critical societal and technical domains is hindered by persistent behavioral pathologies including sycophancy, hallucination, and strategic deception that resist mitigation via reinforcement learning. Current safety paradigms treat these failures as transient training artifacts, lacking a unified theoretical framework to explain their emergence and stability. Here we show that these misalignments are not errors, but mathematically rationalizable behaviors arising from model misspecification. By adapting Berk-Nash Rationalizability from theoretical economics to artificial intelligence, we derive a rigorous framework that models the agent as optimizing against a flawed subjective world model. We demonstrate that widely observed failures are structural necessities: unsafe behaviors emerge as either a stable misaligned equilibrium or oscillatory cycles depending on reward scheme, while strategic deception persists as a “locked-in” equilibrium or through epistemic indeterminacy robust to objective risks. We validate these theoretical predictions through behavioral experiments on six state-of-the-art model families, generating phase diagrams that precisely map the topological boundaries of safe behavior. Our findings reveal that safety is a discrete phase determined by the agent’s epistemic priors rather than a continuous function of reward magnitude. This establishes Subjective Model Engineering, defined as the design of an agent’s internal belief structure, as a necessary condition for robust alignment, marking a paradigm shift from manipulating environmental rewards to shaping the agent’s interpretation of reality.


cs.RO [Back]

[60] RoEL: Robust Event-based 3D Line Reconstruction cs.RO | cs.CVPDF

Gwangtak Bae, Jaeho Shin, Seunggu Kang, Junho Kim, Ayoung Kim

TL;DR: 本文提出了一种名为RoEL的鲁棒事件相机三维线重建方法,该方法通过观察事件数据中不同时间切片的多重表示,稳定地提取具有不同外观的线特征轨迹,并设计了能够消除投影畸变和深度模糊的几何代价函数,以优化三维线地图和相机位姿。该方法在多种数据集上显著提升了基于事件的建图和位姿优化性能,并能灵活应用于多模态场景。

Details

Motivation: 事件相机在运动中倾向于检测物体边界或纹理边缘,这些特征会产生亮度变化的线条,尤其是在人造环境中。虽然线条可以构成一种鲁棒的、持续可观测的中间表示,但其稀疏性可能导致微小的估计误差就引起性能急剧下降。现有利用线条特征的工作较少,且常需依赖额外传感器来补偿事件传感器的严重领域差异和不可预测的噪声特性。

Result: 该方法在多个数据集上展示了基于事件建图和位姿优化的显著性能提升。摘要中未提及具体的定量结果(如精度数值)或与特定基准(benchmark)上SOTA模型的直接比较,但强调了其在不同数据集上的有效性及在多模态场景中的灵活应用潜力。

Insight: 主要创新点在于:1)提出了一种巧妙的算法流程,通过观察事件数据中不同时间切片的多重表示来稳定提取线特征轨迹,以补偿事件数据内部的潜在干扰;2)设计了能够消除投影畸变和深度模糊的几何代价函数,用于联合优化三维线地图和相机位姿;3)所生成的三维线地图高度紧凑,且其代价函数可适配于任何能检测和提取线结构或其投影的观测数据(如三维点云地图或图像观测),实现了方法的通用性和多模态应用潜力。

Abstract: Event cameras in motion tend to detect object boundaries or texture edges, which produce lines of brightness changes, especially in man-made environments. While lines can constitute a robust intermediate representation that is consistently observed, the sparse nature of lines may lead to drastic deterioration with minor estimation errors. Only a few previous works, often accompanied by additional sensors, utilize lines to compensate for the severe domain discrepancies of event sensors along with unpredictable noise characteristics. We propose a method that can stably extract tracks of varying appearances of lines using a clever algorithmic process that observes multiple representations from various time slices of events, compensating for potential adversaries within the event data. We then propose geometric cost functions that can refine the 3D line maps and camera poses, eliminating projective distortions and depth ambiguities. The 3D line maps are highly compact and can be equipped with our proposed cost function, which can be adapted for any observations that can detect and extract line structures or projections of them, including 3D point cloud maps or image observations. We demonstrate that our formulation is powerful enough to exhibit a significant performance boost in event-based mapping and pose refinement across diverse datasets, and can be flexibly applied to multimodal scenarios. Our results confirm that the proposed line-based formulation is a robust and effective approach for the practical deployment of event-based perceptual modules. Project page: https://gwangtak.github.io/roel/