Table of Contents

cs.CL [Back]

[1] References Improve LLM Alignment in Non-Verifiable Domains cs.CL | cs.AI | cs.LGPDF

Kejian Shi, Yixin Liu, Peifeng Wang, Alexander R. Fabbri, Shafiq Joty

TL;DR: 本文研究了在缺乏真实验证器的非可验证领域(如大语言模型对齐)中,如何利用参考输出引导的LLM评估器作为软“验证器”来提升对齐效果。通过设计评估协议,证明了参考引导方法能显著提升较弱LLM评估器的准确性,并基于此实现了参考引导的自我改进,在多个基准上超越了直接监督微调和无参考自我改进方法。

Details

Motivation: 在非可验证领域(如LLM对齐)中,由于缺乏真实验证器,无法直接应用基于可验证奖励的强化学习(RLVR)。本文旨在探索是否可以通过参考输出引导的LLM评估器作为软验证器来弥补这一差距。

Result: 在AlpacaEval和Arena-Hard基准上,使用Llama-3-8B-Instruct模型分别达到73.1%和58.7%的性能,使用Qwen2.5-7B模型分别达到70.0%和74.1%。相比监督微调蒸馏,平均绝对增益为+20.2/+17.1点;相比无参考自我改进,平均绝对增益为+5.3/+3.6点。性能与使用强微调奖励模型ArmoRM相当。

Insight: 创新点在于提出并验证了在非可验证领域使用参考输出引导LLM评估器作为软验证器的有效性,特别是利用前沿模型或高质量人工参考来增强评估器,并在此基础上实现有效的参考引导自我改进对齐方法,为缺乏真实奖励信号的领域提供了可行的后训练方案。

Abstract: While Reinforcement Learning with Verifiable Rewards (RLVR) has shown strong effectiveness in reasoning tasks, it cannot be directly applied to non-verifiable domains lacking ground-truth verifiers, such as LLM alignment. In this work, we investigate whether reference-guided LLM-evaluators can bridge this gap by serving as soft “verifiers”. First, we design evaluation protocols that enhance LLM-based evaluators for LLM alignment using reference outputs. Through comprehensive experiments, we show that a reference-guided approach substantially improves the accuracy of less capable LLM-judges using references from frontier models; stronger LLM-judges can also be enhanced by high-quality (i.e., human-written) references. Building on these improved judges, we demonstrate the utility of high-quality references in alignment tuning, where LLMs guided with references are used as judges to self-improve. We show that reference-guided self-improvement yields clear gains over both direct SFT on reference outputs and self-improvement with reference-free judges, achieving performance comparable to training with ArmoRM, a strong finetuned reward model. Specifically, our method achieves 73.1% and 58.7% on AlpacaEval and Arena-Hard with Llama-3-8B-Instruct, and 70.0% and 74.1% with Qwen2.5-7B, corresponding to average absolute gains of +20.2 / +17.1 points over SFT distillation and +5.3 / +3.6 points over reference-free self-improvement on AlpacaEval / Arena-Hard. These results highlight the potential of using reference-guided LLM-evaluators to enable effective LLM post-training in non-verifiable domains.


[2] Persona2Web: Benchmarking Personalized Web Agents for Contextual Reasoning with User History cs.CL | cs.AIPDF

Serin Kim, Sangam Lee, Dongha Lee

TL;DR: 本文提出了Persona2Web,这是首个在真实开放网络上评估个性化网页代理的基准。该基准基于‘澄清以个性化’原则构建,要求代理根据用户历史记录来解析模糊查询,而非依赖明确指令。它包括揭示长期偏好的用户历史、需要推断用户偏好的模糊查询,以及一个支持细粒度评估的推理感知评估框架。

Details

Motivation: 当前基于大语言模型的网页代理缺乏个性化能力,无法有效解析用户未明确指定细节的模糊意图。为了解决代理需要根据用户历史推断偏好和上下文这一挑战,作者构建了此基准。

Result: 作者在多种代理架构、骨干模型、历史访问方案和不同模糊程度的查询上进行了广泛实验,揭示了构建个性化网页代理行为的关键挑战。

Insight: 主要创新点在于提出了首个面向真实开放网络的个性化网页代理评估基准,并引入了‘澄清以个性化’的原则和推理感知的细粒度评估框架,为研究代理如何利用长期用户历史进行上下文推理提供了标准化测试平台。

Abstract: Large language models have advanced web agents, yet current agents lack personalization capabilities. Since users rarely specify every detail of their intent, practical web agents must be able to interpret ambiguous queries by inferring user preferences and contexts. To address this challenge, we present Persona2Web, the first benchmark for evaluating personalized web agents on the real open web, built upon the clarify-to-personalize principle, which requires agents to resolve ambiguity based on user history rather than relying on explicit instructions. Persona2Web consists of: (1) user histories that reveal preferences implicitly over long time spans, (2) ambiguous queries that require agents to infer implicit user preferences, and (3) a reasoning-aware evaluation framework that enables fine-grained assessment of personalization. We conduct extensive experiments across various agent architectures, backbone models, history access schemes, and queries with varying ambiguity levels, revealing key challenges in personalized web agent behavior. For reproducibility, our codes and datasets are publicly available at https://anonymous.4open.science/r/Persona2Web-73E8.


[3] ReIn: Conversational Error Recovery with Reasoning Inception cs.CL | cs.AIPDF

Takyoung Kim, Jinseok Nam, Chandrayee Basu, Xing Fan, Chengyuan Ma

TL;DR: 本文提出了一种名为ReIn(Reasoning Inception)的测试时干预方法,旨在帮助基于大语言模型(LLM)的对话代理从用户引发的意外错误中恢复,而无需微调模型或修改提示。该方法通过一个外部起始模块识别对话中的预定义错误并生成恢复计划,然后将其整合到代理的内部推理过程中以指导纠正行动。

Details

Motivation: 现有基于LLM和工具集成的对话代理在固定任务导向数据集上表现良好,但对用户引发的意外错误(如模糊或不支持的请求)仍很脆弱。由于微调模型或修改提示成本高昂且耗时,本研究聚焦于错误恢复,探索在不改变模型参数和提示的情况下,如何使代理能从有缺陷的对话上下文中恢复。

Result: 通过系统模拟阻碍用户目标完成的对话失败场景(如模糊和不支持的请求)进行评估,ReIn显著提高了任务成功率,并能泛化到未见过的错误类型。在不同代理模型和起始模块的组合中,ReIn始终优于显式提示修改方法。

Insight: 创新点在于提出了一种无需修改主干模型或系统提示的测试时干预方法,通过将外部生成的恢复计划整合到代理的内部推理中,有效提升了对话代理的鲁棒性。深入分析表明,与ReIn联合定义恢复工具是一种安全有效的策略,可增强代理的恢复能力,同时保持其原有架构不变。

Abstract: Conversational agents powered by large language models (LLMs) with tool integration achieve strong performance on fixed task-oriented dialogue datasets but remain vulnerable to unanticipated, user-induced errors. Rather than focusing on error prevention, this work focuses on error recovery, which necessitates the accurate diagnosis of erroneous dialogue contexts and execution of proper recovery plans. Under realistic constraints precluding model fine-tuning or prompt modification due to significant cost and time requirements, we explore whether agents can recover from contextually flawed interactions and how their behavior can be adapted without altering model parameters and prompts. To this end, we propose Reasoning Inception (ReIn), a test-time intervention method that plants an initial reasoning into the agent’s decision-making process. Specifically, an external inception module identifies predefined errors within the dialogue context and generates recovery plans, which are subsequently integrated into the agent’s internal reasoning process to guide corrective actions, without modifying its parameters or system prompts. We evaluate ReIn by systematically simulating conversational failure scenarios that directly hinder successful completion of user goals: user’s ambiguous and unsupported requests. Across diverse combinations of agent models and inception modules, ReIn substantially improves task success and generalizes to unseen error types. Moreover, it consistently outperforms explicit prompt-modification approaches, underscoring its utility as an efficient, on-the-fly method. In-depth analysis of its operational mechanism, particularly in relation to instruction hierarchy, indicates that jointly defining recovery tools with ReIn can serve as a safe and effective strategy for improving the resilience of conversational agents without modifying the backbone models or system prompts.


[4] Large Language Models Persuade Without Planning Theory of Mind cs.CLPDF

Jared Moore, Rasmus Overmark, Ned Cooper, Beba Cibralic, Nick Haber

TL;DR: 本文提出了一种新颖的交互式心理理论(ToM)评估任务,要求智能体通过策略性地披露信息来说服目标选择特定政策提案。研究发现,在目标的心理状态(知识和动机)被明确揭示时,大语言模型(LLMs)表现出色,但在需要主动推断这些状态时表现不佳,表明其缺乏多步规划能力。然而,当与真人目标互动时,LLMs的说服效果优于人类,表明其可通过修辞策略而非显式ToM推理实现有效说服。

Details

Motivation: 现有ToM评估多采用静态问答基准,无法充分评估交互情境下的心理理论能力。本文旨在填补这一空白,通过设计交互式说服任务,探究LLMs和人类在需要动态推断和使用他人心理状态时的表现差异。

Result: 在实验1中,当目标心理状态被揭示时,LLMs表现优异;但当状态隐藏需主动推断时,其表现低于随机水平,而人类在两种条件下均表现中等。在实验2(人类扮演目标)和实验3(测量真实信念改变)中,LLMs在所有条件下均优于人类说服者。

Insight: 创新点在于设计了交互式ToM评估框架,强调动态信息揭示与心理状态推断。客观来看,研究揭示了LLMs可能通过非ToM的修辞策略实现有效说服,这挑战了将人类-like ToM简单归因于LLMs的倾向,同时凸显了LLMs影响人类信念与行为的潜力。

Abstract: A growing body of work attempts to evaluate the theory of mind (ToM) abilities of humans and large language models (LLMs) using static, non-interactive question-and-answer benchmarks. However, theoretical work in the field suggests that first-personal interaction is a crucial part of ToM and that such predictive, spectatorial tasks may fail to evaluate it. We address this gap with a novel ToM task that requires an agent to persuade a target to choose one of three policy proposals by strategically revealing information. Success depends on a persuader’s sensitivity to a given target’s knowledge states (what the target knows about the policies) and motivational states (how much the target values different outcomes). We varied whether these states were Revealed to persuaders or Hidden, in which case persuaders had to inquire about or infer them. In Experiment 1, participants persuaded a bot programmed to make only rational inferences. LLMs excelled in the Revealed condition but performed below chance in the Hidden condition, suggesting difficulty with the multi-step planning required to elicit and use mental state information. Humans performed moderately well in both conditions, indicating an ability to engage such planning. In Experiment 2, where a human target role-played the bot, and in Experiment 3, where we measured whether human targets’ real beliefs changed, LLMs outperformed human persuaders across all conditions. These results suggest that effective persuasion can occur without explicit ToM reasoning (e.g., through rhetorical strategies) and that LLMs excel at this form of persuasion. Overall, our results caution against attributing human-like ToM to LLMs while highlighting LLMs’ potential to influence people’s beliefs and behavior.


[5] ALPS: A Diagnostic Challenge Set for Arabic Linguistic & Pragmatic Reasoning cs.CL | cs.AIPDF

Hussein S. Al-Olimat, Ahmad Alshareef

TL;DR: 本文介绍了ALPS(阿拉伯语语言与语用套件),这是一个由专家精心构建的阿拉伯语诊断性挑战集,专注于评估深度语义和语用推理能力,以弥补现有大规模基准测试的不足。该数据集包含531个精心设计的问题,涵盖15个任务和47个子任务,确保了文化真实性和语言原生性。通过对23个不同模型(包括商业、开源和阿拉伯语原生模型)的评估,研究发现模型在流利度上表现良好,但在基本形态句法依赖上存在显著缺陷,且商业模型与阿拉伯语原生模型之间存在明显差距。

Details

Motivation: 当前阿拉伯语NLP基准测试多关注规模,但常依赖合成或翻译数据,缺乏深度的语言验证。因此,本文旨在创建一个原生、专家策划的诊断性挑战集,以深入评估阿拉伯语的深度语义和语用推理能力,解决现有基准在语言理解深度上的不足。

Result: 在ALPS数据集上评估了23个模型,单次人类平均准确率为84.6%,专家裁定标准为99.2%。顶级商业模型(如Gemini-3-flash达到94.2%)超越了平均人类水平,但阿拉伯语原生模型(如Jais-2-70B达到83.6%)接近但未匹配人类性能。模型在形态句法依赖任务上错误率较高(在依赖变音符号的任务中平均为36.5%),而在组合语义上表现较好。

Insight: 创新点在于构建了一个专家策划、文化真实的阿拉伯语诊断数据集,专注于深度语言能力评估,揭示了模型在流利度与形态句法理解之间的脱节。从客观角度看,该研究强调了针对特定语言设计高质量、深度验证基准的重要性,以更准确地评估NLP模型的真实语言理解能力,而非仅依赖规模或翻译数据。

Abstract: While recent Arabic NLP benchmarks focus on scale, they often rely on synthetic or translated data which may benefit from deeper linguistic verification. We introduce ALPS (Arabic Linguistic & Pragmatic Suite), a native, expert-curated diagnostic challenge set probing Deep Semantics and Pragmatics, capabilities that complement specialized large-scale benchmarks. While broad-coverage benchmarks prioritize scale and multi-task coverage, ALPS targets the depth of linguistic understanding through 531 rigorously crafted questions across 15 tasks and 47 subtasks. We developed the dataset with deep expertise in Arabic linguistics, guaranteeing cultural authenticity and eliminating translation artifacts. Evaluating 23 diverse models (commercial, open-source, and Arabic-native) against a single-pass human performance (avg. 84.6% accuracy) and an expert-adjudicated oracle (99.2%), we reveal a critical dissociation: models achieve high fluency but fail on fundamental morpho-syntactic dependencies, with elevated error rates on morpho-syntactic dependencies (36.5% across diacritics-reliant tasks) compared to compositional semantics. While top commercial models (Gemini-3-flash at 94.2%) surpass the average single human, a substantial gap persists between commercial giants and Arabic-native models, with the best Arabic-specific model (Jais-2-70B at 83.6%) approaching but not matching human performance.


[6] BankMathBench: A Benchmark for Numerical Reasoning in Banking Scenarios cs.CLPDF

Yunseung Lee, Subin Kim, Youngjun Kwak, Jaegul Choo

TL;DR: 本文提出了BankMathBench,一个针对银行场景数值推理的专用基准测试,包含从基础到高级的三个难度级别,用于评估和提升大语言模型在存款、贷款等实际银行业务中的多步数值计算和产品理解能力。

Details

Motivation: 现有大语言模型在银行核心计算任务(如总收益估算、产品比较、提前还款利息计算)中准确率低,且现有数学或金融基准未能充分覆盖日常银行业务场景,因此需要构建一个领域特定的评估数据集。

Result: 在BankMathBench上微调的开源大语言模型在公式生成和数值推理准确率上显著提升,结合工具增强微调后,在基础、中级和高级任务上的平均准确率分别提高了57.6%、75.1%和62.9%,远超零样本基线。

Insight: 创新点在于构建了一个贴近真实银行业务、具有分层难度的基准测试,有效揭示了模型在金融数值推理中的系统性错误,并通过领域数据微调和工具集成大幅提升了模型性能,为领域专用评估提供了可靠方案。

Abstract: Large language models (LLMs)-based chatbots are increasingly being adopted in the financial domain, particularly in digital banking, to handle customer inquiries about products such as deposits, savings, and loans. However, these models still exhibit low accuracy in core banking computations-including total payout estimation, comparison of products with varying interest rates, and interest calculation under early repayment conditions. Such tasks require multi-step numerical reasoning and contextual understanding of banking products, yet existing LLMs often make systematic errors-misinterpreting product types, applying conditions incorrectly, or failing basic calculations involving exponents and geometric progressions. However, such errors have rarely been captured by existing benchmarks. Mathematical datasets focus on fundamental math problems, whereas financial benchmarks primarily target financial documents, leaving everyday banking scenarios underexplored. To address this limitation, we propose BankMathBench, a domain-specific dataset that reflects realistic banking tasks. BankMathBench is organized in three levels of difficulty-basic, intermediate, and advanced-corresponding to single-product reasoning, multi-product comparison, and multi-condition scenarios, respectively. When trained on BankMathBench, open-source LLMs exhibited notable improvements in both formula generation and numerical reasoning accuracy, demonstrating the dataset’s effectiveness in enhancing domain-specific reasoning. With tool-augmented fine-tuning, the models achieved average accuracy increases of 57.6%p (basic), 75.1%p (intermediate), and 62.9%p (advanced), representing significant gains over zero-shot baselines. These findings highlight BankMathBench as a reliable benchmark for evaluating and advancing LLMs’ numerical reasoning in real-world banking scenarios.


[7] Projective Psychological Assessment of Large Multimodal Models Using Thematic Apperception Tests cs.CLPDF

Anton Dzega, Aviad Elyashar, Ortal Slobodin, Odeya Cohen, Rami Puzis

TL;DR: 本研究采用主题统觉测试(TAT)和社会认知与客体关系量表(SCORS-G),评估大型多模态模型的人格特质。模型扮演两种角色:作为主体模型根据TAT图像生成故事,以及作为评估者模型使用SCORS-G框架分析这些故事。研究发现评估者模型能出色理解TAT回应,其解释与人类专家高度一致,所有模型均能很好理解人际关系和自我概念,但普遍无法感知和调节攻击性,且更大、更新的模型表现更优。

Details

Motivation: 动机是探究大型多模态模型是否可以通过非语言模态(如图像)来评估其人格特质,利用心理学中的主题统觉测试这一投射性框架来揭示模型的无意识人格方面。

Result: 评估者模型在理解和分析TAT回应方面表现出色,与人类专家的解释高度一致(定性结果)。在SCORS-G维度上,所有模型对人际关系和自我概念理解良好,但普遍在感知和调节攻击性上失败;更大、更新的模型家族在所有SCORS-G维度上一致优于更小、更早的模型(定性比较,未提及具体基准或SOTA)。

Insight: 论文的创新点在于将心理学评估框架(TAT和SCORS-G)系统性地应用于评估大型多模态模型的人格特质,并设计双角色(主体与评估者)实验范式;从客观角度看,这为模型行为分析提供了跨学科的新方法,揭示了模型在特定情感维度(如攻击性)上的系统性缺陷,以及模型规模与性能的相关性。

Abstract: Thematic Apperception Test (TAT) is a psychometrically grounded, multidimensional assessment framework that systematically differentiates between cognitive-representational and affective-relational components of personality-like functioning. This test is a projective psychological framework designed to uncover unconscious aspects of personality. This study examines whether the personality traits of Large Multimodal Models (LMMs) can be assessed through non-language-based modalities, using the Social Cognition and Object Relations Scale - Global (SCORS-G). LMMs are employed in two distinct roles: as subject models (SMs), which generate stories in response to TAT images, and as evaluator models (EMs), who assess these narratives using the SCORS-G framework. Evaluators demonstrated an excellent ability to understand and analyze TAT responses. Their interpretations are highly consistent with those of human experts. Assessment results highlight that all models understand interpersonal dynamics very well and have a good grasp of the concept of self. However, they consistently fail to perceive and regulate aggression. Performance varied systematically across model families, with larger and more recent models consistently outperforming smaller and earlier ones across SCORS-G dimensions.


[8] The Emergence of Lab-Driven Alignment Signatures: A Psychometric Framework for Auditing Latent Bias and Compounding Risk in Generative AI cs.CLPDF

Dusan Bosnjakovic

TL;DR: 本文提出了一种基于心理测量理论的审计框架,用于量化生成式AI中持久、潜在的偏见和风险,通过强制选择序数小插曲和密码学排列不变性,对九个领先模型在优化偏见、谄媚和现状合法化等维度进行了审计,发现尽管项目级框架导致高方差,但持久的’实验室信号’导致了显著的行为聚类,表明在’锁定’的提供商生态系统中,潜在偏见不仅是静态错误,还是可能创建递归意识形态回音室的复合变量。

Details

Motivation: 随着大型语言模型从独立的聊天界面过渡到多智能体系统和递归评估循环中的基础推理层,检测持久、提供商级别的行为特征成为安全和治理的关键需求,传统基准测试无法捕捉训练和对齐过程中嵌入的稳定、潜在响应策略。

Result: 研究使用混合线性模型和类内相关系数分析,识别出在九个领先模型中,项目级框架驱动高方差,但持久的’实验室信号’导致显著的行为聚类,表明潜在偏见是复合变量,可能在多层AI架构中创建递归意识形态回音室。

Insight: 创新点包括引入心理测量理论量化潜在偏见,使用强制选择序数小插曲和密码学排列不变性进行审计,以及揭示’实验室信号’作为行为聚类驱动因素,为AI安全和治理提供了新的审计方法,强调了潜在偏见的动态和复合性质。

Abstract: As Large Language Models (LLMs) transition from standalone chat interfaces to foundational reasoning layers in multi-agent systems and recursive evaluation loops (LLM-as-a-judge), the detection of durable, provider-level behavioral signatures becomes a critical requirement for safety and governance. Traditional benchmarks measure transient task accuracy but fail to capture stable, latent response policies – the prevailing mindsets'' embedded during training and alignment that outlive individual model versions. This paper introduces a novel auditing framework that utilizes psychometric measurement theory -- specifically latent trait estimation under ordinal uncertainty -- to quantify these tendencies without relying on ground-truth labels. Utilizing forced-choice ordinal vignettes masked by semantically orthogonal decoys and governed by cryptographic permutation-invariance, the research audits nine leading models across dimensions including Optimization Bias, Sycophancy, and Status-Quo Legitimization. Using Mixed Linear Models (MixedLM) and Intraclass Correlation Coefficient (ICC) analysis, the research identifies that while item-level framing drives high variance, a persistent lab signal’’ accounts for significant behavioral clustering. These findings demonstrate that in ``locked-in’’ provider ecosystems, latent biases are not merely static errors but compounding variables that risk creating recursive ideological echo chambers in multi-layered AI architectures.


[9] AIDG: Evaluating Asymmetry Between Information Extraction and Containment in Multi-Turn Dialogue cs.CLPDF

Adib Sakhawat, Fardeen Sadab, Rakin Shahriar

TL;DR: 本文提出了AIDG(对抗性信息演绎游戏)框架,用于评估大语言模型在多轮对话中信息提取(主动演绎)与信息遏制(状态维持)之间的不对称性。通过两个互补任务(AIDG-I和AIDG-II)对六个前沿LLM进行439场游戏测试,发现模型在信息遏制方面的表现显著优于信息提取,存在明显的能力不对称。

Details

Motivation: 动机在于超越静态基准,通过动态多轮交互来评估大语言模型的战略推理能力,特别是探究其在对话中主动获取信息与防御性保持信息状态之间的不对称表现。

Result: 在六个前沿LLM的测试中,模型在信息遏制(防御)方面比信息提取(演绎)表现好得多,具有350 ELO的优势(Cohen’s d = 5.47)。具体瓶颈包括:确认策略比盲目演绎有效7.75倍,以及41.3%的演绎失败源于对话负载下的指令遵循退化。

Insight: 创新点在于引入了基于博弈论的AIDG框架来量化LLM在动态对话中的战略推理不对称性。客观分析表明,LLM擅长局部防御一致性,但在全局状态跟踪和战略询问方面存在显著不足,这为评估和提升模型推理能力提供了新视角。

Abstract: Evaluating the strategic reasoning capabilities of Large Language Models (LLMs) requires moving beyond static benchmarks to dynamic, multi-turn interactions. We introduce AIDG (Adversarial Information Deduction Game), a game-theoretic framework that probes the asymmetry between information extraction (active deduction) and information containment (state maintenance) in dialogue. We propose two complementary tasks: AIDG-I, measuring pragmatic strategy in social deduction, and AIDG-II, measuring constraint satisfaction in a structured “20 Questions” setting. Across 439 games with six frontier LLMs, we observe a clear capability asymmetry: models perform substantially better at containment than deduction, with a 350 ELO advantage on defense;(Cohen’s d = 5.47). We identify two bottlenecks driving this gap: (1) Information Dynamics, where confirmation strategies are 7.75x more effective than blind deduction (p < 0.00001), and (2) Constraint Adherence, where instruction-following degrades under conversational load, accounting for 41.3% of deductive failures. These findings suggest that while LLMs excel at local defensive coherence, they struggle with the global state tracking required for strategic inquiry.


[10] Modeling Distinct Human Interaction in Web Agents cs.CL | cs.HCPDF

Faria Huq, Zora Zhiruo Wang, Zhanqiu Guo, Venu Arvind Arangarajan, Tianyue Ou

TL;DR: 本文研究了在自主网络代理中建模人类干预行为,以支持协作式网络任务执行。作者收集了包含4200多个交错人机动作的CowCorpus数据集,识别了四种用户交互模式,并训练语言模型来预测用户干预时机,从而提升了代理的适应性和协作能力。

Details

Motivation: 当前自主网络代理系统缺乏对人类干预时机和原因的原则性理解,导致其可能错过关键决策点或请求不必要的确认,因此需要建模人类干预以构建更协作的代理。

Result: 训练的语言模型在干预预测准确率上比基础语言模型提高了61.4-63.4%;在用户研究中,部署了干预感知模型的代理在用户评价的有用性上提升了26.5%。

Insight: 创新点在于将人类干预建模为一个结构化任务,并识别了四种具体的用户交互模式(如放手监督、动手监督等),这为构建自适应、协作式代理提供了可操作的设计原则和数据驱动方法。

Abstract: Despite rapid progress in autonomous web agents, human involvement remains essential for shaping preferences and correcting agent behavior as tasks unfold. However, current agentic systems lack a principled understanding of when and why humans intervene, often proceeding autonomously past critical decision points or requesting unnecessary confirmation. In this work, we introduce the task of modeling human intervention to support collaborative web task execution. We collect CowCorpus, a dataset of 400 real-user web navigation trajectories containing over 4,200 interleaved human and agent actions. We identify four distinct patterns of user interaction with agents – hands-off supervision, hands-on oversight, collaborative task-solving, and full user takeover. Leveraging these insights, we train language models (LMs) to anticipate when users are likely to intervene based on their interaction styles, yielding a 61.4-63.4% improvement in intervention prediction accuracy over base LMs. Finally, we deploy these intervention-aware models in live web navigation agents and evaluate them in a user study, finding a 26.5% increase in user-rated agent usefulness. Together, our results show structured modeling of human intervention leads to more adaptive, collaborative agents.


[11] Unmasking the Factual-Conceptual Gap in Persian Language Models cs.CLPDF

Alireza Sakhaeirad, Ali Ma’manpoosh, Arshia Hemmat

TL;DR: 该论文针对波斯语语言模型,揭示了其在文化能力方面存在的关键缺陷:模型虽然能记忆文化事实,却难以对隐含的社会规范进行推理。作者提出了DivanBench诊断基准,专注于迷信和习俗这类依赖于语境、难以通过简单逻辑推理的规则,并通过三种任务类型评估了七个波斯语LLM,发现了严重的默许偏见、持续预训练反而加剧偏见以及事实知识与应用场景之间存在显著性能差距等问题。

Details

Motivation: 现有波斯语NLP基准测试虽已扩展到语用学和礼貌性领域,但未能有效区分模型是记忆了文化事实,还是真正具备对隐含社会规范进行推理的能力。论文旨在填补这一空白,诊断模型在理解文化中任意、依赖语境的规则(如迷信和习俗)时存在的根本性缺陷。

Result: 在DivanBench基准(包含315个问题,涵盖事实检索、配对场景验证和情境推理三种任务类型)上评估了七个波斯语LLM。结果显示:大多数模型表现出严重的默许偏见;持续的波斯语预训练非但没有改善推理能力,反而放大了这种偏见,甚至削弱了模型辨别矛盾的能力;所有模型在检索事实知识与在场景中应用该知识之间存在21%的性能差距。

Insight: 论文的核心创新点在于构建了DivanBench这一诊断性基准,专门针对文化中难以通过逻辑推导的规则,从而揭示了语言模型在“事实性知识”与“概念性推理”之间存在的根本性差距。客观来看,其重要见解在于指出:单纯扩大单语数据规模无法获得真正的文化能力,当前模型仅学会了模仿文化模式,而非内化其背后的认知图式。这对于多语言和跨文化NLP模型的评估与开发具有重要启示。

Abstract: While emerging Persian NLP benchmarks have expanded into pragmatics and politeness, they rarely distinguish between memorized cultural facts and the ability to reason about implicit social norms. We introduce DivanBench, a diagnostic benchmark focused on superstitions and customs, arbitrary, context-dependent rules that resist simple logical deduction. Through 315 questions across three task types (factual retrieval, paired scenario verification, and situational reasoning), we evaluate seven Persian LLMs and reveal three critical failures: most models exhibit severe acquiescence bias, correctly identifying appropriate behaviors but failing to reject clear violations; continuous Persian pretraining amplifies this bias rather than improving reasoning, often degrading the model’s ability to discern contradictions; and all models show a 21% performance gap between retrieving factual knowledge and applying it in scenarios. These findings demonstrate that cultural competence requires more than scaling monolingual data, as current models learn to mimic cultural patterns without internalizing the underlying schemas.


cs.CV [Back]

[12] Analytic Score Optimization for Multi Dimension Video Quality Assessment cs.CVPDF

Boda Lin, Yongjie Zhu, Wenyu Qin, Meng Wang, Pengfei Wan

TL;DR: 本文提出了一个大规模多维度视频质量评估数据集UltraVQA,并引入了理论驱动的后训练目标Analytic Score Optimization (ASO),以更好地利用丰富的多维度标注,改进离散质量评分预测。

Details

Motivation: 视频质量评估正从单一的平均意见分数向更丰富的多维度评估演进,需要能处理用户生成内容在多个关键质量维度上的数据集和方法。

Result: 在实验中,该方法在质量预测上优于包括闭源API和开源模型在内的大多数基线模型,并降低了平均绝对误差。

Insight: 创新点在于构建了包含五个关键质量维度(运动质量、运动幅度、美学质量、内容质量、清晰度质量)并带有GPT生成解释性理由的大规模数据集,以及一个基于正则化决策过程、具有闭式解、能捕捉人类评分序数性质的后训练优化目标ASO,强调了多维度可解释标注和基于强化对齐的重要性。

Abstract: Video Quality Assessment (VQA) is evolving beyond single-number mean opinion score toward richer, multi-faceted evaluations of video content. In this paper, we present a large-scale multi-dimensional VQA dataset UltraVQA that encompasses diverse User-Generated Content~(UGC) annotated across five key quality dimensions: Motion Quality, Motion Amplitude, Aesthetic Quality, Content Quality, and Clarity Quality. Each video in our dataset is scored by over 3 human raters on these dimensions, with fine-grained sub-attribute labels, and accompanied by an explanatory rationale generated by GPT based on the collective human judgments. To better leverage these rich annotations and improve discrete quality score assessment, we introduce Analytic Score Optimization (ASO), a theoretically grounded post-training objective derived for multi-dimensional VQA. By reframing quality assessment as a regularized decision-making process, we obtain a closed-form solution that naturally captures the ordinal nature of human ratings, ensuring alignment with human ranking preferences. In experiments, our method outperforms most baselines including closed-source APIs and open-source models, while also reducing mean absolute error (MAE) in quality prediction. Our work highlights the importance of multi-dimensional, interpretable annotations and reinforcement-based alignment in advancing video quality assessment.


[13] DODO: Discrete OCR Diffusion Models cs.CVPDF

Sean Man, Roy Ganz, Roi Ronen, Shahar Tsiper, Shai Mazor

TL;DR: 本文提出了DODO,一种基于块离散扩散模型的光学字符识别(OCR)方法,旨在解决传统自回归解码在长文档处理中计算开销大、速度慢的问题。DODO通过将生成过程分解为块,实现了高效的并行解码,在保持接近SOTA准确率的同时,推理速度比自回归基线快达3倍。

Details

Motivation: 当前基于视觉语言模型(VLM)的OCR方法主要依赖自回归解码,在处理长文档时计算成本高且速度慢。OCR任务具有高度确定性(视觉输入严格对应唯一输出序列),理论上可通过扩散模型实现高效并行解码,但现有掩码扩散模型因结构不稳定问题无法适用于OCR的精确匹配要求。

Result: 实验表明,DODO在OCR任务上实现了接近最先进(SOTA)的准确率,同时推理速度比自回归基线快达3倍。

Insight: 创新点在于首次将块离散扩散模型应用于OCR任务,通过块分解缓解了全局扩散的同步错误,从而在保持高精度的同时显著提升推理效率。这为确定性视觉-文本任务的高效并行解码提供了新思路。

Abstract: Optical Character Recognition (OCR) is a fundamental task for digitizing information, serving as a critical bridge between visual data and textual understanding. While modern Vision-Language Models (VLM) have achieved high accuracy in this domain, they predominantly rely on autoregressive decoding, which becomes computationally expensive and slow for long documents as it requires a sequential forward pass for every generated token. We identify a key opportunity to overcome this bottleneck: unlike open-ended generation, OCR is a highly deterministic task where the visual input strictly dictates a unique output sequence, theoretically enabling efficient, parallel decoding via diffusion models. However, we show that existing masked diffusion models fail to harness this potential; those introduce structural instabilities that are benign in flexible tasks, like captioning, but catastrophic for the rigid, exact-match requirements of OCR. To bridge this gap, we introduce DODO, the first VLM to utilize block discrete diffusion and unlock its speedup potential for OCR. By decomposing generation into blocks, DODO mitigates the synchronization errors of global diffusion. Empirically, our method achieves near state-of-the-art accuracy while enabling up to 3x faster inference compared to autoregressive baselines.


[14] SemCovNet: Towards Fair and Semantic Coverage-Aware Learning for Underrepresented Visual Concepts cs.CVPDF

Sakib Ahammed, Xia Cui, Xinqi Fan, Wenqi Lu, Moi Hoon Yap

TL;DR: 该论文提出了SemCovNet模型,旨在解决视觉模型中存在的语义覆盖不平衡(SCI)问题,这是一种由长尾语义表示引起的、在语义层面的偏差。模型通过整合语义描述符映射、描述符注意力调制模块和描述符-视觉对齐损失,来显式地学习纠正语义覆盖差异,从而提升模型的公平性和可靠性。

Details

Motivation: 现有视觉数据集存在语义覆盖不平衡(SCI)这一被忽视的偏差,它不同于类别不平衡,发生在语义层面,影响模型对稀有但有意义的语义的学习和推理。论文旨在缓解SCI,促进更公平、更可解释的视觉学习。

Result: 在多个数据集上的广泛实验表明,SemCovNet显著降低了用于衡量语义公平性的覆盖差异指数(CDI),并增强了模型的可靠性,实现了更公平、更均衡的性能。

Insight: 论文的创新点在于首次将语义覆盖不平衡(SCI)确立为一种可测量和可纠正的偏差,并提出了一个集成了语义描述符映射、动态注意力调制和对齐损失的整体框架(SemCovNet)来系统性解决该问题。从客观角度看,其提出的覆盖差异指数(CDI)为量化语义公平性提供了一个新的度量标准,将公平性研究从类别层面扩展到了更细粒度的语义层面。

Abstract: Modern vision models increasingly rely on rich semantic representations that extend beyond class labels to include descriptive concepts and contextual attributes. However, existing datasets exhibit Semantic Coverage Imbalance (SCI), a previously overlooked bias arising from the long-tailed semantic representations. Unlike class imbalance, SCI occurs at the semantic level, affecting how models learn and reason about rare yet meaningful semantics. To mitigate SCI, we propose Semantic Coverage-Aware Network (SemCovNet), a novel model that explicitly learns to correct semantic coverage disparities. SemCovNet integrates a Semantic Descriptor Map (SDM) for learning semantic representations, a Descriptor Attention Modulation (DAM) module that dynamically weights visual and concept features, and a Descriptor-Visual Alignment (DVA) loss that aligns visual features with descriptor semantics. We quantify semantic fairness using a Coverage Disparity Index (CDI), which measures the alignment between coverage and error. Extensive experiments across multiple datasets demonstrate that SemCovNet enhances model reliability and substantially reduces CDI, achieving fairer and more equitable performance. This work establishes SCI as a measurable and correctable bias, providing a foundation for advancing semantic fairness and interpretable vision learning.


[15] Xray-Visual Models: Scaling Vision models on Industry Scale Data cs.CV | cs.AIPDF

Shlok Mishra, Tsung-Yu Lin, Linda Wang, Hongli Xu, Yimin Liu

TL;DR: 本文提出了Xray-Visual,一个在大规模社交媒体数据上训练的统一视觉模型架构,用于图像和视频理解。该模型利用超过150亿图像-文本对和100亿视频-标签对,采用包含平衡和噪声抑制策略的数据处理流程,并结合了自监督MAE、半监督标签分类和CLIP风格对比学习的三阶段训练管道,以联合优化图像和视频模态。

Details

Motivation: 解决在大规模、嘈杂的行业级社交媒体数据上,构建一个统一、高效且鲁棒的视觉模型,以实现对图像和视频的通用理解。

Result: 在多个基准测试中达到最先进水平:包括ImageNet图像分类、Kinetics和HMDB51视频理解、以及MSCOCO跨模态检索。模型对领域偏移和对抗性扰动表现出强鲁棒性。

Insight: 创新点包括:1) 大规模、精心处理的数据集构建策略;2) 结合多种学习范式的三阶段训练流程;3) 采用EViT提升计算效率的Vision Transformer骨干;4) 集成大语言模型作为文本编码器(LLM2CLIP)以增强检索和泛化能力。

Abstract: We present Xray-Visual, a unified vision model architecture for large-scale image and video understanding trained on industry-scale social media data. Our model leverages over 15 billion curated image-text pairs and 10 billion video-hashtag pairs from Facebook and Instagram, employing robust data curation pipelines that incorporate balancing and noise suppression strategies to maximize semantic diversity while minimizing label noise. We introduce a three-stage training pipeline that combines self-supervised MAE, semi-supervised hashtag classification, and CLIP-style contrastive learning to jointly optimize image and video modalities. Our architecture builds on a Vision Transformer backbone enhanced with efficient token reorganization (EViT) for improved computational efficiency. Extensive experiments demonstrate that Xray-Visual achieves state-of-the-art performance across diverse benchmarks, including ImageNet for image classification, Kinetics and HMDB51 for video understanding, and MSCOCO for cross-modal retrieval. The model exhibits strong robustness to domain shift and adversarial perturbations. We further demonstrate that integrating large language models as text encoders (LLM2CLIP) significantly enhances retrieval performance and generalization capabilities, particularly in real-world environments. Xray-Visual establishes new benchmarks for scalable, multimodal vision models, while maintaining superior accuracy and computational efficiency.


[16] DDiT: Dynamic Patch Scheduling for Efficient Diffusion Transformers cs.CV | cs.AIPDF

Dahye Kim, Deepti Ghadiyaram, Raghudeep Gadde

TL;DR: 本文提出了一种名为DDiT的动态补丁调度方法,用于提升扩散变换器(DiTs)的推理效率。该方法的核心思想是根据去噪时间步和内容复杂度动态调整补丁大小,从而在不牺牲生成质量的前提下显著降低计算成本。

Details

Motivation: 扩散变换器在图像和视频生成方面取得了SOTA性能,但其计算成本高昂。这种低效性主要源于固定的分词过程,即在所有去噪步骤中使用恒定大小的补丁,而忽略了内容复杂度的变化。

Result: 在FLUX-1.Dev和Wan 2.1基准测试上,该方法分别实现了高达3.52倍和3.2倍的加速,同时保持了生成质量和提示遵循性。

Insight: 主要创新点是提出了动态分词策略,其关键见解是:早期去噪步骤仅需较粗糙的补丁来建模全局结构,而后期迭代则需要更精细(更小)的补丁来完善局部细节。这是一种高效的测试时策略,通过动态重新分配补丁大小来优化计算资源。

Abstract: Diffusion Transformers (DiTs) have achieved state-of-the-art performance in image and video generation, but their success comes at the cost of heavy computation. This inefficiency is largely due to the fixed tokenization process, which uses constant-sized patches throughout the entire denoising phase, regardless of the content’s complexity. We propose dynamic tokenization, an efficient test-time strategy that varies patch sizes based on content complexity and the denoising timestep. Our key insight is that early timesteps only require coarser patches to model global structure, while later iterations demand finer (smaller-sized) patches to refine local details. During inference, our method dynamically reallocates patch sizes across denoising steps for image and video generation and substantially reduces cost while preserving perceptual generation quality. Extensive experiments demonstrate the effectiveness of our approach: it achieves up to $3.52\times$ and $3.2\times$ speedup on FLUX-1.Dev and Wan $2.1$, respectively, without compromising the generation quality and prompt adherence.


[17] Characterizing the Predictive Impact of Modalities with Supervised Latent-Variable Modeling cs.CV | cs.CL | cs.LGPDF

Divyam Madaan, Sumit Chopra, Kyunghyun Cho

TL;DR: 本文提出了PRIMO模型,一种监督式隐变量插补方法,用于量化多模态学习中缺失模态的预测影响。PRIMO能够处理训练和推理过程中模态不完整的情况,通过隐变量建模缺失模态与观测模态的关系,并利用采样进行预测和影响分析。

Details

Motivation: 解决多模态大语言模型(MLLMs)在实际应用中模态数据可能缺失、异步收集或仅部分可用的问题,以充分利用所有可用训练样本并量化缺失模态的预测贡献。

Result: 在合成XOR数据集、Audio-Vision MNIST和MIMIC-III(用于死亡率和ICD-9预测)上评估,PRIMO在模态完全缺失时性能与单模态基线相当,在所有模态可用时与多模态基线相当,并通过基于方差的指标量化实例级模态影响。

Insight: 创新点在于引入监督式隐变量建模来显式处理模态缺失,并利用采样策略实现预测分布估计和实例级影响分析,为不完整多模态学习提供了可解释的框架。

Abstract: Despite the recent success of Multimodal Large Language Models (MLLMs), existing approaches predominantly assume the availability of multiple modalities during training and inference. In practice, multimodal data is often incomplete because modalities may be missing, collected asynchronously, or available only for a subset of examples. In this work, we propose PRIMO, a supervised latent-variable imputation model that quantifies the predictive impact of any missing modality within the multimodal learning setting. PRIMO enables the use of all available training examples, whether modalities are complete or partial. Specifically, it models the missing modality through a latent variable that captures its relationship with the observed modality in the context of prediction. During inference, we draw many samples from the learned distribution over the missing modality to both obtain the marginal predictive distribution (for the purpose of prediction) and analyze the impact of the missing modalities on the prediction for each instance. We evaluate PRIMO on a synthetic XOR dataset, Audio-Vision MNIST, and MIMIC-III for mortality and ICD-9 prediction. Across all datasets, PRIMO obtains performance comparable to unimodal baselines when a modality is fully missing and to multimodal baselines when all modalities are available. PRIMO quantifies the predictive impact of a modality at the instance level using a variance-based metric computed from predictions across latent completions. We visually demonstrate how varying completions of the missing modality result in a set of plausible labels.


[18] Cross Pseudo Labeling For Weakly Supervised Video Anomaly Detection cs.CVPDF

Lee Dayeon, Kim Dongheyong, Park Chaewon, Woo Sungmin, Lee Sangyoun

TL;DR: 论文提出了一种名为CPL-VAD的双分支框架,用于弱监督视频异常检测,通过交叉伪标签机制结合异常定位和类别分类任务,在XD-Violence和UCF-Crime数据集上实现了最先进的性能。

Details

Motivation: 解决弱监督视频异常检测中仅使用视频级标签时,同时实现异常片段定位和异常类别识别的挑战,旨在提升模型的时空精度与语义判别能力。

Result: 在XD-Violence和UCF-Crime基准测试中,CPL-VAD在异常检测和异常类别分类任务上均达到了最先进的性能水平。

Insight: 创新点在于通过交叉伪标签机制,使异常检测分支和类别分类分支相互增强,结合了时序定位精度与视觉-语言对齐的语义判别能力,为多任务弱监督学习提供了可借鉴的协同训练策略。

Abstract: Weakly supervised video anomaly detection aims to detect anomalies and identify abnormal categories with only video-level labels. We propose CPL-VAD, a dual-branch framework with cross pseudo labeling. The binary anomaly detection branch focuses on snippet-level anomaly localization, while the category classification branch leverages vision-language alignment to recognize abnormal event categories. By exchanging pseudo labels, the two branches transfer complementary strengths, combining temporal precision with semantic discrimination. Experiments on XD-Violence and UCF-Crime demonstrate that CPL-VAD achieves state-of-the-art performance in both anomaly detection and abnormal category classification.


[19] 3D Scene Rendering with Multimodal Gaussian Splatting cs.CV | cs.AI | cs.ROPDF

Chi-Shiang Gau, Konstantinos D. Polyzos, Athanasios Bacharis, Saketh Madhuvarasu, Tara Javidi

TL;DR: 本文提出了一种多模态3D场景渲染框架,将射频(RF)感知(如汽车雷达)与3D高斯泼溅(GS)渲染相结合,以解决传统视觉GS方法在恶劣天气、低光照或部分遮挡等视觉线索不可靠条件下的局限性,实现更高效、鲁棒的3D场景重建与渲染。

Details

Motivation: 传统基于视觉的3D高斯泼溅方法通常依赖大量相机视图来初始化高斯基元并训练其参数,在视觉线索不可靠(如恶劣天气、低光照、遮挡)时效果不佳且初始化成本高;而射频信号对这些条件具有鲁棒性,因此提出融合RF感知以提升渲染的效率和鲁棒性。

Result: 数值测试表明,将RF感知明智地整合到GS流程中,能够实现由RF信息驱动的结构精确的高保真3D场景渲染。

Insight: 创新点在于利用RF信号(如汽车雷达)的鲁棒性来高效预测深度,仅需稀疏的RF深度测量即可生成高质量3D点云,用于初始化不同GS架构中的高斯函数,从而为多模态3D渲染提供了一种更高效、鲁棒的替代方案。

Abstract: 3D scene reconstruction and rendering are core tasks in computer vision, with applications spanning industrial monitoring, robotics, and autonomous driving. Recent advances in 3D Gaussian Splatting (GS) and its variants have achieved impressive rendering fidelity while maintaining high computational and memory efficiency. However, conventional vision-based GS pipelines typically rely on a sufficient number of camera views to initialize the Gaussian primitives and train their parameters, typically incurring additional processing cost during initialization while falling short in conditions where visual cues are unreliable, such as adverse weather, low illumination, or partial occlusions. To cope with these challenges, and motivated by the robustness of radio-frequency (RF) signals to weather, lighting, and occlusions, we introduce a multimodal framework that integrates RF sensing, such as automotive radar, with GS-based rendering as a more efficient and robust alternative to vision-only GS rendering. The proposed approach enables efficient depth prediction from only sparse RF-based depth measurements, yielding a high-quality 3D point cloud for initializing Gaussian functions across diverse GS architectures. Numerical tests demonstrate the merits of judiciously incorporating RF sensing into GS pipelines, achieving high-fidelity 3D scene rendering driven by RF-informed structural accuracy.


[20] BadCLIP++: Stealthy and Persistent Backdoors in Multimodal Contrastive Learning cs.CVPDF

Siyuan Liang, Yongcheng Jing, Yingjie Wang, Jiaxing Huang, Ee-chien Chang

TL;DR: 本文提出BadCLIP++,一个针对多模态对比学习模型的隐蔽且持久的后门攻击框架。它通过语义融合QR微触发器和目标对齐子集选择来增强隐蔽性,并通过半径收缩、质心对齐、曲率控制和弹性权重巩固来确保后门在强检测和持续微调下的持久性。理论分析表明在信任区域内,干净微调和后门目标的梯度方向一致,从而保证了攻击成功率的上界不增。

Details

Motivation: 现有针对多模态对比学习模型的后门攻击方法在强检测或持续微调下容易失效,主要由于跨模态不一致性暴露触发模式,以及低投毒率下的梯度稀释加速后门遗忘。本文旨在解决这两个挑战,实现更隐蔽和持久的后门攻击。

Result: 在仅0.3%投毒率下,BadCLIP++在数字场景中达到99.99%的攻击成功率(ASR),超越基线方法11.4个百分点。在19种防御方法下,ASR仍保持在99.90%以上,且干净准确率下降小于0.8%。在物理攻击中达到65.03%的成功率,并对抗水印移除防御表现出鲁棒性。

Insight: 创新点包括:1)语义融合QR微触发器,将难以察觉的模式嵌入任务相关区域附近,保持干净数据统计特性;2)目标对齐子集选择以增强低注入率下的信号;3)通过半径收缩和质心对齐稳定触发嵌入,通过曲率控制和弹性权重巩固稳定模型参数;4)首次提供了理论分析,证明在信任区域内干净微调和后门目标的梯度共向,为攻击持久性提供了理论保证。

Abstract: Research on backdoor attacks against multimodal contrastive learning models faces two key challenges: stealthiness and persistence. Existing methods often fail under strong detection or continuous fine-tuning, largely due to (1) cross-modal inconsistency that exposes trigger patterns and (2) gradient dilution at low poisoning rates that accelerates backdoor forgetting. These coupled causes remain insufficiently modeled and addressed. We propose BadCLIP++, a unified framework that tackles both challenges. For stealthiness, we introduce a semantic-fusion QR micro-trigger that embeds imperceptible patterns near task-relevant regions, preserving clean-data statistics while producing compact trigger distributions. We further apply target-aligned subset selection to strengthen signals at low injection rates. For persistence, we stabilize trigger embeddings via radius shrinkage and centroid alignment, and stabilize model parameters through curvature control and elastic weight consolidation, maintaining solutions within a low-curvature wide basin resistant to fine-tuning. We also provide the first theoretical analysis showing that, within a trust region, gradients from clean fine-tuning and backdoor objectives are co-directional, yielding a non-increasing upper bound on attack success degradation. Experiments demonstrate that with only 0.3% poisoning, BadCLIP++ achieves 99.99% attack success rate (ASR) in digital settings, surpassing baselines by 11.4 points. Across nineteen defenses, ASR remains above 99.90% with less than 0.8% drop in clean accuracy. The method further attains 65.03% success in physical attacks and shows robustness against watermark removal defenses.


[21] NRGS-SLAM: Monocular Non-Rigid SLAM for Endoscopy via Deformation-Aware 3D Gaussian Splatting cs.CV | cs.ROPDF

Jiwei Shan, Zeyu Cai, Yirui Li, Yongbo Chen, Lijun Han

TL;DR: 本文提出了NRGS-SLAM,一种基于3D高斯泼溅的单目非刚性SLAM系统,专门用于解决内窥镜场景中软组织持续变形带来的挑战。该系统通过引入一个具有可学习变形概率的变形感知3D高斯地图来解耦相机自运动和内在变形,并设计了可变形跟踪与建图模块,结合鲁棒的几何损失,实现了更准确的相机姿态估计和高质量的场景重建。

Details

Motivation: 内窥镜场景中的软组织持续变形违反了传统V-SLAM的刚性假设,导致相机自运动与内在变形之间存在严重的耦合模糊性。现有单目非刚性SLAM方法缺乏有效的解耦机制,且依赖稀疏或低保真度的场景表示,导致跟踪漂移和重建质量受限。

Result: 在多个公开内窥镜数据集上的广泛实验表明,NRGS-SLAM在相机姿态估计精度上(RMSE降低高达50%)和高质量、照片级真实感重建方面均优于当前最先进方法。全面的消融研究进一步验证了其关键设计的有效性。

Insight: 核心创新点在于引入了变形感知的3D高斯地图表示,通过贝叶斯自监督策略为每个高斯基元学习变形概率,无需外部非刚性标签。在此基础上,设计了优先处理低变形区域的从粗到精姿态估计和逐帧变形更新的跟踪模块,以及平衡表示能力与计算效率的建图模块。此外,统一的鲁棒几何损失整合了外部几何先验,以缓解单目非刚性SLAM固有的不适定性问题。

Abstract: Visual simultaneous localization and mapping (V-SLAM) is a fundamental capability for autonomous perception and navigation. However, endoscopic scenes violate the rigidity assumption due to persistent soft-tissue deformations, creating a strong coupling ambiguity between camera ego-motion and intrinsic deformation. Although recent monocular non-rigid SLAM methods have made notable progress, they often lack effective decoupling mechanisms and rely on sparse or low-fidelity scene representations, which leads to tracking drift and limited reconstruction quality. To address these limitations, we propose NRGS-SLAM, a monocular non-rigid SLAM system for endoscopy based on 3D Gaussian Splatting. To resolve the coupling ambiguity, we introduce a deformation-aware 3D Gaussian map that augments each Gaussian primitive with a learnable deformation probability, optimized via a Bayesian self-supervision strategy without requiring external non-rigidity labels. Building on this representation, we design a deformable tracking module that performs robust coarse-to-fine pose estimation by prioritizing low-deformation regions, followed by efficient per-frame deformation updates. A carefully designed deformable mapping module progressively expands and refines the map, balancing representational capacity and computational efficiency. In addition, a unified robust geometric loss incorporates external geometric priors to mitigate the inherent ill-posedness of monocular non-rigid SLAM. Extensive experiments on multiple public endoscopic datasets demonstrate that NRGS-SLAM achieves more accurate camera pose estimation (up to 50% reduction in RMSE) and higher-quality photo-realistic reconstructions than state-of-the-art methods. Comprehensive ablation studies further validate the effectiveness of our key design choices. Source code will be publicly available upon paper acceptance.


[22] Selective Training for Large Vision Language Models via Visual Information Gain cs.CVPDF

Seulbi Lee, Sangheum Hwang

TL;DR: 本文提出了一种基于困惑度的视觉信息增益(VIG)指标,用于量化视觉输入对大型视觉语言模型(LVLM)预测不确定性的降低程度,并基于此设计了一种选择性训练方案,优先训练高VIG的样本和词元,以提升模型的视觉基础能力并缓解语言偏见。

Details

Motivation: 大型视觉语言模型存在语言偏见问题,即不依赖视觉证据生成答案。现有方法缺乏对单个训练样本或词元从图像中获益程度的定量衡量。

Result: 该方法通过专注于视觉信息丰富的样本和词元,在减少监督的情况下,提升了视觉基础能力并缓解了语言偏见,取得了更优的性能。

Insight: 创新点在于提出了VIG这一可量化的度量标准,用于细粒度分析视觉信息对模型预测的贡献,并据此指导训练过程,实现更高效、更具针对性的模型优化。

Abstract: Large Vision Language Models (LVLMs) have achieved remarkable progress, yet they often suffer from language bias, producing answers without relying on visual evidence. While prior work attempts to mitigate this issue through decoding strategies, architectural modifications, or curated instruction data, they typically lack a quantitative measure of how much individual training samples or tokens actually benefit from the image. In this work, we introduce Visual Information Gain (VIG), a perplexity-based metric that measures the reduction in prediction uncertainty provided by visual input. VIG enables fine-grained analysis at both sample and token levels, effectively highlighting visually grounded elements such as colors, spatial relations, and attributes. Leveraging this, we propose a VIG-guided selective training scheme that prioritizes high-VIG samples and tokens. This approach improves visual grounding and mitigates language bias, achieving superior performance with significantly reduced supervision by focusing exclusively on visually informative samples and tokens.


[23] EntropyPrune: Matrix Entropy Guided Visual Token Pruning for Multimodal Large Language Models cs.CVPDF

Yahong Wang, Juncheng Wu, Zhangkai Ni, Chengmei Yang, Yihang Liu

TL;DR: 本文提出了一种名为EntropyPrune的视觉令牌剪枝框架,通过矩阵熵的视角识别出视觉表示信息量急剧下降的’熵崩溃层’,以此作为剪枝阶段的准则,并量化单个视觉令牌的信息价值以剪枝冗余令牌,从而高效加速多模态大语言模型的推理。

Details

Motivation: 多模态大语言模型因处理大量视觉令牌而产生高昂推理成本,现有令牌剪枝方法依赖启发式选择剪枝层,缺乏可解释性和跨模型可迁移性,本文旨在提供一种基于信息论的、原则性的剪枝准则。

Result: 在多个多模态基准测试中,EntropyPrune在准确性和效率上均优于最先进的剪枝方法;在LLaVA-1.5-7B模型上,实现了68.2%的FLOPs减少,同时保持了96.0%的原始性能。

Insight: 创新点在于从矩阵熵角度识别’熵崩溃层’作为剪枝的客观依据,并提出不依赖注意力图的令牌信息价值量化方法;通过利用对偶Gram矩阵的谱等价性,显著降低了熵计算复杂度,实现了高达64倍的理论加速,展现了方法的鲁棒性和可扩展性。

Abstract: Multimodal large language models (MLLMs) incur substantial inference cost due to the processing of hundreds of visual tokens per image. Although token pruning has proven effective for accelerating inference, determining when and where to prune remains largely heuristic. Existing approaches typically rely on static, empirically selected layers, which limit interpretability and transferability across models. In this work, we introduce a matrix-entropy perspective and identify an “Entropy Collapse Layer” (ECL), where the information content of visual representations exhibits a sharp and consistent drop, which provides a principled criterion for selecting the pruning stage. Building on this observation, we propose EntropyPrune, a novel matrix-entropy-guided token pruning framework that quantifies the information value of individual visual tokens and prunes redundant ones without relying on attention maps. Moreover, to enable efficient computation, we exploit the spectral equivalence of dual Gram matrices, reducing the complexity of entropy computation and yielding up to a 64x theoretical speedup. Extensive experiments on diverse multimodal benchmarks demonstrate that EntropyPrune consistently outperforms state-of-the-art pruning methods in both accuracy and efficiency. On LLaVA-1.5-7B, our method achieves a 68.2% reduction in FLOPs while preserving 96.0% of the original performance. Furthermore, EntropyPrune generalizes effectively to high-resolution and video-based models, highlighting the strong robustness and scalability in practical MLLM acceleration. The code will be publicly available at https://github.com/YahongWang1/EntropyPrune.


[24] HiMAP: History-aware Map-occupancy Prediction with Fallback cs.CVPDF

Yiming Xu, Yi Yang, Hao Cheng, Monika Sester

TL;DR: HiMAP是一种无需跟踪的轨迹预测框架,旨在解决自动驾驶中因多目标跟踪失败导致的预测质量下降问题。它通过将历史检测转换为时空不变的历史占据地图,并引入历史查询模块来检索特定智能体的历史信息,从而生成多模态未来轨迹。

Details

Motivation: 现有预测器依赖多目标跟踪的身份关联,当跟踪因遮挡、身份切换或漏检失败时,预测质量下降并增加安全风险。HiMAP旨在设计一个不依赖跟踪的鲁棒预测框架。

Result: 在Argoverse 2数据集上,HiMAP在无跟踪设置下性能优于强基线(如微调后的QCNet),相对提升11% FDE、12% ADE和4% MR降低,且与基于跟踪的方法性能相当。

Insight: 创新点包括:使用历史占据地图实现时空不变表示,通过历史查询模块从无标签占据表示中检索智能体特定历史,以及采用DETR风格解码器生成多模态轨迹。该方法消除了对身份跟踪的依赖,支持流式推理,并作为跟踪不可用时的鲁棒后备方案。

Abstract: Accurate motion forecasting is critical for autonomous driving, yet most predictors rely on multi-object tracking (MOT) with identity association, assuming that objects are correctly and continuously tracked. When tracking fails due to, e.g., occlusion, identity switches, or missed detections, prediction quality degrades and safety risks increase. We present \textbf{HiMAP}, a tracking-free, trajectory prediction framework that remains reliable under MOT failures. HiMAP converts past detections into spatiotemporally invariant historical occupancy maps and introduces a historical query module that conditions on the current agent state to iteratively retrieve agent-specific history from unlabeled occupancy representations. The retrieved history is summarized by a temporal map embedding and, together with the final query and map context, drives a DETR-style decoder to produce multi-modal future trajectories. This design lifts identity reliance, supports streaming inference via reusable encodings, and serves as a robust fallback when tracking is unavailable. On Argoverse~2, HiMAP achieves performance comparable to tracking-based methods while operating without IDs, and it substantially outperforms strong baselines in the no-tracking setting, yielding relative gains of 11% in FDE, 12% in ADE, and a 4% reduction in MR over a fine-tuned QCNet. Beyond aggregate metrics, HiMAP delivers stable forecasts for all agents simultaneously without waiting for tracking to recover, highlighting its practical value for safety-critical autonomy. The code is available under: https://github.com/XuYiMing83/HiMAP.


[25] Inferring Height from Earth Embeddings: First insights using Google AlphaEarth cs.CVPDF

Alireza Hamoudzadeh, Valeria Belloni, Roberta Ravanelli

TL;DR: 本研究探讨了Google AlphaEarth Embeddings中编码的地理空间和多模态特征是否能有效指导深度学习回归模型进行区域地表高度制图。研究使用10米空间分辨率的AlphaEarth Embeddings,并以高质量数字表面模型(DSM)为参考,评估其支持地形高度推断的能力。采用U-Net和U-Net++作为轻量级卷积解码器,评估嵌入中提取的地理空间信息转化为准确地表高度估计的效果。

Details

Motivation: 解决利用地球嵌入(Earth Embeddings)中的地理空间和多模态特征来指导深度学习模型进行区域地表高度制图的问题,探索这些嵌入是否编码了可解码的高度相关信号。

Result: 在训练集上,两种架构均表现出色(R² = 0.97)。在测试集上,由于训练和测试区域的高度频率分布偏移,性能下降,但U-Net++表现出更好的泛化能力(R² = 0.84,中位数差异 = -2.62 m),优于标准U-Net(R² = 0.78,中位数差异 = -7.22 m)。测试RMSE约为16米(U-Net++),表明泛化仍存在挑战,但强相关性显示嵌入捕获了可转移的地形模式。

Insight: 创新点在于首次利用AlphaEarth Embeddings指导高度制图,并验证了其编码高度相关信号的有效性。从客观角度看,研究强调了结合空间感知卷积架构(如U-Net++)可以提升泛化鲁棒性,同时指出了解决偏差问题以改进区域可转移性的必要性,为基于嵌入的地理空间深度学习应用提供了新思路。

Abstract: This study investigates whether the geospatial and multimodal features encoded in \textit{Earth Embeddings} can effectively guide deep learning (DL) regression models for regional surface height mapping. In particular, we focused on AlphaEarth Embeddings at 10 m spatial resolution and evaluated their capability to support terrain height inference using a high-quality Digital Surface Model (DSM) as reference. U-Net and U-Net++ architectures were thus employed as lightweight convolutional decoders to assess how well the geospatial information distilled in the embeddings can be translated into accurate surface height estimates. Both architectures achieved strong training performance (both with $R^2 = 0.97$), confirming that the embeddings encode informative and decodable height-related signals. On the test set, performance decreased due to distribution shifts in height frequency between training and testing areas. Nevertheless, U-Net++ shows better generalization ($R^2 = 0.84$, median difference = -2.62 m) compared with the standard U-Net ($R^2 = 0.78$, median difference = -7.22 m), suggesting enhanced robustness to distribution mismatch. While the testing RMSE (approximately 16 m for U-Net++) and residual bias highlight remaining challenges in generalization, strong correlations indicate that the embeddings capture transferable topographic patterns. Overall, the results demonstrate the promising potential of AlphaEarth Embeddings to guide DL-based height mapping workflows, particularly when combined with spatially aware convolutional architectures, while emphasizing the need to address bias for improved regional transferability.


[26] A Multi-modal Detection System for Infrastructure-based Freight Signal Priority cs.CV | eess.IV | eess.SYPDF

Ziyan Zhang, Chuheng Wei, Xuanpeng Zhao, Siyan Li, Will Snyder

TL;DR: 本文提出了一种基于基础设施的多模态货运车辆检测系统,该系统集成了激光雷达和摄像头传感器,用于支持货运信号优先控制。系统采用混合传感架构,包含路口安装子系统和路段中间子系统,通过无线通信实现同步数据传输。感知流程结合了基于聚类和深度学习的检测方法,并采用卡尔曼滤波跟踪以实现稳定的实时性能。

Details

Motivation: 解决货运车辆在信号交叉口需要可靠检测和运动估计以支持基于基础设施的货运信号优先控制的问题,准确及时地感知车辆类型、位置和速度对于实现有效的优先控制策略至关重要。

Result: 现场评估表明,该系统能够以高时空分辨率可靠地监测货运车辆的运动,设计和部署为开发支持FSP应用的基于基础设施的传感系统提供了实用见解。

Insight: 创新点在于采用混合多模态传感架构(激光雷达+摄像头)与混合检测方法(聚类+深度学习)的结合,并利用地理参考框架实现车道级定位,为基础设施感知系统提供了可扩展的实时解决方案。

Abstract: Freight vehicles approaching signalized intersections require reliable detection and motion estimation to support infrastructure-based Freight Signal Priority (FSP). Accurate and timely perception of vehicle type, position, and speed is essential for enabling effective priority control strategies. This paper presents the design, deployment, and evaluation of an infrastructure-based multi-modal freight vehicle detection system integrating LiDAR and camera sensors. A hybrid sensing architecture is adopted, consisting of an intersection-mounted subsystem and a midblock subsystem, connected via wireless communication for synchronized data transmission. The perception pipeline incorporates both clustering-based and deep learning-based detection methods with Kalman filter tracking to achieve stable real-time performance. LiDAR measurements are registered into geodetic reference frames to support lane-level localization and consistent vehicle tracking. Field evaluations demonstrate that the system can reliably monitor freight vehicle movements at high spatio-temporal resolution. The design and deployment provide practical insights for developing infrastructure-based sensing systems to support FSP applications.


[27] EA-Swin: An Embedding-Agnostic Swin Transformer for AI-Generated Video Detection cs.CVPDF

Hung Mai, Loi Dinh, Duc Hai Nguyen, Dat Do, Luong Doan

TL;DR: 本文提出EA-Swin,一种嵌入无关的Swin Transformer模型,用于检测AI生成的视频。该模型通过因子化窗口注意力设计直接在预训练视频嵌入上建模时空依赖,兼容通用的ViT风格编码器。同时构建了包含13万视频的EA-Video基准数据集,涵盖多种商业和开源生成器。实验表明EA-Swin在主要生成器上达到0.97-0.99的准确率,显著优于现有方法。

Details

Motivation: 解决现有AI生成视频检测方法依赖浅层嵌入轨迹、基于图像的适应或计算量大的多模态大语言模型的局限性,无法有效应对Sora、Veo等基础视频生成器产生的高度逼真合成视频。

Result: 在EA-Video数据集上的广泛实验显示,EA-Swin在主要生成器上达到0.97-0.99的准确率,比先前SOTA方法(通常0.8-0.9)高出5-20%,同时在未见过的生成器分布上保持强泛化能力。

Insight: 创新点包括:1)嵌入无关的因子化窗口注意力设计,可直接在预训练视频嵌入上建模时空依赖;2)构建大规模、多样化的EA-Video基准数据集,包含未见生成器划分以支持严格的跨分布评估;3)提供可扩展且鲁棒的解决方案,兼容通用ViT风格编码器。

Abstract: Recent advances in foundation video generators such as Sora2, Veo3, and other commercial systems have produced highly realistic synthetic videos, exposing the limitations of existing detection methods that rely on shallow embedding trajectories, image-based adaptation, or computationally heavy MLLMs. We propose EA-Swin, an Embedding-Agnostic Swin Transformer that models spatiotemporal dependencies directly on pretrained video embeddings via a factorized windowed attention design, making it compatible with generic ViT-style patch-based encoders. Alongside the model, we construct the EA-Video dataset, a benchmark dataset comprising 130K videos that integrates newly collected samples with curated existing datasets, covering diverse commercial and open-source generators and including unseen-generator splits for rigorous cross-distribution evaluation. Extensive experiments show that EA-Swin achieves 0.97-0.99 accuracy across major generators, outperforming prior SoTA methods (typically 0.8-0.9) by a margin of 5-20%, while maintaining strong generalization to unseen distributions, establishing a scalable and robust solution for modern AI-generated video detection.


[28] Physics Encoded Spatial and Temporal Generative Adversarial Network for Tropical Cyclone Image Super-resolution cs.CVPDF

Ruoyi Zhang, Jiawei Yuan, Lujia Ye, Runling Yu, Liling Zhao

TL;DR: 本文提出了一种物理编码的时空生成对抗网络(PESTGAN),用于热带气旋卫星图像的超分辨率重建。该方法通过引入PhyCell模块近似涡度方程来编码物理动力学,并使用双判别器框架确保时空一致性,从而在提升分辨率的同时更好地保持云结构的物理真实性。

Details

Motivation: 现有基于深度学习的超分辨率方法通常将卫星图像序列视为普通视频处理,忽略了支配云运动的大气物理规律,导致重建结果在气象学上不合理。本文旨在将物理约束融入模型,以生成具有更高物理保真度的热带气旋高分辨率图像。

Result: 在Digital Typhoon数据集上进行4倍超分辨率实验表明,PESTGAN在结构保真度和感知质量上表现更优。与现有方法相比,它在保持竞争力的像素级精度的同时,显著提升了重建云结构的气象合理性和物理保真度。

Insight: 创新点在于将物理方程(涡度方程)通过约束卷积近似并编码为隐表示,从而在生成模型中分离物理动力学与视觉纹理;同时采用时空双判别器来强化运动一致性。这为物理信息驱动的时空序列生成任务提供了可借鉴的框架。

Abstract: High-resolution satellite imagery is indispensable for tracking the genesis, intensification, and trajectory of tropical cyclones (TCs). However, existing deep learning-based super-resolution (SR) methods often treat satellite image sequences as generic videos, neglecting the underlying atmospheric physical laws governing cloud motion. To address this, we propose a Physics Encoded Spatial and Temporal Generative Adversarial Network (PESTGAN) for TC image super-resolution. Specifically, we design a disentangled generator architecture incorporating a PhyCell module, which approximates the vorticity equation via constrained convolutions and encodes the resulting approximate physical dynamics as implicit latent representations to separate physical dynamics from visual textures. Furthermore, a dual-discriminator framework is introduced, employing a temporal discriminator to enforce motion consistency alongside spatial realism. Experiments on the Digital Typhoon dataset for 4$\times$ upscaling demonstrate that PESTGAN establishes a better performance in structural fidelity and perceptual quality. While maintaining competitive pixel-wise accuracy compared to existing approaches, our method significantly excels in reconstructing meteorologically plausible cloud structures with superior physical fidelity.


Yuchang Jiang, Anton Raichuk, Xiaoye Tong, Vivien Sainte Fare Garnot, Daniel Ortiz-Gonzalo

TL;DR: 本研究利用Sentinel-1和Sentinel-2卫星影像时间序列,通过多模态时空深度学习模型,生成了首张南美洲10米分辨率树木作物分布图,揭示了约1100万公顷的树木作物,其中23%与2000-2020年间的森林覆盖损失相关。

Details

Motivation: 为支持欧盟《零毁林产品法规》(EUDR)等政策,需要高分辨率数据来区分森林与多样化的农业系统,但现有数据缺乏,阻碍了树木作物扩张的监测。

Result: 生成的10米分辨率地图识别出约1100万公顷树木作物,并发现现有支持EUDR的监管地图常将已建立的农业(特别是小农农林系统)误分类为“森林”。

Insight: 创新点在于结合多模态卫星影像时间序列与深度学习,首次提供高分辨率树木作物基线地图,有助于减少误报和促进更有效、包容、公平的保护政策。

Abstract: Monitoring tree crop expansion is vital for zero-deforestation policies like the European Union’s Regulation on Deforestation-free Products (EUDR). However, these efforts are hindered by a lack of highresolution data distinguishing diverse agricultural systems from forests. Here, we present the first 10m-resolution tree crop map for South America, generated using a multi-modal, spatio-temporal deep learning model trained on Sentinel-1 and Sentinel-2 satellite imagery time series. The map identifies approximately 11 million hectares of tree crops, 23% of which is linked to 2000-2020 forest cover loss. Critically, our analysis reveals that existing regulatory maps supporting the EUDR often classify established agriculture, particularly smallholder agroforestry, as “forest”. This discrepancy risks false deforestation alerts and unfair penalties for small-scale farmers. Our work mitigates this risk by providing a high-resolution baseline, supporting conservation policies that are effective, inclusive, and equitable.


[30] SpectralGCD: Spectral Concept Selection and Cross-modal Representation Learning for Generalized Category Discovery cs.CV | cs.AI | cs.LGPDF

Lorenzo Caselli, Marco Mistretta, Simone Magistri, Andrew D. Bagdanov

TL;DR: SpectralGCD是一种高效的多模态广义类别发现方法,它利用CLIP的跨模态图像-概念相似度作为统一表示,将图像表示为大型任务无关词典中语义概念的混合,从而减少对虚假视觉线索的依赖。通过引入谱过滤技术自动保留相关概念,并结合前向和反向知识蒸馏确保语义充分性和对齐性,在六个基准测试中实现了与SOTA相当或更优的准确率,同时显著降低了计算成本。

Details

Motivation: 解决广义类别发现中仅依赖图像特征容易过拟合旧类别,以及现有多模态方法处理模态独立、计算成本高的问题。

Result: 在六个基准测试上,SpectralGCD的准确率与最先进方法相当或显著更优,同时计算成本大幅降低。

Insight: 创新点包括使用CLIP跨模态相似度作为统一表示、引入谱过滤自动选择相关概念、以及通过同一教师模型的前向和反向知识蒸馏确保语义充分性和对齐性,这些方法提升了效率与效果。

Abstract: Generalized Category Discovery (GCD) aims to identify novel categories in unlabeled data while leveraging a small labeled subset of known classes. Training a parametric classifier solely on image features often leads to overfitting to old classes, and recent multimodal approaches improve performance by incorporating textual information. However, they treat modalities independently and incur high computational cost. We propose SpectralGCD, an efficient and effective multimodal approach to GCD that uses CLIP cross-modal image-concept similarities as a unified cross-modal representation. Each image is expressed as a mixture over semantic concepts from a large task-agnostic dictionary, which anchors learning to explicit semantics and reduces reliance on spurious visual cues. To maintain the semantic quality of representations learned by an efficient student, we introduce Spectral Filtering which exploits a cross-modal covariance matrix over the softmaxed similarities measured by a strong teacher model to automatically retain only relevant concepts from the dictionary. Forward and reverse knowledge distillation from the same teacher ensures that the cross-modal representations of the student remain both semantically sufficient and well-aligned. Across six benchmarks, SpectralGCD delivers accuracy comparable to or significantly superior to state-of-the-art methods at a fraction of the computational cost. The code is publicly available at: https://github.com/miccunifi/SpectralGCD.


[31] A High-Level Survey of Optical Remote Sensing cs.CV | cs.AIPDF

Panagiotis Koletsis, Vasilis Efthymiou, Maria Vakalopoulou, Nikos Komodakis, Anastasios Doulamis

TL;DR: 这篇论文是一篇关于光学遥感领域的高层次综述,旨在为进入该领域的研究人员提供全面的概览、关键数据集和见解,帮助他们聚焦于最相关的方向。

Details

Motivation: 近年来计算机视觉的进步推动了遥感技术的发展,同时无人机的普及使得配备RGB相机的光学遥感应用日益广泛。然而,现有文献庞大且分散,缺乏一个从整体视角出发的综述来指导研究人员。

Result: 论文未提及具体的定量实验结果或基准测试,因为它是一篇综述性文章,主要目标是提供领域的能力概述和关键信息,而非提出新模型或进行性能比较。

Insight: 论文的创新点在于首次从整体视角对光学遥感领域进行综合性综述,涵盖了多样化的任务、能力、方法、数据集和见解,为研究人员提供了高层次指导,填补了现有文献的空白。

Abstract: In recent years, significant advances in computer vision have also propelled progress in remote sensing. Concurrently, the use of drones has expanded, with many organizations incorporating them into their operations. Most drones are equipped by default with RGB cameras, which are both robust and among the easiest sensors to use and interpret. The body of literature on optical remote sensing is vast, encompassing diverse tasks, capabilities, and methodologies. Each task or methodology could warrant a dedicated survey. This work provides a comprehensive overview of the capabilities of the field, while also presenting key information, such as datasets and insights. It aims to serve as a guide for researchers entering the field, offering high-level insights and helping them focus on areas most relevant to their interests. To the best of our knowledge, no existing survey addresses this holistic perspective.


[32] EAGLE: Expert-Augmented Attention Guidance for Tuning-Free Industrial Anomaly Detection in Multimodal Large Language Models cs.CVPDF

Xiaomeng Peng, Xilang Huang, Seon Han Choi

TL;DR: 本文提出了一种名为EAGLE的无调优框架,用于在工业异常检测任务中增强多模态大语言模型(MLLMs)的性能。该方法通过集成专家模型的输出来引导MLLMs,使其无需参数更新即可同时实现准确的异常检测和可解释的异常描述,并在MVTec-AD和VisA数据集上取得了与基于微调方法相当的结果。

Details

Motivation: 工业异常检测对智能制造至关重要,但现有深度学习方法通常仅提供二元决策且语义解释有限;而MLLMs虽能生成细粒度语言分析,但现有方法往往需要昂贵的微调,且检测精度相比轻量级专家检测器提升有限。

Result: 在MVTec-AD和VisA数据集上的实验表明,EAGLE无需任何参数更新即可提升多种MLLMs的异常检测性能,达到与基于微调方法相当的水平。

Insight: 创新点在于提出了一种无调优的专家增强注意力引导框架,将专家模型输出作为引导信号,使MLLMs能更集中关注异常区域,从而提升检测准确性和可解释性;客观分析认为,该方法通过分析MLLMs中间层注意力分布,揭示了成功检测与异常区域注意力集中度之间的关联,为理解MLLMs内部机制提供了新视角。

Abstract: Industrial anomaly detection is important for smart manufacturing, but many deep learning approaches produce only binary decisions and provide limited semantic explanations. Multimodal large language models (MLLMs) can potentially generate fine-grained, language-based analyses, yet existing methods often require costly fine-tuning and do not consistently improve anomaly detection accuracy compared to lightweight specialist detectors. We propose expert-augmented attention guidance for industrial anomaly detection in MLLMs (EAGLE), a tuning-free framework that integrates outputs from expert model to guide MLLMs toward both accurate detection and interpretable anomaly descriptions. We further study how EAGLE affects MLLMs internals by examining the attention distribution of MLLMs to the anomalous image regions in the intermediate layers. We observe that successful anomaly detection is associated with increased attention concentration on anomalous regions, and EAGLE tends to encourage this alignment. Experiments on MVTec-AD and VisA show that EAGLE improves anomaly detection performance across multiple MLLMs without any parameter updates, achieving results comparable to fine-tuning based methods. Code is available at \href{https://github.com/shengtun/Eagle}{https://github.com/shengtun/Eagle}


[33] 4D Monocular Surgical Reconstruction under Arbitrary Camera Motions cs.CVPDF

Jiwei Shan, Zeyu Cai, Cheng-Tai Hsieh, Yirui Li, Hao Liu

TL;DR: 本文提出Local-EndoGS,一个用于处理具有任意相机运动的单目内窥镜序列的高质量4D重建框架。它通过引入渐进式、基于窗口的全局表示,将局部可变形场景模型分配给每个观察窗口,从而能够扩展到具有大幅运动的长序列。

Details

Motivation: 解决现有方法在处理具有大范围相机运动的单目内窥镜序列时的局限性,这些方法通常依赖立体深度先验或精确的运动结构进行初始化,难以适应真实的临床环境。

Result: 在三个具有可变形场景和不同相机运动的公共内窥镜数据集上的实验表明,Local-EndoGS在视觉外观质量和几何精度方面持续优于现有最先进方法。

Insight: 创新点在于提出了一个可扩展的、基于窗口的局部场景表示,以及一个不依赖立体深度或精确运动结构的从粗到细的初始化与优化策略,结合了长程2D像素轨迹约束和物理运动先验以提高变形合理性。

Abstract: Reconstructing deformable surgical scenes from endoscopic videos is challenging and clinically important. Recent state-of-the-art methods based on implicit neural representations or 3D Gaussian splatting have made notable progress. However, most are designed for deformable scenes with fixed endoscope viewpoints and rely on stereo depth priors or accurate structure-from-motion for initialization and optimization, limiting their ability to handle monocular sequences with large camera motion in real clinical settings. To address this, we propose Local-EndoGS, a high-quality 4D reconstruction framework for monocular endoscopic sequences with arbitrary camera motion. Local-EndoGS introduces a progressive, window-based global representation that allocates local deformable scene models to each observed window, enabling scalability to long sequences with substantial motion. To overcome unreliable initialization without stereo depth or accurate structure-from-motion, we design a coarse-to-fine strategy integrating multi-view geometry, cross-window information, and monocular depth priors, providing a robust foundation for optimization. We further incorporate long-range 2D pixel trajectory constraints and physical motion priors to improve deformation plausibility. Experiments on three public endoscopic datasets with deformable scenes and varying camera motions show that Local-EndoGS consistently outperforms state-of-the-art methods in appearance quality and geometry. Ablation studies validate the effectiveness of our key designs. Code will be released upon acceptance at: https://github.com/IRMVLab/Local-EndoGS.


[34] QuPAINT: Physics-Aware Instruction Tuning Approach to Quantum Material Discovery cs.CVPDF

Xuan-Bac Nguyen, Hoang-Quan Nguyen, Sankalp Pandey, Tim Faltermeier, Nicholas Borys

TL;DR: 本文提出了一种物理感知的多模态框架QuPAINT,用于从光学显微镜图像中表征二维量子材料。该框架包括基于物理的合成数据生成器Synthia、大规模指令数据集QMat-Instruct、融合光学先验的物理感知注意力模块以及综合性评估基准QF-Bench,旨在解决现有视觉模型因缺乏物理先验和泛化能力不足而难以应对材料外观微小变化和实验条件差异的挑战。

Details

Motivation: 现有视觉模型在表征二维量子材料时,由于缺乏物理先验、标记数据有限以及实验室和成像设置差异大,难以泛化到新材料或硬件条件,因此需要一种能够融合物理知识并提高泛化能力的新方法。

Result: 论文在提出的综合性基准QF-Bench上进行了评估,该基准涵盖多种材料、基底和成像设置,提供了标准化协议以进行公平和可重复的评估,但摘要中未明确提及具体的定量结果或与SOTA的比较水平。

Insight: 创新点包括:1. 基于物理的合成数据生成器Synthia,模拟量子材料薄片的光学响应以减少对专家标注的依赖;2. 首个用于量子材料的大规模指令数据集QMat-Instruct,包含多模态、物理信息的问题-答案对;3. 物理感知指令调优架构QuPAINT,通过物理感知注意力模块融合视觉嵌入和光学先验,实现更鲁棒和可区分的薄片表征。从客观角度看,该方法通过整合物理模型和数据增强,提升了模型在专业领域的泛化性和解释性。

Abstract: Characterizing two-dimensional quantum materials from optical microscopy images is challenging due to the subtle layer-dependent contrast, limited labeled data, and significant variation across laboratories and imaging setups. Existing vision models struggle in this domain since they lack physical priors and cannot generalize to new materials or hardware conditions. This work presents a new physics-aware multimodal framework that addresses these limitations from both the data and model perspectives. We first present Synthia, a physics-based synthetic data generator that simulates realistic optical responses of quantum material flakes under thin-film interference. Synthia produces diverse and high-quality samples, helping reduce the dependence on expert manual annotation. We introduce QMat-Instruct, the first large-scale instruction dataset for quantum materials, comprising multimodal, physics-informed question-answer pairs designed to teach Multimodal Large Language Models (MLLMs) to understand the appearance and thickness of flakes. Then, we propose Physics-Aware Instruction Tuning (QuPAINT), a multimodal architecture that incorporates a Physics-Informed Attention module to fuse visual embeddings with optical priors, enabling more robust and discriminative flake representations. Finally, we establish QF-Bench, a comprehensive benchmark spanning multiple materials, substrates, and imaging settings, offering standardized protocols for fair and reproducible evaluation.


[35] Tracing Copied Pixels and Regularizing Patch Affinity in Copy Detection cs.CV | cs.AIPDF

Yichen Lu, Siwei Nie, Minlong Lu, Xudong Yang, Xiaobo Zhang

TL;DR: 本文提出了一种用于图像复制检测的新方法,通过引入像素坐标追踪模块和几何引导的对比损失,将像素级的可追踪性与图像块级的相似性学习相结合,以应对复杂编辑带来的挑战,并在DISC21数据集上取得了最先进的性能。

Details

Motivation: 现有基于自监督学习的视图级对比方法在复杂编辑的图像对中,由于细粒度对应关系学习不足,检测效果不佳。本文旨在通过利用编辑内容中固有的几何可追踪性来解决这一局限性。

Result: 在DISC21数据集上,该方法在匹配器任务上取得了88.7% uAP / 83.9% RP90,在描述符任务上取得了72.6% uAP / 68.4% RP90的性能,达到了最先进水平。

Insight: 创新点在于提出了PixTrace像素坐标追踪模块来保持跨编辑变换的显式空间映射,以及CopyNCE损失函数利用已验证的映射重叠比来正则化图像块亲和力,从而将像素级可追踪性与块级相似性学习桥接起来,抑制了自监督训练中的监督噪声,并提高了模型的可解释性。

Abstract: Image Copy Detection (ICD) aims to identify manipulated content between image pairs through robust feature representation learning. While self-supervised learning (SSL) has advanced ICD systems, existing view-level contrastive methods struggle with sophisticated edits due to insufficient fine-grained correspondence learning. We address this limitation by exploiting the inherent geometric traceability in edited content through two key innovations. First, we propose PixTrace - a pixel coordinate tracking module that maintains explicit spatial mappings across editing transformations. Second, we introduce CopyNCE, a geometrically-guided contrastive loss that regularizes patch affinity using overlap ratios derived from PixTrace’s verified mappings. Our method bridges pixel-level traceability with patch-level similarity learning, suppressing supervision noise in SSL training. Extensive experiments demonstrate not only state-of-the-art performance (88.7% uAP / 83.9% RP90 for matcher, 72.6% uAP / 68.4% RP90 for descriptor on DISC21 dataset) but also better interpretability over existing methods.


[36] LATA: Laplacian-Assisted Transductive Adaptation for Conformal Uncertainty in Medical VLMs cs.CVPDF

Behzad Bozorgtabar, Dwarikanath Mahapatra, Sudipta Roy, Muzammal Naseer, Imran Razzak

TL;DR: 本文提出了一种名为LATA(拉普拉斯辅助转导适应)的训练和标签无关的细化方法,用于改进医学视觉语言模型(VLMs)在领域偏移下的置信度校准。该方法通过在图像-图像k近邻图上平滑零样本概率,并引入一种失败感知的置信度评分,以在保持分拆置信预测(SCP)有效性的同时,提高预测集的效率和类别间覆盖平衡。

Details

Motivation: 医学视觉语言模型在零样本识别方面表现强大,但在领域偏移下其可靠性依赖于具有保证的校准不确定性。传统的分拆置信预测方法在少样本、不平衡场景中往往导致预测集过大(效率低)且类别覆盖不平衡(类别条件覆盖差距高),而直接适应校准标签会破坏可交换性并失去保证。

Result: 在三个医学VLMs和九个下游任务上,LATA一致地减少了预测集大小和类别条件覆盖差距,同时匹配或收紧目标覆盖,优于先前的转导基线方法,并缩小了与使用标签方法的差距,且计算成本低。

Insight: 创新点包括:1)提出一种基于拉普拉斯平滑的转导适应方法,无需训练或额外标签,通过图平滑零样本概率来保持可交换性;2)引入失败感知的置信度评分,结合视觉语言不确定性框架,提供实例级难度和标签合理性,以改善预测集效率;3)方法为黑盒、计算轻量,并可选择是否使用校准边缘信息进行标签知情变体。

Abstract: Medical vision-language models (VLMs) are strong zero-shot recognizers for medical imaging, but their reliability under domain shift hinges on calibrated uncertainty with guarantees. Split conformal prediction (SCP) offers finite-sample coverage, yet prediction sets often become large (low efficiency) and class-wise coverage unbalanced-high class-conditioned coverage gap (CCV), especially in few-shot, imbalanced regimes; moreover, naively adapting to calibration labels breaks exchangeability and voids guarantees. We propose \texttt{\textbf{LATA}} (Laplacian-Assisted Transductive Adaptation), a \textit{training- and label-free} refinement that operates on the joint calibration and test pool by smoothing zero-shot probabilities over an image-image k-NN graph using a small number of CCCP mean-field updates, preserving SCP validity via a deterministic transform. We further introduce a \textit{failure-aware} conformal score that plugs into the vision-language uncertainty (ViLU) framework, providing instance-level difficulty and label plausibility to improve prediction set efficiency and class-wise balance at fixed coverage. \texttt{\textbf{LATA}} is black-box (no VLM updates), compute-light (windowed transduction, no backprop), and includes an optional prior knob that can run strictly label-free or, if desired, in a label-informed variant using calibration marginals once. Across \textbf{three} medical VLMs and \textbf{nine} downstream tasks, \texttt{\textbf{LATA}} consistently reduces set size and CCV while matching or tightening target coverage, outperforming prior transductive baselines and narrowing the gap to label-using methods, while using far less compute. Comprehensive ablations and qualitative analyses show that \texttt{\textbf{LATA}} sharpens zero-shot predictions without compromising exchangeability.


[37] GraphThinker: Reinforcing Video Reasoning with Event Graph Thinking cs.CVPDF

Zixu Cheng, Da Li, Jian Hu, Ziquan Liu, Wei Li

TL;DR: GraphThinker是一种基于强化微调的方法,旨在通过构建结构化的事件级场景图并增强视觉基础来减少视频推理中的幻觉问题。该方法首先使用多模态大语言模型构建基于事件的视频场景图,显式建模事件内和事件间关系,并将这些图作为中间思维过程融入模型;同时,在强化微调中引入视觉注意力奖励以加强视频基础。

Details

Motivation: 视频推理需要理解视频事件间的因果关系,但这些关系通常是隐式的且人工标注成本高。现有MLLM方法通过密集描述或视频摘要推断事件关系,缺乏因果结构建模,导致推理时出现幻觉。

Result: 在RexTime和VidHalluc两个数据集上的评估表明,GraphThinker在捕捉对象和事件关系、实现更精确的事件定位方面优于先前方法,有效减少了视频推理中的幻觉。

Insight: 创新点在于显式构建事件级场景图来建模视频中的因果结构,并将其作为中间思维过程集成到MLLM中;同时,通过强化微调中的视觉注意力奖励机制增强视觉基础,共同缓解幻觉问题。

Abstract: Video reasoning requires understanding the causal relationships between events in a video. However, such relationships are often implicit and costly to annotate manually. While existing multimodal large language models (MLLMs) often infer event relations through dense captions or video summaries for video reasoning, such modeling still lacks causal understanding. Without explicit causal structure modeling within and across video events, these models suffer from hallucinations during the video reasoning. In this work, we propose GraphThinker, a reinforcement finetuning-based method that constructs structural event-level scene graphs and enhances visual grounding to jointly reduce hallucinations in video reasoning. Specifically, we first employ an MLLM to construct an event-based video scene graph (EVSG) that explicitly models both intra- and inter-event relations, and incorporate these formed scene graphs into the MLLM as an intermediate thinking process. We also introduce a visual attention reward during reinforcement finetuning, which strengthens video grounding and further mitigates hallucinations. We evaluate GraphThinker on two datasets, RexTime and VidHalluc, where it shows superior ability to capture object and event relations with more precise event localization, reducing hallucinations in video reasoning compared to prior methods.


[38] RetouchIQ: MLLM Agents for Instruction-Based Image Retouching with Generalist Reward cs.CVPDF

Qiucheng Wu, Jing Shi, Simon Jenni, Kushal Kafle, Tianyu Wang

TL;DR: 本文提出RetouchIQ框架,利用多模态大语言模型(MLLM)智能体,在通用奖励模型的指导下,实现基于指令的可执行图像润饰。该框架通过强化学习(RL)微调MLLM,使其能够解释用户编辑意图,并生成相应的可执行图像调整参数,从而将高层次的美学目标与精确的参数控制相结合。

Details

Motivation: 现有基于MLLM的专业图像编辑方法,在利用强化学习训练时面临挑战,因为缺乏能够反映创意编辑主观性的可靠、可验证的奖励信号。传统基于规则、使用手工指标与固定参考图像计算相似度的奖励机制存在局限。

Result: 实验表明,RetouchIQ在语义一致性和感知质量上,显著优于以往基于MLLM和基于扩散模型的编辑系统。作者构建了一个包含19万条指令-推理对的数据集,并建立了一个新的基于指令的图像编辑基准。

Insight: 核心创新在于提出了一个通用奖励模型,这是一个经过RL微调的MLLM,能够通过多模态推理,针对每个具体案例生成一组评估指标,从而提供高质量的、与指令一致的标量反馈。这为专业图像编辑提供了灵活、可解释且可执行的智能助手方案。

Abstract: Recent advances in multimodal large language models (MLLMs) have shown great potential for extending vision-language reasoning to professional tool-based image editing, enabling intuitive and creative editing. A promising direction is to use reinforcement learning (RL) to enable MLLMs to reason about and execute optimal tool-use plans within professional image-editing software. However, training remains challenging due to the lack of reliable, verifiable reward signals that can reflect the inherently subjective nature of creative editing. In this work, we introduce RetouchIQ, a framework that performs instruction-based executable image editing through MLLM agents guided by a generalist reward model. RetouchIQ interprets user-specified editing intentions and generates corresponding, executable image adjustments, bridging high-level aesthetic goals with precise parameter control. To move beyond conventional, rule-based rewards that compute similarity against a fixed reference image using handcrafted metrics, we propose a generalist reward model, an RL fine-tuned MLLM that evaluates edited results through a set of generated metrics on a case-by-case basis. Then, the reward model provides scalar feedback through multimodal reasoning, enabling reinforcement learning with high-quality, instruction-consistent gradients. We curate an extended dataset with 190k instruction-reasoning pairs and establish a new benchmark for instruction-based image editing. Experiments show that RetouchIQ substantially improves both semantic consistency and perceptual quality over previous MLLM-based and diffusion-based editing systems. Our findings demonstrate the potential of generalist reward-driven MLLM agents as flexible, explainable, and executable assistants for professional image editing.


[39] Art2Mus: Artwork-to-Music Generation via Visual Conditioning and Large-Scale Cross-Modal Alignment cs.CV | cs.MM | cs.SDPDF

Ivan Rinaldi, Matteo Mendula, Nicola Fanelli, Florence Levé, Matteo Testi

TL;DR: 本文提出了Art2Mus框架,这是一个直接根据艺术品图像生成音乐的系统,无需通过图像到文本的转换或语言监督。为此,作者构建了ArtSound大规模数据集,并通过将视觉嵌入映射到潜在扩散模型的调节空间来实现纯视觉引导的音乐合成。

Details

Motivation: 现有图像条件音乐生成系统通常基于自然照片训练,难以捕捉艺术品丰富的语义、风格和文化内容;且大多依赖图像到文本转换,使用语言作为语义捷径,阻碍了直接的视觉到音频学习。

Result: 实验表明,Art2Mus能生成音乐连贯、风格一致且反映源艺术品显著视觉线索的输出。虽然绝对对齐分数低于文本条件系统(考虑到移除语言监督的难度),但其在感知质量和有意义的跨模态对应方面具有竞争力。

Insight: 主要创新点在于:1) 构建了首个大规模艺术品-音乐配对数据集ArtSound;2) 提出了首个直接进行艺术品到音乐生成的框架,绕过了图像到文本的中间步骤,探索了纯视觉条件音乐生成这一独特且具挑战性的研究方向。

Abstract: Music generation has advanced markedly through multimodal deep learning, enabling models to synthesize audio from text and, more recently, from images. However, existing image-conditioned systems suffer from two fundamental limitations: (i) they are typically trained on natural photographs, limiting their ability to capture the richer semantic, stylistic, and cultural content of artworks; and (ii) most rely on an image-to-text conversion stage, using language as a semantic shortcut that simplifies conditioning but prevents direct visual-to-audio learning. Motivated by these gaps, we introduce ArtSound, a large-scale multimodal dataset of 105,884 artwork-music pairs enriched with dual-modality captions, obtained by extending ArtGraph and the Free Music Archive. We further propose ArtToMus, the first framework explicitly designed for direct artwork-to-music generation, which maps digitized artworks to music without image-to-text translation or language-based semantic supervision. The framework projects visual embeddings into the conditioning space of a latent diffusion model, enabling music synthesis guided solely by visual information. Experimental results show that ArtToMus generates musically coherent and stylistically consistent outputs that reflect salient visual cues of the source artworks. While absolute alignment scores remain lower than those of text-conditioned systems-as expected given the substantially increased difficulty of removing linguistic supervision-ArtToMus achieves competitive perceptual quality and meaningful cross-modal correspondence. This work establishes direct visual-to-music generation as a distinct and challenging research direction, and provides resources that support applications in multimedia art, cultural heritage, and AI-assisted creative practice. Code and dataset will be publicly released upon acceptance.


[40] Adapting Actively on the Fly: Relevance-Guided Online Meta-Learning with Latent Concepts for Geospatial Discovery cs.CV | cs.AI | cs.CY | cs.LGPDF

Jowaria Khan, Anindya Sarkar, Yevgeniy Vorobeychik, Elizabeth Bondi-Kelly

TL;DR: 本文提出了一种用于地理空间发现任务的统一框架,该框架结合了主动学习、在线元学习和概念引导推理。其核心创新在于引入了’概念相关性’这一共享概念,并基于此设计了概念加权不确定性采样策略和相关性感知元批次形成策略,旨在资源受限的动态环境中,利用有限的、有偏的数据高效发现隐藏目标。

Details

Motivation: 解决在数据收集成本高、环境动态变化(如环境监测、灾害响应)的现实场景中,如何利用稀疏且有偏的地理空间真实数据,在严格资源约束下高效发现隐藏目标的问题,克服现有基于学习的方法(如强化学习)的局限性。

Result: 在真实世界致癌物PFAS污染数据集上的实验表明,该方法在数据有限和环境变化的条件下,能够可靠地发现目标。

Insight: 主要创新点在于将领域特定概念(如土地覆盖、污染源距离)作为可用的先验知识,形式化为’概念相关性’,并利用其动态调整主动学习的不确定性度量和元学习的批次构成,从而在数据稀疏和分布偏移的情况下提升采样效率和模型泛化能力。

Abstract: In many real-world settings, such as environmental monitoring, disaster response, or public health, with costly and difficult data collection and dynamic environments, strategically sampling from unobserved regions is essential for efficiently uncovering hidden targets under tight resource constraints. Yet, sparse and biased geospatial ground truth limits the applicability of existing learning-based methods, such as reinforcement learning. To address this, we propose a unified geospatial discovery framework that integrates active learning, online meta-learning, and concept-guided reasoning. Our approach introduces two key innovations built on a shared notion of concept relevance, which captures how domain-specific factors influence target presence: a concept-weighted uncertainty sampling strategy, where uncertainty is modulated by learned relevance based on readily-available domain-specific concepts (e.g., land cover, source proximity); and a relevance-aware meta-batch formation strategy that promotes semantic diversity during online-meta updates, improving generalization in dynamic environments. Our experiments include testing on a real-world dataset of cancer-causing PFAS (Per- and polyfluoroalkyl substances) contamination, showcasing our method’s reliability at uncovering targets with limited data and a varying environment.


[41] CORAL: Correspondence Alignment for Improved Virtual Try-On cs.CVPDF

Jiyoung Kim, Youngjin Shin, Siyoon Jin, Dahyun Chung, Jisu Nam

TL;DR: 本文提出了一种名为CORAL的虚拟试穿(VTON)框架,该框架基于扩散变换器(DiT),通过显式对齐查询-键匹配与外部对应关系来解决现有方法在非配对设置下难以保持服装精细细节的问题。

Details

Motivation: 现有VTON方法通常无法显式地强制人物-服装对齐,且在扩散变换器(DiT)中难以解释对应关系是如何产生的,导致在非配对设置下难以准确保持服装细节。

Result: CORAL在基准模型上持续改进,增强了全局形状迁移和局部细节保持能力,并通过广泛的消融实验验证了设计选择。

Insight: 创新点在于首次分析了DiT架构中的完整3D注意力,揭示了人物-服装对应关系依赖于查询-键的精确匹配,并据此提出了通过对应关系蒸馏损失和熵最小化损失来显式对齐和锐化注意力分布的方法。

Abstract: Existing methods for Virtual Try-On (VTON) often struggle to preserve fine garment details, especially in unpaired settings where accurate person-garment correspondence is required. These methods do not explicitly enforce person-garment alignment and fail to explain how correspondence emerges within Diffusion Transformers (DiTs). In this paper, we first analyze full 3D attention in DiT-based architecture and reveal that the person-garment correspondence critically depends on precise person-garment query-key matching within the full 3D attention. Building on this insight, we then introduce CORrespondence ALignment (CORAL), a DiT-based framework that explicitly aligns query-key matching with robust external correspondences. CORAL integrates two complementary components: a correspondence distillation loss that aligns reliable matches with person-garment attention, and an entropy minimization loss that sharpens the attention distribution. We further propose a VLM-based evaluation protocol to better reflect human preference. CORAL consistently improves over the baseline, enhancing both global shape transfer and local detail preservation. Extensive ablations validate our design choices.


[42] IntRec: Intent-based Retrieval with Contrastive Refinement cs.CVPDF

Pourya Shamsolmoali, Masoumeh Zareapoor, Eric Granger, Yue Lu

TL;DR: 本文提出了IntRec,一种基于意图的交互式物体检索框架,通过用户反馈来精炼预测。其核心是维护正锚点(确认线索)和负约束(拒绝假设)双重记忆集的意图状态,以及一个通过最大化与正线索相似度并惩罚负假设来排序候选物体的对比对齐函数,从而在复杂场景中实现细粒度的消歧。

Details

Motivation: 解决从复杂场景中检索用户指定物体的挑战,特别是在查询模糊或涉及多个相似物体时,现有开放词汇检测器缺乏基于用户反馈精炼预测的能力。

Result: 在LVIS数据集上达到35.4 AP,分别超过OVMR、CoDet和CAKE模型2.3、3.7和0.5 AP;在LVIS-Ambiguous基准上,单次纠正反馈后性能比其一次性基线提升7.9 AP,每次交互增加的延迟小于30毫秒。

Insight: 创新点在于引入交互式反馈机制和意图状态的双重记忆集,结合对比对齐进行细粒度消歧,无需额外监督即可显著提升检索精度,实现了高效的人机协作检索。

Abstract: Retrieving user-specified objects from complex scenes remains a challenging task, especially when queries are ambiguous or involve multiple similar objects. Existing open-vocabulary detectors operate in a one-shot manner, lacking the ability to refine predictions based on user feedback. To address this, we propose IntRec, an interactive object retrieval framework that refines predictions based on user feedback. At its core is an Intent State (IS) that maintains dual memory sets for positive anchors (confirmed cues) and negative constraints (rejected hypotheses). A contrastive alignment function ranks candidate objects by maximizing similarity to positive cues while penalizing rejected ones, enabling fine-grained disambiguation in cluttered scenes. Our interactive framework provides substantial improvements in retrieval accuracy without additional supervision. On LVIS, IntRec achieves 35.4 AP, outperforming OVMR, CoDet, and CAKE by +2.3, +3.7, and +0.5, respectively. On the challenging LVIS-Ambiguous benchmark, it improves performance by +7.9 AP over its one-shot baseline after a single corrective feedback, with less than 30 ms of added latency per interaction.


[43] When Vision Overrides Language: Evaluating and Mitigating Counterfactual Failures in VLAs cs.CV | cs.ROPDF

Yu Fang, Yuchun Feng, Dong Jing, Jiaqi Liu, Yue Yang

TL;DR: 本文研究了视觉-语言-动作模型在缺乏强场景监督指令下的反事实失败问题,即模型倾向于依赖视觉捷径而非语言意图执行动作。为此,作者提出了首个反事实基准LIBERO-CF来评估VLA的语言遵循能力,并设计了一种无需训练的双分支推理方案Counterfactual Action Guidance,通过结合标准VLA策略与无条件语言视觉动作模块,在动作选择时进行反事实比较,从而减少对视觉捷径的依赖并提升任务成功率。

Details

Motivation: 解决VLA模型在语言指令缺乏强场景监督时,因数据集偏差导致的视觉捷径依赖问题,即模型会忽略语言意图而重复执行训练中常见的行为,从而引发反事实失败。

Result: 在LIBERO-CF基准上,CAG方法将语言遵循准确率提升9.7%,在未充分观察任务上的任务成功率提升3.6%;当与VA模型结合时,进一步分别提升15.5%和8.5%。真实世界评估中,平均减少9.4%的反事实失败并提升17.2%的任务成功率。

Insight: 创新点在于提出了首个针对VLA反事实失败的基准LIBERO-CF,以及一种即插即用的双分支推理方案CAG,通过反事实比较显式正则化语言条件,无需额外演示或修改现有架构,有效缓解了视觉捷径依赖问题。

Abstract: Vision-Language-Action models (VLAs) promise to ground language instructions in robot control, yet in practice often fail to faithfully follow language. When presented with instructions that lack strong scene-specific supervision, VLAs suffer from counterfactual failures: they act based on vision shortcuts induced by dataset biases, repeatedly executing well-learned behaviors and selecting objects frequently seen during training regardless of language intent. To systematically study it, we introduce LIBERO-CF, the first counterfactual benchmark for VLAs that evaluates language following capability by assigning alternative instructions under visually plausible LIBERO layouts. Our evaluation reveals that counterfactual failures are prevalent yet underexplored across state-of-the-art VLAs. We propose Counterfactual Action Guidance (CAG), a simple yet effective dual-branch inference scheme that explicitly regularizes language conditioning in VLAs. CAG combines a standard VLA policy with a language-unconditioned Vision-Action (VA) module, enabling counterfactual comparison during action selection. This design reduces reliance on visual shortcuts, improves robustness on under-observed tasks, and requires neither additional demonstrations nor modifications to existing architectures or pretrained models. Extensive experiments demonstrate its plug-and-play integration across diverse VLAs and consistent improvements. For example, on LIBERO-CF, CAG improves $π_{0.5}$ by 9.7% in language following accuracy and 3.6% in task success on under-observed tasks using a training-free strategy, with further gains of 15.5% and 8.5%, respectively, when paired with a VA model. In real-world evaluations, CAG reduces counterfactual failures of 9.4% and improves task success by 17.2% on average.


[44] OpenEarthAgent: A Unified Framework for Tool-Augmented Geospatial Agents cs.CVPDF

Akashah Shabbir, Muhammad Umer Sheikh, Muhammad Akhtar Munir, Hiyam Debary, Mustansar Fiaz

TL;DR: OpenEarthAgent是一个用于开发工具增强地理空间智能体的统一框架,通过监督微调在卫星影像、自然语言查询和详细推理轨迹上进行训练,以支持多步骤地理空间分析任务。

Details

Motivation: 将多模态推理能力扩展到遥感领域面临挑战,需要模型在空间尺度、地理结构和多光谱指数上进行推理,同时保持连贯的多步骤逻辑,因此开发一个统一框架来桥接这一差距。

Result: 在包含14,538个训练实例和1,169个评估实例的数据集上,模型表现出结构化推理、稳定的空间理解和可解释行为,相对于强基线有持续改进,并与近期开源和闭源模型竞争性能。

Insight: 创新点在于通过监督微调对齐模型与已验证的多步骤工具交互,结合GIS操作和指数分析(如NDVI、NBR、NDBI),实现工具驱动的地理空间交互,提升遥感领域的推理能力。

Abstract: Recent progress in multimodal reasoning has enabled agents that can interpret imagery, connect it with language, and perform structured analytical tasks. Extending such capabilities to the remote sensing domain remains challenging, as models must reason over spatial scale, geographic structures, and multispectral indices while maintaining coherent multi-step logic. To bridge this gap, OpenEarthAgent introduces a unified framework for developing tool-augmented geospatial agents trained on satellite imagery, natural-language queries, and detailed reasoning traces. The training pipeline relies on supervised fine-tuning over structured reasoning trajectories, aligning the model with verified multistep tool interactions across diverse analytical contexts. The accompanying corpus comprises 14,538 training and 1,169 evaluation instances, with more than 100K reasoning steps in the training split and over 7K reasoning steps in the evaluation split. It spans urban, environmental, disaster, and infrastructure domains, and incorporates GIS-based operations alongside index analyses such as NDVI, NBR, and NDBI. Grounded in explicit reasoning traces, the learned agent demonstrates structured reasoning, stable spatial understanding, and interpretable behaviour through tool-driven geospatial interactions across diverse conditions. We report consistent improvements over a strong baseline and competitive performance relative to recent open and closed-source models.


cs.RO [Back]

[45] MALLVI: a multi agent framework for integrated generalized robotics manipulation cs.RO | cs.AI | cs.CV | cs.LGPDF

Iman Ahmadi, Mehrshad Taji, Arad Mahdinezhad Kashani, AmirHossein Jadidi, Saina Kashani

TL;DR: MALLVi是一个多智能体大语言与视觉框架,用于实现基于闭环反馈的机器人操作。给定自然语言指令和环境图像,MALLVi通过协调分解器、定位器、思考器和反射器等专用智能体,生成可执行的原子动作,并利用视觉语言模型评估环境反馈以决定重复步骤或继续执行。

Details

Motivation: 解决现有基于大语言模型的机器人任务规划方法依赖专用模型、微调或提示工程,且通常以开环方式运行、缺乏鲁棒环境反馈,在动态环境中表现脆弱的问题。

Result: 在仿真和真实世界环境中的实验表明,迭代的闭环多智能体协调提高了零样本操作任务的泛化能力和成功率。

Insight: 创新点在于采用模块化的多智能体协调架构进行闭环反馈驱动规划,其中反射器支持针对性的错误检测与恢复,仅重新激活相关智能体,避免了完全重新规划,提升了效率和鲁棒性。

Abstract: Task planning for robotic manipulation with large language models (LLMs) is an emerging area. Prior approaches rely on specialized models, fine tuning, or prompt tuning, and often operate in an open loop manner without robust environmental feedback, making them fragile in dynamic settings.We present MALLVi, a Multi Agent Large Language and Vision framework that enables closed loop feedback driven robotic manipulation. Given a natural language instruction and an image of the environment, MALLVi generates executable atomic actions for a robot manipulator. After action execution, a Vision Language Model (VLM) evaluates environmental feedback and decides whether to repeat the process or proceed to the next step.Rather than using a single model, MALLVi coordinates specialized agents, Decomposer, Localizer, Thinker, and Reflector, to manage perception, localization, reasoning, and high level planning. An optional Descriptor agent provides visual memory of the initial state. The Reflector supports targeted error detection and recovery by reactivating only relevant agents, avoiding full replanning.Experiments in simulation and real world settings show that iterative closed loop multi agent coordination improves generalization and increases success rates in zero shot manipulation tasks.Code available at https://github.com/iman1234ahmadi/MALLVI.


cs.AI [Back]

[46] Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents cs.AI | cs.CLPDF

Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu

TL;DR: 本文介绍了GUI-Owl-1.5,这是一个支持多平台(桌面、移动、浏览器等)并具备多种尺寸(2B至235B)的原生GUI智能体模型,旨在通过云边协作实现实时交互。该模型在超过20个GUI基准测试中取得了最先进的性能,涵盖了自动化、定位、工具调用以及记忆与知识任务。

Details

Motivation: 解决构建一个能够跨多种图形用户界面平台高效执行复杂、长视野任务,并具备强大推理与适应能力的通用GUI智能体的挑战。

Result: 在开源模型中,GUI-Owl-1.5在超过20个GUI基准测试上达到SOTA水平:自动化任务(OSWorld 56.5, AndroidWorld 71.6, WebArena 48.4)、定位任务(ScreenSpotPro 80.3)、工具调用任务(OSWorld-MCP 47.6, MobileWorld 46.8)以及记忆与知识任务(GUI-Knowledge Bench 75.5)。

Insight: 创新点包括:1) 混合数据飞轮:结合模拟环境和云端沙箱环境构建数据管道,提升数据收集效率与质量;2) 智能体能力统一增强:使用统一的思维合成流程增强模型推理能力,并重点提升工具调用、记忆和多智能体适应等关键能力;3) 多平台环境强化学习扩展:提出新的环境RL算法MRPO,以应对多平台冲突和长视野任务训练效率低的挑战。

Abstract: The paper introduces GUI-Owl-1.5, the latest native GUI agent model that features instruct/thinking variants in multiple sizes (2B/4B/8B/32B/235B) and supports a range of platforms (desktop, mobile, browser, and more) to enable cloud-edge collaboration and real-time interaction. GUI-Owl-1.5 achieves state-of-the-art results on more than 20+ GUI benchmarks on open-source models: (1) on GUI automation tasks, it obtains 56.5 on OSWorld, 71.6 on AndroidWorld, and 48.4 on WebArena; (2) on grounding tasks, it obtains 80.3 on ScreenSpotPro; (3) on tool-calling tasks, it obtains 47.6 on OSWorld-MCP, and 46.8 on MobileWorld; (4) on memory and knowledge tasks, it obtains 75.5 on GUI-Knowledge Bench. GUI-Owl-1.5 incorporates several key innovations: (1) Hybird Data Flywheel: we construct the data pipeline for UI understanding and trajectory generation based on a combination of simulated environments and cloud-based sandbox environments, in order to improve the efficiency and quality of data collection. (2) Unified Enhancement of Agent Capabilities: we use a unified thought-synthesis pipeline to enhance the model’s reasoning capabilities, while placing particular emphasis on improving key agent abilities, including Tool/MCP use, memory and multi-agent adaptation; (3) Multi-platform Environment RL Scaling: We propose a new environment RL algorithm, MRPO, to address the challenges of multi-platform conflicts and the low training efficiency of long-horizon tasks. The GUI-Owl-1.5 models are open-sourced, and an online cloud-sandbox demo is available at https://github.com/X-PLUG/MobileAgent.


[47] RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models cs.AI | cs.CLPDF

Yunseok Han, Yejoon Lee, Jaeyoung Do

TL;DR: 该论文提出了一个评估大型推理模型(LRMs)推理忠实性的正式框架和基准测试RFEval。该框架将忠实性定义为立场一致性和因果影响两个可测试条件,并构建了一个包含7,186个实例的基准,通过输出层面的反事实干预来探测忠实性。评估了12个开源LRM,发现49.7%的输出存在不忠实问题,主要源于立场不一致。研究发现准确率并非忠实性的可靠代理指标,且当前的RL风格训练目标可能损害推理忠实性。

Details

Motivation: 大型推理模型(LRMs)虽然性能强大,但其生成的推理过程常常听起来合理却未能反映其真实的决策过程,这损害了模型的可靠性和可信度。因此,需要一种方法来正式评估和量化推理过程的忠实性。

Result: 在RFEval基准的七个任务上评估了12个开源LRM,发现平均49.7%的输出存在不忠实问题。失败主要集中在数学和代码等脆弱的、收敛性领域。研究发现,在控制模型和任务后,准确性与忠实性之间的关联很弱且统计上不显著。

Insight: 论文的创新点在于提出了一个将推理忠实性(立场一致性和因果影响)与准确性解耦的正式评估框架,并构建了相应的反事实干预基准RFEval。客观来看,其核心洞察是揭示了当前模型训练(特别是RL风格目标)可能以牺牲推理过程的忠实性为代价来维持准确性,这为构建更可信的AI系统指明了优化方向(即同时优化结果正确性和推理过程的结构完整性)。

Abstract: Large Reasoning Models (LRMs) exhibit strong performance, yet often produce rationales that sound plausible but fail to reflect their true decision process, undermining reliability and trust. We introduce a formal framework for reasoning faithfulness, defined by two testable conditions: stance consistency (a coherent stance linking reasoning to answer) and causal influence (the stated reasoning causally drives the answer under output-level interventions), explicitly decoupled from accuracy. To operationalize this, we present RFEval, a benchmark of 7,186 instances across seven tasks that probes faithfulness via controlled, output-level counterfactual interventions. Evaluating twelve open-source LRMs, we find unfaithfulness in 49.7% of outputs, predominantly from stance inconsistency. Failures are concentrated in brittle, convergent domains such as math and code, and correlate more with post-training regimes than with scale: within-family ablations indicate that adding current RL-style objectives on top of supervised fine-tuning can reduce reasoning faithfulness, even when accuracy is maintained. Crucially, accuracy is neither a sufficient nor a reliable proxy for faithfulness: once controlling for model and task, the accuracy-faithfulness link is weak and statistically insignificant. Our work establishes a rigorous methodology for auditing LRM reliability and shows that trustworthy AI requires optimizing not only for correct outcomes but also for the structural integrity of the reasoning process. Our code and dataset can be found at project page: $\href{https://aidaslab.github.io/RFEval/}{https://aidaslab.github.io/RFEval/}$


[48] From Labor to Collaboration: A Methodological Experiment Using AI Agents to Augment Research Perspectives in Taiwan’s Humanities and Social Sciences cs.AI | cs.CL | cs.CYPDF

Yi-Chih Huang

TL;DR: 本研究提出了一种基于AI Agent的协作研究流程(Agentic Workflow),旨在为人文社科研究提供一种可复制的方法论框架。该流程基于任务模块化、人机分工和可验证性三大原则,包含七个阶段,并明确了人类研究者(负责研究判断与伦理决策)与AI Agent(负责信息检索与文本生成)的清晰角色分工。研究以台湾地区Claude.ai的使用数据(来自Anthropic Economic Index)作为实证案例,验证了该工作流程在二次数据分析中的可行性与输出质量。

Details

Motivation: 生成式AI正在重塑知识工作,但现有研究主要集中在软件工程和自然科学领域,针对人文社科的方法论探索有限。本研究旨在填补这一空白,通过设计一个AI Agent协作研究流程,探索如何将AI有效整合到人文社科的研究实践中,以增强研究视角。

Result: 研究通过实证分析台湾AEI数据(N = 7,729次对话),展示了该工作流程在应用二次数据进行研究时的操作过程和输出质量(详见附录A),验证了该方法的可行性。

Insight: 论文的创新点在于提出了一个专门为人文社科设计的、可复制的AI协作方法论框架,并通过反思性操作过程记录,识别了人机协作的三种操作模式:直接执行、迭代精炼和人类主导。该分类揭示了人类在研究问题形成、理论阐释、情境化推理和伦理反思方面的不可替代性。

Abstract: Generative AI is reshaping knowledge work, yet existing research focuses predominantly on software engineering and the natural sciences, with limited methodological exploration for the humanities and social sciences. Positioned as a “methodological experiment,” this study proposes an AI Agent-based collaborative research workflow (Agentic Workflow) for humanities and social science research. Taiwan’s Claude.ai usage data (N = 7,729 conversations, November 2025) from the Anthropic Economic Index (AEI) serves as the empirical vehicle for validating the feasibility of this methodology. This study operates on two levels: the primary level is the design and validation of a methodological framework - a seven-stage modular workflow grounded in three principles: task modularization, human-AI division of labor, and verifiability, with each stage delineating clear roles for human researchers (research judgment and ethical decisions) and AI Agents (information retrieval and text generation); the secondary level is the empirical analysis of AEI Taiwan data - serving as an operational demonstration of the workflow’s application to secondary data research, showcasing both the process and output quality (see Appendix A). This study contributes by proposing a replicable AI collaboration framework for humanities and social science researchers, and identifying three operational modes of human-AI collaboration - direct execution, iterative refinement, and human-led - through reflexive documentation of the operational process. This taxonomy reveals the irreplaceability of human judgment in research question formulation, theoretical interpretation, contextualized reasoning, and ethical reflection. Limitations including single-platform data, cross-sectional design, and AI reliability risks are acknowledged.


[49] ArXiv-to-Model: A Practical Study of Scientific LM Training cs.AI | cs.CLPDF

Anuj Gupta

TL;DR: 本文通过一个端到端的案例研究,详细记录了从原始arXiv LaTeX源文件训练一个1.36B参数的科学领域语言模型的完整流程,包括数据预处理、分词、训练及实验分析,旨在为计算资源有限的研究者提供实践指导。

Details

Motivation: 当前前沿大语言模型虽展现出强大的推理和数学能力,但从原始科学文献(如arXiv)训练领域专业化语言模型的实际过程缺乏详细文档,本研究旨在填补这一空白,为资源受限的研究者提供透明、工程化的训练指南。

Result: 在数学、计算机科学和理论物理领域的arXiv数据上进行了24次实验运行,模型在52B预训练token的数据丰富环境下表现出稳定的训练行为,并分析了预处理、分词和基础设施瓶颈对训练的影响。

Insight: 创新点在于提供了一个完整的、基于工程实践的端到端训练案例研究,强调了预处理决策对可用token量的显著影响、分词对符号稳定性的作用,以及存储和I/O约束可能成为与计算同等重要的限制因素,而非提出新的模型架构。

Abstract: While frontier large language models demonstrate strong reasoning and mathematical capabilities, the practical process of training domain-specialized scientific language models from raw sources remains under-documented. In this work, we present a detailed case study of training a 1.36B-parameter scientific language model directly from raw arXiv LaTeX sources spanning mathematics, computer science, and theoretical physics. We describe an end-to-end pipeline covering metadata filtering, archive validation, LaTeX extraction, text normalization, domain-aware tokenization, and dense transformer training under constrained compute (2xA100 GPUs). Through 24 experimental runs, we analyze training stability, scaling behavior, data yield losses, and infrastructure bottlenecks. Our findings highlight how preprocessing decisions significantly affect usable token volume, how tokenization impacts symbolic stability, and how storage and I/O constraints can rival compute as limiting factors. We further analyze convergence dynamics and show stable training behavior in a data-rich regime (52B pretraining tokens). Rather than proposing a novel architecture, this work provides an engineering-grounded, transparent account of training a small scientific language model from scratch. We hope these insights support researchers operating under moderate compute budgets who seek to build domain-specialized models.


[50] Evaluating Chain-of-Thought Reasoning through Reusability and Verifiability cs.AI | cs.CL | cs.IRPDF

Shashank Aggarwal, Ram Vikas Mishra, Amit Awekar

TL;DR: 本文提出了一种新的方法来评估大语言模型(LLM)生成的思维链(CoT)推理的质量,超越了传统的任务准确性指标。通过引入可重用性和可验证性这两个新度量,并采用“思考者-执行者”框架将CoT生成与执行解耦,论文在五个基准测试上评估了四种思考者模型与十个执行者模型委员会。研究发现,这些新指标与标准准确性不相关,且专门推理模型生成的CoT并不总是比通用LLM(如Llama和Gemma)生成的更具可重用性或可验证性。

Details

Motivation: 在多智能体信息检索(IR)管道(如搜索和排序)中,基于LLM的智能体之间会交换思维链(CoT)作为中间推理。当前对CoT的评估狭隘地聚焦于目标任务准确性,无法评估推理过程本身的质量或效用。

Result: 在五个基准测试上,使用一个由十个执行者模型组成的委员会评估了四种思考者模型。结果表明,可重用性和可验证性与标准准确性不相关,揭示了当前基于准确性的推理能力排行榜存在盲点。令人惊讶的是,专门推理模型生成的CoT并不总是比通用LLM(如Llama和Gemma)生成的更具可重用性或可验证性。

Insight: 论文的核心创新点在于提出了评估CoT推理质量的两个新维度(可重用性和可验证性)以及一个解耦评估框架(Thinker-Executor)。这为理解和比较不同模型生成的推理链的内在质量提供了更精细的工具,挑战了仅依赖任务准确性的传统评估范式,并揭示了专门化模型在推理可转移性方面可能存在的局限性。

Abstract: In multi-agent IR pipelines for tasks such as search and ranking, LLM-based agents exchange intermediate reasoning in terms of Chain-of-Thought (CoT) with each other. Current CoT evaluation narrowly focuses on target task accuracy. However, this metric fails to assess the quality or utility of the reasoning process itself. To address this limitation, we introduce two novel measures: reusability and verifiability. We decouple CoT generation from execution using a Thinker-Executor framework. Reusability measures how easily an Executor can reuse the Thinker’s CoT. Verifiability measures how frequently an Executor can match the Thinker’s answer using the CoT. We evaluated four Thinker models against a committee of ten Executor models across five benchmarks. Our results reveal that reusability and verifiability do not correlate with standard accuracy, exposing a blind spot in current accuracy-based leaderboards for reasoning capability. Surprisingly, we find that CoTs from specialized reasoning models are not consistently more reusable or verifiable than those from general-purpose LLMs like Llama and Gemma.


[51] CLEF HIPE-2026: Evaluating Accurate and Efficient Person-Place Relation Extraction from Multilingual Historical Texts cs.AI | cs.CL | cs.IRPDF

Juri Opitz, Corina Raclé, Emanuela Boros, Andrianos Michail, Matteo Romanello

TL;DR: HIPE-2026是CLEF评估实验室的一个项目,专注于从多语言历史文本中提取人物与地点之间的关系。它建立在HIPE-2020和HIPE-2022的基础上,将任务扩展到语义关系提取,旨在识别不同语言和时期的人物-地点关联。系统需要分类两种关系类型(’at’和’isAt’),并引入了一个三重评估框架,同时评估准确性、计算效率和领域泛化能力。

Details

Motivation: 解决从嘈杂、多语言的历史文本中准确提取人物-地点关系的挑战,以支持数字人文学科中的知识图谱构建、历史传记重建和空间分析等下游应用。

Result: 摘要未提及具体定量结果,但该评估实验室提供了一个基准测试框架,用于评估不同系统在关系提取任务上的性能,包括准确性、效率和泛化能力。

Insight: 创新点在于将关系提取任务扩展到多语言历史文本,并引入三重评估框架(准确性、效率、泛化),强调对时间和地理线索的推理,这为处理非结构化历史数据提供了系统化的评估方法。

Abstract: HIPE-2026 is a CLEF evaluation lab dedicated to person-place relation extraction from noisy, multilingual historical texts. Building on the HIPE-2020 and HIPE-2022 campaigns, it extends the series toward semantic relation extraction by targeting the task of identifying person–place associations in multiple languages and time periods. Systems are asked to classify relations of two types - $at$ (“Has the person ever been at this place?”) and $isAt$ (“Is the person located at this place around publication time?”) - requiring reasoning over temporal and geographical cues. The lab introduces a three-fold evaluation profile that jointly assesses accuracy, computational efficiency, and domain generalization. By linking relation extraction to large-scale historical data processing, HIPE-2026 aims to support downstream applications in knowledge-graph construction, historical biography reconstruction, and spatial analysis in digital humanities.


cs.OH [Back]

[52] A Conceptual Hybrid Framework for Post-Quantum Security: Integrating BB84 QKD, AES, and Bio-inspired Mechanisms cs.OHPDF

Md. Ismiel Hossen Abir

TL;DR: 本文提出了一种后量子时代的数据保护混合安全框架,该框架结合了AES加密、BB84量子密钥分发、量子态比较轻量级认证和仿生免疫系统自适应威胁检测,旨在应对量子计算对经典密码学(如RSA)的威胁。

Details

Motivation: 量子计算(如Shor算法)能高效破解RSA等经典密码,而经典因式分解方法对大密钥效率低下,因此需要设计一个能抵御量子攻击的后量子安全框架。

Result: 论文主要提出了一个概念性框架,未提供具体实验结果;但指出在理想条件下BB84能实现完整的密钥协商并以高精度检测窃听。

Insight: 创新点在于将经典加密(AES)、量子密钥分发(BB84)、量子态认证和仿生免疫机制集成到一个统一框架中,为后量子加密提供可扩展和自适应的解决方案,但具体实现和验证尚待未来工作。

Abstract: Quantum computing is a significant risk to classical cryptographic, especially RSA, which depends on the difficulty of factoring large numbers. Classical factorization methods, such as Trial Division and Pollard’s Rho, are inefficient for large keys, while Shor’s quantum algorithm can break RSA efficiently in polynomial time. This research studies RSA’s vulnerabilities under both classical and quantum attacks and designs a hybrid security framework to ensure data protection in the post-quantum era. The conceptual framework combines AES encryption for classical security, BB84 Quantum Key Distribution (QKD) for secure key exchange with eavesdropping detection, quantum state comparison for lightweight authentication, and a bio-inspired immune system for adaptive threat detection. RSA is vulnerable to Shor’s algorithm, BB84 achieves full key agreement in ideal conditions, and it detects eavesdropping with high accuracy. The conceptual model includes both classical and quantum security methods, providing a scalable and adaptive solution for Post-Quantum encryption data protection. This work primarily proposes a conceptual framework. Detailed implementation, security proofs, and extensive experimental validation are considered future work.


cs.LG [Back]

[53] Better Think Thrice: Learning to Reason Causally with Double Counterfactual Consistency cs.LG | cs.CLPDF

Victoria Lin, Xinnuo Xu, Rachel Lawrence, Risa Ueno, Amit Sharma

TL;DR: 本文提出了一种名为双重反事实一致性(DCC)的轻量级推理时方法,用于评估和引导大语言模型(LLMs)的因果推理能力。该方法无需标注的反事实数据,通过验证模型执行因果干预和反事实预测的能力来提升其在多种推理任务上的表现。

Details

Motivation: 尽管大语言模型在推理基准测试中表现强劲,但在面对反事实问题时表现出脆弱性,表明其因果推理能力存在缺陷。现有方法需要大规模标注反事实数据,但覆盖所有潜在反事实空间存在限制,因此需要一种无需标注数据的评估和改进方法。

Result: DCC被用于评估多种领先LLMs在一系列推理任务和干预中的因果推理能力,并作为无需训练、测试时拒绝采样的准则,有效提升了多个模型家族在推理任务上的性能。

Insight: 创新点在于提出了一种无需标注数据的推理时方法DCC,通过双重反事实一致性来直接评估和增强模型的因果推理能力,避免了大规模数据标注的瓶颈,具有轻量化和通用性优势。

Abstract: Despite their strong performance on reasoning benchmarks, large language models (LLMs) have proven brittle when presented with counterfactual questions, suggesting weaknesses in their causal reasoning ability. While recent work has demonstrated that labeled counterfactual tasks can be useful benchmarks of LLMs’ causal reasoning, producing such data at the scale required to cover the vast potential space of counterfactuals is limited. In this work, we introduce double counterfactual consistency (DCC), a lightweight inference-time method for measuring and guiding the ability of LLMs to reason causally. Without requiring labeled counterfactual data, DCC verifies a model’s ability to execute two important elements of causal reasoning: causal intervention and counterfactual prediction. Using DCC, we evaluate the causal reasoning abilities of various leading LLMs across a range of reasoning tasks and interventions. Moreover, we demonstrate the effectiveness of DCC as a training-free test-time rejection sampling criterion and show that it can directly improve performance on reasoning tasks across multiple model families.


[54] Training Large Reasoning Models Efficiently via Progressive Thought Encoding cs.LG | cs.CLPDF

Zeliang Zhang, Xiaodong Liu, Hao Cheng, Hao Sun, Chenliang Xu

TL;DR: 本文提出了一种名为渐进思维编码的参数高效微调方法,用于解决大型推理模型在强化学习训练中因长序列自回归解码导致的时间和内存效率瓶颈。该方法通过将中间推理过程逐步编码为固定大小的向量表示,避免了在完整缓存序列上进行反向传播,从而在固定缓存大小下显著降低了内存使用并提升了推理效率。

Details

Motivation: 大型推理模型在处理复杂问题时表现出色,但其强化学习训练依赖于基于结果的奖励,需要长序列展开,导致自回归解码占据大量时间和内存。现有的滑动窗口缓存策略虽能限制内存,但会破坏长上下文推理并降低性能。因此,需要一种方法在固定缓存预算下实现高效推理。

Result: 在三个模型(Qwen2.5-3B-Instruct、Qwen2.5-7B-Instruct和DeepSeek-R1-Distill-Llama-8B)和六个广泛使用的挑战性数学基准测试上的实验表明,该方法相比基于LoRA的微调平均提升19.3%,相比未微调的大型推理模型平均提升29.9%。在相同的严格缓存预算下,在AIME2024/2025上实现了高达23.4%的准确率提升。

Insight: 创新点在于提出渐进思维编码,将中间推理状态压缩为固定维度的向量,从而在训练和推理中保持恒定内存使用,避免了长序列反向传播的开销。这为在现实内存约束下高效、可扩展地训练大型推理模型提供了新思路,平衡了性能与效率。

Abstract: Large reasoning models (LRMs) excel on complex problems but face a critical barrier to efficiency: reinforcement learning (RL) training requires long rollouts for outcome-based rewards, where autoregressive decoding dominates time and memory usage. While sliding-window cache strategies can bound memory, they disrupt long-context reasoning and degrade performance. We introduce Progressive Thought Encoding, a parameter-efficient fine-tuning method that enables LRMs to reason effectively under fixed-size caches. By progressively encoding intermediate reasoning into fixed-size vector representations, our approach eliminates the need to backpropagate through full-cache rollouts, thereby reducing memory usage, while maintaining constant memory during inference. Experiments on three models, including Qwen2.5-3B-Instruct, Qwen2.5-7B-Instruct, and DeepSeek-R1-Distill-Llama-8B, on six widely used challenging mathematical benchmarks show consistent gains: our method achieves +19.3% improvement over LoRA-based fine-tuning and +29.9% over LRMs without fine-tuning on average, with up to +23.4 accuracy improvement on AIME2024/2025 under the same tight cache budgets. These results demonstrate that Progressive Thought Encoding not only improves reasoning accuracy but also makes RL training of LRMs substantially more efficient and scalable under real-world memory constraints.


[55] The Sound of Death: Deep Learning Reveals Vascular Damage from Carotid Ultrasound cs.LG | cs.CVPDF

Christoph Balada, Aida Romano-Martinez, Payal Varshney, Vincent ten Cate, Katharina Geschke

TL;DR: 本文提出了一种从常规颈动脉超声视频中提取血管损伤(VD)表征的机器学习框架。该模型利用高血压作为弱代理标签,学习到具有生物学合理性、可解释性且与心血管风险因素强相关的特征。高VD评分能有效分层预测心肌梗死、心脏性死亡和全因死亡率,其性能匹配或优于SCORE2等传统风险模型。

Details

Motivation: 心血管疾病是全球主要死因,但早期风险检测受限于现有诊断方法。颈动脉超声作为一种非侵入性、广泛可及的模态,蕴含大量未被利用的结构和血流动力学信息。论文旨在挖掘这些信息,以开发一种可扩展的风险评估工具。

Result: 模型学习到的特征与已知心血管风险因素、合并症及实验室指标强相关。在预测心肌梗死、心脏性死亡和全因死亡率方面,高VD评分个体的风险分层能力匹配或优于传统风险模型(如SCORE2)。

Insight: 创新点在于利用高血压作为弱监督信号,从常规超声视频中学习血管损伤的深度表征,避免了依赖复杂临床输入。可解释性AI分析揭示了模型依赖血管形态和血管周围组织特征,发现了血管损伤的新功能和解剖学标志。该方法证明了常规超声蕴含远超以往认知的预后信息,为大规模、低成本、非侵入性的心血管风险评估提供了新途径。

Abstract: Cardiovascular diseases (CVDs) remain the leading cause of mortality worldwide, yet early risk detection is often limited by available diagnostics. Carotid ultrasound, a non-invasive and widely accessible modality, encodes rich structural and hemodynamic information that is largely untapped. Here, we present a machine learning (ML) framework that extracts clinically meaningful representations of vascular damage (VD) from carotid ultrasound videos, using hypertension as a weak proxy label. The model learns robust features that are biologically plausible, interpretable, and strongly associated with established cardiovascular risk factors, comorbidities, and laboratory measures. High VD stratifies individuals for myocardial infarction, cardiac death, and all-cause mortality, matching or outperforming conventional risk models such as SCORE2. Explainable AI analyses reveal that the model relies on vessel morphology and perivascular tissue characteristics, uncovering novel functional and anatomical signatures of vascular damage. This work demonstrates that routine carotid ultrasound contains far more prognostic information than previously recognized. Our approach provides a scalable, non-invasive, and cost-effective tool for population-wide cardiovascular risk assessment, enabling earlier and more personalized prevention strategies without reliance on laboratory tests or complex clinical inputs.


[56] Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting cs.LG | cs.AI | cs.CL | cs.CVPDF

Xiaohan Zhao, Zhaoyi Li, Yaxin Luo, Jiacheng Cui, Zhiqiang Shen

TL;DR: 本文提出M-Attack-V2,一种针对大型视觉语言模型(LVLM)的黑盒对抗攻击改进方法。通过分析现有最先进的基于迁移的攻击方法M-Attack存在的问题(如梯度方差高、优化不稳定),作者引入了多裁剪对齐、辅助目标对齐和补丁动量等模块,显著提升了在Claude-4.0、Gemini-2.5-Pro和GPT-5等前沿LVLM上的攻击成功率。

Details

Motivation: 针对大型视觉语言模型的黑盒对抗攻击面临梯度缺失和多模态边界复杂的挑战。现有基于迁移的方法(如M-Attack)虽然有效,但其局部裁剪匹配策略会导致迭代间梯度方差高、近乎正交,破坏了局部对齐的连贯性并导致优化不稳定。本文旨在解决这些问题,提升黑盒攻击的效率和成功率。

Result: 在多个前沿LVLM上进行了评估,攻击成功率显著提升:在Claude-4.0上从8%提升至30%,在Gemini-2.5-Pro上从83%提升至97%,在GPT-5上从98%提升至100%,超越了先前的黑盒LVLM攻击方法,达到了新的SOTA水平。

Insight: 主要创新点包括:1)将局部匹配重新表述为源变换和目标语义的不对称期望;2)提出多裁剪对齐(MCA)来平均多个独立局部视图的梯度以降低方差;3)提出辅助目标对齐(ATA),使用语义相关分布的小型辅助集来平滑目标流形;4)将动量重新解释为补丁动量,并结合改进的补丁大小集成(PE+)来增强可迁移方向。这些模块化改进共同构成了一个更稳定、更有效的黑盒攻击框架。

Abstract: Black-box adversarial attacks on Large Vision-Language Models (LVLMs) are challenging due to missing gradients and complex multimodal boundaries. While prior state-of-the-art transfer-based approaches like M-Attack perform well using local crop-level matching between source and target images, we find this induces high-variance, nearly orthogonal gradients across iterations, violating coherent local alignment and destabilizing optimization. We attribute this to (i) ViT translation sensitivity that yields spike-like gradients and (ii) structural asymmetry between source and target crops. We reformulate local matching as an asymmetric expectation over source transformations and target semantics, and build a gradient-denoising upgrade to M-Attack. On the source side, Multi-Crop Alignment (MCA) averages gradients from multiple independently sampled local views per iteration to reduce variance. On the target side, Auxiliary Target Alignment (ATA) replaces aggressive target augmentation with a small auxiliary set from a semantically correlated distribution, producing a smoother, lower-variance target manifold. We further reinterpret momentum as Patch Momentum, replaying historical crop gradients; combined with a refined patch-size ensemble (PE+), this strengthens transferable directions. Together these modules form M-Attack-V2, a simple, modular enhancement over M-Attack that substantially improves transfer-based black-box attacks on frontier LVLMs: boosting success rates on Claude-4.0 from 8% to 30%, Gemini-2.5-Pro from 83% to 97%, and GPT-5 from 98% to 100%, outperforming prior black-box LVLM attacks. Code and data are publicly available at: https://github.com/vila-lab/M-Attack-V2.