Table of Contents

cs.CL [Back]

[1] Do LLMs Need Inherent Reasoning Before Reinforcement Learning? A Study in Korean Self-Correction cs.CL | cs.AIPDF

Hongjin Kim, Jaewook Lee, Kiyoung Lee, Jong-hun Shin, Soojong Lim

TL;DR: 本研究探讨了强化学习(RL)能否提升大语言模型(LLM)在低资源语言(如韩语)上的推理能力,发现仅靠RL对缺乏内在韩语推理能力的模型改进有限。通过引入韩语自校正代码转换数据集,并调整模型早期层的韩语特定神经元以对齐内部推理过程,能有效释放RL的潜力,显著提升数学推理和自校正任务性能。

Details

Motivation: LLM在高资源语言(如英语)上表现出强大的推理和自校正能力,但在低资源语言(如韩语)上性能有限,研究旨在探索RL是否能将韩语推理能力提升至与英语相当的水平。

Result: 在数学推理和自校正任务上,通过所提出的对齐方法(调整早期层韩语特定神经元)结合RL,获得了显著的性能提升。

Insight: 论文的创新点在于揭示了提升多语言推理能力的关键并非注入新的语言知识,而是有效激发和对齐模型已有的推理能力,具体通过内部翻译和神经元级调优(特别是早期层的语言特定神经元)来实现,这为多语言推理对齐提供了新视角。

Abstract: Large Language Models (LLMs) demonstrate strong reasoning and self-correction abilities in high-resource languages like English, but their performance remains limited in low-resource languages such as Korean. In this study, we investigate whether reinforcement learning (RL) can enhance Korean reasoning abilities to a degree comparable to English. Our findings reveal that RL alone yields limited improvements when applied to models lacking inherent Korean reasoning capabilities. To address this, we explore several fine-tuning strategies and show that aligning the model’s internal reasoning processes with Korean inputs-particularly by tuning Korean-specific neurons in early layers-is key to unlocking RL’s effectiveness. We introduce a self-correction code-switching dataset to facilitate this alignment and observe significant performance gains in both mathematical reasoning and self-correction tasks. Ultimately, we conclude that the crucial factor in multilingual reasoning enhancement is not injecting new linguistic knowledge, but effectively eliciting and aligning existing reasoning capabilities. Our study provides a new perspective on how internal translation and neuron-level tuning contribute to multilingual reasoning alignment in LLMs.


[2] The Facade of Truth: Uncovering and Mitigating LLM Susceptibility to Deceptive Evidence cs.CLPDF

Herun Wan, Jiaying Wu, Minnan Luo, Fanxiao Li, Zhi Zeng

TL;DR: 本文揭示了大型语言模型在面对精心设计的欺骗性证据时的脆弱性,并提出了一种名为MisBelief的框架来系统性地生成此类证据以评估模型。研究发现,尽管模型对直接错误信息有抵抗力,但对这种经过逻辑润色的欺骗性证据高度敏感,导致对错误信息的置信度平均增加93.0%。为应对此问题,论文提出了欺骗意图屏蔽机制,通过推断证据背后的欺骗意图来提供早期预警,从而有效减轻模型的信念偏移。

Details

Motivation: 为了确保LLM在辅助人类决策时能保持事实性内部信念,需要研究其对误导性信息注入的抵抗力。当前模型虽能抵抗显式错误信息,但存在对复杂、难以证伪的欺骗性证据的根本性脆弱性,论文旨在系统性地探究并缓解这一弱点。

Result: 使用MisBelief框架在三个难度级别上生成了4,800个实例,评估了7个代表性LLM。结果表明,模型对直接错误信息具有鲁棒性,但对精炼的欺骗性证据高度敏感,导致对错误信息的信念评分平均增加93.0%,严重损害了下游推荐。提出的欺骗意图屏蔽机制在实证中能持续减轻信念偏移并促进更谨慎的证据评估。

Insight: 论文的创新点在于揭示了LLM对逻辑上具有说服力但事实上有欺骗性的证据的系统性脆弱性,并提出了一个多角色LLM协作、多轮交互的框架来生成此类证据。从客观角度看,其提出的欺骗意图屏蔽机制作为一种治理方法,通过意图推断来增强模型对欺骗性内容的防御能力,为提升LLM的可靠性和安全性提供了新思路。

Abstract: To reliably assist human decision-making, LLMs must maintain factual internal beliefs against misleading injections. While current models resist explicit misinformation, we uncover a fundamental vulnerability to sophisticated, hard-to-falsify evidence. To systematically probe this weakness, we introduce MisBelief, a framework that generates misleading evidence via collaborative, multi-round interactions among multi-role LLMs. This process mimics subtle, defeasible reasoning and progressive refinement to create logically persuasive yet factually deceptive claims. Using MisBelief, we generate 4,800 instances across three difficulty levels to evaluate 7 representative LLMs. Results indicate that while models are robust to direct misinformation, they are highly sensitive to this refined evidence: belief scores in falsehoods increase by an average of 93.0%, fundamentally compromising downstream recommendations. To address this, we propose Deceptive Intent Shielding (DIS), a governance mechanism that provides an early warning signal by inferring the deceptive intent behind evidence. Empirical results demonstrate that DIS consistently mitigates belief shifts and promotes more cautious evidence evaluation.


[3] MemBuilder: Reinforcing LLMs for Long-Term Memory Construction via Attributed Dense Rewards cs.CLPDF

Zhiyu Shen, Ziming Wu, Fuming Lai, Shaobing Lian, Yanghui Rao

TL;DR: MemBuilder是一个基于强化学习的框架,旨在通过属性密集奖励训练大型语言模型(LLMs)来构建长期记忆,以解决长期对话中保持一致性的挑战。该框架通过合成会话级问题生成提供密集中间奖励,并引入贡献感知梯度加权来优化多维度记忆构建。

Details

Motivation: 解决LLMs在长期对话中因标准检索机制无法捕捉历史状态时序演变而导致的记忆不一致问题,同时克服现有记忆增强框架依赖静态提示或稀疏奖励训练无效的局限性。

Result: 在长期对话基准测试中,MemBuilder使一个40亿参数的模型超越了最先进的闭源基线,展现出强大的泛化能力。

Insight: 创新点包括使用合成会话级问题生成实现密集奖励以缓解稀疏奖励问题,以及通过贡献感知梯度加权动态调整策略更新以优化多维度记忆归因;这为训练轻量级模型实现高效长期记忆管理提供了可借鉴的强化学习范式。

Abstract: Maintaining consistency in long-term dialogues remains a fundamental challenge for LLMs, as standard retrieval mechanisms often fail to capture the temporal evolution of historical states. While memory-augmented frameworks offer a structured alternative, current systems rely on static prompting of closed-source models or suffer from ineffective training paradigms with sparse rewards. We introduce MemBuilder, a reinforcement learning framework that trains models to orchestrate multi-dimensional memory construction with attributed dense rewards. MemBuilder addresses two key challenges: (1) Sparse Trajectory-Level Rewards: we employ synthetic session-level question generation to provide dense intermediate rewards across extended trajectories; and (2) Multi-Dimensional Memory Attribution: we introduce contribution-aware gradient weighting that scales policy updates based on each component’s downstream impact. Experimental results show that MemBuilder enables a 4B-parameter model to outperform state-of-the-art closed-source baselines, exhibiting strong generalization across long-term dialogue benchmarks.


[4] FlashMem: Distilling Intrinsic Latent Memory via Computation Reuse cs.CLPDF

Yubo Hou, Zhisheng Chen, Tao Wan, Zengchang Qin

TL;DR: FlashMem是一个通过计算重用从瞬时推理状态中蒸馏出内在潜在记忆的框架,旨在解决大语言模型无状态架构无法保存动态上下文的问题,从而减少历史信息的冗余处理并提升长时程自主性。

Details

Motivation: 大语言模型的无状态架构缺乏保存动态上下文的机制,导致智能体需要冗余地重新处理历史信息以维持长时程自主性;现有潜在记忆方法因架构分离(依赖辅助编码器)而受限,将记忆与推理主干解耦。

Result: 实验表明,FlashMem在性能上与重型基线模型相当,同时将推理延迟降低了5倍,有效弥合了效率与持久认知之间的差距。

Insight: 创新点包括:利用内部表示唯一编码输入轨迹的特性,将最后隐藏状态识别为交互历史的充分统计量;通过Shared-KV Consolidator直接关注主干冻结缓存来合成记忆,避免冗余重参数化;以及采用无参数Cognitive Monitor基于注意力熵自适应触发整合,仅在检测到高认知不确定性时执行。从客观角度看,该方法通过计算重用和自适应触发机制,实现了高效且持续的上下文记忆管理。

Abstract: The stateless architecture of Large Language Models inherently lacks the mechanism to preserve dynamic context, compelling agents to redundantly reprocess history to maintain long-horizon autonomy. While latent memory offers a solution, current approaches are hindered by architectural segregation, relying on auxiliary encoders that decouple memory from the reasoning backbone. We propose FlashMem, a framework that distills intrinsic memory directly from transient reasoning states via computation reuse. Leveraging the property that internal representations uniquely encode input trajectories, FlashMem identifies the last hidden state as a sufficient statistic for the interaction history. This enables a Shared-KV Consolidator to synthesize memory by attending directly to the backbone’s frozen cache, eliminating redundant re-parameterization. Furthermore, a parameter-free Cognitive Monitor leverages attention entropy to adaptively trigger consolidation only when high epistemic uncertainty is detected. Experiments demonstrate that FlashMem matches the performance of heavy baselines while reducing inference latency by 5 times, effectively bridging the gap between efficiency and persistent cognition.


[5] CHisAgent: A Multi-Agent Framework for Event Taxonomy Construction in Ancient Chinese Cultural Systems cs.CLPDF

Xuemei Tang, Chengxi Yan, Jinghang Gu, Chu-Ren Huang

TL;DR: 本文提出了CHisAgent,一个用于构建中国古代文化系统事件分类学的多智能体大语言模型框架。该框架将分类学构建分解为三个专门化阶段:自底向上的归纳器从原始历史语料推导初始层次结构,自顶向下的扩展器利用LLM世界知识引入缺失的中间概念,以及证据引导的丰富器整合外部结构化历史资源以确保忠实性。基于《二十四史》构建了一个涵盖古代中国政治、军事、外交和社会生活的大规模领域感知事件分类学。

Details

Motivation: 解决大语言模型在历史文化推理(尤其是中文历史等非英语语境)方面能力有限的问题,以及人工构建分类学成本高、难以扩展的挑战。

Result: 在基于《二十四史》构建的事件分类学上,通过无参考和基于参考的广泛评估,证明了其在结构连贯性和覆盖范围上的改进,且进一步分析显示所得分类学支持跨文化对齐。

Insight: 创新点在于将分类学构建任务分解为三个角色专门化的多智能体协作阶段,结合了自底向上归纳、自顶向下扩展和外部证据整合,以平衡自动化、覆盖度和历史忠实性。

Abstract: Despite strong performance on many tasks, large language models (LLMs) show limited ability in historical and cultural reasoning, particularly in non-English contexts such as Chinese history. Taxonomic structures offer an effective mechanism to organize historical knowledge and improve understanding. However, manual taxonomy construction is costly and difficult to scale. Therefore, we propose \textbf{CHisAgent}, a multi-agent LLM framework for historical taxonomy construction in ancient Chinese contexts. CHisAgent decomposes taxonomy construction into three role-specialized stages: a bottom-up \textit{Inducer} that derives an initial hierarchy from raw historical corpora, a top-down \textit{Expander} that introduces missing intermediate concepts using LLM world knowledge, and an evidence-guided \textit{Enricher} that integrates external structured historical resources to ensure faithfulness. Using the \textit{Twenty-Four Histories}, we construct a large-scale, domain-aware event taxonomy covering politics, military, diplomacy, and social life in ancient China. Extensive reference-free and reference-based evaluations demonstrate improved structural coherence and coverage, while further analysis shows that the resulting taxonomy supports cross-cultural alignment.


[6] Closing the Modality Reasoning Gap for Speech Large Language Models cs.CL | cs.SD | eess.ASPDF

Chaoren Wang, Heng Lu, Xueyao Zhang, Shujie Liu, Yan Lu

TL;DR: 本文提出了一种名为TARS的强化学习框架,旨在解决语音大语言模型(Speech LLMs)中存在的模态推理差距问题,即模型在语音输入上的推理性能显著弱于文本输入。该框架通过非对称奖励设计,对齐文本条件和语音条件下的轨迹,利用表示对齐和行为对齐两种互补信号进行优化。在MMSU和OBQA等具有挑战性的推理基准测试中,该方法显著缩小了模态推理差距,并在7B规模的Speech LLMs中取得了最先进的性能。

Details

Motivation: 解决语音大语言模型中存在的模态推理差距,即模型在语音输入上的推理能力明显低于文本输入,这可能与Transformer层间的表示漂移和长链推理中的行为偏差有关。

Result: 在MMSU和OBQA等推理基准测试中,该方法显著缩小了模态推理差距,并在7B规模的Speech LLMs中达到了最先进的性能水平。

Insight: 创新点在于提出了一个强化学习框架TARS,通过非对称奖励设计对齐文本和语音条件下的轨迹,并利用表示对齐(层间隐藏状态相似性)和行为对齐(生成输出与参考文本的语义一致性)两种密集互补信号进行优化,有效提升了语音模态的推理能力。

Abstract: Although speech large language models have achieved notable progress, a substantial modality reasoning gap remains: their reasoning performance on speech inputs is markedly weaker than on text. This gap could be associated with representational drift across Transformer layers and behavior deviations in long-chain reasoning. To address this issue, we introduce TARS, a reinforcement-learning framework that aligns text-conditioned and speech-conditioned trajectories through an asymmetric reward design. The framework employs two dense and complementary signals: representation alignment, which measures layer-wise hidden-state similarity between speech- and text-conditioned trajectories, and behavior alignment, which evaluates semantic consistency between generated outputs and reference text completions. Experiments on challenging reasoning benchmarks, including MMSU and OBQA, show that our approach significantly narrows the modality reasoning gap and achieves state-of-the-art performance among 7B-scale Speech LLMs.


[7] ReasonAny: Incorporating Reasoning Capability to Any Model via Simple and Effective Model Merging cs.CL | cs.AIPDF

Junyao Yang, Chen Qian, Dongrui Liu, Wen Shen, Yong Liu

TL;DR: 论文提出ReasonAny框架,通过对比梯度识别解决模型融合中的性能崩溃问题,使领域专用模型获得推理能力。

Details

Motivation: 解决领域专用模型(如安全、生物医学、金融)难以通过现有模型融合方法有效集成大型推理模型的长链推理能力的问题。

Result: 在安全、生物医学和金融领域的实验中,ReasonAny显著优于现有最先进基线方法,并保持了稳健的推理性能。

Insight: 创新点在于发现推理能力主要存在于梯度敏感性低的参数区域,并基于此提出对比梯度识别方法;可借鉴之处是对参数敏感性的新理解及训练无关的融合框架设计。

Abstract: Large Reasoning Models (LRMs) with long chain-of-thought reasoning have recently achieved remarkable success. Yet, equipping domain-specialized models with such reasoning capabilities, referred to as “Reasoning + X”, remains a significant challenge. While model merging offers a promising training-free solution, existing methods often suffer from a destructive performance collapse: existing methods tend to both weaken reasoning depth and compromise domain-specific utility. Interestingly, we identify a counter-intuitive phenomenon underlying this failure: reasoning ability predominantly resides in parameter regions with low gradient sensitivity, contrary to the common assumption that domain capabilities correspond to high-magnitude parameters. Motivated by this insight, we propose ReasonAny, a novel merging framework that resolves the reasoning-domain performance collapse through Contrastive Gradient Identification. Experiments across safety, biomedicine, and finance domains show that ReasonAny effectively synthesizes “Reasoning + X” capabilities, significantly outperforming state-of-the-art baselines while retaining robust reasoning performance.


[8] ACR: Adaptive Context Refactoring via Context Refactoring Operators for Multi-Turn Dialogue cs.CL | cs.AIPDF

Jiawei Shen, Jia Zhu, Hanghui Guo, Weijie Shi, Yue Cui

TL;DR: 本文提出自适应上下文重构(ACR)框架,通过上下文重构操作符库和教师引导的自进化训练范式,动态监控并重塑多轮对话历史,以缓解上下文惯性和状态漂移问题,从而在减少令牌消耗的同时显著提升性能。

Details

Motivation: 解决多轮对话中LLMs难以保持与早期内容对齐、遵循跨轮次依赖关系以及避免随着交互增长而漂移到错误事实的问题,现有方法在上下文惯性和状态漂移方面存在局限。

Result: 在多轮对话的广泛实验中,该方法显著优于现有基线,同时减少了令牌消耗。

Insight: 创新点在于将上下文管理与推理过程解耦,通过动态重构操作和自进化训练主动干预对话历史,以应对长期依赖和状态保持的挑战。

Abstract: Large Language Models (LLMs) have shown remarkable performance in multi-turn dialogue. However, in multi-turn dialogue, models still struggle to stay aligned with what has been established earlier, follow dependencies across many turns, and avoid drifting into incorrect facts as the interaction grows longer. Existing approaches primarily focus on extending the context window, introducing external memory, or applying context compression, yet these methods still face limitations such as \textbf{contextual inertia} and \textbf{state drift}. To address these challenges, we propose the \textbf{A}daptive \textbf{C}ontext \textbf{R}efactoring \textbf{(ACR)} Framework, which dynamically monitors and reshapes the interaction history to mitigate contextual inertia and state drift actively. ACR is built on a library of context refactoring operators and a teacher-guided self-evolving training paradigm that learns when to intervene and how to refactor, thereby decoupling context management from the reasoning process. Extensive experiments on multi-turn dialogue demonstrate that our method significantly outperforms existing baselines while reducing token consumption.


Nguyen Minh Phuong, Ha-Thanh Nguyen, May Myo Zin, Ken Satoh

TL;DR: 本文提出了一种利用大型语言模型(LLM)进行数据增强的流程,用于法律领域的信息抽取任务。该方法简单有效,能显著减少数据标注所需的人工工作量,并提升信息抽取系统的鲁棒性,且具有通用性,可推广至法律领域之外的其他自然语言处理任务。

Details

Motivation: 解决法律领域信息抽取任务中数据标注成本高、人工工作量大以及系统鲁棒性不足的问题。

Result: 摘要中未提及具体的定量结果、基准测试或性能水平(如SOTA),但宣称该方法能显著减少人工标注工作量并增强系统鲁棒性。

Insight: 创新点在于将LLM应用于法律领域的数据增强流程,以自动化方式生成标注数据,从而降低对人工标注的依赖并提升模型泛化能力;从客观角度看,该方法将数据增强与特定领域(法律)结合,并强调其跨NLP任务的通用性,为资源受限领域的任务提供了一种可扩展的解决方案。

Abstract: In this paper, we propose a pipeline leveraging Large Language Models (LLMs) for data augmentation in Information Extraction tasks within the legal domain. The proposed method is both simple and effective, significantly reducing the manual effort required for data annotation while enhancing the robustness of Information Extraction systems. Furthermore, the method is generalizable, making it applicable to various Natural Language Processing (NLP) tasks beyond the legal domain.


[10] GIFT: Games as Informal Training for Generalizable LLMs cs.CLPDF

Nuoyan Lyu, Bingbing Xu, Weihao Meng, Yige Yuan, Yang Zhang

TL;DR: 本文提出GIFT框架,将游戏作为大语言模型非正式学习的环境,通过嵌套训练框架解决多任务学习中的性能退化问题,利用强化学习在多种游戏中训练模型,以提升其泛化能力和实践智慧。

Details

Motivation: 大语言模型在数学和代码生成等正式学习任务上表现出色,但在战略创造性和社会推理等泛化智能方面仍存在不足,这源于缺乏基于交互反馈的非正式学习,因此研究旨在通过游戏环境弥补这一差距。

Result: 使用基于GRPO的强化学习在Matrix Games、TicTacToe和Who’s the Spy等游戏上进行训练,结果表明游戏非正式学习不仅防止了任务干扰,还显著增强了模型在广泛能力导向基准测试上的泛化性能。

Insight: 创新点在于将游戏作为非正式学习环境,并引入嵌套训练框架通过顺序任务组合强制模型同时掌握多种能力,这为提升LLM的泛化智能提供了新思路,可借鉴其利用内在奖励信号和抽象复杂性的方法。

Abstract: While Large Language Models (LLMs) have achieved remarkable success in formal learning tasks such as mathematics and code generation, they still struggle with the “practical wisdom” and generalizable intelligence, such as strategic creativity and social reasoning, that characterize human cognition. This gap arises from a lack of informal learning, which thrives on interactive feedback rather than goal-oriented instruction. In this paper, we propose treating Games as a primary environment for LLM informal learning, leveraging their intrinsic reward signals and abstracted complexity to cultivate diverse competencies. To address the performance degradation observed in multi-task learning, we introduce a Nested Training Framework. Unlike naive task mixing optimizing an implicit “OR” objective, our framework employs sequential task composition to enforce an explicit “AND” objective, compelling the model to master multiple abilities simultaneously to achieve maximal rewards. Using GRPO-based reinforcement learning across Matrix Games, TicTacToe, and Who’s the Spy games, we demonstrate that integrating game-based informal learning not only prevents task interference but also significantly bolsters the model’s generalization across broad ability-oriented benchmarks. The framework and implementation are publicly available.


[11] A Framework for Personalized Persuasiveness Prediction via Context-Aware User Profiling cs.CL | cs.AIPDF

Sejun Park, Yoonah Park, Jongwon Lim, Yohan Jo

TL;DR: 本文提出了一种用于个性化说服力预测的上下文感知用户画像框架,该框架包含两个可训练组件:查询生成器用于从用户历史中检索与说服相关的记录,画像器将这些记录总结为画像以有效指导说服力预测模型。

Details

Motivation: 现有方法缺乏利用说服对象过去活动(如对话)来优化说服力预测模型的系统框架,而考虑说服对象的特征(如价值观、经验和推理风格)至关重要。

Result: 在ChangeMyView Reddit数据集上的评估显示,该方法在多个预测模型上均优于现有方法,F1分数提升最高达+13.77个百分点。

Insight: 创新点在于通过动态生成查询和上下文相关的用户画像,实现任务导向的个性化说服力预测,而非依赖静态属性或表面相似性。

Abstract: Estimating the persuasiveness of messages is critical in various applications, from recommender systems to safety assessment of LLMs. While it is imperative to consider the target persuadee’s characteristics, such as their values, experiences, and reasoning styles, there is currently no established systematic framework to optimize leveraging a persuadee’s past activities (e.g., conversations) to the benefit of a persuasiveness prediction model. To address this problem, we propose a context-aware user profiling framework with two trainable components: a query generator that generates optimal queries to retrieve persuasion-relevant records from a user’s history, and a profiler that summarizes these records into a profile to effectively inform the persuasiveness prediction model. Our evaluation on the ChangeMyView Reddit dataset shows consistent improvements over existing methods across multiple predictor models, with gains of up to +13.77%p in F1 score. Further analysis shows that effective user profiles are context-dependent and predictor-specific, rather than relying on static attributes or surface-level similarity. Together, these results highlight the importance of task-oriented, context-dependent user profiling for personalized persuasiveness prediction.


[12] Afri-MCQA: Multimodal Cultural Question Answering for African Languages cs.CLPDF

Atnafu Lambebo Tonja, Srija Anand, Emilio Villa-Cueva, Israel Abebe Azime, Jesujoba Oluwadara Alabi

TL;DR: 本文介绍了Afri-MCQA,这是首个覆盖15种非洲语言、包含7.5k个问答对的多模态文化问答基准数据集,由母语者创建,涵盖文本和语音两种模态。评估大型语言模型(LLMs)在该基准上的表现发现,开源模型在非洲语言和文化知识方面表现不佳,尤其在以母语或语音查询的开放式视觉问答(VQA)任务中准确率接近零。

Details

Motivation: 非洲语言占全球语言的三分之一以上,但在AI研究中代表性不足,因此需要构建一个多模态文化问答基准来评估和促进针对非洲语言和文化的AI模型发展。

Result: 在Afri-MCQA基准上,开源LLMs在评估的文化任务中表现很差,开放式VQA在母语或语音查询下准确率接近零;控制实验显示,模型在非洲语言与英语之间的文本和语音理解上存在显著性能差距。

Insight: 创新点在于创建了首个多模态非洲语言文化问答基准,强调了语音优先方法、基于文化的预训练和跨语言文化迁移的重要性,为促进非洲语言的多模态AI发展提供了数据集和评估框架。

Abstract: Africa is home to over one-third of the world’s languages, yet remains underrepresented in AI research. We introduce Afri-MCQA, the first Multilingual Cultural Question-Answering benchmark covering 7.5k Q&A pairs across 15 African languages from 12 countries. The benchmark offers parallel English-African language Q&A pairs across text and speech modalities and was entirely created by native speakers. Benchmarking large language models (LLMs) on Afri-MCQA shows that open-weight models perform poorly across evaluated cultures, with near-zero accuracy on open-ended VQA when queried in native language or speech. To evaluate linguistic competence, we include control experiments meant to assess this specific aspect separate from cultural knowledge, and we observe significant performance gaps between native languages and English for both text and speech. These findings underscore the need for speech-first approaches, culturally grounded pretraining, and cross-lingual cultural transfer. To support more inclusive multimodal AI development in African languages, we release our Afri-MCQA under academic license or CC BY-NC 4.0 on HuggingFace (https://huggingface.co/datasets/Atnafu/Afri-MCQA)


[13] Multimodal In-context Learning for ASR of Low-resource Languages cs.CL | cs.AIPDF

Zhaolin Li, Jan Niehues

TL;DR: 本文研究了多模态上下文学习(MICL)在低资源语言自动语音识别(ASR)中的应用。通过使用Phi-4和Qwen3-Omni等语音大语言模型在三种濒危语言上进行实验,发现MICL能有效利用语音和文本模态学习未见语言,跨语言迁移学习可提升效率,且注意力分析揭示了模型对文本上下文的偏好。最终,作者提出了一种结合强声学模型与语音LLM的简单ASR系统,通过MICL选择声学假设,显著提升了ASR性能。

Details

Motivation: 解决自动语音识别(ASR)因监督数据稀缺而难以覆盖世界大多数语言(尤其是低资源语言)的问题,并探索语音大语言模型(LLMs)能否通过多模态上下文学习(MICL)来学习未见语言以改进ASR。

Result: 在三种不同的濒危语言上实验表明,MICL能持续提升ASR性能;跨语言迁移学习在不使用目标语言数据的情况下,其表现匹配甚至优于基于语料库训练的语言模型。

Insight: 创新点在于将多模态上下文学习(MICL)应用于低资源语言的ASR,并揭示了模型在音频与文本上下文间存在层依赖的注意力偏好;提出的结合强声学模型与语音LLM的ASR系统,通过MICL进行假设选择,为低资源语言ASR提供了一种有效且数据高效的方法。

Abstract: Automatic speech recognition (ASR) still covers only a small fraction of the world’s languages, mainly due to supervised data scarcity. In-context learning (ICL) with large language models (LLMs) addresses this problem, but prior work largely focuses on high-resource languages covered during training and text-only settings. This paper investigates whether speech LLMs can learn unseen languages with multimodal ICL (MICL), and how this learning can be used to improve ASR. We conduct experiments with two speech LLMs, Phi-4 and Qwen3-Omni, on three diverse endangered languages. Firstly, we find that MICL is effective for unseen languages, leveraging both speech and text modalities. We further show that cross-lingual transfer learning improves MICL efficiency on target languages without training on them. Moreover, we analyze attention patterns to interpret MICL mechanisms, and we observe layer-dependent preferences between audio and text context, with an overall bias towards text. Finally, we show that prompt-based ASR with speech LLMs performs poorly on unseen languages, motivating a simple ASR system that combines a stronger acoustic model with a speech LLM via MICL-based selection of acoustic hypotheses. Results show that MICL consistently improves ASR performance, and that cross-lingual transfer learning matches or outperforms corpus-trained language models without using target-language data. Our code is publicly available.


[14] AutoMonitor-Bench: Evaluating the Reliability of LLM-Based Misbehavior Monitor cs.CL | cs.SEPDF

Shu Yang, Jingyu Hu, Tong Li, Hanqi Yan, Wenxuan Wang

TL;DR: 本文提出了首个用于系统评估基于LLM的异常行为监控器可靠性的基准测试AutoMonitor-Bench,该基准包含3010个涵盖问答、代码生成和推理任务的标注样本,并定义了误报率和漏报率两个互补的评估指标。通过对12个专有和10个开源LLM的评估,发现监控性能存在显著差异且存在安全性与实用性的权衡。此外,通过构建大规模训练集并微调Qwen3-4B-Instruction模型,探索了已知异常行为数据训练对监控未见和隐式异常行为的效果。

Details

Motivation: 解决当前缺乏系统性基准来评估基于LLM的异常行为监控器在不同任务和故障模式下的可靠性问题,旨在量化监控器的性能并揭示其局限性。

Result: 评估结果显示不同LLM监控器性能差异显著,误报率与漏报率之间存在固有的权衡关系;微调实验表明,在已知异常行为数据集上训练对监控未见和隐式异常行为的提升有限,突显了可靠监控的挑战。

Insight: 创新点在于构建了首个综合性异常行为监控评估基准,并提出了互补的评估指标;客观分析认为,该研究揭示了LLM监控器在安全性与实用性之间的根本矛盾,为未来任务感知的监控器设计和训练策略提供了重要方向。

Abstract: We introduce AutoMonitor-Bench, the first benchmark designed to systematically evaluate the reliability of LLM-based misbehavior monitors across diverse tasks and failure modes. AutoMonitor-Bench consists of 3,010 carefully annotated test samples spanning question answering, code generation, and reasoning, with paired misbehavior and benign instances. We evaluate monitors using two complementary metrics: Miss Rate (MR) and False Alarm Rate (FAR), capturing failures to detect misbehavior and oversensitivity to benign behavior, respectively. Evaluating 12 proprietary and 10 open-source LLMs, we observe substantial variability in monitoring performance and a consistent trade-off between MR and FAR, revealing an inherent safety-utility tension. To further explore the limits of monitor reliability, we construct a large-scale training corpus of 153,581 samples and fine-tune Qwen3-4B-Instruction to investigate whether training on known, relatively easy-to-construct misbehavior datasets improves monitoring performance on unseen and more implicit misbehaviors. Our results highlight the challenges of reliable, scalable misbehavior monitoring and motivate future work on task-aware designing and training strategies for LLM-based monitors.


[15] EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis cs.CL | cs.AI | cs.LGPDF

Xiaoshuai Song, Haofei Chang, Guanting Dong, Yutao Zhu, Zhicheng Dou

TL;DR: 本文提出了EnvScaler框架,通过程序化合成方法自动生成可扩展的工具交互环境,以解决LLM智能体训练中真实环境访问受限、模拟环境存在幻觉以及人工构建环境难以规模化的问题。该框架包含SkelBuilder(构建多样化环境骨架)和ScenGenerator(生成任务场景和轨迹验证函数)两个组件,并基于生成的191个环境和约7K个场景对Qwen3系列模型进行了SFT和RL训练。

Details

Motivation: 为了解决大语言模型作为智能体在真实世界环境中训练时面临的环境获取难题——真实系统访问受限、LLM模拟环境易产生幻觉和不一致、人工构建沙箱难以规模化——而提出自动化合成可扩展工具交互环境的框架。

Result: 在三个基准测试上的结果表明,使用EnvScaler合成的环境进行训练后,Qwen3系列模型在涉及多轮次、多工具交互的复杂环境中解决任务的能力得到显著提升,达到了当前先进水平。

Insight: 创新点在于通过程序化合成(programmatic synthesis)的自动化流程规模化生成高质量、多样化的工具交互环境,将环境构建分解为骨架构建和场景生成两个可系统化扩展的步骤,并配套生成基于规则的轨迹验证函数,为LLM智能体的训练提供了可扩展且可靠的环境资源。

Abstract: Large language models (LLMs) are expected to be trained to act as agents in various real-world environments, but this process relies on rich and varied tool-interaction sandboxes. However, access to real systems is often restricted; LLM-simulated environments are prone to hallucinations and inconsistencies; and manually built sandboxes are hard to scale. In this paper, we propose EnvScaler, an automated framework for scalable tool-interaction environments via programmatic synthesis. EnvScaler comprises two components. First, SkelBuilder constructs diverse environment skeletons through topic mining, logic modeling, and quality evaluation. Then, ScenGenerator generates multiple task scenarios and rule-based trajectory validation functions for each environment. With EnvScaler, we synthesize 191 environments and about 7K scenarios, and apply them to Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) for Qwen3 series models. Results on three benchmarks show that EnvScaler significantly improves LLMs’ ability to solve tasks in complex environments involving multi-turn, multi-tool interactions. We release our code and data at https://github.com/RUC-NLPIR/EnvScaler.


[16] Router-Suggest: Dynamic Routing for Multimodal Auto-Completion in Visually-Grounded Dialogs cs.CL | cs.AI | cs.CVPDF

Sandeep Mishra, Devichand Budagam, Anubhab Mandal, Bishal Santra, Pawan Goyal

TL;DR: 本文提出了多模态自动补全任务,旨在利用部分输入文本和视觉线索预测实时聊天中的后续字符。作者构建了基准数据集,评估了视觉语言模型与纯文本模型的性能权衡,并提出了Router-Suggest路由框架,该框架能根据对话上下文动态选择文本模型或视觉语言模型,在保证用户满意度的同时显著提升效率。

Details

Motivation: 解决实时多模态自动补全问题,传统纯文本自动补全在多模态对话中无法充分利用共享的视觉上下文来准确捕捉用户意图,需要一种能融合视觉和文本信息的智能补全方法。

Result: 在基于MMDialog和ImageChat构建的基准数据集上评估,Router-Suggest框架相比性能最佳的视觉语言模型实现了2.3倍到10倍的加速;用户研究表明,视觉语言模型在用户满意度、节省用户输入努力以及提升多轮对话补全质量方面显著优于纯文本模型。

Insight: 创新点在于将自动补全任务扩展到多模态领域,并提出了一个动态路由框架Router-Suggest,它根据上下文智能选择模型,在精度和效率之间取得平衡;其轻量级变体适用于资源受限环境,为构建更智能、用户感知的助手提供了新思路。

Abstract: Real-time multimodal auto-completion is essential for digital assistants, chatbots, design tools, and healthcare consultations, where user inputs rely on shared visual context. We introduce Multimodal Auto-Completion (MAC), a task that predicts upcoming characters in live chats using partially typed text and visual cues. Unlike traditional text-only auto-completion (TAC), MAC grounds predictions in multimodal context to better capture user intent. To enable this task, we adapt MMDialog and ImageChat to create benchmark datasets. We evaluate leading vision-language models (VLMs) against strong textual baselines, highlighting trade-offs in accuracy and efficiency. We present Router-Suggest, a router framework that dynamically selects between textual models and VLMs based on dialog context, along with a lightweight variant for resource-constrained environments. Router-Suggest achieves a 2.3x to 10x speedup over the best-performing VLM. A user study shows that VLMs significantly excel over textual models on user satisfaction, notably saving user typing effort and improving the quality of completions in multi-turn conversations. These findings underscore the need for multimodal context in auto-completions, leading to smarter, user-aware assistants.


[17] What do the metrics mean? A critical analysis of the use of Automated Evaluation Metrics in Interpreting cs.CLPDF

Jonathan Downie, Joss Moorkens

TL;DR: 本文批判性地分析了口译领域中自动评估指标的使用,指出当前提出的自动度量方法无法考虑交际语境,因此单独使用时不能作为衡量口译质量的可行标准。

Details

Motivation: 随着口译技术的发展,如远程口译、计算机辅助口译、自动语音翻译和口译虚拟形象,亟需快速高效的方法来评估口译质量,但现有自动评估方法是否适用于真实口译实践(无论是人工还是机器)的质量测量存在疑问。

Result: 文章未提及具体的定量实验结果或基准测试,而是通过分析得出结论:当前自动度量方法在单独使用时无法作为口译质量的可行衡量标准。

Insight: 创新点在于强调口译质量评估中交际语境的根本重要性,指出自动指标忽略语境是其主要局限;客观来看,研究提醒在AI驱动的评估中需整合语境因素,避免过度依赖自动化度量。

Abstract: With the growth of interpreting technologies, from remote interpreting and Computer-Aided Interpreting to automated speech translation and interpreting avatars, there is now a high demand for ways to quickly and efficiently measure the quality of any interpreting delivered. A range of approaches to fulfil the need for quick and efficient quality measurement have been proposed, each involving some measure of automation. This article examines these recently-proposed quality measurement methods and will discuss their suitability for measuring the quality of authentic interpreting practice, whether delivered by humans or machines, concluding that automatic metrics as currently proposed cannot take into account the communicative context and thus are not viable measures of the quality of any interpreting provision when used on their own. Across all attempts to measure or even categorise quality in Interpreting Studies, the contexts in which interpreting takes place have become fundamental to the final analysis.


[18] Continual-learning for Modelling Low-Resource Languages from Large Language Models cs.CL | cs.AIPDF

Santosh Srinath K, Mudit Somani, Varun Reddy Padala, Prajna Devi Upadhyay, Abhijit Das

TL;DR: 本文提出了一种基于词性标注的代码切换和重放适配器策略的持续学习方法,用于缓解从大型语言模型训练小型语言模型时出现的灾难性遗忘问题,并在视觉问答和语言建模任务上验证了其有效性。

Details

Motivation: 解决在多语言场景下,为低资源语言构建小型语言模型时,因适应大型语言模型而导致的灾难性遗忘问题。

Result: 在视觉问答和语言建模任务上的实验表明,所提出的架构成功缓解了灾难性遗忘,但未提及具体基准或与SOTA的比较结果。

Insight: 创新点在于结合词性标注的代码切换和重放适配器策略进行持续学习,为低资源语言模型训练提供了一种减少遗忘的实用方法。

Abstract: Modelling a language model for a multi-lingual scenario includes several potential challenges, among which catastrophic forgetting is the major challenge. For example, small language models (SLM) built for low-resource languages by adapting large language models (LLMs) pose the challenge of catastrophic forgetting. This work proposes to employ a continual learning strategy using parts-of-speech (POS)-based code-switching along with a replay adapter strategy to mitigate the identified gap of catastrophic forgetting while training SLM from LLM. Experiments conducted on vision language tasks such as visual question answering and language modelling task exhibits the success of the proposed architecture.


[19] iReasoner: Trajectory-Aware Intrinsic Reasoning Supervision for Self-Evolving Large Multimodal Models cs.CLPDF

Meghana Sunil, Manikandarajan Venmathimaran, Muthu Subash Kavitha

TL;DR: 本文提出了iReasoner,一种用于大型多模态模型(LMM)自我进化的框架,通过显式地引出思维链(CoT)并奖励其内部一致性,来改进模型的隐式推理能力。该框架在无标签图像上运行一个提议者-求解器循环,利用轨迹感知信号增强结果层面的内在奖励,从而在没有真实标签或外部评判的情况下,为导致相同答案的不同推理路径提供学习信号。

Details

Motivation: 现有的大型多模态模型自我进化框架主要奖励最终结果,对中间推理过程的约束较弱,而中间推理对于基于视觉的决策至关重要。因此,本文旨在解决如何在没有监督的情况下,通过强化中间推理路径的一致性来提升模型的推理能力。

Result: 从Qwen2.5-VL-7B模型开始,iReasoner在完全无监督的后训练下,在多个多模态推理基准测试中取得了高达+2.1分的性能提升。

Insight: 论文的创新点在于引入了轨迹感知的内在推理监督机制,通过比较和奖励思维链的内部一致性来区分不同推理路径的质量,这为在纯无监督设置下实现推理感知的自我改进提供了新思路。

Abstract: Recent work shows that large multimodal models (LMMs) can self-improve from unlabeled data via self-play and intrinsic feedback. Yet existing self-evolving frameworks mainly reward final outcomes, leaving intermediate reasoning weakly constrained despite its importance for visually grounded decision making. We propose iReasoner, a self-evolving framework that improves an LMM’s implicit reasoning by explicitly eliciting chain-of-thought (CoT) and rewarding its internal agreement. In a Proposer–Solver loop over unlabeled images, iReasoner augments outcome-level intrinsic rewards with a trajectory-aware signal defined over intermediate reasoning steps, providing learning signals that distinguish reasoning paths leading to the same answer without ground-truth labels or external judges. Starting from Qwen2.5-VL-7B, iReasoner yields up to $+2.1$ points across diverse multimodal reasoning benchmarks under fully unsupervised post-training. We hope this work serves as a starting point for reasoning-aware self-improvement in LMMs in purely unsupervised settings.


[20] Pantagruel: Unified Self-Supervised Encoders for French Text and Speech cs.CLPDF

Phuong-Hang Le, Valentin Pelloin, Arnault Chatelain, Maryem Bouziane, Mohammed Ghennai

TL;DR: Pantagruel是一系列针对法语文本和语音的自监督编码器模型,通过在特征空间中学习上下文目标表示,而非预测模态特定的目标(如文本标记或语音单元),使模态特定编码器能更有效地捕捉语言和声学规律。模型在大型法语语料库上预训练,并在广泛的文本和语音下游任务中展现出与强基线模型相当或更优的性能。

Details

Motivation: 解决法语文本和语音表示学习中模态特定目标预测的局限性,旨在通过特征空间的自监督目标学习更有效的上下文表示,以提升跨模态理解能力。

Result: 在FLUE和LeBenchmark等标准法语基准测试中,Pantagruel模型相比CamemBERT、FlauBERT和LeBenchmark2.0等强基线表现出竞争性或更优的性能,同时保持能无缝处理语音或文本输入的共享架构。

Insight: 创新点在于使用特征空间的自监督目标替代模态特定预测,允许模态特定编码器更有效地学习语言和声学规律;客观分析认为,该方法通过统一架构处理多模态输入,为法语多模态理解提供了稳健基础,并引入了大规模法语音频语料库INA-100k以增强数据多样性。

Abstract: We release Pantagruel models, a new family of self-supervised encoder models for French text and speech. Instead of predicting modality-tailored targets such as textual tokens or speech units, Pantagruel learns contextualized target representations in the feature space, allowing modality-specific encoders to capture linguistic and acoustic regularities more effectively. Separate models are pre-trained on large-scale French corpora, including Wikipedia, OSCAR and CroissantLLM for text, together with MultilingualLibriSpeech, LeBenchmark, and INA-100k for speech. INA-100k is a newly introduced 100,000-hour corpus of French audio derived from the archives of the Institut National de l’Audiovisuel (INA), the national repository of French radio and television broadcasts, providing highly diverse audio data. We evaluate Pantagruel across a broad range of downstream tasks spanning both modalities, including those from the standard French benchmarks such as FLUE or LeBenchmark. Across these tasks, Pantagruel models show competitive or superior performance compared to strong French baselines such as CamemBERT, FlauBERT, and LeBenchmark2.0, while maintaining a shared architecture that can seamlessly handle either speech or text inputs. These results confirm the effectiveness of feature-space self-supervised objectives for French representation learning and highlight Pantagruel as a robust foundation for multimodal speech-text understanding.


[21] Can We Predict Before Executing Machine Learning Agents? cs.CL | cs.AI | cs.LG | cs.MAPDF

Jingsheng Zheng, Jintian Zhang, Yujie Luo, Yuren Mao, Yunjun Gao

TL;DR: 本文提出了一种名为FOREAGENT的机器学习智能体框架,旨在解决传统‘生成-执行-反馈’范式中的‘执行瓶颈’问题。该方法通过内部化执行先验知识,用即时的预测性推理替代昂贵的物理执行,并引入了‘数据为中心的解决方案偏好’任务和相应的大规模数据集。实验表明,基于LLM的预测器在特定提示下能有效预测执行结果,最终智能体实现了6倍的收敛加速和超越纯执行基线6%的性能。

Details

Motivation: 传统自主机器学习智能体受限于必须物理执行来评估假设的‘生成-执行-反馈’范式,这导致了严重的‘执行瓶颈’,成本高昂且效率低下。本文的动机是绕过这些物理约束,通过预测来替代部分执行。

Result: 在构建的包含18,438对比较的‘数据为中心的解决方案偏好’数据集上,使用经过验证的数据分析报告作为提示的LLM预测器达到了61.5%的准确率,并具有良好的置信度校准。最终实现的FOREAGENT智能体在收敛速度上实现了6倍加速,性能超越了基于执行的基线方法6%。

Insight: 核心创新点在于将‘世界模型’的思想引入智能体架构,通过内部化执行先验,将昂贵的运行时检查替换为预测性推理,从而形成‘预测-验证’循环。这为缓解智能体执行瓶颈、加速科学发现过程提供了一种新的、数据驱动的思路。构建的特定任务数据集也为评估智能体的预测能力提供了基准。

Abstract: Autonomous machine learning agents have revolutionized scientific discovery, yet they remain constrained by a Generate-Execute-Feedback paradigm. Previous approaches suffer from a severe Execution Bottleneck, as hypothesis evaluation relies strictly on expensive physical execution. To bypass these physical constraints, we internalize execution priors to substitute costly runtime checks with instantaneous predictive reasoning, drawing inspiration from World Models. In this work, we formalize the task of Data-centric Solution Preference and construct a comprehensive corpus of 18,438 pairwise comparisons. We demonstrate that LLMs exhibit significant predictive capabilities when primed with a Verified Data Analysis Report, achieving 61.5% accuracy and robust confidence calibration. Finally, we instantiate this framework in FOREAGENT, an agent that employs a Predict-then-Verify loop, achieving a 6x acceleration in convergence while surpassing execution-based baselines by +6%. Our code and dataset will be publicly available soon at https://github.com/zjunlp/predict-before-execute.


[22] Distilling Feedback into Memory-as-a-Tool cs.CLPDF

Víctor Gallego

TL;DR: 本文提出了一种通过将瞬时反馈转化为可检索指导来分摊推理时成本的方法,利用基于文件的记忆系统和智能体控制的工具调用,在Rubric Feedback Bench数据集上验证了其有效性。

Details

Motivation: 为了解决推理时细化流程的高成本问题,旨在将临时的批评反馈转化为可重用的指导原则,以提高大型语言模型在基于评分标准学习任务中的效率。

Result: 在Rubric Feedback Bench这一新颖数据集上的实验表明,增强后的LLM能快速达到测试时细化流程的性能,同时大幅降低推理成本。

Insight: 创新点在于将反馈蒸馏为记忆工具,通过文件系统实现持久化存储和检索,从而在保持性能的同时优化计算资源使用;客观分析认为该方法为LLM的推理优化提供了可扩展的框架。

Abstract: We propose a framework that amortizes the cost of inference-time reasoning by converting transient critiques into retrievable guidelines, through a file-based memory system and agent-controlled tool calls. We evaluate this method on the Rubric Feedback Bench, a novel dataset for rubric-based learning. Experiments demonstrate that our augmented LLMs rapidly match the performance of test-time refinement pipelines while drastically reducing inference cost.


[23] The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning cs.CL | cs.AIPDF

Qiguang Chen, Yantao Du, Ziniu Li, Jinhao Liu, Songyao Duan

TL;DR: 该论文提出将长思维链(Long CoT)推理的有效轨迹类比为具有稳定分子结构,包含深度推理、自我反思和自我探索三种相互作用类型。研究发现,有效的长思维链学习依赖于促进快速熵收敛的‘键’结构,而结构竞争会损害训练。基于此,作者提出了Mole-Syn方法,通过分布转移图引导合成有效的长思维链结构,从而在多个基准测试上提升了性能和强化学习的稳定性。

Details

Motivation: 解决大语言模型难以从人类或非长思维链模型模仿中学习有效长思维链推理的问题,旨在理解并构建可学习的长思维链轨迹的内在结构。

Result: 提出的Mole-Syn方法在多个基准测试上提升了性能并增强了强化学习的稳定性,具体基准未在摘要中明确提及,但暗示了性能提升。

Insight: 创新性地将长思维链推理轨迹建模为分子拓扑结构,并识别出三种关键的相互作用类型;核心洞察是只有促进快速熵收敛的结构‘键’才能支持稳定学习,这为合成有效的训练数据提供了新范式。

Abstract: Large language models (LLMs) often fail to learn effective long chain-of-thought (Long CoT) reasoning from human or non-Long-CoT LLMs imitation. To understand this, we propose that effective and learnable Long CoT trajectories feature stable molecular-like structures in unified view, which are formed by three interaction types: Deep-Reasoning (covalent-like), Self-Reflection (hydrogen-bond-like), and Self-Exploration (van der Waals-like). Analysis of distilled trajectories reveals these structures emerge from Long CoT fine-tuning, not keyword imitation. We introduce Effective Semantic Isomers and show that only bonds promoting fast entropy convergence support stable Long CoT learning, while structural competition impairs training. Drawing on these findings, we present Mole-Syn, a distribution-transfer-graph method that guides synthesis of effective Long CoT structures, boosting performance and RL stability across benchmarks.


[24] Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards cs.CLPDF

Jiajie Zhang, Xin Lv, Ling Feng, Lei Hou, Juanzi Li

TL;DR: 本文提出了一种名为Citation-aware Rubric Rewards (CaRR)的细粒度奖励框架,用于增强基于LLM的深度搜索智能体。该框架通过将复杂问题分解为可验证的单步准则,要求智能体明确识别隐藏实体、提供正确引用并构建完整的证据链,以强调推理的全面性、事实基础和证据连通性。同时,论文还引入了Citation-aware Group Relative Policy Optimization (C-GRPO)方法,结合CaRR和结果奖励来训练鲁棒的深度搜索智能体。

Details

Motivation: 现有基于强化学习的深度搜索智能体主要依赖二元结果奖励,无法捕捉推理过程的全面性和事实性,容易导致捷径利用和幻觉等不良行为。

Result: 实验表明,C-GRPO在多个深度搜索基准测试中持续优于标准的基于结果的强化学习基线方法,有效抑制了捷径利用,促进了全面、基于证据的推理,并在开放式深度研究任务上展现出强大的泛化能力。

Insight: 创新点在于提出了细粒度的、基于准则和引用的奖励框架(CaRR),以及结合该框架的强化学习优化方法(C-GRPO),通过结构化分解问题和强制证据链构建,提升了智能体推理的可靠性和可解释性。

Abstract: Reinforcement learning (RL) has emerged as a critical technique for enhancing LLM-based deep search agents. However, existing approaches primarily rely on binary outcome rewards, which fail to capture the comprehensiveness and factuality of agents’ reasoning process, and often lead to undesirable behaviors such as shortcut exploitation and hallucinations. To address these limitations, we propose \textbf{Citation-aware Rubric Rewards (CaRR)}, a fine-grained reward framework for deep search agents that emphasizes reasoning comprehensiveness, factual grounding, and evidence connectivity. CaRR decomposes complex questions into verifiable single-hop rubrics and requires agents to satisfy these rubrics by explicitly identifying hidden entities, supporting them with correct citations, and constructing complete evidence chains that link to the predicted answer. We further introduce \textbf{Citation-aware Group Relative Policy Optimization (C-GRPO)}, which combines CaRR and outcome rewards for training robust deep search agents. Experiments show that C-GRPO consistently outperforms standard outcome-based RL baselines across multiple deep search benchmarks. Our analysis also validates that C-GRPO effectively discourages shortcut exploitation, promotes comprehensive, evidence-grounded reasoning, and exhibits strong generalization to open-ended deep research tasks. Our code and data are available at https://github.com/THUDM/CaRR.


[25] AdaFuse: Adaptive Ensemble Decoding with Test-Time Scaling for LLMs cs.CL | cs.AIPDF

Chengming Cui, Tianxin Wei, Ziyi Chen, Ruizhong Qiu, Zhichen Zeng

TL;DR: 本文提出AdaFuse,一种自适应集成解码框架,用于动态融合多个大型语言模型(LLMs)的生成能力。它通过基于不确定性的准则在解码步骤中决定是否进行集成,并在不确定时采用多样性感知的缩放策略探索候选续写,从而在测试时实现融合粒度的自适应调整。

Details

Motivation: 现有LLM集成方法通常采用固定的融合粒度,缺乏生成过程中的自适应能力,且无法适应不同任务的生成特性。本文旨在解决这些限制,实现更灵活、高效的推理时集成。

Result: 在开放域问答、算术推理和机器翻译任务上的实验表明,AdaFuse持续优于强集成基线,平均相对提升达到6.88%。

Insight: 创新点在于将自适应集成与测试时缩放协同设计:集成决策指导针对性探索,而探索产生的多样性反过来增强集成质量。这提供了一种动态、上下文感知的融合机制,可借鉴用于提升多模型协同推理的灵活性和效果。

Abstract: Large language models (LLMs) exhibit complementary strengths arising from differences in pretraining data, model architectures, and decoding behaviors. Inference-time ensembling provides a practical way to combine these capabilities without retraining. However, existing ensemble approaches suffer from fundamental limitations. Most rely on fixed fusion granularity, which lacks the flexibility required for mid-generation adaptation and fails to adapt to different generation characteristics across tasks. To address these challenges, we propose AdaFuse, an adaptive ensemble decoding framework that dynamically selects semantically appropriate fusion units during generation. Rather than committing to a fixed granularity, AdaFuse adjusts fusion behavior on the fly based on the decoding context, with words serving as basic building blocks for alignment. To be specific, we introduce an uncertainty-based criterion to decide whether to apply ensembling at each decoding step. Under confident decoding states, the model continues generation directly. In less certain states, AdaFuse invokes a diversity-aware scaling strategy to explore alternative candidate continuations and inform ensemble decisions. This design establishes a synergistic interaction between adaptive ensembling and test-time scaling, where ensemble decisions guide targeted exploration, and the resulting diversity in turn strengthens ensemble quality. Experiments on open-domain question answering, arithmetic reasoning, and machine translation demonstrate that AdaFuse consistently outperforms strong ensemble baselines, achieving an average relative improvement of 6.88%. The code is available at https://github.com/CCM0111/AdaFuse.


cs.CV [Back]

[26] Bi-Orthogonal Factor Decomposition for Vision Transformers cs.CV | cs.AIPDF

Fenil R. Doshi, Thomas Fel, Talia Konkle, George Alvarez

TL;DR: 本文提出了一种名为双正交因子分解(BFD)的两阶段分析框架,用于解耦视觉Transformer中自注意力机制的信息交换。BFD首先通过基于ANOVA的分解将token激活统计地解耦为正交的位置因子和内容因子,然后对查询-键交互矩阵进行奇异值分解,揭示这些因子如何介导通信。应用BFD分析先进视觉模型后,发现了注意力主要通过内容运作、注意力头表现出专业化分工以及DINOv2在中间层同时保持位置结构和丰富语义内容等现象。

Details

Motivation: 自注意力是视觉Transformer的核心计算单元,但缺乏对注意力机制在token间交换何种信息(位置、内容或两者)的原则性理解。现有注意力图仅显示权重集中位置,无法揭示查询和键交换的具体信息类型。

Result: 在验证了位置和内容因子的有效隔离后,BFD应用于最先进的视觉模型(如DINOv2和监督模型),揭示了注意力能量主要由内容-内容交互主导,其次是内容-位置耦合;DINOv2比监督模型分配更多能量给内容-位置交互,并在更丰富的模式谱上分布计算。

Insight: BFD框架提供了一种原则性方法来解耦和分析注意力机制中的位置与语义信息交换;发现了注意力头的功能专业化(分为内容-内容、内容-位置和位置-位置操作符)以及DINOv2卓越的整体形状处理能力源于其中间层同时保持位置结构和上下文丰富语义内容的特性。

Abstract: Self-attention is the central computational primitive of Vision Transformers, yet we lack a principled understanding of what information attention mechanisms exchange between tokens. Attention maps describe where weight mass concentrates; they do not reveal whether queries and keys trade position, content, or both. We introduce Bi-orthogonal Factor Decomposition (BFD), a two-stage analytical framework: first, an ANOVA-based decomposition statistically disentangles token activations into orthogonal positional and content factors; second, SVD of the query-key interaction matrix QK^T exposes bi-orthogonal modes that reveal how these factors mediate communication. After validating proper isolation of position and content, we apply BFD to state-of-the-art vision models and uncover three phenomena.(i) Attention operates primarily through content. Content-content interactions dominate attention energy, followed by content-position coupling. DINOv2 allocates more energy to content-position than supervised models and distributes computation across a richer mode spectrum. (ii) Attention mechanisms exhibit specialization: heads differentiate into content-content, content-position, and position-position operators, while singular modes within heads show analogous specialization. (iii) DINOv2’s superior holistic shape processing emerges from intermediate layers that simultaneously preserve positional structure while contextually enriching semantic content. Overall, BFD exposes how tokens interact through attention and which informational factors - positional or semantic - mediate their communication, yielding practical insights into vision transformer mechanisms.


[27] Coding the Visual World: From Image to Simulation Using Vision Language Models cs.CVPDF

Sagi Eppel

TL;DR: 本文提出Im2Sim方法,探索视觉语言模型(VLMs)从图像中识别并模拟真实世界系统的能力。模型接收自然图像(如城市、云、植被等),描述系统并编写生成代码来模拟和合成图像,通过对比合成图像与原始图像评估其理解能力。研究覆盖物理系统、植被、城市、材料、地质构造等多种复杂涌现系统。

Details

Motivation: 解决视觉理解的核心问题:能否构建图像中描绘系统的代表性模型,探索VLMs在识别和模拟图像中系统与机制方面的潜力。

Result: 在多种复杂涌现系统上测试,结果表明领先的VLMs(如GPT、Gemini)能够跨多个抽象层次和广泛领域理解并建模复杂多组件系统,但在复制图像中精细细节和低层模式排列方面能力有限。

Insight: 创新点在于将视觉理解转化为代码生成和模拟任务(Im2Sim),揭示了VLMs在高层深度视觉理解与精细细节感知之间存在有趣的不对称性,为评估模型系统建模能力提供了新视角。

Abstract: The ability to construct mental models of the world is a central aspect of understanding. Similarly, visual understanding can be viewed as the ability to construct a representative model of the system depicted in an image. This work explores the capacity of Vision Language Models (VLMs) to recognize and simulate the systems and mechanisms depicted in images using the Im2Sim methodology. The VLM is given a natural image of a real-world system (e.g., cities, clouds, vegetation) and is tasked with describing the system and writing code that simulates and generates it. This generative code is then executed to produce a synthetic image, which is compared against the original. This approach is tested on various complex emergent systems, ranging from physical systems (waves, lights, clouds) to vegetation, cities, materials, and geological formations. Through analysis of the models and images generated by the VLMs, we examine their understanding of the systems in images. The results show that leading VLMs (GPT, Gemini) demonstrate the capacity to understand and model complex, multi-component systems across multiple layers of abstraction and a wide range of domains. At the same time, the VLMs exhibit limited ability to replicate fine details and low-level arrangements of patterns in the image. These findings reveal an interesting asymmetry: VLMs combine high-level, deep visual understanding of images with limited perception of fine details.


[28] MOSAIC-GS: Monocular Scene Reconstruction via Advanced Initialization for Complex Dynamic Environments cs.CVPDF

Svitlana Morkva, Maximum Wilder-Smith, Michael Oechsle, Alessio Tonioni, Marco Hutter

TL;DR: MOSAIC-GS是一种基于高斯溅射(Gaussian Splatting)的、完全显式且计算高效的新方法,用于从单目视频中重建高保真动态场景。该方法通过利用深度、光流、动态物体分割和点跟踪等多重几何线索,结合基于刚性的运动约束,在初始化阶段估计初步的3D场景动态,从而减少对仅从视觉外观推断运动的依赖。场景被分解为静态和动态组件以实现紧凑表示、快速训练和实时渲染,其中动态部分的高斯使用参数高效的时变Poly-Fourier曲线表示轨迹。

Details

Motivation: 单目重建由于缺乏足够的多视角约束而本质上是病态的,使得准确恢复物体几何和时序一致性尤其具有挑战性。现有方法过度依赖从视觉外观推断运动,这在单目设置中常常是模糊的。

Result: 在标准的单目动态场景基准测试中,MOSAIC-GS实现了比现有方法快得多的优化和渲染速度,同时保持了与最先进方法相当的重建质量。

Insight: 创新点在于提出了一种利用多重几何线索进行高级初始化的策略,以先验地恢复场景动态,从而为后续的光度优化提供更好的起点。此外,采用静态/动态场景分解以及使用参数高效的Poly-Fourier曲线编码动态高斯轨迹,实现了紧凑表示和高效计算。从客观角度看,该方法将几何先验与可微渲染框架有效结合,为解决单目动态重建的模糊性问题提供了一条有前景的路径。

Abstract: We present MOSAIC-GS, a novel, fully explicit, and computationally efficient approach for high-fidelity dynamic scene reconstruction from monocular videos using Gaussian Splatting. Monocular reconstruction is inherently ill-posed due to the lack of sufficient multiview constraints, making accurate recovery of object geometry and temporal coherence particularly challenging. To address this, we leverage multiple geometric cues, such as depth, optical flow, dynamic object segmentation, and point tracking. Combined with rigidity-based motion constraints, these cues allow us to estimate preliminary 3D scene dynamics during an initialization stage. Recovering scene dynamics prior to the photometric optimization reduces reliance on motion inference from visual appearance alone, which is often ambiguous in monocular settings. To enable compact representations, fast training, and real-time rendering while supporting non-rigid deformations, the scene is decomposed into static and dynamic components. Each Gaussian in the dynamic part of the scene is assigned a trajectory represented as time-dependent Poly-Fourier curve for parameter-efficient motion encoding. We demonstrate that MOSAIC-GS achieves substantially faster optimization and rendering compared to existing methods, while maintaining reconstruction quality on par with state-of-the-art approaches across standard monocular dynamic scene benchmarks.


[29] Multi-task Cross-modal Learning for Chest X-ray Image Retrieval cs.CV | cs.AI | cs.IRPDF

Zhaohui Liang, Sivaramakrishnan Rajaraman, Niccolo Marini, Zhiyun Xue, Sameer Antani

TL;DR: 本文提出了一种多任务学习框架,用于微调BiomedCLIP模型,以提升胸部X光图像与文本之间的跨模态检索性能。该框架在BiomedCLIP骨干网络上添加了一个轻量级MLP投影头,并通过结合二元交叉熵损失、监督对比损失和CLIP损失的多任务复合损失函数进行训练,旨在优化细粒度的医学检索任务。

Details

Motivation: 现有的视觉-语言基础模型(如CLIP和BiomedCLIP)虽然提供了强大的跨模态嵌入,但未针对细粒度医学检索任务(如使用胸部X光图像查询检索临床相关的放射学报告)进行优化,因此需要领域自适应的微调。

Result: 实验结果表明,微调后的模型在图像到文本和文本到图像检索任务上,相比预训练的BiomedCLIP和通用CLIP模型,实现了更平衡且更具临床意义的性能。t-SNE可视化显示正常与异常病例的语义聚类更清晰,证明了模型诊断敏感性的增强。

Insight: 创新点在于采用多任务学习框架,结合分类、对比学习和跨模态对齐损失,对预训练的生物医学基础模型进行领域自适应微调,从而提升细粒度医学检索的临床相关性和性能平衡性。这为生物医学应用中的跨模态检索提供了有效的微调策略。

Abstract: CLIP and BiomedCLIP are examples of vision-language foundation models and offer strong cross-modal embeddings; however, they are not optimized for fine-grained medical retrieval tasks, such as retrieving clinically relevant radiology reports using chest X-ray (CXR) image queries. To address this shortcoming, we propose a multi-task learning framework to fine-tune BiomedCLIP and evaluate improvements to CXR image-text retrieval. Using BiomedCLIP as the backbone, we incorporate a lightweight MLP projector head trained with a multi-task composite loss function that includes: (1) a binary cross-entropy loss to distinguish normal from abnormal CXR studies, (2) a supervised contrastive loss to reinforce intra-class consistency, and (3) a CLIP loss to maintain cross-modal alignment. Experimental results demonstrate that the fine-tuned model achieves more balanced and clinically meaningful performance across both image-to-text and text-to-image retrieval tasks compared to the pretrained BiomedCLIP and general-purpose CLIP models. Furthermore, t-SNE visualizations reveal clearer semantic clustering of normal and abnormal cases, demonstrating the model’s enhanced diagnostic sensitivity. These findings highlight the value of domain-adaptive, multi-task learning for advancing cross-modal retrieval in biomedical applications.


[30] Thinking with Map: Reinforced Parallel Map-Augmented Agent for Geolocalization cs.CV | cs.AI | cs.CLPDF

Yuxiang Ji, Yong Wang, Ziyu Ma, Yiming Hu, Hailang Huang

TL;DR: 本文提出了一种名为’Thinking with Map’的图像地理定位方法,通过赋予大型视觉语言模型(LVLM)使用地图的能力,将其建模为’智能体在地图中循环’的过程。该方法采用两阶段优化方案,包括智能体强化学习(RL)和并行测试时扩展(TTS),以提升采样效率和探索能力。作者还构建了基于真实世界图像的MAPBench基准进行评估。

Details

Motivation: 现有的大型视觉语言模型方法在图像地理定位任务中忽略了人类常用的地图使用策略,本文旨在通过引入地图思维来弥补这一不足。

Result: 实验结果表明,该方法在MAPBench基准上超越了现有的开源和闭源模型,在大多数指标上表现更优,特别是将Acc@500m指标从Gemini-3-Pro(结合谷歌搜索/地图模式)的8.0%提升到了22.1%。

Insight: 核心创新点在于将地图作为外部工具集成到LVLM的推理循环中,并设计了结合强化学习和并行探索的两阶段优化策略,这为增强模型在需要多步探索和决策的任务中的能力提供了新思路。

Abstract: The image geolocalization task aims to predict the location where an image was taken anywhere on Earth using visual clues. Existing large vision-language model (LVLM) approaches leverage world knowledge, chain-of-thought reasoning, and agentic capabilities, but overlook a common strategy used by humans – using maps. In this work, we first equip the model \textit{Thinking with Map} ability and formulate it as an agent-in-the-map loop. We develop a two-stage optimization scheme for it, including agentic reinforcement learning (RL) followed by parallel test-time scaling (TTS). The RL strengthens the agentic capability of model to improve sampling efficiency, and the parallel TTS enables the model to explore multiple candidate paths before making the final prediction, which is crucial for geolocalization. To evaluate our method on up-to-date and in-the-wild images, we further present MAPBench, a comprehensive geolocalization training and evaluation benchmark composed entirely of real-world images. Experimental results show that our method outperforms existing open- and closed-source models on most metrics, specifically improving Acc@500m from 8.0% to 22.1% compared to \textit{Gemini-3-Pro} with Google Search/Map grounded mode.


[31] TAPM-Net: Trajectory-Aware Perturbation Modeling for Infrared Small Target Detection cs.CVPDF

Hongyang Xie, Hongyang He, Victor Sanchez

TL;DR: 本文提出了一种名为TAPM-Net(轨迹感知扰动建模网络)的新方法,用于解决红外小目标检测(ISTD)中因信号弱、尺寸小和背景杂乱带来的挑战。该方法通过建模目标在特征空间中引发的方向性扰动轨迹,利用扰动引导路径模块(PGM)提取特征轨迹,并通过基于Mamba的轨迹感知状态块(TASB)进行动态传播建模,实现了各向异性、上下文敏感的状态转移。

Details

Motivation: 当前基于CNN和ViT的红外小目标检测模型缺乏对目标如何触发特征空间中方向性、层级性扰动的建模机制,而这正是区分红外场景中信号与结构化噪声的关键线索。

Result: 在NUAA-SIRST和IRSTD-1K基准测试上的实验表明,TAPM-Net达到了最先进的(SOTA)性能水平。

Insight: 核心创新点在于首次明确建模了目标诱导的特征扰动在空间中的扩散行为,并设计了PGM和TASB两个组件来提取和传播轨迹信息。该方法用基于状态空间模型(Mamba)的机制替代了传统的注意力机制,在保持全局一致性的同时实现了低计算成本下的各向异性、上下文敏感的特征传播,为小目标检测提供了新的建模视角。

Abstract: Infrared small target detection (ISTD) remains a long-standing challenge due to weak signal contrast, limited spatial extent, and cluttered backgrounds. Despite performance improvements from convolutional neural networks (CNNs) and Vision Transformers (ViTs), current models lack a mechanism to trace how small targets trigger directional, layer-wise perturbations in the feature space, which is an essential cue for distinguishing signal from structured noise in infrared scenes. To address this limitation, we propose the Trajectory-Aware Mamba Propagation Network (TAPM-Net), which explicitly models the spatial diffusion behavior of target-induced feature disturbances. TAPM-Net is built upon two novel components: a Perturbation-guided Path Module (PGM) and a Trajectory-Aware State Block (TASB). The PGM constructs perturbation energy fields from multi-level features and extracts gradient-following feature trajectories that reflect the directionality of local responses. The resulting feature trajectories are fed into the TASB, a Mamba-based state-space unit that models dynamic propagation along each trajectory while incorporating velocity-constrained diffusion and semantically aligned feature fusion from word-level and sentence-level embeddings. Unlike existing attention-based methods, TAPM-Net enables anisotropic, context-sensitive state transitions along spatial trajectories while maintaining global coherence at low computational cost. Experiments on NUAA-SIRST and IRSTD-1K demonstrate that TAPM-Net achieves state-of-the-art performance in ISTD.


[32] ROAP: A Reading-Order and Attention-Prior Pipeline for Optimizing Layout Transformers in Key Information Extraction cs.CV | cs.CLPDF

Tingwei Xie, Jinxin He, Yonghong Song

TL;DR: 本文提出了ROAP,一种轻量级且架构无关的流水线,用于优化Layout Transformers在视觉丰富文档理解任务中的注意力分布。它通过自适应XY间隙树提取层次化阅读序列,并通过阅读顺序感知的相对位置偏置将其整合到注意力机制中,同时引入文本令牌子块注意力先验来抑制视觉噪声。实验表明,ROAP能持续提升LayoutLMv3和GeoLayoutLM等骨干模型的性能。

Details

Motivation: 解决多模态Transformer在视觉丰富文档理解中因缺乏显式阅读顺序建模和视觉令牌干扰而导致的注意力分布不佳问题。

Result: 在FUNSD和CORD基准测试上,ROAP一致地提升了LayoutLMv3和GeoLayoutLM等代表性骨干模型的性能,证实了其有效性。

Insight: 创新点在于显式建模文档的逻辑阅读顺序(通过AXG-Tree和RO-RPB)并自适应调节模态干扰(通过TT-Prior),为复杂布局分析提供了可扩展的解决方案,且不改变预训练骨干网络结构。

Abstract: The efficacy of Multimodal Transformers in visually-rich document understanding (VrDU) is critically constrained by two inherent limitations: the lack of explicit modeling for logical reading order and the interference of visual tokens that dilutes attention on textual semantics. To address these challenges, this paper presents ROAP, a lightweight and architecture-agnostic pipeline designed to optimize attention distributions in Layout Transformers without altering their pre-trained backbones. The proposed pipeline first employs an Adaptive-XY-Gap (AXG-Tree) to robustly extract hierarchical reading sequences from complex layouts. These sequences are then integrated into the attention mechanism via a Reading-Order-Aware Relative Position Bias (RO-RPB). Furthermore, a Textual-Token Sub-block Attention Prior (TT-Prior) is introduced to adaptively suppress visual noise and enhance fine-grained text-text interactions. Extensive experiments on the FUNSD and CORD benchmarks demonstrate that ROAP consistently improves the performance of representative backbones, including LayoutLMv3 and GeoLayoutLM. These findings confirm that explicitly modeling reading logic and regulating modality interference are critical for robust document understanding, offering a scalable solution for complex layout analysis. The implementation code will be released at https://github.com/KevinYuLei/ROAP.


[33] Multi-Image Super Resolution Framework for Detection and Analysis of Plant Roots cs.CV | cs.ETPDF

Shubham Agarwal, Ofek Nourian, Michael Sidorov, Sharon Chemweno, Ofer Hadar

TL;DR: 本文提出了一种用于植物根系检测和分析的多图像超分辨率框架,通过捕获地下环境中植物根系的多个重叠视图,并利用深度学习技术提升图像分辨率和细节,以克服传统视觉方法在遮挡、土壤湿度和低对比度等不利条件下的局限性。

Details

Motivation: 解决地下植物根系成像的挑战,如遮挡、土壤湿度变化和低对比度,这些因素限制了传统视觉方法的有效性,从而影响对土壤-植物相互作用、养分吸收和植物健康的研究。

Result: 在合成数据集上的定量评估显示,该方法优于最先进的超分辨率基线,BRISQUE指标降低了2.3%,同时保持相同的CLIP-IQA分数,表明图像质量得到改善,有助于根系性状的准确估计。

Insight: 创新点在于结合多视图空间冗余性进行超分辨率重建,提升结构保真度和视觉清晰度;从客观角度看,该方法通过合成数据集模拟真实环境因素,为农业和生态研究中的自动地下根系成像和性状量化提供了新方向。

Abstract: Understanding plant root systems is critical for advancing research in soil-plant interactions, nutrient uptake, and overall plant health. However, accurate imaging of roots in subterranean environments remains a persistent challenge due to adverse conditions such as occlusion, varying soil moisture, and inherently low contrast, which limit the effectiveness of conventional vision-based approaches. In this work, we propose a novel underground imaging system that captures multiple overlapping views of plant roots and integrates a deep learning-based Multi-Image Super Resolution (MISR) framework designed to enhance root visibility and detail. To train and evaluate our approach, we construct a synthetic dataset that simulates realistic underground imaging scenarios, incorporating key environmental factors that affect image quality. Our proposed MISR algorithm leverages spatial redundancy across views to reconstruct high-resolution images with improved structural fidelity and visual clarity. Quantitative evaluations show that our approach outperforms state-of-the-art super resolution baselines, achieving a 2.3 percent reduction in BRISQUE, indicating improved image quality with the same CLIP-IQA score, thereby enabling enhanced phenotypic analysis of root systems. This, in turn, facilitates accurate estimation of critical root traits, including root hair count and root hair density. The proposed framework presents a promising direction for robust automatic underground plant root imaging and trait quantification for agricultural and ecological research.


[34] MMViR: A Multi-Modal and Multi-Granularity Representation for Long-range Video Understanding cs.CV | cs.CLPDF

Zizhong Li, Haopeng Zhang, Jiawei Zhang

TL;DR: 本文提出了MMViR,一种用于长视频理解的多模态、多粒度结构化表示方法。该方法通过识别关键转折点对视频进行分割,并构建一个结合全局叙事与细粒度视觉细节的三层描述,以支持高效的基于查询的检索,并在多种场景中具有良好的泛化能力。

Details

Motivation: 解决当前多模态大语言模型在处理包含复杂事件、多样场景和长程依赖的分钟到小时级长视频时,直接编码计算成本过高,而简单视频转文本方法又易导致内容冗余或碎片化的问题。

Result: 在问答、摘要和检索三个任务上的广泛评估表明,MMViR超越了先前的最强方法,在长达一小时的视频理解任务上实现了19.67%的性能提升,同时将处理延迟降低至原始方法的45.4%。

Insight: 创新点在于提出了一种结构化的多粒度视频表示框架,通过关键转折点分割和三层描述耦合,在保证理解深度的同时显著提升了处理效率。这为长视频的高效、精准理解提供了一种可借鉴的范式。

Abstract: Long videos, ranging from minutes to hours, present significant challenges for current Multi-modal Large Language Models (MLLMs) due to their complex events, diverse scenes, and long-range dependencies. Direct encoding of such videos is computationally too expensive, while simple video-to-text conversion often results in redundant or fragmented content. To address these limitations, we introduce MMViR, a novel multi-modal, multi-grained structured representation for long video understanding. MMViR identifies key turning points to segment the video and constructs a three-level description that couples global narratives with fine-grained visual details. This design supports efficient query-based retrieval and generalizes well across various scenarios. Extensive evaluations across three tasks, including QA, summarization, and retrieval, show that MMViR outperforms the prior strongest method, achieving a 19.67% improvement in hour-long video understanding while reducing processing latency to 45.4% of the original.


[35] Prompt-Free SAM-Based Multi-Task Framework for Breast Ultrasound Lesion Segmentation and Classification cs.CV | cs.AI | cs.LGPDF

Samuel E. Johnny, Bernes L. Atabonfack, Israel Alagbe, Assane Gueye

TL;DR: 本研究提出了一种基于Segment Anything Model(SAM)视觉编码器的多任务深度学习框架,用于联合执行乳腺超声(BUS)图像中的病灶分割和诊断分类。该方法采用无提示(prompt-free)的监督式适应策略,通过轻量级卷积头或UNet式解码器对高维SAM特征进行解码以完成像素级分割,并通过掩码引导注意力机制增强分类分支,使模型聚焦于病灶相关特征。在PRECISE 2025乳腺超声数据集上的实验表明,该方法在分割任务上达到0.887的Dice相似系数(DSC),分类准确率为92.3%,在PRECISE挑战排行榜中名列前茅。

Details

Motivation: 乳腺超声图像因对比度低、存在斑点噪声及病灶形态多样,导致准确的肿瘤分割和分类具有挑战性。现有基于提示的SAM变体在医学图像应用中可能受限于提示生成的质量,因此需要一种无需提示、能直接利用SAM强大视觉表征的多任务框架来同时优化分割和分类性能。

Result: 在PRECISE 2025乳腺超声数据集上(按类别划分80%训练、20%测试),该方法取得了0.887的Dice相似系数(DSC)和92.3%的分类准确率,在PRECISE挑战排行榜中达到顶尖水平(SOTA)。

Insight: 创新点包括:1)提出无提示的SAM多任务适应框架,避免了提示生成的复杂性;2)通过掩码引导注意力机制,使分类分支能利用分割掩码聚焦于病灶区域,抑制背景干扰;3)验证了SAM预训练视觉表征在医学图像分割与分类任务中的有效性,结合分割引导学习能显著提升性能。从客观角度看,该研究将SAM的通用视觉能力与特定医学任务结合,为医学图像分析提供了一种高效的多任务学习范式。

Abstract: Accurate tumor segmentation and classification in breast ultrasound (BUS) imaging remain challenging due to low contrast, speckle noise, and diverse lesion morphology. This study presents a multi-task deep learning framework that jointly performs lesion segmentation and diagnostic classification using embeddings from the Segment Anything Model (SAM) vision encoder. Unlike prompt-based SAM variants, our approach employs a prompt-free, fully supervised adaptation where high-dimensional SAM features are decoded through either a lightweight convolutional head or a UNet-inspired decoder for pixel-wise segmentation. The classification branch is enhanced via mask-guided attention, allowing the model to focus on lesion-relevant features while suppressing background artifacts. Experiments on the PRECISE 2025 breast ultrasound dataset, split per class into 80 percent training and 20 percent testing, show that the proposed method achieves a Dice Similarity Coefficient (DSC) of 0.887 and an accuracy of 92.3 percent, ranking among the top entries on the PRECISE challenge leaderboard. These results demonstrate that SAM-based representations, when coupled with segmentation-guided learning, significantly improve both lesion delineation and diagnostic prediction in breast ultrasound imaging.


[36] Enabling Stroke-Level Structural Analysis of Hieroglyphic Scripts without Language-Specific Priors cs.CV | cs.CLPDF

Fuwen Luo, Zihao Wan, Ziyue Wang, Yaluo Liu, Pau Tong Lin Xu

TL;DR: 本文提出了一种名为Hieroglyphic Stroke Analyzer(HieroSA)的新颖且可泛化的框架,旨在使多模态大语言模型能够自动从字符位图中推导出笔画级结构,而无需依赖特定语言的手工数据或先验知识。该框架将现代表意文字和古代象形文字字符图像转换为归一化坐标空间中的显式、可解释的线段表示,实现了跨语言的泛化。

Details

Motivation: 当前先进的大型语言模型和多模态大语言模型在处理象形文字等表意文字系统时,通常将其视为文本标记或原始像素网格,无法有效建模字符笔画的内在逻辑。此外,现有的结构分析方法往往是针对特定文字的且劳动密集型。本文旨在解决这些问题,实现对字符内部结构的自动、通用分析。

Result: 大量实验表明,HieroSA能够有效捕捉字符的内部结构和语义,无需特定语言先验。实验结果凸显了该工作作为字形学分析工具的潜力,可用于更深入地理解象形文字。

Insight: 主要创新点在于提出了一种不依赖语言特定先验、从字符图像直接生成显式笔画级线段表示的通用框架,实现了跨语言的结构分析。这为多模态模型理解复杂文字系统提供了一种新的、可解释的结构化表示方法,可能推动字形学和历史文档分析领域的发展。

Abstract: Hieroglyphs, as logographic writing systems, encode rich semantic and cultural information within their internal structural composition. Yet, current advanced Large Language Models (LLMs) and Multimodal LLMs (MLLMs) usually remain structurally blind to this information. LLMs process characters as textual tokens, while MLLMs additionally view them as raw pixel grids. Both fall short to model the underlying logic of character strokes. Furthermore, existing structural analysis methods are often script-specific and labor-intensive. In this paper, we propose Hieroglyphic Stroke Analyzer (HieroSA), a novel and generalizable framework that enables MLLMs to automatically derive stroke-level structures from character bitmaps without handcrafted data. It transforms modern logographic and ancient hieroglyphs character images into explicit, interpretable line-segment representations in a normalized coordinate space, allowing for cross-lingual generalization. Extensive experiments demonstrate that HieroSA effectively captures character-internal structures and semantics, bypassing the need for language-specific priors. Experimental results highlight the potential of our work as a graphematics analysis tool for a deeper understanding of hieroglyphic scripts. View our code at https://github.com/THUNLP-MT/HieroSA.


[37] GaussianSwap: Animatable Video Face Swapping with 3D Gaussian Splatting cs.CVPDF

Xuan Cheng, Jiahao Rao, Chengyang Li, Wenhao Wang, Weilin Chen

TL;DR: 本文提出GaussianSwap,一种基于3D高斯泼溅技术的新型视频人脸交换框架。该框架从目标视频构建一个3D高斯泼溅人脸化身,并将源图像的身份信息转移到该化身中,从而生成高保真、可动画控制且身份保持性好的换脸视频。

Details

Motivation: 传统视频换脸框架局限于生成基于像素的面部表示,其结果只是一组无结构的像素,缺乏动画或交互操控能力。本文旨在实现从传统像素级视频生成到创建高保真、可交互换脸化身的范式转变。

Result: 实验结果表明,GaussianSwap在身份保持、视觉清晰度和时间一致性方面均表现出色,并支持以往无法实现的交互应用。

Insight: 主要创新点包括:1)将3D高斯泼溅技术引入视频换脸任务,构建可动画控制的3D人脸化身;2)提出一种由三个先进人脸识别模型构建的复合身份嵌入,用于化身的微调以确保身份保持;3)通过将3D高斯泼溅绑定到FLAME模型上,实现了跨帧的动态面部控制。

Abstract: We introduce GaussianSwap, a novel video face swapping framework that constructs a 3D Gaussian Splatting based face avatar from a target video while transferring identity from a source image to the avatar. Conventional video swapping frameworks are limited to generating facial representations in pixel-based formats. The resulting swapped faces exist merely as a set of unstructured pixels without any capacity for animation or interactive manipulation. Our work introduces a paradigm shift from conventional pixel-based video generation to the creation of high-fidelity avatar with swapped faces. The framework first preprocesses target video to extract FLAME parameters, camera poses and segmentation masks, and then rigs 3D Gaussian splats to the FLAME model across frames, enabling dynamic facial control. To ensure identity preserving, we propose an compound identity embedding constructed from three state-of-the-art face recognition models for avatar finetuning. Finally, we render the face-swapped avatar on the background frames to obtain the face-swapped video. Experimental results demonstrate that GaussianSwap achieves superior identity preservation, visual clarity and temporal consistency, while enabling previously unattainable interactive applications.


[38] SAS-VPReID: A Scale-Adaptive Framework with Shape Priors for Video-based Person Re-Identification at Extreme Far Distances cs.CVPDF

Qiwei Yang, Pingping Zhang, Yuhao Wang, Zijing Gong

TL;DR: 本文提出了一种名为SAS-VPReID的尺度自适应框架,用于解决极端远距离下的视频行人重识别问题。该框架包含三个互补模块:利用CLIP视觉编码器和多代理记忆的MEVB模块来提取判别性特征表示;MGTM模块在多时间粒度上构建序列并自适应强调跨尺度的运动线索;PRSD模块结合先验知识来捕捉身体结构动态。

Details

Motivation: 解决在极端远距离下,由于严重分辨率退化、剧烈视角变化和不可避免的外观噪声,导致视频行人重识别极具挑战性的问题。

Result: 在VReID-XFD基准测试上的实验证明了每个模块的有效性,最终框架在VReID-XFD挑战排行榜上排名第一。

Insight: 创新点在于将CLIP视觉编码器与多代理记忆结合用于特征提取,以及通过多粒度时间建模和先验正则化的形状动态来增强对远距离、低质量视频的鲁棒性。从客观角度看,该工作系统地整合了视觉骨干增强、时序建模和形状先验,为处理极端条件下的VPReID提供了一个全面的解决方案。

Abstract: Video-based Person Re-IDentification (VPReID) aims to retrieve the same person from videos captured by non-overlapping cameras. At extreme far distances, VPReID is highly challenging due to severe resolution degradation, drastic viewpoint variation and inevitable appearance noise. To address these issues, we propose a Scale-Adaptive framework with Shape Priors for VPReID, named SAS-VPReID. The framework is built upon three complementary modules. First, we deploy a Memory-Enhanced Visual Backbone (MEVB) to extract discriminative feature representations, which leverages the CLIP vision encoder and multi-proxy memory. Second, we propose a Multi-Granularity Temporal Modeling (MGTM) to construct sequences at multiple temporal granularities and adaptively emphasize motion cues across scales. Third, we incorporate Prior-Regularized Shape Dynamics (PRSD) to capture body structure dynamics. With these modules, our framework can obtain more discriminative feature representations. Experiments on the VReID-XFD benchmark demonstrate the effectiveness of each module and our final framework ranks the first on the VReID-XFD challenge leaderboard. The source code is available at https://github.com/YangQiWei3/SAS-VPReID.


[39] DIFF-MF: A Difference-Driven Channel-Spatial State Space Model for Multi-Modal Image Fusion cs.CVPDF

Yiming Sun, Zifan Ye, Qinghua Hu, Pengfei Zhu

TL;DR: 本文提出了一种名为DIFF-MF的新型多模态图像融合方法,该方法利用模态间的特征差异图来引导特征提取,并通过通道和空间维度的融合过程,有效整合互补信息。该方法采用通道交换模块和空间交换模块,分别通过交叉注意力双状态空间建模和跨模态状态空间扫描,实现自适应特征重加权和全面空间融合,在保持线性计算复杂度的同时高效捕获全局依赖。

Details

Motivation: 现有基于状态空间模型的多模态图像融合方法,往往过度优先考虑红外强度而牺牲可见光细节,或反之,在保留可见光结构的同时削弱热目标显著性。本文旨在克服这些挑战,实现更平衡、高质量的多模态图像融合。

Result: 在驾驶场景和低空无人机数据集上的实验结果表明,该方法在视觉质量和定量评估上均优于现有方法。

Insight: 创新点在于提出了一种差异驱动的通道-空间状态空间模型,利用模态间特征差异图作为引导,并设计了通道交换和空间交换模块,分别通过交叉注意力双状态空间建模和跨模态状态空间扫描,实现更有效的跨模态特征交互与融合,同时保持了线性计算复杂度。

Abstract: Multi-modal image fusion aims to integrate complementary information from multiple source images to produce high-quality fused images with enriched content. Although existing approaches based on state space model have achieved satisfied performance with high computational efficiency, they tend to either over-prioritize infrared intensity at the cost of visible details, or conversely, preserve visible structure while diminishing thermal target salience. To overcome these challenges, we propose DIFF-MF, a novel difference-driven channel-spatial state space model for multi-modal image fusion. Our approach leverages feature discrepancy maps between modalities to guide feature extraction, followed by a fusion process across both channel and spatial dimensions. In the channel dimension, a channel-exchange module enhances channel-wise interaction through cross-attention dual state space modeling, enabling adaptive feature reweighting. In the spatial dimension, a spatial-exchange module employs cross-modal state space scanning to achieve comprehensive spatial fusion. By efficiently capturing global dependencies while maintaining linear computational complexity, DIFF-MF effectively integrates complementary multi-modal features. Experimental results on the driving scenarios and low-altitude UAV datasets demonstrate that our method outperforms existing approaches in both visual quality and quantitative evaluation.


[40] MoGen: A Unified Collaborative Framework for Controllable Multi-Object Image Generation cs.CVPDF

Yanfeng Li, Yue Sun, Keren Fu, Sio-Kei Im, Xiaoming Liu

TL;DR: MoGen提出了一种统一的协作框架,用于可控的多对象图像生成,通过区域语义锚定模块实现语言描述与图像区域的精确对齐,并利用自适应多模态引导模块整合多源控制信号,实现动态细粒度控制。

Details

Motivation: 现有多对象图像生成方法难以基于语言描述实现局部图像生成区域与对应语义的精确对齐,常导致对象数量不一致和属性混淆,且主流方法依赖外部控制信号,输入格式僵化,无法适应用户异构资源条件和多样化约束需求。

Result: 实验结果表明,MoGen在生成质量、数量一致性和细粒度控制方面显著优于现有方法,同时展现出更优的可访问性和控制灵活性。

Insight: 创新点包括区域语义锚定模块实现文本到图像生成中对象数量的精确遵循,以及自适应多模态引导模块动态解析和整合多源控制信号以实现结构化意图引导的选择性约束,提升了可控性和灵活性。

Abstract: Existing multi-object image generation methods face difficulties in achieving precise alignment between localized image generation regions and their corresponding semantics based on language descriptions, frequently resulting in inconsistent object quantities and attribute aliasing. To mitigate this limitation, mainstream approaches typically rely on external control signals to explicitly constrain the spatial layout, local semantic and visual attributes of images. However, this strong dependency makes the input format rigid, rendering it incompatible with the heterogeneous resource conditions of users and diverse constraint requirements. To address these challenges, we propose MoGen, a user-friendly multi-object image generation method. First, we design a Regional Semantic Anchor (RSA) module that precisely anchors phrase units in language descriptions to their corresponding image regions during the generation process, enabling text-to-image generation that follows quantity specifications for multiple objects. Building upon this foundation, we further introduce an Adaptive Multi-modal Guidance (AMG) module, which adaptively parses and integrates various combinations of multi-source control signals to formulate corresponding structured intent. This intent subsequently guides selective constraints on scene layouts and object attributes, achieving dynamic fine-grained control. Experimental results demonstrate that MoGen significantly outperforms existing methods in generation quality, quantity consistency, and fine-grained control, while exhibiting superior accessibility and control flexibility. Code is available at: https://github.com/Tear-kitty/MoGen/tree/master.


[41] VIB-Probe: Detecting and Mitigating Hallucinations in Vision-Language Models via Variational Information Bottleneck cs.CV | cs.AIPDF

Feiran Zhang, Yixin Wu, Zhenghua Wang, Xiaohua Wang, Changze Lv

TL;DR: 本文提出了一种名为VIB-Probe的新框架,用于检测和缓解视觉语言模型中的幻觉问题。该方法基于变分信息瓶颈理论,通过分析模型内部注意力头的输出来提取判别性模式,并过滤语义噪声,从而识别与幻觉相关的注意力头,并引入推理时干预策略来缓解幻觉。

Details

Motivation: 视觉语言模型在多模态任务中表现出色,但容易产生幻觉,即生成的文本偏离视觉内容。现有方法主要依赖输出logits或外部验证工具,忽视了模型内部机制。本文旨在通过探究内部注意力头来检测和缓解幻觉,解决高维状态中视觉-语言语法与噪声纠缠的挑战。

Result: 在多个基准测试上的广泛实验表明,VIB-Probe在幻觉检测和缓解两方面均显著优于现有基线方法。

Insight: 创新点在于利用变分信息瓶颈理论从内部注意力头中提取关键信号并过滤噪声,以及通过梯度分析识别对幻觉有因果影响的注意力头,并实施推理时干预。这为理解模型内部机制和提升生成真实性提供了新思路。

Abstract: Vision-Language Models (VLMs) have demonstrated remarkable progress in multimodal tasks, but remain susceptible to hallucinations, where generated text deviates from the underlying visual content. Existing hallucination detection methods primarily rely on output logits or external verification tools, often overlooking their internal mechanisms. In this work, we investigate the outputs of internal attention heads, postulating that specific heads carry the primary signals for truthful generation.However, directly probing these high-dimensional states is challenging due to the entanglement of visual-linguistic syntax and noise. To address this, we propose VIB-Probe, a novel hallucination detection and mitigation framework leveraging the Variational Information Bottleneck (VIB) theory. Our method extracts discriminative patterns across layers and heads while filtering out semantic nuisances through the information bottleneck principle. Furthermore, by leveraging the gradients of our VIB probe, we identify attention heads with strong causal influence on hallucinations and introduce an inference-time intervention strategy for hallucination mitigation. Extensive experiments across diverse benchmarks demonstrate that VIB-Probe significantly outperforms existing baselines in both settings. Our code will be made publicly available.


[42] One Language-Free Foundation Model Is Enough for Universal Vision Anomaly Detection cs.CVPDF

Bin-Bin Gao, Chengjie Wang

TL;DR: 本文提出了一种名为UniADet的通用视觉异常检测框架,该框架通过解耦分类与分割任务以及跨层级特征,仅学习少量解耦权重,实现了无需语言编码器的高效异常检测。

Details

Motivation: 现有基于视觉-语言基础模型的通用异常检测方法通常依赖复杂的提示工程、适配模块和训练策略,限制了其灵活性和通用性,本文旨在解决这些问题。

Result: 在涵盖工业和医疗领域的14个真实世界异常检测基准测试中,UniADet大幅超越了当前最先进的零样本/少样本方法,甚至首次超越了全样本方法,仅使用0.002M可学习参数。

Insight: 创新点在于发现语言编码器在通用异常检测中并非必要,并提出了一种简单有效的解耦方法,独立学习不同任务和层级特征的权重,从而提升了模型的参数效率和泛化能力。

Abstract: Universal visual anomaly detection (AD) aims to identify anomaly images and segment anomaly regions towards open and dynamic scenarios, following zero- and few-shot paradigms without any dataset-specific fine-tuning. We have witnessed significant progress in widely use of visual-language foundational models in recent approaches. However, current methods often struggle with complex prompt engineering, elaborate adaptation modules, and challenging training strategies, ultimately limiting their flexibility and generality. To address these issues, this paper rethinks the fundamental mechanism behind visual-language models for AD and presents an embarrassingly simple, general, and effective framework for Universal vision Anomaly Detection (UniADet). Specifically, we first find language encoder is used to derive decision weights for anomaly classification and segmentation, and then demonstrate that it is unnecessary for universal AD. Second, we propose an embarrassingly simple method to completely decouple classification and segmentation, and decouple cross-level features, i.e., learning independent weights for different tasks and hierarchical features. UniADet is highly simple (learning only decoupled weights), parameter-efficient (only 0.002M learnable parameters), general (adapting a variety of foundation models), and effective (surpassing state-of-the-art zero-/few-shot by a large margin and even full-shot AD methods for the first time) on 14 real-world AD benchmarks covering both industrial and medical domains. We will make the code and model of UniADet available at https://github.com/gaobb/UniADet.


[43] What’s Left Unsaid? Detecting and Correcting Misleading Omissions in Multimodal News Previews cs.CV | cs.SIPDF

Fanxiao Li, Jiaying Wu, Tingchao Fu, Dayang Li, Herun Wan

TL;DR: 本文针对社交媒体新闻预览(图像-标题对)中因选择性省略关键上下文而导致的隐性误导问题,提出了一种多阶段检测与纠正方法。研究构建了MM-Misleading基准数据集,评估了开源大型视觉语言模型在检测此类误导方面的盲点,并提出了OMGuard框架,该框架通过解释感知微调和基于推理的纠正机制,显著提升了检测准确性并实现了端到端的误导内容修正。

Details

Motivation: 社交媒体新闻预览即使事实正确,也可能因选择性省略关键上下文而导致读者理解与原文产生偏差,这种隐性误导比显性错误信息更难检测且研究不足。

Result: 在构建的MM-Misleading基准上,OMGuard将8B参数模型的检测准确率提升至与235B参数的大型视觉语言模型相当的水平,并在端到端纠正任务上表现出显著更强的性能。

Insight: 论文的创新点在于构建了首个针对多模态新闻预览中基于省略的误导性基准(MM-Misleading),并提出了结合解释感知微调和推理引导纠正的OMGuard框架。客观分析认为,其核心洞察在于揭示了此类误导通常源于局部叙事偏移(如缺失背景)而非全局框架改变,并明确了在图像驱动场景下纯文本纠正的局限性,强调了视觉干预的必要性。

Abstract: Even when factually correct, social-media news previews (image-headline pairs) can induce interpretation drift: by selectively omitting crucial context, they lead readers to form judgments that diverge from what the full article conveys. This covert harm is harder to detect than explicit misinformation yet remains underexplored. To address this gap, we develop a multi-stage pipeline that disentangles and simulates preview-based versus context-based understanding, enabling construction of the MM-Misleading benchmark. Using this benchmark, we systematically evaluate open-source LVLMs and uncover pronounced blind spots to omission-based misleadingness detection. We further propose OMGuard, which integrates (1) Interpretation-Aware Fine-Tuning, which used to improve multimodal misleadingness detection and (2) Rationale-Guided Misleading Content Correction, which uses explicit rationales to guide headline rewriting and reduce misleading impressions. Experiments show that OMGuard lifts an 8B model’s detection accuracy to match a 235B LVLM and delivers markedly stronger end-to-end correction. Further analysis reveals that misleadingness typically stems from local narrative shifts (e.g., missing background) rather than global frame changes, and identifies image-driven scenarios where text-only correction fails, highlighting the necessity of visual interventions.


[44] Towards Generalized Multi-Image Editing for Unified Multimodal Models cs.CVPDF

Pengcheng Xu, Peng Tang, Donghao Luo, Xiaobin Hu, Weichu Cui

TL;DR: 本文提出了一种可扩展的多图像编辑框架,用于统一多模态模型(UMMs),旨在解决UMMs在处理多张输入图像时难以保持视觉一致性和消除视觉线索歧义的问题。该框架通过引入可学习的潜在分离器和正弦索引编码,显式区分图像身份并泛化到可变数量的输入。

Details

Motivation: 统一多模态模型在跨多张输入图像引用细节时,存在视觉一致性维护和视觉线索消歧的局限性,因此需要一种能显式区分图像身份并支持可变输入数量的多图像编辑方法。

Result: 实验表明,在多样化的多图像编辑任务上,该方法在语义一致性、视觉保真度和跨图像整合方面均优于先前基线,验证了其在一致性和泛化能力上的优势。

Insight: 创新点包括:1)可学习的潜在分离器在潜在空间中显式区分每个参考图像,实现准确和解耦的条件控制;2)正弦索引编码为来自同一图像的视觉标记分配连续的正弦索引嵌入,提供显式图像身份,同时允许对可变数量输入进行泛化和外推。此外,通过逆向数据集构建方法建立高保真基准,确保无伪影、可实现的输出。

Abstract: Unified Multimodal Models (UMMs) integrate multimodal understanding and generation, yet they are limited to maintaining visual consistency and disambiguating visual cues when referencing details across multiple input images. In this work, we propose a scalable multi-image editing framework for UMMs that explicitly distinguishes image identities and generalizes to variable input counts. Algorithmically, we introduce two innovations: 1) The learnable latent separators explicitly differentiate each reference image in the latent space, enabling accurate and disentangled conditioning. 2) The sinusoidal index encoding assigns visual tokens from the same image a continuous sinusoidal index embedding, which provides explicit image identity while allowing generalization and extrapolation on a variable number of inputs. To facilitate training and evaluation, we establish a high-fidelity benchmark using an inverse dataset construction methodology to guarantee artifact-free, achievable outputs. Experiments show clear improvements in semantic consistency, visual fidelity, and cross-image integration over prior baselines on diverse multi-image editing tasks, validating our advantages on consistency and generalization ability.


[45] SceneAlign: Aligning Multimodal Reasoning to Scene Graphs in Complex Visual Scenes cs.CV | cs.CL | cs.LGPDF

Chuhan Wang, Xintong Li, Jennifer Yuntong Zhang, Junda Wu, Chengkai Huang

TL;DR: 本文提出SceneAlign框架,通过利用场景图作为结构化视觉信息进行可控的结构干预,以解决多模态大语言模型在复杂视觉场景中推理不忠实的问题。该方法通过识别关键节点并模拟典型视觉定位失败场景构造对比样本,结合直接偏好优化提升模型在视觉推理任务中的准确性和忠实性。

Details

Motivation: 多模态大语言模型在复杂视觉场景中常出现推理不忠实问题,如幻觉实体、错误定位关系、跳过步骤和过度具体化推理,现有基于偏好的方法因依赖文本扰动或答案条件化解释而无法有效解决视觉定位被语言先验绕过的问题。

Result: 在七个视觉推理基准测试中,SceneAlign持续提升了答案准确性和推理忠实性,证明了基于视觉定位感知的对齐方法在多模态推理中的有效性。

Insight: 创新点在于利用场景图进行结构化视觉干预,通过模拟视觉定位失败构造语言合理但视觉事实错误的硬负样本,结合直接偏好优化实现细粒度、结构忠实的推理对齐;客观分析认为该方法将结构化视觉表示与对比学习结合,为多模态推理的忠实性对齐提供了新思路。

Abstract: Multimodal large language models often struggle with faithful reasoning in complex visual scenes, where intricate entities and relations require precise visual grounding at each step. This reasoning unfaithfulness frequently manifests as hallucinated entities, mis-grounded relations, skipped steps, and over-specified reasoning. Existing preference-based approaches, typically relying on textual perturbations or answer-conditioned rationales, fail to address this challenge as they allow models to exploit language priors to bypass visual grounding. To address this, we propose SceneAlign, a framework that leverages scene graphs as structured visual information to perform controllable structural interventions. By identifying reasoning-critical nodes and perturbing them through four targeted strategies that mimic typical grounding failures, SceneAlign constructs hard negative rationales that remain linguistically plausible but are grounded in inaccurate visual facts. These contrastive pairs are used in Direct Preference Optimization to steer models toward fine-grained, structure-faithful reasoning. Across seven visual reasoning benchmarks, SceneAlign consistently improves answer accuracy and reasoning faithfulness, highlighting the effectiveness of grounding-aware alignment for multimodal reasoning.


[46] LatentVLA: Efficient Vision-Language Models for Autonomous Driving via Latent Action Prediction cs.CVPDF

Chengen Xie, Bin Sun, Tianyu Li, Junjie Wu, Zhihui Hao

TL;DR: 本文提出LatentVLA,一种用于自动驾驶的新型视觉-语言-动作(VLA)框架。它通过自监督的潜在动作预测来训练模型,无需语言标注,从而避免了语言偏差,并利用知识蒸馏将VLA模型的泛化能力迁移到高效的视觉网络中,实现了鲁棒性能和实时效率。

Details

Motivation: 解决现有端到端自动驾驶模型在长尾场景中表现不佳,以及当前VLA模型存在的数值预测不精确、对语言标注依赖性强(引入语言偏差和标注负担)和链式思维推理导致计算效率低、难以实时部署的问题。

Result: 在NAVSIM基准测试中取得了92.4的PDMS分数,达到了新的最先进水平(SOTA),并在nuScenes基准测试上展示了强大的零样本泛化能力。

Insight: 核心创新在于采用自监督的潜在动作预测来训练VLA模型,完全摆脱了对语言标注的依赖,消除了语言偏差,并利用知识蒸馏实现了从强大但低效的VLA模型到高效视觉网络的有效知识迁移,兼顾了性能与效率。

Abstract: End-to-end autonomous driving models trained on largescale datasets perform well in common scenarios but struggle with rare, long-tail situations due to limited scenario diversity. Recent Vision-Language-Action (VLA) models leverage broad knowledge from pre-trained visionlanguage models to address this limitation, yet face critical challenges: (1) numerical imprecision in trajectory prediction due to discrete tokenization, (2) heavy reliance on language annotations that introduce linguistic bias and annotation burden, and (3) computational inefficiency from multi-step chain-of-thought reasoning hinders real-time deployment. We propose LatentVLA, a novel framework that employs self-supervised latent action prediction to train VLA models without language annotations, eliminating linguistic bias while learning rich driving representations from unlabeled trajectory data. Through knowledge distillation, LatentVLA transfers the generalization capabilities of VLA models to efficient vision-based networks, achieving both robust performance and real-time efficiency. LatentVLA establishes a new state-of-the-art on the NAVSIM benchmark with a PDMS score of 92.4 and demonstrates strong zeroshot generalization on the nuScenes benchmark.


[47] SGDrive: Scene-to-Goal Hierarchical World Cognition for Autonomous Driving cs.CVPDF

Jingyu Li, Junjie Wu, Dongnan Hu, Xiangkai Huang, Bin Sun

TL;DR: SGDrive提出了一种用于自动驾驶的新型框架,通过构建场景-智能体-目标的分层认知结构,将通用视觉语言模型(VLM)的表示学习围绕驾驶专业知识进行结构化,以解决VLM在自动驾驶中缺乏对3D时空结构化理解的问题。

Details

Motivation: 通用视觉语言模型缺乏对驾驶场景中3D时空关系、几何结构和运动模式的专门理解,导致其在自动驾驶轨迹规划中难以建立有效的结构化时空表示。

Result: 在NAVSIM基准测试中,SGDrive在仅使用摄像头的方法中,于PDMS和EPDMS指标上均达到了最先进的性能水平。

Insight: 创新点在于将驾驶认知显式分解为模仿人类驾驶思维的分层结构(场景-智能体-目标),为通用VLM提供了其原本缺乏的结构化时空表示,从而能更有效地整合多级信息进行轨迹规划。

Abstract: Recent end-to-end autonomous driving approaches have leveraged Vision-Language Models (VLMs) to enhance planning capabilities in complex driving scenarios. However, VLMs are inherently trained as generalist models, lacking specialized understanding of driving-specific reasoning in 3D space and time. When applied to autonomous driving, these models struggle to establish structured spatial-temporal representations that capture geometric relationships, scene context, and motion patterns critical for safe trajectory planning. To address these limitations, we propose SGDrive, a novel framework that explicitly structures the VLM’s representation learning around driving-specific knowledge hierarchies. Built upon a pre-trained VLM backbone, SGDrive decomposes driving understanding into a scene-agent-goal hierarchy that mirrors human driving cognition: drivers first perceive the overall environment (scene context), then attend to safety-critical agents and their behaviors, and finally formulate short-term goals before executing actions. This hierarchical decomposition provides the structured spatial-temporal representation that generalist VLMs lack, integrating multi-level information into a compact yet comprehensive format for trajectory planning. Extensive experiments on the NAVSIM benchmark demonstrate that SGDrive achieves state-of-the-art performance among camera-only methods on both PDMS and EPDMS, validating the effectiveness of hierarchical knowledge structuring for adapting generalist VLMs to autonomous driving.


[48] SketchVL: Policy Optimization via Fine-Grained Credit Assignment for Chart Understanding and More cs.CVPDF

Muye Huang, Lingling Zhang, Yifei Li, Yaqiang Wu, Jun Liu

TL;DR: 本文提出SketchVL,一种通过细粒度信用分配进行策略优化的新型多模态大语言模型,用于图表理解等任务。其核心是FinePO强化学习算法,利用FinePRM对轨迹中的每个绘图动作进行评分,实现步骤级信用分配,从而提升模型推理能力。

Details

Motivation: 现有基于强化学习的多模态大语言模型在图表理解等需要精确复杂视觉推理的任务中面临信用分配挑战,其轨迹级优势估计无法区分单个生成响应中正确与错误的推理步骤。

Result: 实验表明,SketchVL在图表数据集、自然图像数据集和数学任务上,相比其基础模型平均性能提升7.23%。

Insight: 创新点在于引入了细粒度过程奖励模型和绘图式中间推理步骤标注,实现了步骤级行为对齐与信用分配,为训练强大推理模型提供了新方向。

Abstract: Charts are high-density visual carriers of complex data and medium for information extraction and analysis. Due to the need for precise and complex visual reasoning, automated chart understanding poses a significant challenge to existing Multimodal Large Language Models (MLLMs). Many MLLMs trained with reinforcement learning (RL) face the challenge of credit assignment. Their advantage estimation, typically performed at the trajectory level, cannot distinguish between correct and incorrect reasoning steps within a single generated response. To address this limitation, we introduce SketchVL, a novel MLLM that optimized with FinePO, a new RL algorithm designed for fine-grained credit assignment within each trajectory. SketchVL’s methodology involves drawing its intermediate reasoning steps as markers on the image and feeding the annotated image back to itself, creating a robust, multi-step reasoning process. During training, the FinePO algorithm leverages a Fine-grained Process Reward Model (FinePRM) to score each drawing action within a trajectory, thereby precisely assigning credit for each step. This mechanism allows FinePO to more strongly reward correct tokens when a trajectory is globally successful, and more heavily penalize incorrect tokens when the trajectory is globally suboptimal, thus achieving fine-grained reinforcement signals. Experiments show that SketchVL learns to align its step-level behavior with the FinePRM, achieving an average performance gain of 7.23% over its base model across chart datasets, natural image datasets, and mathematics, providing a promising new direction for training powerful reasoning models.


[49] Rotate Your Character: Revisiting Video Diffusion Models for High-Quality 3D Character Generation cs.CVPDF

Jin Wang, Jianxiang Lu, Comi Chen, Guangzheng Xu, Haoyu Yang

TL;DR: 本文提出了RCM(Rotate your Character Model),一个为高质量新视角合成(NVS)和3D角色生成量身定制的先进图像到视频扩散框架。该方法能够将具有任意复杂姿态的角色转换到规范姿态,实现整个观察轨道上一致的新视角合成,并支持高分辨率(1024x1024)轨道视频生成、可控的观察位置以及多视图(最多4张)输入条件。

Details

Motivation: 解决从单张图像生成高质量3D角色这一数字内容创作中的重大挑战,特别是应对复杂身体姿态和自遮挡问题。

Result: 大量实验表明,RCM在新视角合成和3D生成质量方面均优于最先进的方法(SOTA)。

Insight: 核心创新在于通过将任意复杂姿态的角色转换到规范姿态,实现了整个观察轨道上一致且高质量的新视角合成,并整合了高分辨率视频生成、相机姿态控制和多视图条件支持,为单图到3D角色生成提供了一个更强大、更灵活的框架。

Abstract: Generating high-quality 3D characters from single images remains a significant challenge in digital content creation, particularly due to complex body poses and self-occlusion. In this paper, we present RCM (Rotate your Character Model), an advanced image-to-video diffusion framework tailored for high-quality novel view synthesis (NVS) and 3D character generation. Compared to existing diffusion-based approaches, RCM offers several key advantages: (1) transferring characters with any complex poses into a canonical pose, enabling consistent novel view synthesis across the entire viewing orbit, (2) high-resolution orbital video generation at 1024x1024 resolution, (3) controllable observation positions given different initial camera poses, and (4) multi-view conditioning supporting up to 4 input images, accommodating diverse user scenarios. Extensive experiments demonstrate that RCM outperforms state-of-the-art methods in both novel view synthesis and 3D generation quality.


[50] TAGRPO: Boosting GRPO on Image-to-Video Generation with Direct Trajectory Alignment cs.CVPDF

Jin Wang, Jianxiang Lu, Guangzheng Xu, Comi Chen, Haoyu Yang

TL;DR: 本文提出了一种名为TAGRPO的鲁棒后训练框架,旨在提升图像到视频(I2V)生成模型在应用组相对策略优化(GRPO)时的性能。该方法受对比学习启发,通过利用相同初始噪声生成的视频轨迹作为优化指导,在中间潜在空间引入新的GRPO损失,以直接对齐高奖励轨迹并远离低奖励轨迹,同时引入记忆库以增强多样性和降低计算开销。

Details

Motivation: 现有研究将GRPO集成到流匹配模型中,在文本到图像和文本到视频生成中效果显著,但直接应用于I2V模型时往往无法获得一致的奖励提升,因此需要一种更有效的优化方法来解决此局限性。

Result: 在I2V生成任务上,TAGRPO相比DanceGRPO取得了显著改进,表明其有效性。

Insight: 创新点在于观察到从相同初始噪声生成的视频轨迹能为优化提供更优指导,并据此设计了应用于中间潜在空间的GRPO损失,结合记忆库机制,在简单实现下实现了性能提升,为基于轨迹对齐的强化学习优化提供了新思路。

Abstract: Recent studies have demonstrated the efficacy of integrating Group Relative Policy Optimization (GRPO) into flow matching models, particularly for text-to-image and text-to-video generation. However, we find that directly applying these techniques to image-to-video (I2V) models often fails to yield consistent reward improvements. To address this limitation, we present TAGRPO, a robust post-training framework for I2V models inspired by contrastive learning. Our approach is grounded in the observation that rollout videos generated from identical initial noise provide superior guidance for optimization. Leveraging this insight, we propose a novel GRPO loss applied to intermediate latents, encouraging direct alignment with high-reward trajectories while maximizing distance from low-reward counterparts. Furthermore, we introduce a memory bank for rollout videos to enhance diversity and reduce computational overhead. Despite its simplicity, TAGRPO achieves significant improvements over DanceGRPO in I2V generation.


[51] ViTNT-FIQA: Training-Free Face Image Quality Assessment with Vision Transformers cs.CV | cs.LGPDF

Guray Ozgur, Eduarda Caldeira, Tahar Chettaoui, Jan Niklas Kolf, Marco Huber

TL;DR: 本文提出了一种无需训练的人脸图像质量评估方法ViTNT-FIQA,该方法利用视觉Transformer(ViT)中间块的补丁嵌入演化稳定性来评估图像质量。高质量人脸图像在ViT块间的特征演化轨迹稳定,而质量差的图像则变化无常。该方法仅需单次前向传播,无需反向传播或模型修改,在多个基准测试中取得了与最先进方法相当的性能。

Details

Motivation: 当前人脸图像质量评估方法主要利用最终层表示,而无需训练的方法通常需要多次前向传播或反向传播,计算成本高。本文旨在开发一种高效、无需训练且仅需单次前向传播的FIQA方法。

Result: 在八个基准测试(LFW、AgeDB-30、CFP-FP、CALFW、Adience、CPLFW、XQLFW、IJB-C)上的广泛评估表明,ViTNT-FIQA在保持计算效率的同时,取得了与最先进方法(SOTA)相当的性能。

Insight: 创新点在于利用ViT中间块间补丁嵌入的演化稳定性(通过计算连续块间L2归一化嵌入的欧氏距离并聚合)作为质量指标,这为无需训练的FIQA提供了一种新颖且高效的视角,可直接应用于任何预训练的ViT人脸识别模型。

Abstract: Face Image Quality Assessment (FIQA) is essential for reliable face recognition systems. Current approaches primarily exploit only final-layer representations, while training-free methods require multiple forward passes or backpropagation. We propose ViTNT-FIQA, a training-free approach that measures the stability of patch embedding evolution across intermediate Vision Transformer (ViT) blocks. We demonstrate that high-quality face images exhibit stable feature refinement trajectories across blocks, while degraded images show erratic transformations. Our method computes Euclidean distances between L2-normalized patch embeddings from consecutive transformer blocks and aggregates them into image-level quality scores. We empirically validate this correlation on a quality-labeled synthetic dataset with controlled degradation levels. Unlike existing training-free approaches, ViTNT-FIQA requires only a single forward pass without backpropagation or architectural modifications. Through extensive evaluation on eight benchmarks (LFW, AgeDB-30, CFP-FP, CALFW, Adience, CPLFW, XQLFW, IJB-C), we show that ViTNT-FIQA achieves competitive performance with state-of-the-art methods while maintaining computational efficiency and immediate applicability to any pre-trained ViT-based face recognition model.


[52] Adaptive Disentangled Representation Learning for Incomplete Multi-View Multi-Label Classification cs.CV | cs.AIPDF

Quanjiang Li, Zhiming Liu, Tianxiang Xu, Tingjin Luo, Chenping Hou

TL;DR: 本文提出了一种自适应解耦表示学习方法(ADRL),用于解决多视图多标签分类中同时存在的特征缺失和标注不完整问题。该方法通过跨模态传播特征级亲和力实现鲁棒的视图补全,利用随机掩码策略增强重建效果,并通过标签分布间的类别级关联优化分布参数以捕获相互依赖的标签原型。此外,ADRL设计了基于互信息的目标函数以促进共享表示的一致性,并抑制视图特定表示与其他模态的信息重叠,同时通过伪标签生成和特征选择指导视图融合的判别性权衡。

Details

Motivation: 解决多视图多标签学习中因数据采集困难和标注成本高而同时出现的特征缺失和标注不完整问题,克服现有方法在特征恢复、表示解耦和标签语义建模方面的局限性。

Result: 在公共数据集和实际应用上的大量实验表明,ADRL取得了优越的性能。

Insight: 创新点包括:通过邻域感知的跨模态特征传播实现鲁棒视图补全;利用随机掩码增强重建;通过标签分布关联优化标签原型;基于互信息的目标函数促进表示解耦;以及利用伪标签空间结构指导视图融合的判别性权衡。从客观角度看,该方法在特征恢复和标签语义建模的集成方面具有创新性,特别是在处理不完整多视图多标签数据时提供了系统的解耦表示学习框架。

Abstract: Multi-view multi-label learning frequently suffers from simultaneous feature absence and incomplete annotations, due to challenges in data acquisition and cost-intensive supervision. To tackle the complex yet highly practical problem while overcoming the existing limitations of feature recovery, representation disentanglement, and label semantics modeling, we propose an Adaptive Disentangled Representation Learning method (ADRL). ADRL achieves robust view completion by propagating feature-level affinity across modalities with neighborhood awareness, and reinforces reconstruction effectiveness by leveraging a stochastic masking strategy. Through disseminating category-level association across label distributions, ADRL refines distribution parameters for capturing interdependent label prototypes. Besides, we formulate a mutual-information-based objective to promote consistency among shared representations and suppress information overlap between view-specific representation and other modalities. Theoretically, we derive the tractable bounds to train the dual-channel network. Moreover, ADRL performs prototype-specific feature selection by enabling independent interactions between label embeddings and view representations, accompanied by the generation of pseudo-labels for each category. The structural characteristics of the pseudo-label space are then exploited to guide a discriminative trade-off during view fusion. Finally, extensive experiments on public datasets and real-world applications demonstrate the superior performance of ADRL.


[53] Boosting Latent Diffusion Models via Disentangled Representation Alignment cs.CVPDF

John Page, Xuesong Niu, Kai Wu, Kun Gai

TL;DR: 本文提出了一种名为Send-VAE的语义解耦变分自编码器,旨在优化潜在扩散模型(LDMs)中的图像标记器。通过将VAE的潜在空间与预训练视觉基础模型(VFMs)的语义层次对齐,Send-VAE实现了属性级的语义解耦表示,从而提升了生成模型的性能。实验表明,使用Send-VAE训练的基于流的Transformer模型(SiTs)在ImageNet 256x256上达到了SOTA的FID分数,并显著加速了训练。

Details

Motivation: 现有研究通常使用相同的对齐目标(如VFMs)来优化VAEs和LDMs,但这忽略了它们对表示需求的根本差异:LDMs需要保留高层语义概念,而VAEs应擅长语义解耦以结构化编码属性级信息。本文旨在解决这一不匹配问题。

Result: 在ImageNet 256x256基准上,使用Send-VAE训练的SiTs模型在有无分类器无关引导的情况下,分别达到了1.21和1.75的FID分数,实现了SOTA性能,并显著加快了训练速度。

Insight: 创新点在于明确区分了VAE和LDM的对齐目标,提出通过非线性的映射网络将VAE潜在空间与VFMs的语义层次对齐,以实现属性级的语义解耦表示学习,这为提升生成模型性能提供了新思路。

Abstract: Latent Diffusion Models (LDMs) generate high-quality images by operating in a compressed latent space, typically obtained through image tokenizers such as Variational Autoencoders (VAEs). In pursuit of a generation-friendly VAE, recent studies have explored leveraging Vision Foundation Models (VFMs) as representation alignment targets for VAEs, mirroring the approach commonly adopted for LDMs. Although this yields certain performance gains, using the same alignment target for both VAEs and LDMs overlooks their fundamentally different representational requirements. We advocate that while LDMs benefit from latents retaining high-level semantic concepts, VAEs should excel in semantic disentanglement, enabling encoding of attribute-level information in a structured way. To address this, we propose the Semantic disentangled VAE (Send-VAE), explicitly optimized for disentangled representation learning through aligning its latent space with the semantic hierarchy of pre-trained VFMs. Our approach employs a non-linear mapper network to transform VAE latents, aligning them with VFMs to bridge the gap between attribute-level disentanglement and high-level semantics, facilitating effective guidance for VAE learning. We evaluate semantic disentanglement via linear probing on attribute prediction tasks, showing strong correlation with improved generation performance. Finally, using Send-VAE, we train flow-based transformers SiTs; experiments show Send-VAE significantly speeds up training and achieves a state-of-the-art FID of 1.21 and 1.75 with and without classifier-free guidance on ImageNet 256x256.


[54] GeoSurDepth: Spatial Geometry-Consistent Self-Supervised Depth Estimation for Surround-View Cameras cs.CVPDF

Weimin Liu, Wenjun Wang, Joshua H. Meng

TL;DR: GeoSurDepth是一个用于环视摄像头的自监督深度估计框架,它通过利用空间几何一致性作为主要线索来提升深度估计的准确性。该方法利用基础模型作为伪几何先验来增强特征表示,并引入了一个新颖的视图合成流程,通过空间扭曲实现2D-3D提升,从而在时间、空间和时空上下文中提供额外的光度监督。

Details

Motivation: 现有环视深度估计方法主要关注光度级别的跨视图约束,未能充分利用单目和环视设置中固有的丰富几何结构。本文旨在通过几何一致性来解决这一问题,以实现更鲁棒的3D场景理解。

Result: 在DDAD和nuScenes数据集上的大量实验表明,GeoSurDepth达到了最先进的性能,验证了该方法的有效性。

Insight: 创新点在于将几何一致性作为核心监督信号,利用基础模型作为伪几何先验来增强特征表示,并提出了一个结合空间扭曲的视图合成流程以及自适应的联合运动学习策略,以强调信息丰富的空间几何线索并改进运动推理。

Abstract: Accurate surround-view depth estimation provides a competitive alternative to laser-based sensors and is essential for 3D scene understanding in autonomous driving. While prior studies have proposed various approaches that primarily focus on enforcing cross-view constraints at the photometric level, few explicitly exploit the rich geometric structure inherent in both monocular and surround-view setting. In this work, we propose GeoSurDepth, a framework that leverages geometry consistency as the primary cue for surround-view depth estimation. Concretely, we utilize foundation models as a pseudo geometry prior and feature representation enhancement tool to guide the network to maintain surface normal consistency in spatial 3D space and regularize object- and texture-consistent depth estimation in 2D. In addition, we introduce a novel view synthesis pipeline where 2D-3D lifting is achieved with dense depth reconstructed via spatial warping, encouraging additional photometric supervision across temporal, spatial, and spatial-temporal contexts, and compensating for the limitations of single-view image reconstruction. Finally, a newly-proposed adaptive joint motion learning strategy enables the network to adaptively emphasize informative spatial geometry cues for improved motion reasoning. Extensive experiments on DDAD and nuScenes demonstrate that GeoSurDepth achieves state-of-the-art performance, validating the effectiveness of our approach. Our framework highlights the importance of exploiting geometry coherence and consistency for robust self-supervised multi-view depth estimation.


[55] Goal Force: Teaching Video Models To Accomplish Physics-Conditioned Goals cs.CV | cs.AI | cs.ROPDF

Nate Gillman, Yinghua Zhou, Zitian Tang, Evan Luo, Arjan Chakravarthy

TL;DR: 本文提出Goal Force框架,通过力向量和中间动力学定义目标,训练视频生成模型在合成因果原语数据集上学习力在时空中的传播,使模型能够作为隐式神经物理模拟器,实现精确的物理感知规划。

Details

Motivation: 解决视频生成世界模型中目标指定难题:文本指令过于抽象难以捕捉物理细节,而目标图像对于动态任务往往不可行。

Result: 模型在简单物理数据上训练后,在工具操纵和多物体因果链等复杂真实场景中表现出显著的零样本泛化能力,无需依赖外部引擎。

Insight: 创新点在于通过力向量和中间动力学进行目标定义,使视频生成基于基本物理交互,从而涌现出隐式神经物理模拟能力,实现精确的物理感知规划。

Abstract: Recent advancements in video generation have enabled the development of ``world models’’ capable of simulating potential futures for robotics and planning. However, specifying precise goals for these models remains a challenge; text instructions are often too abstract to capture physical nuances, while target images are frequently infeasible to specify for dynamic tasks. To address this, we introduce Goal Force, a novel framework that allows users to define goals via explicit force vectors and intermediate dynamics, mirroring how humans conceptualize physical tasks. We train a video generation model on a curated dataset of synthetic causal primitives-such as elastic collisions and falling dominos-teaching it to propagate forces through time and space. Despite being trained on simple physics data, our model exhibits remarkable zero-shot generalization to complex, real-world scenarios, including tool manipulation and multi-object causal chains. Our results suggest that by grounding video generation in fundamental physical interactions, models can emerge as implicit neural physics simulators, enabling precise, physics-aware planning without reliance on external engines. We release all datasets, code, model weights, and interactive video demos at our project page.


[56] Kidney Cancer Detection Using 3D-Based Latent Diffusion Models cs.CVPDF

Jen Dusseljee, Sarah de Boer, Alessa Hering

TL;DR: 本文提出了一种基于3D潜在扩散模型的新方法,用于在对比增强腹部CT图像中进行肾脏异常检测。该方法结合了去噪扩散概率模型(DDPMs)、去噪扩散隐式模型(DDIMs)和矢量量化生成对抗网络(VQ-GANs),直接处理三维图像体积,并利用仅有病例级伪标签的弱监督进行训练。

Details

Motivation: 解决传统基于切片的方法在三维医学图像异常检测中的局限性,探索在弱监督(仅病例级标签)下利用生成模型进行三维异常检测的可行性,以减少对精细标注的依赖。

Result: 该方法在基准测试中与最先进的监督分割和检测模型进行了比较。当前结果尚未达到监督基线的水平,但证明了3D潜在扩散模型在弱监督异常检测中的潜力和可行性。

Insight: 创新点在于将潜在扩散模型直接应用于三维医学图像体积进行弱监督异常检测,而非传统的切片方式。这为复杂腹部解剖结构的标注高效生成建模提供了重要方向,关键在于提高重建保真度和病变定位能力。

Abstract: In this work, we present a novel latent diffusion-based pipeline for 3D kidney anomaly detection on contrast-enhanced abdominal CT. The method combines Denoising Diffusion Probabilistic Models (DDPMs), Denoising Diffusion Implicit Models (DDIMs), and Vector-Quantized Generative Adversarial Networks (VQ-GANs). Unlike prior slice-wise approaches, our method operates directly on an image volume and leverages weak supervision with only case-level pseudo-labels. We benchmark our approach against state-of-the-art supervised segmentation and detection models. This study demonstrates the feasibility and promise of 3D latent diffusion for weakly supervised anomaly detection. While the current results do not yet match supervised baselines, they reveal key directions for improving reconstruction fidelity and lesion localization. Our findings provide an important step toward annotation-efficient, generative modeling of complex abdominal anatomy.


[57] Adapting Vision Transformers to Ultra-High Resolution Semantic Segmentation with Relay Tokens cs.CVPDF

Yohann Perron, Vladyslav Sydorov, Christophe Pottier, Loic Landrieu

TL;DR: 本文提出了一种名为Relay Tokens的方法,用于适应视觉Transformer进行超高分辨率语义分割。该方法通过并行处理局部高分辨率图像块和全局低分辨率图像块,并利用少量可学习的Relay Tokens在分支间聚合和传播特征,从而在保留局部细节的同时维持全局上下文。

Details

Motivation: 解决现有超高分辨率图像分割方法中,滑动窗口丢弃全局上下文或下采样丢失精细细节的问题,旨在同时保留局部细节和全局感知。

Result: 在Archaeoscape、URUR、Gleason三个超高分辨率分割基准和传统Cityscapes数据集上进行了广泛实验,显示出一致的性能提升,相对mIoU最高提升15%。

Insight: 创新点在于引入显式的多尺度推理机制,通过Relay Tokens在局部和全局分支间进行特征聚合与传播,可直接集成到标准Transformer骨干网络(如ViT和Swin)中,仅增加少于2%的参数。

Abstract: Current approaches for segmenting ultra high resolution images either slide a window, thereby discarding global context, or downsample and lose fine detail. We propose a simple yet effective method that brings explicit multi scale reasoning to vision transformers, simultaneously preserving local details and global awareness. Concretely, we process each image in parallel at a local scale (high resolution, small crops) and a global scale (low resolution, large crops), and aggregate and propagate features between the two branches with a small set of learnable relay tokens. The design plugs directly into standard transformer backbones (eg ViT and Swin) and adds fewer than 2 % parameters. Extensive experiments on three ultra high resolution segmentation benchmarks, Archaeoscape, URUR, and Gleason, and on the conventional Cityscapes dataset show consistent gains, with up to 15 % relative mIoU improvement. Code and pretrained models are available at https://archaeoscape.ai/work/relay-tokens/ .


[58] Performance of a Deep Learning-Based Segmentation Model for Pancreatic Tumors on Public Endoscopic Ultrasound Datasets cs.CV | cs.AI | cs.LGPDF

Pankaj Gupta, Priya Mudgil, Niharika Dutta, Kartik Bose, Nitish Kumar

TL;DR: 本研究评估了一种基于Vision Transformer的深度学习分割模型在公共内镜超声数据集上对胰腺肿瘤的分割性能。模型在交叉验证和外部独立测试集上均表现出色,但存在少量错误预测,表明需要进一步优化和标准化。

Details

Motivation: 胰腺癌生存率低,内镜超声是关键的诊断手段,但其效果受操作者主观性限制。本研究旨在通过深度学习模型自动化分割胰腺肿瘤,减少主观性影响。

Result: 在五折交叉验证中,模型平均Dice相似系数为0.651,IoU为0.579,灵敏度69.8%,特异性98.8%,准确率97.5%。在外部验证集上,DSC为0.657,IoU为0.614,灵敏度71.8%,特异性97.7%,性能稳定但存在9.7%的错误多重预测。

Insight: 创新点在于将Vision Transformer架构应用于内镜超声图像中的胰腺肿瘤分割任务,并在公共数据集上验证了其有效性。客观分析表明,模型在分割精度和泛化性方面具有潜力,但数据集异质性和有限的外部验证提示需要进一步改进和前瞻性研究。

Abstract: Background: Pancreatic cancer is one of the most aggressive cancers, with poor survival rates. Endoscopic ultrasound (EUS) is a key diagnostic modality, but its effectiveness is constrained by operator subjectivity. This study evaluates a Vision Transformer-based deep learning segmentation model for pancreatic tumors. Methods: A segmentation model using the USFM framework with a Vision Transformer backbone was trained and validated with 17,367 EUS images (from two public datasets) in 5-fold cross-validation. The model was tested on an independent dataset of 350 EUS images from another public dataset, manually segmented by radiologists. Preprocessing included grayscale conversion, cropping, and resizing to 512x512 pixels. Metrics included Dice similarity coefficient (DSC), intersection over union (IoU), sensitivity, specificity, and accuracy. Results: In 5-fold cross-validation, the model achieved a mean DSC of 0.651 +/- 0.738, IoU of 0.579 +/- 0.658, sensitivity of 69.8%, specificity of 98.8%, and accuracy of 97.5%. For the external validation set, the model achieved a DSC of 0.657 (95% CI: 0.634-0.769), IoU of 0.614 (95% CI: 0.590-0.689), sensitivity of 71.8%, and specificity of 97.7%. Results were consistent, but 9.7% of cases exhibited erroneous multiple predictions. Conclusions: The Vision Transformer-based model demonstrated strong performance for pancreatic tumor segmentation in EUS images. However, dataset heterogeneity and limited external validation highlight the need for further refinement, standardization, and prospective studies.


[59] Context-Aware Decoding for Faithful Vision-Language Generation cs.CVPDF

Mehrdad Fazli, Bowen Wei, Ziwei Zhu

TL;DR: 本文研究了大型视觉语言模型(LVLMs)在开放式任务(如图像描述和视觉推理)中产生幻觉(即生成与视觉输入不一致的响应)的问题,并提出了一种无需训练的缓解策略。通过分析解码器层级的生成动态,作者发现真实标记比幻觉标记更早累积概率质量,并基于此提出了上下文嵌入注入(CEI)方法,利用最后一个输入令牌的隐藏状态作为接地信号来抑制幻觉。在CHAIR、AMBER和MMHal-Bench基准测试中,CEI在多个LVLMs上优于现有方法,其动态变体实现了最低的整体幻觉率。

Details

Motivation: 解决大型视觉语言模型在开放式任务中产生幻觉(即响应与视觉输入不一致)的局限性,以提升生成内容的忠实度。

Result: 在CHAIR、AMBER和MMHal-Bench基准测试(最大令牌长度为512)上,CEI方法在三个大型视觉语言模型中均优于最先进的基线方法,其动态变体实现了最低的整体幻觉率,达到了SOTA水平。

Insight: 创新点包括:通过Logit Lens揭示了层间生成动态中的承诺深度差距(真实标记比幻觉标记更早累积概率质量),并据此提出了轻量级的上下文嵌入注入(CEI)方法,利用最后一个输入令牌的隐藏状态作为接地信号来维持视觉保真度,这是一种无需训练的可扩展干预策略,结合了机制性洞察与实用干预。

Abstract: Hallucinations, generating responses inconsistent with the visual input, remain a critical limitation of large vision-language models (LVLMs), especially in open-ended tasks such as image captioning and visual reasoning. In this work, we probe the layer-wise generation dynamics that drive hallucinations and propose a training-free mitigation strategy. Employing the Logit Lens, we examine how LVLMs construct next-token distributions across decoder layers, uncovering a pronounced commitment-depth gap: truthful tokens accumulate probability mass on their final candidates earlier than hallucinatory ones. Drawing on this discovery, we introduce Context Embedding Injection (CEI), a lightweight method that harnesses the hidden state of the last input token-the context embedding-as a grounding signal to maintain visual fidelity throughout decoding and curb hallucinations. Evaluated on the CHAIR, AMBER, and MMHal-Bench benchmarks (with a maximum token length of 512), CEI outperforms state-of-the-art baselines across three LVLMs, with its dynamic variant yielding the lowest overall hallucination rates. By integrating novel mechanistic insights with a scalable intervention, this work advances the mitigation of hallucinations in LVLMs.


[60] VideoAR: Autoregressive Video Generation via Next-Frame & Scale Prediction cs.CV | cs.AIPDF

Longbin Ji, Xiaoxiong Liu, Junyuan Shang, Shuohuan Wang, Yu Sun

TL;DR: 本文提出VideoAR,首个大规模视觉自回归视频生成框架,通过结合多尺度下一帧预测与自回归建模,有效解耦时空依赖关系。该方法采用3D多尺度分词器编码时空动态,并引入多尺度时序RoPE、跨帧纠错和随机帧掩码等技术提升长期一致性。实验表明,VideoAR在自回归模型中达到SOTA,在UCF-101上FVD从99.5降至88.6,推理步骤减少10倍以上,VBench得分81.74,与更大规模的扩散模型相当。

Details

Motivation: 当前视频生成领域由扩散和流匹配模型主导,这些模型虽能生成高质量结果,但计算密集且难以扩展。本文旨在通过自回归方法提供一种可扩展、高效且具有时间一致性的视频生成框架。

Result: 在UCF-101基准上,VideoAR将FVD从99.5提升至88.6,推理步骤减少超过10倍;在VBench上达到81.74分,与规模大一个数量级的扩散模型性能相当,在自回归模型中实现新的SOTA。

Insight: 创新点包括:结合多尺度下一帧预测与自回归建模的VAR框架、3D多尺度分词器、多尺度时序RoPE、跨帧纠错和随机帧掩码以增强长期一致性,以及多阶段预训练流程逐步对齐时空学习。从客观角度看,该工作通过自回归方法显著缩小了与扩散模型的性能差距,为视频生成提供了更高效的替代方案。

Abstract: Recent advances in video generation have been dominated by diffusion and flow-matching models, which produce high-quality results but remain computationally intensive and difficult to scale. In this work, we introduce VideoAR, the first large-scale Visual Autoregressive (VAR) framework for video generation that combines multi-scale next-frame prediction with autoregressive modeling. VideoAR disentangles spatial and temporal dependencies by integrating intra-frame VAR modeling with causal next-frame prediction, supported by a 3D multi-scale tokenizer that efficiently encodes spatio-temporal dynamics. To improve long-term consistency, we propose Multi-scale Temporal RoPE, Cross-Frame Error Correction, and Random Frame Mask, which collectively mitigate error propagation and stabilize temporal coherence. Our multi-stage pretraining pipeline progressively aligns spatial and temporal learning across increasing resolutions and durations. Empirically, VideoAR achieves new state-of-the-art results among autoregressive models, improving FVD on UCF-101 from 99.5 to 88.6 while reducing inference steps by over 10x, and reaching a VBench score of 81.74-competitive with diffusion-based models an order of magnitude larger. These results demonstrate that VideoAR narrows the performance gap between autoregressive and diffusion paradigms, offering a scalable, efficient, and temporally consistent foundation for future video generation research.


[61] Deepfake detectors are DUMB: A benchmark to assess adversarial training robustness under transferability constraints cs.CV | cs.CRPDF

Adrian Serrano, Erwan Umlil, Ronan Thomas

TL;DR: 本文提出将DUMB和DUMBer方法扩展到深度伪造检测领域,评估了在迁移性约束和跨数据集配置下,对抗训练对深度伪造检测器鲁棒性的影响。研究涵盖了五种SOTA检测器、三种攻击方法和两个数据集,分析了攻击者和防御者的视角,并映射到数据不匹配场景。实验表明,对抗训练在分布内情况下能增强鲁棒性,但在跨数据集配置下可能因策略不同而降低鲁棒性,强调了现实应用中需要针对具体场景的防御策略。

Details

Motivation: 解决现实环境中深度伪造检测系统面临对抗性攻击时,对抗训练在攻击者知识有限和数据分布不匹配条件下的有效性未充分探索的问题。

Result: 在FaceForensics++和Celeb-DF-V2数据集上,对RECCE、SRM、XCeption、UCF、SPSL五种检测器进行PGD、FGSM、FPBA三种攻击的评估。结果显示,对抗训练在分布内情况下能增强鲁棒性,但在跨数据集配置下可能因策略不同而降低鲁棒性。

Insight: 创新点在于将DUMB/DUMBer框架应用于深度伪造检测,系统评估对抗训练在迁移性约束下的鲁棒性;客观分析表明,对抗训练的有效性高度依赖于数据分布匹配程度,揭示了现实应用中需要自适应防御策略的重要性。

Abstract: Deepfake detection systems deployed in real-world environments are subject to adversaries capable of crafting imperceptible perturbations that degrade model performance. While adversarial training is a widely adopted defense, its effectiveness under realistic conditions – where attackers operate with limited knowledge and mismatched data distributions - remains underexplored. In this work, we extend the DUMB – Dataset soUrces, Model architecture and Balance - and DUMBer methodology to deepfake detection. We evaluate detectors robustness against adversarial attacks under transferability constraints and cross-dataset configuration to extract real-world insights. Our study spans five state-of-the-art detectors (RECCE, SRM, XCeption, UCF, SPSL), three attacks (PGD, FGSM, FPBA), and two datasets (FaceForensics++ and Celeb-DF-V2). We analyze both attacker and defender perspectives mapping results to mismatch scenarios. Experiments show that adversarial training strategies reinforce robustness in the in-distribution cases but can also degrade it under cross-dataset configuration depending on the strategy adopted. These findings highlight the need for case-aware defense strategies in real-world applications exposed to adversarial attacks.


cs.IR [Back]

[62] TagRAG: Tag-guided Hierarchical Knowledge Graph Retrieval-Augmented Generation cs.IR | cs.CLPDF

Wenbiao Tao, Yunshi Lan, Weining Qian

TL;DR: 本文提出了TagRAG,一种基于标签引导的分层知识图谱检索增强生成框架,旨在解决传统RAG方法在全局知识推理和可扩展性方面的不足。该方法通过构建标签知识图谱和进行标签引导的检索,实现了高效的知识表示与检索,并在多个领域数据集上验证了其优越的性能和效率。

Details

Motivation: 传统RAG方法依赖片段级检索,难以处理聚焦查询的摘要任务;而GraphRAG虽引入基于图的全局推理,但存在信息提取效率低、资源消耗大、增量更新适应性差等问题。TagRAG旨在克服这些限制,实现高效的全局推理和可扩展的图谱维护。

Result: 在涵盖农业、计算机科学、法律及跨领域设置的UltraDomain数据集上的大量实验表明,TagRAG相对于基线模型取得了平均95.41%的胜率,同时与GraphRAG相比,保持了约14.6倍的构建效率和1.9倍的检索效率。

Insight: 创新点在于引入了标签引导的分层知识图谱构建与检索机制:1)通过提取对象标签及其关系构建分层领域标签链,实现结构化知识表示;2)检索以领域为中心的标签链来定位和合成相关知识,从而提升检索粒度、适应更小语言模型并支持高效知识增量。这为改进RAG系统的可扩展性和推理能力提供了新思路。

Abstract: Retrieval-Augmented Generation enhances language models by retrieving external knowledge to support informed and grounded responses. However, traditional RAG methods rely on fragment-level retrieval, limiting their ability to address query-focused summarization queries. GraphRAG introduces a graph-based paradigm for global knowledge reasoning, yet suffers from inefficiencies in information extraction, costly resource consumption, and poor adaptability to incremental updates. To overcome these limitations, we propose TagRAG, a tag-guided hierarchical knowledge graph RAG framework designed for efficient global reasoning and scalable graph maintenance. TagRAG introduces two key components: (1) Tag Knowledge Graph Construction, which extracts object tags and their relationships from documents and organizes them into hierarchical domain tag chains for structured knowledge representation, and (2) Tag-Guided Retrieval-Augmented Generation, which retrieves domain-centric tag chains to localize and synthesize relevant knowledge during inference. This design significantly adapts to smaller language models, improves retrieval granularity, and supports efficient knowledge increment. Extensive experiments on UltraDomain datasets spanning Agriculture, Computer Science, Law, and cross-domain settings demonstrate that TagRAG achieves an average win rate of 95.41% against baselines while maintaining about 14.6x construction and 1.9x retrieval efficiency compared with GraphRAG.


[63] Cross-Document Topic-Aligned Chunking for Retrieval-Augmented Generation cs.IR | cs.AI | cs.CLPDF

Mile Stankovic

TL;DR: 本文提出了一种跨文档主题对齐(CDTA)的分块方法,用于解决检索增强生成(RAG)系统中的知识碎片化问题。该方法在语料库层面重构知识,通过识别跨文档的主题、将文本片段映射到主题并合成为统一的知识块,以提升复杂查询(如多跳推理)的检索效果。

Details

Motivation: 当前RAG系统的分块方法通常独立处理单个文档,导致处理需要跨多个来源信息的复杂查询时出现知识碎片化问题,限制了系统性能。本文旨在通过跨文档的知识重组来解决此问题。

Result: 在HotpotQA多跳推理基准测试中,该方法在忠实度(faithfulness)上达到0.93,优于上下文检索(0.83)和语义分块(0.78),比当前行业最佳实践提升12%(p < 0.05)。在UAE Legal文本上,忠实度为0.94,引用准确率为0.93。在k=3时,其忠实度保持0.91,而语义方法降至0.68。

Insight: 核心创新点在于将分块过程从文档级别提升到语料库级别,通过主题对齐实现跨文档的知识合成,生成信息密集的块,从而减少查询时的检索需求。这对于知识分散、查询量大的应用场景具有显著优势,尽管索引成本较高。

Abstract: Chunking quality determines RAG system performance. Current methods partition documents individually, but complex queries need information scattered across multiple sources: the knowledge fragmentation problem. We introduce Cross-Document Topic-Aligned (CDTA) chunking, which reconstructs knowledge at the corpus level. It first identifies topics across documents, maps segments to each topic, and synthesizes them into unified chunks. On HotpotQA multi-hop reasoning, our method reached 0.93 faithfulness versus 0.83 for contextual retrieval and 0.78 for semantic chunking, a 12% improvement over current industry best practice (p < 0.05). On UAE Legal texts, it reached 0.94 faithfulness with 0.93 citation accuracy. At k = 3, it maintains 0.91 faithfulness while semantic methods drop to 0.68, with a single CDTA chunk containing information requiring multiple traditional fragments. Indexing costs are higher, but synthesis produces information-dense chunks that reduce query-time retrieval needs. For high-query-volume applications with distributed knowledge, cross-document synthesis improves measurably over within-document optimization.


[64] Studying Illustrations in Manuscripts: An Efficient Deep-Learning Approach cs.IR | cs.CV | cs.LGPDF

Yoav Evron, Michal Bar-Asher Siegal, Michael Fire

TL;DR: 本文提出了一种高效、可扩展的深度学习流水线,用于从数字化手稿(如梵蒂冈图书馆馆藏)中大规模检测、提取和描述插图。该方法包含三个步骤:使用微调图像分类模型过滤纯文本页,使用高效目标检测模型识别并裁剪插图,以及使用多模态图像描述模型生成简洁、可读的描述。这些描述被存储在可搜索数据库中,使学者能够通过关键词查询检索相关视觉材料。

Details

Motivation: 数字档案虽然提供了前所未有的访问途径,但大规模、系统地研究手稿中的插图仍然具有挑战性。本文旨在利用人工智能的最新进展,为历史研究、艺术史和文化遗产领域提供一种之前不切实际的大规模视觉研究方法。

Result: 将该流水线应用于超过300万页数字化手稿,自动识别并提取了超过20万张独特的插图,处理速度达到每页低于0.06秒,在效率和可访问性上远超传统的分割技术。

Insight: 创新点在于将图像分类、目标检测和多模态图像描述三个成熟的AI模型串联成一个高效的端到端流水线,专门针对手稿插图分析任务进行优化,实现了前所未有的处理规模和速度,展示了前沿AI工具如何重塑学术工作流程并开辟跨学科研究新途径。

Abstract: The recent Artificial Intelligence (AI) revolution has opened transformative possibilities for the humanities, particularly in unlocking the visual content embedded in historical manuscripts. While digital archives now offer unprecedented access to these materials, the ability to systematically study illustrations at a large scale remains challenging. Our study presents a fast and scalable AI approach for detecting, extracting, and describing illustrations in digitized manuscripts. Focusing on collections like the Vatican Library, our system enables efficient visual analysis across millions of pages. Our pipeline consists of three stages: (1) a fine-tuned image classification model filters out text-only pages; (2) an efficient object detection model identifies and crops illustrations; and (3) a multimodal image captioning model generates concise, human-readable descriptions. These are stored in a searchable database, allowing scholars to retrieve relevant visual materials through keyword queries. By harnessing the power of recent AI advancements, we enable large-scale visual research that was previously impractical, empowering scholars in historical studies, art history, and cultural heritage to explore visual motifs, artistic styles, and cross-cultural influences with new precision and speed. Applying our pipeline to over three million digitized manuscript pages, we automatically identified and extracted more than 200,000 unique illustrations. This scale of processing in under 0.06 seconds per page, dramatically outperforms traditional segmentation techniques in both efficiency and accessibility for visual scholarship. Our work demonstrates how cutting-edge AI tools can profoundly reshape scholarly workflows and open new avenues for multidisciplinary research in the age of digital manuscripts.


cs.AI [Back]

[65] Naiad: Novel Agentic Intelligent Autonomous System for Inland Water Monitoring cs.AI | cs.CL | cs.CV | cs.IRPDF

Eirini Baltzi, Tilemachos Moumouris, Athena Psalta, Vasileios Tsironis, Konstantinos Karantzalos

TL;DR: NAIAD是一个基于大语言模型(LLM)和外部分析工具的智能体系统,旨在为内陆水体监测提供一个全面的解决方案。它通过单一提示界面,将自然语言查询转化为可操作的洞察,整合了天气数据、哨兵2号影像、遥感指数计算等多种工具,并利用检索增强生成(RAG)、LLM推理和智能体反思等技术生成定制化报告。

Details

Motivation: 现有内陆水体监测方法通常孤立地处理蓝藻、叶绿素等子问题,缺乏整体性解决方案。NAIAD旨在通过一个智能体AI助手,利用地球观测(EO)数据,为专家和非专家用户提供一个统一、易用的综合性监测工具。

Result: 在一个涵盖多用户专业水平的专用基准测试上,系统在正确性和相关性指标上分别达到了超过77%和85%的分数,表现出强大的适应性和鲁棒性。消融研究表明,Gemma 3 (27B) 和 Qwen 2.5 (14B) 在计算效率和推理性能之间取得了最佳平衡。

Insight: 论文的创新点在于提出了一个集成了LLM、RAG、外部工具编排和智能体反思的综合性智能体框架,将复杂的遥感数据处理和多源信息整合过程封装在一个自然语言接口之后,降低了内陆水体监测的技术门槛。其系统架构设计,特别是将多种专业工具(如CyFi平台)与LLM智能体协同工作的模式,为其他领域构建类似的专家级AI助手提供了参考。

Abstract: Inland water monitoring is vital for safeguarding public health and ecosystems, enabling timely interventions to mitigate risks. Existing methods often address isolated sub-problems such as cyanobacteria, chlorophyll, or other quality indicators separately. NAIAD introduces an agentic AI assistant that leverages Large Language Models (LLMs) and external analytical tools to deliver a holistic solution for inland water monitoring using Earth Observation (EO) data. Designed for both experts and non-experts, NAIAD provides a single-prompt interface that translates natural-language queries into actionable insights. Through Retrieval-Augmented Generation (RAG), LLM reasoning, external tool orchestration, computational graph execution, and agentic reflection, it retrieves and synthesizes knowledge from curated sources to produce tailored reports. The system integrates diverse tools for weather data, Sentinel-2 imagery, remote-sensing index computation (e.g., NDCI), chlorophyll-a estimation, and established platforms such as CyFi. Performance is evaluated using correctness and relevancy metrics, achieving over 77% and 85% respectively on a dedicated benchmark covering multiple user-expertise levels. Preliminary results show strong adaptability and robustness across query types. An ablation study on LLM backbones further highlights Gemma 3 (27B) and Qwen 2.5 (14B) as offering the best balance between computational efficiency and reasoning performance.


[66] The Persona Paradox: Medical Personas as Behavioral Priors in Clinical Language Models cs.AI | cs.CLPDF

Tassallah Abdullahi, Shrestha Ghosh, Hamish S Fraser, Daniel León Tramontini, Adeel Abbasi

TL;DR: 该论文系统评估了医疗角色(如急诊科医生、护士)和交互风格(大胆 vs. 谨慎)对临床大语言模型在临床决策中的影响,发现角色扮演作为行为先验会产生情境依赖且非单调的效果:在重症护理任务中提升性能,但在初级护理环境中降低性能,且交互风格的影响高度模型依赖。

Details

Motivation: 角色调节常被视为大语言模型的行为先验,被认为能单调提升专业性和安全性,但其对高风险临床决策的影响尚不明确,论文旨在系统评估角色控制对临床大语言模型行为的影响。

Result: 在临床分诊和患者安全任务的多维评估中,医疗角色在重症护理任务中使准确性和校准度提升高达约20%,但在初级护理环境中性能下降类似幅度;交互风格调节风险倾向和敏感性,但高度模型依赖;人类临床医生在安全合规性上表现出中等一致性(平均Cohen’s κ=0.43),但对95.9%的推理质量回答信心较低。

Insight: 创新点在于揭示医疗角色作为行为先验会引入情境依赖的权衡,而非安全或专业性的保证,挑战了角色调节单调提升性能的假设;客观分析表明,该研究为临床大语言模型的角色设计提供了实证依据,强调了任务和模型依赖性的重要性。

Abstract: Persona conditioning can be viewed as a behavioral prior for large language models (LLMs) and is often assumed to confer expertise and improve safety in a monotonic manner. However, its effects on high-stakes clinical decision-making remain poorly characterized. We systematically evaluate persona-based control in clinical LLMs, examining how professional roles (e.g., Emergency Department physician, nurse) and interaction styles (bold vs.\ cautious) influence behavior across models and medical tasks. We assess performance on clinical triage and patient-safety tasks using multidimensional evaluations that capture task accuracy, calibration, and safety-relevant risk behavior. We find systematic, context-dependent, and non-monotonic effects: Medical personas improve performance in critical care tasks, yielding gains of up to $\sim+20%$ in accuracy and calibration, but degrade performance in primary-care settings by comparable margins. Interaction style modulates risk propensity and sensitivity, but it’s highly model-dependent. While aggregated LLM-judge rankings favor medical over non-medical personas in safety-critical cases, we found that human clinicians show moderate agreement on safety compliance (average Cohen’s $κ= 0.43$) but indicate a low confidence in 95.9% of their responses on reasoning quality. Our work shows that personas function as behavioral priors that introduce context-dependent trade-offs rather than guarantees of safety or expertise. The code is available at https://github.com/rsinghlab/Persona\_Paradox.


[67] PII-VisBench: Evaluating Personally Identifiable Information Safety in Vision Language Models Along a Continuum of Visibility cs.AI | cs.CL | cs.CR | cs.CVPDF

G M Shahariar, Zabir Al Nazi, Md Olid Hasan Bhuiyan, Zhouxing Shi

TL;DR: 本文提出了PII-VisBench基准,用于评估视觉语言模型在个人可识别信息安全性方面的表现,该基准根据主体在线数据可见度(高、中、低、零)对200个主体进行分层,并包含4000个独特探针。研究发现,随着主体可见度降低,模型拒绝率上升,PII泄露率下降,且高可见度主体更易泄露PII,同时存在模型家族异质性和PII类型差异。

Details

Motivation: 现有VLM的PII泄露评估将隐私视为静态提取任务,忽略了主体在线存在(即其在线数据量)对隐私对齐的影响,因此需要一种能沿在线存在连续体评估VLM安全性的新基准。

Result: 在18个开源VLM(0.3B-32B)上评估,使用拒绝率和条件PII泄露率两个指标,结果显示PII泄露率从高可见度主体的9.10%降至低可见度主体的5.34%,且重述和越狱式提示暴露了攻击和模型依赖的失败。

Insight: 创新点在于引入了基于在线存在连续体的PII安全评估框架,揭示了可见度对VLM隐私泄露的显著影响;客观来看,该研究强调了在VLM安全评估和训练中考虑主体可见度的重要性,并为针对不同可见度级别的隐私保护提供了数据支持。

Abstract: Vision Language Models (VLMs) are increasingly integrated into privacy-critical domains, yet existing evaluations of personally identifiable information (PII) leakage largely treat privacy as a static extraction task and ignore how a subject’s online presence–the volume of their data available online–influences privacy alignment. We introduce PII-VisBench, a novel benchmark containing 4000 unique probes designed to evaluate VLM safety through the continuum of online presence. The benchmark stratifies 200 subjects into four visibility categories: high, medium, low, and zero–based on the extent and nature of their information available online. We evaluate 18 open-source VLMs (0.3B-32B) based on two key metrics: percentage of PII probing queries refused (Refusal Rate) and the fraction of non-refusal responses flagged for containing PII (Conditional PII Disclosure Rate). Across models, we observe a consistent pattern: refusals increase and PII disclosures decrease (9.10% high to 5.34% low) as subject visibility drops. We identify that models are more likely to disclose PII for high-visibility subjects, alongside substantial model-family heterogeneity and PII-type disparities. Finally, paraphrasing and jailbreak-style prompts expose attack and model-dependent failures, motivating visibility-aware safety evaluation and training interventions.


[68] Conformity and Social Impact on AI Agents cs.AI | cs.CL | cs.CYPDF

Alessandro Bellina, Giordano De Marzo, David Garcia

TL;DR: 本研究探讨了作为AI代理的大型多模态语言模型在群体压力下表现出的从众行为。通过改编社会心理学中的经典视觉实验,研究发现AI代理存在系统的从众偏差,其行为符合社会影响理论,并受到群体规模、一致性、任务难度和来源特征的影响。即使单独表现近乎完美的AI代理也容易受到社会影响的操纵,这种脆弱性在不同规模的模型中持续存在,揭示了多智能体系统中可能被恶意利用的安全漏洞。

Details

Motivation: 随着AI代理在多智能体环境中日益普及,理解其集体行为对于预测人工社会动态至关重要。本研究旨在探究AI代理在群体压力下的从众倾向,以揭示其决策过程中的安全脆弱性。

Result: 实验表明,AI代理表现出系统的从众偏差,符合社会影响理论。在简单任务上,更大模型由于能力提升而减少从众,但在其能力边界上仍保持脆弱。研究未提及具体基准测试或SOTA比较,而是通过心理学实验范式进行定性分析。

Insight: 创新点在于将社会心理学经典实验范式应用于评估AI代理的从众行为,揭示了AI决策中未被充分认识的社会影响脆弱性。客观来看,该研究为多智能体系统的安全部署提供了重要的行为学洞察,强调了在集体AI部署中实施保障措施的紧迫性。

Abstract: As AI agents increasingly operate in multi-agent environments, understanding their collective behavior becomes critical for predicting the dynamics of artificial societies. This study examines conformity, the tendency to align with group opinions under social pressure, in large multimodal language models functioning as AI agents. By adapting classic visual experiments from social psychology, we investigate how AI agents respond to group influence as social actors. Our experiments reveal that AI agents exhibit a systematic conformity bias, aligned with Social Impact Theory, showing sensitivity to group size, unanimity, task difficulty, and source characteristics. Critically, AI agents achieving near-perfect performance in isolation become highly susceptible to manipulation through social influence. This vulnerability persists across model scales: while larger models show reduced conformity on simple tasks due to improved capabilities, they remain vulnerable when operating at their competence boundary. These findings reveal fundamental security vulnerabilities in AI agent decision-making that could enable malicious manipulation, misinformation campaigns, and bias propagation in multi-agent systems, highlighting the urgent need for safeguards in collective AI deployments.


[69] WildSci: Advancing Scientific Reasoning from In-the-Wild Literature cs.AI | cs.CLPDF

Tengxiao Liu, Deepak Nathani, Zekun Li, Kevin Yang, William Yang Wang

TL;DR: 该论文提出了WildSci数据集,这是一个从同行评议文献中自动合成的领域特定科学问题数据集,涵盖9个科学学科和26个子领域,旨在通过多选格式促进大规模语言模型在科学推理任务上的训练。

Details

Motivation: 针对当前大语言模型在数学和编程等领域推理进展迅速,但在医学、材料科学等科学领域因数据集覆盖有限和开放性问题复杂性而进展缓慢的问题,提出构建高质量科学推理数据集。

Result: 在多个科学基准测试上的实验证明了该数据集和方法的有效性,具体定量结果未在摘要中明确提及,但表明能提升模型在科学推理任务上的性能。

Insight: 创新点在于从真实文献中自动合成大规模、多领域的科学问题数据集,并采用多选格式与强化学习微调相结合的方法,为科学推理提供了可扩展的训练框架和明确的奖励信号。

Abstract: Recent progress in large language model (LLM) reasoning has focused on domains like mathematics and coding, where abundant high-quality data and objective evaluation metrics are readily available. In contrast, progress in LLM reasoning models remains limited in scientific domains such as medicine and materials science due to limited dataset coverage and the inherent complexity of open-ended scientific questions. To address these challenges, we introduce WildSci, a new dataset of domain-specific science questions automatically synthesized from peer-reviewed literature, covering 9 scientific disciplines and 26 subdomains. By framing complex scientific reasoning tasks in a multiple-choice format, we enable scalable training with well-defined reward signals. We further apply reinforcement learning to finetune models on these data and analyze the resulting training dynamics, including domain-specific performance changes, response behaviors, and generalization trends. Experiments on a suite of scientific benchmarks demonstrate the effectiveness of our dataset and approach. We release WildSci to enable scalable and sustainable research in scientific reasoning, available at https://huggingface.co/datasets/JustinTX/WildSci.


[70] Logic-Parametric Neuro-Symbolic NLI: Controlling Logical Formalisms for Verifiable LLM Reasoning cs.AI | cs.CL | cs.LOPDF

Ali Farjami, Luca Redondi, Marco Valentino

TL;DR: 本文提出了一种逻辑参数化的神经符号自然语言推理框架,将底层逻辑视为可控制组件而非固定形式系统。通过LogiKEy方法将经典与非经典逻辑嵌入高阶逻辑,系统比较了不同逻辑在推理质量、解释细化和证明行为上的差异,重点关注规范推理领域。实验表明逻辑内部策略能提升性能并生成更高效的混合证明,且逻辑有效性具有领域依赖性。

Details

Motivation: 现有结合大语言模型与定理证明器的可验证自然语言推理方法依赖固定逻辑形式系统,限制了鲁棒性和适应性。

Result: 在规范推理任务中,逻辑内部策略相比逻辑外部策略能持续提升性能并产生更高效的混合证明;一阶逻辑在常识推理中表现更优,道义逻辑和模态逻辑在伦理领域表现更佳。

Insight: 将逻辑作为神经符号架构中的一等参数化组件,通过系统比较不同逻辑形式主义实现更鲁棒、模块化和自适应的推理;提出逻辑内部与逻辑外部的规范编码方法论对比。

Abstract: Large language models (LLMs) and theorem provers (TPs) can be effectively combined for verifiable natural language inference (NLI). However, existing approaches rely on a fixed logical formalism, a feature that limits robustness and adaptability. We propose a logic-parametric framework for neuro-symbolic NLI that treats the underlying logic not as a static background, but as a controllable component. Using the LogiKEy methodology, we embed a range of classical and non-classical formalisms into higher-order logic (HOL), enabling a systematic comparison of inference quality, explanation refinement, and proof behavior. We focus on normative reasoning, where the choice of logic has significant implications. In particular, we compare logic-external approaches, where normative requirements are encoded via axioms, with logic-internal approaches, where normative patterns emerge from the logic’s built-in structure. Extensive experiments demonstrate that logic-internal strategies can consistently improve performance and produce more efficient hybrid proofs for NLI. In addition, we show that the effectiveness of a logic is domain-dependent, with first-order logic favouring commonsense reasoning, while deontic and modal logics excel in ethical domains. Our results highlight the value of making logic a first-class, parametric element in neuro-symbolic architectures for more robust, modular, and adaptable reasoning.


cs.LG [Back]

[71] TIME: Temporally Intelligent Meta-reasoning Engine for Context Triggered Explicit Reasoning cs.LG | cs.CLPDF

Susmit Das

TL;DR: 本文提出了TIME(Temporally Intelligent Meta-reasoning Engine),一个行为对齐框架,旨在解决现有大语言模型在推理和时序理解上的不足。它通过引入时间标签、静默间隙标记和可内嵌的简短推理块,训练模型根据对话上下文和时间线索进行高效、上下文敏感且可审计的显式推理。

Details

Motivation: 解决现有推理导向大语言模型的两个主要问题:一是其显式‘思维’过程通常冗长、全局且不可中断,导致计算成本高、可审计性差;二是对话模型普遍缺乏对时间结构的感知能力,无法有效处理对话中的时间间隔和时序逻辑。

Result: 在TIMEBench(一个基于时序的对话基准测试)上评估,TIME框架在4B到32B参数规模的Qwen3模型上,无论是否开启显式推理模式,其得分均优于基础Qwen3模型,同时将推理所需的token数量减少了约一个数量级。

Insight: 主要创新点在于将显式推理视为一种由话语和时序线索驱动的上下文敏感资源,并通过引入时间标签(ISO 8601格式)、静默间隙标记和可内嵌的简短推理块()来实现高效、灵活的推理。这为构建更具时间感知能力和可审计性的对话模型提供了新的行为对齐范式。

Abstract: Reasoning oriented large language models often expose explicit “thinking” as long, turn-global traces at the start of every response, either always on or toggled externally at inference time. While useful for arithmetic, programming, and problem solving, this design is costly, blurs claim level auditability, and cannot re-trigger explicit reasoning once the model begins presenting. Dialogue models are also largely blind to temporal structure, treating replies after seconds and replies after weeks as equivalent unless time is stated in text. We introduce TIME, the Temporally Intelligent Meta-reasoning Engine, a behavioral alignment framework that treats explicit reasoning as a context sensitive resource driven by discourse and temporal cues. TIME augments dialogue with optional ISO 8601


[72] RingSQL: Generating Synthetic Data with Schema-Independent Templates for Text-to-SQL Reasoning Models cs.LG | cs.CLPDF

Marko Sterbentz, Kevin Cushing, Cameron Barrie, Kristian J. Hammond

TL;DR: 本文提出了RingSQL,一个用于文本到SQL推理模型的混合数据生成框架,它结合了模式无关的查询模板和基于大语言模型的自然语言问题复述,以生成高质量、语法正确的合成训练数据。

Details

Motivation: 解决文本到SQL系统因高质量训练数据稀缺而受限的问题,现有合成方法在可靠性(模板方法需要模式特定模板)和可扩展性(LLM生成方法缺乏质量保证)之间存在权衡。

Result: 在六个文本到SQL基准测试中,使用RingSQL生成的数据训练的模型相比使用其他合成数据训练的模型,平均准确率提升了+2.3%。

Insight: 核心创新点是提出了模式无关的查询模板与LLM复述相结合的混合生成范式,在保证跨不同数据库模式的SQL语法正确性的同时,提供了丰富的语言多样性,有效平衡了可靠性与可扩展性。

Abstract: Recent advances in text-to-SQL systems have been driven by larger models and improved datasets, yet progress is still limited by the scarcity of high-quality training data. Manual data creation is expensive, and existing synthetic methods trade off reliability and scalability. Template-based approaches ensure correct SQL but require schema-specific templates, while LLM-based generation scales easily but lacks quality and correctness guarantees. We introduce RingSQL, a hybrid data generation framework that combines schema-independent query templates with LLM-based paraphrasing of natural language questions. This approach preserves SQL correctness across diverse schemas while providing broad linguistic variety. In our experiments, we find that models trained using data produced by RingSQL achieve an average gain in accuracy of +2.3% across six text-to-SQL benchmarks when compared to models trained on other synthetic data. We make our code available at https://github.com/nu-c3lab/RingSQL.


[73] MaxCode: A Max-Reward Reinforcement Learning Framework for Automated Code Optimization cs.LG | cs.CLPDF

Jiefu Ou, Sapana Chaudhary, Kaj Bostrom, Nathaniel Weir, Shuai Zhang

TL;DR: MaxCode是一个基于最大奖励强化学习的框架,用于自动化代码优化。它通过推理时搜索算法引导大型语言模型(LLM)基于执行反馈进行迭代优化,整合了自然语言批评模型以增强观察空间,并使用生成式奖励模型改进搜索探索,从而在CUDA和C++代码优化基准测试中提升了性能。

Details

Motivation: 大型语言模型在通用编码任务中表现出色,但在优化代码时面临两大挑战:一是编写优化代码(如高性能CUDA内核和竞赛级CPU代码)需要系统、算法和特定语言的专业知识;二是需要解释性能指标(如计时和设备利用率),而不仅仅是二进制正确性。

Result: 在KernelBench(CUDA)和PIE(C++)优化基准测试中,MaxCode相比基线方法提升了优化代码性能,在绝对加速值和相对加速排名上分别实现了20.3%和10.1%的相对改进。

Insight: 创新点包括:将现有搜索方法统一到最大奖励强化学习框架下,使观察和动作价值函数模块化;集成自然语言批评模型,将原始执行反馈转化为关于错误和性能瓶颈的诊断见解;训练生成式奖励模型以重新排名潜在解决方案,改进搜索探索。这些方法增强了LLM在代码优化中的反馈利用和探索能力。

Abstract: Large Language Models (LLMs) demonstrate strong capabilities in general coding tasks but encounter two key challenges when optimizing code: (i) the complexity of writing optimized code (such as performant CUDA kernels and competition-level CPU code) requires expertise in systems, algorithms and specific languages and (ii) requires interpretation of performance metrics like timing and device utilization beyond binary correctness. In this work, we explore inference-time search algorithms that guide the LLM to discover better solutions through iterative refinement based on execution feedback. Our approach, called MaxCode unifies existing search methods under a max-reward reinforcement learning framework, making the observation and action-value functions modular for modification. To enhance the observation space, we integrate a natural language critique model that converts raw execution feedback into diagnostic insights about errors and performance bottlenecks, and the best-discounted reward seen so far. Together, these provide richer input to the code proposal function. To improve exploration during search, we train a generative reward-to-go model using action values from rollouts to rerank potential solutions. Testing on the KernelBench (CUDA) and PIE (C++) optimization benchmarks shows that MaxCode improves optimized code performance compared to baselines, achieving 20.3% and 10.1% relative improvements in absolute speedup value and relative speedup ranking, respectively.


[74] Hi-ZFO: Hierarchical Zeroth- and First-Order LLM Fine-Tuning via Importance-Guided Tensor Selection cs.LG | cs.CLPDF

Feihu Jin, Ying Tan

TL;DR: 本文提出了一种名为Hi-ZFO的分层零阶和一阶混合优化框架,用于大语言模型(LLM)的微调。该框架通过重要性引导的张量选择,自适应地将模型划分为关键层和非关键层,对关键层使用精确的一阶梯度更新,对非敏感层则利用零阶优化,旨在结合一阶方法的精确性和零阶方法的探索能力,以解决标准一阶优化易陷入泛化性差的尖锐最小值,而零阶方法收敛慢且在生成任务中方差大的问题。

Details

Motivation: 标准一阶优化方法在微调LLM时容易收敛到尖锐、泛化性差的局部最小值,而零阶方法虽然探索性强、不依赖显式梯度,但收敛速度慢,且在生成任务中由于巨大的输出和搜索空间导致估计方差显著放大,变得噪声大且效率低下。

Result: 在多种生成、数学和代码推理任务上的验证表明,Hi-ZFO始终能实现更优的性能,同时显著减少了训练时间。

Insight: 创新点在于提出了一种分层混合优化策略,通过层重要性分析自适应地分配优化方法,并将零阶优化有意地引入作为“有益的随机性”来源,帮助模型逃离纯一阶优化容易停滞的局部极小值,而不仅仅是作为节省内存的替代方案。

Abstract: Fine-tuning large language models (LLMs) using standard first-order (FO) optimization often drives training toward sharp, poorly generalizing minima. Conversely, zeroth-order (ZO) methods offer stronger exploratory behavior without relying on explicit gradients, yet suffer from slow convergence. More critically, our analysis reveals that in generative tasks, the vast output and search space significantly amplify estimation variance, rendering ZO methods both noisy and inefficient. To address these challenges, we propose \textbf{Hi-ZFO} (\textbf{Hi}erarchical \textbf{Z}eroth- and \textbf{F}irst-\textbf{O}rder optimization), a hybrid framework designed to synergize the precision of FO gradients with the exploratory capability of ZO estimation. Hi-ZFO adaptively partitions the model through layer-wise importance profiling, applying precise FO updates to critical layers while leveraging ZO optimization for less sensitive ones. Notably, ZO in Hi-ZFO is not merely a memory-saving surrogate; it is intentionally introduced as a source of “beneficial stochasticity” to help the model escape the local minima where pure FO optimization tends to stagnate. Validated across diverse generative, mathematical, and code reasoning tasks, Hi-ZFO consistently achieves superior performance while significantly reducing the training time. These results demonstrate the effectiveness of hierarchical hybrid optimization for LLM fine-tuning.


q-bio.GN [Back]

[75] Open World Knowledge Aided Single-Cell Foundation Model with Robust Cross-Modal Cell-Language Pre-training q-bio.GN | cs.AI | cs.CL | cs.LGPDF

Haoran Wang, Xuanyi Zhang, Shuangsang Fang, Longke Ran, Ziqing Deng

TL;DR: 本文提出了一种名为OKR-CELL的开放世界知识辅助鲁棒单细胞基础模型,旨在解决现有单细胞预训练模型在深度整合个体特征和抵抗多模态数据噪声方面的不足。该模型基于跨模态细胞-语言预训练框架,通过利用LLM和检索增强生成技术丰富细胞文本描述,并设计了一种结合样本可靠性评估、课程学习和耦合动量对比学习的跨模态鲁棒对齐目标来增强模型对噪声的鲁棒性。

Details

Motivation: 现有基于预训练语言模型的单细胞基础模型存在两个主要局限:一是对深度个体特征整合不足,二是忽视了多模态数据中噪声的影响。本文旨在同时解决这两个问题。

Result: 在3200万个细胞-文本对上进行预训练后,OKR-CELL在6个评估任务上取得了领先的结果。这些任务包括细胞聚类、细胞类型注释、批次效应校正和少样本注释等标准基准测试,以及在更广泛的多模态应用(如零样本细胞类型注释和双向细胞-文本检索)中表现出优越性能。

Insight: 创新点主要包括:1)利用基于大语言模型的工作流和检索增强生成技术,引入开放世界知识来丰富细胞文本描述;2)设计了一种新颖的跨模态鲁棒对齐目标,整合了样本可靠性评估、课程学习和耦合动量对比学习,以增强模型对噪声数据的抵抗力。这为构建更鲁棒、知识更丰富的单细胞多模态基础模型提供了新思路。

Abstract: Recent advancements in single-cell multi-omics, particularly RNA-seq, have provided profound insights into cellular heterogeneity and gene regulation. While pre-trained language model (PLM) paradigm based single-cell foundation models have shown promise, they remain constrained by insufficient integration of in-depth individual profiles and neglecting the influence of noise within multi-modal data. To address both issues, we propose an Open-world Language Knowledge-Aided Robust Single-Cell Foundation Model (OKR-CELL). It is built based on a cross-modal Cell-Language pre-training framework, which comprises two key innovations: (1) leveraging Large Language Models (LLMs) based workflow with retrieval-augmented generation (RAG) enriches cell textual descriptions using open-world knowledge; (2) devising a Cross-modal Robust Alignment (CRA) objective that incorporates sample reliability assessment, curriculum learning, and coupled momentum contrastive learning to strengthen the model’s resistance to noisy data. After pretraining on 32M cell-text pairs, OKR-CELL obtains cutting-edge results across 6 evaluation tasks. Beyond standard benchmarks such as cell clustering, cell-type annotation, batch-effect correction, and few-shot annotation, the model also demonstrates superior performance in broader multi-modal applications, including zero-shot cell-type annotation and bidirectional cell-text retrieval.