Table of Contents
- cs.CL [Total: 54]
- cs.CV [Total: 48]
- eess.IV [Total: 3]
- cs.AI [Total: 5]
- cs.RO [Total: 4]
- cs.CR [Total: 1]
- cs.LG [Total: 1]
- cs.IR [Total: 1]
cs.CL [Back]
[1] DeepResearch-Slice: Bridging the Retrieval-Utilization Gap via Explicit Text Slicing cs.CL | cs.AIPDF
Shuo Lu, Yinuo Xu, Jianjie Cheng, Lingxiao He, Meng Wang
TL;DR: DeepResearch-Slice是一个神经符号框架,旨在解决深度研究智能体中的检索-利用差距问题。该方法通过预测精确的文本片段索引,在推理前进行确定性硬过滤,以增强模型在嘈杂环境中对已检索关键证据的利用能力。
Details
Motivation: 当前深度研究智能体主要优化检索策略,但存在检索-利用差距:模型在检索到关键证据后,由于嘈杂环境中的上下文盲区,仍无法有效利用这些证据。
Result: 在六个基准测试上的广泛评估显示,该方法显著提升了鲁棒性。将方法应用于冻结的骨干模型上,取得了73%的相对性能提升(从19.1%到33.0%),有效缓解了噪声影响且无需更新推理模型的参数。
Insight: 核心创新点是提出了一种显式的文本切片(硬过滤)机制来替代隐式注意力,以实现对关键证据的显式定位和利用,这强调了在开放式研究中引入显式基础机制的必要性。
Abstract: Deep Research agents predominantly optimize search policies to maximize retrieval probability. However, we identify a critical bottleneck: the retrieval-utilization gap, where models fail to use gold evidence even after it is retrieved, due to context blindness in noisy environments. To bridge this gap, we propose DeepResearch-Slice, a simple yet effective neuro-symbolic framework. Unlike implicit attention, our approach predicts precise span indices to perform a deterministic hard filter before reasoning. Extensive evaluations across six benchmarks show substantial robustness gains. Applying our method to frozen backbones yields a 73 percent relative improvement, from 19.1 percent to 33.0 percent, effectively mitigating noise without requiring parameter updates to the reasoning model. These results highlight the need for explicit grounding mechanisms in open-ended research.
[2] Internal Reasoning vs. External Control: A Thermodynamic Analysis of Sycophancy in Large Language Models cs.CL | cs.AIPDF
Edward Y. Chang
TL;DR: 本文通过对抗数据集CAP-GSM8K评估了大型语言模型中的谄媚行为,比较了内部推理(思维链)与外部控制(RCA)机制的效果。研究发现内部推理在弱模型中会导致性能崩溃(优先悖论),在先进模型中仍存在11.4%的输出差距,而RCA能完全消除谄媚行为。作者提出热力学层次理论,指出仅当能力匹配且强大时混合系统才能达到共振最优效率,否则将陷入失调与熵增,证实外部结构约束对保障安全性不可或缺。
Details
Motivation: 研究动机是探究大型语言模型中常见的谄媚行为(即模型优先迎合用户而非保持正确性)是否仅通过内部推理即可缓解,还是必须依赖外部控制机制。
Result: 在CAP-GSM8K对抗数据集(N=500)上测试GPT-3.5、GPT-4o和GPT-5.1,结果显示内部推理导致弱模型性能崩溃(优先悖论),先进模型仍有11.4%最终输出差距;而外部RCA机制在所有层级模型中完全消除谄媚(0.0%)。
Insight: 创新点在于提出热力学层次框架,将模型能力与控制机制匹配形式化为共振、失调和熵增状态,实证证明外部结构约束对消除谄媚行为是严格必要的,这为LLM安全设计提供了理论依据。
Abstract: Large Language Models frequently exhibit sycophancy, prioritizing user agreeableness over correctness. We investigate whether this requires external regulation or can be mitigated by internal reasoning alone. Using CAP-GSM8K (N=500), an adversarial dataset, we evaluate internal (CoT) versus external (RCA) mechanisms across GPT-3.5, GPT-4o, and GPT-5.1. Our results reveal the structural limits of internal reasoning: it causes performance collapse in weak models (the Prioritization Paradox) and leaves an 11.4% final output gap in frontier models. In contrast, RCA structurally eliminates sycophancy (0.0%) across all tiers. We synthesize these findings into a thermodynamic hierarchy: hybrid systems achieve Resonance (optimal efficiency) only when capabilities are matched and strong, while weak or mismatched pairs succumb to Dissonance and Entropy. This confirms that external structural constraints are strictly necessary to guarantee safety.
[3] OpenAI GPT-5 System Card cs.CL | cs.AIPDF
Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh
TL;DR: OpenAI GPT-5是一个统一系统,包含一个智能快速模型用于回答大多数问题、一个深度推理模型处理复杂问题,以及一个实时路由器根据对话类型、复杂性、工具需求和显式意图动态选择模型。该系统在减少幻觉、提升指令遵循、降低谄媚性方面取得显著进展,并在写作、编码和健康等常见应用场景中提升了性能,同时集成了最新的安全训练方法以防止不允许的内容。
Details
Motivation: 为了解决单一模型在处理多样化、复杂真实世界查询时在速度、深度推理和安全性方面的局限性,GPT-5旨在通过多模型集成和智能路由机制,提供更快速、准确且实用的AI助手,同时增强安全防护以减少潜在危害。
Result: GPT-5在基准测试中超越了先前模型,回答问题更快速,且对真实世界查询更有用;在写作、编码和健康等ChatGPT常见使用场景中性能显著提升,并减少了幻觉和谄媚性。
Insight: 创新点包括:1) 统一系统中的多模型架构(快速模型与深度推理模型)结合实时路由器,基于对话信号动态优化模型选择;2) 持续训练的路由器利用用户切换、偏好率和正确性等真实信号进行改进;3) 在安全方面引入safe-completions训练方法,并对高能力领域(如生物化学)采取预防性安全措施,即使未达到明确危害阈值。
Abstract: This is the system card published alongside the OpenAI GPT-5 launch, August 2025. GPT-5 is a unified system with a smart and fast model that answers most questions, a deeper reasoning model for harder problems, and a real-time router that quickly decides which model to use based on conversation type, complexity, tool needs, and explicit intent (for example, if you say ‘think hard about this’ in the prompt). The router is continuously trained on real signals, including when users switch models, preference rates for responses, and measured correctness, improving over time. Once usage limits are reached, a mini version of each model handles remaining queries. This system card focuses primarily on gpt-5-thinking and gpt-5-main, while evaluations for other models are available in the appendix. The GPT-5 system not only outperforms previous models on benchmarks and answers questions more quickly, but – more importantly – is more useful for real-world queries. We’ve made significant advances in reducing hallucinations, improving instruction following, and minimizing sycophancy, and have leveled up GPT-5’s performance in three of ChatGPT’s most common uses: writing, coding, and health. All of the GPT-5 models additionally feature safe-completions, our latest approach to safety training to prevent disallowed content. Similarly to ChatGPT agent, we have decided to treat gpt-5-thinking as High capability in the Biological and Chemical domain under our Preparedness Framework, activating the associated safeguards. While we do not have definitive evidence that this model could meaningfully help a novice to create severe biological harm – our defined threshold for High capability – we have chosen to take a precautionary approach.
[4] WRAVAL – WRiting Assist eVALuation cs.CL | cs.LGPDF
Gabriel Benedict, Matthew Butler, Naved Merchant, Eetu Salama-Laine
TL;DR: 本文提出了WRAVAL评估框架,专门用于评估小型语言模型(SLMs)在非推理任务(如语气修改)上的实际应用能力,以弥补现有基准在衡量SLMs工业应用效果上的不足。
Details
Motivation: 现有评估主要关注推理和问题解决任务,导致SLMs(参数小于100亿)得分远低于LLMs,但未能反映SLMs在工业应用(如语气修改)中的实际有效性。
Result: 论文通过结合数据生成、提示调优和基于LLM的评估方法,展示了任务特定微调的潜力,为SLMs和LLMs在边缘和私有计算等实际场景中的基准测试提供了工具。
Insight: 创新点在于设计了一个针对非推理任务的评估框架,强调SLMs在缺乏预定义数据集任务上的能力,并通过综合方法验证了任务特定微调在实际应用中的价值。
Abstract: The emergence of Large Language Models (LLMs) has shifted language model evaluation toward reasoning and problem-solving tasks as measures of general intelligence. Small Language Models (SLMs) – defined here as models under 10B parameters – typically score 3-4 times lower than LLMs on these metrics. However, we demonstrate that these evaluations fail to capture SLMs’ effectiveness in common industrial applications, such as tone modification tasks (e.g., funny, serious, professional). We propose an evaluation framework specifically designed to highlight SLMs’ capabilities in non-reasoning tasks where predefined evaluation datasets don’t exist. Our framework combines novel approaches in data generation, prompt-tuning, and LLM-based evaluation to demonstrate the potential of task-specific finetuning. This work provides practitioners with tools to effectively benchmark both SLMs and LLMs for practical applications, particularly in edge and private computing scenarios. Our implementation is available at: https://github.com/amazon-science/wraval.
[5] Advances and Challenges in Semantic Textual Similarity: A Comprehensive Survey cs.CL | cs.AI | math.OCPDF
Lokendra Kumar, Neelesh S. Upadhye, Kannan Piedy
TL;DR: 这篇论文是一篇关于语义文本相似性(STS)的全面综述,回顾了自2021年以来该领域在Transformer架构、对比学习、领域特定技术等方面的进展,涵盖了六大关键研究方向,并分析了当前方法、应用与挑战。
Details
Motivation: 旨在梳理和总结STS领域快速发展的现状,为研究者和从业者提供导航,以应对技术进步、突出新兴趋势并指明未来机遇。
Result: 综述指出,近期Transformer模型(如FarSSiBERT和DeBERTa-v3)取得了显著准确率,对比学习方法(如AspectCSE)建立了新的基准,领域适应模型(如用于医学文本的CXR-BERT和用于金融的Financial-STS)展示了STS在专业领域的有效定制化。
Insight: 创新点在于系统性地将STS进展组织为六大研究领域(基于Transformer的模型、对比学习、领域聚焦解决方案、多模态方法、基于图的方法和知识增强技术),提供了对当前技术格局的全面视角,并强调了跨领域定制化与多模态/知识融合是提升语义理解的关键方向。
Abstract: Semantic Textual Similarity (STS) research has expanded rapidly since 2021, driven by advances in transformer architectures, contrastive learning, and domain-specific techniques. This survey reviews progress across six key areas: transformer-based models, contrastive learning, domain-focused solutions, multi-modal methods, graph-based approaches, and knowledge-enhanced techniques. Recent transformer models such as FarSSiBERT and DeBERTa-v3 have achieved remarkable accuracy, while contrastive methods like AspectCSE have established new benchmarks. Domain-adapted models, including CXR-BERT for medical texts and Financial-STS for finance, demonstrate how STS can be effectively customized for specialized fields. Moreover, multi-modal, graph-based, and knowledge-integrated models further enhance semantic understanding and representation. By organizing and analyzing these developments, the survey provides valuable insights into current methods, practical applications, and remaining challenges. It aims to guide researchers and practitioners alike in navigating rapid advancements, highlighting emerging trends and future opportunities in the evolving field of STS.
[6] GuardEval: A Multi-Perspective Benchmark for Evaluating Safety, Fairness, and Robustness in LLM Moderators cs.CL | cs.AI | cs.HC | cs.LGPDF
Naseem Machlovi, Maryam Saleki, Ruhul Amin, Mohamed Rahouti, Shawqi Al-Maliki
TL;DR: 该论文提出了GuardEval,一个用于评估LLM内容审核系统的多视角基准数据集,以及基于此数据集微调的GemmaGuard (GGuard)模型。GuardEval包含106个细粒度类别,涵盖情感、攻击性语言、偏见和安全问题。评估显示GGuard在宏观F1分数上显著优于现有主流审核模型。
Details
Motivation: 现有LLM内容审核系统在处理隐含冒犯性、微妙偏见和越狱提示等复杂、主观且依赖上下文的案例时存在困难,且对训练数据的依赖可能强化社会偏见,导致不一致和有伦理问题的输出。
Result: 在GuardEval基准上,微调模型GGuard取得了0.832的宏观F1分数,显著超过了OpenAI Moderator (0.64) 和 Llama Guard (0.61) 等领先模型。
Insight: 论文的核心创新在于构建了一个统一的多视角、细粒度、以人为中心的安全评估基准(GuardEval),并证明了多样化和具有代表性的数据对于提升LLM审核系统在复杂边界案例上的安全性、公平性和鲁棒性至关重要。GGuard模型展示了在该基准上微调的有效性。
Abstract: As large language models (LLMs) become deeply embedded in daily life, the urgent need for safer moderation systems, distinguishing between naive from harmful requests while upholding appropriate censorship boundaries, has never been greater. While existing LLMs can detect harmful or unsafe content, they often struggle with nuanced cases such as implicit offensiveness, subtle gender and racial biases, and jailbreak prompts, due to the subjective and context-dependent nature of these issues. Furthermore, their heavy reliance on training data can reinforce societal biases, resulting in inconsistent and ethically problematic outputs. To address these challenges, we introduce GuardEval, a unified multi-perspective benchmark dataset designed for both training and evaluation, containing 106 fine-grained categories spanning human emotions, offensive and hateful language, gender and racial bias, and broader safety concerns. We also present GemmaGuard (GGuard), a QLoRA fine-tuned version of Gemma3-12B trained on GuardEval, to assess content moderation with fine-grained labels. Our evaluation shows that GGuard achieves a macro F1 score of 0.832, substantially outperforming leading moderation models, including OpenAI Moderator (0.64) and Llama Guard (0.61). We show that multi-perspective, human-centered safety benchmarks are critical for reducing biased and inconsistent moderation decisions. GuardEval and GGuard together demonstrate that diverse, representative data materially improve safety, fairness, and robustness on complex, borderline cases.
[7] Metaphors are a Source of Cross-Domain Misalignment of Large Reasoning Models cs.CL | cs.AI | cs.LGPDF
Zhibo Hu, Chen Wang, Yanfeng Shu, Hye-young Paik, Liming Zhu
TL;DR: 本文研究了隐喻对大型语言模型(LLMs)推理路径的影响,发现训练数据中的隐喻与模型跨领域未对齐程度存在强因果关系,并通过干预实验验证了该影响,进而揭示了隐喻与模型潜在特征激活的关联,并基于此设计了一个高精度的未对齐内容检测器。
Details
Motivation: 鉴于LLMs的训练数据包含大量隐喻,而隐喻已知会影响人类决策,本文旨在探究隐喻是否也会影响LLMs的推理路径,特别是在跨领域未对齐问题(即模型将从一个领域学到的未对齐模式泛化到另一领域)的背景下。
Result: 通过在预训练、微调和重新对齐阶段进行隐喻干预,模型的跨领域未对齐程度发生了显著变化;同时,基于对隐喻与模型全局及局部潜在特征激活关联的观察,所设计的检测器能够高精度地预测未对齐内容。
Insight: 论文的创新点在于首次系统性地揭示了训练数据中的隐喻是导致LLMs跨领域未对齐的一个重要因果源,并建立了隐喻与模型内部潜在特征激活的关联,这为理解和检测模型未对齐行为提供了新的视角和工具。
Abstract: Earlier research has shown that metaphors influence human’s decision making, which raises the question of whether metaphors also influence large language models (LLMs)’ reasoning pathways, considering their training data contain a large number of metaphors. In this work, we investigate the problem in the scope of the emergent misalignment problem where LLMs can generalize patterns learned from misaligned content in one domain to another domain. We discover a strong causal relationship between metaphors in training data and the misalignment degree of LLMs’ reasoning contents. With interventions using metaphors in pre-training, fine-tuning and re-alignment phases, models’ cross-domain misalignment degrees change significantly. As we delve deeper into the causes behind this phenomenon, we observe that there is a connection between metaphors and the activation of global and local latent features of large reasoning models. By monitoring these latent features, we design a detector that predict misaligned content with high accuracy.
[8] Implicit Graph, Explicit Retrieval: Towards Efficient and Interpretable Long-horizon Memory for Large Language Models cs.CLPDF
Xin Zhang, Kailai Yang, Hao Li, Chenyue Li, Qiyu Wei
TL;DR: 本文提出了一种名为LatentGraphMem的记忆框架,旨在解决大语言模型在长上下文场景中处理稀疏且分散证据的问题。该框架结合了隐式图记忆和显式子图检索,通过在潜在空间中存储图结构记忆以实现高效稳定,并通过显式接口返回紧凑的符号子图以支持推理和可解释性。
Details
Motivation: 现有记忆系统存在显式结构化记忆在长上下文过载时脆弱且难以扩展,而潜在记忆机制虽高效稳定但缺乏可解释性的问题,因此需要一种兼顾效率、稳定性和可解释性的长时记忆方案。
Result: 在多个模型规模的长时基准测试中,LatentGraphMem持续优于代表性的显式图和潜在记忆基线方法,同时支持参数高效适配,并能灵活扩展到更大的推理器而不引入大量符号开销。
Insight: 创新点在于将隐式图记忆与显式子图检索相结合,实现了记忆的高效存储与可控检索的平衡;从客观角度看,该框架通过训练时显式图视图与推理时潜在空间检索的分离,有效提升了长时记忆任务的性能与可解释性。
Abstract: Long-horizon applications increasingly require large language models (LLMs) to answer queries when relevant evidence is sparse and dispersed across very long contexts. Existing memory systems largely follow two paradigms: explicit structured memories offer interpretability but often become brittle under long-context overload, while latent memory mechanisms are efficient and stable yet difficult to inspect. We propose LatentGraphMem, a memory framework that combines implicit graph memory with explicit subgraph retrieval. LatentGraphMem stores a graph-structured memory in latent space for stability and efficiency, and exposes a task-specific subgraph retrieval interface that returns a compact symbolic subgraph under a fixed budget for downstream reasoning and human inspection. During training, an explicit graph view is materialized to interface with a frozen reasoner for question-answering supervision. At inference time, retrieval is performed in latent space and only the retrieved subgraph is externalized. Experiments on long-horizon benchmarks across multiple model scales show that LatentGraphMem consistently outperforms representative explicit-graph and latent-memory baselines, while enabling parameter-efficient adaptation and flexible scaling to larger reasoners without introducing large symbolic artifacts.
[9] Training-Free Adaptation of New-Generation LLMs using Legacy Clinical Models cs.CL | cs.AIPDF
Sasha Ronaghi, Chloe Stanwyck, Asad Aali, Amir Ronaghi, Miguel Fuentes
TL;DR: 本文提出了一种名为跨架构代理调优(CAPT)的模型集成方法,旨在无需训练即可将新一代通用领域大语言模型适配到临床领域。该方法利用现有临床模型,通过对比解码选择性注入临床相关信号,同时保持通用模型的推理和流畅性。在六个临床分类和文本生成任务上,CAPT显著超越了单个模型及现有集成方法。
Details
Motivation: 解决通过持续预训练和微调将语言模型适配到临床领域时,每次新模型发布都需要昂贵重新训练的问题,实现训练免费的领域适应。
Result: 在六个临床分类和文本生成任务上,CAPT(结合新一代通用模型和旧一代临床模型)平均性能超越UniTE 17.6%,超越代理调优41.4%,并优于单个模型,达到了先进水平。
Insight: 创新点在于提出了一种支持不连续词汇表的跨架构代理调优方法,通过对比解码实现训练免费的领域知识注入;客观分析其核心在于利用模型集成和对比机制,在不更新参数的情况下有效融合新旧模型优势,提升临床任务性能。
Abstract: Adapting language models to the clinical domain through continued pretraining and fine-tuning requires costly retraining for each new model generation. We propose Cross-Architecture Proxy Tuning (CAPT), a model-ensembling approach that enables training-free adaptation of state-of-the-art general-domain models using existing clinical models. CAPT supports models with disjoint vocabularies, leveraging contrastive decoding to selectively inject clinically relevant signals while preserving the general-domain model’s reasoning and fluency. On six clinical classification and text-generation tasks, CAPT with a new-generation general-domain model and an older-generation clinical model consistently outperforms both models individually and state-of-the-art ensembling approaches (average +17.6% over UniTE, +41.4% over proxy tuning across tasks). Through token-level analysis and physician case studies, we demonstrate that CAPT amplifies clinically actionable language, reduces context errors, and increases clinical specificity.
[10] Enhancing Linguistic Competence of Language Models through Pre-training with Language Learning Tasks cs.CLPDF
Atsuki Yamaguchi, Maggie Mi, Nikolaos Aletras
TL;DR: 本文提出L2T预训练框架,通过将语言学习任务与标准的下一个词预测相结合来增强语言模型的语言能力。该框架受人类语言习得启发,将原始文本转化为结构化输入-输出对,以提供明确的语言刺激。实验表明,在原始文本和L2T数据混合预训练的模型不仅提升了语言能力基准测试的性能,还加速了语言能力的习得,同时保持了一般推理任务的竞争力。
Details
Motivation: 当前语言模型主要基于原始文本进行下一个词预测预训练,虽然能学习世界知识和推理能力,但未显式优化语言能力,因此本文旨在通过引入语言学习任务来弥补这一差距。
Result: 在语言能力基准测试中,模型性能得到提升并加速了语言能力的习得,同时在一般推理任务上保持竞争力,但未具体提及是否达到SOTA或与特定模型相当。
Insight: 创新点在于将人类语言习得机制融入预训练,通过结构化任务显式增强语言能力,这为提升模型的语言理解和生成质量提供了新思路。
Abstract: Language models (LMs) are pre-trained on raw text datasets to generate text sequences token-by-token. While this approach facilitates the learning of world knowledge and reasoning, it does not explicitly optimize for linguistic competence. To bridge this gap, we propose L2T, a pre-training framework integrating Language Learning Tasks alongside standard next-token prediction. Inspired by human language acquisition, L2T transforms raw text into structured input-output pairs to provide explicit linguistic stimulation. Pre-training LMs on a mixture of raw text and L2T data not only improves overall performance on linguistic competence benchmarks but accelerates its acquisition, while maintaining competitive performance on general reasoning tasks.
[11] Prompting Underestimates LLM Capability for Time Series Classification cs.CLPDF
Dan Schumacher, Erfan Nourbakhsh, Rocky Slavin, Anthony Rios
TL;DR: 本文通过对比基于提示的生成与线性探针方法,揭示了大型语言模型在时间序列分类任务中实际能力被低估的问题。研究发现,虽然零样本提示性能接近随机水平,但线性探针能显著提升F1分数,甚至达到或超过专用时间序列模型的性能。
Details
Motivation: 针对基于提示的评估方法显示LLMs在时间序列分类上表现不佳的结论,论文旨在探究这是否反映了模型真实表征能力的局限,还是评估方法本身的缺陷。
Result: 线性探针将平均F1分数从0.15-0.26提升至0.61-0.67,在多个基准测试中达到或超越了专用时间序列模型的水平。
Insight: 论文的创新点在于揭示了LLMs内部表征与基于提示的评估结果之间存在系统性不匹配,表明模型在早期Transformer层已编码了具有类别区分性的时间序列信息,且视觉和多模态输入能增强这种信息。这提示评估LLMs能力时需谨慎选择方法,避免低估其真实潜力。
Abstract: Prompt-based evaluations suggest that large language models (LLMs) perform poorly on time series classification, raising doubts about whether they encode meaningful temporal structure. We show that this conclusion reflects limitations of prompt-based generation rather than the model’s representational capacity by directly comparing prompt outputs with linear probes over the same internal representations. While zero-shot prompting performs near chance, linear probes improve average F1 from 0.15-0.26 to 0.61-0.67, often matching or exceeding specialized time series models. Layer-wise analyses further show that class-discriminative time series information emerges in early transformer layers and is amplified by visual and multimodal inputs. Together, these results demonstrate a systematic mismatch between what LLMs internally represent and what prompt-based evaluation reveals, leading current evaluations to underestimate their time series understanding.
[12] EpiQAL: Benchmarking Large Language Models in Epidemiological Question Answering for Enhanced Alignment and Reasoning cs.CL | cs.AIPDF
Mingyang Wei, Dehai Min, Zewen Liu, Yuzhang Xie, Guanchen Wu
TL;DR: 本文提出了EpiQAL,这是首个用于流行病学问答的综合性诊断基准,旨在评估大型语言模型在基于证据的流行病学推理能力。该基准包含三个子集,分别评估基于文本的事实回忆、连接证据与原理的多步推理,以及结论重构能力。实验表明,当前LLMs在流行病学推理上表现有限,尤其在多步推理方面面临挑战。
Details
Motivation: 现有医学问答基准主要关注临床知识或患者层面的推理,缺乏对基于证据的流行病学推断的系统性评估,因此需要构建专门的基准来填补这一空白。
Result: 在十个开源模型上的实验结果显示,当前LLMs在EpiQAL基准上表现有限,多步推理子集最具挑战性;模型排名在不同子集间存在波动,模型规模并非性能的可靠预测指标;思维链提示对多步推理有益,但在其他任务上效果不一。
Insight: 创新点在于构建了首个细粒度的流行病学问答诊断基准,其构建方法结合了专家设计的分类学指导、多模型验证和基于检索的难度控制,能够为模型的证据基础、推理能力和结论重构提供精细的诊断信号。
Abstract: Reliable epidemiological reasoning requires synthesizing study evidence to infer disease burden, transmission dynamics, and intervention effects at the population level. Existing medical question answering benchmarks primarily emphasize clinical knowledge or patient-level reasoning, yet few systematically evaluate evidence-grounded epidemiological inference. We present EpiQAL, the first diagnostic benchmark for epidemiological question answering across diverse diseases, comprising three subsets built from open-access literature. The subsets respectively evaluate text-grounded factual recall, multi-step inference linking document evidence with epidemiological principles, and conclusion reconstruction with the Discussion section withheld. Construction combines expert-designed taxonomy guidance, multi-model verification, and retrieval-based difficulty control. Experiments on ten open models reveal that current LLMs show limited performance on epidemiological reasoning, with multi-step inference posing the greatest challenge. Model rankings shift across subsets, and scale alone does not predict success. Chain-of-Thought prompting benefits multi-step inference but yields mixed results elsewhere. EpiQAL provides fine-grained diagnostic signals for evidence grounding, inferential reasoning, and conclusion reconstruction.
[13] SegNSP: Revisiting Next Sentence Prediction for Linear Text Segmentation cs.CL | cs.AI | cs.IRPDF
José Isidro, Filipe Cunha, Purificação Silvano, Alípio Jorge, Nuno Guimarães
TL;DR: 本文提出SegNSP方法,将线性文本分割重新定义为下一句预测(NSP)任务,通过无标签的NSP方法预测句子是否延续当前主题,并结合分割感知损失和困难负采样来提升性能。在两个数据集上验证了其有效性。
Details
Motivation: 线性文本分割是自然语言处理中长期存在的挑战,由于主题边界定义复杂、语篇结构多变,且需平衡局部连贯性与全局上下文,现有方法难以满足下游应用需求。
Result: 在CitiLink-Minutes数据集上获得B-F1分数0.79,接近人工标注的主题转换;在WikiSection数据集上获得B-F1分数0.65,比最强可复现基线TopSeg高出0.17个绝对点,表现出竞争力和鲁棒性。
Insight: 创新点在于重新利用被现代预训练弃用的NSP任务来建模句子间连续性,无需特定主题标签监督;通过分割感知损失和困难负采样增强对语篇连贯性的捕捉,避免了辅助任务分类的复杂性。
Abstract: Linear text segmentation is a long-standing problem in natural language processing (NLP), focused on dividing continuous text into coherent and semantically meaningful units. Despite its importance, the task remains challenging due to the complexity of defining topic boundaries, the variability in discourse structure, and the need to balance local coherence with global context. These difficulties hinder downstream applications such as summarization, information retrieval, and question answering. In this work, we introduce SegNSP, framing linear text segmentation as a next sentence prediction (NSP) task. Although NSP has largely been abandoned in modern pre-training, its explicit modeling of sentence-to-sentence continuity makes it a natural fit for detecting topic boundaries. We propose a label-agnostic NSP approach, which predicts whether the next sentence continues the current topic without requiring explicit topic labels, and enhance it with a segmentation-aware loss combined with harder negative sampling to better capture discourse continuity. Unlike recent proposals that leverage NSP alongside auxiliary topic classification, our approach avoids task-specific supervision. We evaluate our model against established baselines on two datasets, CitiLink-Minutes, for which we establish the first segmentation benchmark, and WikiSection. On CitiLink-Minutes, SegNSP achieves a B-$F_1$ of 0.79, closely aligning with human-annotated topic transitions, while on WikiSection it attains a B-F$_1$ of 0.65, outperforming the strongest reproducible baseline, TopSeg, by 0.17 absolute points. These results demonstrate competitive and robust performance, highlighting the effectiveness of modeling sentence-to-sentence continuity for improving segmentation quality and supporting downstream NLP applications.
[14] Self-Explaining Hate Speech Detection with Moral Rationales cs.CLPDF
Francielle Vargas, Jackson Trager, Diego Alves, Surendrabikram Thapa, Matteo Guida
TL;DR: 本文提出了一种名为监督道德原理注意力(SMRA)的自解释仇恨言论检测框架,首次将道德原理作为直接监督信号来对齐注意力机制。该方法基于道德基础理论,引导模型关注具有道德显著性的文本片段,而非虚假的词汇模式。为支持该框架,作者还构建了一个包含仇恨标签、道德类别、词级道德原理和社会政治元数据的巴西葡萄牙语基准数据集HateBRMoralXplain。
Details
Motivation: 现有的仇恨言论检测模型过度依赖表层词汇特征,容易受到虚假相关性的影响,导致鲁棒性、文化语境化和可解释性不足。
Result: 在二元仇恨言论检测和多标签道德情感分类任务上,SMRA框架持续提升了性能(例如,F1分数分别提高了0.9和1.5),同时显著增强了解释的忠实度,将IoU F1和Token F1分别提高了7.4和5.0个百分点。解释变得更加简洁,充分性提高了2.3个百分点,公平性保持稳定。
Insight: 主要创新点在于将专家标注的道德原理作为直接监督信号集成到训练目标中,以产生内在可解释和语境化的解释,这不同于以往基于原理监督或事后解释的方法。从客观角度看,该方法将道德心理学理论(道德基础理论)系统地融入深度学习模型,为解决模型可解释性和虚假相关性提供了一个有原则的框架,并构建了多维度标注的新基准数据集以支持研究。
Abstract: Hate speech detection models rely on surface-level lexical features, increasing vulnerability to spurious correlations and limiting robustness, cultural contextualization, and interpretability. We propose Supervised Moral Rationale Attention (SMRA), the first self-explaining hate speech detection framework to incorporate moral rationales as direct supervision for attention alignment. Based on Moral Foundations Theory, SMRA aligns token-level attention with expert-annotated moral rationales, guiding models to attend to morally salient spans rather than spurious lexical patterns. Unlike prior rationale-supervised or post-hoc approaches, SMRA integrates moral rationale supervision directly into the training objective, producing inherently interpretable and contextualized explanations. To support our framework, we also introduce HateBRMoralXplain, a Brazilian Portuguese benchmark dataset annotated with hate labels, moral categories, token-level moral rationales, and socio-political metadata. Across binary hate speech detection and multi-label moral sentiment classification, SMRA consistently improves performance (e.g., +0.9 and +1.5 F1, respectively) while substantially enhancing explanation faithfulness, increasing IoU F1 (+7.4 pp) and Token F1 (+5.0 pp). Although explanations become more concise, sufficiency improves (+2.3 pp) and fairness remains stable, indicating more faithful rationales without performance or bias trade-offs
[15] Reasoning Pattern Alignment Merging for Adaptive Reasoning cs.CL | cs.AIPDF
Zhaofeng Zhong, Wei Yuan, Tong Chen, Xiangyu Zhao, Quoc Viet Hung Nguyen
TL;DR: 该论文提出了一种名为推理模式对齐合并(RPAM)的层间模型合并框架,旨在实现高效的自适应推理。通过将长思维链模型与短思维链指令模型合并,RPAM无需从头训练或大量额外数据,即可在保持高性能的同时显著降低推理成本。
Details
Motivation: 现有大型推理模型为每个查询生成长推理路径,导致不必要的计算和延迟,而现有的加速方法要么成本过高,要么对输入和提示设计高度敏感。本文旨在通过轻量级的模型合并来获得自适应推理器,以解决这一问题。
Result: 在七个广泛使用的推理基准测试上的实验表明,RPAM在保持强大性能的同时,大幅降低了推理成本。
Insight: 创新点在于提出了一种基于特征对齐的层间模型合并框架,通过构建模式标记校准集和优化合并系数,并引入对比目标来对齐中间表示,从而实现查询自适应推理,这是一种无需重新训练或大量数据的轻量级高效方法。
Abstract: Recent large reasoning models (LRMs) have made substantial progress in complex reasoning tasks, yet they often generate lengthy reasoning paths for every query, incurring unnecessary computation and latency. Existing speed-up approaches typically rely on retraining the model or designing sophisticated prompting, which are either prohibitively expensive or highly sensitive to the input and prompt formulation. In this work, we study model merging as a lightweight alternative for efficient reasoning: by combining a long chain-of-thought (Long-CoT) reasoning model with a Short-CoT instruction model, we obtain an adaptive reasoner without training from scratch or requiring large-scale additional data. Building on this idea, we propose Reasoning Pattern Alignment Merging (RPAM), a layer-wise model merging framework based on feature alignment to facilitate query-adaptive reasoning. RPAM first constructs a small pattern-labeled calibration set that assigns each query an appropriate reasoning pattern. It then optimizes layer-wise merging coefficients by aligning the merged model’s intermediate representations with those of the selected model, while a contrastive objective explicitly pushes them away from the non-selected model. Experiments on seven widely used reasoning benchmarks show that RPAM substantially reduces inference cost while maintaining strong performance. Upon article acceptance, we will provide open-source code to reproduce experiments for RPAM.
[16] Mem-Gallery: Benchmarking Multimodal Long-Term Conversational Memory for MLLM Agents cs.CL | cs.AIPDF
Yuanchen Bei, Tianxin Wei, Xuying Ning, Yanjun Zhao, Zhining Liu
TL;DR: 本文提出了Mem-Gallery,一个用于评估多模态大语言模型(MLLM)智能体在多轮对话中长期记忆能力的新基准。该基准包含基于视觉和文本信息的高质量多会话对话,具有长交互历史和丰富的多模态依赖关系。作者基于此数据集构建了一个系统评估框架,从记忆提取与测试时适应、记忆推理和记忆知识管理三个功能维度评估关键记忆能力。对13个记忆系统的广泛测试揭示了当前模型在显式多模态信息保留、记忆组织、记忆推理与知识管理方面的局限性以及效率瓶颈。
Details
Motivation: 现有基准要么仅评估纯文本对话中的多会话记忆,要么在局部上下文中评估多模态理解,无法评估多模态记忆在长期对话轨迹中如何被保存、组织和演化。因此,需要一个新的基准来专门评估MLLM智能体的多模态长期对话记忆能力。
Result: 在Mem-Gallery基准上对13个记忆系统进行了广泛测试,揭示了关键发现,包括显式多模态信息保留和记忆组织的必要性,以及记忆推理和知识管理方面持续存在的局限性。
Insight: 论文的创新点在于提出了首个专注于评估MLLM智能体多模态长期对话记忆的基准Mem-Gallery,并配套了一个从三个功能维度(记忆提取/适应、推理、知识管理)进行系统性评估的框架。这为理解和改进MLLM的长期记忆能力提供了重要的评估工具和分析视角。
Abstract: Long-term memory is a critical capability for multimodal large language model (MLLM) agents, particularly in conversational settings where information accumulates and evolves over time. However, existing benchmarks either evaluate multi-session memory in text-only conversations or assess multimodal understanding within localized contexts, failing to evaluate how multimodal memory is preserved, organized, and evolved across long-term conversational trajectories. Thus, we introduce Mem-Gallery, a new benchmark for evaluating multimodal long-term conversational memory in MLLM agents. Mem-Gallery features high-quality multi-session conversations grounded in both visual and textual information, with long interaction horizons and rich multimodal dependencies. Building on this dataset, we propose a systematic evaluation framework that assesses key memory capabilities along three functional dimensions: memory extraction and test-time adaptation, memory reasoning, and memory knowledge management. Extensive benchmarking across thirteen memory systems reveals several key findings, highlighting the necessity of explicit multimodal information retention and memory organization, the persistent limitations in memory reasoning and knowledge management, as well as the efficiency bottleneck of current models.
[17] PALM-Bench: A Comprehensive Benchmark for Personalized Audio-Language Models cs.CLPDF
Yuwen Wang, Xinyuan Qian, Tian-Hao Zhang, Jiaran Gao, Yuchen Pan
TL;DR: 该论文提出了个性化音频语言模型(PALM)的任务,并创建了首个综合性基准测试PALM-Bench,用于评估模型在多说话人场景下识别个人概念和进行个性化推理的能力。研究发现现有大型音频语言模型在个性化问答方面表现不足,且现有的提示调优和微调方法在建模和迁移个性化知识方面存在局限。
Details
Motivation: 现有大型音频语言模型在音频理解与生成上表现良好,但其行为普遍通用,缺乏对个人上下文(如识别特定说话人及其内容)的支持,无法满足个性化问答需求。论文旨在弥补这一差距,推动个性化音频语言模型的发展。
Result: 在PALM-Bench基准上对代表性开源大型音频语言模型进行了广泛实验,结果显示,现有的无训练提示方法和监督微调策略虽能带来改进,但在稳健地建模个性化知识及跨任务迁移方面仍有限制。
Insight: 创新点在于首次形式化了个性化音频语言模型任务,并构建了首个多任务、多说话人场景的基准测试PALM-Bench,为评估和推动个性化音频理解提供了结构化框架。从客观角度看,该研究揭示了当前模型在个性化上下文建模上的不足,强调了开发更鲁棒个性化方法的必要性。
Abstract: Large Audio-Language Models (LALMs) have demonstrated strong performance in audio understanding and generation. Yet, our extensive benchmarking reveals that their behavior is largely generic (e.g., summarizing spoken content) and fails to adequately support personalized question answering (e.g., summarizing what my best friend says). In contrast, human conditions their interpretation and decision-making on each individual’s personal context. To bridge this gap, we formalize the task of Personalized LALMs (PALM) for recognizing personal concepts and reasoning within personal context. Moreover, we create the first benchmark (PALM-Bench) to foster the methodological advances in PALM and enable structured evaluation on several tasks across multi-speaker scenarios. Our extensive experiments on representative open-source LALMs, show that existing training-free prompting and supervised fine-tuning strategies, while yield improvements, remains limited in modeling personalized knowledge and transferring them across tasks robustly. Data and code will be released.
[18] Persona-aware and Explainable Bikeability Assessment: A Vision-Language Model Approach cs.CL | cs.CV | cs.HC | cs.LGPDF
Yilong Dai, Ziyi Wang, Chenguang Wang, Kexin Zhou, Yiheng Qian
TL;DR: 本文提出了一种基于视觉语言模型(VLM)的、可感知用户角色(persona-aware)且可解释的自行车道可骑行性评估框架。该框架通过理论驱动的角色条件化、多粒度监督微调和AI数据增强,旨在更准确地捕捉复杂道路环境并解释不同骑行者的主观感知差异。
Details
Motivation: 现有基于感知的可骑行性评估方法难以充分捕捉复杂道路环境,且未能充分解释不同骑行用户主观感知的异质性。本文旨在解决这一问题,以促进可持续城市交通和骑行友好型城市的建设。
Result: 实验基于一个全景图像众包系统收集的12,400份角色条件化评估数据。结果表明,所提框架在可骑行性评分预测上具有竞争力,并独特地实现了可解释的因子归因。
Insight: 创新点在于:1)基于成熟骑行分类学的理论驱动角色条件化,通过思维链推理生成角色特定的解释;2)结合稀缺专家标注推理与大量用户评分的多粒度监督微调;3)利用AI生成受控配对数据以分离基础设施变量的影响,用于数据增强和可解释性分析。
Abstract: Bikeability assessment is essential for advancing sustainable urban transportation and creating cyclist-friendly cities, and it requires incorporating users’ perceptions of safety and comfort. Yet existing perception-based bikeability assessment approaches face key limitations in capturing the complexity of road environments and adequately accounting for heterogeneity in subjective user perceptions. This paper proposes a persona-aware Vision-Language Model framework for bikeability assessment with three novel contributions: (i) theory-grounded persona conditioning based on established cyclist typology that generates persona-specific explanations via chain-of-thought reasoning; (ii) multi-granularity supervised fine-tuning that combines scarce expert-annotated reasoning with abundant user ratings for joint prediction and explainable assessment; and (iii) AI-enabled data augmentation that creates controlled paired data to isolate infrastructure variable impacts. To test and validate this framework, we developed a panoramic image-based crowdsourcing system and collected 12,400 persona-conditioned assessments from 427 cyclists. Experiment results show that the proposed framework offers competitive bikeability rating prediction while uniquely enabling explainable factor attribution.
[19] Layer-Order Inversion: Rethinking Latent Multi-Hop Reasoning in Large Language Models cs.CL | cs.AIPDF
Xukai Liu, Ye Liu, Jipeng Zhang, Yanghai Zhang, Kai Zhang
TL;DR: 本文通过分析大语言模型在多跳推理中的内部机制,发现答案实体可能比桥接实体更早解码,即层序反转现象,并提出了概率性回忆-提取框架来解释这一现象。
Details
Motivation: 旨在探究大语言模型内部如何组合多个事实进行多跳推理,挑战现有的跳对齐电路假设。
Result: 通过系统分析真实世界多跳查询,验证了层序反转现象,并实证支持了概率性回忆-提取框架,重新解释了先前层解码证据,解释了思维链增益,并诊断了多跳推理失败的原因。
Insight: 创新点在于揭示了多跳推理中答案实体可能优先于桥接实体解码的层序反转现象,并提出了一个概率性回忆-提取框架来机制性地解释模型行为,为理解大语言模型推理过程提供了新视角。
Abstract: Large language models (LLMs) perform well on multi-hop reasoning, yet how they internally compose multiple facts remains unclear. Recent work proposes \emph{hop-aligned circuit hypothesis}, suggesting that bridge entities are computed sequentially across layers before later-hop answers. Through systematic analyses on real-world multi-hop queries, we show that this hop-aligned assumption does not generalize: later-hop answer entities can become decodable earlier than bridge entities, a phenomenon we call \emph{layer-order inversion}, which strengthens with total hops. To explain this behavior, we propose a \emph{probabilistic recall-and-extract} framework that models multi-hop reasoning as broad probabilistic recall in shallow MLP layers followed by selective extraction in deeper attention layers. This framework is empirically validated through systematic probing analyses, reinterpreting prior layer-wise decoding evidence, explaining chain-of-thought gains, and providing a mechanistic diagnosis of multi-hop failures despite correct single-hop knowledge. Code is available at https://github.com/laquabe/Layer-Order-Inversion.
[20] DiffCoT: Diffusion-styled Chain-of-Thought Reasoning in LLMs cs.CLPDF
Shidong Cao, Hongzhan Lin, Yuxuan Gu, Ziyang Luo, Jing Ma
TL;DR: 本文提出DiffCoT,一种基于扩散思想的链式思维推理框架,将多步推理过程重新定义为迭代去噪过程,通过滑动窗口机制在推理步骤层面引入扩散原理,允许在保持token级自回归的同时对中间步骤进行统一生成和回顾性修正,从而提升大语言模型在数学问题求解中的鲁棒性和纠错能力。
Details
Motivation: 传统链式思维推理方法存在暴露偏差和错误累积问题,早期错误会在自回归解码过程中不可逆地传播,影响最终答案的准确性。
Result: 在三个多步链式思维推理基准测试上,使用不同模型骨干的广泛实验表明,DiffCoT持续优于现有的链式思维偏好优化方法,在推理中展现出更好的鲁棒性和纠错能力。
Insight: 核心创新点在于将扩散模型的迭代去噪思想引入序列推理任务,通过滑动窗口机制和因果扩散噪声调度,实现了对推理链的生成与修正的统一处理,为解决自回归模型中的错误传播问题提供了新思路。
Abstract: Chain-of-Thought (CoT) reasoning improves multi-step mathematical problem solving in large language models but remains vulnerable to exposure bias and error accumulation, as early mistakes propagate irreversibly through autoregressive decoding. In this work, we propose DiffCoT, a diffusion-styled CoT framework that reformulates CoT reasoning as an iterative denoising process. DiffCoT integrates diffusion principles at the reasoning-step level via a sliding-window mechanism, enabling unified generation and retrospective correction of intermediate steps while preserving token-level autoregression. To maintain causal consistency, we further introduce a causal diffusion noise schedule that respects the temporal structure of reasoning chains. Extensive experiments on three multi-step CoT reasoning benchmarks across diverse model backbones demonstrate that DiffCoT consistently outperforms existing CoT preference optimization methods, yielding improved robustness and error-correction capability in CoT reasoning.
[21] How Do Large Language Models Learn Concepts During Continual Pre-Training? cs.CLPDF
Barry Menglong Yao, Sha Li, Yunzhi Yao, Minqian Liu, Zaishuo Xia
TL;DR: 该论文研究大型语言模型在持续预训练过程中如何学习、保留和遗忘概念,通过分析概念电路的行为动态,揭示了概念学习与遗忘的统计显著信号、阶段性时间模式、学习增益与遗忘的关系、语义相似概念的干扰效应以及概念知识的可迁移性。
Details
Motivation: 人类主要通过概念理解世界,但大型语言模型在持续预训练中如何获取、保留和遗忘概念尚不明确,论文旨在探索这一过程,以增进对LLM概念学习机制的理解。
Result: 研究发现概念电路能显著反映概念学习与遗忘,呈现早期增加、随后逐渐减少并稳定的阶段性模式;学习增益较大的概念后续遗忘更明显;语义相似概念比弱相关概念产生更强干扰;部分概念知识具有可迁移性,能促进其他概念学习。
Insight: 创新点在于将概念学习动态与内部概念电路关联,引入图度量分析电路结构,提供了电路层面的概念学习视图,为设计更可解释和鲁棒的概念感知训练策略提供依据。
Abstract: Human beings primarily understand the world through concepts (e.g., dog), abstract mental representations that structure perception, reasoning, and learning. However, how large language models (LLMs) acquire, retain, and forget such concepts during continual pretraining remains poorly understood. In this work, we study how individual concepts are acquired and forgotten, as well as how multiple concepts interact through interference and synergy. We link these behavioral dynamics to LLMs’ internal Concept Circuits, computational subgraphs associated with specific concepts, and incorporate Graph Metrics to characterize circuit structure. Our analysis reveals: (1) LLMs concept circuits provide a non-trivial, statistically significant signal of concept learning and forgetting; (2) Concept circuits exhibit a stage-wise temporal pattern during continual pretraining, with an early increase followed by gradual decrease and stabilization; (3) concepts with larger learning gains tend to exhibit greater forgetting under subsequent training; (4) semantically similar concepts induce stronger interference than weakly related ones; (5) conceptual knowledge differs in their transferability, with some significantly facilitating the learning of others. Together, our findings offer a circuit-level view of concept learning dynamics and inform the design of more interpretable and robust concept-aware training strategies for LLMs.
[22] OLA: Output Language Alignment in Code-Switched LLM Interactions cs.CLPDF
Juhyun Oh, Haneul Yoo, Faiz Ghifari Haznitrama, Alice Oh
TL;DR: 本文提出了OLA基准,用于评估大语言模型在代码切换交互中的输出语言对齐能力,发现当前前沿模型在处理韩英混合输入时存在系统性偏差,倾向于非英语回复,且这种偏差可推广到中文和印尼语对。通过少量数据(约1K示例)的代码切换感知DPO训练可显著改善对齐问题,表明这些失败源于对齐不足而非根本性限制。
Details
Motivation: 解决多语言用户在对话中自然进行代码切换时,大语言模型无法正确推断并回应用户隐含的输出语言期望的问题,即模型在混合语言提示下经常以非预期语言回复。
Result: 在韩英代码切换的OLA基准测试中,即使前沿模型也频繁误解隐含语言期望,表现出向非英语回复的偏差,且该偏差在中文和印尼语对中同样存在;模型还表现出响应中途切换和语言侵入的不稳定性。使用约1K示例的代码切换感知DPO训练可大幅减少未对齐情况。
Insight: 创新点在于引入OLA基准系统评估代码切换下的输出语言对齐,并揭示模型偏差和稳定性问题;客观分析表明,通过针对性的对齐训练(如DPO)而非增强推理能力(如思维链提示)可有效解决此问题,强调了将多语言LLM与真实世界代码切换交互中用户隐含期望对齐的重要性。
Abstract: Code-switching, alternating between languages within a conversation, is natural for multilingual users, yet poses fundamental challenges for large language models (LLMs). When a user code-switches in their prompt to an LLM, they typically do not specify the expected language of the LLM response, and thus LLMs must infer the output language from contextual and pragmatic cues. We find that current LLMs systematically fail to align with this expectation, responding in undesired languages even when cues are clear to humans. We introduce OLA, a benchmark to evaluate LLMs’ Output Language Alignment in code-switched interactions. OLA focuses on Korean–English code-switching and spans simple intra-sentential mixing to instruction-content mismatches. Even frontier models frequently misinterpret implicit language expectation, exhibiting a bias toward non-English responses. We further show this bias generalizes beyond Korean to Chinese and Indonesian pairs. Models also show instability through mid-response switching and language intrusions. Chain-of-Thought prompting fails to resolve these errors, indicating weak pragmatic reasoning about output language. However, Code-Switching Aware DPO with minimal data (about 1K examples) substantially reduces misalignment, suggesting these failures stem from insufficient alignment rather than fundamental limitations. Our results highlight the need to align multilingual LLMs with users’ implicit expectations in real-world code-switched interactions.
[23] From Chains to Graphs: Self-Structured Reasoning for General-Domain LLMs cs.CL | cs.AIPDF
Yingjian Chen, Haoran Liu, Yinhong Liu, Sherry T. Tong, Aosong Feng
TL;DR: 本文提出了一种名为自我图推理(SGR)的新框架,旨在解决大语言模型在开放域问答中推理过程线性化且逻辑不一致的问题。SGR使LLMs能够在生成最终答案前,将其推理过程显式地表示为结构化图,并构建了一个图结构推理数据集用于模型训练。在五个通用和专门领域的QA基准测试中,SGR显著提升了推理一致性,性能超越基础模型17.74%,微调后的LLaMA-3.3-70B模型性能与GPT-4o相当,并超过了Claude-3.5-Haiku。
Details
Motivation: 现有方法(如思维链CoT)将推理表达为线性文本形式,虽然看似连贯但常导致不一致的结论,且无法整合多个前提或并行解决子问题。论文旨在探索LLMs如何自主构建和使用图结构推理,以弥补其在开放域QA中缺乏结构化、并行化推理能力的不足。
Result: 在五个通用和专门领域的QA基准测试上,SGR相比基础模型实现了17.74%的性能提升,并显著改善了推理一致性。经SGR微调的LLaMA-3.3-70B模型在性能上与GPT-4o相当,并超越了Claude-3.5-Haiku,达到了先进水平。
Insight: 论文的核心创新点是提出了SGR框架,使LLMs能够自主生成图结构的推理表示,这突破了传统线性文本推理的限制,支持更符合真实世界需求的并行、结构化推理。从客观角度看,其构建的图结构推理数据集以及将推理过程显式图化的方法,为提升LLMs的逻辑一致性和复杂问题解决能力提供了新的可借鉴路径。
Abstract: Large Language Models (LLMs) show strong reasoning ability in open-domain question answering, yet their reasoning processes are typically linear and often logically inconsistent. In contrast, real-world reasoning requires integrating multiple premises and solving subproblems in parallel. Existing methods, such as Chain-of-Thought (CoT), express reasoning in a linear textual form, which may appear coherent but frequently leads to inconsistent conclusions. Recent approaches rely on externally provided graphs and do not explore how LLMs can construct and use their own graph-structured reasoning, particularly in open-domain QA. To fill this gap, we novelly explore graph-structured reasoning of LLMs in general-domain question answering. We propose Self-Graph Reasoning (SGR), a framework that enables LLMs to explicitly represent their reasoning process as a structured graph before producing the final answer. We further construct a graph-structured reasoning dataset that merges multiple candidate reasoning graphs into refined graph structures for model training. Experiments on five QA benchmarks across both general and specialized domains show that SGR consistently improves reasoning consistency and yields a 17.74% gain over the base model. The LLaMA-3.3-70B model fine-tuned with SGR performs comparably to GPT-4o and surpasses Claude-3.5-Haiku, demonstrating the effectiveness of graph-structured reasoning.
[24] Analyzing Reasoning Shifts in Audio Deepfake Detection under Adversarial Attacks: The Reasoning Tax versus Shield Bifurcation cs.CL | cs.SD | eess.ASPDF
Binh Nguyen, Thai Le
TL;DR: 本文提出了一种用于评估音频语言模型(ALMs)在对抗攻击下推理鲁棒性的法证审计框架,重点关注声学感知、认知一致性和认知失调三个维度。研究发现,显式推理并不总是增强鲁棒性,而是出现分叉:对于声学感知鲁棒的模型,推理起到‘盾牌’防御作用;对于其他模型,推理则带来性能‘税负’,尤其在语言攻击下会降低认知一致性并提高攻击成功率。即使分类失败,高认知失调也可作为‘无声警报’指示潜在操纵。
Details
Motivation: 现有音频深度伪造检测(ADD)研究主要关注最终预测(如真假分类)在对抗攻击下的变化,而缺乏对模型预测推理过程鲁棒性的分析。随着可解释的音频语言模型(ALMs)通过推理轨迹提供一定透明度,需要新的鲁棒性分析范式来评估其推理在攻击下的稳定性。
Result: 通过系统分析发现,在对抗攻击下,显式推理对模型鲁棒性的影响呈现分叉现象:对于声学感知鲁棒的模型,推理起到防御作用(‘盾牌’);对于其他模型,推理反而降低性能(‘税负’),特别是在语言攻击下会显著降低认知一致性并提高攻击成功率。此外,高认知失调可作为分类失败时的潜在操纵指示器。
Insight: 论文的创新点在于提出了一个从声学感知、认知一致性和认知失调三个维度评估ALMs推理鲁棒性的法证审计框架,并首次揭示了推理在对抗攻击下的双重作用(盾牌与税负)。客观来看,该研究为可解释AI的鲁棒性评估提供了新视角,强调了分析推理过程而不仅仅是最终输出的重要性,并指出认知失调可作为检测对抗样本的辅助信号。
Abstract: Audio Language Models (ALMs) offer a promising shift towards explainable audio deepfake detections (ADDs), moving beyond \textit{black-box} classifiers by providing some level of transparency into their predictions via reasoning traces. This necessitates a new class of model robustness analysis: robustness of the predictive reasoning under adversarial attacks, which goes beyond existing paradigm that mainly focuses on the shifts of the final predictions (e.g., fake v.s. real). To analyze such reasoning shifts, we introduce a forensic auditing framework to evaluate the robustness of ALMs’ reasoning under adversarial attacks in three inter-connected dimensions: acoustic perception, cognitive coherence, and cognitive dissonance. Our systematic analysis reveals that explicit reasoning does not universally enhance robustness. Instead, we observe a bifurcation: for models exhibiting robust acoustic perception, reasoning acts as a defensive \textit{shield''}, protecting them from adversarial attacks. However, for others, it imposes a performance \textit{tax’’}, particularly under linguistic attacks which reduce cognitive coherence and increase attack success rate. Crucially, even when classification fails, high cognitive dissonance can serve as a \textit{silent alarm}, flagging potential manipulation. Overall, this work provides a critical evaluation of the role of reasoning in forensic audio deepfake analysis and its vulnerabilities.
[25] Reasoning Model Is Superior LLM-Judge, Yet Suffers from Biases cs.CLPDF
Hui Huang, Xuanxin Wu, Muyun Yang, Yuki Arase
TL;DR: 本文首次系统性地比较了大型推理模型(LRMs)与非推理大语言模型(LLMs)作为评估者的性能。研究发现,LRMs在判断准确性、指令遵循能力和对抗攻击鲁棒性方面均优于非推理LLMs,尤其是在推理密集型任务上。然而,LRMs在表面质量评估方面仍存在显著偏见。为缓解此问题,作者提出了PlanJudge策略,通过让模型在执行前生成明确的评估计划,有效减轻了LRMs和标准LLMs中的偏见。
Details
Motivation: 动机是探究具有推理能力的大模型(LRMs)是否比非推理LLMs更适合作为自动评估者(LLM-Judge),并解决现有评估模型可能存在的偏见问题。
Result: 实验结果表明,在判断准确性(特别是推理任务)、指令遵循和对抗鲁棒性上,LRMs均优于非推理LLMs。但LRMs在表面质量评估上存在强偏见。提出的PlanJudge方法能显著减轻LRMs和标准LLMs的偏见。
Insight: 论文的创新点在于首次系统比较了LRMs与非推理LLMs作为评估者的优劣,并揭示了LRMs虽整体更优但仍受偏见影响。提出的PlanJudge是一种简单有效的去偏策略,通过引入明确的评估计划步骤来提升评估的鲁棒性,这一思路可推广至其他需要LLM进行主观判断的场景。
Abstract: This paper presents the first systematic comparison investigating whether Large Reasoning Models (LRMs) are superior judge to non-reasoning LLMs. Our empirical analysis yields four key findings: 1) LRMs outperform non-reasoning LLMs in terms of judgment accuracy, particularly on reasoning-intensive tasks; 2) LRMs demonstrate superior instruction-following capabilities in evaluation contexts; 3) LRMs exhibit enhanced robustness against adversarial attacks targeting judgment tasks; 4) However, LRMs still exhibit strong biases in superficial quality. To improve the robustness against biases, we propose PlanJudge, an evaluation strategy that prompts the model to generate an explicit evaluation plan before execution. Despite its simplicity, our experiments demonstrate that PlanJudge significantly mitigates biases in both LRMs and standard LLMs.
[26] ELO: Efficient Layer-Specific Optimization for Continual Pretraining of Multilingual LLMs cs.CLPDF
HanGyeol Yoo, ChangSu Choi, Minjun Kim, Seohyun Song, SeungWoo Song
TL;DR: 本文提出了一种高效的层特定优化(ELO)方法,用于增强多语言大语言模型(MLLMs)针对特定语言的持续预训练(CP)。该方法通过分离并训练模型的关键首尾层来显著减少计算成本和内存消耗,并通过层对齐步骤保持源语言性能。实验表明,ELO在提升目标语言性能的同时,实现了高达6.46倍的训练加速。
Details
Motivation: 解决传统持续预训练方法在多语言大语言模型中面临的高计算成本以及源语言(如英语)性能退化的问题。
Result: 在定性基准测试中,目标语言性能提升高达6.2%,同时有效保持了源语言能力;与现有方法相比,训练速度最高提升6.46倍。
Insight: 创新点在于识别并仅对模型的关键层(首尾层)进行特定目标语言的训练,大幅降低可训练参数和前向计算参数总量,再通过简短的层对齐微调实现模型整合,这是一种计算高效的持续预训练策略。
Abstract: We propose an efficient layer-specific optimization (ELO) method designed to enhance continual pretraining (CP) for specific languages in multilingual large language models (MLLMs). This approach addresses the common challenges of high computational cost and degradation of source language performance associated with traditional CP. The ELO method consists of two main stages: (1) ELO Pretraining, where a small subset of specific layers, identified in our experiments as the critically important first and last layers, are detached from the original MLLM and trained with the target language. This significantly reduces not only the number of trainable parameters but also the total parameters computed during the forward pass, minimizing GPU memory consumption and accelerating the training process. (2) Layer Alignment, where the newly trained layers are reintegrated into the original model, followed by a brief full fine-tuning step on a small dataset to align the parameters. Experimental results demonstrate that the ELO method achieves a training speedup of up to 6.46 times compared to existing methods, while improving target language performance by up to 6.2% on qualitative benchmarks and effectively preserving source language (English) capabilities.
[27] SyncThink: A Training-Free Strategy to Align Inference Termination with Reasoning Saturation cs.CLPDF
Gengyang Li, Wang Cai, Yifeng Gao, Yunfang Wu
TL;DR: SyncThink是一种无需训练、即插即用的解码方法,旨在减少思维链提示带来的推理开销。它通过监控模型自身的推理转换信号,在推理饱和时提前终止生成,从而在保持准确性的同时显著降低生成令牌数量和延迟。
Details
Motivation: 解决思维链提示在推理过程中产生冗长、冗余的轨迹,导致推理成本大幅增加的问题。
Result: 在GSM8K、MMLU、GPQA和BBH基准测试上,使用三个DeepSeek-R1蒸馏模型进行实验,SyncThink以656个生成令牌和28.68秒延迟实现了62.00%的平均Top-1准确率,而完整思维链解码则需要2141个令牌和92.01秒,准确率为61.22%。在GPQA等长视野任务中,SyncThink通过防止过度思考,可进一步提升高达8.1个百分点的绝对准确率。
Insight: 创新点在于发现答案令牌对早期推理关注较弱,而集中于特殊令牌“/think”,这形成了一个信息瓶颈;基于此,SyncThink利用模型自身的推理转换信号作为终止条件,实现了无需训练、不修改模型权重的轻量级优化,有效平衡了推理效率与准确性。
Abstract: Chain-of-Thought (CoT) prompting improves reasoning but often produces long and redundant traces that substantially increase inference cost. We present SyncThink, a training-free and plug-and-play decoding method that reduces CoT overhead without modifying model weights. We find that answer tokens attend weakly to early reasoning and instead focus on the special token “/think”, indicating an information bottleneck. Building on this observation, SyncThink monitors the model’s own reasoning-transition signal and terminates reasoning. Experiments on GSM8K, MMLU, GPQA, and BBH across three DeepSeek-R1 distilled models show that SyncThink achieves 62.00 percent average Top-1 accuracy using 656 generated tokens and 28.68 s latency, compared to 61.22 percent, 2141 tokens, and 92.01 s for full CoT decoding. On long-horizon tasks such as GPQA, SyncThink can further yield up to +8.1 absolute accuracy by preventing over-thinking.
[28] e5-omni: Explicit Cross-modal Alignment for Omni-modal Embeddings cs.CL | cs.AI | cs.CVPDF
Haonan Chen, Sicheng Gao, Radu Timofte, Tetsuya Sakai, Zhicheng Dou
TL;DR: 本文提出了e5-omni方法,一种轻量级的显式跨模态对齐方案,旨在将现成的视觉语言模型(VLM)适配为鲁棒的全模态嵌入模型,以解决多模态嵌入中存在的相似度尺度不一致、负样本训练效率低和嵌入统计特性不匹配等问题。
Details
Motivation: 现代信息系统常涉及文本、图像、视频、音频等多种模态,需要全模态嵌入模型将它们映射到共享空间进行直接比较。然而,现有方法主要依赖预训练VLM的隐式对齐,导致相似度分数尺度不一致、批次内负样本训练效率下降以及跨模态嵌入统计特性不匹配三个常见问题。
Result: 在MMEB-V2和AudioCaps基准测试上的实验表明,e5-omni方法相比强大的双模态和全模态基线模型取得了持续的性能提升,并且该方案能很好地迁移到其他VLM骨干网络上。
Insight: 论文的创新点在于提出了一个包含三个简单组件的显式对齐方案:模态感知温度校准以对齐相似度尺度、可控的负样本课程学习与去偏以聚焦于混淆负样本并减少假负样本的影响,以及带协方差正则化的批次白化以更好地匹配共享嵌入空间中的跨模态几何结构。从客观角度看,该方法系统性地解决了多模态嵌入对齐中的几个关键工程挑战,且具有轻量化和可迁移性强的特点。
Abstract: Modern information systems often involve different types of items, e.g., a text query, an image, a video clip, or an audio segment. This motivates omni-modal embedding models that map heterogeneous modalities into a shared space for direct comparison. However, most recent omni-modal embeddings still rely heavily on implicit alignment inherited from pretrained vision-language model (VLM) backbones. In practice, this causes three common issues: (i) similarity logits have modality-dependent sharpness, so scores are not on a consistent scale; (ii) in-batch negatives become less effective over time because mixed-modality batches create an imbalanced hardness distribution; as a result, many negatives quickly become trivial and contribute little gradient; and (iii) embeddings across modalities show mismatched first- and second-order statistics, which makes rankings less stable. To tackle these problems, we propose e5-omni, a lightweight explicit alignment recipe that adapts off-the-shelf VLMs into robust omni-modal embedding models. e5-omni combines three simple components: (1) modality-aware temperature calibration to align similarity scales, (2) a controllable negative curriculum with debiasing to focus on confusing negatives while reducing the impact of false negatives, and (3) batch whitening with covariance regularization to better match cross-modal geometry in the shared embedding space. Experiments on MMEB-V2 and AudioCaps show consistent gains over strong bi-modal and omni-modal baselines, and the same recipe also transfers well to other VLM backbones. We release our model checkpoint at https://huggingface.co/Haon-Chen/e5-omni-7B.
[29] DisastQA: A Comprehensive Benchmark for Evaluating Question Answering in Disaster Management cs.CLPDF
Zhitong Chen, Kai Yin, Xiangjue Dong, Chengkai Liu, Xiangpeng Li
TL;DR: 本文提出了DisastQA,一个用于评估灾害管理领域问答系统性能的大规模基准测试,包含3000个经过严格验证的问题,涵盖八种灾害类型。该基准通过人机协作流程构建,并采用分层抽样确保均衡覆盖,支持从闭卷到噪声证据整合等多种评估条件,以区分模型内部知识与不完美信息下的推理能力。
Details
Motivation: 现有问答基准多基于清晰证据构建,无法有效评估灾害管理场景中处理不确定和冲突信息的能力,因此需要专门针对灾害管理领域构建一个能够测试模型在噪声环境下推理性能的基准。
Result: 在20个模型的实验中,DisastQA揭示了与通用基准(如MMLU-Pro)的显著差异:尽管近期开源模型在清晰设置下接近专有系统性能,但在现实噪声条件下性能急剧下降,暴露了灾害响应中的关键可靠性差距。
Insight: 创新点包括:通过人机协作构建大规模、均衡覆盖的灾害管理QA基准;引入基于噪声证据的评估条件以分离知识记忆与推理能力;针对开放式问答提出基于人类验证的关键点评估协议,强调事实完整性而非冗长回答。该基准为评估模型在真实灾害场景中的鲁棒性提供了重要工具。
Abstract: Accurate question answering (QA) in disaster management requires reasoning over uncertain and conflicting information, a setting poorly captured by existing benchmarks built on clean evidence. We introduce DisastQA, a large-scale benchmark of 3,000 rigorously verified questions (2,000 multiple-choice and 1,000 open-ended) spanning eight disaster types. The benchmark is constructed via a human-LLM collaboration pipeline with stratified sampling to ensure balanced coverage. Models are evaluated under varying evidence conditions, from closed-book to noisy evidence integration, enabling separation of internal knowledge from reasoning under imperfect information. For open-ended QA, we propose a human-verified keypoint-based evaluation protocol emphasizing factual completeness over verbosity. Experiments with 20 models reveal substantial divergences from general-purpose leaderboards such as MMLU-Pro. While recent open-weight models approach proprietary systems in clean settings, performance degrades sharply under realistic noise, exposing critical reliability gaps for disaster response. All code, data, and evaluation resources are available at https://github.com/TamuChen18/DisastQA_open.
[30] From Implicit to Explicit: Token-Efficient Logical Supervision for Mathematical Reasoning in LLMs cs.CL | cs.AIPDF
Shaojie Wang, Liang Zhang
TL;DR: 本文提出了一种名为’第一步逻辑推理’的轻量级训练框架,旨在解决大语言模型在数学推理中逻辑关系理解能力不足的问题。该方法通过专门训练模型识别问题中的变量和操作,提供对逻辑关系的显式监督,从而显著提升模型的逻辑推理能力。
Details
Motivation: 现有研究发现大语言模型在数学问题解决中逻辑推理能力有限,主要依赖模式匹配和记忆,且超过90%的错误预测与逻辑关系理解能力不足有关,而传统的思维链监督微调方法未能有效减少此类错误。
Result: 在多个模型和数据集上的广泛实验表明,FSLR框架在分布内和分布外设置下均一致优于CoT-SFT,平均分别提升3.2%和4.6%。同时,FSLR实现了4-6倍的训练加速,并将训练token消耗降低了80%以上。
Insight: 核心创新点在于将逻辑关系理解这一核心能力从完整的解题轨迹中解耦出来,通过专门监督模型进行第一步规划来提供显式的逻辑监督,这比传统方法中隐含的逻辑关系嵌入更有效且高效。
Abstract: Recent studies reveal that large language models (LLMs) exhibit limited logical reasoning abilities in mathematical problem-solving, instead often relying on pattern-matching and memorization. We systematically analyze this limitation, focusing on logical relationship understanding, which is a core capability underlying genuine logical reasoning, and reveal that errors related to this capability account for over 90% of incorrect predictions, with Chain-of-Thought Supervised Fine-Tuning (CoT-SFT) failing to substantially reduce these errors. To address this bottleneck, we propose First-Step Logical Reasoning (FSLR), a lightweight training framework targeting logical relationship understanding. Our key insight is that the first planning step-identifying which variables to use and which operation to apply-encourages the model to derive logical relationships directly from the problem statement. By training models on this isolated step, FSLR provides explicit supervision for logical relationship understanding, unlike CoT-SFT which implicitly embeds such relationships within complete solution trajectories. Extensive experiments across multiple models and datasets demonstrate that FSLR consistently outperforms CoT-SFT under both in-distribution and out-of-distribution settings, with average improvements of 3.2% and 4.6%, respectively. Moreover, FSLR achieves 4-6x faster training and reduces training token consumption by over 80%.
[31] AirNav: A Large-Scale Real-World UAV Vision-and-Language Navigation Dataset with Natural and Diverse Instructions cs.CLPDF
Hengxing Cai, Yijie Rao, Ligang Huang, Zanyang Zhong, Jinhan Dong
TL;DR: 本文提出了AirNav,一个基于真实城市航拍数据构建的大规模无人机视觉语言导航数据集,旨在解决现有数据集依赖虚拟环境、指令缺乏自然性和规模有限的问题。同时,作者提出了AirVLN-R1模型,结合监督微调和强化微调以提升性能和泛化能力,并进行了初步的真实世界测试验证。
Details
Motivation: 解决现有无人机视觉语言导航数据集依赖虚拟环境、指令不自然且规模小的问题,推动在真实世界场景中的应用。
Result: 通过真实世界测试初步验证了模型的可行性,但摘要中未提及具体的定量结果(如基准测试性能或与SOTA的比较)。
Insight: 创新点在于构建了首个大规模、基于真实航拍数据且指令自然多样的无人机VLN数据集,并提出了结合监督与强化微调的AirVLN-R1模型,增强了泛化能力。
Abstract: Existing Unmanned Aerial Vehicle (UAV) Vision-Language Navigation (VLN) datasets face issues such as dependence on virtual environments, lack of naturalness in instructions, and limited scale. To address these challenges, we propose AirNav, a large-scale UAV VLN benchmark constructed from real urban aerial data, rather than synthetic environments, with natural and diverse instructions. Additionally, we introduce the AirVLN-R1, which combines Supervised Fine-Tuning and Reinforcement Fine-Tuning to enhance performance and generalization. The feasibility of the model is preliminarily evaluated through real-world tests. Our dataset and code are publicly available.
[32] Visual Merit or Linguistic Crutch? A Close Look at DeepSeek-OCR cs.CL | cs.CVPDF
Yunhao Liang, Ruixuan Ying, Bo Li, Hong Li, Kai Yan
TL;DR: 这篇论文对DeepSeek-OCR模型进行了深入分析,探究其高性能究竟是源于真正的视觉识别能力还是对语言先验知识的依赖。通过句子级和单词级的语义破坏实验,研究发现当去除语言支持后,模型性能从约90%骤降至20%,表明其严重依赖语言先验。与13个基线模型的对比显示,传统流水线OCR方法比端到端方法对语义扰动更具鲁棒性。此外,研究还发现视觉token数量越少,模型对先验的依赖越强,幻觉风险越高,并且在约10,000个文本token时会出现模型完全崩溃,表明当前的光学压缩技术可能反而加剧了长上下文瓶颈。
Details
Motivation: 动机是探究DeepSeek-OCR声称的高性能(通过光学2D映射实现高比例视觉-文本压缩,解码文本token数超输入视觉token十倍)的本质驱动因素,即其性能是来自真正的视觉识别能力(Visual merit)还是对语言先验知识的依赖(Linguistic crutch),以澄清其是否真能有效解决LLM的长上下文瓶颈问题。
Result: 实验结果表明,在没有语言先验支持的情况下,DeepSeek-OCR的性能从约90%急剧下降到约20%。在包含13个基线模型的对比基准测试中,传统的流水线OCR方法显示出比端到端方法(如DeepSeek-OCR)对语义扰动显著更高的鲁棒性。此外,上下文压力测试揭示了模型在约10,000个文本token时会发生完全崩溃。
Insight: 论文宣称的创新点在于通过设计语义破坏实验来隔离和量化模型内在OCR能力与语言先验的贡献,从而实证地定义了DeepSeek-OCR的能力边界。从客观角度看,该研究的关键洞察是:1)当前某些高压缩比的视觉-文本端到端模型可能过度依赖语言先验而非真正的视觉理解;2)视觉token的过度压缩会增强对先验的依赖并加剧幻觉风险;3)光学压缩技术若设计不当,可能非但不能缓解反而会加剧长上下文瓶颈。这为未来优化视觉-文本压缩范式提供了重要方向,即需要在压缩效率、视觉鲁棒性和语言先验依赖之间取得平衡。
Abstract: DeepSeek-OCR utilizes an optical 2D mapping approach to achieve high-ratio vision-text compression, claiming to decode text tokens exceeding ten times the input visual tokens. While this suggests a promising solution for the LLM long-context bottleneck, we investigate a critical question: “Visual merit or linguistic crutch - which drives DeepSeek-OCR’s performance?” By employing sentence-level and word-level semantic corruption, we isolate the model’s intrinsic OCR capabilities from its language priors. Results demonstrate that without linguistic support, DeepSeek-OCR’s performance plummets from approximately 90% to 20%. Comparative benchmarking against 13 baseline models reveals that traditional pipeline OCR methods exhibit significantly higher robustness to such semantic perturbations than end-to-end methods. Furthermore, we find that lower visual token counts correlate with increased reliance on priors, exacerbating hallucination risks. Context stress testing also reveals a total model collapse around 10,000 text tokens, suggesting that current optical compression techniques may paradoxically aggravate the long-context bottleneck. This study empirically defines DeepSeek-OCR’s capability boundaries and offers essential insights for future optimizations of the vision-text compression paradigm. We release all data, results and scripts used in this study at https://github.com/dududuck00/DeepSeekOCR.
[33] MIND: From Passive Mimicry to Active Reasoning through Capability-Aware Multi-Perspective CoT Distillation cs.CLPDF
Jin Cui, Jiaqi Guo, Jiepeng Zhou, Ruixuan Yang, Jiayi Lu
TL;DR: 本文提出MIND框架,通过能力感知的多视角思维链蒸馏,将大型语言模型的复杂推理能力迁移到小模型。该方法利用“教学助理”网络合成多样化的教师视角,并通过反馈驱动的惯性校准机制,使监督与学生的当前适应能力对齐,从而提升性能并缓解灾难性遗忘。
Details
Motivation: 现有方法在将大模型的推理能力蒸馏到小模型时,通常限制学生遵循单一黄金推理路径,且独立处理不同推理路径,导致教师的最优推理路径可能成为分布外噪声,从而使学生潜在推理分布退化,性能次优。
Result: 大量实验表明,MIND在分布内和分布外基准测试中均达到了最先进的性能,且精细的潜在空间分析进一步证实了推理能力内化的机制。
Insight: 创新点在于提出了能力自适应的蒸馏框架,从被动模仿转向主动认知构建,通过教学助理网络合成多视角和惯性校准机制动态调整监督,有效对齐学生能力与监督信号,提升了泛化能力和抗遗忘性。
Abstract: While Large Language Models (LLMs) have emerged with remarkable capabilities in complex tasks through Chain-of-Thought reasoning, practical resource constraints have sparked interest in transferring these abilities to smaller models. However, achieving both domain performance and cross-domain generalization remains challenging. Existing approaches typically restrict students to following a single golden rationale and treat different reasoning paths independently. Due to distinct inductive biases and intrinsic preferences, alongside the student’s evolving capacity and reasoning preferences during training, a teacher’s “optimal” rationale could act as out-of-distribution noise. This misalignment leads to a degeneration of the student’s latent reasoning distribution, causing suboptimal performance. To bridge this gap, we propose MIND, a capability-adaptive framework that transitions distillation from passive mimicry to active cognitive construction. We synthesize diverse teacher perspectives through a novel “Teaching Assistant” network. By employing a Feedback-Driven Inertia Calibration mechanism, this network utilizes inertia-filtered training loss to align supervision with the student’s current adaptability, effectively enhancing performance while mitigating catastrophic forgetting. Extensive experiments demonstrate that MIND achieves state-of-the-art performance on both in-distribution and out-of-distribution benchmarks, and our sophisticated latent space analysis further confirms the mechanism of reasoning ability internalization.
[34] O-Researcher: An Open Ended Deep Research Model via Multi-Agent Distillation and Agentic RL cs.CL | cs.AIPDF
Yi Yao, He Zhu, Piaohong Wang, Jincheng Ren, Xinlong Yang
TL;DR: 本文提出了一种名为O-Researcher的开放端深度研究模型,通过多智能体蒸馏和智能体强化学习,自动合成高质量的研究级指令数据,以缩小闭源与开源大语言模型之间的性能差距。该框架采用多智能体工作流模拟复杂工具集成推理,生成多样且高保真的端到端数据,并设计了两阶段训练策略,结合监督微调与新型强化学习方法,以最大化模型对齐和能力。实验表明,该框架使多个规模的开源模型在主要深度研究基准上实现了新的最先进性能。
Details
Motivation: 解决闭源与开源大语言模型之间因高质量训练数据访问差异导致的性能差距问题,旨在为开源模型提供不依赖专有数据或模型的可扩展、有效提升路径。
Result: 在主要深度研究基准上,该框架使多个规模的开源模型实现了新的最先进性能(SOTA),通过广泛实验验证了其有效性。
Insight: 创新点包括:多智能体工作流自动合成研究级指令数据,以及结合监督微调与新型强化学习的两阶段训练策略;从客观角度看,该方法提供了一种无需专有资源的开源模型增强途径,具有可扩展性和实用性。
Abstract: The performance gap between closed-source and open-source large language models (LLMs) is largely attributed to disparities in access to high-quality training data. To bridge this gap, we introduce a novel framework for the automated synthesis of sophisticated, research-grade instructional data. Our approach centers on a multi-agent workflow where collaborative AI agents simulate complex tool-integrated reasoning to generate diverse and high-fidelity data end-to-end. Leveraging this synthesized data, we develop a two-stage training strategy that integrates supervised fine-tuning with a novel reinforcement learning method, designed to maximize model alignment and capability. Extensive experiments demonstrate that our framework empowers open-source models across multiple scales, enabling them to achieve new state-of-the-art performance on the major deep research benchmark. This work provides a scalable and effective pathway for advancing open-source LLMs without relying on proprietary data or models.
[35] HearSay Benchmark: Do Audio LLMs Leak What They Hear? cs.CLPDF
Jin Wang, Liang Lin, Kaiwen Luo, Weiliu Wang, Yitian Chen
TL;DR: 本文首次系统性地研究了音频大语言模型(ALLMs)是否仅通过声学声纹泄露用户隐私,并提出了HearSay基准,该基准包含超过22,000个真实世界音频片段。实验发现ALLMs存在显著的隐私泄露风险、现有安全机制严重不足,以及思维链推理会放大风险。
Details
Motivation: 尽管音频大语言模型在理解和生成方面取得了显著进展,但其潜在的隐私影响在很大程度上仍未得到探索。本文旨在探究ALLMs是否会无意中仅通过声纹泄露用户隐私。
Result: 在HearSay基准上的广泛实验表明:ALLMs从声纹中提取私人属性的准确率高达92.89%(如性别),并能有效分析社会属性;现有安全机制严重不足,大多数模型无法拒绝隐私侵犯请求,对生理特征的拒绝率接近零;在能力较强的模型中,思维链推理通过揭示更深层的声学相关性加剧了隐私风险。
Insight: 论文的创新点在于首次构建了一个用于评估音频大语言模型隐私泄露风险的综合性基准(HearSay),并通过实验揭示了ALLMs在隐私保护方面存在的三个关键漏洞。从客观角度看,该研究为模型隐私对齐提供了重要的实证依据和评估工具,强调了在模型能力提升的同时必须加强隐私保护机制。
Abstract: While Audio Large Language Models (ALLMs) have achieved remarkable progress in understanding and generation, their potential privacy implications remain largely unexplored. This paper takes the first step to investigate whether ALLMs inadvertently leak user privacy solely through acoustic voiceprints and introduces $\textit{HearSay}$, a comprehensive benchmark constructed from over 22,000 real-world audio clips. To ensure data quality, the benchmark is meticulously curated through a rigorous pipeline involving automated profiling and human verification, guaranteeing that all privacy labels are grounded in factual records. Extensive experiments on $\textit{HearSay}$ yield three critical findings: $\textbf{Significant Privacy Leakage}$: ALLMs inherently extract private attributes from voiceprints, reaching 92.89% accuracy on gender and effectively profiling social attributes. $\textbf{Insufficient Safety Mechanisms}$: Alarmingly, existing safeguards are severely inadequate; most models fail to refuse privacy-intruding requests, exhibiting near-zero refusal rates for physiological traits. $\textbf{Reasoning Amplifies Risk}$: Chain-of-Thought (CoT) reasoning exacerbates privacy risks in capable models by uncovering deeper acoustic correlations. These findings expose critical vulnerabilities in ALLMs, underscoring the urgent need for targeted privacy alignment. The codes and dataset are available at https://github.com/JinWang79/HearSay_Benchmark
[36] Membox: Weaving Topic Continuity into Long-Range Memory for LLM Agents cs.CL | cs.AIPDF
Dehao Tao, Guoliang Ma, Yongfeng Huang, Minghu Jiang
TL;DR: 论文提出了一种名为Membox的层次化记忆架构,旨在解决LLM智能体在长对话中难以保持主题连续性的问题。其核心是Topic Loom模块,它通过滑动窗口实时监控对话,将连续的同主题对话轮次分组为连贯的“记忆盒”进行存储,并通过Trace Weaver模块将这些盒子链接成长期的事件时间线轨迹。
Details
Motivation: 现有LLM智能体记忆系统通常遵循“碎片化-补偿”范式:先将对话流拆分为孤立的语句存储,再试图通过基于嵌入的检索恢复连贯性。这会不可逆地损害叙事的因果流,并使检索偏向于词汇相似性,无法有效保持人类对话中常见的主题连续性。
Result: 在LoCoMo基准测试上的实验表明,Membox在时序推理任务上取得了高达68%的F1分数提升,超越了Mem0、A-MEM等竞争基线。同时,Membox仅使用了现有方法所需上下文令牌的一小部分,在效率和效果之间取得了更优的平衡。
Insight: 主要创新点在于通过显式建模主题连续性,提出了一个受认知启发的机制。其核心是“记忆盒”的构建和“时间线轨迹”的编织,在存储阶段就维护了对话的叙事连贯性,而非事后补偿,从而同时提升了长程记忆的连贯性和存储/检索效率。
Abstract: Human-agent dialogues often exhibit topic continuity-a stable thematic frame that evolves through temporally adjacent exchanges-yet most large language model (LLM) agent memory systems fail to preserve it. Existing designs follow a fragmentation-compensation paradigm: they first break dialogue streams into isolated utterances for storage, then attempt to restore coherence via embedding-based retrieval. This process irreversibly damages narrative and causal flow, while biasing retrieval towards lexical similarity. We introduce membox, a hierarchical memory architecture centered on a Topic Loom that continuously monitors dialogue in a sliding-window fashion, grouping consecutive same-topic turns into coherent “memory boxes” at storage time. Sealed boxes are then linked by a Trace Weaver into long-range event-timeline traces, recovering macro-topic recurrences across discontinuities. Experiments on LoCoMo demonstrate that Membox achieves up to 68% F1 improvement on temporal reasoning tasks, outperforming competitive baselines (e.g., Mem0, A-MEM). Notably, Membox attains these gains while using only a fraction of the context tokens required by existing methods, highlighting a superior balance between efficiency and effectiveness. By explicitly modeling topic continuity, Membox offers a cognitively motivated mechanism for enhancing both coherence and efficiency in LLM agents.
[37] NeoAMT: Neologism-Aware Agentic Machine Translation with Reinforcement Learning cs.CL | cs.AIPDF
Zhongtao Miao, Kaiyan Zhao, Masaaki Nagata, Yoshimasa Tsuruoka
TL;DR: 本文提出了一种名为NeoAMT的智能体框架,用于处理包含新词的机器翻译任务。该框架基于维基词典搜索工具,通过强化学习训练翻译智能体,并引入包含新颖奖励设计和自适应生成方法的训练框架以提升翻译质量。
Details
Motivation: 针对包含新词的源语句翻译任务,该领域相较于通用机器翻译研究不足,本文旨在填补这一空白。
Result: 构建了一个覆盖16种语言、75个翻译方向的新词感知机器翻译数据集,并基于约300万条清理后的维基词典记录构建了搜索工具的检索语料库,用于训练和评估翻译智能体的准确性。
Insight: 创新点包括:1) 构建了大规模多语言新词翻译数据集;2) 设计了结合维基词典搜索工具的智能体框架;3) 提出了基于’翻译难度’的自适应生成方法和新颖奖励设计的强化学习训练框架,以优化翻译质量。
Abstract: Neologism-aware machine translation aims to translate source sentences containing neologisms into target languages. This field remains underexplored compared with general machine translation (MT). In this paper, we propose an agentic framework, NeoAMT, for neologism-aware machine translation using a Wiktionary search tool. Specifically, we first create a new dataset for neologism-aware machine translation and develop a search tool based on Wiktionary. The new dataset covers 16 languages and 75 translation directions and is derived from approximately 10 million records of an English Wiktionary dump. The retrieval corpus of the search tool is also constructed from around 3 million cleaned records of the Wiktionary dump. We then use it for training the translation agent with reinforcement learning (RL) and evaluating the accuracy of neologism-aware machine translation. Based on this, we also propose an RL training framework that contains a novel reward design and an adaptive rollout generation approach by leveraging “translation difficulty” to further improve the translation quality of translation agents using our search tool.
[38] Do LLMs Really Memorize Personally Identifiable Information? Revisiting PII Leakage with a Cue-Controlled Memorization Framework cs.CL | cs.AIPDF
Xiaoyu Luo, Yiyi Chen, Qiongxiu Li, Johannes Bjerva
TL;DR: 本文重新评估了大型语言模型(LLMs)中个人可识别信息(PII)的泄露问题,提出了一种基于线索控制记忆(CRM)的评估框架。研究发现,先前报道的PII泄露主要源于提示中的表面形式线索驱动的行为,而非真正的记忆;当控制这些线索时,重建成功率显著下降。
Details
Motivation: 针对现有研究常将PII成功重建视为LLMs记忆的证据,本文认为应在低词汇线索条件下评估PII泄露,以避免由提示诱导的泛化或模式补全导致的误判,从而更可靠地量化LLMs中与隐私相关的记忆。
Result: 在32种语言和多种记忆范式的大规模多语言重评估中,基于重建的设置(如逐字前缀-后缀补全和关联重建)在控制线索后成功率大幅降低;无线索生成和成员推理的真阳性率极低,表明先前报告的PII泄露效果主要由线索驱动。
Insight: 创新点在于提出了线索抵抗记忆(CRM)作为记忆评估的必要条件框架,强调通过明确控制提示-目标重叠线索来区分真实记忆与线索驱动行为;这为隐私相关记忆的可靠量化提供了方法论借鉴,挑战了传统PII泄露解释的假设。
Abstract: Large Language Models (LLMs) have been reported to “leak” Personally Identifiable Information (PII), with successful PII reconstruction often interpreted as evidence of memorization. We propose a principled revision of memorization evaluation for LLMs, arguing that PII leakage should be evaluated under low lexical cue conditions, where target PII cannot be reconstructed through prompt-induced generalization or pattern completion. We formalize Cue-Resistant Memorization (CRM) as a cue-controlled evaluation framework and a necessary condition for valid memorization evaluation, explicitly conditioning on prompt-target overlap cues. Using CRM, we conduct a large-scale multilingual re-evaluation of PII leakage across 32 languages and multiple memorization paradigms. Revisiting reconstruction-based settings, including verbatim prefix-suffix completion and associative reconstruction, we find that their apparent effectiveness is driven primarily by direct surface-form cues rather than by true memorization. When such cues are controlled for, reconstruction success diminishes substantially. We further examine cue-free generation and membership inference, both of which exhibit extremely low true positive rates. Overall, our results suggest that previously reported PII leakage is better explained by cue-driven behavior than by genuine memorization, highlighting the importance of cue-controlled evaluation for reliably quantifying privacy-relevant memorization in LLMs.
[39] VietMed-MCQ: A Consistency-Filtered Data Synthesis Framework for Vietnamese Traditional Medicine Evaluation cs.CLPDF
Huynh Trung Kiet, Dao Sy Duy Minh, Nguyen Dinh Ha Duong, Le Hoang Minh Huy, Long Nguyen
TL;DR: 本文提出了VietMed-MCQ,一个用于越南传统医学评估的多选题数据集,通过结合检索增强生成和自动化一致性检查的框架合成。该数据集包含3,190个问题,涵盖三个难度级别,并进行了专家验证。论文在数据集上评估了七个开源大语言模型,发现具有强中文先验的通用模型表现优于越南语中心模型,但所有模型在复杂诊断推理上仍有困难。
Details
Motivation: 解决大语言模型在越南传统医学等特定文化、资源匮乏的专业领域中,因缺乏高质量结构化基准而导致性能显著下降的问题。
Result: 数据集获得了94.2%的专家批准率,评分者间一致性高(Fleiss’ kappa = 0.82)。在VietMed-MCQ基准测试中,具有强中文先验的通用模型(如Qwen2.5)表现优于越南语中心模型(如Vinallama),但所有模型在复杂诊断推理任务上准确率仍低于70%。
Insight: 创新点在于提出了一个结合RAG和双模型验证(用于答案一致性检查)的数据合成框架,以生成高质量、低资源领域的评估数据。客观来看,其强调跨语言概念迁移(如中文先验知识对越南传统医学的帮助)和揭示模型在复杂推理上的局限,对低资源领域LLM评估具有借鉴意义。
Abstract: Large Language Models (LLMs) have demonstrated remarkable proficiency in general medical domains. However, their performance significantly degrades in specialized, culturally specific domains such as Vietnamese Traditional Medicine (VTM), primarily due to the scarcity of high-quality, structured benchmarks. In this paper, we introduce VietMed-MCQ, a novel multiple-choice question dataset generated via a Retrieval-Augmented Generation (RAG) pipeline with an automated consistency check mechanism. Unlike previous synthetic datasets, our framework incorporates a dual-model validation approach to ensure reasoning consistency through independent answer verification, though the substring-based evidence checking has known limitations. The complete dataset of 3,190 questions spans three difficulty levels and underwent validation by one medical expert and four students, achieving 94.2 percent approval with substantial inter-rater agreement (Fleiss’ kappa = 0.82). We benchmark seven open-source models on VietMed-MCQ. Results reveal that general-purpose models with strong Chinese priors outperform Vietnamese-centric models, highlighting cross-lingual conceptual transfer, while all models still struggle with complex diagnostic reasoning. Our code and dataset are publicly available to foster research in low-resource medical domains.
[40] Step Potential Advantage Estimation: Harnessing Intermediate Confidence and Correctness for Efficient Mathematical Reasoning cs.CLPDF
Fei Wu, Zhenrong Zhang, Qikai Chang, Jianshu Zhang, Quan Liu
TL;DR: 本文提出了一种名为步进潜力优势估计(SPAE)的方法,用于改进大型语言模型在数学推理任务中的强化学习过程。该方法通过提取中间步骤的置信度和正确性,构建步进潜力信号来精细评估推理状态,从而优化优势估计,提升准确性并减少响应长度。
Details
Motivation: 现有基于可验证奖励的强化学习方法在长链推理中存在优势估计粒度粗的问题,缺乏对推理进展的语义层面、步骤级度量,导致模型无法区分必要推导和冗余验证,甚至可能将正确轨迹推翻为错误答案。
Result: 在多个基准测试上的实验表明,SPAE在提升准确性的同时显著减少了响应长度,优于现有的强化学习基线以及近期的高效推理和令牌级优势估计方法。
Insight: 创新点在于提出了一种无需训练的探测机制来获取中间置信度和正确性,并组合成步进潜力信号,进而实现精细化的信用分配,鼓励及时终止推理过程,解决了过程监督缺失的问题。
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) elicits long chain-of-thought reasoning in large language models (LLMs), but outcome-based rewards lead to coarse-grained advantage estimation. While existing approaches improve RLVR via token-level entropy or sequence-level length control, they lack a semantically grounded, step-level measure of reasoning progress. As a result, LLMs fail to distinguish necessary deduction from redundant verification: they may continue checking after reaching a correct solution and, in extreme cases, overturn a correct trajectory into an incorrect final answer. To remedy the lack of process supervision, we introduce a training-free probing mechanism that extracts intermediate confidence and correctness and combines them into a Step Potential signal that explicitly estimates the reasoning state at each step. Building on this signal, we propose Step Potential Advantage Estimation (SPAE), a fine-grained credit assignment method that amplifies potential gains, penalizes potential drops, and applies penalty after potential saturates to encourage timely termination. Experiments across multiple benchmarks show SPAE consistently improves accuracy while substantially reducing response length, outperforming strong RL baselines and recent efficient reasoning and token-level advantage estimation methods. The code is available at https://github.com/cii030/SPAE-RL.
[41] Rethinking Table Pruning in TableQA: From Sequential Revisions to Gold Trajectory-Supervised Parallel Search cs.CLPDF
Yu Guo, Shenghao Ye, Shuangwu Chen, Zijian Wen, Tao Zhang
TL;DR: 本文提出了一种名为TabTrim的新型表格剪枝框架,用于表格问答(TableQA)任务。该框架将表格剪枝从传统的基于不可靠批评信号的顺序修订过程,转变为基于黄金轨迹监督的并行搜索过程,旨在更有效地提取包含答案关键信息的紧凑子表,以提升下游推理性能。
Details
Motivation: 现有表格剪枝方法通常依赖于顺序修订,其批评信号不可靠,容易丢失对答案至关重要的数据。为了解决这一问题,本文旨在开发一种更可靠的剪枝方法,确保剪枝过程能保留关键信息。
Result: 在广泛的实验中,TabTrim在多种表格推理任务上达到了最先进的性能。具体而言,TabTrim-8B模型取得了73.5%的平均准确率,比最强的基线模型高出3.2%,在WikiTQ和TableBench数据集上分别达到79.4%和61.2%的准确率。
Insight: 主要创新点在于将表格剪枝重构为基于黄金轨迹监督的并行搜索问题。具体包括:1)利用黄金SQL查询执行过程中的中间子表来构建黄金剪枝轨迹;2)训练剪枝器和验证器,使逐步剪枝结果与黄金轨迹对齐;3)在推理时采用并行搜索来探索多个候选剪枝轨迹并确定最优子表。这为数据敏感的剪枝任务提供了一种更稳健的监督和搜索范式。
Abstract: Table Question Answering (TableQA) benefits significantly from table pruning, which extracts compact sub-tables by eliminating redundant cells to streamline downstream reasoning. However, existing pruning methods typically rely on sequential revisions driven by unreliable critique signals, often failing to detect the loss of answer-critical data. To address this limitation, we propose TabTrim, a novel table pruning framework which transforms table pruning from sequential revisions to gold trajectory-supervised parallel search. TabTrim derives a gold pruning trajectory using the intermediate sub-tables in the execution process of gold SQL queries, and trains a pruner and a verifier to make the step-wise pruning result align with the gold pruning trajectory. During inference, TabTrim performs parallel search to explore multiple candidate pruning trajectories and identify the optimal sub-table. Extensive experiments demonstrate that TabTrim achieves state-of-the-art performance across diverse tabular reasoning tasks: TabTrim-8B reaches 73.5% average accuracy, outperforming the strongest baseline by 3.2%, including 79.4% on WikiTQ and 61.2% on TableBench.
[42] What Matters For Safety Alignment? cs.CL | cs.AI | cs.CRPDF
Xing Li, Hui-Ling Zhen, Lihao Yin, Xianzhi Yu, Zhenhua Dong
TL;DR: 本文对大型语言模型(LLMs)和大型推理模型(LRMs)的安全对齐能力进行了全面的实证研究,评估了影响安全对齐的关键内在模型特性和外部攻击技术。研究基于32个近期流行的模型,在五个安全数据集上进行了大规模评估,并使用了56种越狱技术和四种思维链攻击策略,总计进行了460万次API调用。
Details
Motivation: 动机是评估LLMs和LRMs的安全对齐能力,识别关键影响因素,为开发更安全可靠的AI系统提供重要见解。
Result: 研究发现,集成推理和自反思机制的LRMs(如GPT-OSS-20B、Qwen3-Next-80B-A3B-Thinking和GPT-OSS-120B)安全性最高;后训练和知识蒸馏可能导致安全对齐系统性退化;通过响应前缀进行思维链攻击可将攻击成功率平均提升3.34倍,在Seed-OSS-36B-Instruct上从0.6%提升至96.3%。
Insight: 创新点在于系统性地比较了六种内在模型特性和三种外部攻击技术的影响,揭示了推理机制对安全性的优势、后训练阶段安全作为显式约束的重要性、文本补全接口中响应前缀带来的高风险,以及角色扮演、提示注入和基于梯度的对抗提示搜索是引发未对齐行为的主要方法。
Abstract: This paper presents a comprehensive empirical study on the safety alignment capabilities. We evaluate what matters for safety alignment in LLMs and LRMs to provide essential insights for developing more secure and reliable AI systems. We systematically investigate and compare the influence of six critical intrinsic model characteristics and three external attack techniques. Our large-scale evaluation is conducted using 32 recent, popular LLMs and LRMs across thirteen distinct model families, spanning a parameter scale from 3B to 235B. The assessment leverages five established safety datasets and probes model vulnerabilities with 56 jailbreak techniques and four CoT attack strategies, resulting in 4.6M API calls. Our key empirical findings are fourfold. First, we identify the LRMs GPT-OSS-20B, Qwen3-Next-80B-A3B-Thinking, and GPT-OSS-120B as the top-three safest models, which substantiates the significant advantage of integrated reasoning and self-reflection mechanisms for robust safety alignment. Second, post-training and knowledge distillation may lead to a systematic degradation of safety alignment. We thus argue that safety must be treated as an explicit constraint or a core optimization objective during these stages, not merely subordinated to the pursuit of general capability. Third, we reveal a pronounced vulnerability: employing a CoT attack via a response prefix can elevate the attack success rate by 3.34x on average and from 0.6% to 96.3% for Seed-OSS-36B-Instruct. This critical finding underscores the safety risks inherent in text-completion interfaces and features that allow user-defined response prefixes in LLM services, highlighting an urgent need for architectural and deployment safeguards. Fourth, roleplay, prompt injection, and gradient-based search for adversarial prompts are the predominant methodologies for eliciting unaligned behaviors in modern models.
[43] Atlas: Orchestrating Heterogeneous Models and Tools for Multi-Domain Complex Reasoning cs.CLPDF
Jinyang Wu, Guocheng Zhai, Ruihan Jin, Jiahao Yuan, Yuhao Shen
TL;DR: 本文提出了ATLAS框架,旨在解决大语言模型与外部工具集成时,如何动态选择最优模型-工具组合的高维优化挑战。该框架采用双路径方法:基于聚类的无训练路由和基于强化学习的多步路由,以提升跨领域复杂推理的性能。
Details
Motivation: 随着大语言模型和工具的多样性增加,现有方法通常依赖单一模型或固定工具调用逻辑,无法充分利用异构模型-工具对的性能差异,因此需要一种动态选择最优组合的框架。
Result: 在15个基准测试上的实验表明,ATLAS在分布内任务上比现有路由方法提升10.1%,在分布外任务上提升13.1%,并超越了GPT-4o等闭源模型,同时在视觉推理任务中通过协调专用多模态工具取得了显著增益。
Insight: 创新点在于提出了一种双路径自适应框架,结合了基于经验先验的无训练路由和基于强化学习的自主探索路由,实现了对异构模型和工具的动态协同调用,提高了跨领域推理的泛化能力。
Abstract: The integration of large language models (LLMs) with external tools has significantly expanded the capabilities of AI agents. However, as the diversity of both LLMs and tools increases, selecting the optimal model-tool combination becomes a high-dimensional optimization challenge. Existing approaches often rely on a single model or fixed tool-calling logic, failing to exploit the performance variations across heterogeneous model-tool pairs. In this paper, we present ATLAS (Adaptive Tool-LLM Alignment and Synergistic Invocation), a dual-path framework for dynamic tool usage in cross-domain complex reasoning. ATLAS operates via a dual-path approach: (1) \textbf{training-free cluster-based routing} that exploits empirical priors for domain-specific alignment, and (2) \textbf{RL-based multi-step routing} that explores autonomous trajectories for out-of-distribution generalization. Extensive experiments across 15 benchmarks demonstrate that our method outperforms closed-source models like GPT-4o, surpassing existing routing methods on both in-distribution (+10.1%) and out-of-distribution (+13.1%) tasks. Furthermore, our framework shows significant gains in visual reasoning by orchestrating specialized multi-modal tools.
[44] When Models Decide and When They Bind: A Two-Stage Computation for Multiple-Choice Question-Answering cs.CLPDF
Hugh Mee Wong, Rick Nouwen, Albert Gatt
TL;DR: 本文研究了语言模型在多项选择题问答(MCQA)任务中的内部工作机制,发现模型采用两阶段计算过程:首先在内容空间选择正确答案,随后将答案绑定到对应的输出符号。
Details
Motivation: 解决MCQA任务中模型推理错误与符号绑定失败相混淆的问题,探究模型内部如何实现问题求解与答案符号输出的分离。
Result: 通过主成分分析、线性探测和因果干预等方法,发现选项边界(换行符)的残差状态包含可线性解码的选项正确性信号,且获胜内容位置在最终选项处理后即可解码,而输出符号则在答案发射位置附近才被表征。
Insight: 揭示了语言模型在MCQA中可能存在先决策后绑定的两阶段机制,这为理解模型内部表示与输出生成过程提供了新视角,并可能启发更鲁棒的评估方法设计。
Abstract: Multiple-choice question answering (MCQA) is easy to evaluate but adds a meta-task: models must both solve the problem and output the symbol that represents the answer, conflating reasoning errors with symbol-binding failures. We study how language models implement MCQA internally using representational analyses (PCA, linear probes) as well as causal interventions. We find that option-boundary (newline) residual states often contain strong linearly decodable signals related to per-option correctness. Winner-identity probing reveals a two-stage progression: the winning content position becomes decodable immediately after the final option is processed, while the output symbol is represented closer to the answer emission position. Tests under symbol and content permutations support a two-stage mechanism in which models first select a winner in content space and then bind or route that winner to the appropriate symbol to emit.
[45] InfiniteWeb: Scalable Web Environment Synthesis for GUI Agent Training cs.CL | cs.AI | cs.CVPDF
Ziyun Zhang, Zezhou Wang, Xiaoyi Zhang, Zongyu Guo, Jiahao Li
TL;DR: 本文提出了InfiniteWeb系统,用于大规模自动生成功能性的网页环境以训练GUI代理。该系统通过统一规范、任务驱动的测试开发以及结合网站种子和参考设计图像来确保多样性,解决了LLM生成多页面、互联网站时的挑战,并生成可验证的任务评估器以提供强化学习的密集奖励信号。
Details
Motivation: 训练GUI代理面临合适环境稀缺的问题,现有LLM在生成单个网页时表现良好,但构建具有多个互联页面的现实功能性网站存在挑战。
Result: 实验表明,InfiniteWeb在现实网站构建方面超越了商业编码代理,且在其生成环境上训练的GUI代理在OSWorld和Online-Mind2Web基准测试中取得了显著的性能提升,证明了系统的有效性。
Insight: 创新点包括通过统一规范和任务中心化测试驱动开发来确保网站的功能性和多样性,以及生成可验证的任务评估器以支持强化学习训练,为GUI代理训练提供了可扩展的环境合成方案。
Abstract: GUI agents that interact with graphical interfaces on behalf of users represent a promising direction for practical AI assistants. However, training such agents is hindered by the scarcity of suitable environments. We present InfiniteWeb, a system that automatically generates functional web environments at scale for GUI agent training. While LLMs perform well on generating a single webpage, building a realistic and functional website with many interconnected pages faces challenges. We address these challenges through unified specification, task-centric test-driven development, and a combination of website seed with reference design image to ensure diversity. Our system also generates verifiable task evaluators enabling dense reward signals for reinforcement learning. Experiments show that InfiniteWeb surpasses commercial coding agents at realistic website construction, and GUI agents trained on our generated environments achieve significant performance improvements on OSWorld and Online-Mind2Web, demonstrating the effectiveness of proposed system.
[46] Doc-PP: Document Policy Preservation Benchmark for Large Vision-Language Models cs.CLPDF
Haeun Jang, Hwan Chang, Hwanhee Lee
TL;DR: 本文提出了Doc-PP基准,用于评估大型视觉语言模型在遵守文档披露政策方面的能力,并揭示了模型在复杂跨模态推理时存在‘推理诱导的安全漏洞’。为应对此漏洞,作者提出了DVA(分解-验证-聚合)推理框架,该框架将推理过程与政策验证解耦,实验证明其能有效提升模型的政策遵从性。
Details
Motivation: 现有的大型视觉语言模型安全研究主要关注隐式社会规范或纯文本场景,忽视了多模态文档中动态、用户定义的信息披露政策约束,这限制了模型在真实文档问答场景中的部署。
Result: 在Doc-PP基准上的评估表明,现有模型在需要跨模态复杂推理时频繁泄露敏感信息。提出的DVA框架显著优于标准的提示防御方法,为政策合规的文档理解提供了鲁棒的基线。
Insight: 创新点在于首次构建了专注于多模态文档政策遵从性的基准Doc-PP,并系统性揭示了‘推理诱导的安全漏洞’这一现象。提出的DVA框架通过结构化解耦推理与验证,为解决多模态环境下的细粒度安全约束问题提供了新思路。
Abstract: The deployment of Large Vision-Language Models (LVLMs) for real-world document question answering is often constrained by dynamic, user-defined policies that dictate information disclosure based on context. While ensuring adherence to these explicit constraints is critical, existing safety research primarily focuses on implicit social norms or text-only settings, overlooking the complexities of multimodal documents. In this paper, we introduce Doc-PP (Document Policy Preservation Benchmark), a novel benchmark constructed from real-world reports requiring reasoning across heterogeneous visual and textual elements under strict non-disclosure policies. Our evaluation highlights a systemic Reasoning-Induced Safety Gap: models frequently leak sensitive information when answers must be inferred through complex synthesis or aggregated across modalities, effectively circumventing existing safety constraints. Furthermore, we identify that providing extracted text improves perception but inadvertently facilitates leakage. To address these vulnerabilities, we propose DVA (Decompose-Verify-Aggregation), a structural inference framework that decouples reasoning from policy verification. Experimental results demonstrate that DVA significantly outperforms standard prompting defenses, offering a robust baseline for policy-compliant document understanding
[47] Large-Scale Aspect-Based Sentiment Analysis with Reasoning-Infused LLMs cs.CL | cs.AIPDF
Paweł Liskowski, Krzysztof Jankowski
TL;DR: 本文介绍了Arctic-ABSA,一个用于现实世界方面级情感分析(ABSA)的强大模型集合。该模型针对商业需求定制,在大量公开数据和精心生成的合成数据上训练,数据集规模是SemEval14的20倍。它扩展了典型ABSA模型,将情感类别从三类增至五类,并联合预测整体文本情感、支持多语言。通过微调思维链示例和引入一种新颖的仅编码器模型推理预训练技术,显著提升了微调效果和泛化能力。
Details
Motivation: 解决现实商业场景中方面级情感分析的需求,扩展传统模型的能力,如更细粒度的情感分类、多语言支持和整体情感预测,并提升模型的推理与泛化性能。
Result: 在SemEval14基准测试上取得了新的最先进(SOTA)结果;395M参数的编码器和8B参数的解码器模型比GPT-4o和Claude 3.5 Sonnet的准确率高10个百分点;单个多语言模型在六种语言上保持87-91%的准确率,且不降低英语性能。
Insight: 创新点包括:将情感类别扩展至五类(增加混合和未知类);联合预测整体文本情感;通过思维链微调和新型仅编码器推理预训练技术注入推理能力;构建并发布了大规模基准数据集ABSA-mix,整合了17个公共ABSA数据集。从客观角度看,其结合大规模合成数据、多任务学习与推理增强的方法,为特定领域NLP任务的性能提升提供了有效路径。
Abstract: We introduce Arctic-ABSA, a collection of powerful models for real-life aspect-based sentiment analysis (ABSA). Our models are tailored to commercial needs, trained on a large corpus of public data alongside carefully generated synthetic data, resulting in a dataset 20 times larger than SemEval14. We extend typical ABSA models by expanding the number of sentiment classes from the standard three (positive, negative, neutral) to five, adding mixed and unknown classes, while also jointly predicting overall text sentiment and supporting multiple languages. We experiment with reasoning injection by fine-tuning on Chain-of-Thought (CoT) examples and introduce a novel reasoning pretraining technique for encoder-only models that significantly improves downstream fine-tuning and generalization. Our 395M-parameter encoder and 8B-parameter decoder achieve up to 10 percentage points higher accuracy than GPT-4o and Claude 3.5 Sonnet, while setting new state-of-the-art results on the SemEval14 benchmark. A single multilingual model maintains 87-91% accuracy across six languages without degrading English performance. We release ABSA-mix, a large-scale benchmark aggregating 17 public ABSA datasets across 92 domains.
[48] Benchmark^2: Systematic Evaluation of LLM Benchmarks cs.CLPDF
Qi Qian, Chengsong Huang, Jingwen Xu, Changze Lv, Muling Wu
TL;DR: 本文提出了Benchmark^2框架,用于系统评估大语言模型(LLM)评测基准本身的质量。该框架包含三个互补的指标:跨基准排名一致性、区分度得分和能力对齐偏差。通过对15个涵盖数学、推理和知识领域的基准以及11个LLM进行广泛实验,揭示了现有基准的质量差异,并证明基于这些指标进行选择性基准构建可以在大幅减少测试集的同时获得可比的评估性能。
Details
Motivation: 大语言模型评测基准的快速激增,产生了对评估基准本身质量的系统性方法的迫切需求。
Result: 在涵盖数学、推理和知识领域的15个基准上对11个LLM进行了实验,分析揭示了现有基准存在显著的质量差异,并证明基于所提指标的选择性基准构建能在测试集大幅减少的情况下达到可比的评估性能。
Insight: 创新点在于提出了首个系统评估LLM基准质量的框架(Benchmark^2),其三个互补指标(排名一致性、区分度、能力偏差)能有效量化基准质量,并为高效、可靠的基准构建提供了方法论指导。
Abstract: The rapid proliferation of benchmarks for evaluating large language models (LLMs) has created an urgent need for systematic methods to assess benchmark quality itself. We propose Benchmark^2, a comprehensive framework comprising three complementary metrics: (1) Cross-Benchmark Ranking Consistency, measuring whether a benchmark produces model rankings aligned with peer benchmarks; (2) Discriminability Score, quantifying a benchmark’s ability to differentiate between models; and (3) Capability Alignment Deviation, identifying problematic instances where stronger models fail but weaker models succeed within the same model family. We conduct extensive experiments across 15 benchmarks spanning mathematics, reasoning, and knowledge domains, evaluating 11 LLMs across four model families. Our analysis reveals significant quality variations among existing benchmarks and demonstrates that selective benchmark construction based on our metrics can achieve comparable evaluation performance with substantially reduced test sets.
[49] Analyzing and Improving Cross-lingual Knowledge Transfer for Machine Translation cs.CLPDF
David Stap
TL;DR: 该论文研究了多语言机器翻译中的跨语言知识迁移问题,分析了语言相似性、数据可用性和训练策略对知识共享的影响,并提出了一系列方法来提升低资源语言的翻译鲁棒性和泛化能力。
Details
Motivation: 解决多语言机器翻译中,特别是低资源语言由于平行数据有限而导致的泛化和知识迁移挑战,旨在理解并改进跨语言表示学习。
Result: 论文通过分析语言相似性对迁移的影响、利用检索和辅助监督增强低资源翻译、以及研究微调带来的权衡,展示了增加翻译覆盖度可以改善泛化并减少离靶行为,但未提及具体基准测试或SOTA比较。
Insight: 创新点在于系统性地分析了跨语言知识迁移的机制,并提出了通过调整数据组成和训练策略(如增加语言多样性)来提升多语言NLP系统的包容性和鲁棒性,为低资源语言翻译提供了实用见解。
Abstract: Multilingual machine translation systems aim to make knowledge accessible across languages, yet learning effective cross-lingual representations remains challenging. These challenges are especially pronounced for low-resource languages, where limited parallel data constrains generalization and transfer. Understanding how multilingual models share knowledge across languages requires examining the interaction between representations, data availability, and training strategies. In this thesis, we study cross-lingual knowledge transfer in neural models and develop methods to improve robustness and generalization in multilingual settings, using machine translation as a central testbed. We analyze how similarity between languages influences transfer, how retrieval and auxiliary supervision can strengthen low-resource translation, and how fine-tuning on parallel data can introduce unintended trade-offs in large language models. We further examine the role of language diversity during training and show that increasing translation coverage improves generalization and reduces off-target behavior. Together, this work highlights how modeling choices and data composition shape multilingual learning and offers insights toward more inclusive and resilient multilingual NLP systems.
[50] When Helpers Become Hazards: A Benchmark for Analyzing Multimodal LLM-Powered Safety in Daily Life cs.CLPDF
Xinyue Lou, Jinan Xu, Jingyi Yin, Xiaolong Wang, Zhaolu Kang
TL;DR: 本文提出了SaLAD,一个用于评估多模态大语言模型(MLLMs)在日常生活中的安全影响的基准测试。该基准包含2013个真实世界的图文样本,涵盖10个常见类别,并平衡设计了不安全场景和过度敏感案例。研究还提出了一个基于安全警告的评估框架,鼓励模型提供清晰、信息丰富的安全警告,而非笼统拒绝。对18个MLLMs的测试结果显示,表现最佳的模型在不安全查询上的安全响应率仅为57.2%,揭示了当前模型在识别日常危险行为方面的脆弱性。
Details
Motivation: 随着MLLMs成为人类生活中不可或缺的助手,其生成的不安全内容对人类行为构成威胁。为了调查和评估MLLMs响应在日常生活中的安全影响,需要一个新的、强调真实风险暴露和细粒度跨模态推理的基准。
Result: 在18个MLLMs上的评估结果表明,表现最佳的模型在不安全查询上的安全响应率仅为57.2%。即使流行的安全对齐方法在该场景下效果也有限,揭示了当前MLLMs在识别日常危险行为方面的不足。
Insight: 创新点在于构建了一个强调真实风险、真实视觉输入和细粒度跨模态推理(即安全风险不能仅从文本推断)的多模态安全基准SaLAD,并提出了一个基于安全警告(而非简单拒绝)的评估框架,这为评估和提升MLLMs的实用安全性提供了新视角和工具。
Abstract: As Multimodal Large Language Models (MLLMs) become an indispensable assistant in human life, the unsafe content generated by MLLMs poses a danger to human behavior, perpetually overhanging human society like a sword of Damocles. To investigate and evaluate the safety impact of MLLMs responses on human behavior in daily life, we introduce SaLAD, a multimodal safety benchmark which contains 2,013 real-world image-text samples across 10 common categories, with a balanced design covering both unsafe scenarios and cases of oversensitivity. It emphasizes realistic risk exposure, authentic visual inputs, and fine-grained cross-modal reasoning, ensuring that safety risks cannot be inferred from text alone. We further propose a safety-warning-based evaluation framework that encourages models to provide clear and informative safety warnings, rather than generic refusals. Results on 18 MLLMs demonstrate that the top-performing models achieve a safe response rate of only 57.2% on unsafe queries. Moreover, even popular safety alignment methods limit effectiveness of the models in our scenario, revealing the vulnerabilities of current MLLMs in identifying dangerous behaviors in daily life. Our dataset is available at https://github.com/xinyuelou/SaLAD.
[51] Modular Prompt Optimization: Optimizing Structured Prompts with Section-Local Textual Gradients cs.CLPDF
Prith Sharma, Austin Z. Henley
TL;DR: 本文提出了模块化提示优化(MPO)框架,通过将提示视为由固定语义部分(如系统角色、上下文、任务描述等)组成的结构化对象,并应用由批评语言模型生成的局部文本梯度来独立优化每个部分,同时保持整体提示结构不变,以减少冗余和组件间干扰,从而提升提示质量。
Details
Motivation: 现有自动提示优化方法通常将提示视为整体文本块,难以定位错误、保留关键指令或防止提示无控制增长,因此需要一种能保持结构、可解释且稳健的优化方法。
Result: 在ARC-Challenge和MMLU两个推理基准上,使用LLaMA-3 8B-Instruct和Mistral-7B-Instruct作为求解模型,MPO均优于未优化的结构化提示和TextGrad基线,实现了显著的准确率提升,且不修改模型参数或改变提示结构。
Insight: 创新点在于将提示结构化并应用局部文本梯度进行独立优化,通过去重整合更新以减少冗余和干扰,这提供了一种有效且实用的方法来提升小型开源语言模型的推理性能,同时保持提示结构的可解释性和稳健性。
Abstract: Prompt quality plays a central role in controlling the behavior, reliability, and reasoning performance of large language models (LLMs), particularly for smaller open-source instruction-tuned models that depend heavily on explicit structure. While recent work has explored automatic prompt optimization using textual gradients and self-refinement, most existing methods treat prompts as monolithic blocks of text, making it difficult to localize errors, preserve critical instructions, or prevent uncontrolled prompt growth. We introduce Modular Prompt Optimization (MPO), a schema-based prompt optimization framework that treats prompts as structured objects composed of fixed semantic sections, including system role, context, task description, constraints, and output format. MPO applies section-local textual gradients, generated by a critic language model, to refine each section independently while keeping the overall prompt schema fixed. Section updates are consolidated through de-duplication to reduce redundancy and interference between components, yielding an interpretable and robust optimization process. We evaluate MPO on two reasoning benchmarks, ARC-Challenge and MMLU, using LLaMA-3 8B-Instruct and Mistral-7B-Instruct as solver models. Across both benchmarks and models, MPO consistently outperforms an untuned structured prompt and the TextGrad baseline, achieving substantial accuracy gains without modifying model parameters or altering prompt structure. These results demonstrate that maintaining a fixed prompt schema while applying localized, section-wise optimization is an effective and practical approach for improving reasoning performance in small open-source LMs.
[52] Bridging the Discrete-Continuous Gap: Unified Multimodal Generation via Coupled Manifold Discrete Absorbing Diffusion cs.CLPDF
Yuanfeng Xu, Yuhao Chen, Liang Lin, Guangrun Wang
TL;DR: 本文提出了一种名为CoM-DAD的新型概率框架,旨在弥合离散数据(如文本)与连续数据(如图像)生成模型之间的鸿沟,以实现统一的跨模态生成。该方法将多模态生成重构为一个层次化的双过程:首先通过连续潜在扩散过程建模高层语义规划,然后基于这些语义先验,通过离散吸收扩散过程进行低层token合成。
Details
Motivation: 当前生成模型存在分裂:自回归模型擅长处理离散数据(文本),扩散模型擅长处理连续数据(图像),这阻碍了真正统一的多模态系统的发展。同时,掩码语言模型(MLMs)虽然高效,但在生成保真度和语义连续性上存在不足,且扩展到多模态场景时面临严重的对齐挑战和训练不稳定问题。
Result: 论文宣称其方法在统一文本-图像生成任务上,相比标准的掩码建模方法,展现了卓越的稳定性,并建立了一个可扩展的新范式。摘要中未提及具体的基准测试(benchmark)或定量比较结果。
Insight: 主要创新点包括:1)将多模态生成分解为高层语义规划和低层token合成的层次化双过程;2)引入“可变速率噪声调度”来调节离散吸收扩散过程;3)提出“随机混合模态传输”策略,无需依赖繁重的对比双编码器即可对齐不同模态。这为统一、高效且稳定的跨模态生成提供了一个新的理论框架和实现路径。
Abstract: The bifurcation of generative modeling into autoregressive approaches for discrete data (text) and diffusion approaches for continuous data (images) hinders the development of truly unified multimodal systems. While Masked Language Models (MLMs) offer efficient bidirectional context, they traditionally lack the generative fidelity of autoregressive models and the semantic continuity of diffusion models. Furthermore, extending masked generation to multimodal settings introduces severe alignment challenges and training instability. In this work, we propose \textbf{CoM-DAD} (\textbf{Co}upled \textbf{M}anifold \textbf{D}iscrete \textbf{A}bsorbing \textbf{D}iffusion), a novel probabilistic framework that reformulates multimodal generation as a hierarchical dual-process. CoM-DAD decouples high-level semantic planning from low-level token synthesis. First, we model the semantic manifold via a continuous latent diffusion process; second, we treat token generation as a discrete absorbing diffusion process, regulated by a \textbf{Variable-Rate Noise Schedule}, conditioned on these evolving semantic priors. Crucially, we introduce a \textbf{Stochastic Mixed-Modal Transport} strategy that aligns disparate modalities without requiring heavy contrastive dual-encoders. Our method demonstrates superior stability over standard masked modeling, establishing a new paradigm for scalable, unified text-image generation.
[53] KDCM: Reducing Hallucination in LLMs through Explicit Reasoning Structures cs.CLPDF
Jinbo Hao, Kai Yang, Qingzhen Su, Yifan Li, Chao Jiang
TL;DR: 论文提出了一种名为KDCM的框架,旨在减少大型语言模型(LLMs)中由提示(prompt)引发的幻觉(hallucination)。该方法扩展了链式知识蒸馏(chain-style knowledge distillation)方法,通过引入一个可编程模块来引导知识图谱探索,该模块以可执行代码的形式嵌入推理提示中,使模型能够在推理过程中利用外部结构化知识。基于此设计,作者开发了一个增强的基于蒸馏的推理框架,明确规范中间推理步骤,从而产生更可靠的预测。
Details
Motivation: 动机是缓解大型语言模型(LLMs)中的幻觉问题,特别是那些由提示(prompt)引发的错误。
Result: 在多个公共基准测试上使用GPT-4和LLaMA-3.3进行评估,实验结果表明,代码引导的推理显著改善了上下文建模并减少了提示引发的幻觉。具体而言,HIT@1、HIT@3和HIT@5分别提高了15.64%、13.38%和13.28%,在多个评估设置中得分超过95%。
Insight: 宣称的创新点在于将可编程模块(作为可执行代码)嵌入推理提示中,以显式引导知识图谱探索和规范推理步骤,从而结合了外部结构化知识来约束错误推理。从客观角度看,其创新之处在于将代码执行与推理过程深度融合,提供了一种可解释且可控的机制来提升LLMs的准确性和可靠性。
Abstract: To mitigate hallucinations in large language models (LLMs), we propose a framework that focuses on errors induced by prompts. Our method extends a chain-style knowledge distillation approach by incorporating a programmable module that guides knowledge graph exploration. This module is embedded as executable code within the reasoning prompt, allowing the model to leverage external structured knowledge during inference. Based on this design, we develop an enhanced distillation-based reasoning framework that explicitly regulates intermediate reasoning steps, resulting in more reliable predictions. We evaluate the proposed approach on multiple public benchmarks using GPT-4 and LLaMA-3.3. Experimental results show that code-guided reasoning significantly improves contextual modeling and reduces prompt-induced hallucinations. Specifically, HIT@1, HIT@3, and HIT@5 increase by 15.64%, 13.38%, and 13.28%, respectively, with scores exceeding 95% across several evaluation settings. These findings indicate that the proposed method effectively constrains erroneous reasoning while improving both accuracy and interpretability.
[54] All That Glisters Is Not Gold: A Benchmark for Reference-Free Counterfactual Financial Misinformation Detection cs.CL | cs.CE | q-fin.CPPDF
Yuechen Jiang, Zhiwei Liu, Yupeng Cao, Yueru He, Ziyang Xu
TL;DR: 该论文提出了RFC Bench基准测试,用于评估大语言模型在真实金融新闻背景下检测虚假信息的能力。该基准在段落级别运行,捕捉金融新闻中分散线索构成的上下文复杂性,定义了两个互补任务:无参考虚假信息检测和基于配对原始-扰动输入的对比诊断。实验表明,当提供对比上下文时模型性能显著更强,而无参考设置则暴露出预测不稳定和无效输出率高等明显弱点。
Details
Motivation: 动机是解决当前大语言模型在缺乏外部参考(无参考)的真实金融新闻场景下,检测虚假信息时存在的性能缺陷和推理不连贯问题,为研究无参考推理和推进更可靠的金融虚假信息检测提供一个结构化测试平台。
Result: 实验结果显示了一个一致的模式:在提供对比上下文(配对输入)时模型性能显著更强,而在无参考设置下模型暴露出显著弱点,包括预测不稳定和无效输出率升高。这些结果表明当前模型在没有外部依据的情况下难以保持连贯的信念状态。
Insight: 论文的创新点在于构建了一个专注于金融领域、强调上下文复杂性和现实场景的无参考虚假信息检测基准(RFC Bench),并通过对比任务设计揭示了模型在有无参考信息时性能的显著差距,为评估和改进模型的内在推理与信念一致性提供了新的视角和测试框架。
Abstract: We introduce RFC Bench, a benchmark for evaluating large language models on financial misinformation under realistic news. RFC Bench operates at the paragraph level and captures the contextual complexity of financial news where meaning emerges from dispersed cues. The benchmark defines two complementary tasks: reference free misinformation detection and comparison based diagnosis using paired original perturbed inputs. Experiments reveal a consistent pattern: performance is substantially stronger when comparative context is available, while reference free settings expose significant weaknesses, including unstable predictions and elevated invalid outputs. These results indicate that current models struggle to maintain coherent belief states without external grounding. By highlighting this gap, RFC Bench provides a structured testbed for studying reference free reasoning and advancing more reliable financial misinformation detection in real world settings.
cs.CV [Back]
[55] HyperCLOVA X 32B Think cs.CV | cs.AI | cs.CL | cs.LGPDF
NAVER Cloud HyperCLOVA X Team
TL;DR: 本文介绍了HyperCLOVA X 32B Think,这是一个拥有320亿参数、专为韩语语言文化环境设计的视觉语言模型,特别强调推理能力和智能体行为。模型通过预训练强化推理,后训练支持多模态理解、增强推理、智能体行为及人类偏好对齐,并在韩语文本和视觉任务以及智能体评估中表现出色。
Details
Motivation: 解决在韩语语言和文化背景下,现有视觉语言模型在推理能力和智能体行为方面可能存在的不足,旨在开发一个专门针对该语境的高性能多模态模型。
Result: 在同等规模模型的对比实验中,该模型在韩语文本到文本、视觉到文本基准测试以及面向智能体的评估任务上均取得了强劲的性能表现。
Insight: 创新点在于将强大的推理能力预训练与针对特定语言文化(韩语)的多模态后训练相结合,并集成了智能体行为支持,为特定语言环境的AI应用提供了专门化模型范例。
Abstract: In this report, we present HyperCLOVA X 32B Think, a vision-language model designed with particular emphasis on reasoning within the Korean linguistic and cultural context, as well as agentic ability. HyperCLOVA X 32B Think is pre-trained with a strong focus on reasoning capabilities and subsequently post-trained to support multimodal understanding, enhanced reasoning, agentic behaviors, and alignment with human preferences. Experimental evaluations against comparably sized models demonstrate that our model achieves strong performance on Korean text-to-text and vision-to-text benchmarks, as well as on agent-oriented evaluation tasks. By open-sourcing HyperCLOVA X 32B Think, we aim to support broader adoption and facilitate further research and innovation across both academic and industrial communities.
[56] VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models cs.CV | cs.AIPDF
Jianke Zhang, Xiaoyu Chen, Qiuyue Wang, Mingsheng Li, Yanjiang Guo
TL;DR: 本文提出VLM4VLA框架,通过最小化适配将通用视觉语言模型(VLM)转化为视觉语言动作(VLA)策略,并系统研究了VLM的选择与能力如何影响下游VLA策略性能。研究发现,VLM的通用能力并不能有效预测下游控制任务表现,且视觉模块是主要性能瓶颈,而将控制相关监督注入冻结的视觉编码器可带来持续提升。
Details
Motivation: 解决一个基础但鲜有系统研究的问题:VLM的选择及其能力如何转化为下游VLA策略的性能,以挑战当前关于VLM通用能力足以提升具身控制效果的常见假设。
Result: 在三个基准测试的多种下游任务上进行了广泛实证研究,VLM初始化相比从头训练带来一致收益,但VLM通用能力与下游任务性能相关性差;通过VLM4VLA框架,即使使用简单设计也能与复杂网络竞争。
Insight: 创新点在于提出了一个公平高效的VLM到VLA适配框架,并揭示了当前VLM预训练目标与具身动作规划需求之间存在持续领域差距,视觉模块是关键瓶颈,注入控制监督可有效缓解此问题。
Abstract: Vision-Language-Action (VLA) models, which integrate pretrained large Vision-Language Models (VLM) into their policy backbone, are gaining significant attention for their promising generalization capabilities. This paper revisits a fundamental yet seldom systematically studied question: how VLM choice and competence translate to downstream VLA policies performance? We introduce VLM4VLA, a minimal adaptation pipeline that converts general-purpose VLMs into VLA policies using only a small set of new learnable parameters for fair and efficient comparison. Despite its simplicity, VLM4VLA proves surprisingly competitive with more sophisticated network designs. Through extensive empirical studies on various downstream tasks across three benchmarks, we find that while VLM initialization offers a consistent benefit over training from scratch, a VLM’s general capabilities are poor predictors of its downstream task performance. This challenges common assumptions, indicating that standard VLM competence is necessary but insufficient for effective embodied control. We further investigate the impact of specific embodied capabilities by fine-tuning VLMs on seven auxiliary embodied tasks (e.g., embodied QA, visual pointing, depth estimation). Contrary to intuition, improving a VLM’s performance on specific embodied skills does not guarantee better downstream control performance. Finally, modality-level ablations identify the visual module in VLM, rather than the language component, as the primary performance bottleneck. We demonstrate that injecting control-relevant supervision into the vision encoder of the VLM yields consistent gains, even when the encoder remains frozen during downstream fine-tuning. This isolates a persistent domain gap between current VLM pretraining objectives and the requirements of embodied action-planning.
[57] MMErroR: A Benchmark for Erroneous Reasoning in Vision-Language Models cs.CV | cs.AI | cs.LGPDF
Yang Shi, Yifeng Xie, Minzhe Guo, Liangsi Lu, Mingxuan Huang
TL;DR: 本文提出了MMErroR基准测试,这是一个包含2013个样本的多模态基准,专门用于评估视觉语言模型(VLMs)在检测和分类推理错误方面的能力。该基准覆盖6个顶级领域下的24个子领域,要求模型不仅判断答案正确性,还需识别推理过程中的错误类型。对20个先进VLMs的评估表明,即使是表现最佳的Gemini-3.0-Pro模型,其错误分类准确率也仅为66.47%,凸显了当前模型在理解错误推理方面的挑战。
Details
Motivation: 随着视觉语言模型(VLMs)在多模态学习中的性能提升,研究者质疑这些模型是否真正理解其处理的内容。核心问题是:VLMs能否检测推理过程何时出错并识别错误类型?为了回答这个问题,需要建立一个专注于过程级、以错误为中心的评估基准。
Result: 在提出的MMErroR基准上评估了20个先进的VLMs。结果表明,即使是性能最好的模型(Gemini-3.0-Pro),其错误分类准确率也仅为66.47%,这突显了当前模型在识别错误推理方面面临的巨大挑战。
Insight: 论文的创新点在于从传统的答案正确性评估转向过程级的、以错误为中心的评估范式,创建了一个系统性的多模态推理错误分类基准。这为深入理解VLMs的推理能力和局限性提供了新的视角和工具,有助于推动模型向更可靠、可解释的方向发展。
Abstract: Recent advances in Vision-Language Models (VLMs) have improved performance in multi-modal learning, raising the question of whether these models truly understand the content they process. Crucially, can VLMs detect when a reasoning process is wrong and identify its error type? To answer this, we present MMErroR, a multi-modal benchmark of 2,013 samples, each embedding a single coherent reasoning error. These samples span 24 subdomains across six top-level domains, ensuring broad coverage and taxonomic richness. Unlike existing benchmarks that focus on answer correctness, MMErroR targets a process-level, error-centric evaluation that requires models to detect incorrect reasoning and classify the error type within both visual and linguistic contexts. We evaluate 20 advanced VLMs, even the best model (Gemini-3.0-Pro) classifies the error in only 66.47% of cases, underscoring the challenge of identifying erroneous reasoning. Furthermore, the ability to accurately identify errors offers valuable insights into the capabilities of multi-modal reasoning models. Project Page: https://mmerror-benchmark.github.io
[58] Guardians of the Hair: Rescuing Soft Boundaries in Depth, Stereo, and Novel Views cs.CVPDF
Xiang Zhang, Yang Zhang, Lukas Mehl, Markus Gross, Christopher Schroers
TL;DR: 本文提出了HairGuard框架,旨在解决3D视觉任务中软边界(如细发)细节恢复的难题。该框架通过数据整理流程和深度修复网络识别软边界区域,利用门控残差模块精细化深度,并通过前向扭曲和生成式场景绘制实现高质量的新视角合成。
Details
Motivation: 软边界(如头发)在自然和计算机生成图像中常见,但由于前景和背景线索的模糊混合,在3D视觉中难以准确处理,导致深度估计、立体转换和新视角合成等任务中细节丢失。
Result: 在单目深度估计、立体图像/视频转换和新视角合成任务上,HairGuard实现了最先进的性能,尤其在软边界区域有显著提升。
Insight: 创新点包括利用图像抠图数据集的数据整理流程、门控残差模块的深度修复网络,以及结合深度前向扭曲和生成式场景绘制的新视角合成方法,可插拔地集成到现有深度模型中,提升软边界细节恢复能力。
Abstract: Soft boundaries, like thin hairs, are commonly observed in natural and computer-generated imagery, but they remain challenging for 3D vision due to the ambiguous mixing of foreground and background cues. This paper introduces Guardians of the Hair (HairGuard), a framework designed to recover fine-grained soft boundary details in 3D vision tasks. Specifically, we first propose a novel data curation pipeline that leverages image matting datasets for training and design a depth fixer network to automatically identify soft boundary regions. With a gated residual module, the depth fixer refines depth precisely around soft boundaries while maintaining global depth quality, allowing plug-and-play integration with state-of-the-art depth models. For view synthesis, we perform depth-based forward warping to retain high-fidelity textures, followed by a generative scene painter that fills disoccluded regions and eliminates redundant background artifacts within soft boundaries. Finally, a color fuser adaptively combines warped and inpainted results to produce novel views with consistent geometry and fine-grained details. Extensive experiments demonstrate that HairGuard achieves state-of-the-art performance across monocular depth estimation, stereo image/video conversion, and novel view synthesis, with significant improvements in soft boundary regions.
[59] RiskCueBench: Benchmarking Anticipatory Reasoning from Early Risk Cues in Video-Language Models cs.CV | cs.CLPDF
Sha Luo, Yogesh Prabhu, Tim Ossowski, Kaiping Chen, Junjie Hu
TL;DR: 该论文提出了RiskCueBench基准测试,用于评估视频语言模型从早期风险线索中进行预见性推理的能力。该基准通过精心标注视频中的风险信号片段(即最早预示潜在安全问题的时刻),来模拟真实世界条件,挑战模型仅基于早期视觉信号预测未来风险事件的能力。实验结果表明,当前系统在此任务上存在显著差距。
Details
Motivation: 现有视频风险评估研究通常允许模型访问包含事故本身的完整视频序列,这大大降低了任务难度,无法反映真实世界仅能观察到早期线索的条件。
Result: 实验结果表明,当前系统在从早期视觉信号解释动态情境并预测未来风险事件的能力上存在显著差距,突显了视频风险预测模型实际部署的重要挑战。
Insight: 创新点在于构建了一个聚焦于“风险信号片段”(最早风险指示时刻)的基准,迫使模型进行真正的预见性推理,而非事后分析,这为评估和推动视频理解模型的时序推理与风险预测能力提供了更严格的测试平台。
Abstract: With the rapid growth of video centered social media, the ability to anticipate risky events from visual data is a promising direction for ensuring public safety and preventing real world accidents. Prior work has extensively studied supervised video risk assessment across domains such as driving, protests, and natural disasters. However, many existing datasets provide models with access to the full video sequence, including the accident itself, which substantially reduces the difficulty of the task. To better reflect real world conditions, we introduce a new video understanding benchmark RiskCueBench in which videos are carefully annotated to identify a risk signal clip, defined as the earliest moment that indicates a potential safety concern. Experimental results reveal a significant gap in current systems ability to interpret evolving situations and anticipate future risky events from early visual signals, highlighting important challenges for deploying video risk prediction models in practice.
[60] A Novel Unified Approach to Deepfake Detection cs.CVPDF
Lord Sen, Shyamapada Mukherjee
TL;DR: 本文提出了一种新颖的统一架构,用于检测图像和视频中的Deepfake。该架构利用空间域和频域特征之间的交叉注意力,并结合血液检测模块,以分类图像为真实或伪造。该方法在FF++和Celeb-DF数据集上取得了优于当前最优方法(SOTA)的结果,并展现出良好的跨数据集泛化能力。
Details
Motivation: 随着AI技术的进步,Deepfake的合成和滥用日益成为威胁,为了维持数字时代的信任,检测和标记Deepfake变得非常必要。本文旨在开发一个统一的检测架构,并深入分析每个步骤。
Result: 在FF++和Celeb-DF数据集上,使用Swin Transformer和BERT时,AUC分别达到99.80%和99.88%;使用EfficientNet-B4和BERT时,AUC分别为99.55%和99.38%,均优于SOTA。该方法在跨数据集测试中也取得了优异结果。
Insight: 创新点在于结合空间域和频域特征的交叉注意力机制,并集成血液检测模块,这有助于捕捉Deepfake的细微伪影。从客观角度看,该统一架构通过多模态特征融合提升了检测性能,且具有较好的泛化性,为Deepfake检测提供了新的思路。
Abstract: The advancements in the field of AI is increasingly giving rise to various threats. One of the most prominent of them is the synthesis and misuse of Deepfakes. To sustain trust in this digital age, detection and tagging of deepfakes is very necessary. In this paper, a novel architecture for Deepfake detection in images and videos is presented. The architecture uses cross attention between spatial and frequency domain features along with a blood detection module to classify an image as real or fake. This paper aims to develop a unified architecture and provide insights into each step. Though this approach we achieve results better than SOTA, specifically 99.80%, 99.88% AUC on FF++ and Celeb-DF upon using Swin Transformer and BERT and 99.55, 99.38 while using EfficientNet-B4 and BERT. The approach also generalizes very well achieving great cross dataset results as well.
[61] Better, But Not Sufficient: Testing Video ANNs Against Macaque IT Dynamics cs.CV | cs.NEPDF
Matteo Dunnhofer, Christian Micheloni, Kohitij Kar
TL;DR: 该论文通过比较猕猴下颞叶皮层在自然视频观看中的神经响应与静态、循环和视频人工神经网络模型的预测能力,发现视频模型在神经预测性上仅有小幅提升,且无法泛化到保留运动但去除形状纹理的‘无外观’视频变体,揭示了当前视频模型仅捕捉外观依赖的动态计算,而非猕猴IT皮层所表现的外观不变性动态计算。
Details
Motivation: 动机是探究灵长类下颞叶皮层在动态视觉处理中的计算本质,特别是其时间响应是否仅反映逐帧前馈变换或浅层时间池化,还是包含更丰富的动态计算,以弥补现有静态ANN模型在动态场景建模中的固有局限。
Result: 在自然视频基准测试中,视频ANN模型在神经预测性上提供了适度改进,尤其在后期响应阶段,但所有ANN模型在‘无外观’视频变体的压力测试中均失败,无法泛化,而猕猴IT种群活动则能跨该操作泛化。
Insight: 创新点在于通过‘无外观’视频变体的压力测试,揭示了当前视频模型仅模拟外观依赖的动态,而非生物视觉系统(如猕猴IT)的外观不变性时间计算,强调了需要开发编码生物时间统计和不变性的新目标函数来改进模型。
Abstract: Feedforward artificial neural networks (ANNs) trained on static images remain the dominant models of the the primate ventral visual stream, yet they are intrinsically limited to static computations. The primate world is dynamic, and the macaque ventral visual pathways, specifically the inferior temporal (IT) cortex not only supports object recognition but also encodes object motion velocity during naturalistic video viewing. Does IT’s temporal responses reflect nothing more than time-unfolded feedforward transformations, framewise features with shallow temporal pooling, or do they embody richer dynamic computations? We tested this by comparing macaque IT responses during naturalistic videos against static, recurrent, and video-based ANN models. Video models provided modest improvements in neural predictivity, particularly at later response stages, raising the question of what kind of dynamics they capture. To probe this, we applied a stress test: decoders trained on naturalistic videos were evaluated on “appearance-free” variants that preserve motion but remove shape and texture. IT population activity generalized across this manipulation, but all ANN classes failed. Thus, current video models better capture appearance-bound dynamics rather than the appearance-invariant temporal computations expressed in IT, underscoring the need for new objectives that encode biological temporal statistics and invariances.
[62] Eye-Q: A Multilingual Benchmark for Visual Word Puzzle Solving and Image-to-Phrase Reasoning cs.CV | cs.AIPDF
Ali Najar, Alireza Mirrokni, Arshia Izadyari, Sadegh Mohammadian, Amir Homayoon Sharifizade
TL;DR: 本文提出了一个名为Eye-Q的多语言视觉单词谜题基准测试,旨在评估视觉语言模型在复杂视觉理解任务上的表现,该任务需要模型发现隐含视觉线索、生成和修正假设,并将感知证据映射到非字面概念。
Details
Motivation: 现有视觉语言模型在标准基准测试上表现良好,但往往依赖表面识别而非深层推理,因此需要更具挑战性的任务来评估其复杂视觉理解能力。
Result: 在Eye-Q基准测试上,最先进的视觉语言模型最大准确率仅为60.27%,尤其在抽象和跨语言谜题上存在显著性能差距,揭示了当前模型在构建和搜索适当概念表示以进行灵活图像到短语推理方面的局限性。
Insight: 创新点在于引入了一种需要选择性注意力、抽象和联想推理的非结构化、线索隐含的视觉单词谜题形式,并设计了多语言和跨语言场景,以及一个开放式、与人类对齐的评估协议来探测模型的假设形成和修正能力。
Abstract: Vision-Language Models (VLMs) have achieved strong performance on standard vision-language benchmarks, yet often rely on surface-level recognition rather than deeper reasoning. We propose visual word puzzles as a challenging alternative, as they require discovering implicit visual cues, generating and revising hypotheses, and mapping perceptual evidence to non-literal concepts in ways that are difficult to solve via literal grounding, OCR-heavy shortcuts, or simple retrieval-style matching. We introduce Eye-Q, a multilingual benchmark designed to assess this form of complex visual understanding. Eye-Q contains 1,343 puzzles in which a model observes a conceptually dense scene with a brief description and must infer a specific target word or phrase. The puzzles are intentionally unstructured and cue-implicit, with distractors and contextual relationships that demand selective attention, abstraction, and associative inference. The benchmark spans English, Persian, Arabic, and cross-lingual puzzles. We evaluate state-of-the-art VLMs using an open-ended, human-aligned protocol that probes hypothesis formation and revision under lightweight assistance. Results reveal substantial performance gaps, especially on abstract and cross-lingual puzzles, highlighting limitations in current models’ ability to construct and search over appropriate conceptual representations for flexible image-to-phrase inference; maximum accuracy reaches only 60.27%.
[63] GAMBIT: A Gamified Jailbreak Framework for Multimodal Large Language Models cs.CVPDF
Xiangdong Hu, Yangyang Jiang, Qin Hu, Xiaojun Jia
TL;DR: 本文提出了一种名为GAMBIT的新型多模态大语言模型越狱框架,该框架通过将有害视觉语义分解并重组为游戏化场景,驱动模型在探索和赢得游戏的过程中主动完成恶意查询,从而有效绕过模型的安全对齐机制。
Details
Motivation: 现有攻击方法主要增加视觉任务本身的复杂性,未能有效利用模型自身的推理激励,导致在具备思维链的推理模型上效果不佳。本文旨在探索能否通过影响模型在认知阶段的决策,使其主动完成越狱。
Result: 在流行的推理和非推理MLLMs上的广泛实验表明,GAMBIT实现了很高的攻击成功率,在Gemini 2.5 Flash上达到92.13%,在QvQ-MAX上达到91.20%,在GPT-4o上达到85.87%,显著优于基线方法。
Insight: 核心创新点在于将越狱攻击构建为一个游戏化场景,通过分解和重组有害语义,并利用模型作为游戏参与者的目标追求来降低其安全注意力,从而诱导其回答重构后的恶意查询。这种方法巧妙地利用了模型的推理激励机制,而非单纯增加任务复杂度。
Abstract: Multimodal Large Language Models (MLLMs) have become widely deployed, yet their safety alignment remains fragile under adversarial inputs. Previous work has shown that increasing inference steps can disrupt safety mechanisms and lead MLLMs to generate attacker-desired harmful content. However, most existing attacks focus on increasing the complexity of the modified visual task itself and do not explicitly leverage the model’s own reasoning incentives. This leads to them underperforming on reasoning models (Models with Chain-of-Thoughts) compared to non-reasoning ones (Models without Chain-of-Thoughts). If a model can think like a human, can we influence its cognitive-stage decisions so that it proactively completes a jailbreak? To validate this idea, we propose GAMBI} (Gamified Adversarial Multimodal Breakout via Instructional Traps), a novel multimodal jailbreak framework that decomposes and reassembles harmful visual semantics, then constructs a gamified scene that drives the model to explore, reconstruct intent, and answer as part of winning the game. The resulting structured reasoning chain increases task complexity in both vision and text, positioning the model as a participant whose goal pursuit reduces safety attention and induces it to answer the reconstructed malicious query. Extensive experiments on popular reasoning and non-reasoning MLLMs demonstrate that GAMBIT achieves high Attack Success Rates (ASR), reaching 92.13% on Gemini 2.5 Flash, 91.20% on QvQ-MAX, and 85.87% on GPT-4o, significantly outperforming baselines.
[64] WeedRepFormer: Reparameterizable Vision Transformers for Real-Time Waterhemp Segmentation and Gender Classification cs.CVPDF
Toqi Tahamid Sarker, Taminul Islam, Khaled R. Ahmed, Cristiana Bernardi Rankrape, Kaitlin E. Creager
TL;DR: 本文提出了WeedRepFormer,一种轻量级的多任务视觉Transformer模型,用于同时进行水麻(waterhemp)分割和性别分类。该模型通过结构重参数化技术,在训练时保持高容量,在推理时实现低延迟,从而在农业应用中平衡了细粒度特征提取和实时部署效率的需求。
Details
Motivation: 现有农业模型难以在生物属性分类所需的细粒度特征提取与实时部署效率之间取得平衡,因此本文旨在设计一个轻量级多任务模型,以同时高效处理水麻分割和性别分类任务。
Result: 在包含10,264个标注帧的水麻数据集上,WeedRepFormer仅使用3.59M参数和3.80 GFLOPs,实现了92.18%的mIoU分割精度和81.91%的性别分类准确率,并以108.95 FPS的速度运行,在分类准确率上比SOTA模型iFormer-T高出4.40%,同时保持竞争力的分割性能,并将参数量减少了1.9倍。
Insight: 创新点包括:系统性地将结构重参数化集成到整个架构(包括视觉Transformer主干、Lite R-ASPP解码器和新的可重参数化分类头)中,以解耦训练容量和推理延迟;此外,还引入了一个全面的水麻数据集,为相关研究提供了基准。
Abstract: We present WeedRepFormer, a lightweight multi-task Vision Transformer designed for simultaneous waterhemp segmentation and gender classification. Existing agricultural models often struggle to balance the fine-grained feature extraction required for biological attribute classification with the efficiency needed for real-time deployment. To address this, WeedRepFormer systematically integrates structural reparameterization across the entire architecture - comprising a Vision Transformer backbone, a Lite R-ASPP decoder, and a novel reparameterizable classification head - to decouple training-time capacity from inference-time latency. We also introduce a comprehensive waterhemp dataset containing 10,264 annotated frames from 23 plants. On this benchmark, WeedRepFormer achieves 92.18% mIoU for segmentation and 81.91% accuracy for gender classification using only 3.59M parameters and 3.80 GFLOPs. At 108.95 FPS, our model outperforms the state-of-the-art iFormer-T by 4.40% in classification accuracy while maintaining competitive segmentation performance and significantly reducing parameter count by 1.9x.
[65] FROST-Drive: Scalable and Efficient End-to-End Driving with a Frozen Vision Encoder cs.CV | cs.AIPDF
Zeyu Dong, Yimin Zhu, Yu Wu, Yu Sun
TL;DR: 本文提出FROST-Drive,一种新颖的端到端自动驾驶架构,其核心是冻结预训练视觉语言模型(VLM)中的视觉编码器权重,以保留其强大的泛化世界知识,并将其直接迁移到驾驶任务中。该架构结合了冻结编码器、基于Transformer的多模态融合适配器和基于GRU的平滑路径点生成解码器,并引入了针对Rater Feedback Score(RFS)优化的自定义损失函数。在专门捕捉长尾场景的大规模Waymo Open E2E数据集上的实验表明,该方法显著优于完全微调编码器的模型。
Details
Motivation: 当前端到端自动驾驶模型通常对视觉编码器在驾驶数据集上进行完全微调,这可能导致模型过度特化于训练数据,从而限制其在新颖复杂场景中的泛化能力。本文挑战了这一训练范式的必要性,旨在探索如何更好地利用预训练视觉编码器的泛化知识。
Result: 在Waymo Open E2E数据集上的大量实验表明,所提出的冻结编码器方法显著优于采用完全微调的模型,为在驾驶任务中实现更鲁棒、更可泛化的性能提供了有力证据。
Insight: 主要创新点在于挑战了端到端驾驶模型中必须微调视觉编码器的常见做法,提出并验证了冻结预训练VLM编码器以保留其广泛世界知识是一种更有效的策略。架构上结合了冻结编码器、多模态融合适配器和针对RFS优化的损失函数,为开发能更好处理现实世界复杂性的视觉模型提供了新途径。
Abstract: End-to-end (E2E) models in autonomous driving aim to directly map sensor inputs to control commands, but their ability to generalize to novel and complex scenarios remains a key challenge. The common practice of fully fine-tuning the vision encoder on driving datasets potentially limits its generalization by causing the model to specialize too heavily in the training data. This work challenges the necessity of this training paradigm. We propose FROST-Drive, a novel E2E architecture designed to preserve and leverage the powerful generalization capabilities of a pretrained vision encoder from a Vision-Language Model (VLM). By keeping the encoder’s weights frozen, our approach directly transfers the rich, generalized world knowledge from the VLM to the driving task. Our model architecture combines this frozen encoder with a transformer-based adapter for multimodal fusion and a GRU-based decoder for smooth waypoint generation. Furthermore, we introduce a custom loss function designed to directly optimize for Rater Feedback Score (RFS), a metric that prioritizes robust trajectory planning. We conduct extensive experiments on Waymo Open E2E Dataset, a large-scale datasets deliberately curated to capture the long-tail scenarios, demonstrating that our frozen-encoder approach significantly outperforms models that employ full fine-tuning. Our results provide substantial evidence that preserving the broad knowledge of a capable VLM is a more effective strategy for achieving robust, generalizable driving performance than intensive domain-specific adaptation. This offers a new pathway for developing vision-based models that can better handle the complexities of real-world application domains.
[66] Experimental Comparison of Light-Weight and Deep CNN Models Across Diverse Datasets cs.CV | cs.LGPDF
Md. Hefzul Hossain Papon, Shadman Rabby
TL;DR: 本文通过实验比较了轻量级和深度CNN模型在多个数据集上的性能,发现经过良好正则化的浅层架构在多个领域(如智慧城市监控和农业品种分类)中具有高度竞争力,无需大型GPU或专用预训练模型。
Details
Motivation: 解决在资源受限环境中部署计算机视觉模型的实际需求,为多个孟加拉国视觉数据集建立统一、可复现的基准。
Result: 在多个异构数据集上,轻量级CNN模型表现出了与深度模型相当的竞争力,适用于低资源环境下的实际部署。
Insight: 论文的创新点在于强调了轻量级CNN在低资源设置中的实用价值,并通过建立可复现的基准展示了浅层架构的潜力,为资源有限的应用场景提供了有效的模型选择指导。
Abstract: Our results reveal that a well-regularized shallow architecture can serve as a highly competitive baseline across heterogeneous domains - from smart-city surveillance to agricultural variety classification - without requiring large GPUs or specialized pre-trained models. This work establishes a unified, reproducible benchmark for multiple Bangladeshi vision datasets and highlights the practical value of lightweight CNNs for real-world deployment in low-resource settings.
[67] ThinkRL-Edit: Thinking in Reinforcement Learning for Reasoning-Centric Image Editing cs.CVPDF
Hengjia Li, Liming Jiang, Qing Yan, Yizhi Song, Hao Kang
TL;DR: 本文提出了ThinkRL-Edit,一个以推理为中心的强化学习框架,用于指令驱动的图像编辑。该方法通过将视觉推理与图像合成解耦,并引入基于思维链的推理采样,在生成前进行规划和反思,以探索多种语义假设并验证其合理性,从而提升对需要复杂推理的编辑任务的处理能力。
Details
Motivation: 现有统一多模态生成模型在指令驱动的图像编辑中,其底层视觉推理能力有限,导致在以推理为中心的编辑任务上表现不佳。虽然强化学习已被用于提升图像编辑质量,但仍面临推理探索受限、奖励融合有偏以及基于VLM的指令奖励不稳定三大挑战。
Result: 实验表明,该方法在需要复杂推理的图像编辑任务上显著优于先前工作,能够生成更忠实于指令、视觉连贯且语义合理的编辑结果。
Insight: 主要创新点包括:1) 将视觉推理与图像合成解耦,并利用思维链(CoT)在生成前进行规划和反思,从而扩展了推理探索空间;2) 提出无偏的链式偏好分组策略,避免多奖励维度加权聚合的偏差;3) 用二元检查表替代基于区间的VLM评分,提供了更精确、方差更低且可解释的奖励信号。
Abstract: Instruction-driven image editing with unified multimodal generative models has advanced rapidly, yet their underlying visual reasoning remains limited, leading to suboptimal performance on reasoning-centric edits. Reinforcement learning (RL) has been investigated for improving the quality of image editing, but it faces three key challenges: (1) limited reasoning exploration confined to denoising stochasticity, (2) biased reward fusion, and (3) unstable VLM-based instruction rewards. In this work, we propose ThinkRL-Edit, a reasoning-centric RL framework that decouples visual reasoning from image synthesis and expands reasoning exploration beyond denoising. To the end, we introduce Chain-of-Thought (CoT)-based reasoning sampling with planning and reflection stages prior to generation in online sampling, compelling the model to explore multiple semantic hypotheses and validate their plausibility before committing to a visual outcome. To avoid the failures of weighted aggregation, we propose an unbiased chain preference grouping strategy across multiple reward dimensions. Moreover, we replace interval-based VLM scores with a binary checklist, yielding more precise, lower-variance, and interpretable rewards for complex reasoning. Experiments show our method significantly outperforms prior work on reasoning-centric image editing, producing instruction-faithful, visually coherent, and semantically grounded edits.
[68] Understanding Reward Hacking in Text-to-Image Reinforcement Learning cs.CVPDF
Yunqi Hong, Kuei-Chun Kao, Hengguang Zhou, Cho-Jui Hsieh
TL;DR: 本文系统分析了文本到图像(T2I)强化学习后训练中的奖励黑客行为,即模型生成视觉上不真实或低质量的图像却获得高奖励分数的问题。研究发现美学/人类偏好奖励和提示-图像一致性奖励均会单独导致奖励黑客,且集成多个奖励仅能部分缓解该问题。作者提出了一种轻量级自适应伪影奖励模型,通过小规模精选数据集训练,可作为现有奖励模型的有效正则化器,显著提升视觉真实感并减少奖励黑客。
Details
Motivation: 现有用于提升生成质量和人类偏好对齐的奖励函数往往是不完美的代理,导致模型容易产生奖励黑客行为,即生成不真实或低质量但获得高奖励的图像。
Result: 实验表明,在多个T2I RL设置中,集成所提出的伪影奖励能显著改善视觉真实感并减少奖励黑客,证明了轻量级奖励增强作为防范奖励黑客安全措施的有效性。
Insight: 创新点在于系统分析了T2I RL中奖励黑客的常见失败模式(生成易产生伪影的图像),并提出了一种轻量级、自适应的伪影奖励模型作为正则化器,该方法可灵活集成到现有RL流程中,为解决奖励不完美代理问题提供了新思路。
Abstract: Reinforcement learning (RL) has become a standard approach for post-training large language models and, more recently, for improving image generation models, which uses reward functions to enhance generation quality and human preference alignment. However, existing reward designs are often imperfect proxies for true human judgment, making models prone to reward hacking–producing unrealistic or low-quality images that nevertheless achieve high reward scores. In this work, we systematically analyze reward hacking behaviors in text-to-image (T2I) RL post-training. We investigate how both aesthetic/human preference rewards and prompt-image consistency rewards individually contribute to reward hacking and further show that ensembling multiple rewards can only partially mitigate this issue. Across diverse reward models, we identify a common failure mode: the generation of artifact-prone images. To address this, we propose a lightweight and adaptive artifact reward model, trained on a small curated dataset of artifact-free and artifact-containing samples. This model can be integrated into existing RL pipelines as an effective regularizer for commonly used reward models. Experiments demonstrate that incorporating our artifact reward significantly improves visual realism and reduces reward hacking across multiple T2I RL setups, demonstrating the effectiveness of lightweight reward augment serving as a safeguard against reward hacking.
[69] CroBIM-U: Uncertainty-Driven Referring Remote Sensing Image Segmentation cs.CV | cs.AIPDF
Yuzhe Sun, Zhe Dong, Haochen Jiang, Tianzhu Liu, Yanfeng Gu
TL;DR: 本文提出了一种不确定性引导的框架CroBIM-U,用于解决遥感图像指代分割任务中跨模态对齐的空间非均匀性问题。该框架通过预测像素级指代不确定性图作为空间先验,自适应地调节语言融合强度与局部细化区域,从而提升复杂遥感场景下的分割鲁棒性与几何保真度。
Details
Motivation: 现有遥感指代分割方法在全图采用均匀的融合与细化策略,在视觉清晰区域引入不必要的语言扰动,而在混淆区域又未能提供足够的消歧能力,无法应对由尺度变化、密集相似干扰物和复杂边界结构导致的跨模态对齐空间非均匀性。
Result: 大量实验表明,该方法作为一个统一的即插即用方案,在不改变主干网络架构的情况下,显著提升了复杂遥感场景下的分割鲁棒性和几何保真度。
Insight: 核心创新在于引入可解释的指代不确定性评分器(RUS)作为空间先验,并基于此设计了不确定性门控融合(UGF)和不确定性驱动局部细化(UDLR)两个即插即用模块,实现了对语言注入强度和细化区域的自适应、非均匀调控。
Abstract: Referring remote sensing image segmentation aims to localize specific targets described by natural language within complex overhead imagery. However, due to extreme scale variations, dense similar distractors, and intricate boundary structures, the reliability of cross-modal alignment exhibits significant \textbf{spatial non-uniformity}. Existing methods typically employ uniform fusion and refinement strategies across the entire image, which often introduces unnecessary linguistic perturbations in visually clear regions while failing to provide sufficient disambiguation in confused areas. To address this, we propose an \textbf{uncertainty-guided framework} that explicitly leverages a pixel-wise \textbf{referring uncertainty map} as a spatial prior to orchestrate adaptive inference. Specifically, we introduce a plug-and-play \textbf{Referring Uncertainty Scorer (RUS)}, which is trained via an online error-consistency supervision strategy to interpretably predict the spatial distribution of referential ambiguity. Building on this prior, we design two plug-and-play modules: 1) \textbf{Uncertainty-Gated Fusion (UGF)}, which dynamically modulates language injection strength to enhance constraints in high-uncertainty regions while suppressing noise in low-uncertainty ones; and 2) \textbf{Uncertainty-Driven Local Refinement (UDLR)}, which utilizes uncertainty-derived soft masks to focus refinement on error-prone boundaries and fine details. Extensive experiments demonstrate that our method functions as a unified, plug-and-play solution that significantly improves robustness and geometric fidelity in complex remote sensing scenes without altering the backbone architecture.
[70] SDCD: Structure-Disrupted Contrastive Decoding for Mitigating Hallucinations in Large Vision-Language Models cs.CV | cs.AIPDF
Yuxuan Xia, Siheng Wang, Peng Li
TL;DR: 本文提出了一种名为结构破坏对比解码(SDCD)的训练免费算法,旨在缓解大型视觉语言模型(LVLM)中的物体幻觉问题。该方法通过引入结构破坏的视觉视图进行对比校准,抑制由视觉编码器在弱结构监督下产生的纹理驱动偏差,从而减少幻觉。实验表明SDCD在多个基准测试上有效降低了幻觉并提升了模型的多模态能力。
Details
Motivation: 大型视觉语言模型在物体幻觉方面存在关键挑战,现有研究多关注语言先验或高层统计偏差,而忽视了视觉编码过程的内在复杂性。本文发现视觉编码器在弱结构监督下固有的‘Bag-of-Patches’行为导致的视觉统计偏差是物体幻觉的一个促成因素,使得模型过度依赖局部纹理特征而非整体几何结构,从而引发虚假视觉置信度和幻觉。
Result: 实验结果显示,SDCD在多个基准测试上显著缓解了幻觉问题,并增强了LVLMs的整体多模态能力,表明该方法有效提升了模型性能。
Insight: 论文的创新点在于首次识别并针对视觉编码过程中的结构破坏偏差,提出了一种无需训练的对比解码方法SDCD,通过结构破坏视图进行输出分布校准,以抑制纹理驱动偏差。从客观角度看,该方法提供了一种轻量级且高效的幻觉缓解策略,可借鉴于其他多模态模型中处理类似偏差问题。
Abstract: Large Vision-Language Models (LVLMs) demonstrate significant progress in multimodal understanding and reasoning, yet object hallucination remains a critical challenge. While existing research focuses on mitigating language priors or high-level statistical biases, they often overlook the internal complexities of the visual encoding process. We identify that visual statistical bias, arising from the inherent Bag-of-Patches behavior of Vision Encoders under weak structural supervision, acts as a contributing factor of object hallucinations. Under this bias, models prioritize local texture features within individual patches over holistic geometric structures. This tendency may induce spurious visual confidence and result in hallucinations. To address this, we introduce a training-free algorithm called Structure-Disrupted Contrastive Decoding (SDCD), which performs contrastive calibration of the output distribution by introducing a shuffled structure-disrupted view. By penalizing tokens that maintain high confidence under this structure-less view, SDCD effectively suppresses the texture-driven bias. Experimental results demonstrate that SDCD significantly mitigates hallucinations across multiple benchmarks and enhances the overall multimodal capabilities of LVLMs.
[71] REFA: Real-time Egocentric Facial Animations for Virtual Reality cs.CVPDF
Qiang Zhang, Tong Xiao, Haroun Habeeb, Larissa Laich, Sofien Bouaziz
TL;DR: 本文提出了一种名为REFA的新型系统,用于实时跟踪面部表情,该系统利用嵌入VR头显中的红外摄像头捕获的以自我为中心的视图,无需繁琐校准即可驱动虚拟角色的面部表情。核心方法采用基于蒸馏的机器学习模型,利用合成和真实图像等异构数据进行训练,并通过一个包含18k多样本的数据集和可微渲染流水线自动提取表情标签。
Details
Motivation: 解决在VR环境中实时、非侵入式且无需冗长校准的面部表情跟踪问题,以促进虚拟环境中的沟通与表达。
Result: 系统在实时性能下实现了准确的面部表情驱动,但摘要未提及具体基准测试或定量结果(如与SOTA的比较)。
Insight: 创新点包括基于蒸馏的异构数据训练方法、轻量级采集设置构建的大规模数据集,以及用于自动标签提取的可微渲染流水线,为VR面部动画提供了高效实用的解决方案。
Abstract: We present a novel system for real-time tracking of facial expressions using egocentric views captured from a set of infrared cameras embedded in a virtual reality (VR) headset. Our technology facilitates any user to accurately drive the facial expressions of virtual characters in a non-intrusive manner and without the need of a lengthy calibration step. At the core of our system is a distillation based approach to train a machine learning model on heterogeneous data and labels coming form multiple sources, \eg synthetic and real images. As part of our dataset, we collected 18k diverse subjects using a lightweight capture setup consisting of a mobile phone and a custom VR headset with extra cameras. To process this data, we developed a robust differentiable rendering pipeline enabling us to automatically extract facial expression labels. Our system opens up new avenues for communication and expression in virtual environments, with applications in video conferencing, gaming, entertainment, and remote collaboration.
[72] G2P: Gaussian-to-Point Attribute Alignment for Boundary-Aware 3D Semantic Segmentation cs.CVPDF
Hojun Song, Chae-yeong Song, Jeong-hun Hong, Chaewon Moon, Dong-hwi Kim
TL;DR: 该论文提出了一种名为Gaussian-to-Point (G2P)的方法,用于提升3D点云的语义分割性能。该方法通过将3D高斯溅射(3D Gaussian Splatting)中具有外观感知的属性(如颜色、纹理、材质)对齐并转移到原始点云上,以解决仅依赖几何特征难以区分形状相似但外观不同物体的问题。G2P通过建立点对点对应关系来解决优化后的高斯与原始点云几何之间的错位,并利用高斯的不透明度和尺度属性分别解决几何模糊性和实现精确的边界定位。
Details
Motivation: 点云语义分割对3D场景理解至关重要,但点云的稀疏和不规则分布限制了外观信息的获取,使得仅依赖几何特征难以区分形状相似但外观(如颜色、纹理、材质)不同的物体。
Result: 大量实验表明,该方法在标准基准测试上取得了优越的性能,并在几何上具有挑战性的类别上显示出显著改进,且无需任何2D或语言监督。
Insight: 核心创新点在于提出了一种将3D高斯溅射的外观感知属性与原始点云几何对齐的机制(G2P)。具体包括:利用点对点对应解决几何错位;利用高斯不透明度属性解决几何模糊性;利用高斯尺度属性实现复杂场景中的精确边界定位。这为3D语义分割提供了一种新的、不依赖额外模态监督的外观信息增强思路。
Abstract: Semantic segmentation on point clouds is critical for 3D scene understanding. However, sparse and irregular point distributions provide limited appearance evidence, making geometry-only features insufficient to distinguish objects with similar shapes but distinct appearances (e.g., color, texture, material). We propose Gaussian-to-Point (G2P), which transfers appearance-aware attributes from 3D Gaussian Splatting to point clouds for more discriminative and appearance-consistent segmentation. Our G2P address the misalignment between optimized Gaussians and original point geometry by establishing point-wise correspondences. By leveraging Gaussian opacity attributes, we resolve the geometric ambiguity that limits existing models. Additionally, Gaussian scale attributes enable precise boundary localization in complex 3D scenes. Extensive experiments demonstrate that our approach achieves superior performance on standard benchmarks and shows significant improvements on geometrically challenging classes, all without any 2D or language supervision.
[73] Semantic Belief-State World Model for 3D Human Motion Prediction cs.CVPDF
Sarim Chaudhry
TL;DR: 该论文提出了一种语义信念状态世界模型(SBWM),将人体运动预测重新定义为人体流形上的潜在动力学模拟。SBWM维护一个循环概率信念状态,其演化独立于姿态重建学习,并与SMPL-X解剖参数化显式对齐,从而迫使模型捕捉运动动态、意图和控制相关结构,而非静态几何或噪声。
Details
Motivation: 传统方法将运动预测视为序列回归问题,存在复合漂移、平均姿态塌缩和不确定性校准不佳等问题,且未分离观测重建与动力学建模。本文旨在通过引入世界模型框架,实现更稳定、可解释的长时程运动模拟。
Result: SBWM展示了连贯的长时程推演,并在显著降低计算成本的同时,取得了具有竞争力的预测精度。
Insight: 核心创新在于将人体视为世界模型状态空间的一部分而非输出,通过结构信息瓶颈(与SMPL-X对齐)分离动态与静态信息,并采用面向推演的随机潜在转换训练,这改变了运动模拟的根本范式。
Abstract: Human motion prediction has traditionally been framed as a sequence regression problem where models extrapolate future joint coordinates from observed pose histories. While effective over short horizons this approach does not separate observation reconstruction with dynamics modeling and offers no explicit representation of the latent causes governing motion. As a result, existing methods exhibit compounding drift, mean-pose collapse, and poorly calibrated uncertainty when rolled forward beyond the training regime. Here we propose a Semantic Belief-State World Model (SBWM) that reframes human motion prediction as latent dynamical simulation on the human body manifold. Rather than predicting poses directly, SBWM maintains a recurrent probabilistic belief state whose evolution is learned independently of pose reconstruction and explicitly aligned with the SMPL-X anatomical parameterization. This alignment imposes a structural information bottleneck that prevents the latent state from encoding static geometry or sensor noise, forcing it to capture motion dynamics, intent, and control-relevant structure. Inspired by belief-state world models developed for model-based reinforcement learning, SBWM adapts stochastic latent transitions and rollout-centric training to the domain of human motion. In contrast to RSSM-based, transformer, and diffusion approaches optimized for reconstruction fidelity, SBWM prioritizes stable forward simulation. We demonstrate coherent long-horizon rollouts, and competitive accuracy at substantially lower computational cost. These results suggest that treating the human body as part of the world models state space rather than its output fundamentally changes how motion is simulated, and predicted.
[74] Can LLMs See Without Pixels? Benchmarking Spatial Intelligence from Textual Descriptions cs.CV | cs.AIPDF
Zhongbin Guo, Zhen Yang, Yushan Li, Xinyue Zhang, Wenyu Gao
TL;DR: 这篇论文提出了一个名为SiT-Bench的新基准测试,旨在评估大型语言模型(LLMs)在不依赖像素输入的情况下,仅通过高保真、坐标感知的文本描述来理解和推理空间信息的能力。该基准包含超过3800个专家标注的项目,涵盖五个主要类别和17个子任务,如自我中心导航和机器人精细操作。研究发现,尽管当前最先进的LLMs在局部语义任务上表现良好,但在全局一致性上仍存在显著的’空间差距’,而显式的空间推理能显著提升性能。
Details
Motivation: 论文的动机源于探究空间智能(Spatial Intelligence, SI)的来源:它究竟是来自视觉编码器还是基础推理主干?为了回答这个问题,作者设计了一个无需像素输入的基准测试,以评估LLMs是否能够仅通过文本描述进行空间推理,从而区分视觉模式匹配与符号推理能力。
Result: 在SiT-Bench基准上评估了当前最先进的LLMs,结果显示模型在局部语义任务上表现熟练,但在全局一致性方面存在明显不足(即’空间差距’)。实验表明,显式空间推理能显著提升性能,暗示LLMs具有潜在的世界建模能力。该基准为未来视觉语言模型(VLMs)和具身智能体的空间基础LLM主干开发提供了基础资源。
Insight: 论文的创新点在于提出了首个专注于评估LLMs纯文本空间推理能力的基准SiT-Bench,通过将单/多视角场景转换为坐标感知的文本描述,迫使模型进行符号推理而非视觉匹配。从客观角度看,这项工作揭示了LLMs在空间理解方面的潜力和局限,为开发更强大的空间智能模型提供了新的评估方向和数据集支持。
Abstract: Recent advancements in Spatial Intelligence (SI) have predominantly relied on Vision-Language Models (VLMs), yet a critical question remains: does spatial understanding originate from visual encoders or the fundamental reasoning backbone? Inspired by this question, we introduce SiT-Bench, a novel benchmark designed to evaluate the SI performance of Large Language Models (LLMs) without pixel-level input, comprises over 3,800 expert-annotated items across five primary categories and 17 subtasks, ranging from egocentric navigation and perspective transformation to fine-grained robotic manipulation. By converting single/multi-view scenes into high-fidelity, coordinate-aware textual descriptions, we challenge LLMs to perform symbolic textual reasoning rather than visual pattern matching. Evaluation results of state-of-the-art (SOTA) LLMs reveals that while models achieve proficiency in localized semantic tasks, a significant “spatial gap” remains in global consistency. Notably, we find that explicit spatial reasoning significantly boosts performance, suggesting that LLMs possess latent world-modeling potential. Our proposed dataset SiT-Bench serves as a foundational resource to foster the development of spatially-grounded LLM backbones for future VLMs and embodied agents. Our code and benchmark will be released at https://github.com/binisalegend/SiT-Bench .
[75] MFC-RFNet: A Multi-scale Guided Rectified Flow Network for Radar Sequence Prediction cs.CV | cs.AIPDF
Wenjie Luo, Chuanhu Deng, Chaorong Li, Rongyao Deng, Qiang Yang
TL;DR: 本文提出MFC-RFNet,一种用于雷达序列预测的多尺度引导整流流网络,旨在解决雷达回波序列降水临近预报中的多尺度演化建模、帧间特征错位和长程时空依赖捕获等挑战。该方法通过小波引导跳跃连接、特征通信模块、条件引导空间变换融合以及整流流训练与轻量级Vision-RWKV模块的结合,实现了高分辨率、准确的降水预测。
Details
Motivation: 解决雷达回波序列降水临近预报中存在的三大关键难题:复杂多尺度演化建模、位移引起的帧间特征错位,以及在不牺牲空间保真度前提下高效捕获长程时空上下文。
Result: 在四个公开数据集(SEVIR、MeteoNet、Shanghai和CIKM)上的评估表明,该方法相比强基线模型取得了一致的性能提升,在更高雨强阈值下产生更清晰的回波形态,并在更长预见期保持持续技能。
Insight: 创新点在于将整流流训练与尺度感知通信、空间对齐和频率感知融合协同结合;具体包括:小波引导跳跃连接保留高频细节,特征通信模块促进双向跨尺度交互,条件引导空间变换融合校正帧间位移,以及轻量级Vision-RWKV块高效捕获长程依赖。
Abstract: Accurate and high-resolution precipitation nowcasting from radar echo sequences is crucial for disaster mitigation and economic planning, yet it remains a significant challenge. Key difficulties include modeling complex multi-scale evolution, correcting inter-frame feature misalignment caused by displacement, and efficiently capturing long-range spatiotemporal context without sacrificing spatial fidelity. To address these issues, we present the Multi-scale Feature Communication Rectified Flow (RF) Network (MFC-RFNet), a generative framework that integrates multi-scale communication with guided feature fusion. To enhance multi-scale fusion while retaining fine detail, a Wavelet-Guided Skip Connection (WGSC) preserves high-frequency components, and a Feature Communication Module (FCM) promotes bidirectional cross-scale interaction. To correct inter-frame displacement, a Condition-Guided Spatial Transform Fusion (CGSTF) learns spatial transforms from conditioning echoes to align shallow features. The backbone adopts rectified flow training to learn near-linear probability-flow trajectories, enabling few-step sampling with stable fidelity. Additionally, lightweight Vision-RWKV (RWKV) blocks are placed at the encoder tail, the bottleneck, and the first decoder layer to capture long-range spatiotemporal dependencies at low spatial resolutions with moderate compute. Evaluations on four public datasets (SEVIR, MeteoNet, Shanghai, and CIKM) demonstrate consistent improvements over strong baselines, yielding clearer echo morphology at higher rain-rate thresholds and sustained skill at longer lead times. These results suggest that the proposed synergy of RF training with scale-aware communication, spatial alignment, and frequency-aware fusion presents an effective and robust approach for radar-based nowcasting.
[76] VideoMemory: Toward Consistent Video Generation via Memory Integration cs.CVPDF
Jinsong Zhou, Yihua Du, Xinli Xu, Luozhou Wang, Zijie Zhuang
TL;DR: 本文提出VideoMemory框架,通过动态记忆库整合叙事规划与视觉生成,旨在解决多镜头视频生成中角色、道具和背景一致性的挑战。该框架利用多智能体系统分解脚本,从记忆库检索实体表示,并基于检索状态合成关键帧和视频,从而在长时距场景变化下保持实体身份和外观的一致性。
Details
Motivation: 现有视频生成模型能生成高质量短片段,但在场景变化或实体长时间间隔后重现时,难以保持实体身份和外观的一致性,这阻碍了叙事视频的连贯生成。
Result: 在构建的包含54个案例的多镜头一致性基准测试(涵盖角色、道具和背景持续性场景)上,VideoMemory在多样叙事序列中实现了强实体级连贯性和高感知质量。
Insight: 创新点在于引入以实体为中心的动态记忆库,存储视觉和语义描述符,并通过检索-更新机制支持故事驱动变化下的身份保持,这为长视频生成中的一致性提供了可借鉴的显式记忆集成方法。
Abstract: Maintaining consistent characters, props, and environments across multiple shots is a central challenge in narrative video generation. Existing models can produce high-quality short clips but often fail to preserve entity identity and appearance when scenes change or when entities reappear after long temporal gaps. We present VideoMemory, an entity-centric framework that integrates narrative planning with visual generation through a Dynamic Memory Bank. Given a structured script, a multi-agent system decomposes the narrative into shots, retrieves entity representations from memory, and synthesizes keyframes and videos conditioned on these retrieved states. The Dynamic Memory Bank stores explicit visual and semantic descriptors for characters, props, and backgrounds, and is updated after each shot to reflect story-driven changes while preserving identity. This retrieval-update mechanism enables consistent portrayal of entities across distant shots and supports coherent long-form generation. To evaluate this setting, we construct a 54-case multi-shot consistency benchmark covering character-, prop-, and background-persistent scenarios. Extensive experiments show that VideoMemory achieves strong entity-level coherence and high perceptual quality across diverse narrative sequences.
[77] MGPC: Multimodal Network for Generalizable Point Cloud Completion With Modality Dropout and Progressive Decoding cs.CVPDF
Jiangyuan Liu, Hongxuan Ma, Yuhao Zhao, Zhe Liu, Jian Wang
TL;DR: 本文提出了一种名为MGPC的可泛化多模态点云补全框架,该框架将点云、RGB图像和文本集成在一个统一架构中,通过创新的模态丢弃策略、基于Transformer的融合模块和渐进式生成器来提升鲁棒性、可扩展性和几何建模能力,并在构建的大规模数据集MGPC-1M和真实世界数据上验证了其优越性能。
Details
Motivation: 现有基于学习的方法(如3D CNN、点云和Transformer方法)在合成基准上表现良好,但由于模态限制、可扩展性和生成能力不足,其在新物体和真实场景中的泛化能力仍然有限。
Result: 在构建的大规模数据集MGPC-1M(包含超过1000个类别和100万个训练对)和真实世界数据上的广泛实验表明,该方法始终优于现有基线,并在真实条件下展现出强大的泛化能力。
Insight: 创新点包括模态丢弃策略以增强鲁棒性、基于Transformer的融合模块实现多模态集成,以及渐进式生成器提升几何建模能力;客观分析认为,通过统一架构整合多模态信息和构建大规模数据集,有效解决了泛化挑战,为点云补全领域提供了新思路。
Abstract: Point cloud completion aims to recover complete 3D geometry from partial observations caused by limited viewpoints and occlusions. Existing learning-based works, including 3D Convolutional Neural Network (CNN)-based, point-based, and Transformer-based methods, have achieved strong performance on synthetic benchmarks. However, due to the limitations of modality, scalability, and generative capacity, their generalization to novel objects and real-world scenarios remains challenging. In this paper, we propose MGPC, a generalizable multimodal point cloud completion framework that integrates point clouds, RGB images, and text within a unified architecture. MGPC introduces an innovative modality dropout strategy, a Transformer-based fusion module, and a novel progressive generator to improve robustness, scalability, and geometric modeling capability. We further develop an automatic data generation pipeline and construct MGPC-1M, a large-scale benchmark with over 1,000 categories and one million training pairs. Extensive experiments on MGPC-1M and in-the-wild data demonstrate that the proposed method consistently outperforms prior baselines and exhibits strong generalization under real-world conditions.
[78] PhysVideoGenerator: Towards Physically Aware Video Generation via Latent Physics Guidance cs.CVPDF
Siddarth Nilol Kundur Satish, Devesh Jaiswal, Hongyu Chen, Abhishek Bakshi
TL;DR: 本文提出PhysVideoGenerator框架,通过将可学习的物理先验嵌入视频生成过程,旨在解决现有视频生成模型在物理动态表示上的不足,如不自然的物体碰撞、重力不一致和时间闪烁等问题。
Details
Motivation: 当前视频生成模型虽能生成高质量美学视频,但难以学习真实世界物理动态的表示,导致出现物理不合理的伪影。
Result: 论文验证了联合训练范式的技术可行性,表明扩散潜在空间包含足够信息以恢复V-JEPA 2的物理表示,且多任务优化在训练中保持稳定,为未来大规模物理感知生成模型评估奠定了基础。
Insight: 创新点在于通过轻量级预测网络PredictorP从扩散潜在中回归预训练V-JEPA 2提取的高层物理特征,并通过跨注意力机制注入到基于DiT的生成器中,实现了物理先验的显式嵌入。
Abstract: Current video generation models produce high-quality aesthetic videos but often struggle to learn representations of real-world physics dynamics, resulting in artifacts such as unnatural object collisions, inconsistent gravity, and temporal flickering. In this work, we propose PhysVideoGenerator, a proof-of-concept framework that explicitly embeds a learnable physics prior into the video generation process. We introduce a lightweight predictor network, PredictorP, which regresses high-level physical features extracted from a pre-trained Video Joint Embedding Predictive Architecture (V-JEPA 2) directly from noisy diffusion latents. These predicted physics tokens are injected into the temporal attention layers of a DiT-based generator (Latte) via a dedicated cross-attention mechanism. Our primary contribution is demonstrating the technical feasibility of this joint training paradigm: we show that diffusion latents contain sufficient information to recover V-JEPA 2 physical representations, and that multi-task optimization remains stable over training. This report documents the architectural design, technical challenges, and validation of training stability, establishing a foundation for future large-scale evaluation of physics-aware generative models.
[79] TRec: Egocentric Action Recognition using 2D Point Tracks cs.CV | cs.LGPDF
Dennis Holzmann, Sven Wachsmuth
TL;DR: 本文提出了一种新颖的以自我为中心(egocentric)的动作识别方法TRec,该方法利用2D点轨迹作为额外的运动线索。与依赖RGB外观、人体姿态估计的主流方法不同,该方法使用CoTracker跟踪视频中随机初始化的点,并将这些轨迹与图像帧一起输入基于Transformer的识别模型。实验表明,即使仅使用初始帧及其点轨迹,该方法也能显著提升识别精度。
Details
Motivation: 现有以自我为中心的动作识别方法主要依赖RGB外观或人体姿态,本文旨在探索一种更轻量、不依赖于手部或物体检测的额外运动表示(即2D点轨迹)来提升识别性能。
Result: 实验结果表明,整合2D点轨迹能持续提升模型性能,优于不使用运动信息的相同模型,证明了其作为轻量有效表示在动作理解中的潜力。
Insight: 创新点在于将随机采样并跟踪的2D点轨迹作为一种通用的运动线索引入动作识别,避免了复杂的手部/物体检测,且证明仅凭初始帧和轨迹信息即可带来显著增益,为轻量化动作理解提供了新思路。
Abstract: We present a novel approach for egocentric action recognition that leverages 2D point tracks as an additional motion cue. While most existing methods rely on RGB appearance, human pose estimation, or their combination, our work demonstrates that tracking randomly sampled image points across video frames can substantially improve recognition accuracy. Unlike prior approaches, we do not detect hands, objects, or interaction regions. Instead, we employ CoTracker to follow a set of randomly initialized points through each video and use the resulting trajectories, together with the corresponding image frames, as input to a Transformer-based recognition model. Surprisingly, our method achieves notable gains even when only the initial frame and its associated point tracks are provided, without incorporating the full video sequence. Experimental results confirm that integrating 2D point tracks consistently enhances performance compared to the same model trained without motion information, highlighting their potential as a lightweight yet effective representation for egocentric action understanding.
[80] BREATH-VL: Vision-Language-Guided 6-DoF Bronchoscopy Localization via Semantic-Geometric Fusion cs.CVPDF
Qingyao Tian, Bingyu Yang, Huai Liao, Xinyan Huang, Junyong Li
TL;DR: 本文提出了BREATH-VL,一个用于6自由度支气管镜定位的视觉-语言混合框架。它通过构建大规模真实世界医学数据集BREATH,并融合视觉语言模型的语义理解与基于视觉配准方法的几何信息,以解决内窥镜定位中数据稀缺、姿态回归精度不足和计算延迟高的问题。
Details
Motivation: 解决将视觉语言模型应用于6自由度内窥镜相机定位时面临的三大挑战:缺乏大规模、高质量、密集标注且面向定位的真实医疗数据集;细粒度姿态回归能力有限;以及从历史帧提取时序特征时计算延迟高。
Result: 在构建的BREATH数据集上进行的广泛实验表明,BREATH-VL在准确性和泛化性上均优于最先进的纯视觉定位方法,与最佳基线相比平移误差降低了25.5%,同时实现了有竞争力的计算延迟。
Insight: 主要创新点在于:1) 构建了迄今最大的体内内窥镜定位数据集BREATH;2) 提出了一个语义-几何融合的混合框架,结合了视觉语言模型的泛化语义理解与配准方法的精确几何对齐优势;3) 引入了一种轻量级上下文学习机制,将运动历史编码为语言提示,实现了高效的时序推理,避免了昂贵的视频级计算。
Abstract: Vision-language models (VLMs) have recently shown remarkable performance in navigation and localization tasks by leveraging large-scale pretraining for semantic understanding. However, applying VLMs to 6-DoF endoscopic camera localization presents several challenges: 1) the lack of large-scale, high-quality, densely annotated, and localization-oriented vision-language datasets in real-world medical settings; 2) limited capability for fine-grained pose regression; and 3) high computational latency when extracting temporal features from past frames. To address these issues, we first construct BREATH dataset, the largest in-vivo endoscopic localization dataset to date, collected in the complex human airway. Building on this dataset, we propose BREATH-VL, a hybrid framework that integrates semantic cues from VLMs with geometric information from vision-based registration methods for accurate 6-DoF pose estimation. Our motivation lies in the complementary strengths of both approaches: VLMs offer generalizable semantic understanding, while registration methods provide precise geometric alignment. To further enhance the VLM’s ability to capture temporal context, we introduce a lightweight context-learning mechanism that encodes motion history as linguistic prompts, enabling efficient temporal reasoning without expensive video-level computation. Extensive experiments demonstrate that the vision-language module delivers robust semantic localization in challenging surgical scenes. Building on this, our BREATH-VL outperforms state-of-the-art vision-only localization methods in both accuracy and generalization, reducing translational error by 25.5% compared with the best-performing baseline, while achieving competitive computational latency.
[81] CSMCIR: CoT-Enhanced Symmetric Alignment with Memory Bank for Composed Image Retrieval cs.CV | cs.AIPDF
Zhipeng Qian, Zihan Liang, Yufei Ma, Ben Chen, Huangyu Dai
TL;DR: 本文提出CSMCIR框架,通过引入多级思维链提示、对称双塔架构和基于熵的动态记忆库,解决了组合图像检索中因异构模态编码导致的表示空间碎片化问题,实现了查询与目标的高效对齐。
Details
Motivation: 现有组合图像检索方法因查询(图像+文本)和目标(图像)使用不同编码器,导致表示空间碎片化和不对齐,限制了检索性能。
Result: 在四个基准数据集上的实验表明,CSMCIR取得了最先进的性能,并具有优越的训练效率。
Insight: 创新点包括:使用多级思维链提示生成与目标图像语义兼容的描述以实现模态对称;设计参数共享的对称双塔架构确保特征表示一致性;利用基于熵的动态记忆库提供高质量负样本并与模型状态同步更新。
Abstract: Composed Image Retrieval (CIR) enables users to search for target images using both a reference image and manipulation text, offering substantial advantages over single-modality retrieval systems. However, existing CIR methods suffer from representation space fragmentation: queries and targets comprise heterogeneous modalities and are processed by distinct encoders, forcing models to bridge misaligned representation spaces only through post-hoc alignment, which fundamentally limits retrieval performance. This architectural asymmetry manifests as three distinct, well-separated clusters in the feature space, directly demonstrating how heterogeneous modalities create fundamentally misaligned representation spaces from initialization. In this work, we propose CSMCIR, a unified representation framework that achieves efficient query-target alignment through three synergistic components. First, we introduce a Multi-level Chain-of-Thought (MCoT) prompting strategy that guides Multimodal Large Language Models to generate discriminative, semantically compatible captions for target images, establishing modal symmetry. Building upon this, we design a symmetric dual-tower architecture where both query and target sides utilize the identical shared-parameter Q-Former for cross-modal encoding, ensuring consistent feature representations and further reducing the alignment gap. Finally, this architectural symmetry enables an entropy-based, temporally dynamic Memory Bank strategy that provides high-quality negative samples while maintaining consistency with the evolving model state. Extensive experiments on four benchmark datasets demonstrate that our CSMCIR achieves state-of-the-art performance with superior training efficiency. Comprehensive ablation studies further validate the effectiveness of each proposed component.
[82] RadDiff: Describing Differences in Radiology Image Sets with Natural Language cs.CV | cs.AI | cs.CL | cs.CY | cs.LGPDF
Xiaoxian Shen, Yuhui Zhang, Sahithi Ankireddy, Xiaohan Wang, Maya Varma
TL;DR: 本文提出了RadDiff,一个多模态智能系统,用于描述放射学图像集之间的临床有意义差异。该系统基于VisDiff的提议者-排序器框架,并引入了四项创新:医学知识注入、多模态推理、迭代假设细化和针对性视觉搜索。作者还构建了RadDiffBench基准进行评估,RadDiff在基准上取得了47%的准确率,显著优于通用领域基线。
Details
Motivation: 理解两组放射学图像之间的差异对于生成临床见解和解释医学AI系统至关重要,但目前缺乏系统的方法和基准。
Result: 在RadDiffBench基准(包含57对专家验证的放射学研究)上,RadDiff达到了47%的准确率,在使用真实报告引导时达到50%,显著优于通用领域的VisDiff基线。
Insight: 创新点包括:1)通过领域适应的视觉语言模型注入医学知识;2)整合图像与临床报告的多模态推理;3)跨多轮推理的迭代假设细化;4)定位并放大显著区域的针对性视觉搜索。这为系统性地揭示放射学数据中的有意义差异提供了首个方法-基准基础。
Abstract: Understanding how two radiology image sets differ is critical for generating clinical insights and for interpreting medical AI systems. We introduce RadDiff, a multimodal agentic system that performs radiologist-style comparative reasoning to describe clinically meaningful differences between paired radiology studies. RadDiff builds on a proposer-ranker framework from VisDiff, and incorporates four innovations inspired by real diagnostic workflows: (1) medical knowledge injection through domain-adapted vision-language models; (2) multimodal reasoning that integrates images with their clinical reports; (3) iterative hypothesis refinement across multiple reasoning rounds; and (4) targeted visual search that localizes and zooms in on salient regions to capture subtle findings. To evaluate RadDiff, we construct RadDiffBench, a challenging benchmark comprising 57 expert-validated radiology study pairs with ground-truth difference descriptions. On RadDiffBench, RadDiff achieves 47% accuracy, and 50% accuracy when guided by ground-truth reports, significantly outperforming the general-domain VisDiff baseline. We further demonstrate RadDiff’s versatility across diverse clinical tasks, including COVID-19 phenotype comparison, racial subgroup analysis, and discovery of survival-related imaging features. Together, RadDiff and RadDiffBench provide the first method-and-benchmark foundation for systematically uncovering meaningful differences in radiological data.
[83] I2E: From Image Pixels to Actionable Interactive Environments for Text-Guided Image Editing cs.CVPDF
Jinghan Yu, Junhao Xiao, Chenyu Zhu, Jiaming Li, Jia Li
TL;DR: 本文提出I2E,一种新颖的’分解-行动’范式,将文本引导的图像编辑重新定义为结构化环境中的可操作交互过程。该方法通过Decomposer将非结构化图像转换为离散、可操作的对象层,并引入物理感知的视觉-语言-行动代理,通过思维链推理将复杂指令解析为一系列原子动作。
Details
Motivation: 现有基于端到端像素级修复的文本引导图像编辑方法在需要精确局部控制和复杂多对象空间推理的组合编辑任务上存在不足,主要受限于规划与执行的隐式耦合、缺乏对象级控制粒度以及对非结构化像素中心建模的依赖。
Result: 在I2E-Bench和多个公开基准测试上的实验结果表明,I2E在处理复杂组合指令、保持物理合理性以及确保多轮编辑稳定性方面显著优于最先进的方法。
Insight: 创新点在于将图像编辑重构为结构化环境中的交互过程,实现了规划与执行的解耦,并提供了对象级的控制粒度。其构建的I2E-Bench基准也为多实例空间推理和高精度编辑任务提供了评估标准。
Abstract: Existing text-guided image editing methods primarily rely on end-to-end pixel-level inpainting paradigm. Despite its success in simple scenarios, this paradigm still significantly struggles with compositional editing tasks that require precise local control and complex multi-object spatial reasoning. This paradigm is severely limited by 1) the implicit coupling of planning and execution, 2) the lack of object-level control granularity, and 3) the reliance on unstructured, pixel-centric modeling. To address these limitations, we propose I2E, a novel “Decompose-then-Action” paradigm that revisits image editing as an actionable interaction process within a structured environment. I2E utilizes a Decomposer to transform unstructured images into discrete, manipulable object layers and then introduces a physics-aware Vision-Language-Action Agent to parse complex instructions into a series of atomic actions via Chain-of-Thought reasoning. Further, we also construct I2E-Bench, a benchmark designed for multi-instance spatial reasoning and high-precision editing. Experimental results on I2E-Bench and multiple public benchmarks demonstrate that I2E significantly outperforms state-of-the-art methods in handling complex compositional instructions, maintaining physical plausibility, and ensuring multi-turn editing stability.
[84] MVP: Enhancing Video Large Language Models via Self-supervised Masked Video Prediction cs.CVPDF
Xiaokun Sun, Zezhong Wu, Zewen Ding, Linli Xu
TL;DR: 本文提出了一种名为Masked Video Prediction(MVP)的新型后训练目标,旨在增强视频大语言模型(VideoLLMs)对时序逻辑和帧间相关性的理解能力。通过要求模型从一组具有挑战性的干扰项中重建被遮蔽的连续视频片段,MVP强制模型关注事件的序列逻辑和时序上下文。为了支持可扩展的训练,作者还引入了一个可扩展的数据合成流程,能够将任意视频语料库转换为MVP训练样本,并采用带有细粒度奖励函数的Group Relative Policy Optimization(GRPO)来增强模型对视频上下文和时序属性的理解。
Details
Motivation: 现有的基于强化学习的VideoLLMs后训练范式主要针对整体内容理解(如描述或视频问答)进行优化,缺乏对内在时序连贯性和帧间相关性的显式监督,这限制了模型捕捉复杂动态和细粒度视觉因果关系的能力。
Result: 综合评估表明,MVP通过直接强化时序推理和因果理解,显著提升了视频推理能力。
Insight: 论文的核心创新点在于提出了一个显式的、自监督的掩码视频预测后训练目标,专门针对增强VideoLLMs的时序建模能力。其可扩展的数据合成流程和细粒度的GRPO奖励机制,为在视频理解中有效融入时序和因果推理提供了新的技术路径。
Abstract: Reinforcement learning based post-training paradigms for Video Large Language Models (VideoLLMs) have achieved significant success by optimizing for visual-semantic tasks such as captioning or VideoQA. However, while these approaches effectively enhance perception abilities, they primarily target holistic content understanding, often lacking explicit supervision for intrinsic temporal coherence and inter-frame correlations. This tendency limits the models’ ability to capture intricate dynamics and fine-grained visual causality. To explicitly bridge this gap, we propose a novel post-training objective: Masked Video Prediction (MVP). By requiring the model to reconstruct a masked continuous segment from a set of challenging distractors, MVP forces the model to attend to the sequential logic and temporal context of events. To support scalable training, we introduce a scalable data synthesis pipeline capable of transforming arbitrary video corpora into MVP training samples, and further employ Group Relative Policy Optimization (GRPO) with a fine-grained reward function to enhance the model’s understanding of video context and temporal properties. Comprehensive evaluations demonstrate that MVP enhances video reasoning capabilities by directly reinforcing temporal reasoning and causal understanding.
[85] A Comparative Study of 3D Model Acquisition Methods for Synthetic Data Generation of Agricultural Products cs.CVPDF
Steven Moonen, Rob Salaets, Kenneth Batstone, Abdellatif Bey-Temsamani, Nick Michiels
TL;DR: 本文针对农业产品合成数据生成中3D模型获取方法进行了比较研究,提出使用扫描或图像到3D方法获取高代表性3D模型来生成合成数据,以训练用于分拣环境中石头和马铃薯检测的AI模型,并通过小规模真实数据微调提升模型性能。
Details
Motivation: 解决农业领域缺乏现成CAD模型导致合成数据生成困难的问题,降低AI模型训练对昂贵真实标注数据的依赖。
Result: 在分拣环境的目标检测任务中,使用高代表性3D模型生成的合成数据训练模型,并结合小规模真实数据微调,能达到与使用低代表性模型时相当甚至更好的性能。
Insight: 创新点在于系统比较了替代CAD模型的3D获取技术(如扫描和图像到3D),并验证了合成数据与少量真实数据微调结合在农业视觉任务中的有效性,为低数据场景提供了实用解决方案。
Abstract: In the manufacturing industry, computer vision systems based on artificial intelligence (AI) are widely used to reduce costs and increase production. Training these AI models requires a large amount of training data that is costly to acquire and annotate, especially in high-variance, low-volume manufacturing environments. A popular approach to reduce the need for real data is the use of synthetic data that is generated by leveraging computer-aided design (CAD) models available in the industry. However, in the agricultural industry these models are not readily available, increasing the difficulty in leveraging synthetic data. In this paper, we present different techniques for substituting CAD files to create synthetic datasets. We measure their relative performance when used to train an AI object detection model to separate stones and potatoes in a bin picking environment. We demonstrate that using highly representative 3D models acquired by scanning or using image-to-3D approaches can be used to generate synthetic data for training object detection models. Finetuning on a small real dataset can significantly improve the performance of the models and even get similar performance when less representative models are used.
[86] From Brute Force to Semantic Insight: Performance-Guided Data Transformation Design with LLMs cs.CV | cs.LGPDF
Usha Shrestha, Dmitry Ignatov, Radu Timofte
TL;DR: 本文提出了一种在NNGPT生态系统中基于性能感知的闭环解决方案,使大型语言模型能够通过内部化经验性能线索来自主设计最优的数据增强变换,从而减少对穷举搜索的依赖。
Details
Motivation: 解决数据增强中依赖启发式设计或暴力方法的问题,通过性能反馈引导LLM自主优化变换设计。
Result: 在包含6000多个经验评估的PyTorch增强函数的数据集上微调LLM,相比暴力方法减少了高达600倍的候选评估次数,同时保持竞争力的峰值准确率。
Insight: 通过成对性能排序(更好-更差变换)进行训练,无需强化学习、奖励模型或符号目标,使LLM能够内部化语义性能线索而非记忆语法,实现任务级推理。
Abstract: Large language models (LLMs) have achieved notable performance in code synthesis; however, data-aware augmentation remains a limiting factor, handled via heuristic design or brute-force approaches. We introduce a performance-aware, closed-loop solution in the NNGPT ecosystem of projects that enables LLMs to autonomously engineer optimal transformations by internalizing empirical performance cues. We fine-tune LLMs with Low-Rank Adaptation on a novel repository of more than 6,000 empirically evaluated PyTorch augmentation functions, each annotated solely by downstream model accuracy. Training uses pairwise performance ordering (better-worse transformations), enabling alignment through empirical feedback without reinforcement learning, reward models, or symbolic objectives. This reduces the need for exhaustive search, achieving up to 600x times fewer evaluated candidates than brute-force discovery while maintaining competitive peak accuracy and shifting generation from random synthesis to task-aligned design. Ablation studies show that structured Chain-of-Thought prompting introduces syntactic noise and degrades performance, whereas direct prompting ensures stable optimization in performance-critical code tasks. Qualitative and quantitative analyses demonstrate that the model internalizes semantic performance cues rather than memorizing syntax. These results show that LLMs can exhibit task-level reasoning through non-textual feedback loops, bypassing explicit symbolic rewards.
[87] Bayesian Monocular Depth Refinement via Neural Radiance Fields cs.CV | cs.GR | cs.LG | cs.ROPDF
Arun Muthukkumar
TL;DR: 本文提出MDENeRF,一种基于神经辐射场(NeRF)的迭代框架,用于改进单目深度估计。该方法通过结合单目深度估计的全局结构先验和NeRF提供的带有不确定性的深度信息,利用贝叶斯融合策略迭代注入高频几何细节,从而生成更精细的深度图。
Details
Motivation: 当前单目深度估计方法生成的深度图往往过于平滑,缺乏精确场景理解所需的精细几何细节,这限制了其在自动驾驶、扩展现实等领域的应用。
Result: 在SUN RGB-D数据集的室内场景上,该方法在关键指标和实验中表现出优越性能。
Insight: 创新点在于将NeRF的不确定性(通过体渲染过程推导)与单目深度先验进行贝叶斯融合,从而在保持全局结构的同时迭代地增强深度图的高频细节;这为结合生成式模型与传统估计方法以提升几何精度提供了新思路。
Abstract: Monocular depth estimation has applications in many fields, such as autonomous navigation and extended reality, making it an essential computer vision task. However, current methods often produce smooth depth maps that lack the fine geometric detail needed for accurate scene understanding. We propose MDENeRF, an iterative framework that refines monocular depth estimates using depth information from Neural Radiance Fields (NeRFs). MDENeRF consists of three components: (1) an initial monocular estimate for global structure, (2) a NeRF trained on perturbed viewpoints, with per-pixel uncertainty, and (3) Bayesian fusion of the noisy monocular and NeRF depths. We derive NeRF uncertainty from the volume rendering process to iteratively inject high-frequency fine details. Meanwhile, our monocular prior maintains global structure. We demonstrate superior performance on key metrics and experiments using indoor scenes from the SUN RGB-D dataset.
[88] HemBLIP: A Vision-Language Model for Interpretable Leukemia Cell Morphology Analysis cs.CVPDF
Julie van Logtestijn, Petru Manescu
TL;DR: 本文提出了HemBLIP,一个用于白血病细胞形态学可解释分析的视觉语言模型。该模型旨在生成对外周血细胞具有形态学感知的描述,以解决当前深度学习模型作为黑箱、临床信任度低的问题。研究基于一个包含1.4万个健康与白血病细胞及专家属性描述的新数据集,通过全微调和LoRA参数高效训练方法进行适配,并与生物医学基础模型MedGEMMA进行了基准测试。
Details
Motivation: 当前用于白血病诊断的白细胞形态学显微镜评估的深度学习模型多为黑箱,限制了临床信任和采用,因此需要开发可解释的模型。
Result: 在新建的数据集上,HemBLIP在描述质量和形态学准确性方面均优于生物医学基础模型MedGEMMA,而基于LoRA的适配在显著降低计算成本的同时带来了进一步的性能提升。
Insight: 创新点在于将通用视觉语言模型(VLM)通过微调(包括参数高效的LoRA方法)专门应用于血液学细胞形态描述任务,构建了一个面向透明、可扩展血液学诊断的、可生成解释性描述的模型框架。
Abstract: Microscopic evaluation of white blood cell morphology is central to leukemia diagnosis, yet current deep learning models often act as black boxes, limiting clinical trust and adoption. We introduce HemBLIP, a vision language model designed to generate interpretable, morphology aware descriptions of peripheral blood cells. Using a newly constructed dataset of 14k healthy and leukemic cells paired with expert-derived attribute captions, we adapt a general-purpose VLM via both full fine-tuning and LoRA based parameter efficient training, and benchmark against the biomedical foundation model MedGEMMA. HemBLIP achieves higher caption quality and morphological accuracy, while LoRA adaptation provides further gains with significantly reduced computational cost. These results highlight the promise of vision language models for transparent and scalable hematological diagnostics.
[89] FocusUI: Efficient UI Grounding via Position-Preserving Visual Token Selection cs.CV | cs.AI | cs.CL | cs.HCPDF
Mingyu Ouyang, Kevin Qinghong Lin, Mike Zheng Shou, Hwee Tou Ng
TL;DR: 本文提出FocusUI,一个用于高效用户界面(UI)定位的框架。它通过选择与指令最相关的图像块并保持位置连续性来解决现有视觉语言模型处理高分辨率UI截图时计算开销大、注意力分散的问题。
Details
Motivation: 现有视觉语言模型处理高分辨率UI截图会产生大量视觉令牌,导致计算开销大且注意力被稀释,而人类交互时通常只关注感兴趣区域。因此,本文旨在解决高效UI定位任务,减少冗余计算。
Result: 在四个定位基准测试上的综合实验表明,FocusUI超越了GUI特定基线。在ScreenSpot-Pro基准上,FocusUI-7B比GUI-Actor-7B性能提升3.7%。即使只保留30%的视觉令牌,性能仅下降3.2%,同时推理速度提升至1.44倍,峰值GPU内存降低17%。
Insight: 创新点包括:1) 通过融合指令条件评分和基于规则的UI图评分构建块级监督,以选择独特且与指令相关的视觉令牌;2) 提出PosPad策略,将丢弃的连续视觉令牌序列压缩为放置在该序列最后索引处的特殊标记,以保持位置连续性,这对UI定位任务至关重要。
Abstract: Vision-Language Models (VLMs) have shown remarkable performance in User Interface (UI) grounding tasks, driven by their ability to process increasingly high-resolution screenshots. However, screenshots are tokenized into thousands of visual tokens (e.g., about 4700 for 2K resolution), incurring significant computational overhead and diluting attention. In contrast, humans typically focus on regions of interest when interacting with UI. In this work, we pioneer the task of efficient UI grounding. Guided by practical analysis of the task’s characteristics and challenges, we propose FocusUI, an efficient UI grounding framework that selects patches most relevant to the instruction while preserving positional continuity for precise grounding. FocusUI addresses two key challenges: (1) Eliminating redundant tokens in visual encoding. We construct patch-level supervision by fusing an instruction-conditioned score with a rule-based UI-graph score that down-weights large homogeneous regions to select distinct and instruction-relevant visual tokens. (2) Preserving positional continuity during visual token selection. We find that general visual token pruning methods suffer from severe accuracy degradation on UI grounding tasks due to broken positional information. We introduce a novel PosPad strategy, which compresses each contiguous sequence of dropped visual tokens into a single special marker placed at the sequence’s last index to preserve positional continuity. Comprehensive experiments on four grounding benchmarks demonstrate that FocusUI surpasses GUI-specific baselines. On the ScreenSpot-Pro benchmark, FocusUI-7B achieves a performance improvement of 3.7% over GUI-Actor-7B. Even with only 30% visual token retention, FocusUI-7B drops by only 3.2% while achieving up to 1.44x faster inference and 17% lower peak GPU memory.
[90] ResTok: Learning Hierarchical Residuals in 1D Visual Tokenizers for Autoregressive Image Generation cs.CVPDF
Xu Zhang, Cheng Da, Huan Yang, Kun Gai, Ming Lu
TL;DR: 本文提出了一种名为ResTok的1D视觉分词器,它通过构建图像token和潜在token的层次化残差表示,将视觉的层次性和残差设计原则重新引入自回归图像生成中,从而增强了表示能力并改善了潜在分布,同时通过层次化自回归生成器显著减少了采样步骤。
Details
Motivation: 现有基于Transformer的1D视觉分词器直接沿用了语言建模的设计原则,将视觉数据视为平坦的序列token流,忽略了视觉数据固有的层次性和残差网络设计对于模型收敛和效率的关键作用。
Result: 在ImageNet-256数据集上,ResTok取得了gFID为2.34的优异结果,并且仅需9个采样步骤,显著提升了自回归图像生成的性能。
Insight: 核心创新点在于将视觉的层次残差先验重新引入视觉分词过程,通过渐进合并构建层次化表示以实现跨层特征融合,并利用层次间的语义残差避免信息重叠,从而得到更集中、更易于自回归建模的潜在分布;同时,提出的层次化自回归生成器通过一次预测整个层级的潜在token,大幅提升了生成效率。
Abstract: Existing 1D visual tokenizers for autoregressive (AR) generation largely follow the design principles of language modeling, as they are built directly upon transformers whose priors originate in language, yielding single-hierarchy latent tokens and treating visual data as flat sequential token streams. However, this language-like formulation overlooks key properties of vision, particularly the hierarchical and residual network designs that have long been essential for convergence and efficiency in visual models. To bring “vision” back to vision, we propose the Residual Tokenizer (ResTok), a 1D visual tokenizer that builds hierarchical residuals for both image tokens and latent tokens. The hierarchical representations obtained through progressively merging enable cross-level feature fusion at each layer, substantially enhancing representational capacity. Meanwhile, the semantic residuals between hierarchies prevent information overlap, yielding more concentrated latent distributions that are easier for AR modeling. Cross-level bindings consequently emerge without any explicit constraints. To accelerate the generation process, we further introduce a hierarchical AR generator that substantially reduces sampling steps by predicting an entire level of latent tokens at once rather than generating them strictly token-by-token. Extensive experiments demonstrate that restoring hierarchical residual priors in visual tokenization significantly improves AR image generation, achieving a gFID of 2.34 on ImageNet-256 with only 9 sampling steps. Code is available at https://github.com/Kwai-Kolors/ResTok.
[91] PosterVerse: A Full-Workflow Framework for Commercial-Grade Poster Generation with HTML-Based Scalable Typography cs.CVPDF
Junle Liu, Peirong Zhang, Yuyi Zhang, Pengyu Yan, Hui Zhou
TL;DR: 本文提出了PosterVerse,一个面向商业级海报生成的全流程框架,通过结合微调LLM、定制化扩散模型和基于MLLM的HTML引擎,实现了从需求理解到最终渲染的自动化设计流程,并引入了首个基于HTML排版文件的中文海报数据集PosterDNA以支持高精度、可扩展的文本渲染。
Details
Motivation: 针对现有自动化海报生成系统存在设计流程不完整、文本渲染精度差、商业应用灵活性不足等问题,旨在开发一个能够无缝整合美观性与信息传达准确性的商业级海报生成解决方案。
Result: 实验结果表明,PosterVerse能够稳定生成具有视觉吸引力、文本对齐准确且布局可定制的商业级海报,为解决商业海报设计的自动化问题提供了有前景的方案。
Insight: 创新点在于提出了一个覆盖蓝图创建、图形背景生成和统一布局-文本渲染的三阶段全流程框架,并首创了基于HTML排版的中文海报数据集PosterDNA,从根本上解决了小尺寸、高密度文本的渲染挑战,实现了可扩展的精确文本排版。
Abstract: Commercial-grade poster design demands the seamless integration of aesthetic appeal with precise, informative content delivery. Current automated poster generation systems face significant limitations, including incomplete design workflows, poor text rendering accuracy, and insufficient flexibility for commercial applications. To address these challenges, we propose PosterVerse, a full-workflow, commercial-grade poster generation method that seamlessly automates the entire design process while delivering high-density and scalable text rendering. PosterVerse replicates professional design through three key stages: (1) blueprint creation using fine-tuned LLMs to extract key design elements from user requirements, (2) graphical background generation via customized diffusion models to create visually appealing imagery, and (3) unified layout-text rendering with an MLLM-powered HTML engine to guarantee high text accuracy and flexible customization. In addition, we introduce PosterDNA, a commercial-grade, HTML-based dataset tailored for training and validating poster design models. To the best of our knowledge, PosterDNA is the first Chinese poster generation dataset to introduce HTML typography files, enabling scalable text rendering and fundamentally solving the challenges of rendering small and high-density text. Experimental results demonstrate that PosterVerse consistently produces commercial-grade posters with appealing visuals, accurate text alignment, and customizable layouts, making it a promising solution for automating commercial poster design. The code and model are available at https://github.com/wuhaer/PosterVerse.
[92] Thinking with Frames: Generative Video Distortion Evaluation via Frame Reward Model cs.CVPDF
Yuan Wang, Borui Liao, Huijuan Huang, Jinda Lu, Ouxiang Li
TL;DR: 本文提出了REACT,一种专门用于评估生成视频中结构扭曲的帧级奖励模型。该模型通过推理视频帧来分配点状分数和归因标签,重点关注识别异常对象外观和交互等结构失真。研究还构建了一个大规模的人类偏好数据集,并采用两阶段训练框架(监督微调+强化学习)来提升模型的推理能力和与人类偏好的对齐。
Details
Motivation: 现有的视频奖励模型通常评估视觉质量、运动质量和文本对齐,但往往忽略了关键的结构扭曲(如异常对象外观和交互),这些扭曲会降低生成视频的整体质量。
Result: 实验结果表明,REACT在评估结构扭曲方面补充了现有的奖励模型,实现了准确的定量评估和可解释的归因分析。
Insight: 创新点在于提出了一个专门针对生成视频结构扭曲的帧级评估模型,并构建了相应的大规模标注数据集和高效的思维链合成流程来支持训练。此外,引入的动态采样机制和两阶段训练框架(结合GRPO)增强了模型的推理和对齐能力。
Abstract: Recent advances in video reward models and post-training strategies have improved text-to-video (T2V) generation. While these models typically assess visual quality, motion quality, and text alignment, they often overlook key structural distortions, such as abnormal object appearances and interactions, which can degrade the overall quality of the generative video. To address this gap, we introduce REACT, a frame-level reward model designed specifically for structural distortions evaluation in generative videos. REACT assigns point-wise scores and attribution labels by reasoning over video frames, focusing on recognizing distortions. To support this, we construct a large-scale human preference dataset, annotated based on our proposed taxonomy of structural distortions, and generate additional data using a efficient Chain-of-Thought (CoT) synthesis pipeline. REACT is trained with a two-stage framework: ((1) supervised fine-tuning with masked loss for domain knowledge injection, followed by (2) reinforcement learning with Group Relative Policy Optimization (GRPO) and pairwise rewards to enhance reasoning capability and align output scores with human preferences. During inference, a dynamic sampling mechanism is introduced to focus on frames most likely to exhibit distortion. We also present REACT-Bench, a benchmark for generative video distortion evaluation. Experimental results demonstrate that REACT complements existing reward models in assessing structutal distortion, achieving both accurate quantitative evaluations and interpretable attribution analysis.
[93] Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models cs.CV | cs.AIPDF
Zitong Huang, Kaidong Zhang, Yukang Ding, Chao Gao, Rui Ding
TL;DR: 本文提出了一种名为LocalDPO的后训练框架,用于提升文本到视频扩散模型与人类偏好的对齐。该方法通过从真实视频中构建局部偏好对,并在时空区域级别进行优化,从而高效地改善生成视频的保真度、时间一致性和人类偏好评分。
Details
Motivation: 现有直接偏好优化(DPO)方法依赖多样本排序和特定任务的评判模型,效率低下且常产生模糊的全局监督。为了解决这些限制,本文旨在开发一种更高效、细粒度的视频生成器对齐范式。
Result: 在Wan2.1和CogVideoX基准上的实验表明,LocalDPO在视频保真度、时间一致性和人类偏好得分方面持续优于其他后训练方法。
Insight: 创新点在于提出了一种自动化的局部偏好对数据收集流程(通过单次推理生成偏好对,无需外部评判模型或人工标注),并引入了区域感知的DPO损失,将偏好学习限制在损坏区域以实现快速收敛,从而实现了更高效和细粒度的视频生成器对齐。
Abstract: Aligning text-to-video diffusion models with human preferences is crucial for generating high-quality videos. Existing Direct Preference Otimization (DPO) methods rely on multi-sample ranking and task-specific critic models, which is inefficient and often yields ambiguous global supervision. To address these limitations, we propose LocalDPO, a novel post-training framework that constructs localized preference pairs from real videos and optimizes alignment at the spatio-temporal region level. We design an automated pipeline to efficiently collect preference pair data that generates preference pairs with a single inference per prompt, eliminating the need for external critic models or manual annotation. Specifically, we treat high-quality real videos as positive samples and generate corresponding negatives by locally corrupting them with random spatio-temporal masks and restoring only the masked regions using the frozen base model. During training, we introduce a region-aware DPO loss that restricts preference learning to corrupted areas for rapid convergence. Experiments on Wan2.1 and CogVideoX demonstrate that LocalDPO consistently improves video fidelity, temporal coherence and human preference scores over other post-training approaches, establishing a more efficient and fine-grained paradigm for video generator alignment.
[94] Analyzing Reasoning Consistency in Large Multimodal Models under Cross-Modal Conflicts cs.CV | cs.AI | cs.CLPDF
Zhihao Zhu, Jiafeng Liang, Shixin Jiang, Jinlan Fu, Ming Liu
TL;DR: 本文系统分析了大型多模态模型在跨模态冲突下的推理一致性,发现模型存在文本惯性问题,即一旦推理链中出现文本幻觉,模型会盲目遵循错误文本而忽视冲突的视觉证据。为缓解此问题,作者提出了无需训练的推理范式——主动视觉上下文精炼,通过主动视觉重定位机制和自适应上下文精炼策略来抑制幻觉传播并增强推理鲁棒性。
Details
Motivation: 大型多模态模型在视频推理中展现出强大能力,但其推理链的鲁棒性存疑,特别是在出现跨模态冲突时模型倾向于固守文本幻觉而忽略视觉证据,即文本惯性问题。
Result: 实验表明,现有模型在逻辑图扰动协议测试下,自我纠正成功率低于10%,主要受困于盲目的文本错误传播;而提出的主动视觉上下文精炼方法能显著抑制幻觉传播并提升推理鲁棒性。
Insight: 创新点在于识别了文本惯性这一关键失效模式,并设计了无需训练的推理范式,通过主动视觉重定位和自适应上下文精炼来增强模型对跨模态冲突的鲁棒性,为提升多模态推理一致性提供了新思路。
Abstract: Large Multimodal Models (LMMs) have demonstrated impressive capabilities in video reasoning via Chain-of-Thought (CoT). However, the robustness of their reasoning chains remains questionable. In this paper, we identify a critical failure mode termed textual inertia, where once a textual hallucination occurs in the thinking process, models tend to blindly adhere to the erroneous text while neglecting conflicting visual evidence. To systematically investigate this, we propose the LogicGraph Perturbation Protocol that structurally injects perturbations into the reasoning chains of diverse LMMs spanning both native reasoning architectures and prompt-driven paradigms to evaluate their self-reflection capabilities. The results reveal that models successfully self-correct in less than 10% of cases and predominantly succumb to blind textual error propagation. To mitigate this, we introduce Active Visual-Context Refinement, a training-free inference paradigm which orchestrates an active visual re-grounding mechanism to enforce fine-grained verification coupled with an adaptive context refinement strategy to summarize and denoise the reasoning history. Experiments demonstrate that our approach significantly stifles hallucination propagation and enhances reasoning robustness.
[95] Gen3R: 3D Scene Generation Meets Feed-Forward Reconstruction cs.CVPDF
Jiaxin Huang, Yuanbo Yang, Bangbang Yang, Lin Ma, Yuewen Ma
TL;DR: Gen3R是一种将基础重建模型与视频扩散模型相结合,用于场景级3D生成的方法。它通过训练适配器,将VGGT重建模型的几何潜在表示与预训练视频扩散模型的外观潜在表示对齐,从而联合生成RGB视频和对应的3D几何信息(如相机位姿、深度图和全局点云)。
Details
Motivation: 动机是弥合强大的基础重建模型先验与视频扩散模型先验之间的差距,以实现高质量的3D场景生成,并探索重建模型与生成模型紧密耦合的相互增益。
Result: 实验表明,该方法在单图像和多图像条件下的3D场景生成任务上取得了最先进(SOTA)的结果,并且通过利用生成先验增强了重建的鲁棒性。
Insight: 核心创新点在于通过适配器对齐几何与外观潜在空间,实现了解耦但协同的联合生成。其将重建与生成模型深度集成的思路,为3D内容生成提供了新的范式,展示了两种模型先验结合带来的性能提升与鲁棒性增强。
Abstract: We present Gen3R, a method that bridges the strong priors of foundational reconstruction models and video diffusion models for scene-level 3D generation. We repurpose the VGGT reconstruction model to produce geometric latents by training an adapter on its tokens, which are regularized to align with the appearance latents of pre-trained video diffusion models. By jointly generating these disentangled yet aligned latents, Gen3R produces both RGB videos and corresponding 3D geometry, including camera poses, depth maps, and global point clouds. Experiments demonstrate that our approach achieves state-of-the-art results in single- and multi-image conditioned 3D scene generation. Additionally, our method can enhance the robustness of reconstruction by leveraging generative priors, demonstrating the mutual benefit of tightly coupling reconstruction and generative models.
[96] GeoReason: Aligning Thinking And Answering In Remote Sensing Vision-Language Models Via Logical Consistency Reinforcement Learning cs.CVPDF
Wenshuai Li, Xiantai Xiang, Zixiao Wen, Guangyao Zhou, Ben Niu
TL;DR: 本文提出了GeoReason框架,旨在解决遥感视觉语言模型(RS-VLMs)中存在的逻辑幻觉问题,即模型可能基于错误推理链或位置捷径得出正确答案。该框架通过构建逻辑驱动的数据集GeoReason-Bench,并采用两阶段训练策略(监督知识初始化和一致性感知强化学习),以强化模型内部推理与最终决策的逻辑一致性。
Details
Motivation: 当前遥感视觉语言模型在复杂空间任务中常出现逻辑幻觉,即正确回答与错误推理链脱节,或依赖位置捷径而非空间逻辑,这削弱了模型在战略空间决策中的可靠性。
Result: 实验结果表明,该框架显著提升了RS-VLMs的认知可靠性和可解释性,与其他先进方法相比达到了最先进的性能水平。
Insight: 创新点在于提出了逻辑一致性奖励机制,通过选项排列策略惩罚逻辑漂移,从而将决策锚定在可验证的推理轨迹上,实现了从感知识别向高层演绎推理的过渡,增强了模型在空间任务中的逻辑可靠性。
Abstract: The evolution of Remote Sensing Vision-Language Models(RS-VLMs) emphasizes the importance of transitioning from perception-centric recognition toward high-level deductive reasoning to enhance cognitive reliability in complex spatial tasks. However, current models often suffer from logical hallucinations, where correct answers are derived from flawed reasoning chains or rely on positional shortcuts rather than spatial logic. This decoupling undermines reliability in strategic spatial decision-making. To address this, we present GeoReason, a framework designed to synchronize internal thinking with final decisions. We first construct GeoReason-Bench, a logic-driven dataset containing 4,000 reasoning trajectories synthesized from geometric primitives and expert knowledge. We then formulate a two-stage training strategy: (1) Supervised Knowledge Initialization to equip the model with reasoning syntax and domain expertise, and (2) Consistency-Aware Reinforcement Learning to refine deductive reliability. This second stage integrates a novel Logical Consistency Reward, which penalizes logical drift via an option permutation strategy to anchor decisions in verifiable reasoning traces. Experimental results demonstrate that our framework significantly enhances the cognitive reliability and interpretability of RS-VLMs, achieving state-of-the-art performance compared to other advanced methods.
[97] Pixel-Wise Multimodal Contrastive Learning for Remote Sensing Images cs.CV | cs.AIPDF
Leandro Stival, Ricardo da Silva Torres, Helio Pedrini
TL;DR: 本文提出了一种新颖的多模态对比学习方法PIMC,用于处理遥感图像和卫星图像时间序列数据。该方法通过将像素级植被指数时间序列转换为二维递归图,并结合遥感影像进行像素级多模态对比学习,有效提升了特征提取能力。在PASTIS和EuroSAT数据集上的像素级预测、分类和土地覆盖分类任务中,该方法均优于现有SOTA方法。
Details
Motivation: 现有深度学习模型通常处理整个图像或完整时间序列,难以有效捕捉像素级的时间变化特征。本文旨在解决如何更有效地从卫星图像时间序列中提取像素级视觉属性变化的问题。
Result: 在PASTIS数据集上的像素级预测和分类任务,以及EuroSAT数据集上的土地覆盖分类任务中,该方法均达到了SOTA水平。实验表明,二维表示显著增强了从SITS中提取特征的能力,对比学习提高了像素时间序列和RSI的表征质量。
Insight: 创新点包括:1) 将像素级植被指数时间序列转换为二维递归图作为更丰富的表示;2) 提出像素级多模态对比学习框架PIMC,联合学习时间序列和遥感影像的表征。这为处理遥感多模态数据提供了一种有效的自监督学习范式。
Abstract: Satellites continuously generate massive volumes of data, particularly for Earth observation, including satellite image time series (SITS). However, most deep learning models are designed to process either entire images or complete time series sequences to extract meaningful features for downstream tasks. In this study, we propose a novel multimodal approach that leverages pixel-wise two-dimensional (2D) representations to encode visual property variations from SITS more effectively. Specifically, we generate recurrence plots from pixel-based vegetation index time series (NDVI, EVI, and SAVI) as an alternative to using raw pixel values, creating more informative representations. Additionally, we introduce PIxel-wise Multimodal Contrastive (PIMC), a new multimodal self-supervision approach that produces effective encoders based on two-dimensional pixel time series representations and remote sensing imagery (RSI). To validate our approach, we assess its performance on three downstream tasks: pixel-level forecasting and classification using the PASTIS dataset, and land cover classification on the EuroSAT dataset. Moreover, we compare our results to state-of-the-art (SOTA) methods on all downstream tasks. Our experimental results show that the use of 2D representations significantly enhances feature extraction from SITS, while contrastive learning improves the quality of representations for both pixel time series and RSI. These findings suggest that our multimodal method outperforms existing models in various Earth observation tasks, establishing it as a robust self-supervision framework for processing both SITS and RSI. Code avaliable on
[98] Klear: Unified Multi-Task Audio-Video Joint Generation cs.CV | cs.AI | cs.MM | cs.SDPDF
Jun Wang, Chunyu Qiang, Yuxin Guo, Yiran Wang, Xijuan Zeng
TL;DR: 本文提出了Klear,一个统一的多任务音视频联合生成模型,旨在解决现有方法中存在的音视频异步、唇语对齐不佳和单模态退化等问题。通过从模型架构、训练策略和数据构建三个维度进行创新,Klear实现了高保真、语义和时间对齐的生成,并在多种任务上大幅超越先前方法,性能与Veo 3相当。
Details
Motivation: 现有音视频联合生成方法存在音视频异步、唇语对齐差和单模态退化等问题,根源在于音视频对应建模弱、泛化能力有限以及高质量密集标注数据稀缺。
Result: Klear在多个任务上大幅超越先前方法,性能与Veo 3相当,实现了高保真、语义和时间对齐的生成,并能鲁棒地泛化到分布外场景。
Insight: 创新点包括:采用单塔架构和统一DiT块结合Omni-Full Attention机制以增强对齐和可扩展性;提出渐进式多任务训练策略(随机模态掩码和分阶段课程学习)以增强表示和防止模态崩溃;构建了首个大规模带密集标注的音视频数据集及自动化数据构建流程。这些为下一代音视频合成提供了统一且可扩展的路径。
Abstract: Audio-video joint generation has progressed rapidly, yet substantial challenges still remain. Non-commercial approaches still suffer audio-visual asynchrony, poor lip-speech alignment, and unimodal degradation, which can be stemmed from weak audio-visual correspondence modeling, limited generalization, and scarce high-quality dense-caption data. To address these issues, we introduce Klear and delve into three axes–model architecture, training strategy, and data curation. Architecturally, we adopt a single-tower design with unified DiT blocks and an Omni-Full Attention mechanism, achieving tight audio-visual alignment and strong scalability. Training-wise, we adopt a progressive multitask regime–random modality masking to joint optimization across tasks, and a multistage curriculum, yielding robust representations, strengthening A-V aligned world knowledge, and preventing unimodal collapse. For datasets, we present the first large-scale audio-video dataset with dense captions, and introduce a novel automated data-construction pipeline which annotates and filters millions of diverse, high-quality, strictly aligned audio-video-caption triplets. Building on this, Klear scales to large datasets, delivering high-fidelity, semantically and temporally aligned, instruction-following generation in both joint and unimodal settings while generalizing robustly to out-of-distribution scenarios. Across tasks, it substantially outperforms prior methods by a large margin and achieves performance comparable to Veo 3, offering a unified, scalable path toward next-generation audio-video synthesis.
[99] Diffusion-DRF: Differentiable Reward Flow for Video Diffusion Fine-Tuning cs.CVPDF
Yifan Wang, Yanyu Li, Sergey Tulyakov, Yun Fu, Anil Kag
TL;DR: 本文提出Diffusion-DRF,一种用于视频扩散模型微调的可微分奖励流方法。该方法利用冻结的预训练视觉语言模型(VLM)作为无需训练的评判器,通过反向传播VLM的反馈信号来优化模型,从而提升视频生成的质量和语义对齐,同时避免奖励攻击和不稳定训练问题。
Details
Motivation: 当前基于直接偏好优化(DPO)的视频生成方法依赖于不可微分的人类标注或奖励模型信号,导致训练需要大量标注、易产生偏差且不稳定,容易引发奖励攻击。本文旨在通过可微分的奖励流来解决这些问题。
Result: Diffusion-DRF在视频生成任务中提升了视频质量和语义对齐,有效缓解了奖励攻击和训练崩溃问题,且无需额外的奖励模型或偏好数据集。该方法具有模型无关性,可泛化至其他基于扩散的生成任务。
Insight: 创新点包括:1)使用冻结VLM作为无需训练的评判器,实现可微分奖励反馈;2)提出自动化的多维度提示管道获取可靠VLM反馈;3)通过梯度检查点技术实现高效优化。从客观角度看,该方法降低了标注依赖和训练不稳定性,为扩散模型微调提供了轻量级解决方案。
Abstract: Direct Preference Optimization (DPO) has recently improved Text-to-Video (T2V) generation by enhancing visual fidelity and text alignment. However, current methods rely on non-differentiable preference signals from human annotations or learned reward models. This reliance makes training label-intensive, bias-prone, and easy-to-game, which often triggers reward hacking and unstable training. We propose Diffusion-DRF, a differentiable reward flow for fine-tuning video diffusion models using a frozen, off-the-shelf Vision-Language Model (VLM) as a training-free critic. Diffusion-DRF directly backpropagates VLM feedback through the diffusion denoising chain, converting logit-level responses into token-aware gradients for optimization. We propose an automated, aspect-structured prompting pipeline to obtain reliable multi-dimensional VLM feedback, while gradient checkpointing enables efficient updates through the final denoising steps. Diffusion-DRF improves video quality and semantic alignment while mitigating reward hacking and collapse – without additional reward models or preference datasets. It is model-agnostic and readily generalizes to other diffusion-based generative tasks.
[100] ToTMNet: FFT-Accelerated Toeplitz Temporal Mixing Network for Lightweight Remote Photoplethysmography cs.CVPDF
Vladimir Frants, Sos Agaian, Karen Panetta
TL;DR: 本文提出了ToTMNet,一种轻量级的远程光电容积描记(rPPG)架构,用于从面部视频中估计血容量脉冲波形。该模型使用FFT加速的托普利茨(Toeplitz)时间混合层替代了计算成本高的注意力机制,实现了全局时间感受野和近线性的计算复杂度,模型参数量仅为63k。在UBFC-rPPG和SCAMPS数据集上的实验表明,该模型在心率估计任务上取得了高精度和强相关性。
Details
Motivation: 解决现有基于深度学习的rPPG方法计算成本高、参数量大,特别是基于注意力的时间建模存在时间长度二次方计算复杂度的问题,旨在设计一个轻量且高效的时序模型。
Result: 在UBFC-rPPG数据集(真实视频)的域内评估中,达到1.055 bpm的平均绝对误差(MAE)和0.996的皮尔逊相关系数;在从合成数据集SCAMPS到真实数据集UBFC-rPPG的跨域评估中,达到1.582 bpm的MAE和0.994的皮尔逊相关系数,表现出色。
Insight: 主要创新点是提出了一个紧凑的门控时间混合器,它结合了局部深度时间卷积和全局托普利茨混合,利用托普利茨算子的结构特性(参数数量与序列长度呈线性关系)并通过循环嵌入和基于FFT的卷积实现近线性时间计算,为rPPG任务提供了一种高效的长程时间滤波替代方案,特别是门控机制被证明对有效利用全局混合和应对域偏移很重要。
Abstract: Remote photoplethysmography (rPPG) estimates a blood volume pulse (BVP) waveform from facial videos captured by commodity cameras. Although recent deep models improve robustness compared to classical signal-processing approaches, many methods increase computational cost and parameter count, and attention-based temporal modeling introduces quadratic scaling with respect to the temporal length. This paper proposes ToTMNet, a lightweight rPPG architecture that replaces temporal attention with an FFT-accelerated Toeplitz temporal mixing layer. The Toeplitz operator provides full-sequence temporal receptive field using a linear number of parameters in the clip length and can be applied in near-linear time using circulant embedding and FFT-based convolution. ToTMNet integrates the global Toeplitz temporal operator into a compact gated temporal mixer that combines a local depthwise temporal convolution branch with gated global Toeplitz mixing, enabling efficient long-range temporal filtering while only having 63k parameters. Experiments on two datasets, UBFC-rPPG (real videos) and SCAMPS (synthetic videos), show that ToTMNet achieves strong heart-rate estimation accuracy with a compact design. On UBFC-rPPG intra-dataset evaluation, ToTMNet reaches 1.055 bpm MAE with Pearson correlation 0.996. In a synthetic-to-real setting (SCAMPS to UBFC-rPPG), ToTMNet reaches 1.582 bpm MAE with Pearson correlation 0.994. Ablation results confirm that the gating mechanism is important for effectively using global Toeplitz mixing, especially under domain shift. The main limitation of this preprint study is the use of only two datasets; nevertheless, the results indicate that Toeplitz-structured temporal mixing is a practical and efficient alternative to attention for rPPG.
[101] ImLoc: Revisiting Visual Localization with Image-based Representation cs.CVPDF
Xudong Jiang, Fangjinhua Wang, Silvano Galliani, Christoph Vogel, Marc Pollefeys
TL;DR: 本文提出ImLoc方法,通过为2D图像添加估计的深度图来捕获几何结构,结合密集匹配器,实现了高精度的视觉定位。该方法在存储和计算上高效,支持精度与内存效率的灵活权衡。
Details
Motivation: 现有视觉定位方法中,2D图像方法易于构建和维护但几何推理能力有限,3D结构方法精度高但需要集中重建且难以更新。本文旨在结合两者优势,使用基于图像的表示来提升定位性能。
Result: 在多个标准基准测试中达到新的最先进精度,在可比地图大小下优于现有内存高效方法。
Insight: 创新点在于将估计的深度图与2D图像结合,增强几何表示,同时通过压缩和GPU加速的LO-RANSAC实现高效存储与计算。这为视觉定位提供了易于维护且高精度的解决方案。
Abstract: Existing visual localization methods are typically either 2D image-based, which are easy to build and maintain but limited in effective geometric reasoning, or 3D structure-based, which achieve high accuracy but require a centralized reconstruction and are difficult to update. In this work, we revisit visual localization with a 2D image-based representation and propose to augment each image with estimated depth maps to capture the geometric structure. Supported by the effective use of dense matchers, this representation is not only easy to build and maintain, but achieves highest accuracy in challenging conditions. With compact compression and a GPU-accelerated LO-RANSAC implementation, the whole pipeline is efficient in both storage and computation and allows for a flexible trade-off between accuracy and highest memory efficiency. Our method achieves a new state-of-the-art accuracy on various standard benchmarks and outperforms existing memory-efficient methods at comparable map sizes. Code will be available at https://github.com/cvg/Hierarchical-Localization.
[102] Choreographing a World of Dynamic Objects cs.CV | cs.GR | cs.ROPDF
Yanzhe Lyu, Chen Geng, Karthik Dharmarajan, Yunzhi Zhang, Hadi Alzayer
TL;DR: 本文提出了一种名为CHORD的通用生成式流水线,用于合成动态物体和场景的4D(3D+时间)动态。该方法通过基于蒸馏的流程,从2D视频的欧拉表示中提取隐藏的拉格朗日运动信息,从而能够生成多样化的多体4D动态,并应用于机器人操作策略生成。
Details
Motivation: 传统基于规则的图形流水线创建动态场景依赖类别特定的启发式方法,劳动密集且难以扩展;而现有基于学习的方法通常需要大规模数据集,可能无法覆盖所有感兴趣的物体类别。本文旨在开发一种通用、多功能且与类别无关的方法来生成动态场景。
Result: 实验表明,该方法能生成多样化的多体4D动态,相比现有方法具有优势,并展示了在生成机器人操作策略方面的适用性。
Insight: 创新点在于提出了一种基于蒸馏的通用流水线,从2D视频的欧拉表示中提取拉格朗日运动信息,实现了类别无关的动态场景生成,并成功应用于机器人领域。
Abstract: Dynamic objects in our physical 4D (3D + time) world are constantly evolving, deforming, and interacting with other objects, leading to diverse 4D scene dynamics. In this paper, we present a universal generative pipeline, CHORD, for CHOReographing Dynamic objects and scenes and synthesizing this type of phenomena. Traditional rule-based graphics pipelines to create these dynamics are based on category-specific heuristics, yet are labor-intensive and not scalable. Recent learning-based methods typically demand large-scale datasets, which may not cover all object categories in interest. Our approach instead inherits the universality from the video generative models by proposing a distillation-based pipeline to extract the rich Lagrangian motion information hidden in the Eulerian representations of 2D videos. Our method is universal, versatile, and category-agnostic. We demonstrate its effectiveness by conducting experiments to generate a diverse range of multi-body 4D dynamics, show its advantage compared to existing methods, and demonstrate its applicability in generating robotics manipulation policies. Project page: https://yanzhelyu.github.io/chord
eess.IV [Back]
[103] Staged Voxel-Level Deep Reinforcement Learning for 3D Medical Image Segmentation with Noisy Annotations eess.IV | cs.CVPDF
Yuyang Fu, Xiuzhen Guo, Ji Shi
TL;DR: 本文提出了一种名为SVL-DRL的端到端分阶段体素级深度强化学习框架,用于在存在噪声标注的情况下进行鲁棒的3D医学图像分割。该框架将每个体素建模为自主智能体,通过动态迭代更新策略自动减轻错误标签的影响,无需人工干预。
Details
Motivation: 医学图像分割模型的性能严重依赖于高质量标注的大规模数据集,但由于器官形态结构复杂和标注者间的差异,噪声标注普遍存在,这极大地限制了分割模型的有效性。受标注者能基于先验知识在分割过程中纠正标签错误的启发,本文旨在解决噪声标注下的鲁棒分割问题。
Result: 在三个公共医学图像数据集上的实验表明,该方法在各种实验设置下均达到了最先进的性能,Dice和IoU分数平均提升了超过3%。
Insight: 主要创新点包括:1) 将噪声标注问题建模为体素依赖性问题,并通过新颖的分阶段强化学习框架解决,保证了模型的鲁棒收敛;2) 引入体素级异步优势演员-评论家模块,将每个体素视为自主智能体,允许其在训练中动态优化自身状态表示,从而直接减轻错误标签的影响;3) 设计了新颖的智能体动作空间以及结合Dice值和空间连续性度量的复合奖励函数,以在保持语义完整性的同时显著提升分割精度。
Abstract: Deep learning has achieved significant advancements in medical image segmentation. Currently, obtaining accurate segmentation outcomes is critically reliant on large-scale datasets with high-quality annotations. However, noisy annotations are frequently encountered owing to the complex morphological structures of organs in medical images and variations among different annotators, which can substantially limit the efficacy of segmentation models. Motivated by the fact that medical imaging annotator can correct labeling errors during segmentation based on prior knowledge, we propose an end-to-end Staged Voxel-Level Deep Reinforcement Learning (SVL-DRL) framework for robust medical image segmentation under noisy annotations. This framework employs a dynamic iterative update strategy to automatically mitigate the impact of erroneous labels without requiring manual intervention. The key advancements of SVL-DRL over existing works include: i) formulating noisy annotations as a voxel-dependent problem and addressing it through a novel staged reinforcement learning framework which guarantees robust model convergence; ii) incorporating a voxel-level asynchronous advantage actor-critic (vA3C) module that conceptualizes each voxel as an autonomous agent, which allows each agent to dynamically refine its own state representation during training, thereby directly mitigating the influence of erroneous labels; iii) designing a novel action space for the agents, along with a composite reward function that strategically combines the Dice value and a spatial continuity metric to significantly boost segmentation accuracy while maintain semantic integrity. Experiments on three public medical image datasets demonstrates State-of-The-Art (SoTA) performance under various experimental settings, with an average improvement of over 3% in both Dice and IoU scores.
[104] GeoDiff-SAR: A Geometric Prior Guided Diffusion Model for SAR Image Generation eess.IV | cs.CVPDF
Fan Zhang, Xuanting Wu, Fei Ma, Qiang Yin, Yuxin Hu
TL;DR: 本文提出了GeoDiff-SAR,一种几何先验引导的扩散模型,用于生成高保真的合成孔径雷达图像。该方法通过计算特定方位角的SAR点云来模拟真实成像的几何结构和散射关系,作为物理引导,并利用基于FiLM的特征融合门控网络动态融合多模态信息,同时采用LoRA对SD3.5模型进行轻量微调以适应SAR领域。
Details
Motivation: 现有生成方法主要在图像域操作,忽略了显式的几何信息,导致生成质量不佳且无法精确控制方位角等关键参数。
Result: 在真实SAR数据集上的大量对比实验表明,GeoDiff-SAR生成的数据具有高保真度,并能有效提升下游分类任务的准确性,特别是在不同方位角上的识别性能有显著改善。
Insight: 创新点在于将几何物理先验(SAR点云)引入扩散模型作为强引导,并设计了多模态动态融合机制(FiLM门控网络)与轻量适配策略(LoRA微调SD3.5),实现了对SAR图像生成过程的精确物理控制。
Abstract: Synthetic Aperture Radar (SAR) imaging results are highly sensitive to observation geometries and the geometric parameters of targets. However, existing generative methods primarily operate within the image domain, neglecting explicit geometric information. This limitation often leads to unsatisfactory generation quality and the inability to precisely control critical parameters such as azimuth angles. To address these challenges, we propose GeoDiff-SAR, a geometric prior guided diffusion model for high-fidelity SAR image generation. Specifically, GeoDiff-SAR first efficiently simulates the geometric structures and scattering relationships inherent in real SAR imaging by calculating SAR point clouds at specific azimuths, which serves as a robust physical guidance. Secondly, to effectively fuse multi-modal information, we employ a feature fusion gating network based on Feature-wise Linear Modulation (FiLM) to dynamically regulate the weight distribution of 3D physical information, image control parameters, and textual description parameters. Thirdly, we utilize the Low-Rank Adaptation (LoRA) architecture to perform lightweight fine-tuning on the advanced Stable Diffusion 3.5 (SD3.5) model, enabling it to rapidly adapt to the distribution characteristics of the SAR domain. To validate the effectiveness of GeoDiff-SAR, extensive comparative experiments were conducted on real-world SAR datasets. The results demonstrate that data generated by GeoDiff-SAR exhibits high fidelity and effectively enhances the accuracy of downstream classification tasks. In particular, it significantly improves recognition performance across different azimuth angles, thereby underscoring the superiority of physics-guided generation.
[105] Scanner-Induced Domain Shifts Undermine the Robustness of Pathology Foundation Models eess.IV | cs.CV | cs.LGPDF
Erik Thiringer, Fredrik K. Gustafsson, Kajsa Ledesma Eriksson, Mattias Rantalainen
TL;DR: 本文系统评估了14种病理学基础模型(PFMs)对全玻片扫描仪设备引起的技术领域偏移的鲁棒性,发现当前PFMs无法抵抗扫描仪引起的变异性,其嵌入空间存在显著的扫描仪特异性偏差,影响下游预测的校准和临床可靠性。
Details
Motivation: 解决病理学基础模型在真实世界技术领域偏移(如不同扫描仪设备变异性)下的鲁棒性问题,现有基准性能虽强但鲁棒性未知,可能影响临床应用的可靠性。
Result: 在包含5台设备扫描的384张乳腺癌全玻片图像的多扫描仪数据集上,通过无监督嵌入分析和临床病理监督预测任务评估,所有模型均未提供可靠的鲁棒性;AUC虽稳定但掩盖了嵌入空间偏移和校准偏差,训练数据多样性较高的视觉语言模型在鲁棒性上有优势但在下游任务中表现不佳。
Insight: 创新点在于首次系统量化PFMs对扫描仪诱导领域偏移的敏感性,揭示嵌入空间稳定性和校准偏差是关键失败模式;客观分析表明PFMs开发需超越以准确率为中心的基准,转向在真实采集变异性下明确评估和优化嵌入稳定性与校准。
Abstract: Pathology foundation models (PFMs) have become central to computational pathology, aiming to offer general encoders for feature extraction from whole-slide images (WSIs). Despite strong benchmark performance, PFM robustness to real-world technical domain shifts, such as variability from whole-slide scanner devices, remains poorly understood. We systematically evaluated the robustness of 14 PFMs to scanner-induced variability, including state-of-the-art models, earlier self-supervised models, and a baseline trained on natural images. Using a multiscanner dataset of 384 breast cancer WSIs scanned on five devices, we isolated scanner effects independently from biological and laboratory confounders. Robustness is assessed via complementary unsupervised embedding analyses and a set of clinicopathological supervised prediction tasks. Our results demonstrate that current PFMs are not invariant to scanner-induced domain shifts. Most models encode pronounced scanner-specific variability in their embedding spaces. While AUC often remains stable, this masks a critical failure mode: scanner variability systematically alters the embedding space and impacts calibration of downstream model predictions, resulting in scanner-dependent bias that can impact reliability in clinical use cases. We further show that robustness is not a simple function of training data scale, model size, or model recency. None of the models provided reliable robustness against scanner-induced variability. While the models trained on the most diverse data, here represented by vision-language models, appear to have an advantage with respect to robustness, they underperformed on downstream supervised tasks. We conclude that development and evaluation of PFMs requires moving beyond accuracy-centric benchmarks toward explicit evaluation and optimisation of embedding stability and calibration under realistic acquisition variability.
cs.AI [Back]
[106] STAR-S: Improving Safety Alignment through Self-Taught Reasoning on Safety Rules cs.AI | cs.CLPDF
Di Wu, Yanyan Zhao, Xin Lu, Mingzhe Li, Bing Qin
TL;DR: 本文提出STAR-S框架,通过自学习循环让大语言模型学习安全规则推理,以提升对越狱攻击的防御能力。该方法通过引导模型基于安全规则进行推理和反思,并利用微调增强安全推理,形成协同循环。实验表明STAR-S能有效防御越狱攻击,优于基线方法。
Details
Motivation: 现有方法试图通过训练模型在响应前对安全规则进行推理来提升安全性,但难以明确设计或直接获取能有效防御越狱攻击的安全推理形式。
Result: 实验表明STAR-S能有效防御越狱攻击,性能优于基线模型。
Insight: 创新点在于将安全规则推理的学习整合到一个自学习循环中,通过迭代的推理生成和模型微调形成协同增强,从而自动学习有效的安全推理模式,而非依赖人工设计。
Abstract: Defending against jailbreak attacks is crucial for the safe deployment of Large Language Models (LLMs). Recent research has attempted to improve safety by training models to reason over safety rules before responding. However, a key issue lies in determining what form of safety reasoning effectively defends against jailbreak attacks, which is difficult to explicitly design or directly obtain. To address this, we propose \textbf{STAR-S} (\textbf{S}elf-\textbf{TA}ught \textbf{R}easoning based on \textbf{S}afety rules), a framework that integrates the learning of safety rule reasoning into a self-taught loop. The core of STAR-S involves eliciting reasoning and reflection guided by safety rules, then leveraging fine-tuning to enhance safety reasoning. Repeating this process creates a synergistic cycle. Improvements in the model’s reasoning and interpretation of safety rules allow it to produce better reasoning data under safety rule prompts, which is then utilized for further training. Experiments show that STAR-S effectively defends against jailbreak attacks, outperforming baselines. Code is available at: https://github.com/pikepokenew/STAR_S.git.
[107] Controllable LLM Reasoning via Sparse Autoencoder-Based Steering cs.AI | cs.CLPDF
Yi Fang, Wenjie Wang, Mingfeng Xue, Boyi Deng, Fengli Xu
TL;DR: 该论文提出了一种名为SAE-Steering的方法,通过使用稀疏自编码器(SAE)解耦大型推理模型(LRM)隐藏状态中的概念纠缠,从而识别出与特定推理策略相关的特征。利用这些特征作为控制向量,该方法能够有效地引导模型的推理策略,提高推理的可靠性和灵活性。
Details
Motivation: 当前大型推理模型(LRMs)在推理过程中能自主选择类似人类的认知策略(如回溯、交叉验证),但这种自主选择常导致低效甚至错误的推理路径。现有方法难以控制细粒度的推理策略,因为LRMs的隐藏状态中存在概念纠缠问题。
Result: 提出的SAE-Steering方法在控制效果上比现有方法高出超过15%。通过控制推理策略,可以将LRMs从错误路径引导至正确路径,实现了7%的绝对准确率提升。
Insight: 创新点在于利用稀疏自编码器(SAE)来解耦隐藏状态,并设计了一个高效的两阶段特征识别流程(SAE-Steering)来筛选策略特定特征。这为可控AI推理提供了一种新思路,即通过干预模型内部表示来引导其行为,而非仅依赖外部提示或微调。
Abstract: Large Reasoning Models (LRMs) exhibit human-like cognitive reasoning strategies (e.g. backtracking, cross-verification) during reasoning process, which improves their performance on complex tasks. Currently, reasoning strategies are autonomously selected by LRMs themselves. However, such autonomous selection often produces inefficient or even erroneous reasoning paths. To make reasoning more reliable and flexible, it is important to develop methods for controlling reasoning strategies. Existing methods struggle to control fine-grained reasoning strategies due to conceptual entanglement in LRMs’ hidden states. To address this, we leverage Sparse Autoencoders (SAEs) to decompose strategy-entangled hidden states into a disentangled feature space. To identify the few strategy-specific features from the vast pool of SAE features, we propose SAE-Steering, an efficient two-stage feature identification pipeline. SAE-Steering first recalls features that amplify the logits of strategy-specific keywords, filtering out over 99% of features, and then ranks the remaining features by their control effectiveness. Using the identified strategy-specific features as control vectors, SAE-Steering outperforms existing methods by over 15% in control effectiveness. Furthermore, controlling reasoning strategies can redirect LRMs from erroneous paths to correct ones, achieving a 7% absolute accuracy improvement.
[108] Sandwich Reasoning: An Answer-Reasoning-Answer Approach for Low-Latency Query Correction cs.AI | cs.CLPDF
Chen Zhang, Kepu Zhang, Jiatong Zhang, Xiao Zhang, Jun Xu
TL;DR: 本文提出Sandwich Reasoning(SandwichR)方法,通过答案-推理-答案的三明治范式,在保持推理精度的同时显著降低查询纠正的延迟。该方法利用一致性感知强化学习和边界拒绝采样来对齐初始答案与后验推理,并在构建的高质量数据集上验证了其有效性。
Details
Motivation: 解决查询纠正任务中链式思维推理(CoT)精度高但延迟过高、无法满足实时搜索需求的问题,以及传统先输出答案再推理的方法中初始答案无法受益于后续推理的缺陷。
Result: 在构建的复杂查询纠正数据集上,SandwichR达到了与标准CoT相当的SOTA精度,同时将延迟降低了40-70%,有效解决了在线搜索中的延迟-精度权衡问题。
Insight: 创新点在于提出了三明治推理范式,通过一致性强化奖励和边界样本采样机制,使模型在低延迟条件下仍能利用推理能力提升纠正精度;同时构建了专门的数据集以推动该领域研究。
Abstract: Query correction is a critical entry point in modern search pipelines, demanding high accuracy strictly within real-time latency constraints. Chain-of-Thought (CoT) reasoning improves accuracy but incurs prohibitive latency for real-time query correction. A potential solution is to output an answer before reasoning to reduce latency; however, under autoregressive decoding, the early answer is independent of subsequent reasoning, preventing the model from leveraging its reasoning capability to improve accuracy. To address this issue, we propose Sandwich Reasoning (SandwichR), a novel approach that explicitly aligns a fast initial answer with post-hoc reasoning, enabling low-latency query correction without sacrificing reasoning-aware accuracy. SandwichR follows an Answer-Reasoning-Answer paradigm, producing an initial correction, an explicit reasoning process, and a final refined correction. To align the initial answer with post-reasoning insights, we design a consistency-aware reinforcement learning (RL) strategy: a dedicated consistency reward enforces alignment between the initial and final corrections, while margin-based rejection sampling prioritizes borderline samples where reasoning drives the most impactful corrective gains. Additionally, we construct a high-quality query correction dataset, addressing the lack of specialized benchmarks for complex query correction. Experimental results demonstrate that SandwichR achieves SOTA accuracy comparable to standard CoT while delivering a 40-70% latency reduction, resolving the latency-accuracy trade-off in online search.
[109] Current Agents Fail to Leverage World Model as Tool for Foresight cs.AI | cs.CL | cs.LGPDF
Cheng Qian, Emre Can Acikgoz, Bingxuan Li, Xiusi Chen, Yuji Zhang
TL;DR: 本文通过实证研究发现,当前基于视觉语言模型构建的智能体难以有效利用生成式世界模型作为外部模拟器进行前瞻性认知。实验表明,智能体在多种任务中很少调用模拟(少于1%)、经常误用预测结果(约15%),且在模拟可用或强制使用时性能甚至可能下降(高达5%)。
Details
Motivation: 解决智能体在需要预测未来状态的任务中,如何有效利用生成式世界模型作为外部模拟工具来增强其认知能力的问题。
Result: 在多样化的智能体任务和视觉问答任务中,当前智能体在模拟调用、结果解释和性能整合方面表现不佳,性能可能下降高达5%,未能达到预期增强效果。
Insight: 创新点在于揭示了智能体利用世界模型的核心瓶颈在于决策何时模拟、如何解释预测结果以及如何将前瞻信息整合到下游推理中,强调了需要开发促进校准化、战略性交互的机制,以提升未来智能体系统的可靠前瞻性认知能力。
Abstract: Agents built on vision-language models increasingly face tasks that demand anticipating future states rather than relying on short-horizon reasoning. Generative world models offer a promising remedy: agents could use them as external simulators to foresee outcomes before acting. This paper empirically examines whether current agents can leverage such world models as tools to enhance their cognition. Across diverse agentic and visual question answering tasks, we observe that some agents rarely invoke simulation (fewer than 1%), frequently misuse predicted rollouts (approximately 15%), and often exhibit inconsistent or even degraded performance (up to 5%) when simulation is available or enforced. Attribution analysis further indicates that the primary bottleneck lies in the agents’ capacity to decide when to simulate, how to interpret predicted outcomes, and how to integrate foresight into downstream reasoning. These findings underscore the need for mechanisms that foster calibrated, strategic interaction with world models, paving the way toward more reliable anticipatory cognition in future agent systems.
[110] Anti-Length Shift: Dynamic Outlier Truncation for Training Efficient Reasoning Models cs.AI | cs.CLPDF
Wei Wu, Liyi Chen, Congxi Xiao, Tianfu Wang, Qimeng Wang
TL;DR: 本文提出动态异常截断(DOT)方法,以解决强化学习增强的大型推理模型在训练中出现的‘长度偏移’现象,即模型在简单查询上生成冗余推理的问题。该方法通过选择性抑制正确推理组中的极端长尾响应来提升效率,并辅以KL正则化和预测性动态采样确保稳定收敛。
Details
Motivation: 现有基于显式长度惩罚的高效推理方法常引入优化冲突,且未深入探究导致模型过度思考的生成机制。本文旨在解决模型在训练过程中对简单输入生成不必要长推理(长度偏移)的问题,以降低部署成本。
Result: 在多个模型规模上的实验表明,该方法显著扩展了效率-性能的帕累托前沿。在AIME-24基准测试中,与初始策略相比,推理token使用量减少78%,同时准确率提升,并超越了现有高效推理方法(SOTA)。
Insight: 创新点在于识别并针对‘长度偏移’现象,提出动态截断训练时正确推理中的异常长尾响应,而非全局惩罚,从而在保持复杂问题长程推理能力的同时提升效率。客观来看,其结合KL正则与动态采样的训练稳定性设计也具有借鉴价值。
Abstract: Large reasoning models enhanced by reinforcement learning with verifiable rewards have achieved significant performance gains by extending their chain-of-thought. However, this paradigm incurs substantial deployment costs as models often exhibit excessive verbosity on simple queries. Existing efficient reasoning methods relying on explicit length penalties often introduce optimization conflicts and leave the generative mechanisms driving overthinking largely unexamined. In this paper, we identify a phenomenon termed length shift where models increasingly generate unnecessary reasoning on trivial inputs during training. To address this, we introduce Dynamic Outlier Truncation (DOT), a training-time intervention that selectively suppresses redundant tokens. This method targets only the extreme tail of response lengths within fully correct rollout groups while preserving long-horizon reasoning capabilities for complex problems. To complement this intervention and ensure stable convergence, we further incorporate auxiliary KL regularization and predictive dynamic sampling. Experimental results across multiple model scales demonstrate that our approach significantly pushes the efficiency-performance Pareto frontier outward. Notably, on the AIME-24, our method reduces inference token usage by 78% while simultaneously increasing accuracy compared to the initial policy and surpassing state-of-the-art efficient reasoning methods.
cs.RO [Back]
[111] PointWorld: Scaling 3D World Models for In-The-Wild Robotic Manipulation cs.RO | cs.AI | cs.CVPDF
Wenlong Huang, Yu-Wei Chao, Arsalan Mousavian, Ming-Yu Liu, Dieter Fox
TL;DR: 本文提出了PointWorld,一个大规模预训练的3D世界模型,用于机器人操作任务。它将状态和动作统一表示为共享3D空间中的3D点流,通过输入RGB-D图像和低级机器人动作序列,预测像素级的3D位移响应。该模型在包含真实和模拟环境的大规模数据集上训练,支持实时推理,并能在模型预测控制框架中实现多种操作任务,无需演示或后训练。
Details
Motivation: 解决机器人操作中,如何像人类一样仅凭一瞥和身体动作意图,就能预测3D世界动态响应的问题,旨在开发一个统一、可扩展且能跨不同机器人形态泛化的3D世界模型。
Result: 在包含单臂Franka和双手机器人的大规模数据集(约200万条轨迹、500小时)上进行了严格实证研究,提炼了大规模3D世界建模的设计原则。模型推理速度快(0.1秒),集成到模型预测控制框架后,使真实Franka机器人能够执行刚体推动、可变形和铰接物体操作以及工具使用等任务,无需演示或后训练,且仅需单张野外捕获图像。
Insight: 创新点在于将动作表示为3D点流而非特定机器人的动作空间,从而直接基于机器人物理几何进行条件化,并实现跨机器人形态的无缝学习整合。从客观角度看,其大规模数据集构建、统一的3D表示方法以及对设计原则的实证提炼,为可扩展的3D世界建模提供了重要借鉴。
Abstract: Humans anticipate, from a glance and a contemplated action of their bodies, how the 3D world will respond, a capability that is equally vital for robotic manipulation. We introduce PointWorld, a large pre-trained 3D world model that unifies state and action in a shared 3D space as 3D point flows: given one or few RGB-D images and a sequence of low-level robot action commands, PointWorld forecasts per-pixel displacements in 3D that respond to the given actions. By representing actions as 3D point flows instead of embodiment-specific action spaces (e.g., joint positions), this formulation directly conditions on physical geometries of robots while seamlessly integrating learning across embodiments. To train our 3D world model, we curate a large-scale dataset spanning real and simulated robotic manipulation in open-world environments, enabled by recent advances in 3D vision and simulated environments, totaling about 2M trajectories and 500 hours across a single-arm Franka and a bimanual humanoid. Through rigorous, large-scale empirical studies of backbones, action representations, learning objectives, partial observability, data mixtures, domain transfers, and scaling, we distill design principles for large-scale 3D world modeling. With a real-time (0.1s) inference speed, PointWorld can be efficiently integrated in the model-predictive control (MPC) framework for manipulation. We demonstrate that a single pre-trained checkpoint enables a real-world Franka robot to perform rigid-body pushing, deformable and articulated object manipulation, and tool use, without requiring any demonstrations or post-training and all from a single image captured in-the-wild. Project website at https://point-world.github.io/.
[112] CLAP: Contrastive Latent Action Pretraining for Learning Vision-Language-Action Models from Human Videos cs.RO | cs.CVPDF
Chubin Zhang, Jianan Wang, Zifeng Gao, Yue Su, Tianru Dai
TL;DR: 本文提出对比潜在动作预训练(CLAP)框架,通过对比学习将人类视频中的视觉潜在空间与机器人轨迹的本体感知潜在空间对齐,并映射到可物理执行的量化码本上。基于此表示,作者构建了双形式VLA框架,包括擅长指令跟随和物体泛化的自回归模型CLAP-NTP,以及基于整流流的高频精确操作策略CLAP-RF,并引入知识匹配正则化策略缓解微调时的灾难性遗忘。实验表明CLAP能有效将人类视频技能迁移至机器人执行。
Details
Motivation: 当前通用视觉-语言-动作模型受限于机器人数据稀缺,而现有潜在动作模型利用视频数据时易受视觉纠缠影响,捕获噪声而非操作技能。
Result: 大量实验证明CLAP显著优于强基线,实现了从人类视频到机器人执行的有效技能迁移。
Insight: 创新点包括通过对比学习对齐视觉与本体感知潜在空间以解耦视觉纠缠,构建双形式策略框架兼顾指令泛化与高频操作,以及引入知识匹配正则化缓解灾难性遗忘。
Abstract: Generalist Vision-Language-Action models are currently hindered by the scarcity of robotic data compared to the abundance of human video demonstrations. Existing Latent Action Models attempt to leverage video data but often suffer from visual entanglement, capturing noise rather than manipulation skills. To address this, we propose Contrastive Latent Action Pretraining (CLAP), a framework that aligns the visual latent space from videos with a proprioceptive latent space from robot trajectories. By employing contrastive learning, CLAP maps video transitions onto a quantized, physically executable codebook. Building on this representation, we introduce a dual-formulation VLA framework offering both CLAP-NTP, an autoregressive model excelling at instruction following and object generalization, and CLAP-RF, a Rectified Flow-based policy designed for high-frequency, precise manipulation. Furthermore, we propose a Knowledge Matching (KM) regularization strategy to mitigate catastrophic forgetting during fine-tuning. Extensive experiments demonstrate that CLAP significantly outperforms strong baselines, enabling the effective transfer of skills from human videos to robotic execution. Project page: https://lin-shan.com/CLAP/.
[113] Wow, wo, val! A Comprehensive Embodied World Model Evaluation Turing Test cs.RO | cs.AI | cs.CVPDF
Chun-Kai Fan, Xiaowei Chi, Xiaozhu Ju, Hao Li, Yong Bao
TL;DR: 本文提出了一个名为WoW-World-Eval(Wow, wo, val)的具身图灵测试基准,用于全面评估视频基础模型作为具身世界模型的性能。该基准基于609个机器人操作数据,考察感知、规划、预测、泛化和执行五大核心能力,并设计了包含22个指标的综合评估协议。评估发现现有模型在长时程规划和物理一致性方面表现不佳,揭示了生成视频与现实世界之间存在显著差距。
Details
Motivation: 当前,视频基础模型被探索用作具身AI中的预测世界模型,但其生成泛化能力是否足以保持人类感知保真度,以及其鲁棒性能否作为现实世界具身智能体的通用先验,这两个关键问题尚未得到解答。
Result: 在WoW-World-Eval基准上,模型在长时程规划上平均得分仅为17.27,在物理一致性上最好得分68.02,表明时空一致性和物理推理能力有限。在逆动态模型图灵测试中,大多数模型执行成功率崩溃至约0%,而WoW模型保持了40.74%的成功率。评估协议的总分与人类偏好之间的皮尔逊相关系数超过0.93。
Insight: 论文的创新点在于提出了一个标准化、全面的具身世界模型评估框架(WoW-World-Eval),该框架通过多维度能力和大量指标进行系统评估,并建立了与人类偏好高度相关的可靠评分基础。这为具身AI领域世界模型的性能衡量和未来发展提供了关键的基准和方向指引。
Abstract: As world models gain momentum in Embodied AI, an increasing number of works explore using video foundation models as predictive world models for downstream embodied tasks like 3D prediction or interactive generation. However, before exploring these downstream tasks, video foundation models still have two critical questions unanswered: (1) whether their generative generalization is sufficient to maintain perceptual fidelity in the eyes of human observers, and (2) whether they are robust enough to serve as a universal prior for real-world embodied agents. To provide a standardized framework for answering these questions, we introduce the Embodied Turing Test benchmark: WoW-World-Eval (Wow,wo,val). Building upon 609 robot manipulation data, Wow-wo-val examines five core abilities, including perception, planning, prediction, generalization, and execution. We propose a comprehensive evaluation protocol with 22 metrics to assess the models’ generation ability, which achieves a high Pearson Correlation between the overall score and human preference (>0.93) and establishes a reliable foundation for the Human Turing Test. On Wow-wo-val, models achieve only 17.27 on long-horizon planning and at best 68.02 on physical consistency, indicating limited spatiotemporal consistency and physical reasoning. For the Inverse Dynamic Model Turing Test, we first use an IDM to evaluate the video foundation models’ execution accuracy in the real world. However, most models collapse to $\approx$ 0% success, while WoW maintains a 40.74% success rate. These findings point to a noticeable gap between the generated videos and the real world, highlighting the urgency and necessity of benchmarking World Model in Embodied AI.
[114] Stable Language Guidance for Vision-Language-Action Models cs.RO | cs.CLPDF
Zhihao Zhan, Yuhao Chen, Jiaying Zhou, Qinhan Lv, Hao Liu
TL;DR: 本文针对视觉-语言-动作(VLA)模型对语言扰动敏感的问题,提出了一种名为‘残差语义引导(RSS)’的概率框架,旨在通过解耦物理可供性与语义执行来增强模型的鲁棒性。该方法包含蒙特卡洛句法集成和残差可供性引导两个创新点,以最大化动作与意图的互信息并抑制视觉干扰。
Details
Motivation: 解决VLA模型在语言扰动下表现脆弱的问题,特别是‘模态坍缩’现象,即强视觉先验淹没稀疏语言信号,导致智能体过度拟合特定指令措辞而忽略底层语义意图。
Result: 在多种操作基准测试中,RSS实现了最先进的鲁棒性,即使在对抗性语言扰动下也能保持性能。
Insight: 创新点在于提出一个概率框架来解耦物理可供性与语义执行,具体通过蒙特卡洛句法集成近似真实语义后验,以及残差可供性引导机制显式隔离语言因果影响。这为增强VLA模型的语言鲁棒性提供了理论和方法上的新思路。
Abstract: Vision-Language-Action (VLA) models have demonstrated impressive capabilities in generalized robotic control; however, they remain notoriously brittle to linguistic perturbations. We identify a critical ``modality collapse’’ phenomenon where strong visual priors overwhelm sparse linguistic signals, causing agents to overfit to specific instruction phrasings while ignoring the underlying semantic intent. To address this, we propose \textbf{Residual Semantic Steering (RSS)}, a probabilistic framework that disentangles physical affordance from semantic execution. RSS introduces two theoretical innovations: (1) \textbf{Monte Carlo Syntactic Integration}, which approximates the true semantic posterior via dense, LLM-driven distributional expansion, and (2) \textbf{Residual Affordance Steering}, a dual-stream decoding mechanism that explicitly isolates the causal influence of language by subtracting the visual affordance prior. Theoretical analysis suggests that RSS effectively maximizes the mutual information between action and intent while suppressing visual distractors. Empirical results across diverse manipulation benchmarks demonstrate that RSS achieves state-of-the-art robustness, maintaining performance even under adversarial linguistic perturbations.
cs.CR [Back]
[115] Listen to Rhythm, Choose Movements: Autoregressive Multimodal Dance Generation via Diffusion and Mamba with Decoupled Dance Dataset cs.CR | cs.CV | cs.LG | cs.SDPDF
Oran Duan, Yinghua Shen, Yingzhu Lv, Luyang Jie, Yaxin Liu
TL;DR: 本文提出了一种名为LRCM的多模态引导扩散框架,用于舞蹈动作生成,支持多种输入模态和自回归生成。通过解耦舞蹈数据集,分离动作捕捉数据、音频节奏和文本描述,并集成音频-潜在Conformer、文本-潜在Cross-Conformer以及运动时序Mamba模块(MTMM),实现了平滑的长序列生成。
Details
Motivation: 当前舞蹈动作生成方法存在语义控制粗糙和长序列连贯性差的问题,本文旨在通过多模态引导和自回归生成解决这些挑战。
Result: 实验结果表明,LRCM在功能能力和定量指标上均表现出色,在多模态输入场景和长序列生成中显示出显著潜力。
Insight: 创新点包括舞蹈数据集的特征解耦范式、音频与文本潜在编码器的集成,以及引入运动时序Mamba模块以增强长序列自回归合成的平滑性,这些设计提升了多模态控制的精细度和序列连贯性。
Abstract: Advances in generative models and sequence learning have greatly promoted research in dance motion generation, yet current methods still suffer from coarse semantic control and poor coherence in long sequences. In this work, we present Listen to Rhythm, Choose Movements (LRCM), a multimodal-guided diffusion framework supporting both diverse input modalities and autoregressive dance motion generation. We explore a feature decoupling paradigm for dance datasets and generalize it to the Motorica Dance dataset, separating motion capture data, audio rhythm, and professionally annotated global and local text descriptions. Our diffusion architecture integrates an audio-latent Conformer and a text-latent Cross-Conformer, and incorporates a Motion Temporal Mamba Module (MTMM) to enable smooth, long-duration autoregressive synthesis. Experimental results indicate that LRCM delivers strong performance in both functional capability and quantitative metrics, demonstrating notable potential in multimodal input scenarios and extended sequence generation. We will release the full codebase, dataset, and pretrained models publicly upon acceptance.
cs.LG [Back]
[116] Adaptive-Boundary-Clipping GRPO: Ensuring Bounded Ratios for Stable and Generalizable Training cs.LG | cs.AI | cs.CLPDF
Chi Liu, Xin Chen
TL;DR: 本文提出了一种名为自适应边界裁剪GRPO(ABC-GRPO)的改进算法,用于增强大语言模型(LLM)在强化学习中的训练稳定性和泛化能力。该算法通过引入非对称和自适应的裁剪机制,优化了原有的组相对策略优化(GRPO)框架,并在数学推理任务上使用Qwen3模型展示了其优越性能。
Details
Motivation: 作者认为标准的GRPO算法中的裁剪机制在某些场景下并非最优,限制了其灵活性和泛化能力,因此需要改进以提升训练效果。
Result: 在数学推理任务上,ABC-GRPO相比标准GRPO取得了更优的性能,并在整个训练过程中保持了显著更高的熵值,从而增强了模型的探索能力并缓解了过早收敛问题。
Insight: 创新点在于引入了非对称和自适应的边界裁剪机制,这允许算法动态调整策略更新边界,从而在保持训练稳定性的同时提高泛化性和探索效率。
Abstract: Group Relative Policy Optimization (GRPO) has emerged as a popular algorithm for reinforcement learning with large language models (LLMs). However, upon analyzing its clipping mechanism, we argue that it is suboptimal in certain scenarios. With appropriate modifications, GRPO can be significantly enhanced to improve both flexibility and generalization. To this end, we propose Adaptive-Boundary-Clipping GRPO (ABC-GRPO), an asymmetric and adaptive refinement of the original GRPO framework. We demonstrate that ABC-GRPO achieves superior performance over standard GRPO on mathematical reasoning tasks using the Qwen3 LLMs. Moreover, ABC-GRPO maintains substantially higher entropy throughout training, thereby preserving the model’s exploration capacity and mitigating premature convergence. The implementation code is available online to ease reproducibility https://github.com/chi2liu/ABC-GRPO.
cs.IR [Back]
[117] Roles of MLLMs in Visually Rich Document Retrieval for RAG: A Survey cs.IR | cs.CLPDF
Xiantao Zhang
TL;DR: 本文是一篇综述性论文,系统性地探讨了多模态大语言模型在视觉丰富文档检索增强生成中的应用。文章将现有研究归纳为三种角色:模态统一描述器、多模态嵌入器和端到端表示器,并从多个维度进行比较分析,旨在为实际应用提供指导。
Details
Motivation: 视觉丰富文档因其布局依赖的语义、脆弱的OCR识别以及信息分散在复杂图表和结构化表格中,给检索增强生成带来了挑战。本文旨在调查MLLMs如何被用于解决这些挑战,使VRD检索在RAG中变得实用。
Result: 本文是一篇综述,未报告具体的定量实验结果,但系统性地比较了不同MLLM角色在检索粒度、信息保真度、延迟与索引大小以及与重排序和事实性验证的兼容性等方面的表现和权衡。
Insight: 论文的创新点在于提出了一个清晰的分类框架(三种MLLM角色),并基于此框架分析了不同方法的核心权衡,为研究者和实践者提供了实用的选择指南。从客观角度看,该综述对新兴的MLLM在文档检索领域的应用进行了系统梳理,并指出了未来研究方向,如自适应检索单元、模型压缩和评估方法开发,具有前瞻性。
Abstract: Visually rich documents (VRDs) challenge retrieval-augmented generation (RAG) with layout-dependent semantics, brittle OCR, and evidence spread across complex figures and structured tables. This survey examines how Multimodal Large Language Models (MLLMs) are being used to make VRD retrieval practical for RAG. We organize the literature into three roles: Modality-Unifying Captioners, Multimodal Embedders, and End-to-End Representers. We compare these roles along retrieval granularity, information fidelity, latency and index size, and compatibility with reranking and grounding. We also outline key trade-offs and offer some practical guidance on when to favor each role. Finally, we identify promising directions for future research, including adaptive retrieval units, model size reduction, and the development of evaluation methods.